Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins

#41

The design looks good to me. Thanks for making the changes.

Keep in mind that since that even though we’re only supporting hosts, resv_nodes is a list of vnodes. For exclhost systems, you’ll need to list out all the vnodes in the (name)+(name) form.

Bhroam

#42

Hi! I like design (alternative two) as well, but have a couple suggestions (to help make it more parallel/similar to other design patterns in PBS Pro):

  1. Rather than overload “-f”, which I believed is used for other purposes in other commands (and since we cannot use “-H”), I suggest using a double dash argument, e.g., something like “--nodes” or “--hosts” or “--SOMETHING”.

  2. Except for qsub, the PBS Pro commands don’t gobble up stdin. I would drop the ability to read the list of hosts from stdin. One can always use something like xargs (or some such) to convert a list to a command line argument.

  3. Not sure about this, but as long as it is always requesting “exclhost”, does it matter whether you are requesting hosts or vnodes (as long as you specify at least one vnode from every host)?

Thanks again!

- bill

#43

Thank you @bhroam. I mentioned the fact in the design doc.

@billnitzberg

  1. OK, will change it. (Just a short note: I checked the ref. guide and double dash parameter is not used in C-based commands in pbs (except for --version)). Of course, we can start to use the double dash here:) Since the prefix of this reservation is ‘M’, how about ‘–maintenance’? Otherwise, I like ‘–hosts’.
  2. OK, removed.
  3. I am not sure if we need to use the ‘exclhost’ at all. Since all the ncpus of the hosts will be requested, is it still useful? @bhroam
#44

Thanks @vchlum! I suggested --hosts, thinking about the future where we might want to use the same syntax in other commands (where we needed a list of hosts). But, we have no such use cases right now, so either one is fine with me.

#45

OK, thanks. ‘--hosts’ it is. Design doc has been changed.
V.

#46

I think there is no need to add a new attribute to reservations in order to indicate that the reservation is maintenance and the prefix itself is sufficient. So, I added a new ‘extend’ parameter to IFL function pbs_submit_resv(). This parameter tells the server to set the new reservation prefix to ‘M’ and this feature is restricted to operators and managers. Design doc changed.

Another way would be to add a new reservation attribute like ‘is_maintenance’.

Vasek

#47

I like the simplification (of passing a new extend parameter and) using the leading M in the name to convey the right information to the admin).

I have two (perhaps too much scope creep) suggestions which you can ignore, but here they are:

  1. At the IFL level, how about making this new extend parameter specify the full name of the reservation? It’s only available for the operator/manager, so they would be responsible for ensuring no name conflicts (like they are today with queues). On the pbs_rsub side, one could either hard-code in something like “M” for now (and save any more extensions for later, or one could add an additional parameter that specifies the name of the reservation itself, e.g., -q (or if that’s taken already, maybe --resv_name). Since these reservations take precedence over other reservations, this would give admins a way to create all sorts of named reservations, not just for maintenance, but for other things too. (For even more extra credit, include recurring/standing reservations too.)

  2. Rather than add a new command line parameter to pbs_rsub for nodes (and have that magically mean the reservation should forcibly succeed), introduce an explicit command line parameter, e.g., --force, (restricted to operator and manager only) that means make this reservation even if it conflicts with other reservations. Then, require the select & place specify the target for the reservation. In the initial version of the implementation, you could short-circuit the command to only accept lists of hosts and exclusive host, i.e., return an error unless the select/place matched the current use case of:
    -lselect=host=n21+host=n22+host=n23+… and -lplace=exclhost
    The main drawback here is that there’s not good syntax with select to ask for 1000s of hosts, so everyone would probably end up wrapping pbs_sub in these cases. The positive would be that the infrastructure would be ready to extend to arbitrary admin-forced reservations (at some future date) without changing syntax. It also means that if/when we design a better syntax for asking for 1000s of hosts, it can be plugged in here as well as in qsub, etc. (Again, at a future date).

Again, feel free to ignore – I do feel the design is good enough as is, and maintenance reservations will be really useful, so I don’t want to slow anything down with scope creep… but though I would share these ideas in case you (@vchlum) feel like they are a great fit and not too much more work.

Thx!

#48

Thanks, @billnitzberg.

  1. I can imagine this could be useful. I like the possibility to identify the reservations so obviously. I have two notes on this:

    • First, allowing custom name in the field that admins and sometimes maybe users are used to be an ID could be slightly confusing.
    • Second, we already have the reservation name (parameter -N), and since ID is no more ID but the name, I think the existing name attribute would become kind of redundant and/or probably misleading because we would have actually two names.
  2. It makes sense, but I also think this can be implemented independently on ‘–hosts’ and ‘–hosts’ could be still useful. Besides, I can imagine adding the --hosts to qsub with the same meaning - as a shortcut to request the entire hosts. As for the short-circuit: For the maintenance use case, it would be also necessary to modify -lselect=host=n21+host=n22+host=n23+ after submission to -lselect=host=n21:ncpus=<n21_ncpus>… Do we want to modify user specified select after submission?

I have no problem to change the design doc to respect your suggestions. It considers the future. What do others think?

Vasek

#49

@bhroam I work on the implementation based on the design doc and I would like to ask for advice concerning the scheduler.

We agreed to degrade overlapping reservations once the M-reservation is confirmed and reconfirm the overlapping reservation in the next iteration of the scheduler.

But as far as I see:

  • Once the overlapping reservation is degraded and reservation retry time is set to ‘now’, and RESV_ATR_resv_nodes is wiped from the overlapping reservation, the reservation is not successfully reconfirmed. The degraded reservation only checks the down nodes by check_vnodes_down() in the scheduler, but no new reconfirming is done.

  • If the overlapping reservation is marked back to UNCONFIRMED (and RESV_ATR_resv_nodes is wiped), it is properly reconfirmed on new nodes. This works OK.

Should I mark the overlapping reservation as UNCONFIRMED with wiping RESV_ATR_resv_nodes instead of marking as DEGRADED? or is there a way how to properly reschedule the degraded reservation on new nodes? (I do not see it) Obviously, set the overlapping reservation as DEGRADED with retry time is not enough.

Vasek

#50

@vchlum,
Sorry about misleading you, I was unaware that reconfirmation only happened for down vnodes. The problem I have with setting the state back to unconfirmed is that if we can’t completely reconfirm the reservation, it will be deleted. Is that what we want? If a degraded reservation can’t be reconfirmed, it will stay degraded until a time we can reconfirm it or it will start short on nodes.

I do have an idea, but you might not like it. There is new code in the scheduler that was added for ralter. When reconfirming a reservation, it sorts the nodes so the reservation’s nodes are in front. This means nodes assigned to the reservation will be reassigned to the reservation before the scheduler goes off to find new nodes. Maybe the degraded code should use this new code? Instead of looking for down nodes to replace, use the new code which does that automatically.

It does creep the scope some since the degraded code will have to use the new code.

Bhroam

#51

@bhroam I was inaccurate before. The ordinary degraded reservation is reconfirmed if at least one node is down. Bypassing the check_vnodes_down() for degraded overlapping reservations results in correct reconfirmation of them. So, in order to distinguish degradation based on down node and degradation based on overlapping, I added new reservation substate RESV_IN_CONFLICT to the design doc. Please, see.

And another thing… For me, It is not obvious how to deal with already running reservation. We agreed that the admin takes care of running job and it is OK to ignore such jobs. That is fine. Can we apply the same rule for already running reservations? It would mean that it is still possible to start a new job through the overlapping reservation even if the node is in M-reservation. I think we can not reconfirm running reservation because there could be running jobs in it, and it would become inconsistent. I have two ideas:

  • Apply the same rule as for running jobs and let the running reservation live. The admins will deal with it.
  • Stop the queues of overlapping running reservations so no new job will run.
#52

Hey @vchlum
I have an idea that I’m not sure is the right answer, but I thought I’d mention it anyway. How about the server rewrites the resv_nodes of the overlapping reservations when it degrades them? it can remove the overlapping nodes. The reservation is still degraded, and it still retains its original select. When the reservation is reconfirmed, it should get all the new nodes. This way a running reservation can still run work on non-overlapping nodes. If you stop the queue, nothing will run. This means if a 1000 node reservation overlaps by 1 node, 999 nodes will sit idle. I also dislike leaving it up to the admin because the scheduler will keep trying to schedule work on the overlapping nodes. They’ll likely just stop the queue themselves.

What won’t happen in this case is that if the reservation can’t be completely be reconfirmed, it won’t get any new nodes. Let’s take that example of the 1000 node reservation and it loses 500 nodes in an overlap. If there are currently 250 nodes free, we won’t add them to the reservation. It gets all 1000 or nothing changes. Of course our current reconfirmation code has this same problem, so I don’t know how much we care.

If we rewrote the resv_nodes, would we need the RESV_IN_CONFLICT substate? We still might because none of the nodes in the resv_nodes are down.

Bhroam

#53

@bhroam OK, thanks. Modifying resv_nodes works well. No new jobs will start on overlapping nodes. AFAIS, the only drawback is that it brings some inconsistency for already running jobs within a reservation. Once the resv_nodes is modified, the scheduler prints warning for reservation’s jobs running on the overlapping node. The message is: “Job has been assigned a node which doesn’t exist in its reservation”. Modification of resv_nodes is mentioned in the design doc now.

I am afraid that modifying of resv_nodes doesn’t resolve the need for RESV_IN_CONFLICT substate. It is still needed for reconfirmation of CONFIRMED (not running) reservations.

Vasek

Edited: One more thing… If all the nodes are removed from running reservation then scheduler prints following msg: “Reservation is invalid - ignoring for this cycle Is running w/o exec_vnode1”.

These messages could be annoying.

#54

Hey @vchlum
The job messages might be useful. The admin has overlapped a running reservation. This message will tell them which jobs they need to deal with before they take nodes down for maintenance. Another way of handling this is when the admin overlaps a reservation, we can tell them at the pbs_rsub time which running jobs are going to need to be handled.

I’m in favor of keeping the message about the reservation without exec_vnode message. It tells the admin why jobs in that reservation aren’t running.

Now that I think about it, there is a corner case where modifying the resv_nodes might not be the right answer. If the maintenance reservation only overlaps by a little near the end of a reservation, then we’re stealing the nodes from the reservation for its entire time. This doesn’t give the admin the option to shrink the end of the reservation. The reservation may be useless now. This corner case might not be worth considering though. The admin is doing this. They should know what they are doing. I guess we could make the maintenance reservation a little less forceful unless asked. Unless a force option is provided, we can tell them what reservations will be overlapped, and let them handle them. If they want to say, “just do it”, they can use the force option. The force option will do what we are considering right now. If scheduling is not turned off while the reservations are being altered, there are race conditions where nodes could be used and the maintenance reservation needs to be moved.

Bhroam

#55

It sounds useful, @bhroam. You mean to have ‘--hosts’ and ‘--force’ like this:

  • ‘--hosts’ will confirm the reservation immediately but no resv_nodes modification is done.
  • ‘--force’ could be combined only with ‘--host’ and once this option is provided, the resv_nodes of overlapping resvs would be modified.

I think both are forcible but the force-level is different. Isn’t it odd? - to have two different options/words for forcing? It kind of leads me to @billnitzberg’s idea no.2 above. How about to implement @billnitzberg’s idea no.2 and the ‘--force’ would be followed by the level like this: ‘--force low/high’ or ‘--force 0/1’ (inspired by torque’s mom diagnosis: ‘momctl -d 0/1/2’).

Otherwise, I do not see any problem to add ‘--force’ as you suggested.

The affected/overlapping reservations could be printed into a reservation comment (new attribute) no matter the ‘force’ level, and the admin could find it there.

Vasek

#56

Hi @bhroam and @vchlum,

Interesting, but I feel we should stick with an agile approach – the first version should focus on delivering the minimal amount needed, leaving “extras” that can be handled by the sysadmin to the sysadmin. That gets the feature out sooner, with less code (so less testing and fewer bugs), then, as we learn how people use the feature, we can best craft enhancements to make the life of the sysadmin easier. (For example, we might find that this feature is never (or rarely) used on overlapping reservations, or we might find something we haven’t thought about yet.)

Thx!