Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins


#41

The design looks good to me. Thanks for making the changes.

Keep in mind that since that even though we’re only supporting hosts, resv_nodes is a list of vnodes. For exclhost systems, you’ll need to list out all the vnodes in the (name)+(name) form.

Bhroam


#42

Hi! I like design (alternative two) as well, but have a couple suggestions (to help make it more parallel/similar to other design patterns in PBS Pro):

  1. Rather than overload “-f”, which I believed is used for other purposes in other commands (and since we cannot use “-H”), I suggest using a double dash argument, e.g., something like “--nodes” or “--hosts” or “--SOMETHING”.

  2. Except for qsub, the PBS Pro commands don’t gobble up stdin. I would drop the ability to read the list of hosts from stdin. One can always use something like xargs (or some such) to convert a list to a command line argument.

  3. Not sure about this, but as long as it is always requesting “exclhost”, does it matter whether you are requesting hosts or vnodes (as long as you specify at least one vnode from every host)?

Thanks again!

- bill


#43

Thank you @bhroam. I mentioned the fact in the design doc.

@billnitzberg

  1. OK, will change it. (Just a short note: I checked the ref. guide and double dash parameter is not used in C-based commands in pbs (except for --version)). Of course, we can start to use the double dash here:) Since the prefix of this reservation is ‘M’, how about ‘–maintenance’? Otherwise, I like ‘–hosts’.
  2. OK, removed.
  3. I am not sure if we need to use the ‘exclhost’ at all. Since all the ncpus of the hosts will be requested, is it still useful? @bhroam

#44

Thanks @vchlum! I suggested --hosts, thinking about the future where we might want to use the same syntax in other commands (where we needed a list of hosts). But, we have no such use cases right now, so either one is fine with me.


#45

OK, thanks. ‘--hosts’ it is. Design doc has been changed.
V.


#46

I think there is no need to add a new attribute to reservations in order to indicate that the reservation is maintenance and the prefix itself is sufficient. So, I added a new ‘extend’ parameter to IFL function pbs_submit_resv(). This parameter tells the server to set the new reservation prefix to ‘M’ and this feature is restricted to operators and managers. Design doc changed.

Another way would be to add a new reservation attribute like ‘is_maintenance’.

Vasek


#47

I like the simplification (of passing a new extend parameter and) using the leading M in the name to convey the right information to the admin).

I have two (perhaps too much scope creep) suggestions which you can ignore, but here they are:

  1. At the IFL level, how about making this new extend parameter specify the full name of the reservation? It’s only available for the operator/manager, so they would be responsible for ensuring no name conflicts (like they are today with queues). On the pbs_rsub side, one could either hard-code in something like “M” for now (and save any more extensions for later, or one could add an additional parameter that specifies the name of the reservation itself, e.g., -q (or if that’s taken already, maybe --resv_name). Since these reservations take precedence over other reservations, this would give admins a way to create all sorts of named reservations, not just for maintenance, but for other things too. (For even more extra credit, include recurring/standing reservations too.)

  2. Rather than add a new command line parameter to pbs_rsub for nodes (and have that magically mean the reservation should forcibly succeed), introduce an explicit command line parameter, e.g., --force, (restricted to operator and manager only) that means make this reservation even if it conflicts with other reservations. Then, require the select & place specify the target for the reservation. In the initial version of the implementation, you could short-circuit the command to only accept lists of hosts and exclusive host, i.e., return an error unless the select/place matched the current use case of:
    -lselect=host=n21+host=n22+host=n23+… and -lplace=exclhost
    The main drawback here is that there’s not good syntax with select to ask for 1000s of hosts, so everyone would probably end up wrapping pbs_sub in these cases. The positive would be that the infrastructure would be ready to extend to arbitrary admin-forced reservations (at some future date) without changing syntax. It also means that if/when we design a better syntax for asking for 1000s of hosts, it can be plugged in here as well as in qsub, etc. (Again, at a future date).

Again, feel free to ignore – I do feel the design is good enough as is, and maintenance reservations will be really useful, so I don’t want to slow anything down with scope creep… but though I would share these ideas in case you (@vchlum) feel like they are a great fit and not too much more work.

Thx!


#48

Thanks, @billnitzberg.

  1. I can imagine this could be useful. I like the possibility to identify the reservations so obviously. I have two notes on this:

    • First, allowing custom name in the field that admins and sometimes maybe users are used to be an ID could be slightly confusing.
    • Second, we already have the reservation name (parameter -N), and since ID is no more ID but the name, I think the existing name attribute would become kind of redundant and/or probably misleading because we would have actually two names.
  2. It makes sense, but I also think this can be implemented independently on ‘–hosts’ and ‘–hosts’ could be still useful. Besides, I can imagine adding the --hosts to qsub with the same meaning - as a shortcut to request the entire hosts. As for the short-circuit: For the maintenance use case, it would be also necessary to modify -lselect=host=n21+host=n22+host=n23+ after submission to -lselect=host=n21:ncpus=<n21_ncpus>… Do we want to modify user specified select after submission?

I have no problem to change the design doc to respect your suggestions. It considers the future. What do others think?

Vasek