Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins


#1

Hi,

I would like to suggest a new node attribute. The reason for the new attribute is to be able to schedule node maintenance with better node utilization.

Please see the EDD for more details and let me know what do you think.

Vasek


#2

Hey @vchlum
Thank you for writing up the EDD. A couple of questions

  1. Can’t you achieve this behavior by submitting a reservation on the requested nodes for the maintenance period?
  2. What if you want to schedule multiple maintenance windows close together? With only one attribute, it will make it difficult to do. This lends itself to reservations again. Multiple reservations can be submitted for any time window.
  3. How will this affect calendaring? dedicated time and reservations have a start and an end. This allows the scheduler to know when the node will be back up. It can schedule jobs on nodes after the maintenance. From the sounds of it, the maintenance has no end time. Does this mean the scheduler can’t plan to use this node until it comes back from maintenance? If this is the case, large jobs can get hurt. If enough nodes go down for maintenance that large jobs can’t run, the scheduler will give up on them. This means no resources will be saved for them. We’ll pick back up when the maintenance window is removed from the node, but we start over. If multiple maintenance windows happen in short order, large jobs may never run.

Bhroam


#3

Thank you @bhroam for the comments.

  1. We use the reservations for scheduled maintenance now. The main issues with reservations are the following:

    • With a running job on a node, the reservation is not confirmed if the expected end time of the job exceeds the start time of the reservation. It is always necessary to get to know the last end time of all jobs that run on maintained nodes in order to confirm reservation. The maintenance cannot start before the last job ends. It can be necessary to schedule the maintenance on an earlier date sometimes. This is also not a script or puppet friendly.

    • The next and bigger issue with reservations is that if I want to schedule the maintenance for a whole cluster (let’s say 10 nodes) but one node is temporary down for some other reason, the reservation will not be confirmed. I am able to create the reservation only for 9 nodes and I need to wait for the last node to come up, and later I can create a second reservation for the last node. And if there are several nodes of the cluster down, it is very complicated to schedule such a maintenance.

  2. Yes, that is true, only one maintenance window is easy to schedule with the proposed attribute.

  3. Thank you for pointing this out. I take large jobs very seriously. The idea was that there is no end time of the maintenance, but how about to add also an attribute ‘available_after’…

Anyway, your questions lead me to this: What would you think of improving the reservations itself? How about to add a new ‘force’ option to the pbs_rsub and with this option the reservation will be confirmed even if the nodes are unavailable right now but with sufficient resources? The reservation would be degraded from the beginning. Also, the running jobs would be ignored with such a ‘force’.

Vasek


#4

1a is s hard situation. The job told PBS that it will take a certain amount of time. You want to take a node away from it before it is over. What should PBS do? While this is not the easiest solution, but you can do a qalter -lwalltime and shorten the job. This will allow you to submit the reservation at the right time.

2a is tricky. You are not the first person to complain about PBS not allowing you to submit a reservation on a down node for purposes of maintenance. Your idea of a reservation with force is interesting, but complicated. What would force mean? Do you want to allow the scheduler to confirm a reservation on a node a job is running on? Is it only for down or offline nodes? If it is the first, you will want to do a relatively complicated node search where you get as many free resources first, and then make a second pass and find as few used resources as possible. If it is just down or offline, then it is easier, but still not as straight forward. You would still probably want to try and satisfy the node solution with up and online nodes before you tried any nodes which are offline or down.

I guess a third option is to only allow force to work with chunks that have host/vnode in it. This seems rather limiting.

I really think something like reservations is the right solution here. Just having an attribute (or pair of attributes) for the next maintenance doesn’t allow for multiple maintenance windows which are close together. Also you have to do either a complex qmgr command or many qmgr commands where one pbs_rsub will do.

When you want to ignore the scheduler for a job, you do a qsub -H. Maybe something similar here for pbs_rsub? You would have to be smart enough to degrade the reservation if any nodes are down or offline.

Bhroam


#5

I really like this idea (assuming it is restricted to PBSPro managers or maybe operators) – it would solve the maintenance issue nicely.

Two things to think about here:

  1. What should happen if any overlapping running jobs are not done by the time the reservation starts… Since the reservation was made by a Manager/Operator, an easy choice would be to start the reservation and let the Manager/Operator handle the potential oversubscription manually, e.g., by waiting for the job(s) to finish, suspending them, or killing them.

  2. What should happen if there are overlapping reservations already in the system? Again, one could decide that it’s the Manager/Operator’s responsibility to address and go ahead and oversubscribe the resources, but I’m not sure that PBS Pro would support this in the current code paths. Note that it may be problematic to ask a Manager to delete a future occurrence of a standing/recurring reservation, so more thought might be needed for this case.

Thx!


#6

@billnitzberg Yes, the ‘force’ option would be only for managers or operators.

  1. I think that the manager/operator should decide what should happen with overlapping running jobs. I think the forced reservation does not need to handle this. We can simply let the overlapping jobs run and once the reservation begins manager will decide what to do.

  2. Since we would let the overlapping jobs run, I would say the correct way would be to oversubscribe the previous reservation and let the previous reservation untouched, but we also have a question whether to oversubscribe ‘forced’ reservation with ‘forced’ reservation. This seems to be kind of complicated for the scheduler.

@bhroam I think it is not necessary to prefer up-nodes before down-ones with the ‘force’ when the nodes are selected. Since it is only for manager/operator, we can assume he/she knows what he/she is doing.

One more thought, how about to add something different. I mean not a ‘reservation’ but real ‘maintenance’ object - reservation-like, but this object would not allow submitting a job in it (is it a problem? sometimes it can be maybe useful to submit a job to a node in maintenance), and this object would oversubscribe everything. Would it be easier for the scheduler?

Vasek


#7

@vchlum
The maintenance object is an interesting idea. Unlike a reservation, it would not need to go to the scheduler for confirmation. We could treat it like a qrun -H where you give a +'d list of nodes. The scheduler would need to be aware of them so to not run jobs that would cross into them. If we provided the ability to say ‘all’, it could obsolete dedicated time.

The way we run jobs in dedicated time is we the scheduler creates a set of queues that all have the same prefix (‘ded’ by default). Any job in a dedicated time queue can’t run unless we are in dedicated time. We could do something similar, or just create a queue like we do with a reservation. The only reason I made the dedicated time prefix is so I didn’t add a queue name to the sched_config file. Queues can be added and deleted willy-nilly by qmgr. I didn’t want to have lingering queue names in a file.

The scheduler will need to be smart enough to understand when there are overlapping reservations. While the scheduler would deny reservations that were attempted over a maintenance window, any reservations that are already confirmed would still overlap.

The question comes back to reservations or these new maintenance objects. They are very similar in nature. I don’t want to create the same feature in PBS twice, but are they similar enough that overloading reservations for maintenance is good?

On one hand, there is a whole lot of machinery surrounding reservations that we wouldn’t need for maintenance objects. Maintenance objects don’t need to be confirmed or degraded/reconfirmed. On the other hand, a maintenance object is a set of resources blocked out for a certain period of time. One that only a certain set of users can run work in. That is basically the definition of a reservation.

My opinion is that reservations and maintenance objects are a little too similar to make both. What do you think?

Bhroam