Reconfirmation of running reservations

Currently if a degraded or in-conflict reservation starts running, no attempts to reconfirm it will be made. The reservation will remain short of nodes for its lifetime.

I’m making changes to replace nodes of a running reservation.

See the following design document that talks about the changes being made to how degraded or in-conflict reservations will be reconfirmed

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1401815043/Reconfirming+degraded+reservations+that+are+running

Bhroam

Thanks @bhroam!

One very minor comment is that “Changes to how degraded reservations” seems to be an incomplete header.

In the " New workflow of a degraded or in-conflict reservation" section, is it left purposefully vague as to HOW “PBS will determine the first time the first reservation reconfirmation will be attempted…”? (Same for how/what resv_retry will be set to in item 3 in the same section).

Will there be no way to control how long PBS will wait before attempting to reconfirm a reservation?

@scc Thanks for reviewing the document. I originally had more information about how the reservations would be reconfirmed. I had a short conversation with @billnitzberg and he suggested I keep it vague and let PBS decide. If we need to expose controls we can. If you feel strongly, I can go back to my original design where the admin explicitly controls the duration between reconfirmation attempts. As a side note, it will make testing easier since we can set the duration short.

Bhroam

I’m closing in on finishing my implementation of the feature. I had to change the design a little. I found out that it is impossible to reconfirm in-conflict reservations after they run. When reconfirming a running reservation, you keep the nodes that it has and replace the nodes that are down. The problem is that in-conflict reservation’s resv_nodes is set to the list of nodes they have. The conflicted nodes were removed. It’s too hard to map the select statement to the nodes that are left.

Also to help with testing, I changed the attributes a little. Instead of having reserve_retry_init which is time the first reconfirmation attempt is made, I have resv_retry_time. This attribute is the time between attempts. Now reserve_retry_init is deprecated. The first attempt will be made after reserve_retry_time seconds after the reservation is first degraded.

Bhroam

Design looks good to me.

I’d prefer it if we could change “reserve_retry_time” to “reserve_retry_interval”. It’s really an interval, and it’s not too late to make its purpose easier to understand.