Reconfirmation of running reservations

Currently if a degraded or in-conflict reservation starts running, no attempts to reconfirm it will be made. The reservation will remain short of nodes for its lifetime.

I’m making changes to replace nodes of a running reservation.

See the following design document that talks about the changes being made to how degraded or in-conflict reservations will be reconfirmed


Thanks @bhroam!

One very minor comment is that “Changes to how degraded reservations” seems to be an incomplete header.

In the " New workflow of a degraded or in-conflict reservation" section, is it left purposefully vague as to HOW “PBS will determine the first time the first reservation reconfirmation will be attempted…”? (Same for how/what resv_retry will be set to in item 3 in the same section).

Will there be no way to control how long PBS will wait before attempting to reconfirm a reservation?

@scc Thanks for reviewing the document. I originally had more information about how the reservations would be reconfirmed. I had a short conversation with @billnitzberg and he suggested I keep it vague and let PBS decide. If we need to expose controls we can. If you feel strongly, I can go back to my original design where the admin explicitly controls the duration between reconfirmation attempts. As a side note, it will make testing easier since we can set the duration short.