I have an idea that I’m not sure is the right answer, but I thought I’d mention it anyway. How about the server rewrites the resv_nodes of the overlapping reservations when it degrades them? it can remove the overlapping nodes. The reservation is still degraded, and it still retains its original select. When the reservation is reconfirmed, it should get all the new nodes. This way a running reservation can still run work on non-overlapping nodes. If you stop the queue, nothing will run. This means if a 1000 node reservation overlaps by 1 node, 999 nodes will sit idle. I also dislike leaving it up to the admin because the scheduler will keep trying to schedule work on the overlapping nodes. They’ll likely just stop the queue themselves.
What won’t happen in this case is that if the reservation can’t be completely be reconfirmed, it won’t get any new nodes. Let’s take that example of the 1000 node reservation and it loses 500 nodes in an overlap. If there are currently 250 nodes free, we won’t add them to the reservation. It gets all 1000 or nothing changes. Of course our current reconfirmation code has this same problem, so I don’t know how much we care.
If we rewrote the resv_nodes, would we need the RESV_IN_CONFLICT substate? We still might because none of the nodes in the resv_nodes are down.