PP-725: new "keep <select>" option for "pbs_release_nodes"

This forum is to inform the community of work being done to enhance the node ramp down feature of PBS, which enables user to release most of the sister nodes/vnodes early from running jobs while keeping some of them which satisfy a sub select statement supplied to pbs_release_nodes command through new -k option.

Refer
RFE Ticket: PP-725
pbs_release_nodes “keep select” enhancement Design Doc

Please feel free to provide feedback on this topic

Thank You

Nicely written, I like it!

In the example, you might change the use of mpiprocs to ncpus, which is a more common request. (Generally, mpiprocs is only specified when mpiprocs != ncpus).

Done changing mpiprocs to ncpus!

I had actually developed the example from the PP ticket which contained the mpiprocs.

Is there a hook point here to allow for customisation of what select lines are permitted at a site?

If a site has restrictions on what specifically a user can provide in their select lines, and are enforcing that with submitjob/modifyjob hooks, this would let users end up with a select line that shouldn’t be allowed by the local policy.

There likely needs to be a way for a hook to see this and reject a change that would create a select line that shouldn’t be allowed.

Thank you for contributing your view from hook perspective.

I feel such a hook point is not necessary, since pbs_release_nodes will NOT be used to allocate new nodes, but only to keep already allocated nodes during job runtime while releasing other sister nodes, all of which were assigned by scheduler adhering to select lines as per local policy/submitjob/modify admin hooks. When the job ends, anyways all allocated nodes are automatically released. pbs_release_nodes is intended to give up allocated nodes of a job which are no more needed by the job, thereby increasing their availability for other jobs which are yet to run.

If user provides any select line while invoking pbs_release_nodes which does not match the select statement already associated with running job, then command should fail (I should be adding this line in the design page soon).

Also note that by invoking pbs_release_nodes -k <select string> user will not be able to prolong the run time of a job than the time decided by the sched/policy.
Please feel free to reply if you do not agree.

Hey @Shrini-h
I like your design, but I have one thought.
Up until now we have either been very exact with what to release (pbs_release_nodes -v), or we’re releasing vnodes before the job starts (hook keep_select). Now we’re going to be making a decision on releasing nodes while a job is running which isn’t an exact list of nodes. This is a bit fuzzy and might lead to us killing nodes with running processes.

For example:
If I have 3 identical chunks, 3:model=abc:ncpus=4. At the point I’m doing the pbs_release_nodes, node 2 has completed, but nodes 1 and 3 are still running. From what I can tell from your design, if I do a release nodes of 1 node, I’ll release the first one and kill it. Shouldn’t we be smarter and start releasing nodes that are finished with work first? Even if we do that, if we’re in a situation where we’re still about to release nodes with processes left, we might want to provide feedback to the user that they’re about to kill parts of their job.

Bhroam

@bhroam your thought is very valid. Thanks, I was kind of banking on this forum to shape this interface with such intricate points.

Yes, we should be smarter by defining the order criteria for selecting the target node list to be released. We should take care of this in implementation and of course document it.

But what should happen when (after exhausting the completed nodes) there are one or more busy nodes selected for release? let me break down to options I can think of:

  1. The easy road: Just state a caveat in the documentation so that user knows his multi-node job will lose such unfinished nodes abruptly.
  2. Add a sub option (something like --only-finished) which performs release only if all targeted nodes have completed their job chunk, else the release cmd just returns a failure code to mean one or more running job chunks were found busy. (optionally return the busy node list?).

May be the option 2. can be a standalone one also and could be a new RFE - “release all sister nodes that have completed their job chunk at this point”. But I have no idea if its a real world use case.

Let me also tag @scc, @bayucan and all others for their opinions here.

Thanks @Shrini-h, I would favor option 1. We have no strong use case for anything more sophisticated at present and users can already release specific named nodes using pbs_release_nodes if there are specific ones they know they are finished with.

For choosing what nodes to release when using this new interface I would either be fine to leave it as “undefined” such that anything is valid that matches the request, or simply take things from the “end” (right hand side, working left) of the job’s node list.

Option 1 seemed reasonable to me, for now.

I’m fine with option 1, but I’d like to have PBS do the smart thing and if possible choose nodes that are free vs arbitrarily choosing nodes. I think users will get mad at us if they told us to free up nodes and we chose busy nodes when there were free nodes we could have chosen instead.

Bhroam

Just studied the source code for implementing both options. The place of implementation will depend on the option we decide. Option 1, looks fairly easy to be implemented in the pbs_server code. Whereas for option 2, we will have to forward the batch request to the Mother Superior mom, as she will be having the status of each sister node’s job chunks.

Considering the feedback and the simplicity factor, let us conclude on option 1 for now. We can of course increment with smarter options as a next RFE based on real world feedback of option 1.

Thank you

1 Like

Based on this consensus, I have added below caveats to the design doc

  • The order of selection of nodes/vnodes to be released or kept by the “-k <select>” option is " Undefined ". Hence user/admin or his/her scripts/tools should not depend/predict on the order of release/keep operation on the nodes/vnodes.
  • If one or more nodes/vnodes targeted for release have one or more job chunks/processes still running in them, then the release operation will result in their abrupt termination.
  • Clubbing the previous two caveats: user/admin should be aware that by using this new option, the running job may lose some of its running job chunks.

The same can be seen in this history diff link