PP-725: new "keep <select>" option for "pbs_release_nodes"

This forum is to inform the community of work being done to enhance the node ramp down feature of PBS, which enables user to release most of the sister nodes/vnodes early from running jobs while keeping some of them which satisfy a sub select statement supplied to pbs_release_nodes command through new -k option.

Refer
RFE Ticket: PP-725
pbs_release_nodes “keep select” enhancement Design Doc

Please feel free to provide feedback on this topic

Thank You

Nicely written, I like it!

In the example, you might change the use of mpiprocs to ncpus, which is a more common request. (Generally, mpiprocs is only specified when mpiprocs != ncpus).

Done changing mpiprocs to ncpus!

I had actually developed the example from the PP ticket which contained the mpiprocs.

Is there a hook point here to allow for customisation of what select lines are permitted at a site?

If a site has restrictions on what specifically a user can provide in their select lines, and are enforcing that with submitjob/modifyjob hooks, this would let users end up with a select line that shouldn’t be allowed by the local policy.

There likely needs to be a way for a hook to see this and reject a change that would create a select line that shouldn’t be allowed.

Thank you for contributing your view from hook perspective.

I feel such a hook point is not necessary, since pbs_release_nodes will NOT be used to allocate new nodes, but only to keep already allocated nodes during job runtime while releasing other sister nodes, all of which were assigned by scheduler adhering to select lines as per local policy/submitjob/modify admin hooks. When the job ends, anyways all allocated nodes are automatically released. pbs_release_nodes is intended to give up allocated nodes of a job which are no more needed by the job, thereby increasing their availability for other jobs which are yet to run.

If user provides any select line while invoking pbs_release_nodes which does not match the select statement already associated with running job, then command should fail (I should be adding this line in the design page soon).

Also note that by invoking pbs_release_nodes -k <select string> user will not be able to prolong the run time of a job than the time decided by the sched/policy.
Please feel free to reply if you do not agree.

Hey @Shrini-h
I like your design, but I have one thought.
Up until now we have either been very exact with what to release (pbs_release_nodes -v), or we’re releasing vnodes before the job starts (hook keep_select). Now we’re going to be making a decision on releasing nodes while a job is running which isn’t an exact list of nodes. This is a bit fuzzy and might lead to us killing nodes with running processes.

For example:
If I have 3 identical chunks, 3:model=abc:ncpus=4. At the point I’m doing the pbs_release_nodes, node 2 has completed, but nodes 1 and 3 are still running. From what I can tell from your design, if I do a release nodes of 1 node, I’ll release the first one and kill it. Shouldn’t we be smarter and start releasing nodes that are finished with work first? Even if we do that, if we’re in a situation where we’re still about to release nodes with processes left, we might want to provide feedback to the user that they’re about to kill parts of their job.

Bhroam

@bhroam your thought is very valid. Thanks, I was kind of banking on this forum to shape this interface with such intricate points.

Yes, we should be smarter by defining the order criteria for selecting the target node list to be released. We should take care of this in implementation and of course document it.

But what should happen when (after exhausting the completed nodes) there are one or more busy nodes selected for release? let me break down to options I can think of:

  1. The easy road: Just state a caveat in the documentation so that user knows his multi-node job will lose such unfinished nodes abruptly.
  2. Add a sub option (something like --only-finished) which performs release only if all targeted nodes have completed their job chunk, else the release cmd just returns a failure code to mean one or more running job chunks were found busy. (optionally return the busy node list?).

May be the option 2. can be a standalone one also and could be a new RFE - “release all sister nodes that have completed their job chunk at this point”. But I have no idea if its a real world use case.

Let me also tag @scc, @bayucan and all others for their opinions here.

Thanks @Shrini-h, I would favor option 1. We have no strong use case for anything more sophisticated at present and users can already release specific named nodes using pbs_release_nodes if there are specific ones they know they are finished with.

For choosing what nodes to release when using this new interface I would either be fine to leave it as “undefined” such that anything is valid that matches the request, or simply take things from the “end” (right hand side, working left) of the job’s node list.

Option 1 seemed reasonable to me, for now.

I’m fine with option 1, but I’d like to have PBS do the smart thing and if possible choose nodes that are free vs arbitrarily choosing nodes. I think users will get mad at us if they told us to free up nodes and we chose busy nodes when there were free nodes we could have chosen instead.

Bhroam

Just studied the source code for implementing both options. The place of implementation will depend on the option we decide. Option 1, looks fairly easy to be implemented in the pbs_server code. Whereas for option 2, we will have to forward the batch request to the Mother Superior mom, as she will be having the status of each sister node’s job chunks.

Considering the feedback and the simplicity factor, let us conclude on option 1 for now. We can of course increment with smarter options as a next RFE based on real world feedback of option 1.

Thank you

1 Like

Based on this consensus, I have added below caveats to the design doc

  • The order of selection of nodes/vnodes to be released or kept by the “-k <select>” option is " Undefined ". Hence user/admin or his/her scripts/tools should not depend/predict on the order of release/keep operation on the nodes/vnodes.
  • If one or more nodes/vnodes targeted for release have one or more job chunks/processes still running in them, then the release operation will result in their abrupt termination.
  • Clubbing the previous two caveats: user/admin should be aware that by using this new option, the running job may lose some of its running job chunks.

The same can be seen in this history diff link

Added below line under the caveats

  • Since the mother superior cannot be ramp-ed down, the sub string of select resource request associated with the mother superior will be internally appended to the sub select string supplied with " -k ".

@scc, @bhroam, @bayucan, @billnitzberg

I’m almost done with the implementation and have a working change set. However I have one doubt:

Should we allow or deny partial resources list in the sub select string supplied to the -k option of pbs_release_nodes command.

For example
Suppose we get a job by asking for:

qsub -l select=15:model=abc:mpiprocs=5+4:model=abc:bigmem=true:mpiprocs=1+30:model=def:mpiprocs=32

Then should we allow or deny below command? (as the resource list inside the chunks are partial)
pbs_release_nodes -k select=2:model=abc+2:model=def+1:bigmem=true

Based on the consensus (allow or deny) I will be able to easily modify the current implementation.

Hi @Shrini-h, why wouldn’t this fall case into this statement in the EDD, since this is a subset of the qsub select statement?

This new option to " pbs_release_nodes " specifies a select statement that is a subset of the job submission select statement which describes the the nodes/vnodes which are to be kept assigned with the job, while releasing the remaining sister nodes/vnodes. The nodes/vnodes released will then be made available for scheduling other jobs.

I would assume that the select statement provided to pbs_release_nodes -k could be satisfied in any valid way from the job’s exec_vnode. In the example you give I think we could either:

  1. keep 2 nodes from the first qsub chunk type (both model=abc), 2 from the third (both model=def), and 1 from the second (bigmem=true)

OR

  1. keep 1 from the first qsub chunk type (model=abc, and we have to keep the primary exec host), 2 from the third (mode=def), and 2 from the second (also model=abc and so can satisfy the remaining model=abc chunk, as well as q for bigmem=true)

Do you agree with this?

I too agree with it, just wanted to clarify since the example in the ticket doesn’t give out this case.

I will add on to my implementation to handle it.
However the logic will be a crude one (first fit) and it wont be able intelligently fit things like a scheduler. So it could fail to satisfy when there exists an optimal way of satisfying the sub spec.
I will try to put up some caveats that should help the user to get higher chances of success.

1 Like

Just remembering something from pbs_release_nodes_given_select() implementation, I was purposely building the available resources via the call add_to_resc_limit_list_sorted(&resc_limit_list, …), which means adding them to a list according increasing number of cpus, mem. So when choosing the node resources to keep, I pick the one with the fewest cpus/mem available that can still satisfy the select spec. In this way, the other nodes with more cpus/mem available can be selected by the scheduler to assign to other jobs.

@bayucan, yes I saw that in the code and I have kept it the same. However the problem is for multiple custom resources (including non-consumable ones like String and Boolean) which are difficult to sort. I will try my best to fit a sorting order that will give out better results.

I have made the changes to handle partial resource list inside a chunk spec.