Filtering nodes per the job request


#1

Altair asked me to start a new post on the topic of PP-507. A post to start clean and take a look at node filtering.

I talked with Dale Talcott and we came up with a list of high-level descriptions of qsub select scenarios. The scheduling context we’re dealing with:

a. Jobs do not share nodes.

b. We have 5+ different node models in our cluster, they differ by number of cores and amount of memory.

At present our users must specify which model(s) their job needs, they may use 2 or more models in a single job.



High-level description of single chunk scenarios

  1. Select N nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

    e.g. select=22:ncpus=16

  2. Select N nodes, PBS must choose nodes that are all the same model.

  3. Place N ranks, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

    e.g. select=352:ncpus=1

  4. Place N ranks, PBS must choose nodes that are all the same model.




High-level description of two chunk scenarios

  1. chunk1: Select M nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).
    chunk2: Select N nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

  2. chunk1: Select M nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).
    chunk2: Select N nodes, PBS must choose nodes that are all the same model.

  3. chunk1: Select M nodes, PBS must choose nodes that are all the same model.
    chunk2: Select N nodes, PBS must choose nodes that are all the same model (but may differ from chunk1).

  4. chunk1: Select M nodes, PBS must choose nodes that are all the same model.
    chunk2: Select N nodes, PBS must choose nodes that are all the same model as chunk1.

        1. Same as 5-8, but chunk 1 is placing M ranks and chunk2 is placing N ranks.

Given those scenarios, how might a user put together their select statement to get what they're asking for?




Additional considerations

i. We are working toward a new filesystem feature for our users that will take “extra” memory on their nodes and make it available as a distributed ramfs. This is likely a straightforward arrangement when jobs request N nodes, users just need to ask for enough memory to handle their rank/process needs and the ramfs. It gets more complicated when users are instead asking for N ranks to be placed as in 9-12 above. We currently plan to place the burden on users to craft their select in the right way, but we wonder if Altair has suggestions on how this could be made easier for users with current PBS and if there’s some future work here to make it yet easier/more intuitive.

-Greg


#2
  • You can use “queuejob” hook to craft the user’s select statement based on the topology of your cluster or you can create certain select profiles based on the request resource by the user.
  • You can reject the jobs via queuejob hook , if their request is not per your site policy ( a wrapper script would be useful to have more control and create a meaningful select statement)
  • ramfs request can be based on mom_dyn_res ( mom dynamic resources ) , check this section from the PBS Pro Administrator guide - 5.13.5.1 Dynamic Host-level Resources

Models and Nodes:

  • you can tag the nodes with a custom resource (string_array, host level resource ) that can support different model
    – you can craft a select statement as below
    qsub -l select=10:ncpus=20:model=Test:10:ncpus=10:model=NoTest – /bin/sleep 100

Sorry, if in case i have not understood your problem correctly or not on the same page as you described.


#3

Thanks for the pointer to mom_dyn_res. We currently use a custom resource for the node model, but in order to support the scenarios I provided there will need to be some change made to the scheduler (at least).


#4

Hi Greg,

At first, thank you so much for posting your use cases on the forum. It will really help kick start the discussion on this feature.
I have read the use cases you mentioned and have following observations:

For use cases 1, 2, 3 and 4, I think PBS can do 1 and 3 even today without any change and it can also do 2 and 4 with specific job wide placement spec being requested.
For use cases 5 and 9, one can do this with PBS by not providing any specific placement and PBS will then choose the first set of nodes it can find.
For use cases 6 and 10 it seems like what we will need is a chunk level placement spec. Like, a user can specify that he/she needs a chunk to run on nodes with same resource value (place=group=model).
Use cases 7 and 11 are I think tricky ones. You mention that they may differ between chunks. If in some cases it is ok if the models are same then in this case users can specify placement set for both the chunks.
Use case 8 and 12 can again be achieved by giving job wide placement spec and then all chunks will run on the same model.

I know that there is nothing like chunk wide placement set present in PBS today but by looking at the use cases I felt maybe what is needed here is a chunk wide placement set rather than a filter (especially for cases where you want model to be same across chunk)

Please let me know what you think.


#5

Sounds good, I’ll need to read up and play around more with placement specs.

Sounds reasonable.

I don’t quite follow - what is trickier about 7 and 11?

-Greg


#6

I did not call it out correctly. Initially I thought that both the chunks must differ on model but then I realized you mentioned ‘may’. If the requirement is that they should differ between chunks then it becomes interesting because while finding a solution for one chunk we will have to make sure we do not use the same placement set as used in the previous one.


#7

Ahh, gotcha. At this time we don’t need a mechanism to enforce different choices.


#8

Just wanted to point out that it’s critical for this feature to work with scheduling buckets (New node placement algorithm).


#9

Continuing the discussion on this thread.