PP-507: Add support for filtering nodes as per the job request



As discussed on design discussion page for PP-506, I have now separated out design proposal for PP-507 to “this page”.

Please review the design proposal and provide comments.

Arun Grover

PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators
Filtering nodes per the job request

Thanks for separating this from PP-506

We’d like to see nfliter applied at the chunk level in the select statement. We certainly have jobs that require 2+ different types of nodes in a single job, so a job-wide nfilter would not suffice.

Chunk-level nfilter leads to the next complications (opportunities?), something I’ve pondered a bit as we’ve considered how to move toward an nfilter-like request mechanism. Here are a couple use cases I’m considering. I’ll use the resources_available resource “model” in my examples, at present model can be one of [san, ivy, has, bro] (representing Sandy Bridge, Ivy Bridge, Haswell, and Broadwell).

uc1) Multi-chunk job that chooses different MPI rank distributions per chunk based on memory, an example request might look like:

qsub -l select=1:mpiprocs=1:model=has:nfilter="node['resources_available'][‘mem’]>=512gb"+100:mpiprocs=12:nfilter="node['resources_available'][‘ncpus’]>=12 and node['resources_available'][‘gigs_per_core’]>=2gb"

uc2) Multi-chunk job that chooses different MPI rank distributions per chunk. The user doesn’t care which model is chosen, but they do want the same model to be chosen across all chunks. An example request might look like:

qsub -l select=1:mpiprocs=2:nfilter="node['resources_available'][‘model’]=pbs.same and bigmem=true"+100:mpiprocs=16:nfilter="node['resources_available'][‘model']=pbs.same and node['resources_available'][‘ncpus’]>=16"

In the above I’ve introduced a magic value pbs.same to accomplish the intent of the use case. I image another magic value pbs.any could satisfy a different use case where the user doesn’t care which model is provided by PBS. These values are illustrative, I’m not pushing for their use as presented here. The general notion is expressing an inter-chunk relationship. This is similar to the description of “like nodes” in the design of PP-506, but as I understand PP-506 a job-set would need to enumerate all requests that can satisfy the particular inter-chunk relationship rather than here where we constrain the matches a filter can come up with.



@gmatthew Thank you for providing your inputs.

I understand that having the filter run on each chunk would probably help in the use cases you mentioned. Although in case of uc2, would having a place=group= not help in getting “pbs.same” effect?

I actually tried a small prototype of having to filter 1000 nodes for each job and submitted 1000 jobs in the setup. It takes around 24 seconds for scheduler to run a job-wide filter on all the jobs once/scheduling cycle. Having to do this on each chunk will bring the performance down further. I think having python do the filtering may not be such a good option, although python does allow users to do a lot more while filtering (like using math module).

Do you see a use case where filtering on the basis of “resources_assigned” would help too? If it is only resources_available then maybe we can do something outside of scheduler.



If a magic value like pbs.same could be used as place=group=pbs.same, yes I could see that working for my use case. I wonder, though, how quickly we would want filtering that effectively acts as 2+ group-like resources, requiring something beyond the place=group= approach.

There are many interesting questions around the performance of nfilter based on design/use, but probably too early to get into this in great detail. For example the various bits of python overhead (spinning up an interpreter, marshaling, etc) may represent a sizable portion of that 24 seconds - or not. :slight_smile: Bhroam has done some great work implementing our design for scheduling faster with node equivalence, that may (or may not) reduce the performance loss if python is involved. Does python have to be involved, even if it ends up not being performant in this application?

I don’t see a use case involving resources_assigned at the moment for our site, but I could easily imagine resources_assigned being handy for other sites.


@gmatthew Just to again touch base upon uc1 you previously mentioned -
Is that use case result of a one off job that you intend to submit once in a while (maybe like to benchmark) or is it something that is needed in every other job that intends to run on bigger memory nodes in their first chunk and so on?


uc1 is something our users are doing today, and we see a general trend toward setting up MPI jobs in that way in order to avoid overloading the head node with both the generally higher/different workload of rank 0 and the workload of additional ranks.


Can you name it ndfilter instead of nfilter. nfilter sounds like numeric filter :slight_smile:

ndfilter might sound more like node-filter.


Sure I can make that change. @billnitzberg also suggested changes to “nfilter” and asked either change it to “node_filter” or just “filter”.

So do you two have any preference between “filter”, “node_filter” or “ndfilter”?


I’d prefer node_filter. Second preference would be ndfilter.


@billnitzberg asked me to provide more use cases for this feature, based on what our users are doing now. Here’s a quick bit of background on how our users request resources, then I’ll get to the use cases. Users pick from one of roughly 7 primary types of nodes using resources_available.model. If they have a need for some specialized subset of a primary type, e.g. bigmem, they request that resource as well. In some cases we have different bigmem sizes, so a user might additionally specify resources_available.mem to request the minimum mem size needed. Then we have onesy/twosey special nodes that might have a different resources_available.model, etc.

uc3) Multi-chunk job that picks a higher memory node for the head node and cheaper (lower accounting charge rate) nodes for the rest:

qsub -l select=1:model=bro:mpiprocs=4+100:model=san:mpiprocs=16

uc4) Multi-chunk job that picks all nodes from the same type, but places fewer MPI ranks on the head node:

qsub -l select=1:model=has:mpiprocs=8+199:model=has:mpiprocs=24

uc5) Multi-chunk job that picks a special onesy node for the last node of the job, and a different type for all other nodes:

qsub -l select=299:model=ivy+1:model=fasthas

Some observations:

a) In general there will be several model types that a user considers “cheaper” as used in uc3, so a node_filter for that particular chunk could express the acceptable set

b) For a request like uc4 we expect that users are concerned more with overall and per-core memory rather than the particular model.

c) We have thought about trying to reformulate select statements to more exactly describe the MPI/mem ratio needed and have PBS parcel out ranks to match. This would potentially recast uc3 to something like:

qsub -l select=1:model=bro:mpiprocs=4+1600:ncpus=1:mpiprocs=1:gigs_per_core=2gb

and PBS could find nodes that match, with possible contraints like pbs.same mentioned earlier in the discussion.

Filtering nodes per the job request

Currently we have moved resources away from this project. So this project is stalled for now.


Our current understanding of the situation is that the original design proposal does not impact the requirements posted by @gmatthew (thanks for posting). So we may continue our discussion.

My main concern is the use of ‘<’ and ‘>’ in the current proposal. Most shells use these characters to redirect I/O. Users will inevitably forget to quote or escape them, leading to potential loss of data. Could you propose some safer alternatives for expressing these conditionals? For example ‘.gt.’ instead of ‘>’ and ‘.ge.’ instead of ‘>=’


I share your concern @mkaro. Since this is a new feature and anything similar to this does not exist in PBS, would it help if we enforce the quoted syntax irrespective of what operator user is trying to use? If you think that would not be a reasonable thing to do then we can go with the way you are proposing to use gt, ge, lt and le as operators. This would also mean that we will have to change the expression to replace the operators because we will end up running it as a python expression.

Also, during recent discussion with @bhroam we found that we have overlooked one major issue in the approach. It is related to security. If we consider node_filter as a python expression and scheduler runs this with an interpreter with root privileges then that is a problem.

Some of the things we could do are following. @bhroam please correct me if I’m wrong:

  • Run scheduler as non-root user. This way expression will not be executed as root. But, it will still be executed under scheduler user privileges and that means fairshare usage file, resource group file, and sched_config would still be accessible to the user.
  • Scheduler impersonates as job user before running the expression. This is achievable but it will have a huge impact on scheduler’s performance.

There could be other ways to address this problem also but these are the two that came out of discussion.


That’s a great point about security @arungrover. While it’s one thing for an admin to define a Python snippet in a PBS Pro configuration file that later gets evaluated, it’s a completely different story when users are able pass Python code as part of their job submission. There are plenty of ways to crash the interpreter or (as you mentioned) obtain elevated privileges… https://stackoverflow.com/questions/35804961/python-eval-is-it-still-dangerous-if-i-disable-builtins-and-attribute-access

For this very significant reason, we should not use Python to evaluate anything a user has provided. Would you agree?