To optimize job resource polling, discontinue reporting of resources_used values of PBS root jobs

Please review my design change proposal in Design - optimize PBS job resource polling

Your design looks good to me.

Hey Al, looks good. Do you think we need a mom switch to (that can be set optionally by admins) to report root jobs? There could be clusters with smaller nodes that do not need this optimization and might depend on root jobs report, for some reason (i have no idea why they might need that, though). Just a thought.

There was this other idea of looking inside only cgroup folder to limit the processes we need to walk through (atleast for a system which has cgroups enabled) - did that not work out?

@subhasisb
The design that I’m advocating is for simplicity sake, allowing it to work on
all platforms with our without cgroups, cpuset subystems.

I think if there’s a push in this forum to report root jobs’ resources_used values in a system without pbs_cgroups hook enabled, then 2 options are possible:

  1. New mom_priv/config file option where .if set to true, would put back in the mom_get_sample pool the root processes.
  2. Have mom_get_sample() to automatically check running job’s owner (pj->ji_qs.ji_un.ji_momt.ji_exuid) and if 0 (root), then root processes will also be added to the pool.

In regards to still having some pidcaching done, yes, that was also proven optimal
comparable to not looking at root processes. Implementation-wise, though, it is not straightforward:

  • PBS mom would first need to see if pbs_cgroups hook is enabled by checking mom_priv/hooks/pbs_cgroups.HK. PBS jobs use cgroups and cpusets when pbs_cgroups hook is enabled.
  • Look for <cgroup_mount_point> in /proc/mounts (default is “/sys/fs/cgroup”)
  • Then build the pidcache ontaining pids found in <cgroup_mount_point>/cpuset/<cgroup_prefix>.slice/<cgroup_prefix>–<jobid>/tasks

<cgroup_prefix> is by default “pbspro” but could actually be changed in the pbs_cgroups.CF file. So pbs_mom would also need to look at CF file to find out if <cgroup_prefix> has been set to something else.

In summary, if we can get by with just not looking at root processes, then it’s a simple, less complicated fix but with the caveat on root jobs.

I like the design.
Do you have a way to test the performance of the feature?

@vstumpf: yes, actually before choosing this option, I had a standalone test program that had many variations of mom_get_sample(): a) no pidcache (no caching of pids), b) no pidcache + no root (as done in the current design), c) no pidcache + using libproc (openproc, getproc,…) to walk through /proc, d) pidcache (pids cached from /sys/fs/cgroup/cpuset/pbspro.slice/pbspro-*/tasks). I ran the test under HPE UV with 16 numa nodes, 36 cpus each for a total of 576 cores, with 36 single node cpuset PBS jobs running, and system was averaging 5000 pids. The result showed option b) and d) utilize less walltime and cputime and actually comparable to each other. Graphs here show 5 trials and averages in seconds:

image
image

2 Likes

Looks great. Thanks for your explanation Al