To optimize job resource polling, discontinue reporting of resources_used values of PBS root jobs

Please review my design change proposal in Design - optimize PBS job resource polling

Your design looks good to me.

Hey Al, looks good. Do you think we need a mom switch to (that can be set optionally by admins) to report root jobs? There could be clusters with smaller nodes that do not need this optimization and might depend on root jobs report, for some reason (i have no idea why they might need that, though). Just a thought.

There was this other idea of looking inside only cgroup folder to limit the processes we need to walk through (atleast for a system which has cgroups enabled) - did that not work out?

@subhasisb
The design that I’m advocating is for simplicity sake, allowing it to work on
all platforms with our without cgroups, cpuset subystems.

I think if there’s a push in this forum to report root jobs’ resources_used values in a system without pbs_cgroups hook enabled, then 2 options are possible:

  1. New mom_priv/config file option where .if set to true, would put back in the mom_get_sample pool the root processes.
  2. Have mom_get_sample() to automatically check running job’s owner (pj->ji_qs.ji_un.ji_momt.ji_exuid) and if 0 (root), then root processes will also be added to the pool.

In regards to still having some pidcaching done, yes, that was also proven optimal
comparable to not looking at root processes. Implementation-wise, though, it is not straightforward:

  • PBS mom would first need to see if pbs_cgroups hook is enabled by checking mom_priv/hooks/pbs_cgroups.HK. PBS jobs use cgroups and cpusets when pbs_cgroups hook is enabled.
  • Look for <cgroup_mount_point> in /proc/mounts (default is “/sys/fs/cgroup”)
  • Then build the pidcache ontaining pids found in <cgroup_mount_point>/cpuset/<cgroup_prefix>.slice/<cgroup_prefix>–<jobid>/tasks

<cgroup_prefix> is by default “pbspro” but could actually be changed in the pbs_cgroups.CF file. So pbs_mom would also need to look at CF file to find out if <cgroup_prefix> has been set to something else.

In summary, if we can get by with just not looking at root processes, then it’s a simple, less complicated fix but with the caveat on root jobs.

I like the design.
Do you have a way to test the performance of the feature?

@vstumpf: yes, actually before choosing this option, I had a standalone test program that had many variations of mom_get_sample(): a) no pidcache (no caching of pids), b) no pidcache + no root (as done in the current design), c) no pidcache + using libproc (openproc, getproc,…) to walk through /proc, d) pidcache (pids cached from /sys/fs/cgroup/cpuset/pbspro.slice/pbspro-*/tasks). I ran the test under HPE UV with 16 numa nodes, 36 cpus each for a total of 576 cores, with 36 single node cpuset PBS jobs running, and system was averaging 5000 pids. The result showed option b) and d) utilize less walltime and cputime and actually comparable to each other. Graphs here show 5 trials and averages in seconds:

image
image

2 Likes

Looks great. Thanks for your explanation Al

Looks good, thanks for the graphs.

Perfect, that’s exactly what I was looking for!

Design looks good, I’ll sign off.

There’s also the fact that a cgroup-based pidcache as it was implemented by the cpuset MoM some time also broke tm_attach, because processes spawned outside of the cpuset cgroups for jobs were seen as “not there” (anything not being polled in that code is seen as ‘not relevant’), which also prevented them from being attached to become part of the cpuset cgroup. Chicken and egg.

If we do not poll root processes, and we have no pidcache, will we also get rid of that bug, even for root jobs? Not seeing rss used being reported is one thing, but not being able to use pbs_attach is another…

Another question: would it make sense (and make things faster) if we had a flag to say “don’t bother recording cputime and memory when polling”? Because if you enable cgroups then most of the times (depending on the cgroup configuration file) these are going to overwritten by the cgroup hook and the values obtained by polling are going to be ignored anyway.

Since we read /proc/[tgid]/stat with a scanf throwing away some fields, I think the kernel will still compute the values to be passed to scanf, but we might save some time storing these (and summing them across processes).

With this update where proc_info is only tracking non-root processes, root jobs would not be able to do pbs_attach. The fix then would be to have pbs_mom, upon receiving a TM_ATTACH request, to look into /proc/ if it doesn’t find the process in proc_info. It’s just a one file check. I’ll update the code.

The problem with not including cputime-related fields (cutime, cstime, utime, stime) and memory-related fields (vsize, rss) is that the PBS resource monitor library (rm(3b) depends on these fields when mom is directly queried. The pbs_cgroups hook may record the actual values for cput and mem (depending on how pbs_cgoups hook config file is set) for each job, but directly asking pbs_mom for “cput”, “mem” values using openrm(), addreq(), getreq(), closerm() would provide an estimated values based on proc_info. Plus, I don’t think it would give us a major performance improvement by the extra arithmetic calculation, and also by not storing extra fields, given that the fields are already part of the malloc-ed proc_stat_t blob.