PP-325: Design document review for cgroups hook


#1

This message is to inform the community that the design document for the cgroups hook has been updated and expanded. You may review the document here:

https://pbspro.atlassian.net/wiki/display/PD/PP-325%3A+Support+Cgroups

Please provide comments in response to this post. Thank you!


#2

Mike,

In reviewing the document I believe that following is not correct

use_hyperthreads false
When set to true, hyperthreads are treated as though they were physical cores. When false, hyperthreads are not counted as physical cores.

I believe that when false, hyperthreads are not added to the cpuset.cpus list even if the cpu has hyperthreading enabled.

Other than that the document looks good.


#3

Thanks for the comment. I have addressed it.


#4

Looks good. I have no further questions/comments.


#5

Would it make sense for exclude_hosts, run_only_on_hosts, and possibly exclude_vntypes to accept lists of regular expressions, rather than just individual names? This could be useful on large clusters, where similar hosts have similar names.


#6

It would make sense for those items to accept regular expressions. I think that is an enhancement we should consider implementing in the future. Thank you for your suggestion. I have filed a ticket on your behalf here: https://pbspro.atlassian.net/browse/PP-678


#7

The document looks good, Mike! The only thing is under ‘kill_timeout’ be sure to specify the time unit. I’m sure it’s in ‘seconds’.


#8

Good point @bayucan. I have updated the document.


#9

Mike, is this design intended to cover all eleven of the user stories under PP-325?


#10
  • the Cgroup Configuration File found here contains an example of a
    configuration file, but it’s not labeled as an example
  • if the global exclude_hosts value lists node001 and node002, why is
    it necessary to have entries in cgroup:cpuacct and cgroup:cpuset for
    them?
  • why is the cgroup hook run on every node assigned to the job since
    it’s certainly possible that one or more of the assigned nodes does
    not support cgroups?
  • if vnode_per_numa_node is true, how are the NUMA-node vnodes named?
  • how does the cpuset subsystem’s enabled parameter interact with the
    PBS pbs_mom.standard vs. pbs_mom.cpuset convention? Does/should the
    upgrade process deal with the convention transition? It seems wrong
    that the default would be false if installing on an existing PBS
    configuration that was previously using CPU sets

#11

@altair4: I attempted to address your first four comments by updating the document. I added a note in the “enabled” section of the cpuset subsystem attempting to explain the options administrators my choose from on SGI systems. The installation process does not perform any modification to the cgroup hook configuration file. If an existing configuration file is present, it will not be overwritten. Once the administrator alters the file, their changes will be preserved during future upgrades.

Thank you for your comments, and please let me know if the changes meet with your approval.


#12

@mkaro: Yes, I think you’ve covered my questions. Thanks.


#13

I have a couple of questions:

  • How does the use of cgroups affect the normal rlimits the mom places on the job?
  • What do you mean by vntypes? Do you mean the builtin vntype resource? You might want to make this more clear.
  • What is the difference between creating cpusets and using cgroup cpus/mems/memsw subsystems? If you create a cpuset for a job, you’re boxing the job into its own little world of memory and cpus. If you use the cpus/mems/memsw subsystems you create limits on what the job can use. This is essentially the same thing, right? Is there some reason you’d want to use both together?
  • Under the memsw subsystem, it says 0MB is the default. Is this correct? In the memory description, it said if memsw wasn’t provided, the memory and memsw limits would be the same. Also, it says physical memory. Shouldn’t that say virtual memory?
  • I’m wondering why Public/Experimental? This doesn’t mean it’s any easier to change than a Public/Stable interface. It just means you can add it in a patch release. If you’re looking for something easier to change, I think you have to go looking in the PBS Private area, and that doesn’t look like it applies here.

One last thing: From the last time I looked at the code, there were a ton of debug log messages. Since log messages are interfaces, they will need to be documented.


#14

Really nicely written – all designs should start with Overview and example – thanks @mkaro!

A few comments:

  • For both the memory and memsw subsystems, it’s not clear to me what “available physical memory” means, e.g., (1) is it the amount configured into the system (e.g., if I put in 64 GB of memory, it is 64 GB, or is it some lesser amount returned by the kernel)?, and (2) is the reserve_percent calculated before or after deducting reserve_amount? (Also, the last sentence defining reserve_amount is self-referential for both memory & memsw).

  • For nmics, ngpus, vmem, and hpmem, the design states the admin must manually add these to the resources line in the scheduler configuration file for this resource “to be considered”. I suggested appending “for scheduling”. Also, is there any way to make this automatic – seems if users are submitting jobs with “nmics”, for example, they will definitely want them to be scheduled. (Perhaps beyond the scope of this design…)

  • I’m not sure of the prevalence of using exclusions, but it feels overkill to have both exclude_hosts & exclude_vntypes: why not just have one exclude? Having whichever is more prevelant would certainly suffice (and put a slight additional burden on those admins that need the other type of exclusion).

  • Finally, a broader comment. I understand that a goal for core PBS Pro is to move all configuration into a single system, e.g., qmgr. This design goes the other way – adding configuration into a separate hook file. Having a separate hook config file for a non-core feature is a great idea (so people can work on extensions without worrying about core PBS Pro), but adding a totally new type of configuration methodology to core PBS Pro seems like moving the wrong direction. Not sure what others think about this…

Again, thanks!


#15

The cgroup limits do not affect the rlimits because they are independent mechanisms within the kernel. A process could be denied access to a resource if it violates either limit. The cgroup limits apply to groups of processes (i.e. all processes in a job) while the rlimits are per process.

Document updated.

Creating cpusets from within MoM via libcpuset is exactly the same thing as creating them via the cgroups hook. I added an Administrator Notes section at the bottom of the document and indicated that the cgroup hook should not be used in conjunction with the cpuset MoM.

Yes, 0MB is the default when no configuration file is present. The default in the configuration file packaged with PBS Pro is 256MB. You are correct about “physical” memory… it has been updated to “virtual” memory.

Agreed. Public/experimental seemed the best fit, which probably just confused anyone reading this that does not work for Altair.

Good grief, that would make this document about 100x longer! Not to mention the amount of time required.


#16

One of the comments from @bhroam identified that “physical” should have been “virtual” for memsw. Physical and virtual memory are obtained from /proc/meminfo as MemTotal and (MemTotal + SwapTotal), respectively.

I updated the document to reflect that reserve_percent is calculated prior to reserve_amount to obtain the total amount reserved. I also correct the self-reference issues.

I added the suggested text. It would be considerably easier to update the scheduler’s resources list if it were accessible via qmgr. I’d have to agree that this is out of scope for this design, but a valid point nonetheless.

The ability to exclude vnode types was primarily targeted for Cray systems where vnode types are commonly defined. The ability to exclude individual hosts was to target individual nodes where the complex does not define vnode types.

I agree that a central repository for all things related to PBS Pro configuration would be of benefit. At the time the cgroups hook was first conceived, the hook configuration file seemed the best fit because the server would push the changes out to all of the MoMs in the complex when it was modified.


#17

@mkaro, thanks for updating the document.
One quick thing: vntype is a resource, not an attribute.

As for cpusets, I really was just curious about the differences between a cpuset and a cgroup limit and why one would want to use both. From my point of view, they look like they are providing the same service. Is it that a cpuset will allow you to hammer down the exact cpus and memory where a cgroup just limits the amount used?

I’m still confused about the sentence in the memory subsystem that says if there isn’t a virtual memory limit requested, the limit will be set to the physical memory limit. How does this mix with the default of 0MB/256MB? Does the sentence in the memory subsystem just take affect if the memory subsystem is enabled and memsw is not?

As for the log messages, I’d take a pass over them and remove the less important ones and document the rest. I remember some that were less useful than others. Either that or take this up with Ian.


#18

Bhroam’s advice is sound.


#19

Thanks @bhroam. I updated the document replacing “vntype attribute” with “vntype resource”.

The cpuset subsystem in cgroups is the very same thing that used to be referred to simply as cpusets. SGI provided a library that could be used to manipulate them, but that was more for convenience than anything else. Manually creating the directories and populating the files achieves the same result. The default location of the cpuset filesystem has changed over time, and a prefix is now applied to some of the files by default. It is possible to mount the cpuset filesystem with the “noprefix” flag to restore backward compatibility.

If we set a cgroup limit of 1GB for mem and don’t set the memsw limit, then the kernel will allow the processes to allocate 1GB of RAM and unlimited swap. If we set a limit of 1GB for both mem and memsw, the kernel will allow the processes to access exactly 1GB of memory, regardless of whether it’s in RAM or swap. If we set a cgroup limit of 1GB for mem and 2GB for memsw, the kernel will allow the processes to allocate up to 2GB with no more than 1GB being RAM. Finally, if we set a cgroup limit of 2GB for mem and 1GB for memsw, the kernel will only allow the processes to access 1GB. In terms of a PBS Pro job, if you set your vmem limit lower than your mem limit you’ve made a mistake and the cgroups hook will reject the job. When a user submits a job specifying -lselect=1:mem=1gb we don’t want to grant them access to unlimited swap, but we do want them to have access to the full 1GB they requested, so memsw is set to the same as mem. Hopefully, that description helps.

I’ll add interfaces for relevant log messages.


#20

My question has more to do with the nature of a cpuset vs a cgroup limit. I understand that turning on the cpuset subsystem will create cpusets just like the pbs_mom.cpuset would. My question is what is the difference between a cpuset and the cpu/memory/memsw subsystems? Why would someone use one instead of the other, or would they use both at the same time?

Thank you for your explanation. It helped me understand the subsystems better. I still have a question about the default. If a job doesn’t request vmem, we set the memory/memsw limit to the same thing. My question is this is true, where does the default come into play? When do we use that default if we’re always setting the memory/memsw limit to the same thing?

Bhroam