PP-586: On a Cray X-series, create a vnode per compute node


#1

This message is to inform developers that there is a new design document available for review:

https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=43548691

Please provide your comments to the community.

Thanks.


#2

Hey Dennis,
I have a few suggestions to make your EDD more clear.

  • I’d move the bullet “New interface: new mom_priv configure variable” up to the top of the details section. Maybe even back it out one bullet level. People do not know what kind of configuration variable it is until they reach this bullet. It leads to confusion.
  • Change “mom_priv” on the same line to “mom_priv/config”
  • What does quiesce the system mean? Be more clear on this. I suspect it means to turn scheduling off and requeue/kill all running jobs. You might give a hint that the dedicated time feature can help with this.
  • Change “Set the mom_priv configuration variable” to “Set the mom_priv/config variable”
  • Change HUP the mom to HUP or restart the mom

#3

Thanks Bhroam for the review. I have addressed your comments. Please take a look at the updated EDD.


#4

Two thoughts:

  1. There has been a general drive to use the qmgr interface for configuration, with the goal to move everything into qmgr, eventually. Rather than implement this twice (once in MOM’s config file and then later within qmgr), it would be ideal to put it in qmgr from the start.

  2. If this type of configuration has uses outside of a Cray X* system, I suggest making the name generic (not using “Cray” nor “ALPS” in the keyword itself) so that the setting can (at some future time) also be applied to other hardware. For example, is there a useful mapping onto generic big memory cluster nodes or Intel MIC or ?.


#5

I agree that putting all configuration variables in qmgr is the right thing to do, but the moms are going to be a particularly difficult problem. For pbs.conf variables, there is a server object. For scheduler variables, there is a sched object. There is no mom object. There are only vnodes where multiple vnodes can be owned by one mom. Until we create some sort of object where we can put mom variables, we’re stuck with the current mom_priv/config file.

Even if we wanted to create a mom object right now, it would greatly scope creep this feature. As an example, we’d have to come up with some sort of new communication pathway between the server and the mom. The scheduler handles this communication by querying the sched object every cycle. The mom has a persistent datastore. The server would have to communicate to the mom in some new fashion every time a variable changed.

So I agree, putting everything into qmgr is a wonderful goal. I just don’t see putting mom variables in qmgr as a viable option right now.


#6

Dennis, I just have one last comment.
Near the bottom you say to set the variable as desired per mom. Does this mean you can set the variables to different settings on each mom? I thought all the moms reported the same set of vnodes. If one mom reported vnodes per numa nodes and another mom reported vnodes per compute nodes, wouldn’t the server get confused?


#7

I don’t think the server would be confused (as in do the wrong thing). But it sure would be confusing since it would likely mean that mom1 who has $alps_create_vnode_per_numa set to false and mom2 who doesn’t have it set (thus it’s true) would cause the following to be created:
vnode1 (Mom=mom1)
vnode1_0 (Mom=mom2)
vnode1_1 (Mom=mom2)

And it would defeat the purpose of reducing the number of vnodes reported. :slight_smile:
It would be best if we added some text to say set the same value on all moms.


#8

Hi Bhroam/Lisa, I’ve updated the text to say ‘all Moms’ instead of ‘each Mom’.


#9

Hi Bill,

Regarding point number 1, i’ll simply defer to Bhroam’s earlier response on this subject.

Regarding 2, i’m not aware of whether this setting has uses outside of Cray systems. If the setting may be applicable to other hardware at some future time, i think it should be addressed at that time and not as part of this feature.

My assumption, as far as scope of the Cray merge effort is concerned, has been that the functionality present in the Cray branch will be transferred as-is to OSS master (ofcourse, there may be some code refactoring and bug fixes etc along the way to ensure working/quality code is merged into master). As far as new functionality is concerned, i assumed that such work would be done subsequently, in order to allow for the current merge effort to complete in a timely manner.

Dennis.


#10

FYI, there is a similar non-Cray specific configuration parameter for cgroups called “vnode_per_numa_node”, but of course that is controlling the behavior of a hook and is set in the hooks config file.


#11

Dennis,
I might be misunderstanding what you are saying here, but I think I have to disagree. The process followed to merge RFEs into master follows a set process. This includes design reviews that are not constrained to just keeping the same functionality as before. I understand the desire to keep things simple and limited in scope. That needs to be done here as well, but I just don’t think it is a given that no extra functionality will be added to RFEs being merged from the cray branch.

Bhroam


#12

Hi Bhroam,
My comment was primarily based on my experience with the Cray branch so far. During this time, process (to the best of my knowledge) has been followed.

My comment regarding ‘new functionality’ could have been clearer. I wasn’t assuming that absolutely no extra functionality should be added to existing features during the merge effort. For instance, if a reviewer felt that merging a Cray feature as-is to OSS master would compromise/break existing OSS functionality, then that certainly needs to be addressed somehow. But ‘brand new’ functionality that would be nice to have or a ‘better’ design of an existing working feature, could be done subsequently (i.e. tickets could be filed and prioritized appropriately to be done after the merge).
In other words, the scope of the review would be ‘sanity checking’ to ensure that merging Cray code to OSS master, does not break anything i.e. the end result of the merge should be working code.

The estimation done by our team has been with regards to ‘as-is’ merge of Cray branch features into OSS. If minor/major redesign of existing features is recommended during review, there will obviously be an additional cost in terms of time/effort, which may very well be needed/acceptable.

The rest of the community may agree/disagree. Sooner rather than later a consensus should emerge regarding the scope of the Cray merge effort.

Dennis.


#13

Perhaps a question that should have been asked earlier…

Do we need the *_create_vnode_per_numa setting at all? Could we simply have PBS Pro always create one vnode per node, with no option to create on vnode per numa node? That’s the configuration that Cray sites have been requesting. Who, if anyone, is actually depending on “one vnode per numa node” functionality, and what is their use case? If there is not enough call for the capability to have one vnode per numa, then I strongly suggest we do not support it with a setting.

(If we do keep such a setting, we should ensure that we are not duplicating similar functionality with two similar settings, re: @scc’s note above about the Cgroup hook setting ‘vnode_per_numa_node’.)


#14

The ability to have one vnode per numa node is used by SGI UV systems. It is also used by sites that want to have the scheduler ensure that jobs are not split across numa nodes (i.e. a gpu on numa node 1 gets assigned the cores and memory on numa node 0) if a given chuck could fit on a numa node. Using cgroups sites can now enforce this.

In the cgroup hooks, we added the single vnode per numa node capability. We did this using a exechost_startup hook, which allowed us to partition nodes for sites that wanted this capabilities. In the long run it would make sense to split this functionality out of the cgroups hook into a separate hook.


#15

Interesting question @billnitzberg. If the following are not common use cases, then perhaps we can change the behavior of PBS to only create one vnode per node:

  • Specify how many NUMA nodes per compute node to allocate (aprun -sn)
  • Specify the specific NUMA nodes to use on the assigned compute nodes (aprun -sl)
  • Specify how many cpus per NUMA node to use (aprun -S)

We would like to hear from our users on Cray machines.


#16

That is also because of current ALPS limitations. If ALPS actually would allow people to share a node between different jobs all using potentially less than a socket, then yes, the functionality would be useful, just as it is crucial on large SGI machines (and even ICE nodes).

Right now, most Cray sites use MAMU nodes for workload smaller than a node (and even use cgroups on them!) and the regular compute nodes are reserved for highly parallel workload. That may change in future.

We shouldn’t chop off code that isn’t useful now just for the sake of it. That’s over-adaptation, which in the long run can actually be harmful to the survival of the species.

I’d vote to make “one vnode per compute node” the default though. That is not a principal stance: having one vnode per socket is more expressive and richer, thus “better”, but it changes the scheduler’s behaviour (as soon as one node in a complex has in_multivnode_host set the algorithm for placing chunks changes!) and it makes the scheduler much less scalable for large sites (and some operations are much worse than O(#nodes) ).


#17

One way to avoid different settings at different moms (if we even need a switch), could be to have the switch a a much more global level, like, say, the server. That is also then achievable via qmgr.


#18

We have relied on one vnode per NUMA node in our clusters for going on 5 years now, and it’s absolutely imperative to giving us good performance on shared cluster nodes. We’re using SGI CPUsets now, but anticipate using cgroups in the future, and it’s important to us that we be able to continue the practice of using a vnode per NUMA node.

Gabe


#19

Thanks @gabe.

This RFE is specifically targeting how to represent nodes on Cray X-series systems. (@djohnpbs – perhaps the Summary could be changed to include “Cray X-series” to make this clear here and in JIRA?)

This RFE was not intended to change functionality on generic Linux nodes nor for SGI UV – PBS Pro should continue to support one vnode per NUMA node on generic Linux nodes and SGI UV. (On UV in particular, this is allows the PBS Scheduler to leverage placement sets for better, and more predictable, application performance.)

From the discussion so far, it is clear that the default (on Cray X-series) should be “one vnode per compute node” (and not “one vnode per NUMA node”).

Whether there are any actual users of PBS Pro that depend on “one vnode per NUMA node” on Cray X-series is still an open question in my mind, as nobody has put forward even one case of anyone depending on this configuration, nor any examples of capabilities that could not also be achieved under the “one vnode per compute node” configuration.

If there is a compelling reason to support “one vnode per NUMA node” configurations on Cray X-series, then I propose not inventing a new name, but using the same name as Cgroups hook for the setting: “vnode_per_numa_node”. (It’s bad design to have two different names for the the same thing.)


#20

@bhroam
There is another way that such data might get from server to MoM: the server sends some data via the (multiple) IS_CLUSTER_ADDRS interserver request messages. It’s pretty ugly, but have a look at the is_request() function.

@lisa-altair
This is another way that an interserver request (which would push the same value to all the MoMs) would be less problematic than using individual mom_priv/config file settings.