PP-586: On a Cray X-series, create a vnode per compute node


#21

Quibbles:

  • values used in setting mom_priv/config variables are not quoted - TRUE, not "TRUE", etc.
  • is it $vnode_per_numa or $vnode_per_numa_node? Both occur twice.

#22

Based on the feedback, I have updated the purpose of the feature including updating the summary of the JIRA ticket. I have updated the external design. Please have a look at the external design: https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=43548691

Provide your comments in this forum.

Thanks!


#23

Thanks for pointing it out @altair4. I have removed the quotes around TRUE and FALSE. And have updated the design to use $vnode_per_numa_node consistently (hopefully).


#24

I think we’d want all moms to have the same vnode_per_numa_node setting. Shouldn’t there be a corresponding statement under “Administrator’s instructions”?


#25

@vccardenas, I removed it because I didn’t want it documented in our commercial documentation guides. But now that I think about it, I should have something to that effect in the interface sections. Thanks!


#26

Having one vnode_per_numa_node setting across an entire cluster might be too restrictive in some cases. If an administrator is using PBS Pro to schedule across two independent clusters of nodes, they may want different settings for each cluster. This could apply to both Cray and non-Cray systems. It should really come down to they way mom reports the resources. That implies it should be a per-mom setting.


#27

Interesting point @mkaro. The need for the cgroups hook to have vnode_per_numa_node set to different values per mom conflicts with the need for Cray X-series vnode creation to have the mom_priv/config vnode_per_numa_node set to the same value on all the moms within a single Cray system.


#28

Agreed. I think that if we were to generalize this RFE to apply to more than only Cray X-series systems, then it would make the most sense to have one vnode_per_numa_node be configurable at the host level (versus MOM, as I feel MOM is too PBS-implementation specific, plus, one MOM can represent many hosts (and vnodes), and one host (or vnode) can be served by multiple MOMs).

Whether it makes sense to expand this RFE to include something like a per-host setting of one vnode_per_numa_node now versus later depends on available resources and whether it is possible to design a solution for the Cray X-series now that is also extensible (or at least sensibly depreciable) for the future.


#29

I agree with @bhroam on both points. I couldn’t find an existing RFE for doing configuration using qmgr, so I filed PP-611 to track that feature request.


#30

We currently have multiple sites that are running a combination of multiple Cray systems and either non-Cray or Cray MAMU nodes as part of a single PBS complex. It’s quite conceivable that a site might want different settings for this feature on different MOMs hanging off a single server. Would putting this in qmgr make that more difficult (or even impossible)?


PP-610: On a Cray X-series, periodically synchronize PBS with ALPS inventory
#31

Those are interesting questions that will have to be kept in mind when PP-611 is being designed. Please add your use cases to PP-611 so they are considered for that RFE.
This discussion is about the external design for PP-586 available at:
https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=43548691
Where there will be an option in the mom_priv/config to cause one vnode to be created per NUMA node for Cray X-series compute nodes.


#32

Just a nit: “Cray numa” should be “Cray NUMA”. Other than this the EDD looks good to me.


#33

Question: This config attribute does not have any effect on how MAMU nodes are reported, does it? I believe it is only useful for MoMs on login node that get the info re: compute nodes from ALPS.


#34

Thanks for bringing it up @smgoosen. The configuration attribute only has an effect on the nodes reported by ALPS. I added some clarification to the design on this. PBS treats MAMU nodes like a standard Linux node, thus the $vnode_per_numa_node would not have an effect on how MAMU nodes are reported within PBS.


#35

I seem to remember that MAMU nodes are often former ALPS nodes, that is they have just been removed from ALPS’s control. What happens if vnode_per_numa_node is still set on a MAMU or non-Cray node (i.e. can’t talk to ALPS)? Is there some error that gets reported or is it just silently ignored?


#36

$vnode_per_numa_node has no effect if there is no ALPS information to act on. There is no log message either. Do you think there should be a log message when $vnode_per_numa_node is set, but there is no ALPS information to act on?
I should also mention that this feature is available only when PBS is built with configure --enable-alps. I will add that to the external design.


#37

Regarding the statement below:
“resources_available.PBScrayseg will be set to 0 when vnode_per_numa_node is unset or set to FALSE”

do we really need to set PBScrayseg to 0 since the whole compute node is the vnode not just NUMA ordinal 0 ?
It seems that it should be unset.

Not currently mentioned in the EDD but are there other PBScray* attributes that need to exist or not exist
depending on the setting of vnode_per_numa_node?


#38

The original use case for PBScrayseg was to allow users to request a specific segment on each compute node. I believe if there is only one vnode then it isn’t necessary to set PBScrayseg.

The EDD looks good to me!


#39

I agree with @vccardenas and @smgoosen about not needing to set PBScrayseg when vnode_per_numa_node is unset or set to FALSE. I have updated the external design accordingly.

@vccardenas I don’t think so. Please let me know if you are concerned about any attributes in particular.


#40

I still think the EDD is OK