So let me see if I understand the logic here… two Moms, moma and momb, running on different login nodes on a Cray system, sharing the same set of compute nodes (cn1, cn2, cn3, …). A hook is configured on both moms with offline_vnodes enabled for failure. The MoMs know they are running on a Cray. The hook fails and moma is marked offline. Because the MoM is running on a Cray, it only mars itself offline, not the vnodes. The other mom and the compute nodes are still available. Now the hook fails on momb, and momb is marked offline. Again, the MoM knows it’s running on a Cray and only marks itself as offline, not the vnodes. The server would then recognize that ALL moms providing access to the compute nodes are down and proceed to mark the compute node (vnodes) offline.
Do I have that right?