External design document for PP-824: Cray - Ramp rate limiting


#41

Hey Mike, i haven’t read the EDD and I don’t understand the feature, I was just curious about your last comment, you said that the MoM will not be in communication with server until something wakes it up, who exactly will wake it up ? It’s not talking to the server, so will it be the scheduler who wakes it up? Or will it require stimulus from the client? Or will it be somebody else?


#42

@agrawalravi90 In this particular context, MoM is no longer in communication because PBS has put it to sleep or powered it down. So PBS knows about the nodes current state and PBS also knows that it can bring the node up whenever it needs. Periodic hook will wakes it up in case of this feature with some help from scheduler.


#43

Thanks for clarifying Ashwath, but I’m still a bit confused. The server periodic hook being able to “Wake” the MoM up implies that the MoM is still listening to messages from the server right? Maybe I’m just confused about the terminology. Is it that the MoM will reject any messages from the server, except the “wake up” message that the server periodic hook will send it, at which point it will resume all communication again? Or is the MoM process just killed when the node goes to “sleep” and “waking” a node up makes the cluster manager start a new MoM process at which point it will just resume all communications again?


#44

I did not include technical details in my previous reply. Cray provided us a power management command line tool using which you can make a compute node power up or down. We have written few API’s to communicate with this tool. When I said PBS puts a node down I meant hook will talk to this tool to power down the nodes and it can bring them again when needed in similar fashion. Think of how we do OS provisioning. Provisioning hook is responsible for rebooting nodes with different images.


#45

I like sleep much better than something like inactive - which could have dual meanings and confusing. sleep (or asleep) is clear - it is a state where it is not communicating with PBS - so covers powered down, sleep states etc.


#46

Got it, thanks for clarifying.