External design document for PP-824: Cray - Ramp rate limiting


#21

Have a look now @visheshh


#22

Thanks @ashwathraop , Changes look good.


#23

I have updated the EDD to make existing provisioning hook event to have multiple hooks. Please have a look and let me know if something needs to be changed.


#24

A couple questions/suggestions regarding the updated design (v.21):

  1. It’s not clear how all the individual pieces of the design fit together. Would it be possible to describe the overall design and add some examples in addition to simply listing the individual interfaces? Better would be an overall design description (with examples) that combined ramp rate limiting, power provisioning, aoe, and new KNL work. (Feel free to point at it is it exists already – I might have missed it.)

  2. The design mixes elements of a built-in feature with elements of an add-on feature (hooks), but without gaining much flexibility from leveraging the hooks interface. I suggest either going all add-on or all built-in. Going all add-on would mean replacing all the built-in new resources and attributes with custom resources that the admin would define as part of enabling this capability (and, if you want to be nice, a script that an admin can use to create them as well). Going all built-in would mean implementing the ramp rate limiting in internal code – this could even be done as a PBS hook, but the hook would not be described in the “external” design, as it would be internal; it could even be done directly in C or in Python without using the hook interface at all. It just seems like the design could be made simpler…

  3. The idea to not create a new provision hook event is a good one. So, removing the new “power_provision” hook event and using the existing “provision” hook event with multiple hooks is great. However, adding the new variable (“pbs.event().prov_type”) is really equivalent to adding the new hook event “power_provision”, so there is no simplification gained. In fact, for the hook writer, it’s pretty complicated to have to have a case statement on prov_type, then either access vnode_list or are and vnode, but not both. If the design is going to re-use the provision hook event, it should be done in a way that doesn’t require a switch/case statement inside that is the equivalent of adding a new hook event.

  4. Some keywords are not defined in the v.21 design: power_provisioning (perhaps a leftover from a prior version?), $max_ramprate_limit (drop the “$”?), $freq (is this the hook frequency?).

  5. Regarding the new node states – are there any backward compatibly concerns with adding these new states? Many customers already have tooling that processes the existing node states – will these new states break existing tooling? Perhaps these could be node sub-states (if PBS Pro has node substates – sorry, I don’t recall whether nodes have substates or only jobs have them.)?

Thanks!


#25

@ashwathraop I feel reusing the existing provisioning hook is great. I agree with Bill that introducing the prov_type is equivalent to adding a new hook. Like i mentioned in a different forum discussion it would mean different section of PBS code will call this hook with different prov_type - which defeats the purpose of making it generic.

I feel at some point of time we should make the provisioning hook a proper PBS hook in that that it should support multiple hook scripts. That way, if we supply a “PBS” hook packaged with PBS, it won’t need to be merged (code-wise) with a site implemented hook that might have already existed.

However, I am not sure how many current customers might face this problem. If it is only a couple then we can just help them update their hook scripts, and implement this in the future based on customer feedback. If there are more we should implement the multiple hook scripts right away.


#26

Hey Bill, thank you for looking into this.

For this project we have a brief description at UCR page here. I like your idea of connecting the features and coming up with a design description. I’ll work with @dilip-krishnan and have a page setup which will have brief design description of power work, provisioning and knl work. With this we will have a single location where all these features are connected and explained.

The current implementation and design of power provisioning and ramp rate feature follows built-in model. We have some part of the code at PBS core (mostly decision making) and action level as part of PBS hook. This is the main reason for making provsion event to support multiple hooks so that we can ship power work as PBS hook and site can put their need on a site hook.

Sounds fine, restored the interface to have only vnode and aoe value. If we are extending aoe capabilities, it should serve all oe requirements in future and we will still have a simple and minimal inputs to the hook.

Yes, they are from previous revision. Fixed them now.

Introduction of these new states wont break backward compatibility. And these states will be seen only when feature is enabled so tools monitoring the node states shouldn’t get affected when the feature is off. Even when the feature is ON new states will appear when the cluster idle or we provision some of the idle nodes to run the jobs. I need more understanding on how these tools monitor the node states. Also we don’t have node sub states, its only jobs that support them.

Thanks.


#27

I have a few minor comments (I know I’m coming to the table late). Don’t worry, they’re pretty minor.

  1. The provisioning feature has an attribute to say it’s enabled, but it is set by the server itself. It is enabled when the provisioning hook is enabled. Ramp rate will have an attribute which is set by the admin. This isn’t consistent with each other. I’d either make it consistent with the provisioning feature, or make the provisioning feature consistent with the current ramp rate EDD.
  2. Speaking of the current provisioning enable attribute, since it gets enabled when the hook is enabled, what happens now that there are multiple hooks? Will it get enabled when one is enabled? What if that hook is a power provisioning hook and not a traditional aoe hook?
  3. You list several stable log messages that contain data to an external command we don’t have any control over. There is nothing saying Cray can’t completely change that command and now we’re stuck. We have a stable interface that needs to change. I’d make them unstable.
  4. Enabling an attribute has an existing log message. You list it as a new interface. I’d remove that log message.
  5. Can you explain to me the difference between the power node states asleep and powered-off? Is one a low power state where the other is truly powered off?

Bhroam


#28

Thank you for your comments Bhroam.

Yes, that is a good idea. I am changing the design and this is one of the thing in those changes. We will not have server attribute but enabling the hook itself should enable the feature too.

Even having one provision hook enabled should keep the provision attribute enabled. Since power provision will also use same infrastructure as aoe hook, it should be okay to keep provision attribute enabled.

Okay. Makes sense.

Removed.

Yes. State powered-off as its name states is quite straight forward, you just shutdown the node. But asleep is something where node is idle or sleeping but when a job lands on the node, it can start working on it immediately. This could cause a power spike if the job involves a high cpu intensity work. To avoid this Cray designed multiple sleep states which can be stepped into. As you move over from complete idle or higher sleep state to lower sleep state energy/power consumption also will increase. With ramp rate limiting you step through these available sleep states and achieve a smoother energy consumption curve.


#29

After revisiting the design I have updated the EDD. If compared to previous version of the document you’ll see lot of differences. Mainly server attributes are moved to json hook configuration file. We discarded the idea of using provisioning for ramping up the nodes. Both ramp up and down will be done at server periodic hook.

Other change that can be seen is inclusion of power on/off feature as same design would serve when we need power nodes on/off based the cluster work load.

Here is the link to EDD: https://pbspro.atlassian.net/wiki/spaces/PD/pages/70123530/External+Interface+Design+for+Ramp+Rate+Limiting+and+power+on+off+nodes.

Please have a look at the updated design and let me know how it looks.


#30

Hi @ashwathraop,

  1. Please add power on/off related log messages too .
  2. There is a typo at qmgr -c “set node <node_name> poweroff_eligible=True”
  3. add the following message to last_state_change_time also
    Node status command pbsnodes will convert internal date format (seconds since epoch) to human readable format and display the value of this attribute in “MON DD YY HH:MM:SS” format.

#31

Addressed your comments.


#32

Hi @ashwathraop, I have some minor comments -

  • The JSON example in interface 2 and the variable names in the table do not match.
  • Interface 5, Details section, 6th bullet has a typo in the work ‘help’
  • Does Last_state_change_time also gets updated when a node changes state from say free to job-busy? or is it only applicable for power states (like sleep)?
  • I could not establish the use of last_state_change_time attribute’s use in the design. Can you please elaborate a little. It is also uncertain if node_idle_limit is applicable to last_state_change_time or last_used_time.
  • Requirements talk about minimum C state (C-1) and sleep state (C-6) but the design does not mention in which state is the hook going to put the node in. Can you please add that information.

#33

Thanks @arungrover for the feedback. I have addressed your comments.

Regarding last_state_change_time, yes it does change when nodes from free to job_busy too or rather on any state change it will get updated. Earlier we named it powered_off_time and it was very specific to power on/off feature. In this context we will refer this value to see when the node went to “sleep” state and see if it is okay to bring it back to available or free state. I felt this is very much an internal design so did not mention the same on EDD.

node_idle_limit refers to last_used_time as we want to identify how long node has been idle for.

Hope this makes sense. Let me know if you need further details on this.


#34

@ashwathraop Thanks for explanation it makes sense now. I agree that the details are related to internal design but the behavior is very much external and it sometimes becomes difficult to create a holistic view by just looking at interfaces.

I have no more comments on design.


#35

@ashwathraop Overall it looks good. I have a few recommendations for you to consider. In interface 2, I would change min_power_down_delay to min_node_down_delay since this is referring to how long a node should be down before powering back up. I would also change max_concurrent_power_limit to either max_concurrent_nodes since this is the number of nodes that can be operated on at a time.

In interface 3, you introduce a new node attribute called poweroff_eligible. Does it make sense to call it power_mgmt_eligible? This could then be used by the administrator to disable a node from power on/off and power ramp up/down.

In interface 6: you introduce a new node state called sleep. Yet the node is not sleeping but in a ineligible state to run jobs. I would suggest that we call this new state something like inactive, ineligible, or power-offline?


#36

@ashwathraop A few questions/suggestions:

  • Is it still an error to have both ramp rate limiting and power on/off enabled?
  • Interface 1 doesn’t say what type of hook PBS_power is (though I see later in Interface 6 you mention it is a server periodic hook)
  • I had the same question as Arun re: last_state_change_time. Since it is visible perhaps a little more explanation that it changes for PBS state changes and not, for example, for sleep state (e.g. C1 -> C2) changes.
  • Re: Jon’s suggestion for Interface 3 - I disagree that it should be generic, ramp rate limiting is a system wide setting, having nodes able to swing from c6 to c1 to c6 kind of defeats the purpose. I would like to see the attribute stay power off specific. Thoughts?
  • Question about your statement “Scheduler can consider the nodes in sleep state to run jobs now.” Does that mean the ramp up/power up would be considered as part of the jobs walltime or does it mean that job would be assigned the nodes but not actually be put into execution until all nodes are “free”? This also goes to Jon’s suggestion to change state name, are they really ineligible or eligible for scheduling but “not quite ready”

#37

Thanks Jon for the comments.

Fixed the edd with new names.

As Sam said we might want to keep ramp rate at cluster level as we are trying to control power spikes and I feel it would not make sense to not have it enabled on few nodes.

We came up with sleep as it felt like generic enough for sleeping or power-down nodes. But name “inactive” seems okay to me.


#38

Thanks Sam for the feedback.

Power off will take precedence over ramp rate if both enabled. To keep things simple we can allow only one of them at a time. Thoughts?

Its going to be an extension to existing power hook which uses almost every mom hook event available. But for this feature its server periodic hook only. I’ll mention it.

Added a note under Interface 2.

I have same thoughts.

Scheduler assumes nodes will be available in future and adds them to jobs exec vnode list. Hook at next iteration will look at jobs exec vnode list and estimated start time of the job to decide if it needs to bring up any nodes. So basically they are ineligible to run jobs before hook brings them up. So we can either call this state “inactive” or “ineligible”. What would you suggest?


#39

For now I’ll keep the state name as “sleep” and see we can have further discussion on this from other reviewers.


#40

@jon: The architecture team suggested the name “sleep” as a generic term for “power ineligible”. One of the main differences between “sleep” and other states is that the MoM is no longer in communication with the server until something wakes it up, but the server still knows of its existence.