PP-1006: Introduce new provisioning capabilities


#1

Hi All,

I have posted a UCR for the introduction of a new provisioning attribute. Request all to provide feedback on the same.

Note the title has changed to: Introduce new provisioning capabilities

Thanks,
smgoosen


#2

Thanks for posting this. A few questions. How many ioe provisioning hook can the system have? If more than one, how many can be enabled? Does a pbs ioe provisioning hook take precedence over a site ioe provisioning hook?


#3

This just adds another ‘oe’ attribute. Each ‘oe’ resource adds special case code that bloats PBS. They are all the same concept. We should have one common code path for all provisioning resources. Adding ioe might fix the current problem, but what happens when we need another provisioning resource? Do we add a fourth resource with another set of special case code?

This is a maintenance nightmare.

We can create resources. We should create a new type of resource called a provisioning resource. This gives us one code path for all provisioning resources that a site wants. They can create 1000 provisioning resources if they want. We won’t care since we created a single code path that handles provisioning.

Bhroam


#4

@Jon The requirement is that the job can request more than one ioe, each on separate chunks. For example:

select=1:vntype=cray_compute_knl:ioe=a2a_50+8:vntype=cray_compute:ioe=HT_enable

PBS must also be aware of what ioe is being requested by a chunk so that chunks with different ioe’s don’t end up on the same node.

So there could be one script that takes the ioe as an option and has a case statement or something to handle each ioe. Or we could have multiple hook scripts more like a standard hook event, in which case we’d need to be able to have more than one enabled. In either case we could package different PBS hook or hooks on different platforms. I think it’s a design decision.

And if the design specifies that there can be multiple hooks, like other hook events I would think a PBS hook would take precedence over a site hook.

@Bhroam There will be differences in the behavior of an ioe hook vs aoe or eoe (e.g. per chunk vs per job). Perhaps in the future we can work out all the corner cases for aoe and eoe provisioning such that it will be worth consolidating into a single code path.


#5

@smgoosen I disagree that just adding another ‘oe’ resource is the right answer. Each ‘oe’ resource requires special case code throughout PBS. If we add ioe, this will be three sets of special case code. This is not maintainable. If there are differences between them, there should be options provided at resource generation.

There is nothing saying provisioning can’t be a single robust code path that is easy to maintain.

We’ve already added two ‘oe’ resources with two distinct code paths. Why is adding a third the right answer? Why isn’t now the right time to make it better?

Bhroam


#6

@smgoosen, thanks for posting the requirements document. I see that you have mentioned that there is need for a different resource/type of provisioning called ioe (for infrastructure provisioning). I, however, do not see the actual “requirements” stated. Why is there a need for a new type of provisioning? What is the customer use case? Why does the “aoe” resource itself not suffice?

Regards,
Subhasis


#7

I have updated the UCR and Jira Issue to make it more generic, the actual implementation of whether we need a new attribute or can just extend the existing aoe attribute will be left to a design discussion.

I also mentioned @Bhroam’s suggestion as a potential implementation


#8

-Notes from offline discussion happened over mails and meeting:

Q1. What provisioning policy for ioe ?

  • Same as AOE (combined or individual ?)
    [samg] I believe you’re asking if there should be separate options for configuring ioe vs aoe? I don’t think that would be necessary.

  • In case of avoid provisioning policy if both not available on same node, which will get preference ?( Jon says ioe should get preference)
    [samg] I agree w/ Jon. The ioe (e.g. KNL) provisioning takes a really long time vs the others and so should be the one we’re really trying to avoid.

Q2. Is ioe provisioning accounted in max_concurrent_provisioning ?
[samg] max_concurrent_provisioning exists to prevent to many nodes from rebooting at the same time, which might result in excessive power spikes. Since ioe provisioning also needs to reboot the system it should be included when we count nodes for max_current_provisioning

Q3. Will mom node reboot during ioe provisioning ?
[samg] For the most common case we have right now (KNL) yes, it will

Q4. What happens when ioe provisioning fails ?
-Note: Since this is infrastructure provisioning, even if ioe provisioning fails, associated job node might still be working fine.
[samg] If provisioning fails we should treat it as a serious issue, the node should be marked offline. If you leave it up the next cycle another (or the same?) job could very well end up getting assigned to the same node, over and over. It would be a potential black hole.

Q5. Will there be separate hook for both ioe and aoe ?

  • Note: Provisioning event as of now allows only one hook associated with it.
    [samg] One of the advantages of having separate provisioning attributes is that you could have different, moduler provisioning scripts for each. For example, one script that handles ioe provisioning, another that handles aoe provisioning, a 3rd that handles eoe provisioning. It would also allow for different timeouts, etc. If we go this route Solution 1 (below) looks like could cover things nicely, with the default of “aoe” if there is no type specified.

Q6. Can current aoe change after ioe provisioning ?
[samg] Not sure exactly what you’re asking so let me know if I’m missing the point. It should be possible to have ioe provisioning occur, setting current_ioe, then aoe provisioning, setting current aoe, finally eoe provisioning, setting current_eoe. But ioe_provisioning shouldn’t change current_aoe, aoe provisioning shouldn’t change current_ioe, etc.

Q7. Since we are providing a PBS_HOOK_xxx for knl , how will customer use their own script?
[samg] They could disable the PBS hook and install their own modified version as a site hook. If they need help they could contact Altair support.

Q8. As of now a mix of provision and non-provision job request is not allowed, what happens
if user requests few ioe only and few aoe only resources ?
[samg] A requirement has been added to the UCR to be able to supply an ioe on a per chunk basis.


#9

Update to Q4 above:

Note that for a Cray we should offline only the vnode (aka compute node) for which provisioning failed. We should not offline the mom, which would then offline ALL the Cray compute nodes is see’s as it’s vnodes (see also the issue re: “pbsnodes -o” offlines ALL vnodes/compute nodes on a Cray)


#10

That sounds fine. The current provisioning code only offlines nodes/vnodes, not moms.


#11

One more quick comment - an example of a provisioning failure we’re currently seeing at a customer where we should offline the node to prevent further problems would be “aoe mismatch” after provisioning has allegedly completed.


#12

I think it can have a simple implementation, ie, only aoe for everything. We can have only one provisioning hook (sure one is welcome to have more, when supported, but I don’t think that is required now).

The provisioning hook can be given 2 inputs:

  1. The node name.
  2. The set of Aoe’s to get it to , say something like (KnL1, power_state2, Rhel10) in no particular order.

This then becomes the scripts choice and independence to get that node to the requested combination using the best possible method(s)(or to reject that if it is a bad one).

Separating the aoe’s and calling the hooks one by one (and in some order) is bad:

  1. PBS code will need to know which aoe is which type – does not serve any use-case actually.
  2. Each hook will only be able do work about the aoe that is passed. Instead if the final desired state is known to one hook, it can do a bunch of optimization – for example, a site can decide to first set the settings to Knl1 and Rhel10 and then do a reboot into power_satte2 – so sort of combine the actions in an intelligent way to make it faster.
  3. PBS will have to maintain the order of hooks – and then at some point people will ask to maintain order of AOE’s per node

#13

@subhasisb and others have a look at this EDD here https://pbspro.atlassian.net/wiki/spaces/PD/pages/70123530/External+Interface+Design+for+Ramp+Rate+Limiting. where I proposed to have multiple provision hooks with extended capabilities. It might be relevant to discussions happening here.


#14

Hi all,
Please review the EDD to satisfy above use case requirement of KNL nodes provisioning, by extending the aoe provisioing resource request.

Warm regards
Dilip


#15

Thanks @Dilip. So we are saying that the aoe is a simple string. Which sounds perfect.

Which means a site could say “sles+Knl1” as the string. PBS will have no special meaning of this and only the site hook would know that it is asking for 2 separate things and do the right set of operations. Or, it could be a word like “knl1_on_sles” that depicts the final state of the node. That should be left to the site to implement.

This sounds fine to me.


#16

Looks good – I like that, for this first enhancement, PBS Pro is being limited to having the same aoe for all chunks that request an aoe, as it satisfies the current requirement and can be easily extended in the future (if it is needed), without any backward compatibility issues (with respect to the language).

I suggest also disallowing the mixing of per-chunk aoe’s and job-wide aoe’s in the same job, so a job can either request -laoe=XYZ or have chunks requesting aoe=XYZ, but not both.


#17

Overall I like the change. It’s a nice enhancement to aoe.

I have a few minor comments that you can choose to update or not. The document is fine by itself.

  • Change the second bullet to: Each job subchunk can request at most one aoe. If it requests an aoe, the aoe has to match the aoe of all other subchunks in the select.
  • Change the examples to move “Valid” or “Invalid” to the front of the qsub line. This will tell the people who are skimming through the document that something is valid or not. It looked weird to me to see the last invalid line until I got to the text that explained it was invalid.

Thanks!
Bhroam


#18

Hi @billnitzberg @bhroam,
I’ve updated the EDD as per your suggestion, please review and let me know if any more changes are needed.

I have doubt regarding this requirement. I guess you meant either we should allow only -lresource or -lselect but not both( This seems to be the existing behavior). So a job request like:
qsub -laoe=rhel -lncpus=10 J1 is acceptable,
while
qsub -laoe=rhel -lselect=1:ncpus=10 J1 is not acceptable.

Please let me know if I misunderstood anything. Sorry for replying so late, I was on vacation.

Warm regards
Dilip