PP-946: On Cray a PTL script setting select specification on a job using set_attributes method makes the job fail to run


#1

Hi All,

Please review the design document created for:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/68780081/PP-946+On+Cray+a+PTL+script+setting+select+specification+on+a+job+using+set+attributes+method+makes+the+job+fail+to+run

Please do let me know of any comments or suggestions.

Thanks and Regards,
Sanket


#2

Looks good @borlesanket


#3

Hi All,

As per current design we have below two cases:
case-1 select without vntype:
select=1:ncpus=1+1:ncpus=2
This will get modified to:
select=1:ncpus=1:vntype=cray_compute+1:ncpus=2:vntype=cray_compute

case-2 select with one of the chunk containing vntype
select=1:ncpus=1+1:ncpus=2:vntype=cray_compute
In this case do not modify select statement.


But there is confusion for case-2, is like

  1. Should it be like above (Not modifying select statement) or
  2. should we also add vntype to the chunks apart from the chunks which already have vntype.(modifying select statement)
    means for case-2 should it get modified to:
    select=1:ncpus=1:vntype=cray_compute+1:ncpus=2:vntype=cray_compute

In short adding vntype to all chunks of the select statement, which does not have it already, regardless of anything ?

Let me know your opinion regarding this.

Regards,
Sanket


#4

It seems odd to raise a concern on something I just signed off :slight_smile:

IMO, we should add vntype to weighted chunks that do not specify vntype on a cray system. Otherwise some chunks of the the job may end up going to a login node, which is not what we want.


#5

This is a long pending thread now… We need to come to a conclusion soon. What do others (@lisa-altair, @kjakkali, @vccardenas, @hirenvadalia and @borlesanket ) think about this change?


#6

Hi, my answer depends on how the job script ended up mixed like that…

  • If PTL cause vntype to be set on some weighted chunks (resource request between the + signs) but not others, then I agree with @arungrover and PTL should add vntype=cray_compute to all of the chunks in the job request.

  • If the job itself caused the strange combination, then I’m not sure what the job expects. And I’m not sure what PTL should do.

However, please be aware, there are some cases where PTL should not add vntype=cray_compute to the weighted chunks. Adding vntype=cray_compute could conflict with the request resource type, which could then cause the job not to be scheduled. Do not add vntype=cray_compute when the weighted chunk requests any of the following resources:

  • vntype
  • host
  • vnode

#7

@arungrover and @lisa-altair, If I look into all above suggestions, I can say that we should add “vntype=cray_compute” to all chunks which do not have vntype or host or vnode mentioned already in it. Please correct me if I am wrong.

For example if I take select statement as below
select=1:ncpus=1:vntype=pqr+1:ncpus=2:host=xyz+1:ncpus=3:vnode=abc+1:ncpus=4

Then this will become
select=1:ncpus=1:vntype=pqr+1:ncpus=2:host=xyz+1:ncpus=3:vnode=abc+1:ncpus=4:vntype=cray_compute


#8

Yes @borlesanket, I believe your understanding is correct!


#9

Yes, your example looks correct!


#10

@arungrover and @lisa-altair, I have updated the document accordingly. Please have a look.


#11

looks good @borlesanket


#12

Thanks for making the changes. Looks good.