PP-741: Need a default job specific to Cray in PTL


#1

Hi All,

Please review the design document created for “Default Cray job” as part of Cray PTL support.
https://pbspro.atlassian.net/wiki/display/PD/PP-741%3A+Need+a+default+job+specific+to+Cray+in+PTL

Please do let me know of any comments or suggestions.

Thanks and Regards,
Sanket


#2

Hi @borlesanket,
I think it would be better if we use “aprun -B sleep 100” as the aprun default instead of “aprun -n 1”

I am a bit confused, probably because I am still learning PTL, it seems like Job.create_script is unsetting whatever your design says should be set in Job()…can you add more text to explain what is being unset in create_script and why.


#3

@borlesanket, could you give examples? In Job.create_script() why do you need a cleanup process? And when do you run the three bulleted steps, is it before or after the job is submitted? Are there any error checking and what happens in case of errors?


#4

@lisa-altair, I have updated the document, please have a look.

@vccardenas, I have added some more information regarding create_script. All three bulleted steps will get executed before the job submission.Regarding error checking create_script internally calls Dshutil’s function which handles errors.

Example:

j = Job(TEST_USER) - This will call “cray_script” which will internally call “create_script” and will set “script” as default cray script, also will set “Resource_List.select” to default value “1:ncpus=1:vntype=cray_compute”.

After this if we call create_script inside testcase as:
j.create_script(script body) - This will execute all three bulleted steps.

Let me know if you have more questions/doubts.


#5

Hi @borlesanket, I think I understand now where my confusion is coming from. I thought a new interface called Job.create_script() was being introduced and I couldn’t figure out why a creation script was unsetting things. I do realize the Synopsis says “modifications to add cleanup”. But I truly thought that the create_script() function’s purpose was to make modifications to something called cleanup. It took me a while to realize that you are modifying the existing create_script() to remove a different script that doesn’t apply in the Cray/Cray ALPS simulator case.

But here’s where I’m still confused, the Details section of Job.create_script() says "the new interface “cray_script”, which internally will call “create_script” and will set “script” as a default Cray specific script."
Okay, at this point “script” is set to something good for a Cray/Cray ALPS simulator.
But then the last few bullets in that section say: if “script” is set and platform is a Cray/Cray ALPS simulator then we remove the things we tried to set when we first called cray_script().
Any clarification would be appreciated. Thanks!


#6

I should also mention, the external design should use the term “Cray ALPS simulator”


#7

@borlesanket, thanks for adding the examples and more information. So if understand correctly, if on Cray or Cray ALPS simulator then in Job.create_script() we need to clear out the default or previous settings for Resource_List.select, Resource_List.vntype, and “script”, because the script body that is Aprun_param may already contain Resource_List.select, Resource_List.vntype, and “script” ?


#8

@vccardenas, you are right it will work in that way.

@lisa-altair, I have updated document mentioning Cray ALPS simulator instead of only Cray simulator.
Talking about last three bullets in create_script() interface, let me explain:

  • First thing we do before job submission is defining job object inside testcase.

  • In case of cray, inside testcase by defining job object normally like j=Job(TEST_USER) will call scripts internally in below sequence:
    j=Job(TEST_USER) -> init -> cray_script -> create_script()

Means, everytime we define job object it will eventually call create_script() and will set “script” to cray specific body.

  • Now suppose user wants to use create_script inside testcase then we should have a cleaner setup, considering user is going to take care all the parameters specific to cray inside create_script. thats why I have added those three cleanup steps.

Let me know if this clears your doubt.


I have some uncertainty about Resource_List.select, whether to clean it inside create_script or not because user can set it during job object initialization (j = Job(TEST_USER, attrs={‘Resource_List.select’: ‘1:ncpus=2’})) and he doesn’t need to add it inside create_script if he/she is using j.create_script inside testcase.

Let me know your suggestions.

Regards,
Sanket


#9

Hi @vccardenas, @lisa-altair, I have added 2 more interface change details inside doc, this is regarding use of new paramter PWD with sudo and default value to ‘submit_dir’ inside submit function for Cray.This is to avoid the use of -o option for successful job submission.

Also as I said in my last comment regarding uncertainty about unsetting Resource_List.select value in Interface: Job.create_script() section, I have removed it to match with the existing PTL configuration which allows user to use Resource_List.select with there own script which they create inside testcase using ‘create_script’.


#10

Hi, I have removed the 2 new interfaces mentioned in last comment as it works only in certain scenario properly. Also I realized that we should run PTL from login node to have similar environment as users set currently for their job submission and for that we don’t require these extra changes for successful job submission as there is no issue of copying stageout files back to server node.


#11

Hi All, I have updated the document with new changes to follow new approach which is more correct than the previous one.According to it I will be modifying Job() and Server.submit(), so I have updated document mentioning these two interfaces, please review the same.


#12

This looks fine to me.


#13

EDD looks good to me. I sign off.


#14

@borlesanket, the current design in page v.10 looks good to me.