PP-928: Reliable Job Startup


#21

So increment_chunks() should just take care of that and document it. If the first chunk is a single chunk, no change will be made to it. If the first chunk is 2 or more, the increase will apply to the chunks beyond the MS chunk (e.g. if select is 3:ncpus=2, then increment_chunks(“50%”) would return 4:ncpus=2).


#22

I see. I never even thought of it this way. Sure, I can make increment_chunks() behave this way. So given select=3:ncpus=2 being the first chunk, then increment_chunks(“50%”) would leave “1:ncpus=2” alone, but apply the “50%” to the remaining “2:ncpus=2” so it becomes “3:ncpus=2”, and then all put back together as “4:ncpus=2”. I’ll update the EDD and code.


#23

I’ve updated the design to incorporate comments from Bhroam and Greg.

  • The ‘tolerate_node_failures’ attribute’s type has been changed from boolean to a string with valid values: “all”, “job_start”, and “none”. “all” is for tolerating node failures at any point in the job run. “job_start” is for tolerating failures or errors only during job start.
  • The increment_chunks() select method has been updated to leave as is the first chunk that is assigned by primary mom.

The updated design is in:
Reliable Job Startup Design v13