PP-928: Reliable Job Startup


So increment_chunks() should just take care of that and document it. If the first chunk is a single chunk, no change will be made to it. If the first chunk is 2 or more, the increase will apply to the chunks beyond the MS chunk (e.g. if select is 3:ncpus=2, then increment_chunks(“50%”) would return 4:ncpus=2).


I see. I never even thought of it this way. Sure, I can make increment_chunks() behave this way. So given select=3:ncpus=2 being the first chunk, then increment_chunks(“50%”) would leave “1:ncpus=2” alone, but apply the “50%” to the remaining “2:ncpus=2” so it becomes “3:ncpus=2”, and then all put back together as “4:ncpus=2”. I’ll update the EDD and code.


I’ve updated the design to incorporate comments from Bhroam and Greg.

  • The ‘tolerate_node_failures’ attribute’s type has been changed from boolean to a string with valid values: “all”, “job_start”, and “none”. “all” is for tolerating node failures at any point in the job run. “job_start” is for tolerating failures or errors only during job start.
  • The increment_chunks() select method has been updated to leave as is the first chunk that is assigned by primary mom.

The updated design is in:
Reliable Job Startup Design v13


I’ve updated the design, adding clarifications, and the latest is in (previous version is v13):

Reliable Job Startup Design v23

  • interface 1: Added the “from <node_host>” string to the message: “ignoring from <node_host> error as job is tolerant of node failures”

  • interface 4: Put in a restriction to the $job_launch_delay option on Windows as follows:
    “This option is currently not supported under Windows. NOTE: Allowing it would cause the primary mom to hang waiting on the job_launch_delay timeout, preventing other jobs from starting. This is because jobs are not pre-started in a forked child process, unlike in Linux/Unix systems.”

  • interface 5: Put in additional details on what primary mom sees as non-healthy sister hosts whose vnodes are not chosen when job is pruned via pbs.release_nodes(keep_select=X):

    • Any sister nodes that are able to join the job will be considered as healthy.
    • The sucess of join job request maybe the result of a check made by a remote execjob_begin hook. After successfully joining the job, the node may further check its status via a remote execjob_prologue hook. A reject by the remote prologue hook will cause primary mom to treat the sister node as a problem node, and will be marked as unhealthy. Unhealthy nodes are not selected when pruning a job’srequest via the pbs.release_nodes(keep_select) call (see interface 8 below).
    • If there’s an execjob_prologue hook in place, the primary mom would track node hosts that have given IM_ALL_OKAY acknowledgement for their execution of the execjob_prologue hook. Then after some ‘job_launch_delay’ amount of time of job startup (interface 4), primary mom would start reporting as failed nodes those who have not given their positive acknowledgement during prologue hook execution. This info is communicated to the child mom running on behalf of the job, so that vnodes from the failed hosts would not be used when pruning a job (i.e. pbs.release_nodes(keep_select=X) call).
    • If after some time, a node’s host comes back with an acknowledgement of successful prologue hook execution, the primary mom would add back the host to the healthy list.
  • interface 8: In regards to the pbs.event().job.release_nodes(keep_select=X) call:

    • This call makes sense only when job is node failure tolerant (i.e. tolerant_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are gathered to be consulted by release_nodes() for determining which chunk should be assigned, freed

    • Since execjob_launch hook will also get called when spawning tasks via pbsdsh, or tm_spawn, ensure the execjob_launch hook invoking release_nodes() call has ‘PBS_NODEFILE’ in the pbs.event().env list. The presence of ‘PBS_NODEFILE’ in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:

      if 'PBS_NODEFILE' not in e.env:_
       j = e.job

      pj = j.release_nodes(keep_select=…)