Re: PP-928 v.6
“Reliable Job Start” is a great goal – thanks for working on this enhancement! In order to add a bit more context, my understanding is that the main use case is focused on:
- Big (wide) jobs often have significant queue waiting times, as other jobs must finish and relinquish their resources to make room for a “big” job. (Generally, this is the result of common large-site policies and how they trade-off the goals of minimizing waiting time and maximizing utilization. This is generally the right approach, and works well in most cases.)
- When PBS Pro detects a “bad” node, any job assigned to that “bad” node is terminated, and is (generally) re-queued. (Generally, as this is again a policy decision left to the user and site).
- In the unfortunate situation a node is detected as “bad” during the startup of a “big” job, this will result in the “big” job being re-queued. If there aren’t sufficient additional nodes available, the “big” job will wait (again) a significant time for another chance to start.
- This enhancement is proposing to eliminate the additional waiting time (caused by the “big” job being re-queued and not having enough additional nodes)
(Please let me know if I got this wrong or missed additional nuances.)
A big suggestion…
Since a lot of effort is left to the SysAdmin, and it seems the core issue is that PBS Pro kills a job when it detects a “bad” node, how about adding a new feature to PBS Pro that allows a job to continue running, even if a node is detected as “bad”?
Allowing a job to continue running (despite detecting “bad” nodes) is a generally useful feature, not just for this enhancement, and has been requested, e.g., to support fault-tolerant MPI. With this capability (plus some recovery bits), a SysAdmin could implement the use case for Reliable Job Startup, e.g.
- Queuejob hook copies user-submitted “select” into a new custom resource “requested_select”, then added chunks to the “select” itself
- An execjob_launch or execjob_prolog hook gets final information on any newly detected “bad” nodes (perhaps using the the proposed Interface 5, and then uses the new Node Ramp Down feature (recently added to PBS Pro) to “free” all the “bad” nodes plus any additional unwanted nodes. (The hook would also need to update the PBS_NODEFILE. Perhaps PBS Pro could supply a hook routine that generates a PBS_NODEFILE from a select to make that easy too.)
- Heck, one could even keep extra nodes around until the job is running for 10 minutes, then free the extras – that would also handle the case when something in the application startup itself causes a “bad” node to be detected.
I guess I’m struggling with the current design, as it forces the SysAdmin to do a lot of the bookkeeping work (something that, in principle, PBS Pro is really good at doing), and it doesn’t address the “whole” problem (e.g., what if Mother Superior is “bad”, and what if a “bad” node is detected in the first 30 seconds of the job). I’m assuming this approach is being proposed to create something useful as a first step, while minimizing development effort. I feel like it should be possible to find a better way to test this direction without adding baggage to the overall PBS Pro design…
If the above is not compelling, then …
Suggest renaming “rjselect”. Generally, common practice for PBS Pro has been to avoid abbreviations whenever possible. Alternative suggestions: “select_with_padding”, “select_startup_pad”, “select_reliable_job_startup”, …
Suggest adding more details on which hooks run before and after the select is updated. Is all the reliable job startup stuff run before any exec job hooks? Which ones run before and which ones run after?
Suggest making the new “s” record in addition to the existing “S” (though I’m not positive about this). The idea is to ensure as much backward compatibility with existing accounting tools as possible. In any case, please explicitly define whether it is in addition or not.