PP-506 & PP-507 — Update from 2017-02-06 Public Design Discussion
@billnitzberg, @scc, @jon, @bhroam, @mkaro, @smgoosen, @arungrover
Consensus on guiding principles for this design:
- C1 - Extend the resource selection options with a focus on flexibility. So, although many of the use cases are site-wide policies, and could be (better?) implemented via sched_config options, this design proposes using per-job resource requests, as that is more flexible overall.
- C2 -Target optimizing job start times. It is understood that optimizing for other measures such as end time, cost, power, utilization, and reliability are important, and other parts of PBS Pro focus on these areas; they’re just not the focus of this feature.
- C3 - Target SysAdmin use cases (90% of focus); secondarily, target power users (10%).
- C4 - Target automation and tools when designing new language syntax (90% of focus) above the secondary consideration of human readability (10%). E.g., make parsing and tooling easier to code, and don’t worry about shell quoting.
- C5 - As always, consider backward compatibility to protect existing investments by the PBS Pro community in tooling, e.g., hooks and tools that parse qstat output and accounting logs. So, it is better to offer an entirely new syntax that supports these new capabilities, and also continue to support the existing (-l, select, place) (unmodified) syntax for backward compatibility.
General Use Cases:
- G1 - Start jobs sooner by providing additional allocation choices (with preferences to partly balance versus other competing goals)
- G2 - Increase efficiency (better cost, speed, power use, reliability) by better matching job classes with preferred resources (e.g., XYZZY jobs run better on big memory nodes, ABCDE jobs run better on AMD cpus, but they all run everywhere, just not quite so well)
- G3 - Prevent erroneous allocations (e.g., application requires Linux 2.6 or later)
- G4 - Adjust request based on the allocation itself.
Specific example use cases:
- U1 - Run on new hardware if available… if not, run on older hardware
- U2 - Don’t use big memory nodes… unless there’s nothing else
- U3 - Run low priority jobs on old hardware, if available… but, if they aren’t available, then use the new hardware. And, vice-versa with high-priority jobs.
- U4 - XYZZY software runs better on big memory nodes, but is OK on small memory nodes, and depending on where XYZZY runs, it needs different numbers of licenses
- U5 - Do what LSF does in terms of boolean & conditional resource requests — note: it was agreed that this was not well-defined enough, and would need a lot more definition to be truly useful.
- U6 - My job needs Red Hat 7 or higher; or only runs on Linux
- U7 - Assuming nodes are appropriately labeled, ensure my job gets whole nodes and fills them up (e.g., ensure the job gets 16 cores on 16 core nodes or 24 cores on 24 core nodes)
- U8 - (Deleted)
- U9 - Adjust execution time limits based on which type of nodes are allocated (example a job may take less walltime to run on new hardware as compared to the old hardware)
[Most of these use cases can be met even today with existing PBS it's just that it may not be able to satisfy all of them together]
Concerns / Discussion:
CD1 —filter idea: another workload manager breaks this problem into two steps: a filter step (to choose a subset of nodes) and an allocation step (to allocate resources from the chosen subset). A strategy like this might allow PBS Pro to provide good backward compatibility (by adding only a new “—filter” pseudo-resource, for example), but it would not support one of the major use cases (G4). So this direction was abandoned.
CD2 - There is a potential conflict between SysAdmin defined policy (e.g., big jobs get top priority) and individual job “policy” (e.g., job X requests ncpus=10 || ncpus=1000 ). One way to address this “conflict” could be to treat each resource request separately (e.g., job X is prioritized by the scheduler as two separate jobs, one asking for ncpus=10 and another asking for ncpus=1000). More thought needed here…
- Review alternative resource request language, e.g., OGF JSDL, OASIS, … then hold follow-up meeting