PP-339 and PP-647:release vnodes early from running jobs


#1

This note is to inform the community of work being done to introduce the node ramp down feature to PBS, which basically releases no longer needed sister nodes/vnodes early from running jobs.

Node Ramp Down Design

Please feel free to provide feedback on this topic


#2

I have a few of comments. Nothing major:

  1. The log messages in interface 1 should be interfaces of their own with change control/visibility of their own
  2. What happens if you provide the -a option and a list of vnodes? Does the -a option take precedence or does the command throw an error?
  3. This gets a little bit into the internals, but what happens if you try to release nodes from a suspended job. If I recall correctly, suspended jobs are in state running, substate 43/45. Do you still get the invalid state message for suspended jobs? I suspect so.
  4. In the accounting log message, I’d rather see a change from “next_" to "updated_”. It makes more sense to me.
  5. Interface 4 should be PBS Private/Unstable. If an interface can change and people shouldn’t depend on it, it should be an unstable interface. Public/Experimental is still a Public interface. Experimental just says you can add such an interface to non-major releases.

#3

@bhroam: Thanks Bhroam for the comments! Here are my replies:

  1. The log messages in interface 1 should be interfaces of their own with change control/visibility of their own
    [Al] Ok, I’ll update.
  2. What happens if you provide the -a option and a list of vnodes? Does the -a option take precedence or does the command throw an error?
    [Al] It will actually throw an error:
    % pbs_release_nodes -j 252 -a federer
    usage: pbs_release_nodes [-j job_identifier] host_or_vnode1 host_or_vnode2 …
    usage: pbs_release_nodes [-j job_identifier] -a
    pbs_release_nodes --version
    I’ll update the EDD.
  3. This gets a little bit into the internals, but what happens if you try to release nodes from a suspended job. If I recall correctly, suspended jobs are in state running, substate 43/45. Do you still get the invalid state message for suspended jobs? I suspect so.
    [Al] Yes, we still get the error message:

% qstat
Job id Name User Time Use S Queue


253.borg STDIN bayucan 00:00:00 S workq
bayucan@borg:~> pbs_release_nodes -j 253 federer
pbs_release_nodes: Request invalid for state of job


#4

[Al] Ok, although the site that wanted this change seemed happy with the “next_” prefix, but I’ve noted your suggestion.

[Al] Ok.


#5

I think updated_* over next_* makes more sense.

Also, not sure if I missed it but did you mention how the resources used values get updated? Do they start over on an update or does something else happen?


#6

I think updated_* over next_* makes more sense.
[Al] Ok, then.

Also, not sure if I missed it but did you mention how the resources used values > get updated? Do they start over on an update or does something else happen?
[Al] I didn’t really talk about it but I should. Basically, when a node is released, it reports to the MS its resources_used* values for the job as the final action. That released node would no longer update the resources_used values for that job since it’s no longer part of the job. But MS will hold onto the data, and will be added during final aggregation of resources_used values when job exits. I’ll update the design doc with this info.


#7

Should we create a new attribute that displays the duration from the last update/start for each update record. This I believe will make it easier for doing accounting/analytic purposes.


#8

I think the site that was trying this feature just parses the accounting logs, going from one update record to the next record, and figures out the duration that way. It’ something to do with figuring out the incremental values. But I’m not sure if we should add an attribute that holds that info automatically…


#9

Unless I am mistaken, I don’t believe that we can modify an ALPS reservation on a Cray XC system. If this is the case then we need to note this in the EDD and I would recommend that we return a message to the user that “pbs_release_nodes is not currently support on Cray XC systems”. Once Cray adds that capability to ALPS then we can remove the error.


#10

I talked to our PBS expert on Cray systems and confirmed what you said is true. I’ve also gotten an idea on how PBS server can figure out if a node/vnode specified in ‘pbs_release_nodes’ is actuallya Cray X series compute node. Then we can reject the request returning the message that you suggested…


#11

Correct, we want the “u” accounting record to represent incremental values.

An example: suppose job A runs 3 hours, initially starting on 4 nodes but releasing 1 node each hour. For this job I’d expect an S record, 2 u records, and an E record (I’ll ignore other records like Q). The S record is fine as-is.

The first u record should contain resources_used.walltime=01:00:00, note the exec_host/vnode of the just-completed phase with 4 nodes, and provide exec_host/vnode of the new phase with 3 nodes. It should also include as many fields from the E/R record as are reasonable.

The second u record also contains resources_used.walltime=01:00:00, but has exec_host/vnode of the just-completed phase with 3 nodes, and provides exec_host/vnode of the new phase with 2 nodes. Again it should also include as many fields from the E/R record as are reasonable.

The pbs server will need to be smart and create the E record with resources_used.walltime=01:00:00 (rather than 03:00:00).


#12

Thanks, Greg. I’ll also need to add this info to the document.


#13

Hi Greg,
This is what I plan to add to the design document. Does this look right to you, that is, is it consistent with what you expect from the feature:

he ‘u’ (update) accounting record represents a just concluded phase in the job, just before a pbs_release_nodes action was successfully done on the job causing a release of nodes/vnodes.
A concluded phase consists of a set of resources assigned to the job (i.e. exec_vnode, exec_host, Resource_List), amount of resources used during that phase (i.e. resources_used), and a set of resources to be assigned in the new phase of the job (i.e. next_*).
The ‘u’ record will show what was assigned (exec_vnode, exec_host, Resource_List), and how much resources was used (resources_used) in the just concluded phase. Note that the resources_used.walltime will simply reflect the amount of walltime used by the job during that phase.

The ‘E’ (end) accounting record will show what was still assigned to the job when the job exited, and how much resources the job had used when it reached the end point. If there is a previous ‘u’ accounting records logged for the job, the resources_used.walltime will only show the amount of walltime used since the previous phase of the job.


#14

Looks good to me, Al.


#15

Thanks again, Greg. I just want to run by you one more thing in regards to the resources_used values.
I’ve found some old notes on the implementation of this feature where I mentioned:

bayucan@borg:~/projects/nasa/node_ramp_down> pbs_release_nodes -j 5501 federer federer[1]
ignoring the following nodes that are not part of job: federer[1]
NOTE: Above will generate the following accounting ‘u’ (update) record since ‘federer’ is part of the job and has been released. The update accounting record is now similar to an E record. exec_host/exec_vnode/Resource_List show the values in the completed phase of the job (previous), while next_exec_host/next_exec_vnode/next_Resource_List* show the values for the next phase of the job. Resources_used* values are still a summation of what’s used by the job at this point, with the exception of ‘walltime’ being the incremental value between phases:

The accounting end record shows the “next” values from the previous ‘u’ record, and walltime is actually the time accumulated from the previous ‘u’ record:


That is, on the end record, in terms of the resources_used values, only the ‘walltime’ value shows an incremental value, while the rest are still the captured summed values at the end point of that job. Or should I just be capturing the amount used in the current phase minus the amount used in the previous phase? I’m not sure how to program it that way since supposed mother superior reported for the job resources_used.mem = 1.03 gb in the previous phase, and in the current phase, it got resources_used.mem = 1.03 gb, which would make it 0 memory used. This is unlike walltime where the value will always be increasing…


#16

Thinking about this a little more, there are several log record semantic issues that need to be resolved when it comes to reporting phases versus whole-jobs. Your first crack at it was NAS-specific and good enough, but I don’t think it’ll support the general case very well.

  1. walltime - under the current scheme the E record walltime will be an incremental value, but a log parser would only know that if it had noted the presence of u records for that job. Maybe incremental-vs-whole should be spelled out in these records, such as “resources_used.incremental_walltime” ?

  2. consumable resources - may need the same treatment as walltime, but with the wrinkle that whole-job values will be aggregations of incrementals. e.g. a job with a 1 hour phase of 10 ncpus followed by a 1 hour phase of 11 ncpus would have a 2 hour whole-job ncpus of 10.5 ncpus - right? Perhaps whole-job values just won’t be sensible in some cases, and computing them would be left as an exercise for the log parser. :slight_smile:

-Greg


#17

Yes, a new name makes better sense here, resolving ambiguity.

Perhaps for backwards compatibility, I can also spell out in the ‘E’ (end) record the incremental consumable resources_used values at the last phase of the job. I can just name them as “resources_used_incr” (or whatever name is better). This will include the resources_used_incr.walltime value.
Then I can leave as is the resources_used* values calculation in the ‘E’ record as before, which is the job end - job start whole job aggregations. Then the log parser can just choose to take the resources_used_incr values from the ‘u’ record and the ‘E’ record and either sum them up, or average them out, whichever makes sense… The log parser also has the choice of using the resources_used* values for the whole job aggregation.
So in summary, both ‘u’ record and ‘E’ record will have the “resources_ued_incr” (incremental) values and the snapshot “resources_used” values. Does this makes sense?


#18

Works for me.

(completely off-topic, but this 20 character minimum response length gets annoying at times)


#19

@gmatthew – Not sure if this has reached the level of “standard usage” yet, but I have been "like"ing a post (using the heart icon) when I want to agree with it (and I don’t want to type an additional 15 characters).


#20

I’ve now updated the design of the node rampdown feature. It’s now v.8: You might want to go to “Page History” and compare v.3 against v.8 (current).

Node Ramp Down Design (v.8)

The things that got updated are:

  1. In the accounting_logs, changed the next_* keywords to updated_*.
  2. Added the info that pbs_release_nodes will not work if one of the nodes/vnodes being released is a Cray XC node.
  3. The ‘u’ (update) and ‘E’ accounting record will now show the incremental values via the resources_used_incr.* items.