PP-339 and PP-647:release vnodes early from running jobs


#61

I can only see bad things happening if someone released a node that was part of a cray reservation. PBS would think it was free and cray wouldn’t. PBS would try and use it again and cray would reject it. We’d end up with a lot of held jobs.

What should we do about it? I don’t know. I don’t think there are any good answers. The only one I see is what Lisa says. Look for “cray_” and document that if you want to use the node ramp down feature, don’t put “cray_” in any node’s vntype that you want to use the feature.

Bhroam


#62

First, just an implementation idea as an alternative to checkfing for “cray_” in vntype (which is not now mandated): can we have the server issue the release nodes request to the job’s primary execution host and have any pbs_mom that is running with $alps_client set reject the request?

Interface 2 is silent on whether or not it is supported for jobs involving Cray XC nodes, and I think it needs to be explicit since it is wholly separate from interface 1 from a user/admin perspective (though of course the back end code is in common). I believe it would not work on such jobs at present since my understanding is that the ALPS reservation persists until after file stageout completes (I may be wrong about that, though, please feel free to set me straight if I am!).

In the future I think it may be worth investigating if we can continue to disallow incremental shrinking of a job while it is still running for jobs running on Cray XC systems (since we can’t modify the ALPS reservation) but allow the “release_nodes_on_stageout” feature to be used by destroying the ALPS reservation before file stageout (and also freeing the compute nodes in PBS of course). In my mind this may be possible to support since we know the job is done with the compute nodes by the time file stageout happens, whereas with interface 1 it can be called at any time in the job’s lifecycle. This would be out of scope for the current work in my opinion, though.


#64

Your suggestion seemed reasonable.


#65

This is possible although, will cray login nodes also have $alps_client set on the mom side or will only be for cray compute nodes? (@lisa-altair) think it’s best not to ramp down vnodes with vntype ‘cray_compute’, ‘cray_login’, or ‘cray_compile’, and perhaps any vnode in the future with the ‘cray_*’ vntype.


#66

There is no pbs_mom running on hosts that get represented as “vntype=cray_compute”, those vnodes are represented and accessed through the pbs_moms running on the hosts that are represented by the vnodes with vntype=cray_login, where alps_client must be set. (And remember those vntype values CAN be changed arbitrarily, those are just the default values).

It is certainly possible (though as far as I know uncommon) for a cray_login type node to serve as the primary execution host for a job that does not actually use any cray_compute type nodes, where a node release would technically be possible but disallowed by this approach. If others see this as an unacceptable drawback please speak up.

An unstated benefit to this proposed approach is it keeps all of the Cray-specific code in the mom, which is (I believe) still the only place we have anything truly specific to this platform.


#67

I see, ok this approach makes sense then. Should I explictly state in the EDD that for Cray X* series nodes, it will be those managed by pbs_mom with $alps_client set, or is that too implementation detail?


#68

‘release_node_on_stageout’ will have the same restrictions as in pbs_release_nodes. Ok, I’ll make it explicit in the doc.


#69

Interesting thought…


#70

I’d like to hear from @lisa-altair and/or @bhroam (and maybe @mkaro) on their views of this implementation idea before we go much further, but as for how to state this in the EDD I’d like to be as precise as possible with something like “this interface is not supported for jobs which have a primary execution host running a pbs_mom with $alps_client set, and will return this error …”.


#71

Interesting point. This would also work on a pbs_release_nodes -a as well (just not a partial release).

Cray’s are a funny beast. If a job doesn’t request a Cray mom node as their first chunk, they will be assigned one (but I don’t think it shows up as MS). While it is common practice that the login nodes are also the mom nodes, do they have to be? Could a site do something like put their mom on the sdb node? Should we be assuming the moms are on login nodes?

As for what the EDD says, I’d use language we use in the docs. I seem to remember the term ‘inventory mom’ used somewhere. I might be misremembering though.

Bhroam


#72

It’s probably sufficient to state that Cray compute nodes are excluded without being more specific. I assume that MAMU nodes will be supported.


#73

I’m okay with this.

This phrasing seems good to me.


#74

@scc: pbs_release_nodes will not allow a node to be released that is managed by the mother superior, which is the primary execution host. There’s a specific error for this that is mentioned in the EDD:

EDD excerpt:
pbs_release_nodes will report an error if any of the nodes specified are managed by a mother superior mom.
Example:
% pbs_release_nodes -j 241 borg[0]
pbs_release_nodes: Can’t free ‘borg[0]’ since it’s on an MS host

I can just add something where nodes that are tied to Cray X* series systems are those managed by mom with $alps_client set, and it will return the appropriate, existing message about not allowing sister nodes tied to Cray X* sseries systems to be released.

existing EDD excerpt:
% pbs_release_node -j 253 cray_node
"pbs_release_nodes: not currently supported on Cray X* series nodes: <cray_node>"


#75
<!--*/

function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = “code-highlighted”;
elem.className = “code-highlighted”;
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(elem.cacheClassElem)
elem.className = elem.cacheClassElem;
if(elem.cacheClassTarget)
target.className = elem.cacheClassTarget;
}
/]]>///–>

  • I don’t understand why the -j <job identifier> circumlocution should be required. Is it not anticipated that releasing the sister nodes will be a very common usage case? If so, then why shouldn’t we be able to effect this using simply pbs_releasenodes -a? The job identifier is readily available in the running job’s environment.
  • regarding the MS will hold onto the data, and will be added during final aggregation of resources_used values when job exits note - what happens to those data if the MS is restarted?
  • what does partial release of vnodes may result in leftover cpusets mean? I can understand that we might not want to attempt to shrink a CPU set on a host using pbs_mom.cpuset, but given that we always allocate a single, indivisible CPU set to a job on an Altix host running the CPU set MoM, I can’t see why there’s a difficulty in clearing all the vnodes from a CPU set-enabled MoM’s host. Is it perhaps just a matter of documenting that when releasing a CPU set-enabled MoM from a job, the host name must be specified rather than the individual vnode name(s)?
  • is there information missing after the those vnodes that are part of a cgroup would not get automatically released note? Should an until the entire cgroup is released clause be added? Or ...?
  • should the pbs_release_nodes: Unauthorized Request message include information about the originator of the attempted action?
  • nit: use of these when referring to a single item (as in the example of freeing a single node - lendl249) these nodes are not part of the job: lendl) is bad English
  • regarding the use of pbs_release_nodes on certain Cray or CPU set-enabled systems: is this check made first (good) or after having partially completed the task of releasing nodes (bad)?
  • why must pbs_release_nodes always be verbose? Can it not be given a flag that says to report success or failure only via exit status?
  • so At every successful pbs_release_nodes call, qstat will show […] implies that pbs_release_nodes always also causes qstat to be called? Ick. This seems like a choice that ought to be left to individual sites’ discretion.
  • regarding Note that the execjob_end hook will not execute on the host in this case: this seems unkind to any sites that have actions embedded in such a hook and now might have to choose whether to relocate whatever was in the hook or forego pbs_release_nodes.
  • regarding the pbs_relnodesjob() API: it’s safer for the extend parameter to be of type void * rather than char * - we are not prescient and cannot anticipate what we might want to eventually use it to convey.
  • nit: in the line following tail -f /var/spool/PBS/server_priv/accounting/201701231, an 01/ has been omitted from the log message date stamp; also note that it would be helpful to describe (or highlight) the differences the user should see between the two tail -f invocations
  • does
    The ’c’ accounting record will show the next assigned exec_vnode, exec_host, Resource_List. along with the job attributes in the new/next phase of the job
    imply that as MoMs are added to a job’s sisterhood and confirm their membership with the Mother Superior, there’s a new record for each of the next assigned exec_vnode<, exec_host […]? I assume not, but please clarify this.
  • interface 5 uses the term phased job for which I see no definition; is this intended to be a term of art?
  • interface 5 also refers to the log parser. Is this intended to refer to a PBS facility or generically (for the latter, logging via syslog would be one such example of many log parser back ends)? If the latter, that may be quite a burden that’s being imposed on sites.
  • interface 5 ends with showing the job’s values in total at the end of the job:. Nothing follows that terminating colon.
  • regarding interface 6: will the message always be Node<sister-mom-hostname>;deallocating 1 cpus from job <job-id> - that is, will it always report one CPU? If so, don't pluralize; if not, use plural only when appropriate.
  • interface 7 would likely be more helpful if it noted what attributes were available rather than the unhelpful Some example internal job attributes […] are

#76

Thanks, Al. I am not sure if this is what you are implying with your reply or not, but the primary execution host name of the job will appear in the Mom = line of all of the X* execution hosts, so if that is the mechanism the code is using to determine whether or not a vnode is managed by a mother superior mom then it may already be rejected, no additional $alps_client checking required. Even if this is already the case the additional check in the code and more explicit detailing in the EDD may still be beneficial, though.

Also , while I re-read this, it stuck out at me that “since it’s on an MS host” is probably not what we want in the error message, instead I think we should use “primary execution host”. I think it is OK (though not ideal) to use the two terms interchangeably in the text of the EDD, but for actual log messages in the software I would very much prefer that we stick to “primary” and “secondary” rather than “MS” and “sister”.


#77

Scott, yes I’m implying MS host to be primary execution host, and the one appearing as “Mom” vnode attribute value.
I see in the admin guide that we interchangeably use “mother superior” and “primary execution host” as well as “secondary execution host” and “sister”, although, “secondary” is not really defined in our reference guide, and instead, we mention “subordinate mom”. I’ll keep the use of “mother superior” and “sister” in the EDD, but I’ll go ahead and change to “primary execution host” instead of “MS” in an error message to pbs_release_nodes when releasing a mother superior vnode.

Here’s the definitions in the PBS reference guide:

"Mother Superior
Mother Superior is the MoM on the head or first host of a multihost job. Mother
Superior controls the job, communicates with the server, and controls and consolidates
resource usage information. When a job is to run on more than one execution
host, the job is sent to the MoM on the primary execution host, which then starts the
job. Moved

Primary Execution Host
The execution host where a job’s top task runs, and where the MoM that manages the
job runs.

Sister
Any MoM that is not on the head or first host of a multihost job. A sister is directed
by the Mother Superior. Also called a subordinate MoM.

Subordinate MoM
Any MoM that is not on the head or first host of a multihost job. A subordinate
MoM is directed by the Mother Superior. Also called a sister.


#78

You’re the second one who suggested not making 'pbs_release_nodes -j ’ a required option, rather if it’s not given, it’s likely called inside a job where pbs_release_nodes can just get the jobid from $PBS_JOBID environment variable. Initially, I didn’t want to do this because pbs_release_nodes may not just be applying to running jobs but also with reservations via a new option later, say -r. But I’m getting convinced that we should allow what you suggested. Unless someone objects, I’ll go ahead and make the EDD change.

As mom holds onto the data for the job, it will also be saved on disk in the internal job files. So if MOM is restarted, it will just recover the data form the job file.

Yes, the trouble is in releasing individual vnodes managed by a cpuset mom. We can enhance pbs_release_nodes to work with cpuset-ed moms on the next release. It’s not targeted for this initial version.

Yes, I think the clause “until the entire cgroup is released for the job” should be added.

The information can be obtained from the server_logs much like how it works with other other PBS commands like qrun executed by non-root. This reminds me, I need to put in the EDD that if pbs_release_nodes fails with “Unauthorized User”, then server_logs would show the message like:
6/27/2017 15:13:45;0020;Server@corretja;Job;15.corretja;Unauthorized Request, request type: 90, Object: Job, Name: 15.corretja, request from: pbsuser@corretja.pbspro.com

Good point. I’ll replace “these nodes” with “node(s)” so it can be applied to both single nodes or multiple nodes lin the list on the error message.

As with the other cases, all vnodes specified in pbs_release_nodes must be releasable, but if one fails, like a Cray check, then none gets released.

Yes, that would be a nice option. We can add this enhancement on the next release of node ramp down feature.

Of course qstat will not be called automatically by pbs_release_nodes! It’s not meant to be implied that way.

The release vnode early request from pbs_release_nodes (i.e. IM_DELETE_JOB2) is different from a normal delete job request (i.e.qdel/IM_DELETE_JOB) as the former happens and the entire job has not ended yet, whereas the latter, the job is at the end. So the former, execjob_epilogue executes while the latter the execjob_end hook. So yes, sites will be made aware of this via our documentation.

I’m just being consistent with the other PBS api functions. All the ‘extend’ parameters are of “char *” type. Perhaps it will be an infrastructure project in PBS later to convert all the types to “void *”.

Ok, will fix this.

I’ve actually highlighted them in different colors blue, green, red…

No, it’s only when there’s a release nodes action. I’ll need to clarify that the new accounting records appear as a result of release node action.

Yes, this needs to be defined exactly in the EDD.

Refers to a PBS facility.

It supposed to be a period (.). I’ll fix.

I’ll change it to “cpu(s)” so it can be applied to both cases.

It’s a private, experimental interface, showing some internal attributes that may be added to later. I’ve actually listed what’s there so far, but it could get added to later.


#79

It might be good to raise the issue of the two terms having very similar meanings with our documentation department. Seems like some explanation about usage context might help.


#80

In thinking about this some more and trying it code-wise, this goes a different route as to how node ramp down is currently implemented. The pbs release nodes request is sent to the server, and the server figures out which sister nodes are allowed to be released and does so, modifying the appropriate internal attributes and structures, and then telling the primary execution mom that this is how assigned vnodes look now and to go ahead and update your own internal tables/structures. It’s going to be a major change to move this logic now entirely on the primary mom side, and also many issues, subtleties arise in doing that.
So I’ll have to go back to Bhroam’s proposal to look for the “cray_” string in the vnode’s vntype value to determine that vnode is not allowed to be released. This will be added to the EDD to replace the note about “vnodes managed by mom with $alsp_client set in config file”.


#81

Rather than have the releasability of a node special-cased for certain hard-coded vntypes, why shouldn’t releasability be a vnode attribute all by itself?