This is a proposal to add a new hook to PBS: execjob_abort hook: EXECJOB_ABORT hook event Design - v.2
Looks good Al. Only one typo below
“execjob_end hook encountered an exception, request rejected”
-> this should be execjob_abort
Oh true, thanks! Fixed.
I’ve updated the design with details on motivation along with other things: execjob_abort hook design - v6
Simply do a page history compare between v5 and v6 to see the update…
Thank you @bayucan for adding motivation behind the proposed change. I understand that the reasoning behind introducing another event is to reduce the impact on existing setups where customers do not anticipate that execjob_end hook event will run if the job aborts. I don’t understand a whole lot about mom hooks but I have some confusion about the design.
The thing I’m confused about is what are we proposing customers to write in this new execjob_abort hook event. As I see, we want them to have an opportunity to remove any special setup they might have done for the job when it began to run. But that part of the functionality is sort of overlapping with what they would have done in an execjob_end/epilog event as well.
I think what we can do is that we can make sure that if a hook event is executed at the beginning of the job then its corresponding end event will also execute before the job leaves the mom. Like, if we run begin hook for the job then we will make sure to run end hook as well irrespective of whether job ran/deleted/aborted. Same would be the case with other pre and post job mom hook events. I know that if we do this then the post job hook events would want to know the reason of job exit/abort. To do this can’t we pass this information in job’s exit status?
@arungrover: In terms of what we are asking the customers to write into an execjob_abort hook, it’s whatever cleanup they feel is needed for the job, which may have used some resources (e.g. files) that needs to be cleaned up. It could be the same as the cleanup put into their execjob_epilogue hook or end hook. However, epilogue hook is only called at the end after job has at least started running with all nodes connected. The end hook does not run always, in particular, when sister mom fails to join job or when primary asks the job to be aborted.
In terms of why we don’t just mandate to run an execjob_end hook when there’s an execjob_begin hook, mom hooks or hooks in general have not been designed as having dependency with each other. They’re designed as just individual, stateless hooks that responds to independent events.
If we just call the execjob_end hook and letting the pbs.event().job.Exit_status determine if it was aborted, is problematic if the sister mom executed it as it’s unclear what the job’s exit status is for a “failed to join” job. The job’s exit status is more used/consulted from a primary mom executing the execjob_end hook, with the exit status < 0 likely means it was aborted. But still, this would require sites to have to update their execjob_end hook to look for this condition, which is not ideal.
Thanks for the explanation!
I thought all the start and end hooks in mom were semantically connected (or should be) to each other. I understand that as of today when a join job fails end hooks do not run on sister because MS might not be sending them a delete job request. But isn’t that the bug? When a join job fails after running a Begin hook, sister mom knows that it has failed, can’t it just run the end hook as a result of that failure?
In order to identify that the end hook is running before even the job process starts, can’t we set a PBS defined value in Exit_status and also make sister moms see it. I agree that sites might have to make sure that they check for exit_status in their end hooks even when run in the context of sister mom but I feel that is a less disruptive model than adding another hook event with sort of overlapping functionality.
It depends on the interpretation of the original execjob_end hook design, which is something that executes after job has run through its completion and about to exit mom, or when job has been interrupted by qdel or qsig, or there are communications problems between the primary mom and the sisters after job has already started running, with all sister moms successfully connected to the job. In this case, I don’t consider this as a bug.
I feel it’s more disruptive to the end users to impose this requirement on them. Calling end hook in places we don’t call before could also introduce regressions in existing tests. Code-wise, I feel it’s quite straightforward to adding new hook events to the code, and not worry about setting a job’s Exit_status value on the sister mom side so that execjob_end hook can consult it, which is new territory.
Normally, job will exit completely after the execjob_end hook runs. However, the job may actually requeue if there’s an earlier execjob_epilogue that executed, which instructed job to be requeued via the pbs.event().job.delete() call.
shouldn’t that be pbs.event().job.rerun() ?
Yes, it should be rerun(). I’ll update the design.
I’ve made small addition to the design v.8 Execjob_abort hook design
“Normally, a job will requeue after abort hook executes, but there maybe cases when the job would actually exit completely, if there are earlier execjob_begin or execjob_prologue hooks that executed, which instructed the job to be deleted via the pbs.event().job.delete() call.”
needed to be qualified further with:
“Job would also exit completely if an earlier execjob_launch hook resulted in a rejection.”