Proposal of interface regarding hold/release of subjob(s) and job array


#1

Currently the pbs commands qhold and qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.

I propose we allow qhold and qrls on subjob.

Interface Design Doc is here: https://pbspro.atlassian.net/wiki/spaces/PD/pages/1049854035/Changes+to+run+count+attribute+hold+and+release+operation+of+subjob+s+in+a+Job+Array

Thank you


#2

Seeing this reminded me of a use case my company has… we have an application that’s building “workflows” inside PBS by submitting multiple jobs with dependencies. Some of the “jobs” inside of this workflow are actually array jobs. For various reasons the developers have elected to use “afterok” dependencies (rather than afterany). We have cases now where some of the tasks of the array fail to complete their work (say the user underestimated the walltime for the array and not all of the tasks have the exact same runtime). As it stands now, the workflow essentially “stops” execution and abandons the jobs in the queue with unsatisfied dependencies. What the developers have decided to do is to leverage a hook that I have to run a “user epilogue” (as opposed to a system epilogue) and if that script exits with a certain status my hook will reject the job, but with a requeue. All of this works. The next issue is this can essentially create an infinite loop of tasks being requeued for ever. So the developers decided they want to enhance their “user epilogue” to check the run count and if the count is above a certain threshold they won’t request the requeue… but as of our current version of PBSPro (13.0.407) tasks don’t have a run_count at all… so they can’t do this.

So long story short… we really want to see array tasks have a run_count that goes up with each re-execution… and I think our Use Case is a little bit different then what’s described above.


#3

Since I’m not seeing any immediate critical opinions, I’m assuming its safe for me to start with design of proposed interface. Off course, meanwhile this forum will be open for down votes or suggestions or feedback on the proposal.

@arwild01, thank you for chipping in with the use case. Making run_count of subjob to increment for each rerun by server is core of the proposed solution. So I fee this will definitely help your use case.


#4

Thanks @Shrini-h. your proposal looks good to me. I have few questions on the design.

  • I am not clear with the use case behind holding the whole parent array if any of the subjob is held. is this only true in case of system hold or for user hold as well?

  • what happens if release fails on any of the subjob in case of "qrls ? will it throw error?

  • Will the subjob comment and parent job array comment be set as why they are held? also job comment will only be set for held subjob or for all subjobs?

  • " * qhold can be supplied with subjob id range as argument to apply any type of Hold on all running subjobs in the range" -> why only running subjobs?

  • please mention the format for specifying subjob ranges


#5

@anamika Thank you for the questions, pls find my answers below

Yes this is only for system hold, the use case is when there is an erroneous execution hook. Without holding the parent job, the server and scheduler will continue instantiating further subjobs which also gets held after 20 retries. Hence by putting a system hold on the parent array, the server and sched can skip the whole array till the admin or user corrects the hook.

the error will be same as with a regular job. I have now added this info in the doc

Yes the comments will be set as with a regular job. It will be set only for held subjob, however any uninstantiated queued subjobs will reflect the comment set on the parent which is the way subjob is displayed. I have now added this info in the doc

I wanted to mean the ones that have not yet finished. Thanks for catching this. I have now edited this info in the doc.

Done. I have now added it as a note at the bottom


#6

got it and makes sense. can you also add a line under system gold explicitly mentioning if user, other, or bad password hold (any hold other than system) is applied on subjobs then parent job array will not be held?

few nits:
“more like to fail to launch” should be “more likely to fail to launch”
“finsished” should be “finished”

  • Thanks for adding the error message for qrls. could you also update it to mention that subjob will remain in H state.

#7

Hey @Shrini-h,
I have one major concern. When you apply an action to a job array, you apply that action to the whole job array (running and queued subjobs). When a running job is held, it is checkpointed. This means if you apply a system hold on the entire job array, you’ll checkpoint the running jobs. Either that or you meant to just hold the queued subjobs which would not be consistent with other applications of the actions on the whole job array itself.

Also, your design document uses the old design document template. Consider using the new template since it provides a better idea of what you are thinking to do.
https://pbspro.atlassian.net/wiki/spaces/DG/pages/293076995/PBS+Pro+Design+Document+Guidelines


#8

I have now realized to fix the “bug” I just need to enable system hold release of single subjob, hence I have reduced the interfaces and moved the explanation to the design doc page. Please let me know your opinions


#9

wont be addressing these, as the interface stands deleted

Done


#10

Sorry I might have not explained it right, I didn’t mean to hold the entire array (all subjobs), but only the parent job array.

Done. (sorry, the new design template was not reflected when creating the blank page)


#11

@Shrini-h
I understand that you want to put a hold on the parent job array, but I am not sure that is consistent with how we’ve treated job arrays in the past. In the past when you act on 1234[], you are acting on all of the subjobs in 1234[]. You are putting a hold on 1234[], so shouldn’t this hold all of the queued subjobs and checkpoint all of the running subjobs? Won’t it be confusing to see the job array in state H and for there to be running subjobs? That being said, your intension is good. if one job is rejected by a hook, them most likely all of them will be since they are identical. The thing is that we usually don’t get in the business of second guessing hooks. They are meant for customers to do whatever they want. What if the begin hook rejected a subjob for a transient issue? A subsequent subjob may still be able to run.

I also think the -t option to qrls is unnecessary. We can directly act upon the subjob ids themselves. The reason we have the -t option to qstat is that it would be super-messy to print all the subjobs all the time. We added the option to print them if they were wanted.

Bhroam


#12

It is infact consistent, for example when we do qhold 1234[] it’s running subjobs continue to run and queued subjobs stay put in ‘Q’ state (kind of equivalent to a “hold”). Since only the parent job is set to ‘H’ state, I mentioned it as holding the parent job only, and I don’t intend to change this behavior.

checkpoint not supported on subjobs. (which I feel should be enabled in future)

Since we are holding after 20 retries, it will be safe to assume its not due to a transient or ephemeral issue. I feel the benefit of holding the parent Job array (assume a very big array job) outweighs the benefit of being tolerant to transient condition. Holding the array for this transient condition also act as a way to notify it to the admin who can release the hold with qrls interface.

The first point of the interface introduces a way to act upon the subjob id directly
The second point ("-t") option is similar to qstat where we give admin a way to remove hold on held subjobs and their parent all in one go.
Currently we already have a way to unhold a held job array qrls 1234[]. This we can use to unhold the parent without touching the held subjobs (i.e the other queued subjobs can continue to get run). I have added this point to the interface now


#13

Interesting. I thought doing a qhold on the parent array would act on all of the subjobs. Cool, it is consistent.

OK, so you are saying that if any one subjob failed to run more than 20 times, we’ll hold the parent job array. Sorry, I must have misread the document. This seems consistent with what we do today with normal jobs. If the use case behind this change is to fix the issue with the infinite loop of running a job on a black hole node, then do we need to hold individual subjobs? It sounds like having the run_count per subjob is enough, since we’re going to hold the parent array when we reach 20 times.

I’m not sure I agree with the first point. Why can’t you act on 1234[1] to act on the subjob directly?
As for the second point, I’m still not sure I see the need to hold an individual subjob. They’re supposed to be identical. Why would we need to differentiate between them this way? We’d also have to make the scheduler understand more about individual subjobs. Right now it knows that if a job array is in the B state, it has control to make and run subjobs based on the indices remaining. If we are making them each more special, it has to have have more special case code to handle this case.

Bhroam


#14

Not clear if you are agreeing or disagreeing.
subjobs already have run_count, but since they are not incremented between each retries, it never reaches the limit of 20 making server and mom loop infinitely (ex exec begin hook error). What I’m trying to propose with parent job is: once a subjob’s run_count breaches the limit, immediately hold the subjob and its parent array (not after parent array reaching 20 times). Hope this clarifies my point

because that is currently not allowed. Currently qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.
so this first point is to enable exactly what you are asking for.

firstly, I’m guessing you have might have missed the context here. All the points under Interface 3 propose changes to qrls command which is used to release hold. Either that or I’m missing the context of your query. I’m stating the second point here when qrls is invoked with a system Held Job Array id with new "-t" option, prior to releasing the hold on it, all of its system Held subjobs are released
Anyways assuming we are talking about releasing of hold on subjob, I feel that your concern on having more special case code doesn’t withstand as, the subjobs in question here have already been instantiated and broken away from the indices remaining.

If we still don’t agree on the proposal of “-t” option for qrls, I don’t mind dropping it. It made more sense when I initially proposed changes to qhold to enable it to hold multiple subjobs in subjob range.

I will wait for few days before dropping this, in case anybody else backs this proposal. Withoug this option, the admin will have to issue multiple qrls commands once for the parent array in ‘H’ and again once for each subjob that are in ‘H’.


#15

I’m sorry I wasn’t clear. I am agreeing with you. I think we should hold the parent job array if any one of the subjobs failed to run 20 times.

I’d rather you allow qrls to act on a subjob than add an option to allow it to act on a subjob. The new option doesn’t seem necessary.

I don’t understand why you can’t just act on 1234[1] or 1234[] and not add the new option. Like I said before, qstat has the -t option for cleanliness. If you had many 10000 way job arrays, the qstat output would be very messy if you always printed the subjobs. This is why we only print the job array itself unless the subjobs are asked for. I don’t see the same use case for qrls

Maybe I’m misunderstanding how array_indices_remaining will be used going forward. Right now it contains all subjobs that are not running or done. Are you saying that when a subjob is instantiated, it’ll be removed from the list? The attribute loses some of its meaning if that is the case. Now it’ll be the queued subjobs that are still part of the array. This isn’t all of the queued subjobs though.

In any case, in most commands when you type 1234[], you mean the entire job array and all of the subjobs. I’d rather this continue to be the case. Do a qrls on 1234[] and the parent job and all the held subjobs get released.

Of course I still don’t see the use case for holding a specific subjob. They are identical jobs, so why would you want to treat one differently?

Bhroam