PP-998 On subjob rejection by a hook the message for the job array is not correct


#1

Hi all,

Currently pbs server sets a job array’s comment attribute to

Not Running: PBS Error: Execution server rejected request

when one or more of its sub jobs fail. This was an intended change made as part of PP-374. However this array job case was not clearly stated in the EDD.

I think the current message is not very intuitive to the users. Some sub jobs can still have the state ‘R’ while this message is set on the job array. I would like to start a discussion about the best way to handle the job array case.

Thanks,


#2

We’ll have to hear from @nithinj on this topic. The EDD only talks about jobs. It makes me think that job arrays weren’t taken into consideration when writing this EDD.

Prior to 13, the comment of a job array was handled in a very different way than a job. When a job was run the comment would change to either a message saying the job was run or a failed to run message saying the server rejected runjob request. Job arrays were different. The job array comment would remain in flux until the first subjob was run. Once the first subjob was run, the comment would be changed to a message saying the job array has begun. At that point forward the comment would not change.

The scheduler continues with this behavior. It will not change the comment of a job array in state B.

This is all for the exact reason you state. It would be confusing for the job array to say it can’t run when there are running subjobs.

Even if this change is intentional, we now have inconsistent behavior between the scheduler and the server. The scheduler won’t change the comment on a job array once it is in state ‘B’ and expects the comment to remain constant. The current server behavior will change the comment on the job array if the mom rejected one of its subjobs.

We should remove this inconsistency. Either things should go back to the pre-13 behavior, or the scheduler should start changing the job array comment to the reason the current subjob can not run.

I personally think we should revert back to the pre-13 behavior since it is less confusing for our users.

Bhroam


#3

Thanks @bhroam, for explaining this. My intention while implementing PP-374 was to match the pre-13 behavior for job comments. And I was seeing similar behavior when tested in those version so I proceeded with that.

=====================================
Test logs from 12.2(job array comment):
======================================
[root@machine PBSPro_12.2.0]# qsub -J 1-3 job.scr
2[].machine
[root@machine PBSPro_12.2.0]# qstat -swnt
machine:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
1.machine root workq job.scr – 1 1 – – Q –

Not Running: PBS Error: Execution server rejected request
2[].machine root workq job.scr – 1 1 – – Q –

Not Running: PBS Error: Execution server rejected request
2[1].machine root workq job.scr – 1 1 – – R –
machine/0

2[2].machine root workq job.scr – 1 1 – – Q –

Not Running: PBS Error: Execution server rejected request
2[3].machine root workq job.scr – 1 1 – – Q –

Not Running: PBS Error: Execution server rejected request
[root@machine PBSPro_12.2.0]# qstat --version
pbs_version = PBSPro_12.2.0.x

The info that, the job array will update the job comment only at the completion of the array is new to me. But the state of the job in my test are ‘Q’ rather than ‘B’. I’ll check what is wrong with my testing and get back to you.

Thanks!


#4

Hey @nithinj,
Thanks for explanation. Your testing shows a rather interesting point. The reason you’re seeing execution server rejected request message is that the job array has not moved into the ‘B’ state yet. The comment stays in flux until it moves to the ‘B’ state.

Try that test again where you run one subjob and reject the second. The job array comment should remain static.

Since your intention was to restore the pre-13 behavior, we probably should only change the job array’s comment if it is not in the ‘B’ state. Will that be difficult in the server code? Has the job array already moved into the ‘B’ state by the time you’re updating the comment? If so, this could be more tricky.

Bhroam


#5

What is an “execution server”? That term is not familiar to me.


#6

This is the message associated with PBSE_MOMREJECT (15041). It’s saying the mom rejected the request to run a job.

Bhroam


#7

Let’s have the message say “MoM”, rather than “execution server”.


#8

@agurban,
This isn’t a new message. It’s over 20 years old. This bug is not changing behavior. The behavior changed unintentionally in 13.0. This bug is changing us back to the pre-13 behavior.

Bhroam


#9

I agree with you. The comment of job array is not getting updated if the state is “B”.

I believe the changes should be rather straight-forward. The server changes the state to “B” if it receives a positive feedback from mom and it does none and simply update the job comment with error message otherwise.
https://github.com/PBSPro/pbspro/pull/176/files
There we should probably add this check for array job and state.

I’m not seeing the job getting “held” even after getting rejected more than 21 times. I’ve created PP-1010 to handle this.