Dependencies not being killed when job fails


#1

Hi all,

We have noticed with pbs pro

pbs_version=14.1.0

some jobs are being left in a held state when jobs they depend on fail.

Probably due to double dependancy? C depends on B which depends on A

A sleeps for 60 seconds and kills itself (kill -TERM $$)

5941.delta d.stetne normal A 65902 1 1 4gb 01:00 R 00:00 delta005/1
5942.delta d.stetne normal B – 1 1 4gb 01:00 H – --
5943.delta d.stetne normal C – 1 1 4gb 01:00 H – --

delta 13:23:05 ~/pbs_test/depend_test
d.stetner 9898 $ qstat -1xn | grep stet
5941.delta d.stetne normal A 65902 1 1 4gb 01:00 F 00:01 delta005/1
5942.delta d.stetne normal B – 1 1 4gb 01:00 F – --
5943.delta d.stetne normal C – 1 1 4gb 01:00 H – --

B is Finished but C stays around.

But, if I make C depend on both B and A:

$ qsub a.pbs
5948.delta

$ qsub -W depend=afterok:5948 b.pbs
5949.delta

$ qsub -W depend=afterok:5948:5949 c.pbs
5954.delta

5948.delta d.stetne normal A 66040 1 1 4gb 01:00 R 00:00 delta005/1
5949.delta d.stetne normal B – 1 1 4gb 01:00 H – --
5954.delta d.stetne normal C – 1 1 4gb 01:00 H – --

5948.delta d.stetne normal A 66040 1 1 4gb 01:00 F 00:01 delta005/1
5949.delta d.stetne normal B – 1 1 4gb 01:00 F – --
5954.delta d.stetne normal C – 1 1 4gb 01:00 F – --

So I think that C should go fail when B does even when not explicitly listing A as a dependancy.

Bug??


#2

This probably isn’t a bug, as it’s how I’ve always known PBS to behave. The dependencies use the job’s exit status. In your example, if job A exits with 0, B then runs. If A exits non-zero, B can’t run and is removed. C, then, is left because B had no exit status (it never ran). Perhaps this is a deficiency in the implementation.

Gabe