Jobs stuck in Exiting state


#1

Dear Wizards,

I have encountered what must be a pretty normal problem.

If a user submits a jobs from a directory where he/she does not have write access, then apparently the job gets stuck in “Exiting” state - presumably because the log files cannot be written to that dir at job-end(?) At this point, the jobs cannot be deleted with qdel, and they do not seem to actually end, so the lock up resources.

This raises a few questions from me:

  1. How do I “clear” the system from such jobs.

  2. How do I ensure that lusers cannot hang the entire system in this way. (Do I need to write a submitjob hook to check the write permissions on the submit-dir or something?)

I assume this is a pretty ordinary issue?

/Bjarne


#2

Hi @buchmann

Ideally the job should end after some time. Mom retries to copy file and after a few attempts if it does not succeed, it just logs these errors and replies back to server with a “Post job file processing error” and this is when server removes the job.
If this is not happening in your case then check mom logs and see what happened to post job file copy.

Now, to clear up your system have you tried forcefully deleting the jobs from the system- "qdel -Wforce "?
Also if users do not need these output/error files you can make sure that they submit jobs with “-koe” parameter so that Mom does not attempt a file copy on these files.

  • Arun Grover

#3

Assuming you have PBS configured to use scp to copy the files, you should know that if the scp fails PBS will fall back to trying pbs_rcp, which with some system/firewall configurations can hang. If that is the case here, you can either replace pbs_rcp with a symlink to /bin/false, or put PBS_RCP=/bin/false in pbs.conf and restart the pbs_mom processes.


#4

No, I had not tried that. Yes, that works. Thanks.
However, we still need to avoid it, and we do need the output/error files (just in case things go amiss, we want to be able to do a post-mortem analysis).

This seems to work. The exiting-state still takes abnormally long (about a minute or so), but it does finish, so the nodes get cleared for further jobs.

Kudos to you both. Thanks!

/Bjarne