Automatic re-queue of failed jobs


#1

Has anyone run into a situation where a job re-queues itself after a power failure? We had an unusual power failure and dropped a set of compute nodes and the private eth switch. Running jobs went back to a ‘Q’ queued state for some reason. Once the compute nodes were restored the jobs automatically restarted and when to a ‘R’ state overwriting all previously generated data.

I’ve tried to reproduce the situation and have been unsuccessfull:

Anytime I rebooted the slave compute node and the master compute node remained on the job would fail immediately. The data remained intact and the .o file was copied back to the head node.

Anytime I rebooted the master compute node the job would hang in the queue in a ‘R’ state. All the files are kept in the working directory except for the .o file.

In my testing the scheduler did as I hoped it would. This keeps the data intact without loosing potentially days of work. Just wondering if there is a situation where jobs go back to a 'Q’ed state on failure or if this was just an anomaly we saw.

Note - The nodes are disk-less

Thanks in advance, Kyle


#2

Kyle,

  • Jobs are only requeueable if the qsub optio ‘-r’ is set to ‘y’ (default is ‘y’ )
  • requeued jobs .OU and ER are copied to the PBS Servers $PBS_HOME/spool directory to be appended for the next run

Could you please check the node_fail_requeue value set in your pbs configuration. (qmgr -c “p s” | grep -i requeue )
node_fail_requeue: Time for the server to wait for primary execution host to come back up before it
requeues or deletes the host’s jobs. Setting a value of 0 will disable it. If parameter is unset; It will revert back to 310 second after PBS server deamon is restarted