What happens to job when a node gets shut down while job is running


#1

Dear All,

what happens to a job when a node gets shut down while job is running in a cluster.

Thanks,
ANS.


#2

Hi @Ans,

When the execution host goes down, the server will lose contact with it and the job will be re-queued (rerunnable jobs) or deleted (non-rerunnable jobs) depending on the value of the server attribute node_fail_requeue.

From the admin guide -

The node_fail_requeue attribute can take these values:
Greater than zero
The server waits for the specified number of seconds after losing contact with a primary execution host, then attempts to contact the primary execution host, and if it cannot, requeues any jobs that can be rerun and deletes any jobs that cannot be rerun.
Zero
Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.
Less than zero
The attribute is treated as if it were set to 1, and jobs are deleted or requeued after the server has been out of contact with Mother Superior for 1 second.

The default value for this attribute is 310, meaning that when the server loses contact with an execution host, it waits for 310 seconds after losing contact with Mother Superior before requeueing or deleting jobs.

Thanks,
Prakash