Recently, there has been a bug filed against qrerun which says that if a job that has a large job files is rerun, qrerun times out.
However, in the background the server continues the process of rerunning the job.
There is a server attribute job_requeue_timeout which determines the timeout period for rerunning the job.If this attribute is not set, we default to 45 seconds.
On issuing a rerun, the execution host sends the job files back to the server. If these files are huge, it would take a while for these files to be copied from MoM to the server.
This is where the timeout comes into picture. The description of this attribute in the guide is limited to the explanation that it is the time allowed for a job to be requeued.
What would happen if the timeout is hit is not mentioned in the guides.
So, all the timeout does for now is it throws a spurious error that causes problems for the client w/o anything having really gone wrong.
In the event of a network failure, the job will get requeued by node_fail_requeue.
Does anyone have any thoughts on when the job_requeue_timeout is necessary?
Would anyone have any objections if we removed the job_requeue_timeout functionality completely?
A detailed analysis can be found here.