Log of the offline discussion -
So just to be sure I understand why the timeout is useful…
For a normal qrerun request the timeout serves no purpose other than to return a spurious message and cause tests to fail.
In the case where the scheduler issues the request (e.g. for preemption), the scheduler waits (right?) for low priority jobs to be preempted before telling the server to run the high priority job. If the low priority jobs get requeued quickly the scheduler somehow knows this and goes on to run the high priority job. If the job doesn’t get requeued quickly enough the timeout expires and the scheduler runs the high priority job causing oversubscription. Assuming a) we don’t want to just wait for the requeue(s) to finish and b) “some” oversubscription is OK, then the purpose of the timeout is to allow a delay before the scheduler oversubscribes the node, right? If so then perhaps the name of the “timeout” should be something like job_requeue_delay? And the log message would be something like “qrerun: requeue delay for job . has expired, requeue still in progress” In the case where a hook issues a rerun command…?
Does the hook infrastructure interpret the timeout as a failure and return or…?
Would it be possible for the “delay” to only apply if the request is coming from the scheduler or is it useful at other times that I haven’t seen mentioned yet?
@arungrover 's reply -
One more thing I’d like to mention here is that… -
If our purpose is for tests to succeed, then qrerun –Wforce option can also be used. This was no timer is registered and prompt is given by to user (in this case test script) instantly. What this also means is that no output files will be copied. If the test case is making use of output file generated from job’s previous run, then it’s not beneficial to use –Wforce option.
You are right about its usage in scheduler. It provides enough time for server to requeue the job before replying back to scheduler. Oversubscription might happen even after timeout expires but the timeout acts as a cushion to handle requeue impact in most cases.