PP-465: qrerun timeouts when big job files are being copied from MoM to server


#21

Hi Bhroam,

Thank you for reviewing the EDD. I have already added the new message in the second last bullet point.

Last bullet point is corrected to say 45 seconds.

Thanks,
Prakash


#22

Thanks for making the change. I totally missed the message.

I have one question. What happens if you give multiple job IDs to be requeued? Like
qrerun 12 13 14

The command should send each job individually. What does qrerun do when one or more of them hit the timeout?

Bhroam


#23

Current behaviour will be continued, for each job the message will be displayed.


#24

Log of the offline discussion -

@smgoosen’s email
Hi Prakash,

So just to be sure I understand why the timeout is useful…

For a normal qrerun request the timeout serves no purpose other than to return a spurious message and cause tests to fail.

In the case where the scheduler issues the request (e.g. for preemption), the scheduler waits (right?) for low priority jobs to be preempted before telling the server to run the high priority job. If the low priority jobs get requeued quickly the scheduler somehow knows this and goes on to run the high priority job. If the job doesn’t get requeued quickly enough the timeout expires and the scheduler runs the high priority job causing oversubscription. Assuming a) we don’t want to just wait for the requeue(s) to finish and b) “some” oversubscription is OK, then the purpose of the timeout is to allow a delay before the scheduler oversubscribes the node, right? If so then perhaps the name of the “timeout” should be something like job_requeue_delay? And the log message would be something like “qrerun: requeue delay for job . has expired, requeue still in progress” In the case where a hook issues a rerun command…?

Does the hook infrastructure interpret the timeout as a failure and return or…?

Would it be possible for the “delay” to only apply if the request is coming from the scheduler or is it useful at other times that I haven’t seen mentioned yet?

Thanks,
Sam Goosen

@arungrover 's reply -

One more thing I’d like to mention here is that… -

If our purpose is for tests to succeed, then qrerun –Wforce option can also be used. This was no timer is registered and prompt is given by to user (in this case test script) instantly. What this also means is that no output files will be copied. If the test case is making use of output file generated from job’s previous run, then it’s not beneficial to use –Wforce option.

You are right about its usage in scheduler. It provides enough time for server to requeue the job before replying back to scheduler. Oversubscription might happen even after timeout expires but the timeout acts as a cushion to handle requeue impact in most cases.

Regards,
Arun Grover


#25

I’m not sure delay is a good term, but I like it better than timeout. Prakash and Sam are correct that timeout makes it sound like we’re going to give up after a certain time. We have precedent in PBS for the term delay. The kill_delay attribute is the period of time between sending job a SIGTERM and then a SIGKILL. If we are to take that as delay means the amount of time between two actions, is it the correct term for what we’re doing here?

Unfortunately I have no better term, so I say we use Sam’s name.

The oversubscription we are talking about isn’t true oversubscription. True oversubscription is to start two jobs on the same node at the same time. They’d fight for cpus and memory and one would probably die(or thrash). In our case, the job is dead. The cpus and memory are free. The network is just in use. If we oversubscribe in this way once in a while, I don’t believe it’s the end of the world.

Sam, your opinion is what matters here. How will the customers view this? Will they be heading our way with torches and pitchforks if we occasionally oversubscribe in this way?

To answer your other question about if we can do something special just for the scheduler, the answer is yes. The server knows if a request is coming from the main scheduler (not multi-scheds). I personally like the idea of the delay for users as long as we tell them what is going on. Like Arun said, if they want something fast, they can always use -Wforce.

Bhroam


#26

@smgoosen
Even though it is possible to do something special for the scheduler, I would not like to have a different behaviour for the same API depending upon from where the call was made. It also adds unnecessary complexity.

Thanks,
Prakash


#27

I have changed the name of our attribute.

Thanks,
Prakash


#28

@Bhroam: I believe the type of “oversubscription” we’re talking about here is acceptable

@prakashcv13: Thanks for changing the name of the “timeout”, the rest of the design looks good!


#29

Re: “Renaming” an existing server attribute…

Please remember that we should strive to maintain 100% backward compatibility. Generally, this means we strive to ensure existing scripts (and configurations) work with newer versions of PBS Pro (without change). This is really important for users of PBS Pro, as many users have tools (sometimes tools that they did not write and even tools they cannot modify) that interact with PBS Pro. In cases where a PBS Pro parameter is wrong (or wrongly named), it is preferable to deprecate the old one and introduce a new one (and to continue to support the old one for several more versions of PBS Pro).

In very, very rare cases, it may be OK to remove an attribute name completely (not sure if this is such a case).

Thanks!

- bill


#30

This is good point.
So if we choose to deprecate the old option then it’s not only the attribute name that we retain, we retain the whole functionality (which includes printing the old message and exiting qrerun with an error number) because those existing scripts/tools would be dependent on the exit value/error message of qrerun?


#31

How about not doing any code changes, but changing the documentation to say that this server attribute is deprecated. We will continue to have this “delay”, but non-configurable.
The documentation change will be that the job will be re-queued and the timeout message is an indication that the rerun process is taking time.


#32

Of all the suggestions so far, I like this one the best.

To me, log messages are mostly not part of the officially supported interface, so changing log messages is generally OK, especially if we improve them :-). So, I am also OK changing the log message to be more meaningful (if it needs that).

Thanks!


#33

Thank you @billnitzberg, by message I meant the message that is displayed on the command line, and we should not be changing that. So, I believe only the documentation change is enough.
I will update the EDD accordingly.

Thanks,
Prakash


#34

EDD is updated accordingly.


#35

I believe we can deprecate the confinable attribute, not change the error return, but still update the message to be less confusing


#36

Ok, Sam. I do understand that the error message is spurious and we need not be prisoners to a wrong message. If there are customer scripts that depend on this exact message (which might be very few), will need to be updated. I have updated the EDD.


#37

EDD update looks good!


#38

@smgoosen, @prakashcv13 - I have 2 questions:

  • when we are marking an attribute to be deprecated, are we also giving another way (a new attribute) for a user to use which will have a more meaningful name? If not, then how can we mark this attribute as deprecated without exposing a better way of dealing with the situation?
  • If we change the error message then wouldn’t that affect backward compatibility? There might be scripts reading this message and taking decisions. In the discussion above Bill mentioned it is okay to have a log message chang but this is an error message that gets displayed on console that we are changing.

#39

Thinking about this error message in particular – it seems very rare and very unlikely to be used inside tooling that users have built around PBS Pro. So, although it does have the potential to break backward compatibility, I believe it’s OK to change this message.

Deprecate means “continue to support”, but inform users that we will likely drop support in some future release. That means we can remove it from the docs, but the setting should continue to work (in the code).


#40

In the PR for this ticket, I also suggest we change the message to:

Response timed out. Job rerun request still in progress.

IMHO, this is more clear and concise.