PP-305: If server_dyn_res script does not return , Scheduler hangs


#1

Hi All,

As of today, if the server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.
Following is the design document for the proposed solution to this hang issue:
Design Document

Please review the proposed design and provide the comments/feedback for same.

Regards,
Varun


#2

Hi Varun, a few comments:

  1. Currently the EDD is written to sound like there is only ever 1 server_dyn_res script (“THE server_dyn_res” is written twice, but I think it should be “A server_dyn_res”).

  2. Related to 1), the EDD should be explicit about whether this timeout applies to ALL server_dyn_res scripts together, or EACH server_dyn_res individually.

  3. The EDD should be explicit about how the resources related to a timed out script run is treated in the cycle in which it timed out. Does the cycle continue normally but the value of that resource is assumed to be 0?

  4. Related to 3), assuming the timed out resource value is treated as 0 for that cycle, the log message in interface 2 should be explicit about this. That is not necessary, though, if in this scenario the log will contain something like these current messages in addition to the new message in interface 2:

05/17/2017 09:54:07;0080;pbs_sched;Svr;server_dyn_res;Error piping to program /bin/get_foo.
05/17/2017 09:54:07;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0

If those messages will be printed upon timeout in addition to the new one in interface 2 I think the EDD should be explicit about it.

  1. I am not sure I like the name in interface 1. I like the simpler “server_dyn_res_timeout” better, or maybe “server_dyn_res_alarm”. Did you insert the “prog” to try to make it clearer that the timeout applies to each individual script/program? I think if the EDD and docs are explicit about this then the simpler attribute name is better.

Thanks!


#3

I think that it should apply to each server_dyn_res to match the same behavior as the alarm in hooks

This is a good questions. Sites use server_dyn_res in various ways. Some sites use it get license counts, other use it to alter jobs, etc. I think that we should assume zero and continue. If the site wants different behavior they will need to add a timeout in their script.

I prefer server_dyn_res_alarm


#4

In interface 1 I believe that we should set the default value to 30 sec to match the same default as we have in hooks.


#5

Hi @jon and @scc,
Thanks for the comments.

Yes the timeout applies for each server_dyn_res script. Added the point in the EDD.

Thanks for putting this point I missed this. I also agree we should assume the value to be zero and continue.
Added the point for this in the EDD.

Modified the name in the EDD to ""server_dyn_res_alarm


#6

I have modified the EDD with following messages:
… …;0080;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out
… …;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0

I have not added “Error piping to program” message as I think it may not be a piping error just a hang issue. We have specific condition for piping error in the code which will remain intact. It is quite possible the program is piping but may be there are delays/slowness/hangs and it times out.
So I prefer keeping only “timed out” message.
Please review the updated EDD.
Thanks,
Varun


#7

Modified the default value to 30 sec. Please review the updated EDD.
Thanks,
Varun


#8

The changes look fine. One suggestion, lets log interface 3 at the default logging level.


#9

@varunsonkar unless the two log messages (interfaces 2/3) are required for automated testing, I would suggest making them Unstable. Log messages are something that should be able to change without 1 years notice.

Bhroam


#10

Modified the EDD. Please review.
Thanks,
Varun


#11

Hi @bhroam,
Thanks for the input I have updated the EDD as per your suggestion. Made the interfaces(2 and 3) as Unstable.
Regards,
Varun


#12

Thanks for the changes the EDD looks good. Since no one else has commented in the last 12 days I suggest we wait for one more day before to see if there are anymore comments before we end discussion and move forward.