Wrong holidays file makes scheduler infinite loop


#1

If our sched_priv/holidays file is incorrect, the scheduler may behave abnormally (infinite loop).

== Example of wrong settings ==

(snip)
*
YEAR  2018
*
* Prime/Nonprime Table
*
*   Prime Non-Prime
* Day   Start Start
* 
  weekday all  all
  saturday  all  all
  sunday  all  all
*
* Day of  Calendar  Company
(snip)

== End of example settings. ==

Then, we find the following scheduler logs:

01/10/2019 14:41:52;0400;pbs_sched;Job;non-prime time;Simulation: Policy change [Thu Jan 10 14:41:20 2019]
01/10/2019 14:41:52;0400;pbs_sched;Job;prime time;Simulation: Policy change [Thu Jan 10 14:41:20 2019]
01/10/2019 14:41:52;0400;pbs_sched;Job;non-prime time;Simulation: Policy change [Thu Jan 10 14:41:20 2019]
01/10/2019 14:41:52;0400;pbs_sched;Job;prime time;Simulation: Policy change [Thu Jan 10 14:41:20 2019]
01/10/2019 14:41:52;0400;pbs_sched;Job;non-prime time;Simulation: Policy change [Thu Jan 10 14:41:20 2019]

It may be infinite loop!
Could you fix this ?

We use version 18 (revision 9ad06407424b16c0f097f56df11fd309f39d52d50).
but, suppose that the file/line in bellow is based on
revision 2950f3399aa36c35b975d5ba9eb93722f1bee1df.

We think the infinite loop is
src/scheduler/siumulate.c: 196-208, in simulate_events.

This infinite loop is caused by end_prime_status returns same value with its argument.

We propose that

  1. reject an invalid setings of holidays file.
    or
  2. fix end_prime_status.
    or, I think you have better ideas than mine.

(Sample patch for 1)

diff --git a/src/scheduler/prime.c b/src/scheduler/prime.c
--- a/src/scheduler/prime.c
+++ b/src/scheduler/prime.c
@@ -438,6 +438,51 @@ parse_holidays(char *fname)
                }
                error = 0;
        }
+
+    //  Invalidate wrong settings.
+    if ( (conf.prime[WEEKDAY][PRIME].all == TRUE)
+            && (conf.prime[WEEKDAY][NON_PRIME].all == TRUE) )
+    {
+        conf.prime[WEEKDAY][NON_PRIME].all  = FALSE;
+        conf.prime[WEEKDAY][NON_PRIME].none = FALSE;
+    }
+    if ( (conf.prime[SATURDAY][PRIME].all == TRUE)
+            && (conf.prime[SATURDAY][NON_PRIME].all == TRUE) )
+    {
+        conf.prime[SATURDAY][NON_PRIME].all  = FALSE;
+        conf.prime[SATURDAY][NON_PRIME].none = FALSE;
+    }
+    if ( (conf.prime[SUNDAY][PRIME].all == TRUE)
+            && (conf.prime[SUNDAY][NON_PRIME].all == TRUE) )
+    {
+        conf.prime[SUNDAY][NON_PRIME].all  = FALSE;
+        conf.prime[SUNDAY][NON_PRIME].none = FALSE;
+    }
     (...snip...)
+    if ( (conf.prime[FRIDAY][PRIME].all == TRUE)
+            && (conf.prime[FRIDAY][NON_PRIME].all == TRUE) )
+    {
+        conf.prime[FRIDAY][NON_PRIME].all  = FALSE;
+        conf.prime[FRIDAY][NON_PRIME].none = FALSE;
+    }
+
        conf.num_holidays = hol_index + 1;

(Sample patch for 2)

diff --git a/src/scheduler/prime.c b/src/scheduler/prime.c
--- a/src/scheduler/prime.c
+++ b/src/scheduler/prime.c
@@ -620,7 +620,10 @@ end_prime_status(time_t date)
        if (p == PRIME && conf.holiday_year == 0)
                return SCHD_INFINITY;
 
-       return end_prime_status_rec(date, date, p);
+    //  Fix infinite loop. Oops, Very very slow....
+    time_t  ret = end_prime_status_rec(date, date, p);
+    if ( ret == date ) { ret = date + 60; }
+    return  ret;
 }
 
 /**

Thank you.


Does scheduler work correctly if holidays continue over 7 days?
#2

PostScript:

In this situation, we can not stop the scheduler via
“sudo service pbs stop”, or “sudo service pbs restart”.

# sudo service pbs restart
Restarting PBS
Stopping PBS
PBS sched - was pid: 9493
Waiting for shutdown to complete
Unable to stop PBS, pbs_sched still active

#3

The pbs_sched is stateless, you can kill -15 pbs_sched_pid and then start it as below
#/etc/init.d/pbs start # please note it is start , not restart

or

#source /etc/pbs.conf ; $PBS_EXEC/sbin/pbs_sched

or

systemctl start pbs

There is no need to restart all the PBS Services , only restarting the PBS scheduler service will do the job.


#4

I seem to remember this issue coming up in the past, but I can’t find it in our bug system. In any case, you can easily achieve what you want by just commenting out the entire holidays file. It will give you 24 hour primetime.

I’ll file a new bug for this. Thanks for reporting it.

Bhroam


#5

Hi adarsh,

Thank you for your comment.
But I’m sorry, I already asked our system administrator to kill -9, and start it.

(Our system admin says that kill -15 (SIGTERM) cannot stop pbs_sched,
so we use kill -9 (SIGKILL) instead.)

This post is not for ourselves, but bug reports for other users.
To prevent other users from seeing the same trouble,
do you want to fix it beforehand?

Sincerely,


#6

Hi bhroam,

Thank you for your comment.

I was testing various setting.
And then, I encountered the above infinite loop bug.

Yes, of course, the above setting has obviously mistake.
But I never think it causes infinite loop…

I hope that the bug is removed before other users see same problem.

But, I didn’t know how can I use issue tracking system.
so I posted here instead.

I’ll file a new bug for this.

Thank you very much! I’m glad of your help.

Sincerely,


#7

In any case, you can easily achieve what you want by just commenting out the entire holidays file. It will give you 24 hour primetime.

Thanks, and I investigated more various pattens:

Case 0) Comment out prime table, but rest YEAR line:

*
YEAR  2019
*
* Prime/Nonprime Table
*
*   Prime Non-Prime
* Day   Start Start
*
*  weekday 0600  1730
*  saturday  none  all
*  sunday  none  all
*
* Day of  Calendar  Company
* Year    Date    Holiday
*

is bad. It also causes infinite loop.

Case 1) Comment out entire holidays file as you say:

* YEAR  2019
*
* Prime/Nonprime Table
*
*   Prime Non-Prime
* Day   Start Start
*
*  weekday 0600  1730
*  saturday  none  all
*  sunday  none  all
*

is good. Yes, i got 24h primetime,
However, the following warnings are recorded repeatedly.
because of conf.holiday_year == 0.

 31 01/15/2019 16:25:23;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 32 01/15/2019 16:25:46;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 33 01/15/2019 16:25:48;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 34 01/15/2019 16:26:31;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 35 01/15/2019 16:27:04;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 36 01/15/2019 16:27:04;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 37 01/15/2019 16:27:04;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 38 01/15/2019 16:31:38;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 39 01/15/2019 16:31:38;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
 40 01/15/2019 16:41:38;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.

this warnings is produced by fifo.c line 174 (schedinit):

if ((tmptr != NULL) && ((tmptr->tm_year + 1900) > conf.holiday_year)) 

Case 2)

*
YEAR  2019
*
* Prime/Nonprime Table
*
*   Prime Non-Prime
* Day   Start Start
*
  weekday all  none
  saturday  all  none
  sunday  all  none
*

is good. And above warning is NOT reported!
But I suspect a little slower than case 1.

Which do you recommend case 1 or 2 ?

Sincerely,


#8

I usually go for Case 2 by default and i do not see any warning messages with respect to out-of-date.


#9

Thank you for your comment.

Case 2 also gives us 24 hour primetime, doesn’t it ?
I will use this settings.

Sincerely,


#10

That is correct Takahiro.
Setting either completely to 24 prime time or non-primetime with the correct YEAR would be the best setting.
Thank you