Top job's reservation ignored and postponed by lower priority job


#1

Hi,

we are currently facing a very weird behavior. Our system uses multiple “general” queues (for short/normal/long jobs, etc.). These queues have the same priority . Beside that, we have also several “high priority” queues (their priority is higher than for the short, normal and long queues) and job_sort_formula = queue_priority. Moreover, jobs are ordered by fair-share , and by_queue = false, i.e., we do not sort jobs in a queue-by-queue fashion but “all at once” (for the queues with the same priority).
Also, we have backfilling and strict_ordering enabled. Backfill_depth is set separately for each queue (depth = 10) and the opt_backfill_fuzzy = high. The expected behavior is that the scheduler first considers those high priority queues, and next all jobs from those short/normal/long queues are considered together (ordered by fair-share). When backfilling starts to schedule jobs, it should give a reservation to every top job that cannot start immediately. Now to the problem:

We have observed a weird behavior, where a high priority job (“high_job” of a user “high_user”) obtained a reservation since it was a top job. However, this reservation was repeatedly postponed to a later time by a different lower priority job (“low_job” of a user “low_user”). According to the log, both jobs were top jobs and according to the fair-share, “high_job” had a higher priority. We’ve checked this by running the pbsfs -c command (pbsfs -c high_user low_user --> high_user).

In our understanding, if a high priority top job receives a reservation it should not be postponed by a lower priority top job because the lower priority top job is selected later in the backfill process, right? We expect the lower priority top job to “see” that there is already a reservation for another (high priority) top job, thus seeking another (later) time slot. Instead, in our case, the latter lower priority top job simply ignored the first reservation and postponed the first (high priority) top job. We are unable to identify the reason for this behavior.

Could the problem be somehow related to the fact that those two jobs were from different queues (yet having the same queue priority)? Since we have backfill_depth set per queue and by_queue=false, once all jobs are ordered by fair-share those “top jobs” can be actually quite scattered within that sorted job list (as jobs come from different queues).

The problem is quite painful, since the high priority job is being postponed repeatedly for several days now and the user is complaining…

Thank you,

Dalibor


(Note: we are aware that backfill+fair-share is not a recommended combination, yet the observed behavior is somehow inexplicable for us.)


#2

Hey,
There are several things that could be going on. First off, how are you determining if a job is a top job? A job in the higher priority queue isn’t necessarily a top job. It’s only if the job is one of the first 10 jobs in that queue that makes it a top job. If you look in the log, there is a message emitted when a job becomes a top job.

Two top jobs will not cause one of them to be pushed later in the calendar. The only way this happens is if a filler job is run on the same resources as a top job and extends into the top job’s time. By the nature of being a top job, it can’t run now. Even if two top jobs were added to same resources at the same time, that would only stop other jobs from running right now. They wouldn’t cause harm to each other.

The likely culprit is your complex backfill_depth/queue setup. Top jobs do not receive a reservation in the PBS sense (see advance reservations). The scheduler adds top jobs to the calendar each scheduling cycle. The scheduler will always take the first N highest priority jobs and add them to the calendar. If the same N jobs are the highest priority from one cycle to the next, all is good. If the ordering of the highest priority jobs changes, then a different N jobs will be added to the calendar. The bottom of the list in the previous cycle will be bumped off and new ones will be added on.

You have several things working against you in your setup. First is fairshare. This can easily change which is the most deserving job per cycle. If the highest priority job’s entity has currently running jobs, each cycle those currently running jobs receive lowers the priority of the high priority job.

Another idea is that one queue could reach its limit of 10 top jobs while other queues haven’t. This means that you’ll add some top jobs to the calendar, skip over others, and add more from another queue. Consider the following: J1 is from Q1, J2 is from Q2, J3 is from Q1. We add J1 to the calendar. Q2 is at it’s limit of 10 top jobs, so we don’t add J2 the calendar, J3 is added to the calendar. The following cycle we have the same ordering, but J2 can run. J3 hasn’t been added to the calendar yet by the time J2 runs (since J2 is higher priority), so J3 gets delayed. If this is the case, consider removing your backfill_depth per queue and adding a larger backfill_depth to the server.

Maybe an endless series of higher priority jobs are being submitted between scheduling cycles. These jobs could either run (similar to above) or bump the Nth job of the list. It might not even be an endless series. A single job can be very damaging to a large top job. The scheduler can take a long time collecting resources for that large top job. If the large job isn’t a top job for one cycle, all those resources can be given away and the scheduler has to start over.

Maybe jobs are starving? Once a job starves, it will gain priority over non-starving jobs (even in higher priority queues). If you don’t want this behavior, turn help_starving_jobs=false.

Lastly, what do you mean by high priority queue? If your high priority queue has a priority of greater than 150, it is an express queue. These jobs get the highest priority over all other sorting methods. It also means you’ll start preempting. You’d would have noticed preemption, so this isn’t likely the case.

Bhroam


#3

Status: SOLVED

Dear Bhroam,

thanks for your valuable input - it directed us on the right track which meant digging deep in the scheduler’s logs.
The root cause of our problem was another situation - which however is well documented in the literature on backfilling - a classic backfilling “mystery” related to early job completions. :slight_smile:

The problem was following:

  • a different job (not mentioned in my original post) had the highest priority in a queue (and was a top job)
  • this job was a demanding, multi-node job. As a consequence, its first “reservation” was placed behind “our job” (which had a lower priority) since there was not enough space in the calendar/schedule
  • in the meantime (between two scheduling cycles), some job terminated earlier (which is common)
  • as a result, a larger gap appeared in the schedule which now became sufficient for the highest priority job
  • the highest priority job now had a much earlier reservation
  • as a consequence, “our job” was not able to obtain its original (early) reservation but had to be delayed, because the original timeslot was now used by the first job
  • moreover, other jobs with even lower priority were still able to fit prior our job which lead us to the original (false) conclusion that these jobs were delaying our higher priority job.

We appreciate your help. Clearly, our initial observation was misleading since we only considered some jobs and ignored the rest of the queue (especially other top jobs with even higher priority).

Regards,
Dalibor


#4

Hey,
I’m glad the confusion was cleared up. It’s a hard situation when a higher priority job pushes a lower priority job out of its timeslot. If this happens too often and your users get unhappy, consider using ASAP advanced reservations. You do a pbs_rsub -Wqsub=. This will create an advanced reservation and move the job into it. Now that this job is in an advanced reservation, it has higher priority than any top job. No top jobs can move it out of the way. The only downside to ASAP reservations is they do not move in time like topjobs. If other jobs ends early, the advanced reservation will still start at its original timeslot.


#5

I realize this question is quite late, but along these same lines, will PBS allow a job that is currently ineligible to run (due to queue run limits, for example) to become a top job? If so, is a top job that is ineligible to run able to delay the start time of other waiting jobs?
Thank you in advance for your time!


#6

Yes, a job that has hit a limit is eligible to become a top job. The scheduler will determine when the job is under its limit when calculating its start time.

If this is undesirable to you, use the job sort formula. Jobs hitting their limits do not accrue eligible time.

Bhroam