PP-337: Multiple schedulers servicing the PBS cluster


The problem is failover. We’re already in the business of starting schedulers. Currently, in the case of failover, the secondary server will try and connect to the primary’s scheduler. If it can’t, it will start a scheduler on the secondary. If we don’t start schedulers ourselves, this part of failover will need to be rewritten. If schedulers register themselves with the primary, how will that work with the secondary? Do we need to instruct the admins to have a second set of schedulers register to the secondary? Is having the server start the schedulers the right answer? Probably not. Whatever we do needs to be convenient to the admins. How do you suggest we have the schedulers start? Having the admins modify our init script is the wrong answer.

I disagree that the main goal is to merge the existing code into OSS. In my opinion that’s a bad mindset to start with.
The existing code was made to just work. It didn’t go through our design process to make it work right. I think the main goal is to take the current code and make it work right and merge that into OSS(within limits). If we need to do more work to make it work right, we should do that extra work. Right now, we are doing this step.

I like this. It moves us in the direction I’d like to see us go. Right now the server gives extra permission to the scheduler. It knows who the scheduler is because he opened the connection to the scheduler. If we flip it, we can keep the connections open and still know that they’re schedulers. The highest permission any command can have is a manager. There are attributes managers don’t have permission to see because they are only of use to the server and the scheduler. I want to be able to write an external command that can authenticate to the server with that level of permission. A “Hi, I’m part of PBS” authentication. By flipping the direction of the connections around, it’s moving us in this direction.



function CodeHighlightOn(elem, id)
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = “code-highlighted”;
elem.className = “code-highlighted”;
function CodeHighlightOff(elem, id)
var target = document.getElementById(id);
elem.className = elem.cacheClassElem;
target.className = elem.cacheClassTarget;

  • interface 1’s Details section says that the following are the mandatory attributes, and in the next sentence that Name of the scheduler is mandatory
  • interface 1’s Details section says that qmgr -c "c sched multi_sched_1" will create/set the following attributes for the sched object but does not explain how the values for the attributes are presented for the configuration step. A number of the subsequent lines are of the form attribute = value but it’s suspicious that there’s white space and no quoting in the examples.
  • nit: every occurrence of it's should instead be its
  • how does one list all the scheduler objects? It might involve something like qmgr -c list schedulers but no one tells us that
  • there seems to be some permissions confusion: Interface 5 says that the Node object will be settable only by Manager/operator, then two sentences later says PBS admin/manager can set node's partition attribute to an existing partition name
  • nit (?): interface 2’s dropped a quotation mark somewhere - it ends with queues attached to specified partition“ but there is no matching quotation mark.
  • nit: if no partition are specified should say partitions (or is specified)
  • regarding The default sched object is the only sched object that cannot be deleted: can a sched object whose state is SCHEDULING be deleted? If so, what happens to the jobs associated with it?
  • since scheduling and scheduler_iteration now belong to the sched object, will there be any consideration given to (temporarily - a release or two?) mapping the request to the new interface rather than being unhelpfully rude about it?
  • nit: it will try to connect to connect should only try to connect once
  • saying that certain attributes are settable only by Manager/operator ignores the fact that A PBS administrator is a person with both root privilege and Manager privilege.
  • I can’t tell exactly what this is trying to tell us:

    When an admin changes the partition value at the queue level all jobs (except in a finished state) will have their partition attribute changed to match the new value if they match the previous value for the partion. All new jobs entering the queue after that time will have their partition attribute to match the partition value set for the queue

    The all jobs … if they match conjunction is too confusing to sort out. All jobs? Only those that match?

  • nit: shouldn’t How PBS server runs scheduler now refer to multiple schedulers?
  • nit: in respective hostnames and port number, change number to numbers
  • I think there’s some implementation detail hidden in the

    If server is unable to connect to these schedulers it will check to see if the scheduler is running, try to connect 5 times, and finally restart the scheduler.

    line. Perhaps it’s a reference to a master scheduler that might be in charge of restarting the ancillary “sub-schedulers”? It would be useful to have a way of distinguishing the scheduler and a scheduler.

  • this

    If a scheduler is already running a scheduling cycle while server will just wait for the previous cycle to finish before trying to start another one.

    needs better English.

  • no one tells us what job_accumulation_time is - a server attribute? What values does it take on? Seconds? Times like 5:03 or 5m3s to express a time of 5 minutes and 3 seconds?
  • do partitions necessarily exist?
  • this

    If job_accumulation_time is set then server will wait until that time has passed after the submission of a job before starting a new cycle

    appears to presume that a single value would be suitable for all the various schedulers. Is that true? Isn’t it reasonable that some schedulers’ work might be amenable to much quicker solutions than others?

  • I don’t understand the one of in

    It gets all the running, queued, exiting jobs from the queues it is associated with one of it’s partitions.

  • all makes no sense in It gets all the list of nodes [...]


Today in a discussion with @bhroam it came out that whether or not “sched_user” that is going to run a scheduler instance should be marked as a “manager” privilege user or not.

As of today, scheduler can perform operations as a manager privilege user and much more. There can be 3 ways of dealing with this situation -

  • Treat “sched_user” as a manager privilege user and not add it to pbs manager list.
  • Have a pre-req that only a pbs manager user can run a scheduler.
  • Add “sched_user” to pbs manager list automatically if it does not exist.

IMO we should just go by option 3 and add “sched_user” to manager list when it is not there. This would also mean deleting it from the list when scheduler object is deleted and throwing an error when someone is trying to delete “sched_user” from manager list without deleting the scheduler object.

Whatever we decide on, I think we should add it to our design proposal.


Have you defined the authentication mechanism that the server will use to validate multiple schedulers? If the long term goal is to run schedulers on multiple nodes, then the authentication will have to take place over the network. Once authenticated, does the EUID/EGID of the scheduler process really matter to the server?

The value of sched_user is the UID that will be assigned to the running scheduler process, assuming PBS starts that process. We could require that this UID be that of a PBS manager, but do we need to if the server has already authenticated the scheduler when it connected? If that manager should get deleted, all scheduling would grind to a halt.

I think we need to define the server/scheduler authentication mechanism before we determine whether or not the server cares about the scheduler’s UID.


In my mind, I was keeping authentication and roles differently. As in, a user can be authenticated but that does not mean that that particular user will be allowed to act as a manager.

If server is spawning schedulers and then after authentication server starts accepting all requests from that authenticated connection then that might require more change across IFL interfaces that scheduler uses or intends to use. But, if we go by already defined roles in PBS and have the sched_user be part of it, then I think the change in the code will be minimal. IMO, we shouldn’t allow users with manager privileges to be deleted if they are also used as “sched_user”



There was a design change (v.38) today to remove job_sort_formula and job_sort_formula_threshold from the sched object. I don’t object to the change: simpler is better for a first version.

However, since the design has been stable for a couple weeks (with no changes and no comments from the community), and a pull request is ongoing, it is appropriate to provide an explanation and an opportunity for the community to comment on the change. Please add some explanation (here in the forum).



Thanks for pointing this out @billnitzberg. We are keeping job_sort_formula_threshold as scheduler attribute only for the backward compatibility. Only change we made is moving job_sort_formula back to server attribute. Added this change to EDD due to which the version of document is v.39 now.

As there is a plan to move both of these attributes anyhow to a Policy object in future, we are suggested to leave job_sort_formula as server attribute for the time being.



Thanks for the explanation.

It seems that job_sort_formula and job_sort_formula_threshold should be defined in the same place. What is the goal of separating them?

If the eventual goal is to have job_sort_formula in a new “scheduling policy” object, then that is also where the job_soft_formula_threshold should end up (in the future). If that is the goal, I suggest putting job_sort_formula_threshold in the server object (now, to match job_soft_formula), and later (when the policy object is developed), both can be transitioned to it.


The reasons they were separated was because all we were putting all new scheduler policy in the sched object. The job_sort_formula was already in the server object, so I left it there. I personally think we should leave everything alone until we create the policy object. Why create extra interfaces that we know we will need to deprecate in the future. I’d even be up to restricting the job_sort_formula_threshold further to just allow it to be set on the default scheduler for backwards compatibility reasons. There is no need to allow each scheduler to have its own.



Thanks @bhroam! I see now that I forgot about all the existing sched attributes (in v14), and job_sort_formula_threshold is just one of them (along with do_not_span_psets, opt_backfill_fuzzy, pbs_version, sched_cycle_length, sched_host, sched_preempt_enforcde_resumption, and throughput_mode).

Thinking more broadly, I have a couple questions / thoughts:

  1. Why specifically call out job_sort_formula_threshold, but none of the other attributes?
  2. What is the difference between the existing “sched_host” & the newly introduced “host” attribute?
  3. Will PBS Pro support backward compatibility for setting the default scheduler’s attributes. In other words, after this feature is done, will the following legal: " qmgr> set sched sched_cycle_length = 10:00", or will only the new syntax be accepted?
  4. What is the syntax for setting an attribute for the default scheduler?
  5. At the bottom of Interface 1, it has “… as mentioned in Interface 3.”, but Interface 3 is removed.

Thanks again!


You have a very good point @billnitzberg.

There are many attributes that are in the current sched object. What should we do with them? I agree with you that they should be explicitly called out when we do decide. When I thought about this originally, I thought we could restrict them to only be set on the default scheduler and none of the new ones. This will be problematic because how will the new schedulers know what attributes they need to read from their scheduler object and which ones to read from the default scheduler object. I’m less happy with that restriction now. I think maybe expanding the scope a little and allow them to be set on every sched object is the right thing to do. When we get around to implementing the policy object, you’ll be able to set these per-scheduler then, so why not now?

I can answer the sched_host part. It is set by the scheduler when the scheduler starts to what host it is running on. There should be no reason to have both sched_host and host. The two serve the same purpose.

There is backwards compatibility with the old set sched syntax. I saw some special case code in qmgr to allow for it. It will set attributes on the default scheduler

To set attributes on the default scheduler you can use the old: ‘set sched’ or name it directly: ‘set sched default’


Thanks @bhroam!

OK, it seems there are only two remaining open issues before declaring that consensus has been reached on this design:

#1 – Recent feedback related to comments 70 & 71, and
#2 – Feedback related to comment 36 item 8 (which was again brought up in comment 58), but for which no additional details have been provided.

@suresht – can you provide feedback or an update (assuming you are now driving the design)?

Thanks again!


we have updated the EDD so as to answer #1 comment above.

Regarding comment #2, Short term decision is that server will start additional schedulers (it already starts schedulers for failover).


Thanks @suresht!

It looks like the updated design addresses only part of comment 72 #1. Comment 70 items 2, 3, 4, and 5 are addressed by the updated design (v.42), but comment 70 item 1 had no action nor explanation.

Please address comment 70 item 1, i.e.:

1. Why specifically call out job_sort_formula_threshold, but none of the other [existing sched object] attributes?

Regarding Comment 72 #2, this was referring to what is now listed under “Notes” at the bottom of the design. (Sorry if there was confusion, the current system for tracking versions and comments and matching them up makes it hard to easily map between comments and versions…).

Please address comment 36 item 8 (which referred to design v.11 Interface 9, which became the Notes section in a later revision). In particular, the Notes section v.42 says “things might be broken” and “may not have a view”, but does not define any options as unsupported/erroneous nor does the design define the new behavior of these options. One guideline for all designs is that everything PBS Pro already supports will continue to be supported (with the same semantics as in prior version) unless the design explicitly states otherwise. If some set of options are supported, then the design should define the actual behavior; if some set of options are not supported, then no behavior needs to be defined, but the options need to be explicitly listed as not being supported. (Ideally, unsupported options also result in errors,)

The problem with the v.42 language is that it doesn’t say anything is unsupported (so PBS Pro supports these options), but the design doesn’t explain the new behavior. Please either mark some set(s) of options as unsupported or fully explain the behavior of using these (supported) options.




Thanks for your comments. We have updated EDD. Please look into it and let us know your comments if any.



Thanks for the clarifications. Not sure if you are making additional changes or you are done with updates. If you are done (for now), I suggest you ask for another call for review (as there have been many changes in the last month, and it would be good to let the community know that this update is ready for an overall review (again).

Thanks again!



As there are some changes went in to the EDD recently, Requesting the community to review the updated EDD and provide your feedback.


@suresht I didn’t see any mention of the other scheduler attributes and where they will exist. Does each scheduler have their own copy (e.g., do_not_span_psets) or are they only set on the default scheduler for the entire complex?


We have added only new/modified scheduler attributes to the EDD. As per my knowledge we are initially thinking that each scheduler has its own copy for the said attributes and others. Please let us know your thoughts on the same.



We have added the role of routing queues to Interface 6: Changes to Queues. Requesting the community to review the updated EDD and provide your feedback.

Thank you.