I don't see the schedulers being on the server host being the single point of failure. The server has always been the single point of failure. At any point in the future that we have multiple servers on multiple hosts, they can start up schedulers. I don't see it being limiting that we require a server being on the same host as the scheduler.
I can't either. I just like consistency. I think we should either follow suit and error with node busy or change it so you can delete a node w/o removing its queue association first. The latter might be more work because you'd have to make sure the queue's hasnodes attribute is correctly unset when the last node is removed.
I misread the document. I thought the 15211 message was printed when the scheduler was created, not when scheduling was set to true. I like that behavior.
Quick question: You said the scheduling attribute is not part of the scheduler object? Is that a typo? If not, where is it?
If "None" means no partition, should you point this out more explicitly? That there is a keyword that can't be associated to any queue/node and means no partition? I do agree that it should be a special keyword and the admin should not be able to associate nodes to the "None" partition
I just read in your document that you plan to not allow changes to the server's scheduling/scheduler_iteration/etc attributes. These are stable interfaces and need to be deprecated first. If an admin sets one of them, I'd print out a message and set it on the default scheduler's policy.
In the section on failover you say the secondary will create the scheduler locally. There is no reason for it to do that. The secondary reads the database and will have the schedulers. It just needs to start them (and update the host attribute).
In the nodes section you say if a node's partition is unset, then the node is not part of any partition and it'll be scheduled by the default scheduler. I'd rather you rephrase that as just if it is unset it is part of the default scheduler's partition.
In the bullet that lists the scheduling events that cause a scheduling cycle, you missed qrun.
I don't think you should wait job_accumulation_time before starting a qrun cycle. You're not waiting for more jobs to be accumulated. You're only running one job.
There are several options on how to implement job_accumulation_time. One is how you described. Once you get one event, you wait that time and then start a cycle. Another is that the time is the minimum amount of time between cycles. This means if you get two events close together, you'll wait. If you get two events far apart, you'll immediately start a cycle. I like the second method better. Also keep in mind automated tests. This attribute can not be turned on by default, or all automated tests will slow down. You should actually say if this attribute is set or not by default.
As a note, the init script will report status of the mom as well as the server. You said it'll only report status of the server now. I don't think that is your intent.