As per current design, to create a partition one needs to group their workload and resources. In order to do so, admin will assign nodes and queue to a partition (by giving them the same partition name) and then assign a scheduler to service this partition by giving the same partition name to scheduler's "partition" attribute.
This is going to change, @suresht is working on a change to make a partition a job attribute rather than a queue attribute. Doing this will allow admins to move queue around without worrying about draining all the running jobs.
I hope that "somebody" isn't the admin who is setting up the system. For all other users, they shouldn't be able to create directories and configure scheduler object. PBS isn't really written keeping in mind all sorts of goofy things users can do in an HPC complex. We expect the complex to be a protective/trusted environment. For that matter, we don't encrypt our messages while communicating over network, one can easily sniff and read what is being sent, even today some user can run a qstat command in a loop and just hog the server.
Thanks for pointing this out.
If the scheduler object is created and it is not scheduling, it's state will be IDLE and if it is scheduling it will be in "SCHEDULING" state.
A configuration with multiple schedulers isn't the one which every customer would want. Our purpose it to have a minimal to no impact on how pbs currently operates. If we add a default partition then every job will need to have a partition associated to it (even if there is not more than 1 scheduler in whole complex). This will also affect upgrades from previous version to new version.
Thanks for pointing this out. This is going to change. Suresh will make a change and make partition a job attribute.
When partition becomes a job attribute, queues will not be part of any partition and it will be okay.
You are right, Jon and Alexis raised a similar point but in a different form. They said what if scheduler dumps every now and then, how often are we going restart scheduler. We should have a max restart check in server for each scheduler and then do not try it again. Now the problem will come on how to let admin know that there is a problem. This will get resolved if we have a scheduler "comment" as mentioned by Bhroam and Jon in the previous replies.
These are fine implementation suggestions. Thanks for providing them. We should consider all options to make things faster than they currently are.
PBS Server isn't really written with threading in mind. There is a plethora of globals that are used everywhere in server. We can think of using synchronization techniques but that will either result into maintenance havoc or poor concurrency.
I guess the same thing that happens today :), scheduler will go down. Server will see scheduler as down and will restart it to run a cycle.
I do not understand what you mean be "it will be rejected". Job will be queued with a comment stating "can never run". Looking at this comment someone monitoring the system (admin or a server periodic hook) can then decide on moving job/resources around and make the job run.