PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators


#37

I guess my point is this:
We cannot possibly avoid additional overhead in the scheduler, because the scheduler just has more decisions to make (no matter how this is implemented). However, we can avoid (mostly) additional overhead in the server - and in my experience bogging down the server has far worse consequences than bogging down the scheduler (“split-brain” due to false failovers, total lack of responsiveness to external commands, and slow scheduling). I would take a long scheduling cycle over a brain-dead server any day of the week. A slow scheduler reduces system throughput - a slow server makes a system unusable. One is an annoying inconvenience, the other is potentially a catastrophe.


#38

I get that, queueing jobs in server is going to affect it’s performance. But, in future we may have multiple servers catering to client requests and I’m hoping queueing the jobs wouldn’t be as expensive as it sounds today.


#39

Thanks for posting the updated v.10 design (https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=49865741).

I really like the direction this is going, and especially that backward compatibility may be more easily accommodated, e.g., the new filter resource is a restriction, so existing qsub hooks (many of which are admission control gates) are likely to behave correctly without modification, and that the select & place syntax is unchanged (so, again, no hooks code needs to be changed there). Obviously, some changes will be necessary if a site wants to support the new capabilities, but this design may lessen the “backward compatibility re-engineering load”.

A few comments:

In case scheduler finds out that it can not run such a job because of resource unavailability and tries to calendar the job so that resources can be reserved for this job in future, it will use only the first resource specification that it encounters in it’s sorted list of jobs and use that to calendar the job.

What issue is this is attempting to address? Unless there is a strong understanding of a known issue, it would be better to start by treating each job as a “regular job”: no caveats except the “only run one” behavior. (More caveats means more complexity means less resilience and less adoption – simpler is almost always better.) I would suggest dropping this (for now), and see what early adopters find as the real issues (ideally, during a Beta). Then, if there is an issue, fix it, and fix it right.

If running job which was initially submitted with multiple resource specifications gets requeued for any reason (like qrerun or node_fail_requeue or preemption by requeue), the job will get reevaluated to run by looking at each of the multiple resource specifications it was initially submitted with.

If there is not a compelling use case for handling requeues, an alternative to this would be to change the semantics from “run only one to completion” to “start only one”, and once one job in a set is started, delete the rest. This would make it easier to define what happens for some operations (e.g., how to handle qmove to another server aka peer scheduling), and would also likely reduce implementation and test effort. Again, as in the above, one would want to adjust based on early adopter feedback.

Interface 1: New Job attribute called “job_set” - qsub option “-s”

  • Do we really need a single character option? Why not just use -Wjob_set= as the only interface?

  • If 103.mycluster.co is a job_set and 104, 105, and 106 are members, can I submit a job that has -Wjob_set=104?

When a job is requested with multiple select specifications, PBS server will honor the queued limits set on the server/queue and run job submission hook on each of the resource specification. If one of the resource specification is found to be exceeding the limits then that resource specification will be ignored.

  • What is the output of such a command? Is it one job ID or many job IDs?

  • Not sure about ignoring an rejected request – shouldn’t the behavior be the same as if I submitted a job with qsub -Wjob_set= …, and wouldn’t that be to throw an error? What happens if all the requests are ignored?

Interface 3: Extend PBS to allow users to submit jobs with a node-filter (nfilter) resource

  • Suggest “filter” instead of “nfilter”.

To access a specific resource out of resources_available, resources_assigned inputs, users must enclose each resource name within square brackets “[ ]” like this - “resources_available[‘ncpus’]”

Q: Is this the same syntax used in PBS hooks and the Scheduler’s job_sort_formula? (Ideally, we should have only one syntax.) Sorry, I just can’t recall this one… If it is the same syntax, I suggest making that statement explicit.

Accounting logs…

  • It would be good to capture the whole workload trace data in the PBS accounting logs (while also minimizing the impact on existing accounting post processing tools). At least some way to represent that a job is part of a job set, and some way to capture that a non-run member of a set has been removed from the queue. (This probably requires more discussion.)

Wow, thanks!


#40

Hi Steve,

Yes the current implementation of loading jobs at startup and history jobs is not scalable. That has little to do with this RFE. The server anyway does not scale when there are large number of jobs. Sure this change can add more jobs to the server, but it must be understood by users that using these alternate jobs does have an impact on the server. Whether we submit it as one job or not, if every job has an alternate there is anyway a huge tax on the scheduler, so end to end performance takes a toll anyways.

Currently, large number of jobs at startup does not affect the server failover capability, and does not contribute to possibility of a split-brain (besides comms between the two servers, we keep touching a shared file every few milliseconds, even between loading jobs from the database at startup). The worst that can happen is that server runs out of memory or being unresponsive to commands for a long duration.

The point is, when you make the job alternates an explicitly understood phenomenon, we transfer responsibility to the users/admin to restrict their usage. Not just the scheduler, with job alternates there is code inside server that runs validations etc. anyway for limits and such that will anyway bog down server some. The additional overhead (beyond what would anyway happen) with multiple job entries is negligible.

We have plans to implement the server as a stateless service (pending prioritization of that work), where we basically do not keep any job data in server memory (maybe a very small cache). When we have that, we can have millions of jobs in the history without much effort on server restart timings.


#41

Arun,

Thanks for posting the updated v.10 design (https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=498657411).

I have a few questions as well.

Users can submit jobs specifying “-s” option during submission. This attribute can only take an already submitted job-id as a value.

So what would happen, if i do this:

qsub sleep.sh --> 1.svr
qsub -s 1.svr sleep.sh --> 2.svr
qsub -s 2.svr sleep.sh --> 3.svr

Would the server recognize that since 3.svr has a jobset id =2.svr which is already part of jobset 1.svr, it will make the 3 as part of the same job set?

Users can specify a node filter with node resources using conditional operator like "<, >, <=, >=, !=.

Since logical operations are supported would we be supporting nested complex expressions? If not, that must be specified.

Interface 4: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)

Do we need this new substate? How about using things that array jobs use - like the Array JOB BEGUN state for the set? Its always complex in the code when one adds new job substate? If you add a state/flag for the “” “job set” object, that is fine. However, adding a job substate must be done very carefully. There is all across PBS codebase that takes specific actions based on the substate that could start failing due to introduction of a new substate (unless all of them are dealt with carefully).


#42

I just wanted to minimize the overhead of having more than one job from a job_set on the calendar. If scheduler adds more than one job then scheduler will reserve those resources and likely not run some other job which it could have. I get your point too, I will take this off the document

I’ll make this change in the document. It would really make things easy if we delete jobs as soon as one starts. I’ll keep this part of the change as “Experimental” since it can change based on feedback.

I’ll change the option to -Wjob_set that seems more readable. Subhasis had a similar question like you have of submitting a job where the job-id pointed by “job_set” option isn’t a job_set leader. I forgot to write this up, but, I think PBS server can just reject this job request. Users must know the job_set they are submitting to. What do you think?
While I am writing this, I think we need a command to just list down all the job_sets if we want users to submit to the right job_set. Otherwise, PBS server should just accept it and move the job under the right job_set (which in your example would be 103)

Output of such a command will be one job id which will be the ID of the job_set leader. Regarding rejecting the request, internally server will throw an error but qsub will ignore the reject and move on to the next resource request. It can also print a message on stderr about why a resource request could not be submitted but that might break backward compatibility. Server, on the other hand, will surely log the reason of rejecting a job submission.

I wanted to keep it as nfilter because it signifies what it is going to filter. If we extend this filter mechanism to replace limits or queues, server we can then call it as jfilter. It’s because based on the prefix “n” or “j” the whole input that is going to be passed to the filter can be easily interpreted.

It isn’t same as job_sort_formula syntax. The reason is that formula just works on the resources requested by the job, so if the formula is like this “ncpus + 2 * mem” it is safe to assume that user is talking about resources requested by the job. In this case, we are exposing resources_available and resources_assigned on the nodes to the users. Both of these can have the same resource name, so we need a way to distinguish them.
Implementation wise this way of specifying filter can be easily interpreted in python if we expose two dicts (resources_assigned and resources_available) to it.

Well, my opinion is why to take a different direction in accounting too. We can probably log job_set information is a job is part of a job_set but other than that it will just look like a bunch of jobs were submitted, one of them ran and others were deleted. This could happen in any normal day-to-day accounting logs too.
exposing job_set information in accounting record will give post processing tools a way to correlate things and make sense out of it.
What do you think?


How to chose the vnode just only with RHEL7 or SUSE 11?
#43

Bill had similar question too. I should have added something related to this to the document. I’m thinking to reject a job would be the right thing to do for PBS. But if we do so, there should be a way users can list down all the job_sets too.
What do you think?

I wasn’t planning on supporting a complex nested expression. But, it is going to be interpreted using python interpreter itself and I guess that does not have a limitation on a complex expression. So yes, as long as the expression can be interpreted using a python interpreter, it can be a complex expression too.

I didn’t think about using an already existing states. If I borrow a state which is already getting used in another feature for years then I would have to worry a lot about breaking backward compatibility and maintaining semantics of what that state/substate means. Creating another substate is a lot of work but it gives us flexibility of doing something new and we don’t really have to worry about breaking backward compatibility.


#44

@billnitzberg, @subhasisb Thanks for your valuable comments, I’ll wait for a day for others to review the document before making changes.


#45

Bill had similar question too. I should have added something related to this to the document. I’m thinking to reject a job would be the right thing to do for PBS. But if we do so, there should be a way users can list down all the job_sets too.
What do you think?

I think rejecting would be troublesome for users - as you said, they would then need a list to know job sets. Instead we can make it transparent. Ie, you can give any jobid as the job-set id and pbs will silently figure things out. The only case we should reject if something in the job-set is already running.


#46

I’m okay to go in that direction too. It’s just that since user was already under a misconception about the job_set leader, it shouldn’t happen that they start thinking that this is a bug in PBS :slight_smile:


#47

Well, in that mode, the job-set-leader id would not be very important - basically you just associate with any job-id that is part of a job-set already and PBS figures it out …the user will not have a misconception that way.


#48

I’m not sure this is the direction we want to go in. This would be different than job arrays. When you requeue a job array, you requeue all the jobs in the array(including all in state X). If we are considering using job sets to replace job arrays in the future, this would make the two designs incompatible.

If we keep all the jobs like the design currently states, I think we should accept jobs to the set after the set is running. It’ll likely be deleted, but if the job set is requeued, it would become a viable job.

There is something else to think about when adding jobs in a jobset to the calendar. If we add more than one, we are making our calendar less accurate. We know only one of the jobs in the jobset will be run. By adding them all to the calendar, we take up space and push other calendared jobs out later in time. There is no real good answer to this issue. Just choosing the first one is probably the best answer. It’s the most deserving job of the set, so saving resources to get it to start running is good. We’re still not sure it is the one which will eventually run though. I think this is a better answer than adding them all though.

If you reject a request for requesting a non-job leader, I’d make it clear what you are doing. Say that the request is invalid because job is part of jobset

If we accept it and do the right thing then we’re basically giving a job set many names (every job in the set). We’d have to do this for all the commands as well. If the user submitted to jobset and is part of set and we added it, the user would be confused if they couldn’t act upon job set in the future.

One more thing to think about. Do we want to consider a job in multiple job sets in the future. If we do, I think we want to reject the request now.


#49

One quick note: qselect now uses a long option (–job_set). getopt_long() is not supported on windows. Actually getopt() isn’t supported on windows. We have a version of it in Libwin. If you want this long option, you’ll have to get a copy of getopt_long() for Libwin as well.

Bhroam


#50

One thing to consider in this design as well is that we already have a job set (job array) with the run criteria of run all. We now are wanting to add a job set with the run_criteria of run one. There is also a third job set to consider for genomes or code breaking with a run_criteria of run all until one succeeds. I think if we add job sets then we need to be flexible to run these and more in the future. Maybe we add a new attribute called run_criteria and set it by default to run all for job arrays and run one for job request sets.


#51

Also, in talking with Arun I realized that one requirement was not clear. For the user perspective they will only see one job id for the job they submitted and will be able to delete the job request set in a single command. Now I don’t have a requirement to allow them to change one of the job requests from the request set but if the team feels strongly that we should provide one using qalter then ok. However, since most users don’t use qalter, why not make it so that if you have to change the whole set of resource requests if you want to change one using a qalter.


#52

@jon Can you please elaborate a little about the use case of seeing only one job id when the users submit a job and delete/alter them in a single command.

Would it be sufficient that if a user submits a job with multiple resource requests then output of the job submission is just the job-id of the first job submitted but if the users does a qstat, it shows up all the other jobs like any other normal job?

Would it be okay if users are allowed to perform an operation like qdel on a job-set by using the command in conjunction with ‘qselect --Wjob-set=’? This way user will be able to delete all the jobs in one command.
This way all other commands can also act on a job-set in one single command like they do on any other list of job-ids.


#53

In the interest of time, Bill’s proposal is to separate design proposal of specifying multiple resource requests and to support conditional operators.

I’m going to separate them out and take the design proposal for PP-507 (nfilters) in a different document. Please let me know if you think that we shouldn’t be separating the design proposals.

Thanks!


#54

From the user perspective,

  • I submit a single job from the command line or CM
  • the admin modifies the job in a hook or submission script to have additional resource requests.
  • I get a single job id back
  • it doesn’t start because another job id in the job set that they knew nothing about runs.
  • I delete the job because I am confused
  • I look in the submission directory and I see output files.

Having multiple job ids for a single job request (not resource request) is a bad idea from a user perspective. And to have to train users to use qselect is also a lot to ask admins to train their CLI user base not to mention the additional documentation to inform the users about how everything changes if they want to work with job sets.


#55

I’m pretty much in agreement with Jon here - the creation of loads of phantom job ids is going to be confusing to both end users and administrators.

And I’m still concerned about the impact on server performance. Yes, I know we’re planning on some significant improvements in this area but those aren’t scheduled until well after boolean resources are scheduled to be delivered (and, since they haven’t been implemented yet, we’re not sure how significant an improvement they’re going to produce) - so we’re basically talking about wrecking server performance for at least a year or so after the feature is implemented.

That seems like a bad idea.


#56

Thanks Jon for your inputs. I’m trying to understand the use case here -

Does the user expects the job-id he/she received from submission to be in running state? Would it help if user has some way of knowing that one of the resource request ran as a separate job? What if the job-id received after submission has a comment that says “Job held, job running from this job set”.

How about passing a special parameter “–job-set” with commands like qstat, qdel, qsig, qhold etc and then give any job-id which is part of a job-set. This will result into action been taken on the whole job-set (not only on the specified job-id)?