I’m okay to go in that direction too. It’s just that since user was already under a misconception about the job_set leader, it shouldn’t happen that they start thinking that this is a bug in PBS
Well, in that mode, the job-set-leader id would not be very important - basically you just associate with any job-id that is part of a job-set already and PBS figures it out …the user will not have a misconception that way.
I’m not sure this is the direction we want to go in. This would be different than job arrays. When you requeue a job array, you requeue all the jobs in the array(including all in state X). If we are considering using job sets to replace job arrays in the future, this would make the two designs incompatible.
If we keep all the jobs like the design currently states, I think we should accept jobs to the set after the set is running. It’ll likely be deleted, but if the job set is requeued, it would become a viable job.
There is something else to think about when adding jobs in a jobset to the calendar. If we add more than one, we are making our calendar less accurate. We know only one of the jobs in the jobset will be run. By adding them all to the calendar, we take up space and push other calendared jobs out later in time. There is no real good answer to this issue. Just choosing the first one is probably the best answer. It’s the most deserving job of the set, so saving resources to get it to start running is good. We’re still not sure it is the one which will eventually run though. I think this is a better answer than adding them all though.
If you reject a request for requesting a non-job leader, I’d make it clear what you are doing. Say that the request is invalid because job is part of jobset
If we accept it and do the right thing then we’re basically giving a job set many names (every job in the set). We’d have to do this for all the commands as well. If the user submitted to jobset and is part of set and we added it, the user would be confused if they couldn’t act upon job set in the future.
One more thing to think about. Do we want to consider a job in multiple job sets in the future. If we do, I think we want to reject the request now.
One quick note: qselect now uses a long option (–job_set). getopt_long() is not supported on windows. Actually getopt() isn’t supported on windows. We have a version of it in Libwin. If you want this long option, you’ll have to get a copy of getopt_long() for Libwin as well.
One thing to consider in this design as well is that we already have a job set (job array) with the run criteria of run all. We now are wanting to add a job set with the run_criteria of run one. There is also a third job set to consider for genomes or code breaking with a run_criteria of run all until one succeeds. I think if we add job sets then we need to be flexible to run these and more in the future. Maybe we add a new attribute called run_criteria and set it by default to run all for job arrays and run one for job request sets.
Also, in talking with Arun I realized that one requirement was not clear. For the user perspective they will only see one job id for the job they submitted and will be able to delete the job request set in a single command. Now I don’t have a requirement to allow them to change one of the job requests from the request set but if the team feels strongly that we should provide one using qalter then ok. However, since most users don’t use qalter, why not make it so that if you have to change the whole set of resource requests if you want to change one using a qalter.
@jon Can you please elaborate a little about the use case of seeing only one job id when the users submit a job and delete/alter them in a single command.
Would it be sufficient that if a user submits a job with multiple resource requests then output of the job submission is just the job-id of the first job submitted but if the users does a qstat, it shows up all the other jobs like any other normal job?
Would it be okay if users are allowed to perform an operation like qdel on a job-set by using the command in conjunction with ‘qselect --Wjob-set=’? This way user will be able to delete all the jobs in one command.
This way all other commands can also act on a job-set in one single command like they do on any other list of job-ids.
In the interest of time, Bill’s proposal is to separate design proposal of specifying multiple resource requests and to support conditional operators.
I’m going to separate them out and take the design proposal for PP-507 (nfilters) in a different document. Please let me know if you think that we shouldn’t be separating the design proposals.
From the user perspective,
- I submit a single job from the command line or CM
- the admin modifies the job in a hook or submission script to have additional resource requests.
- I get a single job id back
- it doesn’t start because another job id in the job set that they knew nothing about runs.
- I delete the job because I am confused
- I look in the submission directory and I see output files.
Having multiple job ids for a single job request (not resource request) is a bad idea from a user perspective. And to have to train users to use qselect is also a lot to ask admins to train their CLI user base not to mention the additional documentation to inform the users about how everything changes if they want to work with job sets.
I’m pretty much in agreement with Jon here - the creation of loads of phantom job ids is going to be confusing to both end users and administrators.
And I’m still concerned about the impact on server performance. Yes, I know we’re planning on some significant improvements in this area but those aren’t scheduled until well after boolean resources are scheduled to be delivered (and, since they haven’t been implemented yet, we’re not sure how significant an improvement they’re going to produce) - so we’re basically talking about wrecking server performance for at least a year or so after the feature is implemented.
That seems like a bad idea.
Thanks Jon for your inputs. I’m trying to understand the use case here -
Does the user expects the job-id he/she received from submission to be in running state? Would it help if user has some way of knowing that one of the resource request ran as a separate job? What if the job-id received after submission has a comment that says “Job held, job running from this job set”.
How about passing a special parameter “–job-set” with commands like qstat, qdel, qsig, qhold etc and then give any job-id which is part of a job-set. This will result into action been taken on the whole job-set (not only on the specified job-id)?
There is an additional functionality requirement embedded in @jon’s example that I don’t think has been considered (or mentioned) to date:
- the ability for a submission hook to add resource alternatives to an existing request
How important is that functionality? Could someone describe the use case (not the implementation, but the use case from the user/admin point of view without referencing PBS Pro, the goal they are trying to achieve by having this functionality)?
Separately, the UI can provide a single job id for the collection while also allowing multiple ids, one for each resource request. For example, imagine if PBS Pro provided individual job ids for every subjob of an array job – the user could still uses a single job id, e.g., 124 or even 124 to refer to the collection.
There has been discussion about how does a hook parse the resource request. So the extension of that is how does the hook writer alter/append/reject that resource request. An application of this is that if I am doing allocation management how do I reserve allocation for a job that is part of the “resource request set” and how will a submission hook know that if the server hasn’t seen the job?
As for the use case, I have a heterogeneous cluster and I want all jobs to use whole nodes. I know what my cluster has so I change the initial job resource request to a set of acceptable “resource requests” to ensure only whole nodes will be allowed. The use cases are the same as discussed above but just using a hook. I see this as no difference from doing it from a hook or a submission web portal
Agreed. The user should only see one job id and be able to operate on a single job id. If we use the job array syntax then we will need a way to distinguish between job arrays and resource requests for qstat (i.e. qstat -t should only show job sets and not “the set of resource requests”. As for qalter, I still think that it should be required to change all requests if a user/admin are not happy with an individual resource request.
Yes. Also submission portals also expect that the job id returned is the one that is run and exits before they continue to the next step, whatever that may be.
It may but as a user I would not like to see “Job held, job running from this job set”. My first response is what is a job set if I did not request one. The second is that I would call the admin and say what does this mean. And then when the stdio and stderr file came back with different job ids which one do I look at. It just adds more confusion. Also, if the application was specifically looking for the stdio/stderr files to see that the job has continued, now what does it do? Do we cause sites to rewrite those as well with this individual job id per resource request?
I don’t think a special parameter “–job-set” is the right way to go. We don’t require this anywhere else (except when you submit a job dependency) so introducing it for this seems like the wrong thing to do from the user perspective.
How is all this going to interact with automated workflow managers/monitors like Cylc and SMS?
Some of us (Bill, Bhroam, Jon and myself) worked on capturing the use cases and the motivation behind those use cases that we thought were missing from the design page.
I have added the outcome of the discussion on the design page and it now mentions the use cases for multiple resource requests and node filtering as well.
Please have a look at it.
It seems the design proposal so far is in the right direction (for Use case 1 under conditional requests) where it allows users to submit jobs with multiple resource requests and also makes PBS scheduler select one of the resource requests according to how it has been prioritized (by admin or by user).
User case 2 probably needs something like “place=group=” to place the job on the nodes that have same resource value. But, since this can be requested in multiple resource requests, it needs to work in conjunction with use case 1.
Use case 3 however, can be addressed differently by providing a special way of packing the jobs on nodes where the nodes are selected in a way that they can be fully and equally occupied on the basis of cores the job has requested for.
For use cases related to filter nodes, a filter (as proposed in the design) provided with each resource request (that supports conditional operators) can be applied to address all the use cases.
Currently we have moved resources away from this project. So this project is stalled for now.