PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators


#1

Hi,

I’ve posted a design proposal to add support for requesting resources with logical ‘or’ and conditional operators while submitting jobs.

Please have a look at the design proposal and provide your feedback.

Thanks,
Arun


Nonconsumable resource on host
PP-507: Add support for filtering nodes as per the job request
PP-507: Add support for filtering nodes as per the job request
#2

Arun,

Thanks for posting this. Overall, it looks like a good start. Below are my thoughts

In interface 1 you have

  • “PBS scheduler will honor the run limits (soft or hard) based of the maximum of all resources requested by the job” Does this mean that if the run limit of 10 ncpus and I have a job that asks for 12||8 ncpus that the scheduler will not consider this job? If the answer is yes, should this be changed so the scheduler only tries to schedule the requests that do not violate the limit instead of excluding the job from consideration?

  • “logical OR operator can not be used to submit reservations.” Should we remove this limitation if we accept conditional operations for reservations in interface 2?

In interface 2 you have

  • “Users can request for non-consumable resources with conditional operator like “<, >, <=, >=, !=””. Should this be reworded to “Users can request for non-consumable chunk level resources with conditional operator like “<, >, <=, >=, !=””

Also, one piece that I think is missing is that we don’t have a way to account for job wide resources (i.e. walltime, software licenses, etc). I believe that including these would make this more useful since if I get 10 cores vs 20 cores my job with 20 cores ideally would finish in 40-50% of the time, but it would require more software licenses. Does this make sense? If so, can we add this functionality?


#3

Thanks for your comments @jon!

Point related to maximum resources was only valid for queued limits and not for run limits. I’ll make that change in the document.
IMO reservations are little more specific in nature as they run as queues so that is the reason that I thought it might not needed to be submitted with OR operator.

I’ll reword the second interface.

You bring up a very interesting point about job wide resources and I didn’t think of that. I can not really mix job wide resources with select specification as that is not intended to carry those resources.

If we need this functionality then probably this is what can be done -

  • Create a new builtin string resource called “job_wide”. This resource can take combination or many “ORed” job wide resources.

  • This resource can only be used when there is multiple select specifications given by the user.

  • Number of ORed job_wide resources must match the number of select specifications given by the user.
    for example a job submission with multiple job wide resources may look like -

    qsub -lselect=“2:ncpus=8:mem=12gb || 4:ncpus=3:mem=16gb” -l job_wide=“walltime=00:10:00 || walltime=00:08:00” job.scr
    This means that if select spec “2:ncpus=8:mem=12gb” is applied then walltime would be considered as 10 mins and if “4:ncpus=3:mem=16gb” is selected then walltime considered j=would be 8 mins.

I will update the document with such an approach.


#4

Jon and Arun,

I was wondering about an ability to factor in “time to results” considering various placement options for a job where time to result is something like wait time + run time. If I had a job that ran e.g. 1 hour with -l select=2:ncpus=16 -l place=scatter vs 2 hours on -l select=32:ncpus=1 -l place=free, then I might be interested in waiting 30 minutes for the preferred choice to be available, even if the job could run right away with the 2nd option. However, I might not be interested in waiting 5 hours. If I understand the suggested design, then the choice between the resource options provided by the user is to be evaluated at every cycle based on the availability of resources at the time of the cycle.


#5

@rrehburg Thanks for providing your inputs. This is a very interesting thought and I can see that it is useful too in many cases.

But, I also think that it has another aspect to be considered. It may so happen that with this requirement in place scheduler may not consider the second select specification to run the job until it surpasses the “wait_time” but at the same time it may happen that even after the wait_time is over it may not find resources to run the job with second select specification because they were taken over by other jobs, resulting in even longer waiting period.

Scheduler as of today is written to run a job as soon as it can, This requirement will make scheduler to find the best-fit to run a job instead of running them on first-fit and this may add additional delays in running jobs.


#6

I’ve modified the document again and it is due for another review :slight_smile:


#7

Hey Arun,
Overall the design looks good. It’s exciting functionality.

Here are some comments I have:

  • You probably don’t want to call your interfaces by their internal ATTR_* constant. Use their real name.

Interface 1:

  • Do you want to say that the scheduler looks at the selects from left to right? This sounds like more of an internal decision the scheduler is making.
  • I’d use the terminology ‘unset’ instead of ‘cleared’ when talking about the selectedspec resource. This is a well defined term in PBS.

Interface 2:

  • I’m not sure I understand what you mean in the second bullet. What is the string comparison function? What does it mean to say str1 > str2

Interface 3:

  • Are you sure you want these log messages to be public stable? It’s hard to change such interfaces.
  • You might want to change the debug level of the first message to DEBUG2. DEBUG3 is for the per-job per-node messages (rather spammy).

Interface 4:

  • In the first bullet you say the schedselect will show the select provided by the user. Is this true? Will the defaults not be applied?
  • Is this interface required? Can’t you regenerate the schedselect from the original Resource_List.select? We’re only changing schedselect, not Resource_List.select

Interface 5:

  • Why limit the use of job_wide to only when multiple selects are used? It is a nice way of submitting all your job wide resources in one place.
  • How do you submit multiple job wide resources at once? Does it use the same ‘:’ delimiter as the select? If so, you’ll run into a problem with the place spec. It also uses the ‘:’
  • Drop the first sentence of the second bullet. It talks about the flags of READ_WRITE. This is an internal flag. The second sentence says what you need.

Interface 6:

  • Drop the first sentence of the first bullet. Similar to above, it lists ATTR_DFLAG_MGRD which is an internal flag. The second sentence says what it needs to say.

#8

@bhroam Thanks for giving review comments.

I’ve changed the document to not mention internal names/flag and just specify their meaning.

In “Interface 1” it is important to mention the order in which scheduler will consider select specification because we want the user to know that which select specification will be given preference. This will help users to specify the select they most likely want their job to run with.

In “Interface 2” to compare two strings we can only compare the strings using basic string comparison functions because PBS does not know what the resource is about. What this also means is that admin needs to be careful in specifying string values to their resources like - ver_12 might turn out to be greater than ver_045. But, if they make sure that all such resources have three numerical digits then ver_012" will turn out to be lesser than ver_045.

I changed the visibility and change control for log messages.
You were also right about Interface 4 that “select” can be used to create schedselect. I’ve removed interface 4 now.

In Interface 5 - I understand you want “job_wide” to have similar behavior as select. I personally felt that submitting all job wide resources under “job_wide” does not actually give us anything apart from being more organized. In case of select it probably made sense because in select same resources can be specified in different chunks but that isn’t the case with job_wide. Also, allowing this will cause us additional work to parse it in server and have more error checking if they are specified with multiple job_wide ORed resources without any select specification.

Would it still cause problem if placement spec is specified in another set of quotes? Do you have a suggestion for a different delimiter.

I’ve made all other changes to the document as you suggested.

Thanks!


#9

Hey Arun,
Thanks for making the changes.

So what you mean is that PBS will use an alpha sort instead of any type of numeric sort? I think you should be explicit in saying that (and maybe put your example in). Either that or limit what operators can be used on strings to just = and !=. I can’t 100% convince myself that the other operators are useful in the case of strings.

I agree that it provides a more organized way of providing our job wide resources. I also think that having no ||s is just a degenerate case of having ||s. I don’t think we should limit the functionality. Why have a special case when we’re using ||s?

I understand it will be extra work, but the code needs to exist to handle the multiple || case, so why not use it for the no || case? I just hate special cases.

I don’t think different types of quotes will work because qsub strips them. The quotes will be gone by the time the string hits the server. I gave some thought into different delimiters and I think the ‘+’ is an option. It separates chunks in the select, and in some way different job wide resources are like separate chunks.

Bhroam


#10

Hi Bhroam,

Thanks for reviewing it again!
I agree with all your comments and have modified the document to reflect the same.

Please have a look.

Thanks!


#11

Thanks for making the changes. It looks good to me.


#12

Hi @arungrover – awesome new feature – thanks!

I started writing some low-level comments, but I realize I have a bigger design question/suggestion… The current design seems to make the allocation request language even more ugly than it already is…

Rather than invent a lot of new syntax, how about moving the “||” outside the select and eliminating the need for job_wide, e.g.,

qsub -l select=3:ncpus=1:mem=2gb -lscratch=5gb --OR -l select=1:ncpus=3:mem=6gb -lscratch=100gb

Alternatively, as long as the design is making the big changes it is suggesting with the language, how about inventing an entirely new language would also be OK, e.g.,

qsub --request="(((ncpus = 1) && (mem = 2gb)) || ((ncpus = 3) && (mem = 1 gb)))"

or some such. If you have time, perhaps we can discuss live (if others want to join, we can set up a web meeting).


#13

Bill,

I think these are really good points. Job-wide resources like walltime are often going to be dependent on the number and type of chunk resources allocated to the job. There needs to be a clear way to express this to PBS for this feature to be really useful. I think your first proposal really has that covered. I’m not sure how this works in your second example, but that’s something that can be hashed out.


#14

We might consider adding a JSON or Python dictionary format to express the requirements. If there are multiple elements we can assume that matching any of them constitutes success. I’d also like to avoid introducing another command line parameter by adding a new type of “select”. For example:

qsub -l jsonselect="{{'count': 3, 'ncpus': 1, 'mem': '2gb', 'scratch': '5gb'}, {'count': 1, 'ncpus': 3, 'mem': '6gb', 'scratch': '100gb'}}"

One advantage is that we can embed sets within sets so you could express alternatives within alternatives.


#15

These are all very good directions about how we want to provide resource inputs to PBS.

I personally like the this one “qsub -l select=3:ncpus=1:mem=2gb -lscratch=5gb --OR -l select=1:ncpus=3:mem=6gb -lscratch=100gb” better than others.

I’m not able to imagine how we can provide multiple chunks in each select with other two approaches.

We can probably have a discussion about what is the appropriate way of proving resource inputs.


#16

The editor won’t preserve the indentation, so this is a bit harder to read than it should be. Using this strategy, we could further embed options within options and still be able to express multiple chunks. Just grab your favorite JSON parser, and the data is easily imported. Just use something like qsub -l jsonselect=$(cat job_reqs.json)

{ "option1": { "walltime": "01:00:00", "chunk1": { "count": 4, "ncpus": 1, "mem": "1gb", "scratch": "5gb" }, "chunk2": { "count": 1, "ncpus": 1, "mem": "4gb", } }, "option2": { "walltime": "01:00:00", "chunk1": { "count": 2, "ncpus": 2, "mem": "2gb", "scratch": "10gb" }, "chunk2": { "count": 1, "ncpus": 1, "mem": "4gb" } }, "option3": { "walltime": "02:00:00", "chunk1": { "count": 2, "ncpus": 1, "mem": "1gb", "scratch": "5gb" }, "chunk2": { "count": 1, "ncpus": 1, "mem": "4gb" } } }


#17

How does this impact job_sort_key, the scheduling formula, and fairshare_usage_res, etc. in the scheduler’s ordering of the jobs? Like limits, would the largest values also get used?


#18

I think you would have to treat each option as though it were an independent resource request even though they all pertain to one job.


#19

If I am reading you right, I think I agree. So we really wouldn’t sort “jobs” in job_sort_key any more, for example, we’d sort individual “resource requests”. That would avoid any potential gaming of the system by having one grossly inflated resource request that you don’t actually want only to get to the front of the sorted job order and have your subsequent reasonable resource request be evaluated ahead of others who did not cheat.


#20

That’s an accurate summary, IMHO. It should not require changes to the code that does the sorting. When resource requests are selected for execution, the scheduler will have to make sure it doesn’t select multiple requests belonging to the same job.