PP-734: Ability to release limited resources when a job is suspended


#21

@arungrover typing the various attribute names made me pay more attention to them…I have a few naming suggestions for you to consider.
For res_released_on_susp, reservations are often referred to as res or resv (when spoken it is hard to hear the V) so to make it clear it is resources and not reservations how about resc_released_on_susp, resources_released_on_susp, or resources_released_on_suspend (granted that might be a bit long)

And I think you should add an “s” to resource_released_list to make it “resources_released_list”. Then it will more closely match/associate with “resources_released”. You could do the same with “res_released_on_susp” (e.g. resources_released_on_*) and make a matched set.


#22

Isn’t what you’re describing just how other attributes work? If they are set to the default value they are printed out. If they are unset they behave as if they were set to the default but don’t get printed out. For example, mail_from is “set” to adm whether or not it you see it in Qmgr: “p s” or not:

smgoosen@linux-gnn7:~> qmgr
Max open servers: 49
Qmgr: p s

set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm <---------------------
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1

smgoosen@linux-gnn7:~> qsub -l select=1 -W sandbox=private -M smgoosen@linux-gnn7 pbs.script
5.linux-gnn7
smgoosen@linux-gnn7:~> su -
Password:
linux-gnn7:~ #
linux-gnn7:~ # qdel 5
linux-gnn7:~ # logout
You have mail in /var/spool/mail/smgoosen
smgoosen@linux-gnn7:~> mail
Heirloom mailx version 12.2 01/07/07. Type ? for help.
"/var/spool/mail/smgoosen": 1 message 1 new
N 1 adm@linux-gnn7.sit Thu Apr 27 14:32 16/571 PBS JOB 5.linux-gnn7
? q

smgoosen@linux-gnn7:~> su -
Password:
linux-gnn7:~ # qmgr
Max open servers: 49
Qmgr: u s mail_from
Qmgr: p s

set server scheduling = True
set server default_queue = workq
set server log_events = 511 <-------------- mail_from no longer printed out
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1

smgoosen@linux-gnn7:~> qsub -l select=1 -W sandbox=private -M smgoosen@linux-gnn7 pbs.script
6.linux-gnn7
smgoosen@linux-gnn7:~> su -
Password:
linux-gnn7:~ # qdel 6
linux-gnn7:~ # logout
smgoosen@linux-gnn7:~> mail
Heirloom mailx version 12.2 01/07/07. Type ? for help.
"/var/spool/mail/smgoosen": 2 messages 1 new 2 unread
U 1 adm@linux-gnn7.sit Thu Apr 27 14:32 17/581 PBS JOB 5.linux-gnn7
N 2 adm@linux-gnn7.sit Thu Apr 27 14:34 16/571 PBS JOB 6.linux-gnn7
?

Still get mail from adm. Other attributes also work this way (log_events, etc)

I’m just suggesting we follow existing precedent


#23

Thanks @smgoosen for providing an example where unsetting an attrib as similar behavior as default.
Although I see a slight difference between these attribs and the one we are talking about. These attributes do not take any keyword as their default which isn’t matching their semantics.

“res_released_on_susp” is supposed to contain a list of comma separated resource names. Having a keyword set in that attribute by default does not really go with the semantics of the attribute itself.
I agree with Lisa that if we end up supporting keyword in the attribute then should we be populating the attributes too? If so, then the other attributes will end up having same values as exec_vnode and resource_list of the job. If not, then it will look odd that one attribute says something but attributes are not being populated.
I feel that supporting a keyword there would create more confusion. I’d request you to reconsider it again.

Thanks!


#24

So your belief is that what I’m suggesting is more confusing than:

to get release everything you unset value (so nothing = everything)
on the other hand having mem set means only release mem

That seems very counterintuitive. As you say “res_released_on_susp” is supposed to contain a list of comma separated resource names. Setting it to is not a list of comma separated resource names, on it’s face would indicate “nothing” should be released (though I’ve already agreed that we can have have it mean everything). At least having a key word to indicate would be much more intuitive to a customer when “everything” is getting released. Also, setting the other attributes to what means for a particular job does not seem confusing.


#25

Attribute naming: let’s make the attribute names save the admin from having to look up the usage.

  • For Interface 1, let’s make it clear that these are the resources that are to be released. Let’s use “resources_to_release_on_susp”.
  • For Interface 2, let’s use “node_resources_released”.
  • For Interface 3, let’s use “server_resources_released”.

Attribute contents: for Interface 1, If the attribute has not been set, PBS can populate it with the list of resources that will be released on suspension. This makes the attribute a one-stop shop; again, the admin won’t have to go looking around for which resources will be released.


#26

Thank you @agurban for your review comments. Me and Sam had a quick talk about adding all the resources by default to this attribute and we feel that maybe adding all resource names in the list by default would make the list too big and it would be hard for anyone to look and search for a resource in that comma separated list.
On the other hand I think if I just follow your comments about renaming the attribute to “rescources_to_release_on_susp” it would just make things clearer.
@smgoosen what do you think about the new name that @agurban suggested for the attribute ?


#27

@arungrover I like something closer to the original attribute: 'resources_released_on_susp". This is a shade shorter and for some reason I don’t like the word ‘to’ in an attribute name. I guess that’s personal preference though.

Bhroam


#28

@bhroam Personally I do not have any problem with the name of the attribute in it’s current form too. It is just that if we add a “to” to it, it may take care of some of the ambiguity about what resources are going to be released. I am doing this just because I do not want to add a keyword to the attribute value to show that we release all the resources be default.


#29

Maybe it’s just me, but when I see ‘resources_released_on_susp’ I see it meaning the resources that are released on suspension. This is pretty much the same thing as resources_to_release_on_susp. It’s just a different verb tense.

I do agree if it’s a choice between adding a keyword or changing the attribute name, change the attribute name. Adding keywords are special cases. Special cases just make the product more complicated.

Bhroam


#30

As Arun mentions I believe we can solve this with an attribute name that would make sense. I’d also like to avoid listing out every resource for the default (release everything). How about something like:

restrict_resources_to_release

That way an empty value would mean release everything (nothing is restricted), but if there is something mentioned (e.g. ncpus) then we would be restricted to only releasing that/those resource(s).


#31

Thanks for your reply @smgoosen. Since the feature is about releasing a set of resources when a job is suspended I think adding a bit about “suspension” in the name is necessary. So, how about this - “restrict_resc_to_release_on_susp”?


#32

The attribute name looks long but I believe is captures all the info we need it to convey. I’m OK with it.


#33

Thanks @smgoosen
I’ve made necessary changes to the design proposal. Please have a look again.


#34

Looks good to me too.


#35

Hi – there had been a desire to avoid shortened names (and to spell keywords out fully) for all new keywords. Susp and resc go the wrong direction…

Are those two shorthands already used consistently everywhere else in PBS for suspend and resources? If not, i suggest either spelling them out completely or using something simpler, like “suspend_free” or “release_on_suspend”.


#36

EDD clarification suggestion…

Change:
A new server attribute “restrict_resc_to_release_on_susp” can be used to specify a list of comma separated list of resource names that can be release when a job is suspended.

To:
A new server attribute “restrict_resc_to_release_on_susp” is a comma separated list of resource names. The resources that get released on suspension will be restricted to the resources listed in “restrict_resc_to_release_on_susp”.


#37

Hi Sam,

Let’s stick with “resources_to_release_on_susp”.

We need to be precise with our naming. The word “restrict” is
ambiguous: are these the ones to which we are restricting ourselves to
releasing, or are these the ones restricted from being released?

In addition, “released” sounds as if they have already been released.
It’s not just a matter of tense.

Using “to_release” is clear: we’re going to release these.

-Anne


#38

@billnitzberg We can use the full names but then this already long attribute name will be even longer.

There is already precedent for using shorthand while representing “resources” as “res” in “max_user_res_soft, max_group_res_soft”, but there isn’t any for “suspend”.
If it is okay, I can change “restrict_resc_to_release_on_susp” to “restrict_res_to_release_on_suspend”. What do you think?


#39

@agurban Using that name, “resources_to_release_on_susp” introduces ambiguity when the default is “unset” to release all resources. And we definitely don’t want to list out everything. Lisa, Alexis and I agree that some form of a new attribute name like “restrict_res_to_release_on_suspend” would reduce ambiguity.


#40

Hi Sam,

I don’t quite see how this solves the problem of what to do when
it’s unset.

-Anne