Add new hook events at the server end


#21

@varunsonkar In that case, I agree with you.


#22

@sujatapatnaik52,
I have few questions/comments:

  • For resv_end hook, what happens when job is running in the reservation queue and admin deletes the reservation?

  • In the UCR I see:

A new reservation end hook event shall be introduced when a confirmed reservation(standing or advanced) ends successfully or there is a user request for deleting the same.

I think it also applies for running reservation as well. Please add that too.

  • Does the endjob hook runs for all the subjobs of a job array?

#23

When the reservation is deleted by admin, the resv_end hook will run only after the running jobs are removed from the reservation queue.

Updated this one.

Yes.

Thank you for the suggestions.


#24

@sujatapatnaik52 I have some comments -

  • In the UCR, under section 1 & 2 (reservation confirmation/end) events you mentioned that admins should have read access to the server. Can you please quote an example why server access would be needed?

  • Under section 3 it says that admin shall be able to read job, server, node, and queue attributes. Are all these attributes related to the job that ended? Like node and queue that will be made available to the hook will only be the one set on the job that ended.

  • Section 3 requirements also say that the hook will run only when the user deletes the job or when it ends successfully. Why is that hook will not run when the job fails?

  • When scheduler issues a PBS_BATCH_ConfirmResv request to the server then does this mean that reservation confirmation hook will make the scheduler wait for a reply and scheduling cycle will halt?

Can you please add the objects that you are exposing to each hook event to the external design document as well.

Thanks!


#25

Hi – I really like the addition of these three new hook events – thanks!

A few suggestions:

  • Could you add more details (ideally, a couple examples) of how each hook event could be useful? The Use Cases include “Admin shall be able to do additional activities”, but that is too general to be helpful. For example, can you give one or two examples of when an admin might want to reject confirmation of a reservation? What about what actual activities an admin might want to do at the end of a reservation? At the end of a job?

  • For the “reservation end” event, what happens if jobs are still running in the reservation queue? If one use case it to be able to extend the end time of a reservation occurrence to allow running jobs more time fo finish, then one would need to run the hook in the case jobs are still running (and probably provide some indication that jobs are still running to the hook as input).

  • For “job end”, I suggest being more specific about when this hook event is executed. For example, if a use case is to be able to alter accounting records, then it would need to run before accounting records are written. If a use case is to allow a job to be re-run on the already allocated set of nodes, then the event would need to be run before the nodes are deallocated. If the use case is for allocation management, then it would be useful to run the hook regardless of whether the job actually started (i.e., to properly adjust allocation accounting, the hook would need to run if the job was queued, then qdel’d or qmove’d to another server, even if it never actually started – basically it would be an “exiting the system” hook… which would be really useful.)

  • Both “reservation end” and “job end” say “The hook can accept or reject the job.”. I don’t think this makes sense at the end of a reservation/job, only the beginning.

  • Please cross-link this page to the Contributor’s portal design page (“PP-912, PP-913 - Adding new…”) by editing the first post and adding a link. It makes it much easier to go back and forth.

  • The Requirements page is (confusingly) showing up at the top level in the Project Documentation space on the Contributors portal. Please make the Requirements page a sub-page of the main “PP-912, PP-913 - Adding new…” page.

Thanks!


#26

Hi Arun,

Thank you for the comments and suggestions.

By read access to server, I mean getting the server related attributes from the hooks. ex - acl_resv_users, acl_resv_hosts etc.

Yes this should be the job related queue and node attributes.

I have updated this in the UCR. The hook shall be executed when the running job fails.

Yes. The hook will make the scheduler wait and halt the scheduling cycle.

I have updated this one. Please have a look at it. Thank you.


#27

Hi Bill,

Thank you for the suggestions.

Admin can use the reservation attributes to invoke external python API’s. On the basis of the results returned from the external API, the hook can accept or reject the confirmation request.

The end hook for a reservation/job can be used to read the attributes in real time. Apart from this, the above reason can also be used here.

Currently, as per the UCR there was only read permissions on the attributes of the reservation in the resv_end hook. However, the usecase which you have suggested sounds good. There shall be write permissions on the end time of the resevation for resv_end hook. Please provide your inputs

Considering the usecase is for allocation management, I have updated the UCR.

Yes.

I added the Design page link to the first post.

I have added the requirements page as a sub-page to the design page. Thank you.


#28

@sujatapatnaik52 Thanks for your reply! It is more clear now.

Although about accessing server object, I do understand what it means to access server object, I do not understand why it is needed for this particular hook event. Also, in the latest version of the use cases under section 1.b(ii) you mention “external APIs” but I’m not sure which external APIs? I think you have a use case in mind which is about why this event is needed but UCR does not reflect the same.

  • I think that blocking the scheduler would be bad, but I don’t see any way around it as well and there is already a precedent in the form of runjob hooks.

  • Now that you have removed the requirement of no running jobs present when hook ends, don’t you think you need to expose all the job information (jobs associated to the reservation) to the hook so that hook can look at the job attributes (like walltime left) and take the decision.

  • It is still not clear that on a job/reservation end event what it would mean to reject the even from the hook.

  • Can you please mention the event name precisely in the EDD for each event/


#29

Hi arun,

Thank you for the comments.

okay suppose the resources requested is not explicitly specified by the user, I think the default resources assigned for the reservation can be available by accessing the server attribute default_chunk.

I did not mention any specific usecase here because the external API’s can be of any usecase as required by the admin. Ex- A python API which captures the reservation attributes and stores them in a message broker or a database. If the insertion is successful return True or False.

Yes I agree.

I liked the idea of exposing running job’s attributes to the reservation end hook. However I am wondering, if there are hundreds of running jobs in the reservation queue, exposing all of them to the hook can impact the server performance?? Not really sure on this. Please provide your inputs.

I don’t think so it would make any difference if the end job/reservation hook performs a reject action considering the current usecases. @billnitzberg has provided a usecase, where the admin might want to extend the end time of the reservation based on the walltime left for the running jobs in the reservation queue. In that scenario, I think the hook reject can make the reservation to run with original attributes and on the other hand, if the hook accepts, the reservation’s end time can be extended.

I think I have added the hook event names in the heading. Please tell me where exactly I missed mentioning the names.

Thank you.


#30

Thanks for this EDD! I am afraid I am not at present sure of the difference between a pbs.event.resv() object and the actual reservation queue that gets created for a reservation. It is not clear to me from the EDD whether or not the reservation queue attributes (like acl_user_enable, for example) will be visible/changeable from a hook running in the resv_confirm event. Can you please comment on this?


#31

Hi Scott,

Thank you for your comments.

I think the pbs.event().resv represents the actual reservation object for which the hook gets called whereas the the reservation queue object can be obtained by using the interface pbs.event().resv.queue.

As per the EDD, the reservation attributes are readonly in the reservation confirmation hook event. So, I would say the reservation queue attributes shall be visible from the hook and not changeable.


#32

Hi All,

As I am not seeing any further comments for the EDD, I feel I can go ahead with current proposed design.

Thank you.


#33

Thanks, and sorry about the late reply. If pbs.event().resv.queue is being added as part of this work then it would need to be added to the EDD (I don’t think any such thing exists today?).


#34

Hi Scott,

Thank you for the reply. I have updated the EDD. Please have a look at it.


#35

I’m still not clear on whether when an attempt is made to modify a reservation via pbs_ralter will a resv_confirm event get triggered? Does every alter request trigger a confirmation action? In what scenarios, if any, would we need an additional resv_alter event?

It would be helpful to get a lifecycle diagram/table of a resv that shows where each hook even gets called, something like what’s in chapter 4 of Hooks Guide.

I also see the design for resv_end still says:

“The hook will be executed only if there are no running jobs in the reservation queue”

How can we extend a reservation to give jobs more time to finish if don’t allow the hook to execute if there are running jobs?


#36

Hi @smgoosen,

Apologies for the delayed response.

If a reservation gets modified by using pbs_ralter, the resv_confirm event shall be triggered.

Yes.

Thank you for this suggestion. I will update the diagram soon.

I have added the above statement considering the current requirements. I think @billnitzberg and @prakashcv13 has pointed this usecase in their previous replies but I did not change anything due to not being sure whether I can add the same or not?? Please provide your inputs.

Thank you.


#37

Hi All,

Considering some offline discussions, I am adding here what is the purpose of exposing the interface pbs.event().resv.queue.

With pbs.server().queue, we can get the reservation’s queue object only if we provide a queue name with something like this - pbs.server().queue("<queue_name>").

The purpose of adding pbs.event().resv.queue is to get the actual queue object belonging to the reservation. The queue name can be obtained with something like pbs.event().resv.queue.name ( in a similar way the queue’s name is obtained in case of runjob hook - pbs.event().job.queue.name) and can be used later if we want it from pbs.server().

It can be possible to obtain the queue name from some attribute of the reservation object (something like splitting the resvid attribute to get the reservation name) and the reservation name shall be passed to pbs.server().queue to get the queue object. However, I feel getting the queue object from the event interface is better than getting it from the server interface.

Please provide your inputs.

Thank you.


#38

@sujatapatnaik52, could you please clarify what you are not sure about.
Also, You are only partially correct that the resv_alter will trigger a resv_conf event. It happens only when a confirmation request is sent to the scheduler.

There might be use cases, where we need a separate resv_alter event.


#39

Hi prakash,

This is regarding the usecase presented for extending the reservation end time, I have two queries -

  1. If this use case has to be considered, how many times the end time can be modified?? There shall be a limit I suppose.
  2. After the end time is modified, I presume the request shall go to the scheduler to get confirmation of the reservation. what if the response from the scheduler doesn’t come immediately?? I mean what could be the default time out period in this case.

Can you please highlight briefly what could be the possible usecases where the alter event doesn’t send a request to Scheduler??

Thank you.


#40

One of the use case could be to limit the number of times a reservation can be altered. This cannot be managed/handled by the resv_confirm event.

I do not understand how this question is relevant to adding a new hook event.