I am adding two new accounting log records which will allow us to get correct resource usage knowledge of suspend/ resume event of the job.
Thanks for taking this up! A few comments:
- I think it’s important to print out resource_released_list information for z record if available
- I think it might also be useful to know who suspended/resumed the job, the scheduler or a user, so I think we should add a “requestor” field as well, similar to the U record.
- I’m not sure how useful it is to record the following: ‘accounting_id’, ‘user’, ‘group’, ‘job_name’, ‘queue’, ‘ctime’, ‘qtime’, ‘etime’, ‘start’. This information is either static, or doesn’t seem important to me in the context of suspending/resuming a job. What do you think?
- ‘exec_host’ & ‘exec_vnode’, this actually is quite relevant for suspend/resume, but this information can be HUGE. So, I just want others to chime in once on whether it’s important enough to record. I kind of feel like we don’t need both, maybe recording only one of them might be enough?
- ‘suspend time’/‘resume time’ - I don’t think we need to print this out explicitly, the timestamp of the record already gives us this information.
- ‘Resource_List resources’ - For the z record, I think it might be useful to print time based resources like resources_used.walltime, cput etc., but the rest of them (like ncpus, mem etc.) will be static information which the admins can get from the E record as well. We also print out all Resource_List in the Q record now, so I’m not sure how useful it is to print them here as well. Also, this information will probably not change from the z record to the r record, so I don’t think we need to print it out in the r record.
My first suggestion is to merge both these records into one. Make a job state change record. This will catch suspend and resume and all other state changes to the system which can be useful. It could be useful to know when a job went on hold and by whom. The record can be small, most of the things you are suggesting can be found elsewhere. We don’t need to print them again. Make it similar to the ‘a’ record which just prints what changed and that is it. At most include the requestor.
If you don’t want to do that, I think you are including way too much information in these records. Most if not all of it can be gained by looking at the Q and S records. I could see resources_used might be useful to see how far the job has progressed before it was suspended (no need for resume).
How will this interact with the admin-suspend feature?