Design a Hook to monitor the Resources utilized by a Job


#1

Hi,

Qstat -fx gives us the status of each job which includes the number of resources used and requested.

Objective: When a job finishes its execution I need to place the below contents in a file and send a comparison to the user.
resources_used.mem = 4700kb
resources_used.ncpus = 1
resources_used.walltime = 00:49:100
Resource_List.mem = 1048576kb
Resource_List.ncpus = 1
Resource_List.walltime = 72:00:00

Could this be implemented using a Hook? If yes how?

As per my understanding I believe we have to create a Hook using “execjob_end”. In documentation there isn’t any example which uses “execjob_end”. It would be great if any relevant example could be shared.


#2

This can be achieved in the execjob_epilogue hook

  1. check if the node is mother superior, if it is
  2. write to a file or write to stdout
  3. Please try and let us know

Sample code:

import pbs,os,sys
e=pbs.event()
j=e.job
jobid=job.id
path_to_stdout_file=j._stdout_file
pbs.logmsg(pbs.LOG_DEBUG,"The path of stdout file is >>>>> %s" % (path_to_stdout_file))
if  (j.in_ms_mom()):
    fileopen=(path_to_stdout_file,'a')
    fileopen.write(str(j.Resource_List['ncpus']))
    fileopen.close()

You can used the below values to populate the file j.resources_used.ncpus, j.Resource_List.mem, j.resources_used.mem , etc .

Please test/share and let us know


#3

Accidental reply. Re-sent to correct recipient.


#4

Hi Adarsh,

Thanks for the sample code. I will try it and share the results.


#5

Hi Adarsh,

Seems I am making some error while defining #PBS -W stageout in job script. Could you please share the code for just writing to a file.


#6

Hi Rakhen,

#PBS -W stagein = <execution_path>@:<storage_path>
#PBS -W stageout = <execution_path>@:<storage_path>

execution_path: filename in jobs’s staging & execution directory on execution host
storage_path: filename on host hostname

The “@” character separates execute path specification from storage path specification (it is a separator)
example:
qsub -Wstagein=.@storageservername:/root/dummy/.txt -Wstageout=.fem@storageservername:/root/temp1 – /bin/sleep 5

Thank you


#7

Hi Adarsh,

Please refer to the below code:

import pbs
import sys
import os

try:
e=pbs.event()
pbs.logmsg(pbs.LOG_DEBUG, “executed epilogue hook”)
if e.job.in_ms_mom():
f=open(“abc.txt”, “w+”)
f.write(str(e.job.Resource_List[“ncpus”]))
f.write("\n")
f.write(str(e.job.Resource_List[“walltime”]))
f.write("\n")
f.write(str(e.job.Resource_List[“place”]))
f.write("\n")
f.close()
else:
f=open(“abc.txt”, “w+”)
f.write(“This is line1”)
f.close()

except SystemExit:
pass

except:
e.reject(“Failed to write to a file”)

The code is working fine for accessing Resource_List objects. When I am trying to access resources_used objects the job is getting aborted.

image
image

I had even tried to access resources_used objects by implementing the below example but still the job end up with abort status…
image

Note: I have not implemented stdout concept.


#8

Please try the below as execjob_epilogue hook:

f=open(’/tmp/abc.txt’, ‘w+’)
f.write("======================================================================================\n\n\t\t\tResource Usage on %s:\n\nJobId: %s \n\tJob_Name = %s \n\tJob_Owner = %s \n\tresource_used.cpupercent = %s \n\tresource_used.cput = %s \n\tresources_used.mem = %s \n\tresources_used.ncpus = %s \n\tresources_used.vmem = %s \n\tresources_used.walltime = %s \n\tqueue = %s \n\tserver = %s \n\tAccount_Name = %s \n\tError_Path = %s \n\texec_host = %s \n\texec_vnode = %s \n\tJoin_Path = %s \n\tKeep_Files = %s \n\tOutput_Path = %s \n\tRerunable = %s \n\tResource_List.ncpus = %s \n\tResource_List.nodect = %s \n\tResource_List.place = %s \n\tResource_List.mem = %s \n\tResource_List.cput = %s\n\tResource_List.walltime = %s \n\tjobdir = %s \n\tEnvironment_Variables = PBS_O_SYSTEM=%s,PBS_O_SHELL=%s,PBS_O_HOME=%s,PBS_O_LOGNAME=%s,PBS_O_WORKDIR=%s,PBS_O_LANG=%s,PBS_O_PATH=%s,PBS_O_QUEUE=%s,PBS_O_HOST=%s \n\teuser = %s \n\tegroup = %s \n\trun_count = %s \n\t\n======================================================================================\n" % (str(datetime.datetime.now()),ji,j.Job_Name,j.Job_Owner,j.resources_used.cpupercent,j.resources_used.cput,j.resources_used.mem,j.resources_used.ncpus,j.resources_used.vmem,j.resources_used.walltime,j.queue,j.server,j.Account_Name,j.Error_Path,j.exec_host2,j.exec_vnode,j.Join_Path,j.Keep_Files,j.Output_Path,j.Rerunable,j.Resource_List[‘ncpus’],j.Resource_List[‘nodect’],j.Resource_List[‘place’],j.Resource_List[‘mem’],j.Resource_List[‘cput’],j.Resource_List[‘walltime’],j.jobdir,j.Variable_List[“PBS_O_SYSTEM”],j.Variable_List[“PBS_O_SHELL”],j.Variable_List[“PBS_O_HOME”],j.Variable_List[“PBS_O_LOGNAME”],j.Variable_List[“PBS_O_WORKDIR”],j.Variable_List[“PBS_O_LANG”],j.Variable_List[“PBS_O_PATH”],j.Variable_List[“PBS_O_QUEUE”],j.Variable_List[“PBS_O_HOST”],j.euser,j.egroup,j.run_count))
f.close()

If you improvise it, please share it with the community