We’ve been trying to use the cgroups hook to do resource management on our GPU nodes. However, since implementing the hook, the GPU nodes have been rejecting jobs with the comment “Not Running: PBS Error: Execution server rejected request” the job is then re-queued until the re-run limit is reached and the job is held.
The MoM logs for this job:
10/31/2019 14:15:18;0080;pbs_mom;Hook;pbs_cgroups.HK;copy hook-related file request received
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Failed to read vntype file /cm/local/apps/pbspro-ce/var/spool/mom_priv/vntype
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0075
A “vntype” file does not exist in the location it is looking and I’m not sure if that is even what is leading to this error, since the documentation seems to suggest the vntype parameter is only necessary for Cray nodes. Similarly, from reading the hook script, it seems that the cgroups_job file is written as a temporary store for jobids, so it seems strange that the script is unable to read the file.
Does anyone know how this error is coming about? Any suggestions would be greatly appreciated.
I can provide more information as needed; I did not want to bombard the post with information that may be extraneous.