Execution node down

Yesterday a user reported his jobs were going into the “H” state shortly after submitting the job. He said the last time this happened, a node was offlined. I did a tracejob on the job and it had attempted to run the 21 time and gave up. Also said ‘execution node down’. pbsnodes -l did not report that. I looked at the history and compute-0-15 had been put online. I offlined it and the job ran. Without the history being there, how would I determine the correct compute node. PBS must know. Thanks.

  1. If the job has been attempted to run 21 times and gave up means one of these issues

    • user authentication or credentials or home directory on the compute node has failed
    • issues with communication/networking with the compute node and the server ( DNS / name resolutions / reverse address resolution problems)
    • check the accounting logs $PBS_HOME/server_priv/accounting for that particular job and find out where the job has run, i mean which compute node (as there is no job history, but job id might be known)
    • go to that compute node and check the Mom logs $PBS_HOME/mom_logs/YYYYMMDD
      The mom logs would give you the details
  2. Enable job history by setting qmgr -c ‘set server job_history_enable=true’
    Submit couple of jobs and check whether they go into H state , if yes, qstat -fx or tracejob , find out the compute node it was scheduled on, login to the compute nodes, check the mom logs for details

Some of the generic reasons for the job in “H” state

	*  If there is an issue with  authentication of the user on the compute node
	*  or user home directory not mounted  or home directory of the user not available on the compute nodes
	*  not sure about the users authentication PBS keeps the job in held state
	*  when the job is manually put on the hold state using  qhold command
	*  If the job is a dependent job

Caveats:
* does user has any issues logging onto the compute nodes?
* Can the user log in to the node?
* Is everything in order for the user account, username, password, home directory etc.

I have history enabled. The mom_logs don’t show anything. This happened yesterday and there is no log for that day:

-rw-r–r-- 1 root root 352 Aug 4 09:40 20190801
drwxr-xr-x 2 root root 12K Aug 4 09:40 .
-rw-r–r-- 1 root root 540 Aug 4 09:40 20190804
[root@compute-0-15 mom_logs]#

My question is, if tracejob shows “execution node down”, why doesn’t it tell me the node? I just guessed at the culprit, used pbsnodes to offline it and reran the job.

To offline the node, please use qmgr -c “set node NODENAME state=offline”.
Do not offline the node using pbsnodes -o .
#qmgr -c “set node NODENAME state=free” , to make is available/free.

Could you please share the output of qstat -fx , tracejob and pbs_dtj , accounting log snippet of
jobid – job id of the job which was put on H state

Below is the output of tracejob. Once I offlined the node, ?I used qrun to run the job:

08/05/2019 16:47:09 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=1565048821 start=1565048829 exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-20/040+compute-0-21/040 exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 session=0 end=1565048829 run_count=20
08/05/2019 16:47:09 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=1565048821 start=1565048829 exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-20/040+compute-0-21/040 exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 resource_assigned.ncpus=400
08/05/2019 16:47:10 S Obit received momhop:21 serverhop:21 state:4 substate:41
08/05/2019 16:47:10 S Discard running job, A sister Mom failed to delete job
08/05/2019 16:47:10 S Job requeued, execution node down
08/05/2019 16:47:10 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=1565048821 start=1565048829 exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-20/040+compute-0-21/040 exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 session=0 end=1565048830 Exit_status=-3 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=400 resources_used.vmem=0kb resources_used.walltime=00:00:00 run_count=21
08/05/2019 16:47:10 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=1565048821 start=1565048829 exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-20/040+compute-0-21/040 exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 session=0 end=1565048830 run_count=21
08/05/2019 17:54:51 L Received qrun request
08/05/2019 17:54:51 L Considering job to run
08/05/2019 17:54:51 S Job Run at request of Scheduler@smaster1. on exec_vnode (compute-0-0:ncpus=40)+(compute-0-1:ncpus=40)+(compute-0-2:ncpus=40)+(compute-0-3:ncpus=40)+(compute-0-4:ncpus=40)+(compute-0-5:ncpus=40)+(compute-0-6:ncpus=40)+(compute-0-7:ncpus=40)+(compute-0-9:ncpus=40)+(compute-0-10:ncpus=40)
08/05/2019 17:54:51 S Job Modified at request of Scheduler@smaster1.
08/05/2019 17:54:51 L Job run
08/05/2019 17:54:51 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=0 start=1565052891 exec_host=compute-0-0/040+compute-0-1/040+compute-0-2/040+compute-0-3/040+compute-0-4/040+compute-0-5/040+compute-0-6/040+compute-0-7/040+compute-0-9/040+compute-0-10/040 exec_vnode=(compute-0-0:ncpus=40)+(compute-0-1:ncpus=40)+(compute-0-2:ncpus=40)+(compute-0-3:ncpus=40)+(compute-0-4:ncpus=40)+(compute-0-5:ncpus=40)+(compute-0-6:ncpus=40)+(compute-0-7:ncpus=40)+(compute-0-9:ncpus=40)+(compute-0-10:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 resource_assigned.ncpus=400
08/05/2019 18:55:28 S Obit received momhop:22 serverhop:22 state:4 substate:42
08/05/2019 18:55:28 S Exit_status=271 resources_used.cpupercent=4008 resources_used.cput=40:11:03 resources_used.mem=4724692kb resources_used.ncpus=400 resources_used.vmem=38967208kb resources_used.walltime=01:00:37
08/05/2019 18:55:28 A user=reinecke group=reinecke project=_pbs_project_default jobname=restart queue=medium ctime=1565048821 qtime=1565048821 etime=0 start=1565052891 exec_host=compute-0-0/040+compute-0-1/040+compute-0-2/040+compute-0-3/040+compute-0-4/040+compute-0-5/040+compute-0-6/040+compute-0-7/040+compute-0-9/040+compute-0-10/040 exec_vnode=(compute-0-0:ncpus=40)+(compute-0-1:ncpus=40)+(compute-0-2:ncpus=40)+(compute-0-3:ncpus=40)+(compute-0-4:ncpus=40)+(compute-0-5:ncpus=40)+(compute-0-6:ncpus=40)+(compute-0-7:ncpus=40)+(compute-0-9:ncpus=40)+(compute-0-10:ncpus=40) Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40 Resource_List.walltime=01:00:00 session=178609 end=1565056528 Exit_status=271 resources_used.cpupercent=4008 resources_used.cput=40:11:03 resources_used.mem=4724692kb resources_used.ncpus=400 resources_used.vmem=38967208kb resources_used.walltime=01:00:37 run_count=22

The system is not letting upload regular text files.

Did the output of tracejob provides and additional info? I have made note of the suggested way to offline a node, but what is the difference between the method you provides and:

pbsnode -o node
pbsnode -c node

Thanks.

Both are the same, but doing it via qmgr way is less error prone. I will update this thread once i get to know the reason. Probably, the node incarnation is updated correctly in the datastore and memory when we use qmgr way. Trust rest is working for you, thanks for sharing the above logs.

08/05/2019 16:47:10 S Discard running job, A sister Mom failed to delete job
08/05/2019 16:47:10 S Job requeued, execution node down

Please make sure when you bring up the compute nodes, the jobs spool directory are clean.