Job error: 'Exit_status = -2'


#1

Hello,

We are encountering “Exit_status = -2” on a particular node and we are not sure why…

There are no mom_logs and server_logs don’t help either.

We did notice that submitting a job after a node reboot results in a hanging experience for the user:

$ qsub -I -l walltime=08:00:00 -l host=node0047 -q def-devel

qsub: waiting for job 38178.bright01-thx to start
<…it just HANGS…so I quit…>
^CDo you wish to terminate the job and exit (y|[n])? y
Job 38178.bright01-thx is being deleted

Even though qstat -f 38178 says:

comment = Job run at Tue Aug 14 at 11:01 on (node0047:ncpus=1:mem=1048576kb
:mic_cores=0:ngpus=0)

And then the next job submission attempt just quits immediately with the ‘Exit_status = -2’ error:

$ qsub -I -l walltime=08:00:00 -l host=node0047 -q def-devel

qsub: waiting for job 38179.bright01-thx to start

qsub: job 38179.bright01-thx ready

qsub: job 38179.bright01-thx completed

Any ideas how we can uncover the root cause of this issue?

Thanks,
Siji


#2

Please increase the mom logging by adding the below line to $PBS_HOME/mom_priv/config and restart the mom service on node0047 and submit a new job

$logevent  0xfffffff

Exit_Code -2 means - Job execution failed, after files, no retry
Probably tracejob and mom logs would help.
Might be related to the ports / firewall.

Thank you


#3

@adarsh

After increasing logging, there still isn’t any mom_log to review. In fact, the $PBS_MOM_HOME/mom_logs is empty.

tracejob 38822

Job: 38822.bright01-thx

08/16/2018 09:37:40 L Considering job to run
08/16/2018 09:37:40 S Job Queued at request of saula@login-0002, owner = saula@login-0002, job name = STDIN, queue = def-devel
08/16/2018 09:37:40 S Job Run at request of Scheduler@bright01-thx.thunder.ccast on exec_vnode (node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0)
08/16/2018 09:37:40 S Job Modified at request of Scheduler@bright01-thx.thunder.ccast
08/16/2018 09:37:40 L Job run
08/16/2018 09:37:40 S enqueuing into def-devel, state 1 hop 1
08/16/2018 09:37:40 A queue=def-devel
08/16/2018 09:37:40 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1534430260 qtime=1534430260 etime=1534430260
start=1534430260 exec_host=node0047/0 exec_vnode=(node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.host=node0047
Resource_List.mem=1gb Resource_List.mic_cores=0 Resource_List.ncpus=1 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=pack
Resource_List.select=1:host=node0047:ncpus=1 Resource_List.walltime=08:00:00 resource_assigned.mem=1048576kb resource_assigned.ncpus=1
resource_assigned.ngpus=0 resource_assigned.mic_cores=0
08/16/2018 09:39:02 S Obit received momhop:1 serverhop:1 state:4 substate:41
08/16/2018 09:39:06 S Exit_status=-2 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb
resources_used.walltime=00:00:00
08/16/2018 09:39:06 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1534430260 qtime=1534430260 etime=1534430260
start=1534430260 exec_host=node0047/0 exec_vnode=(node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.host=node0047
Resource_List.mem=1gb Resource_List.mic_cores=0 Resource_List.ncpus=1 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=pack
Resource_List.select=1:host=node0047:ncpus=1 Resource_List.walltime=08:00:00 session=0 end=1534430346 Exit_status=-2 resources_used.cpupercent=0
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:00 run_count=1

Also, what does the “…,after files,…” mean in your interpretation of Exit_Code -2?


#4

means after staging of the data


#5

Please check pbs_mom service is up and running, please share your /etc/pbs.conf contents.
If mom services are up and running then there should be $PBS_HOME/mom_logs/YYYYMMDD or $PBS_MOM_HOME/mom_logs/YYYYMMDD


#6

Mom is up per systemctl and our pbs.conf is shown beneath:

[node0047 ~]# systemctl status pbs

pbs.service - LSB: The Portable Batch System (PBS) is a flexible workload

Loaded: loaded (/etc/rc.d/init.d/pbs; bad; vendor preset: disabled)

Active: active (running) since Wed 2018-08-15 09:33:44 CDT; 1 day 1h ago

Docs: man:systemd-sysv-generator(8)

Process: 72288 ExecStop=/etc/rc.d/init.d/pbs stop (code=exited, status=0/SUCCESS)

Process: 72335 ExecStart=/etc/rc.d/init.d/pbs start (code=exited, status=0/SUCCESS)

CGroup: /system.slice/pbs.service

└─72405 /cm/shared/apps/pbspro-ce/current/sbin/pbs_mom

Aug 15 09:33:44 node0047 systemd[1]: Starting LSB: The Portable Batch System (PBS) is a flexible workload…

Aug 15 09:33:44 node0047 pbs[72335]: Starting PBS

Aug 15 09:33:44 node0047 pbs[72335]: PBS mom

Aug 15 09:33:44 node0047 systemd[1]: Started LSB: The Portable Batch System (PBS) is a flexible workload.

[node0047 ~]# cat /etc/pbs.conf

PBS_EXEC=/cm/shared/apps/pbspro-ce/current

PBS_HOME=/cm/local/apps/pbspro-ce/var/spool

PBS_START_SERVER=0

PBS_START_SCHED=0

PBS_MOM_HOME=/cm/local/apps/pbspro-ce/var/spool

PBS_START_MOM=1

PBS_START_COMM=0

PBS_SERVER=bright01-thx

PBS_SCP=/usr/bin/scp

PBS_RSHCOMMAND=/usr/bin/ssh

PBS_CORE_LIMIT=unlimited


#7

Thank you , Could you please search 20180816 file on the mom node ?
find / -name “20180816” -print


#8

Still nothing…

[node0047 ~]# find / -name “20180816” -print
[node0047 ~]#

I’m doing some other digging as well, but let me know if you have other thoughts.


#9
  1. Try running this command $PBS_EXEC/unsupported/pbs_dtj 38822
  2. re-install PBS MOM on that node , check whether the logs are written or check whether logs are redirected to syslogs in your exsiting setup or other
  3. Looks like you are using cluster management system (by the looks of your /etc/pbs.conf )
  • make sure hostname resolution is correct / firewall / seliinux etc
  • Interactive Job errors out with 'apparently deleted'
  • kill the PBS MOM on the compute node and start it with strace -o /tmp/mom.txt $PBS_EXEC/sbin/pbs_mom
  • strace qsub -I -l walltime=08:00:00 -l host=node0047 -q def-devel
  • then try to look into strace information and analyze