Interactive Job errors out with 'apparently deleted'


#1

Hello,

We are seeing this issue following some recent change to queue priorities and preemptive scheduling:

$ qsub -I -l walltime=08:00:00 -l select=1:ncpus=4 -q def-devel
qsub: waiting for job 29679.bright01-thx to start
qsub: job 29679.bright01-thx apparently deleted

What could be the source of the issue?

Here’s what our queue configuration looks like:

create queue def-devel
set queue def-devel queue_type = Execution
set queue def-devel Priority = 100
set queue def-devel acl_host_enable = False
set queue def-devel resources_max.walltime = 08:00:00
set queue def-devel resources_min.walltime = 00:00:00
set queue def-devel resources_available.ngpus = 0
set queue def-devel max_run_res.condo = [o:PBS_ALL=38]
set queue def-devel max_run_res.gpuHost = [o:PBS_ALL=2]
set queue def-devel max_run_res.ncpus = [o:PBS_ALL=257]
set queue def-devel enabled = True
set queue def-devel started = True

Many thanks,
Siji


#2

Please check or share the tracejob 29679 output and mom logs of the system on which this job was running.


#3

Sure here it is:

tracejob 29679

Job: 29679.bright01-thx

08/01/2018 14:56:26 S enqueuing into def-devel, state 1 hop 1
08/01/2018 14:56:26 A queue=def-devel
08/01/2018 14:56:27 S Job Queued at request of saula@login2, owner = saula@login2, job name = STDIN, queue = def-devel
08/01/2018 14:56:28 L Considering job to run
08/01/2018 14:56:28 S Job Run at request of Scheduler@bright01-thx.thunder.ccast on exec_vnode (node0115:ncpus=4:mem=1048576kb:mic_cores=0:ngpus=0)
08/01/2018 14:56:28 L Job run
08/01/2018 14:56:29 S Job Modified at request of Scheduler@bright01-thx.thunder.ccast
08/01/2018 14:56:29 S Obit received momhop:1 serverhop:1 state:4 substate:41
08/01/2018 14:56:29 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1533153386 qtime=1533153386 etime=1533153386
start=1533153389 exec_host=node0115/04 exec_vnode=(node0115:ncpus=4:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.mem=1gb
Resource_List.mic_cores=0 Resource_List.ncpus=4 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free
Resource_List.select=1:ncpus=4 Resource_List.walltime=08:00:00 resource_assigned.mem=1048576kb resource_assigned.ncpus=4 resource_assigned.ngpus=0
resource_assigned.mic_cores=0
08/01/2018 14:56:30 S Exit_status=-1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=4 resources_used.vmem=0kb
resources_used.walltime=00:00:00
08/01/2018 14:56:30 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1533153386 qtime=1533153386 etime=1533153386
start=1533153389 exec_host=node0115/0
4 exec_vnode=(node0115:ncpus=4:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.mem=1gb
Resource_List.mic_cores=0 Resource_List.ncpus=4 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free
Resource_List.select=1:ncpus=4 Resource_List.walltime=08:00:00 session=0 end=1533153390 Exit_status=-1 resources_used.cpupercent=0
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=4 resources_used.vmem=0kb resources_used.walltime=00:00:00 run_count=1


#4

One reason would be,

  • If pbs_mom could not resolve the FQDN of the pbs_server, this issue will happen.

In that case, we can see below pbs_mom logs.

08/02/2018 02:50:34;0100;pbs_mom;Job;6.BLRLAP796;allowed brema to access window station and desktop, User brema passworded
08/02/2018 02:50:34;0001;pbs_mom;Svr;pbs_mom;No error (0) in finish_exec, cannot open qsub sock for 6.BLRLAP796
08/02/2018 02:50:34;0008;pbs_mom;Job;6.BLRLAP796;cannot open qsub sock for 6.BLRLAP796
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;task 00000001 cput= 0:00:00
08/02/2018 02:50:35;0008;pbs_mom;Job;6.BLRLAP796;kill_job
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;spark cput= 0:00:00 mem=0kb
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;Obit sent
08/02/2018 02:50:35;0100;pbs_mom;Req;;Type 6 request received from brema@192.168.10.1:15001, sock=1
08/02/2018 02:50:35;0080;pbs_mom;Job;6.BLRLAP796;delete job request received
08/02/2018 02:50:35;0008;pbs_mom;Job;6.BLRLAP796;kill_job


#5

Hmm, that sounds really strange since we have batch jobs running on those nodes without incident. Just interactive jobs are acting this way.

I believe the issue here is specific to either the queue or the job type…


#6

Exit_Code = -1
Job execution failed, before files, no retry

Could you please paste the mom logs for this job id 29679 from node0115 ?


#7

Interesting there isn’t a job directory or file related to this job on node0115:

[node0115 ~]# ls -ltr /cm/local/apps/pbspro-ce/var/spool/mom_priv/jobs/29679
ls: cannot access /cm/local/apps/pbspro-ce/var/spool/mom_priv/jobs/29679: No such file or directory


#8

Please check as below on node0115 for PBS Mom logs :

  • source /etc/pbs.conf
  • cd $PBS_HOME/mom_logs/
  • cat 20180801 | grep or do vi 20180801 and check

#9

Hi

bremanandjk suggested “If pbs_mom could not resolve the FQDN of the pbs_server, this issue will happen.”

I had the same error with only interactive jobs failing and it turned out I had the wrong ip address for the login node in the /etc/hosts file on the head node/PBS server. Hence the FQDN lookup was incorrect.

Mike


#10

Here’s what we have:

[node0115 mom_logs]# cat 20180801 | grep 29679
08/01/2018 14:56:29;0001;pbs_mom;Svr;pbs_mom;Operation now in progress (115) in finish_exec, cannot open qsub sock for 29679.bright01-thx
08/01/2018 14:56:29;0001;pbs_mom;Job;29679.bright01-thx;job not started, Failure -1
08/01/2018 14:56:29;0100;pbs_mom;Job;29679.bright01-thx;task 00000001 cput= 0:00:00
08/01/2018 14:56:29;0008;pbs_mom;Job;29679.bright01-thx;kill_job
08/01/2018 14:56:29;0100;pbs_mom;Job;29679.bright01-thx;node0115 cput= 0:00:00 mem=0kb
08/01/2018 14:56:29;0008;pbs_mom;Job;29679.bright01-thx;no active tasks
08/01/2018 14:56:29;0100;pbs_mom;Job;29679.bright01-thx;Obit sent
08/01/2018 14:56:30;0008;pbs_mom;Job;29679.bright01-thx;no active tasks
08/01/2018 14:56:30;0080;pbs_mom;Job;29679.bright01-thx;delete job request received
08/01/2018 14:56:30;0008;pbs_mom;Job;29679.bright01-thx;kill_job


#11

As rightly mentioned by @bremanandjk , it is related to DNS issues


#12

@bremanandjk, @speleolinux, @adarsh

You might be onto something here…

I did find our login-node’s ip in /etc/hosts to be inconsistent with that of the ip listed in the PBS server’s /etc/host and the ip on the compute node’s /etc/host file.

I went ahead and corrected this error in the login node’s /etc/host file but still received the ‘apparently deleted error’

Maybe the pbs_mom on this compute node needs to be restarted? I’ll try that next and check over my /etc/hosts afresh…


#13

If that does not work I’d suggest trying the following (which mimics essentially what PBS is doing to make the connection in the interactive job):

Submit an interactive job that we know will not run (give it a really high ncpus request so that it will remain queued).

While that job is queued, look at it in qstat -f and note the full value of PBS_O_HOST in the Variable_List attribute.

On the same host that qsub -I is waiting for the job to start, run netstat -anp | grep qsub | grep LISTEN, note the port number (after the “0.0.0.0:”).

Now log into node0115 as root and issue the command “telnet X Y”, where X is the PBS_O_HOST value from qstat -f and Y is the port number from netstat.

Do you get “Escape character is ‘^]’.”, or something else?


Job error: 'Exit_status = -2'
#14

I work with @sijisaula and am assisting with this as well.

I followed your instructions and did not receive the standard telnet response of “Escape character is ‘^]’.” Instead, I got the following (leaving out the actual address):

Trying [$PBS_O_HOST address] …
telnet: connect to address [$PBS_O_HOST address]: No route to host

However, the same $PBS_O_HOST can be pinged successfully from node0115 (and other compute nodes).

Does this narrow it down at all?


#15

I think I see where you’re going with this. The port that the qsub process listens on for each interactive job has to be open on the firewall? I opened that specific port while the job was still queued and telnet now returns the expected response.

However, it looks like for each new interactive job, qsub listens on a different random high-numbered port. If I were to create a general firewall rule, what range of ports should I include to guarantee that interactive qsub processes are always accepted?


#16

PBS Pro ephemeral ports for interactive job submission: Ports from 32768 to 61000 should be open on compute nodes


#17

Great, thank you. I can confirm that this problem has been resolved. Thanks everyone for your help!