My job stay queued


#1

Hi,

Since today and I don’t know why, but when I submit a job it is staying queued.

what should I do to understand what is the problem ?

Yesterday all runing perfectly.

Thank a lot for your help


#2

Please share the output of the below commands

  1. qstat -answ1
  2. pbsnodes -av

Note;

  • please check whether all the nodes status is free ? pbsnodes -av | grep -e Mom -e state
  • please check whether the job requests can be matched on to the compute resources , otherwise, job will be in the queue ?

#3

$ qstat -answ1

return nothing

$ pbsnodes -av

return :

centos7
Mom = centos7.home
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = centos7
resources_available.ncpus = 1
resources_available.vnode = centos7
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

$ pbsnodes -av | grep -e Mom -e state

return

 Mom = centos7.home
 state = state-unknown,down

please check whether the job requests can be matched on to the compute resources , otherwise, job will be in the queue ?

I submit jobs I have already submitted, so normally compute resources are ok.

I do not understand the problem.


#4

Then there is no jobs in the queue. Hence, please submit sample sleep jobs as below
qsub – /bin/sleep 100

The status of the compute node is down, hence job is still in the queue.

Why the node is down - communication issues between the PBS Server and PBS Mom (Compute Node)

  1. pbs_mom service might not be running on the compute node centos7.home
  2. firewall might be blocking the ports ( 15001 - 15007 , 17001 ) between the headnode and compute node (vice versa) . Disable firewall completely and check .
  3. DNS resolution ( forward and reverse resolution of the compute node / headnode ) from respective systems.
  4. Check SELinux is disabled and system is rebooted after disabling the SELinux

#5

Hi. I also have a problem. When I write the job submission script and specify a particular node name, the job stays in a queue. #PBS -l nodes=compunode-0-3.local

After submitting the job which stays in the queue, i use this command qstat -answ1 i get this error
Can Never Run: Insufficient amount of resource: host (compunode-0-3.local !=compunode-0-1,compunode-0-2,compunode.-0-3,…

We have the followiing restrictions on the server for every user(PBS_GENERIC)

max_run=3
max_run_res.ncpus=72
max_run_res.nodect=2
max_queued=2


#6

Please share the output of the below command
pbsnodes computenode-0-3.local

Can you please try this command:
qsub -l host=computenode-0-3 – /bin/sleep 100

It seems the “mom name” is not matching the request.

  • mom name should have the short name , please check

#7

These are the results.

[user@login ~]# pbsnodes compute-0-3.local
Node: compute-0-3.local, Error: Unknown node
[user@login ~]# pbsnodes compute-0-3
compute-0-3.local
Mom = compute-0-3.local
ntype = PBS
state = free
pcpus = 36
resources_available.arch = linux
resources_available.host = compute-0-3
resources_available.mem = 263727076kb
resources_available.ncpus = 36
resources_available.ngpus = 2
resources_available.vnode = compute-0-3.local
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

[user@login ~]$ qsub -l host=compute-0-3 – /bin/sleep 100
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value…]
[-S path] [-u user_list] [-W otherattributes=value…]
[-v variable_list] [-V ] [-z] [script | – command [arg1 …]]
qsub --version


#8

The command should be

qsub -l host=compute-0-3 - - /bin/sleep 100

qsub < hyphen >< l for london >< space >host=compute-0-3< space > < hyphen >< hyphen >< space> /bin/sleep 1000


#9

This is the command and output

qsub -l host=compute-0-3 – bin/sleep 100
80126.master1.local


#10

Thank you !

  • did the job run on the requested host ?
    please share the output of
  • qstat -answ1
  • qstat -fx 80126

#11

Yes it did run. Here are the other outputs
qstat -answ1

100363.master1.local user workq STDIN 17184 1 1 – – R 00:00:00 compute-0-3
Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1)

qstat -fx 100363
Job Id: 100363.master1.local
Job_Name = STDIN
Job_Owner = user@login.local
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 348kb
resources_used.ncpus = 1
resources_used.vmem = 4316kb
resources_used.walltime = 00:01:40
job_state = F
queue = workq
server = master1.local
Checkpoint = u
ctime = Tue Dec 11 15:58:25 2018
Error_Path = login.local:/home/user/STDIN.e100363
exec_host = compute-0-3.local/0
exec_vnode = (compute-0-3.local:ncpus=1)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Dec 11 16:00:06 2018
Output_Path = login.local:/home/user/STDIN.o100363
Priority = 0
qtime = Tue Dec 11 15:58:25 2018
Rerunable = True
Resource_List.host = compute-0-3
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:host=compute-0-3:ncpus=1
stime = Tue Dec 11 15:58:25 2018
session_id = 17184
jobdir = /home/user
substate = 92
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=user-l host=compute-0-3:ncpus=10
,
PBS_O_PATH=/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/mp
i/intel64/bin:/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/
bin/intel64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibut
ils/bin:/opt/pbs/bin:/home/user/.local/bin:/home/user/bin,
PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/user,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=login.local
comment = Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1) and
finished
etime = Tue Dec 11 15:58:25 2018
run_count = 1
Stageout_status = 1
Exit_status = 0
Submit_arguments = -l host=compute-0-3 – /bin/sleep 100
executable = jsdl-hpcpa:Executable/bin/sleep</jsdl-hpcpa:Executable>
argument_list = jsdl-hpcpa:Argument100</jsdl-hpcpa:Argument>
history_timestamp = 1544544006
project = _pbs_project_default


#12

Thank you. It is all working now.
Do you still see any issues ?


#13

But what is the syntax for selecting nodes in PBS?
I am using pbs pro version 17 and i get an error when i use this command below
#PBS -l host=compute-0-3:ncpus=10 -l mem=10GB
The error is
Illegal attribute or resource value Resource_List.select


#14

I got it now . The correct syntax is

#PBS -l host=compute-0-3 -l ncpus=10 -l mem=2GB

Thank you very much adarsh.