Node-level resource not taking effect


#1

I’m trying to divide the whole cluster into three parts, each of which has all nodes of one same type of hardware in it. I created an resource Qlist (just as the Administrator Book says):

create resource Qlist
set resource Qlist type = string_array
set resource Qlist flag = h

and node20, one of the nodes, is configured like this: (output of qmgr -c print node node20)

create node node20 Mom=node20.localdomain
set node node20 state = job-busy
set node node20 resources_available.arch = linux
set node node20 resources_available.host = node20
set node node20 resources_available.mem = 97617396kb
set node node20 resources_available.ncpus = 24
set node node20 resources_available.Qlist = n24
set node node20 resources_available.Qlist += n24_96
set node node20 resources_available.vnode = node20
set node node20 resv_enable = True
set node node20 sharing = default_shared

I have checked that Qlist does exist in resources tag in /var/spool/pbs/sched_priv/sched_config.
After all these configurations, I submitted an job whose job scripts looks like:

#PBS -N test-hold-128
#PBS -l nodes=1:ppn=24
#PBS -l Qlist=n24_128
#PBS -q maintaince
##PBS -l Qqueue=maintaince

sleep 10000000

and got that job allocated onto node20 by PBS. As one can see node20 has only configured with Qlist=n24, n24_96, while the job requests n24_128. In my mind this job should never be allocated onto node20. What am I doing wrong?

Here’s qstat -f of that job:

Job_Name = test-hold-128
Job_Owner = admin@node1
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 5836kb
resources_used.ncpus = 24
resources_used.vmem = 49812kb
resources_used.walltime = 00:14:43
job_state = R
queue = maintaince
server = wiz
Checkpoint = u
ctime = Mon May 21 19:59:26 2018
Error_Path = node1:/home/admin/test-hold-128.e7082
exec_host = node20/0*24
exec_vnode = (node20:ncpus=24)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon May 21 20:09:21 2018
Output_Path = node1:/home/admin/test-hold-128.o7082
Priority = 0
qtime = Mon May 21 19:59:26 2018
Rerunable = True
Resource_List.mpiprocs = 24
Resource_List.ncpus = 24
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=24
Resource_List.place = scatter
Resource_List.preempt_targets = None
Resource_List.Qlist = n24_128
Resource_List.select = 1:ncpus=24:mpiprocs=24
stime = Mon May 21 20:09:21 2018
session_id = 90842
jobdir = /home/admin
substate = 42
Variable_List = ...
comment = Job run at Mon May 21 at 20:09 on (node20:ncpus=24)
etime = Mon May 21 19:59:26 2018
run_count = 1
Submit_arguments = hold.sh
project = _pbs_project_default

Here’s tracejob of that job:

tracejob 7082

Job: 7082.node1

05/21/2018 19:59:26  L    Considering job to run
05/21/2018 19:59:26  L    Insufficient amount of resource: ncpus 
05/21/2018 19:59:26  S    enqueuing into maintaince, state 1 hop 1
05/21/2018 19:59:26  S    Job Queued at request of admin@node1, owner = admin@node1, job name = test-hold-128, queue = maintaince
05/21/2018 19:59:26  S    Job Modified at request of Scheduler@node1
05/21/2018 19:59:26  A    queue=maintaince
05/21/2018 19:59:27  L    Considering job to run
05/21/2018 19:59:27  L    Insufficient amount of resource: ncpus 
05/21/2018 19:59:28  L    Considering job to run
05/21/2018 19:59:28  L    Insufficient amount of resource: ncpus 
05/21/2018 19:59:28  L    Considering job to run
05/21/2018 19:59:28  L    Insufficient amount of resource: ncpus 
05/21/2018 20:07:45  L    Considering job to run
05/21/2018 20:07:45  L    Insufficient amount of resource: ncpus 
05/21/2018 20:09:21  L    Considering job to run
05/21/2018 20:09:21  S    Job Run at request of Scheduler@node1 on exec_vnode (node20:ncpus=24)
05/21/2018 20:09:21  S    Job Modified at request of Scheduler@node1
05/21/2018 20:09:21  L    Job run
05/21/2018 20:09:21  A    user=admin group=users project=_pbs_project_default jobname=test-hold-128 queue=maintaince ctime=1526903966 qtime=1526903966 etime=1526903966 start=1526904561 exec_host=node20/0*24
                          exec_vnode=(node20:ncpus=24) Resource_List.mpiprocs=24 Resource_List.ncpus=24 Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 Resource_List.place=scatter
                          Resource_List.preempt_targets=None Resource_List.Qlist=n24_128 Resource_List.select=1:ncpus=24:mpiprocs=24 resource_assigned.ncpus=24 

(Please kindly ignore the typo of ‘maintanence’ :slight_smile: )


#2

Please note ,

  • Qlist should be added to the $PBS_HOME/sched_priv/sched_config 's resources: line
    resources: “ncpus,aoe,vnodes, …,Qlist”

  • Once the sched_config is modified, kill -HUP < PID of the PBS Scheduler >

  • Qlist is a host-level and hence it should be part of the chunk statement as below

qsub -l select=1:ncpus=1:mem=10mb:Qlist=n24 – /bin/sleep 100

Should be updated to

#PBS -N test-hold-128
#PBS -l select=1:ncpus=24:Qlist=n24_128
#PBS -q maintaince
/bin/sleep 1000000

Thank you


#3

FYI:


#4

Thank you very much. Now I got it. Actually I was confused that why Qlist is not always taking effect. You reminded me that I used a hook to translate nodes=xxx grammar to select=xxx grammar, but I do left maintenance queue out to allow administrators to have more freedom.

So all in all host-level resource won’t take effect in the old nodes=xxx grammar, right?


#5

Thank you too sir. We have a lot not so professional users, who use a batch of ancient job scripts. I’ll just keep the idea to use a hook for translating old grammar to new select=xxx ones.


#6

Thank you and you are welcome

You can use the below:
#PBS -l nodes=1:ppn=24:Qlist=n24_128

Please plan slowly and ahead by migrating the old scripts to the new syntax or else a fact page / information page for your users might help exposing the new syntax.