Configure PBS Pro with Multiple Execution Hosts


#1

Dear all,

I am a beginer with PBS Pro, so I have some problems with Installation and Configuration for my cluster. I deploy PBS follow by Multiple Execution Hosts.

I have a headnode for PBS_server and some hosts for execution host. But I cannot submit job from headnode to execution hosts. When I submit, error:
"qsub:Bad UID for job execution"
or sometimes:
“qsub: Unknow resource: node20”

When I create a Vnode on execution host like:
"create node Vheadnode resources_available.ncpus=12, resources_available.mem=16gb, sharing=default_excl"
The message error: Error (15007) returned from server

The configuration file on headnode (server):
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=headnode.supernodexp.hpc
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_MOM_NODE_NAME=node20
PBS_LEAF_NAME=node20

The configuration file on node20 (execution host):
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=headnode.supernodexp.hpc
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_MOM_NODE_NAME=node20
#PBS_SERVER_HOST_NAME=headnode.supernodexp.hpc

Please help me for this problem, I installed and configured PBS Pro following by the official documents.


#2

Hi,

It seems that you have not been able to add the node to PBS. Error 15007 indicates No permission.
Which user are you trying to add the node with?

“qsub:Bad UID for job execution” --> also indicates that the user is not permitted to submit jobs.

Also, please let us know the qsub command that you tried to execute.

Thanks.


#3

Hi prakashcv13,

I use root account to add node to the server. I added a node as execution host to server, on the server side:
qmgr: create node node20 (node20 is the execution host running MOM)

and on the server side, I can list the information of node20:
qmgr: list node node20
node20
Mom = supernodexp-computenode-020
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = supernodexp-computenode-020
resources_available.mem = 131587560kb
resources_available.ncpus = 48
resources_available.vnode = node20
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l

But, I still cannot submit jobs to node20 from the server
$ cat pbs_script
#!/bin/sh
#PBS -l walltime=1:00:00
#PBS -l mem=400mb,ncpus=4
#PBS -l nodes=node20
stress -c 24 -t 60

$ qsub pbs_script

I try to use an account (not root) to submit job, but it shows error: qsub: Bad UID for job execution
So, may be I have to declare this account with PBS. How can I add this account to PBS?
Help me, please! Thank you so much for your support.


#4

This error is usually seen when you try running jobs as a root.
You would need to add this user to acl_roots.
If the username is myuser
qmgr: set server acl_roots += myuser

After this you should be able to submit a job as this user.


#5

Dear prakashcv13,

I configured and added execution host (node 20) into PBS. At the server side: when I print nodes:
$ pbsnodes -a
headnode
Mom = headnode.supernodexp.hpc
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = headnode
resources_available.mem = 131587560kb
resources_available.ncpus = 48
resources_available.vnode = headnode
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

node20
Mom = supernodexp-computenode-020
ntype = PBS
state = free
pcpus = 24
resources_available.arch = linux
resources_available.host = supernodexp-computenode-020
resources_available.mem = 263701884kb
resources_available.ncpus = 24
resources_available.vnode = node20
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

And I also added “my user” to PBS following your instruction. But when I want submit job scripts like that:
#!/bin/sh
#PBS -l walltime=1:00:00
#PBS -l mem=400mb,ncpus=4
#PBS -l nodes=node20
stress -c 24 -t 60

into the other execution host (exactly node20 above) from the server side:
headnode$ sub pbs_script
15.headnode
headnode$ qstat
Job id Name User Time Use S Queue


14.headnode pbs_script ctminh 0 Q workq
15.headnode pbs_script ctminh 0 Q workq

The jobs look like “not running” on separate execution host (node20), and it is also not running on the server (headnode). I checked this by “top” command.

(“stress” is to test the performance of CPU)

Do you know the reason why, please help me!
Thank you for your attention!


#6

qstat -s and qstat -f output will help in understanding why the jobs are not being run.
If that doesnt help, we need to look in the server and scheduler logs.

Also, I believe that your PBS directives in the script start with a ‘#’

Thanks,
Prakash


#7

Hi all,

I submit jobs with my account “ctminh” after adding into pbs pro that you instructed me. But, when I submit the job, it still cannot run on the execution host, the error show:
comment = Can Never Run: Insufficient amount of resource: host (node20 != headnode,supernodexp-computenode-020)

I guess the reason why PBS server doesnt know about node20 being the resource for submit, but I create a vnode being node20 and PBS server also recognized the information.
// create vnode - command
$ qmgr: create node node20
// show information from server
$ pbsnodes -a

node20
Mom = supernodexp-computenode-020
ntype = PBS
state = free
pcpus = 24
resources_available.arch = linux
resources_available.host = supernodexp-computenode-020
resources_available.mem = 263701884kb

On node20 (my execution host): I just only run pbs_mom for executing jobs. On headnode: I run pbs_server, pbs_sched, pbs_comm, pbs_mom. I also attach some log_files on server and execution host https://www.dropbox.com/sh/kydbpbbg0f2t6z0/AAA0bwFHmFCkr7ws9eViMTRka?dl=0 (sorry, because I dont see any tab for attached files directly on this post)

Thank you for your attention!


#8

change the node name to supernodexp-computenode-020 instead of node20 and accordingly the script to ask for lnodes=supernodexp-computenode-020.

OR change the resources_available.host to “node20” for node20 through qmgr -
qmgr: s n node20=resources_available.host=node20


#9

Hi,
You need to change the job script instead of requesting "PBS -l nodes=node20"
request like this “PBS -l vnode=node20”

Regards
Dilip


#10

Dear all,

Thank you so much, dilip-krishnan and prakashcv13.

I have submitted jobs from PBS server (headnode) into the execution host (node20).
I fixed the declaration in script file: # PBS -l vnode = node20
and, my job can run on node20 from submitting of headnode.

But I still not understand completely about PBS, I need more time to read documents. Thanks again for your attention.


#11

glad that it worked for you!


#12

I cooperate with ctminh.
On headnode, I run command “qsub pbs_script -q @node20” instead of PBS direction (PBS -l vnode=node20) in pbs_script file and I got error

Connection refused
qsub: cannot connect to server node20 (errno=111)

Node20 info:
node20
Mom = supernodexp-computenode-020
Port = 15002
pbs_version = 14.0.1
ntype = PBS
state = free
pcpus = 24
resources_available.arch = linux
resources_available.host = supernodexp-computenode-020
resources_available.mem = 263701884kb
resources_available.ncpus = 24
resources_available.vnode = node20
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

and config file:

PBS_SERVER=headnode.supernodexp.hpc
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

Please help me.

Thank you all.


#13

Hi kimloai,
I see in pbs.conf PBS_SERVER is marked to 0, please change it to 1 and restart pbs service. It seems pbs server is
not running, since it isn’t configured in pbs.conf file.

Regards
Dilip


#14

If I have understood your configuration correctly, your server is the headnode and not node20, so do not set PBS_START_SERVER to 1 on node20.
If you are not using PBS directives in the script, you can specify vnodes through qsub like this -

qsub -lvnode=node20 pbs_script


#15

Hi Kimloai,
-q option in qsub is for telling the destination where the job will be submitted and not actually executed. To exectue the pbs_script on vnode “node20” command provide by Prakash is the right way to submit the job. Also I am not sure if the config file is of headnode or node20. If it is of headnode then you have to set PBS_START_SERVER=1 otherwise not.


#16

More explanation on the -q option to qsub can be found in the documentation. However, what you are doing with the command --> qsub pbs_script -q @node20 is telling PBS to submit job to the server running on node20.
You are not running the pbs server on node20 (as seen in the conf file shared), so you see this error.

Also, whether it is server or execution host, do set PBS_SERVER configuration option in pbs.conf file. This configuration option is to specify the server name which in case is headnode.supernodexp.hpc and is properly set as seen in the conf files you shared earlier, so do not change it unless you change the installation.

PBS_START_SERVER option tells whether pbs server should be started on a particular host or not. It is read by the init script. So, this option should be set to 1 on the server host and 0 on the execution host. This setting is also correct in your case.

Coming to selecting a particular node for a job one way to achieve that is through -lvnode as I mentioned in my previous post.

Hope this helps.

Thanks,
Prakash


#17

Hi Prakash and Dilip

I have figured out my problem and understood correctly -q option of qsub.

Thank for your help.

KimLoai