Unable to submit the job on compute node(node01)

Hi,
I have installed Pbspro 19.1.1 on Master(pbs server) and node01

In master i can able to submit the job and i am getting STDIN.o3 STDIN.e3 files from pbsdata user(pbsdata user is in master machine)
qsub -l select=1:ncpus=1:mem=100mb:host=master – /bin/sleep 10

And, When i submit the job to node01
qsub -l select=1:ncpus=1:mem=100mb:host=node01 – /bin/sleep 10

i am getting
[pbsdata@master ~]$ qstat -ans

master:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


7.master pbsdata workq STDIN – 1 1 100mb – H –

job held, too many failed attempts to run

my logs:
cat /var/spool/pbs/mom_logs/20200304 |grep 7.master
03/04/2020 12:57:44;0028;pbs_mom;Job;7.master;No Password Entry for User pbsdata
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;node01 cput= 0:00:00 mem=0kb
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;no active tasks
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;Obit sent
03/04/2020 12:57:44;0080;pbs_mom;Job;7.master;delete job request received
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job

[pbsdata@master ~]$ pbsnodes -a
master
Mom = master
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = master
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = master
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 14:24:20 2020

node01
Mom = node01
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = node01
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = node01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 13:21:27 2020

[pbsdata@master ~]$

[pbsdata@master ~]$ cat /etc/pbs.conf
PBS_EXEC=/share/apps/platform/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

[root@node01 ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

Please help me to submit the job successfully from pbsdata user(on master machine) to node01.

Regards,
Zain

You got to have pbsdata user on node01.

Henry Wu|吴光宇

+1 @wgy

can you ssh login into node01 as “pbsdata” ? and does the home directory for pbsdata exists ?

1 Like

+1 @wgy @adarsh

By default, the job owner is pbsdata but you can specify -u option to use different user on compute nodes.

   -u user_list
           List  of usernames.  Job is run under a username from this list.  Sets job's User_List attribute to
           user_list.  Default: job owner (username on submit host.)  Format of user_list:

                  user[@host][,user@host ...]

Please follow this link

~/.rhosts should be populated with host(s) and otherusername(s) , if userA wants to submit job(s) as otheruser(s)

Thank you your response Adarsh,
pbsdata user home directory not exists on node01 and i have created home dir and passwordless ssh between pbsdata users of master and node01.

Now, i am able to submit the jobs to node01.

finally, can you explain me, that the PBS pro cluster needs to have the same user on all the VMs in this cluster?
For Ex:
Master (PBS Server/ Head node/login node)
Node01 (computenode)
Node02 (computenode)

Here all the nodes should have test users with passwordless ssh b/w test users.

Regards,
Zain

In any cluster environment, the user needs to have seamless SSH access with HostKeyChecking disabled or approved. across the nodes of the cluster. Basically, the no password should be asked when ssh’ing

  • master / headnode to compute node(s)
  • compute node(s) to headnode
  • compute node(s) to compute node(s)

Usually, in a cluster environment the USER HOME DIRECTORY would be common (mounted across) across all the compute nodes. Also , NIS / PBIS / others might have been used for storing client/server/user details.

Thanks for clarifying ssh across the nodes.

Please guide me to restrict the cores based on user/group else all users when they submitting the job.
For example:
Master - 24c
Node01 - 24c
total 48c
IT group use only 10cores
Bio group use only 20cores
Chem group use only 10cores
how can i restrict with their default queues, this scenarios??
if incase i want to use (top priority) 40c then how can submit the job?

Please guide me to configure this scenarious.

Regards,
Zain

Please refer this documentation:

and this section 5.15.1.9.ii Examples of Setting Server and Queue Limits
This will cover all your use cases .

Are these ( IT, Bio, Chem) Linux groups ?

Thanks for sharing link with session Adarsh.
I will check and get back to you.

Regards,
Zain