Qsub permission dennied


#1

Hi Team,

I faced a very wired issue. qsub is able to submit job but the job always failed per stderr oupt. could you please share how to debug this issue ? Thanks.

zxhu@host:~$ echo ‘hostname’|qsub -q lnx64
100.cictest-09

zxhu@host ~$ cat STDIN.e100
-bash: line 1: /var/spool/pbs/mom_priv/jobs/100.cictest-09.SC: Permission denied


#2

Please check/share the below information , you might be able to trace the issue

  • tracejob 100
  • on the compute node where this job ran, check the mom logs ( $PBS_HOME/mom_priv/mom_logs/YYYYMMDD )

#3

Thanks for reply.
this is the cmd I submit the job
zxhu@sjdpc-zxhu:~$ echo ‘hostname’|qsub -q lnx64
131.cictest-09

zxhu@sjdpc-zxhu:~$ tracejob 131

Job: 131.cictest-09

07/27/2018 06:51:06 S enqueuing into lnx64, state 1 hop 1
07/27/2018 06:51:06 S Job Queued at request of zxhu@sjdpc-zxhu., owner = zxhu@sjdpc-zxhu., job name = STDIN, queue = lnx64
07/27/2018 06:51:07 L Considering job to run
07/27/2018 06:51:07 S Job Run at request of Scheduler@cictest-09 on exec_vnode (cictest-04:ncpus=1)
07/27/2018 06:51:07 S Job Modified at request of Scheduler@cictest-09
07/27/2018 06:51:07 S Exit_status=126 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb
resources_used.walltime=00:00:00
07/27/2018 06:51:07 L Job run
07/27/2018 06:51:07 S Obit received momhop:1 serverhop:1 state:4 substate:42

this is the mom_logs
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 463, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0100;pbs_mom;Req;;Type 1 request received from root@172.19.197.97:15001, sock=1
07/27/2018 08:52:12;0100;pbs_mom;Req;;Type 3 request received from root@172.19.197.97:15001, sock=1
07/27/2018 08:52:12;0100;pbs_mom;Req;;Type 5 request received from root@172.19.197.97:15001, sock=1
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;Started, pid = 5370
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0080;pbs_mom;Job;131.cictest-09;task 00000001 terminated
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 463, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;Terminated
07/27/2018 08:52:12;0100;pbs_mom;Job;131.cictest-09;task 00000001 cput= 0:00:00
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;kill_job
07/27/2018 08:52:12;0100;pbs_mom;Job;131.cictest-09;cictest-04 cput= 0:00:00 mem=0kb
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;no active tasks
07/27/2018 08:52:12;0100;pbs_mom;Job;131.cictest-09;Obit sent
07/27/2018 08:52:12;0100;pbs_mom;Req;;Type 54 request received from root@172.19.197.97:15001, sock=1
07/27/2018 08:52:12;0080;pbs_mom;Job;131.cictest-09;copy file request received
07/27/2018 08:52:12;0100;pbs_mom;Job;131.cictest-09;staged 2 items out over 0:00:00
07/27/2018 08:52:12;0800;pbs_mom;n/a;mom_get_sample;nprocs: 464, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;no active tasks
07/27/2018 08:52:12;0100;pbs_mom;Req;;Type 6 request received from root@172.19.197.97:15001, sock=1
07/27/2018 08:52:12;0080;pbs_mom;Job;131.cictest-09;delete job request received
07/27/2018 08:52:12;0008;pbs_mom;Job;131.cictest-09;kill_job


#4

Exit_Status is non-zero
Exit code - 126 – Command invoked cannot execute

The command “hostname” might not be able to execute on that compute node by that user, try
echo “/bin/hostname” | qsub -q lnx64


#5

still the same error. could you please advise ? Thanks.

zxhu@sjdpc-zxhu:~$ echo ‘/bin/hostname’|qsub -V -q lnx64
193.cictest-09

zxhu@sjdpc-zxhu:~$ cat STDIN.e193
-bash: line 1: /var/spool/pbs/mom_priv/jobs/193.cictest-09.SC: Permission denied


#6
  1. Please check password-less SSH is working for all the users (seamless without Stricthostkey checking)
  • ssh from server to mom should be seamless
  • ssh from mom to server should be seamless
  • ssh between the Mom’s should be seamless
  1. Please add the below lines in the same order in /etc/pbs.conf on the Server and Compute nodes and restart the services
    PBS_RCP=/bin/false
    PBS_SCP=/usr/bin/scp
    PBS_RSHCOMMAND=/usr/bin/ssh
  2. qmgr -c “set server flatuid=true”
  3. Try running an interactive job as below
    qsub -l select=1:ncpus=1 -I (last argument to qsub here is -l is whichi is hypen capital I , i for Icecream)

#7

Just a remind, i doubt if the $HOME directory of this user exists.
if not, the job would failed as soon as it ran.
I can’t remember exactly the error is, but suggest you to do a quick check.


#8

Hi Adarsh,
sorry for late response. the interactive job is okay no matter flatuid set or not

zxhu@master:~$ qsub -l select=1:ncpus=1 -I
qsub: waiting for job 534.master to start
qsub: job 534.master ready

Login on Linux, source .bashrc profile
zxhu@node1:~$

but if it is failed with the same error msg when submitting a batch job.
-bash: line 1: /var/spool/pbs/mom_priv/jobs/533.sjolmbench.SC: Permission denied


#9

Thanks . the $HOME exists.


#10

Please share the permissions of /var/spool/pbs/mom_priv ( after when an interactive job is running).
The permission on my system is as below:

[root@compute_node ~]# namei -mo /var/spool/pbs/spool/
f: /var/spool/pbs/spool/
dr-xr-xr-x root root /
drwxr-xr-x root root var
drwxr-xr-x root root spool
drwxr-xr-x root root pbs
drwxrwxrwt root root spool

[root@compute_node mom_priv]# namei -mo /var/spool/pbs/mom_priv/jobs
f: /var/spool/pbs/mom_priv/jobs
dr-xr-xr-x root root /
drwxr-xr-x root root var
drwxr-xr-x root root spool
drwxr-xr-x root root pbs
drwxr-x–x root root mom_priv
drwxr-x–x root root jobs


#11

root@node1:~# namei -mo /var/spool/pbs/spool/
f: /var/spool/pbs/spool/
dr-xr-xr-x root root /
drwxr-xr-x root root var
drwxr-xr-x root root spool
drwxr-xr-x root root pbs
drwxrwxrwt root root spool
root@node1:~# namei -mo /var/spool/pbs/mom_priv/jobs/
f: /var/spool/pbs/mom_priv/jobs/
dr-xr-xr-x root root /
drwxr-xr-x root root var
drwxr-xr-x root root spool
drwxr-xr-x root root pbs
drwxr-x— root root mom_priv
drwxr-xr-x root root jobs

and I find the file looks like below, it is an interactive job, is it correct ?
-rwx------ 1 zxhu group1 22 Sep 25 01:43 372085.master1.SC

Thanks.


#12

Thank you , could you please share the contents of the .SC file.


#13

#!/bin/bash
sleep 1000

Thanks.


#14

Can you please try this and check whether they run :

  • qsub – /bin/hostname
  • qsub – /bin/sleep 10

I am not sure whether you can run the same script by removing #! line


#15

Thanks. if I remove ‘#!/bin/bash’ , it seems to be work. do you know why ?


#16

Can you run that script as that user without using PBS on that compute node ?


#17

it is able to run without any issue if I run the script without PBS.


#18

Strange, then please try this qsub -S /bin/bash - - sleep.sh


#19

execute script mode always works. :frowning:

the issue only happens on submitting from STDIN mode and the node is RHEL6 or later. there is no issue on RHEL5.