How to write a script for a program run in two hosts?


#1

As usual, I can run a program(use hostname for example) in two hosts with command like this:
mpirun -N 1 -machinefile ./nodes2 hostname
I can get the 2 hostname without PBS.

Recently, I try to use PBS Pro which I installed through yum. I write a script(script.sh) as follow:

#!/bin/bash
#PBS -N JOB1
#PBS -j oe
#PBS -l select=2:ncpus=1:mpiprocs=1
#PBS -l place=scatter
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > machine_file
mpirun -macainefile ./machine_file hostname

I run command “qsub script.sh” but it didn’t work. I see “Job cannot be executed” from mail.
Then I modify “place=scatter” to “place=free” and it works, but the 2 procs is on one host.
I have check with “pbsnodes -a” and sure 7 salve nodes exist.
How to solve this problem?


#2

Please try the below script and update the absolute path to mpirun command:

#!/bin/bash
#PBS -N JOB1
#PBS -j oe
#PBS -l select=2:ncpus=1:mpiprocs=1
#PBS -l place=scatter
cd $PBS_O_WORKDIR
cp $PBS_NODEFILE .
/absolute/path/to/mpirun -machinefile $PBS_NODEFILE /bin/hostname

If the job does not run, then please share the tracejob output and pbsnodes -av output


#3

Unfortunately, it didn’t work.

[hpc@sms3 benchmark]$ pbsnodes -av
ohpc3-cn1
Mom = ohpc3-cn1
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn1
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn1
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn2
Mom = ohpc3-cn2
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn2
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn2
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn3
Mom = ohpc3-cn3
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn3
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn3
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn4
Mom = ohpc3-cn4
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn4
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn4
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn5
Mom = ohpc3-cn5
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn5
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn5
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn6
Mom = ohpc3-cn6
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn6
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn6
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

ohpc3-cn7
Mom = ohpc3-cn7
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = ohpc3-cn7
resources_available.mem = 267796992kb
resources_available.ncpus = 64
resources_available.vnode = ohpc3-cn7
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

/var/spool/mail/hpc:
From adm@sms3.localdomain Wed Sep 26 16:24:38 2018
Return-Path: adm@sms3.localdomain
X-Original-To: hpc@sms3
Delivered-To: hpc@sms3.localdomain
Received: by sms3.localdomain (Postfix, from userid 0)
id 7631240105F5; Wed, 26 Sep 2018 16:24:38 +0800 (CST)
To: hpc@sms3.localdomain
Subject: PBS JOB 25.sms3
Message-Id: 20180926082438.7631240105F5@sms3.localdomain
Date: Wed, 26 Sep 2018 16:24:38 +0800 (CST)
From: adm@sms3.localdomain (root)

PBS Job Id: 25.sms3
Job Name: JOB1
Aborted by PBS Server
Job cannot be executed
See Administrator for help


#4

Please share the output of the below commands

  1. qstat -answ1
  2. qstat -fx 25
  3. tracejob 25
  4. qstat -Bf

#5

[hpc@sms3 benchmark] qstat -answ1 [hpc@sms3 benchmark] qstat -fx 25
qstat: PBS is not configured to maintain job history 25.sms3
[hpc@sms3 benchmark]$ tracejob 25

Job: 25.sms3

09/26/2018 16:24:38 L Considering job to run
09/26/2018 16:24:38 S enqueuing into workq, state 1 hop 1
09/26/2018 16:24:38 S Job Queued at request of hpc@sms3, owner = hpc@sms3, job name = JOB1, queue = workq
09/26/2018 16:24:38 S Job Run at request of Scheduler@sms3 on exec_vnode (ohpc3-cn1:ncpus=1)+(ohpc3-cn2:ncpus=1)
09/26/2018 16:24:38 S Job Modified at request of Scheduler@sms3
09/26/2018 16:24:38 S Discard running job, A sister Mom failed to delete job
09/26/2018 16:24:38 S Exit_status=-1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=2 resources_used.vmem=0kb
resources_used.walltime=00:00:00
09/26/2018 16:24:38 L Job run
09/26/2018 16:24:38 S Obit received momhop:1 serverhop:1 state:4 substate:41
09/26/2018 16:24:38 S dequeuing from workq, state 5
[hpc@sms3 benchmark]$ qstat -Bf
Server: sms3
server_state = Active
server_host = sms3
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
scheduler_iteration = 600
FLicenses = 2000000
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
Avail_Sockets:1000000 Unused_Sockets:1000000
pbs_version = 14.1.2
eligible_time_enable = False
max_concurrent_provision = 5

“qstat -answ1” and “qstat -fx 25” show nothing…


#6

Thank you for this information, much helpful

  1. Enable the job history to see the job history by running (history would be counted from now on )
    qmgr -c “set server job_history_enable=true” # keep the job history for 14 days
    qmgr -c “set server job_history_duration=24:00:00” # if you want to set it 24 hours after enabling job history

Job has run on ohpc3-cn1 and ohpc3-cn2

You have some hostname resolution issues
Please make sure hostnames are resolvable (+reverse resolvable) for their static IP addresses.
( if you are using dynamic IP addresses, then it is not recommended )


#7

Thank you for your patient replys.
I have tried what you said, but the problem still not resolved.
1.The two qmgr cmds didn’t work

[hpc@sms3 ~]$ qmgr -c “set server job_history_enable=true”
qmgr obj= svr=default: Unauthorized Request
qmgr: Error (15007) returned from server

[hpc@sms3 ~]$ qmgr -c “set server job_history_duration=24:00:00”
qmgr obj= svr=default: Unauthorized Request
qmgr: Error (15007) returned from server

  1. I do use static IP addresses, and I can ping ohpc3-cn1 and ohpc3-cn2 on sms3. I have check /etc/hosts, it seems no wrong? I’m not sure.

[hpc@sms3 benchmark]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
126.26.136.121 sms3

126.26.136.122 ohpc3-cn1
126.26.136.123 ohpc3-cn2
126.26.136.124 ohpc3-cn3
126.26.136.125 ohpc3-cn4
126.26.136.126 ohpc3-cn5
126.26.136.127 ohpc3-cn6
126.26.136.128 ohpc3-cn7

and I try to run again, I got some different print in tracejob. But I think it still one problem, cause this print “Discard running job, A sister Mom failed to delete job” didn’t change.

[hpc@sms3 benchmark]$ qsub pbs_job.sh
33.sms3

[hpc@sms3 benchmark]$ qstat -f 33
Job Id: 33.sms3
Job_Name = JOB1
Job_Owner = hpc@sms3
job_state = H
queue = workq
server = sms3
Checkpoint = u
ctime = Wed Sep 26 18:49:40 2018
Error_Path = sms3:/home/hpc/ROMSProjects/benchmark/JOB1.e33
Hold_Types = s
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Wed Sep 26 18:49:46 2018
Output_Path = sms3:/home/hpc/ROMSProjects/benchmark/JOB1.o33
Priority = 0
qtime = Wed Sep 26 18:49:40 2018
Rerunable = True
Resource_List.mpiprocs = 2
Resource_List.ncpus = 2
Resource_List.nodect = 2
Resource_List.place = scatter
Resource_List.select = 2:ncpus=1:mpiprocs=1
stime = Wed Sep 26 18:49:46 2018
substate = 20
Variable_List = PBS_O_HOME=/home/hpc,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=hpc,
PBS_O_PATH=/opt/ohpc/pub/mpi/openmpi3-gnu7/3.0.0/bin:/opt/ohpc/pub/uti
ls/forge/bin:/opt/ohpc/pub/utils/reports/bin:/opt/ohpc/pub/mpi/openmpi3
-gnu7/3.1.0/bin:/opt/ohpc/pub/compiler/gcc/7.3.0/bin:/opt/ohpc/pub/util
s/prun/1.2:/opt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/loc
al/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/hpc/.
local/bin:/home/hpc/bin,PBS_O_MAIL=/var/spool/mail/hpc,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/hpc/ROMSProjects/benchmark,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,PBS_O_HOST=sms3
comment = job held, too many failed attempts to run
run_count = 21
Exit_status = -3
Submit_arguments = pbs_job.sh
project = _pbs_project_default

[hpc@sms3 benchmark]$ tracejob 33

Job: 33.sms3

09/26/2018 18:49:40 S enqueuing into workq, state 1 hop 1
09/26/2018 18:49:40 S Job Queued at request of hpc@sms3, owner = hpc@sms3, job name = JOB1, queue = workq
09/26/2018 18:49:40 S Job Modified at request of Scheduler@sms3
09/26/2018 18:49:40 S Obit received momhop:1 serverhop:1 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:2 serverhop:2 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:3 serverhop:3 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:4 serverhop:4 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:5 serverhop:5 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:6 serverhop:6 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:7 serverhop:7 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:8 serverhop:8 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:9 serverhop:9 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:10 serverhop:10 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:11 serverhop:11 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:12 serverhop:12 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:13 serverhop:13 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:14 serverhop:14 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:15 serverhop:15 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:16 serverhop:16 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:17 serverhop:17 state:4 substate:41
09/26/2018 18:49:46 L Considering job to run
09/26/2018 18:49:46 S Job Run at request of Scheduler@sms3 on exec_vnode (ohpc3-cn1:ncpus=1)+(ohpc3-cn2:ncpus=1)
09/26/2018 18:49:46 S Discard running job, A sister Mom failed to delete job
09/26/2018 18:49:46 S Job requeued, execution node down
09/26/2018 18:49:46 L Job run
09/26/2018 18:49:46 S Obit received momhop:18 serverhop:18 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:19 serverhop:19 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:20 serverhop:20 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:21 serverhop:21 state:4 substate:41


#8

No problems, you are welcome , thank you :slight_smile:

Please run these commands as root user

Do you have the same /etc/hosts file populated on all the compute nodes

Please note:

09/26/2018 16:24:38 S Job Run at request of Scheduler@sms3 on exec_vnode (ohpc3-cn1:ncpus=1)+(ohpc3-cn2:ncpus=1)

ohpc3-cn1 == mother superior node
ohpc3-cn2 == is the sister node

your qstat -fx 33 output

Please check the mom logs on both these nodes ohpc3-cn1 and ohpc3-cn2
as root user on these nodes:
source /etc/pbs.conf
cd $PBS_HOME/mom_logs
vi 20180926 # check the job id 33 and logs associated with it

This would clearly mention the issue (or you can share those two files )


#9

Thank you again.

I didn’t use the same /etc/hosts on all the compute nodes, I’ve fixed it and run the script again.
Failed again. :frowning:

I take the mom_log/20180926 of id 35(the newest job run above, only one file cause I use OpenHPC and root fs are shared), I found that the log repeat following print:

09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;no active tasks
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;Obit sent
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0080;pbs_mom;Job;35.sms3;delete job request received
09/26/2018 20:21:33;0001;pbs_mom;Job;35.sms3;Unable to send delete job request to one or more sisters
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;no active tasks
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;Obit sent
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0080;pbs_mom;Job;35.sms3;delete job request received
09/26/2018 20:21:33;0001;pbs_mom;Job;35.sms3;Unable to send delete job request to one or more sisters
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001


#10

There is an issue with this sister node 126.26.136.123 (ohpc3-cn2 ) , please check the mom logs of this system for job id 35

Error code: 15059 - Sister could not communicate ( check the PBS Pro reference guide)


#11

I’ve re-install all nodes’ pbs. It didn’t work.
I’ve check name resolution with ping & pbs_hostn cmds. The result is correct. By the way, firewalld service on the master node is disabled, and the compute nodes have no firewalld service. Dose this matters?
Now I wonder if I miss some network setting??
I modify the script “select=2” to “select=4” and check the mom_log, all compute nodes alarm same errors.

I’ve collected mom_log from ohpc3-cn2(126.26.136.123) as follow:

[hpc@ohpc3-cn2 mom_logs]$ cat /var/spool/pbs/mom_logs/20180927
09/27/2018 11:01:06;0002;pbs_mom;Svr;Log;Log opened
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.1.2
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
09/27/2018 11:01:06;0100;pbs_mom;Svr;parse_config;file config
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.121 as authorized
09/27/2018 11:01:06;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
09/27/2018 11:01:06;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
09/27/2018 11:01:06;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 127.0.0.1:15003,126.26.136.123:15003
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 1024
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Max files too low - you may want to increase it.
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm sms3:17001
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.123 as authorized
09/27/2018 11:01:06;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
09/27/2018 11:01:06;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
09/27/2018 11:01:06;0002;pbs_mom;n/a;initialize;pcpus=64, OS reports 64 cpu(s)
09/27/2018 11:01:06;0006;pbs_mom;Fil;pbs_mom;Version 14.1.2, started, initialization type = 0
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Mom pid = 14889 ready, using ports Server:15001 MOM:15002 RM:15003
09/27/2018 11:01:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 126.26.136.123:15003 to pbs_comm
09/27/2018 11:01:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm sms3:17001
09/27/2018 11:01:36;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
09/27/2018 11:01:36;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at sms3:15001
09/27/2018 11:01:36;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 126.26.136.121:15001, msg=“tfd=21, pbs_comm:126.26.136.121:17001: Dest not found”
09/27/2018 11:02:43;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 11:02:43;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.122 as authorized
09/27/2018 12:04:25;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 12:04:25;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.124 as authorized
09/27/2018 12:04:27;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 12:04:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.125 as authorized
09/27/2018 12:05:11;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:12;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:13;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:14;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:17;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:17;0021;pbs_mom;Job;1.sms3;rename in job_save failed
Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:18;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:18;0021;pbs_mom;Job;1.sms3;rename in job_save failed
b_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:28;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;no active tasks
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:28;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:22;0021;pbs_mom;Job;1.sms3;rename in job_save failed
12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:29;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;no active tasks
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:29;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001


#12

Please again check the below message ( It seems DNS is working fine with resolution to FQDN and short names on the compute nodes. might be /etc/hosts has been cached ) . We will get there, there is something basic which is hindering the process.

Please make sure

  1. SELinux is disabled (if disabled now , then nodes should be rebooted )
  2. Make sure 15001 to 15007 and 17001 ports are opened for communication within the Cluster (head node to compute node and vice versa and between the compute nodes )

We will check from the the basic

  1. for i in {1…7};do qsub -N HOSTNAME - l select=1:ncpus=1 -l place=excl – /bin/hostname ;done
  2. cat HOSTNAME.o* # should display all the compute node hostnames
  3. cat pbs.sh
    #!/bin/bash
    env
    hostname
    sleep 10
  4. chmod +x pbs.sh
  5. for i in {1…7};do qsub -l select=1:ncpus=1 -l place=excl -N PBS pbs.sh ; done
  6. cat PBS.o*

Let us know the above submission worked without any issues and you were able to see the stdout and stderr files.


#13

Thank you for your patience, adarsh.
I’ve done what you suggest. And I’ve pasted some information below. The results are still depressing…

  1. All nodes (include master node and slave nodes) exec ‘getenforce’, and results are ‘Disabled’.
  1. I close all firewalld and iptables service. And the ports info of all nodes is following, I consider it matchs “the Ports Used by PBS in TPP Mode” which I see from PBS Pro Installation Guide:

[root@sms3 ~]# netstat -anp | grep 1500*
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 3096/pbs_server.bin
tcp 0 0 0.0.0.0:15004 0.0.0.0:* LISTEN 2289/pbs_sched
tcp 0 0 0.0.0.0:15007 0.0.0.0:* LISTEN 3085/postgres
tcp 0 0 126.26.136.121:38754 126.26.136.121:15007 ESTABLISHED 3096/pbs_server.bin
tcp 0 0 126.26.136.121:15007 126.26.136.121:38754 ESTABLISHED 3095/postgres: post
tcp6 0 0 :::15007 :::* LISTEN 3085/postgres
udp 0 0 0.0.0.0:514 0.0.0.0:* 1509/rsyslogd
udp6 0 0 :::514 :::* 1509/rsyslogd
unix 2 [ ACC ] STREAM LISTENING 64685 3085/postgres /tmp/.s.PGSQL.15007
unix 2 [ ACC ] STREAM LISTENING 64683 3085/postgres /var/run/postgresql/.s.PGSQL.15007
unix 2 [ ] DGRAM 61573 1509/rsyslogd
unix 3 [ ] STREAM CONNECTED 33968 1508/irqbalance
[root@sms3 ~]# netstat -anp | grep 1700*
tcp 0 0 0.0.0.0:17001 0.0.0.0:* LISTEN 2217/pbs_comm
tcp 0 0 126.26.136.121:126 126.26.136.121:17001 ESTABLISHED 3096/pbs_server.bin
tcp 0 0 126.26.136.121:17001 126.26.136.124:543 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.127:543 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.122:125 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.121:126 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.121:19 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.123:268 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.125:302 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:17001 126.26.136.128:127 ESTABLISHED 2217/pbs_comm
tcp 0 0 126.26.136.121:19 126.26.136.121:17001 ESTABLISHED 2289/pbs_sched
tcp 0 0 126.26.136.121:17001 126.26.136.126:714 ESTABLISHED 2217/pbs_comm

[root@sms3 ~]# pdsh -w ohpc3-cn[1-7] “netstat -anp” | grep 15003
ohpc3-cn5: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 11993/pbs_mom
ohpc3-cn4: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 64111/pbs_mom
ohpc3-cn1: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 2411/pbs_mom
ohpc3-cn2: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 14889/pbs_mom
ohpc3-cn6: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 15966/pbs_mom
ohpc3-cn3: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 3532/pbs_mom
ohpc3-cn7: tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 43678/pbs_mom
[root@sms3 ~]# pdsh -w ohpc3-cn[1-7] “netstat -anp” | grep 15002
ohpc3-cn5: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 11993/pbs_mom
ohpc3-cn3: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 3532/pbs_mom
ohpc3-cn1: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 2411/pbs_mom
ohpc3-cn7: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 43678/pbs_mom
ohpc3-cn4: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 64111/pbs_mom
ohpc3-cn2: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 14889/pbs_mom
ohpc3-cn6: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 15966/pbs_mom

  1. Submission result just show ohpc3-cn1:

[hpc@sms3 pbs_test]$ cat HOSTNAME.o*
ohpc3-cn1

[hpc@sms3 pbs_test]$ cat PBS.o*
MANPATH=:/opt/pbs/share/man
HOSTNAME=ohpc3-cn1
SHELL=/bin/bash
HISTSIZE=1000
PBS_JOBNAME=PBS
TMPDIR=/var/tmp/pbs.9.sms3
PBS_ENVIRONMENT=PBS_BATCH
PBS_O_WORKDIR=/home/hpc/pbs_test
NCPUS=1
USER=hpc
PBS_TASKNUM=1
LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi3-gnu7/3.0.0/lib:/home/hpc/src/lib:/home/hpc/src/lib/hdf5/lib:/home/hpc/src/lib/pnetcdf/lib:/home/hpc/src/lib/netcdf/lib:
PBS_O_HOME=/home/hpc
PBS_MOMPORT=15003
PBS_O_QUEUE=workq
PATH=/opt/ohpc/pub/mpi/openmpi3-gnu7/3.0.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/hpc/.local/bin:/home/hpc/bin
PBS_O_LOGNAME=hpc
MAIL=/var/spool/mail/hpc
PBS_O_LANG=en_US.UTF-8
_=/bin/env
PBS_JOBCOOKIE=00000000793A00C700000000641429EA
PWD=/home/hpc
PBS_NODENUM=0
PBS_JOBDIR=/home/hpc
PBS_O_SHELL=/bin/bash
PBS_JOBID=9.sms3
ENVIRONMENT=BATCH
HISTCONTROL=ignoredups
HOME=/home/hpc
SHLVL=2
PBS_O_HOST=sms3
LOGNAME=hpc
PBS_QUEUE=workq
PBS_O_MAIL=/var/spool/mail/hpc
OMP_NUM_THREADS=1
LESSOPEN=||/usr/bin/lesspipe.sh %s
PBS_O_SYSTEM=Linux
PBS_NODEFILE=/var/spool/pbs/aux/9.sms3
PBS_O_PATH=/opt/ohpc/pub/mpi/openmpi3-gnu7/3.0.0/bin:/opt/ohpc/pub/utils/forge/bin:/opt/ohpc/pub/utils/reports/bin:/opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/bin:/opt/ohpc/pub/compiler/gcc/7.3.0/bin:/opt/ohpc/pub/utils/prun/1.2:/opt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/hpc/.local/bin:/home/hpc/bin
ohpc3-cn1

I also get the 2 jobs’ mom_log from ohpc3-cn1, why dose the task terminated?

09/27/2018 18:22:13;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:22:13;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:22:13;0008;pbs_mom;Job;8.sms3;Started, pid = 3005
09/27/2018 18:22:15;0080;pbs_mom;Job;8.sms3;task 00000001 terminated
09/27/2018 18:22:15;0008;pbs_mom;Job;8.sms3;Terminated
09/27/2018 18:22:15;0100;pbs_mom;Job;8.sms3;task 00000001 cput= 0:00:00
09/27/2018 18:22:15;0008;pbs_mom;Job;8.sms3;kill_job
09/27/2018 18:22:15;0100;pbs_mom;Job;8.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 18:22:15;0008;pbs_mom;Job;8.sms3;no active tasks
09/27/2018 18:22:15;0100;pbs_mom;Job;8.sms3;Obit sent
09/27/2018 18:22:15;0100;pbs_mom;Req;;Type 54 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:22:15;0080;pbs_mom;Job;8.sms3;copy file request received
09/27/2018 18:22:17;0100;pbs_mom;Job;8.sms3;staged 2 items out over 0:00:02
09/27/2018 18:22:17;0008;pbs_mom;Job;8.sms3;no active tasks
09/27/2018 18:22:17;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:22:17;0080;pbs_mom;Job;8.sms3;delete job request received
09/27/2018 18:22:17;0008;pbs_mom;Job;8.sms3;kill_job
09/27/2018 18:24:44;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:24:44;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:24:44;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:24:44;0008;pbs_mom;Job;9.sms3;Started, pid = 3020
09/27/2018 18:24:54;0080;pbs_mom;Job;9.sms3;task 00000001 terminated
09/27/2018 18:24:54;0008;pbs_mom;Job;9.sms3;Terminated
09/27/2018 18:24:54;0100;pbs_mom;Job;9.sms3;task 00000001 cput= 0:00:00
09/27/2018 18:24:54;0008;pbs_mom;Job;9.sms3;kill_job
09/27/2018 18:24:54;0100;pbs_mom;Job;9.sms3;ohpc3-cn1 cput= 0:00:00 mem=6848kb
09/27/2018 18:24:54;0008;pbs_mom;Job;9.sms3;no active tasks
09/27/2018 18:24:54;0100;pbs_mom;Job;9.sms3;Obit sent
09/27/2018 18:24:54;0100;pbs_mom;Req;;Type 54 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:24:54;0080;pbs_mom;Job;9.sms3;copy file request received
09/27/2018 18:24:55;0100;pbs_mom;Job;9.sms3;staged 2 items out over 0:00:01
09/27/2018 18:24:55;0008;pbs_mom;Job;9.sms3;no active tasks
09/27/2018 18:24:55;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=3
09/27/2018 18:24:55;0080;pbs_mom;Job;9.sms3;delete job request received
09/27/2018 18:24:55;0008;pbs_mom;Job;9.sms3;kill_job


#14

It seems ohpc3-cn1 is behaving correctly and rest of the nodes are having some issues

This is normal termination of jobs on ohpc3-cn1 . There are no issues here.

Rest of your nodes are certainly having some issues.


#15

Request Type 54 is PBS_BATCH_CopyFiles, is here a rcp operation? And “staged 2 items out over 0:00:02” is an error log or not?

I’ve deleted the ohpc3-cn1, and now ohpc3-cn[2-7] are in the cluster. And I run " 1. for i in {1…6};do qsub -N HOSTNAME - l select=1:ncpus=1 -l place=excl – /bin/hostname ;done". The mom_log in ohpc3-cn2 is same with the log in ohpc3-cn1 which I run yesterday. I think the issues consist in the communication between the head compute node and the sister nodes, not a specific node such as ohpc3-cn2.

I download the source code and find the print “staged 2 items out over 0:00:02” in request.c:void req_cpyfile(struct batch_request *preq). The function comments said “process the Copy Files request from the server to dispose of output from the job. This is done by a child of MOM since it might take time”. Are there some issues with copyfile?


#16

There are no issues with copyfile, i have used 14.1.2 and recently upgraded to 18.1.2
It is worth to upgrade to 18.1.2 (if you have any upcoming maintenance scheduled).

By default PBS Pro uses RCP for file copy
Unless you have your configuration below ( which uses SCP )

SERVER
#cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=opencent
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

PBS MoM
#cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=opencent
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

If you have common mounts across all PBS Pro Complex , then you can use “cp” command .
The configuration of which is in the $PBS_HOME/mom_priv/config and the attribute is $usecp
example:
$usecp admin.default.domain:/home /home
$usecp /home admin.default.domain:/home
$usecp admin:/home /home
$usecp /home admin:/home
$usecp admin:/stage /stage
$usecp /stage admin:/stage


#17

Thanks for your advice. I will try 18.1.2 later.
But now I should focus on the issue and solve it, I think this issue still exists when I change to 18.1.2.

Do you have any idea about these issues? Or can I open the debug level log to get more information? Or can I use tcpdump to catch the packet between compute nodes?


#18

Please try to increase the mom log level :

  • edit $PBS_HOME/mom_priv/config and add the below line
  • $logevent 0xffffffff
  • restart the PBS MOM services

We have seen these issues and these are related to communication and you can check the /var/log/messages on the respective nodes (they might help).

Also, you might check the configuration on ohpc3-cn1 (working node) and check against the rest.


#19

I’ve increased the mom log level and exec below cmds (I’ve delete ohpc3-cn1, so it begin with ohpc3-cn2)

then I get the mom log as follow:

09/28/2018 16:31:30;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at sms3:15001
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.123 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.124 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.125 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.126 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.127 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.128 as authorized
09/28/2018 16:32:17;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:17;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:17;0400;pbs_mom;Node;ohpc3-cn2;implicitly added host to vmap
09/28/2018 16:32:17;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:17;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:17;0008;pbs_mom;Job;19.sms3;Started, pid = 3111
09/28/2018 16:32:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:19;0004;pbs_mom;Act;get_wm;libmemacct.so.1 not found
09/28/2018 16:32:19;0080;pbs_mom;Job;19.sms3;task 00000001 terminated
09/28/2018 16:32:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 666, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:19;0008;pbs_mom;Job;19.sms3;Terminated
09/28/2018 16:32:19;0100;pbs_mom;Job;19.sms3;task 00000001 cput= 0:00:00
09/28/2018 16:32:19;0008;pbs_mom;Job;19.sms3;kill_job
09/28/2018 16:32:19;0100;pbs_mom;Job;19.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/28/2018 16:32:20;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:20;0008;pbs_mom;Job;19.sms3;no active tasks
09/28/2018 16:32:20;0100;pbs_mom;Job;19.sms3;Obit sent
09/28/2018 16:32:20;0100;pbs_mom;Req;;Type 54 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:20;0080;pbs_mom;Job;19.sms3;copy file request received
09/28/2018 16:32:23;0100;pbs_mom;Job;19.sms3;staged 2 items out over 0:00:03
09/28/2018 16:32:23;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:23;0008;pbs_mom;Job;19.sms3;no active tasks
09/28/2018 16:32:23;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:23;0080;pbs_mom;Job;19.sms3;delete job request received
09/28/2018 16:32:23;0008;pbs_mom;Job;19.sms3;kill_job
09/28/2018 16:32:23;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0

what’s the impact of lack of libmemacct.so.1?

I’ve checked configuration in compute nodes (I mount the ‘/’ of all compute nodes from master node 126.26.136.121"/opt/ohpc/admin/images/centos7.4, so the configurations in all compute nodes are same) and server node. In my configuration, I didn’t have PBS_RSHCOMMAND. Does it matters?


#20

This can be ignored, this is futile (this is looking for SGI’s package - made available in the
libmemacct library available from a new package called memacct )

It does not.
Note: pbs_mpirun to function correctly for users who require the use of ssh instead of rsh