Start pbs does not work

Hi,

i need your help, when I would like to start pbs

$ sudo /etc/init.d/pbs start

I have the message below :

Starting PBS
/opt/pbs/sbin/pbs_comm ready (pid=6677), Proxy Name:centos7-1.home:17001, Threads:4
PBS comm
pbs_mom: addclient_byname, host centos7 not found
pbs_mom: parse_config, config[1] command "$clienthost centos7" failed, aborting
/opt/pbs/sbin/pbs_mom: config file(s) parsing failed
pbs_mom startup failed, exit 1 aborting.

Someone can help me to understand what’s happened ?

Thanks a lot.

Hi @nekcorp,

Looks like provided value (‘centos7’) for $clienthost in /var/spool/pbs/mom_priv/config is not a valid hostname.
pbs_mom is getting error on gethostbyname(3)

The issue is hostname resolution

  • static IP and hostname (resolvable) is required.
  • is the hostname “centos7” resolvable to a static IP ?

Please share the output of :

  1. cat /etc/pbs.conf
  2. cat /etc/hosts
  3. ifconfig
  4. ping -c 5 centos7

Hi @riyazhakki

In the /var/spool/pbs/mom_priv/config I have :

$clienthost centos7
$restrict_user_maxsysid 999

Hi @adarsh

  1. cat /etc/pbs.conf

PBS_SERVER=centos7-1.home
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

  1. cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.49 centos7-1.home centos7-1

  1. ifconfig

enp9s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.49 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::f4b9:e141:9e3d:226e prefixlen 64 scopeid 0x20
inet6 2a01:cb19:884d:4400:7986:37ee:b6b1:d910 prefixlen 64 scopeid 0x0
ether 00:21:70:70:de:25 txqueuelen 1000 (Ethernet)
RX packets 2785240 bytes 1432119255 (1.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3286872 bytes 740472676 (706.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 17

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Boucle locale)
RX packets 520350 bytes 14971390380 (13.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 520350 bytes 14971390380 (13.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

  1. ping -c 5 centos7

ping: centos7: Name or service not known

Thank you, please update the above line in /etc/hosts to
192.168.1.49 centos7-1.home centos7-1 centos7

Then start the pbs service.

I have update, and then when I start pbs service I have this message :

Starting PBS
PBS comm already running.
PBS mom
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataserviceconnected to PBS dataservice@centos7-1.home
Licenses valid for 10000000 Floating hosts
Server@centos7-1: setup_nodes, could not create node centos7, error = 15062
PBS server

What is the error=15062

Thank you , please share us the output of the below commands
[ source /etc/profile.d/pbs.sh ]

  • ps -ef | grep pbs_
  • pbsnodes -aSj

[ You have to update /etc/pbs.conf against PBS_SERVER=centos7 and then restart the PBS Services ]

ps -ef | grep pbs_

root 2027 1 0 08:51 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 2039 1 0 08:51 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 2383 1 0 08:51 ? 00:00:01 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 2476 2444 0 08:51 ? 00:00:00 postgres: postgres pbs_datastore 192.168.1.49(47432) idle
root 2497 1 0 08:51 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
root 11967 1 0 05:49 ? 00:00:00 /opt/pbs/sbin/pbs_comm
nekcorp 25934 22372 0 09:46 pts/4 00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox pbs_

$ pbsnodes -aSj

centos7 down 0 0 0 0kb/0kb 1/1 0/0 0/0 --

When I restart PBS Services all it is ok now :

Starting PBS
PBS comm already running.
PBS mom already running.
PBS scheduler already running.
PBS Server already running.

But when I submit a job it is stays in queue.

Reason: the compute node (centos7) is down , hence job is in the queue.

  1. please share the output of pbsnodes -av
  2. selinux disabled and system needs to be reboot after this
  3. firewall disabled
  4. ports 15001 to 15009 and 17001 should not be blocked
  5. qmgr -c “d n centos7”
  6. qmgr -c “c n centos7”
  7. qmgr -c “s s scheduling = t”
  8. please share the output of pbsnodes -av and qstat -answ1

Hi @adarsh

Below the return for each command :

1. pbsnodes -av

centos7
Mom = centos7.home
ntype = PBS
state = down,unresolvable
pcpus = 1
resources_available.host = centos7
resources_available.ncpus = 1
resources_available.vnode = centos7
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Fri May 15 10:07:49 2020

2. $ sestatus

SELinux status: disabled

3. $ systemctl status firewalld

● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since dim. 2020-05-17 05:30:49 CEST; 51s ago
Docs: man:firewalld(1)
Process: 3859 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 3859 (code=exited, status=0/SUCCESS)

3. $ nmap -sT -O localhost

Starting Nmap 6.40 ( http://nmap.org ) at 2020-05-17 05:36 CEST
Nmap scan report for localhost (127.0.0.1)
Host is up (0.00026s latency).
Other addresses for localhost (not scanned): 127.0.0.1
Not shown: 990 closed ports
PORT STATE SERVICE
22/tcp open ssh
25/tcp open smtp
80/tcp open http
111/tcp open rpcbind
631/tcp open ipp
3306/tcp open mysql
5432/tcp open postgresql
15002/tcp open unknown
15003/tcp open unknown
15004/tcp open unknown
Device type: general purpose
Running: Linux 3.X
OS CPE: cpe:/o:linux:linux_kernel:3
OS details: Linux 3.7 - 3.9
Network Distance: 0 hops

OS detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 2.85 seconds

4. $ qmgr -c “d n centos7”

qmgr obj=centos7 svr=default: Unauthorized Request
qmgr: Error (15007) returned from server

5. $ qmgr -c “c n centos7”

qmgr obj=centos7 svr=default: Unauthorized Request
qmgr: Error (15007) returned from server

6. $ qmgr -c “s s scheduling = t”

qmgr obj= svr=default: Unauthorized Request
qmgr: Error (15007) returned from server

7. qstat -answ1

142.centos7                    nekcorp         workq           PYTHON               --     1     1    --    --  Q  --   --
 Not Running: Not enough free nodes available

Thanks for your help

Thank you for the information.
Please refer the PBS Pro admin guide https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf on the below section

8.5.1.1 Events Logged at Event Class 128 (0x0080)

Please try deleting the node using qmgr , it will log messages in the $PBS_HOME/server_logs/YYYYMMDD

Please share the server_logs , it will lead us to the issue.

It seems that the canonicalized hostname keeps changing.

The server sets the administrators to root on the canonical name of the hostname when you set things up. Since you now get “unauthorized request”, the qmgr commands no longer seem to come from the address that name is bound to – possibly because that name is no longer resolvable because you changed things (is the host supposed to be “centos7” or “centos7-1”? Decide once and for all…)

Ditto for your node:

centos7
Mom = centos7.home

You created the node by passing the name “centos”. BTW: this name MUST match the output of “hostname” or your hooks won’t work well. It resolved it to an IP address, and reached a MoM. MoM sent back a message, and the IP address it gave resolved to centos7.home, but this name currently does not map to an IP address. I bet that now centos7-1.home exists but not centos7.home.

In other words: stick to one canonical name for each IP address and don’t change it. Make sure that /etc/nsswitch.conf has “files” first for the hosts map, and then control your IP/name translations from there (the canonicalized hostname is always the first entry). Make sure names that you use only appear on one line (do not be tempted to add e.g. “centos7” to two different lines, because that will confuse a lot of software since your canonicalisation is then ambiguous).

How you get out of this conundrum indeed depends on what you see in the logs and what the output of qmgr -c “print server” is.

1 Like