Issue starting PBS


#1

Hi!

After running “sudo /etc/init.d/pbs start” I get the following error:

Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.
***
/opt/pbs/pgsql/pg_upgrade not found 
Failed to upgrade PBS Datastore

There is no pgsql folder under /opt/pbs, nor any folder pgsql/pg_upgrade on my system.
The only similar issue I have found on forums is this.
I am running CentOS 7.3 on ppc64le, have postgresql-server-9.2.18-1 and have installed PBS from source.

Any ideas about what the cause of this problem may be?


#2

From looking into where it fails in the ‘/opt/pbs/libexec/pbs_habitat’ script I think the problem is that I need pgsql version 9.3 or higher.


#3

Hello,

Presuming you are trying for fresh PBSPro installation and looking at the console logs you provided it seems the PBS_HOME directory has files and directories from previous installation. pbs_habitat will get into upgrade mode when it finds datastore directory from previous installation and it tries to upgrade the database to newer version which of course needs pg_upgrade tool from postgres.

I would suggest you to backup and clean PBS_HOME directory then try the installation again.

Thanks,
Ashwath


#4

Thank you Ashwath, now I do not get that error. However when attempting to start pbs it seems to hang (has been like this for almost an hour)… How would you suggest debugging this?

Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.
***
*** Setting default queue and resource limits.
***
Connecting to PBS dataservice..................................................................

#5

Looks like network issue. I hope your hostname resolves to correct ip on the network. Please update /etc/hosts with correct ip if not.

-Ashwath


#6

I am experiencing a similar problem on a new install:

[root@triton log]# /etc/init.d/pbs start
Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.


*** Error initializing the PBS dataservice
Error details:
Creating the PBS Data Service…
Starting PBS Data Service…
waiting for server to start…2017-01-11 12:47:18 PSTLOG: could not bind IPv4 socket: Address already in use
2017-01-11 12:47:18 PSTHINT: Is another postmaster already running on port 15007? If not, wait a few seconds and retry.
2017-01-11 12:47:18 PSTLOG: could not bind IPv6 socket: Address already in use
2017-01-11 12:47:18 PSTHINT: Is another postmaster already running on port 15007? If not, wait a few seconds and retry.
2017-01-11 12:47:18 PSTWARNING: could not create listen socket for "*"
2017-01-11 12:47:18 PSTFATAL: could not create any TCP/IP sockets
stopped waiting
pg_ctl: could not start server
Examine the log output.
Failed to start PBS Data Service
Error starting PBS Data Service
[root@triton log]#


#7

On my system I see the following when PBS Pro is running:

# netstat -n | grep 15007
tcp        0      0 192.168.111.207:41930    192.168.111.207:15007    ESTABLISHED
tcp        0      0 192.168.111.207:15007    192.168.111.207:41930    ESTABLISHED

Please ensure that there is no other service utilizing port 15007. It’s also possible that the port is stuck in TIME_WAIT. That is why the messages suggest you wait some period of time and retry. I think the default is two minutes.


#8

No, nothing running on that port:

[root@triton log]#
[root@triton log]# netstat -n | grep 15007
[root@triton log]#


#9

After confirming that nothing is consuming that port and that it’s not in TIME_WAIT, do you still encounter the problem? What does the netstat output look like immediately after the problem occurs?


#10

This is a working system. I had no trouble installig this one, but it was one of the first releases:

[root@whitcomb ~]# netstat -n|grep 15007
tcp 0 0 192.160.158.181:15007 192.160.158.181:33016 ESTABLISHED
tcp 0 0 192.160.158.181:33016 192.160.158.181:15007 ESTABLISHED
[root@whitcomb ~]#


#11

I ran netstat right after it failed:

/etc/init.d/pbs start
Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.


*** Error initializing the PBS dataservice
Error details:
Creating the PBS Data Service…
Starting PBS Data Service…
waiting for server to start…2017-01-11 14:56:25 PSTLOG: could not bind IPv4 socket: Address already in use
2017-01-11 14:56:25 PSTHINT: Is another postmaster already running on port 15007? If not, wait a few seconds and retry.
2017-01-11 14:56:25 PSTLOG: could not bind IPv6 socket: Address already in use
2017-01-11 14:56:25 PSTHINT: Is another postmaster already running on port 15007? If not, wait a few seconds and retry.
2017-01-11 14:56:25 PSTWARNING: could not create listen socket for "*"
2017-01-11 14:56:25 PSTFATAL: could not create any TCP/IP sockets
stopped waiting
pg_ctl: could not start server
Examine the log output.
Failed to start PBS Data Service
Error starting PBS Data Service
[root@triton log]# !444
netstat -n | grep 15007
[root@triton log]#


#12

The only information I could find about this message is in section 16.3.1 of the PostgreSQL documentation here:
https://www.postgresql.org/docs/7.4/static/postmaster-start.html

Does your system have multiple network interfaces? Are you certain that the hostname and DNS are configured properly? I’ve seen PostgreSQL fail to start when the hostname resolves to something unexpected. You might try taking DNS out of the equation by disabling it in /etc/nsswitch.conf and adding the hostname to /etc/hosts if it’s not already there.


#13

The host does have two nics, but the 2nd one isn’t enabled. I will read the article. It won’t hurt to start the database as given in the examples?


#14

Is it the first time you are installing PBS on this machine? If not I hope there are no postgres processes running from previous installation. When IP address of the system is changed and if PBS is running when it happened, issuing stop command to services may not stop postgres processes. You may have to manually find and kill them.

-Ashwath


#15

When PBS Pro starts the database, it does so using /opt/pbs/sbin/pbs_dataservice which also employs /opt/pbs/libexec/pbs_pgsql_env.sh to set certain environment variables. The end result is a command like this:

su - postgres -c "/bin/sh -c '/bin/pg_ctl -D /var/spool/pbs/datastore -o \"-p 15007\" -w status'"

You should double check to make sure /var/spool/pbs/datastore looks like this:

# ls -ld /var/spool/pbs/datastore drwx------. 15 postgres root 4096 Jan 10 16:42 /var/spool/pbs/datastore

When PostgreSQL is running on my system, I see the following processes:

# ps -ef | grep [p]ostgres postgres 26871 1 0 Jan10 ? 00:00:02 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007 postgres 26879 26871 0 Jan10 ? 00:00:00 postgres: logger process postgres 26881 26871 0 Jan10 ? 00:00:00 postgres: checkpointer process postgres 26882 26871 0 Jan10 ? 00:00:02 postgres: writer process postgres 26883 26871 0 Jan10 ? 00:00:01 postgres: wal writer process postgres 26884 26871 0 Jan10 ? 00:00:03 postgres: autovacuum launcher process postgres 26885 26871 0 Jan10 ? 00:00:05 postgres: stats collector process postgres 26889 26871 0 Jan10 ? 00:00:00 postgres: postgres pbs_datastore 192.168.111.207(41930) idle

Ultimately, something must be consuming port 15007. The most likely culprit is another instance of PostgreSQL.


#16

Hi, mkaro

I encounter the same error when start pbs, and I checked as you said, the end of /opt/pbs/libexec/pbs_pgsql_env.sh is not like this:

su - p> ostgres -c “/bin/sh -c ‘/bin/pg_ctl -D /var/spool/pbs/datastore -o “-p 15007” -w status’”

[root@pbs-master pbs]# ps -ef | grep [p]ostgres
postgres 104640 1 0 09:22 ? 00:00:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
postgres 104641 104640 0 09:22 ? 00:00:00 postgres: logger process
postgres 104643 104640 0 09:22 ? 00:00:00 postgres: checkpointer process
postgres 104644 104640 0 09:22 ? 00:00:00 postgres: writer process
postgres 104645 104640 0 09:22 ? 00:00:00 postgres: wal writer process
postgres 104646 104640 0 09:22 ? 00:00:00 postgres: autovacuum launcher process
postgres 104647 104640 0 09:22 ? 00:00:00 postgres: stats collector process

so I added to it, and then try to start pbs again and have this error below (and I was installed pbs in a refresh environment):

[root@pbs-master pbs]# /etc/init.d/pbs start
Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.


pg_ctl: server is running (PID: 104640)
/usr/bin/postgres “-D” “/var/spool/pbs/datastore” “-p” “15007”
/opt/pbs/pgsql/pg_upgrade not found
Failed to upgrade PBS Datastore


#17

It appears the “postgres pbs_datastore” process is missing. The most common reason I’m aware of is permission problems within the PBS_HOME/datastore directory itself. There may be files or directories with incorrect ownership and/or permissions. Before you try anything I suggest, please backup your PBS_HOME directory. Shut down PBS Pro and then archive PBS_HOME so you can restore it later if necessary. A couple things to try…

With PBS stopped, run the following command as root:
chown -R postgres:root /var/spool/pbs/datastore
Then try starting PBS and see if things work as expected.

If that fails, stop PBS and remove the datastore directory completely. PBS will attempt to recreate it when restarted.

The worst case scenario would be to stop PBS and remove /var/spool/pbs completely. Then restart PBS and let the scripts recreate PBS_HOME in its entirety. If that doesn’t work, I suspect you have some filesystem issue that is preventing PBS from working properly.


#18

Hi, mkaro

I have tried these measures as you said, it is still not working, when I run the command “pbs_probe”, its like this…

[root@centos-master linux]# pbs_probe

====== System Information =======

sysname=Linux
nodename=centos-master.novalocal
release=3.10.0-327.22.2.el7.x86_64
version=#1 SMP Thu Jun 23 17:05:11 UTC 2016
machine=x86_64

====== Problems in PBS HOME Hierarchy =======

Permission/Ownership Problems:

/var/spool/pbs/spool
(drwxr-xr-t , root , root) needs to be (drwxrwxrwt , root, group id < 10)

/var/spool/pbs/spool
(drwxr-xr-t , root , root) needs to be (drwxrwxrwt , root, group id < 10)

/var/spool/pbs/undelivered
(drwxr-xr-t , root , root) needs to be (drwxrwxrwt , root, group id < 10)
Real Path Problems:
/var/spool/pbs/server_priv/tracking, No such file or directory

/var/spool/pbs/server_priv/prov_tracking, No such file or directory

====== Problems in PBS EXEC Hierarchy =======

Permission/Ownership Problems:

/opt/pbs/bin/pbs_topologyinfo
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)

/opt/pbs/sbin/pbs_mom
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)

/opt/pbs/sbin/pbs_sched
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)

/opt/pbs/sbin/pbs_server
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)

/opt/pbs/bin/nqs2pbs
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)
Real Path Problems:
/opt/pbs/bin/pbs_ds_password, No such file or directory

/opt/pbs/bin/pbs_dataservice, No such file or directory

/opt/pbs/sbin/pbs-report, No such file or directory

/opt/pbs/etc/pbs_habitat, No such file or directory

/opt/pbs/etc/pbs_init.d, No such file or directory

/opt/pbs/etc/pbs_postinstall, No such file or directory

/opt/pbs/etc/install_db, No such file or directory

/opt/pbs/etc/pbs_topologyinfo, No such file or directory

/opt/pbs/lib/pbs_sched.a, No such file or directory

/opt/pbs/lib/pm, No such file or directory

/opt/pbs/man, No such file or directory

/opt/pbs/tcltk, No such file or directory

/opt/pbs/python, No such file or directory

/opt/pbs/pgsql, No such file or directory

Do you have any ideas?Thank you again.


#19

Your PBS_HOME should look something like this…

# ls -l /var/spool/pbs
total 56
drwxr-xr-x.  2 root     root 4096 Jul 25 16:52 aux
drwx------.  2 root     root 4096 Jun 29 17:05 checkpoint
drwxr-xr-x.  2 root     root 4096 Jul 25 16:30 comm_logs
drwx------. 15 postgres root 4096 Jul 25 16:33 datastore
drwxr-xr-x.  2 root     root 4096 Jul 25 16:26 mom_logs
drwxr-x--x.  5 root     root 4096 Jun 29 17:06 mom_priv
-rw-r--r--.  1 root     root   19 Jun 29 17:05 pbs_environment
-rw-r--r--.  1 root     root    7 Jun 29 17:06 pbs_version
drwxr-xr-x.  2 root     root 4096 Jul 31 00:02 sched_logs
drwxr-x---.  2 root     root 4096 Jul 19 14:54 sched_priv
drwxr-xr-x.  2 root     root 4096 Jul 31 00:00 server_logs
drwxr-x---.  7 root     root 4096 Jul 25 16:33 server_priv
drwxrwxrwt.  2 root     root 4096 Jul 25 16:52 spool
drwxrwxrwt.  2 root     root 4096 Jun 29 17:05 undelivered

Could you please check to see what yours looks like? The pbs_probe command is complaining about the spool and undelivered directories.


#20

The PBS_HOME seems right,it is like this:

[root@centos7-801 linux]# ls -l /var/spool/pbs
total 52
drwxr-xr-x. 2 root root 4096 Aug 1 03:33 aux
drwx------. 2 root root 4096 Aug 1 03:33 checkpoint
drwxr-xr-x. 2 root root 4096 Aug 1 03:33 comm_logs
drwx------. 15 postgres root 4096 Aug 1 03:42 datastore
drwxr-xr-x. 2 root root 4096 Aug 1 03:33 mom_logs
drwxr-x–x. 4 root root 4096 Aug 1 03:33 mom_priv
-rw-r–r--. 1 root root 19 Aug 1 03:33 pbs_environment
drwxr-xr-x. 2 root root 4096 Aug 1 03:33 sched_logs
drwxr-x—. 2 root root 4096 Aug 1 03:33 sched_priv
drwxr-xr-x. 2 root root 4096 Aug 1 03:42 server_logs
drwxr-x—. 6 root root 4096 Aug 1 03:42 server_priv
drwxrwxrwt. 2 root root 4096 Aug 1 03:44 spool
drwxrwxrwt. 2 root root 4096 Aug 1 03:33 undelivered