Below are the details on how I had worked on Failover Setup.
In Open Stack I had created four instances
Instance1 - Primary Server(PS)
Instance2 - Secondary Server(SS)
Instance3 - Node1(i.e execution host)
Instance4 - NFS Server
Case1: When NFS is not being used.
I had configured Primary Server, Secondary Server and Node1 as per the details mentioned in PBS Admin Guide Section 9 for failover setup. PS, SS and Node1 are able to see each other.
At first I had submitted some jobs on the PS and they were executed successfully. Secondly I had stopped PBS on PS within few seconds I could notice that SS has taken over the control and its is unable to execute the jobs. This signifies we need NFS
Case2: When NFS is being used
-Created a PBS_HOME directory which is now hard mounted on PS, SS, Node1 and NFS Server.
-Stopped PBS on PS, SS and Node1
-Did path changes in /etc/pbs.conf for pointing PBS_HOME variable to Shared File System path
-Copied all the files from /var/spool/pbs to /PBS_HOME/
-Configurations for /etc/hosts is also in sync
When I start PBS its unable to start neither on PS, SS and Node1
Below is the error message:
Mar 29 19:30:55 primaryserver.novalocal pbs_init.d: pbs_sched startup failed, exit 1 aborting.
Mar 29 19:30:55 primaryserver.novalocal systemd: pbs.service: control process exited, code=exited status=1
Mar 29 19:30:55 primaryserver.novalocal systemd: Failed to start Portable Batch System.
– Subject: Unit pbs.service has failed
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
– Unit pbs.service has failed.
– The result is failed.
Mar 29 19:30:55 primaryserver.novalocal systemd: Unit pbs.service entered failed state.
Mar 29 19:30:55 primaryserver.novalocal systemd: pbs.service failed.
Mar 29 19:30:55 primaryserver.novalocal polkitd: Unregistered Authentication Agent for unix-process:17141:1425045 (system bus name :1.132, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnect)
While troubleshooting I had encountered below type of message as well
/opt/pbs/sbin/pbs_comm ready (pid=12782), Proxy Name:secondaryserver:17001, Threads:4
pbs_sched: Permission denied (13) in chk_file_sec, Security violation “/PBS_HOME/pbs/sched_priv” resolves to “/PBS_HOME”
pbs_sched startup failed, exit 1 aborting.
I had made sure that UID is same for PS, SS, Node1 and NFS Server and also shared file system directory has root privileges.
Any leads will be helpful