How to configure a "Failover setup" using two PBS Pro master?


#1

Hi All,

I am new to PBS pro and i am able to install and configure a single master node and compute node cluster successfully and job was running on compute node,

How to configure Fail-over Setup using two PBS Pro Server?

It would be helpful if you can share the steps to configure Fail-over Setup of PBS Pro Server.

Fail-over setup Environment details as follows:-

VM1

----------------------------|
Primary PBS Server |------------|-eth0(10.0.0.1/24)
----------------------------| |
VM2 |
--------------------------------| |
Secondary PBS Server |--------|—eth0(10.0.0.2/24)
--------------------------------| |
VM3 |
----------------------------| |
NFS and NIS server |------------|-eth0(10.0.0.254/24)
----------------------------| |
VM4 |
----------------------------| |
Execution Host1 |------------|-eth0(10.0.0.3/24)
----------------------------| |
VM5 |
-------------------------| |
Execution Host2 |--------------|-eth0(10.0.0.4/24)
-------------------------|

Thank You,
Periasamy


#2

Have you reviewed the PBS Professional Admin Guide v14.2 Section 8 for the details of using PBS Professional’s Fail-Over configuration?

OR, are you asking how to configure PBS Professional in a High Availability environment (e.g., RHCS)?


#3

I am asking how to configure PBS Professional in a High Availability environment (e.g., RHCS)?


#4

You can set it up as you would any service that has LSB scripts (with correct dependencies on filesystem and IP aliases, but:

-the LSB scripts shipped are not robust (do not return the correct codes all the time)
-get confused when the PBS_HOME is not mounted
-call PBSPro commands, unwise when a server that’s unresponsive is the reason for calling the script.

so you need better ones. I’ll post some shortly.

The ones for the server I’ll post do better than just see if processes exist, they use qstat -Bf to ensure the server is also responsive. But that also means that monitoring timeouts should be taken as long (60 seconds) to avoid triggering spurious failovers on any quasihang (DNS lookups, server side hooks etc.), and that the “start” timeout needs to be VERY long if you have lots of jobs in qstat -x output (recovering jobs can take 10-15 minutes at sites that have millions of jobs, especially if you don’t tune the datastore postgresql.conf).

You’d also better split off pbs_comm and the scheduler – these are effectively separate services, and pbs_comm is active/active and needs to become a clone set.

I have a document that describes most of this.


#5

Oops – can’t post PDF files and scripts. Lemme see with the admins how to get around this.


#6

Hi @alexis.cousein – not sure whether it will help, but I increased the trust level associated with your login. (Discourse does all sorts of things to keep spam off the site; one of the things it does is restrict what “new users” are allowed to do.). Let me know if it works; if not, I’ll look for other settings to try. Thx!


#7

I also added a space in Confluence where files may be uploaded and shared…
https://pbspro.atlassian.net/wiki/spaces/PBSPro/pages/269680659/Attachments