Proper way to configure PBS on multiple NIC system

#1

I’ve always been confused with all the hostname stuffs, and today I determined to dig into part of the source code after being tired about endless tries.
I suppose it’s common to see a cluster running PBS, where the master node has two NICs, one for external netwrok, and one for intra-cluster communication. Let’s say, the cluster was named “HPC”. So it’s reasonable to configure the hosts and other files as follows:

#hosts
127.0.0.1   localhost
1.2.3.4     HPC.your.domain.name           HPC
10.0.0.1   node0.local node0   node-mgmt
#hosts.equiv
HPC
HPC.your.domain.name
node0
node-mgmt
#hostname
HPC.your.domain.name
#$(hostname) -> HPC
#$(hostname -f) -> HPC.your.domain.name

And then set the pbs.conf on the master node like this:

#pbs.conf on master node
PBS_SERVER=HPC
PBS_LEAF_NAME=node0

On the execution node, pbs.conf looks like:

#pbs.conf on execution node
PBS_SERVER=node0
PBS_LEAF_NAME=node1

However these configuration would result in some of the commands returns PBS Internal Errors. After checking the source code, it seems that client-side commands, such like qrun XXX, is connecting to the server from external address, 1.2.3.4, rathern than 127.0.0.1 or 10.0.0.1. However, pbs_sched, who actually receives the request, only accepts hostnames localhost and node0 from PBS_LEAF_NAME. So it will cut off the connection, thinking it unauthorized.

Besides the client-side commands, it also prevents the sched from actually schedule any jobs, cause the pbs_server is also considered unauthorized. I’m not sure about it, for I haven’t dig into pbs_server's codebase. However I did see logs in sched_logs folder contains ‘pbs_sched: badconn, node0 on port 661 unauthorized host’ every a few minutes.

I totally don’t understand why PBS are designed in this way. I’ve read the admin book and didn’t found helpful stuff. Is there any special considerations? Is my configuration of hostnames wrong? Or did I miss out something? Any help will be appreciated.

#2

Please try this

  • make sure /etc/hosts is well populated across all the nodes ( DNS , reverse DNS is workign fine)
  • qmgr -c “set server flatuid=true”
  1. edit $PBS_HOME/mom_priv/config (restart pbs_mom services after updating this file)
    $clienthost HPC
    $clienthost node0

  2. create a file called clientfile on the PBS Server/Scheduler host in the below location
    /var/spool/pbs/sched_priv/clientfile

cat /var/spool/pbs/sched_priv/clientfile
$clienthost node0

  1. Start the PBS Scheduler as below manually or by updating the startup scripts
    /opt/pbs/sbin/pbs_sched -c /var/spool/pbs/sched_priv/clientfile
    [ if the pbs_sched has already started, then kill the pbs_sched daemon, and start it manually as above ]