I’ve always been confused with all the hostname stuffs, and today I determined to dig into part of the source code after being tired about endless tries.
I suppose it’s common to see a cluster running PBS, where the master node has two NICs, one for external netwrok, and one for intra-cluster communication. Let’s say, the cluster was named “HPC”. So it’s reasonable to configure the hosts and other files as follows:
#hosts 127.0.0.1 localhost 184.108.40.206 HPC.your.domain.name HPC 10.0.0.1 node0.local node0 node-mgmt
#hosts.equiv HPC HPC.your.domain.name node0 node-mgmt
#hostname HPC.your.domain.name #$(hostname) -> HPC #$(hostname -f) -> HPC.your.domain.name
And then set the
pbs.conf on the master node like this:
#pbs.conf on master node PBS_SERVER=HPC PBS_LEAF_NAME=node0
On the execution node,
pbs.conf looks like:
#pbs.conf on execution node PBS_SERVER=node0 PBS_LEAF_NAME=node1
However these configuration would result in some of the commands returns
PBS Internal Errors. After checking the source code, it seems that client-side commands, such like
qrun XXX, is connecting to the server from external address, 220.127.116.11, rathern than 127.0.0.1 or 10.0.0.1. However,
pbs_sched, who actually receives the request, only accepts hostnames
PBS_LEAF_NAME. So it will cut off the connection, thinking it unauthorized.
Besides the client-side commands, it also prevents the sched from actually schedule any jobs, cause the
pbs_server is also considered unauthorized. I’m not sure about it, for I haven’t dig into
pbs_server's codebase. However I did see logs in
sched_logs folder contains ‘pbs_sched: badconn, node0 on port 661 unauthorized host’ every a few minutes.
I totally don’t understand why PBS are designed in this way. I’ve read the admin book and didn’t found helpful stuff. Is there any special considerations? Is my configuration of hostnames wrong? Or did I miss out something? Any help will be appreciated.