PP-1040: moms cannot communicate with one another in a cloud configuration when cloud nodes resolve each other's hostnames to IP addresses not known to the PBS server/comm


#1

Hi Team,

This functionality is related to TPP communication between the router (PBS_COMM) and leafs (PBS_SERVER/PBS_SCHED/PBS_MOM).

When PBS is used in a configuration where cloud nodes resolve each other’s names to one set of IP addresses but the local PBS server/comm host resolves a different set of IP addresses for the same names then the moms cannot communicate with one another for multinode jobs. This is because when the server runs a job it sends exec_host2 to the primary execution host to communicate all of the nodes in the job, where the hostnames get resolved to the cloud addresses. When the primary execution host then tries to send messages to these addresses through the pbs_comm it is unable to as only the VPN addresses are known to the comm.

Solution :

Earlier PBS_LEAF_NAME would accept only one value. If a value was provided it would resolve that name and find the IP address associated with that name. It would then report the IP addresses to the comm. However, the problem with that was that if the name was not mapped to all the interfaces in the host (say at /etc/hosts or DNS) then those interface/IP’s were not registered.

Now, we can specify multiple names (comma separated) to the variable PBS_LEAF_NAME.
If this variable is not set, then we detect all the IP addresses of the machine and all these IP’s are registered with the pbs_comm

Interface Document


#2

The proposed change looks good to me @bremanandjk. Thanks!