Single-node torque hostname woes


#1

I am attempting to set up torque to run on a single node with 20 logical cores, configured as np=16. Both the server name and the single mom node are meant to have the hostname -s of dev1-linux. The setup mostly is working. A queue exists and I can submit jobs to it. But qnodes shows the node with state=down and the jobs do not run.

I am running on CentOS 6.8 using Torque 4.2.10. From having tried this in the past, I suspect the problem is that there is some kind of communication problem between pbs_server and pbs_mom, with some elements seeing the hostname as the full host (hostname -f) and some as the short name. Log files don’t reveal any obvious errors, except that the server_logs file shows the server as ‘dev1-linux.attlocal.net’ when I have used ‘dev1-linux’ in all the places I can think of where a host name is specified.

Any suggestions about the state=down problem in general or other places (beside server_name and mom_priv/config) for controlling host name?


#2

Greetings,

PBS Professional and Torque are independent products. If you would like to transition your cluster to PBS Professional, we would be glad to investigate any problems you experience. To get started, please visit http://www.pbspro.org/

Thanks,

Mike


#3

Actually, I solved my problem. It seems that with Torque 4.2.10, any setup that has the same host name for pbs_server and pbs_mom is seen as a NUMA system. The key to getting the MOM node to state=free was to edit /var/lib/torque/mom_priv/mom.layout and enter the single line “nodes=0” in it. After restarting the pbs daemons (e.g., service pbs_server restart), a “qnodes” shows the single 16-core compute node as in state=free and jobs now do run.

I’ll have to do further investigation to determine for sure that all cpus and 16 processors indeed get used, but I think I’m on the right track now. Hopefully, this will be of help to some other Torque users who want to set up a single-node queueing system.

  • Gary

#4

I’m glad your system is working, but your discovery is unlikely to be of benefit to other Torque users, since this is not a Torque-related site. :wink:

It does, however, sound like an excellent reason for you to switch to PBS Professional, where no such folderol would have been necessary to get this configuration working.


#5

good bye good bye good bye good bye!