Single-node torque hostname woes


I am attempting to set up torque to run on a single node with 20 logical cores, configured as np=16. Both the server name and the single mom node are meant to have the hostname -s of dev1-linux. The setup mostly is working. A queue exists and I can submit jobs to it. But qnodes shows the node with state=down and the jobs do not run.

I am running on CentOS 6.8 using Torque 4.2.10. From having tried this in the past, I suspect the problem is that there is some kind of communication problem between pbs_server and pbs_mom, with some elements seeing the hostname as the full host (hostname -f) and some as the short name. Log files don’t reveal any obvious errors, except that the server_logs file shows the server as ‘’ when I have used ‘dev1-linux’ in all the places I can think of where a host name is specified.

Any suggestions about the state=down problem in general or other places (beside server_name and mom_priv/config) for controlling host name?



Actually, I solved my problem. It seems that with Torque 4.2.10, any setup that has the same host name for pbs_server and pbs_mom is seen as a NUMA system. The key to getting the MOM node to state=free was to edit /var/lib/torque/mom_priv/mom.layout and enter the single line “nodes=0” in it. After restarting the pbs daemons (e.g., service pbs_server restart), a “qnodes” shows the single 16-core compute node as in state=free and jobs now do run.

I’ll have to do further investigation to determine for sure that all cpus and 16 processors indeed get used, but I think I’m on the right track now. Hopefully, this will be of help to some other Torque users who want to set up a single-node queueing system.

