How to resolve 'state = state-unknown,down' on MoM node


#1

Hello PBS Pro Community,

I have a pretty simple network, with FQDNs pinging and resolving fine.

However, when I try adding a node to PBS pro, it is marked state unknown and is looks like this:

node0075.x.y
Mom = node0075.x.y
Port = 15002
pbs_version = unavailable
ntype = PBS
state = state-unknown,down
resources_available.host = node0075
resources_available.vnode = node0075.x.y
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

Any thoughts on how to resolve this?

Thanks,
Siji


#2
  1. Please check the firewall is not blocking the ports (15002 / 15003 )
  2. Please check pbs mom services are running
  3. Please add the node with the hostname of the node ( ssh into the compute node, type hostname , this name should be used in the qmgr -c " create node HOSTNAME-OF-THE-NODE" )
  4. Check SELinux is disabled (and system is rebooted after disabling SELinux)

Thank you


#3

Thanks for the pointers but I still notice the same issue after simplifying my setup by using just /etc/hosts file. There is no firewall or SELinux and mom is up:

03/26/2018 18:17:09;0002;pbs_mom;Svr;pbs_mom;Mom pid = 33736 ready, using ports Server:15001 MOM:15002 RM:15003

/etc/pbs.conf on the mom node looks like this:
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_MOM_HOME=/var/spool/pbs
PBS_START_MOM=1
PBS_START_COMM=1
PBS_COMM_THREADS=4
PBS_SERVER=clmgmt-01
PBS_SCP=/usr/bin/scp
PBS_CORE_LIMIT=unlimited
~

And /etc/pbs.conf on the server node looks like this:
PBS_SERVER=clmgmt-01
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_LOCALLOG=1
PBS_SYSLOG=2
PBS_SYSLOGSEVR=7
~

Do you see something I might be missing? Or does PBS have any probing tools that might help identify the issue?


#4

Please change the /etc/pbs.conf on the compute node to below and restart the pbs mom services


#5

Thanks for the tip - made the modification but it didn’t resolve the issue.

Here’s what my mom config looks like:
[root@node0075 ~]# cat /var/spool/pbs/mom_priv/config
$clienthost clmgmt-01

There isn’t any server_priv/config…should there be one?


#6

Server hostname is present in the /etc/pbs.conf and mom_priv/config and not in other locations.

Sorry to see that it is not working for you, the deployment is quite straight forward.

Could you please check (via telnet) ports 15001 to 15009 and 17001 is open between headnode and compute node ?

Please share the output of the below commands ( run them on server node and compute node separately)

  1. cat /etc/hosts
  2. ifconfig
  3. pbs_hostn -v < server hostname >
  4. pbs_hostn -v < compute node hostname >
  5. ping server-hostname
  6. ping computenode-hostname
  7. netstat -tunap | grep pbs
  8. hostname