Cannot run job on the execution host


#1

Dear all,
I have a headnode for running PBS server and 2 other nodes for execution hosts. But just only one node can run job with the default queue (workq). The configuration of these nodes are:
headnode
Mom = headnode.supernodexp.hpc
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = headnode
resources_available.mem = 131587468kb
resources_available.ncpus = 48
resources_available.vnode = headnode
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
license = l

node12
Mom = supernodexp-computenode-012
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = supernodexp-computenode-012
resources_available.mem = 263698672kb
resources_available.ncpus = 48
resources_available.vnode = node12
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
license = l

node11
Mom = supernodexp-computenode-011
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = supernodexp-computenode-011
resources_available.mem = 263698716kb
resources_available.ncpus = 48
resources_available.vnode = node11
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
license = l

Node 12 cannot run jobs when I submit them. The error is: (with qstat -f) comment = job held, too many failed attempts to run

When I check the mom_logs on node 12: Failed to get fullhostname from supernodexp-computenode-012 for job 204.headnode.supernodexp.hpc

Please help me with this problem,
Thank you so much!


#2

This looks like a hostname resolution issue. Please verify that hostname -f on the compute node matches what you are calling it on the server.