MOM Config incomplete


#1

Hi All,

im a little stuck or i have confused myself completely now with a brand new issue from my previous ones!

So i have the pbs server running and all is well the config =
PBS.CONF output

PBS_SERVER=testserver.com
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1 (edit: 0 to 1)
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=56
PBS_SCP=/bin/scp

now i have the execution node setup and running (as in the service is green nothing more :unamused:) and heres my config for in in mom_priv

$clienthost server head
$restrict_user_maxsysid 999
$clienthost serverhead.test.com

And thats it is this correct as im getting no response when i run pbsnodes -a,
it tells me that the node is down.

netstat -n |grep 15003 gives me nothing some times a time_wait

i know im missing something simple but i just cant see the wood through the trees!

Cheers
Timbo


#2

PBS_SERVER should be set to the hostname of your server. You have it set to “server head”.

PBS_CORE_LIMIT should be set to the maximum allowed size for core files, or the string “unlimited” (without quotes). For example, PBS_CORE_LIMIT=536870912 sets the limit to 512MB.

$clienthost should be set to the hostname if the server. Again, you have specified “server head”.

$restrict_user_maxsysid defaults to 999. It doesn’t harm anything to specify the default, but it isn’t necessary.


#3

sorry the name server head really means testserver.com.

in the mom config the clienthost is that the execution node name or the servers name as i have the pbs_server=testserver.com in the pbs.conf already?

will change the core limit.

im just strugling to the the execution node to speak to the head server node (testserver.com).


#4

Are there messages in the MoM log file indicating what the problem may be?


#5

PBS_START_MOM=0 should be PBS_START_MOM=1 on the pbs.conf file.


#6

sorry that is set to “1” i must have miss typed it.


#7

01/19/2017 09:05:35;0002;pbs_mom;Svr;Log;Log opened
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.0.1
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
01/19/2017 09:05:35;0100;pbs_mom;Svr;parse_config;file config
01/19/2017 09:05:35;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
01/19/2017 09:05:35;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 1024
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Max files too low - you may want to increase it.
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm testserver.COM
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.20.2 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.20.1 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
01/19/2017 09:05:35;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:35;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
01/19/2017 09:05:35;0002;pbs_mom;n/a;initialize;pcpus=56, OS reports 56 cpu(s)
01/19/2017 09:05:35;0006;pbs_mom;Fil;pbs_mom;Version 14.0.1, started, initialization type = 0
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Mom pid = 18482 ready, using ports Server:15001 MOM:15002 RM:15003
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:37;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:37;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:37;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:39;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:39;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:39;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:41;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:43;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:43;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:43;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called


#8

Sorry should have posted this earlier but its a task to remove stuff from a airgapped network.

Testserver.com is the head node the above log is from the testmom which is setup on a separate node for execution only.


#9

could you please check if

  1. there is an entry for testserver.com in /etc/hosts file on the hosts.
  2. the firewall is blocking the communication.

#10
  1. yes its in the hosts file
  2. firewall is off

#11

I am just wondering why the “com” in testserver.com is capitalized in the mom logs…


#12

its been edited it should be lowcase