Mom states assumed to be last-known-state when pbs_comm fails?


#1

The scenario is an edge condition and not that likely once pbspro is working, but does this reveal a more fundamental logical flaw? If pbs_comm is used (ie., a “stock” build/install where pbs_comm is the expected intra-pbs communications service) and then pbs_comm is killed/blocked/whatever, the server neither reports lack of comms with the moms, nor changes mom states, nor tracks actual mom state changes.

Expected behavior: some form of state change for moms with which expected comms fail – down-comms-lost, down-unknown, signal-stolen-by-space-aliens, anything but “no change” and certainly not state=free.

Actual behavior: jobs attempt to place and fail (sister node failed to delete, etc) when in truth no place occurred at all, no mom logs report any attempt to place, no comms problems are reported in the server-side logs, no errors are reported via qstat/pbsnodes/qmgr/etc, and server-reported mom states do not change. Server-side comms appear to be handed off toward pbs_comms “blind”, without follow-up checks to validate the comms attempt.

To reproduce:

Start the 14.1.0 server/sched/comm and at least 1 configured mom so that pbsnodes shows the mom state=free. Kill pbs_comm. The mom stays state=free indefinitely.

Simulate a restart-persistent issue with the comm by editing pbs.conf (or the init/service script) to exclude comm startup, or firewall to drop the pbs_comm outbound port traffic, etc., then restart the server/sched. The server restarts without complaint and continues to report the mom as state=free.

Simulate mom fallout by killing the mom, dropping network connectivity, whatever. The server continues to report the mom as state=free. Compound this with the previous test – with the mom down, restart the server [such that pbs_comm does not start, or does not connect with the mom]. The server again restarts and continues reporting the mom, now effectively non-existent, as state=free.

Back out whatever measures were taken to prevent pbs_comm functionality and [re]start pbs_comm. The server begins reporting the actual mom state and All Is Well ™.


#2

To clarify: is the apparent “blind”, unvalidated handoff of messages from server to comm intentional? If I took the time to roll a patch to change this behavior, would I be stepping on the toes of a previous assumption about the server<–>comm<–>mom communications path?


#3

No this should be either a bug or a misconfigurtion. The communication over pbs_comm is designed such that if the comm was killed the nodes will immediately be marked as down. I tested it now with 14.1 and it worked perfectly fine when I killed the comm. Logs below:

obs_work/openSUSE_13.2 # ps -ef | grep pbs
root 3911 1 0 09:33 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 3928 1 0 09:33 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 4183 1 0 09:33 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 4228 1 0 09:33 ? 00:00:00 /usr/lib/postgresql94/bin/postgres -D /var/spool/pbs/datastore -p 15007
postgres 4236 4228 0 09:33 ? 00:00:00 postgres: postgres pbs_datastore 10.75.20.98(58610) idle
root 4237 1 0 09:33 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
root 4266 1 0 09:34 ? 00:00:00 /opt/pbs/sbin/pbs_mom

obs_work/openSUSE_13.2 # pbsnodes -av
blrlap465
Mom = blrlap465.india.altair.com
Port = 15002
pbs_version = 14.1.0
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = blrlap465
resources_available.mem = 16316436kb
resources_available.ncpus = 8
resources_available.vnode = blrlap465
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = force_exclhost
license = l

obs_work/openSUSE_13.2 # ps -ef | grep pbs_mom
root 4266 1 0 09:34 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 4277 3800 0 09:34 pts/0 00:00:00 grep --color=auto pbs_mom

blrlap465:/work/altair/pbs_code/obs_work/openSUSE_13.2 # kill -9 4266

blrlap465:/work/altair/pbs_code/obs_work/openSUSE_13.2 # pbsnodes -av
blrlap465
Mom = blrlap465.india.altair.com
Port = 15002
pbs_version = 14.1.0
ntype = PBS
state = down
pcpus = 8
resources_available.arch = linux
resources_available.host = blrlap465
resources_available.mem = 16316436kb
resources_available.ncpus = 8
resources_available.vnode = blrlap465
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
comment = node down: communication closed
resv_enable = True
sharing = force_exclhost
license = l


#4

Can’t reproduce any more. I added some test lines around the comm startup, rebuild to rpms and replaced those that were previously installed, didn’t see the problem, replaced the originally built rpms, and still don’t see it. Must be a snark…