MOM fails on start up


#1

have a bit of a strange one here as this was working with no issues and now it has stopped.

● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2016-12-01 08:55:24 GMT; 4min 20s ago
Docs: man:pbs(8)
Process: 81993 ExecStart=/opt/pbs/libexec/pbs_init.d start start start start start (code=exited, status=1/FAILURE)

Dec 01 08:55:24 chimera-b systemd[1]: Starting Portable Batch System…
Dec 01 08:55:24 chimera-b pbs_init.d[81993]: Starting PBS
Dec 01 08:55:24 chimera-b pbs_init.d[81993]: PBS comm already running.
Dec 01 08:55:24 chimera-b pbs_init.d[81993]: pbs_mom startup failed, exit 3 aborting.
Dec 01 08:55:24 chimera-b systemd[1]: pbs.service: control process exited, code=exited status=1
Dec 01 08:55:24 chimera-b systemd[1]: Failed to start Portable Batch System.
Dec 01 08:55:24 chimera-b systemd[1]: Unit pbs.service entered failed state.
Dec 01 08:55:24 chimera-b systemd[1]: pbs.service failed.

12/01/2016 08:55:24;0002;pbs_mom;Svr;Log;Log opened
12/01/2016 08:55:24;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.0.1
12/01/2016 08:55:24;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
12/01/2016 08:55:24;0001;pbs_mom;Svr;pbs_mom;Permission denied (13) in chk_file_sec, Security violation
"/var/spool/pbs/mom_priv/jobs/" resolves to "/var/spool/pbs/mom_priv/jobs"
12/01/2016 08:55:24;0001;pbs_mom;Svr;pbs_mom;Operation not permitted (1) in chk_file_sec,
Security violation “/var/spool/pbs/spool/” resolves to "/var/spool/pbs/spool"
12/01/2016 08:55:24;0001;pbs_mom;Svr;pbs_mom;Success (0) in mom_main,
Warning: one of chk_file_sec failed: /var/spool/pbs/mom_priv/jobs/, /var/spool/pbs/spool/, /var/spool/pbs/pbs_enviro$

Now by looking at this error it would be a simple permissions fix but i have tried many ways but still get the same outcome.
I have used pbs_probe and that is awesome as it tells you whats missing and what needs to be changed but it has a few issues when it keeps telling you to change the same thing back to what you just changed it from!!!

Any ideas please as this is driving to drink!
Thanks


#2

This is usually file permissions mismatch (or missing files) between what PBS expects and what the permissions of those directories/files are. Check each of the directories being reported.
For example, /var/spool/pbs/mom_priv/jobs/ usually has permissions 751.


#3

I have tried this with no success im afraid do you have any other ideas?


#4

Please check the permissions of the parent directories as well. They could also cause this.


#5

You are receiving a “permission denied” error, which means the kernel is preventing access to something. In this case, your permissions aren’t too liberal, they are too strict. As root, try to access each of the files mentioned in the log. Does your PBS_HOME happen to be a network filesystem, or is it local to the machine?

Not sure if it’s a cut-and-paste issue, but the very last character of the log is “$”. Is that the case in the actual log?

Maybe stating the obvious here, but make sure you are the root user when you start PBS Pro.


#6

This has been done and checked to what they need to be. I was thinking about purging the system and reinstalling but i have a feeling that this will just happen again so if its not permissions what could it be?

i also get another odd error saying the following:

/var/spool/pbs/spool resolves to /var/spool/pbs/spool

Any ideas?


#7

Don’t worry about the “resolves to” message. It’s used for diagnostics in the case there are symbolic links involved. There are no symbolic links in this case because the directory resolves to itself.


#8

PBS_HOME= local RHEL7 server.

$ was a C&P issue.

Yeah im root user and thats no a silly statement every question right now is required.


#9

Please run the following commands and provide their output.

cat /etc/pbs.conf
ls -laR /var/spool/pbs


#10

One additional “silly” question. Is this an SELinux system with multilevel security enabled?


#11

Yes it is and im just going to get your other information so bare with me as this system is on an airgapped network


#12

So glad I asked the MLS question! I didn’t expect the answer to be yes.

It’s been a while since I dealt with MLS, so please bear with me. IIRC, you’ll need to wildcard the directories to allow root full access. I don’t recall the commands necessary to do so, but I could look them up if required. Make sure your levels and compartments are set correctly. This is very likely the source of the problem.


#13

Dont suppose you have a good source for this information?
Any help would be fantastic please as im flying a bit blind on this one!


#14

Let me see if I can put you in touch with our resident expert. He should be online later today.


#15

that would be awesome im in the UK so i will respond with in the next 60mins but then i will have to wait till the following day, but i will respond soon as i can.

Thanks for the help so far.


#16

output file for /var/spool/pbs

PBS.CONF output

PBS_SERVER=servername
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=56
PBS_SCP=/bin/scp


#17

The PBS_CORE_LIMIT is effectively the same thing as “ulimit -c”, so 56 seems like a strange setting. It limits the number of 512 byte blocks that a core file may contain. For example, if I want to limit my core files to 100MB…

n = (100 * 1024 * 1024) / 512 = 204800

I’m a bit surprised to see many of your files and directories are group writable. Here’s a sample of PBS_HOME on my development VM:

# ls -laR /var/spool/pbs | head -40
/var/spool/pbs:
total 64
drwxr-xr-x. 14 root     root 4096 Nov 29 15:43 .
drwxr-xr-x. 16 root     root 4096 Nov 29 15:43 ..
drwxr-xr-x.  2 root     root 4096 Nov 30 16:29 aux
drwx------.  2 root     root 4096 Nov 29 15:43 checkpoint
drwxr-xr-x.  2 root     root 4096 Nov 30 16:12 comm_logs
drwx------. 15 postgres root 4096 Nov 30 20:26 datastore
drwxr-xr-x.  2 root     root 4096 Nov 30 16:12 mom_logs
drwxr-x--x.  4 root     root 4096 Nov 29 15:43 mom_priv
-rw-r--r--.  1 root     root   19 Nov 29 15:43 pbs_environment
-rw-r--r--.  1 root     root    7 Nov 29 15:43 pbs_version
drwxr-xr-x.  2 root     root 4096 Nov 30 16:12 sched_logs
drwxr-x---.  2 root     root 4096 Nov 29 15:43 sched_priv
drwxr-xr-x.  2 root     root 4096 Nov 30 20:27 server_logs
drwxr-x---.  7 root     root 4096 Nov 30 20:26 server_priv
drwxrwxrwt.  2 root     root 4096 Nov 30 20:26 spool
drwxrwxrwt.  2 root     root 4096 Nov 29 15:43 undelivered

/var/spool/pbs/aux:
total 8
drwxr-xr-x.  2 root root 4096 Nov 30 16:29 .
drwxr-xr-x. 14 root root 4096 Nov 29 15:43 ..

/var/spool/pbs/checkpoint:
total 8
drwx------.  2 root root 4096 Nov 29 15:43 .
drwxr-xr-x. 14 root root 4096 Nov 29 15:43 ..

/var/spool/pbs/comm_logs:
total 16
drwxr-xr-x.  2 root root 4096 Nov 30 16:12 .
drwxr-xr-x. 14 root root 4096 Nov 29 15:43 ..
-rw-r--r--.  1 root root 2015 Nov 30 16:12 20161129
-rw-r--r--.  1 root root  911 Nov 30 20:26 20161130

/var/spool/pbs/datastore:
total 108
drwx------. 15 postgres root      4096 Nov 30 20:26 .
drwxr-xr-x. 14 root     root      4096 Nov 29 15:43 ..
# 

I still think the permission denied error is related to MLS.


#18

I guess that might be me. Making PBS work on an MLS system is a large undertaking requiring both code changes (to do things like pass a user’s security context to PBS so it can remember it and pass it along to execution hosts for the pbs_mom processes to properly reinstantiate it for the user) and a nontrivial set of policy files.

As for what your current problem might be, using ausearch to see what the OS is objecting to would be the first step I’d take.


#19

Hello. I got this message the other day when reinstalling pbs server host (the pbs itself was on a shared drive).
I disabled SElinux, and all was fine. (disabling selinux is not fine, but …)

Regards
Einar


#20

Hi Timbo,

My name is Steve Gombosi and I manage support for PBS Professional for Altair in the Americas Region (yes, I know you’re in the UK, but I thought I’d chime in on this since support of all the current MLS sites is my responsibility). The mainline PBS Professional distribution doesn’t currently play nicely with SELinux (as you’ve discovered). There is branch of the PBS Professional commercial distribution (the 13.0.20x series of releases) which contains code to be fully SELinux/MLS compliant (and is indeed certified as ICD-509 compliant, if that means anything to you). It contains code to capture and propagate the submitting user’s security context to any jobs he submits as well as to handle things like polyinstantiated directories. The distribution contains the SELinux policy files necessary to function correctly in this sort of environment. There’s also an addendum to the PBS Professional Admin Guide (available on request), which covers the additional installation steps necessary to extract and compile the policy files from the distribution prior to installation. I can get you a copy of this addendum if you PM me with an email address.

The reason it’s in its own branch right now is that it the SELinux project took, as “altair4” says below, a “nontrivial” amount of special development effort and was started well before we decided to open source PBS Professional. There’s ongoing work to integrate this code back into the mainline branch but it will take some time for that integration to be complete. If you have an immediate need for full MLS integration, you might want to talk to someone from Altair UK about this distribution.