Hostname issue setting up PBS Professional on CentOS 7

Hello,

I’m trying to setup PBS Professional on the head node, running CentOS 7, of an OHPC cluster by following the steps outlined in this document:
https://github.com/openhpc/ohpc/releases/download/v1.3.8.GA/Install_guide-CentOS7-Warewulf-PBSPro-1.3.8-x86_64.pdf

Attempting to start the PBS service using:

systemctl start pbs

yields the status:

Other folks with a related problem usually seem to have some value following "Invalid local hostname: ", even if it’s incorrect, but my instance of the service doesn’t seem to be able to read anything.

Here are the contents of my /etc/pbs.conf file, /etc/hosts file, and the outputs from hostname and hostname -f:

Screenshot%20from%202019-07-10%2009-48-06

Screenshot%20from%202019-07-10%2009-53-22

Both SELinux and the firewall service are disabled:

Screenshot%20from%202019-07-10%2009-59-23

I’d really appreciate any help, let me know if you need any more documentation.

The value of PBS_LEAF_NAME should be the hostname of the interface over which you want PBS Pro to communicate, not the domain name. Unless you have multiple NICs, you should not need to set this value at all. Please try removing PBS_LEAF_NAME from /etc/pbs.conf and try starting PBS Pro.

1 Like

That worked perfectly, PBS started without issue following removal of that line.

Thanks!

1 Like

Hi again,

Upon restarting the service, I receive the following status:

I’m fairly sure this issue is unrelated to that originally presented in the thread, however the system setup remains same. Again, any help would be appreciated and let me know if you need more information.

Thanks

Could you please try running the commands outside of systemctl? In other words (as root)…

# /opt/pbs/libexec/pbs_init.d stop
# /opt/pbs/libexec/pbs_init.d start

The output from these commands should help to diagnose the problem. It’s very unusual that the scheduler failed to start. After you stop PBS, please make sure there are no lingering processes (e.g. ps -ef | grep pbs).

Stopping was successful and no related processes appeared to be lingering, (only grep result was the grep itself), however startup failed. Could this be an issue with permissions?

Please run:

# ls -la /var/spool/pbs

and provide the output. PBS performs some sanity checks when it starts to make sure certain files and directories are present and have the expected permissions. Is /var/spool/pbs/sched_priv a symbolic link?

Screenshot%20from%202019-07-10%2015-04-44

/var/spool/pbs/sched_priv is not a symlink as can be seen from the ls, just a regular directory.

It looks to be an issue with permissions, not on the directories themselves but how the filesystem is mounted. Please run:

# df -h /var/spool/pbs

That will tell you which volume the filesystem is mounted on. Then check the mount options for that filesystem using the “mount” command.

Running:

# df -h /var/spool/pbs

yielded:

Screenshot%20from%202019-07-11%2009-29-52

and the mount options for that filesystem are:

Screenshot%20from%202019-07-11%2009-30-21

I don’t see anything unusual in the output you provided. The one thing that puzzles me is the log message:

pbs_sched: Operation not permitted (1) in chk_file_sec, Security violation "/var/spool/pbs/sched_priv" resolves to "/var/spool/pbs"

I would have expected the two paths to be identical. I don’t believe xfs is a problem, though I tend to use ext4. Are there hidden attributes in XFS that could be affecting permissions?

You might try moving /var/spool/pbs (PBS_HOME in /etc/pbs.conf) to another filesystem and updating /etc/pbs.conf to point to the new location.

Ever seen this before @scc?

Interestingly enough, a fresh attempt at initialization of the service appeared successful despite making no recent alterations. Restarting the service appeared equally successful. However, unless I am misunderstanding something, none of the functions of the service appear operational:

User liuxiapiao appeared to be receiving a message similar to “(to postgres) root on none” and resolved their issue by disabling their firewall.

http://community.pbspro.org/t/pbs-execution-host-couldnt-talk-to-pbs-server/1280

Our firewall is definitely disabled, but could this point to some other disruption of communication between the command line and the pbs server? Also, could this in any way be related to the previous issue with the scheduler?

I’ve checked and the server is active:

What happens when you telnet to the server port?

$ telnet -4 pbs-server 15001
Trying 192.168.1.2...
Connected to pbs-server.
Escape character is '^]'.

+2+15+15056+0+1Connection closed by foreign host.

When I try connecting to an unused port I see this:

$ telnet -4 pbs-server 15005
Trying 192.168.1.2...
telnet: connect to address 192.168.1.2: Connection refused