Pbs_mom stopped with segfault (CentOS 7.2)


#1

Hi. everyone.

I’m a new to PBS pro.
I built master and slave node using virtual machine (based on CentOS-7-x86_64-Minimal-1511.iso).
And I installed PBS pro v14.1.0 downloaded from https://github.com/PBSPro/pbspro/archive/v14.1.0.tar.gz.

Running pbs_mom, status says “pbs_mom is not running”.
I searched around in several days and I found segfault in /var/log/messages like;

Oct 18 12:05:52 centos7 kernel: pbs_mom[11201]: segfault at 0 ip 00007f3b01085346 sp 00007ffc310cd538 error 4 in libc-2.17.so[7f3b01000000+1b7000]

And mom.lock and core_ files still remains in /var/spool/pbs/mom_priv.
How should I fix it ?


Followings are installation situation.
/etc/hosts are configured.

master node

hostname; centos7
os; CentOS 7.2
vcpu; 2
mem; 2048

/etc/pbs.conf

PBS_SERVER=centos7
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

slave node

hostname; centos7-sub
os; CentOS 7.2
vcpu; 2
mem; 2048

/etc/pbs.conf

PBS_SERVER=centos7
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

/var/spool/pbs/mom_priv/config

$clienthost centos7
$restrict_user_maxsysid 999

Any help is appreciated.

Best regards.


#2

Greetings,

Encountering a segfault while getting started with a new software package does not constitute a positive experience, so let’s see if we can help you out here. First off, are you at all familiar with the GNU debugger (gdb)? If so, posting a stack trace would be very helpful. If not, we can provide instructions on how to accomplish this. Second, could you provide a bit more context from the MoM log? Perhaps twenty or so lines before the segfault and maybe another ten lines after could prove useful.

Thanks,

Mike


#3

Hi.
Thank you for your replay.

I afraid I don’t have any gdb experience. So instructions is very helpful.
And here are logs(attaching file is not permitted. so I paste them).

I execute following command.

# /etc/init.d/pbs start

The segfault occured at /var/log/messages

Oct 19 10:01:12 centos7 kernel: pbs_mom[4356]: segfault at 0 ip 00007fd732841346 sp 00007ffe1c34e158 error 4 in libc-2.17.so[7fd7327bc000+1b7000]

And mom_log is here. No log messages come up before and after the segfault.

10/19/2016 10:01:12;0002;pbs_mom;Svr;Log;Log opened
10/19/2016 10:01:12;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.1.0
10/19/2016 10:01:12;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
10/19/2016 10:01:12;0100;pbs_mom;Svr;parse_config;file config
10/19/2016 10:01:12;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.122.32 as authorized
10/19/2016 10:01:12;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
10/19/2016 10:01:12;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
10/19/2016 10:01:12;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
10/19/2016 10:01:12;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
10/19/2016 10:01:12;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 1024
10/19/2016 10:01:12;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Max files too low - you may want to increase it.
10/19/2016 10:01:12;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
10/19/2016 10:01:12;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
10/19/2016 10:01:12;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm centos7
10/19/2016 10:01:12;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
10/19/2016 10:01:12;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
10/19/2016 10:01:12;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
10/19/2016 10:01:12;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.122.32:15003 to pbs_comm
10/19/2016 10:01:12;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm centos7

Thank you in advance.


#4

I don’t see anything unusual in the log file. Please inspect the log around that time frame to see if there are any errors or warnings. All of the timestamps on the log messages match, so we may not be looking at a large enough section.

To produce a stack trace, perform the following steps (as root user):

  1. Locate the binary that produced the core file. I’ll assume it’s /opt/pbs/sbin/pbs_mom
  2. Locate the core file. I’ll assume it’s /var/spool/pbs/mom_priv/core.1234
  3. Install gdb if it’s not already installed: yum install gdb
  4. Start gdb (substitute names as appropriate): gdb /opt/pbs/sbin/pbs_mom /var/spool/pbs/mom_priv/core.1234
  5. To get a backtrace, run the following command in gdb: bt full
  6. Exit gdb by entering “quit” at the gdb prompt.
  7. Copy and paste the output of your gdb session to your response. That’s it!

Did you produce RPM packages when you built PBS, or did you use “make install” to install the software?

One other thing I noticed (I assume it’s a typo), are your master and slave node hostnames both centos7?


#5

Thank you for the detail instruction.

1) GDB stack trace

First of all, I checked pid written in mom.lock because there are several core.xxx file in /var/spool/pbs/mom_priv.

# cat /var/spool/pbs/mom_priv/mom.lock
991

and then I got below stack trace (little bit long…).
I can find segmentation fault before typing bt full.

# gdb /opt/pbs/sbin/pbs_mom /var/spool/pbs/mom_priv/core.991
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/pbs/sbin/pbs_mom...done.
[New LWP 991]
[New LWP 992]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/pbs/sbin/pbs_mom'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fda3256e346 in __strcmp_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 hwloc-libs-1.7-5.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-12.el7_2.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 nss-softokn-freebl-3.16.2.3-14.2.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 openssl-libs-1.0.1e-51.el7_2.7.x86_64 pcre-8.32-15.el7_2.1.x86_64 python-libs-2.7.5-39.el7_2.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) bt full
#0  0x00007fda3256e346 in __strcmp_sse2 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fda33e29ccb in hwloc_obj_cmp () from /lib64/libhwloc.so.5
No symbol table info available.
#2  0x00007fda33e29e87 in hwloc__insert_object_by_cpuset () from /lib64/libhwloc.so.5
No symbol table info available.
#3  0x00007fda33e4396e in summarize () from /lib64/libhwloc.so.5
No symbol table info available.
#4  0x00007fda33e44828 in hwloc_look_x86 () from /lib64/libhwloc.so.5
No symbol table info available.
#5  0x00007fda33e448a3 in hwloc_x86_discover () from /lib64/libhwloc.so.5
No symbol table info available.
#6  0x00007fda33e2c8bb in hwloc_topology_load () from /lib64/libhwloc.so.5
No symbol table info available.
#7  0x000000000043e46f in mom_topology () at mom_main.c:10437
        ret = 0
        topology = 0x1f00f20
        xmlbuf = 0x0
        xmllen = 32730
        vtp = 0x0
        __func__ = "mom_topology"
#8  0x0000000000425bef in dep_initialize () at linux/mom_mach.c:4820
        __func__ = "dep_initialize"
#9  0x00000000004377f4 in initialize () at mom_main.c:1123
        i = <optimized out>
        avl = <optimized out>
        ix = {root = 0x24f47300, keylength = 1476692478, dup_keys = 0}
        hook_msg = '\377' <repeats 16 times>, '\000' <repeats 3156 times>
        hook_buf = '\000' <repeats 512 times>
        hook_input = {pjob = 0x24f47300, progname = 0x0, argv = 0x0, env = 0x0, vnl = 0x6f70732f7261762f, pid = 1882156143, jobs_list = 0x0}
        hook_output = {reject_errcode = 0x0, last_phook = 0x0, fail_action = 0x0, progname = 0x0, argv = 0x0, env = 0x0, vnl = 0x0}
        hook_errcode = 0
        hook_rc = 0
        last_phook = 0x0
        hook_fail_action = 0
        ret = <optimized out>
        xxrp = {xrp = {recptr = 0x0, count = 0, key = '\000' <repeats 15 times>}, buf = '\000' <repeats 287 times>}
        rp = 0x7ffd72a12ad0
        none = "<unset>"
        hostval = <optimized out>
        char_in_cname = <optimized out>
        __func__ = "initialize"
#10 0x000000000041a2ce in main (argc=1, argv=<optimized out>) at mom_main.c:9057
        id = "mom_main"
        tpp_conf = {node_type = 1, routers = 0x1ef0ce0, numthreads = 1, node_name = 0x1ef0cc0 "centos7:15003", auth_type = 1 '\001', get_ext_auth_data = 0x0,
          validate_ext_auth_data = 0x0, compress = 0, tcp_keepalive = 1, tcp_keep_idle = 30, tcp_keep_intvl = 10, tcp_keep_probes = 3,
---Type <return> to continue, or q <return> to quit---
          buf_limit_per_conn = 5000, force_fault_tolerance = 0}
        errflg = <optimized out>
        c = <optimized out>
        rc = <optimized out>
        stalone = <optimized out>
        i = <optimized out>
        dummyfile = <optimized out>
        act = {__sigaction_handler = {sa_handler = 0x436ff0 <stop_me>, sa_sigaction = 0x436ff0 <stop_me>}, sa_mask = {__val = {88579, 0 <repeats 15 times>}},
          sa_flags = 536870912, sa_restorer = 0x7fda3425f000}
        ptr = 0x0
        servername = <optimized out>
        serverport = 874907448
        recover = 0
        time_state_update = 0
        tryport = <optimized out>
        rppfd = 7
        privfd = -1
        tval = {tv_sec = 1476921819, tv_usec = 750521}
        myla = 6.9453353452667086e-310
        nxpjob = <optimized out>
        pjob = <optimized out>
        configscriptaction = <optimized out>
        inputfile = 0x0
        scriptname = 0x0
        prscput = <optimized out>
        prswall = <optimized out>
        fd = <optimized out>
        ipaddr = <optimized out>
        mygid = 0
        optindinc = <optimized out>
        do_mlockall = <optimized out>
        hook_input = {pjob = 0x7ffd72a13ca0, progname = 0x7fda3405694e <_dl_map_object_from_fd+2526> "\213\025\f\264!", argv = 0x7ffd72a13cd0,
          env = 0x7fda3405ae99 <_dl_add_to_namespace_list+25>, vnl = 0x0, pid = 2147479968, jobs_list = 0x7ffd72a13cd0}
        path_hooks_rescdef = "/var/spool/pbs/mom_priv/hooks/resourcedef\000\005\064\332\177\000\000\000\000\000\000\000\000\000\000 =\241r\375\177", '\000' <repeats 14 times>, "\001\000\000\000\r\000\000\000\000\000\000\000\001\n&4\332\177\000\000\000\000\000\000\375\177\000\000\260\f&4\332\177", '\000' <repeats 26 times>, "P\n&4\332\177\000\000\000\000\000\000\375\177\000\000\260\f&4\332\177\000\000\240M\241r\375\177\000\000@\003\000\000\000\000\000\000\177ELF\002\001\001\000\000\000\000\000\000\000\000\000\003\000>\000\001\000\000\000"...
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
(gdb) quit

2) Installation type

I installed pbspro from source . But my colleague tried RPM package producing based on this guide and he got same result.

3) Master and slave have same hostname

Sorry, you’re right. It’s typo. I fix it.

Best regards.


#6

It appears the segmentation fault is coming from the hwloc library, and not from the PBS Pro code. As such, I’m not sure there is much we can do to help you in this case. If you don’t mind, please file a JIRA ticket to track this issue here: https://pbspro.atlassian.net/secure/Dashboard.jspa

Our engineering team will decide how best to approach this issue. Our apologies that there doesn’t appear to be an immediate solution to the problem you have encountered. You could try building your own version of the hwloc library and linking against it in hopes that the problem you encountered has already been resolved. That’s the only suggestion I have at this time.

Thanks,

Mike


#7

Finally I achieved installing & checking valid installation.
But below 2 points are different from previous situation.

  1. To install PBS pro into physical server
  2. To install slightly different kernel version

former version

inux version 3.10.0-327.36.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 10 23:08:37 UTC 2016

later one

Linux version 3.10.0-327.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Nov 19 22:10:57 UTC 2015

I’m not sure which change is critical. I’ll try to build vm instance with this os image and install PBS pro again.
Thank you for your continuous support for my issue!


#8

Thank you for following up once you managed to get things working. This is useful information for the community.