Error of DAPL startup: RLIMIT_MEMLOCK too small, only happen on cross host calculation


#1

Dear all,

I have found a weird problem on pbspro(maybe it is not).
I have the following script(simplify for sake)

#PBS -l select=4:ncpus=1
#PBS -l place=scatter

cd PBS_O_WORKDIR echo ulimit -a ulimit -a echo hostname: hostname echo -e "you are acquiring the resoures:\n(cat $PBS_NODEFILE)"

source [intel2018 script]
mpiexec [my app]

Here is my output file:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514557
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 514557
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
hostname:
cn001
you are acquiring the resoures:
cn001
cn002
cn003
cn006

and here is error output file:

[1] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805)…: fail failed
MPID_Init(1859)…: channel initialization failed
MPIDI_CH3_Init(147)…: fail failed
dapl_rc_setup_all_connections_20(1394): generic failure with errno = 872598799
getConnInfoKVS(956)…: PMI_KVS_Get failed

I realized the stack size problem, but the output file had been shown that locked memory is unlimited.
max locked memory (kbytes, -l) unlimited

This problem only happened when I made the job cross hosts. Multiple chunks in the same host will not induce this.

Do anyone have the same experience?
Gratefully for any comments.

Thanks,
Chris

P. S.
I used TORQUE as my PBS and the same job(same source of Intel compiler and my app) can be done on TORQUE.


#2

Hello @chris,

The error you are seeing is coming from MPI attempting to initialize a thread, most likely on another node. Make sure /etc/security/limits.conf has the following entries:

* soft memlock unlimited
* hard memlock unlimited

You will need to reboot the nodes for the changes to take effect.

Please note that this forum is for PBS Pro related questions. TORQUE is supported by a different vendor.

Thanks,

Mike


#3

Hi @mkaro,

Thanks for your comment.
This helps me. Everything is OK after changing the /etc/security/limits.conf.
I just wondering that I had make sure that my host have the unlimited sizes of lock memory.
in the script, output showed before
max locked memory (kbytes, -l) unlimited

Why do I still need to change the /etc/security/limits.conf and reboot.
I think this is the linux’s issues bothering me, not the pbspro.

Again, thanks for diagnosing this.


#4

Glad to be of help. You’re correct in that Linux controls the default session limits, and those may be adjusted by PBS Pro or by the user within their job… at least those that are modifiable by a non-privileged user. Let us know if you have more questions.