How to prevent HUGE stdout in /var/spool, which might occupy the whole partition


#1

PBSPro store the temporary output in its spool directory, which usually stays in /var partition of the standalone compute node, and not under /home user quota.

I encounter a problem, that some user might have some kind of malformed program. It doesn’t exit after encounter an error, but continuously print “Error encountered. Retrying…” and some debug info into its stdout, once in several millisecond.
It quickly eats up the /var partition, and render other jobs / system process unstable.

Is there an option to limit the maximum size of stdout and stderr of a job (which resides on compute node), but not interfere with its other behaviour (create other file in its CWD, on shared storage)?

Regards,


#2

PBS Pro does not monitor the filesystem state or provide the ability to limit maximum file size, but there are a few options you might consider.

Filesystem quotas have been part of Linux for longer than I can remember (ext3 supports them). Each flavor of Linux usually has their own documentation. For example, the CentOS documentation is located here: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-disk-quotas.html

If that doesn’t give you enough control, you might consider writing a hook that creates a limited scratch space for each job. Here’s a nearly fourteen year old article (still accurate) on how to create and mount a virtual filesystem: https://linuxgazette.net/109/chirico.html

Of course, creating a virtual filesystem doesn’t guarantee your users will direct their output there. It may be like herding cats, depending on your users.

You might also consider changing the location of your spool directory by updating the /etc/pbs.conf file. Take a look at the variables PBS_HOME and PBS_MOM_HOME in the PBS Pro documentation. Don’t forget to restart PBS Pro if you make any changes.

Thanks,

Mike


#3

When doing Google I came across that article:
https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/running-jobs/submitting-jobs-pbs

which mentioned a "
-l file=1GB
Restriction
Limit the maximum size of any single file that the job can create" switch.

So I was thinking if PBSPro is capable of doing it too.

Now it seems to be a private implementation. I’ll revert back to xfs_quota then.


#4

The pseudo resource “file” will tell pbs_mom to set the RLIMIT_FSIZE (which applies per job process per file, not per job as a whole nor per all files as a whole) for the job and that includes the stdout/err files. The job will be allowed to finish once the limit is hit, but only the first N bytes will be captured in stdout, stderr, and any other files produced by job processes.

Example:

[user1@centos7-2 tmp]$ cat t.sh
#!/bin/bash

for x in $(seq 1 10000) ; do echo $x ; done
echo $(date) >> /tmp/finished
echo $PBS_JOBID >> /tmp/finished


[user1@centos7-2 tmp]$ qsub -lfile=100 t.sh
14.centos7-2
[user1@centos7-2 tmp]$ cat t.sh.o14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
3[user1@centos7-2 tmp]$ cat t.sh.e14
/var/spool/pbs/mom_priv/jobs/14.centos7-2.SC: line 3: echo: write error: File too large
[user1@centos7-2 tmp]$ cat finished
Wed May 9 09:34:06 EDT 2018
14.centos7-2

I hope this helps.