Job terminates when execjob_begin and execjob_prologue hook is run in batch but not when run interactively


#1

I have written a hook to setup a scratch file system using beeond (BeeGFS on Demand). In the execjob_begin part of the hook I make sure that the directory that beeond will use for storage is setup on the local SSD disk. Then in the execjob_prologue I start beeond. If I run an interactive job, the beeond directory is mounted and works as expected. If I run the same request in batch the job terminates before it run my script.

In both cases this is the command I am running to start beeond.

12/06/2018 18:23:03;0800;pbs_python;Hook;pbs_python;cmd: beeond start -n /tmp/1273.ip-0A0C1004.beeond -r -d /mnt/pbs_ramdisk -c /mnt/beeond -f /etc/beeond’

The 1273.ip* file I create in the hook that has each node associated to the job listed one per line.

This is what I see in the logs for a batch job.

12/06/2018 17:44:03;0400;pbs_python;Svr;pbs_python;–> Stopping Python interpreter <–
12/06/2018 17:44:03;0400;pbs_mom;Hook;beeond;finished
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 317, cantstat: 2, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;Started, pid = 85978
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 315, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0080;pbs_mom;Job;1268.ip-0A0C1004;task 00000001 terminated
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 314, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;Terminated
12/06/2018 17:44:03;0100;pbs_mom;Job;1268.ip-0A0C1004;task 00000001 cput= 0:00:00
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;kill_job

When I run the same request in interactive mode I see this in the logs.

12/06/2018 18:23:34;0400;pbs_python;Svr;pbs_python;–> Stopping Python interpreter <–
12/06/2018 18:23:34;0400;pbs_mom;Hook;beeond;finished
12/06/2018 18:23:34;0800;pbs_mom;n/a;mom_get_sample;nprocs: 328, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 18:23:34;0008;pbs_mom;Job;1273.ip-0A0C1004;Started, pid = 93525

If I comment out the beeond start command then batch jobs starts as expect but without the beeond file system.

Any idea why PBS would terminate the job in batch mode but work as expected in interactive mode?

Jon


#2

Hi Jon,

What is job script actually doing? From the logs you provided it appears that the script starts and then exits almost immediately. What happens if you submit a job script that does nothing more than print the date, sleep a few seconds, and print the date again? Does that also exit immediately?

Thanks,

Mike


#3

Originally the job script would take about 5 minutes to run. However, I paired it back to just “date” to make sure that there was not an issue in the script. I then look at the size of the stdout file to see if the script ran. If I comment out the one line that runs the beeond command in the hook then I see the date command output. My theory is that something in the beeond script, run in the execjob_prologue event, is signaling PBS that the job script has completed when in reality it has not even started. I looked at the return code of the beeond script and it only returns 0 and when I look at the output from the script in the PBS_HOME/mom_priv/hooks/tmp dir it says that the hook completed successfully. For clarity, I am running version 18.1.3


#4

Hi @jon, is the hook running as root or as the job owner? Just a wild guess, but if it is running as the job owner then there may be systemd process/IPC cleanup at play (maybe beeond backgrounds itself and then exits after the hook exits, which causes systemd to kneecap all of the user’s other processes, including the job which has since started)?


#5

@scc, thanks for suggestion. All of the hook events are running as root and the cgroup hook is not enabled. So I don’t believe that this is the case. I think it may have something to do with the prologue event and the beeond start script launching of processes. I did some additional digging and some assumptions that I made about having to create directories before launching beeond were incorrect. The good news is that I am able to create beeond in the execjob_begin hook. The bad news is that something in the execjob_prologue hook when beeond script is run is causing PBS to not run the job in batch but works fine when the job is run interactively.