Checkpointing function for scheduler


#1

Hi,

do you have any plans to expand pbs pro functions which would enable the check pointed jobs to run on any free compute nodes?

thanks,

Sue


#2

Hello Sue,

There are currently no plans to migrate checkpointed jobs to alternate nodes. Migrating multinode jobs is tricky business, whereas single node jobs are more straightforward. To which are you referring? Are you using BLCR? MPI?

If the community believes this would provide sufficient value (versus the cost of the work) we could certainly consider it.

Thanks,

Mike


#3

Hi Sue,

Well what you are asking is in a way already supported in PBS.

Admin can configure PBS to use their checkpoint scripts (which can also use third party checkpointing tools like Meiosys Checkpoint and BLCR as pointed by @mkaro) and configure checkpointing type as “checkpoint_abort”. This will make the job to requeue and PBS scheduler may eventually end up running this checkpointed job on a different node in subsequent scheduling cycles.

For more information, I’d recommend reading section 9.3.2 of our admin guide - PBSPro Admin Guide

Hope it helps!

Regards,
Arun Grover


#4

Thanks.

do you have any generic checkpoint scripts that we can have a look at?
or what would be your suggestions in regard to using checkpoint scripts?
do we have to use for example, BLCR for checkpointing jobs?

regards,

Sue


#5

I have an example from a few years ago. I would expect it to still work because the PBS Professional interfaces for CPR had not changed - to my knowledge. Although, I do not see a way for me to share scripts on the forum.

I decided to publish them on my GitHub: https://github.com/scottaltair/PBS-Professional-CPR-Example


#6

Hi Sue,

I’m sorry I think I made a mistake in answering this query… After one of my colleague pointed out that even after a requeue the checkpointed job can only be resumed on the same node where it was checkpointed.

So for now, I’m not very sure if we can resume a checkpointed job on a different node.

Regards,
Arun


#7

Hi,

i have tested checkpoint function with code that Scott provided.

two different jobs with MPI run on one compute node. one job is through a queue with higher priority and the other through a queue with lower priority. checkpoint didn’t work with the message received:

05/11/2017 17:14:54;0004;pbs_mom;Job;40.headnode;action checkpoint_abort script /var/spool/pbspro/mom_priv/checkpoint_abort.sh cannot be executed due to permissions
05/11/2017 17:14:54;0008;pbs_mom;Job;40.headnode;checkpoint failed: errno=0
05/11/2017 17:14:54;0008;pbs_mom;Job;40.headnode;req_holdjob: Checkpoint initiated.
05/11/2017 17:14:56;0080;pbs_mom;Req;req_reject;Reject reply code=15061, aux=2, type=7, from root@192.168.129.10:16001

code number 15061 is described as “not all tasks could checkpoint” in the reference manual.
what does it mean? if mpi jobs can’t be checkpointed, it is almost useless having pbspro checkpoint on clusters.

thanks,

Sue


#8

Judging from the MOM log messages, you have a file permission problem on the checkpoint_abort.sh action script. Is the execute bit set on this file?

Steve


#9

it is executable:

-rwxr-xr-x 1 sxy admin 853 Aug 11 2016 checkpoint_abort.sh

Sue


#10

It needs to be owned by root. It’s a security violation to have root (i.e. the MoM process) running scripts that are owned or writeable by non-root users. See page 394 or the 14.2.1 Admin Guide:

“• Under Linux, the checkpoint script should be owned by root, and writable by root only, with permission 0755.”

By the way, if you’re using Scott’s sample checkpoint script from Github, there may be a typo in the kill command. It should read:

kill -TSTP …

not kill -SIGTSTP …

Steve


#11

Hi Steve,

below is checkpoint_abort.sh I am using. I cant find file: ${PBS_JOBID}_data.chk anywhere. what does this file contain?

thanks, Sue

#!/bin/sh -x

� Copyright 2012 Altair Engineering, Inc. All rights reserved.

This code is provided �as is� without any warranty, express or implied, or

indemnification of any kind. All other terms and conditions are as

specified in the Altair PBS EULA.

Assumption:

Purpose:

exec >/tmp/checkpoint_abort.debug 2>&1

CHECKPOINTPATH=$1
if [ ! -d ${CHECKPOINTPATH} ]; then
mkdir -p ${CHECKPOINTPATH} || exit 1
fi

Source in PBS specific environment variables from pbs.conf

PBS_CONF=${PBS_CONF:-/etc/pbs.conf}
[ -f ${PBS_CONF} ] && . ${PBS_CONF}

JOB_JB=${PBS_HOME}/mom_priv/jobs/${PBS_JOBID}.JB
JOB_SC=${PBS_HOME}/mom_priv/jobs/${PBS_JOBID}.SC
PIDS=ps --sid ${PBS_SID} -o pid=

cp ${JOB_SC} ${CHECKPOINTPATH}/${PBS_JOBID}.SC
#kill -SIGTSTP ${PIDS}
kill -TSTP ${PIDS}
sleep 1
cp ${PBS_JOBDIR}/${PBS_JOBID}_data.chk ${CHECKPOINTPATH}
kill -15 ${PIDS}


#12

Hi Steve,

Here attached is checkpoint_abort.sh I am using from Github by maintained by Scott. I can’t find ${PBS_JOBID}_data.chk anywhere on the system.
What is this file for?

Thanks,

Sue


#13

Sue, please confirm the steps below, which are part of the of the README.

Steps to demo:
As root

  1. Update the $PBS_HOME/mom_priv/config with the contents of mom_priv/config.example minding the PATHs to the scripts
  2. Copy the checkpoint_abort.sh, checkpoint.sh, and restart.sh into mom_priv
  3. Chmod 755 the new scripts
  4. Restart PBS MOM

As user

  1. qsub checkpointable_app.sh
  2. tail -f ${PBS_JOBDIR}/${PBS_JOBID}.OU
    • PBS_JOBDIR can be determined by qstat -f | grep jobdir

In another terminal window, as root or the user.

  1. qhold $PBS_JOBID
    • watch the output of the tail in the user’s window
  2. qrls $PBS_JOBID
    • again watch the output of the tail
  3. Allow the job to run, and you will see the period checkpoint kick in, too.

I would like to see the contents of the $PBS_HOME/mom_priv/config file on the system you have deployed the cpr demo scripts on.

cat $PBS_HOME/mom_priv/config

Also, please note that the example “checkpointable” job script (checkpointable_app.sh) is what writes out the file

pbs_cpr_demo/checkpointable_app.sh

Here is the snippet from
write_restart_file() {
echo date Writing restart file…
echo $number > ${PBS_JOBID}_data.chk
}

Scott


#14

A little background:

The important thing to remember here is that true system-level checkpointing (like that found in a lot of the old vendor-supported Unix systems like Unicos, HP-UX, or AIX) doesn’t exist in Linux. There’s no “checkpoint” system call that can be used to checkpoint a generic application.

What this means is that your application has to have some method of checkpointing itself, usually in response to a signal. Scott’s checkpoint_abort script is a sample of how one might write an action script to trigger such a self-checkpoint in an application that uses SIGTSTP to trigger a checkpoint. On Github, there’s a companion script that is intended to run as a demonstration “application” - it basically sleeps and increments a counter in a loop. It traps SIGTSTP and generates a “checkpoint file”. The checkpoint_abort script is designed to work in conjunction with that “application”. It’s not a “plug-and-play” solution to checkpointing any generic application.

Steve


#15

so checkpoint function with pbspro doesn’t do more than what maui does.

Sue


#16

generic checkpoint as different to application software self-checkpoint for pre-empt would be very useful.
nowadays, as big data science emerges, more and more simulations require large size of memory.
generic job checkpoint would provide an option for pre-empt function more reasonable than job suspension.
further more, if generic job checkpoint is made available, to get checkpointed jobs to run on different cores/nodes would be next step of the system development, which should be one of fundamental features in PBS system.

Sue