Self-Checkpointing Applications
This section is about writing jobs for an executable which periodically saves checkpoint information, and how to make HTCondor store that information safely, in case it’s needed to continue the job on another machine or at a later time.
This section is not about how to checkpoint a given executable; that’s up to you or your software provider.
How To Run Self-Checkpointing Jobs
The best way to run self-checkpointing code is to set checkpoint_exit_code
in your submit file. (Any exit code will work, but if you can choose,
consider error code 85
. On Linux systems, this is ERESTART
, which
seems appropriate.) If the executable
exits
with checkpoint_exit_code
, HTCondor will transfer the checkpoint to
the submit node, and then immediately restart the executable
in the
same sandbox on the same machine, with the same arguments
. This
immediate transfer makes the checkpoint available for continuing the job
even if the job is interrupted in a way that doesn’t allow for files to
be transferred (e.g., power failure), or if the file transfer doesn’t
complete in the time allowed.
For a job to use checkpoint_exit_code
successfully, its executable
must meet a number of requirements.
Requirements
Your self-checkpointing code may not meet all of the following
requirements. In many cases, however, you will be able to add a wrapper
script, or modify an existing one, to meet these requirements. (Thus,
your executable
may be a script, rather than the code that’s writing
the checkpoint.) If you can not, consult Working Around the
Assumptions and/or the Other Options.
Your executable exits after taking a checkpoint with an exit code it does not otherwise use.
If your executable does not exit when it takes a checkpoint, HTCondor will not transfer its checkpoint. If your executable exits normally when it takes a checkpoint, HTCondor will not be able to tell the difference between taking a checkpoint and actually finishing; that is, if the checkpoint code and the terminal exit code are the same, your job will never finish.
When restarted, your executable determines on its own if a checkpoint is available, and if so, uses it.
If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has taken a checkpoint.
Starting your executable up from a checkpoint is relatively quick.
If starting your executable up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
Using checkpoint_exit_code
The following Python script (example.py
) is a toy example of code that
checkpoints itself. It counts from 0 to 10 (exclusive), sleeping for 10
seconds at each step. It writes a checkpoint file (containing the next number)
after each nap, and exits with code 85 at count 3, 6, and 9. It exits
with code 0 when complete.
#!/usr/bin/env python
import sys
import time
value = 0
try:
with open('example.checkpoint', 'r') as f:
value = int(f.read())
except IOError:
pass
print("Starting from {0}".format(value))
for i in range(value,10):
print("Computing timestamp {0}".format(value))
time.sleep(10)
value += 1
with open('example.checkpoint', 'w') as f:
f.write("{0}".format(value))
if value%3 == 0:
sys.exit(85)
print("Computation complete")
sys.exit(0)
The following submit file (example.submit
) commands HTCondor to transfer the
file example.checkpoint
to the submit node whenever the script exits with code
85. If interrupted, the job will resume from the most recent of those
checkpoints. Before version 8.9.8, you must include your checkpoint file(s)
in transfer_output_files
; otherwise HTCondor will not transfer it
(them). Starting with version 8.9.8, you may instead use
transfer_checkpoint_files
, as documented on
the condor_submit man page.
checkpoint_exit_code = 85
transfer_output_files = example.checkpoint
should_transfer_files = yes
executable = example.py
arguments =
output = example.out
error = example.err
log = example.log
request_cpus = 1
request_memory = 512M
request_disk = 1G
queue 1
This example does not remove the “checkpoint file” generated for
timestep 9 when the executable completes. This could be done in
example.py
immediately before it exits, but that would cause the
final file transfer to fail, if you specified the file in
transfer_output_files
. The script could instead remove the file
and then re-create it empty, it desired.
How Frequently to Checkpoint
Obviously, the longer the code spends writing checkpoints, and the longer your job spends transferring them, the longer it will take for you to get the job’s results. Conversely, the more frequently the job transfers new checkpoints, the less time the job loses if it’s interrupted. For most users and for most jobs, taking a checkpoint about once an hour works well, and it’s not a bad duration to start experimenting with. A number of factors will skew this interval up or down:
If your job(s) usually run on resources with strict time limits, you may want to adjust how often your job checkpoints to minimize wasted time. For instance, if your job writes a checkpoint after each hour, and each checkpoint takes five minutes to write out and then transfer, your fifth checkpoint will finish twenty-five minutes into the fifth hour, and you won’t gain any benefit from the next thirty-five minutes of computation. If you instead write a checkpoint every eighty-four minutes, your job will only waste four minutes.
If a particular code writes larger checkpoints, or writes smaller checkpoints unusually slowly, you may want to take a checkpoint less frequently than you would for other jobs of a similar length, to keep the total overhead (delay) the same. The opposite is also true: if the job can take checkpoints particularly quickly, or the checkpoints are particularly small, the job could checkpoint more often for the same amount of overhead.
Some code naturally checkpoints at longer or shorter intervals. If a code writes a checkpoint every five minutes, it may make sense for the
executable
to wait for the code to write ten or more checkpoints before exiting (which asks HTCondor to transfer the checkpoint file(s)). If a job is a sequence of steps, the natural (or only possible) checkpoint interval may be between steps.How long it takes to restart from a checkpoint. It should never take longer to restart from a checkpoint than to recompute from the beginning, but the restart process is part of the overhead of taking a checkpoint. The longer a code takes to restart, the less often the
executable
should exit.
Measuring how long it takes to make checkpoints is left as an exercise for the reader. Since version 8.9.1, however, HTCondor will report in the job’s log (if a log is enabled for that job) how long file transfers, including checkpoint transfers, took.
Debugging Self-Checkpointing Jobs
Because a job may be interrupted at any time, it’s valid to interrupt
the job at any time and see if a valid checkpoint is transferred. To do
so, use condor_vacate_job
to evict the job. When that’s done (watch
the user log), use condor_hold
to put it on hold, so that it can’t
restart while you’re looking at the checkpoint (and potentially,
overwrite it). Finally, to obtain the checkpoint file(s) themselves, use
the somewhat mis-named condor_evicted_files
to ask where they are.
For example, if your job is ID 635.0
, and is logging to the file
job.log
, you can copy the files in the checkpoint to a subdirectory of
the current as follows:
$ condor_vacate_job 635.0
Wait for the job to finish being evicted; hit CTRL-C when you see ‘Job was evicted.’ and immediately hold the job.
$ tail --follow job.log
$ condor_hold 635.0
Copy the checkpoint files from the spool.
Note that _condor_stderr
and _condor_stdout
are the files corresponding
to the job’s output and error submit commands; they aren’t named
correctly until the the job finishes.
$ condor_evicted_files get 635.0
Copied to '635.0'.
$ cd 635.0
Now examine the checkpoint files to see if they look right. When you’re done, release the job to see if it actually works right.
$ condor_release 635.0
$ condor_ssh_to_job 635.0
You may also want to remove your copy of checkpoint files:
$ cd ..; rm -fr 635.0
Working Around the Assumptions
The basic technique here is to write a wrapper script (or modify an
existing one), so that the executable
has the necessary behavior,
even if the code does not.
Your executable exits after taking a checkpoint with an exit code it does not otherwise use.
If your code exits when it takes a checkpoint, but not with a unique code, your wrapper script will have to determine, when the executable exits, if it did so because it took a checkpoint. If so, the wrapper script will have to exit with a unique code. If the code could usefully exit with any code, and the wrapper script therefore can not exit with a unique code, you can instead instruct HTCondor to consider being killed by a particular signal as a sign of successful checkpoint; set
+SuccessCheckpointExitBySignal
toTRUE
and+SuccessCheckpointExitSignal
to the particular signal. (If you do not setcheckpoint_exit_code
, you must set+WantFTOnCheckpoint
.)If your code does not exit when it takes a checkpoint, the wrapper script will have to determine when a checkpoint has been made, kill the program, and then exit with a unique code.
When restarted, your executable determines on its own if a checkpoint is available, and if so, uses it.
If your code requires different arguments to start from a checkpoint, the wrapper script must check for the presence of a checkpoint and start the executable with correspondingly modified arguments.
Starting your executable up from a checkpoint is relatively quick.
The longer the start-up delay, the slower the job’s overall progress. If your job’s progress is too slow as a result of start-up delay, and your code can take checkpoints without exiting, read the ‘Delayed Transfers’ and ‘Manual Transfers’ sections below.
Other Options
The preceding sections of this HOWTO explain how a job meeting the requirements can take checkpoints at arbitrary intervals and transfer them back to the submit node. Although this is the method of operation most likely to result in an interrupted job continuing from a valid checkpoint, other, less reliable options exist.
Delayed Transfers
This method is risky, because it does not allow your job to recover from
any failure mode other than an eviction (and sometimes not even then).
It may also require changes to your executable
. The advantage of
this method is that it doesn’t require your code to restart, or even a
recent version of HTCondor.
The basic idea is to take checkpoints as the job runs, but not transfer
them back to the submit node until the job is evicted. This implies that
your executable
doesn’t exit until the job is complete (which is the
normal case). If your code has long start-up delays, you’ll naturally
not want it to exit after it writes a checkpoint; otherwise, the wrapper
script could restart the code as necessary.
To use this method, set when_to_transfer_output
to
ON_EXIT_OR_EVICT
instead of setting checkpoint_exit_code
. This
will cause HTCondor to transfer your checkpoint file(s) (which you
listed in transfer_output_files
, as noted above) when the job is
evicted. Of course, since this is the only time your checkpoint file(s)
will be transferred, if the transfer fails, your job has to start over
from the beginning. One reason file transfer on eviction fails is if it
takes too long, so this method may not work if your
transfer_output_files
contain too much data.
Furthermore, eviction can happen at any time, including while the code is updating its checkpoint file(s). If the code does not update its checkpoint file(s) atomically, HTCondor will transfer the partially-updated checkpoint file(s), potentially overwriting the previous, complete one(s); this will probably prevent the code from picking up where it left off.
In some cases, you can work around this problem by using a wrapper
script. The idea is that renaming a file is an atomic operation, so if
your code writes checkpoints to one file, call it checkpoint
, your
wrapper script – when it detects that the checkpoint is complete –
would rename that file checkpoint.atomic
. That way,
checkpoint.atomic
always has a complete checkpoint in it. With a
such a script, instead of putting checkpoint
in
transfer_output_files
, you would put checkpoint.atomic
, and
HTCondor would never see a partially-complete checkpoint file. (The
script would also, of course, have to copy checkpoint.atomic
to
checkpoint
before running the code.)
Manual Transfers
If you’re comfortable with programming, instead of running a job with
checkpoint_exit_code
, you could use condor_chirp
, or other tools,
to manage your checkpoint file(s). Your executable
would be
responsible for downloading the checkpoint file(s) on start-up, and
periodically uploading the checkpoint file(s) during execution. We don’t
recommend you do this for the same reasons we recommend against managing
your own input and output file transfers.
Early Checkpoint Exits
If your executable’s natural checkpoint interval is half or more of your
pool’s max job runtime, it may make sense to checkpoint and then
immediately ask to be rescheduled, rather than lower your user priority
doing work you know will be thrown away. In this case, you can use the
OnExitRemove
job attribute to determine if your job should be
rescheduled after exiting. Don’t set ON_EXIT_OR_EVICT
, and don’t set
+WantFTOnCheckpoint
; just have the job exit with a unique code after
its checkpoint.
Signals
Signals offer additional options for running self-checkpointing jobs. If you’re not familiar with signals, this section may not make sense to you.
Periodic Signals
HTCondor supports transferring checkpoint file(s) for an executable
which takes a checkpoint when sent a particular signal, if the executable
then exits in a unique way. Set +WantCheckpointSignal
to TRUE
to
periodically receive checkpoint signals, and +CheckpointSig
to
specify which one. (The interval is specified by the administrator of
the execute machine.) The unique way may be a specific exit code, for which
you would set checkpoint_exit_code
, or a signal, for which you would
set +SuccessCheckpointExitBySignal
to TRUE
and
+SuccessCheckpointExitSignal
to the particular signal. (If you do
not set checkpoint_exit_code
, you must set +WantFTOnCheckpoint
.)
Delayed Transfer with Signals
This method is very similar to but riskier than delayed transfers, because in addition to delaying the transfer of the checkpoint files(s), it also delays their creation. Thus, this option should almost never be used; if taking and transferring your checkpoint file(s) is fast enough to reliably complete during an eviction, you’re not losing much by doing so periodically, and it’s unlikely that a code which takes small checkpoints quickly takes a long time to start up. However, this method will work even with very old version of HTCondor.
To use this method, set when_to_transfer_output
to
ON_EXIT_OR_EVICT
and KillSig
to the particular signal that
causes your job to checkpoint.