DAG Save Point Files

A DAG can be set up to write the current progress of the DAG at specified nodes to a save point file. These files are written the first time the designated node starts running. Meaning any retries won’t save the DAG progress again. The save point file is written in the exact same format as a partial Rescue DAG except that all node retry values will be reset to their max value. The DAG save point file can then be specified when re-running a DAG to start the DAG at a certain point of progress.

To specify a save point file use the DAG submit description keyword SAVE_POINT_FILE followed by the name of the node designated as the save point to write a save file, and optionally a filename. If a filename is not specified the file will be written as [Node Name]-[DAG filename].save where the DAG filename is the DAG file that the save file declaration was read from.

If the specified save point filename includes a path then DAGMan will attempt to write the file to that location. If the condor_submit_dag useDagDir flag is used and a path is specified for a save point then the file will be written to that path relative to a DAG’s working directory. Any save point files without a specified path will be written to a sub-directory called save_files created near all other DAGMan procuded files (i.e. .condor.sub, .dagman.out, etc.).

# File: savepointEx.dag
JOB A node.sub
JOB B node.sub
JOB C node.sub
JOB D node.sub

PARENT A B C CHILD D

#SAVE_POINT_FILE NodeName Filename
SAVE_POINT_FILE A
SAVE_POINT_FILE B Node-B_custom.save
SAVE_POINT_FILE C ../example/subdir/Node-C_custom.save
SAVE_POINT_FILE D ./Node-D_custom.save

Given the above example DAG file, if condor_submit_dag savepointEx.dag was ran from the below directory my_work then the produced files appear in the directory tree as follows:

Directory Tree Visualized
└─Home
    ├─example
    │   └─subdir
    │       └─Node-C_custom.save
    └─my_work
        ├─savepointEx.dag
        ├─savepointEx.dag.condor.sub
        ├─savepointEx.dag.dagman.out
        ├─...
        ├─Node-D_custom.save
        └─save_files
              ├─ A-savepointEx.dag.save
              └─ Node-B_custom.save

Once a DAG has ran and produce save point files, the DAG can then be re-run from a save file by passing a filename via the -load_save flag for condor_submit_dag. If the save point file is passed with a specified path then DAGMan will attempt to read the file from that path. If just a save point filename is given then DAGMan will assume the file is located in the``save_files`` directory. The path to save point files will be checked relative to the current working directory that condor_submit_dag was ran from.

When DAGMan writes save point files, if a save file with the same name already exists then DAGMan will rotate the file to [filename].old before writing the new save. Any already existing “old” save files will be removed prior to rotation and saving. So, if the above example DAG was re-run with condor_submit_dag -load_save ./Node-D_custom.save savepointEx.dag from the same directory then once node D starts the previous save would become Node-D_custom.save.old. This behavior does not just effect save point files when re-running a DAG. If a DAG was set up as follows:

# File: progressSavefile.dag
JOB A node.sub
JOB B node.sub
JOB C node.sub
...
SAVE_POINT_FILE A dag-progress.save
SAVE_POINT_FILE B dag-progress.save
SAVE_POINT_FILE C dag-progress.save

Then assuming the parent/child relationships is A->B->C, the first save written at the start of node A will be written to dag-progress.save. Then when node B starts the present dag-progress.save will become dag-progress.save.old and a new dag-progress.save will be written. Finally, once node C starts dag-progress.save.old will be deleted, the present dag-progress.save will become dag-progress.save.old and a new dag-progress.save will be written. Allowing a single save file that progresses with the DAG to be created.