Node Success/Failure

Progress towards completion of the DAG is based upon the success of the nodes within the DAG. The success of a node is based upon the success of the job(s), PRE script, and POST script. A job, PRE script, or POST script with an exit value not equal to 0 is considered failed. The exit value of whatever component of the node was run last determines the success or failure of the node.

Table 2.1 lists the definition of node success and failure for all variations of script and job success and failure, when DAGMAN_ALWAYS_RUN_POST is set to False. In this table, a dash (-) represents the case where a script does not exist for the DAG, S represents success, and F represents failure.

Table 2.2 lists the definition of node success and failure only for the cases where the PRE script fails, when DAGMan is configured to always run POST scripts.

PRE_SKIP

The behavior of DAGMan with respect to node success or failure can be changed with the addition of a PRE_SKIP command. A PRE_SKIP line within the DAG input file uses the syntax:

PRE_SKIP <JobName | ALL_NODES> non-zero-exit-code

The PRE script of a node identified by JobName that exits with the value given by non-zero-exit-code skips the remainder of the node entirely. Neither the job associated with the node nor the POST script will be executed, and the node will be marked as successful.

Retrying Failed Nodes

DAGMan can retry any failed node in a DAG by specifying the node in the DAG input file with the RETRY command. The use of retry is optional. The syntax for retry is

RETRY <JobName | ALL_NODES> NumberOfRetries [UNLESS-EXIT value]

where JobName identifies the node. NumberOfRetries is an integer number of times to retry the node after failure.

The implied number of retries for any node is 0, the same as not having a retry line in the file. Retry causes the whole node to be rerun (i.e. PRE Script, job, and POST Script).

Retry of a node may be short circuited using the optional keyword UNLESS-EXIT, followed by an integer exit value. If the node exits with the specified integer exit value, then no further processing will be done on the node.

The value of the current retry attempt and the maximum number of retires a node can attempt are available to PRE and POST scripts via the $RETRY and $MAX_RETRIES macros.

Stopping the DAG on Node Failure

The ABORT-DAG-ON command provides a way to abort the entire DAG if a given node returns a specific exit code. The syntax for ABORT-DAG-ON is

ABORT-DAG-ON <JobName | ALL_NODES> AbortExitValue [RETURN DAGReturnValue]

If the return value for the specified node matches AbortExitValue, the DAG is immediately aborted. Meaning the DAG stops all currently running nodes, cleans up, writes a rescue DAG, and exits with the optional specified return value. If no DAG return value is specified then DAGMan exits with the node return value that caused the abort.

A DAG return value other than 0, 1, or 2 will cause the condor_dagman job to stay in the queue after it exits and get retried, unless the on_exit_remove expression in the .condor.sub file is manually modified.

The behavior differs based on the existence of PRE and/or POST scripts. If a PRE script returns the AbortExitValue value, the DAG is immediately aborted. If the HTCondor job within a node returns the AbortExitValue value, the DAG is aborted if the node has no POST script. If the POST script returns the AbortExitValue value, the DAG is aborted.

An abort overrides node retries. If a node returns the abort exit value, the DAG is aborted, even if the node has retry specified.