Node Success/Failure
Progress towards completion of the DAG is based upon the success of the nodes within the DAG. The success of a node is based upon the success of the job(s), PRE script, and POST script. A job, PRE script, or POST script with an exit value not equal to 0 is considered failed. The exit value of whatever component of the node was run last determines the success or failure of the node.
Table 2.1 lists the definition of node success and failure for all variations
of script and job success and failure, when DAGMAN_ALWAYS_RUN_POST is set
to False
. In this table, a dash (-
) represents the case where a script
does not exist for the DAG, S represents success, and F represents
failure.
PRE |
JOB |
POST |
Node |
---|---|---|---|
- |
S |
- |
S |
- |
F |
- |
F |
- |
S |
S |
S |
- |
S |
F |
F |
- |
F |
S |
S |
- |
F |
F |
F |
S |
S |
- |
S |
S |
F |
- |
F |
S |
S |
S |
S |
S |
S |
F |
F |
S |
F |
S |
S |
S |
F |
F |
F |
F |
not run |
- |
F |
F |
not run |
not run |
F |
Table 2.1: Node Success or Failure definition with
DAGMAN_ALWAYS_RUN_POST = False
(the default).
Table 2.2 lists the definition of node success and failure only for the cases
where the PRE script fails, when DAGMAN_ALWAYS_RUN_POST is set to True
.
PRE |
JOB |
POST |
Node |
---|---|---|---|
F |
not run |
- |
F |
F |
not run |
S |
S |
F |
not run |
F |
F |
Table 2.2: Node Success or Failure definition with
DAGMAN_ALWAYS_RUN_POST = True
.
PRE_SKIP
The behavior of DAGMan with respect to node success or failure can be changed with the addition of a PRE_SKIP command. A PRE_SKIP line within the DAG input file uses the syntax:
PRE_SKIP <JobName | ALL_NODES> non-zero-exit-code
The PRE script of a node identified by JobName that exits with the value given by non-zero-exit-code skips the remainder of the node entirely. Neither the job associated with the node nor the POST script will be executed, and the node will be marked as successful.
Retrying Failed Nodes
DAGMan can retry any failed node in a DAG by specifying the node in the DAG input file with the RETRY command. The use of retry is optional. The syntax for retry is
RETRY <JobName | ALL_NODES> NumberOfRetries [UNLESS-EXIT value]
where JobName identifies the node. NumberOfRetries is an integer number of times to retry the node after failure. The implied number of retries for any node is 0, the same as not having a retry line in the file. Retry is implemented on nodes, not parts of a node.
The diamond-shaped DAG example may be modified to retry node C:
# File name: diamond.dag
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
RETRY C 3
If node C is marked as failed for any reason, then it is started over as a first retry. The node will be tried a second and third time, if it continues to fail. If the node is marked as successful, then further retries do not occur.
Retry of a node may be short circuited using the optional keyword UNLESS-EXIT, followed by an integer exit value. If the node exits with the specified integer exit value, then no further processing will be done on the node.
The macro $(RETRY)
evaluates to an integer value, set to 0 first time
a node is run, and is incremented each time for each time the node is
retried. The macro $(MAX_RETRIES)
is the value set for
NumberOfRetries. These macros may be used as arguments passed to a PRE
or POST script.
Stopping the DAG on Node Failure
The ABORT-DAG-ON command provides a way to abort the entire DAG if a given node returns a specific exit code. The syntax for ABORT-DAG-ON is
ABORT-DAG-ON <JobName | ALL_NODES> AbortExitValue [RETURN DAGReturnValue]
If the return value of the node specified by JobName matches AbortExitValue, the DAG is immediately aborted. A DAG abort differs from a node failure, in that a DAG abort causes all nodes within the DAG to be stopped immediately. This includes removing the jobs in nodes that are currently running. A node failure differs, as it would allow the DAG to continue running, until no more progress can be made due to dependencies.
The behavior differs based on the existence of PRE and/or POST scripts. If a PRE script returns the AbortExitValue value, the DAG is immediately aborted. If the HTCondor job within a node returns the AbortExitValue value, the DAG is aborted if the node has no POST script. If the POST script returns the AbortExitValue value, the DAG is aborted.
An abort overrides node retries. If a node returns the abort exit value, the DAG is aborted, even if the node has retry specified.
When a DAG aborts, by default it exits with the node return value that caused the abort. This can be changed by using the optional RETURN keyword along with specifying the desired DAGReturnValue. The DAG abort return value can be used for DAGs within DAGs, allowing an inner DAG to cause an abort of an outer DAG.
A DAG return value other than 0, 1, or 2 will cause the condor_dagman
job to stay in the queue after it exits and get retried, unless the
on_exit_remove
expression in the .condor.sub
file is manually
modified.
Adding ABORT-DAG-ON for node C in the diamond-shaped DAG
# File name: diamond.dag
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
RETRY C 3
ABORT-DAG-ON C 10 RETURN 1
causes the DAG to be aborted, if node C exits with a return value of 10. Any other currently running nodes, of which only node B is a possibility for this particular example, are stopped and removed. If this abort occurs, the return value for the DAG is 1.