DAGMan Introduction

Describing Workflows with DAGMan

A DAGMan workflow is described in a DAG description file. The input file specifies the nodes of the DAG as well as the dependencies that order the DAG.

A node within a DAG represents a unit of work. It contains the following:

  • List of Jobs: A collection of one or more HTCondor jobs, defined in a single submit description.

  • PRE Script (optional): A script that runs before the list of jobs is submitted. Typically used to verify that all inputs are valid.

  • POST Script (optional): A script that runs after the list of jobs finishes. Typically used to verify outputs and clean up temporary files.

The following diagram illustrates the elements of a node – every node must contain a list of jobs, with an optional pre and an optional post script.

        flowchart LR
 Start((Start)) --> JOBS
 Start --> PRE[Pre-Script]
 subgraph DAG Node
 PRE --> JOBS[List of Jobs]
 JOBS --> POST[Post-Script]
 end
 JOBS --> End((End))
 POST --> End((End))
    

An edge in DAGMan describes a dependency between two nodes. DAG edges are directional; each has a parent and a child, where the parent node must finish running before the child starts. Any node can have an unlimited number of parents and children.

Example: Diamond DAG

A simple diamond-shaped DAG, as shown on the right, will be used in examples. This four node DAG would be described as the following in the DAG description file:

Example Diamond DAG description file
# File name: diamond.dag

JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT B C CHILD D

JOB

The JOB command specifies a list of one or more HTCondor jobs that become the core of a node in the DAG. The syntax used for each JOB command is:

JOB NodeName SubmitDescription [DIR directory] [NOOP] [DONE]

A JOB entry maps a NodeName to an HTCondor submit description. The NodeName uniquely identifies nodes within the DAG description file and in output messages. Each node name, given by NodeName, within the DAG must be unique.

The values defined for NodeName and SubmitDescription are case sensitive, as file names in a file system are case sensitive. The NodeName can be any string that contains no white space, except for the words PARENT and CHILD (in upper, lower, or mixed case). NodeName also cannot contain special characters (. & +) which are reserved for system use.

The optional DIR keyword specifies a working directory for this node, from which the HTCondor jobs will be submitted, and from which a PRE and/or POST script will be run. If a relative directory is specified, it is relative to the current working directory as the DAG is submitted.

Note

DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_submit_dag.

The optional NOOP keyword identifies a no-operation node. Meaning the list of jobs will not be submitted to HTCondor. DAGMan will still execute any PRE and/or POST scripts associated with the node. Marking a node with NOOP is useful for debugging complex DAG structures without changing the flow of the DAG.

The optional DONE keyword identifies a node as being already completed. Meaning neither the list of jobs nor scripts will be executed. This is mainly used by Rescue DAGs generated by DAGMan itself, in the event of a failure to complete the workflow.

PARENT/CHILD Relationships

The PARENT/CHILD command specifies the dependencies within the DAG. Nodes are parents and/or children of other nodes within the DAG. A parent node must be completed successfully before any of its children may be started. A child node may only be started once all its parents have successfully completed.

The syntax used for each dependency (PARENT/CHILD) command is

PARENT ParentNodeName [ParentNodeName2 ... ] CHILD ChildNodeName [ChildNodeName2 ... ]

The PARENT keyword is followed by one or more ParentNodeName(s). The CHILD keyword is followed by one or more ChildNodeName(s). Each child node depends on every parent node within the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. The diamond-shaped DAG example may specify the dependencies with

Example Diamond DAG description for node dependencies
PARENT A CHILD B C
PARENT B C CHILD D

An alternative specification for the diamond-shaped DAG may specify some or all of the dependencies on separate lines:

Alternate example Diamond DAG description for node dependencies
PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D

Scripts

The optional SCRIPT command specifies processing to be done relative to the execution the node’s associated jobs depending on the script type. All scripts run on the Access Point and not the Execution Points where the actual jobs are likely to run.

Script Types

Processing done before the list of jobs is submitted is called a PRE script. Processing done after the list of jobs completes execution is called a POST script. The PRE and POST script are considered part of the DAG node structure. Additionally, there is a HOLD script that runs when any job associated with the node goes into the held state which are not considered part of the DAG workflow and are run on a best-effort basis. If one does not complete successfully, it has no effect on the overall workflow and no error will be reported.

Note

The script executable does not have to be a shell script (Unix) or batch file (Windows); but should be light weight since it runs directly on the AP.

The syntax used for SCRIPT commands is

# PRE-Script
SCRIPT [DEFER status time] [DEBUG filename type] PRE <NodeName | ALL_NODES> ExecutableName [arguments]
# POST-Script
SCRIPT [DEFER status time] [DEBUG filename type] POST <NodeName | ALL_NODES> ExecutableName [arguments]
# HOLD-Script
SCRIPT [DEFER status time] [DEBUG filename type] HOLD <NodeName | ALL_NODES> ExecutableName [arguments]

The NodeName identifies the node to which the script is attached. The ExecutableName specifies the executable (e.g., shell script or batch file) to be executed and may not contain spaces. The optional arguments are command line arguments to the script, including delimiting spaces. Both ExecutableName and optional arguments are case sensitive.

Scripts are commonly used to do simple tasks such as the following:

  • PRE: Verify inputs for a node’s jobs that are produced by a parent node.

  • POST: Turn a execution failure of the list of jobs into a successful node completion so the DAG doesn’t fail given specific exit codes.

  • HOLD: Notify the user of a held job via email.

DEFER retries

The optional DEFER keyword causes a retry of only the script if the execution of the script exits with the exit code given by status. The retry occurs after at least time seconds, rather than being considered failed. While waiting for the retry, the script does not count against a maxpre or maxpost limit.

Note

The ordering of the DEFER keyword within the SCRIPT specification is fixed. It must come directly after the SCRIPT keyword; this is done to avoid backward compatibility issues for any DAG with a NodeName of DEFER.

DEBUG file

The optional DEBUG keyword will capture a scripts specified standard output streams (STDOUT and/or STDERR) and write them to a specified debug file. This keyword is followed by two pieces of information:

  1. Filename: File to write captured output into.

  2. Type: Type of output to capture. Takes one the following options:
    1. STDOUT

    2. STDERR

    3. ALL (Both STDOUT & STDERR)

This keyword is fixed to appear prior to the script type (PRE, POST, HOLD) and after any declared DEFER retries.

Note

DAGMan will create the specified debug file if it does not already exist. Otherwise, the debug file is appended to.

Note

It is safe to have multiple scripts write to the same file as DAGMan captures all of the scripts output and writes everything at one time. This write also includes a dividing banner with useful information regarding that scripts execution.

Scripts as part of a DAG workflow

Scripts are executed on the access point; the access point is not necessarily the same machine upon which the node’s jobs are run. Further, a single cluster of HTCondor jobs may be spread across several machines.

If the PRE script fails, then the HTCondor jobs associated with the node are not submitted, and the POST script is not run either (by default). However, if the list of jobs is submitted, and there is a POST script, the POST script is always run once the list of jobs finishes. The behavior when the PRE script fails may be changed to run the POST script by setting configuration variable DAGMAN_ALWAYS_RUN_POST to True or by passing the -AlwaysRunPost argument to condor_submit_dag.

Examples that use PRE or POST scripts

Examples using the diamond-shaped DAG. The first example uses a PRE script to expand a compressed file needed as input for the associated HTCondor jobs of nodes B and C. The DAG description file:

Example Diamond DAG description using PRE Scripts
# File name: diamond.dag

JOB  A  A.sub
JOB  B  B.sub
JOB  C  C.sub
JOB  D  D.sub
SCRIPT PRE  B  pre.sh $NODE .gz
SCRIPT PRE  C  pre.sh $NODE .gz
PARENT A CHILD B C
PARENT B C CHILD D

The script pre.sh uses its command line arguments to form the file name of the compressed file. The script contains

#!/bin/sh
gunzip ${1}${2}

Therefore, the PRE script invokes

gunzip B.gz

for node B, which uncompresses file B.gz, placing the result in file B.

This second example uses the $RETURN macro. The DAG description file contains the POST script specification:

SCRIPT POST A stage-out job_status $RETURN

If the first non-successful HTCondor job of node A exits with the value -1, the POST script is invoked as

$ stage-out job_status -1

Warning

DAGMan script macros must be declared individually with surrounding spaces to be replaced. Providing a script argument such as job_status=$RETURN will not substitute the $RETURN macro and pass along the entire string.

Special Script Argument Macros

DAGMan provides the following macros to be used for node script arguments. The use of these macros are limited to being used as individual command line arguments surrounded by spaces:

All Scripts

$NODE

$NODE_COUNT

$QUEUED_COUNT

$DONE_COUNT

$FAILED_COUNT

$FUTILE_COUNT

$DAGID

$DAG_STATUS

$RETRY

$MAX_RETRIES

POST Scripts

$JOBID

$CLUSTERID

$JOB_COUNT

$RETURN

$EXIT_CODES

$EXIT_CODE_COUNTS

$SUCCESS

$JOB_ABORT_COUNT

$PRE_SCRIPT_RETURN

The special macros for all scripts:

  • $NODE evaluates to the (case sensitive) string defined for NodeName.

  • $RETRY evaluates to an integer value set to 0 the first time a node is run, and is incremented each time the node is retried. See Node Success/Failure for the description of how to cause nodes to be retried.

  • $MAX_RETRIES evaluates to an integer value set to the maximum number of retries for the node. Defaults to 0 if retries aren’t specified for a node.

  • $DAGID is the node’s associated DAGManJobId.

  • $DAG_STATUS is the status of the DAG that is recorded in the DAGMan scheduler universe job’s ClassAd as DAG_Status.

    Note

    The macro $DAG_STATUS value and definition is unrelated to the attribute named DagStatus as defined in the node status file.

  • $NODE_COUNT is the total number of nodes within the DAG (including the FINAL node).

  • $QUEUED_COUNT is the current number of nodes running jobs in the DAG.

  • $DONE_COUNT is the current number of nodes that have completed successfully in the DAG.

  • $FAILED_COUNT is the current number of nodes that have failed in the DAG.

  • $FUTILE_COUNT is the current number of nodes that will never run in the DAG.

Macros for POST Scripts only:

  • $CLUSTERID is the node’s associated list of jobs ClusterId.

  • $JOBID evaluates to a representation of the HTCondor job ID [ClusterId.ProcId] of the node job. For nodes with multiple jobs in the same cluster, the ProcId value is the one of the last job within the cluster.

  • $JOB_COUNT evaluates to the total number of jobs associated with the node.

  • $JOB_ABORT_COUNT is the number of jobs associated with the node that exited the queue with an abort event.

  • $SUCCESS evaluates to True or False representing whether the node has been successful up to this point (PRE script and list of jobs succeeded).

  • $RETURN variable evaluates to the return value of the HTCondor job if there is a single job within a cluster. With multiple jobs within the same cluster, the value will be 0 if all jobs within the cluster are successful. Otherwise, the value is the exit value of the first job in the cluster to write a terminate event.

    • A job that dies due to a signal is reported with a $RETURN value representing the additive inverse of the signal number. For example, SIGKILL (signal 9) is reported as -9.

    • A job whose batch system submission fails is reported as -1001.

    • A job that is externally removed from the batch system queue (by something other than condor_dagman) is reported as -1002.

    • If the node’s jobs were skipped because of failure of the PRE script, the value of $RETURN will be -1004.

  • $EXIT_CODES is an ordered comma separated list of ExitCodes returned by the jobs associated with the node.

  • $EXIT_CODE_COUNTS is a ordered comma separated list of the number of jobs associated with the node that exited with a particular ExitCode. The information is passed as {ExitCode}:{Count}.

  • $PRE_SCRIPT_RETURN variable evaluates to the return value of the PRE script of a node, if there is one. If there is no PRE script, this value will be -1.

Node Submit Descriptions

Inline Submit Descriptions

Instead of using a submit description file, you can alternatively include an inline submit description directly inside the .dag file. An inline submit description should be wrapped in { and } braces, with each argument appearing on a separate line, just like the contents of a regular submit file.

This can be helpful when trying to manage lots of submit descriptions, so they can all be described in the same file instead of between many files.

SUBMIT-DESCRIPTION command

In addition to declaring inline submit descriptions as part of a node, they can be declared independently of nodes using the SUBMIT-DESCRIPTION command. This can be helpful to reduce the size and improve the readability of a .dag file when many nodes share the same submit description.

A SUBMIT-DESCRIPTION can be defined using the following syntax:

SUBMIT-DESCRIPTION DescriptionName {
    # submit attributes go here
}

An independently declared submit description must have a unique name that is not used by any of the nodes. It can then be linked to a node as follows:

JOB NodeName DescriptionName

Note

Both inline submit descriptions and the SUBMIT-DESCRIPTION command don’t allow a queue statement resulting in only a single instance of the job being submitted to HTCondor.

Warning

Both inline submit descriptions and the SUBMIT-DESCRIPTION command can only be used when DAGMAN_USE_DIRECT_SUBMIT = True.

External File Descriptions

Each node in a DAG may use a submit description file like one that a user may used to submit via condor_submit.

$ condor_submit submit_file.sub

A key limitation is that each HTCondor submit description file must submit jobs described by a submit description containing a single queue statement. Multiple queue statements are not permitted.

DAGMan does allow the submission of one or more jobs when submitting a node’s submit description described in an external file. However, it is recommended that a node only contains a single job to a cluster because DAGMan treats the an entire list of jobs associated with a single node as one entity. Meaning, one job failure will result in the entire list of jobs being considered failed. Once declared as failed, the remaining jobs associated with the node will be removed from the queue.

Since each node uses the same HTCondor submit description file, this implies that each node within the DAG runs the same list of jobs, but the $(Cluster) macro produces unique file names for each of the node’s outputs because each node has it’s own cluster of jobs.

DAGMan Specific Information Macros

When submitting jobs on behalf of the user, DAGMan will create custom submit description macros that can be utilized. The following macros are referable by the job submit description:

  • JOB: The node name of which these jobs belong.

  • RETRY: The current retry attempt number. First execution is 0.

  • DAGManJobId: The jobs associtated DAGManJobId.

  • DAG_STATUS: The current DAG status as described by DAG_Status (Intended for Final Node)

  • FAILED_COUNT: The current number of failed nodes in the DAG (Intended for Final node).

  • DAG_PARENT_NAMES: Comma separated list of node names that are parents of the node these jobs belong.

DAGMan will also add the following information to the jobs ClassAd:

Note

Depending on the number of parents nodes a node has, the attribute DAGParentNodeNames and submit macro DAG_PARENT_NAMES may not be set.