DAGMan Introduction

DAGMan is a HTCondor tool that allows multiple jobs to be organized in workflows, represented as a directed acyclic graph (DAG). A DAGMan workflow automatically submits jobs in a particular order, such that certain jobs need to complete before others start running. This allows the outputs of some jobs to be used as inputs for others, and makes it easy to replicate a workflow multiple times in the future.

Describing Workflows with DAGMan

A DAGMan workflow is described in a DAG input file. The input file specifies the nodes of the DAG as well as the dependencies that order the DAG.

A node within a DAG represents a unit of work. It contains the following:

  • Job: An HTCondor job, defined in a submit file.

  • PRE script (optional): A script that runs before the job starts. Typically used to verify that all inputs are valid.

  • POST script (optional): A script that runs after the job finishes. Typically used to verify outputs and clean up temporary files.

The following diagram illustrates the elements of a node – every node must contain a job, with an optional pre and an optional post script.

flowchart LR Start((Start)) --> Job Start --> PREscript subgraph DAG Node PREscript --> Job Job --> POSTscript end Job --> End((End)) POSTscript --> End((End))

An edge in DAGMan describes a dependency between two nodes. DAG edges are directional; each has a parent and a child, where the parent node must finish running before the child starts. Any node can have an unlimited number of parents and children.

Example: Diamond DAG

A simple diamond-shaped DAG, as shown in the following image is presented as a starting point for examples. This diamond DAG contains 4 nodes.

flowchart TD A --> B & C B & C --> D

A very simple DAG input file for this diamond-shaped DAG is:

# File name: diamond.dag

JOB  A  A.sub
JOB  B  B.sub
JOB  C  C.sub
JOB  D  D.sub
PARENT A CHILD B C
PARENT B C CHILD D

A set of basic commands appearing in a DAG input file is described below.

JOB

The JOB command specifies an HTCondor job. The syntax used for each JOB command is:

JOB JobName SubmitDescriptionFileName [DIR directory] [NOOP] [DONE]

A JOB entry maps a JobName to an HTCondor submit description file. The JobName uniquely identifies nodes within the DAG input file and in output messages. Each node name, given by JobName, within the DAG must be unique.

The values defined for JobName and SubmitDescriptionFileName are case sensitive, as file names in a file system are case sensitive. The JobName can be any string that contains no white space, except for the strings PARENT and CHILD (in upper, lower, or mixed case). JobName also cannot contain special characters (‘.’, ‘+’) which are reserved for system use.

The optional DIR keyword specifies a working directory for this node, from which the HTCondor job will be submitted, and from which a PRE and/or POST script will be run. If a relative directory is specified, it is relative to the current working directory as the DAG is submitted. Note that a DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_submit_dag.

The optional NOOP keyword identifies that the HTCondor job within the node is not to be submitted to HTCondor. This is useful for debugging a complex DAG structure, by marking jobs as NOOP to verify that the control flow through the DAG is correct. The NOOP keywords are then removed before submitting the DAG. Any PRE and POST scripts for jobs specified with NOOP are executed; to avoid running the PRE and POST scripts, comment them out. Even though the job specified with NOOP is not submitted, its submit description file must still exist.

The optional DONE keyword identifies a node as being already completed. This is mainly used by Rescue DAGs generated by DAGMan itself, in the event of a failure to complete the workflow. Users should generally not use the DONE keyword. The NOOP keyword is more flexible in avoiding the execution of a job within a node.

PARENT/CHILD Relationships

The PARENT … CHILD … command specifies the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any of its children may be started. A child node may only be started once all its parents have successfully completed.

The syntax used for each dependency (PARENT/CHILD) command is

PARENT ParentJobName [ParentJobName2 ... ] CHILD  ChildJobName [ChildJobName2 ... ]

The PARENT keyword is followed by one or more ParentJobName(s). The CHILD keyword is followed by one or more ChildJobName(s). Each child job depends on every parent job within the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. The diamond-shaped DAG example may specify the dependencies with

PARENT A CHILD B C
PARENT B C CHILD D

An alternative specification for the diamond-shaped DAG may specify some or all of the dependencies on separate lines:

PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D

As a further example, the line

PARENT p1 p2 CHILD c1 c2

produces four dependencies:

  1. p1 to c1

  2. p1 to c2

  3. p2 to c1

  4. p2 to c2

Node Job Submit File Contents

SUBMIT-DESCRIPTION command

In addition to declaring inline submit descriptions as part of a job, they can be declared independently of jobs using the SUBMIT-DESCRIPTION command. This can be helpful to reduce the size and readability of a .dag file when many nodes are running the same job.

A SUBMIT-DESCRIPTION can be defined using the following syntax:

SUBMIT-DESCRIPTION DescriptionName {
    # submit attributes go here
}

An independently declared submit description must have a unique name that is not used by any of the jobs. It can then be linked to a job as follows:

JOB JobName DescriptionName

For example, the previous diamond.dag example could be written as follows:

# File name: diamond.dag

SUBMIT-DESCRIPTION DiamondDesc {
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
}

JOB A DiamondDesc
JOB B DiamondDesc
JOB C DiamondDesc
JOB D DiamondDesc

PARENT A CHILD B C
PARENT B C CHILD D

Inline Submit Descriptions

Instead of using a submit description file, you can alternatively include an inline submit description directly inside the .dag file. An inline submit description should be wrapped in { and } braces, with each argument appearing on a separate line, just like the contents of a regular submit file. Using the previous diamond-shaped DAG example, the diamond.dag file would look like this:

# File name: diamond.dag

JOB  A  {
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
}
JOB  B  {
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
}
JOB  C  {
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
}
JOB  D  {
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
}
PARENT A CHILD B C
PARENT B C CHILD D

This can be helpful when trying to manage lots of submit descriptions, so they can all be described in the same file instead of needed to regularly shift between many files.

The main drawback of using inline submit descriptions is that they do not support the queue statement or any variations thereof. Any job described inline in the .dag file will only have a single instance submitted.

External File Descriptions

Each node in a DAG may use a unique submit description file. A key limitation is that each HTCondor submit description file must submit jobs described by a single cluster number; DAGMan cannot deal with a submit description file producing multiple job clusters.

Consider again the diamond-shaped DAG example, where each node job uses the same submit description file.

# File name: diamond.dag

JOB  A  diamond_job.sub
JOB  B  diamond_job.sub
JOB  C  diamond_job.sub
JOB  D  diamond_job.sub
PARENT A CHILD B C
PARENT B C CHILD D

Here is a sample HTCondor submit description file for this DAG:

# File name: diamond_job.sub

executable   = /path/diamond.exe
output       = diamond.out.$(cluster)
error        = diamond.err.$(cluster)
log          = diamond_condor.log
request_cpus   = 1
request_memory = 1024M
request_disk   = 10240K

queue

Since each node uses the same HTCondor submit description file, this implies that each node within the DAG runs the same job. The $(Cluster) macro produces unique file names for each job’s output.

The job ClassAd attribute DAGParentNodeNames is also available for use within the submit description file. It defines a comma separated list of each JobName which is a parent node of this job’s node. This attribute may be used in the arguments command for all but scheduler universe jobs. For example, if the job has two parents, with JobNames B and C, the submit description file command

arguments = $$([DAGParentNodeNames])

will pass the string "B,C" as the command line argument when invoking the job.

DAGMan supports jobs with queues of multiple procs, so for example:

queue 500

will queue 500 procs as expected.