DAGMan Introduction
DAGMan is a HTCondor tool that allows multiple jobs to be organized in workflows, represented as a directed acyclic graph (DAG). A DAGMan workflow automatically submits jobs in a particular order, such that certain jobs need to complete before others start running. This allows the outputs of some jobs to be used as inputs for others, and makes it easy to replicate a workflow multiple times in the future.
Describing Workflows with DAGMan
A DAGMan workflow is described in a DAG input file. The input file specifies the nodes of the DAG as well as the dependencies that order the DAG.
A node within a DAG represents a unit of work. It contains the following:
Job: An HTCondor job, defined in a submit file.
PRE script (optional): A script that runs before the job starts. Typically used to verify that all inputs are valid.
POST script (optional): A script that runs after the job finishes. Typically used to verify outputs and clean up temporary files.
The following diagram illustrates the elements of a node – every node must contain a job, with an optional pre and an optional post script.
flowchart LR Start((Start)) --> Job Start --> PREscript subgraph DAG Node PREscript --> Job Job --> POSTscript end Job --> End((End)) POSTscript --> End((End))
An edge in DAGMan describes a dependency between two nodes. DAG edges are directional; each has a parent and a child, where the parent node must finish running before the child starts. Any node can have an unlimited number of parents and children.
Example: Diamond DAG
A simple diamond-shaped DAG, as shown in the following image is presented as a starting point for examples. This diamond DAG contains 4 nodes.
flowchart TD A --> B & C B & C --> D
A very simple DAG input file for this diamond-shaped DAG is:
# File name: diamond.dag
JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT B C CHILD D
A set of basic commands appearing in a DAG input file is described below.
JOB
The JOB command specifies an HTCondor job. The syntax used for each JOB command is:
JOB JobName SubmitDescriptionFileName [DIR directory] [NOOP] [DONE]
A JOB entry maps a JobName to an HTCondor submit description file. The JobName uniquely identifies nodes within the DAG input file and in output messages. Each node name, given by JobName, within the DAG must be unique.
The values defined for JobName and SubmitDescriptionFileName are case sensitive, as file names in a file system are case sensitive. The JobName can be any string that contains no white space, except for the strings PARENT and CHILD (in upper, lower, or mixed case). JobName also cannot contain special characters (‘.’, ‘+’) which are reserved for system use.
The optional DIR keyword specifies a working directory for this node, from which the HTCondor job will be submitted, and from which a PRE and/or POST script will be run. If a relative directory is specified, it is relative to the current working directory as the DAG is submitted. Note that a DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_submit_dag.
The optional NOOP keyword identifies that the HTCondor job within the node is not to be submitted to HTCondor. This is useful for debugging a complex DAG structure, by marking jobs as NOOP to verify that the control flow through the DAG is correct. The NOOP keywords are then removed before submitting the DAG. Any PRE and POST scripts for jobs specified with NOOP are executed; to avoid running the PRE and POST scripts, comment them out. Even though the job specified with NOOP is not submitted, its submit description file must still exist.
The optional DONE keyword identifies a node as being already completed. This is mainly used by Rescue DAGs generated by DAGMan itself, in the event of a failure to complete the workflow. Users should generally not use the DONE keyword. The NOOP keyword is more flexible in avoiding the execution of a job within a node.
PARENT/CHILD Relationships
The PARENT … CHILD … command specifies the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any of its children may be started. A child node may only be started once all its parents have successfully completed.
The syntax used for each dependency (PARENT/CHILD) command is
PARENT ParentJobName [ParentJobName2 ... ] CHILD ChildJobName [ChildJobName2 ... ]
The PARENT keyword is followed by one or more ParentJobName(s). The CHILD keyword is followed by one or more ChildJobName(s). Each child job depends on every parent job within the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. The diamond-shaped DAG example may specify the dependencies with
PARENT A CHILD B C
PARENT B C CHILD D
An alternative specification for the diamond-shaped DAG may specify some or all of the dependencies on separate lines:
PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D
As a further example, the line
PARENT p1 p2 CHILD c1 c2
produces four dependencies:
p1 to c1
p1 to c2
p2 to c1
p2 to c2
Node Job Submit File Contents
SUBMIT-DESCRIPTION command
In addition to declaring inline submit descriptions as part of a job, they can be declared independently of jobs using the SUBMIT-DESCRIPTION command. This can be helpful to reduce the size and readability of a .dag file when many nodes are running the same job.
A SUBMIT-DESCRIPTION can be defined using the following syntax:
SUBMIT-DESCRIPTION DescriptionName {
# submit attributes go here
}
An independently declared submit description must have a unique name that is not used by any of the jobs. It can then be linked to a job as follows:
JOB JobName DescriptionName
For example, the previous diamond.dag example could be written as follows:
# File name: diamond.dag
SUBMIT-DESCRIPTION DiamondDesc {
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
}
JOB A DiamondDesc
JOB B DiamondDesc
JOB C DiamondDesc
JOB D DiamondDesc
PARENT A CHILD B C
PARENT B C CHILD D
Inline Submit Descriptions
Instead of using a submit description file, you can alternatively include an
inline submit description directly inside the .dag file. An inline submit
description should be wrapped in {
and }
braces, with each argument
appearing on a separate line, just like the contents of a regular submit file.
Using the previous diamond-shaped DAG example, the diamond.dag file would look
like this:
# File name: diamond.dag
JOB A {
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
}
JOB B {
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
}
JOB C {
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
}
JOB D {
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
}
PARENT A CHILD B C
PARENT B C CHILD D
This can be helpful when trying to manage lots of submit descriptions, so they can all be described in the same file instead of needed to regularly shift between many files.
The main drawback of using inline submit descriptions is that they do not
support the queue
statement or any variations thereof. Any job described
inline in the .dag file will only have a single instance submitted.
External File Descriptions
Each node in a DAG may use a unique submit description file. A key limitation is that each HTCondor submit description file must submit jobs described by a single cluster number; DAGMan cannot deal with a submit description file producing multiple job clusters.
Consider again the diamond-shaped DAG example, where each node job uses the same submit description file.
# File name: diamond.dag
JOB A diamond_job.sub
JOB B diamond_job.sub
JOB C diamond_job.sub
JOB D diamond_job.sub
PARENT A CHILD B C
PARENT B C CHILD D
Here is a sample HTCondor submit description file for this DAG:
# File name: diamond_job.sub
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
request_cpus = 1
request_memory = 1024M
request_disk = 10240K
queue
Since each node uses the same HTCondor submit description file, this
implies that each node within the DAG runs the same job. The
$(Cluster)
macro produces unique file names for each job’s output.
The job ClassAd attribute DAGParentNodeNames
is also available for
use within the submit description file. It defines a comma separated
list of each JobName which is a parent node of this job’s node. This
attribute may be used in the arguments
command for all but scheduler universe jobs. For example, if the job has two
parents, with JobNames B and C, the submit description file command
arguments = $$([DAGParentNodeNames])
will pass the string "B,C"
as the command line argument when
invoking the job.
DAGMan supports jobs with queues of multiple procs, so for example:
queue 500
will queue 500 procs as expected.