Running and Managing DAGMan
Once once a workflow has been setup in a .dag
file, all that
is left is to submit the prepared workflow. A key concept to understand
regarding the submission and management of a DAGMan workflow is
that the DAGMan process itself is ran as a HTCondor Scheduler universe
job that runs under the schedd on the AP (often referred to as the
DAGMan proper job) that will in turn manage and submit all the various
jobs and scripts defined in the workflow.
Basic DAG Controls
DAG Submission
To submit a DAG simply use condor_submit_dag or htcondor dag submit
with the DAG description file from the current working directory that the DAG description file
is stored. This will automatically generate an HTCondor scheduler universe job
submit file to execute condor_dagman and submit this job to HTCondor. The
generated DAG submit file is named <DAG Description File>.condor.sub
. If desired,
the generated submit description file can be modified prior to job submission
by doing the following:
$ condor_submit_dag -no_submit diamond.dag
$ vim diamond.dag.condor.sub
$ condor_submit diamond.dag.condor.sub
Since the condor_dagman process is an actual HTCondor job, all jobs managed by DAGMan are marked with the DAGMan proper jobs ClusterId. This value is set to the managed jobs ClassAd attribute DAGManJobId.
Warning
Do not submit the same DAG, with the same DAG description file, from the same working directory at the same time. This will cause unpredictable behavior and failures since both DAGMan jobs will attempt to use the same files to execute.
Single Submission of Multiple, Independent DAGs
Multiple independent DAGs described in various DAG description files can be submitted in a single instance of condor_submit_dag resulting in one condor_dagman job managing all DAGs. This is done by internally combining all independent DAGs into one large DAG with no inter-dependencies between the individual DAGs. To avoid possible node name collisions when producing the large DAG, DAGMan renames all the nodes. The renaming of nodes is controlled by DAGMAN_MUNGE_NODE_NAMES.
When multiple DAGs are submitted like this, DAGMan sets the first DAG description
file provided on the command line as it’s primary DAG file, and uses the primary
DAG file when writing various files such as the *.dagman.out
. In the case of
failure, DAGMan will produce a rescue file named <Primary DAG>_multi.rescue<XXX>
.
See The Rescue DAG section for more information.
The success or failure of the independent DAGs is well defined. When multiple, independent DAGs are submitted with a single command, the success of the composite DAG is defined as the logical AND of the success of each independent DAG, and failure is defined as the logical OR of the failure of any of the independent DAGs.
DAG Monitoring
After submission, the progress of the DAG can be monitored by looking at the job event log file(s), observing the e-mail that job submission to HTCondor causes, or by using condor_q. Using just condor_q while a DAGMan workflow is running will display condensed information regarding the overall workflow progress under the DAGMan proper job as follows:
$ condor_q
$ OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
$ Cole diamond.dag+1024 1/1 12:34 1 2 - 4 1025.0 ... 1026.0
Using condor_q with the -dag and -nobatch flags will display information about the DAGMan proper job and all jobs currently submitted/running as part of the DAGMan workflow as follows:
$ condor_q -dag -nobatch
$ ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
$ 1024.0 Cole 1/1 12:34 0+01:13:19 R 0 0.4 condor_dagman ...
$ 1025.0 |-Node_B 1/1 13:44 0+00:03:19 R 0 0.4 diamond.sh ...
$ 1026.0 |-Node_C 1/1 13:45 0+00:02:19 R 0 0.4 diamond.sh ...
In addition to basic job management, the DAGMan proper job holds a lot of extra information within its job ClassAd that can queried with the -l or the more recommended -af <Attributes> flags for condor_q in association with the DAGMan proper Job Id.
$ condor_q <dagman-job-id> -af Attribute-1 ... Attribute-N
$ condor_q -l <dagman-job-id>
A large amount of information about DAG progress and errors can be found in
the debug log file named <DAG Description File>.dagman.out
. This file should
be saved if errors occur. This file also doesn’t get removed between DAG
new executions, and all logged messages are appended to the file.
Status Information for the DAG in a ClassAd
The condor_dagman job places information about its status in its ClassAd as the following job ad attributes:
DAG Info |
||
Node Info |
||
DAG Process Info |
||
Note
Most of this information is also available in the dagman.out
file, and
DAGMan updates these ClassAd attributes every 2 minutes.
Removing a DAG
To remove a DAG simply use condor_rm on the condor_dagman job. This will remove both the DAGMan proper job and all node jobs, including sub-DAGs, from the HTCondor queue.
A removed DAG will be considered failed unless the DAG has a FINAL node that succeeds.
In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, it will leave any submitted jobs in the HTCondor queue.
Suspending a Running DAG
It may be desired to temporarily suspend a running DAG. For example, the load may be high on the access point, and therefore it is desired to prevent DAGMan from submitting any more jobs until the load goes down. There are two ways to suspend (and resume) a running DAG.
Use condor_hold/condor_release on the condor_dagman job.
After placing the condor_dagman job on hold, no new node jobs will be submitted, and no scripts will be run. Any node jobs already in the HTCondor queue will continue undisturbed. Any running PRE or POST scripts will be killed. If the condor_dagman job is left on hold, it will remain in the HTCondor queue after all of the currently running node jobs are finished. To resume the DAG, use condor_release on the condor_dagman job.
Note
While the condor_dagman job is on hold, no updates will be made to the
*.dagman.out
file.Use a DAG halt file.
A DAG can be suspended by halting it with a halt file. This is a special file named
<DAG Description Filename>.halt
that DAGMan will periodically check exists. If found then the DAG enters the halted state where no PRE scripts are not run and node jobs stop being submitted. Running node jobs will continue undisturbed, POST scripts will run, and the*.dagman.out
log will still be updated.Once all running node jobs and POST scripts have finished, DAGMan will write a Rescue DAG and exit.
Note
If a halt file exists at DAG submission time, it it removed.
Warning
Neither condor_hold nor a DAG halt is propagated to sub-DAGS. In other word if a parent DAG is held or halted, any sub-DAGs will continue to submit node jobs. However, these effects are applied to DAG splices since they are merged into the parent DAG and are controlled by a single condor_dagman instance.
File Paths in DAGs
condor_dagman assumes all relative paths in a DAG description file and its node job submit descriptions are relative to the current working directory where condor_submit_dag was ran. Meaning all files declared in a DAG or its jobs are expected to be found or will be written relative to the DAGs working directory. All jobs will be submitted and all scripts will be ran from the DAGs working directory.
For simple DAG structures this may be alright, but not for complex DAGs. To help reduce confusion of where things run or files are written, the JOB command takes an optional keyword DIR <path>. This will cause DAGMan to submit the node’s job(s) and run the node scripts from the directory specified.
JOB A A.submit DIR dirA
example/
├── sample.dag
└── dirA
├── A.input
├── A.submit
└── programA
If dealing with multiple independent DAGs separated into different directories as described below then a single condor_submit_dag submission from the parent directory will fail to successfully execute since all paths are now relative to the parent directory.
parent/
├── dag1
│ ├── A.input
│ ├── A.submit
│ ├── one.dag
│ └── programA
└── dag2
├── B.input
├── B.submit
├── programB
└── two.dag
Use the condor_submit_dag -UseDagDir flag to execute each individual
DAG in their relative directories. For this example, one.dag
would run from
the dag1
directory and two.dag
would run from dag2
. All produced
DAGMan files will be relative to the primary DAG (first DAG specified on the
command line).
Warning
Use of -usedagdir does not work in conjunction with a JOB command that specifies a working directory via the DIR keyword. Using both will be detected and generate an error.
Managing Large Numbers of Jobs
DAGMan provides lots of useful mechanisms to help submit and manage large numbers of jobs. This can be useful whether a DAG is structured via dependencies or just a bag of loose jobs. Notable features of DAGMan are:
- Throttling
Throttling limits the number of submitted jobs at any point in time.
- Retry a failed list of jobs
Automatically re-run a failed list of jobs to attempt a successful execution. For more information visit Retrying Failed Nodes.
- Scripts associated with node jobs
Perform simple tasks on the Access Point before and/or after a node’s job(s) execution. For more information visit DAGMan Scripts.
It is common for a large grouping of similar jobs to ran under a DAG. It is also very common for some external program or script to produce these large DAGs and needed files. There are generally two ways of organizing DAGs with large number of jobs to manage:
- Using a unique submit description for each node in the DAG
In this setup, a single DAG description file containing
n
nodes with a unique submit description file (see right) for each node such as:Example large DAG description using unique job description files# Large DAG Example: sweep.dag w/ unique submit files JOB job0 job0.sub JOB job1 job1.sub JOB job2 job2.sub ... JOB job999 job999.sub
The benefit of this method is the individual node’s job(s) can easily be submitted separately at any time but at the cost of producing
n
unique files that need to be stored and managed.
- Using a shared submit description file and Custom Job Macros for Nodes
In this setup, a single DAG description file containing
n
nodes share a single submit description (see right) and utilize custom macros added to each job for variance by DAGMan is described such as:Example large DAG using shared job description file for all nodes# Large DAG example: sweep.dag w/ shared submit file JOB job0 common.sub VARS job0 runnumber="0" JOB job1 common.sub VARS job1 runnumber="1" JOB job2 common.sub VARS job2 runnumber="2" ... JOB job999 common.sub VARS job999 runnumber="999"
The benefit to this method is that less files need to be produced, stored, and managed at the cost of more complexity and a double in size to the DAG description file.
Note
Even though DAGMan can assist with the management of large number of jobs, DAGs managing several thousands worth of jobs will produce lots of various files making directory traversal difficult. Consider how the directory structure should look for large DAGs prior to creating and running.
DAGMan Throttling
To prevent possible overloading of the condor_schedd and resources on the Access Point that condor_dagman executes on, DAGMan comes with built in capabilities to help throttle/limit the load on the Access Point.
Throttling at DAG Submission
- Total nodes/clusters:
The total number of DAG nodes that can be submitted to the HTCondor queue at a time. This is specified either at submit time via condor_submit_dags -maxjobs option or via the configuration option DAGMAN_MAX_JOBS_SUBMITTED.
- Idle Jobs:
The total number of idle jobs associated with nodes managed by DAGMan in the HTCondor queue at a time. If DAGMan submits jobs and goes over this limit then DAGMan will wait until the number of idle jobs under its management drops below this max value prior to submitting ready nodes. This is specified either at submit time via condor_submit_dags -maxidle option or via the configuration option DAGMAN_MAX_JOBS_IDLE.
- PRE/POST script:
The total number of PRE and POST scripts DAGMan will execute at a time on the Access Point. These limits can either be specified via condor_submit_dags -maxpre and -maxpost options or via the configuration options DAGMAN_MAX_PRE_SCRIPTS and DAGMAN_MAX_POST_SCRIPTS.
Editing DAG Throttles
The following throttling properties of a running DAG can be changed after the workflow has been started. The values of these properties are published in the condor_dagman job ad; changing any of these properties using condor_qedit will also update the internal DAGMan value.
Currently, you can change the following attributes:
Attribute Name |
Attribute Description |
Maximum number of running nodes |
|
Maximum number of idle jobs |
|
Maximum number of running PRE scripts |
|
Maximum number of running POST scripts |
Throttling Nodes by Category
DAGMan also allows the limiting of the number of running nodes (submitted job clusters) within a DAG at a finer grained control with the CATEGORY and MAXJOBS commands. The CATEGORY command will assign a DAG node to a category that can be referenced by the MAXJOBS command to limit the number of submitted job clusters on a per category basis.
If the number of submitted job clusters for a given category reaches the limit, no further job clusters in that category will be submitted until other job clusters within the category terminate. If MAXJOBS is not set for a defined category, then there is no limit placed on the number of submissions within that category.
The configuration variable DAGMAN_MAX_JOBS_SUBMITTED and the condor_submit_dag -maxjobs command-line option are still enforced if these CATEGORY and MAXJOBS throttles are used.