The optional SCRIPT command specifies processing that is done either before a job within a node is submitted, after a job within a node completes its execution, or when a job goes on hold. All scripts run on the Access Point and not the Execution Point where the node job is likely to run.
PRE and POST scripts
Processing done before a job is submitted is called a PRE script. Processing done after a job completes its execution is called a POST script.
The script executable does not have to be a shell script (Unix) or batch file (Windows); but should be light weight since it runs directly on the AP.
The syntax used for PRE or POST commands is
SCRIPT [DEFER status time] [DEBUG filename type] PRE <JobName | ALL_NODES> ExecutableName [arguments]
SCRIPT [DEFER status time] [DEBUG filename type] POST <JobName | ALL_NODES> ExecutableName [arguments]
The SCRIPT command can use the PRE or POST keyword, which specifies the relative timing of when the script is to be run. The JobName identifies the node to which the script is attached. The ExecutableName specifies the executable (e.g., shell script or batch file) to be executed, and may not contain spaces. The optional arguments are command line arguments to the script, and spaces delimiting the arguments. Both ExecutableName and optional arguments are case sensitive.
A PRE script is commonly used to verify inputs for a node job that are produced by a parent node, and POST scripts can be used to turn a job execution failure into a successful node completion so the DAG doesn’t fail given a specific node job failure.
Additionally, the SCRIPT command can take a HOLD keyword, which indicates an executable to be run when a job goes on hold. These are typically used to notify a user when something goes wrong with their jobs.
The syntax used for a HOLD command is
SCRIPT [DEFER status time] [DEBUG filename type] HOLD <JobName | ALL_NODES> ExecutableName [arguments]
Unlike PRE and POST scripts, HOLD scripts are not considered part of the DAG workflow and are run on a best-effort basis. If one does not complete successfully, it has no effect on the overall workflow and no error will be reported.
The optional DEFER keyword causes a retry of only the script, if the execution of the script exits with the exit code given by status. The retry occurs after at least time seconds, rather than being considered failed. While waiting for the retry, the script does not count against a maxpre or maxpost limit. The ordering of the DEFER keyword within the SCRIPT specification is fixed. It must come directly after the SCRIPT keyword; this is done to avoid backward compatibility issues for any DAG with a JobName of DEFER.
The optional DEBUG keyword will capture a scripts specified standard output streams (STDOUT and/or STDERR) and write them to a specified debug file. This keyword is followed by two pieces of information:
Filename: File to write captured output into.
- Type: Type of output to capture. Takes one the following options:
ALL (Both STDOUT & STDERR)
This keyword is fixed to appear prior to the script type (PRE, POST, HOLD) and after any declared DEFER retries.
It is safe to have multiple scripts write to the same file as DAGMan captures all of the scripts output and writes everything at one time. This write also includes a dividing banner with useful information regarding that scripts execution.
Scripts as part of a DAG workflow
Scripts are executed on the access point; the access point is not necessarily the same machine upon which the node’s job is run. Further, a single cluster of HTCondor jobs may be spread across several machines.
If the PRE script fails, then the HTCondor job associated with the node
is not submitted, and the POST script is not run either (by default). However,
if the job is submitted, and there is a POST script, the POST script is always
run once the job finishes. (The behavior when the PRE script fails may be
changed to run the POST script by setting configuration variable
True or by passing the -AlwaysRunPost
argument to condor_submit_dag.)
Examples that use PRE or POST scripts
Examples use the diamond-shaped DAG. A first example uses a PRE script to expand a compressed file needed as input to each of the HTCondor jobs of nodes B and C. The DAG input file:
# File name: diamond.dag
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
SCRIPT PRE B pre.sh $JOB .gz
SCRIPT PRE C pre.sh $JOB .gz
PARENT A CHILD B C
PARENT B C CHILD D
pre.sh uses its command line arguments to form the file
name of the compressed file. The script contains
Therefore, the PRE script invokes
for node B, which uncompresses file
B.gz, placing the result in file
A second example uses the
$RETURN macro. The DAG input file contains
the POST script specification:
SCRIPT POST A stage-out job_status $RETURN
If the HTCondor job of node A exits with the value -1, the POST script is invoked as
stage-out job_status -1
The slightly different example POST script specification in the DAG input file
SCRIPT POST A stage-out job_status=$RETURN
invokes the POST script with
This example shows that when there is no space between the
and the variable
$RETURN, there is no substitution of the macro’s
Special script argument macros
DAGMan provides the following macros to be used for node script arguments. The use of these macros are limited to being used as individual command line arguments surrounded by spaces:
The special macros for all scripts:
$JOBevaluates to the (case sensitive) string defined for JobName.
$RETRYevaluates to an integer value set to 0 the first time a node is run, and is incremented each time the node is retried. See Retrying Failed Nodes for the description of how to cause nodes to be retried.
$MAX_RETRIESevaluates to an integer value set to the maximum number of retries for the node. Defaults to 0 if retries aren’t specified for a node.
$DAG_STATUSis the status of the DAG. Note that this macro’s value and definition is unrelated to the attribute named
DagStatusas defined for use in a node status file. This macro’s value is the same as the job ClassAd attribute
DAG_Statusthat is defined within the condor_dagman job’s ClassAd. This macro may have the following values:
1: error; an error condition different than those listed here
2: one or more nodes in the DAG have failed
3: the DAG has been aborted by an ABORT-DAG-ON specification
4: removed; the DAG has been removed by condor_rm
5: cycle; a cycle was found in the DAG
6: halted; the DAG has been halted (see Suspending a Running DAG)
$FAILED_COUNTis defined by the number of nodes that have failed in the DAG.
Macros for POST Scripts only:
$JOBIDevaluates to a representation of the HTCondor job ID [ClusterId.ProcId] of the node job. For nodes with multiple jobs in the same cluster, the
ProcIdvalue is the one of the last job within the cluster.
$RETURNvariable evaluates to the return value of the HTCondor job, if there is a single job within a cluster. With multiple jobs within the same cluster, there are two cases to consider. In the first case, all jobs within the cluster are successful; the value of
$RETURNwill be 0, indicating success. In the second case, one or more jobs from the cluster fail. When condor_dagman sees the first terminated event for a job that failed, it assigns that job’s return value as the value of
$RETURN, and it attempts to remove all remaining jobs within the cluster. Therefore, if multiple jobs in the cluster fail with different exit codes, a race condition determines which exit code gets assigned to
A job that dies due to a signal is reported with a
$RETURNvalue representing the additive inverse of the signal number. For example, SIGKILL (signal 9) is reported as -9. A job whose batch system submission fails is reported as -1001. A job that is externally removed from the batch system queue (by something other than condor_dagman) is reported as -1002.
$PRE_SCRIPT_RETURNvariable evaluates to the return value of the PRE script of a node, if there is one. If there is no PRE script, this value will be -1. If the node job was skipped because of failure of the PRE script, the value of
$RETURNwill be -1004 and the value of
$PRE_SCRIPT_RETURNwill be the exit value of the PRE script; the POST script can use this to see if the PRE script exited with an error condition, and assign success or failure to the node, as appropriate.