Workflow Metrics

For every DAG, a metrics file is created. This metrics file is named <dag_file_name>.metrics, where <dag_file_name> is the name of the DAG input file. In a workflow with nested DAGs, each nested DAG will create its own metrics file.

Here is an example metrics output file:

{
    "client":"condor_dagman",
    "version":"8.1.0",
    "planner":"/lfs1/devel/Pegasus/pegasus/bin/pegasus-plan",
    "planner_version":"4.3.0cvs",
    "type":"metrics",
    "wf_uuid":"htcondor-test-job_dagman_metrics-A-subdag",
    "root_wf_uuid":"htcondor-test-job_dagman_metrics-A",
    "start_time":1375313459.603,
    "end_time":1375313491.498,
    "duration":31.895,
    "exitcode":1,
    "dagman_id":"26",
    "parent_dagman_id":"11",
    "rescue_dag_number":0,
    "jobs":4,
    "jobs_failed":1,
    "jobs_succeeded":3,
    "dag_jobs":0,
    "dag_jobs_failed":0,
    "dag_jobs_succeeded":0,
    "total_jobs":4,
    "total_jobs_run":4,
    "total_job_time":0.000,
    "dag_status":2
}

Here is an explanation of each of the items in the file:

client: the name of the client workflow software; in the example, it is "condor_dagman"
version: the version of the client workflow software
planner: the workflow planner, as read from the braindump.txt file
planner_version: the planner software version, as read from the braindump.txt file
type: the type of data, "metrics"
wf_uuid: the workflow ID, generated by pegasus-plan, as read from the braindump.txt file
root_wf_uuid: the root workflow ID, which is relevant for nested workflows. It is generated by pegasus-plan, as read from the braindump.txt file.
start_time: the start time of the client, in epoch seconds, with millisecond precision
end_time: the end time of the client, in epoch seconds, with millisecond precision
duration: the duration of the client, in seconds, with millisecond precision
exitcode: the condor_dagman exit code
dagman_id: the value of the ClusterId attribute of the condor_dagman instance
parent_dagman_id: the value of the ClusterId attribute of the parent condor_dagman instance of this DAG; empty if this DAG is not a SUBDAG
rescue_dag_number: the number of the Rescue DAG being run, or 0 if not running a Rescue DAG
jobs: the number of nodes in the DAG input file, not including SUBDAG nodes
jobs_failed: the number of failed nodes in the workflow, not including SUBDAG nodes
jobs_succeeded: the number of successful nodes in the workflow, not including SUBDAG nodes; this includes jobs that succeeded after retries
dag_jobs: the number of SUBDAG nodes in the DAG input file
dag_jobs_failed: the number of SUBDAG nodes that failed
dag_jobs_succeeded: the number of SUBDAG nodes that succeeded
total_jobs: the total number of jobs in the DAG input file
total_jobs_run: the total number of nodes executed in a DAG. It should be equal to jobs_succeeded + jobs_failed + dag_jobs_succeeded + dag_jobs_failed
total_job_time: the sum of the time between the first execute event and the terminated event for all jobs that are not SUBDAGs
dag_status: the final status of the DAG, with values
- 0: OK
- 1: error; an error condition different than those listed here
- 2: one or more nodes in the DAG have failed
- 3: the DAG has been aborted by an ABORT-DAG-ON specification
- 4: removed; the DAG has been removed by condor_rm
- 5: a cycle was found in the DAG
- 6: the DAG has been halted; see the Suspending a Running DAG section. for an explanation of halting a DAG
Note that any dag_status other than 0 corresponds to a non-zero exit code.

The braindump.txt file is generated by pegasus-plan; the name of the braindump.txt file is specified with the PEGASUS_BRAINDUMP_FILE environment variable. If not specified, the file name defaults to braindump.txt, and it is placed in the current directory.

Note that the total_job_time value is always zero, because the calculation of that value has not yet been implemented.