Workflow Metrics

For every DAG, a metrics file is created. This metrics file is named <dag_file_name>.metrics, where <dag_file_name> is the name of the DAG input file. In a workflow with nested DAGs, each nested DAG will create its own metrics file.

Here is an example metrics output file:

{
    "client":"condor_dagman",
    "version":"8.1.0",
    "planner":"/lfs1/devel/Pegasus/pegasus/bin/pegasus-plan",
    "planner_version":"4.3.0cvs",
    "type":"metrics",
    "wf_uuid":"htcondor-test-job_dagman_metrics-A-subdag",
    "root_wf_uuid":"htcondor-test-job_dagman_metrics-A",
    "start_time":1375313459.603,
    "end_time":1375313491.498,
    "duration":31.895,
    "exitcode":1,
    "dagman_id":"26",
    "parent_dagman_id":"11",
    "rescue_dag_number":0,
    "jobs":4,
    "jobs_failed":1,
    "jobs_succeeded":3,
    "dag_jobs":0,
    "dag_jobs_failed":0,
    "dag_jobs_succeeded":0,
    "total_jobs":4,
    "total_jobs_run":4,
    "total_job_time":0.000,
    "dag_status":2
}

Here is an explanation of each of the items in the file:

  • client: the name of the client workflow software; in the example, it is "condor_dagman"

  • version: the version of the client workflow software

  • planner: the workflow planner, as read from the braindump.txt file

  • planner_version: the planner software version, as read from the braindump.txt file

  • type: the type of data, "metrics"

  • wf_uuid: the workflow ID, generated by pegasus-plan, as read from the braindump.txt file

  • root_wf_uuid: the root workflow ID, which is relevant for nested workflows. It is generated by pegasus-plan, as read from the braindump.txt file.

  • start_time: the start time of the client, in epoch seconds, with millisecond precision

  • end_time: the end time of the client, in epoch seconds, with millisecond precision

  • duration: the duration of the client, in seconds, with millisecond precision

  • exitcode: the condor_dagman exit code

  • dagman_id: the value of the ClusterId attribute of the condor_dagman instance

  • parent_dagman_id: the value of the ClusterId attribute of the parent condor_dagman instance of this DAG; empty if this DAG is not a SUBDAG

  • rescue_dag_number: the number of the Rescue DAG being run, or 0 if not running a Rescue DAG

  • jobs: the number of nodes in the DAG input file, not including SUBDAG nodes

  • jobs_failed: the number of failed nodes in the workflow, not including SUBDAG nodes

  • jobs_succeeded: the number of successful nodes in the workflow, not including SUBDAG nodes; this includes jobs that succeeded after retries

  • dag_jobs: the number of SUBDAG nodes in the DAG input file

  • dag_jobs_failed: the number of SUBDAG nodes that failed

  • dag_jobs_succeeded: the number of SUBDAG nodes that succeeded

  • total_jobs: the total number of jobs in the DAG input file

  • total_jobs_run: the total number of nodes executed in a DAG. It should be equal to jobs_succeeded + jobs_failed + dag_jobs_succeeded + dag_jobs_failed

  • total_job_time: the sum of the time between the first execute event and the terminated event for all jobs that are not SUBDAGs

  • dag_status: the final status of the DAG, with values

    • 0: OK

    • 1: error; an error condition different than those listed here

    • 2: one or more nodes in the DAG have failed

    • 3: the DAG has been aborted by an ABORT-DAG-ON specification

    • 4: removed; the DAG has been removed by condor_rm

    • 5: a cycle was found in the DAG

    • 6: the DAG has been halted; see the Suspending a Running DAG section. for an explanation of halting a DAG

    Note that any dag_status other than 0 corresponds to a non-zero exit code.

The braindump.txt file is generated by pegasus-plan; the name of the braindump.txt file is specified with the PEGASUS_BRAINDUMP_FILE environment variable. If not specified, the file name defaults to braindump.txt, and it is placed in the current directory.

Note that the total_job_time value is always zero, because the calculation of that value has not yet been implemented.