Workflow Metrics
For every DAG, a metrics file is created.
This metrics file is named <dag_file_name>.metrics,
where <dag_file_name> is the name of the DAG input file. In a
workflow with nested DAGs, each nested DAG will create its own metrics
file.
Here is an example metrics output file:
{
"client":"condor_dagman",
"version":"8.1.0",
"planner":"/lfs1/devel/Pegasus/pegasus/bin/pegasus-plan",
"planner_version":"4.3.0cvs",
"type":"metrics",
"wf_uuid":"htcondor-test-job_dagman_metrics-A-subdag",
"root_wf_uuid":"htcondor-test-job_dagman_metrics-A",
"start_time":1375313459.603,
"end_time":1375313491.498,
"duration":31.895,
"exitcode":1,
"dagman_id":"26",
"parent_dagman_id":"11",
"rescue_dag_number":0,
"jobs":4,
"jobs_failed":1,
"jobs_succeeded":3,
"dag_jobs":0,
"dag_jobs_failed":0,
"dag_jobs_succeeded":0,
"total_jobs":4,
"total_jobs_run":4,
"total_job_time":0.000,
"dag_status":2
}
Here is an explanation of each of the items in the file:
client: the name of the client workflow software; in the example, it is"condor_dagman"version: the version of the client workflow softwareplanner: the workflow planner, as read from thebraindump.txtfileplanner_version: the planner software version, as read from thebraindump.txtfiletype: the type of data,"metrics"wf_uuid: the workflow ID, generated by pegasus-plan, as read from thebraindump.txtfileroot_wf_uuid: the root workflow ID, which is relevant for nested workflows. It is generated by pegasus-plan, as read from thebraindump.txtfile.start_time: the start time of the client, in epoch seconds, with millisecond precisionend_time: the end time of the client, in epoch seconds, with millisecond precisionduration: the duration of the client, in seconds, with millisecond precisionexitcode: the condor_dagman exit codedagman_id: the value of theClusterIdattribute of the condor_dagman instanceparent_dagman_id: the value of theClusterIdattribute of the parent condor_dagman instance of this DAG; empty if this DAG is not a SUBDAGrescue_dag_number: the number of the Rescue DAG being run, or 0 if not running a Rescue DAGjobs: the number of nodes in the DAG input file, not including SUBDAG nodesjobs_failed: the number of failed nodes in the workflow, not including SUBDAG nodesjobs_succeeded: the number of successful nodes in the workflow, not including SUBDAG nodes; this includes jobs that succeeded after retriesdag_jobs: the number of SUBDAG nodes in the DAG input filedag_jobs_failed: the number of SUBDAG nodes that faileddag_jobs_succeeded: the number of SUBDAG nodes that succeededtotal_jobs: the total number of jobs in the DAG input filetotal_jobs_run: the total number of nodes executed in a DAG. It should be equal tojobs_succeeded + jobs_failed + dag_jobs_succeeded + dag_jobs_failedtotal_job_time: the sum of the time between the first execute event and the terminated event for all jobs that are not SUBDAGsdag_status: the final status of the DAG, with values0: OK1: error; an error condition different than those listed here2: one or more nodes in the DAG have failed3: the DAG has been aborted by an ABORT-DAG-ON specification4: removed; the DAG has been removed by condor_rm5: a cycle was found in the DAG6: the DAG has been halted; see the Suspending a Running DAG section. for an explanation of halting a DAG
Note that any
dag_statusother than 0 corresponds to a non-zero exit code.
The braindump.txt file is generated by pegasus-plan; the name of
the braindump.txt file is specified with the PEGASUS_BRAINDUMP_FILE
environment variable. If not specified, the file name defaults to
braindump.txt, and it is placed in the current directory.
Note that the total_job_time value is always zero, because the
calculation of that value has not yet been implemented.