Workflow Metrics
For every DAG, a metrics file is created.
This metrics file is named <dag_file_name>.metrics
,
where <dag_file_name>
is the name of the DAG input file. In a
workflow with nested DAGs, each nested DAG will create its own metrics
file.
Here is an example metrics output file:
{
"client":"condor_dagman",
"version":"8.1.0",
"planner":"/lfs1/devel/Pegasus/pegasus/bin/pegasus-plan",
"planner_version":"4.3.0cvs",
"type":"metrics",
"wf_uuid":"htcondor-test-job_dagman_metrics-A-subdag",
"root_wf_uuid":"htcondor-test-job_dagman_metrics-A",
"start_time":1375313459.603,
"end_time":1375313491.498,
"duration":31.895,
"exitcode":1,
"dagman_id":"26",
"parent_dagman_id":"11",
"rescue_dag_number":0,
"jobs":4,
"jobs_failed":1,
"jobs_succeeded":3,
"dag_jobs":0,
"dag_jobs_failed":0,
"dag_jobs_succeeded":0,
"total_jobs":4,
"total_jobs_run":4,
"total_job_time":0.000,
"dag_status":2
}
Here is an explanation of each of the items in the file:
client
: the name of the client workflow software; in the example, it is"condor_dagman"
version
: the version of the client workflow softwareplanner
: the workflow planner, as read from thebraindump.txt
fileplanner_version
: the planner software version, as read from thebraindump.txt
filetype
: the type of data,"metrics"
wf_uuid
: the workflow ID, generated by pegasus-plan, as read from thebraindump.txt
fileroot_wf_uuid
: the root workflow ID, which is relevant for nested workflows. It is generated by pegasus-plan, as read from thebraindump.txt
file.start_time
: the start time of the client, in epoch seconds, with millisecond precisionend_time
: the end time of the client, in epoch seconds, with millisecond precisionduration
: the duration of the client, in seconds, with millisecond precisionexitcode
: the condor_dagman exit codedagman_id
: the value of theClusterId
attribute of the condor_dagman instanceparent_dagman_id
: the value of theClusterId
attribute of the parent condor_dagman instance of this DAG; empty if this DAG is not a SUBDAGrescue_dag_number
: the number of the Rescue DAG being run, or 0 if not running a Rescue DAGjobs
: the number of nodes in the DAG input file, not including SUBDAG nodesjobs_failed
: the number of failed nodes in the workflow, not including SUBDAG nodesjobs_succeeded
: the number of successful nodes in the workflow, not including SUBDAG nodes; this includes jobs that succeeded after retriesdag_jobs
: the number of SUBDAG nodes in the DAG input filedag_jobs_failed
: the number of SUBDAG nodes that faileddag_jobs_succeeded
: the number of SUBDAG nodes that succeededtotal_jobs
: the total number of jobs in the DAG input filetotal_jobs_run
: the total number of nodes executed in a DAG. It should be equal tojobs_succeeded + jobs_failed + dag_jobs_succeeded + dag_jobs_failed
total_job_time
: the sum of the time between the first execute event and the terminated event for all jobs that are not SUBDAGsdag_status
: the final status of the DAG, with values0
: OK1
: error; an error condition different than those listed here2
: one or more nodes in the DAG have failed3
: the DAG has been aborted by an ABORT-DAG-ON specification4
: removed; the DAG has been removed by condor_rm5
: a cycle was found in the DAG6
: the DAG has been halted; see the Suspending a Running DAG section. for an explanation of halting a DAG
Note that any
dag_status
other than 0 corresponds to a non-zero exit code.
The braindump.txt
file is generated by pegasus-plan; the name of
the braindump.txt
file is specified with the PEGASUS_BRAINDUMP_FILE
environment variable. If not specified, the file name defaults to
braindump.txt
, and it is placed in the current directory.
Note that the total_job_time
value is always zero, because the
calculation of that value has not yet been implemented.