Information that the condor_collector collects can be used to monitor a pool. The condor_status command can be used to display snapshot of the current state of the pool. Monitoring systems can be set up to track the state over time, and they might go further, to alert the system administrator about exceptional conditions.
Support for the Ganglia monitoring system (http://ganglia.info/) is integral to HTCondor. Nagios (http://www.nagios.org/) is often used to provide alerts based on data from the Ganglia monitoring system. The condor_gangliad daemon provides an efficient way to take information from an HTCondor pool and supply it to the Ganglia monitoring system.
The condor_gangliad gathers up data as specified by its configuration, and it streamlines getting that data to the Ganglia monitoring system. Updates sent to Ganglia are done using the Ganglia shared libraries for efficiency.
If Ganglia is already deployed in the pool, the monitoring of HTCondor
is enabled by running the condor_gangliad daemon on a single machine
within the pool. If the machine chosen is the one running Ganglia’s
gmetad, then the HTCondor configuration consists of adding
GANGLIAD to the definition of configuration variable
on that machine. It may be advantageous to run the condor_gangliad
daemon on the same machine as is running the condor_collector daemon,
because on a large pool with many ClassAds, there is likely to be less
network traffic. If the condor_gangliad daemon is to run on a
different machine than the one running Ganglia’s gmetad, modify
to get the list of monitored hosts
from the master gmond program.
If the pool does not use Ganglia, the pool can still be monitored by a separate server running Ganglia.
By default, the condor_gangliad will only propagate metrics to hosts
that are already monitored by Ganglia. Set configuration variable
True to set up a
Ganglia host to monitor a pool not monitored by Ganglia or have a
heterogeneous pool where some hosts are not monitored. In this case,
default graphs that Ganglia provides will not be present. However, the
HTCondor metrics will appear.
On large pools, setting configuration variable
reduce the amount of data sent to Ganglia. The execute node data is the
least important to monitor. One can also limit the amount of data by
setting configuration variable
. Be aware that aggregate sums over
the entire pool will not be accurate if this variable limits the
Metrics to be sent to Ganglia are specified in all files within the
directory specified by configuration variable
. Each file in the directory
is read, and the format within each file is that of New ClassAds. Here
is an example of a single metric definition given as a New ClassAd:
[ Name = "JobsSubmitted"; Desc = "Number of jobs submitted"; Units = "jobs"; TargetType = "Scheduler"; ]
A nice set of default metrics is in file:
Recognized metric attribute names and their use:
- The name of this metric, which corresponds to the ClassAd attribute name. Metrics published for the same machine must have unique names.
- A ClassAd expression that produces the value when evaluated. The default value is the value in the daemon ClassAd of the attribute with the same name as this metric.
- A brief description of the metric. This string is displayed when the user holds the mouse over the Ganglia graph for the metric.
- The integer verbosity level of this metric. Metrics with a higher verbosity level than that specified by configuration variable
GANGLIA_VERBOSITYwill not be published.
- A string containing a comma-separated list of daemon ClassAd types that this metric monitors. The specified values should match the value of
MyTypeof the daemon ClassAd. In addition, there are special values that may be included. “Machine_slot1” may be specified to monitor the machine ClassAd for slot 1 only. This is useful when monitoring machine-wide attributes. The special value “ANY” matches any type of ClassAd.
- A boolean expression that may restrict how this metric is incorporated. It defaults to
True, which places no restrictions on the collection of this ClassAd metric.
- The graph title used for this metric. The default is the metric name.
- A string specifying the name of this metric’s group. Metrics are arranged by group within a Ganglia web page. The default is determined by the daemon type. Metrics in different groups must have unique names.
- A string specifying the cluster name for this metric. The default cluster name is taken from the configuration variable
- A string describing the units of this metric.
- A scaling factor that is multiplied by the value of the
Valueattribute. The scale factor is used when the value is not in the basic unit or a human-interpretable unit. For example, duty cycle is commonly expressed as a percent, but the HTCondor value ranges from 0 to 1. So, duty cycle is scaled by 100. Some metrics are reported in KiB. Scaling by 1024 allows Ganglia to pick the appropriate units, such as number of bytes rather than number of KiB. When scaling by large values, converting to the “float” type is recommended.
- A boolean value that specifies if Ganglia should graph the derivative of this metric. Ganglia versions prior to 3.4 do not support this.
- A string specifying the type of the metric. Possible values are “double”, “float”, “int32”, “uint32”, “int16”, “uint16”, “int8”, “uint8”, and “string”. The default is “string” for string values, the default is “int32” for integer values, the default is “float” for real values, and the default is “int8” for boolean values. Integer values can be coerced to “float” or “double”. This is especially important for values stored internally as 64-bit values.
- This string value specifies a regular expression that matches attributes to be monitored by this metric. This is useful for dynamic attributes that cannot be enumerated in advance, because their names depend on dynamic information such as the users who are currently running jobs. When this is specified, one metric per matching attribute is created. The default metric name is the name of the matched attribute, and the default value is the value of that attribute. As usual, the
Valueexpression may be used when the raw attribute value needs to be manipulated before publication. However, since the name of the attribute is not known in advance, a special ClassAd attribute in the daemon ClassAd is provided to allow the
Valueexpression to refer to it. This special attribute is named
Regex. Another special feature is the ability to refer to text matched by regular expression groups defined by parentheses within the regular expression. These may be substituted into the values of other string attributes such as
Desc. This is done by putting macros in the string values. “\1” is replaced by the first group, “\2” by the second group, and so on.
- This string value specifies an aggregation function to apply, instead of publishing individual metrics for each daemon ClassAd. Possible values are “sum”, “avg”, “max”, and “min”.
- When an aggregate function has been specified, this string value specifies which aggregation group the current daemon ClassAd belongs to. The default is the metric
Name. This feature works like GROUP BY in SQL. The aggregation function produces one result per value of
AggregateGroup. A single aggregate group would therefore be appropriate for a pool-wide metric. As an example, to publish the sum of an attribute across different types of slot ClassAds, make the metric name an expression that is unique to each type. The default
AggregateGroupwould be set accordingly. Note that the assumption is still that the result is a pool-wide metric, so by default it is associated with the condor_collector daemon’s host. To group by machine and publish the result into the Ganglia page associated with each machine, make the
AggregateGroupcontain the machine name and override the default
Machineattribute to be the daemon’s machine name, rather than the condor_collector daemon’s machine name.
- The name of the host associated with this metric. If configuration variable
GANGLIAD_DEFAULT_MACHINEis not specified, the default is taken from the
Machineattribute of the daemon ClassAd. If the daemon name is of the form name@hostname, this may indicate that there are multiple instances of HTCondor running on the same machine. To avoid the metrics from these instances overwriting each other, the default machine name is set to the daemon name in this case. For aggregate metrics, the default value of
Machinewill be the name of the condor_collector host.
- A string containing the IP address of the host associated with this metric. If
GANGLIAD_DEFAULT_IPis not specified, the default is extracted from the
MyAddressattribute of the daemon ClassAd. This value must be unique for each machine published to Ganglia. It need not be a valid IP address. If the value of
Machinecontains an “@” sign, the default IP value will be set to the same value as
Machinein order to make the IP value unique to each instance of HTCondor running on the same host.
By default, HTCondor assumes that resources are transient: the
condor_collector will discard ClassAds older than
CLASSAD_LIFETIME seconds. Its
default configuration value is 15 minutes, and as such, the default
UPDATE_INTERVAL will pass
three times before HTCondor forgets about a resource. In some pools,
especially those with dedicated resources, this approach may make it
unnecessarily difficult to determine what the composition of the pool
ought to be, in the sense of knowing which machines would be in the
pool, if HTCondor were properly functioning on all of them.
This assumption of transient machines can be modified by the use of
absent ClassAds. When a machine ClassAd would otherwise expire, the
condor_collector evaluates the configuration variable
ABSENT_REQUIREMENTS against the
machine ClassAd. If
True, the machine ClassAd will be saved in a
persistent manner and be marked as absent; this causes the machine to
appear in the output of
condor_status -absent. When the machine
returns to the pool, its first update to the condor_collector will
invalidate the absent machine ClassAd.
Absent ClassAds, like offline ClassAds, are stored to disk to ensure
that they are remembered, even across condor_collector crashes. The
defines the file in which the
ClassAds are stored, and replaces the no longer used variable
OFFLINE_LOG. Absent ClassAds are retained on disk as maintained by
the condor_collector for a length of time in seconds defined by the
. A value of 0 for this variable
means that the ClassAds are never discarded, and the default value is
Absent ClassAds are only returned by the condor_collector and displayed when the -absent option to condor_status is specified, or when the absent machine ClassAd attribute is mentioned on the condor_status command line. This renders absent ClassAds invisible to the rest of the HTCondor infrastructure.
A daemon may inform the condor_collector that the daemon’s ClassAd
should not expire, but should be removed right away; the daemon asks for
its ClassAd to be invalidated. It may be useful to place an invalidated
ClassAd in the absent state, instead of having it removed as an
invalidated ClassAd. An example of a ClassAd that could benefit from
being absent is a system with an uninterruptible power supply that shuts
down cleanly but unexpectedly as a result of a power outage. To cause
all invalidated ClassAds to become absent instead of invalidated, set
True. Invalidated ClassAds will instead be treated as if they
expired, including when evaluating
HTCondor supports monitoring GPU utilization for NVidia GPUs. This feature
is enabled by default if you set
use feature : GPUs in your configuration
Doing so will cause the startd to run the
This tool polls the (NVidia) GPU device(s) in the system and records their
utilization and memory usage values. At regular intervals, the tool reports
these values to the condor_startd, assigning them to each device’s usage
to the slot(s) to which those devices have been assigned.
Please note that
condor_gpu_utilization can not presently assign GPU
utilization directly to HTCondor jobs. As a result, jobs sharing a GPU
device, or a GPU device being used by from outside HTCondor, will result
in GPU usage and utilization being misreported accordingly.
However, this approach does simplify monitoring for the owner/administrator of the GPUs, because usage is reported by the condor_startd in addition to the jobs themselves.
Currently, you need to query the startd directly to see these attributes.
- The number of GPU-seconds accumulated over this startd’s uptime.
- The largest amount of GPU memory used during this startd’s uptime.