High Availability Configuration Options

These macros affect the high availability operation of HTCondor.

MASTER_HA_LIST

Similar to DAEMON_LIST, this macro defines a list of daemons that the condor_master starts and keeps its watchful eyes on. However, the MASTER_HA_LIST daemons are run in a High Availability mode. The list is a comma or space separated list of subsystem names (as listed in Pre-Defined Macros). For example,

MASTER_HA_LIST = SCHEDD

The High Availability feature allows for several condor_master daemons (most likely on separate machines) to work together to insure that a particular service stays available. These condor_master daemons ensure that one and only one of them will have the listed daemons running.

To use this feature, the lock URL must be set with HA_LOCK_URL.

Currently, only file URLs are supported (those with file:...). The default value for MASTER_HA_LIST is the empty string, which disables the feature.

HA_LOCK_URL

This macro specifies the URL that the condor_master processes use to synchronize for the High Availability service. Currently, only file URLs are supported; for example, file:/share/spool. Note that this URL must be identical for all condor_master processes sharing this resource. For condor_schedd sharing, we recommend setting up SPOOL on an NFS share and having all High Availability condor_schedd processes sharing it, and setting the HA_LOCK_URL to point at this directory as well. For example:

MASTER_HA_LIST = SCHEDD
SPOOL = /share/spool
HA_LOCK_URL = file:/share/spool
VALID_SPOOL_FILES = SCHEDD.lock

A separate lock is created for each High Availability daemon.

There is no default value for HA_LOCK_URL.

Lock files are in the form <SUBSYS>.lock. condor_preen is not currently aware of the lock files and will delete them if they are placed in the SPOOL directory, so be sure to add <SUBSYS>.lock to VALID_SPOOL_FILES for each High Availability daemon.

HA_<SUBSYS>_LOCK_URL

This macro controls the High Availability lock URL for a specific subsystem as specified in the configuration variable name, and it overrides the system-wide lock URL specified by HA_LOCK_URL. If not defined for each subsystem, HA_<SUBSYS>_LOCK_URL is ignored, and the value of HA_LOCK_URL is used.

List of possible subsystems to set <SUBSYS> can be found at SUBSYSTEM.

HA_LOCK_HOLD_TIME

This macro specifies the number of seconds that the condor_master will hold the lock for each High Availability daemon. Upon gaining the shared lock, the condor_master will hold the lock for this number of seconds. Additionally, the condor_master will periodically renew each lock as long as the condor_master and the daemon are running. When the daemon dies, or the condor_master exists, the condor_master will immediately release the lock(s) it holds.

HA_LOCK_HOLD_TIME defaults to 3600 seconds (one hour).

HA_<SUBSYS>_LOCK_HOLD_TIME

This macro controls the High Availability lock hold time for a specific subsystem as specified in the configuration variable name, and it overrides the system wide poll period specified by HA_LOCK_HOLD_TIME. If not defined for each subsystem, HA_<SUBSYS>_LOCK_HOLD_TIME is ignored, and the value of HA_LOCK_HOLD_TIME is used.

List of possible subsystems to set <SUBSYS> can be found at SUBSYSTEM.

HA_POLL_PERIOD

This macro specifies how often the condor_master polls the High Availability locks to see if any locks are either stale (meaning not updated for HA_LOCK_HOLD_TIME seconds), or have been released by the owning condor_master. Additionally, the condor_master renews any locks that it holds during these polls.

HA_POLL_PERIOD defaults to 300 seconds (five minutes).

HA_<SUBSYS>_POLL_PERIOD

This macro controls the High Availability poll period for a specific subsystem as specified in the configuration variable name, and it overrides the system wide poll period specified by HA_POLL_PERIOD. If not defined for each subsystem, HA_<SUBSYS>_POLL_PERIOD is ignored, and the value of HA_POLL_PERIOD is used.

List of possible subsystems to set <SUBSYS> can be found at SUBSYSTEM.

MASTER_<SUBSYS>_CONTROLLER

Used only in HA configurations involving the condor_had.

The condor_master has the concept of a controlling and controlled daemon, typically with the condor_had daemon serving as the controlling process. In this case, all condor_on and condor_off commands directed at controlled daemons are given to the controlling daemon, which then handles the command, and, when required, sends appropriate commands to the condor_master to do the actual work. This allows the controlling daemon to know the state of the controlled daemon.

As of 6.7.14, this configuration variable must be specified for all configurations using condor_had. To configure the condor_negotiator controlled by condor_had:

MASTER_NEGOTIATOR_CONTROLLER = HAD

The macro is named by substituting <SUBSYS> with the appropriate subsystem string as defined by SUBSYSTEM.

HAD_LIST

A comma-separated list of all condor_had daemons in the form IP:port or hostname:port. Each central manager machine that runs the condor_had daemon should appear in this list. If HAD_USE_PRIMARY is set to True, then the first machine in this list is the primary central manager, and all others in the list are backups.

All central manager machines must be configured with an identical HAD_LIST. The machine addresses are identical to the addresses defined in COLLECTOR_HOST.

HAD_USE_PRIMARY

Boolean value to determine if the first machine in the HAD_LIST configuration variable is a primary central manager. Defaults to False.

HAD_CONTROLLEE

This variable is used to specify the name of the daemon which the condor_had daemon controls. This name should match the daemon name in the condor_master daemon’s DAEMON_LIST definition. The default value is NEGOTIATOR.

HAD_CONNECTION_TIMEOUT

The time (in seconds) that the condor_had daemon waits before giving up on the establishment of a TCP connection. The failure of the communication connection is the detection mechanism for the failure of a central manager machine. For a LAN, a recommended value is 2 seconds. The use of authentication (by HTCondor) increases the connection time. The default value is 5 seconds. If this value is set too low, condor_had daemons will incorrectly assume the failure of other machines.

HAD_ARGS

Command line arguments passed by the condor_master daemon as it invokes the condor_had daemon. To make high availability work, the condor_had daemon requires the port number it is to use. This argument is of the form

-p $(HAD_PORT_NUMBER)

where HAD_PORT_NUMBER is a helper configuration variable defined with the desired port number. Note that this port number must be the same value here as used in HAD_LIST. There is no default value.

HAD

The path to the condor_had executable. Normally it is defined relative to $(SBIN). This configuration variable has no default value.

MAX_HAD_LOG

Controls the maximum length in bytes to which the condor_had daemon log will be allowed to grow. It will grow to the specified length, then be saved to a file with the suffix .old. The .old file is overwritten each time the log is saved, thus the maximum space devoted to logging is twice the maximum length of this log file. A value of 0 specifies that this file may grow without bounds. The default is 1 MiB.

HAD_DEBUG

Logging level for the condor_had daemon. See <SUBSYS>_DEBUG for values.

HAD_LOG

Full path and file name of the log file. The default value is $(LOG)/HADLog.

HAD_FIPS_MODE

Controls what type of checksum will be sent along with files that are replicated. Set it to 0 for MD5 checksums and to 1 for SHA-2 checksums. Prior to versions 8.8.13 and 8.9.12 only MD5 checksums are supported. In the 10.0 and later release of HTCondor, MD5 support will be removed and only SHA-2 will be supported. This configuration variable is intended to provide a transition between the 8.8 and 9.0 releases. Once all machines in your pool involved in HAD replication have been upgraded to 9.0 or later, you should set the value of this configuration variable to 1. Default value is 0 in HTCondor versions before 9.12 and 1 in version 9.12 and later.

REPLICATION_LIST

A comma-separated list of all condor_replication daemons in the form IP:port or hostname:port. Each central manager machine that runs the condor_had daemon should appear in this list. All potential central manager machines must be configured with an identical REPLICATION_LIST.

STATE_FILE

A full path and file name of the file protected by the replication mechanism. When not defined, the default path and file used is

$(SPOOL)/Accountantnew.log
REPLICATION_INTERVAL

Sets how often the condor_replication daemon initiates its tasks of replicating the $(STATE_FILE). It is defined in seconds and defaults to 300 (5 minutes).

MAX_TRANSFER_LIFETIME

A timeout period within which the process that transfers the state file must complete its transfer. The recommended value is 2 * average size of state file / network rate. It is defined in seconds and defaults to 300 (5 minutes).

HAD_UPDATE_INTERVAL

Like UPDATE_INTERVAL, determines how often the condor_had is to send a ClassAd update to the condor_collector. Updates are also sent at each and every change in state. It is defined in seconds and defaults to 300 (5 minutes).

HAD_USE_REPLICATION

A boolean value that defaults to False. When True, the use of condor_replication daemons is enabled.

REPLICATION_ARGS

Command line arguments passed by the condor_master daemon as it invokes the condor_replication daemon. To make high availability work, the condor_replication daemon requires the port number it is to use. This argument is of the form

-p $(REPLICATION_PORT_NUMBER)

where REPLICATION_PORT_NUMBER is a helper configuration variable defined with the desired port number. Note that this port number must be the same value as used in REPLICATION_LIST. There is no default value.

REPLICATION

The full path and file name of the condor_replication executable. It is normally defined relative to $(SBIN). There is no default value.

MAX_REPLICATION_LOG

Controls the maximum length in bytes to which the condor_replication daemon log will be allowed to grow. It will grow to the specified length, then be saved to a file with the suffix .old. The .old file is overwritten each time the log is saved, thus the maximum space devoted to logging is twice the maximum length of this log file. A value of 0 specifies that this file may grow without bounds. The default is 1 MiB.

REPLICATION_DEBUG

Logging level for the condor_replication daemon. See <SUBSYS>_DEBUG for values.

REPLICATION_LOG

Full path and file name to the log file. The default value is $(LOG)/ReplicationLog.

TRANSFERER

The full path and file name of the condor_transferer executable. The default value is $(LIBEXEC)/condor_transferer.

TRANSFERER_LOG

Full path and file name to the log file. The default value is $(LOG)/TransfererLog.

TRANSFERER_DEBUG

Logging level for the condor_transferer daemon. See <SUBSYS>_DEBUG for values.

MAX_TRANSFERER_LOG

Controls the maximum length in bytes to which the condor_transferer daemon log will be allowed to grow. A value of 0 specifies that this file may grow without bounds. The default is 1 MiB.