Administrative Quick Start Guide
This guide does not contain step-by-step instructions for getting HTCondor. Rather, it is a guide to joining multiple machines into a single pool of computational resources for use by HTCondor jobs.
This guide begins by briefly describing the three roles required by every
HTCondor pool, as well as the resources and networking required by each
of those roles. This information will enable you to choose which machine(s)
will perform which role(s). This guide also includes instructions on how to
get_htcondor tool to install and configure Linux (or Mac) machines
to perform each of the roles.
If you’re curious, using Windows machines, or you want to automate the
configuration of their pool using a tool like Puppet, the
last section of this guide briefly describes what
get_htcondor tool does and provides a link to the rest of the details.
The Three Roles
Even a single-machine installation of HTCondor performs all three roles.
The Execute Role
The most common reason for adding a machine to an HTCondor pool is to make another machine execute HTCondor jobs; the first major role, therefore, is the execute role. This role is responsible for the technical aspects of actually running, monitoring, and managing the job’s executable; transferring the job’s input and output; and advertising, monitoring, and managing the resources of the execute machine. HTCondor can manage pools containing tens of thousands of execute machines, so this is by far the most common role.
The execute role itself uses very few resources, so almost any machine can contribute to a pool. The execute role can run on a machine with only outbound network connectivity, but being able to accept inbound connections from the machine(s) performing the submit role will simplify setup and reduce overhead. The execute machine does not need to allow user access, or even share user IDs with other machines in the pool (although this may be very convenient, especially on Windows).
The Submit Role
We’ll discuss what “advertising” a machine’s resources means in the next section, but the execute role leaves an obvious question unanswered: where do the jobs come from? The answer is the submit role. This role is responsible for accepting, monitoring, managing, and scheduling jobs on its assigned resources; transferring the input and output of jobs; and requesting and accepting resource assignments. (A “resource” is some reserved fraction of an execute machine.) HTCondor allows arbitrarily many submit roles in a pool, but for administrative convenience, most pools only have one, or a small number, of machines acting in the submit role.
A submit-role machine requires a bit under a megabyte of RAM for each running job, and its ability to transfer data to and from the execute-role machines may become a performance bottleneck. We typically recommend adding another access point for every twenty thousand simultaneously running jobs. A access point must have outbound network connectivity, but a submit machine without inbound network connectivity can’t use execute-role machines without inbound network connectivity. As execute machines are more numerous, access points typically allow inbound connections. Although you may allow users to submit jobs over the network, we recommend allowing users SSH access to the access point.
The Central Manager Role
Only one machine in each HTCondor pool can perform this role (barring certain high-availability configurations, where only one machine can perform this role at a time). A central manager matches resource requests – generated by the submit role based on its jobs – with the resources described by the execute machines. We refer to sending these (automatically-generated) descriptions to the central manager as “advertising” because it’s the primary way execute machines get jobs to run.
A central manager must accept connections from each execute machine and each access point in a pool. However, users should never need access to the central manager. Every machine in the pool updates the central manager every few minutes, and it answers both system and user queries about the status of the pool’s resources, so a fast network is important. For very large pools, memory may become a limiting factor.
Assigning Roles to Machines
The easiest way to assign a role to a machine is when you initially
get HTCondor. You’ll need to supply the same password for
each machine in the same pool; sharing that secret is how the machines
recognize each other as members of the same pool, and connections between
machines are encrypted with it. (HTCondor uses port 9618 to communicate,
so make sure that the machines in your pool accept TCP connections on that
port from each other.) In the command lines below, replace
$htcondor_password with the password you want to use. In addition to the
password, you must specify the name of the central manager, which may be a
host name (which must resolve on all machines in the pool) or an IP address.
In the command lines below, replace
$central_manager_name with the host
name or IP address you want to use.
When you get HTCondor, start with the central manager, then add
the access point(s), and then add the execute machine(s). You may
sudo installed; you may omit it from the command lines below
if you run them as root.
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name
At this point, users logged in on the access point should be able to see
execute machines in the pool (using
condor_status), submit jobs
condor_submit), and see them run (using
Creating a Multi-Machine Pool using Windows or Containers
If you are creating a multi-machine HTCondor pool on Windows computers or using containerization, please see the “Setting Up a Whole Pool” section of the relevant installation guide:
Where to Go from Here
There are two major directions you can go from here, but before we discuss them, a warning.
Making Configuration Changes
HTCondor configuration files should generally be owned by root
(or Administrator, on Windows), but readable by all users. We recommend
that you don’t make changes to the configuration files established by the
installation procedure; this avoids conflicts between your changes and any
changes we may have to make to the base configuration in future
updates. Instead, you should add (or edit) files in the configuration
directory; its location can be determined on a given machine by running
condor_config_val LOCAL_CONFIG_DIR there. HTCondor will process files
in this directory in lexicographic order, so we recommend naming files
##-name.config so that, for example, a setting in
will be overridden by a setting in
Some features of HTCondor, for one reason or another, aren’t (or can’t be) enabled by default. Areas of potentially general interest include:
Although your HTCondor pool should be fully functional at this point, it may not be behaving precisely as you wish, particularly with respect to resource allocation. You can tune how HTCondor allocates resources to users, or groups of users, using the user priority and group quota systems, described in User Priorities and Negotiation. You can enforce machine-specific policies – for instance, preferring GPU jobs on machines with GPUs – using the options described in Policy Configuration for Execution Points and for Access Points.
It may be helpful to at least skim the Users’ Manual to get an idea of what your users might want or expect, particularly the sections on DAGMan Introduction, Choosing an HTCondor Universe, and Self-Checkpointing Applications.
Understanding HTCondor’s ClassAd Mechanism is essential for many administrative tasks.
Slides from past HTCondor Weeks – our annual conference – include a number of tutorials and talks on administrative topics, including monitoring and examples of policies and their implementations.
get_htcondor Does to Configure a Role
The configuration files generated by
get_htcondor are very similar, and
only two lines long:
set the HTCondor configuration variable
CONDOR_HOSTto the name (or IP address) of your central manager;
add the appropriate metaknob:
use role : get_htcondor_central_manager,
use role : get_htcondor_submit, or
use role : get_htcondor_execute.
Putting all of the pool-independent configuration into the metaknobs allows us to change the metaknobs to fix problems or work with later versions of HTCondor as you upgrade.
describes what the configuration script does and how to determine the exact details.