Administrative Quick Start Guide
This guide does not contain step-by-step instructions for getting HTCondor. Rather, it is a guide to joining multiple machines into a single pool of computational resources for use by HTCondor jobs.
This guide begins by briefly describing the three roles required by every
HTCondor pool, as well as the resources and networking required by each
of those roles. This information will enable you to choose which machine(s)
will perform which role(s). This guide also includes instructions on how to
use the get_htcondor
tool to install and configure Linux (or Mac) machines
to perform each of the roles.
If you’re curious, using Windows machines, or you want to automate the
configuration of their pool using a tool like Puppet, the
last section of this guide briefly describes what
the get_htcondor
tool does and provides a link to the rest of the details.
The Three Roles
Even a single-machine installation of HTCondor performs all three roles.
The Execute Role
The most common reason for adding a machine to an HTCondor pool is to make another machine execute HTCondor jobs; the first major role, therefore, is the execute role. This role is responsible for the technical aspects of actually running, monitoring, and managing the job’s executable; transferring the job’s input and output; and advertising, monitoring, and managing the resources of the execute machine. HTCondor can manage pools containing tens of thousands of execute machines, so this is by far the most common role.
The execute role itself uses very few resources, so almost any machine can contribute to a pool. The execute role can run on a machine with only outbound network connectivity, but being able to accept inbound connections from the machine(s) performing the submit role will simplify setup and reduce overhead. The execute machine does not need to allow user access, or even share user IDs with other machines in the pool (although this may be very convenient, especially on Windows).
The Submit Role
We’ll discuss what “advertising” a machine’s resources means in the next section, but the execute role leaves an obvious question unanswered: where do the jobs come from? The answer is the submit role. This role is responsible for accepting, monitoring, managing, and scheduling jobs on its assigned resources; transferring the input and output of jobs; and requesting and accepting resource assignments. (A “resource” is some reserved fraction of an execute machine.) HTCondor allows arbitrarily many submit roles in a pool, but for administrative convenience, most pools only have one, or a small number, of machines acting in the submit role.
A submit-role machine requires a bit under a megabyte of RAM for each running job, and its ability to transfer data to and from the execute-role machines may become a performance bottleneck. We typically recommend adding another access point for every twenty thousand simultaneously running jobs. A access point must have outbound network connectivity, but a submit machine without inbound network connectivity can’t use execute-role machines without inbound network connectivity. As execute machines are more numerous, access points typically allow inbound connections. Although you may allow users to submit jobs over the network, we recommend allowing users SSH access to the access point.
The Central Manager Role
Only one machine in each HTCondor pool can perform this role (barring certain high-availability configurations, where only one machine can perform this role at a time). A central manager matches resource requests – generated by the submit role based on its jobs – with the resources described by the execute machines. We refer to sending these (automatically-generated) descriptions to the central manager as “advertising” because it’s the primary way execute machines get jobs to run.
A central manager must accept connections from each execute machine and each access point in a pool. However, users should never need access to the central manager. Every machine in the pool updates the central manager every few minutes, and it answers both system and user queries about the status of the pool’s resources, so a fast network is important. For very large pools, memory may become a limiting factor.
Assigning Roles to Machines
The easiest way to assign a role to a machine is when you initially
get HTCondor. You’ll need to supply the same password for
each machine in the same pool; sharing that secret is how the machines
recognize each other as members of the same pool, and connections between
machines are encrypted with it. (HTCondor uses port 9618 to communicate,
so make sure that the machines in your pool accept TCP connections on that
port from each other.) In the command lines below, replace
$htcondor_password
with the password you want to use. In addition to the
password, you must specify the name of the central manager, which may be a
host name (which must resolve on all machines in the pool) or an IP address.
In the command lines below, replace $central_manager_name
with the host
name or IP address you want to use.
When you get HTCondor, start with the central manager, then add
the access point(s), and then add the execute machine(s). You may
not have sudo
installed; you may omit it from the command lines below
if you run them as root.
Central Manager
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name
Submit
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name
Execute
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name
At this point, users logged in on the access point should be able to see
execute machines in the pool (using condor_status
), submit jobs
(using condor_submit
), and see them run (using condor_q
).
Creating a Multi-Machine Pool using Windows or Containers
If you are creating a multi-machine HTCondor pool on Windows computers or using containerization, please see the “Setting Up a Whole Pool” section of the relevant installation guide:
Where to Go from Here
There are two major directions you can go from here, but before we discuss them, a warning.
Making Configuration Changes
HTCondor configuration files should generally be owned by root
(or Administrator, on Windows), but readable by all users. We recommend
that you don’t make changes to the configuration files established by the
installation procedure; this avoids conflicts between your changes and any
changes we may have to make to the base configuration in future
updates. Instead, you should add (or edit) files in the configuration
directory; its location can be determined on a given machine by running
condor_config_val LOCAL_CONFIG_DIR
there. HTCondor will process files
in this directory in lexicographic order, so we recommend naming files
##-name.config
so that, for example, a setting in 00-base.config
will be overridden by a setting in 99-specific.config
.
Enabling Features
Some features of HTCondor, for one reason or another, aren’t (or can’t be) enabled by default. Areas of potentially general interest include:
Configuration for Execution Points (particularly Enabling the Fetching and Use of OAuth2 Credentials and Cgroup-Based Process Tracking),
Implementing Policies
Although your HTCondor pool should be fully functional at this point, it may not be behaving precisely as you wish, particularly with respect to resource allocation. You can tune how HTCondor allocates resources to users, or groups of users, using the user priority and group quota systems, described in Configuration for Central Managers. You can enforce machine-specific policies – for instance, preferring GPU jobs on machines with GPUs – using the options described in Configuration for Execution Points.
Further Reading
It may be helpful to at least skim the Users’ Manual to get an idea of what your users might want or expect, particularly the sections on DAGMan Introduction, Choosing an HTCondor Universe, and Self-Checkpointing Applications.
Understanding HTCondor’s ClassAd Mechanism is essential for many administrative tasks.
The rest of the Administrators’ Manual, particularly the section on Monitoring with Ganglia, Elasticsearch, etc..
Slides from past HTCondor Weeks – our annual conference – include a number of tutorials and talks on administrative topics, including monitoring and examples of policies and their implementations.
What get_htcondor
Does to Configure a Role
The configuration files generated by get_htcondor
are very similar, and
only two lines long:
set the HTCondor configuration variable CONDOR_HOST to the name (or IP address) of your central manager;
add the appropriate metaknob:
use role : get_htcondor_central_manager
,use role : get_htcondor_submit
, oruse role : get_htcondor_execute
.
Putting all of the pool-independent configuration into the metaknobs allows us to change the metaknobs to fix problems or work with later versions of HTCondor as you upgrade.
The get_htcondor
documentation
describes what the configuration script does and how to determine the exact details.