Administrative Quick Start Guide

This guide does not contain step-by-step instructions for getting HTCondor. Rather, it is a guide to joining multiple machines into a single pool of computational resources for use by HTCondor jobs.

This guide begins by briefly describing the three roles required by every HTCondor pool, as well as the resources and networking required by each of those roles. This information will enable you to choose which machine(s) will perform which role(s). This guide also includes instructions on how to use the get_htcondor tool to install and configure Linux (or Mac) machines to perform each of the roles.

If you’re curious, using Windows machines, or you want to automate the configuration of their pool using a tool like Puppet, the last section of this guide briefly describes what the get_htcondor tool does and provides a link to the rest of the details.

The Three Roles

Even a single-machine installation of HTCondor performs all three roles.

The Execute Role

The most common reason for adding a machine to an HTCondor pool is to make another machine execute HTCondor jobs; the first major role, therefore, is the execute role. This role is responsible for the technical aspects of actually running, monitoring, and managing the job’s executable; transferring the job’s input and output; and advertising, monitoring, and managing the resources of the execute machine. HTCondor can manage pools containing tens of thousands of execute machines, so this is by far the most common role.

The execute role itself uses very few resources, so almost any machine can contribute to a pool. The execute role can run on a machine with only outbound network connectivity, but being able to accept inbound connections from the machine(s) performing the submit role will simplify setup and reduce overhead. The execute machine does not need to allow user access, or even share user IDs with other machines in the pool (although this may be very convenient, especially on Windows).

The Submit Role

We’ll discuss what “advertising” a machine’s resources means in the next section, but the execute role leaves an obvious question unanswered: where do the jobs come from? The answer is the submit role. This role is responsible for accepting, monitoring, managing, and scheduling jobs on its assigned resources; transferring the input and output of jobs; and requesting and accepting resource assignments. (A “resource” is some reserved fraction of an execute machine.) HTCondor allows arbitrarily many submit roles in a pool, but for administrative convenience, most pools only have one, or a small number, of machines acting in the submit role.

A submit-role machine requires a bit under a megabyte of RAM for each running job, and its ability to transfer data to and from the execute-role machines may become a performance bottleneck. We typically recommend adding another submit machine for every twenty thousand simultaneously running jobs. A submit machine must have outbound network connectivity, but a submit machine without inbound network connectivity can’t use execute-role machines without inbound network connectivity. As execute machines are more numerous, submit machines typically allow inbound connections. Although you may allow users to submit jobs over the network, we recommend allowing users SSH access to the submit machine.

The Central Manager Role

Only one machine in each HTCondor pool can perform this role (barring certain high-availability configurations, where only one machine can perform this role at a time). A central manager matches resource requests – generated by the submit role based on its jobs – with the resources described by the execute machines. We refer to sending these (automatically-generated) descriptions to the central manager as “advertising” because it’s the primary way execute machines get jobs to run.

A central manager must accept connections from each execute machine and each submit machine in a pool. However, users should never need access to the central manager. Every machine in the pool updates the central manager every few minutes, and it answers both system and user queries about the status of the pool’s resources, so a fast network is important. For very large pools, memory may become a limiting factor.

Assigning Roles to Machines

The easiest way to assign a role to a machine is when you initially get HTCondor. You’ll need to supply the same password for each machine in the same pool; sharing that secret is how the machines recognize each other as members of the same pool, and connections between machines are encrypted with it. (HTCondor uses port 9618 to communicate, so make sure that the machines in your pool accept TCP connections on that port from each other.) In the command lines below, replace $htcondor_password with the password you want to use. In addition to the password, you must specify the name of the central manager, which may be a host name (which must resolve on all machines in the pool) or an IP address. In the command lines below, replace $central_manager_name with the host name or IP address you want to use.

When you get HTCondor, start with the central manager, then add the submit machine(s), and then add the execute machine(s). You may not have sudo installed; you may omit it from the command lines below if you run them as root.

Central Manager

curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD="$htcondor_password" sudo /bin/bash -s -- --no-dry-run --central-manager $central_manager_name

Submit

curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD="$htcondor_password" sudo /bin/bash -s -- --no-dry-run --submit $central_manager_name

Execute

curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD="$htcondor_password" sudo /bin/bash -s -- --no-dry-run --execute $central_manager_name

At this point, users logged in on the submit machine should be able to see execute machines in the pool (using condor_status), submit jobs (using condor_submit), and see them run (using condor_q).

Creating a Multi-Machine Pool using Windows or Containers

If you are creating a multi-machine HTCondor pool on Windows computers or using containerization, please see the “Setting Up a Whole Pool” section of the relevant installation guide:

Where to Go from Here

There are two major directions you can go from here, but before we discuss them, a warning.

Making Configuration Changes

HTCondor configuration files should generally be owned by root (or Administrator, on Windows), but readable by all users. We recommend that you don’t make changes to the configuration files established by the installation procedure; this avoids conflicts between your changes and any changes we may have to make to the base configuration in future updates. Instead, you should add (or edit) files in the configuration directory; its location can be determined on a given machine by running condor_config_val LOCAL_CONFIG_DIR there. HTCondor will process files in this directory in lexicographic order, so we recommend naming files ##-name.config so that, for example, a setting in 00-base.config will be overridden by a setting in 99-specific.config.

Enabling Features

Some features of HTCondor, for one reason or another, aren’t (or can’t be) enabled by default. Areas of potentially general interest include:

Implementing Policies

Although your HTCondor pool should be fully functional at this point, it may not be behaving precisely as you wish, particularly with respect to resource allocation. You can tune how HTCondor allocates resources to users, or groups of users, using the user priority and group quota systems, described in User Priorities and Negotiation. You can enforce machine-specific policies – for instance, preferring GPU jobs on machines with GPUs – using the options described in Policy Configuration for Execute Hosts and for Submit Hosts.

Further Reading

What get_htcondor Does to Configure a Role

The configuration files generated by get_htcondor are very similar, and only two lines long:

  • set the HTCondor configuration variable CONDOR_HOST to the name (or IP address) of your central manager;

  • add the appropriate metaknob: use role : get_htcondor_central_manager, use role : get_htcondor_submit, or use role : get_htcondor_execute.

Putting all of the pool-independent configuration into the metaknobs allows us to change the metaknobs to fix problems or work with later versions of HTCondor as you upgrade.

The get_htcondor documentation describes what the configuration script does and how to determine the exact details.