Cloud Annex Customization Guide

Aside from the configuration macros (see the Cloud Annex Configuration Options section), the major way to customize htcondor annex is my customizing the default disk image. Because the implementation of htcondor annex varies from service to service, and that implementation determines the constraints on the disk image, the this section is divided by service.

Amazon Web Services

Requirements for an Annex-compatible AMI are driven by how htcondor annex securely transports HTCondor configuration and security tokens to the instances; we will discuss that implementation briefly, to help you understand the requirements, even though it will hopefully never matter to you.

Resource Requests

For on-demand or Spot instances, we begin by making a single resource request whose client token is the annex name concatenated with an underscore and then a newly-generated GUID. This construction allows us to terminate on-demand instances belonging to a particular annex (by its name), as well as discover the annex name from inside an instance.

An on-demand instance may obtain its instance ID directly from the AWS metadata server, and then ask another AWS API for that instance ID’s client token. Since GUIDs do not contain underscores, we can be certain that anything to the left of the last underscore is the annex’s name.

An instance started by a Spot Fleet has a client token generated by the Spot Fleet. Instead of performing a direct lookup, a Spot Fleet instance must therefore determine which Spot Fleet started it, and then obtain that Spot Fleet’s client token. A Spot Fleet will tag an instance with the Spot Fleet’s identity after the instance starts up. This usually only takes a few minutes, but the default image waits for up to 50 minutes, since you’re already paying for the first hour anyway.

Secure Transport

At this point, the instance knows its annex’s name. This allows the instance to construct the name of the tarball it should download (config-AnnexName.tar.gz), but does not tell it from where a file with that name should be downloaded.

(Because the user data associated with resource request is not secure, and because we want to leave the user data available for its normal usage, we can’t just encode the tarball or its location in the user data.)

The instance determines from which S3 bucket to download by asking the metadata server which role the instance is playing. (An instance without a role is unable to make use of any AWS services without acquiring valid AWS tokens through some other method.) The instance role created by the setup procedure includes permission to read files matching the pattern config-*.tar.gz from a particular private S3 bucket. If the instance finds permissions matching that pattern, it assumes that the corresponding S3 bucket is the one from which it should download, and does so; if successful, it untars the file in /etc/condor/config.d.

In v8.7.1, the script executing these steps is named 49ec2-instance.sh, and is called during configuration when HTCondor first starts up.

In v8.7.2, the script executing these steps is named condor-annex-ec2, and is called during system start-up.

The HTCondor configuration and security tokens are at this point protected on the instance’s disk by the usual filesystem permissions. To prevent HTCondor jobs from using the instance’s permissions to do anything, but in particular download their own copy of the security tokens, the last thing the script does is use the Linux kernel firewall to forbid any non-root process from accessing the metadata server.

Image Requirements

Thus, to work with htcondor annex, an AWS AMI must:

Fetch the HTCondor configuration and security tokens from S3;
configure HTCondor to turn off after it’s been idle for too long;
and turn off the instance when the HTCondor master daemon exits.

The second item could be construed as optional, but if left unimplemented, will disable the -idle command-line option.

The default disk image implements the above as follows:

with a configuration script (/etc/condor/49ec2-instance.sh);
with a single configuration item (STARTD_NOCLAIM_SHUTDOWN);
with a configuration item (DEFAULT_MASTER_SHUTDOWN_SCRIPT) and the corresponding script (/etc/condor/master_shutdown.sh), which just turns around and runs shutdown -h now.

We also strongly recommend that every htcondor annex disk image:

Advertise, in the master and startd, the instance ID.
Use the instance’s public IP, by setting TCP_FORWARDING_HOST.
Turn on communications integrity and encryption.
Encrypt the run directories.
Restrict access to the EC2 meta-data server to root.

The default disk image is configured to do all of this.

Instance Roles

To explain the last point immediately above, EC2 stores (temporary) credentials for the role, if any, associated with an instance on that instance’s meta-data server, which may be accessed via HTTP at a well-known address (currently 169.254.169.254). Unless otherwise configured, any process in the instance can access the meta-data server and thereby make use of the instance’s credentials.

Until version 8.9.0, there was no HTCondor-based reason to run an EC2 instance with an instance role. Starting in 8.9.0, however, HTCondor gained the ability to use the instance role’s credentials to run EC2 universe jobs and htcondor annex commands. This has several advantages over copying credentials into the instance: it may be more convenient, and if you’re the only user of the instance, it’s more secure, because the instance’s credentials expire when the instance does.

However, wanting to allow (other) users to run jobs on or submit jobs to your instance may not mean you want them to able to act with the instance’s privileges (e.g., starting more instances on your account). Although securing your instances ultimately remains your responsibility, the default images we provide for htcondor annex, and the condor-annex-ec2 package, both use the kernel-level firewall to prevent access to the metadata server by any process not owned by root. Because this firewall rule is added during the boot sequence, it will be in place before HTCondor can start any user jobs, and should therefore be effective in preventing access to the instance’s credentials by normal users or their jobs.