Bringing Your Own Resources

You may have access to more, or different, resources than your HTCondor administrator. For example, you may have credits at the PATh facility , or an ACCESS allocation at Anvil, Bridges-2, Expanse, or Perlmutter; or you might have funds for AWS VMs.

If you want to make use of any of these resources, you may “bring your own resources” to any AP which has that functionality enabled. When you do, you’ll give the set of resources you’re leasing a name; we call a named set of leased resources an annex.

HTCondor provides access to two different kinds of annexes: the so-called “HPC” annexes (which use HTC resources), through the htcondor tool; and cloud annexes, through the condor_annex tool. The latter is described in some detail in Cloud Computing section; we’ll only discuss the former here.

Recipes

The following recipes assume you have an OSG Portal account and password. If you’re using a different AP, and your AP administrator has enabled htcondor annex, just log into that AP instead of an OSG Portal AP.

htcondor annex Overview

A HTCondor pool (normally) runs the jobs you submit on resources that were provisioned by the pool administrator. Even if the pool administrator doesn’t own or operate the resources, they had to coordinate with the person who does in order to make them available to you. An “HPC” annex, in contrast, is provisioned by you without involving the pool administrator at all, and the resources you provision will only run your jobs. (This is why “HPC” annex functionality is not turned on by default; the administrator has agreed to let you use the AP’s resources to run jobs on the machines they provisioned, not necessarily on machines that you provisioned.) Unlike resources provisioned by the pool administrator, resources you provision in an annex always have a lease: some amount of time past which the resources will be returned to their owner(s), even if jobs are still running. This is a key safety tool for limiting the use of your allocation(s)/credit(s).

You are not expected to know how htcondor annex works or how to administer a pool, although of course both will be useful if anything goes wrong. The key concept to grasp is that htcondor annex works (a) because it is interactive and (b) by submitting jobs to the batch system managing the resource(s) you want to use to run your jobs.

Point (a) matters because many systems – htcondor annex calls the set of resources you want to use and its associated managment software a “system” – require multi-factor authentication (“MFA”) before jobs can be submitted to the corresponding batch system. When you run htcondor annex, you’ll be asked to login to the system whose resources you want to use, because we can’t automate that process. After you do, htcondor annex will automatically transfer all the necessary files and submit the job – point (b) – that will make (some of) that system’s resources available to your AP to run your jobs.

The chosen system’s software takes over at that point, and you will (eventually) get the resources you asked for. If you want to terminate your lease on those resources early, you can use the shutdown verb to do so.

Details

(In decreasing order of general interest.)

You may need or want to use the SSH configuration file (usually ~/.ssh/config) to set your login name and/or SSH key for a given system. You can use the --login-name flag to htcondor annex create if you need to specify a login name for a particular system, but there’s no corresponding way to specify which SSH key, if you need to. Basically, in order to login successfully via htcondor annex, you need to be able to login successfully via ssh without using any flags on the SSH command line.

If you have access to more than one system supported by htcondor annex, you can add resources from more than one system to the same annex. This might be useful if, for example, you need a lot of GPUs but aren’t too picky about which particular type of GPU. Use the add verb to add resources to an existing annex.

By default, annex EPs shut themselves down after they’ve been idle – that is, have not been running a job – for more than a certain amount of time. This helps reduce the usage of your allocation(s)/credit(s). You can adjust the default (300 seconds) up or down, but if you go too low, EPs can shut down even if they could have been doing work just because it can take a few minutes to get them a job.