HTCondor Introduction

Launch this tutorial in a Jupyter Notebook on Binder: Binder

Let’s start interacting with the HTCondor daemons!

We’ll cover the basics of two daemons, the Collector and the Schedd:

  • The Collector maintains an inventory of all the pieces of the HTCondor pool. For example, each machine that can run jobs will advertise a ClassAd describing its resources and state. In this module, we’ll learn the basics of querying the collector for information and displaying results.

  • The Schedd maintains a queue of jobs and is responsible for managing their execution. We’ll learn the basics of querying the schedd.

There are several other daemons - particularly, the Startd and the Negotiator - that the Python bindings can interact with. We’ll cover those in the advanced modules.

If you are running these tutorials in the provided Docker container or on Binder, a local HTCondor pool has been started in the background for you to interact with.

To get start, let’s import the htcondor modules.

[1]:
import htcondor
import classad

Collector

We’ll start with the Collector, which gathers descriptions of the states of all the daemons in your HTCondor pool. The collector provides both service discovery and monitoring for these daemons.

Let’s try to find the Schedd information for your HTCondor pool. First, we’ll create a Collector object, then use the locate method:

[2]:
coll = htcondor.Collector()  # create the object representing the collector
schedd_ad = coll.locate(htcondor.DaemonTypes.Schedd) # locate the default schedd

print(schedd_ad)

    [
        CondorPlatform = "$CondorPlatform: X86_64-CentOS_5.11 $";
        CondorVersion = "$CondorVersion: 8.9.10 Nov 24 2020 BuildID: UW_Python_Wheel_Build RC $";
        Machine = "4726328b203e";
        MyType = "Scheduler";
        Name = "jovyan@4726328b203e";
        MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=4726328b203e&noUDP&sock=schedd_16_de02>"
    ]

The locate method takes a type of daemon and (optionally) a name, returning a ClassAd that describes how to contact the daemon.

A few interesting points about the above example: - Because we didn’t provide the collector with a constructor, we used the default collector in the container’s configuration file. If we wanted to instead query a non-default collector, we could have done htcondor.Collector("collector.example.com"). - We used the DaemonTypes enumeration to pick the kind of daemon to return. - If there were multiple schedds in the pool, the locate query would have failed. In such a case, we need to provide an explicit name to the method. E.g., coll.locate(htcondor.DaemonTypes.Schedd, "schedd.example.com"). - The MyAddress field in the ad is the actual address information. You may be surprised that this is not simply a hostname:port; to help manage addressing in the today’s complicated Internet (full of NATs, private networks, and firewalls), a more flexible structure was needed. HTCondor developers sometimes refer to this as the sinful string; here, sinful is a play on a Unix data structure, not a moral judgement.

The locate method often returns only enough data to contact a remote daemon. Typically, a ClassAd records significantly more attributes. For example, if we wanted to query for a few specific attributes, we would use the query method instead:

[3]:
coll.query(htcondor.AdTypes.Schedd, projection=["Name", "MyAddress", "DaemonCoreDutyCycle"])
[3]:
[[ DaemonCoreDutyCycle = 3.140653949190941E-03; Name = "jovyan@4726328b203e"; MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=4726328b203e&noUDP&sock=schedd_16_de02>" ]]

Here, query takes an AdType (slightly more generic than the DaemonTypes, as many kinds of ads are in the collector) and several optional arguments, then returns a list of ClassAds.

We used the projection keyword argument; this indicates what attributes you want returned. The collector may automatically insert additional attributes (here, only MyType); if an ad is missing a requested attribute, it is simply not set in the returned ClassAd object. If no projection is specified, then all attributes are returned.

WARNING: when possible, utilize the projection to limit the data returned. Some ads may have hundreds of attributes, making returning the entire ad an expensive operation.

The projection filters the returned keys; to filter out unwanted ads, utilize the constraint option. Let’s do the same query again, but specify our hostname explicitly:

[4]:
import socket # We'll use this to automatically fill in our hostname

name = classad.quote(f"jovyan@{socket.getfqdn()}")
coll.query(
    htcondor.AdTypes.Schedd,
    constraint=f"Name =?= {name}",
    projection=["Name", "MyAddress", "DaemonCoreDutyCycle"],
)
[4]:
[[ DaemonCoreDutyCycle = 3.140653949190941E-03; Name = "jovyan@4726328b203e"; MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=4726328b203e&noUDP&sock=schedd_16_de02>" ]]

Notes: - constraint accepts either an ExprTree or string object; the latter is automatically parsed as an expression. - We used the classad.quote function to properly quote the hostname string. In this example, we’re relatively certain the hostname won’t contain quotes. However, it is good practice to use the quote function to avoid possible SQL-injection-type attacks. Consider what would happen if the host’s FQDN contained spaces and doublequotes, such as foo.example.com" || true!

Schedd

Let’s try our hand at querying the schedd!

First, we’ll need a schedd object. You may either create one out of the ad returned by locate above or use the default in the configuration file:

[5]:
schedd = htcondor.Schedd(schedd_ad)
print(schedd)
<htcondor.htcondor.Schedd object at 0x7fcf1e3d08f0>

Unfortunately, as there are no jobs in our personal HTCondor pool, querying the schedd will be boring. Let’s submit a few jobs (note the API used below will be covered by the next module; it’s OK if you don’t understand it now):

[6]:
sub = htcondor.Submit(
    executable = "/bin/sleep",
    arguments = "5m",
)
with schedd.transaction() as txn:
    sub.queue(txn, 10)

We should now have 10 jobs in queue, each of which should take 5 minutes to complete.

Let’s query for the jobs, paying attention to the jobs’ ID and status:

[7]:
for job in schedd.xquery(projection=['ClusterId', 'ProcId', 'JobStatus']):
    print(repr(job))
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 0; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 1; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 2; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 3; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 4; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 5; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 6; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 7; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 8; ClusterId = 8 ]
[ ServerTime = 1606229341; JobStatus = 1; ProcId = 9; ClusterId = 8 ]

The JobStatus is an integer; the integers map into the following states: - 1: Idle (I) - 2: Running (R) - 3: Removed (X) - 4: Completed (C) - 5: Held (H) - 6: Transferring Output - 7: Suspended

Depending on how quickly you executed the above cell, you might see all jobs idle (JobStatus = 1) or some jobs running (JobStatus = 2) above.

As with the Collector’s query method, we can also filter out jobs using xquery:

[8]:
for ad in schedd.xquery(constraint = 'ProcId >= 5', projection=['ProcId']):
    print(ad.get('ProcId'))
5
6
7
8
9

Astute readers may notice that the Schedd object has both xquery and query methods. The difference between them is primarily how memory is managed: - query returns a list of ClassAds, meaning all objects are held in memory at once. This utilizes more memory, but the results are immediately available. - xquery returns an iterator that produces ClassAds. This only requires one ClassAd to be in memory at once.

Finally, let’s clean up after ourselves (this will remove all of the jobs you own from the queue).

[9]:
import getpass

schedd.act(htcondor.JobAction.Remove, f'Owner == "{getpass.getuser()}"')
[9]:
[ TotalJobAds = 0; TotalPermissionDenied = 0; TotalAlreadyDone = 0; TotalNotFound = 0; TotalSuccess = 10; TotalChangedAds = 1; TotalBadStatus = 0; TotalError = 0 ]

On Job Submission

Congratulations! You can now perform simple queries against the collector for worker and submit hosts, as well as simple job queries against the submit host!

It is now time to move on to advanced job submission and management.