HTCondor Introduction
Launch this tutorial in a Jupyter Notebook on Binder:
Let’s start interacting with the HTCondor daemons!
We’ll cover the basics of two daemons, the Collector and the Schedd:
The Collector maintains an inventory of all the pieces of the HTCondor pool. For example, each machine that can run jobs will advertise a ClassAd describing its resources and state. In this module, we’ll learn the basics of querying the collector for information and displaying results.
The Schedd maintains a queue of jobs and is responsible for managing their execution. We’ll learn the basics of querying the schedd.
There are several other daemons - particularly, the Startd and the Negotiator - that the Python bindings can interact with. We’ll cover those in the advanced modules.
If you are running these tutorials in the provided Docker container or on Binder, a local HTCondor pool has been started in the background for you to interact with.
To get start, let’s import the htcondor
modules.
[1]:
import htcondor
import classad
Collector
We’ll start with the Collector, which gathers descriptions of the states of all the daemons in your HTCondor pool. The collector provides both service discovery and monitoring for these daemons.
Let’s try to find the Schedd information for your HTCondor pool. First, we’ll create a Collector
object, then use the locate
method:
[2]:
coll = htcondor.Collector() # create the object representing the collector
schedd_ad = coll.locate(htcondor.DaemonTypes.Schedd) # locate the default schedd
print(schedd_ad)
[
CondorPlatform = "$CondorPlatform: X86_64-CentOS_7.9 $";
MyType = "Scheduler";
Machine = "fa6c829ace67";
Name = "jovyan@fa6c829ace67";
CondorVersion = "$CondorVersion: 10.7.0 2023-07-31 BuildID: UW_Python_Wheel_Build $";
MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=fa6c829ace67&noUDP&sock=schedd_82_2342>"
]
The locate
method takes a type of daemon and (optionally) a name, returning a ClassAd that describes how to contact the daemon.
A few interesting points about the above example: - Because we didn’t provide the collector with a constructor, we used the default collector in the container’s configuration file. If we wanted to instead query a non-default collector, we could have done htcondor.Collector("collector.example.com")
. - We used the DaemonTypes
enumeration to pick the kind of daemon to return. - If there were multiple schedds in the pool, the locate
query would have failed. In such a case, we need to
provide an explicit name to the method. E.g., coll.locate(htcondor.DaemonTypes.Schedd, "schedd.example.com")
. - The MyAddress
field in the ad is the actual address information. You may be surprised that this is not simply a hostname:port
; to help manage addressing in the today’s complicated Internet (full of NATs, private networks, and firewalls), a more flexible structure was needed. HTCondor developers sometimes refer to this as the sinful string; here, sinful is a play on a
Unix data structure, not a moral judgement.
The locate
method often returns only enough data to contact a remote daemon. Typically, a ClassAd records significantly more attributes. For example, if we wanted to query for a few specific attributes, we would use the query
method instead:
[3]:
coll.query(htcondor.AdTypes.Schedd, projection=["Name", "MyAddress", "DaemonCoreDutyCycle"])
[3]:
[[ DaemonCoreDutyCycle = 1.150843433815885E-02; Name = "jovyan@fa6c829ace67"; MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=fa6c829ace67&noUDP&sock=schedd_82_2342>" ]]
Here, query
takes an AdType
(slightly more generic than the DaemonTypes
, as many kinds of ads are in the collector) and several optional arguments, then returns a list of ClassAds.
We used the projection
keyword argument; this indicates what attributes you want returned. The collector may automatically insert additional attributes (here, only MyType
); if an ad is missing a requested attribute, it is simply not set in the returned ClassAd object. If no projection is specified, then all attributes are returned.
WARNING: when possible, utilize the projection to limit the data returned. Some ads may have hundreds of attributes, making returning the entire ad an expensive operation.
The projection filters the returned keys; to filter out unwanted ads, utilize the constraint
option. Let’s do the same query again, but specify our hostname explicitly:
[4]:
import socket # We'll use this to automatically fill in our hostname
name = classad.quote(f"jovyan@{socket.getfqdn()}")
coll.query(
htcondor.AdTypes.Schedd,
constraint=f"Name =?= {name}",
projection=["Name", "MyAddress", "DaemonCoreDutyCycle"],
)
[4]:
[[ DaemonCoreDutyCycle = 1.150843433815885E-02; Name = "jovyan@fa6c829ace67"; MyAddress = "<172.17.0.2:9618?addrs=172.17.0.2-9618&alias=fa6c829ace67&noUDP&sock=schedd_82_2342>" ]]
Notes: - constraint
accepts either an ExprTree
or string
object; the latter is automatically parsed as an expression. - We used the classad.quote
function to properly quote the hostname string. In this example, we’re relatively certain the hostname won’t contain quotes. However, it is good practice to use the quote
function to avoid possible SQL-injection-type attacks. Consider what would happen if the host’s FQDN contained spaces and doublequotes, such as
foo.example.com" || true
!
Schedd
Let’s try our hand at querying the schedd
!
First, we’ll need a schedd object. You may either create one out of the ad returned by locate
above or use the default in the configuration file:
[5]:
schedd = htcondor.Schedd(schedd_ad)
print(schedd)
<htcondor.htcondor.Schedd object at 0x7ffa2c15ee50>
Unfortunately, as there are no jobs in our personal HTCondor pool, querying the schedd
will be boring. Let’s submit a few jobs (note the API used below will be covered by the next module; it’s OK if you don’t understand it now):
[6]:
sub = htcondor.Submit(
executable = "/bin/sleep",
arguments = "5m",
)
schedd.submit(sub, count=10)
[6]:
<htcondor.htcondor.SubmitResult at 0x7ffa3c46fc10>
We should now have 10 jobs in queue, each of which should take 5 minutes to complete.
Let’s query for the jobs, paying attention to the jobs’ ID and status:
[7]:
for job in schedd.query(projection=['ClusterId', 'ProcId', 'JobStatus']):
print(repr(job))
[ ProcId = 8; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 9; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 0; ClusterId = 7; JobStatus = 2; ServerTime = 1695159746 ]
[ ProcId = 1; ClusterId = 7; JobStatus = 2; ServerTime = 1695159746 ]
[ ProcId = 2; ClusterId = 7; JobStatus = 2; ServerTime = 1695159746 ]
[ ProcId = 3; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 4; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 5; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 6; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
[ ProcId = 7; ClusterId = 7; JobStatus = 1; ServerTime = 1695159746 ]
The JobStatus
is an integer; the integers map into the following states: - 1
: Idle (I
) - 2
: Running (R
) - 3
: Removed (X
) - 4
: Completed (C
) - 5
: Held (H
) - 6
: Transferring Output - 7
: Suspended
Depending on how quickly you executed the above cell, you might see all jobs idle (JobStatus = 1
) or some jobs running (JobStatus = 2
) above.
As with the Collector’s query
method, we can also filter out jobs using query
:
[8]:
for ad in schedd.query(constraint = 'ProcId >= 5', projection=['ProcId']):
print(ad.get('ProcId'))
8
9
5
6
7
Finally, let’s clean up after ourselves (this will remove all of the jobs you own from the queue).
[9]:
import getpass
schedd.act(htcondor.JobAction.Remove, f'Owner == "{getpass.getuser()}"')
[9]:
[ TotalChangedAds = 1; TotalJobAds = 0; TotalPermissionDenied = 0; TotalAlreadyDone = 0; TotalBadStatus = 0; TotalNotFound = 0; TotalSuccess = 10; TotalError = 0 ]
On Job Submission
Congratulations! You can now perform simple queries against the collector for worker and submit hosts, as well as simple job queries against the submit host!
It is now time to move on to advanced job submission and management.