Submitting and Managing Jobs

Launch this tutorial in a Jupyter Notebook on Binder: Binder

What is HTCondor?

An HTCondor pool provides a way for you (as a user) to submit units of work, called jobs, to be executed on a distributed network of computing resources. HTCondor provides tools to monitor your jobs as they run, and make certain kinds of changes to them after submission, which we call “managing” jobs.

In this tutorial, we will learn how to submit and manage jobs from Python. We will see how to submit jobs with various toy executables, how to ask HTCondor for information about them, and how to tell HTCondor to do things with them. All of these things are possible from the command line as well, using tools like condor_submit, condor_qedit, and condor_hold. However, working from Python instead of the command line gives us access to the full power of Python to do things like generate jobs programatically based on user input, pass information consistently from submission to management, or even expose an HTCondor pool to a web application.

We start by importing the HTCondor Python bindings modules, which provide the functions we will need to talk to HTCondor.

[1]:
import htcondor  # for submitting jobs, querying HTCondor daemons, etc.
import classad   # for interacting with ClassAds, HTCondor's internal data format

Submitting a Simple Job

To submit a job, we must first describe it. A submit description is held in a Submit object. Submit objects consist of key-value pairs, and generally behave like Python dictionaries. If you’re familiar with HTCondor’s submit file syntax, you should think of each line in the submit file as a single key-value pair in the Submit object.

Let’s start by writing a Submit object that describes a job that executes the hostname command on an execute node, which prints out the “name” of the node. Since hostname prints its results to standard output (stdout), we will capture stdout and bring it back to the submit machine so we can see the name.

[2]:
hostname_job = htcondor.Submit({
    "executable": "/bin/hostname",  # the program to run on the execute node
    "output": "hostname.out",       # anything the job prints to standard output will end up in this file
    "error": "hostname.err",        # anything the job prints to standard error will end up in this file
    "log": "hostname.log",          # this file will contain a record of what happened to the job
    "request_cpus": "1",            # how many CPU cores we want
    "request_memory": "128MB",      # how much memory we want
    "request_disk": "128MB",        # how much disk space we want
})

print(hostname_job)
executable = /bin/hostname
output = hostname.out
error = hostname.err
log = hostname.log
request_cpus = 1
request_memory = 128MB
request_disk = 128MB

The available descriptors are documented in the condor_submit manual. The keys of the Python dictionary you pass to htcondor.Submit should be the same as for the submit descriptors, and the values should be strings containing exactly what would go on the right-hand side.

Note that we gave the Submit object several relative filepaths. These paths are relative to the directory containing this Jupyter notebook (or, more generally, the current working directory). When we run the job, you should see those files appear in the file browser on the left as HTCondor creates them.

Now that we have a description, let’s submit a job. To do so, we must ask the HTCondor scheduler to open a transaction. Once we have the transaction, we can “queue” (i.e., submit) a job via the Submit object.

[3]:
schedd = htcondor.Schedd()          # get the Python representation of the scheduler
with schedd.transaction() as txn:   # open a transaction, represented by `txn`
    cluster_id = hostname_job.queue(txn)     # queues one job in the current transaction; returns job's ClusterId

print(cluster_id)
11

The integer returned by the queue method is the ClusterId for the submission. It uniquely identifies this submission. Later in this module, we will use it to ask the HTCondor scheduler for information about our jobs.

It isn’t important to understand the transaction mechanics for now; think of it as boilerplate. (There are advanced use cases where it might be useful.)

For now, our job will hopefully have finished running. You should be able to see the files in the file browser on the left. Try opening one of them and seeing what’s inside.

We can also look at the output from inside Python:

[4]:
import os
import time

output_path = "hostname.out"

# this is a crude way to wait for the job to finish
# see the Advanced tutorial "Scalable Job Tracking" for better methods!
while not os.path.exists(output_path):
    print("Output file doesn't exist yet; sleeping for one second")
    time.sleep(1)

with open(output_path, mode = "r") as f:
    print(f.read())
Output file doesn't exist yet; sleeping for one second
Output file doesn't exist yet; sleeping for one second
Output file doesn't exist yet; sleeping for one second
Output file doesn't exist yet; sleeping for one second
Output file doesn't exist yet; sleeping for one second
Output file doesn't exist yet; sleeping for one second

If you got some text, it worked!

If the file never shows up, it means your job didn’t run. You might try looking at the log or error files specified in the submit description to see if there is any useful information in them about why the job failed.

Submitting Multiple Jobs

By default, each queue will submit a single job. A more common use case is to submit many jobs at once, often sharing some base submit description. Let’s write a new submit description which runs sleep.

When we have multiple jobs in a single cluster, each job will be identified not just by its ClusterId but also by a ProcID. We can use the ProcID to separate the output and error files for each individual job. Anything that looks like $(...) in a submit description is a macro, a placeholder which will be “expanded” later by HTCondor into a real value for that particular job. The ProcID expands to a series of incrementing integers, starting at 0. So the first job in a cluster will have ProcID 0, the next will have ProcID 1, etc.

[5]:
sleep_job = htcondor.Submit({
    "executable": "/bin/sleep",
    "arguments": "10s",               # sleep for 10 seconds
    "output": "sleep-$(ProcId).out",  # output and error for each job, using the $(ProcId) macro
    "error": "sleep-$(ProcId).err",
    "log": "sleep.log",               # we still send all of the HTCondor logs for every job to the same file (not split up!)
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

print(sleep_job)
executable = /bin/sleep
arguments = 10s
output = sleep-$(ProcId).out
error = sleep-$(ProcId).err
log = sleep.log
request_cpus = 1
request_memory = 128MB
request_disk = 128MB

We will submit 10 of these jobs. All we need to change from our previous queue call is to add the count keyword argument.

[6]:
schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    cluster_id = sleep_job.queue(txn, count=10)  # submit 10 jobs
print(cluster_id)
12

Now that we have a bunch of jobs in flight, we might want to check how they’re doing. We can ask the HTCondor scheduler about jobs by using its query method. We give it a constraint, which tells it which jobs to look for, and a projection (called attr_list for historical reasons), which tells it what information to return.

[7]:
schedd.query(
    constraint=f"ClusterId == {cluster_id}",
    attr_list=["ClusterId", "ProcId", "Out"],
)
[7]:
[[ ClusterId = 12; ProcId = 3; Out = "sleep-3.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 4; Out = "sleep-4.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 5; Out = "sleep-5.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 6; Out = "sleep-6.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 7; Out = "sleep-7.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 8; Out = "sleep-8.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 9; Out = "sleep-9.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 0; Out = "sleep-0.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 1; Out = "sleep-1.out"; ServerTime = 1589900141 ],
 [ ClusterId = 12; ProcId = 2; Out = "sleep-2.out"; ServerTime = 1589900141 ]]

There are a few things to notice here: - Depending on how long it took you to run the cell, you may only get a few of your 10 jobs in the query. Jobs that have finished leave the queue, and will no longer show up in queries. To see those jobs, you must use the history method instead, which behaves like query, but only looks at jobs that have left the queue. - The results may not have come back in ProcID-sorted order. If you want to guarantee the order of the results, you must do so yourself. - Attributes are often renamed between the submit description and the actual job description in the queue. See the manual for a description of the job attribute names. - The objects returned by the query are instances of ClassAd. ClassAds are the common data exchange format used by HTCondor. In Python, they mostly behave like dictionaries.

Using Itemdata to Vary Over Parameters

By varying some part of the submit description using the ProcID, we can change how each individual job behaves. Perhaps it will use a different input file, or a different argument. However, we often want more flexibility than that. Perhaps our input files are named after different cities, or by timestamp, or some other naming scheme that already exists.

To use such information in the submit description, we need to use itemdata. Itemdata lets us pass arbitrary extra information when we queue, which we can reference with macros inside the submit description. This lets use the full power of Python to generate the submit descriptions for our jobs.

Let’s mock this situation out by generating some files with randomly-chosen names. We’ll also switch to using pathlib.Path, Python’s more modern file path manipulation library.

[8]:
from pathlib import Path
import random
import string
import shutil

def random_string(length):
    """Produce a random lowercase ASCII string with the given length."""
    return "".join(random.choices(string.ascii_lowercase, k = length))

# make a directory to hold the input files, clearing away any existing directory
input_dir = Path.cwd() / "inputs"
shutil.rmtree(input_dir, ignore_errors = True)
input_dir.mkdir()

# make 5 input files
for idx in range(5):
    rs = random_string(5)
    input_file = input_dir / "{}.txt".format(rs)
    input_file.write_text("Hello from job {}".format(rs))

Now we’ll get a list of all the files we just created in the input directory. This is precisely the kind of situation where Python affords us a great deal of flexibility over a submit file: we can use Python instead of the HTCondor submit language to generate and inspect the information we’re going to put into the submit description.

[9]:
input_files = list(input_dir.glob("*.txt"))

for path in input_files:
    print(path)
/home/jovyan/tutorials/users/inputs/hegal.txt
/home/jovyan/tutorials/users/inputs/gpqwf.txt
/home/jovyan/tutorials/users/inputs/usqvn.txt
/home/jovyan/tutorials/users/inputs/drlxt.txt
/home/jovyan/tutorials/users/inputs/goqsk.txt

Now we’ll make our submit description. Our goal is just to print out the text held in each file, which we can do using cat.

We will tell HTCondor to transfer the input file to the execute location by including it in transfer_input_files. We also need to call cat on the right file via arguments. Keep in mind that HTCondor will move the files in transfer_input_files directly to the scratch directory on the execute machine, so instead of the full path, we just need the file’s “name”, the last component of its path. pathlib will make it easy to extract this information.

[10]:
cat_job = htcondor.Submit({
    "executable": "/bin/cat",
    "arguments": "$(input_file_name)",          # we will pass in the value for this macro via itemdata
    "transfer_input_files": "$(input_file)",    # we also need HTCondor to move the file to the execute node
    "should_transfer_files": "yes",             # force HTCondor to transfer files even though we're running entirely inside a container (and it normally wouldn't need to)
    "output": "cat-$(ProcId).out",
    "error": "cat-$(ProcId).err",
    "log": "cat.log",
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

print(cat_job)
executable = /bin/cat
arguments = $(input_file_name)
transfer_input_files = $(input_file)
should_transfer_files = yes
output = cat-$(ProcId).out
error = cat-$(ProcId).err
log = cat.log
request_cpus = 1
request_memory = 128MB
request_disk = 128MB

The itemdata should be passed as a list of dictionaries, where the keys are the macro names to replace in the submit description. In our case, the keys are input_file and input_file_name, so should have a list of 10 dictionaries, each with two entries. HTCondor expects the input file list to be a comma-separated list of POSIX-style paths, so we explicitly convert our Path to a POSIX string.

[11]:
itemdata = [{"input_file": path.as_posix(), "input_file_name": path.name} for path in input_files]

for item in itemdata:
    print(item)
{'input_file': '/home/jovyan/tutorials/users/inputs/hegal.txt', 'input_file_name': 'hegal.txt'}
{'input_file': '/home/jovyan/tutorials/users/inputs/gpqwf.txt', 'input_file_name': 'gpqwf.txt'}
{'input_file': '/home/jovyan/tutorials/users/inputs/usqvn.txt', 'input_file_name': 'usqvn.txt'}
{'input_file': '/home/jovyan/tutorials/users/inputs/drlxt.txt', 'input_file_name': 'drlxt.txt'}
{'input_file': '/home/jovyan/tutorials/users/inputs/goqsk.txt', 'input_file_name': 'goqsk.txt'}

Now we’ll submit the jobs, using queue_with_itemdata instead of queue:

[12]:
schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    submit_result = cat_job.queue_with_itemdata(txn, itemdata = iter(itemdata))  # submit one job for each item in the itemdata

print(submit_result.cluster())
13

Note that queue_with_itemdata returns a “submit result”, not just the ClusterId. The ClusterId can be retreived from the submit result with its cluster() method.

Let’s do a query to make sure we got the itemdata right (these jobs run fast, so you might need to re-run the jobs if your first run has already left the queue):

[13]:
schedd.query(
    constraint=f"ClusterId == {submit_result.cluster()}",
    attr_list=["ClusterId", "ProcId", "Out", "Args", "TransferInput"],
)
[13]:
[[ Args = "hegal.txt"; ClusterId = 13; ProcId = 0; Out = "cat-0.out"; TransferInput = "/home/jovyan/tutorials/users/inputs/hegal.txt"; ServerTime = 1589900141 ],
 [ Args = "gpqwf.txt"; ClusterId = 13; ProcId = 1; Out = "cat-1.out"; TransferInput = "/home/jovyan/tutorials/users/inputs/gpqwf.txt"; ServerTime = 1589900141 ],
 [ Args = "usqvn.txt"; ClusterId = 13; ProcId = 2; Out = "cat-2.out"; TransferInput = "/home/jovyan/tutorials/users/inputs/usqvn.txt"; ServerTime = 1589900141 ],
 [ Args = "drlxt.txt"; ClusterId = 13; ProcId = 3; Out = "cat-3.out"; TransferInput = "/home/jovyan/tutorials/users/inputs/drlxt.txt"; ServerTime = 1589900141 ],
 [ Args = "goqsk.txt"; ClusterId = 13; ProcId = 4; Out = "cat-4.out"; TransferInput = "/home/jovyan/tutorials/users/inputs/goqsk.txt"; ServerTime = 1589900141 ]]

And let’s take a look at all the output:

[14]:
# again, this is very crude - see the advanced tutorials!
while not len(list(Path.cwd().glob("cat-*.out"))) == len(itemdata):
    print("Not all output files exist yet; sleeping for one second")
    time.sleep(1)

for output_file in Path.cwd().glob("cat-*.out"):
    print(output_file, "->", output_file.read_text())
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
Not all output files exist yet; sleeping for one second
/home/jovyan/tutorials/users/cat-1.out -> Hello from job gpqwf
/home/jovyan/tutorials/users/cat-0.out -> Hello from job hegal
/home/jovyan/tutorials/users/cat-2.out -> Hello from job usqvn
/home/jovyan/tutorials/users/cat-4.out -> Hello from job goqsk
/home/jovyan/tutorials/users/cat-3.out -> Hello from job drlxt

Managing Jobs

Once a job is in queue, the scheduler will try its best to execute it to completion. There are several cases where you may want to interrupt the normal flow of jobs. Perhaps the results are no longer needed; perhaps the job needs to be edited to correct a submission error. These actions fall under the purview of job management.

There are two Schedd methods dedicated to job management:

  • edit(): Change an attribute for a set of jobs.
  • act(): Change the state of a job (remove it from the queue, hold it, suspend it, etc.).

The act method takes an argument from the JobAction enum. Commonly-used values include:

  • Hold: put a job on hold, vacating a running job if necessary. A job will stay in the hold state until told otherwise.
  • Release: Release a job from the hold state, returning it to Idle.
  • Remove: Remove a job from the queue. If it is running, it will stop running. This requires the execute node to acknowledge it has successfully vacated the job, so Remove may not be instantaneous.
  • Vacate: Cause a running job to be killed on the remote resource and return to the Idle state. With Vacate, jobs may be given significant time to cleanly shut down.

To play with this, let’s bring back our sleep submit description, but increase the sleep time significantly so that we have time to interact with the jobs.

[15]:
long_sleep_job = htcondor.Submit({
    "executable": "/bin/sleep",
    "arguments": "10m",                # sleep for 10 minutes
    "output": "sleep-$(ProcId).out",
    "error": "sleep-$(ProcId).err",
    "log": "sleep.log",
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

print(long_sleep_job)
executable = /bin/sleep
arguments = 10m
output = sleep-$(ProcId).out
error = sleep-$(ProcId).err
log = sleep.log
request_cpus = 1
request_memory = 128MB
request_disk = 128MB

[16]:
schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    cluster_id = long_sleep_job.queue(txn, 5)

As an experiment, let’s set an arbitrary attribute on the jobs and check that it worked. When we’re really working, we could do things like change the amount of memory a job has requested by editing its RequestMemory attribute. The job attributes that are built-in to HTCondor are described here, but your site may specify additional, custom attributes as well.

[17]:
# sets attribute foo to the string "bar" for all of our jobs
# note the nested quotes around bar! The outer "" make it a Python string; the inner "" make it a ClassAd string.
schedd.edit(f"ClusterId == {cluster_id}", "foo", "\"bar\"")

# do a query to check the value of attribute foo
schedd.query(
    constraint=f"ClusterId == {cluster_id}",
    attr_list=["ClusterId", "ProcId", "JobStatus", "foo"],
)
[17]:
[[ ClusterId = 14; ProcId = 0; foo = "bar"; JobStatus = 1; ServerTime = 1589900191 ],
 [ ClusterId = 14; ProcId = 1; foo = "bar"; JobStatus = 1; ServerTime = 1589900191 ],
 [ ClusterId = 14; ProcId = 2; foo = "bar"; JobStatus = 1; ServerTime = 1589900191 ],
 [ ClusterId = 14; ProcId = 3; foo = "bar"; JobStatus = 1; ServerTime = 1589900191 ],
 [ ClusterId = 14; ProcId = 4; foo = "bar"; JobStatus = 1; ServerTime = 1589900191 ]]

Although the job status appears to be an attribute, we cannot edit it directly. As mentioned above, we must instead act on the job. Let’s hold the first two jobs so that they stop running, but leave the others going.

[18]:
# hold the first two jobs
schedd.act(htcondor.JobAction.Hold, f"ClusterId == {cluster_id} && ProcID <= 1")

# check the status of the jobs
ads = schedd.query(
    constraint=f"ClusterId == {cluster_id}",
    attr_list=["ClusterId", "ProcId", "JobStatus"],
)

for ad in ads:
    # the ClassAd objects returned by the query act like dictionaries, so we can extract individual values out of them using []
    print(f"ProcID = {ad['ProcID']} has JobStatus = {ad['JobStatus']}")
ProcID = 0 has JobStatus = 5
ProcID = 1 has JobStatus = 5
ProcID = 2 has JobStatus = 1
ProcID = 3 has JobStatus = 1
ProcID = 4 has JobStatus = 1

The various job statuses are represented by numbers. 1 means Idle, 2 means Running, and 5 means Held. If you see JobStatus = 5 above for ProcID = 0 and ProcID = 1, then we succeeded!

The opposite of JobAction.Hold is JobAction.Release. Let’s release those jobs and let them go back to Idle.

[19]:
schedd.act(htcondor.JobAction.Release, f"ClusterId == {cluster_id}")

ads = schedd.query(
    constraint=f"ClusterId == {cluster_id}",
    attr_list=["ClusterId", "ProcId", "JobStatus"],
)

for ad in ads:
    # the ClassAd objects returned by the query act like dictionaries, so we can extract individual values out of them using []
    print(f"ProcID = {ad['ProcID']} has JobStatus = {ad['JobStatus']}")
ProcID = 0 has JobStatus = 1
ProcID = 1 has JobStatus = 1
ProcID = 2 has JobStatus = 1
ProcID = 3 has JobStatus = 1
ProcID = 4 has JobStatus = 1

Note that we simply released all the jobs in the cluster. Releasing a job that is not held doesn’t do anything, so we don’t have to be extremely careful.

Finally, let’s clean up after ourselves:

[20]:
schedd.act(htcondor.JobAction.Remove, f"ClusterId == {cluster_id}")
[20]:
[ TotalJobAds = 3; TotalPermissionDenied = 0; TotalAlreadyDone = 0; TotalNotFound = 0; TotalSuccess = 5; TotalChangedAds = 1; TotalBadStatus = 0; TotalError = 0 ]

Exercises

Now let’s practice what we’ve learned.

  • In each exercise, you will be given a piece of code and a test that does not yet pass.
  • The exercises are vaguely in order of increasing difficulty.
  • Modify the code, or add new code to it, to pass the test. Do whatever it takes!
  • You can run the test by running the block it is in.
  • Feel free to look at the test for clues as to how to modify the code.
  • Many of the exercises can be solved either by using Python to generate inputs, or by using advanced features of the ClassAd language. Either way is valid!
  • Don’t modify the test. That’s cheating!

Exercise 1: Incrementing Sleeps

Submit five jobs which sleep for 5, 6, 7, 8, and 9 seconds, respectively.

[21]:
# MODIFY OR ADD TO THIS BLOCK...

incrementing_sleep = htcondor.Submit({
    "executable": "/bin/sleep",
    "arguments": "1",
    "output": "ex1-$(ProcId).out",
    "error": "ex1-$(ProcId).err",
    "log": "ex1.log",
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    cluster_id = incrementing_sleep.queue(txn, 5)
[22]:
# ... TO MAKE THIS TEST PASS

expected = [str(i) for i in range(5, 10)]
print("Expected ", expected)

ads = schedd.query(f"ClusterId == {cluster_id}", attr_list = ["Args"])
arguments = sorted(ad["Args"] for ad in ads)
print("Got      ", arguments)

assert arguments == expected, "Arguments were not what we expected!"
print("The test passed. Good job!")
Expected  ['5', '6', '7', '8', '9']
Got       ['1', '1', '1', '1', '1']
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-22-70a913e244a8> in <module>
      8 print("Got      ", arguments)
      9
---> 10 assert arguments == expected, "Arguments were not what we expected!"
     11 print("The test passed. Good job!")

AssertionError: Arguments were not what we expected!

Exercise 2: Echo to Target

Run a job that makes the text Echo to Target appear in a file named ex3.txt.

[23]:
# MODIFY OR ADD TO THIS BLOCK...

echo = htcondor.Submit({
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    cluster_id = echo.queue(txn, 1)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-b768186838d5> in <module>
      9 schedd = htcondor.Schedd()
     10 with schedd.transaction() as txn:
---> 11     cluster_id = echo.queue(txn, 1)

RuntimeError: No 'executable' parameter was provided
[24]:
# ... TO MAKE THIS TEST PASS

does_file_exist = os.path.exists("ex3.txt")
assert does_file_exist, "ex3.txt does not exist!"

expected = "Echo to Target"
print("Expected ", expected)

contents = open("ex3.txt", mode = "r").read().strip()
print("Got      ", contents)

assert expected in contents, "Contents were not what we expected!"

print("The test passed. Good job!")
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-24-408658d31a88> in <module>
      2
      3 does_file_exist = os.path.exists("ex3.txt")
----> 4 assert does_file_exist, "ex3.txt does not exist!"
      5
      6 expected = "Echo to Target"

AssertionError: ex3.txt does not exist!

Exercise 3: Holding Odds

Hold all of the odd-numbered jobs in this large cluster.

  • Note that the test block removes all of the jobs you own when it runs, to prevent these long-running jobs from corrupting other tests!
[25]:
# MODIFY OR ADD TO THIS BLOCK...

long_sleep = htcondor.Submit({
    "executable": "/bin/sleep",
    "arguments": "10m",
    "output": "ex2-$(ProcId).out",
    "error": "ex2-$(ProcId).err",
    "log": "ex2.log",
    "request_cpus": "1",
    "request_memory": "128MB",
    "request_disk": "128MB",
})

schedd = htcondor.Schedd()
with schedd.transaction() as txn:
    cluster_id = long_sleep.queue(txn, 100)
[26]:
# ... TO MAKE THIS TEST PASS

import getpass

try:
    ads = schedd.query(f"ClusterId == {cluster_id}", attr_list = ["ProcID", "JobStatus"])
    proc_to_status = {int(ad["ProcID"]): ad["JobStatus"] for ad in sorted(ads, key = lambda ad: ad["ProcID"])}

    for proc, status in proc_to_status.items():
        print("Proc {} has status {}".format(proc, status))

    assert len(proc_to_status) == 100, "Wrong number of jobs (perhaps you need to resubmit them?)."
    assert all(status == 5 for proc, status in proc_to_status.items() if proc % 2 != 0), "Not all odd jobs were held."
    assert all(status != 5 for proc, status in proc_to_status.items() if proc % 2 == 0), "An even job was held."

    print("The test passed. Good job!")
finally:
    schedd.act(htcondor.JobAction.Remove, f'Owner=="{getpass.getuser()}"')
Proc 0 has status 1
Proc 1 has status 1
Proc 2 has status 1
Proc 3 has status 1
Proc 4 has status 1
Proc 5 has status 1
Proc 6 has status 1
Proc 7 has status 1
Proc 8 has status 1
Proc 9 has status 1
Proc 10 has status 1
Proc 11 has status 1
Proc 12 has status 1
Proc 13 has status 1
Proc 14 has status 1
Proc 15 has status 1
Proc 16 has status 1
Proc 17 has status 1
Proc 18 has status 1
Proc 19 has status 1
Proc 20 has status 1
Proc 21 has status 1
Proc 22 has status 1
Proc 23 has status 1
Proc 24 has status 1
Proc 25 has status 1
Proc 26 has status 1
Proc 27 has status 1
Proc 28 has status 1
Proc 29 has status 1
Proc 30 has status 1
Proc 31 has status 1
Proc 32 has status 1
Proc 33 has status 1
Proc 34 has status 1
Proc 35 has status 1
Proc 36 has status 1
Proc 37 has status 1
Proc 38 has status 1
Proc 39 has status 1
Proc 40 has status 1
Proc 41 has status 1
Proc 42 has status 1
Proc 43 has status 1
Proc 44 has status 1
Proc 45 has status 1
Proc 46 has status 1
Proc 47 has status 1
Proc 48 has status 1
Proc 49 has status 1
Proc 50 has status 1
Proc 51 has status 1
Proc 52 has status 1
Proc 53 has status 1
Proc 54 has status 1
Proc 55 has status 1
Proc 56 has status 1
Proc 57 has status 1
Proc 58 has status 1
Proc 59 has status 1
Proc 60 has status 1
Proc 61 has status 1
Proc 62 has status 1
Proc 63 has status 1
Proc 64 has status 1
Proc 65 has status 1
Proc 66 has status 1
Proc 67 has status 1
Proc 68 has status 1
Proc 69 has status 1
Proc 70 has status 1
Proc 71 has status 1
Proc 72 has status 1
Proc 73 has status 1
Proc 74 has status 1
Proc 75 has status 1
Proc 76 has status 1
Proc 77 has status 1
Proc 78 has status 1
Proc 79 has status 1
Proc 80 has status 1
Proc 81 has status 1
Proc 82 has status 1
Proc 83 has status 1
Proc 84 has status 1
Proc 85 has status 1
Proc 86 has status 1
Proc 87 has status 1
Proc 88 has status 1
Proc 89 has status 1
Proc 90 has status 1
Proc 91 has status 1
Proc 92 has status 1
Proc 93 has status 1
Proc 94 has status 1
Proc 95 has status 1
Proc 96 has status 1
Proc 97 has status 1
Proc 98 has status 1
Proc 99 has status 1
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-26-90d8c213859e> in <module>
     11
     12     assert len(proc_to_status) == 100, "Wrong number of jobs (perhaps you need to resubmit them?)."
---> 13     assert all(status == 5 for proc, status in proc_to_status.items() if proc % 2 != 0), "Not all odd jobs were held."
     14     assert all(status != 5 for proc, status in proc_to_status.items() if proc % 2 == 0), "An even job was held."
     15

AssertionError: Not all odd jobs were held.