Submitting a Remote Job
Submitting a job to a remote Access Point
Usually, when you run the condor_submit` command, you are logged into an Access Point (AP)
which is running a condor_schedd, and your submit defaults to sending the job to the
condor_schedd running on that same AP. However, it is possible to have condor_submit
send the job to a condor_schedd running on some other machine. Maybe you want to run
condor_submit
from your laptop and send the job to an AP on some server. Maybe
you are building a web portal, and you want the portal to run on one machine,
and the condor_schedd running on some other machine.
The first concern is security. When you submit locally, the condor_schedd can easily determine who is submitting the job, and thus what system account it should run the condor_shadow as. This is much more difficult with a remote, over-the-network submit. For this to work, some additional setup must happen. While this authentication can be setup with SSL, Kerberos or Windows native methods, for Linux systems, we recommend HTCondor’s ID tokens, as it is easy for a user to setup, and secure.
Assuming that an administrator has set up signing keys (see Token Authentication), to create a token that can authenticate you for remote submission, login to the access point and run the command
$ condor_token_fetch -token name_of_your_ap
Note that name_of_your_ap is merely a filename, but if you have more than one AP, it is good to name the file containing the token clearly. When this command succeeds, there is no output but the access token is place into the file with that name in the tokens.d subdirectory of your personal .condor directory in your home directory.
If you copy this directory and contents from the AP (the machine
you want to submit to, and place the directory in the same
place on the machine you want to submit from, then
condor_submit
can submit remotely. To do so, you’ll
need to tell condor_submit
the name of the pool (i.e. the
name of the machine running the central manager), and the name
of the Access Point that you ran condor_token_fetch on. If you
don’t know the name of the central manager, running the command
condor_config_val COLLECTOR_HOST
will tell you.
Then, to submit the job, on the remote machine, simple run
$ condor_submit -name name-of-ap -pool cm-name submit_file
and perhaps any other options you might want to pass to condor_submit
After condor_submit
reports the cluster id of your new job, it
has been successfully submitted to the AP, and the AP is responsible
for the management of the job thereafter. You can query the
job with
$ condor_q -name name-of-ap -pool cm-name
and run all the related commands like condor_rm
, condor_hold
and condor_release
in a similar way.
File transfer with remote submission
After condor_submit
successfully completes a remote submission,
the machine you ran condor_submit
on is not involved at all in the
management of the job; the remote AP manages it. Therefore, you can
disconnect that machine from the network, turn it off, or hibernate it.
Even if this machine is turned off, the AP will find a matching Execution
Point to run the job on, and run it to completion.
This means that any input files specified in transfer_input_files
are copied off of this access point as part of the submit process
and stored in a safe place on the Access Point. This safe place is
the spool directory. While a user can force spooling to happen
by adding the -spool
option to condor_submit
, any remote
submit (with the -name option) automatically turns on spooling.
Note that files transferred via file transfer plugins are never spooled,
they are always pulled by the worker node immediately before job execution.
Correspondingly, when the jobs complete, output files cannot be
transferred to the submitting machine, as it may be off, or disconnected
from the network. These files are also stored in the spool directory
of the AP machine. To indicate that a completed job still has
spool files it is holding on the AP machine, a remotely submitted
job remained in the AP’s, and is visible with the condor_q
command
after completion, and is in the ‘C’ompleted state. Jobs will
stay in this state for three days by default, or until you have
fetched the output files off of the machine.
You can fetch the output sandbox from the AP back to your submitting
machine (or anywhere that has permissions), by running the
condor_transfer_data
command. This also takes a -name
and
-pool
option like condor_submit. You can specify a job or jobs
in the usual way, often just with the cluster.proc syntax. When run,
it copies the job’s output sandbox from the spool on the AP back to
the current directory of the machine condor_transfer_data
is run.