Hold Reason Codes
Whenever a job is placed on hold, the job ad will be updated with a HoldReasonCode attribute, which will be set to a numeric code indicating the reason for the hold. The following table lists the possible values for the HoldReasonCode attribute, along with a brief description of each code. In addition, the HoldReasonCode attribute may be accompanied by a HoldReasonSubCode to give additional details.
The NumHoldsByReason job attribute will also be updated with the number of times the job has been held for each reason code, keyed by the NumHoldsByReason Label. For example, if a job has been held twice for reason code 12 and once for reason code 26, the NumHoldsByReason attribute will look like this:
NumHoldsByReason = [ TransferOutputError = 2; SystemPolicy = 1 ]
Integer HoldReasonCode
[NumHoldsByReason Label]
|
Reason for Hold
|
HoldReasonSubCode
|
Suggestions to user to fix
|
|---|---|---|---|
1
[UserRequest]
|
The user put the job on hold with condor_hold. |
||
3
[JobPolicy]
|
The |
User Provided |
|
4
[CorruptedCredential]
|
The credentials for the job are invalid. |
||
5
[JobPolicyUndefined]
|
A job policy expression
evaluated to
|
||
6
[FailedToCreateProcess]
|
The condor_starter failed to start the executable. |
Unix errno |
|
7
[UnableToOpenOutput]
|
The standard output file for the job could not be opened. |
Unix errno |
|
8
[UnableToOpenInput]
|
The standard input file for the job could not be opened. |
Unix errno |
|
9
[UnableToOpenOutputStream]
|
The standard output stream for the job could not be opened. |
Unix errno |
|
10
[UnableToOpenInputStream]
|
The standard input stream for the job could not be opened. |
Unix errno |
|
11
[InvalidTransferAck]
|
An internal HTCondor protocol error was encountered when transferring files. |
||
12
[TransferOutputError]
|
An error occurred while transferring job output files or self-checkpoint files. |
See note |
|
13
[TransferInputError]
|
An error occurred while transferring job input files. |
See note |
|
14
[IwdError]
|
The initial working directory of the job cannot be accessed. |
Unix errno |
Verify initialdir exists and is writeable |
15
[SubmittedOnHold]
|
The user requested the job be submitted on hold. |
||
16
[SpoolingInput]
|
Input files are being spooled. |
Wait for spooling to complete |
|
17
[JobShadowMismatch]
|
A standard universe job is not compatible with the condor_shadow version available on the submitting machine. |
||
18
[InvalidTransferGoAhead]
|
An internal HTCondor protocol error was encountered when transferring files. |
||
19
[HookPrepareJobFailure]
|
<Keyword>_HOOK_PREPARE_JOB was defined but could not be executed or returned failure. |
||
20
[MissedDeferredExecutionTime]
|
The job missed its deferred execution time and therefore failed to run. |
||
21
[StartdHeldJob]
|
The job was put on hold because WANT_HOLD in the machine policy was true. |
||
22
[UnableToInitUserLog]
|
Unable to initialize job event log. |
Verify file in log lives in a writeable directory. |
|
23
[FailedToAccessUserAccount]
|
Failed to access user account. |
||
24
[NoCompatibleShadow]
|
No compatible shadow. |
||
25
[InvalidCronSettings]
|
Invalid cron settings. |
||
26
[SystemPolicy]
|
SYSTEM_PERIODIC_HOLD evaluated to true. |
||
27
[SystemPolicyUndefined]
|
The system periodic job policy evaluated to undefined. |
||
32
[MaxTransferInputSizeExceeded]
|
The maximum total input file transfer size was exceeded. (See MAX_TRANSFER_INPUT_MB |
||
33
[MaxTransferOutputSizeExceeded]
|
The maximum total output file transfer size was exceeded. (See MAX_TRANSFER_OUTPUT_MB |
||
34
[JobOutOfResources]
|
Job resource usage exceeded a provisioned limit; the limit exceeded is specified in the subcode. |
Exceeded Resource: |
|
Memory usage exceeded |
102 |
Resubmit with larger request_memory or consider using retry_request_memory |
|
Disk usage exceeded |
104 |
Resubmit with larger request_disk or consider using retry_request_disk |
|
35
[InvalidDockerImage]
|
Specified Docker image was invalid. |
Verify docker_image is correct in submit file |
|
36
[FailedToCheckpoint]
|
Job failed when sent the checkpoint signal it requested. |
||
37
[EC2UserError]
|
User error in the EC2 universe: |
||
Public key file not defined. |
1 |
||
Private key file not defined. |
2 |
||
Grid resource string missing EC2 service URL. |
4 |
||
Failed to authenticate. |
9 |
||
Can’t use existing SSH keypair with the given server’s type. |
10 |
||
You, or somebody like you, cancelled this request. |
20 |
||
38
[EC2InternalError]
|
Internal error in the EC2 universe: |
||
Grid resource type not EC2. |
3 |
||
Grid resource type not set. |
5 |
||
Grid job ID is not for EC2. |
7 |
||
Unexpected remote job status. |
21 |
||
39
[EC2AdminError]
|
Administrator error in the EC2 universe: |
||
EC2_GAHP not defined. |
6 |
||
40
[EC2ConnectionProblem]
|
Connection problem in the EC2 universe |
||
…while creating an SSH keypair. |
11 |
||
…while starting an on-demand instance. |
12 |
||
…while requesting a spot instance. |
17 |
||
41
[EC2ServerError]
|
Server error in the EC2 universe: |
||
Abnormal instance termination reason. |
13 |
||
Unrecognized instance termination reason. |
14 |
||
Resource was down for too long. |
22 |
||
42
[EC2InstancePotentiallyLost]
|
Instance potentially lost due to an error in the EC2 universe: |
||
Connection error while terminating an instance. |
15 |
||
Failed to terminate instance too many times. |
16 |
||
Connection error while terminating a spot request. |
17 |
||
Failed to terminated a spot request too many times. |
18 |
||
Spot instance request purged before instance ID acquired. |
19 |
||
43
[PreScriptFailed]
|
Pre script failed. |
||
44
[PostScriptFailed]
|
Post script failed. |
||
45
[SingularityTestFailed]
|
Test of singularity runtime failed before launching a job |
||
46
[JobDurationExceeded]
|
The job’s allowed duration was exceeded. |
||
47
[JobExecuteExceeded]
|
The job’s allowed execution time was exceeded. |
||
48
[HookShadowPrepareJobFailure]
|
Prepare job shadow hook failed when it was executed; status code indicated job should be held. |
Note
For hold codes 12 [TransferOutputError] and 13 [TransferInputError]: file transfer may invoke file-transfer plug-ins. If it does, the hold subcodes may additionally be 62 (ETIME), if the file-transfer plug-in timed out; or the exit code of the plug-in shifted left by eight bits, otherwise.