Hold Reason Codes

Whenever a job is placed on hold, the job ad will be updated with a HoldReasonCode attribute, which will be set to a numeric code indicating the reason for the hold. The following table lists the possible values for the HoldReasonCode attribute, along with a brief description of each code. In addition, the HoldReasonCode attribute may be accompanied by a HoldReasonSubCode to give additional details.

The NumHoldsByReason job attribute will also be updated with the number of times the job has been held for each reason code, keyed by the NumHoldsByReason Label. For example, if a job has been held twice for reason code 12 and once for reason code 26, the NumHoldsByReason attribute will look like this:

NumHoldsByReason = [ TransferOutputError = 2; SystemPolicy = 1 ]
Integer HoldReasonCode
[NumHoldsByReason Label]
Reason for Hold
HoldReasonSubCode
Suggestions to user to fix
1
[UserRequest]

The user put the job on hold with condor_hold.

3
[JobPolicy]

The PERIODIC_HOLD expression evaluated to True. Or, ON_EXIT_HOLD was true

User Provided

4
[CorruptedCredential]

The credentials for the job are invalid.

5
[JobPolicyUndefined]

A job policy expression evaluated to Undefined.

6
[FailedToCreateProcess]

The condor_starter failed to start the executable.

Unix errno

7
[UnableToOpenOutput]

The standard output file for the job could not be opened.

Unix errno

8
[UnableToOpenInput]

The standard input file for the job could not be opened.

Unix errno

9
[UnableToOpenOutputStream]

The standard output stream for the job could not be opened.

Unix errno

10
[UnableToOpenInputStream]

The standard input stream for the job could not be opened.

Unix errno

11
[InvalidTransferAck]

An internal HTCondor protocol error was encountered when transferring files.

12
[TransferOutputError]

An error occurred while transferring job output files or self-checkpoint files.

See note

13
[TransferInputError]

An error occurred while transferring job input files.

See note

14
[IwdError]

The initial working directory of the job cannot be accessed.

Unix errno

Verify initialdir exists and is writeable

15
[SubmittedOnHold]

The user requested the job be submitted on hold.

16
[SpoolingInput]

Input files are being spooled.

Wait for spooling to complete

17
[JobShadowMismatch]

A standard universe job is not compatible with the condor_shadow version available on the submitting machine.

18
[InvalidTransferGoAhead]

An internal HTCondor protocol error was encountered when transferring files.

19
[HookPrepareJobFailure]

<Keyword>_HOOK_PREPARE_JOB was defined but could not be executed or returned failure.

20
[MissedDeferredExecutionTime]

The job missed its deferred execution time and therefore failed to run.

21
[StartdHeldJob]

The job was put on hold because WANT_HOLD in the machine policy was true.

22
[UnableToInitUserLog]

Unable to initialize job event log.

Verify file in log lives in a writeable directory.

23
[FailedToAccessUserAccount]

Failed to access user account.

24
[NoCompatibleShadow]

No compatible shadow.

25
[InvalidCronSettings]

Invalid cron settings.

26
[SystemPolicy]

SYSTEM_PERIODIC_HOLD evaluated to true.

27
[SystemPolicyUndefined]

The system periodic job policy evaluated to undefined.

32
[MaxTransferInputSizeExceeded]

The maximum total input file transfer size was exceeded. (See MAX_TRANSFER_INPUT_MB

33
[MaxTransferOutputSizeExceeded]

The maximum total output file transfer size was exceeded. (See MAX_TRANSFER_OUTPUT_MB

34
[JobOutOfResources]

Job resource usage exceeded a provisioned limit; the limit exceeded is specified in the subcode.

Exceeded Resource:

Memory usage exceeded

102

Resubmit with larger request_memory or consider using retry_request_memory

Disk usage exceeded

104

Resubmit with larger request_disk or consider using retry_request_disk

35
[InvalidDockerImage]

Specified Docker image was invalid.

Verify docker_image is correct in submit file

36
[FailedToCheckpoint]

Job failed when sent the checkpoint signal it requested.

37
[EC2UserError]

User error in the EC2 universe:

Public key file not defined.

1

Private key file not defined.

2

Grid resource string missing EC2 service URL.

4

Failed to authenticate.

9

Can’t use existing SSH keypair with the given server’s type.

10

You, or somebody like you, cancelled this request.

20

38
[EC2InternalError]

Internal error in the EC2 universe:

Grid resource type not EC2.

3

Grid resource type not set.

5

Grid job ID is not for EC2.

7

Unexpected remote job status.

21

39
[EC2AdminError]

Administrator error in the EC2 universe:

EC2_GAHP not defined.

6

40
[EC2ConnectionProblem]

Connection problem in the EC2 universe

…while creating an SSH keypair.

11

…while starting an on-demand instance.

12

…while requesting a spot instance.

17

41
[EC2ServerError]

Server error in the EC2 universe:

Abnormal instance termination reason.

13

Unrecognized instance termination reason.

14

Resource was down for too long.

22

42
[EC2InstancePotentiallyLost]

Instance potentially lost due to an error in the EC2 universe:

Connection error while terminating an instance.

15

Failed to terminate instance too many times.

16

Connection error while terminating a spot request.

17

Failed to terminated a spot request too many times.

18

Spot instance request purged before instance ID acquired.

19

43
[PreScriptFailed]

Pre script failed.

44
[PostScriptFailed]

Post script failed.

45
[SingularityTestFailed]

Test of singularity runtime failed before launching a job

46
[JobDurationExceeded]

The job’s allowed duration was exceeded.

47
[JobExecuteExceeded]

The job’s allowed execution time was exceeded.

48
[HookShadowPrepareJobFailure]

Prepare job shadow hook failed when it was executed; status code indicated job should be held.

Note

For hold codes 12 [TransferOutputError] and 13 [TransferInputError]: file transfer may invoke file-transfer plug-ins. If it does, the hold subcodes may additionally be 62 (ETIME), if the file-transfer plug-in timed out; or the exit code of the plug-in shifted left by eight bits, otherwise.