Vacate Reason Codes

Whenever a job or claim on an EP slot is kicked off the slot, the job ad will be updated with a VacateReasonCode attribute, which will be set to a numeric code indicating the reason for the vacate. The following table lists the possible values for the VacateReasonCode attribute, along with a brief description of each code. In addition, the VacateReasonCode attribute may be accompanied by a VacateReasonSubCode to give additional details.

The NumVacatesByReason job attribute will also be updated with the number of times the job or claim has been vacated for each reason code, keyed by the NumVacatesByReason Label. For example, if a job has been vacated twice for reason code 12 and once for reason code 1023, the NumVacatesByReason attribute will look like this:

NumVacatesByReason = [ TransferOutputError = 2; JobRemoved = 1 ]
Integer VacateReasonCode
[NumVacatesByReason Label]
Reason for Vacate
VacateReasonSubCode
1
[UserRequest]

The user put the job on hold with condor_hold.

3
[JobPolicy]

The PERIODIC_HOLD expression evaluated to True. Or, ON_EXIT_HOLD was true

User Provided

4
[CorruptedCredential]

The credentials for the job are invalid.

5
[JobPolicyUndefined]

A job policy expression evaluated to Undefined.

6
[FailedToCreateProcess]

The condor_starter failed to start the executable.

Unix errno

7
[UnableToOpenOutput]

The standard output file for the job could not be opened.

Unix errno

8
[UnableToOpenInput]

The standard input file for the job could not be opened.

Unix errno

9
[UnableToOpenOutputStream]

The standard output stream for the job could not be opened.

Unix errno

10
[UnableToOpenInputStream]

The standard input stream for the job could not be opened.

Unix errno

11
[InvalidTransferAck]

An internal HTCondor protocol error was encountered when transferring files.

12
[TransferOutputError]

An error occurred while transferring job output files or self-checkpoint files.

See note

13
[TransferInputError]

An error occurred while transferring job input files.

See note

14
[IwdError]

The initial working directory of the job cannot be accessed.

Unix errno

15
[SubmittedOnHold]

The user requested the job be submitted on hold.

16
[SpoolingInput]

Input files are being spooled.

17
[JobShadowMismatch]

A standard universe job is not compatible with the condor_shadow version available on the submitting machine.

18
[InvalidTransferGoAhead]

An internal HTCondor protocol error was encountered when transferring files.

19
[HookPrepareJobFailure]

<Keyword>_HOOK_PREPARE_JOB was defined but could not be executed or returned failure.

20
[MissedDeferredExecutionTime]

The job missed its deferred execution time and therefore failed to run.

21
[StartdHeldJob]

The job was put on hold because WANT_HOLD in the machine policy was true.

22
[UnableToInitUserLog]

Unable to initialize job event log.

23
[FailedToAccessUserAccount]

Failed to access user account.

24
[NoCompatibleShadow]

No compatible shadow.

25
[InvalidCronSettings]

Invalid cron settings.

26
[SystemPolicy]

SYSTEM_PERIODIC_HOLD evaluated to true.

27
[SystemPolicyUndefined]

The system periodic job policy evaluated to undefined.

32
[MaxTransferInputSizeExceeded]

The maximum total input file transfer size was exceeded. (See MAX_TRANSFER_INPUT_MB

33
[MaxTransferOutputSizeExceeded]

The maximum total output file transfer size was exceeded. (See MAX_TRANSFER_OUTPUT_MB

34
[JobOutOfResources]

Job resource usage exceeded a provisioned limit; the limit exceeded is specified in the subcode.

Exceeded Resource:

Memory usage exceeded

102

Disk usage exceeded

104

35
[InvalidDockerImage]

Specified Docker image was invalid.

36
[FailedToCheckpoint]

Job failed when sent the checkpoint signal it requested.

37
[EC2UserError]

User error in the EC2 universe:

Public key file not defined.

1

Private key file not defined.

2

Grid resource string missing EC2 service URL.

4

Failed to authenticate.

9

Can’t use existing SSH keypair with the given server’s type.

10

You, or somebody like you, cancelled this request.

20

38
[EC2InternalError]

Internal error in the EC2 universe:

Grid resource type not EC2.

3

Grid resource type not set.

5

Grid job ID is not for EC2.

7

Unexpected remote job status.

21

39
[EC2AdminError]

Administrator error in the EC2 universe:

EC2_GAHP not defined.

6

40
[EC2ConnectionProblem]

Connection problem in the EC2 universe

…while creating an SSH keypair.

11

…while starting an on-demand instance.

12

…while requesting a spot instance.

17

41
[EC2ServerError]

Server error in the EC2 universe:

Abnormal instance termination reason.

13

Unrecognized instance termination reason.

14

Resource was down for too long.

22

42
[EC2InstancePotentiallyLost]

Instance potentially lost due to an error in the EC2 universe:

Connection error while terminating an instance.

15

Failed to terminate instance too many times.

16

Connection error while terminating a spot request.

17

Failed to terminated a spot request too many times.

18

Spot instance request purged before instance ID acquired.

19

43
[PreScriptFailed]

Pre script failed.

44
[PostScriptFailed]

Post script failed.

45
[SingularityTestFailed]

Test of singularity runtime failed before launching a job

46
[JobDurationExceeded]

The job’s allowed duration was exceeded.

47
[JobExecuteExceeded]

The job’s allowed execution time was exceeded.

48
[HookShadowPrepareJobFailure]

Prepare job shadow hook failed when it was executed; status code indicated job should be held.

1000
[JobPolicyVacate]

PeriodicVacate evaluated to True.

1001
[SystemPolicyVacate]

SYSTEM_PERIODIC_VACATE evaluated to True.

1002
[ShadowException]

A Shadow Exception event occurred.

1003
[JobNotStarted]

A setup step failed.

1004
[UserVacateJob]

The user requested the job be vacated.

1005
[JobShouldRequeue]

An unspecified error occurred.

1006
[FailedToActivateClaim]

The shadow failed to activate the claim

1007
[StarterError]

The starter encountered an error.

1008
[ReconnectFailed]

The shadow failed to reconnect after a network failure.

1009
[ClaimDeactivated]

The AP requested the job to be vacated.

1010
[StartdVacateCommand]

The administrator requested the job to be vacated.

1011
[StartdPreemptExpression]

The EP’s PREEMPT expression evaluated to True.

1012
[StartdException]

The startd died due to an internal error.

1013
[StartdShutdown]

The startd was shut down.

1014
[StartdDraining]

The slot was drained.

1015
[StartdCoalesce]

The slot was coalesced with other slots by condor_now.

1016
[StartdHibernate]

The startd entered hibernation.

1017
[StartdReleaseCommand]

The AP released the claim.

1018
[StartdPreemptingClaimRank]

The slot was claimed for a job with a higher startd Rank.

1019
[StartdPreemptingClaimUserPrio]

The slot was claimed for a job with better user priority.

1020
[VMError]

The virtual machine software at the EP had an error.

1021
[ContainerError]

The container software at the EP had an error.

1022
[ScheddVacate]

The AP vacated the job.

1023
[JobRemoved]

The job was removed.

1024
[ScratchDirError]

An error occurred with the scratch directory on the EP.

1025
[SuccessfulCheckpoint]

A self-checkpoint job would have restarted but could not reactivate its claim.

1026
[ActivationRefusedBadRequest]

Activation request had missing or invalid attributes its claim.

1027
[ActivationRefusedNoMatch]

Activation request did not match slot requirements its claim.

1028
[ActivationRefusedStillCleaning]

Activation request refused because Starter is still cleaning up after a job

1029
[ActivationRefusedWorklifeExpired]

Activation request refused because the claim worklife has expired

1030
[ActivationRefusedPreempted]

Activation request refused because claim is being preempted

1031
[ActivationRefusedBroken]

Activation request refused because slot is in the broken state

1032
[ActivationRefusedNotIdle]

Activation request refused because slot is not idle

1033
[ActivationRefusedUnclaimed]

Activation request refused because slot is not claimed

1034
[ActivationRefusedClaimNotFound]

Activation request refused because claim id used for the request was not found

1035
[ActivationRefusedOldClaim]

Activation request refused because claim id used for the request was is no longer valid

1036
[ActivationRefusedUnhealthy]

Activation request refused because a slot health check failed

Note

For vacate codes 12 [TransferOutputError] and 13 [TransferInputError], file transfer may invoke file-transfer plug-ins. If it does, the vacate subcodes may additionally be:

  • 62 (ETIME) if the file-transfer plug-in timed out.

  • -1001 (FileTransferPluginNotFound) if the file-transfer plug-in was not found installed on the EP.

  • -1002 (FileTransferPluginNotOperational) if the file-transfer plug-in failed to execute properly.

  • -1003 (FileTransferPluginExecFailed) if the EP was unable to launch the file-transfer plug-in (maybe EP is out of memory?).

  • -1004 (FileTransferPluginNoResultReported) if the file-transfer plug-in failed to report a result (maybe disk is full?).

  • Otherwise, the exit code of the plug-in shifted left by eight bits.