Upgrading from an 24.0 LTS version to an 25.0 LTS version of HTCondor

Upgrading from a 24.0 LTS version of HTCondor to a 25.0 LTS version will also introduce changes that administrators and users of sites running from an older HTCondor version should be aware of when planning an upgrade. Here is a list of items that administrators should be aware of.

Upgrading from a 24.0 LTS version of HTCondor to a 25.0 LTS version will bring new features introduced in the 24.x versions of HTCondor. These new features include the following (note that this list contains only the most significant changes; a full list of changes can be found in the version history: Version 24 Feature Releases):

  • New implementations of the Python bindings, htcondor2 and classad2. This re-implementation of the Python bindings fixes short-comings in the original implementation and eases future maintenance. (HTCONDOR-2153)

  • Added new condor_dag_checker tool for users to check DAG files for syntactical and logical errors prior to submission. (HTCONDOR-3088)

  • Add new configuration knob, LOCAL_UNIVERSE_CGROUP_ENFORCEMENT, which defaults to false. When true, and running on a cgroup enable system, local universe jobs must specify request_memory, and if the job exceeds that limit, it will be put on hold. (HTCONDOR-3170)

  • Added new job ClassAd attributes NumVacates and NumVacatesByReason. These attributes provide counts about why a job left the running state without completing (i.e. was vacated from the execution point). (HTCONDOR-3204)

  • Added new job ClassAd attribute TransferInputFileCounts. (HTCONDOR-3024)

  • Improvements to condor_q for held jobs. The hold code and subcode are now displayed as part of the -hold option. A new option -hold-codes displays the first job for each unique hold code and subcode. (HTCONDOR-3127)

  • Added new -lvm option to condor_status to view current disk usage of slots enforcing disk limits. This option can be paired with -startd to show information about execution points enforcing disk limits. (HTCONDOR-3119)

  • Added new halt and resume verbs to htcondor dag for first class way to halt a DAG. (HTCONDOR-2898)

  • htcondor ap status will now show the RecentDaemonCoreDutyCycle of each reported Access Point’s condor_schedd. (HTCONDOR-3009)

  • Add SYSTEM_MAX_RELEASES which implements an upper bound on the number of times any job can be released by a user or periodic expression. (HTCONDOR-2926)

  • Improved condor_watch_q to display information about the number of jobs actively transferring input or output files. (HTCONDOR-2958)

  • Added the ability for a docker universe job to fetch an authenticated image from the docker repository. (HTCONDOR-2870)

  • The condor_startd will now keep dynamic slots that have a SlotBrokenReason attribute in Unclaimed state rather than deleting them when they change state to Unclaimed. A new configuration variable CONTINUE_TO_ADVERTISE_BROKEN_DYNAMIC_SLOTS controls this behavior. It defaults to true but can be set to false to preserve the old behavior. This change also adds a new attribute BrokenContextAds to the daemon ad of the condor_startd. This attribute has a ClassAd for each broken resource in the startd. condor_status has been enhanced to use this new attribute to display more information about the context of broken resources when both -startd and -broken arguments are used. (HTCONDOR-2844)

  • The condor_startd will now permanently reduce the total slot resources advertised by a partitionable slot when a dynamic slot is deleted while it is marked as broken. The amount of reduction will be advertised in new attributes such as ad-attr:BrokenSlotCpus so that the original size of the slot can be computed. (HTCONDOR-2865)

  • The condor_startd, when running on a machine with Nvidia gpus, now advertises Nvidia driver version. (HTCONDOR-2856)

  • Improved validation and cleanup of EXECUTE directories. The EXECUTE directory must now be owned by the condor user when the daemons are started as root. The condor_startd will not attempt to clean an invalid EXECUTE directory nor will it alter the file permissions of an EXECUTE directory. (HTCONDOR-2789)

  • Added new submit command primary_unix_group, which takes a string which must be one of the user’s supplemental groups, and sets the primary group to that value. (HTCONDOR-2702)

  • Added singularity launcher wrapper script that runs inside the container and launches the job proper. If this fails to run, HTCondor detects there is a problem with the container runtime, not the job, and reruns the job elsewhere. Controlled by parameter SINGULARITY_USE_LAUNCHER (HTCONDOR-1446)

  • Added new submit command for container universe, mount_under_scratch that allows user to create writable ephemeral directories in their otherwise read only container images. (HTCONDOR-2728)

  • A new job attribute FirstJobMatchDate will be set for all jobs of a single submission to the current time when the first job of that submission is matched to a slot. (HTCONDOR-2676)

  • Added new job ad attribute InitialWaitDuration, recording the number of seconds from when a job was queued to when the first launch happened. (HTCONDOR-2666)