Version 24 Feature Releases
We release new features in these releases of HTCondor. The details of each version are described below.
Version 24.5.1
Release Notes:
HTCondor version 24.5.1 released on March 4, 2025.
New Features:
The condor_starter now advertise StdoutMtime and StderrMtime which represent the most recent modification time, in seconds since the epoch of a job which uses file transfer. (HTCONDOR-2837)
The condor_startd, when running on a machine with Nvidia gpus, now advertises Nvidia driver version. (HTCONDOR-2856)
Increased the default width of condor_qusers output when redirected to a file or piped to another command to prevent truncation. (HTCONDOR-2861)
The condor_startd will now never lose track and leak logical volumes that were failed to be cleaned up when using STARTD_ENFORCE_DISK_LIMITS. The condor_startd will now periodically retry removal of logical volumes with an exponential backoff. (HTCONDOR-2852)
The condor_startd will now keep dynamic slots that have a SlotBrokenReason attribute in
Unclaimed
state rather than deleting them when they change state toUnclaimed
. A new configuration variable CONTINUE_TO_ADVERTISE_BROKEN_DYNAMIC_SLOTS controls this behavior. It defaults totrue
but can be set tofalse
to preserve the old behavior. This change also adds a new attribute BrokenContextAds to the daemon ad of the condor_startd. This attribute has a classad for each broken resource in the startd. condor_status has been enhanced to use this new attribute to display more information about the context of broken resources when both-startd
and-broken
arguments are used. (HTCONDOR-2844)The condor_startd will now permanently reduce the total slot resources advertised by a partitionable slot when a dynamic slot is deleted while it is marked as broken. The amount of reduction will be advertised in new attributes such as ad-attr:BrokenSlotCpus so that the original size of the slot can be computed. (HTCONDOR-2865)
Daemons will now more quickly discover with a non-responsive condor_collector has recovered and resume advertising to it. (HTCONDOR-2605)
Jobs can now request user credentials generated by any combination of the OAuth2, Local Issuer, and Vault credential monitors on the AP. Remote submitters can request these credentials without having any of the CREDMON-related parameters in their configuration files. (HTCONDOR-2851)
HTCondor tarballs now contain Pelican 7.13.0
Bugs Fixed:
Fixed a bug where the condor_gridmanager would write to log file GridmanagerLog.root after a reconfig. (HTCONDOR-2846)
htcondor annex shutdown
now works again. (HTCONDOR-2808)Fixed a bug where the job state table DAGMan prints to its debug file could contain a negative number for the count of failed jobs. (HTCONDOR-2872)
Fixed a bug where chirp would not work in container universe jobs using the docker runtime. (HTCONDOR-2866)
Fixed a bug where referencing
htcondor2.JobEvent.cluster
could crash if processed log event was not associated with job(s) (i.e. had a negative value). (HTCONDOR-2881)Fixed a bug that caused the condor_gridmanager to abort if a job that it was managing disappeared from the job queue (i.e. due to someone running condor_rm -force). (HTCONDOR-2845)
Fixed a bug that caused grid ads from different Access Points to overwrite each other in the collector. (HTCONDOR-2876)
Fixed a memory leak that can occur in any HTCondor daemon when an invalid ClassAd expression is encountered. (HTCONDOR-2847)
Fixed a bug that caused daemons to go into infinite recursion, eventually crashing when they ran out of stack memory. (HTCONDOR-2873)
Version 24.4.0
Release Notes:
HTCondor version 24.4.0 released on February 4, 2025.
New Features:
Improved validation and cleanup of EXECUTE directories. The EXECUTE directory must now be owned by the condor user when the daemons are started as root. The condor_startd will not attempt to clean an invalid EXECUTE directory nor will it alter the file permissions of an EXECUTE directory. (HTCONDOR-2789)
For batch grid universe jobs, the PATH environment variable values from the job ad and the worker node environment are now combined. Previously, only the PATH value from the job ad was used. The old behavior can be restored by setting
blah_merge_paths=no
in theblah.config
file. (HTCONDOR-2793)Many small improvements to condor_q
-analyze
and-better-analyze
for pools that use partitionable slots. As a part of this, the condor_schedd was changed to provide match information for the autocluster of the job being analyzed, which condor_q will report if it is available. (HTCONDOR-2720)The condor_startd now advertises a new attribute, SingularityUserNamespaces which is
true
when apptainer or singularity work and are using Linux user namespaces, andfalse
when it is using setuid mode. (HTCONDOR-2818)The condor_startd daemon ad now contains attributes showing the average and total bytes transferred to and from jobs during its lifetime. (HTCONDOR-2721)
The condor_credd daemon no longer listens on port
9620
by default, but rather uses the condor_shared_port daemon. (HTCONDOR-2763)DAGMan will now periodically print a table regarding states of job placed to the Access Point to the debug log (
*.dagman.out
). The rate at which this table in printed is dictated by DAGMAN_PRINT_JOB_TABLE_INTERVAL (HTCONDOR-2794)For arc grid universe jobs, the new submit command arc_data_staging can be used to supply additional elements to the DataStaging block of the ARC ADL that HTCondor constructs. (HTCONDOR-2774)
Bugs Fixed:
Changed the numeric output of htcondor job status so that the rounding to megabytes, gigabytes, etc. matches the binary definitions the rest of the tools use. (HTCONDOR-2788)
Fixed a bug in the negotiator that caused it to crash when matching offline ads. (HTCONDOR-2819)
Fixed a memory leak in the schedd that could be caused by
SCHEDD_CRON
scripts that generate standard error output. (HTCONDOR-2817)Fixed a bug that cause the condor_schedd to crash with a segmentation fault if a condor_off
-fast
command was run while a schedd cron script was running. (HTCONDOR-2815)Fixed issue where EP’s using STARTD_ENFORCE_DISK_LIMITS would fill up the EP’s filesystem due to excessive saving of metadata to
/etc/lvm/archive
. (HTCONDOR-2791)Fixed bug where container_service_names did not work. (HTCONDOR-2829)
Fixed very rare bug that could cause the condor_startd to crash when the condor_collector times out queries and DNS is running very slowly. (HTCONDOR-2831)
Updated condor_upgrade_check to test for use for PASSWORD authentication and warn about the authenticated identity changing. (HTCONDOR-2823)
Version 24.3.0
Release Notes:
HTCondor version 24.3.0 released on January 6, 2025.
New Features:
Updated the condor_credmon_oauth and created a new
condor-credmon-multi
RPM package which, when installed, allows user credentials added via Vault and user credentials generated via a local issuer to exist simultaneously without conflict (e.g. the Vault credmon will not attempt to refresh locally issued credentials). (HTCONDOR-2408)Added singularity launcher wrapper script that runs inside the container and launches the job proper. If this fails to run, HTCondor detects there is a problem with the container runtime, not the job, and reruns the job elsewhere. Controlled by parameter SINGULARITY_USE_LAUNCHER (HTCONDOR-1446)
EP’s using STARTD_ENFORCE_DISK_LIMITS will now advertise IsEnforcingDiskUsage in the machine ad. (HTCONDOR-2734)
Added new
AUTO
option to LVM_HIDE_MOUNT that creates a mount namespace for ephemeral logical volumes if the job is compatible with mount hiding (i.e not Docker jobs). TheAUTO
value is now the default value. (HTCONDOR-2717)Added new submit command for container universe, mount_under_scratch that allows user to create writable ephemeral directories in their otherwise read only container images. (HTCONDOR-2728)
Environment variables from the job that start with
PELICAN_
will now be set in the environment of the pelican file transfer plugin when it is invoked to do file transfer. This is intended to allow jobs to turn on enhanced logging in the plugin. (HTCONDOR-2674)When the condor_startd interrupts a job’s execution, the specific reason is now reflected in the job attributes VacateReason and VacateReasonCode. (HTCONDOR-2713)
Improved performance of condor_history by using the in-memory sort order of job attributes used by the condor_schedd. (HTCONDOR-2729)
If the startd detects that an exited or evicted job has leftover, unkillable processes, it now marks that slot as “broken”, and will not reassign the resources for that slot to any other jobs. Disabled if STARTD_LEFTOVER_PROCS_BREAK_SLOTS is set to false. (HTCONDOR-2756)
Methods in
htcondor2.Schedd
which takejob_spec
arguments now accept a cluster ID in the form of anint
. These functions (htcondor2.Schedd.act()
,htcondor2.Schedd.edit()
,htcondor2.Schedd.export_jobs()
,htcondor2.Schedd.retrieve()
, andhtcondor2.Schedd.unexport_jobs()
) now also raiseTypeError
if theirjob_spec
argument is not astr
,list
ofstr
,classad2.ExprTree
, orint
. (HTCONDOR-2745)Add new knob CGROUP_POLLING_INTERVAL which defaults to 5 (seconds), to control how often a cgroup system polls for resource usage. (HTCONDOR-2802)
Bugs Fixed:
Fixed a bug introduced in 24.2.0 where the daemons failed to start if configured to use only a network interface that didn’t have an IPv6 address. Also, the daemons will no longer bind and advertise an address that doesn’t match the value of NETWORK_INTERFACE. (HTCONDOR-2799)
The htcondor job submit command now issues credentials like condor_submit. (HTCONDOR-2745)
EPs spawned by htcondor annex no longer crash on start-up. (HTCONDOR-2745)
When resolving a hostname to a list of IP addresses, avoid using IPv6 link-local addresses. This change was done incorrectly in 23.9.6. (HTCONDOR-2746)
htcondor2.Submit.from_dag()
andhtcondor.Submit.from_dag()
now correctly raises an HTCondor exception when the processing of DAGMan options and submit time DAG commands fails. (HTCONDOR-2736)Fixed confusing job hold message that would state a job requested
0.0 GB
of disk via request_disk when exceeding disk usage on Execution Points using STARTD_ENFORCE_DISK_LIMITS. (HTCONDOR-2753)You can now locate a collector daemon in the htcondor2 Python bindings. (HTCONDOR-2738)
Fixed a bug in condor_qusers tool where the
add
argument would always enable rather than add a user. (HTCONDOR-2775)Fixed a bug where cgroup systems did not report peak memory, as intended but current instantaneous memory instead. (HTCONDOR-2800) (HTCONDOR-2804)
Fixed an inconsistency in cgroup v1 systems where the memory reported by condor included memory used by the kernel to cache disk pages. (HTCONDOR-2807)
Fixed a bug on cgroup v1 systems where jobs that were killed by the Out of Memory killer did not go on hold. (HTCONDOR-2806)
Fixed incompatibility of condor_adstash with v2.x of the OpenSearch Python Client. (HTCONDOR-2614)
The
-subsystem
argument of condor_status is once again case-insensitive for credd and defrag subsystem types. (HTCONDOR-2796)
Version 24.2.2
Release Notes:
HTCondor version 24.2.2 released on December 4, 2024.
New Features:
None.
Bugs Fixed:
If knob EXECUTE is explicitly set to a blank string in the configuration file for whatever reason, the execution point (startd) may attempt to remove all files from the root partition (everything in /) upon startup. (HTCONDOR-2760)
Version 24.2.1
Release Notes:
HTCondor version 24.2.1 released on November 26, 2024.
This version includes all the updates from Version 24.0.2.
The DAGMan metrics file has changed the name of metrics referring to
jobs
to accurately refer to modern terminology asnodes
. To revert back to old terminology set DAGMAN_METRICS_FILE_VERSION =1
. (HTCONDOR-2682)
New Features:
DAGMan will now correctly submit late materialization jobs to an Access Point when DAGMAN_USE_DIRECT_SUBMIT =
True
. (HTCONDOR-2673)Added new submit command primary_unix_group, which takes a string which must be one of the user’s supplemental groups, and sets the primary group to that value. (HTCONDOR-2702)
Improved DAGMan metrics file to use updated terminology and contain more metrics. (HTCONDOR-2682)
A condor_startd which has ENABLE_STARTD_DAEMON_AD enabled will no longer abort when it cannot create the required number of slots of the correct size on startup. It will now continue to run; reporting the failure to the collector in the daemon ad. Slots that can be fully provisioned will work normally. Slots that cannot be fully provisioned will exist but advertise themselves as broken. This is now the default behavior because daemon ads are enabled by default. The condor_status tool has a new option
-broken
which displays broken slots and their reason for being broken. Use this option with the-startd
option to display machines that are fully or partly broken. (HTCONDOR-2500)A new job attribute FirstJobMatchDate will be set for all jobs of a single submission to the current time when the first job of that submission is matched to a slot. (HTCONDOR-2676)
Added new job ad attribute InitialWaitDuration, recording the number of seconds from when a job was queued to when the first launch happened. (HTCONDOR-2666)
condor_ssh_to_job when entering an Apptainer container now sets the supplemental unix group ids in the same way that vanilla jobs have them set. (HTCONDOR-2695)
IPv6 networking is now fully supported on Windows. (HTCONDOR-2601)
Daemons will no longer block trying to invalidate their ads in a dead collector when shutting down. (HTCONDOR-2709)
Added option
FAST
to configuration parameter MASTER_NEW_BINARY_RESTART. This will cause the condor_master to do a fast restart of all the daemons when it detects new binaries. (HTCONDOR-2708)
Bugs Fixed:
None.
Version 24.1.1
Release Notes:
HTCondor version 24.1.1 released on October 31, 2024.
This version includes all the updates from Version 24.0.1.
New Features:
Added
get
to thehtcondor credential
noun, which prints the contents of a stored OAuth2 credential. (HTCONDOR-2626)Added
htcondor2.set_ready_state()
for those brave few writing daemons in the Python bindings. (HTCONDOR-2615)When blah_debug_save_submit_info is set in blah.config, the
stdout
andstderr
of the blahp’s wrapper script is saved under the given directory. (HTCONDOR-2636)The DAG command SUBMIT-DESCRIPTION and node inline submit descriptions now work when DAGMAN_USE_DIRECT_SUBMIT =
False
. (HTCONDOR-2607)Docker universe jobs now check the Architecture field in the image, and if it doesn’t match the architecture of the EP, the job is put on hold. The new parameter DOCKER_SKIP_IMAGE_ARCH_CHECK skips this. (HTCONDOR-2661)
Added a configuration template, use feature:DefaultCheckpointDestination. (HTCONDOR-2403)
Bugs Fixed:
If HTCondor detects that an invalid checkpoint has been downloaded for a self-checkpoint jobs using third-party storage, that checkpoint is now marked for deletion and the job rescheduled. (HTCONDOR-1258)