HTCondor — LLMpedia

HTCondor
Name	HTCondor
Developer	University of Wisconsin–Madison
Released	1988
Programming language	C++
Operating system	Linux, Unix, macOS, Windows
License	Apache License 2.0

Contents

Overview
Architecture and Components
Job Submission and Scheduling
Resource Management and Policies
Security and Authentication
Deployment and Use Cases
History and Development

HTCondor is a specialized software system for high-throughput computing that manages and schedules large ensembles of computational tasks across distributed resources. It provides a workload management framework enabling organizations to harness clusters, campus grids, clouds, and heterogeneous workstations for batch-oriented and long-running computations. HTCondor is widely employed in scientific, engineering, and industry environments to maximize utilization of compute assets and to support workflows from data analysis to parameter sweeps.

Overview

HTCondor orchestrates the execution of independent jobs by matchmaking between job requirements and available resources offered by compute nodes. It supports opportunistic use of idle workstations, dedicated clusters, and cloud instances, integrating with data staging solutions and job monitoring systems. The system interoperates with many institutions and projects in the high-performance and distributed computing ecosystem, allowing integration with platforms such as Open Science Grid, European Grid Infrastructure, XSEDE, National Science Foundation, and national laboratories like Los Alamos National Laboratory and Lawrence Berkeley National Laboratory.

Architecture and Components

The architecture comprises central managers, submit nodes, execute nodes, and auxiliary services. Core daemons coordinate state and negotiation: a central collector maintains cluster state, a negotiator performs matchmaking, and a schedd manages job queues. Execute nodes run a startd daemon that advertises capabilities and executes tasks, while submit hosts use condor_submit and condor_q clients for job lifecycle control. Data movement and file transfer are handled by dedicated transfer components and can interoperate with storage systems like Hadoop Distributed File System, Globus, and Ceph. Monitoring and logging integrate with telemetry and reporting systems used by CERN, NASA, DOE facilities, and research computing centers.

Job Submission and Scheduling

Users express computational workloads via job description files consumed by condor_submit; these files enumerate executable paths, arguments, resource requirements, and policy directives. The scheduler implements a matchmaking algorithm comparing job ClassAds with machine ClassAds to allocate jobs to suitable machines, supporting parallel jobs via mechanisms akin to batch queuing systems used at Los Alamos National Laboratory and multi-node orchestration comparable to approaches in SLURM and PBS Professional. HTCondor supports checkpointing, DAG-based job dependencies similar to Pegasus (workflow management), job arrays, and heterogeneous workload handling essential for workflows in projects such as LIGO Scientific Collaboration, Large Hadron Collider experiments, and climate modeling centers.

Resource Management and Policies

Resource advertisements capture attributes like CPU, memory, disk, GPU presence, and operating system; administrators define policies to enforce fair-share, preemption, and priority schemes. Policy modules enable site-specific behavior consistent with governance models at universities like University of Wisconsin–Madison, research consortia, and national infrastructures. Integration with accounting and quota services aligns with reporting practices at institutions such as European Research Council grant projects and facility allocation systems used by NERSC and Argonne National Laboratory.

Security and Authentication

Security features include authentication, authorization, and encrypted communication channels; HTCondor integrates with standard authentication technologies such as Kerberos, X.509, and operating system-level credentials. It can leverage site-wide identity infrastructures like InCommon and federated identity systems used by universities and labs, and supports sandboxing and containerization models involving Docker, Singularity, and virtualization platforms favored by enterprise and research institutions. These mechanisms enable compliance with security practices at organizations such as National Institutes of Health, Department of Energy, and major academic computing centers.

Deployment and Use Cases

HTCondor deployments range from campus pools of idle desktops to large-scale production grids supporting experiments at CERN, astronomy surveys run by teams at institutions like Space Telescope Science Institute, bioinformatics pipelines at Broad Institute, and commercial workloads in industries collaborating with Amazon Web Services and Google Cloud Platform. Use cases include parameter sweeps in computational chemistry groups, Monte Carlo production for particle physics collaborations, image processing pipelines for remote sensing projects with partners like USGS and NOAA, and grid-enabled applications in social sciences funded by bodies such as the National Endowment for the Humanities.

History and Development

Originating in the late 1980s at academic research groups, development continued through collaborations among universities, national laboratories, and funding agencies. Key development milestones track engagement with initiatives such as the NSF Information Technology Research programs, adoption in European grid projects like EGI, and integration into national cyberinfrastructure efforts exemplified by XSEDE. The software evolved to support modern cloud and container technologies, guided by contributors from institutions including University of Wisconsin–Madison, Condor Team, and partner organizations across academia and industry.

Category:Distributed computing Category:Cluster computing Category:Scientific software