HTCondor

HTCondor
Name	HTCondor
Developer	University of Wisconsin–Madison
Released	1988
Operating system	Linux, Microsoft Windows, macOS
Genre	High-throughput computing, Grid computing
License	Apache License 2.0

Contents

Overview
Architecture
Job Management
Use Cases and Applications
History and Development
Related Software and Projects

HTCondor. It is a specialized workload management system for compute-intensive jobs created at the University of Wisconsin–Madison. The software framework operates as a distributed batch processing system, enabling users to harness idle computing power across networks of workstations and dedicated clusters. Its design focuses on efficiently managing large numbers of computational tasks, making it a cornerstone tool in the fields of high-throughput computing and distributed computing.

Overview

The system provides a comprehensive environment for submitting, queuing, scheduling, and monitoring computational workloads across distributed resources. It is particularly renowned for its ability to create a virtual pool of computing power from otherwise underutilized desktop machines, a concept known as cycle scavenging. This approach allows organizations to leverage existing IT infrastructure for scientific computing without requiring dedicated supercomputer resources. The project is stewarded by the Center for High Throughput Computing within the University of Wisconsin–Madison.

Architecture

The architecture is based on a central manager model that coordinates a pool of execute and submit machines. Key components include the *condor_schedd*, which manages job queues on submit nodes, and the *condor_startd*, which governs resource allocation on execute nodes. Communication and security are managed through a custom protocol, with strong integration into existing Kerberos and GSSAPI infrastructures for authentication. The system can interoperate with other grid computing infrastructures like the Open Science Grid and can utilize cloud computing resources from providers such as Amazon Web Services.

Job Management

Users define jobs via a structured job description file which specifies executable, arguments, and resource requirements. The system's matchmaking scheduler then pairs jobs with suitable machines based on policies and constraints. It provides sophisticated features for job monitoring, logging, and management, including mechanisms for checkpointing and process migration to enhance reliability. Integration with Docker and other containerization technologies allows for encapsulated execution environments, while its directed acyclic graph manager, **DAGMan**, handles complex workflows with dependencies.

Use Cases and Applications

It is extensively used in scientific domains requiring large-scale parameter sweeps or Monte Carlo simulations, such as computational physics, bioinformatics, and astronomy. Notable projects utilizing the platform include the LIGO Scientific Collaboration for gravitational-wave data analysis and various high-energy physics experiments at CERN. Its ability to manage data-intensive computing workloads also makes it applicable in fields like machine learning model training and cryptography.

History and Development

The project originated in 1988 from the Condor project research team led by Miron Livny at the University of Wisconsin–Madison. Early development was funded by the National Science Foundation and the United States Department of Energy. A significant milestone was its adoption of the Apache License 2.0 in 2012, transitioning from a custom license to promote broader community use and contribution. The software's evolution has been closely tied to advancements in wide area network technologies and the rise of e-Science.

The ecosystem interacts with numerous other distributed computing tools. It often serves as a local resource manager for larger grid computing systems like the Open Science Grid and the European Grid Infrastructure. It shares conceptual similarities with other batch-queueing systems like Slurm Workload Manager and Portable Batch System, though with a distinct focus on opportunistic resource harvesting. Related projects from the same research team include the Globus Toolkit for grid services and the HTCondor-C component for inter-grid job submission.

Category:Free software Category:Job scheduling software Category:University of Wisconsin–Madison

Overview

Architecture

Job Management

Use Cases and Applications

History and Development

Related Software and Projects