Condor (HTCondor)

Condor (HTCondor)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Condor (HTCondor)
Developer	University of Wisconsin–Madison
Initial release	1988
Latest release	ongoing
Programming language	C++, Python, Java
Operating system	Unix-like, Microsoft Windows
License	Apache License 2.0

Contents

Overview
History and Development
Architecture and Components
Job Submission and Scheduling
Networking and Security
Use Cases and Deployments
Performance, Scalability, and Benchmarks

Condor (HTCondor) is a specialized high-throughput computing system originally developed at the University of Wisconsin–Madison and widely adopted by research institutions, national laboratories, and enterprises. It coordinates large collections of heterogeneous workstations and clusters to run batch and parallel workloads across resources administered by diverse organizations such as CERN, NASA, and national research consortia. The project integrates resource management, job scheduling, fault tolerance, and workload matchmaking to support scientific computing initiatives like Large Hadron Collider analyses and bioinformatics pipelines.

Overview

Condor provides a workload management system that enables users at institutions like the Lawrence Berkeley National Laboratory, Argonne National Laboratory, and the Los Alamos National Laboratory to harness idle compute resources. It supports batch processing for research in domains supported by organizations such as the National Science Foundation and collaborations including Open Science Grid and XSEDE. The system interoperates with middleware and tools including Globus Toolkit, Apache Hadoop, and Docker, enabling workflows used by projects tied to Human Genome Project-era bioinformatics and contemporary climate modeling in partnerships with the National Oceanic and Atmospheric Administration.

History and Development

Development began in the late 1980s at the University of Wisconsin–Madison led by contributors who collaborated with entities like the National Center for Supercomputing Applications and funding from agencies such as the Defense Advanced Research Projects Agency and the National Science Foundation. Over decades, Condor evolved through integrations with distributed computing initiatives including SETI@home-era volunteer computing, coordination with the Open Grid Forum, and adoption by science facilities including Brookhaven National Laboratory and Fermilab. Major milestones include support for opportunistic scheduling, the introduction of the matchmaking algorithm used by communities associated with LIGO and IceCube, and transitions to modern packaging and licensing aligned with foundations such as the Apache Software Foundation best practices.

Architecture and Components

The Condor architecture comprises daemons and services that run on nodes administered by institutions like Princeton University, Stanford University, and Massachusetts Institute of Technology. Core components include the central manager often deployed in ways similar to resource managers at Oak Ridge National Laboratory, the scheduler inspired by research from Carnegie Mellon University, and execution nodes comparable to compute nodes used at European Organization for Nuclear Research. Subsystems integrate with software such as CVMFS, Singularity, and container runtimes from projects like Kubernetes to enable reproducible environments for teams associated with Sanger Institute and Wellcome Trust. Ancillary tools include data staging facilities that interact with storage systems like CERN EOS, Amazon S3, and parallel filesystems used at Argonne National Laboratory.

Job Submission and Scheduling

Users submit jobs via command-line tools and APIs paralleling interfaces from Slurm Workload Manager, Torque (software), and PBS Professional; Condor accepts job descriptions that express resource needs and constraints similar to specifications used in collaborations like ENIAC research projects. Its scheduler uses a matchmaking model influenced by research at Cornell University and supports checkpointing protocols comparable to those used in long-running simulations at Lawrence Livermore National Laboratory. Condor manages DAG-based workflows for pipelines akin to those deployed for Human Genome Project analyses and integrates with workflow managers such as Pegasus (workflow management), Nextflow, and Apache Airflow used by teams at Broad Institute.

Networking and Security

Condor implements security mechanisms that align with practices used by institutions like National Institute of Standards and Technology and standards adopted by the Internet Engineering Task Force. Authentication and authorization integrate with methods familiar to administrators at Stanford Linear Accelerator Center and federated infrastructures such as eduGAIN; it supports credential management comparable to systems used by Globus (service). Network communication uses encrypted channels modeled on techniques championed by research at MIT Lincoln Laboratory and supports firewall traversal strategies used in distributed projects like BOINC.

Use Cases and Deployments

Condor is deployed for high-throughput tasks across research centers including CERN, Fermilab, Brookhaven National Laboratory, Los Alamos National Laboratory, and universities such as University of California, Berkeley, University of Michigan, and University of Cambridge. Typical applications include parameter sweeps used in computational chemistry research from groups associated with Royal Society grants, Monte Carlo simulations in particle physics collaborations like ATLAS (particle detector), and large-scale data processing pipelines used in genomics consortia including Genome England. It supports opportunistic cycles in campus environments, enabling collaboration between departments at institutions like Harvard University and Yale University.

Performance, Scalability, and Benchmarks

Performance studies compare Condor to alternatives such as Slurm Workload Manager and Kubernetes-based batch systems in publications from venues like the ACM SIGMETRICS and IEEE International Parallel and Distributed Processing Symposium. Benchmarks conducted at centers such as Oak Ridge National Laboratory and Argonne National Laboratory examine throughput, job turnaround, and scalability on clusters with architectures resembling those used in projects funded by the Department of Energy. Results often highlight Condor's strengths in handling large numbers of short-lived tasks for collaborations including Open Science Grid and its integration with federated identity and data management systems employed by European Grid Infrastructure.

Category:Distributed computing