Condor (software)

Condor (software)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Condor
Developer	University of Wisconsin–Madison; Open Science Grid contributors; HTCondor Project
Released	1988
Programming language	C++
Operating system	Linux, Unix, Windows
Genre	Workload management system, High-throughput computing
License	Apache License 2.0

Contents

Overview
Architecture and Components
Job Submission and Scheduling
Resource Management and Policies
Security and Authentication
Performance, Scalability, and Use Cases
History and Development

Condor (software) is a specialized workload management system for high-throughput computing developed originally at the University of Wisconsin–Madison and maintained by the HTCondor Project and associated contributors. It coordinates job submission, scheduling, and resource allocation across heterogeneous clusters, campus grids, and national infrastructures such as the Open Science Grid, enabling computational research in fields ranging from particle physics to bioinformatics. Condor integrates with batch systems, cloud platforms, and identity systems to support large-scale, fault-tolerant execution of compute-intensive and data-intensive workloads.

Overview

Condor provides a policy-driven environment to execute tasks across distributed resources, emphasizing throughput, priority, and opportunistic use of idle capacity on systems affiliated with institutions like CERN, Fermilab, and Lawrence Livermore National Laboratory. The system interfaces with resource managers such as Slurm Workload Manager, PBS Professional, and Sun Grid Engine to federate compute pools, and it supports scientific projects associated with organizations including National Science Foundation and Department of Energy. Condor's feature set addresses batch queuing, checkpoint/restart, preemption, and file staging for workflows used by researchers in astronomy, chemistry, genomics, and climate science.

Architecture and Components

The Condor architecture comprises daemons and services that coordinate resource advertisement, matchmaking, and job lifecycle management across nodes in a pool. Core components include the central manager (collector and negotiator), schedd daemons for job queues, startd daemons for resource offers, and shadow and starter processes that mediate execution and I/O. Integration points enable interaction with directory services like LDAP, identity systems such as Kerberos and OAuth, and storage systems like Hadoop Distributed File System and Ceph. Condor also ships with tools and libraries for monitoring via Prometheus, visualization with Grafana, and workflow orchestration with adapters to Apache Airflow and Pegasus (workflow management system).

Job Submission and Scheduling

Users submit jobs through command-line tools and APIs that translate job requirements into ClassAds consumed by the negotiator and matchmaker. Scheduling employs a ClassAd language to express attributes for jobs and machines, and negotiation cycles reconcile policies for priorities, ranks, and preemption. Support exists for DAG-based workflows, array jobs, and opportunistic execution across campus grids like OSG and federated clouds such as Amazon Web Services and Google Cloud Platform. Advanced scheduling features include resource throttling, job retirement, hold and release semantics, and integration with pilot-based frameworks used by collaborations like ATLAS and CMS.

Resource Management and Policies

Policy enforcement in Condor is driven by configurable rank and requirement expressions, quotas, and fair-share algorithms that reflect institutional allocations and project accounting. Administrators define role-based access using groups and classad-based policies while leveraging accounting systems such as Glexec and publish/subscribe telemetry for usage tracking by centers like XSEDE. Resource matching can factor in hardware attributes (CPU, GPU, memory), software modules (via Environment Modules), and license availability for commercial packages such as MATLAB or ANSYS. Preemption and opportunistic policies allow lending cycles to desktop grids and volunteer compute programs similar to BOINC deployments.

Security and Authentication

Condor supports multiple authentication mechanisms including shared secret, Kerberos tickets, and negotiated TLS using X.509 certificates compatible with infrastructures like the Globus Toolkit and trust federations such as the InCommon Federation. Authorization leverages ACLs, role mappings, and integration with identity providers including Active Directory and Okta. Secure file transfer and data staging are accomplished with tools interoperable with GridFTP, SCP, and encrypted channels monitored for compliance with mandates from agencies like the DOE and regulatory frameworks followed by national labs and universities.

Performance, Scalability, and Use Cases

Condor scales from desktops to national cyberinfrastructure, demonstrated by production deployments at facilities such as Fermilab, Lawrence Berkeley National Laboratory, and campus clusters participating in XSEDE. Performance tuning includes configuring negotiation intervals, collector hierarchies, and resource pooling to support thousands to millions of job submissions per day for workflows in high-energy physics, computational chemistry, machine learning, and seismology. Use cases encompass ensemble simulations, parameter sweeps, Monte Carlo jobs, and continuous integration for scientific software stacks maintained in repositories like GitHub and GitLab.

History and Development

Condor originated in the late 1980s at the University of Wisconsin–Madison as part of efforts to harness idle workstation cycles for research computing. Over decades it evolved through collaborations with national projects and laboratories, contributing to standards and interoperable middleware adopted by initiatives including the Open Science Grid and TeraGrid. Major development milestones include the introduction of ClassAds, support for checkpointing, and expansion to cloud and container environments with support for Docker and Singularity. Ongoing development is driven by the HTCondor Project, academic partners, and contributors from institutions such as University of Chicago, University of California, Berkeley, and industry collaborators working on scalable scientific infrastructure.

Category:Workload management systems Category:High-throughput computing