PanDA Workload Management System

PanDA Workload Management System
Name	PanDA Workload Management System
Developer	CERN; Brookhaven National Laboratory; Lawrence Berkeley National Laboratory; University of Chicago; Institute of High Energy Physics, Chinese Academy of Sciences
Initial release	2008
Written in	Python (programming language); Java (programming language)
Operating system	Linux; Scientific Linux; CentOS; Ubuntu
License	MIT License; Apache License

Contents

Overview
Architecture and Components
Job Submission and Scheduling
Data Management and Storage Integration
Monitoring, Logging, and Accounting
Deployment, Scalability, and Performance
History and Use Cases

PanDA Workload Management System

PanDA Workload Management System is a distributed workload management platform developed for large-scale scientific computing in high-energy physics and beyond. It orchestrates job distribution across heterogeneous resources including grids, clouds, and supercomputers to support experiments such as ATLAS, facilitating coordination among institutions like CERN, Brookhaven National Laboratory, Lawrence Berkeley National Laboratory, and Fermilab. The system integrates with middleware and resource providers including Globus Toolkit, HTCondor, and major cloud vendors to automate execution for research collaborations such as LHCb, CMS, and multi-institution projects.

Overview

PanDA enables workflow execution for experiments with high-throughput requirements, connecting experiment services at CERN to computing centers such as KIT (Karlsruhe Institute of Technology), TRIUMF, INFN Laboratori Nazionali del Gran Sasso, and DESY. It manages pilot-based job brokerage used by collaborations including ATLAS and supported by funding and governance from agencies like European Research Council, U.S. Department of Energy, National Science Foundation, and national laboratories such as Brookhaven National Laboratory and SLAC National Accelerator Laboratory. The platform interfaces with identity and access systems like CERN Single Sign-On, federated services including eduGAIN, and certificate authorities such as International Grid Trust Federation.

Architecture and Components

The architecture separates control, execution, and data planes and comprises services such as PanDA Server front-ends, pilot agents, workload queues, and database back-ends. Components integrate with middleware layers such as gLite, ARC (middleware), UNICORE, and resource managers like PBS Professional, Slurm, and HTCondor. Data orchestration uses systems including Rucio, dCache, EOS, and Ceph while metadata and catalogs interact with CERN Document Server and institutional registries at Brookhaven National Laboratory. Security and authentication are mediated by VOMS and OAuth 2.0 when connecting to services at NIKHEF, RAL (Rutherford Appleton Laboratory), and CC-IN2P3.

Job Submission and Scheduling

Jobs are submitted via experiment-specific interfaces and portals developed by teams at University of Chicago, University of California, Berkeley, and University of Oxford, and then brokered to pilots that validate site environments. The scheduling pipeline uses matchmaking informed by resource descriptors from OpenStack, Amazon Web Services, Google Cloud Platform, and HPC centers like Oak Ridge National Laboratory and Lawrence Livermore National Laboratory. Policies reflect collaboration agreements among institutions such as CERN, Fermilab, INFN, and IHEP, and adapt to queuing systems at facilities including NERSC, Argonne National Laboratory, and Pawsey Supercomputing Centre.

Data Management and Storage Integration

PanDA coordinates mass data movement across storage systems including Rucio, dCache, XRootD, SRM (Storage Resource Manager), and GridFTP, interfacing with data catalogs at CERN Open Data Portal and datasets produced at detectors like ATLAS and CMS. Integration spans archival systems at CERN and national repositories at Brookhaven National Laboratory, KIT, and TRIUMF, and leverages caching infrastructures such as Content Delivery Network nodes hosted by centers like FNAL and CC-IN2P3.

Monitoring, Logging, and Accounting

Monitoring uses dashboards and telemetry pipelines connected to tools from Elastic Stack and Grafana, with logging aggregated via ELK Stack installations in data centers at CERN and Brookhaven National Laboratory. Accounting and usage reports are produced for stakeholders including European Grid Infrastructure, DOE Office of Science, and project management offices at ATLAS and LHCb. Alerting integrates with incident response teams at CERN Computer Centre and operations centers at RAL and TRIUMF, while provenance information is recorded for reproducibility with registries maintained by CERN Open Data and institutional archives.

Deployment, Scalability, and Performance

PanDA has been deployed on grids, clouds, and HPC platforms, demonstrating scalability to millions of jobs per year during runs of Large Hadron Collider, coordinated with experiments like ATLAS and ALICE. Performance tuning draws on techniques from distributed computing projects at NERSC, OLCF, and PRACE centers, and uses containerization with Docker (software), Singularity (software), and orchestration via Kubernetes in cloud deployments led by teams from Lawrence Berkeley National Laboratory and SLAC National Accelerator Laboratory. Scaling tests reference load-generation frameworks used in studies at Brookhaven National Laboratory and benchmarking suites from SPEC (computer benchmark), while operational resilience aligns with best practices from ITIL-oriented operations in research infrastructures.

History and Use Cases

Originally developed for the ATLAS experiment at CERN and co-developed by teams at Brookhaven National Laboratory and University of Chicago, PanDA evolved to support diverse science domains including astrophysics collaborations at NASA, bioinformatics pipelines at European Molecular Biology Laboratory, and climate modeling consortia at IPCC-affiliated centers. Notable deployments include production operations during Large Hadron Collider Run periods, integration with cloud pilots at Amazon Web Services and Google Cloud Platform, and adaptation for HPC campaigns at Argonne Leadership Computing Facility and Oak Ridge Leadership Computing Facility. The system’s governance involves collaborations among agencies like DOE Office of Science, European Commission, and national labs including Lawrence Berkeley National Laboratory and Brookhaven National Laboratory.

Category:Distributed computing Category:High-throughput computing