GlideinWMS — LLMpedia

GlideinWMS
Name	GlideinWMS
Developer	Fermilab; Open Science Grid contributors
Programming language	Python (programming language)
Operating system	Linux, Unix
Platform	Worldwide LHC Computing Grid, Open Science Grid
License	Apache License

Contents

Overview
Architecture and Components
Installation and Configuration
Operation and Workflow
Security and Authentication
Performance and Scalability
History and Development

GlideinWMS is a pilot-based workload management system designed to provision compute resources across distributed infrastructures such as Worldwide LHC Computing Grid, Open Science Grid, and national research clouds. It enables opportunistic and dedicated use of heterogeneous resources by submitting lightweight pilot jobs that create dynamic execution environments for user payloads, integrating with middleware like HTCondor, CERN services, and cloud platforms including Amazon Web Services, Google Cloud Platform, and Microsoft Azure. GlideinWMS is developed primarily at Fermilab in collaboration with projects such as the US LHC Science Program and communities tied to experiments like CMS and ATLAS.

Overview

GlideinWMS provides a pilot factory model that supplies transient worker nodes to batch systems and cloud endpoints, coordinating with resource providers such as National Science Foundation-funded infrastructures, regional centers like CERN Tier-1 sites, and research consortia including Open Science Grid partners. It leverages scheduling systems like HTCondor and job submission frameworks used by collaborations like CMS, ATLAS, LIGO Scientific Collaboration, and projects associated with High Luminosity LHC preparations. GlideinWMS interoperates with identity and access frameworks such as CILogon, Globus Toolkit, and certificate authorities affiliated with DOE labs.

Architecture and Components

Core components include the Factory, Frontend, and Collector/Negotiator pair commonly seen in HTCondor deployments; the Factory submits pilot jobs to endpoints managed by providers like XSEDE sites, NERSC, and regional computing centers. The Frontend accepts user VO and group policies reflecting governance from entities like European Grid Infrastructure and collaboration management systems at CERN. GlideinWMS integrates with the HTCondor-G gateway, the CVMFS distribution service for software, and storage systems such as EOS and dCache. Monitoring and accounting tie into tools like Grafana, Prometheus, and logging infrastructures inspired by ELK Stack deployments at national labs.

Installation and Configuration

Deployments commonly occur on middleware stacks prevalent at Fermilab and partner sites, with packages and configuration managed via tools like Ansible, Puppet, and YUM/APT repositories maintained by science grids. Administrators map site resources using information services akin to BDII or site-specific catalogs used by WLCG operations, aligning with vo-specific policies from collaborations such as Belle II and IceCube. Integration with cloud providers requires credentials and APIs consistent with OpenStack deployments at research clouds, and IAM mappings analogous to practices at Lawrence Berkeley National Laboratory or Brookhaven National Laboratory.

Operation and Workflow

A typical workflow begins with the Frontend calculating resource needs and instructing the Factory to submit pilot jobs to endpoints like HTCondor pools, cloud orchestration services at Amazon EC2, or batch systems at NERSC. Pilots, once bootstrapped, advertise resources to a Collector which facilitates match-making through a Negotiator using HTCondor policies derived from experiment workload managers for CMS and ATLAS. User payloads are dispatched from submitters employing workload tools such as CRAB or PanDA, and data access is achieved via grid protocols implemented by XRootD and transfer services coordinated with Fermilab data management teams.

Security and Authentication

GlideinWMS relies on credential management and authentication mechanisms in common use at CERN and DOE facilities, including X.509 certificates from certificate authorities used by EGI and trust frameworks like those promoted by CILogon. Authorization meshes with VOMS-style attribute assertions familiar to collaborations such as ALICE and LHCb, and with federated identity initiatives pursued by InCommon and academic identity providers. Secure bootstrapping of pilots leverages sandboxing techniques, filesystem isolation schemas practiced at NERSC, and workload provenance tracking akin to systems used at Brookhaven National Laboratory.

Performance and Scalability

GlideinWMS has been exercised in large-scale campaigns supporting experiments like CMS and infrastructure projects such as Open Science Grid user communities, demonstrating scaling to tens of thousands of concurrent pilots across resources including WLCG tiers and commercial cloud fleets. Performance tuning draws on practices developed at Fermilab and operational experience from collaborations with ATLAS and other large experiments, using metrics collected via Prometheus and visualization in Grafana to optimize pilot lifetimes, submission rates, and matchmaking throughput. Scalability strategies integrate hierarchical Frontend configurations and federation patterns used in distributed systems research at institutions like MIT and University of Chicago.

History and Development

Development began within the distributed computing efforts at Fermilab to meet needs of particle physics experiments such as CDF and later CMS, evolving alongside middleware like Condor and initiatives including the Open Science Grid. Contributions have come from institutions across the US LHC Science Program and international partners tied to WLCG operations, with feature development driven by requirements from collaborations including ATLAS, LHCb, and multi-messenger projects like LIGO. The project roadmap has reflected trends in cloud adoption exemplified by Amazon Web Services collaborations, containerization practices popularized by Docker and orchestration patterns from Kubernetes, while continuing integration with community services at CERN and national laboratories such as Lawrence Livermore National Laboratory.

Category:Distributed computing