LLMpediaThe first transparent, open encyclopedia generated by LLMs

Pegasus (workflow management)

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: XSEDE Hop 4
Expansion Funnel Raw 73 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted73
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
Pegasus (workflow management)
NamePegasus
TitlePegasus (workflow management)
DeveloperUniversity of Southern California Information Sciences Institute, Pegasus Workflow Management System Team
Released2002
Programming languageJava (programming language), Python (programming language)
Operating systemLinux, macOS
LicenseApache License

Pegasus (workflow management) is an open-source scientific workflow management system developed to map complex computational workflows onto distributed resources. It automates workflow planning, execution, monitoring, and provenance tracking for large-scale computations used by researchers and engineers across domains. Pegasus integrates with high-performance computing and cloud platforms to support reproducible research and data-intensive science.

Overview

Pegasus originated from research projects at the University of Southern California and the Information Sciences Institute to address challenges in orchestrating distributed computations for collaborations such as LIGO Scientific Collaboration, ALMA Observatory, and projects funded by the National Science Foundation. The project has been influenced by grid computing initiatives like the Open Science Grid and workflow efforts such as Triana (software), Taverna (software), and Kepler (software). Pegasus provides tools for abstract workflow description, automatic mapping to concrete resources, fault-tolerance, and provenance capture used in collaborations including Large Hadron Collider experiments, Square Kilometre Array, and climate modeling consortia.

Architecture and Components

Pegasus comprises planning, execution, and monitoring components that interact with resources managed by systems such as Slurm Workload Manager, HTCondor, Kubernetes, and cloud platforms like Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Core components include the planner, which transforms abstract workflows into executable Directed Acyclic Graphs, a submitter that interfaces with schedulers such as PBS Professional and Torque (software), and a monitoring stack compatible with Prometheus (software) and Grafana. Pegasus leverages data management tools and catalogs inspired by iRODS and integrates provenance models influenced by the W3C PROV standard. Workflow descriptions can be generated from languages and systems like CWL (Common Workflow Language), Nextflow, and Snakemake through adapters.

Features and Capabilities

Pegasus offers features including automatic workflow planning, data staging, retry and checkpointing semantics, metadata and provenance capture, and site selection informed by historical performance data from sources like XDMoD and Globus Toolkit. It supports ensemble and parameter sweep patterns used in projects such as ENIGMA (consortium) and COPERNICUS (program), enables reproducible pipelines for consortia including HPC User Forum and integrates authentication with identity providers like InCommon, CILogon, and OAuth 2.0. Security and compliance workflows interoperate with services such as OpenID Connect and resource managers used in national labs like Argonne National Laboratory, Lawrence Berkeley National Laboratory, and Oak Ridge National Laboratory.

Use Cases and Applications

Pegasus has been applied in astrophysics workflows for Event Horizon Telescope, genomics pipelines in projects like 1000 Genomes Project, and physics analyses connected to CERN experiments. Earth sciences groups use Pegasus for satellite data processing from missions such as Landsat and Sentinel-2, while climate scientists integrate it with modeling campaigns coordinated by World Climate Research Programme. Bioinformatics teams in institutions like Broad Institute and Wellcome Sanger Institute deploy Pegasus for variant calling and sequence alignment. Additionally, digital humanities projects at institutions such as Stanford University and Harvard University have used Pegasus to manage large-scale text analysis workflows.

Deployment, Scalability, and Performance

Pegasus scales from single-node deployments to federated infrastructures spanning national cyberinfrastructure programs like the Extreme Science and Engineering Discovery Environment and the European Grid Infrastructure. Performance tuning leverages scheduler features from SLURM, resource provisioning via OpenStack, and autoscaling in clouds like Amazon EC2. Empirical benchmarking and profiling are often conducted with tools and initiatives such as SPEC (benchmarking), Benchmarking Science Gateway, and collaborations with centers including National Center for Supercomputing Applications and Texas Advanced Computing Center. Fault tolerance is achieved through retry policies and provenance-driven restart mechanisms used in long-running workflows at facilities such as NERSC.

Community, Development, and Licensing

Pegasus is developed by an international team with contributors from universities, national laboratories, and industry partners including USC, University of Chicago, Argonne National Laboratory, and companies participating in open-source ecosystems like Apache Software Foundation. The project engages users through workshops at conferences such as Supercomputing Conference, PEARC, and ACM SIGMOD and maintains community channels used by projects in the Open Science Grid and XSEDE communities. Pegasus is released under a permissive Apache License that enables academic and commercial use; development is hosted with collaboration tools common to projects like GitHub and coordinated via governance models similar to those used by Apache Software Foundation projects.

Category:Workflow management systems