REANA — LLMpedia

REANA
Name	REANA
Developer	European Organization for Nuclear Research (CERN); REANA community
Initial release	2017
Written in	Python, YAML, JSON
Operating system	Linux
License	MIT License

Contents

Overview
Architecture and Components
Workflow and Features
Use Cases and Adoption
Development and Governance

REANA is an open-source platform for reproducible, reusable, and scalable data analysis workspaces designed to support computational research workflows. It provides container-based execution, provenance tracking, and integration with persistent storage to enable researchers to compose, execute, and share multi-step analyses across distributed compute infrastructures. The project emerged from collaborations among high-energy physics, bioinformatics, and earth sciences communities and emphasizes interoperability with scientific tooling and standards.

Overview

REANA targets reproducibility challenges faced by researchers using complex pipelines involving command-line tools, statistical packages, and containerized environments. It integrates computing paradigms used by communities around CERN, European Grid Infrastructure, PRACE, Open Science Grid, and Docker-centric deployments from Kubernetes clusters. The platform interoperates with workflow languages and systems such as Common Workflow Language, YAML, Snakemake, Nextflow, Galaxy and CWL-aware engines, enabling reuse across projects like those led by European Molecular Biology Laboratory, Max Planck Society, Wellcome Trust Sanger Institute, and research groups at ETH Zurich. REANA’s design reflects principles espoused by initiatives including FAIR data policy advocates and infrastructure efforts by ELIXIR and EOSC.

Architecture and Components

The architecture decomposes functionality into microservices and components that orchestrate containerized job execution, storage, and provenance. Core components include a workflow repository, a job controller, a compute backend connector, and a provenance database. Container orchestration relies on Kubernetes API primitives and container runtimes like Docker and Podman; storage integration uses object stores compatible with Amazon S3, Ceph, and OpenStack Swift. Authentication and authorization integrate with identity providers such as CERN Single Sign-On, OAuth 2.0, and federated systems common to eduGAIN. The system uses artifacts described in YAML manifests and metadata schemas to capture dependencies, which facilitates reproducibility across environments including clusters managed by Slurm and batch systems used at Lawrence Berkeley National Laboratory and Fermilab.

Workflow and Features

REANA executes workflows through a declarative specification that enumerates input data, computational steps, and container images. Supported workflow flavors include serial scripts, CWL, Snakemake, and Nextflow recipes, each mapped to execution DAGs and job templates. Features include dataset versioning, provenance provenance metadata, caching and reuse of intermediate artifacts, and interactive sessions for analysis with Jupyter Notebook and RStudio. The platform captures lineage metadata compatible with standards promoted by W3C provenance models and integrates with repository services like Zenodo and institutional data repositories at University of Oxford and Karolinska Institutet for dataset publication. For monitoring and debugging, REANA exposes metrics consumable by tools such as Prometheus and visualization via Grafana. Resource management supports elastic scaling on cloud providers including Amazon Web Services, Google Cloud Platform, and Microsoft Azure, and scientific clouds run by European Open Science Cloud partners.

Use Cases and Adoption

REANA has been adopted across multiple scientific domains to reproduce published analyses, run large-scale parameter sweeps, and enable collaborative method development. High-energy physics groups at CERN used it to encapsulate event-selection pipelines, while genomics teams at EMBL-EBI and clinical bioinformatics groups at Broad Institute utilized it for variant-calling workflows. Climate science consortia at ECMWF and geospatial labs at European Space Agency employed the platform for data harmonization pipelines. Educational adopters include workflow training in programs at ETH Zurich, Imperial College London, and University of Cambridge, integrating with teaching materials and reproducible assignments. REANA’s model facilitated reproducible results for projects that archived artifacts on Zenodo and cited them in publications in journals such as Nature Methods, PLOS Computational Biology, and IEEE Transactions.

Development and Governance

Development is stewarded by a community anchored at CERN with contributors from academic institutions, national laboratories, and open-source organizations. Governance follows meritocratic open-source practices, with roadmap discussions in issue trackers and working groups aligned to domains like security, interoperability, and user experience. Continuous integration and delivery workflows use tooling familiar to open-source projects, including GitHub, Travis CI/GitLab CI, and container image registries maintained by partners. Funding and collaborative development have included grants and programs from the European Commission and partnerships with infrastructure projects such as ELIXIR and EOSC. The community publishes design documents, API specifications, and tutorials that interlink with existing standards from W3C, OpenAPI Initiative, and domain-specific consortia to promote long-term sustainability.

Category:Open-source software Category:Scientific computing Category:Data provenance