Cromwell (software)

Cromwell (software)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Cromwell
Developer	Broad Institute
Released	2016
Programming language	Java, Scala, Python
Operating system	Linux, macOS
License	MIT License

Contents

Overview
History and Development
Architecture and Design
Features and Functionality
Use Cases and Adoption
Performance and Scalability
Security and Licensing

Cromwell (software) is an open-source workflow execution engine designed to run scientific pipelines described in the Workflow Description Language on diverse compute backends. Developed for reproducible analysis in high-throughput biology, it integrates with orchestration systems and cloud platforms to execute complex genomic, transcriptomic, and bioinformatics workflows. Cromwell emphasizes portability, scalability, and interoperability across platforms used by research institutions, consortia, and commercial organizations.

Overview

Cromwell operates as a workflow runner that interprets Workflow Description Language scripts and dispatches tasks to execution backends such as local compute clusters, Google Cloud Platform, and Amazon Web Services. It was created to support large-scale genomics projects at the Broad Institute and to enable workflow portability across environments including on-premises clusters managed by Slurm and cloud-native services like Kubernetes. Cromwell interoperates with workflow languages, resource managers, and data services used by consortia such as the Genome Aggregation Database and projects like the 1000 Genomes Project, facilitating reproducible pipelines for sequencing centers and clinical laboratories.

History and Development

Cromwell originated within the Broad Institute to meet the needs of groups such as the Genome Analysis Toolkit team and collaborators on initiatives like the Cancer Genome Atlas and the Clinical Genome Resource. Early development drew on experiences from pipeline frameworks used at institutions including the Massachusetts General Hospital and the Broad Institute core sequencing facilities. Contributions have come from developers affiliated with organizations such as the BROAD institute community, enterprise partners at Google, and academic collaborators at universities like Harvard University and MIT. Over successive releases, Cromwell added support for standards endorsed by the Global Alliance for Genomics and Health and integrated with workflow languages standardized by the Open Bioinformatics Foundation and community working groups.

Architecture and Design

Cromwell's architecture separates workflow description parsing, task scheduling, and backend execution, enabling modular integration with systems like Docker, Singularity, and orchestration platforms such as Kubernetes. The engine is implemented in Scala (programming language) and runs on the Java Virtual Machine, leveraging libraries from ecosystems including Akka for concurrency and OpenJDK runtime compatibility. Its backend abstraction permits connectors to resource managers such as LSF, PBS Pro, and HTCondor, while supporting cloud APIs from providers like Microsoft Azure and Google Cloud Storage. The design emphasizes fault tolerance influenced by distributed systems research from groups like Google and Amazon Web Services.

Features and Functionality

Cromwell executes workflows written in the Workflow Description Language and supports features such as scatter-gather parallelism, task retries, runtime parameterization, and metadata tracking compatible with data catalogs used by projects like dbGaP and European Genome-phenome Archive. It integrates container technologies including Docker (software) and supports filesystem backends such as POSIX, S3 (storage service), and network storage common at institutions like Broad Institute. Workflow provenance capabilities align with standards promoted by the Research Data Alliance and metadata profiles used by consortia like the International Cancer Genome Consortium. The engine exposes APIs that enable integration with web services and portals developed by groups such as Seven Bridges and workflow registries maintained by initiatives like the Dockstore project.

Use Cases and Adoption

Adopters include large-scale sequencing centers, clinical genomics laboratories, and research consortia such as the Pediatric Cancer Genome Project and national programs exemplified by Genomics England. Commercial entities in biotechnology and pharmaceutical sectors have used Cromwell for pipelines supporting assays validated against standards from agencies like the Food and Drug Administration and collaborations with platform providers including Terra (platform). Educational adopters include university departments at Stanford University, University of California, San Francisco, and infrastructure teams at national laboratories such as Lawrence Berkeley National Laboratory that run reproducible analyses and training curricula.

Performance and Scalability

Cromwell has been benchmarked for throughput on clusters managed by Slurm and cloud platforms such as Google Cloud Platform and Amazon Web Services, demonstrating linear scaling for many embarrassingly parallel workloads typical in sequencing pipelines like variant calling used by the 1000 Genomes Project. Performance characteristics depend on task granularity, container startup times influenced by Docker Hub pulls, and latency introduced by object storage APIs such as Amazon S3. For high-concurrency workloads, deployment patterns leveraging Kubernetes and autoscaling groups on providers like Google Cloud Platform and Microsoft Azure are common to achieve sustained throughput in production environments.

Security and Licensing

Cromwell is distributed under the MIT License, enabling integration into commercial and academic projects while permitting proprietary extensions by companies such as platform vendors and service providers. Security considerations include access control to cloud credentials managed via systems like HashiCorp Vault and encryption of data-at-rest on services such as Google Cloud Storage and Amazon S3. Compliance efforts by adopters often reference frameworks from agencies like the U.S. Department of Health and Human Services and standards bodies such as NIST for handling controlled-access genomic data.

Category:Workflow management systems