Common Workflow Language

Common Workflow Language
Name	Common Workflow Language
Developer	Common Workflow Language Working Group
Initial release	2014
Latest release	1.2.0
Programming language	YAML, JSON, RDF
License	Apache License 2.0

Contents

Overview
History and Development
Design and Components
Execution and Implementations
Use Cases and Adoption
Governance and Community
Criticism and Limitations

Common Workflow Language

Common Workflow Language is an open specification for describing data analysis workflows and tools. It enables reproducible computational pipelines used across bioinformatics, clinical research, and data science by providing a platform-neutral, declarative format. CWL integrates with container technologies, provenance systems, and workflow engines to support portability across high-performance computing clusters, cloud platforms, and laboratory informatics.

Overview

CWL defines a YAML- and JSON-based schema that specifies command-line tools, inputs, outputs, and workflow steps to orchestrate complex pipelines. The specification emphasizes interoperability with container formats such as Docker and Singularity, provenance frameworks like PROV, and workflow engines including Toil and Rabix. It targets domains that require reproducibility and portability, including genomics exemplified by projects from the Broad Institute, EMBL-EBI, and the Wellcome Sanger Institute. CWL's design reflects engagement with standards bodies and initiatives such as the Global Alliance for Genomics and Health, the Research Data Alliance, and the Open Bioinformatics Foundation.

History and Development

The CWL initiative began in response to fragmentation in workflow descriptions used by institutions including the Broad Institute, Genomics England, and the European Bioinformatics Institute. Early contributors included engineers from the Broad Institute, Seven Bridges, and academic groups from Stanford University and the University of California, San Diego. Major milestones include community workshops held at conferences like the Intelligent Systems for Molecular Biology meeting and collaborations with projects such as Galaxy, Taverna, and Nextflow. Governance evolved through participation by organizations such as the Software Sustainability Institute and the Apache Software Foundation community norms.

Design and Components

CWL separates tool descriptions from workflow composition using languages rooted in YAML and JSON-LD, enabling semantic annotations compatible with RDF and PROV provenance vocabularies. Core components include CommandLineTool, ExpressionTool, and Workflow, each defining typed inputs, outputs, secondary files, and runtime requirements such as resource hints for CPU and memory. CWL supports parameter sweeping and scatter-gather patterns for parallelism that align with batch schedulers like SLURM and Grid Engine. The specification interoperates with container registries such as Docker Hub and Quay, and integrates with packaging efforts exemplified by Bioconda and BioContainers.

Execution and Implementations

Multiple execution engines implement the CWL specification, including Rabix Executor, Toil, cwltool reference implementation, Arvados, and Cromwell via adapters. Cloud providers and platforms that support CWL workflows include Amazon Web Services, Google Cloud Platform, Microsoft Azure, and workflow platforms like Seven Bridges, DNANexus, and Illumina Connected Analytics. Integrations span high-performance computing centers and institutional clusters managed via SLURM, LSF, and SGE, with provenance capture enabled through frameworks such as PROV and Research Objects. Tool registries and repositories that host CWL descriptors include Dockstore, BioSchemas-driven registries, and Galaxy ToolShed mirrors.

Use Cases and Adoption

CWL is widely used in genomics pipelines for variant calling, RNA-seq, and metagenomics developed by consortia like the 1000 Genomes Project, the Cancer Genome Atlas, and the Human Cell Atlas. Clinical sequencing labs and translational research groups at institutions such as Broad Institute, EMBL-EBI, and Stanford Medicine use CWL to standardize pipelines for regulatory submission and reproducible science. Other adopters include pharmaceutical companies, public health agencies, and environmental genomics initiatives that require portability between cloud providers and national computing centers such as the European Bioinformatics Institute and the National Center for Biotechnology Information.

Governance and Community

The CWL specification is maintained collaboratively by a community-led working group composed of contributors from academic institutions, commercial vendors, and non-profit organizations. Decision-making is facilitated through public repositories, issue trackers, and community meetings often associated with conferences including ISMB and bioinformatics summits. Funding and institutional support have come from foundations and research councils that sponsor interoperability efforts, while outreach leverages training at universities, workshops hosted by ELIXIR, and documentation contributed by organizations like the Open Bioinformatics Foundation.

Criticism and Limitations

Critics point to the steep learning curve for users unfamiliar with YAML, JSON-LD, or container ecosystems, citing challenges reported by small laboratories and clinical teams. Performance and scaling limitations can arise when mapping CWL semantics to heterogeneous execution backends, a concern noted in comparisons with workflow engines such as Nextflow and Snakemake. The specification’s rigidity in certain typing and serialization rules has led some projects to prefer domain-specific languages developed by commercial vendors or platform-specific abstractions. Nonetheless, ongoing community work aims to address usability, tooling, and backward compatibility.

Category:Bioinformatics