CWL — LLMpedia

CWL
Name	CWL
Developer	Common Workflow Language community
Initial release	2014
Latest release	1.2.0
Programming language	YAML, JSON, Python (tooling)
Operating system	Cross-platform
License	Apache License 2.0

Contents

Overview
History
Specification and Components
Implementations and Tools
Use Cases and Adoption
Community and Governance

CWL

CWL is an open specification for describing portable, declarative computational workflows and tools. It enables reproducible analysis across diverse compute environments by defining command-line tool descriptions, workflow graphs, and execution metadata that can be interpreted by multiple engines. The specification emphasizes interoperability among projects in bioinformatics, data science, and high-performance computing.

Overview

CWL provides a machine-readable format for packaging Docker, Singularity (software), Apache Mesos, Kubernetes, SLURM (software), HTCondor, and other execution platforms with tool descriptions and workflow graphs. Tool descriptions reference container images from registries like Docker Hub or Quay (software), while workflow graphs connect steps using standardized type systems. Major scientific projects and institutions such as European Bioinformatics Institute, Broad Institute, Wellcome Sanger Institute, National Institutes of Health, European Molecular Biology Laboratory, Genome Canada, Swiss Institute of Bioinformatics, and Los Alamos National Laboratory have evaluated or adopted the specification for portability. CWL documents are expressed in YAML or JSON and are intended to be version-controlled in systems like Git and shared via repositories such as GitHub, GitLab, or Zenodo.

History

Work on the specification began in the mid-2010s through collaboration among research groups, tool developers, and platform providers. Early contributors included teams from the Broad Institute, Lawrence Berkeley National Laboratory, University of Cambridge, University of Melbourne, US NIH National Library of Medicine, and various open-source communities. The project aligned with broader reproducibility initiatives associated with events like Bioinformatics Open Days, GA4GH discussions, and workshops held at conferences such as ISMB and RECOMB. Over successive versions the specification incorporated features to support scatter/gather patterns, secondary files handling, and inline expressions, while maintaining compatibility with container standards like OCI (open container initiative).

Specification and Components

The core components of the specification include "CommandLineTool", "Workflow", and "ExpressionTool" document types. CommandLineTool documents declare inputs, outputs, base commands, and runtime requirements referencing containers like Docker, resource managers such as SLURM (software) or HTCondor, and hardware descriptors that map to cloud services like Amazon Web Services and Google Cloud Platform. Workflow documents express directed acyclic graphs linking steps and data flow, integrating scatter/gather semantics akin to map-reduce patterns used in projects at Lawrence Livermore National Laboratory or Argonne National Laboratory. The type system covers file, stdout, stderr, integer, float, string, array, and record types, while secondaryFiles rules manage paired sequence files (e.g., from Illumina instruments) and index files (e.g., from BWA (software), SAMtools). The specification also defines runtime hints and requirements for linking to provenance models such as PROV-O and standards for metadata used by archives like ENA and NCBI Sequence Read Archive.

Implementations and Tools

Multiple execution engines and toolchains implement the specification, including engines developed at Kitware, Rabix, Seven Bridges Genomics, Arvados, Toil (workflow engine), Cromwell (software), Nextflow-adapter projects, Galaxy (web platform), and cloud-native platforms supported by Google Cloud Platform and Amazon Web Services. Authoring and validation tools integrate with editors and CI systems from JetBrains, Visual Studio Code, Travis CI, and GitHub Actions. Container build systems such as Dockerfile toolchains and registries like Docker Hub and Quay (software) are commonly used. Workflow provenance and visualization tools interface with projects like PROV-O, Common Workflow Language Viewer, and web services hosted by organizations including ELIXIR and BioConda.

Use Cases and Adoption

The specification is widely used for genomic pipelines (alignment, variant calling, RNA-seq), proteomics workflows, image analysis, and ecology modeling. Large-scale consortia such as 1000 Genomes Project, ENCODE Project Consortium, International Cancer Genome Consortium, and national infrastructures in Canada, United Kingdom, Australia, and Germany have standardized pipelines using CWL-conformant documents. Research groups at institutions like Stanford University, Massachusetts Institute of Technology, University of Oxford, Harvard University, University of Toronto, and Max Planck Society use it to ensure reproducibility and portability across HPC clusters and cloud providers. In clinical bioinformatics, pipelines validated for diagnostics reference CWL descriptions to aid regulatory compliance efforts related to standards from bodies such as FDA and regional health authorities.

Community and Governance

The specification is developed collaboratively by an open community of contributors from academia, industry, and government labs, with governance oriented around a steering group, working groups, and public issue trackers hosted on platforms like GitHub. Outreach and standards alignment involve organizations such as Global Alliance for Genomics and Health, ELIXIR, Open Bioinformatics Foundation, and national funding agencies. Regular meetings, conference workshops, and community calls engage contributors from institutions including Wellcome Sanger Institute, Broad Institute, European Bioinformatics Institute, University of California, San Diego, and University of Cambridge to drive roadmap priorities and interoperability testing events. The specification's licensing under Apache License 2.0 facilitates integration into commercial offerings by companies like Illumina, Thermo Fisher Scientific, Seven Bridges Genomics, and cloud vendors.

Category:Workflow languages