Kepler (scientific workflow system)

Kepler (scientific workflow system)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Kepler
Developer	University of California, Santa Barbara; Lawrence Berkeley National Laboratory; University of Maryland, Baltimore County
Released	2000s
Programming language	Java (programming language)
Operating system	Linux, Microsoft Windows, macOS
Platform	Java (programming language)
Genre	Scientific workflow management system
License	BSD license

Contents

Overview
Architecture and Components
Workflow Design and Execution
Data Provenance and Reproducibility
Use Cases and Applications
Development, Community, and Licensing

Kepler (scientific workflow system) is an open-source workflow management system for designing, executing, and sharing scientific and computational workflows. It integrates graphical workflow composition, execution engines, data provenance, and extensible component libraries to support research in disciplines such as astronomy, ecology, bioinformatics, climate science, and geoscience. Kepler was developed through collaborations among institutions including University of California, Santa Barbara, Lawrence Berkeley National Laboratory, and National Center for Supercomputing Applications, and has been used in projects funded by agencies such as the National Science Foundation and the Department of Energy.

Overview

Kepler builds on a lineage of workflow and pipeline systems that includes Taverna (software), Pegasus (workflow management system), Galaxy (computational biology), CWL (Common Workflow Language), and Apache Airflow while addressing domain-specific needs from projects like Tropics and Earth System Grid Federation. The project emphasizes modularity and scientific reproducibility in the tradition of initiatives supported by National Institutes of Health and international consortia such as Elixir (European project), with integration points for resources like HDF5 data stores and NetCDF archives. Kepler’s user community has intersected with researchers from NASA, NOAA, US Geological Survey, and academic groups at institutions such as University of Illinois Urbana–Champaign and University of Washington.

Architecture and Components

Kepler’s architecture centers on a component-based model influenced by research from Los Alamos National Laboratory and architectural patterns used by Globus Toolkit and Apache Hadoop. Core components include a graphical editor adapted from workflows used at Lawrence Livermore National Laboratory, an execution engine leveraging the Java (programming language) runtime, an actor model inspired by Ptolemy II, and libraries of actors for interfaces to tools like R (programming language), Python (programming language), MATLAB, and HPC batch schedulers such as SLURM and PBS (software suite). Kepler includes port types, directors, and provenance capture modules comparable to those in VisTrails and NiPype, and supports plug-in architectures similar to Eclipse (software). Security and authentication in Kepler deployments have been integrated with middleware used by Open Science Grid and XSEDE.

Workflow Design and Execution

Users construct workflows by composing actors and connectors in a visual canvas inspired by paradigms from Ptolemy II and Simulink. Kepler supports multiple execution models—dataflow, synchronous/reactive, and control-flow—via directors comparable to models used in Apache Spark and TensorFlow graph execution. Workflows can call external services such as SOAP and REST endpoints, access repositories like GitHub and Zenodo, and orchestrate tools including BLAST, GROMACS, and MODFLOW. Execution can target local resources, clusters managed by Torque or SLURM, and cloud platforms from providers like Amazon Web Services and Google Cloud Platform, often mediated by middleware such as Globus or HTCondor.

Data Provenance and Reproducibility

Kepler captures fine-grained provenance information to support reproducibility efforts aligned with initiatives from Research Data Alliance and policies from European Commission frameworks. Provenance records include actor configurations, parameter sets, input/output artifacts, and runtime metadata comparable to standards like W3C PROV, enabling integration with systems from Dataverse and OpenAIRE. Kepler’s provenance facilities have been used in studies published in venues such as Journal of Open Research Software and conferences like IEEE eScience Conference and ACM SIGMOD, and complement reproducibility tools developed by groups at University of California, Berkeley and Stanford University.

Use Cases and Applications

Kepler has been applied to workflows in astronomy for processing data from observatories associated with National Radio Astronomy Observatory and to pipelines analyzing remote sensing data from satellites used by European Space Agency and NASA Jet Propulsion Laboratory. In bioinformatics, it has orchestrated pipelines for sequence analysis integrating NCBI resources and tools funded by National Human Genome Research Institute. Environmental science projects at Smithsonian Institution and Scripps Institution of Oceanography have used Kepler for ecosystem modeling and sensor-network data fusion, while hydrology groups at US Geological Survey used it to integrate models like MODFLOW and SWAT (soil water assessment tool). Kepler workflows have also supported social science computational models linked to datasets from ICPSR and international collaborations coordinated through UNESCO programs.

Development, Community, and Licensing

Kepler’s development has been coordinated by academic and national laboratory contributors including teams at University of California, Santa Barbara, Lawrence Berkeley National Laboratory, University of Maryland, Baltimore County, and collaborators from National Center for Supercomputing Applications and Lawrence Livermore National Laboratory. The project uses community-driven governance similar to practices at Apache Software Foundation projects, with mailing lists, workshops at conferences like AGU Fall Meeting, American Geophysical Union, and code contributions tracked in repositories hosted by organizations parallel to GitHub and Bitbucket. Kepler is distributed under permissive licensing comparable to the BSD license, encouraging reuse by universities, national laboratories, startups, and companies in sectors represented by Siemens, Boeing, and General Electric research labs. Community training and materials have been presented at venues such as International Supercomputing Conference and integrated into curricula at institutions like Carnegie Mellon University and Massachusetts Institute of Technology.

Category:Scientific workflows