PROV (W3C)

PROV (W3C)
Name	PROV
Title	PROV (W3C)
Developer	World Wide Web Consortium
Released	2013
Latest release	2013 Recommendation
Platform	Cross-platform
License	W3C Community / Standards

Contents

Overview
History and Development
Core Concepts and Data Model
Serializations and Formats
Implementations and Tools
Use Cases and Applications
Governance and Related Standards

PROV (W3C) PROV is a family of W3C Recommendations for representing provenance information about entities, activities, and agents. It provides an interoperable data model and serializations designed for tracking the origin, derivation, and attribution of digital artifacts across workflows, archives, and data ecosystems. The specifications are intended to enable reproducible research, accountable publishing, and auditable processing in technical, institutional, and archival contexts.

Overview

PROV establishes a structured provenance vocabulary and a formal model linking Tim Berners-Lee, World Wide Web Consortium, Hypertext Transfer Protocol, Extensible Markup Language, and Resource Description Framework practices to practical provenance capture. It bridges conceptual work from Open Provenance Model, Dublin Core Metadata Initiative, ISO standards, Semantic Web research, and implementations in institutions such as CERN, NASA, European Bioinformatics Institute, and National Institutes of Health. PROV defines classes and relations that connect provenance notions used by Linux Foundation projects, Apache Software Foundation tools, and scholarly infrastructures like CrossRef and ORCID.

History and Development

The development of PROV was driven by W3C working groups and contributors from organizations including Google, IBM, Microsoft, Siemens, and Oracle. Influences included earlier provenance efforts such as the Open Provenance Model and projects at Los Alamos National Laboratory and Sandia National Laboratories. The Working Group drew on use cases from European Commission research programs, National Science Foundation data management plans, and collaborations with DataCite and JSTOR. PROV achieved W3C Recommendation status following review processes that involved standards bodies like IETF and coordination with ISO/IEC committees.

Core Concepts and Data Model

PROV's data model centers on three primary concepts: Entity, Activity, and Agent, with relations such as wasGeneratedBy, used, wasDerivedFrom, wasAssociatedWith, and actedOnBehalfOf. These map to formal graph structures comparable to RDF Schema and link to provenance patterns studied by Tim Berners-Lee collaborators and Turing Award recipients working on knowledge representation. The model supports bundles for provenance grouping, roles for responsibilities traceable to organizations like United Nations bodies or projects funded by Horizon 2020, and identifiers interoperable with systems like Handle System, Digital Object Identifier, and International Standard Name Identifier. PROV also formalizes temporal and attributional metadata compatible with calendaring standards from Internet Engineering Task Force efforts and compliance regimes influenced by General Data Protection Regulation stakeholders.

Serializations and Formats

The PROV family specifies multiple syntaxes: PROV-N for notation, PROV-XML for XML encoding, PROV-O as an RDF/OWL2 ontology, and PROV-JSON for JavaScript ecosystems. These serializations enable integration with technologies and platforms such as Apache Hadoop, Apache Spark, Kubernetes, Docker, and databases like PostgreSQL and MongoDB. Mappings to JSON-LD and Turtle facilitate linkage with ecosystems around Schema.org, Wikidata, and Europeana. Interoperability testing referenced projects like GitHub, Bitbucket, and archives at Internet Archive.

Implementations and Tools

Multiple open-source and commercial implementations support PROV, including libraries and platforms from communities like Apache Software Foundation and vendors like Microsoft Research and IBM Research. Notable tools include provenance capture integrated into workflow systems such as Apache Airflow, Galaxy (bioinformatics), Taverna, and Nextflow, and visualization utilities used by National Institutes of Health and European Molecular Biology Laboratory. Repositories and registries adopt PROV-compatible metadata via connectors to Dataverse, Zenodo, and institutional repositories at universities like Harvard University and Stanford University.

Use Cases and Applications

PROV has been applied to reproducible science initiatives supported by National Science Foundation, clinical informatics projects in institutions like Mayo Clinic and Massachusetts General Hospital, digital preservation efforts at Library of Congress and British Library, and research data management at consortia such as ELIXIR and Global Biodiversity Information Facility. It underpins auditing and compliance workflows in enterprises including Goldman Sachs, supply chain traceability pilots influenced by World Economic Forum, and scholarly publishing workflows used by Elsevier and Springer Nature. PROV also assists provenance-aware machine learning pipelines in collaborations involving Google DeepMind, OpenAI, and academic labs at MIT and UC Berkeley.

PROV is maintained within the W3C standards process and coordinated with related specifications such as Web Ontology Language, Resource Description Framework, SPARQL Protocol and RDF Query Language, and identity standards like OAuth. Its governance model reflects practices from other W3C Recommendations and liaises with organizations including ISO, IETF, and research infrastructures funded by European Research Council. Extensions and profiles of the model have been proposed in contexts like Open Geospatial Consortium and health data standards influenced by Health Level Seven International.

Category:World Wide Web Consortium standards