PROV-O — LLMpedia

Contents

PROV-O PROV-O is an ontology as a W3C Recommendation for representing provenance information about digital artifacts, agents, and activities. It enables interoperability across systems by providing a formal vocabulary to express origins, responsibility, and derivation for resources used by institutions such as Library of Congress, United Nations, European Commission, National Institutes of Health, and NASA. Adopted in contexts ranging from archival workflows at the British Library to research data management at Harvard University, PROV-O underpins provenance tracking in projects involving World Health Organization, World Bank, International Monetary Fund, European Space Agency, and the Smithsonian Institution.

Overview

PROV-O formalizes provenance using terms that interoperate with standards like RDF Schema, OWL, SPARQL, JSON-LD, and XML Schema and complements initiatives from bodies such as W3C, ISO, IEEE, Open Geospatial Consortium, and OASIS. It supports archival chains-of-custody used by National Archives and Records Administration, citation practices followed by Crossref and DataCite, and reproducibility requirements of journals like Nature, Science, and PLoS. PROV-O facilitates audit trails in systems from GitHub repositories to institutional repositories at MIT, Stanford University, Yale University, Columbia University, and University of Oxford.

The ontology defines core classes and properties that map to provenance constructs referenced by projects at CERN, European Organization for Nuclear Research, Large Hadron Collider, Human Genome Project, and Allen Institute for Brain Science. Core classes include Activity, Entity, and Agent used in workflows at NASA Jet Propulsion Laboratory, NOAA, European Southern Observatory, Max Planck Society, and Lawrence Berkeley National Laboratory. Properties express relations like wasGeneratedBy, used, wasDerivedFrom, and wasAttributedTo applied in datasets hosted by Dryad, Figshare, Zenodo, Gene Expression Omnibus, and Protein Data Bank. The vocabulary interoperates with identifiers from ORCID, DOI, Handle System, ISBN, ISSN, PubMed, and CrossRef to support attribution and citation tracking for outputs produced by The Royal Society, American Chemical Society, IEEE, ACM, and Springer Nature.

PROV-O is commonly serialized in RDF serializations such as Turtle, N-Triples, RDF/XML, and JSON-LD used by platforms like Wikidata, DBpedia, Europeana, Digital Public Library of America, and BNE (Biblioteca Nacional de España). Implementers use SPARQL queries against triplestores including Apache Jena, Virtuoso, Stardog, Blazegraph, and GraphDB to extract provenance graphs for projects at European Bioinformatics Institute, Broad Institute, Sanger Institute, Wellcome Trust, and GAVI, the Vaccine Alliance. Integration with linked data platforms such as Schema.org, FOAF, Dublin Core, DCAT, and SKOS is common in information systems at WorldCat, OCLC, JSTOR, Project Gutenberg, and Internet Archive.

PROV-O supports reproducible research pipelines used by initiatives like ReproZip, Galaxy Project, Nextflow, Snakemake, and CWL (Common Workflow Language), and provenance capture in laboratory environments at Cold Spring Harbor Laboratory and Salk Institute. In cultural heritage, institutions such as Metropolitan Museum of Art, Louvre, Vatican Museums, Tate Modern, and Rijksmuseum use provenance metadata to document conservation and acquisition histories. In journalism and fact-checking, organizations such as BBC, The New York Times, The Guardian, ProPublica, and Associated Press employ provenance traces for source verification. Legal and compliance applications are used by European Court of Human Rights, International Criminal Court, US Securities and Exchange Commission, and World Trade Organization for chain-of-custody and audit reporting.

Tooling ecosystems include provenance libraries and tools such as ProvToolbox, ProvStore, ProvViz, and plugins for Protégé used by researchers at University of Manchester, University of Edinburgh, Tsinghua University, Peking University, and National University of Singapore. Commercial platforms integrate PROV-O with enterprise systems from Microsoft, Google, Amazon Web Services, IBM, and Oracle for cloud provenance in services used by Deutsche Bank, Goldman Sachs, JP Morgan Chase, Citibank, and HSBC. Data platforms and workflow engines such as Apache Airflow, Kubernetes, Docker, Hadoop, and Spark adopt provenance models for lineage in projects at Netflix, Spotify, Facebook, Twitter, and LinkedIn.

PROV-O originated from the W3C Provenance Working Group drawing on earlier provenance models and standards from communities including Dublin Core Metadata Initiative, Open Archives Initiative, International Organization for Standardization, World Wide Web Consortium, and research projects funded by European Research Council and National Science Foundation. Key design discussions involved contributors from MITRE Corporation, University of Southampton, University of Oxford, RPI (Rensselaer Polytechnic Institute), and Los Alamos National Laboratory. The Recommendation was published alongside companion documents and use case reports informing adoption by consortia such as Research Data Alliance, OpenAIRE, ELIXIR, EOSC (European Open Science Cloud), and GO FAIR.

Category:Ontology