CERN Open Metadata

CERN Open Metadata
Name	CERN Open Metadata
Developer	CERN
Released	2019
Programming language	Python (programming language), Java (programming language)
Operating system	Linux
License	MIT License
Website	CERN

Contents

Overview
History and Development
Architecture and Data Model
Governance and Access Policies
Use Cases and Applications
Integration with CERN Services and Tools
Community and Contributions

CERN Open Metadata CERN Open Metadata is a metadata catalog and governance platform designed for scientific datasets, digital assets, and research artifacts at CERN. It supports discovery, lineage, and stewardship across experiments such as ATLAS, CMS, ALICE, and LHCb while integrating with infrastructures like EOS (storage) and CERNBox. The project aligns with open data initiatives from institutions including European Organization for Nuclear Research, European Open Science Cloud, and collaborations with Invenio developers and other research organizations.

Overview

CERN Open Metadata provides a centralized metadata registry that connects datasets, publications, software, and services used by experiments like ATLAS and CMS, linking to repositories such as Zenodo, GitHub, and InvenioRDM. It exposes APIs compatible with standards from DataCite, W3C, Schema.org, PROV (W3C) and integrates identifiers like DOI, ORCID, ROR, and Handle System. The platform supports curation by groups including CERN IT, CERN Open Data team, and collaborations with projects such as REANA and Indico.

History and Development

CERN Open Metadata emerged from metadata and preservation work at CERN following requirements from OpenAIRE, EOSC pilots, and recommendations by the European Commission for research data. Early efforts drew on expertise from CERN Open Data Portal and provenance experiments linked to ALICE preservation initiatives and LHCb data management. Development involved contributors from CERN IT, Software Preservation Network, Helm, and international partners such as DESY, INRIA, KIT, and Fermilab. Releases have synchronized with community events like RDA (Research Data Alliance) meetings, International Conference on Digital Preservation, and workshops sponsored by ERC programs.

Architecture and Data Model

The architecture combines microservices, search, graph stores, and authentication layers using components from Elasticsearch, Neo4j, PostgreSQL, Docker, and Kubernetes. Metadata records adhere to schemas influenced by DataCite Metadata Schema, Dublin Core elements, and PROV-O, allowing relationships across entities such as experiments, datasets, publications, software, and people (linked via ORCID). Authentication and authorization integrate with CERN Single Sign-On, OAuth 2.0, and SAML federations including eduGAIN. The system supports export formats compatible with OAI-PMH, JSON-LD, and RDF triples to facilitate harvesting by services like OpenAIRE and aggregators such as Crossref and Europe PMC.

Governance and Access Policies

Governance is overseen by stakeholders including CERN Director-General offices, CERN Scientific Information Service, and experiment data management boards like those for ATLAS and CMS. Policy development references mandates from European Commission, Horizon 2020, and funders such as ERC and Wellcome Trust. Access controls reconcile open datasets on the CERN Open Data Portal with restricted premium datasets managed under experiment-specific rules endorsed by collaborations like ALICE Collaboration and LHCb Collaboration. Metadata licensing often uses Creative Commons terms and machine-actionable policies tied to identifiers like DOI.

Use Cases and Applications

CERN Open Metadata supports discovery of analysis-level datasets for researchers from institutions such as MIT, University of Oxford, École Normale Supérieure, and University of Tokyo; enables reproducible workflows in environments like CERN Open Data Portal and REANA; and aids curators from libraries and archives including Bibliothèque nationale de France and British Library. It powers dashboards for project managers at CERN, provenance tracking used in publications indexed by INSPIRE-HEP, and semantic linking exploited by knowledge graphs used in tools developed by Invenio teams and collaborations with Zenodo.

Integration with CERN Services and Tools

Integrations connect metadata to storage systems such as EOS (storage), identity services like CERN Single Sign-On, publication systems including INSPIRE-HEP, and repositories like Zenodo and InvenioRDM. The platform interfaces with workflow systems including REANA, job schedulers at HTCondor, and container registries employing Docker Hub and GitLab. Visualization and analysis tools such as ROOT (software), Jupyter Notebook, and RStudio can consume metadata for reproducibility and citation, while catalog feeds support services like OpenAIRE and institutional portals at organizations like CERN Library.

Community and Contributions

Community contributors include researchers from experiments such as ATLAS, CMS, ALICE, data stewards from institutions like DESY and Fermilab, and developers from projects including Invenio and Zenodo. Governance and roadmap discussions occur within forums like RDA, OpenAIRE task forces, and working groups tied to EOSC. Contributions follow open-source practices shared with communities like GitHub, and engagement includes training events held at venues such as CERN and conferences including ICLR and NeurIPS for technical cross-pollination. Ongoing collaborations extend to national libraries, research infrastructures, and funding agencies such as ERC and Horizon Europe programs.

Category:Open data