Open Citation Corpus

Open Citation Corpus
Name	Open Citation Corpus
Type	Research infrastructure
Founded	2010s
Founder	OpenCitations
Headquarters	Scholarly community
Mission	Promote open scholarly citation data

Contents

Open Citation Corpus is an open scholarly citation aggregation initiative that collects, normalizes, and distributes citation links extracted from academic publications. It operates within a landscape including CrossRef, PubMed, arXiv, Microsoft Academic, and Scopus and interfaces with repositories such as Zenodo, Figshare, Dryad, and Dataverse to advance discoverability and reuse. The project collaborates with institutions like Wellcome Trust, National Institutes of Health, European Research Council, and Library of Congress and aligns with movements including Open Access, Creative Commons, Plan S, and FAIR data principles.

History

The corpus emerged amid efforts by organizations including OpenCitations, DataCite, CrossRef, SPARC Europe, and Directory of Open Access Journals following debates involving Elsevier, Springer Nature, Wiley-Blackwell, IEEE, and Taylor & Francis over citation transparency. Early development drew on tools and datasets from projects such as OpenAIRE, PubMed Central, CORD-19, and BioRxiv while engaging scholars from University of Oxford, Harvard University, Stanford University, and Imperial College London. Milestones included integrations with services run by Digital Science, Clarivate Analytics, Google Scholar, and initiatives funded by agencies like UK Research and Innovation, Horizon 2020, National Science Foundation, and European Commission. Governance and outreach involved workshops at venues including International Conference on Scientific and Technical Information, OpenCon, Force11, and Wikimedia Conference.

The corpus aggregates citation links from publishers and repositories such as PubMed Central, PLOS, arXiv, Springer Nature, Elsevier, and Wiley-Blackwell and indexes metadata attributes used by CrossRef, DataCite, ORCID, DOI, and ISSN systems. Content covers bibliographic entities represented in taxonomies from Library of Congress, BIBFRAME, ORCID, and MeSH and spans disciplines and journals like Nature, Science, The Lancet, Cell, and Proceedings of the National Academy of Sciences. The dataset includes relationships extracted for works deposited in repositories such as Zenodo, Figshare, Dryad, and institutional archives at Massachusetts Institute of Technology, University of Cambridge, Max Planck Society, and CNRS.

The corpus models citations using identifiers and schemas maintained by CrossRef, DataCite, ORCID, DOI, and ISSN registries and adopts semantic frameworks including RDF, JSON-LD, Schema.org, and BIBFRAME. Interoperability is achieved via mappings to vocabularies from W3C, Dublin Core, SPARQL, and Linked Data technologies developed alongside projects such as Semantic Web, OpenAIRE, and Europeana. The project aligns provenance descriptions with standards promulgated by PROV-O, metadata quality practices advocated by NISO, and persistent identifier strategies used by Handle System and CrossRef Event Data.

Access mechanisms mirror infrastructures operated by CrossRef, DataCite, Zenodo, and GitHub and provide bulk and API access compatible with services from Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Licensing follows open patterns encouraged by Creative Commons, including Creative Commons Attribution and Creative Commons Public Domain Dedication, and respects mandates from funders such as Wellcome Trust, Horizon 2020, NIH Public Access Policy, and Plan S. Distribution practices reference archival strategies used by LOCKSS, CLOCKSS, and Portico to ensure persistence and reusability across infrastructures managed by European Open Science Cloud and national libraries like British Library.

Researchers and institutions leverage the corpus for bibliometric analyses alongside tools from VOSviewer, Gephi, CiteSpace, and NetworkX and in studies funded by National Science Foundation, European Research Council, and Wellcome Trust. Publishers and platforms such as CrossRef, Scopus, Clarivate Analytics, Google Scholar, and Dimensions use the data to enhance discovery, citation indexing, and impact assessment tied to awards like Nobel Prize, Fields Medal, and Turing Award. Policy analysts apply the corpus to evaluate compliance with Plan S, track Open Access uptake in repositories including PubMed Central and arXiv, and inform stakeholders including Research Councils UK, European Commission, and UNESCO. Digital humanists combine corpus links with collections from British Library, Library of Congress, and Europeana for citation network studies connecting works by Charles Darwin, Isaac Newton, Marie Curie, and Alan Turing.

Stewardship involves collaboration among organizations such as OpenCitations, CrossRef, DataCite, ORCID, NISO, and academic partners at University of Oxford, University of Cambridge, Harvard University, and Stanford University. Community governance processes mirror models used by W3C, Apache Software Foundation, Creative Commons, and GitHub and include advisory input from funders like Wellcome Trust, Horizon 2020, National Institutes of Health, and UK Research and Innovation. Technical maintenance leverages continuous integration and versioning workflows familiar to projects hosted on GitHub, archived via Zenodo, and mirrored through infrastructure by European Open Science Cloud and institutional repositories at Max Planck Society and CNRS.

Category:Bibliographic databases