OpenAlex — LLMpedia

OpenAlex
Name	OpenAlex
Type	Bibliographic database
Country	United States
Launched	2021
Predecessor	Microsoft Academic Graph
License	CC0 (data)

Contents

Overview
History and Development
Data Model and Content
Access and APIs
Usage and Applications
Governance and Licensing

OpenAlex is an open catalog of scholarly works, authors, venues, institutions, and concepts designed to provide comprehensive, machine-readable metadata for research. It aggregates bibliographic records, citation links, author profiles, venue metadata, institution identifiers, and concept taxonomies to support discovery, analysis, and infrastructure in scholarly communications. The project is used by researchers, libraries, publishers, funders, and startups to power recommender systems, bibliometrics, and scholarly search.

Overview

OpenAlex indexes millions of journal articles, conference papers, books, datasets, and preprints across disciplines and links them to structured records for authors, journals, publishers, repositories, and subject concepts. Major indexed sources and partners include Crossref, PubMed, arXiv, PubMed Central, Wikidata, Zenodo, DataCite, and legacy collections such as Microsoft Academic outputs. The database maps metadata to persistent identifiers like DOI, ORCID, ISSN, ISBN, and ROR identifiers. Tools and projects that use or integrate with the data include Jupyter Notebook workflows, Elasticsearch indices, Kubernetes deployments, Amazon Web Services, and research platforms like Zotero and Hypothesis.

History and Development

The project was launched after the discontinuation of Microsoft Academic Graph and builds on work by organizations and contributors from the scholarly infrastructure ecosystem. Early development drew on practices from Crossref metadata curation, citation-linking methods exemplified by Clarivate Analytics's Web of Science and Elsevier's Scopus, and open data philosophies similar to Wikimedia Foundation initiatives. Key contributors and funders include nonprofit organizations, university libraries such as Stanford University and Harvard University, research groups at MIT and University of California, Berkeley, and open-source communities around GitHub and Apache Software Foundation. Milestones include initial dataset releases, expansion of concept taxonomies influenced by MeSH and ACM Computing Classification System, and integration efforts with platforms like ORCID and Crossref Event Data.

Data Model and Content

The OpenAlex data model consists of five primary entity types: works, authors, venues, institutions, and concepts. Each work record links to identifiers such as DOI and provides metadata fields inspired by standards from Dublin Core, Schema.org, and BIBFRAME. Author records map to ORCID and institutional affiliations referencing identifiers like ROR. Venue records encompass journals and conferences with ISSN and publisher relationships to entities such as Springer Nature, Elsevier, Wiley-Blackwell, and IEEE. Institutional metadata connects to organizations like University of Oxford, University of Cambridge, Max Planck Society, Chinese Academy of Sciences, and National Institutes of Health. Concept taxonomies are informed by domain authorities including Medical Subject Headings, ACM, PASCAL, and community ontologies on Wikidata and Wikipedia. Citation networks enable analysis similar to methods used in PageRank and bibliometric indicators from Altmetric and Eigenfactor studies.

Access and APIs

OpenAlex provides bulk data snapshots and a RESTful API for programmatic access, supporting JSON responses suitable for tools like Python (programming language), R (programming language), Node.js, and Julia (programming language). Hosted services and mirrors may run on infrastructure from Amazon Web Services, Google Cloud Platform, and GitHub Actions for CI/CD. Clients and wrappers have been developed by communities around PyPI, CRAN, and npm. Rate limiting, pagination, and query parameters follow patterns from APIs by Crossref and Europe PMC. The project publishes code and documentation on GitHub and coordinates contribution workflows using OpenStreetMap-style community governance and issue tracking practices familiar to projects like TensorFlow and LibreOffice.

Usage and Applications

Researchers use the dataset for bibliometrics, altmetrics, citation network analysis, and trend detection in fields ranging from biology-adjacent domains indexed in PubMed to computer science topics cataloged by ACM. Libraries and discovery services integrate OpenAlex records into catalogs alongside systems like Ex Libris and Koha. Publishers and preprint servers such as bioRxiv and medRxiv cross-link to OpenAlex metadata to improve discoverability. Data science teams employ the corpus for machine learning tasks similar to work by DeepMind and OpenAI in natural language processing, training models to perform recommendation tasks akin to those in Semantic Scholar. Funders and policy analysts at organizations like Wellcome Trust and National Science Foundation leverage the metadata for portfolio analysis and evaluation.

Governance and Licensing

The project operates within an open-data ethos and releases core bibliographic metadata under permissive licensing comparable to Creative Commons public domain designations. Governance involves a mix of nonprofit stewards, community contributors, academic partners, and advisory boards modeled after governance seen at Wikimedia Foundation and collaborative infrastructures such as OpenStreetMap Foundation. Licensing choices aim to maximize reuse by researchers, startups, and cultural institutions, aligning with practices advocated by groups like SPARC and Force11. Data provenance and correction workflows reference standards promoted by ORCID, Crossref, and open scholarship initiatives at Scholarly Communication Institute-type organizations.

Category:Bibliographic databases