Microsoft Academic Graph

Microsoft Academic Graph
Name	Microsoft Academic Graph
Type	Bibliographic database
Owner	Microsoft Research
Launched	2015
Discontinued	2021
Language	English (primary)
Coverage	Scholarly publications, authors, venues, institutions, fields, citations

Contents

Overview
Data Model and Content
Data Sources and Collection Methods
Access, APIs, and Tools
Impact, Uses, and Limitations
History and Discontinuation
Reception and Criticism

Microsoft Academic Graph

The Microsoft Academic Graph provided a large-scale, heterogeneous scholarly knowledge graph that mapped relationships among publications, authors, venues, institutions, concepts, and citations. It was produced by Microsoft Research and used by researchers, librarians, and companies for bibliometrics, trend analysis, and recommendation systems. The dataset interlinked entities such as authors affiliated with Stanford University, publications appearing in Nature (journal), and conferences like NeurIPS to enable network analyses across disciplines.

Overview

The service aggregated metadata for millions of records, linking publications to entities including authors from Massachusetts Institute of Technology, institutions like Harvard University, venues such as IEEE Transactions on Pattern Analysis and Machine Intelligence, and works appearing in outlets such as Science (journal), Proceedings of the National Academy of Sciences, and ACM SIGIR Conference. The graph exposed relationships used by projects at Carnegie Mellon University, University of Oxford, and Tsinghua University for mapping citation networks, tracking influence through connections to awards such as the Nobel Prize and the Turing Award, and correlating output with funding agencies like the National Institutes of Health and the National Science Foundation.

Data Model and Content

The schema represented nodes and edges for entities including publications (journal articles, conference papers, patents), authors, affiliations, venues (journals, conferences), fields of study tied to taxonomies, and citation edges linking works such as articles in Lancet (journal) and patents from United States Patent and Trademark Office. It captured author disambiguation by linking names to institutions like University of Cambridge and research groups at Google Research and DeepMind. The graph included metadata fields analogous to DOI records from agencies like CrossRef and indexing signals similar to those used by PubMed and Scopus.

Data Sources and Collection Methods

Content was ingested from publisher feeds, open repositories, web crawls, and metadata providers, incorporating outputs from publishers such as Elsevier, Springer Nature, and Wiley. It harvested preprints from servers like arXiv and metadata from institutional repositories at Caltech and ETH Zurich. Automated extraction pipelines used techniques comparable to those developed at Allen Institute for AI and relied on bibliographic identifiers issued by organizations like the International DOI Foundation. Entity linking drew on authority sources used by libraries such as the Library of Congress and citation indices maintained by Clarivate.

Access, APIs, and Tools

Microsoft exposed the dataset through APIs and downloadable snapshots consumed by tooling similar to that used by Kaggle and platforms such as Semantic Scholar. Developers and analysts accessed RESTful endpoints to query entities tied to conferences like ICML or journals like IEEE Transactions on Neural Networks and Learning Systems, and integrated results with workflows in environments like GitHub and Jupyter Notebook. Third-party projects at OpenAI and academic groups used the graph with analytics tools from Tableau and libraries like NetworkX for graph analysis.

Impact, Uses, and Limitations

Researchers at institutions including University of California, Berkeley and Princeton University used the graph for bibliometrics, trend spotting in fields like computer science and medicine, and building recommender systems comparable to services from Google Scholar and Scopus. Policymakers and evaluators at organizations such as the European Commission and the World Health Organization leveraged bibliographic maps for strategic planning, while startups integrated author and affiliation data into products alongside profiles from LinkedIn. Limitations included coverage bias toward English-language and publisher-partner content, entity-disambiguation errors affecting authors from institutions like IIT Bombay or University of São Paulo, and incomplete citation linking compared with curated indexes like Web of Science.

History and Discontinuation

Developed by research teams within Microsoft Research beginning in the 2010s, the graph expanded through collaborations with academic partners and was linked to projects conducted at labs such as Microsoft Research Redmond and Microsoft Research Cambridge (UK). Public releases, updates, and dataset snapshots were distributed until an announced wind-down in 2021, after which continuity efforts referenced alternative datasets maintained by organizations like Semantic Scholar and initiatives at Allen Institute for AI and community projects at Open Knowledge Foundation.

Reception and Criticism

The graph was praised by scholars at University of Washington and Yale University for scale and utility in large-scale analyses, while critics from venues like Communications of the ACM and commentaries by researchers affiliated with Indiana University noted issues including opaque update schedules, proprietary processing steps at corporate labs like Microsoft Research that complicated reproducibility, and representation gaps affecting authors from regions served by universities such as Universidad Nacional Autónoma de México and University of Cape Town. Debates compared it to services from Google Scholar, Scopus (Elsevier), and Web of Science (Clarivate) on openness, coverage, and sustainability.

Category:Bibliographic databases