YAGO — LLMpedia

YAGO
Name	YAGO
Developer	Max Planck Institute for Computer Science; University of Leipzig
Initial release	2007
Written in	Python; Java; SPARQL
Operating system	Cross-platform
License	Open data

Contents

Overview
History and Development
Knowledge Base Structure and Data Sources
Extraction and Integration Methods
Applications and Use Cases
Evaluation and Criticisms

YAGO is a large semantic knowledge base that organizes facts about entities, their types, and relationships. It combines structured information from multiple sources into a unified ontology to support semantic search, question answering, and research in natural language processing. The project emphasizes precision, provenance, and alignment with established taxonomies to enable reliable reasoning over linked data.

Overview

The knowledge base integrates facts about people, places, organizations, creative works, historical events, and scientific concepts from sources such as Wikipedia, Wikidata, WordNet, DBpedia, GeoNames, and bibliographic datasets. Its schema maps entities to a taxonomy derived from lexical and encyclopedic resources to provide fine-grained typing and disambiguation for entities like Albert Einstein, Queen Elizabeth II, Mount Everest, New York City, and The Beatles. The dataset is published in formats compatible with semantic web technologies including RDF and SPARQL endpoints used by projects such as Google Knowledge Graph, Microsoft Academic, Wikidata Query Service, IBM Watson, and Stanford CoreNLP.

History and Development

Initial work began at research centers including the Max Planck Society and the University of Leipzig in the mid-2000s, building on earlier lexical resources such as WordNet and large-scale web crawls indexed by groups like Common Crawl and Internet Archive. Early releases focused on high-precision extraction from Wikipedia infoboxes and category systems, evolving to incorporate mappings to DBpedia and alignments with Library of Congress subject headings and identifiers from authority files such as VIAF and ORCID. Subsequent versions integrated multilingual data from language editions such as German Wikipedia, French Wikipedia, Spanish Wikipedia, and Chinese Wikipedia, and added temporal and spatial metadata referencing authorities like UNESCO and ISO 3166.

Knowledge Base Structure and Data Sources

The ontology leverages lexical hierarchies and encyclopedic classes to type entities such as Marie Curie, Pablo Picasso, Ludwig van Beethoven, Google, Microsoft, Harvard University, Oxford University, Princeton University, NASA, European Space Agency, United Nations, World Health Organization, and International Monetary Fund. Core sources include encyclopedic metadata from Wikipedia infoboxes, lexical senses from WordNet, structured extractions from DBpedia, geospatial identifiers from GeoNames, authority records from VIAF and Library of Congress, and metadata from bibliographic collections like CrossRef and PubMed. The schema captures entity types, attributes (birthdates, founding dates, locations), and relations such as leadership, membership, authorship, and participation linking entities like Nelson Mandela, Mahatma Gandhi, Martin Luther King Jr., United States Constitution, Treaty of Versailles, Nobel Prize, Pulitzer Prize, and Oscar Awards.

Extraction and Integration Methods

Automated extraction pipelines parse infoboxes, category trees, lists, and article text from sources such as Wikipedia and align lexical senses from WordNet to disambiguate homonyms like Paris (Texas), Paris (France), and Paris Hilton. Schema mapping and taxonomy induction use resources like DBpedia and controlled vocabularies from Library of Congress Subject Headings to assign types for entities including Isaac Newton, Galileo Galilei, Ada Lovelace, Charles Darwin, Sigmund Freud, and Alexander Fleming. Reconciliation employs identifier matching with authority systems such as VIAF and Orcid and crosswalks to datasets like Wikidata and MusicBrainz to harmonize records for entities like Ludwig van Beethoven, Wolfgang Amadeus Mozart, The Rolling Stones, Madonna (entertainer), and Beyoncé Knowles. Provenance metadata records source pages and extraction patterns to support validation and conflict resolution.

Applications and Use Cases

The dataset supports semantic search and question answering systems used in academic and industrial settings including tools from Google, Microsoft Research, Facebook, and Amazon Web Services. It underpins entity linking and named-entity disambiguation in NLP frameworks such as SpaCy, Stanford NLP, and GATE for tasks involving figures like William Shakespeare, Jane Austen, Charles Dickens, Leo Tolstoy, Fyodor Dostoevsky, and Miguel de Cervantes. Researchers use it for knowledge graph embedding experiments with models like TransE, DistMult, and ComplEx and for downstream applications in recommendation systems, digital humanities projects about Renaissance, Industrial Revolution, and World War II, and biomedical knowledge discovery referencing PubMed, ClinicalTrials.gov, DrugBank, and Human Genome Project identifiers.

Evaluation and Criticisms

Evaluation studies compare coverage and accuracy against resources such as Wikidata, DBpedia, and Freebase and use benchmarks from the Semantic Web Challenge and datasets like GOLD Standard corpora. Criticisms highlight biases inherited from source materials including prominence bias toward figures like Barack Obama, Donald Trump, Vladimir Putin, Xi Jinping, Emmanuel Macron, and Angela Merkel and coverage gaps for underrepresented regions and languages. Other concerns address temporal staleness, alignment inconsistencies with authority files like VIAF and Orcid, and challenges in maintaining high precision when integrating noisy web-derived data exemplified by large-scale crawls from Common Crawl.

Category:Knowledge bases