Google Books Ngram Corpus

Contents

Overview
Data and Methodology
Applications and Research
Limitations and Criticisms
Related Projects and Tools

Google Books Ngram Corpus is a massive, publicly available dataset derived from the digitized text collection of the Google Books project. It contains the frequency of word sequences, or n-grams, across a vast corpus of published works spanning several centuries. This resource enables the quantitative analysis of linguistic and cultural trends over time through a field known as culturomics. Researchers utilize this corpus to track the evolution of language, the rise and fall of concepts, and historical patterns in published thought.

Overview

The corpus was first released in 2009 by a collaborative team from Google, Harvard University, the MIT, and Encyclopædia Britannica, Inc.. Its creation was driven by the goal of providing a quantitative lens on human culture, as detailed in a seminal paper published in the journal Science. The dataset is built upon the scanned books from the Google Books Library Project, which partners with major institutions like the New York Public Library and the Bodleian Library. This foundational work allows scholars to move beyond qualitative analysis and employ computational methods to study historical texts, effectively creating a time series of language use.

Data and Methodology

The underlying data is sourced from over 8 million books, representing more than 6% of all books ever published. The corpus includes text in multiple languages such as English, French, German, Spanish, Russian, and Chinese. The primary methodology involves scanning text and counting the annual frequency of n-grams, which are contiguous sequences of words; common analyses use 1-grams (single words), 2-grams, 3-grams, 4-grams, and 5-grams. The raw counts are normalized to account for the varying total number of words published each year, with results presented as a percentage of the annual corpus. Metadata, including publication year, is extracted from ISBN records and library catalogues, though this process can introduce certain biases.

Applications and Research

This corpus has enabled groundbreaking research across numerous disciplines. In linguistics, it has been used to study grammaticalization, lexical change, and the adoption of new words like "Internet". Historians and sociologists employ it for culturomics, examining trends in fame by tracking mentions of individuals like Albert Einstein or events like the Great Depression. It has shed light on historical memory, such as the changing discussion of World War II versus the Vietnam War. The corpus also supports research in digital humanities, allowing analysis of literary themes and the reception of works by authors such as William Shakespeare and Jane Austen. Tools like the Google Books Ngram Viewer provide a public interface for exploring these trends.

Limitations and Criticisms

Despite its scale, the corpus has notable limitations. A significant issue is the overrepresentation of scientific literature, particularly from the 20th century, which can skew trends related to general culture. The optical character recognition (OCR) process for older texts can produce errors, especially with historical typefaces or damaged pages. Furthermore, the metadata is imperfect; publication dates may be misidentified, and the corpus does not distinguish between genres like fiction and non-fiction. Critics, including researchers from the University of Chicago, have argued that these biases require careful statistical correction to avoid misleading conclusions about cultural evolution. The selection of books scanned is also influenced by the holdings of partner libraries like the University of Michigan Library.

Several other initiatives complement or provide alternatives to this corpus. The HathiTrust Digital Library offers a similar massive collection of digitized texts with different research interfaces. For more curated historical text analysis, projects like the Text Creation Partnership and the Corpus of Historical American English are often used. Academic consortia such as the ARTFL Project at the University of Chicago provide deep dives into specific corpora, like French literature. Software tools including the NLTK library in Python and the Stanford CoreNLP suite are commonly employed to analyze n-gram data. The broader field of big data research in the humanities continues to evolve through these parallel efforts.

Category:Digital libraries Category:Computational linguistics Category:Google services Category:Text mining

Google Books Ngram Corpus

Overview

Data and Methodology

Applications and Research

Limitations and Criticisms

Related Projects and Tools