Corpus of Historical American English

Corpus of Historical American English
Name	Corpus of Historical American English
Abbreviation	COHA
Type	Historical text corpus
Languages	English
Years	1810–2009
Size	~400 million words
Created	2000s
Creator	Mark Davies
Institution	Brigham Young University

Contents

Overview
Composition and Periodization
Corpus Compilation and Sources
Annotation and Metadata
Access, Licensing, and Tools
Research Applications and Findings
Limitations and Criticisms

Corpus of Historical American English is a large diachronic corpus of American English covering the period 1810–2009, designed for quantitative analysis of linguistic, cultural, and literary change. It supports studies in historical linguistics, corpus linguistics, digital humanities, and sociocultural history by providing searchable texts drawn from newspapers, magazines, fiction, and nonfiction. The corpus has been used alongside other resources in computational linguistics, lexicography, and cultural analytics.

Overview

The corpus was developed by Mark Davies at Brigham Young University as part of broader work in historical corpora and corpus tools such as the Corpus of Contemporary American English, enabling comparisons across time with resources like the Google Books Ngram Viewer, the British National Corpus, and the Corpus of Global Web-based English. It has informed scholarship associated with institutions such as Oxford University Press, Cambridge University Press, Harvard University, and Yale University and has been cited in projects at Stanford University, Massachusetts Institute of Technology, University of Oxford, University of Cambridge, and University of California, Berkeley. Funders and collaborators include National Endowment for the Humanities, National Science Foundation, and corporate partners like LexisNexis and ProQuest.

Composition and Periodization

Material in the corpus is balanced across decades from the Jacksonian era through the Reconstruction era, the Gilded Age, the Progressive Era, both World War I and World War II periods, the Roaring Twenties, the Great Depression, the Cold War, the Civil Rights Movement, and into the late 20th century including the Vietnam War era and the Reagan administration. Text genres include fiction from publishers such as Harper & Brothers and Penguin Books, magazines including The Atlantic, Harper's Magazine, and Time (magazine), newspapers like The New York Times and The Washington Post, and nonfiction works by authors like Walt Whitman, Mark Twain, Henry David Thoreau, F. Scott Fitzgerald, Ernest Hemingway, Toni Morrison, William Faulkner, and Ralph Waldo Emerson.

Corpus Compilation and Sources

Source materials were digitized from archival and commercial collections including holdings from Library of Congress, HathiTrust, JSTOR, Gale (company), ProQuest Historical Newspapers, and the Internet Archive. Texts derive from publishers and periodicals such as HarperCollins, Random House, Scribner, McClure's Magazine, Puck (magazine), and The New Yorker, and include works by public figures like Abraham Lincoln, Theodore Roosevelt, Woodrow Wilson, Franklin D. Roosevelt, John F. Kennedy, Martin Luther King Jr., Susan B. Anthony, Frederick Douglass, and Susan Sontag. Legal and political texts sampled reflect episodes such as the Missouri Compromise, the Emancipation Proclamation, the New Deal, the Civil Rights Act of 1964, and policy debates from the Great Society era.

Annotation and Metadata

Each text in the corpus is annotated with metadata fields including publication year, genre, author, publisher, and place of publication, linked to repositories like WorldCat and archives such as the New York Public Library. Linguistic annotation layers include tokenization, lemmatization, and part-of-speech tagging aligned with tagsets used in projects at Stanford University and tagging frameworks influenced by the Penn Treebank. Metadata supports queries across attributes related to authors such as Edgar Allan Poe, Louisa May Alcott, Nathaniel Hawthorne, Emily Dickinson, Sylvia Plath, Henry James, Jack London, and Stephen Crane.

Access, Licensing, and Tools

Access to the corpus has been provided via an online interface and downloadable files under licensing agreements negotiated with content providers; academic users at institutions like University of Michigan, Columbia University, Princeton University, and Duke University have used site licenses. Tools and APIs for querying the corpus have been integrated with concordancers and analysis platforms influenced by tools from Sketch Engine, AntConc, and research from Google Research and Microsoft Research. Licensing arrangements reflect negotiations with rights holders including Penguin Random House, Hearst Communications, and archival partners such as Chronicling America.

Research Applications and Findings

Researchers have used the corpus to trace lexical change (e.g., frequency shifts in words studied by scholars at Oxford English Dictionary projects), semantic change demonstrated in analyses referencing authors like Charles Dickens and Henry Adams, and sociocultural trends investigated alongside data from U.S. Census Bureau and Bureau of Labor Statistics records. Studies combining the corpus with computational methods from groups at Carnegie Mellon University, Max Planck Institute for Psycholinguistics, University of Chicago, and New York University have explored topics including diachronic syntax change, gendered language patterns linked to figures such as Susan B. Anthony and Gloria Steinem, and sentiment shifts around events like Pearl Harbor, Watergate scandal, and 9/11.

Limitations and Criticisms

Critics have noted sampling biases related to preservation and digitization practices affecting representation of marginalized voices such as Native American authors, African American newspapers like The Chicago Defender and The Pittsburgh Courier, and women writers omitted from archival pipelines. Other limitations include OCR errors in materials from sources like Google Books and Historic Newspapers and copyright restrictions affecting inclusion of late 20th-century works from publishers such as Simon & Schuster and Houghton Mifflin Harcourt. Methodological critiques draw on debates from scholars at Yale Law School and Harvard Law School about corpus representativeness and intellectual property.

Category:Corpora Category:Historical linguistics