Generated by GPT-5-mini| Corpus of Historical American English | |
|---|---|
| Name | Corpus of Historical American English |
| Abbreviation | COHA |
| Type | Historical text corpus |
| Languages | English |
| Years | 1810–2009 |
| Size | ~400 million words |
| Created | 2000s |
| Creator | Mark Davies |
| Institution | Brigham Young University |
Corpus of Historical American English is a large diachronic corpus of American English covering the period 1810–2009, designed for quantitative analysis of linguistic, cultural, and literary change. It supports studies in historical linguistics, corpus linguistics, digital humanities, and sociocultural history by providing searchable texts drawn from newspapers, magazines, fiction, and nonfiction. The corpus has been used alongside other resources in computational linguistics, lexicography, and cultural analytics.
The corpus was developed by Mark Davies at Brigham Young University as part of broader work in historical corpora and corpus tools such as the Corpus of Contemporary American English, enabling comparisons across time with resources like the Google Books Ngram Viewer, the British National Corpus, and the Corpus of Global Web-based English. It has informed scholarship associated with institutions such as Oxford University Press, Cambridge University Press, Harvard University, and Yale University and has been cited in projects at Stanford University, Massachusetts Institute of Technology, University of Oxford, University of Cambridge, and University of California, Berkeley. Funders and collaborators include National Endowment for the Humanities, National Science Foundation, and corporate partners like LexisNexis and ProQuest.
Material in the corpus is balanced across decades from the Jacksonian era through the Reconstruction era, the Gilded Age, the Progressive Era, both World War I and World War II periods, the Roaring Twenties, the Great Depression, the Cold War, the Civil Rights Movement, and into the late 20th century including the Vietnam War era and the Reagan administration. Text genres include fiction from publishers such as Harper & Brothers and Penguin Books, magazines including The Atlantic, Harper's Magazine, and Time (magazine), newspapers like The New York Times and The Washington Post, and nonfiction works by authors like Walt Whitman, Mark Twain, Henry David Thoreau, F. Scott Fitzgerald, Ernest Hemingway, Toni Morrison, William Faulkner, and Ralph Waldo Emerson.
Source materials were digitized from archival and commercial collections including holdings from Library of Congress, HathiTrust, JSTOR, Gale (company), ProQuest Historical Newspapers, and the Internet Archive. Texts derive from publishers and periodicals such as HarperCollins, Random House, Scribner, McClure's Magazine, Puck (magazine), and The New Yorker, and include works by public figures like Abraham Lincoln, Theodore Roosevelt, Woodrow Wilson, Franklin D. Roosevelt, John F. Kennedy, Martin Luther King Jr., Susan B. Anthony, Frederick Douglass, and Susan Sontag. Legal and political texts sampled reflect episodes such as the Missouri Compromise, the Emancipation Proclamation, the New Deal, the Civil Rights Act of 1964, and policy debates from the Great Society era.
Each text in the corpus is annotated with metadata fields including publication year, genre, author, publisher, and place of publication, linked to repositories like WorldCat and archives such as the New York Public Library. Linguistic annotation layers include tokenization, lemmatization, and part-of-speech tagging aligned with tagsets used in projects at Stanford University and tagging frameworks influenced by the Penn Treebank. Metadata supports queries across attributes related to authors such as Edgar Allan Poe, Louisa May Alcott, Nathaniel Hawthorne, Emily Dickinson, Sylvia Plath, Henry James, Jack London, and Stephen Crane.
Access to the corpus has been provided via an online interface and downloadable files under licensing agreements negotiated with content providers; academic users at institutions like University of Michigan, Columbia University, Princeton University, and Duke University have used site licenses. Tools and APIs for querying the corpus have been integrated with concordancers and analysis platforms influenced by tools from Sketch Engine, AntConc, and research from Google Research and Microsoft Research. Licensing arrangements reflect negotiations with rights holders including Penguin Random House, Hearst Communications, and archival partners such as Chronicling America.
Researchers have used the corpus to trace lexical change (e.g., frequency shifts in words studied by scholars at Oxford English Dictionary projects), semantic change demonstrated in analyses referencing authors like Charles Dickens and Henry Adams, and sociocultural trends investigated alongside data from U.S. Census Bureau and Bureau of Labor Statistics records. Studies combining the corpus with computational methods from groups at Carnegie Mellon University, Max Planck Institute for Psycholinguistics, University of Chicago, and New York University have explored topics including diachronic syntax change, gendered language patterns linked to figures such as Susan B. Anthony and Gloria Steinem, and sentiment shifts around events like Pearl Harbor, Watergate scandal, and 9/11.
Critics have noted sampling biases related to preservation and digitization practices affecting representation of marginalized voices such as Native American authors, African American newspapers like The Chicago Defender and The Pittsburgh Courier, and women writers omitted from archival pipelines. Other limitations include OCR errors in materials from sources like Google Books and Historic Newspapers and copyright restrictions affecting inclusion of late 20th-century works from publishers such as Simon & Schuster and Houghton Mifflin Harcourt. Methodological critiques draw on debates from scholars at Yale Law School and Harvard Law School about corpus representativeness and intellectual property.
Category:Corpora Category:Historical linguistics