Corpus of Contemporary American English

Corpus of Contemporary American English
Name	Corpus of Contemporary American English
Abbreviation	COCA
Created	2008
Creator	Mark Davies
Institution	Brigham Young University
Size	1+ billion words (growing)
Languages	English (American)
Genres	spoken, fiction, magazines, newspapers, academic

Contents

Overview and history
Corpus composition and design
Data access and tools
Applications and research uses
Evaluation, limitations, and critiques

Corpus of Contemporary American English

The Corpus of Contemporary American English is a large, balanced corpus of American English compiled for linguistic, literary, and computational research. It was developed to support work by scholars associated with Brigham Young University, led by Mark Davies, and has been used by researchers connected to institutions such as Stanford University, Harvard University, University of California, Berkeley, University of Oxford, and Massachusetts Institute of Technology. The resource informs projects in collaboration with publishers and media organizations including The New York Times, The Guardian, Time (magazine), Random House, and academic presses like Oxford University Press.

Overview and history

The corpus was initiated in 2008 by Mark Davies at Brigham Young University as a successor to earlier corpora such as the Brown Corpus and the Lancaster-Oslo/Bergen Corpus. Early development involved digitized text from partners including The New York Times Company, The Washington Post, and databases used by scholars at Yale University and Columbia University. Subsequent expansions incorporated material contemporary to events like the 2008 United States presidential election, the Arab Spring, and the rise of platforms such as Facebook, Twitter, and YouTube, situating the corpus alongside corpora like the Google Books Ngram Viewer and the British National Corpus.

Corpus composition and design

The corpus compiles over a billion words across genres: spoken transcripts, fiction, magazines, newspapers, and academic writing. Spoken samples derive from broadcast outlets such as National Public Radio, transcripts of C-SPAN events, and television programs aired on networks including ABC, CBS, and NBC. Fiction selections draw on works published by houses like Penguin Random House, HarperCollins, and Simon & Schuster. Magazine and newspaper sections include material from Time (magazine), The Atlantic, The New Yorker, and regional papers such as the Los Angeles Times and the Chicago Tribune. Academic prose reflects articles indexed by JSTOR, PubMed, and publishers like Cambridge University Press and Springer Science+Business Media.

Design decisions mirror methodologies established in resources like the Penn Treebank and the Switchboard Corpus: systematic sampling by year, balanced genre representation, and tokenization standards influenced by initiatives at Carnegie Mellon University and University of Pennsylvania. Metadata fields record publication date, source, author when available, and genre, enabling comparisons with corpora such as the Enron Corpus and the Corpus of Historical American English.

Data access and tools

Access is provided through a web interface and downloadable subcorpora, with query functionality inspired by concordancers used at Lancaster University and tools developed at Brigham Young University. Users can perform collocation searches, frequency lists, and keyness analyses; these functions resemble services offered by Sketch Engine and the Corpus Workbench (CWB). APIs and export options support integration with programming environments at Massachusetts Institute of Technology and Stanford University using libraries such as NLTK, spaCy, and TensorFlow for computational workflows. Educational users from institutions like University of Michigan and University of Toronto commonly use the corpus in coursework alongside corpora like the Corpus of Late Modern English.

Applications and research uses

The corpus has supported lexical, syntactic, and discourse studies by researchers at Princeton University, University of Chicago, Duke University, and University of Pennsylvania. Applied work includes lexicography for dictionaries such as Merriam-Webster and Oxford English Dictionary supplements, phraseology studies used by translators at institutions like European Commission language services, and natural language processing models developed by teams at Google, Microsoft Research, and Facebook AI Research. It informs sociolinguistic studies examining language variation in contexts tied to events like the 2008 recession, the 2016 United States presidential election, and debates surrounding laws such as the Affordable Care Act. Literary scholars at Columbia University and New York University employ the resource for stylometric analyses related to authors represented by Knopf and Hachette Book Group.

Evaluation, limitations, and critiques

Scholars at University of Illinois at Urbana-Champaign and critics in venues like The Chronicle of Higher Education note strengths in size and balance but raise concerns similar to critiques of the Google Books Ngram data: representativeness, copyright biases, and metadata completeness. Limitations include under‑representation of social media platforms governed by corporations such as Meta Platforms and Twitter, Inc., variability in transcription quality from broadcasters like CNN and Fox News, and sampling biases compared with national surveys by agencies like the United States Census Bureau. Methodological debates reference standards from the International Corpus of English and argue for transparency in sourcing akin to discussions around the British National Corpus 2014 update.

Category:Corpora