CoQA — LLMpedia

CoQA
Name	CoQA
Type	Dataset
Domain	Natural language processing
Released	2018
Creators	Stanford University
Languages	English
Tasks	Conversational question answering
License	Research

Contents

Introduction
Dataset Composition
Task and Evaluation Metrics
Data Collection and Annotation Protocol
Baselines and Models
Impact and Applications

CoQA

CoQA is a conversational question answering dataset developed to evaluate machine comprehension in multi-turn dialogue. It was introduced by researchers at Stanford University and designed to measure conversational understanding across passages from diverse domains such as New York Times, Wikipedia, and literary works like Project Gutenberg texts. The dataset has influenced benchmarks and modeling progress in the period marked by advances from ELMo and BERT through to GPT-3.

Introduction

CoQA was released as part of a wave of datasets addressing conversational and reading-comprehension challenges pioneered alongside datasets such as SQuAD, QuAC, and HotpotQA. It focuses on free-form answers in a conversational setting where each question can depend on previous turns, echoing phenomena studied in corpora like Switchboard Corpus and annotated in projects related to Penn Treebank style efforts. The creators sought to bridge gaps highlighted by evaluations at venues like EMNLP, NAACL, and ACL.

Dataset Composition

The dataset comprises passages drawn from multiple sources: news articles from outlets including CNN and The New York Times, literature from Project Gutenberg and excerpts similar to those in Moby-Dick or Pride and Prejudice, middle-school reading texts akin to materials used by Common Core assessments, and conversational snippets reflecting content akin to Quora threads. Each passage is paired with multi-turn dialogues where annotators acting as questioners produce sequences of questions and annotators acting as answerers supply answers grounded in the passage. The dataset size sits alongside contemporaneous corpora like SQuAD 2.0 and HotpotQA in scale and variety.

Task and Evaluation Metrics

The CoQA task requires models to produce an answer for each question in a conversation given the passage and prior turns. Evaluation uses both exact-match-style string comparison and token-level overlap metrics inspired by measures used in evaluations for SQuAD and other reading comprehension tasks. Commonly reported metrics in papers presented at NeurIPS and ICLR include F1 score over tokens and turn-level accuracy; leaderboard comparisons often reference baselines developed in workshops at EMNLP and shared tasks organized by groups affiliated with Stanford NLP Group.

Data Collection and Annotation Protocol

Data collection relied on crowdworkers and trained annotators following protocols influenced by best practices from projects at Carnegie Mellon University and annotation standards seen in corpora curated by LDC. Annotators were assigned roles of questioner and answerer: questioners asked conversational, context-dependent questions, while answerers provided short free-form spans or abstractive answers drawn from the passage. Quality control incorporated adjudication similar to methods used in datasets from Microsoft Research and annotation audits akin to procedures at Google Research. The pipeline included multiple validation passes modeled after annotation workflows from OntoNotes and inter-annotator agreement assessments familiar from studies published in COLING proceedings.

Baselines and Models

Initial baselines included adaptations of span-extraction models such as architectures using BiDAF and sequence models augmented with contextual embeddings like ELMo; subsequent work applied transformer-based encoders inspired by BERT, RoBERTa, and later encoder-decoder systems following paradigms set by GPT-2 and T5. Model developments leveraged pretraining strategies and fine-tuning regimes popularized in papers at NeurIPS and ICML, and incorporated attention mechanisms and memory components explored in work affiliated with Facebook AI Research and DeepMind. Comparative evaluations against models for QuAC and SQuAD were common in conference papers and leaderboard reports.

Impact and Applications

CoQA influenced research on conversational agents deployed in contexts resembling question-answering assistants from Amazon and Microsoft, and shaped evaluation practices for systems evaluated in workshops at ACL and NAACL. It spurred methodological advances in coreference handling, contextual understanding, and multi-turn reasoning that informed later datasets and model families such as those presented by OpenAI and Google Research. CoQA's design principles also appeared in industrial evaluation suites at organizations like Facebook and academic courses at Stanford University that teach modern natural language understanding techniques.

Category:Datasets in natural language processing