LLMpediaThe first transparent, open encyclopedia generated by LLMs

Stanford Question Answering Dataset

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: BERT Hop 5
Expansion Funnel Raw 51 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted51
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
Stanford Question Answering Dataset
NameStanford Question Answering Dataset
AbbreviationSQuAD
Released2016
DevelopersStanford University Natural Language Processing Group
DomainReading comprehension, machine comprehension
TasksQuestion answering, span extraction
LicensePublic domain (original release)

Stanford Question Answering Dataset is a widely used reading comprehension dataset developed at Stanford University for evaluating machine comprehension and question answering systems. It contains crowdsourced question–answer pairs aligned to passages drawn from Wikipedia articles and has served as a benchmark in research driven by organizations such as Google Research, Microsoft Research, Facebook AI Research, DeepMind, and academic groups at Massachusetts Institute of Technology and University of California, Berkeley. The dataset catalyzed advances in models including architectures from Allen Institute for AI, Carnegie Mellon University, and industry labs at Amazon Research and IBM Watson.

Overview

SQuAD was introduced to measure performance on extractive question answering using natural language contexts from Wikipedia articles about subjects like Barack Obama, World War II, The Great Gatsby, and Marie Curie. Early influential variants include SQuAD v1.1 and SQuAD v2.0; the former focuses on answer span prediction while the latter adds unanswerable questions inspired by adversarial examples from work at Google Brain and analysis by researchers from University of Washington. The dataset influenced benchmarks such as the General Language Understanding Evaluation (GLUE) and spurred competitions hosted at venues like the Conference on Empirical Methods in Natural Language Processing and the International Conference on Machine Learning.

Construction and Dataset Composition

Passages in SQuAD were sampled from Wikipedia articles curated for breadth across topics including biographies like Isaac Newton, events like the French Revolution, and works such as Hamlet. Crowdworkers recruited through platforms comparable to industry-standard services produced question–answer pairs; answers are contiguous spans from the passages, a design choice contrasting with synthetic corpora used in projects at Microsoft Research Asia and Facebook AI Research (FAIR). The dataset includes thousands of paragraphs, tens of thousands of question–answer pairs in v1.1, and additional unanswerable items in v2.0, with annotations validated through pipelines similar to those developed at Stanford NLP Group and reviewed by researchers affiliated with Princeton University and Columbia University.

Tasks and Evaluation Metrics

The core task is extractive question answering: given a passage and a question, models must predict the start and end indices of an answer span drawn from the passage. Evaluation relies primarily on Exact Match (EM) and F1 score, metrics also used in benchmarks by groups at OpenAI, Google Research, DeepMind, Facebook AI Research, and laboratories associated with Toyota Technological Institute at Chicago. Progress on SQuAD often correlates with improvements in pretraining regimes exemplified by models from Google AI, OpenAI, Microsoft Research, and transformer architectures introduced by researchers at Google Brain and Carnegie Mellon University.

Baselines and Leaderboard

Early baselines included logistic and recurrent neural network models developed by teams at Stanford University and compared against heuristics from University of California, Berkeley and University of Oxford. Subsequent entries on public leaderboards adopted attention-based and transformer-based architectures such as models from Google Research (e.g., BERT), innovations by Allen Institute for AI (e.g., ELMo), and later large-scale models from OpenAI and Microsoft Research. The leaderboard ecosystem includes contributions from teams at Yandex, Tencent AI Lab, Huawei Noah's Ark Lab, and academic groups at University of Toronto and ETH Zurich.

Impact and Applications

SQuAD has shaped research in pretraining, transfer learning, and interpretability pursued by groups at Google Brain, OpenAI, Facebook AI Research, and DeepMind. It informed practical systems in industry, influencing QA features in products from Google LLC, Microsoft Corporation, Amazon.com, and search-related efforts at Baidu and Yandex. Academic applications range from curriculum development at Stanford University and Massachusetts Institute of Technology to use as a benchmark in workshops hosted by conferences such as NeurIPS, ACL, EMNLP, and ICLR.

Limitations and Criticisms

Critiques raised by teams at University of Washington, Allen Institute for AI, and Carnegie Mellon University highlight issues such as vulnerability to lexical overlap, annotation artifacts, and overfitting to span-based extraction—limitations also discussed in work associated with Google Research and Facebook AI Research. SQuAD’s reliance on Wikipedia reduces diversity compared with corpora compiled from sources like Common Crawl or domain-specific datasets used by NIH-affiliated teams and World Health Organization studies. The addition of unanswerable questions in v2.0 mitigated some shortcuts, but subsequent analyses by researchers at Stanford University and Princeton University emphasize the need for multi-hop reasoning benchmarks developed in projects from Allen Institute for AI and DeepMind.

Category:Question answering datasets