Generated by GPT-5-mini| HotpotQA | |
|---|---|
| Name | HotpotQA |
| Released | 2018 |
| Domain | Question answering, natural language processing |
| Developers | Allen Institute for Artificial Intelligence; University of Washington |
| License | Open |
| Language | English |
| Tasks | Multi-hop question answering, explainable QA |
HotpotQA is a benchmark dataset for multi-hop question answering developed to evaluate reasoning across multiple supporting documents. The dataset was introduced to the research community alongside work from teams at the Allen Institute for Artificial Intelligence, the University of Washington, and collaborators associated with conferences such as NeurIPS, ACL, and EMNLP. HotpotQA has influenced subsequent datasets and systems in the line of work that includes SQuAD, WikiHop, Natural Questions, DROP, and RACE.
HotpotQA contains questions requiring reasoning over multiple Wikipedia articles and was released to address limitations in single-hop benchmarks like SQuAD and TriviaQA. The corpus provides supporting facts as sentence-level annotations for explainability, enabling evaluation protocols that echo objectives of research presented at venues such as ICML, NAACL, and COLING. HotpotQA has been cited in research from institutions including Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, Google Research, and industry groups at Facebook AI Research and Microsoft Research. The dataset bridges interests of research programs represented at NeurIPS 2018, ACL 2019, and workshops at EMNLP 2018.
Question collection was performed via crowdsourcing campaigns that used workers from platforms comparable to Amazon Mechanical Turk and annotation workflows resembling projects at Wikimedia Foundation and datasets curated by teams at Allen Institute for AI. Passages and candidate paragraphs were drawn from English-language articles in Wikipedia, with entity linking and hyperlink navigation procedures similar to practices at DBpedia and Wikidata. The dataset contains both "distractor" settings and "full wiki" retrieval settings, reflecting retrieval pipelines used in systems like BM25 implementations and neural retrievers inspired by work from Facebook AI Research and Google Brain. Annotation quality control drew on methodologies used by projects at Stanford NLP Group, Berkeley AI Research, and annotation standards discussed at ACL 2018 workshops.
HotpotQA supports multiple evaluation tasks: answer span extraction, yes/no classification, supporting fact identification, and full pipeline end-to-end QA over Wikipedia. Metrics used include exact match and F1 for answer text as practiced on benchmarks like SQuAD 1.1, precision/recall/F1 for supporting fact retrieval akin to evaluation in WikiHop, and retrieval metrics similar to those used in document ranking research at TREC and documented by groups at NIST. The leaderboard regimes and evaluation scripts reflect community standards from conferences such as EMNLP 2018 and repositories maintained by Allen Institute for AI and Hugging Face.
Early baselines used bi-directional attention architectures and span prediction models rooted in work from Google DeepMind, Yann LeCun's lab, and transformer developments at Google Research such as BERT. Subsequent modeling approaches incorporated graph neural networks inspired by studies from MIT, reasoning modules related to research at DeepMind, and retrieval-augmented generation techniques paralleling systems from OpenAI and Facebook AI Research. Joint tasks combining answer prediction with supporting fact supervision drew upon multi-task learning traditions from researchers at Carnegie Mellon University and University of Toronto, while pretraining strategies leveraged corpora and techniques associated with Common Crawl, BooksCorpus, and model checkpoints released by Hugging Face and TensorFlow communities.
Analyses of HotpotQA revealed challenges including spurious correlations, distractor resilience, entity disambiguation, and interpretability, echoing concerns raised about datasets like SQuAD and Natural Questions. Error analyses invoked issues in coreference resolution similar to problems studied by teams at Stanford CoreNLP and multilingual transfer questions discussed at EMNLP. The dataset surfaces difficulties in long-context reasoning also investigated in research at DeepMind and optimization concerns reported by groups at OpenAI. Mitigation strategies explored include adversarial filtering approaches reminiscent of methods used in datasets curated by Yandex researchers and annotation refinement protocols practiced by scholars at University of Washington.
HotpotQA has been used to benchmark multi-hop reasoning in academic projects at Massachusetts Institute of Technology, industry evaluations at Google Research and Microsoft Research, and shared tasks at venues like NAACL and ACL. It influenced the design of subsequent datasets and models, informing work on explainable QA at Carnegie Mellon University, retrieval-augmented generation at OpenAI and Facebook AI Research, and integration of structured knowledge from Wikidata and DBpedia. The dataset contributed to better understanding of model interpretability and retrieval challenges that are relevant to deployed systems at companies such as Amazon (company), Apple Inc., and research labs across China, India, and Europe.
Category:Question answering datasets