DROP (dataset) — LLMpedia

DROP (dataset)
Name	DROP
Full name	Discrete Reasoning Over the content of Paragraphs
Creators	{\-} Michael Schuster, Sewon Min, Danqi Chen
Released	2019
Domain	Reading comprehension, question answering
License	Academic

Contents

Introduction
Dataset Creation and Annotation
Task Definition and Formats
Evaluation Metrics and Baselines
Results and Analysis
Impact and Applications

DROP (dataset) is a benchmark for reading comprehension and discrete reasoning designed to evaluate models on numerical, temporal, and logical reasoning over paragraphs. It was introduced to challenge models on tasks involving arithmetic, counting, sorting, and comparison derived from passages drawn from diverse text sources. The dataset has been influential in the development of neural models focusing on symbolic manipulation, compositional reasoning, and weakly supervised learning.

Introduction

DROP was introduced in the context of efforts by researchers at institutions such as Princeton University, Stanford University, Facebook AI Research, and Allen Institute for AI to push beyond span-selection benchmarks like SQuAD and Natural Questions. The dataset comprises passages from sources including Wikipedia and other encyclopedic texts, paired with crowdsourced questions inspired by prior work on datasets such as RACE, NarrativeQA, MultiRC, and HotpotQA. DROP emphasizes questions that require aggregation, subtraction, counting, and multi-step inference similar to problems studied in Mathematical Olympiad-style reasoning and datasets like Math23k.

Dataset Creation and Annotation

The creators used annotators recruited through platforms associated with projects at institutions like Amazon Mechanical Turk and collaborations involving researchers from Columbia University and NYU. Passages were sampled from corpora curated in projects linked to Wikipedia and preprocessed following protocols reminiscent of annotation practices from GLUE, SuperGLUE, and the MRQA shared task. Annotators produced question-answer pairs requiring discrete operations, validated by consensus and adjudicated using methods inspired by quality control procedures used in ImageNet and COCO curation. The dataset includes metadata such as passage IDs, question types, and difficulty labels similar to schema designs used by SQuAD and HotpotQA teams.

Task Definition and Formats

DROP frames reading comprehension as a mapping from a passage and question to an answer that may be a number, date, span, or set, akin to formats seen in Natural Questions and MS MARCO. The task formulations support multiple answer types and require operations such as addition, subtraction, counting, and sorting, paralleling operations formalized in works on Semantic Parsing and Program Synthesis. Input-output pairs were provided in JSON schemas like those used by the Hugging Face datasets library and evaluation toolkits modeled after protocols from the EMNLP and ACL communities. The dataset also introduced crowdsourced annotation guidelines comparable to those used for HotpotQA and CoQA.

Evaluation Metrics and Baselines

Evaluation on DROP uses metrics tailored to numeric and span responses, combining exact match and numerically aware F1 metrics influenced by evaluation practices from SQuAD and numeric answer assessments in MathQA. Baseline models reported include neural architectures such as BiDAF, BERT, and later transformer-based models like RoBERTa and T5 adapted to support discrete operations, similar to adaptations applied in studies involving ALBERT and XLNet. The leaderboard and benchmarking protocols were established following standards from community efforts like GLUE and the WMT shared tasks, with ablation studies comparing end-to-end neural approaches to hybrid symbolic–neural systems inspired by research at Google Research and DeepMind.

Results and Analysis

Initial baselines showed that pretrained transformers such as BERT and RoBERTa achieved moderate performance but struggled on multi-step arithmetic and temporal reasoning, mirroring findings from evaluations on DROP-like challenges in papers from ACL, EMNLP, and NAACL conferences. Subsequent models incorporating explicit arithmetic modules, program synthesis components, or discrete reasoning layers—approaches seen in work from Allen Institute for AI, Facebook AI Research, and Google Research—substantially improved scores. Error analyses revealed common failure modes including implicit coreference, multi-hop temporal reasoning akin to phenomena studied in Temporal Information Extraction, and failure to generalize across numeric scales, echoing challenges reported in research on Compositional Generalization and datasets like SCAN.

Impact and Applications

DROP influenced research on neuro-symbolic methods, prompting integration of arithmetic solvers and constrained decoding strategies found in projects at MIT, Berkeley, and Carnegie Mellon University. It has been used to evaluate systems in academic competitions and workshops associated with conferences such as NeurIPS, ICML, and ACL, and has informed curriculum and benchmarks at organizations like OpenAI and DeepMind. Applications include improved question-answering modules in open-domain QA pipelines used by teams at Google, Microsoft Research, and startup labs focusing on information extraction and data-driven reasoning. The dataset also catalyzed follow-up benchmarks and datasets exploring numerical reasoning in domains like finance, law, and medicine, paralleling initiatives by Bloomberg, LexisNexis, and PubMed.

Category:Datasets