SuperGLUE — LLMpedia

SuperGLUE
Name	SuperGLUE
Introduced	2019
Developers	Allen Institute for AI; Google Research; New York University
Purpose	Natural language understanding benchmark
Tasks	Multiple NLU tasks including textual entailment, question answering, coreference
License	Open benchmark

Contents

SuperGLUE

SuperGLUE is a benchmark for evaluating advanced natural language processing systems built to measure generalization across diverse Stanford University-style linguistic tasks; it was introduced to extend challenges set by earlier datasets created at University of Washington, Carnegie Mellon University, Massachusetts Institute of Technology, University of California, Berkeley, and industry labs such as Google Research and Facebook AI Research. The benchmark was released by the Allen Institute for AI in collaboration with researchers at New York University and Google Research as part of an effort following trends established by projects at OpenAI, DeepMind, Microsoft Research, Amazon Web Services, and academic consortia affiliated with University of Oxford and University of Cambridge.

Overview

SuperGLUE was designed as a successor to GLUE, addressing limitations identified by teams at Stanford University, Princeton University, Harvard University, Yale University, and research groups at IBM Research and Tencent AI Lab; it aggregates multiple difficult question answering and reasoning tasks inspired by datasets from Allen Institute for AI, University of Pennsylvania, Columbia University, University of Illinois Urbana–Champaign, and University of Toronto. The benchmark emphasizes robustness and compositional reasoning similar to evaluation approaches used by DARPA programs and evaluation suites developed at National Institute of Standards and Technology. SuperGLUE’s curation involved contributors from the broader community including labs at Facebook AI Research, Google DeepMind, Microsoft Research Montreal, DeepMind, and universities like Cornell University and University of Washington.

SuperGLUE comprises multiple tasks drawn from different data sources: the commitment-driven textual entailment task inspired by work at University of Pennsylvania and datasets linked to Cornell University; multi-sentence reading comprehension tasks influenced by protocols at Stanford University and Carnegie Mellon University; commonsense reasoning items related to initiatives at University of Chicago and Massachusetts Institute of Technology; coreference resolution examples reflecting techniques from University of Oxford and Johns Hopkins University; and word-in-context style problems comparable to studies from Brown University and Rice University. Each component echoes methodologies proven in competitions hosted by NeurIPS, ICLR, ACL, EMNLP, and AAAI, and leverages annotation conventions established at institutions such as University of Michigan and University of California, San Diego.

SuperGLUE uses an aggregate scoring scheme that combines task-specific metrics developed by evaluation committees similar to those in NeurIPS challenges and benchmarks overseen by Association for Computational Linguistics committees; contributors include researchers formerly associated with Google Research, Facebook AI Research, OpenAI, DeepMind, and university labs at University of Edinburgh and University of Toronto. Evaluation enforces strict train/dev/test splits and blind test evaluation via server-side submission protocols modeled after leaderboards run at Kaggle, Papers with Code, CodaLab, and evaluation frameworks used by NVIDIA Research and Intel Labs. The metric aggregation resembles scoring rules used in multi-task challenges at ICML and calibration practices discussed by teams at Carnegie Mellon University and University of Pennsylvania.

Initial baselines were established using transformer architectures developed at Google Research and research groups at OpenAI, Facebook AI Research, Microsoft Research, and DeepMind; examples include models influenced by BERT research from Google Research and later systems from RoBERTa teams at Facebook AI Research and innovators at Allen Institute for AI. Subsequent leaderboards featured entries from industrial labs such as Alibaba DAMO Academy, Baidu Research, Tencent AI Lab, and academic groups at University of Washington, Stanford University, University of California, Berkeley, and Massachusetts Institute of Technology, reflecting architecture evolution including techniques popularized at Carnegie Mellon University and University of Oxford. The leaderboard informed model development cycles at organizations like Huawei Noah’s Ark Lab and influenced benchmarks curated by Papers with Code and consortiums at European Commission-funded projects.

SuperGLUE motivated progress at institutions including Google Research, OpenAI, DeepMind, Facebook AI Research, and university labs at Stanford University and University of Toronto, accelerating research in areas explored by teams at UC Berkeley and Carnegie Mellon University. Critics from conferences such as ACL and NeurIPS and groups at MIT and Harvard University argued that the benchmark can be gamed by large-scale pretraining regimes pioneered by OpenAI and Google Research; scholars at Princeton University and Yale University questioned whether high leaderboard scores reflect genuine understanding versus dataset artifacts identified in studies from Columbia University and Johns Hopkins University. Others advocated for evaluation practices resembling those of NIST and proposed adversarial or controlled testing protocols similar to efforts at DARPA and Allen Institute for AI.

Following SuperGLUE, research agendas at OpenAI, DeepMind, Google Research, Facebook AI Research, and university groups at Stanford University and Massachusetts Institute of Technology pursued broader benchmarks and challenge suites inspired by SuperGLUE’s design; examples include multi-task benchmarks developed by Allen Institute for AI and efforts at Papers with Code and Hugging Face. Successor evaluations incorporated multilingual and robustness dimensions advanced by teams at ETH Zurich, University of Edinburgh, Johns Hopkins University, and University of Cambridge, while consortiums such as BigScience and projects funded by European Commission integrated practices introduced in SuperGLUE into larger-scale benchmarks and leaderboards managed by CodaLab and Kaggle.

Category:Benchmarks in natural language processing