GLUE — LLMpedia

GLUE
Name	GLUE
Full name	General Language Understanding Evaluation
Created	2018
Creators	Alex Wang, Alec Radford, Ilya Sutskever, Sam Altman
Field	Natural language understanding benchmarks
Hosted by	Stanford University

Contents

Overview
History and Development
Tasks and Benchmark Components
Evaluation Metrics and Results
Criticisms and Limitations
Influence and Successors

GLUE

GLUE is a benchmark suite for evaluating performance on a collection of natural language understanding tasks. It provides standardized datasets and metrics used by research groups at institutions such as Google Research, OpenAI, Facebook AI Research, Microsoft Research and DeepMind to compare models like BERT (language model), RoBERTa, XLNet, ALBERT and T5 (language model). The benchmark influenced evaluation practices at conferences including NeurIPS, ICML and ACL (conference).

Overview

GLUE aggregates multiple publicly available corpora assembled into a unified evaluation protocol to measure general language understanding. It includes tasks drawn from resources associated with projects and organizations such as Penn Treebank, Stanford Sentiment Treebank, SNLI (dataset), MultiNLI, SQuAD, CoLA (dataset) and RTE (dataset), enabling comparisons across model families including transformer-based architectures pioneered by groups at Google DeepMind and OpenAI. The benchmark provided a leaderboard managed by researchers from University of Washington and Carnegie Mellon University and became a focal point at workshops co-located with EMNLP and ACL 2019.

History and Development

GLUE was introduced in 2018 by a team of researchers associated with projects at University of Pennsylvania, Stanford University and University of Washington to address fragmentation in task-specific evaluations. Early adopters included teams that produced models such as ELMo, BERT (language model), and later RoBERTa and ALBERT, which reported improvements on GLUE leaderboards curated by organizers from Allen Institute for AI and researchers participating in NAACL. The dataset draws on earlier annotation efforts exemplified by Message Understanding Conference initiatives and follows evaluation traditions established by the Text REtrieval Conference and the CoNLL shared tasks.

Tasks and Benchmark Components

GLUE comprises several sentence-level and pairwise sentence tasks originally assembled from existing datasets: acceptability judgments from CoLA (dataset), sentiment analysis from Stanford Sentiment Treebank, semantic similarity from STS-B, paraphrase identification from MRPC and QQP (dataset), natural language inference from SNLI (dataset) and MultiNLI, and textual entailment datasets like RTE (dataset). The benchmark also defined a diagnostic dataset inspired by analysis suites used in work by researchers at Facebook AI Research and Google Research to probe linguistic phenomena such as coreference, negation, and world knowledge previously explored in studies linked to Winograd Schema Challenge and resources like OntoNotes. Leaderboards reported both per-task scores and an aggregate metric reflecting overall performance across items contributed by teams at Stanford University and Princeton University.

Evaluation Metrics and Results

GLUE uses task-specific metrics including Matthews correlation for CoLA (dataset), accuracy for many classification tasks, and Pearson/Spearman correlations for semantic similarity tasks like STS-B. Aggregate scoring combined these metrics into a single number used to rank models submitted to the public leaderboard maintained by the original GLUE team and mirrored on platforms run by Papers with Code and groups associated with Hugging Face. Breakthrough results on GLUE included milestones by BERT (language model), followed by gains from RoBERTa, XLNet, and distilled families such as DistilBERT, with research articles presented at venues like EMNLP and NeurIPS documenting performance increases.

Criticisms and Limitations

GLUE received critique from authors and subsequent researchers at Stanford University, MIT, Berkeley (University of California, Berkeley), and industry labs for overfitting to leaderboards and for encouraging model tuning that exploited dataset artifacts previously discussed in analyses by teams at Allen Institute for AI and Facebook AI Research. Concerns included limited coverage of multilingual phenomena highlighted by scholars at University of Edinburgh and University College London, and insufficient evaluation of reasoning and commonsense skills emphasized in work from MotiveLab and investigations tied to the CommonsenseQA benchmark. These critiques motivated calls for diagnostic datasets and stress tests modeled after tasks used in competitions such as SemEval.

Influence and Successors

GLUE catalyzed a family of successor benchmarks and evaluation efforts, inspiring the creation of SuperGLUE, which was designed by researchers at Carnegie Mellon University and Google Research to address shortcomings in the original suite, as well as multilingual and robustness-focused benchmarks like XTREME, CLUE (benchmark), Adversarial NLI and Dynabench. It shaped best practices adopted by major language model releases from OpenAI, Google Research, Microsoft Research, and Facebook AI Research, and influenced dataset curation and leaderboards associated with Papers with Code and the Hugging Face community. Supervision and evaluation paradigms from GLUE continue to inform benchmarks featured at conferences such as ACL (conference), NeurIPS, and ICLR.

Category:Natural language processing benchmarks