GLUE benchmark — LLMpedia

Contents

Overview
Tasks and Datasets
Evaluation Metrics and Protocol
Baselines and Leaderboard
Impact and Criticisms
Extensions and Successors

GLUE benchmark The GLUE benchmark is a widely cited evaluation suite for assessing the performance of natural language understanding systems. It was developed to provide a standardized comparison across diverse tasks and models, influencing research at institutions such as Google, OpenAI, Stanford University, Massachusetts Institute of Technology, and University of Washington. Early adopters and contributors include teams from Facebook AI Research, DeepMind, Carnegie Mellon University, University of Oxford, and University of California, Berkeley.

Overview

GLUE was introduced to measure progress on sentence- and sentence-pair understanding by assembling multiple existing datasets into a single benchmark, promoting apples-to-apples comparisons among models from labs like Microsoft Research, Allen Institute for AI, Amazon Web Services, IBM Research, and Salesforce Research. Influential figures and groups associated with its release come from collaborations involving researchers affiliated with Stanford University, Harvard University, University of Pennsylvania, Columbia University, and Princeton University. The benchmark’s design drew attention from conferences such as NeurIPS, ICLR, ACL (2018), EMNLP, and AAAI, and it spurred follow-on work at venues including KDD and SIGIR.

Tasks and Datasets

GLUE aggregates tasks originally from sources created by projects and teams at Stanford University, University of Pennsylvania, Allen Institute for AI, Massachusetts Institute of Technology, Columbia University, and New York University. The suite includes sentence-level tasks drawn from datasets like those produced for the RTE challenges, corpora used in research from WMT, and classification collections associated with groups at Princeton University and Carnegie Mellon University. It incorporates data derived from efforts at Microsoft Research and annotation projects with participants from Harvard University, Yale University, Duke University, and University of Michigan. The assembled tasks cover acceptability, inference, paraphrase, and sentiment phenomena that were previously studied by teams at Facebook AI Research, Google Research, DeepMind, and Amazon.

Evaluation Metrics and Protocol

GLUE defines an aggregated scoring protocol combining per-task metrics developed in prior work by researchers at Stanford University, Carnegie Mellon University, University of California, Berkeley, and University of Washington. Individual tasks use accuracy, F1, and other measures that trace lineage to evaluation conventions employed by groups at ACL, EMNLP, NeurIPS, and ICML. The leaderboard and submission mechanisms were managed in ways similar to pipelines used by Kaggle, Papers with Code, CodaLab, and infrastructure teams at OpenAI and Google. To discourage overfitting, the benchmark introduced a held-out test protocol paralleling practices from competitions organized by WMT and SemEval.

Baselines and Leaderboard

Initial baselines for GLUE were provided by models and architectures studied at Stanford University, University of Washington, University of California, Berkeley, and Carnegie Mellon University, including earlier incarnations of systems connected to Google, Facebook AI Research, Microsoft Research, and DeepMind. The emergence of pretraining regimes associated with teams at Google Research and OpenAI—notably transformer-based models influenced by work at Google Brain and University of Toronto—rapidly changed leaderboard dynamics. Public leaderboards tracked contributions from research groups at Stanford University, Massachusetts Institute of Technology, University of Oxford, University of Cambridge, ETH Zurich, University of Toronto, and corporate labs such as DeepMind, Facebook AI Research, Microsoft Research, and Amazon Web Services.

Impact and Criticisms

GLUE shaped research agendas at institutions including Stanford University, Harvard University, MIT, Carnegie Mellon University, Columbia University, and corporate labs like Google, OpenAI, Facebook AI Research, and DeepMind by providing common targets for evaluation. Critics from communities at University of Pennsylvania, University of California, Berkeley, Princeton University, and University of Washington argued that GLUE emphasized benchmark performance over robustness, generalization, and interpretability—concerns also raised in forums at NeurIPS, ICLR, and ACL. Debates involving stakeholders from Ethics boards, research groups at Harvard, and policy researchers at Stanford highlighted issues of dataset bias, representativeness, and transferability to applied settings studied by teams at Microsoft Research, IBM Research, and Amazon.

Extensions and Successors

Following GLUE, several successors and extensions were created by researchers at University of Pennsylvania, Stanford University, New York University, University of Maryland, Facebook AI Research, Google Research, Allen Institute for AI, and OpenAI, expanding coverage to multilingual, commonsense, and robustness evaluations. Notable follow-ups were championed in work associated with SuperGLUE, new multilingual suites influenced by groups at Google, Microsoft Research, and Facebook AI Research, and robustness and adversarial benchmarks promoted by teams at Carnegie Mellon University, University of Washington, ETH Zurich, and University of Edinburgh. These successor efforts have been discussed at conferences including ACL, EMNLP, NeurIPS, ICLR, and AAAI and adopted by institutions such as Stanford, MIT, Harvard, Oxford, and Cambridge.

Category:Natural language processing