XGLUE — LLMpedia

XGLUE
Name	XGLUE
Type	Multilingual Benchmark
Domain	Natural Language Processing
Released	2020
Creators	Unknown
License	Various

Contents

Overview
Dataset Composition and Tasks
Benchmarking and Evaluation Metrics
Baseline Models and Results
Usage and Licensing
Limitations and Biases
Impact and Applications

XGLUE is a multilingual benchmark designed to evaluate cross-lingual pretraining and transfer for a range of natural language processing tasks. It aggregates datasets spanning classification, sequence labeling, and generation across many languages to provide standardized comparison for models. XGLUE has been used in studies comparing multilingual encoders and sequence-to-sequence systems across research groups and industrial labs.

Overview

XGLUE was introduced in the context of research on multilingual models alongside work on BERT (language model), RoBERTa, mBERT, XLM, XLM-RoBERTa, mT5, T5 (machine learning model), GPT-2, GPT-3, BART, MBART, Electra (machine learning), ALBERT, DistilBERT, ERNIE (language model), LaMDA, PaLM, Claude (AI assistant), LLaMA (language model), OPT (language model), BLOOM (language model), DeBERTa, ULMFiT, Transformer (machine learning model), Seq2Seq, FAIR (Facebook AI Research)], Google Research, Microsoft Research, OpenAI, DeepMind as a way to benchmark multilingual transfer learning. The benchmark consolidates prior multilingual corpora from projects related to Wikipedia, Common Crawl, WMT (conference), OSCAR (dataset), CCNet, Universal Dependencies, GLUE (benchmark), SuperGLUE, XTREME, CLUE (benchmark), PAWS-X, XNLI, MLQA and aligns tasks for training and evaluation. The initiative drew attention from teams at Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, University of California, Berkeley, ETH Zurich, University of Cambridge, University of Oxford, Tsinghua University, Peking University, Chinese Academy of Sciences, The Alan Turing Institute, NYU (New York University), KAIST, Seoul National University, University of Toronto, University of Montreal, McGill University, University of Washington, Johns Hopkins University, Imperial College London, Columbia University, Princeton University, Yale University, Harvard University, Caltech.

Dataset Composition and Tasks

XGLUE assembles datasets covering tasks drawn from multilingual resources such as Wikipedia, Tatoeba, and parallel corpora created for WMT (conference), IWSLT (conference), Europarl and aligned with annotation schemes used in projects like CoNLL and OntoNotes. Task types include sentence classification exemplified by XNLI and PAWS-X, sequence labeling akin to CoNLL-2003, question answering related to SQuAD and MLQA, machine reading comprehension similar to RACE (dataset), natural language inference from resources associated with SNLI, and retrieval-style tasks inspired by benchmarks like TREC. Languages covered echo those in mBERT and XLM-RoBERTa pretraining such as English, Chinese, Arabic, French, German, Spanish, Russian, Portuguese, Italian, Dutch, Swedish, Danish, Norwegian, Polish, Turkish, Hindi, Bengali, Urdu, Vietnamese, Thai, Indonesian, Malay, Tagalog, Japanese, Korean, Persian, Kazakh and several low-resource languages addressed in XTREME.

Benchmarking and Evaluation Metrics

Evaluation in XGLUE uses metrics standard to each task: accuracy and F1 for classification tasks as in GLUE (benchmark), exact match and F1 for question answering as with SQuAD, token-level F1 for sequence labeling following CoNLL-2003 conventions, BLEU for generation and translation tasks inspired by BLEU (metric), and ROUGE for summarization comparable to CNN/Daily Mail (dataset). Leaderboards compare systems developed by groups at Google Research, Facebook AI Research, Microsoft Research, OpenAI, DeepMind, Hugging Face, Alibaba DAMO Academy, Baidu Research, Tencent AI Lab, NVIDIA Research and universities such as Stanford University and Carnegie Mellon University. Statistical significance testing often follows protocols used in papers presented at conferences like ACL (conference), EMNLP, NAACL, NeurIPS, ICML, IJCAI, COLING.

Baseline Models and Results

Reported baselines for XGLUE include multilingual encoders and sequence-to-sequence models such as mBERT, XLM, XLM-RoBERTa, mT5, mBART, BERT (language model), RoBERTa, T5 (machine learning model), and distilled variants like DistilBERT. Results demonstrated transfer trends similar to findings in evaluations of XTREME and GLUE (benchmark), with larger pretraining corpora and architectures from teams at Google Research, Facebook AI Research, Microsoft Research tending to outperform smaller models from academic labs. Papers presenting baseline results were submitted to venues including ACL (conference), EMNLP, NeurIPS, ICLR, and have been cited in follow-up work from institutions such as MIT, Harvard University, Stanford University, University of Washington, and industry groups like Hugging Face.

Usage and Licensing

XGLUE aggregates datasets that originate under a variety of licenses, mirroring practices seen with resources like Common Crawl, Wikipedia, OSCAR (dataset), and corpora distributed at WMT (conference). Users applying XGLUE data must comply with the original licenses of constituent datasets, and many research groups at Google Research, Facebook AI Research, Microsoft Research, OpenAI, DeepMind include license checks in their data pipelines. Datasets have been packaged and distributed by community efforts involving organizations like Hugging Face and research labs at Stanford University and Tsinghua University.

Limitations and Biases

Analyses of XGLUE point to limitations common to cross-lingual benchmarks documented in studies from Stanford University, University of Oxford, Carnegie Mellon University, University of Toronto and industry groups. These include uneven language coverage similar to issues in mBERT and XLM-RoBERTa pretraining, cultural and dialectal biases noted in evaluations of Common Crawl-derived corpora, and domain mismatch problems highlighted in work from Google Research and Microsoft Research. Additional concerns echo findings from audits conducted by teams at AI Now Institute, Partnership on AI, The Alan Turing Institute, and Data & Society Research Institute regarding demographic representation and annotation consistency.

Impact and Applications

XGLUE has been used to compare multilingual pretraining strategies in papers from Google Research, Facebook AI Research, Microsoft Research, OpenAI, Stanford University, Tsinghua University, Carnegie Mellon University and others, informing choices for models deployed by organizations like Google Translate, Microsoft Translator, Baidu Translate, Amazon Translate, and services from Alibaba Group, Tencent. Its influence appears in follow-up benchmarks such as XTREME and datasets curated by Hugging Face community projects. XGLUE comparisons have guided research directions presented at conferences such as ACL (conference), EMNLP, NeurIPS, and ICLR.

Category:Multilingual natural language processing benchmarks