XLNet — LLMpedia

XLNet
Name	XLNet
Authors	Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
Year	2019
Institutions	Carnegie Mellon University; Google Brain; Google Research; Tencent AI Lab
Type	Autoregressive pretraining model
Successors	Transformer-XL; T5; RoBERTa

Contents

Introduction
Architecture and Training Objectives
Pretraining Dataset and Implementation
Performance and Benchmarks
Applications and Adaptations
Criticisms and Limitations

XLNet XLNet is a pretrained autoregressive language model introduced in 2019 that combines permutation-based training with the Transformer-XL architecture. Developed by researchers affiliated with Carnegie Mellon University, Google Brain, and Google Research, the model aimed to improve over bidirectional encoders such as BERT on a range of natural language understanding benchmarks including GLUE, SQuAD, and RACE. XLNet leveraged innovations from prior work on recurrent attention in Transformer-XL and permutation language modeling to capture bidirectional context without masked tokens.

Introduction

XLNet emerged in the context of rapid advances led by teams at Google Research, Carnegie Mellon University, and labs like Tencent AI Lab and Facebook AI Research that produced models such as BERT, GPT-2, ELMo, ULMFiT, and Transformer-XL. The authors sought to reconcile advantages of autoregressive models exemplified by GPT and GPT-2 with the bidirectional conditioning of BERT and to address pretraining limitations raised in comparisons by groups at Microsoft Research and OpenAI. XLNet reported state-of-the-art results on benchmarks maintained by institutions like Stanford University and University of Washington and contributed to subsequent model development in projects at Alibaba and DeepMind.

Architecture and Training Objectives

XLNet's architecture builds directly on the Transformer block designed by researchers at Google Brain and on the segment-level recurrence and relative positional encoding introduced in Transformer-XL. The model uses a permutation language modeling objective inspired by earlier autoregressive work at OpenAI and by sequence modeling in publications from MIT and CMU. Instead of the masked token approach of BERT (a method promoted in papers from Google Research), XLNet optimizes the joint likelihood of token permutations, enabling it to learn bidirectional interactions while preserving the probabilistic consistency of autoregressive factorization used in GPT series research. The training leverages ideas discussed in workshops at NeurIPS and ICML, and draws on optimization techniques from researchers affiliated with Stanford University and Berkeley.

Pretraining Dataset and Implementation

XLNet was pretrained on large corpora compiled from sources commonly used by industry teams at Google, OpenAI, and Facebook, including datasets like BooksCorpus and English Wikipedia alongside web-scale data analogous to that used in Common Crawl experiments by groups at Yandex and Microsoft. Implementation details benefited from toolchains such as TensorFlow from Google and techniques discussed at conferences like ACL and EMNLP. Hardware for training involved accelerators similar to those used by NVIDIA and cloud platforms provided by Google Cloud Platform; engineering practices referenced optimizers and batching strategies described in papers from DeepMind and Facebook AI Research.

Performance and Benchmarks

On benchmarks created and maintained by research groups at Stanford University and datasets curated by teams at Allen Institute for AI, XLNet achieved leading performance on tasks including reading comprehension datasets like SQuAD and commonsense reasoning datasets like RACE. The model established competitive results on the GLUE benchmark overseen by researchers at NYU and Princeton University and influenced comparative evaluations in studies from Microsoft Research and IBM Research. XLNet's empirical gains were analyzed in follow-up work by groups at ETH Zurich, University of Oxford, and Carnegie Mellon University that examined generalization, transfer learning, and domain adaptation.

Applications and Adaptations

Following its release, XLNet influenced applied projects in organizations such as Google, Alibaba, Tencent, Microsoft, and startups incubated at Y Combinator. Adaptations included fine-tuning for tasks in information retrieval researched at Cornell University, question answering pipelines used by teams at SRI International, and text generation prototypes produced by labs like OpenAI and DeepMind. Researchers at University of Toronto and University College London explored model compression and distillation techniques inspired by XLNet for deployment on devices promoted by companies like Apple and Samsung. XLNet also informed multilingual and domain-specific variants investigated in collaborations involving Carnegie Mellon University and Facebook AI Research.

Criticisms and Limitations

Critiques of XLNet appeared in analyses from academic groups at MIT, Harvard University, and Stanford University noting increased computational cost and complexity relative to models like RoBERTa and simplified training regimes advocated by researchers at Facebook AI Research. Concerns raised by engineers at OpenAI and practitioners at Microsoft focused on reproducibility, data curation similar to debates around GPT-2, and sensitivity to hyperparameters documented in follow-ups at NeurIPS workshops. Ethical and societal implications were discussed by scholars at Oxford University and Harvard Kennedy School in the context of large-scale language models developed by institutions such as Google, OpenAI, and Facebook.

Category:Language models