BERT (language model)

BERT (language model)
Name	BERT
Developer	Google (company)
Released	2018
Languages	Multilingual
Type	Masked language model

Contents

Background and Development
Architecture and Training Objectives
Pretraining and Fine-tuning
Applications and Performance
Variants and Extensions
Limitations and Criticisms

BERT (language model) is a transformer-based machine learning model introduced by researchers at Google (company), designed to produce deep bidirectional representations for natural language processing tasks. It built on prior work in deep learning, representation learning, and the Transformer architecture, enabling state-of-the-art results on benchmarks such as GLUE, SQuAD, and other corpora. The model influenced subsequent models from organizations including OpenAI, Facebook AI Research, Microsoft Research, Stanford University, and Carnegie Mellon University.

Background and Development

BERT's development was announced in a 2018 paper authored by researchers from Google Research and Google AI Language, following earlier advances from teams at University of Toronto, University of Oxford, and New York University in attention-based models. It drew on seminal work including the Transformer paper by researchers affiliated with Google Brain and the University of Toronto collaboration, and on representation techniques from word2vec and GloVe. The project emerged amid a surge of interest sparked by performance improvements on datasets created by groups at Stanford University and Allen Institute for AI, notably datasets maintained by teams at Princeton University and Carnegie Mellon University.

Architecture and Training Objectives

BERT uses a multi-layer bidirectional Transformer encoder stack developed within Google Brain. Its core architecture follows configurations similar to large models explored by researchers at OpenAI and DeepMind, employing multi-head self-attention mechanisms introduced by authors affiliated with Google Research and tested by teams at Facebook AI Research. Training objectives combine a Masked Language Modeling task inspired by earlier cloze tasks used at Stanford University with a Next Sentence Prediction objective related to discourse work from Columbia University and Massachusetts Institute of Technology. Hyperparameters and layer counts were explored across team collaborations that included researchers from University of California, Berkeley and Carnegie Mellon University.

Pretraining and Fine-tuning

Pretraining used large unlabeled corpora such as the English Wikipedia and the BooksCorpus, employing compute resources similar to those used by projects at Google Cloud and research centers like NVIDIA Corporation and Intel Corporation. Fine-tuning procedures were popularized by experiments comparing BERT to architectures from Stanford University and University of Washington on tasks developed by teams at Facebook AI Research and Allen Institute for AI. Transfer learning paradigms enabled BERT-based models to be adapted for classification, question answering, and sequence labeling tasks evaluated by benchmark creators at University of California, Irvine and University of Pennsylvania.

Applications and Performance

BERT was applied across industry and academia, powering search improvements in products managed by Google Search teams and inspiring deployments by Microsoft in enterprise tools, as well as research implementations at Amazon (company) and Adobe Inc.. It achieved top results on leaderboards maintained by organizations such as SQuAD organizers and the GLUE consortium. Use cases included systems developed by teams at Slack Technologies, Salesforce research groups, and projects at Zillow Group and Uber Technologies for natural language understanding and information retrieval tasks.

Variants and Extensions

Following BERT's release, numerous variants and extensions were introduced by research groups at Google Research, Facebook AI Research, Microsoft Research, Allen Institute for AI, and academic labs including Carnegie Mellon University and Princeton University. Notable derived models and approaches were produced by teams at Stanford University, University of Washington, and ETH Zurich, exploring techniques such as distillation by researchers at Hugging Face and parameter-efficient fine-tuning developed at Google Research and Harvard University. Extensions addressed multilingual challenges with contributions from European Parliament corpora experiments and collaborations with institutions like University of Edinburgh and University of Cambridge.

Limitations and Criticisms

Critiques of BERT appeared in analyses from researchers at MIT, Stanford University, University of California, Berkeley, and advocacy groups concerned with bias and fairness including scholars from Harvard University and Oxford Internet Institute. Limitations cited include high compute and energy demands similar to concerns raised about large models by teams at OpenAI and DeepMind, susceptibility to adversarial examples studied at Cornell University and Carnegie Mellon University, and challenges in capturing long-range discourse highlighted by work at Princeton University and University of Pennsylvania. Ethical and governance discussions referenced institutions such as European Commission panels and policy groups at Brookings Institution.

Category:Language models