Generated by GPT-5-mini| Transformer | |
|---|---|
![]() Raimond Spekking · CC BY-SA 4.0 · source | |
| Name | Transformer |
| Introduced | 2017 |
| Creators | Google Brain |
| Key paper | "Attention Is All You Need" |
| Core concepts | Self-attention, positional encoding, encoder–decoder |
| Notable variants | BERT, GPT, T5, Vision Transformer |
Transformer
Transformers are a class of deep learning models introduced in 2017 by researchers at Google Brain in the paper "Attention Is All You Need", notable for replacing recurrent and convolutional architectures in many sequence modeling tasks. They use self-attention mechanisms to model relationships within input sequences and have driven advances in natural language processing, computer vision, and multimodal systems across organizations such as OpenAI, DeepMind, and Facebook AI Research. Transformers underpin influential models like BERT, GPT-3, T5 (text-to-text transfer transformer), and the Vision Transformer, and have redefined benchmarks in benchmarks such as GLUE, SuperGLUE, and ImageNet.
The Transformer architecture emerged from work at Google Research to address limitations in handling long-range dependencies present in earlier models like Long Short-Term Memory and Gated Recurrent Unit. By relying on attention mechanisms developed from research on Neural Machine Translation and earlier sequence-to-sequence efforts including Seq2Seq models and the Bahdanau attention framework, the Transformer achieved parallelizable training on TPU and GPU accelerators. Early adopters included projects at Microsoft Research and open-source toolkits such as TensorFlow and PyTorch, enabling rapid replication and extension across academic groups at Stanford University, MIT, and Carnegie Mellon University.
A Transformer stacks encoder and decoder blocks composed of multi-head self-attention, feed-forward networks, residual connections, and layer normalization. Multi-head attention builds on the concept of query, key, and value projections used in prior works like Luong attention and enables the model to attend to different representational subspaces simultaneously. Positional encodings inject order information into the model using sinusoidal functions or learned embeddings, inspired by sequence representations used in early word embedding research and techniques from signal processing. Key components—attention matrices, layer normalization, and feed-forward layers—are optimized using initialization schemes from papers by researchers at University of Toronto and Yoshua Bengio’s group, and the architecture integrates techniques popularized in systems from Google DeepMind and OpenAI.
Training large Transformers leverages pretraining objectives such as masked language modeling from BERT and autoregressive sampling used by GPT series, combined with fine-tuning on supervised datasets like SQuAD and MNLI. Optimization strategies include Adam and LAMB optimizers, learning rate schedulers such as linear warmup and cosine decay adopted from NVIDIA and Google Research practices, and regularization via dropout and label smoothing influenced by work at Microsoft. Scaling laws observed by researchers at OpenAI and DeepMind describe how dataset size, model parameters, and compute interact to determine generalization, informing projects at Anthropic and academic consortia in how to allocate compute across pretraining and fine-tuning. Distributed training uses model parallelism and data parallelism implementations in frameworks developed by Horovod contributors and infrastructure from Amazon Web Services and Google Cloud Platform.
Transformers power applications across language, vision, audio, and multimodal domains. In natural language, they enable systems for machine translation used by Google Translate, question answering in products at Microsoft and Amazon, and conversational agents from startups and research groups such as OpenAI and Meta Platforms. In vision, the Vision Transformer and hybrid models have advanced image classification on ImageNet and object detection in pipelines adopted by teams at Facebook AI Research. Multimodal integrations appear in models combining text and image encoders used by projects at Stanford Vision and Learning Lab and industry initiatives at DeepMind for tasks like video understanding and captioning. Transformers also contribute to scientific discovery workflows at institutions like Lawrence Berkeley National Laboratory and to code synthesis systems evaluated in competitions run by ICLR and NeurIPS.
Despite broad success, Transformers face challenges including compute and memory costs that limit accessibility for smaller labs and institutions such as regional universities and non-profits, echoing concerns raised by researchers at MIT and policy teams at OpenAI. They show susceptibility to dataset biases documented by studies at Google Research and Stanford University and can produce plausible but incorrect outputs, a problem highlighted in evaluations at ACL and NAACL. Interpretability remains limited compared to symbolic models investigated at Berkeley AI Research, and robustness to adversarial inputs has motivated adversarial research agendas at Carnegie Mellon University and ETH Zurich. Governance, safety, and ethical deployment concerns have prompted guidelines from bodies like IEEE and policy discussions at The White House.
Numerous variants adapt the Transformer core to domain and efficiency constraints. Encoder-only models include BERT and domain-specialized derivatives from BioNLP groups for biomedical text. Decoder-only families such as GPT-2 and GPT-4 focus on autoregressive generation used by OpenAI and research labs at Microsoft Research. Sequence-to-sequence models like T5 and BART power summarization and translation tasks in systems developed by Google Research and Facebook AI Research. Efficient extensions—sparse attention, linearized attention, and memory-augmented forms—have emerged from research at DeepMind, Google Research, and universities including UC Berkeley and EPFL. Cross-modal hybrids combine Transformer encoders with convolutional backbones in work at Visual Geometry Group and the Allen Institute for AI.
Category:Neural network architectures