Transformer (machine learning model)

Transformer (machine learning model)
Name	Transformer
Year	2017
Authors	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Institution	Google Brain, Google Research
Influenced	GPT-3, BERT (language model), T5 (Text-To-Text Transfer Transformer), DALL-E, AlphaFold

Contents

Overview
Architecture
Training
Applications
Variants and extensions
Impact and limitations

Transformer (machine learning model). The Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by researchers from Google Brain and Google Research. It relies entirely on a mechanism called self-attention to process sequential data, dispensing with the recurrent neural network and convolutional neural network layers used in prior models. This design enables superior parallelization during training and has become the foundational architecture for most state-of-the-art models in natural language processing, computer vision, and bioinformatics.

Overview

The model was developed to address limitations in existing sequence transduction models like long short-term memory networks, which process data sequentially. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence regardless of their distance. This architecture was first showcased achieving new state-of-the-art results on English-German and English-French machine translation tasks. Its efficiency and performance quickly led to its adoption by major Silicon Valley firms, including OpenAI, Meta Platforms, and Microsoft.

Architecture

The Transformer uses an encoder-decoder structure, where both the encoder and decoder are composed of a stack of identical layers. Each encoder layer contains a multi-head attention mechanism and a simple feedforward neural network, with layer normalization and residual connections applied around each sub-layer. The decoder layers are similar but include an additional multi-head attention sub-layer that performs attention over the output of the encoder stack. A critical component is the positional encoding, which is added to the input embeddings to inject information about the order of the sequence, since the model itself has no inherent recurrence.

Training

Transformers are typically trained using supervised learning on large datasets, such as the Wikipedia corpus or Common Crawl, with objectives like language modeling or masked language modeling. Training leverages backpropagation and stochastic gradient descent optimizers like Adam. The process is computationally intensive, often requiring thousands of GPUs or TPUs over weeks or months, as seen with models like GPT-4 developed by OpenAI and PaLM from Google DeepMind. Techniques like dropout and label smoothing are commonly used for regularization.

Applications

Beyond its original use in machine translation, the Transformer architecture underpins most modern large language models. It is the core of OpenAI's GPT series, Google's BERT and T5, and Meta Platforms' Llama. The architecture has also been successfully adapted for computer vision in models like the Vision Transformer from Google Research, and for protein structure prediction in DeepMind's AlphaFold system. It is used in multi-modal systems such as OpenAI's DALL-E and CLIP.

Variants and extensions

Numerous variants have been developed to improve efficiency or adapt the architecture to new domains. The decoder-only variant, used in the GPT series, is prevalent for generative tasks. The encoder-only architecture is used in models like BERT. To address the quadratic complexity of self-attention, efficient variants like the Reformer from Google Research and the Linformer from Facebook AI Research have been proposed. Other extensions include the Transformer-XL for longer context and the Perceiver IO for general multi-modal processing.

Impact and limitations

The Transformer has had a revolutionary impact, driving rapid progress in artificial intelligence and becoming the de facto standard in natural language processing. It has enabled the development of powerful foundation models and generative AI tools used worldwide. However, limitations remain, including high computational costs for training and inference, massive data requirements, and challenges in modeling extremely long sequences. Issues of AI alignment, algorithmic bias, and hallucination in models like ChatGPT are active areas of research at institutions like the Allen Institute for AI and Stanford University.

Category:Machine learning models Category:Natural language processing Category:Artificial neural networks