transformer — LLMpedia

transformer
Name	Transformer
Developer	Google Brain, Google Research
Released	2017
Influenced	GPT-3, BERT, T5, DALL-E, AlphaFold

Contents

Overview
Architecture
Training
Applications
Variants

transformer. A transformer is a deep learning architecture that relies on a mechanism of attention to weigh the influence of different parts of the input data. Introduced in the 2017 paper "Attention Is All You Need" by researchers at Google Brain and Google Research, it has become the foundational model for most state-of-the-art systems in natural language processing. Unlike previous recurrent neural network or convolutional neural network architectures, it processes all input data simultaneously, enabling more efficient parallelization during training.

Overview

The core innovation of the architecture is the self-attention mechanism, which allows the model to dynamically focus on different parts of a sequence when producing an output. This design effectively handles long-range dependencies within data, a task that was challenging for earlier models like long short-term memory networks. The original model was applied to the task of machine translation, specifically between English and German, achieving new performance records. Its success quickly led to its adoption across a wide spectrum of AI research, influencing projects at organizations like OpenAI, Facebook AI Research, and DeepMind.

Architecture

The architecture is composed of an encoder and a decoder, each consisting of a stack of identical layers. Each encoder layer contains a multi-head attention mechanism and a simple, position-wise feedforward neural network, with layer normalization and residual connections applied around each sub-layer. The decoder includes similar layers but adds a masked multi-head attention mechanism to prevent attending to subsequent positions, preserving the autoregressive property during generation. A critical component is the positional encoding, which injects information about the order of the sequence since the model itself has no inherent notion of sequence order. This structure was first implemented using frameworks like TensorFlow and PyTorch.

Training

Training these models requires massive datasets and substantial computational resources, often utilizing hardware like graphics processing unit clusters or specialized tensor processing units. They are typically trained using a variant of supervised learning on tasks like language modeling or masked language modeling, with optimization performed by algorithms like Adam. The training objective often involves minimizing a cross-entropy loss function. Large-scale training runs, such as those for GPT-4 or PaLM, are conducted by major labs including Microsoft Research and Google AI, consuming significant energy and time. The release of pre-trained models like those from Hugging Face has democratized access to these powerful architectures.

Applications

Transformers have revolutionized natural language processing, forming the backbone of models like ChatGPT, Claude, and Google Bard. They are central to text-to-image generation systems such as Stable Diffusion and Midjourney. Beyond language, they have achieved groundbreaking results in bioinformatics with AlphaFold 2 for protein structure prediction, in audio processing for systems like Whisper, and in multimodal learning as seen in Flamingo. Their use extends to code generation tools like GitHub Copilot and recommendation systems at companies like Netflix and Amazon.

Variants

Numerous variants have been developed to improve efficiency or adapt to specific domains. The Vision Transformer adapts the architecture for computer vision tasks, challenging the dominance of convolutional neural networks on datasets like ImageNet. Sparse Transformers reduce computational cost through selective attention patterns. Reformer incorporates locality-sensitive hashing for more efficient attention. Transformer-XL introduces a recurrence mechanism to enable longer context lengths. Other notable variants include BART, RoBERTa, and T5, each optimized for different training paradigms and released by institutions like Meta and Google.

Category:Artificial intelligence Category:Machine learning Category:Natural language processing