Adam (optimizer) — LLMpedia

Adam (optimizer)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Adam (optimizer)
Introduced	2014
Authors	Diederik P. Kingma; Jimmy Ba
Field	Machine learning; Optimization
Related	Stochastic gradient descent; RMSprop; AdaGrad; AdaDelta

Contents

Background and motivation
Algorithm
Variants and improvements
Theoretical properties and convergence
Practical considerations and implementation
Applications and empirical performance
Criticisms and limitations

Adam (optimizer)

Adam is a stochastic optimization method widely used in deep learning for training neural networks. Developed by Diederik P. Kingma and Jimmy Ba, Adam combines ideas from adaptive learning rates and momentum to accelerate convergence and stabilize training across architectures such as convolutional networks, recurrent networks, and transformers. The method has influenced research in optimization, regularization, and large-scale training across industry and academia.

Background and motivation

Adam was proposed to address challenges encountered when training models on datasets and architectures found in ImageNet competitions, Speech recognition benchmarks, and large-scale projects at companies like Google and Facebook. Researchers including Diederik P. Kingma and Jimmy Ba drew on prior work such as Stochastic gradient descent, momentum methods, AdaGrad, RMSprop, and AdaDelta to create an algorithm that adapts per-parameter learning rates using estimates of first and second moments. The motivation also reflects needs exposed in competitions like the NeurIPS workshops and datasets from COCO and SQuAD, where unstable training dynamics in architectures like AlexNet, VGG (neural network), and ResNet demanded robust optimizers.

Algorithm

Adam maintains exponential moving averages of gradients and squared gradients to compute parameter updates. At each iteration, Adam updates biased first moment estimates (analogous to momentum) and second moment estimates (analogous to RMSprop) and then applies bias correction before adjusting parameters. The core update uses step size hyperparameters, decay rates commonly named beta1 and beta2, and an epsilon term for numerical stability. The algorithm is frequently implemented in frameworks such as TensorFlow, PyTorch, MXNet, JAX, and Keras, and appears in optimizer suites alongside SGD (optimization method), AdaGrad, and RMSProp.

Variants and improvements

Multiple variants and improvements extend Adam to different settings and address empirical or theoretical issues. Notable variants include AMSGrad, which modifies second moment handling to ensure nonincreasing learning rates; AdamW, which decouples weight decay from adaptive updates; Nadam, which blends Nesterov momentum with Adam; and AdaBound, which bounds adaptive learning rates with dynamic clipping. Other proposals such as RAdam, Yogi, and Shampoo aim to reduce variance, improve conditioning, or leverage second-order structure. These developments have been explored in venues like ICML, ICLR, NeurIPS, and journals associated with IEEE and ACM.

Theoretical properties and convergence

Analyses of Adam touch on convergence in stochastic and deterministic settings and on conditions required for regret bounds. Early work identified counterexamples where unconstrained Adam fails to converge, prompting theoretical revisions and variants like AMSGrad. Convergence proofs often invoke assumptions used in optimization theory studied in texts from researchers at Stanford University, Massachusetts Institute of Technology, and Carnegie Mellon University. Results relate to notions from convex optimization exemplified in work by scholars connected to Princeton University, University of Toronto, and University of Oxford. Theoretical research discusses bounds under assumptions of Lipschitz continuity, bounded gradients, and diminishing step sizes, and connects to regret analyses from the online learning literature influenced by figures at Yahoo! Research and Microsoft Research.

Practical considerations and implementation

Implementers tune hyperparameters such as learning rate, beta1, beta2, and epsilon; commonly used defaults stem from the original paper. Practical training recipes for architectures like Transformer (machine learning model), BERT, GPT, ResNet, and LSTM often specify Adam or AdamW with learning rate schedules such as cosine decay, linear warmup, or polynomial decay developed in works from Google Research and OpenAI. Numerical stability and performance considerations arise on hardware from NVIDIA GPUs, TPU accelerators from Google, and CPU clusters managed by services like Amazon Web Services and Microsoft Azure. Software engineering aspects appear in repositories hosted on platforms like GitHub and in continuous integration for projects at Hugging Face and DeepMind.

Applications and empirical performance

Adam is used across applications in computer vision, natural language processing, speech, reinforcement learning, and scientific computing. Benchmarks on datasets such as ImageNet, CIFAR-10, PTB (Penn Treebank), SQuAD, and GLUE show rapid initial progress and competitive final performance in many settings. In generative modeling and large-scale language models exemplified by work at OpenAI, Adam variants and AdamW are standard choices. Empirical studies from research groups at Facebook AI Research, Google Brain, DeepMind, and academic labs have compared Adam to SGD with momentum and second-order methods, reporting trade-offs in generalization, stability, and wall-clock time to convergence.

Criticisms and limitations

Criticisms of Adam include sensitivity to hyperparameters in some regimes, potential generalization gaps compared to SGD with momentum on certain vision tasks, and theoretical counterexamples showing nonconvergence without modification. Concerns motivate practices like decoupled weight decay, bespoke learning rate schedules, and hybrid optimizers. The optimizer’s behavior is further scrutinized in reproducibility initiatives and empirical studies from communities centered on NeurIPS, ICLR, and ICML; ongoing work at institutions such as Berkeley, Harvard University, and ETH Zurich continues to refine understanding.

Category:Optimization algorithms