Adam (optimization algorithm)

Adam (optimization algorithm)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Adam
Developer	Diederik P. Kingma; Jimmy Lei Ba
Introduced	2014
Programming languages	Python; C++; Java; MATLAB
License	MIT (common implementations)

Contents

Introduction
Algorithm
Hyperparameters and Variants
Convergence and Theoretical Analysis
Practical Considerations and Implementation
Applications and Performance
Criticisms and Alternatives

Adam (optimization algorithm) Adam is an adaptive stochastic optimization algorithm widely used for training neural networks and other machine learning models. It combines ideas from momentum-based methods and adaptive learning-rate algorithms to provide per-parameter step sizes, yielding robust performance across a range of tasks in computer vision, natural language processing, and reinforcement learning. Adam's popularity stems from empirical effectiveness, ease of implementation, and availability in major frameworks.

Introduction

Adam was introduced by Diederik P. Kingma and Jimmy Lei Ba in 2014 as an algorithm for first-order gradient-based optimization of stochastic objective functions. It draws from earlier work on adaptive methods such as AdaGrad, RMSProp, and momentum algorithms including Polyak's heavy ball method and Nesterov accelerated gradient. Adam maintains exponential moving averages of past gradients and squared gradients, applying bias-corrections to those estimates; this marries the stability of momentum with the per-parameter scaling of adaptive methods. The algorithm rapidly gained adoption in implementations from projects like TensorFlow, PyTorch, and frameworks maintained by companies such as Google, Facebook, and Microsoft Research.

Algorithm

Adam maintains, for each parameter, estimates of the first moment (mean) and second moment (uncentered variance) of the gradient. At timestep t the raw gradients g_t are computed by backpropagation using nodes defined in computational graphs popularized by Theano and Torch (machine learning) predecessors. The update rules compute m_t = β1 * m_{t-1} + (1−β1) * g_t and v_t = β2 * v_{t-1} + (1−β2) * g_t^2, where β1 and β2 are exponential decay rates introduced in the original Adam paper. Bias-corrected estimates \hat{m}_t and \hat{v}_t are formed to account for initialization at zero, and parameters θ are updated by θ_{t+1} = θ_t − α * \hat{m}_t / (sqrt(\hat{v}_t) + ε). The constants α and ε control global step size and numerical stability respectively. The derivation and pseudocode are presented alongside empirical comparisons in the original preprint submitted to venues attended by researchers from institutions such as UC Berkeley and University of Toronto.

Hyperparameters and Variants

Standard default hyperparameters recommended in the original manuscript are α = 0.001, β1 = 0.9, β2 = 0.999, and ε = 10^−8; these defaults are implemented in libraries maintained by groups like OpenAI and DeepMind. Variants adjust one or more hyperparameters or modify the moment estimates: AdamW decouples weight decay regularization from the update rule, addressing issues raised in work associated with Los Alamos National Laboratory and industrial adopters. AMSGrad modifies the second-moment update to enforce non-increasing stepsizes and was proposed by researchers affiliated with institutions including Cornell University and Facebook AI Research. Other variants include AdaMax, which uses an infinity norm, and scheduled or warmup strategies popularized in papers from Google Brain and model training recipes used by teams at NVIDIA.

Convergence and Theoretical Analysis

Theoretical analysis of Adam examines convergence in convex and nonconvex settings, with initial proofs for stochastic convex optimization later complemented by counterexamples showing non-convergence under certain conditions. The counterexamples were developed by researchers at institutions like University of California, Los Angeles and Princeton University, motivating variants such as AMSGrad with provable regret bounds under assumptions common in online learning literature originating from centers like MIT. Subsequent analyses extend to nonconvex landscapes typical of deep learning tasks studied by groups including Stanford University and Carnegie Mellon University, characterizing stationary-point convergence rates under smoothness and bounded-gradient assumptions. Recent work explores adaptive learning-rate dynamics in overparameterized regimes highlighted in studies from Harvard University and collaborations with major industry labs.

Practical Considerations and Implementation

Implementations of Adam appear in major open-source repositories maintained by GitHub projects and integrated into high-level APIs of Keras, JAX, and other toolkits. Practical tips include tuning learning-rate schedules, applying weight decay via AdamW, and employing gradient clipping for stability in recurrent models popularized by researchers at NYU and Google DeepMind. Numerical stability concerns related to ε selection and mixed-precision training techniques used by teams at Intel and NVIDIA affect performance; frameworks provide fused kernels and optimizer operators to leverage hardware accelerators from AMD and NVIDIA. Reproducibility discussions and benchmarking efforts are ongoing in community forums hosted by organizations like NeurIPS and ICML.

Applications and Performance

Adam has been used to train architectures such as AlexNet, VGG (2014 film), ResNet, Transformer (machine learning model), and generative models including Variational Autoencoder and Generative Adversarial Network variants developed in labs across academia and industry. It is a default choice in many natural language processing pipelines employing models from groups like OpenAI and Google Research and in reinforcement learning experiments by teams at DeepMind and OpenAI Gym. Empirically, Adam often converges faster in early epochs compared with plain stochastic gradient descent methods used in historical systems from Stanford and Berkeley, though final generalization may differ depending on regularization and schedule strategies explored by researchers at Facebook and Microsoft Research.

Criticisms and Alternatives

Criticisms of Adam include instances of poorer generalization compared to stochastic gradient descent with momentum observed in benchmarks reported by teams at Google Brain and institutions such as University of Oxford. Analytical objections and pathological examples motivated alternatives and hybrids: SGD with momentum, AdaGrad, RMSProp, and newer methods like AdaBelief and RAdam developed by researchers associated with organizations including KAIST and Alibaba. Community consensus recommends experimentation with Adam variants, learning-rate schedules, and regularization strategies when adopting Adam in production systems managed by companies like Amazon and research groups at prominent conferences such as ICLR.

Category:Optimization algorithms