AdamW — LLMpedia

AdamW
Name	AdamW
Introduced	2017
Authors	Ilya Loshchilov; Frank Hutter
Institution	Google Research; University of Freiburg
Related	Adam; SGD; RMSprop; AdaGrad; Momentum; L2 regularization

Contents

Introduction
Background and Motivation
Algorithm
Variants and Extensions
Implementation Details and Hyperparameters
Applications and Performance
Criticisms and Limitations

AdamW AdamW is an optimization algorithm for training machine learning models that decouples weight decay from adaptive moment estimation, improving generalization in deep learning. It was introduced by Ilya Loshchilov and Frank Hutter to address shortcomings observed with Adam when combined with classical L2 regularization and has since influenced research at institutions such as Google Research, OpenAI, and academic groups at University of Freiburg and Massachusetts Institute of Technology. The method has been adopted across frameworks like PyTorch, TensorFlow, and JAX and applied in projects from BERT to ResNet and Transformer architectures.

Introduction

AdamW modifies the update rule of Adam by explicitly applying weight decay as a separate step rather than conflating it with the adaptive gradient updates introduced by Kingma and Ba. The algorithm addresses empirical observations from communities working on ImageNet classification, CIFAR-10, and language modeling tasks where optimizers such as SGD with momentum often generalize better than adaptive methods. AdamW has been widely discussed in workshops at NeurIPS, ICLR, and ICML and cited in follow-up work from groups at DeepMind and universities including Stanford University and University of Toronto.

Background and Motivation

Prior optimizers like SGD, Momentum, AdaGrad, RMSprop, and Adam shaped the landscape of training deep networks exemplified by architectures such as AlexNet, VGG, Inception, and ResNet. Researchers at Google Brain and labs including Facebook AI Research observed that Adam often underperformed in final test error compared to SGD on benchmarks like ImageNet and CIFAR-100. Loshchilov and Hutter introduced AdamW to reconcile the benefits of adaptive learning rates from Adam with decoupled regularization inspired by practices in works on weight decay and optimization theory developed by scholars at ETH Zurich and University College London. The discussion intersected with optimization themes from textbooks and conferences associated with Courant Institute and Princeton University.

Algorithm

The AdamW update integrates moment estimates similar to Adam—first moment (mean) and second moment (uncentered variance)—while applying weight decay as a direct subtraction from parameters. Formally, the algorithm maintains moving averages analogous to the scheme by Kingma and Ba and introduces a weight decay hyperparameter comparable to procedures in classical L2 regularization literature by researchers at Bell Labs and IBM Research. AdamW’s steps are used in training models such as Transformer blocks, BERT, GPT, ResNet, DenseNet, and recurrent models like LSTM and GRU, with optimization schedules often coordinated with learning rate schemes from Cosine annealing and Warmup strategies proposed at ICLR and NeurIPS.

Variants and Extensions

Since its introduction, AdamW inspired variants incorporating decoupled weight decay into optimizers like AdaBound, RAdam, NovoGrad, AdamP, AdaFactor, and LAMB. Extensions combine AdamW with techniques from Lookahead optimizer, Gradient Centralization, and optimization methods developed by authors at Microsoft Research and Amazon Web Services for large-batch training of models such as BERT-large and ResNet-50. Research integrating AdamW with distributed training frameworks used by Horovod and TensorFlow Distributed explored scaling on compute platforms from NVIDIA and Google TPU programs, with contributions from teams at Facebook and Microsoft Azure.

Implementation Details and Hyperparameters

Common implementations in PyTorch, TensorFlow Addons, and Hugging Face Transformers expose hyperparameters: learning rate, beta1, beta2, epsilon, and weight decay. Default beta values often follow Kingma and Ba conventions (beta1=0.9, beta2=0.999) with epsilon values adapted from implementations at OpenAI and Google Research. Practical settings derive from experiments on ImageNet, CIFAR-10, GLUE benchmark, and SQuAD, with workflows established in repositories maintained by GitHub projects from labs at Stanford NLP Group and Berkeley AI Research. Hyperparameter tuning tools from Ray Tune, Optuna, and Hyperopt are commonly used alongside schedulers such as ReduceLROnPlateau and CosineAnnealingLR developed in frameworks like scikit-learn and Keras.

Applications and Performance

AdamW is widely used in computer vision, natural language processing, and speech tasks, powering training of models like ResNet, DenseNet, EfficientNet, Transformer, BERT, and GPT variants in projects by Google Research, OpenAI, Facebook AI Research, and academic labs at Carnegie Mellon University and University of Oxford. Benchmarks on ImageNet, CIFAR-100, GLUE, and SuperGLUE report improved generalization and stability compared to standard Adam with naive weight decay, and large-scale training for foundation models at Microsoft Research and DeepMind often employs AdamW or derivatives. Performance studies appear in proceedings of NeurIPS, ICLR, and ICML with empirical comparisons to SGD with momentum, LARS, and LAMB for large-batch regimes.

Criticisms and Limitations

Critiques of AdamW include sensitivity to hyperparameter tuning noted by researchers at University of Toronto and ETH Zurich, potential suboptimality in some sparse settings explored by teams at Facebook AI Research and DeepMind, and cases where carefully tuned SGD with momentum outperforms adaptive optimizers on specific tasks like certain ImageNet configurations studied by groups at Stanford University and Berkeley. Theoretical analyses by academics at MIT and Princeton examine convergence properties in nonconvex settings, and practitioners have reported interactions with regularization techniques such as Dropout and normalization layers from works at Yann LeCun’s groups. Ongoing research at Google Research, OpenAI, Microsoft Research, and various universities continues to refine when AdamW or alternative optimizers are preferable.

Category:Optimization algorithms