ADAM and AMSGrad for Stochastic Optimization

Posted on Oct 09, 2022

This post is based on

Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization (arXiv:1412.6980). arXiv.
Reddi, S. J., Kale, S., & Kumar, S. (2018, February 15). On the Convergence of Adam and Beyond. International Conference on Learning Representations.

Kingma and Ba (2017) propose ADAM, which is derived from adaptive moment estimation. It is designed to combine the advantages of

Duchi et al. (2011)’s AdaGrad: works well with sparse gradients
Tieleman and Hinton (2012)’s RMSProp: work well in on-line and non-stationary settings.

Reddi et al. (2018) analyzed the convergence issue of ADAM, and pointed out that the issue can be fixed by endowing with “long-term memory” of past gradient, and they proposed a variance of ADAM, so-called AMSGrad, which not only fix the convergence issue but often also lead to improved empirical performance.

PyTorch has enabled this algorithm via amsgrad option,

# https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False)

See also a success of AMSGrad over ADAM: Loss suddenly increases using Adam optimizer - PyTorch Forums

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

ADAM and AMSGrad for Stochastic Optimization

Posted on Oct 09, 2022