ADAM and AMSGrad for Stochastic Optimization
Posted on
This post is based on
- Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization (arXiv:1412.6980). arXiv.
- Reddi, S. J., Kale, S., & Kumar, S. (2018, February 15). On the Convergence of Adam and Beyond. International Conference on Learning Representations.
Kingma and Ba (2017) propose ADAM, which is derived from adaptive moment estimation. It is designed to combine the advantages of
- Duchi et al. (2011)’s AdaGrad: works well with sparse gradients
- Tieleman and Hinton (2012)’s RMSProp: work well in on-line and non-stationary settings.
Reddi et al. (2018) analyzed the convergence issue of ADAM, and pointed out that the issue can be fixed by endowing with “long-term memory” of past gradient, and they proposed a variance of ADAM, so-called AMSGrad, which not only fix the convergence issue but often also lead to improved empirical performance.
PyTorch has enabled this algorithm via amsgrad
option,
# https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False)
See also a success of AMSGrad over ADAM: Loss suddenly increases using Adam optimizer - PyTorch Forums