WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

ADAM and AMSGrad for Stochastic Optimization

Posted on
Tags: Optimization, Gradient Descent

This post is based on

Kingma and Ba (2017) propose ADAM, which is derived from adaptive moment estimation. It is designed to combine the advantages of

Reddi et al. (2018) analyzed the convergence issue of ADAM, and pointed out that the issue can be fixed by endowing with “long-term memory” of past gradient, and they proposed a variance of ADAM, so-called AMSGrad, which not only fix the convergence issue but often also lead to improved empirical performance.

PyTorch has enabled this algorithm via amsgrad option,

# https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False)

See also a success of AMSGrad over ADAM: Loss suddenly increases using Adam optimizer - PyTorch Forums

Published in categories Note