Annealed Importance Sampling

Posted on Jan 28, 2019

This is the note for Neal, R. M. (1998). Annealed Importance Sampling. ArXiv:Physics/9803008.

Abstract

Simulated Annealing, moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions, has traditionally been used as an inexact method of handling isolated modes in Markov chain samplers. The author argues that one can use the Markov chain transitions for such an annealing sequence to define an importance sampler.

• Allow to perform acceptably even for high-dimensional problems
• Resemble the second half of the previously-studied tempered transitions, and can be seen as a generalization of a variant of SIS.
• Related to thermodynamic (热力学) integration methods for estimating ratios of normalizing constants.
• Most attractive when isolated modes are present, or when estimates of normalizing constants are required.
• May more generally useful, since its independent sampling allows one to bypass some of the problems of assessing convergence and autocorrelation in Markov chain samplers.

Introduction

To compute expectations of various quantities w.r.t. complex distributions:

• estimate expectations by sample averages based on points drawn independently from the distribution of interest.
• generate independent points from some simpler approximating distribution, and then use an importance sampling estimate, in which the points are weighted to compensate for use of the wrong distribution.
• use a sample of dependent points obtained by simulating a Markov chain that converges to the correct distribution.

The author will combine the latter two approaches by using an importance sampling distribution defined by a series of Markov chains.

Drawback of Markov chain: for several widely-separated modes, the chain will move between modes only rarely, it will take a long time to reach equilibrium, and will exhibit high autocorrelations for functions of the state variables out to long time lags.

Simulated annealing employs a sequence of distributions, with probabilities or probability densities given by $p_0(x)$ to $p_n(x)$, in which each $p_j$ differs only slightly from $p_{j+1}$. The distribution $p_0$ is the one of interest. The distributions $p_n$ is designed so that the Markov chain used to sample from it allows movement between all regions of the state space. A traditional scheme: $p_j\propto p_0(x)^{\beta_j}$ for $1=\beta_0>\beta_1\ldots>\beta_n$.

Annealing run:

1. started at some initial state, from which simulated a Markov chain designed to converge to $p_n$ for some number of iterations which are not necessarily enough to actually approach equilibrium
2. simulated some number of iterations of a Markov chain designed to converge to $p_{n-1}$, starting from the final state of the previous simulation.
3. ….
4. finally simulate the chain designed to converge to $p_0$.

The annealing process takes advantage of the freer movement possible under the other distributions. The annealed importance sampling is essentially a way of assigning weights to the states found by multiple simulated annealing runs.

The annealed importance sampling procedure

Fix $f_0$ as the distribution of interest, and fix $f_n$ as a simple distribution, let

$f_j(x) = f_0(x)^{\beta_j}f_n(x)^{1-\beta_j}$

where $1=\beta_0>\beta_1>\cdots>\beta_n=0$.

AIS produces a sample of points, $x^{(1)},\ldots,x^{(N)}$, and corresponding weights, $w^{(1)},\ldots,w^{(N)}$. To generate each point, $x^{(i)}$, and associated weight, $w^{(i)}$, generate a sequence of points, $x_{n-1},\ldots,x_0$ as follows:

1. $x_{n-1}\sim p_n$
2. $x_{n-2}\sim T_{n-1}(x_{n-1},\cdot)$
3. ……
4. $x_1\sim T_2(x_2,\cdot)$
5. $x_0\sim T_1(x_1,\cdot)$

Let $x^{(i)}=x_0$, and set

$w^{(i)} = \frac{f_{n-1}(x_{n-1})}{f_n(x_{n-1})}\frac{f_{n-2}(x_{n-2})}{f_{n-1}(x_{n-1})}\cdots \frac{f_1(x_1)}{f_2(x_1)}\frac{f_0(x_0)}{f_1(x_0)}\,.$

Validate the AIS procedure

Note that

$f(x_0,\ldots,x_{n-1}) =f_0(x_0)\tilde T_1(x_0,x_1)\tilde T_2(x_1,x_2)\cdots \tilde T_{n-1}(x_{n-2},x_{n-1})\,,$

where

$\tilde T_j(x,x') = T_j(x',x)p_j(x')/p_j(x) =T_j(x',x)f_j(x')/f_j(x)\,.$

Then

$f(x_0,\ldots,x_{n-1}) = f_0(x_0)\frac{f_1(x_1)}{f_1(x_0)}T_1(x_1,x_0)\cdots\frac{f_{n-1}(x_{n-1})}{f_{n-1}(x_{n-2})}T_{n-1}(x_{n-1},x_{n-2})\,.$

The proposal distribution is

$g(x_0,\ldots,x_{n-1}) = f_n(x_{n-1})T_{n-1}(x_{n-1},x_{n-2})\cdots T_2(x_2,x_1)T_1(x_1,x_0)\,.$

Thus this procedure can be regarded as an importance sampler on the extended state space. The importance weights are

$w^{(i)} = \frac{f(x_0,\ldots,x_{n-1})}{g(x_0,\ldots,x_{n-1})}\,.$

Accuracy of Importance sampling estimates

Suppose $\bar a$ is the importance sampling estimate for $\E_f(a)$, based on points $x^{(i)}$ drawn independently from the density proportional to $g(x)$:

$\bar a = \frac{\sum_{i=1}^Nw^{(i)}a(x^{(i)})}{\sum_{i=1}^Nw^{(i)}}\,.$

We can judge the accuracy of $\bar a$ by its variance, which can approximated as

$\Var(\bar a) \approx \frac{\E_g\big[\big(w^{(i)}(a(x^{(i)}-\E_f(a)))\big)\big]}{N\E_g\big[w^{(i)}\big]^2}\,.$

Someone proposed the following estimate:

$\widehat\Var(\bar a) = \frac{\sum\big(w^{(i)}(a(x^{(i)-\bar a}))\big)^2}{\big[ \sum w^{(i)} \big]^2}$

Relationship to tempered transitions

A tempered transition operates as follows, starting from state $\hat x_0$:

• Generate $\hat x_1\sim \hat T_1(\hat x_0,\cdot)$
• Generate $\hat x_2\sim \hat T_2(\hat x_1,\cdot)$
• Generate $\bar x_n\sim \hat T_n(\hat x_{n-1},\cdot)$
• Generate $\check x_{n-1}\sim \check T_n(\bar x_n, \cdot)$
• Generate $\check x_1\sim \check T_2(\check x_2,\cdot)$
• Generate $\check x_0\sim \check T_1(\check x_1,\cdot)$

The state $\check x_0$ is then accepted as the next state of the Markov chain with probability

$\min\Big[1, \frac{p_1(\hat x_1)}{p_0(\hat x_0)}\cdots \frac{p_n(\hat x_{n-1})}{p_{n-1}(\hat x_{n-1})}\frac{p_{n-1}(\check x_{n-1})}{p_n(\check x_{n-1})}\cdot \frac{p_0(\check x_0)}{p_1(\check x_0)}\Big]$

The second half of the tempered transition procedure is identical to the annealed importance sampling procedure, and the annealed importance sampling weight is essentially the same as the second half of the product defining the tempered transition acceptance probability.

To estimate ratios of normalizing constants, treat the weights as two parts, in which each one can be used to estimate $\int f_n(x)dx/\int f_0(x)dx$, and finally we can average them.

Relationship to SIS

Consider a model for the joint distribution of observable variables $x_1,\ldots,x_n$ along with associated latent variables $s_1,\ldots,s_n$. We wish to estimate expectations w.r.t. the conditional distribution of $s_1,\ldots,s_n$ given known values for $x_1,\ldots,x_n$.

SIS can be viewed as annealed importance sampling with a sequence of distributions, $p_0$ to $p_n$, in which $p_j$ is related to the distribution conditional on $n-j$ of the observable variables, $p_0$ is then the distribution of interest, conditional on all of $x_1,\ldots,x_n$.

Published in categories Note