WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Sequential Monte Carlo Methods

Posted on
Tags: Sequential Monte Carlo, Hidden Markov Model

This note is for Section 3 of Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12(656–704), 3., and it is the complement of my previous post.

Sequential Monte Carlo Methods

SMC methods are a general class of Monte Carlo methods that sample sequentially from a sequence of target probability densities ${\pi_n(x_{1:n})}$ of increasing dimension where each distribution $\pi_n(x_{1:n})$ is defined on the product space $\cX^n$. Writing

\[\pi_n(x_{1:n})=\frac{\gamma_n(x_{1:n})}{Z_n}\,,\]

we require only that $\gamma_n:\cX^n\rightarrow \bbR^+$ is known pointwise; the normalising constant

\[Ζ_n = \int \gamma_n(x_{1:n})dx_{1:n}\]

might be unknown.

Basic of Monte Carlo Methods

Consider approximating a generic probability density $\pi_n(x_{1:n})$ for some fixed $n$. If we sample $N$ independent random variables, $X_{1:n}^i\sim \pi_n(x_{1:n})$ for $i=1,\ldots,N$, then the Monte Carlo method approximates $\pi_n(x_{1:n})$ by the empirical measure

\[\hat \pi_n(x_{1:n}) = \frac 1N \sum_{i=1}^N\delta_{X_{1:n}^i}(x_{1:n})\,.\]

Based on this approximation, it is possible to approximate any marginal, say $\pi_n(x_k)$, easily using

\[\hat \pi_n(x_k) = \frac 1N\sum_{i=1}^N\delta_{X_k^i}(x_k)\,,\]

and the expectation of any test function $\varphi_n:\cX^n\rightarrow\bbR$ given by

\[I_n(\varphi_n):=\int \varphi_n(x_{1:n})\pi_n(x_{1:n})dx_{1:n}\]

is estimated by

\[I_n^{MC}(\varphi_n):=\int \varphi_n(x_{1:n})\hat \pi_n(x_{1:n})dx_{1:n}=\frac 1N\sum_{i=1}^N\varphi_n(X_{1:n}^i)\,.\]

This estimate is unbiased and its variance is given by

\[\bbV [I_n^{MC}(\varphi_n)] = \frac 1N\Big( \int \varphi^2_n(x_{1:n})\pi_n(x_{1:n})dx_{1:n} -I_n^2(\varphi_n) \Big)\]

The main advantage of Monte Carlo methods over standard approximation techniques is that the variance of the approximation error decreases at a rate of $O(1/N)$ regardless of the dimension of the space $\cX^n$. But there are at least two main problems:

  • If $\pi_n(x_{1:n})$ is a complex high-dimensional probability distribution, then we cannot sample from it.
  • Even if we knew how to sample exactly from $\pi_n(x_{1:n})$, the computational complexity of such a sampling scheme is typically at least linear in the number of variables $n$.

Importance Sampling

This strategy can address the first problem mentioned above. IS relies on the introduction of an importance density (or proposal density, instrumental density) $q_n(x_{1:n})$ such that

\[\pi_n(x_{1:n})>0\Rightarrow q_n(x_{1:n})>0\]

then we have

\[\pi_n(x_{1:n})=\frac{\frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})}q_n(x_{1:n})}{Z_n}=\frac{w_n(x_{1:n})q_n(x_{1:n})}{Z_n}\,.\]

Assume we draw $N$ independent samples $X_{1:n}^i\sim q_n(x_{1:n})$ then

\[\begin{align} \hat \pi_n(x_{1:n}) &= \sum_{i=1}^NW_n^i\delta_{X_{1:n}^i}(x_{1:n})\label{eq23}\\ \hat Z_n &= \frac 1N \sum_{i=1}^Nw_n(X_{1:n}^i)\label{eq24} \end{align}\]

where

\[W_n^i = \frac{w_n(X_{1:n}^i)}{\sum_{j=1}^Nw_n(X_{1:n}^j)}\]

IS provides an (unbiased) estimate of the normalising constant with relative variance

\[\begin{align*} \frac{\bbV_{IS}[\hat Z_n]}{Z_n^2} & = \frac{1}{N^2Z_n^2}N\bbV(w_n(X_{1:n}^i))\\ &= \frac 1N\Big[\int \frac{w_n^2(x_{1:n})q_n(x_{1:n})}{Z_n^2}dx_{1:n}-\Big(\int \frac{w_n(x_{1:n})q_n(x_{1:n})}{Z_n}dx_{1:n}\Big)^2\Big]\\ &= \frac 1N\Big[\int \frac{\pi^2_n(x_{1:n})}{q_n(x_{1:n})}dx_{1:n}-1\Big]\,. \end{align*}\]

Also, $I_n(\varphi_n)$ can be estimated as

\[I_n^{IS}(\varphi_n) = \int \varphi_n(x_{1:n})\hat\pi_n(x_{1:n})dx_{1:n}=\sum_{i=1}^NW_n^i\varphi_n(X_{1:n}^i)\,.\]

Unlike $I_n^{MC}(\varphi_n)$, this estimate is biased for finite $N$. However, it is consistent and the asymptotic bias is

\[\lim_{N\rightarrow \infty} N(I_n^{IS}(\varphi_n)-I_n(\varphi_n)) = -\int\frac{\pi^2_n(x_{1:n})}{q_n(x_{1:n})}(\varphi_n(x_{1:n})-I_n(\varphi_n))dx_{1:n}\,.\]

How to derive?

Furthermore, $I_n^{IS}(\varphi)$ satisfies a CLT with asymptotic variance

\[\frac 1N\int \frac{\pi_n^2(x_{1:n})}{q_n(x_{1:n})}(\varphi_n(x_{1:n})-I_n(\varphi_n))^2dx_{1:n}\]

The bias being $O(1/N)$ and the variance $O(1/N)$, the mean-squared error given by the squared bias plus the vairance is asymptotically dominated by the variance term.

Sequential Importance Sampling

This strategy is to address the second problem. Select an importance distribution which has the following structure

\[\begin{align*} q_n(x_{1:n}) &= q_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})\\ &= q_1(x_1)\prod_{k=2}^nq_k(x_k\mid x_{1:k-1})\,. \end{align*}\]

Practically, this means that to obtain particles $X_{1:n}^i\sim q_n(x_{1:n})$ at time $n$, we sample $X_1^i\sim q_1(x_1)$ at time 1 then $X_k^i\sim q_k(x_k\mid X_{1:k-1}^i)$ at time $k$ for $k=2,\ldots,n$. The associated unnormalised weights can be computed recursively using the decomposition

\[\begin{align*} w_n(x_{1:n}) &= \frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})}\\ &= \frac{\gamma_{n-1}(x_{1:n-1})}{q_{n-1}(x_{1:n-1})}\frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})} \end{align*}\]

which can be written as

\[\begin{align*} w_n(x_{1:n}) &= w_{n-1}(x_{1:n-1})\cdot \alpha_n(x_{1:n})\\ &=w_1(x_1)\prod_{k=2}^n\alpha_k(x_{1:k}) \end{align*}\]

where the incremental importance weight function $\alpha_n(x_{1:n})$ is given by

\[\alpha_n(x_{1:n})=\frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}\]

At any time $n$, we obtain the estimates $\hat\pi_n(x_{1:n})$ $\eqref{eq23}$ and $\hat Z_n$ $\eqref{eq24}$, respectively.

It is straightforward to check that a consistent estimate of $Z_n/Z_{n-1}$ is also provided by the same set of samples:

\[\widehat{\frac{Z_n}{Z_{n-1}}}=\sum_{i=1}^NW_{n-1}^i\alpha_n(X_{1:n}^i)\,.\]

The estimator is motivated by the fact that

\[\begin{align*} \int \alpha_n(x_{1:n})\pi_n(x_{1:n}) dx_{1:n} &= \int \alpha_{n}(x_{1:n})\pi_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})dx_{1:n}\\ &= \int \frac{\gamma_n(x_{1:n})\pi_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}dx_{1:n-1}\\ &=\frac{Z_n}{Z_{n-1}}\,. \end{align*}\]

To minimise the variance of $w_n(x_{1:n})$, which can be achieved by selecting

\[q_n^{opt}(x_n\mid x_{1:n-1}) = \pi_n(x_n\mid x_{1:n-1})\]

as in this case the variance of $w_n(x_{1:n})$ conditional upon $x_{1:n-1}$ is zero and the associated incremental weight is given by

\[\alpha_n^{opt}(x_{1:n})=\frac{\gamma_n(x_{1:n-1})\pi(x_n\mid x_{1:n-1})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}=\frac{\int\gamma_n(x_{1:n})dx_n}{\gamma_{n-1}(x_{1:n-1})}\,.\]

It is not always possible to sample from $\pi_n(x_n\mid x_{1:n-1})$, and nor is it always possible to compute $\alpha_n^{opt}(x_{1:n})$. In this case, we should employ an approximation of $q_n^{opt}(x_n\mid x_{1:n-1})$ for $q_n(x_n\mid x_{1:n-1})$.

It is important to be aware that the methodology presented here suffers from severe drawbacks. Even for standard IS (SIS is nothing but a special version of IS), the variance of the resulting estimates increases exponentially with $n$.

Resampling

Resampling techniques are a key ingredient of SMC methods which partially solve this problem is some important scenarios.

Consider the IS approximation $\hat\pi_n(x_{1:n})$ of the target distribution $\pi_n(x_{1:n})$. This approximation is based on weighted samples from $q_{1:n}(x_{1:n})$ and does not provide samples approximately distributed according to $\pi_n(x_{1:n})$.

To obtain approximate samples from $\pi_n(x_{1:n})$, we can simply sample from its IS approximation $\hat\pi_n(x_{1:n})$; that is, we select $X_{1:n}^i$ with probability $W_n^i$. This operation is called resampling.

To obtain $N$ samples from $\hat\pi_n(x_{1:n})$, we can simply resample $N$ times from $\hat\pi_n(x_{1:n})$. This is equivalent to associating a number of offspring $N_n^i$ with each particle $X_{1:n}^i$ in such a way that $N_n^{1:N}=(N_n^1,\ldots,N_n^N)$ follow a multinomial distribution with parameter vector $(N, W_n^{1:N})$ and associating a weight of $1/N$ with each offspring.

We approximate $\hat\pi_n(x_{1:n})$ by the resampled empirical measure

\[\bar \pi_n(x_{1:n}) = \sum_{i=1}^N\frac{N_n^i}{N}\delta_{X_{1:n}^i}(x_{1:n})\]

where $\bbE[N_n^i\mid W_n^{1:N}]=NW_n^i$. Hence $\bar\pi_n(x_{1:n})$ is unbiased approximation of $\hat\pi_n(x_{1:n})$.

Improved unbiased resampling schemes to preserve the unbiasedness property such that $\bbV[N_n^i\mid W_n^{1:N}]$ is smaller than that obtained via the multinomial resampling scheme.

Three most popular algorithms in descending order of popularity/efficiency:

  • Systematic Resampling
  • Residual Resampling
  • Multinomial Resampling

Resampling allows us to obtain samples distributed approximately according to $\pi_n(x_{1:n})$, but it should be clear that if we are interested in estimating $I_n(\varphi_n)$ then we will obtain an estimate with lower variance using $\hat\pi_n(x_{1:n})$ than which we would have obtained by using $\bar \pi_n(x_{1:n})$. By resampling we indeed add some extra “noise”. However, an important advantage of resampling is that it allows use to remove of particles with low weights with a high probability.

A Generic Sequential Monte Carlo Algorithm

SMC methods are a combination of SIS and resampling.

Sequential Monte Carlo, also called Sequential Importance Resampling (SIR) or Sequential Importance Sampling and Resampling (SIS/R)

At time $n = 1$

  • Sample $X_1^i\sim q_1(x_1)$
  • Compute the weights $w_1(X_1^i)$ and $W_1^i\propto w_1(X_1^i)$
  • Resample $\{W_1^i,X_1^i\}$ to obtain $N$ equally-weighted particles $\{\frac 1N, \bar X_1^i\}$

At time $n\ge 2$

  • Sample $X_n^i\sim q_n(x_n\mid \bar X_{1:n-1}^i)$ and set $X_{1:n}^i\leftarrow (\bar X_{1:n-1}^i, X_n^i)$
  • Compute the weights $\alpha_n(X_{1:n}^i)$ and $W_n^i\propto W_{n-1}^i\alpha_n(X_{1:n}^i)$
  • Resample $\{W_n^i, X_{1:n}^i\}$ to obtain $N$ new equally-weighted particles $\{\frac 1N, \bar X_{1:N}^i\}$

At time $n$, this algorithm provides two approximations of $\pi_n(x_{1:n})$. We obtain

\[\tilde \pi_n(x_{1:n}) = \sum_{i=1}^NW_n^i\delta_{X_{1:n}^i}(x_{1:n})\]

after the sampling step and

\[\bar \pi_n(x_{1:n}) = \frac 1N\sum_{i=1}^N\delta_{\bar X_{1:n}^i}(x_{1:n})\]

after the resampling step.

If particles have unnormalised weights with a small variance then the resampling step might be unnecessary. In practice, it is more sensible to resample only when the variance of the unnormalised weights is superior to a pre-specified threshold. In that case, we can use the so-called Effective Sample Size (ESS) criterion,

\[ESS = \Big(\sum_{i=1}^N(W_n^i)^2\Big)^{-1}\,.\]

In a simple IS setting, inference based on the $N$ weighted samples is approximately equivalent (in terms of estimator variance) to inference based on $ESS$ perfect samples from the target distribution. The $ESS$ takes values between $1$ and $N$ and we resample only when it is below a threshold $N_T$; typically $N_T=N/2$.

Alternative criteria can be used such as the entropy of the weights $\{W_n^i\}$ which achieves its maximum value when $W_n^i=1/N$. In this case, we resample when the entropy is below a given threshold.

Sequential Monte Carlo with Adaptive Resampling

At time $n=1$

  • Sample $X_1^i\sim q_1(x_1)$
  • Compute the weights $w_1(X_1^i)$ and $W_1^i\propto w_1(X_1^i)$
  • If resampling criterion satisfied then resample $\{W_1^i, X_1^i\}$ to obtain $N$ equally weighted particles ${\frac 1N, \bar X_1^i}$ and set $\{\bar W_1^i,\bar X_1^i\}\leftarrow \{\frac 1N, \bar X_1^i\}$, otherwise set $\{\bar W_1^i,\bar X_1^i\}\leftarrow \{W_1^i, X_1^i\}$

At time $n\ge 2$

  • Sample $X_n^i\sim q_n(x_n\mid \bar X_{1:n-1}^i)$ and set $X_{1:n}^i\leftarrow (\bar X_{1:n-1}^i, X_n^i)$
  • Compute the weights $\alpha_n(X_{1:n}^i)$ and $W_n^i\propto \bar W_{n-1}^i\alpha_n(X_{1:n}^i)$
  • If sampling criterion satisfied, then resample $\{W_n^i, X_{1:n}^i\}$ to obtain $N$ equally weighted particles $\{\frac 1N, \bar X_{1:n}^i\}$ and set $\{\bar W_n^i,\bar X_n^i\}\leftarrow \{\frac 1N, \bar X_n^i\}$, otherwise set $\{\bar W_n^i,\bar X_n^i\}\leftarrow \{W_n^i,X_n^i\}$.

In this context, we have two approximations of $\pi_n(x_{1:n})$

\[\begin{align*} \hat\pi_n(x_{1:n}) &= \sum_{i=1}^NW_n^i\delta_{X_{1:i}^i}(x_{1:n})\\ \bar \pi_n(x_{1:n}) &= \sum_{i=1}^N\bar W_n^i\delta_{\bar X_{1:n}^i}(x_{1:n}) \end{align*}\]

which are equal if no resampling step is used at time $n$.

The presence or absence of degeneracy is the factor which most often determines whether an SMC algorithm works in practice. However strong the convergence results available for limitingly large samples may be, we cannot expect good performance if the finite sample which is actually used is degenerate.

Although sample degeneracy emerges as a consequence of resampling, it is really a manifestation of a deeper problem – one which resampling actually mitigates.

The resampling mechanism: it “reset the system” in such a way that its representation of final time marginals remains well behaved at the expense of further diminishing the quality of the path-samples.

Convergence Results for SMC methods

The CLT allows us to clearly understand the benefits of the resampling step and why it works. The associated SMC estimates of $\hat Z_n/Z_n$ and $I_n(\varphi_n)$ satisfy a CLT.

Summary

  1. wherever it is possible to sample from $q_n(x_n\mid x_{1:n-1})$ and evaluate $\alpha_n(x_{1:n})$ in a time independent of $n$, this leads to an algorithm whose computational complexity does not increase with $n$
  2. for any $k$, there exists $n>k$ such that the SMC approximation of $\pi_n(x_{1:k})$ consists of a single particle because of the successive resampling steps. It is thus impossible to get a “good” SMC approximation of the joint distributions $\{\pi_n(x_{1:n})\}$ when $n$ is too large.
  3. however, under mixing conditions, this SMC algorithm is able to provide estimates of marginal distributions of the form $\pi_n(x_{n-L+1:n})$ and estimates of $Z_n/Z_{n-1}$ whose asymptotic variance is uniformly bounded with $n$.
  4. The variance of SMC estimates can only expected to be reasonable if the variance of the incremental weights is small. In particular, this requires that we can only expect to obtain good performance if $\pi_n(x_{1:n-1})\approx \pi_{n-1}(x_{1:n-1})$ and $q_n(x_n\mid x_{1:n-1})\approx \pi_n(x_n\mid x_{1:n-1})$.

Published in categories Note