Sequential Monte Carlo Methods
Posted on
This note is for Section 3 of Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12(656–704), 3., and it is the complement of my previous post.
Sequential Monte Carlo Methods
SMC methods are a general class of Monte Carlo methods that sample sequentially from a sequence of target probability densities πn(x1:n) of increasing dimension where each distribution πn(x1:n) is defined on the product space Xn. Writing
πn(x1:n)=γn(x1:n)Zn,we require only that γn:Xn→R+ is known pointwise; the normalising constant
Ζ_n = \int \gamma_n(x_{1:n})dx_{1:n}might be unknown.
Basic of Monte Carlo Methods
Consider approximating a generic probability density \pi_n(x_{1:n}) for some fixed n. If we sample N independent random variables, X_{1:n}^i\sim \pi_n(x_{1:n}) for i=1,\ldots,N, then the Monte Carlo method approximates \pi_n(x_{1:n}) by the empirical measure
\hat \pi_n(x_{1:n}) = \frac 1N \sum_{i=1}^N\delta_{X_{1:n}^i}(x_{1:n})\,.Based on this approximation, it is possible to approximate any marginal, say \pi_n(x_k), easily using
\hat \pi_n(x_k) = \frac 1N\sum_{i=1}^N\delta_{X_k^i}(x_k)\,,and the expectation of any test function \varphi_n:\cX^n\rightarrow\bbR given by
I_n(\varphi_n):=\int \varphi_n(x_{1:n})\pi_n(x_{1:n})dx_{1:n}is estimated by
I_n^{MC}(\varphi_n):=\int \varphi_n(x_{1:n})\hat \pi_n(x_{1:n})dx_{1:n}=\frac 1N\sum_{i=1}^N\varphi_n(X_{1:n}^i)\,.This estimate is unbiased and its variance is given by
\bbV [I_n^{MC}(\varphi_n)] = \frac 1N\Big( \int \varphi^2_n(x_{1:n})\pi_n(x_{1:n})dx_{1:n} -I_n^2(\varphi_n) \Big)The main advantage of Monte Carlo methods over standard approximation techniques is that the variance of the approximation error decreases at a rate of O(1/N) regardless of the dimension of the space \cX^n. But there are at least two main problems:
- If \pi_n(x_{1:n}) is a complex high-dimensional probability distribution, then we cannot sample from it.
- Even if we knew how to sample exactly from \pi_n(x_{1:n}), the computational complexity of such a sampling scheme is typically at least linear in the number of variables n.
Importance Sampling
This strategy can address the first problem mentioned above. IS relies on the introduction of an importance density (or proposal density, instrumental density) q_n(x_{1:n}) such that
\pi_n(x_{1:n})>0\Rightarrow q_n(x_{1:n})>0then we have
\pi_n(x_{1:n})=\frac{\frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})}q_n(x_{1:n})}{Z_n}=\frac{w_n(x_{1:n})q_n(x_{1:n})}{Z_n}\,.Assume we draw N independent samples X_{1:n}^i\sim q_n(x_{1:n}) then
\begin{align} \hat \pi_n(x_{1:n}) &= \sum_{i=1}^NW_n^i\delta_{X_{1:n}^i}(x_{1:n})\label{eq23}\\ \hat Z_n &= \frac 1N \sum_{i=1}^Nw_n(X_{1:n}^i)\label{eq24} \end{align}where
W_n^i = \frac{w_n(X_{1:n}^i)}{\sum_{j=1}^Nw_n(X_{1:n}^j)}IS provides an (unbiased) estimate of the normalising constant with relative variance
\begin{align*} \frac{\bbV_{IS}[\hat Z_n]}{Z_n^2} & = \frac{1}{N^2Z_n^2}N\bbV(w_n(X_{1:n}^i))\\ &= \frac 1N\Big[\int \frac{w_n^2(x_{1:n})q_n(x_{1:n})}{Z_n^2}dx_{1:n}-\Big(\int \frac{w_n(x_{1:n})q_n(x_{1:n})}{Z_n}dx_{1:n}\Big)^2\Big]\\ &= \frac 1N\Big[\int \frac{\pi^2_n(x_{1:n})}{q_n(x_{1:n})}dx_{1:n}-1\Big]\,. \end{align*}Also, I_n(\varphi_n) can be estimated as
I_n^{IS}(\varphi_n) = \int \varphi_n(x_{1:n})\hat\pi_n(x_{1:n})dx_{1:n}=\sum_{i=1}^NW_n^i\varphi_n(X_{1:n}^i)\,.Unlike I_n^{MC}(\varphi_n), this estimate is biased for finite N. However, it is consistent and the asymptotic bias is
\lim_{N\rightarrow \infty} N(I_n^{IS}(\varphi_n)-I_n(\varphi_n)) = -\int\frac{\pi^2_n(x_{1:n})}{q_n(x_{1:n})}(\varphi_n(x_{1:n})-I_n(\varphi_n))dx_{1:n}\,.How to derive?
Furthermore, I_n^{IS}(\varphi) satisfies a CLT with asymptotic variance
\frac 1N\int \frac{\pi_n^2(x_{1:n})}{q_n(x_{1:n})}(\varphi_n(x_{1:n})-I_n(\varphi_n))^2dx_{1:n}The bias being O(1/N) and the variance O(1/N), the mean-squared error given by the squared bias plus the vairance is asymptotically dominated by the variance term.
Sequential Importance Sampling
This strategy is to address the second problem. Select an importance distribution which has the following structure
\begin{align*} q_n(x_{1:n}) &= q_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})\\ &= q_1(x_1)\prod_{k=2}^nq_k(x_k\mid x_{1:k-1})\,. \end{align*}Practically, this means that to obtain particles X_{1:n}^i\sim q_n(x_{1:n}) at time n, we sample X_1^i\sim q_1(x_1) at time 1 then X_k^i\sim q_k(x_k\mid X_{1:k-1}^i) at time k for k=2,\ldots,n. The associated unnormalised weights can be computed recursively using the decomposition
\begin{align*} w_n(x_{1:n}) &= \frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})}\\ &= \frac{\gamma_{n-1}(x_{1:n-1})}{q_{n-1}(x_{1:n-1})}\frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})} \end{align*}which can be written as
\begin{align*} w_n(x_{1:n}) &= w_{n-1}(x_{1:n-1})\cdot \alpha_n(x_{1:n})\\ &=w_1(x_1)\prod_{k=2}^n\alpha_k(x_{1:k}) \end{align*}where the incremental importance weight function \alpha_n(x_{1:n}) is given by
\alpha_n(x_{1:n})=\frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}At any time n, we obtain the estimates \hat\pi_n(x_{1:n}) \eqref{eq23} and \hat Z_n \eqref{eq24}, respectively.
It is straightforward to check that a consistent estimate of Z_n/Z_{n-1} is also provided by the same set of samples:
\widehat{\frac{Z_n}{Z_{n-1}}}=\sum_{i=1}^NW_{n-1}^i\alpha_n(X_{1:n}^i)\,.The estimator is motivated by the fact that
\begin{align*} \int \alpha_n(x_{1:n})\pi_n(x_{1:n}) dx_{1:n} &= \int \alpha_{n}(x_{1:n})\pi_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})dx_{1:n}\\ &= \int \frac{\gamma_n(x_{1:n})\pi_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}dx_{1:n-1}\\ &=\frac{Z_n}{Z_{n-1}}\,. \end{align*}To minimise the variance of w_n(x_{1:n}), which can be achieved by selecting
q_n^{opt}(x_n\mid x_{1:n-1}) = \pi_n(x_n\mid x_{1:n-1})as in this case the variance of w_n(x_{1:n}) conditional upon x_{1:n-1} is zero and the associated incremental weight is given by
\alpha_n^{opt}(x_{1:n})=\frac{\gamma_n(x_{1:n-1})\pi(x_n\mid x_{1:n-1})}{\gamma_{n-1}(x_{1:n-1})q_n(x_n\mid x_{1:n-1})}=\frac{\int\gamma_n(x_{1:n})dx_n}{\gamma_{n-1}(x_{1:n-1})}\,.It is not always possible to sample from \pi_n(x_n\mid x_{1:n-1}), and nor is it always possible to compute \alpha_n^{opt}(x_{1:n}). In this case, we should employ an approximation of q_n^{opt}(x_n\mid x_{1:n-1}) for q_n(x_n\mid x_{1:n-1}).
It is important to be aware that the methodology presented here suffers from severe drawbacks. Even for standard IS (SIS is nothing but a special version of IS), the variance of the resulting estimates increases exponentially with n.
Resampling
Resampling techniques are a key ingredient of SMC methods which partially solve this problem is some important scenarios.
Consider the IS approximation \hat\pi_n(x_{1:n}) of the target distribution \pi_n(x_{1:n}). This approximation is based on weighted samples from q_{1:n}(x_{1:n}) and does not provide samples approximately distributed according to \pi_n(x_{1:n}).
To obtain approximate samples from \pi_n(x_{1:n}), we can simply sample from its IS approximation \hat\pi_n(x_{1:n}); that is, we select X_{1:n}^i with probability W_n^i. This operation is called resampling.
To obtain N samples from \hat\pi_n(x_{1:n}), we can simply resample N times from \hat\pi_n(x_{1:n}). This is equivalent to associating a number of offspring N_n^i with each particle X_{1:n}^i in such a way that N_n^{1:N}=(N_n^1,\ldots,N_n^N) follow a multinomial distribution with parameter vector (N, W_n^{1:N}) and associating a weight of 1/N with each offspring.
We approximate \hat\pi_n(x_{1:n}) by the resampled empirical measure
\bar \pi_n(x_{1:n}) = \sum_{i=1}^N\frac{N_n^i}{N}\delta_{X_{1:n}^i}(x_{1:n})where \bbE[N_n^i\mid W_n^{1:N}]=NW_n^i. Hence \bar\pi_n(x_{1:n}) is unbiased approximation of \hat\pi_n(x_{1:n}).
Improved unbiased resampling schemes to preserve the unbiasedness property such that \bbV[N_n^i\mid W_n^{1:N}] is smaller than that obtained via the multinomial resampling scheme.
Three most popular algorithms in descending order of popularity/efficiency:
- Systematic Resampling
- Residual Resampling
- Multinomial Resampling
Resampling allows us to obtain samples distributed approximately according to \pi_n(x_{1:n}), but it should be clear that if we are interested in estimating I_n(\varphi_n) then we will obtain an estimate with lower variance using \hat\pi_n(x_{1:n}) than which we would have obtained by using \bar \pi_n(x_{1:n}). By resampling we indeed add some extra “noise”. However, an important advantage of resampling is that it allows use to remove of particles with low weights with a high probability.
A Generic Sequential Monte Carlo Algorithm
SMC methods are a combination of SIS and resampling.
Sequential Monte Carlo, also called Sequential Importance Resampling (SIR) or Sequential Importance Sampling and Resampling (SIS/R)
At time n = 1
- Sample X_1^i\sim q_1(x_1)
- Compute the weights w_1(X_1^i) and W_1^i\propto w_1(X_1^i)
- Resample \{W_1^i,X_1^i\} to obtain N equally-weighted particles \{\frac 1N, \bar X_1^i\}
At time n\ge 2
- Sample X_n^i\sim q_n(x_n\mid \bar X_{1:n-1}^i) and set X_{1:n}^i\leftarrow (\bar X_{1:n-1}^i, X_n^i)
- Compute the weights \alpha_n(X_{1:n}^i) and W_n^i\propto W_{n-1}^i\alpha_n(X_{1:n}^i)
- Resample \{W_n^i, X_{1:n}^i\} to obtain N new equally-weighted particles \{\frac 1N, \bar X_{1:N}^i\}
At time n, this algorithm provides two approximations of \pi_n(x_{1:n}). We obtain
\tilde \pi_n(x_{1:n}) = \sum_{i=1}^NW_n^i\delta_{X_{1:n}^i}(x_{1:n})after the sampling step and
\bar \pi_n(x_{1:n}) = \frac 1N\sum_{i=1}^N\delta_{\bar X_{1:n}^i}(x_{1:n})after the resampling step.
If particles have unnormalised weights with a small variance then the resampling step might be unnecessary. In practice, it is more sensible to resample only when the variance of the unnormalised weights is superior to a pre-specified threshold. In that case, we can use the so-called Effective Sample Size (ESS) criterion,
ESS = \Big(\sum_{i=1}^N(W_n^i)^2\Big)^{-1}\,.In a simple IS setting, inference based on the N weighted samples is approximately equivalent (in terms of estimator variance) to inference based on ESS perfect samples from the target distribution. The ESS takes values between 1 and N and we resample only when it is below a threshold N_T; typically N_T=N/2.
Alternative criteria can be used such as the entropy of the weights \{W_n^i\} which achieves its maximum value when W_n^i=1/N. In this case, we resample when the entropy is below a given threshold.
Sequential Monte Carlo with Adaptive Resampling
At time n=1
- Sample X_1^i\sim q_1(x_1)
- Compute the weights w_1(X_1^i) and W_1^i\propto w_1(X_1^i)
- If resampling criterion satisfied then resample \{W_1^i, X_1^i\} to obtain N equally weighted particles {\frac 1N, \bar X_1^i} and set \{\bar W_1^i,\bar X_1^i\}\leftarrow \{\frac 1N, \bar X_1^i\}, otherwise set \{\bar W_1^i,\bar X_1^i\}\leftarrow \{W_1^i, X_1^i\}
At time n\ge 2
- Sample X_n^i\sim q_n(x_n\mid \bar X_{1:n-1}^i) and set X_{1:n}^i\leftarrow (\bar X_{1:n-1}^i, X_n^i)
- Compute the weights \alpha_n(X_{1:n}^i) and W_n^i\propto \bar W_{n-1}^i\alpha_n(X_{1:n}^i)
- If sampling criterion satisfied, then resample \{W_n^i, X_{1:n}^i\} to obtain N equally weighted particles \{\frac 1N, \bar X_{1:n}^i\} and set \{\bar W_n^i,\bar X_n^i\}\leftarrow \{\frac 1N, \bar X_n^i\}, otherwise set \{\bar W_n^i,\bar X_n^i\}\leftarrow \{W_n^i,X_n^i\}.
In this context, we have two approximations of \pi_n(x_{1:n})
\begin{align*} \hat\pi_n(x_{1:n}) &= \sum_{i=1}^NW_n^i\delta_{X_{1:i}^i}(x_{1:n})\\ \bar \pi_n(x_{1:n}) &= \sum_{i=1}^N\bar W_n^i\delta_{\bar X_{1:n}^i}(x_{1:n}) \end{align*}which are equal if no resampling step is used at time n.
The presence or absence of degeneracy is the factor which most often determines whether an SMC algorithm works in practice. However strong the convergence results available for limitingly large samples may be, we cannot expect good performance if the finite sample which is actually used is degenerate.
Although sample degeneracy emerges as a consequence of resampling, it is really a manifestation of a deeper problem – one which resampling actually mitigates.
The resampling mechanism: it “reset the system” in such a way that its representation of final time marginals remains well behaved at the expense of further diminishing the quality of the path-samples.
Convergence Results for SMC methods
The CLT allows us to clearly understand the benefits of the resampling step and why it works. The associated SMC estimates of \hat Z_n/Z_n and I_n(\varphi_n) satisfy a CLT.
Summary
- wherever it is possible to sample from q_n(x_n\mid x_{1:n-1}) and evaluate \alpha_n(x_{1:n}) in a time independent of n, this leads to an algorithm whose computational complexity does not increase with n
- for any k, there exists n>k such that the SMC approximation of \pi_n(x_{1:k}) consists of a single particle because of the successive resampling steps. It is thus impossible to get a “good” SMC approximation of the joint distributions \{\pi_n(x_{1:n})\} when n is too large.
- however, under mixing conditions, this SMC algorithm is able to provide estimates of marginal distributions of the form \pi_n(x_{n-L+1:n}) and estimates of Z_n/Z_{n-1} whose asymptotic variance is uniformly bounded with n.
- The variance of SMC estimates can only expected to be reasonable if the variance of the incremental weights is small. In particular, this requires that we can only expect to obtain good performance if \pi_n(x_{1:n-1})\approx \pi_{n-1}(x_{1:n-1}) and q_n(x_n\mid x_{1:n-1})\approx \pi_n(x_n\mid x_{1:n-1}).