Illustrate Path Sampling by Stan Programming
Posted on (Update: ) 0 Comments
This post reviewed the topic of path sampling in the lecture slides of STAT 5020, and noted a general path sampling described by Gelman and Meng (1998), then used a toy example to illustrate it with Stan programming language.
Path Sampling in SEM
This section is based on the lecture slides and textbook of STAT 5020.
The Bayes factor for comparing $M_1$ and $M_0$ is defined as
\[B_{10} = \frac{p(\Y\mid M_1)}{p(\Y\mid M_0)}\,,\]which involves the density $p(\Y\mid M_k)$. Let $\btheta_k$ be the random parameter vector associated with $M_k$, and we have
\[P(\Y\mid M_k)=\int p(\Y\mid \btheta_k,M_k)p(\btheta_k\mid M_k)d\btheta_k\,.\]It is very often difficult to obtain $B_{10}$ analytically, and various analytic and numerical approximations have been proposed in the literature, and path sampling is one of them.
To simplify notation, ‘$M_k$’ will be suppressed; hence $p(\Y)=p(\Y\mid M_k)$.
In general, let $\Y$ be the matrix of observed data, and $\Omega$ be the matrix of latent variables in the model. For SEMs which involve latent variables, direct application of path sampling in computing the Bayes factor is difficult. Similar to Bayesian estimation, we can utilize the idea of data augmentation to solve this problem.
Now consider the following class of densities which are denoted by a continuous parameter $t$ in $[0,1]$:
\[p(\bOmega,\btheta\mid \Y,t) = \frac{1}{z(t)}p(\Y,\bOmega, \btheta\mid t)\,,\]where
\[z(t) = P(\Y\mid t) = \int p(\Y,\bOmega,\btheta\mid t)d\bOmega d\btheta = \int p(\Y,\bOmega\mid \btheta,t)p(\btheta)d\bOmega d\btheta\]with $p(\btheta)$ the prior density of $\btheta$ which is assumed to be independent of $t$.
In computing the Bayes factor, we construct a path using the parameter $t\in [0,1]$ to link two competing models $M_1$ and $M_0$ together, so that $z(1)=p(\Y\mid 1)=p(\Y\mid M_1), z(0)=P(\Y\mid 0)=p(\Y\mid M_0)$, and $B_{10}=z(1)/z(0)$.
Note that
\[\begin{aligned} \log B_{10} & = \log \frac{z(1)}{z(0)} = \int \frac{d}{dt}\log z(t)dt\\ &=\int \color{orange}{\int \frac{1}{z(t)}\frac{d}{dt}p(\Y,\bOmega,\btheta\mid t)d\bOmega d\btheta}dt\\ &=\int \color{orange}{\int \frac{p(\bOmega,\btheta\mid \Y,t)}{p(\Y,\bOmega,\btheta\mid t)} \frac{d}{dt}p(\Y,\bOmega,\btheta\mid t)d\bOmega d\btheta }dt\\ &=\int \color{orange}{\int \frac{1}{dt}\log p(\Y,\bOmega,\btheta\mid t)p(\bOmega,\btheta\mid \Y,t)d\bOmega d\btheta} dt\\ &=\int \color{orange}{\E_{\bOmega,\btheta}\left[ \frac{d}{dt}\log p(\Y,\bOmega,\btheta\mid t) \right]}dt\,. \end{aligned}\]To evaluate the integral over $t$, we first order the unique values of fixed grids $\{t_{(s)}\}_{s=1}^S$ in the interval $[0,1]$ such that $0 = t_{(0)} < t_{(1)}<\cdots t_{(S)}< t_{(S+1)}=1$, and estimate $\log B_{10}$ by
\[\widehat{\log B_{10}} = \frac 12\sum_{s=1}^S(t_{(s+1)}-t_{(s)})(\bar U_{(s+1)}+\bar U_{(s)})\,,\]where $\bar U_{(s)}$ is the following average of the values of $U(\Y,\bOmega,\btheta,t)$ based on simulation draws at $t=t_{(s)}$:
\[\bar U_{(s)} = J^{-1}\sum_{j=1}^J U(\Y,\bOmega^{(j)}, \btheta^{(j)}, t_{(s)})\,,\]in which $\{(\bOmega^{(j)}, \btheta^{(j)}),j=1,\ldots,J\}$ are observations drawn from $p(\bOmega,\btheta\mid \Y, t_{(s)})$.
The main computation is in simulating the sample of observations $\{(\bOmega^{(j)},\btheta^{(j)}),j=1,\ldots,J\}$ from $p(\bOmega,\btheta\mid\Y,t_{(s)})$ for $s=0,\ldots,S+1$. This task can be done via some efficient MCMC methods, such as the Gibbs sampler and the MH algorithm.
If it is difficult to find a path that directly links the competing models. Most of the time, this difficulty can be solved by using appropriate auxiliary models, $M_a,M_b,\ldots$ in between $M_1$ and $M_0$. For example, suppose that $M_a$ and $M_b$ are appropriate auxiliary models such that $M_a$ can be linked with $M_1$ and $M_b$, and $M_b$ can be linked with $M_0$.
Hence, $\log B_{10}=\log B_{1a} + \log B_{ab} - \log B_{0b}$.
The Need For Computing Normalizing Constants
This section is based on Gelman and Meng (1998).
For the powerful MCMC methods, we can simulate from a complex probability model $p(\omega)$, where $\omega$ is in general a high-dimensional variable, without knowing its normalizing constant, i.e., one can evaluate $q(\omega)$, an unnormalized density function, but cannot directly calculate $z=\int q(\omega)\mu(d\omega)$, the normalizing constant, where $\mu$ can be a counting measure, a Lebesgue measure or a mixture of them.
In likelihood analysis with missing data, if one had all the observations, $y_{com}$, the computation of the complete-data likelihood for parameter $\psi$, $L(\psi\mid y_{com})=p(y_{com}\mid\psi)$, would be straightforward. This suggests the following method for simulating the observed-data likelihood $L(\psi\mid y_{obs})=p(y_{obs}\mid \psi)$ in the cases where it is difficult to calculate $L(\psi\mid y_{obs})$ directly. Because
\[p(y_{com}\mid y_{obs},\psi) = \frac{p(y_{com}\mid \psi)}{p(y_{obs}\mid \psi)}\equiv \frac{L(\psi\mid y_{com})}{L(\psi\mid y_{obs})}\,,\]we can treat the likelihood of interest $L(\psi\mid y_{obs})$ as the normalizing constant of $p(y_{com}\mid y_{obs},\psi)$, with the complete-data likelihood $L(\psi\mid y_{com})$ serving as the unnormalized density. In this formulation, $y_{com}$ plays the role of $\omega$ in our general notation. For example, In genetic linkage analysis, a key step is the computation of the likelihood of $\psi$, the locations of disease genes relative to a set of markers, based on the observed data $y_{obs}$ from a pedigree.
A related general problem is, given an unnormalized joint density $q(\omega, \theta)$, to evaluate the marginal density $p(\theta) = \int p(\omega,\theta)\mu(d\omega)$. The computation of a Bayes factor, which requires the calculation of two probabilities, each of which is the marginal density under an individual model.
In physics and chemistry, a well-studied problem of computing normalizing constants is known as free energy estimation. The problem starts with an unnormalized density, the system density:
\[q(\omega\mid T,\alpha) = \exp\left( -\frac{H(\omega, \alpha)}{kT} \right)\,,\]where $H(\omega,\alpha)$ is the energy function of state $\omega$, $k$ is Boltzmann’s constant, $T$ is the temperature and $\alpha$ is a vector of system characteristics. The free energy $F$ of the system is defined as
\[F(T,\alpha)=-kT\log(z(T,\alpha))\,,\]where $z(T,\alpha)$ is the normalizing constant of the system density.
In applications in both genetics and physics, the real interest is not a single normalizing constant itself, but rather ratios, or equivalently differences of the logarithms, of them (i.e., differences of log-likelihoods; free energy differences). Even when it appears that we need to deal with a single normalizing constant, we can almost always bring in a convenient completely known density on the same space as a reference point. Therefore, without loss of generality, consider a class of densities on the same space, which we denote either by a numerical index $t$ or by a continuous parameter $\theta$; that is
\[p_t(\omega) = \frac{1}{z_t}q_t(\omega) \text{ or } p(\omega\mid \theta) = \frac{1}{z(\theta)}q(\omega\mid\theta)\,.\]Three common approaches for approximating analytically intractable normalizing constants:
- analytic approximation
- numerical approximation
- Monte Carlo simulation
The purpose of this paper:
- describe the method of path sampling for estimating $\lambda$ unbiasedly.
- show that importance sampling, bridge sampling and path sampling represent a natural methodological evolution, from using no bridge densities to using an infinite number of them
- investigate the problem of optimal paths, which turns out to be closely related to the Jeffreys prior distribution and the Rao and Hellinger distances between two distributions.
- provide two application and potential of path sampling for statistical problems.
A general framework for path sampling
This section is based on Gelman and Meng (1998).
Assume that densities are indexed by a continuous (vector) parameter $\theta$. Given two unnormalized densities with the same support $q_0(\omega)$ and $q_1(\omega)$, we can always construct a continuous path to link them. For example, we can construct a geometric path using a scalar parameter $\theta\in [0,1]$,
\[q(\omega\mid \theta) = q_0^{1-\theta}(\omega) q_1^\theta(\omega)\,.\]First assume that $\theta$ is a scalar quantity, w.l.o.g., assume that $\theta\in[0,1]$ and that we are interested in computing the ratio $r=z(1)/z(0)$. Note that
\[\frac{d}{d\theta}\log z(\theta) = \int\color{orange}{\frac{1}{z(\theta)}}\frac{d}{d\theta} q(\omega\mid\theta)\mu(d\omega) = \int\color{orange}{\frac{p(\omega\mid \theta)}{q(\omega\mid \theta)}}\frac{d}{d\theta} q(\omega\mid\theta)\mu(d\omega)=\E_\theta\left[ \frac{d}{d\theta}\log q(\omega\mid \theta) \right]\,,\]where $\E_\theta$ denotes the expectation w.r.t. the sampling distribution $p(\omega\mid\theta)$. By analogy to the potential in statistical physics, label
\[U(\omega,\theta) = \frac{d}{d\theta}\log q(\omega\mid \theta)\,.\]Integrating from 0 to 1 yields,
\[\lambda = \log \left[ \frac{z(1)}{z(0)} \right] = \int_0^1 \E_\theta[U(\omega,\theta)]d\theta\,.\]If consider $\theta$ as a random variable with a uniform distribution on $[0,1]$, we can interpret the right-hand side as the expectation of $U(\omega,\theta)$ over the joint distribution of $(\omega,\theta)$. More generally, we can introduce a prior density $p(\theta)$ for $\theta\in [0,1]$, then
\[\lambda = \E\left[ \frac{U(\omega,\theta)}{p(\theta)} \right]\,,\]where the expectation is w.r.t. the joint density $p(\omega,\theta)=p(\omega\mid\theta)p(\theta)$.
Extension to multivariate $\theta$. Suppose $\theta$ is now a $d$-dimensional parameter vector and we are interested in the ratio $z(\theta_1)/z(\theta_0)$ for given vectors $\theta_0$ and $\theta_1$. We first select a continuous path in the $d$-dimensional parameter space that links $\theta_0$ and $\theta_1: \theta(t)=(\theta_1(t),\ldots,\theta_d(t))$. for $t\in [0,1]$, with $\theta(0)=\theta_0$ and $\theta(1)=\theta_1$. Then
\[\lambda = \int_0^1 \E_{\theta(t)}\left[ \frac{d}{dt}\log q(\omega\mid \theta(t)) \right]dt=\int_0^1\E_{\btheta(t)}\left[ \sum_{k=1}^d \dot\theta_k(t)U_k(\omega,\theta(t)) \right]dt\,,\]where
\[\begin{aligned} U_k(\omega,\theta) &= \frac{\partial \log q(\omega\mid\theta)}{\partial\theta_k}\\ \dot{\theta}_k(t) &= \frac{d\theta_k(t)}{dt},\quad k=1,\ldots,d\,. \end{aligned}\]Connection with Bridge Sampling
If a trial density $\tilde p(\omega)$ is completely known, e.g., an analytic approximaton of the target density $p(\omega)=q(\omega)/z$, then the importance sampling estimator of $z$ is based on the identity
\[z = E_{\tilde p}\left[\frac{q(\omega)}{\tilde p(\omega)}\right]\]and the corresponding Monte Carlo estimator is
\[\hat z = \frac 1n\sum_{i=1}^n\frac{q(\omega_i)}{\tilde p(\omega_i)}\,,\]where $\omega_i$ are draws from $\tilde p(\omega)$.
However, if the draws from densities that themselves are only known in unnormalized forms, and thus the above estimator cannot be applied directly. This is typically the case with iterative simulation, where one can produce draws from $p_t(\omega)$ while only knowing $q_t(\omega) = z_tp_t(\omega)$, with $z_t$ being the unknown quantity of interest. Various methods are based on special cases of the following identity
\[r\equiv \frac{z_1}{z_0} = \frac{E_0[q_1(\omega)\alpha(\omega)]}{E_1[q_0(\omega)\alpha(\omega)]}\,,\]where $E_t$ denotes the expectation w.r.t. $p_t(\omega), t=0,1$, and $\alpha(\omega)$ is an arbitrary function satifying
\[0 < \left\vert \int_{\Omega_0\cap \Omega_1}\alpha(\omega)p_0(\omega)p_1(\omega)\mu(d\omega)\right\vert < \infty\,,\]and $\Omega_t$ is the support of $p_t(\omega)$ and assume $\mu(\Omega_0\cap \Omega_1) > 0$.
Examples:
- taking $\alpha = q_0^{-1}$ leads to the commonly used identity \begin{equation} \frac{z_1}{z_0} = E_0\left[\frac{q_1(\omega)}{q_0(\omega)}\right]\quad \Omega_1\subset \Omega_0 \label{eq:z1-div-z0} \end{equation}
- taking $\alpha=[q_0q_1]^{-1}$ leads to a generalization of the “harmonic rule”.
- when $z_1=1$, it calls reciprocal importance sampling.
Now consider
\[\alpha(\omega) = \frac{q_{1/2}(\omega)}{q_0(\omega)q_1(\omega)},\omega \in \Omega_0 \cap \Omega_1\,,\]where $q_{1/2}(\omega)$ is an arbitrary unnormalized density having support $\Omega_0\cap \Omega_1$. Then
\[r\equiv \frac{z_1}{z_0} = \frac{z_{1/2}/z_0}{z_{1/2}/z_1}=\frac{E_0[q_{1/2}(\omega)/q_0(\omega)]}{E_1[q_{1/2}(\omega)/q_1(\omega)]}\,,\]with the corresponding estimator
\[\hat r = \frac{(1/n_0)\sum_{i=1}^{n_0}[q_{1/2}(\omega_{0i})/q_0(\omega_{0i})]}{(1/n_1)\sum_{i=1}^{n_1}[q_{1/2}(\omega_{1i})/q_1(\omega_{1i})]}\,.\]That is, instead of applying \eqref{eq:z1-div-z0} to directly estimate $z_1/z_0$, first estimate $z_{1/2}/z_0$ and $z_{1/2}/z_1$ and then take the ratio to cancel $z_{1/2}$. The gain of efficiency arises because with a sensible choice of the “bridge” density $p_{1/2}$, there is less nonoverlap between $p_t (t=0,1)$ and $p_{1/2}$ than that between $p_0$ and $p_1$. In other words, $p_{1/2}$ serves as a bridge between $p_0$ and $p_1$, hence the name bridge sampling.
It is possible that the two densities $q_0(\omega)$ and $q_1(\omega)$ are so far separated that, even with the optimal bridge density, the estimator is too variable to use in practice, or even does not exist if $p_1$ and $p_0$ are completely separated. In such cases, it is useful to construct a finite series of $L-1$ intermediate densities, $q(\omega\theta_l), l=0,\ldots,L$, including the two endpoints. Then in a telescoping fashion,
\[z \equiv \frac{z(1)}{z(0)} = \prod_{l=1}^L \frac{z(\theta_{l-1/2})/z(\theta_{l-1})}{z(\theta_{l-1/2})/z(\theta_l)}=\prod_{l=1}^L\frac{E_{\theta_{l-1}}[q(\omega\mid \theta_{l-1/2})/q(\omega\mid \theta_{l-1})]}{E_{\theta_l}[q(\omega\mid\theta_{l-1/2})/q(\omega\mid \theta_l)]}\,.\]When $L\rightarrow \infty$, it becomes the above path sampling.
Illustration of Path-sampling Method
This toy example is adapted from tbz533’s blog.
Let $n$ denote the number of coin tosses, and $k$ the number of ‘heads’. Hence, the data $D$ is given by the pair $(n,k)$. Given a prior $\theta\sim Beta(\alpha,\beta)$ on the probability of throwing heads, then we have
\[P(D\mid M_1) = \int_0^1 p(D\mid \theta)\pi(\theta)d\theta = \binom{n}{k}\int_0^1\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}d\theta = \binom{n}{k}B(k+\alpha, n-k+\beta)\,,\]and
\[P(D\mid M_0) = \binom{n}{k}\left(\frac 12\right)^k\left(1-\frac 12\right)^{n-k}\]To perform path sampling, consider a family of unnormalized distributions $Q_t$ indexed by $t\in[0,1]$, such that $Q_0(\theta)=\pi(\theta)$ and $Q_1(\theta)=p(D\mid \theta)\pi(\theta)$. The normalizing constants are denoted by $z(t)$, and we have $z(0)=1, z(1)=p(D\mid M_1)$. Explicitly,
\[p(\theta\mid D,M_1) = \frac{p(D\mid \theta)\pi(\theta)}{p(D\mid M_1)} \triangleq \frac{p(D\mid\theta)\pi(\theta)}{z(1)}\]and
\[p(\theta\mid D, M_0) = \frac{p(D\mid \theta)\pi(\theta)}{p(D\mid M_0)} = \pi(\theta) \triangleq \frac{\pi(\theta)}{z(0)}\,.\]Note that $z(t)=\int Q_t(\theta)d\theta$, and we would have
\[\frac{d}{dt}\log z(t) = \frac{1}{z(t)}\int \frac{d}{dt}Q_t(\theta)d\theta = \int \frac{Q_t(\theta)}{z(t)}\frac{d}{dt}\log Q_t(\theta)d\theta= E\left[\frac{d}{dt}\log Q_t(\theta)\right]\]where the expectation is under $p(\theta\mid D, M_t)$.
Let $U(\theta, t)=\frac{d}{dt}\log Q_t(\theta)$, then
\[\lambda = \log\frac{z(1)}{z(0)}=\int_0^1 \frac{d}{dt}\log z(t)dt = \int_0^1 E[U(\theta, t)]dt\]Here, $\lambda = \log P(D\mid M_1)$. One estimator for $\lambda$ is
\[\hat\lambda = \frac 1N\sum_{i=1}^NU(\theta_i,t_i)\,.\]One choice of $Q_t(\theta)$ is
\[Q_t(\theta) = \pi(\theta)^{1-t}(\pi(\theta)p(D\mid\theta))^{1-t}=\pi(\theta)p(D\mid\theta)^t\,,\]then
\[\frac{d}{dt}\log Q_t(\theta) = \log p(D\mid\theta)\,.\]import pystan
# define a Stan model
model = """
data {
int<lower=0> n;
int<lower=0, upper=n> k;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0, upper=1> t;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
theta ~ beta(alpha, beta);
target += t * binomial_lpmf(k | n, theta);
}
generated quantities {
real U;
U = binomial_lpmf(k | n, theta);
}
"""
sm = pystan.StanModel(model_code=model)
import numpy as np
import multiprocessing
import scipy.stats as sts
import scipy.special as spl
### parameters
k = 10
n = 100
alpha = 1
beta = 1
K = 100
N = 200
### prepare a data dictionary and then run the stan model
def runStanModel(t):
coin_data = {
'n': n,
'k': k,
'alpha': alpha,
'beta': beta,
't': t
}
fit = sm.sampling(data = coin_data, iter = 2*K, warmup = K, chains = 1)
la = fit.extract(permuted = True)
return la['U']
### make partition of [0, 1]
ts = np.linspace(0, 1, N+1)
### start a worker pool
pool = multiprocessing.Pool(4) ## 4 threads
### run Stan model for each T
Us = np.array(pool.map(runStanModel, ts))
### take one sample for each t
lamhat = np.mean(Us[:,-1])
### compute a standard error
se_lamhat = sts.sem(Us[:,-1])
print("estimated lambda = %f +/- %f" % (lamhat, se_lamhat))
print("estimated p(D|M_1) = %f" % np.exp(lamhat))
estimated lambda = -4.753885 +/- 0.842809
estimated p(D|M_1) = 0.008618
Note that
\[P(D\mid M_1) = \int_0^1 p(D\mid \theta)\pi(\theta)d\theta = \binom{n}{k}\int_0^1\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}d\theta = \binom{n}{k}B(k+\alpha, n-k+\beta)\,,\]then we can calculate the exact $p(D\mid M_1)$ and $\lambda$.
exectMargLL = spl.beta(k+alpha, n-k+beta) * spl.binom(n, k)
exectMargLogLL = np.log(exectMargLL)
print("exact lambda = %f" % exectMargLogLL)
print("exact p(D|M_1) = %f" % exectMargLL)
exact lambda = -4.615121
exact p(D|M_1) = 0.009901