# Illustrate Path Sampling by Stan Programming

##### Posted on 0 Comments

This post reviewed the topic of path sampling in the lecture slides of STAT 5020, and noted a general path sampling described by Gelman and Meng (1998), then used a toy example to illustrate it with Stan programming language.

## Path Sampling in SEM

This section is based on the lecture slides and textbook of STAT 5020.

The Bayes factor for comparing $M_1$ and $M_0$ is defined as

which involves the density $p(\Y\mid M_k)$. Let $\btheta_k$ be the random parameter vector associated with $M_k$, and we have

It is very often difficult to obtain $B_{10}$ analytically, and various analytic and numerical approximations have been proposed in the literature, and path sampling is one of them.

To simplify notation, ‘$M_k$’ will be suppressed; hence $p(\Y)=p(\Y\mid M_k)$.

In general, let $\Y$ be the matrix of observed data, and $\Omega$ be the matrix of latent variables in the model. For SEMs which involve latent variables, direct application of path sampling in computing the Bayes factor is difficult. Similar to Bayesian estimation, we can utilize the idea of data augmentation to solve this problem.

Now consider the following class of densities which are denoted by a continuous parameter $t$ in $[0,1]$:

where

with $p(\btheta)$ the prior density of $\btheta$ which is assumed to be independent of $t$.

In computing the Bayes factor, we construct a path using the parameter $t\in [0,1]$ to link two competing models $M_1$ and $M_0$ together, so that $z(1)=p(\Y\mid 1)=p(\Y\mid M_1), z(0)=P(\Y\mid 0)=p(\Y\mid M_0)$, and $B_{10}=z(1)/z(0)$.

Note that

To evaluate the integral over $t$, we first order the unique values of fixed grids $\{t_{(s)}\}_{s=1}^S$ in the interval $[0,1]$ such that $0 = t_{(0)} < t_{(1)}<\cdots t_{(S)}< t_{(S+1)}=1$, and estimate $\log B_{10}$ by

where $\bar U_{(s)}$ is the following average of the values of $U(\Y,\bOmega,\btheta,t)$ based on simulation draws at $t=t_{(s)}$:

in which $\{(\bOmega^{(j)}, \btheta^{(j)}),j=1,\ldots,J\}$ are observations drawn from $p(\bOmega,\btheta\mid \Y, t_{(s)})$.

The main computation is in simulating the sample of observations $\{(\bOmega^{(j)},\btheta^{(j)}),j=1,\ldots,J\}$ from $p(\bOmega,\btheta\mid\Y,t_{(s)})$ for $s=0,\ldots,S+1$. This task can be done via some efficient MCMC methods, such as the Gibbs sampler and the MH algorithm.

If it is difficult to find a path that directly links the competing models. Most of the time, this difficulty can be solved by using appropriate auxiliary models, $M_a,M_b,\ldots$ in between $M_1$ and $M_0$. For example, suppose that $M_a$ and $M_b$ are appropriate auxiliary models such that $M_a$ can be linked with $M_1$ and $M_b$, and $M_b$ can be linked with $M_0$.

Hence, $\log B_{10}=\log B_{1a} + \log B_{ab} - \log B_{0b}$.

## The Need For Computing Normalizing Constants

This section is based on Gelman and Meng (1998).

For the powerful MCMC methods, we can simulate from a complex probability model $p(\omega)$, where $\omega$ is in general a high-dimensional variable, without knowing its normalizing constant, i.e., one can evaluate $q(\omega)$, an **unnormalized density function**, but cannot directly calculate $z=\int q(\omega)\mu(d\omega)$, the **normalizing constant**, where $\mu$ can be a counting measure, a Lebesgue measure or a mixture of them.

In likelihood analysis with missing data, if one had all the observations, $y_{com}$, the computation of the complete-data likelihood for parameter $\psi$, $L(\psi\mid y_{com})=p(y_{com}\mid\psi)$, would be straightforward. This suggests the following method for simulating the observed-data likelihood $L(\psi\mid y_{obs})=p(y_{obs}\mid \psi)$ in the cases where it is difficult to calculate $L(\psi\mid y_{obs})$ directly. Because

we can treat the likelihood of interest $L(\psi\mid y_{obs})$ as the normalizing constant of $p(y_{com}\mid y_{obs},\psi)$, with the complete-data likelihood $L(\psi\mid y_{com})$ serving as the unnormalized density. In this formulation, $y_{com}$ plays the role of $\omega$ in our general notation. For example, In genetic linkage analysis, a key step is the computation of the likelihood of $\psi$, the locations of disease genes relative to a set of markers, based on the observed data $y_{obs}$ from a pedigree.

A related general problem is, given an unnormalized joint density $q(\omega, \theta)$, to evaluate the marginal density $p(\theta) = \int p(\omega,\theta)\mu(d\omega)$. The computation of a Bayes factor, which requires the calculation of two probabilities, each of which is the marginal density under an individual model.

In physics and chemistry, a well-studied problem of computing normalizing constants is known as free energy estimation. The problem starts with an unnormalized density, the **system density**:

where $H(\omega,\alpha)$ is the energy function of state $\omega$, $k$ is Boltzmann’s constant, $T$ is the temperature and $\alpha$ is a vector of system characteristics. The **free energy** $F$ of the system is defined as

where $z(T,\alpha)$ is the normalizing constant of the system density.

In applications in both genetics and physics, the real interest is not a single normalizing constant itself, but rather ratios, or equivalently differences of the logarithms, of them (i.e., differences of log-likelihoods; free energy differences). Even when it appears that we need to deal with a single normalizing constant, we can almost always bring in a convenient completely known density on the same space as a reference point. Therefore, without loss of generality, consider a class of densities on the same space, which we denote either by a numerical index $t$ or by a continuous parameter $\theta$; that is

Three common approaches for approximating analytically intractable normalizing constants:

- analytic approximation
- numerical approximation
- Monte Carlo simulation

The purpose of this paper:

- describe the method of path sampling for estimating $\lambda$ unbiasedly.
- show that importance sampling, bridge sampling and path sampling represent a natural methodological evolution, from using no bridge densities to using an infinite number of them
- investigate the problem of optimal paths, which turns out to be closely related to the Jeffreys prior distribution and the Rao and Hellinger distances between two distributions.
- provide two application and potential of path sampling for statistical problems.

## A general framework for path sampling

This section is based on Gelman and Meng (1998).

Assume that densities are indexed by a continuous (vector) parameter $\theta$. Given two unnormalized densities with the same support $q_0(\omega)$ and $q_1(\omega)$, we can always construct a **continuous path** to link them. For example, we can construct a geometric path using a scalar parameter $\theta\in [0,1]$,

First assume that $\theta$ is a scalar quantity, w.l.o.g., assume that $\theta\in[0,1]$ and that we are interested in computing the ratio $r=z(1)/z(0)$. Note that

where $\E_\theta$ denotes the expectation w.r.t. the sampling distribution $p(\omega\mid\theta)$. By analogy to the potential in statistical physics, label

Integrating from 0 to 1 yields,

If consider $\theta$ as a random variable with a uniform distribution on $[0,1]$, we can interpret the right-hand side as the expectation of $U(\omega,\theta)$ over the joint distribution of $(\omega,\theta)$. More generally, we can introduce a prior density $p(\theta)$ for $\theta\in [0,1]$, then

where the expectation is w.r.t. the joint density $p(\omega,\theta)=p(\omega\mid\theta)p(\theta)$.

Extension to multivariate $\theta$. Suppose $\theta$ is now a $d$-dimensional parameter vector and we are interested in the ratio $z(\theta_1)/z(\theta_0)$ for given vectors $\theta_0$ and $\theta_1$. We first select a continuous path in the $d$-dimensional parameter space that links $\theta_0$ and $\theta_1: \theta(t)=(\theta_1(t),\ldots,\theta_d(t))$. for $t\in [0,1]$, with $\theta(0)=\theta_0$ and $\theta(1)=\theta_1$. Then

where

## Illustration of Path-sampling Method

This toy example is adapted from tbz533’s blog.

Let $n$ denote the number of coin tosses, and $k$ the number of ‘heads’. Hence, the data $D$ is given by the pair $(n,k)$. Given a prior $\theta\sim Beta(\alpha,\beta)$ on the probability of throwing heads, then we have

and

To perform path sampling, consider a family of unnormalized distributions $Q_t$ indexed by $t\in[0,1]$, such that $Q_0(\theta)=\pi(\theta)$ and $Q_1(\theta)=p(D\mid \theta)\pi(\theta)$. The normalizing constants are denoted by $z(t)$, and we have $z(0)=1, z(1)=p(D\mid M_1)$. Explicitly,

and

Note that $z(t)=\int Q_t(\theta)d\theta$, and we would have

where the expectation is under $p(\theta\mid D, M_t)$.

Let $U(\theta, t)=\frac{d}{dt}\log Q_t(\theta)$, then

Here, $\lambda = \log P(D\mid M_1)$. One estimator for $\lambda$ is

One choice of $Q_t(\theta)$ is

then

```
import pystan
# define a Stan model
model = """
data {
int<lower=0> n;
int<lower=0, upper=n> k;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0, upper=1> t;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
theta ~ beta(alpha, beta);
target += t * binomial_lpmf(k | n, theta);
}
generated quantities {
real U;
U = binomial_lpmf(k | n, theta);
}
"""
sm = pystan.StanModel(model_code=model)
import numpy as np
import multiprocessing
import scipy.stats as sts
import scipy.special as spl
### parameters
k = 10
n = 100
alpha = 1
beta = 1
K = 100
N = 200
### prepare a data dictionary and then run the stan model
def runStanModel(t):
coin_data = {
'n': n,
'k': k,
'alpha': alpha,
'beta': beta,
't': t
}
fit = sm.sampling(data = coin_data, iter = 2*K, warmup = K, chains = 1)
la = fit.extract(permuted = True)
return la['U']
### make partition of [0, 1]
ts = np.linspace(0, 1, N+1)
### start a worker pool
pool = multiprocessing.Pool(4) ## 4 threads
### run Stan model for each T
Us = np.array(pool.map(runStanModel, ts))
### take one sample for each t
lamhat = np.mean(Us[:,-1])
### compute a standard error
se_lamhat = sts.sem(Us[:,-1])
print("estimated lambda = %f +/- %f" % (lamhat, se_lamhat))
print("estimated p(D|M_1) = %f" % np.exp(lamhat))
```

```
estimated lambda = -4.753885 +/- 0.842809
estimated p(D|M_1) = 0.008618
```

Note that

then we can calculate the exact $p(D\mid M_1)$ and $\lambda$.

```
exectMargLL = spl.beta(k+alpha, n-k+beta) * spl.binom(n, k)
exectMargLogLL = np.log(exectMargLL)
print("exact lambda = %f" % exectMargLogLL)
print("exact p(D|M_1) = %f" % exectMargLL)
```

```
exact lambda = -4.615121
exact p(D|M_1) = 0.009901
```