# Exponential Twisting in Importance Sampling

##### Posted on Sep 18, 2019 (Update: Mar 01, 2020)

Consider the problem of estimating

$\alpha = E_f(G(X)) = \int G(x)f(x)dx$

where $X$ is a random vector of $\IR^N$ with probability density $f$, and $G$ is a function from $\IR^N$ to $\IR$.

The usual MC estimate is

$\alpha_f = \frac 1n\sum_{i=1}^nG(X_i)\,,$

where $X_i$ independently draws from $f$.

If there is another probability density $g$, then alternatively

$\alpha = E_g\left[G(X)\frac{f(X)}{g(X)}\right]\,.$

In the framework of Gaussian vector, the ordinal density $f(x)$ is

$f(x) = (2\pi)^{-N/2}\exp\left(-\frac 12\Vert x\Vert^2\right)\,,$

where $x = (x_1,\ldots,x_N)$.

Some research is mainly on adding a drift vector $\mu=(\mu_1,\ldots,\mu_N)$ and the new corresponding density is

$g_\mu(x) = (2\pi)^{-N/2}\exp\left(-\frac 12\Vert x-\mu\Vert^2\right)\,,$

then the key problem is how to achieve the following minimum

$\min_\mu E_f\left[ G^2(X) \frac{f(X)}{g_\mu(X)} \right]\,,$

since

$E_g\left[ \left(G(X)\frac{f(X)}{g(X)}\right)^2 \right] =E_f\left[G(X)^2 \frac{f(X)}{g(X)} \right]\,,$

and the importance sampling estimate is an unbiased estimate.

Consider a ratio estimator

$\delta_h^n = \sum_{i=1}^nw_ih(x_i)/\sum_{i=1}^nw_i\,,$

where $X_i\sim g(y)$ and $g$ is a candidate distribution for target $f$. The $w_i$’s are realizations of random variables $W_i$ such that $\E[W_i\mid X_i=x] = \kappa f(x)/g(x)$, $\kappa$ being an arbitrary constant (that corresponds to the lack of normalizing constants in $f$ and $g$)but $f$ and $g$ are already supposed to be density functions. Denote

$S_h^n = \sum_{i=1}^n W_ih(X_i), S_1^n=\sum_{i=1}^nW_i\,.$

Then the asymptotic variance of $\delta_h^n$ is

$\var(\delta_h^n) = \frac{1}{n^2\kappa^2}(\var(S_h^n)-2\E_f[h]\cov(S_h^n, S_1^n) + (\E_f[h])^2\var(S_1^n))\,.$

The key step in the proof is the approximation of the variance of a ratio $X/Y$ by the delta method. Let $T_1,\ldots, T_k$ be random variables with means $\theta_1,\ldots,\theta_k$. Then

$\var_\theta(g(T))\approx \sum_{i=1}^k[g_i'(\theta)]^2\var_\theta T_i+2\sum_{i > j}g_i'(\theta)g_j'(\theta)\cov_\theta(T_i,T_j)\,.$

Further calculation and approximation give Liu (1996)’s ‘rule of thumb’: the effective sample size resulting from an importance sampling procedure is approximately

$\frac{N}{1+\var_g(w(X))}\,.$ For a cumulative distribution function $F$ on $\IR^N$, define the cumulant generating function of $F$,

$\psi(\vartheta) = \log\left(\int_{\IR^N}e^{\psi_\vartheta(x)}dF(x)\right)=\log E[e^{\phi_\vartheta}(X)]\,,$

where $\phi_\vartheta(x)$ is called exponential twisting function, if $\phi_\vartheta(x)$ is chosen to satisfy $\psi(\vartheta) < \infty$.

For each twisting function $\phi_\vartheta(x)$, set

$F_{\phi_\vartheta}(x) = \int_{-\infty}^{x_1}\cdots\int_{-\infty}^{x_N}e^{\phi_\vartheta(y)-\psi(\vartheta)}dF(y)\,.$

It can be showed that $F_{\phi_\vartheta}(x)$ is a probability distribution and $\{F_{\phi_\vartheta},\vartheta\}$ form an exponential family of distributions. The transformation from $F$ to $F_{\phi_\vartheta}$ is called Exponential Twisting or Exponential change of measure.

If $F$ has a density $f$, then $F_{\phi_\vartheta}$ has density

$f_{\phi_\vartheta}(x) = e^{\phi_\vartheta(x)-\psi(\vartheta)}f(x) = \frac{f(x)}{W(x)}\,,$

here the weighted function $W(x)=e^{-\phi_\vartheta(x)+\psi(\vartheta)}$, so $f(x)=W(x)f_{\phi_\vartheta}(x)$.

The aim is to compute the expectation $V=E[G(X)]$. Using exponential twisting function $\phi_\vartheta(x)$, we have

$V=E[G(X)]=E_\vartheta[G(X)W(X)]$

and the second moment of $G(X)W(X)$ under $P_{\phi_\vartheta}$

$E_\vartheta[G^2(X)W^2(X)]=E[G^2(X)W(X)]\,.$

By the Cauchy-Schwarz inequality,

$E[G^2(X)W(X)] = E[G^2(X)e^{-\phi_\vartheta}(X)]\cdot E[e^{\phi_\vartheta}(X)]\ge E^2[G(X)] = V^2\,.$

It follows that

$\Var_\vartheta[G(X)W(X)] = E_\vartheta[G^2(X)W^2(X)] - E_\vartheta^2[G(X)W(X)] =E_\vartheta[G^2(X)W^2(X)]-V^2\,.$

In order to achieve the optimal situation, we have

$\log(G(X))=\varphi_\vartheta(X)\,,$

but in this case, $\psi(\vartheta)=\log(E[G(X)])$ = \log(V) cannot be computed analytically, as $V$ is unknown.

Apply Taylor Expansion for $\log(G(X))$,

$\log G(X) = \log G(X_1^0,\ldots,X_N^0) + \sum_{i=1}^N(X_i-X_i^0)\frac{\partial \log(G(X_1^0,X_2^0,\ldots,X_N^0))}{\partial X_i} + \\\frac{1}{2!}\sum_{i,j=1}^N(X_i-X_i^0)(X_j-X_j^0)\frac{\partial^2\log(G(X_1^0,X_2^0,\ldots,X_N^0))}{\partial X_i\partial X_j}+O(\rho^3)\,,$

where $\rho=\sqrt{\sum_{i=1}^N(X_i-X_{i0})^2}$.

Write

$\log(G(X)) = \beta_0 + \sum_{i=1}^N\beta_iX_i + \sum_{i=1}^N\sum_{j=i}^N\beta_{ij}X_iX_j+\varepsilon\,.$

In order to avoid overfitting and simplify the least square regression, remove the cross terms,

$\phi_\vartheta(x)=\beta_0 + \sum_{i=1}^N\beta_ix_i + \sum_{i=1}^N\beta_{ii}x_i^2 = \beta_0 + \beta' x + x' B x$

Published in categories Memo