Approximations for KL between GMMs

Posted on Feb 07, 2025

Tags: KL-divergence, Gaussian Mixture Model, Variational Inference, Unscented Transformation

This note is based on Hershey, J. R., & Olsen, P. A. (2007). Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4, IV-317-IV–320. https://doi.org/10.1109/ICASSP.2007.366913

The KL divergence between two GMMs is not analytically tractable.

Some techniques cope with this problem by replacing the KL divergence with other functions that can be computed efficiently.

the paper introduce two new methods,

the variational approximation
the variational upper bound

Introduction

KL divergence, also known as the relative entropy, between two probability density function $f(x)$ and $g(x)$

\[D(f\Vert g) = \int f(x)\log \frac{f(x)}{g(x)} dx\]

three properties:

self similarity: $D(f\Vert f) = 0$
self identification: $D(f\Vert g)=0$ only if $f = g$
positivity: $D(f\Vert g)\ge 0$ for all $f, g$

For two gaussians $\hat f$ and $\hat g$, the KL divergence has a closed formed expression,

\[D(\hat f\Vert \hat g) = \frac 12\left[ \log \frac{\vert \Sigma_{\hat g}\vert}{\vert \Sigma_{\hat f}\vert} + \tr[\Sigma_{\hat g}^{-1}\Sigma_{\hat f}] - d + (\mu_{\hat f} - \mu_{\hat g})^T\Sigma_{\hat g}^{-1}(\mu_{\hat f} - \hat \mu_{\hat g}) \right]\]

whereas for two GMMs no such closed form expression exists

Next consider $f$ and $g$ to be GMMs. The marginal densities of $x\in \IR^d$ under $f$ and $g$ are

\[f(x) = \sum_a\pi_a N(x; \mu_a;\Sigma_a)\\ g(x) = \sum_b w_b N(x; \mu_b;\Sigma_b)\]

the estimates of $D(f\Vert g)$ will make use of the KL-divergence between individual components, write as $D(f_a\Vert g_b)$

Monte Carlo Sampling

draw a sample $x_i$ from the pdf $f$ such that $\bbE_f[\log f(x_i) / g(x_i)] = D(f\Vert g)$

The Unscented Transformation

The unscented transform is an approach to estimate $\bbE_{f_a}[h(x)]$ in such a way that the approximation is exact for all quadratic functions $h(x)$

it is possible to pick $2d$ sigma points ${x_{a, k}}_{k=1}^{2d}$ such that

\[\int f_a(x) h(x) dx = \frac{1}{2d} \sum_{k=1}^{2d} h(x_{a, k})\]

one possible choice of the sigma points is

\[x_{a, k} = \mu_a + \sqrt{d\lambda_{a, k}} e_{a, k}\\ x_{a, d+k} = \mu_a - \sqrt{d\lambda_{a, k}} e_{a, k}\]

for $k = 1,\ldots, d$ where $\lambda_{a, k}$ and $e_{a, k}$ are the eigenvalues and eigenvectors of the covariance $\Sigma_a$ of the gaussian $f_a$.

The KL divergence may be written as

\[D(f\Vert g) = \sum_a \pi_a \bbE_{f_a}[h]\]

for $h = \log f/g$, so the unscented estimate is

\[D_{unscented}(f\Vert g) = \frac{1}{2d} \sum_a \pi_a \sum_{k = 1}^{2d} \log \frac{f(x_{a, k})}{g(x_{a, k})}\]

the unscented estimate satisfies the similarity property, but the identification or positivity property do not hold in general.

the unscented estimator is similar to a Monte Carlo technique except that the samples are chosen deterministically

Gaussian Approximation

a commonly used approximation to $D(f\Vert g)$ is to replace $f$ and $g$ with gaussians, $\hat f$ and $\hat g$.

the mean and covariance of $f$ are given by

\[\mu_{\hat f} = \sum_a \pi_a\mu_a\\ \Sigma_{\hat f} = \sum_a \pi_a(\Sigma_a + (\mu_a - \mu_{\hat f})(\mu_a - \mu_{\hat f})^T)\]

the approximation $D_{gaussian}(f\Vert g)$ is given by the closed-form expression, $D_{gaussian}(f\Vert g) = D(\hat f\Vert \hat g)$

another popular method is to use the nearest pair of gaussians resulting in

\[D_{\min} = \min_{a, b} D(f_a\Vert g_b)\]

Both $D_{gaussian}(f\Vert g)$ and $D_{\min}(f\Vert g)$ satisfy the positivity and similarity properties, but the identification property does not hold.

although they are simple to formulate, they are both rather poor approximations

The Product of Gaussian Approximation

\[L_f(g) = E_{f(x)}[\log g(x)]\]

then

\[D(f\Vert g) = L_f(f) - L_f(g)\]

an upper bound on the likelihood results from using Jensen’s inequality to move the log outside the expected value

\[\begin{align*} L_f(g) &= \sum_a \pi_a \bbE_{f_a(x)} \log \sum_b w_b g_b(x)\\ &\le \sum_a\pi_a \log \sum_b w_b \bbE_{f_a(x)} [g_b(x)]\\ &=\sum_a\pi_a\log\sum_b w_b\int f_a(x)g_b(x)dx\\ &=\sum_a\pi_a\log\sum_bw_bz_{ab}\,, \end{align*}\]

where

\[z_{ab} = \int f_a(x)g_b(x)dx\]

is the normalizing constant for a product of Gaussians, which has a well known closed-form solution.

The KL-divergence can now be estimated in a simple closed form:

\[D_{product}(f\Vert g) = \sum_a \pi_a \log \frac{\sum_{a'}\pi_{a'}z_{aa'}}{\sum_b w_bz_{ab}}\]

$D_{product}(f\Vert g)$ satisfies the similarity property, but not the identification or positivity property.

Also, $D_{product}(f\Vert g)$ tends to greatly underestimate $D(f\Vert g)$

The Matched Bound Approximation

If $f$ and $g$ have the same number of components then by the chain rule for relative entropy, we have the following upper bound

\[D(f\Vert g) \le D(\pi\Vert w) + \sum_a\pi_a D(f_a\Vert g_a) = \sum_a\pi_a (\log \pi_a / w_a + D(f_a\Vert g_a))\]

Define a matching function, $m: {1, 2,\ldots, n_f}\rightarrow {1,\ldots, n_g}$, between $n_f$ components of $f$ and $n_g$ components of $g$ as follows:

\[m(a) = \arg\min_b D(f_a\Vert g_b) -\log w_b\]

Goldberger’s approximate formula can then be written as

\[D_{goldberger}(f\Vert g) = \sum_a \pi_a(D(f_a\Vert g_{m(a)}) + \log \frac{\pi_a}{w_{m(a)}})\]

$D_{goldberger}$ turns out to work well empirically

The Variational Approximation

define variational parameters $\phi_{b\mid a} > 0$ such that $\sum_b \phi_{b\mid a} = 1$. By Jensen’s inequality we have

\[\begin{align*} L_f(g) &= E_{f(x)} \log\sum_b w_bg_b(x)\\ &= E_{f(x)} \log \sum_b \phi_{b\mid a} \frac{w_bg_b(x)}{\phi_{b\mid a}}\\ &\ge \bbE_{f(x)} \sum_b \phi_{b\mid a} \log \frac{w_bg_b(x)}{\phi_{b\mid a}}\\ &= L_f(g, \phi) \end{align*}\]

get the best bound by maximizing $L_f(g, \phi)$ w.r.t $\phi$. The maximum value is obtained with

\[\hat\phi_{b\mid a} = \frac{w_b \exp(-D(f_a\Vert g_b))}{\sum_{b'}\pi_{b'}\exp(-D(f_a\Vert g_{b'}))}\]

Likewise, define

\[L_f(f, \psi) = \bbE_{f(x)}\sum_{a'}\psi_{a'\mid a} \log \frac{\pi_{a'}f_{a'}(x)}{\psi_{a'\mid a}}\]

and find the optimal $\psi_{a’\mid a}$:

\[\hat \psi_{a'\mid a} = \frac{\pi_{a'}\exp(-D(f_a\Vert f_{a'}))}{\sum_{\hat a}\pi_{\hat a}\exp(-D(f_a\Vert f_{\hat a}))}\]

define $D_{variational}(f\Vert g) = L_f(f, \hat \psi) - L_f(g, \hat \phi)$, the result simplifies to

\[D_{variational}(f\Vert g) = \sum_a\pi_a \log \frac{\sum_{a'}\pi_{a'}\exp(-D(f_a\Vert f_{a'}))}{\sum_{b}w_b \exp(-D(f_a\Vert g_b))}\]

The Variational Upper Bound

introduce the variational parameters $\phi_{b\mid a}\ge 0$ and $\psi_{a\mid b}\ge 0$ satisfying the constraints $\sum_b \phi_{b\mid a} = \pi_a$ and $\sum_a\psi_{a\mid b}=w_b$.

using the varational parameters we may write

\[f = \sum_a \pi_a f_a = \sum_{ab} \phi_{b\mid a} f_a\\ g = \sum_b w_b g_b = \sum_{ab} \psi_{a\mid b} g_b\]

an upper bound

\[D(f\Vert g) \le D(\phi\Vert \psi) + \sum_{ab}\phi_{b\mid a} D(f_a\Vert g_b) = D_{\phi,\psi}(f\Vert g)\]

the best possible upper bound can be attained by finding the variational parameters $\hat\phi$ and $\hat\psi$ that minimize $D_{\phi, \psi}(f\Vert g)$

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.