The First Glimpse into Pseudolikelihood

Posted on Feb 12, 2019

Tags: Pseudolikelihood, Markov Random Field

This post caught a glimpse of the pseudolikelihood.

As Wikipedia said, a pseudolikelihood is an approximation to the joint probability distribution of a collection of random variables.

Given a set of random variables, $X=X_1,\ldots,X_n$ and a set $E$ of dependencies between these random variables, where $\{X_i,X_j\}\not\in E$ implies $X_i$ is conditionally independent of $X_j$ given $X_i$’s neighbors, the pseudolikelihood of $X=x=(x_1,x_2,\ldots,x_n)$ is

\[\begin{equation} \Pr(X=x) = \prod_i\Pr(X_i=x_i\mid X_j=x_j\;\text{for all $j$ for which } \{X_i,X_j\}\in E)\,. \label{eq:pseudoll} \end{equation}\]

In a word, the pseudolikelihood is the product of the variables, with each probability conditional on the rest of the data.

Murphy (2012) provides a motivation $\eqref{eq:pseudoll}$ base on MRFs.

Consider an MRF in log-linear form:

\[p(\y\mid \btheta) = \frac{1}{Z(\btheta)}\exp\Big(\sum_c\btheta_c^T\bphi_c(\y)\Big)\,,\]

where $c$ indexes the cliques.

The objective for maximum likelihood is

\[\ell_{\mathrm{ML}}(\btheta) = \frac 1N \sum_i\log p(\y_i\mid\btheta)\]

The derivative for the weights of a particular clique, $c$, is given by

\[\begin{align*} \frac{\partial \ell_{\mathrm{ML}}}{\partial \btheta_c} &= \frac 1N\Big[\bphi_c(\y_i)-\frac{\partial}{\partial \btheta_c}\log Z(\btheta)\Big]\\ &= \underbrace{\Big[\frac 1N \sum_i \bphi_c(\y_i)\Big]}_{\text{clamped terms}}-\underbrace{\bbE[\bphi_c(\y)\mid \btheta]}_{\text{unclamped term}}\\ &= \bbE_{p_{\mathrm{emp}}}[\bphi_c(\y)]-\bbE_{p(\cdot\mid\btheta)}[\bphi_c(\y)]\,. \end{align*}\]

Let the gradient be zero, we will get moment matching:

\[\bbE_{p_{\mathrm{emp}}}[\bphi_c(\y)]=\bbE_{p(\cdot\mid\btheta)}[\bphi_c(\y)]\,.\]

One alternative to MLE is to maximize the pseudo likelihood, defined as follows:

\[\ell_{\mathrm{PL}}(\btheta) = \frac 1N\sum_{i=1}^N\sum_{d=1}^D\log p(y_{id}\mid \y_{i,-d},\btheta)\,,\]

which is also called the composite likelihood.

The PL approach can be illustrated by the following figure for a 2d grid. This objective is generally fast to compute since each full conditional $p(y_{id}\mid \y_{i,-d},\btheta)$ only requires summing over the states of a single node, $y_{id}$, in order to compute the local normalizing constant.

By the way, there are many methods to compute the MLE/MAP, as show in the following table.

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

The First Glimpse into Pseudolikelihood

Posted on Feb 12, 2019