WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Review of Composite Likelihood

Posted on
Tags: Likelihood, Pseudolikelihood, Composite Likelihood, Quasi-likelihood, Information Matrix

This note is based on Varin, C., Reid, N., & Firth, D. (2011). AN OVERVIEW OF COMPOSITE LIKELIHOOD METHODS. Statistica Sinica, 21(1), 5–42., a survey of recent developments in the theory and application of composite likelihood.


Composite likelihood is an inference function derived by multiplying a collection of component likelihoods.

The resulting estimating equation obtained from the derivative of the composite log-likelihood is an unbiased estimating equation since each individual component is a conditional or marginal density.

The inference function has the properties of likelihood from a specified model since the components are multiplied, whether or not they are independent.

Composite Likelihood Inference

Consider an $m$-dimensional vector random variable $Y$, with probability density function $f(y,\theta)$ for some unknown $p$-dimensional parameter vector $\theta\in \Theta$. Denote by $\{\cA_1,\ldots,\cA_K\}$ a set of marginal or conditional events with associated likelihoods $\cL_k(\theta;y)\propto f(y\in\cA_k;\theta)$. A composite likelihood is the weighted product


where $w_k$ are nonnegative weights to be chosen.

Composite conditional likelihoods

Perhaps the precedent of composite likelihood is the pseudolikelihood, which is the product of the conditional densities of a single observation given its neighbors,

\[\cL_C(\theta;y)=\prod_{r=1}^m f(y_r\mid\{y_s:y_s \text{ is neighbors of }y_r;\theta\})\]

Liang (1987) studies composite conditional likelihoods of type

\[\cL_C(\theta;y)=\prod_{r=1}^{m-1}\prod_{s=r+1}^m f(y_r\mid y_r+y_s;\theta)\,,\]

and applies them to stratified case-control studies.

Molenberghs and Verbeke (2005) in the context of longitudinal studies, and Mardia et al. (2008) in bioinformatics, construct composite likelihoods by pooling pairwise conditional densities

\[\cL_C(\theta;y)=\prod_{r=1}^m\prod_{s=1}^m f(y_r\mid y_s;\theta)\,,\]

or by pooling full conditional densities

\[\cL_C(\theta;y)=\prod_{r=1}^m f(y_r\mid y_{-r};\theta)\,,\]

where $y_{-r}$ denotes the vector of all the observations but $y_r$.

Composite marginal likelihoods

The simplest composite marginal likelihood is the pseudolikelihood constructed under working independence assumptions,

\[\cL_{\mathrm{ind}}(\theta;y)=\prod_{r=1}^m f(y_r;\theta)\,,\]

sometimes referred to in the literature as the independence likelihood.

If parameters related to dependence are also of interest it is necessary to model blocks of observations, as in the pairwise likelihood

\[\cL_{\mathrm{pair}}(\theta;y)=\prod_{r=1}^{m-1}\prod_{s=r+1}^m f(y_r,y_s;\theta)\,.\]

For continuous symmetric responses with inference focused on the dependence structure, researchers propose composite marginal likelihoods based on pairwise differences,

\[\cL_{\mathrm{diff}}(\theta;y) = \prod_{r=1}^{m-1}\prod_{s=r+1}^m f(y_r-y_s;\theta)\,.\]


Composite likelihoods are referred to with several different names:

  • pseudolikelihood
  • approximate likelihood
  • quasi-likelihood

Derived quantities

The second Bartlett identity does not hold, we need to distinguish between the sensitivity matrix

\[H(\theta) = \E_\theta\{-\Delta_\theta u(\theta; Y)\} = \int \{-\Delta_\theta u(\theta; y)\}f(y;\theta)dy\]

and the variability matrix

\[J(\theta) = \var_\theta\{u(\theta; Y)\}\,,\]

and the Fisher information needs to be substituted by the Godambe information matrix

\[G(\theta) = H(\theta)J(\theta)^{-1}H(\theta)\,,\]

also referred to as the sandwich information matrix.

Reserve the notation $I(\theta)=\var_\theta{\Delta_\theta\log f(Y;\theta)}$ for the expected Fisher information; if $c\ell(\theta)$ is a true log-likelihood function then $G=H=I$. An estimating equation $u(\theta;y)$ which has $H(\theta)=J(\theta)$ for all $\theta$ is called information unbiased.

Asymptotic theory

In the case of $n$ independent and identically distributed observations $Y_1,\ldots,Y_n$ from the model $f(y;\theta)$ on $\IR^m$, and $n\rightarrow\infty$ with $m$ fixed, under regularity conditions,

\[\sqrt n(\hat\theta_{CL}-\theta)\overset{d}{\rightarrow} N_p(0,G^{-1}(\theta))\,.\]

Analogues of the AIC and BIC information criteria for model selection:

  • $\AIC =-2c\ell(\hat \theta_{CL};y) + 2\dim(\theta)$
  • $\BIC = -2c\ell(\hat \theta_{CL};y) + \dim(\theta)\log n$

where $\dim(\theta)=\tr\{H(\theta)G(\theta)^{-1}\}$ is an effective number of parameters.


Gaussian random fields

A typical model in geostatistics applications is a Gaussian random field $Y=\{Y(c):c\in \cC\subset \IR^2\}$ with mean function $\mu(c)$ and covariance matrix $\Sigma(\theta)$ whose entries reflect spatial correlation.

MLE would be more efficient, but requires the inversion of the covariance matrix $\Sigma(\theta)$, usually with a computational cost of order $\cO(m^3)$. This cost is prohibitive with many modern spatial, or spatial-temporal, data sets.

Let $y_r=y(c_r)$ be the observation of process $Y$ at location $c_r$. Vecchia (1988) proposes approximating the full likelihood with the composite conditional likelihood

\[\cL_{CC}(\theta;y) = f(y_1;\theta)\prod_{r=2}^m f(y_r\mid \cB_r;\theta)\,,\]

where $\cB_r$ is a subset of $\{y_{r-1},\ldots,y_1\}$ chosen so as to make feasible the computation of $\cL_C$. Vecchia (1988) suggests restricting $\cB_r$ to a number of neighbours of $y_r$.

Stein, Chi and Welty (2004) further develop Vecchia’s proposal by using blocks of observations in place of single observations,

\[\cL_{CC}(\theta;y) = f(z_1;\theta)\prod_{b=2}^B f(z_b\mid \cB_b';\theta)\,,\]

where $z_1,\ldots,z_B$ are $B$ blocks of data and $\cB_b’$ is a subset of $\{z_{b-1},\ldots,z_1\}$.

Difficulties with composite likelihoods of Stein, Chi and Welty (2004) and Vecchia (1988) arises with the selection of the observation order and of the conditioning sets $\cB_b$ and $\cB_b’$. Three different likelihood approximations all based on splitting the data into blocks are proposed:

  • “big blocks likelihood”: estimating $\theta$ from the joint density of the block means.
  • “small blocks”: the composite marginal likelihood formed by the product of densities for all the observations in each block
  • hybrid method: a proposed compromise between the above two.

A major reason for concern with maximum likelihood estimation is the difficulty in checking the assumption of multivariate normality. This difficulty is also shared by these block strategies, but the pairwise likelihood and the composite likelihood of differences just require bivariate normality for pairs of observations, which is much simpler to validate.

Spatial extremes

The rise in hazardous environmental events leads to much interest in statistical modeling of spatial extremes. A flexible approach to this problem is provided by max-stable models.

Serially correlated random effects

Consider longitudinal counts $Y_{ir}$ observed at occasion $r=1,\ldots,m_i$ for subject $i=1,\ldots,n$. This type of data may naturally modelled as conditionally independent Poisson variables

\[Y_{ir}\mid U_i\sim Po\{U_i\exp(x_{ir}'\beta)\}\,,\]

where $U_i$ is a random effect, $x_{ir}$ is a covariate vector, and $\beta$ are unknown regression coefficients.

One can assume different Gamma distributed random effects $U_{ir}$ for each measurement,

\[Y_{ir}\mid U_{ir}\sim Po\{U_{ir}\exp(x_{ir}'\beta)\}\,,\]

while specifying the joint distribution of $U_{ir}$ to describe the serial dependence. Unfortunately, the higher model flexibility of the above formulation is paid for in terms of computational complexity. The likelihood function involves a number of terms exponentially increasing with the series length $m_i$. Likelihood computation is impractical except in low dimensions. Hence, researchers propose that inference be based on the pairwise likelihood

\[\cL_{pair}(\theta;y)=\prod_{i=1}^n \frac{1}{m_i-1}\prod_{r=1}^{m_i-1}\prod_{s=r+1}^{m_i}f(y_{ir},y_{is};\theta)\,.\]

Published in categories Note