WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Comparisons of Three Likelihood Criteria

Posted on
Tags: Pseudolikelihood, Quasi-likelihood, Generalized Linear Model, MLE, NBα-distribution, Dispersion

The note is for Nelder, J. A., & Lee, Y. (1992). Likelihood, Quasi-Likelihood and Pseudolikelihood: Some Comparisons. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 273–284.


Several approaches can be addressed the fitting of models jointly to the mean and dispersion of a response.

  • If it is possible to specify a likelihood then maximum likelihood (ML) can be used.
    • in the normal case, both $\mu$ and $\sigma^2$ are modelled as functions of covariates.
    • in the non-normal case, extend to the class of GLMs, but the GLMs with Poisson and binomial errors have the variance as a fixed function of the mean.
  • consider models without exact likelihood in order to allow the dispersion to vary independently, where a quasi-likelihood (QL) can be defined based on the first two components only of the distribution.

The properties of the maximum quasi-likelihood (MQL) estimators relative to the ML estimators:

  • little disagreement about the estimating equations for the model for the mean
  • the main dichotomy lies in the choice of response variable for the dispersion analysis.
    • $d$, the deviance component for each observation: extended quasi-likelihood (EQL)
    • $X^2$, the Pearson $\chi^2$ component: pseudolikelihood (PL)

Comparisons of the likelihood, EQL and PL:

  • for normal errors, these three criteria are equivalent
  • for non-normal errors, in particular the class of GLMs, considerable disagreement.

Three models:

  • involving a distribution for the errors which does not belong to the GLM family of distributions, but for which there exists a QL.
  • a model specified by the first two component only
  • involve a Poisson mixture with the inverse Gaussian distribution

The three criteria

In a GLM,

\[\E(y) = \mu = g^{-1}(X\beta)\,,\]

where $g$ is the link function (monotonic), then

\[\eta = g(\mu) = \sum x_j\beta_j\,.\]

The log-likelihood can be written in the form:

\[\begin{equation} \ell(\theta;y)=\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\,, \label{eq:ll} \end{equation}\]

where $\theta$ is the canonical parameter and $a(\phi)$ has the form $\phi/m$, where $\phi$ is the dispersion parameter and $m$ is the prior weight, and take $m=1$.

Apply the properties of exponential families,

\[\begin{align*} \E\Big(\frac{y}{a(\phi)}\Big) &= \frac{1}{a(\phi)}b'(\theta)\\ \var\Big(\frac{y}{a(\phi)}\Big) &= \frac{1}{a(\phi)}b''(\theta)\,, \end{align*}\]

which can be simplified to

\[\begin{align*} \E(y) &= b'(\theta)\\ \var(y) &= \phi b''(\theta) \end{align*}\]

Of the distributions with likelihoods of the form $\eqref{eq:ll}$, the Poisson and binomial have $\phi=1$, i.e. fixed a priori, while the normal, gamma and inverse Gaussian distributions have $\phi$ variable and usually unknown.

For $\phi$ fixed the ML equations for the $\beta$ are independent of $\phi$ and are given by

\[\begin{equation} \sum W(y-\mu)\frac{d\eta}{d\mu}x_j=0 \label{eq:lleq} \end{equation}\]

where $W$ is the weight function defined by

\[W^{-1} = \Big(\frac{d\eta}{d\mu}\Big)^2\mathrm V(\mu)\,,\]

where $\mathrm V(\mu)=b’’(\theta)$ (double check??). This can be motivated by the update step for maximizing the GLM likelihood over $\beta$ (Art Owen’s GLM outline):

\[\beta^{(k+1)} = \beta^{(k)}-\Big(\E\Big[\frac{\partial^2\ell}{\partial\beta\partial\beta'}\Big]\Big)^{-1}\frac{\partial \ell}{\partial \beta}\,,\]

which can be simplified to

\[\beta^{(k+1)} = (X'WX)^{-1}X'Wz\]

where $W$ is a diagonal matrix with



\[z_i = (Y_i-\mu_i)g'(\mu_i)+\x_i\beta\,.\]

Here we have $n$ sample points, but $\eqref{eq:lleq}$ can be viewed as at population level, or only a single sample.

As a measure of discrepancy of a fit for one observation, we may use the deviance component $d$, given by

\[d=-2\int_u^\mu \frac{y-u}{\mathrm V(u)}du\,.\]

The deviance $D$ is the sum of $d$ over the observations.

For all GLMs, we have the relation (how to derive???)

\[\frac{\partial \ell}{\partial \mu} = \frac{y-\mu}{\phi \mathrm V(\mu)}\]

so that this first derivation of the likelihood depends only on the first two moments of $y$. This led the definition of QL (more strictly quasi-log-likelihood) $q$ by the relation

\[\frac{\partial q}{\partial \mu} = \frac{y-\mu}{\phi \mathrm V(\mu)}\,.\]

The use of $q$ as a criterion for fitting allows the class of GLMs to be extended to models defined only by the properties of the first two moments. The QL $q$ will be a true likelihood if there is a distribution of the GLM type having $\var(y)=\phi V(\mu)$.

QLs allow two kinds of extension to GLMs:

  • GLMs with fixed $\phi=1$ can be extended to allow a variable $\phi$; e.g., log-linear models with Poisson errors with $\var(y)=\mu$ can be enlarged to allow overdispersion with $\var(y)=\phi\mu$.
  • $\mathrm{V}(\mu)$ takes a form which does not correspond to that of a standard GLM, for example $\mathrm V(\mu)=\mu^\alpha$ for variable $\alpha$.


A quasi-(log)-likelihood $q$ can be made into a distribution by normalizing it. Such a distribution we call a quasi-distribution. If $q$ has parameter vector $\beta$, then the quasi-distribution has frequency function

\[f_q = \exp(q)/\omega\,,\]

where $\omega = \int \exp(q)dy$, and likelihood

\[\ell_q = q-\log\omega\,.\]

The ML equations for the quasi-distribution, $\partial \ell_q/\partial \beta=0$, and the MQL equations, $\partial q/\partial\beta=0$, will differ by a term

\[\begin{align*} \frac{\partial \log\omega}{\partial \beta} &= \frac 1w \frac{\partial \omega}{\partial \mu}\frac{\partial\mu}{\partial\beta}\\ &=\frac 1\omega \frac{\partial\mu}{\partial \beta}\int \frac{\partial}{\partial\mu}\exp qdy\\ &=\frac 1\omega \frac{\partial\mu}{\partial \beta}\int \exp q\frac{y-\mu}{\phi V(\mu)}dy\\ &=\frac{\partial \mu}{\partial\beta}\frac{\mu^*-\mu}{\phi V(\mu)}\,, \end{align*}\]

where $\mu^*$ is the quasi-mean $\int yf_qdy$. If $\mu^*-\mu$ is small compared with $y-\mu$ we can expect the MQL estimates to approximate closely the ML estimates from the quasi-distribution.

Extended Quasi-likelihood and Pseudo-likelihood

QL can be used to compare different linear predictors or different link functions on the same data, but it cannot be used to compare different variances functions on the same data. For this reason, we need the EQL, which can be written in the form of the corresponding extended quasi-deviance $D^+$, which takes the form

\[D^+ = \sum d_i/\phi_i + \sum \log \{2\pi \phi_iV(y_i)\}\,,\]

where $\phi$ is the dispersion parameter and $V(y)$ is the variance function applied to the observation.

PL gives rise to an analogous extended deviance of the form

\[D_p = \sum X^2/\phi + \sum \log\{2\pi\phi V(\mu)\}\,,\]

where $X^2$ is the Pearson $\chi^2$-component for the observation. It differs from the EQL by having $X^2$ in place of $d$ and $V(\mu)$ in place of $V(y)$.



The standard negative binomial model assumes a gamma distribution for $u$, the mean of the Poisson distribution. If we write this gamma distribution in the form

\[\frac{1}{\Gamma(\nu)}\exp\Big(-\frac u\alpha\Big)\Big(\frac u\alpha\Big)^{\nu-1}d\Big(\frac u\alpha\Big)\,,\]

then, for the mixture distribution,

\[\begin{align*} \E(u/\alpha) &= \nu\\ \var(u/\alpha) &= \nu\,, \end{align*}\]


\[\begin{align*} \E(y) & = \E[\E(y\mid u)] = \alpha\nu \triangleq \mu\\ \var(y) & = \E[\var(y\mid u)] + \var[\E(y\mid u)] = \alpha\nu + \alpha^2\nu\,. \end{align*}\]

The standard negative binomial model for GLMs arises from assuming that $\nu$ is fixed as $\mu$ varies. This gives

\[\var(y) = \mu + \mu^2/\nu\,.\]

However, if assume that $\mu$ varies with $\nu$ and $\alpha$ remains constant, we obtain a distribution that we call the NB$\alpha$-distribution; for this

\[\var(y) = \mu(1+\alpha)\]

so that we obtain a model with an error distribution resembling an overdispersed Poisson distribution, with dispersion parameter $\phi=1+\alpha$. This distribution is not a standard GLM-type exponential family, so that the ML and MQL equations are different.

ML Calculation steps:

  1. define dlg, ddg and dtg
  2. write out the log-likelihood of the NB$\alpha$-distribution.
  3. write out the ML estimating equations for a log-link: $\partial \ell/\partial \beta_r=0$ and $\partial \ell/\partial \alpha=0$.
  4. Solve $\beta$ by the Newton method for given $\alpha$, where replace Hessian matrix to avoid non-positive eigenvalue.
  5. Estimate $\alpha$ by a simple one-dimensional search.
  6. Carry out step 4 and step 5 alternately until convergence.

QL & PL estimators of $\beta$ do not require knowledge of $\alpha$.

Counts with Power Variance Function

Although the estimators based on EQL and PL are often very close, those based on EQL may be asymptotically biased whereas those based on PL are not. The model postulates a variance function of the form

\[V(y) = \mu^\alpha\,,\]

and an estimate is required of $\alpha$, the true value being 1. The variance of $y$ can be modelled as $\var(y)=\phi \mu^\alpha$, where $\phi$ can either be assumed to be 1 or be estimated from data. In this example, the main interest is the estimation of $\alpha$.

The finite-sample-adjusted QL and PL estimators for $\alpha$ can be obtained by minimizing respectively $D^+$ and $D_p$.

Poisson-Inverse Gaussian Mixture

A Poisson distribution is mixed with an inverse Gaussian distribution such that the resulting variance function is $\mu+\alpha\mu^2$, i.e. has the same form as the gamma mixture that gives rise to the negative binomial distribution.

Estimate $\alpha$ by using the ML estimator and a modified MPL estimator.

Published in categories Note