Loading [MathJax]/jax/output/CommonHTML/jax.js

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

MM algorithm for Variance Components Models

Posted on (Update: )
Tags: Variance Components Model, Minorization-maximization, Restricted Maximum Likelihood

The post is based on Zhou, H., Hu, L., Zhou, J., & Lange, K. (2019). MM Algorithms for Variance Components Models. Journal of Computational and Graphical Statistics, 28(2), 350–361.

Variance components estimation and mixed model analysis are central themes in statistics with applications in numerous scientific disciplines. MLE and restricted MLE of variance component models remain numerically challenging.

Building on the minorization-maximization (MM) principle, the paper presents a novel iterative algorithm for variance components estimation.

The authors claim that the MM algorithm is trivial to implement and competitive on large data problems, and the algorithm readily extends to more complicated problems such as linear mixed models, multivariate response models possibly with missing data, maximum a posteriori estimation, and penalized estimation.

establish the global convergence of the MM algorithm to a Karush-Kuhn-Tucker point and demonstrate that it converges faster than the classical EM algorithm when the number of variance components is greater than two and all covariance matrices are positive definite.

Introduction

Given an observed n×1 response vector y and n×p predictor matrix X, the simplest variance components model postulates that YN(Xβ,Ω), where Ω=mi=1σ2iVi, and the V1,,Vm are m fixed positive semidefinite matrices.

The parameter of the model can be divided into mean effects β=(β1,,βp) and variance components σ2=(σ21,,σ2m).

Throughout assume Ω is positive definite.

Estimation resolves around the log-likelihood function

L(β,σ2)=12logdetΩ12(yXβ)TΩ1(yXβ).

Two popular methods among the commonly used methods for estimating variance components,

  • MLE
  • restricted (or residual) MLE (REML): first projects y to the null space of X and then estimates variance components based on the projected responses. If the columns of a matrix B span the null space of XT, then REML estimates the σ2i by maximizing the log-likelihood of the redefined response vector BTY, which is normally distributed with mean 0 and covariance BTΩB=mi=1σ2iBTViB.

Adopted from Harville, D. A. (1977). Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. Journal of the American Statistical Association, 72(358), 320–338.

Put p=rank(X). An error contrast means that a linear combinations uy of the observations such that E(uy)=0, i.e., such that uX=0. The maximum possible number of linearly independent error contrasts in any set of error contrasts is np. A particular set of np linearly independent rows of the matrix IX(XX)X. In the REML approach, the inference are based on the likelihood associated with np linearly independent error contrasts rather than on that associated with the full data vector y.

borrow a figure from ESL, the projection y to the null space of X should be the residual vector yˆy

Estimation bias in variance components originates from the DoF loss in estimating mean components. For the simple linear regression y=Xβ+ε,εN(0,σ2I),

ˆσ2MLE=1N(yXˆβ)T(yXˆβ)vsˆσ2unbiased=1Nk(yXˆβ)T(yXˆβ),

where k=rank(X(XTX)1XT). In such a simple example, we can remove the bias post hoc by multiplying a correction factor. However, for complex problems where closed-form solutions do not exist, we need to resort to ReML.

For general linea regression model, εN(0,H(θ)). If vector a is orthogonal to all columns of X, i.e., aTX=0, then aTy is known as an error contrast. We can find at most Nk such vectors that are linearly independent. Take A=[a1a2aNk]. It follows that ATX=0 and E[ATy]=0.

S=IX(XTX)1XT is a candidate set for A. The error contrast vector w=ATεN(0,ATHA) is free of β. The directly estimate θ by maximizing a “restricted” log-likelihood function L(θATy).

In the simplest linear regression, the mean estimate ˆβ is independent of the variance component θ. It implies that although the ML and ReML estimates of σ2 are different, the estimate of ˆβ are the same. But it is not true for more complex regression models.

A large literature on iterative algorithms for finding MLE and REML. Fitting variance components models remains a challenge in models with a large sample size n or a large number of variance components m.

  • Newton’s method converges quickly but is numerically unstable owing to the non-concavity of the log-likelihood
  • Fisher’s scoring algorithm replaces the observed information matrix in Newton’s method by the expected information matrix and yields an ascent algorithm when safeguarded by step halving. It is too expensive.
  • EM is easy to implement and numerically stable, but painfully slow to converge.

The paper derives a novel minorization-maximization algorithm for finding the MLE and REML estimates of variance components, and prove global convergence of the MM algorithm to a Karush-Kuhn-Tucker (KKT) point and explain why MM generally converges faster than EM for models with more than two variance components.

Preliminaries

Background on MM algorithms

The MM principle for maximizing an objective function f(θ) involves minorizing the objective function f(θ) by a surrogate function g(θθ(t)) around the current iterate θ(t) of a search. Minorization is defined by the two conditions

f(θ(t))=g(θ(t)θ(t))f(θ)g(θθ(t))

The point θ(t+1) maximizing g(θθ(t)) satisfies the ascent property f(θ(t+1))f(θ(t)).

The celebrated EM principle is a special case of the MM principle. The Q function produced in the E step of an EM algorithm minorizes the log-likelihood up to an irrelevant constant. Thus, both EM and MM share the same advantages: simplicity, stability, graceful adaptation to constraints, and the tendency to avoid large matrix inversion.

Convex Matrix Functions

A matrix-valued function f is said to be matrix convex if

f(λA+(1λ)B)λf(A)+(1λ)f(B)

for all A,B and λ[0,1].

  • the matrix fractional function f(A,B)=ATB1A is jointly convex in the m×n matrix A and the m×m positive-definite matrix B.
  • the log determinant function f(B)=logdetB is concave on the set of positive-definite matrices.

References

  • Zhang, Xiuming. “A Tutorial on Restricted Maximum Likelihood Estimation in Linear Regression and Linear Mixed-Effects Model,” 2015

Published in categories Memo