Over-specified EM for GMMs

Posted on Feb 04, 2025

Tags: Gaussian Mixture Model, Expectation-Maximization

This post is for Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I., & Yu, B. (2020). Singularity, Misspecification and the Convergence Rate of Em. The Annals of Statistics, 48(6), 3161–3182.

consider over-specified settings in which the number of fitted components is larger than the number of components in the true distribution

such mis-specified settings can lead to singularity in the Fisher information matrix

population EM
sample EM

Introduction

mixture models provide a principled approach to modeling heterogeneous collection of data (that are usually assumed i.i.d.)
it is frequently the case that the number of mixture components in the fitted model does not match the number of mixture components in the data-generating mechanism
such mismatch can lead to substantially slower convergence rates for the MLE for the underlying parameters
in contrast, relatively less attention has been paid to the computational implications of this mismatch

goal of this paper: gain a fundamental understanding of the behavior of EM when used to fit overspecified mixture models

Statistical issues with overspecification

main difficulty for analyzing the MLE in such settings arises from

label switching between the mixtures
lack of strong concavity in the likelihood

such issues do not interface with density estimation, since the standard divergence measures like the KL and Hellinger distances remain invariant under permutations of labels, and strong concavity is not required

Chen (1995): consider a class of overspecified finite mixture models

in contrast to the usual $n^{-1/2}$ convergence rate for the MLE based on $n$ samples, Chen showed that for estimating scalar location parameters in a certain class of overspecified finite mixture models, the corresponding rate slows down to $n^{-1/4}$

Computational concerns with mixture models

while the papers discussed above address the statistical behavior of a global maximum of the log-likelihood, they do not consider the associated computational issues of obtaining such a maximum

in general settings, nonconvexity of the log-likelihood makes it impossible to guarantee that the iterative algorithms used in practice converge to the global optimum, or equivalently the MLE

For Gaussian mixture models, the population EM algorithm is known to exhibit various convergence rates, ranging from linear to superlinear (quasi-Newton like) convergence if the overlap between the mixture components tend to zero

it has been noted that the convergence of EM can be prohibitively slow when the mixtures are not well separated.

Prior work on EM

Balakrishnan et al. laid out a general theoretical framework for analysis of the EM algorithm, and in particular how to prove nonasymptotic bounds on the Euclidean distance between sample-based EM iterates and the true parameter

when applied to the special case of two-component Gaussian location mixtures, assumed to be well specified and suitably separated, their theory guarantees that

population EM updates enjoy a geometric rate of convergence to the true parameter when initialized in a sufficiently small neighborhood around the truth
sample-based EM updates converge to an estimate at Euclidean distance of order $(d/n)^{1/2}$, based on $n$ i.i.d. draws from a finite mixture model in $\IR^d$

further work in this vein has characterized the behavior of EM in a variety of settings for two Gaussian mixtures, including

convergence analysis with additional sparsity constraints
global convergence of population EM
guarantees of geometric convergence under less restrictive conditions on the two mixture components
….

an assumption common to all of this previous work is that there is no mis-specification in the fitting of the Gaussian mixtures

Contributions

nonasymptotic performance of the EM algorithm for overspecified mixtures

provide a study of overspecified mixture models when fit to a particularly simple (nonmixture) data generating mechanism, a multivariate normal distribution $N(0, \sigma^2I_d)$

two-mixture unbalanced fit: study a mixture of two location-Gaussian distributions with unknown location, known variance and known equal unequal weights for the two components
two-mixture balanced fit: the population EM converges to the true parameter from an arbitrary initialization, but the rate of convergence varies as a function of the distance of the current iterate from the true parameter value, becoming exponentially slower as the iterates approach the true parameter

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.