Over-specified EM for GMMs
Posted on
consider over-specified settings in which the number of fitted components is larger than the number of components in the true distribution
such mis-specified settings can lead to singularity in the Fisher information matrix
- population EM
- sample EM
Introduction
- mixture models provide a principled approach to modeling heterogeneous collection of data (that are usually assumed i.i.d.)
- it is frequently the case that the number of mixture components in the fitted model does not match the number of mixture components in the data-generating mechanism
- such mismatch can lead to substantially slower convergence rates for the MLE for the underlying parameters
- in contrast, relatively less attention has been paid to the computational implications of this mismatch
goal of this paper: gain a fundamental understanding of the behavior of EM when used to fit overspecified mixture models
Statistical issues with overspecification
main difficulty for analyzing the MLE in such settings arises from
- label switching between the mixtures
- lack of strong concavity in the likelihood
such issues do not interface with density estimation, since the standard divergence measures like the KL and Hellinger distances remain invariant under permutations of labels, and strong concavity is not required
Chen (1995): consider a class of overspecified finite mixture models
- in contrast to the usual $n^{-1/2}$ convergence rate for the MLE based on $n$ samples, Chen showed that for estimating scalar location parameters in a certain class of overspecified finite mixture models, the corresponding rate slows down to $n^{-1/4}$
Computational concerns with mixture models
while the papers discussed above address the statistical behavior of a global maximum of the log-likelihood, they do not consider the associated computational issues of obtaining such a maximum
in general settings, nonconvexity of the log-likelihood makes it impossible to guarantee that the iterative algorithms used in practice converge to the global optimum, or equivalently the MLE
For Gaussian mixture models, the population EM algorithm is known to exhibit various convergence rates, ranging from linear to superlinear (quasi-Newton like) convergence if the overlap between the mixture components tend to zero
it has been noted that the convergence of EM can be prohibitively slow when the mixtures are not well separated.
Prior work on EM
Balakrishnan et al. laid out a general theoretical framework for analysis of the EM algorithm, and in particular how to prove nonasymptotic bounds on the Euclidean distance between sample-based EM iterates and the true parameter
when applied to the special case of two-component Gaussian location mixtures, assumed to be well specified and suitably separated, their theory guarantees that
- population EM updates enjoy a geometric rate of convergence to the true parameter when initialized in a sufficiently small neighborhood around the truth
- sample-based EM updates converge to an estimate at Euclidean distance of order $(d/n)^{1/2}$, based on $n$ i.i.d. draws from a finite mixture model in $\IR^d$
further work in this vein has characterized the behavior of EM in a variety of settings for two Gaussian mixtures, including
- convergence analysis with additional sparsity constraints
- global convergence of population EM
- guarantees of geometric convergence under less restrictive conditions on the two mixture components
- ….
an assumption common to all of this previous work is that there is no mis-specification in the fitting of the Gaussian mixtures
Contributions
nonasymptotic performance of the EM algorithm for overspecified mixtures
provide a study of overspecified mixture models when fit to a particularly simple (nonmixture) data generating mechanism, a multivariate normal distribution $N(0, \sigma^2I_d)$
- two-mixture unbalanced fit: study a mixture of two location-Gaussian distributions with unknown location, known variance and known equal unequal weights for the two components
- two-mixture balanced fit: the population EM converges to the true parameter from an arbitrary initialization, but the rate of convergence varies as a function of the distance of the current iterate from the true parameter value, becoming exponentially slower as the iterates approach the true parameter