Clustering of High-dim GMMs with EM

Posted on Feb 06, 2025

Tags: Gaussian Mixture Model, Expectation-Maximization, High-Dimensional

This post is for Cai, T. T., Ma, J., & Zhang, L. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3), 1234–1267. https://doi.org/10.1214/18-AOS1711

study clustering of high-dimensional Gaussian mixtures, and propose a procedure, called CHIME, that is based on the EM algorithm and a direct estimation method for the sparse discriminant vector.

establish the optimal rate of convergence for the excess misclustering error and show that CHIME is minimax rate optimal

Introduction

k-means and k-medians are centroid-based
Hierarchical clustering builds a hierarchy of clusters based on the empirical masures of dissimilarity among sets of observations

the paper consider clustering of data generated from Gaussian mixtures with the focus on the high-dimensional setting

\[Y ~ \begin{cases} 1 & \text{w.p. }1-w^\star\\ 2 & \text{w.p. }w^\star \end{cases} Z\mid Y = k \sim N_p(\mu_k^\star, \Sigma^\star)\]

suppose $n$ unlabeled observations

\[z^{(1)}, z^{(2)},\ldots, z^{(n)} \sim (1-w^\star)N_p(\mu_1^\star, \Sigma^\star) + w^\star N_p(\mu_2^\star, \Sigma^\star)\]

In the ideal case where the parameter $\theta^\star = {w^\star, \mu_1^\star, \mu_2^\star, \Sigma^\star}$ is known,

in practice, the parameters are unknown and a data driven method is needed

a common approach in the low-dimensional setting is to simply plug the sample values

in high-dimensional setting where $p$ can be very larger than $n$, the sample covariance matrix may not even be invertible and it is difficult to estimate the precision matrix

one way is to directly estimate the discriminant direction.

for unsupervised learning, the class labels are not observed.

the paper introduces CHIME

use the posterior probability of $z^{(i)}$ in class $k$ as the sample label of $z^{(i)}$

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Clustering of High-dim GMMs with EM

Posted on Feb 06, 2025

Introduction