Clustering of High-dim GMMs with EM
Posted on
study clustering of high-dimensional Gaussian mixtures, and propose a procedure, called CHIME, that is based on the EM algorithm and a direct estimation method for the sparse discriminant vector.
establish the optimal rate of convergence for the excess misclustering error and show that CHIME is minimax rate optimal
Introduction
- k-means and k-medians are centroid-based
- Hierarchical clustering builds a hierarchy of clusters based on the empirical masures of dissimilarity among sets of observations
the paper consider clustering of data generated from Gaussian mixtures with the focus on the high-dimensional setting
\[Y ~ \begin{cases} 1 & \text{w.p. }1-w^\star\\ 2 & \text{w.p. }w^\star \end{cases} Z\mid Y = k \sim N_p(\mu_k^\star, \Sigma^\star)\]suppose $n$ unlabeled observations
\[z^{(1)}, z^{(2)},\ldots, z^{(n)} \sim (1-w^\star)N_p(\mu_1^\star, \Sigma^\star) + w^\star N_p(\mu_2^\star, \Sigma^\star)\]In the ideal case where the parameter $\theta^\star = {w^\star, \mu_1^\star, \mu_2^\star, \Sigma^\star}$ is known,
in practice, the parameters are unknown and a data driven method is needed
a common approach in the low-dimensional setting is to simply plug the sample values
in high-dimensional setting where $p$ can be very larger than $n$, the sample covariance matrix may not even be invertible and it is difficult to estimate the precision matrix
- one way is to directly estimate the discriminant direction.
for unsupervised learning, the class labels are not observed.
the paper introduces CHIME
- use the posterior probability of $z^{(i)}$ in class $k$ as the sample label of $z^{(i)}$