Selective Inference for Hierarchical Clustering
Posted on
This note is for Gao, L. L., Bien, J., & Witten, D. (2022). Selective Inference for Hierarchical Clustering (arXiv:2012.02936). arXiv.
develop a valid test for a difference in means between two clusters estimated from the data
consider the following model for $n$ observations of $q$ features
\[X \sim MN_{n\times q} (\mu, I_n, \sigma^2I_q)\]where $\mu\in \IR^{n\times q}$, with rows $\mu_i$ is unknown, and $\sigma^2 > 0$ is known.
For $\cG\subset {1,2,\ldots, n}$, let
\[\mu_{\cG} =\frac{1}{\vert \cG\vert}\sum_{i\in \cG}\mu_i\qquad \text{and}\qquad \bar X_{\cG} = \frac{1}{\vert \cG\vert}\sum_{i\in \cG}X_i\,,\]given a realization $\bfx\in \IR^{n\times q}$ of $\bfX$, first apply a clustering algorithm $\cC$ to obtain $\cC(\bfx)$. Then use $\bfx$ to test, for a pair of clusters $\hat\cC_1,\hat\cC_2\in \cC(\bfx)$,
\[H_0^{\hat\cC_1,\hat \cC_2}: \bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}\]a naive Wald test, with $p$-value given by
\[P(\Vert \bar X_{\hat \cC_1}-\bar X_{\hat \cC_2}\Vert \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2)\]where $\Vert \bar X_{\hat \cC_1}-\bar X_{\cC_2}\Vert_2 \sim (\sigma\sqrt{1/\vert \hat\cC_1\vert + 1/\vert \cC_2\vert})\cdot\chi_q$
key idea: account for the hypothesis selection procedure by defining a p-value that conditions on the event ${\hat \cC_1,\hat cC_2\in \cC(\bfX)}$. This yields a correctly-sized test.
a large body of work evaluates the statistical significance of a clustering by testing the goodness-of-fit of models under the misspecification of the number of clusters or by assessing the stability of estimated clusters
- most of these papers conduct bootstrap sampling or asymptotic approximations to the null distribution
-
the proposed framework avoids the need for resampling and provides exact finite-sample inference for the difference in means between a pair of estimated clusters, under the assumption that $\sigma$ is known
- Campbell (2018) considers testing for a difference in means after convex clustering (Hocking et al. 2011), a relatively esoteric form of clustering
the proposed framework is efficient when applied to hierarchical clustering
- Zhang et al. (2019) proposes splitting the data, clustering the training set, and applying these clusters to the test set
Selective Inference for Clustering
a test of no difference in means between two clusters
define the p-value as a conditional version
\[P_{H_0^{\hat\cC_1, \hat\cC_2}}\left(\bar X_{\hat\cC_1} - \bar X_{\hat cC_2}\Vert_2 \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2 \mid \hat \cC_1,\hat \cC_2\in \cC(\bfX) \right)\label{eq:5}\]This amounts to asking “Among all realizations of $\bfX$ that result in clusters $\hat \cC_1$ and $\hat \cC_2$, what proportion have a difference in cluster means at least as large as the difference in cluster means in the observed data set, when in truth $\bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}$”
One can show that rejecting $H_0^{\hat\cC_1,\hat\cC_2}$ when the above p-value controls the selective type I error rate.
For any non-overlapping groups of observations $\cG_1, \cG_2\subset{1,2,\ldots, n}$, let $H_0^{\cG_1,\cG_2}$ denote the null hypothesis that $\bar \mu_{\cG_1}=\bar \mu_{\cG_2}$. ….
however, \eqref{eq:5} cannot be calculated since the conditional distribution of $\Vert \bar X_{\cC_1}-\bar X_{\cC_2}\Vert_2$ given $\hat \cC_1, \hat\cC_2\in \cC(\bfX)$ involves the unknown nuisance parameters $\pi^\bot = I_n - \frac{\nu\nu^\top}{\Vert\nu\Vert_2^2}$ projects onto the orthogonal complemenet of the vector $\nu$, and where
\[[\nu(\hat\cC_1, \hat\cC_2)]_i = 1\{i\in \hat\cC_1\}/\vert \hat\cC_1\vert - 1\{i\in \hat\cC_2\}/\vert\hat\cC_2\vert\,.\]In other words, it requires knowing aspects of $\mu$ that are not known under the null.
instead, we will define the $p$-value for $H_0^{\hat\cC_1,\hat\cC_2}$ to be