Selective Inference for Hierarchical Clustering
Posted on
This note is for Gao, L. L., Bien, J., & Witten, D. (2022). Selective Inference for Hierarchical Clustering (arXiv:2012.02936). arXiv.
develop a valid test for a difference in means between two clusters estimated from the data
consider the following model for $n$ observations of $q$ features
\[X \sim MN_{n\times q} (\mu, I_n, \sigma^2I_q)\]where $\mu\in \IR^{n\times q}$, with rows $\mu_i$ is unknown, and $\sigma^2 > 0$ is known.
For $\cG\subset {1,2,\ldots, n}$, let
\[\mu_{\cG} =\frac{1}{\vert \cG\vert}\sum_{i\in \cG}\mu_i\qquad \text{and}\qquad \bar X_{\cG} = \frac{1}{\vert \cG\vert}\sum_{i\in \cG}X_i\,,\]given a realization $\bfx\in \IR^{n\times q}$ of $\bfX$, first apply a clustering algorithm $\cC$ to obtain $\cC(\bfx)$. Then use $\bfx$ to test, for a pair of clusters $\hat\cC_1,\hat\cC_2\in \cC(\bfx)$,
\[H_0^{\hat\cC_1,\hat \cC_2}: \bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}\]a naive Wald test, with $p$value given by
\[P(\Vert \bar X_{\hat \cC_1}\bar X_{\hat \cC_2}\Vert \ge \Vert \bar x_{\hat \cC_1}\bar x_{\hat \cC_2}\Vert_2)\]where $\Vert \bar X_{\hat \cC_1}\bar X_{\cC_2}\Vert_2 \sim (\sigma\sqrt{1/\vert \hat\cC_1\vert + 1/\vert \cC_2\vert})\cdot\chi_q$
key idea: account for the hypothesis selection procedure by defining a pvalue that conditions on the event ${\hat \cC_1,\hat cC_2\in \cC(\bfX)}$. This yields a correctlysized test.
a large body of work evaluates the statistical significance of a clustering by testing the goodnessoffit of models under the misspecification of the number of clusters or by assessing the stability of estimated clusters
 most of these papers conduct bootstrap sampling or asymptotic approximations to the null distribution

the proposed framework avoids the need for resampling and provides exact finitesample inference for the difference in means between a pair of estimated clusters, under the assumption that $\sigma$ is known
 Campbell (2018) considers testing for a difference in means after convex clustering (Hocking et al. 2011), a relatively esoteric form of clustering
the proposed framework is efficient when applied to hierarchical clustering
 Zhang et al. (2019) proposes splitting the data, clustering the training set, and applying these clusters to the test set
Selective Inference for Clustering
a test of no difference in means between two clusters
define the pvalue as a conditional version
\[P_{H_0^{\hat\cC_1, \hat\cC_2}}\left(\bar X_{\hat\cC_1}  \bar X_{\hat cC_2}\Vert_2 \ge \Vert \bar x_{\hat \cC_1}\bar x_{\hat \cC_2}\Vert_2 \mid \hat \cC_1,\hat \cC_2\in \cC(\bfX) \right)\label{eq:5}\]This amounts to asking “Among all realizations of $\bfX$ that result in clusters $\hat \cC_1$ and $\hat \cC_2$, what proportion have a difference in cluster means at least as large as the difference in cluster means in the observed data set, when in truth $\bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}$”
One can show that rejecting $H_0^{\hat\cC_1,\hat\cC_2}$ when the above pvalue controls the selective type I error rate.
For any nonoverlapping groups of observations $\cG_1, \cG_2\subset{1,2,\ldots, n}$, let $H_0^{\cG_1,\cG_2}$ denote the null hypothesis that $\bar \mu_{\cG_1}=\bar \mu_{\cG_2}$. ….
however, \eqref{eq:5} cannot be calculated since the conditional distribution of $\Vert \bar X_{\cC_1}\bar X_{\cC_2}\Vert_2$ given $\hat \cC_1, \hat\cC_2\in \cC(\bfX)$ involves the unknown nuisance parameters $\pi^\bot = I_n  \frac{\nu\nu^\top}{\Vert\nu\Vert_2^2}$ projects onto the orthogonal complemenet of the vector $\nu$, and where
\[[\nu(\hat\cC_1, \hat\cC_2)]_i = 1\{i\in \hat\cC_1\}/\vert \hat\cC_1\vert  1\{i\in \hat\cC_2\}/\vert\hat\cC_2\vert\,.\]In other words, it requires knowing aspects of $\mu$ that are not known under the null.
instead, we will define the $p$value for $H_0^{\hat\cC_1,\hat\cC_2}$ to be