Selective Inference for Hierarchical Clustering

Posted on Feb 08, 2024

Tags: Selective Inference, Hierarchical Clustering

This note is for Gao, L. L., Bien, J., & Witten, D. (2022). Selective Inference for Hierarchical Clustering (arXiv:2012.02936). arXiv.

develop a valid test for a difference in means between two clusters estimated from the data

consider the following model for $n$ observations of $q$ features

\[X \sim MN_{n\times q} (\mu, I_n, \sigma^2I_q)\]

where $\mu\in \IR^{n\times q}$, with rows $\mu_i$ is unknown, and $\sigma^2 > 0$ is known.

For $\cG\subset {1,2,\ldots, n}$, let

\[\mu_{\cG} =\frac{1}{\vert \cG\vert}\sum_{i\in \cG}\mu_i\qquad \text{and}\qquad \bar X_{\cG} = \frac{1}{\vert \cG\vert}\sum_{i\in \cG}X_i\,,\]

given a realization $\bfx\in \IR^{n\times q}$ of $\bfX$, first apply a clustering algorithm $\cC$ to obtain $\cC(\bfx)$. Then use $\bfx$ to test, for a pair of clusters $\hat\cC_1,\hat\cC_2\in \cC(\bfx)$,

\[H_0^{\hat\cC_1,\hat \cC_2}: \bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}\]

a naive Wald test, with $p$-value given by

\[P(\Vert \bar X_{\hat \cC_1}-\bar X_{\hat \cC_2}\Vert \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2)\]

where $\Vert \bar X_{\hat \cC_1}-\bar X_{\cC_2}\Vert_2 \sim (\sigma\sqrt{1/\vert \hat\cC_1\vert + 1/\vert \cC_2\vert})\cdot\chi_q$

key idea: account for the hypothesis selection procedure by defining a p-value that conditions on the event ${\hat \cC_1,\hat cC_2\in \cC(\bfX)}$. This yields a correctly-sized test.

a large body of work evaluates the statistical significance of a clustering by testing the goodness-of-fit of models under the misspecification of the number of clusters or by assessing the stability of estimated clusters

most of these papers conduct bootstrap sampling or asymptotic approximations to the null distribution
the proposed framework avoids the need for resampling and provides exact finite-sample inference for the difference in means between a pair of estimated clusters, under the assumption that $\sigma$ is known
Campbell (2018) considers testing for a difference in means after convex clustering (Hocking et al. 2011), a relatively esoteric form of clustering

the proposed framework is efficient when applied to hierarchical clustering

Zhang et al. (2019) proposes splitting the data, clustering the training set, and applying these clusters to the test set

Selective Inference for Clustering

a test of no difference in means between two clusters

define the p-value as a conditional version

\[P_{H_0^{\hat\cC_1, \hat\cC_2}}\left(\bar X_{\hat\cC_1} - \bar X_{\hat cC_2}\Vert_2 \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2 \mid \hat \cC_1,\hat \cC_2\in \cC(\bfX) \right)\label{eq:5}\]

This amounts to asking “Among all realizations of $\bfX$ that result in clusters $\hat \cC_1$ and $\hat \cC_2$, what proportion have a difference in cluster means at least as large as the difference in cluster means in the observed data set, when in truth $\bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}$”

One can show that rejecting $H_0^{\hat\cC_1,\hat\cC_2}$ when the above p-value controls the selective type I error rate.

For any non-overlapping groups of observations $\cG_1, \cG_2\subset{1,2,\ldots, n}$, let $H_0^{\cG_1,\cG_2}$ denote the null hypothesis that $\bar \mu_{\cG_1}=\bar \mu_{\cG_2}$. ….

however, \eqref{eq:5} cannot be calculated since the conditional distribution of $\Vert \bar X_{\cC_1}-\bar X_{\cC_2}\Vert_2$ given $\hat \cC_1, \hat\cC_2\in \cC(\bfX)$ involves the unknown nuisance parameters $\pi^\bot = I_n - \frac{\nu\nu^\top}{\Vert\nu\Vert_2^2}$ projects onto the orthogonal complemenet of the vector $\nu$, and where

\[[\nu(\hat\cC_1, \hat\cC_2)]_i = 1\{i\in \hat\cC_1\}/\vert \hat\cC_1\vert - 1\{i\in \hat\cC_2\}/\vert\hat\cC_2\vert\,.\]

In other words, it requires knowing aspects of $\mu$ that are not known under the null.

instead, we will define the $p$-value for $H_0^{\hat\cC_1,\hat\cC_2}$ to be

Interpreting $x’(\phi)$ and $\hat\cS$

Computing $\hat S for hierarchical clustering

Extensions

Monte Carlo approximation to the p-value

Non-spherical covariance matrix

Unknown variance

Conseuqences of selective type I error rate control

Simulation results

uniform p-values under a global null

conditional power and recovery probability

Data applications

scRNA

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.