WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Selective Inference for Hierarchical Clustering

Posted on
Tags: Selective Inference, Hierarchical Clustering

This note is for Gao, L. L., Bien, J., & Witten, D. (2022). Selective Inference for Hierarchical Clustering (arXiv:2012.02936). arXiv.

develop a valid test for a difference in means between two clusters estimated from the data

consider the following model for $n$ observations of $q$ features

\[X \sim MN_{n\times q} (\mu, I_n, \sigma^2I_q)\]

where $\mu\in \IR^{n\times q}$, with rows $\mu_i$ is unknown, and $\sigma^2 > 0$ is known.

For $\cG\subset {1,2,\ldots, n}$, let

\[\mu_{\cG} =\frac{1}{\vert \cG\vert}\sum_{i\in \cG}\mu_i\qquad \text{and}\qquad \bar X_{\cG} = \frac{1}{\vert \cG\vert}\sum_{i\in \cG}X_i\,,\]

given a realization $\bfx\in \IR^{n\times q}$ of $\bfX$, first apply a clustering algorithm $\cC$ to obtain $\cC(\bfx)$. Then use $\bfx$ to test, for a pair of clusters $\hat\cC_1,\hat\cC_2\in \cC(\bfx)$,

\[H_0^{\hat\cC_1,\hat \cC_2}: \bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}\]

a naive Wald test, with $p$-value given by

\[P(\Vert \bar X_{\hat \cC_1}-\bar X_{\hat \cC_2}\Vert \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2)\]

where $\Vert \bar X_{\hat \cC_1}-\bar X_{\cC_2}\Vert_2 \sim (\sigma\sqrt{1/\vert \hat\cC_1\vert + 1/\vert \cC_2\vert})\cdot\chi_q$

key idea: account for the hypothesis selection procedure by defining a p-value that conditions on the event ${\hat \cC_1,\hat cC_2\in \cC(\bfX)}$. This yields a correctly-sized test.

a large body of work evaluates the statistical significance of a clustering by testing the goodness-of-fit of models under the misspecification of the number of clusters or by assessing the stability of estimated clusters

  • most of these papers conduct bootstrap sampling or asymptotic approximations to the null distribution
  • the proposed framework avoids the need for resampling and provides exact finite-sample inference for the difference in means between a pair of estimated clusters, under the assumption that $\sigma$ is known

  • Campbell (2018) considers testing for a difference in means after convex clustering (Hocking et al. 2011), a relatively esoteric form of clustering

the proposed framework is efficient when applied to hierarchical clustering

  • Zhang et al. (2019) proposes splitting the data, clustering the training set, and applying these clusters to the test set

Selective Inference for Clustering

a test of no difference in means between two clusters

define the p-value as a conditional version

\[P_{H_0^{\hat\cC_1, \hat\cC_2}}\left(\bar X_{\hat\cC_1} - \bar X_{\hat cC_2}\Vert_2 \ge \Vert \bar x_{\hat \cC_1}-\bar x_{\hat \cC_2}\Vert_2 \mid \hat \cC_1,\hat \cC_2\in \cC(\bfX) \right)\label{eq:5}\]

This amounts to asking “Among all realizations of $\bfX$ that result in clusters $\hat \cC_1$ and $\hat \cC_2$, what proportion have a difference in cluster means at least as large as the difference in cluster means in the observed data set, when in truth $\bar \mu_{\hat\cC_1} = \bar\mu_{\hat \cC_2}$”

One can show that rejecting $H_0^{\hat\cC_1,\hat\cC_2}$ when the above p-value controls the selective type I error rate.

For any non-overlapping groups of observations $\cG_1, \cG_2\subset{1,2,\ldots, n}$, let $H_0^{\cG_1,\cG_2}$ denote the null hypothesis that $\bar \mu_{\cG_1}=\bar \mu_{\cG_2}$. ….

however, \eqref{eq:5} cannot be calculated since the conditional distribution of $\Vert \bar X_{\cC_1}-\bar X_{\cC_2}\Vert_2$ given $\hat \cC_1, \hat\cC_2\in \cC(\bfX)$ involves the unknown nuisance parameters $\pi^\bot = I_n - \frac{\nu\nu^\top}{\Vert\nu\Vert_2^2}$ projects onto the orthogonal complemenet of the vector $\nu$, and where

\[[\nu(\hat\cC_1, \hat\cC_2)]_i = 1\{i\in \hat\cC_1\}/\vert \hat\cC_1\vert - 1\{i\in \hat\cC_2\}/\vert\hat\cC_2\vert\,.\]

In other words, it requires knowing aspects of $\mu$ that are not known under the null.

instead, we will define the $p$-value for $H_0^{\hat\cC_1,\hat\cC_2}$ to be




Interpreting $x’(\phi)$ and $\hat\cS$


Computing $\hat S for hierarchical clustering



Monte Carlo approximation to the p-value

Non-spherical covariance matrix

Unknown variance

Conseuqences of selective type I error rate control

Simulation results

uniform p-values under a global null

conditional power and recovery probability

Data applications



Published in categories Note