# Selective Inference for K-means

##### Posted on

This note is for Chen, Y. T., & Witten, D. M. (2022). Selective inference for k-means clustering (arXiv:2203.15267). arXiv.

The paper consider the simple and well-studied model for $n$ observations and $q$ features

\[X\sim MN_{n\times q}(\mu, I_n, \sigma^2I_q)\]where

- $\mu\in \IR^{n\times q}$ has unknown rows $\mu_i$, and $\sigma^2 > 0$ is known
- given a realization $x\in \IR^{n\times q}$ of $X$, first apply the $k$-means clustering algorithm to obtain $C(x)$, a partition of the samples ${1,\ldots, n}$.

Consider testing the null hypothesis that the mean is the same across two estimated clusters, i.e.,

\[H_0: \sum_{i\in \hat C_1}/\vert\hat C_1\vert = \sum_{i\in \hat C_2}/\vert \hat C_2\vert \quad \text{versus}\quad H_1: \sum_{i\in \hat C_1}/\vert\hat C_1\vert \neq \sum_{i\in \hat C_2}\mu_i/\vert\hat C_2\vert\]this is equivalent to testing $H_0: \mu^\top\nu = 0_q$ versus $H_1:\mu^\top \nu\neq 0_q$, where

\[\nu_i = 1{i\in \hat C_1}/\vert \hat C_1\vert - 1{i\in \hat C_2}/\vert\hat C_2\vert, i=1,\ldots, n\]key insight behind selective inference is as follows: to obtain a valid test of $H_0$, we need to condition on the aspect of the data that led us to test it.

## Selective Inference for k-means clustering

given samples $x_1,\ldots, x_n\in \IR^q$ and a positive integer $K$, $k$-means clustering partitions the $n$ samples into disjoint subsets $\hat C_1,\ldots,\hat C_K$ by solving the following optimization problem

\[\min_{C_1,\ldots, C_K} \sum_{k=1}^K\sum_{i\in C_k} \Vert x_i - \sum_{i\in C_k}x_i/\vert C_k\vert \Vert_2^2 \text{subject to} \cup_{k=1}^K C_k = \{1,\ldots, n\}, C_k\cap C_{k'} = \emptyset,\text{for all } k\neq k'\,.\]it is not typically possible to solve for the global optimum

a number of algorithms are available to find a local optimum; one such approach is Lloydâ€™s algorithm

since the k-means algorithm partitions all $n$ observations, it is natural to condition on the cluster assignments of all observations rather than just on ${\hat C_1, \hat C_2\in C(X)}$. this leads to the $p$-value

\[pr_{H_0}\left[\Vert X^T\nu\Vert_2 \ge \Vert x^T\nu\Vert_2 \mid \cap_{i=1}^n\{c_i^{(T)}(X) = c_i^{(T)}(x)\}, \Pi_{\nu}^\perp X = \Pi_{\nu}^\perp x, dir(X^\top\nu) = dir(x^\top\nu) \right]\]however, characterizing $\cap_{i=1}^n \{c_i^{(T)}(X) = c_i^{(T)}(x)\}$ is not straightforward

so, the paper condition on **all of the intermediate clustering assignments**

this p-value answers the question:

Assuming that there is no difference between the population means of $\hat C_1$ and $\hat C_2$, what is the probability of observing such a large difference between their centroids, among all the realizations of $X$ that yield identical results in every iterations of the $k$-means algorithm?

## Extensions: Non-spherical covariance matrix

For a known positive definite matrix $\Sigma$, let

\[X\sim MN_{n\times q}(\mu, I_n, \Sigma)\]we can whiten the data by applying the transformation $x_i\rightarrow \Sigma^{-1/2}x_i$. Since $\Sigma^{-1/2}\succ 0$, testing the null hypothesis is equivalent to testing

\[H_0:\sum_{i\in \hat C_1}\Sigma^{-1/2}\mu_i/\vert \hat C_1\vert = \sum_{i\in \hat C_2}\Sigma^{-1/2}\mu_i/\vert \hat C_2\vert \quad \text{versus}\quad \Sigma_{i\in\hat C_1}\Sigma^{-1/2}\mu_i/\vert \hat C_1\vert \neq \sum_{i\in \hat C_2}\Sigma^{-1/2}\mu_i / \vert\hat C_2\vert\]