Test Difference for A Single Feature

Posted on Apr 12, 2024

Tags: Selective Inference, K-means, Hierarchical Clustering

This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.

$x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
$\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
$\Sigma\in \IR^{q\times q}$: known, positive-definite

assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution

\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]

Test two pre-defined groups

\[H_{0j}: \bar\mu_{Gj} = \bar \mu_{G'j}\quad \text{versus} \quad H_{1j}: \bar \mu_{Gj} \neq \mu_{G'j}\]

This is equivalent to testing

\[H_{0j}: [\mu^T\nu]_j = 0 \quad \text{versus} \quad H_{1j}: [\mu^T\nu]_j \neq 0\,,\]

where $\nu$ is the $n$-vector with $i$-th element given by

\[\frac{1\{i\in G\}}{\vert G\vert} - \frac{1\{i\in G'\}}{\vert G'\vert}\]

under the normal assumption, the p-value is

\[1 - 2\Phi(\vert [\nu^Tx]_j \vert / (\Vert \nu\Vert_2^2\Sigma_{jj}))\]

Selective inference for the mean of a single feature

motivates the following conditional version of the two-sample Z-test to test $H_{0j}$

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]

computing is challenging, as

the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
the conditioning set ${X\in \IR^{n\times q}: C(X)=C(x)}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize

condition on additional events and compute

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]

${U(X) = U(x)}$ does not sacrifice control of the selective Type I error rate

library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
  mu <- rbind(
    c(delta / 2, rep(0, q - 1)),
    c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
  )
  
  sig <- 1
  cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
  diag(cov_mat_sim) <- 1
  cov_mat_sim <- cov_mat_sim * sig
  
  X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
  
  for (i in 1:q) {
    pvals[delta+1, i] = kmeans_inference_1f(X,
                                   k = 2,
                                   cluster_1 = 1,
                                   cluster_2 = 2,
                                   feat = i,
                                   iso = FALSE, sig = NULL, covMat = cov_mat_sim,
                                   iter.max = 15)$pval
  }
}
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)

from the figure, it seems that it is too conservative when delta is smaller than 6
covMat should be passed, can we incorporate the estimation of covariance into the whole procedure?
each time only one feature is tested, we need to pass feat = i to specify which one to be tested, what if q is large？
see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2

Simulation study

test the null hypothesis

\[\hat H_{0j}: \bar \mu_{\hat Gj} = \bar \mu_{\hat G' }\quad \text{vs}\quad \hat H_{1j}: \bar \mu_{\hat Gj} \neq \bar \mu_{\hat G'j}\]

where

$\hat G$ and $\hat G’$ are a randomly-chosen pair of clusters from k-means or hierarchical clustering
$j$ is the randomly-chosen feature

why not test for all features

Selective Type I error rate

Conditional power and detection probability

conditional power: the probability of rejecting $\hat H_{0j}$ given that $\hat G$ and $\hat G’$ are true clusters
detection probability: how often $\hat G_1$ and $\hat G_2$ are true clusters

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.