Test Difference for A Single Feature

Posted on Apr 12, 2024 (Update: Jan 07, 2026)

Tags: Selective Inference, K-means, Hierarchical Clustering

This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.. Later the paper was published in the Biostatistics

$x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
$\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
$\Sigma\in \IR^{q\times q}$: known, positive-definite

assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution

\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]

Test two pre-defined groups

\[H_{0j}: \bar\mu_{Gj} = \bar \mu_{G'j}\quad \text{versus} \quad H_{1j}: \bar \mu_{Gj} \neq \mu_{G'j}\]

This is equivalent to testing

\[H_{0j}: [\mu^T\nu]_j = 0 \quad \text{versus} \quad H_{1j}: [\mu^T\nu]_j \neq 0\,,\]

where $\nu$ is the $n$-vector with $i$-th element given by

\[\frac{1\{i\in G\}}{\vert G\vert} - \frac{1\{i\in G'\}}{\vert G'\vert}\]

under the normal assumption, the p-value is

\[1 - 2\Phi(\vert [\nu^Tx]_j \vert / (\Vert \nu\Vert_2^2\Sigma_{jj}))\]

Selective inference for the mean of a single feature

motivates the following conditional version of the two-sample Z-test to test $H_{0j}$

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]

computing is challenging, as

the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
the conditioning set $\{X\in \IR^{n\times q}: C(X)=C(x)\}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize

condition on additional events and compute

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]

$\{U(X) = U(x)\}$ does not sacrifice control of the selective Type I error rate

It follows from Thm 3.1 that computing the selective p-value involves determining the truncation set in (3.14). The next section focuses on understanding and computing this truncation set for hierarchical and k-means clustering.

4. The Truncation Set

The set $\hat S_j$ represents the values of $\phi$ for which the clustering algorithm $\calC$, when applied to $x’(\phi, j)$, yields the clustering output $C(x)$. Here $x’(\phi, j)$ can be interpreted as a perturbation to the observed data $x$.

library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
  mu <- rbind(
    c(delta / 2, rep(0, q - 1)),
    c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
  )
  
  sig <- 1
  cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
  diag(cov_mat_sim) <- 1
  cov_mat_sim <- cov_mat_sim * sig
  
  X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
  
  for (i in 1:q) {
    pvals[delta+1, i] = kmeans_inference_1f(X,
                                   k = 2,
                                   cluster_1 = 1,
                                   cluster_2 = 2,
                                   feat = i,
                                   iso = FALSE, sig = NULL, covMat = cov_mat_sim,
                                   iter.max = 15)$pval
  }
}
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)

from the figure, it seems that it is too conservative when delta is smaller than 6
covMat should be passed, can we incorporate the estimation of covariance into the whole procedure?
each time only one feature is tested, we need to pass feat = i to specify which one to be tested, what if q is large？
see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2

Simulation study

test the null hypothesis

\[\hat H_{0j}: \bar \mu_{\hat Gj} = \bar \mu_{\hat G'j }\quad \text{vs}\quad \hat H_{1j}: \bar \mu_{\hat Gj} \neq \bar \mu_{\hat G'j}\]

where

$\hat G$ and $\hat G’$ are a randomly-chosen pair of clusters from k-means or hierarchical clustering
$j$ is the randomly-chosen feature

why not test for all features

Selective Type I error rate

Conditional power and detection probability

conditional power: the probability of rejecting $\hat H_{0j}$ given that $\hat G$ and $\hat G’$ are true clusters
detection probability: how often $\hat G_1$ and $\hat G_2$ are true clusters

6 Applications to scRNA-seq Data

apply the proposed p-values to single-cell RNA-sequencing data collected by the Tabula Sapiens Consortium, which measures messenger RNA expression levels in each of 500,000 cells from 24 different tissues and organs

one unique feature of the Consortium data set is that experts annotated cell types consistently across the different tissues.

make use of the labeled cell types as the ground truth and use this information to demonstrate that the proposed p-values yield biologically reasonable results.

as per standard pre-processing techniques, first excluded cells with low numbers or total counts of expressed genes, as well as cells in which a large percentage of the expressed genes are mitochondrial

then normalized the transcripts for each cell by the total sum of counts in that cell, followed by a $\log_2(x+1)$ transformation.

to investigate the selective type I error rate in the absence of true clusters, first consider a ``no cluster’’ data set consisting of only CD4-positive, alpha-beta T cells after pre-processing
for the “cluster” data set, which encompasses memory B cells, natural killer cells, macrophages, and monocytes. applied k-means clustering to obtain four clusters

7 Discussion

the p-value can be extended to test for a difference in means between groups of features
empirically, conditioning on too much information results in a loss of power, leveraging recent developments in selective inference to compute the ``ideal’’ p-value could further enhance the power and usability of the proposal
extend the clustering algorithm and selection criteria for testing cluster means. e.g., other clustering methods widely used in scRNA-seq experiments such as graph clustering
the p-values can be used to derive selective confidence intervals for a difference in means for feature $j$

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.