Test Difference for A Single Feature
Posted on (Update: )
This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.. Later the paper was published in the Biostatistics
- $x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
- $\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
- $\Sigma\in \IR^{q\times q}$: known, positive-definite
assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution
\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]Test two pre-defined groups
\[H_{0j}: \bar\mu_{Gj} = \bar \mu_{G'j}\quad \text{versus} \quad H_{1j}: \bar \mu_{Gj} \neq \mu_{G'j}\]This is equivalent to testing
\[H_{0j}: [\mu^T\nu]_j = 0 \quad \text{versus} \quad H_{1j}: [\mu^T\nu]_j \neq 0\,,\]where $\nu$ is the $n$-vector with $i$-th element given by
\[\frac{1\{i\in G\}}{\vert G\vert} - \frac{1\{i\in G'\}}{\vert G'\vert}\]under the normal assumption, the p-value is
\[1 - 2\Phi(\vert [\nu^Tx]_j \vert / (\Vert \nu\Vert_2^2\Sigma_{jj}))\]Selective inference for the mean of a single feature
motivates the following conditional version of the two-sample Z-test to test $H_{0j}$
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]computing is challenging, as
- the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
- the conditioning set $\{X\in \IR^{n\times q}: C(X)=C(x)\}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize
condition on additional events and compute
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]$\{U(X) = U(x)\}$ does not sacrifice control of the selective Type I error rate
It follows from Thm 3.1 that computing the selective p-value involves determining the truncation set in (3.14). The next section focuses on understanding and computing this truncation set for hierarchical and k-means clustering.
4. The Truncation Set
The set $\hat S_j$ represents the values of $\phi$ for which the clustering algorithm $\calC$, when applied to $x’(\phi, j)$, yields the clustering output $C(x)$. Here $x’(\phi, j)$ can be interpreted as a perturbation to the observed data $x$.
library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
mu <- rbind(
c(delta / 2, rep(0, q - 1)),
c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
)
sig <- 1
cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
diag(cov_mat_sim) <- 1
cov_mat_sim <- cov_mat_sim * sig
X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
for (i in 1:q) {
pvals[delta+1, i] = kmeans_inference_1f(X,
k = 2,
cluster_1 = 1,
cluster_2 = 2,
feat = i,
iso = FALSE, sig = NULL, covMat = cov_mat_sim,
iter.max = 15)$pval
}
}
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)
- from the figure, it seems that it is too conservative when
deltais smaller than 6 covMatshould be passed, can we incorporate the estimation of covariance into the whole procedure?- each time only one feature is tested, we need to pass
feat = ito specify which one to be tested, what ifqis large? - see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2
Simulation study
test the null hypothesis
\[\hat H_{0j}: \bar \mu_{\hat Gj} = \bar \mu_{\hat G'j }\quad \text{vs}\quad \hat H_{1j}: \bar \mu_{\hat Gj} \neq \bar \mu_{\hat G'j}\]where
- $\hat G$ and $\hat G’$ are a randomly-chosen pair of clusters from k-means or hierarchical clustering
- $j$ is the randomly-chosen feature
why not test for all features
Selective Type I error rate
Conditional power and detection probability
- conditional power: the probability of rejecting $\hat H_{0j}$ given that $\hat G$ and $\hat G’$ are true clusters
- detection probability: how often $\hat G_1$ and $\hat G_2$ are true clusters
6 Applications to scRNA-seq Data
apply the proposed p-values to single-cell RNA-sequencing data collected by the Tabula Sapiens Consortium, which measures messenger RNA expression levels in each of 500,000 cells from 24 different tissues and organs
one unique feature of the Consortium data set is that experts annotated cell types consistently across the different tissues.
make use of the labeled cell types as the ground truth and use this information to demonstrate that the proposed p-values yield biologically reasonable results.
as per standard pre-processing techniques, first excluded cells with low numbers or total counts of expressed genes, as well as cells in which a large percentage of the expressed genes are mitochondrial
then normalized the transcripts for each cell by the total sum of counts in that cell, followed by a $\log_2(x+1)$ transformation.
- to investigate the selective type I error rate in the absence of true clusters, first consider a ``no cluster’’ data set consisting of only CD4-positive, alpha-beta T cells after pre-processing
- for the “cluster” data set, which encompasses memory B cells, natural killer cells, macrophages, and monocytes. applied k-means clustering to obtain four clusters
7 Discussion
- the p-value can be extended to test for a difference in means between groups of features
- empirically, conditioning on too much information results in a loss of power, leveraging recent developments in selective inference to compute the ``ideal’’ p-value could further enhance the power and usability of the proposal
- extend the clustering algorithm and selection criteria for testing cluster means. e.g., other clustering methods widely used in scRNA-seq experiments such as graph clustering
- the p-values can be used to derive selective confidence intervals for a difference in means for feature $j$