Test Difference for A Single Feature
Posted on 0 Comments
This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.
- $x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
- $\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
- $\Sigma\in \IR^{q\times q}$: known, positive-definite
assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution
\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]Test two pre-defined groups
\[H_{0j}: \bar\mu_{Gj} = \bar \mu_{G'j}\quad \text{versus} \quad H_{1j}: \bar \mu_{Gj} \neq \mu_{G'j}\]This is equivalent to testing
\[H_{0j}: [\mu^T\nu]_j = 0 \quad \text{versus} \quad H_{1j}: [\mu^T\nu]_j \neq 0\,,\]where $\nu$ is the $n$-vector with $i$-th element given by
\[\frac{1\{i\in G\}}{\vert G\vert} - \frac{1\{i\in G'\}}{\vert G'\vert}\]under the normal assumption, the p-value is
\[1 - 2\Phi(\vert [\nu^Tx]_j \vert / (\Vert \nu\Vert_2^2\Sigma_{jj}))\]Selective inference for the mean of a single feature
motivates the following conditional version of the two-sample Z-test to test $H_{0j}$
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]computing is challenging, as
- the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
- the conditioning set ${X\in \IR^{n\times q}: C(X)=C(x)}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize
condition on additional events and compute
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]${U(X) = U(x)}$ does not sacrifice control of the selective Type I error rate
library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
mu <- rbind(
c(delta / 2, rep(0, q - 1)),
c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
)
sig <- 1
cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
diag(cov_mat_sim) <- 1
cov_mat_sim <- cov_mat_sim * sig
X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
for (i in 1:q) {
pvals[delta+1, i] = kmeans_inference_1f(X,
k = 2,
cluster_1 = 1,
cluster_2 = 2,
feat = i,
iso = FALSE, sig = NULL, covMat = cov_mat_sim,
iter.max = 15)$pval
}
}
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)
- from the figure, it seems that it is too conservative when
delta
is smaller than 6 covMat
should be passed, can we incorporate the estimation of covariance into the whole procedure?- each time only one feature is tested, we need to pass
feat = i
to specify which one to be tested, what ifq
is large? - see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2
Simulation study
test the null hypothesis
\[\hat H_{0j}: \bar \mu_{\hat Gj} = \bar \mu_{\hat G' }\quad \text{vs}\quad \hat H_{1j}: \bar \mu_{\hat Gj} \neq \bar \mu_{\hat G'j}\]where
- $\hat G$ and $\hat G’$ are a randomly-chosen pair of clusters from k-means or hierarchical clustering
- $j$ is the randomly-chosen feature
why not test for all features
Selective Type I error rate
Conditional power and detection probability
- conditional power: the probability of rejecting $\hat H_{0j}$ given that $\hat G$ and $\hat G’$ are true clusters
- detection probability: how often $\hat G_1$ and $\hat G_2$ are true clusters