Test Difference for A Single Feature
Posted on 0 Comments
This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.
- $x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
- $\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
- $\Sigma\in \IR^{q\times q}$: known, positive-definite
assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution
\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]motivates the following conditional version of the two-sample Z-test to test $H_{0j}$
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]computing is challenging, as
- the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
- the conditioning set ${X\in \IR^{n\times q}: C(X)=C(x)}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize
condition on additional events and compute
\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]${U(X) = U(x)}$ does not sacrifice control of the selective Type I error rate
library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
mu <- rbind(
c(delta / 2, rep(0, q - 1)),
c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
)
sig <- 1
cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
diag(cov_mat_sim) <- 1
cov_mat_sim <- cov_mat_sim * sig
X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
for (i in 1:q) {
pvals[delta+1, i] = kmeans_inference_1f(X,
k = 2,
cluster_1 = 1,
cluster_2 = 2,
feat = i,
iso = FALSE, sig = NULL, covMat = cov_mat_sim,
iter.max = 15)$pval
}
}
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)
- from the figure, it seems that it is too conservative when
delta
is smaller than 6 covMat
should be passed, can we incorporate the estimation of covariance into the whole procedure?- each time only one feature is tested, we need to pass
feat = i
to specify which one to be tested, what ifq
is large? - see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2