# Test Difference for A Single Feature

##### Posted on Apr 12, 20240 Comments
• $x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
• $\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
• $\Sigma\in \IR^{q\times q}$: known, positive-definite

assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution

$X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n$

### Test two pre-defined groups

$H_{0j}: \bar\mu_{Gj} = \bar \mu_{G'j}\quad \text{versus} \quad H_{1j}: \bar \mu_{Gj} \neq \mu_{G'j}$

This is equivalent to testing

$H_{0j}: [\mu^T\nu]_j = 0 \quad \text{versus} \quad H_{1j}: [\mu^T\nu]_j \neq 0\,,$

where $\nu$ is the $n$-vector with $i$-th element given by

$\frac{1\{i\in G\}}{\vert G\vert} - \frac{1\{i\in G'\}}{\vert G'\vert}$

under the normal assumption, the p-value is

$1 - 2\Phi(\vert [\nu^Tx]_j \vert / (\Vert \nu\Vert_2^2\Sigma_{jj}))$

### Selective inference for the mean of a single feature

motivates the following conditional version of the two-sample Z-test to test $H_{0j}$

$P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )$

computing is challenging, as

• the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
• the conditioning set ${X\in \IR^{n\times q}: C(X)=C(x)}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize

condition on additional events and compute

$P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))$

${U(X) = U(x)}$ does not sacrifice control of the selective Type I error rate

library(CADET)
library(MASS)
n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
mu <- rbind(
c(delta / 2, rep(0, q - 1)),
c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
)

sig <- 1
cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
diag(cov_mat_sim) <- 1
cov_mat_sim <- cov_mat_sim * sig

X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]

for (i in 1:q) {
pvals[delta+1, i] = kmeans_inference_1f(X,
k = 2,
cluster_1 = 1,
cluster_2 = 2,
feat = i,
iso = FALSE, sig = NULL, covMat = cov_mat_sim,
iter.max = 15)$pval } } image(deltas, 1:q, log(pvals), ylab = "features") grid(10, 20)  • from the figure, it seems that it is too conservative when delta is smaller than 6 • covMat should be passed, can we incorporate the estimation of covariance into the whole procedure? • each time only one feature is tested, we need to pass feat = i to specify which one to be tested, what if q is large？ • see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2 ## Simulation study test the null hypothesis $\hat H_{0j}: \bar \mu_{\hat Gj} = \bar \mu_{\hat G' }\quad \text{vs}\quad \hat H_{1j}: \bar \mu_{\hat Gj} \neq \bar \mu_{\hat G'j}$ where •$\hat G$and$\hat G’$are a randomly-chosen pair of clusters from k-means or hierarchical clustering •$j$is the randomly-chosen feature why not test for all features ### Selective Type I error rate ### Conditional power and detection probability • conditional power: the probability of rejecting$\hat H_{0j}$given that$\hat G$and$\hat G’$are true clusters • detection probability: how often$\hat G_1$and$\hat G_2\$ are true clusters

Published in categories Note