WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Test Difference for A Single Feature

Posted on 0 Comments
Tags: Selective Inference, K-means, Hierarchical Clustering

This note is for Chen, Y. T., & Gao, L. L. (2023). Testing for a difference in means of a single feature after clustering (arXiv:2311.16375). arXiv.

  • $x\in \IR^{n\times q}$: a data matrix with $n$ observations of $q$ features
  • $\mu\in\IR^{n\times q}$: unknown rows $\mu_i\in\IR^q$
  • $\Sigma\in \IR^{q\times q}$: known, positive-definite

assume that $x$ is a realization of a random matrix $X$, where rows of $X$ are independent and drawn from a multivariate normal distribution

\[X_i \sim N_q(\mu_i, \Sigma), i=1,2,\ldots, n\]

motivates the following conditional version of the two-sample Z-test to test $H_{0j}$

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x) )\]

computing is challenging, as

  • the conditional distribution of $[X^T\hat\nu]_j$ depends on unknown parameters that are left unspecified by $\hat H_{0j}$
  • the conditioning set ${X\in \IR^{n\times q}: C(X)=C(x)}$ depends on the clustering algorithm $C$ and could be highly non-trivial to characterize

condition on additional events and compute

\[P(\Vert [X^T\hat\nu]_j \ge \vert [x^T\hat\nu]_j \vert \mid C(X) = C(x), U(X) = U(x))\]

${U(X) = U(x)}$ does not sacrifice control of the selective Type I error rate


n = 100
true_clusters <- c(rep(1, 50), rep(2, 50))
q = 20
rho = 0
deltas = 0:9
pvals = matrix(0, ncol = q, nrow = length(deltas))
for (delta in deltas) {
  mu <- rbind(
    c(delta / 2, rep(0, q - 1)),
    c(rep(0, q - 1), delta / 2), c(rep(0, q - 1), delta / 2)
  sig <- 1
  cov_mat_sim <- matrix(rho, nrow = q, ncol = q)
  diag(cov_mat_sim) <- 1
  cov_mat_sim <- cov_mat_sim * sig
  X <- mvrnorm(n, mu = rep(0, q), Sigma = cov_mat_sim) + mu[true_clusters, ]
  for (i in 1:q) {
    pvals[delta+1, i] = kmeans_inference_1f(X,
                                   k = 2,
                                   cluster_1 = 1,
                                   cluster_2 = 2,
                                   feat = i,
                                   iso = FALSE, sig = NULL, covMat = cov_mat_sim,
                                   iter.max = 15)$pval
image(deltas, 1:q, log(pvals), ylab = "features")
grid(10, 20)


  • from the figure, it seems that it is too conservative when delta is smaller than 6
  • covMat should be passed, can we incorporate the estimation of covariance into the whole procedure?
  • each time only one feature is tested, we need to pass feat = i to specify which one to be tested, what if q is largeļ¼Ÿ
  • see also the issue I raised in their GitHub repo: https://github.com/yiqunchen/CADET/issues/2

Published in categories Note