WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

GhostKnockoffs: Only Summary Statistics

Posted on 0 Comments
Tags: Knockoff, Lasso, GWAS

This note is for Chen, Z., He, Z., Chu, B. B., Gu, J., Morrison, T., Sabatti, C., & Cand├Ęs, E. (2024). Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression (arXiv:2402.12724). arXiv.

GhostKnockoffs: main idea is to generate knockoff Z-scores directly without creating knockoff variables. The method operates with only $X^\top Y$ and $\vert Y\Vert_2^2$, where $X$ is the $n\times p$ matrix of covariates, and $Y$ is the $n\times 1$ response vector.

extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression

  • empirical covariance of the covariate-response pair $(X, Y)$ is available, i.e., $X^\top X, X^\top Y, \Vert Y\Vert_2^2$ are available along with the sample size $n$. Substantial power improvement over the method of He et al. (2022) due to far more effective test statistics

Model-X Knockoffs and GhostKnockoffs

conditional independence hypotheses $H_0^j: X_j\ind Y\mid X_{-j}$ for $1\le j\le p$

$n$ i.i.d. samples $(X_i, Y_i), 1\le i\le n$

two conditions:

  • exchangeability: $(X_j, \tilde X_j, X_{-j}, \tilde X_{-j})\overset{d}{=}(\tilde X_j, X_j, X_{-j}, \tilde X_{-j}),\forall 1\le j\le p$
  • conditional independence: $\tilde X\ind Y\mid X$

define feature importance statistics $W = w([X, \tilde X], Y)\in \IR^p$ to be any function of $X, \tilde X, Y$ such that a flip-sign property holds

\[w_j([X, \tilde X]_{swap(j)}, Y) = -w_j([X,\tilde X], Y)\]

common choices include

  • marginal correlation difference statistic: $W_j = \vert X_j^\top Y\vert - \vert \tilde X_j^\top Y\vert$
  • lasso coefficient difference statistic: $W_j = \vert \hat\beta_j(\lambda_{CV})\vert - \vert \hat \beta_{j+p}(\lambda_{CV})\vert$

Gaussian knockoff sampler

\[\tilde X = XP + EV^{1/2}\]

where

  • $E$: $n\times p$ i.i.d. standard Gaussian entries
  • $P = I-\Sigma^{-1}D, V = 2D-D\Sigma^{-1}D$ where $D = \diag{s}$

GhostKnockoffs with marginal correlation difference statistic

sample the knockoff Z-score $\tilde Z_s$ from $X^\top Y$ and $\Vert Y\Vert_2^2$ directly, in a way such that

\[\tilde Z_s\mid X, Y\overset{d}{=} \tilde X^\top Y\mid X, Y\]

where $\tilde X=G(X,\Sigma)$ is the knockoff matrix generated by the Gaussian knockoff sampler.

Then take $W = Z_s - \tilde Z_s$

\[\tilde Z_s = P^\top X^\top Y + \Vert Y\Vert_2 Z\]

where $Z\sim N(0, V)$ is independent of $X$ and $Y$

GhostKnockoffs with Penalized Regression: Known Empirical Covariance

sample from the joint distribution $T(X, \tilde X, Y)$ using the Gram matrix of $[X, Y]$ only.

GhostKnockoffs with the square-root Lasso

the square-root Lasso, for which the choice of a reasonable tuning parameter is convenient.

\[\argmin \Vert Y- [X \tilde X]\beta\Vert_2 + \lambda \Vert \beta\Vert_1\]

and a good choice of $\lambda$ is given by

\[\lambda = \kappa \bbE\left[\frac{\Vert [X \tilde X]^\top \varepsilon\Vert_\infty}{\Vert\epsilon\Vert_2} \mid X, \tilde X\right]\]

GhostKnockoffs with the Lasso-max

\[W_j = \sup\{\lambda:\hat\beta_j(\lambda) \neq 0\} - \sup\{\lambda: \hat \beta_{j+p}(\lambda)\neq 0\}\]

He et al. (2021)

  • $X_i = (X_{i1},\ldots, X_{iq})$: vector of covariates
  • $G_i = (G_{i1},\ldots, G_{ip})$: vector of genotype
  • $Y_i$: genotype
\[g(\mu_i) = \alpha_0 + \alpha^TX_i + \beta^TG_i\]

the per-sample score statistic can be written as $G_i^\top Y_i$. The z-scores aggregating all samples can be written as

\[Z_{score} = \frac{1}{\sqrt n}G^\top Y\]

the knockoff counterpart for $Z_{score}$ can be directly generated by

\[\tilde Z_{score} = PZ_{score} + E,\]

where $E\sim N(0, V)$.

define a W-statistic that quantifies the magnitude of effect on the outcome as

\[W = (T-\text{median}_{1\le m\le M} T^m) I_{T\ge \max_{1\le m\le M} T^m}\,.\]

Published in categories Note