# GhostKnockoffs: Only Summary Statistics

##### Posted on 0 Comments

GhostKnockoffs: main idea is to generate knockoff Z-scores directly without creating knockoff variables. The method operates with only $X^\top Y$ and $\vert Y\Vert_2^2$, where $X$ is the $n\times p$ matrix of covariates, and $Y$ is the $n\times 1$ response vector.

extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression

- empirical covariance of the covariate-response pair $(X, Y)$ is available, i.e., $X^\top X, X^\top Y, \Vert Y\Vert_2^2$ are available along with the sample size $n$. Substantial power improvement over the method of He et al. (2022) due to far more effective test statistics

## Model-X Knockoffs and GhostKnockoffs

conditional independence hypotheses $H_0^j: X_j\ind Y\mid X_{-j}$ for $1\le j\le p$

$n$ i.i.d. samples $(X_i, Y_i), 1\le i\le n$

two conditions:

- exchangeability: $(X_j, \tilde X_j, X_{-j}, \tilde X_{-j})\overset{d}{=}(\tilde X_j, X_j, X_{-j}, \tilde X_{-j}),\forall 1\le j\le p$
- conditional independence: $\tilde X\ind Y\mid X$

define feature importance statistics $W = w([X, \tilde X], Y)\in \IR^p$ to be any function of $X, \tilde X, Y$ such that a flip-sign property holds

\[w_j([X, \tilde X]_{swap(j)}, Y) = -w_j([X,\tilde X], Y)\]common choices include

- marginal correlation difference statistic: $W_j = \vert X_j^\top Y\vert - \vert \tilde X_j^\top Y\vert$
- lasso coefficient difference statistic: $W_j = \vert \hat\beta_j(\lambda_{CV})\vert - \vert \hat \beta_{j+p}(\lambda_{CV})\vert$

### Gaussian knockoff sampler

\[\tilde X = XP + EV^{1/2}\]where

- $E$: $n\times p$ i.i.d. standard Gaussian entries
- $P = I-\Sigma^{-1}D, V = 2D-D\Sigma^{-1}D$ where $D = \diag{s}$

## GhostKnockoffs with marginal correlation difference statistic

sample the knockoff Z-score $\tilde Z_s$ from $X^\top Y$ and $\Vert Y\Vert_2^2$ directly, in a way such that

\[\tilde Z_s\mid X, Y\overset{d}{=} \tilde X^\top Y\mid X, Y\]where $\tilde X=G(X,\Sigma)$ is the knockoff matrix generated by the Gaussian knockoff sampler.

Then take $W = Z_s - \tilde Z_s$

\[\tilde Z_s = P^\top X^\top Y + \Vert Y\Vert_2 Z\]where $Z\sim N(0, V)$ is independent of $X$ and $Y$

## GhostKnockoffs with Penalized Regression: Known Empirical Covariance

sample from the joint distribution $T(X, \tilde X, Y)$ using the Gram matrix of $[X, Y]$ only.

## GhostKnockoffs with the square-root Lasso

the square-root Lasso, for which the choice of a reasonable tuning parameter is convenient.

\[\argmin \Vert Y- [X \tilde X]\beta\Vert_2 + \lambda \Vert \beta\Vert_1\]and a good choice of $\lambda$ is given by

\[\lambda = \kappa \bbE\left[\frac{\Vert [X \tilde X]^\top \varepsilon\Vert_\infty}{\Vert\epsilon\Vert_2} \mid X, \tilde X\right]\]## GhostKnockoffs with the Lasso-max

\[W_j = \sup\{\lambda:\hat\beta_j(\lambda) \neq 0\} - \sup\{\lambda: \hat \beta_{j+p}(\lambda)\neq 0\}\]## He et al. (2021)

- $X_i = (X_{i1},\ldots, X_{iq})$: vector of covariates
- $G_i = (G_{i1},\ldots, G_{ip})$: vector of genotype
- $Y_i$: genotype

the per-sample score statistic can be written as $G_i^\top Y_i$. The z-scores aggregating all samples can be written as

\[Z_{score} = \frac{1}{\sqrt n}G^\top Y\]the knockoff counterpart for $Z_{score}$ can be directly generated by

\[\tilde Z_{score} = PZ_{score} + E,\]where $E\sim N(0, V)$.

define a W-statistic that quantifies the magnitude of effect on the outcome as

\[W = (T-\text{median}_{1\le m\le M} T^m) I_{T\ge \max_{1\le m\le M} T^m}\,.\]