Model-X Knockoffs
Posted on (Update: )
the paper addresses the selection problem by considering a very general conditional model, where the response $Y$ can depend in an arbitrary fashion on the covariates $X_1,\ldots, X_p$.
the only restriction is that the observations $(X_{i1},\ldots, X_{ip}, Y_i)$ are independently and identically distributed (i.i.d.)
the model is simpley
\[(X_{i1},\ldots, X_{ip}, Y_i)\sim_{i.i.d.} F_{XY}, \quad i = 1,\ldots, n\]for some arbitrary $p+1$ dimensional joint distribution $F_{XY}$.
assume no knowledge of the conditional distribution of $Y\mid X_1,\ldots, X_p$, but assume the joint distribution of the covariates is known, and denote it by $F_X$.
minimal set $\cS$: Markov blanket or Markov boundary for $Y$ in the literature on graphical models
the present paper places no restriction on $p$ relative to $n$, in sharp contrast to the original knockoffs work which requird the low-dimensional setting of $n \ge p$
the conceptual difference is that the original knockoff procedure treats the $X_{ij}$ as fixed and relies on specific stochastic properties of the linear model, precluding consideration of $p > n$ or nonlinear models.
relationship with the classical setup
the usual setup for inference in conditional models is to assume a strong parametric model for the response conditional on the covariates, such as a homoscedastic linear model, but to assume as little as possible about, or even condition on, the covariates
the paper do the exact opposite by assuming we know everything about the covariate distribution but nothing about the conditional distribution $Y\mid X_1,\ldots, X_p$
relationship with work on inference after selection
these works differe from the paper in a number of ways so that the authors largely see them as complementary activities
- the post-inference works presuppose a selection procedure has been chosen (for reasons that may have nothing to do with controlling Type I error) and then compute p-values for the selected variables, taking into account the selection step. In contrast, MX knockoffs is by itself a selection procedure that controls Type I error.
- inference after selection relies heavily on parametric assumptions about the conditional distribution, namely that \(Y\mid X_1, \ldots, X_p \sim N(\mu(X_1,\ldots, X_p), \sigma^2)\) making it unclear how to extend it to the more general setting
- in the selection step, a subset of size $m\le n$ of the original $p$ covariates is selected, say $X_{j_1},\ldots, X_{j_m}$, and the objects of inference are the coefficients of the $X_{jk}$’s in the projection of $\mu$ onto the linear subspace spannjie mied by the $n\times m$ matrix of observed values of these $X_{jk}$’s. That is, the $k$-th null hypothesis is that the aformentioned coefficients on $X_{jk}$ is zero. In contrast, for MX in the same setting, the $j$-th null hypothesis would be that $\mu$ does not depend on $X_j$, and there would be $p$ such null hypotheses, one for each of the original features.
obstacles to obtaining p-values
the requirement of having valid p-values is quite constraining for general conditional modeling problems
regression p-value approximations
- in low-dimensional setting ($n\ge p$)
- for GLMs, one must resort to asymptotic p-values derived from MLE theory, which could be far from valid in practice
- in high-dimensional ($n < p$) GLMs, it it not clear how to get $p$-values at all.
marginal testing
marginal p-values: p-values for testing unconditional (or marginal) independence betweeen $Y$ and $X_j$
the class of conditional test statistics is far richer than that of marginal ones, and includes the most powerful statistical inference and prediction methodology available.
for example, in compressed sensing, the signal recovery guarantees for state-of-the-art $\ell_1$-based (joint) algorithms are stronger than any guarantees possible with marginal methods.
some specific drawbacks of using marginal p-values are
- power loss
- interpretability
- dependent p-values
Methodology
Model-X knockoffs
Model-X knockoffs for the family of random variables $X = (X_1,\ldots, X_p)$ are a new family of random variables $\tilde X = (\tilde X_1,\ldots, \tilde X_p)$ constructed with the following two properties
- for any subset $S\subset {1,\ldots, p}$, $(X,\tilde X)_{swap(S)}\overset{d}{=}(X, \tilde X)$
- $\tilde X\ind Y\mid X$ if there is a response $Y$. It is guaranteed if $\tilde X$ is constructed without looking at $Y$
An example of MX knockoffs, suppose that $X\sim N(0, \Sigma)$. Then a joint distribution obey the property is
\[(X, \tilde X)\sim N(0, G), \qquad G=\begin{bmatrix} \Sigma & \Sigma - \diag\{s\}\\ \Sigma - \diag\{s\} & \Sigma \end{bmatrix}\]one way to construct the knockoff vector $\tilde X$ from the conditional distribution
\[\tilde X\mid X\overset{d}{=} N(\mu, V)\]where $\mu$ and $V$ are given by classical regression formulas, namely
\[\mu = X-X\Sigma^{-1}\diag\{s\}\qquad V = 2\diag\{s\} - \diag\{s\}\Sigma^{-1}\diag\{s\}\]Feature statistics
compute statistics $W_j$ for each $j\in {1,\ldots, p}$, a large positive value of $W_j$ providing evidence against the hypothesis that $X_j$ is null. This statistic depends on the response and the original variables but also on the knockoffs, that is
\[W_j = w_j([X, \tilde X], y)\]impose a flip-sign property, which means that swapping the $j$-th variable with its knockoff has the effect of changing the sign of $W_j$. Formally, if $[X,\tilde X]_{swap(S)}$ is the matrix obtained by swapping columns in $S$
\[w_j([X, \tilde X]_{swap(S)}, y) = \begin{cases} w_j([x, \tilde X], y) & j\not\in S\\ -w_j([X, \tilde X], y) & j\in S \end{cases}\]Constructing model-X knockoffs
Exact constructions
Approximate constructions: second-order model-X knockoffs
the equicorrelated construction uses
\[s_j^{EQ} = 2\lambda_\min(\Sigma) \land 1 \text{ for all } $j$\]which minimize the correlation between variable-knockoff pairs subject to the constraint that all such pairs must have the same correlation
the demidefinite program (SDP) construction solves the convex program
\[\min \sum_j\vert 1 - s_j^{SDP} \vert\qquad\text{subject to } s_j^{SDP}\ge 0, \diag\{s^{SDP}\} \preceq 2\Sigma\]the constructions run into new difficulties when large $p$
generalize the two knockoff constructions by the following two-step procedure, which called the approximate semidefinite program (ASDP) construction
Step 1: choose an approximation $\Sigma_{approx}$ of $\Sigma$ and solve
\[\min \sum_j \vert 1 - \hat s_j\vert \quad \text{subject to} \hat s_j\ge 0, \diag\{\hat s\}\preceq 2\Sigma_{approx}\]Step 2: solve
\[\max \gamma \quad \text{subject to} \diag\{\gamma\hat s\}\preceq 2\Sigma\]and set $s^{ASDP} = \gamma\hat s$.
- ASDP with $\Sigma_{approx} = I$ trivially gives $\hat s_j = 1$ and $\gamma = 2\lambda_\min(\Sigma)\land 1$, reproducing the equicorrelated construction
- ASDP with $\Sigma_{approx} = \Sigma$ clearly gives $\hat s_j = s^{SDP}$ and $\gamma = 1$, reproducing the SDP construction
generally, one can choose $\Sigma_{approx}$ to be an $m$-block-diagonal approximation of $\Sigma$
The conditional randomization test
Numerical simulations
Logistic regression $p$-values
asymptotic maximum likelihood theory promises valid p-values for each coefficient in a GLM only when $n\ge p$
Alternative knockoff statistics
Adaptive knockoff statistics
Bayesian knockoff statistics
Alternative procedures
- FX knockoff procedure: can only be applied in homoscedastic Gaussian linear regression when $n \ge p$
- BHq applied to asymptotic GLM p-values: can only be applied when $n\ge p$ and although for linear regression exact $p$-values can be computed (when the MLE exists), for any other GLM these $p$-values can be far from valid unless $n » p$.