Model-X Knockoffs
Posted on (Update: )
the paper addresses the selection problem by considering a very general conditional model, where the response Y can depend in an arbitrary fashion on the covariates X1,…,Xp.
the only restriction is that the observations (Xi1,…,Xip,Yi) are independently and identically distributed (i.i.d.)
the model is simpley
(Xi1,…,Xip,Yi)∼i.i.d.FXY,i=1,…,nfor some arbitrary p+1 dimensional joint distribution FXY.
assume no knowledge of the conditional distribution of Y∣X1,…,Xp, but assume the joint distribution of the covariates is known, and denote it by FX.
minimal set S: Markov blanket or Markov boundary for Y in the literature on graphical models
the present paper places no restriction on p relative to n, in sharp contrast to the original knockoffs work which requird the low-dimensional setting of n≥p
the conceptual difference is that the original knockoff procedure treats the Xij as fixed and relies on specific stochastic properties of the linear model, precluding consideration of p>n or nonlinear models.
relationship with the classical setup
the usual setup for inference in conditional models is to assume a strong parametric model for the response conditional on the covariates, such as a homoscedastic linear model, but to assume as little as possible about, or even condition on, the covariates
the paper do the exact opposite by assuming we know everything about the covariate distribution but nothing about the conditional distribution Y∣X1,…,Xp
relationship with work on inference after selection
these works differe from the paper in a number of ways so that the authors largely see them as complementary activities
- the post-inference works presuppose a selection procedure has been chosen (for reasons that may have nothing to do with controlling Type I error) and then compute p-values for the selected variables, taking into account the selection step. In contrast, MX knockoffs is by itself a selection procedure that controls Type I error.
- inference after selection relies heavily on parametric assumptions about the conditional distribution, namely that Y∣X1,…,Xp∼N(μ(X1,…,Xp),σ2) making it unclear how to extend it to the more general setting
- in the selection step, a subset of size m≤n of the original p covariates is selected, say Xj1,…,Xjm, and the objects of inference are the coefficients of the Xjk’s in the projection of μ onto the linear subspace spannjie mied by the n×m matrix of observed values of these Xjk’s. That is, the k-th null hypothesis is that the aformentioned coefficients on Xjk is zero. In contrast, for MX in the same setting, the j-th null hypothesis would be that μ does not depend on Xj, and there would be p such null hypotheses, one for each of the original features.
obstacles to obtaining p-values
the requirement of having valid p-values is quite constraining for general conditional modeling problems
regression p-value approximations
- in low-dimensional setting (n≥p)
- for GLMs, one must resort to asymptotic p-values derived from MLE theory, which could be far from valid in practice
- in high-dimensional (n<p) GLMs, it it not clear how to get p-values at all.
marginal testing
marginal p-values: p-values for testing unconditional (or marginal) independence betweeen Y and Xj
the class of conditional test statistics is far richer than that of marginal ones, and includes the most powerful statistical inference and prediction methodology available.
for example, in compressed sensing, the signal recovery guarantees for state-of-the-art ℓ1-based (joint) algorithms are stronger than any guarantees possible with marginal methods.
some specific drawbacks of using marginal p-values are
- power loss
- interpretability
- dependent p-values
Methodology
Model-X knockoffs
Model-X knockoffs for the family of random variables X=(X1,…,Xp) are a new family of random variables ˜X=(˜X1,…,˜Xp) constructed with the following two properties
- for any subset S⊂1,…,p, (X,˜X)swap(S)d=(X,˜X)
- ˜X⊥⊥Y∣X if there is a response Y. It is guaranteed if ˜X is constructed without looking at Y
An example of MX knockoffs, suppose that X∼N(0,Σ). Then a joint distribution obey the property is
(X,˜X)∼N(0,G),G=[ΣΣ−diag{s}Σ−diag{s}Σ]one way to construct the knockoff vector ˜X from the conditional distribution
˜X∣Xd=N(μ,V)where μ and V are given by classical regression formulas, namely
μ=X−XΣ−1diag{s}V=2diag{s}−diag{s}Σ−1diag{s}Feature statistics
compute statistics Wj for each j∈1,…,p, a large positive value of Wj providing evidence against the hypothesis that Xj is null. This statistic depends on the response and the original variables but also on the knockoffs, that is
Wj=wj([X,˜X],y)impose a flip-sign property, which means that swapping the j-th variable with its knockoff has the effect of changing the sign of Wj. Formally, if [X,˜X]swap(S) is the matrix obtained by swapping columns in S
wj([X,˜X]swap(S),y)={wj([x,˜X],y)j∉S−wj([X,˜X],y)j∈SConstructing model-X knockoffs
Exact constructions
Approximate constructions: second-order model-X knockoffs
the equicorrelated construction uses
sEQj=2λminwhich minimize the correlation between variable-knockoff pairs subject to the constraint that all such pairs must have the same correlation
the demidefinite program (SDP) construction solves the convex program
\min \sum_j\vert 1 - s_j^{SDP} \vert\qquad\text{subject to } s_j^{SDP}\ge 0, \diag\{s^{SDP}\} \preceq 2\Sigmathe constructions run into new difficulties when large p
generalize the two knockoff constructions by the following two-step procedure, which called the approximate semidefinite program (ASDP) construction
Step 1: choose an approximation \Sigma_{approx} of \Sigma and solve
\min \sum_j \vert 1 - \hat s_j\vert \quad \text{subject to} \hat s_j\ge 0, \diag\{\hat s\}\preceq 2\Sigma_{approx}Step 2: solve
\max \gamma \quad \text{subject to} \diag\{\gamma\hat s\}\preceq 2\Sigmaand set s^{ASDP} = \gamma\hat s.
- ASDP with \Sigma_{approx} = I trivially gives \hat s_j = 1 and \gamma = 2\lambda_\min(\Sigma)\land 1, reproducing the equicorrelated construction
- ASDP with \Sigma_{approx} = \Sigma clearly gives \hat s_j = s^{SDP} and \gamma = 1, reproducing the SDP construction
generally, one can choose \Sigma_{approx} to be an m-block-diagonal approximation of \Sigma
The conditional randomization test
Numerical simulations
Logistic regression p-values
asymptotic maximum likelihood theory promises valid p-values for each coefficient in a GLM only when n\ge p
Alternative knockoff statistics
Adaptive knockoff statistics
Bayesian knockoff statistics
Alternative procedures
- FX knockoff procedure: can only be applied in homoscedastic Gaussian linear regression when n \ge p
- BHq applied to asymptotic GLM p-values: can only be applied when n\ge p and although for linear regression exact p-values can be computed (when the MLE exists), for any other GLM these p-values can be far from valid unless n » p.