Processing math: 71%

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Model-X Knockoffs

Posted on (Update: )
Tags: Knockoffs, Variable Selection

This note is for Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat].

the paper addresses the selection problem by considering a very general conditional model, where the response Y can depend in an arbitrary fashion on the covariates X1,,Xp.

the only restriction is that the observations (Xi1,,Xip,Yi) are independently and identically distributed (i.i.d.)

the model is simpley

(Xi1,,Xip,Yi)i.i.d.FXY,i=1,,n

for some arbitrary p+1 dimensional joint distribution FXY.

assume no knowledge of the conditional distribution of YX1,,Xp, but assume the joint distribution of the covariates is known, and denote it by FX.

minimal set S: Markov blanket or Markov boundary for Y in the literature on graphical models

the present paper places no restriction on p relative to n, in sharp contrast to the original knockoffs work which requird the low-dimensional setting of np

the conceptual difference is that the original knockoff procedure treats the Xij as fixed and relies on specific stochastic properties of the linear model, precluding consideration of p>n or nonlinear models.

relationship with the classical setup

the usual setup for inference in conditional models is to assume a strong parametric model for the response conditional on the covariates, such as a homoscedastic linear model, but to assume as little as possible about, or even condition on, the covariates

the paper do the exact opposite by assuming we know everything about the covariate distribution but nothing about the conditional distribution YX1,,Xp

relationship with work on inference after selection

these works differe from the paper in a number of ways so that the authors largely see them as complementary activities

  1. the post-inference works presuppose a selection procedure has been chosen (for reasons that may have nothing to do with controlling Type I error) and then compute p-values for the selected variables, taking into account the selection step. In contrast, MX knockoffs is by itself a selection procedure that controls Type I error.
  2. inference after selection relies heavily on parametric assumptions about the conditional distribution, namely that YX1,,XpN(μ(X1,,Xp),σ2) making it unclear how to extend it to the more general setting
  3. in the selection step, a subset of size mn of the original p covariates is selected, say Xj1,,Xjm, and the objects of inference are the coefficients of the Xjk’s in the projection of μ onto the linear subspace spannjie mied by the n×m matrix of observed values of these Xjk’s. That is, the k-th null hypothesis is that the aformentioned coefficients on Xjk is zero. In contrast, for MX in the same setting, the j-th null hypothesis would be that μ does not depend on Xj, and there would be p such null hypotheses, one for each of the original features.

obstacles to obtaining p-values

the requirement of having valid p-values is quite constraining for general conditional modeling problems

regression p-value approximations

  • in low-dimensional setting (np)
    • for GLMs, one must resort to asymptotic p-values derived from MLE theory, which could be far from valid in practice
  • in high-dimensional (n<p) GLMs, it it not clear how to get p-values at all.

marginal testing

marginal p-values: p-values for testing unconditional (or marginal) independence betweeen Y and Xj

the class of conditional test statistics is far richer than that of marginal ones, and includes the most powerful statistical inference and prediction methodology available.

for example, in compressed sensing, the signal recovery guarantees for state-of-the-art 1-based (joint) algorithms are stronger than any guarantees possible with marginal methods.

some specific drawbacks of using marginal p-values are

  • power loss
  • interpretability
  • dependent p-values

Methodology

Model-X knockoffs

Model-X knockoffs for the family of random variables X=(X1,,Xp) are a new family of random variables ˜X=(˜X1,,˜Xp) constructed with the following two properties

  • for any subset S1,,p, (X,˜X)swap(S)d=(X,˜X)
  • ˜XYX if there is a response Y. It is guaranteed if ˜X is constructed without looking at Y

An example of MX knockoffs, suppose that XN(0,Σ). Then a joint distribution obey the property is

(X,˜X)N(0,G),G=[ΣΣdiag{s}Σdiag{s}Σ]

one way to construct the knockoff vector ˜X from the conditional distribution

˜XXd=N(μ,V)

where μ and V are given by classical regression formulas, namely

μ=XXΣ1diag{s}V=2diag{s}diag{s}Σ1diag{s}

Feature statistics

compute statistics Wj for each j1,,p, a large positive value of Wj providing evidence against the hypothesis that Xj is null. This statistic depends on the response and the original variables but also on the knockoffs, that is

Wj=wj([X,˜X],y)

impose a flip-sign property, which means that swapping the j-th variable with its knockoff has the effect of changing the sign of Wj. Formally, if [X,˜X]swap(S) is the matrix obtained by swapping columns in S

wj([X,˜X]swap(S),y)={wj([x,˜X],y)jSwj([X,˜X],y)jS

Constructing model-X knockoffs

Exact constructions

image

Approximate constructions: second-order model-X knockoffs

the equicorrelated construction uses

sEQj=2λmin

which minimize the correlation between variable-knockoff pairs subject to the constraint that all such pairs must have the same correlation

the demidefinite program (SDP) construction solves the convex program

\min \sum_j\vert 1 - s_j^{SDP} \vert\qquad\text{subject to } s_j^{SDP}\ge 0, \diag\{s^{SDP}\} \preceq 2\Sigma

the constructions run into new difficulties when large p

generalize the two knockoff constructions by the following two-step procedure, which called the approximate semidefinite program (ASDP) construction

Step 1: choose an approximation \Sigma_{approx} of \Sigma and solve

\min \sum_j \vert 1 - \hat s_j\vert \quad \text{subject to} \hat s_j\ge 0, \diag\{\hat s\}\preceq 2\Sigma_{approx}

Step 2: solve

\max \gamma \quad \text{subject to} \diag\{\gamma\hat s\}\preceq 2\Sigma

and set s^{ASDP} = \gamma\hat s.

  • ASDP with \Sigma_{approx} = I trivially gives \hat s_j = 1 and \gamma = 2\lambda_\min(\Sigma)\land 1, reproducing the equicorrelated construction
  • ASDP with \Sigma_{approx} = \Sigma clearly gives \hat s_j = s^{SDP} and \gamma = 1, reproducing the SDP construction

generally, one can choose \Sigma_{approx} to be an m-block-diagonal approximation of \Sigma

The conditional randomization test

image

Numerical simulations

Logistic regression p-values

asymptotic maximum likelihood theory promises valid p-values for each coefficient in a GLM only when n\ge p

Alternative knockoff statistics

Adaptive knockoff statistics

Bayesian knockoff statistics

Alternative procedures

  • FX knockoff procedure: can only be applied in homoscedastic Gaussian linear regression when n \ge p
  • BHq applied to asymptotic GLM p-values: can only be applied when n\ge p and although for linear regression exact p-values can be computed (when the MLE exists), for any other GLM these p-values can be far from valid unless n » p.

Published in categories Note