WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Model-X Knockoffs

Posted on (Update: )
Tags: Knockoff, Variable Selection

This note is for Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat].

the paper addresses the selection problem by considering a very general conditional model, where the response $Y$ can depend in an arbitrary fashion on the covariates $X_1,\ldots, X_p$.

the only restriction is that the observations $(X_{i1},\ldots, X_{ip}, Y_i)$ are independently and identically distributed (i.i.d.)

the model is simpley

\[(X_{i1},\ldots, X_{ip}, Y_i)\sim_{i.i.d.} F_{XY}, \quad i = 1,\ldots, n\]

for some arbitrary $p+1$ dimensional joint distribution $F_{XY}$.

assume no knowledge of the conditional distribution of $Y\mid X_1,\ldots, X_p$, but assume the joint distribution of the covariates is known, and denote it by $F_X$.

minimal set $\cS$: Markov blanket or Markov boundary for $Y$ in the literature on graphical models

the present paper places no restriction on $p$ relative to $n$, in sharp contrast to the original knockoffs work which requird the low-dimensional setting of $n \ge p$

the conceptual difference is that the original knockoff procedure treats the $X_{ij}$ as fixed and relies on specific stochastic properties of the linear model, precluding consideration of $p > n$ or nonlinear models.

relationship with the classical setup

the usual setup for inference in conditional models is to assume a strong parametric model for the response conditional on the covariates, such as a homoscedastic linear model, but to assume as little as possible about, or even condition on, the covariates

the paper do the exact opposite by assuming we know everything about the covariate distribution but nothing about the conditional distribution $Y\mid X_1,\ldots, X_p$

relationship with work on inference after selection

these works differe from the paper in a number of ways so that the authors largely see them as complementary activities

  1. the post-inference works presuppose a selection procedure has been chosen (for reasons that may have nothing to do with controlling Type I error) and then compute p-values for the selected variables, taking into account the selection step. In contrast, MX knockoffs is by itself a selection procedure that controls Type I error.
  2. inference after selection relies heavily on parametric assumptions about the conditional distribution, namely that \(Y\mid X_1, \ldots, X_p \sim N(\mu(X_1,\ldots, X_p), \sigma^2)\) making it unclear how to extend it to the more general setting
  3. in the selection step, a subset of size $m\le n$ of the original $p$ covariates is selected, say $X_{j_1},\ldots, X_{j_m}$, and the objects of inference are the coefficients of the $X_{jk}$’s in the projection of $\mu$ onto the linear subspace spannjie mied by the $n\times m$ matrix of observed values of these $X_{jk}$’s. That is, the $k$-th null hypothesis is that the aformentioned coefficients on $X_{jk}$ is zero. In contrast, for MX in the same setting, the $j$-th null hypothesis would be that $\mu$ does not depend on $X_j$, and there would be $p$ such null hypotheses, one for each of the original features.

obstacles to obtaining p-values

the requirement of having valid p-values is quite constraining for general conditional modeling problems

regression p-value approximations

  • in low-dimensional setting ($n\ge p$)
    • for GLMs, one must resort to asymptotic p-values derived from MLE theory, which could be far from valid in practice
  • in high-dimensional ($n < p$) GLMs, it it not clear how to get $p$-values at all.

marginal testing

marginal p-values: p-values for testing unconditional (or marginal) independence betweeen $Y$ and $X_j$

the class of conditional test statistics is far richer than that of marginal ones, and includes the most powerful statistical inference and prediction methodology available.

for example, in compressed sensing, the signal recovery guarantees for state-of-the-art $\ell_1$-based (joint) algorithms are stronger than any guarantees possible with marginal methods.

some specific drawbacks of using marginal p-values are

  • power loss
  • interpretability
  • dependent p-values

Methodology

Model-X knockoffs

Model-X knockoffs for the family of random variables $X = (X_1,\ldots, X_p)$ are a new family of random variables $\tilde X = (\tilde X_1,\ldots, \tilde X_p)$ constructed with the following two properties

  • for any subset $S\subset {1,\ldots, p}$, $(X,\tilde X)_{swap(S)}\overset{d}{=}(X, \tilde X)$
  • $\tilde X\ind Y\mid X$ if there is a response $Y$. It is guaranteed if $\tilde X$ is constructed without looking at $Y$

An example of MX knockoffs, suppose that $X\sim N(0, \Sigma)$. Then a joint distribution obey the property is

\[(X, \tilde X)\sim N(0, G), \qquad G=\begin{bmatrix} \Sigma & \Sigma - \diag\{s\}\\ \Sigma - \diag\{s\} & \Sigma \end{bmatrix}\]

one way to construct the knockoff vector $\tilde X$ from the conditional distribution

\[\tilde X\mid X\overset{d}{=} N(\mu, V)\]

where $\mu$ and $V$ are given by classical regression formulas, namely

\[\mu = X-X\Sigma^{-1}\diag\{s\}\qquad V = 2\diag\{s\} - \diag\{s\}\Sigma^{-1}\diag\{s\}\]

Feature statistics

compute statistics $W_j$ for each $j\in {1,\ldots, p}$, a large positive value of $W_j$ providing evidence against the hypothesis that $X_j$ is null. This statistic depends on the response and the original variables but also on the knockoffs, that is

\[W_j = w_j([X, \tilde X], y)\]

impose a flip-sign property, which means that swapping the $j$-th variable with its knockoff has the effect of changing the sign of $W_j$. Formally, if $[X,\tilde X]_{swap(S)}$ is the matrix obtained by swapping columns in $S$

\[w_j([X, \tilde X]_{swap(S)}, y) = \begin{cases} w_j([x, \tilde X], y) & j\not\in S\\ -w_j([X, \tilde X], y) & j\in S \end{cases}\]

Constructing model-X knockoffs

Exact constructions

image

Approximate constructions: second-order model-X knockoffs

the equicorrelated construction uses

\[s_j^{EQ} = 2\lambda_\min(\Sigma) \land 1 \text{ for all } $j$\]

which minimize the correlation between variable-knockoff pairs subject to the constraint that all such pairs must have the same correlation

the demidefinite program (SDP) construction solves the convex program

\[\min \sum_j\vert 1 - s_j^{SDP} \vert\qquad\text{subject to } s_j^{SDP}\ge 0, \diag\{s^{SDP}\} \preceq 2\Sigma\]

the constructions run into new difficulties when large $p$

generalize the two knockoff constructions by the following two-step procedure, which called the approximate semidefinite program (ASDP) construction

Step 1: choose an approximation $\Sigma_{approx}$ of $\Sigma$ and solve

\[\min \sum_j \vert 1 - \hat s_j\vert \quad \text{subject to} \hat s_j\ge 0, \diag\{\hat s\}\preceq 2\Sigma_{approx}\]

Step 2: solve

\[\max \gamma \quad \text{subject to} \diag\{\gamma\hat s\}\preceq 2\Sigma\]

and set $s^{ASDP} = \gamma\hat s$.

  • ASDP with $\Sigma_{approx} = I$ trivially gives $\hat s_j = 1$ and $\gamma = 2\lambda_\min(\Sigma)\land 1$, reproducing the equicorrelated construction
  • ASDP with $\Sigma_{approx} = \Sigma$ clearly gives $\hat s_j = s^{SDP}$ and $\gamma = 1$, reproducing the SDP construction

generally, one can choose $\Sigma_{approx}$ to be an $m$-block-diagonal approximation of $\Sigma$

The conditional randomization test

image

Numerical simulations

Logistic regression $p$-values

asymptotic maximum likelihood theory promises valid p-values for each coefficient in a GLM only when $n\ge p$

Alternative knockoff statistics

Adaptive knockoff statistics

Bayesian knockoff statistics

Alternative procedures

  • FX knockoff procedure: can only be applied in homoscedastic Gaussian linear regression when $n \ge p$
  • BHq applied to asymptotic GLM p-values: can only be applied when $n\ge p$ and although for linear regression exact $p$-values can be computed (when the MLE exists), for any other GLM these $p$-values can be far from valid unless $n » p$.

Published in categories Note