WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Power of Knockoffs

Posted on
Tags: Knockoffs

This note is for Ke, Zheng Tracy, Jun S. Liu, and Yucong Ma. “Power of Knockoff: The Impact of Ranking Algorithm, Augmented Design, and Symmetric Statistic.” arXiv:2010.08132. Preprint, arXiv, February 13, 2024.

power of knockoff: the impact of ranking algorithm, augmented design, and symmetric statistic

knockoff has three key components:

  • ranking algorithm
  • augmented design
  • symmetric statistic

various combinations of the three components, obtain a collection of variants of knockoff

goal: all variants guarantee finite-sample FDR control, and the goal is to compare their power

assume a Rare and Weak signal model on regression coefficients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate

compare the power of knockoff with its propotype- a method that uses the same ranking algorithm but has access to an ideal threshold

three components should coordinate so that the resulting importance metrics for null variables ($\beta_j=0$) have symmetric distributions and that the importance metrics for non-null variables ($\beta_j\neq 0$) are positive with high probability.

the literature has revealed insights on how to choose these components to get a valid FDR control, but there is little understanding on how to design them to boost power

for each variant of knockoff, also consider its prototype, which applies the ranking algorithm to the original design $X$ and selects variables by applying an ideal threshold on the importance metrics output by the ranking algorithm

the comparison of knockoff and its prototype reveals the key difference between FDR control and variable selection — we need to pay an extra price to find a data-driven threshold

regime of rare and weak signals

\[\#\{j:\beta_j\neq 0\} \sim p^{1-\vartheta}, \qquad \vert \beta_j\vert \sqrt{2r\log(p)}\;\text{if }\vert\beta_j\vert \neq 0\]

the power of an FDR control method depends on the target FDR level $q$. instead of fixing $q$, derive a trade-off diagram between FDR and the TPR as $q$ varies

  • Su et al. (2017): a framework for studying the trade-off between FPR and TPR across the lasso solution path
  • Weinstein et al. (2017) and Weinstein et al. (2021): extend this framework to find a tradeoff for the knockoff filter, when the ranking algorithm is the Lasso and thresholded Lasso, respectively
    • for linear sparsity and independent Gaussian design
    • here in this paper, $\beta$ is much sparser, and the overall signal strength as characterized by $\Vert\beta\Vert$ is much smaller. and primarily interested in correlated designs
  • Liu and Rigollet (2019): gave sufficient and necessary conditions on $X$ such that knockoffs has a full power for correlated designs, but they did not provide an explicit tradeoff diagram
    • not apply to the orthodox knockoff but only to a variant of knockoff that uses de-biased Lasso as the ranking algorithm
  • Fan et al. (2019): study the power of model-X knockoff for arbitrary sparsity, under a stronger signal strength
  • Javanmard and Javadi (2019): using de-biased Lasso for FDR control
  • Wang and Janson (2022) and Spector and Janson (2022): consider linear sparsity and iid Gaussian designs, and found a disadvantage of power by constructing augumented design as in the orthodox knockoff
  • Li and Fithian (2021): recast the fix-X knockoff as a conditional post-selection inference method and studied its power
  • in a sequel of papers (Ji and Jin, 2012; Jin et al. 2014; Ke et al., 2014), the rare/weak signal model was used to study variable selection
    • focus on the class of Screen-and-Clean methods for variable selection and proved its optimality under various design classes

2 The knockoff filter, its variant and prototypes

three key components

  1. ranking alg
  2. augmented design, $[X, \tilde X]$
  3. symmetric statistic $f(\cdot, \cdot)$

For $f$, $f$ can be anti-symmetric function. two popular choices: signed maximum statistic and the difference statistic

\[f^{sgm}(u, v) = \sgn(u-v)\cdot \max(u, v), \qquad f^{dif}(u, v) = u-v\]

4 Impact of the symmetric statistic

in the orthogonal design,

\[Z_j = \vert x_j'y\vert,\qquad \text{and} \qquad \tilde Z_j = \vert\tilde x_j'y\vert,\qquad 1\le j\le p\]

consider the two symmetric statistics: sgm and dif

phase diagram:

  • region of exact recovery (ER)
  • region of almost full recovery (AFR)
  • region of No Recovery (NR)

signed maximum is optimal among all symmetric statistics, because its phase diagram already matches with the information-theoretic lower bound.


Published in categories