WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Derandomised Knockoffs from E-values

Posted on
Tags: E-values, Knockoffs, False Discovery Rate, Multiple Hypothesis Testing, Variable Selection

This note is for Ren, Z., & Barber, R. F. (2024). Derandomised knockoffs: Leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122–154. https://doi.org/10.1093/jrsssb/qkad085

Empirically, it has been observed that the output can be highly variable from one run to another, which is potentially an undesirable property.

Ren et al. (2023) propose that, after running knockoffs M times on the given dataset, the procedure computes

\[\Pi_j = \frac 1M \sum_{m=1}^M 1\{j\in S_{kn}^{(m)}\}\]

the final selected set is then given by $S_{\text{kn-derand}}{j:\Pi_j \ge \eta}$, all features $X_j$ exceeding some threshold probability of selection for a random run of knockoffs.

Ren et al. (2023) establish a guaranteed bound on the expected number of false discoveries, but it appears impossible to prove a bound directly on FDR

contribution of this work

with a simple and natural modification in the construction of the derandomised knockoffs procedure, they can restore the guarantee of FDR control

Specifcally, they consider a weighted probability of selection, replacing $\Pi_j$ with

\[\frac 1M\sum_{m=1}^M \text{weight}_j^{(m)} 1\{j \in S_{kn}^{(m)}\}\]

informally, choose a lower weight if many features were selected in the $m$-th run.

the method builds on the recent e-BH procedure, which helps find an adaptive, FDR-controlling threshold for the weighted selection probabilities

model-X knockoffs

\[S_{kn} = \{j: W_j \ge T\}\,,\qquad \text{where } T=\inf\left\{t > 0: \frac{1+\sum_{j\in [p]}1(W_j\le t) }{\sum_{j\in [p]}1(W_j\ge t)} \le \alpha \right\}\]

e-BH procedure

\[S_{ebh} = \left\{j: e_j \ge \frac{p}{\alpha \hat k} \right\}\,, \quad \text{where } \hat k = \max\{k\in [p]: e_{(k)}\ge \frac{p}{\alpha k}\}\]

Knockoffs as an e-BH procedure

the knockoffs procedure is in fact equivalent to a relaxed e-BH procedure with a class of properly defined e-values

\[e_j = p\cdot \frac{1(W_j \ge T)}{1+\sum_{k\in [p]}1(W_k\le -T)}\]

Let $S_{kn}$ be the set of selected features for the knockoff procedure, and let $S_{ebh}$ be the set of selected features for the e-BH procedures applied to $e_1,\ldots, e_p$. Then $S_{kn} = S_{ebh}$. {.thm title=”Theorem 1”}

Derandomising knockoffs

generate $M$ copies of the knockoff matrix

choosing some $\alpha_{kn}\in (0, 1)$, compute the threshold

\[T^{(m)} = \inf\{t>0: \frac{1+\sum_{j\in [p]}1(W_j^{(m)}\le -t)}{\sum_{k\in [p]}1(W_k^{(m)} \ge t) }\le \alpha_{kn} \}\]

let

\[e_j^{(m)} = p\cdot \frac{1(W_j^{(m)}\ge T^{(m)})}{1+\sum_{j\in [p]}1(W_k^{(m)} \le -T^{(m)})}\]

Aggregate the e-values by taking the average

\[e_j^{Avg} = \frac 1M\sum_{m=1}^M e_j^{(m)}\]

Generalization of $T^{(m)}$

it is free to choose the m-th threshold $T^{(m)}$ in a different way as long as it is still a stopping time with respect to the filtration generated by a masked version of $W^{(m)}$, and the FDR guarantee for derandomised knockoffs will still hold.

A uniform improvement on $T^{(m)}$

\[T^{(m)} = \inf\left\{ t > 0: \frac{1+\sum_{j\in [p]}1(W_j^{(m)}\le -t)}{\sum_{k\in [p]}1(W_k^{(m)}\ge t)} \le \alpha_{kn}\;\text{or}\; \sum_{j\in [p]}1(W_j\ge t) < \frac{1}{\alpha_{kn}\} \right\}\]

Comparison to Ren et al. (2023)

the weight corresponding to the m-th run is given by

\[\frac{1 / (1 + \sum_{k\in [p]}1(W_k^{(m)} \le -T^{(m)}) )}{\sum_{l\in [M]} 1 / (1 + \sum_{k\in [p]} 1(W_k^{(l)} \le -T^{(l)}) ) }\]

Comparison to Dai et al. (2022, 2023)

the proposed procedure uses e-values of the form

\[e_j = \frac{1}{M}\sum_{m=1}^M \text{weight}_j^{(m)} 1(j \in S_{kn}^{(m)})\]

where the weights are chosen as

\[\text{weight}_j^{(m)} = p\cdot \frac{1}{1+\sum_{k\in [p]} 1(W_k^{(m)} \le -T^{(m)})}\]

in contrast, the method in Dai et al. (2022, 2023) can be rewritten as

\[\text{weight}_j^{(m)} = p\cdot \frac{1}{\alpha_{kn} \cdot \sum_{k\in [p]} 1(W_k^{(m)}\ge t) }\]

consequently, the proposed method is strictly more powerful.

image

Choices of parameters

two potentially different parameters $\alpha_{ebh}$ and $\alpha_{kn}$

  • $\alpha_{ebh}$ determines the ultimate FDR guarantee of the procedure
  • $\alpha_{kn}$ does not affect this FDR guarantee but may instead affect the power of the method

Question: should one simply choose $\alpha_{kn} = \alpha_{ebh}$ (particularly given that this choice reduces to the original knockoffs method when $M=1$)?

it turns out that choosing $\alpha_{kn} < \alpha_{ebh}$ is preferable when $M > 1$.

a simple scenario:

suppose that the data contain $s$ many non-nulls, which each have extremely strong signals, so that any single run of the knockoff filter is highly likely to select all $s$ non-nulls.

for each run $m$, one will expect a false discovery proportion of approximately $\alpha_{kn}$, meaning that one have $\vert S_{kn}^{(m)}\cap H_0\vert \approx \frac{\alpha_{kn}}{1-\alpha_{kn}}s$ false discoveries together with the $\approx s$ true discoveries.

since each non-null $j$ is selected in (nearly) every run of knockoffs, while for a null $j\in H_0$ it may be the case that $j$ is selected for only a small proportion of the runs, we therefore expect to see

\[e_j^{avg}\approx \frac{p}{\frac{\alpha_{kn}}{1-\alpha_{kn}}s}\quad\text{for } j\in H_1, \quad e_j^{avg}\approx 0\quad \text{for } j\in H_0\]

applying the e-BH procedure at level $\alpha_{ebh}$ to these $e$-values, the power can be high only if $\frac{\alpha_{kn}}{1-\alpha_{kn}} \le \alpha_{ebh}$.

in other words, $\alpha_{kn} = \alpha_{ebh}$ will likely lead to zero power but $\alpha_{kn} \le \frac{\alpha_{ebh}}{1+\alpha_{ebh}}$ will lead to (nearly) perfect recovery.

Derandomisation: a closer look

conditional on the data $(X, Y)$, for each feature $j$, the $e_j^{(m)}$’s are i.i.d. for $m=1,\ldots,M$. by the strong law of large numbers, we then have

\[e_j^{avg} = \frac 1M\sum_{m=1}^M e_j^{(m)} \rightarrow e_j^\infty = \bbE[e_j^{(1)}\mid \bfX, \bfY]\]

almost surely as $M \rightarrow\infty$

let $S_{kn-\infty}$ denote the selected set obtained by applying e-BH to these limiting e-values $e_1^\infty, \ldots, e_p^\infty$.

the selected set $S_{kn-derand}$ obtained by averging over $M$ runs of knockoffs, should eventually stabilise to this completely derandomised selected set $S_{kn-\infty}$.

Extensions

Fixed-X knockoffs

the fixed-X knockoffs procedure considers a setting where the design matrix $\bfX$ is fixed and the response is generated from a Gaussian linear model

\[Y=X\beta + \varepsilon\]

image

Robust knockoffs

the validity of the model-X knockoffs frame relies crucially on the knowledge of $P_X$, but in practice, this distribution might be estimated rather than known exactly.

this section investigates the robustness of the derandomised knockoffs procedure when the knockoff copies are sampled without exact knowledge of $P_X$

Multienvironment knockoffs

write the conditional independence hypothesis in environment $e$ as

\[H_j^{ci, e}: Y^e\ind X_j^e \mid X_{-j}^e\]

find the association that is consistently true across all the environments is equivalent to testing the following consistent conditional independence hypothesis

\[H_j^{cst}: \exists e\in [E]\text{ such that the null } H_j^{ci, e} \text{ is true}\]

Li et al. (2021) propose a relaxed version that tests for partial consistency

\[H_j^{pcst, r}: \vert \left\{e\in [E]: H_j^{ci, e}\text{ is true}\right}\vert > E-r\]

image

Knockoffs with side information

one way is the weighted version if one has $u_j$ that reflects the possibility of $j$ being a non-null

another way is the adaptive knockoffs procedure: it sequentially screens the hypotheses in an order determined by the side information and the partially masked feature importance statistics

image


Published in categories