Derandomised Knockoffs from E-values
Posted on
Empirically, it has been observed that the output can be highly variable from one run to another, which is potentially an undesirable property.
Ren et al. (2023) propose that, after running knockoffs M times on the given dataset, the procedure computes
\[\Pi_j = \frac 1M \sum_{m=1}^M 1\{j\in S_{kn}^{(m)}\}\]the final selected set is then given by $S_{\text{kn-derand}}{j:\Pi_j \ge \eta}$, all features $X_j$ exceeding some threshold probability of selection for a random run of knockoffs.
Ren et al. (2023) establish a guaranteed bound on the expected number of false discoveries, but it appears impossible to prove a bound directly on FDR
contribution of this work
with a simple and natural modification in the construction of the derandomised knockoffs procedure, they can restore the guarantee of FDR control
Specifcally, they consider a weighted probability of selection, replacing $\Pi_j$ with
\[\frac 1M\sum_{m=1}^M \text{weight}_j^{(m)} 1\{j \in S_{kn}^{(m)}\}\]informally, choose a lower weight if many features were selected in the $m$-th run.
the method builds on the recent e-BH procedure, which helps find an adaptive, FDR-controlling threshold for the weighted selection probabilities
model-X knockoffs
\[S_{kn} = \{j: W_j \ge T\}\,,\qquad \text{where } T=\inf\left\{t > 0: \frac{1+\sum_{j\in [p]}1(W_j\le t) }{\sum_{j\in [p]}1(W_j\ge t)} \le \alpha \right\}\]e-BH procedure
\[S_{ebh} = \left\{j: e_j \ge \frac{p}{\alpha \hat k} \right\}\,, \quad \text{where } \hat k = \max\{k\in [p]: e_{(k)}\ge \frac{p}{\alpha k}\}\]Knockoffs as an e-BH procedure
the knockoffs procedure is in fact equivalent to a relaxed e-BH procedure with a class of properly defined e-values
\[e_j = p\cdot \frac{1(W_j \ge T)}{1+\sum_{k\in [p]}1(W_k\le -T)}\]Let $S_{kn}$ be the set of selected features for the knockoff procedure, and let $S_{ebh}$ be the set of selected features for the e-BH procedures applied to $e_1,\ldots, e_p$. Then $S_{kn} = S_{ebh}$. {.thm title=”Theorem 1”}
Derandomising knockoffs
generate $M$ copies of the knockoff matrix
choosing some $\alpha_{kn}\in (0, 1)$, compute the threshold
\[T^{(m)} = \inf\{t>0: \frac{1+\sum_{j\in [p]}1(W_j^{(m)}\le -t)}{\sum_{k\in [p]}1(W_k^{(m)} \ge t) }\le \alpha_{kn} \}\]let
\[e_j^{(m)} = p\cdot \frac{1(W_j^{(m)}\ge T^{(m)})}{1+\sum_{j\in [p]}1(W_k^{(m)} \le -T^{(m)})}\]Aggregate the e-values by taking the average
\[e_j^{Avg} = \frac 1M\sum_{m=1}^M e_j^{(m)}\]Generalization of $T^{(m)}$
it is free to choose the m-th threshold $T^{(m)}$ in a different way as long as it is still a stopping time with respect to the filtration generated by a masked version of $W^{(m)}$, and the FDR guarantee for derandomised knockoffs will still hold.
A uniform improvement on $T^{(m)}$
\[T^{(m)} = \inf\left\{ t > 0: \frac{1+\sum_{j\in [p]}1(W_j^{(m)}\le -t)}{\sum_{k\in [p]}1(W_k^{(m)}\ge t)} \le \alpha_{kn}\;\text{or}\; \sum_{j\in [p]}1(W_j\ge t) < \frac{1}{\alpha_{kn}\} \right\}\]Comparison to Ren et al. (2023)
the weight corresponding to the m-th run is given by
\[\frac{1 / (1 + \sum_{k\in [p]}1(W_k^{(m)} \le -T^{(m)}) )}{\sum_{l\in [M]} 1 / (1 + \sum_{k\in [p]} 1(W_k^{(l)} \le -T^{(l)}) ) }\]Comparison to Dai et al. (2022, 2023)
the proposed procedure uses e-values of the form
\[e_j = \frac{1}{M}\sum_{m=1}^M \text{weight}_j^{(m)} 1(j \in S_{kn}^{(m)})\]where the weights are chosen as
\[\text{weight}_j^{(m)} = p\cdot \frac{1}{1+\sum_{k\in [p]} 1(W_k^{(m)} \le -T^{(m)})}\]in contrast, the method in Dai et al. (2022, 2023) can be rewritten as
\[\text{weight}_j^{(m)} = p\cdot \frac{1}{\alpha_{kn} \cdot \sum_{k\in [p]} 1(W_k^{(m)}\ge t) }\]consequently, the proposed method is strictly more powerful.
Choices of parameters
two potentially different parameters $\alpha_{ebh}$ and $\alpha_{kn}$
- $\alpha_{ebh}$ determines the ultimate FDR guarantee of the procedure
- $\alpha_{kn}$ does not affect this FDR guarantee but may instead affect the power of the method
Question: should one simply choose $\alpha_{kn} = \alpha_{ebh}$ (particularly given that this choice reduces to the original knockoffs method when $M=1$)?
it turns out that choosing $\alpha_{kn} < \alpha_{ebh}$ is preferable when $M > 1$.
a simple scenario:
suppose that the data contain $s$ many non-nulls, which each have extremely strong signals, so that any single run of the knockoff filter is highly likely to select all $s$ non-nulls.
for each run $m$, one will expect a false discovery proportion of approximately $\alpha_{kn}$, meaning that one have $\vert S_{kn}^{(m)}\cap H_0\vert \approx \frac{\alpha_{kn}}{1-\alpha_{kn}}s$ false discoveries together with the $\approx s$ true discoveries.
since each non-null $j$ is selected in (nearly) every run of knockoffs, while for a null $j\in H_0$ it may be the case that $j$ is selected for only a small proportion of the runs, we therefore expect to see
\[e_j^{avg}\approx \frac{p}{\frac{\alpha_{kn}}{1-\alpha_{kn}}s}\quad\text{for } j\in H_1, \quad e_j^{avg}\approx 0\quad \text{for } j\in H_0\]applying the e-BH procedure at level $\alpha_{ebh}$ to these $e$-values, the power can be high only if $\frac{\alpha_{kn}}{1-\alpha_{kn}} \le \alpha_{ebh}$.
in other words, $\alpha_{kn} = \alpha_{ebh}$ will likely lead to zero power but $\alpha_{kn} \le \frac{\alpha_{ebh}}{1+\alpha_{ebh}}$ will lead to (nearly) perfect recovery.
Derandomisation: a closer look
conditional on the data $(X, Y)$, for each feature $j$, the $e_j^{(m)}$’s are i.i.d. for $m=1,\ldots,M$. by the strong law of large numbers, we then have
\[e_j^{avg} = \frac 1M\sum_{m=1}^M e_j^{(m)} \rightarrow e_j^\infty = \bbE[e_j^{(1)}\mid \bfX, \bfY]\]almost surely as $M \rightarrow\infty$
let $S_{kn-\infty}$ denote the selected set obtained by applying e-BH to these limiting e-values $e_1^\infty, \ldots, e_p^\infty$.
the selected set $S_{kn-derand}$ obtained by averging over $M$ runs of knockoffs, should eventually stabilise to this completely derandomised selected set $S_{kn-\infty}$.
Extensions
Fixed-X knockoffs
the fixed-X knockoffs procedure considers a setting where the design matrix $\bfX$ is fixed and the response is generated from a Gaussian linear model
\[Y=X\beta + \varepsilon\]Robust knockoffs
the validity of the model-X knockoffs frame relies crucially on the knowledge of $P_X$, but in practice, this distribution might be estimated rather than known exactly.
this section investigates the robustness of the derandomised knockoffs procedure when the knockoff copies are sampled without exact knowledge of $P_X$
Multienvironment knockoffs
write the conditional independence hypothesis in environment $e$ as
\[H_j^{ci, e}: Y^e\ind X_j^e \mid X_{-j}^e\]find the association that is consistently true across all the environments is equivalent to testing the following consistent conditional independence hypothesis
\[H_j^{cst}: \exists e\in [E]\text{ such that the null } H_j^{ci, e} \text{ is true}\]Li et al. (2021) propose a relaxed version that tests for partial consistency
\[H_j^{pcst, r}: \vert \left\{e\in [E]: H_j^{ci, e}\text{ is true}\right}\vert > E-r\]Knockoffs with side information
one way is the weighted version if one has $u_j$ that reflects the possibility of $j$ being a non-null
another way is the adaptive knockoffs procedure: it sequentially screens the hypotheses in an order determined by the side information and the partially masked feature importance statistics