# FDR Control via Data Splitting

##### Posted on (Update: )

This note is for Dai, C., Lin, B., Xing, X., & Liu, J. S. (2020). False Discovery Rate Control via Data Splitting. ArXiv:2002.08542 [Stat].

- based on the Benjamin-Hochberg (BHq) procedure
- based on the “knockoff filtering” idea: does not require p-values for individual features, and achieves FDR control by creating “knockoff” features in a similar spirit as adding spike-in controls in biological experiments.
- fixed-design knockoff filter
- model-X knockoff filter

The paper proposes an FDR control framework based on data splitting.

In high-dimensional regressions, it can be challenging to either construct valid p-values (even asymptotically) or estimate accurately the joint distribution of features, thus limiting the applicability of both BHq and the model-X knockoff filter. **The data-splitting framework proposed here appears to fill this gap.**

- split the data into two halves
- apply two potentially different statistical learning procedures to each part of the data

Different from the main motivation of most existing methods for DS to handle the high-dimensionality, the paper aims at obtaining two independent measurements of the importance of each feature via data splitting. FDR control is achieved by constructing a proper test statistic for each feature based on these two measurements.

A similar strategy as in the knockoff filter to estimate the number of false positives. Main idea: construct a test statistic $M_j$ for each feature $X_j$, referred to as the “mirror statistic”, with the following two key properties

- a feature with a larger mirror statistic is more likely to be a relevant feature
- the sampling distribution of the mirror statistic of any null feature is symmetric about 0.

MDS in built upon multiple independent replications of DS, aiming at reducing the variability of the selection result. Instead of ranking features by their mirror statistics, MDS ranks features by their inclusion rate, which are selection frequencies adjusted by selection sizes, among multiple DS replications.

Assumption 1 (Symmetry) For each null feature index $j\in S_0$, the sampling distribution of either $\hat\beta_j^{(1)}$ or $\hat\beta_j^{(2)}$ is symmetric about 0.

A general form of the mirror statistic $M_j$ is

\[M_j = \sign(\hat\beta_j^{(1)}\hat\beta_j^{(2)})f(\vert \hat\beta_j^{(1)}\vert, \vert \hat\beta_j^{(2)}\vert)\]where function $f(u, v)$ is non-negative, symmetric about $u$ and $v$, and monotonically increasing in both $u$ and $v$.

over estimate of FDP (!!!) what is gap?