FDR Control under General Dependence via Symmetrization
Posted on
develop a new class of distribution-free multiple testing rules for FDR control under general dependence
a key element is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening, and information pooling
the proposed SDA filter
- first constructs a sequence of ranking statistics that fulfill global symmetry properties
- then chooses a data-driven threshold along the ranking to control the FDR
the SDA filter substantially outperforms the Knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic p-values
they
- first develop finite-sample theories to provide an upper bound for the actual FDR under general dependence
- then establish the asymptotic validity of SDA for both the FDR and FDP control under mild regularity conditions
conventional FDR procedures, such as the BH procedure, adaptive p-value procedure (Benjamini and Hochberg 1997) and adaptive z-value procedure based on local FDR (Efron et al. 2001; Sun and Cai 2007), are developed under the assumption that the test statistics are independent
FDR control under dependence is a critical problem that requires much research. Two key issues:
- how the dependence may affect existing FDR methods
- how to properly incorporate the dependence structure into inference
FDR Control under dependence
the impact of dependence on FDR analysis was first investigated by Benjamini and Yekutieli (2001),
- who showed that the BH procedure, when adjusted at level $\frac{\alpha}{\sum_{j=1}^p1/j}$, control the FDR at level $\alpha$ under arbitrary dependence among the p-values
- however, this adjustment is often too conservative in practice
- it further proved that applying BH without any adjustment is valid for FDR control for correlated tests satisfying the PRDS property
Sarkar (2002):
- showed that the FDR control theory under positive dependence holds for a generalized class of step-wise methods
Storey, Taylor, and Siegmund (2004):
- showed that, in the asymptotic sense, BH is valid under weak dependence
Wu (2008):
- showed that, in the asymptotic sense, BH is valid under Markovian dependence
Clarke and Hall (2009):
- showed that, in the asymptotic sense, BH is valid under linear process models
Although controlling FDR does not always require independence, some key quantities in FDR analysis, such as the expectation and variance of the number of false positives, may possess substantially different properties under dependence (Owen 2005; Finner, Dickhaus, and Roters, 2007)
- this implies that conventional FDR methods such as BH can suffer from low power and high variability under strong dependence
Efron (2007) and Schwartzman and Lin (2011):
- showed that strong correlations degrade the accuracy in both estimation and testing
- in particular, positive/negative correlations can make the empirical null distributions of z-values narrower/wider, which has substantial impact on subsequent FDR analyses
high correlations can be exploited to aggregate weak signals from individuals to increase the signal-to-noise ratio (SNR)
- hence, informative dependence structures can become a bless for FDR analysis
Benjamini and Heller (2007), Sun and Cai (2009), Sun and Wei (2011):
- showed that incorporating functional, spatial, and temporal correlations into inference can improve the power and interpretability of existing methods
- however, these methods are not applicable to general dependence structures
Efron (2007), Efron (2010), and Fan, Han, and Gu (2012)
- discussed how to obtain more accurate FDR estimates by taking into account arbitrary dependence
Leek and Storey (2008), Friguet, Kloareg, and Causeur (2009), Fan, Han, and Gu (2012)
- showed that the overall dependence can be much weakened by subtracting the common factors out,
- and factor-adjusted p-values can be employed to construct more powerful FDR procedures
Hall and Jin (2010), Jin (2012), and Li and Zhong (2017):
- showed that, under both the global testing and multiple testing contexts, the covariance structures can be used, via transformation, to construct test statistics with increased SNR
However, the above methods, e.g., Fan and Han (2017), Li and Zhong (2017):
- rely heavily on the accuracy of estimated models and the asymptotic normality of the test statistics
- under the finite-sample setting, poor estimates of model parameters or violations of normality assumption may lead to less powerful and even invalid FDR procedures
this paper aims to develop a robust and assumption-lean method that effectively controls the FDR under general dependence with much improved power
Model and Problem Formulation
- $\xi_i = (\xi_{i1}, \ldots, \xi_{ip})^\top, i=1,\ldots, n$, follows a multivariate distribution with mean $\mu = (\mu_1,\ldots, \mu_p)^\top$ and covariance matrix $\Sigma$
- the problem of interest is to test $p$ hypotheses simultaneously
the summary statistic $\bar\xi = \frac 1n \sum_{i=1}^n \xi_i$ obeys a multivariate normal (MVN) model asymptotically
\[\bar \xi \overset{d}{\approx} MVN(\mu, n^{-1}\Sigma)\]Denote $\Omega = \Sigma^{-1}$ the precision matrix. first assume that $\Omega$ is known, later discuss an unknown precision matrix
the problem of multiple testing under dependence can be recast as a variable selection problem in linear regression. Specifically, by taking a “whitening” transformation, the model becomes
\[Y = X\mu + \varepsilon, \quad \varepsilon\overset{d}{\approx} MVN(0, n^{-1}I)\]where $Y = \Omgea^{1/2}\bar\xi$ is the pseudo response, $X = \Omega^{1/2}$ is the design matrix
the connection between model selection and FDR was discussed in Abramovich et al. (2006) and Bogdan et al. (2015), respectively, under the normal means model and regression model with orthogonal designs
let
- $\theta_j = I(\mu_j \neq 0)$
- $\delta_j \in {0, 1}$: a decision, where $\delta_j = 1$ indicates that $H_j^0$ is rejected
- $\cA = {j:\mu_j \neq 0}$: non-null set
define the FDP and true discovery proportion
\[FDP = \frac{\sum_{j=1}^p (1-\theta_j)\delta_j}{(\sum_{j=1}^p\delta_j) \lor 1},\quad TDP = \frac{\sum_{j=1}^p\theta_j\delta_j}{(\sum_{j=1}^p\theta_j)\lor 1}\]then $FDR=\bbE[FDP]$ and average power $AP = \bbE[TDP]$
FDR control by symmetrized data aggregation
- splits the sample into two parts, both of which are used to construct statistics to assess the evidence against the null
- aggregate the two statistics to form a new ranking statistic fulfilling symmetry properties
- choose a threshold along the ranking by using the symmetry property
Connections to existing works
- for the classical sample splitting, they firstly divide the data into two independent parts, second use one part to narrow down the focus, and finally use the remainder to perform inference tasks
- these ideas have a common theme with covariate-assisted multiple testing, where the primary statistic plays the key role to assess the significance while the side information plays an auxiliary role to assist inference
- SDA is inspired by knockoffs
Finite-Sample Theory on FDR Control
- $S = {j: \hat\mu_{1j}\neq 0}$
- $W_S = (w_j, j\in S)$
- $W_{-j} = W_S \ W_j$
the key quantity that controls the upper bound is
\[\Delta_j = \vert \Pr(W_j > 0\mid \vert W_j\vert, W_{-j}) - 1/2\vert\]