FDR Control under General Dependence via Symmetrization

Posted on Dec 13, 2024

Tags: False Discovery Rate, Data Splitting, Knockoffs, Federated Learning

This note is for Du, L., Guo, X., Sun, W., & Zou, C. (2023). False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. Journal of the American Statistical Association, 118(541), 607–621. https://doi.org/10.1080/01621459.2021.1945459

develop a new class of distribution-free multiple testing rules for FDR control under general dependence

a key element is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening, and information pooling

the proposed SDA filter

first constructs a sequence of ranking statistics that fulfill global symmetry properties
then chooses a data-driven threshold along the ranking to control the FDR

the SDA filter substantially outperforms the Knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic p-values

they

first develop finite-sample theories to provide an upper bound for the actual FDR under general dependence
then establish the asymptotic validity of SDA for both the FDR and FDP control under mild regularity conditions

conventional FDR procedures, such as the BH procedure, adaptive p-value procedure (Benjamini and Hochberg 1997) and adaptive z-value procedure based on local FDR (Efron et al. 2001; Sun and Cai 2007), are developed under the assumption that the test statistics are independent

FDR control under dependence is a critical problem that requires much research. Two key issues:

how the dependence may affect existing FDR methods
how to properly incorporate the dependence structure into inference

FDR Control under dependence

the impact of dependence on FDR analysis was first investigated by Benjamini and Yekutieli (2001),

who showed that the BH procedure, when adjusted at level $\frac{\alpha}{\sum_{j=1}^p1/j}$, control the FDR at level $\alpha$ under arbitrary dependence among the p-values
however, this adjustment is often too conservative in practice
it further proved that applying BH without any adjustment is valid for FDR control for correlated tests satisfying the PRDS property

Sarkar (2002):

showed that the FDR control theory under positive dependence holds for a generalized class of step-wise methods

Storey, Taylor, and Siegmund (2004):

showed that, in the asymptotic sense, BH is valid under weak dependence

Wu (2008):

showed that, in the asymptotic sense, BH is valid under Markovian dependence

Clarke and Hall (2009):

showed that, in the asymptotic sense, BH is valid under linear process models

Although controlling FDR does not always require independence, some key quantities in FDR analysis, such as the expectation and variance of the number of false positives, may possess substantially different properties under dependence (Owen 2005; Finner, Dickhaus, and Roters, 2007)

this implies that conventional FDR methods such as BH can suffer from low power and high variability under strong dependence

Efron (2007) and Schwartzman and Lin (2011):

showed that strong correlations degrade the accuracy in both estimation and testing
in particular, positive/negative correlations can make the empirical null distributions of z-values narrower/wider, which has substantial impact on subsequent FDR analyses

high correlations can be exploited to aggregate weak signals from individuals to increase the signal-to-noise ratio (SNR)

hence, informative dependence structures can become a bless for FDR analysis

Benjamini and Heller (2007), Sun and Cai (2009), Sun and Wei (2011):

showed that incorporating functional, spatial, and temporal correlations into inference can improve the power and interpretability of existing methods
however, these methods are not applicable to general dependence structures

Efron (2007), Efron (2010), and Fan, Han, and Gu (2012)

discussed how to obtain more accurate FDR estimates by taking into account arbitrary dependence

Leek and Storey (2008), Friguet, Kloareg, and Causeur (2009), Fan, Han, and Gu (2012)

showed that the overall dependence can be much weakened by subtracting the common factors out,
and factor-adjusted p-values can be employed to construct more powerful FDR procedures

Hall and Jin (2010), Jin (2012), and Li and Zhong (2017):

showed that, under both the global testing and multiple testing contexts, the covariance structures can be used, via transformation, to construct test statistics with increased SNR

However, the above methods, e.g., Fan and Han (2017), Li and Zhong (2017):

rely heavily on the accuracy of estimated models and the asymptotic normality of the test statistics
under the finite-sample setting, poor estimates of model parameters or violations of normality assumption may lead to less powerful and even invalid FDR procedures

this paper aims to develop a robust and assumption-lean method that effectively controls the FDR under general dependence with much improved power

Model and Problem Formulation

$\xi_i = (\xi_{i1}, \ldots, \xi_{ip})^\top, i=1,\ldots, n$, follows a multivariate distribution with mean $\mu = (\mu_1,\ldots, \mu_p)^\top$ and covariance matrix $\Sigma$
the problem of interest is to test $p$ hypotheses simultaneously

\[H_j^0: \mu_j = 0\text{ versus } H_j^1: \mu_j\neq 0, \quad \text{for } j =1,\ldots, p\]

the summary statistic $\bar\xi = \frac 1n \sum_{i=1}^n \xi_i$ obeys a multivariate normal (MVN) model asymptotically

\[\bar \xi \overset{d}{\approx} MVN(\mu, n^{-1}\Sigma)\]

Denote $\Omega = \Sigma^{-1}$ the precision matrix. first assume that $\Omega$ is known, later discuss an unknown precision matrix

the problem of multiple testing under dependence can be recast as a variable selection problem in linear regression. Specifically, by taking a “whitening” transformation, the model becomes

\[Y = X\mu + \varepsilon, \quad \varepsilon\overset{d}{\approx} MVN(0, n^{-1}I)\]

where $Y = \Omgea^{1/2}\bar\xi$ is the pseudo response, $X = \Omega^{1/2}$ is the design matrix

the connection between model selection and FDR was discussed in Abramovich et al. (2006) and Bogdan et al. (2015), respectively, under the normal means model and regression model with orthogonal designs

let

$\theta_j = I(\mu_j \neq 0)$
$\delta_j \in {0, 1}$: a decision, where $\delta_j = 1$ indicates that $H_j^0$ is rejected
$\cA = {j:\mu_j \neq 0}$: non-null set

define the FDP and true discovery proportion

\[FDP = \frac{\sum_{j=1}^p (1-\theta_j)\delta_j}{(\sum_{j=1}^p\delta_j) \lor 1},\quad TDP = \frac{\sum_{j=1}^p\theta_j\delta_j}{(\sum_{j=1}^p\theta_j)\lor 1}\]

then $FDR=\bbE[FDP]$ and average power $AP = \bbE[TDP]$

FDR control by symmetrized data aggregation

splits the sample into two parts, both of which are used to construct statistics to assess the evidence against the null
aggregate the two statistics to form a new ranking statistic fulfilling symmetry properties
choose a threshold along the ranking by using the symmetry property

Connections to existing works

for the classical sample splitting, they firstly divide the data into two independent parts, second use one part to narrow down the focus, and finally use the remainder to perform inference tasks
- these ideas have a common theme with covariate-assisted multiple testing, where the primary statistic plays the key role to assess the significance while the side information plays an auxiliary role to assist inference
SDA is inspired by knockoffs

Finite-Sample Theory on FDR Control

$S = {j: \hat\mu_{1j}\neq 0}$
$W_S = (w_j, j\in S)$
$W_{-j} = W_S \ W_j$

the key quantity that controls the upper bound is

\[\Delta_j = \vert \Pr(W_j > 0\mid \vert W_j\vert, W_{-j}) - 1/2\vert\]

Asymptotic Theory on FDP Control

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.