# FDR Control in GLM

##### Posted on (Update: )

GLM has been widely used to model counts or other types of non-Gaussian data.

the article introduces a framework for feature selection in the GLM that can achieve robust FDR control

main idea: construct a mirror statistic based on data perturbation to measure the importance of each feature. FDR control is achieved by taking advantage of the mirror statistic’s property that its sampling distribution is (asymptitically) symmetric about zero for any null feature.

- in the moderate-dimensional setting, i.e., $p/n \in \kappa\in (0,1)$, construct the mirror statistic based on MLE
- in the high-dimensional setting, $p»n$, use the debiased Lasso to build the mirror statistic

the proposed method is scale-free as it only hinges on the symmetry the mirror statistic, thus, can be more robust in finite-sample cases compared to existing methods.

BHq requires p-values, which are different to construct in high dimensions.

- Javanmard and Javadi (2019) and Ma, Tony Cai, and Li (2020) considered applying BHq to high-dimensional linear and logistic regression models, respectively, with p-values obtained via the debiased Lasso.
**However**, the debiased Lasso only provides asymptotically valid p-values, which often appear highly nonuniform under the null in finite-sample cases. - Knockoff requires nearly exact knowledge of the joint distribution of all features, potentially limiting its applicability in high dimensions.
- If the distribution is unknown, Barber, Candes, and Samworth (2020) showed that the inflation of the FDR is proportional to the estimation error in the conditional distribution $P(X_j\mid X_{-j})$

the article propose a new FDR control framework for the GLM, no p-values, no joint distribution.

## FDR Control via Mirror Statistics

### Gaussian Mirrors

the idea of Gaussian mirror is to create a pair of perturbed mirror features

\[X_j^+ = X_j + c_jZ_j, \quad X_j^- = X_j - c_jZ_j\]in which $c_j$ is an adjustable scalar and $Z_j$ follows $N(0,1)$ independently across $j\in [p]$. The linear model with $\beta^\star$ as the true parameter vector can then be equivalently recasted as

\[y = \frac{\beta_j^\star}{2}X_j^+ + \frac{\beta_j^\star}{2}X_j^- + X_{-j}\beta_{-j}^\star + \epsilon\]obtain $\hat\beta^+$ and $\hat\beta^-$, as well as the normalized estimates $T_j^+$ and $T_j^-$, via OLS.

For any null feature $j\in S_0$, both $T_j^+$ and $T_j^-$ follow t-distribution centered at zero.

### Data Splitting

## GLM in Moderate Dimensions

Consider the following GLM with a canonical link function $\rho$:

\[p(y\mid X, \beta^\star) = \prod_{i=1}^n c(y_i)\exp(y_ix_i^T\beta^\star - \rho(x_i^T\beta^\star))\,,\]assume that $p/n\rightarrow \kappa\in (0, 1)$. For fixed $p$, $\kappa = 0$.

Define the Moreau envelop of the loss function that is, the negative log-likelihood, of the GLM is

It holds for a wide range of GLMs and machine learning models including logistic regression, Poisson regression, and support vector machines (Abbasi 2020)

Further assume that $x_i$’s are iid observations for a distribution with mean 0 and covariance matrix $\Sigma$, and the signal strength coverges to a constant, that is, $\gamma_n:=var(x_i^T\beta^\star)\rightarrow \gamma^2$.

### Properties of the MLE

The MLE of $\beta^\star$ can be written as

\[\hat\beta = \argmin_{\beta \in \IR^p} \left\{ \frac 1n 1^T\rho(X\beta) - \frac 1n y^TX\beta \right\}\]which can behave very differently when $\kappa > 0$ compared with the classical setting.

- first, the GLM may not be identifiable, i.e., the MLE does not exist uniquely. For instance,
**for logistic regression, the MLE does not exist if the two classes are well separated.** - second, the MLE is asymptotically biased and its asymptotic variance also differs from the classical result.

Consider the GLM where $x_i\sim N(0, \Sigma)$. Let $\Theta = \Sigma^{-1}$ and $\tau_j^2 = \Theta_{jj}^{-1}$. Assuming that $\sqrt n\tau_j\beta_j^\star = O(1)$, we have \(P\left(\frac{\sqrt n(\hat\beta_j - \alpha_\star\beta_j^\star)}{\sigma_\star /\tau_j}\le x \mid \text{MLE exists} \right)\rightarrow \Phi(x)\,,\) where $\alpha_\star, \sigma_\star$ are two universal constants depending on the link function $\rho$, the true coefficient vector $\beta^\star$, the signal strength $\gamma$, and the ratio $\kappa$.

We need to estimate the bias and variance scaling factors $(\alpha_star, \sigma_\star)$ in order to obtain $p$-values.

### FDR Control via Data Splitting

In comparison with BHq, DS procedure does not require estimating the scaling factors $(\alpha_\star,\sigma_\star)$, thus, is applicable to all GLMs in the moderate-dimensional setting.

Normalize two independent MLEs $\hat\beta^{(1)}$ and $\hat\beta^{(2)}$ as below,

\[T_j^{(1)} = \hat\tau_j^{(1)}\hat\beta_j^{(1)}, T_j^{(2)} = \hat\tau_j^{(2)}\hat\beta_j^{(2)}\]where $\hat\tau_j^{(1)}$ and $\hat\tau_j^{(2)}$ are two independent estimates of $\tau_j$.

Although the asymptotic standard deviation of $\hat\beta_j$ is $\sigma_\star/\tau_j$, we can safely drop the constant $\sigma_\star$ because the FDR control framework is scale-invariant with respect to the mirror statistcis.

DS also does not require estimating the bias scaling factor $\alpha_\star$ since $\alpha_\star\beta_j^\star = 0$ under the null.

### FDR Control via Gaussian Mirror

Cover (1964, 1965) showed that the MLE exists uniquely only if $\kappa \in (0, 1/2]$, hence, DS is only applicable when $\kappa\in (0, 1/4]$, that is $n\ge 4p$. The same issue also occurs to Knockoff as it doubles the number of features. Besides, even if $\kappa \in(0, 1/4]$ and the MLE exists uniquely as $n, p\rightarrow \infty$, in the finite-sample case, there is still a chance that the MLE does not exist uniquely for some sample splits used in MDS.

To overcome this issue, consider the Gaussian mirror method, which extends the applicability to $\kappa \in (0, 1/2]$ as long as the MLE exists uniquely on the full data.

Normalize $\hat\beta_j^+$ and $\hat\beta_j^-$ as

\[T_j^\star = \sqrt{\hat\tau_j^2 + c_j^2}\hat\beta_j^+, T_j^- = \sqrt{\hat \tau_j^2 + c_j^2}\hat\beta_j^-\,.\]## GLM in High Dimensions

### Construction of The Mirror Statistic via Debiased Lasso

DS proceeds by first randomly splitting the data into two halves, and the calculating the two independent debiased Lasso estimates

normalize $\hat\beta^d_j$ as $T_j = \hat\beta_j^d/\hat\sigma_j$

For BHq, if the variance is under-estimated, it is at the risk of losing FDR control; on the other hand, if the variance is over-estimated, BHq can be overly conservative, leading to a significant power loss. In contrast, DS is scale-free.

### Theoretical Justification of the Data-Splitting Methods

Let

\[\Theta_{ij}^0 = \frac{\Theta_{ij}}{\sqrt{\Theta_{ii}\Theta_{jj}}}, \varrho = \max_{i\neq j} \vert \Theta_{ij}^0\vert, s = \max_{j\in [p]}\#\{i:\Theta_{ij}^0\neq 0\}\]and define $\gamma_j$ as the coefficient vector of the best linear predictor of $X_{\beta^\star, j}$ using $X_{\beta^\star, -j}$,

\[\gamma_j = \argmin_{\gamma\in \IR^{p-1}} \bbE[\Vert X_{\beta^\star, j}- X_{\beta^\star, -j}\gamma\Vert_2^2]\,,\]Define $S_{1, strong}$ as the largest subset of $S_1$ such that

\[\sqrt{n/\log p} \min_{j\in S_{1,strong}}\vert \beta_j^\star\vert \rightarrow\infty\,.\]There exist some constants $\alpha_1, \alpha_2, C, c$ such that the following conditions are satisfied

For any FDR control level $q\in (0, 1)$, in the asymptotic regime $p = O(n^r)$ for some constant $r> 1$, under Assumption 4.1, we have

\[FDP \le q + o_p(1)\quad \text{and}\quad \limsup_{n, p\rightarrow \infty}FDR \le q\]for the DS procedure.

Assume that there exists some constants $C > 0$ such that $\max_{j\in S_1\S_{1, strong}}\sqrt n\vert\beta_j^\star\vert \le C$. Then, under Assumption 4.1, the MDS procedure satisifies

\(FDP \le q + o_p(1)\quad \text{and}\quad \limsup_{n, p\rightarrow\infty} FDR \le q\) in the asymptotic regime $p = O(n^r)$ for some constant $r > 1$.

## Discussion by Law and Bühlmann

This section is based on the discussion Law, M., & Bühlmann, P. (2023). Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models.” Journal of the American Statistical Association, 118(543), 1578–1583.

### Comparison of Gaussian Mirros and Data Splitting

In the settings of a Gaussian design,

\[\frac{\var\left(\frac{\hat\beta_{j,1}^{GM} + \hat\beta_{j, 2}^{GM}}{2}\right)}{\var\left(\frac{\hat\beta_{j,1}^{DS} + \hat\beta_{j,2}^{DS}}{2}\right)} = \frac{n-2p-2}{n-p-2} < 1\]- Intuitively, the Gaussian mirror estimates $p$ nuisances parameters on the basis of $n$ observations ($p-1$ parameters from $X_{-j}$ and one parameter from $z_j$)
- while data splitting estimates $2(p-1)$ nuisance parameters ($p-1$ from each of $X_{-j}^{(1)}$ and $X_{-j}^{(2)}$)