Statistical Learning and Selective Inference

Posted on Jan 19, 2024

Tags: p-values, False Discovery Rate, Lasso

This post is for Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences of the United States of America, 112(25), 7629–7634.

Forward Stepwise Regression

this procedure enters predictors one at a time, choosing the predictor that most decreases the residual sum of squares at each stage.

classic statistical theory for assessing the strength each predictor: define $RSS_k$ to be the residual sum of squares for the model containing $k$ predictors, use the change in residual sum of squares to form a test statistic

\[R_k = \frac{1}{\sigma^2}(\RSS_{k-1}-\RSS_k)\]

(with $\sigma$ assumed known here), and compare it to a $\chi^2$ distribution with one degrees of freedom.

A big problem: this classsical thoery assumes that the models being compared were prespecified before seeing the data. However, this is not the case here: we have “cherry-picked” the best predictor at each stage to maximize $R_k$.

for forward stepwise regression and many other procedures the selection events can be written in the “polyhedral” form $Ay \le b$ for some matrix $A$ and vector $b$. Consider any new vector of outcomes, say $y^\star$, we run the forward stepwise procedure and keep track of the predictors entered at each stage.

the polyhedral form says that the set of new data vectors $y^\star$ that would yield the same list of predictors can be described by the set $Ay^\star \le b$. The quantities $A$ and $b$ depend on the data and the selected variables.

Under the selection $Ay\le b$ the naive expression $\hat\beta \sim N(\beta, \tau^2)$ is replaced by a truncated normal distribution $\hat\beta \sim TN^{c, d}(\beta, \tau^2)$.

this is just a normal distribution truncated to lie in the interval $(c, d)$

FDR and a Sequential Stopping Rule

choose a targert FDR level of $\alpha$, and denoting the successive $P$ values by $pv_1,pv_2,\ldots$, the ForwardStop rule is defined by

\[\hat k = \max\{k:-\frac 1k\sum_{i=1}^k\log(1-pv_i)\le \alpha\}\,.\]

stop at the last time that the average $p$ value up to that point is below some target FDR level $\alpha$

The Lasso

for fixed predictors and value $\lambda$, the vector of response values $y^\star$ that would yield the sam active set after applying the lasso can be written in the form $Ay^\star \le b$. Here $A$ and $b$ depend on the predictors, the active set and $\lambda$, but not $y$.

Principal Components and Beyond

In PCA, one must decide on the number of components $K$ that are “significant”. Traditionally, this is done through the so-called scree plot, which is a plot of the eigenvalues in order from smallest to largest.

Selective inference can also be applied. Choosing the leading eigenvectors of a covariance matrix is similar in spirit to forward stepwise regression, and one can derive selection-adjusted p-values for each successive increase in the rank.

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.