Statistical Learning and Selective Inference
Posted on
Forward Stepwise Regression
this procedure enters predictors one at a time, choosing the predictor that most decreases the residual sum of squares at each stage.
classic statistical theory for assessing the strength each predictor: define $RSS_k$ to be the residual sum of squares for the model containing $k$ predictors, use the change in residual sum of squares to form a test statistic
\[R_k = \frac{1}{\sigma^2}(\RSS_{k-1}-\RSS_k)\](with $\sigma$ assumed known here), and compare it to a $\chi^2$ distribution with one degrees of freedom.
A big problem: this classsical thoery assumes that the models being compared were prespecified before seeing the data. However, this is not the case here: we have “cherry-picked” the best predictor at each stage to maximize $R_k$.
for forward stepwise regression and many other procedures the selection events can be written in the “polyhedral” form $Ay \le b$ for some matrix $A$ and vector $b$. Consider any new vector of outcomes, say $y^\star$, we run the forward stepwise procedure and keep track of the predictors entered at each stage.
the polyhedral form says that the set of new data vectors $y^\star$ that would yield the same list of predictors can be described by the set $Ay^\star \le b$. The quantities $A$ and $b$ depend on the data and the selected variables.
Under the selection $Ay\le b$ the naive expression $\hat\beta \sim N(\beta, \tau^2)$ is replaced by a truncated normal distribution $\hat\beta \sim TN^{c, d}(\beta, \tau^2)$.
this is just a normal distribution truncated to lie in the interval $(c, d)$
FDR and a Sequential Stopping Rule
choose a targert FDR level of $\alpha$, and denoting the successive $P$ values by $pv_1,pv_2,\ldots$, the ForwardStop rule is defined by
\[\hat k = \max\{k:-\frac 1k\sum_{i=1}^k\log(1-pv_i)\le \alpha\}\,.\]stop at the last time that the average $p$ value up to that point is below some target FDR level $\alpha$
The Lasso
for fixed predictors and value $\lambda$, the vector of response values $y^\star$ that would yield the sam active set after applying the lasso can be written in the form $Ay^\star \le b$. Here $A$ and $b$ depend on the predictors, the active set and $\lambda$, but not $y$.
Principal Components and Beyond
In PCA, one must decide on the number of components $K$ that are “significant”. Traditionally, this is done through the so-called scree plot, which is a plot of the eigenvalues in order from smallest to largest.
Selective inference can also be applied. Choosing the leading eigenvectors of a covariance matrix is similar in spirit to forward stepwise regression, and one can derive selection-adjusted p-values for each successive increase in the rank.