Exact Post-Selection Inference for Sequential Regression Procedures
Posted on
propose new inference tools for forward stepwise regression, least angle regression, and the lasso.
assume a Gaussian model for the observation vector $y$, first describe a general scheme to perform valid inference after any selection event that can be characterized as $y$ falling into a polyhedral set.
this framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path
the p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control.
R package: selectiveInference
Introduction
Consider observations $y\in \IR^n$ drawn from a Gaussian model,
\[y = \theta +\epsilon, \epsilon\sim N(0, \sigma^2I)\,.\]do not assume that the true model is itself linear
related work
inference for high-dimensional regression models
- based on sample splitting or resampling methods
- Wasserman and Roeder (2009)
- Meinshausen and Buhlmann (2010)
- Minnier, Tian, and Cai (2011)
- based on “debiasing” a regularized regression estimator, like the lasso
- Zhang and Zhang (2014)
- Buhlmann (2013)
- van de Geer et al. (2014)
- Javamard and Montanari (2014a, 2014b)
the inferential targets considered in the aforementioned works are all fixed, and not post-selected
It is clear (at least conceptually) how to use sampling-splitting techniques to accommodate post-selection inferential goals; it is much less clear how to do so with the debiasing tools mentioned above.
- Berk et al. (2013) carried out valid post-selection inference (PoSI) by considering all possible model selection procedures that could have produced the given submodel.
- as the authors state, the inferences are generally conservative for particular selection procedures, but have the advantages that they do not depend on the correctness of the selected submodel.
- Lee et al. (2016), concurrent with this paper, constructed p-values and intervals for lasso coefficients at a fixed value of the regularization parameter $\lambda$. Both leverage the same core statistical framework, using truncated Gaussian (TG) distributions, for exact post-selection inference, but differ in the applications pursued with this framework.
Summary of Results
Consider testing the hypothesis
\[H_0: \nu^T\theta = 0\,,\]conditional on having observed $y\in \cP$, where $\cP$ is a given polyhedral set, and $\nu$ is a given contrast vector.
Derive a test statistic $T(y, \cP, \nu)$ with the property that
\[T(y, P, \nu) \sim_{P_0} \mathrm{Unif}(0, 1)\,,\]where $P_0(\cdot) = P_{v^T\theta = 0}(\cdot \mid y\in \cP)$.
For many regression procedures of interest, in particular, for the sequential algorithms FS, LAR, and lasso — the event that the procedure selects a given model (after a given number of steps) can be represented in this form.
For example, consider FS after one step, with $p=3$ variable total: the FS procedure selects variable 3, and assigns it a positive coefficients, iff
\[X_3^Ty/\Vert X_3\Vert_2 \ge \pm X_1^Ty/\Vert X_1\Vert_2\,, X_3^TY/\Vert X_3\Vert_2 \ge \pm X_2^Ty / \Vert X_2\Vert_2\,.\]With $X$ considered fixed, it can be compactly represented as $\Gamma y \ge 0$.
If $\hat j_1(y)$ and $\hat s_1(y)$ denote the variable and sign selected by FS at the first step, then
\[\{y: \hat j_1(y) = 3, \hat s_1(y) = 1\} = \{y:\Gamma y\ge 0\}\,,\]for a particular matrix $\Gamma$.
To test the significance of the third variable, conditional on it being selected at the first step of FS, consider the null hypothesis $H_0$ with $\nu = X_3, \cP = {y: \Gamma y \ge 0}$.
This can be rexpressed as
\[P_{X_3^T\theta = 0}(T_1\le \alpha\mid \hat j_1(y) = 3, \hat s_1(y) = 1) = \alpha\]for all $\alpha\in [0, 1]$.
A similar constrcution holds for a general step $k$ of FS: letting $\hat A_k(y) = [\hat j_1(y), \ldots, \hat j_k(y)]$ denote the active list after $k$ steps and $\hat s_{A_k}(y) = [\hat s_1(y), \ldots, \hat s_k(y)]$ denote the signs of the corresponding coefficients, we have, for any fixed $A_k$ and $s_{A_k}$,
\[\{y: \hat A_k(y) = A_k, \hat s_{A_k}(y) = s_{A_k}\} = \{y:\Gamma y\ge 0\}\,,\]for another matrix $\Gamma$.
Let $(M^TM)^+$ for the Moore-Penrose pseudoinverse of the square matrix $M^TM$, and $M^+ = (M^TM)^+M^T$ for the pseudoinverse of the rectangular matrix $M$.
With $\nu = (X_{A_k}^+)^Te_k$, where $e_k$ is the $k$-th standard basis vector, the hypothesis is $e_k^TX_{A_k}^+\theta = 0$, i.e., it specifies that the last partial regression coefficient is not significant, in a projected linear model of $\theta$ on $X_{A_k}$.
Conditional Confidence Intervals
By inverting the test statistic, one can obtain a conditional confidence intervals $I_k$ satisfying
\[P(e_k^TX_{A_k}^+\theta\in I_k\mid \hat A_k(y) = A_k, \hat s_{A_k}(y) = s_{A_k}) = 1-\alpha\,.\]