WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Exact Post-Selection Inference for Sequential Regression Procedures

Posted on
Tags: p-values, False Discovery Rate, Lasso, Forward Stepwise Regression, Least Angle Regression

This post is for Tibshirani, R. J., Taylor, J., Lockhart, R., & Tibshirani, R. (2016). Exact Post-Selection Inference for Sequential Regression Procedures. Journal of the American Statistical Association, 111(514), 600–620.

propose new inference tools for forward stepwise regression, least angle regression, and the lasso.

assume a Gaussian model for the observation vector $y$, first describe a general scheme to perform valid inference after any selection event that can be characterized as $y$ falling into a polyhedral set.

this framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path

the p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control.

R package: selectiveInference

Introduction

Consider observations $y\in \IR^n$ drawn from a Gaussian model,

\[y = \theta +\epsilon, \epsilon\sim N(0, \sigma^2I)\,.\]

do not assume that the true model is itself linear

inference for high-dimensional regression models

  • based on sample splitting or resampling methods
    • Wasserman and Roeder (2009)
    • Meinshausen and Buhlmann (2010)
    • Minnier, Tian, and Cai (2011)
  • based on “debiasing” a regularized regression estimator, like the lasso
    • Zhang and Zhang (2014)
    • Buhlmann (2013)
    • van de Geer et al. (2014)
    • Javamard and Montanari (2014a, 2014b)

the inferential targets considered in the aforementioned works are all fixed, and not post-selected

It is clear (at least conceptually) how to use sampling-splitting techniques to accommodate post-selection inferential goals; it is much less clear how to do so with the debiasing tools mentioned above.

  • Berk et al. (2013) carried out valid post-selection inference (PoSI) by considering all possible model selection procedures that could have produced the given submodel.
    • as the authors state, the inferences are generally conservative for particular selection procedures, but have the advantages that they do not depend on the correctness of the selected submodel.
  • Lee et al. (2016), concurrent with this paper, constructed p-values and intervals for lasso coefficients at a fixed value of the regularization parameter $\lambda$. Both leverage the same core statistical framework, using truncated Gaussian (TG) distributions, for exact post-selection inference, but differ in the applications pursued with this framework.

Summary of Results

Consider testing the hypothesis

\[H_0: \nu^T\theta = 0\,,\]

conditional on having observed $y\in \cP$, where $\cP$ is a given polyhedral set, and $\nu$ is a given contrast vector.

Derive a test statistic $T(y, \cP, \nu)$ with the property that

\[T(y, P, \nu) \sim_{P_0} \mathrm{Unif}(0, 1)\,,\]

where $P_0(\cdot) = P_{v^T\theta = 0}(\cdot \mid y\in \cP)$.

For many regression procedures of interest, in particular, for the sequential algorithms FS, LAR, and lasso — the event that the procedure selects a given model (after a given number of steps) can be represented in this form.

For example, consider FS after one step, with $p=3$ variable total: the FS procedure selects variable 3, and assigns it a positive coefficients, iff

\[X_3^Ty/\Vert X_3\Vert_2 \ge \pm X_1^Ty/\Vert X_1\Vert_2\,, X_3^TY/\Vert X_3\Vert_2 \ge \pm X_2^Ty / \Vert X_2\Vert_2\,.\]

With $X$ considered fixed, it can be compactly represented as $\Gamma y \ge 0$.

If $\hat j_1(y)$ and $\hat s_1(y)$ denote the variable and sign selected by FS at the first step, then

\[\{y: \hat j_1(y) = 3, \hat s_1(y) = 1\} = \{y:\Gamma y\ge 0\}\,,\]

for a particular matrix $\Gamma$.

To test the significance of the third variable, conditional on it being selected at the first step of FS, consider the null hypothesis $H_0$ with $\nu = X_3, \cP = {y: \Gamma y \ge 0}$.

This can be rexpressed as

\[P_{X_3^T\theta = 0}(T_1\le \alpha\mid \hat j_1(y) = 3, \hat s_1(y) = 1) = \alpha\]

for all $\alpha\in [0, 1]$.

A similar constrcution holds for a general step $k$ of FS: letting $\hat A_k(y) = [\hat j_1(y), \ldots, \hat j_k(y)]$ denote the active list after $k$ steps and $\hat s_{A_k}(y) = [\hat s_1(y), \ldots, \hat s_k(y)]$ denote the signs of the corresponding coefficients, we have, for any fixed $A_k$ and $s_{A_k}$,

\[\{y: \hat A_k(y) = A_k, \hat s_{A_k}(y) = s_{A_k}\} = \{y:\Gamma y\ge 0\}\,,\]

for another matrix $\Gamma$.

Let $(M^TM)^+$ for the Moore-Penrose pseudoinverse of the square matrix $M^TM$, and $M^+ = (M^TM)^+M^T$ for the pseudoinverse of the rectangular matrix $M$.

With $\nu = (X_{A_k}^+)^Te_k$, where $e_k$ is the $k$-th standard basis vector, the hypothesis is $e_k^TX_{A_k}^+\theta = 0$, i.e., it specifies that the last partial regression coefficient is not significant, in a projected linear model of $\theta$ on $X_{A_k}$.

Conditional Confidence Intervals

By inverting the test statistic, one can obtain a conditional confidence intervals $I_k$ satisfying

\[P(e_k^TX_{A_k}^+\theta\in I_k\mid \hat A_k(y) = A_k, \hat s_{A_k}(y) = s_{A_k}) = 1-\alpha\,.\]

Marginalization


Published in categories Note