WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Debiased Lasso

Posted on
Tags: Lasso, High-Dimensional, Confidence Interval

This post is based on Section 6.4 of Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. “Statistical Learning withSparsity,” 2016, 362.

For the linear model $\y=\X\beta+\varepsilon$, consider the high-dimensional regime wherein the number of parameters $p$ exceeds the sample size $N$. A prominent approach is the Lasso defined through the following convex optimization problem:

A far less understood question is how to perform statistical inference in the high-dimensional setting, for instance computing confidence intervals and $p$-values for quantities of interest.

If $N > p$, we can simply fit the full model by least squares and use standard intervals from least-squares theory

where $\hat\beta$ is the OLS estimate, $v_j^2=(\X^T\X)_{jj}^{-1},\hat\sigma^2=\sum_i(y_i-\hat y_i)^2/(N-p)$, and $z^{(\alpha)}$ is the $\alpha$-percentile of the standard normal distribution. However, this approach does not work when $N < p$.

One proposal that has been suggested is to use a debiased version of the lasso estimator,

where $\Theta$ is an approximate inverse of $\hat\Sigma=\frac 1NX^TX$. Rewrite as

\begin{equation} \hat\beta^d = \beta+\frac 1N\Theta\X^T\varepsilon + \underbrace{(\I_p-\frac 1N\Theta\X^T\X)(\hat\beta_\lambda-\beta)}_{\hat\Delta}\,.\label{eq:6.17} \end{equation}

With \eqref{eq:6.17}, one can use the approximation $\hat\beta^d\sim N(\beta,\frac{\sigma^2}{N}\Theta\hat\Sigma\Theta^T)$ to form confidence intervals for the components $\beta_j$ if we can get some estimates of $\Theta$ so that $\Vert \hat\Delta\Vert_\infty\rightarrow 0$.

There are several different proposals for estimating $\Theta$:

  • estimate $\Theta$ using neighborhood-based methods to impose sparsity on the components
  • for each $j$, define $m_j$ to be the solution to the convex program where $e_j$ being the $j$-th unit vector. Then set $\hat\Theta=(m_1,\ldots,m_p)$, which tries to make both $\hat\Sigma\hat\Theta\approx I$ and the variance of $\hat \beta_j^d$ small.

TODO: Data can be found at Datasets used in SLS, and try to reproduce Figure 6.13.


Published in categories Memo