# Debiased Lasso

##### Posted on

This post is based on Section 6.4 of Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. “Statistical Learning with Sparsity,” 2016, 362.

For the linear model $\y=\X\beta+\varepsilon$, consider the high-dimensional regime wherein the number of parameters $p$ exceeds the sample size $N$. A prominent approach is the Lasso defined through the following convex optimization problem:

A far less understood question is how to perform statistical inference in the high-dimensional setting, for instance computing confidence intervals and $p$-values for quantities of interest.

If $N > p$, we can simply fit the full model by least squares and use standard intervals from least-squares theory

where $\hat\beta$ is the OLS estimate, $v_j^2=(\X^T\X)_{jj}^{-1},\hat\sigma^2=\sum_i(y_i-\hat y_i)^2/(N-p)$, and $z^{(\alpha)}$ is the $\alpha$-percentile of the standard normal distribution. However, this approach does not work when $N < p$.

One proposal that has been suggested is to use a debiased version of the lasso estimator,

\begin{equation} \hat\beta^d=\hat\beta_\lambda+\frac 1N \Theta\X^T(\y-\X\hat\beta_\lambda)\,,\label{eq:6.16} \end{equation}

where $\Theta$ is an approximate inverse of $\hat\Sigma=\frac 1NX^TX$. Rewrite as

\begin{equation} \hat\beta^d = \beta+\frac 1N\Theta\X^T\varepsilon + \underbrace{(\I_p-\frac 1N\Theta\X^T\X)(\hat\beta_\lambda-\beta)}_{\hat\Delta}\,.\label{eq:6.17} \end{equation}

If $N\ge p$, then $\Theta^{-1}=\frac 1N\X^T\X$, then \eqref{eq:6.16} becomes which is unbiased for $\beta$. However when $N < p$, $\X^T\X/N$ is not invertible and we need try to find an approximation inverse.

With \eqref{eq:6.17}, one can use the approximation $\hat\beta^d\sim N(\beta,\frac{\sigma^2}{N}\Theta\hat\Sigma\Theta^T)$ to form confidence intervals for the components $\beta_j$ if we can get some estimates of $\Theta$ so that $\Vert \hat\Delta\Vert_\infty\rightarrow 0$.

There are several different proposals for estimating $\Theta$:

- estimate $\Theta$ using neighborhood-based methods to impose sparsity on the components
- for each $j$, define $m_j$ to be the solution to the convex program where $e_j$ being the $j$-th unit vector. Then set $\hat\Theta=(m_1,\ldots,m_p)$, which tries to make both $\hat\Sigma\hat\Theta\approx I$ and the variance of $\hat \beta_j^d$ small.

TODO: Data can be found at Datasets used in SLS, and try to reproduce Figure 6.13.