# Debiased Lasso

##### Posted on

This post is based on Section 6.4 of Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. “Statistical Learning with Sparsity,” 2016, 362.

For the linear model $\y=\X\beta+\varepsilon$, consider the high-dimensional regime wherein the number of parameters $p$ exceeds the sample size $N$. A prominent approach is the Lasso defined through the following convex optimization problem:

\[\hat\beta_\lambda\equiv \arg\max_{\beta\in \IR^p}\left\{\frac{1}{2N}\Vert \y-\X\beta\Vert^2_2+\lambda \Vert \beta\Vert_1\right\}\,.\]A far less understood question is how to perform statistical inference in the high-dimensional setting, for instance computing confidence intervals and $p$-values for quantities of interest.

If $N > p$, we can simply fit the full model by least squares and use standard intervals from least-squares theory

\[\hat\beta_j \pm z^{(\alpha)}v_j\hat\sigma\,,\]where $\hat\beta$ is the OLS estimate, $v_j^2=(\X^T\X)_{jj}^{-1},\hat\sigma^2=\sum_i(y_i-\hat y_i)^2/(N-p)$, and $z^{(\alpha)}$ is the $\alpha$-percentile of the standard normal distribution. However, this approach does not work when $N < p$.

One proposal that has been suggested is to use a debiased version of the lasso estimator,

\begin{equation} \hat\beta^d=\hat\beta_\lambda+\frac 1N \Theta\X^T(\y-\X\hat\beta_\lambda)\,,\label{eq:6.16} \end{equation}

where $\Theta$ is an approximate inverse of $\hat\Sigma=\frac 1NX^TX$. Rewrite as

\begin{equation} \hat\beta^d = \beta+\frac 1N\Theta\X^T\varepsilon + \underbrace{(\I_p-\frac 1N\Theta\X^T\X)(\hat\beta_\lambda-\beta)}_{\hat\Delta}\,.\label{eq:6.17} \end{equation}

If $N\ge p$, then $\Theta^{-1}=\frac 1N\X^T\X$, then \eqref{eq:6.16} becomes \(\hat \beta^d = (\X^T\X)^{-1}\X^T\y = \hat \beta\) which is unbiased for $\beta$. However when $N < p$, $\X^T\X/N$ is not invertible and we need try to find an approximation inverse.

With \eqref{eq:6.17}, one can use the approximation $\hat\beta^d\sim N(\beta,\frac{\sigma^2}{N}\Theta\hat\Sigma\Theta^T)$ to form confidence intervals for the components $\beta_j$ if we can get some estimates of $\Theta$ so that $\Vert \hat\Delta\Vert_\infty\rightarrow 0$.

There are several different proposals for estimating $\Theta$:

- estimate $\Theta$ using neighborhood-based methods to impose sparsity on the components
- for each $j$, define $m_j$ to be the solution to the convex program \(\min_{m\in\IR^p}m^T\hat\Sigma m\text{ subject to }\Vert \hat\Sigma m-e_j\Vert_\infty \le \gamma\,,\) where $e_j$ being the $j$-th unit vector. Then set $\hat\Theta=(m_1,\ldots,m_p)$, which tries to make both $\hat\Sigma\hat\Theta\approx I$ and the variance of $\hat \beta_j^d$ small.

TODO: Data can be found at Datasets used in SLS, and try to reproduce Figure 6.13.