Leave-one-out CV for Lasso
Posted on
This note is for Homrighausen, D., & McDonald, D. J. (2013). Leave-one-out cross-validation is risk consistent for lasso. ArXiv:1206.6128 [Math, Stat].
Suppose we form the adaptive ridge regression estimator (Grandvalet (1998))
\[\argmin_{\theta, (\lambda_j)} \Vert Y-X\theta\Vert_2^2 +\sum_{j=1}^p\lambda_j\theta_j^2\]subject to the constraint $\lambda\sum_{j=1}^p 1/\lambda_j=p$. Then the solution is equivalent, under a reparameterization of $\lambda$, to the lasso solution.
Main assumptions ensure that the sequence $(X_i)_{i=1}^n$ is sufficiently regular.
- A:
where $C$ is a positive definite matrix with minimum eigenvalue $c_\min > 0$
- B:
There exists a constant $C_X < \infty$ independent of $n$ such that
\[\Vert X_i\Vert \le C_X\]Define the predictive risk and the leave-one-out cross-validation estimator of risk
\[R_n(\lambda) = \frac 1n \bbE \Vert X(\hat \theta(\lambda) - \theta)\Vert^2 +\sigma^2\]and
\[\hat R_n(\lambda) = \frac 1n \sum_{i=1}^n(Y_i - X_i^T\hat\theta^{(i)}(\lambda))^2\]Suppose that Assumptions A and B hold and that there exists a $C_\theta < \infty$ such that $\Vert\theta\Vert_1\le C_\theta$. Also, suppose $W_i$ is sub-gaussian. Then \(R_n(\hat\lambda_n) - R_n(\lambda_n) \rightarrow 0\)
To prove the theorem, first to show
\[\sup \vert \hat R_n(\lambda) - R_n(\lambda)\vert \rightarrow 0\]in probability. Then the result follows as
\[\begin{align*} R_n(\hat \lambda_n) - R_n(\lambda_n) &= (R_n(\hat\lambda_n) - \hat R_n(\hat\lambda_n)) + (\hat R_n(\hat\lambda_n) - R_n(\lambda_n)) \\ &\le (R_n(\hat\lambda_n) - \hat R_n(\hat\lambda_n)) + (\hat R_n(\lambda_n) - R_n(\lambda_n))\\ &\le 2\sup (R_n(\lambda) - \hat R_n(\lambda))\\ &=o_p(1) \end{align*}\]where $R_n(\hat \lambda_n) -R_n(\lambda_n)$ is non-stochastic, and therefore the convergence in probability implies sequential convergence.
Write
\[\begin{align*} \vert R_n(\lambda) - \hat R_n(\lambda)\vert &=\left\vert \frac 1n \bbE\Vert X\hat\theta(\lambda)\Vert_2^2 + \frac 1n \Vert X\theta\Vert_2^2 -\frac 2n \bbE (X\hat\theta(\lambda))^TX\theta + \sigma^2 - \frac 1n\sum_{i=1}^n\left(Y_i^2 + (X_i^T\hat\theta^{(i)}(\lambda))^2 - 2Y_iX_i^T\hat\theta^{(i)}(\lambda)\right) \right\vert\\ &\le \left\vert \frac 1n \bbE\Vert X\hat\theta(\lambda)\Vert_2^2 - \frac 1n\sum_{i=1}^n (X_i^T\hat\theta^{(i)}(\lambda))^2\right\vert + 2\left\vert \frac 1n \bbE (X\hat\theta(\lambda))^TX\theta - \frac 1n \sum_{i=1}^nY_iX_i^T\hat\theta^{(i)}(\lambda)\right\vert +\left\vert \frac 1n \Vert X\theta\Vert_2^2 + \sigma^2 - \frac 1n\sum_{i=1}^nY_i^2\right\vert\\ &\triangleq a + b + c \end{align*}\]a
Observe that
\[\left\vert \frac 1n \bbE\Vert X\hat\theta(\lambda)\Vert_2^2 - \frac 1n \sum_{i=1}^n(X_i^T\hat\theta^{(i)}(\lambda))^2\right\vert \le \left\vert \frac 1n \bbE\Vert X\hat\theta(\lambda)\Vert_2^2 - \frac 1n\Vert X\hat\theta(\lambda)\Vert_2^2 \right\vert + \left\vert \frac 1n\Vert X\hat\theta(\lambda)\Vert_2^2 - \frac 1n\sum_{i=1}^n(X_i^T\hat\theta^{(i)}(\lambda))^2 \right\vert\]