# Cross-Validation for High-Dimensional Ridge and Lasso

##### Posted on Sep 16, 2021 (Update: Mar 18, 2022) 0 Comments

This note collects several references on the research of cross-validation.

$\newcommand\loo{\mathrm{loo}} \newcommand\gcv{\mathrm{gcv}} \newcommand\asto{\overset{a.s.}{\Rightarrow}}$

## Ridge Regression

examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations.

• given i.i.d. samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the expected out-of-sample prediction error, uniformly over a range of ridge regularization parameters that includes zero (and even negative values)
• prove the analogous result for leave-one-out cross-validation
• ridge tunning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy.
• Ridge Error Analysis
• Ridge Cross Validation

### Problem Setup

consider the out-of-sample prediction error, or conditional (on the training dataset) prediction error,

$\err(\lambda) = \Err(\hat\beta_\lambda)$

of ridge estimate $\hat\beta_\lambda$, the goal is to analyze the differences between the cross-validation estimators of risk and the risk itself,

$\loo(\lambda) - \err(\lambda)$

and

$\gcv(\lambda) - \err(\lambda)$

Also, denote the optimal parameters as $\lambda_I^\star, \hat\lambda_I^\gcv, \hat\lambda_I^\loo$, compare the prediction errors of the models tunned using GCV and LOOCV.

### Proof of Lemma 5.1

First of all, for a symmetric and positive definite matrix, we have $\sigma(A)=\lambda_\max(A)$, then

$\Vert A\Vert_2 = \sigma_\max(A) = \sqrt{\lambda_\max(A^TA)} = \lambda_\max(A)\,.$

and hence $\Vert\Sigma\Vert_2 = \Vert\Sigma^{1/2}\Vert_2^2$.

Suppose that ${x_{jk}}$ is a double array of iid random variable with mean zero and variance $\sigma^2$ and finite fourth moment. Let $X_n = (x_{jk, j\le p, k\le n})$ and $S_n=\frac 1nX_nX_n^*$. Then

\begin{align*} \lim_{n\rightarrow\infty} \lambda_\max(S_n) = (1+\sqrt \gamma)^2\\ \lim_{n\rightarrow\infty} \lambda_\min(S_n) = (1-\sqrt \gamma)^2\\ \end{align*}

Consider the matrix $B_n=\frac 1n T_n^{1/2}X_nX_n^*T_n^{1/2}$, where $T_n^{1/2}$ is a Hermitian square root of the Hermitian nonnegative definite $p\times p$ matrix $T_n$. We have

$\lambda_{i_n+1}(B_n) \le \lambda_1(X_nX_n^*/n)\lambda_{i_n+1}(T_n)\quad \text{and}\quad \lambda_{i_n}(B_n)\ge \lambda_p(X_nX_n^*/n)\lambda_{i_n}(T_n)$

Since the entries in $Z$ are i.i.d. with mean zero, variance 1, and finite $4+\eta$ moment, then we have a.s.

\begin{align*} \lim_{n\rightarrow\infty} \lambda_\max(Z^TZ/n) = (1+\sqrt \gamma)^2\\ \lim_{n\rightarrow\infty} \lambda_\min(Z^TZ/n) = (1-\sqrt \gamma)^2\\ \end{align*}

Firstly,

\begin{align*} \Vert\hat\Sigma\Vert_2 &= \left\Vert \Sigma^{1/2}\frac{Z^TZ}{n}\Sigma^{1/2}\right\Vert_2 \le \Vert \Sigma^{1/2}\Vert_2 \left\Vert \frac{Z^TZ}{n}\right\Vert_2 \Vert \Sigma^{1/2}\Vert_2 =\Vert \Sigma\Vert_2 \left\Vert \frac{Z^TZ}{n}\right\Vert_2 \end{align*}

Alternatively, we can apply Fact 1,

$\Vert\hat\Sigma\Vert_2 = \sigma_\max(\hat\Sigma)\le \lambda_\max(\hat\Sigma)\lambda_\max(Z^TZ/n) = \Vert \Sigma\Vert_2 \lambda_\max(Z^TZ/n)$

then from Theorem 5.11, almost surely for large n,

$\Vert\hat\Sigma\Vert_2 \le (1+\sqrt{\gamma})^2\Vert\Sigma\Vert_2\,.$

Next consider

\begin{align*} \Vert (\hat\Sigma + \lambda I_p)^+\Vert_2 &= \sigma_\max ((\hat \Sigma+\lambda I_p)^+)=\frac{1}{\lambda_\min(\hat\Sigma+\lambda I_p)}=\frac{1}{\lambda_\min(\hat\Sigma) + \lambda} \end{align*}

By Fact 1,

\begin{align*} \lambda_\min(\hat\Sigma) &\ge \lambda_\min (Z^TZ/n)\lambda_\min(\Sigma) = r_\min \lambda_\min(Z^TZ/n)\,, \end{align*}

It follows that almost surely for large $n$, we have

$\Vert (\hat\Sigma + \lambda I_p)^+\Vert_2 \le \frac{1}{\lambda -\lambda_\min}\,,$

where $\lambda_\min \triangleq -(1-\sqrt\gamma)^2r_\min$.

### Proof of Lemma 5.2

The goal is to establish

$\gcv_c(\lambda) \asto 0$

under proportional asymptotic limit.

Write $\gcv_c(\lambda) = b_n^T\varepsilon/n$, where

$b_n = 2X(I_p-(\hat\Sigma + \lambda I_p)^+\hat\Sigma)^2\beta_0$

then as in the proof of Lemma 5.1,

$\Vert b_n\Vert^2/n\le C\,.$

### Proof of Lemma 5.3

First establish

$gcv_b(\lambda) - \frac{err_b(\lambda)}{(1+\tr[(\hat\Sigma+\lambda I_p)\Sigma]/n)^2} \asto 0\,.$

Let $B=\beta_0\beta_0^T$, then

$\beta_0^T(I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)\hat\Sigma (I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)\beta_0 = \frac 1n \sum_{i=1}^n x_i^T(I_p-\hat\Sigma (\hat\Sigma+\lambda I_p)^+)B(I_p-\hat\Sigma(\hat\Sigma+\lambda I_p)^+)x_i$

where the summands are quadratic form where the point of evaluation $x_i$ and the matrix are dependent. Use the standard leave-one-out trick and the Sherman-Morrison-Woodbury formula,

$x_i^T(I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)B(I_p - \hat\Sigma(\Sigma + \lambda I_p)^+)x_i = \frac{x_i^T(I_p-\hat\Sigma_{-i}(\hat\Sigma_{-i}+\lambda I_p)^+)B(I_p-\hat\Sigma_{-i}(\hat\Sigma_{-i}+\lambda I_p)^+)x_i }{(1+x_i^T(\hat\Sigma_{-i}+\lambda I_p)^+x_i/n)^2}$

Then by the lemma S.3.1,

$1+\tr[(\hat\Sigma+\lambda I_p)^+\Sigma]/n - \frac{1}{1-\tr[(\hat\Sigma+\lambda I_p)^+\hat\Sigma]/n}\asto 0\,,$

we can obtain

$\frac{\gcv_b(\lambda)}{\gcv_d(\lambda)} - err(\lambda) \asto 0\,.$

## Lasso

There exist very few results about properties of the Lasso estimator when $\lambda$ is chosen using cross-validation,

derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using $K$-fold cross-validation.

• the bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction

For example, when the conditional distribution of the noise $\epsilon$ given $X$ is Gaussian, the $L^2$ norm implies that

$\Vert \hat\beta(\hat\lambda) - \beta\Vert_2 = O_P\left(...\right)$
• the results cover the case when $p$ in (potentially much) larger than $n$ and also allow for the case of non-Gaussian noise.

Consider the regression model

$Y = X'\beta + \varepsilon, \E[\varepsilon \mid X] = 0$

Consider triangular array asymptotics (??), so that the distribution of $(X, Y)$, and in particular the dimension $p$ of the vector $X$, is allowed to depend on $n$ and can be larger or even much larger than $n$.

The Lasso estimator,

$\hat\beta(\lambda) = \argmin \left(\frac 1n \sum_{i=1}^n(Y_i-X_i'b)^2 +\lambda \Vert b\Vert_1\right)$

In principle, the distribution of $X$ is absolutely continuous w.r.t. Lebesgue measure on $\IR^p$, in which case the optimization problem has the unique solution with probability one (??).

The paper assume that $K$ is fixed, i.e., independent of $n$, and hence it does not cover leave-one-out cross-validation.

• Homrighausen and McDonald (2013): assume that $p$ is much smaller than $n$, and only show consisteny of the (leave-one-out) cross-validated Lasso estimator
• Homrighausen and McDonald (2013): give the “first” result when the smoothing parameter is chosen via cross-validation. Consider the high-dimensional setting wherein the number of predictors $p=n^\alpha, \alpha > 0$ grows with the number of observations.
• Homrighausen and McDonald (2016): high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$

## Other

the training error has a downward bias and K-fold cross-validation has an upward bias, the paper investigates two families that connect the training error and K-fold cross-validation.

This strategy reminds me the one used in Bootstrap mentioned in ESL

Published in categories Note