# Cross-Validation for High-Dimensional Ridge and Lasso

##### Posted on Sep 16, 20210 Comments

This note collects several references on the research of cross-validation.

$\newcommand\loo{\mathrm{loo}} \newcommand\gcv{\mathrm{gcv}}$

## Ridge Regression

examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations.

• given i.i.d. samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the expected out-of-sample prediction error, uniformly over a range of ridge regularization parameters that includes zero (and even negative values)
• prove the analogous result for leave-one-out cross-validation
• ridge tunning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy.
• Ridge Error Analysis
• Ridge Cross Validation

### Problem Setup

consider the out-of-sample prediction error, or conditional (on the training dataset) prediction error,

$\err(\lambda) = \Err(\hat\beta_\lambda)$

of ridge estimate $\hat\beta_\lambda$, the goal is to analyze the differences between the cross-validation estimators of risk and the risk itself,

$\loo(\lambda) - \err(\lambda)$

and

$\gcv(\lambda) - \err(\lambda)$

Also, denote the optimal parameters as $\lambda_I^\star, \hat\lambda_I^\gcv, \hat\lambda_I^\loo$, compare the prediction errors of the models tunned using GCV and LOOCV.

## Lasso

There exist very few results about properties of the Lasso estimator when $\lambda$ is chosen using cross-validation,

derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using $K$-fold cross-validation.

• the bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction

For example, when the conditional distribution of the noise $\epsilon$ given $X$ is Gaussian, the $L^2$ norm implies that

$\Vert \hat\beta(\hat\lambda) - \beta\Vert_2 = O_P\left(...\right)$
• the results cover the case when $p$ in (potentially much) larger than $n$ and also allow for the case of non-Gaussian noise.

## Other

the training error has a downward bias and K-fold cross-validation has an upward bias, the paper investigates two families that connect the training error and K-fold cross-validation.

This strategy reminds me the one used in Bootstrap mentioned in ESL

Published in categories Note