# Cross-Validation for High-Dimensional Ridge and Lasso

##### Posted on 0 Comments

This note collects several references on the research of cross-validation.

\[\newcommand\loo{\mathrm{loo}} \newcommand\gcv{\mathrm{gcv}}\]## Ridge Regression

examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations.

- given i.i.d. samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the
**expected out-of-sample prediction error**, uniformly over a range of ridge regularization parameters that includes zero (and even negative values) - prove the analogous result for leave-one-out cross-validation
- ridge tunning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy.

### Related Work

- Ridge Error Analysis
- Ridge Cross Validation

### Problem Setup

consider the out-of-sample prediction error, or conditional (on the training dataset) prediction error,

\[\err(\lambda) = \Err(\hat\beta_\lambda)\]of ridge estimate $\hat\beta_\lambda$, the goal is to analyze the differences between the cross-validation estimators of risk and the risk itself,

\[\loo(\lambda) - \err(\lambda)\]and

\[\gcv(\lambda) - \err(\lambda)\]Also, denote the optimal parameters as $\lambda_I^\star, \hat\lambda_I^\gcv, \hat\lambda_I^\loo$, compare the prediction errors of the models tunned using GCV and LOOCV.

## Lasso

There exist very few results about properties of the Lasso estimator when $\lambda$ is chosen using cross-validation,

derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using $K$-fold cross-validation.

- the bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction

For example, when the conditional distribution of the noise $\epsilon$ given $X$ is Gaussian, the $L^2$ norm implies that

\[\Vert \hat\beta(\hat\lambda) - \beta\Vert_2 = O_P\left(...\right)\]- the results cover the case when $p$ in (potentially much) larger than $n$ and also allow for the case of non-Gaussian noise.

## Other

Also have a look on Fushiki, T. (2011). Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 21(2), 137–146.

the training error has a downward bias and K-fold cross-validation has an upward bias, the paper investigates two families that connect the training error and K-fold cross-validation.

This strategy reminds me the one used in Bootstrap mentioned in ESL