Cross-Validation for High-Dimensional Ridge and Lasso
Posted on (Update: ) 0 Comments
This note collects several references on the research of cross-validation.
\[\newcommand\loo{\mathrm{loo}} \newcommand\gcv{\mathrm{gcv}} \newcommand\asto{\overset{a.s.}{\Rightarrow}}\]Ridge Regression
examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations.
- given i.i.d. samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the expected out-of-sample prediction error, uniformly over a range of ridge regularization parameters that includes zero (and even negative values)
- prove the analogous result for leave-one-out cross-validation
- ridge tunning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy.
Related Work
- Ridge Error Analysis
- Ridge Cross Validation
Problem Setup
consider the out-of-sample prediction error, or conditional (on the training dataset) prediction error,
\[\err(\lambda) = \Err(\hat\beta_\lambda)\]of ridge estimate $\hat\beta_\lambda$, the goal is to analyze the differences between the cross-validation estimators of risk and the risk itself,
\[\loo(\lambda) - \err(\lambda)\]and
\[\gcv(\lambda) - \err(\lambda)\]Also, denote the optimal parameters as $\lambda_I^\star, \hat\lambda_I^\gcv, \hat\lambda_I^\loo$, compare the prediction errors of the models tunned using GCV and LOOCV.
Proof of Lemma 5.1
First of all, for a symmetric and positive definite matrix, we have $\sigma(A)=\lambda_\max(A)$, then
\[\Vert A\Vert_2 = \sigma_\max(A) = \sqrt{\lambda_\max(A^TA)} = \lambda_\max(A)\,.\]and hence $\Vert\Sigma\Vert_2 = \Vert\Sigma^{1/2}\Vert_2^2$.
Suppose that ${x_{jk}}$ is a double array of iid random variable with mean zero and variance $\sigma^2$ and finite fourth moment. Let $X_n = (x_{jk, j\le p, k\le n})$ and $S_n=\frac 1nX_nX_n^*$. Then
\[\begin{align*} \lim_{n\rightarrow\infty} \lambda_\max(S_n) = (1+\sqrt \gamma)^2\\ \lim_{n\rightarrow\infty} \lambda_\min(S_n) = (1-\sqrt \gamma)^2\\ \end{align*}\]
Consider the matrix $B_n=\frac 1n T_n^{1/2}X_nX_n^*T_n^{1/2}$, where $T_n^{1/2}$ is a Hermitian square root of the Hermitian nonnegative definite $p\times p$ matrix $T_n$. We have
\[\lambda_{i_n+1}(B_n) \le \lambda_1(X_nX_n^*/n)\lambda_{i_n+1}(T_n)\quad \text{and}\quad \lambda_{i_n}(B_n)\ge \lambda_p(X_nX_n^*/n)\lambda_{i_n}(T_n)\]
Since the entries in $Z$ are i.i.d. with mean zero, variance 1, and finite $4+\eta$ moment, then we have a.s.
\[\begin{align*} \lim_{n\rightarrow\infty} \lambda_\max(Z^TZ/n) = (1+\sqrt \gamma)^2\\ \lim_{n\rightarrow\infty} \lambda_\min(Z^TZ/n) = (1-\sqrt \gamma)^2\\ \end{align*}\]Firstly,
\[\begin{align*} \Vert\hat\Sigma\Vert_2 &= \left\Vert \Sigma^{1/2}\frac{Z^TZ}{n}\Sigma^{1/2}\right\Vert_2 \le \Vert \Sigma^{1/2}\Vert_2 \left\Vert \frac{Z^TZ}{n}\right\Vert_2 \Vert \Sigma^{1/2}\Vert_2 =\Vert \Sigma\Vert_2 \left\Vert \frac{Z^TZ}{n}\right\Vert_2 \end{align*}\]Alternatively, we can apply Fact 1,
\[\Vert\hat\Sigma\Vert_2 = \sigma_\max(\hat\Sigma)\le \lambda_\max(\hat\Sigma)\lambda_\max(Z^TZ/n) = \Vert \Sigma\Vert_2 \lambda_\max(Z^TZ/n)\]
then from Theorem 5.11, almost surely for large n,
\[\Vert\hat\Sigma\Vert_2 \le (1+\sqrt{\gamma})^2\Vert\Sigma\Vert_2\,.\]Next consider
\[\begin{align*} \Vert (\hat\Sigma + \lambda I_p)^+\Vert_2 &= \sigma_\max ((\hat \Sigma+\lambda I_p)^+)=\frac{1}{\lambda_\min(\hat\Sigma+\lambda I_p)}=\frac{1}{\lambda_\min(\hat\Sigma) + \lambda} \end{align*}\]By Fact 1,
\[\begin{align*} \lambda_\min(\hat\Sigma) &\ge \lambda_\min (Z^TZ/n)\lambda_\min(\Sigma) = r_\min \lambda_\min(Z^TZ/n)\,, \end{align*}\]It follows that almost surely for large $n$, we have
\[\Vert (\hat\Sigma + \lambda I_p)^+\Vert_2 \le \frac{1}{\lambda -\lambda_\min}\,,\]where $\lambda_\min \triangleq -(1-\sqrt\gamma)^2r_\min$.
Proof of Lemma 5.2
The goal is to establish
\[\gcv_c(\lambda) \asto 0\]under proportional asymptotic limit.
Write $\gcv_c(\lambda) = b_n^T\varepsilon/n$, where
\[b_n = 2X(I_p-(\hat\Sigma + \lambda I_p)^+\hat\Sigma)^2\beta_0\]then as in the proof of Lemma 5.1,
\[\Vert b_n\Vert^2/n\le C\,.\]Proof of Lemma 5.3
First establish
\[gcv_b(\lambda) - \frac{err_b(\lambda)}{(1+\tr[(\hat\Sigma+\lambda I_p)\Sigma]/n)^2} \asto 0\,.\]Let $B=\beta_0\beta_0^T$, then
\[\beta_0^T(I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)\hat\Sigma (I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)\beta_0 = \frac 1n \sum_{i=1}^n x_i^T(I_p-\hat\Sigma (\hat\Sigma+\lambda I_p)^+)B(I_p-\hat\Sigma(\hat\Sigma+\lambda I_p)^+)x_i\]where the summands are quadratic form where the point of evaluation $x_i$ and the matrix are dependent. Use the standard leave-one-out trick and the Sherman-Morrison-Woodbury formula,
\[x_i^T(I_p-\hat\Sigma(\hat\Sigma + \lambda I_p)^+)B(I_p - \hat\Sigma(\Sigma + \lambda I_p)^+)x_i = \frac{x_i^T(I_p-\hat\Sigma_{-i}(\hat\Sigma_{-i}+\lambda I_p)^+)B(I_p-\hat\Sigma_{-i}(\hat\Sigma_{-i}+\lambda I_p)^+)x_i }{(1+x_i^T(\hat\Sigma_{-i}+\lambda I_p)^+x_i/n)^2}\]Then by the lemma S.3.1,
\[1+\tr[(\hat\Sigma+\lambda I_p)^+\Sigma]/n - \frac{1}{1-\tr[(\hat\Sigma+\lambda I_p)^+\hat\Sigma]/n}\asto 0\,,\]we can obtain
\[\frac{\gcv_b(\lambda)}{\gcv_d(\lambda)} - err(\lambda) \asto 0\,.\]Lasso
There exist very few results about properties of the Lasso estimator when $\lambda$ is chosen using cross-validation,
derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using $K$-fold cross-validation.
- the bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction
For example, when the conditional distribution of the noise $\epsilon$ given $X$ is Gaussian, the $L^2$ norm implies that
\[\Vert \hat\beta(\hat\lambda) - \beta\Vert_2 = O_P\left(...\right)\]- the results cover the case when $p$ in (potentially much) larger than $n$ and also allow for the case of non-Gaussian noise.
Consider the regression model
\[Y = X'\beta + \varepsilon, \E[\varepsilon \mid X] = 0\]Consider triangular array asymptotics (??), so that the distribution of $(X, Y)$, and in particular the dimension $p$ of the vector $X$, is allowed to depend on $n$ and can be larger or even much larger than $n$.
The Lasso estimator,
\[\hat\beta(\lambda) = \argmin \left(\frac 1n \sum_{i=1}^n(Y_i-X_i'b)^2 +\lambda \Vert b\Vert_1\right)\]In principle, the distribution of $X$ is absolutely continuous w.r.t. Lebesgue measure on $\IR^p$, in which case the optimization problem has the unique solution with probability one (??).
The paper assume that $K$ is fixed, i.e., independent of $n$, and hence it does not cover leave-one-out cross-validation.
- Homrighausen and McDonald (2013): assume that $p$ is much smaller than $n$, and only show consisteny of the (leave-one-out) cross-validated Lasso estimator
- Homrighausen and McDonald (2013): give the “first” result when the smoothing parameter is chosen via cross-validation. Consider the high-dimensional setting wherein the number of predictors $p=n^\alpha, \alpha > 0$ grows with the number of observations.
- Homrighausen and McDonald (2016): high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$
Other
Also have a look on Fushiki, T. (2011). Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 21(2), 137–146.
the training error has a downward bias and K-fold cross-validation has an upward bias, the paper investigates two families that connect the training error and K-fold cross-validation.
This strategy reminds me the one used in Bootstrap mentioned in ESL