WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Kernel Ridgeless Regression Can Generalize

Posted on
Tags: RKHS, Kernel Ridge Regression, Interpolation

This note is based on Liang, T., & Rakhlin, A. (2018). Just Interpolate: Kernel “Ridgeless” Regression Can Generalize. ArXiv:1808.00387 [Cs, Math, Stat].

Introduction

Conventionally, explicit regularization should be added to the least-squares objective when the Hilbert space $\cH$ is high- or infinite-dimensional:

\[\min_{f\in \cH}\sum_{i=1}^n(f(x_i)-y_i)^2+\lambda \Vert f\Vert_{\cH}^2\,.\]

The regularization term is introduced to avoid “overfitting” since kernels provide enough flexibility to fit training data exactly (i.e. interpolate it).

The parameter $\lambda$ is a knob for balancing bias and variance, and should be chosen judiciously. However, a number of researchers noted that, the best out-of-sample performance, empirically, is often attained by setting the regularization parameter to zero and finding the minimum-norm solution among those that interpolate the training data. The mechanism for good out-of-sample performance of this interpolation method has been largely unclear.

Literature Review

  • Wahba (1990): the study of nonparametric regression in reproducing kernel Hilbert spaces (RKHS) from the computational and statistical perspectives.
    • one of key aspects: the role of the decay of eigenvalues of the kernel (at the population level) in rates of convergence
    • the analysis relies on explicit regularization (ridge parameter $\lambda$) for the bias-variance trade-off. The parameter is either chosen to reflect the knowledge of the spectral decay at the population level (typically unknown to statistician), or by the means of cross-validation.
    • the explicit formula of Kernel Ridge Regression has been introduced as “kriging” in the literature before, and was widely used in Bayesian statistics. [TODO]
  • In the learning theory community, Kernel Ridge Regression is known as a special case of Support Vector Regression, the notions like metric entropy or “effective dimension” were employed to analyze the guarantees on the excess loss of Kernel Ridge Regression, even when the model is misspecified.
    • the analysis leans crucially on the explicit regularization, as given by a careful choice of $\lambda$, and mostly focusing on the fixed dimension and large sample size setting.
    • the literature stays relatively quiet in terms of what happens to the minimum norm interpolation rules.
    • the existing bounds in nonparametric statistics and learning theory do not apply to interpolated solution either in the regression or the classification setting.

The paper: aim to answer when and why interpolation in RKHS works, as a starting point for explaining the good empirical performance of interpolation using kernels in practice.

Preliminaries

Problem Formulation

Observe $n$ iid pairs $(x_i, y_i), 1\le i\le n$, where $x_i$ are the covariates with values in a compact domain $\Omega\subset \IR^d$ and $y_i\in\IR$ are the responses.

Interested in estimating the conditional expectation function $f_*(x)=\E(y\mid X=x)$, which is assumed to lie in a Reproducing Kernel Hilbert Space (RKHS) $\cH$.

The interpolation estimator studied in this paper is defined as

\[\hat \argmin_{f \in \cH} \Vert f\Vert_{\cH},\quad \text{s.t.}\quad f(x_i)=y_i,\forall i\,.\]

Let $X\in \IR^{n\times d}$ be the matrix with rows $x_1,\ldots,x_n$ and let $Y$ be the vector of values $y_1,\ldots,y_n$. Slightly abusing the notation,

  • $K(X,X)=[K(x_i,x_j)]_{ij}\in\IR^{n\times n}$ be the kernel matrix
  • $K(x,X)=[K(x,x_1),\ldots,K(x,x_n)]\in \IR^{1\times n}$

If $K(X,X)$ is invertible, the solution would be

\[\hat f(x) = K(x,X)K(X,X)^{-1}Y\,.\]

For the interpolating estimator, provide high-probability (w.r.t. a draw of $X$) upper bounds on the integrated squared risk of the form

\[\E(\hat f(\x) - f_*(\x))^2 \le \phi_{n,d}(X,f^*)\,.\]

Here the expectation is over $\x\sim \mu$ and $Y\mid X$, and $\phi_{n,d}$ is a data-dependent upper bound.

Notation and Background on RKHS

For an operator $A$, its adjoint is denoted by $A^*$. For real matrices, the adjoint is the transpose. For any $x\in \Omega$, let $K_x:\IR\rightarrow \cH$ be such that

\[f(x) = \langle K_x, f\rangle_{\cH} = K_x^*f\,.\]

In follows that for any $x,z\in \Omega$

\[K(x, z) = \langle K_x, K_z\rangle_\cH = K_x^*K_z\,.\]

These two equations are called the kernel reproducing property, see ESL-CN – 5.8 正则化和再生核希尔伯特空间理论

Introduce the integral operator $\cT_\mu:L_\mu^2\rightarrow L_\mu^2$ w.r.t. the marginal measure $\mu(x)$:

\[\cT_\mu f(z) = \int K(x,z)f(x)d\mu(x)\,,\]

Denote the set of eigenfunctions of this integral operator by $e(x)=\{e_1(x),\ldots,e_p(x)\}$, we have

\[\cT_\mu e_i=t_ie_i\quad\,,\text{and }\int e_i(x)e_j(x)d\mu(x) = \delta_{ij}\,.\]

Denote $T=\diag(t_1,\ldots,t_p)$ as the collection of non-negative eigenvalues. Adopting the spectral notation,

\[K(x,z) = e(x)^*Te(z)\,.\]

Then the interpolation estimator becomes

\[\hat f(x)=e(x)^* Te(X)[ e(X)^* Te(X)]^{-1}Y\,.\]

Main Result

Under specific assumptions, with high probability that the interpolation estimator satisfies $$ \E_{Y\mid X}\Vert \hat f-f_*\Vert_{L_\mu^2}^2\le \phi_{n,d}(X, f_*)+\epsilon(n, d)\,. $$

Remarks:

  • the upper bound is data-dependent and can serve as a certificate that interpolation will succeed
  • the two terms represent upper bounds on the variance and bias of the interpolation estimator, respectively. unlike the explicit regularization analysis, the two terms ae not controlled by a tunable parameter $\lambda$. Rather, the choice of the non-linear kernel $K$ itself leads to an implicit control of the two terms through curvature of the kernel function, favorable properties of the data, and high dimensionality

refer to the favorable structure of eigenvalues of the data covariance matrix as favorable geometric properties of the data. The first term is small when the data matrix enjoys certain decay of the eigenvalues, thanks to the implicit regularization $\gamma$. The second term is small when the eigenvalues of the kernel matrix decay fast or the kernel matrix is effectively low rank.


Published in categories Memo