# Convergence rates of least squares

## Empirical Process

empirical measure:

Given a collection $\cF$ of measurable functions $f:\cX\rightarrow \bbR$, the empirical measure induces a map from $\cF$ to $\bbR$ given by

$f\mapsto \bbP_nf\,.$

Here, $Qf=\int fdQ$.

Let $P$ be the common distribution of the $X_i$. The centered and scaled version of the given map is the $\cF$-indexed empirical process $\bbG_n$ given by

$f\mapsto \bbG_nf = \sqrt n(\bbP-P)f = \frac{1}{\sqrt n}\sum_{i=1}^n(f(X_i)-Pf)\,.$

For a given function $f$, we have

\begin{align*} \bbP_n & f\overset{as}{\rightarrow}Pf\\ \bbG_n & \rightarrow N(0, P(f-Pf)^2)\,, \end{align*}

provided $Pf$ exists and $Pf^2<\infty$, respectively. Now we concerned with making these two statements uniform in $f$ varying over a class $\cF$.

With the notation $\Vert Q\Vert_\cF=\sup\{\vert Qf\vert:f\in\cF\}$, the uniform version of the law of large numbers becomes

$\Vert \bbP_n-P\Vert_\cF\rightarrow 0\,.$

A class $\cF$ for which this is true is called a Glivenko-Cantelli class.

Investigate conditions under which

$\bbG_n = \sqrt n(\bbP - P) \rightarrow \bbG\,\qquad \text{in }\ell^\infty(\cF)\,,$

where the limit $\bbG$ is a tight Borel measurable element in $\ell^\infty(\cF)$. A class $\cF$ for which this is the true is called a Donsker class.

Write the notation $Z_i=\delta_{X_i}-P$, the empirical central limit theorem can be written

$\frac{1}{\sqrt n}\sum_{i=1}^nZ_i\rightarrow \bbG\,,$

in $\ell^\infty(\cF)$, where $\bbG$ is a tight Brownian bridge. Given i.i.d real-valued random variables $\xi_1,\ldots,\xi_n$, which are independent of $Z_1,\ldots,Z_n$, the multiplier central limit theorem asserts that

$\frac{1}{\sqrt n}\sum_{i=1}^n\xi_iZ_i\rightarrow \bbG\,.$

If the $\xi_i$ have mean zero and variance 1, and satisfy a moment condition, then the multiplier central limit theorem is true iff $\cF$ is Donsker: in that case, the two displays are equivalent.

## Introduction

### Classical setting of nonparametric regression:

$Y_i=f_0(X_i) + \xi_i\text{ for }i=1,\ldots,n$

$X_1,\ldots,X_n$ iid, and $\xi_1,\ldots,\xi_n$ iid and independent of $X_i$.

LSE:

$\hat f_n = \argmin_{f\in \cF}\frac{1}{n}\sum_{i=1}^n (Y_i-f(X_i))^2$

The LSE is well-known to have nice properties (rate-optimality) when

• E: the errors $\{\xi_i\}$ are sub-Gaussian or at least sub-exponential
• F: the class $\cF$ of regression functions satisfies a condition slightly stronger than a Donsker condition.

In terms of a very large literature, there remains a lack of clear understanding of the properties of $\hat f_n$ in terms of assumptions concerning the heaviness of the tails of the errors and the massiveness or “size” of the class $\cF$.

The interest of the paper is in developing further tools and methods to study properties of $\hat f_n$, especially its convergence rate when the error condition E is replaced by

• E’: the errors $\{\xi_i\}$ have only a $p$-moment for some $1\le p<\infty$.

Question 1: What determines the convergence rate $b_n$ of $\hat f_n$ w.r.t. some risk or loss functions? When is this rate $b_n$ determined by $p$ (and hence the tail behavior of the $\xi_i$’s), and when is it determined by $\alpha$ (and hence the size of $\cF$)?

Classical methods used to prove consistency and rates of convergence of the LSE

Since $\hat f_n$ minimizes the functional

$f\mapsto \bbP_n(Y-f(X))^2=\frac 1n \sum_{i=1}^n (Y_i-f(X_i))^2\,.$

Adding and subtracting $f_0$ on the left side,

$\bbP_n(Y-f_0(X))^2 + 2\bbP_n(Y-f_0)(f_0-\hat f_n) + \bbP_n(f_0-\hat f_n)^2 \le \bbP_n(Y-f_0(X))^2\,.$

Since $\xi_i=Y_i-f_0(X_i)$, we conclude that

$\bbP_n(\hat f_n(X)-f_0(X))^2 \le 2\bbP_n(\xi(\hat f_n(X)-f_0(X)))\le 2\sup_{f\in \cF}\bbP_n(\xi(f(X)-f_0(X)))$

where the process

$f\mapsto n(\bbP_n-P)(\xi f(X)) = n\bbP_n(\xi f(X)) = \sum_{i=1}^n \xi_if(X_i)$

is a multiplier empirical process. (due to $\xi$ and $f$ are uncorrelated?)

Consider

$\bbM_n(f) = 2\bbP_n\xi(f-f_0) - \bbP_n(f-f_0)^2\,,$

and note that $\hat f_n$ maximizes $\bbM_n(f)$ over $\cF$.(minimize its negative which is the difference between the true model and the estimated model) Since the errors have zero mean and are independent of the $X_i$’s, this process has mean $M(f)\equiv -P(f-f_0)^2$. Since $\bbM_n(f_0)=0=M(f_0)$, centering then yields the process

$f\mapsto \bbZ_n(f) \equiv \bbM_n(f) - \bbM_n(f_0) - (M(f) - M(f_0)) = 2\bbP_n\xi(f-f_0) - (\bbP-P)(f-f_0)^2$

Establishing rates of convergence for $\hat f_n$ then boils down to bounding

$\bbE \sup_{f\in\cF:P(f-f_0)^2\le \delta^2}\bbZ_n(f)$

as a function of $n$ and $\delta$.

Question 2: Under what moment conditions on the $\xi_i$’s can we assert that the multiplier empirical process has roughly the same size as the empirical process $f\mapsto n(\bbP_n-P)(f(X))=\sum_{i=1}^n(f(X_i)-Pf)$, or equivalently the symmetrized empirical process $f\mapsto \sum_{i=1}^n\varepsilon_i f(X_i)$ for nearly all function classes $\cF$ in a non-asymptotic manner?

• provide simple moment conditions on the $\xi_i$’s
• give some comparisons to the existing multiplier inequalities which illustrate the improvement possible via the new bounds in non-asymptotic settings

### Notation

• $\Vert\xi\Vert_p\equiv (\bbE\vert\xi\vert^p)^{1/p}$
• $\Vert \xi\Vert_{p,1}\equiv \int_0^\infty \bbP(\vert \xi\vert> t)^{1/p}dt$
• $\Vert f\Vert_{L_p(P)}\equiv (P\vert f\vert^p)^{1/p}$: the usual $L_p$ norm under $P$
• $\Vert f\Vert_{\infty}=\sup_{x\in\cX}\vert f(x)\vert$
• $f$ is $P$-centered, and $\cF$ is $P$-centered if all $f\in \cF$ are $P$-centered.
• $L_p(g,B)$: the $L_p(P)$-ball centered at $g$ with radius with radius $B$.
• for any $\phi:\cF\rightarrow \bbR$, write $\Vert \phi(f)\Vert_\cF$ for $\sup_{f\in \cF}\vert \phi(f)\vert$.

Published in categories Memo