# Convergence rates of least squares

##### Posted on

This note is for Han, Q., & Wellner, J. A. (2017). Convergence rates of least squares regression estimators with heavy-tailed errors.

## Empirical Process

empirical measure:

Given a collection $\cF$ of measurable functions $f:\cX\rightarrow \bbR$, the empirical measure induces a map from $\cF$ to $\bbR$ given by

\[f\mapsto \bbP_nf\,.\]Here, $Qf=\int fdQ$.

Let $P$ be the common distribution of the $X_i$. The centered and scaled version of the given map is the $\cF$-indexed empirical process $\bbG_n$ given by

\[f\mapsto \bbG_nf = \sqrt n(\bbP-P)f = \frac{1}{\sqrt n}\sum_{i=1}^n(f(X_i)-Pf)\,.\]For a **given** function $f$, we have

provided $Pf$ exists and $Pf^2<\infty$, respectively. Now we concerned with making these two statements uniform in $f$ varying over a class $\cF$.

With the notation $\Vert Q\Vert_\cF=\sup\{\vert Qf\vert:f\in\cF\}$, the uniform version of the law of large numbers becomes

\[\Vert \bbP_n-P\Vert_\cF\rightarrow 0\,.\]A class $\cF$ for which this is true is called a **Glivenko-Cantelli class**.

Investigate conditions under which

\[\bbG_n = \sqrt n(\bbP - P) \rightarrow \bbG\,\qquad \text{in }\ell^\infty(\cF)\,,\]where the limit $\bbG$ is a tight Borel measurable element in $\ell^\infty(\cF)$. A class $\cF$ for which this is the true is called a **Donsker class**.

Write the notation $Z_i=\delta_{X_i}-P$, the empirical central limit theorem can be written

\[\frac{1}{\sqrt n}\sum_{i=1}^nZ_i\rightarrow \bbG\,,\]in $\ell^\infty(\cF)$, where $\bbG$ is a tight Brownian bridge. Given i.i.d real-valued random variables $\xi_1,\ldots,\xi_n$, which are independent of $Z_1,\ldots,Z_n$, the multiplier central limit theorem asserts that

\[\frac{1}{\sqrt n}\sum_{i=1}^n\xi_iZ_i\rightarrow \bbG\,.\]If the $\xi_i$ have mean zero and variance 1, and satisfy a moment condition, then the multiplier central limit theorem is true iff $\cF$ is Donsker: in that case, the two displays are equivalent.

## Introduction

### Classical setting of nonparametric regression:

\[Y_i=f_0(X_i) + \xi_i\text{ for }i=1,\ldots,n\]$X_1,\ldots,X_n$ iid, and $\xi_1,\ldots,\xi_n$ iid and independent of $X_i$.

LSE:

\[\hat f_n = \argmin_{f\in \cF}\frac{1}{n}\sum_{i=1}^n (Y_i-f(X_i))^2\]The LSE is well-known to have nice properties (rate-optimality) when

- E: the errors $\{\xi_i\}$ are sub-Gaussian or at least sub-exponential
- F: the class $\cF$ of regression functions satisfies a condition slightly stronger than a Donsker condition.

In terms of a very large literature, there remains a lack of clear understanding of the properties of $\hat f_n$ in terms of assumptions concerning the heaviness of the tails of the errors and the massiveness or “size” of the class $\cF$.

The interest of the paper is in developing further tools and methods to study properties of $\hat f_n$, especially its convergence rate when the error condition E is replaced by

- E’: the errors $\{\xi_i\}$ have only a $p$-moment for some $1\le p<\infty$.

**Question 1:** What determines the convergence rate $b_n$ of $\hat f_n$ w.r.t. some risk or loss functions? When is this rate $b_n$ determined by $p$ (and hence the tail behavior of the $\xi_i$’s), and when is it determined by $\alpha$ (and hence the size of $\cF$)?

Classical methods used to prove consistency and rates of convergence of the LSE

Since $\hat f_n$ minimizes the functional

\[f\mapsto \bbP_n(Y-f(X))^2=\frac 1n \sum_{i=1}^n (Y_i-f(X_i))^2\,.\]Adding and subtracting $f_0$ on the left side,

\[\bbP_n(Y-f_0(X))^2 + 2\bbP_n(Y-f_0)(f_0-\hat f_n) + \bbP_n(f_0-\hat f_n)^2 \le \bbP_n(Y-f_0(X))^2\,.\]Since $\xi_i=Y_i-f_0(X_i)$, we conclude that

\[\bbP_n(\hat f_n(X)-f_0(X))^2 \le 2\bbP_n(\xi(\hat f_n(X)-f_0(X)))\le 2\sup_{f\in \cF}\bbP_n(\xi(f(X)-f_0(X)))\]where the process

\[f\mapsto n(\bbP_n-P)(\xi f(X)) = n\bbP_n(\xi f(X)) = \sum_{i=1}^n \xi_if(X_i)\]is a **multiplier empirical process**. *(due to $\xi$ and $f$ are uncorrelated?)*

Consider

\[\bbM_n(f) = 2\bbP_n\xi(f-f_0) - \bbP_n(f-f_0)^2\,,\]and note that $\hat f_n$ maximizes $\bbM_n(f)$ over $\cF$.*(minimize its negative which is the difference between the true model and the estimated model)* Since the errors have zero mean and are independent of the $X_i$’s, this process has mean $M(f)\equiv -P(f-f_0)^2$. Since $\bbM_n(f_0)=0=M(f_0)$, centering then yields the process

Establishing rates of convergence for $\hat f_n$ then boils down to bounding

\[\bbE \sup_{f\in\cF:P(f-f_0)^2\le \delta^2}\bbZ_n(f)\]as a function of $n$ and $\delta$.

**Question 2:** Under what moment conditions on the $\xi_i$’s can we assert that the multiplier empirical process has roughly the same size as the empirical process $f\mapsto n(\bbP_n-P)(f(X))=\sum_{i=1}^n(f(X_i)-Pf)$, or equivalently the symmetrized empirical process $f\mapsto \sum_{i=1}^n\varepsilon_i f(X_i)$ for nearly all function classes $\cF$ in a non-asymptotic manner?

- provide simple moment conditions on the $\xi_i$’s
- give some comparisons to the existing multiplier inequalities which illustrate the improvement possible via the new bounds in non-asymptotic settings

### Notation

- $\Vert\xi\Vert_p\equiv (\bbE\vert\xi\vert^p)^{1/p}$
- $\Vert \xi\Vert_{p,1}\equiv \int_0^\infty \bbP(\vert \xi\vert> t)^{1/p}dt$
- $\Vert f\Vert_{L_p(P)}\equiv (P\vert f\vert^p)^{1/p}$: the usual $L_p$ norm under $P$
- $\Vert f\Vert_{\infty}=\sup_{x\in\cX}\vert f(x)\vert$
- $f$ is $P$-centered, and $\cF$ is $P$-centered if all $f\in \cF$ are $P$-centered.
- $L_p(g,B)$: the $L_p(P)$-ball centered at $g$ with radius with radius $B$.
- for any $\phi:\cF\rightarrow \bbR$, write $\Vert \phi(f)\Vert_\cF$ for $\sup_{f\in \cF}\vert \phi(f)\vert$.