WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Concentration Inequality for Machine Learning

Posted on
Tags: Concentration Inequality

This post is based on the material of the first lecture of STAT6050 instructed by Prof. Wicker.

Firstly, Prof. Wicker started with the measure of complexity of a model/method, and introduced the so-called “margin” inequality,

\[\E(f(s))\le \hat \E_sf(s) + 2\hat \cR_m(\cF) + \sqrt{\frac{\log \frac 1\delta}{2m}}\]

with probability larger than $1-\delta$, which he wanted to prove in the following lecture.

Formally, the margin bounds are bounds that guarantee good generalization error (small $R(f)$) for any classifier $f$ that correctly classifies the training data with a large confidence, where confidence is measured in terms of the functional margin $yf(x)$.

Define \(\phi_\rho(u) = \begin{cases} 1 & u\in (-\infty, 0)\\ 1-\frac u\rho & u\in [0, \rho]\\ 0 & u\in (\rho, \infty) \end{cases}\) The $\rho$-margin loss is \(L_\rho(y, t) = \phi_\rho(y\cdot t)\,.\)

Note that

\[\hat R_{L_\rho}(f) = \frac 1n \sum_{i=1}^n\phi_\rho(y_if(x_i)) \le \frac 1n\sum_{i=1}^n\sum_{i=1}^n\1(y_if(x_i)\le \rho)=:\hat R_\rho(f)\,.\]

If $f(x) = w^Tx$ and $\Vert w\Vert_2=1$, then $\vert w^Tx\vert$ is the distance from $x$ to the hyperplane defined by $\{x: x^Tw=0\}$, which is known as the geometric margin, while $w^Tx$ can be called signed distance.

Both LDA and SVM have talked about the signed distance, refer to Figure 4.15 of ESL for more details.

The basic margin bound is as follows:

Let $\cF\subseteq [a, b]^\cX$ and fix $\rho > 0, \delta > 0$. With probability at least $1-\delta$, for all $f\in \cF$,

\[R(f)\le \hat R_\rho(f) + \frac 2\rho \cR_n(\cF) + \sqrt{\frac{\log \frac 1\delta}{2n}}\,.\]

Then, he talked about Rademacher Complexity, which is a measure of the richness of a class of real-valued functions. Before that, I have heard VC dimension, and asked about the relationship between VC dimension with this lesson, Prof. Wicker said that he will prove VC inequality using today’s material. It is reported that, unlike VC dimension, Rademacher complexity is not restricted to binary functions, and will also prove useful in the analysis of other learning algorithms such as kernel-based algorithms.

Formally, let $\cG\subseteq [a, b]^\cZ$ be a set of functions $\cZ \rightarrow [a, b]$ where $a, b\in \IR, a < b$. Let $Z_1,\ldots, Z_n$ be iid random variables on $\cZ$ following some distribution $P$. Denote the sample $S = (Z_1,\ldots, Z_n)$.

The empirical Rademacher complexity of $\cG$ with respect to the sample $S$ is

\[\hat \fR_S(\cG) \triangleq \E_\sigma\left[\sup_{g\in \cG}\frac 1n\sum_{i=1}^n\sigma_ig(Z_i) \right]\]

where $\sigma = (\sigma_1,\ldots, \sigma_n)^T$ with $\sigma_i\iid U\{-1,1\}$. Here $\sigma_1,\ldots,\sigma_n$ are known as Rademacher random variables.

The Rademacher complexity of $\cG$ is

\[\fR_n(\cG) = \E_S[\hat\fR_S(\cG)]\,.\]

An interpretation in the context of binary classification is that $\cG$ is rich, equivalently, $\hat \fR_S(\cG)$ or $\fR_n(\cG)$ is high, if we can choose functions $g$ to accurately match different random sign combinations reflected by $\sigma$.

Then he introduced several concentration inequalities as preliminary knowledge.

$\E e^{tX}\le e^{\frac{t^2(b-a)^2}{8}}$ if $a\le X\le b$ and $\E X=0$.

By the convexity of the exponential,

\[e^{tX} \le \frac{x-a}{b-a}e^{tb} + \frac{b-x}{b-a}e^{ta}\]

we have

\[\E e^{tX} \le \frac{-a}{b-a}e^{tb} + \frac{b}{b-a}e^{ta} = \exp\left[ta + \log \left( \frac{-a}{b-a}e^{t(b-a)} + \frac{b}{b-a} \right)\right] \triangleq \exp\varphi(t)\,.\]

Note that $\varphi(0) = 0$, and

\[\begin{align*} \varphi'(t) &= a + \frac{-ae^{t(b-a)}}{-\frac{a}{b-a}e^{t(b-a)} + \frac{b}{b-a}}\\ \varphi''(t) &= \mu(1-\mu)(b-a)^2 \end{align*}\]

where

\(\mu = \frac{\frac{b}{b-a}}{-\frac{a}{b-a}e^{t(b-a)} + \frac{b}{b-a}}\,,\) then $\varphi’(0)=0$ and $\varphi’'(t)\le \frac 14(b-a)^2$. It follows that

\[\varphi(t) = \varphi(0) + t\varphi'(0) + \frac{t^2}{2}\varphi''(\xi) \le \frac{t^2(b-a)^2}{8}\,.\]

Then he gave the

If $a_i\le X_i\le b_i$ and $X_i, i=1,\ldots,m$ are independent, then

\[P(\sum_{i=1}^m(X_i-\E X_i)\ge \varepsilon) \le \exp\left(-\frac{2\varepsilon^2}{\sum_{i=1}^m(b_i-a_i)^2}\right)\]

By Markov Inequality and the above Hoeffding’s Lemma.

and

If $\E(X\mid Z)=0$ and $f(Z)\le X\le f(Z)+c$, then

\[\E(e^{tX}\mid Z)\le e^{\frac{t^2c^2}{8}}\,.\]

Repeat the proof of Hoeffding’s Lemma.

The next concentration inequality is

If $X_i=f(Z_1,\ldots, Z_i), \E(X_i\mid Z_1,\ldots,Z_{i-1})=0$ and $f(Z_1,\ldots,Z_{i-1})\le X_i\le f(Z_1,\ldots,Z_{n-1})+c_i$, then

\[P(\sum_{i=1}^nX_i\ge \varepsilon)\le \exp\left(\frac{-2\varepsilon^2}{\sum_{i=1}^mc_i^2}\right)\,.\]
\[\begin{align*} P\left(\sum_{i=1}^mX_i\ge \varepsilon\right) &= P\left(e^{t\sum_{i=1}^mX_i} \ge e^{t\varepsilon} \right)\\ &\le e^{-t\varepsilon}\E\left(e^{t\sum_{i=1}^mX_i}\right)\\ &=e^{-t\varepsilon}\E\left( \E\left( e^{t\sum_{i=1}^mX_i}\mid Z_1,\ldots, Z_{m-1} \right) \right)\\ &=e^{-t\varepsilon}\E\left( e^{t\sum_{i=1}^{m-1}X_i}\E\left( e^{tX_m}\mid Z_1,\ldots, Z_{m-1} \right) \right)\,, \end{align*}\]

Note that $\E(X_m\mid Z_1,\ldots,Z_{m-1})=0$ and $f(Z_1,\ldots,Z_{m-1})\le X_m\le f(Z_1,\ldots, Z_{m-1})+c_m$, then we can use the “Conditional” Hoeffding’s Lemma and get

\[P(\sum_{i=1}^mX_i\ge \varepsilon)\le e^{-t\varepsilon}\E\left(e^{t\sum_{i=1}^{m-1}X_i}\right)e^{\frac{t^2c_m^2}{8}}\le e^{-t\varepsilon}e^{\frac{t^2\sum_{i=1}^mc_i^2}{8}}\,,\]

where the last inequality is by iterating. Finally, take the optimal $t$.

It recalls me the concentration bounds for martingale difference sequences discussed in Wainwright (2019), actually here $X_i$ constitutes a martingale difference sequence.

Given a sequence $\{Y_k\}_{k=1}^\infty$ of random variables adapted to a filtration $\{\cF_k\}_{k=1}^\infty$, the pair $\{(Y_k, \cF_k)\}_{k=1}^\infty$ is a martingale if for all $k\ge 1$,

\[\E[\vert Y_k\vert] < \infty\qquad \text{and}\quad \E[Y_{k+1}\mid \cF_k] = Y_k\,.\]

A closely relationed notion is that of martingale difference sequence, meaning an adapted sequence $\{(D_k,\cF_k)\}_{k=1}^\infty$ such that for all $k\ge 1$,

\[\E[\vert D_k\vert] < \infty \qquad \text{and}\quad \E[D_{k+1}\mid \cF_k] = 0\,.\]

Finally, he moved to the

Let $s = (x_1,\ldots, x_m)$, then

\[P(f(s) - \E f(s)\ge \varepsilon) \le e^{-\frac{2\varepsilon^2}{\sum_{i=1}^mc_i^2}}\]

under the hypothesis that $\forall i, x_i, x_i’$,

\[\vert f(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_m) - f(x_1,\ldots,x_{i-1},x_{i}',x_{i+1},\ldots,x_m)\vert\le c_i\,.\]

Write $f(s)-\E f(s)$ as a telescope form to make use of the hypothesis,

\[f(s)-\E f(s) = \sum_{i=1}^m V_i\,,\]

where

\(V_i = \E[f(s)\mid X_1,\ldots,X_i] - E[f(s)\mid X_1,\ldots,X_{i-1}]\,.\) Actually, $\{V_i\}$ is a martingale difference sequence, and hence $\E V_i=0$. Note that

\[\underbrace{\inf_x \E[f(s)\mid X_1,\ldots,X_{i-1},x] - \E[f(s)\mid X_1,\ldots,X_{i-1}]}_{W_i}\le V_i\le \underbrace{\sup_x \E[f(s)\mid X_1,\ldots,X_{i-1},x] - \E[f(s)\mid X_1,\ldots,X_{i-1}]}_{U_i}\,,\]

and

\[\begin{align*} W_i - U_i &\le \sup_{x,x'}\left\vert\E[f(s)\mid X_1,\ldots,X_{i-1},x] - \E[f(s)\mid X_1,\ldots,X_{i-1},x']\right\vert\\ &=\sup_{x,x'}\left\vert\int f(X_1,\ldots,X_{i-1}, x, X_{i+1},\ldots,X_m) - f(X_1,\ldots,X_{i-1}, x', X_{i+1},\ldots,X_m) dP\right\vert\\ &\le c_i\,, \end{align*}\]

then the result follows from Azuma’s Inequality.


Published in categories Course