# Concentration Inequality for Machine Learning

##### Posted on Jan 09, 2020

This post is based on the material of the first lecture of STAT6050 instructed by Prof. Wicker.

Firstly, Prof. Wicker started with the measure of complexity of a model/method, and introduced the so-called “margin” inequality,

with probability larger than $1-\delta$, which he wanted to prove in the following lecture.

Formally, the margin bounds are bounds that guarantee good generalization error (small $R(f)$) for any classifier $f$ that correctly classifies the training data with a large confidence, where confidence is measured in terms of the functional margin $yf(x)$.

Define $% $ The $\rho$-margin loss is $L_\rho(y, t) = \phi_\rho(y\cdot t)\,.$

Note that

If $f(x) = w^Tx$ and $\Vert w\Vert_2=1$, then $\vert w^Tx\vert$ is the distance from $x$ to the hyperplane defined by $\{x: x^Tw=0\}$, which is known as the geometric margin, while $w^Tx$ can be called signed distance.

Both LDA and SVM have talked about the signed distance, refer to Figure 4.15 of ESL for more details.

The basic margin bound is as follows:

Let $\cF\subseteq [a, b]^\cX$ and fix $\rho > 0, \delta > 0$. With probability at least $1-\delta$, for all $f\in \cF$,

Then, he talked about Rademacher Complexity, which is a measure of the richness of a class of real-valued functions. Before that, I have heard VC dimension, and asked about the relationship between VC dimension with this lesson, Prof. Wicker said that he will prove VC inequality using today’s material. It is reported that, unlike VC dimension, Rademacher complexity is not restricted to binary functions, and will also prove useful in the analysis of other learning algorithms such as kernel-based algorithms.

Formally, let $\cG\subseteq [a, b]^\cZ$ be a set of functions $\cZ \rightarrow [a, b]$ where $a, b\in \IR, a < b$. Let $Z_1,\ldots, Z_n$ be iid random variables on $\cZ$ following some distribution $P$. Denote the sample $S = (Z_1,\ldots, Z_n)$.

The empirical Rademacher complexity of $\cG$ with respect to the sample $S$ is

where $\sigma = (\sigma_1,\ldots, \sigma_n)^T$ with $\sigma_i\iid U\{-1,1\}$. Here $\sigma_1,\ldots,\sigma_n$ are known as Rademacher random variables.

The Rademacher complexity of $\cG$ is

An interpretation in the context of binary classification is that $\cG$ is rich, equivalently, $\hat \fR_S(\cG)$ or $\fR_n(\cG)$ is high, if we can choose functions $g$ to accurately match different random sign combinations reflected by $\sigma$.

Then he introduced several concentration inequalities as preliminary knowledge.

$\E e^{tX}\le e^{\frac{t^2(b-a)^2}{8}}$ if $a\le X\le b$ and $\E X=0$.

By the convexity of the exponential,

we have

Note that $\varphi(0) = 0$, and

where

$\mu = \frac{\frac{b}{b-a}}{-\frac{a}{b-a}e^{t(b-a)} + \frac{b}{b-a}}\,,$ then $\varphi’(0)=0$ and $\varphi’'(t)\le \frac 14(b-a)^2$. It follows that

Then he gave the

If $a_i\le X_i\le b_i$ and $X_i, i=1,\ldots,m$ are independent, then

By Markov Inequality and the above Hoeffding’s Lemma.

and

If $\E(X\mid Z)=0$ and $f(Z)\le X\le f(Z)+c$, then

Repeat the proof of Hoeffding’s Lemma.

The next concentration inequality is

If $X_i=f(Z_1,\ldots, Z_i), \E(X_i\mid Z_1,\ldots,Z_{i-1})=0$ and $f(Z_1,\ldots,Z_{i-1})\le X_i\le f(Z_1,\ldots,Z_{n-1})+c_i$, then

Note that $\E(X_m\mid Z_1,\ldots,Z_{m-1})=0$ and $f(Z_1,\ldots,Z_{m-1})\le X_m\le f(Z_1,\ldots, Z_{m-1})+c_m$, then we can use the “Conditional” Hoeffding’s Lemma and get

where the last inequality is by iterating. Finally, take the optimal $t$.

It recalls me the concentration bounds for martingale difference sequences discussed in Wainwright (2019), actually here $X_i$ constitutes a martingale difference sequence.

Given a sequence $\{Y_k\}_{k=1}^\infty$ of random variables adapted to a filtration $\{\cF_k\}_{k=1}^\infty$, the pair $\{(Y_k, \cF_k)\}_{k=1}^\infty$ is a martingale if for all $k\ge 1$,

A closely relationed notion is that of martingale difference sequence, meaning an adapted sequence $\{(D_k,\cF_k)\}_{k=1}^\infty$ such that for all $k\ge 1$,

Finally, he moved to the

Let $s = (x_1,\ldots, x_m)$, then

under the hypothesis that $\forall i, x_i, x_i’$,

Write $f(s)-\E f(s)$ as a telescope form to make use of the hypothesis,

where

$V_i = \E[f(s)\mid X_1,\ldots,X_i] - E[f(s)\mid X_1,\ldots,X_{i-1}]\,.$ Actually, $\{V_i\}$ is a martingale difference sequence, and hence $\E V_i=0$. Note that

and

then the result follows from Azuma’s Inequality.

Published in categories Course