# WeiYa's Work Yard

##### Posted on Jan 16, 2020 (Update: Jan 17, 2020)

This post is based on the material of the second lecture of STAT 6050 instructed by Prof. Wicker, and mainly refer some more formally description from the book, Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar - Foundations of Machine Learning-The MIT Press (2012).

Firstly, he proved the margin bounds,

Let $\cF$ be a family of functions mapping from $\cX$ to $[0, 1]$. Then, for any $\delta > 0$, with probability at least $1-\delta$, each of the following holds for all $f\in\cF$:

\begin{align} \E f(x) &\le \hat \E_S f(x) + 2\fR_m(\cF) + \sqrt{\frac{\log \frac 1\delta}{2m}}\label{eq:bound_m}\\ \E f(x) &\le \hat \E_S f(x) + 2\hat\fR_S(\cF) + 3\sqrt{\frac{\log \frac 2\delta}{2m}}\label{eq:bound_s} \end{align}
$\begin{equation} \E f(x) = \hat \E_S f(x) + \E f(x) - \hat \E_S f(x) \le \hat \E_S f(x) + \sup_f [\E f(x) - \hat \E_Sf(x)]\triangleq \hat \E_Sf(x) + \phi(S)\label{eq:1} \end{equation}$

then we will apply McDiarmid’s Inequality to function $\phi(S)$. Note that

$\begin{equation} \phi(S) - \phi(S')\le\sup_f[\hat \E_{S'}f(x) - \hat \E_Sf(x)] = \frac 1m\sup_f [f(x_i')-f(x_i)]\le \frac 1m\;\forall\,i=1,\ldots,m\,,\label{eq:s_sprime} \end{equation}$

follows from the sub-additivity of the supremum function,

$\sup(U+V) \le \sup(U) + \sup(V)$

and its variant,

$\sup(U+V) - \sup(U)\le \sup(V)\,.$

Similarly $\phi(S’) - \phi(S) \le \frac 1m\;\forall\,i=1,\ldots,m$, then $\vert \phi(S)-\phi(S’)\vert \le \frac 1m\;\forall\,i=1,\ldots,m$. And the expectation of $\phi(S)$ can be bounded as follows,

\begin{align} \E_S\phi(S) &= \E_S\left[\sup_f[\E f(x) - \hat \E_Sf(x)]\right] = \E_S\left[\sup_f\E_{S'}[\hat \E_{S'}(f) - \hat \E_S(f) ]\right]\\ &\le \E_{S,S'}\left[\sup_{f}[\hat \E_{S'}(f) - \hat \E_S(f)]\right]\label{eq:S_Sprime}\\ &=\E_{S,S'}\left[ \sup_f\frac 1m\sum_{i=1}^m(f(x_i')-f(x_i)) \right]\\ &=\E_{\sigma,S,S'}\left[\sup_f\frac 1m\sum_{i=1}^m\sigma_i(f(x_i')-f(x_i))\right]\\ &\le \E_{\sigma,S'}\left[\sup_f\frac 1m\sum_{i=1}^m\sigma_if(x_i')\right] + \E_{\sigma,S}\left[\sup_f\frac 1m\sum_{i=1}^m-\sigma_if(x_i)\right]\\ &=2\E_{\sigma,S}\left[\sup_f\frac 1m\sum_{i=1}^m\sigma_if(x_i)\right] = 2\fR_m(\cF)\,, \end{align} where $\sigma_i$ is the Rademacher variables, that is uniformly distributed independent random variables taking values in $\{-1,+1\}$. Now, we can use McDiarmid’s Inequality,

$P(\phi(S)-\E\phi(S)\ge \varepsilon) \le \exp\left(-\frac{2\varepsilon^2}{m\cdot \frac{1}{m^2}}\right) = \exp(-2m\varepsilon^2)\triangleq \delta\,,$

which implies

$\varepsilon = \sqrt{\frac{\log\frac{1}{\delta}}{2m}}\,,$

thus with at least probability $1-\delta$,

$\begin{equation} \phi(S) - \E\phi(S)\le \sqrt{\frac{\log\frac{1}{\delta}}{2m}}\,, \end{equation}$

that is,

$\phi(S) \le 2\fR_m(\cF) + \sqrt{\frac{\log\frac{1}{\delta}}{2m}}\,,$

combining with \eqref{eq:1}, we prove the first result. To derive a bound in terms of $\hat\fR_S(\cF)$, notice that

$\hat \fR_{S'}(\cF) - \hat \fR_S(\cF) \le \frac 1m\E_\bsigma\sup_f\sigma_i(f(x_i')-f(x_i)) \le \frac 1m\;\forall\,i=1,\ldots,m\,,$

and similarly $\hat\fR_S(\cF) - \hat\fR_{S’}(\cF)\le 1m$, and hence $\vert \fR_S(\cF) - \fR_{S’}(\cF)\vert\le \frac 1m$. Using again McDiarmid’s Inequality (actually the another side), with probability at least $1-\delta/2$ the following holds,

$P(\hat \fR_S(\cF) - \fR_m(\cF) \le -\varepsilon) \le \exp(-2m\varepsilon^2)\triangleq \frac \delta 2\,,$

which implies with at least probability $1-\delta/2$,

$\hat\fR_S(\cF)\ge \fR_m(\cF) - \sqrt{\frac{\log\frac{2}{\delta}}{2m}}\,,$

combing with \eqref{eq:bound_m} yields with at least probability $1-\delta$:

$\phi(S) \le 2\hat\fR_S(\cF) + 3\sqrt{\frac{\log\frac 2\delta}{2m}}\,,$

which proves \eqref{eq:bound_s}. I think there are some abuse of notations. In validating the bounded difference inequality \eqref{eq:s_sprime}, $S$ and $S’$ should be treated as two sample points (not random variables) with only one different point, while in the calculation of $\E_S\phi(S)$ \eqref{eq:S_Sprime}, $S$ and $S’$ are two iid variables, and $S’$ is also called ghost sample.

Then he illustrates the Rademacher Complexity in real life for linear classifier as an application. Consider $\cF=\{x\mapsto w^Tx; \Vert w\Vert \le \Lambda\}$ and $x\in \cX$ where $\Vert x\Vert \le r$, then

\begin{align} \hat \fR_S(\cF) &= \frac 1m\E_\bsigma \sup_f\sum_{i=1}^m\sigma_if(x_i) = \frac 1m\E_\bsigma \sup_w\sum_{i=1}^m\sigma_iw^Tx_i\\ &=\frac 1m\E_\bsigma \sup_{w;\Vert w\Vert\le \Lambda}w^T\sum_{i=1}^m\sigma_ix_i\\ &=\frac\Lambda m\E_\bsigma\lVert \sum_{i=1}^m\sigma_ix_i\rVert=\frac\Lambda m\E_\bsigma\sqrt{\Vert\sum_{i=1}^m\sigma_ix_i\Vert^2}\\ &\le \frac \Lambda m\sqrt{\E_\bsigma\Vert \sum_{i=1}^m\sigma_ix_i\Vert^2}\tag{by Jensen's Inequality}\\ &=\frac\Lambda m\sqrt{\E_\bsigma[\sum_i\sum_j\sigma_i\sigma_jx_i^Tx_j]}\\ &=\frac\Lambda m\sqrt{\sum_{i=1}^m\Vert x_i\Vert^2}\label{eq:sigma_ij}\\ &\le\frac{\Lambda r}{\sqrt m} \end{align}

where \eqref{eq:sigma_ij} follows the fact that $\sigma_i^2 = 1$, while $\E\sigma_i\sigma_j=\E\sigma_i\E\sigma_j$ if $i\neq j$ and $\E\sigma_i=0$.

Remarks:

• We are not interested in $\E w^Tx \le \hat \E_S(w^Tx) + \cdots$, but in bounding a risk function. For example, $\E_{(x, y)\sim D^m}\1_{y\neq f(x)}$.
• A less known inequality than Jensen’s is the Khinchin-Kahane inequality.

It is reported that the Khinchin-Kahane inequality is a classical inequality in the probability literature. It was initially studied by Khinchin in the real case, and later extended by Kahane to normed linear spaces. A general version of the inequality for Banach spaces (a vector space $X$ over any scalar field $K$, which is equipped with a norm $\Vert\cdot\Vert_X$ and which is complete with respect to the distance induced by the norm), as well as sharp constant in some cases is as follows:

For all $p, q\in[1, \infty)$, there exists a universal constant $C_{p,q} > 0$ depending only on $p,q$ such that for all choices of Banach spaces $\bbB$, finite sets of vectors $x_1,\ldots,x_n\in \bbB$, and independent Rademacher variables $\sigma_1,\ldots,\sigma_n$,

$\E\left[\left\Vert \sum_{i=1}^n\sigma_ix_i\right\Vert^q\right]^{1/q} \le C_{p,q} \cdot \E\left[\left\Vert\sum_{i=1}^n\sigma_ix_i \right\Vert^p\right]^{1/p}\,.$

If moreover $p=1\le q\le 2$, then the constant $C_{1,q}=2^{1-1/q}$ is optimal.

$\frac 12\E\left\Vert\sum_{i=1}^m\sigma_ix_i\right\Vert^2 \le \left[\E \left\Vert\sum_{i=1}^n\sigma_ix_i \right\Vert \right]^2 \le \E\left\Vert\sum_{i=1}^m\sigma_ix_i\right\Vert^2\,.$

He said that building on tons of linear functions, we can obtain results on other classes of functions, such as neural networks, that is,

$\hat\fR_m(\cF_d)\rightarrow \hat \fR_m(\cF_{d-1})\rightarrow \cdots\rightarrow\text{linear functions}\,,$

where $\cF_d$ denotes the neural networks with $d$ layers.

Let $\phi$ be an $L$-Lipschitz, then $$\fR_m(\phi\circ\cF)\le L\fR_m(\cF)\,.$$

First fix a sample $S=(x_1,\ldots,x_m)$, then, by definition,

$\hat\cR_S(\phi\circ\cF) = \frac 1m \E_\bsigma\sup_f\sum_{i=1}^m\sigma_i(\phi\circ f)(x_i)=\frac 1m\E_{\sigma_1,\ldots,\sigma_{m-1}}\left[\E_{\sigma_m}\left[ \sup_f \left(u_{m-1}(f) + \sigma_m(\phi\circ f)(x_m)\right) \right]\right]\,,$

where $u_{m-1}(f)=\sum_{i=1}^{m-1}\sigma_i(\phi\circ f)(x_i)$. By definition of supremum, for any $\epsilon > 0$, there exists $f_1,f_2\in \cF$ such that

\begin{align} u_{m-1}(f_1) + (\phi\circ f_1)(x_m) &\ge (1-\epsilon)\left[\sup_f u_{m-1}(f) + (\phi\circ f)(x_m)\right]\label{eq:eps}\\ u_{m-f}(f_2) - (\phi\circ f_2)(x_m) &\ge (1-\epsilon)\left[\sup_f u_{m-1}(f) - (\phi\circ f)(x_m)\right] \end{align}

How/Is the nonnegative of $\sup_f$ guaranteed.(??)

Thus, for any $\epsilon > 0$,

\begin{align} &(1-\epsilon)\E_{\sigma_m}\left[\sup_f \left( u_{m-1}(f) + \sigma_m(\phi\circ f)(x_m)\right)\right] \\ &= (1-\varepsilon)\left[\frac 12\sup_f \left(u_{m-1}(f) + (\phi\circ f)(x_m)\right) + \frac 12\sup_f \left(u_{m-1}(f) - (\phi\circ f)(x_m)\right)\right]\\ &\le \frac 12\left[u_{m-1}(f_1)+(\phi\circ f_1)(x_m) + u_{m-1}(f_2)-(\phi\circ f_2)(x_m)\right]\\ &\le \frac 12\left[u_{m-1}(f_1)+sLf_1(x_m) + u_{m-1}(f_2) - sLf_2(x_m)\right]\tag{Lipschitz Property}\\ &\le \frac 12\sup_f\left[u_{m-1}(f) + sLf(x_m)\right] + \frac 12\sup_f\left[u_{m-1}(f) - sLf(x_m)\right]\\ &=\E_{\sigma_m}\left[\sup_f \left(u_{m-1}(f)+\sigma_mLh(x_m)\right)\right]\,. \end{align}

Since $\epsilon > 0$ is arbitrary, we have

$\E_{\sigma_m}\left[\sup_f[u_{m-1}(f)+\sigma_m(\phi\circ f)(x_m)]\right]\le \E_{\sigma_m}\left[\sup_f u_{m-1}(f)+\sigma_m Lf(x_m)\right]\,.$

Iterating this procedure for $i=1,\ldots,m-1$, we can conclude the proof.

He remarked that the Rademacher Complexity will be our main tool, on topics of VC-dimension and covering numbers. Now let we see more properties.

• If $\alpha\in\IR$, then $\fR_m(\alpha\cF) = \vert \alpha\vert\fR_m(\cF)$
• $\fR_m(\cF+\cG)=\fR_m(\cF) +\fR_m(\cG)$
• $\fR_m(\max(f, g); f\in\cF, g\in\cG)\le \fR_m(\cF) + \fR_m(\cG)$
1. By the symmetry of $\sigma$.
2. The second one can be proved via $\le$ and $\ge$, the former one follows from the sub-additivity of supremum while the latter one proved with the $1-\epsilon$ technique as in \eqref{eq:eps}.
3. Decompose $\max(f,g)=\frac{f+g}{2} + \frac{\vert f-g\vert}{2}$, and observe that the absolute value is 1-Lipschitz.

Published in categories Course