WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Minimax Lower Bounds

Posted on (Update: )
Tags: Minimax

This note is based on Chapter 15 of Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press.

Suppose that $\{\theta^1,\ldots,\theta^M\}$ is a $2\delta$-separated set contained in the space $\theta(\cP)$. Generate a r.v. $Z$ by the following procedure:

  1. Sample $J$ uniformly from the index set $[M]={1,\ldots,M}$.
  2. Given $J=j$, sample $Z\sim \bbP_{\theta^j}$.

Let $\bbQ$ denote the joint distribution of the pair $(Z,J)$ generated by this procedure.

A testing function for this problem is a mapping $\psi:\cZ \rightarrow [M]$, and the associated probability of error is given by $\bbQ[\psi(Z)\neq J]$.

(From estimation to testing) For any increasing function $\Phi$ and choice of $2\delta$-separated set, the minimax risk is lower bounded as $$ \fM(\theta(\cP), \Phi\circ \rho)\ge \Phi(\delta)\inf_\psi\bbQ[\psi(Z)\neq J]\,, $$ where the infimum ranges over test functions.
For any $\bbP\in\cP$ with parameter $\theta = \theta(\bbP)$, we have $$ \bbE[\Phi(\rho(\hat\theta,\theta))] \ge \Phi(\delta) \bbP[\Phi(\rho(\hat\theta,\theta))\ge \Phi(\delta)]\ge \Phi(\delta)\bbP[\rho(\hat\theta,\theta)\ge \delta]\,. $$ Note that $$ \sup_{\bbP\in\cP}\bbP[\rho(\hat\theta,\theta(\bbP)) \ge \delta] \ge \frac 1M \sum_{j=1}^M\bbP_{\theta^j}[\rho(\hat\theta, \theta^j)\ge \delta] = \bbQ[\rho(\hat\theta, \theta^J)\ge \delta]\,. $$ Any estimator $\hat\theta$ can be used to define a test, $$ \psi(Z) = \arg\min_{\ell\in[M]} \rho(\theta^\ell, \hat\theta)\,. $$ Suppose that the true parameter is $\theta^j$: we then claim that the event $\{\rho(\theta^j,\hat\theta)<\delta\}$ ensures that the test is correct. Thus, conditioned on $J=j$, $$ \{\rho(\hat\theta,\theta^j) < \delta\}\subset \{\psi(Z)=j\}\,, $$ which implies that $$ \bbP_{\theta^j}[\rho(\hat\theta,\theta^j)\ge \delta] \ge \bbP_{\theta^j}[\psi(Z)\neq j]\,. $$ Taking averages over the index $j$, we find that $$ \bbQ[\rho(\hat\theta,\theta^J)\ge \delta] = \frac 1M\sum_{j=1}^M\bbP_{\theta^j}[\rho(\hat\theta,\theta^j)\ge \delta]\ge \bbQ[\psi(Z)\neq J]\,. $$ Thus, $$ \sup_{\bbP\in\cP}\bbE_\bbP[\Phi(\rho(\hat\theta,\theta))]\ge \Phi(\delta)\bbQ[\psi(Z)\neq J]\,. $$ Finally, take the infimum over all estimators $\hat\theta$ on the left-hand side, and the infimum over the induced set of tests on the right-hand side.

Some divergence measures

  • Total Variation distance: $\Vert \bbP-\bbQ\Vert_{\TV}:=\sup_{A\subseteq \calX}\vert \bbP(A)-\bbQ(A)\vert\,.$
  • Kullback-Leibler divergence: $D(\bbQ\Vert \bbP)=\int_\calX q(x)\log\frac{q(x)}{p(x)}\nu(dx)\,.$
  • squared Hellinger distance: $H^2(\bbP\Vert \bbQ)=\int(\sqrt{p(x)}-\sqrt{q(x)})^2\nu(dx)\,.$

\begin{align} \Vert \bbP-\bbQ\Vert_{\TV} &\le\sqrt{\frac 12D(\bbQ\Vert \bbP)}\\ \Vert \bbP-\bbQ\Vert_{\TV} &\le H(\bbP\Vert \bbQ)\sqrt{1-\frac{H^2(\bbP\Vert \bbQ)}{4}}\label{LeCamIneq} \end{align}

Let $\bbP^{1:n}=\otimes_{i=1}^n\bbP_i$ be the product measure defined on $\calX^n$, similarly $\bbQ^{1:n}$.

  • in general, difficult to express $\Vert \bbP^{1:n}-\bbQ^{1:n}\Vert$ in terms of the individual distances
  • $D(\bbP^{1:n}\Vert\bbQ^{1:n})=\sum D(\bbP_i\Vert\bbQ_i)$
  • $H^2(\bbP^{1:n}\Vert\bbQ^{1:n})/2=1-\prod (1-H^2(\bbP_i\Vert \bbQ_i)/2)$.

In the iid case

  • $D(\bbP^{1:n}\Vert \bbQ^{1:n})=nD(\bbP_1\Vert \bbQ_1)$
  • $\frac 12H^2(\bbP^{1:n}\Vert \bbQ^{1:n})=1-(1-\frac 12H^2(\bbP_1\Vert\bbQ_1))^n\le \frac n2H^2(\bbP_1\Vert \bbQ_1)$suppose $f(x)=(1-x)^n-(1-nx)$, easily can show $f(x)\ge 0$ for $x\in [0,1]$

Binary testing

The simplest type of testing problem, known as a binary hypothesis test, involves only two distributions.

In a binary testing problem with equally weighted hypotheses, we observe a random variable $Z$ drawn according to the mixture distribution $\bbQ_Z=\frac 12\bbP_0+\frac 12\bbP_1$. For a given decision rule $\psi:\cal Z\rightarrow {0,1}$, the associated probability of error is given by

If we take the infimum of this error probability over all decision rules, we obtain a quantity known as the Bayes risk for the problem.

In the binary case, we have $$ \inf_\psi \bbQ[\psi(Z)\neq J]=\frac 12\{1-\Vert \bbP_1-\bbP_0\Vert_{\TV}\}\,. $$
Note that there is a one-to-one correspondence between decision rule $\psi$ and measurable partitions $(A,A^c)$ of the space $\cal X$. In particular, any rule $\psi$ is uniquely determined by the set $A={x\in \cX\mid \psi(x)=1}$. Thus, we have $$ \sup_\psi\bbQ[\psi(Z)=J] = \sup_{A\subseteq \cX}\{\frac 12\bbP_1(\A)+\frac 12\bbP_0(A^c)\}=\frac 12\sup_{A\subseteq \cX}\{\bbP_1(A)-\bbP_0(A)\} + \frac 12\,. $$ Then the result follows from $$ \sup_\psi \bbQ[\psi(Z)=J]=1-\inf_\psi\bbQ[\psi(Z)\neq J]. $$

Thus, for any pair of distributions $\bbP_0,\bbP_1\in\cP$ such that $\rho(\theta(\bbP_0),\theta(\bbP_1))\ge 2\delta$, we have

\begin{equation} \fM(\theta(\cP),\Phi\circ\rho) \ge \frac{\Phi(\delta)}{2}\left[1-\Vert \bbP_1-\bbP_0\Vert_{\TV}\right]\,.\label{2point-Le-Cam} \end{equation}

For a fixed variance $\sigma^2$, let $\bbP_\theta$ be the distribution of a $\cN(\theta,\sigma^2)$ variable; letting the mean $\theta$ vary over the real line defines the Gaussian location family $\{\bbP_\theta,\theta\in\bbR\}$. Here we consider the problem of estimating $\theta$ under either the absolute error $\vert \hat\theta-\theta\vert$ or the squared error $(\hat\theta-\theta)^2$ using a collection $Z=(Y_1,\ldots,Y_n)$ of $n$ iid samples drawn from a $\cN(\theta,\sigma^2)$ distribution. Apply the two-point Le Cam bound \eqref{2point-Le-Cam} with the distributions $\bbP_0^n$ and $\bbP_\theta^n$, where $\theta=2\delta$. Note that $$ \Vert \bbP_\theta^n -\bbP_0^n\Vert_{\TV}^2 \le \frac 14\{e^{n\theta^2/\sigma^2}-1\} = \frac 14\{e^{4n\delta^2/\sigma^2}-1\},. $$ Setting $\delta = \frac 12\frac{\sigma}{\sqrt n}$ thus yields \begin{align} \inf_{\hat\theta}\sup_{\theta\in\bbR} \bbE_\theta[\vert\hat\theta -\theta\vert]\ge\frac{\delta}{2}\{1-\frac 12\sqrt{e-1}\}\ge \frac{\delta}{6} = \frac{1}{12}\frac{\sigma}{\sqrt n}\\ \inf_{\hat\theta}\sup_{\theta\in\bbR} \bbE_\theta[(\hat\theta-\theta)^2]\ge \frac{\delta^2}{2}\{1-\frac 12\sqrt{e-1}\}\ge \frac{\delta^2}{6}=\frac{1}{24}\frac{\sigma^2}{n}\,. \end{align} For instance, the sample mean $\tilde \theta_n$ satisfies the bounds $$ \sup_{\theta\in\bbR}\bbE_\theta[\vert \tilde \theta_n-\theta\vert] = \sqrt{\frac{2}{\pi}}\frac{\sigma}{\sqrt n}\text{ and }\sup_{\theta\in\bbR}\bbE_\theta[(\tilde \theta_n-\theta)^2] = \frac{\sigma^2}{n}\,. $$

Mean-squared error decaying as $n^{-1}$ is typical for parametric problems with a certain type of regularity. For other non-regular problems, faster rates become possible.

(Uniform location family) For each $\theta\in\bbR$, let $\bbU_\theta$ be the uniform distribution over the interval $[\theta,\theta+1]$. Given a pair $\theta,\theta'\in\bbR$, compute the Hellinger distance between $\bbU_\theta$ and $\bbU_\theta'$. By symmetry, only consider $\theta'>\theta$. If $\theta' > \theta + 1$, $H^2(\bbU_\theta\Vert\bbU_{\theta'})=2$. If $\theta'\in (\theta,\theta+1]$, $H^2(\bbU_\theta\Vert\bbU_{\theta'})=2\vert \theta'-\theta\vert$. Take $\vert\theta'-\theta\vert=2\delta:=\frac{1}{4n}$ (why? to cancel out $n$ in the next equation?), then $$ \frac 12H^2(\bbU_\theta^n\Vert \bbU_{\theta'}^n)\le \frac n2 2\vert\theta'-\theta\vert = 1/4 $$ By \eqref{LeCamIneq}, we find that $$ \Vert \bbU_\theta^n-\bbU_{\theta'}^n\Vert_{\TV}^2 \le H^2(\bbU_{\theta}^n\Vert\bbU_{\theta'}^n)\le \frac 12\,. $$ From \eqref{2point-Le-Cam} with $\Phi(t)=t^2$, we have $$ \inf_{\hat\theta}\sup_{\theta\in\bbR}\bbE_\theta[(\hat\theta -\theta)^2]\ge \frac{(1-1/\sqrt 2)}{128}n^{-2}\,. $$

Le Cam’s method for nonparametric problems.

Estimate a density at a point, say $x=0$, in which case $\theta(f):=f(0)$ is known as an evaluation functional.

The Lipschitz constant of the functional $\theta$ with respect to the Hellinger norm is given by

(Le Cam for functional) For any increasing function $\Phi$ on the non-negative real line and any functional $\theta:\cF\rightarrow \bbR$, we have $$ \inf_{\hat\theta}\sup_{f\in\cF}\bbE\left[ \Phi(\hat\theta-\theta(f)) \right]\ge \frac 14\Phi\left(\frac 12\omega(\frac{1}{2\sqrt n};\theta,\cF)\right)\,. $$
Let $\epsilon^2=1/4n$, choose a pair $f,g$ that achieve the supremum defining $\omega(1/(2\sqrt n))$. Note that $$ \Vert \bbP_f^n-\bbP^n_g\Vert_{\TV}^2 \le H^2(\bbP_f^n\Vert \bbP_g^n)\le nH^2(\bbP_f\Vert\bbP_g)\le \frac 14\,. $$ Thus, Le Cam's bound \eqref{2point-Le-Cam} with $\delta = \frac 12\omega(\frac{1}{2\sqrt n})$ implies that $$ \inf_{\hat \theta}\sup_{f\in\cF} \bbE\left[ \Phi(\hat\theta - \theta(f)) \right]\ge \frac 14\Phi\left(\frac 12\omega(\frac{1}{2\sqrt n})\right)\,. $$
Consider the densities on $[-\frac 12,\frac 12]$ that are bounded uniformly away from zero, and are Lipschitz with constant one. Our goal is to estimate the linear functional $\theta(f)=f(0)$. It suffices to lower bound $\omega(\frac{1}{2\sqrt n};\theta,\cF)$ and choose a pair $f_0,g\in\cF$ with $H^2(f_0\mid g)=\frac{1}{4n}$, and then evaluating the difference $\vert \theta(f_0)-\theta(g)\vert$. Let $f_0\equiv 1$. For a parameter $\delta\in(0,\frac 16]$, consider the function $$ \phi(x) = \begin{cases} \delta -\vert x\vert & \text{for }\vert x\vert \le \delta\\ \vert x-2\delta\vert -\delta & \text{for }x\in[\delta, 3\delta]\\ 0 & \text{otherwise} \end{cases} $$ By construction, the function $\phi$ is $1$-Lipschitz, uniformly bounded with $\Vert \phi\Vert_\infty = \delta\le 1/6$, and integrates to zero. Thus, the perturbed function $g=f_0+\phi$ is a density function belonging to our class and by construction, $\vert \theta(f_0)-\theta(g)\vert=\delta$. To control the squared Hellinger distance, $$ \frac 12H^2(f_0\Vert g) = 1-\int_{-1/2}^{1/2}\sqrt{1+\phi(t)}dt\,. $$ Define the function $\Psi(u)=\sqrt{1+u}$, and note that $\sup_{u\in\bbR}\vert \Psi''(u)\vert \le \frac 14$. By a Taylor series expansion, and by [Taylor inequality](http://mathworld.wolfram.com/TaylorsInequality.html), $$ \frac 12H^2(f_0\Vert g)=\int_{-1/2}^{1/2}[\Psi(0)-\Psi(\phi(t))]dt \le \int_{-1/2}^{1/2}{-\Psi'(0)\phi(t)+\frac 18\phi^2(t)}dt\,. $$ Observe that $$ \int_{-1/2}^{1/2}\phi(t)dt = 0\text{ and }\int_{-1/2}^{1/2}\phi^2(t)dt = 4\int_0^\delta(\delta -x)^2dx = \frac 43\delta^3\,. $$ Thus, $$ H^2(f_0\Vert g) \le \frac 28\cdot \frac 43\delta^3 = \frac 13\delta^3\,. $$ Consequently, set $\delta^3 = 3/4n$ ensures that $H^2(f_0\Vert g)\le 1/4n$. Hence, $$ \inf_{\hat\theta}\sup_{f\in\cF}\bbE\left[ (\hat\theta-f(0))^2\right]\ge \frac{1}{16}\omega^2(\frac{1}{2\sqrt n})\succsim n^{-2/3}\,. $$

Le Cam’s convex hull method

Consider two subsets $\cP_0$ and $\cP_1$ of $\cP$ that are $2\delta$-separated, in the sense that

For any $2\delta$-separated classes of distributions $\cP_0$ and $\cP_1$ contained within $\cP$, any estimator $\hat\theta$ has worst-case risk at least $$ \sup_{\bbP\in\cP}\bbE_\bbP[\rho(\hat\theta), \theta(\bbP)] \ge \frac{\delta}{2} \sup_{\bbP_0\in\conv(\cP_0)\\\bbP_1\in\conv(\cP_1)}{1-\Vert \bbP_0-\bbP_1\Vert_{\TV}} $$
(Sharpened bounds for Gaussian location family) A key step was to upper bound the TV distance $\Vert \bbP_\theta^n-\bbP_0^n\Vert_{\TV}$ between the $n$-fold product distributions based on the Gaussian models $\cN(\theta,\sigma^2)$ and $(0,\sigma^2)$. Set $\theta = 2\delta$, consider two families $\cP_0=\{\bbP_0^n\}$ and $\cP_1=\{\bbP_\theta^n,\bbP_{-\theta}^n\}$. Note that the mixture distribution $\bar\bbP:=\frac 12\bbP_\theta^n+\frac 12\bbP_{-\theta}^n$ belongs to $\conv(\cP_1)$. Note that $$ \Vert \bar\bbP-\bbP_0^n\Vert^2_{\TV}\le \frac 14 \left\{e^{\frac 12(\frac{\sqrt n\theta}{\sigma})^4}-1\right}=\frac 14\left\{e^{\frac 12(\frac{2\sqrt n\delta}{\sigma})^4}-1\right\} $$ Set $\delta = \frac{\sigma t}{2\sqrt n}$ for some parameter $t > 0$ to be chosen, the convex hull Le Cam bound yields $$ \min_{\hat\theta}\sup_{\theta\in\bbR}\bbE_\theta[\vert \hat\theta-\theta\vert] \ge \frac{\sigma}{4\sqrt n}\sup_{t > 0}\{t(1-\frac 12\sqrt{e^{t^4/2}-1})\}\ge \frac{3}{20}\frac{\sigma}{\sqrt n}\,. $$

Fano’s method

The mutual information between the r.v. $(Z,J)$ is defined by the KL divergence as the underlying measure of distance,

Our set-up: lower bounding the probability of error an $M$-ary hypothesis testing problem, based on a family of distributions ${\bbP_{\theta^1},\ldots,\bbP_{\theta^M}}$.

The mutual information can be written in terms of component distributions ${\bbP_{\theta^j},j\in[M]}$ and the mixture distribution $\bar \bbQ$

The Fano method is based on the following lower bound on the error probability in an $M$-ary testing problem, applicable when $J$ is uniformly distributed over the index set:

Let $\{\theta^1,\ldots,\theta^M\}$ be a $2\delta$-separated set in the $\rho$ semi-metric on $\Theta(\cP)$, and suppose that $J$ is uniformly distributed over the index set $\{1,\ldots,M\}$, and $(Z\mid J=j)\sim \bbP_{\theta^j}$. Then for any increasing function $\Phi: [0,\infty) \rightarrow [0,\infty)$, the minimax risk is lower bounded as $$ \fM(\theta(\cP);\Phi\circ \rho)\ge \Phi(\delta)\left\{1-\frac{I(Z;J)+\log 2}{\log M}\right\}\,, $$ where $I(Z;J)$ is the mutual information between $Z$ and $J$.

Published in categories Note