WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Stein's Paradox

Posted on
Tags: Paradox, Admissible, Minimax, James-Stein, Empirical Bayes

I learned Stein’s Paradox from Larry Wasserman’s post, STEIN’S PARADOX, perhaps I had encountered this term before but I cannot recall anything about it. (I am guilty)

An Awesome Introduction

As @nextgenseek commented under Larry’s post, Efron and Morris (1977) is an awesome introduction to Stein’s Paradox for anyone who is uninitiated in statistics.

The best guess about the future is usually obtained by computing the average of past events. Stein’s paradox defines circumstances in which there are estimators better than the arithmetic average.

Efron and Morris (1977)

Stein’s paradox concerns the use of observed averages to estimate unobservable quantities. Averaging is the second most basic processes in statistics, the first being the simple act of counting.

Baseball example

Efron and Morris (1977) firstly demonstrates how to calculate the James-Stein estimator by using the baseball batting data, and shows that the James-Stein’s estimator is better than arithmetic average in terms of sum of squared error.

Then the article lists the questions raised by critics of Stein’s method.

  • There is noting in the statement of the theorem that requires the component problems to have some sensible relation to one another. For example, estimated the percent of imported cars by combing the baseball data.
  • Why should one player’s success or lack of it have any influence on the estimate of another player’s ability?

In the baseball batting averages example, the James-Stein estimator is given by

\[z = \bar y + c(y-\bar y)\,,\]

where $y$ is the average of a single set of data, $\bar y$ is the grand average of averages and $c$ is a “shrinking factor”, and is given by

\[c = 1-\frac{(k-3)\sigma^2}{\sum (y-\bar y)^2}\,.\]

Here $k$ is again the number of unknown means, $\sigma^2$ is the square of the standard deviation and $\sum(y-\bar y)^2$ is the sum of the squared deviations of the individual averages $y$ from the grand average $\bar y$.

In effect the James-Stein procedure makes a preliminary guess that all the unobservable means are near the grand average $\bar y$:

  • If the data support that guess in the sense that the observed averages are themselves not too far from $\bar y$, then the estimates are all shrunk further toward the grand average.
  • If the guess is contradicted, then not much shrinking is done.

There are several other expressions for the James-Stein estimator, but they differ mainly in detail. All of them have in common the shrinking factor $c$, which is the definitive characteristic of the James-Stein estimator.

It is possible to use other more or less arbitrary initial guess or point of origin for the estimator rather than the grand average $\bar y$.

Estimating the true mean for an isolated city by Stein’s method creates serious errors when that mean has an atypical value. The rationale of the method is to reduce the overall risk by assuming that the true means are more similar to one another than the observed data. That assumption can degrade the estimation of a genuinely atypical mean.

With some valuable extra information, such as the true batting abilities of all major-league players have an approximately normal distribution with particular mean $m$ and standard deviation, which statisticians call a “prior distribution”, it is possible to construct a superior estimator

\[Z = m + C(y-m)\,.\]

The formula for the James-Stein estimator is strikingly similar to that of Bayes’ equation. Indeed, as the number of means being averaged grows very large, the two equations become identical. The two shrinking factors $c$ and $C$ converge on the same value, and the grand average $\bar y$ becomes equal to the mean $m$ precisely when all players are included in the calculation.

However, the James-Stein procedure has one important advantage over Bayes’ method since it can be employed without knowledge of the prior distribution. On the other hand, ignorance has a price, which must be paid in reduced accuracy of estimation.

In the historical context, the James-Stein estimator can be regarded as an “empirical Bayes rule”, coined by Herbert E. Robbins, but Robbin’s theory was immediately recognized as a fundamental breakthrough while the closely related Stein’s result has been much slower in gaining acceptance.

The James-Stein estimator is not the only one that is known to be better than the sample averages. Indeed, the James-Stein estimator is itself inadmissible!

More Statistical Way

Larry’s post provided a more statistical way to introduce Stein’s Paradox.

Recall that the risk of an estimator $\hat\theta$ of $\theta$ is

\[R_{\hat\theta}(\theta) = \bbE(\hat\theta-\theta)^2 = \int (\hat \theta(x)-\theta)^2p(x;\theta)dx\,.\]

An estimator $\hat\theta$ is inadmissible if there exists another estimator $\theta^*$ which dominates it (i.e., such that $R(\theta,\theta^*)\le R(\theta, \hat\theta)$) for all $\theta$ and admissible if no such estimator $\theta^*$ exists.

Consider $X\sim N(\theta,I_k)$, the paradox said that

  • if $k=1,2$, $\hat\theta \equiv X$ is admissible.
  • if $k\ge 3$, $\hat\theta \equiv X$ is inadmissible.

The proof that $X$ is inadmissible is based on defining an explicit estimator $\theta^*$ that has smaller risk than $X$. For example, the James-Stein estimator is

\[\theta^* = \Big(1-\frac{k-2}{\Vert X\Vert}\Big)X\,.\]

Larry illustrated the intuition from Bayes explanation, similar to Efron and Morris (1977), and function estimation, where the process of smoothing can be seen as “shrinking estimates towards 0” as with the James-Stein estimator.

Larry also mentioned that $\hat\theta=X$ is minimax, whose risk achieves the minimax bound

\[\inf_{\hat\theta}\sup R_{\hat\theta}(\theta)\,.\]

In fact, $R_{\hat\theta}(\theta)=k$ for all $\theta$ where $k$ is the dimension of $\theta$. The risk $R_{\theta^*}(\theta)$ of the James-Stein estimator is less than the risk of $X$, but $R_{\theta^*}(\theta)\rightarrow k$ as $\Vert \theta\Vert\rightarrow \infty$. So they have the same maximum risk.


@Joe Blitzstein pointed out several references, two by Efron and Morris, and one by Stigler.

Published in categories Note