Higher Criticism for Sparse Heterogeneous Mixtures

Posted on Oct 15, 2025

This note is for Donoho, D., & Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3), 962–994.

Tukey introduced the notion of higher criticism by a story: 250 tests 11 were significant at the 5% level, but one would expect 12.5 significant tests even in the purely null case

he proposed a sort of second-level significance testing, based on the statistic

\[\text{HC}_{0.05, n} = \sqrt{n}[(\text{Fraction Significant at 0.05} - 0.05)] / \sqrt{0.05\times 0.95}\]

and suggested that values of (say) 2 or greater indicate a kind of significance of the overall body of tests

we might consider not only significance at the 0.05 level, but perhaps at all levels between (say) 0 and $\alpha_0$, and so define

\[\text{HC}^\star_n = \max_{0 < \le\alpha_0} \sqrt{n}[(\text{Fraction Significant at $\alpha$}) - \alpha] / \sqrt{\alpha \times (1-\alpha)}\]

in the paper’s setting, there are $n$ independent tests of unrelated hypotheses, $H_{0i}$ vs $H_{1i}$, where the test statistics $X_i$ obey

\[H_{0, i} \sim N(0,1)\\ H_{1, i} \sim N(\mu_i, 1),\qquad \text{\mu_i > 0}\]

1.1 the model, and the asymptotic detection boundary

consider a special case where all the nonzero $\mu_i$ are equal, and model the data as providing $n$ i.i.d. observations from one of the two possible situations

\[H_0: X_i\sim_{iid} N(0, 1)\,,\qquad 1\le i\le n\\ H_1^{(n)}: X_i\sim_{iid} (1-\varepsilon)N(0, 1) + \varepsilon N(\mu, 1)\,,\qquad 1\le i\le n\]

with $\varepsilon_n$ and $\mu_n$ fixed and known, the optimal procedure is simply the likelihood ratio test

there is a threshold effect for the likelihood ratio test: the sum of Type I and Type II errors tends to 0 or 1 depending on whether $\mu$ exceeds a so-called detection boundary or not.

1.2 Peformance of higher criticism

let $p_i$ be the p-value for the i-th component null hypothesis, and let $p_{(i)}$ denote the p-values sorted in increasing order

write

\[\text{HC}^\star_n = \max_{1\le i\le \alpha_0\cdot n}\sqrt n[i/n - p_{(i)}] / \sqrt{p_{(i)}(1-p_{(i)})}\]

to use $HC_n^\star$ to conduct a level-$\alpha$ test, we must find a critical value $h(n, \alpha)$

\[P_{H_0}\{\text{HC}_n^\star > h(n,\alpha)\} \le \alpha\]

1.3 Which part of the sample contains the information?

In the null case $\text{HC}_n^\star$ us closely related to well-known functionals of the standard uniform empirical process.

Given $n$ independent random samples $U_1, \ldots, U_n$ from $U[0, 1]$, with empirical distribution function

\[F_n(t) = \frac 1n\sum_{i=1}^n1_{U_i\le t}\]

the uniform empirical process is denoted by

\[U_n(t) = \sqrt n[F_n(t) - t], \qquad 0 < t < 1\]

and the normalized uniform empirical process by

\[W_n(t) = \frac{U_n(t)}{\sqrt{t(1-t)}}\]

Note that $W_n(t)$ is asymptotically $N(0, 1)$ for each fixed $t\in (0, 1)$

under $H_0$ the p-values are i.i.d. $U[0, 1]$, we have, in this case, the representation

\[\text{HC}_n^\star = \max_{0 < t < \alpha_0} W_n(t)\]

note the use of max instead of sup, since the maximum value of $W_n(t)$ is attained.

extend usage below by letting $W_n(t)$ also stand for the normalized empirical process starting from $F_n(t) = \frac 1n\sum_{i=1}^n 1_{p_i\le t}$, where the $p_i$ are the $n$ given $p$-values, but which are i.i.d. $U(0, 1)$ only in the null case.

accordingly, $W_n(t)$ will be asymptotically $N(0, 1)$ at each fixed $t$ only under $H_0$, and one anticipates a different limiting behavior under the alternative.

now look for the value of $t$ at which $W_n(t)$ has the most dramatically different behavior under the null and the alternative

introduce the notation $z_n(q) = \sqrt{2q\log(n)}$ for $0 < q\le 1$, and look for the value of $q$ for which $\text{Prob}(X_i > z_n(q))$ best differentiates between null and alternative.

under the alternative, $H_1^{(n)}$, the data have a sparse component with nonzero mean at $\mu_n = z_n(r)$

surprisingly, the most informative value of $q$ is $q = r$

to find the most informative $q$, introduce some notation,

\[p_{n, q} = P(N(0, 1) > z_n(q)), q > 0\]

and note that

\[\max_{0 < t < 1/2}W_n(t) = \max_{0 < q < \infty} W_n(p_{n,q})\]

it is only necessary to consider $0 < q\le 1$

Let $N_n(q)$ count the observations exceeding $z_n(q)$:

\[N_n(q)= \#\{i: X_i \ge z_n(q) \}\]

then define

\[V_n(q) = \frac{N_n(q) - np_{n, q}}{\sqrt{np_{n, q}(1-p_{n,q})}}\]

in many calculations, one will find factors with polylog behavior,

to focus on the main ideas, introduce the notation $L_n$ for a generic polylog factor, which may change from occurrence to occurrence. when we do this, we think of $L_n$ essentially as if it were constant

\[P(N(\mu_n, 1) > z_n(q)) = L_n\cdot n^{-(\sqrt q - \sqrt r)^2}\\ P(N(0, 1) > z_n(q)) = L_n\cdot n^{-q}, \qquad r < q\le 1\]

it follows that, under the alternative $H_1^{(n)}$, we have

\[EV_n(q) = L_n \cdot n^{[(1+q)/2 - \beta - (\sqrt q - \sqrt r)^2]}\]

while under the null $EV_n(q) = 0$

the most informative value of $q$ will optimize the growth rate to $\infty$

there can be considerable latitude in choosing $q$ so that $EV_n(q)$ goes to $\infty$ under $H_1^{(n)}$. This requires

\[2\sqrt r - \sqrt{2(r-\beta + \frac 12)} < \sqrt q < 2\sqrt r + \sqrt{2(r-\beta+\frac 12)}\]

1.4 Comparison to several multiple comparison procedures

the higher criticism is just one specific approach to combining many $p$-values in search of an overall test

many other tools are available from the field of multiple comparisons and meta-analysis

1.4.1 Range/Maximum/Bonferroni

the studentized range:

\[R_n = (\max X_i - \min X_i) / S_n\]

where $S_n$ is the sample standard deviation. This is frequently used in testing for homogeneity of a set of normal means.

for the theoretical purposes, it is convenient to analyze the simpler statistic

\[M_n = \max_i X_i\]

focus attention on one-sided deviations only and have no need to estimate the known standard deviation under the null

note that

in the field of meta-analysis this is the same thing as combining several $p$-values by taking the minimum $p$-value
another equivalent term for this is Bonferroni-based inference

the maximum statistic has a critical value $m(n, \alpha)$ which obeys

\[m(n, \alpha)\sim \sqrt{2\log n}, \qquad n\rightarrow\infty\]

in comparison to $HC^\star$, it follows that the test focuses entirely on whether there are observations exceeding $\sqrt{2\log n}$

Define $\rho_{Max}(\beta) = (1 - \sqrt{1-\beta})^2$ Suppose $r > \rho_{Max}(\beta)$ and consider a sequence of level $\alpha_n$-tests based on $M_n$ with $\alpha_n\rightarrow 0$ slowly enough. Then $P_{H_1^{(n)}}\{M_n \text{ rejects }H_0\} \rightarrow 1$

In short, $\rho_{Max}$ defines the detection boundary for $M_n$. This compares to the efficient boundary as follows: $\rho^\star(\beta) = \rho_{Max}(\beta), \beta\in [\frac 34, 1)$ so that $M_n$ is effective in the range of very sparse alternatives

1.4.2 FDR-controlling methods

consider a situation where one observed data $X_i = \theta_i + Z_i$ and where the $Z_i$ are i.i.d. $N(0, 1)$. Suppose that only a fraction $\varepsilon_n$ of the means $\theta_i$ might be nonzero.

the behavior of FDR-controlling procedures is not different from that of the maximum $M_n$

\[\rho_{FDR}(\beta) = \rho_{Max}(\beta) = (1 - \sqrt{1 - \beta})^2, \qquad \frac 12 < \beta < 1\]

1.5 Classical methods for combining $p$-values

a classical approach to combining $p$-values is Fisher’s method, which refers to

\[F_n = -2\sum_{1\le i\le n} \log p_i\]

as the $\chi_n^2$ distribution.

In the setting of extreme heterogeneity, Fisher’s method is unable to function well asymptotically

If $\varepsilon_n = n^{-\beta}, \beta > \frac 12$ and $\mu_n\le \sqrt{2\log n}$, asymptotically $F_n$ is unable to separate $H_1^{(n)}$ and $H_0$

1.6 Relation to goodness-of-fit testing

Anderson and Darling defined a goodness-of-fit measure which involves the maximum of the normalized empirical process
- the main difference is that here focus attention near $p=0$, while Anderson and Darling maximize over $a < p \le b$ with $0 < a < b < 1$
- however, all the information needed for discrimination between $H_0^{(n)}$ and $H_1^{(n)}$ is at values of $p$ increasingly close to 0 as $n$ increases
- therefore, statistics based on $a < p < b$ with $a, b$ fixed are dramatically inefficient

Berk and Jones proposed a goodness-of-fit measure which, adapted to the present setting, may be written as

\[BJ_n^+ = n\cdot \max_{1\le i\le n/2} K^+(i/n, p_{(i)})\]

the detection boundary of $BJ_n^+$ is the same as that of $HC^\star$

\[\rho_{BJ}(\beta) = \rho^\star(\beta)\,, \frac 12 < \beta < 1\]

1.7 Generalizations

the higher criticism statistic can obviously be used in a wide variety of situations and there is no need for the p-values to be derived from normally distributed Z-scores

HC for Large-Scale Inference, Especially for Rare and Weak Effects

review the basics of HC in both the testing and feature selection settings

HC adapts easily to new situations: clique detection and bivariate outlier detection

review the theoretical ideoloy behind HC

The Rare/Weak (RW) model is a theoretical framework simultaneously controling the size and prevalence of useful/significant items among the useless/null bulk

The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER)

1.1 HC Basics

generalize Tukey’s proposal and set

\[HC_{N,\alpha} = \sqrt{N}(\text{Fraction significant at $\alpha$ - \alpha}) / \sqrt{\alpha(1-\alpha)}\]

the significance of the overall body of test is captured in the following Higher Criticism statistic

\[HC_N^\star = \max_{0\le \alpha\le \alpha_0} HC_{N,\alpha}\]

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.