Higher Criticism for Sparse Heterogeneous Mixtures
Posted on
This note is for Donoho, D., & Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3), 962–994.
Tukey introduced the notion of higher criticism by a story: 250 tests 11 were significant at the 5% level, but one would expect 12.5 significant tests even in the purely null case
he proposed a sort of second-level significance testing, based on the statistic
\[\text{HC}_{0.05, n} = \sqrt{n}[(\text{Fraction Significant at 0.05} - 0.05)] / \sqrt{0.05\times 0.95}\]and suggested that values of (say) 2 or greater indicate a kind of significance of the overall body of tests
we might consider not only significance at the 0.05 level, but perhaps at all levels between (say) 0 and $\alpha_0$, and so define
\[\text{HC}^\star_n = \max_{0 < \le\alpha_0} \sqrt{n}[(\text{Fraction Significant at $\alpha$}) - \alpha] / \sqrt{\alpha \times (1-\alpha)}\]in the paper’s setting, there are $n$ independent tests of unrelated hypotheses, $H_{0i}$ vs $H_{1i}$, where the test statistics $X_i$ obey
\[H_{0, i} \sim N(0,1)\\ H_{1, i} \sim N(\mu_i, 1),\qquad \text{\mu_i > 0}\]1.1 the model, and the asymptotic detection boundary
consider a special case where all the nonzero $\mu_i$ are equal, and model the data as providing $n$ i.i.d. observations from one of the two possible situations
\[H_0: X_i\sim_{iid} N(0, 1)\,,\qquad 1\le i\le n\\ H_1^{(n)}: X_i\sim_{iid} (1-\varepsilon)N(0, 1) + \varepsilon N(\mu, 1)\,,\qquad 1\le i\le n\]with $\varepsilon_n$ and $\mu_n$ fixed and known, the optimal procedure is simply the likelihood ratio test
there is a threshold effect for the likelihood ratio test: the sum of Type I and Type II errors tends to 0 or 1 depending on whether $\mu$ exceeds a so-called detection boundary or not.
1.2 Peformance of higher criticism
let $p_i$ be the p-value for the i-th component null hypothesis, and let $p_{(i)}$ denote the p-values sorted in increasing order
write
\[\text{HC}^\star_n = \max_{1\le i\le \alpha_0\cdot n}\sqrt n[i/n - p_{(i)}] / \sqrt{p_{(i)}(1-p_{(i)})}\]to use $HC_n^\star$ to conduct a level-$\alpha$ test, we must find a critical value $h(n, \alpha)$
\[P_{H_0}\{\text{HC}_n^\star > h(n,\alpha)\} \le \alpha\]1.3 Which part of the sample contains the information?
In the null case $\text{HC}_n^\star$ us closely related to well-known functionals of the standard uniform empirical process.
Given $n$ independent random samples $U_1, \ldots, U_n$ from $U[0, 1]$, with empirical distribution function
\[F_n(t) = \frac 1n\sum_{i=1}^n1_{U_i\le t}\]the uniform empirical process is denoted by
\[U_n(t) = \sqrt n[F_n(t) - t], \qquad 0 < t < 1\]and the normalized uniform empirical process by
\[W_n(t) = \frac{U_n(t)}{\sqrt{t(1-t)}}\]Note that $W_n(t)$ is asymptotically $N(0, 1)$ for each fixed $t\in (0, 1)$
under $H_0$ the p-values are i.i.d. $U[0, 1]$, we have, in this case, the representation
\[\text{HC}_n^\star = \max_{0 < t < \alpha_0} W_n(t)\]note the use of max instead of sup, since the maximum value of $W_n(t)$ is attained.
extend usage below by letting $W_n(t)$ also stand for the normalized empirical process starting from $F_n(t) = \frac 1n\sum_{i=1}^n 1_{p_i\le t}$, where the $p_i$ are the $n$ given $p$-values, but which are i.i.d. $U(0, 1)$ only in the null case.
accordingly, $W_n(t)$ will be asymptotically $N(0, 1)$ at each fixed $t$ only under $H_0$, and one anticipates a different limiting behavior under the alternative.
now look for the value of $t$ at which $W_n(t)$ has the most dramatically different behavior under the null and the alternative
introduce the notation $z_n(q) = \sqrt{2q\log(n)}$ for $0 < q\le 1$, and look for the value of $q$ for which $\text{Prob}(X_i > z_n(q))$ best differentiates between null and alternative.
under the alternative, $H_1^{(n)}$, the data have a sparse component with nonzero mean at $\mu_n = z_n(r)$
surprisingly, the most informative value of $q$ is $q = r$
to find the most informative $q$, introduce some notation,
\[p_{n, q} = P(N(0, 1) > z_n(q)), q > 0\]and note that
\[\max_{0 < t < 1/2}W_n(t) = \max_{0 < q < \infty} W_n(p_{n,q})\]it is only necessary to consider $0 < q\le 1$
Let $N_n(q)$ count the observations exceeding $z_n(q)$:
\[N_n(q)= \#\{i: X_i \ge z_n(q) \}\]then define
\[V_n(q) = \frac{N_n(q) - np_{n, q}}{\sqrt{np_{n, q}(1-p_{n,q})}}\]in many calculations, one will find factors with polylog behavior,
to focus on the main ideas, introduce the notation $L_n$ for a generic polylog factor, which may change from occurrence to occurrence. when we do this, we think of $L_n$ essentially as if it were constant
\[P(N(\mu_n, 1) > z_n(q)) = L_n\cdot n^{-(\sqrt q - \sqrt r)^2}\\ P(N(0, 1) > z_n(q)) = L_n\cdot n^{-q}, \qquad r < q\le 1\]it follows that, under the alternative $H_1^{(n)}$, we have
\[EV_n(q) = L_n \cdot n^{[(1+q)/2 - \beta - (\sqrt q - \sqrt r)^2]}\]while under the null $EV_n(q) = 0$
the most informative value of $q$ will optimize the growth rate to $\infty$
there can be considerable latitude in choosing $q$ so that $EV_n(q)$ goes to $\infty$ under $H_1^{(n)}$. This requires
\[2\sqrt r - \sqrt{2(r-\beta + \frac 12)} < \sqrt q < 2\sqrt r + \sqrt{2(r-\beta+\frac 12)}\]1.4 Comparison to several multiple comparison procedures
the higher criticism is just one specific approach to combining many $p$-values in search of an overall test
many other tools are available from the field of multiple comparisons and meta-analysis
1.4.1 Range/Maximum/Bonferroni
the studentized range:
\[R_n = (\max X_i - \min X_i) / S_n\]where $S_n$ is the sample standard deviation. This is frequently used in testing for homogeneity of a set of normal means.
for the theoretical purposes, it is convenient to analyze the simpler statistic
\[M_n = \max_i X_i\]focus attention on one-sided deviations only and have no need to estimate the known standard deviation under the null
note that
- in the field of meta-analysis this is the same thing as combining several $p$-values by taking the minimum $p$-value
- another equivalent term for this is Bonferroni-based inference
the maximum statistic has a critical value $m(n, \alpha)$ which obeys
\[m(n, \alpha)\sim \sqrt{2\log n}, \qquad n\rightarrow\infty\]in comparison to $HC^\star$, it follows that the test focuses entirely on whether there are observations exceeding $\sqrt{2\log n}$
Define \(\rho_{Max}(\beta) = (1 - \sqrt{1-\beta})^2\) Suppose $r > \rho_{Max}(\beta)$ and consider a sequence of level $\alpha_n$-tests based on $M_n$ with $\alpha_n\rightarrow 0$ slowly enough. Then \(P_{H_1^{(n)}}\{M_n \text{ rejects }H_0\} \rightarrow 1\)
In short, $\rho_{Max}$ defines the detection boundary for $M_n$. This compares to the efficient boundary as follows: \(\rho^\star(\beta) = \rho_{Max}(\beta), \beta\in [\frac 34, 1)\) so that $M_n$ is effective in the range of very sparse alternatives
1.4.2 FDR-controlling methods
consider a situation where one observed data $X_i = \theta_i + Z_i$ and where the $Z_i$ are i.i.d. $N(0, 1)$. Suppose that only a fraction $\varepsilon_n$ of the means $\theta_i$ might be nonzero.
the behavior of FDR-controlling procedures is not different from that of the maximum $M_n$
\[\rho_{FDR}(\beta) = \rho_{Max}(\beta) = (1 - \sqrt{1 - \beta})^2, \qquad \frac 12 < \beta < 1\]
1.5 Classical methods for combining $p$-values
a classical approach to combining $p$-values is Fisher’s method, which refers to
\[F_n = -2\sum_{1\le i\le n} \log p_i\]as the $\chi_n^2$ distribution.
In the setting of extreme heterogeneity, Fisher’s method is unable to function well asymptotically
If $\varepsilon_n = n^{-\beta}, \beta > \frac 12$ and $\mu_n\le \sqrt{2\log n}$, asymptotically $F_n$ is unable to separate $H_1^{(n)}$ and $H_0$
1.6 Relation to goodness-of-fit testing
- Anderson and Darling defined a goodness-of-fit measure which involves the maximum of the normalized empirical process
    - the main difference is that here focus attention near $p=0$, while Anderson and Darling maximize over $a < p \le b$ with $0 < a < b < 1$
- however, all the information needed for discrimination between $H_0^{(n)}$ and $H_1^{(n)}$ is at values of $p$ increasingly close to 0 as $n$ increases
- therefore, statistics based on $a < p < b$ with $a, b$ fixed are dramatically inefficient
 
Berk and Jones proposed a goodness-of-fit measure which, adapted to the present setting, may be written as
\[BJ_n^+ = n\cdot \max_{1\le i\le n/2} K^+(i/n, p_{(i)})\]the detection boundary of $BJ_n^+$ is the same as that of $HC^\star$
\[\rho_{BJ}(\beta) = \rho^\star(\beta)\,, \frac 12 < \beta < 1\]
1.7 Generalizations
the higher criticism statistic can obviously be used in a wide variety of situations and there is no need for the p-values to be derived from normally distributed Z-scores
HC for Large-Scale Inference, Especially for Rare and Weak Effects
review the basics of HC in both the testing and feature selection settings
HC adapts easily to new situations: clique detection and bivariate outlier detection
review the theoretical ideoloy behind HC
The Rare/Weak (RW) model is a theoretical framework simultaneously controling the size and prevalence of useful/significant items among the useless/null bulk
The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER)
1.1 HC Basics
generalize Tukey’s proposal and set
\[HC_{N,\alpha} = \sqrt{N}(\text{Fraction significant at $\alpha$ - \alpha}) / \sqrt{\alpha(1-\alpha)}\]the significance of the overall body of test is captured in the following Higher Criticism statistic
\[HC_N^\star = \max_{0\le \alpha\le \alpha_0} HC_{N,\alpha}\]