WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Higher Criticism for Sparse Heterogeneous Mixtures

Posted on
Tags: Higher Criticism, Mixtures

This note is for Donoho, D., & Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3), 962–994.

Tukey introduced the notion of higher criticism by a story: 250 tests 11 were significant at the 5% level, but one would expect 12.5 significant tests even in the purely null case

he proposed a sort of second-level significance testing, based on the statistic

\[\text{HC}_{0.05, n} = \sqrt{n}[(\text{Fraction Significant at 0.05} - 0.05)] / \sqrt{0.05\times 0.95}\]

and suggested that values of (say) 2 or greater indicate a kind of significance of the overall body of tests

we might consider not only significance at the 0.05 level, but perhaps at all levels between (say) 0 and $\alpha_0$, and so define

\[\text{HC}^\star_n = \max_{0 < \le\alpha_0} \sqrt{n}[(\text{Fraction Significant at $\alpha$}) - \alpha] / \sqrt{\alpha \times (1-\alpha)}\]

in the paper’s setting, there are $n$ independent tests of unrelated hypotheses, $H_{0i}$ vs $H_{1i}$, where the test statistics $X_i$ obey

\[H_{0, i} \sim N(0,1)\\ H_{1, i} \sim N(\mu_i, 1),\qquad \text{\mu_i > 0}\]

1.1 the model, and the asymptotic detection boundary

consider a special case where all the nonzero $\mu_i$ are equal, and model the data as providing $n$ i.i.d. observations from one of the two possible situations

\[H_0: X_i\sim_{iid} N(0, 1)\,,\qquad 1\le i\le n\\ H_1^{(n)}: X_i\sim_{iid} (1-\varepsilon)N(0, 1) + \varepsilon N(\mu, 1)\,,\qquad 1\le i\le n\]

with $\varepsilon_n$ and $\mu_n$ fixed and known, the optimal procedure is simply the likelihood ratio test

there is a threshold effect for the likelihood ratio test: the sum of Type I and Type II errors tends to 0 or 1 depending on whether $\mu$ exceeds a so-called detection boundary or not.

1.2 Peformance of higher criticism

let $p_i$ be the p-value for the i-th component null hypothesis, and let $p_{(i)}$ denote the p-values sorted in increasing order

write

\[\text{HC}^\star_n = \max_{1\le i\le \alpha_0\cdot n}\sqrt n[i/n - p_{(i)}] / \sqrt{p_{(i)}(1-p_{(i)})}\]

to use $HC_n^\star$ to conduct a level-$\alpha$ test, we must find a critical value $h(n, \alpha)$

\[P_{H_0}\{\text{HC}_n^\star > h(n,\alpha)\} \le \alpha\]

1.3 Which part of the sample contains the information?

In the null case $\text{HC}_n^\star$ us closely related to well-known functionals of the standard uniform empirical process.

Given $n$ independent random samples $U_1, \ldots, U_n$ from $U[0, 1]$, with empirical distribution function

\[F_n(t) = \frac 1n\sum_{i=1}^n1_{U_i\le t}\]

the uniform empirical process is denoted by

\[U_n(t) = \sqrt n[F_n(t) - t], \qquad 0 < t < 1\]

and the normalized uniform empirical process by

\[W_n(t) = \frac{U_n(t)}{\sqrt{t(1-t)}}\]

Note that $W_n(t)$ is asymptotically $N(0, 1)$ for each fixed $t\in (0, 1)$

under $H_0$ the p-values are i.i.d. $U[0, 1]$, we have, in this case, the representation

\[\text{HC}_n^\star = \max_{0 < t < \alpha_0} W_n(t)\]

note the use of max instead of sup, since the maximum value of $W_n(t)$ is attained.

extend usage below by letting $W_n(t)$ also stand for the normalized empirical process starting from $F_n(t) = \frac 1n\sum_{i=1}^n 1_{p_i\le t}$, where the $p_i$ are the $n$ given $p$-values, but which are i.i.d. $U(0, 1)$ only in the null case.

accordingly, $W_n(t)$ will be asymptotically $N(0, 1)$ at each fixed $t$ only under $H_0$, and one anticipates a different limiting behavior under the alternative.

now look for the value of $t$ at which $W_n(t)$ has the most dramatically different behavior under the null and the alternative

introduce the notation $z_n(q) = \sqrt{2q\log(n)}$ for $0 < q\le 1$, and look for the value of $q$ for which $\text{Prob}(X_i > z_n(q))$ best differentiates between null and alternative.

under the alternative, $H_1^{(n)}$, the data have a sparse component with nonzero mean at $\mu_n = z_n(r)$

surprisingly, the most informative value of $q$ is $q = r$

to find the most informative $q$, introduce some notation,

\[p_{n, q} = P(N(0, 1) > z_n(q)), q > 0\]

and note that

\[\max_{0 < t < 1/2}W_n(t) = \max_{0 < q < \infty} W_n(p_{n,q})\]

it is only necessary to consider $0 < q\le 1$

Let $N_n(q)$ count the observations exceeding $z_n(q)$:

\[N_n(q)= \#\{i: X_i \ge z_n(q) \}\]

then define

\[V_n(q) = \frac{N_n(q) - np_{n, q}}{\sqrt{np_{n, q}(1-p_{n,q})}}\]

in many calculations, one will find factors with polylog behavior,

to focus on the main ideas, introduce the notation $L_n$ for a generic polylog factor, which may change from occurrence to occurrence. when we do this, we think of $L_n$ essentially as if it were constant

\[P(N(\mu_n, 1) > z_n(q)) = L_n\cdot n^{-(\sqrt q - \sqrt r)^2}\\ P(N(0, 1) > z_n(q)) = L_n\cdot n^{-q}, \qquad r < q\le 1\]

it follows that, under the alternative $H_1^{(n)}$, we have

\[EV_n(q) = L_n \cdot n^{[(1+q)/2 - \beta - (\sqrt q - \sqrt r)^2]}\]

while under the null $EV_n(q) = 0$

the most informative value of $q$ will optimize the growth rate to $\infty$

there can be considerable latitude in choosing $q$ so that $EV_n(q)$ goes to $\infty$ under $H_1^{(n)}$. This requires

\[2\sqrt r - \sqrt{2(r-\beta + \frac 12)} < \sqrt q < 2\sqrt r + \sqrt{2(r-\beta+\frac 12)}\]

1.4 Comparison to several multiple comparison procedures

the higher criticism is just one specific approach to combining many $p$-values in search of an overall test

many other tools are available from the field of multiple comparisons and meta-analysis

1.4.1 Range/Maximum/Bonferroni

the studentized range:

\[R_n = (\max X_i - \min X_i) / S_n\]

where $S_n$ is the sample standard deviation. This is frequently used in testing for homogeneity of a set of normal means.

for the theoretical purposes, it is convenient to analyze the simpler statistic

\[M_n = \max_i X_i\]

focus attention on one-sided deviations only and have no need to estimate the known standard deviation under the null

note that

  • in the field of meta-analysis this is the same thing as combining several $p$-values by taking the minimum $p$-value
  • another equivalent term for this is Bonferroni-based inference

the maximum statistic has a critical value $m(n, \alpha)$ which obeys

\[m(n, \alpha)\sim \sqrt{2\log n}, \qquad n\rightarrow\infty\]

in comparison to $HC^\star$, it follows that the test focuses entirely on whether there are observations exceeding $\sqrt{2\log n}$

Define \(\rho_{Max}(\beta) = (1 - \sqrt{1-\beta})^2\) Suppose $r > \rho_{Max}(\beta)$ and consider a sequence of level $\alpha_n$-tests based on $M_n$ with $\alpha_n\rightarrow 0$ slowly enough. Then \(P_{H_1^{(n)}}\{M_n \text{ rejects }H_0\} \rightarrow 1\)

In short, $\rho_{Max}$ defines the detection boundary for $M_n$. This compares to the efficient boundary as follows: \(\rho^\star(\beta) = \rho_{Max}(\beta), \beta\in [\frac 34, 1)\) so that $M_n$ is effective in the range of very sparse alternatives

1.4.2 FDR-controlling methods

consider a situation where one observed data $X_i = \theta_i + Z_i$ and where the $Z_i$ are i.i.d. $N(0, 1)$. Suppose that only a fraction $\varepsilon_n$ of the means $\theta_i$ might be nonzero.

the behavior of FDR-controlling procedures is not different from that of the maximum $M_n$

\[\rho_{FDR}(\beta) = \rho_{Max}(\beta) = (1 - \sqrt{1 - \beta})^2, \qquad \frac 12 < \beta < 1\]

1.5 Classical methods for combining $p$-values

a classical approach to combining $p$-values is Fisher’s method, which refers to

\[F_n = -2\sum_{1\le i\le n} \log p_i\]

as the $\chi_n^2$ distribution.

In the setting of extreme heterogeneity, Fisher’s method is unable to function well asymptotically

If $\varepsilon_n = n^{-\beta}, \beta > \frac 12$ and $\mu_n\le \sqrt{2\log n}$, asymptotically $F_n$ is unable to separate $H_1^{(n)}$ and $H_0$

1.6 Relation to goodness-of-fit testing

  • Anderson and Darling defined a goodness-of-fit measure which involves the maximum of the normalized empirical process
    • the main difference is that here focus attention near $p=0$, while Anderson and Darling maximize over $a < p \le b$ with $0 < a < b < 1$
    • however, all the information needed for discrimination between $H_0^{(n)}$ and $H_1^{(n)}$ is at values of $p$ increasingly close to 0 as $n$ increases
    • therefore, statistics based on $a < p < b$ with $a, b$ fixed are dramatically inefficient

Berk and Jones proposed a goodness-of-fit measure which, adapted to the present setting, may be written as

\[BJ_n^+ = n\cdot \max_{1\le i\le n/2} K^+(i/n, p_{(i)})\]

the detection boundary of $BJ_n^+$ is the same as that of $HC^\star$

\[\rho_{BJ}(\beta) = \rho^\star(\beta)\,, \frac 12 < \beta < 1\]

1.7 Generalizations

the higher criticism statistic can obviously be used in a wide variety of situations and there is no need for the p-values to be derived from normally distributed Z-scores

HC for Large-Scale Inference, Especially for Rare and Weak Effects

review the basics of HC in both the testing and feature selection settings

HC adapts easily to new situations: clique detection and bivariate outlier detection

review the theoretical ideoloy behind HC

The Rare/Weak (RW) model is a theoretical framework simultaneously controling the size and prevalence of useful/significant items among the useless/null bulk

The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER)

1.1 HC Basics

generalize Tukey’s proposal and set

\[HC_{N,\alpha} = \sqrt{N}(\text{Fraction significant at $\alpha$ - \alpha}) / \sqrt{\alpha(1-\alpha)}\]

the significance of the overall body of test is captured in the following Higher Criticism statistic

\[HC_N^\star = \max_{0\le \alpha\le \alpha_0} HC_{N,\alpha}\]

Published in categories