WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Bootstrap Hypothesis Testing

Posted on 0 Comments
Tags: Bootstrap, Hypothesis Testing

This report is motivated by comments under Larry’s post, Modern Two-Sample Tests.

Larry’s post

Consider two independent samples, $X_1,\ldots,X_n\sim P$ and $Y_1,\ldots,Y_m\sim Q$. To test $H_0:P=Q$ versus $H_1:P\neq Q$.

Larry highlighted the importance and amazement of Permutation Method and three innovated test statistics.

Kernel Tests

Firstly choose a kernel $K$, such as the Gaussian kernel,

The test statistic is

where $K_h(x,y)$ can be seen as a measure of similarity between $x$ and $y$. To avoid the tunning parameter $h$, we can choose

Energy Test

Based on estimating the following distance between $P$ and $Q$:

where $X,X’\sim P$ and $Y,Y’\sim Q$. The test statistic is a sample estimate of this distance.

The Cross-Match Test

Ignore the labels and put the data into non-overlapping pairs. Let $a_0$ be the number of pairs of type $(0,0)$, let $a_2$ be the number of pairs of type $(1,1)$, and let $a_1$ be the number of pairs of type $(0,1)$ and $(1,0)$.



After reading the great post, I continue to go through the comments, and one of Larry’s responses attracts me,

The permutation test is exact. The bootstrap is only approximate.

Although Larry had explained the above conclusion, such as,

To sivaramanb

It is exact no matter how many random permutations you use. Exact means: Pr(type I error ) <= alpha

The only assumptions are i.i.d. No stronger than usual

To Anonymous

They are similar but not the same. The bootstrap samples n observations from the empirical distribution. (Actually, in a testing problem, the empirical has to be corrected to be consistent to the null hypothesis). The type I error goes to 0 as the sample size goes to infinity. In the permutation test, the type I error is less than alpha. No large sample approximation needed.

I also confused. So I resort to Google for help. Noa Haas’s slide helps me a lot, and find more formerly discussion in Efron and Tibshirani (1994).

Bootstrap Hypothesis Testing

Two sample test

Just in the setting of Larry’s post, we want to test $H_0:P=Q$ versus $H_1:P\neq Q$.

In the permutation test, the distribution under the null hypothesis $F_0$ is defined as the distribution of possible orderings of the labels, while Bootstrap uses a “plug-in” style estimate for $F_0$.

Quote Efron and Tibshirani (1994)’s discussion about the relationship between the permutation test and the bootstrap.

“A permutation test exploits special symmetry that exists under the null hypothesis to create a permutation distribution of the test statistic. For example, in the two-sample problem when testing $P=Q$, all permutations of the order statistic of the combined sample are equally probable. As a result of this symmetry, the ASL from a permutation test is exact: in the two-sample problem, $\mathrm{ASL}_{\mathrm{perm}}$ is the exact probability of obtaining a test statistic as extreme as the one observed, having fixed the data values of the combined sample.

“In contrast, the bootstrap explicitly estimates the probability mechanism under the null hypothesis, and then samples from it to estimate the ASL. The estimate $\widehat{\mathrm{ASL}}_{\mathrm{boot}}$ has no interpretation as an exact probability, but like all bootstrap estimates is only guaranteed to be accurate as the sample size goes to infinity. On the other hand, the bootstrap hypothesis test does not require the special symmetry that is needed for a permutation test, and so can be applied much more generally.

One sample test

Suppose we have sample $\x=(x_1,x_2,\ldots,x_n)$, and want to test $H_0:\mu\neq \mu_0$ against $\H_1:\mu\neq \mu_0$. In this case, we don’t have a symmetry, and hence the permutation test in unavailable. Suppose the test statistic is

To perform Bootstrap hypothesis testing, we need a distribution that estimates the population of treatment times under $H_0$. Note first that the empirical distribution $\hat F$ is not an appropriate estimate for $F$ because it does not obey $H_0$. That is, the mean of $\hat F$ is not necessarily equal to $\mu_0$. A simple way is to translate the empirical distribution $\hat F$ so that it has the desired mean, i.e.,

Then sample $\tilde x_1^*,\ldots,\tilde x_n^*$ with replacement from $\tilde x_1,\ldots,\tilde x_n$, and for each bootstrap sample compute the test statistic


Problem 16.4 of Efron and Tibshirani (1994).

Generate 100 samples of size 7 from a normal distribution with mean 129.0 and standard deviation 66.8. For each sample, perform a bootstrap hypothesis test by using the empirical distribution and the translated empirical distribution.

Compute the average of ASL for each test, averaged over the 100 simulations. And repeat with mean 170.

using StatsBase
using Distributions

# generate data
n = 7
μ = 129.0
σ = 66.8

# bootstrap test 
function bootTest(data; B = 1000, trans::Bool = true)
    t = ones(B)
    x = copy(data)
    avg = mean(x)
    t_obs = sqrt(n) * (avg - μ) / sqrt(var(data))
    if trans
        x .= x .- avg .+ μ 
    for i = 1:B
        # sample 
        idx = sample(1:n, n)
        x_star = x[idx]
        σ_star = sqrt(var(x_star))
        t[i] = sqrt(n) * (mean(x_star) - μ) / σ_star
    return sum(t .> t_obs) / B
bootTest (generic function with 1 method)
function repx(mu, n = 100)
    p1 = ones(100)
    p2 = ones(100)
    for i = 1: 100
        P = Normal(mu, σ)
        x = rand(P, n)
        p1[i] = bootTest(x, trans = false)
        p2[i] = bootTest(x, trans = true)
    return mean(p1), mean(p2)
repx (generic function with 2 methods)

If mu = 129.0, we should not reject the null hypothesis.

(0.49734, 0.5011800000000001)

If mu = 170.0, the p-value should be much smaller than the previous one, and maybe we can reject the null hypothesis.

(0.58353, 0.16352)

Similarly, for the much larger mu, it is reasonable to reject $H_0$.

(0.6929299999999999, 0.00791)

It turns out that it is necessary to use the translated empirical distribution while performing bootstrap hypothesis testing, otherwise we would make mistakes.


Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994 May 15.

Published in categories Report