Suppose that {θ1,…,θM} is a 2δ-separated set contained in the space θ(P). Generate a r.v. Z by the following procedure:
Sample J uniformly from the index set [M]=1,…,M.
Given J=j, sample Z∼Pθj.
Let Q denote the joint distribution of the pair (Z,J) generated by this procedure.
A testing function for this problem is a mapping ψ:Z→[M], and the associated probability of error is given by Q[ψ(Z)≠J].
(From estimation to testing) For any increasing function Φ and choice of 2δ-separated set, the minimax risk is lower bounded as
M(θ(P),Φ∘ρ)≥Φ(δ)infψQ[ψ(Z)≠J],
where the infimum ranges over test functions.
For any P∈P with parameter θ=θ(P), we have
E[Φ(ρ(ˆθ,θ))]≥Φ(δ)P[Φ(ρ(ˆθ,θ))≥Φ(δ)]≥Φ(δ)P[ρ(ˆθ,θ)≥δ].
Note that
supP∈PP[ρ(ˆθ,θ(P))≥δ]≥1MM∑j=1Pθj[ρ(ˆθ,θj)≥δ]=Q[ρ(ˆθ,θJ)≥δ].
Any estimator ˆθ can be used to define a test,
ψ(Z)=argminℓ∈[M]ρ(θℓ,ˆθ).
Suppose that the true parameter is θj: we then claim that the event {ρ(θj,ˆθ)<δ} ensures that the test is correct.
Thus, conditioned on J=j,
{ρ(ˆθ,θj)<δ}⊂{ψ(Z)=j},
which implies that
Pθj[ρ(ˆθ,θj)≥δ]≥Pθj[ψ(Z)≠j].
Taking averages over the index j, we find that
Q[ρ(ˆθ,θJ)≥δ]=1MM∑j=1Pθj[ρ(ˆθ,θj)≥δ]≥Q[ψ(Z)≠J].
Thus,
supP∈PEP[Φ(ρ(ˆθ,θ))]≥Φ(δ)Q[ψ(Z)≠J].
Finally, take the infimum over all estimators ˆθ on the left-hand side, and the infimum over the induced set of tests on the right-hand side.
Some divergence measures
Total Variation distance: ‖P−Q‖TV:=supA⊆X|P(A)−Q(A)|.
Let P1:n=⊗ni=1Pi be the product measure defined on Xn, similarly Q1:n.
in general, difficult to express ‖P1:n−Q1:n‖ in terms of the individual distances
D(P1:n‖Q1:n)=∑D(Pi‖Qi)
H2(P1:n‖Q1:n)/2=1−∏(1−H2(Pi‖Qi)/2).
In the iid case
D(P1:n‖Q1:n)=nD(P1‖Q1)
12H2(P1:n‖Q1:n)=1−(1−12H2(P1‖Q1))n≤n2H2(P1‖Q1)suppose f(x)=(1−x)n−(1−nx), easily can show f(x)≥0 for x∈[0,1]
Binary testing
The simplest type of testing problem, known as a binary hypothesis test, involves only two distributions.
In a binary testing problem with equally weighted hypotheses, we observe a random variable Z drawn according to the mixture distribution QZ=12P0+12P1. For a given decision rule ψ:Z→0,1, the associated probability of error is given by
Q[ψ(Z)≠J]=12P0[ψ(Z)≠0]+12P1[ψ(Z)≠1].
If we take the infimum of this error probability over all decision rules, we obtain a quantity known as the Bayes risk for the problem.
In the binary case, we have
infψQ[ψ(Z)≠J]=12{1−‖P1−P0‖TV}.
Note that there is a one-to-one correspondence between decision rule ψ and measurable partitions (A,Ac) of the space X. In particular, any rule ψ is uniquely determined by the set A=x∈X∣ψ(x)=1. Thus, we have
supψQ[ψ(Z)=J]=supA⊆X{12P1(A)+12P0(Ac)}=12supA⊆X{P1(A)−P0(A)}+12.
Then the result follows from
supψQ[ψ(Z)=J]=1−infψQ[ψ(Z)≠J].
Thus, for any pair of distributions P0,P1∈P such that ρ(θ(P0),θ(P1))≥2δ, we have
M(θ(P),Φ∘ρ)≥Φ(δ)2[1−‖P1−P0‖TV].
For a fixed variance σ2, let Pθ be the distribution of a N(θ,σ2) variable; letting the mean θ vary over the real line defines the Gaussian location family {Pθ,θ∈R}. Here we consider the problem of estimating θ under either the absolute error |ˆθ−θ| or the squared error (ˆθ−θ)2 using a collection Z=(Y1,…,Yn) of n iid samples drawn from a N(θ,σ2) distribution.
Apply the two-point Le Cam bound (2) with the distributions Pn0 and Pnθ, where θ=2δ. Note that
‖Pnθ−Pn0‖2TV≤14{enθ2/σ2−1}=14{e4nδ2/σ2−1},.
Setting δ=12σ√n thus yields
infˆθsupθ∈REθ[|ˆθ−θ|]≥δ2{1−12√e−1}≥δ6=112σ√ninfˆθsupθ∈REθ[(ˆθ−θ)2]≥δ22{1−12√e−1}≥δ26=124σ2n.
For instance, the sample mean ˜θn satisfies the bounds
supθ∈REθ[|˜θn−θ|]=√2πσ√n and supθ∈REθ[(˜θn−θ)2]=σ2n.
Mean-squared error decaying as n−1 is typical for parametric problems with a certain type of regularity. For other non-regular problems, faster rates become possible.
(Uniform location family) For each θ∈R, let Uθ be the uniform distribution over the interval [θ,θ+1].
Given a pair θ,θ′∈R, compute the Hellinger distance between Uθ and U′θ. By symmetry, only consider θ′>θ.
If θ′>θ+1, H2(Uθ‖Uθ′)=2.
If θ′∈(θ,θ+1], H2(Uθ‖Uθ′)=2|θ′−θ|.
Take |θ′−θ|=2δ:=14n (why? to cancel out n in the next equation?), then
12H2(Unθ‖Unθ′)≤n22|θ′−θ|=1/4
By (1), we find that
‖Unθ−Unθ′‖2TV≤H2(Unθ‖Unθ′)≤12.
From (2) with Φ(t)=t2, we have
infˆθsupθ∈REθ[(ˆθ−θ)2]≥(1−1/√2)128n−2.
Le Cam’s method for nonparametric problems.
Estimate a density at a point, say x=0, in which case θ(f):=f(0) is known as an evaluation functional.
The Lipschitz constant of the functional θ with respect to the Hellinger norm is given by
ω(ϵ;θ,F):=supf,g∈F{|θ(f)−θ(g)||H2(f‖g)≤ϵ2}
(Le Cam for functional) For any increasing function Φ on the non-negative real line and any functional θ:F→R, we have
infˆθsupf∈FE[Φ(ˆθ−θ(f))]≥14Φ(12ω(12√n;θ,F)).
Let ϵ2=1/4n, choose a pair f,g that achieve the supremum defining ω(1/(2√n)). Note that
‖Pnf−Png‖2TV≤H2(Pnf‖Png)≤nH2(Pf‖Pg)≤14.
Thus, Le Cam's bound (2) with δ=12ω(12√n) implies that
infˆθsupf∈FE[Φ(ˆθ−θ(f))]≥14Φ(12ω(12√n)).
Consider the densities on [−12,12] that are bounded uniformly away from zero, and are Lipschitz with constant one. Our goal is to estimate the linear functional θ(f)=f(0).
It suffices to lower bound ω(12√n;θ,F) and choose a pair f0,g∈F with H2(f0∣g)=14n, and then evaluating the difference |θ(f0)−θ(g)|.
Let f0≡1. For a parameter δ∈(0,16], consider the function
ϕ(x)={δ−|x|for |x|≤δ|x−2δ|−δfor x∈[δ,3δ]0otherwise
By construction, the function ϕ is 1-Lipschitz, uniformly bounded with ‖ϕ‖∞=δ≤1/6, and integrates to zero. Thus, the perturbed function g=f0+ϕ is a density function belonging to our class and by construction, |θ(f0)−θ(g)|=δ.
To control the squared Hellinger distance,
12H2(f0‖g)=1−∫1/2−1/2√1+ϕ(t)dt.
Define the function Ψ(u)=√1+u, and note that supu∈R|Ψ″(u)|≤14. By a Taylor series expansion, and by [Taylor inequality](http://mathworld.wolfram.com/TaylorsInequality.html),
12H2(f0‖g)=∫1/2−1/2[Ψ(0)−Ψ(ϕ(t))]dt≤∫1/2−1/2−Ψ′(0)ϕ(t)+18ϕ2(t)dt.
Observe that
∫1/2−1/2ϕ(t)dt=0 and ∫1/2−1/2ϕ2(t)dt=4∫δ0(δ−x)2dx=43δ3.
Thus,
H2(f0‖g)≤28⋅43δ3=13δ3.
Consequently, set δ3=3/4n ensures that H2(f0‖g)≤1/4n.
Hence,
infˆθsupf∈FE[(ˆθ−f(0))2]≥116ω2(12√n)≿n−2/3.
Le Cam’s convex hull method
Consider two subsets P0 and P1 of P that are 2δ-separated, in the sense that
ρ(θ(P0),θ(P1))≥2δfor all P0∈P0 and P1∈P1.
For any 2δ-separated classes of distributions P0 and P1 contained within P, any estimator ˆθ has worst-case risk at least
supP∈PEP[ρ(ˆθ),θ(P)]≥δ2supP0∈conv(P0)P1∈conv(P1)1−‖P0−P1‖TV
(Sharpened bounds for Gaussian location family) A key step was to upper bound the TV distance ‖Pnθ−Pn0‖TV between the n-fold product distributions based on the Gaussian models N(θ,σ2) and (0,σ2).
Set θ=2δ, consider two families P0={Pn0} and P1={Pnθ,Pn−θ}. Note that the mixture distribution ˉP:=12Pnθ+12Pn−θ belongs to conv(P1). Note that
‖ˉP−Pn0‖2TV≤14{e12(√nθσ)4−1}=14{e12(2√nδσ)4−1}
Set δ=σt2√n for some parameter t>0 to be chosen, the convex hull Le Cam bound yields
minˆθsupθ∈REθ[|ˆθ−θ|]≥σ4√nsupt>0{t(1−12√et4/2−1)}≥320σ√n.
Fano’s method
The mutual information between the r.v. (Z,J) is defined by the KL divergence as the underlying measure of distance,
I(Z,J):=D(QZ,J∣QZQJ)
Our set-up: lower bounding the probability of error an M-ary hypothesis testing problem, based on a family of distributions Pθ1,…,PθM.
The mutual information can be written in terms of component distributions Pθj,j∈[M] and the mixture distribution ˉQ
I(Z;J)=1MM∑j=1D(Pθj‖ˉQ).
The Fano method is based on the following lower bound on the error probability in an M-ary testing problem, applicable when J is uniformly distributed over the index set:
P[ψ(Z)≠J]≥1−I(Z;J)+log2logM
Let {θ1,…,θM} be a 2δ-separated set in the ρ semi-metric on Θ(P), and suppose that J is uniformly distributed over the index set {1,…,M}, and (Z∣J=j)∼Pθj. Then for any increasing function Φ:[0,∞)→[0,∞), the minimax risk is lower bounded as
M(θ(P);Φ∘ρ)≥Φ(δ){1−I(Z;J)+log2logM},
where I(Z;J) is the mutual information between Z and J.