Processing math: 100%

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Minimax Lower Bounds

Posted on (Update: )
Tags: Minimax

This note is based on Chapter 15 of Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press.

M(θ(P);ρ):=infˆθsupPPE[ρ(ˆθ,θ(P))]. M(θ(P);Φρ):=infˆθsupPPE[Φ(ρ(ˆθ,θ(P)))].

Suppose that {θ1,,θM} is a 2δ-separated set contained in the space θ(P). Generate a r.v. Z by the following procedure:

  1. Sample J uniformly from the index set [M]=1,,M.
  2. Given J=j, sample ZPθj.

Let Q denote the joint distribution of the pair (Z,J) generated by this procedure.

A testing function for this problem is a mapping ψ:Z[M], and the associated probability of error is given by Q[ψ(Z)J].

(From estimation to testing) For any increasing function Φ and choice of 2δ-separated set, the minimax risk is lower bounded as M(θ(P),Φρ)Φ(δ)infψQ[ψ(Z)J], where the infimum ranges over test functions.
For any PP with parameter θ=θ(P), we have E[Φ(ρ(ˆθ,θ))]Φ(δ)P[Φ(ρ(ˆθ,θ))Φ(δ)]Φ(δ)P[ρ(ˆθ,θ)δ]. Note that supPPP[ρ(ˆθ,θ(P))δ]1MMj=1Pθj[ρ(ˆθ,θj)δ]=Q[ρ(ˆθ,θJ)δ]. Any estimator ˆθ can be used to define a test, ψ(Z)=argmin[M]ρ(θ,ˆθ). Suppose that the true parameter is θj: we then claim that the event {ρ(θj,ˆθ)<δ} ensures that the test is correct. Thus, conditioned on J=j, {ρ(ˆθ,θj)<δ}{ψ(Z)=j}, which implies that Pθj[ρ(ˆθ,θj)δ]Pθj[ψ(Z)j]. Taking averages over the index j, we find that Q[ρ(ˆθ,θJ)δ]=1MMj=1Pθj[ρ(ˆθ,θj)δ]Q[ψ(Z)J]. Thus, supPPEP[Φ(ρ(ˆθ,θ))]Φ(δ)Q[ψ(Z)J]. Finally, take the infimum over all estimators ˆθ on the left-hand side, and the infimum over the induced set of tests on the right-hand side.

Some divergence measures

  • Total Variation distance: PQTV:=supAX|P(A)Q(A)|.
  • Kullback-Leibler divergence: D(QP)=Xq(x)logq(x)p(x)ν(dx).
  • squared Hellinger distance: H2(PQ)=(p(x)q(x))2ν(dx).

PQTV12D(QP) PQTVH(PQ)1H2(PQ)4

Let P1:n=ni=1Pi be the product measure defined on Xn, similarly Q1:n.

  • in general, difficult to express P1:nQ1:n in terms of the individual distances
  • D(P1:nQ1:n)=D(PiQi)
  • H2(P1:nQ1:n)/2=1(1H2(PiQi)/2).

In the iid case

  • D(P1:nQ1:n)=nD(P1Q1)
  • 12H2(P1:nQ1:n)=1(112H2(P1Q1))nn2H2(P1Q1)suppose f(x)=(1x)n(1nx), easily can show f(x)0 for x[0,1]

Binary testing

The simplest type of testing problem, known as a binary hypothesis test, involves only two distributions.

In a binary testing problem with equally weighted hypotheses, we observe a random variable Z drawn according to the mixture distribution QZ=12P0+12P1. For a given decision rule ψ:Z0,1, the associated probability of error is given by

Q[ψ(Z)J]=12P0[ψ(Z)0]+12P1[ψ(Z)1].

If we take the infimum of this error probability over all decision rules, we obtain a quantity known as the Bayes risk for the problem.

In the binary case, we have infψQ[ψ(Z)J]=12{1P1P0TV}.
Note that there is a one-to-one correspondence between decision rule ψ and measurable partitions (A,Ac) of the space X. In particular, any rule ψ is uniquely determined by the set A=xXψ(x)=1. Thus, we have supψQ[ψ(Z)=J]=supAX{12P1(A)+12P0(Ac)}=12supAX{P1(A)P0(A)}+12. Then the result follows from supψQ[ψ(Z)=J]=1infψQ[ψ(Z)J].

Thus, for any pair of distributions P0,P1P such that ρ(θ(P0),θ(P1))2δ, we have

M(θ(P),Φρ)Φ(δ)2[1P1P0TV].

For a fixed variance σ2, let Pθ be the distribution of a N(θ,σ2) variable; letting the mean θ vary over the real line defines the Gaussian location family {Pθ,θR}. Here we consider the problem of estimating θ under either the absolute error |ˆθθ| or the squared error (ˆθθ)2 using a collection Z=(Y1,,Yn) of n iid samples drawn from a N(θ,σ2) distribution. Apply the two-point Le Cam bound (2) with the distributions Pn0 and Pnθ, where θ=2δ. Note that PnθPn02TV14{enθ2/σ21}=14{e4nδ2/σ21},. Setting δ=12σn thus yields infˆθsupθREθ[|ˆθθ|]δ2{112e1}δ6=112σninfˆθsupθREθ[(ˆθθ)2]δ22{112e1}δ26=124σ2n. For instance, the sample mean ˜θn satisfies the bounds supθREθ[|˜θnθ|]=2πσn and supθREθ[(˜θnθ)2]=σ2n.

Mean-squared error decaying as n1 is typical for parametric problems with a certain type of regularity. For other non-regular problems, faster rates become possible.

(Uniform location family) For each θR, let Uθ be the uniform distribution over the interval [θ,θ+1]. Given a pair θ,θR, compute the Hellinger distance between Uθ and Uθ. By symmetry, only consider θ>θ. If θ>θ+1, H2(UθUθ)=2. If θ(θ,θ+1], H2(UθUθ)=2|θθ|. Take |θθ|=2δ:=14n (why? to cancel out n in the next equation?), then 12H2(UnθUnθ)n22|θθ|=1/4 By (1), we find that UnθUnθ2TVH2(UnθUnθ)12. From (2) with Φ(t)=t2, we have infˆθsupθREθ[(ˆθθ)2](11/2)128n2.

Le Cam’s method for nonparametric problems.

Estimate a density at a point, say x=0, in which case θ(f):=f(0) is known as an evaluation functional.

The Lipschitz constant of the functional θ with respect to the Hellinger norm is given by

ω(ϵ;θ,F):=supf,gF{|θ(f)θ(g)||H2(fg)ϵ2}
(Le Cam for functional) For any increasing function Φ on the non-negative real line and any functional θ:FR, we have infˆθsupfFE[Φ(ˆθθ(f))]14Φ(12ω(12n;θ,F)).
Let ϵ2=1/4n, choose a pair f,g that achieve the supremum defining ω(1/(2n)). Note that PnfPng2TVH2(PnfPng)nH2(PfPg)14. Thus, Le Cam's bound (2) with δ=12ω(12n) implies that infˆθsupfFE[Φ(ˆθθ(f))]14Φ(12ω(12n)).
Consider the densities on [12,12] that are bounded uniformly away from zero, and are Lipschitz with constant one. Our goal is to estimate the linear functional θ(f)=f(0). It suffices to lower bound ω(12n;θ,F) and choose a pair f0,gF with H2(f0g)=14n, and then evaluating the difference |θ(f0)θ(g)|. Let f01. For a parameter δ(0,16], consider the function ϕ(x)={δ|x|for |x|δ|x2δ|δfor x[δ,3δ]0otherwise By construction, the function ϕ is 1-Lipschitz, uniformly bounded with ϕ=δ1/6, and integrates to zero. Thus, the perturbed function g=f0+ϕ is a density function belonging to our class and by construction, |θ(f0)θ(g)|=δ. To control the squared Hellinger distance, 12H2(f0g)=11/21/21+ϕ(t)dt. Define the function Ψ(u)=1+u, and note that supuR|Ψ(u)|14. By a Taylor series expansion, and by [Taylor inequality](http://mathworld.wolfram.com/TaylorsInequality.html), 12H2(f0g)=1/21/2[Ψ(0)Ψ(ϕ(t))]dt1/21/2Ψ(0)ϕ(t)+18ϕ2(t)dt. Observe that 1/21/2ϕ(t)dt=0 and 1/21/2ϕ2(t)dt=4δ0(δx)2dx=43δ3. Thus, H2(f0g)2843δ3=13δ3. Consequently, set δ3=3/4n ensures that H2(f0g)1/4n. Hence, infˆθsupfFE[(ˆθf(0))2]116ω2(12n)n2/3.

Le Cam’s convex hull method

Consider two subsets P0 and P1 of P that are 2δ-separated, in the sense that

ρ(θ(P0),θ(P1))2δfor all P0P0 and P1P1.
For any 2δ-separated classes of distributions P0 and P1 contained within P, any estimator ˆθ has worst-case risk at least supPPEP[ρ(ˆθ),θ(P)]δ2supP0conv(P0)P1conv(P1)1P0P1TV
(Sharpened bounds for Gaussian location family) A key step was to upper bound the TV distance PnθPn0TV between the n-fold product distributions based on the Gaussian models N(θ,σ2) and (0,σ2). Set θ=2δ, consider two families P0={Pn0} and P1={Pnθ,Pnθ}. Note that the mixture distribution ˉP:=12Pnθ+12Pnθ belongs to conv(P1). Note that ˉPPn02TV14{e12(nθσ)41}=14{e12(2nδσ)41} Set δ=σt2n for some parameter t>0 to be chosen, the convex hull Le Cam bound yields minˆθsupθREθ[|ˆθθ|]σ4nsupt>0{t(112et4/21)}320σn.

Fano’s method

The mutual information between the r.v. (Z,J) is defined by the KL divergence as the underlying measure of distance,

I(Z,J):=D(QZ,JQZQJ)

Our set-up: lower bounding the probability of error an M-ary hypothesis testing problem, based on a family of distributions Pθ1,,PθM.

The mutual information can be written in terms of component distributions Pθj,j[M] and the mixture distribution ˉQ

I(Z;J)=1MMj=1D(PθjˉQ).

The Fano method is based on the following lower bound on the error probability in an M-ary testing problem, applicable when J is uniformly distributed over the index set:

P[ψ(Z)J]1I(Z;J)+log2logM
Let {θ1,,θM} be a 2δ-separated set in the ρ semi-metric on Θ(P), and suppose that J is uniformly distributed over the index set {1,,M}, and (ZJ=j)Pθj. Then for any increasing function Φ:[0,)[0,), the minimax risk is lower bounded as M(θ(P);Φρ)Φ(δ){1I(Z;J)+log2logM}, where I(Z;J) is the mutual information between Z and J.

Published in categories Note