This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sub-Gaussian Error Bounds for Hypothesis Testing
thanks: This research was supported in part by the US National Science Foundation under grant HDR: TRIPODS 19-34884.

Yan Wang Department of Statistics, Iowa State University
Ames, IA 50011, USA
Email: wangyan@iastate.edu
Abstract

We interpret likelihood-based test functions from a geometric perspective where the Kullback-Leibler (KL) divergence is adopted to quantify the distance from a distribution to another. Such a test function can be seen as a sub-Gaussian random variable, and we propose a principled way to calculate its corresponding sub-Gaussian norm. Then an error bound for binary hypothesis testing can be obtained in terms of the sub-Gaussian norm and the KL divergence, which is more informative than Pinsker’s bound when the significance level is prescribed. For MM-ary hypothesis testing, we also derive an error bound which is complementary to Fano’s inequality by being more informative when the number of hypotheses or the sample size is not large.

I Introduction

Hypothesis testing is one central task in statistics. One of its simplest forms is the binary case: given nn independent and identically distributed (i.i.d.) random variables X1n(X1,,Xn)X_{1}^{n}\equiv(X_{1},\ldots,X_{n}), one wants to infer whether the null hypothesis H0:XiP0H_{0}:X_{i}\sim P_{0} or the alternative hypothesis H1:XiP1H_{1}:X_{i}\sim P_{1} is true. The binary case serves as an important starting point from which further results can be established, in the settings of both classical and quantum hypothesis testing [1, 2]. With X1nX_{1}^{n}, one can construct the empirical distribution P^n=1ni=1nδXi\hat{P}_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}, where δX\delta_{X} is the Dirac measure that puts unit mass at XX. Adopting the Kullback-Leibler (KL) divergence as a distance from P^n\hat{P}_{n} to P0P_{0} or P1P_{1}, one can construct a test function as

Φ(X1n)=I{DKL(P^nP0)DKL(P^nP1)>c},\displaystyle\Phi(X_{1}^{n})=I\{D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1})>c\}, (1)

where I{}I\{\cdot\} is the indicator function, c0c\geq 0 serves as a threshold beyond which the decision that P^n\hat{P}_{n} is closer to P1P_{1} than to P0P_{0} is made, and DKL(PQ)=ln(dP/dQ)dPD_{\mathrm{KL}}(P\rVert Q)=\int\ln(dP/dQ)dP is the KL divergence from probability PP to probability QQ if PQP\ll Q. Conventionally, if PP is not absolutely continuous with respect to QQ, then DKL(PQ)D_{\mathrm{KL}}(P\rVert Q)\equiv\infty. Note P^n\hat{P}_{n} is discrete; hence if both P0P_{0} and P1P_{1} are discrete with the same support, (1) is well defined. Denote the densities of P0P_{0} and P1P_{1} with respect to the counting measure as p0p_{0} and p1p_{1}, respectively, and we have

DKL(P^nP0)DKL(P^nP1)=1nln(i=1np1(Xi)i=1np0(Xi)).\displaystyle D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1})=\frac{1}{n}\ln\left(\frac{\prod_{i=1}^{n}p_{1}(X_{i})}{\prod_{i=1}^{n}p_{0}(X_{i})}\right). (2)

In fact, in this case, (1) is equivalent to the test function for the likelihood ratio test [4]

Φlrt(X1n)=I{i=1np1(Xi)i=1np0(Xi)>c},\displaystyle\Phi_{\mathrm{lrt}}(X_{1}^{n})=I\left\{\frac{\prod_{i=1}^{n}p_{1}(X_{i})}{\prod_{i=1}^{n}p_{0}(X_{i})}>c^{\prime}\right\}, (3)

where c=ecnc^{\prime}=e^{cn}. In the case that both P0P_{0} and P1P_{1} are continuous, the KL divergence difference DKL(P^nP0)DKL(P^nP1)D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1}) is not well defined. Nonetheless, the technically tricky part is the term “p^nln(p^n)𝑑μ\int\hat{p}_{n}\ln(\hat{p}_{n})d\mu,” where we use p^n\hat{p}_{n} to denote the density of P^n\hat{P}_{n} with respect to the Lebesgue measure μ\mu as if it had one. But it appears twice and is cancelled out formally. We might conveniently define the KL divergence difference in this case as (2), and still find the equivalence between (1) and (3). Using the KL divergence in the context of hypothesis testing can be beneficial. Firstly, it provides a clear geometric meaning to the likelihood ratio test, as well as to the general idea underlying hypothesis testing. Secondly, it also offers a geometric, or even physical, interpretation of the lower bound for the resulting statistical errors, as shown below.

Under the null hypothesis H0H_{0}, the type I error rate (or the significance level) α\alpha that is incurred by applying (1) for a fixed cc is

α=𝔼X1nP0nΦ(X1n),\displaystyle\alpha=\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n}), (4)

where P0nP_{0}^{\otimes n} is the product probability measure for X1nX_{1}^{n} under H0H_{0}. In practice, by prescribing the significance level, for example, letting α=0.05\alpha=0.05, one can derive the corresponding cc and determine the desired test function. However, in this work, our focus is notnot to find a test function at given α\alpha, we mainly deal with the case that cc is fixed, and α\alpha is obtained in a somewhat passive way. Thanks to the Neyman-Pearson lemma [3], the likelihood ratio test is known to be optimal in the sense of statistical power. Hence, given the incurred α\alpha, test function (1) has the minimal type II error rate β\beta among all possible test functions with the corresponding type I error rate no gretaer than α\alpha:

β=1𝔼X1nP1nΦ(X1n),\displaystyle\beta=1-\mathbb{E}_{X_{1}^{n}\sim P_{1}^{\otimes n}}\Phi(X_{1}^{n}), (5)

where P1nP_{1}^{\otimes n} is the product probability measure for X1nX_{1}^{n} under the alternative hypothesis H1H_{1}.

Controlling statistical errors is of practical importance; however, typically one cannot suppress both types of error simultaneously. Under our i.i.d. setting, a classical result, based on Pinsker’s inequality, concerning the error bound for any (measurable) test function is that [4]

α+β1n2DKL(P1P0).\displaystyle\alpha+\beta\geq 1-\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{1}\rVert P_{0})}. (6)

This result is striking in that without going into the details of calculating α\alpha and β\beta, one can have a nontrivial lower bound of their sum in terms of the KL divergence between two candidate probabilities, as long as the right-hand side of (6) is greater than 0. For a fixed nn, this bound is solely determined by DKL(P1P0)D_{\mathrm{KL}}(P_{1}\rVert P_{0}), which reflects the “distance” from P1P_{1} to P0P_{0}. This result also has a significant physical meaning. At a nonequilibrium steady state, if P1P_{1} denotes the probability associated with observing a stochastic trajectory in the forward process, and P0P_{0} in the backward process, then the theory of stochastic thermodynamics tells us that DKL(P1P0)D_{\mathrm{KL}}(P_{1}\rVert P_{0}) is equivalent to the average entropy production ΔS\Delta S in the forward process, which is always nonnegative [5, 6]. Hence, if one wishes to infer the arrow of time based on observations, then Pinsker’s result (6) implies that the chance of making an error is high if ΔS\Delta S is small. In fact, we know that ΔS=0\Delta S=0 at equilibrium, and one cannot tell the arrow of time at all; hypothesis testing is just random guess in this case.

While (6) is useful, can we have a tighter and thus more informative bound? In this work, we will show that by taking advantage of the sub-Gaussian property of Φ(X1n)\Phi(X_{1}^{n}) [7, 8], one can derive a bound (17) on statistical errors in terms of its sub-Gaussian norm (as well as the KL divergence from P1P_{1} to P0P_{0}). We name such an error bound as “sub-Gaussian” to highlight this fact. It turns out that it is a tighter bound than (6) in the sense that it provides a greater lower bound for α+β\alpha+\beta (or for β\beta at any given α0.5\alpha\neq 0.5). In practice, a small α\alpha is commonly set as the significance level, and our result can hopefully be more relevant. Moreover, in the case of MM-ary hypothesis testing where M>2M>2 hypotheses are present, we also derive a bound (24) for making incorrect decisions, which is complementary to the celebrated Fano’s inequality [9] when the number of hypotheses MM or the sample size nn is not large. The error bounds presented in this work are universal and easily applicable. We hope these findings can help better quantify errors in various statistical practices involving hypothesis testing.

II Main Results

We will first introduce the sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n}). Then error bounds in the binary and MM-ary cases are established, respectively.

II-A Sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n})

Sub-Gaussian random variables are natural generalizations of Gaussian ones. The so-called sub-Gaussian property can be defined in several different but equivalent ways [7, 8]. In this work, we pick one that suits most for our purposes.

Definition 1.

A random variable XX with probability law PP is called sub-Gaussian if there exists σ>0\sigma>0 such that its central moment generating function satisfies

𝔼Pes(X𝔼PX)eσ2s2/2,s.\displaystyle\mathbb{E}_{P}e^{s(X-\mathbb{E}_{P}X)}\leq e^{\sigma^{2}s^{2}/2},\ \forall s\in\mathbb{R}.
Definition 2.

The associated sub-Gaussian norm σXP\sigma_{XP} of XX with respect to PP is defined as

σXPinf{σ>0:𝔼Pes(X𝔼PX)eσ2s2/2,s}.\displaystyle\sigma_{XP}\equiv\inf\{\sigma>0:\mathbb{E}_{P}e^{s(X-\mathbb{E}_{P}X)}\leq e^{\sigma^{2}s^{2}/2},\ \forall s\in\mathbb{R}\}.
Remark 1.

σXP\sigma_{XP} is a well defined norm for the centered variable X𝔼PXX-\mathbb{E}_{P}X [6]. It is the same for a location family of random variables that have different means but are otherwise identical. Also, σXP\sigma_{XP} is equal to the ψ2\psi_{2}-Orlicz norm of X𝔼PXX-\mathbb{E}_{P}X up to a numerical constant factor.

Lemma 1.

A bounded random variable is sub-Gaussian. In particular, if X[a,b]X\in[a,b] almost surely with respect to PP, then σXP(ba)/2\sigma_{XP}\leq(b-a)/2.

Proof.

This is a well known result that can be found in, for example, [7, 8]. ∎

Test function (1) is an indicator function and takes on values in {0,1}\{0,1\}; hence it is bounded. No matter what the law of X1nX_{1}^{n} is, Φ(X1n)\Phi(X_{1}^{n}) is always sub-Gaussian by Lemma 1 with a uniform upper bound of its sub-Gaussian norm that

σΦP0.5.\displaystyle\sigma_{\Phi P}\leq 0.5. (7)

However, if α\alpha is fixed as a result of some cc being used in (1), then a more informative sub-Gaussian norm for Φ(X1n)\Phi(X_{1}^{n}) can be obtained under the situation that X1nP0nX_{1}^{n}\sim P_{0}^{\otimes n}. In this case, by (4),

Pr(Φ(X1n)=1|H0)=𝔼X1nP0nΦ(X1n)=α,\displaystyle\text{Pr}(\Phi(X_{1}^{n})=1|H_{0})=\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n})=\alpha,

and one can explicitly write

𝔼X1nP0nes[Φ(X1n)α]\displaystyle\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}e^{s[\Phi(X_{1}^{n})-\alpha]}
=\displaystyle=\ Pr(Φ=1|H0)es(1α)+Pr(Φ=0|H0)es(0α)\displaystyle\text{Pr}(\Phi=1|H_{0})e^{s(1-\alpha)}+\text{Pr}(\Phi=0|H_{0})e^{s(0-\alpha)}
=\displaystyle=\ αes(1α)+(1α)esαef.\displaystyle\alpha e^{s(1-\alpha)}+(1-\alpha)e^{-s\alpha}\equiv e^{f}.

Using ff, one can rewrite the sub-Gaussian property as

hf12σ2s20,s.\displaystyle h\equiv f-\frac{1}{2}\sigma^{2}s^{2}\leq 0,\ \forall s\in\mathbb{R}. (8)

Since Φ\Phi is sub-Gaussian, there exists σ\sigma such that at any α\alpha, we have h(s=0)=0h(s=0)=0, which is the maximal value of hh. This fact implies h/s|s=0=0\partial h/\partial s|_{s=0}=0 and 2h/s2|s=00\partial^{2}h/\partial s^{2}|_{s=0}\leq 0. The latter poses a constraint on σ\sigma’s under which (8) holds:

2h/s2|s=00σ22f/s2|s=0=α(1α).\displaystyle\partial^{2}h/\partial s^{2}|_{s=0}\leq 0\Longrightarrow\sigma^{2}\geq\partial^{2}f/\partial s^{2}|_{s=0}=\alpha(1-\alpha). (9)

Since α(1α)0.25\alpha(1-\alpha)\leq 0.25, we know the minimal universal σ\sigma for all α\alpha is 0.50.5, consistent with (7).

For a specific α\alpha, the minimal σ\sigma that makes (8) valid is denoted as σΦ0(α)\sigma_{\Phi 0}(\alpha), which is defined to be the sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n}) under the law Φ#P0n\Phi_{\#}P_{0}^{\otimes n}, the push forward probability measure of P0nP_{0}^{\otimes n} induced by Φ\Phi. We may also simply state that σΦ0(α)\sigma_{\Phi 0}(\alpha) is the sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n}) under H0H_{0}. The norm σΦ0(α)\sigma_{\Phi 0}(\alpha) can be numerically obtained in a principled way, as summarized in the following theorem.

Theorem 1.

For α0.5\alpha\neq 0.5, besides the trivial solution (σ,0)(\sigma,0) with any σ>0\sigma>0, the equations

{f=12σ2s2,fs=σ2s,\displaystyle\left\{\begin{array}[]{ll}f=\frac{1}{2}\sigma^{2}s^{2},\\ \frac{\partial f}{\partial s}=\sigma^{2}s,\end{array}\right. (12)

have only one nontrivial solution (σ,s)(\sigma^{\ast},s^{\ast}) where s0s^{\ast}\neq 0. The sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n}) under H0H_{0} is σΦ0=σ\sigma_{\Phi 0}=\sigma^{\ast}. For α=0.5\alpha=0.5, σΦ0=0.5\sigma_{\Phi 0}=0.5.

Proof.

We will consider three cases based on the value of α\alpha.

Case I: α=0.5\alpha=0.5. In this case, σΦ0\sigma_{\Phi 0} can be obtained directly by noticing

exp[f]\displaystyle\exp[f] =cosh(s2)=n=0(s/2)2n(2n)!n=0(s/2)2n2nn!\displaystyle=\cosh\left(\frac{s}{2}\right)=\sum_{n=0}^{\infty}\frac{(s/2)^{2n}}{(2n)!}\leq\sum_{n=0}^{\infty}\frac{(s/2)^{2n}}{2^{n}n!}
=exp(12×0.52×s2).\displaystyle=\exp\left(\frac{1}{2}\times 0.5^{2}\times s^{2}\right).

Hence by direct inspection, σΦ0=0.5\sigma_{\Phi 0}=0.5.

Case II: 0<α<0.50<\alpha<0.5. Before diving into the proof, we briefly address the main idea first. Given α\alpha, the function hh depends on both ss and σ\sigma. Requiring its maximum to be no greater than 0 at some σ\sigma naturally leads to two conditions that h(s,σ)=0h(s,\sigma)=0 and h(s,σ)/s=0\partial h(s,\sigma)/\partial s=0, which are just (12). It is expected that σΦ0\sigma_{\Phi 0} can be obtained from the corresponding nontrivial solutions, since it is the minimal σ\sigma that satisfies (8). Fig. 1 confirms this intuition, where α=0.05\alpha=0.05 is assumed for illustration. By tuning σ\sigma to some σ\sigma^{\ast}, one can see the maximum of hh at some s>0s^{\ast}>0 can be exactly equal to 0, i.e., h(s,σ)=0h(s^{\ast},\sigma^{\ast})=0. Also at this ss^{\ast}, hh is tangent to the ss-axis, indicating that h(s,σ)/s|s=s=0\partial h(s,\sigma^{\ast})/\partial s|_{s=s^{\ast}}=0. Hence σΦ0=σ\sigma_{\Phi 0}=\sigma^{\ast}.

Refer to caption

Figure 1: Assuming α=0.05\alpha=0.05, we show the main idea underlying Theorem 1, and numerically calculate σΦ0\sigma_{\Phi 0}, which is the minimal σ\sigma such that h(s)h(s) is no greater than 0 for all ss due to the sub-Gaussian property (8).

Now we turn to the proof. It is trivial that for any α\alpha, hh attains its maximal value 0 at s=0s=0, no matter what σ>0\sigma>0 is. This does not provide much useful information of σΦ0\sigma_{\Phi 0}. To proceed, we need a nontrivial local maximum of h(s)h(s) at some s0s\neq 0. Our first observation is that when 0<α<0.50<\alpha<0.5, there is no local maximum achieved for s<0s<0, because h/s>0\partial h/\partial s>0 for all s<0s<0. To see this, let ae|s|(1α)a\equiv e^{-|s|(1-\alpha)}, be|s|αb\equiv e^{|s|\alpha}, and δ0.5α\delta\equiv 0.5-\alpha, then we have that

hs\displaystyle\frac{\partial h}{\partial s} =α(1α)baαa+(1α)b+σ2|s|\displaystyle=-\alpha(1-\alpha)\frac{b-a}{\alpha a+(1-\alpha)b}+\sigma^{2}|s|
=α(1α)1e|s|(0.5+δ)+(0.5δ)e|s|+σ2|s|\displaystyle=-\alpha(1-\alpha)\frac{1-e^{-|s|}}{(0.5+\delta)+(0.5-\delta)e^{-|s|}}+\sigma^{2}|s|
>2α(1α)tanh(|s|/2)+σ2|s|\displaystyle>-2\alpha(1-\alpha)\tanh(|s|/2)+\sigma^{2}|s|
>[σ2α(1α)]×|s|0,\displaystyle>[\sigma^{2}-\alpha(1-\alpha)]\times|s|\geq 0,

where α<0.5\alpha<0.5 (hence δ>0\delta>0) is used in the first inequality, the second inequality is due to tanh(x)<x\tanh(x)<x for x>0x>0, and the last inequality is given by (9) since we have already known Φ(X1n)\Phi(X_{1}^{n}) is sub-Gaussian. This result indicates that the nontrivial maximum, if any, can only be found at some s>0s>0.

For s>0s>0, following similar steps, we obtain

hs=α(1α)12coth(s2)(12α)σ2s,\displaystyle\frac{\partial h}{\partial s}=\frac{\alpha(1-\alpha)}{\frac{1}{2}\coth\left(\frac{s}{2}\right)-\left(\frac{1}{2}-\alpha\right)}-\sigma^{2}s,

and the condition h/s=0\partial h/\partial s=0 then implies

g(s)s2coth(s2)=(12α)s+α(1α)σ2l(s,σ).\displaystyle g(s)\equiv\frac{s}{2}\coth\left(\frac{s}{2}\right)=\left(\frac{1}{2}-\alpha\right)s+\frac{\alpha(1-\alpha)}{\sigma^{2}}\equiv l(s,\sigma).

It is straightforward to check that g(s)1g(s)\geq 1 is a positive, monotonically increasing, and strongly convex function. Hence it can intersect the straight line l(s,σ)l(s,\sigma) at no more than two points. Note g(0+)=1g(0^{+})=1, and g(0+)=0g^{\prime}(0^{+})=0. The intercept of l(s,σ)l(s,\sigma) is α(1α)/σ2(0,1)\alpha(1-\alpha)/\sigma^{2}\in(0,1) by (9), and the slope is greater than 0. Hence by tuning σ\sigma, it is always possible to make g(s)g(s) and l(s,σ)l(s,\sigma) intersect twice. Denote these two points as s1(σ)s_{1}(\sigma) and s2(σ)s_{2}(\sigma), respectively, with h(s1)<h(s2)h(s_{1})<h(s_{2}). As shown in Fig. 1, h(s1)h(s_{1}) is the minimum between two maxima h(s=0)h(s=0) and h(s2)h(s_{2}). Then further requiring h(s2(σ))=0h(s_{2}(\sigma))=0 at some σ\sigma^{\ast}, which is attainable since Φ\Phi is known to be sub-Gaussian, we obtain σΦ0=σ\sigma_{\Phi 0}=\sigma^{\ast}, and Theorem 1 for the 0<α<0.50<\alpha<0.5 part is proved.

Case III: 0.5<α<10.5<\alpha<1. Note f(s)f(s) or h(s)h(s) is invariant under the transformations α1α\alpha\leftrightarrow 1-\alpha and sss\leftrightarrow-s. Hence σΦ0\sigma_{\Phi 0} is the same for α\alpha and 1α1-\alpha.

Combining all three cases, we have proved Theorem 1. ∎

Refer to caption

Figure 2: The sub-Gaussian norm σΦ0\sigma_{\Phi 0} is plotted as a function of type I error α\alpha. Since σΦ0\sigma_{\Phi 0} is the same for α\alpha and 1α1-\alpha, we only plot the result for α(0,0.5]\alpha\in(0,0.5].

One can calculate σΦ0\sigma_{\Phi 0} in a principled way with given α\alpha, without knowing P0P_{0} or P1P_{1} or the constant cc in the test function. We summarize the relation between α\alpha and σΦ0\sigma_{\Phi 0} for (1) in Fig. 2. Error bounds for hypothesis testing can now be established based on σΦ0\sigma_{\Phi 0}.

II-B Sub-Gaussian bound for binary hypothesis testing

Lemma 2.

Consider two general probability measures ν\nu and μ\mu on a common measurable space. Suppose νμ\nu\ll\mu, and let gdν/dμg\equiv d\nu/d\mu be the density of ν\nu with respect to μ\mu. Let YY be a sub-Gaussian random variable which is a function of XX that has law μ\mu or ν\nu. Then we have

|𝔼XνY𝔼XμY|σY#μ2DKL(νμ),\displaystyle|\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y|\leq\sigma_{Y_{\#}\mu}\sqrt{2D_{\mathrm{KL}}(\nu\rVert\mu)}, (13)

where σY#μ\sigma_{Y_{\#}\mu} denotes the sub-Gaussian norm of YY with respect to the push forward measure Y#μY_{\#}\mu.

Recently, there have been several works with findings similar to Theorem 2, in the context of nonequilibrium statistical physics [6], data exploration or model bias analysis [10, 11], or uncertainty quantification for stochastic processes [12]. They can, however, be analyzed in a unified way based on the spirit in [13].

Proof.

We have assumed νμ\nu\ll\mu and gdν/dμg\equiv d\nu/d\mu. The associated entropy functional of gg with respect to μ\mu is defined as Entμ(g)=glngdμ\mathrm{Ent}_{\mu}(g)=\int g\ln gd\mu. It is straightforward to find that

Entμ(g)\displaystyle\mathrm{Ent}_{\mu}(g) =dνdμln(dνdμ)𝑑μ=ln(dνdμ)𝑑ν\displaystyle=\int\frac{d\nu}{d\mu}\ln\left(\frac{d\nu}{d\mu}\right)d\mu=\int\ln\left(\frac{d\nu}{d\mu}\right)d\nu
=DKL(νμ).\displaystyle=D_{\mathrm{KL}}(\nu\rVert\mu). (14)

On the other hand, by the variational representation of Entμ(g)\mathrm{Ent}_{\mu}(g), we have that

Entμ(g)=supηηg𝑑μ,witheη𝑑μ1,\displaystyle\mathrm{Ent}_{\mu}(g)=\sup_{\eta}\int\eta gd\mu,\ \text{with}\ \int e^{\eta}d\mu\leq 1, (15)

where η\eta is a measurable function. We have ηg𝑑μ=𝔼μηg=𝔼νη\int\eta gd\mu=\mathbb{E}_{\mu}\eta g=\mathbb{E}_{\nu}\eta and eη𝑑μ=𝔼μeη\int e^{\eta}d\mu=\mathbb{E}_{\mu}e^{\eta}.

By assumption, Y(X)Y(X) is sub-Gaussian. If XμX\sim\mu, then the sub-Gaussian norm of YY under the push forward measure Y#μY_{\#}\mu is σY#μ\sigma_{Y_{\#}\mu}. Let us construct η\eta as

η=s[Y(X)𝔼XμY(X)]12σY#μ2s2.\displaystyle\eta=s[Y(X)-\mathbb{E}_{X\sim\mu}Y(X)]-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}.

It is clear that 𝔼μeη1\mathbb{E}_{\mu}e^{\eta}\leq 1 can be satisfied. Combining η\eta with (II-B) and (15), we arrive at

DKL(νμ)\displaystyle D_{\mathrm{KL}}(\nu\rVert\mu) 𝔼Xμη(Y(X))g\displaystyle\geq\mathbb{E}_{X\sim\mu}\eta(Y(X))g
=𝔼Xμg[s(Y𝔼XμY)12σY#μ2s2]\displaystyle=\mathbb{E}_{X\sim\mu}g\left[s(Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}\right]
=𝔼Xν[s(Y𝔼XμY)12σY#μ2s2]\displaystyle=\mathbb{E}_{X\sim\nu}\left[s(Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}\right]
=s(𝔼XνY𝔼XμY)12σY#μ2s2,\displaystyle=s(\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2},

which holds for any ss\in\mathbb{R}. Hence, Lemma 2 is proved:

|𝔼XνY𝔼XμY|\displaystyle|\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y| [inf|s|DKL(νμ)|s|+12σY#μ2|s|]\displaystyle\leq\left[\inf_{|s|}\frac{D_{\mathrm{KL}}(\nu\rVert\mu)}{|s|}+\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}|s|\right]
=σY#μ2DKL(νμ).\displaystyle=\sigma_{Y_{\#}\mu}\sqrt{2D_{\mathrm{KL}}(\nu\rVert\mu)}.

Theorem 2.

Suppose P1P0P_{1}\ll P_{0}, and denote the sub-Gaussian norm of test function (1) under the null hypothesis H0H_{0} as σΦ0\sigma_{\Phi 0}. Then we have

|𝔼X1nP0nΦ(X1n)\displaystyle|\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n})- 𝔼X1nP1nΦ(X1n)|\displaystyle\mathbb{E}_{X_{1}^{n}\sim P_{1}^{\otimes n}}\Phi(X_{1}^{n})|
σΦ02nDKL(P1P0).\displaystyle\leq\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}. (16)
Proof.

Let X=X1nX=X_{1}^{n}, ν=P1n\nu=P_{1}^{\otimes n} and μ=P0n\mu=P_{0}^{\otimes n}, and due to the i.i.d. setting, DKL(P1nP0n)=nDKL(P1P0)D_{\mathrm{KL}}(P_{1}^{\otimes n}\rVert P_{0}^{\otimes n})=nD_{\mathrm{KL}}(P_{1}\rVert P_{0}). Then the proof is completed by letting Y=Φ(X1n)Y=\Phi(X_{1}^{n}) in Lemma 2. ∎

Corollary 1.

One has

α+β1σΦ02nDKL(P1P0).\displaystyle\alpha+\beta\geq 1-\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}. (17)
Proof.

Insert definitions (4) and (5) into (2) and then simplify to obtain the result. ∎

Remark 2.

Corollary 1 can be relaxed by replacing the sub-Gaussian norm σΦ0\sigma_{\Phi 0} with one of its upper bounds. In fact, if we use the universal upper bound provided by (7), then Corollary 1 reduces to Pinsker’s classical result (6). However, our bound is always stronger in general. In particular, when controlling α\alpha is more important than controlling β\beta, one might set c>0c>0 to put more emphasis on it. Hence for the same sample size nn, the larger cc is, the smaller α\alpha and σΦ0\sigma_{\Phi 0} are, resulting in a tighter bound for β\beta.

Remark 3.

There is another inequality from Theorem 2 that α+β1+σΦ02nDKL(P1P0)\alpha+\beta\leq 1+\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}. But it is somewhat trivial because the bound is greater than 1 and in general does not provide much useful information. For example, one can always accept H0H_{0}, and for this trivial decision rule, α=0\alpha=0, but β1\beta\leq 1 by definition. Hence α+β1\alpha+\beta\leq 1, and the extra term σΦ02nDKL(P1P0)\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})} is not informative at all.

Remark 4.

Suppose also P0P1P_{0}\ll P_{1}, which is the usual case in hypothesis testing. Then by symmetry, it is straightforward to have

α+β1σΦ12nDKL(P0P1),\displaystyle\alpha+\beta\geq 1-\sigma_{\Phi 1}\sqrt{2nD_{\mathrm{KL}}(P_{0}\rVert P_{1})}, (18)

where σΦ1\sigma_{\Phi 1} is the sub-Gaussian norm of Φ(X1n)\Phi(X_{1}^{n}) under H1H_{1}, and it is a function of β\beta. This result is nontrivially different than (17), not only because different norms are applied, but also because the KL divergence is not symmetric in two involved probabilities. Given (18), we can either bound α\alpha when β\beta is given or bound β\beta in an implicit way when α\alpha is given.

Remark 5.

Similar to (6), our bound is also nonasymptotic in nature as it holds for any finite nn. The expense we pay for this, however, is that in the large nn and small α\alpha limit, our bound for β\beta is not as tight as Stein’s lemma which states that βenDKL(P0P1)\beta\sim e^{-nD_{\mathrm{KL}}(P_{0}\rVert P_{1})} [4].

II-C Sub-Gaussian bound for MM-ary hypothesis testing

A generalization of our result to the MM-ary hypothesis testing can be obtained. Suppose there are MM hypotheses, represented by the corresponding probability distributions {P1,,PM}\{P_{1},\ldots,P_{M}\}. Suppose from one of such distributions Pi0P_{i_{0}}, nn data points X1nX_{1}^{n} are drawn independently. Our task is to infer the hypothesis index i0{i_{0}} from data. Similar to (1), let us consider the test function for the ii-th hypothesis as

φi(X1n)=jiI{DKL(P^nPj)DKL(P^nPi)>ci},\displaystyle\varphi_{i}(X_{1}^{n})=\prod_{j\neq i}I\{D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{j})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{i})>c_{i}\},

with φ=1Φ\varphi=1-\Phi in the binary case. We will consider the case that ci=0c_{i}=0 for all i{1,,M}i\in\{1,\ldots,M\}. Unlike in the binary case where c>0c>0 can be adopted to intentionally render a small α\alpha, the test function here is purely likelihood-based without any prescribed preference over any particular hypothesis. It is known that this approach minimizes α+β\alpha+\beta in the binary case (the Bayes classifier). From MM such test functions φi\varphi_{i}, one can construct a random vector 𝝋=(φ1,,φM){\boldsymbol{\varphi}}=(\varphi_{1},\ldots,\varphi_{M}). Assume there always exists a single index i0i_{0} such that

DKL(P^nPj)DKL(P^nPi0)>0\displaystyle D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{j})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{i_{0}})>0

holds for all ji0j\neq i_{0}. In this case, φi0=1\varphi_{i_{0}}=1, and φj=0\varphi_{j}=0 for ji0j\neq i_{0}. Since X1nX_{1}^{n} is random, we expect that i0i_{0} may differ for each realization. However, it is almost surely with respect to all PiP_{i}’s that

i=1Mφi(X1n)=1.\sum_{i=1}^{M}\varphi_{i}(X_{1}^{n})=1. (19)

Under MM hypotheses, we can construct a matrix, denoted 𝔼𝝋\mathbb{E}\boldsymbol{\varphi}, that encodes the error incurred in testing:

𝔼𝝋(𝔼1φ1𝔼1φM𝔼Mφ1𝔼MφM),\displaystyle\boldsymbol{\mathbb{E}\varphi}\equiv\left(\begin{array}[]{ccc}\mathbb{E}_{1}\varphi_{1}&\cdots&\mathbb{E}_{1}\varphi_{M}\\ \vdots&\ddots&\vdots\\ \mathbb{E}_{M}\varphi_{1}&\cdots&\mathbb{E}_{M}\varphi_{M}\\ \end{array}\right), (23)

where the matrix element 𝔼iφj𝔼X1nPinφj\mathbb{E}_{i}\varphi_{j}\equiv\mathbb{E}_{X_{1}^{n}\sim P_{i}^{\otimes n}}\varphi_{j}. By (19), the row sum of 𝔼𝝋\mathbb{E}\boldsymbol{\varphi} is 1. The diagonal elements of 𝔼𝝋\mathbb{E}\boldsymbol{\varphi} are actually the probabilities that the underlying hypothesis is correctly identified. In other words, the probability of making an incorrect decision when the data are generated from the iith hypothesis is αi1𝔼iφi\alpha_{i}\equiv 1-\mathbb{E}_{i}\varphi_{i}. We denote αmaxmaxiαi\alpha_{\max}\equiv\max_{i}\alpha_{i}. The following theorem provides a lower bound to αmax\alpha_{\max} that is complementary to Fano’s inequality.

Theorem 3.

Suppose PiPjP_{i}\ll P_{j} for all i,j{1,,M}i,j\in\{1,\ldots,M\}. For any j{1,,M}j\in\{1,\ldots,M\}, we have

αmax11M1Mi=1Mσφi2nDKL(PjPi),\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M}\sum_{i=1}^{M}\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})}, (24)

where σφi\sigma_{\varphi_{i}} is the sub-Gaussian norm of φi\varphi_{i} with respect to the iith hypothesis.

Proof.

First note that φi\varphi_{i} is sub-Gaussian since it takes on values in {0,1}\{0,1\}. If αi\alpha_{i} is fixed, then the sub-Gaussian norm σφi\sigma_{\varphi_{i}} can be calculated similarly as in the binary case. Even αi\alpha_{i} is unknown, by (2), we can formally have

𝔼iφiσφi2nDKL(PjPi)+𝔼jφi.\displaystyle\mathbb{E}_{i}\varphi_{i}\leq\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})}+\mathbb{E}_{j}\varphi_{i}. (25)

Summing over ii and combining (19), we find

M(1αmax)i=1M𝔼iφi1+i=1Mσφi2nDKL(PjPi).\displaystyle M(1-\alpha_{\max})\leq\sum_{i=1}^{M}\mathbb{E}_{i}\varphi_{i}\leq 1+\sum_{i=1}^{M}\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})}.

Finally, we arrive at (24) by rearranging the terms. Hence the proof is completed. ∎

If we aim at lower bounding αmax\alpha_{\max}, then using the sub-Gaussian norm σφi\sigma_{\varphi_{i}} in Theorem 3 seems not useful practically, since σφi\sigma_{\varphi_{i}} itself depends on αi\alpha_{i}. Nonetheless, due to the universal upper bound (7), we can have a relaxed version of (24) as in the corollary below.

Corollary 2.

For any j{1,,M}j\in\{1,\ldots,M\}, we have

αmax11M1Mi=1Mn2DKL(PjPi),\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M}\sum_{i=1}^{M}\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{j}\rVert P_{i})}, (26)

or, using the mean square root of KL divergences, we have

αmax11M1M2i,j=1Mn2DKL(PjPi).\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M^{2}}\sum_{i,j=1}^{M}\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{j}\rVert P_{i})}. (27)

Furthermore, if DKL(PiPj)δD_{\mathrm{KL}}(P_{i}\rVert P_{j})\leq\delta holds for each pair of ii and jj, then

αmax11Mn2δ.\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\sqrt{\frac{n}{2}\delta}. (28)
Remark 6.

It is interesting to compare (28) with Fano’s inequality [9], which, under the same assumption that all KL divergences are uniformly bounded by δ\delta, states that

αmaxFano1nδ+ln2ln(M1).\displaystyle\alpha_{\max}^{\mathrm{Fano}}\geq 1-\frac{n\delta+\ln 2}{\ln(M-1)}. (29)

As evidenced by the scalings of MM and nn in (28) and (29), respectively, there is a region that our result outperforms Fano’s in the sense that it provides a greater lower bound for αmax\alpha_{\max}. Qualitatively, this happens when at least one of the number of hypotheses MM and the sample size nn is not large. For example, when M=3M=3, Fano’s inequality is trivial, while our result can still work nontrivially.

III Conclusion and discussion

In this work, by using the sub-Gaussian property of test functions, we uncover two universal error bounds in terms of the sub-Gaussian norm and the Kullback-Leibler divergence. In the case of binary hypothesis testing, our bound (17) is always tighter than Pinsker’s bound (6) for any given α0.5\alpha\neq 0.5. In the case of MM-ary hypothesis testing, our result (24) is complementary to Fano’s inequality (29) by providing a more informative bound when the number of hypotheses or the sample size is not large.

Given the universality of our results, we hope, with possible generalizations, they can find potential applications in fields ranging from clinical trials to quantum state discrimination. In particular, the quantum extension of these bounds is of special interest. Due to the experimental cost, it may be important to quantify statistical errors in the presence of a limited number of observations, and nonasymptotic rather than asymptotic results are thus more relevant. Both our bounds hold for any finite sample size, and can hopefully be helpful in such cases.

Acknowledgment

YW gratefully thank Prof. Dan Nettleton for helpful discussions that stimulated this work and Prof. Huaiqing Wu for a careful review of the manuscript and insightful comments.

References

  • [1] R. W. Keener, Theoretical Statistics: Topics for a Core Course. New York, NY, USA: Springer, 2010.
  • [2] M. Hayashi, Quantum Information Theory: A Mathematical Foundation, 2nd ed. New York, NY, USA: Springer-Verlag, 2017.
  • [3] J. Neyman and E. S. Pearson, IX, “On the problem of the most efficient tests of statistical hypotheses,” Phil. Trans. R. Soc. London, vol. A231, pp. 289–337, April 1933.
  • [4] T. M. Cover and J. A. Thomas, Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, 1991.
  • [5] U. Seifert, “Entropy production along a stochastic trajectory and an integral fluctuation theorem,” Phys. Rev. Lett., vol. 95, no.4, July 2005, Art. no. 040602.
  • [6] Y. Wang, “Sub-Gaussian and subexponential fluctuation-response inequalities,” Phys. Rev. E, vol. 102, no. 5, November 2020, Art. no. 052105.
  • [7] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge, U.K.: Cambridge University Press, 2019.
  • [8] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge, U.K.: Cambridge University Press, 2018.
  • [9] R. Fano, Transmission of Information: A Statistical Theory of Communications. Cambridge, MA, USA: M.I.T. Press, 1961.
  • [10] D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via information usage,” IEEE Trans. Inf. Theory, vol. 66, no. 1, pp. 302–323, January 2020.
  • [11] K. Gourgoulias, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang, “How biased is your model? Concentration inequalities, information and model bias,” IEEE Trans. Inf. Theory, vol. 66, no. 5, pp. 3079–3097, May 2020.
  • [12] J. Birrell and L. Rey-Bellet, “Uncertainty quantification for Markov processes via variational principles and functional inequalities,” SIAM/ASA J. Uncertain. Quantif. vol. 8, no. 2, pp. 539–572, April 2020.
  • [13] S. G. Bobkov and F. Götze, “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities,” J. Funct. Anal., vol. 163, no. 1, pp. 1–28, April 1999.