Sub-Gaussian Error Bounds for Hypothesis Testing
^†^†thanks: This research was supported in part by the US National Science Foundation under grant HDR: TRIPODS 19-34884.

Yan Wang Department of Statistics, Iowa State University
Ames, IA 50011, USA
Email: wangyan@iastate.edu

Abstract

We interpret likelihood-based test functions from a geometric perspective where the Kullback-Leibler (KL) divergence is adopted to quantify the distance from a distribution to another. Such a test function can be seen as a sub-Gaussian random variable, and we propose a principled way to calculate its corresponding sub-Gaussian norm. Then an error bound for binary hypothesis testing can be obtained in terms of the sub-Gaussian norm and the KL divergence, which is more informative than Pinsker’s bound when the significance level is prescribed. For $M$ -ary hypothesis testing, we also derive an error bound which is complementary to Fano’s inequality by being more informative when the number of hypotheses or the sample size is not large.

I Introduction

Hypothesis testing is one central task in statistics. One of its simplest forms is the binary case: given $n$ independent and identically distributed (i.i.d.) random variables $X_{1}^{n}\equiv(X_{1},\ldots,X_{n})$ , one wants to infer whether the null hypothesis $H_{0}:X_{i}\sim P_{0}$ or the alternative hypothesis $H_{1}:X_{i}\sim P_{1}$ is true. The binary case serves as an important starting point from which further results can be established, in the settings of both classical and quantum hypothesis testing [1, 2]. With $X_{1}^{n}$ , one can construct the empirical distribution $\hat{P}_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}$ , where $\delta_{X}$ is the Dirac measure that puts unit mass at $X$ . Adopting the Kullback-Leibler (KL) divergence as a distance from $\hat{P}_{n}$ to $P_{0}$ or $P_{1}$ , one can construct a test function as

\displaystyle\Phi(X_{1}^{n})=I\{D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1})>c\},

(1)

where $I\{\cdot\}$ is the indicator function, $c\geq 0$ serves as a threshold beyond which the decision that $\hat{P}_{n}$ is closer to $P_{1}$ than to $P_{0}$ is made, and $D_{\mathrm{KL}}(P\rVert Q)=\int\ln(dP/dQ)dP$ is the KL divergence from probability $P$ to probability $Q$ if $P\ll Q$ . Conventionally, if $P$ is not absolutely continuous with respect to $Q$ , then $D_{\mathrm{KL}}(P\rVert Q)\equiv\infty$ . Note $\hat{P}_{n}$ is discrete; hence if both $P_{0}$ and $P_{1}$ are discrete with the same support, (1) is well defined. Denote the densities of $P_{0}$ and $P_{1}$ with respect to the counting measure as $p_{0}$ and $p_{1}$ , respectively, and we have

\displaystyle D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1})=\frac{1}{n}\ln\left(\frac{\prod_{i=1}^{n}p_{1}(X_{i})}{\prod_{i=1}^{n}p_{0}(X_{i})}\right).

(2)

In fact, in this case, (1) is equivalent to the test function for the likelihood ratio test [4]

\displaystyle\Phi_{\mathrm{lrt}}(X_{1}^{n})=I\left\{\frac{\prod_{i=1}^{n}p_{1}(X_{i})}{\prod_{i=1}^{n}p_{0}(X_{i})}>c^{\prime}\right\},

(3)

where $c^{\prime}=e^{cn}$ . In the case that both $P_{0}$ and $P_{1}$ are continuous, the KL divergence difference $D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{0})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{1})$ is not well defined. Nonetheless, the technically tricky part is the term “ $\int\hat{p}_{n}\ln(\hat{p}_{n})d\mu$ ,” where we use $\hat{p}_{n}$ to denote the density of $\hat{P}_{n}$ with respect to the Lebesgue measure $\mu$ as if it had one. But it appears twice and is cancelled out formally. We might conveniently define the KL divergence difference in this case as (2), and still find the equivalence between (1) and (3). Using the KL divergence in the context of hypothesis testing can be beneficial. Firstly, it provides a clear geometric meaning to the likelihood ratio test, as well as to the general idea underlying hypothesis testing. Secondly, it also offers a geometric, or even physical, interpretation of the lower bound for the resulting statistical errors, as shown below.

Under the null hypothesis $H_{0}$ , the type I error rate (or the significance level) $\alpha$ that is incurred by applying (1) for a fixed $c$ is

\displaystyle\alpha=\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n}),

(4)

where $P_{0}^{\otimes n}$ is the product probability measure for $X_{1}^{n}$ under $H_{0}$ . In practice, by prescribing the significance level, for example, letting $\alpha=0.05$ , one can derive the corresponding $c$ and determine the desired test function. However, in this work, our focus is $not$ to find a test function at given $\alpha$ , we mainly deal with the case that $c$ is fixed, and $\alpha$ is obtained in a somewhat passive way. Thanks to the Neyman-Pearson lemma [3], the likelihood ratio test is known to be optimal in the sense of statistical power. Hence, given the incurred $\alpha$ , test function (1) has the minimal type II error rate $\beta$ among all possible test functions with the corresponding type I error rate no gretaer than $\alpha$ :

\displaystyle\beta=1-\mathbb{E}_{X_{1}^{n}\sim P_{1}^{\otimes n}}\Phi(X_{1}^{n}),

(5)

where $P_{1}^{\otimes n}$ is the product probability measure for $X_{1}^{n}$ under the alternative hypothesis $H_{1}$ .

Controlling statistical errors is of practical importance; however, typically one cannot suppress both types of error simultaneously. Under our i.i.d. setting, a classical result, based on Pinsker’s inequality, concerning the error bound for any (measurable) test function is that [4]

\displaystyle\alpha+\beta\geq 1-\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{1}\rVert P_{0})}.

(6)

This result is striking in that without going into the details of calculating $\alpha$ and $\beta$ , one can have a nontrivial lower bound of their sum in terms of the KL divergence between two candidate probabilities, as long as the right-hand side of (6) is greater than 0. For a fixed $n$ , this bound is solely determined by $D_{\mathrm{KL}}(P_{1}\rVert P_{0})$ , which reflects the “distance” from $P_{1}$ to $P_{0}$ . This result also has a significant physical meaning. At a nonequilibrium steady state, if $P_{1}$ denotes the probability associated with observing a stochastic trajectory in the forward process, and $P_{0}$ in the backward process, then the theory of stochastic thermodynamics tells us that $D_{\mathrm{KL}}(P_{1}\rVert P_{0})$ is equivalent to the average entropy production $\Delta S$ in the forward process, which is always nonnegative [5, 6]. Hence, if one wishes to infer the arrow of time based on observations, then Pinsker’s result (6) implies that the chance of making an error is high if $\Delta S$ is small. In fact, we know that $\Delta S=0$ at equilibrium, and one cannot tell the arrow of time at all; hypothesis testing is just random guess in this case.

While (6) is useful, can we have a tighter and thus more informative bound? In this work, we will show that by taking advantage of the sub-Gaussian property of $\Phi(X_{1}^{n})$ [7, 8], one can derive a bound (17) on statistical errors in terms of its sub-Gaussian norm (as well as the KL divergence from $P_{1}$ to $P_{0}$ ). We name such an error bound as “sub-Gaussian” to highlight this fact. It turns out that it is a tighter bound than (6) in the sense that it provides a greater lower bound for $\alpha+\beta$ (or for $\beta$ at any given $\alpha\neq 0.5$ ). In practice, a small $\alpha$ is commonly set as the significance level, and our result can hopefully be more relevant. Moreover, in the case of $M$ -ary hypothesis testing where $M>2$ hypotheses are present, we also derive a bound (24) for making incorrect decisions, which is complementary to the celebrated Fano’s inequality [9] when the number of hypotheses $M$ or the sample size $n$ is not large. The error bounds presented in this work are universal and easily applicable. We hope these findings can help better quantify errors in various statistical practices involving hypothesis testing.

II Main Results

We will first introduce the sub-Gaussian norm of $\Phi(X_{1}^{n})$ . Then error bounds in the binary and $M$ -ary cases are established, respectively.

II-A Sub-Gaussian norm of $\Phi(X_{1}^{n})$

Sub-Gaussian random variables are natural generalizations of Gaussian ones. The so-called sub-Gaussian property can be defined in several different but equivalent ways [7, 8]. In this work, we pick one that suits most for our purposes.

Definition 1.

A random variable $X$ with probability law $P$ is called sub-Gaussian if there exists $\sigma>0$ such that its central moment generating function satisfies

\displaystyle\mathbb{E}_{P}e^{s(X-\mathbb{E}_{P}X)}\leq e^{\sigma^{2}s^{2}/2},\ \forall s\in\mathbb{R}.

Definition 2.

The associated sub-Gaussian norm $\sigma_{XP}$ of $X$ with respect to $P$ is defined as

\displaystyle\sigma_{XP}\equiv\inf\{\sigma>0:\mathbb{E}_{P}e^{s(X-\mathbb{E}_{P}X)}\leq e^{\sigma^{2}s^{2}/2},\ \forall s\in\mathbb{R}\}.

Remark 1.

$\sigma_{XP}$ is a well defined norm for the centered variable $X-\mathbb{E}_{P}X$ [6]. It is the same for a location family of random variables that have different means but are otherwise identical. Also, $\sigma_{XP}$ is equal to the $\psi_{2}$ -Orlicz norm of $X-\mathbb{E}_{P}X$ up to a numerical constant factor.

Lemma 1.

A bounded random variable is sub-Gaussian. In particular, if $X\in[a,b]$ almost surely with respect to $P$ , then $\sigma_{XP}\leq(b-a)/2$ .

Proof.

This is a well known result that can be found in, for example, [7, 8]. ∎

Test function (1) is an indicator function and takes on values in $\{0,1\}$ ; hence it is bounded. No matter what the law of $X_{1}^{n}$ is, $\Phi(X_{1}^{n})$ is always sub-Gaussian by Lemma 1 with a uniform upper bound of its sub-Gaussian norm that

\displaystyle\sigma_{\Phi P}\leq 0.5.

(7)

However, if $\alpha$ is fixed as a result of some $c$ being used in (1), then a more informative sub-Gaussian norm for $\Phi(X_{1}^{n})$ can be obtained under the situation that $X_{1}^{n}\sim P_{0}^{\otimes n}$ . In this case, by (4),

\displaystyle\text{Pr}(\Phi(X_{1}^{n})=1|H_{0})=\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n})=\alpha,

and one can explicitly write

		$\displaystyle\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}e^{s[\Phi(X_{1}^{n})-\alpha]}$
	$\displaystyle=\$	$\displaystyle\text{Pr}(\Phi=1\|H_{0})e^{s(1-\alpha)}+\text{Pr}(\Phi=0\|H_{0})e^{s(0-\alpha)}$
	$\displaystyle=\$	$\displaystyle\alpha e^{s(1-\alpha)}+(1-\alpha)e^{-s\alpha}\equiv e^{f}.$

Using $f$ , one can rewrite the sub-Gaussian property as

\displaystyle h\equiv f-\frac{1}{2}\sigma^{2}s^{2}\leq 0,\ \forall s\in\mathbb{R}.

(8)

Since $\Phi$ is sub-Gaussian, there exists $\sigma$ such that at any $\alpha$ , we have $h(s=0)=0$ , which is the maximal value of $h$ . This fact implies $\partial h/\partial s|_{s=0}=0$ and $\partial^{2}h/\partial s^{2}|_{s=0}\leq 0$ . The latter poses a constraint on $\sigma$ ’s under which (8) holds:

\displaystyle\partial^{2}h/\partial s^{2}|_{s=0}\leq 0\Longrightarrow\sigma^{2}\geq\partial^{2}f/\partial s^{2}|_{s=0}=\alpha(1-\alpha).

(9)

Since $\alpha(1-\alpha)\leq 0.25$ , we know the minimal universal $\sigma$ for all $\alpha$ is $0.5$ , consistent with (7).

For a specific $\alpha$ , the minimal $\sigma$ that makes (8) valid is denoted as $\sigma_{\Phi 0}(\alpha)$ , which is defined to be the sub-Gaussian norm of $\Phi(X_{1}^{n})$ under the law $\Phi_{\#}P_{0}^{\otimes n}$ , the push forward probability measure of $P_{0}^{\otimes n}$ induced by $\Phi$ . We may also simply state that $\sigma_{\Phi 0}(\alpha)$ is the sub-Gaussian norm of $\Phi(X_{1}^{n})$ under $H_{0}$ . The norm $\sigma_{\Phi 0}(\alpha)$ can be numerically obtained in a principled way, as summarized in the following theorem.

Theorem 1.

For $\alpha\neq 0.5$ , besides the trivial solution $(\sigma,0)$ with any $\sigma>0$ , the equations

\displaystyle\left\{\begin{array}[]{ll}f=\frac{1}{2}\sigma^{2}s^{2},\\ \frac{\partial f}{\partial s}=\sigma^{2}s,\end{array}\right.

(12)

have only one nontrivial solution $(\sigma^{\ast},s^{\ast})$ where $s^{\ast}\neq 0$ . The sub-Gaussian norm of $\Phi(X_{1}^{n})$ under $H_{0}$ is $\sigma_{\Phi 0}=\sigma^{\ast}$ . For $\alpha=0.5$ , $\sigma_{\Phi 0}=0.5$ .

Proof.

We will consider three cases based on the value of $\alpha$ .

Case I: $\alpha=0.5$ . In this case, $\sigma_{\Phi 0}$ can be obtained directly by noticing

	$\displaystyle\exp[f]$	$\displaystyle=\cosh\left(\frac{s}{2}\right)=\sum_{n=0}^{\infty}\frac{(s/2)^{2n}}{(2n)!}\leq\sum_{n=0}^{\infty}\frac{(s/2)^{2n}}{2^{n}n!}$
		$\displaystyle=\exp\left(\frac{1}{2}\times 0.5^{2}\times s^{2}\right).$

Hence by direct inspection, $\sigma_{\Phi 0}=0.5$ .

Case II: $0<\alpha<0.5$ . Before diving into the proof, we briefly address the main idea first. Given $\alpha$ , the function $h$ depends on both $s$ and $\sigma$ . Requiring its maximum to be no greater than 0 at some $\sigma$ naturally leads to two conditions that $h(s,\sigma)=0$ and $\partial h(s,\sigma)/\partial s=0$ , which are just (12). It is expected that $\sigma_{\Phi 0}$ can be obtained from the corresponding nontrivial solutions, since it is the minimal $\sigma$ that satisfies (8). Fig. 1 confirms this intuition, where $\alpha=0.05$ is assumed for illustration. By tuning $\sigma$ to some $\sigma^{\ast}$ , one can see the maximum of $h$ at some $s^{\ast}>0$ can be exactly equal to 0, i.e., $h(s^{\ast},\sigma^{\ast})=0$ . Also at this $s^{\ast}$ , $h$ is tangent to the $s$ -axis, indicating that $\partial h(s,\sigma^{\ast})/\partial s|_{s=s^{\ast}}=0$ . Hence $\sigma_{\Phi 0}=\sigma^{\ast}$ .

Refer to caption — Figure 1: Assuming $\alpha=0.05$ , we show the main idea underlying Theorem 1, and numerically calculate $\sigma_{\Phi 0}$ , which is the minimal $\sigma$ such that $h(s)$ is no greater than 0 for all $s$ due to the sub-Gaussian property (8).

Now we turn to the proof. It is trivial that for any $\alpha$ , $h$ attains its maximal value 0 at $s=0$ , no matter what $\sigma>0$ is. This does not provide much useful information of $\sigma_{\Phi 0}$ . To proceed, we need a nontrivial local maximum of $h(s)$ at some $s\neq 0$ . Our first observation is that when $0<\alpha<0.5$ , there is no local maximum achieved for $s<0$ , because $\partial h/\partial s>0$ for all $s<0$ . To see this, let $a\equiv e^{-|s|(1-\alpha)}$ , $b\equiv e^{|s|\alpha}$ , and $\delta\equiv 0.5-\alpha$ , then we have that

	$\displaystyle\frac{\partial h}{\partial s}$	$\displaystyle=-\alpha(1-\alpha)\frac{b-a}{\alpha a+(1-\alpha)b}+\sigma^{2}\|s\|$
		$\displaystyle=-\alpha(1-\alpha)\frac{1-e^{-\|s\|}}{(0.5+\delta)+(0.5-\delta)e^{-\|s\|}}+\sigma^{2}\|s\|$
		$\displaystyle>-2\alpha(1-\alpha)\tanh(\|s\|/2)+\sigma^{2}\|s\|$
		$\displaystyle>[\sigma^{2}-\alpha(1-\alpha)]\times\|s\|\geq 0,$

where $\alpha<0.5$ (hence $\delta>0$ ) is used in the first inequality, the second inequality is due to $\tanh(x)<x$ for $x>0$ , and the last inequality is given by (9) since we have already known $\Phi(X_{1}^{n})$ is sub-Gaussian. This result indicates that the nontrivial maximum, if any, can only be found at some $s>0$ .

For $s>0$ , following similar steps, we obtain

\displaystyle\frac{\partial h}{\partial s}=\frac{\alpha(1-\alpha)}{\frac{1}{2}\coth\left(\frac{s}{2}\right)-\left(\frac{1}{2}-\alpha\right)}-\sigma^{2}s,

and the condition $\partial h/\partial s=0$ then implies

\displaystyle g(s)\equiv\frac{s}{2}\coth\left(\frac{s}{2}\right)=\left(\frac{1}{2}-\alpha\right)s+\frac{\alpha(1-\alpha)}{\sigma^{2}}\equiv l(s,\sigma).

It is straightforward to check that $g(s)\geq 1$ is a positive, monotonically increasing, and strongly convex function. Hence it can intersect the straight line $l(s,\sigma)$ at no more than two points. Note $g(0^{+})=1$ , and $g^{\prime}(0^{+})=0$ . The intercept of $l(s,\sigma)$ is $\alpha(1-\alpha)/\sigma^{2}\in(0,1)$ by (9), and the slope is greater than 0. Hence by tuning $\sigma$ , it is always possible to make $g(s)$ and $l(s,\sigma)$ intersect twice. Denote these two points as $s_{1}(\sigma)$ and $s_{2}(\sigma)$ , respectively, with $h(s_{1})<h(s_{2})$ . As shown in Fig. 1, $h(s_{1})$ is the minimum between two maxima $h(s=0)$ and $h(s_{2})$ . Then further requiring $h(s_{2}(\sigma))=0$ at some $\sigma^{\ast}$ , which is attainable since $\Phi$ is known to be sub-Gaussian, we obtain $\sigma_{\Phi 0}=\sigma^{\ast}$ , and Theorem 1 for the $0<\alpha<0.5$ part is proved.

Case III: $0.5<\alpha<1$ . Note $f(s)$ or $h(s)$ is invariant under the transformations $\alpha\leftrightarrow 1-\alpha$ and $s\leftrightarrow-s$ . Hence $\sigma_{\Phi 0}$ is the same for $\alpha$ and $1-\alpha$ .

Combining all three cases, we have proved Theorem 1. ∎

One can calculate $\sigma_{\Phi 0}$ in a principled way with given $\alpha$ , without knowing $P_{0}$ or $P_{1}$ or the constant $c$ in the test function. We summarize the relation between $\alpha$ and $\sigma_{\Phi 0}$ for (1) in Fig. 2. Error bounds for hypothesis testing can now be established based on $\sigma_{\Phi 0}$ .

II-B Sub-Gaussian bound for binary hypothesis testing

Lemma 2.

Consider two general probability measures $\nu$ and $\mu$ on a common measurable space. Suppose $\nu\ll\mu$ , and let $g\equiv d\nu/d\mu$ be the density of $\nu$ with respect to $\mu$ . Let $Y$ be a sub-Gaussian random variable which is a function of $X$ that has law $\mu$ or $\nu$ . Then we have

\displaystyle|\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y|\leq\sigma_{Y_{\#}\mu}\sqrt{2D_{\mathrm{KL}}(\nu\rVert\mu)},

(13)

where $\sigma_{Y_{\#}\mu}$ denotes the sub-Gaussian norm of $Y$ with respect to the push forward measure $Y_{\#}\mu$ .

Recently, there have been several works with findings similar to Theorem 2, in the context of nonequilibrium statistical physics [6], data exploration or model bias analysis [10, 11], or uncertainty quantification for stochastic processes [12]. They can, however, be analyzed in a unified way based on the spirit in [13].

Proof.

We have assumed $\nu\ll\mu$ and $g\equiv d\nu/d\mu$ . The associated entropy functional of $g$ with respect to $\mu$ is defined as $\mathrm{Ent}_{\mu}(g)=\int g\ln gd\mu$ . It is straightforward to find that

	$\displaystyle\mathrm{Ent}_{\mu}(g)$	$\displaystyle=\int\frac{d\nu}{d\mu}\ln\left(\frac{d\nu}{d\mu}\right)d\mu=\int\ln\left(\frac{d\nu}{d\mu}\right)d\nu$
		$\displaystyle=D_{\mathrm{KL}}(\nu\rVert\mu).$		(14)

On the other hand, by the variational representation of $\mathrm{Ent}_{\mu}(g)$ , we have that

\displaystyle\mathrm{Ent}_{\mu}(g)=\sup_{\eta}\int\eta gd\mu,\ \text{with}\ \int e^{\eta}d\mu\leq 1,

(15)

where $\eta$ is a measurable function. We have $\int\eta gd\mu=\mathbb{E}_{\mu}\eta g=\mathbb{E}_{\nu}\eta$ and $\int e^{\eta}d\mu=\mathbb{E}_{\mu}e^{\eta}$ .

By assumption, $Y(X)$ is sub-Gaussian. If $X\sim\mu$ , then the sub-Gaussian norm of $Y$ under the push forward measure $Y_{\#}\mu$ is $\sigma_{Y_{\#}\mu}$ . Let us construct $\eta$ as

\displaystyle\eta=s[Y(X)-\mathbb{E}_{X\sim\mu}Y(X)]-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}.

It is clear that $\mathbb{E}_{\mu}e^{\eta}\leq 1$ can be satisfied. Combining $\eta$ with (II-B) and (15), we arrive at

	$\displaystyle D_{\mathrm{KL}}(\nu\rVert\mu)$	$\displaystyle\geq\mathbb{E}_{X\sim\mu}\eta(Y(X))g$
		$\displaystyle=\mathbb{E}_{X\sim\mu}g\left[s(Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}\right]$
		$\displaystyle=\mathbb{E}_{X\sim\nu}\left[s(Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2}\right]$
		$\displaystyle=s(\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y)-\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}s^{2},$

which holds for any $s\in\mathbb{R}$ . Hence, Lemma 2 is proved:

	$\displaystyle\|\mathbb{E}_{X\sim\nu}Y-\mathbb{E}_{X\sim\mu}Y\|$	$\displaystyle\leq\left[\inf_{\|s\|}\frac{D_{\mathrm{KL}}(\nu\rVert\mu)}{\|s\|}+\frac{1}{2}\sigma_{Y_{\#}\mu}^{2}\|s\|\right]$
		$\displaystyle=\sigma_{Y_{\#}\mu}\sqrt{2D_{\mathrm{KL}}(\nu\rVert\mu)}.$

∎

Theorem 2.

Suppose $P_{1}\ll P_{0}$ , and denote the sub-Gaussian norm of test function (1) under the null hypothesis $H_{0}$ as $\sigma_{\Phi 0}$ . Then we have

	$\displaystyle\|\mathbb{E}_{X_{1}^{n}\sim P_{0}^{\otimes n}}\Phi(X_{1}^{n})-$	$\displaystyle\mathbb{E}_{X_{1}^{n}\sim P_{1}^{\otimes n}}\Phi(X_{1}^{n})\|$
		$\displaystyle\leq\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}.$		(16)

Proof.

Let $X=X_{1}^{n}$ , $\nu=P_{1}^{\otimes n}$ and $\mu=P_{0}^{\otimes n}$ , and due to the i.i.d. setting, $D_{\mathrm{KL}}(P_{1}^{\otimes n}\rVert P_{0}^{\otimes n})=nD_{\mathrm{KL}}(P_{1}\rVert P_{0})$ . Then the proof is completed by letting $Y=\Phi(X_{1}^{n})$ in Lemma 2. ∎

Corollary 1.

One has

\displaystyle\alpha+\beta\geq 1-\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}.

(17)

Proof.

Insert definitions (4) and (5) into (2) and then simplify to obtain the result. ∎

Remark 2.

Corollary 1 can be relaxed by replacing the sub-Gaussian norm $\sigma_{\Phi 0}$ with one of its upper bounds. In fact, if we use the universal upper bound provided by (7), then Corollary 1 reduces to Pinsker’s classical result (6). However, our bound is always stronger in general. In particular, when controlling $\alpha$ is more important than controlling $\beta$ , one might set $c>0$ to put more emphasis on it. Hence for the same sample size $n$ , the larger $c$ is, the smaller $\alpha$ and $\sigma_{\Phi 0}$ are, resulting in a tighter bound for $\beta$ .

Remark 3.

There is another inequality from Theorem 2 that $\alpha+\beta\leq 1+\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}$ . But it is somewhat trivial because the bound is greater than 1 and in general does not provide much useful information. For example, one can always accept $H_{0}$ , and for this trivial decision rule, $\alpha=0$ , but $\beta\leq 1$ by definition. Hence $\alpha+\beta\leq 1$ , and the extra term $\sigma_{\Phi 0}\sqrt{2nD_{\mathrm{KL}}(P_{1}\rVert P_{0})}$ is not informative at all.

Remark 4.

Suppose also $P_{0}\ll P_{1}$ , which is the usual case in hypothesis testing. Then by symmetry, it is straightforward to have

\displaystyle\alpha+\beta\geq 1-\sigma_{\Phi 1}\sqrt{2nD_{\mathrm{KL}}(P_{0}\rVert P_{1})},

(18)

where $\sigma_{\Phi 1}$ is the sub-Gaussian norm of $\Phi(X_{1}^{n})$ under $H_{1}$ , and it is a function of $\beta$ . This result is nontrivially different than (17), not only because different norms are applied, but also because the KL divergence is not symmetric in two involved probabilities. Given (18), we can either bound $\alpha$ when $\beta$ is given or bound $\beta$ in an implicit way when $\alpha$ is given.

Remark 5.

Similar to (6), our bound is also nonasymptotic in nature as it holds for any finite $n$ . The expense we pay for this, however, is that in the large $n$ and small $\alpha$ limit, our bound for $\beta$ is not as tight as Stein’s lemma which states that $\beta\sim e^{-nD_{\mathrm{KL}}(P_{0}\rVert P_{1})}$ [4].

II-C Sub-Gaussian bound for $M$ -ary hypothesis testing

A generalization of our result to the $M$ -ary hypothesis testing can be obtained. Suppose there are $M$ hypotheses, represented by the corresponding probability distributions $\{P_{1},\ldots,P_{M}\}$ . Suppose from one of such distributions $P_{i_{0}}$ , $n$ data points $X_{1}^{n}$ are drawn independently. Our task is to infer the hypothesis index ${i_{0}}$ from data. Similar to (1), let us consider the test function for the $i$ -th hypothesis as

\displaystyle\varphi_{i}(X_{1}^{n})=\prod_{j\neq i}I\{D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{j})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{i})>c_{i}\},

with $\varphi=1-\Phi$ in the binary case. We will consider the case that $c_{i}=0$ for all $i\in\{1,\ldots,M\}$ . Unlike in the binary case where $c>0$ can be adopted to intentionally render a small $\alpha$ , the test function here is purely likelihood-based without any prescribed preference over any particular hypothesis. It is known that this approach minimizes $\alpha+\beta$ in the binary case (the Bayes classifier). From $M$ such test functions $\varphi_{i}$ , one can construct a random vector ${\boldsymbol{\varphi}}=(\varphi_{1},\ldots,\varphi_{M})$ . Assume there always exists a single index $i_{0}$ such that

\displaystyle D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{j})-D_{\mathrm{KL}}(\hat{P}_{n}\rVert P_{i_{0}})>0

holds for all $j\neq i_{0}$ . In this case, $\varphi_{i_{0}}=1$ , and $\varphi_{j}=0$ for $j\neq i_{0}$ . Since $X_{1}^{n}$ is random, we expect that $i_{0}$ may differ for each realization. However, it is almost surely with respect to all $P_{i}$ ’s that

\sum_{i=1}^{M}\varphi_{i}(X_{1}^{n})=1.

(19)

Under $M$ hypotheses, we can construct a matrix, denoted $\mathbb{E}\boldsymbol{\varphi}$ , that encodes the error incurred in testing:

\displaystyle\boldsymbol{\mathbb{E}\varphi}\equiv\left(\begin{array}[]{ccc}\mathbb{E}_{1}\varphi_{1}&\cdots&\mathbb{E}_{1}\varphi_{M}\\ \vdots&\ddots&\vdots\\ \mathbb{E}_{M}\varphi_{1}&\cdots&\mathbb{E}_{M}\varphi_{M}\\ \end{array}\right),

(23)

where the matrix element $\mathbb{E}_{i}\varphi_{j}\equiv\mathbb{E}_{X_{1}^{n}\sim P_{i}^{\otimes n}}\varphi_{j}$ . By (19), the row sum of $\mathbb{E}\boldsymbol{\varphi}$ is 1. The diagonal elements of $\mathbb{E}\boldsymbol{\varphi}$ are actually the probabilities that the underlying hypothesis is correctly identified. In other words, the probability of making an incorrect decision when the data are generated from the $i$ th hypothesis is $\alpha_{i}\equiv 1-\mathbb{E}_{i}\varphi_{i}$ . We denote $\alpha_{\max}\equiv\max_{i}\alpha_{i}$ . The following theorem provides a lower bound to $\alpha_{\max}$ that is complementary to Fano’s inequality.

Theorem 3.

Suppose $P_{i}\ll P_{j}$ for all $i,j\in\{1,\ldots,M\}$ . For any $j\in\{1,\ldots,M\}$ , we have

\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M}\sum_{i=1}^{M}\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})},

(24)

where $\sigma_{\varphi_{i}}$ is the sub-Gaussian norm of $\varphi_{i}$ with respect to the $i$ th hypothesis.

Proof.

First note that $\varphi_{i}$ is sub-Gaussian since it takes on values in $\{0,1\}$ . If $\alpha_{i}$ is fixed, then the sub-Gaussian norm $\sigma_{\varphi_{i}}$ can be calculated similarly as in the binary case. Even $\alpha_{i}$ is unknown, by (2), we can formally have

\displaystyle\mathbb{E}_{i}\varphi_{i}\leq\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})}+\mathbb{E}_{j}\varphi_{i}.

(25)

Summing over $i$ and combining (19), we find

\displaystyle M(1-\alpha_{\max})\leq\sum_{i=1}^{M}\mathbb{E}_{i}\varphi_{i}\leq 1+\sum_{i=1}^{M}\sigma_{\varphi_{i}}\sqrt{2nD_{\mathrm{KL}}(P_{j}\rVert P_{i})}.

Finally, we arrive at (24) by rearranging the terms. Hence the proof is completed. ∎

If we aim at lower bounding $\alpha_{\max}$ , then using the sub-Gaussian norm $\sigma_{\varphi_{i}}$ in Theorem 3 seems not useful practically, since $\sigma_{\varphi_{i}}$ itself depends on $\alpha_{i}$ . Nonetheless, due to the universal upper bound (7), we can have a relaxed version of (24) as in the corollary below.

Corollary 2.

For any $j\in\{1,\ldots,M\}$ , we have

\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M}\sum_{i=1}^{M}\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{j}\rVert P_{i})},

(26)

or, using the mean square root of KL divergences, we have

\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\frac{1}{M^{2}}\sum_{i,j=1}^{M}\sqrt{\frac{n}{2}D_{\mathrm{KL}}(P_{j}\rVert P_{i})}.

(27)

Furthermore, if $D_{\mathrm{KL}}(P_{i}\rVert P_{j})\leq\delta$ holds for each pair of $i$ and $j$ , then

\displaystyle\alpha_{\max}\geq 1-\frac{1}{M}-\sqrt{\frac{n}{2}\delta}.

(28)

Remark 6.

It is interesting to compare (28) with Fano’s inequality [9], which, under the same assumption that all KL divergences are uniformly bounded by $\delta$ , states that

\displaystyle\alpha_{\max}^{\mathrm{Fano}}\geq 1-\frac{n\delta+\ln 2}{\ln(M-1)}.

(29)

As evidenced by the scalings of $M$ and $n$ in (28) and (29), respectively, there is a region that our result outperforms Fano’s in the sense that it provides a greater lower bound for $\alpha_{\max}$ . Qualitatively, this happens when at least one of the number of hypotheses $M$ and the sample size $n$ is not large. For example, when $M=3$ , Fano’s inequality is trivial, while our result can still work nontrivially.

III Conclusion and discussion

In this work, by using the sub-Gaussian property of test functions, we uncover two universal error bounds in terms of the sub-Gaussian norm and the Kullback-Leibler divergence. In the case of binary hypothesis testing, our bound (17) is always tighter than Pinsker’s bound (6) for any given $\alpha\neq 0.5$ . In the case of $M$ -ary hypothesis testing, our result (24) is complementary to Fano’s inequality (29) by providing a more informative bound when the number of hypotheses or the sample size is not large.

Given the universality of our results, we hope, with possible generalizations, they can find potential applications in fields ranging from clinical trials to quantum state discrimination. In particular, the quantum extension of these bounds is of special interest. Due to the experimental cost, it may be important to quantify statistical errors in the presence of a limited number of observations, and nonasymptotic rather than asymptotic results are thus more relevant. Both our bounds hold for any finite sample size, and can hopefully be helpful in such cases.

Acknowledgment

YW gratefully thank Prof. Dan Nettleton for helpful discussions that stimulated this work and Prof. Huaiqing Wu for a careful review of the manuscript and insightful comments.

References

[1] R. W. Keener, Theoretical Statistics: Topics for a Core Course. New York, NY, USA: Springer, 2010.
[2] M. Hayashi, Quantum Information Theory: A Mathematical Foundation, 2nd ed. New York, NY, USA: Springer-Verlag, 2017.
[3] J. Neyman and E. S. Pearson, IX, “On the problem of the most efficient tests of statistical hypotheses,” Phil. Trans. R. Soc. London, vol. A231, pp. 289–337, April 1933.
[4] T. M. Cover and J. A. Thomas, Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, 1991.
[5] U. Seifert, “Entropy production along a stochastic trajectory and an integral fluctuation theorem,” Phys. Rev. Lett., vol. 95, no.4, July 2005, Art. no. 040602.
[6] Y. Wang, “Sub-Gaussian and subexponential fluctuation-response inequalities,” Phys. Rev. E, vol. 102, no. 5, November 2020, Art. no. 052105.
[7] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge, U.K.: Cambridge University Press, 2019.
[8] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge, U.K.: Cambridge University Press, 2018.
[9] R. Fano, Transmission of Information: A Statistical Theory of Communications. Cambridge, MA, USA: M.I.T. Press, 1961.
[10] D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via information usage,” IEEE Trans. Inf. Theory, vol. 66, no. 1, pp. 302–323, January 2020.
[11] K. Gourgoulias, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang, “How biased is your model? Concentration inequalities, information and model bias,” IEEE Trans. Inf. Theory, vol. 66, no. 5, pp. 3079–3097, May 2020.
[12] J. Birrell and L. Rey-Bellet, “Uncertainty quantification for Markov processes via variational principles and functional inequalities,” SIAM/ASA J. Uncertain. Quantif. vol. 8, no. 2, pp. 539–572, April 2020.
[13] S. G. Bobkov and F. Götze, “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities,” J. Funct. Anal., vol. 163, no. 1, pp. 1–28, April 1999.

	$\displaystyle\frac{\partial h}{\partial s}$	$\displaystyle=-\alpha(1-\alpha)\frac{b-a}{\alpha a+(1-\alpha)b}+\sigma^{2}\|s\|$
		$\displaystyle=-\alpha(1-\alpha)\frac{1-e^{-\|s\|}}{(0.5+\delta)+(0.5-\delta)e^{-\|s\|}}+\sigma^{2}\|s\|$
		$\displaystyle>-2\alpha(1-\alpha)\tanh(\|s\|/2)+\sigma^{2}\|s\|$
		$\displaystyle>[\sigma^{2}-\alpha(1-\alpha)]\times\|s\|\geq 0,$

Sub-Gaussian Error Bounds for Hypothesis Testing ††thanks: This research was supported in part by the US National Science Foundation under grant HDR: TRIPODS 19-34884.

Abstract

I Introduction

II Main Results

II-A Sub-Gaussian norm of Φ​(X1n)\Phi(X_{1}^{n})

Definition 1.

Definition 2.

Remark 1.

Lemma 1.

Proof.

Theorem 1.

Proof.

II-B Sub-Gaussian bound for binary hypothesis testing

Lemma 2.

Proof.

Theorem 2.

Proof.

Corollary 1.

Proof.

Remark 2.

Remark 3.

Remark 4.

Remark 5.

II-C Sub-Gaussian bound for MM-ary hypothesis testing

Theorem 3.

Proof.

Corollary 2.

Remark 6.

III Conclusion and discussion

Acknowledgment

References

Sub-Gaussian Error Bounds for Hypothesis Testing
^†^†thanks: This research was supported in part by the US National Science Foundation under grant HDR: TRIPODS 19-34884.

II-A Sub-Gaussian norm of $\Phi(X_{1}^{n})$

II-C Sub-Gaussian bound for $M$ -ary hypothesis testing