Asymptotics of Sequential Composite Hypothesis Testing under Probabilistic Constraints
Abstract
We consider the sequential composite binary hypothesis testing problem in which one of the hypotheses is governed by a single distribution while the other is governed by a family of distributions whose parameters belong to a known set . We would like to design a test to decide which hypothesis is in effect. Under the constraints that the probabilities that the length of the test, a stopping time, exceeds are bounded by a certain threshold , we obtain certain fundamental limits on the asymptotic behavior of the sequential test as tends to infinity. Assuming that is a convex and compact set, we obtain the set of all first-order error exponents for the problem. We also prove a strong converse. Additionally, we obtain the set of second-order error exponents under the assumption that the alphabet of the observations is finite. In the proof of second-order asymptotics, a main technical contribution is the derivation of a central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions. This result may be of independent interest. We also show that some important statistical models satisfy the conditions.
Index Terms:
Sequential composite hypothesis testing, Error exponents, Second-order asymptotics, Generalized sequential probability ratio testI Introduction
Hypothesis testing is a fundamental problem in information theory and statistics [2]. Here we consider a sequential composite hypothesis testing problem in which i.i.d. observations are drawn from either a simple null hypothesis or a composite from the alternative hypothesis. We consider the first-order and second-order tradeoff between the two types of error probabilities under a probabilistic constraint on the stopping times. There is a vast literature on this subject [3, Part I], however the optimal trade-off in the probabilistic stopping time constraint has not been determined. The probabilistic constraints means that we constrain the probabilities (under both hypotheses) that the length of the stopping time exceeds to be no larger than some prescribed threshold . We let tend to infinity to exploit various asymptotic and concentration results.
I-A Related works
In the classical problem of sequential hypothesis testing in the statistical literature, one seeks to minimize the expected sample size subject to bounds on the type-I and type-II error probabilities and , i.e., the sequential testing problem is to solve, for each ,
(1) |
There is a vast literature on solving the above problem (see [3, Part I] for example). The dual problem corresponding to that of (1) is the minimization of the error probabilities subject to expectation constraints on the sample size. More specifically, the dual problem corresponding to (1) entails solving for each ,
(2) |
The optimal tests of (1) and (2) are given by appropriate sequential probability ratio tests. However, in this paper, we consider the problem of minimizing the error probabilities subject to probabilistic constraints on the sample size. In more detail, the problem we are concerned with is the following:
(3) |
As the nature of the constraints are different (expectation versus probabilistic), the proof techniques are also different. For problem (2), Wald’s identity and data-processing inequality are used to derive the achievability and the converse. For our problem (3), concentration inequalities such as the central limit theorem are used to derive the achievability and the converse.
For the first-order asymptotics (exponents of the two types of error probabilities), there is a vast literature on binary hypothesis testing. In the fixed-length hypothesis testing where the length of the vector of observations is fixed, the Neyman–Pearson lemma [4] states that the likelihood ratio test is optimal and Chernoff–Stein lemma [5, Theorem 13.1] shows that if we constrain the type-I error to be less than any , the best (maximum) type-II error exponent is the relative entropy , where and are respectively the distributions under the null and alternative hypotheses respectively. If we require the type-I error exponent to be at least , i.e., the type-I error probability is upper bounded by , the maximum type-II error exponent is [2]. In this regard, we see that there is a trade-off between two error exponents, i.e., they cannot be simultaneously large. However, in the sequential case where the length of the test sample is a stopping time and its expectation is bounded by , the trade-off can be eradicated. Wald and Wolfowitz [6] showed that when the expectations of sample length under and are bounded by a common integer (these are known as the expectation constraints) and tends to infinity, the set of achievable error exponents is . In addition, the corner point is attained by a sequence of sequential probability ratio tests (SPRTs). Lalitha and Javidi [7] considered an interesting setting that interpolates between the fixed-length hypothesis testing and sequential hypothesis testing. They considered the almost-fixed-length hypothesis testing problem, in which the stopping time is allowed to be larger than a prescribed integer with exponentially small probability for different . The probabilistic constraints we employ in this paper are analogous to those in [7], but instead of allowing the event that the stopping time to be larger than to have exponentially small probability, we only require this event to have probability at most , a fixed constant. This allows us to ask questions ranging from strong converses to second-order asymptotics. In [8], Haghifam, Tan, and Khisti considered sequential classification which is similar to sequential hypothesis testing apart from the fact that true distributions are only partially known in the form of training samples.
For the composite hypothesis testing, Zeitouni, Ziv, and Merhav [9] investigated the generalized likelihood ratio test (GLRT) and proposed conditions for asymptotic optimality of the GLRT in the Neyman-Pearson sense. For the sequential case, Lai [10] analyzed different sequential testing problems and obtained a unified asymptotic theory that results in certain generalized sequential likelihood ratio tests to be asymptotically optimal solutions to these problem. Li, Nitinawarat and Veeravalli [11] considered a universal outlier hypothesis testing problem in the fixed-length setting; universality here refers to the fact that the distributions are unknown and have to be estimated on the fly. They then extended their work to the sequential setting [12] but under expectation constraints on the stopping time. The work that is closest to ours is that by Li, Liu, and Ying [13] whose results can be modified to solve the composite version of the dual problem (2). They showed that the generalized sequential probability ratio test is asymptotically optimal by making use of optimality results of sequential probability ratio tests (SPRTs).
Concerning the second-order asymptotic regime, in fixed-length binary hypothesis testing in which the type-I error is bounded by a fixed constant , Strassen [14] showed that the second-order term can be quantified via the relative entropy variance [15] and the inverse of the Gaussian cumulative distribution function. For the sequential case, Li and Tan [16] recently established the second-order asymptotics of sequential binary hypothesis testing under probabilistic and expectation constraints on the stopping time, showing that the former (resp. latter) set of constraints results in a (resp. ) backoff from the relative entropies. These are estimates of the costs of operating in the finite-length setting. In this paper, we seek to extend these results to sequential composite hypothesis testing.
I-B Main contributions
Our main contributions consist in obtaining the first-order and second-order asymptotics for sequential composite hypothesis testing under the probabilistic constraints, i.e., we constrain the probabilities that the lengths of observations exceed is no larger than some prescribed .
-
•
First, while the results of Li, Liu, and Ying [13] can be modified to solve the composite version of the dual problem in (2), which yields first-order asymptotic results under expectation constraints, we obtain the first-order asymptotic results under the probabilistic constraints. We show that the corner points of the optimal error exponent regions are identical under both types of constraints.
-
•
Second, Li, Liu, and Ying [13] only proved that the generalized sequential probability ratio test is asymptotically optimal by making use of the optimality results of sequential probability ratio test (SPRT). Here we prove a strong converse result, namely that the exponents stay unchanged even if the probability that the stopping time exceeds is smaller than for all . We do so using information-theoretic ideas and, in particular, the ubiquitous change-of-measure technique (Lemma 3).
-
•
Third, and most importantly, we obtain the second-order asymptotics of the error exponents when we assume that the observations take values on a finite alphabet. A main technical contribution here is that we obtain a new central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions (Proposition 6). We contrast our central limit-type result to classical statistical results such as Wilks’ theorem [17, Chapter 16].
I-C Paper Outline
The rest of the paper is structured as follows. In Section II, we formulate the composite sequential hypothesis testing problem precisely and state the probabilistic constraints on the stopping time. In Section III, we list some mild assumptions on the distributions and uncertainty set in order to state and prove our first-order asymptotic results. In Section IV, we consider the second-order asymptotics of the same problem by augmenting to the assumptions stated in Section III. We state a central limit-type theorem for the maximum of a set of log-likelihood ratios and our main result concerning the second-order asymptotics. We relegate the more technical calculations (such as proofs of lemmata) to the appendix.
II Problem Formulation
Let be an observed i.i.d. sequence, where each follows a density with respect to a base measure on the alphabet . We consider the problem of composite hypothesis testing:
where and are density functions with respect to and . We assume that and are mutually absolutely continuous for all . Denote and as the probability measures associated to and , respectively. Let be the -algebra generated by . Let be a stopping time adapted to the filtration and let be the -algebra associated with . Let be a -valued -measurable function. The pair constitutes a sequential hypothesis test, where is called the decision function and is the stopping time. When (resp. ), the decision is made in favor of (resp. ). The type-I and maximal type-II error probabilities are defined as
In other words, is the error probability that the true density is but and is the maximal error probability over all parameters that the true density is but the decision made based on the observations up to time .
In this paper, we seek the first-order and second-order asymptotics of exponents of the error probabilities under probabilistic constraints on stopping time . The probabilistic constraints dictate that, for every error tolerance , there exists an integer such that for all , the stopping time satisfies
(4) |
and
(5) |
In the following, all logarithms are natural logarithms, i.e., with respect to base .
III First-order Asymptotics
We say that an exponent pair is -achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests that satisfies the probabilistic constraints on the stopping time in (4) and (5) and
The set of all -achievable is denoted as . For simple (non-composite) binary sequential hypothesis testing under the expectation constraints (i.e., , the set of all achievable error exponent pairs, as shown by Wald and Wolfowitz [6] (also see [16, 7]), is
(6) |
The corner point can be achieved by a sequence of sequential probability ratio tests [6].
We define the log-likelihood ratio and maximal log-likelihood ratio respectively as
For two positive numbers and , we define the stopping time as
and the decision rule as
We term the above test given by as a generalized sequential probability ratio test (GSPRT) with thresholds and . The stopping time is almost surely finite for any distribution within the family [13], so (5) holds for GSPRT. For the above GSPRT, we define type-I error probability and maximal type-II error probability respectively as
We introduce some assumptions on the distributions and .
-
(A1)
The parameter set is compact.
-
(A2)
Assume that and are twice continuously differentiable on . For each , the solutions to the minimizations and are unique. Their existences are guaranteed by the compactness of and the continuity of and on . In addition, and for some .
-
(A3)
Let be the log-likelihood ratio. We assume that . Besides, there exist and such that for all , and
(7) where is the norm of the gradient vector
We present some examples that satisfy Conditions (A1)–(A3). We first show that Condition (A1)–(A3) hold for the canonical exponential family under suitable assumptions and then provide an explicit example.
Example 1 (Canonical exponential families).
The general form of probability density for the canonical exponential family of probability distributions is [18]:
where is called the base measure, is the parameter vector, is referred to as the sufficient statistic and is the cumulant generating function. We define the set of valid parameters as .
Now we consider the test
We also assume that the exponential families under consideration satisfy the following assumptions:
-
(i)
is a convex and compact set;
-
(ii)
is thrice continuously differentiable with respect to ;
-
(iii)
and are positive definite for .
For this example, Condition (A1) holds because of Assumption (i). For Condition (A2), we have
which are twice continuously differentiable with respect to in based on Assumption (ii). Besides, we have
where holds because [18]. Based on Assumption (iii), and are strongly convex in . Hence, the solutions to the minimizations are unique. Then we also have
which means that and if and only if . As , . Similarly, we have
As assumed to be positive definite per Assumption (iii), then if and only if . As , we have .
For Condition (A3), we have
Then due to Assumptions (i) and (ii). Let be the all ones vector. For all and , we have
where is based on the property , is based on Markov’s inequality and the fact that [19]. Denote . Then there exists such that when (where is the solution to if it exists, else ),
which shows that (7) holds.
Example 2 (Gaussian distributions).
For Gaussian distributions, , , and , where and are the elements of . We consider the test
We assume that is a convex and compact set and .
For this example, Assumption (i) (i.e., Condition (A1)) holds as we assume that is a convex and compact set. Besides, is thrice continuously differentiable and
which is positive definite. Besides,
which is positive definite when . Thus, Assumptions (ii) and (iii) hold, which implies Condition (A2) holds. For Condition (A3), we have
Then we choose , we have
Denote . Then there exists such that when (where is the solution to if it exists, else ),
which shows that Condition (A3) holds.
Our first main result is Theorem 1 which characterizes the set of first-order error exponents under the probabilistic constraints on the stopping time in (4).
Theorem 1.
For fixed and if Conditions (A1)–(A3) are satisfied, the set of -achievable pair of error exponents is
Furthermore, the corner point of this set is achieved by an appropriately defined sequence of GSPRTs.
Theorem 1 shows that the -achievable error exponent region is a rectangle. In addition, Theorem 1 shows a strong converse result because the region does not depend on the permissible error probability .
III-A Proof of Achievability of Theorem 1
Let and be two positive numbers such that and . Let be the GSPRT with the thresholds and . Since Conditions (A1)–(A3) are satisfied, then from [13, Theorem 2.1] we have that
(8) | ||||
(9) |
Next we prove that the two probabilistic constraints in (4) are satisfied for the GSPRT with thresholds and . We first introduce the uniform weak law of large numbers (UWLLN) [20, Theorem 6.10].
Lemma 2.
Let be a sequence of i.i.d. random vectors, and let be a nonrandom vector lying in a compact subset . Moreover, let be a Borel-measurable function on such that for each is continuous on . Finally, assume that . Then for any ,
Let . We observe that , so we have
Because
and , we have
Then by UWLLN, for , there exists an , such that when ,
Therefore,
We now prove that . Define . We also have . Then for each and , we have
where the last step follows from Chebyshev’s inequality [21]. Then based on Condition (A3) that and does not depend on , there exists an such that when , We have shown that when , the two probabilistic constraints (4) are satisfied. Then together with (8), (9) and the arbitrariness of and , we show that any exponent pair such that and is in
III-B Proof of Strong Converse of Theorem 1
The following lemma is taken from Li and Tan [16].
Lemma 3.
Let be a sequential hypothesis test such that and . Then for any event , and for each we have
Then we use Lemma 3 to prove the converse part. Let such that . Without loss of generality and by passing to a subsequence if necessary, we assume that there exists a sequence of sequential hypothesis tests such that and and
(10) | ||||
Let for . Then and . Using Lemma 3 with the event , for each we have that
which further implies that
(11) |
Similarly, for each , we have that
and when we set , we have
(12) |
Let be an arbitrary positive number and let . We first bound the term
We note that is an i.i.d. sequence. Besides, we have that and is finite based on Condition (A3). Then based on Kolmogorov’s maximal inequality [21, Theorem 2.5.5], we have that
(13) |
Note that here we use the Kolmogorov’s maximal inequality and it only requires that the second moment of the log-likelihood ratio is finite; this is a weaker condition than assuming that the third absolute moment of the log-likelihood ratio is finite as in [16]. Then we have that
(14) |
When we set in (III-B), using similar arguments as in the derivation of (14), we obtain
From (III-B) and the fact that , we have that
From (10) it follows that , which together with (14) implies that
Similarly, we also obtain
Due to the arbitrariness of , letting , we have that and , completing the proof of the strong converse as desired.
IV Second-order Asymptotics
In the previous section, we considered the (first-order) error exponents of the sequential composite hypothesis testing problem under probabilistic constraints. While the result (Theorem 1) is conclusive, there is often substantial motivation [15] to consider higher-order asymptotics due to finite-length considerations. To wit, the probabilistic bound observation length of the sequence might be short and thus the exponents derived in the previous section will be overly optimistic. In this section, we quantify the backoff from the optimal first-order exponents by examining the second-order asymptotics. To do so, we make a set of somewhat more stringent conditions on the distributions and the uncertainty set . We first assume that the alphabet of the observations is the finite set . Let be the set of probability mass functions with alphabet . In other words, is the probability simplex given by
Similarly, define the open probability simplex
Under hypothesis , the underlying probability mass function is given by and we assume that for all . Under hypothesis , the underlying probability mass function belongs to the set . For any and positive constant , let be the open -neighborhood of the point . Let be such that . See Fig. 1 for an illustration of this projection.
IV-A Other Assumptions and Preliminary Results
We assume that , which contains distributions supported on , satisfies the following conditions:
-
(A1’)
The set is equal to for some piece-wise smooth convex function .
-
(A2’)
There exists a fixed constant such that for all .
-
(A3’)
The function is smooth (infinitely differentiable) on for some .
The key tool used in the derivation of the second-order terms is a central limit-type result for , the maximum of log-likelihood ratios of the observations over . To simplify this quantity, we define the empirical distribution or type [22, Chapter 11] of as , for . In the following, for the sake of brevity, we often suppress the dependence on the sequence and write in place of , but we note that is a random distribution induced by the observations . Since is a finite set, we have
(15) |
The key in obtaining the central limit-type result for the sequence of random variables is to solve the optimization problem in (IV-A), or more precisely, to understand the properties of the optimizer to (IV-A). Now we study the following optimization problem for a generic :
(16) | ||||
Let be an optimizer to the optimization problem (IV-A). The properties of are provided in the following three lemmas.
Lemma 4.
If and , then the optimizer is unique.
The existence and uniqueness of the optimizer of the optimization problem (IV-A) follows from the strictly convexity of the function on the compact convex (uncertainty) set .
As the optimizer is unique, we can define the function
For the sake of convenience in what follows, define
(17) |
Some key properties of are provided in Lemma 10 and Lemma 11 in the Appendix. By the definition of , it follows that . Without loss of generality, we assume
Then there exists such that for , the following equation holds
Then for , the Jacobian of with respect to is
We now introduce the following regularity condition on the function at the point :
-
(A4’)
The Jacobian matrix is of full rank (i.e., ).
One may wonder whether the new assumptions we have stated are overly restrictive. In fact, they are not and there exist interesting families of uncertainty sets that satisfy Assumptions (A1’)–(A4’). A canonical example of an uncertainty set that satisfies these conditions is when is piece-wise linear on the set . Thus, is similar to a linear family [23], an important class of statistical models.
Example 3.
Let be a set of linear functions with domain and let be a set of real numbers. Let . Assume and satisfy the following three conditions:
-
•
The set and for some ;
-
•
The minimizer is such that and for ;
-
•
Let for some real coefficients . One of the coefficients of , i.e., one of the numbers in the set , is not equal to .
Intuitively, defined as the intersection of halfspaces (linear inequality constraints) is a polyhedron contained in the relative interior of . Fig. 1 provides an illustration for the ternary case .
Proposition 5.
The set described in Example 3 satisfies Conditions (A1’)–(A4’).
Now we are ready to state the promised central limit-type result for , defined in (IV-A). Define the relative entropy variance [15]
and the Gaussian cumulative distribution function . Then we have
Proposition 6.
Under Conditions (A1’)–(A4’), if is a sequence of i.i.d. random variables with for all , then , defined in (IV-A), satisfies
A major result in the statistics literature that bears some semblance to Proposition 6 is known as Wilks’ theorem (see [17, Chapter 16] for example). For the case in which the null hypothesis is simple,111Wilks’ theorem also applies to the case in which both the null and alternative hypotheses are composite, but we are only concerned with the simpler setting here. Wilks’ theorem states that if the sequence of random variables is independently drawn from (the distribution of the null hypothesis), then (two times) the log-likelihood ratio statistic
where is the chi-squared distribution with degrees of freedom. This result differs from Proposition 6 because in the maximization is taken over whereas in Wilks’ theorem, it is taken over . This results in different normalizations in the statements on convergence in distributions; in Proposition 6, is normalized by but there is no normalization of the log-likelihood ratio statistic in Wilks’ theorem. This is because, for the former (our result), the dominant term is the first-order term in the Taylor expansion, but in the latter (Wilks’ theorem), the dominant term is the second-order term.
Proposition 7.
Conditions (A1’)–(A4’) imply Conditions (A1)–(A3) in Section III.
IV-B Definition and Main Results
We say that a second-order exponent pair is -achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests that satisfies the probabilistic constraints on the stopping time in (4) and
where , which is unique (see Proposition 7 which implies that Condition (A2) is satisfied). The set of all -achievable second-order exponent pairs is denoted as , the second-order error exponent region. The set of second-order error exponents is stated in the following theorem.
Theorem 8.
If Conditions (A1’)–(A4’) are satisfied, for any , the second-order error exponent region is
Furthermore, the boundary of this set is achieved by an appropriately defined sequence of GSPRTs.
This theorem states that the backoffs from the (first-order) error exponents are of orders and the implied constants are given by and . Thus, we have stated a set of sufficient conditions on the distributions and the uncertainty set (namely (A1’)–(A4’)) for which the second-order terms are analogous to that for simple sequential hypothesis testing under the probabilistic constraints derived by Li and Tan [16]. However, the techniques used to derive Theorem 8 are more involved compared to those for the probabilistic constraints in [16]. This is because we have to derive the asymptotic distribution of the maximum of a set of log-likelihood ratio terms (cf. Proposition 6). This constitutes our main contribution in this part of the paper.
IV-C Proof of the Achievability Part of Theorem 8
The proof of achievability consists of two parts. We first prove the desired upper bound on type-I error probability and the maximal type-II error probability under an appropriately defined sequence of GSPRTs. Then we prove that the probabilistic constraints are satisfied.
We start with the proof of the first part. Let and be such that . Let be the GSPRT with thresholds
and
Based on Proposition 7, we know that (A1)–(A3) are satisfied. Hence, from [13, Theorem 2.1] we have that
To simplify and , we introduce an approximation lemma from [24, Lemma 48].
Lemma 9.
Let be a compact metric space. Suppose and are continuous, then we have
where and .
Here we take and . Based on Lemma 9 and the fact that has a unique minimizer (see Assumption (A2) which is implied by Proposition 7), we have
(18) |
Similarly, we have
(19) |
Thus, based on (IV-C) and (IV-C), the arbitrariness of and and the continuity of , we obtain
(20) |
and
(21) |
Next we prove that the probabilistic constraints for the sequence of GSPRTs are satisfied. Let . We observe that with probability 1. Thus, we have
(22) | |||
(23) |
where (22) is from Proposition 6. Hence, for sufficiently large .
We now prove that . Let . We also have with probability 1. Then by the Berry-Esseen Theorem [25], for any , we have
(24) |
where is a positive finite constant depending only on and . As stated in Condition (A2’) (i.e., that ) and , thus is uniformly bounded on . Then for every , there exists an integer which does not depend on , such that when , . Since is arbitrary, .
IV-D Proof of the Converse Part of Theorem 8
For each , from [16], we know that
where as . Now we want to find the optimal upper bound for all , which means we need to obtain
Similar to the analysis in achievability part, we use Lemma 9 and obtain that
Similarly, we have that
which completes the proof of the converse.
E Properties of
Lemma 10.
If and , then the following properties of the optimizer hold.
-
(i)
The function is continuous on ;
-
(ii)
There exists such that for , is smooth (infinitely differentiable) at ;
-
(iii)
For , the optimizer is such that (i.e., is on the boundary of the uncertainty set);
-
(iv)
For , there exists a symbol such that
(25) -
(v)
For , and ( is the symbol that satisfies (25) in Part (iv) above),
(26)
Proof:
We first prove Part (i) of Lemma 10. Assume, to the contrary, that is not continuous at some . Then there exists a positive number and a sequence such that as and for all . From the definition of and the fact that , there exists such that
(27) |
for all . From Condition (A2’) and the fact that , there exists a constant such that
which further implies that
(28) |
Combining (27) and (E), we have that
which contradicts the fact that
Hence is continuous on .
We next prove Part (ii) of Lemma 10. From the continuity of (as proved above), there exists such that
which, together with Condition (A3’) implies Part (ii) of Lemma 10.
We now proceed to prove Part (iii) of Lemma 10. Recall that the optimizer is obtained from the optimization problem (IV-A). Its corresponding Lagrangian is
For , is smooth at (the previous part). Hence using the Karush–Kuhn–Tucker (KKT) conditions [26], the optimizer satisfies the first-order stationary conditions, which are
(29) |
The complementary slackness condition is , which implies that either or . When , we have
which is impossible as . Thus, it holds that , which means the optimizer lies on the boundary of the set .
Lemma 11.
Proof:
Now we prove Part (i) of Lemma 11. As is smooth and is of full rank, there exists such that is of full rank for all . Then by the inverse function theorem [27, Theorem 2.11], is differentiable in . We multiply on both sides of (29) and sum from to to obtain
(34) |
We differentiate the first constraint with respect to on both sides to obtain
(35) |
From Part (iii) of Lemma 10 it follows that , which means that the function formed by the composition of and is always for all the . Therefore, the derivative of the composition of and with respect to is , i.e.,
(36) |
Substituting (35) and (36) back into (34), we have that
as desired.
F Proof of Proposition 5
Assume . Without loss of generality, we assume . Conditions (A1’)–(A3’) clearly hold. Hence from Part (ii) of Lemma 10 there exists such that for all , the optimizer of the optimization problem (IV-A) is such that and that for all . Note that . Then for , using the KKT conditions, we obtain the first-order optimality conditions for the optimizer :
(37) | ||||
Hence,
(38) |
Substituting (38) into (37), we obtain
Thus, the Jacobian of at is the following diagonal matrix:
Since for all , the diagonal terms in the Jacobian are non-zero. Thus, , which proves that Condition (A4’) holds for the set in Example 3.
G Proof of Proposition 6
We now prove the promised central limit-type result for the sequence of random variables . Let . Let be given as in Part (i) of Lemma 11 and define the -typical set
This is the set of sequences whose types are near . The key idea is to perform a Taylor expansion of the function (defined in (17)) at the point and analyze the asymptotics of the various terms in the expansion. For brevity, define the deviation of the type of from the true distribution at symbol as
For , let be the Hessian matrix of . This is well defined because is twice continuously differentiable on according to Part (ii) of Lemma 11. If , then . Thus for , using Taylor’s theorem we have the expansion
(39) | ||||
(40) |
where lies on the line segment between and , (39) follows from (32) in Lemma 11 and (40) follows from Part (i) of Lemma 11. Note that we represent probability mass functions as row vectors.
Then for , from (40), we have that
(41) |
Let and be the smallest and largest eigenvalues of , respectively. From Part (i) of Lemma 11, it follows that is smooth on , which implies that there exists two constants and such that
(42) |
Then we have the upper bound shown in (45) (at the top of the next page),
(43) | ||||
(44) | ||||
(45) |
in which (43) follows from the fact that for and (G), (44) follows from (G), and (45) holds by the union bound and Hoeffding’s inequality. Similarly, we can obtain the lower bound shown in (46) (also shown on the next page).
(46) |
One can verify that
(47) |
and the variance
(48) | |||
(49) | |||
where (48) follows from
and (49) follows from
and
Therefore is a sum of i.i.d. random variables with mean and variance . Hence, by the central limit theorem,
Together with the fact that almost surely, this implies that
(50) |
and
(51) |
Then combining (45), (46), (50) and (51), we have that
(52) |
and
(53) |
Since is arbitrary, it follows from (52) and (53) that
which completes the proof of Proposition 6.
H Proof of Proposition 7
We now show that Conditions (A1’)–(A4’) imply Conditions (A1)–(A3). Condition (A1) is easily verified by Condition (A1’). As , we have
and
Combining Condition (A2’) which says that for all and , and are uniformly bounded and twice continuously differentiable on . As , and , which together with the compactness of , implies that
(54) |
From [22, Theorem 2.7.2], is strictly convex in , which, together with the fact that is compact and convex, implies the uniqueness of the minimizers to the two optimization problems in (54).
For Condition (A3), as is a finite alphabet and Condition (A2’) holds, it can be easily checked that . Note that
We can define the finite number (because Condition (A2’) mandates that for all . With this choice, trivially, for all ,
which shows that Condition (A3) clearly holds.
References
- [1] J. Pan, Y. Li, and V. Y. F. Tan, “Asymptotics of sequential composite hypothesis testing under probabilistic constraints,” in IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2021, pp. 172–177.
- [2] R. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. 20, no. 4, pp. 405–417, 1974.
- [3] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesis testing and changepoint detection. CRC Press, 2014.
- [4] J. Neyman and E. S. Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London (Series A), vol. 231, pp. 289–337, 1933.
- [5] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), 2014.
- [6] A. Wald and J. Wolfowitz, “Optimum character of the sequential probability ratio test,” Ann. Math. Statist., vol. 19, no. 3, pp. 326–339, 1948.
- [7] A. Lalitha and T. Javidi, “Reliability of sequential hypothesis testing can be achieved by an almost-fixed-length test,” in IEEE International Symposium on Information Theory. IEEE, 2016, pp. 1710–1714.
- [8] M. Haghifam, V. Y. F. Tan, and A. Khisti, “Sequential classification with empirically observed statistics,” IEEE Transactions on Information Theory, vol. 67, no. 5, pp. 3095–3113, 2021.
- [9] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test optimal?” IEEE Transactions on Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
- [10] T.-L. Lai, “Asymptotic optimality of generalized sequential likelihood ratio tests in some classical sequential testing problems,” Sequential Analysis, vol. 21, no. 4, pp. 219–247, 2002.
- [11] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.
- [12] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal sequential outlier hypothesis testing,” Sequential Analysis, vol. 36, no. 3, pp. 309–344, 2017.
- [13] X. Li, J. Liu, and Z. Ying, “Generalized sequential probability ratio test for separate families of hypotheses,” Sequential Analysis, vol. 33, no. 4, pp. 539–563, 2014.
- [14] V. Strassen, “Asymptotische abschatzugen in Shannon’s informationstheorie,” in Transactions of the Third Prague Conference on Information Theory etc, 1962. Czechoslovak Academy of Sciences, Prague, 1962, pp. 689–723.
- [15] V. Y. F. Tan, “Asymptotic estimates in information theory with non-vanishing error probabilities,” Foundations and Trends® in Communications and Information Theory, vol. 11, no. 1-2, pp. 1–184, 2014.
- [16] Y. Li and V. Y. F. Tan, “Second-order asymptotics of sequential hypothesis testing,” IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 7222–7230, 2020.
- [17] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998.
- [18] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
- [19] A. R. Sampson, “Characterizing Exponential Family Distributions by Moment Generating Functions,” The Annals of Statistics, vol. 3, no. 3, pp. 747 – 753, 1975.
- [20] H. J. Bierens, Modes of Convergence, ser. Themes in Modern Econometrics. Cambridge University Press, 2004, pp. 137–178.
- [21] R. Durrett, Probability: Theory and Examples. Duxbury Press, 2004.
- [22] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006.
- [23] S.-I. Amari and H. Nagaoka, Methods of Information Geometry, ser. Translations of Mathematical Monographs. American Mathematical Society, 2007.
- [24] Y. Polyanskiy, Channel Coding: Non-Asymptotic Fundamental Limits. Princeton University, 2010.
- [25] A. C. Berry, “The accuracy of the Gaussian approximation to the sum of independent variates,” Transactions of the American Mathematical Society, vol. 49, no. 1, pp. 122–136, 1941.
- [26] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
- [27] M. Spivak, Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus. Taylor & Francis Inc, 1971.
Jiachun Pan is currently a Ph.D. candidate in the Department of Electrical and Computer Cngineering in National University of Singapore (NUS). She received the B.S. degree from University of Electronic Science and Technology of China (UESTC) in 2015 and M.Eng. degree from University of Science and Technology of China (USTC) in 2019. Her research interests include information theory and statistical learning. |
Yonglong Li is a research fellow at the Department of Electrical and Computer Engineering, National University of Singapore. He received the bachelor degree in Mathematics from Zhengzhou University in 2011 and the Ph.D. degree in Mathematics from the University of Hong Kong in 2015. From 2017 to 2019, he was a postdoctoral fellow at the Center for Memory and Recording Research (CMRR), University of California, San Diego. |
Vincent Y. F. Tan (S’07-M’11-SM’15) was born in Singapore in 1981. He received the B.A. and M.Eng. degrees in electrical and information science from Cambridge University in 2005, and the Ph.D. degree in electrical engineering and computer science (EECS) from the Massachusetts Institute of Technology (MIT) in 2011. He is currently a Dean’s Chair Associate Professor with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore (NUS). His research interests include information theory, machine learning, and statistical signal processing. Dr. Tan is a member of the IEEE Information Theory Society Board of Governors. He was an IEEE Information Theory Society Distinguished Lecturer from 2018 to 2019. He received the MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize in 2011, the NUS Young Investigator Award in 2014, the Singapore National Research Foundation (NRF) Fellowship (Class of 2018), and the NUS Young Researcher Award in 2019. He is currently serving as a Senior Area Editor for the IEEE Transactions on Signal Processing and for the IEEE Transactions on Information Theory. |