This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotics of Sequential Composite Hypothesis Testing under Probabilistic Constraints

Jiachun Pan, Yonglong Li, Vincent Y. F. Tan, Senior Member, IEEE This work is partially funded by a Singapore National Research Foundation Fellowship (R-263-000-D02-281). The paper was presented in part at the 2021 International Symposium on Information Theory (ISIT) [1]. Jiachun Pan is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Email: pan.jiachun@u.nus.edu. Yonglong Li is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Email: elelong@nus.edu.sg. Vincent Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore, Singapore, Email: vtan@nus.edu.sg. Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
Abstract

We consider the sequential composite binary hypothesis testing problem in which one of the hypotheses is governed by a single distribution while the other is governed by a family of distributions whose parameters belong to a known set Γ\Gamma. We would like to design a test to decide which hypothesis is in effect. Under the constraints that the probabilities that the length of the test, a stopping time, exceeds nn are bounded by a certain threshold ϵ\epsilon, we obtain certain fundamental limits on the asymptotic behavior of the sequential test as nn tends to infinity. Assuming that Γ\Gamma is a convex and compact set, we obtain the set of all first-order error exponents for the problem. We also prove a strong converse. Additionally, we obtain the set of second-order error exponents under the assumption that the alphabet of the observations 𝒳\mathcal{X} is finite. In the proof of second-order asymptotics, a main technical contribution is the derivation of a central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions. This result may be of independent interest. We also show that some important statistical models satisfy the conditions.

Index Terms:
Sequential composite hypothesis testing, Error exponents, Second-order asymptotics, Generalized sequential probability ratio test

I Introduction

Hypothesis testing is a fundamental problem in information theory and statistics [2]. Here we consider a sequential composite hypothesis testing problem in which i.i.d. observations are drawn from either a simple null hypothesis or a composite from the alternative hypothesis. We consider the first-order and second-order tradeoff between the two types of error probabilities under a probabilistic constraint on the stopping times. There is a vast literature on this subject [3, Part I], however the optimal trade-off in the probabilistic stopping time constraint has not been determined. The probabilistic constraints means that we constrain the probabilities (under both hypotheses) that the length of the stopping time exceeds nn to be no larger than some prescribed threshold ϵ(0,1)\epsilon\in(0,1). We let nn tend to infinity to exploit various asymptotic and concentration results.

I-A Related works

In the classical problem of sequential hypothesis testing in the statistical literature, one seeks to minimize the expected sample size 𝔼i[τ(δ)],i{0,1}\mathbb{E}_{i}[\tau(\delta)],i\in\{0,1\} subject to bounds on the type-I and type-II error probabilities P0(δτ=1)αP_{0}(\delta_{\tau}=1)\leq\alpha and P1(δτ=0)βP_{1}(\delta_{\tau}=0)\leq\beta, i.e., the sequential testing problem is to solve, for each i{0,1}i\in\{0,1\},

min(τ,δ)𝔼i[τ(δ)]s.t. P0(δτ=1)α,P1(δτ=0)β.\displaystyle\min_{(\tau,\delta)}\mathbb{E}_{i}[\tau(\delta)]\quad\mbox{s.t.~{}}\quad P_{0}(\delta_{\tau}=1)\leq\alpha,P_{1}(\delta_{\tau}=0)\leq\beta. (1)

There is a vast literature on solving the above problem (see [3, Part I] for example). The dual problem corresponding to that of (1) is the minimization of the error probabilities subject to expectation constraints on the sample size. More specifically, the dual problem corresponding to (1) entails solving for each i{0,1}i\in\{0,1\},

min(τ,δ)Pi(δτ=1i)s.t. 𝔼i[τ]n,i{0,1}.\displaystyle\min_{(\tau,\delta)}P_{i}(\delta_{\tau}=1-i)\quad\mbox{s.t.~{}}\quad\mathbb{E}_{i}[\tau]\leq n,i\in\{0,1\}. (2)

The optimal tests (τ,δ)(\tau^{*},\delta^{*}) of (1) and (2) are given by appropriate sequential probability ratio tests. However, in this paper, we consider the problem of minimizing the error probabilities subject to probabilistic constraints on the sample size. In more detail, the problem we are concerned with is the following:

min(τ,δ)Pi(δτ=1i)s.t. Pi(τ>n)<ϵ,i{0,1}.\displaystyle\min_{(\tau,\delta)}P_{i}(\delta_{\tau}=1-i)\quad\mbox{s.t.~{}}\quad P_{i}(\tau>n)<\epsilon,i\in\{0,1\}. (3)

As the nature of the constraints are different (expectation versus probabilistic), the proof techniques are also different. For problem (2), Wald’s identity and data-processing inequality are used to derive the achievability and the converse. For our problem (3), concentration inequalities such as the central limit theorem are used to derive the achievability and the converse.

For the first-order asymptotics (exponents of the two types of error probabilities), there is a vast literature on binary hypothesis testing. In the fixed-length hypothesis testing where the length of the vector of observations is fixed, the Neyman–Pearson lemma [4] states that the likelihood ratio test is optimal and Chernoff–Stein lemma [5, Theorem 13.1] shows that if we constrain the type-I error to be less than any ϵ(0,1)\epsilon\in(0,1), the best (maximum) type-II error exponent is the relative entropy D(p0p1)D(p_{0}\|p_{1}), where p0p_{0} and p1p_{1} are respectively the distributions under the null and alternative hypotheses respectively. If we require the type-I error exponent to be at least r>0r>0, i.e., the type-I error probability is upper bounded by exp(nr)\exp(-nr), the maximum type-II error exponent is min{D(qp0):D(qp1)r}\min\{D(q\|p_{0}):D(q\|p_{1})\geq r\} [2]. In this regard, we see that there is a trade-off between two error exponents, i.e., they cannot be simultaneously large. However, in the sequential case where the length of the test sample is a stopping time and its expectation is bounded by nn, the trade-off can be eradicated. Wald and Wolfowitz [6] showed that when the expectations of sample length under H0H_{0} and H1H_{1} are bounded by a common integer nn (these are known as the expectation constraints) and nn tends to infinity, the set of achievable error exponents is {(E0,E1):E0D(p1p0),E1D(p0p1)}\{(E_{0},E_{1}):E_{0}\leq D(p_{1}\|p_{0}),E_{1}\leq D(p_{0}\|p_{1})\}. In addition, the corner point (D(p1p0),D(p0p1))(D(p_{1}\|p_{0}),D(p_{0}\|p_{1})) is attained by a sequence of sequential probability ratio tests (SPRTs). Lalitha and Javidi [7] considered an interesting setting that interpolates between the fixed-length hypothesis testing and sequential hypothesis testing. They considered the almost-fixed-length hypothesis testing problem, in which the stopping time is allowed to be larger than a prescribed integer nn with exponentially small probability exp(nγ)\exp(-n\gamma) for different γ>0\gamma>0. The probabilistic constraints we employ in this paper are analogous to those in [7], but instead of allowing the event that the stopping time to be larger than nn to have exponentially small probability, we only require this event to have probability at most ϵ(0,1)\epsilon\in(0,1), a fixed constant. This allows us to ask questions ranging from strong converses to second-order asymptotics. In [8], Haghifam, Tan, and Khisti considered sequential classification which is similar to sequential hypothesis testing apart from the fact that true distributions are only partially known in the form of training samples.

For the composite hypothesis testing, Zeitouni, Ziv, and Merhav [9] investigated the generalized likelihood ratio test (GLRT) and proposed conditions for asymptotic optimality of the GLRT in the Neyman-Pearson sense. For the sequential case, Lai [10] analyzed different sequential testing problems and obtained a unified asymptotic theory that results in certain generalized sequential likelihood ratio tests to be asymptotically optimal solutions to these problem. Li, Nitinawarat and Veeravalli [11] considered a universal outlier hypothesis testing problem in the fixed-length setting; universality here refers to the fact that the distributions are unknown and have to be estimated on the fly. They then extended their work to the sequential setting [12] but under expectation constraints on the stopping time. The work that is closest to ours is that by Li, Liu, and Ying [13] whose results can be modified to solve the composite version of the dual problem (2). They showed that the generalized sequential probability ratio test is asymptotically optimal by making use of optimality results of sequential probability ratio tests (SPRTs).

Concerning the second-order asymptotic regime, in fixed-length binary hypothesis testing in which the type-I error is bounded by a fixed constant ϵ(0,1)\epsilon\in(0,1), Strassen [14] showed that the second-order term can be quantified via the relative entropy variance [15] and the inverse of the Gaussian cumulative distribution function. For the sequential case, Li and Tan [16] recently established the second-order asymptotics of sequential binary hypothesis testing under probabilistic and expectation constraints on the stopping time, showing that the former (resp. latter) set of constraints results in a Θ(1/n)\Theta(1/\sqrt{n}) (resp. Θ(1/n)\Theta(1/n)) backoff from the relative entropies. These are estimates of the costs of operating in the finite-length setting. In this paper, we seek to extend these results to sequential composite hypothesis testing.

I-B Main contributions

Our main contributions consist in obtaining the first-order and second-order asymptotics for sequential composite hypothesis testing under the probabilistic constraints, i.e., we constrain the probabilities that the lengths of observations exceed nn is no larger than some prescribed ϵ(0,1)\epsilon\in(0,1).

  • First, while the results of Li, Liu, and Ying [13] can be modified to solve the composite version of the dual problem in (2), which yields first-order asymptotic results under expectation constraints, we obtain the first-order asymptotic results under the probabilistic constraints. We show that the corner points of the optimal error exponent regions are identical under both types of constraints.

  • Second, Li, Liu, and Ying [13] only proved that the generalized sequential probability ratio test is asymptotically optimal by making use of the optimality results of sequential probability ratio test (SPRT). Here we prove a strong converse result, namely that the exponents stay unchanged even if the probability that the stopping time exceeds nn is smaller than ϵ\epsilon for all ϵ(0,1)\epsilon\in(0,1). We do so using information-theoretic ideas and, in particular, the ubiquitous change-of-measure technique (Lemma 3).

  • Third, and most importantly, we obtain the second-order asymptotics of the error exponents when we assume that the observations take values on a finite alphabet. A main technical contribution here is that we obtain a new central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions (Proposition 6). We contrast our central limit-type result to classical statistical results such as Wilks’ theorem [17, Chapter 16].

I-C Paper Outline

The rest of the paper is structured as follows. In Section II, we formulate the composite sequential hypothesis testing problem precisely and state the probabilistic constraints on the stopping time. In Section III, we list some mild assumptions on the distributions and uncertainty set in order to state and prove our first-order asymptotic results. In Section IV, we consider the second-order asymptotics of the same problem by augmenting to the assumptions stated in Section III. We state a central limit-type theorem for the maximum of a set of log-likelihood ratios and our main result concerning the second-order asymptotics. We relegate the more technical calculations (such as proofs of lemmata) to the appendix.

II Problem Formulation

Let {Xi}i=1\{X_{i}\}_{i=1}^{\infty} be an observed i.i.d. sequence, where each XiX_{i} follows a density pp with respect to a base measure μ\mu on the alphabet 𝒳\mathcal{X}. We consider the problem of composite hypothesis testing:

H0:p=p0andH1:p{pγ:γΓ},\displaystyle H_{0}:p=p_{0}\quad\mbox{and}\quad H_{1}:p\in\{p_{\gamma}:\gamma\in\Gamma\},

where p0p_{0} and pγp_{\gamma} are density functions with respect to μ\mu and p0{pγ:γΓ}p_{0}\notin\{p_{\gamma}:\gamma\in\Gamma\}. We assume that p0p_{0} and pγp_{\gamma} are mutually absolutely continuous for all γΓ\gamma\in\Gamma. Denote P0P_{0} and PγP_{\gamma} as the probability measures associated to p0p_{0} and pγp_{\gamma}, respectively. Let (Xn)\mathcal{F}(X^{n}) be the σ\sigma-algebra generated by Xn=(X1,X2,,Xn)X^{n}=(X_{1},X_{2},\ldots,X_{n}). Let τ\tau be a stopping time adapted to the filtration {(Xn)}n=1\{\mathcal{F}(X^{n})\}_{n=1}^{\infty} and let τ\mathcal{F}_{\tau} be the σ\sigma-algebra associated with τ\tau. Let δ\delta be a {0,1}\{0,1\}-valued τ\mathcal{F}_{\tau}-measurable function. The pair (δ,τ)(\delta,\tau) constitutes a sequential hypothesis test, where δ\delta is called the decision function and τ\tau is the stopping time. When δ=0\delta=0 (resp. δ=1\delta=1), the decision is made in favor of H0H_{0} (resp. H1H_{1}). The type-I and maximal type-II error probabilities are defined as

𝖯1|0(δ,τ):=P0(δ=1)and𝖯0|1(δ,τ):=supγΓPγ(δ=0).\displaystyle\mathsf{P}_{1|0}(\delta,\tau):=P_{0}(\delta=1)\quad\mbox{and}\quad\mathsf{P}_{0|1}(\delta,\tau):=\sup_{\gamma\in\Gamma}P_{\gamma}(\delta=0).

In other words, 𝖯1|0(δ,τ)\mathsf{P}_{1|0}(\delta,\tau) is the error probability that the true density is p0p_{0} but δ=1\delta=1 and 𝖯0|1(δ,τ)\mathsf{P}_{0|1}(\delta,\tau) is the maximal error probability over all parameters γΓ\gamma\in\Gamma that the true density is pγp_{\gamma} but the decision made δ=0\delta=0 based on the observations up to time τ\tau.

In this paper, we seek the first-order and second-order asymptotics of exponents of the error probabilities under probabilistic constraints on stopping time τ\tau. The probabilistic constraints dictate that, for every error tolerance 0<ϵ<10<\epsilon<1, there exists an integer n0(ϵ)n_{0}(\epsilon) such that for all n>n0(ϵ)n>n_{0}(\epsilon), the stopping time τ\tau satisfies

P0(τ>n)<ϵandsupγΓPγ(τ>n)<ϵ,\displaystyle P_{0}(\tau>n)<\epsilon\quad\mbox{and}\quad\sup_{\gamma\in\Gamma}P_{\gamma}(\tau>n)<\epsilon, (4)

and

P0(τ<)=1andsupγΓPγ(τ<)=1.\displaystyle P_{0}(\tau<\infty)=1\quad\mbox{and}\quad\sup_{\gamma\in\Gamma}P_{\gamma}(\tau<\infty)=1. (5)

In the following, all logarithms are natural logarithms, i.e., with respect to base ee.

III First-order Asymptotics

We say that an exponent pair (E0,E1)(E_{0},E_{1}) is ϵ\epsilon-achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests {(δn,τn)}n=1\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty} that satisfies the probabilistic constraints on the stopping time in (4) and (5) and

E0\displaystyle E_{0} lim infn1nlog1𝖯1|0(δn,τn),\displaystyle\leq\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{1|0}(\delta_{n},\tau_{n})},
E1\displaystyle E_{1} lim infn1nlog1𝖯0|1(δn,τn).\displaystyle\leq\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{0|1}(\delta_{n},\tau_{n})}.

The set of all ϵ\epsilon-achievable (E0,E1)(E_{0},E_{1}) is denoted as ϵ(p0,Γ)\mathcal{E}_{\epsilon}(p_{0},\Gamma). For simple (non-composite) binary sequential hypothesis testing under the expectation constraints (i.e., maxi=0,1𝔼Pi[τ]n)\max_{i=0,1}\mathbb{E}_{P_{i}}[\tau]\leq n), the set of all achievable error exponent pairs, as shown by Wald and Wolfowitz [6] (also see [16, 7]), is

~ϵ(p0,p1)={(E0,E1):E0D(p1p0),E1D(p0p1)}.\displaystyle\tilde{\mathcal{E}}_{\epsilon}(p_{0},p_{1})=\{(E_{0},E_{1}):E_{0}\leq D(p_{1}\|p_{0}),E_{1}\leq D(p_{0}\|p_{1})\}. (6)

The corner point (D(p1p0),D(p0p1))(D(p_{1}\|p_{0}),D(p_{0}\|p_{1})) can be achieved by a sequence of sequential probability ratio tests [6].

We define the log-likelihood ratio and maximal log-likelihood ratio respectively as

Sn(γ):=i=1nlogpγ(Xi)p0(Xi)andSn:=supγΓSn(γ).\displaystyle S_{n}(\gamma):=\sum_{i=1}^{n}\log\frac{p_{\gamma}(X_{i})}{p_{0}(X_{i})}\quad\mbox{and}\quad S_{n}:=\sup_{\gamma\in\Gamma}S_{n}(\gamma).

For two positive numbers AA and BB, we define the stopping time τ\tau as

τ:=inf{n:Sn>A or Sn<B},\displaystyle\tau:=\inf\{n:S_{n}>A\text{ or }S_{n}<-B\},

and the decision rule as

δ:={0,if Sτ<B,1,if Sτ>A.\displaystyle\delta:=\begin{cases}0,&\text{if }S_{\tau}<-B,\\ 1,&\text{if }S_{\tau}>A.\end{cases}

We term the above test given by (δ,τ)(\delta,\tau) as a generalized sequential probability ratio test (GSPRT) with thresholds AA and BB. The stopping time τ\tau is almost surely finite for any distribution within the family [13], so (5) holds for GSPRT. For the above GSPRT, we define type-I error probability and maximal type-II error probability respectively as

𝖯1|0(τ,δ)\displaystyle\mathsf{P}_{1|0}(\tau,\delta) :=P0(Sτ>A),and\displaystyle:=P_{0}(S_{\tau}>A),~{}\mbox{and}
𝖯0|1(τ,δ)\displaystyle\mathsf{P}_{0|1}(\tau,\delta) :=supγΓPγ(Sτ<B).\displaystyle:=\sup_{\gamma\in\Gamma}P_{\gamma}(S_{\tau}<-B).

We introduce some assumptions on the distributions and Γ\Gamma.

  • (A1)

    The parameter set Γd\Gamma\subset\mathbb{R}^{d} is compact.

  • (A2)

    Assume that γD(pγp0)\gamma\mapsto D(p_{\gamma}\|p_{0}) and γD(p0pγ)\gamma\mapsto D(p_{0}\|p_{\gamma}) are twice continuously differentiable on Γ\Gamma. For each γΓ\gamma\in\Gamma, the solutions to the minimizations minγΓD(p0pγ)\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma}) and minγΓD(pγp0)\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0}) are unique. Their existences are guaranteed by the compactness of Γ\Gamma and the continuity of γD(pγp0)\gamma\mapsto D(p_{\gamma}\|p_{0}) and γD(p0pγ)\gamma\mapsto D(p_{0}\|p_{\gamma}) on Γ\Gamma. In addition, minγΓD(p0pγ)>ϵ0\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})>\epsilon_{0} and minγΓD(pγp0)>ϵ0\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})>\epsilon_{0} for some ϵ0>0\epsilon_{0}>0.

  • (A3)

    Let ξ(γ)=logpγ(X)logp0(X)\xi(\gamma)=\log p_{{\gamma}}(X)-\log p_{0}(X) be the log-likelihood ratio. We assume that 𝔼[maxγ|ξ(γ)|2]<\mathbb{E}[\max_{\gamma}|\xi(\gamma)|^{2}]<\infty. Besides, there exist α>1\alpha>1 and x0x_{0}\in\mathbb{R} such that for all γΓ\gamma\in\Gamma, and x>x0x>x_{0}

    P0(maxγΓ|γξ(γ)|>x)e|logx|α,\displaystyle P_{0}\Bigg{(}\max_{\gamma\in\Gamma}|\nabla_{\gamma}\xi(\gamma)|>x\Bigg{)}\leq e^{-|\log x|^{\alpha}}, (7)

    where |γξ(γ)||\nabla_{\gamma}\xi(\gamma)| is the 1\ell_{1} norm of the gradient vector γξ(γ)d\nabla_{\gamma}\xi(\gamma)\in\mathbb{R}^{d}

We present some examples that satisfy Conditions (A1)–(A3). We first show that Condition (A1)–(A3) hold for the canonical exponential family under suitable assumptions and then provide an explicit example.

Example 1 (Canonical exponential families).

The general form of probability density for the canonical exponential family of probability distributions is  [18]:

p𝜸(x)=h(x)exp(𝜸T(x)A(𝜸)),\displaystyle p_{\bm{\gamma}}(x)=h(x)\exp(\bm{\gamma}^{\top}T(x)-A(\bm{\gamma})),

where h(x)h(x) is called the base measure, 𝜸\bm{\gamma} is the parameter vector, T(x)T(x) is referred to as the sufficient statistic and A(𝜸)A(\bm{\gamma}) is the cumulant generating function. We define the set of valid parameters as Θ={𝜸d:A(𝜸)<}\Theta=\{\bm{\gamma}\in\mathbb{R}^{d}:A(\bm{\gamma})<\infty\}.

Now we consider the test

H0:p𝜸0(x)=h(x)exp(𝜸0T(x)A(𝜸0)),𝜸0Θ;\displaystyle H_{0}:p_{\bm{\gamma}_{0}}(x)=h(x)\exp(\bm{\gamma}_{0}^{\top}T(x)-A(\bm{\gamma}_{0})),\quad\bm{\gamma}_{0}\in\Theta;
H1:p𝜸(x)=h(x)exp(𝜸T(x)A(𝜸)),𝜸Γ,𝜸0Γ.\displaystyle H_{1}:p_{\bm{\gamma}}(x)=h(x)\exp(\bm{\gamma}^{\top}T(x)-A(\bm{\gamma})),\quad\bm{\gamma}\in\Gamma,\bm{\gamma}_{0}\notin\Gamma.

We also assume that the exponential families under consideration satisfy the following assumptions:

  1. (i)

    ΓΘ\Gamma\subset\Theta is a convex and compact set;

  2. (ii)

    A(𝜸)A(\bm{\gamma}) is thrice continuously differentiable with respect to 𝜸\bm{\gamma};

  3. (iii)

    𝜸2A(𝜸)\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma}) and ((𝜸𝜸0)𝜸2A(𝜸))\nabla((\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma})) are positive definite for 𝜸Γ\bm{\gamma}\in\Gamma.

For this example, Condition (A1) holds because of Assumption (i). For Condition (A2), we have

D(p𝜸p𝜸0)=(𝜸𝜸0)𝔼𝜸[T(X)]A(𝜸)+A(𝜸0),\displaystyle D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}\mathbb{E}_{\bm{\gamma}}[T(X)]-A(\bm{\gamma})+A(\bm{\gamma}_{0}),
D(p𝜸0p𝜸)=(𝜸0𝜸)𝔼0[T(X)]A(𝜸0)+A(𝜸),\displaystyle D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})=(\bm{\gamma}_{0}-\bm{\gamma})^{\top}\mathbb{E}_{0}[T(X)]-A(\bm{\gamma}_{0})+A(\bm{\gamma}),

which are twice continuously differentiable with respect to 𝜸\bm{\gamma} in Γ\Gamma based on Assumption (ii). Besides, we have

𝜸2D(p𝜸p𝜸0)\displaystyle\nabla_{\bm{\gamma}}^{2}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}}) =(a)((𝜸𝜸0)𝜸2A(𝜸)),\displaystyle\overset{(a)}{=}\nabla((\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma})),
𝜸2D(p𝜸0p𝜸)\displaystyle\nabla_{\bm{\gamma}}^{2}D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}}) =𝜸2A(𝜸).\displaystyle=\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma}).

where (a)(a) holds because 𝔼𝜸[T(X)]=𝜸A(𝜸)\mathbb{E}_{\bm{\gamma}}[T(X)]=\nabla_{\bm{\gamma}}A(\bm{\gamma}) [18]. Based on Assumption (iii), D(p𝜸p𝜸0)D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}}) and D(p𝜸0p𝜸)D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}}) are strongly convex in 𝜸\bm{\gamma}. Hence, the solutions to the minimizations are unique. Then we also have

𝜸D(p𝜸p𝜸0)=(𝜸𝜸0)𝜸2A(𝜸),\displaystyle\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla^{2}_{\bm{\gamma}}A(\bm{\gamma}),

which means that 𝜸D(p𝜸p𝜸0)=0\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=0 and D(p𝜸p𝜸0)=0D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=0 if and only if 𝜸=𝜸0\bm{\gamma}=\bm{\gamma}_{0}. As 𝜸0Γ\bm{\gamma}_{0}\notin\Gamma, min𝜸ΓD(p𝜸p𝜸0)>0\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})>0. Similarly, we have

𝜸D(p𝜸0p𝜸)=𝜸A(𝜸0)+𝜸A(𝜸).\displaystyle\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})=-\nabla_{\bm{\gamma}}A(\bm{\gamma}_{0})+\nabla_{\bm{\gamma}}A(\bm{\gamma}).

As 𝜸2A(𝜸)\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma}) assumed to be positive definite per Assumption (iii), then 𝜸A(𝜸0)=𝜸A(𝜸)\nabla_{\bm{\gamma}}A(\bm{\gamma}_{0})=\nabla_{\bm{\gamma}}A(\bm{\gamma}) if and only if 𝜸=𝜸0\bm{\gamma}=\bm{\gamma}_{0}. As 𝜸0Γ\bm{\gamma}_{0}\notin\Gamma, we have min𝜸ΓD(p𝜸0p𝜸)>0\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})>0.

For Condition (A3), we have

ξ(𝜸)=(𝜸𝜸0)T(X)A(𝜸)+A(𝜸0).\displaystyle\xi(\bm{\gamma})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}T(X)-A(\bm{\gamma})+A(\bm{\gamma}_{0}).

Then 𝔼[max𝜸|ξ(𝜸)|2]<\mathbb{E}[\max_{\bm{\gamma}}|\xi(\bm{\gamma})|^{2}]<\infty due to Assumptions (i) and (ii). Let 𝐞\mathbf{e} be the all ones vector. For all t>0t>0 and t𝐞+𝜸0Θt\mathbf{e}+\bm{\gamma}_{0}\in\Theta, we have

P0(max𝜸Γ|𝜸ξ(𝜸)|>x)\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\bigg{)}
=P0(max𝜸Γ|T(X)𝜸A(𝜸)|>x)\displaystyle=P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{|}T(X)-\nabla_{\bm{\gamma}}A(\bm{\gamma})\big{|}>x\bigg{)}
=(a)P0(max𝜸Γ|T(X)𝔼𝜸[T(X)]|>x)\displaystyle\overset{(a)}{=}P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{|}T(X)-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}>x\bigg{)}
P0(|T(X)𝔼0[T(X)]|+max𝜸Γ|𝔼0[T(X)]𝔼𝜸[T(X)]|>x)\displaystyle\leq P_{0}\!\bigg{(}\!\big{|}T\!(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{|}\!+\max_{\bm{\gamma}\in\Gamma}\big{|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}\!>\!x\!\bigg{)}
=P0(|T(X)𝔼0[T(X)]|>xmax𝜸Γ|𝔼0[T(X)]𝔼𝜸[T(X)]|)\displaystyle=P_{0}\!\bigg{(}\!\big{|}T(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{|}\!>\!x\!-\!\max_{\bm{\gamma}\in\Gamma}\!\big{|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}\!\bigg{)}
(b)exp(tx+tmax𝜸Γ|𝔼0[T(X)]𝔼𝜸[T(X)]|\displaystyle\overset{(b)}{\leq}\exp\bigg{(}-tx+t\max_{\bm{\gamma}\in\Gamma}\big{|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}
+A(t𝐞+𝜸0)A(𝜸0)t|𝜸A(𝜸)|𝜸=𝜸0|+log2),\displaystyle\quad+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t|\nabla_{\bm{\gamma}}A(\bm{\gamma})|_{\bm{\gamma}=\bm{\gamma}_{0}}|+\log 2\bigg{)},

where (a)(a) is based on the property 𝔼𝜸[T(X)]=𝜸A(𝜸)\mathbb{E}_{\bm{\gamma}}[T(X)]=\nabla_{\bm{\gamma}}A(\bm{\gamma}), (b)(b) is based on Markov’s inequality and the fact that 𝔼0[exp(t𝐞,T(X)𝔼0[T(X)])]=exp(A(t𝐞+𝜸0)A(𝜸0)t|𝜸A(𝜸)|𝜸=𝜸0|)\mathbb{E}_{0}[\exp(\langle t\mathbf{e},T(X)-\mathbb{E}_{0}[T(X)]\rangle)]=\exp(A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t|\nabla_{\bm{\gamma}}A(\bm{\gamma})|_{\bm{\gamma}=\bm{\gamma}_{0}}|) [19]. Denote x~=tmax𝜸Γ|𝔼0[T(X)]𝔼𝜸[T(X)]|+A(t𝐞+𝜸0)A(𝜸0)t|𝜸A(𝜸)|𝜸=𝜸0|+log2\tilde{x}=t\max_{\bm{\gamma}\in\Gamma}\big{|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t|\nabla_{\bm{\gamma}}A(\bm{\gamma})|_{\bm{\gamma}=\bm{\gamma}_{0}}|+\log 2. Then there exists α>1\alpha>1 such that when x>x0=max{x1,1}x>x_{0}=\max\{x_{1},1\} (where x1x_{1} is the solution to tx1x~=(logx1)αtx_{1}-\tilde{x}=(\log x_{1})^{\alpha} if it exists, else x1=0x_{1}=0),

P0(max𝜸Γ|𝜸ξ(𝜸)|>x)e(txx~)e(logx)α,\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\bigg{)}\leq e^{-(tx-\tilde{x})}\leq e^{-(\log x)^{\alpha}},

which shows that (7) holds.

Example 2 (Gaussian distributions).

For Gaussian distributions, 𝜸=[μ/σ2,1/2σ2]\bm{\gamma}=[\mu/\sigma^{2},-1/2\sigma^{2}]^{\top}, T(x)=[x,x2]T(x)=[x,x^{2}]^{\top}, A(𝜸)=γ124γ212log(2γ2)A(\bm{\gamma})=-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2}) and h(x)=12πh(x)=\frac{1}{\sqrt{2\pi}}, where γ1\gamma_{1} and γ2\gamma_{2} are the elements of 𝜸\bm{\gamma}. We consider the test

H0:𝒩(0,1),𝜸0=[0,1/2]T;\displaystyle H_{0}:\mathcal{N}(0,1),\;\quad\bm{\gamma}_{0}=[0,-1/2]^{T};
H1:𝒩(μ,σ2),𝜸=[μ/σ2,1/2σ2]TΓ,𝜸0Γ.\displaystyle H_{1}:\mathcal{N}(\mu,\sigma^{2}),\quad\bm{\gamma}=[\mu/\sigma^{2},-1/2\sigma^{2}]^{T}\in\Gamma,\bm{\gamma}_{0}\notin\Gamma.

We assume that Γ\Gamma is a convex and compact set and σ2>4μ2+13μ+1\sigma^{2}>\frac{4\mu^{2}+1}{3\mu+1}.

For this example, Assumption (i) (i.e., Condition (A1)) holds as we assume that Γ\Gamma is a convex and compact set. Besides, A(𝜸)A(\bm{\gamma}) is thrice continuously differentiable and

A2(𝜸1)𝜸12=[12γ2γ12γ22γ12γ22γ122γ23+12γ22],\displaystyle\frac{\partial A^{2}(\bm{\gamma}_{1})}{\partial\bm{\gamma}_{1}^{2}}=\left[\begin{matrix}&-\frac{1}{2\gamma_{2}}&\frac{\gamma_{1}}{2\gamma_{2}^{2}}\\ &\frac{\gamma_{1}}{2\gamma_{2}^{2}}&-\frac{\gamma_{1}^{2}}{2\gamma_{2}^{3}}+\frac{1}{2\gamma_{2}^{2}}\end{matrix}\right],

which is positive definite. Besides,

((𝜸𝜸0)TA′′(𝜸))𝜸=[14γ22γ12γ23γ12γ233γ14γ2412γ231γ22],\displaystyle\frac{\partial((\bm{\gamma}-\bm{\gamma}_{0})^{T}A^{\prime\prime}(\bm{\gamma}))}{\partial\bm{\gamma}}=\left[\begin{matrix}&\frac{1}{4\gamma_{2}^{2}}&-\frac{\gamma_{1}}{2\gamma_{2}^{3}}\\ &\frac{-\gamma_{1}}{2\gamma_{2}^{3}}&\frac{3\gamma_{1}}{4\gamma_{2}^{4}}-\frac{1}{2\gamma_{2}^{3}}-\frac{1}{\gamma_{2}^{2}}\end{matrix}\right],

which is positive definite when σ2>4μ2+13μ+1\sigma^{2}>\frac{4\mu^{2}+1}{3\mu+1}. Thus, Assumptions (ii) and (iii) hold, which implies Condition (A2) holds. For Condition (A3), we have

tmax𝜸Γ\displaystyle t\max_{\bm{\gamma}\in\Gamma} |𝔼0[T(X)]𝔼𝜸[T(X)]|+A(t𝐞+𝜸0)A(𝜸0)\displaystyle\big{|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})
t|A(𝜸0)|=tmax𝜸Γ|γ124γ212log(2γ2)|\displaystyle-t|A^{\prime}(\bm{\gamma}_{0})|=t\max_{\bm{\gamma}\in\Gamma}\bigg{|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\bigg{|}
t24(t1/2)12log(2(t1/2))t.\displaystyle-\frac{t^{2}}{4(t-1/2)}-\frac{1}{2}\log(-2(t-1/2))-t.

Then we choose t=14t=\frac{1}{4}, we have

14max𝜸Γ\displaystyle\frac{1}{4}\max_{\bm{\gamma}\in\Gamma} |𝔼0[T(X)]𝔼𝜸[T(X)]|+A(14𝐞+𝜸0)A(𝜸0)\displaystyle\big{|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}+A\Big{(}\frac{1}{4}\mathbf{e}+\bm{\gamma}_{0}\Big{)}-A(\bm{\gamma}_{0})
14|A(𝜸0)|=14max𝜸Γ|γ124γ212log(2γ2)|+516.\displaystyle-\frac{1}{4}|A^{\prime}(\bm{\gamma}_{0})|=\frac{1}{4}\max_{\bm{\gamma}\in\Gamma}\bigg{|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\bigg{|}+\frac{5}{16}.

Denote x~=14max𝜸Γ|γ124γ212log(2γ2)|316+32log2\tilde{x}=\frac{1}{4}\max_{\bm{\gamma}\in\Gamma}\Big{|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\Big{|}-\frac{3}{16}+\frac{3}{2}\log 2. Then there exists α>1\alpha>1 such that when x>x0=max{x1,1}x>x_{0}=\max\{x_{1},1\} (where x1x_{1} is the solution to 14x1x~=(logx1)α\frac{1}{4}x_{1}-\tilde{x}=(\log x_{1})^{\alpha} if it exists, else x1=0x_{1}=0),

P0(max𝜸Γ|𝜸ξ(𝜸)|>x)e(txx~)e(logx)α,\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\bigg{)}\leq e^{-(tx-\tilde{x})}\leq e^{-(\log x)^{\alpha}},

which shows that Condition (A3) holds.

Our first main result is Theorem 1 which characterizes the set of first-order error exponents under the probabilistic constraints on the stopping time in (4).

Theorem 1.

For fixed 0<ϵ<10<\epsilon<1 and if Conditions (A1)–(A3) are satisfied, the set of ϵ\epsilon-achievable pair of error exponents is

ϵ(p0,Γ)={(E0,E1):E0minγΓD(pγp0),E1minγΓD(p0pγ).}\mathcal{E}_{\epsilon}(p_{0},\Gamma)=\left\{(E_{0},E_{1}):\begin{aligned} &E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0}),\\ &E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma}).\end{aligned}\right\}

Furthermore, the corner point of this set is achieved by an appropriately defined sequence of GSPRTs.

Theorem 1 shows that the ϵ\epsilon-achievable error exponent region is a rectangle. In addition, Theorem 1 shows a strong converse result because the region does not depend on the permissible error probability 0<ϵ<10<\epsilon<1.

III-A Proof of Achievability of Theorem 1

Let ε0\varepsilon_{0} and ε1\varepsilon_{1} be two positive numbers such that ε0(0,minγΓD(pγp0))\varepsilon_{0}\in\big{(}0,\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})\big{)} and ε1(0,minγΓD(p0pγ))\varepsilon_{1}\in\big{(}0,\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})\big{)}. Let (δn,τn)(\delta_{n},\tau_{n}) be the GSPRT with the thresholds An:=n(minγΓD(pγp0)ε0)A_{n}:=n(\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})-\varepsilon_{0}) and Bn:=n(minγΓD(p0pγ)ε1)B_{n}:=n(\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})-\varepsilon_{1}). Since Conditions (A1)–(A3) are satisfied, then from [13, Theorem 2.1] we have that

lim infn1nlog1P0(Sτn>An)\displaystyle\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{P_{0}(S_{\tau_{n}}\!>\!A_{n})} minγΓD(pγp0)ε0,\displaystyle\!\geq\!\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})\!-\!\varepsilon_{0}, (8)
lim infn1nlog1supγΓPγ(Sτn<Bn)\displaystyle\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\displaystyle\sup_{\gamma\in\Gamma}P_{{\gamma}}(S_{\tau_{n}}\!<\!-B_{n})} minγΓD(p0pγ)ε1.\displaystyle\!\geq\!\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})\!-\!\varepsilon_{1}. (9)

Next we prove that the two probabilistic constraints in (4) are satisfied for the GSPRT (δn,τn)(\delta_{n},\tau_{n}) with thresholds AnA_{n} and BnB_{n}. We first introduce the uniform weak law of large numbers (UWLLN) [20, Theorem 6.10].

Lemma 2.

Let {Xj}j=1\{X_{j}\}_{j=1}^{\infty} be a sequence of i.i.d. random vectors, and let γΓ\gamma\in\Gamma be a nonrandom vector lying in a compact subset Γd\Gamma\subset\mathbb{R}^{d}. Moreover, let g(x,γ)g(x,\gamma) be a Borel-measurable function on 𝒳×Γ\mathcal{X}\times\Gamma such that for each x,g(x,γ)x,g(x,\gamma) is continuous on Γ\Gamma. Finally, assume that 𝔼[maxγΓ|g(Xj,γ)|]<\mathbb{E}\left[\max_{\gamma\in\Gamma}|g(X_{j},\gamma)|\right]<\infty. Then for any δ>0\delta>0,

limn(maxγΓ|1nj=1ng(Xj,γ)𝔼[g(X,γ)]|δ)=0.\displaystyle\lim_{n\to\infty}\mathbb{P}\left(\max_{\gamma\in\Gamma}\bigg{|}\frac{1}{n}\sum_{j=1}^{n}g(X_{j},\gamma)-\mathbb{E}[g(X,\gamma)]\bigg{|}\geq\delta\right)=0.

Let τ:=inf{k:Sk<Bn}\tau^{\prime}:=\inf\{k:S_{k}<-B_{n}\}. We observe that ττn\tau^{\prime}\geq\tau_{n}, so we have

P0(τn>n)P0(τ>n)=P0(maxγΓSn(γ)Bn).\displaystyle P_{0}(\tau_{n}>n)\leq P_{0}(\tau^{\prime}>n)=P_{0}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\geq-B_{n}\right).

Because

maxγΓi=1nlogpγ(Xi)p0(Xi)+nminγΓD(p0pγ)\displaystyle\max_{\gamma\in\Gamma}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}+n\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})
maxγΓ(i=1nlogpγ(Xi)p0(Xi)n𝔼0[logpγ(X)p0(X)]),\displaystyle\leq\max_{\gamma\in\Gamma}\left(\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\right),

and maxxf(x)minxg(x)maxx(f(x)g(x))maxx|f(x)g(x)|\max_{x}f(x)-\min_{x}g(x)\leq\max_{x}(f(x)-g(x))\leq\max_{x}|f(x)-g(x)|, we have

P0(maxγΓSn(γ)n(minγΓD(p0pγ)ε1))\displaystyle P_{0}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\geq-n\Big{(}\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})-\varepsilon_{1}\Big{)}\right)
P0(maxγΓ|i=1nlogpγ(Xi)p0(Xi)n𝔼0[logpγ(X)p0(X)]|nε1).\displaystyle\leq P_{0}\left(\max_{\gamma\in\Gamma}\bigg{|}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\bigg{|}\geq n\varepsilon_{1}\right).

Then by UWLLN, for 0<ϵ<10<\epsilon<1, there exists an n0(ϵ)n_{0}(\epsilon), such that when n>n0(ϵ)n>n_{0}(\epsilon),

P0(maxγΓ|i=1nlogpγ(Xi)p0(Xi)n𝔼0[logpγ(X)p0(X)]|nε1)<ϵ.P_{0}\left(\max_{\gamma\in\Gamma}\bigg{|}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\bigg{|}\geq n\varepsilon_{1}\right)<\epsilon.

Therefore, P0(τn>n)P0(τ>n)<ϵ.P_{0}(\tau_{n}>n)\leq P_{0}(\tau^{\prime}>n)<\epsilon.

We now prove that supγΓPγ(τn>n)<ϵ\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau_{n}>n)<\epsilon. Define τ′′:=inf{k:maxγΓSk(γ)>An}\tau^{\prime\prime}:=\inf\{k:\max_{\gamma\in\Gamma}S_{k}(\gamma)>A_{n}\}. We also have τ′′τn\tau^{\prime\prime}\geq\tau_{n}. Then for each γ0Γ\gamma_{0}\in\Gamma and t<0t<0, we have

Pγ0(τn>n)\displaystyle P_{{\gamma_{0}}}(\tau_{n}>n) Pγ0(τ′′>n)\displaystyle\leq P_{{\gamma_{0}}}(\tau^{\prime\prime}>n)
Pγ0(maxγΓSn(γ)An)\displaystyle\leq P_{{\gamma_{0}}}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\leq A_{n}\right)
Pγ0(Sn(γ0)n(D(pγ0p0)ε0))\displaystyle\leq P_{{\gamma_{0}}}\big{(}S_{n}(\gamma_{0})\leq n(D(p_{\gamma_{0}}\|p_{0})-\varepsilon_{0})\big{)}
Var(ξ(γ0))nη02\displaystyle\leq\frac{\mathrm{Var}(\xi(\gamma_{0}))}{n\eta_{0}^{2}}

where the last step follows from Chebyshev’s inequality [21]. Then based on Condition (A3) that 𝔼[maxγ|ξ(γ)|2]<\mathbb{E}[\max_{\gamma}|\xi(\gamma)|^{2}]<\infty and η0\eta_{0} does not depend on γ0\gamma_{0}, there exists an n1(ϵ)n_{1}(\epsilon) such that when n>n1(ϵ)n>n_{1}(\epsilon), supγΓPγ(τn>n)<ϵ.\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau_{n}>n)<\epsilon. We have shown that when n>max{n0(ϵ),n1(ϵ)}n>\max\{n_{0}(\epsilon),n_{1}(\epsilon)\}, the two probabilistic constraints (4) are satisfied. Then together with (8), (9) and the arbitrariness of ε0\varepsilon_{0} and ε1\varepsilon_{1}, we show that any exponent pair (E0,E1)(E_{0},E_{1}) such that E0minγΓD(pγp0)E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0}) and E1minγΓD(p0pγ)E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma}) is in ϵ(p0,Γ)\mathcal{E}_{\epsilon}(p_{0},\Gamma)

III-B Proof of Strong Converse of Theorem 1

The following lemma is taken from Li and Tan [16].

Lemma 3.

Let (δ,τ)(\delta,\tau) be a sequential hypothesis test such that P0(τ<)=1P_{0}(\tau<\infty)=1 and supγΓPγ(τ<)=1\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau<\infty)=1. Then for any event FτF\in\mathcal{F}_{\tau}, λ>0\lambda>0 and for each γ0Γ\gamma_{0}\in\Gamma we have

P0(F)λPγ0(F)\displaystyle P_{0}(F)-\lambda P_{{\gamma_{0}}}(F) P0(Sτ(γ0)logλ),\displaystyle\leq P_{0}(S_{\tau}(\gamma_{0})\leq-\log\lambda),
Pγ0(F)1λP0(F)\displaystyle P_{{\gamma_{0}}}(F)-\frac{1}{\lambda}P_{0}(F) Pγ0(Sτ(γ0)logλ).\displaystyle\leq P_{{\gamma_{0}}}(S_{\tau}(\gamma_{0})\geq-\log\lambda).

Then we use Lemma 3 to prove the converse part. Let (E0,E1)ϵ(p0,Γ)(E_{0},E_{1})\in\mathcal{E}_{\epsilon}(p_{0},\Gamma) such that min{E0,E1}>0\min\{E_{0},E_{1}\}>0. Without loss of generality and by passing to a subsequence if necessary, we assume that there exists a sequence of sequential hypothesis tests {(δn,τn)}n=1\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty} such that P0(τn<)=1P_{0}(\tau_{n}<\infty)=1 and supγΓPγ(τn<)=1\sup_{\gamma\in\Gamma}P_{\gamma}(\tau_{n}<\infty)=1 and

E0\displaystyle E_{0} =limn1nlog1𝖯1|0(δn,τn),\displaystyle=\lim_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{1|0}(\delta_{n},\tau_{n})}, (10)
E1\displaystyle E_{1} =limn1nlog1𝖯0|1(δn,τn).\displaystyle=\lim_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{0|1}(\delta_{n},\tau_{n})}.

Let Zi(τn)={δn=i}Z_{i}(\tau_{n})=\{\delta_{n}=i\} for i=0,1i=0,1. Then 𝖯1|0(δn,τn)=P0(Z1(τn))\mathsf{P}_{1|0}(\delta_{n},\tau_{n})=P_{0}(Z_{1}(\tau_{n})) and 𝖯0|1(δn,τn)=supγΓPγ(Z0(τn))\mathsf{P}_{0|1}(\delta_{n},\tau_{n})=\sup_{\gamma\in\Gamma}P_{{\gamma}}(Z_{0}(\tau_{n})). Using Lemma 3 with the event F=Z0(τn)F=Z_{0}(\tau_{n}), for each γ0Γ\gamma_{0}\in\Gamma we have that

1P0(Z1(τn))λPγ0(Z0(τn))\displaystyle 1-P_{0}(Z_{1}(\tau_{n}))-\lambda P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))
P0(Sτn(γ0)logλ)\displaystyle\leq P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda)
P0(Sτn(γ0)logλ,τnn)+P0(τn>n),\displaystyle\leq P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)+P_{0}(\tau_{n}>n),

which further implies that

logPγ0(Z0(τn))\displaystyle\log P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))
log[1λ(1P0(Z1(τn))P0(τn>n)\displaystyle\geq\log\bigg{[}\frac{1}{\lambda}\Big{(}1-P_{0}(Z_{1}(\tau_{n}))-P_{0}(\tau_{n}>n)
P0(Sτn(γ0)logλ,τnn))].\displaystyle\qquad-P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)\Big{)}\bigg{]}. (11)

Similarly, for each γ0Γ\gamma_{0}\in\Gamma, we have that

1Pγ0(\displaystyle 1-P_{{\gamma_{0}}}( Z0(τn))1λP0(Z1(τn))\displaystyle Z_{0}(\tau_{n}))-\frac{1}{\lambda}P_{0}(Z_{1}(\tau_{n}))
Pγ0(Sτ(γ0)logλ,τnn)+Pγ0(τn>n),\displaystyle\leq P_{{\gamma_{0}}}(S_{\tau}(\gamma_{0})\geq-\log\lambda,\tau_{n}\leq n)+P_{{\gamma_{0}}}(\tau_{n}>n),

and when we set E=Z1(τn)E=Z_{1}(\tau_{n}), we have

logP0(Z1(τn))\displaystyle\log P_{0}(Z_{1}(\tau_{n}))
log[λ(1Pγ0(Z0(τn))Pγ0(τn>n)\displaystyle\geq\log\Big{[}\lambda\Big{(}1-P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))-P_{{\gamma_{0}}}(\tau_{n}>n)
Pγ0(Sτn(γ0)logλ,τnn))].\displaystyle\quad-P_{{\gamma_{0}}}(S_{\tau_{n}}(\gamma_{0})\geq-\log\lambda,\tau_{n}\leq n)\Big{)}\Big{]}. (12)

Let δ\delta be an arbitrary positive number and let logλ=n(D(p0pγ0)+δ)\log\lambda=n\left(D(p_{0}\|p_{\gamma_{0}})+\delta\right). We first bound the term

P0(Sτn(γ0)logλ,τnn)\displaystyle P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)
P0(max1kni=1klogp0(Xi)pγ0(Xi)n(D(p0pγ0)+δ)).\displaystyle\leq P_{0}\left(\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}\geq n(D(p_{0}\|p_{\gamma_{0}})+\delta)\right).

We note that {logp0(Xi)pγ0(Xi)D(p0pγ0)}i=1n\big{\{}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}-D(p_{0}\|p_{\gamma_{0}})\big{\}}_{i=1}^{n} is an i.i.d. sequence. Besides, we have that 𝔼0[ξ(γ0)D(p0pγ0)]=0\mathbb{E}_{0}\left[-\xi(\gamma_{0})-D(p_{0}\|p_{\gamma_{0}})\right]=0 and Var(ξ(γ0))\mathrm{Var}\left(\xi(\gamma_{0})\right) is finite based on Condition (A3). Then based on Kolmogorov’s maximal inequality [21, Theorem 2.5.5], we have that

P0(max1kni=1klogp0(Xi)pγ0(Xi)\displaystyle P_{0}\bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})} n(D(p0pγ0)+δ))\displaystyle\geq n(D(p_{0}\|p_{\gamma_{0}})+\delta)\bigg{)}
Var(ξ(γ0))nδ2.\displaystyle\leq\frac{\mathrm{Var}\left(\xi(\gamma_{0})\right)}{n\delta^{2}}. (13)

Note that here we use the Kolmogorov’s maximal inequality and it only requires that the second moment of the log-likelihood ratio is finite; this is a weaker condition than assuming that the third absolute moment of the log-likelihood ratio is finite as in [16]. Then we have that

limnsupγ0ΓP0(max1kni=1klogp0(Xi)pγ0(Xi)logλ)=0.\displaystyle\lim_{n\to\infty}\sup_{\gamma_{0}\in\Gamma}P_{0}\Bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}\geq\log\lambda\Bigg{)}=0. (14)

When we set logλ=n(D(pγ0p0)+δ)-\log\lambda=n(D(p_{\gamma_{0}}\|p_{0})+\delta) in (III-B), using similar arguments as in the derivation of (14), we obtain

limnsupγ0ΓPγ0(max1knSk(γ0)logλ)=0.\displaystyle\lim_{n\to\infty}\sup_{\gamma_{0}\in\Gamma}P_{{\gamma_{0}}}\bigg{(}\max_{1\leq k\leq n}S_{k}(\gamma_{0})\geq-\log\lambda\bigg{)}=0.

From (III-B) and the fact that Pγ0(τn>n)<ϵP_{\gamma_{0}}(\tau_{n}>n)<\epsilon, we have that

1n\displaystyle-\frac{1}{n} supγΓlogPγ(Z0(τn))\displaystyle\sup_{\gamma\in\Gamma}\log P_{\gamma}(Z_{0}(\tau_{n}))
minγΓ(D(p0pγ)+δ)1nlog(1𝖯1|0(δn,τn)ϵ\displaystyle\leq\min_{\gamma\in\Gamma}(D(p_{0}\|p_{\gamma})+\delta)-\frac{1}{n}\log\Bigg{(}1-\mathsf{P}_{1|0}(\delta_{n},\tau_{n})-\epsilon
supγΓP0(max1kni=1klogp0(Xi)pγ(Xi)logλ)).\displaystyle\quad-\sup_{\gamma\in\Gamma}P_{0}\bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma}}(X_{i})}\geq\log\lambda\bigg{)}\Bigg{)}.

From (10) it follows that limn𝖯1|0(δn,τn)=0\lim_{n\to\infty}\mathsf{P}_{1|0}(\delta_{n},\tau_{n})=0, which together with (14) implies that

E1=limn1nlog𝖯0|1(δn,τn)minγΓD(p0pγ)+δ.\displaystyle E_{1}=\lim_{n\to\infty}-\frac{1}{n}\log\mathsf{P}_{0|1}(\delta_{n},\tau_{n})\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})+\delta.

Similarly, we also obtain

E0=limn1nlog𝖯1|0(δn,τn)\displaystyle E_{0}=\lim_{n\to\infty}-\frac{1}{n}\log\mathsf{P}_{1|0}(\delta_{n},\tau_{n}) minγΓD(pγp0)+δ.\displaystyle\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})+\delta.

Due to the arbitrariness of δ\delta, letting δ0+\delta\to 0^{+}, we have that E0minγΓD(pγp0)E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0}) and E1minγΓD(p0pγ)E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma}), completing the proof of the strong converse as desired.

IV Second-order Asymptotics

In the previous section, we considered the (first-order) error exponents of the sequential composite hypothesis testing problem under probabilistic constraints. While the result (Theorem 1) is conclusive, there is often substantial motivation [15] to consider higher-order asymptotics due to finite-length considerations. To wit, the probabilistic bound observation length of the sequence nn might be short and thus the exponents derived in the previous section will be overly optimistic. In this section, we quantify the backoff from the optimal first-order exponents by examining the second-order asymptotics. To do so, we make a set of somewhat more stringent conditions on the distributions and the uncertainty set Γ\Gamma. We first assume that the alphabet of the observations is the finite set 𝒳={1,2,,d}\mathcal{X}=\{1,2,\dots,d\}. Let 𝒫𝒳\mathcal{P}_{\mathcal{X}} be the set of probability mass functions with alphabet 𝒳\mathcal{X}. In other words, 𝒫𝒳\mathcal{P}_{\mathcal{X}} is the probability simplex given by

𝒫𝒳:={(q(1),q(2),,q(d)):i=1dq(i)=1,q(i)0,i𝒳}.\mathcal{P}_{\mathcal{X}}\!:=\!\bigg{\{}\!(q(1),\!q(2),\ldots,\!q(d))\!:\!\sum_{i=1}^{d}\!q(i)=1,\!q(i)\geq 0,~{}\forall\,i\!\in\!\mathcal{X}\!\bigg{\}}.

Similarly, define the open probability simplex

𝒫𝒳+:={(q(1),q(2),,q(d)):i=1dq(i)=1,q(i)>0,i𝒳}.\mathcal{P}_{\mathcal{X}}^{+}\!:=\!\bigg{\{}\!(q(1),\!q(2),\ldots,\!q(d))\!:\!\sum_{i=1}^{d}\!q(i)=1,\!q(i)>0,~{}\forall\,i\!\in\!\mathcal{X}\!\bigg{\}}.

Under hypothesis H0H_{0}, the underlying probability mass function is given by {p0(i)}i=1d\{p_{0}(i)\}_{i=1}^{d} and we assume that p0(i)>0p_{0}(i)>0 for all i𝒳i\in\mathcal{X}. Under hypothesis H1H_{1}, the underlying probability mass function belongs to the set Γ𝒫𝒳\Gamma\subset\mathcal{P}_{\mathcal{X}}. For any q~𝒫𝒳\tilde{q}\in\mathcal{P}_{\mathcal{X}} and positive constant η\eta, let (q~,η):={q𝒫𝒳:|q(i)q~(i)|<η,i𝒳}\mathcal{B}(\tilde{q},\eta):=\{q\in\mathcal{P}_{\mathcal{X}}:|q(i)-\tilde{q}(i)|<\eta,~{}\forall\,i\in\mathcal{X}\} be the open η\eta-neighborhood of the point q~\tilde{q}. Let 𝜸\bm{\gamma}^{\prime} be such that D(p0p𝜸)=min𝜸ΓD(p0p𝜸)D(p_{0}\|p_{\bm{\gamma}^{\prime}})=\min_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}}). See Fig. 1 for an illustration of this projection.

IV-A Other Assumptions and Preliminary Results

We assume that Γ\Gamma, which contains distributions supported on 𝒳\mathcal{X}, satisfies the following conditions:

  1. (A1’)

    The set Γ\Gamma is equal to {𝜸={γi}i=1d:F(𝜸)0}𝒫𝒳\{\bm{\gamma}=\{\gamma_{i}\}_{i=1}^{d}:F(\bm{\gamma})\leq 0\}\in\mathcal{P}_{\mathcal{X}} for some piece-wise smooth convex function F:𝒫𝒳F:\mathcal{P}_{\mathcal{X}}\to\mathbb{R}.

  2. (A2’)

    There exists a fixed constant c0>0c_{0}>0 such that mini𝒳γic0\min_{i\in\mathcal{X}}\gamma_{i}\geq c_{0} for all 𝜸Γ\bm{\gamma}\in\Gamma.

  3. (A3’)

    The function FF is smooth (infinitely differentiable) on (𝜸,η)\mathcal{B}(\bm{\gamma}^{\prime},\eta) for some η>0\eta>0.

The key tool used in the derivation of the second-order terms is a central limit-type result for max𝜸Γk=1nlogp𝜸(Xk)p0(Xk)\max_{\bm{\gamma}\in\Gamma}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}, the maximum of log-likelihood ratios of the observations over Γ\Gamma. To simplify this quantity, we define the empirical distribution or type [22, Chapter 11] of XnX^{n} as Q(i;Xn)=k=1n𝟙{Xk=i}/nQ(i;X^{n})=\sum_{k=1}^{n}\mathbbm{1}\{X_{k}=i\}/n, for i=1,2,,di=1,2,\dots,d. In the following, for the sake of brevity, we often suppress the dependence on the sequence XnX^{n} and write Q(i)Q(i) in place of Q(i;Xn)Q(i;X^{n}), but we note that QQ is a random distribution induced by the observations XnX^{n}. Since 𝒳\mathcal{X} is a finite set, we have

Sn\displaystyle S_{n} =max𝜸Γk=1nlogp𝜸(Xk)p0(Xk)\displaystyle=\max_{\bm{\gamma}\in\Gamma}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}
=max𝜸Γi=1dk=1n𝟙{Xk=i}logγip0(i)\displaystyle=\max_{\bm{\gamma}\in\Gamma}\sum_{i=1}^{d}\sum_{k=1}^{n}\mathbbm{1}\{X_{k}=i\}\log\frac{\gamma_{i}}{p_{0}(i)}
=nmax𝜸Γi=1dQ(i)logγip0(i).\displaystyle=n\max_{\bm{\gamma}\in\Gamma}\sum_{i=1}^{d}Q(i)\log\frac{\gamma_{i}}{p_{0}(i)}. (15)

The key in obtaining the central limit-type result for the sequence of random variables {Sn/n}n\{S_{n}/\sqrt{n}\}_{n\in\mathbb{N}} is to solve the optimization problem in (IV-A), or more precisely, to understand the properties of the optimizer to (IV-A). Now we study the following optimization problem for a generic q𝒫𝒳q\in\mathcal{P}_{\mathcal{X}}:

min𝜸\displaystyle\min_{\bm{\gamma}}\;\; i=1dq(i)logp0(i)γi\displaystyle\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}}
s.t.\displaystyle\mathrm{s.t.}\;\; i=1dγi=1,\displaystyle\sum_{i=1}^{d}\gamma_{i}=1, (16)
F(𝜸)0.\displaystyle F(\bm{\gamma})\leq 0.

Let 𝜸~\tilde{\bm{\gamma}} be an optimizer to the optimization problem (IV-A). The properties of 𝜸~\tilde{\bm{\gamma}} are provided in the following three lemmas.

Lemma 4.

If q𝒫𝒳+q\in\mathcal{P}^{+}_{\mathcal{X}} and qΓq\not\in\Gamma, then the optimizer 𝛄~\tilde{\bm{\gamma}} is unique.

The existence and uniqueness of the optimizer of the optimization problem (IV-A) follows from the strictly convexity of the function 𝜸i=1dq(i)logp0(i)γi\bm{\gamma}\mapsto\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}} on the compact convex (uncertainty) set Γ\Gamma.

As the optimizer 𝜸~\tilde{\bm{\gamma}} is unique, we can define the function

𝐠(q)=(g1(q),,gd(q))=:𝜸~.\displaystyle\mathbf{g}(q)=(g_{1}(q),\ldots,g_{d}(q))=:\tilde{\bm{\gamma}}.

For the sake of convenience in what follows, define

f(q):=i=1dq(i)logp0(i)gi(q).\displaystyle f(q):=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}. (17)

Some key properties of 𝐠(q)\mathbf{g}(q) are provided in Lemma 10 and Lemma 11 in the Appendix. By the definition of 𝜸\bm{\gamma}^{\prime}, it follows that 𝐠(p0)=𝜸\mathbf{g}(p_{0})=\bm{\gamma}^{\prime}. Without loss of generality, we assume

F(𝜸)γ1i=1dγiF(𝜸)γi0.\frac{\partial F(\bm{\gamma}^{\prime})}{\partial\gamma_{1}}-\sum_{i=1}^{d}\gamma^{\prime}_{i}\frac{\partial F(\bm{\gamma}^{\prime})}{\partial\gamma^{\prime}_{i}}\neq 0.

Then there exists 0<η¯<η^0<\bar{\eta}<\hat{\eta} such that for q(p0,η¯)q\in\mathcal{B}(p_{0},\bar{\eta}), the following equation holds

F(𝜸~)γ~1i=1dγ~iF(𝜸~)γ~i0.\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\tilde{\gamma}_{1}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\tilde{\gamma}_{i}}\neq 0.

Then for q(p0,η¯)q\in\mathcal{B}(p_{0},\bar{\eta}), the Jacobian of (q(2),,q(d))(q(2),\ldots,q(d)) with respect to (γ~2,,γ~d)(\tilde{\gamma}_{2},\ldots,\tilde{\gamma}_{d}) is

𝐉(q)=[q(2)γ~2q(2)γ~3q(2)γ~dq(3)γ~2q(3)γ~3q(3)γ~dq(d)γ~2q(d)γ~3q(d)γ~d](d1)×(d1).\displaystyle\mathbf{J}(q)=\begin{bmatrix}&\frac{\partial q(2)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(2)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(2)}{\partial\tilde{\gamma}_{d}}\vspace{.5em}\\ &\frac{\partial q(3)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(3)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(3)}{\partial\tilde{\gamma}_{d}}\vspace{.5em}\\ &\vdots&\vdots&\ddots&\vdots\vspace{.5em}\\ &\frac{\partial q(d)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(d)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(d)}{\partial\tilde{\gamma}_{d}}\end{bmatrix}\in\mathbb{R}^{(d-1)\times(d-1)}.

We now introduce the following regularity condition on the function FF at the point p0p_{0}:

  • (A4’)

    The Jacobian matrix 𝐉(p0)\mathbf{J}(p_{0}) is of full rank (i.e., rank(𝐉(p0))=d1\mathrm{rank}(\mathbf{J}(p_{0}))=d-1).

One may wonder whether the new assumptions we have stated are overly restrictive. In fact, they are not and there exist interesting families of uncertainty sets that satisfy Assumptions (A1’)–(A4’). A canonical example of an uncertainty set Γ\Gamma that satisfies these conditions is when FF is piece-wise linear on the set 𝒫𝒳+\mathcal{P}_{\mathcal{X}}^{+}. Thus, Γ\Gamma is similar to a linear family [23], an important class of statistical models.

Example 3.

Let {Fk}k=1l\{F_{k}\}_{k=1}^{l} be a set of ll linear functions with domain d\mathbb{R}^{d} and let {ξk}k=1l\{\xi_{k}\}_{k=1}^{l} be a set of ll real numbers. Let Γ=k=1l{(y1,,yd):Fk(y1,,yd)ξk}\Gamma=\bigcap_{k=1}^{l}\{(y_{1},\ldots,y_{d}):F_{k}(y_{1},\ldots,y_{d})\leq\xi_{k}\}. Assume {Fk}k=1l\{F_{k}\}_{k=1}^{l} and {ξk}k=1l\{\xi_{k}\}_{k=1}^{l} satisfy the following three conditions:

  • The set Γ𝒫𝒳+\Gamma\subset\mathcal{P}_{\mathcal{X}}^{+} and Fk(p0)>ξkF_{k}(p_{0})>\xi_{k} for some kk;

  • The minimizer 𝜸=argmin𝜸ΓD(p0p𝜸)\bm{\gamma}^{\prime}=\operatorname*{arg\,min}_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}}) is such that F1(𝜸)=ξ1F_{1}(\bm{\gamma}^{\prime})=\xi_{1} and Fk(𝜸)<ξkF_{k}(\bm{\gamma}^{\prime})<\xi_{k} for k1k\not=1;

  • Let F1(y1,,yd)=i=1dwiyiF_{1}(y_{1},\ldots,y_{d})=\sum_{i=1}^{d}w_{i}y_{i} for some real coefficients w1,,wdw_{1},\ldots,w_{d}. One of the coefficients of F1F_{1}, i.e., one of the numbers in the set {wi}i=1d\{w_{i}\}_{i=1}^{d}, is not equal to ξ1\xi_{1}.

Intuitively, Γ\Gamma defined as the intersection of halfspaces (linear inequality constraints) is a polyhedron contained in the relative interior of 𝒫𝒳\mathcal{P}_{\mathcal{X}}. Fig. 1 provides an illustration for the ternary case 𝒳={1,2,3}\mathcal{X}=\{1,2,3\}.

\begin{overpic}[width=364.2392pt]{simplex.pdf} \put(70.0,9.0){$\mathcal{P}_{\mathcal{X}}$} \put(68.0,45.0){$F_{1}$} \put(76.0,38.0){$F_{2}$} \put(15.0,38.0){$F_{3}$} \put(60.0,14.0){$p_{0}$} \put(41.0,20.0){$\Gamma$} \put(58.0,20.0){$D(p_{0}\|p_{\bm{\gamma}^{\prime}}\!)$} \put(48.5,25.0){$\bm{\gamma}^{\prime}$} \put(45.0,8.5){$D(p_{{\bm{\gamma}}^{*}}\|p_{0})$} \put(40.0,15.0){${\bm{\gamma}}^{*}$} \put(79.0,6.0){$(0,1,0)$} \put(5.0,6.0){$(1,0,0)$} \put(49.0,53.0){$(0,0,1)$} \end{overpic}
Figure 1: The set Γ\Gamma formed by the intersection of three halfspaces defined by F1,F2F_{1},F_{2}, and F3F_{3}. See Example 3.
Proposition 5.

The set Γ\Gamma described in Example 3 satisfies Conditions (A1’)–(A4’).

The proof of Proposition 5 is provided in Appendix F.

Now we are ready to state the promised central limit-type result for SnS_{n}, defined in (IV-A). Define the relative entropy variance [15]

V(pq):=Varp[logp(X)q(X)]V(p\|q):=\mathrm{Var}_{p}\left[\log\frac{p(X)}{q(X)}\right]

and the Gaussian cumulative distribution function Φ(y):=y12πeu2/2du\Phi(y):=\int_{-\infty}^{y}\frac{1}{\sqrt{2\pi}}e^{-u^{2}/2}\,\mathrm{d}u. Then we have

Proposition 6.

Under Conditions (A1’)–(A4’), if {Xi}i=1\{X_{i}\}_{i=1}^{\infty} is a sequence of i.i.d. random variables with P(X1=i)=p0(i)P(X_{1}=i)=p_{0}(i) for all i𝒳i\in\mathcal{X}, then {Sn}n=1\{S_{n}\}_{n=1}^{\infty}, defined in (IV-A), satisfies

n(SnnD(p0p𝜸))d𝒩(0,V(p0p𝜸)).\displaystyle\sqrt{n}\bigg{(}\frac{S_{n}}{n}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}\big{(}0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})\big{)}.

The proof of Proposition 6 can be found in Appendix G.

A major result in the statistics literature that bears some semblance to Proposition 6 is known as Wilks’ theorem (see [17, Chapter 16] for example). For the case in which the null hypothesis is simple,111Wilks’ theorem also applies to the case in which both the null and alternative hypotheses are composite, but we are only concerned with the simpler setting here. Wilks’ theorem states that if the sequence of random variables {Xi}i=1\{X_{i}\}_{i=1}^{\infty} is independently drawn from p0p_{0} (the distribution of the null hypothesis), then (two times) the log-likelihood ratio statistic

2max𝜸Γ{p0}k=1nlogp𝜸(Xk)p0(Xk)dχd12,\displaystyle 2\max_{\bm{\gamma}\in\Gamma\cup\{p_{0}\}}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\chi^{2}_{d-1},

where χd12\chi_{d-1}^{2} is the chi-squared distribution with d1d-1 degrees of freedom. This result differs from Proposition 6 because in SnS_{n} the maximization is taken over Γ\Gamma whereas in Wilks’ theorem, it is taken over Γ{p0}\Gamma\cup\{p_{0}\}. This results in different normalizations in the statements on convergence in distributions; in Proposition 6, SnS_{n} is normalized by n\sqrt{n} but there is no normalization of the log-likelihood ratio statistic in Wilks’ theorem. This is because, for the former (our result), the dominant term is the first-order term in the Taylor expansion, but in the latter (Wilks’ theorem), the dominant term is the second-order term.

Proposition 7.

Conditions (A1’)–(A4’) imply Conditions (A1)–(A3) in Section III.

The proof of Proposition 7 is provided in Appendix H. Thus, we see that the assumptions used to derive the first-order results are less restrictive than those for the second-order result that we are going to state in the next subsection.

IV-B Definition and Main Results

We say that a second-order exponent pair (G0,G1)(G_{0},G_{1}) is ϵ\epsilon-achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests {(δn,τn)}n=1\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty} that satisfies the probabilistic constraints on the stopping time in (4) and

G0lim infn1n(log1𝖯1|0(δn,τn)nD(p𝜸p0)),\displaystyle G_{0}\leq\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\bigg{(}\log\frac{1}{\mathsf{P}_{1|0}(\delta_{n},\tau_{n})}-nD(p_{\bm{\gamma}^{*}}\|p_{0})\bigg{)},
G1lim infn1n(log1𝖯0|1(δn,τn)nD(p0p𝜸)),\displaystyle G_{1}\leq\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\bigg{(}\log\frac{1}{\mathsf{P}_{0|1}(\delta_{n},\tau_{n})}-nD(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)},

where 𝜸=argmin𝜸ΓD(p𝜸p0)\bm{\gamma}^{*}=\operatorname*{arg\,min}_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{0}), which is unique (see Proposition 7 which implies that Condition (A2) is satisfied). The set of all ϵ\epsilon-achievable second-order exponent pairs (G0,G1)(G_{0},G_{1}) is denoted as 𝒢ϵ(p0,Γ)\mathcal{G}_{\epsilon}(p_{0},\Gamma), the second-order error exponent region. The set of second-order error exponents 𝒢ϵ(p0,Γ)\mathcal{G}_{\epsilon}(p_{0},\Gamma) is stated in the following theorem.

Theorem 8.

If Conditions (A1’)–(A4’) are satisfied, for any 0<ϵ<10<\epsilon<1, the second-order error exponent region is

𝒢ϵ(p0,Γ)={(G0,G1)2:G0Φ1(ϵ)V(p𝜸p0)G1Φ1(ϵ)V(p0p𝜸)}.\displaystyle\mathcal{G}_{\epsilon}(p_{0},\Gamma)=\left\{(G_{0},G_{1})\in\mathbb{R}^{2}:\!\begin{array}[]{rl}G_{0}&\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})}\\ G_{1}&\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\end{array}\!\right\}.

Furthermore, the boundary of this set is achieved by an appropriately defined sequence of GSPRTs.

This theorem states that the backoffs from the (first-order) error exponents are of orders Θ(1/n)\Theta(1/\sqrt{n}) and the implied constants are given by Φ1(ϵ)V(p𝜸p0)\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})} and Φ1(ϵ)V(p0p𝜸)\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}. Thus, we have stated a set of sufficient conditions on the distributions and the uncertainty set Γ\Gamma (namely (A1’)–(A4’)) for which the second-order terms are analogous to that for simple sequential hypothesis testing under the probabilistic constraints derived by Li and Tan [16]. However, the techniques used to derive Theorem 8 are more involved compared to those for the probabilistic constraints in [16]. This is because we have to derive the asymptotic distribution of the maximum of a set of log-likelihood ratio terms (cf. Proposition 6). This constitutes our main contribution in this part of the paper.

IV-C Proof of the Achievability Part of Theorem 8

The proof of achievability consists of two parts. We first prove the desired upper bound on type-I error probability and the maximal type-II error probability under an appropriately defined sequence of GSPRTs. Then we prove that the probabilistic constraints are satisfied.

We start with the proof of the first part. Let η0\eta_{0} and η1\eta_{1} be such that η0,η1(0,ϵ)\eta_{0},\eta_{1}\in(0,\epsilon). Let (δn,τn)(\delta_{n},\tau_{n}) be the GSPRT with thresholds

An:=nmin𝜸Γ(D(p𝜸p0)+Φ1(ϵη0)V(p𝜸p0)n),A_{n}:=n\min_{\bm{\gamma}\in\Gamma}\bigg{(}D(p_{\bm{\gamma}}\|p_{0})+\Phi^{-1}(\epsilon-\eta_{0})\sqrt{\frac{V(p_{\bm{\gamma}}\|p_{0})}{n}}\bigg{)},

and

Bn:=nmin𝜸Γ(D(p0p𝜸)+Φ1(ϵη1)V(p0p𝜸)n).B_{n}:=n\min_{\bm{\gamma}\in\Gamma}\bigg{(}D(p_{0}\|p_{\bm{\gamma}})+\Phi^{-1}(\epsilon-\eta_{1})\sqrt{\frac{V(p_{0}\|p_{\bm{\gamma}})}{n}}\bigg{)}.

Based on Proposition 7, we know that (A1)–(A3) are satisfied. Hence, from [13, Theorem 2.1] we have that

P0(Sτn>An)eAnandsup𝜸ΓP𝜸(Sτn<Bn)eBn.\displaystyle P_{0}(S_{\tau_{n}}>A_{n})\leq e^{-A_{n}}\quad\mbox{and}\quad\sup_{\bm{\gamma}\in\Gamma}P_{{\bm{\gamma}}}(S_{\tau_{n}}<-B_{n})\leq e^{-B_{n}}.

To simplify AnA_{n} and BnB_{n}, we introduce an approximation lemma from [24, Lemma 48].

Lemma 9.

Let Γ\Gamma be a compact metric space. Suppose h:Γh:\Gamma\to\mathbb{R} and k:Γk:\Gamma\to\mathbb{R} are continuous, then we have

max𝜸Γ[nh(𝜸)+nk(𝜸)]=nh+nk+o(n),\displaystyle\max_{\bm{\gamma}\in\Gamma}[nh(\bm{\gamma})+\sqrt{n}k(\bm{\gamma})]=nh^{*}+\sqrt{n}k^{*}+o(\sqrt{n}),

where h:=max𝛄Γh(𝛄)h^{*}:=\max_{\bm{\gamma}\in\Gamma}h(\bm{\gamma}) and k:=sup𝛄:h(𝛄)=hk(𝛄)k^{*}:=\sup_{\bm{\gamma}:h(\bm{\gamma})=h^{*}}k(\bm{\gamma}).

Here we take h(𝜸)=D(p0p𝜸)h(\bm{\gamma})=-D(p_{0}\|p_{\bm{\gamma}}) and k(𝜸)=Φ1(ϵη1)V(p0p𝜸)k(\bm{\gamma})=-\Phi^{-1}(\epsilon-\eta_{1})\sqrt{V(p_{0}\|p_{\bm{\gamma}})}. Based on Lemma 9 and the fact that 𝜸D(p0p𝜸)\bm{\gamma}\mapsto D(p_{0}\|p_{\bm{\gamma}}) has a unique minimizer γ\gamma^{\prime} (see Assumption (A2) which is implied by Proposition 7), we have

min𝜸Γ(nD(p0p𝜸)+nV(p0p𝜸)Φ1(ϵη1))\displaystyle\min_{\bm{\gamma}\in\Gamma}\left(nD(p_{0}\|p_{\bm{\gamma}})+\sqrt{{nV(p_{0}\|p_{\bm{\gamma}})}}\Phi^{-1}(\epsilon-\eta_{1})\right)
=nD(p0p𝜸)+Φ1(ϵη1)nV(p0p𝜸)+o(n).\displaystyle=nD(p_{0}\|p_{\bm{\gamma}^{\prime}})+{\Phi^{-1}(\epsilon-\eta_{1})}\sqrt{nV(p_{0}\|p_{\bm{\gamma}^{\prime}})}+o(\sqrt{n}). (18)

Similarly, we have

min𝜸Γ(nD(p𝜸p0)+nV(p𝜸p0)Φ1(ϵη0))\displaystyle\min_{\bm{\gamma}\in\Gamma}\left(nD(p_{\bm{\gamma}}\|p_{0})+\sqrt{{nV(p_{\bm{\gamma}}\|p_{0})}}\Phi^{-1}(\epsilon-\eta_{0})\right)
=nD(p𝜸p0)+Φ1(ϵη0)nV(p𝜸p0)+o(n).\displaystyle=nD(p_{\bm{\gamma}^{*}}\|p_{0})+{\Phi^{-1}(\epsilon-\eta_{0})}\sqrt{nV(p_{\bm{\gamma}^{*}}\|p_{0})}+o(\sqrt{n}). (19)

Thus, based on (IV-C) and (IV-C), the arbitrariness of η0\eta_{0} and η1\eta_{1} and the continuity of Φ1\Phi^{-1}, we obtain

lim infn1n(log\displaystyle\!\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log 1P0(Sτn>An)nD(p𝜸p0))\displaystyle\frac{1}{P_{0}(S_{\tau_{n}}>A_{n})}-{n}D(p_{\bm{\gamma}^{*}}\|p_{0})\Big{)}
Φ1(ϵ)V(p𝜸p0),\displaystyle\geq\Phi^{-1}(\epsilon)\sqrt{{V(p_{\bm{\gamma}^{*}}\|p_{0})}}, (20)

and

lim infn1n(log\displaystyle\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log 1supγΓPγ(Sτn<Bn)nD(p0p𝜸))\displaystyle\frac{1}{\sup_{\gamma\in\Gamma}P_{{\gamma}}(S_{\tau_{n}}<-B_{n})}-nD(p_{0}\|p_{\bm{\gamma}^{\prime}})\Big{)}
Φ1(ϵ)V(p0p𝜸).\displaystyle\geq\Phi^{-1}(\epsilon)\sqrt{{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}}. (21)

Next we prove that the probabilistic constraints for the sequence of GSPRTs {(δn,τn)}n=1\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty} are satisfied. Let τ:=inf{k:max𝜸ΓSk(γ)<Bn}\tau^{\prime}:=\inf\{k:\max_{\bm{\gamma}\in\Gamma}S_{k}(\gamma)<-B_{n}\}. We observe that ττn\tau^{\prime}\geq\tau_{n} with probability 1. Thus, we have

P0(τn>n)\displaystyle P_{0}(\tau_{n}>n)
P0(τ>n)\displaystyle\leq P_{0}(\tau^{\prime}>n)
P0(max𝜸ΓSn(𝜸)Bn)\displaystyle\leq P_{0}\left(\max_{\bm{\gamma}\in\Gamma}S_{n}(\bm{\gamma})\geq-B_{n}\right)
=P0(min𝜸Γni=1dQ(i)logp0(i)γiBn)\displaystyle=P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}n\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}\leq B_{n}\bigg{)}
P0(min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))\displaystyle\leq P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\Big{)}
Φ1(ϵη1)V(p0p𝜸))\displaystyle\quad\leq{\Phi^{-1}(\epsilon-\eta_{1})}\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}
ϵη1\displaystyle\to\epsilon-\eta_{1} (22)
<ϵ,\displaystyle<\epsilon, (23)

where (22) is from Proposition 6. Hence, P0(τn>n)<ϵP_{0}(\tau_{n}>n)<\epsilon for sufficiently large nn.

We now prove that sup𝜸ΓPγ(τn>n)<ϵ\sup_{\bm{\gamma}\in\Gamma}P_{\gamma}(\tau_{n}>n)<\epsilon. Let τ′′:=inf{k:max𝜸ΓSk(𝜸)>An}\tau^{\prime\prime}:=\inf\{k:\max_{\bm{\gamma}\in\Gamma}S_{k}(\bm{\gamma})>A_{n}\}. We also have τ′′τn\tau^{\prime\prime}\geq\tau_{n} with probability 1. Then by the Berry-Esseen Theorem [25], for any 𝜸0Γ\bm{\gamma}_{0}\in\Gamma, we have

P𝜸0(τn>n)\displaystyle P_{{\bm{\gamma}_{0}}}(\tau_{n}>n)
P𝜸0(τ′′>n)\displaystyle\leq P_{{\bm{\gamma}_{0}}}(\tau^{\prime\prime}>n)
P𝜸0(max𝜸ΓSn(𝜸)An)\displaystyle\leq P_{{\bm{\gamma}_{0}}}\left(\max_{\bm{\gamma}\in\Gamma}S_{n}(\bm{\gamma})\leq A_{n}\right)
P𝜸0(Sn(𝜸0)n(D(p𝜸0p0)+V(p𝜸0p0)nΦ1(ϵη0)))\displaystyle\leq\!P_{{\bm{\gamma}_{0}}}\!\bigg{(}S_{n}(\bm{\gamma}_{0})\!\leq\!n\Big{(}D(p_{\bm{\gamma}_{0}}\!\|p_{0})\!+\!\sqrt{\frac{V\!(p_{\bm{\gamma}_{0}}\|p_{0})}{n}}\Phi^{-1}(\epsilon\!-\eta_{0})\!\Big{)}\!\bigg{)}
ϵη0+T1n,\displaystyle\leq\epsilon-\eta_{0}+\frac{T_{1}}{\sqrt{n}}, (24)

where T1T_{1} is a positive finite constant depending only on Var𝜸0(ξ(𝜸0))\mathrm{Var}_{\bm{\gamma}_{0}}(\xi(\bm{\gamma}_{0})) and 𝔼𝜸0[|ξ(𝜸0)|3]\mathbb{E}_{\bm{\gamma}_{0}}[|\xi(\bm{\gamma}_{0})|^{3}]. As stated in Condition (A2’) (i.e., that γic0>0,i=1,,d\gamma_{i}\geq c_{0}>0,i=1,\dots,d) and p0(i)>0,i=1,,dp_{0}(i)>0,i=1,\dots,d, thus 𝔼𝜸[|ξ(𝜸)|3]\mathbb{E}_{\bm{\gamma}}[|\xi(\bm{\gamma})|^{3}] is uniformly bounded on Γ\Gamma. Then for every 0<ϵ<10<\epsilon<1, there exists an integer n1(ϵ)n_{1}(\epsilon) which does not depend on 𝜸\bm{\gamma}, such that when n>n1(ϵ)n>n_{1}(\epsilon), P𝜸0(τ>n)ϵη0/2<ϵP_{{\bm{\gamma}_{0}}}(\tau>n)\leq\epsilon-\eta_{0}/2<\epsilon. Since 𝜸0Γ\bm{\gamma}_{0}\in\Gamma is arbitrary, sup𝜸ΓP𝜸(τ>n)<ϵ\sup_{\bm{\gamma}\in\Gamma}P_{{\bm{\gamma}}}(\tau>n)<\epsilon.

We have shown that the two probabilistic constraints (23) and (IV-C) are satisfied for sufficiently large nn. Then together with (IV-C) and (IV-C), we have shown that any second-order error exponent pair (G0,G1)(G_{0},G_{1}) such that G0Φ1(ϵ)V(p𝜸p0)G_{0}\geq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})} and G1Φ1(ϵ)V(p0p𝜸)G_{1}\geq{\Phi^{-1}(\epsilon)}\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})} belongs to 𝒢ϵ(p0,Γ)\mathcal{G}_{\epsilon}(p_{0},\Gamma).

IV-D Proof of the Converse Part of Theorem 8

For each 𝜸0Γ\bm{\gamma}_{0}\in\Gamma, from [16], we know that

1n\displaystyle-\frac{1}{\sqrt{n}} logP𝜸0(Z0(τn))\displaystyle\log P_{\bm{\gamma}_{0}}(Z_{0}(\tau_{n}))
nD(p0p𝜸0)+V(p0p𝜸0)Φ1(ϵ)+αn,\displaystyle\leq\sqrt{n}D(p_{0}\|p_{\bm{\gamma}_{0}})+\sqrt{V(p_{0}\|p_{\bm{\gamma}_{0}})}\Phi^{-1}(\epsilon)+\alpha_{n},

where αn0\alpha_{n}\to 0 as nn\to\infty. Now we want to find the optimal upper bound for all γΓ\gamma\in\Gamma, which means we need to obtain

1n\displaystyle-\frac{1}{\sqrt{n}} sup𝜸ΓlogP𝜸(Z0(τn))\displaystyle\sup_{\bm{\gamma}\in\Gamma}\log P_{\bm{\gamma}}(Z_{0}(\tau_{n}))
min𝜸Γ(nD(p0p𝜸)+V(p0p𝜸)Φ1(ϵ)+αn).\displaystyle\leq\min_{\bm{\gamma}\in\Gamma}\bigg{(}\sqrt{n}D(p_{0}\|p_{\bm{\gamma}})+\sqrt{V(p_{0}\|p_{\bm{\gamma}})}\Phi^{-1}(\epsilon)+\alpha_{n}\bigg{)}.

Similar to the analysis in achievability part, we use Lemma 9 and obtain that

lim supn1n(log\displaystyle\limsup_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log 1𝖯1|0(δn,τn)nD(p0p𝜸))\displaystyle\frac{1}{\mathsf{P}_{1|0}(\delta_{n},\tau_{n})}-nD(p_{0}\|p_{\bm{\gamma}^{\prime}})\Big{)}
Φ1(ϵ)V(p0p𝜸).\displaystyle\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}.

Similarly, we have that

lim supn1n(log\displaystyle\limsup_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log 1𝖯0|1(δn,τn)nD(p𝜸p0))\displaystyle\frac{1}{\mathsf{P}_{0|1}(\delta_{n},\tau_{n})}-nD(p_{\bm{\gamma}^{*}}\|p_{0})\Big{)}
Φ1(ϵ)V(p𝜸p0),\displaystyle\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})},

which completes the proof of the converse.

In the appendix, we provide some key properties of 𝐠(q)\mathbf{g}(q) in Lemmas 10 and 11 and their proofs. We also present the proofs of Propositions 5, 6, and 7.

E Properties of 𝐠(q)\mathbf{g}(q)

Lemma 10.

If q𝒫𝒳+q\in\mathcal{P}^{+}_{\mathcal{X}} and qΓq\not\in\Gamma, then the following properties of the optimizer 𝛄~=𝐠(q)\tilde{\bm{\gamma}}=\mathbf{g}(q) hold.

  • (i)

    The function 𝐠(q)\mathbf{g}(q) is continuous on 𝒫𝒳+Γ\mathcal{P}_{\mathcal{X}}^{+}\setminus\Gamma;

  • (ii)

    There exists η^>0\hat{\eta}>0 such that for q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), FF is smooth (infinitely differentiable) at 𝜸~\tilde{\bm{\gamma}};

  • (iii)

    For q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), the optimizer 𝜸~\tilde{\bm{\gamma}} is such that F(𝜸~)=0F(\tilde{\bm{\gamma}})=0 (i.e., 𝜸~\tilde{\bm{\gamma}} is on the boundary of the uncertainty set);

  • (iv)

    For q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), there exists a symbol j𝒳j\in\mathcal{X} such that

    F(𝜸~)γji=1dγ~iF(𝜸~)γi0;\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{j}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{i}}\neq 0; (25)
  • (v)

    For q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), i𝒳i\in\mathcal{X} and iji\not=j (j𝒳j\in\mathcal{X} is the symbol that satisfies (25) in Part (iv) above),

    q(i)\displaystyle q(i) =γ~i+(q(j)γ~j)γ~iγ~j(F(𝜸~)γjk=1dγ~kF(𝜸~)γk)\displaystyle=\tilde{\gamma}_{i}+\frac{(q(j)-\tilde{\gamma}_{j})\tilde{\gamma}_{i}}{\tilde{\gamma}_{j}\big{(}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{j}}-\sum_{k=1}^{d}\tilde{\gamma}_{k}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{k}}\big{)}}
    ×(F(𝜸~)γik=1dγ~kF(𝜸~)γk).\displaystyle\qquad\times\bigg{(}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{i}}-\sum_{k=1}^{d}\tilde{\gamma}_{k}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{k}}\bigg{)}. (26)
Proof:

We first prove Part (i) of Lemma 10. Assume, to the contrary, that 𝐠(q)\mathbf{g}(q) is not continuous at some q𝒫𝒳+Γq\in\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma. Then there exists a positive number κ\kappa and a sequence {qk}k=1𝒫𝒳+Γ\{q_{k}\}_{k=1}^{\infty}\subset\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma such that qkqq_{k}\to q as kk\to\infty and i=1d|gi(qk)gi(q)|κ\sum_{i=1}^{d}|g_{i}(q_{k})-g_{i}(q)|\geq\kappa for all kk\in\mathbb{N}. From the definition of 𝐠(qk)\mathbf{g}(q_{k}) and the fact that p0𝒫+p_{0}\in\mathcal{P}^{+}, there exists κ^>0\hat{\kappa}>0 such that

i=1dqk(i)logp0(i)gi(qk)<i=1dqk(i)logp0(i)gi(q)κ^,\displaystyle\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}<\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa}, (27)

for all kk\in\mathbb{N}. From Condition (A2’) and the fact that {𝐠(qk)}k=1Γ\{\mathbf{g}(q_{k})\}_{k=1}^{\infty}\subset\Gamma, there exists a constant M<M<\infty such that

supk,i𝒳|logp0(i)gi(qk)|M,\sup_{k\in\mathbb{N},i\in\mathcal{X}}\left|\log\frac{p_{0}(i)}{g_{i}(q_{k})}\right|\leq M,

which further implies that

lim supki=1dqk(i)logp0(i)gi(qk)\displaystyle\limsup_{k\to\infty}\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}
=lim supk(i=1d(q(i)+(qk(i)q(i)))logp0(i)gi(qk))\displaystyle=\limsup_{k\to\infty}\bigg{(}\sum_{i=1}^{d}\big{(}q(i)+(q_{k}(i)-q(i))\big{)}\log\frac{p_{0}(i)}{g_{i}(q_{k})}\bigg{)}
=lim supki=1dq(i)logp0(i)gi(qk).\displaystyle=\limsup_{k\to\infty}\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}. (28)

Combining (27) and (E), we have that

lim supki=1dq(i)logp0(i)gi(qk)\displaystyle\limsup_{k\to\infty}\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}
lim supki=1dqk(i)logp0(i)gi(q)κ^\displaystyle\leq\limsup_{k\to\infty}\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa}
=i=1dq(i)logp0(i)gi(q)κ^,\displaystyle=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa},

which contradicts the fact that

i=1dq(i)logp0(i)gi(qk)i=1dq(i)logp0(i)gi(q).\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}\geq\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}.

Hence 𝐠(q)\mathbf{g}(q) is continuous on 𝒫𝒳+Γ\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma.

We next prove Part (ii) of Lemma 10. From the continuity of 𝐠(q)\mathbf{g}(q) (as proved above), there exists η^>0\hat{\eta}>0 such that

{𝜸~:𝜸~=𝐠(q)for some q(p0,η^)}(𝐠(p0),η),\{\tilde{\bm{\gamma}}:\tilde{\bm{\gamma}}=\mathbf{g}(q)\,\mbox{for some $q\in\mathcal{B}(p_{0},\hat{\eta})$}\}\subset\mathcal{B}(\mathbf{g}(p_{0}),\eta),

which, together with Condition (A3’) implies Part (ii) of Lemma 10.

We now proceed to prove Part (iii) of Lemma 10. Recall that the optimizer 𝜸~\tilde{\bm{\gamma}} is obtained from the optimization problem (IV-A). Its corresponding Lagrangian is

L(𝜸,λ,μ)=i=1dq(i)logp0(i)γi+λ(i=1dγi1)+μF(𝜸).\displaystyle L(\bm{\gamma},\lambda,\mu)=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}}+\lambda\bigg{(}\sum_{i=1}^{d}\gamma_{i}-1\bigg{)}+\mu F(\bm{\gamma}).

For q(𝐠(q),η^)q\in\mathcal{B}(\mathbf{g}(q),\hat{\eta}), F(𝜸)F(\bm{\gamma}) is smooth at 𝜸~\tilde{\bm{\gamma}} (the previous part). Hence using the Karush–Kuhn–Tucker (KKT) conditions [26], the optimizer 𝜸~\tilde{\bm{\gamma}} satisfies the first-order stationary conditions, which are

q(i)γ~i+λ+μF(𝜸)γi|𝜸=𝜸~=0,i=1,,d.\displaystyle-\frac{q(i)}{\tilde{\gamma}_{i}}+\lambda+\mu\frac{\partial F(\bm{\gamma})}{\partial\gamma_{i}}\bigg{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}=0,\quad\forall\,i=1,\dots,d. (29)

The complementary slackness condition is μF(𝜸~)=0\mu F(\tilde{\bm{\gamma}})=0, which implies that either μ=0\mu=0 or F(𝜸~)=0F(\tilde{\bm{\gamma}})=0. When μ=0\mu=0, we have

q(i)=λγ~iλ=1γ~i=q(i),\displaystyle q(i)=\lambda\tilde{\gamma}_{i}\;\Longleftrightarrow\;\lambda=1\;\Longleftrightarrow\;\tilde{\gamma}_{i}=q(i),

which is impossible as qΓq\notin\Gamma. Thus, it holds that F(𝜸~)=0F(\tilde{\bm{\gamma}})=0, which means the optimizer lies on the boundary of the set Γ\Gamma.

We then proceed to prove Part (iv) of Lemma 10. If

F(𝜸)γj|𝜸=𝜸~i=1dγ~iF(𝜸)γi|𝜸=𝜸~=0\frac{\partial F(\bm{\gamma})}{\partial\gamma_{j}}\Big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F({\bm{\gamma}})}{\partial\gamma_{i}}\Big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}=0

for all j𝒳j\in\mathcal{X}, then {F(𝜸)γj|𝜸=𝜸~}j=1d\big{\{}\frac{\partial F(\bm{\gamma})}{\partial\gamma_{j}}\big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}\big{\}}_{j=1}^{d} are all equal. Combining this fact with (29), we have that q=𝜸~q=\tilde{\bm{\gamma}}, which contradicts the fact that qΓq\not\in\Gamma.

Finally, we prove Part (v) of Lemma 10. Combining the constraints in (IV-A) and (29), we can obtain qq in terms of λ\lambda as

q(j)=γ~jμγ~ji=1dγ~iF(𝜸)γ~i+μγ~jF(𝜸)γ~j\displaystyle q(j)=\tilde{\gamma}_{j}-\mu\tilde{\gamma}_{j}\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{i}}+\mu\tilde{\gamma}_{j}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{j}} (30)

for all j=1,2,,dj=1,2,\dots,d. Then we obtain μ\mu in terms of q(j)q(j) as:

μ=1γ~j(F(𝜸)γ~ji=1dγ~iF(𝜸)γ~i)1(q(j)γ~j).\displaystyle\mu=\frac{1}{\tilde{\gamma}_{j}}\bigg{(}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{j}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{i}}\bigg{)}^{-1}(q(j)-\tilde{\gamma}_{j}). (31)

Then substituting (31) into (30), we have the desired formula. This completes the proof of Lemma 10. ∎

Lemma 11.

Let η^\hat{\eta} be as given in Lemma 10. Suppose q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), and Γ\Gamma satisfies (A1’)–(A4’). Then,

  1. (i)

    The function 𝐠(q)\mathbf{g}(q) is smooth on (p0,ζ)\mathcal{B}(p_{0},\zeta) for some ζ>0\zeta>0 and satisfies the following equality

    j=1dq(j)gj(q)gj(q)q(i)=0,\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0,

    for all q(p0,ζ)q\in\mathcal{B}(p_{0},\zeta).

  2. (ii)

    The function ff, defined in (17), is smooth on (p0,ζ)\mathcal{B}(p_{0},\zeta) and its first- and second-order derivatives are

    f(q)q(j)\displaystyle\frac{\partial f(q)}{\partial q(j)} =logp0(j)gj(q)+i=1dq(i)gi(q)gi(q)q(j),\displaystyle=\log\frac{p_{0}(j)}{g_{j}(q)}+\sum_{i=1}^{d}\frac{q(i)}{g_{i}(q)}\frac{\partial g_{i}(q)}{\partial q(j)}, (32)
    2f(q)q(j)2\displaystyle\frac{\partial^{2}f(q)}{\partial q(j)^{2}} =2gj(q)gj(q)q(j)i=1d[q(i)gi(q)2\displaystyle=-\frac{2}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(j)}-\sum_{i=1}^{d}\bigg{[}-\frac{q(i)}{g_{i}(q)^{2}}
    ×(gi(q)q(j))2+q(i)gi(q)2gi(q)q(j)2],and\displaystyle\qquad\times\bigg{(}\frac{\partial g_{i}(q)}{\partial q(j)}\bigg{)}^{2}+\frac{q(i)}{g_{i}(q)}\frac{\partial^{2}g_{i}(q)}{\partial q(j)^{2}}\bigg{]},\quad\mbox{and}
    2f(q)q(j)q(i)\displaystyle\frac{\partial^{2}f(q)}{\partial q(j)\partial q(i)} =1gj(q)gj(q)q(i)1gi(q)gi(q)q(i)\displaystyle=-\frac{1}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}-\frac{1}{g_{i}(q)}\frac{\partial g_{i}(q)}{\partial q(i)}
    l=1d[q(l)gl(q)2gl(q)q(j)gl(q)q(i)\displaystyle\qquad-\sum_{l=1}^{d}\bigg{[}-\frac{q(l)}{g_{l}(q)^{2}}\frac{\partial g_{l}(q)}{\partial q(j)}\frac{\partial g_{l}(q)}{\partial q(i)}
    +q(l)gl(q)2gl(q)q(j)q(i)]forij.\displaystyle\qquad\quad+\frac{q(l)}{g_{l}(q)}\frac{\partial^{2}g_{l}(q)}{\partial q(j)\partial q(i)}\bigg{]}\quad\mbox{for}\;i\neq j. (33)
Proof:

Now we prove Part (i) of Lemma 11. As F(𝜸)F(\bm{\gamma}) is smooth and 𝐉(p0)\mathbf{J}(p_{0}) is of full rank, there exists ζ>0\zeta>0 such that 𝐉(q)\mathbf{J}(q) is of full rank for all q(p0,ζ)q\in\mathcal{B}(p_{0},\zeta). Then by the inverse function theorem [27, Theorem 2.11], 𝜸~=𝐠(q)\tilde{\bm{\gamma}}=\mathbf{g}(q) is differentiable in qq. We multiply gj(q)/q(i){\partial g_{j}(q)}/{\partial q(i)} on both sides of (29) and sum from j=1j=1 to dd to obtain

j=1dq(j)gj(q)gj(q)q(i)=λj=1dgj(q)q(i)+μj=1dF(𝐠(q))gj(q)gj(q)q(i).\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=\lambda\sum_{j=1}^{d}\frac{\partial g_{j}(q)}{\partial q(i)}+\mu\sum_{j=1}^{d}\frac{\partial F(\mathbf{g}(q))}{\partial g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}. (34)

We differentiate the first constraint i=1dγ~i=i=1dgi(q)=1\sum_{i=1}^{d}\tilde{\gamma}_{i}=\sum_{i=1}^{d}g_{i}(q)=1 with respect to qq on both sides to obtain

j=1dgj(q)q(i)=0.\displaystyle\sum_{j=1}^{d}\frac{\partial g_{j}(q)}{\partial q(i)}=0. (35)

From Part (iii) of Lemma 10 it follows that F(𝜸~)=F(𝐠(q))=0F(\tilde{\bm{\gamma}})=F(\mathbf{g}(q))=0, which means that the function formed by the composition of FF and 𝐠\mathbf{g} is always 0 for all the q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}) . Therefore, the derivative of the composition of FF and 𝐠\mathbf{g} with respect to qq is 0, i.e.,

F(𝐠(q))q(i)=j=1dF(𝐠(q))gj(q)gj(q)q(i)=0.\displaystyle\frac{\partial F(\mathbf{g}(q))}{\partial q(i)}=\sum_{j=1}^{d}\frac{\partial F(\mathbf{g}(q))}{\partial g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0. (36)

Substituting (35) and (36) back into (34), we have that

j=1dq(j)gj(q)gj(q)q(i)=0,\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0,

as desired.

Part (ii) of Lemma 11 follows from straightforward, albeit tedious, calculations. This completes the proof of Lemma 11. ∎

F Proof of Proposition 5

Assume F1(γ)=F1(γ1,,γd)=i=1dwiγiF_{1}(\gamma)=F_{1}(\gamma_{1},\ldots,\gamma_{d})=\sum_{i=1}^{d}w_{i}\gamma_{i}. Without loss of generality, we assume w1ξ1w_{1}\not=\xi_{1}. Conditions (A1’)–(A3’) clearly hold. Hence from Part (ii) of Lemma 10 there exists η^\hat{\eta} such that for all q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), the optimizer 𝜸~\tilde{\bm{\gamma}} of the optimization problem (IV-A) is such that F1(𝜸~)=ξ1F_{1}(\tilde{\bm{\gamma}})=\xi_{1} and that Fk(𝜸~)<ξkF_{k}(\tilde{\bm{\gamma}})<\xi_{k} for all k1k\not=1. Note that F1(𝜸)/γi=wi{\partial F_{1}(\bm{\gamma})}/{\partial\gamma_{i}}=w_{i}. Then for q(p0,η^)q\in\mathcal{B}(p_{0},\hat{\eta}), using the KKT conditions, we obtain the first-order optimality conditions for the optimizer 𝜸~\tilde{\bm{\gamma}}:

i=1dγ~i\displaystyle\sum_{i=1}^{d}\tilde{\gamma}_{i} =1,\displaystyle=1,
i=1dwiγ~i\displaystyle\sum_{i=1}^{d}w_{i}\tilde{\gamma}_{i} =ξ1,\displaystyle=\xi_{1}, (37)
q(i)\displaystyle q(i) =λ1γ~i+λ2γ~iwi.\displaystyle=\lambda_{1}\tilde{\gamma}_{i}+\lambda_{2}\tilde{\gamma}_{i}w_{i}.

Hence,

λ2=q(1)γ~1γ~1(w1ξ1).\displaystyle\lambda_{2}=\frac{q(1)-\tilde{\gamma}_{1}}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}. (38)

Substituting (38) into (37), we obtain

q(i)=γ~i(1+(q(1)γ~1)(wiξ1)γ~1(w1ξ1)).\displaystyle q(i)=\tilde{\gamma}_{i}\bigg{(}1+\frac{(q(1)-\tilde{\gamma}_{1})(w_{i}-\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}\bigg{)}.

Thus, the Jacobian of (q(2),,q(d))(q(2),\ldots,q(d)) at (γ~2,,γ~d)(\tilde{\gamma}_{2},\ldots,\tilde{\gamma}_{d}) is the following (d1)×(d1)(d-1)\times(d-1) diagonal matrix:

𝐉(q)\displaystyle\mathbf{J}(q) =diag[1+(q(1)γ~1)(w2ξ1)γ~1(w1ξ1),1+(q(1)γ~1)(w3ξ1)γ~1(w1ξ1),\displaystyle=\mathrm{diag}\!\bigg{[}1\!+\!\frac{(q(1)\!-\!\tilde{\gamma}_{1})(w_{2}\!-\!\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})},\!1\!+\!\frac{(q(1)\!-\!\tilde{\gamma}_{1})(w_{3}\!-\!\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})},
,1+(q(1)γ~1)(wdξ1)γ~1(w1ξ1)].\displaystyle\qquad\ldots,1+\frac{(q(1)-\tilde{\gamma}_{1})(w_{d}-\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}\bigg{]}.

Since p0(i)>0p_{0}(i)>0 for all i=1,2,,di=1,2,\dots,d, the diagonal terms in the Jacobian 𝐉(p0)\mathbf{J}(p_{0}) are non-zero. Thus, det(𝐉(p0))0\mathrm{det}(\mathbf{J}(p_{0}))\neq 0, which proves that Condition (A4’) holds for the set Γ\Gamma in Example 3.

G Proof of Proposition 6

We now prove the promised central limit-type result for the sequence of random variables {Sn/n}n\{S_{n}/\sqrt{n}\}_{n\in\mathbb{N}}. Let z(0,1)z\in(0,1). Let ζ\zeta be given as in Part (i) of Lemma 11 and define the ζ\zeta-typical set

𝒯ζ(n)=𝒯ζ(n)(p0)\displaystyle\mathcal{T}_{\zeta}^{(n)}=\mathcal{T}_{\zeta}^{(n)}(p_{0})
={xn𝒳n:|(1nk=1n𝟙{xk=i})p0(i)|<ζ,i𝒳}.\displaystyle=\bigg{\{}x^{n}\in\mathcal{X}^{n}:\Big{|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\mathbbm{1}\{x_{k}=i\}\Big{)}-p_{0}(i)\Big{|}<\zeta,~{}\forall\,i\in\mathcal{X}\bigg{\}}.

This is the set of sequences whose types are near p0p_{0}. The key idea is to perform a Taylor expansion of the function f(Q)=i=1dQ(i)logp0(i)gi(Q)f(Q)=\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{g_{i}(Q)} (defined in (17)) at the point Q=p0Q=p_{0} and analyze the asymptotics of the various terms in the expansion. For brevity, define the deviation of the type QQ of XnX^{n} from the true distribution at symbol i𝒳i\in\mathcal{X} as

Δi:=Q(i)p0(i).\Delta_{i}:=Q(i)-p_{0}(i).

For q(p0,ζ)q\in\mathcal{B}(p_{0},\zeta), let 𝐇(q)d×d\mathbf{H}(q)\in\mathbb{R}^{d\times d} be the Hessian matrix of f(q)f(q). This is well defined because f()f(\cdot) is twice continuously differentiable on (p0,ζ)\mathcal{B}(p_{0},\zeta) according to Part (ii) of Lemma 11. If xn𝒯ζ(n)x^{n}\in\mathcal{T}_{\zeta}^{(n)}, then Q(p0,ζ)Q\in\mathcal{B}(p_{0},\zeta). Thus for Q(p0,ζ)Q\in\mathcal{B}(p_{0},\zeta), using Taylor’s theorem we have the expansion

f(Q)\displaystyle f(Q) =f(p0)+(f(p0))(Qp0)\displaystyle=f(p_{0})+(\nabla f(p_{0}))^{\top}(Q-p_{0})
+12(Qp0)𝐇(Q~)(Qp0)\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}
=i=1dp0(i)logp0(i)gi(p0)+i=1dlogp0(i)gi(p0)Δi\displaystyle=\sum_{i=1}^{d}p_{0}(i)\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\sum_{i=1}^{d}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Delta_{i}
i=1dj=1dp0(j)gj(p0)gj(q)q(i)|q=p0Δi\displaystyle\qquad-\sum_{i=1}^{d}\sum_{j=1}^{d}\frac{p_{0}(j)}{g_{j}(p_{0})}\frac{\partial g_{j}(q)}{\partial q(i)}\bigg{|}_{q=p_{0}}\Delta_{i}
+12(Qp0)𝐇(Q~)(Qp0)\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top} (39)
=D(p0p𝜸)+i=1dlogp0(i)gi(p0)Δi\displaystyle=D(p_{0}\|p_{\bm{\gamma}^{\prime}})+\sum_{i=1}^{d}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Delta_{i}
+12(Qp0)𝐇(Q~)(Qp0),\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}, (40)

where Q~\tilde{Q} lies on the line segment between QQ and p0p_{0}, (39) follows from (32) in Lemma 11 and (40) follows from Part (i) of Lemma 11. Note that we represent probability mass functions as row vectors.

Then for Q(p0,ζ)Q\in\mathcal{B}(p_{0},\zeta), from (40), we have that

min𝜸Γ(ni=1dQ(i)logp0(i)γi)nD(p0p𝜸)\displaystyle\min_{\bm{\gamma}\in\Gamma}\bigg{(}\sqrt{n}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}\bigg{)}-\sqrt{n}D(p_{0}\|p_{\bm{\gamma}^{\prime}})
=n(f(Q)D(p0p𝜸))\displaystyle=\sqrt{n}\big{(}f(Q)-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\big{)}
=i=1dnΔilogp0(i)gi(p0)+n2(Qp0)𝐇(Q~)(Qp0).\displaystyle=\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}. (41)

Let λmin(𝐇(q))\lambda_{\min}(\mathbf{H}(q)) and λmax(𝐇(q))\lambda_{\max}(\mathbf{H}(q)) be the smallest and largest eigenvalues of 𝐇(q)\mathbf{H}(q), respectively. From Part (i) of Lemma 11, it follows that f()f(\cdot) is smooth on (q0,ζ)\mathcal{B}(q_{0},\zeta), which implies that there exists two constants c~\tilde{c} and C~\tilde{C} such that

<c~\displaystyle-\infty<\tilde{c} <minq(q0,ζ)λmin(𝐇(q))\displaystyle<\min_{q\in\mathcal{B}(q_{0},\zeta)}\lambda_{\min}(\mathbf{H}(q))
maxq(q0,ζ)λmax(𝐇(q))<C~<.\displaystyle\leq\max_{q\in\mathcal{B}(q_{0},\zeta)}\lambda_{\max}(\mathbf{H}(q))<\tilde{C}<\infty. (42)

Then we have the upper bound shown in (45) (at the top of the next page),

P0\displaystyle P_{0} (min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))Φ1(z)V(p0p𝜸))\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}
P0(min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))+P0(Xn𝒯ζ(n))\displaystyle\leq P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})
=P0(i=1dnbiΔi+n2(Qp0)𝐇(Q~)(Qp0)Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))+P0(Xn𝒯ζ(n))\displaystyle=P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)}) (43)
P0(i=1dnΔilogp0(i)gi(p0)+λmin(𝐇(Q~))2i=1dnΔi2Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))+P0(Xn𝒯ζ(n))\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\lambda_{\min}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})
P0(i=1dnΔilogp0(i)gi(p0)+c~2i=1dnΔi2Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))+P0(Xn𝒯ζ(n))\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)}) (44)
P0(i=1dnΔilogp0(i)gi(p0)+c~2i=1dnΔi2Φ1(z)V(p0p𝜸))+dexp(2nη2),\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}+d\exp(-2n\eta^{2}), (45)

in which (43) follows from the fact that Q(p0,ζ)Q\in\mathcal{B}(p_{0},\zeta) for xn𝒯ζ(n)x^{n}\in\mathcal{T}_{\zeta}^{(n)} and (G), (44) follows from (G), and (45) holds by the union bound and Hoeffding’s inequality. Similarly, we can obtain the lower bound shown in (46) (also shown on the next page).

P0\displaystyle P_{0} (min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))Φ1(z)V(p0p𝜸))\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}
P0(i=1dnbiΔi+n2(Qp0)𝐇(Q~)(Qp0)Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}
P0(i=1dnbiΔi+λmax(𝐇(Q~))2i=1dnΔi2Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\lambda_{\max}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}
P0(i=1dnΔilogp0(i)gi(p0)+C~2i=1dnΔi2Φ1(z)V(p0p𝜸),Xn𝒯ζ(n))\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}
P0(i=1dnΔilogp0(i)gi(p0)+C~2i=1dnΔi2Φ1(z)V(p0p𝜸))P0(Xn𝒯ζ(n))\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-P_{0}\big{(}X^{n}\not\in\mathcal{T}_{\zeta}^{(n)}\big{)}
P0(i=1dnΔilogp0(i)gi(p0)+C~2i=1dnΔi2Φ1(z)V(p0p𝜸))dexp(2nζ2).\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-d\exp(-2n\zeta^{2}). (46)

One can verify that

ni=1dΔilogp0(i)gi(p0)\displaystyle n\sum_{i=1}^{d}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}
=k=1n(i=1d(𝟙{Xk=i}p0(i))logp0(i)gi(p0))\displaystyle=\sum_{k=1}^{n}\bigg{(}\sum_{i=1}^{d}\big{(}\mathbbm{1}\{X_{k}=i\}-p_{0}(i)\big{)}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{)} (47)

and the variance

Var0[i=1d(𝟙{X1=i}p0(i))logp0(i)gi(p0)]\displaystyle\mathrm{Var}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}
=𝔼0[(i=1d(𝟙{X1=i}p0(i))logp0(i)gi(p0))2]\displaystyle=\mathbb{E}_{0}\bigg{[}\bigg{(}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{)}^{2}\bigg{]} (48)
=𝔼0[i=1d(logp0(i)gi(p0))2(𝟙{X1=i}p0(i))2\displaystyle=\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}^{2}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))^{2}
+2ji(𝟙{X1=i}p0(i))(𝟙{X1=j}p0(j))\displaystyle\qquad+2\sum_{j\neq i}(\mathbbm{1}\{X_{1}=i\}\!-p_{0}(i))(\mathbbm{1}\{X_{1}=j\}\!-p_{0}(j))
×(logp0(i)gi(p0))(logp0(j)gj(p0))]\displaystyle\qquad\quad\times\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}\bigg{]}
=i=1d(1p0(i))p0(i)log2p0(i)gi(p0)\displaystyle=\sum_{i=1}^{d}(1-p_{0}(i))p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}
2ijp0(i)p0(j)(logp0(i)gi(p0))(logp0(j)gj(p0))\displaystyle\qquad-2\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)} (49)
=i=1dp0(i)log2p0(i)gi(p0)i=1dp0(i)2log2p0(i)gi(p0)\displaystyle=\sum_{i=1}^{d}p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}-\sum_{i=1}^{d}p_{0}(i)^{2}\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}
2ijp0(i)p0(j)(logp0(i)gi(p0))(logp0(j)gj(p0))\displaystyle\qquad-2\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}
=V(p0p𝜸),\displaystyle=V(p_{0}\|p_{\bm{\gamma}^{\prime}}),

where (48) follows from

𝔼0[i=1d(𝟙{X1=i}p0(i))logp0(i)gi(p0)]=0,\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}=0,

and (49) follows from

ij𝔼0[(𝟙{X1=i}p0(i))(𝟙{X1=j}p0(j))\displaystyle\sum_{i\neq j}\mathbb{E}_{0}\bigg{[}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))(\mathbbm{1}\{X_{1}=j\}-p_{0}(j))
×(logp0(i)gi(p0))(logp0(j)gj(p0))]\displaystyle\quad\times\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}\bigg{]}
=ijp0(i)p0(j)(logp0(i)gi(p0))(logp0(j)gj(p0))\displaystyle=-\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}

and

𝔼0[i=1d(𝟙{Xk=i}p0(i))2log2p0(i)gi(p0)]\displaystyle\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{k}=i\}-p_{0}(i))^{2}\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}
=i=1d(1p0(i))p0(i)log2p0(i)gi(p0).\displaystyle=\sum_{i=1}^{d}(1-p_{0}(i))p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}.

Therefore ni=1dΔilogp0(i)gi(p0)n\sum_{i=1}^{d}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})} is a sum of i.i.d. random variables {i=1d(𝟙{Xk=i}p0(i))logp0(i)gi(p0)}k=1n\big{\{}\sum_{i=1}^{d}(\mathbbm{1}\{X_{k}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\big{\}}_{k=1}^{n} with mean 0 and variance V(p0p𝜸)V(p_{0}\|p_{\bm{\gamma}^{\prime}}). Hence, by the central limit theorem,

i=1dnΔilogp0(i)gi(p0)d𝒩(0,V(p0p𝜸)).\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})).

Together with the fact that i=1dnΔi20\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\to 0 almost surely, this implies that

i=1dnΔilogp0(i)gi(p0)+c~2i=1dnΔi2d𝒩(0,V(p0p𝜸)),\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})), (50)

and

i=1dnΔilogp0(i)gi(p0)+C~2i=1dnΔi2d𝒩(0,V(p0p𝜸)).\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})). (51)

Then combining (45), (46), (50) and (51), we have that

lim supn\displaystyle\limsup_{n\to\infty} P0(min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))\displaystyle\;P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\Big{)}
Φ1(z)V(p0p𝜸))z,\displaystyle\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}\leq z, (52)

and

lim infn\displaystyle\liminf_{n\to\infty} P0(min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))\displaystyle\;P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\Big{)}
Φ1(z)V(p0p𝜸))z.\displaystyle\leq\Phi^{-1}(z)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\bigg{)}\geq z. (53)

Since z(0,1)z\in(0,1) is arbitrary, it follows from (52) and (53) that

min𝜸Γn(i=1dQ(i)logp0(i)γiD(p0p𝜸))d𝒩(0,V(p0p𝜸)),\displaystyle\!\min_{\bm{\gamma}\in\Gamma}\!\sqrt{n}\!\bigg{(}\!\sum_{i=1}^{d}\!Q(i)\!\log\frac{p_{0}(i)}{\gamma_{i}}\!-\!D(p_{0}\|p_{\bm{\gamma}^{\prime}})\!\bigg{)}\!\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\!\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})),

which completes the proof of Proposition 6.

H Proof of Proposition 7

We now show that Conditions (A1’)–(A4’) imply Conditions (A1)–(A3). Condition (A1) is easily verified by Condition (A1’). As 𝒳={1,2,,d}\mathcal{X}=\{1,2,\dots,d\}, we have

D(p0p𝜸)=i=1dp0(i)logp0(i)γi,\displaystyle D(p_{0}\|p_{\bm{\gamma}})=\sum_{i=1}^{d}p_{0}(i)\log\frac{p_{0}(i)}{\gamma_{i}},

and

D(p𝜸p0)=i=1dγilogγip0(i).\displaystyle D(p_{\bm{\gamma}}\|p_{0})=\sum_{i=1}^{d}\gamma_{i}\log\frac{\gamma_{i}}{p_{0}(i)}.

Combining Condition (A2’) which says that mini=1,,dγic0>0\min_{i=1,\ldots,d}\gamma_{i}\geq c_{0}>0 for all γΓ\gamma\in\Gamma and mini=1,,dp0(i)>0\min_{i=1,\ldots,d}p_{0}(i)>0, D(p0p𝜸)D(p_{0}\|p_{\bm{\gamma}}) and D(p𝜸p0)D(p_{\bm{\gamma}}\|p_{0}) are uniformly bounded and twice continuously differentiable on Γ\Gamma. As p0Γp_{0}\notin\Gamma, D(p0p𝜸)>0D(p_{0}\|p_{\bm{\gamma}})>0 and D(p𝜸p0)>0D(p_{\bm{\gamma}}\|p_{0})>0, which together with the compactness of Γ\Gamma, implies that

min𝜸ΓD(p0p𝜸)>0andmin𝜸ΓD(p𝜸p0)>0.\displaystyle\min_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}})>0\quad\mbox{and}\quad\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{0})>0. (54)

From [22, Theorem 2.7.2], D(p0p𝜸)D(p_{0}\|p_{\bm{\gamma}}) is strictly convex in (p0,𝜸)(p_{0},\bm{\gamma}), which, together with the fact that Γ\Gamma is compact and convex, implies the uniqueness of the minimizers to the two optimization problems in (54).

For Condition (A3), as 𝒳\mathcal{X} is a finite alphabet and Condition (A2’) holds, it can be easily checked that 𝔼[max𝜸Γ|ξ(𝜸)|2]<\mathbb{E}[\max_{\bm{\gamma}\in\Gamma}|\xi(\bm{\gamma})|^{2}]<\infty. Note that

𝜸ξ(𝜸)=(𝟙{X=1}γ1,,𝟙{X=d}γd).\nabla_{\bm{\gamma}}\xi(\bm{\gamma})=\Big{(}\frac{\mathbbm{1}\{X=1\}}{\gamma_{1}},\ldots,\frac{\mathbbm{1}\{X=d\}}{\gamma_{d}}\Big{)}^{\top}.

We can define the finite number x0:=max𝜸Γmaxi𝒳1/γi1/c0x_{0}:=\max_{\bm{\gamma}\in\Gamma}\max_{i\in\mathcal{X}}1/\gamma_{i}\leq 1/c_{0} (because Condition (A2’) mandates that mini=1,,dγic0>0\min_{i=1,\ldots,d}\gamma_{i}\geq c_{0}>0 for all 𝜸Γ)\bm{\gamma}\in\Gamma). With this choice, trivially, for all x>x0x>x_{0},

P0(max𝜸Γ|𝜸ξ(𝜸)|>x)=0,\displaystyle P_{0}\Bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\Bigg{)}=0,

which shows that Condition (A3) clearly holds.

References

  • [1] J. Pan, Y. Li, and V. Y. F. Tan, “Asymptotics of sequential composite hypothesis testing under probabilistic constraints,” in IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2021, pp. 172–177.
  • [2] R. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. 20, no. 4, pp. 405–417, 1974.
  • [3] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesis testing and changepoint detection.   CRC Press, 2014.
  • [4] J. Neyman and E. S. Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London (Series A), vol. 231, pp. 289–337, 1933.
  • [5] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), 2014.
  • [6] A. Wald and J. Wolfowitz, “Optimum character of the sequential probability ratio test,” Ann. Math. Statist., vol. 19, no. 3, pp. 326–339, 1948.
  • [7] A. Lalitha and T. Javidi, “Reliability of sequential hypothesis testing can be achieved by an almost-fixed-length test,” in IEEE International Symposium on Information Theory.   IEEE, 2016, pp. 1710–1714.
  • [8] M. Haghifam, V. Y. F. Tan, and A. Khisti, “Sequential classification with empirically observed statistics,” IEEE Transactions on Information Theory, vol. 67, no. 5, pp. 3095–3113, 2021.
  • [9] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test optimal?” IEEE Transactions on Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
  • [10] T.-L. Lai, “Asymptotic optimality of generalized sequential likelihood ratio tests in some classical sequential testing problems,” Sequential Analysis, vol. 21, no. 4, pp. 219–247, 2002.
  • [11] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.
  • [12] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal sequential outlier hypothesis testing,” Sequential Analysis, vol. 36, no. 3, pp. 309–344, 2017.
  • [13] X. Li, J. Liu, and Z. Ying, “Generalized sequential probability ratio test for separate families of hypotheses,” Sequential Analysis, vol. 33, no. 4, pp. 539–563, 2014.
  • [14] V. Strassen, “Asymptotische abschatzugen in Shannon’s informationstheorie,” in Transactions of the Third Prague Conference on Information Theory etc, 1962. Czechoslovak Academy of Sciences, Prague, 1962, pp. 689–723.
  • [15] V. Y. F. Tan, “Asymptotic estimates in information theory with non-vanishing error probabilities,” Foundations and Trends® in Communications and Information Theory, vol. 11, no. 1-2, pp. 1–184, 2014.
  • [16] Y. Li and V. Y. F. Tan, “Second-order asymptotics of sequential hypothesis testing,” IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 7222–7230, 2020.
  • [17] A. W. van der Vaart, Asymptotic Statistics.   Cambridge University Press, 1998.
  • [18] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference.   Now Publishers Inc, 2008.
  • [19] A. R. Sampson, “Characterizing Exponential Family Distributions by Moment Generating Functions,” The Annals of Statistics, vol. 3, no. 3, pp. 747 – 753, 1975.
  • [20] H. J. Bierens, Modes of Convergence, ser. Themes in Modern Econometrics.   Cambridge University Press, 2004, pp. 137–178.
  • [21] R. Durrett, Probability: Theory and Examples.   Duxbury Press, 2004.
  • [22] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).   USA: Wiley-Interscience, 2006.
  • [23] S.-I. Amari and H. Nagaoka, Methods of Information Geometry, ser. Translations of Mathematical Monographs.   American Mathematical Society, 2007.
  • [24] Y. Polyanskiy, Channel Coding: Non-Asymptotic Fundamental Limits.   Princeton University, 2010.
  • [25] A. C. Berry, “The accuracy of the Gaussian approximation to the sum of independent variates,” Transactions of the American Mathematical Society, vol. 49, no. 1, pp. 122–136, 1941.
  • [26] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge University Press, 2004.
  • [27] M. Spivak, Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus.   Taylor & Francis Inc, 1971.
Jiachun Pan is currently a Ph.D. candidate in the Department of Electrical and Computer Cngineering in National University of Singapore (NUS). She received the B.S. degree from University of Electronic Science and Technology of China (UESTC) in 2015 and M.Eng. degree from University of Science and Technology of China (USTC) in 2019. Her research interests include information theory and statistical learning.
Yonglong Li is a research fellow at the Department of Electrical and Computer Engineering, National University of Singapore. He received the bachelor degree in Mathematics from Zhengzhou University in 2011 and the Ph.D. degree in Mathematics from the University of Hong Kong in 2015. From 2017 to 2019, he was a postdoctoral fellow at the Center for Memory and Recording Research (CMRR), University of California, San Diego.
Vincent Y. F. Tan (S’07-M’11-SM’15) was born in Singapore in 1981. He received the B.A. and M.Eng. degrees in electrical and information science from Cambridge University in 2005, and the Ph.D. degree in electrical engineering and computer science (EECS) from the Massachusetts Institute of Technology (MIT) in 2011. He is currently a Dean’s Chair Associate Professor with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore (NUS). His research interests include information theory, machine learning, and statistical signal processing. Dr. Tan is a member of the IEEE Information Theory Society Board of Governors. He was an IEEE Information Theory Society Distinguished Lecturer from 2018 to 2019. He received the MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize in 2011, the NUS Young Investigator Award in 2014, the Singapore National Research Foundation (NRF) Fellowship (Class of 2018), and the NUS Young Researcher Award in 2019. He is currently serving as a Senior Area Editor for the IEEE Transactions on Signal Processing and for the IEEE Transactions on Information Theory.