Asymptotics of Sequential Composite Hypothesis Testing under Probabilistic Constraints

Jiachun Pan, Yonglong Li, Vincent Y. F. Tan, Senior Member, IEEE This work is partially funded by a Singapore National Research Foundation Fellowship (R-263-000-D02-281). The paper was presented in part at the 2021 International Symposium on Information Theory (ISIT) [1]. Jiachun Pan is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Email: pan.jiachun@u.nus.edu. Yonglong Li is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Email: elelong@nus.edu.sg. Vincent Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore, Singapore, Email: vtan@nus.edu.sg. Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

Abstract

We consider the sequential composite binary hypothesis testing problem in which one of the hypotheses is governed by a single distribution while the other is governed by a family of distributions whose parameters belong to a known set $\Gamma$ . We would like to design a test to decide which hypothesis is in effect. Under the constraints that the probabilities that the length of the test, a stopping time, exceeds $n$ are bounded by a certain threshold $\epsilon$ , we obtain certain fundamental limits on the asymptotic behavior of the sequential test as $n$ tends to infinity. Assuming that $\Gamma$ is a convex and compact set, we obtain the set of all first-order error exponents for the problem. We also prove a strong converse. Additionally, we obtain the set of second-order error exponents under the assumption that the alphabet of the observations $\mathcal{X}$ is finite. In the proof of second-order asymptotics, a main technical contribution is the derivation of a central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions. This result may be of independent interest. We also show that some important statistical models satisfy the conditions.

Index Terms:

Sequential composite hypothesis testing, Error exponents, Second-order asymptotics, Generalized sequential probability ratio test

I Introduction

Hypothesis testing is a fundamental problem in information theory and statistics [2]. Here we consider a sequential composite hypothesis testing problem in which i.i.d. observations are drawn from either a simple null hypothesis or a composite from the alternative hypothesis. We consider the first-order and second-order tradeoff between the two types of error probabilities under a probabilistic constraint on the stopping times. There is a vast literature on this subject [3, Part I], however the optimal trade-off in the probabilistic stopping time constraint has not been determined. The probabilistic constraints means that we constrain the probabilities (under both hypotheses) that the length of the stopping time exceeds $n$ to be no larger than some prescribed threshold $\epsilon\in(0,1)$ . We let $n$ tend to infinity to exploit various asymptotic and concentration results.

I-A Related works

In the classical problem of sequential hypothesis testing in the statistical literature, one seeks to minimize the expected sample size $\mathbb{E}_{i}[\tau(\delta)],i\in\{0,1\}$ subject to bounds on the type-I and type-II error probabilities $P_{0}(\delta_{\tau}=1)\leq\alpha$ and $P_{1}(\delta_{\tau}=0)\leq\beta$ , i.e., the sequential testing problem is to solve, for each $i\in\{0,1\}$ ,

\displaystyle\min_{(\tau,\delta)}\mathbb{E}_{i}[\tau(\delta)]\quad\mbox{s.t.~{}}\quad P_{0}(\delta_{\tau}=1)\leq\alpha,P_{1}(\delta_{\tau}=0)\leq\beta.

(1)

There is a vast literature on solving the above problem (see [3, Part I] for example). The dual problem corresponding to that of (1) is the minimization of the error probabilities subject to expectation constraints on the sample size. More specifically, the dual problem corresponding to (1) entails solving for each $i\in\{0,1\}$ ,

\displaystyle\min_{(\tau,\delta)}P_{i}(\delta_{\tau}=1-i)\quad\mbox{s.t.~{}}\quad\mathbb{E}_{i}[\tau]\leq n,i\in\{0,1\}.

(2)

The optimal tests $(\tau^{*},\delta^{*})$ of (1) and (2) are given by appropriate sequential probability ratio tests. However, in this paper, we consider the problem of minimizing the error probabilities subject to probabilistic constraints on the sample size. In more detail, the problem we are concerned with is the following:

\displaystyle\min_{(\tau,\delta)}P_{i}(\delta_{\tau}=1-i)\quad\mbox{s.t.~{}}\quad P_{i}(\tau>n)<\epsilon,i\in\{0,1\}.

(3)

As the nature of the constraints are different (expectation versus probabilistic), the proof techniques are also different. For problem (2), Wald’s identity and data-processing inequality are used to derive the achievability and the converse. For our problem (3), concentration inequalities such as the central limit theorem are used to derive the achievability and the converse.

For the first-order asymptotics (exponents of the two types of error probabilities), there is a vast literature on binary hypothesis testing. In the fixed-length hypothesis testing where the length of the vector of observations is fixed, the Neyman–Pearson lemma [4] states that the likelihood ratio test is optimal and Chernoff–Stein lemma [5, Theorem 13.1] shows that if we constrain the type-I error to be less than any $\epsilon\in(0,1)$ , the best (maximum) type-II error exponent is the relative entropy $D(p_{0}\|p_{1})$ , where $p_{0}$ and $p_{1}$ are respectively the distributions under the null and alternative hypotheses respectively. If we require the type-I error exponent to be at least $r>0$ , i.e., the type-I error probability is upper bounded by $\exp(-nr)$ , the maximum type-II error exponent is $\min\{D(q\|p_{0}):D(q\|p_{1})\geq r\}$ [2]. In this regard, we see that there is a trade-off between two error exponents, i.e., they cannot be simultaneously large. However, in the sequential case where the length of the test sample is a stopping time and its expectation is bounded by $n$ , the trade-off can be eradicated. Wald and Wolfowitz [6] showed that when the expectations of sample length under $H_{0}$ and $H_{1}$ are bounded by a common integer $n$ (these are known as the expectation constraints) and $n$ tends to infinity, the set of achievable error exponents is $\{(E_{0},E_{1}):E_{0}\leq D(p_{1}\|p_{0}),E_{1}\leq D(p_{0}\|p_{1})\}$ . In addition, the corner point $(D(p_{1}\|p_{0}),D(p_{0}\|p_{1}))$ is attained by a sequence of sequential probability ratio tests (SPRTs). Lalitha and Javidi [7] considered an interesting setting that interpolates between the fixed-length hypothesis testing and sequential hypothesis testing. They considered the almost-fixed-length hypothesis testing problem, in which the stopping time is allowed to be larger than a prescribed integer $n$ with exponentially small probability $\exp(-n\gamma)$ for different $\gamma>0$ . The probabilistic constraints we employ in this paper are analogous to those in [7], but instead of allowing the event that the stopping time to be larger than $n$ to have exponentially small probability, we only require this event to have probability at most $\epsilon\in(0,1)$ , a fixed constant. This allows us to ask questions ranging from strong converses to second-order asymptotics. In [8], Haghifam, Tan, and Khisti considered sequential classification which is similar to sequential hypothesis testing apart from the fact that true distributions are only partially known in the form of training samples.

For the composite hypothesis testing, Zeitouni, Ziv, and Merhav [9] investigated the generalized likelihood ratio test (GLRT) and proposed conditions for asymptotic optimality of the GLRT in the Neyman-Pearson sense. For the sequential case, Lai [10] analyzed different sequential testing problems and obtained a unified asymptotic theory that results in certain generalized sequential likelihood ratio tests to be asymptotically optimal solutions to these problem. Li, Nitinawarat and Veeravalli [11] considered a universal outlier hypothesis testing problem in the fixed-length setting; universality here refers to the fact that the distributions are unknown and have to be estimated on the fly. They then extended their work to the sequential setting [12] but under expectation constraints on the stopping time. The work that is closest to ours is that by Li, Liu, and Ying [13] whose results can be modified to solve the composite version of the dual problem (2). They showed that the generalized sequential probability ratio test is asymptotically optimal by making use of optimality results of sequential probability ratio tests (SPRTs).

Concerning the second-order asymptotic regime, in fixed-length binary hypothesis testing in which the type-I error is bounded by a fixed constant $\epsilon\in(0,1)$ , Strassen [14] showed that the second-order term can be quantified via the relative entropy variance [15] and the inverse of the Gaussian cumulative distribution function. For the sequential case, Li and Tan [16] recently established the second-order asymptotics of sequential binary hypothesis testing under probabilistic and expectation constraints on the stopping time, showing that the former (resp. latter) set of constraints results in a $\Theta(1/\sqrt{n})$ (resp. $\Theta(1/n)$ ) backoff from the relative entropies. These are estimates of the costs of operating in the finite-length setting. In this paper, we seek to extend these results to sequential composite hypothesis testing.

I-B Main contributions

Our main contributions consist in obtaining the first-order and second-order asymptotics for sequential composite hypothesis testing under the probabilistic constraints, i.e., we constrain the probabilities that the lengths of observations exceed $n$ is no larger than some prescribed $\epsilon\in(0,1)$ .

•

First, while the results of Li, Liu, and Ying [13] can be modified to solve the composite version of the dual problem in (2), which yields first-order asymptotic results under expectation constraints, we obtain the first-order asymptotic results under the probabilistic constraints. We show that the corner points of the optimal error exponent regions are identical under both types of constraints.
•

Second, Li, Liu, and Ying [13] only proved that the generalized sequential probability ratio test is asymptotically optimal by making use of the optimality results of sequential probability ratio test (SPRT). Here we prove a strong converse result, namely that the exponents stay unchanged even if the probability that the stopping time exceeds $n$ is smaller than $\epsilon$ for all $\epsilon\in(0,1)$ . We do so using information-theoretic ideas and, in particular, the ubiquitous change-of-measure technique (Lemma 3).
•

Third, and most importantly, we obtain the second-order asymptotics of the error exponents when we assume that the observations take values on a finite alphabet. A main technical contribution here is that we obtain a new central limit-type result for a maximum of an uncountable set of log-likelihood ratios under suitable conditions (Proposition 6). We contrast our central limit-type result to classical statistical results such as Wilks’ theorem [17, Chapter 16].

I-C Paper Outline

The rest of the paper is structured as follows. In Section II, we formulate the composite sequential hypothesis testing problem precisely and state the probabilistic constraints on the stopping time. In Section III, we list some mild assumptions on the distributions and uncertainty set in order to state and prove our first-order asymptotic results. In Section IV, we consider the second-order asymptotics of the same problem by augmenting to the assumptions stated in Section III. We state a central limit-type theorem for the maximum of a set of log-likelihood ratios and our main result concerning the second-order asymptotics. We relegate the more technical calculations (such as proofs of lemmata) to the appendix.

II Problem Formulation

Let $\{X_{i}\}_{i=1}^{\infty}$ be an observed i.i.d. sequence, where each $X_{i}$ follows a density $p$ with respect to a base measure $\mu$ on the alphabet $\mathcal{X}$ . We consider the problem of composite hypothesis testing:

\displaystyle H_{0}:p=p_{0}\quad\mbox{and}\quad H_{1}:p\in\{p_{\gamma}:\gamma\in\Gamma\},

where $p_{0}$ and $p_{\gamma}$ are density functions with respect to $\mu$ and $p_{0}\notin\{p_{\gamma}:\gamma\in\Gamma\}$ . We assume that $p_{0}$ and $p_{\gamma}$ are mutually absolutely continuous for all $\gamma\in\Gamma$ . Denote $P_{0}$ and $P_{\gamma}$ as the probability measures associated to $p_{0}$ and $p_{\gamma}$ , respectively. Let $\mathcal{F}(X^{n})$ be the $\sigma$ -algebra generated by $X^{n}=(X_{1},X_{2},\ldots,X_{n})$ . Let $\tau$ be a stopping time adapted to the filtration $\{\mathcal{F}(X^{n})\}_{n=1}^{\infty}$ and let $\mathcal{F}_{\tau}$ be the $\sigma$ -algebra associated with $\tau$ . Let $\delta$ be a $\{0,1\}$ -valued $\mathcal{F}_{\tau}$ -measurable function. The pair $(\delta,\tau)$ constitutes a sequential hypothesis test, where $\delta$ is called the decision function and $\tau$ is the stopping time. When $\delta=0$ (resp. $\delta=1$ ), the decision is made in favor of $H_{0}$ (resp. $H_{1}$ ). The type-I and maximal type-II error probabilities are defined as

\displaystyle\mathsf{P}_{1|0}(\delta,\tau):=P_{0}(\delta=1)\quad\mbox{and}\quad\mathsf{P}_{0|1}(\delta,\tau):=\sup_{\gamma\in\Gamma}P_{\gamma}(\delta=0).

In other words, $\mathsf{P}_{1|0}(\delta,\tau)$ is the error probability that the true density is $p_{0}$ but $\delta=1$ and $\mathsf{P}_{0|1}(\delta,\tau)$ is the maximal error probability over all parameters $\gamma\in\Gamma$ that the true density is $p_{\gamma}$ but the decision made $\delta=0$ based on the observations up to time $\tau$ .

In this paper, we seek the first-order and second-order asymptotics of exponents of the error probabilities under probabilistic constraints on stopping time $\tau$ . The probabilistic constraints dictate that, for every error tolerance $0<\epsilon<1$ , there exists an integer $n_{0}(\epsilon)$ such that for all $n>n_{0}(\epsilon)$ , the stopping time $\tau$ satisfies

\displaystyle P_{0}(\tau>n)<\epsilon\quad\mbox{and}\quad\sup_{\gamma\in\Gamma}P_{\gamma}(\tau>n)<\epsilon,

(4)

and

\displaystyle P_{0}(\tau<\infty)=1\quad\mbox{and}\quad\sup_{\gamma\in\Gamma}P_{\gamma}(\tau<\infty)=1.

(5)

In the following, all logarithms are natural logarithms, i.e., with respect to base $e$ .

III First-order Asymptotics

We say that an exponent pair $(E_{0},E_{1})$ is $\epsilon$ -achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests $\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty}$ that satisfies the probabilistic constraints on the stopping time in (4) and (5) and

	$\displaystyle E_{0}$	$\displaystyle\leq\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{1\|0}(\delta_{n},\tau_{n})},$
	$\displaystyle E_{1}$	$\displaystyle\leq\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{0\|1}(\delta_{n},\tau_{n})}.$

The set of all $\epsilon$ -achievable $(E_{0},E_{1})$ is denoted as $\mathcal{E}_{\epsilon}(p_{0},\Gamma)$ . For simple (non-composite) binary sequential hypothesis testing under the expectation constraints (i.e., $\max_{i=0,1}\mathbb{E}_{P_{i}}[\tau]\leq n)$ , the set of all achievable error exponent pairs, as shown by Wald and Wolfowitz [6] (also see [16, 7]), is

\displaystyle\tilde{\mathcal{E}}_{\epsilon}(p_{0},p_{1})=\{(E_{0},E_{1}):E_{0}\leq D(p_{1}\|p_{0}),E_{1}\leq D(p_{0}\|p_{1})\}.

(6)

The corner point $(D(p_{1}\|p_{0}),D(p_{0}\|p_{1}))$ can be achieved by a sequence of sequential probability ratio tests [6].

We define the log-likelihood ratio and maximal log-likelihood ratio respectively as

\displaystyle S_{n}(\gamma):=\sum_{i=1}^{n}\log\frac{p_{\gamma}(X_{i})}{p_{0}(X_{i})}\quad\mbox{and}\quad S_{n}:=\sup_{\gamma\in\Gamma}S_{n}(\gamma).

For two positive numbers $A$ and $B$ , we define the stopping time $\tau$ as

\displaystyle\tau:=\inf\{n:S_{n}>A\text{ or }S_{n}<-B\},

and the decision rule as

\displaystyle\delta:=\begin{cases}0,&\text{if }S_{\tau}<-B,\\ 1,&\text{if }S_{\tau}>A.\end{cases}

We term the above test given by $(\delta,\tau)$ as a generalized sequential probability ratio test (GSPRT) with thresholds $A$ and $B$ . The stopping time $\tau$ is almost surely finite for any distribution within the family [13], so (5) holds for GSPRT. For the above GSPRT, we define type-I error probability and maximal type-II error probability respectively as

	$\displaystyle\mathsf{P}_{1\|0}(\tau,\delta)$	$\displaystyle:=P_{0}(S_{\tau}>A),~{}\mbox{and}$
	$\displaystyle\mathsf{P}_{0\|1}(\tau,\delta)$	$\displaystyle:=\sup_{\gamma\in\Gamma}P_{\gamma}(S_{\tau}<-B).$

We introduce some assumptions on the distributions and $\Gamma$ .

(A1)

The parameter set $\Gamma\subset\mathbb{R}^{d}$ is compact.
(A2)

Assume that $\gamma\mapsto D(p_{\gamma}\|p_{0})$ and $\gamma\mapsto D(p_{0}\|p_{\gamma})$ are twice continuously differentiable on $\Gamma$ . For each $\gamma\in\Gamma$ , the solutions to the minimizations $\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})$ and $\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})$ are unique. Their existences are guaranteed by the compactness of $\Gamma$ and the continuity of $\gamma\mapsto D(p_{\gamma}\|p_{0})$ and $\gamma\mapsto D(p_{0}\|p_{\gamma})$ on $\Gamma$ . In addition, $\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})>\epsilon_{0}$ and $\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})>\epsilon_{0}$ for some $\epsilon_{0}>0$ .
(A3)

Let $\xi(\gamma)=\log p_{{\gamma}}(X)-\log p_{0}(X)$ be the log-likelihood ratio. We assume that $\mathbb{E}[\max_{\gamma}|\xi(\gamma)|^{2}]<\infty$ . Besides, there exist $\alpha>1$ and $x_{0}\in\mathbb{R}$ such that for all $\gamma\in\Gamma$ , and $x>x_{0}$

$\displaystyle P_{0}\Bigg{(}\max_{\gamma\in\Gamma}|\nabla_{\gamma}\xi(\gamma)|>x\Bigg{)}\leq e^{-|\log x|^{\alpha}},$ (7)

where $|\nabla_{\gamma}\xi(\gamma)|$ is the $\ell_{1}$ norm of the gradient vector $\nabla_{\gamma}\xi(\gamma)\in\mathbb{R}^{d}$

We present some examples that satisfy Conditions (A1)–(A3). We first show that Condition (A1)–(A3) hold for the canonical exponential family under suitable assumptions and then provide an explicit example.

Example 1 (Canonical exponential families).

The general form of probability density for the canonical exponential family of probability distributions is [18]:

\displaystyle p_{\bm{\gamma}}(x)=h(x)\exp(\bm{\gamma}^{\top}T(x)-A(\bm{\gamma})),

where $h(x)$ is called the base measure, $\bm{\gamma}$ is the parameter vector, $T(x)$ is referred to as the sufficient statistic and $A(\bm{\gamma})$ is the cumulant generating function. We define the set of valid parameters as $\Theta=\{\bm{\gamma}\in\mathbb{R}^{d}:A(\bm{\gamma})<\infty\}$ .

Now we consider the test

	$\displaystyle H_{0}:p_{\bm{\gamma}_{0}}(x)=h(x)\exp(\bm{\gamma}_{0}^{\top}T(x)-A(\bm{\gamma}_{0})),\quad\bm{\gamma}_{0}\in\Theta;$
	$\displaystyle H_{1}:p_{\bm{\gamma}}(x)=h(x)\exp(\bm{\gamma}^{\top}T(x)-A(\bm{\gamma})),\quad\bm{\gamma}\in\Gamma,\bm{\gamma}_{0}\notin\Gamma.$

We also assume that the exponential families under consideration satisfy the following assumptions:

(i)

$\Gamma\subset\Theta$ is a convex and compact set;
(ii)

$A(\bm{\gamma})$ is thrice continuously differentiable with respect to $\bm{\gamma}$ ;
(iii)

$\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma})$ and $\nabla((\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma}))$ are positive definite for $\bm{\gamma}\in\Gamma$ .

For this example, Condition (A1) holds because of Assumption (i). For Condition (A2), we have

	$\displaystyle D(p_{\bm{\gamma}}\\|p_{\bm{\gamma}_{0}})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}\mathbb{E}_{\bm{\gamma}}[T(X)]-A(\bm{\gamma})+A(\bm{\gamma}_{0}),$
	$\displaystyle D(p_{\bm{\gamma}_{0}}\\|p_{\bm{\gamma}})=(\bm{\gamma}_{0}-\bm{\gamma})^{\top}\mathbb{E}_{0}[T(X)]-A(\bm{\gamma}_{0})+A(\bm{\gamma}),$

which are twice continuously differentiable with respect to $\bm{\gamma}$ in $\Gamma$ based on Assumption (ii). Besides, we have

	$\displaystyle\nabla_{\bm{\gamma}}^{2}D(p_{\bm{\gamma}}\\|p_{\bm{\gamma}_{0}})$	$\displaystyle\overset{(a)}{=}\nabla((\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma})),$
	$\displaystyle\nabla_{\bm{\gamma}}^{2}D(p_{\bm{\gamma}_{0}}\\|p_{\bm{\gamma}})$	$\displaystyle=\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma}).$

where $(a)$ holds because $\mathbb{E}_{\bm{\gamma}}[T(X)]=\nabla_{\bm{\gamma}}A(\bm{\gamma})$ [18]. Based on Assumption (iii), $D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})$ and $D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})$ are strongly convex in $\bm{\gamma}$ . Hence, the solutions to the minimizations are unique. Then we also have

\displaystyle\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}\nabla^{2}_{\bm{\gamma}}A(\bm{\gamma}),

which means that $\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=0$ and $D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})=0$ if and only if $\bm{\gamma}=\bm{\gamma}_{0}$ . As $\bm{\gamma}_{0}\notin\Gamma$ , $\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{\bm{\gamma}_{0}})>0$ . Similarly, we have

\displaystyle\nabla_{\bm{\gamma}}D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})=-\nabla_{\bm{\gamma}}A(\bm{\gamma}_{0})+\nabla_{\bm{\gamma}}A(\bm{\gamma}).

As $\nabla_{\bm{\gamma}}^{2}A(\bm{\gamma})$ assumed to be positive definite per Assumption (iii), then $\nabla_{\bm{\gamma}}A(\bm{\gamma}_{0})=\nabla_{\bm{\gamma}}A(\bm{\gamma})$ if and only if $\bm{\gamma}=\bm{\gamma}_{0}$ . As $\bm{\gamma}_{0}\notin\Gamma$ , we have $\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}_{0}}\|p_{\bm{\gamma}})>0$ .

For Condition (A3), we have

\displaystyle\xi(\bm{\gamma})=(\bm{\gamma}-\bm{\gamma}_{0})^{\top}T(X)-A(\bm{\gamma})+A(\bm{\gamma}_{0}).

Then $\mathbb{E}[\max_{\bm{\gamma}}|\xi(\bm{\gamma})|^{2}]<\infty$ due to Assumptions (i) and (ii). Let $\mathbf{e}$ be the all ones vector. For all $t>0$ and $t\mathbf{e}+\bm{\gamma}_{0}\in\Theta$ , we have

	$\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})\|>x\bigg{)}$
	$\displaystyle=P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{\|}T(X)-\nabla_{\bm{\gamma}}A(\bm{\gamma})\big{\|}>x\bigg{)}$
	$\displaystyle\overset{(a)}{=}P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{\|}T(X)-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}>x\bigg{)}$
	$\displaystyle\leq P_{0}\!\bigg{(}\!\big{\|}T\!(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{\|}\!+\max_{\bm{\gamma}\in\Gamma}\big{\|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}\!>\!x\!\bigg{)}$
	$\displaystyle=P_{0}\!\bigg{(}\!\big{\|}T(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{\|}\!>\!x\!-\!\max_{\bm{\gamma}\in\Gamma}\!\big{\|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}\!\bigg{)}$
	$\displaystyle\overset{(b)}{\leq}\exp\bigg{(}-tx+t\max_{\bm{\gamma}\in\Gamma}\big{\|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}$
	$\displaystyle\quad+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t\|\nabla_{\bm{\gamma}}A(\bm{\gamma})\|_{\bm{\gamma}=\bm{\gamma}_{0}}\|+\log 2\bigg{)},$

where $(a)$ is based on the property $\mathbb{E}_{\bm{\gamma}}[T(X)]=\nabla_{\bm{\gamma}}A(\bm{\gamma})$ , $(b)$ is based on Markov’s inequality and the fact that $\mathbb{E}_{0}[\exp(\langle t\mathbf{e},T(X)-\mathbb{E}_{0}[T(X)]\rangle)]=\exp(A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t|\nabla_{\bm{\gamma}}A(\bm{\gamma})|_{\bm{\gamma}=\bm{\gamma}_{0}}|)$ [19]. Denote $\tilde{x}=t\max_{\bm{\gamma}\in\Gamma}\big{|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{|}+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t|\nabla_{\bm{\gamma}}A(\bm{\gamma})|_{\bm{\gamma}=\bm{\gamma}_{0}}|+\log 2$ . Then there exists $\alpha>1$ such that when $x>x_{0}=\max\{x_{1},1\}$ (where $x_{1}$ is the solution to $tx_{1}-\tilde{x}=(\log x_{1})^{\alpha}$ if it exists, else $x_{1}=0$ ),

\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\bigg{)}\leq e^{-(tx-\tilde{x})}\leq e^{-(\log x)^{\alpha}},

which shows that (7) holds.

Example 2 (Gaussian distributions).

For Gaussian distributions, $\bm{\gamma}=[\mu/\sigma^{2},-1/2\sigma^{2}]^{\top}$ , $T(x)=[x,x^{2}]^{\top}$ , $A(\bm{\gamma})=-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})$ and $h(x)=\frac{1}{\sqrt{2\pi}}$ , where $\gamma_{1}$ and $\gamma_{2}$ are the elements of $\bm{\gamma}$ . We consider the test

	$\displaystyle H_{0}:\mathcal{N}(0,1),\;\quad\bm{\gamma}_{0}=[0,-1/2]^{T};$
	$\displaystyle H_{1}:\mathcal{N}(\mu,\sigma^{2}),\quad\bm{\gamma}=[\mu/\sigma^{2},-1/2\sigma^{2}]^{T}\in\Gamma,\bm{\gamma}_{0}\notin\Gamma.$

We assume that $\Gamma$ is a convex and compact set and $\sigma^{2}>\frac{4\mu^{2}+1}{3\mu+1}$ .

For this example, Assumption (i) (i.e., Condition (A1)) holds as we assume that $\Gamma$ is a convex and compact set. Besides, $A(\bm{\gamma})$ is thrice continuously differentiable and

\displaystyle\frac{\partial A^{2}(\bm{\gamma}_{1})}{\partial\bm{\gamma}_{1}^{2}}=\left[\begin{matrix}&-\frac{1}{2\gamma_{2}}&\frac{\gamma_{1}}{2\gamma_{2}^{2}}\\ &\frac{\gamma_{1}}{2\gamma_{2}^{2}}&-\frac{\gamma_{1}^{2}}{2\gamma_{2}^{3}}+\frac{1}{2\gamma_{2}^{2}}\end{matrix}\right],

which is positive definite. Besides,

\displaystyle\frac{\partial((\bm{\gamma}-\bm{\gamma}_{0})^{T}A^{\prime\prime}(\bm{\gamma}))}{\partial\bm{\gamma}}=\left[\begin{matrix}&\frac{1}{4\gamma_{2}^{2}}&-\frac{\gamma_{1}}{2\gamma_{2}^{3}}\\ &\frac{-\gamma_{1}}{2\gamma_{2}^{3}}&\frac{3\gamma_{1}}{4\gamma_{2}^{4}}-\frac{1}{2\gamma_{2}^{3}}-\frac{1}{\gamma_{2}^{2}}\end{matrix}\right],

which is positive definite when $\sigma^{2}>\frac{4\mu^{2}+1}{3\mu+1}$ . Thus, Assumptions (ii) and (iii) hold, which implies Condition (A2) holds. For Condition (A3), we have

	$\displaystyle t\max_{\bm{\gamma}\in\Gamma}$	$\displaystyle\big{\|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})$
		$\displaystyle-t\|A^{\prime}(\bm{\gamma}_{0})\|=t\max_{\bm{\gamma}\in\Gamma}\bigg{\|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\bigg{\|}$
		$\displaystyle-\frac{t^{2}}{4(t-1/2)}-\frac{1}{2}\log(-2(t-1/2))-t.$

Then we choose $t=\frac{1}{4}$ , we have

	$\displaystyle\frac{1}{4}\max_{\bm{\gamma}\in\Gamma}$	$\displaystyle\big{\|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}+A\Big{(}\frac{1}{4}\mathbf{e}+\bm{\gamma}_{0}\Big{)}-A(\bm{\gamma}_{0})$
		$\displaystyle-\frac{1}{4}\|A^{\prime}(\bm{\gamma}_{0})\|=\frac{1}{4}\max_{\bm{\gamma}\in\Gamma}\bigg{\|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\bigg{\|}+\frac{5}{16}.$

Denote $\tilde{x}=\frac{1}{4}\max_{\bm{\gamma}\in\Gamma}\Big{|}-\frac{\gamma_{1}^{2}}{4\gamma_{2}}-\frac{1}{2}\log(-2\gamma_{2})\Big{|}-\frac{3}{16}+\frac{3}{2}\log 2$ . Then there exists $\alpha>1$ such that when $x>x_{0}=\max\{x_{1},1\}$ (where $x_{1}$ is the solution to $\frac{1}{4}x_{1}-\tilde{x}=(\log x_{1})^{\alpha}$ if it exists, else $x_{1}=0$ ),

\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\bigg{)}\leq e^{-(tx-\tilde{x})}\leq e^{-(\log x)^{\alpha}},

which shows that Condition (A3) holds.

Our first main result is Theorem 1 which characterizes the set of first-order error exponents under the probabilistic constraints on the stopping time in (4).

Theorem 1.

For fixed $0<\epsilon<1$ and if Conditions (A1)–(A3) are satisfied, the set of $\epsilon$ -achievable pair of error exponents is

\mathcal{E}_{\epsilon}(p_{0},\Gamma)=\left\{(E_{0},E_{1}):\begin{aligned} &E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0}),\\ &E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma}).\end{aligned}\right\}

Furthermore, the corner point of this set is achieved by an appropriately defined sequence of GSPRTs.

Theorem 1 shows that the $\epsilon$ -achievable error exponent region is a rectangle. In addition, Theorem 1 shows a strong converse result because the region does not depend on the permissible error probability $0<\epsilon<1$ .

III-A Proof of Achievability of Theorem 1

Let $\varepsilon_{0}$ and $\varepsilon_{1}$ be two positive numbers such that $\varepsilon_{0}\in\big{(}0,\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})\big{)}$ and $\varepsilon_{1}\in\big{(}0,\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})\big{)}$ . Let $(\delta_{n},\tau_{n})$ be the GSPRT with the thresholds $A_{n}:=n(\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})-\varepsilon_{0})$ and $B_{n}:=n(\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})-\varepsilon_{1})$ . Since Conditions (A1)–(A3) are satisfied, then from [13, Theorem 2.1] we have that

	$\displaystyle\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{P_{0}(S_{\tau_{n}}\!>\!A_{n})}$	$\displaystyle\!\geq\!\min_{\gamma\in\Gamma}D(p_{\gamma}\\|p_{0})\!-\!\varepsilon_{0},$		(8)
	$\displaystyle\liminf_{n\to\infty}\frac{1}{n}\log\frac{1}{\displaystyle\sup_{\gamma\in\Gamma}P_{{\gamma}}(S_{\tau_{n}}\!<\!-B_{n})}$	$\displaystyle\!\geq\!\min_{\gamma\in\Gamma}D(p_{0}\\|p_{\gamma})\!-\!\varepsilon_{1}.$		(9)

Next we prove that the two probabilistic constraints in (4) are satisfied for the GSPRT $(\delta_{n},\tau_{n})$ with thresholds $A_{n}$ and $B_{n}$ . We first introduce the uniform weak law of large numbers (UWLLN) [20, Theorem 6.10].

Lemma 2.

Let $\{X_{j}\}_{j=1}^{\infty}$ be a sequence of i.i.d. random vectors, and let $\gamma\in\Gamma$ be a nonrandom vector lying in a compact subset $\Gamma\subset\mathbb{R}^{d}$ . Moreover, let $g(x,\gamma)$ be a Borel-measurable function on $\mathcal{X}\times\Gamma$ such that for each $x,g(x,\gamma)$ is continuous on $\Gamma$ . Finally, assume that $\mathbb{E}\left[\max_{\gamma\in\Gamma}|g(X_{j},\gamma)|\right]<\infty$ . Then for any $\delta>0$ ,

\displaystyle\lim_{n\to\infty}\mathbb{P}\left(\max_{\gamma\in\Gamma}\bigg{|}\frac{1}{n}\sum_{j=1}^{n}g(X_{j},\gamma)-\mathbb{E}[g(X,\gamma)]\bigg{|}\geq\delta\right)=0.

Let $\tau^{\prime}:=\inf\{k:S_{k}<-B_{n}\}$ . We observe that $\tau^{\prime}\geq\tau_{n}$ , so we have

\displaystyle P_{0}(\tau_{n}>n)\leq P_{0}(\tau^{\prime}>n)=P_{0}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\geq-B_{n}\right).

Because

	$\displaystyle\max_{\gamma\in\Gamma}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}+n\min_{\gamma\in\Gamma}D(p_{0}\\|p_{\gamma})$
	$\displaystyle\leq\max_{\gamma\in\Gamma}\left(\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\right),$

and $\max_{x}f(x)-\min_{x}g(x)\leq\max_{x}(f(x)-g(x))\leq\max_{x}|f(x)-g(x)|$ , we have

	$\displaystyle P_{0}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\geq-n\Big{(}\min_{\gamma\in\Gamma}D(p_{0}\\|p_{\gamma})-\varepsilon_{1}\Big{)}\right)$
	$\displaystyle\leq P_{0}\left(\max_{\gamma\in\Gamma}\bigg{\|}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\bigg{\|}\geq n\varepsilon_{1}\right).$

Then by UWLLN, for $0<\epsilon<1$ , there exists an $n_{0}(\epsilon)$ , such that when $n>n_{0}(\epsilon)$ ,

P_{0}\left(\max_{\gamma\in\Gamma}\bigg{|}\sum_{i=1}^{n}\log\frac{{p_{\gamma}}(X_{i})}{{p_{0}}(X_{i})}-n\mathbb{E}_{0}\left[\log\frac{{p_{\gamma}}(X)}{{p_{0}}(X)}\right]\bigg{|}\geq n\varepsilon_{1}\right)<\epsilon.

Therefore, $P_{0}(\tau_{n}>n)\leq P_{0}(\tau^{\prime}>n)<\epsilon.$

We now prove that $\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau_{n}>n)<\epsilon$ . Define $\tau^{\prime\prime}:=\inf\{k:\max_{\gamma\in\Gamma}S_{k}(\gamma)>A_{n}\}$ . We also have $\tau^{\prime\prime}\geq\tau_{n}$ . Then for each $\gamma_{0}\in\Gamma$ and $t<0$ , we have

	$\displaystyle P_{{\gamma_{0}}}(\tau_{n}>n)$	$\displaystyle\leq P_{{\gamma_{0}}}(\tau^{\prime\prime}>n)$
		$\displaystyle\leq P_{{\gamma_{0}}}\left(\max_{\gamma\in\Gamma}S_{n}(\gamma)\leq A_{n}\right)$
		$\displaystyle\leq P_{{\gamma_{0}}}\big{(}S_{n}(\gamma_{0})\leq n(D(p_{\gamma_{0}}\\|p_{0})-\varepsilon_{0})\big{)}$
		$\displaystyle\leq\frac{\mathrm{Var}(\xi(\gamma_{0}))}{n\eta_{0}^{2}}$

where the last step follows from Chebyshev’s inequality [21]. Then based on Condition (A3) that $\mathbb{E}[\max_{\gamma}|\xi(\gamma)|^{2}]<\infty$ and $\eta_{0}$ does not depend on $\gamma_{0}$ , there exists an $n_{1}(\epsilon)$ such that when $n>n_{1}(\epsilon)$ , $\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau_{n}>n)<\epsilon.$ We have shown that when $n>\max\{n_{0}(\epsilon),n_{1}(\epsilon)\}$ , the two probabilistic constraints (4) are satisfied. Then together with (8), (9) and the arbitrariness of $\varepsilon_{0}$ and $\varepsilon_{1}$ , we show that any exponent pair $(E_{0},E_{1})$ such that $E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})$ and $E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})$ is in $\mathcal{E}_{\epsilon}(p_{0},\Gamma)$

III-B Proof of Strong Converse of Theorem 1

The following lemma is taken from Li and Tan [16].

Lemma 3.

Let $(\delta,\tau)$ be a sequential hypothesis test such that $P_{0}(\tau<\infty)=1$ and $\sup_{\gamma\in\Gamma}P_{{\gamma}}(\tau<\infty)=1$ . Then for any event $F\in\mathcal{F}_{\tau}$ , $\lambda>0$ and for each $\gamma_{0}\in\Gamma$ we have

	$\displaystyle P_{0}(F)-\lambda P_{{\gamma_{0}}}(F)$	$\displaystyle\leq P_{0}(S_{\tau}(\gamma_{0})\leq-\log\lambda),$
	$\displaystyle P_{{\gamma_{0}}}(F)-\frac{1}{\lambda}P_{0}(F)$	$\displaystyle\leq P_{{\gamma_{0}}}(S_{\tau}(\gamma_{0})\geq-\log\lambda).$

Then we use Lemma 3 to prove the converse part. Let $(E_{0},E_{1})\in\mathcal{E}_{\epsilon}(p_{0},\Gamma)$ such that $\min\{E_{0},E_{1}\}>0$ . Without loss of generality and by passing to a subsequence if necessary, we assume that there exists a sequence of sequential hypothesis tests $\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty}$ such that $P_{0}(\tau_{n}<\infty)=1$ and $\sup_{\gamma\in\Gamma}P_{\gamma}(\tau_{n}<\infty)=1$ and

	$\displaystyle E_{0}$	$\displaystyle=\lim_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{1\|0}(\delta_{n},\tau_{n})},$		(10)
	$\displaystyle E_{1}$	$\displaystyle=\lim_{n\to\infty}\frac{1}{n}\log\frac{1}{\mathsf{P}_{0\|1}(\delta_{n},\tau_{n})}.$

Let $Z_{i}(\tau_{n})=\{\delta_{n}=i\}$ for $i=0,1$ . Then $\mathsf{P}_{1|0}(\delta_{n},\tau_{n})=P_{0}(Z_{1}(\tau_{n}))$ and $\mathsf{P}_{0|1}(\delta_{n},\tau_{n})=\sup_{\gamma\in\Gamma}P_{{\gamma}}(Z_{0}(\tau_{n}))$ . Using Lemma 3 with the event $F=Z_{0}(\tau_{n})$ , for each $\gamma_{0}\in\Gamma$ we have that

	$\displaystyle 1-P_{0}(Z_{1}(\tau_{n}))-\lambda P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))$
	$\displaystyle\leq P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda)$
	$\displaystyle\leq P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)+P_{0}(\tau_{n}>n),$

which further implies that

		$\displaystyle\log P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))$
		$\displaystyle\geq\log\bigg{[}\frac{1}{\lambda}\Big{(}1-P_{0}(Z_{1}(\tau_{n}))-P_{0}(\tau_{n}>n)$
		$\displaystyle\qquad-P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)\Big{)}\bigg{]}.$		(11)

Similarly, for each $\gamma_{0}\in\Gamma$ , we have that

	$\displaystyle 1-P_{{\gamma_{0}}}($	$\displaystyle Z_{0}(\tau_{n}))-\frac{1}{\lambda}P_{0}(Z_{1}(\tau_{n}))$
		$\displaystyle\leq P_{{\gamma_{0}}}(S_{\tau}(\gamma_{0})\geq-\log\lambda,\tau_{n}\leq n)+P_{{\gamma_{0}}}(\tau_{n}>n),$

and when we set $E=Z_{1}(\tau_{n})$ , we have

		$\displaystyle\log P_{0}(Z_{1}(\tau_{n}))$
		$\displaystyle\geq\log\Big{[}\lambda\Big{(}1-P_{{\gamma_{0}}}(Z_{0}(\tau_{n}))-P_{{\gamma_{0}}}(\tau_{n}>n)$
		$\displaystyle\quad-P_{{\gamma_{0}}}(S_{\tau_{n}}(\gamma_{0})\geq-\log\lambda,\tau_{n}\leq n)\Big{)}\Big{]}.$		(12)

Let $\delta$ be an arbitrary positive number and let $\log\lambda=n\left(D(p_{0}\|p_{\gamma_{0}})+\delta\right)$ . We first bound the term

	$\displaystyle P_{0}(S_{\tau_{n}}(\gamma_{0})\leq-\log\lambda,\tau_{n}\leq n)$
	$\displaystyle\leq P_{0}\left(\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}\geq n(D(p_{0}\\|p_{\gamma_{0}})+\delta)\right).$

We note that $\big{\{}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}-D(p_{0}\|p_{\gamma_{0}})\big{\}}_{i=1}^{n}$ is an i.i.d. sequence. Besides, we have that $\mathbb{E}_{0}\left[-\xi(\gamma_{0})-D(p_{0}\|p_{\gamma_{0}})\right]=0$ and $\mathrm{Var}\left(\xi(\gamma_{0})\right)$ is finite based on Condition (A3). Then based on Kolmogorov’s maximal inequality [21, Theorem 2.5.5], we have that

	$\displaystyle P_{0}\bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}$	$\displaystyle\geq n(D(p_{0}\\|p_{\gamma_{0}})+\delta)\bigg{)}$
		$\displaystyle\leq\frac{\mathrm{Var}\left(\xi(\gamma_{0})\right)}{n\delta^{2}}.$		(13)

Note that here we use the Kolmogorov’s maximal inequality and it only requires that the second moment of the log-likelihood ratio is finite; this is a weaker condition than assuming that the third absolute moment of the log-likelihood ratio is finite as in [16]. Then we have that

\displaystyle\lim_{n\to\infty}\sup_{\gamma_{0}\in\Gamma}P_{0}\Bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma_{0}}}(X_{i})}\geq\log\lambda\Bigg{)}=0.

(14)

When we set $-\log\lambda=n(D(p_{\gamma_{0}}\|p_{0})+\delta)$ in (III-B), using similar arguments as in the derivation of (14), we obtain

\displaystyle\lim_{n\to\infty}\sup_{\gamma_{0}\in\Gamma}P_{{\gamma_{0}}}\bigg{(}\max_{1\leq k\leq n}S_{k}(\gamma_{0})\geq-\log\lambda\bigg{)}=0.

From (III-B) and the fact that $P_{\gamma_{0}}(\tau_{n}>n)<\epsilon$ , we have that

	$\displaystyle-\frac{1}{n}$	$\displaystyle\sup_{\gamma\in\Gamma}\log P_{\gamma}(Z_{0}(\tau_{n}))$
		$\displaystyle\leq\min_{\gamma\in\Gamma}(D(p_{0}\\|p_{\gamma})+\delta)-\frac{1}{n}\log\Bigg{(}1-\mathsf{P}_{1\|0}(\delta_{n},\tau_{n})-\epsilon$
		$\displaystyle\quad-\sup_{\gamma\in\Gamma}P_{0}\bigg{(}\max_{1\leq k\leq n}\sum_{i=1}^{k}\log\frac{{p_{0}}(X_{i})}{{p_{\gamma}}(X_{i})}\geq\log\lambda\bigg{)}\Bigg{)}.$

From (10) it follows that $\lim_{n\to\infty}\mathsf{P}_{1|0}(\delta_{n},\tau_{n})=0$ , which together with (14) implies that

\displaystyle E_{1}=\lim_{n\to\infty}-\frac{1}{n}\log\mathsf{P}_{0|1}(\delta_{n},\tau_{n})\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})+\delta.

Similarly, we also obtain

\displaystyle E_{0}=\lim_{n\to\infty}-\frac{1}{n}\log\mathsf{P}_{1|0}(\delta_{n},\tau_{n})

\displaystyle\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})+\delta.

Due to the arbitrariness of $\delta$ , letting $\delta\to 0^{+}$ , we have that $E_{0}\leq\min_{\gamma\in\Gamma}D(p_{\gamma}\|p_{0})$ and $E_{1}\leq\min_{\gamma\in\Gamma}D(p_{0}\|p_{\gamma})$ , completing the proof of the strong converse as desired.

IV Second-order Asymptotics

In the previous section, we considered the (first-order) error exponents of the sequential composite hypothesis testing problem under probabilistic constraints. While the result (Theorem 1) is conclusive, there is often substantial motivation [15] to consider higher-order asymptotics due to finite-length considerations. To wit, the probabilistic bound observation length of the sequence $n$ might be short and thus the exponents derived in the previous section will be overly optimistic. In this section, we quantify the backoff from the optimal first-order exponents by examining the second-order asymptotics. To do so, we make a set of somewhat more stringent conditions on the distributions and the uncertainty set $\Gamma$ . We first assume that the alphabet of the observations is the finite set $\mathcal{X}=\{1,2,\dots,d\}$ . Let $\mathcal{P}_{\mathcal{X}}$ be the set of probability mass functions with alphabet $\mathcal{X}$ . In other words, $\mathcal{P}_{\mathcal{X}}$ is the probability simplex given by

\mathcal{P}_{\mathcal{X}}\!:=\!\bigg{\{}\!(q(1),\!q(2),\ldots,\!q(d))\!:\!\sum_{i=1}^{d}\!q(i)=1,\!q(i)\geq 0,~{}\forall\,i\!\in\!\mathcal{X}\!\bigg{\}}.

Similarly, define the open probability simplex

\mathcal{P}_{\mathcal{X}}^{+}\!:=\!\bigg{\{}\!(q(1),\!q(2),\ldots,\!q(d))\!:\!\sum_{i=1}^{d}\!q(i)=1,\!q(i)>0,~{}\forall\,i\!\in\!\mathcal{X}\!\bigg{\}}.

Under hypothesis $H_{0}$ , the underlying probability mass function is given by $\{p_{0}(i)\}_{i=1}^{d}$ and we assume that $p_{0}(i)>0$ for all $i\in\mathcal{X}$ . Under hypothesis $H_{1}$ , the underlying probability mass function belongs to the set $\Gamma\subset\mathcal{P}_{\mathcal{X}}$ . For any $\tilde{q}\in\mathcal{P}_{\mathcal{X}}$ and positive constant $\eta$ , let $\mathcal{B}(\tilde{q},\eta):=\{q\in\mathcal{P}_{\mathcal{X}}:|q(i)-\tilde{q}(i)|<\eta,~{}\forall\,i\in\mathcal{X}\}$ be the open $\eta$ -neighborhood of the point $\tilde{q}$ . Let $\bm{\gamma}^{\prime}$ be such that $D(p_{0}\|p_{\bm{\gamma}^{\prime}})=\min_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}})$ . See Fig. 1 for an illustration of this projection.

IV-A Other Assumptions and Preliminary Results

We assume that $\Gamma$ , which contains distributions supported on $\mathcal{X}$ , satisfies the following conditions:

(A1’)

The set $\Gamma$ is equal to $\{\bm{\gamma}=\{\gamma_{i}\}_{i=1}^{d}:F(\bm{\gamma})\leq 0\}\in\mathcal{P}_{\mathcal{X}}$ for some piece-wise smooth convex function $F:\mathcal{P}_{\mathcal{X}}\to\mathbb{R}$ .
(A2’)

There exists a fixed constant $c_{0}>0$ such that $\min_{i\in\mathcal{X}}\gamma_{i}\geq c_{0}$ for all $\bm{\gamma}\in\Gamma$ .
(A3’)

The function $F$ is smooth (infinitely differentiable) on $\mathcal{B}(\bm{\gamma}^{\prime},\eta)$ for some $\eta>0$ .

The key tool used in the derivation of the second-order terms is a central limit-type result for $\max_{\bm{\gamma}\in\Gamma}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}$ , the maximum of log-likelihood ratios of the observations over $\Gamma$ . To simplify this quantity, we define the empirical distribution or type [22, Chapter 11] of $X^{n}$ as $Q(i;X^{n})=\sum_{k=1}^{n}\mathbbm{1}\{X_{k}=i\}/n$ , for $i=1,2,\dots,d$ . In the following, for the sake of brevity, we often suppress the dependence on the sequence $X^{n}$ and write $Q(i)$ in place of $Q(i;X^{n})$ , but we note that $Q$ is a random distribution induced by the observations $X^{n}$ . Since $\mathcal{X}$ is a finite set, we have

$\displaystyle S_{n}$	$\displaystyle=\max_{\bm{\gamma}\in\Gamma}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}$
	$\displaystyle=\max_{\bm{\gamma}\in\Gamma}\sum_{i=1}^{d}\sum_{k=1}^{n}\mathbbm{1}\{X_{k}=i\}\log\frac{\gamma_{i}}{p_{0}(i)}$
	$\displaystyle=n\max_{\bm{\gamma}\in\Gamma}\sum_{i=1}^{d}Q(i)\log\frac{\gamma_{i}}{p_{0}(i)}.$	(15)

The key in obtaining the central limit-type result for the sequence of random variables $\{S_{n}/\sqrt{n}\}_{n\in\mathbb{N}}$ is to solve the optimization problem in (IV-A), or more precisely, to understand the properties of the optimizer to (IV-A). Now we study the following optimization problem for a generic $q\in\mathcal{P}_{\mathcal{X}}$ :

$\displaystyle\min_{\bm{\gamma}}\;\;$	$\displaystyle\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}}$
$\displaystyle\mathrm{s.t.}\;\;$	$\displaystyle\sum_{i=1}^{d}\gamma_{i}=1,$	(16)
	$\displaystyle F(\bm{\gamma})\leq 0.$

Let $\tilde{\bm{\gamma}}$ be an optimizer to the optimization problem (IV-A). The properties of $\tilde{\bm{\gamma}}$ are provided in the following three lemmas.

Lemma 4.

If $q\in\mathcal{P}^{+}_{\mathcal{X}}$ and $q\not\in\Gamma$ , then the optimizer $\tilde{\bm{\gamma}}$ is unique.

The existence and uniqueness of the optimizer of the optimization problem (IV-A) follows from the strictly convexity of the function $\bm{\gamma}\mapsto\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}}$ on the compact convex (uncertainty) set $\Gamma$ .

As the optimizer $\tilde{\bm{\gamma}}$ is unique, we can define the function

\displaystyle\mathbf{g}(q)=(g_{1}(q),\ldots,g_{d}(q))=:\tilde{\bm{\gamma}}.

For the sake of convenience in what follows, define

\displaystyle f(q):=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}.

(17)

Some key properties of $\mathbf{g}(q)$ are provided in Lemma 10 and Lemma 11 in the Appendix. By the definition of $\bm{\gamma}^{\prime}$ , it follows that $\mathbf{g}(p_{0})=\bm{\gamma}^{\prime}$ . Without loss of generality, we assume

\frac{\partial F(\bm{\gamma}^{\prime})}{\partial\gamma_{1}}-\sum_{i=1}^{d}\gamma^{\prime}_{i}\frac{\partial F(\bm{\gamma}^{\prime})}{\partial\gamma^{\prime}_{i}}\neq 0.

Then there exists $0<\bar{\eta}<\hat{\eta}$ such that for $q\in\mathcal{B}(p_{0},\bar{\eta})$ , the following equation holds

\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\tilde{\gamma}_{1}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\tilde{\gamma}_{i}}\neq 0.

Then for $q\in\mathcal{B}(p_{0},\bar{\eta})$ , the Jacobian of $(q(2),\ldots,q(d))$ with respect to $(\tilde{\gamma}_{2},\ldots,\tilde{\gamma}_{d})$ is

\displaystyle\mathbf{J}(q)=\begin{bmatrix}&\frac{\partial q(2)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(2)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(2)}{\partial\tilde{\gamma}_{d}}\vspace{.5em}\\ &\frac{\partial q(3)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(3)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(3)}{\partial\tilde{\gamma}_{d}}\vspace{.5em}\\ &\vdots&\vdots&\ddots&\vdots\vspace{.5em}\\ &\frac{\partial q(d)}{\partial\tilde{\gamma}_{2}}&\frac{\partial q(d)}{\partial\tilde{\gamma}_{3}}&\cdots&\frac{\partial q(d)}{\partial\tilde{\gamma}_{d}}\end{bmatrix}\in\mathbb{R}^{(d-1)\times(d-1)}.

We now introduce the following regularity condition on the function $F$ at the point $p_{0}$ :

(A4’)

The Jacobian matrix $\mathbf{J}(p_{0})$ is of full rank (i.e., $\mathrm{rank}(\mathbf{J}(p_{0}))=d-1$ ).

One may wonder whether the new assumptions we have stated are overly restrictive. In fact, they are not and there exist interesting families of uncertainty sets that satisfy Assumptions (A1’)–(A4’). A canonical example of an uncertainty set $\Gamma$ that satisfies these conditions is when $F$ is piece-wise linear on the set $\mathcal{P}_{\mathcal{X}}^{+}$ . Thus, $\Gamma$ is similar to a linear family [23], an important class of statistical models.

Example 3.

Let $\{F_{k}\}_{k=1}^{l}$ be a set of $l$ linear functions with domain $\mathbb{R}^{d}$ and let $\{\xi_{k}\}_{k=1}^{l}$ be a set of $l$ real numbers. Let $\Gamma=\bigcap_{k=1}^{l}\{(y_{1},\ldots,y_{d}):F_{k}(y_{1},\ldots,y_{d})\leq\xi_{k}\}$ . Assume $\{F_{k}\}_{k=1}^{l}$ and $\{\xi_{k}\}_{k=1}^{l}$ satisfy the following three conditions:

•

The set $\Gamma\subset\mathcal{P}_{\mathcal{X}}^{+}$ and $F_{k}(p_{0})>\xi_{k}$ for some $k$ ;
•

The minimizer $\bm{\gamma}^{\prime}=\operatorname*{arg\,min}_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}})$ is such that $F_{1}(\bm{\gamma}^{\prime})=\xi_{1}$ and $F_{k}(\bm{\gamma}^{\prime})<\xi_{k}$ for $k\not=1$ ;
•

Let $F_{1}(y_{1},\ldots,y_{d})=\sum_{i=1}^{d}w_{i}y_{i}$ for some real coefficients $w_{1},\ldots,w_{d}$ . One of the coefficients of $F_{1}$ , i.e., one of the numbers in the set $\{w_{i}\}_{i=1}^{d}$ , is not equal to $\xi_{1}$ .

Intuitively, $\Gamma$ defined as the intersection of halfspaces (linear inequality constraints) is a polyhedron contained in the relative interior of $\mathcal{P}_{\mathcal{X}}$ . Fig. 1 provides an illustration for the ternary case $\mathcal{X}=\{1,2,3\}$ .

\begin{overpic}[width=364.2392pt]{simplex.pdf} \put(70.0,9.0){$\mathcal{P}_{\mathcal{X}}$} \put(68.0,45.0){$F_{1}$} \put(76.0,38.0){$F_{2}$} \put(15.0,38.0){$F_{3}$} \put(60.0,14.0){$p_{0}$} \put(41.0,20.0){$\Gamma$} \put(58.0,20.0){$D(p_{0}\|p_{\bm{\gamma}^{\prime}}\!)$} \put(48.5,25.0){$\bm{\gamma}^{\prime}$} \put(45.0,8.5){$D(p_{{\bm{\gamma}}^{*}}\|p_{0})$} \put(40.0,15.0){${\bm{\gamma}}^{*}$} \put(79.0,6.0){$(0,1,0)$} \put(5.0,6.0){$(1,0,0)$} \put(49.0,53.0){$(0,0,1)$} \end{overpic}

Figure 1: The set

\Gamma

formed by the intersection of three halfspaces defined by

F_{1},F_{2}

, and

F_{3}

. See Example 3.

Proposition 5.

The set $\Gamma$ described in Example 3 satisfies Conditions (A1’)–(A4’).

The proof of Proposition 5 is provided in Appendix F.

Now we are ready to state the promised central limit-type result for $S_{n}$ , defined in (IV-A). Define the relative entropy variance [15]

V(p\|q):=\mathrm{Var}_{p}\left[\log\frac{p(X)}{q(X)}\right]

and the Gaussian cumulative distribution function $\Phi(y):=\int_{-\infty}^{y}\frac{1}{\sqrt{2\pi}}e^{-u^{2}/2}\,\mathrm{d}u$ . Then we have

Proposition 6.

Under Conditions (A1’)–(A4’), if $\{X_{i}\}_{i=1}^{\infty}$ is a sequence of i.i.d. random variables with $P(X_{1}=i)=p_{0}(i)$ for all $i\in\mathcal{X}$ , then $\{S_{n}\}_{n=1}^{\infty}$ , defined in (IV-A), satisfies

\displaystyle\sqrt{n}\bigg{(}\frac{S_{n}}{n}-D(p_{0}\|p_{\bm{\gamma}^{\prime}})\bigg{)}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}\big{(}0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})\big{)}.

The proof of Proposition 6 can be found in Appendix G.

A major result in the statistics literature that bears some semblance to Proposition 6 is known as Wilks’ theorem (see [17, Chapter 16] for example). For the case in which the null hypothesis is simple,¹¹1Wilks’ theorem also applies to the case in which both the null and alternative hypotheses are composite, but we are only concerned with the simpler setting here. Wilks’ theorem states that if the sequence of random variables $\{X_{i}\}_{i=1}^{\infty}$ is independently drawn from $p_{0}$ (the distribution of the null hypothesis), then (two times) the log-likelihood ratio statistic

\displaystyle 2\max_{\bm{\gamma}\in\Gamma\cup\{p_{0}\}}\sum_{k=1}^{n}\log\frac{p_{\bm{\gamma}}(X_{k})}{p_{0}(X_{k})}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\chi^{2}_{d-1},

where $\chi_{d-1}^{2}$ is the chi-squared distribution with $d-1$ degrees of freedom. This result differs from Proposition 6 because in $S_{n}$ the maximization is taken over $\Gamma$ whereas in Wilks’ theorem, it is taken over $\Gamma\cup\{p_{0}\}$ . This results in different normalizations in the statements on convergence in distributions; in Proposition 6, $S_{n}$ is normalized by $\sqrt{n}$ but there is no normalization of the log-likelihood ratio statistic in Wilks’ theorem. This is because, for the former (our result), the dominant term is the first-order term in the Taylor expansion, but in the latter (Wilks’ theorem), the dominant term is the second-order term.

Proposition 7.

Conditions (A1’)–(A4’) imply Conditions (A1)–(A3) in Section III.

The proof of Proposition 7 is provided in Appendix H. Thus, we see that the assumptions used to derive the first-order results are less restrictive than those for the second-order result that we are going to state in the next subsection.

IV-B Definition and Main Results

We say that a second-order exponent pair $(G_{0},G_{1})$ is $\epsilon$ -achievable under the probabilistic constraints if there exists a sequence of sequential hypothesis tests $\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty}$ that satisfies the probabilistic constraints on the stopping time in (4) and

	$\displaystyle G_{0}\leq\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\bigg{(}\log\frac{1}{\mathsf{P}_{1\|0}(\delta_{n},\tau_{n})}-nD(p_{\bm{\gamma}^{*}}\\|p_{0})\bigg{)},$
	$\displaystyle G_{1}\leq\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\bigg{(}\log\frac{1}{\mathsf{P}_{0\|1}(\delta_{n},\tau_{n})}-nD(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)},$

where $\bm{\gamma}^{*}=\operatorname*{arg\,min}_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{0})$ , which is unique (see Proposition 7 which implies that Condition (A2) is satisfied). The set of all $\epsilon$ -achievable second-order exponent pairs $(G_{0},G_{1})$ is denoted as $\mathcal{G}_{\epsilon}(p_{0},\Gamma)$ , the second-order error exponent region. The set of second-order error exponents $\mathcal{G}_{\epsilon}(p_{0},\Gamma)$ is stated in the following theorem.

Theorem 8.

If Conditions (A1’)–(A4’) are satisfied, for any $0<\epsilon<1$ , the second-order error exponent region is

\displaystyle\mathcal{G}_{\epsilon}(p_{0},\Gamma)=\left\{(G_{0},G_{1})\in\mathbb{R}^{2}:\!\begin{array}[]{rl}G_{0}&\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})}\\ G_{1}&\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}\end{array}\!\right\}.

Furthermore, the boundary of this set is achieved by an appropriately defined sequence of GSPRTs.

This theorem states that the backoffs from the (first-order) error exponents are of orders $\Theta(1/\sqrt{n})$ and the implied constants are given by $\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})}$ and $\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}$ . Thus, we have stated a set of sufficient conditions on the distributions and the uncertainty set $\Gamma$ (namely (A1’)–(A4’)) for which the second-order terms are analogous to that for simple sequential hypothesis testing under the probabilistic constraints derived by Li and Tan [16]. However, the techniques used to derive Theorem 8 are more involved compared to those for the probabilistic constraints in [16]. This is because we have to derive the asymptotic distribution of the maximum of a set of log-likelihood ratio terms (cf. Proposition 6). This constitutes our main contribution in this part of the paper.

IV-C Proof of the Achievability Part of Theorem 8

The proof of achievability consists of two parts. We first prove the desired upper bound on type-I error probability and the maximal type-II error probability under an appropriately defined sequence of GSPRTs. Then we prove that the probabilistic constraints are satisfied.

We start with the proof of the first part. Let $\eta_{0}$ and $\eta_{1}$ be such that $\eta_{0},\eta_{1}\in(0,\epsilon)$ . Let $(\delta_{n},\tau_{n})$ be the GSPRT with thresholds

A_{n}:=n\min_{\bm{\gamma}\in\Gamma}\bigg{(}D(p_{\bm{\gamma}}\|p_{0})+\Phi^{-1}(\epsilon-\eta_{0})\sqrt{\frac{V(p_{\bm{\gamma}}\|p_{0})}{n}}\bigg{)},

and

B_{n}:=n\min_{\bm{\gamma}\in\Gamma}\bigg{(}D(p_{0}\|p_{\bm{\gamma}})+\Phi^{-1}(\epsilon-\eta_{1})\sqrt{\frac{V(p_{0}\|p_{\bm{\gamma}})}{n}}\bigg{)}.

Based on Proposition 7, we know that (A1)–(A3) are satisfied. Hence, from [13, Theorem 2.1] we have that

\displaystyle P_{0}(S_{\tau_{n}}>A_{n})\leq e^{-A_{n}}\quad\mbox{and}\quad\sup_{\bm{\gamma}\in\Gamma}P_{{\bm{\gamma}}}(S_{\tau_{n}}<-B_{n})\leq e^{-B_{n}}.

To simplify $A_{n}$ and $B_{n}$ , we introduce an approximation lemma from [24, Lemma 48].

Lemma 9.

Let $\Gamma$ be a compact metric space. Suppose $h:\Gamma\to\mathbb{R}$ and $k:\Gamma\to\mathbb{R}$ are continuous, then we have

\displaystyle\max_{\bm{\gamma}\in\Gamma}[nh(\bm{\gamma})+\sqrt{n}k(\bm{\gamma})]=nh^{*}+\sqrt{n}k^{*}+o(\sqrt{n}),

where $h^{*}:=\max_{\bm{\gamma}\in\Gamma}h(\bm{\gamma})$ and $k^{*}:=\sup_{\bm{\gamma}:h(\bm{\gamma})=h^{*}}k(\bm{\gamma})$ .

Here we take $h(\bm{\gamma})=-D(p_{0}\|p_{\bm{\gamma}})$ and $k(\bm{\gamma})=-\Phi^{-1}(\epsilon-\eta_{1})\sqrt{V(p_{0}\|p_{\bm{\gamma}})}$ . Based on Lemma 9 and the fact that $\bm{\gamma}\mapsto D(p_{0}\|p_{\bm{\gamma}})$ has a unique minimizer $\gamma^{\prime}$ (see Assumption (A2) which is implied by Proposition 7), we have

		$\displaystyle\min_{\bm{\gamma}\in\Gamma}\left(nD(p_{0}\\|p_{\bm{\gamma}})+\sqrt{{nV(p_{0}\\|p_{\bm{\gamma}})}}\Phi^{-1}(\epsilon-\eta_{1})\right)$
		$\displaystyle=nD(p_{0}\\|p_{\bm{\gamma}^{\prime}})+{\Phi^{-1}(\epsilon-\eta_{1})}\sqrt{nV(p_{0}\\|p_{\bm{\gamma}^{\prime}})}+o(\sqrt{n}).$		(18)

Similarly, we have

		$\displaystyle\min_{\bm{\gamma}\in\Gamma}\left(nD(p_{\bm{\gamma}}\\|p_{0})+\sqrt{{nV(p_{\bm{\gamma}}\\|p_{0})}}\Phi^{-1}(\epsilon-\eta_{0})\right)$
		$\displaystyle=nD(p_{\bm{\gamma}^{}}\\|p_{0})+{\Phi^{-1}(\epsilon-\eta_{0})}\sqrt{nV(p_{\bm{\gamma}^{}}\\|p_{0})}+o(\sqrt{n}).$		(19)

Thus, based on (IV-C) and (IV-C), the arbitrariness of $\eta_{0}$ and $\eta_{1}$ and the continuity of $\Phi^{-1}$ , we obtain

	$\displaystyle\!\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log$	$\displaystyle\frac{1}{P_{0}(S_{\tau_{n}}>A_{n})}-{n}D(p_{\bm{\gamma}^{*}}\\|p_{0})\Big{)}$
		$\displaystyle\geq\Phi^{-1}(\epsilon)\sqrt{{V(p_{\bm{\gamma}^{*}}\\|p_{0})}},$		(20)

and

	$\displaystyle\liminf_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log$	$\displaystyle\frac{1}{\sup_{\gamma\in\Gamma}P_{{\gamma}}(S_{\tau_{n}}<-B_{n})}-nD(p_{0}\\|p_{\bm{\gamma}^{\prime}})\Big{)}$
		$\displaystyle\geq\Phi^{-1}(\epsilon)\sqrt{{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}}.$		(21)

Next we prove that the probabilistic constraints for the sequence of GSPRTs $\{(\delta_{n},\tau_{n})\}_{n=1}^{\infty}$ are satisfied. Let $\tau^{\prime}:=\inf\{k:\max_{\bm{\gamma}\in\Gamma}S_{k}(\gamma)<-B_{n}\}$ . We observe that $\tau^{\prime}\geq\tau_{n}$ with probability 1. Thus, we have

	$\displaystyle P_{0}(\tau_{n}>n)$
	$\displaystyle\leq P_{0}(\tau^{\prime}>n)$
	$\displaystyle\leq P_{0}\left(\max_{\bm{\gamma}\in\Gamma}S_{n}(\bm{\gamma})\geq-B_{n}\right)$
	$\displaystyle=P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}n\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}\leq B_{n}\bigg{)}$
	$\displaystyle\leq P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\Big{)}$
	$\displaystyle\quad\leq{\Phi^{-1}(\epsilon-\eta_{1})}\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}$
	$\displaystyle\to\epsilon-\eta_{1}$		(22)
	$\displaystyle<\epsilon,$		(23)

where (22) is from Proposition 6. Hence, $P_{0}(\tau_{n}>n)<\epsilon$ for sufficiently large $n$ .

We now prove that $\sup_{\bm{\gamma}\in\Gamma}P_{\gamma}(\tau_{n}>n)<\epsilon$ . Let $\tau^{\prime\prime}:=\inf\{k:\max_{\bm{\gamma}\in\Gamma}S_{k}(\bm{\gamma})>A_{n}\}$ . We also have $\tau^{\prime\prime}\geq\tau_{n}$ with probability 1. Then by the Berry-Esseen Theorem [25], for any $\bm{\gamma}_{0}\in\Gamma$ , we have

		$\displaystyle P_{{\bm{\gamma}_{0}}}(\tau_{n}>n)$
		$\displaystyle\leq P_{{\bm{\gamma}_{0}}}(\tau^{\prime\prime}>n)$
		$\displaystyle\leq P_{{\bm{\gamma}_{0}}}\left(\max_{\bm{\gamma}\in\Gamma}S_{n}(\bm{\gamma})\leq A_{n}\right)$
		$\displaystyle\leq\!P_{{\bm{\gamma}_{0}}}\!\bigg{(}S_{n}(\bm{\gamma}_{0})\!\leq\!n\Big{(}D(p_{\bm{\gamma}_{0}}\!\\|p_{0})\!+\!\sqrt{\frac{V\!(p_{\bm{\gamma}_{0}}\\|p_{0})}{n}}\Phi^{-1}(\epsilon\!-\eta_{0})\!\Big{)}\!\bigg{)}$
		$\displaystyle\leq\epsilon-\eta_{0}+\frac{T_{1}}{\sqrt{n}},$		(24)

where $T_{1}$ is a positive finite constant depending only on $\mathrm{Var}_{\bm{\gamma}_{0}}(\xi(\bm{\gamma}_{0}))$ and $\mathbb{E}_{\bm{\gamma}_{0}}[|\xi(\bm{\gamma}_{0})|^{3}]$ . As stated in Condition (A2’) (i.e., that $\gamma_{i}\geq c_{0}>0,i=1,\dots,d$ ) and $p_{0}(i)>0,i=1,\dots,d$ , thus $\mathbb{E}_{\bm{\gamma}}[|\xi(\bm{\gamma})|^{3}]$ is uniformly bounded on $\Gamma$ . Then for every $0<\epsilon<1$ , there exists an integer $n_{1}(\epsilon)$ which does not depend on $\bm{\gamma}$ , such that when $n>n_{1}(\epsilon)$ , $P_{{\bm{\gamma}_{0}}}(\tau>n)\leq\epsilon-\eta_{0}/2<\epsilon$ . Since $\bm{\gamma}_{0}\in\Gamma$ is arbitrary, $\sup_{\bm{\gamma}\in\Gamma}P_{{\bm{\gamma}}}(\tau>n)<\epsilon$ .

We have shown that the two probabilistic constraints (23) and (IV-C) are satisfied for sufficiently large $n$ . Then together with (IV-C) and (IV-C), we have shown that any second-order error exponent pair $(G_{0},G_{1})$ such that $G_{0}\geq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\|p_{0})}$ and $G_{1}\geq{\Phi^{-1}(\epsilon)}\sqrt{V(p_{0}\|p_{\bm{\gamma}^{\prime}})}$ belongs to $\mathcal{G}_{\epsilon}(p_{0},\Gamma)$ .

IV-D Proof of the Converse Part of Theorem 8

For each $\bm{\gamma}_{0}\in\Gamma$ , from [16], we know that

	$\displaystyle-\frac{1}{\sqrt{n}}$	$\displaystyle\log P_{\bm{\gamma}_{0}}(Z_{0}(\tau_{n}))$
		$\displaystyle\leq\sqrt{n}D(p_{0}\\|p_{\bm{\gamma}_{0}})+\sqrt{V(p_{0}\\|p_{\bm{\gamma}_{0}})}\Phi^{-1}(\epsilon)+\alpha_{n},$

where $\alpha_{n}\to 0$ as $n\to\infty$ . Now we want to find the optimal upper bound for all $\gamma\in\Gamma$ , which means we need to obtain

	$\displaystyle-\frac{1}{\sqrt{n}}$	$\displaystyle\sup_{\bm{\gamma}\in\Gamma}\log P_{\bm{\gamma}}(Z_{0}(\tau_{n}))$
		$\displaystyle\leq\min_{\bm{\gamma}\in\Gamma}\bigg{(}\sqrt{n}D(p_{0}\\|p_{\bm{\gamma}})+\sqrt{V(p_{0}\\|p_{\bm{\gamma}})}\Phi^{-1}(\epsilon)+\alpha_{n}\bigg{)}.$

Similar to the analysis in achievability part, we use Lemma 9 and obtain that

	$\displaystyle\limsup_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log$	$\displaystyle\frac{1}{\mathsf{P}_{1\|0}(\delta_{n},\tau_{n})}-nD(p_{0}\\|p_{\bm{\gamma}^{\prime}})\Big{)}$
		$\displaystyle\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}.$

Similarly, we have that

	$\displaystyle\limsup_{n\to\infty}\frac{1}{\sqrt{n}}\Big{(}\log$	$\displaystyle\frac{1}{\mathsf{P}_{0\|1}(\delta_{n},\tau_{n})}-nD(p_{\bm{\gamma}^{*}}\\|p_{0})\Big{)}$
		$\displaystyle\leq\Phi^{-1}(\epsilon)\sqrt{V(p_{\bm{\gamma}^{*}}\\|p_{0})},$

which completes the proof of the converse.

In the appendix, we provide some key properties of $\mathbf{g}(q)$ in Lemmas 10 and 11 and their proofs. We also present the proofs of Propositions 5, 6, and 7.

E Properties of $\mathbf{g}(q)$

Lemma 10.

If $q\in\mathcal{P}^{+}_{\mathcal{X}}$ and $q\not\in\Gamma$ , then the following properties of the optimizer $\tilde{\bm{\gamma}}=\mathbf{g}(q)$ hold.

(i)

The function $\mathbf{g}(q)$ is continuous on $\mathcal{P}_{\mathcal{X}}^{+}\setminus\Gamma$ ;
(ii)

There exists $\hat{\eta}>0$ such that for $q\in\mathcal{B}(p_{0},\hat{\eta})$ , $F$ is smooth (infinitely differentiable) at $\tilde{\bm{\gamma}}$ ;
(iii)

For $q\in\mathcal{B}(p_{0},\hat{\eta})$ , the optimizer $\tilde{\bm{\gamma}}$ is such that $F(\tilde{\bm{\gamma}})=0$ (i.e., $\tilde{\bm{\gamma}}$ is on the boundary of the uncertainty set);

(iv)

For $q\in\mathcal{B}(p_{0},\hat{\eta})$ , there exists a symbol $j\in\mathcal{X}$ such that

\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{j}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{i}}\neq 0;

(25)

(v)

For $q\in\mathcal{B}(p_{0},\hat{\eta})$ , $i\in\mathcal{X}$ and $i\not=j$ ( $j\in\mathcal{X}$ is the symbol that satisfies (25) in Part (iv) above),

	$\displaystyle q(i)$	$\displaystyle=\tilde{\gamma}_{i}+\frac{(q(j)-\tilde{\gamma}_{j})\tilde{\gamma}_{i}}{\tilde{\gamma}_{j}\big{(}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{j}}-\sum_{k=1}^{d}\tilde{\gamma}_{k}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{k}}\big{)}}$
		$\displaystyle\qquad\times\bigg{(}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{i}}-\sum_{k=1}^{d}\tilde{\gamma}_{k}\frac{\partial F(\tilde{\bm{\gamma}})}{\partial\gamma_{k}}\bigg{)}.$		(26)

Proof:

We first prove Part (i) of Lemma 10. Assume, to the contrary, that $\mathbf{g}(q)$ is not continuous at some $q\in\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma$ . Then there exists a positive number $\kappa$ and a sequence $\{q_{k}\}_{k=1}^{\infty}\subset\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma$ such that $q_{k}\to q$ as $k\to\infty$ and $\sum_{i=1}^{d}|g_{i}(q_{k})-g_{i}(q)|\geq\kappa$ for all $k\in\mathbb{N}$ . From the definition of $\mathbf{g}(q_{k})$ and the fact that $p_{0}\in\mathcal{P}^{+}$ , there exists $\hat{\kappa}>0$ such that

\displaystyle\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}<\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa},

(27)

for all $k\in\mathbb{N}$ . From Condition (A2’) and the fact that $\{\mathbf{g}(q_{k})\}_{k=1}^{\infty}\subset\Gamma$ , there exists a constant $M<\infty$ such that

\sup_{k\in\mathbb{N},i\in\mathcal{X}}\left|\log\frac{p_{0}(i)}{g_{i}(q_{k})}\right|\leq M,

which further implies that

		$\displaystyle\limsup_{k\to\infty}\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}$
		$\displaystyle=\limsup_{k\to\infty}\bigg{(}\sum_{i=1}^{d}\big{(}q(i)+(q_{k}(i)-q(i))\big{)}\log\frac{p_{0}(i)}{g_{i}(q_{k})}\bigg{)}$
		$\displaystyle=\limsup_{k\to\infty}\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}.$		(28)

Combining (27) and (E), we have that

	$\displaystyle\limsup_{k\to\infty}\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}$
	$\displaystyle\leq\limsup_{k\to\infty}\sum_{i=1}^{d}q_{k}(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa}$
	$\displaystyle=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}-\hat{\kappa},$

which contradicts the fact that

\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q_{k})}\geq\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{g_{i}(q)}.

Hence $\mathbf{g}(q)$ is continuous on $\mathcal{P}^{+}_{\mathcal{X}}\setminus\Gamma$ .

We next prove Part (ii) of Lemma 10. From the continuity of $\mathbf{g}(q)$ (as proved above), there exists $\hat{\eta}>0$ such that

\{\tilde{\bm{\gamma}}:\tilde{\bm{\gamma}}=\mathbf{g}(q)\,\mbox{for some $q\in\mathcal{B}(p_{0},\hat{\eta})$}\}\subset\mathcal{B}(\mathbf{g}(p_{0}),\eta),

which, together with Condition (A3’) implies Part (ii) of Lemma 10.

We now proceed to prove Part (iii) of Lemma 10. Recall that the optimizer $\tilde{\bm{\gamma}}$ is obtained from the optimization problem (IV-A). Its corresponding Lagrangian is

\displaystyle L(\bm{\gamma},\lambda,\mu)=\sum_{i=1}^{d}q(i)\log\frac{p_{0}(i)}{\gamma_{i}}+\lambda\bigg{(}\sum_{i=1}^{d}\gamma_{i}-1\bigg{)}+\mu F(\bm{\gamma}).

For $q\in\mathcal{B}(\mathbf{g}(q),\hat{\eta})$ , $F(\bm{\gamma})$ is smooth at $\tilde{\bm{\gamma}}$ (the previous part). Hence using the Karush–Kuhn–Tucker (KKT) conditions [26], the optimizer $\tilde{\bm{\gamma}}$ satisfies the first-order stationary conditions, which are

\displaystyle-\frac{q(i)}{\tilde{\gamma}_{i}}+\lambda+\mu\frac{\partial F(\bm{\gamma})}{\partial\gamma_{i}}\bigg{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}=0,\quad\forall\,i=1,\dots,d.

(29)

The complementary slackness condition is $\mu F(\tilde{\bm{\gamma}})=0$ , which implies that either $\mu=0$ or $F(\tilde{\bm{\gamma}})=0$ . When $\mu=0$ , we have

\displaystyle q(i)=\lambda\tilde{\gamma}_{i}\;\Longleftrightarrow\;\lambda=1\;\Longleftrightarrow\;\tilde{\gamma}_{i}=q(i),

which is impossible as $q\notin\Gamma$ . Thus, it holds that $F(\tilde{\bm{\gamma}})=0$ , which means the optimizer lies on the boundary of the set $\Gamma$ .

We then proceed to prove Part (iv) of Lemma 10. If

\frac{\partial F(\bm{\gamma})}{\partial\gamma_{j}}\Big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F({\bm{\gamma}})}{\partial\gamma_{i}}\Big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}=0

for all $j\in\mathcal{X}$ , then $\big{\{}\frac{\partial F(\bm{\gamma})}{\partial\gamma_{j}}\big{|}_{\bm{\gamma}=\tilde{\bm{\gamma}}}\big{\}}_{j=1}^{d}$ are all equal. Combining this fact with (29), we have that $q=\tilde{\bm{\gamma}}$ , which contradicts the fact that $q\not\in\Gamma$ .

Finally, we prove Part (v) of Lemma 10. Combining the constraints in (IV-A) and (29), we can obtain $q$ in terms of $\lambda$ as

\displaystyle q(j)=\tilde{\gamma}_{j}-\mu\tilde{\gamma}_{j}\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{i}}+\mu\tilde{\gamma}_{j}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{j}}

(30)

for all $j=1,2,\dots,d$ . Then we obtain $\mu$ in terms of $q(j)$ as:

\displaystyle\mu=\frac{1}{\tilde{\gamma}_{j}}\bigg{(}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{j}}-\sum_{i=1}^{d}\tilde{\gamma}_{i}\frac{\partial F(\bm{\gamma})}{\partial\tilde{\gamma}_{i}}\bigg{)}^{-1}(q(j)-\tilde{\gamma}_{j}).

(31)

Then substituting (31) into (30), we have the desired formula. This completes the proof of Lemma 10. ∎

Lemma 11.

Let $\hat{\eta}$ be as given in Lemma 10. Suppose $q\in\mathcal{B}(p_{0},\hat{\eta})$ , and $\Gamma$ satisfies (A1’)–(A4’). Then,

(i)

The function $\mathbf{g}(q)$ is smooth on $\mathcal{B}(p_{0},\zeta)$ for some $\zeta>0$ and satisfies the following equality

$\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0,$

for all $q\in\mathcal{B}(p_{0},\zeta)$ .

(ii)

The function $f$ , defined in (17), is smooth on $\mathcal{B}(p_{0},\zeta)$ and its first- and second-order derivatives are

$\displaystyle\frac{\partial f(q)}{\partial q(j)}$	$\displaystyle=\log\frac{p_{0}(j)}{g_{j}(q)}+\sum_{i=1}^{d}\frac{q(i)}{g_{i}(q)}\frac{\partial g_{i}(q)}{\partial q(j)},$	(32)
$\displaystyle\frac{\partial^{2}f(q)}{\partial q(j)^{2}}$	$\displaystyle=-\frac{2}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(j)}-\sum_{i=1}^{d}\bigg{[}-\frac{q(i)}{g_{i}(q)^{2}}$
	$\displaystyle\qquad\times\bigg{(}\frac{\partial g_{i}(q)}{\partial q(j)}\bigg{)}^{2}+\frac{q(i)}{g_{i}(q)}\frac{\partial^{2}g_{i}(q)}{\partial q(j)^{2}}\bigg{]},\quad\mbox{and}$
$\displaystyle\frac{\partial^{2}f(q)}{\partial q(j)\partial q(i)}$	$\displaystyle=-\frac{1}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}-\frac{1}{g_{i}(q)}\frac{\partial g_{i}(q)}{\partial q(i)}$
	$\displaystyle\qquad-\sum_{l=1}^{d}\bigg{[}-\frac{q(l)}{g_{l}(q)^{2}}\frac{\partial g_{l}(q)}{\partial q(j)}\frac{\partial g_{l}(q)}{\partial q(i)}$
	$\displaystyle\qquad\quad+\frac{q(l)}{g_{l}(q)}\frac{\partial^{2}g_{l}(q)}{\partial q(j)\partial q(i)}\bigg{]}\quad\mbox{for}\;i\neq j.$	(33)

Proof:

Now we prove Part (i) of Lemma 11. As $F(\bm{\gamma})$ is smooth and $\mathbf{J}(p_{0})$ is of full rank, there exists $\zeta>0$ such that $\mathbf{J}(q)$ is of full rank for all $q\in\mathcal{B}(p_{0},\zeta)$ . Then by the inverse function theorem [27, Theorem 2.11], $\tilde{\bm{\gamma}}=\mathbf{g}(q)$ is differentiable in $q$ . We multiply ${\partial g_{j}(q)}/{\partial q(i)}$ on both sides of (29) and sum from $j=1$ to $d$ to obtain

\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=\lambda\sum_{j=1}^{d}\frac{\partial g_{j}(q)}{\partial q(i)}+\mu\sum_{j=1}^{d}\frac{\partial F(\mathbf{g}(q))}{\partial g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}.

(34)

We differentiate the first constraint $\sum_{i=1}^{d}\tilde{\gamma}_{i}=\sum_{i=1}^{d}g_{i}(q)=1$ with respect to $q$ on both sides to obtain

\displaystyle\sum_{j=1}^{d}\frac{\partial g_{j}(q)}{\partial q(i)}=0.

(35)

From Part (iii) of Lemma 10 it follows that $F(\tilde{\bm{\gamma}})=F(\mathbf{g}(q))=0$ , which means that the function formed by the composition of $F$ and $\mathbf{g}$ is always $0$ for all the $q\in\mathcal{B}(p_{0},\hat{\eta})$ . Therefore, the derivative of the composition of $F$ and $\mathbf{g}$ with respect to $q$ is $0$ , i.e.,

\displaystyle\frac{\partial F(\mathbf{g}(q))}{\partial q(i)}=\sum_{j=1}^{d}\frac{\partial F(\mathbf{g}(q))}{\partial g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0.

(36)

Substituting (35) and (36) back into (34), we have that

\displaystyle\sum_{j=1}^{d}\frac{q(j)}{g_{j}(q)}\frac{\partial g_{j}(q)}{\partial q(i)}=0,

as desired.

Part (ii) of Lemma 11 follows from straightforward, albeit tedious, calculations. This completes the proof of Lemma 11. ∎

F Proof of Proposition 5

Assume $F_{1}(\gamma)=F_{1}(\gamma_{1},\ldots,\gamma_{d})=\sum_{i=1}^{d}w_{i}\gamma_{i}$ . Without loss of generality, we assume $w_{1}\not=\xi_{1}$ . Conditions (A1’)–(A3’) clearly hold. Hence from Part (ii) of Lemma 10 there exists $\hat{\eta}$ such that for all $q\in\mathcal{B}(p_{0},\hat{\eta})$ , the optimizer $\tilde{\bm{\gamma}}$ of the optimization problem (IV-A) is such that $F_{1}(\tilde{\bm{\gamma}})=\xi_{1}$ and that $F_{k}(\tilde{\bm{\gamma}})<\xi_{k}$ for all $k\not=1$ . Note that ${\partial F_{1}(\bm{\gamma})}/{\partial\gamma_{i}}=w_{i}$ . Then for $q\in\mathcal{B}(p_{0},\hat{\eta})$ , using the KKT conditions, we obtain the first-order optimality conditions for the optimizer $\tilde{\bm{\gamma}}$ :

$\displaystyle\sum_{i=1}^{d}\tilde{\gamma}_{i}$	$\displaystyle=1,$
$\displaystyle\sum_{i=1}^{d}w_{i}\tilde{\gamma}_{i}$	$\displaystyle=\xi_{1},$	(37)
$\displaystyle q(i)$	$\displaystyle=\lambda_{1}\tilde{\gamma}_{i}+\lambda_{2}\tilde{\gamma}_{i}w_{i}.$

Hence,

\displaystyle\lambda_{2}=\frac{q(1)-\tilde{\gamma}_{1}}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}.

(38)

Substituting (38) into (37), we obtain

\displaystyle q(i)=\tilde{\gamma}_{i}\bigg{(}1+\frac{(q(1)-\tilde{\gamma}_{1})(w_{i}-\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}\bigg{)}.

Thus, the Jacobian of $(q(2),\ldots,q(d))$ at $(\tilde{\gamma}_{2},\ldots,\tilde{\gamma}_{d})$ is the following $(d-1)\times(d-1)$ diagonal matrix:

	$\displaystyle\mathbf{J}(q)$	$\displaystyle=\mathrm{diag}\!\bigg{[}1\!+\!\frac{(q(1)\!-\!\tilde{\gamma}_{1})(w_{2}\!-\!\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})},\!1\!+\!\frac{(q(1)\!-\!\tilde{\gamma}_{1})(w_{3}\!-\!\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})},$
		$\displaystyle\qquad\ldots,1+\frac{(q(1)-\tilde{\gamma}_{1})(w_{d}-\xi_{1})}{\tilde{\gamma}_{1}(w_{1}-\xi_{1})}\bigg{]}.$

Since $p_{0}(i)>0$ for all $i=1,2,\dots,d$ , the diagonal terms in the Jacobian $\mathbf{J}(p_{0})$ are non-zero. Thus, $\mathrm{det}(\mathbf{J}(p_{0}))\neq 0$ , which proves that Condition (A4’) holds for the set $\Gamma$ in Example 3.

G Proof of Proposition 6

We now prove the promised central limit-type result for the sequence of random variables $\{S_{n}/\sqrt{n}\}_{n\in\mathbb{N}}$ . Let $z\in(0,1)$ . Let $\zeta$ be given as in Part (i) of Lemma 11 and define the $\zeta$ -typical set

	$\displaystyle\mathcal{T}_{\zeta}^{(n)}=\mathcal{T}_{\zeta}^{(n)}(p_{0})$
	$\displaystyle=\bigg{\{}x^{n}\in\mathcal{X}^{n}:\Big{\|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\mathbbm{1}\{x_{k}=i\}\Big{)}-p_{0}(i)\Big{\|}<\zeta,~{}\forall\,i\in\mathcal{X}\bigg{\}}.$

This is the set of sequences whose types are near $p_{0}$ . The key idea is to perform a Taylor expansion of the function $f(Q)=\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{g_{i}(Q)}$ (defined in (17)) at the point $Q=p_{0}$ and analyze the asymptotics of the various terms in the expansion. For brevity, define the deviation of the type $Q$ of $X^{n}$ from the true distribution at symbol $i\in\mathcal{X}$ as

\Delta_{i}:=Q(i)-p_{0}(i).

For $q\in\mathcal{B}(p_{0},\zeta)$ , let $\mathbf{H}(q)\in\mathbb{R}^{d\times d}$ be the Hessian matrix of $f(q)$ . This is well defined because $f(\cdot)$ is twice continuously differentiable on $\mathcal{B}(p_{0},\zeta)$ according to Part (ii) of Lemma 11. If $x^{n}\in\mathcal{T}_{\zeta}^{(n)}$ , then $Q\in\mathcal{B}(p_{0},\zeta)$ . Thus for $Q\in\mathcal{B}(p_{0},\zeta)$ , using Taylor’s theorem we have the expansion

$\displaystyle f(Q)$	$\displaystyle=f(p_{0})+(\nabla f(p_{0}))^{\top}(Q-p_{0})$
	$\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}$
	$\displaystyle=\sum_{i=1}^{d}p_{0}(i)\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\sum_{i=1}^{d}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Delta_{i}$
	$\displaystyle\qquad-\sum_{i=1}^{d}\sum_{j=1}^{d}\frac{p_{0}(j)}{g_{j}(p_{0})}\frac{\partial g_{j}(q)}{\partial q(i)}\bigg{\|}_{q=p_{0}}\Delta_{i}$
	$\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}$	(39)
	$\displaystyle=D(p_{0}\\|p_{\bm{\gamma}^{\prime}})+\sum_{i=1}^{d}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Delta_{i}$
	$\displaystyle\qquad+\frac{1}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top},$	(40)

where $\tilde{Q}$ lies on the line segment between $Q$ and $p_{0}$ , (39) follows from (32) in Lemma 11 and (40) follows from Part (i) of Lemma 11. Note that we represent probability mass functions as row vectors.

Then for $Q\in\mathcal{B}(p_{0},\zeta)$ , from (40), we have that

		$\displaystyle\min_{\bm{\gamma}\in\Gamma}\bigg{(}\sqrt{n}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}\bigg{)}-\sqrt{n}D(p_{0}\\|p_{\bm{\gamma}^{\prime}})$
		$\displaystyle=\sqrt{n}\big{(}f(Q)-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\big{)}$
		$\displaystyle=\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}.$		(41)

Let $\lambda_{\min}(\mathbf{H}(q))$ and $\lambda_{\max}(\mathbf{H}(q))$ be the smallest and largest eigenvalues of $\mathbf{H}(q)$ , respectively. From Part (i) of Lemma 11, it follows that $f(\cdot)$ is smooth on $\mathcal{B}(q_{0},\zeta)$ , which implies that there exists two constants $\tilde{c}$ and $\tilde{C}$ such that

	$\displaystyle-\infty<\tilde{c}$	$\displaystyle<\min_{q\in\mathcal{B}(q_{0},\zeta)}\lambda_{\min}(\mathbf{H}(q))$
		$\displaystyle\leq\max_{q\in\mathcal{B}(q_{0},\zeta)}\lambda_{\max}(\mathbf{H}(q))<\tilde{C}<\infty.$		(42)

Then we have the upper bound shown in (45) (at the top of the next page),

$\displaystyle P_{0}$	$\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}$
	$\displaystyle\leq P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$
	$\displaystyle=P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$	(43)
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\lambda_{\min}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$	(44)
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}+d\exp(-2n\eta^{2}),$	(45)

in which (43) follows from the fact that $Q\in\mathcal{B}(p_{0},\zeta)$ for $x^{n}\in\mathcal{T}_{\zeta}^{(n)}$ and (G), (44) follows from (G), and (45) holds by the union bound and Hoeffding’s inequality. Similarly, we can obtain the lower bound shown in (46) (also shown on the next page).

$\displaystyle P_{0}$	$\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\lambda_{\max}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-P_{0}\big{(}X^{n}\not\in\mathcal{T}_{\zeta}^{(n)}\big{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-d\exp(-2n\zeta^{2}).$	(46)

One can verify that

		$\displaystyle n\sum_{i=1}^{d}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}$
		$\displaystyle=\sum_{k=1}^{n}\bigg{(}\sum_{i=1}^{d}\big{(}\mathbbm{1}\{X_{k}=i\}-p_{0}(i)\big{)}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{)}$		(47)

and the variance

	$\displaystyle\mathrm{Var}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}$
	$\displaystyle=\mathbb{E}_{0}\bigg{[}\bigg{(}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{)}^{2}\bigg{]}$		(48)
	$\displaystyle=\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}^{2}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))^{2}$
	$\displaystyle\qquad+2\sum_{j\neq i}(\mathbbm{1}\{X_{1}=i\}\!-p_{0}(i))(\mathbbm{1}\{X_{1}=j\}\!-p_{0}(j))$
	$\displaystyle\qquad\quad\times\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}\bigg{]}$
	$\displaystyle=\sum_{i=1}^{d}(1-p_{0}(i))p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}$
	$\displaystyle\qquad-2\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}$		(49)
	$\displaystyle=\sum_{i=1}^{d}p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}-\sum_{i=1}^{d}p_{0}(i)^{2}\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}$
	$\displaystyle\qquad-2\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}$
	$\displaystyle=V(p_{0}\\|p_{\bm{\gamma}^{\prime}}),$

where (48) follows from

\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}=0,

and (49) follows from

	$\displaystyle\sum_{i\neq j}\mathbb{E}_{0}\bigg{[}(\mathbbm{1}\{X_{1}=i\}-p_{0}(i))(\mathbbm{1}\{X_{1}=j\}-p_{0}(j))$
	$\displaystyle\quad\times\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}\bigg{]}$
	$\displaystyle=-\sum_{i\neq j}p_{0}(i)p_{0}(j)\Big{(}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\Big{)}\Big{(}\log\frac{p_{0}(j)}{g_{j}(p_{0})}\Big{)}$

and

	$\displaystyle\mathbb{E}_{0}\bigg{[}\sum_{i=1}^{d}(\mathbbm{1}\{X_{k}=i\}-p_{0}(i))^{2}\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}\bigg{]}$
	$\displaystyle=\sum_{i=1}^{d}(1-p_{0}(i))p_{0}(i)\log^{2}\frac{p_{0}(i)}{g_{i}(p_{0})}.$

Therefore $n\sum_{i=1}^{d}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}$ is a sum of i.i.d. random variables $\big{\{}\sum_{i=1}^{d}(\mathbbm{1}\{X_{k}=i\}-p_{0}(i))\log\frac{p_{0}(i)}{g_{i}(p_{0})}\big{\}}_{k=1}^{n}$ with mean $0$ and variance $V(p_{0}\|p_{\bm{\gamma}^{\prime}})$ . Hence, by the central limit theorem,

\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})).

Together with the fact that $\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\to 0$ almost surely, this implies that

\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})),

(50)

and

\displaystyle\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})).

(51)

Then combining (45), (46), (50) and (51), we have that

	$\displaystyle\limsup_{n\to\infty}$	$\displaystyle\;P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\Big{)}$
		$\displaystyle\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}\leq z,$		(52)

and

	$\displaystyle\liminf_{n\to\infty}$	$\displaystyle\;P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\Big{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\Big{)}$
		$\displaystyle\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}\geq z.$		(53)

Since $z\in(0,1)$ is arbitrary, it follows from (52) and (53) that

\displaystyle\!\min_{\bm{\gamma}\in\Gamma}\!\sqrt{n}\!\bigg{(}\!\sum_{i=1}^{d}\!Q(i)\!\log\frac{p_{0}(i)}{\gamma_{i}}\!-\!D(p_{0}\|p_{\bm{\gamma}^{\prime}})\!\bigg{)}\!\stackrel{{\scriptstyle\mathrm{d}}}{{\longrightarrow}}\!\mathcal{N}(0,V(p_{0}\|p_{\bm{\gamma}^{\prime}})),

which completes the proof of Proposition 6.

H Proof of Proposition 7

We now show that Conditions (A1’)–(A4’) imply Conditions (A1)–(A3). Condition (A1) is easily verified by Condition (A1’). As $\mathcal{X}=\{1,2,\dots,d\}$ , we have

\displaystyle D(p_{0}\|p_{\bm{\gamma}})=\sum_{i=1}^{d}p_{0}(i)\log\frac{p_{0}(i)}{\gamma_{i}},

and

\displaystyle D(p_{\bm{\gamma}}\|p_{0})=\sum_{i=1}^{d}\gamma_{i}\log\frac{\gamma_{i}}{p_{0}(i)}.

Combining Condition (A2’) which says that $\min_{i=1,\ldots,d}\gamma_{i}\geq c_{0}>0$ for all $\gamma\in\Gamma$ and $\min_{i=1,\ldots,d}p_{0}(i)>0$ , $D(p_{0}\|p_{\bm{\gamma}})$ and $D(p_{\bm{\gamma}}\|p_{0})$ are uniformly bounded and twice continuously differentiable on $\Gamma$ . As $p_{0}\notin\Gamma$ , $D(p_{0}\|p_{\bm{\gamma}})>0$ and $D(p_{\bm{\gamma}}\|p_{0})>0$ , which together with the compactness of $\Gamma$ , implies that

\displaystyle\min_{\bm{\gamma}\in\Gamma}D(p_{0}\|p_{\bm{\gamma}})>0\quad\mbox{and}\quad\min_{\bm{\gamma}\in\Gamma}D(p_{\bm{\gamma}}\|p_{0})>0.

(54)

From [22, Theorem 2.7.2], $D(p_{0}\|p_{\bm{\gamma}})$ is strictly convex in $(p_{0},\bm{\gamma})$ , which, together with the fact that $\Gamma$ is compact and convex, implies the uniqueness of the minimizers to the two optimization problems in (54).

For Condition (A3), as $\mathcal{X}$ is a finite alphabet and Condition (A2’) holds, it can be easily checked that $\mathbb{E}[\max_{\bm{\gamma}\in\Gamma}|\xi(\bm{\gamma})|^{2}]<\infty$ . Note that

\nabla_{\bm{\gamma}}\xi(\bm{\gamma})=\Big{(}\frac{\mathbbm{1}\{X=1\}}{\gamma_{1}},\ldots,\frac{\mathbbm{1}\{X=d\}}{\gamma_{d}}\Big{)}^{\top}.

We can define the finite number $x_{0}:=\max_{\bm{\gamma}\in\Gamma}\max_{i\in\mathcal{X}}1/\gamma_{i}\leq 1/c_{0}$ (because Condition (A2’) mandates that $\min_{i=1,\ldots,d}\gamma_{i}\geq c_{0}>0$ for all $\bm{\gamma}\in\Gamma)$ . With this choice, trivially, for all $x>x_{0}$ ,

\displaystyle P_{0}\Bigg{(}\max_{\bm{\gamma}\in\Gamma}|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})|>x\Bigg{)}=0,

which shows that Condition (A3) clearly holds.

References

[1] J. Pan, Y. Li, and V. Y. F. Tan, “Asymptotics of sequential composite hypothesis testing under probabilistic constraints,” in IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2021, pp. 172–177.
[2] R. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. 20, no. 4, pp. 405–417, 1974.
[3] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesis testing and changepoint detection. CRC Press, 2014.
[4] J. Neyman and E. S. Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London (Series A), vol. 231, pp. 289–337, 1933.
[5] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), 2014.
[6] A. Wald and J. Wolfowitz, “Optimum character of the sequential probability ratio test,” Ann. Math. Statist., vol. 19, no. 3, pp. 326–339, 1948.
[7] A. Lalitha and T. Javidi, “Reliability of sequential hypothesis testing can be achieved by an almost-fixed-length test,” in IEEE International Symposium on Information Theory. IEEE, 2016, pp. 1710–1714.
[8] M. Haghifam, V. Y. F. Tan, and A. Khisti, “Sequential classification with empirically observed statistics,” IEEE Transactions on Information Theory, vol. 67, no. 5, pp. 3095–3113, 2021.
[9] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test optimal?” IEEE Transactions on Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
[10] T.-L. Lai, “Asymptotic optimality of generalized sequential likelihood ratio tests in some classical sequential testing problems,” Sequential Analysis, vol. 21, no. 4, pp. 219–247, 2002.
[11] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.
[12] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal sequential outlier hypothesis testing,” Sequential Analysis, vol. 36, no. 3, pp. 309–344, 2017.
[13] X. Li, J. Liu, and Z. Ying, “Generalized sequential probability ratio test for separate families of hypotheses,” Sequential Analysis, vol. 33, no. 4, pp. 539–563, 2014.
[14] V. Strassen, “Asymptotische abschatzugen in Shannon’s informationstheorie,” in Transactions of the Third Prague Conference on Information Theory etc, 1962. Czechoslovak Academy of Sciences, Prague, 1962, pp. 689–723.
[15] V. Y. F. Tan, “Asymptotic estimates in information theory with non-vanishing error probabilities,” Foundations and Trends® in Communications and Information Theory, vol. 11, no. 1-2, pp. 1–184, 2014.
[16] Y. Li and V. Y. F. Tan, “Second-order asymptotics of sequential hypothesis testing,” IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 7222–7230, 2020.
[17] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998.
[18] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
[19] A. R. Sampson, “Characterizing Exponential Family Distributions by Moment Generating Functions,” The Annals of Statistics, vol. 3, no. 3, pp. 747 – 753, 1975.
[20] H. J. Bierens, Modes of Convergence, ser. Themes in Modern Econometrics. Cambridge University Press, 2004, pp. 137–178.
[21] R. Durrett, Probability: Theory and Examples. Duxbury Press, 2004.
[22] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006.
[23] S.-I. Amari and H. Nagaoka, Methods of Information Geometry, ser. Translations of Mathematical Monographs. American Mathematical Society, 2007.
[24] Y. Polyanskiy, Channel Coding: Non-Asymptotic Fundamental Limits. Princeton University, 2010.
[25] A. C. Berry, “The accuracy of the Gaussian approximation to the sum of independent variates,” Transactions of the American Mathematical Society, vol. 49, no. 1, pp. 122–136, 1941.
[26] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[27] M. Spivak, Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus. Taylor & Francis Inc, 1971.

Jiachun Pan is currently a Ph.D. candidate in the Department of Electrical and Computer Cngineering in National University of Singapore (NUS). She received the B.S. degree from University of Electronic Science and Technology of China (UESTC) in 2015 and M.Eng. degree from University of Science and Technology of China (USTC) in 2019. Her research interests include information theory and statistical learning.

Yonglong Li is a research fellow at the Department of Electrical and Computer Engineering, National University of Singapore. He received the bachelor degree in Mathematics from Zhengzhou University in 2011 and the Ph.D. degree in Mathematics from the University of Hong Kong in 2015. From 2017 to 2019, he was a postdoctoral fellow at the Center for Memory and Recording Research (CMRR), University of California, San Diego.

Vincent Y. F. Tan (S’07-M’11-SM’15) was born in Singapore in 1981. He received the B.A. and M.Eng. degrees in electrical and information science from Cambridge University in 2005, and the Ph.D. degree in electrical engineering and computer science (EECS) from the Massachusetts Institute of Technology (MIT) in 2011. He is currently a Dean’s Chair Associate Professor with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore (NUS). His research interests include information theory, machine learning, and statistical signal processing. Dr. Tan is a member of the IEEE Information Theory Society Board of Governors. He was an IEEE Information Theory Society Distinguished Lecturer from 2018 to 2019. He received the MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize in 2011, the NUS Young Investigator Award in 2014, the Singapore National Research Foundation (NRF) Fellowship (Class of 2018), and the NUS Young Researcher Award in 2019. He is currently serving as a Senior Area Editor for the IEEE Transactions on Signal Processing and for the IEEE Transactions on Information Theory.

	$\displaystyle P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\|\nabla_{\bm{\gamma}}\xi(\bm{\gamma})\|>x\bigg{)}$
	$\displaystyle=P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{\|}T(X)-\nabla_{\bm{\gamma}}A(\bm{\gamma})\big{\|}>x\bigg{)}$
	$\displaystyle\overset{(a)}{=}P_{0}\bigg{(}\max_{\bm{\gamma}\in\Gamma}\big{\|}T(X)-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}>x\bigg{)}$
	$\displaystyle\leq P_{0}\!\bigg{(}\!\big{\|}T\!(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{\|}\!+\max_{\bm{\gamma}\in\Gamma}\big{\|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}\!>\!x\!\bigg{)}$
	$\displaystyle=P_{0}\!\bigg{(}\!\big{\|}T(X)\!-\!\mathbb{E}_{0}[T\!(X)]\big{\|}\!>\!x\!-\!\max_{\bm{\gamma}\in\Gamma}\!\big{\|}\mathbb{E}_{0}[T(X)]\!-\!\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}\!\bigg{)}$
	$\displaystyle\overset{(b)}{\leq}\exp\bigg{(}-tx+t\max_{\bm{\gamma}\in\Gamma}\big{\|}\mathbb{E}_{0}[T(X)]-\mathbb{E}_{\bm{\gamma}}[T(X)]\big{\|}$
	$\displaystyle\quad+A(t\mathbf{e}+\bm{\gamma}_{0})-A(\bm{\gamma}_{0})-t\|\nabla_{\bm{\gamma}}A(\bm{\gamma})\|_{\bm{\gamma}=\bm{\gamma}_{0}}\|+\log 2\bigg{)},$

$\displaystyle P_{0}$	$\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}$
	$\displaystyle\leq P_{0}\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$
	$\displaystyle=P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$	(43)
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\lambda_{\min}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}+P_{0}(X^{n}\notin\mathcal{T}_{\zeta}^{(n)})$	(44)
	$\displaystyle\leq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{c}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}+d\exp(-2n\eta^{2}),$	(45)

$\displaystyle P_{0}$	$\displaystyle\bigg{(}\min_{\bm{\gamma}\in\Gamma}\sqrt{n}\bigg{(}\sum_{i=1}^{d}Q(i)\log\frac{p_{0}(i)}{\gamma_{i}}-D(p_{0}\\|p_{\bm{\gamma}^{\prime}})\bigg{)}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\sqrt{n}}{2}({Q}-p_{0})\mathbf{H}(\tilde{Q})({Q}-p_{0})^{\top}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}b_{i}\Delta_{i}+\frac{\lambda_{\max}(\mathbf{H}(\tilde{Q}))}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})},X^{n}\in\mathcal{T}_{\zeta}^{(n)}\bigg{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-P_{0}\big{(}X^{n}\not\in\mathcal{T}_{\zeta}^{(n)}\big{)}$
	$\displaystyle\geq P_{0}\bigg{(}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}\log\frac{p_{0}(i)}{g_{i}(p_{0})}+\frac{\tilde{C}}{2}\sum_{i=1}^{d}\sqrt{n}\Delta_{i}^{2}\leq\Phi^{-1}(z)\sqrt{V(p_{0}\\|p_{\bm{\gamma}^{\prime}})}\bigg{)}-d\exp(-2n\zeta^{2}).$	(46)

Asymptotics of Sequential Composite Hypothesis Testing under Probabilistic Constraints

Abstract

Index Terms:

I Introduction

I-A Related works

I-B Main contributions

I-C Paper Outline

II Problem Formulation

III First-order Asymptotics

Example 1 (Canonical exponential families).

Example 2 (Gaussian distributions).

Theorem 1.

III-A Proof of Achievability of Theorem 1

Lemma 2.

III-B Proof of Strong Converse of Theorem 1

Lemma 3.

IV Second-order Asymptotics

IV-A Other Assumptions and Preliminary Results

Lemma 4.

Example 3.

Proposition 5.

Proposition 6.

Proposition 7.

IV-B Definition and Main Results

Theorem 8.

IV-C Proof of the Achievability Part of Theorem 8

Lemma 9.

IV-D Proof of the Converse Part of Theorem 8

E Properties of 𝐠​(q)\mathbf{g}(q)

Lemma 10.

Proof:

Lemma 11.

Proof:

F Proof of Proposition 5

G Proof of Proposition 6

H Proof of Proposition 7

References

E Properties of $\mathbf{g}(q)$