This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Numerical Approach to Sequential Multi-Hypothesis Testing for Bernoulli Model

\nameAndrey Novikov CONTACT Andrey Novikov, Universidad Autónoma Metropolitana - Unidad Iztapalapa, Avenida Ferrocarril San Rafael Atlixco 186, col. Leyes de Reforma 1A Sección, C.P. 09310, Cd. de México, México. . Email: an@xanum.uam.mx Metropolitan Autonomous University, Mexico City, Mexico
Abstract

In this paper we deal with the problem of sequential testing of multiple hypotheses. The main goal is minimizing the expected sample size (ESS) under restrictions on the error probabilities.

We take, as a criterion of minimization, a weighted sum of the ESS’s evaluated at some points of interest in the parameter space aiming at its minimization under restrictions on the error probabilities.

We use a variant of the method of Lagrange multipliers which is based on the minimization of an auxiliary objective function (called Lagrangian) combining the objective function with the restrictions, taken with some constants called multipliers. Subsequently, the multipliers are used to make the solution comply with the restrictions.

We develop a computer-oriented method of minimization of the Lagrangian function, that provides, depending on the specific choice of the parameter points, optimal tests in different concrete settings, like in Bayesian, Kiefer-Weiss and other settings.

To exemplify the proposed methods for the particular case of sampling from a Bernoulli population we develop a set of computer algorithms for designing sequential tests that minimize the Lagrangian function and for the numerical evaluation of test characteristics like the error probabilities and the ESS, and other related. We implement the algorithms in the R programming language. The program code is available in a public GitHub repository.

For the Bernoulli model, we made a series of computer evaluations related to the optimality of sequential multi-hypothesis tests, in a particular case of three hypotheses. A numerical comparison with the matrix sequential probability ratio test is carried out.

A method of solution of the multi-hypothesis Kiefer-Weiss is proposed, and is applied for a particular case of three hypotheses in the Bernoulli model.

keywords:
sequential analysis; hypothesis testing; optimal stopping; optimal sequential tests; multiple hypotheses; SPRT; MSPRT
{amscode}

62L10, 62L15, 62F03, 60G40, 62M02

1 Introduction

The problem of testing multiple hypotheses is one of the oldest problems in the sequential analysis.

A traditional approach to this problem is Bayesian. It is based on the assumption that the hypotheses come up with some probabilities called a priori (see Blackwell and Girshick 1954; Baum and Veeravalli 1994; Tartakovsky, Nikiforov, and Basseville 2015, among many others).

Despite that the optimal Bayesian solution can be characterized on the basis of general principles like dynamic programming or the theory of optimal stopping (Shiryaev 1978; Chow, Robbins, and Siegmund 1971), at least theoretically, there seems to exist a strong belief that the theoretical solution is too complex to be useful for practical purposes (see, for example Baum and Veeravalli 1994; Tartakovsky, Nikiforov, and Basseville 2015). An exception is the case of two simple hypotheses where the solution is given by the classical sequential probability ratio test (Wald’s SPRT, see Wald and Wolfowitz 1948).

For these reasons, approximate solutions of the problem have been proposed. One of the widely used tests, due to its simplicity, is the matrix sequential probability ratio test (MSPRT) by Armitage (1950). Tartakovsky, Nikiforov, and Basseville (2015) showed that the MSPRT is asymptotically optimal, as error probabilities go to 0.

Another approach that has received considerable attention through the decades is the so-called Kiefer-Weiss problem, consisting in the minimization of the maximum value of the expected sample number (ESS), over all possible parameter points (Kiefer and Weiss 1957). Lorden (1980) showed that the Kiefer-Weiss problem can be reduced to the minimization of the ESS evaluated at a specific parameter point, different from the hypothesized parameter values (so-called modified Kiefer-Weiss problem), and (in essence) used the method of Lagrange multipliers to characterize the solutions to the modified Kiefer-Weiss problem.

A generalization of the Kiefer-Weiss problem to the case of multiple hypotheses has been formulated in Tartakovsky, Nikiforov, and Basseville (2015) (Section 5.3) and received an asymptotic treatment in Section 5.3.1, ibid.

In this paper, we propose an approach to the optimal multi-hypothesis testing based on minimization of the weighted ESS evaluated at parameter points not necessarily coinciding with the hypothesized values, and then use the method of the Lagrange multipliers to reduce to the minimization of the Lagrangian function. Depending on the choice of the points for evaluating the ESS in the Lagrangian function, we obtain, in particular, the Bayesian and the Kiefer-Weiss settings, and more.

We apply the method of Novikov (2009b) and characterize the sequential tests minimizing the Lagrangian function, for any choice of multipliers. For practical applications, we propose the use of numerical methods for the Lagrange minimization, the evaluation of the characteristics (the error probabilities, the ESS, etc.), and for finding the multiplier values to comply with the restrictions on the error probabilities.

We illustrate the proposed methods in the particular case of sampling from a Bernoulli population, where we develop a complete set of computer algorithms for all the numerical tasks described above and implement them in the R programming language. The program code is available in a public GitHub repository in Novikov (2023).

Using the developed software, we run a series of numerical comparisons related to optimal properties of sequential multi-hypothesis tests in the Bernoulli model.

First, we evaluate the performance characteristics error of the MSPRT for a particular case of three hypotheses. The MSPRT is known to be asymptotically optimal, as the error probabilities go to 0, so the evaluations we carry out give an idea of how small the error probabilities should be in order that the asymptotic formulas for the ESS give a reasonably good approximation to the calculated values. We use N=4000N=4000 which, apparently, is sufficient for good approximations of the characteristics of non-truncated MSPRTs.

Other comparison we carry out is also related with the MSPRT. For a number of error probability levels, we numerically find both MSPRT and the optimal Bayes test (for uniform a priori distribution) matching the given error probabilities (up to some precision). The results show a very high efficiency of the MSPRT.

Also we propose a method for solving a multi-hypothesis version of the Kiefer-Weiss problem, and give a numerical example.

In Section 2, we adapt the results of Novikov (2009b) to the problem of minimization of weighted ESS calculated at arbitrary parameter points. In Section 3, we derive computational formulas for the Bernoulli model. Numerical results are presented in Section 4. Section 5 is a brief list of the results and suggestions for further work.

2 Optimal sequential multi-hypothesis tests

In this section, we formulate some settings for the problem of optimal multi-hypothesis testing and use the general results of Novikov (2009b) for characterisation of the respective optimal solutions.

We assume that independent and identically distributed (i.i.d.) observations X1,X2,,Xn,X_{1},X_{2},\dots,X_{n},\ \dots are potentially available to the statistician on the one-by-one basis, providing us with information about the unknown distribution of the data. Let us denote it PθP_{\theta}, where θ\theta is some parameter identifying the distribution in a unique manner. We are concerned with the problem of distinguishing between a finite number of simple hypotheses H1:θ=θ1H_{1}:\theta=\theta_{1}, H2:θ=θ2H_{2}:\theta=\theta_{2}, \dots, Hk:θ=θkH_{k}:\theta=\theta_{k}, k2k\geq 2.

We follow Novikov (2009b) in the notation and general assumptions.

In particular, we consider sequential multi-hypothesis test as a pair ψ,ϕ\langle\psi,\phi\rangle of a stopping rule ψ=(ψ1,ψ2,,ψn,)\psi=(\psi_{1},\psi_{2},\dots,\psi_{n},\dots), and a (terminal) decision rule ϕ=(ϕ1,ϕ2,,ϕn,)\phi=(\phi_{1},\phi_{2},\dots,\phi_{n},\dots).

The elements of the stopping rule ψn=ψn(x1,,xn)\psi_{n}=\psi_{n}(x_{1},\dots,x_{n}) are measurable functions taking values in [0,1][0,1], where the value at (x1,,xn)(x_{1},\dots,x_{n}) is interpreted as the conditional probability, given the observations, to stop (randomization at the stopping time).

The elements of the decision rule ϕn=ϕn(x1,,xn)\phi_{n}=\phi_{n}(x_{1},\dots,x_{n}) are measurable functions of observations such that ϕn=(ϕn1,,ϕnk)\phi_{n}=(\phi_{n}^{1},\dots,\phi_{n}^{k}), and ϕnj0\phi_{n}^{j}\geq 0 and j=1kϕnj(x1,,xn)1\sum_{j=1}^{k}\phi_{n}^{j}(x_{1},\dots,x_{n})\equiv 1. Given the data (x1,,xn)(x_{1},\dots,x_{n}) observed, ϕnj(x1,,xn)\phi_{n}^{j}(x_{1},\dots,x_{n}) is interpreted as a conditional probability to accept hypothesis HjH_{j}, j=1,,kj=1,\dots,k (randomization at the decision time).

The sequential test starts with observing X1=x1X_{1}=x_{1} (stage n=1n=1). At each stage n=1,2,n=1,2,\dots the test procedure stops with probability ψn(x1,,xn)\psi_{n}(x_{1},\dots,x_{n}), given that X1=x1,,Xn=xnX_{1}=x_{1},\dots,X_{n}=x_{n} are observed, and proceeds to taking a terminal decision. If it does not stop, the test proceeds to taking one additional observation Xn+1=xn+1X_{n+1}=x_{n+1} and going to stage n+1n+1, etc., until the process eventually stops. When the test stops at any stage nn (this nn is called stopping time), a terminal decision is taken accepting hypothesis HjH_{j} with probability ϕnj(x1,,xn)\phi_{n}^{j}(x_{1},\dots,x_{n}), conditionally on (x1,,xn)(x_{1},\dots,x_{n}). Let us denote τψ\tau_{\psi} the stopping time (as a random variable) generated by the described process.

Let

snψ=snψ(x1,,xn)=(1ψ1(x1))(1ψn1(x1,,xn1))ψn(x1,,xn)s_{n}^{\psi}=s_{n}^{\psi}(x_{1},\dots,x_{n})=(1-\psi_{1}(x_{1}))\dots(1-\psi_{n-1}(x_{1},\dots,x_{n-1}))\psi_{n}(x_{1},\dots,x_{n})

(s1ψ(x1)=ψ1(x1)s_{1}^{\psi}(x_{1})=\psi_{1}(x_{1}) by definition).

Then the expected sample size (ESS) of the test procedure is defined as

Eθτψ=n=1nEθsnψ=n=1nEθsnψ(X1,,Xn),E_{\theta}\tau_{\psi}=\sum_{n=1}^{\infty}nE_{\theta}s_{n}^{\psi}=\sum_{n=1}^{\infty}nE_{\theta}s_{n}^{\psi}(X_{1},\dots,X_{n}),

provided that n=1Eθsnψ=1\sum_{n=1}^{\infty}E_{\theta}s_{n}^{\psi}=1, - otherwize it is infinite by definition. Here and throughout the paper, EθE_{\theta} is the symbol of mathematical expectation with respect to PθP_{\theta}. Also we use snψs_{n}^{\psi} (without arguments) both for snψ(x1,x2,,xn)s_{n}^{\psi}(x_{1},x_{2},\dots,x_{n}) and for snψ(X1,X2,,Xn)s_{n}^{\psi}(X_{1},X_{2},\dots,X_{n}), depending on the context. So do we when dealing with other functions like ψn\psi_{n}, ϕn\phi_{n}, etc.

Other characteristics of a sequential test ψ,ϕ\langle\psi,\phi\rangle are the error probabilities defined as

αij(ψ,ϕ)=n=1Eθisnψϕnj, 1ijk.\alpha_{ij}(\psi,\phi)=\sum_{n=1}^{\infty}E_{\theta_{i}}s_{n}^{\psi}\phi_{n}^{j},\;1\leq i\not=j\leq k.

Another natural way to define error probabilities is less detailed:

αi(ψ,ϕ)=n=1Eθisnψ(1ϕni)=j:jiαij(ψ,ϕ), 1ik.\alpha_{i}(\psi,\phi)=\sum_{n=1}^{\infty}E_{\theta_{i}}s_{n}^{\psi}(1-\phi_{n}^{i})=\sum_{j:j\not=i}\alpha_{ij}(\psi,\phi),\;1\leq i\leq k.

In the case of two hypotheses the definitions are equivalent.

For k=2k=2, the classical result of Wald and Wolfowitz (1948) states that the sequential probability ratio test (SPRT) minimizes both Eθ1τψE_{\theta_{1}}\tau_{\psi} and Eθ2τψE_{\theta_{2}}\tau_{\psi} in the class of sequential tests ψ,ϕ\langle\psi,\phi\rangle such that

α1(ψ,ϕ)α1,α2(ψ,ϕ)α2,\alpha_{1}(\psi,\phi)\leq\alpha_{1},\quad\alpha_{2}(\psi,\phi)\leq\alpha_{2},

where α1\alpha_{1} and α2\alpha_{2} are the error probabilities of the SPRT.

To the best of our knowledge, no direct generalizations of this result exist for k>2k>2. For this reason, we propose weaker settings.

Let us choose some parameter points ϑi\vartheta_{i}, i=1,,Ki=1,\dots,K and the weights γi\gamma_{i}, i=1,,Ki=1,\dots,K being these non-negative numbers such that i=1Kγi=1\sum_{i=1}^{K}\gamma_{i}=1, K1K\geq 1. Formally, we propose to minimize the weighted ESS

Cγ,ϑ(ψ)=i=1KγiEϑiτψ\ C_{\gamma,\vartheta}(\psi)=\sum_{i=1}^{K}\gamma_{i}E_{\vartheta_{i}}\tau_{\psi} (1)

over all sequential multi-hypothesis tests subject to

αij(ψ,ϕ)αij,1i,jk,ij\alpha_{ij}(\psi,\phi)\leq\alpha_{ij},\quad 1\leq i,j\leq k,\quad i\not=j (2)

or to

αi(ψ,ϕ)αi,1ik\alpha_{i}(\psi,\phi)\leq\alpha_{i},\quad 1\leq i\leq k (3)

where αij\alpha_{ij} and αi\alpha_{i} are some positive numbers.

To support this formulation, let us refer to a very practical context of optimal group-sequential testing in the case of two hypotheses. For testing the mean of a normal distribution with known variance, Eales and Jennison (1992) considered five settings for the ESS minimization under restrictions on the error probabilities. Four of them, namely, F1F_{1} to F4F_{4} (see Eales and Jennison 1992) are of type (1), with different choices of KK, ϑi\vartheta_{i} and γi\gamma_{i}. F5F_{5} is also a kind of weighted ESS but of continuous type, which is quite possible to be treated by our method, but for the time being stays beyond the scope. Generalizations of these settings to the case of more than two hypotheses and infinite horizons are straightforward.

Given that the formulated problem is a problem of a minimization under restrictions, we want to use the Lagrange multipliers method. By the principle of the Lagrange method, to minimize Cγ,ϑC_{\gamma,\vartheta} under restrictions (2) one should be able to minimize the Lagrangian function

L(ψ,ϕ)=Cγ,ϑ(ψ)+1ijkλijαij(ψ,ϕ),L(\psi,\phi)=C_{\gamma,\vartheta}(\psi)+\sum_{1\leq i\not=j\leq k}\lambda_{ij}\alpha_{ij}(\psi,\phi), (4)

with any constant multipliers λij0\lambda_{ij}\geq 0, and to find the values of the multipliers for which equalities in (2) hold. Respectively, the problem of minimization under conditions (3) reduces to minimization of

L(ψ,ϕ)=Cγ,ϑ(ψ)+1ikλiαi(ψ,ϕ),L(\psi,\phi)=C_{\gamma,\vartheta}(\psi)+\sum_{1\leq i\leq k}\lambda_{i}\alpha_{i}(\psi,\phi), (5)

with multipliers λi\lambda_{i}, i=1,,ki=1,\dots,k, and finding the values of λi\lambda_{i} for which equalities in (3) hold. It is easy to see that (5) is a particular case of (4) with λij=λi\lambda_{ij}=\lambda_{i} for all j=1,2,,kj=1,2,\dots,k, jij\not=i, so in what follows we focus on the minimization of (4).

It is not difficult to see that in the particular case when θi=ϑi\theta_{i}=\vartheta_{i}, i=1,2,k=Ki=1,2\dots,k=K the Lagrangian function (4) can be considered Bayesian risk (see, for example, Baum and Veeravalli 1994, among many others) corresponding to the a priori distribution (γ1,,γk)(\gamma_{1},\dots,\gamma_{k}) on the set of parameter points {θ1,,θk}\{\theta_{1},\dots,\theta_{k}\}, where λij/γi\lambda_{ij}/\gamma_{i} can be interpreted as conditional loss from accepting HjH_{j} when HiH_{i} is true. Thus, the minimization of (4) readily solves the problem of optimal Bayesian tests for kk hypotheses.

The well-known modified Kiefer-Weiss problem (see, for example, Lorden 1980) also easily embeds into this scheme by taking γ1=1\gamma_{1}=1, K=1K=1, and ϑ1\vartheta_{1} between the hypothesized values θ1\theta_{1} and θ2\theta_{2}, being k=2k=2. And this gives rise to a multi-hypothesis version of the Kiefer-Weiss problem, starting from a modified version of it, with ϑ1,,ϑk1\vartheta_{1},\dots,\vartheta_{k-1} such that θ1<ϑ1<θ2<ϑ2<<ϑk1<θk\theta_{1}<\vartheta_{1}<\theta_{2}<\vartheta_{2}<\dots<\vartheta_{k-1}<\theta_{k} and with some weights γ1,γ2,,γk1\gamma_{1},\gamma_{2},\dots,\gamma_{k-1}, adding up to 1, as additional parameters. To our knowledge, there are no known non-asymptotic solutions of the multi-hypothesis Kiefer-Weiss problem, and this could be a basis for one.

Now, let us characterize the tests which minimize the Lagrangian function (4), for a given set of multipliers. It is worth noting that L(ψ,ϕ)L(\psi,\phi) implicitly depends on the Lagrange multipliers, therefore all the constructions below will also (implicitly) depend on λij\lambda_{ij}, as well as on other elements of problem setting, like θi\theta_{i} and ϑi\vartheta_{i}, etc.

First of all, in a very standard way it can be shown that there is a universal decision rule ϕ\phi that minimizes L(ψ,ϕ)L(\psi,\phi) whatever fixed ψ\psi (see Novikov 2009b).

Let us assume that PθP_{\theta} is absolutely continuous with respect to a σ\sigma-finite measure μ\mu and denote fθf_{\theta} its Radon-Nikodym derivative. Also denote fθn=fθn(x1,,xn)=i=1nfθ(xi)f_{\theta}^{n}=f_{\theta}^{n}(x_{1},\dots,x_{n})=\prod_{i=1}^{n}f_{\theta}(x_{i}), and let fγϑn=i=1Kγifϑinf^{n}_{{\gamma\vartheta}}=\sum_{i=1}^{K}\gamma_{i}f_{\vartheta_{i}}^{n}. Define

vn=min1jki:ijλijfθin.v_{n}=\min_{1\leq j\leq k}\sum_{i:i\not=j}\lambda_{ij}f_{\theta_{i}}^{n}. (6)

Let a decision rule ϕ\phi be such that

ϕnj=0wheneveri:ijλijfθin>vn\phi_{n}^{j}=0\quad\mbox{whenever}\quad\sum_{i:i\not=j}\lambda_{ij}f_{\theta_{i}}^{n}>v_{n} (7)

(in the case of equality in (7) ϕnj\phi_{n}^{j} can be arbitrarily randomized between those jj sharing this equality, with the only requirement that j=1kϕnj1\sum_{j=1}^{k}\phi_{n}^{j}\equiv 1). It follows from Theorem 3 in Novikov (2009b) that

L(ψ)=infϕL(ψ,ϕ)=n=1snψ(nfγϑn+vn)𝑑μn,L(\psi)=\inf_{\phi}L(\psi,\phi)=\sum_{n=1}^{\infty}\int s_{n}^{\psi}\left(nf_{{{\gamma\vartheta}}}^{n}+v_{n}\right)d\mu^{n}, (8)

and we have an optimal stopping problem of minimizing (8) over stopping rules ψ\psi.

The problem is first solved in the class of truncated tests, i.e. those not taking more than a finite number NN of observations. Let 𝒮N\mathcal{S}^{N} be the set of all such stopping rules that (1ψ1)(1ψN)0(1-\psi_{1})\dots(1-\psi_{N})\equiv 0.

Let us define operator n\mathcal{I}_{n} in the following way. For any measurable non-negative v=v(x1,,xn)v=v(x_{1},\dots,x_{n}) let

nv=(nv)(x1,,xn1)=v(x1,,xn)𝑑μ(xn).\mathcal{I}_{n}v=(\mathcal{I}_{n}v)(x_{1},\dots,x_{n-1})=\int v(x_{1},\dots,x_{n})d\mu(x_{n}).

Now, starting from

VNNvN,V_{N}^{N}\equiv v_{N},

define recursively over n=N,N1,,2n=N,N-1,\dots,2

Vn1N=min{vn1,fγϑn1+nVnN}.V_{n-1}^{N}=\min\{v_{n-1},f_{{\gamma\vartheta}}^{n-1}+\mathcal{I}_{n}V_{n}^{N}\}.

Then for any ψ𝒮N\psi\in\mathcal{S}^{N}

L(ψ)1+1V1N,L(\psi)\geq 1+\mathcal{I}_{1}V_{1}^{N}, (9)

and there is an equality in (9) if for all n=1,2,,N1n=1,2,\dots,N-1

ψn=I{vnfγϑn+n+1Vn+1N},\psi_{n}=I_{\{v_{n}\leq f_{{\gamma\vartheta}}^{n}+\mathcal{I}_{n+1}V_{n+1}^{N}\}}, (10)

where IAI_{A} denotes the indicator function of the event AA. In this way, stopping rule ψ\psi in (10) minimizes L(ψ)L(\psi) in the class of truncated stopping rules 𝒮N\mathcal{S}^{N}. Any ψn\psi_{n} may be arbitrarily randomized between samples (x1,,xn)(x_{1},\dots,x_{n}) for which there is an equality in the inequality under the indicator function in (10). This gives the same value of L(ψ)L(\psi). The details can be found in Novikov (2009b).

The optimal non-truncated tests can be found passing to the limit, as NN\to\infty, provided that

vn𝑑μn0,asn,\int v_{n}d\mu^{n}\to 0,\quad\mbox{as}\quad n\to\infty, (11)

(see Remark 7 in Novikov 2009b). In the case of i.i.d. observations we are considering in this paper, (11) holds without any additional conditions. The formal proof of this fact can be found in the Appendix.

The construction of the optimal non-truncated test is as follows. First of all, it is easy to see that VnN+1VnNV_{n}^{N+1}\leq V_{n}^{N}, so there exists Vn=limNVnNV_{n}=\lim_{N\to\infty}V_{n}^{N}, n=1,2,n=1,2,\dots Then it follows from (9) that

L(ψ)1+1V1,L(\psi)\geq 1+\mathcal{I}_{1}V_{1}, (12)

and the right-hand side in (12) is attained if

ψn=I{vnfγϑn+n+1Vn+1}\psi_{n}=I_{\{v_{n}\leq f_{{\gamma\vartheta}}^{n}+\mathcal{I}_{n+1}V_{n+1}\}} (13)

for all n=1,2,n=1,2,\dots In this way, we obtain tests ψ,ϕ\langle\psi,\phi\rangle with ψ\psi satisfying (13) and ϕ\phi satisfying (7) which minimize the Lagrangian function L(ψ,ϕ)L(\psi,\phi).

We propose using numerical methods for construction of the truncated tests minimizing the Lagrangian function. For the Bernoulli model, we develop numerical algorithms for this and implement them in the form of a computer program in the R programming language. Having the means for minimizing the Lagrangian function, to obtain optimal sequential tests in the conditional setting (i.e. those minimizing Cγ,ϑC_{\gamma,\vartheta} under conditions (2)) we need to find Lagrangian multipliers λij\lambda_{ij}, 1ijk1\leq i\not=j\leq k, providing a test (7)-(10) for which equalities in (2) hold. Respectively, the minimization of Cγ,ϑC_{\gamma,\vartheta} under conditions (3) reduces to finding λi\lambda_{i}, i=1,,ki=1,\dots,k such that for the test in (7)-(10), with λij=λi\lambda_{ij}=\lambda_{i} for 1jik1\leq j\not=i\leq k, for which there are all equalities in (3).

In no way can one be sure that such λij\lambda_{ij} exist for every combination of αij\alpha_{ij} (not even in the classical case of two hypotheses). On the other hand, every combination of λij\lambda_{ij} employed in (7)-(10), produces an optimal test ψ,ϕ\langle\psi,\phi\rangle in the conditional setting, if one takes its error probabilities as αij\alpha_{ij} in (2) (i.e. αij=αij(ψ,ϕ)\alpha_{ij}=\alpha_{ij}(\psi,\phi)) (or, respectively, as αi\alpha_{i} in (3), that is αi=αi(ψ,ϕ)\alpha_{i}=\alpha_{i}(\psi,\phi)).

Having at hand a computer program for the Lagrange minimization, finding the multipliers providing a tolerable level of the error probabilities is a question of some trial-and-error look-ups, because larger values of λij\lambda_{ij} make αij\alpha_{ij} smaller, grosso modo. As an alternative, general-purpose computer algorithms of numerical optimization can be used to get as close as possible to the desired values of αij\alpha_{ij} by moving the input values of λij\lambda_{ij}, for example, the method of Nelder and Mead (1965).

For the non-truncated tests, we propose using approximations by truncated tests. We illustrate all this technique on the particular case of Bernoulli distribution in the subsequent sections.

3 Optimal sequential tests for sampling from a Bernoulli population

In this section, we apply the general results of Section 2 to the model of Bernoulli observations. In this way we obtain a complete set of computer algorithms for computing the tests that minimize the Lagrangian function, and their numerical characteristics, in the Bernoulli model. For the determination of the values of the Lagrange multipliers general-purpose computer algorithms will be used.

3.1 Construction of optimal tests

We apply the results of Section 2 to the model of sampling from a Bernoulli population, in which case fθ(x)=θx(1θ)1xf_{\theta}(x)=\theta^{x}(1-\theta)^{1-x}, x=0,1x=0,1, and fθn(x1,,xn)=θsn(1θ)nsnf_{\theta}^{n}(x_{1},\dots,x_{n})=\theta^{s_{n}}(1-\theta)^{n-s_{n}} with sn=i=1nxis_{n}=\sum_{i=1}^{n}x_{i}.

Let

gθn(s)=(ns)θs(1θ)ns, 0sng_{\theta}^{n}(s)={n\choose s}\theta^{s}(1-\theta)^{n-s},\;0\leq s\leq n

be the probability mass function corresponding to the sufficient statistic Sn=i=1nXiS_{n}=\sum_{i=1}^{n}X_{i} (binomial distribution with parameters nn and θ\theta). Define

un=un(s)=min1jki:ijλijgθin(s), 0sn,u_{n}=u_{n}(s)=\min_{1\leq j\leq k}\sum_{i:i\not=j}\lambda_{ij}g_{\theta_{i}}^{n}(s),\;0\leq s\leq n, (14)

and let

gγϑn(s)=i=1Kγigϑin(s), 0sng_{\gamma\vartheta}^{n}(s)=\sum_{i=1}^{K}\gamma_{i}g_{\vartheta_{i}}^{n}(s),\;0\leq s\leq n

Let us define the operator 𝒥n\mathcal{J}_{n} defined for any function U(s)U(s), 0sn0\leq s\leq n, as

𝒥nU(s)=U(s)nsn+U(s+1)s+1n, 0sn1,\mathcal{J}_{n}U(s)=U(s)\frac{n-s}{n}+U(s+1)\frac{s+1}{n},\;0\leq s\leq n-1, (15)

for n=2,3,n=2,3,\dots Starting from

UNN(s)=uN(s), 0sN,U_{N}^{N}(s)=u_{N}(s),\;0\leq s\leq N,

define recursively for n=N1,N2,,1n=N-1,N-2,\dots,1

UnN(s)=min{un(s),gγϑn(s)+𝒥n+1Un+1N(s)}, 0sn.U_{n}^{N}(s)=\min\left\{u_{n}(s),g_{{\gamma\vartheta}}^{n}(s)+\mathcal{J}_{n+1}U_{n+1}^{N}(s)\right\},\;0\leq s\leq n. (16)
Proposition 3.1.

For m=1,2,N1m=1,2\dots,N-1

𝒥m+1Um+1N(sm)=(msm)m+1Vm+1N(x1,,xm)\mathcal{J}_{m+1}U_{m+1}^{N}(s_{m})={m\choose s_{m}}\mathcal{I}_{m+1}V^{N}_{m+1}(x_{1},\dots,x_{m}) (17)

where sm=i=1mxis_{m}=\sum_{i=1}^{m}x_{i}.

Proof. By induction over m=N1,N2,,1m=N-1,N-2,\dots,1. For m=N1m=N-1 we have

𝒥m+1Um+1N(sm)=𝒥NUNN(sN1)=𝒥NuN(sN1)\mathcal{J}_{m+1}U_{m+1}^{N}(s_{m})=\mathcal{J}_{N}U_{N}^{N}(s_{N-1})=\mathcal{J}_{N}u_{N}(s_{N-1})
=uN(sN1)NsN1N+uN(sN1+1)sN1+1N=u_{N}(s_{N-1})\frac{N-s_{N-1}}{N}+u_{N}(s_{N-1}+1)\frac{s_{N-1}+1}{N}
=(NsN1)vN(x1,,xN=0)NsN1N+(NsN1+1)vN(x1,,xN=1)sN1+1N={N\choose s_{N-1}}v_{N}(x_{1},\dots,x_{N}=0)\frac{N-s_{N-1}}{N}+{N\choose s_{N-1}+1}v_{N}(x_{1},\dots,x_{N}=1)\frac{s_{N-1}+1}{N}
=(N1sN1)NVNN(x1,,xN1)=(msm)m+1Vm+1N(x1,,xm)={N-1\choose s_{N-1}}\mathcal{I}_{N}V^{N}_{N}(x_{1},\dots,x_{N-1})={m\choose s_{m}}\mathcal{I}_{m+1}V^{N}_{m+1}(x_{1},\dots,x_{m})

Let us suppose now that (17) holds for some mnN1m\leq n\leq N-1. Then for m=n1m=n-1

𝒥m+1Um+1N(sm)=𝒥nUnN(sn1)\mathcal{J}_{m+1}U_{m+1}^{N}(s_{m})=\mathcal{J}_{n}U_{n}^{N}(s_{n-1})
=UnN(sn1)nsn1n+UnN(sn1+1)sn1+1n=U_{n}^{N}(s_{n-1})\frac{n-s_{n-1}}{n}+U_{n}^{N}(s_{n-1}+1)\frac{s_{n-1}+1}{n}
=(nsn1)VnN(x1,,xn=0)nsn1n+(nsn1+1)VnN(x1,,xn=1)sn1+1n={n\choose s_{n-1}}V_{n}^{N}(x_{1},\dots,x_{n}=0)\frac{n-s_{n-1}}{n}+{n\choose s_{n-1}+1}V_{n}^{N}(x_{1},\dots,x_{n}=1)\frac{s_{n-1}+1}{n}
=(n1sn1)nVnN(x1,,xn1)=(msm)m+1Vm+1N(x1,,xm)={n-1\choose s_{n-1}}\mathcal{I}_{n}V^{N}_{n}(x_{1},\dots,x_{n-1})={m\choose s_{m}}\mathcal{I}_{m+1}V^{N}_{m+1}(x_{1},\dots,x_{m})

\Box

It is easy to see that the optimal decision rule (7) can be expressed in terms of the sufficient statistic sns_{n}:

ϕnj(sn)=0wheneveri:ijλijgθin(sn)>un(sn),\phi_{n}^{j}(s_{n})=0\quad\text{whenever}\quad\sum_{i:i\not=j}\lambda_{ij}g_{\theta_{i}}^{n}(s_{n})>u_{n}(s_{n}), (18)

and it follows from Proposition 3.1 that the optimal truncated stopping rule (10) as well:

ψn(sn)=I{ungγϑn+𝒥n+1Un+1N}(sn),\psi_{n}(s_{n})=I_{\{u_{n}\leq g_{{\gamma\vartheta}}^{n}+\mathcal{J}_{n+1}U_{n+1}^{N}\}}(s_{n}), (19)

for n=1,2,N1n=1,2\dots,N-1, and the optimal non-truncated one as

ψn(sn)=I{ungγϑn+𝒥n+1Un+1}(sn)\psi_{n}(s_{n})=I_{\{u_{n}\leq g_{{\gamma\vartheta}}^{n}+\mathcal{J}_{n+1}U_{n+1}\}}(s_{n}) (20)

with Un=limNUnNU_{n}=\lim_{N\to\infty}U_{n}^{N} for all n=1,2,n=1,2,\dots

Formulas (18)-(19) provide a truncated test which has an exact optimality property (neither asymptotic nor approximate), whatever be k2k\geq 2, θ1,,θk,γ1,,γK,ϑ1,,ϑK,K1\theta_{1},\dots,\theta_{k},\gamma_{1},\dots,\gamma_{K},\vartheta_{1},\dots,\vartheta_{K},K\geq 1, N2N\geq 2 and Largange multipliers λij0\lambda_{ij}\geq 0, 1ijk1\leq i\not=j\leq k.

Furthermore, they suggest a computational algorithm for evaluating the elements of optimal sequential test: start from step NN calculating ϕN\phi_{N} for all 0sN0\leq s\leq N (which is based on weighted sums of binomial probabilities with parameters NN and θi\theta_{i}, i=1,2,,ki=1,2,\dots,k, according to (18)), and recurrently use (16) for steps n=N1,N2,,1n=N-1,N-2,\dots,1 to calculate UnN(s)U_{n}^{N}(s) for all 0sn0\leq s\leq n, marking those ss for which

un(s)>gγϑn(s)+𝒥n+1Un+1N(s)u_{n}(s)>g_{{\gamma\vartheta}}^{n}(s)+\mathcal{J}_{n+1}U_{n+1}^{N}(s)

as belonging to the continuation region (by virtue of (19)); for all other ss storing the terminal decision based on (18) as that corresponding to ss.

We implemented this algorithm in the form of a function in the R programming language (R Core Team 2013); the source code is available in a public GitHub repository in Novikov (2023). The documentation can be found in the repository.

Making NN large enough we can approximate the optimal non-truncated test corresponding to (20). In particular, this can be helpful when the optimal infinite-horizon test is in fact truncated. This happens, for example, in the case of modified Kiefer-Weiss problem, corresponding (in our notation) to the case of two hypotheses with θ1<ϑ1<θ2\theta_{1}<\vartheta_{1}<\theta_{2}, k=2k=2, K=1K=1 (see Lorden 1980). Below in Section 4 we give another example of this possibility, in a multi-hypothesis context.

Despite that the test obtained in this subsection does not have a closed form (instead, all the values of the optimal rules (18) – (19) are stored in the computer memory), we believe it can be quite practical for many applications which do not require more than some thousands of steps. If they do, one could try the algorithm with a maximum number of steps their computer will withstand, to see if the performance requirements could be met with that reduced number of steps. If not, more computer power might be needed.

3.2 Evaluation of performance characteristics

We derive in this part computational formulas for performance characteristics of sequential multi-hypothesis tests for the Bernoulli model.

Let ψ,ϕ\langle\psi,\phi\rangle be any sequential multi-hypothesis test based on sufficient statistics: ψn=ψn(sn)\psi_{n}=\psi_{n}(s_{n}), ϕn=ϕn(sn)\phi_{n}=\phi_{n}(s_{n}) with ψ𝒮N\psi\in\mathcal{S}^{N}. The test ψ,ϕ\langle\psi,\phi\rangle is arbitrary but will be held fixed throughout this subsection, so it will be suppressed in the notation.

Proposition 3.2.

Define

ajN(s;θ)=gθN(s)ϕNj(s),s=0,1,,N,j=1,2,,k,a_{j}^{N}(s;\theta)=g_{\theta}^{N}(s)\phi_{N}^{j}(s),\;s=0,1,\dots,N,\;j=1,2,\dots,k, (21)

and, recursively over n=N1,N2,,1n=N-1,N-2,\dots,1,

ajn(s;θ)\displaystyle a_{j}^{n}(s;\theta) =\displaystyle= gθn(s)ψn(s)ϕnj(s)\displaystyle g_{\theta}^{n}(s)\psi_{n}(s)\phi_{n}^{j}(s)
+(ajn+1(s;θ)n+1sn+1+ajn+1(s+1;θ)s+1n+1)(1ψn(s)),\displaystyle+\left(a_{j}^{n+1}(s;\theta)\frac{n+1-s}{n+1}+a_{j}^{n+1}(s+1;\theta)\frac{s+1}{n+1}\right)(1-\psi_{n}(s)),

s=0,1,,ns=0,1,\dots,n, j=1,kj=1,\dots k.

Then the probability to accept hypothesis HjH_{j}, given that the true parameter is θ\theta, can be calculated as aj0(θ)=aj1(0;θ)+aj1(1;θ)a_{j}^{0}(\theta)=a_{j}^{1}(0;\theta)+a_{j}^{1}(1;\theta). In particular, αij(ψ,ϕ)=aj0(θi)\alpha_{ij}(\psi,\phi)=a_{j}^{0}(\theta_{i}), iji\not=j.

Proof. Let us denote Ajn=Ajn(ψ,ϕ)A_{j}^{n}=A_{j}^{n}(\psi,\phi) the event meaning that hypothesis HjH_{j} is accepted at or after step nn (following the rules of the test ψ,ϕ\langle\psi,\phi\rangle), n=1,2,,Nn=1,2,\dots,N.

Let us first prove, by induction over n=N,N1,,1n=N,N-1,\dots,1, that

ajn(Sn;θ)=Pθ(Ajn|X1,,Xn)gθn(Sn)a_{j}^{n}(S_{n};\theta)=P_{\theta}(A_{j}^{n}|X_{1},\dots,X_{n})g_{\theta}^{n}(S_{n}) (22)

For n=Nn=N, (22) follows from (21) and the definition of the decision rule ϕ\phi.

Let us suppose now that (22) holds for some nNn\leq N. Then

ajn1(Sn1;θ)=gθn1(Sn1)ψn1(Sn1)ϕn1j(Sn1)\displaystyle a_{j}^{n-1}(S_{n-1};\theta)=g_{\theta}^{n-1}(S_{n-1})\psi_{n-1}(S_{n-1})\phi_{n-1}^{j}(S_{n-1})
+[ajn(Sn1;θ)nSn1n+ajn(Sn1+1;θ)Sn1+1n](1ψn1(Sn1)).\displaystyle\quad+\Big{[}a_{j}^{n}(S_{n-1};\theta)\frac{n-S_{n-1}}{n}+a_{j}^{n}(S_{n-1}+1;\theta)\frac{S_{n-1}+1}{n}\Big{]}(1-\psi_{n-1}(S_{n-1})). (23)

But, by the supposition,

ajn(Sn1;θ)nSn1n+ajn(Sn1+1;θ)Sn1+1n\displaystyle a_{j}^{n}(S_{n-1};\theta)\frac{n-S_{n-1}}{n}+a_{j}^{n}(S_{n-1}+1;\theta)\frac{S_{n-1}+1}{n}
=Pθ(Ajn|X1,,Xn1,Xn=0)gθn(Sn1)nSn1n\displaystyle=P_{\theta}(A_{j}^{n}|X_{1},\dots,X_{n-1},X_{n}=0)g_{\theta}^{n}(S_{n-1})\frac{n-S_{n-1}}{n}
+Pθ(Ajn|X1,,Xn1,Xn=1)gθn(Sn1+1)Sn1+1n\displaystyle\quad+P_{\theta}(A_{j}^{n}|X_{1},\dots,X_{n-1},X_{n}=1)g_{\theta}^{n}(S_{n-1}+1)\frac{S_{n-1}+1}{n}
=(Pθ(Ajn|X1,,Xn1,Xn=0)(1θ)\displaystyle=(P_{\theta}(A_{j}^{n}|X_{1},\dots,X_{n-1},X_{n}=0)(1-\theta)
+Pθ(Ajn|X1,,Xn1,Xn=1)θ)gθn1(Sn1)\displaystyle\quad+P_{\theta}(A_{j}^{n}|X_{1},\dots,X_{n-1},X_{n}=1)\theta)g_{\theta}^{n-1}(S_{n-1})
=Pθ(Ajn|X1,,Xn1)gθn1(Sn1)\displaystyle=P_{\theta}({A_{j}^{n}}|X_{1},\dots,X_{n-1})g_{\theta}^{n-1}(S_{n-1})

Therefore, (3.2) equals

(ψn1ϕn1j+Pθ(Ajn|X1,,Xn1)(1ψn1))gθn1(Sn1)\displaystyle\left(\psi_{n-1}\phi_{n-1}^{j}+P_{\theta}({A_{j}^{n}}|X_{1},\dots,X_{n-1})(1-\psi_{n-1})\right)g_{\theta}^{n-1}(S_{n-1})
=Pθ(Ajn1|X1,,Xn1)gθn1(Sn1).\displaystyle=P_{\theta}(A_{j}^{n-1}|X_{1},\dots,X_{n-1})g_{\theta}^{n-1}(S_{n-1}).

Now that (22) is proved, we apply it for n=1n=1 and have

aj1(1;θ)=Pθ(Aj1|X1=1)θandaj1(0;θ)=Pθ(Aj1|X1=0)(1θ),a_{j}^{1}(1;\theta)=P_{\theta}(A_{j}^{1}|X_{1}=1)\theta\quad\text{and}\quad a_{j}^{1}(0;\theta)=P_{\theta}(A_{j}^{1}|X_{1}=0)(1-\theta),

thus,

aj1(0;θ)+aj1(1;θ)=Pθ(Aj1|X1=1)θ+Pθ(Aj1|X1=0)(1θ)=Pθ(Aj1)=aj0(θ).a_{j}^{1}(0;\theta)+a_{j}^{1}(1;\theta)=P_{\theta}(A_{j}^{1}|X_{1}=1)\theta+P_{\theta}(A_{j}^{1}|X_{1}=0)(1-\theta)=P_{\theta}(A_{j}^{1})=a_{j}^{0}(\theta).

\Box

In an analogous way, characteristics of sample number can be treated.

Proposition 3.3.

For any stopping rule ψ\psi define for any m1m\geq 1

bmm(s;θ)=gθm(s)(1ψm(s)),s=0,1,,m,b_{m}^{m}(s;\theta)=g_{\theta}^{m}(s)(1-\psi_{m}(s)),\;s=0,1,\dots,m, (24)

and, recursively over n=m1,m2,,1n=m-1,m-2,\dots,1,

bnm(s;θ)=(bn+1m(s;θ)n+1sn+1+bn+1m(s+1;θ)s+1n+1)(1ψn(s)),b_{n}^{m}(s;\theta)=\left(b_{n+1}^{m}(s;\theta)\frac{n+1-s}{n+1}+b_{n+1}^{m}(s+1;\theta)\frac{s+1}{n+1}\right)(1-\psi_{n}(s)), (25)

s=0,1,,ns=0,1,\dots,n. Then Pθ(τψ>m)=b1m(0;θ)+b1m(1;θ)P_{\theta}(\tau_{\psi}>m)=b_{1}^{m}(0;\theta)+b_{1}^{m}(1;\theta).

Proof. Let us denote Bnm=Bnm(ψ)B_{n}^{m}=B_{n}^{m}(\psi), n=1,2,,mn=1,2,\dots,m, the event meaning that the test following the stopping rule ψ\psi does not stop at any step between nn and mm, inclusively.

Let us first prove, by induction over n=m,m1,,1n=m,m-1,\dots,1, that

bnm(Sn;θ)=Pθ(Bnm|X1,,Xn)gθn(Sn)b_{n}^{m}(S_{n};\theta)=P_{\theta}(B_{n}^{m}|X_{1},\dots,X_{n})g_{\theta}^{n}(S_{n}) (26)

For n=mn=m, (26) follows from (24). Let us suppose now that (26) holds for some nmn\leq m. Then

bn1m(Sn1;θ)=(bnm(Sn1;θ)nSn1n+bnm(Sn1+1;θ)Sn1+1n)(1ψn1)\displaystyle b_{n-1}^{m}(S_{n-1};\theta)=\left(b_{n}^{m}(S_{n-1};\theta)\frac{n-S_{n-1}}{n}+b_{n}^{m}(S_{n-1}+1;\theta)\frac{S_{n-1}+1}{n}\right)(1-\psi_{n-1})
=[Pθ(Bnm|X1,,Xn1,Xn=0)gθn(Sn1)nSn1n\displaystyle\quad=\Big{[}P_{\theta}(B_{n}^{m}|X_{1},\dots,X_{n-1},X_{n}=0)g_{\theta}^{n}(S_{n-1})\frac{n-S_{n-1}}{n}
+Pθ(Bnm|X1,,Xn1,Xn=1)gθn(Sn1+1)Sn1+1n](1ψn1)\displaystyle\quad\quad+\;P_{\theta}(B_{n}^{m}|X_{1},\dots,X_{n-1},X_{n}=1)g_{\theta}^{n}(S_{n-1}+1)\frac{S_{n-1}+1}{n}\Big{]}(1-\psi_{n-1})
=[Pθ(Bnm|X1,,Xn1,Xn=0)(1θ)\displaystyle\quad=\Big{[}P_{\theta}(B_{n}^{m}|X_{1},\dots,X_{n-1},X_{n}=0)(1-\theta)
+Pθ(Bnm|X1,,Xn1,Xn=1)θ](1ψn1)gθn1(Sn1)\displaystyle\quad\quad+\;P_{\theta}(B_{n}^{m}|X_{1},\dots,X_{n-1},X_{n}=1)\theta\Big{]}(1-\psi_{n-1})g_{\theta}^{n-1}(S_{n-1})
=Pθ(Bnm(1ψn1)|X1,,Xn1)gθn1(Sn1)\displaystyle\quad=P_{\theta}(B_{n}^{m}(1-\psi_{n-1})|X_{1},\dots,X_{n-1})g_{\theta}^{n-1}(S_{n-1})
=Pθ(Bn1m|X1,,Xn1)gθn1(Sn1)\displaystyle\quad=P_{\theta}(B_{n-1}^{m}|X_{1},\dots,X_{n-1})g_{\theta}^{n-1}(S_{n-1})

Now that (26) is proved, we apply it for n=1n=1 and have

b1m(1;θ)=Pθ{B1m|X1=1}θandb1m(0;θ)=Pθ{B1m|X1=0}(1θ),b_{1}^{m}(1;\theta)=P_{\theta}\{B_{1}^{m}|X_{1}=1\}\theta\quad\text{and}\quad b_{1}^{m}(0;\theta)=P_{\theta}\{B_{1}^{m}|X_{1}=0\}(1-\theta),

thus,

b1m(0;θ)+b1m(1;θ)=Pθ(B1m|X1=1)θ+Pθ(B1m|X1=0)(1θ)=Pθ(B1m)=Pθ(τψ>m).b_{1}^{m}(0;\theta)+b_{1}^{m}(1;\theta)=P_{\theta}(B_{1}^{m}|X_{1}=1)\theta+P_{\theta}(B_{1}^{m}|X_{1}=0)(1-\theta)=P_{\theta}(B_{1}^{m})=P_{\theta}(\tau_{\psi}>m).

\Box

It follows from Proposition 3.3 that if ψ𝒮N\psi\in\mathcal{S}^{N}, then

Eθτψ=m=1NPθ(τψm)=1+m=1N1(b1m(0;θ)+b1m(1;θ)).E_{\theta}\tau_{\psi}=\sum_{m=1}^{N}P_{\theta}(\tau_{\psi}\geq m)=1+\sum_{m=1}^{N-1}(b_{1}^{m}(0;\theta)+b_{1}^{m}(1;\theta)). (27)

If a stopping rule ψ\psi is not truncated, we can use (27) to approximate EθτψE_{\theta}\tau_{\psi}, noting that Eθmin{τψ,N}EθτψE_{\theta}\min\{\tau_{\psi},N\}\to E_{\theta}\tau_{\psi}, as NN\to\infty, by the theorem of monotone convergence, and min{τψ,N}\min\{\tau_{\psi},N\} corresponds to the truncated rule ψN=(ψ1,,ψN1,1,)𝒮N\psi^{N}=(\psi_{1},\dots,\psi_{N-1},1,\dots)\in\mathcal{S}^{N}. Applying (27) to ψN\psi^{N} we see that Eθmin{τψ,N}=1+m=1N1(b1m(0;θ)+b1m(1;θ))E_{\theta}\min\{\tau_{\psi},N\}=1+\sum_{m=1}^{N-1}(b_{1}^{m}(0;\theta)+b_{1}^{m}(1;\theta)), thus

Eθτψ=1+m=1(b1m(0;θ)+b1m(1;θ)).E_{\theta}\tau_{\psi}=1+\sum_{m=1}^{\infty}(b_{1}^{m}(0;\theta)+b_{1}^{m}(1;\theta)).

Dealing with expectations, a more direct way to evaluate (27) is incorporating the summation in (27) into the inductive evaluations in (25). This is done in the following

Proposition 3.4.

For a stopping rule ψ\psi, define

cNN(s;θ)=gθN(s)(1ψN(s)),s=0,1,,N,c_{N}^{N}(s;\theta)=g_{\theta}^{N}(s)(1-\psi_{N}(s)),\;s=0,1,\dots,N,

and, recursively over n=N1,N2,,1n=N-1,N-2,\dots,1,

cnN(s;θ)=(gθn(s)+cn+1N(s;θ)n+1sn+1+cn+1N(s+1;θ)s+1n+1)(1ψn(s)),c_{n}^{N}(s;\theta)=\left(g_{\theta}^{n}(s)+c_{n+1}^{N}(s;\theta)\frac{n+1-s}{n+1}+c_{n+1}^{N}(s+1;\theta)\frac{s+1}{n+1}\right)(1-\psi_{n}(s)),

s=0,1,,ns=0,1,\dots,n. Then

Eθmin{τψ,N+1}=1+c1N(0;θ)+c1N(1;θ)E_{\theta}\min\{\tau_{\psi},N+1\}=1+c_{1}^{N}(0;\theta)+c_{1}^{N}(1;\theta) (28)

Again, passing to the limit in (28), as NN\to\infty, we obtain

Eθτψ=1+limN(c1N(0;θ)+c1N(1;θ))E_{\theta}\tau_{\psi}=1+\lim_{N\to\infty}(c_{1}^{N}(0;\theta)+c_{1}^{N}(1;\theta))

We implemented the algorithms presented in this subsection in the R programming language; the source code is available in Novikov (2023).

It should be noted that the algorithms for performance evaluations in this subsection are applicable to any truncated test based on sufficient statistics, and not only to the optimal test of Subsection 3.1. In particular, we included in the program implementation a function producing the structure of the (truncated version of) the matrix sequential probability ratio test (MSPRT), enabling in this way all the performance evaluations of this subsection for the truncated MSPRT as well. Because an MSPRT for two hypotheses is an SPRT, this also covers the performance evaluation of truncated SPRTs. Also, an implementation of the Monte Carlo simulation for the performance evaluation is provided as a part of the program code.

4 Applications. Numerical results

In this section we apply the theoretical results of the preceding sections to construction and performance evaluation of sequential tests in the Bernoulli model.

4.1 Efficiency of the MSPRT

In this subsection, we evaluate the performance of the widely-used matrix probability ratio test (MSPRT) for multiple hypotheses and numerically compare its expected sample size characteristics with asymptotic bounds for these, in a particular case of testing of three hypotheses about the parameter of the Bernoulli distribution.

The idea of the MSPRT is to simultaneously run k(k1)/2k(k-1)/2 SPRTs for each pair of the hypothesized parameter values, stopping only when all the SPRTs decide in favour of a certain hypothesis. Let Aij>1A_{ij}>1 be some constants, 1ijk1\leq i\not=j\leq k. Then the stopping time of the MSPRT (let us denote it τ\tau^{*}) is defined as

min{n1:there isisuch thatfθin(x1,,xn)Aijfθjn(x1,,xn)for all ji}\min\{n\geq 1:\text{there is}\;i\;\text{such that}\;f_{\theta_{i}}^{n}(x_{1},\dots,x_{n})\geq A_{ij}f_{\theta_{j}}^{n}(x_{1},\dots,x_{n})\;\text{for all }\;j\not=i\} (29)

in which case hypothesis HiH_{i} is accepted. Armitage (1950) showed that the MSPRT stops with probability one under each HiH_{i}, and that

αij1/Aji,1ijk\alpha_{ij}^{*}\leq 1/A_{ji},\quad 1\leq i\not=j\leq k (30)

where αij\alpha_{ij}^{*} is the error probability of MSPRT (29).

For k=2k=2 the MSPRT is an ordinary SPRT and (30) are the very well known Wald’s inequalities for its error probabilities.

To get numerical results we consider a particular case of k=3k=3 hypotheses for the parameter of success θ\theta of the Bernoulli distribution, with θ1=0.3\theta_{1}=0.3, θ2=0.4\theta_{2}=0.4 and θ3=0.5\theta_{3}=0.5.

First of all, we will be interested in calculating the performance characteristics of the MSPRT in this particular case. It is easy to see that the rules of the MSPRT are based, in the Bernoulli case, on the sufficient statistics SnS_{n}, n=1,2,n=1,2,\dots, so the formulas of Subsection 3.2 apply for the truncated version of the MSPRT. Strictly speaking, the terminal decision at the last step, when the MSPRT is truncated at time NN, is not defined. But we will calculate the exact probability that MSPRT does not come to a decision at any earlier stage, and make the probability of this so small (choosing NN large enough) that any concrete decision one can take in the last step will not affect the numerical values of the error probabilities, nor those of the ESS under any one of the hypotheses.

In Tartakovsky, Nikiforov, and Basseville (2015), asymptotic formulas are obtained for the ESS of the MSPRT, so we consider this example a good opportunity to juxtapose the really obtained and the asymptotic values of the corresponding numerical characteristics, calculated in various practical scenarios. We use the thresholds Aji=(k1)/αA_{ji}=(k-1)/\alpha which make the MSPRT in (29) asymptotically optimal, as maxi{αi}=α0\max_{i}\{\alpha_{i}\}=\alpha\to 0 (see Tartakovsky, Nikiforov, and Basseville 2015, Section 4.3.1).

The results of evaluations are presented in Table 1, where αi\alpha_{i}^{*}, EθiτE_{\theta_{i}}\tau^{*} are the evaluated characteristics of the MSPRT, and RiR_{i} the respective ratio between EθiτE_{\theta_{i}}\tau^{*} and the asymptotic expression for it (according to Tartakovsky, Nikiforov, and Basseville 2015, p. 196), i=1,2,3i=1,2,3.

α\alpha α1\alpha_{1}^{*} α2\alpha_{2}^{*} α3\alpha_{3}^{*} Eθ1τE_{\theta_{1}}\tau^{*} Eθ2τE_{\theta_{2}}\tau^{*} Eθ3τE_{\theta_{3}}\tau^{*} R1R_{1} R2R_{2} R3R_{3}
0.1 0.026091 0.089375 0.029442 134.5 211.8 142.5 1.26 1.85 1.26
0.05 0.013039 0.045384 0.014829 169.4 264.9 180 1.22 1.78 1.23
0.025 0.006498 0.022826 0.007467 203.5 313.2 216.2 1.19 1.71 1.2
0.01 0.002575 0.009172 0.002981 247.4 372.4 262.7 1.16 1.63 1.16
0.005 0.001291 0.004596 0.001504 280 414.1 297.4 1.14 1.57 1.15
0.002 0.0005 0.00184 0.000594 322.8 468.9 342.8 1.12 1.52 1.13
0.001 0.000248 0.00092 0.000296 355.1 508.8 376.9 1.11 1.48 1.11
0.0005 0.000123 0.00046 0.000147 387.2 548.5 411 1.1 1.45 1.1
5E-07 1.14E-07 4.6E-07 1.47E-07 707.1 928.5 749.5 1.05 1.29 1.05
5E-09 1.1E-09 4.6E-09 1.46E-09 920.3 1175.5 975.2 1.04 1.24 1.04
Table 1: ESS: MSPRT vs. asymptotic

4.2 Bayes vs. MSPRT

Now, let us numerically compare the optimal multi-hypothesis test with the MSPRT, provided both have the same levels of error probabilities αi=α\alpha_{i}=\alpha, i=1,2,3i=1,2,3. To this end, we numerically find the Lagrange multipliers λi\lambda_{i} providing the best approximation of the error probabilities of the test (7)-(10) to α\alpha, with respect to the distance

maxi{|αi(ψ,ϕ)α|/α}.\max_{i}\{|\alpha_{i}(\psi,\phi)-\alpha|/\alpha\}.

The gradient-free optimization method of Nelder and Mead (1965) works well for this fitting. We use ϑi=θi\vartheta_{i}=\theta_{i} and γi=1/3\gamma_{i}=1/3, for i=1,2,3i=1,2,3 as a criterion of minimization in (1), i.e. we evaluate the Bayesian tests with the “least informative” prior distribution. The results of fitting are presented in Table 2 (upper block).

As a competing MSPRT we take the test (29), with AijA_{ij} defined as Aij=AjA_{ij}=A_{j} for all 1ji31\leq j\not=i\leq 3, and carry out the same fitting procedure as above, with respect to A1,A2,A3A_{1},A_{2},A_{3}. The results are presented in the middle block of Table 2.

In the lower block of Table 2 we placed the ratios RiR_{i} between the ESS of the MSPRT (EθiτE_{\theta_{i}}\tau^{*}) and that of the respective Bayesian test (EθiτE_{\theta_{i}}\tau), under each one of the hypotheses.

The results show an astonishingly high efficiency of the MSPRT, especially for small α\alpha. This would not be so surprising for two hypotheses, because in this case any MSPRT is in fact an SPRT, and any Bayesian test is an SPRT, too (see Wald and Wolfowitz 1948), so fitting numerically both tests to given error probabilities should give a relative efficiency of about 100%. But we see that largely the same happens for three hypotheses, at least in the case of equal error probabilities we are examining.

The question arises whether there exist Bayesian tests “essentially” outperforming MSPRTs, in the case of three hypotheses. The answer is “yes”, as the following numerical example suggests.

In a rather straightforward way, we found a Bayesian test, corresponding to very “unbalanced” weights γ=(0.01,0.01,0.98)\gamma=(0.01,0.01,0.98), and an MSPRT having the same error probabilities: α1=0.0051\alpha_{1}=0.0051, α2=0.089\alpha_{2}=0.089,α3=0.068\alpha_{3}=0.068. These correspond to Lagrangian multipliers of λ1=200\lambda_{1}=200 λ2=500\lambda_{2}=500 λ3=200\lambda_{3}=200 for the Bayesian test and the thresholds log(A1)=4.90\log(A_{1})=4.90, log(A2)=3.00\log(A_{2})=3.00, log(A3)=1.69\log(A_{3})=1.69 for the MSPRT, respectively. Accordingly, we obtained Eθ1τ=320.1E_{\theta_{1}}\tau=320.1, Eθ2τ=258.5E_{\theta_{2}}\tau=258.5, Eθ3τ=101.3E_{\theta_{3}}\tau=101.3 for the Bayesian test, and Eθ1τ=139.7E_{\theta_{1}}\tau^{*}=139.7, Eθ2τ=239.0E_{\theta_{2}}\tau^{*}=239.0, Eθ3τ=134.3E_{\theta_{3}}\tau^{*}=134.3 for the MSPRT. Respectively, the weighted ESS evaluated to Cγ,θ(τ)=105.07C_{\gamma,\theta}(\tau)=105.07 and Cγ,θ(τ)=135.32C_{\gamma,\theta}(\tau^{*})=135.32, that is, nearly 29% larger for the MSPRT in comparison with the Bayesian test.

The most desirable property an optimal test should have is that it minimizes the ESS under each one of the hypotheses, in the class of tests subject to restrictions on the error probabilities. Nevertheless, we think this property is too strong to be fulfilled by any sequential test, when there are three (or more) hypotheses. We base this opinion on the following simple observation. Suppose there is a “uniformly optimal” test ϕ,ψ\langle\phi^{*},\psi^{*}\rangle in the sense that αi(ψ,ϕ)=αi\alpha_{i}(\psi^{*},\phi^{*})=\alpha_{i} i=1,,ki=1,\dots,k, and for any test ϕ,ψ\langle\phi,\psi\rangle such that αi(ψ,ϕ)αi\alpha_{i}(\psi,\phi)\leq\alpha_{i} for i=1,,ki=1,\dots,k, it holds EθiτψEθiτψE_{\theta_{i}}\tau_{\psi}\geq E_{\theta_{i}}\tau_{\psi^{*}} for all i=1,,ki=1,\dots,k. Then it is obvious that, whatever be the weights γi0\gamma_{i}\geq 0, i=1,,ki=1,\dots,k, it holds that Cγ,θ(ψ,ϕ)=i=1kγiEθiτψCγ,θ(ψ,ϕ)C_{\gamma,\theta}(\psi,\phi)=\sum_{i=1}^{k}\gamma_{i}E_{\theta_{i}}\tau_{\psi}\geq C_{\gamma,\theta}(\psi^{*},\phi^{*}). Thus, for any set of weights γi\gamma_{i}, i=1,,ki=1,\dots,k we have a test minimizing the weighted ESS under the restrictions on the error probabilities, i.e. one test ϕ,ψ\langle\phi^{*},\psi^{*}\rangle solves all the problems of minimization of weighted ESS we formulated in Section 2 (all those with ϑ=θ\vartheta=\theta but arbitrary γ\gamma). It seems that this is “too much” for one test when there are more than two hypotheses (it is fine for two hypotheses because it is well known that any Bayesian test is an SPRT). Unfortunately, the discrete nature of error probabilities in the Bernoulli model seems to be a serious obstacle for constructing a formal counterexample in this case. We hope to be able to provide one in our future publications concerning continuous distribution families.

α\alpha 0.1 0.05 0.025 0.01 0.005 0.002 0.001 0.0005
log(λ1)\log(\lambda_{1}) 5.09 5.61 6.15 6.91 7.52 8.36 9.04 9.71
log(λ2)\log(\lambda_{2}) 5.88 6.55 7.21 8.10 8.78 9.68 10.37 11.06
log(λ3)\log(\lambda_{3}) 5.23 5.77 6.34 7.13 7.76 8.63 9.31 9.99
Eθ1τE_{\theta_{1}}\tau 113.4 160.7 194.4 242.0 276.1 320.0 352.6 385.0
Eθ2τE_{\theta_{2}}\tau 136.0 189.4 238.4 298.3 340.9 395.5 435.7 475.3
Eθ3τE_{\theta_{3}}\tau 115.9 156.6 202.4 253.1 289.2 335.8 370.4 404.7
log(A1)\log(A_{1}) 1.67 2.37 3.07 3.96 4.63 5.52 6.20 6.88
log(A2)\log(A_{2}) 2.81 3.56 4.27 5.21 5.90 6.81 7.52 8.21
log(A3)\log(A_{3}) 1.81 2.50 3.21 4.12 4.81 5.72 6.41 7.09
Eθ1τE_{\theta_{1}}\tau^{*} 110.0 153.4 192.3 240.1 273.9 317.5 350.7 383.1
Eθ2τE_{\theta_{2}}\tau^{*} 136.0 189.4 238.4 298.3 341.0 395.0 435.7 475.3
Eθ3τE_{\theta_{3}}\tau^{*} 118.3 163.4 204.7 255.1 291.2 337.2 372.3 406.7
R1R_{1} 0.970 0.955 0.989 0.992 0.993 0.992 0.995 0.995
R2R_{2} 1.000 1.000 1.000 1.000 1.000 0.999 1.000 1.000
R3R_{3} 1.010 1.043 1.011 1.007 1.008 1.004 1.005 1.005
Table 2: Relative efficiency of the MSPRT with respect to the Bayesian test

4.3 The Kiefer-Weiss problem for multi-hypothesis testing

In this subsection we propose a construction of a test which might be helpful for solution of the Kiefer-Weiss problem for multiple hypotheses and present a numerical example where the proposed test provides an approximate solution to the Kiefer-Weiss problem in the case of three hypotheses about the parameter of the Bernoulli model.

Let θ1<θ2<<θK\theta_{1}<\theta_{2}<\dots<\theta_{K} be the hypothezised parameter values, K2K\geq 2. Generalizing the Kiefer-Weiss problem from the case of K=2K=2 hypotheses (see Kiefer and Weiss 1957) let us say that the Kiefer-Weiss problem for K2K\geq 2 hypotheses is to find a sequential test ψ,ϕ\langle\psi,\phi\rangle which minimizes supθ(θ1,θ2)Eθτψ\sup_{\theta\in(\theta_{1},\theta_{2})}E_{\theta}\tau_{\psi} in the class of tests subject to restrictions on the error probabilities (2).

Kiefer and Weiss (1957) and Weiss (1962) noted that in some symmetrical cases the solution can be obtained as a solution to a much simpler problem (called modified Kiefer-Weiss problem nowadays). This latter problem is to find a test minimizing Eϑ1τψE_{\vartheta_{1}}\tau_{\psi} among the tests satisfying the restrictions on the error probabilities, where ϑ1\vartheta_{1} is some point in (θ1,θ2)(\theta_{1},\theta_{2}).

For the general multi-hypothesis case we propose the following generalization of this construction. Let ϑi(θi,θi+1)\vartheta_{i}\in(\theta_{i},\theta_{i+1}), for i=1,2,,k1i=1,2,\dots,k-1, be some parameter points. And let γi[0,1]\gamma_{i}\in[0,1], i=1,2,,k1i=1,2,\dots,k-1, be some weights (such that i=1k1γi=1\sum_{i=1}^{k-1}\gamma_{i}=1). Recall that

Cγ,ϑ(ψ)=i=1k1γiEϑiτψC_{\gamma,\vartheta}(\psi)=\sum_{i=1}^{k-1}\gamma_{i}E_{\vartheta_{i}}\tau_{\psi} (31)
Proposition 4.1.

Let us suppose there is a test ψ,ϕ\langle\psi^{*},\phi^{*}\rangle, with some ϑi(θi,θi+1)\vartheta_{i}\in(\theta_{i},\theta_{i+1}), and γi0\gamma_{i}\geq 0, i=1,2,,k1i=1,2,\dots,k-1, i=1k1γi=1\sum_{i=1}^{k-1}\gamma_{i}=1, such that

Cγ,ϑ(ψ)+ijλijαij(ψ,ϕ)Cγ,ϑ(ψ)+ijλijαij(ψ,ϕ)C_{\gamma,\vartheta}(\psi^{*})+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}(\psi^{*},\phi^{*})\leq C_{\gamma,\vartheta}(\psi)+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}(\psi,\phi) (32)

for all sequential tests ψ,ϕ\langle\psi,\phi\rangle, and that

αij(ψ,ϕ)=αij,for all 1ijk.\alpha_{ij}(\psi^{*},\phi^{*})=\alpha_{ij},\;\mbox{for all}\;1\leq i\not=j\leq k. (33)

Additionally, let us suppose that

Eϑiτψ=supθ(θ1,θk)Eθτψfor all 1ik1.E_{\vartheta_{i}}\tau_{\psi^{*}}=\sup_{\theta\in(\theta_{1},\theta_{k})}E_{\theta}\tau_{\psi^{*}}\;\mbox{for all}\;1\leq i\leq k-1. (34)

Then for any sequential test ψ,ϕ\langle\psi,\phi\rangle satisfying

αij(ψ,ϕ)αij,for all1ijk,\alpha_{ij}(\psi,\phi)\leq\alpha_{ij},\;\mbox{for all}\quad 1\leq i\not=j\leq k, (35)

it holds

supθ(θ1,θk)Eθτψsupθ(θ1,θk)Eθτψ,\sup_{\theta\in(\theta_{1},\theta_{k})}E_{\theta}\tau_{\psi^{*}}\leq\sup_{\theta\in(\theta_{1},\theta_{k})}E_{\theta}\tau_{\psi}, (36)

i.e. ψ,ϕ\langle\psi^{*},\phi^{*}\rangle solves the Kiefer-Weiss problem.

Proof. It follows from (32), (33) and (35) that

Cγ,ϑ(ψ)+ijλijαij=Cγ,ϑ(ψ)+ijλijαij(ψ,ϕ)\displaystyle C_{\gamma,\vartheta}(\psi^{*})+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}=C_{\gamma,\vartheta}(\psi^{*})+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}(\psi^{*},\phi^{*})
Cγ,ϑ(ψ)+ijλijαij(ψ,ϕ)Cγ,ϑ(ψ)+ijλijαij\displaystyle\leq C_{\gamma,\vartheta}(\psi)+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}(\psi,\phi)\leq C_{\gamma,\vartheta}(\psi)+\sum_{i\not=j}\lambda_{ij}\alpha_{ij}

for any test ψ,ϕ\langle\psi,\phi\rangle satisfying (35), so

Cγ,ϑ(ψ)=i=1k1γiEϑiτψCγ,ϑ(ψ)=i=1k1γiEϑiτψsupθ(θ1,θk)Eθτψ.C_{\gamma,\vartheta}(\psi^{*})=\sum_{i=1}^{k-1}\gamma_{i}E_{\vartheta_{i}}\tau_{\psi^{*}}\leq C_{\gamma,\vartheta}(\psi)=\sum_{i=1}^{k-1}\gamma_{i}E_{\vartheta_{i}}\tau_{\psi}\leq\sup_{\theta\in(\theta_{1},\theta_{k})}E_{\theta}\tau_{\psi}.

But, due to (34),

i=1k1γiEϑiτψ=supθ(θ1,θk)Eθτψ,\sum_{i=1}^{k-1}\gamma_{i}E_{\vartheta_{i}}\tau_{\psi^{*}=}\sup_{\theta\in(\theta_{1},\theta_{k})}E_{\theta}\tau_{\psi^{*}},

thus (36) follows. \Box

Remark 1.

The modification of Proposition 4.1 to be used with restrictions on αi\alpha_{i} rather than on αij\alpha_{ij} is straightforward: just using λi,αi\lambda_{i},\alpha_{i} instead of λij\lambda_{ij} and αij\alpha_{ij}, respectively.

Remark 2.

We conjecture that, when sampling from exponential families of distributions, the tests constructed in Proposition 4.1 for multiple hypotheses (even without condition (34)), are always truncated, just like those in the modified Kiefer-Weiss problem for two hypotheses are, when ϑ1(θ1,θ2)\vartheta_{1}\in(\theta_{1},\theta_{2}). Using our program in Novikov (2023) it is easy to see this for any number of hypotheses in the Bernoulli case.

Remark 3.

Proposition 4.1 is valid for any number of hypotheses for any parametric family of distributions.

Let us consider now an example of of a numerical solution to the Kiefer-Weiss problem for Bernoulli model, in the case of three hypotheses.

Let θ1=0.3\theta_{1}=0.3, θ2=0.5\theta_{2}=0.5 and θ3=0.7\theta_{3}=0.7. We took N=1200N=1200, γ1=γ2=0.5\gamma_{1}=\gamma_{2}=0.5 and λ1=λ2=λ3=200\lambda_{1}=\lambda_{2}=\lambda_{3}=200 and used the function OptTest from the program code in Novikov (2023) to produce tests satisfying condition (32) (minimizing the Lagrangian function). To comply with (34), after a simple numerical optimization over ϑ1=1ϑ2\vartheta_{1}=1-\vartheta_{2} we found that for ϑ1=0.4026\vartheta_{1}=0.4026, ϑ2=0.5974\vartheta_{2}=0.5974 it holds

maxθ[0.3,0.7]Eθτψ=56.2=Eϑ1τψ=Eϑ2τψ\max_{\theta\in[0.3,0.7]}E_{\theta}\tau_{\psi^{*}}=56.2=E_{\vartheta_{1}}\tau_{\psi^{*}}=E_{\vartheta_{2}}\tau_{\psi^{*}}

To calculate the error probabilities we used the function PAccept in Novikov (2023), and obtained α1(ψ,ϕ)=α3(ψ,ϕ)=0.037\alpha_{1}(\psi^{*},\phi^{*})=\alpha_{3}(\psi^{*},\phi^{*})=0.037 and α2(ψ,ϕ)=0.07\alpha_{2}(\psi^{*},\phi^{*})=0.07. Thus, we have a numerical solution of the Kiefer-Weiss problem under restrictions α1=α3=0.037\alpha_{1}=\alpha_{3}=0.037 and α2=0.07\alpha_{2}=0.07. The optimal test is truncated at N=160N=160. The function maxNumber can be used to see the maximum number of steps a test requires.

To compare the Kiefer-Weiss solution with a Bayesian test we used the same function OptTest, now with θi=ϑi,i=1,2,3\theta_{i}=\vartheta_{i},i=1,2,3 and γi=1/3\gamma_{i}=1/3, i=1,2,3i=1,2,3 at the truncation level N=1200N=1200 using the Nelder-Mead optimization to get (as close as possible to α1=α3=0.037\alpha_{1}=\alpha_{3}=0.037 and α2=0.07\alpha_{2}=0.07). The fitted values are α1=α3=0.0370\alpha_{1}=\alpha_{3}=0.0370 and α2=0.0704\alpha_{2}=0.0704 and the maximum ESS of 60.2. Thus, the Kiefer-Weiss solution saves about 10% of observations, on the average, in comparison with the optimal Bayesian test.

5 Conclusions and further work

In this paper, we proposed a computer-oriented method of construction of sequential multi-hypothesis tests, minimizing a weighted expected sample number (ESS).

For the particular case of sampling from a Bernoulli population, we developed a computational scheme for evaluating the optimal tests and calculating the numerical characteristics of sequential tests based on sufficient statistics. An implementation of the algorithms in the R programming language has been published in a GitHub repository Novikov (2023).

A numerical evaluation of the widely-used multi-hypothesis sequential probability ratio test is carried out for the case of three simple hypotheses about the parameter of the Bernoulli distribution, and a numerical comparison is made with the asymptotic expressions for the ESS of the asymptotically optimal MSPRT.

For a series of error probabilities we evaluated the ESS of the Bayesian test and compared it with that of the MSPRT having the same error probabilities, in which case the MSPRT exhibited a very high efficiency. On the other hand, we found a numerical example where the MSPRT is substantially less efficient than the optimal Bayesian test.

We proposed a method of numerical solution of the multi-hypothesis Kiefer-Weiss problem. The proposed method is applied to three-hypothesis Kiefer-Weiss problem for the Bernoulli. Numerial results are given.

A very immediate extension of this work could be developing computational algorithms for construction and performance evaluation of optimal sequential multi-hypothesis tests for other parametric families, first of all for one-parameter exponential families (cf. Novikov and Farkhshatov 2022).

The method we applied in this paper for i.i.d. observations can in fact be used for much more general models. For example, it can be applied to the models considered in Liu, Gao, and Li (2016), where numerical methods of performance evaluation of the MSPRT for non-i.i.d. observations are developed. It would be interesting to carry out a comparison study between the MSPRT and our optimal tests. Extensions to models with dependent observations are also possible.

The proposed method for solution of the Kiefer-Weiss problem can be extended to other parametric families.

Another expected application is an extension of sequentially planned tests for two hypotheses (Novikov 2022) to the case of multiple hypotheses (Novikov 2009a).

Acknowledgements

The author gratefully acknowledges a partial support of the National Researchers System (SNI) by CONACyT, Mexico, for this work.

The author thanks the anonymous Reviewers and the Associate Editor for very substantial comments and suggestions on improving earlier versions of this work.

Appendix. Proof of (11)

Let us define αij(n,ϕ)\alpha_{ij}(n,\phi) as the error probability of the fixed-sample-size test based on nn observations and using the decision rule from (7). It follows from Theorem 3 in Novikov (2009b) that

vn𝑑μn=ijλijαij(n,ϕ),\int v_{n}d\mu^{n}=\sum_{i\not=j}\lambda_{ij}\alpha_{ij}(n,\phi),

Let us prove that for any iji\not=j such that λij>0\lambda_{ij}>0 αij(n,ϕ)0\alpha_{ij}(n,\phi)\to 0, as nn\to\infty.

We have

αij(n,ϕ)=Pθi(l:ljλljfθln=vn)Pθi(l:ljλljfθlnl:liλlifθln)\alpha_{ij}(n,\phi)=P_{\theta_{i}}(\sum_{l:l\not=j}\lambda_{lj}f_{\theta_{l}}^{n}=v_{n})\leq P_{\theta_{i}}(\sum_{l:l\not=j}\lambda_{lj}f_{\theta_{l}}^{n}\leq\sum_{l:l\not=i}\lambda_{li}f_{\theta_{l}}^{n})
Pθi(l:ljλljfθlnfθinl:liλlifθlnfθin)Pθi(l:ljλljfθlnfθinl:liλlifθlnfθin)\leq P_{\theta_{i}}(\sum_{l:l\not=j}\lambda_{lj}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}\leq\sum_{l:l\not=i}\lambda_{li}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}})\leq P_{\theta_{i}}(\sum_{l:l\not=j}\lambda_{lj}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}\leq\sum_{l:l\not=i}\lambda_{li}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}})
Pθi(λij+l:li,jλljfθlnfθinl:liλlifθlnfθin)0,asn.\leq P_{\theta_{i}}(\lambda_{ij}+\sum_{l:l\not=i,j}\lambda_{lj}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}\leq\sum_{l:l\not=i}\lambda_{li}\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}})\to 0,\quad\mbox{as}\quad n\to\infty.

This latter holds because

fθlnfθin0,asn,\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}\to 0,\;\mbox{as}\;n\to\infty,

in PθiP_{\theta_{i}}-probability for any lil\not=i. Indeed, by the Markov inequality

Pθi(fθlnfθin>ϵ)=Pθi(fθlnfθin>ϵ)Eθifθlnfθin/ϵP_{\theta_{i}}({\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}}>\epsilon)=P_{\theta_{i}}(\sqrt{\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}}>\sqrt{\epsilon})\leq E_{\theta_{i}}\sqrt{\frac{f_{\theta_{l}}^{n}}{f_{\theta_{i}}^{n}}}/\sqrt{\epsilon}
=((fθifθl)1/2𝑑μ)n/ϵ0,asn,=\left(\int\left(f_{\theta_{i}}f_{\theta_{l}}\right)^{1/2}d\mu\right)^{n}/\sqrt{\epsilon}\to 0,\quad\mbox{as}\quad n\to\infty,

because (fθifθl)1/2𝑑μ<1\int\left(f_{\theta_{i}}f_{\theta_{l}}\right)^{1/2}d\mu<1 for lil\not=i, due to the Cauchy-Schwarz inequality.

References

  • Armitage (1950) Armitage, P. 1950. “Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis.” Journal of the Royal Statistical Society B 12: 137–144.
  • Baum and Veeravalli (1994) Baum, C. W., and V. V. Veeravalli. 1994. “A Sequential Procedure for Multihypothesis Testing.” IEEE Transactions on Information Theory 40 (6): 1994–2007.
  • Blackwell and Girshick (1954) Blackwell, D., and M. A. Girshick. 1954. Theory of games and statistical decisions. John Wiley and Sons, Inc.
  • Chow, Robbins, and Siegmund (1971) Chow, Y.S, H. Robbins, and S. Siegmund. 1971. Great Expectations: The Theory of Optimal Stopping. Houghton Mifflin.
  • Eales and Jennison (1992) Eales, J. D., and C. Jennison. 1992. “An improved method for deriving optimal one-sided group sequential tests.” Biometrika 79 (1): 13–24. https://doi.org/10.1093/biomet/79.1.13.
  • Kiefer and Weiss (1957) Kiefer, J., and L. Weiss. 1957. “Some properties of generalized sequential probability ratio tests.” Annals of Mathematical Statistics 28: 57–75.
  • Liu, Gao, and Li (2016) Liu, Y., Y. Gao, and X. Rong Li. 2016. “Operating Characteristic and Average Sample Number of Binary and Multi-Hypothesis Sequential Probability Ratio Test.” IEEE Transactions on Signal Processing 64 (12): 3167–3179.
  • Lorden (1980) Lorden, G. 1980. “Structure of sequential tests minimizing an expected sample size.” Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 51 (3): 291–302.
  • Nelder and Mead (1965) Nelder, J. A., and T. Mead. 1965. “A simplex method for function minimization.” Computer Journal 7 (4): 308–313.
  • Novikov (2009a) Novikov, A. 2009a. “Optimal Sequential Multiple Hypothesis Testing in Presence of Control Variables.” Kybernetika 45 (3): 507–528.
  • Novikov (2009b) Novikov, A. 2009b. “Optimal Sequential Multiple Hypothesis Tests.” Kybernetika 45 (2): 309–330.
  • Novikov (2022) Novikov, A. 2022. “Optimal design and performance evaluation of sequentially planned hypothesis tests.” https://arxiv.org/abs/2210.07203.
  • Novikov (2023) Novikov, A. 2023. “ An R Project for Construction and Performance Evaluation of Sequential Multi-Hypothesis Tests.” https://github.com/HOBuKOB-MEX/multihypothesis.
  • Novikov and Farkhshatov (2022) Novikov, A., and F. Farkhshatov. 2022. “Design and performance evaluation in Kiefer-Weiss problems when sampling from discrete exponential families.” Sequential Analysis 41 (04): 417 – 434.
  • R Core Team (2013) R Core Team. 2013. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.
  • Shiryaev (1978) Shiryaev, A. N. 1978. Optimal stopping rules. Berlin: Springer.
  • Tartakovsky, Nikiforov, and Basseville (2015) Tartakovsky, A. G., I. V. Nikiforov, and M. Basseville. 2015. Sequential analysis: hypothesis testing and changepoint detection. Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Wald and Wolfowitz (1948) Wald, A., and J. Wolfowitz. 1948. “Optimum character of the sequential probability ratio test.” Annals of Mathematical Statistics 19 (3): 326–339.
  • Weiss (1962) Weiss, L. 1962. “On sequential tests which minimize the maximum expected sample size.” Journal of American Statistical Assocciation 57: 551–566.