This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Upper Confidence Bound Approach to Estimating the Maximum Mean

Kun Zhang111kunzhang@cityu.edu.hk Institute of Statistics and Big Data, Renmin University of China, Beijing, China 100872 Guangwu Liu222msgw.liu@cityu.edu.hk Department of Management Sciences, College of Business, City University of Hong Kong, Kowloon Tong, Hong Kong Wen Shi333shi3wen@163.com Business School, Central South University, Changsha, China 410083
Abstract

Estimating the maximum mean finds a variety of applications in practice. In this paper, we study estimation of the maximum mean using an upper confidence bound (UCB) approach where the sampling budget is adaptively allocated to one of the systems. We study in depth the existing grand average (GA) estimator, and propose a new largest-size average (LSA) estimator. Specifically, we establish statistical guarantees, including strong consistency, asymptotic mean squared errors, and central limit theorems (CLTs) for both estimators, which are new to the literature. We show that LSA is preferable over GA, as the bias of the former decays at a rate much faster than that of the latter when sample size increases. By using the CLTs, we further construct asymptotically valid confidence intervals for the maximum mean, and propose a single hypothesis test for a multiple comparison problem with application to clinical trials. Statistical efficiency of the resulting point and interval estimates and the proposed single hypothesis test is demonstrated via numerical examples.

1 Introduction

Estimating the maximum mean of a number of stochastic systems finds a variety of applications in broad areas, ranging from financial risk measurement and evaluation of clinical trials to Markov decision processes and reinforcement learning. In financial risk measurement, coherent risk measures such as expected shortfall play an important role in determining risk capital charge for financial institutions, which is of great practical importance to both risk managers and regulators. In general, a coherent risk measure can be written as the maximum of expected losses (or means) under different probability measures. In practice, its estimation in practice often involves intensive simulation, and developing efficient methods for estimating coherent risk measures has been a topic of interest in simulation community. To improve upon a two-stage procedure in Chen and Dudewicz (1976) for constructing confidence intervals (CIs) for the maximum mean, Lesnevski et al. (2007) proposed a procedure for constructing a two-sided, fixed-width CI for a coherent risk measure, by combining stage-wise sampling and screening based on ranking-and-selection techniques. Estimating the maximum mean also finds applications in evaluation of clinical trials. One of the goals of clinical trials is to estimate the best treatment effect among a number of treatments with unknown mean effects. Estimates of the best mean effect provide useful information into the level of effectiveness of the best treatment.

In both applications in financial risk measurement and clinical trials, the accuracy of the maximum mean estimates is of great concern. For instance, FDA (U.S. Food and Drug Administration) is concerned about the bias and reliability of the estimates of treatment effects when using adaptive designs in clinical trials (FDA 2019). To address such concerns, statistical inference about the maximum mean is highly desirable, providing not only point estimates, but also CIs that offer information about the precision of the estimates. To provide theoretical guarantees for the CIs, central limit theorems (CLTs) are typically needed, which is part of the agenda in our study. In this paper, we aim to develop efficient estimators for the maximum mean, and provide theoretical results on the properties of the estimators, such as their biases, variances, and CLTs.

CLTs for maximum mean estimators can also be applied to a fundamental hypothesis testing problem in clinical trials, i.e., comparing K1K-1 treatments (with unknown mean effects denoted by μ1,,μK1\mu_{1},\ldots,\mu_{K-1}) to a control treatment with known mean effect μ0\mu_{0}, aiming to determine whether any of the K1K-1 treatments outperforms the control; see Dmitrienko and Hsu (2006) and references therein. In the current practice, a commonly used approach is to simultaneously test K1K-1 null hypotheses: μkμ0\mu_{k}\leq\mu_{0}, k=1,,K1k=1,\ldots,K-1, where Bonferroni correction is often applied to control the type I error. However, it is well known that this approach may be conservative especially when KK is large, and it comes at the cost of increasing the probability of type II errors. With CLTs for estimators of the maximum mean μ=max(μ0,μ1,,μK1)\mu^{*}=\max(\mu_{0},\mu_{1},\ldots,\mu_{K-1}), we propose an alternative approach by testing only a single hypothesis: μ=μ0\mu^{*}=\mu_{0}, thus avoiding the use of Bonferroni correction and its subsequent drawbacks. The CLTs provide theoretical guarantees for the validity of the proposed single hypothesis test that is numerically shown to significantly outperform the multiple hypotheses approach.

Maximum mean estimation has also been studied in the literature on Markov decision processes (MDPs) and reinforcement learning, where the value function for a given state is usually the maximum of multiple expected values taken for different actions. Estimation of the value function is a fundamental sub-task in many solution methods for MDPs and reinforcement learning; see, e.g., Chang et al. (2005), van Hasselt (2010), and Fu (2017). A double (also referred to as cross-validation) estimator was proposed in van Hasselt (2010, 2013) based on two disjoint sets of samples, where one set of samples are used to select the best system (i.e., the system with the maximum mean) and the estimator is given by the average of the selected system using another set of samples. D’Eramo et al. (2016) proposed an estimator by using a weighted average of the sample means of individual systems, where the weight of a system is set as a Gaussian approximation of the probability that this system is the best. The double estimator leads to double Q-learning algorithms, and empirical studies in van Hasselt (2010) and van Hasselt et al. (2016) suggest that these algorithms may outperform conventional Q-learning and deep reinforcement learning algorithms. However, given that maximum mean estimation is only an intermediate module but not the ultimate goal in MDPs and reinforcement learning, how the estimation errors propagate through the state-action pairs and affect the optimal policies is beyond the scope of this paper and requires dedicated future research.

Estimating the maximum mean is closely related to the problem of selecting the best system, an area of active research in simulation community; see, e.g., Kim and Nelson (2006) for an overview. Arguably, the problem of estimating the maximum mean is more difficult than selecting the best system (Lesnevski et al. 2007), because the former requires a careful balance between two goals, including selecting the best system and allocating as many samples as possible to the best system. These two goals are not necessarily complementary, because correctly selecting the best system requires some samples to be allocated to all systems, which limits the achievement of the second goal.

One of the key issues in estimating the maximum mean is how to sample from individual systems. In this regard, both static sampling (van Hasselt 2010, 2013; D’Eramo et al. 2016) and adaptive sampling (Chang et al. 2005; Kocsis and Szepesvári 2006) have been studied in the literature. The former refers to sampling policies that maintain a pre-specified sampling frequency for each system, while the latter sequentially adjusts sampling frequencies based on historical samples drawn from the systems. Many adaptive sampling methods such as the upper confidence bound (UCB) policy and Thompson sampling share a common feature that higher sampling frequencies are geared towards systems with larger means. When estimating the maximum mean, this feature implies a smaller variance for the resulting estimator because more samples are allocated to the target system, thus offering better computational efficiency. Adaptive sampling/design using policies like UCB offers benefits beyond computational efficiency in the application of clinical trials, where each patient is assigned to one of the KK treatments to assess treatment effects. In this application, other than assessing treatment effects, another important goal is to treat patients as effectively as possible during the trials. Therefore, for the well-being of patients, it is desirable to adopt adaptive designs like the UCB policy so that more patients are assigned to more effective treatments. Interested readers are referred to Villar et al. (2015) and Robertson and Wason (2019) for recent work on adaptive designs in clinical trials.

In this paper, we study maximum mean estimation based on the UCB sampling policy. UCB is one of the most celebrated adaptive sampling methods used in online decision making with learning, which has been extensively used as a tool to balance the trade-off between exploration and exploitation during the course of learning; see, e.g., Agrawal (1995) and Auer et al. (2002) for early work on multi-armed bandit problems, and Hao et al. (2019) and references therein for recent related work. Via the UCB sampling policy, the expected amount of sampling budget allocated to the non-maximum systems is at most of a logarithmic order of the total budget, which implies that the majority of sampling budget is allocated to the system with the maximum mean. Within the UCB framework, one of the estimators being considered is the average of all the samples, no matter drawn from the system with the maximum mean or not. We refer to it as the grand average (GA) estimator. The rationale of the GA estimator stems from the fact that it is dominated by the sample average of the system with the maximum mean, from which the majority of the samples are drawn. The GA estimator is not new in the literature, and has served as a key ingredient for algorithms in MDPs; see, e.g., Chang et al. (2005) and Kocsis and Szepesvári (2006). However, to the best of our knowledge, very little is known about the statistical properties of the GA estimator, except its asymptotic unbiasedness. In this paper, we fill this gap by providing statistical guarantees for the GA estimator, including its strong consistency, mean squared error (MSE), and CLT. On the downside, a well-known drawback of the GA estimator is its bias that results from taking grand average of all systems, including those with smaller means. While this bias gradually dies out as sample size increases, its impact in finite-sample settings cannot be neglected.

As an attempt to remedy the above drawback, we propose a new estimator by taking the sample average of the system that has been allocated the largest amount of samples, referred to as the largest-size average (LSA) estimator. This estimator is intuitively appealing, as the number of allocated samples to the system with the maximum mean dominates those allocated to other systems, ensuring a high probability that the system identified by LSA is exactly the one with the maximum mean. LSA avoids taking average of systems with smaller means, and it is thus expected to have a smaller bias compared to the GA estimator.

To summarize, we make the following contributions in this paper.

  • We provide statistical guarantees for the GA estimator, including its strong consistency, MSE, and CLT, which are new to the literature to the best of our knowledge.

  • We develop a new LSA estimator for estimating the maximum mean with statistical guarantees, and show that it is preferable over the GA estimator, due to its bias that converges at least as fast as νn/n3/2\nu_{n}/n^{3/2}, a rate much faster than that (νn/n\nu_{n}/n) of the GA estimator, where nn and νn\nu_{n} denote the total sample size and exploration rate, respectively. While polynomial exploration rates are allowed in our setup, both theoretical results and empirical evidence suggest that logarithmic exploration rate is asymptotically optimal when minimizing the MSEs of both the GA and LSA estimators.

  • Using the established CLTs, we construct asymptotically valid GA and LSA CIs for the maximum mean, and propose a single hypothesis test for a multiple comparison problem with application to clinical trials. Numerical experiments confirm the validity of the CIs, and show that the proposed single test exhibits lower type I error and higher statistical power, compared to a benchmarking test with multiple hypotheses.

The rest of the paper is organized as follows. In Section 2, we formulate the problem of estimating the maximum mean. The GA and LSA estimators, and a single hypothesis test for the multiple comparison problem are presented in Section 3. Theoretical analysis of the estimators is provided in Section 4. A numerical study is presented in Section 5 to demonstrate the efficiency of the estimators and the proposed single hypothesis test, followed by conclusions in Section 6.

2 Problem and Backgrounds

Consider KK stochastic systems, with their performances denoted by random variables {Xk,k=1,,K}\{X_{k},k=1,\ldots,K\}. Let μk=𝔼(Xk)\mu_{k}={\mathbb{E}}(X_{k}). For each kk, we assume that XkX_{k} follows an unknown probability distribution, while independent samples can be drawn. This assumption is usually satisfied in many practical applications. For instance, XkX_{k} may represent the stochastic performance of a complex simulation model and follows an unknown probability distribution, while samples of XkX_{k} can be generated via simulation. We make the following assumption on XkX_{k}’s.

Assumption 1.

For any k=1,,Kk=1,\ldots,K, XkX_{k} follows a sub-Gaussian distribution, i.e., there exists a sub-Gaussian parameter γk\gamma_{k} such that 𝔼[exp(λ(Xkμk))]exp(γk2λ2/2){\mathbb{E}}[\exp\left(\lambda(X_{k}-\mu_{k})\right)]\leq\exp\left(\gamma_{k}^{2}\lambda^{2}/2\right), for any λ\lambda\in{\mathbb{R}}.

Assumption 1 is widely used in the literature on machine learning. Many commonly used distributions, such as those with bounded supports and normal distributions, are sub-Gaussian. Assumption 1 ensures the existence of a Hoeffding bound that serves as the basis for the theoretical analysis in this paper. This bound is presented in Lemma A1 in the appendix.

In this paper, we are interested in estimating the maximum mean of the KK systems, defined as

μmaxk=1,,Kμk.\mu^{*}\triangleq\max_{k=1,\ldots,K}\mu_{k}.

Let kk^{*} denote the index of the system with the largest mean, i.e., μ=μk\mu^{*}=\mu_{k^{*}}. Without loss of generality, kk^{*} is assumed to be unique.

Assumption 2.

The system with the largest mean is unique, i.e., μ>μk\mu^{*}>\mu_{k} for any kkk\neq k^{*}.

Assumption 2 implies that the mean gap between systems kk^{*} and kk, given by Δkμμk\Delta_{k}\triangleq\mu^{*}-\mu_{k}, is positive for any kkk\neq k^{*}.

Given samples of XkX_{k}, denoted by {Xk,j,j=1,,nk}\{X_{k,j},j=1,\ldots,n_{k}\}, k=1,,Kk=1,\ldots,K, a straightforward estimator of μ\mu^{*}, referred to as the maximum estimator (see, e.g., D’Eramo et al. 2016), is given by

maxk=1,,KX¯k,\max_{k=1,\ldots,K}\bar{X}_{k},

where X¯k[nk]j=1nkXk,j/nk\bar{X}_{k}[n_{k}]\triangleq\sum_{j=1}^{n_{k}}X_{k,j}/n_{k} is the sample average of XkX_{k}. For notational convenience, where no confusion arises, we write simply X¯k\bar{X}_{k} in places for X¯k[nk]\bar{X}_{k}[n_{k}].

It has been well known that the maximum estimator has a positive bias and may overestimate μ\mu^{*}; see, e.g., Smith and Winkler (2006) and van Hasselt (2010). The positive bias may lead to adverse effects in various applications. For instance, in the estimation of coherent risk measures, it may lead to overestimation of risk, resulting in unduly high capital charges for risky activities (Lesnevski et al. 2007). In reinforcement learning, as errors propagate over all the state-action pairs, the positive bias may negatively affect the speed of learning (van Hasselt 2010). More importantly, when using the maximum estimator, it is not clear how to allocate sampling budget to the systems, i.e., setting the values of nkn_{k}’s, so as to ensure reasonable accuracy. Therefore, it is of both theoretical and practical interests to develop alternative estimators of μ\mu^{*}.

3 An Upper Confidence Bound Approach

Sampling from the KK systems is essential for constructing any estimator of μ\mu^{*}. In developing an efficient estimator, two issues are of major concern. The first issue is on how to efficiently allocate the sampling budget, while the second issue is on methods of constructing an estimator such that statistical guarantees can be established.

To address the first issue, adaptive sampling policies have been studied in the literature. A popular adaptive sampling policy is the UCB policy that was originally proposed for the multi-armed bandit (MAB) problem; see Auer et al. (2002). In traditional MAB problem, in each round, one of the KK systems is chosen and a random sample (reward) is drawn from the chosen system, and the decision maker aims to maximize the total rewards collected. At the heart of the MAB problem is to find an adaptive sampling policy that decides which system to sample from in each round so as to balance the tradeoff between exploration and exploitation. Among various sampling policies for MAB, it has been well known that the UCB policy achieves the optimal rate of regret, defined as the expected loss due to the fact that a policy does not always choose the system with the highest expected reward; see Lai and Robbin (1985) and Auer et al. (2002). Specifically, let Tk(n)T_{k}(n) denote the number of samples drawn from system kk during the first nn rounds, for k=1,,Kk=1,\dots,K. The UCB sampling policy is described in Algorithm 1.

Algorithm 1: UCB Sampling Policy

  1. 1.

    Initialization: During the first KK rounds, draw a sample from each system.

  2. 2.

    Repeat: For nK+1n\geq K+1, draw a sample from the system indexed by

    κn=argmaxk{1,,K}{X¯k[Tk(n1)]+2lognTk(n1)},\kappa_{n}={\text{arg}\max}_{k\in\{1,\dots,K\}}\left\{\bar{X}_{k}[T_{k}(n-1)]+\sqrt{2\log n\over T_{k}(n-1)}\right\}, (1)

    where X¯k[Tk(n1)]\bar{X}_{k}[T_{k}(n-1)] denotes the sample mean of system kk during the first (n1)(n-1) rounds.

    Update the sample average of system κn\kappa_{n}.

The UCB policy offers important insights into the adaptive allocation of sampling budget. Essentially, it balances the tradeoff between exploration and exploitation using the UCB defined on the right-hand-side (RHS) of (1). On the one hand, systems with higher on-going sample averages have a higher chance to be chosen in the current round, contributing to higher total rewards. On the other hand, systems that are chosen less frequently during the previous rounds, i.e., Tk(n1)T_{k}(n-1) being smaller, may also have a sufficient chance to be chosen so that such systems will be sufficiently explored. The function logn\log n in (1) can be interpreted as the exploration rate that controls the speed of exploring systems that may not have the largest mean.

In the MAB context, the exploration rate is set to be logn\log n simply because it leads to the optimal rate of regret. However, it should be pointed out that when the objective is not to minimize the regret, as the case in our problem setting, it is not yet clear whether logarithmic exploration rate is optimal. For the sake of generality, we henceforth allow the exploration rate, denoted by νn\nu_{n}, to take different forms as a function of nn, and refer to the resulting UCB policy as a generalized UCB (GUCB) policy, which is described in Algorithm 2.

Algorithm 2 (GUCB): Generalized UCB Sampling Policy

  1. 1.

    Initialization: During the first KK rounds, draw a sample from each system.

  2. 2.

    Repeat: For nK+1n\geq K+1, draw a sample from the system indexed by

    In=argmaxk{1,,K}{X¯k[Tk(n1)]+2νnTk(n1)},I_{n}={\text{arg}\max}_{k\in\{1,\dots,K\}}\left\{\bar{X}_{k}[T_{k}(n-1)]+\sqrt{2\nu_{n}\over T_{k}(n-1)}\right\},

    where X¯k[Tk(n1)]\bar{X}_{k}[T_{k}(n-1)] denotes the sample mean of system kk during the first (n1)(n-1) rounds.

    Update the sample average of system InI_{n}.

3.1 Grand Average (GA) Estimator

Before developing estimators for μ\mu^{*}, we highlight some key properties of the GUCB policy. To convey the main idea, for a while we focus on a special case where the exploration rate νn=logn\nu_{n}=\log n. In this case, it has been well known that (see Auer et al. 2002), for system kk (kkk\neq k^{*}),

𝔼[Tk(n)]Clogn,{\mathbb{E}}[T_{k}(n)]\leq C\log n,

for some constant CC that depends on Δk\Delta_{k}. Then, it can be easily seen that 𝔼[Tk(n)]{\mathbb{E}}[T_{k^{*}}(n)], the expected number of times system kk^{*} is chosen, is of order nn during the first nn rounds, because the summation of Tk(n)T_{k}(n)’s equals nn.

In other words, it is expected that among the first nn rounds, system kk^{*} is chosen in the majority of rounds. Recall that our objective is to estimate μ\mu^{*}, the mean of system kk^{*}. It is, therefore, reasonable to use the grand average of the samples of all the nn rounds as an estimator of μ\mu^{*}, i.e., the GA estimator. Although it uses the samples that are not drawn from system kk^{*}, the validity of the GA estimator can be justified by the fact that the estimation error due to such samples may phase out when nn is sufficiently large, as the number of such samples is negligible compared to those drawn from system kk^{*}.

Specifically, the GA estimator is given by

M~n=1nk=1Kj=1Tk(n)Xk,j,\widetilde{M}_{n}=\frac{1}{n}\sum_{k=1}^{K}\sum_{j=1}^{T_{k}(n)}X_{k,j}, (2)

where {Xk,j,j=1,,Tk(n)}\{X_{k,j},j=1,\ldots,T_{k}(n)\} denotes the samples drawn from system kk during the first nn rounds using the GUCB policy in Algorithm 2.

The GA estimator is not new in the literature, which has served as a key ingredient for algorithms for MDPs and reinforcement learning under the special case that νn=logn\nu_{n}=\log n. However, research on the statistical properties of the estimator has been underdeveloped. To the best of our knowledge, only the asymptotic unbiasedness of the estimator has been established; see, e.g., Kocsis and Szepesvári (2006), while other statistical properties seem to be missing. One of the contributions of this paper is to fill this gap. In particular, we establish strong consistency, asymptotic MSE, and CLT for the GA estimator, which shall be discussed in detail in Section 4.

By taking grand average of the samples of all the systems including those with means smaller than μ\mu^{*}, the GA estimator is naturally negatively biased. While this bias gradually dies out as the sample size nn increases, its impact cannot be neglected with finite samples, which has been considered as a main drawback of the GA estimator.

3.2 Largest-Size Average (LSA) Estimator

As an attempt to remedy the bias drawback of the GA estimator, we propose a new estimator based on a key observation that the number of samples drawn from system kk^{*} usually dominates those from other systems. We therefore define, at the nnth round,

Inargmaxk=1,,KTk(n).I_{n}^{*}\triangleq\mathop{\arg\max}_{k=1,...,K}T_{k}\left(n\right). (3)

In other words, InI_{n}^{*} is the index of the system from which the GUCB policy has sampled most frequently so far. It is reasonable to expect that it is very likely to observe In=kI_{n}^{*}=k^{*}. Recall that μ\mu^{*} is the mean of the performance of system kk^{*}. It is, therefore, reasonable to propose a new estimator of μ\mu^{*} by taking sample average of the system indexed by InI_{n}^{*}, i.e., the LSA estimator. Specifically, the LSA estimator is given by

MIn=1TIn(n)j=1TIn(n)XIn,j,M_{I_{n}^{*}}=\frac{1}{T_{I_{n}^{*}}(n)}\sum_{j=1}^{T_{I_{n}^{*}}(n)}X_{I_{n}^{*},j}, (4)

where the samples are drawn using the GUCB policy in Algorithm 2.

The LSA estimator and the maximum estimator share a similar structure, both of them being sample averages of certain identified systems. However, the way of identifying the target system differs significantly. The maximum estimator chooses the system with the largest sample average and uses directly this sample average to estimate μ\mu^{*}. Instead, the LSA estimator chooses the system from which the GUCB policy has sampled most frequently so far, and uses the sample average of this system to estimate μ\mu^{*}. The latter seems to be more intuitively appealing in that it is virtually an unbiased sample-mean estimator conditioning on the event that system kk^{*} is correctly identified. Arguably, correctly identifying kk^{*} may be relatively easy as the number of rounds GUCB samples from system kk^{*} is usually overwhelmingly larger than those from other systems.

Compared to the GA estimator, the LSA estimator avoids taking averages of those systems with smaller means with a high probability. It is thus expected that the LSA estimator may have smaller bias, which shall be shown theoretically in Section 4.

3.3 Hypothesis Testing using Maximum Mean Estimators

In what follows, we demonstrate how the maximum mean estimators may lead to a new approach to a multiple comparison problem that finds important application in clinical trials.

In a clinical trial, suppose that one is interested in comparing K1K-1 treatments with unknown mean effects μ1,,μK1\mu_{1},\ldots,\mu_{K-1} to a control treatment with known mean effect μ0\mu_{0}, where K2K\geq 2 and a higher mean effect implies a better treatment. In the current practice, a commonly used approach to this comparison problem is to test simultaneously K1K-1 null hypotheses H0,kH_{0,k}’s against alternatives H1,kH_{1,k}’s, where

H0,k:μkμ0v.s.H1,k:μk>μ0,k=1,,K1.\displaystyle H_{0,k}:\mu_{k}\leq\mu_{0}\quad v.s.\quad H_{1,k}:\mu_{k}>\mu_{0},\quad k=1,\ldots,K-1. (5)

With the above multiple hypotheses framework, a meta null hypothesis is set as H0,kH_{0,k}’s being true simultaneously for all k=1,,K1k=1,\ldots,K-1, which is rejected if any of H0,kH_{0,k}’s is rejected, implying that at least one of the K1K-1 treatments outperforms the control with certain confidence. Suppose that the overall type I error of this test with multiple hypotheses is set to be α\alpha. In this case, type I errors associated with individual hypotheses cannot be set as α\alpha, in order to control the overall type I error. Instead, Bonferroni correction is often applied, which assigns a type I error of α/(K1)\alpha/(K-1) to the K1K-1 individual tests. By doing so, it can be verified that the family-wise error rate (FWER), defined as the probability of rejecting at least one true H0,kH_{0,k} is not larger than α\alpha, provided that these individual tests are independent. To see this, suppose that there are m0m_{0} true null hypotheses that are assumed to be {H0,1,,H0,m0}\{H_{0,1},\ldots,H_{0,m_{0}}\} without loss of generality, where m0m_{0} takes values in {1,,K1}\{1,\ldots,K-1\} and is unknown to the researcher. Then,

FWER=Pr(i=1m0{Rejecting H0,i})i=1m0Pr(Rejecting H0,i)=m0αK1α.\displaystyle\mbox{FWER}=\Pr\left(\cup_{i=1}^{m_{0}}\{\mbox{Rejecting $H_{0,i}$}\}\right)\leq\sum_{i=1}^{m_{0}}\Pr\left(\mbox{Rejecting $H_{0,i}$}\right)=m_{0}{\alpha\over K-1}\leq\alpha.

While it is easy to implement, Bonferroni correction has been criticized, due to its conservativeness, i.e., a large gap between FWER and α\alpha especially when KK is large, and the drawback that it usually increases the probability of type II errors. It is thus desirable to develop other procedures for the comparison of treatments that avoid Bonferroni correction.

We note that with the estimators of the maximum mean, a more appealing approach can be developed for this multiple comparison problem. More specifically, in contrast to the multiple hypotheses formulation, we propose a single hypothesis test, with

H0:maxk=0,,K1μk=μ0v.s.H1:maxk=0,,K1μk>μ0.\displaystyle H_{0}:\max_{k=0,\ldots,K-1}\mu_{k}=\mu_{0}\quad v.s.\quad H_{1}:\max_{k=0,\ldots,K-1}\mu_{k}>\mu_{0}. (6)

Relabelling μ0\mu_{0} as μK\mu_{K}, we note that the GA or LSA estimator, M~n\widetilde{M}_{n} or MInM_{I_{n}^{*}}, can be used to estimate maxk=0,,K1μkmaxk=1,,Kμk\max_{k=0,\ldots,K-1}\mu_{k}\equiv\max_{k=1,\ldots,K}\mu_{k}. Essentially, by comparing M~n\widetilde{M}_{n} (or MInM_{I_{n}^{*}}) to μ0\mu_{0}, one may be able to decide whether to reject H0H_{0} or not. Nevertheless, one needs to know the asymptotic distribution of M~n\widetilde{M}_{n} (or MInM_{I_{n}^{*}}), in order to formalize the test. We shall show in Section 4, more specifically, Theorems 5 and 8, that both M~n\widetilde{M}_{n} and MInM_{I_{n}^{*}} follow asymptotic normal distributions, thus providing theoretical support for the proposed single hypothesis test. Exact forms of the testing statistics resulting from the GA and LSA estimators are delayed to (11) and (14) in the subsequent section.

4 Theoretical Analysis

Throughout the paper, we denote γ¯=max{γ1,,γK}\bar{\gamma}=\max\{\gamma_{1},...,\gamma_{K}\}, where γk\gamma_{k}’s are the sub-Gaussian parameter defined in Assumption 1. To facilitate analysis, we first show a result on Tk(n)T_{k}(n).

Proposition 1.

Suppose that Assumptions 1 and 2 hold, and νn4γ¯2logn\nu_{n}\geq 4\bar{\gamma}^{2}\log n. Then, for kkk\neq k^{*} and any positive integer pp,

𝔼Tk(n)p{8νnΔk2+5ifp=1;{8νnΔk2+5}2ifp=2;{8νnΔk2+1+[6log(n+1)+O(1)]13}3ifp=3;{8νnΔk2+1+[2pp3(n+1)p3+O(np4)]1p}pifp4,\displaystyle{\mathbb{E}}T_{k}(n)^{p}\leq\begin{cases}\displaystyle\frac{8\nu_{n}}{\Delta_{k}^{2}}+5&if\ p=1;\\ \displaystyle\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right\}^{2}&if\ p=2;\\ \displaystyle\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\left[6\log\left(n+1\right)+O(1)\right]^{\frac{1}{3}}\right\}^{3}&if\ p=3;\\ \displaystyle\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\left[\frac{2p}{p-3}(n+1)^{p-3}+O\left(n^{p-4}\right)\right]^{\frac{1}{p}}\right\}^{p}&if\ p\geq 4,\end{cases}

where Δk>0\Delta_{k}>0 for kkk\neq k^{*} due to Assumption 2, and the notation O()O(\cdot) means that lim supnan/bnC\limsup_{n\rightarrow\infty}a_{n}/b_{n}\leq C for some constant CC if an=O(bn)a_{n}=O(b_{n}).

Proposition 1 provides an upper bound for the ppth moment of Tk(n)T_{k}(n), the number of samples drawn from system kk during the first nn rounds, for kkk\neq k^{*}. This upper bound relies on the exploration rate νn\nu_{n}, under the assumption that the exploration rate is sufficiently fast, i.e., at least as fast as the logarithmic order. In a special case when νn=4γ¯2logn\nu_{n}=4\bar{\gamma}^{2}\log n and p=1p=1, the result is the same as that in Theorem 1 of Auer et al. (2002).

Essentially, Proposition 1 gives an upper bound on the rate at which the ppth moment of Tk(n)T_{k}(n) converges to \infty as nn\rightarrow\infty for kkk\neq k^{*}. It implies that the sampling ratios Tk(n)/nT_{k}(n)/n may satisfy certain convergence properties. In particular, strong consistency of the sampling ratios are summarized in the following theorem, whose proof is provided in Section A.3 of the appendix.

Theorem 1.

Suppose that Assumptions 1 and 2 hold, and νn[αlogn,n1δ]\nu_{n}\in\left[\alpha\log n,n^{1-\delta}\right] with 0<δ<10<\delta<1 and α4γ¯2\alpha\geq 4\bar{\gamma}^{2}. Then, as nn\to\infty,

Tk(n)na.s.{0forkk;1fork=k,\displaystyle\frac{T_{k}(n)}{n}\overset{a.s.}{\longrightarrow}\begin{cases}0&for\ k\neq k^{*};\\ 1&for\ k=k^{*},\end{cases}

where the notation a.s.\overset{a.s.}{\longrightarrow} denotes convergence almost surely (or with probability 1).

Theorem 1 implies that in the long run, the number of samples drawn from system kk^{*} dominates the total numbers of samples drawn from other systems, in the almost sure sense. This result suggests that, it is natural to use InI_{n}^{*}, defined in (3), as an estimator of the unknown kk^{*}. Strong consistency of InI_{n}^{*} can also be established, which is summarized in the following theorem. Proof of the theorem is provided in Section A.4 of the appendix.

Theorem 2.

Under the same conditions of Theorem 1, as nn\to\infty,

Ina.s.k.\displaystyle I^{*}_{n}\overset{a.s.}{\longrightarrow}k^{*}.

4.1 GA Estimator

In what follows, we establish strong consistency, asymptotic MSE and asymptotic normality for the GA estimator M~n\widetilde{M}_{n}. Let XIj,jX_{I_{j},j} denote the sample drawn at the jjth round. Note that

M~n\displaystyle\widetilde{M}_{n} =\displaystyle= 1nk=1KTk(n)X¯k[Tk(n)]=1nj=1nXIj,j=1nj=1n(XIj,jμIj)+1nj=1nμIj.\displaystyle\frac{1}{n}\sum_{k=1}^{K}T_{k}(n)\bar{X}_{k}[T_{k}(n)]=\frac{1}{n}\sum_{j=1}^{n}X_{I_{j},j}=\frac{1}{n}\sum_{j=1}^{n}\left(X_{I_{j},j}-\mu_{I_{j}}\right)+\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}.

Define

Znj=1n(XIj,jμIj).\displaystyle Z_{n}\triangleq\sum_{j=1}^{n}\left(X_{I_{j},j}-\mu_{I_{j}}\right).

Then,

M~n=Znn+1nj=1nμIj.\displaystyle\widetilde{M}_{n}=\frac{Z_{n}}{n}+\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}. (7)

Let n{\cal F}_{n} be the σ\sigma-field generated by the first nn samples for n1n\geq 1, and 0={Ω,}{\cal F}_{0}=\{\Omega,\emptyset\}. Note that InI_{n} is n1{\cal F}_{n-1}-measurable. It follows that for n1n\geq 1,

𝔼[XIn,n|n1]=μIn,andthen𝔼XIn,n=𝔼μIn.\displaystyle{\mathbb{E}}\left[\left.X_{I_{n},n}\right|{\cal F}_{n-1}\right]=\mu_{I_{n}},\ {\rm and\ then}\ {\mathbb{E}}X_{I_{n},n}={\mathbb{E}}\mu_{I_{n}}.

Therefore, 𝔼Zn=0{\mathbb{E}}Z_{n}=0 and

𝔼[Zn|n1]=Zn1+𝔼[XIn,nμIn|n1]=Zn1,\displaystyle{\mathbb{E}}\left[\left.Z_{n}\right|{\cal F}_{n-1}\right]=Z_{n-1}+{\mathbb{E}}\left[\left.X_{I_{n},n}-\mu_{I_{n}}\right|{\cal F}_{n-1}\right]=Z_{n-1},

i.e., ZnZ_{n} is a martingale with mean 0.

Note that XkμkX_{k}-\mu_{k} is sub-Gaussian with parameter γk\gamma_{k} for k=1,,Kk=1,...,K. Because In+1I_{n+1} is n{\cal F}_{n}-measurable,

𝔼[exp{s(XIn+1,n+1μIn+1)}|n]exp{γ¯2s22}, for s,\displaystyle{\mathbb{E}}\left[\left.\exp\{s(X_{I_{n+1},n+1}-\mu_{I_{n+1}})\}\right|{\cal F}_{n}\right]\leq\exp\left\{\frac{\bar{\gamma}^{2}s^{2}}{2}\right\},\mbox{ for }s\in{\mathbb{R}},

implies that XIn+1,n+1μIn+1X_{I_{n+1},n+1}-\mu_{I_{n+1}} is sub-Gaussian with parameter γ¯\bar{\gamma}.

Recall that by the definition of ZnZ_{n},

Zn+1Zn=XIn+1,n+1μIn+1.\displaystyle Z_{n+1}-Z_{n}=X_{I_{n+1},n+1}-\mu_{I_{n+1}}.

Then, we have

(|Zn+1Zn|>a|n)2exp{a22γ¯2},\displaystyle{\mathbb{P}}\left(\left.|Z_{n+1}-Z_{n}|>a\right|{\cal F}_{n}\right)\leq 2\exp\left\{-\frac{a^{2}}{2\bar{\gamma}^{2}}\right\},

by the Hoeffding bound in Lemma A1.

Further note that {Zn+1Zn,n=1,2,}\{Z_{n+1}-Z_{n},n=1,2,...\} is a martingale difference sequence, the condition of Theorem 2 in Shamir (2011) is thus satisfied. Applying this theorem, for any δ>0\delta>0, it holds that with probability at least 1δ1-\delta,

Znn56log(1/δ)n/(2γ¯2)=112γ¯2log(1/δ)n.\displaystyle\frac{Z_{n}}{n}\leq\sqrt{\frac{56\log(1/\delta)}{n/(2\bar{\gamma}^{2})}}=\sqrt{\frac{112\bar{\gamma}^{2}\log(1/\delta)}{n}}.

Let ε=112γ¯2log(1/δ)n\varepsilon=\sqrt{\frac{112\bar{\gamma}^{2}\log(1/\delta)}{n}}. Then, δ=exp{nε2112γ¯2}\delta=\exp\left\{-\frac{n\varepsilon^{2}}{112\bar{\gamma}^{2}}\right\} and

(Znn>ε)δ=exp{nε2112γ¯2}.\displaystyle{\mathbb{P}}\left(\frac{Z_{n}}{n}>\varepsilon\right)\leq\delta=\exp\left\{-\frac{n\varepsilon^{2}}{112\bar{\gamma}^{2}}\right\}.

Applying the same argument to the sequence {Zn,n=1,2,}\{-Z_{n},n=1,2,...\}, symmetrically we have

(Znn<ε)δ=exp{nε2112γ¯2}.\displaystyle{\mathbb{P}}\left(\frac{Z_{n}}{n}<-\varepsilon\right)\leq\delta=\exp\left\{-\frac{n\varepsilon^{2}}{112\bar{\gamma}^{2}}\right\}.

Therefore,

(|Znn|>ε)2exp{nε2112γ¯2},\displaystyle{\mathbb{P}}\left(\left|\frac{Z_{n}}{n}\right|>\varepsilon\right)\leq 2\exp\left\{-\frac{n\varepsilon^{2}}{112\bar{\gamma}^{2}}\right\},

and moreover,

n=1(|Zn/n|>ε)<.\displaystyle\sum_{n=1}^{\infty}{\mathbb{P}}\left(\left|Z_{n}/n\right|>\varepsilon\right)<\infty.

By Borel-Cantelli lemma, Zn/na.s.0Z_{n}/n\overset{a.s.}{\longrightarrow}0 as nn\to\infty. i.e., the first term on the RHS of (7) converges to 0 almost surely.

Moreover, the second term on the RHS of (7) satisfies

1nj=1nμIj=1nk=1KTk(n)μka.s.μk,\displaystyle\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}=\frac{1}{n}\sum_{k=1}^{K}T_{k}(n)\mu_{k}\overset{a.s.}{\longrightarrow}\mu_{k^{*}},

where the convergence follows from Theorem 1 and the continuous mapping theorem. Therefore, by (7), we establish strong consistency of M~n\widetilde{M}_{n} that is summarized as follows.

Theorem 3.

Under the same conditions of Theorem 1, M~n\widetilde{M}_{n} is a strongly consistent estimator of μ\mu^{*}, i.e., as nn\to\infty,

M~na.s.μ.\displaystyle\widetilde{M}_{n}\overset{a.s.}{\longrightarrow}\mu^{*}.

Theorem 3 ensures that the GA estimator, M~n\widetilde{M}_{n}, converges to μ\mu^{*} with probability 1, as nn\rightarrow\infty, which serves as an important theoretical guarantee for the estimator. To further understand its probabilistic behavior, it is desirable to conduct error analysis, especially on its bias and variance. In the following theorem, we establish upper bounds for its bias, variance and MSE. The proof of the theorem is provided in Section A.5 of the appendix.

Theorem 4.

Suppose that Assumptions 1 and 2 hold, and νn[αlogn,n12δ]\nu_{n}\in\left[\alpha\log n,n^{\frac{1}{2}-\delta}\right] with 0<δ<10<\delta<1 and α4γ¯2\alpha\geq 4\bar{\gamma}^{2}. Let cc be a constant satisfying 8νnΔk2+5cνn\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\leq c\nu_{n} for kkk\neq k^{*}, and μ¯2=max{μ12,,μK2}\bar{\mu}^{2}=\max\{\mu_{1}^{2},...,\mu_{K}^{2}\}. Then,

|Bias(M~n)|kk1n(8νnΔk2+5Δk)=O(νnn),Var[M~n]2Kσk2n+2K3(μ¯2+1)n2c2νn2=O(1n),\displaystyle\left|{\rm Bias}\left(\widetilde{M}_{n}\right)\right|\leq\sum_{k\neq k^{*}}\frac{1}{n}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\Delta_{k}\right)=O\left(\frac{\nu_{n}}{n}\right),\quad{\rm Var}\left[\widetilde{M}_{n}\right]\leq\frac{2K\sigma_{k^{*}}^{2}}{n}+\frac{2K^{3}(\bar{\mu}^{2}+1)}{n^{2}}c^{2}\nu_{n}^{2}=O\left(\frac{1}{n}\right),

and thus

MSE(M~n)2Kσk2n+o(1n),\displaystyle{\rm MSE}\left(\widetilde{M}_{n}\right)\leq{2K\sigma_{k^{*}}^{2}\over n}+o\left({1\over n}\right),

where the notation o()o(\cdot) means that limnan/bn=0\lim_{n\rightarrow\infty}a_{n}/b_{n}=0 if an=o(bn)a_{n}=o(b_{n}).

Theorem 4 shows that the MSE of the GA estimator converges to 0 at a rate of n1n^{-1}. In particular, the variance is the dominant term in the MSE, compared to the square of the bias, which is of order νn2/n2\nu_{n}^{2}/n^{2}. This result implies that when the exploration rate takes value in the range νn[αlogn,n1/2δ]\nu_{n}\in\left[\alpha\log n,n^{1/2-\delta}\right] with 0<δ<1/20<\delta<1/2, the bias of the GA estimator is negligible compared to its variance, making it possible to construct CIs by ignoring its bias. Furthermore, as suggested by Theorem 4, the leading term of the variance does not depend on νn\nu_{n}. It is thus asymptotically optimal to choose logarithmic νn\nu_{n}, the slowest possible rate, so as to reduce the asymptotic bias as far as possible. The superiority of this choice of νn\nu_{n} is also confirmed by the numerical experiments in Section 5.

To provide a formal theoretical support for asymptotically valid CIs and the single hypothesis test in (6), we establish a CLT for M~n\widetilde{M}_{n} in the following theorem, whose proof is provided in Section A.6 of the appendix.

Theorem 5.

Suppose that Assumptions 1 and 2 hold, and νn[αlogn,n12δ]\nu_{n}\in\left[\alpha\log n,n^{\frac{1}{2}-\delta}\right] with 0<δ<10<\delta<1 and α4γ¯2\alpha\geq 4\bar{\gamma}^{2}. Then,

n(M~nμ)N(0,σk2),asn,\displaystyle\sqrt{n}\left(\widetilde{M}_{n}-\mu^{*}\right)\Rightarrow N(0,\sigma_{k^{*}}^{2}),\ \ as\ \ n\rightarrow\infty,

where \Rightarrow denotes convergence in distribution.

Theorem 5 shows that the GA estimator is asymptotically normally distributed with mean μ\mu^{*} and variance σk2/n\sigma_{k^{*}}^{2}/n. Based on Theorem 5, an asymptotically valid CI can be constructed for μ\mu^{*}. To do so, a remaining issue is on how to estimate the unknown σk2\sigma_{k^{*}}^{2}. In light of the proof of Theorem 5, we estimate σk2\sigma_{k^{*}}^{2} using

σ~n2=1nk=1KTk(n)σ^k2,\displaystyle\widetilde{\sigma}_{n}^{2}=\frac{1}{n}\sum_{k=1}^{K}T_{k}(n)\hat{\sigma}_{k}^{2}, (8)

where for k=1,,Kk=1,...,K,

σ^k2=1Tk(n)j=1Tk(n)Xk,j2[1Tk(n)j=1Tk(n)Xk,j]2.\displaystyle\hat{\sigma}_{k}^{2}=\frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{k,j}^{2}-\left[\frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{k,j}\right]^{2}. (9)

It can be shown that σ~n2\widetilde{\sigma}_{n}^{2} converges to σk2\sigma_{k^{*}}^{2} in probability, as nn\rightarrow\infty. This result is summarized in the following proposition, whose proof is provided in Section A.7 of the appendix.

Proposition 2.

Under the same conditions of Theorem 5, as nn\rightarrow\infty,

σ~n2σk2,inprobability.\displaystyle\widetilde{\sigma}_{n}^{2}\to\sigma_{k^{*}}^{2},\ \ {\rm in\ probability}.

From Theorem 5 and Proposition 2, it follows that

nσ~n1(M~nμ)N(0,1).\displaystyle\sqrt{n}\widetilde{\sigma}_{n}^{-1}\left(\widetilde{M}_{n}-\mu^{*}\right)\Rightarrow N(0,1).

Then, an asymptotically valid 100(1β)%(1-\beta)\% CI of μ\mu^{*} is given by

(M~nz1β/2σ~n/n,M~n+z1β/2σ~n/n),\displaystyle\left(\widetilde{M}_{n}-z_{1-\beta/2}\widetilde{\sigma}_{n}/\sqrt{n},\quad\widetilde{M}_{n}+z_{1-\beta/2}\widetilde{\sigma}_{n}/\sqrt{n}\right), (10)

where z1β/2z_{1-\beta/2} is the 1β/21-\beta/2 quantile of the standard normal distribution.

The CLT in Theorem 5 and the variance estimator σ~n2\widetilde{\sigma}_{n}^{2} in Proposition 2 also serve as the theoretical foundation for the proposed single hypothesis test in Section 3.3. More specifically, the following testing statistics can be constructed:

U~nn(M~nμ0)σ~n.\widetilde{U}_{n}\triangleq{\sqrt{n}\left(\widetilde{M}_{n}-\mu_{0}\right)\over\widetilde{\sigma}_{n}}. (11)

Note that under either the null or alternative hypothesis, maxk=0,,K1μkμ0\max_{k=0,\ldots,K-1}\mu_{k}\geq\mu_{0}. Therefore, a one-side test can be conducted using the GA-based statistics U~n\widetilde{U}_{n}. In particular, the null hypothesis is rejected when U~n>z1α\widetilde{U}_{n}>z_{1-\alpha}, suggesting that at least one of the considered treatments outperform the control treatment, where α\alpha denotes the overall type I error.

4.2 LSA Estimator

Strong consistency of the LSA estimator is stated in the following theorem, whose proof is provided in Section A.8 of the appendix.

Theorem 6.

Suppose that Assumptions 1 and 2 hold, and νn[αlogn,n1δ]\nu_{n}\in\left[\alpha\log n,n^{1-\delta}\right] with 0<δ<10<\delta<1 and α4γ¯2\alpha\geq 4\bar{\gamma}^{2}. Then, as nn\to\infty,

MIna.s.μ.\displaystyle M_{I^{*}_{n}}\overset{a.s.}{\longrightarrow}\mu^{*}.

Theorem 6 shows that the LSA estimator converges to μ\mu^{*} as nn\rightarrow\infty. To better understand its asymptotic error, we establish upper bounds for its asymptotic bias, variance and MSE, which are summarized in the following theorem with proof provided in Section A.9 of the appendix.

Theorem 7.

Under the same conditions of Theorem 6, Bias(MIn)=𝔼[MInμk]<0{\rm Bias}(M_{I_{n}^{*}})={\mathbb{E}}\left[M_{I_{n}^{*}}-\mu_{k^{*}}\right]<0 for sufficiently large nn, and there exist constants c1,c2c_{1},c_{2} and c3c_{3}, such that

|Bias(MIn)|K2σk2c2νn+c3n(nKc1νn)+o(νnn3/2)=O(νnn3/2),\displaystyle\left|{\rm Bias}\left(M_{I_{n}^{*}}\right)\right|\leq{K^{2}\sigma_{k^{*}}^{2}c_{2}\nu_{n}+c_{3}\over\sqrt{n}\left(n-Kc_{1}\nu_{n}\right)}+o\left({\nu_{n}\over n^{3/2}}\right)=O\left({\nu_{n}\over n^{3/2}}\right),
Var[MIn]σk2K2n+o(1n),\displaystyle{\rm Var}\left[M_{I_{n}^{*}}\right]\leq{\sigma_{k^{*}}^{2}K^{2}\over n}+o\left({1\over n}\right),

and thus

MSE[MIn]σk2K2n+O(νn2n3)=O(1n).\displaystyle{\rm MSE}\left[M_{I_{n}^{*}}\right]\leq{\sigma_{k^{*}}^{2}K^{2}\over n}+O\left(\frac{\nu_{n}^{2}}{n^{3}}\right)=O\left(\frac{1}{n}\right).

Theorem 7 shows that the bias of LSA estimator decays at a rate at least as fast as νn/n3/2\nu_{n}/n^{3/2}, which is much faster than that of GA estimator (see Theorem 4). Interestingly, the rate of convergence of its bias can be faster than n1n^{-1} if νn=o(n1/2)\nu_{n}=o(n^{1/2}). Furthermore, as suggested by Theorem 7, the leading term of the variance does not depend on νn\nu_{n}. It is thus asymptotically optimal to choose logarithmic νn\nu_{n}, the slowest possible rate, so as to reduce the asymptotic bias as far as possible. The superiority of this choice of νn\nu_{n} is also confirmed by the numerical experiments in Section 5.

It should be remarked that in Theorems 7 and 4, the dominant coefficients of the variance bounds for the LSA and GA estimators, i.e., σk2K2\sigma_{k^{*}}^{2}K^{2} and 2Kσk22K\sigma_{k^{*}}^{2}, are not tight and can be further reduced with more elaborate analysis. These coefficients do not imply that the GA estimator has a smaller variance, and in fact, numerical results in Section 5 suggest the opposite.

Similar to the case for GA, we establish a CLT for the LSA estimator, which is stated in the following theorem. Proof of the theorem is provided in Section A.10 of the appendix.

Theorem 8.

Under the same conditions of Theorem 6,

n(MInμ)N(0,σk2),asn.\displaystyle\sqrt{n}\left(M_{I^{*}_{n}}-\mu^{*}\right)\Rightarrow N(0,\sigma_{k^{*}}^{2}),\ \ as\ \ n\to\infty.

Theorem 8 shows the asymptotic normality of the LSA estimator, whose rate of convergence is n1/2n^{-1/2}. From its proof, it can also be seen that if replacing n\sqrt{n} by TIn(n)\sqrt{T_{I^{*}_{n}}(n)}, the CLT still holds.

While both CLTs for the GA and LSA estimators (in Theorems 5 and 8, respectively) have the same rate of convergence, it should be emphasized that they require different conditions on the exploration rate. Specifically, CLT holds for the LSA estimator so long as the exploration rate νn\nu_{n} is within the range of [αlogn,n1δ]\left[\alpha\log n,n^{1-\delta}\right], while CLT of the GA estimator can only be guaranteed under a more restricted condition that νn[αlogn,n1/2δ]\nu_{n}\in\left[\alpha\log n,n^{{1/2}-\delta}\right].

In what follows, we construct an asymptotically valid CI for μ\mu^{*} based on Theorem 8. To do this, we propose to estimate the unknown variance σk2\sigma_{k^{*}}^{2} by

σ¯n2=1TIn(n)j=1TIn(n)XIn,j2[1TIn(n)j=1TIn(n)XIn,j]2.\displaystyle\bar{\sigma}_{n}^{2}=\frac{1}{T_{I^{*}_{n}}(n)}\sum_{j=1}^{T_{I^{*}_{n}}(n)}X_{I^{*}_{n},j}^{2}-\left[\frac{1}{T_{I^{*}_{n}}(n)}\sum_{j=1}^{T_{I^{*}_{n}}(n)}X_{I^{*}_{n},j}\right]^{2}.

Strong consistency of the estimator σ¯n2\bar{\sigma}_{n}^{2} is established in the following proposition, whose proof is provided in Section A.11 of the appendix.

Proposition 3.

Under the same conditions of Theorem 6,

σ¯n2a.s.σk2,asn.\displaystyle\bar{\sigma}_{n}^{2}\overset{a.s.}{\longrightarrow}\sigma_{k^{*}}^{2},\quad{\mbox{a}s}\ n\rightarrow\infty. (12)

Similar to (10), an asymptotically valid 100(1β)%(1-\beta)\% CI of μ\mu^{*} is given by

(MInz1β/2σ¯n/n,MIn+z1β/2σ¯n/n).\displaystyle\left(M_{I^{*}_{n}}-z_{1-\beta/2}\bar{\sigma}_{n}/\sqrt{n},\quad M_{I^{*}_{n}}+z_{1-\beta/2}\bar{\sigma}_{n}/\sqrt{n}\right). (13)

Similar to the discussion in Section 4.1, for the proposed single hypothesis test in Section 3.3, we construct the following LSA-based testing statistics:

Unn(MInμ0)σ¯n,U_{n}^{*}\triangleq{\sqrt{n}\left(M_{I_{n}^{*}}-\mu_{0}\right)\over\bar{\sigma}_{n}}, (14)

and reject the null hypothesis when Un>z1αU_{n}^{*}>z_{1-\alpha}, where α\alpha denotes the overall type I error.

5 Numerical Study

We consider three examples. The first example aims to illustrate the asymptotic behaviors of the GA and LSA estimators, including their errors and rates of convergence. While it has been argued in Sections 4.1 and 4.2 that logarithmic exploration rate is asymptotically optimal when minimizing MSEs of the GA and LSA estimators, we numerically confirm it using the first set of experiments. For the rest of the experiments in all examples, we choose logarithmic exploration rate given its superiority on both theoretical and empirical grounds.

In the second example, we consider an application of the GA and LSA estimator to the estimation of coherent risk measures, and compare our constructed GA and LSA CIs to the fixed-width CIs produced by the procedure of Lesnevski et al. (2007).

In the third example, we consider a clinical trial using the GA- and LSA-based single hypothesis tests proposed in (6), aiming to demonstrate their relative merits compared to an existing test with multiple hypotheses in the form of (5).

5.1 Implementation Issues

In the numerical experiments, we observed that the bias of the GA estimator with finite samples could be quite large, especially when the variances of the systems are large, or the exploration rate νn\nu_{n} is large. This phenomenon is intuitively explained by that there may be too many samples drawn from systems other than kk^{*}, leading to a significant underestimation of the maximum mean. However, this adverse effect dies out gradually in the long run. As a means to alleviate this adverse effect in finite-sample setting, we propose to choose a certain amount of samples as a warm-up period and discard the warm-up samples in the estimation. Numerical experiments suggest that a small amount of warm-up samples, e.g. 10%10\% of the total sampling budget, may greatly improve the estimation quality of the GA estimator. Throughout our experiments, we set 10% of the budget as the warm-up period for the GA estimator.

By contrast, the LSA estimator is insensitive to the warm-up samples. Whether to keep or to discard these samples does not have much impact on the quality of the LSA estimator. During the implementation, we discard the warm-up period samples for the LSA estimator as well, to be parallel to the setting for the GA estimator.

From an implementation perspective, it has also been documented that slightly tuning the UCB policy by taking into consideration of the variances of the systems may substantially improve the performances of algorithms for multi-armed bandit problems; see, e.g., Auer et al. (2002). We observe the same phenomenon when estimating the maximum mean. Following the same idea of Auer et al. (2002), we propose to replace the UCB, i.e., 2νn/Tk(n1)\sqrt{2\nu_{n}/T_{k}(n-1)}, by

2νnTk(n1)Vk,n\sqrt{{2\nu_{n}\over T_{k}(n-1)}V_{k,n}}

for k=1,,Kk=1,\ldots,K during the implementation, where

Vk,n=1Tk(n1)i=1Tk(n1)Xk,j2X¯k2[Tk(n1)]+2νnTk(n1)V_{k,n}=\frac{1}{T_{k}(n-1)}\sum_{i=1}^{T_{k}(n-1)}X_{k,j}^{2}-\bar{X}_{k}^{2}[T_{k}(n-1)]+\sqrt{2\nu_{n}\over T_{k}(n-1)}

is an estimate of the UCB of the variance of system kk.

5.2 Maximum of KK Normal Means

Consider KK systems, with samples from normal distributions with unknown means (μ1,μ2,,μK)(\mu_{1},\mu_{2},\dots,\mu_{K}). Parameter configuration is similar to that of Fan et al. (2016), who focused on selecting the system with the largest mean rather than estimating the maximum mean. More specifically, the means (μ1,μ2,,μK)(\mu_{1},\mu_{2},\dots,\mu_{K}) of systems are set to be μk=1.5+0.5k\mu_{k}=1.5+0.5k, k=1,,Kk=1,\ldots,K, and the standard deviations are set to be σk=μk\sigma_{k}=\mu_{k} for all systems.

We examine the performances of different estimators when estimating μ\mu^{*}, by considering their biases, standard deviations, MSEs, and RRMSEs, where RRMSE is defined as the percentage of the root MSE to the true maximum mean. Results reported are based on 10410^{4} independent replications unless otherwise stated.

In the first set of experiments, let K=20K=20 with the total sampling budget nn ranging between 10410^{4} and 10710^{7}. We compare biases, MSEs, and coverage probabilities for GA and LSA estimators with respect to (w.r.t.) νn=logn\nu_{n}=\log n, n1/2n^{1/2}, and n2/3n^{2/3}. The results are summarized in Figures 1, 2, and 3, illustrating the rates of convergence of absolute biases and MSEs, and coverage probabilities of the 90% CIs, for the GA (left panel) and LSA (right panel) estimators.

  • Figure 1 shows that for different choices of νn\nu_{n}, the bias of the LSA estimator is substantially smaller than that of the GA estimator, consistent with Theorems 4 and 7 that the former converges to 0 at a rate of at least νn/n3/2\nu_{n}/n^{3/2}, much faster than the latter (with a rate of νn/n\nu_{n}/n). Furthermore, it is observed that logarithmic νn\nu_{n} leads to the smallest biases among the three choices, for both the GA and LSA estimators.

  • Figure 2 shows that the MSE of the LSA estimator converges approximately to 0 at a rate of n1n^{-1} for all choice of νn\nu_{n}. For the GA estimator, the same rate is observed for logarithmic νn\nu_{n}, while the rate tends to be slower when νn=n1/2\nu_{n}=n^{1/2} and νn=n2/3\nu_{n}=n^{2/3}. This coincides with the requirement in Theorems 4 that νn\nu_{n} must be slower than n1/2n^{1/2} for the GA estimator.

  • Figure 3 shows the LSA CIs outperform the GA ones substantially in terms of coverage probabilities, especially when sample sizes are relatively small, due to the significantly smaller bias of the LSA estimator. When νn=logn\nu_{n}=\log n, both the GA and LSA CIs achieve the nominal coverage probability with sufficiently large sample sizes. However, the GA CIs fail to do so for νn=n1/2\nu_{n}=n^{1/2} and νn=n2/3\nu_{n}=n^{2/3} that violate the requirement in Theorem 5.

Refer to caption
Figure 1: (Color online) Convergence of estimated absolute biases.
Refer to caption
Figure 2: (Color online) Convergence rate of estimated MSEs.
Refer to caption
Figure 3: (Color online) Observed coverage probabilities of the 90% confidence intervals with respect to different sample sizes.

Given the superiority of logarithmic exploration rate, we set νn=logn\nu_{n}=\log n in the rest of the experiments in Section 5.

In the second set of experiments, we fix K=20K=20, and compare the GA and LSA estimators to four competing estimators given as follows. Consider two maximum average estimators that take the maximum of the sample means of the KK systems. In particular, the first is a maximum average estimator based on the GUCB adaptive sampling policy, referred to as adaptive maximum average (AMA) estimator, while the other is the maximum average estimator based on a simple static sampling policy that randomly allocates each sample to one of the systems with equal probabilities, referred to as simple maximum average (SMA) estimator. Moreover, we consider two heuristic estimators in the form of the sample average of certain selected system resulting from existing best arm identification methods. In particular, we consider two Bayesian best arm identification methods, including the top-two probability sampling (TTPS) method of Russo (2020), and the top-two expected improvement (TTEI) of Qin et al. (2017). For these two estimators, we assume known variances of the systems as required by the TTPS and TTEI methods, and set the tuning parameter β\beta to be the default value of 0.50.5, following Russo (2020) and Qin et al. (2017).

It should be pointed out that, in spite of being reasonable estimators, asymptotic properties of the AMA, TTPS and TTEI estimators, such as their biases, MSEs and central limit theorems, are not known in the literature. The comparison of these estimators in our experiments serves merely to numerically examine their merits, while development of theoretical underpinnings is a topic that deserves future investigation.

Relative biases and standard deviations as percentages of the true maximum mean, and RRMSEs are summarized in Table 1444For cases of n=104n=10^{4} and 10510^{5}, reported errors are estimated based on 10510^{5} independent replications.. From Table 1, it can be seen that all the considered estimators converge as sample size increases. Among them, LSA and AMA have almost the same biases and variances, and perform the best with the smallest MSEs. Their biases are also smaller or comparable to other estimators. GA and SMA suffer from a disadvantage of significantly larger biases compared to other estimators. TTPS and TTEI have biases comparable to those of LSA and AMA, but with larger variances. It is also interesting to observe that while SMA is positively biased, AMA is negatively biased, suggesting that the dependence structure among the samples drawn from the UCB policy may lead to a negative bias.

Table 1: Relative bias (%), relative standard deviation (%) and RRMSE (%) of different estimators
LSA GA
nn 10410^{4} 10510^{5} 10610^{6} 10410^{4} 10510^{5} 10610^{6}
Relative bias -0.12 -2.7e-03 -1.3e-03 -2.38 -0.40 -0.03
Relative stdev 1.28 0.33 0.10 1.31 0.36 0.10
RRMSE 1.28 0.33 0.10 3.03 0.54 0.11
AMA SMA
nn 10410^{4} 10510^{5} 10610^{6} 10410^{4} 10510^{5} 10610^{6}
Relative bias 0.12-0.12 -2.7e-03 -1.3e-03 0.98 0.01 3.0e-05
Relative stdev 1.27 0.33 0.10 3.76 1.40 0.45
RRMSE 1.27 0.33 0.10 3.89 1.40 0.45
TTPS TTEI
nn 10410^{4} 10510^{5} 10610^{6} 10410^{4} 10510^{5} 10610^{6}
Relative bias -9.0e-03 -7.3e-03 -6.1e-03 0.14-0.14 -2.7e-03 -7.5e-04
Relative stdev 1.51 0.49 0.14 1.77 0.45 0.14
RRMSE 1.51 0.50 0.14 1.77 0.45 0.14

5.3 Simulated CIs for Coherent Risk Measures

Consider a risk measurement application in which we are interested in estimating a coherent risk measure that is the maximum of the expected discounted portfolio loss taken over KK probability distributions. Lesnevski et al. (2007) proposed a simulation-based procedure to construct a fixed-width CI for the risk measure, and used an example with K=256K=256 to demonstrate the performance of the constructed CIs. For the sake of completeness, we briefly describe this example as follows.

A coherent risk measure can be written as

sup𝒫𝔼[Y/r],\displaystyle\sup_{{\mathbb{P}}\in{\cal P}}{\mathbb{E}}_{\mathbb{P}}[-Y/r],

where YY is the value of a portfolio at a future time horizon, 1/r1/r is a stochastic discount factor which represents the time value of money, and 𝒫{\cal P} is a set of probability measures simplified by 𝒫={1,,K}{\cal P}=\{{\mathbb{P}}_{1},...,{\mathbb{P}}_{K}\}. Denote X=Y/rX=-Y/r and μk=𝔼k[X]\mu_{k}={\mathbb{E}}_{{\mathbb{P}}_{k}}[X]. Let XiX_{i} be a random variable whose distribution under a probability measure {\mathbb{P}} is the same as that of XX under k{\mathbb{P}}_{k}, i.e., (Xkx)=k(Xx){\mathbb{P}}(X_{k}\leq x)={\mathbb{P}}_{k}(X\leq x). Then the coherent risk measure is exactly the maximum mean:

maxk=1,,K𝔼Xk.\displaystyle\max_{k=1,...,K}{\mathbb{E}}X_{k}. (15)

Consider estimating the coherent risk measure of a portfolio consisting of European-style call and put options written on three assets with price Sj(t)S_{j}(t) at time tt for j=1,2,3j=1,2,3. Let S0(t)S_{0}(t) denote a market index. All options in the portfolio have a common terminal time TT. For each of j=0,1,2,3j=0,1,2,3, Sj(t)S_{j}(t) follows geometric Brownian motion with drift djd_{j} and volatility ζj\zeta_{j}, and therefore logSj(t)=logSj(0)+(djζj2/2)t+ζjWjt\log S_{j}(t)=\log S_{j}(0)+(d_{j}-\zeta_{j}^{2}/2)t+\zeta_{j}W_{j}\sqrt{t}, where WjW_{j} is a standard normal random variable. The assets are dependent in the sense that W0=Z0W_{0}=Z_{0}, and Wj=λjZ0+1λj2ZjW_{j}=\lambda_{j}Z_{0}+\sqrt{1-\lambda_{j}^{2}}Z_{j} for j=1,2,3j=1,2,3, where under probability measure {\mathbb{P}}, Z0Z_{0}, Z1Z_{1}, Z2Z_{2}, and Z3Z_{3} are independent standard normal random variables. In this model, Z0Z_{0} corresponds to the market factor common to all assets, while Z1Z_{1}, Z2Z_{2}, and Z3Z_{3} are idiosyncratic factors corresponding to each individual asset.

Denote by Φ\Phi the standard normal distribution function. Each ZjZ_{j} for j=0,1,2,3j=0,1,2,3 is sampled restricted on four cases with equal probabilities: sampled conditional on exceeding Φ1(11/20)\Phi^{-1}(1-1/\sqrt{20}), sampled conditional on being below Φ1(1/20)\Phi^{-1}(1/\sqrt{20}), sampled conditional on being between these two numbers, and sampled with no restriction. Therefore, K=44=256K=4^{4}=256 in (15), and each XkX_{k} denotes the value of the portfolio corresponding to a combination of these four cases. The parameters are set as follows: TT is one week, ζ1=39.8%\zeta_{1}=39.8\%, ζ2=19.3%\zeta_{2}=19.3\%, ζ3=27.0%\zeta_{3}=27.0\%, λ1=0.617\lambda_{1}=0.617, λ2=0.368\lambda_{2}=0.368, λ3=0.785\lambda_{3}=0.785, and dj=0d_{j}=0 and Sj(0)=100S_{j}(0)=100 for j=1,2,3j=1,2,3. The amounts of options in the portfolio could be referred to Table 1 in Lesnevski et al. (2007).

We estimate the coherent risk measure using the GA and LSA estimators. In the experiments, we examine the performance of the estimators, and compare the CIs to those of Lesnevski et al. (2007), which are constructed via the so-called “two-stage algorithm with screening” in the literature. When implementing this algorithm, we use exactly the same algorithm parameters as in Lesnevski et al. (2007).

The simulation-based procedure of Lesnevski et al. (2007) sets a certain amount of initial samples to enable the use of ranking-and-selection techniques. In our implementation, we set the same amount of warm-up samples. Moreover, a fixed width of the CI has to be set at the beginning of their procedure, the total sample size required to stop the process may thus vary across different replications. For the sake of fairness in the comparison, we first run the procedure of Lesnevski et al. (2007), and the same sample size is then used to construct the GA and LSA CIs. We then compare the width of our CIs to that of Lesnevski et al. (2007).

Relative errors of the GA and LSA estimators and coverage probabilities of their 95%95\% CIs w.r.t. different sample sizes are presented in Table 2. From the table, it can be seen that both the GA and LSA estimators exhibit high accuracy. In particular, when the sample size is larger than 10510^{5}, RRMSEs of both estimators are less than 1%1\%, which are contributed mainly by the variances while their biases are negligible. However, when the sample size is relatively small, e.g., n=104n=10^{4}, the bias of the LSA estimator is substantially lower than that of the GA estimator, which also explains the low coverage probabilities of the GA CIs in this case. This result suggests that the LSA estimator is preferable over the GA estimator, especially for small or medium sample sizes.

Table 2: Relative bias (%), standard deviation (%) and RRMSE (%), and coverage probabilities (%)
LSA GA
nn 10410^{4} 10510^{5} 10610^{6} 10710^{7} 10410^{4} 10510^{5} 10610^{6} 10710^{7}
Bias 0.07-0.07 0.01-0.01 0.00-0.00 0.000.00 10.64-10.64 0.48-0.48 0.03-0.03 0.00-0.00
Stdev 1.21 0.29 0.09 0.03 1.54 0.31 0.09 0.03
RRMSE 1.21 0.29 0.09 0.03 10.75 0.57 0.10 0.03
Cover. prob. 93.1 95.7 95.3 95.8 0 60.6 93.0 95.6

In the comparison of CIs, we observed that the widths of the GA and the LSA CIs are quite close as they have similar variances. Hence, we only report the result of the LSA CIs when comparing their widths to the method of Lesnevski et al. (2007). The comparison results are summarized in Figure 4, w.r.t. varying settings of the fixed width by Lesnevski et al. (2007). From the figure, it can be seen that the LSA CIs achieve the nominal coverage probability (95%95\%), while the CIs constructed by Lesnevski et al. (2007) tend to be conservative, yielding coverage probabilities close to 100%100\%. The ratio between the width of the CI of Lesnevski et al. (2007) and that of the LSA CIs is also reported. It can be observed that the widths of the CIs of Lesnevski et al. (2007) are more than 2 times larger than LSA CIs, especially when the fixed width is set to be large. The ratio can be as large as 4 times when the fixed width is large, implying that the LSA method produces much narrower CIs and thus has better quality.

Refer to caption
Figure 4: (Color online) Comparison of CIs

5.4 Hypothesis Testing in Clinical Trials

In a clinical trial, suppose that one is interested in comparing K1K-1 treatments with unknown mean effects μ1,,μK1\mu_{1},\ldots,\mu_{K-1} to a control treatment with known mean effect μ0\mu_{0}. To examine the performances of different hypothesis tests, we consider respectively the GA-based and LSA-based testing statistics in (11) and (14), based on the proposed single hypothesis test in (6). We further consider a benchmarking test with multiple hypotheses in the form of (5). The benchmarking test uses a fixed randomization design that randomly assigns a patient to the treatments with equal probabilities, referred to as FR test.

Our numerical setting is the same as Mozgunov and Jaki (2020), where the outcomes of treatments for patients follow Bernoulli distributions with success probabilities (μ0,μ1,,μK1)(\mu_{0},\mu_{1},\ldots,\mu_{K-1}), K=4K=4, n=423n=423, and overall type I error equal to 5%5\%. In addition to determining whether some treatments outperform the control, another important goal in the clinical trial is to treat the patients as effectively as possible during the trial (Villar et al. 2015). To quantify to what extent this goal is achieved, two measures are commonly used, including the expected proportion of patients assigned to the best treatment (POB) and the expected number of patient success (ENS); see, e.g., Mozgunov and Jaki (2020).

To measure the performance of each test, we estimate its POB and ENS. Furthermore, for a case with true means (μ0,μ1,μ2,μ3)=(0.31,0.27,0.28,0.29)(\mu_{0},\mu_{1},\mu_{2},\mu_{3})=(0.31,0.27,0.28,0.29) where H0H_{0} is true, we estimate its overall type I error, while we estimate its power for another case with true means (μ0,μ1,μ2,μ3)=(0.3,0.3,0.3,0.5)(\mu_{0},\mu_{1},\mu_{2},\mu_{3})=(0.3,0.3,0.3,0.5) where H0H_{0} is false. Comparison results are summarized in Table 3, based on 10410^{4} independent replications.

It can be seen from Table 3 that the GA-based and LSA-based tests have an advantage over the FR test in that they consistently assign more patients to the best treatment and have higher ENS, and are thus preferable when taking into account the well-being of the patients. From a statistical perspective, when H0H_{0} is true (i.e., Case 1), the estimated type I error of the LSA-based test is close to the nominal one (5%5\%), while the rest tend to be conservative. Conservativeness of the GA-based test comes from its negative bias that is not negligible with a sample size of 423423, and that of the FR test is due to drawbacks of Bonferroni correction. In Case 2 when H0H_{0} is false, both the GA-based and LSA-based tests have much higher estimated power compared to the FR test. In summary, the LSA-based test outperforms the rest of the tests significantly in all the considered dimensions including POB, ENS, type I error, and statistical power.

Table 3: Comparison of different hypothesis tests.
Case 1: (μ0,μ1,μ2,μ3)=(0.31,0.27,0.28,0.29)(\mu_{0},\mu_{1},\mu_{2},\mu_{3})=(0.31,0.27,0.28,0.29) Case 2: (μ0,μ1,μ2,μ3)=(0.3,0.3,0.3,0.5)(\mu_{0},\mu_{1},\mu_{2},\mu_{3})=(0.3,0.3,0.3,0.5)
Test Est. Type I POB (s.e.) ENS (s.e.) Est. power POB (s.e.) ENS (s.e.)
FR 0.014 0.25(0.0001) 121.6(0.03) 0.814 0.25(0.0001) 148.1(0.03)
GA 0.011 0.34(0.0006) 123.1(0.03) 0.966 0.78(0.0006) 192.7(0.02)
LSA 0.044 0.34(0.0006) 123.1(0.03) 0.970 0.78(0.0006) 192.7(0.02)

5.5 Robust Analysis of Stochastic Simulation: A Call Center Case

An application of the proposed maximum mean estimator is on robust analysis of stochastic simulation. Robust analysis aims to quantify the impact of input uncertainty (i.e., misspecification of the input models) inherent in stochastic simulation. For this purpose, Ghosh and Lam (2019) and Blanchet and Murthy (2019) considered worst-case performance bounds, i.e., maximum (and/or minimum) of the expected simulation output taken over a set of plausible input models. While these authors consider a collection of plausible input models in the space of continuous distributions by imposing constraints on their moments, distances from a baseline model or some other summary statistics, we consider in this paper a discrete setting where a finite number of plausible input models are specified by fitting to input data.

Specifically, the expected simulation outputs corresponding to a set of KK plausible input models are denoted by μ1,,μK\mu_{1},\ldots,\mu_{K}, respectively. From each input model, a simulation output with noise can be generated. We are interested in estimating the following worst-case bounds,

μ=mink=1,,Kμk,andμ=maxk=1,,Kμk,\mu^{{\dagger}}=\min_{k=1,\ldots,K}\mu_{k},\quad\text{and}\quad\mu^{*}=\max_{k=1,\ldots,K}\mu_{k},

which help to quantify the input uncertainty of the plausible input models.

The maximum mean estimators studied in this paper can be applied to estimate μ\mu^{*}, as well as μ\mu^{{\dagger}} by taking a negative sign on the simulation outputs. To illustrate how it works, we consider a case of a call center from an anonymous bank with real data available over a period of 12 months from January 1999 till December 1999 (Mandelbaum 2014). Workflows of the call center are described as follows. Customer phone calls arrive and ask for a banking service on their own. The call center consists of both automated service (computer-generated voice information, VRU) and agent service (AS) provided by eight agents. After the VRU service, customers either to a queue before AS, or directly to receive AS, or leave the system directly (see Figure 5). While privoviding VRU service seven days a week, 24 hours a day, the call center is staffed during 7:00 and 0:00 during weekdays. The information of each phone is stockpile, and each call has 17 fields in the data, e.g., Call ID (5 digits), VRU entry (6 digits), VRU exit (6 digits), Queue start (6 digits), Queue exit (6 digits), Outcome (AS or Hang), AS start (6 digits), AS exit (6 digits), etc.

Refer to caption
Figure 5: Logic flow of the call center of an anonymous bank

We build up a simulation model with the logic flow described in Figure 5, to study customers’ expected waiting time during the interval 10:00-12:00 on a weekday. Key inputs to the simulation include interarrival (IA) distribution, and service time distributions for both VRU and AS services. During input modeling, we consider a base dataset and an extended dataset, including input data during the specified interval on January 1, 1999 (with 172 phone calls), and the combined data on both January 1 and 8 (with 311 phone calls in total), respectively. Comparison of simulation performances based on these two datasets helps to understand the impact of input data size on the magnitude of input uncertainty. Arena’s Input Analyzer tool is used to fit the input distributions, and Table 4 presents three candidate distributions for each of IA and AS times, all with pp-value larger than 0.01, while empirical distribution of VRU service time is used due to unsatisfactory pp-values for parametric distributions given by Inout Analyzer. The abandon rates for the two scenarios are 7.1% and 5.5%, respectively, according to the proportion of calls that hang out after VRU service.

Table 4: Fitted distributions with pp-value larger than 0.01 for IA and AS times.
Dataset IA time AS time
Base dataset -0.001 + ERLA(41.9, 1) -0.001 + ERLA(225, 1)
-0.001 + EXPO(41.9) -0.001 + EXPO(225)
-0.001 + WEIB(38.4, 0.82) -0.001 + WEIB(225, 0.992)
Extended dataset -0.001 + ERLA(41.8, 1) -0.001 + ERLA(222, 1)
-0.001 + EXPO(41.8) -0.001 + EXPO(222)
-0.001 + WEIB(40.9, 0.951) -0.001 + WEIB(222, 0.996)

For each dataset, we specify the plausible input models as all possible combinations of the three candidate distributions for IA and AS times, while empirical distribution is employed for the VRU service time. In total there are K=9K=9 (=3×3=3\times 3) plausible models for each dataset, and the proposed GA and LSA estimators are applied to estimate μ\mu^{{\dagger}} and μ\mu^{*}, the worst-case bounds for expected waiting time. Using real data, we also estimate the true value of the expected waiting time.

Table 5: Comparison of 95% CIs of minimum mean μ\mu^{{\dagger}}, maximum mean μ\mu^{*}, the plausible performance range [μ^,μ^][\widehat{\mu}^{\dagger},\widehat{\mu}^{*}], and the gap between μ\mu^{*} and μ\mu^{{\dagger}} for the base and extended datasets.
LSA GA True value1
Dataset nn 10410^{4} 10510^{5} 10410^{4} 10510^{5}
Base dataset CI of μ\mu^{\dagger} [9.99, 10.87] [10.04, 10.24] [10.40, 10.94] [10.20, 10.37] 10.13
CI of μ\mu^{*} [13.70, 14.90] [14.71, 14.97] [13.89, 14.60] [14.64, 14.87] 14.81
[μ^,μ^][\widehat{\mu}^{\dagger},\widehat{\mu}^{*}] [10.43, 14.30] [10.14, 14.84] [10.67, 14.25] [10.29, 14.75] 11.98
μ^μ^\widehat{\mu}^{*}-\widehat{\mu}^{\dagger} 3.87 4.70 3.58 4.46
Extended dataset CI of μ\mu^{\dagger} [8.78, 9.13] [8.76, 8.91] [9.29, 9.78] [8.76, 8.91] 8.84
CI of μ\mu^{*} [11.69, 12.51] [11.98, 12.21] [11.28, 11.86] [11.86, 12.06] 12.10
[μ^,μ^][\widehat{\mu}^{\dagger},\widehat{\mu}^{*}] [9.05, 12.10] [8.83, 12.10] [9.53, 11.57] [8.84, 11.96] 12.05
μ^μ^\widehat{\mu}^{*}-\widehat{\mu}^{\dagger} 3.05 3.27 2.04 3.12
  • 1

    In the last column, the first two numbers for each dataset are the (approximately) true values of μ\mu^{\dagger} and μ\mu^{*} estimated based a large (10510^{5}) sample size for each plausible input model, and the bold numbers represent the average waiting times obtained by taking averages from the real data.

Table 5 reports the constructed 90% CIs for μ\mu^{{\dagger}} and μ\mu^{*}, the plausible performance range [μ^,μ^][\widehat{\mu}^{{\dagger}},\widehat{\mu}^{*}], as well as the width μ^μ^\widehat{\mu}^{*}-\widehat{\mu}^{{\dagger}} obtained by the GA and LSA estimators. Based on the table, we highlight our observations as follows.

  • Both the GA and the LSA estimators produce accurate estimates of μ\mu^{{\dagger}} and μ\mu^{*} when sample size nn is sufficiently large, exemplified by CIs with narrow half-widths when n=105n=10^{5}. However, the GA estimator may have larger bias when sample size is not sufficiently large. For instance, when n=104n=10^{4}, the GA CIs of μ\mu^{{\dagger}} and μ\mu^{*} lie above and below their true values (e.g. 10.13 and 14.81 for the base dataset) respectively, suggesting positive bias and negative bias for μ\mu^{{\dagger}} and μ\mu^{*}, respectively. By contrast, the LSA CIs have better performance as they cover the true values for both the base and extended datasets.

  • The estimated plausible ranges of [μ,μ][\mu^{{\dagger}},\mu^{*}] resulting from the LSA estimator cover the true waiting times for both datasets, while those from the GA estimator cover true waiting times for n=105n=10^{5} but not for n=104n=10^{4} due to its larger bias. Compared to the GA estimator, the LSA estimator produces more accurate point estimates of both μ\mu^{{\dagger}} and μ\mu^{*} (and thus the gaps of the worst-case bounds μμ\mu^{*}-\mu^{{\dagger}}) for all sample sizes and both datasets. Comparing the gaps of the worst-case bounds for these two datasets (with 172 and 311 pone calls respectively), it can be seen that increasing the size of the input dataset helps reducing significantly this gap and thus the input uncertainty, e.g., the gap is reduced from 4.70 to 3.27 using the LSA estimator with n=105n=10^{5}.

6 Conclusions

We have studied two estimators, namely, the GA and LSA estimators, for estimating the maximum mean based on the UCB sampling policy. We have established strong consistency, asymptotic MSEs, and CLTs for both estimators, and constructed asymptotically valid CIs. Furthermore, a single hypothesis test has been proposed to solve a multiple comparison problem with application to clinical trials, and shown to significantly outperform an existing benchmarking test with multiple hypotheses, as the former exhibits lower type I error and higher statistical power.

The LSA estimator is preferable over the GA estimator, mainly due to the fact that its bias decays at a rate much faster than that of the GA estimator. The superiority of the LSA estimator over the GA estimator and several competing estimators has been demonstrated through numerical experiments.

Acknowledgements

This research is partially supported by the National Natural Science Foundation of China and Research Grants Council of Hong Kong under Joint Research Scheme project N_CityU 105/21 and GRF 11508217 and 11508620, and InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.

Appendix A A Appendix

A.1 Lemmas

Lemma A1.

(Hoeffding bound, Proposition 2.5 in Wainwright, 2019) Suppose that random variables ξi\xi_{i}, i=1,,ni=1,...,n, are independent, and ξi\xi_{i} has mean μi\mu_{i} and sub-Gaussian parameter γi\gamma_{i}. Then for all t0t\geq 0, we have

(i=1n(ξiμi)t)exp{t22i=1nγi2}.\displaystyle{\mathbb{P}}\left(\sum_{i=1}^{n}(\xi_{i}-\mu_{i})\geq t\right)\leq\exp\left\{-\frac{t^{2}}{2\sum_{i=1}^{n}\gamma_{i}^{2}}\right\}.
Lemma A2.

Suppose that XiX_{i}’s are i.i.d. mean-zero random variables, and NN is a stopping time taking positive integer values with NnN\leq n w.p.1. Let St=i=1tXiS_{t}=\sum_{i=1}^{t}X_{i}. For any positive integer p2p\geq 2, if 𝔼|X1|p<{\mathbb{E}}|X_{1}|^{p}<\infty, then

𝔼|SNp|=O(np/2).{\mathbb{E}}\left|S_{N}^{p}\right|=O(n^{p/2}).
Proof of Lemma A2.

First note that it suffices to show that |𝔼(SNp)|=O(np/2)\left|{\mathbb{E}}(S_{N}^{p})\right|=O(n^{p/2}) for any p2p\geq 2, because these two are equivalent when pp is an even number, and if pp is an odd number, 𝔼|SNp|[𝔼(SNp+1)]pp+1=O(np+12pp+1)=O(np/2){\mathbb{E}}\left|S_{N}^{p}\right|\leq\left[{\mathbb{E}}\left(S_{N}^{p+1}\right)\right]^{p\over p+1}=O(n^{{p+1\over 2}\cdot{p\over p+1}})=O(n^{p/2}), following Hölder’s inequality.

For notational ease, denote Cqk=(qk)C_{q}^{k}=\binom{q}{k} and κq=𝔼[X1q]\kappa_{q}={\mathbb{E}}[X_{1}^{q}]. Note that κ1=0\kappa_{1}=0. Further let t{\cal F}_{t} denote the filtration up to tt.

We first note the following fact. With cc being a constant, ss a nonnegative integer, and VtV_{t} a martingale, if a process {Mt,t1}\{M_{t},t\geq 1\} satisfies

𝔼[Mt+1|t]=Mt+ctsVt,{\mathbb{E}}\left[M_{t+1}|{\cal F}_{t}\right]=M_{t}+ct^{s}V_{t},

then it can be seen that

WtMt+𝒂𝐭s+1VtW_{t}\triangleq M_{t}+\bm{a}^{\top}{\bf t}^{s+1}V_{t}

is a martingale, by verifying the property that 𝔼[Wt+1|t]=Wt{\mathbb{E}}[W_{t+1}|{\cal F}_{t}]=W_{t}, where the shorthand notation 𝐭s+1(t,,ts+1){\bf t}^{s+1}\triangleq(t,\ldots,t^{s+1})^{\top}, and the constant vector 𝒂=(a1,,as+1)\bm{a}=(a_{1},\ldots,a_{s+1})^{\top} satisfies

as+1=c/Cs+11,andasi=k=2i+2asi1+kCsi1+kk/Csi1,i=0,,s1.a_{s+1}=-c/C_{s+1}^{1},\quad\mbox{and}\quad a_{s-i}=-\sum_{k=2}^{i+2}a_{s-i-1+k}C_{s-i-1+k}^{k}/C_{s-i}^{1},\quad i=0,\ldots,s-1. (16)

Define Vt,01V_{t,0}\equiv 1 and Vt,1=StV_{t,1}=S_{t}. For any nonnegative integer q0q\geq 0, if 𝔼[St+1q|t]{\mathbb{E}}\left[S_{t+1}^{q}|{\cal F}_{t}\right] can be written in the form of

𝔼[St+1q|t]=Stq+j𝒮qc(j,q)ts(j,q)Vt,r(j,q),\displaystyle{\mathbb{E}}\left[S_{t+1}^{q}|{\cal F}_{t}\right]=S_{t}^{q}+\sum_{j\in{\cal S}_{q}}c(j,q)t^{s(j,q)}V_{t,r(j,q)}, (17)

where c(j,q)c(j,q)’s are constants, s(j,q)s(j,q) and r(j,q)r(j,q) are nonnegative integers depending on (j,q)(j,q), and 𝒮p{\cal S}_{p} is an index set with a finite number of elements, then, we define

Vt,q=Stq+j𝒮q𝒂(j,q)𝐭s(j,q)+1Vt,r(j,q),V_{t,q}=S_{t}^{q}+\sum_{j\in{\cal S}_{q}}\bm{a}(j,q)^{\top}{\bf t}^{s(j,q)+1}V_{t,r(j,q)},

where 𝒂(j,q)\bm{a}(j,q) is a contant vector obtained via (16) by setting s=s(j,q)s=s(j,q) and c=c(j,q)c=c(j,q).

Then, using the aforementioned fact, the process {Vt,q,t1}\{V_{t,q},t\geq 1\} is a martingale, and therefore,

Stq=Vt,qj𝒮q𝒂(j,p)𝐭s(j,q)+1Vt,r(j,q)S_{t}^{q}=V_{t,q}-\sum_{j\in{\cal S}_{q}}\bm{a}(j,p)^{\top}{\bf t}^{s(j,q)+1}V_{t,r(j,q)} (18)

is a linear combination of martingales.

In the rest of this proof, we use an induction argument. Suppose that for any integer qq satisfying 2qp2\leq q\leq p, (a) and (b) hold, where

  1. a)

    𝔼[St+1q|t]{\mathbb{E}}\left[S_{t+1}^{q}|{\cal F}_{t}\right] is of the form in (17), satisfying V0,r(j,q)=0V_{0,r(j,q)}=0 when r(j,q)1r(j,q)\geq 1, and r(j,q)q2r(j,q)\leq q-2 for any j𝒮qj\in{\cal S}_{q}; and

  2. b)

    𝔼|Ns(j,q)+1VN,r(j,q)|=O(nq/2){\mathbb{E}}\left|N^{s(j,q)+1}V_{N,r(j,q)}\right|=O(n^{q/2}) for any s(j,q)s(j,q) and r(j,q)r(j,q).,

we want to show that both (a) and (b) hold for q=p+1q=p+1.

It should be pointed out that combining (18) and (b) yields |𝔼(SNq)|=O(nq/2)\left|{\mathbb{E}}(S_{N}^{q})\right|=O(n^{q/2}), because 𝔼[VN,q]=V0,q=0{\mathbb{E}}[V_{N,q}]=V_{0,q}=0 due to the martingale stopping theorem, and

𝔼(SNq)\displaystyle{\mathbb{E}}(S_{N}^{q}) =\displaystyle= 𝔼[VN,q]j𝒮q𝒂(j,p)𝔼[𝐍s(j,q)+1VN,r(j,q)]=O(nq/2).\displaystyle{\mathbb{E}}[V_{N,q}]-\sum_{j\in{\cal S}_{q}}\bm{a}(j,p)^{\top}{\mathbb{E}}\left[{\bf N}^{s(j,q)+1}V_{N,r(j,q)}\right]=O(n^{q/2}).

Moreover, when q=2q=2, it can be easily checked that 𝔼[St+12|t]=St2+κ2Vt,0{\mathbb{E}}[S_{t+1}^{2}|{\cal F}_{t}]=S_{t}^{2}+\kappa_{2}V_{t,0} and Vt,2=St2κ2tVt,0V_{t,2}=S_{t}^{2}-\kappa_{2}tV_{t,0}, and thus both (a) and (b) are satisfied.

Therefore, it suffices to complete the induction argument, i.e., (a) and (b) must hold for q=p+1q=p+1 if they hold for any 2qp2\leq q\leq p. To see this, note that

𝔼[St+1p+1|t]=𝔼[(St+Xt+1)p+1|t]=Stp+1+j=2p+1Cp+1jκjStp+1j.\displaystyle{\mathbb{E}}\left[S_{t+1}^{p+1}|{\cal F}_{t}\right]={\mathbb{E}}\left[\left(S_{t}+X_{t+1}\right)^{p+1}|{\cal F}_{t}\right]=S_{t}^{p+1}+\sum_{j=2}^{p+1}C_{p+1}^{j}\kappa_{j}S_{t}^{p+1-j}. (19)

For any j=2,,p1j=2,\ldots,p-1, combining (a) and (18) suggests that Stp+1jS_{t}^{p+1-j} is a linear combination of Vt,p+1jV_{t,p+1-j} and Vt,r(j,p+1j)V_{t,r(j,p+1-j)}’s. Moreover, when j=pj=p and p+1p+1, Stp+1jS_{t}^{p+1-j} equals Vt,1V_{t,1} and Vt,0V_{t,0}, respectively. Therefore, by (19), assembling terms and rearranging the index set 𝒮p+1{\cal S}_{p+1} lead to

𝔼[St+1p+1|t]=Stp+1+j𝒮p+1c(j,p+1)ts(j,p+1)Vt,r(j,p+1)\displaystyle{\mathbb{E}}\left[S_{t+1}^{p+1}|{\cal F}_{t}\right]=S_{t}^{p+1}+\sum_{j\in{\cal S}_{p+1}}c(j,p+1)t^{s(j,p+1)}V_{t,r(j,p+1)}

for some c(j,p+1)c(j,p+1), s(j,p+1)s(j,p+1) and r(j,p+1)r(j,p+1), where it can be verified that V0,r(j,p+1)=0V_{0,r(j,p+1)}=0 when r(j,p+1)1r(j,p+1)\geq 1.

It can also be checked that

  1. (i)

    r(j,p+1)=r(j1,q)r(j,p+1)=r(j_{1},q) and s(j,p+1)s(j2,q)+1s(j,p+1)\leq s(j_{2},q)+1 for some qp2q\leq p-2 and j1,j2𝒮qj_{1},j_{2}\in{\cal S}_{q}; or,

  2. (ii)

    r(j,p+1)=qr(j,p+1)=q and s(j,p+1)=0s(j,p+1)=0 for some qp1q\leq p-1.

Because r(j1,q)q2r(j_{1},q)\leq q-2, it is clear that r(j,p+1)p1r(j,p+1)\leq p-1. Hence, (a) holds for q=p+1q=p+1.

Furthermore, define

Vt,p+1=Stp+1+j𝒮p+1𝒂(j,p+1)𝐭s(j,p+1)+1Vt,r(j,p+1).V_{t,p+1}=S_{t}^{p+1}+\sum_{j\in{\cal S}_{p+1}}\bm{a}(j,p+1)^{\top}{\bf t}^{s(j,p+1)+1}V_{t,r(j,p+1)}.

Then, {Vt,p+1,t1}\{V_{t,p+1},t\geq 1\} is a martingale. Using the martingale stopping theorem, we have 𝔼[VN,p+1]=V0,p+1=0{\mathbb{E}}\left[V_{N,p+1}\right]=V_{0,p+1}=0, i.e.,

𝔼[SNp+1]=j𝒮p+1𝔼[𝒂(j,p+1)𝐍s(j,p+1)+1VN,r(j,p+1)],{\mathbb{E}}\left[S_{N}^{p+1}\right]=-\sum_{j\in{\cal S}_{p+1}}{\mathbb{E}}\left[\bm{a}(j,p+1)^{\top}{\bf N}^{s(j,p+1)+1}V_{N,r(j,p+1)}\right], (20)

where 𝐍s(j,p+1)+1=(N,,Ns(j,p+1)+1){\bf N}^{s(j,p+1)+1}=(N,\ldots,N^{s(j,p+1)+1})^{\top}.

When (i) is true,

𝔼|Ns(j,p+1)+1VN,r(j,p+1)|𝔼|Ns(j2,q)+1VN,r(j1,q)|n𝔼|Ns(j2,q)VN,r(j1,q)|=O(nnq2)=O(np2),\displaystyle{\mathbb{E}}\left|N^{s(j,p+1)+1}V_{N,r(j,p+1)}\right|\leq{\mathbb{E}}\left|N^{s(j_{2},q)+1}V_{N,r(j_{1},q)}\right|\leq n{\mathbb{E}}\left|N^{s(j_{2},q)}V_{N,r(j_{1},q)}\right|=O(n\cdot n^{q\over 2})=O(n^{p\over 2}),

where the last equality follows from the fact that qp2q\leq p-2.

When (ii) is true, for some qp1q\leq p-1,

𝔼|Ns(j,p+1)+1VN,r(j,p+1)|=𝔼|NVN,q|.\displaystyle{\mathbb{E}}\left|N^{s(j,p+1)+1}V_{N,r(j,p+1)}\right|={\mathbb{E}}\left|NV_{N,q}\right|.

Note that |VN,q||SNq|+j𝒮q|𝒂(j,q)𝐍s(j,q)+1VN,r(j,q)|\left|V_{N,q}\right|\leq|S_{N}^{q}|+\sum_{j\in{\cal S}_{q}}\left|\bm{a}(j,q)^{\top}{\bf N}^{s(j,q)+1}V_{N,r(j,q)}\right|. If qq is an even number, then

𝔼|VN,q|𝔼(SNq)+j𝒮q𝔼|𝒂(j,q)𝐍s(j,q)+1VN,r(j,q)|=O(nq2)=O(np12),{\mathbb{E}}\left|V_{N,q}\right|\leq{\mathbb{E}}(S_{N}^{q})+\sum_{j\in{\cal S}_{q}}{\mathbb{E}}\left|\bm{a}(j,q)^{\top}{\bf N}^{s(j,q)+1}V_{N,r(j,q)}\right|=O(n^{q\over 2})=O(n^{p-1\over 2}),

where the first equality follows from (18) and (b). Otherwise, qq is an odd number, and by Cauchy-Schwartz inequality,

𝔼|VN,q|\displaystyle{\mathbb{E}}\left|V_{N,q}\right| \displaystyle\leq 𝔼|SNq|+j𝒮q𝔼|𝒂(j,q)𝐍s(j,q)+1VN,r(j,q)|\displaystyle{\mathbb{E}}\left|S_{N}^{q}\right|+\sum_{j\in{\cal S}_{q}}{\mathbb{E}}\left|\bm{a}(j,q)^{\top}{\bf N}^{s(j,q)+1}V_{N,r(j,q)}\right|
\displaystyle\leq [𝔼(SNq+1)]qq+1+j𝒮q𝔼|𝒂(j,q)𝐍s(j,q)+1VN,r(j,q)|\displaystyle\left[{\mathbb{E}}\left(S_{N}^{q+1}\right)\right]^{q\over q+1}+\sum_{j\in{\cal S}_{q}}{\mathbb{E}}\left|\bm{a}(j,q)^{\top}{\bf N}^{s(j,q)+1}V_{N,r(j,q)}\right|
\displaystyle\leq O(nq+12qq+1)+O(nq2)=O(np12).\displaystyle O(n^{{q+1\over 2}\cdot{q\over q+1}})+O(n^{q\over 2})=O(n^{p-1\over 2}).

Therefore no matter qq is even or odd, 𝔼|Ns(j,p+1)+1VN,r(j,p+1)|=𝔼|NVN,q|n𝔼|VN,q|=O(np+12){\mathbb{E}}\left|N^{s(j,p+1)+1}V_{N,r(j,p+1)}\right|={\mathbb{E}}\left|NV_{N,q}\right|\leq n{\mathbb{E}}\left|V_{N,q}\right|=O(n^{p+1\over 2}). In other words, (b) holds for q=p+1q=p+1, which completes the induction. ∎

A.2 Proof of Proposition 1

To prove the proposition, we introduce a lemma.

Lemma A3.

Given a real λ(0,1)\lambda\notin(0,1), for any positive integer nn,

j=1njλ{=nifλ=0;=n(n+1)2ifλ=1;<log(n+12)+log2ifλ=1;<1λ+1[(n+12)λ+112λ+1]ifλ1.\displaystyle\sum_{j=1}^{n}j^{\lambda}\begin{cases}\displaystyle=n&if\ \lambda=0;\\ \displaystyle=\frac{n(n+1)}{2}&if\ \lambda=1;\\ \displaystyle<\log\left(n+\frac{1}{2}\right)+\log 2&if\ \lambda=-1;\\ \displaystyle<\dfrac{1}{\lambda+1}\left[\left(n+\frac{1}{2}\right)^{\lambda+1}-\frac{1}{2^{\lambda+1}}\right]&if\ \lambda\neq-1.\end{cases}
Proof of Lemma A3.

The conclusion of the lemma for λ=0\lambda=0 or 1 is trivial, and we thus mainly examine the case when λ[0,1]\lambda\notin[0,1]. Because xλx^{\lambda} (x>0x>0) is convex for λ[0,1]\lambda\notin[0,1], by Jensen’s Inequality,

n12n+12xλdx>(n12n+12xdx)λ=nλ,ifλ[0,1].\displaystyle\int_{n-\frac{1}{2}}^{n+\frac{1}{2}}x^{\lambda}{\rm d}x>\left(\int_{n-\frac{1}{2}}^{n+\frac{1}{2}}x{\rm d}x\right)^{\lambda}=n^{\lambda},\ {\rm if}\ \lambda\notin[0,1].

It then follows that

(n+12)λ+1(n12)λ+1=(λ+1)n12n+12xλdx{>(λ+1)nλif1<λ<0,orλ>1,<(λ+1)nλ,ifλ<1,\displaystyle\left(n+\frac{1}{2}\right)^{\lambda+1}-\left(n-\frac{1}{2}\right)^{\lambda+1}=(\lambda+1)\int_{n-\frac{1}{2}}^{n+\frac{1}{2}}x^{\lambda}{\rm d}x\begin{cases}\displaystyle>(\lambda+1)n^{\lambda}&{\rm if}\ -1<\lambda<0,\ \ {\rm or}\ \ \lambda>1,\\ \displaystyle<(\lambda+1)n^{\lambda},&{\rm if}\ \lambda<-1,\end{cases}

and

log(n+12)log(n12)=n12n+12x1dx>n1,ifλ=1.\displaystyle\log\left(n+\frac{1}{2}\right)-\log\left(n-\frac{1}{2}\right)=\int_{n-\frac{1}{2}}^{n+\frac{1}{2}}x^{-1}{\rm d}x>n^{-1},\ {\rm if}\ \lambda=-1.

Therefore, for λ[0,1]\lambda\notin[0,1] and λ1\lambda\neq-1,

j=1njλ<1λ+1j=1n[(j+12)λ+1(j12)λ+1]=1λ+1[(n+12)λ+1(12)λ+1],\displaystyle\sum_{j=1}^{n}j^{\lambda}<\dfrac{1}{\lambda+1}\sum_{j=1}^{n}\left[\left(j+\frac{1}{2}\right)^{\lambda+1}-\left(j-\frac{1}{2}\right)^{\lambda+1}\right]=\dfrac{1}{\lambda+1}\left[\left(n+\frac{1}{2}\right)^{\lambda+1}-\left(\frac{1}{2}\right)^{\lambda+1}\right],

and for λ=1\lambda=-1,

j=1njλ<j=1n[log(j+12)log(j12)]=log(n+12)log(12).\displaystyle\sum_{j=1}^{n}j^{\lambda}<\sum_{j=1}^{n}\left[\log\left(j+\frac{1}{2}\right)-\log\left(j-\frac{1}{2}\right)\right]=\log\left(n+\frac{1}{2}\right)-\log\left(\frac{1}{2}\right).

Proof of Proposition 1. We first the result with p=1p=1 and then p2p\geq 2.

(i) When p=1p=1, we want to prove that for kkk\neq k^{*}

𝔼Tk(n)8νnΔk2+5\displaystyle{\mathbb{E}}T_{k}(n)\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+5 (21)

Let

j=max{j:Tk(j1)8νnΔk2}.\displaystyle j^{*}=\max\left\{j:T_{k}(j-1)\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}\right\}.

If j=nj^{*}=n, then (21) is true. Therefore, we just have to consider j<nj^{*}<n. Note that for j>jj>j^{*}, we have

Tk(j1)>8νnΔk2.\displaystyle T_{k}(j-1)>\frac{8\nu_{n}}{\Delta_{k}^{2}}. (22)

For j>jj>j^{*} and Ij=kI_{j}=k, we claim that at least one of the two following inequalities holds:

X¯k[Tk(j1)]+cj,Tk(j1)\displaystyle\bar{X}_{k^{*}}[T_{k^{*}}(j-1)]+c_{j,T_{k^{*}}(j-1)} \displaystyle\leq μk,\displaystyle\mu_{k^{*}}, (23)
X¯k[Tk(j1)]cj,Tk(j1)\displaystyle\bar{X}_{k}[T_{k}(j-1)]-c_{j,T_{k}(j-1)} >\displaystyle> μk,\displaystyle\mu_{k}, (24)

where cj,s=2νj/sc_{j,s}=\sqrt{2\nu_{j}/s}. In fact, by contradiction, if both the inequalities (23) and (24) are false when j>jj>j^{*} and Ij=kI_{j}=k, we have

X¯k[Tk(j1)]+cj,Tk(j1)>\displaystyle\bar{X}_{k^{*}}[T_{k^{*}}(j-1)]+c_{j,T_{k^{*}}(j-1)}> μk=μk+Δk>μk+22νnTk(j1)\displaystyle\mu_{k^{*}}=\mu_{k}+\Delta_{k}>\mu_{k}+2\sqrt{\frac{2\nu_{n}}{T_{k}(j-1)}}
\displaystyle\geq μk+22νjTk(j1)X¯k[Tk(j1)]+cj,Tk(j1),\displaystyle\mu_{k}+2\sqrt{\frac{2\nu_{j}}{T_{k}(j-1)}}\geq\bar{X}_{k}[T_{k}(j-1)]+c_{j,T_{k}(j-1)},

where the second inequality is due to (22), which contradicts with Ij=kI_{j}=k.

In the following, we prove (21). Note that

Tk(n)=Tk(j)+j=j+1n𝟙{Ij=k}=Tk(j)+j=j+1n𝟙{Ij=k,(23) or (24) true}\displaystyle T_{k}(n)=T_{k}(j^{*})+\sum_{j=j^{*}+1}^{n}\mathbbm{1}\{I_{j}=k\}=T_{k}(j^{*})+\sum_{j=j^{*}+1}^{n}\mathbbm{1}\{I_{j}=k,\mbox{(\ref{eqn:Tk11}) or (\ref{eqn:Tk12}) true}\}
Tk(j1)+1+j=j+1n𝟙{(23) or (24) true}8νnΔk2+1+j=j+1n𝟙{(23) or (24) true}.\displaystyle\leq T_{k}(j^{*}-1)+1+\sum_{j=j^{*}+1}^{n}\mathbbm{1}\{\mbox{(\ref{eqn:Tk11}) or (\ref{eqn:Tk12}) true}\}\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sum_{j=j^{*}+1}^{n}\mathbbm{1}\{\mbox{(\ref{eqn:Tk11}) or (\ref{eqn:Tk12}) true}\}. (25)

Taking expectation, it follows that

𝔼Tk(n)\displaystyle{\mathbb{E}}T_{k}(n)\leq 8νnΔk2+1+j=j+1n((23) or (24) true)\displaystyle\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sum_{j=j^{*}+1}^{n}{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk11}) or (\ref{eqn:Tk12}) true}\right)
\displaystyle\leq 8νnΔk2+1+j=j+1n[((23) true)+((24) true)].\displaystyle\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sum_{j=j^{*}+1}^{n}\left[{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk11}) true}\right)+{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk12}) true}\right)\right].

Furthermore, by Lemma A1,

((23) true)=\displaystyle{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk11}) true}\right)= (X¯k[Tk(j1)]+cj,Tk(j1)μk)(mins=1,,j1X¯k[s]+cj,sμk)\displaystyle{\mathbb{P}}\left(\bar{X}_{k^{*}}[T_{k^{*}}(j-1)]+c_{j,T_{k^{*}}(j-1)}\leq\mu_{k^{*}}\right)\leq{\mathbb{P}}\left(\min_{s=1,...,j-1}\bar{X}_{k^{*}}[s]+c_{j,s}\leq\mu_{k^{*}}\right)
=\displaystyle= (s=1j1X¯k[s]+cj,sμk)s=1j1(X¯k[s]+cj,sμk)\displaystyle{\mathbb{P}}\left(\cup_{s=1}^{j-1}\bar{X}_{k^{*}}[s]+c_{j,s}\leq\mu_{k^{*}}\right)\leq\sum_{s=1}^{j-1}{\mathbb{P}}\left(\bar{X}_{k^{*}}[s]+c_{j,s}\leq\mu_{k^{*}}\right)
\displaystyle\leq s=1j1exp{(scj,s)22sγk2}=(j1)exp{νjγk2}<j3,\displaystyle\sum_{s=1}^{j-1}\exp\left\{-\frac{(sc_{j,s})^{2}}{2s\gamma_{k^{*}}^{2}}\right\}=(j-1)\exp\left\{-\frac{\nu_{j}}{\gamma_{k^{*}}^{2}}\right\}<j^{-3}, (26)

where the last inequality is due to the condition νn4γ¯2logn\nu_{n}\geq 4\bar{\gamma}^{2}\log n. Similarly, ((24) true){\mathbb{P}}\left(\mbox{(\ref{eqn:Tk12}) true}\right) has the same bound. Therefore,

𝔼Tk(n)8νnΔk2+1+2j=j+1nj38νnΔk2+1+4(n+12)28νnΔk2+5,\displaystyle{\mathbb{E}}T_{k}(n)\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+2\sum_{j=j^{*}+1}^{n}j^{-3}\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+4-\left(n+\frac{1}{2}\right)^{-2}\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+5,

where the second inequality is by Lamma A3.

(ii) Next, we prove the result with p2p\geq 2. Denote

Rj=𝟙{Ij=k},j=j+1,,n.\displaystyle R_{j}=\mathbbm{1}\left\{I_{j}=k\right\},j=j^{*}+1,...,n. (27)

Then, by (A.2), we have

Tk(n)8νnΔk2+1+j=j+1nRj.\displaystyle T_{k}(n)\leq\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sum_{j=j^{*}+1}^{n}R_{j}. (28)

For any integer p2p\geq 2, by Minkowski’s inequality, we have

𝔼Tk(n)p\displaystyle{\mathbb{E}}T_{k}(n)^{p} \displaystyle\leq 𝔼(8νnΔk2+1+j=j+1nRj)p{8νnΔk2+1+[𝔼(j=j+1nRj)p]1p}p.\displaystyle{\mathbb{E}}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sum_{j=j^{*}+1}^{n}R_{j}\right)^{p}\leq\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\left[{\mathbb{E}}\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{p}\right]^{\frac{1}{p}}\right\}^{p}. (29)

In what follows, we analyze 𝔼(j=j+1nRj)p{\mathbb{E}}\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{p}. Note that

(j=j+1nRj)p=j1,,jp{j+1,,n}Rj1Rjp\displaystyle\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{p}=\sum_{j_{1},...,j_{p}\in\{j^{*}+1,...,n\}}R_{j_{1}}\cdots R_{j_{p}} (30)
=\displaystyle= p!j1>>jpRj1Rjp+m=1p1i1++im=pi1,,im1(pi1,,im)j1>>jmRj1i1Rj2i2Rjmim\displaystyle p!\sum_{j_{1}>\cdots>j_{p}}R_{j_{1}}\cdots R_{j_{p}}+\sum_{m=1}^{p-1}\sum_{\begin{subarray}{l}i_{1}+\cdots+i_{m}=p\\ i_{1},...,i_{m}\geq 1\end{subarray}}\binom{p}{i_{1},...,i_{m}}\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}^{i_{1}}R_{j_{2}}^{i_{2}}\cdots R_{j_{m}}^{i_{m}}
=\displaystyle= p!j1>>jpRj1Rjp+m=1p1i1++im=pi1,,im1(pi1,,im)j1>>jmRj1Rj2Rjm\displaystyle p!\sum_{j_{1}>\cdots>j_{p}}R_{j_{1}}\cdots R_{j_{p}}+\sum_{m=1}^{p-1}\sum_{\begin{subarray}{l}i_{1}+\cdots+i_{m}=p\\ i_{1},...,i_{m}\geq 1\end{subarray}}\binom{p}{i_{1},...,i_{m}}\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}R_{j_{2}}\cdots R_{j_{m}}

where (pi1,,im)=p!/(i1!im!)\binom{p}{i_{1},...,i_{m}}=p!/(i_{1}!\cdots i_{m}!), and the last equality follows from the definition of RjR_{j} in (27).

Consider j1>>jmRj1Rjm\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}\cdots R_{j_{m}} when 2mp2\leq m\leq p. By Lemma A3, for a positive integer λ\lambda,

j=1njλ<1λ+1(n+1)λ+1.\displaystyle\sum_{j=1}^{n}j^{\lambda}<\frac{1}{\lambda+1}\left(n+1\right)^{\lambda+1}. (31)

Therefore, for 2mp2\leq m\leq p, we have

j1>>jmRj1Rjm=j1>>jm1Rj1Rjm1jm=j+1jm11Rjm\displaystyle\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}\cdots R_{j_{m}}=\sum_{j_{1}>\cdots>j_{m-1}}R_{j_{1}}\cdots R_{j_{m-1}}\sum_{j_{m}=j^{*}+1}^{j_{m-1}-1}R_{j_{m}}
\displaystyle\leq j1>>jm1Rj1Rjm1jm1=j1>>jm2Rj1Rjm2jm1=j+1jm21Rjm1jm1\displaystyle\sum_{j_{1}>\cdots>j_{m-1}}R_{j_{1}}\cdots R_{j_{m-1}}\cdot j_{m-1}=\sum_{j_{1}>\cdots>j_{m-2}}R_{j_{1}}\cdots R_{j_{m-2}}\sum_{j_{m-1}=j^{*}+1}^{j_{m-2}-1}R_{j_{m-1}}\cdot j_{m-1}
\displaystyle\leq j1>>jm2Rj1Rjm2jm1=j+1jm21jm112j1>>jm2Rj1Rjm2jm22,\displaystyle\sum_{j_{1}>\cdots>j_{m-2}}R_{j_{1}}\cdots R_{j_{m-2}}\sum_{j_{m-1}=j^{*}+1}^{j_{m-2}-1}j_{m-1}\leq\frac{1}{2}\sum_{j_{1}>\cdots>j_{m-2}}R_{j_{1}}\cdots R_{j_{m-2}}\cdot j_{m-2}^{2},

where the first inequality follows from the fact that |Rj|1|R_{j}|\leq 1, and the last inequality follows from (31) by letting λ=1\lambda=1 and n=jm21n=j_{m-2}-1.

Iteratively applying the above argument leads to

j1>>jmRj1Rjm1(m1)!j1=j+1nRj1j1m1.\displaystyle\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}\cdots R_{j_{m}}\leq\frac{1}{(m-1)!}\sum_{j_{1}=j^{*}+1}^{n}R_{j_{1}}\cdot j_{1}^{m-1}. (32)

By (A.2) and (A.2), we know that

𝔼Rj1((23) true)+((24) true)2j13.\displaystyle{\mathbb{E}}R_{j_{1}}\leq{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk11}) true}\right)+{\mathbb{P}}\left(\mbox{(\ref{eqn:Tk12}) true}\right)\leq 2j_{1}^{-3}.

Hence, combining with (32), we have, for 2mp2\leq m\leq p,

𝔼j1>>jmRj1Rjm1(m1)!j1=j+1nj1m1𝔼Rj12(m1)!j1=1nj1m4.\displaystyle{\mathbb{E}}\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}\cdots R_{j_{m}}\leq\frac{1}{(m-1)!}\sum_{j_{1}=j^{*}+1}^{n}j_{1}^{m-1}{\mathbb{E}}R_{j_{1}}\leq\frac{2}{(m-1)!}\sum_{j_{1}=1}^{n}j_{1}^{m-4}.

By Lemma A3, we have

𝔼j1>>jmRj1Rjm{2(21n+12)ifm=2;log(n+12)+log2ifm=3;2(m1)!1m3(n+12)m3ifm4.\displaystyle{\mathbb{E}}\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}\cdots R_{j_{m}}\leq\begin{cases}\displaystyle 2\left(2-\frac{1}{n+\frac{1}{2}}\right)&if\ m=2;\\ \displaystyle\log\left(n+\frac{1}{2}\right)+\log 2&if\ m=3;\\ \displaystyle\frac{2}{(m-1)!}\frac{1}{m-3}\left(n+\frac{1}{2}\right)^{m-3}&if\ m\geq 4.\\ \end{cases} (33)

By (30) and (33), it follows that for p4p\geq 4,

𝔼(j=j+1nRj)p\displaystyle{\mathbb{E}}\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{p} =p!𝔼j1>>jpRj1Rjp+m=1p1i1++im=pi1,,im1(pi1,,im)𝔼j1>>jmRj1Rj2Rjm\displaystyle=p!\cdot{\mathbb{E}}\sum_{j_{1}>\cdots>j_{p}}R_{j_{1}}\cdots R_{j_{p}}+\sum_{m=1}^{p-1}\sum_{\begin{subarray}{l}i_{1}+\cdots+i_{m}=p\\ i_{1},...,i_{m}\geq 1\end{subarray}}\binom{p}{i_{1},...,i_{m}}\cdot{\mathbb{E}}\sum_{j_{1}>\cdots>j_{m}}R_{j_{1}}R_{j_{2}}\cdots R_{j_{m}}
2pp3(n+1)p3+O((n+1)p4).\displaystyle\leq\frac{2p}{p-3}(n+1)^{p-3}+O\left((n+1)^{p-4}\right).

Plugging the above into (29), we have, for p4p\geq 4,

𝔼Tk(n)p{8νnΔk2+1+[2pp3(n+1)p3+O(np4)]1p}p.\displaystyle{\mathbb{E}}T_{k}(n)^{p}\leq\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\left[\frac{2p}{p-3}(n+1)^{p-3}+O\left(n^{p-4}\right)\right]^{\frac{1}{p}}\right\}^{p}.

Similarly, for p=3p=3, we have

𝔼(j=j+1nRj)3\displaystyle{\mathbb{E}}\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{3} =6[log(n+12)+log2]+O(1)6log(n+1)+O(1).\displaystyle=6\left[\log\left(n+\frac{1}{2}\right)+\log 2\right]+O(1)\leq 6\log\left(n+1\right)+O(1).

Hence,

𝔼Tk(n)3{8νnΔk2+1+[6log(n+1)+O(1)]13}3.\displaystyle{\mathbb{E}}T_{k}(n)^{3}\leq\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\left[6\log\left(n+1\right)+O(1)\right]^{\frac{1}{3}}\right\}^{3}.

For p=2p=2, it follows from (31) and (33) that

𝔼(j=j+1nRj)2=j=j+1n𝔼Rj2+2𝔼j1>j2Rj1Rj2=j=j+1n𝔼Rj+2𝔼j1>j2Rj1Rj2\displaystyle{\mathbb{E}}\left(\sum_{j=j^{*}+1}^{n}R_{j}\right)^{2}=\sum_{j=j^{*}+1}^{n}{\mathbb{E}}R_{j}^{2}+2{\mathbb{E}}\sum_{j_{1}>j_{2}}R_{j_{1}}R_{j_{2}}=\sum_{j=j^{*}+1}^{n}{\mathbb{E}}R_{j}+2{\mathbb{E}}\sum_{j_{1}>j_{2}}R_{j_{1}}R_{j_{2}}
j1=j+1n2j13+4[21n+12]<4(n+12)2+4[21n+12]<12,\displaystyle\leq\sum_{j_{1}=j^{*}+1}^{n}2j_{1}^{-3}+4\left[2-\frac{1}{n+\frac{1}{2}}\right]<4-\left(n+\frac{1}{2}\right)^{-2}+4\left[2-\frac{1}{n+\frac{1}{2}}\right]<12,

and thus

𝔼Tk(n)2{8νnΔk2+1+12}2<{8νnΔk2+5}2.\displaystyle{\mathbb{E}}T_{k}(n)^{2}\leq\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+1+\sqrt{12}\right\}^{2}<\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right\}^{2}.

The proof is complete.

A.3 Proof of Theorem 1

When kkk\neq k^{*}, by Markov’s inequality and Proposition 1, for any ε>0\varepsilon>0 and p3p\geq 3,

(Tk(n)n>ε)1εp𝔼[Tk(n)n]p\displaystyle{\mathbb{P}}\left(\frac{T_{k}(n)}{n}>\varepsilon\right)\leq\frac{1}{\varepsilon^{p}}{\mathbb{E}}\left[\frac{T_{k}(n)}{n}\right]^{p} \displaystyle\leq 1εp{8Δk2νnn+1n+[2pp31n3+O(n4)]1p}p\displaystyle\frac{1}{\varepsilon^{p}}\left\{\frac{8}{\Delta_{k}^{2}}\frac{\nu_{n}}{n}+\frac{1}{n}+\left[\frac{2p}{p-3}\frac{1}{n^{3}}+O\left(n^{-4}\right)\right]^{\frac{1}{p}}\right\}^{p}
\displaystyle\leq 2p1εp[(8Δk2νnn+1n)p+2pp31n3+O(n4)].\displaystyle\frac{2^{p-1}}{\varepsilon^{p}}\left[\left(\frac{8}{\Delta_{k}^{2}}\frac{\nu_{n}}{n}+\frac{1}{n}\right)^{p}+\frac{2p}{p-3}\frac{1}{n^{3}}+O\left(n^{-4}\right)\right].

Because νnn1δ\nu_{n}\leq n^{1-\delta}, it follows that for p4p\geq 4,

(Tk(n)n>ε)\displaystyle{\mathbb{P}}\left(\frac{T_{k}(n)}{n}>\varepsilon\right) \displaystyle\leq 2p1εp[(8Δk2nδ+1n)p+2pp31n3+O(n4)]\displaystyle\frac{2^{p-1}}{\varepsilon^{p}}\left[\left(\frac{8}{\Delta_{k}^{2}n^{\delta}}+\frac{1}{n}\right)^{p}+\frac{2p}{p-3}\frac{1}{n^{3}}+O\left(n^{-4}\right)\right] (34)
\displaystyle\leq 2p1εp[2p1(8Δk2npδ+1np)+2pp31n3+O(n4)].\displaystyle\frac{2^{p-1}}{\varepsilon^{p}}\left[2^{p-1}\left(\frac{8}{\Delta_{k}^{2}n^{p\delta}}+\frac{1}{n^{p}}\right)+\frac{2p}{p-3}\frac{1}{n^{3}}+O\left(n^{-4}\right)\right].

Let p>max{1/δ,4}p>\max\{1/\delta,4\}. Then, for any ϵ>0\epsilon>0,

n=1(Tk(n)n>ε)<.\displaystyle\sum_{n=1}^{\infty}{\mathbb{P}}\left(\frac{T_{k}(n)}{n}>\varepsilon\right)<\infty.

By Borel-Cantelli lemma, Tk(n)/na.s.0T_{k}(n)/n\overset{a.s.}{\longrightarrow}0 as nn\to\infty, for kkk\neq k^{*}, and thus

Tk(n)n=1kkTk(n)na.s.1.\displaystyle\frac{T_{k^{*}}(n)}{n}=1-\sum_{k\neq k^{*}}\frac{T_{k}(n)}{n}\overset{a.s.}{\longrightarrow}1.

A.4 Proof of Theorem 2

It follows from Theorem 1 that

(ωΩ:limnTk(n)(ω)n=0,kk)=1.\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\lim_{n\to\infty}\frac{T_{k}(n)(\omega)}{n}=0,\ \forall k\neq k^{*}\right)=1.

Because k=1KTk(n)=n\sum_{k=1}^{K}T_{k}(n)=n, the event limnTk(n)(ω)n=0,kk\lim_{n\to\infty}\frac{T_{k}(n)(\omega)}{n}=0,\ \forall k\neq k^{*} implies that limnTk(n)n=1\lim_{n\to\infty}\frac{T_{k^{*}}(n)}{n}=1. Therefore,

(ωΩ:limnTk(n)(ω)n=0,kk,andlimnTk(n)(ω)n=1)=1.\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\lim_{n\to\infty}\frac{T_{k}(n)(\omega)}{n}=0,\ \forall k\neq k^{*},\ {\rm and}\ \lim_{n\to\infty}\frac{T_{k^{*}}(n)(\omega)}{n}=1\right)=1.

Then, (A)=1{\mathbb{P}}(A)=1, where

A={ωΩ:limn[Tk(n)(ω)nTk(n)(ω)n]=1,kk}.\displaystyle A=\left\{\omega\in\Omega:\lim_{n\to\infty}\left[\frac{T_{k^{*}}(n)(\omega)}{n}-\frac{T_{k}(n)(\omega)}{n}\right]=1,\ \forall k\neq k^{*}\right\}.

In other words, for any ωA\omega\in A and 0<ε<10<\varepsilon<1, there exists N=N(ω,ε)N=N(\omega,\varepsilon)\in{\mathbb{N}}, such that for any n>Nn>N,

0<1ε<Tk(n)(ω)nTk(n)(ω)n<1+ε,\displaystyle 0<1-\varepsilon<\frac{T_{k^{*}}(n)(\omega)}{n}-\frac{T_{k}(n)(\omega)}{n}<1+\varepsilon,

which implies

max{T1(n)(ω),,TK(n)(ω)}=Tk(n)(ω).\displaystyle\max\left\{{T_{1}(n)(\omega)},...,{T_{K}(n)(\omega)}\right\}={T_{k^{*}}(n)(\omega)}.

By the uniqueness assumption of kk^{*}, we have

(ωΩ:0<ε<1,N=N(ω,ε),s.t.n>N,In(ω)=k)=1,\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\forall 0<\varepsilon<1,\exists N=N(\omega,\varepsilon)\in{\mathbb{N}},s.t.\ \forall n>N,I^{*}_{n}(\omega)=k^{*}\right)=1, (35)

i.e., Ina.s.kI^{*}_{n}\overset{a.s.}{\longrightarrow}k^{*} as nn\rightarrow\infty.

A.5 Proof of Theorem 4

We analyze the bias and variance of M~n\widetilde{M}_{n} separately.

The Bias. We assert that Tk(n)T_{k}(n) is a stopping time. In fact, given the filtration generated by {Xij,ik,j=1,2,}\{X_{ij},i\neq k,j=1,2,\dots\}, denoted by σ(Xij,ik,j=1,2,)\sigma\left(X_{ij},i\neq k,j=1,2,\dots\right), one has

{Tk(n)m}={Tk(n)m1}cσ(Xk1,,Xk,m1).\{T_{k}(n)\geq m\}=\{T_{k}(n)\leq m-1\}^{c}\in\sigma\left(X_{k1},\dots,X_{k,m-1}\right).

Hence Tk(n)T_{k}(n) is a stopping time with respect to {Xkj,j1}\{X_{kj},j\geq 1\}. Then by Wald’s equation,

𝔼[j=1Tk(n)Xkj]=μk𝔼[Tk(n)],\displaystyle{\mathbb{E}}\left[\sum_{j=1}^{T_{k}(n)}X_{kj}\right]=\mu_{k}{\mathbb{E}}\left[T_{k}(n)\right],

leading to

𝔼M~n=1nk=1Kμk𝔼[Tk(n)].{\mathbb{E}}\widetilde{M}_{n}={1\over n}\sum_{k=1}^{K}\mu_{k}{\mathbb{E}}\left[T_{k}(n)\right].

Thus,

Bias(M~n)\displaystyle{\rm Bias}\left(\widetilde{M}_{n}\right) =\displaystyle= 𝔼M~nμ=(1+𝔼Tk(n)n)μk+kk𝔼Tk(n)nμk\displaystyle{\mathbb{E}}\widetilde{M}_{n}-\mu^{*}=\left(-1+\frac{{\mathbb{E}}T_{k^{*}}(n)}{n}\right)\mu_{k^{*}}+\sum_{k\neq k^{*}}\frac{{\mathbb{E}}T_{k}(n)}{n}\mu_{k}
=\displaystyle= kk𝔼Tk(n)nμk+kk𝔼Tk(n)nμk=kk𝔼Tk(n)n(μkμk).\displaystyle-\sum_{k\neq k^{*}}\frac{{\mathbb{E}}T_{k}(n)}{n}\mu_{k^{*}}+\sum_{k\neq k^{*}}\frac{{\mathbb{E}}T_{k}(n)}{n}\mu_{k}=\sum_{k\neq k^{*}}\frac{{\mathbb{E}}T_{k}(n)}{n}\left(\mu_{k}-\mu_{k^{*}}\right).

Because μk<μk\mu_{k}<\mu_{k^{*}} for kkk\neq k^{*}, we have Bias(M~n)<0{\rm Bias}\left(\widetilde{M}_{n}\right)<0.

By Proposition 1,

|Bias(M~In)|kk1n(8νnΔk2+5)(μkμk)kk1n(8νnΔk2+5Δk)=O(νnn).\displaystyle\left|{\rm Bias}\left(\widetilde{M}_{I_{n}^{*}}\right)\right|\leq\sum_{k\neq k^{*}}\frac{1}{n}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right)\left(\mu_{k^{*}}-\mu_{k}\right)\leq\sum_{k\neq k^{*}}\frac{1}{n}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\Delta_{k}\right)=O\left(\frac{\nu_{n}}{n}\right). (36)

The Variance. Note that

Var[M~n]=𝔼(M~n𝔼[M~n])2\displaystyle{\rm Var}\left[\widetilde{M}_{n}\right]={\mathbb{E}}\left(\widetilde{M}_{n}-{\mathbb{E}}\left[\widetilde{M}_{n}\right]\right)^{2} (37)
=\displaystyle= 𝔼{(M~n1nk=1KμkTk(n))+(1nk=1KμkTk(n)1nk=1Kμk𝔼Tk(n))}2\displaystyle{\mathbb{E}}\left\{\left(\widetilde{M}_{n}-{1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)\right)+\left({1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)-{1\over n}\sum_{k=1}^{K}\mu_{k}{\mathbb{E}}T_{k}(n)\right)\right\}^{2}
\displaystyle\leq 2𝔼(M~n1nk=1KμkTk(n))2+2𝔼(1nk=1KμkTk(n)1nk=1Kμk𝔼Tk(n))2.\displaystyle 2{\mathbb{E}}\left(\widetilde{M}_{n}-{1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)\right)^{2}+2{\mathbb{E}}\left({1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)-{1\over n}\sum_{k=1}^{K}\mu_{k}{\mathbb{E}}T_{k}(n)\right)^{2}.

We analyze the two terms on the RHS of (37) separately. By the Wald’s Lemma (see Theorem 13.2.14 in Athreya and Lahiri 2006),

𝔼(j=1Tk(n)XkjμkTk(n))2=σk2𝔼Tk(n).\displaystyle{\mathbb{E}}\left(\sum_{j=1}^{T_{k}(n)}X_{kj}-\mu_{k}T_{k}(n)\right)^{2}=\sigma_{k}^{2}{\mathbb{E}}T_{k}(n). (38)

Then, by (7) and Cauchy-Schwarz inequality, it can be seen that

𝔼(M~n1nk=1KμkTk(n))2Kn2k=1K𝔼(j=1Tk(n)XkjμkTk(n))2=Kn2k=1Kσk2𝔼Tk(n)\displaystyle{\mathbb{E}}\left(\widetilde{M}_{n}-{1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)\right)^{2}\leq\frac{K}{n^{2}}\sum_{k=1}^{K}{\mathbb{E}}\left(\sum_{j=1}^{T_{k}(n)}X_{kj}-\mu_{k}T_{k}(n)\right)^{2}=\frac{K}{n^{2}}\sum_{k=1}^{K}\sigma_{k}^{2}{\mathbb{E}}T_{k}(n)
=Kn2σk2𝔼Tk(n)+Kn2kk𝔼Tk(n)Knσk2+Kn2kk𝔼Tk(n),\displaystyle=\frac{K}{n^{2}}\sigma_{k^{*}}^{2}{\mathbb{E}}T_{k^{*}}(n)+\frac{K}{n^{2}}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}(n)\leq\frac{K}{n}\sigma_{k^{*}}^{2}+\frac{K}{n^{2}}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}(n), (39)

where the last inequality follows from the fact that Tk(n)nT_{k^{*}}(n)\leq n.

We then analyze the second term on the RHS of (37). Note that, by Cauchy-Schwarz inequality,

𝔼(1nk=1KμkTk(n)1nk=1Kμk𝔼Tk(n))2=1n2𝔼(k=1KμkTk(n)k=1Kμk𝔼Tk(n))2\displaystyle{\mathbb{E}}\left({1\over n}\sum_{k=1}^{K}\mu_{k}T_{k}(n)-{1\over n}\sum_{k=1}^{K}\mu_{k}{\mathbb{E}}T_{k}(n)\right)^{2}={1\over n^{2}}{\mathbb{E}}\left(\sum_{k=1}^{K}\mu_{k}T_{k}(n)-\sum_{k=1}^{K}\mu_{k}{\mathbb{E}}T_{k}(n)\right)^{2} (40)
\displaystyle\leq Kn2k=1Kμk2𝔼[Tk(n)𝔼Tk(n)]2=Kn2k=1Kμk2Var[Tk(n)]\displaystyle{K\over n^{2}}\sum_{k=1}^{K}\mu_{k}^{2}{\mathbb{E}}\left[T_{k}(n)-{\mathbb{E}}T_{k}(n)\right]^{2}={K\over n^{2}}\sum_{k=1}^{K}\mu_{k}^{2}{\rm Var}\left[T_{k}(n)\right]
\displaystyle\leq Kn2kkμk2Var[Tk(n)]+Kn2μk2Var[Tk(n)].\displaystyle{K\over n^{2}}\sum_{k\neq k^{*}}\mu_{k}^{2}{\rm Var}\left[T_{k}(n)\right]+{K\over n^{2}}\mu_{k^{*}}^{2}{\rm Var}\left[T_{k^{*}}(n)\right].

Furthermore,

Var[Tk(n)]=Var[nkkTk(n)]=Var[kkTk(n)]𝔼[kkTk(n)]2(K1)kk𝔼Tk2(n).\displaystyle{\rm Var}\left[T_{k^{*}}(n)\right]={\rm Var}\left[n-\sum_{k\neq k^{*}}T_{k}(n)\right]={\rm Var}\left[\sum_{k\neq k^{*}}T_{k}(n)\right]\leq{\mathbb{E}}\left[\sum_{k\neq k^{*}}T_{k}(n)\right]^{2}\leq(K-1)\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}^{2}(n).

Therefore,

RHSof(40)Kn2kkμk2𝔼Tk2(n)+K(K1)n2μk2kk𝔼Tk2(n)K2n2μ¯2kk𝔼Tk2(n),\displaystyle{\rm RHS\ of\ }(\ref{eqn:Temp12})\leq{K\over n^{2}}\sum_{k\neq k^{*}}\mu_{k}^{2}{\mathbb{E}}T_{k}^{2}(n)+{K(K-1)\over n^{2}}\mu_{k^{*}}^{2}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}^{2}(n)\leq\frac{K^{2}}{n^{2}}\bar{\mu}^{2}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}^{2}(n), (41)

where μ¯2=max{μ12,,μK2}\bar{\mu}^{2}=\max\{\mu_{1}^{2},...,\mu_{K}^{2}\}.

Combining (37), (A.5)-(41) and Proposition 1 yields

Var[M~n]\displaystyle{\rm Var}\left[\widetilde{M}_{n}\right]\leq 2Knσk2+2Kn2kk𝔼Tk(n)+2K2μ¯2n2kk𝔼Tk2(n)\displaystyle\frac{2K}{n}\sigma_{k^{*}}^{2}+\frac{2K}{n^{2}}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}(n)+\frac{2K^{2}\bar{\mu}^{2}}{n^{2}}\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}^{2}(n)
\displaystyle\leq 2Knσk2+2Kn2kk(8νnΔk2+5)+2K2μ¯2n2kk{8νnΔk2+5}2\displaystyle\frac{2K}{n}\sigma_{k^{*}}^{2}+\frac{2K}{n^{2}}\sum_{k\neq k^{*}}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right)+\frac{2K^{2}\bar{\mu}^{2}}{n^{2}}\sum_{k\neq k^{*}}\left\{\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right\}^{2}
\displaystyle\leq 2Knσk2+2K2cνnn2+2K3μ¯2n2c2νn22Knσk2+2K3(μ¯2+1)n2c2νn2,\displaystyle\frac{2K}{n}\sigma_{k^{*}}^{2}+\frac{2K^{2}c\nu_{n}}{n^{2}}+\frac{2K^{3}\bar{\mu}^{2}}{n^{2}}c^{2}\nu_{n}^{2}\leq\frac{2K}{n}\sigma_{k^{*}}^{2}+\frac{2K^{3}(\bar{\mu}^{2}+1)}{n^{2}}c^{2}\nu_{n}^{2}, (42)

where the third inequality is because of the condition 8νnΔk2+5cνn\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\leq c\nu_{n} for kkk\neq k^{*}.

Incorporating the bias in (36), the variance in (A.5), and the condition νnn1/2δ\nu_{n}\leq n^{1/2-\delta} with 0<δ<1/20<\delta<1/2, we have

MSE(M~n)=Var[M~n]+Bias(M~In)22Knσk2+o(1n).\displaystyle{\rm MSE}\left(\widetilde{M}_{n}\right)={\rm Var}\left[\widetilde{M}_{n}\right]+{\rm Bias}\left(\widetilde{M}_{I_{n}^{*}}\right)^{2}\leq\frac{2K}{n}\sigma_{k^{*}}^{2}+o\left(\frac{1}{n}\right).

A.6 Proof of Theorem 5

To prove the result, we first define a martingale difference array and introduce a lemma on asymptotic normality for martingale different arrays.

Definition A1.

(Definition 16.1.1 in Athreya and Lahiri (2006)) Let {Yn}n1\left\{Y_{n}\right\}_{n\geq 1} be a collection of random variables on a probability space (Ω,,)(\Omega,{\cal F},{\mathbb{P}}) and let {n}n1\left\{{\cal F}_{n}\right\}_{n\geq 1} be a filtration. Then, {Yn,n}n1\left\{Y_{n},{\cal F}_{n}\right\}_{n\geq 1} is called a martingale difference array if YnY_{n} is n{\cal F}_{n}-measurable and 𝔼[Yn|n1]=0{\mathbb{E}}[Y_{n}|{\cal F}_{n-1}]=0 for each n1n\geq 1.

Lemma A4 (Theorem 16.1.1 of Athreya and Lahiri (2006)).

For each n1n\geq 1, let {Yni,ni}i1\left\{Y_{ni},{\cal F}_{ni}\right\}_{i\geq 1} be a martingale difference array on (Ω,,)(\Omega,{\cal F},{\mathbb{P}}), with 𝔼|Yni|2<{\mathbb{E}}|Y_{ni}|^{2}<\infty for all n1n\geq 1 and let τn\tau_{n} be a finite stopping time w.r.t. ni{\cal F}_{ni}. Suppose that for some constant σ2>0\sigma^{2}>0, as nn\rightarrow\infty,

i=1τn𝔼[|Yni|2|ni]σ2,inprobability,\displaystyle\sum_{i=1}^{\tau_{n}}{\mathbb{E}}\left[\left.|Y_{ni}|^{2}\right|{\cal F}_{ni}\right]\to\sigma^{2},\ in\ probability,

and that for each ε>0\varepsilon>0,

i=1τn𝔼[|Yni|2𝟙{|Yni|>ε}|ni]0,inprobability,\displaystyle\sum_{i=1}^{\tau_{n}}{\mathbb{E}}\left[\left.|Y_{ni}|^{2}\mathds{1}_{\left\{\left|Y_{ni}\right|>\varepsilon\right\}}\right|{\cal F}_{ni}\right]\to 0,\ in\ probability,

Then,

i=1τnYniN(0,σ2),asn.\displaystyle\sum_{i=1}^{\tau_{n}}Y_{ni}\Rightarrow N(0,\sigma^{2}),\ as\ n\rightarrow\infty.

Proof of Theorem 5. By (7),

n(M~nμ)=Znn+n(1nj=1nμIjμ).\displaystyle\sqrt{n}\left(\widetilde{M}_{n}-\mu^{*}\right)=\frac{Z_{n}}{\sqrt{n}}+\sqrt{n}\left(\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}-\mu^{*}\right). (43)

We analyze the two terms on the RHS of (43) separately. We have shown that ZnZ_{n} is a martingale. So ZnZn1=XIn,nμInZ_{n}-Z_{n-1}=X_{I_{n},n}-\mu_{I_{n}} is a martingale difference array (see Definition A1). To show the asymptotic normality of Zn/nZ_{n}/\sqrt{n}, it suffices to prove that the two conditions of Lemma A4 are satisfied for Ynj=(XIj,jμIj)/nY_{nj}=\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}, τn=n\tau_{n}=n and nj=j{\cal F}_{nj}={\cal F}_{j}.

First, note that

j=1n𝔼[|(XIj,jμIj)/n|2|j1]=1nj=1n𝔼[XIj,j22μIjXIj,j+μIj2|j1]\displaystyle\sum_{j=1}^{n}{\mathbb{E}}\left[\left.\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}\right|^{2}\right|{\cal F}_{j-1}\right]=\frac{1}{n}\sum_{j=1}^{n}{\mathbb{E}}\left[\left.X_{I_{j},j}^{2}-2\mu_{I_{j}}X_{I_{j},j}+\mu_{I_{j}}^{2}\right|{\cal F}_{j-1}\right]
=\displaystyle= 1nj=1n{𝔼[XIj,j2|j1]2μIj𝔼[XIj,j|j1]+μIj2}=1nj=1n{𝔼[XIj,j2|j1]μIj2}\displaystyle\frac{1}{n}\sum_{j=1}^{n}\left\{{\mathbb{E}}\left[\left.X_{I_{j},j}^{2}\right|{\cal F}_{j-1}\right]-2\mu_{I_{j}}{\mathbb{E}}\left[\left.X_{I_{j},j}\right|{\cal F}_{j-1}\right]+\mu_{I_{j}}^{2}\right\}=\frac{1}{n}\sum_{j=1}^{n}\left\{{\mathbb{E}}\left[\left.X_{I_{j},j}^{2}\right|{\cal F}_{j-1}\right]-\mu_{I_{j}}^{2}\right\}
=\displaystyle= 1nj=1nσIj2=1nk=1KTk(n)σk2a.s.σk2,asn,\displaystyle\frac{1}{n}\sum_{j=1}^{n}\sigma_{I_{j}}^{2}=\frac{1}{n}\sum_{k=1}^{K}T_{k}(n)\sigma_{k}^{2}\overset{a.s.}{\longrightarrow}\sigma_{k^{*}}^{2},\ {\rm as}\ n\to\infty,

where the convergence follows from Theorem 1.

Second, by Cauchy-Schwarz inequality,

j=1n𝔼[|(XIj,jμIj)/n|2𝟙{|(XIj,jμIj)/n|>ε}|j1]\displaystyle\sum_{j=1}^{n}{\mathbb{E}}\left[\left.\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}\right|^{2}\mathbbm{1}\left\{\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}\right|>\varepsilon\right\}\right|{\cal F}_{j-1}\right]
\displaystyle\leq j=1n(𝔼[|(XIj,jμIj)/n|4|j1])12(𝔼[𝟙{|(XIj,jμIj)/n|>ε}|j1])12\displaystyle\sum_{j=1}^{n}\left({\mathbb{E}}\left[\left.\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}\right|^{4}\right|{\cal F}_{j-1}\right]\right)^{\frac{1}{2}}\left({\mathbb{E}}\left[\left.\mathbbm{1}\left\{\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)/\sqrt{n}\right|>\varepsilon\right\}\right|{\cal F}_{j-1}\right]\right)^{\frac{1}{2}}
=\displaystyle= 1nj=1n(𝔼[|XIj,jμIj|4|j1])12(|(XIj,jμIj)|>εn|j1)12\displaystyle\frac{1}{n}\sum_{j=1}^{n}\left({\mathbb{E}}\left[\left.\left|X_{I_{j},j}-\mu_{I_{j}}\right|^{4}\right|{\cal F}_{j-1}\right]\right)^{\frac{1}{2}}{\mathbb{P}}\left(\left.\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)\right|>\varepsilon\sqrt{n}\right|{\cal F}_{j-1}\right)^{\frac{1}{2}}
=\displaystyle= 1nj=1nϑIj2(|(XIj,jμIj)|>εn|j1)121nj=1nϑIj2((ε2n)1𝔼[|XIj,jμIj|2|j1])12\displaystyle\frac{1}{n}\sum_{j=1}^{n}\vartheta_{I_{j}}^{2}{\mathbb{P}}\left(\left.\left|\left(X_{I_{j},j}-\mu_{I_{j}}\right)\right|>\varepsilon\sqrt{n}\right|{\cal F}_{j-1}\right)^{\frac{1}{2}}\leq\frac{1}{n}\sum_{j=1}^{n}\vartheta_{I_{j}}^{2}\left((\varepsilon^{2}n)^{-1}{\mathbb{E}}\left[\left.\left|X_{I_{j},j}-\mu_{I_{j}}\right|^{2}\right|{\cal F}_{j-1}\right]\right)^{\frac{1}{2}}
=\displaystyle= 1nj=1nϑIj2((ε2n)1σIj2)12ϑ¯2σ¯εn0, as n,\displaystyle\frac{1}{n}\sum_{j=1}^{n}\vartheta_{I_{j}}^{2}\left((\varepsilon^{2}n)^{-1}\sigma_{I_{j}}^{2}\right)^{\frac{1}{2}}\leq\frac{\bar{\vartheta}^{2}\bar{\sigma}}{\varepsilon\sqrt{n}}\to 0,\mbox{ as }n\to\infty,

where ϑk4=𝔼[|Xkμk|4]\vartheta_{k}^{4}={\mathbb{E}}[|X_{k}-\mu_{k}|^{4}], ϑ¯4=max{ϑ14,,ϑK4}\bar{\vartheta}^{4}=\max\{\vartheta_{1}^{4},...,\vartheta_{K}^{4}\} and σ¯2=max{σ12,,σK2}\bar{\sigma}^{2}=\max\{\sigma_{1}^{2},...,\sigma_{K}^{2}\}.

Therefore, by Lemma A4, we have

Zn/nN(0,σk2),asn.\displaystyle Z_{n}/\sqrt{n}\Rightarrow N(0,\sigma_{k^{*}}^{2}),\ \ as\ \ n\rightarrow\infty.

It remains to prove that the second term on the RHS of (43) converges to 0 in probability. Note that

1nj=1nμIjμk\displaystyle\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}-\mu_{k^{*}} =\displaystyle= 1nk=1KTk(n)μkμk=kkTk(n)nμk+(Tk(n)n1)μk\displaystyle\frac{1}{n}\sum_{k=1}^{K}T_{k}(n)\mu_{k}-\mu_{k^{*}}=\sum_{k\neq k^{*}}\frac{T_{k}(n)}{n}\mu_{k}+\left(\frac{T_{k^{*}}(n)}{n}-1\right)\mu_{k^{*}}
=\displaystyle= kkTk(n)nμkkkTk(n)nμk=kkTk(n)n(μkμk).\displaystyle\sum_{k\neq k^{*}}\frac{T_{k}(n)}{n}\mu_{k}-\sum_{k\neq k^{*}}\frac{T_{k}(n)}{n}\mu_{k^{*}}=\sum_{k\neq k^{*}}\frac{T_{k}(n)}{n}\left(\mu_{k}-\mu_{k^{*}}\right).

Therefore, the first moment of the second term on the RHS of (43) satisfies

𝔼|n(1nj=1nμIjμk)|=𝔼|kkTk(n)n(μkμk)|kk𝔼Tk(n)nΔk0.\displaystyle{\mathbb{E}}\left|\sqrt{n}\left(\frac{1}{n}\sum_{j=1}^{n}\mu_{I_{j}}-\mu_{k^{*}}\right)\right|={\mathbb{E}}\left|\sum_{k\neq k^{*}}\frac{T_{k}(n)}{\sqrt{n}}\left(\mu_{k}-\mu_{k^{*}}\right)\right|\leq\sum_{k\neq k^{*}}\frac{{\mathbb{E}}T_{k}(n)}{\sqrt{n}}\Delta_{k}\to 0.

where the convergence follows from Proposition 1 with p=1p=1 and the condition νn[αlogn,n12δ]\nu_{n}\in\left[\alpha\log n,n^{\frac{1}{2}-\delta}\right] (0<δ<1,α4γ¯2)(0<\delta<1,\alpha\geq 4\bar{\gamma}^{2}).

Applying Slutsky’s Theorem to the RHS of (43) leads to the conclusion.

A.7 Proof of Proposition 2

Note that

σk2=𝔼Xk2(𝔼Xk)2.\displaystyle\sigma_{k}^{2}={\mathbb{E}}X_{k}^{2}-\left({\mathbb{E}}X_{k}\right)^{2}. (44)

We first prove that σ^k2\hat{\sigma}_{k}^{2} converges to σk2\sigma_{k}^{2} in probability. From Theorem 2 of Lai and Robbins (1985), it follows that for kkk\neq k^{*}, Tk(n)T_{k}(n)\to\infty in probability; and from Theorem 1, it follows that Tk(n)T_{k^{*}}(n)\to\infty a.s. By Theorem 2 in Richter (1965), we have that for k=1,,Kk=1,...,K,

1Tk(n)j=1Tk(n)Xk,j2𝔼Xk2,and1Tk(n)j=1Tk(n)Xk,j𝔼Xk.\displaystyle\frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{k,j}^{2}\overset{{\mathbb{P}}}{\longrightarrow}{\mathbb{E}}X_{k}^{2},\ \ {\rm and}\ \ \frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{k,j}\overset{{\mathbb{P}}}{\longrightarrow}{\mathbb{E}}X_{k}.

By the continuous mapping theorem, it follows that σ^k2σk2\hat{\sigma}_{k}^{2}\to\sigma_{k}^{2} in probability.

Then by Theorem 1 and the continuous mapping theorem, σ~n2\widetilde{\sigma}_{n}^{2} converges to σk2\sigma_{k^{*}}^{2} in probability.

A.8 Proof of Theorem 6

By Theorem 1, we have

Tk(n)a.s..\displaystyle T_{k^{*}}(n)\overset{a.s.}{\longrightarrow}\infty. (45)

By the strong law of large numbers (SLLN) and Theorem 1 in Richter (1965) or Theorem 8.2 in Chapter 6 of Gut (2013), we have

1Tk(n)j=1Tk(n)Xk,ja.s.μ,asn.\displaystyle\frac{1}{T_{k^{*}}(n)}\sum_{j=1}^{T_{k^{*}}(n)}X_{k^{*},j}\overset{a.s.}{\longrightarrow}\mu^{*},\ \ {\rm as}\ \ n\to\infty.

Denote Mk=1Tk(n)j=1Tk(n)Xk,jM_{k}=\frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{k,j}. The definition of almost sure convergence says that

(ωΩ:0<ε<1,N=N(ω,ε),s.t.n>N,|Mkμ|<ε)=1.\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\forall 0<\varepsilon<1,\exists N^{\prime}=N^{\prime}(\omega,\varepsilon)\in{\mathbb{N}},s.t.\ \forall n>N^{\prime},\left|M_{k^{*}}-\mu^{*}\right|<\varepsilon\right)=1. (46)

From (35), it follows that

(ωΩ:0<ε<1,N=N(ω,ε),s.t.n>N,MIn=Mk)=1.\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\forall 0<\varepsilon<1,\exists N=N(\omega,\varepsilon)\in{\mathbb{N}},s.t.\ \forall n>N,M_{I^{*}_{n}}=M_{k^{*}}\right)=1. (47)

Combining (46) and (47) leads to

(ωΩ:0<ε<1,N¯=max{N,N},s.t.n>N¯,|MInμ|<ε)=1,\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\forall 0<\varepsilon<1,\exists\bar{N}=\max\{N,N^{\prime}\}\in{\mathbb{N}},s.t.\ \forall n>\bar{N},\left|M_{I^{*}_{n}}-\mu^{*}\right|<\varepsilon\right)=1,

implying that

MIna.s.μ,asn.\displaystyle M_{I^{*}_{n}}\overset{a.s.}{\rightarrow}\mu^{*},\ \ {\rm as}\ \ n\to\infty.

A.9 Proof of Theorem 7

For notational ease, denote

Zn,k=j=1Tk(n)(Xk,jμk),k=1,,K.Z_{n,k}=\sum_{j=1}^{T_{k}(n)}\left(X_{k,j}-\mu_{k}\right),\quad k=1,\ldots,K.

Recall that sub-Gaussian distributions have finite moments up to any order. By Assumption 1 and the fact that Tk(n)nT_{k}(n)\leq n, it can be seen that {Xk,j,j1}\{X_{k,j},j\geq 1\} satisfies the conditions in Lemma A2 for any k=1,,Kk=1,\ldots,K. We first note the following two facts that are useful in the proof. For any positive integer pp, by Proposition 1,

Pr(Ink)kkPr(In=k)kkPr(Tk(n)n/K)Kp+1𝔼[Tk(n)p]/np=O(νnpnp+1n3),\displaystyle\Pr(I_{n}^{*}\neq k^{*})\leq\sum_{k\neq k^{*}}\Pr(I_{n}^{*}=k)\leq\sum_{k\neq k^{*}}\Pr(T_{k}(n)\geq n/K)\leq K^{p+1}{\mathbb{E}}\left[T_{k}(n)^{p}\right]/n^{p}=O\left({\nu_{n}^{p}\over n^{p}}+{1\over n^{3}}\right), (48)

and by Lemma A2,

𝔼[|Zn,k|p]=O(np/2),k=1,,K.\displaystyle{\mathbb{E}}\left[\left|Z_{n,k}\right|^{p}\right]=O(n^{p/2}),\quad k=1,\ldots,K. (49)

It suffices to analyze the bias and variance of MInM_{I_{n}^{*}}, and then the MSE follows straightforwardly. Consider the bias first. Note that

Bias(MIn)=𝔼[MInμk]=𝔼[1TIn(n)j=1TIn(n)(XIn,jμk)]\displaystyle{\rm Bias}(M_{I_{n}^{*}})={\mathbb{E}}\left[M_{I_{n}^{*}}-\mu_{k^{*}}\right]={\mathbb{E}}\left[\frac{1}{T_{I_{n}^{*}}(n)}\sum_{j=1}^{T_{I_{n}^{*}}(n)}(X_{I_{n}^{*},j}-\mu_{k^{*}})\right]
=\displaystyle= 𝔼[Zn,kTk(n)𝟙{In=k}]+kk𝔼[Zn,kTk(n)𝟙{In=k}]\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]+\sum_{k\neq k^{*}}{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]
=\displaystyle= 𝔼[Zn,kTk(n)]kk𝔼[Zn,kTk(n)𝟙{In=k}]+kk𝔼[Zn,kTk(n)𝟙{In=k}]\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}\right]-\sum_{k\neq k^{*}}{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]+\sum_{k\neq k^{*}}{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]
=\displaystyle= Q1Q2+Q3,\displaystyle Q_{1}-Q_{2}+Q_{3},

where Q1,Q2Q_{1},Q_{2} and Q3Q_{3} denote the three terms on the RHS of (A.9), respectively.

Recall that 𝔼[Zn,k]=0{\mathbb{E}}[Z_{n,k^{*}}]=0 due to Wald’s equation. It follows that

Q1=\displaystyle Q_{1}= 𝔼[Zn,kTk(n)Zn,k𝔼Tk(n)]=𝔼[Zn,kTk(n)(1Tk(n)𝔼Tk(n))]=Cov[Zn,kTk(n),1Tk(n)𝔼Tk(n)]\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}-\frac{Z_{n,k^{*}}}{{\mathbb{E}}T_{k^{*}}(n)}\right]={\mathbb{E}}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}\left(1-\frac{T_{k^{*}}(n)}{{\mathbb{E}}T_{k^{*}}(n)}\right)\right]={\rm Cov}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)},1-\frac{T_{k^{*}}(n)}{{\mathbb{E}}T_{k^{*}}(n)}\right]
=\displaystyle= Cov[Zn,kTk(n),Tk(n)𝔼Tk(n)]=ρVar[Zn,k/Tk(n)]Var[Tk(n)]𝔼Tk(n),\displaystyle-{\rm Cov}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)},\frac{T_{k^{*}}(n)}{{\mathbb{E}}T_{k^{*}}(n)}\right]=-\rho\cdot\frac{\sqrt{{\rm Var}[Z_{n,k^{*}}/T_{k^{*}}(n)]{\rm Var}[T_{k^{*}}(n)]}}{{\mathbb{E}}T_{k^{*}}(n)}, (51)

where ρ\rho denote the correlation coefficient between Zn,k/Tk(n)Z_{n,k^{*}}/T_{k^{*}}(n) and Tk(n)T_{k^{*}}(n), i.e.

ρ=Cov[Zn,k/Tk(n),Tk(n)]Var[Zn,k/Tk(n)]Var[Tk(n)].\displaystyle\rho=\frac{{\rm Cov}\left[Z_{n,k^{*}}/T_{k^{*}}(n),T_{k^{*}}(n)\right]}{\sqrt{{\rm Var}[Z_{n,k^{*}}/T_{k^{*}}(n)]{\rm Var}[T_{k^{*}}(n)]}}.

It should also be noted that ρ>0\rho>0 (and thus Q1<0Q_{1}<0), because Zn,k/Tk(n)Z_{n,k^{*}}/T_{k^{*}}(n) and Tk(n)T_{k^{*}}(n) are positive correlated. To see this, note that

Cov[Zn,kTk(n),Tk(n)]=Cov[X¯k[Tk(n)]Tk(n)Tk(n)μk,Tk(n)]=Cov[X¯k[Tk(n)],Tk(n)]>0,\displaystyle{\rm Cov}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)},T_{k^{*}}(n)\right]={\rm Cov}\left[\frac{\bar{X}_{k^{*}}[T_{k^{*}}(n)]T_{k^{*}}(n)}{T_{k^{*}}(n)}-\mu_{k^{*}},T_{k^{*}}(n)\right]={\rm Cov}\left[\bar{X}_{k^{*}}[T_{k^{*}}(n)],T_{k^{*}}(n)\right]>0,

where the inequality follows naturally from the way that GUCB is defined, i.e., larger sample mean leads to higher chance of sampling from arm kk^{*}.

By (51), to understand the order Q1Q_{1}, we only need to analyze 𝔼Tk(n){\mathbb{E}}T_{k^{*}}(n), Var[Tk(n)]{\rm Var}[T_{k^{*}}(n)] and Var[Zn,k/Tk(n)]{\rm Var}[Z_{n,k^{*}}/T_{k^{*}}(n)]. Note that by Proposition 1,

n>𝔼Tk(n)=nkk𝔼Tk(n)nkk(8νnΔk2+5)nKc1νn,\displaystyle n>{\mathbb{E}}T_{k^{*}}(n)=n-\sum_{k\neq k^{*}}{\mathbb{E}}T_{k}(n)\geq n-\sum_{k\neq k^{*}}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right)\geq n-Kc_{1}\nu_{n}, (52)

and

Var[Tk(n)]=\displaystyle{\rm Var}[T_{k^{*}}(n)]= Var[nkkTk(n)]=Var[kkTk(n)]𝔼[(kkTk(n))2]\displaystyle{\rm Var}\left[n-\sum_{k\neq k^{*}}T_{k}(n)\right]={\rm Var}\left[\sum_{k\neq k^{*}}T_{k}(n)\right]\leq{\mathbb{E}}\left[\left(\sum_{k\neq k^{*}}T_{k}(n)\right)^{2}\right]
\displaystyle\leq (K1)𝔼[kkTk(n)2]Kkk(8νnΔk2+5)2K2c22νn2,\displaystyle(K-1){\mathbb{E}}\left[\sum_{k\neq k^{*}}T_{k}(n)^{2}\right]\leq K\sum_{k\neq k^{*}}\left(\frac{8\nu_{n}}{\Delta_{k}^{2}}+5\right)^{2}\leq K^{2}c_{2}^{2}\nu_{n}^{2}, (53)

where the third inequality follows from Proposition 1 with p=2p=2.

Moreover,

Var[Zn,kTk(n)]𝔼[Zn,k2Tk2(n)]=𝔼[Zn,k2TIn2(n)𝟙{In=k}]+𝔼[Zn,k2Tk(n)2𝟙{Ink}],\displaystyle{\rm Var}\left[\frac{Z_{n,k^{*}}}{T_{k^{*}}(n)}\right]\leq{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}^{2}(n)}\right]={\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{I_{n}^{*}}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]+{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}(n)^{2}}\mathbbm{1}\{I_{n}^{*}\neq k^{*}\}\right], (54)

We analyze the RHS of (54). Because TInn/KT_{I_{n}^{*}}\geq n/K,

𝔼[Zn,k2TIn2(n)𝟙{In=k}]K2n2𝔼[Zn,k2]=σk2K2n2𝔼[Tk(n)]σk2K2n,\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{I_{n}^{*}}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]\leq{K^{2}\over n^{2}}{\mathbb{E}}\left[Z_{n,k^{*}}^{2}\right]={\sigma_{k^{*}}^{2}K^{2}\over n^{2}}{\mathbb{E}}\left[T_{k^{*}}(n)\right]\leq{\sigma_{k^{*}}^{2}K^{2}\over n}, (55)

where the first equality follows from Wald’s lemma (Theorem 13.2.14 in Athreya and Lahiri 2006).

We also note that

𝔼[Zn,k2Tk(n)2𝟙{Ink}]𝔼[Zn,k2𝟙{Ink}]\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}(n)^{2}}\mathbbm{1}\{I_{n}^{*}\neq k^{*}\}\right]\leq{\mathbb{E}}\left[Z_{n,k^{*}}^{2}\mathbbm{1}\{I_{n}^{*}\neq k^{*}\}\right] (56)
\displaystyle\leq 𝔼p11p1(𝟙{Ink})𝔼1p1(Zn,k2p1)\displaystyle{\mathbb{E}}^{p_{1}-1\over p_{1}}\left(\mathbbm{1}\{I_{n}^{*}\neq k^{*}\}\right){\mathbb{E}}^{1\over p_{1}}\left(Z_{n,k^{*}}^{2p_{1}}\right)
=\displaystyle= O((νnp2np2+1n3)p11p1n2p121p1)=O((n(1δ)p2(p11)/p1np2(p11)/p1+1n3(p11)/p1)n)=o(1n),\displaystyle O\left(\left({\nu_{n}^{p_{2}}\over n^{p_{2}}}+{1\over n^{3}}\right)^{p_{1}-1\over p_{1}}\cdot n^{{2p_{1}\over 2}\cdot{1\over p_{1}}}\right)=O\left(\left({n^{(1-\delta)p_{2}(p_{1}-1)/p_{1}}\over n^{p_{2}(p_{1}-1)/p_{1}}}+{1\over n^{3(p_{1}-1)/p_{1}}}\right)\cdot n\right)=o\left({1\over n}\right),

where the second inequality follows from Hölder’s inequality, and the third-to-last equality is due to the facts in (48) and (49) with p1p_{1} and p2p_{2} satisfying p1=3/(1ϵ)p_{1}=3/(1-\epsilon) and p2>3/δp_{2}>3/\delta for some ϵ(0,1)\epsilon\in(0,1).

Combining Equations (51)-(56) yields

|Q1|K2σk2c2νn+o(1)n(nKc1νn)=O(νnn3/2).\displaystyle|Q_{1}|\leq{K^{2}\sigma_{k^{*}}^{2}c_{2}\nu_{n}+o(1)\over\sqrt{n}\left(n-Kc_{1}\nu_{n}\right)}=O\left({\nu_{n}\over n^{3/2}}\right).

Using the same argument in proving (56), we have

|Q2|kk𝔼[|Zn,k|𝟙{In=k}]\displaystyle|Q_{2}|\leq\sum_{k\neq k^{*}}{\mathbb{E}}\left[\left|Z_{n,k^{*}}\right|\mathbbm{1}\{I_{n}^{*}=k\}\right]
\displaystyle\leq (K1)𝔼p11p1(𝟙{Ink})𝔼1p1|Zn,kp1|\displaystyle(K-1){\mathbb{E}}^{p_{1}-1\over p_{1}}\left(\mathbbm{1}\{I_{n}^{*}\neq k^{*}\}\right){\mathbb{E}}^{1\over p_{1}}\left|Z_{n,k^{*}}^{p_{1}}\right|
=\displaystyle= O((n(1δ)p2(p11)/p1np2(p11)/p1+1n3(p11)/p1)n12)=o(νnn3/2)\displaystyle O\left(\left({n^{(1-\delta)p_{2}(p_{1}-1)/p_{1}}\over n^{p_{2}(p_{1}-1)/p_{1}}}+{1\over n^{3(p_{1}-1)/p_{1}}}\right)\cdot n^{1\over 2}\right)=o\left({\nu_{n}\over n^{3/2}}\right)

with p1p_{1} and p2p_{2} satisfying p1=3/(1ϵ)p_{1}=3/(1-\epsilon) and p2>3/(2δ)p_{2}>3/(2\delta) for some ϵ(0,1)\epsilon\in(0,1).

In a similar manner, it can be shown that |Q3|=o(νnn3/2)|Q_{3}|=o\left({\nu_{n}\over n^{3/2}}\right), and the details are omitted. Therefore, by (A.9),

|Bias(MIn)|K2σk2c2νn+c3n(nKc1νn)+o(νnn3/2)=O(νnn3/2).\left|{\rm Bias}\left(M_{I_{n}^{*}}\right)\right|\leq{K^{2}\sigma_{k^{*}}^{2}c_{2}\nu_{n}+c_{3}\over\sqrt{n}\left(n-Kc_{1}\nu_{n}\right)}+o\left({\nu_{n}\over n^{3/2}}\right)=O\left({\nu_{n}\over n^{3/2}}\right).

Note further that the order of |Q1||Q_{1}| dominates those of |Q2||Q_{2}| and |Q3||Q_{3}|. Because Q1<0Q_{1}<0, we know that Bias(MIn)<0{\rm Bias}\left(M_{I_{n}^{*}}\right)<0 for sufficiently large nn.

We now consider the variance. Note that

Var[MIn]\displaystyle{\rm Var}\left[M_{I_{n}^{*}}\right]\leq 𝔼[|MInμk|2]\displaystyle{\mathbb{E}}\left[\left|M_{I_{n}^{*}}-\mu_{k^{*}}\right|^{2}\right]
=\displaystyle= 𝔼[|MInμk|2𝟙{In=k}]+kk𝔼[|MInμk|2𝟙{In=k}]\displaystyle{\mathbb{E}}\left[\left|M_{I_{n}^{*}}-\mu_{k^{*}}\right|^{2}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]+\sum_{k\neq k^{*}}{\mathbb{E}}\left[\left|M_{I_{n}^{*}}-\mu_{k^{*}}\right|^{2}\mathbbm{1}\{I_{n}^{*}=k\}\right]
=\displaystyle= 𝔼[Zn,k2Tk2(n)𝟙{In=k}]+kk𝔼[(Zn,kΔk)2Tk2(n)𝟙{In=k}]\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]+\sum_{k\neq k^{*}}{\mathbb{E}}\left[{(Z_{n,k}-\Delta_{k})^{2}\over T_{k}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]
\displaystyle\leq 𝔼[Zn,k2Tk2(n)𝟙{In=k}]+2kk𝔼[Zn,k2Tk2(n)𝟙{In=k}]+2kkΔk2𝔼[1Tk2(n)𝟙{In=k}].\displaystyle{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]+2\sum_{k\neq k^{*}}{\mathbb{E}}\left[{Z_{n,k}^{2}\over T_{k}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]+2\sum_{k\neq k^{*}}\Delta_{k}^{2}{\mathbb{E}}\left[{1\over T_{k}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]. (57)

We note that by (55),

𝔼[Zn,k2Tk2(n)𝟙{In=k}]σk2K2n.{\mathbb{E}}\left[\frac{Z_{n,k^{*}}^{2}}{T_{k^{*}}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k^{*}\}\right]\leq{\sigma_{k^{*}}^{2}K^{2}\over n}.

Similar to the proof of (56), we have

2kk𝔼[Zn,k2Tk2(n)𝟙{In=k}]=o(1n).2\sum_{k\neq k^{*}}{\mathbb{E}}\left[{Z_{n,k}^{2}\over T_{k}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]=o\left({1\over n}\right).

Moreover, using the fact in (48),

2kkΔk2𝔼[1Tk2(n)𝟙{In=k}]2kkΔk2𝔼[𝟙{In=k}]=O(νnpnp+1n3)=O(n(1δ)pnp+1n3)=o(1n),2\sum_{k\neq k^{*}}\Delta_{k}^{2}{\mathbb{E}}\left[{1\over T_{k}^{2}(n)}\mathbbm{1}\{I_{n}^{*}=k\}\right]\leq 2\sum_{k\neq k^{*}}\Delta_{k}^{2}{\mathbb{E}}\left[\mathbbm{1}\{I_{n}^{*}=k\}\right]=O\left({\nu_{n}^{p}\over n^{p}}+{1\over n^{3}}\right)=O\left({n^{(1-\delta)p}\over n^{p}}+{1\over n^{3}}\right)=o\left({1\over n}\right),

with p>1/δp>1/\delta.

Therefore,

Var[MIn]σk2K2n+o(1n),{\rm Var}\left[M_{I_{n}^{*}}\right]\leq{\sigma_{k^{*}}^{2}K^{2}\over n}+o\left({1\over n}\right),

which completes the proof.

A.10 Proof of Theorem 8

Let Mk=1Tk(n)j=1Tk(n)XkjM_{k}=\frac{1}{T_{k}(n)}\sum_{j=1}^{T_{k}(n)}X_{kj}. By (47),

(ωΩ:0<ε<1,N=N(ω,ε),s.t.n>N,nMIn=nMk)=1,\displaystyle{\mathbb{P}}\left(\omega\in\Omega:\forall 0<\varepsilon<1,\exists N=N(\omega,\varepsilon)\in{\mathbb{N}},s.t.\ \forall n>N,\sqrt{n}M_{I^{*}_{n}}=\sqrt{n}M_{k^{*}}\right)=1,

i.e.,

nMInnMka.s.0,asn.\displaystyle\sqrt{n}M_{I^{*}_{n}}-\sqrt{n}M_{k^{*}}\overset{a.s.}{\longrightarrow}0,\ \ {\rm as}\ \ n\to\infty.

Note that

n(MInμ)=nMInnMk+n(Mkμ).\displaystyle\sqrt{n}\cdot\left(M_{I^{*}_{n}}-\mu^{*}\right)=\sqrt{n}M_{I^{*}_{n}}-\sqrt{n}M_{k^{*}}+\sqrt{n}\cdot\left(M_{k^{*}}-\mu^{*}\right).

By Slutsky Theorem, to prove the theorem, it suffices to show

n(Mkμ)N(0,σk2),asn,\displaystyle\sqrt{n}\cdot\left(M_{k^{*}}-\mu^{*}\right)\Rightarrow N(0,\sigma_{k^{*}}^{2}),\ \ {\rm as}\ \ n\to\infty, (58)

which follows immediately from (45) and Theorem 1 of Rényi (1957) or Theorem 3.2 in Chapter 7 of Gut (2013), and the proof is completed.

A.11 Proof of Proposition 3

Note that σk2=𝔼Xk2𝔼2Xk\sigma_{k^{*}}^{2}={\mathbb{E}}X_{k^{*}}^{2}-{\mathbb{E}}^{2}X_{k^{*}}. It suffices to prove the two terms on the RHS of (12) converge to two terms of σk2\sigma_{k^{*}}^{2}, respectively.

By Theorem 6 and the continuous mapping theorem, the second term on the RHS of (12) converges almost surely to 𝔼2Xk{\mathbb{E}}^{2}X_{k^{*}}. Similar to the proof of Theorem 6, we can show that the first term on the RHS of (12) converge to 𝔼Xk2{\mathbb{E}}X_{k^{*}}^{2} almost surely, thus completing the proof.

References

  • Agrawal, R. 1995. Sample mean based index policies by o(logn)o(\log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4), 1054-1078.

  • Athreya, K. B., S. N. Lahiri. 2006. Measure Theory and Probability Theory. Springer Science & Business Media.

  • Auer, P., N. Cesa-Bianchi, P. Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3), 235-256.

  • Chang, H. S., M. C. Fu, J. Hu, S. I. Marcus. 2005. An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53(1), 126-139.

  • Chen, H. J., E. J. Dudewicz. 1976. Procedures for fixed-width interval estimation of the largest normal mean. Journal of the American Statistical Association, 71(355), 752-756.

  • D’Eramo, C., A. Nuara, M. Restelli. 2016. Estimating the maximum expected value through Gaussian approximation. International Conference on Machine Learning, 1032-1040.

  • Dmitrienko, A., J. C. Hsu. 2006. Multiple testing in clinical trials. S. Kotz, C. B. Read, N. Balakrishnan, B. Z. Vidakovic (Eds.), Encyclopedia of Statistical Sciences. Wiley, New Jersey.

  • Fan, W., L. J. Hong, B. L. Nelson. 2016. Indifference-zone-free selection of the best. Operations Research, 64(6), 1499-1514.

  • FDA (Food and Drug Administration). 2019. Adaptive designs for clinical trials of drugs and biologics: Guidance for industry. Available at https://www.fda.gov/media/78495/download.

  • Fu, M. C. 2017. Markov decision processes, AlphaGo, and Monte Carlo tree search: Back to the future. INFORMS TutORials in Operations Research, 68-88.

  • Gut, A. 2013. Probability: A Graduate Course. Springer Science & Business Media.

  • Hao, B., Y. Abbasi-Yadkori, Z. Wen, G. Cheng. 2019. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems, 12123-12133.

  • Kim, S.-H., B. L. Nelson. 2006. Selecting the best system. In Handbook in OR & MS: Simulation, S. G. Henderson and B. L. Nelson (Eds.), 501-534, Elsevier Science.

  • Kocsis, L., C. Szepesvári. 2006. Bandit based Monte-Carlo planning. Machine Learning: ECML, 282-293.

  • Lai, T. L., H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4-22.

  • Lesnevski, V., B. L. Nelson, J. Staum. 2007. Simulation of coherent risk measures based on generalized scenarios. Management Science, 53(11), 1756-1769.

  • Mozgunov, P., T. Jaki. 2020. An information theoretic approach for selecting arms in clinical trials. Journal of the Royal Statistical Society Series B, 82(5), 1223-1247.

  • Qin, C., D. Klabjan, D. Russo. 2017. Improving the expected improvement algorithm. Proceedings of the 31st International Conference on Neural Information Processing Systems, 5387-5397.

  • Rényi, A. 1957. On the asymptotic distribution of the sum of a random number of independent random variables. Acta Mathematica Academiae Scientiarum Hungarica, 8(1-2), 193-199.

  • Richter, W. 1965. Limit theorems for sequences of random variables with sequences of random indeces. Theory of Probability & Its Applications, 10(1), 74-84.

  • Robertson, D. S., J. M. S. Wason. 2019. Familywise error control in multi-armed response-adaptive trials. Biometrics, 75, 885-894.

  • Russo, D. 2020. Simple bayesian algorithms for best arm identification. Operations Research, 68(6), 1625-1647.

  • Shamir, O. 2011. A variant of Azuma’s inequality for martingales with subgaussian tails. arXiv preprint arXiv:1110.2392.

  • Smith, J. E., R. L. Windler. 2006. The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3), 311-322.

  • van Hasselt, H. 2010. Double qq-learning. Advances in Neural Information Processing Systems (NIPS), 2613-2621.

  • van Hasselt, H. 2013. Estimating the maximum expected value: an analysis of (nested) cross validation and the maximum sample average. Available at arXiv preprint arXiv:1302.7175.

  • van Hasselt, H.,, A. Guez, D. Silver. 2016. Deep reinforcement learning with double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30, 2094-2100.

  • Villar, S. S., J. Bowden, J. Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30(2), 199.

  • Wainwright, M. J. 2019. High-dimensional Statistics: A Non-asymptotic Viewpoint (Vol. 48). Cambridge University Press.