This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Locally Optimal Fixed-Budget Best Arm Identification
in Two-Armed Gaussian Bandits
with Unknown Variances

Masahiro Kato University of Tokyo
Abstract

We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.

1 Introduction

This study investigates the problem of best arm identification (BAI) with a fixed budget in stochastic two-armed Gaussian bandits. In this problem, we consider an adaptive experiment with a fixed number of rounds, called a budget. At each round, we can draw an arm and observe the reward. The goal of the problem is to identify the best arm with the highest expected reward at the end of the experiment (Bubeck et al., 2009; Audibert et al., 2010).

Formally, we consider the following adaptive experiment with two arms and Gaussian rewards. There are two arms 11 and 22, and an arm a{1,2}a\in\{1,2\} has an \mathbb{R}-valued Gaussian reward Ya𝒩(μa,σa2)Y_{a}\sim\mathcal{N}(\mu_{a},\sigma_{a}^{2}) with the mean μa[Cμ,Cμ]\mu_{a}\in[-C_{\mu},C_{\mu}] and the variance σa2[Cσ2,1/Cσ2]\sigma_{a}^{2}\in[C_{\sigma^{2}},1/C_{\sigma^{2}}] for some universal constants Cμ,Cσ2>0C_{\mu},C_{\sigma^{2}}>0. We assume that CμC_{\mu} and Cσ2C_{\sigma^{2}} are known to us for a technical purpose, and it is enough to set CμC_{\mu} as a sufficiently large value and Cσ2C_{\sigma^{2}} as a sufficiently small value. Given fixed (σ12,σ22)(\sigma^{2}_{1},\sigma^{2}_{2}), let

𝒫G𝒫(σ12,σ22)G{P=(𝒩(μ1,σ12),𝒩(μ2,σ22)):μ1,μ2(,+),μ1μ2}\mathcal{P}^{\mathrm{G}}\coloneqq\mathcal{P}^{\mathrm{G}}_{(\sigma^{2}_{1},\sigma^{2}_{2})}\coloneqq\big{\{}P=(\mathcal{N}(\mu_{1},\sigma^{2}_{1}),\mathcal{N}(\mu_{2},\sigma^{2}_{2})):\mu_{1},\mu_{2}\in(-\infty,+\infty),\ \ \mu_{1}\neq\mu_{2}\big{\}}

be a set of distributions generating the data, which is referred to as the Gaussian bandit models, where P𝒫P\in\mathcal{P} is a pair of distributions that generate (Y1,Y2)(Y_{1},Y_{2}), and 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) is a Gaussian distribution with a mean μ\mu and a variance σ2\sigma^{2}. For an instance PP, the best arm a(P){1,2}a^{\star}(P)\in\{1,2\} is defined as a(P)=argmaxa{1,2}μaa^{\star}(P)=\operatorname*{arg\,max}_{a\in\{1,2\}}\mu_{a}, which is assumed to exist uniquely.

In the adaptive experiment, we consider a strategy to identify the best arm. A fixed budget TT is given. For each round t[T]{1,2,,T}t\in[T]\coloneqq\{1,2,\dots,T\}, let (Y1,t,Y2,t)(Y_{1,t},Y_{2,t}) be an independent and identically distributed (i.i.d.) copy of (Y1,Y2)(Y_{1},Y_{2}). At each round tt, we draw arm At{1,2}A_{t}\in\{1,2\} and observe a reward Yt=a{1,2}𝟙[At=a]Ya,tY_{t}=\sum_{a\in\{1,2\}}\mathbbm{1}[A_{t}=a]Y_{a,t}. At the end of the experiment (after round TT), we recommend an estimated best arm a^T{1,2}\widehat{a}_{T}\in\{1,2\}. During an experiment, we follow a strategy that determines which arm to draw and which arm to recommend as the best arm. The performance of strategies is evaluated by a minimal probability of misidentification P(a^Ta(P))\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P)), where P\mathbb{P}_{P} is the probability law under PP.

Background.

In fixed-budget BAI, it has been an important question of interest to investigate the probability of misidentification P(a^Ta(P))\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P)) in the limit TT\to\infty. For the interest, a typical approach is to derive an upper and lower bound of the probability separately and specify its value.

For a lower bound of the probability of misidentification, Kaufmann et al. (2016) develops a general theory for deriving lower bounds of the probability. Their theory applies the change-of-measure argument, which has been employed in various problems (van der Vaart, 1998), including studies for regret minimization (Lai & Robbins, 1985). Their lower bound is general and can be applied to a wide range of settings, such as the fixed confidence setting (Garivier & Kaufmann, 2016) as well as the fixed budget setting.

In contrast, an upper bound of the misidentification probability has not been fully clarified. A typical way to derive upper bounds is to construct a specific strategy and evaluate its misidentification probability. Kaufmann et al. (2016) develops a strategy under a setting in which the variance (σ12,σ22)(\sigma^{2}_{1},\sigma^{2}_{2}) of the reward is known and shows its misidentification probability corresponds to the lower bound. However, this strategy is not available under the usual setting with unknown variance. Based on these situations, the current results are insufficient to establish an upper bound for the misidentification probability when the variances are unknown.

Based on the situation above, our interest is in strategies for identifying misidentification probabilities in the adaptive experimental setting described above. Specifically, we need a strategy such that an upper bound on its misidentification probability aligns with the lower bound proposed in Kaufmann et al. (2016). Further, this strategy must be valid when the variance is unknown.

Our approach and contribution.

In this study, we develop a strategy whose probability of misidentification aligns with the lower bound under an additional setting. To accomplish this, we develop the Neyman allocation-augmented inverse probability weight (NA-AIPW) strategy. Then, we show that the probability of misidentification aligns with the lower bound under a small-gap regime. The details of each are described below.

The NA-AIPW strategy consists of a sampling rule using the Neyman allocation (NA) and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator. NA is a method of sampling arms using a ratio of the root of the variance of rewards, as utilized in Neyman (1934); Kaufmann et al. (2016). The NA-AIPW strategy samples the arms by estimating this variance during the adaptive experiment. At the end of the experiment, the NA-AIPW strategy recommends an arm with the highest expected reward estimated by using the AIPW estimator, which is an unbiased estimator with a small asymptotic variance.

The small-gap regime considers a situation μ1μ20\mu_{1}-\mu_{2}\to 0 as TT\to\infty. Although this additional setting slightly simplifies the problem with BAI, the problem is still sufficiently complicated since the small gap makes it difficult to identify the best arm. This setting has been utilized in BAI with fixed confidence, such as the analysis of lil’UCB (Jamieson et al., 2014). In the realm of statistical testing, such an evaluation framework is known as the local Bahadur efficiency (Bahadur, 1960; Wieand, 1976; Akritas & Kourouklis, 1988; He & Shao, 1996). From a technical perspective, the small-gap regime is a situation where we can ignore the estimation error of the variances compared to the difficulty of identifying the best arm. Since the error of the estimation of the variance is relatively negligible in the small-gap setting, we can show that the misidentification probability of the NA-AIPW strategy matches the lower bound.

We summarize the backgrounds and our contributions. In BAI with two-armed Gaussian rewards and a fixed budget, a strategy has been needed in which its misidentification probability achieves the lower bound derived by Kaufmann et al. (2016). Although Kaufmann et al. (2016) demonstrates an asymptotically optimal strategy that satisfies the requirement with known variances, it remains an unresolved issue to find a strategy whose upper bound matches their derived lower bound when variances are unknown. For this issue, this study proposes the NA-AIPW strategy whose probability of misidentification matches the lower bound under the small-gap regime.

Organization.

This study is organized as follows. First, in Section 2, we review the lower bound of Kaufmann et al. (2016). Then, in Section 3, we propose our NA-AIPW strategy. In Section 4, we show that the misidentification probability of the strategy asymptotically corresponds to the lower bound by Kaufmann et al. (2016) under the small-gap setting. We show the proof in Section 5, where we also provide a novel concentration inequality based on the Chernoff bound. In Section 6, we discuss the difficulty in this problem. In Section 7, we introduce related work and remaining problems, which includes an extension of our small-gap setting to a setting with multi-armed bandits and non-Gaussian rewards.

Notation. Let t\mathcal{F}_{t} be the sigma-algebra generated by all observations up to round tt. We define a truncation operator: for a variable vv\in\mathbb{R} and a constant c1c\geq 1, thre(v;c1,c2)min{max{v,c1},c2}\mathrm{thre}(v;c_{1},c_{2})\coloneqq\min\{\max\{v,c_{1}\},c_{2}\}.

2 Lower Bound of Probability of Misidentification

As a preparation, we introduce a lower bound for the probability of misidentification in BAI with a fixed budget. We call a strategy is consistent, if for any P𝒫GP\in\mathcal{P}^{\mathrm{G}}, P(a^Ta(P))0\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P))\to 0 as TT\to\infty. To evaluate the performance of strategies For any P𝒫GP\in\mathcal{P}^{\mathrm{G}}, we focus on the following metric for P(a^Ta(P))\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P)) used in many studies, such as Kaufmann et al. (2016):

1TlogP(a^Ta(P)).\displaystyle-\frac{1}{T}\log\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P)).

Note that the upper bound (resp. lower bound) of this term works as a lower bound (resp. upper bound) of the probability of misidentification P(a^Ta(P))\mathbb{P}_{P}(\widehat{a}_{T}\neq a^{\star}(P)) since xlogxx\mapsto-\log x is a strictly decreasing function.

For two-armed Gaussian bandits, Kaufmann et al. (2016) presents the following lower bounds.

Proposition 2.1 (Theorem 12 in Kaufmann et al. (2016)).

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}} and Δ=μ1μ2\Delta=\mu^{*}_{1}-\mu^{*}_{2} with known constants Cμ,Cσ2>0C_{\mu},C_{\sigma^{2}}>0 independent of TT, if {(Y1,t,Y2,t)}t[T]\{(Y_{1,t},Y_{2,t})\}_{t\in[T]} is generated from PP^{*}, any consistent strategy satisfies

lim supT1TlogP(a^Ta(P))Δ22(σ1+σ2)2.\displaystyle\limsup_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}(\widehat{a}_{T}\neq a^{\star}(P^{*}))\leq\frac{\Delta^{2}}{2\big{(}\sigma_{1}+\sigma_{2}\big{)}^{2}}.

Note that we can remove the condition that we know constants Cμ,Cσ2>0C_{\mu},C_{\sigma^{2}}>0 independent of TT such that μa[Cμ,Cμ]\mu_{a}\in[-C_{\mu},C_{\mu}] and σ12,σ22>Cσ2\sigma^{2}_{1},\sigma^{2}_{2}>C_{\sigma^{2}} hold for deriving the lower bound. However, it is required to implement our strategy and to derive an upper bound. For the conditions of the lower bound to align with those of the upper bound, we add the conditions in this proposition.

From the statement, there are some important aspects of this lower bound: (i) The term Δ=μ1μ2\Delta=\mu^{*}_{1}-\mu^{*}_{2}, which referred to a gap, appears in the numerator and the magnitude of the error is described by the gap. (ii) The variances (σ12,σ22)(\sigma_{1}^{2},\sigma_{2}^{2}) appear in the denominator, which plays an important role.

It has been discussed to find a strategy in which the upper bound of its probability of misidentification coincides with this lower bound in Proposition 2.1. Although Kaufmann et al. (2016) develops a strategy that satisfies the requirement, it needs to sample arms with some probability depending on the known variances (σ12,σ22)(\sigma_{1}^{2},\sigma_{2}^{2}). To the best of our knowledge, if the variances are unknown and need to be estimated during adaptive experiments, no one has found the desired strategy.

3 The NA-AIPW Strategy

In this section, we define our strategy. Formally, a strategy gives a pair ((At)t[T],a^T)((A_{t})_{t\in[T]},\widehat{a}_{T}), where (i) (At)t[T]{1,2}T(A_{t})_{t\in[T]}\in\{1,2\}^{T} is a sequence of arms generated by a sampling rule that determines which arm AtA_{t} is chosen in each tt based on t1\mathcal{F}_{t-1}, and (ii) a^T{1,2}\widehat{a}_{T}\in\{1,2\} is a recommended arm by a recommendation rule based on T\mathcal{F}_{T}. Our proposed NA-AIPW strategy consists of (i) a sampling rule with the Neyman Allocation (NA) (Neyman, 1923), and (ii) a recommendation rule using the Augmented Inverse Probability Weighting (AIPW) estimator (Robins et al., 1994; Bang & Robins, 2005). Based on these rules, we refer to this strategy as the NA-AIPW strategy111Similar strategies are often used in the context of the average treatment effect estimation by an adaptive experiment (van der Laan, 2008; Kato et al., 2020)..

3.1 Target Allocation Ratio

As preparation, we introduce the notion of a target allocation ratio, which will be used for the sampling rule. We define target allocation ratios w1,w2(0,1)w_{1}^{*},w_{2}^{*}\in(0,1) as

w1σ1σ1+σ2, and w21w1.\displaystyle w^{*}_{1}\coloneqq\frac{{\sigma_{1}}}{{\sigma_{1}}+{\sigma_{2}}},\mbox{~{}~{}and~{}~{}}w^{*}_{2}\coloneqq 1-w^{*}_{1}.

A sampling rule following this target allocation ratio is known as the Neyman allocation rule (Neyman, 1934). Glynn & Juneja (2004) and Kaufmann et al. (2016) also propose this allocation. This target allocation ratio is characterized by the variances (standard deviations); therefore, the target allocation ratio is unknown when the variances are unknown. Therefore, to use this ratio, we need to estimate it from observations.

3.2 Sampling Rule with Neyman Allocation (NA)

We present the sampling rule with the NA. At each round t[T]t\in[T], our sampling rule randomly draws an arm a{1,2}a\in\{1,2\} with a probability identical to an estimated version of the target allocation ratio waw_{a}^{*}. To estimate the target allocation ratio waw_{a}^{*}, we estimate the variances during the adaptive experiment. For a{1,2}a\in\{1,2\}, let {σ^a}t[T]\{\widehat{\sigma}_{a}\}_{t\in[T]} and {w^a,t}\{\widehat{w}_{a,t}\} be sequences of estimators of σa,μa\sigma_{a},\mu_{a} and waw_{a}^{*}, that will be defined bellow.

We use the rounds t=1t=1 and t=2t=2 for initialization. Specifically, we draw the arm 11 at round t=1t=1 and the arm 22 at round t=2t=2, and also set w^a,1=w^a,2=1/2\widehat{w}_{a,1}=\widehat{w}_{a,2}=1/2 for a{1,2}a\in\{1,2\}.

At the round t3t\geq 3, we estimate the target allocation ratio (variances) waw^{*}_{a} for a{1,2}a\in\{1,2\} using past observations t1\mathcal{F}_{t-1}. For each t3t\geq 3, we first define an estimator of the expected reward μa\mu_{a} as

μ~a,t1s=1t1𝟙[As=a]s=1t1𝟙[As=a]Ya,s.\widetilde{\mu}_{a,t}\coloneqq\frac{1}{\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]}\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]Y_{a,s}.

Also, we define a second moment estimator ζ~a,t(s=1t1𝟙[As=a])1s=1t1𝟙[As=a]Ya,s2\widetilde{\zeta}_{a,t}\coloneqq({\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]})^{-1}\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]Y^{2}_{a,s}, and a root of variance estimator σ~a,t={ζ~a,t(μ~a,t)2}1/2\widetilde{\sigma}_{a,t}=\{\widetilde{\zeta}_{a,t}-\left(\widetilde{\mu}_{a,t}\right)^{2}\}^{1/2}. Then, we define the estimator σ^a,t=thre(σ~a,t;C¯σ21/2,1/C¯σ2)\widehat{\sigma}_{a,t}=\mathrm{thre}(\widetilde{\sigma}_{a,t};\underline{C}_{\sigma^{2}}^{1/2},1/\overline{C}_{\sigma^{2}}) with the constant Cσ2>0C_{\sigma^{2}}>0, defined in Section 1. Note that this truncation is introduced for a technical purpose to draw each arm infinitely many times as TT\to\infty and avoid the estimators of μ1\mu^{*}_{1} and μ2\mu^{*}_{2}, defined below, diverging to infinity. We just use a sufficiently small value for Cσ2>0C_{\sigma^{2}}>0. Also, we define the estimator w^1,t\widehat{w}_{1,t} and w^2,t\widehat{w}_{2,t} as

w^1,tσ^1,tσ^1,t+σ^2,t,andw^2,t1w^1,t.\displaystyle\widehat{w}_{1,t}\coloneqq\frac{{\widehat{\sigma}_{1,t}}}{{\widehat{\sigma}_{1,t}}+{\widehat{\sigma}_{2,t}}},\quad\mathrm{and}\quad\widehat{w}_{2,t}\coloneqq 1-\widehat{w}_{1,t}. (1)

In each round t3t\geq 3, we draw arm At=1A_{t}=1 with probability w^1,t\widehat{w}_{1,t} and At=2A_{t}=2 with probability w^2,t\widehat{w}_{2,t}.

We note the possibility of increasing the number of initialization rounds, although our strategy utilizes only the first two rounds for this purpose. The additional rounds of initialization serve to stabilize the sampling rule in practical applications, akin to the concept of the forced-sampling (Garivier & Kaufmann, 2016). We can change the number of initialization rounds if the condition w^a,ta.s.wa\widehat{w}_{a,t}\xrightarrow{\mathrm{a.s.}}w^{*}_{a} is satisfied as tt\to\infty for every a{1,2}a\in\{1,2\}, which is crucial for our theoretical analysis. For instance, instead of using w^1,t\widehat{w}_{1,t} directly, an alternative arm-drawing probability could be defined as w~1,t=αt/2+(1αt)w^1,t\widetilde{w}_{1,t}=\alpha_{t}/2+(1-\alpha_{t})\widehat{w}_{1,t}, assuming αt[0,1]\alpha_{t}\in[0,1] and converges to zero as tt approaches infinity (here, we define w~2,t=1w~1,t\widetilde{w}_{2,t}=1-\widetilde{w}_{1,t}). Moreover, the number of initialization rounds can be made dependent upon the number of arms without impacting the theoretical outcomes.

3.3 Recommendation Rule with the Augmented Inverse Probability Weighting (AIPW) Estimator

We present our recommendation rule. In the recommendation phase after round TT, we estimate μa\mu^{*}_{a} for each a{1,2}a\in\{1,2\} and recommend an arm with the bigger estimated expected reward. With a truncated version of the estimated expected reward μ^a,tthre(μ~a,t,Cμ,Cμ)\widehat{\mu}_{a,t}\coloneqq\mathrm{thre}(\widetilde{\mu}_{a,t},-C_{\mu},C_{\mu}) with some predetermined constant Cμ>0C_{\mu}>0, defined in Section 1, we define the augmented inverse probability weighting (AIPW) estimator of μa\mu^{*}_{a} for each a{1,2}a\in\{1,2\} as

μ^a,TAIPW1Tt=1Tψa,t,whereψa,t𝟙[At=a](Ya,tμ^a,t)w^a,t+μ^a,t.\displaystyle\widehat{\mu}^{\mathrm{AIPW}}_{a,T}\coloneqq\frac{1}{T}\sum^{T}_{t=1}\psi_{a,t},\qquad\mathrm{where}\ \psi_{a,t}\coloneqq\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{a,t}-\widehat{\mu}_{a,t}\big{)}}{\widehat{w}_{a,t}}+\widehat{\mu}_{a,t}. (2)

At the end of the experiment (after the round t=Tt=T), we recommend a^T\widehat{a}_{T} as

a^T{1ifμ^1,TAIPWμ^2,TAIPW,2otherwise.\displaystyle\widehat{a}_{T}\coloneqq\begin{cases}&1\quad\mathrm{if}\quad\widehat{\mu}^{\mathrm{AIPW}}_{1,T}\geq\widehat{\mu}^{\mathrm{AIPW}}_{2,T},\\ &2\quad\mathrm{otherwise}.\end{cases} (3)

We adopt the AIPW estimator for our strategy because it has several advantages. First, the AIPW estimator has the property of semiparametric efficiency, which indicates that it has the smallest asymptotic variance among a certain class (Hahn, 1998). The property is necessary to prove that the strategy using the AIPW estimator is optimal, which means the misidentification probability is small enough to achieve its lower bound. The second reason is more technical; the AIPW estimator simplifies the theoretical analysis (see Section 6.3). Specifically, we can decompose an error by the AIPW estimator into a sum of random variables with martingale properties, making it suitable for analysis using the central limit theorem. This property is unique to the AIPW estimator but not to naive estimators such as an empirical average. Details will be given in Section 5.

We provide the pseudo-code for our proposed strategy in Algorithm 1. Note that we introduce CμC_{\mu} and Cσ2C_{\sigma^{2}} for technical purposes to bound the estimators and any large positive value can be used.

Algorithm 1 NA-AIPW Strategy
  Parameter: Positive constants CμC_{\mu} and Cσ2C_{\sigma^{2}}.
  Initialization:
  At t=1t=1, sample At=1A_{t}=1; at t=2t=2, sample At=2A_{t}=2. For a{1,2}a\in\{1,2\}, set w^a,1=w^a,2=0.5\widehat{w}_{a,1}=\widehat{w}_{a,2}=0.5.
  for t=3t=3 to TT do
     Construct w^a,t\widehat{w}_{a,t} following (1).
     Draw At=1A_{t}=1 with probability w^1,t\widehat{w}_{1,t} and At=2A_{t}=2 with probability w^2,t=1w^1,t\widehat{w}_{2,t}=1-\widehat{w}_{1,t}.
     Observe YtY_{t}.
  end for
  Construct μ^a,TAIPW\widehat{\mu}^{\mathrm{AIPW}}_{a,T} for a{1,2}a\in\{1,2\}. following (2).
  Recommend a^T\widehat{a}_{T} following (3).

4 Misidentification Probability and Asymptotic Optimality

In this section, we show the following upper bound of the misspecification probability of the NA-AIPW strategy, which also implies that the strategy is asymptotically optimal.

Theorem 4.1 (Upper bound of the NA-AIPW strategy).

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}} with known constants Cμ,Cσ2>0C_{\mu},C_{\sigma^{2}}>0 independent of TT, if {(Y1,t,Y2,t)}t[T]\{(Y_{1,t},Y_{2,t})\}_{t\in[T]} is generated from PP^{*}, the following holds as Δ0\Delta\to 0:

lim infT1TlogP(a^TAIPWa(P))Δ22(σ1+σ2)2o(Δ2).\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{AIPW}}_{T}\neq a^{\star}(P^{*})\right)\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})^{2}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-o\left(\Delta^{2}\right)}.

We note again that CμC_{\mu} and Cσ2C_{\sigma^{2}} are introduced for technical purpose. The constant CμC_{\mu} is introduced to guarantee the boundedness of the estimators, and it is sufficient to use a sufficiently large value for it. The constant Cσ2C_{\sigma^{2}} is used to draw each arm infinitely many times as TT\to\infty and avoid the estimators of the means μ1\mu^{*}1 and μ2\mu^{*}_{2} diverging to infinity, and it is sufficient to use a sufficiently small value for Cσ2>0C_{\sigma^{2}}>0.

Note that the lower bound of 1TlogP(a^TAIPWa(P))-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{AIPW}}_{T}\neq a^{\star}(P^{*})\right) implies the upper bound of P(a^TAIPWa(P))\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{AIPW}}_{T}\neq a^{\star}(P^{*})\right). This theorem implies us to evaluate the probability of misidentification up to the constant term, even when it is exponentially small, as Δ0\Delta\to 0.

This result directly implies the asymptotic optimality of the NA-AIPW strategy. As Δ0\Delta\to 0, the upper bound matches the lower bound in Proposition 2.1. This asymptotic optimality result suggests that the estimation error of the target allocation ratio (variances) ww^{*} is negligible when Δ0\Delta\to 0. This is because the estimation error is insignificant compared to the challenges of identifying the best arm due to the small gap.

Although studies, such as Ariu et al. (2021), Qin (2022), and Degenne (2023), point out the non-existence of the optimal strategies in fixed-budget BAI against the lower bound shown by Kaufmann et al. (2016), our result does not yield a contradiction. Existing impossibility results discuss the existence of a strategy that violates the lower bound. Note that the lower bounds in Kaufmann et al. (2016) are applicable to any instances in the bandit models (with some regularity conditions). In other words, if we consider the lower bound in Kaufmann et al. (2016) for all instances, there exists an instance under which there exists a strategy whose lower bound is larger than the lower bound derived by Kaufmann et al. (2016). In contrast, we only consider bandit models where Δ0\Delta\to 0. Our result implies that if we restrict bandit models, the upper bounds of our strategy within the restricted bandit models match the lower bound. Because our optimality is limited to a case where Δ0\Delta\to 0, we refer to our optimality as asymptotic optimality under the small-gap regime or local asymptotic optimality.

We conjecture that even if we replace the AIPW estimator with the sample average estimator, defined as μ~a,t=(s=1t1𝟙[As=a])1s=1t1𝟙[As=a]Ya,s\widetilde{\mu}_{a,t}=({\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]})^{-1}\sum^{t-1}_{s=1}\mathbbm{1}[A_{s}=a]Y_{a,s} in Section 3.2, the upper bound of the strategy still matches the lower bound. However, the proof is an open issue. Hirano et al. (2003) and Hahn et al. (2011) show that the sample average estimator μ~a,t\widetilde{\mu}_{a,t} and the AIPW estimator have the same asymptotic variance (or asymptotic distribution). To show the result, we need to employ empirical process arguments. One of the problems in extending the result to analysis for BAI is that their result focuses on the asymptotic distribution, not the tail probability. Therefore, to show the asymptotic optimality of the strategy with the sample average in the sense of the probability of misidentification, we need to modify the result in Hirano et al. (2003) and Hahn et al. (2011) to analyze the tail probability.

5 Proof of Theorem 4.1

To show Theorem 4.1, we derive the upper bound of P(μ^a(P),TAIPWμ^b,TAIPW)\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW}}_{a^{\star}(P^{*}),T}\leq\widehat{\mu}^{\mathrm{AIPW}}_{b,T}\right) for b{1,2}\{a(P)}b\in\{1,2\}\backslash\{a^{\star}(P^{*})\}, which is equivalent to P(a^TAIPWa(P))\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{AIPW}}_{T}\neq a^{\star}(P^{*})\right). Without loss of generality, we assume that a(P)=1a^{\star}(P^{*})=1 and b=2b=2. Let us define Vσ12w1+σ22w2=(σ1+σ2)2V\coloneqq\frac{\sigma^{2}_{1}}{w^{*}_{1}}+\frac{\sigma^{2}_{2}}{w^{*}_{2}}=\left(\sigma_{1}+\sigma_{2}\right)^{2} and

Ψtψ1,tψ2,tΔV.\displaystyle\Psi_{t}\coloneqq\frac{\psi_{1,t}-\psi_{2,t}-\Delta}{\sqrt{V}}.

Therefore, in the following parts, we aim to derive the upper bound of P(μ^a(P),TAIPWμ^b,TAIPW)=P(μ^1,TAIPWμ^2,TAIPW)=P(t=1TΨtTΔV)\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW}}_{a^{\star}(P^{*}),T}\leq\widehat{\mu}^{\mathrm{AIPW}}_{b,T}\right)=\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW}}_{1,T}\leq\widehat{\mu}^{\mathrm{AIPW}}_{2,T}\right)=\mathbb{P}_{P^{*}}\left(\sum^{T}_{t=1}\Psi_{t}\leq-\frac{T\Delta}{\sqrt{V}}\right). Let 𝔼P\mathbb{E}_{P} be the expectation under P𝒫GP\in\mathcal{P}^{\mathrm{G}}. We derive the upper bound using the Chernoff bound. This proof is partially inspired by techniques in Hadad et al. (2021), and Kato et al. (2020).

First, because there exists a constant C>0C>0 independent of TT such that w^a,t>C\widehat{w}_{a,t}>C by construction, the following lemma holds.

Lemma 5.1.

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}} and all a{1,2}a\in\{1,2\}, μ^a,ta.sμa\widehat{\mu}_{a,t}\xrightarrow{\mathrm{a.s}}\mu^{*}_{a} and σ^a2a.sσa2\widehat{\sigma}^{2}_{a}\xrightarrow{\mathrm{a.s}}\sigma^{2}_{a}.

Furthermore, from σ^a2a.sσa2\widehat{\sigma}^{2}_{a}\xrightarrow{\mathrm{a.s}}\sigma^{2}_{a} and continuous mapping theorem, for any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}} and all a{1,2}a\in\{1,2\}, w^a,ta.swa,t\widehat{w}_{a,t}\xrightarrow{\mathrm{a.s}}w^{*}_{a,t}.

Step 1: Sequence {Ψt}t=1T\{\Psi_{t}\}^{T}_{t=1} is a martingale difference sequence (MDS)

We prove that {Ψt}t=1T\{\Psi_{t}\}^{T}_{t=1} is an MDS; that is, 𝔼P[Ψt|t1]=0\mathbb{E}_{P^{*}}\left[\Psi_{t}|\mathcal{F}_{t-1}\right]=0. Although this fact is well-known in the literature of causal inference (van der Laan, 2008; Hadad et al., 2021; Kato et al., 2020), we show the proof for the sake of completeness.

Lemma 5.2.

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}}, {Ψt}t=1T\{\Psi_{t}\}^{T}_{t=1} is an MDS.

Proof.

For each t[T]t\in[T], it holds that

V𝔼P[Ψt|t1]\displaystyle\sqrt{V}\mathbb{E}_{P^{*}}\left[\Psi_{t}|\mathcal{F}_{t-1}\right]
=𝔼P[𝟙[At=1](Y1,tμ^1,t)w^1,t+μ^1,t|t1]𝔼P[𝟙[At=2](Y2,tμ^2,t)w^2,t+μ^2,t|t1]Δ\displaystyle=\mathbb{E}_{P^{*}}\left[\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}}{\widehat{w}_{1,t}}+\widehat{\mu}_{1,t}|\mathcal{F}_{t-1}\right]-\mathbb{E}_{P^{*}}\left[\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}}{\widehat{w}_{2,t}}+\widehat{\mu}_{2,t}|\mathcal{F}_{t-1}\right]-\Delta
=w^1,t(μ1μ^1,t)w^1,t+μ^1,tw^2,t(μ2μ^2,t)w^2,t+μ^2,tΔ={(μ1μ2)(μ1μ2)}=0\displaystyle=\frac{\widehat{w}_{1,t}\big{(}\mu^{*}_{1}-\widehat{\mu}_{1,t}\big{)}}{\widehat{w}_{1,t}}+\widehat{\mu}_{1,t}-\frac{\widehat{w}_{2,t}\big{(}\mu^{*}_{2}-\widehat{\mu}_{2,t}\big{)}}{\widehat{w}_{2,t}}+\widehat{\mu}_{2,t}-\Delta=\left\{\left(\mu^{*}_{1}-\mu^{*}_{2}\right)-\left(\mu^{*}_{1}-\mu^{*}_{2}\right)\right\}=0

Step 2: Evaluation by using the Chernoff Bound with Martingales

By applying the Chernoff bound, for any v<0v<0 and any λ<0\lambda<0, it holds that

P(1Tt=1TΨtv)𝔼P[exp(λt=1TΨt)]exp(Tλv).\displaystyle\mathbb{P}_{P^{*}}\left(\frac{1}{T}\sum^{T}_{t=1}\Psi_{t}\leq v\right)\leq\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\sum^{T}_{t=1}\Psi_{t}\right)\right]\exp\left(-T\lambda v\right).

From the Chernoff bound and a property of an MDS, we have

𝔼P[exp(λt=1TΨt)]=𝔼P[t=1T𝔼P[exp(λΨt)|t1]]=𝔼P[exp(t=1Tlog𝔼P[exp(λΨt)|t1])].\displaystyle\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\sum^{T}_{t=1}\Psi_{t}\right)\right]=\mathbb{E}_{P^{*}}\left[\prod^{T}_{t=1}\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\Psi_{t}\right)|\mathcal{F}_{t-1}\right]\right]=\mathbb{E}_{P^{*}}\left[\exp\left(\sum^{T}_{t=1}\log\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\Psi_{t}\right)|\mathcal{F}_{t-1}\right]\right)\right].

Then, the Taylor expansion around λ=0\lambda=0 yields

log𝔼P[exp(λΨt)|t1]=λ22𝔼P[Ψt2|t1]+o(λ2)\displaystyle\log\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\Psi_{t}\right)|\mathcal{F}_{t-1}\right]=\frac{\lambda^{2}}{2}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]+o\left(\lambda^{2}\right) (4)

as λ0\lambda\to 0. This is given as follows. Since 𝔼P[Ψtkexp(λΨt)|t1]\mathbb{E}_{P^{*}}\left[\Psi^{k}_{t}\exp(\lambda\Psi_{t})|\mathcal{F}_{t-1}\right] are finite everywhere in an open interval (0,)(0,\infty) for k=1,2,3k=1,2,3, the Taylor expansion yields the following (for the details, see the textbook such as page 75 in Bulmer (1967) and Theorem 5.19 in Apostol (1974)):

𝔼P[exp(λΨt)|t1]=1+k=12𝔼P[Ψtk/k!|t1]+o(λ2)\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\Psi_{t}\right)|\mathcal{F}_{t-1}\right]=1+\sum^{2}_{k=1}\mathbb{E}_{P^{*}}\left[\Psi^{k}_{t}/k!|\mathcal{F}_{t-1}\right]+o\left(\lambda^{2}\right)

as λ0\lambda\to 0. Note that the finiteness of 𝔼P[Ψtkexp(λΨt)|t1]\mathbb{E}_{P^{*}}\left[\Psi^{k}_{t}\exp(\lambda\Psi_{t})|\mathcal{F}_{t-1}\right] comes from the following in Ψt\Psi_{t}: (i) Ya,tY_{a,t} is a Gaussian random variable, (ii) μ^a,t\widehat{\mu}_{a,t} and w^a,t\widehat{w}_{a,t} are bounded random variables by our truncation, and (iii) the lower bound of w^\widehat{w} is given by Cσ2C_{\sigma^{2}}. By using the Taylor expansion again, we approximate log(1+z)\log(1+z) around z=0z=0 as log(1+z)=zz2/2+z3/3\log(1+z)=z-z^{2}/2+z^{3}/3-\cdots. Therefore, we have

log𝔼P[exp(λΨt)|t1]\displaystyle\log\mathbb{E}_{P^{*}}\left[\exp\left(\lambda\Psi_{t}\right)|\mathcal{F}_{t-1}\right]
={λ𝔼P[Ψt|t1]+λ2T𝔼P[Ψt2/2!|t1]+o(λ2)}12{λ𝔼P[Ψt|t1]+o(λ)}2\displaystyle=\Big{\{}\lambda\mathbb{E}_{P^{*}}\left[\Psi_{t}|\mathcal{F}_{t-1}\right]+\frac{\lambda^{2}}{T}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}/2!|\mathcal{F}_{t-1}\right]+o\left(\lambda^{2}\right)\Big{\}}-\frac{1}{2}\left\{\lambda\mathbb{E}_{P^{*}}\left[\Psi_{t}|\mathcal{F}_{t-1}\right]+o\left(\lambda\right)\right\}^{2}
=λ2T𝔼P[Ψt2/2!|t1]+o(λ2)\displaystyle=\frac{\lambda^{2}}{T}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}/2!|\mathcal{F}_{t-1}\right]+o\left(\lambda^{2}\right)

as λ0\lambda\to 0. Here, we used 𝔼P[Ψt|t1]=0\mathbb{E}_{P^{*}}\left[\Psi_{t}|\mathcal{F}_{t-1}\right]=0. Thus, the (4) holds.

Step 3: Convergence of the Second Moment

We next show 𝔼P[Ψt2|t1]1a.s0\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\xrightarrow{\mathrm{a.s}}0. This result is a direct consequence of Lemma 5.1.

Lemma 5.3.

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}}, we have

𝔼P[Ψt2|t1]1a.s0ast.\displaystyle\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\xrightarrow{\mathrm{a.s}}0\qquad\mathrm{as}\ \ t\to\infty.
Proof.

We have

V𝔼P[Ψt2|t1]=𝔼P[(ψ1,tψ2,tΔ)2|t1]\displaystyle V\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]=\mathbb{E}_{P^{*}}\left[\left(\psi_{1,t}-\psi_{2,t}-\Delta\right)^{2}\Big{|}\mathcal{F}_{t-1}\right]
=𝔼P[(𝟙[At=1](Y1,tμ^1,t)w^1,t𝟙[At=2](Y2,tμ^2,t)w^2,t)2\displaystyle=\mathbb{E}_{P^{*}}\Bigg{[}\Bigg{(}\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}}{\widehat{w}_{2,t}}\Bigg{)}^{2}
+2(𝟙[At=a(P)](Y1,tμ^1,t)w^1,t𝟙[At=a](Y2,tμ^2,t)w^2,t)×(μ^1,tμ^2,t(μ1μ2))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +2\Bigg{(}\frac{\mathbbm{1}[A_{t}=a^{\star}(P^{*})]\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}}{\widehat{w}_{2,t}}\Bigg{)}\times\left(\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}-(\mu^{*}_{1}-\mu^{*}_{2})\right)
+(μ^1,tμ^2,t(μ1μ2))2|t1]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\left(\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}-(\mu^{*}_{1}-\mu^{*}_{2})\right)^{2}|\mathcal{F}_{t-1}\Bigg{]}
=𝔼P[𝟙[At=1](Y1,tμ^1,t)2(w^1,t)2+𝟙[At=2](Y2,tμ^2,t)2(w^2,t)2\displaystyle=\mathbb{E}_{P^{*}}\Bigg{[}\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}^{2}}{\big{(}\widehat{w}_{1,t}\big{)}^{2}}+\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}^{2}}{\big{(}\widehat{w}_{2,t}\big{)}^{2}}
+2(𝟙[At=1](Y1,tμ^1,t)w^1,t𝟙[At=a](Y2,tμ^2,t)w^2,t)(μ^1,tμ^2,t(μ1μ2))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +2\left(\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}}{\widehat{w}_{2,t}}\right)\left(\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}-(\mu^{*}_{1}-\mu^{*}_{2})\right)
+((μ^1,tμ^2,t)(μ1μ2))2|t1]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-(\mu^{*}_{1}-\mu^{*}_{2})\right)^{2}|\mathcal{F}_{t-1}\Bigg{]}
=a{1,2}𝔼P[(Ya,tμ^a,t)2(w^a,t)2|t1]𝔼P[((μ^1,tμ^2,t)(μ1μ2))2|t1].\displaystyle=\sum_{a\in\{1,2\}}\mathbb{E}_{P^{*}}\left[\frac{\big{(}Y_{a,t}-\widehat{\mu}_{a,t}\big{)}^{2}}{\big{(}\widehat{w}_{a,t}\big{)}^{2}}|\mathcal{F}_{t-1}\right]-\mathbb{E}_{P^{*}}\left[\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}|\mathcal{F}_{t-1}\right].

Here, for a{1,2}a\in\{1,2\}, the followings hold:

𝔼P[𝟙[At=a](Ya,tμ^a,t)2(w^a,t)2|t1]=𝔼P[(Ya,tμ^a,t)2w^a,t|t1]=𝔼P[(Ya,t)2]2μaμ^a,t+(μ^a,t)2w^a,t\displaystyle\mathbb{E}_{P^{*}}\Bigg{[}\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{a,t}-\widehat{\mu}_{a,t}\big{)}^{2}}{\big{(}\widehat{w}_{a,t}\big{)}^{2}}|\mathcal{F}_{t-1}\Bigg{]}=\mathbb{E}_{P^{*}}\Bigg{[}\frac{\big{(}Y_{a,t}-\widehat{\mu}_{a,t}\big{)}^{2}}{\widehat{w}_{a,t}}|\mathcal{F}_{t-1}\Bigg{]}=\frac{\mathbb{E}_{P^{*}}[(Y_{a,t})^{2}]-2\mu^{*}_{a}\widehat{\mu}_{a,t}+(\widehat{\mu}_{a,t})^{2}}{\widehat{w}_{a,t}}
=𝔼P[(Ya,t)2](μa)2+(μaμ^a,t)2w^a,t=σa2+(μaμ^a,t)2w^a,t,\displaystyle=\frac{\mathbb{E}_{P^{*}}[(Y_{a,t})^{2}]-(\mu^{*}_{a})^{2}+(\mu^{*}_{a}-\widehat{\mu}_{a,t})^{2}}{\widehat{w}_{a,t}}=\frac{\sigma^{2}_{a}+(\mu^{*}_{a}-\widehat{\mu}_{a,t})^{2}}{\widehat{w}_{a,t}},

and

𝔼P[𝟙[At=a](Ya,tμ^2,t)2(w^1,t)2𝟙[At=a](Ya,tμ^2,t)2(w^2,t)2|t1]=0,\mathbb{E}_{P^{*}}\Bigg{[}\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{a,t}-\widehat{\mu}_{2,t}\big{)}^{2}}{\big{(}\widehat{w}_{1,t}\big{)}^{2}}\frac{\mathbbm{1}[A_{t}=a]\big{(}Y_{a,t}-\widehat{\mu}_{2,t}\big{)}^{2}}{\big{(}\widehat{w}_{2,t}\big{)}^{2}}|\mathcal{F}_{t-1}\Bigg{]}=0,

where we used 𝔼P[(Ya,t)2|x](μa)2=σa2\mathbb{E}_{P^{*}}[(Y_{a,t})^{2}|x]-(\mu^{*}_{a})^{2}=\sigma^{2}_{a}. Therefore, the following holds:

𝔼P[(Y1,tμ^1,t)2w^1,t|t1]+𝔼P[(Y2,tμ^2,t)2w^2,t|t1]𝔼P[((μ^1,tμ^2,t)(μ1μ2))2|t1]\displaystyle\mathbb{E}_{P^{*}}\left[\frac{\big{(}Y_{1,t}-\widehat{\mu}_{1,t}\big{)}^{2}}{\widehat{w}_{1,t}}|\mathcal{F}_{t-1}\right]+\mathbb{E}_{P^{*}}\left[\frac{\big{(}Y_{2,t}-\widehat{\mu}_{2,t}\big{)}^{2}}{\widehat{w}_{2,t}}|\mathcal{F}_{t-1}\right]-\mathbb{E}_{P^{*}}\left[\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-(\mu^{*}_{1}-\mu^{*}_{2})\right)^{2}|\mathcal{F}_{t-1}\right]
=𝔼P[σ12+(μ1μ^1,t)2w^1,t]+𝔼P[σ22+(μ2μ^2,t)2w^2,t]𝔼P[((μ^1,tμ^2,t)(μ1μ2))2].\displaystyle=\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{1}+(\mu^{*}_{1}-\widehat{\mu}_{1,t})^{2}}{\widehat{w}_{1,t}}\right]+\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{2}+(\mu^{*}_{2}-\widehat{\mu}_{2,t})^{2}}{\widehat{w}_{2,t}}\right]-\mathbb{E}_{P^{*}}\left[\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-(\mu^{*}_{1}-\mu^{*}_{2})\right)^{2}\right].

Because μ^a,ta.s.μa\widehat{\mu}_{a,t}\xrightarrow{\mathrm{a.s.}}\mu^{*}_{a} and w^a,ta.s.wa\widehat{w}_{a,t}\xrightarrow{\mathrm{a.s.}}w^{*}_{a}, we have

limt|(σ12+(μ1μ^1,t)2w^1,t)+(σ22+(μ2μ^2,t)2w^2,t)((μ^1,tμ^2,t)(μ1μ2))2\displaystyle\lim_{t\to\infty}\Bigg{|}\left(\frac{\sigma^{2}_{1}+(\mu^{*}_{1}-\widehat{\mu}_{1,t})^{2}}{\widehat{w}_{1,t}}\right)+\left(\frac{\sigma^{2}_{2}+(\mu^{*}_{2}-\widehat{\mu}_{2,t})^{2}}{\widehat{w}_{2,t}}\right)-\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}
(σ12w1+σa2w2+((μ1μ2)(μ1μ2))2)|\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\left(\frac{\sigma^{2}_{1}}{w^{*}_{1}}+\frac{\sigma^{2}_{a}}{w^{*}_{2}}+\left(\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}-(\mu^{*}_{1}-\mu^{*}_{2})\right)^{2}\right)\Bigg{|}
limt|σ12w^1,tσ12w1|+limt|σ22w^2,tσ22w2|+limt(μ1μ^1,t)2w^1,t+limt(μ2μ^2,t)2w^2,t\displaystyle\leq\lim_{t\to\infty}\left|\frac{\sigma^{2}_{1}}{\widehat{w}_{1,t}}-\frac{\sigma^{2}_{1}}{w^{*}_{1}}\right|+\lim_{t\to\infty}\left|\frac{\sigma^{2}_{2}}{\widehat{w}_{2,t}}-\frac{\sigma^{2}_{2}}{w^{*}_{2}}\right|+\lim_{t\to\infty}\frac{(\mu^{*}_{1}-\widehat{\mu}_{1,t})^{2}}{\widehat{w}_{1,t}}+\lim_{t\to\infty}\frac{(\mu^{*}_{2}-\widehat{\mu}_{2,t})^{2}}{\widehat{w}_{2,t}}
+limt|((μ^1,tμ^2,t)(μ1μ2))2((μ1μ2)(μ1μ2))2|=0,\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\lim_{t\to\infty}\big{|}\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}-\left(\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}\big{|}=0,

with probability 11. Therefore, from Lebesgue’s dominated convergence theorem, we obtain

V𝔼P[Ψt2|t1]V\displaystyle V\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-V
=𝔼P[σ12+(μ1μ^1,t)2w^1,t|t1]+𝔼P[σa2+(μ2μ^2,t)2w^2,t|t1]\displaystyle=\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{1}+(\mu^{*}_{1}-\widehat{\mu}_{1,t})^{2}}{\widehat{w}_{1,t}}|\mathcal{F}_{t-1}\right]+\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{a}+(\mu^{*}_{2}-\widehat{\mu}_{2,t})^{2}}{\widehat{w}_{2,t}}|\mathcal{F}_{t-1}\right]
𝔼P[((μ^1,tμ^2,t)(μ1μ2))2|t1]𝔼P[σ12w1+σ22w2+((μ1μ2)(μ1μ2))2|t1]\displaystyle\ \ \ \ \ -\mathbb{E}_{P^{*}}\left[\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}|\mathcal{F}_{t-1}\right]-\mathbb{E}_{P^{*}}\Bigg{[}\frac{\sigma^{2}_{1}}{w^{*}_{1}}+\frac{\sigma^{2}_{2}}{w^{*}_{2}}+\left(\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}|\mathcal{F}_{t-1}\Bigg{]}
a.s.0.\displaystyle\xrightarrow{\mathrm{a.s.}}0.

This lemma immediately yields the following lemma.

Lemma 5.4.

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}} and any ϵ>0\epsilon>0, there exists t(ϵ)>0t(\epsilon)>0 such that for all T>t(ϵ)T>t(\epsilon), we have

1Tt=1T|𝔼P[Ψt2|t1]1|<ϵ\displaystyle\frac{1}{T}\sum^{T}_{t=1}\Big{|}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\Big{|}<\epsilon

with probability one.

This result is a variant of the Cesàro lemma for a case with almost sure convergence. For completeness, we show the proof, which is based on the proof of Lemma 10 in Hadad et al. (2021).

Proof.

Let utu_{t} be ut=𝔼P[Ψt2|t1]1u_{t}=\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1. Note that 1Tt=1T𝔼P[Ψt2|t1]1=1Tt=1Tut\frac{1}{T}\sum^{T}_{t=1}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1=\frac{1}{T}\sum^{T}_{t=1}u_{t}.

From the proof of Lemma 5.3, we can find that utu_{t} is a bounded random variable. Recall that

V𝔼P[Ψt2|t1]\displaystyle V\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]
=𝔼P[σ12+(μ1μ^1,t)2w^1,t|t1]+𝔼P[σ22+(μ2μ^2,t)2w^2,t|t1]𝔼P[((μ^1,tμ^2,t)(μ1μ2))2|t1].\displaystyle=\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{1}+(\mu^{*}_{1}-\widehat{\mu}_{1,t})^{2}}{\widehat{w}_{1,t}}|\mathcal{F}_{t-1}\right]+\mathbb{E}_{P^{*}}\left[\frac{\sigma^{2}_{2}+(\mu^{*}_{2}-\widehat{\mu}_{2,t})^{2}}{\widehat{w}_{2,t}}|\mathcal{F}_{t-1}\right]-\mathbb{E}_{P^{*}}\left[\left(\big{(}\widehat{\mu}_{1,t}-\widehat{\mu}_{2,t}\big{)}-\big{(}\mu^{*}_{1}-\mu^{*}_{2}\big{)}\right)^{2}|\mathcal{F}_{t-1}\right].

We assumed that (μ1,μ2,μ^1,t,μ^2.t,w^1,t,w^2,t)(\mu^{*}_{1},\mu^{*}_{2},\widehat{\mu}_{1,t},\widehat{\mu}_{2.t},\widehat{w}_{1,t},\widehat{w}_{2,t}) are all bounded random variables. Let CC be a constant independent of TT such that |ut|<C|u_{t}|<C for all tt\in\mathbb{N}.

Almost-sure convergence of utu_{t} to zero as tt\to\infty implies that for any ϵ>0\epsilon^{\prime}>0, there exists t(ϵ)t(\epsilon) such that |ut|<ϵ|u_{t}|<\epsilon^{\prime} for all tt(ϵ)t\geq t(\epsilon^{\prime}) with probability one. Let (ϵ)\mathcal{E}(\epsilon^{\prime}) denote the event in which this happens; that is, (ϵ)={|ut|<ϵtt(ϵ)}\mathcal{E}(\epsilon^{\prime})=\{|u_{t}|<\epsilon^{\prime}\quad\forall\ t\geq t(\epsilon^{\prime})\}. Under this event, for T>t(ϵ)T>t(\epsilon^{\prime}), the following holds:

1Tt=1T|ut|1Tt=1t(ϵ)C+1Tt=t(ϵ)+1Tϵ=1Tt(ϵ)C+ϵ,\displaystyle\frac{1}{T}\sum^{T}_{t=1}|u_{t}|\leq\frac{1}{T}\sum^{t(\epsilon^{\prime})}_{t=1}C+\frac{1}{T}\sum^{T}_{t=t(\epsilon^{\prime})+1}\epsilon=\frac{1}{T}t(\epsilon^{\prime})C+\epsilon^{\prime},

where 1Tt(ϵ)C0\frac{1}{T}t(\epsilon^{\prime})C\to 0 as TT\to\infty.

Therefore, for any ϵ>0\epsilon>0, there exists t(ϵ)t(\epsilon) such that for all T>t(ϵ)T>t(\epsilon), 1Tt=1T|ut|<ϵ\frac{1}{T}\sum^{T}_{t=1}|u_{t}|<\epsilon holds with probability one. ∎

Step 4: Tail Bound with the Approximated Second Moment

Let v=λv=\lambda. Then, we have

P(t=1TΨtTv)𝔼P[exp(Tλ22+λ22{t=1T𝔼P[Ψt2|t1]1}+To(λ2))].\displaystyle\mathbb{P}_{P^{*}}\left(\sum^{T}_{t=1}\Psi_{t}\leq Tv\right)\leq\mathbb{E}_{P^{*}}\left[\exp\left(-\frac{T\lambda^{2}}{2}+\frac{\lambda^{2}}{2}\left\{\sum^{T}_{t=1}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\right\}+To\left(\lambda^{2}\right)\right)\right].

From Lemma 5.4, for any ϵ>0\epsilon>0, there exists t(ϵ)>0t(\epsilon)>0 such that for all T>t(ϵ)T>t(\epsilon), we have

𝔼P[exp(Tλ22+λ22{t=1T𝔼P[Ψt2|t1]1}+To(λ2))]\displaystyle\mathbb{E}_{P^{*}}\left[\exp\left(-\frac{T\lambda^{2}}{2}+\frac{\lambda^{2}}{2}\left\{\sum^{T}_{t=1}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\right\}+To\left(\lambda^{2}\right)\right)\right]
=exp(Tλ22+To(λ2))𝔼P[exp(λ22{t=1T𝔼P[Ψt2|t1]1})]\displaystyle=\exp\left(-\frac{T\lambda^{2}}{2}+To\left(\lambda^{2}\right)\right)\mathbb{E}_{P^{*}}\left[\exp\left(\frac{\lambda^{2}}{2}\left\{\sum^{T}_{t=1}\mathbb{E}_{P^{*}}\left[\Psi^{2}_{t}|\mathcal{F}_{t-1}\right]-1\right\}\right)\right]
exp(Tλ22+To(λ2))exp(λ22Tϵ)=exp(Tλ22+To(λ2)+λ22Tϵ).\displaystyle\leq\exp\left(-\frac{T\lambda^{2}}{2}+To\left(\lambda^{2}\right)\right)\exp\left(\frac{\lambda^{2}}{2}T\epsilon\right)=\exp\left(-\frac{T\lambda^{2}}{2}+To\left(\lambda^{2}\right)+\frac{\lambda^{2}}{2}T\epsilon\right).

Step 5: Final Step of the Proof of Theorem 4.1

For any ϵ>0\epsilon>0, there exists t(ϵ)>0t(\epsilon)>0 such that for all T>t(ϵ)T>t(\epsilon), we obtainz

1TlogP(μ^1,TAIPWμ^2,TAIPW){λ22+o(λ2)+λ22ϵ}=λ22o(λ2)λ22ϵ,\displaystyle-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW}}_{1,T}\leq\widehat{\mu}^{\mathrm{AIPW}}_{2,T}\right)\geq-\left\{-\frac{\lambda^{2}}{2}+o\left(\lambda^{2}\right)+\frac{\lambda^{2}}{2}\epsilon\right\}=\frac{\lambda^{2}}{2}-o\left(\lambda^{2}\right)-\frac{\lambda^{2}}{2}\epsilon,

as λ0\lambda\to 0.

Let λ=ΔV\lambda=-\frac{\Delta}{\sqrt{V}}. Then, we have

1TlogP(μ^TAIPW,a(P)μ^TAIPW,a)Δ22Vo(Δ2V)ϵΔ22V,\displaystyle-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW},a^{\star}(P)}_{T}\leq\widehat{\mu}^{\mathrm{AIPW},a}_{T}\right)\geq\frac{\Delta^{2}}{2V}-o\left(\frac{\Delta^{2}}{V}\right)-\frac{\epsilon\Delta^{2}}{2V},

as Δ0\Delta\to 0. By letting Δ0\Delta\to 0 and TT\to\infty, and then letting ϵ0\epsilon\to 0 independently of TT and Δ\Delta, we have

1TlogP(μ^1,TAIPWμ^2,TAIPW)\displaystyle-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{AIPW}}_{1,T}\leq\widehat{\mu}^{\mathrm{AIPW}}_{2,T}\right) Δ22Vo(Δ2).\displaystyle\geq\frac{\Delta^{2}}{2V}-o\left(\Delta^{2}\right).

Thus, the proof is complete.

6 Discussion

In this section, we discuss related topics.

6.1 Neyman Allocation with Unknown Variances

For two-armed Gaussian bandits with known variances, Chen et al. (2000), Glynn & Juneja (2004), and Kaufmann et al. (2016) conclude that sampling each arm with a proportion of the standard deviation is optimal, which corresponds to the Neyman allocation Neyman (1934).

The Neyman allocation with unknown variances has been long studied in various fields. van der Laan (2008) and Hahn et al. (2011) develop algorithms for estimating the gap parameter Δ\Delta itself in an adaptive experiment with the Neyman allocation. They estimate the variances and show their algorithms’ optimalities under the framework of semiparametric efficiency, which closely connects to the Gaussian approximation of estimators using the central limit theorem. Although they show their optimality under the framework, they do not investigate the asymptotic optimality in the large-deviation framework. Tabord-Meehan (2022), Kato et al. (2020), and Zhao (2023) also attempt to adrress related problems.

Jourdan et al. (2023) examines BAI with unknown variances in a fixed-confidence setting. Beyond the difference in settings (we focus on fixed-budget BAI), the methods of deriving lower bounds differ between our approach and theirs. They determine the lower bound while incorporating the assumption that the variances are unknown. Moreover, under a large-gap regime (Δ\Delta is fixed), they confirm a discrepancy between the lower bounds when variances are known versus unknown. Specifically, they consider alternative hypotheses related to both variances and means. In contrast, the lower bounds presented by Kaufmann et al. (2016) and ourselves are based on alternative hypotheses with fixed variances. While Jourdan et al. (2023) suggests that the upper bounds of strategies with unknown variances cannot align with the lower bound when variances are known, our findings indicate a match under the small-gap regime.

6.2 Necessity of the Small-Gap Regime

First, we discuss the necessity of the small-gap regime.

Estimation error of the variances.

The most critical reason we employ the small-gap regime is that the estimation error of the variances cannot be ignored in evaluating the probability of misidentification. To clarify this point, we review the probability of misidentification when we know the variances.

Probability of misidentification of the Kaufmann et al. (2016)’s strategy with known variances.

Kaufmann et al. (2016) proposes drawing arm aa in σaσ1+σ2T=waT\frac{\sigma^{a}}{\sigma^{1}+\sigma^{2}}T=w^{*}_{a}T rounds (for simplicity, we deal with waTw^{*}_{a}T as an integer). Without loss of generality, we consider draw arm 11 in the first w1Tw^{*}_{1}T rounds and draw arm 22 in the following Tw1T=w2TT-w^{*}_{1}T=w^{*}_{2}T rounds. Then, they estimate the best arm as a^TKCGargmaxa{1,2}μ^a,TSA\widehat{a}^{\mathrm{KCG}}_{T}\coloneqq\operatorname*{arg\,max}_{a\in\{1,2\}}\widehat{\mu}^{\mathrm{SA}}_{a,T}, where μ^a,TSA\widehat{\mu}^{\mathrm{SA}}_{a,T} is the sample average defined as

μ^a,TSA1t=1T𝟙[AtKCG=a]t=1T𝟙[AtKCG=a]Yt,\displaystyle\widehat{\mu}^{\mathrm{SA}}_{a,T}\coloneqq\frac{1}{\sum^{T}_{t=1}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]}\sum^{T}_{t=1}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]Y_{t}, (5)

where AtKCGA^{\mathrm{KCG}}_{t} denotes an arm drawn by the Kaufmann et al. (2016)’s strategy. For the strategy, they show that its probability of misidentification is given as

lim infT1TlogP(a^TKCGa(P))Δ22(σ1+σ2)2.\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{KCG}}_{T}\neq a^{\star}(P^{*})\right)\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})^{2}}. (6)

Note that this upper bound comes from the upper bound of

lim infT1TlogP(a^TKCGa(P))=lim infT1TlogP(μ^1,TSAμ^2,TSA0),\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{KCG}}_{T}\neq a^{\star}(P^{*})\right)=\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\leq 0\right), (7)

in case where arm 11 is the best arm (a(P)=1a^{\star}(P^{*})=1). In contrast, in Theorem 4.1, we show that our strategy’s upper bound is lim infT1TlogP(a^TAIPWa(P))Δ22(σ1+σ2)2o(Δ2)\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{AIPW}}_{T}\neq a^{\star}(P^{*})\right)\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})^{2}}-o(\Delta^{2}). The difference between our upper bounds and theirs is the existence of o(Δ2)-o(\Delta^{2}) term, which vanishes as Δ0\Delta\to 0. This difference comes from the estimation error of the variances.

Intuitive explanation about the influence of the influence of the variance estimation.

To understand the variance estimation, we rewrite the sample average as

μ^a,TSA=1Tt=1T11Tt=1T𝟙[AtKCG=a]𝟙[AtKCG=a]Yt=1Tt=1T1wa𝟙[AtKCG=a]Yt,\displaystyle\widehat{\mu}^{\mathrm{SA}}_{a,T}=\frac{1}{T}\sum^{T}_{t=1}\frac{1}{\frac{1}{T}\sum^{T}_{t=1}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]Y_{t}=\frac{1}{T}\sum^{T}_{t=1}\frac{1}{w^{*}_{a}}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]Y_{t}, (8)

where we used t=1T𝟙[AtKCG=a]=waT\sum^{T}_{t=1}\mathbbm{1}[A^{\mathrm{KCG}}_{t}=a]=w^{*}_{a}T. Here, we consider a strategy that estimates waw^{*}_{a} by estimating the variances. Let w~a\widetilde{w}_{a} be some estimator of waw^{*}_{a}. Then, we design a strategy that draws arm aa in w~aT\widetilde{w}_{a}T rounds in some way. In that case, the sample average roughly becomes

μ~a,TSA=1Tt=1T11Tt=1T𝟙[A~t=a]𝟙[A~t=a]Yt=1Tt=1T1w~a𝟙[A~t=a]Yt,\displaystyle\widetilde{\mu}^{\mathrm{SA}}_{a,T}=\frac{1}{T}\sum^{T}_{t=1}\frac{1}{\frac{1}{T}\sum^{T}_{t=1}\mathbbm{1}[\widetilde{A}_{t}=a]}\mathbbm{1}[\widetilde{A}_{t}=a]Y_{t}=\frac{1}{T}\sum^{T}_{t=1}\frac{1}{\widetilde{w}_{a}}\mathbbm{1}[\widetilde{A}_{t}=a]Y_{t}, (9)

where A~t\widetilde{A}_{t} denotes an arm drawn by some strategy that draws arm aa in w~aT\widetilde{w}_{a}T rounds. Then, if we recommend arm a~Targmaxa{1,2}μ~a,TSA\widetilde{a}_{T}\coloneqq\operatorname*{arg\,max}_{a\in\{1,2\}}\widetilde{\mu}^{\mathrm{SA}}_{a,T} as the best arm, we evaluate

lim infT1TlogP(a~Ta(P))=lim infT1TlogP(μ~1,TSAμ~2,TSA0),\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widetilde{a}_{T}\neq a^{\star}(P^{*})\right)=\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\leq 0\right), (10)

From Markov’s inequality, for any λ>0\lambda>0, we have

logP(μ~1,TSAμ~2,TSA0)𝔼[exp(Tλ(μ~1,TSAμ~2,TSA))].\displaystyle\log\mathbb{P}_{P^{*}}\left(\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\leq 0\right)\leq\mathbb{E}\left[\exp\left(T\lambda\left(\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\right]. (11)

To obtain the same upper bound as that in the Kaufmann et al. (2016)’s strategy, we consider the following decomposition:

𝔼[exp(Tλ(μ~1,TSAμ~2,TSA))]=𝔼[exp(Tλ(μ^1,TSAμ^2,TSA))exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))].\displaystyle\mathbb{E}\left[\exp\left(T\lambda\left(\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\right]=\mathbb{E}\left[\exp\left(T\lambda\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right]. (12)

Suppose that the following holds in some way:

𝔼[exp(Tλ(μ^1,TSAμ^2,TSA))exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))]\displaystyle\mathbb{E}\left[\exp\left(T\lambda\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right]
=𝔼[exp(Tλ(μ^1,TSAμ^2,TSA))]𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))].\displaystyle=\mathbb{E}\left[\exp\left(T\lambda\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\right]\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right].

Note that this decomposition does not generally hold, but we assume it since it makes it easy to understand the variance estimation problem. Under the assumption, it holds that

1TlogP(a~Ta(P))\displaystyle-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widetilde{a}_{T}\neq a^{\star}(P^{*})\right) (13)
1Tlog𝔼[exp(Tλ(μ^1,TSAμ^2,TSA))]1Tlog𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))]\displaystyle\geq-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right)\right)\right]-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right] (14)
Δ22(σ1+σ2)21Tlog𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))].\displaystyle\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})^{2}}-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right]. (15)

This inequality implies that to obtain the same upper bound as that of Kaufmann et al. (2016)’s strategy, we need to bound

lim infT1Tlog𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))]\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right] (16)

with an arbitrage rate of convergence; more exactly, we need to show that for any ε>0\varepsilon>0,

lim infT1Tlog𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))]ε\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right]\geq-\varepsilon

holds. However, it is impossible to achieve that convergence rate with commonly known theorems about convergence. Therefore, we introduced the small-gap regime, which evaluates the term as

lim infT1Tlog𝔼[exp(Tλ({μ~1,TSAμ~2,TSA}{μ^1,TSAμ^2,TSA}))]=o(Δ2).\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{E}\left[\exp\left(T\lambda\left(\left\{\widetilde{\mu}^{\mathrm{SA}}_{1,T}-\widetilde{\mu}^{\mathrm{SA}}_{2,T}\right\}-\left\{\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}\right\}\right)\right)\right]=-o(\Delta^{2}).

Note that this argument is not rigorous and is simplified for explanation.

6.3 The AIPW, IPW, and Sample Average Estimators

A key component of our analysis is the AIPW estimator, which comprises an MDS and boasts minimum asymptotic variance. By using the properties of an MDS, we tackle the dependence among observations. The upper bound can also be applied to the Inverse Probability Weighting (IPW) estimator, but in this case, the upper bound may not coincide with the lower bound. This discrepancy occurs because the AIPW estimator’s asymptotic variance is smaller than the IPW estimator’s. The minimum variance property of the AIPW estimator stems from the efficient influence function (Hahn, 1998; Tsiatis, 2007).

We conjecture that the asymptotic optimality of strategies employing the naive sample average estimator in the recommendation rule can be demonstrated, although we do not prove it in this study. This is because Hahn et al. (2011) shows that, using the CLT, the AIPW and sample average estimators have the same asymptotic distribution. However, due to the inability to utilize MDS properties and the presence of sample dependency, the analysis becomes challenging when we derive a corresponding result for a large deviation (exponential rate of the probability of misidentification).

For the reader’s reference, we detail the problems related to the IPW estimator and the sample average estimator.

The NA-IPW strategy.

We consider the following strategy. In the NA-AIPW strategy, instead of the AIPW estimator, we use the following IPW estimator to estimate the means:

μ^a,TIPW1Tt=1Tψa,tIPW,whereψa,tIPW𝟙[At=a]Ya,tw^a,t.\displaystyle\widehat{\mu}^{\mathrm{IPW}}_{a,T}\coloneqq\frac{1}{T}\sum^{T}_{t=1}\psi^{\mathrm{IPW}}_{a,t},\qquad\mathrm{where}\ \psi^{\mathrm{IPW}}_{a,t}\coloneqq\frac{\mathbbm{1}[A_{t}=a]Y_{a,t}}{\widehat{w}_{a,t}}. (17)

At the end of the experiment (after the round t=Tt=T), we recommend a^TIPW\widehat{a}^{\mathrm{IPW}}_{T} as

a^TIPW{1ifμ^1,TIPWμ^2,TIPW,2otherwise.\displaystyle\widehat{a}^{\mathrm{IPW}}_{T}\coloneqq\begin{cases}&1\quad\mathrm{if}\quad\widehat{\mu}^{\mathrm{IPW}}_{1,T}\geq\widehat{\mu}^{\mathrm{IPW}}_{2,T},\\ &2\quad\mathrm{otherwise}.\end{cases} (18)

We refer to this strategy as the NA-IPW strategy, whose probability of misidentification of this strategy is given as follows.

Theorem 6.1 (Upper bound of the NA-IPW strategy).

For any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}}, the following holds as Δ0\Delta\to 0:

lim infT1TlogP(a^TIPWa(P))Δ22(σ1+σ2)(ζ1σ1+ζ2σ2)o(Δ2),\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{IPW}}_{T}\neq a^{\star}(P^{*})\right)\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})\left(\frac{\zeta^{*}_{1}}{\sigma_{1}}+\frac{\zeta^{*}_{2}}{\sigma_{2}}\right)}-o\left(\Delta^{2}\right),

where ζa𝔼P[Ya,t2]\zeta^{*}_{a}\coloneqq\mathbb{E}_{P^{*}}\big{[}Y^{2}_{a,t}\big{]}.

Proof.

Let us define VIPWζ1w1+ζ2w2Δ2V^{\mathrm{IPW}}\coloneqq\frac{\zeta^{*}_{1}}{w^{*}_{1}}+\frac{\zeta^{*}_{2}}{w^{*}_{2}}-\Delta^{2} and ΨIPW{ψ1,tIPWψ2,tIPWΔ}/VIPW\Psi^{\mathrm{IPW}}\coloneqq\Big{\{}\psi^{\mathrm{IPW}}_{1,t}-\psi^{\mathrm{IPW}}_{2,t}-\Delta\Big{\}}/V^{\mathrm{IPW}}. Then, we have

𝔼P[(ΨtIPW)2|t1]=𝔼P[(ψ1,tIPWψ2,tIPWΔ)2|t1]\displaystyle\mathbb{E}_{P^{*}}\left[\left(\Psi^{\mathrm{IPW}}_{t}\right)^{2}|\mathcal{F}_{t-1}\right]=\mathbb{E}_{P^{*}}\left[\left(\psi^{\mathrm{IPW}}_{1,t}-\psi^{\mathrm{IPW}}_{2,t}-\Delta\right)^{2}\Big{|}\mathcal{F}_{t-1}\right]
=𝔼P[(𝟙[At=1]Y1,tw^1,t𝟙[At=2]Y2,tw^2,tΔ)2|t1]\displaystyle=\mathbb{E}_{P^{*}}\Bigg{[}\Bigg{(}\frac{\mathbbm{1}[A_{t}=1]Y_{1,t}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]Y_{2,t}}{\widehat{w}_{2,t}}-\Delta\Bigg{)}^{2}|\mathcal{F}_{t-1}\Bigg{]}
=𝔼P[(𝟙[At=1]Y1,tw^1,t𝟙[At=2]Y2,tw^2,t)2|t1]Δ2\displaystyle=\mathbb{E}_{P^{*}}\Bigg{[}\Bigg{(}\frac{\mathbbm{1}[A_{t}=1]Y_{1,t}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]Y_{2,t}}{\widehat{w}_{2,t}}\Bigg{)}^{2}|\mathcal{F}_{t-1}\Bigg{]}-\Delta^{2}
=𝔼P[Y1,t2w^1,tY2,t2w^2,t|t1]Δ2VIPW\displaystyle=\mathbb{E}_{P^{*}}\Bigg{[}\frac{Y^{2}_{1,t}}{\widehat{w}_{1,t}}-\frac{Y^{2}_{2,t}}{\widehat{w}_{2,t}}|\mathcal{F}_{t-1}\Bigg{]}-\Delta^{2}\to V^{\mathrm{IPW}}

as TT\to\infty. By replacing VV and Ψ\Psi in the proof of Theorem 4.1 with VIPWV^{\mathrm{IPW}} and ΨIPW\Psi^{\mathrm{IPW}}, we obtain

lim infT1TlogP(a^TIPWa(P))Δ22(ζ1w1+ζ2w2Δ2)o(Δ2),\displaystyle\liminf_{T\to\infty}-\frac{1}{T}\log\mathbb{P}_{P^{*}}\left(\widehat{a}^{\mathrm{IPW}}_{T}\neq a^{\star}(P^{*})\right)\geq\frac{\Delta^{2}}{2\left(\frac{\zeta^{*}_{1}}{w^{*}_{1}}+\frac{\zeta^{*}_{2}}{w^{*}_{2}}-\Delta^{2}\right)}-o\left(\Delta^{2}\right),

where the RHS is equal to Δ22(ζ1w1+ζ2w2)o(Δ2)\frac{\Delta^{2}}{2\left(\frac{\zeta^{*}_{1}}{w^{*}_{1}}+\frac{\zeta^{*}_{2}}{w^{*}_{2}}\right)}-o\left(\Delta^{2}\right). The proof is complete. ∎

Note that 2(σ1+σ2)(ζ1σ1+ζ2σ2)2(σ1+σ2)22(\sigma_{1}+\sigma_{2})\left(\frac{\zeta^{*}_{1}}{\sigma_{1}}+\frac{\zeta^{*}_{2}}{\sigma_{2}}\right)\geq 2(\sigma_{1}+\sigma_{2})^{2} since (ζ1σ1+ζ2σ2)(ζ1(μ1)2σ1+ζ2(μ2)2σ2)=(σ1+σ2)\left(\frac{\zeta^{*}_{1}}{\sigma_{1}}+\frac{\zeta^{*}_{2}}{\sigma_{2}}\right)\geq\left(\frac{\zeta^{*}_{1}-(\mu^{*}_{1})^{2}}{\sigma_{1}}+\frac{\zeta^{*}_{2}-(\mu^{*}_{2})^{2}}{\sigma_{2}}\right)=(\sigma_{1}+\sigma_{2}). Therefore, the upper bound of probability of misidentification of the NA-IPW strategy is larger than that of the NA-AIPW strategy (Note that the inequality is flipped due to 1TlogP-\frac{1}{T}\log\mathbb{P}_{P^{*}}; that is, Δ22(σ1+σ2)2Δ22(σ1+σ2)(ζ1σ1+ζ2σ2)\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})^{2}}\geq\frac{\Delta^{2}}{2(\sigma_{1}+\sigma_{2})\left(\frac{\zeta^{*}_{1}}{\sigma_{1}}+\frac{\zeta^{*}_{2}}{\sigma_{2}}\right)} implies that the upper bound of the probability of misidentification of the NA-AIPW strategy is smaller than that of the NA-IPW strategy). In the case of the evaluation using the CLT, similar results have been known in existing studies, such as Hirano et al. (2003) and Kato et al. (2020).

The NA-SA strategy.

Next, we consider the following strategy. In the NA-AIPW strategy, instead of the AIPW estimator, we use the following sample average estimator:

μ^a,TSA1t=1T𝟙[At=a]t=1T𝟙[At=a]Yt=1Tt=1Tψa,tSA,whereψa,tSA𝟙[At=a]Yt1Tt=1T𝟙[At=a].\displaystyle\widehat{\mu}^{\mathrm{SA}}_{a,T}\coloneqq\frac{1}{\sum^{T}_{t=1}\mathbbm{1}[A_{t}=a]}\sum^{T}_{t=1}\mathbbm{1}[A_{t}=a]Y_{t}=\frac{1}{T}\sum^{T}_{t=1}\psi^{\mathrm{SA}}_{a,t},\qquad\mathrm{where}\ \psi^{\mathrm{SA}}_{a,t}\coloneqq\frac{\mathbbm{1}[A_{t}=a]Y_{t}}{\frac{1}{T}\sum^{T}_{t=1}\mathbbm{1}[A_{t}=a]}. (19)

At the end of the experiment (after the round t=Tt=T), we recommend a^TIPW\widehat{a}^{\mathrm{IPW}}_{T} as

a^TSA{1ifμ^1,TSAμ^2,TSA,2otherwise.\displaystyle\widehat{a}^{\mathrm{SA}}_{T}\coloneqq\begin{cases}&1\quad\mathrm{if}\quad\widehat{\mu}^{\mathrm{SA}}_{1,T}\geq\widehat{\mu}^{\mathrm{SA}}_{2,T},\\ &2\quad\mathrm{otherwise}.\end{cases} (20)

We refer to this strategy as the NA-SA strategy. Evaluation of the probability of misidentification of this strategy is not easy since we cannot employ a martingale property, which has been used in the analysis of the NA-AIPW strategy and the NA-IPW strategy. In order to derive its upper bound, we need to evaluate μ^1,TSAμ^2,TSA\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}. Here, note that

μ^1,TSAμ^2,TSA\displaystyle\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T} =μ^1,TSAμ^2,TSA{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\displaystyle=\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}-\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\} (21)
+{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\displaystyle\ \ \ \ \ \ \ \ \ +\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\} (22)

holds, and the variance of Ψt{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\Psi^{*}_{t}\coloneqq\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\} is VV in the proof of Theorem 4.1, and {Ψt}t=1T\{\Psi^{*}_{t}\}^{T}_{t=1} consists of an MDS. Therefore, if μ^1,TSAμ^2,TSA{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}=0\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}-\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\}=0, we can directly apply the proof of Theorem 4.1 to obtain the same upper bound in Theorem 4.1. However, μ^1,TSAμ^2,TSA{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}-\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\} is not zero and remains as a bias term, and it is known that its evaluation requires several techniques. For example, to show T(μ^1,TSAμ^2,TSAΔ)d𝒩(0,V)\sqrt{T}\left(\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}-\Delta\right)\xrightarrow{\mathrm{d}}\mathcal{N}(0,V), Hahn et al. (2011) bounds μ^1,TSAμ^2,TSA{𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}-\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\} using the property of the stochastic equicontinuity, which is based on the arguments in Hirano et al. (2003). This problem is related to the use of the Donsker condition in semiparametric analysis, as explained in Kennedy (2016). We may show the upper bound of the NA-SA strategy by using a similar approach used in Hahn et al. (2011), but there are two issues. First, it is unknown what condition corresponds to the stochastic equicontinuity in the setting of BAI, where the samples are dependent. Second, it is unclear whether we can directly apply the stochastic equicontinuity or similar properties to show the large deviation upper bound since such conditions have been used for the central limit evaluation. Therefore, although the findings of Hirano et al. (2003) and Hahn et al. (2011) may aid in resolving this issue, it is an open issue how we use it. Note that this issue caused by the bias of μ^1,TSAμ^2,TSA\widehat{\mu}^{\mathrm{SA}}_{1,T}-\widehat{\mu}^{\mathrm{SA}}_{2,T}, which is non-zero. Also note that in contrast, the bias of μ^1,TAIPWμ^2,TAIPW\widehat{\mu}^{\mathrm{AIPW}}_{1,T}-\widehat{\mu}^{\mathrm{AIPW}}_{2,T} is zero due to the properties of an MDS.

6.4 The Tracking Strategy

In fixed-confidence BAI, the tracking strategy is popular, as used in Garivier & Kaufmann (2016). In the existing studies of the Neyman allocation, such a strategy has been used. For example, Hahn et al. (2011) splits the whole samples into two groups. In the first stage, we uniformly randomly draw each arm and estimate waw^{*}_{a}. In the second stage, for the estimators w^a\widehat{w}_{a} of waw^{*}_{a} we draw each arm so that 1Tt=1T𝟙[At=a]=w^a\frac{1}{T}\sum^{T}_{t=1}\mathbbm{1}[A_{t}=a]=\widehat{w}_{a} holds. Then, Hahn et al. (2011) estimates Δ=μ1μ2\Delta=\mu^{*}_{1}-\mu^{*}_{2} using the sample average estimator. This strategy is quite similar to that in Garivier & Kaufmann (2016), since it draws arms to track the ratio of waw^{*}_{a}.

However, the strategy of Hahn et al. (2011) makes analyzing upper bounds difficult. As we explained in Section 6.3, in our analysis, the unbiasedness of the AIPW estimator plays an important role. In contrast, if we use the tracking strategy, we cannot employ the property of 𝔼P[𝟙[At=a]w^t|t1]=1\mathbb{E}_{P^{*}}\left[\frac{\mathbbm{1}[A_{t}=a]}{\widehat{w}_{t}}|\mathcal{F}_{t-1}\right]=1. Note that the NA-AIPW strategy draws arm AtA_{t} with probability w^t\widehat{w}_{t}, but the tracking strategy draws arm AtA_{t} more complicatedly, under which we cannot use the martingale property.

As well as we explained in Section 6.3, the bias term makes the analysis significantly difficult. According to the existing studies, we need to use some techniques for the analysis, such as the Donsker condition (Hirano et al., 2003; Hahn et al., 2011). Existing studies have proposed using the AIPW estimator to avoid this issue, as shown in van der Laan (2008) and Kato et al. (2020).

Thus, although we acknowledge the possibility of using the tracking strategy, the analysis requires some sophisticated techniques. We expect that existing studies such as (Hirano et al., 2003) and Hahn et al. (2011) will help the analysis, but it is still an open issue. Note that even in the tracking strategy, existing strategy such as (Hirano et al., 2003) and Hahn et al. (2011) bypass the evaluation of the AIPW-type estimators in the analysis. This proof procedure is also related to the semiparmaetric efficiency bound (Hahn, 1998), under which the semiparametric efficient score is given as Ψt={𝟙[At=1](Y1,tμ1)w^1,t𝟙[At=2](Y2,tμ2)w^2,tΔ}\Psi^{*}_{t}=\left\{\frac{\mathbbm{1}[A_{t}=1]\big{(}Y_{1,t}-\mu^{*}_{1}\big{)}}{\widehat{w}_{1,t}}-\frac{\mathbbm{1}[A_{t}=2]\big{(}Y_{2,t}-\mu^{*}_{2}\big{)}}{\widehat{w}_{2,t}}-\Delta\right\}.

7 Related Work

This section presents related works.

7.1 On the Asymptotic Optimality in Fixed-Budget BAI

There is a long debate on the optimal strategies for fixed-budget BAI. Glynn & Juneja (2004) develops their strategies by using the large deviation principles. However, while they justify their strategies using the large deviation principles, they do not provide lower bounds for strategies. Therefore, there remains a question about whether their strategies are truly asymptotically optimal.

Kaufmann et al. (2016) establishes distribution-dependent lower bounds for BAI with fixed confidence and budget, utilizing change-of-measure arguments. According to their results, we can confirm that for two-armed Gaussian bandits, the strategy of Glynn & Juneja (2004) is optimal.

However, Kaufmann et al. (2016) leaves lower bounds for multi-armed fixed-budget BAI as an open issue. Based on the arguments of Glynn & Juneja (2004) and Russo (2020), Kasy & Sautmann (2021) attempts to derive an asymptotically optimal strategy, but their attempt does not succeed. As pointed out by Ariu et al. (2021), without additional assumptions, there exists an instance PP^{*} whose lower bound is larger than that of Kaufmann et al. (2016). This result is based on another lower bound discovered by Carpentier & Locatelli (2016). These arguments are summarized by Qin (2022).

To address this issue, Kato et al. (2023b) and Degenne (2023) consider a restriction such that sampling rules do not depend on PP^{*}. Under this restriction, we can show the asymptotic optimality of the strategy provided by Glynn & Juneja (2004), which requires full knowledge about PP^{*} and is practically infeasible.

Komiyama et al. (2022) and Atsidakou et al. (2023) discuss asymptotically optimal strategies from minimax and Bayesian perspectives, respectively, where the leading factor ignoring some constant terms of lower and upper bounds match, unlike our optimality up to constant terms. This open issue is further explored by Komiyama et al. (2022), Wang et al. (2023a), Wang et al. (2023b), and Kato (2023).

Note that in the fixed confidence BAI setting, Garivier & Kaufmann (2016) proposes a strategy with an upper bound matching the derived lower bound. However, in the fixed-budget BAI, it remains unclear whether a strategy with an upper bound matching Kaufmann et al. (2016)’s lower bound exists.

Alternative lower bounds have been proposed by Audibert et al. (2010), Bubeck et al. (2011), Komiyama et al. (2023) and Kato et al. (2023a) for the expected simple regret minimization, which is another performance measure different from the probability of misidentification.

Some research employs local asymptotics to examine the asymptotic optimality of the Neyman allocation rule in this context, such as Armstrong (2022) and Adusumilli (2022).

Ordinal optimization in the operations research community is another related field (Ahn et al., 2021; Chen et al., 2000).

7.2 Extension to BAI in Multi-Armed Bandit (MAB) Problems

In contrast to two-armed bandit problems and BAI with fixed confidence, lower bounds for MAB problems remain unknown. One primary reason is the reversal of KL divergence. Kato et al. (2023b), Degenne (2023), Kato (2023) consider strategies that use sampling rules that are (asymptotically) invariant for any P𝒫GP^{*}\in\mathcal{P}^{\mathrm{G}}. Such a class of strategies is sometimes called static in the sense that it cannot estimate parameters during an adaptive experiment to avoid the dependency on PP^{*}. However, if we consider Gaussian bandit models, sampling strategies that are invariant for PP^{*} do not imply non-adaptive (static) strategies because we can still adaptively estimate the variances during an adaptive experiment (the variances are assumed to be the same for any PP^{*}).

8 Simulation Studies

This section provides simulation studies to investigate the empirical performance of the NA-AIPW strategy. For comparison, we also investigate the performances of the NA-IPW and NA-SA strategies defined in Section 6.3. Furthermore, we also conduct simulation studies of the “oracle” strategy with the known variances, denoted by Oracle, and the uniform strategy that draws an equal number of arms, denoted by Uniform. We recommend an arm with the highest sample average in the Oracle and Uniform strategies.222The oracle strategy is the one proposed by Glynn & Juneja (2004) and Kaufmann et al. (2016). The Uniform strategy with the recommendation rule is referred to as the Uniform-Empirical Best Arm (EBA) strategy by Bubeck et al. (2011).

Throughout the experiment, we set arm 11 as the best arm. We conduct experiments with the three settings.

In the first experiment, we set μ1=1.00\mu^{*}_{1}=1.00 and choose μ2\mu^{*}_{2} from the set {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (1,v2)(1,v_{2}) or (v2,1)(v_{2},1), where v2v_{2} is chosen from 5,10,20,50{5,10,20,50}. We continue the strategies until T=10,000T=10,000 and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.333This means we report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\ldots,9,900,10,000\}. We conduct 1,0001,000 independent trials for each choice of parameters and plot the results in Figures 1 and 2.

In the second experiment, the variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (5,v2)(5,v_{2}) or (v2,5)(v_{2},5), where v2v_{2} is chosen from 5,10,20,50{5,10,20,50}. The other settings are the same as the first experiment. The results are shown in Figures 3 and 4.

In the third experiment, we set μ1=10.00\mu^{*}_{1}=10.00 and choose μ2\mu^{*}_{2} from the set {9.80,9.85,9.90,9.95,9.99}\{9.80,9.85,9.90,9.95,9.99\}. The other settings are the same as the first experiment. The results are shown in Figures 5 and 6.

Our theoretical results imply that the probability of misidentifications of the NA-AIPW and Oracle strategies approach the same as Δ0\Delta\to 0. We can confirm the phenomenon. In the results, the Oracle strategy is a bit better than the NA-AIPW strategy when Δ\Delta is large. However, the gap approaches zero as Δ0\Delta\to 0. Note that when Δ\Delta is large, the convergence of the probability of misidentification is very fast, so the gap is still not so large even if Δ\Delta is large because both the probability of misidentifications of the NA-AIPW and Oracle strategies converge to zero very fast.

We can also find that the performance improvement of the NA-AIPW strategy from the Uniform strategy is large as the difference of variances is large. For example, in the second experiment, when (σ1,σ2)=(5.5)(\sigma_{1},\sigma_{2})=(5.5), there is no improvement by using the NA-AIPW strategy from the Uniform strategy because the NA allocation also leads us to draw each arm with equal ratio.

The difference between the NA-AIPW and NA-IPW strategies becomes large as the mean outcome of each arm becomes large. We can find that in the third setting, the NA-IPW strategy behaves badly since μ1=10.00\mu^{*}_{1}=10.00 and μ2\mu^{*}_{2} is chosen from {9.80,9.85,9.90,9.95,9.99}\{9.80,9.85,9.90,9.95,9.99\}, while μ1=1.00\mu^{*}_{1}=1.00 and μ2\mu^{*}_{2} is chosen from {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\} in the first and second settings.

Refer to caption
Figure 1: The results are under the first setting. We set μ1=1.00\mu^{*}_{1}=1.00 and choose μ2\mu^{*}_{2} from the set {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (1,v2)(1,v_{2}) or (v2,1)(v_{2},1), where v2v_{2} is chosen from 5,10{5,10}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.
Refer to caption
Figure 2: The results are under the first setting. We set μ1=1.00\mu^{*}_{1}=1.00 and choose μ2\mu^{*}_{2} from the set {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (1,v2)(1,v_{2}) or (v2,1)(v_{2},1), where v2v_{2} is chosen from 20,50{20,50}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.

9 Conclusion

This study investigated fixed-budget BAI for two-armed Gaussian bandits with unknown variances. We first reviewed the lower bound shown by Kaufmann et al. (2016). Then, we proposed the NA-AIPW strategy and found that its probability of misidentification matches the lower bound when the budget approaches infinity and the gap between the expected rewards of the two arms approaches zero. We referred to this setting as the small-gap regime and the optimality as the local asymptotic optimality. Although there are several remaining open questions, our result provides insight into long-standing open problems in BAI.

References

  • Adusumilli (2022) Karun Adusumilli. Neyman allocation is minimax optimal for best arm identification with two arms, 2022. arXiv:2204.05527.
  • Ahn et al. (2021) Dohyun Ahn, Dongwook Shin, and Assaf Zeevi. Online ordinal optimization under model misspecification, 2021. URL https://api.semanticscholar.org/CorpusID:235389954. SSRN.
  • Akritas & Kourouklis (1988) Michael G. Akritas and Stavros Kourouklis. Local bahadur efficiency of score tests. Journal of Statistical Planning and Inference, 19(2):187–199, 1988.
  • Apostol (1974) Tom M Apostol. Mathematical analysis; 2nd ed. Addison-Wesley series in mathematics. Addison-Wesley, 1974.
  • Ariu et al. (2021) Kaito Ariu, Masahiro Kato, Junpei Komiyama, Kenichiro McAlinn, and Chao Qin. Policy choice and best arm identification: Asymptotic analysis of exploration sampling, 2021. arXiv:2109.08229.
  • Armstrong (2022) Timothy B. Armstrong. Asymptotic efficiency bounds for a class of experimental designs, 2022. arXiv:2205.02726.
  • Atsidakou et al. (2023) Alexia Atsidakou, Sumeet Katariya, Sujay Sanghavi, and Branislav Kveton. Bayesian fixed-budget best-arm identification, 2023. arXiv:2211.08572.
  • Audibert et al. (2010) Jean-Yves Audibert, Sébastien Bubeck, and Remi Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory, pp.  41–53, 2010.
  • Bahadur (1960) R. R. Bahadur. Stochastic Comparison of Tests. The Annals of Mathematical Statistics, 31(2):276 – 295, 1960.
  • Bang & Robins (2005) Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bubeck et al. (2009) Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory, pp.  23–37. Springer Berlin Heidelberg, 2009.
  • Bubeck et al. (2011) Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 2011.
  • Bulmer (1967) Michael George Bulmer. Principles of statistics. M.I.T. Press, 2. ed. edition, 1967.
  • Carpentier & Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In COLT, 2016.
  • Chen et al. (2000) Chun-Hung Chen, Jianwu Lin, Enver Yücesan, and Stephen E. Chick. Simulation budget allocation for further enhancing theefficiency of ordinal optimization. Discrete Event Dynamic Systems, 10(3):251–270, 2000.
  • Degenne (2023) Rémy Degenne. On the existence of a complexity in fixed budget bandit identification. In Conference on Learning Theory, volume 195, pp. 1131–1154. PMLR, 2023.
  • Garivier & Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, 2016.
  • Glynn & Juneja (2004) Peter Glynn and Sandeep Juneja. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, volume 1. IEEE, 2004.
  • Hadad et al. (2021) Vitor Hadad, David A. Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, 118(15), 2021.
  • Hahn (1998) Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–331, 1998.
  • Hahn et al. (2011) Jinyong Hahn, Keisuke Hirano, and Dean Karlan. Adaptive experimental design using the propensity score. Journal of Business and Economic Statistics, 2011.
  • He & Shao (1996) Xuming He and Qi-man Shao. Bahadur efficiency and robustness of studentized score tests. Annals of the Institute of Statistical Mathematics, 48(2):295–314, 1996.
  • Hirano et al. (2003) Keisuke Hirano, Guido Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 2003.
  • Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, 2014.
  • Jourdan et al. (2023) Marc Jourdan, Degenne Rémy, and Kaufmann Emilie. Dealing with unknown variances in best-arm identification. In Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201, pp.  776–849, 2023.
  • Kasy & Sautmann (2021) Maximilian Kasy and Anja Sautmann. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021.
  • Kato (2023) Masahiro Kato. Worst-case optimal multi-armed gaussian best arm identification with a fixed budget, 2023. arXiv:2310.19788.
  • Kato et al. (2020) Masahiro Kato, Takuya Ishihara, Junya Honda, and Yusuke Narita. Efficient adaptive experimental design for average treatment effect estimation, 2020. arXiv:2002.05308.
  • Kato et al. (2023a) Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, and Toru Kitagawa. Asymptotically minimax optimal fixed-budget best arm identification for expected simple regret minimization, 2023a. arXiv:2302.02988.
  • Kato et al. (2023b) Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, and Toru Kitagawa. Fixed-budget hypothesis best arm identification: On the information loss in experimental design. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023b.
  • Kaufmann et al. (2016) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016.
  • Kennedy (2016) Edward H. Kennedy. Semiparametric theory and empirical processes in causal inference, 2016. arXiv:1510.04740.
  • Komiyama et al. (2022) Junpei Komiyama, Taira Tsuchiya, and Junya Honda. Minimax optimal algorithms for fixed-budget best arm identification. In Advances in Neural Information Processing Systems, 2022.
  • Komiyama et al. (2023) Junpei Komiyama, Kaito Ariu, Masahiro Kato, and Chao Qin. Rate-optimal bayesian simple regret in best arm identification. Mathematics of Operations Research, 2023.
  • Lai & Robbins (1985) T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.
  • Neyman (1923) Jerzy Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Statistical Science, 5:463–472, 1923.
  • Neyman (1934) Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97:123–150, 1934.
  • Qin (2022) Chao Qin. Open problem: Optimal best arm identification with fixed-budget. In Conference on Learning Theory, 2022.
  • Robins et al. (1994) James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866, 1994.
  • Russo (2020) Daniel Russo. Simple bayesian algorithms for best-arm identification. Operations Research, 68(6):1625–1647, 2020.
  • Tabord-Meehan (2022) Max Tabord-Meehan. Stratification Trees for Adaptive Randomisation in Randomised Controlled Trials. The Review of Economic Studies, 90(5):2646–2673, 2022.
  • Tsiatis (2007) Anastasios Tsiatis. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer New York, 2007.
  • van der Laan (2008) Mark J. van der Laan. The construction and analysis of adaptive group sequential designs, 2008. URL https://biostats.bepress.com/ucbbiostat/paper232.
  • van der Vaart (1998) A.W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
  • Wang et al. (2023a) Po-An Wang, Kaito Ariu, and Alexandre Proutiere. On uniformly optimal algorithms for best arm identification in two-armed bandits with fixed budget, 2023a. arXiv:2308.12000.
  • Wang et al. (2023b) Po-An Wang, Ruo-Chun Tzeng, and Alexandre Proutiere. Best arm identification with fixed budget: A large deviation perspective. In Advances in Neural Information Processing Systems, 2023b.
  • Wieand (1976) Harry S. Wieand. A Condition Under Which the Pitman and Bahadur Approaches to Efficiency Coincide. The Annals of Statistics, 4(5):1003 – 1011, 1976.
  • Zhao (2023) Jinglong Zhao. Adaptive neyman allocation, 2023. arXiv:2309.08808.
Refer to caption
Figure 3: The results are under the first setting. We set μ1=1.00\mu^{*}_{1}=1.00 and choose μ2\mu^{*}_{2} from the set {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (5,v2)(5,v_{2}) or (v2,5)(v_{2},5), where v2v_{2} is chosen from 5,10{5,10}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.
Refer to caption
Figure 4: The results are under the first setting. We set μ1=1.00\mu^{*}_{1}=1.00 and choose μ2\mu^{*}_{2} from the set {0.80,0.85,0.90,0.95,0.99}\{0.80,0.85,0.90,0.95,0.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (5,v2)(5,v_{2}) or (v2,5)(v_{2},5), where v2v_{2} is chosen from 20,50{20,50}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.
Refer to caption
Figure 5: The results are under the first setting. We set μ1=10.00\mu^{*}_{1}=10.00 and choose μ2\mu^{*}_{2} from the set {9.80,9.85,9.90,9.95,9.99}\{9.80,9.85,9.90,9.95,9.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (1,v2)(1,v_{2}) or (v2,1)(v_{2},1), where v2v_{2} is chosen from 5,10{5,10}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.
Refer to caption
Figure 6: The results are under the first setting. We set μ1=10.00\mu^{*}_{1}=10.00 and choose μ2\mu^{*}_{2} from the set {9.80,9.85,9.90,9.95,9.99}\{9.80,9.85,9.90,9.95,9.99\}. The variances (σ1,σ2)(\sigma_{1},\sigma_{2}) are selected with a probability of 1/21/2 from either (1,v2)(1,v_{2}) or (v2,1)(v_{2},1), where v2v_{2} is chosen from 20,50{20,50}. We conduct 1,0001,000 independent trials and report the empirical probability of misidentification at T{100,200,300,,9,900,10,000}T\in\{100,200,300,\dots,9,900,10,000\}.