This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimality of Thompson Sampling with Noninformative Priors
for Pareto Bandits

Jongyeong Lee1,2    Junya Honda3,2    Chao-Kai Chiang1    Masashi Sugiyama2,1
( 1 The University of Tokyo 2 RIKEN AIP 3 Kyoto University )
Abstract

In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.

1 Introduction

In the multi-armed bandit (MAB) problem, an agent plays an arm and observes a reward only from the played arm, which is partial feedback (Thompson, 1933; Robbins, 1952). The rewards are further assumed to be generated from the distribution of the corresponding arm in the stochastic MAB problem (Bubeck et al., 2012). Since only partial observations are available, the agent has to estimate unknown distributions to guess which arm is optimal while avoiding playing suboptimal arms that induce loss of resources. Thus, the agent has to cope with the dilemma of exploration and exploitation.

In this problem, Thompson sampling (TS), a randomized Bayesian policy that plays an arm according to the posterior probability of being optimal, has been widely adopted because of its outstanding empirical performance (Chapelle and Li, 2011; Russo et al., 2018). Following its empirical success, theoretical analysis of TS has been conducted for several reward models such as Bernoulli models (Agrawal and Goyal, 2012; Kaufmann et al., 2012), one-dimensional exponential families (Korda et al., 2013), Gaussian models (Honda and Takemura, 2014), and bounded support models (Riou and Honda, 2020; Baudry et al., 2021) where asymptotic optimality of TS was established. Here, an algorithm is said to be asymptotically optimal if it can achieve the theoretical problem-dependent lower bound derived by Lai et al. (1985) for one-parameter models and Burnetas and Katehakis (1996) for multiparameter or nonparametric models. Note that the performance of any reasonable algorithms cannot be better than these lower bounds.

Apart from the problem-dependent regret analysis, several works studied the problem-independent or prior-independent bounds of TS (Bubeck and Liu, 2013; Russo and Van Roy, 2016; Agrawal and Goyal, 2017). In this paper, we study how the choice of noninformative priors affects the performance of TS for any given problem instance. In other words, we focus on the asymptotic optimality of TS depending on the choice of noninformative priors.

The asymptotic optimality of TS has been mainly considered in the one-parameter model, while its optimality under the multiparameter model has not been well studied. To the best of our knowledge, the asymptotic optimality of TS in the noncompact multiparameter model is only known for the Gaussian bandits (Honda and Takemura, 2014) where both the mean and variance are unknown. They showed that TS with the uniform prior is optimal while TS with the Jeffreys prior and reference prior cannot achieve the lower bound. The success of the uniform prior is due to its frequent playing of seemingly suboptimal arms. Its conservativeness comes from a moderate overestimation of the posterior probability that current suboptimal arms might be optimal.

In this paper, we consider the two-parameter Pareto models where the tail function is heavy-tailed. We first derive the closed form of the problem-dependent constant that appears in the theoretical lower bound in Pareto models, which is not trivial, unlike those for exponential families. Based on this result, we show that TS with some probability-matching priors achieves the optimal bound, which is the first result for two-parameter Pareto bandit models, to our knowledge.

We further show that TS with different choices of probability matching priors, called optimistic priors, suffers a polynomial regret in expectation. Therefore, being conservative would be better when one chooses noninformative priors to avoid suboptimality in view of expectation. Nevertheless, we show that TS with the Jeffreys prior or the reference prior can achieve the optimal regret bound if we add a truncation procedure on the shape parameter. Our contributions are summarized as follows:

  • We prove the asymptotic optimality/suboptimality of TS under different choices of priors, which shows the importance of the choice of noninformative priors in cases of two-parameter Pareto models.

  • We provide another option to achieve optimality: adding a truncation procedure to the parameter space of the posterior distribution instead of finding an optimal prior.

This paper is organized as follows. In Section 2, we formulate the stochastic MAB problems under the Pareto distribution and derive its regret lower bound. Based on the choice of noninformative priors and their corresponding posteriors, we formulate TS for the Pareto models and propose another TS-based algorithm to solve the suboptimality problem of the Jeffreys prior and the reference prior in Section 3. In Section 4, we provide the main results on the optimality of TS and TS with a truncation procedure, whose proof outline is given in Section 6. Numerical results that support our theoretical analysis are provided in Section 5.

2 Preliminaries

In this section, we formulate the stochastic MAB problem. We derive the exact form of the problem-dependent constant that appears in the lower bound of the expected regret in Pareto bandits.

2.1 Notations

We consider the stochastic KK-armed bandit problem where the rewards are generated from Pareto distributions with fixed parameters. An agent chooses an arm aa in [K]:={1,,K}[K]:=\{1,\ldots,K\} at each round tt\in\mathbb{N} and observes an independent and identically distributed reward from Pa(κa,αa)\mathrm{Pa}(\kappa_{a},\alpha_{a}), where Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) denotes the Pareto distribution parameterized by scale κ>0\kappa>0 and shape α>0\alpha>0. This has the density function of form

fκ,αPa(x)=ακαxα+1𝟙[xκ],f_{{\kappa,\alpha}}^{\mathrm{Pa}}(x)=\frac{\alpha\kappa^{\alpha}}{x^{\alpha+1}}\mathbbm{1}[x\geq\kappa], (1)

where 𝟙[]\mathbbm{1}[\cdot] denotes the indicator function. We consider a bandit model where parameters θa=(κa,αa)+×(1,)\theta_{a}=(\kappa_{a},\alpha_{a})\in\mathbb{R}_{+}\times(1,\infty) are unknown to the agent. We denote the mean of a random variable following Pa(θa)\mathrm{Pa}(\theta_{a}) by μa=μ(θa):=κaαaαa1\mu_{a}=\mu(\theta_{a}):=\frac{\kappa_{a}\alpha_{a}}{\alpha_{a}-1}. Note that α>1\alpha>1 is a necessary condition of an arm to have a finite mean, which is required to define the sub-optimality gap Δa:=maxi[K]μiμa\Delta_{a}:=\max_{i\in[K]}\mu_{i}-\mu_{a}. We assume without loss of generality that arm 11 has the maximum mean for simplicity, i.e., μ1=maxi[K]μi\mu_{1}=\max_{i\in[K]}\mu_{i}. Let j(t)j(t) be the arm played at round tt\in\mathbb{N} and Na(t)=s=1t1𝟙[j(s)=a]N_{a}(t)=\sum_{s=1}^{t-1}\mathbbm{1}[j(s)=a] denote the number of rounds the arm aa is played until round tt. Then, the regret at round TT is given as

Reg(T)=t=1TΔj(t)=a=2KΔaNa(T+1).\mathrm{Reg}(T)=\sum_{t=1}^{T}\Delta_{j(t)}=\sum_{a=2}^{K}\Delta_{a}N_{a}(T+1).

Let ra,nr_{a,n} be the nn-th reward generated from the arm aa. In the Pareto distribution, the maximum likelihood estimators (MLEs) of κ,α{\kappa,\alpha} for arm aa given nn rewards and their distributions are given as follows (Malik, 1970):

κ^a(n)=mins[n]ra,sPa(κa,nαa),α^a(n)=ns=1nlog(ra,s)nlogκ^a(n)IG(n1,nαa),\displaystyle\hat{\kappa}_{a}(n)=\min_{s\in[n]}r_{a,s}\sim\mathrm{Pa}(\kappa_{a},n\alpha_{a}),~{}~{}~{}~{}\hat{\alpha}_{a}(n)=\frac{n}{\sum_{s=1}^{n}\log(r_{a,s})-n\log\hat{\kappa}_{a}(n)}\sim\mathrm{IG}(n-1,n\alpha_{a}),{} (2)

where IG(n,α)\mathrm{IG}(n,\alpha) denotes the inverse-gamma distribution with shape n>0n>0 and scale α>0\alpha>0. Note that Malik (1970) further showed the stochastic independence of α^(n)\hat{\alpha}(n) and κ^(n)\hat{\kappa}(n).

2.2 Asymptotic lower bound

Burnetas and Katehakis (1996) provided a problem-dependent lower bound of the expected regret such that any uniformly fast convergent policy, which is a policy satisfying Reg(T)=o(tα)\mathrm{Reg}(T)=o(t^{\alpha}) for all α(0,1)\alpha\in(0,1), must satisfy

lim infT𝔼[Reg(T)]logTa=2KΔainfθ:μ(θ)>μ1KL(Pa(κa,αa),Pa(θ)),\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\log T}\geq\sum_{a=2}^{K}\frac{\Delta_{a}}{\inf_{\theta:\mu(\theta)>\mu_{1}}\mathrm{KL}(\mathrm{Pa}(\kappa_{a},\alpha_{a}),\mathrm{Pa}(\theta))}, (3)

where KL(,)\mathrm{KL}(\cdot,\cdot) denotes the Kullback-Leibler (KL) divergence. Notice that the bandit model (θa)a[K](\theta_{a})_{a\in[K]} is considered as a fixed constant in the problem-dependent analysis.

The KL divergence between Pareto distributions is given as

KL(Pa(κ1,α1),Pa(κ2,α2))={log(α1α2)+α2log(κ1κ2)+α2α11if κ2κ1,otherwise.\displaystyle\mathrm{KL}(\mathrm{Pa}(\kappa_{1},\alpha_{1}),\mathrm{Pa}(\kappa_{2},\alpha_{2}))=\begin{cases}\log\left(\frac{\alpha_{1}}{\alpha_{2}}\right)+\alpha_{2}\log\left(\frac{\kappa_{1}}{\kappa_{2}}\right)+\frac{\alpha_{2}}{\alpha_{1}}-1&\text{if }\kappa_{2}\leq\kappa_{1},\\ \infty&\text{otherwise}.\end{cases}

Here the divergence sometimes becomes infinity since the scale parameter κ\kappa determines the support of the Pareto distribution. We denote the numerator in (3) for a1a\neq 1 by

KLinf(a)\displaystyle\mathrm{KL}_{\mathrm{inf}}(a) :=infθ:μ(θ)>μ1KL(Pa(κa,αa),Pa(θ))\displaystyle:=\inf_{\theta:\mu(\theta)>\mu_{1}}\mathrm{KL}(\mathrm{Pa}(\kappa_{a},\alpha_{a}),\mathrm{Pa}(\theta))
=infθΘalogαaα+αlogκaκ+ααa1,\displaystyle=\inf_{\theta\in\Theta_{a}}\log\frac{\alpha_{a}}{\alpha}+\alpha\log\frac{\kappa_{a}}{\kappa}+\frac{\alpha}{\alpha_{a}}-1,

where

Θa={(κ,α)(0,κa]×(0,):μ(κ,α)>μ1}.\Theta_{a}=\left\{(\kappa,\alpha)\in(0,\kappa_{a}]\times(0,\infty):\mu({\kappa,\alpha})>\mu_{1}\right\}. (4)

Notice that Θa\Theta_{a} allows parameters whose expected rewards are infinite (α(0,1]\alpha\in(0,1]), although we consider a bandit model with αa>1\alpha_{a}>1 for all a[K]a\in[K] so that the sub-optimality gap Δa\Delta_{a} becomes finite. This implies that KLinf(a)\mathrm{KL}_{\mathrm{inf}}(a) does not depend on whether the agent considers the possibility that an arm has the infinite expected reward or not. Then, we can simply rewrite the lower bound in (3) as

lim infT𝔼[Reg(T)]logTa=2KΔaKLinf(a).\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\log T}\geq\sum_{a=2}^{K}\frac{\Delta_{a}}{\mathrm{KL}_{\mathrm{inf}}(a)}.

The following lemma shows the closed form of this infimum, whose proof is given in Appendix B.

Lemma 1.

For any arm a1a\neq 1, it holds that

KLinf(a)=log(αaμ1κaμ1)+1αaμ1μ1κa1.\mathrm{KL}_{\mathrm{inf}}(a)=\log\left(\alpha_{a}\frac{\mu_{1}-\kappa_{a}}{\mu_{1}}\right)+\frac{1}{\alpha_{a}}\frac{\mu_{1}}{\mu_{1}-\kappa_{a}}-1.

2.3 Relation with bounded moment models

In MAB literature, several algorithms based on the upper confidence bound (UCB) were proposed to tackle heavy-tailed models with infinite variance under additional assumptions on moments (Bubeck et al., 2013). One major assumption is that the moment of any arm aa satisfies 𝔼[|ra,n|γ]v\mathbb{E}[|r_{a,n}|^{\gamma}]\leq v for some fixed γ[1,2)\gamma\in[1,2) and known v<v<\infty (Bubeck et al., 2013). Note that the γ\gamma-th raw moment of the density function of XX following Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) is given as

𝔼[Xγ]={αγ,ακγαγα>γ,\mathbb{E}\left[X^{\gamma}\right]=\begin{cases}\infty&\alpha\leq\gamma,\\ \frac{\alpha\kappa^{\gamma}}{\alpha-\gamma}&\alpha>\gamma,\end{cases}

which implies that the Pareto models and the bounded moment models are not a subset of each other.

Recently, Agrawal et al. (2021) proposed an asymptotically optimal KL-UCB based algorithm that requires solving the optimization problem at every round. Since the bounded moment model only covers certain Pareto distributions in general, the known optimality result of KL-UCB does not necessarily imply the optimality in the sense of (3).

3 Thompson sampling and probability matching priors

TS is a policy from the Bayesian viewpoint, where the choices of priors are important. Although one can utilize prior knowledge on parameters when choosing the prior, such information would not always be available in practice. To deal with such scenarios, we consider noninformative priors based on the Fisher information (FI) matrix, which does not assume any information on unknown parameters.

For a random variable XX with density f(|𝜽)f(\cdot|\bm{\theta}), FI is defined as the variance of the score, a partial derivative of logf\log f with respect to 𝜽\bm{\theta}, which is given as follows (Cover and Thomas, 2006):

Iij=[I(𝜽)]ij:=𝔼X[(θilogf(X|𝜽))(θjlogf(X|𝜽))|𝜽].\displaystyle I_{ij}=[I(\bm{\theta})]_{ij}:=\mathbb{E}_{X}\left[\left(\frac{\partial}{\partial\theta_{i}}\log f(X|\bm{\theta})\right)\left(\frac{\partial}{\partial\theta_{j}}\log f(X|\bm{\theta})\right)\,\middle|\,\bm{\theta}\right].{} (5)

It is known that the FI matrix in (5) coincides with the negative expected value of the Hessian matrix of logf(X|𝜽)\log f(X|\bm{\theta}) if the model satisfies the FI regular condition (Schervish, 2012). However, Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) does not satisfy this condition since it is a parametric-support family. Therefore, for XX with density function in (1), one can obtain the FI matrix of Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) based on (5) as follows (Li et al., 2022):

I(κ,α)=[α2κ2001α2]=[I11(κ)I11(α)00I22(α)],I({\kappa,\alpha})=\begin{bmatrix}\frac{\alpha^{2}}{\kappa^{2}}&0\\ 0&\frac{1}{\alpha^{2}}\end{bmatrix}=\begin{bmatrix}I_{11}(\kappa)I_{11}(\alpha)&0\\ 0&I_{22}(\alpha)\end{bmatrix}, (6)

where I11(κ)=1κ2I_{11}(\kappa)=\frac{1}{\kappa^{2}}, I11(α)=α2I_{11}(\alpha)=\alpha^{2}, and I22(α)=1α2I_{22}(\alpha)=\frac{1}{\alpha^{2}}. Note that I11I_{11} differs from 𝔼[2κ2logfκ,αPa(X;θ)|θ]=ακ2-\mathbb{E}\left[\frac{\partial^{2}}{\partial\kappa^{2}}\log f_{{\kappa,\alpha}}^{\mathrm{Pa}}(X;\theta)\,\middle|\,\theta\right]=\frac{\alpha}{\kappa^{2}}.

Based on (6), the Jeffreys prior and the reference prior are given as πJ(κ,α)det(I)=1κ\pi_{\mathrm{J}}({\kappa,\alpha})\propto\sqrt{\det(I)}=\frac{1}{\kappa} and πR(κ,α)I11(κ)I22(α)=1κα\pi_{\mathrm{R}}({\kappa,\alpha})\propto\sqrt{I_{11}(\kappa)I_{22}(\alpha)}=\frac{1}{\kappa\alpha}, respectively. Here, the reverse reference prior is the same as the reference prior from the orthogonality of parameters (Datta and Ghosh, 1995; Datta, 1996).

From the orthogonality of parameters, the probability matching prior when κ\kappa is of interest and α\alpha is the nuisance parameter is given as

πP(κ,α)I11g1(α)=ακg1(α)\pi_{\mathrm{P}}({\kappa,\alpha})\propto\sqrt{I_{11}}g_{1}(\alpha)=\frac{\alpha}{\kappa}g_{1}(\alpha)

for arbitrary g1(α)>0g_{1}(\alpha)>0 (Tibshirani, 1989). In this paper, we consider the prior π(κ,α)αkκ\pi({\kappa,\alpha})\propto\frac{\alpha^{-k}}{\kappa} for kk\in\mathbb{Z} since the cases k=0,1k=0,1 correspond to the Jeffreys prior and the (reverse) reference prior, respectively.

Remark 1.

The Pareto distribution discussed in this paper is sometimes called the Pareto type 11 distribution (Arnold, 2008). On the other hand, Kim et al. (2009) derived several noninformative priors for a special case of the Pareto type 22 distribution called the Lomax distribution (Lomax, 1954), where the FI matrix can be written using the negative Hessian.

In the multiparameter cases, the Jeffreys prior is known to suffer from many problems (Datta and Ghosh, 1996; Ghosh, 2011). For example, it is known that the Jeffreys prior leads to inconsistent estimators for the error variance in the Neyman-Scott problem (see Berger and Bernardo, 1992, Example 3). This might be a possible reason why TS with Jeffreys prior suffers a polynomial expected regret in a multiparameter distribution setting. More details on the probability matching prior and the Jeffreys prior can be found in Appendix E.

Algorithm 1 STS\mathrm{STS} / STS-T\mathrm{STS}\text{-}\mathrm{T}
1:  Parameter: kk\in\mathbb{Z}, n¯=max{2,k+1}\bar{n}=\max\{2,k+1\}.
2:  Initialization: Select each arm n¯\bar{n} times.
3:  Loop:
4:  Sample α~a(t)Erlang(Na(t)k,Na(t)α^a(Na(t)))\tilde{\alpha}_{a}(t)\sim\mathrm{Erlang}\left(N_{a}(t)-k,\frac{N_{a}(t)}{\hat{\alpha}_{a}(N_{a}(t))}\right).
5:  α¯a(Na(t))min(Na(t),α^a(Na(t)))\bar{\alpha}_{a}(N_{a}(t))\leftarrow\min(N_{a}(t),\hat{\alpha}_{a}(N_{a}(t))) .
6:  Sample α~a(t)Erlang(Na(t)k,Na(t)α¯a(Na(t)))\tilde{\alpha}_{a}(t)\sim\mathrm{Erlang}\left(N_{a}(t)-k,\frac{N_{a}(t)}{\bar{\alpha}_{a}(N_{a}(t))}\right).
7:  if {a[K]:α~a(t)1}\{a\in[K]:\tilde{\alpha}_{a}(t)\leq 1\}\neq\emptyset then
8:     Select j(t)=argmina[K]α~a(t)j(t)=\operatorname*{arg\,min}_{a\in[K]}\tilde{\alpha}_{a}(t).
9:  else
10:     Sample uaU(0,1)u_{a}\sim U(0,1) for every a[K]a\in[K].
11:     κ~a(t)=κ^a(Na(t))ua1/(Na(t)α~a(t))\tilde{\kappa}_{a}(t)=\hat{\kappa}_{a}(N_{a}(t))u_{a}^{1/(N_{a}(t)\tilde{\alpha}_{a}(t))}.
12:     Play j(t)=argmaxa[K]κ~a(t)α~a(t)α~a(t)1j(t)=\operatorname*{arg\,max}_{a\in[K]}\frac{\tilde{\kappa}_{a}(t)\tilde{\alpha}_{a}(t)}{\tilde{\alpha}_{a}(t)-1}
13:                   =argmaxa[K]μ~a(t)=\operatorname*{arg\,max}_{a\in[K]}\tilde{\mu}_{a}(t).
14:  end if

3.1 Sampling procedure

Let t:=(j(s),rj(s),Nj(s)(s))s=1t1\mathcal{F}_{t}:=(j(s),r_{j(s),N_{j(s)}(s)})_{s=1}^{t-1} be the history until round tt. Under the prior αkκ\frac{\alpha^{-k}}{\kappa} with kk\in\mathbb{Z}, the marginalized posterior distribution of the shape parameter of arm aa is given as

αatErlang(Na(t)k,Na(t)α^a(Na(t))),\alpha_{a}\mid\mathcal{F}_{t}\sim\mathrm{Erlang}\left(N_{a}(t)-k,\frac{N_{a}(t)}{\hat{\alpha}_{a}(N_{a}(t))}\right), (7)

where Erlang(s,β)\mathrm{Erlang}(s,\beta) denotes the Erlang distribution with shape ss and rate β\beta. Note that we require n¯max{2,k+1}\bar{n}\geq\max\{2,k+1\} initial plays to avoid improper posteriors and MLE of α\alpha. When the shape parameter αa\alpha_{a} is given as β\beta, the cumulative distribution function (CDF) of the conditional posterior of κa\kappa_{a} is given as

[κaxt,αa=β]=(xκ^a(Na(t)))βNa(t),\mathbb{P}\left[\kappa_{a}\leq x\mid\mathcal{F}_{t},\alpha_{a}=\beta\right]=\left(\frac{x}{\hat{\kappa}_{a}(N_{a}(t))}\right)^{\beta N_{a}(t)}, (8)

if 0<xκ^a(Na(t))0<x\leq\hat{\kappa}_{a}(N_{a}(t)). Since one can derive the posteriors following the same steps as Sun et al. (2020), the detailed derivation is postponed to Appendix E.3. At round tt, we denote the sampled scale and shape parameters of arm aa by κ~a(t)\tilde{\kappa}_{a}(t) and α~a(t)\tilde{\alpha}_{a}(t), respectively, and the corresponding mean reward by μ~a(t):=μ(κ~a(t),α~a(t))\tilde{\mu}_{a}(t):=\mu(\tilde{\kappa}_{a}(t),\tilde{\alpha}_{a}(t)). We first sample the shape parameter from the marginalized posterior in (7). Then, we sample the scale parameter given the sampled shape parameter from the CDF of the conditional posterior in (8) by using inverse transform sampling. TS based on this sequential procedure, which we call Sequential Thompson Sampling (STS\mathrm{STS}), can be formulated as Algorithm 1.

In Theorem 3 given in the next section, STS\mathrm{STS} with the Jeffreys prior and the reference prior turns out to be suboptimal in view of the lower bound in (3). Their suboptimality is mainly due to the behavior of the posterior in (7) when α^1(n)\hat{\alpha}_{1}(n) is overestimated for small N1(t)=nN_{1}(t)=n. To overcome such issues, we propose the STS-T\mathrm{STS}\text{-}\mathrm{T} policy, a variant of STS\mathrm{STS} with truncation, where we replace α^(n)\hat{\alpha}(n) with α¯(n):=min(n,α^(n))\bar{\alpha}(n):=\min\left(n,\hat{\alpha}(n)\right) in (7). Note that such a truncation procedure is especially considered in the posterior sampling by (7) and (8). We show that STS-T\mathrm{STS}\text{-}\mathrm{T} with the Jeffreys prior and the reference prior can achieve the optimal regret bound in Theorem 4.

3.2 Interpretation of the prior parameter kk

The Erlang distribution is a special case of the Gamma distribution, where the shape parameter is a positive integer. If a random variable XX follows Erlang(s,β)\mathrm{Erlang}(s,\beta), then it has the density of form

fs,βEr(x)=βsΓ(s)xs1eβx𝟙[x+],f_{s,\beta}^{\mathrm{Er}}(x)=\frac{\beta^{s}}{\Gamma(s)}x^{s-1}e^{-\beta x}\mathbbm{1}[x\in\mathbb{R}_{+}], (9)

where ss\in\mathbb{N} and β>0\beta>0 denote the shape and rate parameter, respectively. Then, the CDF evaluated at x>0x>0 is given as

Fs,βEr(x)=0βxts1etdtΓ(s)=γ(s,βx)Γ(s),F_{s,\beta}^{\mathrm{Er}}(x)=\frac{\int_{0}^{\beta x}t^{s-1}e^{-t}\mathrm{d}t}{\Gamma(s)}=\frac{\gamma(s,\beta x)}{\Gamma(s)}, (10)

where γ(,)\gamma(\cdot,\cdot) denotes the lower incomplete gamma function. Since γ(s+1,x)=sγ(s,x)xsex\gamma(s+1,x)=s\gamma(s,x)-x^{s}e^{-x} holds, one can observe that for any x>0x>0

Fs,βEr(x)Fs+1,βEr(x).F_{s,\beta}^{\mathrm{Er}}(x)\geq F_{s+1,\beta}^{\mathrm{Er}}(x). (11)

From the sampling procedure of STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T}, μ~\tilde{\mu} depends on κ~\tilde{\kappa} only when α~>1\tilde{\alpha}>1 holds since α~1\tilde{\alpha}\leq 1 results in μ~(,α~)=\tilde{\mu}(\cdot,\tilde{\alpha})=\infty. Therefore, for any β>1\beta>1 in (8), κ~\tilde{\kappa} will concentrate on κ^\hat{\kappa} for sufficiently large Na(t)=nN_{a}(t)=n. Thus, μ~\tilde{\mu} will be mainly determined by α~\tilde{\alpha} and κ^\hat{\kappa}, where the choice of kk affects the sampling of α~\tilde{\alpha} by (7). From (11), one could see that the probability of sampling small α~\tilde{\alpha} increases as shape nkn-k decreases. Therefore, μ~\tilde{\mu} of suboptimal arms would increase as kk increases for the same nn. In other words, the probability of sampling large μ~\tilde{\mu} becomes large as kk increases. Therefore, TS with large kk becomes a conservative policy that could frequently play currently suboptimal arms. In contrast, priors with small kk yield an optimistic policy that focuses on playing the current best arm.

4 Main results

In this section, we provide regret bounds of STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with different choices of kk\in\mathbb{Z}. At first, we show the asymptotic optimality of STS\mathrm{STS} for priors π(κ,α)αkκ\pi({\kappa,\alpha})\propto\frac{\alpha^{-k}}{\kappa} with k2k\in\mathbb{Z}_{\geq 2}.

Theorem 2.

Assume that arm 11 is the unique optimal arm with a finite mean. For every a[K]a\in[K], let εa=min{κaαa(κa+1)κaδaμa(μa+δaκa)+κaδa,κaδaμa(1+μa+δa)}\varepsilon_{a}=\min\left\{\frac{\kappa_{a}}{\alpha_{a}(\kappa_{a}+1)}\frac{\kappa_{a}\delta_{a}}{\mu_{a}(\mu_{a}+\delta_{a}-\kappa_{a})+\kappa_{a}\delta_{a}},\frac{\kappa_{a}\delta_{a}}{\mu_{a}(1+\mu_{a}+\delta_{a})}\right\} where δa=Δa2\delta_{a}=\frac{\Delta_{a}}{2} for a1a\neq 1 and δ1=mina1δa\delta_{1}=\min_{a\neq 1}\delta_{a}. Given arbitrary ϵ(0,mina[K]εa)\epsilon\in(0,\min_{a\in[K]}\varepsilon_{a}), the expected regret of STS\mathrm{STS} with k2k\in\mathbb{Z}_{\geq 2} is bounded as

𝔼[Reg(T)]a=2KΔalogTDa,k(ϵ)+𝒪(ϵ2),\mathbb{E}[\mathrm{Reg}(T)]\leq\sum_{a=2}^{K}\frac{\Delta_{a}\log T}{D_{a,k}(\epsilon)}+\mathcal{O}\left(\epsilon^{-2}\right),

where Da,k(ϵ)D_{a,k}(\epsilon) is a function such that limϵ0Da,k(ϵ)=KLinf(a)\lim_{\epsilon\to 0}D_{a,k}(\epsilon)=\mathrm{KL}_{\mathrm{inf}}(a) for any fixed kk\in\mathbb{Z}.

By letting ϵ=o(1)\epsilon=o\left(1\right) in Theorem 2, we see that STS\mathrm{STS} with k2k\in\mathbb{Z}_{\geq 2} satisfies

lim infT𝔼[Reg(T)]logTa=2KΔaKLinf(a),\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\log T}\leq\sum_{a=2}^{K}\frac{\Delta_{a}}{\mathrm{KL}_{\mathrm{inf}}(a)},

which shows the asymptotic optimality of STS\mathrm{STS} in terms of the lower bound in (3).

Next, we show that STS\mathrm{STS} with k1k\in\mathbb{Z}_{\leq 1} cannot achieve the asymptotic bound in the theorem below. Following the proofs for Gaussian bandits (Honda and Takemura, 2014), we consider two-armed bandit problems where the full information on the suboptimal arm is given to simplify the analysis. We further assume that two arms have the same scale parameter κ1=κ2\kappa_{1}=\kappa_{2}.

Theorem 3.

Consider a two-armed bandit problem where κ1=κ2\kappa_{1}=\kappa_{2} and 1<α1<α21<\alpha_{1}<\alpha_{2}. When α~1(t)\tilde{\alpha}_{1}(t) and κ~1(t)\tilde{\kappa}_{1}(t) are sampled from the posteriors in (7) and (8), respectively and μ~2(t)=μ2\tilde{\mu}_{2}(t)=\mu_{2} holds, there exists a constant C(α1,α2)>0C(\alpha_{1},\alpha_{2})>0 satisfying

lim infT𝔼[Reg(T)]logTC(α1,α2),\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\log T}\geq C(\alpha_{1},\alpha_{2}),

where C(α1,α2)>KLinf(2)C(\alpha_{1},\alpha_{2})>\mathrm{KL}_{\mathrm{inf}}(2) holds for some instances. In particular, for k0k\in\mathbb{Z}_{\leq 0}, there exists C(α1,α2)C^{\prime}(\alpha_{1},\alpha_{2}) satisfying

lim infT𝔼[Reg(T)]TC(α1,α2).\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\sqrt{T}}\geq C^{\prime}(\alpha_{1},\alpha_{2}).

From Theorems 2 and 3, we find that the prior should be conservative to some extent when one considers maximizing rewards in expectation.

Although STS\mathrm{STS} with the Jeffreys prior (k=0k=0) and reference prior (k=1k=1) were shown to be suboptimal, we show that a modified algorithm, STS-T\mathrm{STS}\text{-}\mathrm{T}, can achieve the optimal regret bound with k0k\in\mathbb{Z}_{\geq 0}.

Theorem 4.

With the same notation as Theorem 2, the expected regret of STS-T\mathrm{STS}\text{-}\mathrm{T} with k0k\in\mathbb{Z}_{\geq 0} is bounded as

𝔼[Reg(T)]a=2KΔalogTDa,k(ϵ)+𝒪(ϵm),\mathbb{E}[\mathrm{Reg}(T)]\leq\sum_{a=2}^{K}\frac{\Delta_{a}\log T}{D_{a,k}(\epsilon)}+\mathcal{O}(\epsilon^{-m}),

where m=max(2,3k)m=\max(2,3-k).

From Theorems 2 and 4, we have two choices to achieve the lower bound in (3): use either the conservative priors with MLEs or moderately optimistic priors with truncated samples. Since initialization steps require playing every arm max(2,k+1)\max(2,k+1) times, if the number of arms KK is large, the Jeffreys priors or the reference prior with the truncated estimator would be a better choice. On the other hand, if the model can contain arms with large α\alpha, where the truncation might be problematic for small nn, it would be better to use STS\mathrm{STS} with conservative priors.

5 Experiments

In this section, we present numerical results to demonstrate the performance of STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T}, which supports our theoretical analysis. We consider the 44-armed bandit model 𝜽4\bm{\theta}_{4} with parameters given in Table 1 as an example where suboptimal arms have smaller, equal, and larger κ\kappa compared with the optimal arm. 𝜽4\bm{\theta}_{4} has 𝝁=(4.55, 3.2, 2.74, 3)\bm{\mu}=(4.55,\,3.2,\,2.74,\,3) and infinite variance. Further experimental results can be found in Appendix H.

Table 1: 44-armed bandit model 𝜽4\bm{\theta}_{4}.
arm 11 arm 22 arm 33 arm 44
κ\kappa 1.3 1.2 1.3 1.5
α\alpha 1.4 1.6 1.9 2.0

Figure 1 shows the cumulative regret for the proposed policies with various choices of parameters kk on the prior. The solid lines denote the averaged cumulative regret over 100,000 independent runs of priors that can achieve the optimal lower bound in (3), whereas the dashed lines denote that of priors that cannot. The green dotted line denotes the problem-dependent lower bound and shaded regions denote a quarter standard deviation.

In Figures 2 and 3, we investigate the difference between STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with the same kk. The solid lines denote the averaged cumulative regret over 100,000 independent runs. The shaded regions and dashed lines show the central 99%99\% interval and the upper 0.05%0.05\% of regret.

The Jeffreys prior (k=0k=0)

In Figure 1(a), the Jeffreys prior seems to have a larger order of the regret compared with priors k=1,3k=1,3, which performed the best in this setting. As Theorem 4 states, its performance improves under STS-T\mathrm{STS}\text{-}\mathrm{T}, which shows a similar performance to that of k=1,3k=1,3.

Refer to caption
(a) Cumulative regret of STS\mathrm{STS} with various kk
Refer to caption
(b) Cumulative regret of STS-T\mathrm{STS}\text{-}\mathrm{T} with various kk
Figure 1: The solid lines denote the averaged cumulative regret over 100,000 independent runs of priors that can achieve the optimal lower bound in (3). The dashed lines denote that of priors that cannot achieve the optimal lower bound in (3). The shaded regions show a quarter standard deviation. The green dotted line denotes the problem-dependent lower bound based on Lemma 1.
Refer to caption
(a) The Jeffreys prior k=0k=0
Refer to caption
(b) The reference prior k=1k=1
Refer to caption
(c) Prior with k=3k=3
Figure 2: The solid lines denote an averaged regret over independent 100,000 runs. The shaded regions and dashed lines show the central 99%99\% interval and the upper 0.05%0.05\% of the regret, respectively.

Figure 2(a) illustrates the possible reason for the improvements, where the central 99%99\% interval of the regret noticeably shrank under STS-T\mathrm{STS}\text{-}\mathrm{T}. Since the suboptimality of STS\mathrm{STS} with the Jeffreys prior (k=0k=0) is due to an extreme case that induces a polynomial regret with small probability, this kind of shrink contributes to decreasing the expected regret of STS-T\mathrm{STS}\text{-}\mathrm{T} with the Jeffreys prior.

The reference prior (k=1k=1)

The reference prior showed a similar performance to the asymptotically optimal prior k=3k=3, although it was shown to be suboptimal for some instances under STS\mathrm{STS} in Theorem 3. Similarly to the Jeffreys prior (k=0)(k=0), the reference prior (k=1k=1) under STS-T\mathrm{STS}\text{-}\mathrm{T} has a smaller central 99%99\% interval of the regret than that under STS\mathrm{STS} as shown in Figure 2(b), although its decrement is comparably smaller than that of the Jeffreys prior. This would imply that the reference prior is more conservative than the Jeffreys prior.

Refer to caption
(a) Prior with k=1k=-1
Refer to caption
(b) Prior with k=3k=-3
Figure 3: The solid lines denote an averaged regret over independent 100,000 runs. The shaded regions and dashed lines show the central 99%99\% interval and the upper 0.05%0.05\% of the regret, respectively.
The conservative prior (k=3k=3)

Interestingly, Figure 2(c) showed that a truncated procedure does not affect the central 99%99\% interval of the regret and even degrade the performance in upper 0.05%0.05\%. Notice that the upper 0.05%0.05\% of the regret of k=3k=3 is much lower than that of k=0,1k=0,1, which shows the stability of the conservative prior in Figure 2.

Since a truncation procedure was adopted to prevent an extreme case that was a problem for k1k\in\mathbb{Z}_{\leq 1}, it is natural to see that there is no difference between STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with k=3k=3. This would imply that k=3k=3 is sufficiently conservative, and so the truncated procedure does not affect the overall performance.

Optimistic priors (k<0k<0)

In Figure 1(a), one can see that the averaged regret of k=1k=-1 and k=3k=-3 increases much faster than that of k=0,1,3k=0,1,3 under the STS\mathrm{STS} policy, which illustrates the suboptimality of STS\mathrm{STS} with priors k<0k\in\mathbb{Z}_{<0}.

As the optimistic priors (k<0k<0) showed better performance under STS-T\mathrm{STS}\text{-}\mathrm{T} in Figure 1, we can check the effectiveness of a truncation procedure in the posterior sampling with optimistic priors. However, as Figures 3(a) and 3(b) illustrate, there is no big difference in the central 99%99\% interval of the regret between STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with k=1,3k=-1,-3, which might imply that a prior with k<0k\in\mathbb{Z}_{<0} is too optimistic. Therefore, we might need to use a more conservative truncation procedure such as the one using α¯a(n)=max(n,α^a(n))\bar{\alpha}_{a}(n)=\max(\sqrt{n},\hat{\alpha}_{a}(n)) or max(logn,α^a(n))\max(\log n,\hat{\alpha}_{a}(n)), which would induce a larger regret in the finite time horizon.

6 Proof outline of optimal results

In this section, we provide the proof outline of Theorem 2 and Theorem 4, whose detailed proof is given in Appendix C. Note that the proof of Theorem 3 is postponed to Appendix D.

Let us first consider good events on MLEs defined by

𝒦a,n(ϵ)\displaystyle\mathcal{K}_{a,n}(\epsilon) :={κ^a(n)[κa,κa+ϵ]}\displaystyle:=\{\hat{\kappa}_{a}(n)\in[\kappa_{a},\kappa_{a}+\epsilon]\}
𝒜a,n(ϵ)\displaystyle\mathcal{A}_{a,n}(\epsilon) :={α^a(n)[αaϵa,l(ϵ),αa+ϵa,u(ϵ)]}\displaystyle:=\{\hat{\alpha}_{a}(n)\in[\alpha_{a}-\epsilon_{a,l}(\epsilon),\alpha_{a}+\epsilon_{a,u}(\epsilon)]\}
a,n(ϵ)\displaystyle\mathcal{E}_{a,n}(\epsilon) :=𝒦a,n(ϵ)𝒜a,n(ϵ),\displaystyle:=\mathcal{K}_{a,n}(\epsilon)\cap\mathcal{A}_{a,n}(\epsilon),

where nn\in\mathbb{N} and

ϵa,l(ϵ)=ϵαa21+ϵαa,ϵa,u(ϵ)=ϵαa2(κa+1)κaϵαa(κa+1).\epsilon_{a,l}(\epsilon)=\frac{\epsilon\alpha_{a}^{2}}{1+\epsilon\alpha_{a}},\hskip 10.00002pt\epsilon_{a,u}(\epsilon)=\frac{\epsilon\alpha_{a}^{2}(\kappa_{a}+1)}{\kappa_{a}-\epsilon\alpha_{a}(\kappa_{a}+1)}. (12)

Note that α¯a(n)=α^a(n)\bar{\alpha}_{a}(n)=\hat{\alpha}_{a}(n) holds on 𝒜a,n(ϵ)\mathcal{A}_{a,n}(\epsilon) for any nαa+1n\geq\alpha_{a}+1. Here, we set εa\varepsilon_{a} to satisfy μ^a[μaδa,μa+δa]\hat{\mu}_{a}\in\left[\mu_{a}-\delta_{a},\mu_{a}+\delta_{a}\right] on a(ϵ)\mathcal{E}_{a}(\epsilon) for any ϵεa\epsilon\leq\varepsilon_{a}. Define an event on the sample of the optimal arm

ϵ(t):={μ~1(t)μ1ϵ}.\mathcal{M}_{\epsilon}(t):=\{\tilde{\mu}_{1}(t)\geq\mu_{1}-\epsilon\}.

Then, the expected regret at round TT can be decomposed as follows:

𝔼[Reg(T)]\displaystyle\mathbb{E}[\mathrm{Reg}(T)] =𝔼[t=1TΔj(t)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{j(t)}\right]
=a=2KΔa(n¯+t=n¯K+1T𝔼[𝟙[j(t)=a]])\displaystyle=\sum_{a=2}^{K}\Delta_{a}\left(\bar{n}+\sum_{t=\bar{n}K+1}^{T}\mathbb{E}[\mathbbm{1}[j(t)=a]]\right)
Δ2t=n¯K+1T(𝔼[𝟙[j(t)1,𝒦1,N1(t)(ϵ),ϵc(t)]]+𝔼[𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]])\displaystyle\leq\Delta_{2}\sum_{t=\bar{n}K+1}^{T}\bigg{(}\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right]+\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right]\bigg{)}
+a=2KΔa{n¯+t=n¯K+1T(𝔼[𝟙[j(t)=a,ϵ(t),a,Na(t)(ϵ)]]\displaystyle\quad+\sum_{a=2}^{K}\Delta_{a}\bigg{\{}\bar{n}+\sum_{t=\bar{n}K+1}^{T}\bigg{(}\mathbb{E}\left[\mathbbm{1}[j(t)=a,\mathcal{M}_{\epsilon}(t),\mathcal{E}_{a,N_{a}(t)}(\epsilon)]\right]
+𝔼[𝟙[j(t)=a,ϵ(t),a,Na(t)c(ϵ)]])},\displaystyle\hskip 150.00023pt+\mathbb{E}\left[\mathbbm{1}[j(t)=a,\mathcal{M}_{\epsilon}(t),\mathcal{E}_{a,N_{a}(t)}^{c}(\epsilon)]\right]\bigg{)}\bigg{\}},

where c\mathcal{E}^{c} denotes the complementary set of \mathcal{E}. Lemmas 58 complete the proof of Theorems 2 and 4, whose proofs are given in Appendix C.

Lemma 5.

Under STS\mathrm{STS} with k2k\in\mathbb{Z}_{\geq 2},

t=n¯K+1T𝔼[𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]]𝒪(ϵ2).\sum_{t=\bar{n}K+1}^{T}\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right]\leq\mathcal{O}(\epsilon^{-2}).

and under STS-T\mathrm{STS}\text{-}\mathrm{T} with k0k\in\mathbb{Z}_{\geq 0},

t=n¯K+1T𝔼[𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]]𝒪(ϵm),\sum_{t=\bar{n}K+1}^{T}\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right]\leq\mathcal{O}(\epsilon^{-m}),

where m=max(2,3k)m=\max(2,3-k).

Although Lemma 6 contributes to the main term of the regret, the proof of Lemma 5 is the main difficulty in the regret analysis. We found that our analysis does not result in a finite upper bound for STS\mathrm{STS} with k<2k\in\mathbb{Z}_{<2} and designed STS-T\mathrm{STS}\text{-}\mathrm{T} to solve such problems.

Lemma 6.

Under STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with kk\in\mathbb{Z}, it holds that for any a[K]a\in[K]

t=n¯K+1T𝔼\displaystyle\sum_{t=\bar{n}K+1}^{T}\mathbb{E} [𝟙[j(t)=a,ϵ(t),a,Na(t)(ϵ)]]max(0,k)+1+1αaϵ𝟙[k>0]+logTDa,k(ϵ).\displaystyle[\mathbbm{1}[j(t)=a,\mathcal{M}_{\epsilon}(t),\mathcal{E}_{a,N_{a}(t)}(\epsilon)]]\leq\max(0,k)+1+\frac{1}{\alpha_{a}\epsilon}\mathbbm{1}[k>0]+\frac{\log T}{D_{a,k}(\epsilon)}.

where Da,k(ϵ)>0D_{a,k}(\epsilon)>0 is a finite problem-deterministic constant satisfying limϵ0Da,k(ϵ)=KLinf(a)\lim_{\epsilon\to 0}D_{a,k}(\epsilon)=\mathrm{KL}_{\mathrm{inf}}(a).

Since large kk yields a more conservative policy and it requires additional initial plays of every arm, large kk might induce larger regret for a finite time horizon TT, which corresponds to the component of the regret discussed in Lemma 6. Thus, this lemma would imply that the policy has to be conservative to some extent and being overly conservative would induce larger regrets in a finite time.

Lemma 7.

Under STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with k0k\in\mathbb{Z}_{\geq 0},

t=n¯K+1T𝔼[𝟙[j(t)1,𝒦1,N1(t)(ϵ),ϵc(t)]]𝒪(ϵ1).\sum_{t=\bar{n}K+1}^{T}\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right]\leq\mathcal{O}(\epsilon^{-1}).

The key to Lemma 7 is to convert the term on μ~1(t)\tilde{\mu}_{1}(t), ϵ(t)\mathcal{M}_{\epsilon}(t), to a term on α~1(t)\tilde{\alpha}_{1}(t). Since μ(κ,α)=\mu({\kappa,\alpha})=\infty holds for α1\alpha\leq 1, μ~1=μ(κ~1,α~1)\tilde{\mu}_{1}=\mu(\tilde{\kappa}_{1},\tilde{\alpha}_{1}) becomes infinity regardless of the value of κ~1\tilde{\kappa}_{1} if α~11\tilde{\alpha}_{1}\leq 1 holds, which implies [ϵc(t),α~1(t)1]=0\mathbb{P}[\mathcal{M}_{\epsilon}^{c}(t),\tilde{\alpha}_{1}(t)\leq 1]=0. Therefore, it is enough to consider the case where α~1(t)>1\tilde{\alpha}_{1}(t)>1 holds to prove Lemma 7. Although density functions of α~1\tilde{\alpha}_{1} under STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} are different, conditional CDFs of κ~1\tilde{\kappa}_{1} given α1=α~1\alpha_{1}=\tilde{\alpha}_{1} are the same, which is given in (8) as

[κ~1x|t,α~1=α1]=(xκ^1(N1(t)))α~1N1(t).\mathbb{P}[\tilde{\kappa}_{1}\leq x|\mathcal{F}_{t},\tilde{\alpha}_{1}=\alpha_{1}]=\left(\frac{x}{\hat{\kappa}_{1}(N_{1}(t))}\right)^{\tilde{\alpha}_{1}N_{1}(t)}.

Therefore, for sufficiently large N1(t)N_{1}(t) and α~1(t)>1\tilde{\alpha}_{1}(t)~{}>~{}1, κ~1(t)\tilde{\kappa}_{1}(t) will concentrate on κ^1(N1(t))\hat{\kappa}_{1}(N_{1}(t)) with high probability, which is close to its true value κ1\kappa_{1} under the event {𝒦1,N1(t)(ϵ)}\{\mathcal{K}_{1,N_{1}(t)}(\epsilon)\}. Thus, μ~1=κ~1α~1α~11κ1α~1α~11=μ(κ1,α~1)\tilde{\mu}_{1}=\frac{\tilde{\kappa}_{1}\tilde{\alpha}_{1}}{\tilde{\alpha}_{1}-1}\geq~{}\frac{\kappa_{1}\tilde{\alpha}_{1}}{\tilde{\alpha}_{1}-1}=\mu(\kappa_{1},\tilde{\alpha}_{1}) holds with high probability, which implies that [𝒦1,N1(t)(ϵ),ϵc(t)|t]\mathbb{P}[\mathcal{K}_{1,N_{1}(t)}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)|\mathcal{F}_{t}] can be roughly bounded by [𝒦1,N1(t)(ϵ),α~1(t)c|t]\mathbb{P}[\mathcal{K}_{1,N_{1}(t)}(\epsilon),\tilde{\alpha}_{1}(t)\geq c|\mathcal{F}_{t}] for some problem-dependent constants c>1c>1. Since 𝒦1\mathcal{K}_{1} is deterministic given t\mathcal{F}_{t}, we have

[𝒦1,N1(t)(ϵ),\displaystyle\mathbb{P}[\mathcal{K}_{1,N_{1}(t)}(\epsilon),\, α~1(t)c|t]=𝟙[𝒦1,N1(t)(ϵ)][α~1(t)c|t],\displaystyle\tilde{\alpha}_{1}(t)\geq c|\mathcal{F}_{t}]=\mathbbm{1}[\mathcal{K}_{1,N_{1}(t)}(\epsilon)]\mathbb{P}[\tilde{\alpha}_{1}(t)\geq c|\mathcal{F}_{t}],

which implies μ~1(t)\tilde{\mu}_{1}(t) is mainly determined by the value of α~1(t)\tilde{\alpha}_{1}(t) under the event {𝒦1,N1(t)(ϵ)}\{\mathcal{K}_{1,N_{1}(t)}(\epsilon)\} for both policies. In such cases, STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} behave like TS in the Pareto distribution with a known scale parameter, where μ~1(t):=μ(κ1,α~1(t))\tilde{\mu}_{1}(t):=\mu(\kappa_{1},\tilde{\alpha}_{1}(t)) for tt\in\mathbb{N}. Here, the Pareto distribution with the known scale parameter belongs to the one-dimensional exponential family, where Korda et al. (2013) showed the optimality of TS with the Jeffreys prior. Since the posterior of α\alpha under the Jeffreys prior is given as the Erlang distribution with shape N1(t)+1N_{1}(t)+1 in the one-parameter Pareto model, we can apply the results by Korda et al. (2013) to prove Lemma 7 by using some properties of the Erlang distribution such as (11).

Lemma 8.

Under STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with kk\in\mathbb{Z}, it holds that for any a1a\neq 1

t=n¯K+1T𝔼[𝟙[j(t)=a,a,Na(t)c(ϵ)]]𝒪(ϵ2).\sum_{t=\bar{n}K+1}^{T}\mathbb{E}\left[\mathbbm{1}[j(t)=a,\mathcal{E}_{a,N_{a}(t)}^{c}(\epsilon)]\right]\leq\mathcal{O}\left(\epsilon^{-2}\right).

Lemma 8 controls the regret induced when estimators of the played arm are not close to their true parameters, which is not difficult to analyze as in the usual analysis of TS. In fact, the proof of this lemma is straightforward since the upper bounds of [𝒦ac]\mathbb{P}[\mathcal{K}_{a}^{c}] and [𝒦a,𝒜ac]\mathbb{P}[\mathcal{K}_{a},\mathcal{A}_{a}^{c}] can be easily derived based on the distributions of κ^a\hat{\kappa}_{a} and α^a\hat{\alpha}_{a} in (2).

7 Conclusion

We considered the MAB problems under the Pareto distribution that has a heavy tail and follows the power-law. While most previous research on TS has focused on one-dimensional or light-tailed distributions, we focused on the Pareto distribution characterized by unknown scale and shape parameters. By sequentially sampling parameters via their marginalized and conditional posterior distributions, we can realize an efficient sampling procedure. We showed that TS with the appropriate choice of priors achieves a problem-dependent optimal regret bound in such a setting for the first time. Although the Jeffreys prior and the reference prior are shown to be suboptimal under the direct implementation of TS, we showed that they can achieve the optimal regret bound if we add a truncation procedure. Experimental results support our theoretical results, which show the optimality of conservative priors and the effectiveness of the truncation procedure for the Jeffreys prior and the reference prior.

Acknowledgement

JL was supported by JST SPRING, Grant Number JPMJSP2108. CC and MS were supported by the Institute for AI and Beyond, UTokyo.

References

  • Agrawal and Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1. PMLR, 2012.
  • Agrawal and Goyal (2017) Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for Thompson sampling. Journal of the Association for Computing Machinery, 64(5):1–24, 2017.
  • Agrawal et al. (2021) Shubhada Agrawal, Sandeep K Juneja, and Wouter M Koolen. Regret minimization in heavy-tailed bandits. In Conference on Learning Theory, pages 26–62. PMLR, 2021.
  • Arnold (2008) Barry C Arnold. Pareto and generalized Pareto distributions. In Modeling income distributions and Lorenz curves, pages 119–145. Springer, 2008.
  • Baudry et al. (2021) Dorian Baudry, Romain Gautron, Emilie Kaufmann, and Odalric Maillard. Optimal Thompson sampling strategies for support-aware CVaR bandits. In International Conference on Machine Learning, pages 716–726. PMLR, 2021.
  • Berger and Bernardo (1992) James O Berger and José M Bernardo. On the development of the reference prior method. Bayesian Statistics, 4(4):35–60, 1992.
  • Bernardo (1979) Jose M Bernardo. Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):113–128, 1979.
  • Bubeck and Liu (2013) Sébastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson sampling. In Advances in Neural Information Processing Systems, volume 26, pages 638–646, 2013.
  • Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Bubeck et al. (2013) Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
  • Burnetas and Katehakis (1996) Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.
  • Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • Cover and Thomas (2006) Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience, 2nd edition, 2006.
  • Datta (1996) Gauri Sankar Datta. On priors providing frequentist validity of Bayesian inference for multiple parametric functions. Biometrika, 83(2):287–298, 1996.
  • Datta and Ghosh (1995) Gauri Sankar Datta and Malay Ghosh. Some remarks on noninformative priors. Journal of the American Statistical Association, 90(432):1357–1363, 1995.
  • Datta and Ghosh (1996) Gauri Sankar Datta and Malay Ghosh. On the invariance of noninformative priors. The Annals of Statistics, 24(1):141–159, 1996.
  • Datta and Sweeting (2005) Gauri Sankar Datta and Trevor J Sweeting. Probability matching priors. Handbook of Statistics, 25:91–114, 2005.
  • DiCiccio et al. (2017) Thomas J DiCiccio, Todd A Kuffner, and G Alastair Young. A simple analysis of the exact probability matching prior in the location-scale model. The American Statistician, 71(4):302–304, 2017.
  • Ghosh (2011) Malay Ghosh. Objective priors: An introduction for frequentists. Statistical Science, 26(2):187–202, 2011.
  • Honda and Takemura (2014) Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. In International Conference on Artificial Intelligence and Statistics, volume 33, pages 375–383. PMLR, 2014.
  • Jeffreys (1998) Harold Jeffreys. The theory of probability. OUP Oxford, 1998.
  • Kaufmann et al. (2012) Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite time analysis. In International Conference on Algorithmic Learning Theory, volume 7568, pages 199–213. Springer, 2012.
  • Kim et al. (2009) Dal-Ho Kim, Sang-Gil Kang, and Woo-Dong Lee. Noninformative priors for Pareto distribution. Journal of the Korean Data and Information Science Society, 20(6):1213–1223, 2009.
  • Korda et al. (2013) Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In International Conference on Neural Information Processing Systems, volume 26, pages 1448–1456. Curran Associates, Inc., 2013.
  • Lai et al. (1985) Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • Li et al. (2022) Mingming Li, Huafei Sun, and Linyu Peng. Fisher–Rao geometry and Jeffreys prior for Pareto distribution. Communications in Statistics-Theory and Methods, 51(6):1895–1910, 2022.
  • Lomax (1954) Kenneth S Lomax. Business failures: Another example of the analysis of failure data. Journal of the American Statistical Association, 49(268):847–852, 1954.
  • Malik (1970) Henrick John Malik. Estimation of the parameters of the Pareto distribution. Metrika, 15(1):126–132, 1970.
  • Mukerjee and Ghosh (1997) Rahul Mukerjee and Malay Ghosh. Second-order probability matching priors. Biometrika, 84(4):970–975, 1997.
  • Riou and Honda (2020) Charles Riou and Junya Honda. Bandit algorithms based on Thompson sampling for bounded reward distributions. In International Conference on Algorithmic Learning Theory, pages 777–826. PMLR, 2020.
  • Robbins (1952) Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • Robert et al. (2007) Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer, 2nd edition, 2007.
  • Russo and Van Roy (2016) Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  • Russo et al. (2018) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  • Schervish (2012) Mark J Schervish. Theory of statistics. Springer Science & Business Media, 2012.
  • Simon and Divsalar (1998) Marvin K Simon and Dariush Divsalar. Some new twists to problems involving the Gaussian probability integral. IEEE Transactions on Communications, 46(2):200–210, 1998.
  • Stein (1959) Charles Stein. An example of wide discrepancy between fiducial and confidence intervals. The Annals of Mathematical Statistics, 30(4):877–880, 1959.
  • Sun et al. (2020) Fupeng Sun, Yueqi Cao, Shiqiang Zhang, and Huafei Sun. The Bayesian inference of Pareto models based on information geometry. Entropy, 23(1):45, 2020.
  • Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Tibshirani (1989) Robert Tibshirani. Noninformative priors for one parameter of many. Biometrika, 76(3):604–608, 1989.
  • Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • Wallace (1959) David L. Wallace. Bounds on normal approximations to student’s and the chi-square distributions. The Annals of Mathematical Statistics, 30(4):1121–1130, 1959.
  • Welch and Peers (1963) BL Welch and HW Peers. On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society: Series B (Methodological), 25(2):318–329, 1963.

Appendix A Notations

Tables 25 summarize the symbols used in this paper.

Table 2: Notations for the bandit problem.
Symbol Meaning
KK the number of arms.
TT time horizon.
j(t)j(t) the index of the played arm at round tt.
kk\in\mathbb{Z} prior parameter, see Section 3 for details.
n¯=max(2,k+1)\bar{n}=\max(2,k+1) initial plays to avoid improper posteriors.
Na(t)N_{a}(t) the number of playing arm aa until round tt.
ra,nr_{a,n} nn-th reward generated from the arm aa.
μ(θ)\mu(\theta) the expected value of the random variable following Pa(θ)\mathrm{Pa}(\theta).
μa=μ(θa)\mu_{a}=\mu(\theta_{a}) the expected rewards of arm aa.
Δa\Delta_{a} sub-optimality gap of arm aa.
δa=Δa2\delta_{a}=\frac{\Delta_{a}}{2} for a1a\neq 1 a half of sub-optimality gap of arm aa.
δ1:=mina1δa\delta_{1}:=\min_{a\neq 1}\delta_{a} defined as the minimum of sub-optimality gap.
t=(j(s),rj(s),Nj(s))s=1t1\mathcal{F}_{t}=(j(s),r_{j(s),N_{j(s)}})_{s=1}^{t-1} the history until round tt.
t[]:=[|t]\mathbb{P}_{t}[\cdot]:=\mathbb{P}[\cdot|\mathcal{F}_{t}] conditional probability given t\mathcal{F}_{t}.
ga(c,α)g_{a}(c,\alpha) KL-divergence from Pa(κac,α)\mathrm{Pa}\left(\frac{\kappa_{a}}{c},\alpha\right) to Pa(κa,αa)\mathrm{Pa}(\kappa_{a},\alpha_{a}) for c1c\geq 1.
ha(c,𝝁)=ha(c)h_{a}(c,\bm{\mu})=h_{a}(c) the upper bound of α\alpha satisfying μ(κac,α)μ1\mu\left(\frac{\kappa_{a}}{c},\alpha\right)\geq\mu_{1} for c1c\geq 1.
Table 3: Notations for probability distributions and estimators
Symbol Meaning
Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) Pareto distribution with the scale κ>0\kappa>0 and shape α>0\alpha>0.
fκ,αPa(x)f_{{\kappa,\alpha}}^{\mathrm{Pa}}(x) density function of Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}).
Erlang(s,β)\mathrm{Erlang}(s,\beta) Erlang distribution with the shape s>0s>0 and rate β>0\beta>0.
fs,βEr(x)f_{s,\beta}^{\mathrm{Er}}(x) density function of Erlang(s,β)\mathrm{Erlang}(s,\beta).
Fs,βEr(x)F_{s,\beta}^{\mathrm{Er}}(x) CDF of Erlang(s,β)\mathrm{Erlang}(s,\beta) evaluated at x>0x>0.
IG(s,β)\mathrm{IG}(s,\beta) Inverse Gamma distribution with shape s>0s>0 and scale β>0\beta>0.
Fn(x)F_{n}(x) CDF of the chi-squared distribution with nn degree of freedom.
Γ(s)\Gamma(s) Gamma function.
γ(s,x)\gamma(s,x) the lower incomplete gamma function.
Γ(s,x)\Gamma(s,x) the upper incomplete gamma function.
κ^a(n),α^a(n)\hat{\kappa}_{a}(n),\,\hat{\alpha}_{a}(n) MLEs of the scale and shape parameter of arm aa after nn observations, defined in (2).
κ~a(t),α~a(t)\tilde{\kappa}_{a}(t),\,\tilde{\alpha}_{a}(t) sampled parameters at round tt from posterior distribution in (8) and (7).
α¯a(n)=max(α^a(n),n)\bar{\alpha}_{a}(n)=\max(\hat{\alpha}_{a}(n),n) truncated estimator of the shape parameter.
αa,n\alpha_{a,n} a temporary notation that can be replaced by both α^a(n)\hat{\alpha}_{a}(n) (STS\mathrm{STS}) and α¯a(n)\bar{\alpha}_{a}(n) (STS-T\mathrm{STS}\text{-}\mathrm{T}).
μ^a(n)=μ(κ^a(n),α^a(n))\hat{\mu}_{a}(n)=\mu(\hat{\kappa}_{a}(n),\hat{\alpha}_{a}(n)) computed mean rewards by the MLEs after nn observation.
μ~a(t)=μ(κ~a(t),α~a(t))\tilde{\mu}_{a}(t)=\mu(\tilde{\kappa}_{a}(t),\tilde{\alpha}_{a}(t)) computed mean rewards by sampled parameters κ~a(t)\tilde{\kappa}_{a}(t) and α~a(t)\tilde{\alpha}_{a}(t) at round tt.
θa=(κa,αa)\theta_{a}=(\kappa_{a},\alpha_{a}) tuple of true parameters of arm a[K]a\in[K].
θ^a,n=(κ^a(n),α^a(n))\hat{\theta}_{a,n}=(\hat{\kappa}_{a}(n),\hat{\alpha}_{a}(n)) tuple of MLEs of arm aa after nn observations.
θ¯a,n=(κ^a(n),α¯a(n))\bar{\theta}_{a,n}=(\hat{\kappa}_{a}(n),\bar{\alpha}_{a}(n)) tuple of estimators with a truncation procedure of arm aa after nn observations.
θa,n=(κ^a(n),αa,n)\theta_{a,n}=(\hat{\kappa}_{a}(n),\alpha_{a,n}) a temporary notation that can be replaced by both θ^a,n\hat{\theta}_{a,n} (STS\mathrm{STS}) and θ¯a,n\bar{\theta}_{a,n} (STS-T\mathrm{STS}\text{-}\mathrm{T}).
Table 4: Notations for the regret analysis
Symbol Meaning
Da,k(ϵ)D_{a,k}(\epsilon) a function contributes to the main term of regret analysis defined in (42).
𝒦a,n(ϵ)\mathcal{K}_{a,n}(\epsilon) an event where MLE of κ\kappa is close to its true value at round tt after nn observations.
𝒜a,n(ϵ)\mathcal{A}_{a,n}(\epsilon) an event where MLE of α\alpha is close to its true value at round tt after nn observations.
a,n(ϵ)\mathcal{E}_{a,n}(\epsilon) intersection of 𝒦a,n(ϵ)\mathcal{K}_{a,n}(\epsilon) and 𝒜a,n(ϵ)\mathcal{A}_{a,n}(\epsilon).
ϵ(t)\mathcal{M}_{\epsilon}(t) an event where sampled mean of the optimal arm is close to its true mean reward at round tt.
pn(x|θ1,n)p_{n}(x|\theta_{1,n}) probability of {μ~1(t)μ1x}\{\tilde{\mu}_{1}(t)\leq\mu_{1}-x\} after nn observation of arm 11 given θ1,n\theta_{1,n}.
Gk(x;n)G_{k}(x;n) another expression of the CDF of the Erlang distribution in (20).
Table 5: Notations for (deterministic) constants
Symbol Meaning
εa\varepsilon_{a} problem-dependent constants to satisfy μ^a(n)[μaδa,μa+δa]\hat{\mu}_{a}(n)\in[\mu_{a}-\delta_{a},\mu_{a}+\delta_{a}] on a,n(ϵ)\mathcal{E}_{a,n}(\epsilon) for any ϵεa\epsilon\leq\varepsilon_{a}.
ϵa,l(ϵ)\epsilon_{a,l}(\epsilon), ϵa,l(ϵ)\epsilon_{a,l}(\epsilon) constants to control a deviation of α^a(ϵ)\hat{\alpha}_{a}(\epsilon) under the event 𝒜a,n(ϵ)\mathcal{A}_{a,n}(\epsilon).
ρθ1(ϵ),ρ¯=ρθ1(ϵ/2)\rho_{\theta_{1}}(\epsilon),\bar{\rho}=\rho_{\theta_{1}}(\epsilon/2) a difference from its true shape parameter α1\alpha_{1} to satisfy μ(κa,α+ρμ1(ϵ))μ1ϵ2\mu(\kappa_{a},\alpha+\rho_{\mu_{1}}(\epsilon))\geq\mu_{1}-\frac{\epsilon}{2}.
C1(μ1,ϵ,n)=C1,nC_{1}(\mu_{1},\epsilon,n)=C_{1,n} a constant smaller than 11 in (20).
C2(μ1,ϵ,k)C_{2}(\mu_{1},\epsilon,k) an uniform bound of pn(ϵ|)p_{n}(\epsilon|\cdot) under 𝒦1,nc(ϵ)𝒜1,n(ϵ/2)\mathcal{K}_{1,n}^{c}(\epsilon)\cap\mathcal{A}_{1,n}(\epsilon/2).
cμ1(ϵ)c_{\mu_{1}}(\epsilon) a small constant with 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}).

Appendix B Proof of Lemma 1

See 1

Proof.

Recall the definition

KLinf(a)=KLinf(Pa(θa),Pa(θ)):=infθΘalogαaα+αlogκaκ+ααa1,\mathrm{KL}_{\mathrm{inf}}(a)=\mathrm{KL}_{\mathrm{inf}}(\mathrm{Pa}(\theta_{a}),\mathrm{Pa}(\theta)):=\inf_{\theta\in\Theta_{a}}\log\frac{\alpha_{a}}{\alpha}+\alpha\log\frac{\kappa_{a}}{\kappa}+\frac{\alpha}{\alpha_{a}}-1,

where θ=(κ,α)\theta=(\kappa,\alpha) and Θa\Theta_{a} defined in (4).

Here, we consider the partition of Θa\Theta_{a},

Θa(1)\displaystyle\Theta_{a}^{(1)} ={(κ,α)(0,κa]×(0,1]:μ(κ,α)>μ1}=(0,κa]×(0,1]\displaystyle=\left\{(\kappa,\alpha)\in(0,\kappa_{a}]\times(0,1]:\mu({\kappa,\alpha})>\mu_{1}\right\}=(0,\kappa_{a}]\times(0,1]
Θa(2)\displaystyle\Theta_{a}^{(2)} ={(κ,α)(0,κa]×(1,):μ(κ,α)=καα1>μ1},\displaystyle=\left\{(\kappa,\alpha)\in(0,\kappa_{a}]\times(1,\infty):\mu({\kappa,\alpha})=\frac{\kappa\alpha}{\alpha-1}>\mu_{1}\right\},{} (13)

where Θa(1)Θa(2)=Θa\Theta_{a}^{(1)}\cup\Theta_{a}^{(2)}=\Theta_{a}. Therefore, it holds that

KLinf(a)=min(infθΘa(1)logαaα+αlogκaκ+ααa1,infθΘa(2)logαaα+αlogκaκ+ααa1).\mathrm{KL}_{\mathrm{inf}}(a)=\min\left(\inf_{\theta\in\Theta_{a}^{(1)}}\log\frac{\alpha_{a}}{\alpha}+\alpha\log\frac{\kappa_{a}}{\kappa}+\frac{\alpha}{\alpha_{a}}-1,\inf_{\theta\in\Theta_{a}^{(2)}}\log\frac{\alpha_{a}}{\alpha}+\alpha\log\frac{\kappa_{a}}{\kappa}+\frac{\alpha}{\alpha_{a}}-1\right).

For (κ,α)Θa(1)({\kappa,\alpha})\in\Theta_{a}^{(1)}, μ(κ,α)=\mu({\kappa,\alpha})=\infty holds regardless of κ\kappa. Therefore, we obtain

infθΘa(1)logαaα+αlogκaκ+ααa1\displaystyle\inf_{\theta\in\Theta_{a}^{(1)}}\log\frac{\alpha_{a}}{\alpha}+\alpha\log\frac{\kappa_{a}}{\kappa}+\frac{\alpha}{\alpha_{a}}-1 =infα(0,1]logαaα+ααa1\displaystyle=\inf_{\alpha\in(0,1]}\log\frac{\alpha_{a}}{\alpha}+\frac{\alpha}{\alpha_{a}}-1
=logαa+1αa1,\displaystyle=\log\alpha_{a}+\frac{1}{\alpha_{a}}-1,

where the last equality holds since logx+1x1\log x+\frac{1}{x}-1 is an increasing function for x1x\geq 1.

Let κaκ=c1\frac{\kappa_{a}}{\kappa}=c\geq 1 to make KL divergence from Pa(θa)\mathrm{Pa}(\theta_{a}) to Pa(κ,α)\mathrm{Pa}({\kappa,\alpha}) be well-defined. From its definition of Θa(2)\Theta_{a}^{(2)} in (13), any θ=(κ,α)Θa(2)\theta=({\kappa,\alpha})\in\Theta_{a}^{(2)} satisfies καα1μ1\frac{\kappa\alpha}{\alpha-1}\geq\mu_{1}, i.e.,

κaαc(α1)μ1αcμ1cμ1κa=:ha(c,𝝁)=ha(c).\frac{\kappa_{a}\alpha}{c(\alpha-1)}\geq\mu_{1}\Leftrightarrow\alpha\leq\frac{c\mu_{1}}{c\mu_{1}-\kappa_{a}}=:h_{a}(c,\bm{\mu})=h_{a}(c).

Note that it holds that

ha(1)=μ1μ1κaμaμaκa=αah_{a}(1)=\frac{\mu_{1}}{\mu_{1}-\kappa_{a}}\leq\frac{\mu_{a}}{\mu_{a}-\kappa_{a}}=\alpha_{a}

since xxy\frac{x}{x-y} is decreasing with respect to xyx\geq y. Then, we can rewrite the infimum of KL divergence as

KLinf(a)=min(logαa+1αa1,infc1infαha(c)ga(α,c)),\mathrm{KL}_{\mathrm{inf}}(a)=\min\left(\log\alpha_{a}+\frac{1}{\alpha_{a}}-1,\inf_{c\geq 1}\inf_{\alpha\leq h_{a}(c)}g_{a}(\alpha,c)\right),

where ga(α,c):=logαaα+αlogc+ααa1g_{a}(\alpha,c):=\log\frac{\alpha_{a}}{\alpha}+\alpha\log c+\frac{\alpha}{\alpha_{a}}-1 satisfying

ga(α,c)α=1αa+logc1α.\frac{\partial g_{a}(\alpha,c)}{\partial\alpha}=\frac{1}{\alpha_{a}}+\log c-\frac{1}{\alpha}. (14)

Then, the inner infimum can be obtained when α=αa1+αalogc\alpha=\frac{\alpha_{a}}{1+\alpha_{a}\log c} holds if αa1+αalogc<ha(c)\frac{\alpha_{a}}{1+\alpha_{a}\log c}<h_{a}(c), where ga(αa1+αalogc,c)=log(1+αalogc)g_{a}\left(\frac{\alpha_{a}}{1+\alpha_{a}\log c},c\right)=\log(1+\alpha_{a}\log c).

Let ca1c_{a}^{*}\geq 1 be a deterministic constant such that

ha(ca)=caμ1caμ1κa=αa1+αalogca(μ1αa)calogca+(μ1αaμ1)ca=αaκah_{a}(c_{a}^{*})=\frac{c_{a}^{*}\mu_{1}}{c_{a}^{*}\mu_{1}-\kappa_{a}}=\frac{\alpha_{a}}{1+\alpha_{a}\log c_{a}^{*}}\Leftrightarrow(\mu_{1}\alpha_{a})c_{a}^{*}\log c_{a}^{*}+(\mu_{1}-\alpha_{a}\mu_{1})c_{a}^{*}=-\alpha_{a}\kappa_{a} (15)

so that ha(c)αa1+αalogch_{a}(c)\geq\frac{\alpha_{a}}{1+\alpha_{a}\log c} holds for any ccac\geq c_{a}^{*}. Since the solution of axlog(x)+bx=cax\log(x)+bx=-c is given as exp(W(ceb/aa)ba)\exp\left(W\left(-\frac{ce^{b/a}}{a}\right)-\frac{b}{a}\right) for principal branch of Lambert W function W()W(\cdot), one can obtain cac_{a}^{*} by solving the equality in (15), which is

ca=exp(W(κaμ1e1αa1)+11αa).c_{a}^{*}=\exp\left(W\left(-\frac{\kappa_{a}}{\mu_{1}}e^{\frac{1}{\alpha_{a}}-1}\right)+1-\frac{1}{\alpha_{a}}\right). (16)

Notice that κaμ1e1αa1κaμae1αa1(11αa)e(11αa)e1\frac{\kappa_{a}}{\mu_{1}}e^{\frac{1}{\alpha_{a}}-1}\leq\frac{\kappa_{a}}{\mu_{a}}e^{\frac{1}{\alpha_{a}}-1}\leq\left(1-\frac{1}{\alpha_{a}}\right)e^{-\left(1-\frac{1}{\alpha_{a}}\right)}\leq e^{-1} holds so that cac_{a}^{*} is a real value. Here, we consider the principal branch to ensure ca1c_{a}^{*}\geq 1 since the solution on other branches, W1()W_{-1}(\cdot), is less than 11, which is out of our interest.

Let Aa=11αaA_{a}=1-\frac{1}{\alpha_{a}}, which is positive as αa>1\alpha_{a}>1 and Ba=κaμ1κaμa=αa1αa=AaB_{a}=\frac{\kappa_{a}}{\mu_{1}}\leq\frac{\kappa_{a}}{\mu_{a}}=\frac{\alpha_{a}-1}{\alpha_{a}}=A_{a}. Then, we can rewrite cac_{a}^{*} as

ca=eAaeW(BaeAa)=eAaeAaBaW(BaeAa).eW(x)=xW(x)c_{a}^{*}=e^{A_{a}}e^{W(-B_{a}e^{-A_{a}})}=e^{A_{a}}e^{-A_{a}}\frac{-B_{a}}{W(-B_{a}e^{-A_{a}})}.\hskip 30.00005pt\because e^{W(x)}=\frac{x}{W(x)}

Since the principal branch of Lambert W function, W(x)W(x), is increasing for x1ex\geq-\frac{1}{e}, we have

0>W(BaeAa)W(BaeBa)=Ba,0>W(-B_{a}e^{-A_{a}})\geq W(-B_{a}e^{-B_{a}})=-B_{a},

which implies that ca1c_{a}^{*}\geq 1. Therefore, the infimum of gag_{a} can be written as

infc1infαha(c)ga(α,c)\displaystyle\inf_{c\geq 1}\inf_{\alpha\leq h_{a}(c)}g_{a}(\alpha,c) =min(infc[1,ca]ga(ha(c),c),infccalog(1+αalogc))\displaystyle=\min\left(\inf_{c\in[1,c_{a}^{*}]}g_{a}(h_{a}(c),c),\inf_{c\geq c_{a}^{*}}\log(1+\alpha_{a}\log c)\right)
=min(infc[1,ca]ga(ha(c),c),log(1+αalogca)),\displaystyle=\min\left(\inf_{c\in[1,c_{a}^{*}]}g_{a}(h_{a}(c),c),\log(1+\alpha_{a}\log c_{a}^{*})\right),

where we follow the convention that the infimum over the empty set is defined as infinity.

By substituting cac_{a}^{*} in (16), we obtain

log(1+αalogca)=log(αa+W(κaμ1e1αa1)).\log(1+\alpha_{a}\log c_{a}^{*})=\log\left(\alpha_{a}+W\left(-\frac{\kappa_{a}}{\mu_{1}}e^{\frac{1}{\alpha_{a}}-1}\right)\right).

Let us consider the following inequalities:

log(αa+W(κaμ1e1αa1))\displaystyle\log\left(\alpha_{a}+W\left(-\frac{\kappa_{a}}{\mu_{1}}e^{\frac{1}{\alpha_{a}}-1}\right)\right) log(αa+W(κaμae1αa1))\displaystyle\geq\log\left(\alpha_{a}+W\left(-\frac{\kappa_{a}}{\mu_{a}}e^{\frac{1}{\alpha_{a}}-1}\right)\right)
=log(αa+W(1αaαae1αa1))\displaystyle=\log\left(\alpha_{a}+W\left(\frac{1-\alpha_{a}}{\alpha_{a}}e^{\frac{1}{\alpha_{a}}-1}\right)\right)
=log(αa+1αa1),\displaystyle=\log\left(\alpha_{a}+\frac{1}{\alpha_{a}}-1\right),{} (17)

where the first inequality holds since the principal branch of Lambert W function W(x)W(x) is increasing and negative with respect to x[1/e,0)x\in\left[-1/e,0\right).

It remains to find the closed form of infc[1,ca]ga(ha(c),c)\inf_{c\in[1,c_{a}^{*}]}g_{a}(h_{a}(c),c). From the definition of ha(c)=cμ1cμ1κah_{a}(c)=\frac{c\mu_{1}}{c\mu_{1}-\kappa_{a}}, we have ha(c)=μ1κa(cμ1κa)2h_{a}^{\prime}(c)=-\frac{\mu_{1}\kappa_{a}}{(c\mu_{1}-\kappa_{a})^{2}} and

ga(ha(c),c)c\displaystyle\frac{\partial g_{a}(h_{a}(c),c)}{\partial c} =c(logαaha(c)+ha(c)logc+ha(c)αa1)ga(x,c)=logαax+xlogc+xαa1\displaystyle=\frac{\partial}{\partial c}\left(\log\frac{\alpha_{a}}{h_{a}(c)}+h_{a}(c)\log c+\frac{h_{a}(c)}{\alpha_{a}}-1\right)\hskip 20.00003pt\because g_{a}(x,c)=\log\frac{\alpha_{a}}{x}+x\log c+\frac{x}{\alpha_{a}}-1
=ha(c)ha(c)+ha(c)logc+ha(c)c+1αaha(c)\displaystyle=-\frac{h_{a}^{\prime}(c)}{h_{a}(c)}+h_{a}^{\prime}(c)\log c+\frac{h_{a}(c)}{c}+\frac{1}{\alpha_{a}}h_{a}^{\prime}(c)
=κac(cμ1κa)μ1κa(cμ1κa)2logc+μ1cμ1κaκaμ1αa(cμ1κa)2\displaystyle=\frac{\kappa_{a}}{c(c\mu_{1}-\kappa_{a})}-\frac{\mu_{1}\kappa_{a}}{(c\mu_{1}-\kappa_{a})^{2}}\log c+\frac{\mu_{1}}{c\mu_{1}-\kappa_{a}}-\frac{\kappa_{a}\mu_{1}}{\alpha_{a}(c\mu_{1}-\kappa_{a})^{2}}
=κac(cμ1κa)μ1κa(cμ1κa)2logc+μ1cμ1κaκaαa(cμ1κa)2.\displaystyle=\frac{\kappa_{a}}{c(c\mu_{1}-\kappa_{a})}-\frac{\mu_{1}\kappa_{a}}{(c\mu_{1}-\kappa_{a})^{2}}\log c+\mu_{1}\frac{c\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}}{(c\mu_{1}-\kappa_{a})^{2}}.{} (18)

Since the first term in (18) is positive for c1c\geq 1 and μ1μa>κa\mu_{1}\geq\mu_{a}>\kappa_{a}, let us consider the last two terms for c[1,ca]c\in[1,c_{a}^{*}],

μ1κa(cμ1κa)2logc+μ1cμ1κaκaαa(cμ1κa)2\displaystyle-\frac{\mu_{1}\kappa_{a}}{(c\mu_{1}-\kappa_{a})^{2}}\log c+\mu_{1}\frac{c\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}}{(c\mu_{1}-\kappa_{a})^{2}} =μ1(cμ1κa)2(cμ1κaκaαaκalogc)\displaystyle=\frac{\mu_{1}}{(c\mu_{1}-\kappa_{a})^{2}}\left(c\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}-\kappa_{a}\log c\right)
=μ1(cμ1κa)2(μ1κaκaαa+(c1)μ1κalogc)\displaystyle=\frac{\mu_{1}}{(c\mu_{1}-\kappa_{a})^{2}}\left(\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}+(c-1)\mu_{1}-\kappa_{a}\log c\right)
=μ1(cμ1κa)2(μ1κaκaαa+μ1(cκaμ1logc1)).\displaystyle=\frac{\mu_{1}}{(c\mu_{1}-\kappa_{a})^{2}}\left(\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}+\mu_{1}\left(c-\frac{\kappa_{a}}{\mu_{1}}\log c-1\right)\right).

Here,

μ1κaκaαaμaκaκaαa=κaαaαa1κaκaαa=κaαa(αa1)>0,\mu_{1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}\geq\mu_{a}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}=\frac{\kappa_{a}\alpha_{a}}{\alpha_{a}-1}-\kappa_{a}-\frac{\kappa_{a}}{\alpha_{a}}=\frac{\kappa_{a}}{\alpha_{a}(\alpha_{a}-1)}>0,

and cκaμ1logc1c-\frac{\kappa_{a}}{\mu_{1}}\log c-1 is increasing with respect to cc so that cκaμ1logc10c-\frac{\kappa_{a}}{\mu_{1}}\log c-1\geq 0 for c1c\geq 1. Therefore, cga(ha(c),c)\frac{\partial}{\partial c}g_{a}(h_{a}(c),c) is positive for c1c\geq 1, i.e., ga(ha(c),c)g_{a}(h_{a}(c),c) is an increasing function with respect to c1c\geq 1.

Thus, we have for c[1,ca]c\in[1,c_{a}^{*}],

infc[1,ca]ga(ha(c),c)=ga(ha(1),1)=ga(μ1μ1κa,1)=log(αaμ1κaμ1)+1αaμ1μ1κa1.\inf_{c\in[1,c_{a}^{*}]}g_{a}(h_{a}(c),c)=g_{a}\left(h_{a}(1),1\right)=g_{a}\left(\frac{\mu_{1}}{\mu_{1}-\kappa_{a}},1\right)=\log\left(\alpha_{a}\frac{\mu_{1}-\kappa_{a}}{\mu_{1}}\right)+\frac{1}{\alpha_{a}}\frac{\mu_{1}}{\mu_{1}-\kappa_{a}}-1.

Denote Xa=αaμ1κaμ1[1,αa)X_{a}=\alpha_{a}\frac{\mu_{1}-\kappa_{a}}{\mu_{1}}\in[1,\alpha_{a}) where Xa=1X_{a}=1 happens only when μa=μ1\mu_{a}=\mu_{1}. Then, we have for αa>1\alpha_{a}>1

log(Xa)+1Xa1logαa+1αa1log(αa+1αa1)log(1+αalogca),\log(X_{a})+\frac{1}{X_{a}}-1\leq\log\alpha_{a}+\frac{1}{\alpha_{a}}-1\leq\log\left(\alpha_{a}+\frac{1}{\alpha_{a}}-1\right)\leq\log(1+\alpha_{a}\log c_{a}^{*}),

where the last inequality comes from the result in (17). Therefore, we have

KLinf(a)\displaystyle\mathrm{KL}_{\mathrm{inf}}(a) =min(logαa+1αa1,infc[1,ca]ga(ha(c),c),log(1+αalogca))\displaystyle=\min\left(\log\alpha_{a}+\frac{1}{\alpha_{a}}-1,\inf_{c\in[1,c_{a}^{*}]}g_{a}(h_{a}(c),c),\log(1+\alpha_{a}\log c_{a}^{*})\right)
=log(αaμ1κaμ1)+1αaμ1μ1κa1,\displaystyle=\log\left(\alpha_{a}\frac{\mu_{1}-\kappa_{a}}{\mu_{1}}\right)+\frac{1}{\alpha_{a}}\frac{\mu_{1}}{\mu_{1}-\kappa_{a}}-1,

which concludes the proof. ∎

Appendix C Proofs of lemmas for Theorems 2 and 4

In this section, we provide the proof of lemmas for the main results.

To avoid redundancy, we use a temporary notation αa,n\alpha_{a,n} when the same result holds for both α^a(n)\hat{\alpha}_{a}(n) and α¯a(n)\bar{\alpha}_{a}(n). When αa,n\alpha_{a,n} notation is used, one can replace it with either α^a(n)\hat{\alpha}_{a}(n) or α¯a(n)\bar{\alpha}_{a}(n) depending on which policy we are considering. For example, it holds that

αa,n1{α^a(n)1under STS policy,α¯a(n)1under STS-T policy.\displaystyle\alpha_{a,n}\leq 1\Leftrightarrow\begin{cases}\hat{\alpha}_{a}(n)\leq 1&\text{under $\mathrm{STS}$ policy},\\ \bar{\alpha}_{a}(n)\leq 1&\text{under $\mathrm{STS}\text{-}\mathrm{T}$ policy}.\end{cases}

Similarly, we use the notation θa,n:=(κ^a(n),αa,n)\theta_{a,n}:=(\hat{\kappa}_{a}(n),\alpha_{a,n}) when it can be replaced by both θ^a,n=(κ^a(n),α^a(n))\hat{\theta}_{a,n}=(\hat{\kappa}_{a}(n),\hat{\alpha}_{a}(n)) and θ¯a,n=(κ^a(n),α¯a(n))\bar{\theta}_{a,n}=(\hat{\kappa}_{a}(n),\bar{\alpha}_{a}(n)) for any arm a[K]a\in[K] and nn\in\mathbb{N}. Based on θa,n\theta_{a,n} notation, we provide an inequality on the posterior probability that the sampled mean is smaller than a given value with proofs in Appendix C.5.

Lemma 9.

For any arm a[K]a\in[K] and fixed tt\in\mathbb{N}, let Na(t)=nN_{a}(t)=n. For any positive ξyyκa\xi\leq\frac{y}{y-\kappa_{a}} and kk\in\mathbb{Z}, it holds that

𝟙[κ^a(n)y][μ~a(t)y|θa,n]ξfnk,nαa,nEr(x)dx+(yμ((κa,ξ)))n1ξfnk,nαa,nEr(x)dx,\displaystyle\mathbbm{1}[\hat{\kappa}_{a}(n)\leq y]\mathbb{P}[\tilde{\mu}_{a}(t)\leq y|\theta_{a,n}]\leq\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\mathrm{Er}}(x)\mathrm{d}x+\bigg{(}\frac{y}{\mu((\kappa_{a},\xi))}\bigg{)}^{n}\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\mathrm{Er}}(x)\mathrm{d}x,

where fs,βEr()f_{s,\beta}^{\mathrm{Er}}(\cdot) denotes the probability density function of the Erlang distribution with shape ss\in\mathbb{N} and rate β>0\beta>0.

Based on θ1,n\theta_{1,n} notation, we denote the probability of sample from the posterior distribution after nn times playing is smaller than μ1x\mu_{1}-x by

pn(x|θ1,n):=[μ~1μ1x|κ^1(n),α1,n].p_{n}(x|\theta_{1,n}):=\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{1}-x|\hat{\kappa}_{1}(n),\alpha_{1,n}]. (19)

Let K(ϵ)=(κ1+ϵ,μ1ϵ)K(\epsilon)=(\kappa_{1}+\epsilon,\mu_{1}-\epsilon) be an open interval on \mathbb{R}. The Lemma 10 below shows the upper bound of pnp_{n} conditioned on θ1,n{\theta}_{1,n}.

Lemma 10.

Given ϵ>0\epsilon>0, define a positive problem-dependent constant ρ=ρθ1(ϵ):=κ1ϵ2(μ1ϵ/2κ1)(μ1κ1)\rho=\rho_{\theta_{1}}(\epsilon):=\frac{\kappa_{1}\epsilon}{2(\mu_{1}-\epsilon/2-\kappa_{1})(\mu_{1}-\kappa_{1})}. If nn¯=max(2,k+1)n\geq\bar{n}=\max(2,k+1), then for k0k\in\mathbb{Z}_{\geq 0}

pn(ϵ|θ1,n){en,ifκ^1(n)μ1ϵ,h(μ1,ϵ,n),ifκ^1(n)K(ϵ),α1,nα1+ρ,C1(μ1,ϵ,n)Gk(1/α1,n;n)+1Gk(1/α1,n;n)ifκ^1(n)K(ϵ),α1,nα1+ρ,\displaystyle p_{n}(\epsilon|\theta_{1,n})\leq\begin{cases}e^{-n},&\mathrm{if}~{}\hat{\kappa}_{1}(n)\geq\mu_{1}-\epsilon,\\ h(\mu_{1},\epsilon,n),&\mathrm{if}~{}\hat{\kappa}_{1}(n)\in K(\epsilon),\alpha_{1,n}\leq\alpha_{1}+\rho,\\ C_{1}(\mu_{1},\epsilon,n)G_{k}(1/\alpha_{1,n};n)+1-G_{k}(1/\alpha_{1,n};n)&\mathrm{if}~{}\hat{\kappa}_{1}(n)\in K(\epsilon),\alpha_{1,n}\geq\alpha_{1}+\rho,\end{cases}

where

h(μ1,ϵ,n)\displaystyle h(\mu_{1},\epsilon,n) :=en3ϵ4μ1(112encμ1(ϵ))+12encμ1(ϵ)\displaystyle:=e^{-n\frac{3\epsilon}{4\mu_{1}}}\left(1-\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}\right)+\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}
C1(μ1,ϵ,n)\displaystyle C_{1}(\mu_{1},\epsilon,n) :=(μ1ϵμ1ϵ/2)nenϵ2μ1ϵ<1.\displaystyle:=\left(\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon/2}\right)^{n}\leq e^{-n\frac{\epsilon}{2\mu_{1}-\epsilon}}<1.
Gk(x;n)\displaystyle G_{k}(x;n) :=Fnk,nxEr(α1+ρ)\displaystyle:=F^{\mathrm{Er}}_{n-k,nx}(\alpha_{1}+\rho){} (20)

for FErF^{\mathrm{Er}} defined in (10). Here, cμ1(ϵ)=ζlog(1+ζ)=𝒪(ϵ2)c_{\mu_{1}}(\epsilon)=\zeta-\log(1+\zeta)=\mathcal{O}(\epsilon^{-2}) and ζ=ϵ4μ12ϵ(0,1)\zeta=\frac{\epsilon}{4\mu_{1}-2\epsilon}\in(0,1) are deterministic constants of μ1\mu_{1} and ϵ\epsilon.

Notice that μ((κ1,α1+ρ))=μ1ϵ/2\mu((\kappa_{1},\alpha_{1}+\rho))=\mu_{1}-\epsilon/2 holds and there exists a problem-dependent constant C2(μ1,ϵ,k)<1C_{2}(\mu_{1},\epsilon,k)<1 such that for any nn¯=max(2,k+1)n\geq\bar{n}=\max(2,k+1) and ϵ>0\epsilon>0

h(μ1,ϵ,n)1C2(μ1,ϵ,k).h(\mu_{1},\epsilon,n)\leq 1-C_{2}(\mu_{1},\epsilon,k). (21)

C.1 Proof of Lemma 5

Let us start by stating a well-known fact that is utilized in the proof.

Fact 11.

When XErlang(n,β)X\sim\mathrm{Erlang}(n,\beta) with rate parameter β\beta, then 1X\frac{1}{X} follows the inverse gamma distribution with shape nn\in\mathbb{N} and scale β+\beta\in\mathbb{R}_{+}, i.e., 1XIG(n,β)\frac{1}{X}\sim\mathrm{IG}(n,\beta).

See 5

Proof.

Let us consider the following decomposition that holds under both STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T}:

t=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]\displaystyle\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]
=n=n¯Tt=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t),N1(t)=n]\displaystyle=\sum_{n=\bar{n}}^{T}\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}\bigg{[}j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t),N_{1}(t)=n\bigg{]}
=n=n¯Tm=1T𝟙[mt=Kn¯+1T𝟙[j(t)1,ϵc(t),𝒦1,N1(t)c(ϵ),N1(t)=n]].\displaystyle=\sum_{n=\bar{n}}^{T}\sum_{m=1}^{T}\mathbbm{1}\Bigg{[}m\leq\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}\bigg{[}j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t),\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),N_{1}(t)=n\bigg{]}\Bigg{]}.

Notice that

mt=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t),N1(t)=n]m\leq\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t),N_{1}(t)=n]

implies that μ~1(t)μ1ϵ\tilde{\mu}_{1}(t)\leq\mu_{1}-\epsilon occurred mm times in a row on {t:𝒦1,N1(t)c(ϵ),ϵc(t),N1(t)=n}\{t:\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t),N_{1}(t)=n\}. Thus, we have

𝔼[t=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]]\displaystyle\mathbb{E}\Bigg{[}\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\Bigg{]} 𝔼[n=n¯Tm=1T𝟙[𝒦1,nc(ϵ)]pn(ϵ|θ1,n)m]\displaystyle\leq\mathbb{E}\left[\sum_{n=\bar{n}}^{T}\sum_{m=1}^{T}\mathbbm{1}[\mathcal{K}_{1,n}^{c}(\epsilon)]p_{n}(\epsilon|{\theta}_{1,n})^{m}\right]
n=n¯T𝔼[𝟙[𝒦1,nc(ϵ)]pn(ϵ|θ1,n)1pn(ϵ|θ1,n)]\displaystyle\leq\sum_{n=\bar{n}}^{T}\mathbb{E}\left[\mathbbm{1}\left[\mathcal{K}_{1,n}^{c}(\epsilon)\right]\frac{p_{n}(\epsilon|{\theta}_{1,n})}{1-p_{n}(\epsilon|{\theta}_{1,n})}\right]{} (22)

for pn(|)p_{n}(\cdot|\cdot) defined in (19). From now on, we fix nn¯n\geq\bar{n} and drop the argument nn of κ^1(n)\hat{\kappa}_{1}(n), α^1(n)\hat{\alpha}_{1}(n) and α¯1(n)\bar{\alpha}_{1}(n) for simplicity.

Under STS\mathrm{STS} with k2k\in\mathbb{Z}_{\geq 2}:

By applying Lemma 10 and (21) under STS\mathrm{STS} with k0k\in\mathbb{Z}_{\geq 0}, we can decompose the expectation in (22) as

𝔼[𝟙[𝒦1,nc(ϵ)]pn(ϵ|θ^1,n)1pn(ϵ|θ^1,n)][κ^1μ1ϵ]en1en\displaystyle\mathbb{E}\left[\mathbbm{1}\left[\mathcal{K}_{1,n}^{c}(\epsilon)\right]\frac{p_{n}(\epsilon|\hat{\theta}_{1,n})}{1-p_{n}(\epsilon|\hat{\theta}_{1,n})}\right]\leq\mathbb{P}[\hat{\kappa}_{1}\geq\mu_{1}-\epsilon]\frac{e^{-n}}{1-e^{-n}} +[κ^1K(ϵ),α^1<α1+ρ]h(μ1,ϵ,n)C2(μ1,ϵ,k)\displaystyle+\mathbb{P}[\hat{\kappa}_{1}\in K(\epsilon),\hat{\alpha}_{1}<\alpha_{1}+\rho]\frac{h(\mu_{1},\epsilon,n)}{C_{2}(\mu_{1},\epsilon,k)}
+𝔼θ^1,n[𝟙[κ^1K(ϵ),α^1>α1+ρ]Gk(1/α^1;n)(1C1,n)](),\displaystyle\hskip 10.00002pt+\underbrace{\mathbb{E}_{\hat{\theta}_{1,n}}\left[\frac{\mathbbm{1}[\hat{\kappa}_{1}\in K(\epsilon),\hat{\alpha}_{1}>\alpha_{1}+\rho]}{G_{k}(1/\hat{\alpha}_{1};n)(1-C_{1,n})}\right]}_{(\divideontimes)},{} (23)

where we denoted C1,n=C1(μ1,ϵ,n)C_{1,n}=C_{1}(\mu_{1},\epsilon,n). For simplicity, let us define z:=1α^1z:=\frac{1}{\hat{\alpha}_{1}} where zErlang(n1,nα1)z\sim\mathrm{Erlang}(n-1,n\alpha_{1}) holds from Fact (11) since α^1IG(n1,nα1)\hat{\alpha}_{1}\sim\mathrm{IG}(n-1,n\alpha_{1}) in (2). From the independence of κ^\hat{\kappa} and α^\hat{\alpha} and distributions of zz and α^\hat{\alpha} in (9) and (2), respectively, we have

()\displaystyle(\divideontimes) =01α1+ρzn2enα1z(nα1)n1Γ(n1)κ^1K(ϵ)fκ1,nα1Pa(κ^1)Gk(z;n)(1C1,n)dκ^1dz\displaystyle=\int_{0}^{\frac{1}{\alpha_{1}+\rho}}z^{n-2}e^{-n\alpha_{1}z}\frac{(n\alpha_{1})^{n-1}}{\Gamma(n-1)}\int_{\hat{\kappa}_{1}\in K(\epsilon)}\frac{f^{\mathrm{Pa}}_{\kappa_{1},n\alpha_{1}}(\hat{\kappa}_{1})}{G_{k}(z;n)(1-C_{1,n})}\mathrm{d}\hat{\kappa}_{1}\mathrm{d}z
=[κ^1K(ϵ)]01α1+ρzn2enα1zGk(z;n)(1C1,n)(nα1)n1Γ(n1)dz.\displaystyle=\mathbb{P}[\hat{\kappa}_{1}\in K(\epsilon)]\int_{0}^{\frac{1}{\alpha_{1}+\rho}}\frac{z^{n-2}e^{-n\alpha_{1}z}}{G_{k}(z;n)(1-C_{1,n})}\frac{(n\alpha_{1})^{n-1}}{\Gamma(n-1)}\mathrm{d}z.

By substituting the CDF in (10), we obtain

Gk(z;n)\displaystyle G_{k}(z;n) =Fnk,nzEr(α1+ρ)\displaystyle=F^{\mathrm{Er}}_{n-k,nz}(\alpha_{1}+\rho)
=1Γ(nk)0n(α1+ρ)zettnk1dt\displaystyle=\frac{1}{\Gamma(n-k)}\int_{0}^{n(\alpha_{1}+\rho)z}e^{-t}t^{n-k-1}\mathrm{d}t
en(α1+ρ)zΓ(nk)0n(α1+ρ)ztnk1dt\displaystyle\geq\frac{e^{-n(\alpha_{1}+\rho)z}}{\Gamma(n-k)}\int_{0}^{n(\alpha_{1}+\rho)z}t^{n-k-1}\mathrm{d}t
=en(α1+ρ)zΓ(nk+1)(n(α1+ρ)z)nk.\displaystyle=\frac{e^{-n(\alpha_{1}+\rho)z}}{\Gamma(n-k+1)}(n(\alpha_{1}+\rho)z)^{n-k}.{} (24)

Therefore,

()[κ^K(ϵ)]\displaystyle\frac{(\divideontimes)}{\mathbb{P}[\hat{\kappa}\in K(\epsilon)]} 01α1+ρΓ(nk+1)(n(α1+ρ)z)nk(1C1,n)en(α1+ρ)z(nα1)n1Γ(n1)zn2enα1zdz\displaystyle\leq\int_{0}^{\frac{1}{\alpha_{1}+\rho}}\frac{\Gamma(n-k+1)}{(n(\alpha_{1}+\rho)z)^{n-k}(1-C_{1,n})}e^{n(\alpha_{1}+\rho)z}\frac{(n\alpha_{1})^{n-1}}{\Gamma(n-1)}z^{n-2}e^{-n\alpha_{1}z}\mathrm{d}z
=Γ(nk+1)Γ(n1)(1C1,n)(α1+ρ)k1(α1α1+ρ)n1nk101α1+ρzk2enρzdz\displaystyle=\frac{\Gamma(n-k+1)}{\Gamma(n-1)(1-C_{1,n})}(\alpha_{1}+\rho)^{k-1}\left(\frac{\alpha_{1}}{\alpha_{1}+\rho}\right)^{n-1}n^{k-1}\int_{0}^{\frac{1}{\alpha_{1}+\rho}}z^{k-2}e^{n\rho z}\mathrm{d}z
Γ(nk+1)Γ(n1)(1C1,n)(α1+ρ)k1eρα1+ρ(n1)nk1(nρ)k201α1+ρ(nρz)k2enρzdz.\displaystyle\leq\frac{\Gamma(n-k+1)}{\Gamma(n-1)(1-C_{1,n})}(\alpha_{1}+\rho)^{k-1}e^{-\frac{\rho}{\alpha_{1}+\rho}(n-1)}\frac{n^{k-1}}{(n\rho)^{k-2}}\int_{0}^{\frac{1}{\alpha_{1}+\rho}}(n\rho z)^{k-2}e^{n\rho z}\mathrm{d}z.{} (25)

By letting nρz=yn\rho z=y, we can bound the integral in (25) as

nk1(nρ)k201α1+ρ(nρz)k2enρzdz\displaystyle\frac{n^{k-1}}{(n\rho)^{k-2}}\int_{0}^{\frac{1}{\alpha_{1}+\rho}}(n\rho z)^{k-2}e^{n\rho z}\mathrm{d}z =ρ(k1)0nρα1+ρyk2eydy\displaystyle=\rho^{-(k-1)}\int_{0}^{\frac{n\rho}{\alpha_{1}+\rho}}y^{k-2}e^{y}\mathrm{d}y
ρ(k1)enρα1+ρ0nρα1+ρyk2dy\displaystyle\leq\rho^{-(k-1)}e^{n\frac{\rho}{\alpha_{1}+\rho}}\int_{0}^{\frac{n\rho}{\alpha_{1}+\rho}}y^{k-2}\mathrm{d}y{} (26)
=enρα1+ρk1(nα1+ρ)k1, if k2\displaystyle=\frac{e^{n\frac{\rho}{\alpha_{1}+\rho}}}{k-1}\left(\frac{n}{\alpha_{1}+\rho}\right)^{k-1},\quad\text{ if }k\in\mathbb{Z}_{\geq 2}{} (27)

where (27) holds only for k2k\in\mathbb{Z}_{\geq 2} since the integral in (26) diverges for k1k\in\mathbb{Z}_{\leq 1}.

By applying (27) to (25), we obtain for k2k\in\mathbb{Z}_{\geq 2}

()\displaystyle(\divideontimes) [κ^K(ϵ)]eρα1+ρ1C1,nnk1k1Γ(nk+1)Γ(n1)\displaystyle\leq\mathbb{P}[\hat{\kappa}\in K(\epsilon)]\frac{e^{\frac{\rho}{\alpha_{1}+\rho}}}{1-C_{1,n}}\frac{n^{k-1}}{k-1}\frac{\Gamma(n-k+1)}{\Gamma(n-1)}\hskip 30.00005pt
e1ϵα1κ+ϵn1C1,nΓ(nk+1)Γ(n1)nk1k1=𝒪(nenϵ),\displaystyle\leq\frac{e^{1-\frac{\epsilon\alpha_{1}}{\kappa+\epsilon}n}}{1-C_{1,n}}\frac{\Gamma(n-k+1)}{\Gamma(n-1)}\frac{n^{k-1}}{k-1}=\mathcal{O}(ne^{-n\epsilon}),{} (28)

where (28) can be obtained by Lemma 15 and ρα1+ρ<1\frac{\rho}{\alpha_{1}+\rho}<1. By combining (28) with (23) and (22), we obtain for k2k\in\mathbb{Z}_{\geq 2}

𝔼[t=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]]\displaystyle\mathbb{E}\left[\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\right] n=n¯T(en1en+en3ϵ4μ1+12encμ1(ϵ)C2(μ1,ϵ,k)+())\displaystyle\leq\sum_{n=\bar{n}}^{T}\bigg{(}\frac{e^{-n}}{1-e^{-n}}+\frac{e^{-n\frac{3\epsilon}{4\mu_{1}}}+\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}}{C_{2}(\mu_{1},\epsilon,k)}+(\divideontimes)\bigg{)}
n=n¯T𝒪(en)+𝒪(enϵ)+𝒪(enϵ2)+𝒪(nenϵ)\displaystyle\leq\sum_{n=\bar{n}}^{T}\mathcal{O}(e^{-n})+\mathcal{O}(e^{-n\epsilon})+\mathcal{O}(e^{-n\epsilon^{2}})+\mathcal{O}(ne^{-n\epsilon})
=𝒪(1)+𝒪(ϵ1)+𝒪(ϵ2)+𝒪(ϵ2),\displaystyle=\mathcal{O}(1)+\mathcal{O}(\epsilon^{-1})+\mathcal{O}(\epsilon^{-2})+\mathcal{O}(\epsilon^{-2}),

which concludes the proof under STS\mathrm{STS} with k2k\in\mathbb{Z}_{\geq 2}.

Under STS-T\mathrm{STS}\text{-}\mathrm{T} with k0k\in\mathbb{Z}_{\geq 0}:

Under STS-T\mathrm{STS}\text{-}\mathrm{T}, we have the following inequality instead of (23):

𝔼[𝟙[𝒦nc(ϵ)]pn(ϵ|θ¯1,n)1pn(ϵ|θ¯1,n)][κ^1μ1ϵ]en1en\displaystyle\mathbb{E}\left[\mathbbm{1}\left[\mathcal{K}_{n}^{c}(\epsilon)\right]\frac{p_{n}(\epsilon|\bar{\theta}_{1,n})}{1-p_{n}(\epsilon|\bar{\theta}_{1,n})}\right]\leq\mathbb{P}[\hat{\kappa}_{1}\geq\mu_{1}-\epsilon]\frac{e^{-n}}{1-e^{-n}} +[κ^1K(ϵ),α¯1<α1+ρ]h(μ1,ϵ,n)C2(μ1,ϵ,k)\displaystyle+\mathbb{P}[\hat{\kappa}_{1}\in K(\epsilon),\bar{\alpha}_{1}<\alpha_{1}+\rho]\frac{h(\mu_{1},\epsilon,n)}{C_{2}(\mu_{1},\epsilon,k)}
+𝔼θ¯1,n[𝟙[κ^1K(ϵ),α¯1(α1+ρ,n]]Gk(1/α¯1;n)(1C1,n)]().\displaystyle\hskip 10.00002pt+\underbrace{\mathbb{E}_{\bar{\theta}_{1,n}}\left[\frac{\mathbbm{1}[\hat{\kappa}_{1}\in K(\epsilon),\bar{\alpha}_{1}\in(\alpha_{1}+\rho,n]]}{G_{k}(1/\bar{\alpha}_{1};n)(1-C_{1,n})}\right]}_{(\star)}.{} (29)

From 𝟙[α¯1(n)<n]=𝟙[α¯1(n)=α^1(n)]\mathbbm{1}[\bar{\alpha}_{1}(n)<n]=\mathbbm{1}[\bar{\alpha}_{1}(n)=\hat{\alpha}_{1}(n)], it holds that

()\displaystyle(\star) =𝔼θ^1,n[𝟙[κ^1K(ϵ),α^1(α1+ρ,n)]Gk(1/α^1;n)(1C1,n)]+𝔼θ¯1,n[𝟙[κ^1K(ϵ),α¯1=n]Gk(1/α¯1;n)(1C1,n)].\displaystyle=\mathbb{E}_{\hat{\theta}_{1,n}}\left[\frac{\mathbbm{1}[\hat{\kappa}_{1}\in K(\epsilon),\hat{\alpha}_{1}\in(\alpha_{1}+\rho,n)]}{G_{k}(1/\hat{\alpha}_{1};n)(1-C_{1,n})}\right]+\mathbb{E}_{\bar{\theta}_{1,n}}\left[\frac{\mathbbm{1}[\hat{\kappa}_{1}\in K(\epsilon),\bar{\alpha}_{1}=n]}{G_{k}(1/\bar{\alpha}_{1};n)(1-C_{1,n})}\right].

Since 𝟙[α¯1(n)=n]=𝟙[α^1(n)n]\mathbbm{1}[\bar{\alpha}_{1}(n)=n]=\mathbbm{1}[\hat{\alpha}_{1}(n)\geq n] holds and κ^\hat{\kappa} and α^\hat{\alpha} are independent, we have for z=1α^1Erlang(n1,nα1)z=\frac{1}{\hat{\alpha}_{1}}\sim\mathrm{Erlang}(n-1,n\alpha_{1})

()[κ^1K(ϵ)]\displaystyle\frac{(\star)}{\mathbb{P}[\hat{\kappa}_{1}\in K(\epsilon)]} 1n1α1+ρzn2enα1zGk(z;n)(1C1,n)(nα1)n1Γ(n1)dz()+1Gk(1/n;n)(1C1,n)[1α^11n](),\displaystyle\leq\underbrace{\int_{\frac{1}{n}}^{\frac{1}{\alpha_{1}+\rho}}\frac{z^{n-2}e^{-n\alpha_{1}z}}{G_{k}(z;n)(1-C_{1,n})}\frac{(n\alpha_{1})^{n-1}}{\Gamma(n-1)}\mathrm{d}z}_{(\dagger)}+\underbrace{\frac{1}{G_{k}(1/n;n)(1-C_{1,n})}\mathbb{P}\left[\frac{1}{\hat{\alpha}_{1}}\leq\frac{1}{n}\right]}_{(\diamond)},{} (30)

where the same techniques on ()(\divideontimes) can be applied to derive an upper bound of ()(\dagger). By following the same steps as (25) and (26), we obtain

ρnρα+ρyk2dy{(nρα+ρ)k1,ifk2,log(nα+ρ),ifk=1,ρk1/(1k),ifkk0,\int_{\rho}^{\frac{n\rho}{\alpha+\rho}}y^{k-2}\mathrm{d}y\leq\begin{cases}\left(\frac{n\rho}{\alpha+\rho}\right)^{k-1},&\mathrm{if}~{}k\in\mathbb{Z}_{\geq 2},\\ \log\left(\frac{n}{\alpha+\rho}\right),&\mathrm{if}~{}k=1,\\ \rho^{k-1}/(1-k),&\mathrm{if}~{}k\in\mathbb{Z}_{k\leq 0},\end{cases}

as a counterpart of the integral in (26). By following the same steps as (27) and (28), we have

(){Γ(nk+1)Γ(n1)enk1k1,ifk2,(n1)log(nα+ρ),ifk=1,Γ(nk+1)Γ(n1)e(1k)(α+ρ)1k,ifkk0.(\dagger)\leq\begin{cases}\frac{\Gamma(n-k+1)}{\Gamma(n-1)}\frac{en^{k-1}}{k-1},&\mathrm{if}~{}k\in\mathbb{Z}_{\geq 2},\\ (n-1)\log\left(\frac{n}{\alpha+\rho}\right),&\mathrm{if}~{}k=1,\\ \frac{\Gamma(n-k+1)}{\Gamma(n-1)}\frac{e}{(1-k)(\alpha+\rho)^{1-k}},&\mathrm{if}~{}k\in\mathbb{Z}_{k\leq 0}.\\ \end{cases} (31)

Note that the probability term in ()(\diamond) is the same as the CDF of the Erlang(n1,nα1)\mathrm{Erlang}(n-1,n\alpha_{1}) with rate nα1n\alpha_{1} evaluated at 1n\frac{1}{n} since α^1IG(n1,nα1)\hat{\alpha}_{1}\sim\mathrm{IG}(n-1,n\alpha_{1}) from (2). Thus, we have

()\displaystyle(\diamond) =11C1,n1Gk(1/n;n)γ(n1,α1)Γ(n1)\displaystyle=\frac{1}{1-C_{1,n}}\frac{1}{G_{k}(1/n;n)}\frac{\gamma(n-1,\alpha_{1})}{\Gamma(n-1)}
eα1+ρ1C1,nΓ(nk+1)(α1+ρ)nkγ(n1,α1)Γ(n1)by (24)\displaystyle\leq\frac{e^{\alpha_{1}+\rho}}{1-C_{1,n}}\frac{\Gamma(n-k+1)}{(\alpha_{1}+\rho)^{n-k}}\frac{\gamma(n-1,\alpha_{1})}{\Gamma(n-1)}\hskip 10.00002pt\text{by (\ref{eq: Gkbnd})}
eα1+ρ1C1,nΓ(nk+1)Γ(n1)α1n1(α1+ρ)nk\displaystyle\leq\frac{e^{\alpha_{1}+\rho}}{1-C_{1,n}}\frac{\Gamma(n-k+1)}{\Gamma(n-1)}\frac{\alpha_{1}^{n-1}}{(\alpha_{1}+\rho)^{n-k}}{} (32)
eα1+ρ1C1,nΓ(nk+1)Γ(n1)1(α1+ρ)1k\displaystyle\leq\frac{e^{\alpha_{1}+\rho}}{1-C_{1,n}}\frac{\Gamma(n-k+1)}{\Gamma(n-1)}\frac{1}{(\alpha_{1}+\rho)^{1-k}}
=𝒪(n2k),\displaystyle=\mathcal{O}(n^{2-k}),{} (33)

where (32) holds from γ(s,x)xs\gamma(s,x)\leq x^{s} for any s1s\geq 1 and x>0x>0. Therefore, by combining (31) and (33) with (30) and [κ^K(ϵ)]=𝒪(enϵ)\mathbb{P}[\hat{\kappa}\in K(\epsilon)]=\mathcal{O}(e^{-n\epsilon}), we have

(){𝒪(nenϵ),ifk2𝒪(nlog(n)enϵ),ifk=1,𝒪(n2kenϵ),ifk0.(\star)\leq\begin{cases}\mathcal{O}(ne^{-n\epsilon}),&\mathrm{if}~{}k\in\mathbb{Z}_{\geq 2}\\ \mathcal{O}(n\log(n)e^{-n\epsilon}),&\mathrm{if}~{}k=1,\\ \mathcal{O}(n^{2-k}e^{-n\epsilon}),&\mathrm{if}~{}k\in\mathbb{Z}_{\leq 0}.}{\end{cases} (34)

By combining (34) with (29) and (22), we obtain for k0k\in\mathbb{Z}_{\geq 0}

𝔼[t=Kn¯+1T𝟙[j(t)1,𝒦1,N1(t)c(ϵ),ϵc(t)]]\displaystyle\mathbb{E}\bigg{[}\sum_{t=K\bar{n}+1}^{T}\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}^{c}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]\bigg{]} n=n¯T(en1en+en3ϵ4μ1+12encμ1(ϵ)C2(μ,ϵ,k)+())\displaystyle\leq\sum_{n=\bar{n}}^{T}\bigg{(}\frac{e^{-n}}{1-e^{-n}}+\frac{e^{-n\frac{3\epsilon}{4\mu_{1}}}+\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}}{C_{2}(\mu,\epsilon,k)}+(\star)\bigg{)}
n=n¯T(𝒪(en)+𝒪(enϵ)+𝒪(enϵ2)\displaystyle\leq\sum_{n=\bar{n}}^{T}\bigg{(}\mathcal{O}(e^{-n})+\mathcal{O}(e^{-n\epsilon})+\mathcal{O}\left(e^{-n\epsilon^{2}}\right)
+𝒪(ψ(n,k)enϵ))\displaystyle\hskip 120.00018pt+\mathcal{O}\left(\psi(n,k)e^{-n\epsilon}\right)\bigg{)}
=𝒪(1)+𝒪(ϵ1)+𝒪(ϵ2)+𝒪(ϵmax(2,3k)),\displaystyle=\mathcal{O}(1)+\mathcal{O}(\epsilon^{-1})+\mathcal{O}(\epsilon^{-2})+\mathcal{O}\left(\epsilon^{-\max(2,3-k)}\right),

where

ψ(n,k)=n𝟙[k2]+nlog(n)𝟙[k=1]+n2k𝟙[k0].\psi(n,k)=n\mathbbm{1}[k\geq 2]+n\log(n)\mathbbm{1}[k=1]+n^{2-k}\mathbbm{1}[k\leq 0].

Note that the analysis on term ()(\star) also holds for STS-T\mathrm{STS}\text{-}\mathrm{T} with k<0k\in\mathbb{Z}_{<0}. However, differently from the case of k{0,1}k\in\{0,1\}, priors with k<0k\in\mathbb{Z}_{<0} have additional problems in Lemma 10 under the event {κ^1K(ϵ),α¯1(n)α1+ρ}\{\hat{\kappa}_{1}\in K(\epsilon),\bar{\alpha}_{1}(n)\leq\alpha_{1}+\rho\}, where the upper bound becomes a constant 12\frac{1}{2}. ∎

C.2 Proof of Lemma 6

Firstly, we state a well-known fact that is utilized in the proof.

Fact 12.

When XErlang(n,β)X\sim\mathrm{Erlang}(n,\beta) with rate parameter β\beta, then 2βX2\beta X follows the chi-squared distribution with 2n2n degree of freedom, i.e., 2βXχ2n22\beta X\sim\chi_{2n}^{2}.

See 6

Proof.

From the sampling rule, it holds that

t=n¯K+1T[j(t)=a,μ~1(t)μ1ϵ,a,Na(t)(ϵ)]\displaystyle\sum_{t=\bar{n}K+1}^{T}\mathbb{P}\left[j(t)=a,\tilde{\mu}_{1}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon)\right] t=n¯K+1T[j(t)=a,μ~a(t)μ1ϵ,a,Na(t)(ϵ)].\displaystyle\leq\sum_{t=\bar{n}K+1}^{T}\mathbb{P}\left[j(t)=a,\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon)\right].

Fix a time index tt and denote t[]=[t]\mathbb{P}_{t}[\cdot]=\mathbb{P}[\cdot\mid\mathcal{F}_{t}] and Na(t)=nN_{a}(t)=n. To simplify notations, we drop the argument tt of κ~a(t),α~a(t)\tilde{\kappa}_{a}(t),\tilde{\alpha}_{a}(t) and μ~a(t)\tilde{\mu}_{a}(t) and the argument nn of κ^a(n),α^a(n),α¯a(n)\hat{\kappa}_{a}(n),\hat{\alpha}_{a}(n),\bar{\alpha}_{a}(n).

Since κ~a(0,κ^a]\tilde{\kappa}_{a}\in(0,\hat{\kappa}_{a}] holds from its posterior distribution, if α~aμ1ϵμ1ϵκ^a\tilde{\alpha}_{a}\geq\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}_{a}} holds, then μ~a=κ~aα~aα~a1μ1ϵ\tilde{\mu}_{a}=\frac{\tilde{\kappa}_{a}\tilde{\alpha}_{a}}{\tilde{\alpha}_{a}-1}\leq\mu_{1}-\epsilon holds for any κ~a\tilde{\kappa}_{a}. Recall that fnk,nαa,nEr()f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(\cdot) denotes a density function of Erlang(nk,nαa,n)\mathrm{Erlang}\left(n-k,\frac{n}{\alpha_{a,n}}\right) with rate parameter nαa,n\frac{n}{\alpha_{a,n}}, which is the marginalized posterior distribution of α~\tilde{\alpha} under STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T}. From the CDF of κ~\tilde{\kappa} in (8), if κ^a<μ1ϵ\hat{\kappa}_{a}<\mu_{1}-\epsilon, then

t[μ~aμ1ϵ]\displaystyle\mathbb{P}_{t}\left[\tilde{\mu}_{a}\geq\mu_{1}-\epsilon\right] =t[α~a1]+t[κ~aα~a1α~a(μ1ϵ)α~a(1,μ1ϵμ1ϵκ^a)]\displaystyle=\mathbb{P}_{t}\left[\tilde{\alpha}_{a}\leq 1\right]+\mathbb{P}_{t}\left[\tilde{\kappa}_{a}\geq\frac{\tilde{\alpha}_{a}-1}{\tilde{\alpha}_{a}}(\mu_{1}-\epsilon)\cap\tilde{\alpha}_{a}\in\left(1,\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}_{a}}\right)\right]
=01fnk,nαa,nEr(x)dx+1μ1ϵμ1ϵκ^fnk,nαa,nEr(x)t[κ~ax1x(μ1ϵ)]dx\displaystyle=\int_{0}^{1}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{1}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathbb{P}_{t}\left[\tilde{\kappa}_{a}\geq\frac{x-1}{x}(\mu_{1}-\epsilon)\right]\mathrm{d}x
=01fnk,nαa,nEr(x)dx+1μ1ϵμ1ϵκ^fnk,nαa,nEr(x)(1(x1κ^x(μ1ϵ))nx)dx\displaystyle=\int_{0}^{1}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{1}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(1-\left(\frac{x-1}{\hat{\kappa}x}(\mu_{1}-\epsilon)\right)^{nx}\right)\mathrm{d}x
=0μ1ϵμ1ϵκ^fnk,nαa,nEr(x)dx1μ1ϵμ1ϵκ^fnk,nαa,nEr(x)(x1κ^x(μ1ϵ))nxdx.\displaystyle=\int_{0}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x-\int_{1}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}x}(\mu_{1}-\epsilon)\right)^{nx}\mathrm{d}x.

Since xxy\frac{x}{x-y} is increasing with respect to y<xy<x and κ^κ+ϵ\hat{\kappa}\leq\kappa+\epsilon holds on \mathcal{E}, we have for \mathcal{E}

μ1ϵμ1ϵκ^μ1ϵμ1ϵ(κ+ϵ).\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}\leq\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-(\kappa+\epsilon)}.

Let

ηa(ϵ)=κa(Δaϵ)ϵμa(μaκa)(μ1κa2ϵ)>0\eta_{a}(\epsilon)=\frac{\kappa_{a}(\Delta_{a}-\epsilon)-\epsilon\mu_{a}}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a}-2\epsilon)}>0 (35)

be a deterministic constant that depends only on the model and ϵ\epsilon and satisfies

αaηa(ϵ)\displaystyle\alpha_{a}-\eta_{a}(\epsilon) =μaμaκaκa(Δaϵ)ϵμa(μaκa)(μ1κa2ϵ)\displaystyle=\frac{\mu_{a}}{\mu_{a}-\kappa_{a}}-\frac{\kappa_{a}(\Delta_{a}-\epsilon)-\epsilon\mu_{a}}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a}-2\epsilon)}
=μaμ1κaμa2ϵμaκa(μ1μaϵ)+ϵμa(μaκa)(μ1κa2ϵ)\displaystyle=\frac{\mu_{a}\mu_{1}-\kappa_{a}\mu_{a}-2\epsilon\mu_{a}-\kappa_{a}(\mu_{1}-\mu_{a}-\epsilon)+\epsilon\mu_{a}}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a}-2\epsilon)}
=μ1(μaκa)ϵ(μaκa)(μaκa)(μ1κa2ϵ)\displaystyle=\frac{\mu_{1}(\mu_{a}-\kappa_{a})-\epsilon(\mu_{a}-\kappa_{a})}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a}-2\epsilon)}
=μ1ϵμ1κa2ϵ.\displaystyle=\frac{\mu_{1}-\epsilon}{\mu_{1}-\kappa_{a}-2\epsilon}.

Since ηa(ϵ)>0\eta_{a}(\epsilon)>0, it holds that for any ϵ(0,εa)\epsilon\in(0,\varepsilon_{a})

αaηa(ϵ)=μ1ϵμ1κa2ϵμaμaκa=αa.\alpha_{a}-\eta_{a}(\epsilon)=\frac{\mu_{1}-\epsilon}{\mu_{1}-\kappa_{a}-2\epsilon}\leq\frac{\mu_{a}}{\mu_{a}-\kappa_{a}}=\alpha_{a}.

Note that μμκ=α\frac{\mu}{\mu-\kappa}=\alpha holds and the change of μ\mu to μ\mu^{\prime} with fixed κ\kappa that is μμκ\frac{\mu^{\prime}}{\mu^{\prime}-\kappa}, implies how the value of the shape parameter α\alpha^{\prime} should be to satisfy μ((κ,α))=μ\mu((\kappa,\alpha^{\prime}))=\mu^{\prime}. For example, θ=(κa+εa,αa)\theta=(\kappa_{a}+\varepsilon_{a},\alpha_{a}) satisfies μ(θ)μa+δa2\mu(\theta)\leq\mu_{a}+\frac{\delta_{a}}{2}. Thus, if μ((κa+εa,α))=μ1ϵ>μa+δa2\mu((\kappa_{a}+\varepsilon_{a},\alpha))=\mu_{1}-\epsilon>\mu_{a}+\frac{\delta_{a}}{2}, then α\alpha should be smaller than αa\alpha_{a}. Hence, we have

𝟙[a,n(ϵ)]t[μ~a\displaystyle\mathbbm{1}[\mathcal{E}_{a,n}(\epsilon)]\mathbb{P}_{t}\bigg{[}\tilde{\mu}_{a} μ1ϵ]\displaystyle\geq\mu_{1}-\epsilon\bigg{]}
𝟙[a,n(ϵ)](0μ1ϵμ1ϵκ^fnk,nαa,nEr(x)dx\displaystyle\leq\mathbbm{1}[\mathcal{E}_{a,n}(\epsilon)]\bigg{(}\int_{0}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x
1μ1ϵμ1ϵκ^fnk,nαa,nEr(x)(x1κ^x(μ1ϵ))nxdx)\displaystyle\hskip 90.00014pt-\int_{1}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}x}(\mu_{1}-\epsilon)\right)^{nx}\mathrm{d}x\bigg{)}
𝟙[a,n(ϵ)]0μ1ϵμ1ϵκ^fnk,nαa,nEr(x)dx\displaystyle\leq\mathbbm{1}[\mathcal{E}_{a,n}(\epsilon)]\int_{0}^{\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\hat{\kappa}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x{} (36)
𝟙[a,n(ϵ)]0αaηa(ϵ)fnk,nαa,nEr(x)dx=𝟙[a,n(ϵ)]t[α~a(t)αaηa(ϵ)].\displaystyle\leq\mathbbm{1}[\mathcal{E}_{a,n}(\epsilon)]\int_{0}^{\alpha_{a}-\eta_{a}(\epsilon)}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x=\mathbbm{1}[\mathcal{E}_{a,n}(\epsilon)]\mathbb{P}_{t}[\tilde{\alpha}_{a}(t)\leq\alpha_{a}-\eta_{a}(\epsilon)].{} (37)

Therefore, by taking expectation and using Fact 12, we have

[μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\mathbb{P}\left[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)\right] [α~aαaηa(ϵ),a,n(ϵ)],\displaystyle\leq\mathbb{P}[\tilde{\alpha}_{a}\leq\alpha_{a}-\eta_{a}(\epsilon),\mathcal{E}_{a,n}(\epsilon)],
=[Z2nαa,n(αηa(ϵ)),a,n(ϵ)]\displaystyle=\mathbb{P}\left[Z\leq\frac{2n}{\alpha_{a,n}}\left(\alpha-\eta_{a}(\epsilon)\right),\mathcal{E}_{a,n}(\epsilon)\right]{} (38)

where ZZ is a random variable following the chi-squared distribution with 2(nk)2(n-k) degrees of freedom, i.e., Zχ2n2k2Z\sim\chi_{2n-2k}^{2}.

C.2.1 Under STS\mathrm{STS}

Here, we first consider the case of STS\mathrm{STS} where we replace αa,n\alpha_{a,n} with α^a(n)\hat{\alpha}_{a}(n).

Since α^a[αaϵa,l,αa+ϵa,u]\hat{\alpha}_{a}\in[\alpha_{a}-\epsilon_{a,l},\alpha_{a}+\epsilon_{a,u}] holds on a,n(ϵ)\mathcal{E}_{a,n}(\epsilon), we have

1αaϵ(1+1κa)=1αa+ϵa,u1α^a1αaϵa,l=1αa+ϵ\frac{1}{\alpha_{a}}-\epsilon\left(1+\frac{1}{\kappa_{a}}\right)=\frac{1}{\alpha_{a}+\epsilon_{a,u}}\leq\frac{1}{\hat{\alpha}_{a}}\leq\frac{1}{\alpha_{a}-\epsilon_{a,l}}=\frac{1}{\alpha_{a}}+\epsilon (39)

by the definition of ϵa,l(ϵ)\epsilon_{a,l}(\epsilon) and ϵa,u(ϵ)\epsilon_{a,u}(\epsilon) in (12).

By replacing αa,n\alpha_{a,n} with α^a(n)\hat{\alpha}_{a}(n) in (38) and applying (39), we have

[μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\mathbb{P}\left[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)\right] [Z2nα^a(αaηa(ϵ)),a,n(ϵ)]\displaystyle\leq\mathbb{P}\left[Z\leq\frac{2n}{\hat{\alpha}_{a}}\left(\alpha_{a}-\eta_{a}(\epsilon)\right),\mathcal{E}_{a,n}(\epsilon)\right]
[Z2n(1αa+ϵ)(αaηa(ϵ))]\displaystyle\leq\mathbb{P}\left[Z\leq 2n\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right)\right]
=[Z2(nk)nnk(1αa+ϵ)(αaηa(ϵ))].\displaystyle=\mathbb{P}\left[Z\leq 2(n-k)\frac{n}{n-k}\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right)\right].{} (40)
Priors with k0k\in\mathbb{Z}_{\leq 0}.

Let us first consider the case k0k\in\mathbb{Z}_{\leq 0}, where we have nnk1\frac{n}{n-k}\leq 1. It holds that

[μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\mathbb{P}\left[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)\right] [Z2(nk)nnk(1αa+ϵ)(αaηa(ϵ))]\displaystyle\leq\mathbb{P}\left[Z\leq 2(n-k)\frac{n}{n-k}\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right)\right]
[Z2(nk)(1αa+ϵ)(αaηa(ϵ))].\displaystyle\leq\mathbb{P}\left[Z\leq 2(n-k)\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right)\right].

Note that the definition of εa\varepsilon_{a} in Theorem 2 is set to satisfy (1α+ϵ)(αηa(ϵ))<1\left(\frac{1}{\alpha}+\epsilon\right)\left(\alpha-\eta_{a}(\epsilon)\right)<1 for any ϵεa\epsilon\leq\varepsilon_{a}. Thus, we can apply Lemma 17, which shows

[Z2(nk)(1ηa(ϵ)αa+ϵ(αaηa(ϵ)))]e(nk)Da,k(ϵ),\mathbb{P}\left[Z\leq 2(n-k)\left(1-\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+\epsilon(\alpha_{a}-\eta_{a}(\epsilon))\right)\right]\leq e^{-(n-k)D_{a,k}(\epsilon)}, (41)

where

Da,k(ϵ):=log(1ηa(ϵ)αa+(max(0,k)+1)ϵ(αaηa(ϵ)))ηa(ϵ)αa+(max(0,k)+1)ϵ(αaηa(ϵ))D_{a,k}(\epsilon):=-\log\left(1-\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+(\max(0,k)+1)\epsilon(\alpha_{a}-\eta_{a}(\epsilon))\right)\\ -\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+(\max(0,k)+1)\epsilon(\alpha_{a}-\eta_{a}(\epsilon)) (42)

is a finite constant that only depends on the model parameters, ϵ\epsilon, and prior parameter kk.

For arbitrary na>0n_{a}>0, applying (41) to (38) gives

t=n¯K+1T𝔼[𝟙[j(t)=a,μ~1(t)μ1ϵ,a,Na(t)(ϵ)]]\displaystyle\sum_{t=\bar{n}K+1}^{T}\mathbb{E}[\mathbbm{1}[j(t)=a,\tilde{\mu}_{1}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon)]]
t=n¯K+1T[j(t)=a,μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\leq\sum_{t=\bar{n}K+1}^{T}\mathbb{P}[j(t)=a,\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)]
na+t=n¯K+1T[μ~a(t)μ1ϵ,a,Na(t)(ϵ),Na(t)na]\displaystyle\leq n_{a}+\sum_{t=\bar{n}K+1}^{T}\mathbb{P}[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon),N_{a}(t)\geq n_{a}]
na+t=n¯K+1Te(nak)Da,k(ϵ)\displaystyle\leq n_{a}+\sum_{t=\bar{n}K+1}^{T}e^{-(n_{a}-k)D_{a,k}(\epsilon)}
na+t=n¯K+1TenaDa,k(ϵ)=na+TenaDa,k(ϵ).\displaystyle\leq n_{a}+\sum_{t=\bar{n}K+1}^{T}e^{-n_{a}D_{a,k}(\epsilon)}=n_{a}+Te^{-n_{a}D_{a,k}(\epsilon)}.

Letting na=logTDa,k(ϵ)n_{a}=\frac{\log T}{D_{a,k}(\epsilon)} concludes the cases of priors with k0k\in\mathbb{Z}_{\leq 0}.

Priors with k>0k\in\mathbb{Z}_{>0}

Next, consider the case k>0k\in\mathbb{Z}_{>0}. Recall that we first play every arm k+1k+1 times if k>0k>0, which implies that nk>0n-k>0. For n1αϵ+k+1n\geq\frac{1}{\alpha\epsilon}+k+1, it holds that

nnk(1α+ϵ)1α+(k+1)ϵ.\frac{n}{n-k}\left(\frac{1}{\alpha}+\epsilon\right)\leq\frac{1}{\alpha}+(k+1)\epsilon. (43)

By applying (43) to (38), we have for n1αϵ+k+1n\geq\frac{1}{\alpha\epsilon}+k+1,

[α~aαaηa(ϵ),a,Na(t)(ϵ)][Z2(nk)(1ηa(ϵ)αa+(k+1)ϵ(αaηa(ϵ)))].\displaystyle\mathbb{P}[\tilde{\alpha}_{a}\leq\alpha_{a}-\eta_{a}(\epsilon),\mathcal{E}_{a,N_{a}(t)}(\epsilon)]\leq\mathbb{P}\left[Z\leq 2(n-k)\left(1-\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+(k+1)\epsilon(\alpha_{a}-\eta_{a}(\epsilon))\right)\right].

Similarly, by applying Lemma 17, one can see that for n1αaϵ+k+1n\geq\frac{1}{\alpha_{a}\epsilon}+k+1

[α~aαaηa(ϵ),a,Na(t)(ϵ)]e(nk)Da,k(ϵ),\mathbb{P}[\tilde{\alpha}_{a}\leq\alpha_{a}-\eta_{a}(\epsilon),\mathcal{E}_{a,N_{a}(t)}(\epsilon)]\leq e^{-(n-k)D_{a,k}(\epsilon)}, (44)

where Da,k(ϵ)D_{a,k}(\epsilon) is defined in (42).

When k>0k\in\mathbb{Z}_{>0}, let na1αaϵ+k+1n_{a}\geq\frac{1}{\alpha_{a}\epsilon}+k+1 be arbitrary. By applying (44) to (38) again, we have

t=n¯K+1T𝔼[𝟙[j(t)=a,μ~1(t)μ1ϵ,a,Na(t)(ϵ)]]\displaystyle\sum_{t=\bar{n}K+1}^{T}\mathbb{E}[\mathbbm{1}[j(t)=a,\tilde{\mu}_{1}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon)]]
t=n¯K+1T[j(t)=a,μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\leq\sum_{t=\bar{n}K+1}^{T}\mathbb{P}[j(t)=a,\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)]
na+t=n¯K+1T[μ~a(t)μ1ϵ,a,Na(t)(ϵ),Na(t)na]\displaystyle\leq n_{a}+\sum_{t=\bar{n}K+1}^{T}\mathbb{P}[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,N_{a}(t)}(\epsilon),N_{a}(t)\geq n_{a}]
na+t=n¯K+1Te(nak)Da,k(ϵ)=na+Te(nak)Da,k(ϵ).\displaystyle\leq n_{a}+\sum_{t=\bar{n}K+1}^{T}e^{-(n_{a}-k)D_{a,k}(\epsilon)}=n_{a}+Te^{-(n_{a}-k)D_{a,k}(\epsilon)}.

Letting na=k+1+1αaϵ+logTDa,k(ϵ)n_{a}=k+1+\frac{1}{\alpha_{a}\epsilon}+\frac{\log T}{D_{a,k}(\epsilon)} concludes the cases of priors with k>0k>0.

C.2.2 Under STS-T\mathrm{STS}\text{-}\mathrm{T}

Here, we consider the case of STS-T\mathrm{STS}\text{-}\mathrm{T} where we replace αa,n\alpha_{a,n} with α¯a(n)=min(α^a(n),n)\bar{\alpha}_{a}(n)=\min(\hat{\alpha}_{a}(n),n). From the definition of α¯a(n)\bar{\alpha}_{a}(n), it holds that for ϵεa\epsilon\leq\varepsilon_{a}

nαa+1:𝟙[α¯a(n)=α^a(n),𝒜a,n(ϵ)]=1.\forall n\geq\alpha_{a}+1:\mathbbm{1}[\bar{\alpha}_{a}(n)=\hat{\alpha}_{a}(n),\mathcal{A}_{a,n}(\epsilon)]=1.

Therefore, for nαa+1n\geq\alpha_{a}+1, the analysis on STS\mathrm{STS} can be applied to STS-T\mathrm{STS}\text{-}\mathrm{T} directly.

Let us consider the case where α¯a(n)=n<αa+1\bar{\alpha}_{a}(n)=n<\alpha_{a}+1 holds under the condition 𝒜a,n(ϵ)\mathcal{A}_{a,n}(\epsilon). By replacing αa,n\alpha_{a,n} with nn in (38) and following the same steps as in (38) and (41), we have for any kk\in\mathbb{Z},

[μ~a(t)μ1ϵ,a,n(ϵ)]\displaystyle\mathbb{P}\left[\tilde{\mu}_{a}(t)\geq\mu_{1}-\epsilon,\mathcal{E}_{a,n}(\epsilon)\right] [Z2nn(αaηa(ϵ)),a,n(ϵ)]\displaystyle\leq\mathbb{P}\left[Z\leq\frac{2n}{n}\left(\alpha_{a}-\eta_{a}(\epsilon)\right),\mathcal{E}_{a,n}(\epsilon)\right]
[Z2(nk)1nk(1αa+ϵ)(αaηa(ϵ)),a,n(ϵ)]\displaystyle\leq\mathbb{P}\left[Z\leq 2(n-k)\frac{1}{n-k}\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right),\mathcal{E}_{a,n}(\epsilon)\right]
[Z2(nk)(1αa+ϵ)(αaηa(ϵ)),a,n(ϵ)]\displaystyle\leq\mathbb{P}\left[Z\leq 2(n-k)\left(\frac{1}{\alpha_{a}}+\epsilon\right)\left(\alpha_{a}-\eta_{a}(\epsilon)\right),\mathcal{E}_{a,n}(\epsilon)\right]
e(nk)Da,k(ϵ),\displaystyle\leq e^{-(n-k)D_{a,k}(\epsilon)},

where Da,k(ϵ)D_{a,k}(\epsilon) defined in (42). Therefore, the same result follows by the analysis in Section C.2.1.

C.2.3 Asymptotic behavior of Da,k(ϵ)D_{a,k}(\epsilon)

Finally, we show that limϵ0Da,k(ϵ)=KLinf(a)\lim_{\epsilon\to 0}D_{a,k}(\epsilon)=\mathrm{KL}_{\mathrm{inf}}(a). From their definitions of ηa(ϵ)\eta_{a}(\epsilon) in (35) and Δa=μ1μa\Delta_{a}=\mu_{1}-\mu_{a}, we have

limϵ0ηa(ϵ)\displaystyle\lim_{\epsilon\to 0}\eta_{a}(\epsilon) =limϵ0κa(Δaϵ)ϵμa(μaκa)(μ1κa2ϵ)\displaystyle=\lim_{\epsilon\to 0}\frac{\kappa_{a}(\Delta_{a}-\epsilon)-\epsilon\mu_{a}}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a}-2\epsilon)}
=κaΔa(μaκa)(μ1κa)=κa(μ1μa)(μaκa)(μ1κa)=(αa1)μ1μaμ1κa.\displaystyle=\frac{\kappa_{a}\Delta_{a}}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a})}=\frac{\kappa_{a}(\mu_{1}-\mu_{a})}{(\mu_{a}-\kappa_{a})(\mu_{1}-\kappa_{a})}=(\alpha_{a}-1)\frac{\mu_{1}-\mu_{a}}{\mu_{1}-\kappa_{a}}.

Then, it holds that

limϵ01ηa(ϵ)αa\displaystyle\lim_{\epsilon\to 0}1-\frac{\eta_{a}(\epsilon)}{\alpha_{a}} =1(αa1αaμ1μaμ1κa)\displaystyle=1-\left(\frac{\alpha_{a}-1}{\alpha_{a}}\frac{\mu_{1}-\mu_{a}}{\mu_{1}-\kappa_{a}}\right)
=αa(μ1κa)(αa1)(μ1μa)αa(μ1κa)\displaystyle=\frac{\alpha_{a}(\mu_{1}-\kappa_{a})-(\alpha_{a}-1)(\mu_{1}-\mu_{a})}{\alpha_{a}(\mu_{1}-\kappa_{a})}
=αa(μaκa)+μ1μaαa(μ1κa)\displaystyle=\frac{\alpha_{a}(\mu_{a}-\kappa_{a})+\mu_{1}-\mu_{a}}{\alpha_{a}(\mu_{1}-\kappa_{a})}
=μaμaκa(μaκa)αa(μ1κa)+μ1μaαa(μ1κa)αa=μaμaκa\displaystyle=\frac{\mu_{a}}{\mu_{a}-\kappa_{a}}\frac{(\mu_{a}-\kappa_{a})}{\alpha_{a}(\mu_{1}-\kappa_{a})}+\frac{\mu_{1}-\mu_{a}}{\alpha_{a}(\mu_{1}-\kappa_{a})}\hskip 20.00003pt\because\alpha_{a}=\frac{\mu_{a}}{\mu_{a}-\kappa_{a}}
=μ1αa(μ1κa)=:1Xa,\displaystyle=\frac{\mu_{1}}{\alpha_{a}(\mu_{1}-\kappa_{a})}=:\frac{1}{X_{a}},

Therefore, from the definition of Da,k(ϵ)D_{a,k}(\epsilon) in (42)

limϵ0Da,k(ϵ)\displaystyle\lim_{\epsilon\to 0}D_{a,k}(\epsilon) =limϵ0[log(1ηa(ϵ)αa+(max(0,k)+1)(αaηa(ϵ))ϵ)\displaystyle=\lim_{\epsilon\to 0}\bigg{[}-\log\left(1-\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+(\max(0,k)+1)(\alpha_{a}-\eta_{a}(\epsilon))\epsilon\right)
ηa(ϵ)αa+(max(0,k)+1)(αaηa(ϵ))ϵ]\displaystyle\hskip 110.00017pt-\frac{\eta_{a}(\epsilon)}{\alpha_{a}}+(\max(0,k)+1)(\alpha_{a}-\eta_{a}(\epsilon))\epsilon\bigg{]}
=log(1limϵ0ηa(ϵ)αa)limϵ0ηa(ϵ)αa\displaystyle=-\log(1-\lim_{\epsilon\to 0}\frac{\eta_{a}(\epsilon)}{\alpha_{a}})-\lim_{\epsilon\to 0}\frac{\eta_{a}(\epsilon)}{\alpha_{a}}
=log1Xa+1Xa1=logXa+1Xa1\displaystyle=-\log\frac{1}{X_{a}}+\frac{1}{X_{a}}-1=\log X_{a}+\frac{1}{X_{a}}-1
=log(αaμ1κaμ1)+μ1αa(μ1κa)1=KLinf(a),\displaystyle=\log\left(\alpha_{a}\frac{\mu_{1}-\kappa_{a}}{\mu_{1}}\right)+\frac{\mu_{1}}{\alpha_{a}(\mu_{1}-\kappa_{a})}-1=\mathrm{KL}_{\mathrm{inf}}(a),

where the last equality comes from Lemma 1. ∎

C.3 Proof of Lemma 7

Firstly, we state two well-known facts that are utilized in the proof.

Fact 13.

When XPa(κ,α)X\sim\mathrm{Pa}(\kappa,\alpha) with the scale parameter κ\kappa\in\mathbb{R} and rate parameter α+\alpha\in\mathbb{R}_{+}, then log(Xκ)\log\left(\frac{X}{\kappa}\right) follows the exponential distribution with rate α\alpha, i.e., log(Xκ)Exp(α)\log\left(\frac{X}{\kappa}\right)\sim\mathrm{Exp}(\alpha).

Fact 14.

Let X1,,XnX_{1},\ldots,X_{n} be identically independently distributed with the exponential distribution with the rate parameter α\alpha, i.e., Xii.i.d.Exp(α)X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathrm{Exp}(\alpha) for any i[n]i\in[n]. Then, their sum follows the Erlang distribution with the shape parameter nn\in\mathbb{N} and rate parameter α\alpha, i.e., i=1nXiErlang(n,α)\sum_{i=1}^{n}X_{i}\sim\mathrm{Erlang}(n,\alpha).

See 7

Proof.

When one considers the Pareto distribution with known scale parameter κ\kappa that belongings to the one-dimensional exponential family, the posterior on the shape parameter αone>0\alpha^{\mathrm{one}}>0 after observing n=N1(t)n=N_{1}(t) rewards is given for kk\in\mathbb{Z}

αone|tErlang(nk+1,Xn),\alpha^{\mathrm{one}}~{}|~{}\mathcal{F}_{t}\sim\mathrm{Erlang}\left(n-k+1,X_{n}\right), (45)

where Xn=s=1nlog(r1,s)nlog(κ1)X_{n}=\sum_{s=1}^{n}\log(r_{1,s})-n\log(\kappa_{1}). Note that XnErlang(n,α1)X_{n}\sim\mathrm{Erlang}(n,\alpha_{1}) from Facts 13 and 14. Let α~1one\tilde{\alpha}_{1}^{\mathrm{one}} be a sample from the posterior distribution in (45). Then, for one-dimensional Pareto bandits, it holds from (10) that

[μ~1(t)μ1ϵ|t]=[α~1oneβ|t]=Γ(nk+1,βXn)Γ(nk+1),\mathbb{P}[\tilde{\mu}_{1}(t)\leq\mu_{1}-\epsilon|\mathcal{F}_{t}]=\mathbb{P}\left[\tilde{\alpha}_{1}^{\mathrm{one}}\geq\beta\,\middle|\,\mathcal{F}_{t}\right]=\frac{\Gamma\left(n-k+1,\beta X_{n}\right)}{\Gamma(n-k+1)},

where we denoted β=μ1ϵμ1ϵκ1\beta=\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\kappa_{1}} satisfying μ(κ1,β)=μ1ϵ\mu(\kappa_{1},\beta)=\mu_{1}-\epsilon. Therefore, Lemma 21 can be written as

t=1T𝔼[𝟙[j(t)1,ϵc(t)]]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[\mathbbm{1}\left[j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t)\right]\right] =t=1Tn=1T𝔼[𝟙[j(t)1,ϵc(t),N1(t)=n]]\displaystyle=\sum_{t=1}^{T}\sum_{n=1}^{T}\mathbb{E}\left[\mathbbm{1}\left[j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t),N_{1}(t)=n\right]\right]
=t=1Tn=1T𝔼[[j(t)1,ϵc(t),N1(t)=n|t]]\displaystyle=\sum_{t=1}^{T}\sum_{n=1}^{T}\mathbb{E}\left[\mathbb{P}\left[j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t),N_{1}(t)=n\,\middle|\,\mathcal{F}_{t}\right]\right]
=t=1Tn=1T0Γ(n+1,βx)Γ(n+1)α1nΓ(n)xn1eα1xdx𝒪(ϵ1),\displaystyle=\sum_{t=1}^{T}\sum_{n=1}^{T}\int_{0}^{\infty}\frac{\Gamma(n+1,\beta x)}{\Gamma(n+1)}\frac{\alpha_{1}^{n}}{\Gamma(n)}x^{n-1}e^{-\alpha_{1}x}\mathrm{d}x\leq\mathcal{O}(\epsilon^{-1}),

where we injected the density function of the Erlang distribution into the last equality.

On the other hand, for two-parameter Pareto bandits where the scale parameter is unknown, it holds by the law of total expectation that

𝔼[𝟙[j(t)1,𝒦1,N1(t)(ϵ),ϵc(t)]]\displaystyle\mathbb{E}[\mathbbm{1}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)]] =𝔼κ^1,α^1[[j(t)1,𝒦1,N1(t)(ϵ),ϵc(t)|t]]\displaystyle=\mathbb{E}_{\hat{\kappa}_{1},\hat{\alpha}_{1}}\left[\mathbb{P}[j(t)\neq 1,\mathcal{K}_{1,N_{1}(t)}(\epsilon),\mathcal{M}_{\epsilon}^{c}(t)|\mathcal{F}_{t}]\right]
=𝔼κ^1,α^1[𝟙[𝒦1,N1(t)(ϵ)][j(t)1,ϵc(t)|t]],\displaystyle=\mathbb{E}_{\hat{\kappa}_{1},\hat{\alpha}_{1}}\left[\mathbbm{1}[\mathcal{K}_{1,N_{1}(t)}(\epsilon)]\mathbb{P}[j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t)|\mathcal{F}_{t}]\right],

where the last equality holds since 𝒦\mathcal{K} is determined by the history t\mathcal{F}_{t}.

From Lemma 9 with y=μ1ϵy=\mu_{1}-\epsilon, it holds for any ξμ1ϵμ1ϵκ1=β\xi\leq\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\kappa_{1}}=\beta that

𝟙[𝒦1,n(ϵ)][μ~1(t)μ1ϵ|t]\displaystyle\mathbbm{1}[\mathcal{K}_{1,n}(\epsilon)]\mathbb{P}[\tilde{\mu}_{1}(t)\leq\mu_{1}-\epsilon|\mathcal{F}_{t}]
𝟙[𝒦1,n(ϵ)]((μ1ϵμ((κ1,ξ)))n1ξfnk,nα1,nEr(x)dx+ξfnk,nα1,nEr(x)dx)\displaystyle\leq\mathbbm{1}[\mathcal{K}_{1,n}(\epsilon)]\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x\right)
𝟙[𝒦1,n(ϵ)]((μ1ϵμ((κ1,ξ)))n(1Γ(nk,nα1,nξ)Γ(nk))+Γ(nk,nα1,nξ)Γ(nk))\displaystyle\leq\mathbbm{1}[\mathcal{K}_{1,n}(\epsilon)]\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n-k,\frac{n}{\alpha_{1,n}}\xi\right)}{\Gamma(n-k)}\right)+\frac{\Gamma\left(n-k,\frac{n}{\alpha_{1,n}}\xi\right)}{\Gamma(n-k)}\right){} (46)

which is a convex combination of 11 and (μ1ϵμ((κ1,ξ)))n\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}. Therefore, RHS of (46) increases as Γ(nk,nα1,nξ)Γ(nk)\frac{\Gamma\left(n-k,\frac{n}{\alpha_{1,n}}\xi\right)}{\Gamma(n-k)} increases. From the definition of Γ(n,x)\Gamma(n,x), it holds that Γ(n,x)Γ(n,x+y)\Gamma(n,x)\geq\Gamma(n,x+y) for any positive y>0y>0 and Γ(n+1,x)=nΓ(n,x)+xnex\Gamma(n+1,x)=n\Gamma(n,x)+x^{n}e^{-x}. Since nα^1(n)nα¯1(n)\frac{n}{\hat{\alpha}_{1}(n)}\leq\frac{n}{\bar{\alpha}_{1}(n)} holds for any nn\in\mathbb{N}, it holds for k0k\in\mathbb{Z}_{\geq 0} that

Γ(nk,nα¯1(n)ξ)Γ(nk)Γ(nk,nα^1(n)ξ)Γ(nk)Γ(n,nα^1(n)ξ)Γ(n).\frac{\Gamma\left(n-k,\frac{n}{\bar{\alpha}_{1}(n)}\xi\right)}{\Gamma(n-k)}\leq\frac{\Gamma\left(n-k,\frac{n}{\hat{\alpha}_{1}(n)}\xi\right)}{\Gamma(n-k)}\leq\frac{\Gamma\left(n,\frac{n}{\hat{\alpha}_{1}(n)}\xi\right)}{\Gamma(n)}.

Let us denote Yn:=nα^1(n)=i=1nlog(r1,s)nlog(κ^1(n))Y_{n}:=\frac{n}{\hat{\alpha}_{1}(n)}=\sum_{i=1}^{n}\log(r_{1,s})-n\log(\hat{\kappa}_{1}(n)), which follows the Erlang distribution with shape n1n-1 and rate α1\alpha_{1} [Malik, 1970]. By taking expectation with respect to κ^1(n)\hat{\kappa}_{1}(n), we have for any ξβ\xi\leq\beta that

𝔼κ^1[𝟙[𝒦1,n(ϵ)][μ~1(t)μ1ϵ|t]]\displaystyle\mathbb{E}_{\hat{\kappa}_{1}}[\mathbbm{1}[\mathcal{K}_{1,n}(\epsilon)]\mathbb{P}[\tilde{\mu}_{1}(t)\leq\mu_{1}-\epsilon|\mathcal{F}_{t}]]
κ1κ1+ϵ((μ1ϵμ((κ1,ξ)))n(1Γ(n,ξYn)Γ(n))+Γ(n,ξYn)Γ(n))[κ^1(n)=x]dx\displaystyle\leq\int_{\kappa_{1}}^{\kappa_{1}+\epsilon}\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right)+\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right)\mathbb{P}[\hat{\kappa}_{1}(n)=x]\mathrm{d}x
=[𝒦1,n(ϵ)]((μ1ϵμ((κ1,ξ)))n(1Γ(n,ξYn)Γ(n))+Γ(n,ξYn)Γ(n))\displaystyle=\mathbb{P}[\mathcal{K}_{1,n}(\epsilon)]\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right)+\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right)
=(1(κ1κ1+ϵ)nα1)((μ1ϵμ((κ1,ξ)))n(1Γ(n,ξYn)Γ(n))+Γ(n,ξYn)Γ(n)),\displaystyle=\left(1-\left(\frac{\kappa_{1}}{\kappa_{1}+\epsilon}\right)^{n\alpha_{1}}\right)\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right)+\frac{\Gamma\left(n,\xi Y_{n}\right)}{\Gamma(n)}\right),

where we used κ^1(n)Pa(κ1,nα1)\hat{\kappa}_{1}(n)\sim\mathrm{Pa}(\kappa_{1},n\alpha_{1}) in (2) for the last equality.

Therefore, under the two-parameter Pareto distribution, the following holds for any ξβ\xi\leq\beta under both STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with k0k\in\mathbb{Z}_{\geq 0} that

𝔼κ^1,α^1[𝟙[𝒦1,n(ϵ)][μ~1(t)μ1ϵ|t]](1(κ1κ1+ϵ)nα1)0((μ1ϵμ((κ1,ξ)))n(1Γ(n,ξy)Γ(n))+Γ(n,ξy)Γ(n))α1n1Γ(n1)yn2eα1ydy.\mathbb{E}_{\hat{\kappa}_{1},\hat{\alpha}_{1}}[\mathbbm{1}[\mathcal{K}_{1,n}(\epsilon)]\mathbb{P}[\tilde{\mu}_{1}(t)\leq\mu_{1}-\epsilon|\mathcal{F}_{t}]]\\ \leq\left(1-\left(\frac{\kappa_{1}}{\kappa_{1}+\epsilon}\right)^{n\alpha_{1}}\right)\int_{0}^{\infty}\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n,\xi y\right)}{\Gamma(n)}\right)+\frac{\Gamma\left(n,\xi y\right)}{\Gamma(n)}\right)\frac{\alpha_{1}^{n-1}}{\Gamma(n-1)}y^{n-2}e^{-\alpha_{1}y}\mathrm{d}y.

Therefore, Lemma 21 concludes the proof for any nn\in\mathbb{N}, by carefully selecting ξβ\xi\leq\beta satisfying

(1(κ1κ1+ϵ)nα1)0((μ1ϵμ((κ1,ξ)))n(1Γ(n,ξy)Γ(n))+Γ(n,ξy)Γ(n))α1n1Γ(n1)yn2eα1ydy0Γ(n+1,βy)Γ(n+1)α1nΓ(n)yn1eα1ydy.\left(1-\left(\frac{\kappa_{1}}{\kappa_{1}+\epsilon}\right)^{n\alpha_{1}}\right)\int_{0}^{\infty}\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\left(1-\frac{\Gamma\left(n,\xi y\right)}{\Gamma(n)}\right)+\frac{\Gamma\left(n,\xi y\right)}{\Gamma(n)}\right)\frac{\alpha_{1}^{n-1}}{\Gamma(n-1)}y^{n-2}e^{-\alpha_{1}y}\mathrm{d}y\\ \leq\int_{0}^{\infty}\frac{\Gamma(n+1,\beta y)}{\Gamma(n+1)}\frac{\alpha_{1}^{n}}{\Gamma(n)}y^{n-1}e^{-\alpha_{1}y}\mathrm{d}y.

Note that when we consider STS\mathrm{STS} with k=1k=-1, we have to find ξβ\xi^{\prime}\leq\beta such that

(1(κ1κ1+ϵ)nα1)\displaystyle\left(1-\left(\frac{\kappa_{1}}{\kappa_{1}+\epsilon}\right)^{n\alpha_{1}}\right)
×0((μ1ϵμ((κ1,ξ)))n(1Γ(n+1,ξy)Γ(n+1))+Γ(n+1,ξy)Γ(n+1))α1n1Γ(n1)yn2eα1ydy\displaystyle\times\int_{0}^{\infty}\left(\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi^{\prime}))}\right)^{n}\left(1-\frac{\Gamma\left(n+1,\xi^{\prime}y\right)}{\Gamma(n+1)}\right)+\frac{\Gamma\left(n+1,\xi^{\prime}y\right)}{\Gamma(n+1)}\right)\frac{\alpha_{1}^{n-1}}{\Gamma(n-1)}y^{n-2}e^{-\alpha_{1}y}\mathrm{d}y
0Γ(n+1,βy)Γ(n+1)α1nΓ(n)yn1eα1ydy.\displaystyle\hskip 150.00023pt\leq\int_{0}^{\infty}\frac{\Gamma(n+1,\beta y)}{\Gamma(n+1)}\frac{\alpha_{1}^{n}}{\Gamma(n)}y^{n-1}e^{-\alpha_{1}y}\mathrm{d}y.

From Γ(n,x)Γ(n,x+y)\Gamma(n,x)\geq\Gamma(n,x+y) for any positive x,y>0x,y>0 and ξβ\xi^{\prime}\leq\beta, we have for any x>0x>0 that

Γ(n+1,ξx)Γ(n+1)Γ(n+1,βx)Γ(n+1).\frac{\Gamma\left(n+1,\xi^{\prime}x\right)}{\Gamma(n+1)}\geq\frac{\Gamma(n+1,\beta x)}{\Gamma(n+1)}.

Therefore, for k1k\in\mathbb{Z}_{\leq-1}, we might not be able to apply the results by Korda et al. [2013]. ∎

C.4 Proof of Lemma 8

We first state two lemmas on the event 𝒦\mathcal{K} and 𝒜\mathcal{A}.

Lemma 15.

For any algorithm and a[K]a\in[K], it holds that for all ϵ>0\epsilon>0, t>0t>0, and nn\in\mathbb{N}

[𝒦a,Na(t)c(ϵ),Na(t)=n]exp(αaϵκa+ϵn).\mathbb{P}\left[\mathcal{K}_{a,N_{a}(t)}^{c}(\epsilon),N_{a}(t)=n\right]\leq\exp\left(-\frac{\alpha_{a}\epsilon}{\kappa_{a}+\epsilon}n\right).
Lemma 16.

For any algorithm and for any a[K]a\in[K], it holds that for all ϵ(0,κaαa(κa+1))\epsilon\in\left(0,\frac{\kappa_{a}}{\alpha_{a}(\kappa_{a}+1)}\right) and t>0t>0, and nn¯n\geq\bar{n}

[𝒜a,Na(t)c(ϵ),𝒦a,Na(t)(ϵ),Na(t)=n]2exp(αa2ϵ24n),\mathbb{P}\left[\mathcal{A}_{a,N_{a}(t)}^{c}(\epsilon),\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n\right]\leq 2\exp\left(-\frac{\alpha_{a}^{2}\epsilon^{2}}{4}n\right),

See 8

Proof.

From the Lemmas 15 and 16, one can see that for nn¯n\geq\bar{n},

[a,Na(t)c(ϵ),Na(t)=n]\displaystyle\mathbb{P}\left[\mathcal{E}_{a,N_{a}(t)}^{c}(\epsilon),N_{a}(t)=n\right] =[𝒦a,Na(t)c(ϵ),Na(t)=n]+[𝒜a,Na(t)c(ϵ),𝒦a,Na(t)(ϵ),Na(t)=n]\displaystyle=\mathbb{P}\left[\mathcal{K}_{a,N_{a}(t)}^{c}(\epsilon),N_{a}(t)=n\right]+\mathbb{P}\left[\mathcal{A}_{a,N_{a}(t)}^{c}(\epsilon),\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n\right]
exp(αaϵκa+ϵn)+2exp(αa2ϵ24n).\displaystyle\leq\exp\left(-\frac{\alpha_{a}\epsilon}{\kappa_{a}+\epsilon}n\right)+2\exp\left(-\frac{\alpha_{a}^{2}\epsilon^{2}}{4}n\right).

Since {j(t)=a,a,nc(ϵ),Na(t)=n}\left\{j(t)=a,\mathcal{E}_{a,n}^{c}(\epsilon),N_{a}(t)=n\right\} occurs only once for any nn\in\mathbb{N}, it holds that

t=n¯K+1T𝔼[𝟙[j(t)=a,a,Na(t)c(ϵ)]]\displaystyle\sum_{t=\bar{n}K+1}^{T}\mathbb{E}\left[\mathbbm{1}\left[j(t)=a,\mathcal{E}^{c}_{a,N_{a}(t)}(\epsilon)\right]\right] =t=n¯K+1Tn=n¯T𝔼[𝟙[j(t)=a,a,Na(t)c(ϵ),Na(t)=n]]\displaystyle=\sum_{t=\bar{n}K+1}^{T}\sum_{n=\bar{n}}^{T}\mathbb{E}\left[\mathbbm{1}\left[j(t)=a,\mathcal{E}^{c}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n\right]\right]
n=n¯𝔼[𝟙[a,Na(t)c(ϵ),Na(t)=n]]\displaystyle\leq\sum_{n=\bar{n}}^{\infty}\mathbb{E}\left[\mathbbm{1}\left[\mathcal{E}_{a,N_{a}(t)}^{c}(\epsilon),N_{a}(t)=n\right]\right]
=n=n¯[𝒦a,Na(t)c(ϵ),Na(t)=n]\displaystyle=\sum_{n=\bar{n}}^{\infty}\mathbb{P}\left[\mathcal{K}_{a,N_{a}(t)}^{c}(\epsilon),N_{a}(t)=n\right]
+[𝒜a,Na(t)c(ϵ)𝒦a,Na(t)(ϵ),Na(t)=n]\displaystyle\hskip 30.00005pt+\mathbb{P}\left[\mathcal{A}_{a,N_{a}(t)}^{c}(\epsilon)\cap\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n\right]
n=n¯exp(αaϵκa+ϵn)+2exp(αa2ϵ24n).\displaystyle\leq\sum_{n=\bar{n}}^{\infty}\exp\left(-\frac{\alpha_{a}\epsilon}{\kappa_{a}+\epsilon}n\right)+2\exp\left(-\frac{\alpha_{a}^{2}\epsilon^{2}}{4}n\right).

Since exp(an)\exp(-an) with a>0a>0 is a decreasing function with respect to nn, it holds that

n=2exp(an)1exp(an)dn=exp(a)a,\sum_{n=2}^{\infty}\exp(-an)\leq\int_{1}^{\infty}\exp(-an)\mathrm{d}n=\frac{\exp(-a)}{a},

which concludes the proof. ∎

C.4.1 Proof of Lemma 15

See 15

Proof.

Since κ^a(n)Pa(κa,nαa)\hat{\kappa}_{a}(n)\sim\mathrm{Pa}(\kappa_{a},n\alpha_{a}) holds for any nn\in\mathbb{N} in (2), it holds that

[𝒦a,Na(t)c,Na(t)=n]\displaystyle\mathbb{P}\left[\mathcal{K}^{c}_{a,N_{a}(t)},N_{a}(t)=n\right] =[κ^a(Na(t))κa+ϵ,Na(t)=n]\displaystyle=\mathbb{P}\left[\hat{\kappa}_{a}(N_{a}(t))\geq\kappa_{a}+\epsilon,N_{a}(t)=n\right]
=(κaκa+ϵ)nαaexp(αaϵκa+ϵn),\displaystyle=\left(\frac{\kappa_{a}}{\kappa_{a}+\epsilon}\right)^{n\alpha_{a}}\leq\exp\left(-\frac{\alpha_{a}\epsilon}{\kappa_{a}+\epsilon}n\right),

which concludes the proof. ∎

C.4.2 Proof of Lemma 16

See 16

Proof.

Fix a time index tt and denote t[]=[t]\mathbb{P}_{t}[\cdot]=\mathbb{P}[\cdot\mid\mathcal{F}_{t}] and Na(t)=nN_{a}(t)=n. To simplify notations, we drop the argument nn of κ^a(n)\hat{\kappa}_{a}(n) and α^a(n)\hat{\alpha}_{a}(n).

Let ra,kr_{a,k}^{\prime} be the kk-th order statistics of (ra,s)s=1n(r_{a,s})_{s=1}^{n} for arm aa such that ra,1ra,2ra,nr_{a,1}^{\prime}\leq r_{a,2}^{\prime}\ldots\leq r_{a,n}^{\prime}. From the definition of MLE of αa\alpha_{a},

[α^aαaϵa,l(ϵ),𝒦a,Na(t)(ϵ),Na(t)=n]\displaystyle\mathbb{P}[\hat{\alpha}_{a}\leq\alpha_{a}-\epsilon_{a,l}(\epsilon),\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n] [ns=1nlogra,snlogra,1αaϵa,l(ϵ)]\displaystyle\leq\mathbb{P}\left[\frac{n}{\sum_{s=1}^{n}\log r_{a,s}^{\prime}-n\log r_{a,1}^{\prime}}\leq\alpha_{a}-\epsilon_{a,l}(\epsilon)\right]
=[nαaϵa,l(ϵ)s=1nlogra,sra,1]\displaystyle=\mathbb{P}\left[\frac{n}{\alpha_{a}-\epsilon_{a,l}(\epsilon)}\leq\sum_{s=1}^{n}\log\frac{r_{a,s}^{\prime}}{r_{a,1}^{\prime}}\right]
=[nαaϵa,l(ϵ)nlogκra,1+s=1nlogra,sκ]\displaystyle=\mathbb{P}\left[\frac{n}{\alpha_{a}-\epsilon_{a,l}(\epsilon)}\leq n\log\frac{\kappa}{r_{a,1}^{\prime}}+\sum_{s=1}^{n}\log\frac{r_{a,s}^{\prime}}{\kappa}\right]
[nαaϵa,l(ϵ)s=1nlogra,sκa]\displaystyle\leq\mathbb{P}\left[\frac{n}{\alpha_{a}-\epsilon_{a,l}(\epsilon)}\leq\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa_{a}}\right]
[ϵ1ns=1nlogra,sκa1αa],\displaystyle\leq\mathbb{P}\left[\epsilon\leq\frac{1}{n}\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa_{a}}-\frac{1}{\alpha_{a}}\right],

where the first equality holds from the definition of MLEs in (2), the second inequality holds since any sample generated from the Pareto distribution cannot be smaller than its scale parameter κ\kappa, and the last inequality holds from the definition of ϵa,l(ϵ)\epsilon_{a,l}(\epsilon) in (12).

Similarly, one can derive that

[α^aαa+ϵa,u(ϵ),𝒦a,Na(t)(ϵ),Na(t)=n]\displaystyle\mathbb{P}[\hat{\alpha}_{a}\geq\alpha_{a}+\epsilon_{a,u}(\epsilon),\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n] [s=1nlogra,sκanαa+ϵa,u(ϵ)+nlogr1κ𝒦]\displaystyle\leq\mathbb{P}\left[\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa_{a}}\leq\frac{n}{\alpha_{a}+\epsilon_{a,u}(\epsilon)}+n\log\frac{r_{1}^{\prime}}{\kappa}\cap\mathcal{K}\right]
[s=1nlogra,sκnαa+ϵa,u(ϵ)+nlogκa+ϵκa]\displaystyle\leq\mathbb{P}\left[\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa}\leq\frac{n}{\alpha_{a}+\epsilon_{a,u}(\epsilon)}+n\log\frac{\kappa_{a}+\epsilon}{\kappa_{a}}\right]
[s=1nlogra,sκanαa+ϵa,u(ϵ)+nϵκa]\displaystyle\leq\mathbb{P}\left[\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa_{a}}\leq\frac{n}{\alpha_{a}+\epsilon_{a,u}(\epsilon)}+\frac{n\epsilon}{\kappa_{a}}\right]
[1ns=1nlogra,sκa1αaϵ],\displaystyle\leq\mathbb{P}\left[\frac{1}{n}\sum_{s=1}^{n}\log\frac{r_{a,s}}{\kappa_{a}}-\frac{1}{\alpha_{a}}\leq-\epsilon\right],

where the second inequality holds since ra,1=κ^aκa+ϵr_{a,1}^{\prime}=\hat{\kappa}_{a}\leq\kappa_{a}+\epsilon holds on 𝒦a,n\mathcal{K}_{a,n}, the third inequality from log(1+x)x\log(1+x)\leq x for x>1x>-1, and the last inequality comes from the definition of ϵa,u(ϵ)\epsilon_{a,u}(\epsilon). From Fact 13, ya,s:=log(ra,sκa)Exp(αa)y_{a,s}:=\log\left(\frac{r_{a,s}}{\kappa_{a}}\right)\sim\mathrm{Exp}(\alpha_{a}) and the last probability can be considered as a deviation of the sum of exponentially distributed random variables.

For the exponential distribution Exp(α)\mathrm{Exp}(\alpha), we say that Bernstein’s condition with parameter bb holds if

𝔼[Mk]12k!1α2bk2 for k=3,4,,\mathbb{E}\left[M_{k}\right]\leq\frac{1}{2}k!\frac{1}{\alpha^{2}}b^{k-2}\quad\text{ for }k=3,4,\ldots,

where MkM_{k} implies the kk-th central moment. For Exp(αa)\mathrm{Exp}(\alpha_{a}), it holds that

𝔼[Mk]=!kαakk!21αa2(1αa)k2,\mathbb{E}\left[M_{k}\right]=\frac{!k}{\alpha_{a}^{k}}\leq\frac{k!}{2}\frac{1}{\alpha_{a}^{2}}\left(\frac{1}{\alpha_{a}}\right)^{k-2},

where !k!k is the subfactorial of kk such that !kk!e+12k!2!k\leq\frac{k!}{e}+\frac{1}{2}\leq\frac{k!}{2} for k3k\geq 3. Hence, the exponential distribution with parameter αa\alpha_{a} satisfies Bernstein’s condition with parameter 1αa\frac{1}{\alpha_{a}}, so that it is subexponential with parameters (2αa2,2αa)\left(\frac{2}{\alpha_{a}^{2}},\frac{2}{\alpha_{a}}\right). Therefore, by applying Lemma 18, we have

(|1ns=1nya,s1αa|ϵ)2exp(n4min{αa2ϵ2,αaϵ}).\mathbb{P}\left(\absolutevalue{\frac{1}{n}\sum_{s=1}^{n}y_{a,s}-\frac{1}{\alpha_{a}}}\geq\epsilon\right)\leq 2\exp\left(-\frac{n}{4}\min\{\alpha_{a}^{2}\epsilon^{2},\alpha_{a}\epsilon\}\right).

Note that it holds for ϵ<κaαa(κa+1)\epsilon<\frac{\kappa_{a}}{\alpha_{a}(\kappa_{a}+1)} that

[α^aαaϵa,l(ϵ)𝒦a,n]\displaystyle\mathbb{P}[\hat{\alpha}_{a}\leq\alpha_{a}-\epsilon_{a,l}(\epsilon)\cap\mathcal{K}_{a,n}] (1ns=1nya,s1αaϵ)\displaystyle\leq\mathbb{P}\left(\frac{1}{n}\sum_{s=1}^{n}y_{a,s}-\frac{1}{\alpha_{a}}\geq\epsilon\right)
[α^aαa+ϵa,u(ϵ)𝒦a,n]\displaystyle\mathbb{P}[\hat{\alpha}_{a}\geq\alpha_{a}+\epsilon_{a,u}(\epsilon)\cap\mathcal{K}_{a,n}] (1ns=1nya,s1αaϵ),\displaystyle\leq\mathbb{P}\left(\frac{1}{n}\sum_{s=1}^{n}y_{a,s}-\frac{1}{\alpha_{a}}\leq-\epsilon\right),

for ϵa,l(ϵ)=ϵαa21+ϵαa\epsilon_{a,l}(\epsilon)=\frac{\epsilon\alpha_{a}^{2}}{1+\epsilon\alpha_{a}} and ϵa,u(ϵ)=ϵαa2(κa+1)κaϵαa(κa+1)\epsilon_{a,u}(\epsilon)=\frac{\epsilon\alpha_{a}^{2}(\kappa_{a}+1)}{\kappa_{a}-\epsilon\alpha_{a}(\kappa_{a}+1)}, which satisfy limϵ0max{ϵa,l(ϵ),ϵa,u(ϵ)}=0+\lim_{\epsilon\to 0}\max\{\epsilon_{a,l}(\epsilon),\epsilon_{a,u}(\epsilon)\}=0_{+}. Hence, by recovering the original notations, we obtain

[𝒜a,Na(t)c(ϵ),𝒦a,Na(t)(ϵ),Na(t)=n]\displaystyle\mathbb{P}[\mathcal{A}_{a,N_{a}(t)}^{c}(\epsilon),\mathcal{K}_{a,N_{a}(t)}(\epsilon),N_{a}(t)=n] =[α^a(n)αaϵa,l(ϵ),𝒦a,Na(t),Na(t)=n]\displaystyle=\mathbb{P}[\hat{\alpha}_{a}(n)\leq\alpha_{a}-\epsilon_{a,l}(\epsilon),\mathcal{K}_{a,N_{a}(t)},N_{a}(t)=n]
+[α^a(n)αa+ϵa,u(ϵ),𝒦a,Na(t),Na(t)=n]\displaystyle\hskip 10.00002pt+\mathbb{P}[\hat{\alpha}_{a}(n)\geq\alpha_{a}+\epsilon_{a,u}(\epsilon),\mathcal{K}_{a,N_{a}(t)},N_{a}(t)=n]
2exp(αa2ϵ24n),\displaystyle\leq 2\exp\left(-\frac{\alpha^{2}_{a}\epsilon^{2}}{4}n\right),

for ϵ<1αa\epsilon<\frac{1}{\alpha_{a}} with αa>1\alpha_{a}>1. ∎

C.5 Proof of Lemma 9

See 9

Proof.

Fix a time index tt with Na(t)=nN_{a}(t)=n and denote t[]=[t]\mathbb{P}_{t}[\cdot]=\mathbb{P}[\cdot\mid\mathcal{F}_{t}]. To simplify notations, we drop the argument nn of κ^a(n)\hat{\kappa}_{a}(n) and α^a(n)\hat{\alpha}_{a}(n) and the argument tt of α~k(t)\tilde{\alpha}_{k}(t), α~a(t)\tilde{\alpha}_{a}(t), and μ~a(t)\tilde{\mu}_{a}(t).

When κ^a<y\hat{\kappa}_{a}<y holds, μ~ay\tilde{\mu}_{a}\leq y can hold regardless of the value of κ~a\tilde{\kappa}_{a} if κ^aα~aα~a1y\hat{\kappa}_{a}\frac{\tilde{\alpha}_{a}}{\tilde{\alpha}_{a}-1}\leq y holds since κ~a(0,κ^a]\tilde{\kappa}_{a}\in(0,\hat{\kappa}_{a}] holds from its posterior distribution in (8). Hence, if κ^a<y\hat{\kappa}_{a}<y, then

α~(t)yyκ^aμ~ay.\tilde{\alpha}(t)\geq\frac{y}{y-\hat{\kappa}_{a}}\implies\tilde{\mu}_{a}\leq y. (47)

When 1<α~(t)<yyκ^a1<\tilde{\alpha}(t)<\frac{y}{y-\hat{\kappa}_{a}},

μ~a=κ~aα~aα~a1yκ~ayα~a1α~a.\tilde{\mu}_{a}=\tilde{\kappa}_{a}\frac{\tilde{\alpha}_{a}}{\tilde{\alpha}_{a}-1}\leq y\,\Leftrightarrow\,\tilde{\kappa}_{a}\leq y\frac{\tilde{\alpha}_{a}-1}{\tilde{\alpha}_{a}}. (48)

Since α~a1\tilde{\alpha}_{a}\leq 1 implies μ~a=\tilde{\mu}_{a}=\infty, from (47) and (48), it holds that

t[μ~ay]\displaystyle\mathbb{P}_{t}[\tilde{\mu}_{a}\leq y] =1yyκ^afnk,nαa,nEr(x)t[κ~ax1xy]dx+yyκ^afnk,nαa,nEr(x)dx\displaystyle=\int_{1}^{\frac{y}{y-\hat{\kappa}_{a}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathbb{P}_{t}\left[\tilde{\kappa}_{a}\leq\frac{x-1}{x}y\right]\mathrm{d}x+\int_{\frac{y}{y-\hat{\kappa}_{a}}}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x
=1yyκ^afnk,nαa,nEr(x)(x1κ^axy)nxdx+yyκ^afnk,nαa,nEr(x)dx,\displaystyle=\int_{1}^{\frac{y}{y-\hat{\kappa}_{a}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}_{a}x}y\right)^{nx}\mathrm{d}x+\int_{\frac{y}{y-\hat{\kappa}_{a}}}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x,{} (49)

where we denoted t[]=[|t]\mathbb{P}_{t}[\cdot]=\mathbb{P}[\cdot|\mathcal{F}_{t}] and recovered the CDF in (8) in (49). Take any finite y>yy^{\prime}>y and let ξ:=yyκa<yyκa\xi:=\frac{y^{\prime}}{y^{\prime}-\kappa_{a}}<\frac{y}{y-\kappa_{a}} such that μ((κa,ξ)))=y\mu((\kappa_{a},\xi)))=y^{\prime}. Since aab\frac{a}{a-b} is decreasing with respect to a>b>0a>b>0, one can see that

t[μ~ay]\displaystyle\mathbb{P}_{t}[\tilde{\mu}_{a}\leq y] =1yyκ^afnk,nαa,nEr(x)(x1κ^axy)nxdx+yyκ^afnk,nαa,nEr(x)dx\displaystyle=\int_{1}^{\frac{y}{y-\hat{\kappa}_{a}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}_{a}x}y\right)^{nx}\mathrm{d}x+\int_{\frac{y}{y-\hat{\kappa}_{a}}}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x
1yyκ^afnk,nαa,nEr(x)(x1κ^axy)nxdx+yyκ^afnk,nαa,nEr(x)dx\displaystyle\leq\int_{1}^{\frac{y^{\prime}}{y^{\prime}-\hat{\kappa}_{a}}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}_{a}x}y\right)^{nx}\mathrm{d}x+\int_{\frac{y^{\prime}}{y^{\prime}-\hat{\kappa}_{a}}}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x
1yyκfnk,nαa,nEr(x)(x1κ^axy)nxdx+yyκfnk,nαa,nEr(x)dx\displaystyle\leq\int_{1}^{\frac{y^{\prime}}{y^{\prime}-\kappa}}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}_{a}x}y\right)^{nx}\mathrm{d}x+\int_{\frac{y^{\prime}}{y^{\prime}-\kappa}}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x
1ξfnk,nαa,nEr(x)(x1κxy)ndx+ξfnk,nαa,nEr(x)dx\displaystyle\leq\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\left(\frac{x-1}{\kappa x}y\right)^{n}\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x{} (50)
(ξ1κξy)n1ξfnk,nαa,nEr(x)dx+ξfnk,nαa,nEr(x)dx\displaystyle\leq\left(\frac{\xi-1}{\kappa\xi}y\right)^{n}\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x{} (51)
=(yμ((κ,ξ)))n1ξfnk,nαa,nEr(x)dx+ξfnk,nαa,nEr(x)dx,\displaystyle=\left(\frac{y}{\mu((\kappa,\xi))}\right)^{n}\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{a,n}}}^{\text{Er}}(x)\mathrm{d}x,

where (50) comes from κ^aκ\hat{\kappa}_{a}\geq\kappa and we used the increasing property of x1x\frac{x-1}{x} in (51). ∎

C.6 Proof of Lemma 10

See 10

Proof.

Similarly to other proofs, fix tt and let N1(t)=nN_{1}(t)=n. To simplify notations, we drop the argument tt of κ~1(t),α~1(t)\tilde{\kappa}_{1}(t),\tilde{\alpha}_{1}(t) and μ~1(t)\tilde{\mu}_{1}(t) and the argument nn of κ^1(n),α^1(n),α¯1(n)\hat{\kappa}_{1}(n),\hat{\alpha}_{1}(n),\bar{\alpha}_{1}(n).

Case 1. On {κ^1μ1ϵ}\{\hat{\kappa}_{1}\geq\mu_{1}-\epsilon\}

Under the condition {κ^1μ1ϵ}\{\hat{\kappa}_{1}\geq\mu_{1}-\epsilon\}, the event {μ~1μ1ϵ}\{\tilde{\mu}_{1}\leq\mu_{1}-\epsilon\} is eventually determined by the value of κ~\tilde{\kappa} since {κ~1(μ1ϵ,κ^1]}\{\tilde{\kappa}_{1}\in(\mu_{1}-\epsilon,\hat{\kappa}_{1}]\} is a sufficient condition to {μ~1>μ1ϵ}\{\tilde{\mu}_{1}>\mu_{1}-\epsilon\}. Therefore, if κ^1μ1ϵ\hat{\kappa}_{1}\geq\mu_{1}-\epsilon, then

pn(ϵ|θ1,n)=1fnk,nα1,nEr(x)(μ1ϵκ^1x1x)nxdx.p_{n}(\epsilon|{\theta}_{1,n})=\int_{1}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\left(\frac{\mu_{1}-\epsilon}{\hat{\kappa}_{1}}\frac{x-1}{x}\right)^{nx}\mathrm{d}x.

Then,

𝟙[κ^1μ1ϵ]pn(ϵ|θ1,n)\displaystyle\mathbbm{1}[\hat{\kappa}_{1}\geq\mu_{1}-\epsilon]p_{n}(\epsilon|{\theta}_{1,n}) =𝟙[κ^1μ1ϵ](1fnk,nα1,nEr(x)(μ1ϵκ^1x1x)nxdx)\displaystyle=\mathbbm{1}[\hat{\kappa}_{1}\geq\mu_{1}-\epsilon]\left(\int_{1}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\left(\frac{\mu_{1}-\epsilon}{\hat{\kappa}_{1}}\frac{x-1}{x}\right)^{nx}\mathrm{d}x\right)
𝟙[κ^1μ1ϵ]1fnk,nα1,nEr(x)(11x)nxdx\displaystyle\leq\mathbbm{1}[\hat{\kappa}_{1}\geq\mu_{1}-\epsilon]\int_{1}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\left(1-\frac{1}{x}\right)^{nx}\mathrm{d}x
𝟙[κ^1(n)μ1ϵ]1fnk,nα1,nEr(x)endx\displaystyle\leq\mathbbm{1}[\hat{\kappa}_{1}(n)\geq\mu_{1}-\epsilon]\int_{1}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)e^{-n}\mathrm{d}x
𝟙[κ^1(n)μ1ϵ]en,\displaystyle\leq\mathbbm{1}[\hat{\kappa}_{1}(n)\geq\mu_{1}-\epsilon]e^{-n},

where the second inequality holds from μ1ϵκ^1\mu_{1}-\epsilon\leq\hat{\kappa}_{1}.

Case 2. On {κ^1K(ϵ),α1,nα1+ρ}\{\hat{\kappa}_{1}\in K(\epsilon),\alpha_{1,n}\leq\alpha_{1}+\rho\}

By applying Lemma 9 with y=μ1ϵy=\mu_{1}-\epsilon, we have for any ξμ1ϵμ1ϵκ1\xi\leq\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon-\kappa_{1}} that

𝟙[κ^1<μ1ϵ,α1,nα+ρ]pn(ϵ|θ1,n)\displaystyle\mathbbm{1}[\hat{\kappa}_{1}<\mu_{1}-\epsilon,\alpha_{1,n}\leq\alpha+\rho]p_{n}(\epsilon|{\theta}_{1,n}) (μ1ϵμ((κ1,ξ)))n1ξfnk,nα1,nEr(x)dx+ξfnk,nα1,nEr(x)dx\displaystyle\leq\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\int_{1}^{\xi}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x
(μ1ϵμ((κ1,ξ)))n0ξfnk,nα1,nEr(x)dx+ξfnk,nα1,nEr(x)dx.\displaystyle\leq\left(\frac{\mu_{1}-\epsilon}{\mu((\kappa_{1},\xi))}\right)^{n}\int_{0}^{\xi}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\xi}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x.{} (52)

Let us define ρ¯:=ρθ(ϵ/2)\bar{\rho}:=\rho_{\theta}(\epsilon/2). Then, it satisfies μ((κ1,α1+ρ¯))=μ1ϵ4\mu((\kappa_{1},\alpha_{1}+\bar{\rho}))=\mu_{1}-\frac{\epsilon}{4} and

α1+ρ¯=μϵ/4μϵ/4κ1<μϵμϵκ1,\alpha_{1}+\bar{\rho}=\frac{\mu-\epsilon/4}{\mu-\epsilon/4-\kappa_{1}}<\frac{\mu-\epsilon}{\mu-\epsilon-\kappa_{1}},

where the inequality holds from the decreasing property of the function xxy\frac{x}{x-y} with respect to x>yx>y. By replacing ξ\xi with α1+ρ¯\alpha_{1}+\bar{\rho} in (52), we have

𝟙[κ^1\displaystyle\mathbbm{1}[\hat{\kappa}_{1} <μ1ϵ,α1,nα1+ρ]pn(ϵ|θ¯1,n)\displaystyle<\mu_{1}-\epsilon,\alpha_{1,n}\leq\alpha_{1}+\rho]p_{n}(\epsilon|\bar{\theta}_{1,n})
𝟙[κ^1<μ1ϵ,α1,nα1+ρ]((μ1ϵμ1ϵ/4)n0ξfnk,nα1,nEr(x)dx+α1+ρ¯fnk,nα1,n(n)Er(x)dx)\displaystyle\leq\mathbbm{1}[\hat{\kappa}_{1}<\mu_{1}-\epsilon,\alpha_{1,n}\leq\alpha_{1}+\rho]\bigg{(}\left(\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon/4}\right)^{n}\int_{0}^{\xi}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\text{Er}}(x)\mathrm{d}x+\int_{\alpha_{1}+\bar{\rho}}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}(n)}}^{\text{Er}}(x)\mathrm{d}x\bigg{)}
𝟙[κ^1<μ1ϵ,α1,nα1+ρ](en(3ϵ4μ1ϵ)(1[α~1α1+ρ¯])+[α~1α1+ρ¯]).\displaystyle\leq\mathbbm{1}[\hat{\kappa}_{1}<\mu_{1}-\epsilon,\alpha_{1,n}\leq\alpha_{1}+\rho]\bigg{(}e^{-n\left(\frac{3\epsilon}{4\mu_{1}-\epsilon}\right)}\left(1-\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho}]\right)+\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho}]\bigg{)}.{} (53)

Let ZnZ_{n} be a random variable that follows the chi-squared distribution with nn degree of freedom and Fn()F_{n}(\cdot) denote the CDF of ZnZ_{n}. Then, it holds that

[α~1α1+ρ¯,α1,nα1+ρ]\displaystyle\mathbb{P}\bigg{[}\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho},\alpha_{1,n}\leq\alpha_{1}+\rho\bigg{]} =[Z2nα1,n(α1+ρ¯),α1,nα1+ρ]by Fact 12\displaystyle=\mathbb{P}\left[Z\geq\frac{2n}{\alpha_{1,n}}(\alpha_{1}+\bar{\rho}),\alpha_{1,n}\leq\alpha_{1}+\rho\right]\hskip 10.00002pt\text{by Fact~{}\ref{fact1}}
[Z2nα1+ρ¯α1+ρ]\displaystyle\leq\mathbb{P}\left[Z\geq 2n\frac{\alpha_{1}+\bar{\rho}}{\alpha_{1}+\rho}\right]
[Z2nμ1ϵ/4μ1ϵ/2]\displaystyle\leq\mathbb{P}\left[Z\geq 2n\frac{\mu_{1}-\epsilon/4}{\mu_{1}-\epsilon/2}\right]
=1F2n2k(2n(1+ζ)),\displaystyle=1-F_{2n-2k}(2n(1+\zeta)),{} (54)

where ζ=ϵ4μ12ϵ(0,1)\zeta=\frac{\epsilon}{4\mu_{1}-2\epsilon}\in(0,1). By applying Lemma 19, we have if nζ>kn\zeta>-k,

[α~1α1+ρ¯,α1,nα1+ρ]\displaystyle\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho},\alpha_{1,n}\leq\alpha_{1}+\rho]
1F2n2k(2n(1+ζ))\displaystyle\leq 1-F_{2n-2k}\left(2n(1+\zeta)\right)
<122π(nk)nk1/2e(nk)Γ(nk)erfc(n(ζ+k)(nk)logn(1+ζ)nk),\displaystyle<\frac{1}{2}\frac{\sqrt{2\pi}(n-k)^{n-k-1/2}e^{-(n-k)}}{\Gamma(n-k)}\mathrm{erfc}\left(\sqrt{n(\zeta+k)-(n-k)\log\frac{n(1+\zeta)}{n-k}}\right),

where Γ()\Gamma(\cdot) denotes the Gamma function. For n1/2n\geq 1/2, it holds from Stirling’s formula that

2πnn1/2enΓ(n)2πe1/6nn1/2en,\sqrt{2\pi}n^{n-1/2}e^{-n}\leq\Gamma(n)\leq\sqrt{2\pi}e^{1/6}n^{n-1/2}e^{-n},

which results in

[α~1α1+ρ¯,α1,nα1+ρ]<12erfc(n(ζ+k)(nk)logn(1+ζ)nk).\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho},\alpha_{1,n}\leq\alpha_{1}+\rho]<\frac{1}{2}\mathrm{erfc}\left(\sqrt{n(\zeta+k)-(n-k)\log\frac{n(1+\zeta)}{n-k}}\right). (55)

Notice that (nk)logn(1+ζ)nk>0(n-k)\log\frac{n(1+\zeta)}{n-k}>0 always holds from the assumption of nζ>kn\zeta>-k where priors with k0k\in\mathbb{Z}_{\geq 0} satisfies regardless of nn. Thus, if ζ+k0\zeta+k\leq 0, then the argument in the complementary error function in (55) becomes negative. This makes the upper bound in (55) greater than or equal to 12\frac{1}{2}. Therefore, for the priors with k<0k\in\mathbb{Z}_{<0}, the right term in (55) is bounded by a constant since ζ(0,1)\zeta\in(0,1).

Since the complementary error function is a decreasing function, for priors with k0k\in\mathbb{Z}_{\geq 0}, it holds from (55) that

[α~1α1+ρ¯,α1,nα1+ρ]12erfc(n(ζlog(1+ζ)),\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho},\alpha_{1,n}\leq\alpha_{1}+\rho]\leq\frac{1}{2}\mathrm{erfc}\left(\sqrt{n(\zeta-\log(1+\zeta)}\right),

where we substitute k=0k=0. By the change of variables, the complementary error function is bounded for any x0x\geq 0 as follows [Simon and Divsalar, 1998]:

erfc(x)ex2,\mathrm{erfc}(x)\leq e^{-x^{2}},

which implies

[α~1α1+ρ¯,α1,nα1+ρ]12encμ1(ϵ),\mathbb{P}[\tilde{\alpha}_{1}\geq\alpha_{1}+\bar{\rho},\alpha_{1,n}\leq\alpha_{1}+\rho]\leq\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}, (56)

where cμ1(ϵ)=ζlog(1+ζ)>0c_{\mu_{1}}(\epsilon)=\zeta-\log(1+\zeta)>0 is a deterministic constants on μ1\mu_{1} and ϵ\epsilon. By combining (56) with (53), we have

𝟙[κ^1<μ1ϵ,α1,nα1+ρ]pn(ϵ|θ1,n)en3ϵ4μ1(112encμ1(ϵ))+12encμ1(ϵ)=:h(μ1,ϵ,n).\mathbbm{1}[\hat{\kappa}_{1}<\mu_{1}-\epsilon,\alpha_{1,n}\leq\alpha_{1}+\rho]p_{n}(\epsilon|{\theta}_{1,n})\leq e^{-n\frac{3\epsilon}{4\mu_{1}}}\left(1-\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}\right)+\frac{1}{2}e^{-nc_{\mu_{1}}(\epsilon)}=:h(\mu_{1},\epsilon,n).

From the power-series expansion of log(1+x)\log(1+x), we have log(1+x)xx22+x33\log(1+x)\geq x-\frac{x^{2}}{2}+\frac{x^{3}}{3} for x(0,1)x\in(0,1) and

cμ1(ϵ)=ζlog(1+ζ)ζ22ζ33\displaystyle c_{\mu_{1}}(\epsilon)=\zeta-\log(1+\zeta)\leq\frac{\zeta^{2}}{2}-\frac{\zeta^{3}}{3} =ζ26(32ζ)\displaystyle=\frac{\zeta^{2}}{6}(3-2\zeta)
ζ22=𝒪(ϵ2).\displaystyle\leq\frac{\zeta^{2}}{2}=\mathcal{O}(\epsilon^{-2}).
Case 3. On {κ^1K(ϵ),α1,nα1+ρ}\{\hat{\kappa}_{1}\in K(\epsilon),\alpha_{1,n}\geq\alpha_{1}+\rho\}

By applying Lemma 9 with y=μ1ϵy=\mu_{1}-\epsilon and ξ=α1+ρ\xi=\alpha_{1}+\rho, we have

𝟙[κ^1<μ1ϵ]pn(ϵ|θ1,n)\displaystyle\mathbbm{1}[\hat{\kappa}_{1}<\mu_{1}-\epsilon]p_{n}(\epsilon|{\theta}_{1,n}) (μ1ϵμ1ϵ/2)n1α1+ρfnk,nα1,nEr(x)dx+α1+ρfnk,nα1,nEr(x)dx\displaystyle\leq\bigg{(}\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon/2}\bigg{)}^{n}\int_{1}^{\alpha_{1}+\rho}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\mathrm{Er}}(x)\mathrm{d}x+\int_{\alpha_{1}+\rho}^{\infty}f_{n-k,\frac{n}{\alpha_{1,n}}}^{\mathrm{Er}}(x)\mathrm{d}x
=C1(μ1,ϵ,n)[α~1[1,α1+ρ]α1,n]+[α~1α1+ρα1,n]\displaystyle=C_{1}(\mu_{1},\epsilon,n)\mathbb{P}\left[\tilde{\alpha}_{1}\in[1,\alpha_{1}+\rho]\mid\alpha_{1,n}\right]+\mathbb{P}\left[\tilde{\alpha}_{1}\geq\alpha_{1}+\rho\mid\alpha_{1,n}\right]
C1(μ1,ϵ,n)[α~1α1+ρα1,n]+[α~1α1+ρα1,n]\displaystyle\leq C_{1}(\mu_{1},\epsilon,n)\mathbb{P}\left[\tilde{\alpha}_{1}\leq\alpha_{1}+\rho\mid\alpha_{1,n}\right]+\mathbb{P}\left[\tilde{\alpha}_{1}\geq\alpha_{1}+\rho\mid\alpha_{1,n}\right]
=C1(μ1,ϵ,n)Gk(1/α1,n;n)+(1Gk(1/α1,n;n)),\displaystyle=C_{1}(\mu_{1},\epsilon,n)G_{k}(1/\alpha_{1,n};n)+(1-G_{k}(1/\alpha_{1,n};n)),{} (57)

where An=C1(μ1,ϵ,n):=(μ1ϵμ1ϵ/2)nenϵ2μ1ϵ<1A_{n}=C_{1}(\mu_{1},\epsilon,n):=\left(\frac{\mu_{1}-\epsilon}{\mu_{1}-\epsilon/2}\right)^{n}\leq e^{-n\frac{\epsilon}{2\mu_{1}-\epsilon}}<1. Since α~1\tilde{\alpha}_{1} follows Erlang(nk,nα1,n)\mathrm{Erlang}\left(n-k,\frac{n}{\alpha_{1,n}}\right), it holds that

[α~1α1+ρα1,n]=γ(nk,n(α1+ρ)α1,n)Γ(nk),\displaystyle\mathbb{P}\left[\tilde{\alpha}_{1}\leq\alpha_{1}+\rho\mid\alpha_{1,n}\right]=\frac{\gamma\left(n-k,\frac{n(\alpha_{1}+\rho)}{\alpha_{1,n}}\right)}{\Gamma(n-k)},

where γ(,)\gamma(\cdot,\cdot) denotes the lower incomplete gamma function. Therefore, letting

Gk(x;n):=γ(nk,n(α1+ρ)x)Γ(nk)G_{k}(x;n):=\frac{\gamma\left(n-k,n(\alpha_{1}+\rho)x\right)}{\Gamma(n-k)}

concludes the proof. ∎

Appendix D Proof of Theorem 3

As shown in proofs of Theorem 2, the integral term in (26) diverges for k1k\in\mathbb{Z}_{\leq 1} without the restriction on α^\hat{\alpha}. In this section, we provide the partial proof of Theorem 3 for k0k\in\mathbb{Z}_{\leq 0}, which shows the necessity of such requirement to achieve asymptotic optimality.

Proof of Theorem 3.

We consider a two-armed bandit problem with θ1=(κ,α1)\theta_{1}=(\kappa,\alpha_{1}) and θ2=(κ,α2)\theta_{2}=(\kappa,\alpha_{2}). Let γ=max{α2,α2+1}\gamma=\max\{\lceil\alpha_{2}\rceil,\lfloor\alpha_{2}\rfloor+1\} and m=γγ1m=\frac{\gamma}{\gamma-1}, so that μ2m=κα2α21γ1γ>κ\frac{\mu_{2}}{m}=\kappa\frac{\alpha_{2}}{\alpha_{2}-1}\frac{\gamma-1}{\gamma}>\kappa. Assume 1<α1<α21<\alpha_{1}<\alpha_{2} and μ~2(s)=μ2=κα2α21\tilde{\mu}_{2}(s)=\mu_{2}=\kappa\frac{\alpha_{2}}{\alpha_{2}-1} for all ss\in\mathbb{N}. Recall that STS\mathrm{STS} starts from playing every arms twice for priors k1k\leq 1, i.e., Na(s)2N_{a}(s)\geq 2 holds for all a{1,2}a\in\{1,2\}. We have for T5T\geq 5

𝔼[Reg(T)]\displaystyle\mathbb{E}[\mathrm{Reg}(T)] =Δ2𝔼[t=1T𝟙[j(t)=2]]\displaystyle=\Delta_{2}\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}[j(t)=2]\right]
Δ2𝔼[t=5T𝟙[j(t)=2,N1(t)=2]].\displaystyle\geq\Delta_{2}\mathbb{E}\left[\sum_{t=5}^{T}\mathbbm{1}[j(t)=2,N_{1}(t)=2]\right].

From the definition of N1()N_{1}(\cdot), {j(s)2,N1(s)=2}\{j(s)\neq 2,N_{1}(s)=2\} implies N1(t)>2N_{1}(t)>2 for t>st>s. Therefore, for any t5t\geq 5,

{j(t)=2,N1(t)=2}\displaystyle\{j(t)=2,N_{1}(t)=2\} {s[1,t4]:j(s+4)=2}\displaystyle\Leftrightarrow\{\forall s\in[1,t-4]:j(s+4)=2\}
{s[1,t4]:μ~1(s+4)<μ2}.\displaystyle\Leftrightarrow\{\forall s\in[1,t-4]:\tilde{\mu}_{1}(s+4)<\mu_{2}\}.

Let T=T4T^{\prime}=T-4, then we have

𝔼[t=5T𝟙[j(t)=2,N1(t)=2]]\displaystyle\mathbb{E}\bigg{[}\sum_{t=5}^{T}\mathbbm{1}[j(t)=2,N_{1}(t)=2]\bigg{]} =𝔼[t=5T𝟙[s[1,t4]:μ~1(s+4)<μ2]]\displaystyle=\mathbb{E}\left[\sum_{t=5}^{T}\mathbbm{1}\left[\forall s\in[1,t-4]:\tilde{\mu}_{1}(s+4)<\mu_{2}\right]\right]
=𝔼[s=1T([μ~1μ2|κ^1(2),α^1(2)])s].\displaystyle=\mathbb{E}\left[\sum_{s=1}^{T^{\prime}}\left(\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{2}|\hat{\kappa}_{1}(2),\hat{\alpha}_{1}(2)]\right)^{s}\right].{} (58)

Notice that κ^1(N1(s))=κ^1(2)\hat{\kappa}_{1}(N_{1}(s))=\hat{\kappa}_{1}(2) and α^1(N1(s))=α^1(2)\hat{\alpha}_{1}(N_{1}(s))=\hat{\alpha}_{1}(2) hold for all s2s\geq 2 since only j(s)=2j(s)=2 holds for all s2s\geq 2. Here, we first provide the lower bound on [μ~1μ2|κ^1,α^1]\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{2}|\hat{\kappa}_{1},\hat{\alpha}_{1}]. Since μ2m=κα2α21γ1γ>κ\frac{\mu_{2}}{m}=\frac{\kappa\alpha_{2}}{\alpha_{2}-1}\frac{\gamma-1}{\gamma}>\kappa holds, we can consider the case where κ^1(2)μ2m\hat{\kappa}_{1}(2)\leq\frac{\mu_{2}}{m} occurs.

From (49), it holds for yκ^1(n)y\geq\hat{\kappa}_{1}(n) that

t[μ~ay]\displaystyle\mathbb{P}_{t}[\tilde{\mu}_{a}\leq y] =1yyκ^(n)fnk,nαnEr(x)(x1κ^xy)nxdx+yyκ^(n)fnk,nαnEr(x)dxby (49)\displaystyle=\int_{1}^{\frac{y}{y-\hat{\kappa}(n)}}f_{n-k,\frac{n}{\alpha_{n}}}^{\text{Er}}(x)\left(\frac{x-1}{\hat{\kappa}x}y\right)^{nx}\mathrm{d}x+\int_{\frac{y}{y-\hat{\kappa}(n)}}^{\infty}f_{n-k,\frac{n}{\alpha_{n}}}^{\text{Er}}(x)\mathrm{d}x\hskip 20.00003pt\text{by (\ref{eq: tmucase1})}
yyκ^(n)fnk,nαnEr(x)dx.\displaystyle\geq\int_{\frac{y}{y-\hat{\kappa}(n)}}^{\infty}f_{n-k,\frac{n}{\alpha_{n}}}^{\text{Er}}(x)\mathrm{d}x.

By letting n=2n=2 and y=μ2y=\mu_{2}, we have for k1k\in\mathbb{Z}_{\leq 1}

[μ~1μ2|κ^1(2),α^1(2)]\displaystyle\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{2}|\hat{\kappa}_{1}(2),\hat{\alpha}_{1}(2)] 𝟙[κ^μ2m]μ2μ2κ^f2k,2α^Er(x)dx\displaystyle\geq\mathbbm{1}\left[\hat{\kappa}\leq\frac{\mu_{2}}{m}\right]\int_{\frac{\mu_{2}}{\mu_{2}-\hat{\kappa}}}^{\infty}f_{2-k,\frac{2}{\hat{\alpha}}}^{\text{Er}}(x)\mathrm{d}x
𝟙[κ^μ2m]γf2k,2α^Er(x)dx\displaystyle\geq\mathbbm{1}\left[\hat{\kappa}\leq\frac{\mu_{2}}{m}\right]\int_{\gamma}^{\infty}f_{2-k,\frac{2}{\hat{\alpha}}}^{\text{Er}}(x)\mathrm{d}x{} (59)
=𝟙[κ^μ2m]Γ(2k,2γα^)Γ(2k),\displaystyle=\mathbbm{1}\left[\hat{\kappa}\leq\frac{\mu_{2}}{m}\right]\frac{\Gamma(2-k,\frac{2\gamma}{\hat{\alpha}})}{\Gamma(2-k)},{} (60)

where (59) holds from μ2μ2κ^1(2)μ2μ2μ2/m=mm1=γ=α2\frac{\mu_{2}}{\mu_{2}-\hat{\kappa}_{1}(2)}\leq\frac{\mu_{2}}{\mu_{2}-\mu_{2}/m}=\frac{m}{m-1}=\gamma=\lceil\alpha_{2}\rceil and Γ(,)\Gamma(\cdot,\cdot) is the upper incomplete Gamma function. To simplify the notations, we drop the arguments on nn and tt of μ~1(t)\tilde{\mu}_{1}(t), κ^1(n)\hat{\kappa}_{1}(n), and α^1(n)\hat{\alpha}_{1}(n) in the following sections.

D.1 Priors k0k\in\mathbb{Z}_{\leq 0}

Note that Γ(n,x)\Gamma(n,x) is an increasing function with respect to nn for fixed xx. Therefore, (60) implies that if the lower bound of regret for the reference prior is larger than the lower bound, then every prior with k0k\in\mathbb{Z}_{\leq 0} are suboptimal. Therefore, let us consider the case k=1k=1, where we can rewrite (60) as

[μ~1μ2|κ^1,α^1]𝟙[κ^1μ2m]Γ(1,2γα^1)Γ(1)=e2γα^1.\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{2}|\hat{\kappa}_{1},\hat{\alpha}_{1}]\geq\mathbbm{1}\left[\hat{\kappa}_{1}\leq\frac{\mu_{2}}{m}\right]\frac{\Gamma(1,\frac{2\gamma}{\hat{\alpha}_{1}})}{\Gamma(1)}=e^{-\frac{2\gamma}{\hat{\alpha}_{1}}}. (61)

Since α^1(2)IG(1,2α1)\hat{\alpha}_{1}(2)\sim\mathrm{IG}(1,2\alpha_{1}) in (2), z:=2γα^z:=\frac{2\gamma}{\hat{\alpha}} follows an exponential distribution with rate parameter α1/γ\alpha_{1}/\gamma, i.e., zExp(α1/γ)z\sim\mathrm{Exp}(\alpha_{1}/\gamma). By combining (61) with (58), we have

𝔼[t=5T𝟙[j(t)=2,N1(t)=2]]\displaystyle\mathbb{E}\Bigg{[}\sum_{t=5}^{T}\mathbbm{1}[j(t)=2,N_{1}(t)=2]\Bigg{]} 𝔼κ^,z[s=1T(𝟙[κ^μ2/m]ez)s]\displaystyle\geq\mathbb{E}_{\hat{\kappa},z}\left[\sum_{s=1}^{T^{\prime}}\bigg{(}\mathbbm{1}[\hat{\kappa}\leq\mu_{2}/m]e^{-z}\bigg{)}^{s}\right]
=[κ^μ2/m]𝔼zExp(α1/γ)[s=1Tezs],\displaystyle=\mathbb{P}[\hat{\kappa}\leq\mu_{2}/m]\mathbb{E}_{z\sim\mathrm{Exp}(\alpha_{1}/\gamma)}\left[\sum_{s=1}^{T^{\prime}}e^{-zs}\right],{} (62)

where we used the stochastic independence of α^\hat{\alpha} and κ^\hat{\kappa}. Here,

𝔼zExp(α1/γ)[s=1Tezs]\displaystyle\mathbb{E}_{z\sim\mathrm{Exp}(\alpha_{1}/\gamma)}\left[\sum_{s=1}^{T^{\prime}}e^{-zs}\right] =𝔼zExp(α1/γ)[(1ezT)ez1ez]\displaystyle=\mathbb{E}_{z\sim\mathrm{Exp}(\alpha_{1}/\gamma)}\left[(1-e^{-zT^{\prime}})\frac{e^{-z}}{1-e^{-z}}\right]
=0(1exT)ex1exeα1γxdx\displaystyle=\int_{0}^{\infty}(1-e^{-xT^{\prime}})\frac{e^{-x}}{1-e^{-x}}e^{-\frac{\alpha_{1}}{\gamma}x}\mathrm{d}x
0(1exT)e2x1exdxby α1γ<1\displaystyle\geq\int_{0}^{\infty}(1-e^{-xT^{\prime}})\frac{e^{-2x}}{1-e^{-x}}\mathrm{d}x\hskip 20.00003pt\text{by }\frac{\alpha_{1}}{\gamma}<1
(11e)1Te2x1exdx\displaystyle\geq\left(1-\frac{1}{e}\right)\int_{\frac{1}{T^{\prime}}}^{\infty}\frac{e^{-2x}}{1-e^{-x}}\mathrm{d}x
=(11e)[log(ex1)z+ez]x=1T\displaystyle=\left(1-\frac{1}{e}\right)\left[\log(e^{x}-1)-z+e^{-z}\right]_{x=\frac{1}{T^{\prime}}}^{\infty}
(11e)(logT+132T),\displaystyle\geq\left(1-\frac{1}{e}\right)\left(\log T^{\prime}+1-\frac{3}{2T^{\prime}}\right),{} (63)

where the last inequality holds from its power series expansion

log(ex1)x+exlog(x)+132x\displaystyle\log(e^{x}-1)-x+e^{-x}\geq\log(x)+1-\frac{3}{2}x

and limxlog(ex1)x+ex=0\lim_{x\to\infty}\log(e^{x}-1)-x+e^{-x}=0. By combining (63) with (62) and (58) and elementary calculation with κ^1(2)Pa(κ1,2α1)\hat{\kappa}_{1}(2)\sim\mathrm{Pa}(\kappa_{1},2\alpha_{1}), we have

𝔼[Reg(T)]\displaystyle\mathbb{E}[\mathrm{Reg}(T)] Δ2(1(mκμ2)2α1)(11e)(logT+132T)\displaystyle\geq\Delta_{2}\left(1-\left(\frac{m\kappa}{\mu_{2}}\right)^{2\alpha_{1}}\right)\left(1-\frac{1}{e}\right)\left(\log T^{\prime}+1-\frac{3}{2T^{\prime}}\right)
=Δ2(1(mκμ2)2α1)(11e)(log(T+4)+132(T+4)).\displaystyle=\Delta_{2}\left(1-\left(\frac{m\kappa}{\mu_{2}}\right)^{2\alpha_{1}}\right)\left(1-\frac{1}{e}\right)\left(\log(T+4)+1-\frac{3}{2(T+4)}\right).

Therefore, under STS\mathrm{STS} with k1k\in\mathbb{Z}_{\leq 1}, there exists a constant C(α1,α2)C(\alpha_{1},\alpha_{2}) such that

lim infT𝔼[Reg(T)]logTC(α1,α2).\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\log T}\geq C(\alpha_{1},\alpha_{2}).

D.2 Priors k0k\in\mathbb{Z}_{\leq 0}

Similarly to the last section, it is sufficient to consider the case k=0k=0, where we can rewrite (60) as

[μ~1μ2|κ^1,α^1]𝟙[κ^1μ2m]Γ(2,2γα^1)Γ(2).\mathbb{P}[\tilde{\mu}_{1}\leq\mu_{2}|\hat{\kappa}_{1},\hat{\alpha}_{1}]\geq\mathbbm{1}\left[\hat{\kappa}_{1}\leq\frac{\mu_{2}}{m}\right]\frac{\Gamma(2,\frac{2\gamma}{\hat{\alpha}_{1}})}{\Gamma(2)}. (64)

From the definition of the upper incomplete Gamma function, we have

g(z):=Γ(2,z)=zx1exdx=ez(z+1),g(z):=\Gamma(2,z)=\int_{z}^{\infty}x^{1}e^{-x}\mathrm{d}x=e^{-z}(z+1),

as a counterpart of eze^{-z} in (62) with the same notations z=2γα^1Exp(α1γ)z=\frac{2\gamma}{\hat{\alpha}_{1}}\sim\mathrm{Exp}\left(\frac{\alpha_{1}}{\gamma}\right).

Therefore, by replacing ezse^{-zs} in (62) with g(z)sg(z)^{s}, we have

𝔼z[s=1T(g(z))s]\displaystyle\mathbb{E}_{z}\left[\sum_{s=1}^{T^{\prime}}(g(z))^{s}\right] 𝔼z[𝟙[z(0,1]]s=1T(g(z))s]\displaystyle\geq\mathbb{E}_{z}\left[\mathbbm{1}[z\in(0,1]]\sum_{s=1}^{T^{\prime}}(g(z))^{s}\right]
𝔼z[𝟙[z(0,1]]s=1T(1z2)s]\displaystyle\geq\mathbb{E}_{z}\left[\mathbbm{1}[z\in(0,1]]\sum_{s=1}^{T^{\prime}}(1-z^{2})^{s}\right]
=𝔼z[𝟙[z(0,1]](1(1z2)T)1z2z2],\displaystyle=\mathbb{E}_{z}\left[\mathbbm{1}[z\in(0,1]](1-(1-z^{2})^{T^{\prime}})\frac{1-z^{2}}{z^{2}}\right],

where we used the fact z[0,1]z\in[0,1], g(z)1z2g(z)\geq 1-z^{2} in the second inequality. Since z(0,1T]z\in\left(0,\frac{1}{\sqrt{T^{\prime}}}\right], (1z2)T11+Tz2(1-z^{2})^{T^{\prime}}\leq\frac{1}{1+T^{\prime}z^{2}} holds, we have 1(1z2)TTz21+Tz21-(1-z^{2})^{T^{\prime}}\geq\frac{T^{\prime}z^{2}}{1+T^{\prime}z^{2}}.

By applying this fact, we have for T>1T^{\prime}>1,

𝔼z[s=1T(g(z))s]\displaystyle\mathbb{E}_{z}\Bigg{[}\sum_{s=1}^{T^{\prime}}(g(z))^{s}\Bigg{]} 𝔼z[T(1z2)1+Tz2𝟙[z(0,1T]]]\displaystyle\geq\mathbb{E}_{z}\left[\frac{T^{\prime}(1-z^{2})}{1+T^{\prime}z^{2}}\mathbbm{1}\left[z\in\left(0,\frac{1}{\sqrt{T^{\prime}}}\right]\right]\right]
𝔼zExp(α1/γ)[(T212)𝟙[z(0,1T]]]\displaystyle\geq\mathbb{E}_{z\sim\mathrm{Exp}(\alpha_{1}/\gamma)}\left[\left(\frac{T^{\prime}}{2}-\frac{1}{2}\right)\mathbbm{1}\left[z\in\left(0,\frac{1}{\sqrt{T^{\prime}}}\right]\right]\right]\hskip 30.00005pt
=(T212)01Tα1γeα1γzdz\displaystyle=\left(\frac{T^{\prime}}{2}-\frac{1}{2}\right)\int_{0}^{\frac{1}{\sqrt{T^{\prime}}}}\frac{\alpha_{1}}{\gamma}e^{-\frac{\alpha_{1}}{\gamma}z}\mathrm{d}z{} (65)
=(T212)(1eα1γT).\displaystyle=\left(\frac{T^{\prime}}{2}-\frac{1}{2}\right)\left(1-e^{-\frac{\alpha_{1}}{\gamma\sqrt{T^{\prime}}}}\right).

Notice that ex1x2e^{-x}\leq 1-\frac{x}{2} holds for x<1x<1, which gives

𝔼z[s=1T(g(z))s]\displaystyle\mathbb{E}_{z}\bigg{[}\sum_{s=1}^{T^{\prime}}(g(z))^{s}\bigg{]} (T212)(1eα1γT)\displaystyle\geq\left(\frac{T^{\prime}}{2}-\frac{1}{2}\right)\left(1-e^{-\frac{\alpha_{1}}{\gamma\sqrt{T^{\prime}}}}\right)
(T212)α12γT=α14γ(T1T).\displaystyle\geq\left(\frac{T^{\prime}}{2}-\frac{1}{2}\right)\frac{\alpha_{1}}{2\gamma\sqrt{T^{\prime}}}=\frac{\alpha_{1}}{4\gamma}\left(\sqrt{T^{\prime}}-\frac{1}{\sqrt{T^{\prime}}}\right).{} (66)

By applying (66) to (58), we obtain for k0k\in\mathbb{Z}_{\leq 0} and T=T4>1T^{\prime}=T-4>1,

𝔼[Reg(T)]\displaystyle\mathbb{E}[\mathrm{Reg}(T)] Δ2α14γ(1(mκμ2)2α1)(T1T)\displaystyle\geq\Delta_{2}\frac{\alpha_{1}}{4\gamma}\left(1-\left(\frac{m\kappa}{\mu_{2}}\right)^{2\alpha_{1}}\right)\left(\sqrt{T^{\prime}}-\frac{1}{\sqrt{T^{\prime}}}\right)
=𝒪(T).\displaystyle=\mathcal{O}(\sqrt{T}).

Notice that from the definition of m=γγ1=α2α21m=\frac{\gamma}{\gamma-1}=\frac{\left\lceil{\alpha_{2}}\right\rceil}{\left\lceil{\alpha_{2}}\right\rceil-1}, mκμ2=m(11α2)<1m\frac{\kappa}{\mu_{2}}=m\left(1-\frac{1}{\alpha_{2}}\right)<1 holds. Therefore, under STS\mathrm{STS} with priors k0k\in\mathbb{Z}_{\leq 0}, there exists a constant C(α1,α2)>0C^{\prime}(\alpha_{1},\alpha_{2})>0 such that

lim infT𝔼[Reg(T)]TC(α1,α2).\liminf_{T\to\infty}\frac{\mathbb{E}[\mathrm{Reg}(T)]}{\sqrt{T}}\geq C^{\prime}(\alpha_{1},\alpha_{2}).

Appendix E Priors and posteriors

In this section, we provide details on the problem of Jeffreys prior and the probability matching prior under the multiparameter models. One can find more details from references in this section.

E.1 Problems of the Jeffreys prior in the presence of nuisance parameters

The Jeffreys prior was defined to be proportional to the square root of the determinant of the FI matrix so that it remains invariant under all one-to-one reparameterizations of parameters [Jeffreys, 1998]. However, the Jeffreys prior is known to suffer from many problems when the model contains nuisance parameters [Datta and Ghosh, 1995, Ghosh, 2011]. Therefore, Jeffreys himself recommended using other priors in the case of multiparameter models [Berger and Bernardo, 1992]. For example, for the location-scale family, Jeffreys recommended using alternate priors, which coincide with the exact matching prior [DiCiccio et al., 2017].

As mentioned in the main text, it is known that the Jeffreys prior leads to inconsistent estimators for the variance in the Neyman-Scott problem [see Berger and Bernardo, 1992, Example 3.]. Another example is Stein’s example [Stein, 1959], where the model of the Gaussian distribution with a common variance is considered. In this example, the Jeffreys prior lead to an unsatisfactory posterior distribution since the generalized Bayesian estimator under the Jeffreys prior is dominated by other estimators for the quadratic loss [see Robert et al., 2007, Example 3.5.9.]. Note that Bernardo [1979] showed that the reference prior does not suffer from such problems, which would explain why the reference prior shows better performance than the Jeffreys prior in the multiparameter bandit problems.

E.2 Probability matching prior

The probability matching prior is a type of noninformative prior that is designed to achieve the synthesis between the coverage probability of the Bayesian interval estimates and that of the frequentist interval estimates [Welch and Peers, 1963, Tibshirani, 1989]. Therefore, the posterior probability of certain intervals matches exactly or asymptotically the frequentist’s coverage probability under the probability matching prior. If the posterior probability of certain intervals exactly matches the confidence interval, such a prior is called an exact matching prior. In cases where the Bayesian interval estimate does not exactly match the frequentist’s coverage probability, but the difference is small, it is called a kk-th order matching prior. The difference between the two probabilities is measured by a remainder term, usually denoted as 𝒪(nk2)\mathcal{O}(n^{-\frac{k}{2}}), where nn is the sample size and kk is the order of the matching111Some papers call a prior kk-th order matching prior when a remainder is 𝒪(nk+12)\mathcal{O}\left(n^{-\frac{k+1}{2}}\right) [DiCiccio et al., 2017]. In this paper, we follow notations used in Mukerjee and Ghosh [1997] and Datta and Sweeting [2005]..

For example, let θ+\theta\in\mathbb{R}_{+} be a parameter of interest. For some priors π(θ)\pi(\theta), let ψ(θ|Xn)\psi(\theta|X_{n}) be a posterior distribution after observing nn samples XnX_{n}. Then, for any α(0,1)\alpha\in(0,1), let us define θα>0\theta_{\alpha}>0 such that

0θαψ(θ|Xn)dθ=α.\int_{0}^{\theta_{\alpha}}\psi(\theta|X_{n})\mathrm{d}\theta=\alpha.

When π(θ)\pi(\theta) is the second order probability matching prior, it holds that

[θθα|Xn]=α+𝒪(n1).\mathbb{P}[\theta\leq\theta_{\alpha}|X_{n}]=\alpha+\mathcal{O}(n^{-1}).

When π(θ)\pi(\theta) is the exact probability matching prior, we have

[θθα|Xn]=α.\mathbb{P}[\theta\leq\theta_{\alpha}|X_{n}]=\alpha.

For more details, we refer readers to Datta and Sweeting [2005] and Ghosh [2011] and the references therein.

E.3 Details on the derivation of posteriors

In this section, we provide the detailed derivation of posteriors.

Let the observation 𝒓=(r1,,rn)\bm{r}=(r_{1},\ldots,r_{n}) of an arm and let q(n)=s=1nlogrsq(n)=\sum_{s=1}^{n}\log r_{s}. Then, Bayes’ theorem gives the posterior probability density as

p(κ,α𝒓)=p(𝒓|κ,α)p(κ,α)00p(𝒓κ,α)p(κ,α)dκdα,\displaystyle p({\kappa,\alpha}\mid\bm{r})=\frac{p(\bm{r}|{\kappa,\alpha})p({\kappa,\alpha})}{\int_{0}^{\infty}\int_{0}^{\infty}p(\bm{r}\mid{\kappa,\alpha})p({\kappa,\alpha})\mathrm{d}\kappa\mathrm{d}\alpha},

where

p(𝒓κ,α)\displaystyle p(\bm{r}\mid{\kappa,\alpha}) =αnκnα(s=1nra,s)α1𝟙[κκ^(n)]\displaystyle=\alpha^{n}\kappa^{n\alpha}\left(\prod_{s=1}^{n}r_{a,s}\right)^{-\alpha-1}\mathbbm{1}[\kappa\leq\hat{\kappa}(n)]
=αnκnαexp(q(n)(α+1))𝟙[κκ^(n)].\displaystyle=\alpha^{n}\kappa^{n\alpha}\exp(-q(n)(\alpha+1))\mathbbm{1}[\kappa\leq\hat{\kappa}(n)].

By direct computation with given prior with kk\in\mathbb{Z}, we have

00p(𝒓κ,α)p(κ,α)dκdα\displaystyle\int_{0}^{\infty}\int_{0}^{\infty}p(\bm{r}\mid{\kappa,\alpha})p({\kappa,\alpha})\mathrm{d}\kappa\mathrm{d}\alpha =00p(𝒓κ,α)αkκdκdα\displaystyle=\int_{0}^{\infty}\int_{0}^{\infty}p(\bm{r}\mid{\kappa,\alpha})\frac{\alpha^{-k}}{\kappa}\mathrm{d}\kappa\mathrm{d}\alpha
=0αnkexp(q(n)(α+1))0κ^κnα1dκdα\displaystyle=\int_{0}^{\infty}\alpha^{n-k}\exp(-q(n)(\alpha+1))\int_{0}^{\hat{\kappa}}\kappa^{n\alpha-1}\mathrm{d}\kappa\mathrm{d}\alpha
=0αnk1neq(n)exp(α(q(n)nlogκ^))dα\displaystyle=\int_{0}^{\infty}\frac{\alpha^{n-k-1}}{n}e^{-q(n)}\exp(-\alpha(q(n)-n\log\hat{\kappa}))\mathrm{d}\alpha
=Γ(nk)neq(n)(q(n)nlogκ^)nk.\displaystyle=\frac{\Gamma(n-k)}{n}\frac{e^{-q(n)}}{(q(n)-n\log\hat{\kappa})^{n-k}}.

Therefore, the joint posterior probability density is given as follows:

p(κ,α𝒓)=n[q(n)nlogκ^(n)]nkΓ(nk)αnkκnα1eq(n)α𝟙[0<κκ^(n)],p({\kappa,\alpha}\mid\bm{r})=\frac{n[q(n)-n\log\hat{\kappa}(n)]^{n-k}}{\Gamma(n-k)}\alpha^{n-k}\kappa^{n\alpha-1}e^{-q(n)\alpha}\mathbbm{1}[0<\kappa\leq\hat{\kappa}(n)],

which gives the marginal posterior of α\alpha as

p(α𝒓)=αnk1[q(n)nlogκ^(n)]nkΓ(nk)eα(q(n)nlogκ^(n)).p(\alpha\mid\bm{r})=\frac{\alpha^{n-k-1}[q(n)-n\log\hat{\kappa}(n)]^{n-k}}{\Gamma(n-k)}e^{-\alpha(q(n)-n\log\hat{\kappa}(n))}. (67)

Thus, sample α~\tilde{\alpha} generated from the marginal posterior actually follows the Gamma distribution with shape nkn-k and rate q(n)nlogκ^(n)=nα^q(n)-n\log\hat{\kappa}(n)=\frac{n}{\hat{\alpha}}, i.e., α~Erlang(nk,nα^)\tilde{\alpha}\sim\mathrm{Erlang}\left(n-k,\frac{n}{\hat{\alpha}}\right) as nn\in\mathbb{N} and kk\in\mathbb{Z} if n>kn>k. When α~\tilde{\alpha} is given, the conditional posterior of κ\kappa is given as

p(κ𝒓,α)\displaystyle p(\kappa\mid\bm{r},\alpha) =p(κ,α𝒓)p(α𝒓)\displaystyle=\frac{p({\kappa,\alpha}\mid\bm{r})}{p(\alpha\mid\bm{r})}
=nακ^nακnα1𝟙[0<κκ^(n)].\displaystyle=\frac{n\alpha}{\hat{\kappa}^{n\alpha}}\kappa^{n\alpha-1}\mathbbm{1}[0<\kappa\leq\hat{\kappa}(n)].{} (68)

Hence, the cumulative distribution function (CDF) of κ\kappa given α\alpha is given as

(κx)=F(x𝒓,α=α~)=(xκ^(n))nα~, 0<xκ^(n).\mathbb{P}(\kappa\leq x)=F(x\mid\bm{r},\alpha=\tilde{\alpha})=\left(\frac{x}{\hat{\kappa}(n)}\right)^{n\tilde{\alpha}},\,0<x\leq\hat{\kappa}(n). (69)

Note that MLEs of κ,α{\kappa,\alpha} are equivalent to the maximum a posteriori (MAP) estimators when one uses the Jeffreys prior [Sun et al., 2020, Li et al., 2022].

In sum, under aforementioned priors, we consider the marginalized posterior distribution on α\alpha

p(α𝒓)=Erlang(nk,nα^)p(\alpha\mid\bm{r})=\mathrm{Erlang}\left(n-k,\frac{n}{\hat{\alpha}}\right)

and the cumulative distribution function (CDF) of the conditional posterior of κ\kappa

F(x𝒓,α=α~)=(xα^(n))nα~, 0<xκ^(n).F(x\mid\bm{r},\alpha=\tilde{\alpha})=\left(\frac{x}{\hat{\alpha}(n)}\right)^{n\tilde{\alpha}},\,0<x\leq\hat{\kappa}(n).

Note that we require max{2,k+1}\max\left\{2,k+1\right\} initial plays to avoid improper posteriors and improper MLEs.

Appendix F Technical lemma

In this section, we present some technical lemmas used in the proof of main lemmas.

Lemma 17.

Let ZZ be a random variable following the chi-squared distribution with the degree of freedom 2n2n. Then, for any x(0,1)x\in(0,1)

[Z2nx]enh(x),\mathbb{P}[Z\leq 2nx]\leq e^{-nh(x)},

where h(x)=(x1logx)0h(x)=(x-1-\log x)\geq 0.

Proof.

Let XiX_{i} be random variables following the standard normal distribution, so that Z=i=12nXi2Z=\sum_{i=1}^{2n}X_{i}^{2} holds. From Lemma 20, one can derive

[Z2nx]=[12ni=12nXi2x]exp((2ninfzxΛ(z))).\mathbb{P}[Z\leq 2nx]=\mathbb{P}\left[\frac{1}{2n}\sum_{i=1}^{2n}X_{i}^{2}\leq x\right]\leq\exp{\left(-2n\inf_{z\leq x}\Lambda^{*}(z)\right)}.

From the definition of the moment generating function, one can see that

Λ(z)=supλλzlog𝔼[eλX12]=supλλz+12log(12λ)=12(z1logz),\Lambda^{*}(z)=\sup_{\lambda\in\mathbb{R}}\lambda z-\log\mathbb{E}\left[e^{\lambda X_{1}^{2}}\right]=\sup_{\lambda\in\mathbb{R}}\lambda z+\frac{1}{2}\log(1-2\lambda)=\frac{1}{2}(z-1-\log z),

which concludes the proof. ∎

Appendix G Known results

In this section, we present some proved lemmas that we use without proofs.

Lemma 18 (Bernstein’s inequality).

Let XX be a (σ2,b)(\sigma^{2},b)-subexponential random variable with 𝔼[X]=μ\mathbb{E}[X]=\mu and Var[X]=σ2Var[X]=\sigma^{2}, which satisfies

𝔼[eλX]exp(λ2σ22) for |λ|1b.\mathbb{E}[e^{\lambda X}]\leq\exp{\frac{\lambda^{2}\sigma^{2}}{2}}\quad\text{ for }|\lambda|\leq\frac{1}{b}.

Let XiX_{i} be independent (σ2,b)(\sigma^{2},b)-subexponential. Then, it holds that

(|1ns=1nXiμ|t)2exp(n2min{t2σ2,tb}).\mathbb{P}\left(\absolutevalue{\frac{1}{n}\sum_{s=1}^{n}X_{i}-\mu}\geq t\right)\leq 2\exp\left(-\frac{n}{2}\min\left\{\frac{t^{2}}{\sigma^{2}},\frac{t}{b}\right\}\right).

For more details, we refer the reader to Vershynin [2018].

Lemma 19 (Theorem 4.1. in Wallace [1959]).

Let FnF_{n} be the distribution function of the chi-squared distribution on nn degrees of freedom. For all t>nt>n, all n>0n>0, and with w(t)=tnnlog(t/n)w(t)=\sqrt{t-n-n\log(t/n)},

1Fn(t)<dn2erfc(w(t)2),1-F_{n}(t)<\frac{d_{n}}{2}\mathrm{erfc}\left(\frac{w(t)}{\sqrt{2}}\right),

where dn=(n2)n12en22πΓ(n/2)d_{n}=\frac{\left(\frac{n}{2}\right)^{\frac{n-1}{2}}e^{-\frac{n}{2}}\sqrt{2\pi}}{\Gamma(n/2)} and erfc()\mathrm{erfc}(\cdot) is the complementary error function.

Lemma 20 (Cramér’s theorem).

Let X1,,XnX_{1},\ldots,X_{n} be i.i.d. random variables on \mathbb{R}. Then, for any convex set CC\in\mathbb{R},

[1ni=1nXiC]exp((ninfzCΛ(z))),\mathbb{P}\left[\frac{1}{n}\sum_{i=1}^{n}X_{i}\in C\right]\leq\exp{\left(-n\inf_{z\in C}\Lambda^{*}(z)\right)},

where Λ(z)=supλλzlog𝔼[eλX1]\Lambda^{*}(z)=\sup_{\lambda\in\mathbb{R}}\lambda z-\log\mathbb{E}[e^{\lambda X_{1}}].

Lemma 21 (Result of term (A) in Korda et al. [2013]).

When one uses the Jeffreys prior as a prior distribution under the Pareto distribution with known scale parameter, TS satisfies that for sufficiently small ϵ>0\epsilon>0,

t=1T𝔼[𝟙[j(t)1,ϵc(t)]]𝒪(ϵ1).\sum_{t=1}^{T}\mathbb{E}\left[\mathbbm{1}[j(t)\neq 1,\mathcal{M}_{\epsilon}^{c}(t)]\right]\leq\mathcal{O}\left(\epsilon^{-1}\right).

Appendix H Additional experimental results

Refer to caption
(a) The Jeffreys prior k=0k=0
Refer to caption
(b) The reference prior k=1k=1
Refer to caption
(c) Prior with k=3k=3
Refer to caption
(d) Prior with k=1k=-1
Refer to caption
(e) Prior with k=3k=-3
Figure 4: The solid lines denotes an averaged regret over independent 100,000 runs. The shaded regions show a quarter standard deviation.

From Figure 4, one can observe that the performance difference between STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} is large as kk decreases. Since a truncation procedure aims to prevent an extreme case that can occur under STS\mathrm{STS} with priors k1k\in\mathbb{Z}_{\leq 1}, it is quite natural to see that there is no difference between STS\mathrm{STS} and STS-T\mathrm{STS}\text{-}\mathrm{T} with prior k=3k=3. We can further see the improvement of STS-T\mathrm{STS}\text{-}\mathrm{T} is dramatic as kk decreases, where an extreme case can easily occur.

Refer to caption
(a) Cumulative regret of STS\mathrm{STS} with various kk under 𝜽4\bm{\theta}_{4}^{\prime}
Refer to caption
(b) Cumulative regret of STS-T\mathrm{STS}\text{-}\mathrm{T} with various kk under 𝜽4\bm{\theta}_{4}^{\prime}
Figure 5: The solid lines denote the averaged cumulative regret over 100,000 independent runs of priors that can achieve the optimal lower bound in (3). The dashed lines denote that of priors that cannot achieve the optimal lower bound in (3). The green dotted line denotes the problem-dependent lower bound based on Lemma 1.
Refer to caption
(a) The Jeffreys prior k=0k=0
Refer to caption
(b) The reference prior k=1k=1
Refer to caption
(c) Prior with k=3k=3
Refer to caption
(d) Prior with k=1k=-1
Refer to caption
(e) Prior with k=3k=-3
Figure 6: The solid lines denotes an averaged regret over independent 10,000 runs. The shaded regions and dashed lines show the central 99%99\% interval and the upper 0.05%0.05\% of regret, respectively.

We further consider another 44-armed bandit problem 𝜽4\bm{\theta}_{4}^{\prime} where 𝜿=(1.0,1.5,2.0,2.0)\bm{\kappa}=(1.0,1.5,2.0,2.0) and 𝜶=(1.2,1.5,1.8,2.0)\bm{\alpha}=(1.2,1.5,1.8,2.0) where 𝝁=(5.0,4.5,4.5,4.0)\bm{\mu}=(5.0,4.5,4.5,4.0). 𝜽4\bm{\theta}_{4}^{\prime} would be a more challenging problem than 𝜽4\bm{\theta}_{4} in the sense that the κ\kappa determines the left boundary of the support, where larger κ\kappa implies larger minimum value of the arm. Therefore, if κ\kappa of the suboptimal arm is larger than that of the optimal arm, it would make a problem difficult in the first few trials. Figures 5 and 6 show the numerical results with time horizon T=T= 50,000 and independent 10,000 runs. Although STS\mathrm{STS} with the reference prior shows similar performance to the conservative prior k=3k=3, its performance varies a lot.

Figures 6(a) and 6(b) show the effectiveness of the truncation procedure where STS-T\mathrm{STS}\text{-}\mathrm{T} has a much smaller upper 0.05%0.05\% regret than that of STS\mathrm{STS}. Although k=1k=-1 also shows huge improvements in the central 99%99\% interval of regret as shown in Figure 6(d), STS-T\mathrm{STS}\text{-}\mathrm{T} with k=1k=-1 shows worse performance compared with priors with k0k\in\mathbb{Z}_{\geq 0} in Figure 5(b).