This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Best Arm Identification in Restless Markov Multi–Armed Bandits

P. N. Karthik, Kota Srinivas Reddy, and Vincent Y. F. Tan
National University of Singapore
Email: karthik@nus.edu.sg, ksreddy@nus.edu.sg, vtan@nus.edu.sg
Abstract

We study the problem of identifying the best arm in a multi-armed bandit environment when each arm is a time-homogeneous and ergodic discrete-time Markov process on a common, finite state space. The state evolution on each arm is governed by the arm’s transition probability matrix (TPM). A decision entity that knows the set of arm TPMs but not the exact mapping of the TPMs to the arms, wishes to find the index of the best arm as quickly as possible, subject to an upper bound on the error probability. The decision entity selects one arm at a time sequentially, and all the unselected arms continue to undergo state evolution (restless arms). For this problem, we derive the first-known problem instance-dependent asymptotic lower bound on the growth rate of the expected time required to find the index of the best arm, where the asymptotics is as the error probability vanishes. Further, we propose a sequential policy that, for an input parameter RR, forcibly selects an arm that has not been selected for RR consecutive time instants. We show that this policy achieves an upper bound that depends on RR and is monotonically non-increasing as RR\to\infty. The question of whether, in general, the limiting value of the upper bound as RR\to\infty matches with the lower bound, remains open. We identify a special case in which the upper and the lower bounds match. Prior works on best arm identification have dealt with (a) independent and identically distributed observations from the arms, and (b) rested Markov arms, whereas our work deals with the more difficult setting of restless Markov arms.

publicationid: pubid: 978-1-5090-5356-8/17/$31.00 © 2017 IEEE

I Introduction

Consider a multi-armed bandit with K2K\geq 2 arms in which each arm is associated with a time-homogeneous and ergodic discrete-time Markov process evolving on a common, finite state space. The state evolutions on each arm are governed by the arm’s transition probability matrix (TPM). Given a function ff on the common state space of the arms, we define the best arm as the arm with the largest average value of ff, averaged over the arm’s stationary distribution. A decision entity that has knowledge of the set of TPMs of the arms, but does not know the exact mapping of the TPMs to the arms, wishes to find the index of the best arm as quickly as possible, subject to an upper bound on the error probability. Our interest is in the asymptotics as the error probability vanishes.

The above problem, known popularly in the literature as best arm identification, is an instance of an optimal stopping problem in decision theory, and can be embedded within the framework of active sequential hypothesis testing studied in the classical works of Chernoff [1] and Albert [2]. Prior works on best arm identification have dealt with (a) independent and identically distributed (i.i.d.) observations from the arms, as in [3], and (b) rested Markov arms [4] in which the arms yield Markovian observations and an arm evolves only when selected, and remains frozen otherwise. In this work, we extend the results of [3, 5, 4] to the more difficult setting when the unselected arms continue to evolve (restless arms). An examination of the results in [3, 5, 4] shows that given an error probability threshold ϵ>0\epsilon>0, the minimum expected expected time required to find the best arm (or, expected stopping time) with an error probability no more than ϵ\epsilon grows as O(log(1/ϵ))O(\log(1/\epsilon)) in the limit as ϵ0\epsilon\downarrow 0. We anticipate a similar growth rate for the expected stopping time in the setting of restless arms. Our goal is to characterise or bound the exact constant multiplying log(1/ϵ)\log(1/\epsilon) for the setting of restless arms and attempt to achieve the lower bound of O(log(1/ϵ))O(\log(1/\epsilon)) in the limit as ϵ0\epsilon\downarrow 0. Additionally, we aim to devise an efficient policy that achieves the fundamental limit.

I-A A Preliminary Trembling Hand Model

The continued evolution of the unselected arms in our work makes it necessary for the decision entity to keep a record of (a) the time elapsed since each arm was previously selected (the arm’s delay), and (b) the state of each arm recorded at its previous selection instant (the arm’s last observed state). The notion of arm delays is superfluous when the arms are rested because the unobserved arms remain frozen. They are also redundant when the arms yield i.i.d. observations because the observation from an arm at any given time is independent of all its previous observations. When the arms are restless, the arm delays are non-negative, integer-valued, and introduce a countably-infinite dimension to the problem. In a related problem of identifying an anomalous or odd arm in a restless multi-armed bandit, Karthik and Sundaresan [6, 7] demonstrated that the arm delays and the last observed states constitute a controlled Markov process. A key aspect of the works [6, 7] is the notion of a trembling hand for selecting the arms that is defined in the system model. Probabilistically, a trembling hand with parameter γ[0,1]\gamma\in[0,1] selects an arm uniformly at random with probability γ\gamma, and selects the intended arm with probability 1γ1-\gamma. As the authors note in [6], when γ>0\gamma>0, the controlled Markov process (of arm delays and last observed states) satisfies a key ergodicity property (see [6, Lemma 1]) which is pivotal in establishing matching upper and lower bounds on the expected time to find the odd arm. An understanding of whether the lower bound for the case γ=0\gamma=0 admits a matching upper bound remains open.

I-B Forgoing the Trembling Hand Model and Constraining the Maximum Delay of Each Arm

The trembling hand model of [6, 7] implies that at any given time, the probability of selecting any arm is at least γ/K\gamma/K, irrespective of the arm selection scheme used. In this work, we forgo the trembling hand assumption and allow for arm selection schemes to put zero mass on certain arms. We derive a lower bound on the expected time to find the best arm over all arm selection schemes. However, such a generic lower bound may not be achievable. In order to analyse the achievability of the lower bound, we restrict attention to those arm selection policies which, for an input parameter R>KR>K, forcefully select an arm if its delay is equal to RR. For this class of policies, we derive an upper bound in terms of RR and show that the sequence of upper bounds, one for each value of RR, is monotonically non-increasing as RR increases. An advantage of our method over the methods of [6, 7] is that while the arm selection schemes in [6, 7] are not practically implementable, the arm selection schemes we propose in this paper are easy to implement.

The question of whether, in general, the limiting value of the upper bounds matches the generic lower bound appears to be a difficult problem and remains open. Notwithstanding this, we identify some special cases when the limiting value of the upper bounds matches with the generic lower bound.

I-C Prior Works on Multi-Armed Bandits and Best Arm Identification

The problem of minimising regret for multi-armed bandits was introduced in the seminal work of Lai and Robbins [8] when each arm yields i.i.d. observations. Anantharam et al. [9] extended the results of Lai and Robbins to the setting of rested arms. We refer the reader to [10] for an extensive survey of regret minimization problems. In [11, 3], the authors derived a lower bound on the sample complexity of best arm identification in multi-armed bandits with i.i.d. observations from arms. Many follow-up works proposed algorithms to achieve the lower bounds; notable among them are action-elimination algorithms [12, 13], upper confidence bound (UCB) algorithms [14, 15], and lower and upper confidence bound (LUCB) algorithms [16, 17]. Several extensions to the classical setup of best-arm identification in multi-armed bandits have appeared in the literature; notable among them are correlated bandits [18], cascading bandits [19], bandits with switching costs [20], and bandits with corrupted rewards [21].

For the problem of finding the best arm as quickly as possible subject to an upper bound on the error probability, the paper [3] provided a problem instance-dependent lower bound on the expected time required to find the best arm for the setting of i.i.d. observations from the arms, and a track-and-stop scheme that meets the lower bound asymptotically as the error probability vanishes. Moulos [4] extended the results of [3] to the setting of rested arms with hidden Markov observations. However, the lower and the upper bounds in [4] differ by a constant multiplicative factor; the achievability analysis therein uses novel concentration inequalities for Markov processes derived by the author. For a related problem of finding the anomalous (or odd) arm in multi-armed bandits, the works [22, 6, 7] obtain matching upper and lower bounds on the expected time required to find the odd arm subject to an upper bound on the error probability. While [22] studies the setting of rested arms, the works [6, 7] focus on the setting of restless arms. A key aspect of the works [6, 7] is a certain trembling hand model for selecting the arms, motivated from a certain visual science experiment. The paper [6] assumes that the TPMs of the odd arm and the non-odd arms are known beforehand to the decision entity, whereas the works [22, 7] deal with the case when the arm TPMs are unknown.

The recent works [23, 24, 25] study a more general problem of sequential hypothesis testing in multi-armed bandits, some special cases of which are the problems of best arm identification and odd arm identification, in the context of i.i.d. observations from each arm. An extension of the results of [23, 24, 25] to the settings of rested and restless arms is a possible direction of future work.

I-D Contributions

In this paper, we study the problem of finding the best arm in a restless multi-armed bandit as quickly as possible, subject to an upper bound on the error probability. Our technical contributions are as follows:

  • In Section V, we show that given any ϵ>0\epsilon>0, the expected time required to find the best arm with an error probability no more than ϵ\epsilon grows as O(log(1/ϵ))O(\log(1/\epsilon)) in the limit as ϵ0\epsilon\downarrow 0. We explicitly characterise the problem instance-dependent constant multiplying log(1/ϵ)\log(1/\epsilon) that is a function of the arm TPMs.

  • For a problem instance CC, the constant T(C)T^{\star}(C) appearing in our lower bound is the solution of a sup-min optimisation problem, where the supremum is over all possible state-action occupancy measures associated with the countable-state controlled Markov process of arm delays and last observed states; see (21) for details. The question of whether the supremum in the expression for T(C)T^{\star}(C) lower bound is attained is still open. The key difficulty in showing this is the presence of the countably-infinite valued arm delays appearing in the expression for T(C)T^{\star}(C), which makes further simplifications of the expression for T(C)T^{\star}(C) difficult. In the prior works [3, 5, 4], further simplification of the expression for the constants governing the lower bound is possible because the notion of arm delays is superfluous in the settings of those works.

    In a related problem of odd arm identification in restless multi-armed bandits, Karthik and Sundaresan [6, 7] show that under a trembling hand model for arms selection, the supremum in the expression for the lower bounds in these works may be further simplified by restricting the supremum to the class of all stationary arm selection policies; see [6, 7] for more details. However, it is unclear if such a simplification is possible in the absence of the trembling hand (as is the setting of this paper).

  • In our achievability analysis presented in Section VI, to ameliorate the difficulty arising from the countably infinite-valued arm delays, we constrain the maximum delay of each arm to be no more than RR, and focus our attention on those policies which select an arm forcibly if its delay equals RR. This constraint on the maximum delay of each arm makes the state-action space finite and amenable to further analyses. To the best of our knowledge, this is the first work to analyse delay-constrained policies for restless multi-armed bandits.

    It is worth noting that the achievability analyses in the works [6, 7] rely crucially on a key ergodicity property (see [6, Lemma 1]) for the controlled Markov process of arm delays and last observed states that is satisfied under a trembling hand model for arms selection. This ergodicity property is pivotal to the achievability analysis in [6, 7], and it is unclear if the same property holds in the absence of the trembling hand.

  • We devise a policy that, for an input parameter RR, forcibly pulls an arm if its delay equals RR, and achieves an upper bound of the order Θ(log(1/ϵ))\Theta(\log(1/\epsilon)). We show that the best (smallest) constant multiplying log(1/ϵ)\log(1/\epsilon) is equal to 1/TR(C)1/T_{R}^{\star}(C) under the problem instance CC; see (27) for the exact expression of TR(C)T_{R}^{\star}(C). Our results imply that 1/TR(C)1/T_{R}^{\star}(C) is a valid asymptotic upper bound on the expected stopping time for all integers RR.

  • We show that TR(C)T_{R}^{\star}(C) is non-decreasing in RR, and that TR(C)T(C)T_{R}^{*}(C)\leq T^{\star}(C) for all RR, thus implying that limRTR(C)\lim_{R\to\infty}\ T_{R}^{\star}(C) exists. Thus, the lower bound of our work is given by T(C)1T^{\star}(C)^{-1}, whereas the upper bound is governed by limRTR(C)1\lim_{R\to\infty}T_{R}^{\star}(C)^{-1}. While it is certainly true that limRTR(C)T(C)\lim_{R\to\infty}\ T_{R}^{\star}(C)\leq T^{\star}(C), showing that, in general, this inequality is an equality seems to be a difficult problem and remains open. In the special case when the TPM of each arm has identical rows, which is akin to obtaining i.i.d. observations from the arms, the above inequality is an equality, thus leading to matching upper and lower bounds in this special setting. We are thus able to recover the results of the prior works [5, 3] for the case when the i.i.d. observations from each arm come from a finite alphabet.

I-E Paper Organisation

The rest of this paper is organised as follows. In Section II, we setup the notations and state our central goal. In Section III, we introduce the notions of arm delays, last observed states, and the Markov decision problem arising from the arm delays and the last observed states. We provide expressions for the log-likelihoods and the log-likelihood ratios, the basic quantities of analysis, in Section IV, and present the asymptotic lower bound on the growth rate of the expected time required to find the best arm in Section V. In Section VI, we constrain the maximum delay of each arm to be no more than RR for some R(K,)R\in\mathbb{N}\cap(K,\infty), and describe a policy that forcibly samples an arm whose delay equals RR. We provide results on the performance of the policy in Section VII, and state the main result in Section VIII. We discuss the convergence of TR(C)T_{R}^{\star}(C) to T(C)T^{\star}(C) as RR\to\infty in Section IX, where we show that the convergence takes place in the special case when the TPM of each arm has identical rows. We conclude the paper in Section X. The proofs of all the results are contained in Appendices A-I.

II Notations and Preliminaries

We consider a multi-armed bandit with K2K\geq 2 arms, and define 𝒜{1,,K}\mathcal{A}\coloneqq\{1,\ldots,K\} to be the set of arms. We associate with each arm an ergodic discrete-time Markov process on a common, finite state space 𝒮\mathcal{S}. We assume that the Markov process of each arm is independent of those of the other arms. We write {Xta:t0}\{X_{t}^{a}:t\geq 0\} denote the Markov process of arm a𝒜a\in\mathcal{A}.111Throughout the paper, time t{0,1,2,}.t\in\{0,1,2,\ldots\}. The state evolution on each arm is governed by its transition probability matrix (TPM). Given TPMs P1,,PKP_{1},\ldots,P_{K} and a permutation σ:{1,,K}{1,,K}\sigma:\{1,\ldots,K\}\to\{1,\ldots,K\}, let C=(Pσ(1),,Pσ(K))C=(P_{\sigma(1)},\ldots,P_{\sigma(K)}) denote an assignment of the TPMs to the arms in which the TPM assigned to arm aa is Pσ(a)P_{\sigma(a)}. In the sequel, we refer to CC as an assignment of the TPMs, and we let 𝒞\mathcal{C} denote the collection of all possible assignments of the TPMs, i.e.,

𝒞={(Pσ(1),,Pσ(K)):σ is a permutation on 𝒜}.\mathcal{C}=\{(P_{\sigma(1)},\ldots,P_{\sigma(K)}):\ \sigma\text{ is a permutation on }\mathcal{A}\}. (1)

For each k=1,,Kk=1,\ldots,K, let μk={μk(i):i𝒮}\mu_{k}=\{\mu_{k}(i):\ i\in\mathcal{S}\} denote the unique stationary distribution of the TPM PkP_{k}. Given a function f:𝒮f:\mathcal{S}\to\mathbb{R}, let

νai𝒮f(i)μa(i),a𝒜,\nu_{a}\coloneqq\sum_{i\in\mathcal{S}}\ f(i)\,\mu_{a}(i),\quad a\in\mathcal{A}, (2)

denote the average value of ff under μa\mu_{a}. Define the best arm a𝒜a^{\star}\in\mathcal{A} as

aargmaxa𝒜νa=argmaxa𝒜i𝒮f(i)μa(i).a^{\star}\coloneqq\arg\max_{a\in\mathcal{A}}\ \nu_{a}=\arg\max_{a\in\mathcal{A}}\ \sum_{i\in\mathcal{S}}\ f(i)\,\mu_{a}(i). (3)

We assume throughout the paper that aa^{\star} is unique. Without loss of generality, let Pa=P1.P_{a^{\star}}=P_{1}.222The reader may recognise that this does not necessarily imply a=1a^{\star}=1.

For an integer d1d\geq 1 and a matrix PP, let PdP^{d} denote the matrix obtained by multiplying PP with itself dd times. For i,j𝒮i,j\in\mathcal{S} and d1d\geq 1, let Pd(j|i)P^{d}(j|i) to denote the (i,j)(i,j)th element of PdP^{d}. For a𝒜a\in\mathcal{A}, let each row of PaP_{a} be mutually absolutely continuous with the corresponding row of PaP_{a^{\prime}} for all aaa^{\prime}\neq a. It is easy to see that this implies that for all d1d\geq 1, each row of PadP_{a}^{d} is mutually absolutely continuous with the corresponding row of PadP_{a^{\prime}}^{d}. The above assumption implies that the decision entity cannot infer the best arm merely by observing certain specific state(s) or state-transition(s) on the arm.

A decision entity that knows P1,,PKP_{1},\ldots,P_{K} only up to a permutation wishes to find the index of the best arm (i.e., the arm whose TPM is P1P_{1}) as quickly as possible, subject to an upper bound on the error probability. Clearly, this is accomplished if the decision entity finds C𝒞C\in\mathcal{C} that defines the problem instance. Given C𝒞C\in\mathcal{C}, we write Alt(C)\textsf{Alt}(C) to denote the set of all assignments of the TPMs alternative to CC, i.e., those assignments of the TPMs in which the location of the best arm is different from the one in CC. In order to find the index of the best arm in a problem instance CC, the decision entity selects the arms sequentially, one at each time tt. Let AtA_{t} be the arm selected at time tt. The decision entity observes the state of the arm AtA_{t}, denoted by X¯t\bar{X}_{t}. In contrast to the previous works [3, 5, 4] that deal with i.i.d. observations from the arms and rested arms, we assume that the unobserved arms continue to undergo state evolution whether or not they are selected (restless arms). Let (A0:t,X¯0:t)(A0,X¯0,,AtX¯t)(A_{0:t},\bar{X}_{0:t})\coloneqq(A_{0},\bar{X}_{0},\ldots,A_{t}\,\bar{X}_{t}) denote the history of all the arm selections and observations seen up to time tt. All random variables are defined on a common probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). Define the filtration

0{Ω,},tσ(A0:t1,X¯0:t1),t1.\mathcal{F}_{0}\coloneqq\{\Omega,\emptyset\},\quad\mathcal{F}_{t}\coloneqq\sigma(A_{0:t-1},\bar{X}_{0:t-1}),\quad t\geq 1. (4)

II-A Policy and Problem Definition

A policy π\pi is defined by a collection of functions {πt:t0}\{\pi_{t}:t\geq 0\}. At each time tt, πt\pi_{t} does one of the following based on the history t\mathcal{F}_{t}:

  • stop and declare the index of the best arm;

  • choose to pull arm AtA_{t} according to a deterministic or a randomised rule.

Let π\pi denote a generic policy, and let τ(π)\tau(\pi) denote the stopping time of policy π\pi (defined with respect to the filtration (4)). Let θ(τ(π))\theta(\tau(\pi)) denote the index of the best arm declared by the policy π\pi at the stopping time.

Let PCπ()P_{C}^{\pi}(\cdot) denote the probability computed under the assignment of the TPMs CC and under the policy π\pi. For a𝒜a\in\mathcal{A}, let 𝒞a𝒞\mathcal{C}_{a}\subset\mathcal{C} denote the collection of all permutations in which Pa=P1P_{a}=P_{1}. Clearly, the collection {𝒞a:a𝒜}\{\mathcal{C}_{a}:\ a\in\mathcal{A}\} is a partition of 𝒞\mathcal{C}. Given an error probability threshold ϵ>0\epsilon>0, let

Π(ϵ){π:for all a𝒜,PCπ(θ(τ(π))a)ϵC𝒞a}\Pi(\epsilon)\coloneqq\{\pi:\ \text{for all }a\in\mathcal{A},\ \ P_{C}^{\pi}(\theta(\tau(\pi))\neq a)\leq\epsilon\ \forall C\in\mathcal{C}_{a}\} (5)

denote the collection of all policies whose error probability at the stopping time is no more than ϵ\epsilon for all possible assignments of TPMs. We anticipate from similar results in the prior works [3, 5, 4] that

infπΠ(ϵ)𝔼Cπ[τ(π)]=Θ(log(1/ϵ)).\inf_{\pi\in\Pi(\epsilon)}\mathbb{E}_{C}^{\pi}[\tau(\pi)]=\Theta(\log(1/\epsilon)).

Here, 𝔼Cπ[]\mathbb{E}_{C}^{\pi}[\cdot] denotes the expectation under the assignment of the TPMs CC and policy π\pi. Our interest is in characterising, or at least bounding, the value of

limϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ).\lim_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\ \frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}. (6)

For simplicity, we assume that every policy starts by selecting arm 11 at time t=0t=0, arm 22 at time t=1t=1, etc., and arm KK at time t=K1t=K-1. This ensures that the Markov process of each arm is observed at least once.

III Delays, Last Observed States, and a Markov Decision Problem

The contents of this section are mostly borrowed from [6, Section II-B, II-C], but modified appropriately to reflect the absence of the trembling hand assumption of [6], with an intent to keep the material in the paper self-contained. Recall that the decision entity observes only one of the arms at each time tt, while the unobserved arms continue to undergo state evolution. This means that at any time tt, the probability of the observation X¯t\bar{X}_{t} on the selected arm AtA_{t} given the history t\mathcal{F}_{t} is a function of (a) the time elapsed since the previous time instant of selection of arm AtA_{t} (called the delay of arm AtA_{t}), and (b) the state of arm AtA_{t} at its previous selection time instant (called the last observed state of arm AtA_{t}). Notice that when the arms are rested, the notion of arm delays is superfluous since each arm remains frozen at its previously observed state until its next selection time instant. Also, the notion of arm delays is redundant in the setting of iid observations because the current state of the arm selected is independent of the state at its previous selection. The notion of arm delays is a key distinguishing feature of the setting of restless arms.

For tKt\geq K, let da(t)d_{a}(t) and ia(t)i_{a}(t) respectively denote the delay and the last observed state of arm aa at time tt. Let d¯(t)(d1(t),,dK(t))\underline{d}(t)\coloneqq(d_{1}(t),\ldots,d_{K}(t)) and i¯(t)(i1(t),,iK(t))\underline{i}(t)\coloneqq(i_{1}(t),\ldots,i_{K}(t)) denote the vectors of arm delays and the last observed states at time tt. Note that arm delays and last observed states are defined only for tKt\geq K as these quantities are well-defined only when at least one observation is available from each arm. Set d¯(K)=(K,K1,,1)\underline{d}(K)=(K,K-1,\ldots,1). Thus, da(t)1d_{a}(t)\geq 1 for all tKt\geq K, and that da(t)=1d_{a}(t)=1 if and only if arm aa is selected at time t1t-1.

The rule for updating the arm delays and last observed states is as follows: if At=aA_{t}=a^{\prime}, then

da(t+1)={da(t)+1,aa,1,a=a,ia(t+1)={ia(t),aa,Xta,a=a,\displaystyle{d}_{a}(t+1)=\begin{cases}d_{a}(t)+1,&a\neq a^{\prime},\\ 1,&a=a^{\prime},\end{cases}\qquad\qquad i_{a}(t+1)=\begin{cases}i_{a}(t),&a\neq a^{\prime},\\ X_{t}^{a},&a=a^{\prime},\end{cases} (7)

where Xta=X¯tX_{t}^{a}=\bar{X}_{t} is the state of the arm At=aA_{t}=a^{\prime} at time tt. Thus, for all tKt\geq K, based on {(d¯(s),i¯(s)):Kst}\{(\underline{d}(s),\underline{i}(s)):K\leq s\leq t\}, the decision entity chooses to pull AtA_{t} and observes X¯t\bar{X}_{t}.333Note that specifying {(d¯(s),i¯(s)):Kst}\{(\underline{d}(s),\underline{i}(s)):K\leq s\leq t\} is equivalent to specifying (A0:t1,X¯0:t1)(A_{0:t-1},\bar{X}_{0:t-1}) for all tKt\geq K. It then form (d¯(t+1),i¯(t+1))(\underline{d}(t+1),\underline{i}(t+1)). This repeats until the stopping time, at which time it declares θ(τ(π))\theta(\tau(\pi)) (under policy π\pi) as the candidate best arm.

From the update rule in (7), it is clear that the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} takes values in a subset 𝕊\mathbb{S} of the countable set K×𝒮K\mathbb{N}^{K}\times\mathcal{S}^{K}, where ={1,2,}\mathbb{N}=\{1,2,\ldots\} denotes the set of natural numbers. The subset 𝕊\mathbb{S} is formed based on the constraint that at any time tKt\geq K, exactly one of the components of d¯(t)\underline{d}(t) is equal to 11, and all the other components are >1>1. Given any assignment of the TPMs C𝒞C\in\mathcal{C} and policy π\pi, note that for all (d¯,i¯)𝕊(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S} and tKt\geq K,

PCπ(d¯(t+1)=d¯,i¯(t+1)=i¯{(d¯(s),i¯(s)),Kst},A0:t)\displaystyle P_{C}^{\pi}(\underline{d}(t+1)=\underline{d}^{\prime},\underline{i}(t+1)=\underline{i}^{\prime}\mid\{(\underline{d}(s),\underline{i}(s)),\ K\leq s\leq t\},A_{0:t})
=PCπ(d¯(t+1)=d¯,i¯(t+1)=i¯(d¯(t),i¯(t)),At).\displaystyle=P_{C}^{\pi}(\underline{d}(t+1)=\underline{d}^{\prime},\underline{i}(t+1)=\underline{i}^{\prime}\mid(\underline{d}(t),\underline{i}(t)),A_{t}). (8)

On account of (8) being satisfied, we say that under any policy π\pi, the evolution of the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is controlled by the sequence {At}t0\{A_{t}\}_{t\geq 0} of intended arm selections under policy π\pi. Alternatively, {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is a controlled Markov process, with {At}t0\{A_{t}\}_{t\geq 0} being the sequence of controls.444The terminology used here follows that of Borkar [26]. Thus, we are in a Markov decision problem (MDP) setting. We now make precise the state space, the action space, the transition probabilities and our objective.

The state space of the MDP is 𝕊\mathbb{S}, with the state at time tt denoted (d¯(t),i¯(t))(\underline{d}(t),\underline{i}(t)). The action space of the MDP is 𝒜\mathcal{A}, with action AtA_{t} at time tt possibly depending on the history t\mathcal{F}_{t}. The transition probabilities for the MDP under an assignment of the TPMs C𝒞C\in\mathcal{C} and a policy π\pi are given by

PCπ(d¯(t+1)=d¯,i¯(t+1)=i¯d¯(t)=d¯,i¯(t)=i¯,At=a)\displaystyle P_{C}^{\pi}(\underline{d}(t+1)=\underline{d}^{\prime},\underline{i}(t+1)=\underline{i}^{\prime}\mid\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a)
={(PCa)da(ia|ia),if da=1 and da~=da~+1 for all a~a,ia~=ia~ for all a~a,0,otherwise,\displaystyle=\begin{cases}(P_{C}^{a})^{d_{a}}(i_{a}^{\prime}|i_{a}),&\text{if }d_{a}^{\prime}=1\text{ and }d^{\prime}_{\tilde{a}}=d_{\tilde{a}}+1\text{ for all }\tilde{a}\neq a,\\ &i_{\tilde{a}}^{\prime}=i_{\tilde{a}}\text{ for all }\tilde{a}\neq a,\\ 0,&\text{otherwise},\end{cases} (9)

where PCaP_{C}^{a} denotes the TPM of arm aa under the assignment of the TPMs CC. For instance, if C=(P1,,PK)C=(P_{1},\ldots,P_{K}), then PCa=PaP_{C}^{a}=P_{a} for all a𝒜a\in\mathcal{A}. Note that the right hand side of (9) is not a function of tt or π\pi. Let QC(d¯,i¯|d¯,i¯,a)Q_{C}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a) denote the transition probabilities in (9).

IV Log-Likelihoods and Log-Likelihood Ratios

Given C𝒞C\in\mathcal{C}, let

ZCπ(n)logPCπ(A0:n,X¯0:n)\displaystyle Z_{C}^{\pi}(n)\coloneqq\log P_{C}^{\pi}(A_{0:n},\bar{X}_{0:n}) (10)

denote the log-likelihood of all the controls and observations seen up to time nn under the policy π\pi when CC is the assignment of the TPMs. Under the assumption that π\pi selects arm 11 at time t=0t=0, arm 22 at time t=1t=1, etc., and arm KK at time K1K-1, (10) may be expressed as

ZCπ(n)\displaystyle Z_{C}^{\pi}(n) =a=1KlogPCπ(Xa1a)\displaystyle=\sum_{a=1}^{K}\log P_{C}^{\pi}(X_{a-1}^{a}) (11)
+t=KnlogPCπ(At|A0:t1,X¯0:t1)\displaystyle\hskip 28.45274pt+\sum_{t=K}^{n}\ \log P_{C}^{\pi}(A_{t}|A_{0:t-1},\,\bar{X}_{0:t-1}) (12)
+t=KnlogPCπ(X¯t|A0:t,X¯0:t1).\displaystyle\hskip 56.9055pt+\sum_{t=K}^{n}\ \log P_{C}^{\pi}(\bar{X}_{t}|A_{0:t},\,\bar{X}_{0:t-1}). (13)

Because π\pi is oblivious to the assignment of the TPMs CC, (12) does not depend on CC. Furthermore, (13) may be expressed as

t=KnlogPCπ(X¯t|A0:t,X¯0:t1)\displaystyle\sum_{t=K}^{n}\ \log P_{C}^{\pi}(\bar{X}_{t}|A_{0:t},\,\bar{X}_{0:t-1}) =t=Kna=1K𝕀{At=a}logPCπ(X¯t|A0:t1,At=a,X¯0:t1)\displaystyle=\sum_{t=K}^{n}\ \sum_{a=1}^{K}\ \mathbb{I}_{\{A_{t}=a\}}\ \log P_{C}^{\pi}(\bar{X}_{t}|A_{0:t-1},A_{t}=a,\bar{X}_{0:t-1})
=t=Kn(d¯,i¯)𝕊a=1K𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}log(PCa)da(X¯t|ia)\displaystyle=\sum_{t=K}^{n}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i},\,A_{t}=a\}}\,\log(P_{C}^{a})^{d_{a}}(\bar{X}_{t}|i_{a})
=t=Kn(d¯,i¯)𝕊a=1Kj𝒮𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a,X¯t=j}log(PCa)da(j|ia)\displaystyle=\sum_{t=K}^{n}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i},\,A_{t}=a,\,\bar{X}_{t}=j\}}\,\log(P_{C}^{a})^{d_{a}}(j|i_{a})
=(d¯,i¯)𝕊a=1Kj𝒮N(n,d¯,i¯,a,j)log(PCa)da(j|ia),\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j)\ \log(P_{C}^{a})^{d_{a}}(j|i_{a}), (14)

where

N(n,d¯,i¯,a,j)t=Kn𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a,X¯t=j}\displaystyle N(n,\underline{d},\underline{i},a,j)\coloneqq\sum_{t=K}^{n}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i},\,A_{t}=a,\,\bar{X}_{t}=j\}} (15)

denotes the number of times up to time nn the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):\,t\geq K\} is in the state (d¯,i¯)(\underline{d},\underline{i}), arm aa is selected subsequently, and the state of arm aa is observed to be jj. Let

N(n,d¯,i¯,a)j𝒮N(n,d¯,i¯,a,j),N(n,d¯,i¯)a=1Kj𝒮N(n,d¯,i¯,a,j).\displaystyle N(n,\underline{d},\underline{i},a)\coloneqq\sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j),\quad N(n,\underline{d},\underline{i})\coloneqq\sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j). (16)

Given C,C𝒞C,C^{\prime}\in\mathcal{C}, let

ZCCπ(n)ZCπ(n)ZCπ(n)=logPCπ(A0:n,X¯0:n)PCπ(A0:n,X¯0:n)\displaystyle Z_{CC^{\prime}}^{\pi}(n)\coloneqq Z_{C}^{\pi}(n)-Z_{C^{\prime}}^{\pi}(n)=\log\frac{P_{C}^{\pi}(A_{0:n},\bar{X}_{0:n})}{P_{C^{\prime}}^{\pi}(A_{0:n},\bar{X}_{0:n})} (17)

denote the log-likelihood ratio (LLR) of the controls and observations seen under the policy π\pi when the assignment of the TPMs is CC with respect to that when the assignment of the TPMs is CC^{\prime}. Using (14) in conjunction with the fact that (12) does not depend on CC, we get

ZCCπ(n)=a=1KlogPCπ(Xa1a)PCπ(Xa1a)+(d¯,i¯)𝕊a=1Kj𝒮N(n,d¯,i¯,a,j)log(PCa)da(j|ia)(PCa)da(j|ia).\displaystyle Z_{CC^{\prime}}^{\pi}(n)=\sum_{a=1}^{K}\log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}. (18)

V Lower Bound

We now present a lower bound for (6). Given two probability distributions μ\mu and ν\nu on the finite state space 𝒮\mathcal{S}, the Kullback–Leibler (KL) divergence (also called the relative entropy) between μ\mu and ν\nu is defined as

DKL(μν)i𝒮μ(i)logμ(i)ν(i),D_{\textsf{KL}}(\mu\|\nu)\coloneqq\sum_{i\in\mathcal{S}}\mu(i)\log\frac{\mu(i)}{\nu(i)}, (19)

where, by convention, 0log00=00\log\frac{0}{0}=0.

Proposition 1

Suppose C𝒞C\in\mathcal{C} is the underlying assignment of the TPMs. Then,

lim infϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)1T(C),\liminf_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}\geq\frac{1}{T^{\star}(C)}, (20)

where T(C)T^{\star}(C) is given by

T(C)supνminCAlt(C)(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia)).\displaystyle T^{\star}(C)\coloneqq\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a})). (21)

In (21), the supremum is over all ν={ν(d¯,i¯,a):(d¯,i¯)𝕊,a𝒜}\nu=\{\nu(\underline{d},\underline{i},a):(\underline{d},\underline{i})\in\mathbb{S},a\in\mathcal{A}\} satisfying

(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)=1,\displaystyle\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\leavevmode\nobreak\ \sum_{a=1}^{K}\,\nu(\underline{d},\underline{i},a)=1, (22)
ν(d¯,i¯,a)0for all (d¯,i¯,a)𝕊×𝒜.\displaystyle\nu(\underline{d},\underline{i},a)\geq 0\quad\text{for all }(\underline{d},\underline{i},a)\in\mathbb{S}\times\mathcal{A}. (23)
Proof:

The proof proceeds in several steps. First, we derive an analogue of the ubiquitous change-of-measure result [5, Lemma 18] for the setting of restless arms. We then lower bound the expected LLR of any policy whose stoppage error probability is at most ϵ\epsilon by d(ϵ,1ϵ)d(\epsilon,1-\epsilon), where d(x,y)d(x,y) denotes the relative entropy between two Bernoulli distributions with parameters xx and yy. Next, we derive an upper bound on the expected LLR in terms of the expected stopping time. This involves deriving an analogue of Wald’s identity for the setting of restless arms. Combining the upper and the lower bounds for the expected LLR, we get (20). The details are in Appendix A. ∎

V-A A Flow Constraint

Notice that T(C)T^{\star}(C) is the optimal value of an infinite-dimensional linear program (LP). The question of whether there exists ν\nu attaining the supremum in (21) remains open. The key difficulty in showing this is that because the set 𝕊\mathbb{S} is countably infinite, it is not clear whether the set of all ν\nu satisfying (22)-(23) (which is akin to the space of all probability distributions on 𝕊×𝒜\mathbb{S}\times\mathcal{A}) is compact. Also, because this supremum is over all probability distributions on 𝕊×𝒜\mathbb{S}\times\mathcal{A} which is a large class of distributions, the lower bound in (20) may not be achievable. It seems necessary to introduce additional constraints on ν\nu to render the lower bound achievable. Indeed, given δ>0\delta>0, suppose νδ\nu_{\delta} is a probability distribution on 𝕊×𝒜\mathbb{S}\times\mathcal{A} such that

minCAlt(C)(d¯,i¯)𝕊a=1Kνδ(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))T(C)δ.\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu_{\delta}(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\geq T^{\star}(C)-\delta.

One way to achieve the quantity on the left hand side of the above equation is to ensure that for all (d¯,i¯,a)𝕊×𝒜(\underline{d},\underline{i},a)\in\mathbb{S}\times\mathcal{A}, the value of the fraction N(n,d¯,i¯,a)/nN(n,\underline{d},\underline{i},a)/n is close to νδ(d¯,i¯,a)\nu_{\delta}(\underline{d},\underline{i},a) for all nn large (the regime of large nn is akin to the regime of vanishing error probabilities). It seems difficult to accomplish this in the absence of more structure on νδ\nu_{\delta}. It is worth noting here that in a related problem of odd arm identification, the authors of [6] are confronted with a similar difficulty in showing the achievability of the lower bound therein. To ameliorate the difficulty, they introduce a version of the following flow constraint on ν\nu:

flow constraint:a=1Kν(d¯,i¯,a)=(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)QC(d¯,i¯|d¯,i¯,a)for all (d¯,i¯)𝕊.\textsf{flow constraint}:\quad\sum_{a=1}^{K}\nu(\underline{d}^{\prime},\underline{i}^{\prime},a)=\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ Q_{C}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a)\quad\text{for all }(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S}. (24)

In (24), QCQ_{C} is the MDP transition matrix defined in (9). The flow constraint is, in fact, a global balance equation, and dictates that for any (d¯,i¯)𝕊(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S}, the long-term probability of a transition from (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}) (the flow out of (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}), captured by the left hand side of (24)) is equal to the probability of a transition to (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}) (the flow into (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}), captured by the right hand side of (24)). The authors of [6] show that their lower bound, after including the flow constraint, can be achieved by a certain trembling hand-based policy; see [6, Section V] for a description of the policy. However, the policy in [6] is not practically implementable.

With an end goal of showing achievability of our lower bound, we take (24) into consideration along with (22)-(23) when evaluating the supremum in (21), and wish to design a policy that (a) is computationally feasible/tractable and easy-to-implement, and (b) achieves the lower bound in (20). This forms the content of the next section.

VI Achievability

As alluded to in the previous section, the trembling hand-based policy of [6] is not practically implementable. One reason for this is that the arm delays (which appear in the policy of [6]) take countably infinitely many values and therefore cannot be handled on a machine with finite-size memory. To alleviate the difficulty arising from the countably infinite-valued arm delays, we study a simplified setting where the maximum delay of each arm is restricted to be at most RR for some R(K,),R\in\mathbb{N}\cap(K,\infty),555We consider R>KR>K to be consistent with our assumption that each of the arms is sampled once in the first KK time instants. and an arm whose delay is equal to RR at any given time is forcibly selected in the following time instant. Let 𝕊R\mathbb{S}_{R} denote the subset of 𝕊\mathbb{S} in which the delay of each arm is no more than RR. Further, for a𝒜a\in\mathcal{A}, let 𝕊R,a\mathbb{S}_{R,a} denote the subset of 𝕊R\mathbb{S}_{R} in which the delay of arm aa is equal to RR. Notice that 𝕊R,a\mathbb{S}_{R,a} is a finite set for each aa and that 𝕊R,a\mathbb{S}_{R,a} and 𝕊R,a\mathbb{S}_{R,a^{\prime}} are disjoint for all aaa^{\prime}\neq a.

VI-A Modifications to the MDP Transition Probabilities Under Maximum Delay Constraint

Recall that {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} and {At:t0}\{A_{t}:t\geq 0\} together define a Markov decision problem (MDP) whose state space is 𝕊\mathbb{S}, the action space is 𝒜\mathcal{A}, the state at time tt is (d¯(t),i¯(t))(\underline{d}(t),\underline{i}(t)), and the control (or action) at time tt is AtA_{t}. From Section III, we know that the transition probabilities of the MDP are given by (9). When the delay of each arm is constrained to be no more than RR, the modified state space of the MDP is 𝕊R\mathbb{S}_{R}, and the modified transition probabilities for the MDP are as follows:

  • Case 1: (d¯,i¯)a=1K𝕊R,a(\underline{d},\underline{i})\notin\bigcup_{a=1}^{K}\ \mathbb{S}_{R,a}. In this case, the transition probabilities are as in (9).

  • Case 2: (d¯,i¯)𝕊R,a(\underline{d},\underline{i})\in\mathbb{S}_{R,a} for some a𝒜a\in\mathcal{A}. In this case, when At=aA_{t}=a,

    PCπ(d¯(t+1)=d¯,i¯(t+1)=i¯d¯(t)=d¯,i¯(t)=i¯,At=a)\displaystyle P_{C}^{\pi}(\underline{d}(t+1)=\underline{d}^{\prime},\underline{i}(t+1)=\underline{i}^{\prime}\mid\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a)
    ={(PCa)R(ia|ia),if da=1 and da~=da~+1 for all a~a,ia~=ia~ for all a~a,0,otherwise,\displaystyle=\begin{cases}(P_{C}^{a})^{R}(i_{a}^{\prime}|i_{a}),&\text{if }d_{a}^{\prime}=1\text{ and }d^{\prime}_{\tilde{a}}=d_{\tilde{a}}+1\text{ for all }\tilde{a}\neq a,\\ &i_{\tilde{a}}^{\prime}=i_{\tilde{a}}\text{ for all }\tilde{a}\neq a,\\ 0,&\text{otherwise},\end{cases} (25)

    and when AtaA_{t}\neq a, the transition probabilities are undefined.

We write QC,R(d¯,i¯|d¯,i¯,a)Q_{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a) to denote the transition probabilities in (25).

VI-B Capturing the Maximum Delay Constraint and a Finite Dimensional Linear Program

Recall that in the absence of any constraints on the maximum delay of each arm, the lower bound is as in (20), with the constant T(C)T^{\star}(C) in (20) as given in (21). Further, the supremum in (21) is over all ν\nu satisfying (22)-(24). When the delay of each arm is constrained to be no more than RR, the following additional constraint on ν\nu comes into play:

R-max-delay-constraint:ν(d¯,i¯,a)=a=1Kν(d¯,i¯,a)for all (d¯,i¯)𝕊R,a.R\textsf{-max-delay-constraint}:\quad\nu(\underline{d},\underline{i},a)=\sum_{a^{\prime}=1}^{K}\ \nu(\underline{d},\underline{i},a^{\prime})\quad\text{for all }(\underline{d},\underline{i})\in\mathbb{S}_{R,a}. (26)

The condition in (26) captures the observation that any occurrence of the state (d¯,i¯)𝕊R,a(\underline{d},\underline{i})\in\mathbb{S}_{R,a} is followed by selecting arm aa forcibly (i.e., with probability 11), thus implying that ν(d¯,i¯,a)=0\nu(\underline{d},\underline{i},a^{\prime})=0 for all aaa^{\prime}\neq a which is equivalent to (26). Let TR(C)T_{R}^{\star}(C) be the optimal value of the following optimisation problem:

supνminCAlt(C)(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia)),\displaystyle\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a})), (27)
subject to
a=1Kν(d¯,i¯,a)=(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)QC,R(d¯,i¯|d¯,i¯,a)for all (d¯,i¯)𝕊R,\displaystyle\sum_{a=1}^{K}\nu(\underline{d}^{\prime},\underline{i}^{\prime},a)=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\,\nu(\underline{d},\underline{i},a)\ Q_{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a)\quad\text{for all }(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S}_{R}, (28)
(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)=1,\displaystyle\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)=1, (29)
ν(d¯,i¯,a)0for all (d¯,i¯,a)𝕊R×𝒜,\displaystyle\nu(\underline{d},\underline{i},a)\geq 0\quad\text{for all }(\underline{d},\underline{i},a)\in\mathbb{S}_{R}\times\mathcal{A}, (30)
ν(d¯,i¯,a)=a=1Kν(d¯,i¯,a)for all (d¯,i¯)𝕊R,a,a𝒜.\displaystyle\nu(\underline{d},\underline{i},a)=\sum_{a^{\prime}=1}^{K}\ \nu(\underline{d},\underline{i},a^{\prime})\quad\text{for all }(\underline{d},\underline{i})\in\mathbb{S}_{R,a},\quad a\in\mathcal{A}. (31)

Notice that the above optimisation problem is a finite-dimensional LP, and is the analogue of the infinite-dimensional LP in (21) with the additional condition in (31) to account for the case when an arm is forcibly selected if its delay equals RR. Because (a) 𝕊R×𝒜\mathbb{S}_{R}\times\mathcal{A} is finite, (b) the space of all probability distributions on 𝕊R×𝒜\mathbb{S}_{R}\times\mathcal{A} (say, 𝒫(𝕊R×𝒜)\mathscr{P}(\mathbb{S}_{R}\times\mathcal{A})) is compact with respect to the topology arising from the Euclidean metric in |𝕊R×𝒜|\mathbb{R}^{|\mathbb{S}_{R}\times\mathcal{A}|}, (c) the set of all ν\nu satisfying (28)-(31) is a closed subset of 𝒫(𝕊R×𝒜)\mathscr{P}(\mathbb{S}_{R}\times\mathcal{A}) (and therefore compact), and (d) the expression

minCAlt(C)(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))

is continuous in ν\nu, it follows by Weierstrass extreme value theorem that there exists νC,R={νC,R(d¯,i¯,a):(d¯,i¯,a)𝕊R×𝒜}\nu_{C,R}^{\star}=\{\nu_{C,R}^{\star}(\underline{d},\underline{i},a):(\underline{d},\underline{i},a)\in\mathbb{S}_{R}\times\mathcal{A}\} that attains the supremum in (27). Although a closed-form expression for νC,R\nu_{C,R}^{\star} is not currently available, it can easily be evaluated numerically. In the next section, we fix R(K,)R\in\mathbb{N}\cap(K,\infty) and design a policy for finding the best arm that samples an arm forcibly if its delay is equal to RR. Additionally, we demonstrate that our policy (a) stops in finite time almost surely, (b) satisfies the desired error probability, and (c) achieves an upper bound of 1/TR(C)1/T_{R}^{\star}(C) asymptotically as the error probability vanishes. We shall see that our policy is easy to implement as it operates on the finite set 𝕊R\mathbb{S}_{R} instead of the countable set 𝕊\mathbb{S}.

VI-C A Policy for Finding the Best Arm Under a Maximum Delay Constraint

Fix an R(K,)R\in\mathbb{N}\cap(K,\infty). In this section, we design a policy for finding the best arm when the delay of each arm is constrained to be no more than RR. Towards this, we first analyse a uniform arm selection policy that, for all tKt\geq K, selects the arms uniformly whenever the delay of each arm is <R<R, and forcibly selects an arm whose delay is equal to RR. Let this policy be denoted πRunif\pi_{R}^{\textsf{unif}}. It is clear that {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is a Markov process under πRunif\pi_{R}^{\textsf{unif}} with 𝕊R\mathbb{S}_{R} as its state space. The following result shows that this Markov process is, in fact, ergodic.

Lemma 1

Fix R(K,)R\in\mathbb{N}\cap(K,\infty). Under every assignment of the TPMs C𝒞C\in\mathcal{C}, the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):\ t\geq K\} is ergodic under the policy πRunif\pi_{R}^{\textsf{unif}}.

Proof:

See Appendix B. ∎

As a consequence of Lemma 1, the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):\ t\geq K\} has a unique stationary distribution, say μC,Runif={μC,Runif(d¯,i¯):(d¯,i¯)𝕊R}\mu_{C,R}^{\textsf{unif}}=\{\mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i}):(\underline{d},\underline{i})\in\mathbb{S}_{R}\}, under the policy πRunif\pi_{R}^{\textsf{unif}} and under the assignment of the TPMs CC. We note that μC,Runif(d¯,i¯)>0\mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i})>0 for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R}. Let

νC,Runif(d¯,i¯,a){μC,Runif(d¯,i¯)K,(d¯,i¯)a=1K𝕊R,a,μC,Runif(d¯,i¯),(d¯,i¯)𝕊R,a,0,(d¯,i¯)aa𝕊R,a.\nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)\coloneqq\begin{cases}\frac{\mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i})}{K},&(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\ \mathbb{S}_{R,a^{\prime}},\\ \mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i}),&(\underline{d},\underline{i})\in\mathbb{S}_{R,a},\\ 0,&(\underline{d},\underline{i})\in\bigcup_{a^{\prime}\neq a}\ \mathbb{S}_{R,a^{\prime}}.\end{cases} (32)

denote the corresponding ergodic state-action occupancy measure. Observe that νC,Runif\nu_{C,R}^{\textsf{unif}} satisfies (28)-(31).

For η(0,1]\eta\in(0,1] and C𝒞C\in\mathcal{C}, let

νη,R,C(d¯,i¯,a)\displaystyle\nu_{\eta,R,C}(\underline{d},\underline{i},a) ηνC,Runif(d¯,i¯,a)+(1η)νC,R(d¯,i¯,a),\displaystyle\coloneqq\eta\ \nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)+(1-\eta)\ \nu_{C,R}^{\star}(\underline{d},\underline{i},a), (33)
μη,R,C(d¯,i¯)\displaystyle\mu_{\eta,R,C}(\underline{d},\underline{i}) a=1Kνη,R,C(d¯,i¯,a).\displaystyle\coloneqq\sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a). (34)

Observe that μη,R,C(d¯,i¯)ηKμC,Runif(d¯,i¯)>0\mu_{\eta,R,C}(\underline{d},\underline{i})\geq\frac{\eta}{K}\ \mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i})>0 for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R}. Also, νη,R,C\nu_{\eta,R,C} satisfies (28)-(31) by virtue of the fact that both νC,Runif\nu_{C,R}^{\textsf{unif}} and νC,R\nu_{C,R}^{\star} satisfy (28)-(31). Let λη,R,C={λη,R,C(a|d¯,i¯):a𝒜,(d¯,i¯)𝕊R}\lambda_{\eta,R,C}=\{\lambda_{\eta,R,C}(a|\underline{d},\underline{i}):a\in\mathcal{A},\ (\underline{d},\underline{i})\in\mathbb{S}_{R}\} be defined as

λη,R,C(a|d¯,i¯)=νη,R,C(d¯,i¯,a)μη,R,C(d¯,i¯)(d¯,i¯)𝕊R,a𝒜.\lambda_{\eta,R,C}(a|\underline{d},\underline{i})=\frac{\nu_{\eta,R,C}(\underline{d},\underline{i},a)}{\mu_{\eta,R,C}(\underline{d},\underline{i})}\quad(\underline{d},\underline{i})\in\mathbb{S}_{R},\ a\in\mathcal{A}. (35)

Notice that for all (d¯,i¯)a=1K𝕊R,a(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\ \mathbb{S}_{R,a^{\prime}},

λη,R,C(a|d¯,i¯)ηKμC,Runif(d¯,i¯)\displaystyle\lambda_{\eta,R,C}(a|\underline{d},\underline{i})\geq\frac{\eta}{K}\ \mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i}) ηKμRmin,\displaystyle\geq\frac{\eta}{K}\ \mu_{R}^{\textsf{min}}, (36)

where

μRminminC𝒞min(d¯,i¯)𝕊RμC,Runif(d¯,i¯)>0.\mu_{R}^{\textsf{min}}\coloneqq\min_{C\in\mathcal{C}}\ \min_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\mu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i})>0. (37)

That is, whenever the delay of each arm is <R<R, the distribution λη,R,C(|d¯,i¯)\lambda_{\eta,R,C}(\cdot|\underline{d},\underline{i}) puts a strictly positive mass on each of the arms. Using the arm selection rule in (35) instead of the uniform sampling rule and following the proof template in Appendix B, it can be shown that the policy πλη,R,C\pi^{\lambda_{\eta,R,C}}, which selects the arms at each time instant according to the rule in (35), renders the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} ergodic. We now claim that the stationary distribution of the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} under πλη,R,C\pi^{\lambda_{\eta,R,C}} is μη,R,C\mu_{\eta,R,C}. Indeed, suppose QC,R={QC,R(d¯,i¯|d¯,i¯):(d¯,i¯),(d¯,i¯)𝕊R}Q^{C,R}=\{Q^{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i}):(\underline{d}^{\prime},\underline{i}^{\prime}),(\underline{d},\underline{i})\in\mathbb{S}_{R}\} denotes the transition probability matrix of the Markov process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} under the policy πλη,R,C\pi^{\lambda_{\eta,R,C}}. Then,

QC,R(d¯,i¯|d¯,i¯)\displaystyle Q^{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i}) =a=1Kλη,R,C(a|d¯,i¯)QC,R(d¯,i¯|d¯,i¯,a),\displaystyle=\sum_{a=1}^{K}\ \lambda_{\eta,R,C}(a|\underline{d},\underline{i})\ Q_{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a), (38)

from which it follows that for all (d¯,i¯)𝕊R(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S}_{R},

(d¯,i¯)𝕊Rμη,R,C(d¯,i¯)QC,R(d¯,i¯|d¯,i¯)\displaystyle\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \mu_{\eta,R,C}(\underline{d},\underline{i})\ Q^{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i}) =(d¯,i¯)𝕊Ra=1Kμη,R,C(d¯,i¯)λη,R,C(a|d¯,i¯)QC,R(d¯,i¯|d¯,i¯,a)\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \mu_{\eta,R,C}(\underline{d},\underline{i})\ \lambda_{\eta,R,C}(a|\underline{d},\underline{i})\ Q_{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a)
=(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)QC,R(d¯,i¯|d¯,i¯,a)\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ Q_{C,R}(\underline{d}^{\prime},\underline{i}^{\prime}|\underline{d},\underline{i},a)
=(a)a=1Kνη,R,C(d¯,i¯,a)\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d}^{\prime},\underline{i}^{\prime},a)
=μη,R,C(d¯,i¯),\displaystyle=\mu_{\eta,R,C}(\underline{d}^{\prime},\underline{i}^{\prime}), (39)

thus proving the claim. In the above set of equations, (a)(a) follows from the fact that νη,R,C\nu_{\eta,R,C} satisfies (28).

Remark 1

Observe that νη,R,C\nu_{\eta,R,C} in (33) is a mixture of two terms, one arising from the uniform arm selection rule under maximum delay constraints (the term corresponding to νC,Runif\nu_{C,R}^{\textsf{unif}}), and the other arising from the optimal solution to the finite-dimensional LP under a constraint on the delay of each arm (the term corresponding to νC,R\nu_{C,R}^{\star}). A similar, trembling hand-based mixture term appears in the works [6, 7]. While the mixtures in [6, 7] arise from a restrictive system model that forces each arm to be selected with a strictly positive probability, the mixture in (33) ensures that each arm is selected with a strictly positive probability (except when the delay of an arm is equal to the maximum allowed delay RR in which case it is forcibly selected) without any restrictions on the system model. It is also worth noting that the mixtures in [6, 7] give rise to a conditional probability distribution on the arms, conditioned on the arm delays and the last observed states (a quantity akin to λ(a|d¯,i¯\lambda(a|\underline{d},\underline{i})), whereas the mixture in (33) gives rise to a joint probability distribution on the set 𝕊R×𝒜\mathbb{S}_{R}\times\mathcal{A} (a quantity akin to ν(d¯,i¯,a)\nu(\underline{d},\underline{i},a)).

Our policy, which we call RR-Delay-Constrained-Restless-BAI (or RR-DCR-BAI in short) or alternatively as π(L,η,R)\pi^{\star}(L,\eta,R), is then as follows. Here, L>1L>1, η(0,1]\eta\in(0,1] and R(K,)R\in\mathbb{N}\cap(K,\infty) are parameters of the policy. As we shall see, the parameter LL controls the policy’s error probability. In fact, we will see from Lemma 3 that if we set L=1/ϵL=1/\epsilon, the error probability is bounded above by ϵ\epsilon.   Policy RR-DCR-BAI / π(L,η,R)\pi^{\star}(L,\eta,R):
Fix L>1L>1, η(0,1]\eta\in(0,1], and R(K,)R\in\mathbb{N}\cap(K,\infty). Assume that A0=1A_{0}=1, A1=2A_{1}=2, …, AK1=KA_{K-1}=K. Let

MCπ(L,η,R)(n)=minCAlt(C)ZCCπ(L,η,R)(n),C𝒞.M_{C}^{\pi^{\star}(L,\eta,R)}(n)=\min_{C^{\prime}\in\textsf{Alt}(C)}Z_{CC^{\prime}}^{\pi^{\star}(L,\eta,R)}(n),\quad C\in\mathcal{C}.

Implement the following steps for all nKn\geq K.
(1) Compute C¯(n)argmaxC𝒞MCπ(L,η,R)(n)\bar{C}(n)\in\arg\max_{C\in\mathcal{C}}\ M_{C}^{\pi^{\star}(L,\eta,R)}(n). Resolve ties uniformly at random.
(2) If MC¯(n)π(L,η,R)(n)log(L(K1)(K1)!)M^{\pi^{\star}(L,\eta,R)}_{\bar{C}(n)}(n)\geq\log(L(K-1)(K-1)!), stop and declare the index of the best arm in C¯(n)\bar{C}(n).
(3) If MC¯(n)π(L,η,R)(n)<log(L(K1)(K1)!)M^{\pi^{\star}(L,\eta,R)}_{\bar{C}(n)}(n)<\log(L(K-1)(K-1)!), select arm AnA_{n} according to the distribution

Pπ(L,η,R)(An=a|A0:n1,X¯0:n1)=λη,R,C¯(n)(a|d¯(n),i¯(n)).P^{\pi^{\star}(L,\eta,R)}(A_{n}=a|A_{0:n-1},\bar{X}_{0:n-1})=\lambda_{\eta,R,\bar{C}(n)}(a|\underline{d}(n),\underline{i}(n)).

  In item (1) in RR-DCR-BAI, C¯(n)\bar{C}(n) denotes the estimate of the underlying assignment of the TPMs based on all the controls (arm selections) and observations seen up to time nn. If the LLR between C¯(n)\bar{C}(n) and its nearest alternative assignment of the TPMs exceeds a certain threshold (i.e., log(L(K1)(K1)!)\geq\log(L(K-1)(K-1)!)), then the policy is sufficiently confident that C¯(n)\bar{C}(n) is indeed the underlying assignment of the TPMs, and therefore stops and declares the index of the best arm in C¯(n)\bar{C}(n). Else, it samples the next arm based on the value of (d¯(n),i¯(n))(\underline{d}(n),\underline{i}(n)) according to the distribution λη,R,C¯(n)(|d¯(n),i¯(n))\lambda_{\eta,R,\bar{C}(n)}(\cdot|\underline{d}(n),\underline{i}(n)).

In the following section, we demonstrate that for a suitable choice of LL, the policy RR-DCR-BAI achieves the desired error probability. Further, letting LL\to\infty, we show that the growth rate of its expected stopping time satisfies an asymptotic upper bound that is arbitrarily close to 1/TR(C)1/T_{R}^{\star}(C) under the assignment of the TPMs CC for a suitable choice of η\eta.

VII Results on the Performance of Policy RR-DCR-BAI

This section is organised as follows. In Section VII-A, we establish that for a given C𝒞C\in\mathcal{C} and any CAlt(C)C^{\prime}\in\textsf{Alt}(C), the LLR ZCCπ(n)Z_{CC^{\prime}}^{\pi}(n) has a strictly positive drift almost surely under the assignment of the TPMs CC (Lemma 2), and therefore the policy RR-DCR-BAI stops in finite time almost surely. In Section VII-B, we show that RR-DCR-BAI satisfies any desired error probability for a suitable choice of LL (Lemma 3). In Section VII-C, we strengthen the result of Section VII-A by showing that the LLR of CC with respect to its nearest alternative CAlt(C)C^{\prime}\in\textsf{Alt}(C) is a certain constant that, in the limit as η0\eta\downarrow 0, converges to TR(C)T_{R}^{\star}(C) (Lemma 2). In Section VII-D, we show that the stopping time of RR-DCR-BAI grows almost surely as LL\to\infty (Lemma 4). In Section VII-E, we derive an almost sure upper bound on the stopping time of RR-DCR-BAI that, in the limit as η0\eta\downarrow 0, converges to 1/TR(C)1/T_{R}^{\star}(C). In Section VII-F, we show that the expected stopping time of RR-DCR-BAI satisfies the same upper bound as that derived in Section VII-E. This is based on (a) showing that the family {τ(π(L,η,R))/logL:L>1}\{\tau(\pi^{\star}(L,\eta,R))/\log L:L>1\} is uniformly integrable, and (b) combining the almost sure upper bound with the uniform integrability result to obtain an upper bound in expectation. The proofs of all the results are relegated to the appendices.

VII-A Strictly Positive Drift for the LLRs

Let πNS(L,η,R)\pi_{\textsf{NS}}^{\star}(L,\eta,R) denote a version of RR-DCR-BAI that never stops, i.e., it does not check the second step in RR-DCR-BAI and continues to the last step indefinitely. We now show that under πNS(L,η,R)\pi_{\textsf{NS}}^{\star}(L,\eta,R), the LLRs have a strictly positive drift as the number of rounds of arm selection nn\to\infty.

Lemma 2

Fix L>1L>1, η(0,1]\eta\in(0,1], R(K,)R\in\mathbb{N}\cap(K,\infty), and C𝒞C\in\mathcal{C}. Under the assignment of the TPMs CC and the policy π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R),

lim infnZCCπ(n)n>0almost surelyfor all CAlt(C).\liminf_{n\to\infty}\frac{Z_{CC^{\prime}}^{\pi}(n)}{n}>0\quad\text{almost surely}\quad\text{for all }C^{\prime}\in\textsf{Alt}(C). (40)
Proof:

See Appendix C. ∎

Lemma 2 asserts that under the assignment of the TPMs CC, we have

lim infnMCπ(n)n>0almost surely\liminf_{n\to\infty}\frac{M_{C}^{\pi}(n)}{n}>0\quad\text{almost surely} (41)

when π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R). This means that MCπ(n)log(L(K1)(K1)!)M_{C}^{\pi}(n)\geq\log(L(K-1)(K-1)!) for all nn large, almost surely. This proves that RR-DCR-BAI stops in finite time with probability 11.

VII-B Desired Error Probability

In this section, we show that for an appropriate choice of the parameter LL, the policy RR-DCR-BAI achieves any desired error probability.

Lemma 3

Fix an error probability threshold ϵ>0\epsilon>0. If L=1/ϵL=1/\epsilon, then π(L,η,R)Π(ϵ)\pi^{\star}(L,\eta,R)\in\Pi(\epsilon) for all η(0,1]\eta\in(0,1] and R(K,)R\in\mathbb{N}\cap(K,\infty). Here, Π(ϵ)\Pi(\epsilon) is as defined in (5).

Proof:

The proof uses the fact that the policy stops in finite time almost surely, and is given in Appendix D. ∎

VII-C The Correct Asymptotic Drift of the LLRs

In this section, we strengthen the result of Section VII-A by showing that under the constraint that the delay of each arm is at most RR, the asymptotic drift of the LLRs is arbitrarily close to TR(C)T_{R}^{\star}(C).

Proposition 2

Fix L>1L>1, η(0,1]\eta\in(0,1], R(K,)R\in\mathbb{N}\cap(K,\infty), and C𝒞C\in\mathcal{C}. Consider the policy π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R). Under the assignment of the TPMs CC, for all CAlt(C)C^{\prime}\in\textsf{Alt}(C),

limnZCCπ(n)n=(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))almost surely.\lim_{n\to\infty}\frac{Z_{CC^{\prime}}^{\pi}(n)}{n}=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\quad\text{almost surely}. (42)

Consequently, it follows that

limnMCπ(n)n=minCAlt(C)(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))almost surely.\displaystyle\lim_{n\to\infty}\frac{M_{C}^{\pi}(n)}{n}=\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\quad\text{almost surely}. (43)
Proof:

See Appendix E. ∎

From (33), we note that the right hand side of (43) may be lower bounded by

ηminCAlt(C)(d¯,i¯)𝕊Ra=1KνC,Runif(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))\displaystyle\eta\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
+(1η)minCAlt(C)(d¯,i¯)𝕊Ra=1KνC,R(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))\displaystyle\hskip 28.45274pt+(1-\eta)\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\star}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=ηminCAlt(C)(d¯,i¯)𝕊Ra=1KνC,Runif(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))+(1η)TR(C),\displaystyle=\eta\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))+(1-\eta)\ T_{R}^{\star}(C), (44)

which, as η0\eta\downarrow 0, converges to TR(C)T_{R}^{\star}(C). Using this observation, we shall show later that our policy achieves an upper bound of 1/TR(C)1/T_{R}^{\star}(C) in the limit as η0\eta\downarrow 0.

VII-D Asymptotic Growth of Stopping Time

In this section, we demonstrate that τ(π(L,η,R))\tau(\pi^{\star}(L,\eta,R)) grows as LL\to\infty (equivalently ϵ0\epsilon\downarrow 0).

Lemma 4

Fix η(0,1]\eta\in(0,1], R(K,)R\in\mathbb{N}\cap(K,\infty), and C𝒞C\in\mathcal{C}. Under the assignment of the TPMs CC,

lim infLτ(π(L,η,R))= almost surely.\liminf_{L\to\infty}\tau(\pi^{\star}(L,\eta,R))=\infty\text{ almost surely.} (45)
Proof:

See Appendix F. ∎

As a consequence of Lemma 4, we get that under the assignment of the TPMs CC and under π=π(L,η,R)\pi=\pi^{\star}(L,\eta,R),

limLMCπ(τ(π))τ(π)=(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))almost surely.\lim_{L\to\infty}\ \frac{M_{C}^{\pi}(\tau(\pi))}{\tau(\pi)}=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\quad\text{almost surely}. (46)

VII-E Almost Sure Asymptotic Upper Bound on the Stopping Time

In this section, we derive an almost sure asymptotic upper bound on the stopping time of RR-DCR-BAI as LL\to\infty, and show that this upper bound is arbitrarily close to 1/TR(C)1/T_{R}^{\star}(C) under the assignment of the TPMs CC. In the next section, we combine the almost sure upper bound of this section with a certain uniform integrability result to claim that the expected stopping time of RR-DCR-BAI satisfies the same upper bound as that derived in this section.

Lemma 5

Fix η(0,1]\eta\in(0,1], R(K,)R\in\mathbb{N}\cap(K,\infty), and C𝒞C\in\mathcal{C}. Under the assignment of the TPMs CC and under the policy π=π(L,η,R)\pi=\pi^{*}(L,\eta,R),

lim supLτ(π)logL\displaystyle\limsup_{L\to\infty}\,\frac{\tau(\pi)}{\log L} 1minCAlt(C)(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))\displaystyle\leq\frac{1}{\min\limits_{C^{\prime}\in\textsf{Alt}(C)}\ \sum\limits_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum\limits_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))}
1ηTRunif(C)+(1η)TR(C)almost surely,\displaystyle\leq\frac{1}{\eta\,T_{R}^{\textsf{unif}}(C)+(1-\eta)\ T_{R}^{\star}(C)}\quad\text{almost surely}, (47)

where

TRunif(C)minCAlt(C)(d¯,i¯)𝕊Ra=1KνC,Runif(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia)).T_{R}^{\textsf{unif}}(C)\coloneqq\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a})).
Proof:

See Appendix G

VII-F Asymptotic Upper Bound on the Expected Stopping Time

In this section, we show that the expected value of the stopping time of policy RR-DCR-BAI satisfies an asymptotic upper bound that matches with the right hand side of (47) as LL\to\infty.

Proposition 3

Fix η(0,1]\eta\in(0,1], R(K,)R\in\mathbb{N}\cap(K,\infty), and C𝒞C\in\mathcal{C}. Under the assignment of the TPMs CC, the policy RR-DCR-BAI satisfies

lim supL𝔼Cπ[τ(π)]logL1ηTRunif(C)+(1η)TR(C).\limsup_{L\to\infty}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log L}\leq\frac{1}{\eta\,T_{R}^{\textsf{unif}}(C)+(1-\eta)\ T_{R}^{\star}(C)}. (48)
Proof:

In the proof, which we provide in Appendix H, we first show that the family {τ(π)/logL:L>1}\{\tau(\pi)/\log L:L>1\} is uniformly integrable. Combining (47) with the uniform integrability result yields (48). ∎

VIII A Key Monotonicity Property and the Main Result

In this section, we establish a key monotonicity property for TR(C)T_{R}^{\star}(C), which is that TR(C)TR(C)T_{R}^{\star}(C)\leq T_{R^{\prime}}^{\star}(C) for all R<RR<R^{\prime}. This, combined together with the fact that TR(C)T(C)T_{R}^{\star}(C)\leq T^{\star}(C), implies that limRTR(C)\lim_{R\to\infty}\ T_{R}^{\star}(C) exists. We conclude the section by stating the main result of the paper.

VIII-A A Key Monotonicity Property

The below result asserts that TR(C)T_{R}^{\star}(C) is monotonically non-decreasing as RR increases.

Lemma 6

TR(C)TR+1(C)T_{R}^{\star}(C)\leq T_{R+1}^{\star}(C) for all R(K,)R\in\mathbb{N}\cap(K,\infty).

Proof:

Fix R(K,)R\in\mathbb{N}\cap(K,\infty). The key idea behind the proof is to note that (a) 𝕊R𝕊R+1\mathbb{S}_{R}\subset\mathbb{S}_{R+1}, and (b) any ν\nu that satisfies (28)-(31) with parameter RR also satisfies them with parameter R+1R+1. The details follow. Let νC,R\nu_{C,R}^{\star} and νC,R+1\nu_{C,R+1}^{\star} be the optimal state-action measures when the arm delays are constrained to no more than RR and R+1R+1 respectively. Note that both νC,R\nu_{C,R}^{\star} and νC,R+1\nu_{C,R+1}^{\star} satisfy (28)-(31) (with the corresponding parameters RR and R+1R+1). Further, νC,R\nu_{C,R}^{\star} satisfies (28)-(31) with parameter R+1R+1. Define ν~C,R+1\widetilde{\nu}_{C,R+1} as

ν~C,R+1(d¯,i¯,a){νC,R(d¯,i¯,a) if (d¯,i¯)𝕊R,0, otherwise.\displaystyle\widetilde{\nu}_{C,R+1}(\underline{d},\underline{i},a)\coloneqq\begin{cases}\nu_{C,R}^{\star}(\underline{d},\underline{i},a)&\text{ if }(\underline{d},\underline{i})\in\mathbb{S}_{R},\\ 0,&\text{ otherwise}.\\ \end{cases}

Clearly, ν~C,R+1\widetilde{\nu}_{C,R+1} satisfies (28)-(31) with parameter R+1R+1. We therefore have

TR+1(C)\displaystyle T_{R+1}^{\star}(C) =minCAlt(C)(d¯,i¯)𝕊R+1a=1KνC,R+1(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle=\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R+1}}\ \sum_{a=1}^{K}\ \nu_{C,R+1}^{\star}(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
(a)minCAlt(C)(d¯,i¯)𝕊R+1a=1Kν~C,R+1(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R+1}}\ \sum_{a=1}^{K}\ \widetilde{\nu}_{C,R+1}(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=minCAlt(C)(d¯,i¯)𝕊Ra=1Kν~C,R+1(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle=\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \widetilde{\nu}_{C,R+1}(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=minCAlt(C)(d¯,i¯)𝕊Ra=1KνC,R(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle=\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ {\nu}_{C,R}^{\star}(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=TR(C),\displaystyle=T_{R}^{\star}(C),

thus establishing the desired result. ∎

Lemma 6, in conjunction with the observation that TR(C)T(C)T_{R}^{\star}(C)\leq T^{\star}(C) for all RR, implies that limRTR(C)\lim_{R\to\infty}T_{R}^{\star}(C) exists and limRTR(C)T(C)\lim_{R\to\infty}T_{R}^{\star}(C)\leq T^{\star}(C). The question of whether this inequality is an equality seems difficult to prove, and is discussed in the next section.

VIII-B Main Result

We are now ready to state the main result of the paper.

Theorem 4

Consider a multi-armed bandit with K2K\geq 2 arms in which each arm is a time homogeneous and ergodic discrete-time Markov process on the finite state space 𝒮\mathcal{S}. Given TPMs P1,,PKP_{1},\ldots,P_{K} and a permutation σ:{1,,K}{1,,K}\sigma:\{1,\ldots,K\}\to\{1,\ldots,K\}, let C=(Pσ(1),,Pσ(K))C=(P_{\sigma(1)},\ldots,P_{\sigma(K)}) be the underlying assignment of the TPMs where Pσ(a)P_{\sigma(a)} denotes the TPM of arm aa. The growth rate of the expected time required to find the best arm in CC satisfies the lower bound

lim infϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)1T(C).\displaystyle\liminf_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}\geq\frac{1}{T^{\star}(C)}. (49)

Further, given any ϵ>0\epsilon>0, the policy π(1/ϵ,η,R)Π(ϵ)\pi^{\star}(1/\epsilon,\eta,R)\in\Pi(\epsilon) for all η(0,1]\eta\in(0,1] and R(K,)R\in\mathbb{N}\cap(K,\infty). Additionally,

lim supRlim supη0lim supL𝔼Cπ(L,η,R)[τ(π(L,η,R))]logL1limRTR(C),\displaystyle\limsup_{R\to\infty}\ \limsup_{\eta\downarrow 0}\ \limsup_{L\to\infty}\ \frac{\mathbb{E}_{C}^{\pi^{\star}(L,\eta,R)}[\tau(\pi^{\star}(L,\eta,R))]}{\log L}\leq\frac{1}{\lim_{R\to\infty}\ T_{R}^{\star}(C)}, (50)

thereby yielding

1T(C)\displaystyle\frac{1}{T^{\star}(C)} lim infϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)\displaystyle\leq\liminf_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}
lim supϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)\displaystyle\leq\limsup_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}
lim supRlim supη0lim supL𝔼Cπ(L,η,R)[τ(π(L,η,R))]logL1limRTR(C).\displaystyle\leq\limsup_{R\to\infty}\ \limsup_{\eta\downarrow 0}\ \limsup_{L\to\infty}\ \frac{\mathbb{E}_{C}^{\pi^{\star}(L,\eta,R)}[\tau(\pi^{\star}(L,\eta,R))]}{\log L}\leq\frac{1}{\lim_{R\to\infty}\ T_{R}^{\star}(C)}. (51)

Thus, the lower bound on the growth rate of the expected stopping time is 1/T(C)1/T^{\star}(C), and the upper bound is 1/(limRTR(C))1/(\lim_{R\to\infty}\ T_{R}^{\star}(C)).

Proof:

The asymptotic lower bound in (49) follows from Proposition 1. From Lemma 3, we know that for any ϵ>0\epsilon>0, the policy π(1/ϵ,η,R)Π(ϵ)\pi^{\star}(1/\epsilon,\eta,R)\in\Pi(\epsilon) for all η(0,1]\eta\in(0,1] and R(K,)R\in\mathbb{N}\cap(K,\infty). Therefore, it follows that

infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)𝔼π(1/ϵ,η,R)[τ(π(1/ϵ,η,R))]log(1/ϵ).\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log\left(1/\epsilon\right)}\leq\frac{\mathbb{E}^{\pi^{\star}(1/\epsilon,\eta,R)}[\tau(\pi^{\star}(1/\epsilon,\eta,R))]}{\log(1/\epsilon)}. (52)

Fixing η\eta, RR, and letting ϵ0\epsilon\downarrow 0 (or equivalently, substituting L=1/ϵL=1/\epsilon and letting LL\to\infty) in (52), and using the upper bound in (48), we get

lim supϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)lim supL𝔼Cπ(L,η,R)[τ(π(L,η,R))]logL1ηTRunif(C)+(1η)TR(C).\displaystyle\limsup_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}\leq\limsup_{L\to\infty}\frac{\mathbb{E}_{C}^{\pi^{\star}(L,\eta,R)}[\tau(\pi^{\star}(L,\eta,R))]}{\log L}\leq\frac{1}{\eta\ T_{R}^{\textsf{unif}}(C)+(1-\eta)\ T_{R}^{\star}(C)}. (53)

Letting η0\eta\downarrow 0 in (53) and noting that the leftmost term in (53) does not depend on η\eta, we get

lim supϵ0infπΠ(ϵ)𝔼Cπ[τ(π)]log(1/ϵ)lim supη0lim supL𝔼Cπ(L,η,R)[τ(π(L,η,R))]logL1TR(C).\displaystyle\limsup_{\epsilon\downarrow 0}\inf_{\pi\in\Pi(\epsilon)}\frac{\mathbb{E}_{C}^{\pi}[\tau(\pi)]}{\log(1/\epsilon)}\leq\limsup_{\eta\downarrow 0}\ \limsup_{L\to\infty}\frac{\mathbb{E}_{C}^{\pi^{\star}(L,\eta,R)}[\tau(\pi^{\star}(L,\eta,R))]}{\log L}\leq\frac{1}{T_{R}^{\star}(C)}. (54)

Finally, letting RR\to\infty in (54), we arrive at (51). ∎

IX On the Convergence of TR(C)T_{R}^{\star}(C) to T(C)T^{\star}(C) as RR\to\infty

Recall that T(C)T^{\star}(C) is the optimal value of the infinite-dimensional LP in (21), where the supremum in (21) is over all ν\nu satisfying (22)-(24), and TR(C)T_{R}^{\star}(C) is the optimal value of the finite-dimensional LP in (27) that arises when the the delay of each arm is constrained to be no more than RR. From our exposition in Section VIII-A, we know that limRTR(C)T(C)\lim_{R\to\infty}T_{R}^{\star}(C)\leq T^{\star}(C). Showing that, in general, this inequality is an equality appears to be difficult. In this section, we show that in the special case when the arm TPMs P1,,PKP_{1},\ldots,P_{K} have identical rows, which is akin to obtaining i.i.d. observations from the arms, we have limRTR(C)=T(C)\lim_{R\to\infty}T_{R}^{\star}(C)=T^{\star}(C), thus leading to matching upper and lower bounds in this special setting.

Assume that the TPMs P1,,PKP_{1},\ldots,P_{K} have identical rows, and suppose that μ1,,μK\mu_{1},\ldots,\mu_{K} are the unique stationary distributions associated with P1,,PKP_{1},\ldots,P_{K} respectively. Then, by the convergence result [27, Theorem 4.9] for finite-state Markov processes, each row of PkP_{k} must be equal to μk\mu_{k}, k=1,,Kk=1,\ldots,K. In this special setting, the below result states that TR(C)=T(C)T_{R}^{\star}(C)=T^{\star}(C) for all R(K,)R\in\mathbb{N}\cap(K,\infty), and therefore limRTR(C)=T(C)\lim_{R\to\infty}T_{R}^{\star}(C)=T^{\star}(C).

Lemma 7

Suppose each row of PkP_{k} is equal to μk\mu_{k}, k=1,,Kk=1,\ldots,K. In this special setting, T(C)=TR(C)T^{\star}(C)=T_{R}^{\star}(C) for all R(K,)R\in\mathbb{N}\cap(K,\infty). Consequently, limRTR(C)=T(C)\lim_{R\to\infty}T_{R}^{\star}(C)=T^{\star}(C).

Proof:

The proof uses the key idea that for a given C𝒞C\in\mathcal{C} and for all dd\in\mathbb{N}, i𝒮i\in\mathcal{S}, and CAlt(C)C^{\prime}\in\textsf{Alt}(C),

DKL((PCa)d(|i)(PCa)d(|i))=DKL(μCaμCa),D_{\textsf{KL}}((P_{C}^{a})^{d}(\cdot|i)\|(P_{C^{\prime}}^{a})^{d}(\cdot|i))=D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a}),

where μCa\mu_{C}^{a} denotes the stationary distribution associated with the TPM PCaP_{C}^{a}. The complete proof of Lemma 7 is given in Appendix I. ∎

As a consequence of Lemma 7, the common expressions for T(C)T^{\star}(C) and TR(C)T_{R}^{\star}(C) for all R(K,)R\in\mathbb{N}\cap(K,\infty) specialise to the lower bounds of [5, 3] for the case when the i.i.d. observations from each arm come from a finite alphabet, thus enabling us to recover the results of [5, 3]. For general arm TPMs with non-identical rows, we leave open the question of whether limRTR(C)\lim_{R\to\infty}T_{R}^{\star}(C) equals T(C)T^{\star}(C) for future study.

X Concluding Remarks and Discussion

  1. 1.

    We studied the problem of finding the best arm in a restless Markov multi-armed bandit as quickly as possible, subject to an upper bound on the error probability. For this optimal stopping problem, we showed that under the problem instance CC, the expected time required to find the best arm with an error probability no more than ϵ\epsilon is lower bounded by log(1/ϵ)/T(C)\log(1/\epsilon)/T^{\star}(C) in the limit as ϵ0\epsilon\downarrow 0 (converse). Here, T(C)T^{\star}(C) is a problem-instance dependent constant that captures the hardness of the problem. We also devised a policy that, for an input parameter R(K,)R\in\mathbb{N}\cap(K,\infty), forcibly selects an arm which has not been selected for RR consecutive time instants, and finds the best arm in at most log(1/ϵ)/TR(C)\log(1/\epsilon)/T_{R}^{\star}(C) time instants on the average as ϵ0\epsilon\downarrow 0 (achievability).

  2. 2.

    We showed that TR(C)T_{R}^{\star}(C) is monotonically non-decreasing in RR, and that limRTR(C)T(C)\lim_{R\to\infty}T_{R}^{\star}(C)\leq T^{\star}(C). Showing that, in general, this inequality is an equality appears to be a difficult problem and remains open. Notwithstanding this, we showed that in the special case when the TPM of each arm has identical rows (which is akin to obtaining i.i.d. observations from each arm), the above inequality is indeed an equality. We were thus able to recover the results of [5, 3] for the case of arms with a common, finite alphabet.

  3. 3.

    The trembling hand-based policy of [6] is not practically implementable because it operates on the countable set 𝕊\mathbb{S} of all delays and last observed states which cannot be handled on a machine with finite-size memory. However, for any given R(K,)R\in\mathbb{N}\cap(K,\infty), our policy operates on the finite set 𝕊R\mathbb{S}_{R} which can easily be stored in finite-size memory on a machine, thereby making it practically implementable.

  4. 4.

    In our achievability analysis, we assumed that the initial state of each arm follows a certain distribution ϕ\phi that is independent of the underlying assignment of the TPMs. However, this may not actually be the case. For instance, if the Markov process of each arm has evolved for a sufficiently long time and reached stationarity before the decision entity begins sampling the arms at t=0t=0, then the initial state of each arm follows the arm’s stationary distribution. This will lead to a mismatch between the LLR expressions in our work (resulting from ϕ\phi) and the actual LLR expressions (resulting from the stationary distributions). Suppose Z¯CCπ(n)\bar{Z}_{CC^{\prime}}^{\pi}(n) denotes the analogue of (18) resulting from using the stationary distributions in place of ϕ\phi. Then, fixing C𝒞C\in\mathcal{C}, it can be shown that

    limnZCCπ(n)nZ¯CCπ(n)n=0.\lim_{n\to\infty}\frac{Z_{CC^{\prime}}^{\pi}(n)}{n}-\frac{\bar{Z}_{CC^{\prime}}^{\pi}(n)}{n}=0.

    That is, the asymptotic drift of ZCCπ(n)Z_{CC^{\prime}}^{\pi}(n) is identical to that of Z¯CCπ(n)\bar{Z}_{CC^{\prime}}^{\pi}(n), and therefore the assumption X0aϕX_{0}^{a}\sim\phi does not affect the asymptotic analysis in any way.

  5. 5.

    It will be interesting to extend the results of our paper to the case when the arm TPMs P1,,PKP_{1},\ldots,P_{K} are not known to the decision entity beforehand. The difficulty here is that for any given C𝒞C\in\mathcal{C}, the set of alternatives Alt(C)\textsf{Alt}(C) is uncountably infinite. Also, the arm TPMs they must be estimated on-the-fly using the observations from the arms. In this case, showing that the TPM estimates converge to their true values is the key challenge. It will be interesting to explore these issues further.

  6. 6.

    The function ff appearing in the definition of the best arm in (3), also appears implicitly in the analyses of the lower and the upper bounds wherever one evaluates Alt(C)\textsf{Alt}(C) for any given CC. A more realistic setting where we anticipate that ff will appear explicitly in the analyses of the lower and the upper bounds is one in which the decision entity only observes Yta=f(Xta)Y_{t}^{a}=f(X_{t}^{a}) and not the underlying state XtaX_{t}^{a} of arm aa at time tt, i.e., the arms yield hidden Markov observations. More generally, suppose that Yta|XtaPa(|Xta)Y_{t}^{a}|X_{t}^{a}\sim P^{a}(\cdot|X_{t}^{a}) for some conditional probability distribution PaP^{a}, a𝒜a\in\mathcal{A}. Because {Yta:t0}\{Y_{t}^{a}:t\geq 0\} is not a Markov process in general, the analyses of the lower and the upper bounds in this setting appear to be quite challenging. It will be interesting to explore this setting in more detail.

Acknowledgements

The authors are supported by a Singapore National Research Foundation Fellowship (under grant number R-263-000-D02-281) and by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-RP-2020-018).

Appendix A Proof of Proposition 1

It suffices to prove (20) for all π\pi such that 𝔼Cπ[τ(π)]<\mathbb{E}_{C}^{\pi}[\tau(\pi)]<\infty, as (20) trivially holds when 𝔼Cπ[τ(π)]=\mathbb{E}_{C}^{\pi}[\tau(\pi)]=\infty. This proof is organised as follows. In Section A-A, we derive a change-of-measure result that is the analogue of [5, Lemma 18] for the setting of restless arms (see (56)). Using the change-of-measure result together with [5, Lemma 19], we derive in Section A-B a lower bound for the expected LLR in terms of the error probability. In Section A-C, we derive an upper bound for the expected LLR in terms of the expected stopping time (see (73)). Combining the lower bound of Section A-B and the upper bound of Section A-C, and letting the error probability vanish, we arrive at the lower bound (20).

A-A A Change-of-Measure Result for Restless Arms

The following change-of-measure result is the analogue of [5, Lemma 18] for the setting of restless arms. The proof technique is along the lines of the proof of [5, Lemma 18].

Lemma 8

Fix C,C𝒞C,C^{\prime}\in\mathcal{C}. Given a policy π\pi with stopping time τ(π)\tau(\pi) such that PCπ(τ(π)<)=1P_{C}^{\pi}(\tau(\pi)<\infty)=1, PCπ(τ(π)<)=1P_{C^{\prime}}^{\pi}(\tau(\pi)<\infty)=1, let

τ(π){E:E{τ(π)=t}t for all t0},\mathcal{F}_{\tau(\pi)}\coloneqq\{E\in\mathcal{F}:E\cap\{\tau(\pi)=t\}\in\mathcal{F}_{t}\text{ for all }t\geq 0\}, (55)

where {t:t0}\{\mathcal{F}_{t}:t\geq 0\} is as defined in (4). Then,

PCπ(E)=𝔼Cπ[𝕀Eexp(ZCCπ(τ(π)))],Eτ(π).P_{C^{\prime}}^{\pi}(E)=\mathbb{E}_{C}^{\pi}\big{[}\mathbb{I}_{E}\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(\tau(\pi))\big{)}\big{]},\quad E\in\mathcal{F}_{\tau(\pi)}. (56)
Proof:

We prove (56) by first demonstrating, through mathematical induction, that the relation

𝔼C[g(A0:t,X¯0:t)]=𝔼Cπ[g(A0:t,X¯0:t)exp(ZCCπ(t))]\mathbb{E}_{C^{\prime}}[g(A_{0:t},\bar{X}_{0:t})]=\mathbb{E}_{C}^{\pi}\big{[}g(A_{0:t},\bar{X}_{0:t})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\big{]} (57)

holds for all t0t\geq 0 and for all measurable functions g:𝒜t+1×𝒮t+1g:\mathcal{A}^{t+1}\times\mathcal{S}^{t+1}\to\mathbb{R}. Then, (56) follows from (57) by noting that for any Eτ(π)E\in\mathcal{F}_{\tau(\pi)},

PCπ(E)\displaystyle P_{C^{\prime}}^{\pi}(E) =𝔼Cπ[𝕀E]\displaystyle=\mathbb{E}_{C^{\prime}}^{\pi}[\mathbb{I}_{E}]
=𝔼Cπ[t0𝕀E{τ(π)=t}]\displaystyle=\mathbb{E}_{C^{\prime}}^{\pi}\bigg{[}\sum_{t\geq 0}\ \mathbb{I}_{E\cap\{\tau(\pi)=t\}}\bigg{]}
=(a)t0𝔼Cπ[𝕀E{τ(π)=t}]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{t\geq 0}\ \mathbb{E}_{C^{\prime}}^{\pi}\left[\mathbb{I}_{E\cap\{\tau(\pi)=t\}}\right]
=(b)t0𝔼Cπ[𝕀E{τ(π)=t}exp(ZCCπ(t))]\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\sum_{t\geq 0}\ \mathbb{E}_{C}^{\pi}\left[\mathbb{I}_{E\cap\{\tau(\pi)=t\}}\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\right]
=t0𝔼Cπ[𝕀E{τ(π)=t}exp(ZCCπ(τ(π)))]\displaystyle=\sum_{t\geq 0}\ \mathbb{E}_{C}^{\pi}\left[\mathbb{I}_{E\cap\{\tau(\pi)=t\}}\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(\tau(\pi))\big{)}\right]
=𝔼Cπ[𝕀Eexp(ZCCπ(τ(π)))],\displaystyle=\mathbb{E}_{C}^{\pi}\left[\mathbb{I}_{E}\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(\tau(\pi))\big{)}\right], (58)

where (a)(a) is due to the monotone convergence theorem, and (b)(b) above follows from (57) and the fact that Eτ(π)E\in\mathcal{F}_{\tau(\pi)} implies that E{τ(π)=t}tE\cap\{\tau(\pi)=t\}\in\mathcal{F}_{t} for all t0t\geq 0.

The proof of (57) for the case t=0t=0 may be obtained as follows: for any measurable g:𝒜×𝒮g:\mathcal{A}\times\mathcal{S}\to\mathbb{R},

𝔼Cπ[g(A0,X¯0)]\displaystyle\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0},\bar{X}_{0})] =a=1Ki𝒮g(a,i)PCπ(A0=a,X¯0=i)\displaystyle=\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(a,i)\ P_{C^{\prime}}^{\pi}(A_{0}=a,\bar{X}_{0}=i)
=a=1Ki𝒮g(a,i)PCπ(A0=a)PCπ(X¯0=i|A0=a)\displaystyle=\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(a,i)\ P_{C^{\prime}}^{\pi}(A_{0}=a)\ P_{C^{\prime}}^{\pi}(\bar{X}_{0}=i|A_{0}=a)
=(a)a=1Ki𝒮g(a,i)PCπ(A0=a)ϕ(i)\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(a,i)\ P_{C}^{\pi}(A_{0}=a)\ \phi(i)
=a=1Ki𝒮g(a,i)PCπ(A0=a)PCπ(X0a=i|A0=a)\displaystyle=\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(a,i)\ P_{C}^{\pi}(A_{0}=a)\ P_{C}^{\pi}(X_{0}^{a}=i|A_{0}=a)
=𝔼Cπ[g(A0,X¯0)]\displaystyle=\mathbb{E}_{C}^{\pi}[g(A_{0},\bar{X}_{0})]
=(b)𝔼Cπ[g(A0,X¯0)exp(ZCCπ(0))],\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\mathbb{E}_{C}^{\pi}\big{[}g(A_{0},\bar{X}_{0})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(0)\big{)}\big{]}, (59)

where in writing (a)(a), we make use of (i) the fact that PCπ(A0=a)=PCπ(A0=a)P_{C^{\prime}}^{\pi}(A_{0}=a)=P_{C}^{\pi}(A_{0}=a) because the policy π\pi selects arms without the knowledge of the underlying assignment of the TPMs, and (ii) the assumption that X0aϕX_{0}^{a}\sim\phi for all a𝒜a\in\mathcal{A}, where ϕ\phi is a probability distribution on 𝒮\mathcal{S} that does not depend on the underlying assignment of the TPMs. In writing (b)(b) above, we make use of the observation that

ZCCπ(0)=logPCπ(A0,X¯0)PCπ(A0,X¯0)=0.\displaystyle Z_{CC^{\prime}}^{\pi}(0)=\log\frac{P_{C}^{\pi}(A_{0},\bar{X}_{0})}{P_{C^{\prime}}^{\pi}(A_{0},\bar{X}_{0})}=0. (60)

We now assume that (57) is true for some t>0t>0, and show that it also true for t+1t+1. By the law of iterated expectations, 𝔼Cπ[g(A0:t+1,X¯0:t+1)]=𝔼Cπ[𝔼Cπ[g(A0:t+1,X¯0:t+1)|t+1]]\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})]=\mathbb{E}_{C^{\prime}}^{\pi}[\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})|\mathcal{F}_{t+1}]]. Because 𝔼Cπ[g(A0:t+1,X¯0:t+1)|t+1]\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})|\mathcal{F}_{t+1}] is a measurable function of (A0:t,X¯0:t)(A_{0:t},\bar{X}_{0:t}), by the induction hypothesis, we have

𝔼Cπ[g(A0:t+1,X¯0:t+1)|t+1]\displaystyle\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})|\mathcal{F}_{t+1}] =𝔼Cπ[𝔼Cπ[g(A0:t+1,X¯0:t+1)|t+1]exp(ZCCπ(t))]\displaystyle=\mathbb{E}_{C}^{\pi}\big{[}\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})\,\big{|}\,\mathcal{F}_{t+1}]\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\big{]}
=𝔼Cπ[𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t))|t+1]],\displaystyle=\mathbb{E}_{C}^{\pi}\big{[}\mathbb{E}_{C^{\prime}}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{F}_{t+1}\big{]}\big{]}, (61)

where the last line above follows by noting that ZCCπ(t)Z_{CC^{\prime}}^{\pi}(t) is measurable with respect to t+1\mathcal{F}_{t+1}. We now note that

𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t))|t+1]\displaystyle\mathbb{E}_{C^{\prime}}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{F}_{t+1}\big{]}
=a=1Ki𝒮g(A0:t,X¯0:t,a,i)PCπ(At+1=a|t+1)PCπ(X¯t+1=i|At+1=a,t+1)exp(ZCCπ(t))\displaystyle=\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(A_{0:t},\bar{X}_{0:t},a,i)\ P_{C^{\prime}}^{\pi}(A_{t+1}=a|\mathcal{F}_{t+1})\ P_{C^{\prime}}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}
=(a)a=1Ki𝒮g(A0:t,X¯0:t,a,i)PCπ(At+1=a|t+1)PCπ(X¯t+1=i|At+1=a,t+1)exp(ZCCπ(t)),\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(A_{0:t},\bar{X}_{0:t},a,i)\ P_{C}^{\pi}(A_{t+1}=a|\mathcal{F}_{t+1})\ P_{C^{\prime}}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}, (62)

where in writing (a)(a) above, we make use of the fact that PCπ(At+1=a|t+1)=PCπ(At+1=a|t+1)P_{C^{\prime}}^{\pi}(A_{t+1}=a|\mathcal{F}_{t+1})=P_{C}^{\pi}(A_{t+1}=a|\mathcal{F}_{t+1}) because π\pi selects arms without the knowledge of the underlying assignment of the TPMs. Also, we note that

PCπ(X¯t+1=i|At+1=a,t+1)exp(ZCCπ(t))\displaystyle P_{C^{\prime}}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}
=PCπ(X¯t+1=i|At+1=a,t+1)PCπ(X¯t+1=i|At+1=a,t+1)exp(ZCCπ(t))PCπ(X¯t+1=i|At+1=a,t+1)\displaystyle=\frac{P_{C^{\prime}}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})}{P_{C}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})}\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\ P_{C}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})
=exp(ZCCπ(t+1))PCπ(X¯t+1=i|At+1=a,t+1).\displaystyle=\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t+1)\big{)}\,P_{C}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1}). (63)

Substituting (63) in (62) and simplifying, we get

𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t))|t+1]\displaystyle\mathbb{E}_{C^{\prime}}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t)\big{)}\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{F}_{t+1}\big{]}
=a=1Ki𝒮g(A0:t,X¯0:t,a,i)PCπ(At+1=a|t+1)PCπ(X¯t+1=i|At+1=a,t+1)exp(ZCCπ(t+1))\displaystyle=\sum_{a=1}^{K}\ \sum_{i\in\mathcal{S}}\ g(A_{0:t},\bar{X}_{0:t},a,i)\ P_{C}^{\pi}(A_{t+1}=a|\mathcal{F}_{t+1})\ P_{C}^{\pi}(\bar{X}_{t+1}=i|A_{t+1}=a,\mathcal{F}_{t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t+1)\big{)}
=𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t+1))|t+1].\displaystyle=\mathbb{E}_{C}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\ \exp\big{(}-Z_{CC^{\prime}}^{\pi}(t+1)\big{)}\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{F}_{t+1}\big{]}. (64)

Substituting (64) in (61), we get

𝔼Cπ[g(A0:t+1,X¯0:t+1)|t+1]\displaystyle\mathbb{E}_{C^{\prime}}^{\pi}[g(A_{0:t+1},\bar{X}_{0:t+1})|\mathcal{F}_{t+1}] =𝔼Cπ[𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t+1))|t+1]]\displaystyle=\mathbb{E}_{C}^{\pi}\big{[}\mathbb{E}_{C}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t+1)\big{)}\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ \mathcal{F}_{t+1}\big{]}\big{]}
=𝔼Cπ[g(A0:t+1,X¯0:t+1)exp(ZCCπ(t+1))].\displaystyle=\mathbb{E}_{C}^{\pi}\big{[}g(A_{0:t+1},\bar{X}_{0:t+1})\,\exp\big{(}-Z_{CC^{\prime}}^{\pi}(t+1)\big{)}\big{]}. (65)

Taking 𝔼Cπ[]\mathbb{E}_{C^{\prime}}^{\pi}[\cdot] on both sides of (65), we get the desired result. ∎

A-B A Lower Bound for 𝔼Cπ[ZCCπ(τ(π))]\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))] when πΠ(ϵ)\pi\in\Pi(\epsilon)

We note the following lower bound on the expected LLR. We omit the proof as it follows directly from the proof of [5, Lemma 19].

Lemma 9

Fix C𝒞C\in\mathcal{C}, CAlt(C)C^{\prime}\in\textsf{Alt}(C), and π\pi such that PCπ(τ(π)<)=1P_{C}^{\pi}(\tau(\pi)<\infty)=1, PCπ(τ(π)<)=1P_{C^{\prime}}^{\pi}(\tau(\pi)<\infty)=1. Then,

  1. 1.

    PCπP_{C}^{\pi} and PCπP_{C^{\prime}}^{\pi} are mutually absolutely continuous, and

  2. 2.

    for all Eτ(π)E\in\mathcal{F}_{\tau(\pi)} such that PCπ(E)>0P_{C}^{\pi}(E)>0, PCπ(E)>0P_{C^{\prime}}^{\pi}(E)>0,

    𝔼Cπ[ZCCπ(τ(π))]d(PCπ(E),PCπ(E)),\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]\geq d(P_{C}^{\pi}(E),P_{C^{\prime}}^{\pi}(E)), (66)

    where d(x,y)d(x,y) denotes the relative entropy between two Bernoulli distributions with parameters xx and yy.

Fix ϵ>0\epsilon>0. Recall the set Π(ϵ)\Pi(\epsilon) in (5). A direct consequence of Lemma 9 is that for any πΠ(ϵ)\pi\in\Pi(\epsilon), setting E={ωΩ:θ(τ(π))=a(C)}E=\{\omega\in\Omega:\theta(\tau(\pi))=a^{\star}(C)\}, where a(C)a^{\star}(C) is the index of the best arm in CC, noting that PCπ(E)1ϵP_{C}^{\pi}(E)\geq 1-\epsilon, PCπ(E)ϵP_{C^{\prime}}^{\pi}(E)\leq\epsilon for all CAlt(C)C^{\prime}\in\textsf{Alt}(C), and using the fact that xd(x,y)x\mapsto d(x,y) is monotone increasing for x<yx<y and the yd(x,y)y\mapsto d(x,y) is monotone decreasing for any fixed xx, we get

𝔼Cπ[ZCCπ(τ(π))]d(ϵ,1ϵ)\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]\geq d(\epsilon,1-\epsilon) (67)

for all CAlt(C)C^{\prime}\in\textsf{Alt}(C). Thus, it follows that minCAlt(C)𝔼Cπ[ZCCπ(τ(π))]d(ϵ,1ϵ)\min_{C^{\prime}\in\textsf{Alt}(C)}\ \mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]\geq d(\epsilon,1-\epsilon) whenever πΠ(ϵ)\pi\in\Pi(\epsilon).

A-C An Upper Bound for 𝔼Cπ[ZCCπ(τ(π))]\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))] in terms of 𝔼Cπ[τ(π)]\mathbb{E}_{C}^{\pi}[\tau(\pi)]

We first note the following result.

Lemma 10

Fix π\pi and C𝒞C\in\mathcal{C}. For all (d¯,i¯)𝕊(\underline{d},\underline{i})\in\mathbb{S}, a𝒜a\in\mathcal{A}, and j𝒮j\in\mathcal{S},

𝔼Cπ[N(τ(π),d¯,i¯,a,j)]=(PCa)da(j|ia)𝔼Cπ[N(τ(π),d¯,i¯,a)].\mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a,j)\big{]}=(P_{C}^{a})^{d_{a}}(j|i_{a})\ \mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}. (68)
Proof:

We note that

𝔼Cπ[𝔼Cπ[N(τ(π),d¯,i¯,a,j)|Xa1a]|τ(π)]\displaystyle\mathbb{E}_{C}^{\pi}[\mathbb{E}_{C}^{\pi}[N(\tau(\pi),\underline{d},\underline{i},a,j)|X_{a-1}^{a}]|\tau(\pi)] =𝔼Cπ[𝔼Cπ[t=Kτ(π)1{d¯(t)=d¯,i¯(t)=i¯,At=a,Xta=j}|Xa1a]|τ(π)]\displaystyle=\mathbb{E}_{C}^{\pi}\bigg{[}\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{t=K}^{\tau(\pi)}1_{\{\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a,X_{t}^{a}=j\}}\bigg{|}X_{a-1}^{a}\bigg{]}\bigg{|}\tau(\pi)\bigg{]}
=𝔼Cπ[t=Kτ(π)PCπ(d¯(t)=d¯,i¯(t)=i¯,At=a,Xta=j|Xa1a)|τ(π)].\displaystyle=\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{t=K}^{\tau(\pi)}P_{C}^{\pi}(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a,X_{t}^{a}=j|X_{a-1}^{a})\,\bigg{|}\tau(\pi)\bigg{]}. (69)

For each tt in the range of the summation in (69), the conditional probability term for tt may be expressed as

PCπ(d¯(t)=d¯,i¯(t)=i¯,At=a,Xta=j|Xa1a)\displaystyle P_{C}^{\pi}(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a,X_{t}^{a}=j|X_{a-1}^{a})
=PCπ(d¯(t)=d¯,i¯(t)=i¯,At=a|Xa1a)PCπ(Xta=j|At=a,d¯(t)=d¯,i¯(t)=i¯,Xa1a)\displaystyle=P_{C}^{\pi}(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a|X_{a-1}^{a})\cdot P_{C}^{\pi}(X_{t}^{a}=j|A_{t}=a,\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},X_{a-1}^{a})
=PCπ(d¯(t)=d¯,i¯(t)=i¯,At=a|Xa1a)(PCa)da(j|ia).\displaystyle=P_{C}^{\pi}(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{t}=a|X_{a-1}^{a})\cdot(P_{C}^{a})^{d_{a}}(j|i_{a}). (70)

Plugging (70) back in (69) and taking 𝔼Cπ[]\mathbb{E}_{C}^{\pi}[\cdot] on both sides of (69), we arrive at (68). ∎

From (18), we note that for all CAlt(C)C^{\prime}\in\textsf{Alt}(C),

𝔼Cπ[ZCCπ(τ(π))]\displaystyle\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]
=(a)𝔼Cπ[a=1KlogPCπ(Xa1a)PCπ(Xa1a)]+𝔼Cπ[(d¯,i¯)𝕊a=1Kj𝒮N(τ(π),d¯,i¯,a,j)log(PCa)da(j|ia)(PCa)da(j|ia)]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}+\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(\tau(\pi),\underline{d},\underline{i},a,j)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\bigg{]}
=𝔼Cπ[a=1KlogPCπ(Xa1a)PCπ(Xa1a)]+(d¯,i¯)𝕊a=1Kj𝒮𝔼Cπ[N(τ(π),d¯,i¯,a,j)]log(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle=\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a,j)\big{]}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
=(b)𝔼Cπ[a=1KlogPCπ(Xa1a)PCπ(Xa1a)]+(d¯,i¯)𝕊a=1Kj𝒮𝔼Cπ[N(τ(π),d¯,i¯,a)](PCa)da(j|ia)log(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}\ (P_{C}^{a})^{d_{a}}(j|i_{a})\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
=𝔼Cπ[a=1KlogPCπ(Xa1a)PCπ(Xa1a)]+(d¯,i¯)𝕊a=1K𝔼Cπ[N(τ(π),d¯,i¯,a)]DKL((PCa)da(|ia)(PCa)da(j|ia)).\displaystyle=\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})). (71)

In the above chain of equations, (a)(a) follows from the dominated convergence theorem (noting that each row of (PCa)d(P_{C}^{a})^{d} is mutually absolutely continuous with respect to the corresponding row of (PCa)d(P_{C^{\prime}}^{a})^{d} for all d1d\geq 1), and (b)(b) follows from Lemma 10. Continuing with (71), we have

𝔼Cπ[ZCCπ(τ(π))]𝔼Cπ[a=1Klog1PCπ(Xa1a)]\displaystyle\mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]\leq\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{1}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}
+(d¯,i¯)𝕊a=1K𝔼Cπ[N(τ(π),d¯,i¯,a)]DKL((PCa)da(|ia)(PCa)da(j|ia))\displaystyle\hskip 85.35826pt+\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))
𝔼Cπ[a=1Klog1PCπ(Xa1a)]\displaystyle\leq\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{1}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}
+(𝔼Cπ[τ(π)K+1])(d¯,i¯)𝕊a=1K𝔼Cπ[N(τ(π),d¯,i¯,a)]𝔼Cπ[τ(π)K+1]DKL((PCa)da(|ia)(PCa)da(j|ia)),\displaystyle\hskip 42.67912pt+(\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1])\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \frac{\mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}}{\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1]}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})), (72)

for all CAlt(C)C^{\prime}\in\textsf{Alt}(C), from which it follows that

minCAlt(C)𝔼Cπ[ZCCπ(τ(π))]minCAlt(C)𝔼Cπ[a=1Klog1PCπ(Xa1a)]\displaystyle\min_{C^{\prime}\in\textsf{Alt}(C)}\ \mathbb{E}_{C}^{\pi}[Z_{CC^{\prime}}^{\pi}(\tau(\pi))]\leq\min_{C^{\prime}\in\textsf{Alt}(C)}\ \mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{1}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}
+(𝔼Cπ[τ(π)K+1])minCAlt(C)(d¯,i¯)𝕊a=1K𝔼Cπ[N(τ(π),d¯,i¯,a)]𝔼Cπ[τ(π)K+1]DKL((PCa)da(|ia)(PCa)da(j|ia))\displaystyle\hskip 28.45274pt+(\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1])\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \frac{\mathbb{E}_{C}^{\pi}\big{[}N(\tau(\pi),\underline{d},\underline{i},a)\big{]}}{\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1]}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))
minCAlt(C)𝔼Cπ[a=1Klog1PCπ(Xa1a)]\displaystyle\leq\min_{C^{\prime}\in\textsf{Alt}(C)}\ \mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{1}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}
+(𝔼Cπ[τ(π)K+1]){supνminCAlt(C)(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(j|ia))},\displaystyle\hskip 28.45274pt+(\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1])\ \bigg{\{}\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))\bigg{\}}, (73)

where the supremum in (73) is over all ν\nu that satisfy (22)-(23). The expression within braces in (73) is T(C)T^{\star}(C).

A-D The Final Steps

Combining the results of Sections A-B and A-C, we get

d(ϵ,1ϵ)minCAlt(C)𝔼Cπ[a=1Klog1PCπ(Xa1a)]+(𝔼Cπ[τ(π)K+1])T(C).\displaystyle d(\epsilon,1-\epsilon)\leq\min_{C^{\prime}\in\textsf{Alt}(C)}\ \mathbb{E}_{C}^{\pi}\bigg{[}\sum_{a=1}^{K}\ \log\frac{1}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}\bigg{]}+(\mathbb{E}_{C}^{\pi}[\tau(\pi)-K+1])\ T^{\star}(C). (74)

Noting that (a) d(ϵ,1ϵ)/log(1/ϵ)1d(\epsilon,1-\epsilon)/\log(1/\epsilon)\to 1 as ϵ0\epsilon\downarrow 0, and (b) the first term on the right hand side of (74) is bounded from above, by dividing (74) throughout by d(ϵ,1ϵ)d(\epsilon,1-\epsilon) and letting ϵ0\epsilon\downarrow 0, we arrive at the lower bound (20).

Appendix B Proof of Lemma 1

Fix an assignment of the TPMs C𝒞C\in\mathcal{C}. In this proof, we establish that under the policy πRunif\pi_{R}^{\textsf{unif}}, the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):\ t\geq K\} is irreducible, aperiodic, positive recurrent, and therefore ergodic.

Proof:

Consider any two states (d¯,i¯),(d¯,i¯)𝕊R(\underline{d},\underline{i}),(\underline{d}^{\prime},\underline{i}^{\prime})\in\mathbb{S}_{R}, and suppose that the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is in the state (d¯,i¯)(\underline{d},\underline{i}) at time t=T0t=T_{0}. We now demonstrate that there exists an integer NN (possibly depending on (d¯,i¯)(\underline{d},\underline{i}) and (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime})) such that

PCπRunif(d¯(T0+N)=d¯,i¯(T0+N)=i¯|d¯(T0)=d¯,i¯(T0)=i¯)>0.P_{C}^{\pi_{R}^{\textsf{unif}}}(\underline{d}(T_{0}+N)=\underline{d}^{\prime},\ \underline{i}(T_{0}+N)=\underline{i}^{\prime}|\underline{d}(T_{0})=\underline{d},\ \underline{i}(T_{0})=\underline{i})>0.

Assume without loss of generality that d¯\underline{d}^{\prime}, the vector of arm delays in the destination state (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}), is such that d1>d2>>dK=1d_{1}^{\prime}>d_{2}^{\prime}>\cdots>d_{K}^{\prime}=1. Noting that P1,,PKP_{1},\ldots,P_{K} are TPMs on the finite set 𝒮\mathcal{S}, we use [27, Proposition 1.7] for finite state Markov processes to deduce that there exist integers M1,,MKM_{1},\ldots,M_{K} such that for all mMmax{M1,,MK}m\geq M\coloneqq\max\{M_{1},\ldots,M_{K}\},

P1m(j|i)>0,,PKm(j|i)>0 for all i,j𝒮.P_{1}^{m}(j|i)>0,\ldots,P_{K}^{m}(j|i)>0\quad\text{ for all }i,j\in\mathcal{S}. (75)

Order the components of d¯\underline{d}, the vector of arm delays in the starting state (d¯,i¯)(\underline{d},\underline{i}), in decreasing order. Under πRunif\pi_{R}^{\textsf{unif}}, consider the sequence of arm selections and observations as follows: for a total of MM time instants, from t=T0t=T_{0} to t=T0+M1t=T_{0}+M-1, select the arms in a round robin fashion in the decreasing order of their component values in d¯\underline{d}. At time t=T0+Mt=T_{0}+M, select arm 11 and observe the state i1i_{1}^{\prime} on it. Thereafter, select arms 2,,K2,\ldots,K in a round robin fashion in the decreasing order of their component values in d¯\underline{d} until time t=T0+M+d1d21t=T_{0}+M+d_{1}^{\prime}-d_{2}^{\prime}-1. At time t=T0+M+d1d2t=T_{0}+M+d_{1}^{\prime}-d_{2}^{\prime}, select arm 22 and observe the state i2i_{2}^{\prime} on it. Continue the round robin sampling on arms 3,,K3,\ldots,K till time t=T0+M+d1d31t=T_{0}+M+d_{1}^{\prime}-d_{3}^{\prime}-1. At time t=T0+M+d1d3t=T_{0}+M+d_{1}^{\prime}-d_{3}^{\prime}, select arm 33 and observe the state i3i_{3}^{\prime} on it. Continue this process till arm KK is selected at time t=T0+M+d11t=T_{0}+M+d_{1}^{\prime}-1 and the state iKi_{K}^{\prime} is observed on it.

Clearly, because of the round robin selection procedure, the delay of each arm at any time is no more than KK. Also, the above sequence of arm selections and observations leads to the state (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}) at time t=T0+M+d1t=T_{0}+M+d_{1}^{\prime}. Thus, the probability of starting from the state (d¯,i¯)(\underline{d},\underline{i}) and reaching the state (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}) may be lower bounded by the probability that the above sequence of actions and observations occur under πRunif\pi_{R}^{\textsf{unif}}, which in turn may be lower bounded by

(1K)M+d1[a=1K(PCa)M+da+d1da(ia|ia)].\displaystyle\bigg{(}\frac{1}{K}\bigg{)}^{M+d_{1}^{\prime}}\cdot\left[\prod_{a=1}^{K}(P_{C}^{a})^{M+d_{a}+d_{1}^{\prime}-d_{a}^{\prime}}(i_{a}^{\prime}|i_{a})\right]. (76)

Noting that MM+d1+d1daM+2R1M\leq M+d_{1}+d_{1}^{\prime}-d_{a}^{\prime}\leq M+2R-1, let

ε¯min{(PCa)m(j|i):i,j𝒮,a𝒜,MmM+2R1}.\bar{\varepsilon}\coloneqq\min\Big{\{}(P_{C}^{a})^{m}(j|i):\ i,j\in\mathcal{S},\ a\in\mathcal{A},\ M\leq m\leq M+2R-1\Big{\}}. (77)

It is clear that ε¯>0\bar{\varepsilon}>0, and (76) may further be lower bounded by

(1K)M+d1ε¯K>0.\bigg{(}\frac{1}{K}\bigg{)}^{M+d_{1}^{\prime}}\bar{\varepsilon}^{K}>0. (78)

Thus, we see that the Markov process {(d¯(t),i¯(t):tK)}\{(\underline{d}(t),\underline{i}(t):t\geq K)\} is in the state (d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime}) after N=M+d1N=M+d_{1}^{\prime} time instants with a strictly positive probability. This establishes irreducibility. ∎

Proof:

It suffices to show that for each (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R}, there exists NN (possibly depending on (d¯,i¯)(\underline{d},\underline{i})) such that the probability of the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} starting from the state (d¯,i¯)(\underline{d},\underline{i}) at some time t=T0t=T_{0} and returning to the state (d¯,i¯)(\underline{d},\underline{i}) after NN time instants and also after N+1N+1 time instants is strictly positive. This follows directly from the proof of irreducibility presented above by setting (d¯,i¯)=(d¯,i¯)(\underline{d}^{\prime},\underline{i}^{\prime})=(\underline{d},\underline{i}) and N=M+d1N=M+d_{1}, where MM is such that (75) holds for all mMm\geq M. ∎

Proof:

This follows from the facts that (a) 𝕊R\mathbb{S}_{R} is finite, (b) {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is irreducible under πRunif\pi_{R}^{\textsf{unif}}, and (c) an irreducible Markov process evolving on a finite state space is positive recurrent. ∎

Appendix C Proof of Lemma 2

This proof is organised as follows. First, we show in Section C-A that for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R},

lim infnN(n,d¯,i¯)n>0almost surely.\liminf_{n\to\infty}\frac{N(n,\underline{d},\underline{i})}{n}>0\quad\text{almost surely}. (79)

Next, we show in Section C-B that almost surely,

lim infnN(n,d¯,i¯,a)n{>0,if (d¯,i¯)𝕊R,a or (d¯,i¯)a=1K𝕊R,a,=0,if (d¯,i¯)𝕊R,a for some aa.\liminf_{n\to\infty}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\begin{cases}>0,&\text{if }(\underline{d},\underline{i})\in\mathbb{S}_{R,a}\text{ or }(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\ \mathbb{S}_{R,a^{\prime}},\\ =0,&\text{if }(\underline{d},\underline{i})\in\mathbb{S}_{R,a^{\prime}}\text{ for some }a^{\prime}\neq a.\end{cases} (80)

Using the above results, we establish (40) in Section C-C.

C-A Limiting Drift of N(n,d¯,i¯)N(n,\underline{d},\underline{i})

Let MM be sufficiently large so that (75) holds for all mMm\geq M. Fix an arbitrary (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R}, and assume without loss of generality that d¯\underline{d} is such that d1>d2>>dK=1d_{1}>d_{2}>\cdots>d_{K}=1. Let pC(d¯,i¯)p_{C}(\underline{d},\underline{i}) denote the probability of the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} starting in the state (d¯,i¯)(\underline{d},\underline{i}) and returning back to the state (d¯,i¯)(\underline{d},\underline{i}) under the assignment of the TPMs CC. Following the exposition in Appendix B, it can be shown that pC(d¯,i¯)>0p_{C}(\underline{d},\underline{i})>0. Now, the term N(n,d¯,i¯)N(n,\underline{d},\underline{i}) may be lower bounded almost surely by the number of visits to the state (d¯,i¯)(\underline{d},\underline{i}) measured only at times t=K+M+d1,K+2(M+d1),K+3(M+d1)t=K+M+d_{1},K+2(M+d_{1}),K+3(M+d_{1}) and so on until time t=nt=n. At each of these time instants, the probability that the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} is in the state (d¯,i¯)(\underline{d},\underline{i}) under the assignment of the TPMs CC is equal to pC(d¯,i¯)p_{C}(\underline{d},\underline{i}). Thus, we have

N(n,d¯,i¯)Bin(nK+1M+d1,pC(d¯,i¯))almost surely,\displaystyle N(n,\underline{d},\underline{i})\geq\text{Bin}\left(\frac{n-K+1}{M+d_{1}},\ p_{C}(\underline{d},\underline{i})\right)\quad\text{almost surely}, (81)

where the notation Bin(m,q)\text{Bin}(m,q) denotes a Binomial random variable with parameters mm and qq. It then follows that, almost surely,

lim infnN(n,d¯,i¯)n\displaystyle\liminf_{n\to\infty}\frac{N(n,\underline{d},\underline{i})}{n} lim infnBin(nK+1M+d1,pC(d¯,i¯))n\displaystyle\geq\liminf_{n\to\infty}\frac{\text{Bin}\left(\frac{n-K+1}{M+d_{1}},\leavevmode\nobreak\ \leavevmode\nobreak\ p_{C}(\underline{d},\underline{i})\right)}{n}
=lim infnBin(nK+1M+d1,pC(d¯,i¯))nK+1M+d1nK+1n1M+d1\displaystyle=\liminf_{n\to\infty}\frac{\text{Bin}\left(\frac{n-K+1}{M+d_{1}},\ p_{C}(\underline{d},\underline{i})\right)}{\frac{n-K+1}{M+d_{1}}}\cdot\frac{n-K+1}{n}\cdot\frac{1}{M+d_{1}}
=(a)pC(d¯,i¯)M+d1\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\frac{p_{C}(\underline{d},\underline{i})}{M+d_{1}}
>0,\displaystyle>0, (82)

where (a)(a) above is due to the strong law of large numbers. This establishes (79).

C-B Limiting Drift of N(n,d¯,i¯,a)N(n,\underline{d},\underline{i},a)

If (d¯,i¯)𝕊R,a(\underline{d},\underline{i})\in\mathbb{S}_{R,a}, then N(n,d¯,i¯,a)=N(n,d¯,i¯)N(n,\underline{d},\underline{i},a)=N(n,\underline{d},\underline{i}), and consequently

lim infnN(n,d¯,i¯,a)n=lim infnN(n,d¯,i¯)n>0almost surely.\liminf_{n\to\infty}\ \frac{N(n,\underline{d},\underline{i},a)}{n}=\liminf_{n\to\infty}\ \frac{N(n,\underline{d},\underline{i})}{n}>0\quad\text{almost surely}.

If (d¯,i¯)𝕊R,a(\underline{d},\underline{i})\in\mathbb{S}_{R,a^{\prime}} for some aaa^{\prime}\neq a, then N(n,d¯,i¯,a)=0N(n,\underline{d},\underline{i},a)=0 for all nKn\geq K. Thus, it remains to show (80) for (d¯,i¯)a=1K𝕊R,a(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\ \mathbb{S}_{R,a^{\prime}}. Fix one such arbitrary (d¯,i¯)(\underline{d},\underline{i}) and define

S(n,d¯,i¯,a)t=Kn[𝕀{At=a,d¯(t)=d¯,i¯(t)=i¯}PCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)].S(n,\underline{d},\underline{i},a)\coloneqq\sum_{t=K}^{n}\Big{[}\mathbb{I}_{\{A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}\}}-P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})\Big{]}. (83)

For each tKt\geq K, because |𝕀{At=a,d¯(t)=d¯,i¯(t)=i¯}PCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)|2|\mathbb{I}_{\{A_{t}=a,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}\}}-P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})|\leq 2 almost surely, and 𝔼Cπ[𝕀{At=a,d¯(t)=d¯,i¯(t)=i¯}PCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)|A0:t1,X¯0:t1]=0\mathbb{E}_{C}^{\pi}[\mathbb{I}_{\{A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}\}}-P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})|A_{0:t-1},\bar{X}_{0:t-1}]=0 almost surely, the collection {𝕀{At=a,d¯(t)=d¯,i¯(t)=i¯}PCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)}tK\{\mathbb{I}_{\{A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}\}}-P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\,\bar{X}_{0:t-1})\}_{t\geq K} is a bounded martingale difference sequence. Using the concentration result [28, Theorem 1.2A] for bounded martingale difference sequences, and subsequently applying the Borel–Cantelli lemma, we get that

S(n,d¯,i¯,a)n0as n,almost surely.\frac{S(n,\underline{d},\underline{i},a)}{n}\longrightarrow 0\quad\text{as }n\to\infty,\quad\text{almost surely}. (84)

This implies that for every choice of ε>0\varepsilon>0, there exists Nε=Nε(d¯,i¯,a)N_{\varepsilon}=N_{\varepsilon}(\underline{d},\underline{i},a) sufficiently large such that

N(n,d¯,i¯,a)n1nt=KnPCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)εnNε, almost surely.\frac{N(n,\underline{d},\underline{i},a)}{n}\geq\frac{1}{n}\ \sum_{t=K}^{n}P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})-\varepsilon\quad\forall\ n\geq N_{\varepsilon},\text{ almost surely}. (85)

Now, for each tKt\geq K, under π=π(L,η,R)\pi=\pi^{\star}(L,\eta,R),

PCπ(At=a,d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)\displaystyle P_{C}^{\pi}(A_{t}=a,\,\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\,\bar{X}_{0:t-1})
=PCπ(At=a|d¯(t)=d¯,i¯(t)=i¯,A0:t1,X¯0:t1)PCπ(d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)\displaystyle=P_{C}^{\pi}(A_{t}=a|\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i},A_{0:t-1},\bar{X}_{0:t-1})\cdot P_{C}^{\pi}(\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})
=λη,R,C¯(t)(a|d¯,i¯)PCπ(d¯(t)=d¯,i¯(t)=i¯|A0:t1,X¯0:t1)\displaystyle=\lambda_{\eta,R,\bar{C}(t)}(a|\underline{d},\underline{i})\cdot P_{C}^{\pi}(\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}|A_{0:t-1},\bar{X}_{0:t-1})
=(a)ηνC¯(t),Runif(d¯,i¯,a)+(1η)νC¯(t),R(d¯,i¯,a)ημC¯(t),Runif(d¯,i¯)+(1η)a=1KνC¯(t),R(d¯,i¯,a)𝕀{(d¯(t)=d¯,i¯(t)=i¯}\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\frac{\eta\ \nu_{\bar{C}(t),R}^{\textsf{unif}}(\underline{d},\underline{i},a)+(1-\eta)\ \nu_{\bar{C}(t),R}^{\star}(\underline{d},\underline{i},a)}{\eta\ \mu_{\bar{C}(t),R}^{\textsf{unif}}(\underline{d},\underline{i})+(1-\eta)\ \sum_{a^{\prime}=1}^{K}\ \nu_{\bar{C}(t),R}^{\star}(\underline{d},\underline{i},a^{\prime})}\cdot\mathbb{I}_{\{(\underline{d}(t)=\underline{d},\,\underline{i}(t)=\underline{i}\}}
ηKμC¯(t),Runif(d¯,i¯)𝕀{(d¯(t)=d¯,i¯(t)=i¯}\displaystyle\geq\frac{\eta}{K}\cdot\mu_{\bar{C}(t),R}^{\textsf{unif}}(\underline{d},\underline{i})\cdot\mathbb{I}_{\{(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i}\}}
ηKμRmin𝕀{(d¯(t)=d¯,i¯(t)=i¯},\displaystyle\geq\frac{\eta}{K}\cdot\mu_{R}^{\textsf{min}}\cdot\mathbb{I}_{\{(\underline{d}(t)=\underline{d},\underline{i}(t)=\underline{i}\}}, (86)

where in writing (a)(a) above, we use the fact that (d¯(t),i¯(t))(\underline{d}(t),\underline{i}(t)) is measurable with respect to the history (A0:t1,X¯0:t1)(A_{0:t-1},\,\bar{X}_{0:t-1}), and μRmin\mu_{R}^{\textsf{min}} in (86) is as defined in (37).

Plugging (86) in (85), we get

N(n,d¯,i¯,a)n\displaystyle\frac{N(n,\underline{d},\underline{i},a)}{n} ηKμRminN(n,d¯,i¯)nεnNε,almost surely.\displaystyle\geq\frac{\eta}{K}\cdot\mu_{R}^{\textsf{min}}\cdot\frac{N(n,\underline{d},\underline{i})}{n}-\varepsilon\quad\forall\leavevmode\nobreak\ n\geq N_{\varepsilon},\quad\text{almost surely}. (87)

Using (82) in (87), we get that

N(n,d¯,i¯,a)nK+1ηKμRminpC(d¯,i¯)2(M+d1)ε\frac{N(n,\underline{d},\underline{i},a)}{n-K+1}\geq\frac{\eta}{K}\cdot\mu_{R}^{\textsf{min}}\cdot\frac{p_{C}(\underline{d},\underline{i})}{2(M+d_{1})}-\varepsilon (88)

for all nn large, almost surely. Setting ε=η2KμRminpC(d¯,i¯)2(M+d1)\varepsilon=\frac{\eta}{2K}\cdot\mu_{R}^{\textsf{min}}\cdot\frac{p_{C}(\underline{d},\underline{i})}{2(M+d_{1})} establishes (80).

C-C Completing the Proof of Lemma 2

From (80), it follows that whenever lim infnN(n,d¯,i¯,a)/n>0\liminf_{n\to\infty}\ N(n,\underline{d},\underline{i},a)/n>0 almost surely, we may apply the ergodic theorem to deduce that under the assignment of the TPMs CC,

N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia)as n,almost surely.\frac{N(n,\underline{d},\underline{i},a,j)}{N(n,\underline{d},\underline{i},a)}\longrightarrow(P_{C}^{a})^{d_{a}}(j|i_{a})\quad\text{as }n\to\infty,\quad\text{almost surely}. (89)

Under the constraint that the delay of each arm is at most RR,

ZCCπ(n)n=1na=1KlogPCπ(Xa1a)PCπ(Xa1a)+(d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia).\displaystyle\frac{Z^{\pi}_{CC^{\prime}}(n)}{n}=\frac{1}{n}\ \sum_{a=1}^{K}\log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}. (90)

The first term in (90) may be lower bounded as follows: assuming that X0aϕX_{0}^{a}\sim\phi for all a𝒜a\in\mathcal{A}, where ϕ\phi is a probability distribution on 𝒮\mathcal{S} that is independent of the underlying TPMs CC and puts a strictly positive mass on each state in 𝒮\mathcal{S}, it follows that

1na=1KlogPCπ(Xa1a)PCπ(Xa1a)\displaystyle\frac{1}{n}\ \sum_{a=1}^{K}\log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})} 1na=1KlogPCπ(Xa1a)\displaystyle\geq\frac{1}{n}\ \sum_{a=1}^{K}\log P_{C}^{\pi}(X_{a-1}^{a})
=1nj𝒮𝕀{Xa1a=j}a=1Klog(i𝒮ϕ(i)(PCa)a1(j|i)).\displaystyle=\frac{1}{n}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{X_{a-1}^{a}=j\}}\ \sum_{a=1}^{K}\log\left(\sum_{i\in\mathcal{S}}\ \phi(i)\ (P_{C}^{a})^{a-1}(j|i)\right). (91)

Because the right hand side of (91) converges to 0 as nn\to\infty, given any ε>0\varepsilon>0, there exists N1=N1(ε)N_{1}=N_{1}(\varepsilon) such that

1na=1KlogPCπ(Xa1a)PCπ(Xa1a)\displaystyle\frac{1}{n}\ \sum_{a=1}^{K}\log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})} εfor all nN1,almost surely\displaystyle\geq-\varepsilon\quad\text{for all }n\geq N_{1},\quad\text{almost surely} (92)

The second term in (90) may be expressed as

(d¯,i¯)a=1K𝕊R,aa=1KN(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia)+a=1K(d¯,i¯)𝕊R,aN(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia).\displaystyle\sum_{(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\mathbb{S}_{R,a^{\prime}}}\ \sum_{a=1}^{K}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}+\sum_{a=1}^{K}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}. (93)

Using the convergence in (89) and noting that 𝕊R×𝒜\mathbb{S}_{R}\times\mathcal{A} is finite, we get that there exists N2=N2(ε)N_{2}=N_{2}(\varepsilon) such that for all nN2n\geq N_{2}, (93) is almost surely lower bounded by

(d¯,i¯)a=1K𝕊R,aa=1KN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(j|ia))\displaystyle\sum_{(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\mathbb{S}_{R,a^{\prime}}}\ \sum_{a=1}^{K}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))
+a=1K(d¯,i¯)𝕊R,aN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(j|ia))ε.\displaystyle\hskip 85.35826pt+\sum_{a=1}^{K}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))-\varepsilon. (94)

Combining (92) and (93), we get that

ZCCπ(n)n\displaystyle\frac{Z^{\pi}_{CC^{\prime}}(n)}{n} 2ε+(d¯,i¯)a=1K𝕊R,aa=1KN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(j|ia))\displaystyle\geq-2\varepsilon+\sum_{(\underline{d},\underline{i})\notin\bigcup_{a^{\prime}=1}^{K}\mathbb{S}_{R,a^{\prime}}}\ \sum_{a=1}^{K}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a}))
+a=1K(d¯,i¯)𝕊R,aN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(j|ia))\displaystyle\hskip 85.35826pt+\sum_{a=1}^{K}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})) (95)

for all nmax{N1,N2}n\geq\max\{N_{1},N_{2}\}, almost surely. Using the results in (79) and (80), we see that the limit infimum of the last two terms in (95) is strictly positive, almost surely. Because ε>0\varepsilon>0 is arbitrary, the desired result follows.

Appendix D Proof of Lemma 3

The policy π=π(L,η,R)\pi=\pi^{\star}(L,\eta,R) commits an error if one of the following events is true:

  1. 1.

    The policy does not stop in finite time.

  2. 2.

    The policy stops in finite time and declares an incorrect best arm index.

The event in item 11 above has zero probability, thanks to Lemma 2. Thus, the probability of error of policy π(L,η,R)\pi^{\star}(L,\eta,R) may be evaluated as follows: for C𝒞C\in\mathcal{C}, recall that a(C)a^{\star}(C) is the index of the best arm in CC. Then, under the assignment of the TPMs CC, the error probability of π(L,η,R)\pi^{\star}(L,\eta,R) is given by

PCπ(θ(τ(π))a)\displaystyle P^{\pi}_{C}(\theta(\tau(\pi))\neq a)
=PCπ(n and aa such that τ(π)=n and θ(n)=a)\displaystyle=P_{C}^{\pi}\bigg{(}\exists\ n\text{ and }a^{\prime}\neq a\text{ such that }\tau(\pi)=n\text{ and }\theta(n)=a^{\prime}\bigg{)}
=PCπ(n and CAlt(C) such that τ(π)=n and θ(n)=a(C)).\displaystyle=P_{C}^{\pi}\bigg{(}\exists\ n\text{ and }C^{\prime}\in\textsf{Alt}(C)\text{ such that }\tau(\pi)=n\text{ and }\theta(n)=a^{\star}(C^{\prime})\bigg{)}. (96)

Let a(n){ωΩ:τ(π)(ω)=n,θ(n)(ω)=a}\mathcal{R}_{a}(n)\coloneqq\{\omega\in\Omega:\tau(\pi)(\omega)=n,\,\theta(n)(\omega)=a\}, a𝒜a\in\mathcal{A}, denote the set of all sample paths for which the policy stops at time nn and declares aa as the index of the best arm. Clearly, {a(n):a𝒜,n0}\{\mathcal{R}_{a}(n):a\in\mathcal{A},\ n\geq 0\} is a collection of mutually disjoint sets. Therefore, we have

PCπ(θ(τ(π))a)\displaystyle P_{C}^{\pi}(\theta(\tau(\pi))\neq a)
=PCπ(aan=0a(n))\displaystyle=P_{C}^{\pi}\left(\bigcup_{a^{\prime}\neq a}\,\bigcup_{n=0}^{\infty}\mathcal{R}_{a^{\prime}}(n)\right)
=aan=0PCπ(τ(π)=n,θ(n)=a)\displaystyle=\sum_{a^{\prime}\neq a}\ \sum_{n=0}^{\infty}\ P_{C}^{\pi}(\tau(\pi)=n,\ \theta(n)=a^{\prime})
=CAlt(C)n=0PCπ(τ(π)=n,θ(n)=a(C))\displaystyle=\sum_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{n=0}^{\infty}\ P_{C}^{\pi}(\tau(\pi)=n,\ \theta(n)=a^{\star}(C^{\prime}))
=CAlt(C)n=0a(C)(n)𝑑PCπ(ω)\displaystyle=\sum_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{n=0}^{\infty}\ \int_{\mathcal{R}_{a^{\star}(C^{\prime})}(n)}\,dP_{C}^{\pi}(\omega)
=CAlt(C)n=0a(C)(n)exp(ZCπ(n,ω))d(A0:n(ω),X¯0:n(ω))\displaystyle=\sum_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{n=0}^{\infty}\leavevmode\nobreak\ \int_{\mathcal{R}_{a^{\star}(C^{\prime})}(n)}\ \exp(Z_{C}^{\pi}(n,\omega))\ d(A_{0:n}(\omega),\bar{X}_{0:n}(\omega))
=CAlt(C)n=0a(C)(n)exp(ZCCπ(n,ω))exp(ZCπ(n,ω))d(A0:n(ω),X¯0:n(ω))\displaystyle=\sum_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{n=0}^{\infty}\ \int_{\mathcal{R}_{a^{\star}(C^{\prime})}(n)}\ \exp({-Z_{C^{\prime}C}^{\pi}(n,\omega)})\ \exp(Z_{C^{\prime}}^{\pi}(n,\omega))\ d(A_{0:n}(\omega),\bar{X}_{0:n}(\omega))
(a)CAlt(C)n=0a(C)(n)1L(K1)(K1)!𝑑PCπ(ω)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{C^{\prime}\in\textsf{Alt}(C)}\sum_{n=0}^{\infty}\ \int_{\mathcal{R}_{a^{\star}(C^{\prime})}(n)}\ \frac{1}{L(K-1)(K-1)!}\leavevmode\nobreak\ dP^{\pi}_{C^{\prime}}(\omega)
=CAlt(C)1L(K1)(K1)!PCπ(n=0a(C)(n))\displaystyle=\sum_{C^{\prime}\in\textsf{Alt}(C)}\frac{1}{L(K-1)(K-1)!}\leavevmode\nobreak\ P_{C^{\prime}}^{\pi}\left(\bigcup_{n=0}^{\infty}\mathcal{R}_{a^{\star}(C^{\prime})}(n)\right)
1L,\displaystyle\leq\frac{1}{L}, (97)

where (a)(a) above follows by noting that for any CAlt(C)C^{\prime}\in\textsf{Alt}(C), the condition MCπ(n)log(L(K1)(K1)!)M_{C^{\prime}}^{\pi}(n)\geq\log(L(K-1)(K-1)!) holds at the stopping time τ(π)=n\tau(\pi)=n on the set Ra(C)R_{a^{\star}(C^{\prime})}. In particular, this implies that ZCCπ(n)log(L(K1)(K1)!)Z^{\pi}_{C^{\prime}C}(n)\geq\log(L(K-1)(K-1)!). Setting L=1/ϵL=1/\epsilon in (97) yields the desired result.

Appendix E Proof of Proposition 2

We note that for all CAlt(C)C^{\prime}\in\textsf{Alt}(C), under the policy π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R), almost surely,

lim supnMCπ(n)\displaystyle\limsup_{n\to\infty}M_{C^{\prime}}^{\pi}(n) =lim supnminC′′Alt(C)ZCC′′π(n)\displaystyle=\limsup_{n\to\infty}\min_{C^{\prime\prime}\in\textsf{Alt}(C^{\prime})}\ Z_{C^{\prime}C^{\prime\prime}}^{\pi}(n)
lim supnZCCπ(n)\displaystyle\leq\limsup_{n\to\infty}\ Z_{C^{\prime}C}^{\pi}(n)
=lim supnZCCπ(n)\displaystyle=\limsup_{n\to\infty}\ -Z_{CC^{\prime}}^{\pi}(n)
=lim infnZCCπ(n)\displaystyle=-\liminf_{n\to\infty}\ Z_{CC^{\prime}}^{\pi}(n)
lim infnMCπ(n)\displaystyle\leq-\liminf_{n\to\infty}\ M_{C}^{\pi}(n)
<0,\displaystyle<0, (98)

where the last line above is due to Lemma 2. Furthermore, for any C¯C\bar{C}\neq C such that a(C¯)=a(C)a^{\star}(\bar{C})=a^{\star}(C),666Recall that a(C)a^{\star}(C) denotes the index of the best arm in CC. following the exposition in Appendix C-C with CC^{\prime} replaced by C¯\bar{C}, we get that lim infnZCC¯π(n)/n>0\liminf_{n\to\infty}Z_{C\bar{C}}^{\pi}(n)/n>0 almost surely, and therefore lim infnZCC¯π(n)>0\liminf_{n\to\infty}Z_{C\bar{C}}^{\pi}(n)>0 almost surely. This implies that lim infnZCCπ(n)ZC¯Cπ(n)>0\liminf_{n\to\infty}Z_{CC^{\prime}}^{\pi}(n)-Z_{\bar{C}C^{\prime}}^{\pi}(n)>0 almost surely for all CAlt(C)C^{\prime}\in\textsf{Alt}(C), which in turn implies that lim infnZCCπ(n)>lim supnZC¯Cπ(n)\liminf_{n\to\infty}Z_{CC^{\prime}}^{\pi}(n)>\limsup_{n\to\infty}Z_{\bar{C}C^{\prime}}^{\pi}(n) almost surely for all CAlt(C)C^{\prime}\in\textsf{Alt}(C). Noting that Alt(C)=Alt(C¯)\textsf{Alt}(C)=\textsf{Alt}(\bar{C}), it follows that

lim infnMCπ(n)>lim supnMC¯π(n)almost surely\displaystyle\liminf_{n\to\infty}M_{C}^{\pi}(n)>\limsup_{n\to\infty}M_{\bar{C}}^{\pi}(n)\quad\text{almost surely} (99)

for all C¯C\bar{C}\neq C such that a(C¯)=a(C)a^{\star}(\bar{C})=a^{\star}(C). Combining (99) and (98), we get that under π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R),

C¯(n)=Cfor all n large,almost surely,\bar{C}(n)=C\quad\text{for all }n\text{ large},\quad\text{almost surely}, (100)

when the underlying assignment of the TPMs is CC, from which we may deduce that under π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R),

limnPπ(An=a|A0:n1,{(d¯(s),i¯(s)):Ksn1},d¯(n)=d¯,i¯(n)=i¯)\displaystyle\lim_{n\to\infty}\ P^{\pi}(A_{n}=a|A_{0:n-1},\{(\underline{d}(s),\underline{i}(s)):K\leq s\leq n-1\},\underline{d}(n)=\underline{d},\underline{i}(n)=\underline{i})
=limnλη,R,C¯(n)(a|d¯,i¯)\displaystyle=\lim_{n\to\infty}\ \lambda_{\eta,R,\bar{C}(n)}(a|\underline{d},\underline{i})
=λη,R,C(a|d¯,i¯)\displaystyle=\lambda_{\eta,R,C}(a|\underline{d},\underline{i}) (101)

almost surely. This shows that πNS(L,η,R)\pi_{\textsf{NS}}^{\star}(L,\eta,R) eventually makes the process {(d¯(t),i¯(t)):tK}\{(\underline{d}(t),\underline{i}(t)):t\geq K\} ergodic with νη,R,C\nu_{\eta,R,C} as its ergodic state-action occupancy measure. As a consequence, for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R} and a𝒜a\in\mathcal{A}, it follows that

limnN(n,d¯,i¯,a)n=νη,R,C(d¯,i¯,a)almost surely.\lim_{n\to\infty}\ \frac{N(n,\underline{d},\underline{i},a)}{n}=\nu_{\eta,R,C}(\underline{d},\underline{i},a)\ \text{almost surely}. (102)

Therefore, under π=πNS(L,η,R)\pi=\pi_{\textsf{NS}}^{\star}(L,\eta,R), we have

limnZCCπ(n)n\displaystyle\lim_{n\to\infty}\ \frac{Z_{CC^{\prime}}^{\pi}(n)}{n} =limn(d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle=\lim_{n\to\infty}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\frac{N(n,\underline{d},\underline{i},a,j)}{n}\,\log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
=(d¯,i¯)𝕊Ra=1Kj𝒮limnN(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \lim_{n\to\infty}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\,\log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
=(a)(d¯,i¯)𝕊Ra=1Kj𝒮νη,R,C(d¯,i¯,a)(PCa)da(j|ia)log(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\cdot(P_{C}^{a})^{d_{a}}(j|i_{a})\cdot\log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
=(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))almost surely,\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\,D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\quad\text{almost surely}, (103)

where (a)(a) follows from (89). Eq. (43) is then immediate from (103).

Appendix F Proof of Lemma 4

Because π=π(L,η,R)\pi=\pi^{\star}(L,\eta,R) selects arm 11 at time t=0t=0, arm 22 at time t=1t=1, etc., and arm KK at time t=K1t=K-1, in order to prove the lemma, it suffices to prove that for all mKm\geq K,

limLPCπ(τ(π)m)=0.\lim_{L\to\infty}P_{C}^{\pi}(\tau(\pi)\leq m)=0. (104)

Fix an arbitrary mKm\geq K. Then,

lim supLPCπ(τ(π)m)\displaystyle\limsup_{L\to\infty}\,P_{C}^{\pi}(\tau(\pi)\leq m)
=lim supLPCπ(Knm and C𝒞 such that MCπ(n)log(L(K1)(K1)!))\displaystyle=\limsup_{L\to\infty}\,P_{C}^{\pi}\bigg{(}\exists\leavevmode\nobreak\ K\leq n\leq m\text{ and }C^{\prime}\in\mathcal{C}\text{ such that }M_{C^{\prime}}^{\pi}(n)\geq\log(L(K-1)(K-1)!)\bigg{)}
lim supLC𝒞n=KmPCπ(MCπ(n)log(L(K1)(K1)!))\displaystyle\leq\limsup_{L\to\infty}\sum_{C^{\prime}\in\mathcal{C}}\ \sum_{n=K}^{m}\ P_{C}^{\pi}(M_{C^{\prime}}^{\pi}(n)\geq\log(L(K-1)(K-1)!))
lim supL1log(L(K1)(K1)!)C𝒞n=Km𝔼Cπ[MCπ(n)],\displaystyle\leq\limsup_{L\to\infty}\ \frac{1}{\log(L(K-1)(K-1)!)}\ \sum_{C^{\prime}\in\mathcal{C}}\ \sum_{n=K}^{m}\ \mathbb{E}_{C}^{\pi}[M_{C^{\prime}}^{\pi}(n)], (105)

where the first and the second inequalities above follow from the union bound and Markov’s inequality respectively. We now show that for each n{K,,m}n\in\{K,\ldots,m\}, the expectation term inside the summation in (105) is finite. This will then imply that the limit supremum on the right-hand side of (105) is equal to 0, thus proving the desired result.

Note that

MCπ(n)=minC~Alt(C)ZCC~π(n)ZCC~π(n) for all C~Alt(C).\displaystyle M_{C^{\prime}}^{\pi}(n)=\min_{\tilde{C}\in\textsf{Alt}(C^{\prime})}Z_{C^{\prime}\tilde{C}}^{\pi}(n)\leq Z_{C^{\prime}\tilde{C}}^{\pi}(n)\text{ for all }\tilde{C}\in\textsf{Alt}(C^{\prime}). (106)

Fix an arbitrary C~Alt(C)\tilde{C}\in\textsf{Alt}(C^{\prime}) and recall that

ZCC~π(n)\displaystyle Z_{C^{\prime}\tilde{C}}^{\pi}(n) =a=1KlogPCπ(Xa1a)PC~π(Xa1a)+(d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)log(PCa)da(j|ia)(PC~a)da(j|ia).\displaystyle=\sum_{a=1}^{K}\log\frac{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}{P_{\tilde{C}}^{\pi}(X_{a-1}^{a})}+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j)\ \log\frac{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}{(P_{\tilde{C}}^{a})^{d_{a}}(j|i_{a})}. (107)

Because each row of (PCa)d(P_{C^{\prime}}^{a})^{d} is mutually absolutely continuous with the corresponding row of (PC~a)d(P_{\tilde{C}}^{a})^{d} for all d1d\geq 1, we may upper bound the second term in (107) as

(d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)log(PCa)da(j|ia)(PC~a)da(j|ia)\displaystyle\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j)\ \log\frac{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}{(P_{\tilde{C}}^{a})^{d_{a}}(j|i_{a})} A((d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j))\displaystyle\leq A\cdot\left(\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ N(n,\underline{d},\underline{i},a,j)\right)
=A(nK+1)almost surely,\displaystyle=A\cdot(n-K+1)\quad\text{almost surely}, (108)

where

A=max{logPad(j|i)Pad(j|i):Pad(j|i)0,Pad(j|i)0,d,i,j𝒮,a,a𝒜}<.A=\max\left\{\log\frac{P_{a}^{d}(j|i)}{P_{a^{\prime}}^{d}(j|i)}:\ P_{a}^{d}(j|i)\neq 0,\ P_{a^{\prime}}^{d}(j|i)\neq 0,\ d\in\mathbb{N},\ i,j\in\mathcal{S},\ a,a^{\prime}\in\mathcal{A}\right\}<\infty. (109)

Furthermore, suppose that X0aϕX_{0}^{a}\sim\phi for all a𝒜a\in\mathcal{A}, where ϕ\phi is a probability distribution on 𝒮\mathcal{S} that is independent of aa and the underlying assignment of the TPMs CC. Without loss of generality, let ϕ(i)>0\phi(i)>0 for all i𝒮i\in\mathcal{S}. Then, the first term in (107) may be upper bounded as

a=1KlogPCπ(Xa1a)PC~π(Xa1a)\displaystyle\sum_{a=1}^{K}\log\frac{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}{P_{\tilde{C}}^{\pi}(X_{a-1}^{a})} =logPCπ(X01)PC~π(X01)+a=2KlogPCπ(Xa1a)PC~π(Xa1a)\displaystyle=\log\frac{P_{C^{\prime}}^{\pi}(X_{0}^{1})}{P_{\tilde{C}}^{\pi}(X_{0}^{1})}+\sum_{a=2}^{K}\log\frac{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}{P_{\tilde{C}}^{\pi}(X_{a-1}^{a})}
=a=2Kj𝒮𝕀{Xa1a=j}logPCπ(Xa1a=j)PC~π(Xa1a=j)\displaystyle=\sum_{a=2}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{X_{a-1}^{a}=j\}}\ \log\frac{P_{C^{\prime}}^{\pi}(X_{a-1}^{a}=j)}{P_{\tilde{C}}^{\pi}(X_{a-1}^{a}=j)}
=a=2Kj𝒮𝕀{Xa1a=j}logi𝒮ϕ(i)(PCa)a1(j|i)i𝒮ϕ(i)(PC~a)a1(j|i)\displaystyle=\sum_{a=2}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{X_{a-1}^{a}=j\}}\ \log\frac{\sum_{i\in\mathcal{S}}\ \phi(i)\cdot(P_{C^{\prime}}^{a})^{a-1}(j|i)}{\sum_{i\in\mathcal{S}}\ \phi(i)\cdot(P_{\tilde{C}}^{a})^{a-1}(j|i)}
(a)a=2Kj𝒮𝕀{Xa1a=j}i𝒮ϕ(i)(PCa)a1(j|i)i𝒮ϕ(i)(PCa)a1(j|i)log(PCa)a1(j|i)(PC~a)a1(j|i)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{a=2}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{X_{a-1}^{a}=j\}}\ \sum_{i\in\mathcal{S}}\ \frac{\phi(i)\,(P_{C^{\prime}}^{a})^{a-1}(j|i)}{\sum_{i^{\prime}\in\mathcal{S}}\phi(i^{\prime})\,(P_{C^{\prime}}^{a})^{a-1}(j|i^{\prime})}\,\log\frac{(P_{C^{\prime}}^{a})^{a-1}(j|i)}{(P_{\tilde{C}}^{a})^{a-1}(j|i)}
A(K1)almost surely,\displaystyle\leq A\,(K-1)\quad\text{almost surely}, (110)

where (a)(a) above follows from the log-sum inequality [29, Theorem 2.7.1]. Combining (108) and (110), we get

ZCC~π(n)Analmost surely,Z_{C^{\prime}\tilde{C}}^{\pi}(n)\leq A\cdot n\quad\text{almost surely}, (111)

from which it follows that 𝔼Cπ[ZCC~π(n)]An\mathbb{E}_{C}^{\pi}[Z_{C^{\prime}\tilde{C}}^{\pi}(n)]\leq A\cdot n.

Appendix G Proof of Lemma 5

By the definition of τ(π)\tau(\pi), we know that under the assignment of the TPMs CC,

MCπ(τ(π)1)<log(L(K1)(K1)!).M_{C}^{\pi}(\tau(\pi)-1)<\log(L(K-1)(K-1)!).

Therefore, almost surely,

1\displaystyle 1 =lim supLlog(L(K1)(K1)!)logL\displaystyle=\limsup_{L\to\infty}\ \frac{\log(L(K-1)(K-1)!)}{\log L}
lim supLMCπ(τ(π)1)logL\displaystyle\geq\limsup_{L\to\infty}\ \frac{M_{C}^{\pi}(\tau(\pi)-1)}{\log L}
=lim supLMCπ(τ(π)1)τ(π)1τ(π)1logL\displaystyle=\limsup_{L\to\infty}\ \frac{M_{C}^{\pi}(\tau(\pi)-1)}{\tau(\pi)-1}\cdot\frac{\tau(\pi)-1}{\log L}
(lim supLτ(π)1logL)(minCAlt(C)(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))),\displaystyle\geq\left(\limsup_{L\to\infty}\ \frac{\tau(\pi)-1}{\log L}\right)\cdot\left(\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\right), (112)

where the last line above is due to (46) and the fact that the increment MCπ(n)MCπ(n1)M_{C}^{\pi}(n)-M_{C}^{\pi}(n-1) is bounded for all nKn\geq K. We then note that

minCAlt(C)(d¯,i¯)𝕊Ra=1Kνη,R,C(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia))\displaystyle\min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{\eta,R,C}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
ηminCAlt(C)((d¯,i¯)𝕊Ra=1KνC,Runif(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia)))\displaystyle\geq\eta\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \left(\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\textsf{unif}}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\right)
+(1η)minCAlt(C)((d¯,i¯)𝕊Ra=1KνC,R(d¯,i¯,a)D((PCa)da(|ia)(PCa)da(|ia)))\displaystyle\hskip 85.35826pt+(1-\eta)\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \left(\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu_{C,R}^{\star}(\underline{d},\underline{i},a)\ D((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))\right)
ηTRunif(C)+(1η)TR(C),\displaystyle\geq\eta\ T_{R}^{\textsf{unif}}(C)+(1-\eta)\ T_{R}^{\star}(C), (113)

thus establishing the desired result.

Appendix H Proof of Proposition 3

We prove here that the family {τ(π)/logL:L>1}\{\tau(\pi)/\log L:L>1\}, where π=π(L,η,R)\pi=\pi^{\star}(L,\eta,R), is uniformly integrable. Towards this, we show that

lim supL𝔼Cπ[(τ(π)logL)2]<.\limsup_{L\to\infty}\ \mathbb{E}_{C}^{\pi}\bigg{[}\bigg{(}\frac{\tau(\pi)}{\log L}\bigg{)}^{2}\bigg{]}<\infty. (114)

Then, [30, Lemma 3, pp. 227] implies the desired uniform integrability result from (114) (by using G(t)=t2G(t)=t^{2} in [30, Lemma 3, pp. 227]). Let

ψ(L)log(L(K1)(K1)!)logL,\displaystyle\psi(L)\coloneqq\frac{\log(L(K-1)(K-1)!)}{\log L}, (115)

and let πC=πC(L,η,R)\pi^{\star}_{C}=\pi^{\star}_{C}(L,\eta,R) denote the version of the policy RR-DCR-BAI that stops only when the event

MCπC(n)log(L(K1)(K1)!)M_{C}^{\pi^{\star}_{C}}(n)\geq\log(L(K-1)(K-1)!)

occurs. Clearly, τ(πC)τ(π)\tau(\pi^{\star}_{C})\geq\tau(\pi) almost surely. Then,

lim supL𝔼Cπ[(τ(π)logL)2]\displaystyle\limsup_{L\to\infty}\ \mathbb{E}_{C}^{\pi}\bigg{[}\bigg{(}\frac{\tau(\pi)}{\log L}\bigg{)}^{2}\bigg{]}
=lim supL0PCπ((τ(π)logL)2>x)𝑑x\displaystyle=\limsup_{L\to\infty}\int_{0}^{\infty}P_{C}^{\pi}\bigg{(}\left(\frac{\tau(\pi)}{\log L}\right)^{2}>x\bigg{)}\,dx
=lim supL0PCπ(τ(π)>(x)(logL))𝑑x\displaystyle=\limsup_{L\to\infty}\int_{0}^{\infty}P_{C}^{\pi}\bigg{(}\tau(\pi)>(\sqrt{x})({\log L})\bigg{)}\,dx
lim supL0PCπ(τ(πC)>(x)(logL))𝑑x\displaystyle\leq\limsup_{L\to\infty}\int_{0}^{\infty}P_{C}^{\pi}\bigg{(}{\tau(\pi^{\star}_{C})}>(\sqrt{x})({\log L})\bigg{)}\,dx
(a)lim supL{ψ(L)+ψ(L)PCπ(τ(πC)>(x)(logL))𝑑x}\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\limsup_{L\to\infty}\bigg{\{}\psi(L)+\int_{\psi(L)}^{\infty}P_{C}^{\pi}\bigg{(}{\tau(\pi^{\star}_{C})}>(\sqrt{x})({\log L})\bigg{)}\,dx\bigg{\}}
lim supLψ(L)+lim supLnψ(L)logL2n+1(logL)2PCπ(τ(πC)>n)\displaystyle\leq\limsup_{L\to\infty}\ \psi(L)+\limsup_{L\to\infty}\sum_{n\geq\sqrt{\psi(L)}\,\log L}^{\infty}\frac{2n+1}{(\log L)^{2}}\ P_{C}^{\pi}(\tau(\pi_{C}^{\star})>n)
1+lim supLnψ(L)logL2n+1(logL)2PCπ(MCπ(n)<log(L(K1)(K1)!)),\displaystyle\leq 1+\limsup_{L\to\infty}\sum_{n\geq\sqrt{\psi(L)}\,\log L}^{\infty}\frac{2n+1}{(\log L)^{2}}\ P_{C}^{\pi}(M_{C}^{\pi}(n)<\log(L(K-1)(K-1)!)), (116)

where (a)(a) above follows by upper bounding the probability term by 11 for all xψ(L)x\leq\psi(L). Below, we show that PCπ(MCπ(n)<log(L(K1)(K1)!))P_{C}^{\pi}(M_{C}^{\pi}(n)<\log(L(K-1)(K-1)!)) is O(1/n3)O(1/n^{3}), which implies that the infinite sum in (116) is finite. This will then prove that the right-hand side of (116) is finite. Note that

PCπ(MCπ(n)<log(L(K1)(K1)!))\displaystyle P_{C}^{\pi}(M_{C}^{\pi}(n)<\log(L(K-1)(K-1)!)) =PCπ(minCAlt(C)ZCCπ(n)<log(L(K1)(K1)!))\displaystyle=P_{C}^{\pi}\left(\min_{C^{\prime}\in\textsf{Alt}(C)}Z_{CC^{\prime}}^{\pi}(n)<\log(L(K-1)(K-1)!)\right)
CAlt(C)PCπ(ZCCπ(n)<log(L(K1)(K1)!)),\displaystyle\leq\sum_{C^{\prime}\in\textsf{Alt}(C)}\ P_{C}^{\pi}\left(Z_{CC^{\prime}}^{\pi}(n)<\log(L(K-1)(K-1)!)\right), (117)

where the last line above follows from the union bound.

We now show that each term inside the summation in (117) is O(1/n3)O(1/n^{3}). Fix CAlt(C)C^{\prime}\in\textsf{Alt}(C) and observe that

PCπ(ZCCπ(n)<log(L(K1)(K1)!))\displaystyle P_{C}^{\pi}\left(Z_{CC^{\prime}}^{\pi}(n)<\log(L(K-1)(K-1)!)\right)
=PCπ(ZCCπ(n)n<log(L(K1)(K1)!)n)\displaystyle=P_{C}^{\pi}\left(\frac{Z_{CC^{\prime}}^{\pi}(n)}{n}<\frac{\log(L(K-1)(K-1)!)}{n}\right)
=PCπ(1na=1KlogPCπ(Xa1a)PCπ(Xa1a)\displaystyle=P_{C}^{\pi}\bigg{(}\frac{1}{n}\ \sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}
+(d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia)<log(L(K1)(K1)!)n)\displaystyle\hskip 56.9055pt+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}<\frac{\log(L(K-1)(K-1)!)}{n}\bigg{)}
PCπ(1na=1KlogPCπ(Xa1a)PCπ(Xa1a)<ε)\displaystyle\leq P_{C}^{\pi}\left(\frac{1}{n}\ \sum_{a=1}^{K}\ \log\frac{P_{C}^{\pi}(X_{a-1}^{a})}{P_{C^{\prime}}^{\pi}(X_{a-1}^{a})}<-\varepsilon\right) (118)
+PCπ((d¯,i¯)𝕊Ra=1Kj𝒮N(n,d¯,i¯,a,j)nlog(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle+P_{C}^{\pi}\bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \frac{N(n,\underline{d},\underline{i},a,j)}{n}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
(d¯,i¯)𝕊Ra=1KN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(|ia))<ε)\displaystyle\hskip 85.35826pt-\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\,D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))<-\varepsilon\bigg{)} (119)
+PCπ((d¯,i¯)𝕊Ra=1KN(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(|ia))2ε<log(L(K1)(K1)!)n)\displaystyle+P_{C}^{\pi}\bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \frac{N(n,\underline{d},\underline{i},a)}{n}\,D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))-2\varepsilon<\frac{\log(L(K-1)(K-1)!)}{n}\bigg{)} (120)

for all ε>0\varepsilon>0. Fixing ε\varepsilon, we handle (118)-(120) individually. We also show how to choose ε\varepsilon (later in (130)).

  1. 1.

    Handling (118): This term is equal to 0 for all sufficiently large values of nn because the left hand side inside the probability term converges almost surely to 0, whereas the right hand side is negative.

  2. 2.

    Handling (119): This term may be expressed as

    PCπ((d¯,i¯)𝕊Ra=1Kj𝒮[N(n,d¯,i¯,a,j)nN(n,d¯,i¯,a)n(PCa)da(j|ia)]log(PCa)da(j|ia)(PCa)da(j|ia)<ε)\displaystyle P_{C}^{\pi}\bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \bigg{[}\frac{N(n,\underline{d},\underline{i},a,j)}{n}-\frac{N(n,\underline{d},\underline{i},a)}{n}\ (P_{C}^{a})^{d_{a}}(j|i_{a})\bigg{]}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}<-\varepsilon\bigg{)} (121)

    We note that

    Mn{(d¯,i¯)𝕊Ra=1Kj𝒮[N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia)]log(PCa)da(j|ia)(PCa)da(j|ia)}nKM_{n}\coloneqq\left\{\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \bigg{[}N(n,\underline{d},\underline{i},a,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\bigg{]}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\right\}_{n\geq K}

    is a martingale. Indeed, because each row of (PCa)d(P_{C}^{a})^{d} is mutually absolutely continuous with respect to the corresponding row of (PCa)d(P_{C^{\prime}}^{a})^{d} for all d1d\geq 1,

    𝔼Cπ[(d¯,i¯)𝕊Ra=1Kj𝒮(N(n,d¯,i¯,h,j)N(n,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)|A0:n1,X¯0:n1]\displaystyle\mathbb{E}_{C}^{\pi}\bigg{[}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n,\underline{d},\underline{i},h,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\bigg{|}\ A_{0:n-1},\bar{X}_{0:n-1}\bigg{]}
    =(d¯,i¯)𝕊Ra=1Kj𝒮(N(n1,d¯,i¯,a,j)N(n1,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n-1,\underline{d},\underline{i},a,j)-N(n-1,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}
    +(d¯,i¯)𝕊Ra=1Kj𝒮[𝕀{d¯(n)=d¯,i¯(n)=i¯}(PCπ(An=a,X¯n=j|A0:n1,X¯0:n1)\displaystyle+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \bigg{[}\mathbb{I}_{\{\underline{d}(n)=\underline{d},\leavevmode\nobreak\ \underline{i}(n)=\underline{i}\}}\ \bigg{(}P_{C}^{\pi}(A_{n}=a,\ \bar{X}_{n}=j|A_{0:n-1},\bar{X}_{0:n-1})
    PCπ(An=a|A0:n1,X¯0:n1)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)]\displaystyle\hskip 156.49014pt-P_{C}^{\pi}(A_{n}=a|A_{0:n-1},\bar{X}_{0:n-1})\ (P_{C}^{a})^{d_{a}}(j|i_{a})\bigg{)}\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\bigg{]}
    =(d¯,i¯)𝕊Ra=1Kj𝒮(N(n1,d¯,i¯,a,j)N(n1,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia),\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n-1,\underline{d},\underline{i},a,j)-N(n-1,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}, (122)

    where the last line follows by noting that when (d¯(n),i¯(n))=(d¯,i¯)(\underline{d}(n),\underline{i}(n))=(\underline{d},\underline{i}), under the assignment of the TPMs CC,

    PCπ(X¯n=j|An=a,A0:n1,X¯0:n1)=(PCa)da(j|ia).P_{C}^{\pi}(\bar{X}_{n}=j|A_{n}=a,A_{0:n-1},\bar{X}_{0:n-1})=(P_{C}^{a})^{d_{a}}(j|i_{a}).

    Further, the above martingale is bounded, and its quadratic variation Mn\langle M_{n}\rangle satisfies

    Mn\displaystyle\langle M_{n}\rangle
    t=Kn𝔼Cπ[((d¯,i¯)𝕊Ra=1Kj𝒮𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}(𝕀{X¯t=j}(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia))2|t]\displaystyle\coloneqq\sum_{t=K}^{n}\ \mathbb{E}_{C}^{\pi}\bigg{[}\bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\ \underline{i}(t)=\underline{i},\ A_{t}=a\}}\ \left(\mathbb{I}_{\{\bar{X}_{t}=j\}}-(P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\bigg{)}^{2}\bigg{|}\mathcal{F}_{t}\bigg{]}
    t=Kn𝔼Cπ[(d¯,i¯)𝕊Ra=1Kj𝒮𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}(𝕀{X¯t=j}(PCa)da(j|ia))2(log(PCa)da(j|ia)(PCa)da(j|ia))2|t]\displaystyle\leq\sum_{t=K}^{n}\ \mathbb{E}_{C}^{\pi}\bigg{[}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\ \underline{i}(t)=\underline{i},\ A_{t}=a\}}\ \left(\mathbb{I}_{\{\bar{X}_{t}=j\}}-(P_{C}^{a})^{d_{a}}(j|i_{a})\right)^{2}\ \bigg{(}\log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\bigg{)}^{2}\bigg{|}\mathcal{F}_{t}\bigg{]}
    (a)4A2t=Kn𝔼Cπ[(d¯,i¯)𝕊Ra=1Kj𝒮𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}|t]\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}4A^{2}\ \sum_{t=K}^{n}\ \mathbb{E}_{C}^{\pi}\bigg{[}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d},\leavevmode\nobreak\ \underline{i}(t)=\underline{i},\leavevmode\nobreak\ A_{t}=a\}}\bigg{|}\mathcal{F}_{t}\bigg{]}
    n(4A2|𝒮|)\displaystyle\leq n\ (4A^{2}|\mathcal{S}|) (123)

    almost surely, where AA above is as defined in (109). We then have

    PCπ((d¯,i¯)𝕊Ra=1Kj𝒮(N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)<nε)\displaystyle P_{C}^{\pi}\left(\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n,\underline{d},\underline{i},a,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}<-n\varepsilon\right)
    PCπ(|(d¯,i¯)𝕊Ra=1Kj𝒮(N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)|>nε)\displaystyle\leq P_{C}^{\pi}\left(\left\lvert\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n,\underline{d},\underline{i},a,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\right\rvert>n\varepsilon\right)
    Pπ(supKtn|(d¯,i¯)𝕊Ra=1Kj𝒮(N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)|>nε)\displaystyle\leq P^{\pi}\left(\sup_{K\leq t\leq n}\leavevmode\nobreak\ \left\lvert\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \sum_{j\in\mathcal{S}}\ \left(N(n,\underline{d},\underline{i},a,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\right\rvert>n\varepsilon\right)
    (a)𝔼Cπ[(supKtn|(d¯,i¯)𝕊Ra=1Kj𝒮(N(n,d¯,i¯,a,j)N(n,d¯,i¯,a)(PCa)da(j|ia))log(PCa)da(j|ia)(PCa)da(j|ia)|)6]n6ε6\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\mathbb{E}_{C}^{\pi}\left[\left(\sup\limits_{K\leq t\leq n}\ \left\lvert\sum\limits_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum\limits_{a=1}^{K}\ \sum\limits_{j\in\mathcal{S}}\ \left(N(n,\underline{d},\underline{i},a,j)-N(n,\underline{d},\underline{i},a)\ (P_{C}^{a})^{d_{a}}(j|i_{a})\right)\ \log\frac{(P_{C}^{a})^{d_{a}}(j|i_{a})}{(P_{C^{\prime}}^{a})^{d_{a}}(j|i_{a})}\right\rvert\right)^{6}\right]}{n^{6}\ \varepsilon^{6}}
    (b)Bn6ε6𝔼Cπ[|Mn|3]\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{B}{n^{6}\ \varepsilon^{6}}\ \mathbb{E}_{C}^{\pi}[|\langle M_{n}\rangle|^{3}]
    (c)Bn6ε6n3(4A2|𝒮|)3\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\frac{B}{n^{6}\ \varepsilon^{6}}\cdot n^{3}\cdot\left(4A^{2}|\mathcal{S}|\right)^{3}
    =An3,\displaystyle=\frac{A^{\prime}}{n^{3}}, (124)

    where (a)(a) above is due to Markov’s inequality, (b)(b) is due to Burkholder’s inequality [31, pp. 414], and (c)(c) follows from (123). We have thus shown that (119) is O(1/n3)O(1/n^{3}).

  3. 3.

    Handling (120): Observe that (120) may be upper bounded by

    PCπ(N(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(|ia))2ε<log(L(K1)(K1)!)n)P_{C}^{\pi}\bigg{(}\frac{N(n,\underline{d},\underline{i},a)}{n}\,D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))-2\varepsilon<\frac{\log(L(K-1)(K-1)!)}{n}\bigg{)} (125)

    for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R} and a𝒜a\in\mathcal{A}. Fix d¯=d¯=(K,K1,,1)\underline{d}=\underline{d}^{\star}=(K,K-1,\ldots,1), i¯=i¯𝒮K\underline{i}=\underline{i}^{\star}\in\mathcal{S}^{K}, and a𝒜a\in\mathcal{A}. Using the convergence in (102), we get that for every ϵ>0\epsilon^{\prime}>0, under the policy πNS(L,η,R)\pi_{\textsf{NS}}^{\star}(L,\eta,R),

    N(n,d¯,i¯,a)nνη,R,C(d¯,i¯,a)(1ϵ)\frac{N(n,\underline{d}^{\star},\underline{i}^{\star},a)}{n}\geq\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime}) (126)

    for all nn large, almost surely. Leveraging this, we define

    En{ωΩ:N(n,d¯,i¯,a,ω)nνη,R,C(d¯,i¯,a)(1ϵ)},E_{n}\coloneqq\left\{\omega\in\Omega:\frac{N(n,\underline{d}^{\star},\underline{i}^{\star},a,\omega)}{n}\geq\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})\right\}, (127)

    and write (125) as the sum of two terms, one of which is

    PCπ(En{N(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(|ia))2ε<log(L(K1)(K1)!)n})\displaystyle P_{C}^{\pi}\bigg{(}E_{n}\bigcap\left\{\frac{N(n,\underline{d}^{\star},\underline{i}^{\star},a)}{n}\,D_{\textsf{KL}}((P_{C}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a})\|(P_{C^{\prime}}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a}))-2\varepsilon<\frac{\log(L(K-1)(K-1)!)}{n}\right\}\bigg{)}
    PCπ((1ϵ)νη,R,C(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))2ε<log(L(K1)(K1)!)n),\displaystyle\leq P_{C}^{\pi}\left((1-\epsilon^{\prime})\ \nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a})\|(P_{C^{\prime}}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a}))-2\varepsilon<\frac{\log(L(K-1)(K-1)!)}{n}\right), (128)

    and the other is

    PCπ(Enc{N(n,d¯,i¯,a)nDKL((PCa)da(|ia)(PCa)da(|ia))2ε<log(L(K1)(K1)!)n})\displaystyle P_{C}^{\pi}\bigg{(}E_{n}^{c}\bigcap\left\{\frac{N(n,\underline{d}^{\star},\underline{i}^{\star},a)}{n}\,D_{\textsf{KL}}((P_{C}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a})\|(P_{C^{\prime}}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a}))-2\varepsilon<\frac{\log(L(K-1)(K-1)!)}{n}\right\}\bigg{)}
    PCπ(N(n,d¯,i¯,a)n<νη,R,C(d¯,i¯,a)(1ϵ)).\displaystyle\leq P_{C}^{\pi}\left(\frac{N(n,\underline{d}^{\star},\underline{i}^{\star},a)}{n}<\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})\right). (129)

    We shall see how to choose ϵ\epsilon^{\prime} (later in (136)). Choosing ε\varepsilon such that

    (1ϵ)νη,R,C(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))2ε>0,(1-\epsilon^{\prime})\ \nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a})\|(P_{C^{\prime}}^{a})^{d^{\star}_{a}}(\cdot|i^{\star}_{a}))-2\varepsilon>0, (130)

    we see that the left hand side of the probability term in (128) is strictly positive, whereas the right hand side goes to 0 as nn\to\infty. Thus, for all sufficiently large values of nn, (128) equals 0.

    It now remains to show that (129) is O(1/n3)O(1/n^{3}).

    Showing that (129) is O(1/n3)O(1/n^{3}):
    Let MM be a large, positive integer such that (75) holds for all mMm\geq M. Along the lines of the proof of irreducibility presented in Appendix B, it can be shown that for all (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R} and T0T_{0},

    PCπ(d¯(T0+N)=d¯,i¯(T0+N)=i¯|d¯(T0)=d¯,i¯(T0)=i¯)(ηKμRmin)M+Kε¯K\displaystyle P_{C}^{\pi}(\underline{d}(T_{0}+N)=\underline{d}^{\star},\underline{i}(T_{0}+N)=\underline{i}^{\star}|\underline{d}(T_{0})=\underline{d},\underline{i}(T_{0})=\underline{i})\geq\left(\frac{\eta}{K}\ \mu_{R}^{\textsf{min}}\right)^{M+K}\ \bar{\varepsilon}^{K} (131)

    for N=M+KN=M+K, where μRmin\mu_{R}^{\textsf{min}} is as defined in (37), and ε¯\bar{\varepsilon} is as defined in (77). Eq. (131) states that under the policy RR-DCR-BAI, the probability of starting from any (d¯,i¯)𝕊R(\underline{d},\underline{i})\in\mathbb{S}_{R} and reaching the state (d¯,i¯)(\underline{d}^{\star},\underline{i}^{\star}) after N=M+KN=M+K time instants may be lower bounded uniformly over all starting states. In the literature on controlled Markov processes, such a phenomenon is referred to as Doeblin’s minorisation condition [32, Eq. (5)]. Writing ρ\rho to denote the constant on the right hand side of (131), we get that for all nM+Kn\geq M+K,

    PCπ(d¯(n)=d¯,i¯(n)=i¯)\displaystyle P_{C}^{\pi}(\underline{d}(n)=\underline{d}^{\star},\underline{i}(n)=\underline{i}^{\star})
    =(d¯,i¯)𝕊R[PCπ(d¯(n)=d¯,i¯(n)=i¯|d¯(nMK)=d¯,i¯(nMK)=i¯)\displaystyle=\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \Big{[}P_{C}^{\pi}(\underline{d}(n)=\underline{d}^{\star},\underline{i}(n)=\underline{i}^{\star}|\underline{d}(n-M-K)=\underline{d},\underline{i}(n-M-K)=\underline{i})
    PCπ(d¯(nMK)=d¯,i¯(nMK)=i¯)]\displaystyle\hskip 142.26378pt\cdot P_{C}^{\pi}(\underline{d}(n-M-K)=\underline{d},\underline{i}(n-M-K)=\underline{i})\Big{]}
    ρ(d¯,i¯)𝕊RPCπ(d¯(nMK)=d¯,i¯(nMK)=i¯)\displaystyle\geq\rho\cdot\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ P_{C}^{\pi}(\underline{d}(n-M-K)=\underline{d},\underline{i}(n-M-K)=\underline{i})
    =ρ.\displaystyle=\rho. (132)

    We now note that for all nM+Kn\geq M+K,

    N(n,d¯,i¯,a)\displaystyle N(n,\underline{d}^{\star},\underline{i}^{\star},a) =t=Kn𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}\displaystyle=\sum_{t=K}^{n}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d}^{\star},\ \underline{i}(t)=\underline{i}^{\star},\ A_{t}=a\}}
    t=M+Kn𝕀{d¯(t)=d¯,i¯(t)=i¯,At=a}almost surely.\displaystyle\geq\sum_{t=M+K}^{n}\ \mathbb{I}_{\{\underline{d}(t)=\underline{d}^{\star},\ \underline{i}(t)=\underline{i}^{\star},\ A_{t}=a\}}\quad\text{almost surely}. (133)

    Denoting the right hand side of (133) by N(n,d¯,i¯,a)N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a), we note that

    𝔼Cπ[N(n,d¯,i¯,a)]\displaystyle\mathbb{E}_{C}^{\pi}[N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)] =t=M+KnPCπ(d¯(t)=d¯,i¯(t)=i¯,At=a)\displaystyle=\sum_{t=M+K}^{n}\ P_{C}^{\pi}(\underline{d}(t)=\underline{d}^{\star},\ \underline{i}(t)=\underline{i}^{\star},\ A_{t}=a)
    t=M+KnρPCπ(At=a|d¯(t)=d¯,i¯(t)=i¯)\displaystyle\geq\sum_{t=M+K}^{n}\ \rho\cdot P_{C}^{\pi}(A_{t}=a|\underline{d}(t)=\underline{d}^{\star},\ \underline{i}(t)=\underline{i}^{\star})
    =t=M+Knρλη,R,C¯(t)(a|d¯,i¯)\displaystyle=\sum_{t=M+K}^{n}\ \rho\cdot\lambda_{\eta,R,\bar{C}(t)}(a|\underline{d}^{\star},\underline{i}^{\star})
    (nMK+1)ρ(ηKμRmin),\displaystyle\geq(n-M-K+1)\cdot\rho\cdot\left(\frac{\eta}{K}\ \mu_{R}^{\textsf{min}}\right), (134)

    where μRmin\mu_{R}^{\textsf{min}} is as defined in (37).

    For all nM+Kn\geq M+K, we then have

    PCπ(N(n,d¯,i¯,a)<nνη,R,C(d¯,i¯,a)(1ϵ))\displaystyle P_{C}^{\pi}\bigg{(}N(n,\underline{d}^{\star},\underline{i}^{\star},a)<n\,\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})\bigg{)}
    PCπ(N(n,d¯,i¯,a)<nνη,R,C(d¯,i¯,a)(1ϵ))\displaystyle\leq P_{C}^{\pi}\bigg{(}N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)<n\,\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})\bigg{)}
    =PCπ(N(n,d¯,i¯,a)𝔼Cπ[N(n,d¯,i¯,a)]<nνη,R,C(d¯,i¯,a)(1ϵ)𝔼Cπ[N(n,d¯,i¯,a)])\displaystyle=P_{C}^{\pi}\bigg{(}N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)-\mathbb{E}_{C}^{\pi}[N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)]<n\,\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})-\mathbb{E}_{C}^{\pi}[N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)]\bigg{)}
    PCπ(N(n,d¯,i¯,a)𝔼Cπ[N(n,d¯,i¯,a)]<nνη,R,C(d¯,i¯,a)(1ϵ)\displaystyle\leq P_{C}^{\pi}\bigg{(}N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)-\mathbb{E}_{C}^{\pi}[N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)]<n\,\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})
    (nMK+1)ρηKμRmin)\displaystyle\hskip 241.84842pt-(n-M-K+1)\cdot\rho\cdot\frac{\eta}{K}\cdot\mu_{R}^{\textsf{min}}\bigg{)}
    PCπ(N(n,d¯,i¯,a)𝔼Cπ[N(n,d¯,i¯,a)]<n{νη,R,C(d¯,i¯,a)(1ϵ)\displaystyle\leq P_{C}^{\pi}\bigg{(}N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)-\mathbb{E}_{C}^{\pi}[N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a)]<n\bigg{\{}\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})
    (nMK+1n)ρημRminK}).\displaystyle\hskip 227.62204pt-\left(\frac{n-M-K+1}{n}\right)\cdot\frac{\rho\,\eta\,\mu_{R}^{\textsf{min}}}{K}\bigg{\}}\bigg{)}. (135)

    Noting that

    nMK+1n12\frac{n-M-K+1}{n}\geq\frac{1}{2}

    for all nn sufficiently large, we choose ϵ\epsilon^{\prime} such that

    νη,R,C(d¯,i¯,a)(1ϵ)ρημRmin2K<0.\nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)(1-\epsilon^{\prime})-\frac{\rho\,\eta\,\mu_{R}^{\textsf{min}}}{2K}<0.

    For instance, it suffices to set

    ϵ=112ρημRmin2Kνη,R,C(d¯,i¯,a).\epsilon^{\prime}=1-\frac{1}{2}\ \frac{\rho\,\eta\,\mu_{R}^{\textsf{min}}}{2K\ \nu_{\eta,R,C}(\underline{d}^{\star},\underline{i}^{\star},a)}. (136)

    For this choice of ϵ\epsilon^{\prime}, it follows that the probability term in (135) may be bounded above exponentially by using concentration inequalities for sub-gaussian random variables [33, p. 25]; here, N(n,d¯,i¯,a)N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a) is a sum of (not necessarily independent) indicator random variables, each of which is sub-gaussian with variance factor 1/41/4. This implies that N(n,d¯,i¯,a)N^{\prime}(n,\underline{d}^{\star},\underline{i}^{\star},a) is also sub-gaussian [7, Lemma 17]. Therefore, for all sufficiently large nn, the probability term in (135) is O(1/n3)O(1/n^{3}).

Appendix I Proof of Lemma 7

Suppose that Pk(|i)=μk()P_{k}(\cdot|i)=\mu_{k}(\cdot) for all k=1,,Kk=1,\ldots,K and i𝒮i\in\mathcal{S}. Fixing C𝒞C\in\mathcal{C}, it follows that for all dd\in\mathbb{N}, i𝒮i\in\mathcal{S}, and CAlt(C)C^{\prime}\in\textsf{Alt}(C),

DKL((PCa)d(|i)(PCa)d(|i))=DKL(μCaμCa),\displaystyle D_{\textsf{KL}}((P_{C}^{a})^{d}(\cdot|i)\|(P_{C^{\prime}}^{a})^{d}(\cdot|i))=D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a}), (137)

where μCa\mu_{C}^{a} denotes the stationary distribution associated with the TPM PCaP_{C}^{a}. As a consequence of (137), we have

T(C)\displaystyle T^{\star}(C) =supνminCAlt(C)(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=supνminCAlt(C)(d¯,i¯)𝕊a=1Kν(d¯,i¯,a)DKL(μCaμCa)\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})
=supκminCAlt(C)a=1Kκ(a)DKL(μCaμCa),\displaystyle=\sup_{\kappa}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{a=1}^{K}\ \kappa(a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a}), (138)

where κ(a)(d¯,i¯)𝕊ν(d¯,i¯,a)\kappa(a)\coloneqq\sum_{(\underline{d},\underline{i})\in\mathbb{S}}\nu(\underline{d},\underline{i},a) for all a𝒜a\in\mathcal{A}, and the supremum in (138) is over all κ\kappa which are probability distributions on the set of arms 𝒜\mathcal{A}.

For a fixed R(K,)R\in\mathbb{N}\cap(K,\infty), suppose that

𝕊1=a=1K𝕊R,a,𝕊2=𝕊R𝕊1.\mathbb{S}_{1}=\bigcup_{a=1}^{K}\ \mathbb{S}_{R,a},\quad\mathbb{S}_{2}=\mathbb{S}_{R}\setminus\mathbb{S}_{1}.

Then,

TR(C)\displaystyle T_{R}^{\star}(C)
=supνminCAlt(C)(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)DKL((PCa)da(|ia)(PCa)da(|ia))\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}((P_{C}^{a})^{d_{a}}(\cdot|i_{a})\|(P_{C^{\prime}}^{a})^{d_{a}}(\cdot|i_{a}))
=supνminCAlt(C)(d¯,i¯)𝕊Ra=1Kν(d¯,i¯,a)DKL(μCaμCa)\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})
=supνminCAlt(C){(d¯,i¯)𝕊1a=1Kν(d¯,i¯,a)DKL(μCaμCa)+(d¯,i¯)𝕊2a=1Kν(d¯,i¯,a)DKL(μCaμCa)}\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\bigg{\{}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{1}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{2}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})\bigg{\}}
=supνminCAlt(C){a=1K(d¯,i¯)𝕊R,aν(d¯,i¯,a)DKL(μCaμCa)+(d¯,i¯)𝕊2a=1Kν(d¯,i¯,a)DKL(μCaμCa)}\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\bigg{\{}\sum_{a=1}^{K}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{2}}\ \sum_{a=1}^{K}\ \nu(\underline{d},\underline{i},a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})\bigg{\}}
=supνminCAlt(C){a=1KDKL(μCaμCa)((d¯,i¯)𝕊R,aν(d¯,i¯,a)+(d¯,i¯)𝕊2ν(d¯,i¯,a))}\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\bigg{\{}\sum_{a=1}^{K}\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})\ \bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\nu(\underline{d},\underline{i},a)+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{2}}\ \nu(\underline{d},\underline{i},a)\bigg{)}\bigg{\}}
=(a)supνminCAlt(C){a=1KDKL(μCaμCa)((d¯,i¯)𝕊R,aν(d¯,i¯,a)+aa(d¯,i¯)𝕊R,aν(d¯,i¯,a)+(d¯,i¯)𝕊2ν(d¯,i¯,a))}\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\bigg{\{}\sum_{a=1}^{K}\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})\ \bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a}}\nu(\underline{d},\underline{i},a)+\sum_{a^{\prime}\neq a}\ \sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R,a^{\prime}}}\nu(\underline{d},\underline{i},a)+\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{2}}\ \nu(\underline{d},\underline{i},a)\bigg{)}\bigg{\}}
=supνminCAlt(C){a=1KDKL(μCaμCa)((d¯,i¯)𝕊Rν(d¯,i¯,a))}\displaystyle=\sup_{\nu}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\bigg{\{}\sum_{a=1}^{K}\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a})\ \bigg{(}\sum_{(\underline{d},\underline{i})\in\mathbb{S}_{R}}\nu(\underline{d},\underline{i},a)\bigg{)}\bigg{\}}
=supκminCAlt(C)a=1Kκ(a)DKL(μCaμCa),\displaystyle=\sup_{\kappa}\ \min_{C^{\prime}\in\textsf{Alt}(C)}\ \sum_{a=1}^{K}\ \kappa(a)\ D_{\textsf{KL}}(\mu_{C}^{a}\|\mu_{C^{\prime}}^{a}), (139)

where (a)(a) above follows from the observation that any ν\nu participating in the supremum in (a)(a) meets the RR-max-delay constraint in (26), and therefore satisfies ν(d¯,i¯,a)=0\nu(\underline{d},\underline{i},a)=0 for all (d¯,i¯)𝕊R,a(\underline{d},\underline{i})\in\mathbb{S}_{R,a^{\prime}}, aaa^{\prime}\neq a. From (138) and (139), we see that TR(C)=T(C)T_{R}^{\star}(C)=T^{\star}(C) for all RR, thus proving that limRTR(C)=T(C)\lim_{R\to\infty}T_{R}^{\star}(C)=T^{\star}(C) in the special case when each of the arm TPMs have identical rows.

References

  • [1] H. Chernoff, “Sequential design of experiments,” The Annals of Mathematical Statistics, vol. 30, no. 3, pp. 755–770, 1959.
  • [2] A. E. Albert, “The sequential design of experiments for infinitely many states of nature,” The Annals of Mathematical Statistics, pp. 774–799, 1961.
  • [3] A. Garivier and E. Kaufmann, “Optimal best arm identification with fixed confidence,” in Conference on Learning Theory,.   PMLR, 2016, pp. 998–1027.
  • [4] V. Moulos, “Optimal best Markovian arm identification with fixed confidence,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [5] E. Kaufmann, O. Cappé, and A. Garivier, “On the complexity of best-arm identification in multi-armed bandit models,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1–42, 2016.
  • [6] P. N. Karthik and R. Sundaresan, “Detecting an odd restless Markov arm with a trembling hand,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 5230–5258, 2021.
  • [7] ——, “Learning to detect an odd restless Markov arm with a trembling hand,” arXiv preprint arXiv:2105.03603, 2021.
  • [8] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [9] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part ii: Markovian rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 977–982, 1987.
  • [10] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
  • [11] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multi-armed bandit problem,” Journal of Machine Learning Research, vol. 5, no. 6, pp. 623–648, 2004.
  • [12] E. Even-Dar, S. Mannor, and Y. Mansour, “Pac bounds for multi-armed bandit and Markov decision processes,” in International Conference on Computational Learning Theory.   Springer, 2002, pp. 255–270.
  • [13] Z. Karnin, T. Koren, and O. Somekh, “Almost optimal exploration in multi-armed bandits,” in International Conference on Machine Learning.   PMLR, 2013, pp. 1238–1246.
  • [14] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck, “lil’ucb: An optimal exploration algorithm for multi-armed bandits,” in Conference on Learning Theory.   PMLR, 2014, pp. 423–439.
  • [15] J.-Y. Audibert, S. Bubeck, and R. Munos, “Best arm identification in multi-armed bandits.” in Conference on Learning Theory.   JMLR, 2010, pp. 41–53.
  • [16] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone, “Pac subset selection in stochastic multi-armed bandits.” in International Conference on Machine Learning, vol. 12.   PMLR, 2012, pp. 655–662.
  • [17] E. Kaufmann and S. Kalyanakrishnan, “Information complexity in bandit subset selection,” in Conference on Learning Theory.   PMLR, 2013, pp. 228–251.
  • [18] S. Gupta, G. Joshi, and O. Yağan, “Best arm identification in correlated multi-armed bandits,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 549–563, 2021.
  • [19] Z. Zhong, W. C. Cheung, and V. Y. F. Tan, “Best arm identification for cascading bandits in the fixed confidence setting,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 481–11 491.
  • [20] O. Dekel, J. Ding, T. Koren, and Y. Peres, “Bandits with switching costs: T2/3{T}^{2/3} regret,” in Proceedings of the ACM Symposium on Theory of Computing, 2014, pp. 459–467.
  • [21] Z. Zhong, W. C. Cheung, and V. Y. F. Tan, “Probabilistic sequential shrinking: A best arm identification algorithm for stochastic bandits with corruptions,” in International Conference on Machine Learning.   PMLR, 2021, pp. 12 772–12 781.
  • [22] P. N. Karthik and R. Sundaresan, “Learning to detect an odd Markov arm,” IEEE Transactions on Information Theory, vol. 66, no. 7, pp. 4324–4348, 2020.
  • [23] A. Deshmukh, S. Bhashyam, and V. V. Veeravalli, “Controlled sensing for composite multihypothesis testing with application to anomaly detection,” in Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2018, pp. 2109–2113.
  • [24] A. Deshmukh, V. V. Veeravalli, and S. Bhashyam, “Sequential controlled sensing for composite multihypothesis testing,” Sequential Analysis, vol. 40, no. 2, pp. 259–289, 2021.
  • [25] G. R. Prabhu, S. Bhashyam, A. Gopalan, and R. Sundaresan, “Sequential multi-hypothesis testing in multi-armed bandit problems: An approach for asymptotic optimality,” IEEE Transactions on Information Theory, 2022.
  • [26] V. S. Borkar, “Control of Markov chains with long-run average cost criterion,” in Stochastic Differential Systems, Stochastic Control Theory and Applications.   Springer, 1988, pp. 57–77.
  • [27] D. A. Levin and Y. Peres, Markov Chains and Mixing Times.   American Mathematical Society, 2017, vol. 107.
  • [28] H. Victor, “A general class of exponential inequalities for martingales and ratios,” The Annals of Probability, vol. 27, no. 1, pp. 537–564, 1999.
  • [29] M. Thomas and J. A. Thomas, Elements of Information Theory.   Wiley-Interscience, 2006.
  • [30] A. N. Shiryaev, Probability-1.   Springer, 2016, vol. 95.
  • [31] Y. S. Chow and H. Teicher, Probability Theory: Independence, Interchangeability, Martingales.   Springer Science & Business Media, 2012.
  • [32] I. Kontoyiannis, L. A. Lastras-Montaño, and S. P. Meyn, “Relative entropy and exponential deviation bounds for general Markov chains,” in Proceedings of International Symposium on Information Theory,.   IEEE, 2005, pp. 1563–1567.
  • [33] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence.   Oxford University Press, 2013.