This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Transfer in Sequential Multi-armed Bandits via Reward Samples

NR Rahul and Vaibhav Katewa NR Rahul is with the Department of Electrical Communication Engineering (ECE) at the Indian Institute of Science, Bengaluru, India. Email: rahulnr@iisc.ac.inVaibhav Katewa is with the Robert Bosch Center for Cyber-Physical Systems and the Department of ECE at the Indian Institute of Science, Bengaluru, India. Email: vkatewa@iisc.ac.in
Abstract

We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.

I INTRODUCTION

The Multi-armed Bandit (MAB) problem [1, 2, 3] is a popular sequential decision-making problem where an agent interacts with the environment by taking actions at every time step and in return gets a random reward. The goal of the agent is to maximize the average reward received. Recently, there has been a lot of interest in the application of the MAB problem in the context of online advertisements and recommender systems[4, 5]. One of the problems highlighted in [5] is the user cold start problem, which is the inability of a recommender system to make a good recommendation for a new user in absence of any prior information. In this scenario, it is useful to transfer knowledge from other related users in order to make better initial recommendations to the new user. In the context of a MAB problem, transfer learning uses knowledge from one bandit problem in order to improve the performance of another related bandit problem [6, 7]. In particular, it helps to accelerate learning and make better decisions quickly.

In this paper, we consider a sequential stochastic MAB problem where the agent interacts with the environment sequentially in episodes (similar to [6]), where different episodes are synonymous with different tasks or different bandit problems. The reward distributions of the arms remain constant throughout the episode but change over different episodes. This scenario is useful, for instance, in recommender systems where the reward distributions of recommended items change in order to capture the changing user preferences over time. The goal is to leverage the knowledge from previous episodes in order to improve the performance in the current episode, thereby leading to an overall performance improvement. Towards this, we use reward samples from previous episodes to make decisions in the current episode. Our algorithm is based on the UCB algorithm for bandits [8].

Related Work: Transfer learning in the context of MAB has been studied in the framework of Multi-task learning [6, 9, 10, 11] and Meta-learning [12, 13, 14]. In Multi-task learning, the set of tasks are fixed and they are repeatedly encountered by the learning algorithm whereas in meta-learning, the algorithm learns to adapt to a new task after learning from a few tasks drawn from the same task distribution. For instance, the authors in [6] consider a finite set of bandit problems which are encountered repeatedly over time. In contrast, we consider an infinite set of bandit problems but assume that the problems are “similar” (we define the notion of “similarity” later). The idea of transferring knowledge using samples is used in the SW-UCB algorithm in [15], but it suffers from the notion of negative transfer, where knowledge transfer can degrade the performance. In contrast, our algorithm facilitates knowledge transfer while guaranteeing that there is no negative transfer.

The main contributions of the paper are:
(i) We develop an algorithm based on UCB to transfer knowledge using the reward samples from the previous episodes in a sequential stochastic MAB setting. Our algorithm has a better performance compared to UCB with no transfer.
(ii) We provide the regret analysis for the proposed algorithm and our regret upper bound explicitly captures the performance improvement due to transfer.
(iii) We show via numerical simulations that our algorithm is able to effectively transfer knowledge from previous episodes.

Notations: 𝟙{E}\mathds{1}\{E\} denotes the indicator function whose value is 11 if the event (condition) EE is true, and 0 otherwise. \emptyset denotes null set.

II PRELIMINARIES AND PROBLEM STATEMENT

We consider the Multi-Armed Bandit problem with KK arms and JJ episodes. The length of each episode is nn. Define [K]{1,2,,K}[K]\triangleq\{1,2,\cdots,K\} and [J]{1,2,,J}[J]\triangleq\{1,2,\cdots,J\}. At any given integer time t>0t>0, one among the KK arms is pulled and a random reward is received. Let It[K]I_{t}\in[K] and rItr_{I_{t}}, denote the arm pulled at time tt and the corresponding random reward, respectively. We assume that rIt[0,1]r_{I_{t}}\in[0,1] and the rewards are independent across time and across all arms. In any given episode, the distributions of the arms do not change. However, they are allowed to be different over different episodes.

Let μkj\mu_{k}^{j} be the mean reward of arm kk in episode jj. Let μj[μ1j,μ2j,,μKj]T\mu^{j}\triangleq[\mu_{1}^{j},\mu_{2}^{j},\cdots,\mu_{K}^{j}]^{T} denote the vector containing the mean rewards of all arms for episode jj. Further, let kj𝒜jargmaxk[K]{μkj}k^{j}_{*}\in\mathcal{A}^{j}\triangleq\underset{k\in[K]}{\arg\max}\{\mu_{k}^{j}\} and μj=maxk[K]{μkj}\mu^{j}_{*}=\max\limits_{k\in[K]}\{\mu_{k}^{j}\} denote an optimal arm111There may be more than one optimal arms which have equal maximum mean rewards. in episode jj and it’s mean reward, respectively. Define Δkj=μjμkj>0\Delta_{k}^{j}=\mu_{*}^{j}-\mu_{k}^{j}>0 as the sub-optimality gap of arm k𝒜jk\notin\mathcal{A}^{j} in episode jj. Note that the mean rewards of the arms are unknown.

We assume that the episodes in the MAB problem are related in the sense that the mean rewards of the arms across episodes do not change considerably. We capture this by the following assumption.

Assumption 1.

We assume that μj1μj2ϵ||\mu^{j_{1}}-\mu^{j_{2}}||_{\infty}\leq\epsilon for any j1,j2[J]j_{1},j_{2}\in[J], where the parameter 0<ϵ10<\epsilon\leq 1 is assumed to be known.

This assumption implies that for each arm, the mean rewards across all episodes do not differ by more than ϵ\epsilon. In applications like online advertising and recommender systems, the user preferences change over time only gradually, and therefore, the parameter ϵ\epsilon can be used to capture this behaviour.

Let Nkj(t)N_{k}^{j}(t) denote the number of pulls of arm kk in the time interval [(j1)n+1,t][(j-1)n+1,t]. Thus, Nkj(t)N_{k}^{j}(t) counts the number of times arm kk is pulled from the beginning of episode jj until time tt. Note that for episode jj, the allowable values of tt in Nkj(t)N_{k}^{j}(t) are [(j1)n+1,nj][(j-1)n+1,nj]. Further, let Sk(t)S_{k}(t) denote the number of pulls of arm kk in the time interval [1,t][1,t]. Thus, Sk(t)S_{k}(t) counts the number of times arm kk is pulled from the beginning of episode 11 until time tt. For example, if n=5n=5 and j=2j=2, then Nk2(8)N_{k}^{2}(8) counts the number of times arm kk is pulled in time instants 6,7,6,7, and 88. Further, Sk(8)S_{k}(8) counts the number of times arm kk is pulled in the interval [1,8][1,8].

The goal of the agent is to decide which arm to pull (what should be the value of ItI_{t}) at any given time tt based on the information {rI1,rI2,,rIt1}\{r_{I_{1}},r_{I_{2}},\cdots,r_{I_{t-1}}\} in order to maximize the average reward over all episodes. This is captured by the pseudo-regret RJR_{J} of the MAB problem over JJ episodes:

RJ\displaystyle R_{J} =j=1J𝔼[t=(j1)n+1jn(rkjrIt)]\displaystyle=\sum\limits_{j=1}^{J}\mathbb{E}\left[\sum\limits_{t=(j-1)n+1}^{jn}(r_{k_{*}^{j}}-r_{I_{t}})\right]
=j=1J(nμj𝔼[t=(j1)n+1jnμItj])\displaystyle=\sum\limits_{j=1}^{J}\left(n\mu_{*}^{j}-\mathbb{E}\left[\sum\limits_{t=(j-1)n+1}^{jn}\mu^{j}_{I_{t}}\right]\right)
=j=1Jk=1KΔkj𝔼[Nkj(jn)],\displaystyle=\sum\limits_{j=1}^{J}\sum\limits_{k=1}^{K}\Delta_{k}^{j}\mathbb{E}[N_{k}^{j}(jn)], (1)

where the last equality follows since k=1KNkj(jn)=n\sum_{k=1}^{K}N_{k}^{j}(jn)=n for any j[J]j\in[J]. Thus, the goal is to make decisions {It:1tnJ}\{I_{t}:1\leq t\leq nJ\} to minimize the regret in (II).

In this paper, we exploit the relation among the mean rewards of arms in different episodes (c.f. Assumption 1) in order to minimize the regret RJR_{J}. This is achieved by reusing (transferring) reward samples from previous episodes to make decisions in the current episode. We describe the approach and the proposed algorithm in detail in the next section.

III ALL SAMPLE TRANSFER UCB (AST-UCB)

Our approach of reusing samples from previous episodes builds on the standard UCB algorithm for bandits. In this section, we first describe the UCB algorithm and then our proposed algorithm, which we call All Sample Transfer UCB (AST-UCB).

III-A UCB Algorithm [8]

Intuitively, the arm-pulling decisions should be made on the reward samples obtained from each arm. Since the mean rewards of the arms are unknown, the UCB algorithm computes their sample-average estimates and the corresponding confidence intervals. Then, based on the principle of optimism in the face of uncertainty, the upper (maximum) value in the confidence interval of each arm is treated as the optimistic mean reward of that arm. Then, the arm with the highest optimistic mean reward is pulled.

As time progresses and more reward samples are received, the estimates become better and the confidence intervals become smaller. Thus, the upper value in the confidence interval approaches the true mean. Eventually, the optimistic mean reward of the optimal arm becomes larger than all other sub-optimal arms, and thereafter, only the optimal arm is pulled.

The standard UCB algorithm is used when the arm distributions are assumed to be the same at all times. However, in our setting, the distributions change over episodes. Therefore, one approach would be to implement UCB algorithm separately in each episode by using only the samples of that particular episode. In other words, the UCB algorithm is restarted at the beginning of every episode and it uses only the reward samples received during the current episode. We call this approach as No Transfer UCB (NT-UCB) algorithm. Next, we explain NT-UCB algorithm for episode jj.

Let μ^1kj(t)\hat{\mu}_{1k}^{j}(t) denote the sample-average estimate of the mean reward of arm kk at time tt, and is computed as:

μ^1kj(t)=τ=(j1)n+1trIτ𝟙{Iτ=k}max{1,Nkj(t)},\displaystyle\hat{\mu}_{1k}^{j}(t)=\>\>\frac{\sum\limits_{\tau=(j-1)n+1}^{t}r_{I_{\tau}}\mathds{1}\{I_{\tau}=k\}}{\max\{1,N_{k}^{j}(t)\}}, (2)

where Nkj(t)N_{k}^{j}(t) denotes the number of times arm kk is pulled until time tt since the beginning of episode jj. Next, we compute the optimistic mean reward corresponding to μ^1kj(t)\hat{\mu}_{1k}^{j}(t). For this, we require the following result.

Lemma 1.

Let α>1\alpha>1. For episode jj, time t[(j1)n+1,jn]t\in[(j-1)n+1,jn] and arm kk, with probability at least 12(t(j1)n)α,1-\frac{2}{(t-(j-1)n)^{\alpha}}, the following equation is satisfied

|μ^1kj(t)μkj|p1kj(t)αlog(t(j1)n)2Nkj(t).\displaystyle|\hat{\mu}_{1k}^{j}(t)-\mu_{k}^{j}|\leq p_{1k}^{j}(t)\triangleq\sqrt{\frac{\alpha\log{(t-(j-1)n)}}{2N_{k}^{j}(t)}}. (3)
  • Proof.

    The rewards are independent random variables with support [0,1][0,1]. Using Hoeffding’s inequality[16] for estimate μ^1kj(t)\hat{\mu}_{1k}^{j}(t), we get

    Pr{|μ^1kj(t)μkj|δ}2exp(2Nkj(t)δ2).\displaystyle\text{Pr}\{|\hat{\mu}_{1k}^{j}(t)-\mu_{k}^{j}|\geq\delta\}\leq 2\exp(-2N_{k}^{j}(t){\delta}^{2}).

    Setting δ=αlog(t(j1)n)2Nkj(t)\delta=\sqrt{\frac{\alpha\log(t-(j-1)n)}{2N_{k}^{j}(t)}} for Nkj(t)1N_{k}^{j}(t)\geq 1, the lemma follows. ∎

Using Lemma 1, we form a confidence interval for mean reward μkj\mu_{k}^{j} using the estimate μ^1kj(t)\hat{\mu}_{1k}^{j}(t) at time tt in episode jj as

D1j(t)=[μ^1kj(t)p1kj(t),μ^1kj(t)+p1kj(t)].\displaystyle D_{1}^{j}(t)=[\hat{\mu}_{1k}^{j}(t)-p_{1k}^{j}(t),\hat{\mu}_{1k}^{j}(t)+p_{1k}^{j}(t)].

Next, the NT-UCB algorithm pulls the arm with maximum optimistic reward:

It=argmaxk[K]{μ^1kj(t1)+p1kj(t1)}.\displaystyle I_{t}=\underset{k\in[K]}{\arg\max}\left\{\hat{\mu}_{1k}^{j}(t-1)+p_{1k}^{j}(t-1)\right\}.

The above steps are repeated until the end of episode jj. Next, we provide an upper bound on the pseudo-regret of the NT-UCB algorithm.

Lemma 2.

Let α>1\alpha>1. The pseudo-regret of NT-UCB satisfies

RJ\displaystyle R_{J} k=1K[2αlogn(k=1Δkj>0J1Δkj)+Jkα+1α1],\displaystyle\leq\sum\limits_{k=1}^{K}\left[2\alpha\log{n}\bigg{(}\sum\limits_{{\begin{subarray}{c}k=1\\ \Delta_{k}^{j}>0\end{subarray}}}^{J}\frac{1}{\Delta_{k}^{j}}\bigg{)}+J_{k}\frac{\alpha+1}{\alpha-1}\right], (4)
whereJk=k=1JΔkj\displaystyle\text{where}\quad J_{k}=\sum\limits_{k=1}^{J}\Delta_{k}^{j}
  • Proof.

    An upper bound on the regret over all episodes is obtained by adding the per-episode regret bound of the standard UCB algorithm, which is given as [1]222The second term in (4) differs from the corresponding term mentioned in [1], since additional union bounds are used to obtain the result in [1].

    Rjk=1Δkj>0K(2αlognΔkj+α+1α1Δkj).\displaystyle R_{j}\leq\sum\limits_{{\begin{subarray}{c}k=1\\ \Delta_{k}^{j}>0\end{subarray}}}^{K}\left(\frac{2\alpha\log{n}}{\Delta_{k}^{j}}+\frac{\alpha+1}{\alpha-1}\Delta_{k}^{j}\right).

    The result then follows. ∎

III-B AST-UCB Algorithm

For any particular episode, the NT-UCB algorithm mentioned above uses samples only in that episode to compute the estimates. However, as per Assumption 1, the mean rewards across the episodes are related, and therefore, reward samples in previous episodes carry information about the mean reward in the current episode. In order to capture this information, we construct an auxiliary estimate (in addition to the UCB estimate) that uses the reward samples from the beginning of the first episode. Then, we combine these two estimates to make the decisions. Next, we describe this approach for episode jj.

Let μ^2k(t)\hat{\mu}_{2k}(t) denote the auxiliary sample-average estimate of the mean reward of arm kk at time tt, computed as:

μ^2k(t)=τ=1trIτ𝟙{Iτ=k}max{1,Sk(t)},\displaystyle\hat{\mu}_{2k}(t)=\>\frac{\sum\limits_{\tau=1}^{t}r_{I_{\tau}}\mathds{1}\{I_{\tau}=k\}}{\max\{1,S_{k}(t)\}}, (5)

where Sk(t)S_{k}(t) denotes the number of times arm kk is pulled until time tt since the beginning of episode 11. Note that estimate μ^2k(t)\hat{\mu}_{2k}(t) captures the information of reward samples of arm kk from all previous episodes 333An alternate strategy would be to construct the auxiliary estimate from a fixed number of previous episodes. However, our strategy is better since the confidence interval corresponding to estimate (5) is always better than this alternate strategy.. Next, we compute the optimistic mean reward corresponding to μ^2k(t)\hat{\mu}_{2k}(t). For this, we require the following result.

Lemma 3.

Let α>1\alpha>1. For episode jj, time t[(j1)n+1,jn]t\in[(j-1)n+1,jn] and arm kk, with probability at least 12(t(j1)n)α,1-\frac{2}{(t-(j-1)n)^{\alpha}}, the following equation is satisfied

|μ^2k(t)\displaystyle|\hat{\mu}_{2k}(t) μkj|p2kj(t)αlog(t(j1)n)2Sk(t)+Ukj(t)ϵ,\displaystyle-\mu_{k}^{j}|\leq p_{2k}^{j}(t)\triangleq\sqrt{\frac{\alpha\log{(t-(j-1)n})}{2S_{k}(t)}}+U_{k}^{j}(t)\epsilon, (6)
whereUkj(t)=Sk(t)Nkj(t)Sk(t).\displaystyle\text{where}\quad U_{k}^{j}(t)=\frac{S_{k}(t)-N_{k}^{j}(t)}{S_{k}(t)}.
  • Proof.

    The rewards are independent random variables with support [0,1][0,1]. Using McDiarmid’s inequality[17] for estimate μ^2k(t)\hat{\mu}_{2k}(t), we get

    Pr{|μ^2k(t)𝔼[μ^2k(t)]|δ}2exp(2Sk(t)δ2).\displaystyle\text{Pr}\{\lvert\hat{\mu}_{2k}(t)-\mathbb{E}[\hat{\mu}_{2k}(t)]\rvert\geq\delta\}\leq 2\exp(-2S_{k}(t){\delta}^{2}).

    Setting δ=αlog((t(j1)n))2Sk(t)\delta=\sqrt{\frac{\alpha\log((t-(j-1)n))}{2S_{k}(t)}} for Sk(t)1S_{k}(t)\geq 1, we get

    Pr{|μ^2k(t)𝔼[μ^2k(t)]|αlog(t(j1)n)2Sk(t)}\displaystyle\text{Pr}\left\{|\hat{\mu}_{2k}(t)-\mathbb{E}[\hat{\mu}_{2k}(t)]|\geq\sqrt{\frac{\alpha\log(t-(j-1)n)}{2S_{k}(t)}}\right\}
    2(t(j1)n)α.\displaystyle\leq\frac{2}{(t-(j-1)n)^{\alpha}}.

    Hence, with probability at least 12(t(j1)n)α1-\frac{2}{(t-(j-1)n)^{\alpha}}, the following holds

    |μ^2k(t)𝔼[μ^2k(t)]|αlog(t(j1)n)2Sk(t).\displaystyle|\hat{\mu}_{2k}(t)-\mathbb{E}[\hat{\mu}_{2k}(t)]|\leq\sqrt{\frac{\alpha\log(t-(j-1)n)}{2S_{k}(t)}}. (7)

    Next, we bound 𝔼[μ^2k(t)]\mathbb{E}[\hat{\mu}_{2k}(t)] for Sk(t)1S_{k}(t)\geq 1, t[(j1)n+1,jn]t\in[(j-1)n+1,jn]:

    𝔼[μ^2k(t)]\displaystyle\mathbb{E}[\hat{\mu}_{2k}(t)] =l=1j1Nkl(ln)μkl+Nkj(t)μkjSk(t),\displaystyle=\frac{\sum\limits_{l=1}^{j-1}N_{k}^{l}(ln)\mu_{k}^{l}+N_{k}^{j}(t)\mu_{k}^{j}}{S_{k}(t)},
    =μkj+l=1j1Nkl(ln)(μklμkj)Sk(t),\displaystyle=\mu_{k}^{j}+\frac{\sum\limits_{l=1}^{j-1}N_{k}^{l}(ln)(\mu_{k}^{l}-\mu_{k}^{j})}{S_{k}(t)},
    μkj+(Skj(t)Nkj(t))ϵSk(t),\displaystyle\leq\mu_{k}^{j}+\frac{(S_{k}^{j}(t)-N_{k}^{j}(t))\epsilon}{S_{k}(t)},
    =μkj+Ukj(t)ϵ,\displaystyle=\mu_{k}^{j}+U_{k}^{j}(t)\epsilon, (8)

    where the inequality follows from μklμkjϵ\mu_{k}^{l}-\mu_{k}^{j}\leq\epsilon (Asssumption 1). Similarly, using μklμkjϵ\mu_{k}^{l}-\mu_{k}^{j}\geq-\epsilon (Asssumption 1), we get

    𝔼[μ^2k(t)]μkjUkj(t)ϵ.\displaystyle\mathbb{E}[\hat{\mu}_{2k}(t)]\geq\mu_{k}^{j}-U_{k}^{j}(t)\epsilon. (9)

    Conditions (Proof.) and (9) yield |𝔼[μ^2k(t)]|μkj+Ukj(t)ϵ|\mathbb{E}[\hat{\mu}_{2k}(t)]|\leq\mu_{k}^{j}+U_{k}^{j}(t)\epsilon. Using this in (7), we get the result in (6). ∎

Using Lemma 3, we form a confidence interval for mean reward μkj\mu_{k}^{j} using the estimate μ^2k(t)\hat{\mu}_{2k}(t) at time step tt in episode jj as

D2j(t)=[μ^2k(t)p2kj(t),μ^2k(t)+p2kj(t)].\displaystyle D_{2}^{j}(t)=[\hat{\mu}_{2k}(t)-p_{2k}^{j}(t),\hat{\mu}_{2k}(t)+p_{2k}^{j}(t)].

Next, we present two key steps of the AST-UCB algorithm.

(i) Combine the optimistic rewards of the two estimates μ^1kj(t)\hat{\mu}_{1k}^{j}(t) and μ^2k(t)\hat{\mu}_{2k}(t) given in (2) and (5) as:

qkj(t)=min{μ^1kj(t)+p1kj(t),μ^2k(t)+p2kj(t)}.\displaystyle q_{k}^{j}(t)=\min\{\hat{\mu}_{1k}^{j}(t)+p_{1k}^{j}(t),\hat{\mu}_{2k}(t)+p_{2k}^{j}(t)\}. (10)

(ii) Pull arm

It=argmaxk[K]{qkj(t1)}.\displaystyle I_{t}=\underset{k\in[K]}{\arg\max}\{q_{k}^{j}(t-1)\}.

The above steps are repeated until the end of episode jj. All the steps of AST-UCB are given below in Algorithm 1.

Algorithm 1 AST-UCB
Episode length nn, Number of episodes JJ, Parameters α\alpha, ϵ\epsilon and Number of arms KK
for episode j=1,2,,Jj=1,2,...,J do
     for t=(j1)n+1,,(j1)n+Kt=(j-1)n+1,\cdots,(j-1)n+K do
         It=t(j1)nI_{t}=t-(j-1)n (Pull each arm once)
     end for
     for t=(j1)n+K+1,,jnt=(j-1)n+K+1,\cdots,jn do
         compute μ^1kj(t1)\hat{\mu}_{1k}^{j}(t-1), p1kj(t1)p_{1k}^{j}(t-1) using (2),(3)
         compute μ^2k(t1)\hat{\mu}_{2k}(t-1), p2kj(t1)p_{2k}^{j}(t-1) using (5),(6)
         compute optimistic reward qkj(t1)q_{k}^{j}(t-1) using (10)
         select arm It=argmaxk[K]{qkj(t1)}I_{t}=\underset{k\in[K]}{\arg\max}\{q_{k}^{j}(t-1)\}
         update number of pulls Nkj(t)N_{k}^{j}(t) and Sk(t)S_{k}(t)
     end for
end for

Next, we explain the motivation for Step (i). We combine the confidence intervals D1kj(t)D_{1k}^{j}(t) and D2kj(t)D_{2k}^{j}(t) by taking their intersection to get a better confidence interval. Note that by taking the intersection, the new confidence interval D1kj(t)D2kj(t)D_{1k}^{j}(t)\cap D_{2k}^{j}(t) is always smaller than the original two confidence intervals, as illustrated in Figure 1. This smaller interval results in a better estimate of μkj\mu_{k}^{j}. We then pick the optimistic reward in the new confidence interval444Note that the Step (i) is valid even when D1kj(t)D_{1k}^{j}(t) and D2kj(t)D_{2k}^{j}(t) do not intersect.. Further, Step (ii) is similar to the UCB algorithm where we pull the arm with the maximum optimistic reward. The next result presents a bound on the probability of μkj\mu_{k}^{j} lying in the new confidence interval (the new confidence interval being non-empty).

Refer to caption
Figure 1: The blue and green intervals represent confidence intervals D1kj(t)D_{1k}^{j}(t) and D2kj(t)D_{2k}^{j}(t) for mean μkj\mu_{k}^{j}, respectively. The orange interval is the intersection of the two intervals, which is clearly smaller (and hence better). The optimistic reward of the orange interval is given by qkj(t)q_{k}^{j}(t).
Lemma 4.

For episode jj, time t[(j1)n+1,jn]t\in[(j-1)n+1,jn] and arm kk, with probability at least 14(t(j1)n)α,1-\frac{4}{(t-(j-1)n)^{\alpha}}, the following equations are satisfied

(i)μkjD1kj(t)D2kj(t).\displaystyle(i)\quad\mu_{k}^{j}\in D_{1k}^{j}(t)\cap D_{2k}^{j}(t). (11)
(ii)D1kj(t)D2kj(t)=.\displaystyle(ii)\quad D_{1k}^{j}(t)\cap D_{2k}^{j}(t)=\emptyset. (12)
  • Proof.

    Define events 1={μkjD1kj(t)}\mathcal{E}_{1}=\{\mu_{k}^{j}\notin D_{1k}^{j}(t)\} and 2={μkjD1kj(t)}\mathcal{E}_{2}=\{\mu_{k}^{j}\notin D_{1k}^{j}(t)\}. Then we have

    Pr{μkjD1kj(t)D2kj(t)}\displaystyle\text{Pr}\{\mu_{k}^{j}\notin D_{1k}^{j}(t)\cap D_{2k}^{j}(t)\} =Pr{12}\displaystyle=\text{Pr}\{\mathcal{E}_{1}\cup\mathcal{E}_{2}\}
    Pr{1}+Pr{2}\displaystyle\leq\text{Pr}\{\mathcal{E}_{1}\}+\text{Pr}\{\mathcal{E}_{2}\}
    4(t(j1)n)α,\displaystyle\leq\frac{4}{(t-(j-1)n)^{\alpha}},

    where the last inequality follows from Lemmas 1 and 3. Hence, condition (i) in the lemma follows. Same arguments are valid for condition (ii) as well. ∎

Note that although the new confidence interval is smaller, Lemma 4 shows that the probability bound of the mean reward belonging to this new interval has reduced as compared to that in (3) or (6). However, we show in Theorem 1 that the negative effect of the reduction of the probability is not significant, and the smaller interval leads to an overall reduction in the regret.

IV REGRET ANALYSIS

In this section, we derive the regret of the AST-UCB algorithm and then provide the analysis of the result.

Theorem 1.

Let Δkmaxmaxj[J]{Δkj}\Delta_{k}^{max}\triangleq\max\limits_{j\in[J]}\>\{\Delta_{k}^{j}\} and Δkminminj[J],Δkj>0{Δkj}\Delta_{k}^{min}\triangleq\min\limits_{j\in[J],\Delta_{k}^{j}>0}\>\{\Delta_{k}^{j}\}. The pseudo-regret of AST-UCB with α>1\alpha>1 and 0ϵ<12mink[K]{Δkmin}0\leq\epsilon<\frac{1}{2}\min\limits_{k\in[K]}\{\Delta_{k}^{min}\} satisfies

RJ\displaystyle R_{J} k=1KΔkmax[min{j=1Δkj>0J2αlog(n)(Δkj)2,2αlog(n)(Δkmin2ϵ)2}\displaystyle\leq\sum\limits_{k=1}^{K}\Delta_{k}^{max}\Bigg{[}\min\bigg{\{}\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j})^{2}},\frac{2\alpha\log{(n)}}{(\Delta_{k}^{\min}-2\epsilon)^{2}}\bigg{\}}
+Jα+3α1].\displaystyle\hskip 85.35826pt+J\frac{\alpha+3}{\alpha-1}\Bigg{]}. (13)
  • Proof.

    Refer to the appendix. ∎

Next, we compare the regret bounds of our algorithm (1) and NT-UCB (4), and highlight the benefit of transfer. The transfer happens due to the first term in (1). Hence, we compare the first terms in the regret bounds. To this end, we define the following terms that capture the dependence on JJ:

AkJ=j=1Δkj>0JΔkmax(Δkj)2,BkJ=Δkmax(Δkmin2ϵ)2,CkJ=j=1Δkj>0J1Δkj.\displaystyle A_{k}^{J}=\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{\Delta_{k}^{max}}{(\Delta_{k}^{j})^{2}},B_{k}^{J}=\frac{\Delta_{k}^{max}}{(\Delta_{k}^{\min}-2\epsilon)^{2}},C_{k}^{J}=\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{1}{\Delta_{k}^{j}}.

Several comments are in order. First, observe that, for transfer to be beneficial, we need min{AkJ,Bkj}<CkJ\min\{A_{k}^{J},B_{k}^{j}\}<C_{k}^{J}. Since AkJCkJA_{k}^{J}\geq C_{k}^{J}, this can happen only if BkJ<CkJB_{k}^{J}<C_{k}^{J} . The term BkJB_{k}^{J} behaves like a constant as compared to CkJC_{k}^{J} which increases as the total number of episodes JJ increases. Therefore, for some large enough Jm(ϵ)J^{m}(\epsilon), we get CkJm(ϵ)<BkJm(ϵ)C_{k}^{J^{m}(\epsilon)}<B_{k}^{J^{m}(\epsilon)} which leads to decrease in the regret as compared to NT-UCB. Second, as ϵ\epsilon increases (episodes become increasingly non-related), Jm(ϵ)J^{m}(\epsilon) increases since more episodes (samples) are required for the transfer to be beneficial. Third, we have logarithmic dependence of episode length nn on the regret (which is the case with NT-UCB as well). Fourth, the second term in the regret bound of AST-UCB (1) is higher than the corresponding term in NT-UCB bound (4) due to the decreased probability bound in Lemma 4 as compared to Lemmas 1 and 3.

V Numerical Simulations

Refer to caption
(a) Regret as function of episode length nn.
Refer to caption
(b) Regret as function of total number of episodes JJ.
Figure 2: Empirical regret of NT-UCB and AST-UCB for different values of ϵ\epsilon for Case I.
Refer to caption
(a) Regret as function of episode length nn.
Refer to caption
(b) Regret as function of total number of episodes JJ.
Figure 3: Empirical regret of NT-UCB and AST-UCB for different values of ϵ\epsilon for Case II.

In this section, we present the numerical results for AST-UCB algorithm. We consider K=4K=4 armed bandit problem. In numerical simulations, we need to select the mean reward (μkj\mu_{k}^{j}) of each arm for each episode which should satisfy Assumption 1. Towards this end, we fix a seed interval of length ϵ\epsilon for each arm. Then, at the beginning of each episode, we uniformly sample the value of μkj\mu_{k}^{j} from this seed interval. This ensures that Assumption 1 is satisfied. Once the mean reward value μkj\mu_{k}^{j} is obtained, we construct a uniform distribution with mean μkj\mu_{k}^{j} and width d=0.2d=0.2. In case the support of this uniform distribution lies outside the interval [0,1][0,1], we reduce dd to avoid this. The reward samples are then generated from the uniform distribution. For each scenario, we compute the regret RJR_{J} by taking an empirical average over 3030 independent realizations of that scenario.

We simulate AST-UCB and NT-UCB for two cases (two sets of seed intervals). Note that the seed intervals for each arm are of length ϵ\epsilon. The mid-points of the seed intervals of the four arms for Case I and Case II are (0.4,0.6,0.6,0.4)(0.4,0.6,0.6,0.4) and (0.35,0.7,0.3,0.4)(0.35,0.7,0.3,0.4), respectively.

In Figure 2(a), we observe that the regret of AST-UCB is considerably smaller as compared to NT-UCB. This is particularly true for smaller values of ϵ\epsilon. As ϵ\epsilon increases555Since the reward support is [0,1][0,1], values of ϵ>1\epsilon>1 are not valid in our setting., the regret of AST-UCB approaches to that of NT-UCB. This is in accordance with the fact that when ϵ\epsilon is large, the confidence interval of the auxiliary estimate in (6) is large and transfer is not much beneficial. Further, we observe a logarithmic dependence of the regret on nn, as quantified by the regret bounds in (4) and (1).

In Figure 2(b), we again observe that AST-UCB performs better than NT-UCB, particularly for small values of ϵ\epsilon. We also observe that the regret has a “approximate” linear dependence on JJ. The plots in Figures 2 show that for any value of ϵ\epsilon the difference between the regret of NT-UCB and AST-UCB increases with episode length (nn) or total number of episodes (JJ). This is because a larger number of reward samples from previous episodes become available, thereby increasing the transfer.

Similar observations can be seen in Figures 3(a) and 3(b) for Case II. However, the improvement of AST-UCB over NT-UCB in terms of regret is more in Case II as compared to Case I. This happens because the seed intervals in Case II are farther apart, which helps in distinguishing the best arm more quickly using the samples of previous episodes.

VI CONCLUSION

We analyzed the transfer of reward samples in a sequential stochastic multi-armed bandit setting. We proposed a transfer algorithm based on UCB and showed that its regret is lower than UCB with no transfer. We provide regret analysis of our algorithm and validate our approach via numerical experiments. Future research directions include extending the work to the case when the parameter ϵ\epsilon is unknown, and studying a similar transfer problem in the context of reinforcement learning.

APPENDIX: Proof of Theorem 1

To simplify the notation, we re-denote several variables as μ=μkj\mu=\mu_{k}^{j}, μ=μj\mu_{*}=\mu_{*}^{j}, μ^1=μ^1kj(t1)\hat{\mu}_{1}=\hat{\mu}_{1k}^{j}(t-1), μ^1=μ^1kjj(t1)\hat{\mu}_{1*}=\hat{\mu}_{1k_{*}^{j}}^{j}(t-1), μ^2=μ^2k(t1)\hat{\mu}_{2}=\hat{\mu}_{2k}(t-1), μ^2=μ^2kj(t1)\hat{\mu}_{2*}=\hat{\mu}_{2k_{*}^{j}}(t-1), tjn=(j1)n+1t_{j}^{n}=(j-1)n+1,

p1=αlog(ttjn)2Nkj(t1),p1=αlog(ttjn)2Nkjj(t1),\displaystyle p_{1}=\sqrt{\frac{\alpha\log{(t-t_{j}^{n})}}{2N_{k}^{j}(t-1)}},p_{1*}=\sqrt{\frac{\alpha\log{(t-t_{j}^{n})}}{2N_{k_{*}^{j}}^{j}(t-1)}},
p2=αlog(ttjn)2Sk(t1)+Ukj(t)ϵ,\displaystyle p_{2}=\sqrt{\frac{\alpha\log{(t-t_{j}^{n})}}{2S_{k}(t-1)}}+U_{k}^{j}(t)\epsilon,
p2=αlog(ttjn)2Skj(t1)+Ukj(t)ϵ,\displaystyle p_{2*}=\sqrt{\frac{\alpha\log{(t-t_{j}^{n})}}{2S_{k_{*}^{j}}(t-1)}}+U_{k_{*}}^{j}(t)\epsilon,
u1kj=2αlog(n)(Δkj)2,u2kj=2αlog(n)(Δkj2ϵ)2.\displaystyle u_{1k}^{j}=\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j})^{2}},u_{2k}^{j}=\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j}-2\epsilon)^{2}}.

For arm kk to be pulled at time tt (It=kI_{t}=k), at least one of the following five conditions should be true:

μ^1p1\displaystyle\hat{\mu}_{1}-p_{1} >μ,\displaystyle>\mu, (14)
μ^1+p1\displaystyle\hat{\mu}_{1*}+p_{1*} μ,\displaystyle\leq\mu_{*}, (15)
μ^2+p2\displaystyle\hat{\mu}_{2*}+p_{2*} μ,\displaystyle\leq\mu_{*}, (16)
μ^2p2\displaystyle\hat{\mu}_{2}-p_{2} >μ,\displaystyle>\mu, (17)
αlogn2Nkj(t1)\displaystyle\sqrt{\frac{\alpha\log{n}}{2N_{k}^{j}(t-1)}} >Δkj2andαlog(n)2Sk(t1)+ϵ>Δkj2.\displaystyle>\frac{\Delta_{k}^{j}}{2}\hskip 11.38092pt\text{and}\hskip 11.38092pt\sqrt{\frac{\alpha\log{(n)}}{2S_{k}(t-1)}}+\epsilon>\frac{\Delta_{k}^{j}}{2}. (18)

We show this by contradiction. Assume that none of the conditions in (14)-(17) is true and the first condition in (18) is false. Then, using the fact that p1<αlogn2Nkj(t1)p_{1}<\sqrt{\frac{\alpha\log{n}}{2N_{k}^{j}(t-1)}}, we have

μ^1+p1\displaystyle\hat{\mu}_{1*}+p_{1*} >μ=Δkj+μ2p1+μμ^1+p1,\displaystyle>\mu_{*}=\Delta_{k}^{j}+\mu\geq 2p_{1}+\mu\geq\hat{\mu}_{1}+p_{1}, (19)
μ^2+p2\displaystyle\hat{\mu}_{2*}+p_{2*} >μ=Δkj+μ2p1+μμ^1+p1.\displaystyle>\mu_{*}=\Delta_{k}^{j}+\mu\geq 2p_{1}+\mu\geq\hat{\mu}_{1}+p_{1}. (20)

Conditions in (19) and (20) imply

min{μ^1+p1,μ^2+p2}>μ^1+p1.\displaystyle\min\{\hat{\mu}_{1*}+p_{1*},\hat{\mu}_{2*}+p_{2*}\}>\hat{\mu}_{1}+p_{1}. (21)

Similarly, when none of the conditions in (14)-(17) is true and the second condition in (18) is false, we get

min{μ^1+p1,μ^2+p2}>μ^2+p2.\displaystyle\min\{\hat{\mu}_{1*}+p_{1*},\hat{\mu}_{2*}+p_{2*}\}>\hat{\mu}_{2}+p_{2}. (22)

Thus, at least one of the conditions in (21) and (22) is true, and this yields

min{μ^1+p1,\displaystyle\min\{\hat{\mu}_{1*}+p_{1*}, μ^2+p2}>min{μ^1+p1,μ^2+p2}.\displaystyle\hat{\mu}_{2*}+p_{2*}\}>\min\{\hat{\mu}_{1}+p_{1},\hat{\mu}_{2}+p_{2}\}.

The above condition implies that the AST-UCB algorithm will not pull arm kk, and hence, we have a contradiction. The cumulative regret after JJ episodes (each with length nn) is given by

RJ\displaystyle R_{J} =j=1Jk=1KΔkj𝔼[Nkj(jn)],\displaystyle=\sum\limits_{j=1}^{J}\sum\limits_{k=1}^{K}\Delta_{k}^{j}\mathbb{E}[N_{k}^{j}(jn)],
k=1KΔkmax𝔼[S~k(Jn)],\displaystyle\leq\sum\limits_{k=1}^{K}\Delta_{k}^{\max}\mathbb{E}[\tilde{S}_{k}(Jn)],

where S~k(Jn)\tilde{S}_{k}(Jn) is the total number of sub-optimal pulls to arm kk over all episodes. Next, we bound the regret by bounding the term 𝔼[S~k(Jn)]\mathbb{E}[\tilde{S}_{k}(Jn)]. For an arbitrary sequence ItI_{t}, t=1,2,,Jnt=1,2,\cdots,Jn, we have

S~k(Jn)\displaystyle\tilde{S}_{k}(Jn) =j=1Jt=tjnjn𝟙{It=k;kkj},\displaystyle=\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}}^{jn}\mathds{1}\{I_{t}=k;k\neq k_{*}^{j}\},
=j=1J(𝟙{kkj}+t=tjn+Kjn𝟙{It=k;kkj}),\displaystyle=\sum\limits_{j=1}^{J}\bigg{(}\mathds{1}\{k\neq k_{*}^{j}\}+\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k;k\neq k_{*}^{j}\}\bigg{)},
=j=1Jt=tjn+Kjn𝟙{It=k;kkj;(18)isTrue}\displaystyle=\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k;k\neq k_{*}^{j};\eqref{eq:cond5}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}
+j=1J(𝟙{kkj}+t=tjn+Kjn𝟙{It=k;kkj;\displaystyle\hskip 14.22636pt+\sum\limits_{j=1}^{J}\bigg{(}\mathds{1}\{k\neq k_{*}^{j}\}+\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k;k\neq k_{*}^{j};
(18)isFalse}).\displaystyle\hskip 39.83368pt\eqref{eq:cond5}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{False}\}\bigg{)}. (23)
First term in (APPENDIX: Proof of Theorem 1)=j=1Jt=tjn+Kjn𝟙{It=k,kkj;\displaystyle\text{First term in \eqref{eq:S_k}}=\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k,k\neq k_{*}^{j};
Nkj(t1)<u1kj,Sk(t1)<u2kj},\displaystyle\hskip 85.35826ptN_{k}^{j}(t-1)<u_{1k}^{j},S_{k}(t-1)<u_{2k}^{j}\},
=j=1Jt=tjn+Kjnmin{𝟙{It=k,kkt;Nkj(t1)<u1kj},\displaystyle=\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}+K}^{jn}\min\bigg{\{}\mathds{1}\{I_{t}=k,k\neq k_{*}^{t};N_{k}^{j}(t-1)<u_{1k}^{j}\},
𝟙{It=k,kkt;Sk(t1)<u2kj}},\displaystyle\hskip 28.45274pt\mathds{1}\{I_{t}=k,k\neq k_{*}^{t};S_{k}(t-1)<u_{2k}^{j}\}\bigg{\}},
min{j=1Jt=tjn+Kjn𝟙{It=k,kkt;Nkj(t1)<u1kj},\displaystyle\leq\min\bigg{\{}\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k,k\neq k_{*}^{t};N_{k}^{j}(t-1)<u_{1k}^{j}\},
j=1Jt=tjn+Kjn𝟙{It=k,kkt;Sk(t1)<u2kj}},\displaystyle\hskip 28.45274pt\sum\limits_{j=1}^{J}\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{I_{t}=k,k\neq k_{*}^{t};S_{k}(t-1)<u_{2k}^{j}\}\bigg{\}},
min{j=1Δkj>0J2αlog(n)(Δkj)2,t=1Jn𝟙{It=k,kkt;\displaystyle\leq\min\bigg{\{}\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j})^{2}},\sum\limits_{t=1}^{Jn}\mathds{1}\bigg{\{}I_{t}=k,k\neq k_{*}^{t};
Sk(t1)<2αlog(n)(Δkmin2ϵ)2}},\displaystyle\hskip 28.45274ptS_{k}(t-1)<\frac{2\alpha\log{(n)}}{(\Delta_{k}^{\min}-2\epsilon)^{2}}\bigg{\}}\bigg{\}},
min{j=1Δkj>0J2αlog(n)(Δkj)2,2αlog(n)(Δkmin2ϵ)2},\displaystyle\leq\min\bigg{\{}\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j})^{2}},\frac{2\alpha\log{(n)}}{(\Delta_{k}^{\min}-2\epsilon)^{2}}\bigg{\}}, (24)
Second term of (APPENDIX: Proof of Theorem 1)j=1J(1+t=tjn+Kjn𝟙{(14)or(15)\displaystyle\text{Second term of \eqref{eq:S_k}}\leq\sum\limits_{j=1}^{J}\bigg{(}1+\sum\limits_{t=t_{j}^{n}+K}^{jn}\mathds{1}\{\eqref{eq:cond1}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond2}
or(16)or(17)isTrue}).\displaystyle\hskip 113.81102pt\text{or}\hskip 2.84544pt\eqref{eq:cond3}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond4}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}\bigg{)}. (25)

Using (APPENDIX: Proof of Theorem 1), (APPENDIX: Proof of Theorem 1), (APPENDIX: Proof of Theorem 1) and taking expectation, we get

𝔼[Sk(n)]\displaystyle\mathbb{E}[S_{k}(n)] min{WJ,VJ}+j=1J(1+t=tjn+KjnPr{(14)or\displaystyle\leq\min\left\{W_{J},V_{J}\right\}+\sum\limits_{j=1}^{J}\bigg{(}1+\sum\limits_{t=t_{j}^{n}+K}^{jn}\text{Pr}\{\eqref{eq:cond1}\hskip 2.84544pt\text{or}
(15)or(16)or(17)isTrue}),\displaystyle\hskip 28.45274pt\eqref{eq:cond2}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond3}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond4}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}\bigg{)}, (26)
whereWJ\displaystyle\text{where}\hskip 2.84544ptW_{J} =j=1Δkj>0J2αlog(n)(Δkj)2andVJ=2αlog(n)(Δkmin2ϵ)2.\displaystyle=\sum\limits_{\begin{subarray}{c}j=1\\ \Delta_{k}^{j}>0\end{subarray}}^{J}\frac{2\alpha\log{(n)}}{(\Delta_{k}^{j})^{2}}\hskip 2.84544pt\text{and}\hskip 2.84544ptV_{J}=\frac{2\alpha\log{(n)}}{(\Delta_{k}^{\min}-2\epsilon)^{2}}.

Next, we bound the probability of the event that at least one of (14) or (15) or (16) or (17) is true. We use the union bound, followed by the application of one-sided Hoeffding’s inequality (steps are similar to the proof of Lemma 1 and 3) to get,

Pr{(14)or(15)or(16)or(17)isTrue}\displaystyle\text{Pr}\{\eqref{eq:cond1}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond2}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond3}\hskip 2.84544pt\text{or}\hskip 2.84544pt\eqref{eq:cond4}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}
Pr{(14)isTrue}+Pr{(15)isTrue}+Pr{(16)isTrue}\displaystyle\leq\text{Pr}\{\eqref{eq:cond1}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}+\text{Pr}\{\eqref{eq:cond2}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}+\text{Pr}\{\eqref{eq:cond3}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}
+Pr{(17)isTrue}},\displaystyle\hskip 14.22636pt+\text{Pr}\{\eqref{eq:cond4}\hskip 2.84544pt\text{is}\hskip 2.84544pt\text{True}\}\},
4(ttjn)α.\displaystyle\leq\frac{4}{(t-t_{j}^{n})^{\alpha}}. (27)

Using (APPENDIX: Proof of Theorem 1) and (APPENDIX: Proof of Theorem 1), we have

𝔼[S~k(n)]min{WJ,VJ}+j=1J(1+t=tjn+Kjn4(ttjn)α),\displaystyle\mathbb{E}[\tilde{S}_{k}(n)]\leq\min\left\{W_{J},V_{J}\right\}+\sum\limits_{j=1}^{J}\bigg{(}1+\sum\limits_{t=t_{j}^{n}+K}^{jn}\frac{4}{(t-t_{j}^{n})^{\alpha}}\bigg{)},
min{WJ,VJ}+j=1J(1+s=(j1)n+K4(stjn)α𝑑s),\displaystyle\leq\min\left\{W_{J},V_{J}\right\}+\sum\limits_{j=1}^{J}\bigg{(}1+\int\limits_{s=(j-1)n+K}^{\infty}\frac{4}{(s-t_{j}^{n})^{\alpha}}ds\bigg{)},
=min{WJ,VJ}+j=1J(1+4(K1)1αα1),\displaystyle=\min\left\{W_{J},V_{J}\right\}+\sum\limits_{j=1}^{J}\bigg{(}1+\frac{4(K-1)^{1-\alpha}}{\alpha-1}\bigg{)},
min{WJ,VJ}+j=1J(1+4α1),\displaystyle\leq\min\left\{W_{J},V_{J}\right\}+\sum\limits_{j=1}^{J}\bigg{(}1+\frac{4}{\alpha-1}\bigg{)},
min{WJ,VJ}+Jα+3α1.\displaystyle\leq\min\left\{W_{J},V_{J}\right\}+J\frac{\alpha+3}{\alpha-1}.

Hence the theorem follows.

ACKNOWLEDGMENT

References

  • [1] S. Bubeck, N. Cesa-Bianchi, et al., “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
  • [2] T. Lattimore and C. Szepesvári, Bandit algorithms. Cambridge University Press, 2020.
  • [3] H. Robbins, “Some aspects of the sequential design of experiments,” 1952.
  • [4] D. Bouneffouf, I. Rish, and C. Aggarwal, “Survey on applications of multi-armed and contextual bandits,” in 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8, 2020.
  • [5] N. Silva, H. Werneck, T. Silva, A. C. Pereira, and L. Rocha, “Multi-armed bandits in recommendation systems: A survey of the state-of-the-art and future directions,” Expert Systems with Applications, vol. 197, p. 116669, 2022.
  • [6] A. Lazaric, E. Brunskill, et al., “Sequential transfer in multi-armed bandit with finite set of models,” Advances in Neural Information Processing Systems, vol. 26, 2013.
  • [7] A. Shilton, S. Gupta, S. Rana, and S. Venkatesh, “Regret bounds for transfer learning in bayesian optimisation,” in Artificial Intelligence and Statistics, pp. 307–315, PMLR, 2017.
  • [8] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, pp. 235–256, 2002.
  • [9] J. Zhang and E. Bareinboim, “Transfer learning in multi-armed bandit: a causal approach,” in Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1778–1780, 2017.
  • [10] A. A. Deshmukh, U. Dogan, and C. Scott, “Multi-task learning for contextual bandits,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [11] B. Liu, Y. Wei, Y. Zhang, Z. Yan, and Q. Yang, “Transferable contextual bandit for cross-domain recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  • [12] L. Cella, A. Lazaric, and M. Pontil, “Meta-learning with stochastic linear bandits,” in International Conference on Machine Learning, pp. 1360–1370, PMLR, 2020.
  • [13] L. Cella and M. Pontil, “Multi-task and meta-learning with sparse linear bandits,” in Uncertainty in Artificial Intelligence, pp. 1692–1702, PMLR, 2021.
  • [14] J. Azizi, B. Kveton, M. Ghavamzadeh, and S. Katariya, “Meta-learning for simple regret minimization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 6709–6717, 2023.
  • [15] A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit problems,” in International Conference on Algorithmic Learning Theory, pp. 174–188, Springer, 2011.
  • [16] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” The collected works of Wassily Hoeffding, pp. 409–426, 1994.
  • [17] C. McDiarmid et al., “On the method of bounded differences,” Surveys in combinatorics, vol. 141, no. 1, pp. 148–188, 1989.