This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Communication Efficient Parallel Reinforcement Learning

Mridul Agarwal, Bhargav Ganguly, and Vaneet Aggarwal
{agarw180,bganguly,vaneet}@purdue.edu
Purdue University, West Lafayette IN 47907
Abstract

We consider the problem where MM agents interact with MM identical and independent environments with SS states and AA actions using reinforcement learning for TT rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide dist-UCRL which runs at each agent and prove that the total cumulative regret of MM agents is upper bounded as O~(DSMAT)\tilde{O}(DS\sqrt{MAT}) for a Markov Decision Process with diameter DD, number of states SS, and number of actions AA. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of O(MSAlog(MT))O\left(MSA\log(MT)\right) on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.

I Introduction

Reinforcement Learning (RL) is being increasingly deployed and trained with parallel agents. In many cases, each agent interacts with identical and independent environments. For example, in autonomous cars, the agents do not share the environment as they are located possibly far away [Kiran et al., 2020]. Similarly, in ride/freight sharing services, different RL agents (vehicles) make decisions in parallel to minimize their costs and maximize the profits [Al-Abbasi et al., 2019, Manchella et al., 2021]. Further, consider an example of an e-commerce company using RL agents in servers for recommending products to customer. Based on the location of the customer, each customer’s query may potentially be routed to a particular server (or agent) [Sankararaman et al., 2019]. Finally in the field of robotics, parallel robots are often deployed in practice [Hu et al., 2020, Sartoretti et al., 2019]. From these examples, we note that every agent would gain by sharing their data collected, otherwise they would be wasting the knowledge accumulated by the remaining agents. Such parallel policy learning with limited amount of communication is the focus of this paper.

Sharing of data across agents poses a new set of problems which comes from communicating the samples. If the agents communicate the observation tuple (state, action, reward, next state) at every time step, their communication complexity will increase. For various power constrained devices such as small robots, this luxury might not always be available [Sankararaman et al., 2019]. In this paper, we aim to work on the problem to reduce the number of communication steps while obtaining the same regret bounds as that of an strategy in which the agents always communicate.

For a system with MM parallel agents, the parallel agents generate MM times more data. From the lens of regret analysis, this setup loosely translates to a setup where a single agent works for time horizon of MTMT. For the setup where a single agent runs for MTMT steps, the regret is lower bounded by O(DSMATO(\sqrt{DSMAT} and upper bounded by O~(DSMAT)\tilde{O}(DS\sqrt{MAT}) [Jaksch et al., 2010, Agrawal and Jia, 2017]. We show that, if the agents take their actions sequentially and communicate after every interaction with their respective environments, then the agents can collectively obtain a regret bound of O~(DSMAT)\tilde{O}(DS\sqrt{MAT}). This results in a faster convergence rate of O(1/MT)O(1/\sqrt{MT}) to the optimal policy as compared to the convergence rate of O(1/T)O(1/\sqrt{T}) when using one RL agent. Note that in practice, a sequential decision making for parallel independent agents cannot be guaranteed. Thus, this result acts as a lower bound to parallel reinforcement learning. In this paper, we further provide an algorithm, in which agents work in parallel, that achieves these bounds with limited communication.

We consider a setup where MM reinforcement learning agents interact with MM identical environments or Markov Decision Processes (MDPs) [Sutton and Barto, 2018, Puterman, 1994]. Similar to the examples considered above, the MM environments are independent. We assume that there also exists a central coordinator (server), although we suggest a method to relax this assumption. All the agents report their experiences to the central coordinator which then computes and shares a policy with the agents. We consider that the central coordinator uses a model based algorithm to compute the policy and each agent stores the number of visitations made to any particular (state, action, next state) tuple. The central coordinator also shares the total visitations to each state-action pair back to all the agents.

Based on this setup, we provide a novel communication efficient dist-UCRL algorithm. In dist-UCRL algorithm, the agents communicate with the central coordinator whenever any agent visits any state action pair s,as,a in the current epoch (since last synchronization) a fraction (1/M1/M) of the total visitation counts in s,as,a till last synchronization, instead of every N1N\geq 1 time step. Using this synchronization strategy, for an MDP with diameter DD, number of states SS, and number of actions AA, using dist-UCRL algorithm, we bound the cumulative regret of MM agents for TT time horizon scales as O(DSMAT)O(DS\sqrt{MAT}) using only O(MASlog2(MT))O(MAS\log_{2}(MT)) synchronization steps111We use the term communication round and synchronization step interchangeably.. Note that dist-UCRL not only achieves the lower bound regret scaling (O~(MT)\tilde{O}(\sqrt{MT})) but also with limited communication.

To obtain our results, we also derive a concentration bound on MM independent Martingale sequences which can be of independent interest. To the best of our knowledge, this is the first work in the direction of obtaining the performance guarantees using regret analysis for a setup where MM agents interact and collaborate.

We also evaluate our algorithms empirically. We run the proposed dist-UCRL algorithm in multiple environments. We compare the dist-UCRL algorithm against the modified UCRL algorithm, following which the agents communicates after every time step. We show that dist-UCRL algorithm obtain similar regret bound with reduced communication.

The rest of the paper is organized as follows. Section II summarizes the key related works. Section III describes the complete system model. Section IV describes the dist-UCRL algorithm, and the regret guarantees of the algorithm are provided in Section V. Section VI describes a modification of UCRL2, mod-UCRL2, and its regret guarantees. Section VII tests the proposed algorithms in multiple environments, and Section VIII concludes the paper with some possible directions for future works.

II Related Works

Optimal planning using Markov Decision Processes has seen numerous significant contributions in history. Many algorithms have been proposed to find the optimal policies. The major algorithms include Q-learning, and policy iteration to find optimal policies [Howard, 1960, Puterman, 1994, Bertsekas, 1995, Sutton and Barto, 2018]. Fundamentally, the algorithms work by calculating the utility of taking certain actions in states that maximizes the utility of the state. These algorithms provide optimal policy when the transition probabilities of the Markov Decision Process is known or an ϵ\epsilon-optimal policy if using iterative algorithms [Puterman, 1994].

A large body of work is also done in the setup where the transition probabilities are not known. The model-based algorithms work by reducing the number samples required to obtain a close estimate of transition probabilities [Bartlett and Tewari, 2009, Jaksch et al., 2010]. The model-free algorithms work by directly estimating the utilities obtained by taking an action in a state [Jin et al., 2018]. These algorithms apply optimism in the face of uncertainty principle to find a model near the empirical estimates that provides the highest reward. There are also algorithms that sample the transition probabilities using posterior sampling and obtain regret bound [Ian et al., 2013, Agrawal and Jia, 2017]. The algorithms suggest that using parallel agents interacting independent and identical environments will provide tighter concentration inequalities and hence will help in reducing the regret.

In the domain of multiple agents using RL, most of the work is done where the agents interact with a dependent environment and the decision of one agent impacts all the other agents. This area is known as Multi-Agent Reinforcement Learning (MARL) [Gupta et al., 2017, Zhang et al., 2018]. The agents may cooperate with each other, for example, in the case of autonomous vehicles yielding on a busy road. Or the agents may also compete with each other, for example, in the case of car racing where only one car may win. Although, we do have multiple agents in our setup, the environments are independent. Hence, we will refrain from using MARL terminology in this paper.

With the introduction of deep learning to the field of reinforcement learning, many “Deep RL” algorithms have been provided to minimize the sample complexity to find the optimal policies [Mnih et al., 2013, Schulman et al., 2015, Mnih et al., 2016, Schulman et al., 2017, Haarnoja et al., 2018]. Recently, various algorithms are using parallel actors to learn better policies using deep reinforcement learning [Nair et al., 2015, Clemente et al., 2017, Horgan et al., 2018, Espeholt et al., 2018, Assran et al., 2019]. These algorithms consist of parallel agents that share the entire sequence of (state, action, reward, next state) tuples after every nn epochs, and update a common neural network with the gradients computed from possibly parallel learners. Compared to these works, we consider a model based setup to obtain regret guarantees with only O(MASlog2(MT))O(MAS\log_{2}(MT)) number of synchronization rounds with a common controller learning the policy.

Recently, in the area of Bandits [Lattimore and Szepesvári, 2020], there has been a thrust on distributed bandits with reduced number of communication rounds. Various algorithms have been proposed to minimize the regret for setups where the agents synchronize a central coordinator [Kanade et al., 2012, Hillel et al., 2013, Wang et al., 2019, Dubey and Pentland, 2020]. Recently, Wang et al. [2019] showed that it is possible to obtain optimal regret guarantees with number of rounds that are independent of TT in stochastic Multi-Armed Bandits and Linear Bandits. Further, Dubey and Pentland [2020] considered a linear bandit setup and aimed to protect the privacy of the agents collaborating along with minimizing the number of communications rounds. Moreover, [Chawla et al., 2020, Sankararaman et al., 2019] considered a gossiping setup to communicate only the index of the best arm, thus reducing not only the number of communication rounds, but also the number of bits in each communication round. Similar to the bandit setup, we also attempt to find rigorous regret guarantees of the dist-UCRL algorithm and a bound on the number of communication rounds.

III System Model

Let [K][K] be the set of KK elements, or [K]={1,2,,K}[K]=\{1,2,\cdots,K\}. We consider an MDP =(𝒮,𝒜,P,r¯)\mathcal{M}=(\mathcal{S},\mathcal{A},P,\bar{r}), where 𝒮=[S]\mathcal{S}=[S] is the set of finite states and 𝒜=[A]\mathcal{A}=[A] is the set of finite actions. PP denotes the transition probabilities, i.e., on taking action a𝒜a\in\mathcal{A} in state s𝒮s\in\mathcal{S}, the next state s𝒮s^{\prime}\in\mathcal{S} follows distribution P(|s,a)P(\cdot|s,a). Also, on taking action a𝒜a\in\mathcal{A} in state s𝒮s\in\mathcal{S}, an agent receives a stochastic reward rr drawn from a distribution over [0,1][0,1] with mean r¯(s,a)\bar{r}(s,a).

We consider MM agents in the system, each interacting with MM identical environments. Let i[M]i\in[M] be the indexing for agents and their corresponding environment. For an agent ii, at time step tt, let si,ts_{i,t} denote the state of the agent, ai,ta_{i,t} be the action taken by the agent, and ri,tr_{i,t} denote the reward obtained by the agent on taking action ai,ta_{i,t} in state si,ts_{i,t}. We assume that the MM environments are independent. Mathematically, i[M] and t1\forall\leavevmode\nobreak\ i\in[M]\text{ and }\forall\leavevmode\nobreak\ t\geq 1, we have,

(si,t+1|s1,t,a1,t,,sM,t,aM,t,)=(si,t+1|si,t,ai,t),\displaystyle\mathbb{P}(s_{i,t+1}|s_{1,t},a_{1,t},\cdots,s_{M,t},a_{M,t},)=\mathbb{P}(s_{i,t+1}|s_{i,t},a_{i,t}),

and we take a similar assumption over the rewards ri,tr_{i,t} as well. This means, the distribution of the next state si,t+1s_{i,t+1} and the rewards ri,tr_{i,t} of agent i[M]i\in[M] are conditioned only on the knowledge of the current state and action (si,t,ai,t)(s_{i,t},a_{i,t}) of the agent ii and is independent of the other states and actions of the remaining agents.

Let policy π:𝒮𝒜\pi:\mathcal{S}\to\mathcal{A} be a function to determine which action to select in state s𝒮s\in\mathcal{S}. Note that each policy induces a Markov Chain on the states 𝒮\mathcal{S} with transition probabilities Ps,s=P(|s,π(s))P_{s,s}=P(\cdot|s,\pi(s)). After defining a policy, we can now define the diameter of the MDP \mathcal{M} as:

Definition 1 (Diameter).

Consider the Markov Chain induced by the policy π\pi on the MDP \mathcal{M}. Let T(s|,π,s)T(s^{\prime}|\mathcal{M},\pi,s) be a random variable that denotes the first time step when this Markov Chain enters state ss^{\prime} starting from state ss. Then, the diameter of the MDP \mathcal{M} is defined as:

D()=maxssminπ𝔼[T(s|,π,s)]\displaystyle D(\mathcal{M})=\max_{s^{\prime}\neq s}\min_{\pi}\mathbb{E}\left[T(s^{\prime}|\mathcal{M},\pi,s)\right] (1)

We assume that the MDP \mathcal{M} has a finite diameter which means that there is a policy, such that following that policy all s𝒮s\in\mathcal{S} communicate with each other.

Any agent i[M]i\in[M] starting from an initial random state si,1=ss_{i,1}=s follows an algorithm 𝒜\mathscr{A} till TT time steps to collect a cumulative reward of RiR_{i}. Also, let ρi\rho_{i} denote the average reward of the agent following algorithm 𝒜\mathscr{A}. Or,

Ri(,𝒜,s,T)\displaystyle R_{i}(\mathcal{M},\mathscr{A},s,T) =t=0Tri,t\displaystyle=\sum\nolimits_{t=0}^{T}r_{i,t} (2)
ρi(,𝒜,s)\displaystyle\rho_{i}(\mathcal{M},\mathscr{A},s) =limT1Tt=0T𝔼[ri,t]\displaystyle=\lim_{T\to\infty}\frac{1}{T}\sum\nolimits_{t=0}^{T}\mathbb{E}\left[r_{i,t}\right] (3)

Let there be an algorithm 𝒜\mathscr{A} which always selects action according to a stationary policy π:𝒮𝒜\pi:\mathcal{S}\to\mathcal{A}. Then, we denote ρ(,𝒜,s)=ρ(,π,s)\rho(\mathcal{M},\mathscr{A},s)=\rho(\mathcal{M},\pi,s). The optimal average reward does not depend on the state [Puterman, 1994, Section 8.3.3], and hence for the optimal policy π\pi^{*} which maximizes the average reward ρ(,π,s)\rho(\mathcal{M},\pi,s) we have,

ρ()ρ(,s)maxπρ(,s).\displaystyle\rho^{*}(\mathcal{M})\coloneqq\rho^{*}(\mathcal{M},s)\coloneqq\max\nolimits_{\pi}\rho(\mathcal{M},s). (4)

Further, the optimal average reward satisfies [Puterman, 1994, Theorem 8.4.7],

ρ+v(s)=r¯(s,a)+sP(s|s,a)v(s)s𝒮\displaystyle\rho^{*}+v(s)=\bar{r}(s,a^{*})+\sum\nolimits_{s^{\prime}}P(s^{\prime}|s,a^{*})v(s^{\prime})\leavevmode\nobreak\ \forall s\in\mathcal{S} (5)

where a=π(s)a^{*}=\pi^{*}(s), and v(s)v(s) is called the bias of the state ss and it denotes the extra reward obtained from starting in the state ss. Note that v(s)v(s) is not unique since if v(s)sv(s)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ s satisfy Equation (5), then so does v(s)+cv(s)+c for all cc\in\mathbb{R} and hence, the bias is translation invariant.

We aim to maximize the cumulative reward collected by all the agents. Hence, we want to develop an algorithm that, starting from no knowledge about the system, learns a policy that minimizes the regret. The regret of an algorithm 𝒜\mathscr{A}, for starting state ss, and running for time TT, is defined as:

Δ(,𝒜,s,T):=ρMTi[M]Ri(,𝒜,s,T)\displaystyle\Delta(\mathcal{M},\mathscr{A},s,T):=\rho^{*}MT-\sum\nolimits_{i\in[M]}R_{i}(\mathcal{M},\mathscr{A},s,T)

We will now present our algorithm dist-UCRL that uses upper confidence bounds to bound the regret Δ(,dist-UCRL,s,T)\Delta(\mathcal{M},\textsc{dist-UCRL},s,T) with high probability with only O(MASlog2(MT))O(MAS\log_{2}(MT)) communication rounds.

IV dist-UCRL Algorithm

We consider that each agent runs an instance of the dist-UCRL algorithm. The dist-UCRL algorithm running at agent ii is described in Algorithm 1. The algorithm proceeds in epochs indexed as k=1,2,k=1,2,\cdots. The start of every epoch also coincides with the synchronization step where every agent communicates with the central node to share data and update policies. This also implies that the number of synchronization rounds requires by the algorithm are the same as the number of epochs for the algorithm runs.

Algorithm 1 dist-UCRL at agent ii
1:  Input: S,A,MS,A,M
2:  Set parameters Pi(s,a,s)=0(s,a,s)𝒮×𝒜×𝒮P_{i}(s,a,s^{\prime})=0\forall(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} and r^i(s,a)=0(s,a)𝒮×𝒜\hat{r}_{i}(s,a)=0\forall(s,a)\in\mathcal{S}\times\mathcal{A}.
3:  for Epochs: k=1,2,k=1,2,\cdots do
4:     Set νi,k(s,a)=0(s,a)𝒮×𝒜\nu_{i,k}(s,a)=0\forall(s,a)\in\mathcal{S}\times\mathcal{A}.
5:     πk,Nk=\pi_{k},N_{k}= synchronize(Pi,r^i,t)(P_{i},\hat{r}_{i},t)
6:     while νi,k(s,a)<max(1,Nk(s,a))/M(s,a)\nu_{i,k}(s,a)<\max(1,N_{k}(s,a))/M\forall(s,a) and Synchronization not requested do
7:        Play action ai,t=πk(si,t)a_{i,t}=\pi_{k}(s_{i,t}), observe reward rtr_{t} and next state si,t+1s_{i,t+1}.
8:        Set νi,k(si,t,ai,t)=νi,k(si,t,ai,t)+1\nu_{i,k}(s_{i,t},a_{i,t})=\nu_{i,k}(s_{i,t},a_{i,t})+1, Pi(si,i,t,ai,t,st+1)=Pi(si,t,ai,t,si,t+1)+1P_{i}(s_{i,{i,t}},a_{i,t},s_{t+1})=P_{i}(s_{i,t},a_{i,t},s_{i,t+1})+1, r^i(si,t,ai,t)=r^i(si,t,ai,t)+rt\hat{r}_{i}(s_{i,t},a_{i,t})=\hat{r}_{i}(s_{i,t},a_{i,t})+r_{t}.
9:        Set t=t+1t=t+1.
10:     end while
11:     Request synchronization
12:  end for

Algorithm 1 running at agent ii, maintains two counters, νi,k(s,a)\nu_{i,k}(s,a) and Pi(s,a,s)P_{i}(s,a,s^{\prime}). νi,k(s,a)\nu_{i,k}(s,a) counts the number of visitations to state action pair (s,a)(s,a) in epoch kk, and Pi(s,a,s)P_{i}(s,a,s^{\prime}) counts the instances when the agent moves to state ss^{\prime} on taking action aa in state ss. The agent also stores r^i(s,a)\hat{r}_{i}(s,a) as the cumulative reward obtained in (s,a)(s,a).

Algorithm 2 synchronize at central node
1:  Input: Pi,r^iP_{i},\hat{r}_{i} from all agents i[M]i\in[M], tt.
2:  for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
3:     Set N(s,a)=isPi(s,a,s)N(s,a)=\sum_{i}\sum_{s^{\prime}}P_{i}(s,a,s^{\prime}).
4:     Set p^(s,a,s)=iPi(s,a,s)max{1,N(s,a)}\hat{p}(s,a,s^{\prime})=\frac{\sum_{i}P_{i}(s,a,s^{\prime})}{\max\{1,N(s,a)\}}
5:     Set r¯^(s,a)=ir^i(s,a)max{1,N(s,a)}\hat{\bar{r}}(s,a)=\frac{\sum_{i}\hat{r}_{i}(s,a)}{\max\{1,N(s,a)\}}
6:     Set r~(s,a)=r¯^(s,a)+7log(2MSAt)2max{1,N(s,a)}\tilde{r}(s,a)=\hat{\bar{r}}(s,a)+\sqrt{\frac{7\log(2MSAt)}{2\max\{1,N(s,a)\}}}
7:     Set d(s,a)=14Slog(2MAt)max{1,N(s,a)}d(s,a)=\sqrt{\frac{14S\log(2MAt)}{\max\{1,N(s,a)\}}}
8:  end for
9:  Set π\pi = Extended Value Iteration(p^,d,r~,1Mt\hat{p},d,\tilde{r},\frac{1}{\sqrt{Mt}})
10:  Return π,N\pi,N

Let Nk(s,a)=i=1Mk=1k1νi,k(s,a)N_{k}(s,a)=\sum_{i=1}^{M}\sum_{k^{\prime}=1}^{k-1}\nu_{i,k}(s,a) be the total number of visitations till the start of epoch kk for all agents. And hence, Nk(s,a)=0N_{k}(s,a)=0 at k=1k=1 for all s,a𝒮×𝒜s,a\in\mathcal{S}\times\mathcal{A}. At the start of every epoch k1k\geq 1, the agents obtains the policy for epoch kk and total visitations to any state action pair Nk(s,a)N_{k}(s,a). We will denote the time step at which epoch kk start and agents synchronize of the kthk^{th} time with tkt_{k}. At t=1t=1, the algorithm synchronizes all the agents for the very first time, or t1=1t_{1}=1. Later, a new epoch is triggered whenever any of the agent requests for synchronization222The synchronization request can be sent to server, which will in turn pause and synchronize the entire system.. An agent ii requests synchronization whenever νi,k(s,a)\nu_{i,k}(s,a) becomes at least 1/M1/M of Nk(s,a)N_{k}(s,a) for any state action pair. We assume that every agent is able to receive the synchronization signal instantly and stop further processing of the current epoch. The algorithm calls synchronize algorithm every time after a new epoch starts and updates the policy πk\pi_{k} and Nk(s,a)N_{k}(s,a) values. Every agent now selects actions according to the policy πk\pi_{k} in the epoch kk.

The synchronize algorithm is described in Algorithm 2. This algorithm calculates the estimates of the transition probability p^(|s,a)\hat{p}(\cdot|s,a) and the mean rewards r¯^(s,a)\hat{\bar{r}}(s,a) using the samples from all the MM agents. We now consider a set of all plausible MDPs (t)\mathscr{M}(t) that exist in the neighborhood of the estimated MDP ^=(𝒮,𝒜,p^,r¯^)\widehat{\mathcal{M}}=(\mathcal{S},\mathcal{A},\hat{p},\hat{\bar{r}}). The mean rewards r(s,a)r^{\prime}(s,a) and the transition probabilities p(s,a)p^{\prime}(s,a) for all the MDPs in the set (t)\mathscr{M}(t) satisfies:

|r¯^(s,a)r(s,a)|\displaystyle|\hat{\bar{r}}(s,a)-r^{\prime}(s,a)| 7log(2MSAt)2max{1,N(s,a)}\displaystyle\leq\sqrt{\frac{7\log(2MSAt)}{2\max\{1,N(s,a)\}}} (6)
p^(|s,a)p(|s,a)1\displaystyle\|\hat{p}(\cdot|s,a)-p^{\prime}(\cdot|s,a)\|_{1} 14Slog(2MAt)max{1,N(s,a)}\displaystyle\leq\sqrt{\frac{14S\log(2MAt)}{\max\{1,N(s,a)\}}} (7)

After obtaining (t)\mathscr{M}(t), Algorithm 2 calls the Extended Value Iteration algorithm which then computes the optimal policy for the optimistic MDP ~t\tilde{\mathcal{M}}_{t} in the set (t)\mathscr{M}(t). The optimistic MDP satisfies ρ(~)=supM(t)ρ(M)\rho^{*}(\tilde{\mathcal{M}})=\sup_{\mathrm{M}\in\mathscr{M}(t)}\rho^{*}(\mathrm{M}). As described in [Jaksch et al., 2010], it is not trivial to directly find the optimistic MDP in (t)\mathscr{M}(t). Hence, we consider an extended MDP t+\mathcal{M}_{t}^{+} which is constructed with the same state space and a continuous action space (a,q(|,a))𝒜×𝒫t(a,q(\cdot|\cdot,a))\in\mathcal{A}\times\mathscr{P}_{t}, where 𝒫t\mathscr{P}_{t} is the set of transition probabilities for action a𝒜a\in\mathcal{A} that satisfies Equation (7). When (t)\mathscr{M}(t) contains the true MDP \mathcal{M}, the diameter of the extended MDP t+\mathcal{M}_{t}^{+} is bounded by DD as the policy for which all states communicate with each other for MDP \mathcal{M} also ensures that all states communicate in the extended MDP t+\mathcal{M}_{t}^{+} as well.

Algorithm 3 Extended Value Iteration
1:  Input: p^,d,r~,ϵ\hat{p},d,\tilde{r},\epsilon.
2:  Set u0(s,a)=0,u1(s)=maxar~(s,a),i=1u_{0}(s,a)=0,u_{1}(s)=\max_{a}\tilde{r}(s,a),i=1.
3:  Set π(s)=argmaxar~(s,a)\pi(s)=\arg\max_{a}\tilde{r}(s,a)
4:  while maxs{ui(s)ui1(s)}mins{ui(s)ui1(s)}ϵ\max_{s}\{u_{i}(s)-u_{i-1}(s)\}-\min_{s}\{u_{i}(s)-u_{i-1}(s)\}\geq\epsilon do
5:     Sort s1,,sSs_{1}^{\prime},\cdots,s_{S}^{\prime} such that ui(s1)ui(sS)u_{i}(s_{1}^{\prime})\geq\cdots\geq u_{i}(s_{S}^{\prime}).
6:     Set p(s1)=min{1,p^(s|s1,a)+d(s1,a)/2}p(s_{1})=\min\{1,\hat{p}(s^{\prime}|s_{1},a)+d(s_{1},a)/2\}
7:     Set p(sn)=p^(s|sn,a),n=2,3,,Sp(s_{n})=\hat{p}(s^{\prime}|s_{n},a),n=2,3,\cdots,S.
8:     Set l=1l=1
9:     while sp(s)>1\sum_{s}p(s)>1 do
10:        Set p(sl)=max{0,1snslp(sn)}p(s_{l})=\max\{0,1-\sum_{s_{n}\neq s_{l}}p(s_{n})\}
11:        Set l=l1l=l-1
12:     end while
13:     i=i+1i=i+1
14:     ui(s)=maxa{r~(s,a)+sp(s)ui1(s)}u_{i}(s)=\max_{a}\left\{\tilde{r}(s,a)+\sum_{s^{\prime}}p(s^{\prime})u_{i-1}(s^{\prime})\right\}
15:     π(s)=argmaxa{r~(s,a)+sp(s)ui1(s)}\pi(s)=\arg\max_{a}\left\{\tilde{r}(s,a)+\sum_{s^{\prime}}p(s^{\prime})u_{i-1}(s^{\prime})\right\}
16:  end while
17:  Return π\pi

The Extended Value Iteration algorithm (Algorithm 3) follows the design of the Extended Value Iteration of UCRL2 algorithm by Jaksch et al. [2010]. As described by Jaksch et al. [2010], Extended Value Iteration (EVI) obtains a policy π\pi that is ϵ\epsilon-optimal for the extended MDP t+\mathcal{M}_{t}^{+} and in turn the optimistic MDP ~\tilde{\mathcal{M}}. Algorithm 3 calculates the values of the states of the extended MDP t+\mathcal{M}_{t}^{+}. The extended value iteration calculates the utilities of the states and the actions that achieve this utility. Note that unlike the extended value iteration in UCRL2 algorithm, we consider Algorithm 3 to be converged when we have maxs(ui+1ui(s))mins(ui+1ui(s))ϵ=1/Mt\max_{s}(u_{i+1}-u_{i}(s))-\min_{s}(u_{i+1}-u_{i}(s))\leq\epsilon=1/\sqrt{Mt}. This is because we now have MM times more samples till any time step tt as compared to the UCRL algorithm. The EVI, at start of epoch kk, returns a policy πk\pi_{k} that satisfies ρ(~,πk)ρ(~tk)1/Mtk\rho(\tilde{\mathcal{M}},\pi_{k})\geq\rho^{*}(\tilde{\mathcal{M}}_{t_{k}})-1/\sqrt{Mt_{k}}.

We also note that the central controller is not necessariliy required if the agents are in a completely connected network, they can share there data with each other and run the algorithms for the central controller by themselves. Further, the completely connected assumption can also be relaxed by considering a setup where all agents forward the messages they get. This allows the broadcast of the information. Hence, the proposed algorithm can be generalized to any network structure as long as all the agents are connected via some path.

V Results for dist-UCRL 

After describing the algorithm, we now bound the regret of the dist-UCRL algorithm. We show that the regret bound holds with high probability. We bound the regret incurred by the dist-UCRL algorithm in the form of following theorem.

Theorem 1.

For a MDP =([S],[A],P,r)\mathcal{M}=([S],[A],P,r) with diameter DD, for any starting state ss, the regret of the dist-UCRL algorithm, running on MM agents for TT time steps, is upper bounded with probability at least 11(MT)5/41-\frac{1}{(MT)^{5/4}} as:

Δ(,dist-UCRL,s,T)O~(DSMAT)\displaystyle\Delta(\mathcal{M},\textsc{dist-UCRL},s,T)\leq\tilde{O}(DS\sqrt{MAT}) (8)

where O~\tilde{O} hides the poly-log terms in M,S,A,M,S,A, and TT.

We let mm be the total number synchronizations done by agents running dist-UCRL algorithm till time TT. Then, we bound mm deterministically in the following theorem.

Theorem 2.

The total number of communication rounds mm for dist-UCRL2 up to step TSA/MT\geq SA/M is upper bounded as

m1+2MAS+MASlog2(MT)\displaystyle m\leq 1+2MAS+MAS\log_{2}\left(MT\right) (9)
Proof.

We use the fact that when νi,k(s,a)Nk(s,a)/M\nu_{i,k}(s,a)\geq N_{k}(s,a)/M for some state action pair (s,a)(s,a) and for some agent ii, the total visitation count Nk+1(s,a)N_{k+1}(s,a) is at least Nk(s,a)+νi,k(s,a)=Nk(s,a)(1+1M)N_{k}(s,a)+\nu_{i,k}(s,a)=N_{k}(s,a)(1+\frac{1}{M}). This gives an exponential growth for the total visitation count of any state action pair. Also, since the total visitation count for all state action pairs is upper bounded by MTMT, using Jensen’s inequality we bound the number of epochs in logarithmic order of MTMT. A complete proof is provided in Appendix D. ∎

We now state the lemmas required for the proof of the Theorem 1. The first three lemmas are used to handle the stochastic nature of the algorithm and environment. The first lemma provides concentration bounds on the 1\ell_{1}-deviation of the transition probabilities p^(|s,a)\hat{p}(\cdot|s,a) for any (s,a)(s,a).

Lemma 1.

The 1\ell_{1}-deviation of the true distribution and the empirical distribution using nn samples, over the next states given the current state ss and action aa is bounded by

(p^(|s,a)P(|s,a)1ϵ)2Sexp(nϵ22)\displaystyle\mathbb{P}\left(\|\hat{p}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\geq\epsilon\right)\leq 2^{S}\exp{(-\frac{n\epsilon^{2}}{2})} (10)
Proof.

The proof follows on the lines of [Weissman et al., 2003, Theorem 2.1], using the distribution as transition probabilities. ∎

Lemma 2 (Hoeffding’s Inequality, [Hoeffding, 1994]).

Let {Xt}t=1T\{X_{t}\}_{t=1}^{T} be i.i.d. random variables in [0,1]. Then, we have,

P(t=1TXi,tϵ)exp(2ϵ2T)\displaystyle P(\sum_{t=1}^{T}X_{i,t}\geq\epsilon)\leq\exp{\left(-\frac{2\epsilon^{2}}{T}\right)} (11)

The next lemma provides concentration bounds on the sum of MM independent Martingale sequences for length TT.

Lemma 3.

Let {Xi,t}t=1T\{X_{i,t}\}_{t=1}^{T} be a zero-mean Martingale sequence for i=1,,Mi=1,\cdots,M adapted to filtration {t}t=0T\{\mathcal{F}_{t}\}_{t=0}^{T}. Then, if {Xi,t}t=1T\{X_{i,t}\}_{t=1}^{T} and {Xj,t}t=1T\{X_{j,t}\}_{t=1}^{T} are independent for all iji\neq j and |Xi,t|Xi,t1|c|X_{i,t}|X_{i,t-1}|\leq c for all i,ti,t, we have,

P(t=1Ti=1MXi,tϵ)exp(2ϵ2MTc2)\displaystyle P\left(\sum_{t=1}^{T}\sum_{i=1}^{M}X_{i,t}\geq\epsilon\right)\leq\exp{\left(-\frac{2\epsilon^{2}}{MTc^{2}}\right)} (12)
Proof Sketch.

We prove this lemma similar to the proof of Azuma-Hoeffding’s Inquality [Hoeffding, 1994]. A detailed proof is provided in Appendix A. ∎

We now bound the growth rate of the total number of visitations to state action pair (s,a)(s,a), i=1Mνi,k(s,a)\sum_{i=1}^{M}\nu_{i,k}(s,a), in any epoch kk. If the total visitations are large, then the agents will incur large regret from a possibly sub-optimal policy. Hence, we have the following lemma:

Lemma 4.

For any epoch kk, we have,

i=1Mνi,k(s,a)Nk(s,a)+M1\displaystyle\sum\nolimits_{i=1}^{M}\nu_{i,k}(s,a)\leq N_{k}(s,a)+M-1 (13)
Proof.

Note that agent ii requests for synchronization, and triggers a new epoch, whenever νi,k(s,a)=Nk(s,a)/MNk(s,a)/M+(M1)/M\nu_{i,k}(s,a)=\lceil N_{k}(s,a)/M\rceil\leq N_{k}(s,a)/M+(M-1)/M. Summing over all the agents ii gives the bound. ∎

Lemma 5 (Lemma 19 [Jaksch et al., 2010]).

For any sequence of number z1z2znz_{1}\leq z_{2}\leq\cdots\leq z_{n} with zkk=1k1zkZkz_{k}\leq\sum_{k^{\prime}=1}^{k-1}z_{k^{\prime}}\eqqcolon Z_{k}, we have,

k=1nzkZk(2+1)Zn\displaystyle\sum_{k=1}^{n}\frac{z_{k}}{Z_{k}}\leq(\sqrt{2}+1)Z_{n} (14)

The last lemma states that the span of the bias of the optimal policy defined as maxsv(s)minsv(s)\max_{s}v(s)-\min_{s}v(s) is bounded by the diameter DD.

Lemma 6 (Remark 8 from Jaksch et al. [2010]).

The span of the bias v:𝒮v:\mathcal{S}\to\mathbb{R} of the optimal policy π\pi for any MDP \mathcal{M} is upper bounded by its diameter D()D(\mathcal{M}), or,

sp(v)=maxsv(s)minsv(s)D()\displaystyle sp(v)=\max_{s}v(s)-\min_{s}v(s)\leq D(\mathcal{M}) (15)

After stating all the necessary lemmas, we are now ready to prove the regret bound of the dist-UCRL algorithm or Δ(,dist-UCRL,s,T)\Delta(\mathcal{M},\textsc{dist-UCRL},s,T). We provide a detailed sketch here and provide the complete proof in Appendix C.

Proof Sketch of Theorem 1.

We break the regret expression, into 44 different sources of regrets as:

1. Regret from deviating from expected reward: Note that regret compares the expected optimal gain ρ\rho^{*} with the observed rewards ri,tr_{i,t}. Since, ri,tr_{i,t} is a random variable between [0,1][0,1], the agents suffer a regret if the observed rewards are lower as compared to the mean. Hence, we use Hoeffding’s inequality [Hoeffding, 1994] to bound the regret generated by the randomness of the observed rewards. This gives us regret bounded by O~(MT)\tilde{O}(\sqrt{MT}).

2. Regret from deviating from the expected next state: The algorithm, when transitioning to the next state, expects a bias given the current state. However, the bias of the realized state may be different and even lower from the expected bias. Hence, we bound the deviation from the expected bias as the algorithm moves to states. Starting from state ss, the deviation process is modelled as a zero-mean Martingale sequence of the states visited by an agent ii. Also, we have MM independent agents interacting with MM independent environment. We use Lemma 3 to bound the total deviation of the realized bias and the expected bias. This gives us regret bounded by O~(DMT)\tilde{O}(D\sqrt{MT}).

3. Regret from not optimizing for the true MDP: For epoch kk, we run an 1Mtk\frac{1}{\sqrt{Mt_{k}}}-optimal policy for an optimistic MDP from the set of MDPs (tk)\mathscr{M}(t_{k}) for i[M]s,aνi,k(s,a)\sum_{i\in[M]}\sum_{s,a}\nu_{i,k}(s,a). The transition probabilities and the mean rewards of MDPs in (tk)\mathscr{M}(t_{k}) satisfy Equation (7) and Equation (6). When the true MDP \mathcal{M} lies in (tk)\mathscr{M}(t_{k}), we bound the regret because of the incorrect MDPs by the product of the diameter of the set 𝒫tk\mathscr{P}_{t_{k}}, the diameter DD of the MDP \mathcal{M}, and the number of visitations to any state action pair s,as,a in epoch kk using Lemma 6. All we need to do now is to bound the sum of the product for all epochs. We do so by using Lemma 4 and Lemma 5. This gives us regret bounded by O~(DSMAT)\tilde{O}(DS\sqrt{MAT}).

4. Regret when the estimated MDP is far from the true MDP \mathcal{M}: We use Lemma 1 to bound the 1\ell_{1} distances of the estimated transition probability and the true transition probability given any state action pair. Further, we also use Hoeffding’s inequality (Lemma 2) to bound the distance of the true mean rewards and the estimated rewards. Taking union bounds over all time steps till TT, we obtain the bound for all possible values of N(s,a)N(s,a). Further, taking union bounds over all state and actions provide the desired concentration bounds for all states and actions. Finally, we take the union bounds for all values of t(MT)1/4t\geq(MT)^{1/4}. This gives the total bound on probability that the true MDP \mathcal{M} lies in (tk)\mathscr{M}(t_{k}) as:

((t))1(MT)5/4t(MT)5/4.\displaystyle\mathbb{P}\left(\mathcal{M}\notin\mathscr{M}(t)\right)\leq\frac{1}{(MT)^{5/4}}\forall\leavevmode\nobreak\ t\geq(MT)^{5/4}. (16)

Now summing over all the regret sources, we get the regret bound of the dist-UCRL algorithm. ∎

The proof of the Theorem 1 and Theorem 2 suggests that the analysis could potentially be extended to various other algorithms that follow the epoch completion condition from the UCRL-2 Algorithm of [Jaksch et al., 2010].

VI Naive Approach: mod-UCRL2 for multiple agents

The proposed dist-UCRL algorithm does not require the agents to work sequentially, and hence the MM agents can truly work in parallel. For comparison of the proposed algorithm and completeness, we also consider an extension of UCRL2 algorithm by Jaksch et al. [2010] for MM parallel agents which we call mod-UCRL2. In the mod-UCRL2 algorithm, we assume that all the agents communicate to a centralized server at each time step tt, and the server decides the actions for the agents at each time tt. Thus, the communication rounds for mod-UCRL2 algorithm is O(T)O(T), while the regret analysis is not apriori clear, and is the focus of this section. We also note that even though this algorithm is an easy extension of UCRL2, the analysis of regret is not straightforward, and uses the approach that was used to prove the regret guarantees of dist-UCRL because of the presence of MM agents.

At every time step tt, every agent i[M]i\in[M] observes its state si,ts_{i,t} and sends it to the server. The server after receiving all the states, process the requests sequentially in the order s1,t,s2,t,,sM,ts_{1,t},s_{2,t},\cdots,s_{M,t}. The central server runs an instance of the UCRL2 algorithm with state sequence s1,1,s2,1,,sM,1,s1,2,,s_{1,1},s_{2,1},\cdots,s_{M,1},s_{1,2},\cdots, and the corresponding action sequence a1,1,a2,1,,aM,1,a1,2,a_{1,1},a_{2,1},\cdots,a_{M,1},a_{1,2},\cdots. The UCRL2 algorithm, running at the server, proceeds in epoch with epoch 11 starting with (i,t)=(1,1)(i,t)=(1,1). We consider an epoch kk contains observations for {(i¯k,t¯k),(i¯k,t¯k)+1,,(i¯k,t¯k)}\{(\underline{i}_{k},\underline{t}_{k}),(\underline{i}_{k},\underline{t}_{k})+1,\cdots,(\bar{i}_{k},\bar{t}_{k})\} for some i¯k,t¯k,i¯k,t¯k\underline{i}_{k},\underline{t}_{k},\bar{i}_{k},\bar{t}_{k}. The server maintains counters νk(s,a)\nu_{k}(s,a) denoting the number of times a state action pair s,as,a is visited in the epoch kk, and counter Nk(s,a)N_{k}(s,a) denoting the number of times a state action pair s,as,a is visited before the start of the epoch kk as:

νk(s,a)\displaystyle\nu_{k}(s,a) =(i,t)=(i¯k,t¯k)(i¯k,t¯k)𝟏{si,t=s,ai,t=a}\displaystyle=\sum\nolimits_{(i,t)=(\underline{i}_{k},\underline{t}_{k})}^{(\bar{i}_{k},\bar{t}_{k})}\bm{1}{\{s_{i,t}=s,a_{i,t}=a\}} (17)
Nk(s,a)\displaystyle N_{k}(s,a) =k=1k1νk(s,a)\displaystyle=\sum\nolimits_{k^{\prime}=1}^{k-1}\nu_{k^{\prime}}(s,a) (18)

The server updates an epoch kk whenever νk(s,a)=max{1,Nk(s,a)}\nu_{k}(s,a)=\max\{1,N_{k}(s,a)\} for some s,as,a. Following the UCRL2 algorithm, the server updates the policy at the beginning of every epoch using the observations collected till the beginning of the epoch. The complete algorithm is provided in Appendix E.

Note that in the mod-UCRL2 algorithm, the agents only behaves as interfaces to the MM independent environments, and is essentially not practical because of the sequential interface. In the following result, we formally state the result that the regret is upper bounded by O~(DSMAT)\tilde{O}(DS\sqrt{MAT}).

Theorem 3.

For a MDP =([S],[A],P,r)\mathcal{M}=([S],[A],P,r) with diameter DD, for any starting state ss, the regret of the mod-UCRL2 algorithm, running on MM agents for TT time steps, is upper bounded with probability at least 11(MT)5/41-\frac{1}{(MT)^{5/4}} as:

Δ(,mod-UCRL2,s,T)O~(DSMAT)\displaystyle\Delta(\mathcal{M},\textsc{mod-UCRL2},s,T)\leq\tilde{O}(DS\sqrt{MAT})

where O~\tilde{O} hides the poly-log terms in M,S,A,M,S,A, and TT.

Proof Outline.

Similar to the proof of Theorem 1, we again consider the four sources of regret. Then, breaking the regret into episodes when the mod-UCRL2 algorithm updates policy, we obtain the required bound. The complete proof is provided in Appendix F. ∎

VII Evaluations

In this section, we analyze the performance of the proposed dist-UCRL algorithm empirically. We test the dist-UCRL algorithm in multiple environments and also vary the number of agents in all the cases to study the regret growth with respect to MM. Further, we also evaluate the average number of communication steps used by the dist-UCRL  algorithm till time step TT.

Refer to caption
(a) RiverSwim environment
Refer to caption
(b) RiverSwim-12 environment
Refer to caption
(c) Gridworld environment
Figure 1: Average cumulative regret per agent under various communication strategies. The empirical regret of the dist-UCRL algorithm and mod-UCRL2 algorithm are almost identical which is expected from the regret analysis of the two algorithms.

We first run the dist-UCRL algorithm for the riverswim environment, which is a standard benchmark for model-based RL algorithm [Ian et al., 2013, Tossou et al., 2019] with 66 states and 22 actions. Next, we construct an extended riverswim environment with 1212 states and 22 actions. Finally, we use a Grid-World environment [Sutton and Barto, 2018] for a 7×77\times 7 grid which amounts to 2020 states and 44 actions.

We compare the dist-UCRL algorithm against the mod-UCRL2 algorithm. We also compare with the standard UCRL2 algorithm for M=1M=1. Note that for the case of M=1M=1, both, dist-UCRL algorithm and the mod-UCRL2 algorithm reduces to the UCRL2 algorithm.

Refer to caption
(a) RiverSwim environment
Refer to caption
(b) RiverSwim-12 environment
Refer to caption
(c) Gridworld environment
Figure 2: Total number of communication rounds required for the dist-UCRL algorithm for multiple number agents across various environments.

We run 5050 independent iterations of the algorithm. We plot the average per-agent regret, Δ(,dist-UCRL,s,T)/M\Delta(\mathcal{M},\textsc{dist-UCRL},s,T)/M, over the 5050 iterations and the error bars for both dist-UCRL and mod-UCRL2 algorithm in Figure 1. We vary the number of agents MM as 1,41,4, and 1616 to reduce clutter in figures. From Figure 1(a) and Figure 1(c), we note that the per-agent regret decreases approximately by a factor of 22 for every 44-fold increase in the number of agents. This is expected and was indeed the goal.

Note that in Figure 1(b), the per-agent regret falls from being linear in UCRL2 (M=1M=1) to significantly sublinear as the number of agents were increased to M=4M=4 and M=16M=16. For the extended RiverSwim environment with 1212 states, the diameter is approximately 5×1045\times 10^{4}. Hence, the available time horizon of T=105T=10^{5} was not sufficient for UCRL2. However, distributed RL algorithms still observed a sub-linear regret by collating their knowledge. This further demonstrates the advantage of deploying multiple parallel agents in complex environments to enhance exploration.

We also empirically evaluate the number of communication rounds required by the dist-UCRL algorithm. Note that the number of communications for mod-UCRL2 algorithm is same as the total length for which the algorithm runs. Hence, we only plot the number of communications rounds required for the dist-UCRL algorithm. We vary the number of agents MM in multiples for 22 and starting from M=2M=2 agents. We again run the experiment for 5050 independent iterations and plot the average number of synchronization rounds till time step tt.

We plot the number of communication rounds of the dist-UCRL algorithm in Figure 2. We observe that the number of the synchronization rounds increase very slowly with tt. Further, as suggested from Theorem 2, the number of synchronization rounds are increasing in MM. However, we note that for large values of tt, the increase in the number of communication rounds is sub-linear. This is because for the bound on the number of communication rounds, we take pessimistic estimates on the increase on the visitation counts Nk(s,a)N_{k}(s,a) where only one agent ii updates total Nk(s,a)N_{k}(s,a) by a factor of (1+1/M)(1+1/M) only when the agent ii triggers the synchronization round for s,as,a. However, in practice, Nk(s,a)N_{k}(s,a) grows faster than (1+1/M)(1+1/M), as the synchronization rounds triggered for other state action pairs and other agents also contribute towards Nk(s,a)N_{k}(s,a).

VIII Conclusion

In this work, we considered the problem of simultaneously reducing the cumulative regret and the number of communication rounds between MM agents. The MM agents interact with MM independent and identical Markov Decision Processes and share their data to learn the optimal policy faster. To this end, we proposed dist-UCRL, an epoch-based algorithm, following which the agents communicate in the beginning of every epoch. The data collected from the MM agents allows obtaining tighter deviation bounds and hence a smaller confidence set. Further, the agents trigger an epoch only after collecting sufficient samples in every epoch. This allows us to bound the regret of the dist-UCRL algorithm as O~(DSMAT)\tilde{O}(DS\sqrt{MAT}) and the number of communication rounds as O(MASlog2(MT))O(MAS\log_{2}(MT)). To analyze our algorithm, we also provide a concentration inequality for MM independent Martingale sequence of equal length which may be of an independent interest. We also evaluate the algorithm empirically and found that the average per-agent regret decreases as O(1/M)O(1/\sqrt{M}).

For comparison, we consider an extension of the UCRL2 algorithm for MM parallel agents working in round-robin sequence, and denote this algorithm as mod-UCRL2. We show that this algorithm also achieves the same regret bound as dist-UCRL algorithm, while with O(T)O(T) communication rounds. The regret guarantee required the techniques developed for dist-UCRL. The evaluation results demonstrate the effectiveness of the proposed dist-UCRL algorithm in achieving the same regret bound as mod-UCRL2, while with lower communication. Thus, the proposed dist-UCRL algorithm can be utilized in training multiple parallel power starved devices using reinforcement learning.

Possible future works for the proposed works include analysing parallel RL and dist-UCRL algorithm with differential privacy, where agents may send data with additional noise to protect privacy.

References

  • Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1184–1194, 2017.
  • Al-Abbasi et al. [2019] Abubakr O Al-Abbasi, Arnob Ghosh, and Vaneet Aggarwal. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12):4714–4727, 2019.
  • Assran et al. [2019] Mahmoud Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, and Michael Rabbat. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Bartlett and Tewari [2009] Peter L Bartlett and Ambuj Tewari. Regal: a regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42, 2009.
  • Bertsekas [1995] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. 1995.
  • Chawla et al. [2020] Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. The gossiping insert-eliminate algorithm for multi-agent bandits. In International Conference on Artificial Intelligence and Statistics, pages 3471–3481. PMLR, 2020.
  • Clemente et al. [2017] Alfredo V Clemente, Humberto N Castejón, and Arjun Chandra. Efficient parallel methods for deep reinforcement learning. arXiv preprint arXiv:1705.04862, 2017.
  • Dubey and Pentland [2020] Abhimanyu Dubey and AlexSandy’ Pentland. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33, 2020.
  • Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR, 2018.
  • Gupta et al. [2017] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83. Springer, 2017.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  • Hillel et al. [2013] Eshcar Hillel, Zohar Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 1, pages 854–862, 2013.
  • Hoeffding [1994] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
  • Horgan et al. [2018] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
  • Howard [1960] Ronald A Howard. Dynamic programming and markov processes. 1960.
  • Hu et al. [2020] Junyan Hu, Hanlin Niu, Joaquin Carrasco, Barry Lennox, and Farshad Arvin. Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning. IEEE Transactions on Vehicular Technology, 2020.
  • Ian et al. [2013] Osband Ian, Van Roy Benjamin, and Russo Daniel. (more) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pages 3003–3011, 2013.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018.
  • Kanade et al. [2012] Varun Kanade, Zhenming Liu, and Božidar Radunović. Distributed non-stochastic experts. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1, pages 260–268, 2012.
  • Kiran et al. [2020] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. arXiv preprint arXiv:2002.00444, 2020.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Manchella et al. [2021] Kaushik Manchella, Abhishek K Umrawal, and Vaneet Aggarwal. Flexpool: A distributed model-free deep reinforcement learning algorithm for joint passengers and goods transportation. IEEE Transactions on Intelligent Transportation Systems, 2021.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  • Nair et al. [2015] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
  • Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.
  • Sankararaman et al. [2019] Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
  • Sartoretti et al. [2019] Guillaume Sartoretti, Yue Wu, William Paivine, TK Satish Kumar, Sven Koenig, and Howie Choset. Distributed reinforcement learning for multi-robot decentralized collective construction. In Distributed autonomous robotic systems, pages 35–49. Springer, 2019.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tossou et al. [2019] Aristide C. Y. Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. CoRR, abs/1905.12425, 2019.
  • Wang et al. [2019] Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2019.
  • Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. 2003.
  • Zhang et al. [2018] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881. PMLR, 2018.

Appendix A Proof of the concentration bound of sum of independent Martingales

Lemma 7.

Let {Xi,t}t=1T\{X_{i,t}\}_{t=1}^{T} be a zero-mean Martingale sequence for i=1,,Mi=1,\cdots,M adapted to filtration {t}t=0T\{\mathcal{F}_{t}\}_{t=0}^{T}. Then, if {Xi,t}t=1T\{X_{i,t}\}_{t=1}^{T} and {Xj,t}t=1T\{X_{j,t}\}_{t=1}^{T} are independent for all iji\neq j and |Xi,t|Xi,t1|c|X_{i,t}|X_{i,t-1}|\leq c for all i,ti,t, we have,

P(t=1Ti=1MXi,tϵ)exp(2ϵ2MTc2)\displaystyle P(\sum_{t=1}^{T}\sum_{i=1}^{M}X_{i,t}\geq\epsilon)\leq\exp{\left(-\frac{2\epsilon^{2}}{MTc^{2}}\right)} (19)
Proof.

For any s>0s>0, we can write

P(t=1Ti=1MXi,tϵ)\displaystyle P\left(\sum_{t=1}^{T}\sum_{i=1}^{M}X_{i,t}\geq\epsilon\right) (20)
=P((exp(s(i=1Mt=1TXi,t))exp(sϵ))\displaystyle=P\left((\exp{\left(s\left(\sum_{i=1}^{M}\sum_{t=1}^{T}X_{i,t}\right)\right)}\geq\exp{(s\epsilon)}\right) (21)
𝔼[exp(s(i=1Mt=1TXi,t))]exp(sϵ)\displaystyle\leq\frac{\mathbb{E}\left[\exp{\left(s\left(\sum_{i=1}^{M}\sum_{t=1}^{T}X_{i,t}\right)\right)}\right]}{\exp{(s\epsilon)}} (22)
=𝔼[i=1Mexp(s(t=1TXi,t))]exp(sϵ)\displaystyle=\frac{\mathbb{E}\left[\prod_{i=1}^{M}\exp{\left(s\left(\sum_{t=1}^{T}X_{i,t}\right)\right)}\right]}{\exp{(s\epsilon)}} (23)
=i=1M𝔼[exp(s(t=1TXi,t))]exp(sϵ)\displaystyle=\frac{\prod_{i=1}^{M}\mathbb{E}\left[\exp{\left(s\left(\sum_{t=1}^{T}X_{i,t}\right)\right)}\right]}{\exp{(s\epsilon)}} (24)
=i=1M𝔼[exp(st=1T1Xi,t)𝔼[exp(sXi,T)|Xi,T1]]exp(sϵ)\displaystyle=\frac{\prod_{i=1}^{M}\mathbb{E}\left[\exp{\left(s\sum_{t=1}^{T-1}X_{i,t}\right)\mathbb{E}\left[\exp{\left(sX_{i,T}\right)}|X_{i,T-1}\right]}\right]}{\exp{(s\epsilon)}}
i=1M𝔼[exp(s(t=1T1Xi,t))exp(s2c2/8)]exp(sϵ)\displaystyle\leq\frac{\prod_{i=1}^{M}\mathbb{E}\left[\exp{\left(s\left(\sum_{t=1}^{T-1}X_{i,t}\right)\right)}\exp{(s^{2}c^{2}/8)}\right]}{\exp{(s\epsilon)}} (25)
i=1Mt=1Texp(s2c2/8)exp(sϵ)\displaystyle\leq\frac{\prod_{i=1}^{M}\prod_{t=1}^{T}\exp{\left(s^{2}c^{2}/8\right)}}{\exp{(s\epsilon)}} (26)
=exp(MTs2c2/8)exp(sϵ)\displaystyle=\frac{\exp{\left(MTs^{2}c^{2}/8\right)}}{\exp{(s\epsilon)}} (27)
supsexp(MTs2c2/8)exp(sϵ)=exp(2ϵ2MTc2)\displaystyle\leq\sup_{s}\frac{\exp{\left(MTs^{2}c^{2}/8\right)}}{\exp{(s\epsilon)}}=\exp{\left(-\frac{2\epsilon^{2}}{MTc^{2}}\right)} (28)

Equation (24) follows from independence of sequences from each other. Equation (25) follows from Hoeffding’s lemma [Hoeffding, 1994] on the last random variable of sequence, and Equation (26) follows from applying Hoeffding’s lemma iteratively. Choosing ss that minimizes the expression in Equation (27) we get the required result. ∎

Appendix B Bounds on Probability of Event (t)\mathcal{M}\notin\mathscr{M}(t)

Lemma 8.

The probability of the event that the set (t)\mathscr{M}(t) does not contain the true MDP \mathcal{M} is bounded as:

((t))115(Mt)6\displaystyle\mathbb{P}\left(\mathcal{M}\notin\mathscr{M}(t)\right)\leq\frac{1}{15(Mt)^{6}} (29)
Proof.

From Lemma 1, the 1\ell_{1} distance of a probability distribution over SS events with nn samples is bounded as:

(P(|s,a)P^(|s,a)1ϵ)(2S2)exp(nϵ22)(2S)exp(nϵ22)\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\epsilon\right)\leq(2^{S}-2)\exp{\left(-\frac{n\epsilon^{2}}{2}\right)}\leq(2^{S})\exp{\left(-\frac{n\epsilon^{2}}{2}\right)} (30)

This, for some given number of visits n(s,a)n(s,a) to state action pair (s,a)(s,a) and ϵ=2n(s,a)log(2S20SA(Mt)7)14Sn(s,a)log(2A(Mt))\epsilon=\sqrt{\frac{2}{n(s,a)}\log(2^{S}20SA(Mt)^{7})}\leq\sqrt{\frac{14S}{n(s,a)}\log(2A(Mt))} gives,

(P(|s,a)P^(|s,a)114Sn(s,a)log(2A(Mt)))\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2A(Mt))}\right) (2S)exp(n(s,a)22n(s,a)log(2S20SA(Mt)7))\displaystyle\leq(2^{S})\exp{\left(-\frac{n(s,a)}{2}\frac{2}{n(s,a)}\log(2^{S}20SA(Mt)^{7})\right)} (31)
=2S12S20SA(Mt)7=120AS(Mt)7\displaystyle=2^{S}\frac{1}{2^{S}20SA(Mt)^{7}}=\frac{1}{20AS(Mt)^{7}} (32)

We sum over the all the possible values of n(s,a)n(s,a) till tt time-step to bound the probability that the event t\mathcal{E}_{t} does not occur as:

n(s,a)=1t120SA(Mt)7120SA(Mt)6\displaystyle\sum_{n(s,a)=1}^{t}\frac{1}{20SA(Mt)^{7}}\leq\frac{1}{20SA(Mt)^{6}} (33)

Also, the rewards rtr_{t}, conditioned on (st,at)=(s,a)(s_{t},a_{t})=(s,a), are independent and identically distributed over [0,1][0,1] with mean r¯(s,a)\bar{r}(s,a). Hence, using Hoeffding’s concentration bounds, we have the 1\ell_{1} distance of reward estimates r^(s,a)\hat{r}(s,a) of r¯(s,a)\bar{r}(s,a) with nn samples is bounded as:

(|r^(s,a)r¯(s,a)|ϵ)2exp(2nϵ2)\displaystyle\mathbb{P}\left(|\hat{r}(s,a)-\bar{r}(s,a)|\geq\epsilon\right)\leq 2\exp{\left(-2n\epsilon^{2}\right)} (34)

This, for ϵ=12n(s,a)log(120SA(Mt)7)72n(s,a)log(2SA(Mt))\epsilon=\sqrt{\frac{1}{2n(s,a)}\log(120SA(Mt)^{7})}\leq\sqrt{\frac{7}{2n(s,a)}\log(2SA(Mt))} gives,

(|r^(s,a)r¯(s,a)|ϵ)\displaystyle\mathbb{P}\left(|\hat{r}(s,a)-\bar{r}(s,a)|\geq\epsilon\right) 2exp(2nlog(120SA(Mt)7)2n(s,a))\displaystyle\leq 2\exp{\left(-2n\frac{\log(120SA(Mt)^{7})}{2n(s,a)}\right)} (35)
21120SA(Mt)7=160SA(Mt)7\displaystyle\leq 2\frac{1}{120SA(Mt)^{7}}=\frac{1}{60SA(Mt)^{7}} (36)

We sum over the all the possible values of n(s,a)n(s,a) till tt time-step to bound the probability that (t)\mathcal{M}\notin\mathcal{M}(t) does not occur as:

(P(|s,a)P^(|s,a)114SN(s,a)log(2A(Mt)))n(s,a)=1t120SA(Mt)7120SA(Mt)6\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{N(s,a)}\log(2A(Mt))}\right)\leq\sum_{n(s,a)=1}^{t}\frac{1}{20SA(Mt)^{7}}\leq\frac{1}{20SA(Mt)^{6}} (37)
(|r^(s,a)r¯(s,a)|72N(s,a)log(2SA(Mt)))n(s,a)=1t160SA(Mt)7160SA(Mt)6\displaystyle\mathbb{P}\left(|\hat{r}(s,a)-\bar{r}(s,a)|\geq\sqrt{\frac{7}{2N(s,a)}\log(2SA(Mt))}\right)\leq\sum_{n(s,a)=1}^{t}\frac{1}{60SA(Mt)^{7}}\leq\frac{1}{60SA(Mt)^{6}} (38)

where N(s,a)N(s,a) denotes the number of visitations to s,as,a till tt.

Finally, summing over all the s,as,a, we get,

((t))115(Mt)6\displaystyle\mathbb{P}\left(\mathcal{M}\notin\mathscr{M}(t)\right)\leq\frac{1}{15(Mt)^{6}} (39)

Appendix C Proof of the Regret Bound

Proof.

We want to bound Δ(,dist-UCRL,s,T)\Delta(\mathcal{M},\textsc{dist-UCRL},s,T). The dist-UCRL algorithm proceeds in epoch and generates a new policy for each epoch kk using an optimistic MDP from set (tk)\mathscr{M}(t_{k}). For that, we construct two cases as enumerated:

  1. (a)

    Case where the true MDP \mathcal{M} lies in the set (tk)\mathscr{M}(t_{k}) for all kk:. In this case, we will bound the regret with the terms corresponding to parts 1, 2, and 3, of the four sources of regret as mentioned in the main text, and analyze each of them.

    From the definition of regret, we have:

    Δ(,dist-UCRL,s,T)\displaystyle\Delta(\mathcal{M},\textsc{dist-UCRL},s,T) =i=1Mt=1Tρi=1Mt=1Tri,t\displaystyle=\sum_{i=1}^{M}\sum_{t=1}^{T}\rho^{*}-\sum_{i=1}^{M}\sum_{t=1}^{T}r_{i,t}
    =i=1Mt=1T(ρ𝔼[ri,t])+i=1Mt=1T(𝔼[ri,t]ri,t)\displaystyle=\sum_{i=1}^{M}\sum_{t=1}^{T}(\rho^{*}-\mathbb{E}[r_{i,t}])+\sum_{i=1}^{M}\sum_{t=1}^{T}\left(\mathbb{E}[r_{i,t}]-r_{i,t}\right)
    =i=1Mk=1ms,aνi,k(s,a)(ρr¯(s,a))+i=1Mt=1T(𝔼[ri,t]ri,t)\displaystyle=\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left(\rho^{*}-\bar{r}(s,a)\right)+\sum_{i=1}^{M}\sum_{t=1}^{T}\left(\mathbb{E}[r_{i,t}]-r_{i,t}\right) (40)
    i=1Mk=1ms,aνi,k(s,a)(ρ~tk+1Mtkr¯(s,a))+i=1Mt=1T(𝔼[ri,t]ri,t)\displaystyle\leq\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left(\tilde{\rho}_{t_{k}}+\frac{1}{\sqrt{Mt_{k}}}-\bar{r}(s,a)\right)+\sum_{i=1}^{M}\sum_{t=1}^{T}\left(\mathbb{E}[r_{i,t}]-r_{i,t}\right) (41)

    Equation (40) follows from the fact that at time tt playing action aa in state ss, the expected reward 𝔼[ri,t]=r¯(s,a)\mathbb{E}[r_{i,t}]=\bar{r}(s,a) for any agent i[M]i\in[M]. Equation (41) follows from the fact that the Extended Value Iteration algorithm returns an 1/Mtk1/\sqrt{Mt_{k}}-optimal policy for the optimistic MDP t~(t)\tilde{\mathcal{M}_{t}}\in\mathscr{M}(t) at time tkt_{k} and hence the gain ρ~tk\tilde{\rho}_{t_{k}} of the policy πk\pi_{k} for ~\tilde{\mathcal{M}} satisfies ρ~tkρ(~)1Mtkρ()1Mtk=ρ1Mtk\tilde{\rho}_{t_{k}}\geq\rho^{*}(\tilde{\mathcal{M}})-\frac{1}{\sqrt{Mt_{k}}}\geq\rho^{*}(\mathcal{M})-\frac{1}{\sqrt{Mt_{k}}}=\rho^{*}-\frac{1}{\sqrt{Mt_{k}}}. Further, the algorithm runs the policy πk\pi_{k} for entire epoch kk. We will now separate the terms in Equation (41) to different parts as explained below.

    1. Regret from deviating from expected reward: This denotes the second summation in Equation (41). Since the reward ri,t[0,1]r_{i,t}\in[0,1], we use Hoeffding’s Inequality (Lemma 2) to bound the deviation of observed rewards from expected rewards with high probability:

    (i=1Mt=1T(𝔼[ri,t]ri,t)58MTlog(8MT))\displaystyle\mathbb{P}\left(\sum_{i=1}^{M}\sum_{t=1}^{T}\left(\mathbb{E}[r_{i,t}]-r_{i,t}\right)\leq\sqrt{\frac{5}{8}MT\log(8MT)}\right) exp(2MT58MTlog(8MT))\displaystyle\leq\exp{\left(-\frac{2}{MT}\frac{5}{8}MT\log(8MT)\right)} (42)
    (14MT)5/4<112(MT)5/4\displaystyle\leq\left(\frac{1}{4MT}\right)^{5/4}<\frac{1}{12(MT)^{5/4}} (43)

    This completes part 1.

    We now want to bound the first term in Equation (41). Recall that tkt_{k} was defined as the time step whenever the epoch kk starts. Let ~k=(𝒮,𝒜,P~k,,r~k)\tilde{\mathcal{M}}_{k}=(\mathcal{S},\mathcal{A},\tilde{P}_{k},,\tilde{r}_{k}) be the optimistic MDP in (t)\mathscr{M}(t). Then, we overload the notation and let P~k(s|s)=P~(s|s,πk(s))\tilde{P}_{k}(s^{\prime}|s)=\tilde{P}(s^{\prime}|s,\pi_{k}(s)) denote the transition probability matrix for the Markov Chain induced by the policy πk\pi_{k} on MDP ~\tilde{\mathcal{M}}. Also, let v~k\tilde{v}_{k} denote the bias vector of the policy πk\pi_{k} for MDP ~\tilde{\mathcal{M}}. Then P~kv~k\tilde{P}_{k}\tilde{v}_{k} is a vector and P~kv~k(s)\tilde{P}_{k}\tilde{v}_{k}(s) denote its sths^{th} element. Similarly, we overload the true transition probability PP into a matrix PkP_{k} and obtain a corresponding vector PkvkP_{k}v_{k}. Further, both r¯\bar{r}, and r~k\tilde{r}_{k} satisfy Equation (6) for mean estimated mean r¯^\hat{\bar{r}}. Hence, for all s,as,a, we have:

    |r~k(s,a)r¯(s,a)|\displaystyle|\tilde{r}_{k}(s,a)-\bar{r}(s,a)| |r~k(s,a)r¯^k(s,a)|+|r¯^k(s,a)r¯(s,a)|\displaystyle\leq|\tilde{r}_{k}(s,a)-\hat{\bar{r}}_{k}(s,a)|+|\hat{\bar{r}}_{k}(s,a)-\bar{r}(s,a)| (44)
    27log(2MSAtk)max{1,Nk(s,a)}dk(s,a)\displaystyle\leq 2\sqrt{\frac{7\log(2MSAt_{k})}{\max\{1,N_{k}(s,a)\}}}\eqqcolon d_{k}(s,a) (45)

    We now consider the gain-bias relationship from [Puterman, 1994] as:

    i[M]k=1ms,aνi,k(s,a)(ρ~tk+1Mtkr¯(s,a))\displaystyle\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left(\tilde{\rho}_{t_{k}}+\frac{1}{\sqrt{Mt_{k}}}-\bar{r}(s,a)\right)
    =i[M]k=1ms,aνi,k(s,a)(ρ~tkr~k(s,a)+(r~k(s,a)r¯(s,a))+1Mtk)\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left(\tilde{\rho}_{t_{k}}-\tilde{r}_{k}(s,a)+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{Mt_{k}}}\right)
    =i[M]k=1ms,aνi,k(s,a)(P~kv~k(s)v~k(s)+(r~k(s,a)r¯(s,a))+1Mtk)\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left(\tilde{P}_{k}\tilde{v}_{k}(s)-\tilde{v}_{k}(s)+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{Mt_{k}}}\right) (46)
    =i[M]k=1ms,aνi,k(s,a)(Pkv~k(s)v~k(s))\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)(P_{k}\tilde{v}_{k}(s)-\tilde{v}_{k}(s))
    +i[M]k=1ms,aνi,k(s,a)((P~kv~k(s)Pkv~k(s))+(r~k(s,a)r¯(s,a))+1Mtk)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)\left((\tilde{P}_{k}\tilde{v}_{k}(s)-P_{k}\tilde{v}_{k}(s))+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{Mt_{k}}}\right) (47)

    where Equation (46) follows from Equation (5). Note that the first term in Equation (47) denotes part 2, or the regret from deviating from the expected next state. The second term in Equation (47) denotes part 3, or the regret when algorithm is not accurately optimizing for the true MDP.

    2. Regret from deviating from the expected next state: Note that tk+11t_{k+1}-1 is the last time step of the epoch. Using this, we can upper bound the first term of the Equation (47) as:

    i[M]k=1ms,aνi,k(s,a)(Pv~k(s)v~k(s))\displaystyle\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)(P\tilde{v}_{k}(s)-\tilde{v}_{k}(s)) =i[M]k=1mt=tktk+11(Pv~k(si,t)v~k(si,t))\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{t=t_{k}}^{t_{k+1}-1}(P\tilde{v}_{k}(s_{i,t})-\tilde{v}_{k}(s_{i,t})) (48)
    =i[M]k=1mt=tktk+11(Pv~k(si,t)v~k(si,t+1))\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{t=t_{k}}^{t_{k+1}-1}(P\tilde{v}_{k}(s_{i,t})-\tilde{v}_{k}(s_{i,t+1}))
    +i[M]k=1m(v~k(si,tk+1)v~k(si,tk))\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{i\in[M]}\sum_{k=1}^{m}(\tilde{v}_{k}(s_{i,t_{k+1}})-\tilde{v}_{k}(s_{i,t_{k}})) (49)
    =i[M]k=1mt=tktk+11(Pv~k(si,t)v~k(si,t+1))+MmD\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{t=t_{k}}^{t_{k+1}-1}(P\tilde{v}_{k}(s_{i,t})-\tilde{v}_{k}(s_{i,t+1}))+MmD (50)
    =i[M]k=1mt=tktk+11(Xi,t)+MmD\displaystyle=\sum_{i\in[M]}\sum_{k=1}^{m}\sum_{t=t_{k}}^{t_{k+1}-1}(X_{i,t})+MmD (51)
    =i[M]t=1T(Xi,t)+MmD\displaystyle=\sum_{i\in[M]}\sum_{t=1}^{T}(X_{i,t})+MmD (52)

    where Equation (48) follows from the fact that aνi,k(s,a)=t=tktk+11𝟏{st=s}\sum_{a}\nu_{i,k}(s,a)=\sum_{t=t_{k}}^{t_{k+1}-1}\bm{1}\{s_{t}=s\}. Equation (50) follows from Lemma 6. Now, note that Xi,tX_{i,t} is a Martingale difference sequence for each i[M]i\in[M]. Further, the processes are independent as the environments of each agent are independent. Hence, the first term in Equation 52 is sum of MM independent Martingale sequences of length TT such that 0v~k(si,t)D0\leq\tilde{v}_{k}(s_{i,t})\leq D. Hence, using the Lemma 3 for c=2Dc=2D, with probability at least 11/(12(MT)5/4)1-1/(12(MT)^{5/4}), we get:

    i=1Mk=1ms,aνi,k(s,a)(Pv~k(s)v~k(s))\displaystyle\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)(P\tilde{v}_{k}(s)-\tilde{v}_{k}(s)) D2MT54log(8MT)+MmD\displaystyle\leq D\sqrt{2MT\frac{5}{4}\log(8MT)}+MmD (53)

    This completes part 2.

    3. Regret from not optimizing for the true MDP:
    We now attempt to bound the second term in Equation (47) which denotes the distance of the estimated transition probability and the true transition probabilities:

    i=1Mk=1ms,aνi,k(s,a)((P~kv~k(s)Pv~k(s))+(r~k(s,a)r¯(s,a))+1Mtk)\displaystyle\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{s,a}\nu_{i,k}(s,a)((\tilde{P}_{k}\tilde{v}_{k}(s)-P\tilde{v}_{k}(s))+({\tilde{r}}_{k}(s,a)-\bar{r}(s,a))+\frac{1}{\sqrt{Mt_{k}}})
    s,ak=1mi=1Mνi,k(s,a)((P~kv~k(s)Pv~k(s))+dk(s,a)+1Mtk)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)((\tilde{P}_{k}\tilde{v}_{k}(s)-P\tilde{v}_{k}(s))+d_{k}(s,a)+\frac{1}{\sqrt{Mt_{k}}}) (54)
    s,ak=1mi=1Mνi,k(s,a)(P~kP1v~k(s)+dk(s,a)+1Mtk)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\left(\|\tilde{P}_{k}-P\|_{1}\|\tilde{v}_{k}(s)\|_{\infty}+d_{k}(s,a)+\frac{1}{\sqrt{Mt_{k}}}\right) (55)
    s,ak=1mi=1Mνi,k(s,a)(DP~kP1+dk(s,a)+1Mtk)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\left(D\|\tilde{P}_{k}-P\|_{1}+d_{k}(s,a)+\frac{1}{\sqrt{Mt_{k}}}\right) (56)
    s,ak=1mi=1Mνi,k(s,a)(DP~kP1)+dk(s,a)+s,ak=1mi=1Mνi,k(s,a)(1Mtk)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\left(D\|\tilde{P}_{k}-P\|_{1}\right)+d_{k}(s,a)+\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\left(\frac{1}{\sqrt{Mt_{k}}}\right) (57)

    where Equation (54) holds from using Equation (45). Equation (55) follows from Hölder’s inequality. Equation (56) follows from Lemma 6 and noting that v~\tilde{v} is translation invariant, and hence we can choose minsv~(s)0\min_{s}\tilde{v}(s)\geq 0. Now, note that P~k,Pk\tilde{P}_{k},P_{k}, both satisfy Equation (7). Hence, P~kPk1\|\tilde{P}_{k}-P_{k}\|_{1} is upper bounded by the diameter of the set 𝒫tk\mathscr{P}_{t_{k}}. Further, Nk(s,a)s,aNk(s,a)=MtkN_{k}(s,a)\leq\sum_{s,a}N_{k}(s,a)=Mt_{k}. This gives us:

    s,ak=1mi=1Mνi,k(s,a)27log(2MSAtk)+2D14Slog(2MAtk)+1Nk(s,a)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\frac{2\sqrt{7\log(2MSAt_{k})}+2D\sqrt{14S\log(2MAt_{k})}+1}{\sqrt{N_{k}(s,a)}}
    +s,ak=1mi=1Mνi,k(s,a)1Nk(s,a)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{s,a}\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\frac{1}{\sqrt{N_{k}(s,a)}} (58)
    2(7log(2MSAT)+1+D14Slog(2MAT))\displaystyle\leq 2\left(\sqrt{7\log(2MSAT)+1}+D\sqrt{14S\log(2MAT)}\right)
    ×(s,ak=1mi=1Mνi,k(s,a)Nk(s,a)+s,ak=1mi=1Mνi,k(s,a)Nk(s,a))\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \times\left(\sum_{s,a}\sum_{k=1}^{m}\frac{\sum_{i=1}^{M}\nu_{i,k}(s,a)}{\sqrt{N_{k}(s,a)}}+\sum_{s,a}\sum_{k=1}^{m}\frac{\sum_{i=1}^{M}\nu_{i,k}(s,a)}{\sqrt{N_{k}(s,a)}}\right)
    (2+1)2(7log(2MSAT)+1+D14Slog(2MAT))s,a(N(s,a)+k=1mM1Nk(s,a))\displaystyle\leq(\sqrt{2}+1)2\left(\sqrt{7\log(2MSAT)+1}+D\sqrt{14S\log(2MAT)}\right)\sum\nolimits_{s,a}\left(\sqrt{N(s,a)}+\sum_{k=1}^{m}\frac{M-1}{\sqrt{N_{k}(s,a)}}\right) (59)
    (2+1)2(7log(2MSAT)+1+D14Slog(2MAT))s,a(N(s,a)+k=1m(M1))\displaystyle\leq(\sqrt{2}+1)2\left(\sqrt{7\log(2MSAT)+1}+D\sqrt{14S\log(2MAT)}\right)\sum\nolimits_{s,a}\left(\sqrt{N(s,a)}+\sum_{k=1}^{m}(M-1)\right) (60)
    (2+1)2(7log(2MSAT)+1+D14Slog(2MAT))\displaystyle\leq(\sqrt{2}+1)2\left(\sqrt{7\log(2MSAT)+1}+D\sqrt{14S\log(2MAT)}\right)
    ×((s,a1)(s,aN(s,a))+m(M1)SA)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \times\left(\sqrt{\left(\sum\nolimits_{s,a}1\right)\left(\sum\nolimits_{s,a}N(s,a)\right)}+m(M-1)SA\right) (61)
    (2+1)2(7log(2MSAT)+1+D14Slog(2MAT))(SAMT+m(M1)SA)\displaystyle\leq(\sqrt{2}+1)2\left(\sqrt{7\log(2MSAT)+1}+D\sqrt{14S\log(2MAT)}\right)\left(\sqrt{SAMT}+m(M-1)SA\right) (62)

    Equation (59) follows from Lemma 5 and Lemma 4. Further, we note that if N(s,a)=0N(s,a)=0, then Nk(s,a)=0N_{k}(s,a)=0 for all kk, and νi,k=0\nu_{i,k}=0 for all k,ik,i. Hence, Nk(s,a)1N_{k}(s,a)\geq 1 for all kk in Equation (59) and this gives Equation (60). Equation (61) follows from the Cauchy Schwarz inequality.

    This completes part 3.

  2. (b)

    Case where the true MDP (tk)\mathcal{\notin}\mathscr{M}(t_{k}) for some kk: For this case, we use a trivial bound of 11 for each agent ii at each time step tt. This is because the reward ri,tr_{i,t} lie in [0,1][0,1] for all i[M]i\in[M] and for all t=1,2,,Tt=1,2,\cdots,T. Using this, we show that the regret remains bounded by MT\sqrt{MT} with high probability . We bound the regret incurred for this case in part 4..

    4. Regret when the estimated MDP is far from the true MDP \mathcal{M}: In the following, we bound the probability of the event when the 1\ell_{1}-deviation bounds in Equation (7) and the bounds in Equation (6) fail to hold. Note that the regret of any time step is upper bound by 11 as rt[0,1]r_{t}\in[0,1]. Hence,

    i=1Mk=1mνi,k(s,a)𝟏{(tk)}\displaystyle\sum_{i=1}^{M}\sum_{k=1}^{m}\nu_{i,k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k})\}} =k=1mi=1Mνi,k(s,a)𝟏{(tk)}\displaystyle=\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k})\}} (63)
    k=1mNk(s,a)𝟏{(tk)}\displaystyle\leq\sum_{k=1}^{m}N_{k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k})\}} (64)
    k=1mMtk𝟏{(tk)}\displaystyle\leq\sum_{k=1}^{m}Mt_{k}\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k})\}} (65)
    t=1TMt𝟏{(t)}\displaystyle\leq\sum_{t=1}^{T}Mt\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}} (66)
    Mt=1(T/M)1/4t+t=(T/M)1/4+1Tt𝟏{(t)}\displaystyle\leq M\sum\nolimits_{t=1}^{(T/M)^{1/4}}t+\sum\nolimits_{t=(T/M)^{1/4}+1}^{T}t\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}}
    MT+t=(T/M)1/4+1Tt𝟏{(t)}\displaystyle\leq\sqrt{MT}+\sum\nolimits_{t=(T/M)^{1/4}+1}^{T}t\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}} (67)

    where (64) follows from Lemma 4. Now, we bound the probability of the event {(t)}\{\mathcal{M}\notin\mathscr{M}(t)\} with 1/(15(Mt)6)1/(15(Mt)^{6}) for all tt using Lemma 8 (which uses Lemma 1, refer Appendix B for a detailed proof). Taking union bounds over {(tk)},t(T/M)1/4+1\{\mathcal{M}\notin\mathscr{M}(t_{k})\},t\geq(T/M)^{1/4}+1, we get

    t=(T/M)1/4+1T(𝟏{(t)})\displaystyle\sum_{t=(T/M)^{1/4}+1}^{T}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}}) t=(T/M)1/4+1(𝟏{(t)})\displaystyle\leq\sum_{t=(T/M)^{1/4}+1}^{\infty}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}})
    t=(T/M)1/4(𝟏{(t)})\displaystyle\leq\int_{t=(T/M)^{1/4}}^{\infty}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t)\}})
    t=(T/M)1/41(Mt)6M5/412M6T5/4112(MT)5/4\displaystyle\leq\int_{t=(T/M)^{1/4}}^{\infty}\frac{1}{(Mt)^{6}}\leq\frac{M^{5/4}}{12M^{6}T^{5/4}}\leq\frac{1}{12(MT)^{5/4}}

    This complete part 4 of the regret sources and all the cases pertaining to the true MDP \mathcal{M}.

Summing over all the possible sources of regret, we obtain the required bound on the regret. Further, using union bound on all the event when the concentration bounds fails to hold, we establish that the regret bound in Equation (8) holds with high probability. ∎

Appendix D Bound on number of communication rounds

Proof.

Let N(s,a)N(s,a) be the total number of visits to the state action pair (s,a)(s,a) across all agents when algorithm dist-UCRL terminates. Let K(s,a)K(s,a) denote the synchronization rounds requested when νi,kNk(s,a)/M\nu_{i,k}\geq N_{k}(s,a)/M for any agent i[M]i\in[M]. Now, assuming that agent ii triggers the communication round because νi,k(s,a)Nk(s,a)/M\nu_{i,k}(s,a)\geq N_{k}(s,a)/M for some (s,a)(s,a) and Nk(s,a)>0N_{k}(s,a)>0. Then for this (s,a)(s,a) we have, Nk+1(s,a)Nk(s,a)+νi,k(s,a)Nk(s,a)(1+1/M)N_{k+1}(s,a)\geq N_{k}(s,a)+\nu_{i,k}(s,a)\geq N_{k}(s,a)(1+1/M). Then, for K(s,a)1K(s,a)\geq 1, we have:

N(s,a)\displaystyle N(s,a) =k=1mi=1Mνi,k(s,a)\displaystyle=\sum_{k=1}^{m}\sum_{i=1}^{M}\nu_{i,k}(s,a) (68)
(1M+k:νi,k(s,a)Nk(s,a)/M for any iNk(s,a)M)\displaystyle\geq\left(\frac{1}{M}+\sum_{k:\nu_{i,k}(s,a)\geq N_{k}(s,a)/M\text{ for any }i}\frac{N_{k}(s,a)}{M}\right)
1M(1+jNj(s,a))\displaystyle\geq\frac{1}{M}\left(1+\sum_{j}N_{j}(s,a)\right) (69)
1M(1+j=1K(s,a)(1+1M)j1)\displaystyle\geq\frac{1}{M}\left(1+\sum_{j=1}^{K(s,a)}\left(1+\frac{1}{M}\right)^{j-1}\right) (70)
(1+1M)K(s,a)\displaystyle\geq\left(1+\frac{1}{M}\right)^{K(s,a)} (71)

where Equation (69) follows for j{k:νi,k(s,a)Nk(s,a)/M for any i}j\in\{k:\nu_{i,k}(s,a)\geq N_{k}(s,a)/M\text{ for any }i\}. Equation (70) follows from using K(s,a)K(s,a) lowest j{k:νi,k(s,a)Nk(s,a)/M for any i}[m]j\in\{k:\nu_{i,k}(s,a)\geq N_{k}(s,a)/M\text{ for any }i\}\cup[m] and lower bounding the sequence Nj(s,1)N_{j}(s,1) with Nj(s,a)=1N_{j}(s,a)=1 for j=1j=1.

Now, we note (71) holds when N(s,a)1N(s,a)\geq 1. Further, we note that whenever N(s,a)=0N(s,a)=0 or N(s,a)=1N(s,a)=1, we have K(s,a)=0K(s,a)=0. So, to include the case of N(s,a)=0N(s,a)=0, we have

N(s,a)(1+1M)K(s,a)1\displaystyle N(s,a)\geq(1+\frac{1}{M})^{K(s,a)}-1 (72)

Also, note that the total visitations N(s,a)N(s,a) for all state action pairs is the total interactions of MM agents with their correspoinding environments which is MTMT, hence we have

MT\displaystyle MT =s,aN(s,a)\displaystyle=\sum_{s,a}N(s,a) (73)
s,a[(1+1M)K(s,a)1]\displaystyle\geq\sum_{s,a}\left[\left(1+\frac{1}{M}\right)^{K(s,a)}-1\right] (74)
AS(1+1M)s,aK(s,a)/ASSA\displaystyle\geq AS\left(1+\frac{1}{M}\right)^{\sum_{s,a}K(s,a)/AS}-SA (75)
s,aK(s,a)\displaystyle\implies\sum_{s,a}K(s,a) ASlog2(1+1M)log2(MTSA+1)\displaystyle\leq\frac{AS}{\log_{2}\left(1+\frac{1}{M}\right)}\log_{2}\left(\frac{MT}{SA}+1\right) (76)
MASlog2(MTSA+1)\displaystyle\leq MAS\log_{2}\left(\frac{MT}{SA}+1\right) (77)

where Equation (75) follows from Jensen’s inequality. Equation (77) follows from the fact that 1Mlog2(1+1M)M1\frac{1}{M}\leq\log_{2}(1+\frac{1}{M})\forall M\geq 1.
Now, a new epoch is triggered when νi,k(s,a)max{1,Nk(s,a)}/M\nu_{i,k}(s,a)\geq\max\{1,N_{k}(s,a)\}/M. This means that a new epoch will be triggered whenever Nk(s,a)=0N_{k}(s,a)=0 and any agent visits (s,a)(s,a) for the very first time. Hence, the algorithm can trigger SASA epochs even when K(s,a)=0K(s,a)=0 for all (s,a)(s,a). Also, there is epoch 11 trivially starts when the algorithm starts. This gives:

m\displaystyle m 1+AS+s,aK(s,a)\displaystyle\leq 1+AS+\sum\nolimits_{s,a}K(s,a) (78)
1+2MAS+MASlog2(MTSA),\displaystyle\leq 1+2MAS+MAS\log_{2}\left(\frac{MT}{SA}\right), (79)

for TSA/MT\geq SA/M. This completes the proof. ∎

Appendix E mod-UCRL2 algorithm description

Algorithm 4 Modified UCRL 2 at central server
1:  Input: S,A,MS,A,M
2:  Set parameters P(s,a,s)=0(s,a,s)𝒮×𝒜×𝒮P(s,a,s^{\prime})=0\forall(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} and r^(s,a)=0,N(s,a)=0(s,a)𝒮×𝒜\hat{r}(s,a)=0,N(s,a)=0\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ (s,a)\in\mathcal{S}\times\mathcal{A}.
3:  for Epochs: k=1,2,k=1,2,\cdots do
4:     Set νk(s,a)=0(s,a)𝒮×𝒜\nu_{k}(s,a)=0\forall(s,a)\in\mathcal{S}\times\mathcal{A}.
5:     for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
6:        Set N(s,a)=sP(s,a,s)N(s,a)=\sum_{s^{\prime}}P(s,a,s^{\prime}).
7:        Set p^(s,a,s)=P(s,a,s)max{1,N(s,a)}\hat{p}(s,a,s^{\prime})=\frac{P(s,a,s^{\prime})}{\max\{1,N(s,a)\}}
8:        Set r¯^(s,a)=r^(s,a)max{1,N(s,a)}\hat{\bar{r}}(s,a)=\frac{\hat{r}(s,a)}{\max\{1,N(s,a)\}}
9:        Set r~(s,a)=r¯^(s,a)+7log(2MSAt)2max{1,N(s,a)}\tilde{r}(s,a)=\hat{\bar{r}}(s,a)+\sqrt{\frac{7\log(2MSAt)}{2\max\{1,N(s,a)\}}}
10:        Set d(s,a)=14Slog(2MAt)max{1,N(s,a)}d(s,a)=\sqrt{\frac{14S\log(2MAt)}{\max\{1,N(s,a)\}}}
11:     end for
12:     Set π\pi = Extended Value Iteration(p^,d,r~,1Mt\hat{p},d,\tilde{r},\frac{1}{\sqrt{Mt}})
13:     while νi,k(s,a)<max{1,Nk(s,a)}(s,a)\nu_{i,k}(s,a)<\max\{1,N_{k}(s,a)\}\forall(s,a) do
14:        Receive state si,ts_{i,t} from agent ii
15:        Send action ai,t=πk(si,t)a_{i,t}=\pi_{k}(s_{i,t}) to agent ii and receive back reward rtr_{t} and next state si,t+1s_{i,t+1}.
16:        Set
νk(si,t,ai,t)=νk(si,t,ai,t)+1\nu_{k}(s_{i,t},a_{i,t})=\nu_{k}(s_{i,t},a_{i,t})+1
P(si,i,t,ai,t,st+1)=Pi(si,t,ai,t,si,t+1)+1P(s_{i,{i,t}},a_{i,t},s_{t+1})=P_{i}(s_{i,t},a_{i,t},s_{i,t+1})+1
r^(si,t,ai,t)=r^(si,t,ai,t)+rt\hat{r}(s_{i,t},a_{i,t})=\hat{r}(s_{i,t},a_{i,t})+r_{t}
17:        Set i=i+1i=i+1.
18:        if i>Mi>M then
19:           i=1i=1
20:           t=t+1t=t+1
21:        end if
22:     end while
23:  end for

Appendix F Proof of the Regret Bound for the Modified UCRL2 Algorithm

Proof.

We want to bound Δ(,mod-UCRL2,s,T)\Delta(\mathcal{M},\textsc{mod-UCRL2},s,T). For this, we reuse most of the proof of Theorem 1.

First note that the number of epochs of the mod-UCRL2 algorithm, mm^{\prime}, is upper bounded by 1+AS+ASlog2(MT/SA)1+AS+AS\log_{2}(MT/SA). The proof mostly follows from Section D. However, the server now performs the update whenever νk(s,a)=Nk(s,a)\nu_{k}(s,a)=N_{k}(s,a). This results in growth factor of 22 instead of (1+1/M)(1+1/M) in Equation (71).

We again use the fact that the EVI algorithm finds an ϵ\epsilon-optimal policy of the optimistic MDP in the neighborhood of the true MDP \mathcal{M}. Recall that, we define server time tt^{\prime} to be a pair (i,t)(i,t). Let server time tkt_{k}^{\prime} to be the pair i¯i,t¯k\underline{i}_{i},\underline{t}_{k} when the epoch kk starts. Then, all the definitions from the analysis of Theorem 1 will translate here with subscripts changed from tt to tt^{\prime} and tkt_{k} to tkt_{k}^{\prime}. Further, we define |t|=|(i,t)|=M(t1)+i|t^{\prime}|=|(i,t)|=M(t-1)+i, as the total number of interactions the agent collectively did with their environments till the server time tt^{\prime}. We also define t+1t^{\prime}+1, and t1t^{\prime}-1 as:

ti+1={(i,t+1) for ti=(i,t),i=K,(i+1,t) for ti=(i,t),i<K.\displaystyle t_{i}^{\prime}+1=\begin{cases}(i,t+1)\text{ for }t_{i}^{\prime}=(i,t),i=K,\\ (i+1,t)\text{ for }t_{i}^{\prime}=(i,t),i<K.\end{cases} (80)
ti1={(i1,t) for ti=(i,t),i>1,(K,t1) for ti=(i,t),i=1.\displaystyle t_{i}^{\prime}-1=\begin{cases}(i-1,t)\text{ for }t_{i}^{\prime}=(i,t),i>1,\\ (K,t-1)\text{ for }t_{i}^{\prime}=(i,t),i=1.\end{cases} (81)

The mod-UCRL2 algorithm proceeds in epoch and generates a new policy for each epoch kk using an optimistic MDP from set (tk)\mathscr{M}(t_{k}^{\prime}). Now, similar to the proof of Theorem 1, we construct two cases as enumerated:

  1. (a)

    Case where the true MDP \mathcal{M} lies in the set (tk)\mathscr{M}(t_{k}^{\prime}) for all kk: In this case, we will bound the regret with the terms corresponding to parts 1, 2, and 3, of the four sources of regret as mentioned in the main text for Theorem 1, and analyze each of them.

    Again, from the definition of regret, we have:

    Δ(,mod-UCRL2,s,T)\displaystyle\Delta(\mathcal{M},\textsc{mod-UCRL2},s,T) =t=1(M,T)ρt=1(M,T)rt\displaystyle=\sum_{t^{\prime}=1}^{(M,T)}\rho^{*}-\sum_{t^{\prime}=1}^{(M,T)}r_{t^{\prime}}
    =t=1(M,T)(ρ𝔼[rt])+t=1(M,T)(𝔼[rt]rt)\displaystyle=\sum_{t^{\prime}=1}^{(M,T)}(\rho^{*}-\mathbb{E}[r_{t^{\prime}}])+\sum_{t^{\prime}=1}^{(M,T)}\left(\mathbb{E}[r_{t^{\prime}}]-r_{t^{\prime}}\right)
    =k=1ms,aνk(s,a)(ρr¯(s,a))+t=1(M,T)(𝔼[rt]rt)\displaystyle=\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left(\rho^{*}-\bar{r}(s,a)\right)+\sum_{t^{\prime}=1}^{(M,T)}\left(\mathbb{E}[r_{t^{\prime}}]-r_{t^{\prime}}\right) (82)
    k=1ms,aνk(s,a)(ρ~tk+1|tk|r¯(s,a))+t=1(M,T)(𝔼[rt]rt)\displaystyle\leq\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left(\tilde{\rho}_{t_{k}}+\frac{1}{\sqrt{|t_{k}^{\prime}|}}-\bar{r}(s,a)\right)+\sum_{t^{\prime}=1}^{(M,T)}\left(\mathbb{E}[r_{t^{\prime}}]-r_{t^{\prime}}\right) (83)

    Equation (82) follows from the fact that at step tt^{\prime} playing action aa in state ss, the expected reward 𝔼[rt]=r¯(s,a)\mathbb{E}[r_{t^{\prime}}]=\bar{r}(s,a) for any agent i[M]i\in[M]. Equation (83) follows from the fact that the Extended Value Iteration algorithm returns an 1/|tk|1/\sqrt{|t_{k}^{\prime}|}-optimal policy for the optimistic MDP t~(t)\tilde{\mathcal{M}_{t^{\prime}}}\in\mathscr{M}(t) at server time tkt_{k} and hence the gain ρ~tk\tilde{\rho}_{t_{k}^{\prime}} of the policy πk\pi_{k} for ~\tilde{\mathcal{M}} satisfies ρ~tkρ(~)1|tk|ρ()1|tk|=ρ1|tk|\tilde{\rho}_{t_{k}^{\prime}}\geq\rho^{*}(\tilde{\mathcal{M}})-\frac{1}{\sqrt{|t_{k}^{\prime}|}}\geq\rho^{*}(\mathcal{M})-\frac{1}{\sqrt{|t_{k}^{\prime}|}}=\rho^{*}-\frac{1}{\sqrt{|t_{k}^{\prime}|}}. Further, the algorithm runs the policy πk\pi_{k} for entire epoch kk.

    Separating terms in Equation (83), similar to that in the proof of Theorem 1, we reduce the problem into sum of the three parts as explained below.

    1. Regret from deviating from expected reward: Since the reward rt[0,1]r_{t^{\prime}}\in[0,1], we first bound the second summation in Equation (83) with high probability using Hoeffding’s inequality (Lemma 2):

    (t=1(M,T)(𝔼[rt]rt)58MTlog(8MT))\displaystyle\mathbb{P}\left(\sum_{t^{\prime}=1}^{(M,T)}\left(\mathbb{E}[r_{t^{\prime}}]-r_{t^{\prime}}\right)\leq\sqrt{\frac{5}{8}MT\log(8MT)}\right) =(i=1Mt=1T(𝔼[ri,t]ri,t)58MTlog(8MT))\displaystyle=\mathbb{P}\left(\sum_{i=1}^{M}\sum_{t=1}^{T}\left(\mathbb{E}[r_{i,t}]-r_{i,t}\right)\leq\sqrt{\frac{5}{8}MT\log(8MT)}\right) (84)
    exp(2MT58MTlog(8MT))\displaystyle\leq\exp{\left(-\frac{2}{MT}\frac{5}{8}MT\log(8MT)\right)} (85)
    (14MT)5/4<112(MT)5/4\displaystyle\leq\left(\frac{1}{4MT}\right)^{5/4}<\frac{1}{12(MT)^{5/4}} (86)

    This completes part 1.

    We now want to bound the first term in Equation (83). Based on the assumption, the true MDP \mathcal{M} lies in the set (tk)\mathscr{M}(t_{k}^{\prime}). Let ~k=(𝒮,𝒜,P~k,,r~)\tilde{\mathcal{M}}_{k}=(\mathcal{S},\mathcal{A},\tilde{P}_{k},,\tilde{r}) be the optimistic MDP in (t)\mathscr{M}(t^{\prime}). Then, we overload the notation and let P~k(s|s)=P~(s|s,πk(s))\tilde{P}_{k}(s^{\prime}|s)=\tilde{P}(s^{\prime}|s,\pi_{k}(s)) denote the transition probability matrix for the Markov Chain induced by the policy πk\pi_{k} on MDP ~\tilde{\mathcal{M}}. Also, let v~k\tilde{v}_{k} denote the bias vector of the policy πk\pi_{k} for MDP ~\tilde{\mathcal{M}}. Then P~kv~k\tilde{P}_{k}\tilde{v}_{k} is a vector and P~kv~k(s)\tilde{P}_{k}\tilde{v}_{k}(s) denote its sths^{th} element. Similarly, we overload the true transition probability PP into a matrix PkP_{k} and obtain a corresponding vector PkvkP_{k}v_{k}. Also, similar to the proof of Theorem 1, we have

    |r~r¯(s,a)|\displaystyle|\tilde{r}-\bar{r}(s,a)| 2dk(s,a):=7log(2AS|tk|2max{1,Nk(s,a)}\displaystyle\leq 2d_{k}(s,a):=\sqrt{\frac{7\log(2AS|t_{k}^{\prime}|}{2\max\{1,N_{k}(s,a)\}}} (87)

    We now consider the gain-bias relationship from [Puterman, 1994] as:

    k=1ms,aνk(s,a)(ρ~tk+1|tk|r¯(s,a))\displaystyle\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left(\tilde{\rho}_{t_{k}^{\prime}}+\frac{1}{\sqrt{|t_{k}^{\prime}|}}-\bar{r}(s,a)\right) (88)
    =k=1ms,aνk(s,a)(ρ~tkr~k(s,a)+(r~k(s,a)r¯(s,a))+1|tk|)\displaystyle=\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left(\tilde{\rho}_{t_{k}^{\prime}}-\tilde{r}_{k}(s,a)+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right)
    =k=1ms,aνk(s,a)(P~kv~k(s)v~k(s)+(r~k(s,a)r¯(s,a))+1|tk|)\displaystyle=\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left(\tilde{P}_{k}\tilde{v}_{k}(s)-\tilde{v}_{k}(s)+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right) (89)
    =k=1ms,aνk(s,a)(Pkv~k(s)v~k(s))\displaystyle=\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)(P_{k}\tilde{v}_{k}(s)-\tilde{v}_{k}(s))
    +k=1ms,aνk(s,a)((P~kv~k(s)Pkv~k(s))+(r~k(s,a)r¯(s,a))+1|tk|)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)\left((\tilde{P}_{k}\tilde{v}_{k}(s)-P_{k}\tilde{v}_{k}(s))+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right) (90)

    where Equation (89) follows from Equation (5). Again, the first term in Equation (90) denotes part 2, or the regret from deviating from the expected next state. The second term in Equation (90) denotes part 3, or the regret when algorithm is not accurately optimizing for the true MDP.

    2. Regret from deviating from the expected next state: Let tk+11=(i¯i,t¯k)t_{k+1}^{\prime}-1=(\bar{i}_{i},\bar{t}_{k}) be the last server time step of the epoch. Using this, we can upper bound the first term of the Equation (90) as:

    k=1ms,aνk(s,a)(Pv~k(s)v~k(s))\displaystyle\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)(P\tilde{v}_{k}(s)-\tilde{v}_{k}(s)) =k=1mt=tktk+11(Pv~k(st)v~k(st))\displaystyle=\sum_{k=1}^{m}\sum_{t^{\prime}=t_{k}^{\prime}}^{t_{k+1}^{\prime}-1}(P\tilde{v}_{k}(s_{t^{\prime}})-\tilde{v}_{k}(s_{t^{\prime}})) (91)
    =k=1mt=tktk+11(Pv~k(st)v~k(st+1))+k=1m(v~k(stk+1)v~k(stk))\displaystyle=\sum_{k=1}^{m}\sum_{t^{\prime}=t_{k}^{\prime}}^{t_{k+1}^{\prime}-1}(P\tilde{v}_{k}(s_{t^{\prime}})-\tilde{v}_{k}(s_{t^{\prime}+1}))+\sum_{k=1}^{m}(\tilde{v}_{k}(s_{t_{k+1}})-\tilde{v}_{k}(s_{t_{k}})) (92)
    =k=1mt=tktk+11(Pv~k(st)v~k(st+1))+k=1m(v~k(s(i¯k+1,t¯k+1))v~k(s(i¯k,t¯k)))\displaystyle=\sum_{k=1}^{m}\sum_{t^{\prime}=t_{k}^{\prime}}^{t_{k+1}^{\prime}-1}(P\tilde{v}_{k}(s_{t^{\prime}})-\tilde{v}_{k}(s_{t^{\prime}+1}))+\sum_{k=1}^{m}(\tilde{v}_{k}(s_{(\bar{i}_{k+1},\bar{t}_{k+1})})-\tilde{v}_{k}(s_{(\underline{i}_{k},\underline{t}_{k})})) (93)
    k=1mt=tktk+11(Pv~k(st)v~k(st+1))+mD\displaystyle\leq\sum_{k=1}^{m}\sum_{t^{\prime}=t_{k}^{\prime}}^{t_{k+1}^{\prime}-1}(P\tilde{v}_{k}(s_{t^{\prime}})-\tilde{v}_{k}(s_{t^{\prime}+1}))+mD (94)
    k=1mt=(i¯k,t¯k)i¯k,t¯k(Pv~k(st)v~k(st+1))+mD\displaystyle\leq\sum_{k=1}^{m}\sum_{t^{\prime}=(\underline{i}_{k},\underline{t}_{k})}^{\bar{i}_{k},\bar{t}_{k}}(P\tilde{v}_{k}(s_{t^{\prime}})-\tilde{v}_{k}(s_{t^{\prime}+1}))+mD (95)
    i=1Mk=1mt=t¯kt¯k(Pv~k(si,t)v~k(si,t+1))+mD\displaystyle\leq\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{t=\underline{t}_{k}}^{\bar{t}_{k}}(P\tilde{v}_{k}(s_{i,t})-\tilde{v}_{k}(s_{i,t+1}))+mD (96)
    i=1Mk=1mt=t¯kt¯kXi,t+mD\displaystyle\leq\sum_{i=1}^{M}\sum_{k=1}^{m}\sum_{t=\underline{t}_{k}}^{\bar{t}_{k}}X_{i,t}+mD (97)
    i=1Mt=1TXi,t+mD\displaystyle\leq\sum_{i=1}^{M}\sum_{t=1}^{T}X_{i,t}+mD (98)

    where Equation (91) follows from the fact that aνk(s,a)=t=tktk+11𝟏{st=s}\sum_{a}\nu_{k}(s,a)=\sum_{t^{\prime}=t_{k}^{\prime}}^{t_{k+1}^{\prime}-1}\bm{1}\{s_{t^{\prime}}=s\}. Equation (94) follows from Lemma 6 and from the fact that although states s(i¯k+1,t¯k+1)s_{(\bar{i}_{k+1},\bar{t}_{k+1})} and s(¯ik,¯tk)s_{(\underline{}{i}_{k},\underline{}{t}_{k})} may belong to two different processes, the processes are identical. Now, we note that the first term in Equation (98) is sum of MM independent Martingale sequences of length TT such that 0v~k(s(i,t))D0\leq\tilde{v}_{k}(s_{(i,t)})\leq D. Hence, using the Lemma 3 for c=2Dc=2D, with probability at least 11/(12(MT)5/4)1-1/(12(MT)^{5/4}), we get:

    k=1ms,aνk(s,a)(Pv~k(s)v~k(s))\displaystyle\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)(P\tilde{v}_{k}(s)-\tilde{v}_{k}(s)) D2MT54log(8MT)+MmD\displaystyle\leq D\sqrt{2MT\frac{5}{4}\log(8MT)}+MmD (99)

    This completes part 2.

    3. Regret from not optimizing for the true MDP: We now attempt to bound the second term in Equation (90) which denotes the distance of the estimated transition probability and the true transition probabilities:

    k=1ms,aνk(s,a)((P~kv~k(s)Pv~k(s))+(r~k(s,a)r¯(s,a))+1|tk|)\displaystyle\sum_{k=1}^{m}\sum_{s,a}\nu_{k}(s,a)((\tilde{P}_{k}\tilde{v}_{k}(s)-P\tilde{v}_{k}(s))+\left(\tilde{r}_{k}(s,a)-\bar{r}(s,a)\right)+\frac{1}{\sqrt{|t_{k}^{\prime}|}})
    s,ak=1mνk(s,a)((P~kv~k(s)Pv~k(s))+dk(s,a)1|tk|)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\nu_{k}(s,a)((\tilde{P}_{k}\tilde{v}_{k}(s)-P\tilde{v}_{k}(s))+d_{k}(s,a)\frac{1}{\sqrt{|t_{k}^{\prime}|}})
    k=1mνk(s,a)(P~kP1v~k(s)+dk(s,a)+1|tk|)\displaystyle\leq\sum_{k=1}^{m}\nu_{k}(s,a)\left(\|\tilde{P}_{k}-P\|_{1}\|\tilde{v}_{k}(s)\|_{\infty}+d_{k}(s,a)+\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right) (100)
    s,ak=1mνk(s,a)(DP~kP1+dk(s,a)+1|tk|)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\nu_{k}(s,a)\left(D\|\tilde{P}_{k}-P\|_{1}+d_{k}(s,a)+\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right) (101)
    s,ak=1mνk(s,a)(DP~kP1+dk(s,a))+s,ak=1mνk(s,a)(1|tk|)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\nu_{k}(s,a)\left(D\|\tilde{P}_{k}-P\|_{1}+d_{k}(s,a)\right)+\sum_{s,a}\sum_{k=1}^{m}\nu_{k}(s,a)\left(\frac{1}{\sqrt{|t_{k}^{\prime}|}}\right) (102)

    where Equation (100) follows from Hölder’s inequality. Equation (101) follows from Lemma 6 and noting that v~\tilde{v} is translation invariant, and hence we can choose minsv~(s)0\min_{s}\tilde{v}(s)\geq 0. Now, note that P~k,Pk\tilde{P}_{k},P_{k}, both satisfy Equation (7). Hence, P~kPk1\|\tilde{P}_{k}-P_{k}\|_{1} is upper bounded by the diameter of the set 𝒫tk\mathscr{P}_{t_{k}^{\prime}}. Further, Nk(s,a)s,aNk(s,a)=|tk|N_{k}(s,a)\leq\sum_{s,a}N_{k}(s,a)=|t_{k}|. This gives us:

    s,ak=1mνk(s,a)2D14Slog(2A|tk|)+27log(2SA|tk|)+1Nk(s,a)\displaystyle\leq\sum_{s,a}\sum_{k=1}^{m}\nu_{k}(s,a)\frac{2D\sqrt{14S\log(2A|t_{k}^{\prime}|)}+2\sqrt{7\log(2SA|t_{k}^{\prime}|)}+1}{\sqrt{N_{k}(s,a)}} (103)
    2(D14Slog(2MAT)+7log(2MSAT)+1)s,ak=1mνk(s,a)Nk(s,a)\displaystyle\leq 2\left(D\sqrt{14S\log(2MAT)}+\sqrt{7\log(2MSAT)+1}\right)\sum_{s,a}\sum_{k=1}^{m}\frac{\nu_{k}(s,a)}{\sqrt{N_{k}(s,a)}}
    (2+1)(2D14Slog(2MAT)+7log(2MSAT)+1)s,a(N(s,a))\displaystyle\leq(\sqrt{2}+1)\left(2D\sqrt{14S\log(2MAT)}+\sqrt{7\log(2MSAT)+1}\right)\sum\nolimits_{s,a}\left(\sqrt{N(s,a)}\right) (104)
    (2+1)(2D14Slog(2MAT)+7log(2MSAT)+1)s,a(N(s,a))\displaystyle\leq(\sqrt{2}+1)\left(2D\sqrt{14S\log(2MAT)}+\sqrt{7\log(2MSAT)+1}\right)\sum\nolimits_{s,a}\left(\sqrt{N(s,a)}\right) (105)
    (2+1)(2D14Slog(2MAT)+7log(2MSAT)+1)((s,a1)(s,aN(s,a)))\displaystyle\leq(\sqrt{2}+1)\left(2D\sqrt{14S\log(2MAT)}+\sqrt{7\log(2MSAT)+1}\right)\left(\sqrt{\left(\sum\nolimits_{s,a}1\right)\left(\sum\nolimits_{s,a}N(s,a)\right)}\right) (106)
    (2+1)(2D14Slog(2MAT)+7log(2MSAT)+1)(SAMT)\displaystyle\leq(\sqrt{2}+1)\left(2D\sqrt{14S\log(2MAT)}+\sqrt{7\log(2MSAT)+1}\right)\left(\sqrt{SAMT}\right) (107)

    Equation (104) follows from Lemma 5. Further, we note that if N(s,a)=0N(s,a)=0, then Nk(s,a)=0N_{k}(s,a)=0 for all kk, and νk=0\nu_{k}=0 for all k,ik,i. Hence, Nk(s,a)1N_{k}(s,a)\geq 1 for all kk in Equation (104) and this gives Equation (105). Equation (106) follows from the Cauchy Schwarz inequality. This completes part 3.

  2. (b)

    Case where the true MDP (tk)\mathcal{\notin}\mathscr{M}(t_{k}^{\prime}) for some kk: For this case, we use a trivial bound of 11 at each server time step tt^{\prime}. This is because the reward r(i,t)r_{(i,t)} lie in [0,1][0,1] for all i[M]i\in[M] and for all t=1,2,,Tt=1,2,\cdots,T. Using this, we show that the regret remains bounded by MT\sqrt{MT} with high probability . We bound the regret incurred for this case in part 4.

    4. Regret when the estimated MDP is far from the true MDP \mathcal{M}: We bound the probability of (tk)\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime}) for some kk, as follows:

    k=1mνk(s,a)𝟏{(tk)}\displaystyle\sum_{k=1}^{m}\nu_{k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime})\}} =k=1mνk(s,a)𝟏{(tk)}\displaystyle=\sum_{k=1}^{m}\nu_{k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime})\}} (108)
    k=1mNk(s,a)𝟏{(tk)}\displaystyle\leq\sum_{k=1}^{m}N_{k}(s,a)\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime})\}} (109)
    k=1m|tk|𝟏{(tk)}\displaystyle\leq\sum_{k=1}^{m}|t_{k}^{\prime}|\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime})\}} (110)
    |t|=1MT|t|𝟏{(t)}\displaystyle\leq\sum_{|t^{\prime}|=1}^{MT}|t^{\prime}|\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}} (111)
    t=1(MT)1/4|t|+t=(MT)1/4+1MT|t|𝟏{(t)}\displaystyle\leq\sum\nolimits_{t=1}^{(MT)^{1/4}}|t^{\prime}|+\sum\nolimits_{t=(MT)^{1/4}+1}^{MT}|t^{\prime}|\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}}
    MT+|t|=(MT)1/4+1MT|t|𝟏{(t)}\displaystyle\leq\sqrt{MT}+\sum\nolimits_{|t^{\prime}|=(MT)^{1/4}+1}^{MT}|t^{\prime}|\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}} (112)

    where (109) follows from Lemma 4. Now, we bound the probability of the event {(t)}\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\} with 1/(15(|t|)6)1/(15(|t^{\prime}|)^{6}) for all tt^{\prime} using Lemma 8 (which uses Lemma 1, refer Appendix B for a detailed proof). Taking union bounds over {(tk)},|t|(MT)1/4+1\{\mathcal{M}\notin\mathscr{M}(t_{k}^{\prime})\},|t^{\prime}|\geq(MT)^{1/4}+1, we get

    |t|=(MT)1/4+1T(𝟏{(t)})\displaystyle\sum_{|t^{\prime}|=(MT)^{1/4}+1}^{T}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}}) |t|=(MT)1/4+1(𝟏{(t)})\displaystyle\leq\sum_{|t^{\prime}|=(MT)^{1/4}+1}^{\infty}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}})
    |t|=(MT)1/4(𝟏{(t)})\displaystyle\leq\int_{|t^{\prime}|=(MT)^{1/4}}^{\infty}\mathbb{P}(\bm{1}_{\{\mathcal{M}\notin\mathscr{M}(t^{\prime})\}})
    |t|=(MT)1/41(|t|)6112(MT)5/4\displaystyle\leq\int_{|t^{\prime}|=(MT)^{1/4}}^{\infty}\frac{1}{(|t^{\prime}|)^{6}}\leq\frac{1}{12(MT)^{5/4}}

    This complete part 4 of the regret sources and all the cases pertaining to the true MDP \mathcal{M}.

Summing over all the possible sources of regret, we obtain the required bound on the regret. Further, using union bound on all the event when the concentration bounds fails to hold, we establish that the regret bound in Equation (8) holds with high probability. ∎