This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provably Efficient Multi-Task Reinforcement Learning with Model Transfer

Chicheng Zhang
University of Arizona
chichengz@cs.arizona.edu
Zhi Wang
University of California San Diego
zhiwang@eng.ucsd.edu
Abstract

We study multi-task reinforcement learning (RL) in tabular episodic Markov decision processes (MDPs). We formulate a heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, with a goal of improving their collective performance through inter-player information sharing. We design and analyze an algorithm based on the idea of model transfer, and provide gap-dependent and gap-independent upper and lower bounds that characterize the intrinsic complexity of the problem.

1 Introduction

In many real-world applications, reinforcement learning (RL) agents can be deployed as a group to complete similar tasks at the same time. For example, in healthcare robotics, robots are paired with people with dementia to perform personalized cognitive training activities by learning their preferences [42, 21]; in autonomous driving, a set of autonomous vehicles learn how to navigate and avoid obstacles in various environments [27]. In these settings, each learning agent alone may only be able to acquire a limited amount of data, while the agents as a group have the potential to collectively learn faster through sharing knowledge among themselves. Multi-task learning [7] is a practical framework that can be used to model such settings, where a set of learning agents share/transfer knowledge to improve their collective performance.

Despite many empirical successes of multi-task RL (see, e.g., [51, 28, 27]) and transfer learning for RL (see, e.g., [26, 39]), a theoretical understanding of when and how information sharing or knowledge transfer can provide benefits remains limited. Exceptions include [16, 6, 11, 17, 32, 25], which study multi-task learning from parameter- or representation-transfer perspectives. However, these works still do not provide a completely satisfying answer: for example, in many application scenarios, the reward structures and the environment dynamics are only slightly different for each task—this is, however, not captured by representation transfer [11, 17] or existing works on clustering-based parameter transfer [16, 6]. In such settings, is it possible to design provably efficient multi-task RL algorithms that have guarantees never worse than agents learning individually, while outperforming the individual agents in favorable situations?

In this work, we formulate an online multi-task RL problem that is applicable to the aforementioned settings. Specifically, inspired by a recent study on multi-task multi-armed bandits [43], we formulate the ϵ\epsilon-Multi-Player Episodic Reinforcement Learning (abbreviated as ϵ\epsilon-MPERL) problem, in which all tasks share the same state and action spaces, and the tasks are assumed to be similar—i.e., the dissimilarities between the environments of different tasks (specifically, the reward distributions and transition dynamics associated with the players/tasks) are bounded in terms of a dissimilarity parameter ϵ0\epsilon\geq 0. This problem not only models concurrent RL [34, 16] as a special case by taking ϵ=0\epsilon=0, but also captures richer multi-task RL settings when ϵ\epsilon is nonzero. We study regret minimization for the ϵ\epsilon-MPERL problem, specifically:

  1. 1.

    We identify a problem complexity notion named subpar state-action pairs, which captures the amenability to information sharing among tasks in ϵ\epsilon-MPERL problem instances. As shown in the multi-task bandits literature (e.g., [43]), inter-task information sharing is not always helpful in reducing the players’ collective regret. Subpar state-action pairs, intuitively speaking, are clearly suboptimal for all tasks, for which we can robustly take advantage of (possibly biased) data collected for other tasks to achieve a lower regret in a certain task.

  2. 2.

    In the setting where the dissimilarity parameter ϵ\epsilon is known, we design a model-based algorithm Multi-task-Euler (Algorithm 1), which is built upon state-of-the-art algorithms for learning single-task Markov decision processes (MDPs) [3, 46, 36], as well as algorithmic ideas of model transfer in RL [39]. Multi-task-Euler crucially utilizes the dissimilarity assumption to robustly take advantage of information sharing among tasks, and achieves regret upper bounds in terms of subpar state-action pairs, in both (value function suboptimality) gap-dependent and gap-independent fashions. Specifically, compared with a baseline algorithm that does not utilize information sharing, Multi-task-Euler has a regret guarantee that: (1) is never worse, i.e., it avoids negative transfer [33]; (2) can be much superior when there are a large number of subpar state-action pairs.

  3. 3.

    We also present gap-dependent and gap-independent regret lower bounds for the ϵ\epsilon-MPERL problem in terms of subpar state-action pairs. These lower bounds nearly match the upper bounds when the episode length of the MDP is a constant. Together, the upper and lower bounds can be used to characterize the intrinsic complexity of the ϵ\epsilon-MPERL problem.

2 Preliminaries

Throughout this paper, we denote by [n]:={1,,n}[n]\mathrel{\mathop{\ordinarycolon}}=\mathinner{\left\{1,\ldots,n\right\}}. For a set AA in a universe UU, we use AC=UAA^{C}=U\setminus A to denote its complement. Denote by Δ(𝒳)\Delta(\mathcal{X}) the set of probability distributions over 𝒳\mathcal{X}. For functions f,gf,g, we use fgf\lesssim g or f=O(g)f=O(g) (resp. fgf\gtrsim g or f=Ω(g)f=\Omega(g)) to denote that there exists some constant c>0c>0, such that fcgf\leq cg (resp. fcgf\geq cg), and use fgf\eqsim g to denote fgf\lesssim g and fgf\gtrsim g simultaneously. Define ab:=max(a,b)a\vee b\mathrel{\mathop{\ordinarycolon}}=\max(a,b), and ab:=min(a,b)a\wedge b\mathrel{\mathop{\ordinarycolon}}=\min(a,b). We use 𝔼\mathbb{E} to denote the expectation operator, and use var\mathrm{var} to denote the variance operator. Throughout, we use O~()\tilde{O}(\cdot) and Ω~()\tilde{\Omega}(\cdot) notation to hide polylogarithmic factors.

Multi-task RL in episodic MDPs.

We have a set of MM MDPs {p=(H,𝒮,𝒜,p0,p,rp)}p=1M\mathinner{\left\{\mathcal{M}_{p}=(H,{\mathcal{S}},\mathcal{A},p_{0},\mathbb{P}_{p},r_{p})\right\}}_{p=1}^{M}, each associated with a player p[M]p\in[M]. Each MDP p\mathcal{M}_{p} is regarded as a task. The MDPs share the same episode length H+H\in\mathbb{N}_{+}, finite state space 𝒮{\mathcal{S}}, finite action space 𝒜\mathcal{A}, and initial state distribution p0Δ(𝒮)p_{0}\in\Delta({\mathcal{S}}). Let \bot be a default terminal state that is not contained in 𝒮{\mathcal{S}}. The transition probabilities p:𝒮×𝒜Δ(𝒮{})\mathbb{P}_{p}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}}\cup\mathinner{\left\{\bot\right\}}) and reward distributions rp:𝒮×𝒜Δ([0,1])r_{p}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}\times\mathcal{A}\to\Delta([0,1]) of the players are not necessarily identical. We assume that the MDPs are layered111This is a standard assumption (see, e.g., [44]). It is worth noting that any episodic MDP (with possibly nonstationary transition and reward) can be converted to a layered MDP with stationary transition and reward, with the state space size being HH times the size of the original state space., in that the state space 𝒮{\mathcal{S}} can be partitioned into disjoint subsets (𝒮h)h=1H({\mathcal{S}}_{h})_{h=1}^{H}, where p0p_{0} is supported on 𝒮1{\mathcal{S}}_{1}, and for every p[M]p\in[M], h[H]h\in[H], and every s𝒮h,a𝒜s\in{\mathcal{S}}_{h},a\in\mathcal{A}, p(s,a)\mathbb{P}_{p}(\cdot\mid s,a) is supported on 𝒮h+1{\mathcal{S}}_{h+1}; here, we define 𝒮H+1={}{\mathcal{S}}_{H+1}=\mathinner{\left\{\bot\right\}}. We denote by S:=|𝒮|S\mathrel{\mathop{\ordinarycolon}}=\left|{\mathcal{S}}\right| the size of the state space, and A:=|𝒜|A\mathrel{\mathop{\ordinarycolon}}=\left|\mathcal{A}\right| the size of the action space.

Interaction process.

The interaction process between the players and the environment is as follows: at the beginning, both (rp)p=1M(r_{p})_{p=1}^{M} and (p)p=1M(\mathbb{P}_{p})_{p=1}^{M} are unknown to the players. For each episode k[K]k\in[K], conditioned on the interaction history up to episode k1k-1, each player p[M]p\in[M] independently interacts with its respective MDP p\mathcal{M}_{p}; specifically, player pp starts at state s1,pkp0s_{1,p}^{k}\sim p_{0}, and at every step (layer) h[H]h\in[H], it chooses action ah,pka_{h,p}^{k}, transitions to next state sh+1,pkp(sh,pk,ah,pk)s_{h+1,p}^{k}\sim\mathbb{P}_{p}(\cdot\mid s_{h,p}^{k},a_{h,p}^{k}) and receives a stochastic immediate reward rh,pkrp(sh,pk,ah,pk)r_{h,p}^{k}\sim r_{p}(\cdot\mid s_{h,p}^{k},a_{h,p}^{k}); after all players have finished their kk-th episode, they can communicate and share their interaction history. The goal of the players is to maximize their expected collective reward 𝔼[k=1Kp=1Mh=1Hrh,pk]\mathbb{E}\left[\sum_{k=1}^{K}\sum_{p=1}^{M}\sum_{h=1}^{H}r_{h,p}^{k}\right].

Policy and value functions.

A deterministic, history-independent policy π\pi is a mapping from 𝒮{\mathcal{S}} to 𝒜\mathcal{A}, which can be used by a player to make decisions in its respective MDP. For player pp and step hh, we use Vh,pπ:𝒮h[0,H]V_{h,p}^{\pi}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}_{h}\to[0,H] and Qh,pπ:𝒮h×𝒜[0,H]Q_{h,p}^{\pi}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}_{h}\times\mathcal{A}\to[0,H] to denote its respective value and action-value functions, respectively. They satisfy the following recurrence known as the Bellman equation:

h[H]:Vh,pπ(s)=Qh,pπ(s,π(s)),Qh,pπ(s,a)=Rp(s,a)+(pVh+1,pπ)(s,a),\forall h\in[H]\mathrel{\mathop{\ordinarycolon}}\quad V_{h,p}^{\pi}(s)=Q_{h,p}^{\pi}(s,\pi(s)),\;\;Q_{h,p}^{\pi}(s,a)=R_{p}(s,a)+(\mathbb{P}_{p}V_{h+1,p}^{\pi})(s,a),

where we use the convention that VH+1,pπ()=0V_{H+1,p}^{\pi}(\bot)=0, and for f:𝒮h+1f\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}_{h+1}\to\mathbb{R}, (pf)(s,a):=s𝒮h+1p(ss,a)f(s)(\mathbb{P}_{p}f)(s,a)\mathrel{\mathop{\ordinarycolon}}=\sum_{s^{\prime}\in{\mathcal{S}}_{h+1}}\mathbb{P}_{p}(s^{\prime}\mid s,a)f(s^{\prime}), and Rp(s,a):=𝔼r^rp(s,a)[r^]R_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=\mathbb{E}_{\hat{r}\sim r_{p}(\cdot\mid s,a)}\left[\hat{r}\right] is the expected immediate reward of player pp. For player pp and policy π\pi, denote by V0,pπ=𝔼s1p0[V1,pπ(s1)]V_{0,p}^{\pi}=\mathbb{E}_{s_{1}\sim p_{0}}\left[V_{1,p}^{\pi}(s_{1})\right] its expected reward.

For player pp, we also define its optimal value function Vh,p:𝒮h[0,H]V_{h,p}^{\star}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}_{h}\to[0,H] and the optimal action-value function Qh,p:𝒮h×𝒜[0,H]Q_{h,p}^{\star}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}_{h}\times\mathcal{A}\to[0,H] using the Bellman optimality equation:

h[H]:Vh,p(s)=maxa𝒜Qh,p(s,a),Qh,p(s,a)=Rp(s,a)+(pVh+1,p)(s,a),\forall h\in[H]\mathrel{\mathop{\ordinarycolon}}\quad V_{h,p}^{\star}(s)=\max_{a\in\mathcal{A}}Q_{h,p}^{\star}(s,a),\;\;Q_{h,p}^{\star}(s,a)=R_{p}(s,a)+(\mathbb{P}_{p}V_{h+1,p}^{\star})(s,a), (1)

where we again use the convention that VH+1,p()=0V_{H+1,p}^{\star}(\bot)=0. For player pp, denote by V0,p=𝔼s1p0[V1,p(s1)]V_{0,p}^{\star}=\mathbb{E}_{s_{1}\sim p_{0}}\left[V_{1,p}^{\star}(s_{1})\right] its optimal expected reward.

Given a policy π\pi, as Vh,pπV_{h,p}^{\pi} for different hh’s are only defined in the respective layer 𝒮h{\mathcal{S}}_{h}, we “collate” the value functions (Vh,pπ)h=1H(V_{h,p}^{\pi})_{h=1}^{H} and obtain a single value function Vpπ:𝒮{}V_{p}^{\pi}\mathrel{\mathop{\ordinarycolon}}{\mathcal{S}}\cup\mathinner{\left\{\bot\right\}}\to\mathbb{R}. Formally, for every h[H+1]h\in[H+1] and s𝒮hs\in{\mathcal{S}}_{h},

Vpπ(s):=Vh,pπ(s).V_{p}^{\pi}(s)\mathrel{\mathop{\ordinarycolon}}=V_{h,p}^{\pi}(s).

We define Qpπ,Vp,QpQ_{p}^{\pi},V_{p}^{\star},Q_{p}^{\star} similarly. For player pp, given its optimal action value function QpQ_{p}^{\star}, any of its greedy policies πp(s)argmaxa𝒜Qp(s,a)\pi_{p}^{\star}(s)\in\mathop{\mathrm{argmax}}_{a\in\mathcal{A}}Q_{p}^{\star}(s,a) is optimal with respect to p\mathcal{M}_{p}.

Suboptimality gap.

For player pp, we define the suboptimality gap of state-action pair (s,a)(s,a) as gapp(s,a)=Vp(s)Qp(s,a)\mathrm{gap}_{p}(s,a)=V_{p}^{\star}(s)-Q_{p}^{\star}(s,a). We define the mininum suboptimality gap of player pp as gapp,min=min(s,a):gapp(s,a)>0gapp(s,a)\mathrm{gap}_{p,\min}=\min_{(s,a)\mathrel{\mathop{\ordinarycolon}}\mathrm{gap}_{p}(s,a)>0}\mathrm{gap}_{p}(s,a), and the minimum suboptimality gap over all players as gapmin=minp[M]gapp,min\mathrm{gap}_{\min}=\min_{p\in[M]}\mathrm{gap}_{p,\min}. For player p[M]p\in[M], define Zp,opt:={(s,a):gapp(s,a)=0}Z_{p,\mathrm{opt}}\mathrel{\mathop{\ordinarycolon}}=\mathinner{\left\{(s,a)\mathrel{\mathop{\ordinarycolon}}\mathrm{gap}_{p}(s,a)=0\right\}} as the set of optimal state-action pairs with respect to pp.

Performance metric.

We measure the performance of the players using their collective regret, i.e., over a total of KK episodes, how much extra reward they would have collected in expectation if they were executing their respective optimal policies from the beginning. Formally, suppose for each episode kk, player pp executes policy πk(p)\pi^{k}(p), then the collective regret of the players is defined as:

Reg(K)=p=1Mk=1K(V0,pV0,pπk(p)).\mathrm{Reg}(K)=\sum_{p=1}^{M}\sum_{k=1}^{K}\left(V_{0,p}^{\star}-V_{0,p}^{\pi^{k}(p)}\right).

Baseline: individual Strong-Euler.

A naive baseline for multi-task RL is to let each player run a separate RL algorithm without communication. For concreteness, we choose to let each player run the state-of-the-art Strong-Euler algorithm [36] (see also its precursor Euler [46]), which enjoys minimax gap-independent [3, 8] and gap-dependent regret guarantees, and we refer to this strategy as individual Strong-Euler. Specifically, as it is known that Strong-Euler has a regret of O~(H2SAK+H4S2A\tilde{O}(\sqrt{H^{2}SAK}+H^{4}S^{2}A), individual Strong-Euler has a collective regret of O~(MH2SAK+MH4S2A)\tilde{O}(M\sqrt{H^{2}SAK}+MH^{4}S^{2}A). In addition, by a union bound and summing up the gap-dependent regret guarantees of Strong-Euler for the MM MDPs altogether, it can be checked that with probability 1δ1-\delta, individual Strong-Euler has a collective regret of order222The originally-stated gap-dependent regret bound of Strong-Euler ([36], Corollary 2.1) uses a slightly different notion of suboptimality gap, which takes an extra minimum over all steps. A close examination of their proof shows that Strong-Euler has regret bound (2) in layered MDPs. See also Remark LABEL:rem:clipping in Appendix LABEL:subsec:conclude-reg-bounds.

ln(MSAKδ)\bBigg@4(p[M]((s,a)Zp,optH3gapp,min+(s,a)Zp,optCH3gapp(s,a))+MH4S2AlnSAgapmin\bBigg@4).\displaystyle\begin{split}\ln\mathinner{\left(\frac{MSAK}{\delta}\right)}\bBigg@{4}(&\sum_{p\in[M]}\left(\sum_{(s,a)\in Z_{p,\mathrm{opt}}}\frac{H^{3}}{\mathrm{gap}_{p,\min}}+\sum_{(s,a)\in Z_{p,\mathrm{opt}}^{C}}\frac{H^{3}}{\mathrm{gap}_{p}(s,a)}\right)+MH^{4}S^{2}A\ln\frac{SA}{\mathrm{gap}_{\min}}\bBigg@{4}).\end{split} (2)

Our goal is to design multi-task RL algorithms that can achieve collective regret strictly lower than this baseline in both gap-dependent and gap-independent fashions when the tasks are similar.

Notion of similarity.

Throughout this paper, we will consider the following notion of similarity between MDPs in the multi-task episodic RL setting.

Definition 1.

A collection of MDPs (p)p=1M(\mathcal{M}_{p})_{p=1}^{M} is said to be ϵ\epsilon-dissimilar for ϵ0\epsilon\geq 0, if for all p,q[M]p,q\in[M], and (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A},

|Rp(s,a)Rq(s,a)|ϵ,p(s,a)q(s,a)1ϵH.\left|R_{p}(s,a)-R_{q}(s,a)\right|\leq\epsilon,\;\|\mathbb{P}_{p}(\cdot\mid s,a)-\mathbb{P}_{q}(\cdot\mid s,a)\|_{1}\leq\frac{\epsilon}{H}.

If this happens, we call (p)p=1M(\mathcal{M}_{p})_{p=1}^{M} an ϵ\epsilon-Multi-Player Episodic Reinforcement Learning (abbrev. ϵ\epsilon-MPERL) problem instance.

If the MDPs in (p)p=1M(\mathcal{M}_{p})_{p=1}^{M} are 0-dissimilar, then they are identical by definition, and our interaction protocol degenerates to the concurrent RL protocol [34]. Our dissimilarity notion is complementary to those of [6, 16]: they require the MDPs to be either identical, or have well-separated parameters for at least one state-action pair; in contrast, our dissimilarity notion allows the MDPs to be nonidentical and arbitrarily close.

We have the following intuitive lemma that shows the closeness of optimal value functions of different MDPs, in terms of the dissimilarity parameter ϵ\epsilon:

Lemma 2.

If (p)p=1M(\mathcal{M}_{p})_{p=1}^{M} are ϵ\epsilon-dissimilar, then for every p,q[M]p,q\in[M], and (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, |Qp(s,a)Qq(s,a)|2Hϵ\left|Q_{p}^{\star}(s,a)-Q_{q}^{\star}(s,a)\right|\leq 2H\epsilon; consequently, |gapp(s,a)gapq(s,a)|4Hϵ.\left|\mathrm{gap}_{p}(s,a)-\mathrm{gap}_{q}(s,a)\right|\leq 4H\epsilon.

3 Algorithm

We now describe our main algorithm, Multi-task-Euler (Algorithm 1). Our model-based algorithm is built upon recent works on episodic RL that provide algorithms with sharp instance-dependent guarantees in the single task setting [46, 36]. In a nutshell, for each episode kk and each player pp, the algorithm performs optimistic value iteration to construct high-probability upper and lower bounds for the optimal value and action value functions VpV_{p}^{\star} and QpQ_{p}^{\star}, and uses them to guide its exploration and decision making process.

1
Input : Failure probability δ(0,1)\delta\in(0,1), dissimilarity parameter ϵ0\epsilon\geq 0.
Initialize: Set Vp()=0V_{p}(\bot)=0 for all pp in [M][M], where \bot is the only state in 𝒮H+1{\mathcal{S}}_{H+1} ;
2
3for k=1,2,,Kk=1,2,\ldots,K do
4       for p=1,2,,Mp=1,2,\ldots,M do
             // Construct optimal value estimates for player pp
5             for h=H,H1,,1h=H,H-1,\ldots,1 do
6                   for (s,a)𝒮h×𝒜(s,a)\in{\mathcal{S}}_{h}\times\mathcal{A} do
7                         Compute:
8                        
9                        ind-Q¯p(s,a)=R^p(s,a)+(^pV¯p)(s,a)+ind-bp(s,a)\overline{\text{ind-}Q}_{p}(s,a)=\hat{R}_{p}(s,a)+(\hat{\mathbb{P}}_{p}\overline{V}_{p})(s,a)+\text{ind-}b_{p}(s,a);
10                        
11                        ind-Q¯p(s,a)=R^p(s,a)+(^pV¯p)(s,a)ind-bp(s,a)\underline{\text{ind-}Q}_{p}(s,a)=\hat{R}_{p}(s,a)+(\hat{\mathbb{P}}_{p}\underline{V}_{p})(s,a)-\text{ind-}b_{p}(s,a);
12                        
13                        agg-Q¯p(s,a)=R^(s,a)+(^V¯p)(s,a)+agg-bp(s,a)\overline{\text{agg-}Q}_{p}(s,a)=\hat{R}(s,a)+(\hat{\mathbb{P}}\overline{V}_{p})(s,a)+\text{agg-}b_{p}(s,a);
14                        
15                        agg-Q¯p(s,a)=R^(s,a)+(^V¯p)(s,a)agg-bp(s,a)\underline{\text{agg-}Q}_{p}(s,a)=\hat{R}(s,a)+(\hat{\mathbb{P}}\underline{V}_{p})(s,a)-\text{agg-}b_{p}(s,a);
16                        
17                        Update optimal action value function upper and lower bound estimates:
18                        Q¯p(s,a)=min{Hh+1,ind-Q¯p(s,a),agg-Q¯p(s,a)}\overline{Q}_{p}(s,a)=\min\mathinner{\left\{H-h+1,\overline{\text{ind-}Q}_{p}(s,a),\overline{\text{agg-}Q}_{p}(s,a)\right\}};
19                        
20                        Q¯p(s,a)=max{0,ind-Q¯p(s,a),agg-Q¯p(s,a)}\underline{Q}_{p}(s,a)=\max\mathinner{\left\{0,\underline{\text{ind-}Q}_{p}(s,a),\underline{\text{agg-}Q}_{p}(s,a)\right\}};
21                        
22                  for s𝒮hs\in{\mathcal{S}}_{h} do
23                         Define πk(p)(s)=argmaxa𝒜Q¯p(s,a)\pi^{k}(p)(s)=\mathop{\mathrm{argmax}}_{a\in\mathcal{A}}\overline{Q}_{p}(s,a) ;
24                         Update V¯p(s)=Q¯p(s,πk(p)(s))\overline{V}_{p}(s)=\overline{Q}_{p}\mathinner{\left(s,\pi^{k}(p)(s)\right)}, V¯p(s)=Q¯p(s,πk(p)(s))\underline{V}_{p}(s)=\underline{Q}_{p}\mathinner{\left(s,\pi^{k}(p)(s)\right)}.
25            
      // All players pp interact with their respective environments, and update reward and transition estimates
26      
27      for p=1,2,,Mp=1,2,\ldots,M do
28             Player pp executes policy πk(p)\pi^{k}(p) on p\mathcal{M}_{p} and obtains trajectory (sh,pk,ah,pk,rh,pk)h=1H(s_{h,p}^{k},a_{h,p}^{k},r_{h,p}^{k})_{h=1}^{H}.
29            Update individual estimates of transition probability ^p\hat{\mathbb{P}}_{p}, reward R^p\hat{R}_{p} and count np(,)n_{p}(\cdot,\cdot) using the first parts of Equations (3), (4) and (5).
30      Update aggregate estimates of transition probability ^\hat{\mathbb{P}}, reward R^\hat{R} and count n(,)n(\cdot,\cdot) using the second parts of Equations (3), (4) and (5).
Algorithm 1 Multi-task-Euler

Empirical estimates of model parameters.

For each player pp, the construction of its value function bound estimates relies on empirical estimates on its transition probability and expected reward function. For both estimands, we use two estimators with complementary roles, which are at two different points of the bias-variance tradeoff spectrum: one estimator uses only the player’s own data (termed individual estimate), which has large variance; the other estimator uses the data collected by all players (termed aggregate estimate), which has lower variance but can easily be biased, as transition probabilities and reward distributions are heterogeneous. Such an algorithmic idea of “model transfer”, where one estimates model in one task using data collected from other tasks has appeared in prior works (e.g.,  [39]). Specifically, at the beginning of episode kk, for every h[H]h\in[H] and (s,a)𝒮h×𝒜(s,a)\in{\mathcal{S}}_{h}\times\mathcal{A}, the algorithm has its empirical count of encountering (s,a)(s,a) for each player pp, along with its total empirical count across all players, respectively:

np(s,a):=l=1k1𝟏((sh,pl,ah,pl)=(s,a)),n(s,a):=l=1k1p=1M𝟏((sh,pl,ah,pl)=(s,a)).n_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=\sum_{l=1}^{k-1}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l})=(s,a)\right),\;n(s,a)\mathrel{\mathop{\ordinarycolon}}=\sum_{l=1}^{k-1}\sum_{p=1}^{M}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l})=(s,a)\right). (3)

The individual and aggregate estimates of immediate reward R(s,a)R(s,a) are defined as:

R^p(s,a):=l=1k1𝟏((sh,pl,ah,pl)=(s,a))rh,plnp(s,a),R^(s,a):=l=1k1p=1M𝟏((sh,pl,ah,pl)=(s,a))rh,pln(s,a).\hat{R}_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=\frac{\sum_{l=1}^{k-1}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l})=(s,a)\right)r_{h,p}^{l}}{n_{p}(s,a)},\;\hat{R}(s,a)\mathrel{\mathop{\ordinarycolon}}=\frac{\sum_{l=1}^{k-1}\sum_{p=1}^{M}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l})=(s,a)\right)r_{h,p}^{l}}{n(s,a)}. (4)

Similarly, for every h[H]h\in[H] and (s,a,s)𝒮h×𝒜×𝒮h+1(s,a,s^{\prime})\in{\mathcal{S}}_{h}\times\mathcal{A}\times{\mathcal{S}}_{h+1}, we also define the individual and aggregate estimates of transition probability as:

^p(ss,a):=l=1k𝟏((sh,pl,ah,pl,sh+1,pl)=(s,a,s))np(s,a),^(ss,a):=l=1kp=1M𝟏((sh,pl,ah,pl,sh+1,pl)=(s,a,s))n(s,a).\displaystyle\begin{split}\hat{\mathbb{P}}_{p}(s^{\prime}\mid s,a)&\mathrel{\mathop{\ordinarycolon}}=\frac{\sum_{l=1}^{k}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l},s_{h+1,p}^{l})=(s,a,s^{\prime})\right)}{n_{p}(s,a)},\\ \hat{\mathbb{P}}(s^{\prime}\mid s,a)&\mathrel{\mathop{\ordinarycolon}}=\frac{\sum_{l=1}^{k}\sum_{p=1}^{M}{\bf 1}\left((s_{h,p}^{l},a_{h,p}^{l},s_{h+1,p}^{l})=(s,a,s^{\prime})\right)}{n(s,a)}.\end{split} (5)

If n(s,a)=0n(s,a)=0, we define R^(s,a):=0\hat{R}(s,a)\mathrel{\mathop{\ordinarycolon}}=0 and ^(ss,a):=1|𝒮h+1|\hat{\mathbb{P}}(s^{\prime}\mid s,a)\mathrel{\mathop{\ordinarycolon}}=\frac{1}{\left|{\mathcal{S}}_{h+1}\right|}; and if np(s,a)=0n_{p}(s,a)=0, we define R^p(s,a):=0\hat{R}_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=0 and ^p(ss,a):=1|𝒮h+1|\hat{\mathbb{P}}_{p}(s^{\prime}\mid s,a)\mathrel{\mathop{\ordinarycolon}}=\frac{1}{\left|{\mathcal{S}}_{h+1}\right|}. The counts and reward estimates can be maintained by Multi-task-Euler efficiently in an incremental manner.

Constructing value function estimates via optimistic value iteration.

For each player pp, based on these model parameter estimates, Multi-task-Euler performs optimistic value iteration to compute the value function estimates for states at all layers (lines 1 to 1). For the terminal layer H+1H+1, VH+1()=0V_{H+1}^{\star}(\bot)=0 trivially, so nothing needs to be done. For earlier layers h[H]h\in[H], Multi-task-Euler iteratively builds its value function estimates in a backward fashion. At the time of estimating values for layer hh, the algorithm has already obtained optimal value estimates for layer h+1h+1. Based on the Bellman optimality equation (1), Multi-task-Euler estimates (Qp(s,a))s𝒮h,a𝒜(Q_{p}^{\star}(s,a))_{s\in{\mathcal{S}}_{h},a\in\mathcal{A}} using model parameter estimates and its estimates of (Vp(s))s𝒮h+1(V_{p}^{\star}(s))_{s\in{\mathcal{S}}_{h+1}}, i.e., (V¯p(s))s𝒮h+1(\overline{V}_{p}(s))_{s\in{\mathcal{S}}_{h+1}} and (V¯p(s))s𝒮h+1(\underline{V}_{p}(s))_{s\in{\mathcal{S}}_{h+1}} (lines 1 to 1).

Specifically, Multi-task-Euler constructs estimates of Qp(s,a)Q_{p}^{\star}(s,a) for all s𝒮h,a𝒜s\in{\mathcal{S}}_{h},a\in\mathcal{A} in two different ways. First, it uses the individual estimates of model of player pp to construct ind-Q¯p\underline{\text{ind-}Q}_{p} and ind-Q¯p\overline{\text{ind-}Q}_{p}, upper and lower bound estimates of QpQ_{p}^{\star} (lines 1 and 1); this construction is reminiscent of Euler and Strong-Euler [46, 36], in that if we were only to use ind-Q¯p\underline{\text{ind-}Q}_{p} and ind-Q¯p\overline{\text{ind-}Q}_{p} as our optimal action value function estimate Q¯p\overline{Q}_{p} and Q¯p\underline{Q}_{p}, our algorithm becomes individual Strong-Euler. The individual value function estimates are key to establishing Multi-task-Euler’s fall-back guarantees, ensuring that it never performs worse than the individual Strong-Euler baseline. Second, it uses the aggregate estimate of model to construct agg-Q¯p\underline{\text{agg-}Q}_{p} and agg-Q¯p\overline{\text{agg-}Q}_{p}, also upper and lower bound estimates of QpQ_{p}^{\star} (lines 1 and 1); this construction is unique to the multitask learning setting, and is our new algorithmic contribution.

To ensure that agg-Q¯p\overline{\text{agg-}Q}_{p} and ind-Q¯p\overline{\text{ind-}Q}_{p} (resp. agg-Q¯p\underline{\text{agg-}Q}_{p} and ind-Q¯p\underline{\text{ind-}Q}_{p}) are valid upper bounds (resp. lower bounds) of QpQ_{p}^{\star}, Multi-task-Euler adds bonus terms ind-bp(s,a)\text{ind-}b_{p}(s,a) and agg-bp(s,a)\text{agg-}b_{p}(s,a), respectively, in the optimistic value iteration process, to account for estimation error of the model estimates against the true models. Specifically, both bonus terms comprise three parts:

ind-bp(s,a):=brw(np(s,a),0)+\displaystyle\text{ind-}b_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=b_{\mathrm{rw}}\mathinner{\left(n_{p}(s,a),0\right)}+ bprob(^p(s,a),np(s,a),V¯p,V¯p,0)+\displaystyle\ b_{\mathrm{prob}}\mathinner{\left(\hat{\mathbb{P}}_{p}(\cdot\mid s,a),n_{p}(s,a),\overline{V}_{p},\underline{V}_{p},0\right)}+
bstr(^p(s,a),np(s,a),V¯p,V¯p,0),\displaystyle\qquad\qquad\qquad b_{\mathrm{str}}\mathinner{\left(\hat{\mathbb{P}}_{p}(\cdot\mid s,a),n_{p}(s,a),\overline{V}_{p},\underline{V}_{p},0\right)},
agg-bp(s,a):=brw(n(s,a),ϵ)+\displaystyle\text{agg-}b_{p}(s,a)\mathrel{\mathop{\ordinarycolon}}=b_{\mathrm{rw}}\mathinner{\left(n(s,a),\epsilon\right)}+ bprob(^(s,a),n(s,a),V¯p,V¯p,ϵ)+\displaystyle\ b_{\mathrm{prob}}\mathinner{\left(\hat{\mathbb{P}}(\cdot\mid s,a),n(s,a),\overline{V}_{p},\underline{V}_{p},\epsilon\right)}+
bstr(^(s,a),n(s,a),V¯p,V¯p,ϵ),\displaystyle\qquad\qquad\qquad b_{\mathrm{str}}\mathinner{\left(\hat{\mathbb{P}}(\cdot\mid s,a),n(s,a),\overline{V}_{p},\underline{V}_{p},\epsilon\right)},

where

brw(n,κ)\displaystyle b_{\mathrm{rw}}\mathinner{\left(n,\kappa\right)} :=1κ+Θ(L(n)n),\displaystyle\mathrel{\mathop{\ordinarycolon}}=1\;\;\wedge\;\;\kappa+\Theta\mathinner{\left(\sqrt{\frac{L(n)}{n}}\right)},
bprob(q,n,V¯,V¯,κ)\displaystyle b_{\mathrm{prob}}\mathinner{\left(q,n,\overline{V},\underline{V},\kappa\right)} :=H  2κ+Θ\bBigg@4(varsq[V¯(s)]L(n)n+\displaystyle\mathrel{\mathop{\ordinarycolon}}=H\;\;\wedge\;\;2\kappa+\Theta\bBigg@{4}(\sqrt{\frac{\mathrm{var}_{s^{\prime}\sim q}\left[\overline{V}(s^{\prime})\right]L(n)}{n}}+
𝔼sq[(V¯(s)V¯(s))2]L(n)n+HL(n)n\bBigg@4),\displaystyle\qquad\qquad\qquad\qquad\ \sqrt{\frac{\mathbb{E}_{s^{\prime}\sim q}\left[(\overline{V}(s^{\prime})-\underline{V}(s^{\prime}))^{2}\right]L(n)}{n}}+\frac{HL(n)}{n}\bBigg@{4}),
bstr(q,n,V¯,V¯,κ)\displaystyle b_{\mathrm{str}}\mathinner{\left(q,n,\overline{V},\underline{V},\kappa\right)} :=κ+Θ(S𝔼sq[(V¯(s)V¯(s))2]L(n)n+HSL(n)n),\displaystyle\mathrel{\mathop{\ordinarycolon}}=\kappa+\Theta\mathinner{\left(\sqrt{\frac{S\;\mathbb{E}_{s^{\prime}\sim q}\left[(\overline{V}(s^{\prime})-\underline{V}(s^{\prime}))^{2}\right]L(n)}{n}}+\frac{HSL(n)}{n}\right)},

and L(n)ln(MSAnδ)L(n)\eqsim\ln(\frac{MSAn}{\delta}).

The bonus terms altogether ensures strong optimism [36], i.e.,

for any p and (s,a),Q¯p(s,a)Rp(s,a)+(pV¯p)(s,a).\text{for any $p$ and $(s,a)$,}\;\;\overline{Q}_{p}(s,a)\geq R_{p}(s,a)+(\mathbb{P}_{p}\overline{V}_{p})(s,a). (6)

In short, strong optimism is a stronger form of optimism (the weaker requirement that for any pp and (s,a)(s,a), Q¯p(s,a)Qp(s,a)\overline{Q}_{p}(s,a)\geq Q^{\star}_{p}(s,a) and V¯p(s)Vp(s)\overline{V}_{p}(s)\geq V^{\star}_{p}(s)), which allows us to use the clipping lemma (Lemma B.6 of [36], see also Lemma LABEL:lem:clipping-main in Appendix LABEL:subsec:conclude-reg-bounds) to obtain sharp gap-dependent regret guarantees. The three parts in the bonus term serve for different purposes towards establishing (6):

  1. 1.

    The first component accounts for the uncertainty in reward estimation: with probability 1O(δ)1-O(\delta), |R^p(s,a)Rp(s,a)|brw(np(s,a),0)\mathinner{\!\left\lvert\hat{R}_{p}(s,a)-R_{p}(s,a)\right\rvert}\leq b_{\mathrm{rw}}\mathinner{\left(n_{p}(s,a),0\right)}, and |R^(s,a)Rp(s,a)|brw(n(s,a),ϵ)\mathinner{\!\left\lvert\hat{R}(s,a)-R_{p}(s,a)\right\rvert}\leq b_{\mathrm{rw}}\mathinner{\left(n(s,a),\epsilon\right)}.

  2. 2.

    The second component accounts for the uncertainty in estimating (pVp)(s,a)(\mathbb{P}_{p}V_{p}^{\star})(s,a): with probability 1O(δ)1-O(\delta), |(^pVp)(s,a)(pVp)(s,a)|bprob(^p(s,a),np(s,a),V¯p,V¯p,0)\mathinner{\!\left\lvert(\hat{\mathbb{P}}_{p}V_{p}^{\star})(s,a)-(\mathbb{P}_{p}V_{p}^{\star})(s,a)\right\rvert}\leq b_{\mathrm{prob}}\mathinner{\left(\hat{\mathbb{P}}_{p}(\cdot\mid s,a),n_{p}(s,a),\overline{V}_{p},\underline{V}_{p},0\right)} and |(^Vp)(s,a)(pVp)(s,a)|bprob(^(s,a),n(s,a),V¯p,V¯p,ϵ)\mathinner{\!\left\lvert(\hat{\mathbb{P}}V_{p}^{\star})(s,a)-(\mathbb{P}_{p}V_{p}^{\star})(s,a)\right\rvert}\leq b_{\mathrm{prob}}\mathinner{\left(\hat{\mathbb{P}}(\cdot\mid s,a),n(s,a),\overline{V}_{p},\underline{V}_{p},\epsilon\right)}.

  3. 3.

    The third component accounts for the lower order terms for strong optimism: with probability 1O(δ)1-O(\delta), |(^pp)(V¯pVp)(s,a)|bstr(^p(s,a),np(s,a),V¯p,V¯p,0)\mathinner{\!\left\lvert(\hat{\mathbb{P}}_{p}-\mathbb{P}_{p})(\overline{V}_{p}-V_{p}^{\star})(s,a)\right\rvert}\leq b_{\mathrm{str}}\mathinner{\left(\hat{\mathbb{P}}_{p}(\cdot\mid s,a),n_{p}(s,a),\overline{V}_{p},\underline{V}_{p},0\right)}, and |(^p)(V¯pVp)(s,a)|bstr(^(s,a),n(s,a),V¯p,V¯p,ϵ)\mathinner{\!\left\lvert(\hat{\mathbb{P}}-\mathbb{P}_{p})(\overline{V}_{p}-V_{p}^{\star})(s,a)\right\rvert}\leq b_{\mathrm{str}}\mathinner{\left(\hat{\mathbb{P}}(\cdot\mid s,a),n(s,a),\overline{V}_{p},\underline{V}_{p},\epsilon\right)}.

Based on the above concentration inequalities and the definitions of bonus terms, it can be shown inductively that, with probability 1O(δ)1-O(\delta), both agg-Q¯p\overline{\text{agg-}Q}_{p} and ind-Q¯p\overline{\text{ind-}Q}_{p} (resp. agg-Q¯p\underline{\text{agg-}Q}_{p} and ind-Q¯p\underline{\text{ind-}Q}_{p}) are valid upper bounds (resp. lower bounds) of QpQ_{p}^{\star}.

Finally, observe that for any (s,a)𝒮h×𝒜(s,a)\in{\mathcal{S}}_{h}\times\mathcal{A}, Qp(s,a)Q_{p}^{\star}(s,a) has range [0,Hh+1][0,H-h+1]. By taking intersections of all confidence bounds of QpQ_{p}^{\star} it has obtained, Multi-task-Euler constructs its final upper and lower bound estimates for Qp(s,a)Q_{p}^{\star}(s,a), Q¯p(s,a)\overline{Q}_{p}(s,a) and Q¯p(s,a)\underline{Q}_{p}(s,a) respectively, for (s,a)𝒮h×𝒜(s,a)\in{\mathcal{S}}_{h}\times\mathcal{A} (line 1 to 1). Similar ideas on using data from multiple sources to construct confidence intervals and guide explorations have been used by [37, 43] for multi-task noncontextual and contextual bandits. Using the relationship between the optimal value Vp(s)V_{p}^{\star}(s) and and optimal action values {Qp(s,a):a𝒜}\mathinner{\left\{Q_{p}^{\star}(s,a)\mathrel{\mathop{\ordinarycolon}}a\in\mathcal{A}\right\}}, Multi-task-Euler also constructs upper and lower bound estimates for Vp(s)V_{p}^{\star}(s), V¯p(s)\overline{V}_{p}(s) and V¯p(s)\underline{V}_{p}(s), respectively for s𝒮hs\in{\mathcal{S}}_{h} (line 1).

Executing optimistic policies.

At each episode kk, for each player pp, its optimal action-value function upper bound estimate Q¯p\overline{Q}_{p} induces a greedy policy πk(p):sargmaxa𝒜Q¯p(s,a)\pi^{k}(p)\mathrel{\mathop{\ordinarycolon}}s\mapsto\mathop{\mathrm{argmax}}_{a\in\mathcal{A}}\overline{Q}_{p}(s,a) (line 1); the player then executes this policy in this episode to collect a new trajectory and use this to update its individual model parameter estimates. After all players finish their episode kk, the algorithm also updates its aggregate model parameter estimates (lines 1 to 1) using Equations (3), (4) and (5), and continues to the next episode.

4 Performance guarantees

Before stating the guarantees of Algorithm 1, we define an instance-dependent complexity measure that characterizes the amenability to information sharing.

Definition 3.

The set of subpar state-action pairs is defined as:

ϵ:={(s,a)𝒮×𝒜:p[M],gapp(s,a)>96Hϵ},\mathcal{I}_{\epsilon}\mathrel{\mathop{\ordinarycolon}}=\mathinner{\left\{(s,a)\in{\mathcal{S}}\times\mathcal{A}\mathrel{\mathop{\ordinarycolon}}\exists p\in[M],\mathrm{gap}_{p}(s,a)>96H\epsilon\right\}},

where we recall that gapp(s,a)=Vp(s)Qp(s,a)\mathrm{gap}_{p}(s,a)=V^{\star}_{p}(s)-Q^{\star}_{p}(s,a).

Definition 3 generalizes the notion of subpar arms defined for multi-task multi-armed bandit learning [43] in two ways: first, it is with regards to state-action pairs as opposed to actions only; second, in RL, suboptimality gaps depend on optimal value function, which in turn depends on both immediate reward and subsequent long-term return.

To ease our later presentation, we also present the following lemma.

Lemma 4.

For any (s,a)ϵ(s,a)\in\mathcal{I}_{\epsilon}, we have that: (1) for all p[M]p\in[M], (s,a)Zp,opt(s,a)\notin Z_{p,\mathrm{opt}}, where we recall that Zp,opt={(s,a):gapp(s,a)=0}Z_{p,\mathrm{opt}}=\mathinner{\left\{(s,a)\mathrel{\mathop{\ordinarycolon}}\mathrm{gap}_{p}(s,a)=0\right\}} is the set of optimal state-action pairs with respect to pp; (2) for all p,q[M]p,q\in[M], gapp(s,a)12gapq(s,a)\mathrm{gap}_{p}(s,a)\geq\frac{1}{2}\mathrm{gap}_{q}(s,a).

The lemma follows directly from Lemma 2; its proof can be found in the Appendix along with proofs of the following theorems. Item 1 implies that any subpar state-action pair is suboptimal for all players. In other words, for every player pp, the state-action space 𝒮×𝒜{\mathcal{S}}\times\mathcal{A} can be partitioned to three disjoint sets: ϵ,Zp,opt,(ϵZp,opt)C\mathcal{I}_{\epsilon},Z_{p,\mathrm{opt}},(\mathcal{I}_{\epsilon}\cup Z_{p,\mathrm{opt}})^{C}. Item 2 implies that for any subpar (s,a)(s,a), its suboptimal gaps with respect to all players are within a constant of each other.

4.1 Upper bounds

With the above definitions, we are now ready to present the performance guarantees of Algorithm 1. We first present a gap-independent collective regret bound of Multi-task-Euler.

Theorem 5 (Gap-independent bound).

If {p}p=1M\mathinner{\left\{\mathcal{M}_{p}\right\}}_{p=1}^{M} are ϵ\epsilon-dissimilar, then Multi-task-Euler satisfies that with probability 1δ1-\delta,

Reg(K)O~(MH2|ϵC|K+MH2|ϵ|K+MH4S2A).\mathrm{Reg}(K)\leq\tilde{O}\mathinner{\left(M\sqrt{H^{2}|\mathcal{I}_{\epsilon}^{C}|K}+\sqrt{MH^{2}|\mathcal{I}_{\epsilon}|K}+MH^{4}S^{2}A\right)}.

We again compare this regret upper bound with individual Strong-Euler’s gap independent regret bound. Recall that individual Strong-Euler guarantees that with probability 1δ1-\delta,

Reg(K)O~(MH2SAK+MH4S2A).\mathrm{Reg}(K)\leq\tilde{O}\mathinner{\left(M\sqrt{H^{2}SAK}+MH^{4}S^{2}A\right)}.

We focus on the comparison on the leading terms, i.e., the K\sqrt{K} terms. As MH2SAKMH2|ϵ|K+MH2|ϵC|KM\sqrt{H^{2}SAK}\eqsim M\sqrt{H^{2}\mathinner{\!\left\lvert\mathcal{I}_{\epsilon}\right\rvert}K}+M\sqrt{H^{2}\mathinner{\!\left\lvert\mathcal{I}_{\epsilon}^{C}\right\rvert}K}, we see that an improvement in the collective regret bound comes from the contributions from subpar state-action pairs: the MH2|ϵ|KM\sqrt{H^{2}|\mathcal{I}_{\epsilon}|K} term is reduced to MH2|ϵ|K\sqrt{MH^{2}|\mathcal{I}_{\epsilon}|K}, a factor of O~(1M)\tilde{O}(\sqrt{\frac{1}{M}}) improvement. Moreover, if |ϵC|SA\left|\mathcal{I}_{\epsilon}^{C}\right|\ll SA and M1M\gg 1, Multi-task-Euler provides a regret bound of lower order than individual Strong-Euler.

We next present a gap-dependent upper bound on its collective regret.

Theorem 6 (Gap-dependent upper bound).

If {p}p=1M\mathinner{\left\{\mathcal{M}_{p}\right\}}_{p=1}^{M} are ϵ\epsilon-dissimilar, then Multi-task-Euler satisfies with probability 1δ1-\delta,

Reg(K)ln(MSAKδ)\displaystyle\mathrm{Reg}(K)\lesssim\ln(\frac{MSAK}{\delta}) \bBigg@4(p[M]((s,a)Zp,optH3gapp,min+(s,a)(ϵZp,opt)CH3gapp(s,a))+\displaystyle\bBigg@{4}(\sum_{p\in[M]}\left(\sum_{(s,a)\in Z_{p,\mathrm{opt}}}\frac{H^{3}}{\mathrm{gap}_{p,\min}}+\sum_{(s,a)\in(\mathcal{I}_{\epsilon}\cup Z_{p,\mathrm{opt}})^{C}}\frac{H^{3}}{\mathrm{gap}_{p}(s,a)}\right)+
(s,a)ϵH3minpgapp(s,a)\bBigg@4)+ln(MSAKδ)MH4S2AlnMSAgapmin,\displaystyle\sum_{(s,a)\in\mathcal{I}_{\epsilon}}\frac{H^{3}}{\min_{p}\mathrm{gap}_{p}(s,a)}\bBigg@{4})+\ln(\frac{MSAK}{\delta})\cdot MH^{4}S^{2}A\ln\frac{MSA}{\mathrm{gap}_{\min}},

where we recall that gapp,min=min(s,a):gapp(s,a)>0gapp(s,a)\mathrm{gap}_{p,\min}=\min_{(s,a)\mathrel{\mathop{\ordinarycolon}}\mathrm{gap}_{p}(s,a)>0}\mathrm{gap}_{p}(s,a), and gapmin=minpgapp,min\mathrm{gap}_{\min}=\min_{p}\mathrm{gap}_{p,\min}.

Comparing this regret bound with the regret bound obtained by the individual Strong-Euler baseline, recall that by summing over the regret guarantees of Strong-Euler for all players p[M]p\in[M], and taking a union bound over all pp, individual Strong-Euler guarantees a collective regret bound of

Reg(K)ln(MSAKδ)\displaystyle\mathrm{Reg}(K)\lesssim\ln(\frac{MSAK}{\delta}) \bBigg@4(p[M]((s,a)Zp,optH3gapp,min+(s,a)(ϵZp,opt)CH3gapp(s,a))+\displaystyle\bBigg@{4}(\sum_{p\in[M]}\left(\sum_{(s,a)\in Z_{p,\mathrm{opt}}}\frac{H^{3}}{\mathrm{gap}_{p,\min}}+\sum_{(s,a)\in(\mathcal{I}_{\epsilon}\cup Z_{p,\mathrm{opt}})^{C}}\frac{H^{3}}{\mathrm{gap}_{p}(s,a)}\right)+
(s,a)ϵp[M]H3gapp(s,a)\bBigg@4)+ln(MSAKδ)MH4S2AlnSAgapmin,\displaystyle\sum_{(s,a)\in\mathcal{I}_{\epsilon}}\sum_{p\in[M]}\frac{H^{3}}{\mathrm{gap}_{p}(s,a)}\bBigg@{4})+\ln(\frac{MSAK}{\delta})\cdot MH^{4}S^{2}A\ln\frac{SA}{\mathrm{gap}_{\min}},

that holds with probability 1δ1-\delta. We again focus on comparing the leading terms, i.e., the terms that have polynomial dependences on the suboptimality gaps in the above two bounds. It can be seen that an improvement in the regret bound by Multi-task-Euler comes from the contributions from the subpar state-action pairs: for each (s,a)ϵ(s,a)\in\mathcal{I}_{\epsilon}, the regret bound is reduced from p[M]H3gapp(s,a)\sum_{p\in[M]}\frac{H^{3}}{\mathrm{gap}_{p}(s,a)} to H3minpgapp(s,a)\frac{H^{3}}{\min_{p}\mathrm{gap}_{p}(s,a)}, a factor of O(1M)O(\frac{1}{M}) improvement. Recent work of [44] has shown that in the single-task setting, it is possible to replace (s,a)Zp,optH3gapp,min\sum_{(s,a)\in Z_{p,\mathrm{opt}}}\frac{H^{3}}{\mathrm{gap}_{p,\min}} with a sharper problem-dependent complexity term that depends on the multiplicity of optimal state-action pairs. We leave improving the guarantee of Theorem 6 in a similar manner as an interesting open problem.

Key to the proofs of Theorems 5 and 6 is a new bound on the surplus [36] of the value function estimates. Our new surplus bound is a minimum of two terms: one depends on the usual state-action visitation counts of player pp, the other depends on the task dissimilarity parameter ϵ\epsilon and the state-action visitation counts of all players. Detailed proofs can be found at Appendix LABEL:sec:proof-upper.

4.2 Lower bounds

To complement the above upper bounds, we now present gap-dependent and gap-independent regret lower bounds that also depend on our subpar state-action pair notion. Our lower bounds are inspired by regret lower bounds for episodic RL [36, 8] and multi-task bandits [43].

Theorem 7 (Gap-independent lower bound).

For any A2A\geq 2, H2H\geq 2, S4HS\geq 4H, KSAK\geq SA, MM\in\mathbb{N}, and l,lCl,l^{C}\in\mathbb{N} with l+lC=SAl+l^{C}=SA and lSA4(S+HA)l\leq SA-4(S+HA), there exists some ϵ\epsilon that satisfies: for any algorithm Alg\mathrm{Alg}, there exists an ϵ\epsilon-MPERL problem instance with SS states, AA actions, MM players and an episode length of HH such that |ϵ192H|l\left|\mathcal{I}_{\frac{\epsilon}{192H}}\right|\geq l, and

𝔼[RegAlg(K)]Ω(MH2lCK+MH2lK).\mathbb{E}\left[\mathrm{Reg}_{\mathrm{Alg}}(K)\right]\geq\Omega\left(M\sqrt{H^{2}l^{C}K}+\sqrt{MH^{2}lK}\right).

We also present a gap-dependent lower bound. Before that, we first formally define the notion of sublinear regret algorithms: for any fixed ϵ\epsilon, we say that an algorithm Alg\mathrm{Alg} is a sublinear regret algorithm for the ϵ\epsilon-MPERL problem if there exists some C>0C>0 (that possibly depends on the state-action space, the number of players, and ϵ\epsilon) and α<1\alpha<1 such that for all KK and all ϵ\epsilon-MPERL environments, 𝔼[RegAlg(K)]CKα\mathbb{E}\left[\mathrm{Reg}_{\mathrm{Alg}}(K)\right]\leq CK^{\alpha}.

Theorem 8 (Gap-dependent lower bound).

Fix ϵ0\epsilon\geq 0. For any SS\in\mathbb{N}, A2A\geq 2, H2H\geq 2, MM\in\mathbb{N}, with S2(H1)S\geq 2(H-1), let S1=S2(H1)S_{1}=S-2(H-1); and let {Δs,a,p}(s,a,p)[S1]×[A]×[M]\mathinner{\left\{\Delta_{s,a,p}\right\}}_{(s,a,p)\in[S_{1}]\times[A]\times[M]} be any set of values that satisfies: (1) each Δs,a,p[0,H/(48M)]\Delta_{s,a,p}\in[0,H/(48\sqrt{M})], (2) for every (s,p)[S1]×[M](s,p)\in[S_{1}]\times[M], there exists at least one action a[A]a\in[A] such that Δs,a,p=0\Delta_{s,a,p}=0, and (3) for every (s,a)[S1]×[A](s,a)\in[S_{1}]\times[A] and p,q[M]p,q\in[M], |Δs,a,pΔs,a,q|ϵ/4\mathinner{\!\left\lvert\Delta_{s,a,p}-\Delta_{s,a,q}\right\rvert}\leq\epsilon/4. There exists an ϵ\epsilon-MPERL problem instance with SS states, AA actions, MM players and an episode length of HH, such that 𝒮1=[S1]{\mathcal{S}}_{1}=[S_{1}], |𝒮h|=2\left|{\mathcal{S}}_{h}\right|=2 for all h2h\geq 2, and

gapp(s,a)=Δs,a,p,(s,a,p)[S1]×[A]×[M];\mathrm{gap}_{p}(s,a)=\Delta_{s,a,p},\quad\forall(s,a,p)\in[S_{1}]\times[A]\times[M];

for this problem instance, any sublinear regret algorithm Alg\mathrm{Alg} for the ϵ\epsilon-MPERL problem must satisfy:

𝔼[RegAlg(K)]Ω(lnK(p[M](s,a)(ϵ/768H)C:gapp(s,a)>0H2gapp(s,a)+(s,a)(ϵ/768H)H2minpgapp(s,a))).\mathbb{E}\left[\mathrm{Reg}_{\mathrm{Alg}}(K)\right]\geq\Omega\left(\ln K\left(\sum_{p\in[M]}\sum_{\begin{subarray}{c}(s,a)\in\mathcal{I}^{C}_{\left(\epsilon/768H\right)}\mathrel{\mathop{\ordinarycolon}}\\ \mathrm{gap}_{p}(s,a)>0\end{subarray}}\frac{H^{2}}{\mathrm{gap}_{p}(s,a)}+\sum_{(s,a)\in\mathcal{I}_{\left(\epsilon/768H\right)}}\frac{H^{2}}{\min_{p}\mathrm{gap}_{p}(s,a)}\right)\right).

Comparing the lower bounds with Multi-task-Euler’s regret upper bounds in Theorems 5 and 6, we see that the upper and lower bounds nearly match for any constant HH. When HH is large, a key difference between the upper and lower bounds is that the former are in terms of ϵ\mathcal{I}_{\epsilon}, while the latter are in terms of Θ(ϵH)\mathcal{I}_{\Theta(\frac{\epsilon}{H})}. We conjecture that our upper bounds can be improved by replacing ϵ\mathcal{I}_{\epsilon} with Θ(ϵH)\mathcal{I}_{\Theta(\frac{\epsilon}{H})}—our analysis uses a clipping trick similar to [36], which may be the reason for a suboptimal dependence on HH. We leave closing this gap as an open question.

5 Related Work

Regret minimization for MDPs.

Our work belongs to the literature of regret minimization for MDPs, e.g., [5, 18, 8, 3, 9, 19, 10, 46, 36, 49, 45, 44]. In the episodic setting, [3, 10, 46, 36, 49] achieve minimax H2SAK\sqrt{H^{2}SAK} regret bounds for general stationary MDPs. Furthermore, the Euler algorithm [46] achieves adaptive problem-dependent regret guarantees when the total reward within an episode is small or when the environmental norm of the MDP is small.  [36] refines Euler, proposing Strong-Euler that provides more fine-grained gap-dependent O(logK)O(\log K) regret guarantees.  [45, 44] show that the optimistic Q-learning algorithm [19] and its variants can also achieve gap-dependent logarithmic regret guarantees. Remarkably, [44] achieves a regret bound that improves over that of [36], in that it replaces the dependence on the number of optimal state-action pairs with the number of non-unique state-action pairs.

Transfer and lifelong learning for RL.

A considerable portion of related works concerns transfer learning for RL tasks (see [40, 24, 50] for surveys from different angles), and many studies investigate a batch setting: given some source tasks and target tasks, transfer learning agents have access to batch data collected for the source tasks (and sometimes for the target tasks as well). In this setting, model-based approaches have been explored in e.g., [39]; theoretical guarantees for transfer of samples across tasks have been established in e.g., [25, 41]. Similarly, sequential transfer has been studied under the framework of lifelong RL in e.g., [38, 1, 15, 22]—in this setting, an agent faces a sequence of RL tasks and aims to take advantage of knowledge gained from previous tasks for better performance in future tasks; in particular, analyses on the sample complexity of transfer learning algorithms are presented in [6, 29] under the assumption that an upper bound on the total number of unique (and well-separated) RL tasks is known. We note that, in contrast, we study an online setting in which no prior data are available and multiple RL tasks are learned concurrently by RL agents.

Concurrent RL.

Data sharing between multiple RL agents that learn concurrently has also been investigated in the literature. For example, in [20, 35, 16, 12], a group of agents interact in parallel with identical environments. Another setting is studied in [16], in which agents solve different RL tasks (MDPs); however, similar to [6, 29], it is assumed that there is a finite number of unique tasks, and different tasks are well-separated, i.e., there is a minimum gap. In this work, we assume that players face similar but not necessarily identical MDPs, and we do not assume a minimum gap. [17] study multi-task RL with linear function approximation with representation transfer, where it is assumed that the optimal value functions of all tasks are from a low dimensional linear subspace. Our setting and results are most similar to [32] and [13]. [32] study concurrent exploration in similar MDPs with continuous states in the PAC setting; however, their PAC guarantee does not hold for target error rate arbitrarily close to zero; in contrast, our algorithm has a fall-back guarantee, in that it always has a sublinear regret. Concurrent RL from similar linear MDPs has also been recently studied in [13]: under the assumption of small heterogeneity between different MDPs (a setting very similar to ours), the provided regret guarantee involves a term that is linear in the number of episodes, whereas our algorithm in this paper always has a sublinear regret; concurrent RL under the assumption of large heterogeneity is also studied in that work, but additional contextual information is assumed to be available for the players to ensure a sublinear regret.

Other related topics and models.

In many multi-agent RL models [47, 31], a set of learning agents interact with a common environment and have shared global states; in particular, [48] study the setting with heterogeneous reward distributions, and provide convergence guarantees for two policy gradient-based algorithms. In contrast, in our setting, our learning agents interact with separate environments. Multi-agent bandits with similar, heterogeneous reward distributions are investigated in [37, 43]; herein, we generalize their multi-task bandit problem setting to the episodic MDP setting.

6 Conclusion and Future Directions

In this paper, we generalize the multi-task bandit learning framework in [43] and formulate a multi-task concurrent RL problem, in which tasks are similar but not necessarily identical. We provide a provably efficient model-based algorithm that takes advantage of knowledge transfer between different tasks. Our instance-dependent regret upper and lower bounds formalize the intuition that subpar state-action pairs are amenable to information sharing among tasks.

There still remain gaps between our upper and lower bounds which can be closed by either a finer analysis or a better algorithm: first, the dependence on ϵ\mathcal{I}_{\epsilon} in the upper bound does not match the dependence of Θ(ϵ/H)\mathcal{I}_{\Theta(\epsilon/H)} in the lower bound when HH is large; second, the gap-dependent upper bound has O(H3)O(H^{3}) dependence, whereas the gap-dependent lower bound only has Ω(H2)\Omega(H^{2}) dependence; third, the additive dependence on the number of optimal state-action pairs can potentially be removed by new algorithmic ideas [44].

Furthermore, one major obstacle in deploying our algorithm in practice is its requirement for knowledge of ϵ\epsilon; an interesting avenue is to apply model selection strategies in bandits and RL to achieve adaptivity to unknown ϵ\epsilon. Another interesting future direction is to consider more general parameter transfer for online RL, for example, in the context of function approximation.

7 Acknowledgements

We thank Kamalika Chaudhuri for helpful initial discussions, and thank Akshay Krishnamurthy and Tongyi Cao for discussing the applicability of adaptive RL in metric spaces to the multitask RL problem studied in this paper. CZ acknowledges startup funding support from the University of Arizona. ZW thanks the National Science Foundation under IIS 1915734 and CCF 1719133 for research support.

References

  • [1] David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29. PMLR, 2018.
  • [2] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150–165. Springer, 2007.
  • [3] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  • [4] Peter Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pages 335–342. Omnipress, 2008.
  • [5] Peter L Bartlett and Ambuj Tewari. Regal: a regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42, 2009.
  • [6] Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 122–131, 2013.
  • [7] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • [8] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2818–2826, 2015.
  • [9] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717–5727, 2017.
  • [10] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
  • [11] Carlo D’Eramo, Davide Tateo, Andrea Bonarini, Marcello Restelli, and Jan Peters. Sharing knowledge in multi-task deep reinforcement learning. In International Conference on Learning Representations, 2020.
  • [12] Maria Dimakopoulou and Benjamin Van Roy. Coordinated exploration in concurrent reinforcement learning. In International Conference on Machine Learning, pages 1271–1279. PMLR, 2018.
  • [13] Abhimanyu Dubey and Alex Pentland. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021.
  • [14] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
  • [15] Francisco M Garcia and Philip S Thomas. A meta-mdp approach to exploration for lifelong reinforcement learning. arXiv preprint arXiv:1902.00843, 2019.
  • [16] Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • [17] Jiachen Hu, Xiaoyu Chen, Chi Jin, Lihong Li, and Liwei Wang. Near-optimal representation learning for linear bandits and linear rl. arXiv preprint arXiv:2102.04132, 2021.
  • [18] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
  • [19] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018.
  • [20] R Matthew Kretchmar. Parallel reinforcement learning. In The 6th World Conference on Systemics, Cybernetics, and Informatics. Citeseer, 2002.
  • [21] Alyssa Kubota, Emma IC Peterson, Vaishali Rajendren, Hadas Kress-Gazit, and Laurel D Riek. Jessie: Synthesizing social robot behaviors for personalized neurorehabilitation and beyond. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pages 121–130, 2020.
  • [22] Nicholas C Landolfi, Garrett Thomas, and Tengyu Ma. A model-based approach for sample-efficient multi-task reinforcement learning. arXiv preprint arXiv:1907.04964, 2019.
  • [23] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • [24] Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pages 143–173. Springer, 2012.
  • [25] Alessandro Lazaric and Marcello Restelli. Transfer from multiple mdps. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • [26] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples in batch reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 544–551, 2008.
  • [27] Xinle Liang, Yang Liu, Tianjian Chen, Ming Liu, and Qiang Yang. Federated transfer reinforcement learning for autonomous driving. arXiv preprint arXiv:1910.06001, 2019.
  • [28] Boyi Liu, Lujia Wang, and Ming Liu. Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems. IEEE Robotics and Automation Letters, 4(4):4555–4562, 2019.
  • [29] Yao Liu, Zhaohan Guo, and Emma Brunskill. Pac continuous state online multitask reinforcement learning with identification. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’16, page 438–446, 2016.
  • [30] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. COLT, 2009.
  • [31] Afshin OroojlooyJadid and Davood Hajinezhad. A review of cooperative multi-agent deep reinforcement learning. arXiv preprint arXiv:1908.03963, 2019.
  • [32] Jason Pazis and Ronald Parr. Efficient pac-optimal exploration in concurrent, continuous state mdps with delayed updates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • [33] Michael T. Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G. Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, 2005.
  • [34] David Silver, Leonard Newnham, David Barker, Suzanne Weller, and Jason McFall. Concurrent reinforcement learning from customer interactions. In International conference on machine learning, pages 924–932. PMLR, 2013.
  • [35] David Silver, Leonard Newnham, David Barker, Suzanne Weller, and Jason McFall. Concurrent reinforcement learning from customer interactions. In International conference on machine learning, pages 924–932. PMLR, 2013.
  • [36] Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprint arXiv:1905.03814, 2019.
  • [37] Marta Soare, Ouais Alsharif, Alessandro Lazaric, and Joelle Pineau. Multi-task linear bandits. NIPS2014 Workshop on Transfer and Multi-task Learning : Theory meets Practice, 2014.
  • [38] Fumihide Tanaka and Masayuki Yamamura. Multitask reinforcement learning on the distribution of mdps. In Proceedings 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation, volume 3, pages 1108–1113. IEEE, 2003.
  • [39] Matthew E Taylor, Nicholas K Jong, and Peter Stone. Transferring instances for model-based reinforcement learning. In Joint European conference on machine learning and knowledge discovery in databases, pages 488–505. Springer, 2008.
  • [40] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.
  • [41] Andrea Tirinzoni, Mattia Salvini, and Marcello Restelli. Transfer of samples in policy search via multiple importance sampling. In International Conference on Machine Learning, pages 6264–6274. PMLR, 2019.
  • [42] Konstantinos Tsiakas, Cheryl Abellanoza, and Fillia Makedon. Interactive learning and adaptation for robot assisted therapy for people with dementia. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pages 1–4, 2016.
  • [43] Zhi Wang, Chicheng Zhang, Manish Kumar Singh, Laurel Riek, and Kamalika Chaudhuri. Multitask bandit learning through heterogeneous feedback aggregation. In International Conference on Artificial Intelligence and Statistics, pages 1531–1539. PMLR, 2021.
  • [44] Haike Xu, Tengyu Ma, and Simon S Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. arXiv preprint arXiv:2102.04692, 2021.
  • [45] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
  • [46] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  • [47] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635, 2019.
  • [48] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881. PMLR, 2018.
  • [49] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020.
  • [50] Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888, 2020.
  • [51] Hankz Hankui Zhuo, Wenfeng Feng, Qian Xu, Qiang Yang, and Yufeng Lin. Federated reinforcement learning. arXiv preprint arXiv:1901.08277, 2019.