This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Understanding the Benefit of Multitask Representation Learning in Decision Process

\nameRui Lu \emailr-lu21@mails.tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University. \AND\nameAndrew Zhao \emailzqc21@mails.tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University. \AND\nameSimon Du \emailssdu@cs.washington.com
\addrPaul G. Allen School of Computer Science and Engineering, University of Washington. \AND\nameGao Huang \emailgaohuang@tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University
Abstract

While multitask representation learning has become a popular approach in reinforcement learning (RL) to boost the sample efficiency, the theoretical understanding of why and how it works is still limited. Most previous analytical works could only assume that the representation function is already known to the agent or from linear function class, since analyzing general function class representation encounters non-trivial technical obstacles such as generalization guarantee, formulation of confidence bound in abstract function space, etc. However, linear-case analysis heavily relies on the particularity of linear function class, while real-world practice usually adopts general non-linear representation functions like neural networks. This significantly reduces its applicability. In this work, we extend the analysis to general function class representations. Specifically, we consider an agent playing MM contextual bandits (or MDPs) concurrently and extracting a shared representation function ϕ\phi from a specific function class Φ\Phi using our proposed Generalized Functional Upper Confidence Bound algorithm (GFUCB). We theoretically validate the benefit of multitask representation learning within general function class for bandits and linear MDP for the first time. Lastly, we conduct experiments to demonstrate the effectiveness of our algorithm with neural net representation.

Keywords: Multitask, Representation Learning, Reinforcement Learning, Transfer Learning, Sample Complexity.

1 Introduction

2 Related Work

In the supervised learning setting, a line of works have been done on multitask learning and representation learning with various assumptions baxter2000model; du2017hypothesis; ando2005framework; ben2003exploiting; maurer2006bounds; cavallanti2010linear; maurer2016benefit; du2020few; tripuraneni2020provable. These results assumed that all tasks share a joint representation function. It is also worth mentioning that tripuraneni2020provable gave the method-of-moments estimator and built the confidence ball for the feature extractor, which inspired our algorithm for the infinite-action setting.

The benefit of representation learning has been studied in sequential decision-making problems, especially in RL domains. arora2020provable proved that representation learning could reduce the sample complexity of imitation learning. d'eramo2020sharing showed that representation learning could improve the convergence rate of the value iteration algorithm. Both require a probabilistic assumption similar to that in maurer2016benefit, and the statistical rates are of similar forms as those in maurer2016benefit. Following these works, we study a special class of MDP called Linear MDP. Linear MDP yang2019sample; jin2019provably is a popular model in RL, which uses linear function approximation to generalize large state-action space. zanette2020learning extends the definition to low inherent Bellman error (or IBE in short) MDPs. This model assumes that both the transition and the reward are near-linear in given features.

Recently, yang2021impact showed multitask representation learning reduces the regret in linear bandits, using the framework developed by du2020few. Moreover, some works hu2021near; lu2021power; jin2019provably proved results on the benefit of multitask representation learning RL with generative model or linear representation function. However, these works either restrict the representation function class to be linear, or the representation function is known to agent. This is unrealistic in real world practice, which limits these works’ meaning.

The most relevant works that need to be mentioned is general function class value approximation for bandits and MDPs. russo2013eluder first proposed the concept of eluder dimension to measure the complexity of a function class and gave a regret bound for general function bandits using this dimension. wang2020reinforcement further proved that it can also be adopted in MDP problems. dong2021provable extended the analysis with sequential Rademacher complexity. Inspired by these works, we adopt eluder dimension and develop our own analysis. But it should be pointed out that all those works focus on single task setting, which give a provable bound for just one single MDP or bandit problem. They lack the insight for why simultaneously dealing with multiple distinct but correlated tasks is more sample efficient. Our work aim to establish a framework to explain this. By considering locating the ground truth value function in multihead function space M\mathcal{F}^{\otimes M} (see detailed definition in section 4), we are able to theoretically explain the main reason for the boost of sample efficiency. Informally speaking, the shared feature extraction backbone ϕ\phi receives samples from all the tasks, therefore accelerating the convergence for every single task compare with solving them separately.

3 Preliminaries

3.1 Notations

We use [n][n] to denote the set {1,2,,n}\{1,2,\ldots,n\} and ,\langle\cdot,\cdot\rangle to denote the inner product between two vectors. We use f(x)=O(g(x))f(x)=O(g(x)) to represent f(x)Cg(x)f(x)\leq C\cdot g(x) holds for any x>x0x>x_{0} with some C>0C>0 and x0>0x_{0}>0. Ignoring the logarithm term, we use f(x)=O~(g(x))f(x)=\tilde{O}(g(x)) .

3.2 Multitask Contextual Bandits

We first study multitask representation learning in contextual bandits. Each task i[M]i\in[M] is associated with an unknown function f(i)f^{(i)}\in\mathcal{F} from certain function class \mathcal{F}. At each step t[T]t\in[T], the agent is given a context vector Ct,iC_{t,i} from certain context space 𝒞\mathcal{C} and a set of actions 𝒜t,i\mathcal{A}_{t,i} selected from certain action space 𝒜\mathcal{A} for each task ii. The agent needs to choose one action At,i𝒜t,iA_{t,i}\in\mathcal{A}_{t,i}, and then receives a reward as Rt,i=f(i)(Ct,i,At,i)+ηt,iR_{t,i}=f^{(i)}(C_{t,i},A_{t,i})+\eta_{t,i}, where ηt,i\eta_{t,i} is the random noise sampled from some i.i.d. distribution. The agent’s goal is to understand function f(i)f^{(i)} and maximize the cumulative reward, or equivalently, minimize the total regret from all MM tasks in TT steps defined as below.

Reg(T)= def t=1Ti=1M(f(i)(Ct,i,At,i)f(i)(Ct,i,At,i)),\operatorname{Reg}(T)\stackrel{{\scriptstyle\text{ def }}}{{=}}\sum_{t=1}^{T}\sum_{i=1}^{M}\left(f^{(i)}(C_{t,i},A_{t,i}^{\star})-f^{(i)}(C_{t,i},A_{t,i})\right),

where At,i=argmaxA𝒜t,if(i)(Ct,i,A)A_{t,i}^{\star}=\arg\max_{A\in\mathcal{A}_{t,i}}f^{(i)}(C_{t,i},A) is the optimal action with respect to context Ct,iC_{t,i} in task ii.

3.3 Multitask MDP

Going beyond contextual bandits, we also study how this shared low-dimensional representation could benefit the sequential decision making problem like Markov Decision Process (MDP). In this work, we study undiscounted episodic finite horizon MDP problem. Consider an MDP =(𝒮,𝒜,𝒫,r,H)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,H), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, 𝒫\mathcal{P} is the transition dynamics, r(,)r(\cdot,\cdot) is the reward function and HH is the planning horizon. The agent starts from an initial state s1s_{1} which can be either fixed or sampled from a certain distribution, then interacts with environment for HH rounds. In the single task framework, at each round (also called level) hh, the agent needs to perform an action aha_{h} according to a policy function ah=πh(sh)a_{h}=\pi_{h}(s_{h}) . Then the agent will receive a reward Rh(sh,ah)=r(sh,ah)+ηhR_{h}(s_{h},a_{h})=r(s_{h},a_{h})+\eta_{h} where ηh\eta_{h} again is the noise term. The environment then transits the state from shs_{h} to sh+1s_{h+1} according to distribution 𝒫(|sh,ah)\mathcal{P}(\cdot|s_{h},a_{h}). The estimation for action value function given following action policy π\pi is defined as Qhπ(sh,ah)=r(sh,ah)+𝔼[t=h+1HRt(st,πt(st))]Q_{h}^{\pi}(s_{h},a_{h})=r(s_{h},a_{h})+\mathbb{E}\left[\sum_{t=h+1}^{H}R_{t}(s_{t},\pi_{t}(s_{t}))\right], and state value function is defined as Vhπ(sh)=Qhπ(sh,πh(sh))V_{h}^{\pi}(s_{h})=Q_{h}^{\pi}(s_{h},\pi_{h}(s_{h})). Note that there always exists a deterministic optimal policy π\pi^{\star} for which Vhπ(s)=maxπVhπ(s)V_{h}^{\pi^{\star}}(s)=\max_{\pi}V_{h}^{\pi}(s) and Qhπ(s,a)=maxπQhπ(s,a)Q_{h}^{\pi^{\star}}(s,a)=\max_{\pi}Q_{h}^{\pi}(s,a), we will denote them as Vh(s)V_{h}^{\star}(s) and Qh(s,a)Q_{h}^{\star}(s,a) for simplicity.

In the multitask setting, the agent gets a batch of states {sh,t(i)}i=1M\{s_{h,t}^{(i)}\}_{i=1}^{M} simultaneously from MM different MDP tasks {(i)}i=1M\{\mathcal{M}^{(i)}\}_{i=1}^{M} at each round hh in episode tt, then performs a batch of actions {πti(sh,t(i))}i=1M\{\pi_{t}^{i}(s_{h,t}^{(i)})\}_{i=1}^{M} for each task i[M]i\in[M]. Every HH rounds form an episode, and the agent will interact with the environment for totally TT episodes. The goal for the agent is minimizing the regret defined as

Reg(T)=t=1Ti=1MV1(i)(s1,t(i))V1πti(s1,t(i)),\displaystyle\operatorname{Reg}(T)=\sum_{t=1}^{T}\sum_{i=1}^{M}V_{1}^{(i)\star}\left(s_{1,t}^{(i)}\right)-V_{1}^{\pi_{t}^{i}}\left(s_{1,t}^{(i)}\right),

where V1(i)V_{1}^{(i)\star} is the optimal value of task ii and s1,t(i)s_{1,t}^{(i)} is the initial state for task ii at episode tt.

To let representation function play a role, it is assumed that all tasks share the same state space 𝒮\mathcal{S} and action space 𝒜\mathcal{A}. Moreover, there exists a representation function ϕ:𝒮×𝒜k\phi:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^{k} such that action and state value function of all tasks (i)\mathcal{M}^{(i)} is always (approximately) linear in this representation. For example, given a representation function ϕ\phi, the action value approximation function at level hh is parametrized by a vector 𝜽hk\boldsymbol{\theta}_{h}\in\mathbb{R}^{k} as Qh[ϕ,𝜽h]= def ϕ(s,a),𝜽hQ_{h}[\phi,\boldsymbol{\theta}_{h}]\stackrel{{\scriptstyle\text{ def }}}{{=}}\langle\phi(s,a),\boldsymbol{\theta}_{h}\rangle, similar for Vh[ϕ,𝜽h](s)= def maxaϕ(s,a),𝜽hV_{h}[\phi,\boldsymbol{\theta}_{h}](s)\stackrel{{\scriptstyle\text{ def }}}{{=}}\max_{a}\langle\phi(s,a),\boldsymbol{\theta}_{h}\rangle. We denote all such action value functions as 𝒬h={Qh[ϕ,𝜽h]:ϕΦ,𝜽hk}\mathcal{Q}_{h}=\{Q_{h}[\phi,\boldsymbol{\theta}_{h}]:\phi\in\Phi,\boldsymbol{\theta}_{h}\in\mathbb{R}^{k}\}, also value function approximation space as 𝒱h={Vh[ϕ,𝜽h]:ϕΦ,𝜽hk}\mathcal{V}_{h}=\{V_{h}[\phi,\boldsymbol{\theta}_{h}]:\phi\in\Phi,\boldsymbol{\theta}_{h}\in\mathbb{R}^{k}\}. Each task (i)\mathcal{M}^{(i)} is a linear MDP, which means 𝒬h\mathcal{Q}_{h} is always approximately close under Bellman operator 𝒯h(Qh+1)(s,a)= def rh(s,a)+𝔼s𝒫h(|s,a)maxaQh+1(s,a)\mathcal{T}_{h}(Q_{h+1})(s,a)\stackrel{{\scriptstyle\text{ def }}}{{=}}r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{h}(\cdot|s,a)}\max_{a^{\prime}}Q_{h+1}(s^{\prime},a^{\prime}).

Linear MDP Definition. A finite horizon MDP =(𝒮,𝒜,𝒫,r,H)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,H) is a linear MDP, if there exists a representation function ϕ:𝒮×𝒜k\phi:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^{k} and its induced value approximation function class 𝒬h,h[H]\mathcal{Q}_{h},h\in[H], such that the inherent Bellman errorzanette2020learning

h= def supQh+1𝒬h+1infQh𝒬hsups𝒮,a𝒜|(Qh𝒯h(Qh+1))(s,a)|,\mathcal{I}_{h}\stackrel{{\scriptstyle\text{ def }}}{{=}}\sup_{Q_{h+1}\in\mathcal{Q}_{h+1}}\inf_{Q_{h}\in\mathcal{Q}_{h}}\sup_{s\in\mathcal{S},a\in\mathcal{A}}\left|\left(Q_{h}-\mathcal{T}_{h}\left(Q_{h+1}\right)\right)(s,a)\right|,

is always smaller than some small constant \mathcal{I}.

The definition essentially assumes that for any Q-value approximation function Qh+1𝒬h+1Q_{h+1}\in\mathcal{Q}_{h+1} at level h+1h+1, the Q-value function QhQ_{h} at level hh induced by it can always be closely approximated in class 𝒬h\mathcal{Q}_{h}, which assures the accuracy through sequential levels.

3.4 Eluder Dimension

To measure the complexity of a general function class ff, we adopt the concept of eluder dimension russo2013eluder. First, define ϵ\epsilon-dependence and independence.

Definition 1 (ϵ\epsilon-dependent). An input xx is ϵ\epsilon-dependent on set X={x1,x2,,xn}X=\{x_{1},x_{2},\ldots,x_{n}\} with respect to function class \mathcal{F}, if any pair of functions f,f~f,\tilde{f}\in\mathcal{F} satisfying

i=1n(f(xi)f~(xi))2ϵ\sqrt{\sum_{i=1}^{n}(f(x_{i})-\tilde{f}(x_{i}))^{2}}\leq\epsilon

also satisfies |f(x)f~(x)|ϵ|f(x)-\tilde{f}(x)|\leq\epsilon. Otherwise, we call action xx to be ϵ\epsilon-independent of dataset XX.

Intuitively, ϵ\epsilon-dependence captures the exhaustion of interpolation flexibility for function class \mathcal{F}. Given an unknown function ff’s value on set X={x1,x2,,xn}X=\{x_{1},x_{2},\ldots,x_{n}\}, we are able to pin down its value on some particular input xx with only ϵ\epsilon-scale prediction error.

Definition 2 (ϵ\epsilon-eluder dimension). The ϵ\epsilon-eluder dimension dimE(,ϵ)\operatorname{dim}_{E}(\mathcal{F},\epsilon) is the maximum length for a sequence of inputs x1,x2,xd𝒳x_{1},x_{2},\ldots x_{d}\in\mathcal{X}, such that for some ϵϵ\epsilon^{\prime}\geq\epsilon, every element is ϵ\epsilon^{\prime}-independent of its predecessors.

This definition is similar to the definition of the dimensionality of a linear space, which is the maximum length of a sequence of vectors such that each one is linear independent to its predecessors. For instance, if ={f(x):d,f(x)=θx}\mathcal{F}=\{f(x):\mathbb{R}^{d}\mapsto\mathbb{R},f(x)=\theta^{\top}x\}, we have dimE(,ϵ)=O(dlog1/ϵ)\operatorname{dim}_{E}(\mathcal{F},\epsilon)=O(d\log 1/\epsilon) since any dd linear independent input’s estimated value can fully describe a linear mapping function. We also omit the ϵ\epsilon and use dimE()\operatorname{dim}_{E}(\mathcal{F}) when it only has a logarithm-dependent term on ϵ\epsilon.

4 Main Results for Contextual Bandits

In this section, we will present our theoretical analysis on the proposed GFUCB algorithm for contextual bandits.

4.1 Assumptions

This section will list the assumptions that we make for our analysis. The main assumption is the existence of a shared feature extraction function from class Φ={ϕ:𝒞×𝒜k}\Phi=\{\phi:\mathcal{C}\times\mathcal{A}\mapsto\mathbb{R}^{k}\} that any task’s value function is linear in this ϕ\phi.

Assumption 1.1 (Shared Space and Representation) All the tasks share the same context space 𝒞\mathcal{C} and action space 𝒜\mathcal{A}. Also, there exists a shared representation function ϕΦ\phi\in\Phi and a set of kk-dimensional parameters {𝛉i}i=1M\{\boldsymbol{\theta}_{i}\}_{i=1}^{M} such that each f(i)f^{(i)} has the form f(i)(,)=ϕ(,),𝛉if^{(i)}(\cdot,\cdot)=\langle\phi(\cdot,\cdot),\boldsymbol{\theta}_{i}\rangle.

Following standard regularization assumptions for bandits hu2021near; yang2021impact, we make assumptions on noise distribution and function parameters.

Assumption 1.2 (Conditional Sub-Gaussian Noise) Denote t,i=σ(C1,i,A1,i,,Ct,i,At,i)\mathcal{H}_{t,i}=\sigma(C_{1,i},A_{1,i},\ldots,C_{t,i},A_{t,i}) to be the σ\sigma-field summarizing the history information available before reward Rt,iR_{t,i} is observed for every task i[M]i\in[M]. We have ηt,i\eta_{t,i} is sampled from a 1-Sub-Gaussian distribution, namely 𝔼[exp(ληt,i)t,i]exp(λ22)\mathbb{E}\left[\exp(\lambda\eta_{t,i})\mid\mathcal{H}_{t,i}\right]\leq\exp\left(\frac{\lambda^{2}}{2}\right) for λ\forall\lambda\in\mathbb{R}

Assumption 1.3 (Bounded-Norm Feature and Parameter) We assume that the parameter 𝛉i\boldsymbol{\theta}_{i} and the feature vector for any context-action pair (C,A)𝒞×𝒜(C,A)\in\mathcal{C}\times\mathcal{A} is constant bounded for each task i[M]i\in[M], namely 𝛉i2k\|\boldsymbol{\theta}_{i}\|_{2}\leq\sqrt{k} for i[M]\forall i\in[M] and ϕ(C,A)21\|\phi(C,A)\|_{2}\leq 1 for C𝒞,A𝒜\forall C\in\mathcal{C},A\in\mathcal{A}.

Apart from these assumptions, we add assumptions to measure and constrain the complexity of value approximation function class =Φ\mathcal{F}=\mathcal{L}\circ\Phi.

Assumption 1.4 (Bounded Eluder Dimension). We assume that function class \mathcal{F} has bounded Eluder dimension dd, which means for any ϵ\epsilon, dimE(,ϵ)=O~(d)\operatorname{dim}_{E}(\mathcal{F},\epsilon)=\tilde{O}(d).

4.2 Algorithm Details

Algorithm 1 Generalized Functional UCB Algorithm
1:  for step t:1Tt:1\to T do
2:     Compute t\mathcal{F}_{t} according to (*)
3:     Receive contexts Ct,iC_{t,i} and action sets 𝒜t,i\mathcal{A}_{t,i}, i[M]i\in[M]
4:     ft,At,i=argmaxft,Ai𝒜t,ii=1Mf(i)(Ct,i,Ai)f_{t},A_{t,i}=\mathop{\mathrm{argmax}}_{f\in\mathcal{F}_{t},\ A_{i}\in\mathcal{A}_{t,i}}\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i})
5:     Play At,iA_{t,i} for task i, and get reward Rt,iR_{t,i} for i[M]i\in[M].
6:  end for

The details of the algorithm is in Algorithm 1. At each step tt, the algorithm first solves the optimization problem below to get the empirically optimal solution f^t\hat{f}_{t} that best predicts the rewards for context-input pairs seen so far.

f^targminfMi=1Mk=1t1(f(i)(Ck,i,Ak,i)Rk,i)2\displaystyle\hat{f}_{t}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}^{\otimes M}}\sum_{i=1}^{M}\sum_{k=1}^{t-1}\left(f^{(i)}(C_{k,i},A_{k,i})-R_{k,i}\right)^{2}

Here we abuse the notation of M\mathcal{F}^{\otimes M} as M={f=(f(1),,f(M)):f(i)()=ϕ()𝒘i}\mathcal{F}^{\otimes M}=\left\{f=\left(f^{(1)},\ldots,f^{(M)}\right):f^{(i)}(\cdot)=\phi(\cdot)^{\top}\boldsymbol{w}_{i}\in\mathcal{F}\right\} to denote the M-head prediction version of \mathcal{F}, parametrized by a shared representation function ϕ()\phi(\cdot) and a weight matrix 𝑾=[𝒘1,,𝒘M]k×M\boldsymbol{W}=[\boldsymbol{w}_{1},\ldots,\boldsymbol{w}_{M}]\in\mathbb{R}^{k\times M}. We use f(i)f^{(i)} to denote the ithi_{th} head of function ff which specially serves for task ii.

After obtaining f^t\hat{f}_{t}, we maintain a functional confidence set tM\mathcal{F}_{t}\subseteq\mathcal{F}^{\otimes M} for possible value approximation functions

t= def {\displaystyle\mathcal{F}_{t}\stackrel{{\scriptstyle\text{ def }}}{{=}}\Bigg{\{} fM:f^tf2,Et2βt,|f(i)(𝒙)|1,𝒙𝒞×𝒜,i[M]}\displaystyle f\in\mathcal{F}^{\otimes M}:\left\|\hat{f}_{t}-f\right\|^{2}_{2,E_{t}}\leq\beta_{t},|f^{(i)}(\boldsymbol{x})|\leq 1,\forall\boldsymbol{x}\in\mathcal{C}\times\mathcal{A},i\in[M]\Bigg{\}} (*)

Here, for the sake of simplicity, we use f^tf2,Et2=i=1Mk=1t1(f^t(i)(𝒙k,i)f(i)(𝒙k,i))2\left\|\hat{f}_{t}-f\right\|^{2}_{2,E_{t}}=\sum_{i=1}^{M}\sum_{k=1}^{t-1}\left(\hat{f}_{t}^{(i)}(\boldsymbol{x}_{k,i})-f^{(i)}(\boldsymbol{x}_{k,i})\right)^{2} to denote the empirical 2-norm of function f^tf=(f^t(1)f(1),,f^t(M)f(M))\hat{f}_{t}-f=\left(\hat{f}_{t}^{(1)}-f^{(1)},\ldots,\hat{f}_{t}^{(M)}-f^{(M)}\right). Basically, (*) contains all the functions in M\mathcal{F}^{\otimes M} whose value estimation difference on all collected context-action pairs 𝒙k,i=(Ck,i,Ak,i)\boldsymbol{x}_{k,i}=(C_{k,i},A_{k,i}) compared with empirical loss minimizer f^t\hat{f}_{t} does not exceed a preset parameter βt\beta_{t}. We show that with high probability, the real value function fθf_{\theta} is always contained in t\mathcal{F}_{t} when βt\beta_{t} is carefully chosen as O~(Mk+log(𝒩(Φ,α,))\tilde{O}(Mk+\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right), where 𝒩(,α,)\mathcal{N}(\mathcal{F},\alpha,\|\cdot\|_{\infty}) is the α\alpha-covering number of function class Φ\Phi in the sup-norm ϕ=max𝒙𝒮×𝒜ϕ(𝒙)2\|\phi\|_{\infty}=\max_{\boldsymbol{x}\in\mathcal{S}\times\mathcal{A}}\|\phi(\boldsymbol{x})\|_{2} and α\alpha is set to be a small number as 1kMT\frac{1}{kMT} (see detailed definition and proof in Lemma 1).

For the action choice, our algorithm follows OFUL, which estimates each action value with the most optimistic function value in our confidence set t\mathcal{F}_{t}, and chooses the action whose optimistic value estimation is the highest. In the multitask setting, we choose one action from each task to form an action tuple (A1,A2,,AM)(A_{1},A_{2},\ldots,A_{M}) such that the summation of the optimistic value estimation i=1Mf(i)(Ct,i,Ai)\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i}) is maximized by some function ftf\in\mathcal{F}_{t}.

Intractability. Some may have concerns on the intractability of building the confidence set (*) and solving the optimization problem to get f^t,ft,At,i\hat{f}_{t},f_{t},A_{t,i}. The solution comes as two folds. From the theoretical perspective, since the focus of problem is sample complexity rather than computational complexity, a computational oracle can simply be assumed to give the solution of the optimization. This is the common practice for theoretical works jin2021bellman; sun2018model; agarwal2014taming; jiang2017contextual in order to focus on the sample complexity analysis. From empirical perspective, there are great chances to optimize it with gradient methods. For example, solving f^t\hat{f}_{t} is a standard empirical risk minimization problem, and can be effectively solved with gradient methods du2019gradient. As for ftf_{t} and At,iA_{t,i}, note that it is not necessary to explicitly build the confidence set t\mathcal{F}_{t} by listing all the candidates. The approximation algorithm just need to search within the confidence set via gradient method to optimize objective i=1Mf(i)(Ct,i,Ai)\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i}). The start point is f^t\hat{f}_{t}, and the algorithm knows that it approaches the border of t\mathcal{F}_{t} when f^tf2,Et2\|\hat{f}_{t}-f\|_{2,E_{t}}^{2} approaches βt\beta_{t}. The details of implementation are in section 6.

Mechanism. GFUCB algorithm solves the exploration problem in an implicit way. For a context-action pair 𝒙=(C,A)\boldsymbol{x}=(C,A) in task ii which has not been fully understood and explored yet, the possible value estimation f(i)(𝒙)f^{(i)}(\boldsymbol{x}) will vary in large range with regard to constraint ff^t2,Et2βt\|f-\hat{f}_{t}\|_{2,E_{t}}^{2}\leq\beta_{t}. This is because within t\mathcal{F}_{t} there are many possible function value on this 𝒙\boldsymbol{x} while agreeing on all past context-action pairs’ value. Therefore, the optimistic value f(i)(𝒙)f^{(i)}(\boldsymbol{x}) will become high by getting a significant implicit bonus, encouraging the agent to try such action AA under context CC, which achieves natural exploration.

The reduction of sample complexity is achieved through joint training for function ϕ\phi. If we solve these tasks independently, the confidence set width βt\beta_{t} is at scale Mlog(𝒩(Φ,α,))M\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right) because it needs to cover MM representation function space respectively. By involving ϕ\phi in the prediction for all tasks, our algorithm reduces the size of confidence set by MM times, since now the samples from all the tasks can contribute to learn the representation ϕ\phi. Usually log(𝒩(Φ,α,))\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right) is much greater than kk and MM, hence our confidence set shrinks at a much faster speed. This explains how GFUCB achieves lower regret, since the sub-optimality at each step tt is proportional to the confidence set width βt\beta_{t} when real value function fθtf_{\theta}\in\mathcal{F}_{t}.

4.3 Regret Bound

Based on the assumptions above, we have the regret guarantee as below.

Theorem 1. Based on assumption 1.1 to 1.4, denote the cumulative regret in TT steps as Reg(T)\operatorname{Reg}(T), with probability at least 1δ1-\delta we have

Reg(T)=O~(MdT(Mk+log𝒩(Φ,αT,))).\operatorname{Reg}(T)=\tilde{O}\left(\sqrt{MdT(Mk+\log\mathcal{N}(\Phi,\alpha_{T},\|\cdot\|_{\infty}))}\right).

Here, d:=dimE(,αT)d:=\operatorname{dim}_{E}(\mathcal{F},\alpha_{T}) is the Eluder dimension for value approximation function class =Φ\mathcal{F}=\mathcal{L}\circ\Phi, and αT\alpha_{T} is discretization scale which only appears in logarithm term thus omitted. The detailed proof is left in appendix.

To the best of knowledge, this is the first regret bound for general function class representation learning in contextual bandits. To get a sense of its sharpness, note that when Φ\Phi is specialized as linear function class as Φ={ϕ(x)=𝑩𝒙,𝑩k×d}\Phi=\{\phi(x)=\boldsymbol{Bx},\boldsymbol{B}\in\mathbb{R}^{k\times d}\}, we have log𝒩(Φ,αT,)=O~(dk)\log\mathcal{N}(\Phi,\alpha_{T},\|\cdot\|_{\infty})=\tilde{O}(dk) and dimE()=d\operatorname{dim}_{E}(\mathcal{F})=d, then our bound is reduced to O~(MdTk+dMTk)\tilde{O}(M\sqrt{dTk}+d\sqrt{MTk}), which is the same optimal as the current best provable regret bound for linear representation class bandits in hu2021near.

5 Main Results for MDP

5.1 Assumptions

For multitask Linear MDP setting, we adopt Assumption 3 from hu2021near which generalizes the inherent Bellman error zanette2020learning to multitask setting.

Assumption 2.1 (Low IBE for multitask) Define multi-task IBE is defined as

hmul = def sup{Qh+1(i)}i=1M𝒬h+1inf{Qh(i)}i=1M𝒬hsups𝒮,a𝒜,i[M]|(Qh(i)𝒯h(i)(Qh+1(i)))(s,a)|.\mathcal{I}_{h}^{\text{mul }}\stackrel{{\scriptstyle\text{ def }}}{{=}}\sup_{\left\{Q_{h+1}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h+1}}\inf_{\left\{Q_{h}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h}}\sup_{s\in\mathcal{S},a\in\mathcal{A},i\in[M]}\left|\left(Q_{h}^{(i)}-\mathcal{T}_{h}^{(i)}\left(Q_{h+1}^{(i)}\right)\right)(s,a)\right|.

We have =defsuphhmul\mathcal{I}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{h}\mathcal{I}_{h}^{\mathrm{mul}} is small for all 𝒬h\mathcal{Q}_{h}, h[H]h\in[H].

Assumption 2.1 generalize low IBE to multitask setting. It assumes that for every task i[M]i\in[M], its Q-value function space is always close under the Bellman operator.

Assumption 2.2 (Parameter Regularization) We assume that

  • ϕ(s,a)1\|\phi(s,a)\|\leq 1, 0Qhπ(s,a)10\leq Q_{h}^{\pi}(s,a)\leq 1 for (s,a)𝒮×𝒜,h[H],π\forall(s,a)\in\mathcal{S}\times\mathcal{A},h\in[H],\forall\pi.

  • There exists a constant DD such that for any h[H]h\in[H] and 𝛉h(i)\boldsymbol{\theta}_{h}^{(i)}, it holds that 𝛉h(i)2D\|\boldsymbol{\theta}_{h}^{(i)}\|_{2}\leq D.

  • For any fixed {Qh+1(i)}i=1M𝒬h+1\left\{Q_{h+1}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h+1}, the random noise zh(i)=defRh(i)(s,a)+maxaQh+1(i)(s,a)𝒯h(i)(Qh+1(i))(s,a)z_{h}^{(i)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}R_{h}^{(i)}(s,a)+\max_{a}Q_{h+1}^{(i)}(s^{\prime},a)-\mathcal{T}_{h}^{(i)}\left(Q_{h+1}^{(i)}\right)(s,a) is bounded in [1,1][-1,1] and is always independent to all other random variables for (s,a)𝒮×𝒜,h[H],i[M]\forall(s,a)\in\mathcal{S}\times\mathcal{A},h\in[H],i\in[M].

These assumptions are widely adopted in linear MDP analytical works zanette2020learning; hu2021near; lu2021power, which regularizes the parameter, feature, and noise scale. Again we add bounded Eluder dimension constraint for the Q-value estimation class.

Assumption 2.3 (Bounded Eluder Dimension). We assume that function class 𝒬h\mathcal{Q}_{h} has bounded Eluder dimension dd for any h[H]h\in[H].

5.2 Algorithm Details

Algorithm 2 multitask Linear MDP Algorithm
1:  for episode t:1Tt:1\to T do
2:     QH+1(i)=0,i[M]Q_{H+1}^{(i)}=0,i\in[M]
3:     for h:H1h:H\to 1 do
4:        ϕ^h,t,𝜽^h,t(i)\hat{\phi}_{h,t},\hat{\boldsymbol{\theta}}_{h,t}^{(i)}\leftarrow solving (1)
5:        Qh(i)(,)=ϕ^h,t(,)𝜽^h,t(i),Vh(i)()=maxaQh(i)(,a)Q_{h}^{(i)}(\cdot,\cdot)=\hat{\phi}_{h,t}(\cdot,\cdot)^{\top}\hat{\boldsymbol{\theta}}_{h,t}^{(i)},V_{h}^{(i)}(\cdot)=\max_{a}Q_{h}^{(i)}(\cdot,a)
6:     end for
7:     for h:1Hh:1\to H do
8:        Compute h,t\mathcal{F}_{h,t} according to Lemma 4
9:        Receive states {sh,t(i)}i=1M\left\{s_{h,t}^{(i)}\right\}_{i=1}^{M}, f~h,t,ah,t(i)=argmaxfh,t,a(i)𝒜i=1Mf(i)(sh,t(i),a(i))\tilde{f}_{h,t},a_{h,t}^{(i)}=\mathop{\mathrm{argmax}}_{f\in\mathcal{F}_{h,t},a^{(i)}\in\mathcal{A}}\sum_{i=1}^{M}f^{(i)}\left(s_{h,t}^{(i)},a^{(i)}\right)
10:        Play ah,t(i)a_{h,t}^{(i)} and get reward Rh,t(i)R_{h,t}^{(i)} for task i[M]i\in[M].
11:     end for
12:  end for

The algorithm for multitask linear MDP is similar to contextual bandits as above. The optimization problem in line 4 of Algorithm 2 is finding the empirically best solution for Q-value estimation at level hh in episode tt as below

ϕ^h,t,𝚯^h,t\displaystyle\hat{\phi}_{h,t},\hat{\boldsymbol{\Theta}}_{h,t}\leftarrow argminϕΦ,𝚯=[𝜽(1),,𝜽(M)](ϕ,𝚯)\displaystyle\mathop{\mathrm{argmin}}_{\phi\in\Phi,\boldsymbol{\Theta}=[\boldsymbol{\theta}^{(1)},\ldots,\boldsymbol{\theta}^{(M)}]}\mathcal{L}(\phi,\boldsymbol{\Theta}) (1)
s.t.\displaystyle s.t.\quad 𝜽(i)D,i[M]\displaystyle\|\boldsymbol{\theta}^{(i)}\|\leq D,\forall i\in[M]
0ϕ(s,a)𝜽i1,(s,a)𝒮×𝒜,i[M],\displaystyle 0\leq\phi(s,a)^{\top}\boldsymbol{\theta}_{i}\leq 1,\forall(s,a)\in\mathcal{S}\times\mathcal{A},i\in[M],

where (ϕ,𝚯)\mathcal{L}(\phi,\boldsymbol{\Theta}) is the empirical loss function defined as

i=1Mj=1t1(ϕ(sh,j(i),ah,j(i))𝜽(i)Rh,j(i)Vh+1(i)(sh+1,j(i)))2.\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{t-1}\left(\phi\left(s^{(i)}_{h,j},a^{(i)}_{h,j}\right)^{\top}\boldsymbol{\theta}^{(i)}-R_{h,j}^{(i)}-V_{h+1}^{(i)}\left(s^{(i)}_{h+1,j}\right)\right)^{2}.

The framework of our work resembles LSVI jin2019provably and lu2021power which learns the Q-value estimation in reverse order, at each level hh, the algorithm uses just-learned value estimation function Vh+1V_{h+1} to build the regression target value as Rh,j(i)+Vh+1(i)(sh+1,j(i))R_{h,j}^{(i)}+V_{h+1}^{(i)}\left(s^{(i)}_{h+1,j}\right) and find empirically best estimation f^h,t(i)=ϕ^h,t𝜽^h,t(i)\hat{f}_{h,t}^{(i)}=\hat{\phi}_{h,t}^{\top}\hat{\boldsymbol{\theta}}_{h,t}^{(i)} for each task i[M]i\in[M]. The optimistic value estimation of each action is again searched within confidence set h,t\mathcal{F}_{h,t} which is centered at f^h,t\hat{f}_{h,t} and shrinks as the constraint ff^h,t2,Et2βt\|f-\hat{f}_{h,t}\|^{2}_{2,E_{t}}\leq\beta_{t} becomes increasingly tighter. Note that the contextual bandit problem can be regarded as a 1-horizon MDP problem without transition dynamics, and our framework at each level hh is indeed a copy of procedures in Algorithm 1.

5.3 Regret Bound

Based on assumptions 2.1 to 2.3, we prove that our algorithm enjoys a regret bound guaranteed by the following theorem. The detailed proof is left in appendix.

Theorem 2. Based on assumption 2.1 to 2.3, denote the cumulative regret in TT episodes as Reg(T)\operatorname{Reg}(T), we have the following regret bound for Reg(T)\operatorname{Reg}(T) holds with probability at least 1δ1-\delta for Algorithm 2

O~(MHTdk+HMTdlog𝒩(Φ,α)+MHTd),\tilde{O}\left(MH\sqrt{Tdk}+H\sqrt{MTd\log\mathcal{N}(\Phi,\alpha)}+MHT\mathcal{I}\sqrt{d}\right),

where α\alpha is discretization scale smaller than 1kMT\frac{1}{kMT}.

Remark. Compared with naively executing single task general value function approximation algorithm wang2020reinforcement for MM tasks, whose regret bound is O~(MHdTlog𝒩(Φ))\tilde{O}(MHd\sqrt{T\log\mathcal{N}(\Phi)}), to achieve same average regret, our algorithm outperforms this naive algorithm with a boost of sample efficiency by O~(Md)\tilde{O}(Md). This benefit mainly attributes to learning in function space M=MΦ\mathcal{F}^{\otimes M}=\mathcal{L}^{M}\circ\Phi instead of M=(Φ)M\mathcal{F}^{M}=(\mathcal{L}\circ\Phi)^{M}, the former is more compact and requires much fewer samples to learn.

Need a new figure.

6 Transfer Learning

7 Lower Bound

8 Experiments

8.1 Regret and Transfer for Bandits

8.2 Regret and Transfer for MDPs

9 Conclusion


Acknowledgments and Disclosure of Funding

All acknowledgements go at the end of the paper before appendices and references. Moreover, you are required to declare funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). More information about this disclosure can be found on the JMLR website.

Appendix A

In this appendix we prove the following theorem from Section 6.2:

Theorem Let u,v,wu,v,w be discrete variables such that v,wv,w do not co-occur with uu (i.e., u0v=w=0u\neq 0\;\Rightarrow\;v=w=0 in a given dataset 𝒟{\cal D}). Let Nv0,Nw0N_{v0},N_{w0} be the number of data points for which v=0,w=0v=0,w=0 respectively, and let Iuv,IuwI_{uv},I_{uw} be the respective empirical mutual information values based on the sample 𝒟{\cal D}. Then

Nv0>Nw0IuvIuwN_{v0}\;>\;N_{w0}\;\;\Rightarrow\;\;I_{uv}\;\leq\;I_{uw}

with equality only if uu is identically 0.  

Appendix B

Proof. We use the notation:

Pv(i)=NviN,i0;Pv0Pv(0)= 1i0Pv(i).P_{v}(i)\;=\;\frac{N_{v}^{i}}{N},\;\;\;i\neq 0;\;\;\;P_{v0}\;\equiv\;P_{v}(0)\;=\;1-\sum_{i\neq 0}P_{v}(i).

These values represent the (empirical) probabilities of vv taking value i0i\neq 0 and 0 respectively. Entropies will be denoted by HH. We aim to show that IuvPv0<0\frac{\partial I_{uv}}{\partial P_{v0}}<0….

Remainder omitted in this sample. See http://www.jmlr.org/papers/ for full paper.