Towards Understanding the Benefit of Multitask Representation Learning in Decision Process

\nameRui Lu \emailr-lu21@mails.tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University. \AND\nameAndrew Zhao \emailzqc21@mails.tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University. \AND\nameSimon Du \emailssdu@cs.washington.com
\addrPaul G. Allen School of Computer Science and Engineering, University of Washington. \AND\nameGao Huang \emailgaohuang@tsinghua.edu.cn
\addrDepartment of Automation, Tsinghua University

Abstract

While multitask representation learning has become a popular approach in reinforcement learning (RL) to boost the sample efficiency, the theoretical understanding of why and how it works is still limited. Most previous analytical works could only assume that the representation function is already known to the agent or from linear function class, since analyzing general function class representation encounters non-trivial technical obstacles such as generalization guarantee, formulation of confidence bound in abstract function space, etc. However, linear-case analysis heavily relies on the particularity of linear function class, while real-world practice usually adopts general non-linear representation functions like neural networks. This significantly reduces its applicability. In this work, we extend the analysis to general function class representations. Specifically, we consider an agent playing $M$ contextual bandits (or MDPs) concurrently and extracting a shared representation function $\phi$ from a specific function class $\Phi$ using our proposed Generalized Functional Upper Confidence Bound algorithm (GFUCB). We theoretically validate the benefit of multitask representation learning within general function class for bandits and linear MDP for the first time. Lastly, we conduct experiments to demonstrate the effectiveness of our algorithm with neural net representation.

Keywords: Multitask, Representation Learning, Reinforcement Learning, Transfer Learning, Sample Complexity.

1 Introduction

2 Related Work

In the supervised learning setting, a line of works have been done on multitask learning and representation learning with various assumptions baxter2000model; du2017hypothesis; ando2005framework; ben2003exploiting; maurer2006bounds; cavallanti2010linear; maurer2016benefit; du2020few; tripuraneni2020provable. These results assumed that all tasks share a joint representation function. It is also worth mentioning that tripuraneni2020provable gave the method-of-moments estimator and built the confidence ball for the feature extractor, which inspired our algorithm for the infinite-action setting.

The benefit of representation learning has been studied in sequential decision-making problems, especially in RL domains. arora2020provable proved that representation learning could reduce the sample complexity of imitation learning. d'eramo2020sharing showed that representation learning could improve the convergence rate of the value iteration algorithm. Both require a probabilistic assumption similar to that in maurer2016benefit, and the statistical rates are of similar forms as those in maurer2016benefit. Following these works, we study a special class of MDP called Linear MDP. Linear MDP yang2019sample; jin2019provably is a popular model in RL, which uses linear function approximation to generalize large state-action space. zanette2020learning extends the definition to low inherent Bellman error (or IBE in short) MDPs. This model assumes that both the transition and the reward are near-linear in given features.

Recently, yang2021impact showed multitask representation learning reduces the regret in linear bandits, using the framework developed by du2020few. Moreover, some works hu2021near; lu2021power; jin2019provably proved results on the benefit of multitask representation learning RL with generative model or linear representation function. However, these works either restrict the representation function class to be linear, or the representation function is known to agent. This is unrealistic in real world practice, which limits these works’ meaning.

The most relevant works that need to be mentioned is general function class value approximation for bandits and MDPs. russo2013eluder first proposed the concept of eluder dimension to measure the complexity of a function class and gave a regret bound for general function bandits using this dimension. wang2020reinforcement further proved that it can also be adopted in MDP problems. dong2021provable extended the analysis with sequential Rademacher complexity. Inspired by these works, we adopt eluder dimension and develop our own analysis. But it should be pointed out that all those works focus on single task setting, which give a provable bound for just one single MDP or bandit problem. They lack the insight for why simultaneously dealing with multiple distinct but correlated tasks is more sample efficient. Our work aim to establish a framework to explain this. By considering locating the ground truth value function in multihead function space $\mathcal{F}^{\otimes M}$ (see detailed definition in section 4), we are able to theoretically explain the main reason for the boost of sample efficiency. Informally speaking, the shared feature extraction backbone $\phi$ receives samples from all the tasks, therefore accelerating the convergence for every single task compare with solving them separately.

3 Preliminaries

3.1 Notations

We use $[n]$ to denote the set $\{1,2,\ldots,n\}$ and $\langle\cdot,\cdot\rangle$ to denote the inner product between two vectors. We use $f(x)=O(g(x))$ to represent $f(x)\leq C\cdot g(x)$ holds for any $x>x_{0}$ with some $C>0$ and $x_{0}>0$ . Ignoring the logarithm term, we use $f(x)=\tilde{O}(g(x))$ .

3.2 Multitask Contextual Bandits

We first study multitask representation learning in contextual bandits. Each task $i\in[M]$ is associated with an unknown function $f^{(i)}\in\mathcal{F}$ from certain function class $\mathcal{F}$ . At each step $t\in[T]$ , the agent is given a context vector $C_{t,i}$ from certain context space $\mathcal{C}$ and a set of actions $\mathcal{A}_{t,i}$ selected from certain action space $\mathcal{A}$ for each task $i$ . The agent needs to choose one action $A_{t,i}\in\mathcal{A}_{t,i}$ , and then receives a reward as $R_{t,i}=f^{(i)}(C_{t,i},A_{t,i})+\eta_{t,i}$ , where $\eta_{t,i}$ is the random noise sampled from some i.i.d. distribution. The agent’s goal is to understand function $f^{(i)}$ and maximize the cumulative reward, or equivalently, minimize the total regret from all $M$ tasks in $T$ steps defined as below.

\operatorname{Reg}(T)\stackrel{{\scriptstyle\text{ def }}}{{=}}\sum_{t=1}^{T}\sum_{i=1}^{M}\left(f^{(i)}(C_{t,i},A_{t,i}^{\star})-f^{(i)}(C_{t,i},A_{t,i})\right),

where $A_{t,i}^{\star}=\arg\max_{A\in\mathcal{A}_{t,i}}f^{(i)}(C_{t,i},A)$ is the optimal action with respect to context $C_{t,i}$ in task $i$ .

3.3 Multitask MDP

Going beyond contextual bandits, we also study how this shared low-dimensional representation could benefit the sequential decision making problem like Markov Decision Process (MDP). In this work, we study undiscounted episodic finite horizon MDP problem. Consider an MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,H)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}$ is the transition dynamics, $r(\cdot,\cdot)$ is the reward function and $H$ is the planning horizon. The agent starts from an initial state $s_{1}$ which can be either fixed or sampled from a certain distribution, then interacts with environment for $H$ rounds. In the single task framework, at each round (also called level) $h$ , the agent needs to perform an action $a_{h}$ according to a policy function $a_{h}=\pi_{h}(s_{h})$ . Then the agent will receive a reward $R_{h}(s_{h},a_{h})=r(s_{h},a_{h})+\eta_{h}$ where $\eta_{h}$ again is the noise term. The environment then transits the state from $s_{h}$ to $s_{h+1}$ according to distribution $\mathcal{P}(\cdot|s_{h},a_{h})$ . The estimation for action value function given following action policy $\pi$ is defined as $Q_{h}^{\pi}(s_{h},a_{h})=r(s_{h},a_{h})+\mathbb{E}\left[\sum_{t=h+1}^{H}R_{t}(s_{t},\pi_{t}(s_{t}))\right]$ , and state value function is defined as $V_{h}^{\pi}(s_{h})=Q_{h}^{\pi}(s_{h},\pi_{h}(s_{h}))$ . Note that there always exists a deterministic optimal policy $\pi^{\star}$ for which $V_{h}^{\pi^{\star}}(s)=\max_{\pi}V_{h}^{\pi}(s)$ and $Q_{h}^{\pi^{\star}}(s,a)=\max_{\pi}Q_{h}^{\pi}(s,a)$ , we will denote them as $V_{h}^{\star}(s)$ and $Q_{h}^{\star}(s,a)$ for simplicity.

In the multitask setting, the agent gets a batch of states $\{s_{h,t}^{(i)}\}_{i=1}^{M}$ simultaneously from $M$ different MDP tasks $\{\mathcal{M}^{(i)}\}_{i=1}^{M}$ at each round $h$ in episode $t$ , then performs a batch of actions $\{\pi_{t}^{i}(s_{h,t}^{(i)})\}_{i=1}^{M}$ for each task $i\in[M]$ . Every $H$ rounds form an episode, and the agent will interact with the environment for totally $T$ episodes. The goal for the agent is minimizing the regret defined as

\displaystyle\operatorname{Reg}(T)=\sum_{t=1}^{T}\sum_{i=1}^{M}V_{1}^{(i)\star}\left(s_{1,t}^{(i)}\right)-V_{1}^{\pi_{t}^{i}}\left(s_{1,t}^{(i)}\right),

where $V_{1}^{(i)\star}$ is the optimal value of task $i$ and $s_{1,t}^{(i)}$ is the initial state for task $i$ at episode $t$ .

To let representation function play a role, it is assumed that all tasks share the same state space $\mathcal{S}$ and action space $\mathcal{A}$ . Moreover, there exists a representation function $\phi:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^{k}$ such that action and state value function of all tasks $\mathcal{M}^{(i)}$ is always (approximately) linear in this representation. For example, given a representation function $\phi$ , the action value approximation function at level $h$ is parametrized by a vector $\boldsymbol{\theta}_{h}\in\mathbb{R}^{k}$ as $Q_{h}[\phi,\boldsymbol{\theta}_{h}]\stackrel{{\scriptstyle\text{ def }}}{{=}}\langle\phi(s,a),\boldsymbol{\theta}_{h}\rangle$ , similar for $V_{h}[\phi,\boldsymbol{\theta}_{h}](s)\stackrel{{\scriptstyle\text{ def }}}{{=}}\max_{a}\langle\phi(s,a),\boldsymbol{\theta}_{h}\rangle$ . We denote all such action value functions as $\mathcal{Q}_{h}=\{Q_{h}[\phi,\boldsymbol{\theta}_{h}]:\phi\in\Phi,\boldsymbol{\theta}_{h}\in\mathbb{R}^{k}\}$ , also value function approximation space as $\mathcal{V}_{h}=\{V_{h}[\phi,\boldsymbol{\theta}_{h}]:\phi\in\Phi,\boldsymbol{\theta}_{h}\in\mathbb{R}^{k}\}$ . Each task $\mathcal{M}^{(i)}$ is a linear MDP, which means $\mathcal{Q}_{h}$ is always approximately close under Bellman operator $\mathcal{T}_{h}(Q_{h+1})(s,a)\stackrel{{\scriptstyle\text{ def }}}{{=}}r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{h}(\cdot|s,a)}\max_{a^{\prime}}Q_{h+1}(s^{\prime},a^{\prime})$ .

Linear MDP Definition. A finite horizon MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,H)$ is a linear MDP, if there exists a representation function $\phi:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^{k}$ and its induced value approximation function class $\mathcal{Q}_{h},h\in[H]$ , such that the inherent Bellman errorzanette2020learning

\mathcal{I}_{h}\stackrel{{\scriptstyle\text{ def }}}{{=}}\sup_{Q_{h+1}\in\mathcal{Q}_{h+1}}\inf_{Q_{h}\in\mathcal{Q}_{h}}\sup_{s\in\mathcal{S},a\in\mathcal{A}}\left|\left(Q_{h}-\mathcal{T}_{h}\left(Q_{h+1}\right)\right)(s,a)\right|,

is always smaller than some small constant $\mathcal{I}$ .

The definition essentially assumes that for any Q-value approximation function $Q_{h+1}\in\mathcal{Q}_{h+1}$ at level $h+1$ , the Q-value function $Q_{h}$ at level $h$ induced by it can always be closely approximated in class $\mathcal{Q}_{h}$ , which assures the accuracy through sequential levels.

3.4 Eluder Dimension

To measure the complexity of a general function class $f$ , we adopt the concept of eluder dimension russo2013eluder. First, define $\epsilon$ -dependence and independence.

Definition 1 ( $\epsilon$ -dependent). An input $x$ is $\epsilon$ -dependent on set $X=\{x_{1},x_{2},\ldots,x_{n}\}$ with respect to function class $\mathcal{F}$ , if any pair of functions $f,\tilde{f}\in\mathcal{F}$ satisfying

\sqrt{\sum_{i=1}^{n}(f(x_{i})-\tilde{f}(x_{i}))^{2}}\leq\epsilon

also satisfies $|f(x)-\tilde{f}(x)|\leq\epsilon$ . Otherwise, we call action $x$ to be $\epsilon$ -independent of dataset $X$ .

Intuitively, $\epsilon$ -dependence captures the exhaustion of interpolation flexibility for function class $\mathcal{F}$ . Given an unknown function $f$ ’s value on set $X=\{x_{1},x_{2},\ldots,x_{n}\}$ , we are able to pin down its value on some particular input $x$ with only $\epsilon$ -scale prediction error.

Definition 2 ( $\epsilon$ -eluder dimension). The $\epsilon$ -eluder dimension $\operatorname{dim}_{E}(\mathcal{F},\epsilon)$ is the maximum length for a sequence of inputs $x_{1},x_{2},\ldots x_{d}\in\mathcal{X}$ , such that for some $\epsilon^{\prime}\geq\epsilon$ , every element is $\epsilon^{\prime}$ -independent of its predecessors.

This definition is similar to the definition of the dimensionality of a linear space, which is the maximum length of a sequence of vectors such that each one is linear independent to its predecessors. For instance, if $\mathcal{F}=\{f(x):\mathbb{R}^{d}\mapsto\mathbb{R},f(x)=\theta^{\top}x\}$ , we have $\operatorname{dim}_{E}(\mathcal{F},\epsilon)=O(d\log 1/\epsilon)$ since any $d$ linear independent input’s estimated value can fully describe a linear mapping function. We also omit the $\epsilon$ and use $\operatorname{dim}_{E}(\mathcal{F})$ when it only has a logarithm-dependent term on $\epsilon$ .

4 Main Results for Contextual Bandits

In this section, we will present our theoretical analysis on the proposed GFUCB algorithm for contextual bandits.

4.1 Assumptions

This section will list the assumptions that we make for our analysis. The main assumption is the existence of a shared feature extraction function from class $\Phi=\{\phi:\mathcal{C}\times\mathcal{A}\mapsto\mathbb{R}^{k}\}$ that any task’s value function is linear in this $\phi$ .

Assumption 1.1 (Shared Space and Representation) All the tasks share the same context space $\mathcal{C}$ and action space $\mathcal{A}$ . Also, there exists a shared representation function $\phi\in\Phi$ and a set of $k$ -dimensional parameters $\{\boldsymbol{\theta}_{i}\}_{i=1}^{M}$ such that each $f^{(i)}$ has the form $f^{(i)}(\cdot,\cdot)=\langle\phi(\cdot,\cdot),\boldsymbol{\theta}_{i}\rangle$ .

Following standard regularization assumptions for bandits hu2021near; yang2021impact, we make assumptions on noise distribution and function parameters.

Assumption 1.2 (Conditional Sub-Gaussian Noise) Denote $\mathcal{H}_{t,i}=\sigma(C_{1,i},A_{1,i},\ldots,C_{t,i},A_{t,i})$ to be the $\sigma$ -field summarizing the history information available before reward $R_{t,i}$ is observed for every task $i\in[M]$ . We have $\eta_{t,i}$ is sampled from a 1-Sub-Gaussian distribution, namely $\mathbb{E}\left[\exp(\lambda\eta_{t,i})\mid\mathcal{H}_{t,i}\right]\leq\exp\left(\frac{\lambda^{2}}{2}\right)$ for $\forall\lambda\in\mathbb{R}$

Assumption 1.3 (Bounded-Norm Feature and Parameter) We assume that the parameter $\boldsymbol{\theta}_{i}$ and the feature vector for any context-action pair $(C,A)\in\mathcal{C}\times\mathcal{A}$ is constant bounded for each task $i\in[M]$ , namely $\|\boldsymbol{\theta}_{i}\|_{2}\leq\sqrt{k}$ for $\forall i\in[M]$ and $\|\phi(C,A)\|_{2}\leq 1$ for $\forall C\in\mathcal{C},A\in\mathcal{A}$ .

Apart from these assumptions, we add assumptions to measure and constrain the complexity of value approximation function class $\mathcal{F}=\mathcal{L}\circ\Phi$ .

Assumption 1.4 (Bounded Eluder Dimension). We assume that function class $\mathcal{F}$ has bounded Eluder dimension $d$ , which means for any $\epsilon$ , $\operatorname{dim}_{E}(\mathcal{F},\epsilon)=\tilde{O}(d)$ .

4.2 Algorithm Details

Algorithm 1 Generalized Functional UCB Algorithm

1: for step

t:1\to T

2: Compute

\mathcal{F}_{t}

according to (

*

)

3: Receive contexts

C_{t,i}

and action sets

\mathcal{A}_{t,i}

i\in[M]

f_{t},A_{t,i}=\mathop{\mathrm{argmax}}_{f\in\mathcal{F}_{t},\ A_{i}\in\mathcal{A}_{t,i}}\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i})

5: Play

A_{t,i}

for task i, and get reward

R_{t,i}

for

i\in[M]

6: end for

The details of the algorithm is in Algorithm 1. At each step $t$ , the algorithm first solves the optimization problem below to get the empirically optimal solution $\hat{f}_{t}$ that best predicts the rewards for context-input pairs seen so far.

\displaystyle\hat{f}_{t}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}^{\otimes M}}\sum_{i=1}^{M}\sum_{k=1}^{t-1}\left(f^{(i)}(C_{k,i},A_{k,i})-R_{k,i}\right)^{2}

Here we abuse the notation of $\mathcal{F}^{\otimes M}$ as $\mathcal{F}^{\otimes M}=\left\{f=\left(f^{(1)},\ldots,f^{(M)}\right):f^{(i)}(\cdot)=\phi(\cdot)^{\top}\boldsymbol{w}_{i}\in\mathcal{F}\right\}$ to denote the M-head prediction version of $\mathcal{F}$ , parametrized by a shared representation function $\phi(\cdot)$ and a weight matrix $\boldsymbol{W}=[\boldsymbol{w}_{1},\ldots,\boldsymbol{w}_{M}]\in\mathbb{R}^{k\times M}$ . We use $f^{(i)}$ to denote the $i_{th}$ head of function $f$ which specially serves for task $i$ .

After obtaining $\hat{f}_{t}$ , we maintain a functional confidence set $\mathcal{F}_{t}\subseteq\mathcal{F}^{\otimes M}$ for possible value approximation functions

\displaystyle\mathcal{F}_{t}\stackrel{{\scriptstyle\text{ def }}}{{=}}\Bigg{\{}

\displaystyle f\in\mathcal{F}^{\otimes M}:\left\|\hat{f}_{t}-f\right\|^{2}_{2,E_{t}}\leq\beta_{t},|f^{(i)}(\boldsymbol{x})|\leq 1,\forall\boldsymbol{x}\in\mathcal{C}\times\mathcal{A},i\in[M]\Bigg{\}}

(

*

)

Here, for the sake of simplicity, we use $\left\|\hat{f}_{t}-f\right\|^{2}_{2,E_{t}}=\sum_{i=1}^{M}\sum_{k=1}^{t-1}\left(\hat{f}_{t}^{(i)}(\boldsymbol{x}_{k,i})-f^{(i)}(\boldsymbol{x}_{k,i})\right)^{2}$ to denote the empirical 2-norm of function $\hat{f}_{t}-f=\left(\hat{f}_{t}^{(1)}-f^{(1)},\ldots,\hat{f}_{t}^{(M)}-f^{(M)}\right)$ . Basically, ( $*$ ) contains all the functions in $\mathcal{F}^{\otimes M}$ whose value estimation difference on all collected context-action pairs $\boldsymbol{x}_{k,i}=(C_{k,i},A_{k,i})$ compared with empirical loss minimizer $\hat{f}_{t}$ does not exceed a preset parameter $\beta_{t}$ . We show that with high probability, the real value function $f_{\theta}$ is always contained in $\mathcal{F}_{t}$ when $\beta_{t}$ is carefully chosen as $\tilde{O}(Mk+\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right)$ , where $\mathcal{N}(\mathcal{F},\alpha,\|\cdot\|_{\infty})$ is the $\alpha$ -covering number of function class $\Phi$ in the sup-norm $\|\phi\|_{\infty}=\max_{\boldsymbol{x}\in\mathcal{S}\times\mathcal{A}}\|\phi(\boldsymbol{x})\|_{2}$ and $\alpha$ is set to be a small number as $\frac{1}{kMT}$ (see detailed definition and proof in Lemma 1).

For the action choice, our algorithm follows OFUL, which estimates each action value with the most optimistic function value in our confidence set $\mathcal{F}_{t}$ , and chooses the action whose optimistic value estimation is the highest. In the multitask setting, we choose one action from each task to form an action tuple $(A_{1},A_{2},\ldots,A_{M})$ such that the summation of the optimistic value estimation $\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i})$ is maximized by some function $f\in\mathcal{F}_{t}$ .

Intractability. Some may have concerns on the intractability of building the confidence set ( $*$ ) and solving the optimization problem to get $\hat{f}_{t},f_{t},A_{t,i}$ . The solution comes as two folds. From the theoretical perspective, since the focus of problem is sample complexity rather than computational complexity, a computational oracle can simply be assumed to give the solution of the optimization. This is the common practice for theoretical works jin2021bellman; sun2018model; agarwal2014taming; jiang2017contextual in order to focus on the sample complexity analysis. From empirical perspective, there are great chances to optimize it with gradient methods. For example, solving $\hat{f}_{t}$ is a standard empirical risk minimization problem, and can be effectively solved with gradient methods du2019gradient. As for $f_{t}$ and $A_{t,i}$ , note that it is not necessary to explicitly build the confidence set $\mathcal{F}_{t}$ by listing all the candidates. The approximation algorithm just need to search within the confidence set via gradient method to optimize objective $\sum_{i=1}^{M}f^{(i)}(C_{t,i},A_{i})$ . The start point is $\hat{f}_{t}$ , and the algorithm knows that it approaches the border of $\mathcal{F}_{t}$ when $\|\hat{f}_{t}-f\|_{2,E_{t}}^{2}$ approaches $\beta_{t}$ . The details of implementation are in section 6.

Mechanism. GFUCB algorithm solves the exploration problem in an implicit way. For a context-action pair $\boldsymbol{x}=(C,A)$ in task $i$ which has not been fully understood and explored yet, the possible value estimation $f^{(i)}(\boldsymbol{x})$ will vary in large range with regard to constraint $\|f-\hat{f}_{t}\|_{2,E_{t}}^{2}\leq\beta_{t}$ . This is because within $\mathcal{F}_{t}$ there are many possible function value on this $\boldsymbol{x}$ while agreeing on all past context-action pairs’ value. Therefore, the optimistic value $f^{(i)}(\boldsymbol{x})$ will become high by getting a significant implicit bonus, encouraging the agent to try such action $A$ under context $C$ , which achieves natural exploration.

The reduction of sample complexity is achieved through joint training for function $\phi$ . If we solve these tasks independently, the confidence set width $\beta_{t}$ is at scale $M\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right)$ because it needs to cover $M$ representation function space respectively. By involving $\phi$ in the prediction for all tasks, our algorithm reduces the size of confidence set by $M$ times, since now the samples from all the tasks can contribute to learn the representation $\phi$ . Usually $\log\left(\mathcal{N}(\Phi,\alpha,\|\cdot\|_{\infty})\right)$ is much greater than $k$ and $M$ , hence our confidence set shrinks at a much faster speed. This explains how GFUCB achieves lower regret, since the sub-optimality at each step $t$ is proportional to the confidence set width $\beta_{t}$ when real value function $f_{\theta}\in\mathcal{F}_{t}$ .

4.3 Regret Bound

Based on the assumptions above, we have the regret guarantee as below.

Theorem 1. Based on assumption 1.1 to 1.4, denote the cumulative regret in $T$ steps as $\operatorname{Reg}(T)$ , with probability at least $1-\delta$ we have

\operatorname{Reg}(T)=\tilde{O}\left(\sqrt{MdT(Mk+\log\mathcal{N}(\Phi,\alpha_{T},\|\cdot\|_{\infty}))}\right).

Here, $d:=\operatorname{dim}_{E}(\mathcal{F},\alpha_{T})$ is the Eluder dimension for value approximation function class $\mathcal{F}=\mathcal{L}\circ\Phi$ , and $\alpha_{T}$ is discretization scale which only appears in logarithm term thus omitted. The detailed proof is left in appendix.

To the best of knowledge, this is the first regret bound for general function class representation learning in contextual bandits. To get a sense of its sharpness, note that when $\Phi$ is specialized as linear function class as $\Phi=\{\phi(x)=\boldsymbol{Bx},\boldsymbol{B}\in\mathbb{R}^{k\times d}\}$ , we have $\log\mathcal{N}(\Phi,\alpha_{T},\|\cdot\|_{\infty})=\tilde{O}(dk)$ and $\operatorname{dim}_{E}(\mathcal{F})=d$ , then our bound is reduced to $\tilde{O}(M\sqrt{dTk}+d\sqrt{MTk})$ , which is the same optimal as the current best provable regret bound for linear representation class bandits in hu2021near.

5 Main Results for MDP

5.1 Assumptions

For multitask Linear MDP setting, we adopt Assumption 3 from hu2021near which generalizes the inherent Bellman error zanette2020learning to multitask setting.

Assumption 2.1 (Low IBE for multitask) Define multi-task IBE is defined as

\mathcal{I}_{h}^{\text{mul }}\stackrel{{\scriptstyle\text{ def }}}{{=}}\sup_{\left\{Q_{h+1}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h+1}}\inf_{\left\{Q_{h}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h}}\sup_{s\in\mathcal{S},a\in\mathcal{A},i\in[M]}\left|\left(Q_{h}^{(i)}-\mathcal{T}_{h}^{(i)}\left(Q_{h+1}^{(i)}\right)\right)(s,a)\right|.

We have $\mathcal{I}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{h}\mathcal{I}_{h}^{\mathrm{mul}}$ is small for all $\mathcal{Q}_{h}$ , $h\in[H]$ .

Assumption 2.1 generalize low IBE to multitask setting. It assumes that for every task $i\in[M]$ , its Q-value function space is always close under the Bellman operator.

Assumption 2.2 (Parameter Regularization) We assume that

•

$\|\phi(s,a)\|\leq 1$ , $0\leq Q_{h}^{\pi}(s,a)\leq 1$ for $\forall(s,a)\in\mathcal{S}\times\mathcal{A},h\in[H],\forall\pi$ .
•

There exists a constant $D$ such that for any $h\in[H]$ and $\boldsymbol{\theta}_{h}^{(i)}$ , it holds that $\|\boldsymbol{\theta}_{h}^{(i)}\|_{2}\leq D$ .
•

For any fixed $\left\{Q_{h+1}^{(i)}\right\}_{i=1}^{M}\in\mathcal{Q}_{h+1}$ , the random noise $z_{h}^{(i)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}R_{h}^{(i)}(s,a)+\max_{a}Q_{h+1}^{(i)}(s^{\prime},a)-\mathcal{T}_{h}^{(i)}\left(Q_{h+1}^{(i)}\right)(s,a)$ is bounded in $[-1,1]$ and is always independent to all other random variables for $\forall(s,a)\in\mathcal{S}\times\mathcal{A},h\in[H],i\in[M]$ .

These assumptions are widely adopted in linear MDP analytical works zanette2020learning; hu2021near; lu2021power, which regularizes the parameter, feature, and noise scale. Again we add bounded Eluder dimension constraint for the Q-value estimation class.

Assumption 2.3 (Bounded Eluder Dimension). We assume that function class $\mathcal{Q}_{h}$ has bounded Eluder dimension $d$ for any $h\in[H]$ .

5.2 Algorithm Details

Algorithm 2 multitask Linear MDP Algorithm

1: for episode

t:1\to T

Q_{H+1}^{(i)}=0,i\in[M]

3: for

h:H\to 1

\hat{\phi}_{h,t},\hat{\boldsymbol{\theta}}_{h,t}^{(i)}\leftarrow

solving (1)

Q_{h}^{(i)}(\cdot,\cdot)=\hat{\phi}_{h,t}(\cdot,\cdot)^{\top}\hat{\boldsymbol{\theta}}_{h,t}^{(i)},V_{h}^{(i)}(\cdot)=\max_{a}Q_{h}^{(i)}(\cdot,a)

6: end for

7: for

h:1\to H

8: Compute

\mathcal{F}_{h,t}

according to Lemma 4

9: Receive states

\left\{s_{h,t}^{(i)}\right\}_{i=1}^{M}

\tilde{f}_{h,t},a_{h,t}^{(i)}=\mathop{\mathrm{argmax}}_{f\in\mathcal{F}_{h,t},a^{(i)}\in\mathcal{A}}\sum_{i=1}^{M}f^{(i)}\left(s_{h,t}^{(i)},a^{(i)}\right)

10: Play

a_{h,t}^{(i)}

and get reward

R_{h,t}^{(i)}

for task

i\in[M]

11: end for

12: end for

The algorithm for multitask linear MDP is similar to contextual bandits as above. The optimization problem in line 4 of Algorithm 2 is finding the empirically best solution for Q-value estimation at level $h$ in episode $t$ as below

$\displaystyle\hat{\phi}_{h,t},\hat{\boldsymbol{\Theta}}_{h,t}\leftarrow$	$\displaystyle\mathop{\mathrm{argmin}}_{\phi\in\Phi,\boldsymbol{\Theta}=[\boldsymbol{\theta}^{(1)},\ldots,\boldsymbol{\theta}^{(M)}]}\mathcal{L}(\phi,\boldsymbol{\Theta})$	(1)
$\displaystyle s.t.\quad$	$\displaystyle\\|\boldsymbol{\theta}^{(i)}\\|\leq D,\forall i\in[M]$
	$\displaystyle 0\leq\phi(s,a)^{\top}\boldsymbol{\theta}_{i}\leq 1,\forall(s,a)\in\mathcal{S}\times\mathcal{A},i\in[M],$

where $\mathcal{L}(\phi,\boldsymbol{\Theta})$ is the empirical loss function defined as

\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{t-1}\left(\phi\left(s^{(i)}_{h,j},a^{(i)}_{h,j}\right)^{\top}\boldsymbol{\theta}^{(i)}-R_{h,j}^{(i)}-V_{h+1}^{(i)}\left(s^{(i)}_{h+1,j}\right)\right)^{2}.

The framework of our work resembles LSVI jin2019provably and lu2021power which learns the Q-value estimation in reverse order, at each level $h$ , the algorithm uses just-learned value estimation function $V_{h+1}$ to build the regression target value as $R_{h,j}^{(i)}+V_{h+1}^{(i)}\left(s^{(i)}_{h+1,j}\right)$ and find empirically best estimation $\hat{f}_{h,t}^{(i)}=\hat{\phi}_{h,t}^{\top}\hat{\boldsymbol{\theta}}_{h,t}^{(i)}$ for each task $i\in[M]$ . The optimistic value estimation of each action is again searched within confidence set $\mathcal{F}_{h,t}$ which is centered at $\hat{f}_{h,t}$ and shrinks as the constraint $\|f-\hat{f}_{h,t}\|^{2}_{2,E_{t}}\leq\beta_{t}$ becomes increasingly tighter. Note that the contextual bandit problem can be regarded as a 1-horizon MDP problem without transition dynamics, and our framework at each level $h$ is indeed a copy of procedures in Algorithm 1.

5.3 Regret Bound

Based on assumptions 2.1 to 2.3, we prove that our algorithm enjoys a regret bound guaranteed by the following theorem. The detailed proof is left in appendix.

Theorem 2. Based on assumption 2.1 to 2.3, denote the cumulative regret in $T$ episodes as $\operatorname{Reg}(T)$ , we have the following regret bound for $\operatorname{Reg}(T)$ holds with probability at least $1-\delta$ for Algorithm 2

\tilde{O}\left(MH\sqrt{Tdk}+H\sqrt{MTd\log\mathcal{N}(\Phi,\alpha)}+MHT\mathcal{I}\sqrt{d}\right),

where $\alpha$ is discretization scale smaller than $\frac{1}{kMT}$ .

Remark. Compared with naively executing single task general value function approximation algorithm wang2020reinforcement for $M$ tasks, whose regret bound is $\tilde{O}(MHd\sqrt{T\log\mathcal{N}(\Phi)})$ , to achieve same average regret, our algorithm outperforms this naive algorithm with a boost of sample efficiency by $\tilde{O}(Md)$ . This benefit mainly attributes to learning in function space $\mathcal{F}^{\otimes M}=\mathcal{L}^{M}\circ\Phi$ instead of $\mathcal{F}^{M}=(\mathcal{L}\circ\Phi)^{M}$ , the former is more compact and requires much fewer samples to learn.

Need a new figure.

6 Transfer Learning

7 Lower Bound

8 Experiments

8.1 Regret and Transfer for Bandits

8.2 Regret and Transfer for MDPs

9 Conclusion

Acknowledgments and Disclosure of Funding

All acknowledgements go at the end of the paper before appendices and references. Moreover, you are required to declare funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). More information about this disclosure can be found on the JMLR website.

Appendix A

In this appendix we prove the following theorem from Section 6.2:

Theorem Let $u,v,w$ be discrete variables such that $v,w$ do not co-occur with $u$ (i.e., $u\neq 0\;\Rightarrow\;v=w=0$ in a given dataset ${\cal D}$ ). Let $N_{v0},N_{w0}$ be the number of data points for which $v=0,w=0$ respectively, and let $I_{uv},I_{uw}$ be the respective empirical mutual information values based on the sample ${\cal D}$ . Then

N_{v0}\;>\;N_{w0}\;\;\Rightarrow\;\;I_{uv}\;\leq\;I_{uw}

with equality only if $u$ is identically 0.

Appendix B

Proof. We use the notation:

P_{v}(i)\;=\;\frac{N_{v}^{i}}{N},\;\;\;i\neq 0;\;\;\;P_{v0}\;\equiv\;P_{v}(0)\;=\;1-\sum_{i\neq 0}P_{v}(i).

These values represent the (empirical) probabilities of $v$ taking value $i\neq 0$ and 0 respectively. Entropies will be denoted by $H$ . We aim to show that $\frac{\partial I_{uv}}{\partial P_{v0}}<0$ ….

Remainder omitted in this sample. See http://www.jmlr.org/papers/ for full paper.