This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

Yuda Song  Yifei Zhou111Cornell University. Email: yz639@cornell.edu. The first two authors have equal contributions to the paper.   Ayush Sekhari222MIT. Email: sekhari@mit.edu   J. Andrew Bagnell333Carnegie Mellon University. Email: dbagnell@aurora.tech   Akshay Krishnamurthy444Microsoft Research. Email: akshaykr@microsoft.com   Wen Sun555Cornell University. Email: ws455@cornell.edu Carnegie Mellon University. Email: yudas@cs.cmu.edu
Abstract

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s Revenge.

1 Introduction

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve suitable performance. This issue is particularly salient when the environment requires sophisticated exploration and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence, the practical success of online RL and related policy gradient/improvement methods has been largely restricted to settings where a high quality simulator is available.

To overcome the issue of sample inefficiency, attention has turned to the offline RL setting [Levine et al., 2020], where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in some other manner (e.g., by a system running in production or an expert). While these methods still require a large dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without compromising system performance. However, offline RL methods can suffer from distribution shift, where the state distribution induced by the learned policy differs significantly from the offline distribution [Wang et al., 2021]. Existing provable approaches for addressing distribution shift are computationally intractable, while empirical approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see).

Refer to caption
Figure 1: Performance of our approach Hy-Q on Montezuma’s Revenge. We consider three types of offline datasets: easy, medium, and hard. The easy one contains offline data from a high quality policy (denoted by Expert in the figure), the medium one consists of 20% data from a random policy and 80% from the expert, and the hard one consists of half random data and half expert data. All offline datasets have 100k tuples of state, action, reward, and next state (no trajectory-wise information is available to the learners). With only 0.1m offline samples, our approach Hy-Q learns nearly 10x faster than Random Network Distillation (RND) – an online deep RL baseline designed for Montezuma’s revenge (see Figure 6 for a visual comparison using screenshots of the game). We also note that Conservative Q-Learning (CQL)—an offline RL baseline, completely fails in all three offline datasets, indicating the hardness of offline policy learning from such small size offline datasets. Imitation Learning baseline Behavior Cloning (BC) achieves reasonable performance in easy and medium datasets, but fails completely on the hard dataset. See Section 6.2 for more detailed comparison to CQL, BC, and other baselines.

In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the favorable properties of both offline and online settings. In Hybrid RL, the agent has both an offline dataset and the ability to interact with the environment, as in the traditional online RL setting. The offline dataset helps address the exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a number of empirical works [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017] which mainly focus on using expert demonstrations as offline data.

Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from a “nice” distribution. A number of simple and effective algorithms, including CPI [Kakade and Langford, 2002], PSDP [Bagnell et al., 2003], and policy gradient methods [Kakade, 2001, Agarwal et al., 2020b] – methods which have inspired recent powerful heuristic RL methods such as TRPO [Schulman et al., 2015] and PPO [Schulman et al., 2017]— are provably efficient in the reset setting. Yet, leveraging a reset distribution is a strong requirement (often tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid RL differs from the reset setting in that (a) we have an offline dataset, but (b) our online interactions must come from traces that start at the initial state distribution of the environment, and the initial state distribution is not assumed to have any nice properties. Both features (offline data and a nice reset distribution) facilitate algorithm design by de-emphasizing the exploration challenge. However, Hybrid RL is more practical since an offline dataset is much easier to access, while a nice reset distribution or even generative model is generally not guaranteed in practice.

We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The algorithm is a simple adaptation of the classical fitted Q-iteration algorithm (FQI) and it accommodates value-based function approximation in a relatively general setup. 666We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results analyze Q-iteration, but we use an algorithm with an online/mini-batch flavor that is closer to Q-learning for our experiments. For our theoretical results, we prove that Hy-Q is both statistically and computationally efficient assuming that: (1) the offline distribution covers some high quality policy, (2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least squares regression oracle. The first three assumptions are standard statistical assumptions in the RL literature while the fourth is a widely used computational abstraction for supervised learning. No computationally efficient algorithms are known under these assumptions in pure offline or pure online settings, which highlights the advantages of the hybrid setting.

We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination lock [Misra et al., 2020] and Montezuma’s Revenge from the Arcade Learning Environment [Bellemare et al., 2013]. Starting with an offline dataset that contains some transitions from a high quality policy, our approach outperforms: an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma’s Revenge, a pure offline RL baseline, an imitation learning baseline, and an existing hybrid method. Compared to the online methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efficiency (e.g., Figure 1). Compared to the offline and hybrid methods, Hy-Q performs most favorably when the offline dataset also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the significant benefits that can be realized by combining offline and online data.

2 Related Works

We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, offline RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the system to initial state distribution of the environment, which is not assumed to provide any form of coverage.

Pure online RL

Beyond tabular settings, many existing statistically efficient RL algorithms are not computationally tractable, due to the difficulty of implementing optimism. This is true in the linear MDP [Jin et al., 2020] with large action spaces, the linear Bellman complete model [Zanette et al., 2020, Agarwal et al., 2019], and in the general function approximation setting [Jiang et al., 2017, Sun et al., 2019, Du et al., 2021, Jin et al., 2021a]. These computational challenges have inspired results on intractability of aspects of online RL [Dann et al., 2018, Kane et al., 2022]. On the other hand, many simple exploration based algorithms like ϵ\epsilon-greedy are computationally efficient, but they may not always work well in practice. Recent theoretical works [Dann et al., 2022, Liu and Brunskill, 2018] have explored the additional structural assumptions on the underlying dynamics and value function class under which ϵ\epsilon-greedy succeeds, but they still do not capture all the relevant practical problems.

There are several online RL algorithms that aim to tackle the computational issue via stronger structural assumptions and supervised learning-style computational oracles [Misra et al., 2020, Sekhari et al., 2021, Zhang et al., 2022c, Agarwal et al., 2020a, Uehara et al., 2021, Modi et al., 2021, Zhang et al., 2022a, Qiu et al., 2022]. Compared to these oracle-based methods, our approach operates in the more general “bilinear rank” setting and relies on a standard supervised learning primitive: least squares regression. Notably, our oracle admits efficient implementation with linear function approximation, so we obtain an end-to-end computational guarantee; this is not true for prior oracle-based methods.

There are many deep RL methods for the online setting (e.g., Schulman et al. [2015, 2017], Lillicrap et al. [2016], Haarnoja et al. [2018], Schrittwieser et al. [2020]). Apart from a few exceptions (e.g., Burda et al. [2018], Badia et al. [2020], Guo et al. [2022]), most rely on random exploration (e.g. ϵ\epsilon-greedy) and are not capable of strategic exploration. In fact, guarantees for ϵ\epsilon-greedy like algorithms only exist under additional structural assumptions on the underlying problem.

In our experiments, we test our approach on Montezuma’s Revenge, and we pick Rnd [Burda et al., 2018] as a deep RL exploration baseline due to its simplicity and effectiveness on the game of Montezuma’s Revenge.

Online RL with reset distributions

When an exploratory reset distribution is available, a number of statistically and computationally efficient algorithms are known. The classic algorithms are Cpi [Kakade and Langford, 2002], Psdp [Bagnell et al., 2003], Natural Policy Gradient [Kakade, 2001, Agarwal et al., 2020b], and PolyTex [Abbasi-Yadkori et al., 2019]. Uchendu et al. [2022] recently demonstrated that algorithms like Psdp work well when equipped with modern neural network function approximators. However, these algorithms (and their analyses) heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically unavailable in practice, unless one also has a simulator and access to its internal states. In contrast, we assume the offline data covers some high quality policy (no need to be globally exploratory), which helps with exploration, but we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing.

Offline RL

Offline RL methods learn policies solely from a given offline dataset, with no interaction whatsoever. When the dataset has global coverage, algorithms such as FQI [Munos and Szepesvári, 2008, Chen and Jiang, 2019] or certainty-equivalence model learning [Ross and Bagnell, 2012], can find near-optimal policies in an oracle-efficient manner, via least squares or model-fitting oracles. However, with only partial coverage, existing methods either (a) are not computationally efficient due to the difficulty of implementing pessimism both in linear settings with large action spaces [Jin et al., 2021b, Zhang et al., 2022b, Chang et al., 2021] and general function approximation settings [Uehara and Sun, 2021, Xie et al., 2021a, Jiang and Huang, 2020, Chen and Jiang, 2022, Zhan et al., 2022], or (b) require strong representation conditions such as policy-based Bellman completeness [Xie et al., 2021a, Zanette et al., 2021]. In contrast, in the hybrid setting, we obtain an efficient algorithm under the more natural condition of completeness w.r.t., the Bellman optimality operator only.

Among the many empirical offline RL methods (e.g., Kumar et al. [2020], Yu et al. [2021], Kostrikov et al. [2021], Fujimoto and Gu [2021]), we use Cql [Kumar et al., 2020] as a baseline in our experiments, since it has been shown to work in image-based settings such as Atari games.

Online RL with offline datasets

Ross and Bagnell [2012] developed a model-based algorithm for a similar hybrid setting. In comparison, our approach is model-free and consequently may be more suitable for high-dimensional state spaces (e.g., raw-pixel images). Xie et al. [2021b] studied hybrid RL and show that offline data does not yield statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and demonstrates computational benefits of hybrid RL.

On the empirical side, several works consider combining offline expert demonstrations with online interaction [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017]. A common challenge in offline RL is the robustness against low-quality offline dataset. Previous works mostly focus on expert demonstrations and have no rigorous guarantees for such robustness. In fact, Nair et al. [2020] showed that such degradation in performance indeed happens in practice with low-quality offline data. In our experiments, we observe that DQfD [Hester et al., 2018] also has a similar degradation. On the other hand, our algorithm is robust to the quality of the offline data. Note that the core idea of our algorithm is similar to that of Vecerik et al. [2017], who adapt DDPG to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. [2017] does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. [2017] for the hybrid RL setting. We also include a detailed comparison with previous empirical work in Appendix D.

3 Preliminaries

We consider finite horizon Markov Decision Process M(𝒮,𝒜,H,R,P,d0)M({\mathcal{S}},\mathcal{A},H,R,P,d_{0}), where 𝒮{\mathcal{S}} is the state space, 𝒜\mathcal{A} is the action space, HH denotes the horizon, stochastic rewards R(s,a)Δ([0,1])R(s,a)\in\Delta([0,1]) and P(s,a)Δ(𝒮)P(s,a)\in\Delta({\mathcal{S}}) are the reward and transition distributions at (s,a)(s,a), and d0Δ(𝒮)d_{0}\in\Delta({\mathcal{S}}) is the initial distribution. We assume the agent can only reset from d0d_{0} (at the beginning of each episode). Since the optimal policy is non-stationary in this setting, we define a policy π:={π0,,πH1}\pi:=\{\pi_{0},\dots,\pi_{H-1}\} where πh:𝒮Δ(𝒜)\pi_{h}:{\mathcal{S}}\mapsto\Delta(\mathcal{A}). Given π\pi, dhπΔ(𝒮×𝒜)d^{\pi}_{h}\in\Delta({\mathcal{S}}\times\mathcal{A}) denotes the state-action occupancy induced by π\pi at step hh. Given π\pi, we define the state and state-action value functions in the usual manner: Vhπ(s)=𝔼[τ=hH1rτ|π,sh=s]V^{\pi}_{h}(s)=\mathbb{E}[\sum_{\tau=h}^{H-1}r_{\tau}|\pi,s_{h}=s] and Qhπ(s,a)=𝔼[τ=hH1rτ|π,sh=s,ah=a]Q^{\pi}_{h}(s,a)=\mathbb{E}[\sum_{\tau=h}^{H-1}r_{\tau}|\pi,s_{h}=s,a_{h}=a]. QQ^{\star} and VV^{\star} denote the optimal value functions. We denote Vπ=𝔼s0d0V0π(s0)V^{\pi}=\mathbb{E}_{s_{0}\sim d_{0}}V^{\pi}_{0}(s_{0}) as the expected total reward of π\pi. We define the Bellman operator 𝒯\mathcal{T} such that for any f:𝒮×𝒜f:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R},

𝒯f(s,a)=𝔼[R(s,a)]+𝔼sP(s,a)maxaf(s,a)s,a,\mathcal{T}f(s,a)=\mathbb{E}[R(s,a)]+\mathbb{E}_{s^{\prime}\sim P(s,a)}\max_{a^{\prime}}f(s^{\prime},a^{\prime})\qquad\forall s,a,

We assume that for each hh we have an offline dataset of moffm_{\mathrm{off}} samples (s,a,r,s)(s,a,r,s^{\prime}) drawn iid via (s,a)νh,rR(s,a),sP(s,a)(s,a)\sim\nu_{h},r\in R(s,a),s^{\prime}\sim P(s,a). Here ν={ν0,,νH1}\nu=\{\nu_{0},\dots,\nu_{H-1}\} denote the corresponding offline data distributions. For a dataset 𝒟\mathcal{D}, we use 𝔼^𝒟[]\hat{\mathbb{E}}_{\mathcal{D}}[\cdot] to denote a sample average over this dataset. For our theoretical results, we will assume that ν\nu covers some high-quality policy. Note that covering a high quality policy does not mean that ν\nu itself is a distribution of some high quality policy. For instance, ν\nu could be a mixture distribution of some high quality policy and a few low-quality policies, in which case, treating ν\nu as expert demonstration will fail completely (as we will show in our experiments). We consider the value-based function approximation setting, where we are given a function class =0××H1\mathcal{F}=\mathcal{F}_{0}\times\dots\times\mathcal{F}_{H-1} with h𝒮×𝒜[0,Vmax]\mathcal{F}_{h}\subset{\mathcal{S}}\times\mathcal{A}\mapsto[0,V_{\mathrm{max}}] that we use to approximate the value functions for the underlying MDP. Here VmaxV_{\mathrm{max}} denotes the maximum total reward of a trajectory. For ease of notation, we define f={f0,,fH1}f=\{f_{0},\ldots,f_{H-1}\} and define πf\pi^{f} to be the greedy policy w.r.t., ff, which chooses actions as πhf(s)=argmaxafh(s,a)\pi_{h}^{f}(s)=\mathop{\mathrm{argmax}}_{a}f_{h}(s,a).

4 Hybrid Q-Learning

Algorithm 1 Hybrid Q-learning using both offline and online data (Hy-Q)
0:  Value class: \mathcal{F}, #iterations: TT, offline dataset 𝒟hν\mathcal{D}^{\nu}_{h} of size moffm_{\mathrm{off}} =T=T for h[H1]h\in[H-1].
1:  Initialize fh1(s,a)=0f_{h}^{1}(s,a)=0.
2:  for t=1,,Tt=1,\dots,T do
3:     Let πt\pi^{t} be the greedy policy w.r.t. ftf^{t} i.e., πht(s)=argmaxafht(s,a).\pi_{h}^{t}(s)=\mathop{\mathrm{argmax}}_{a}f^{t}_{h}(s,a).
4:     For each hh, collect monm_{\mathrm{on}} =1=1 online tuples 𝒟htdhπt\mathcal{D}^{t}_{h}\sim d_{h}^{\pi^{t}}. // Online collection // FQI using both online and offline data
5:     Set fHt+1(s,a)=0f_{H}^{t+1}(s,a)=0.
6:     for h=H1,,0h=H-1,\dots,0 do
7:        Estimate fht+1f_{h}^{t+1} using least squares regression on the aggregated data 𝒟ht=𝒟hν+τ=1t𝒟hτ\mathcal{D}^{t}_{h}=\mathcal{D}^{\nu}_{h}+\sum_{\tau=1}^{t}\mathcal{D}^{\tau}_{h}:
fht+1argminfh{𝔼^𝒟ht(f(s,a)rmaxafh+1t+1(s,a))2}\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}^{t}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\} (1)
8:     end for
9:  end for

In this section, we present our algorithm Hybrid Q Learning – Hy-Q in Algorithm 1. Hy-Q takes an offline dataset {𝒟ν}h=1H\{\mathcal{D}^{\nu}\}_{h=1}^{H} that contains (s,a,r,s)(s,a,r,s^{\prime}) tuples and a Q function class h𝒮×𝒜[0,H],h=1,,H,\mathcal{F}_{h}\subset{\mathcal{S}}\times\mathcal{A}\mapsto[0,H],h=1,\dots,H, as inputs, and outputs a policy that optimizes the given reward function. The algorithm is conceptually simple: it iteratively executes the FQI procedure (line 6) using the offline dataset and on-policy samples generated by the learned policies.

Specifically, at iteration tt and timestep hh, we have an estimate fhtf_{h}^{t} of the QhQ_{h}^{\star} function and we set πht\pi_{h}^{t} to be the greedy policy for fhtf_{h}^{t}. We execute πht\pi_{h}^{t} to collect a dataset 𝒟ht\mathcal{D}_{h}^{t} of online samples in line 4. More formally, we sample shdhπt,ahπht(|sh),sh+1P(|sh,ah)s_{h}\sim d^{\pi^{t}}_{h},a_{h}\sim\pi_{h}^{t}(\cdot|s_{h}),s_{h+1}\sim P(\cdot|s_{h},a_{h}) and add the tuple (sh,ah,rh,sh+1)(s_{h},a_{h},r_{h},s_{h+1}) to 𝒟ht\mathcal{D}_{h}^{t}. Then we run FQI, the dynamic programming style algorithm on both the offline dataset 𝒟hν\mathcal{D}^{\nu}_{h} and all previously collected online samples {𝒟hτ}τ=1t\{\mathcal{D}_{h}^{\tau}\}_{\tau=1}^{t}. The FQI update works backward from time step HH to 0 and computes fht+1f_{h}^{t+1} via least squares regression with input (s,a)(s,a) and regression target r+maxafh+1t+1(s,a)r+\max_{a^{\prime}}f_{h+1}^{t+1}(s^{\prime},a^{\prime}).

Let us make several remarks, here we drop timestep hh for generality. Intuitively, the FQI updates in Hy-Q try to ensure that the estimate ftf^{t} has small Bellman error under both the offline distribution ν\nu and the online distributions dπtd^{\pi^{t}}. The standard offline version of FQI ensures the former, but this alone is insufficient when the offline dataset has poor coverage. Indeed FQI may have poor performance in such cases [see examples in Zhan et al., 2022, Chen and Jiang, 2022]. The key insight in Hy-Q is to use online interaction to ensure that we also have small Bellman error on dπtd^{\pi^{t}}. As we will see, the moment we find an ftf^{t} that has small Bellman error on the offline distribution ν\nu and its own greedy policy’s distribution dπtd^{\pi^{t}}, FQI guarantees that πt\pi^{t} will be at least as good as any policy covered by ν\nu. This observation results in an explore-or-terminate phenomenon: either ftf^{t} has small Bellman error on its distribution and we are done, or dπtd^{\pi^{t}} must be significantly different from distributions we have seen previously and we make progress. Crucially, no explicit exploration is required for this argument, which is precisely how we avoid the computational difficulties with implementing optimism.

Another important point pertains to catastrophic forgetting. We will see that the size of the offline dataset moffm_{\mathrm{off}} should be comparable to the total amount of online data {𝒟hτ}τ=1T\{\mathcal{D}_{h}^{\tau}\}_{\tau=1}^{T}, so that the two terms in Eq. 1 have similar weight and we ensure low Bellman error on ν\nu throughout the learning process. In practice, we implement this by having all model updates use a fixed (significant) number of offline samples even as we collect more online data, so that we do not “forget” the distribution ν\nu. This is quite different from warm-starting with 𝒟ν\mathcal{D}^{\nu} and then switching to online RL, which may result in catastrophic forgetting due to a vanishing proportion of offline samples being used for model training as we collect more online samples. We note that this balancing scheme is analogous to and inspired by the one used by Ross and Bagnell [2012] in the context of model-based RL with a reset distribution. Previously, similar techniques have also been explored for various applications (for example, see Appendix F.3 of Kalashnikov et al. [2018]). As in  Ross and Bagnell [2012], a key practical insight from our analysis is that the offline data should be used throughout training to avoid catastrophic forgetting.

5 Theoretical Analysis: Low Bilinear Rank Models

In this section we present the main theoretical guarantees for Hy-Q. We start by stating the main assumptions and definitions for the function approximator, the offline data distribution, and the MDP. We state the key definitions and then provide some discussion.

Assumption 1 (Realizability and Bellman completeness).

For any hh, we have QhhQ_{h}^{\star}\in\mathcal{F}_{h}. Additionally, for any fh+1h+1f_{h+1}\in\mathcal{F}_{h+1}, we have 𝒯fh+1h{\mathcal{T}}f_{h+1}\in\mathcal{F}_{h}.

Definition 1 (Bellman error transfer coefficient).

For any policy π\pi, define the transfer coefficient as

Cπ:=max{0,maxfh=0H1𝔼s,adhπ[𝒯fh+1(s,a)fh(s,a)]h=0H1𝔼s,aνh(𝒯fh+1(s,a)fh(s,a))2}.\displaystyle C_{\pi}:=\max\left\{0,~{}\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}}}\right\}. (2)

The transfer coefficient definition above is somewhat non-standard, but is actually weaker than related notions used in prior offline RL results. First, the average Bellman error appearing in the numerator is weaker than the squared Bellman error notion of [Xie et al., 2021a]; a simple calculation shows that Cπ2C_{\pi}^{2} is upper bounded by their coefficient. Second, by using Bellman errors, both of these are bounded by notions involving density ratios [Kakade and Langford, 2002, Munos and Szepesvári, 2008, Chen and Jiang, 2019]. Third, when function ff\in\mathcal{F} is linear in some known feature which is the case for models such as linear MDPs, the above transfer coefficient can be refined to relative condition number defined using the features. Finally, many works, particularly those that do not employ pessimism [Munos and Szepesvári, 2008, Chen and Jiang, 2019], require “all-policy” analogs, which places a much stronger requirement on the offline data distribution ν\nu. In contrast, we will only ask that CπC_{\pi} is small for some high-quality policy that we hope to compete with. In Appendix 12, we showcase that our transfer coefficient is weaker than any related notions used in prior works under various settings such as tabular MDPs, linear MDPs, low-rank MDPs, and MDPs with general value function approximation.

Definition 2 (Bilinear model [Du et al., 2021]).

We say that the MDP together with the function class \mathcal{F} is a bilinear model of rank dd if for any h[H1]h\in[H-1], there exist two (unknown) mappings Xh,Wh:dX_{h},W_{h}:\mathcal{F}\mapsto\mathbb{R}^{d} with maxfXh(f)2BX\max_{f}\|X_{h}(f)\|_{2}\leq B_{X} and maxfWh(f)2BW\max_{f}\|W_{h}(f)\|_{2}\leq B_{W} such that:

f,g:|𝔼s,adhπf[gh(s,a)𝒯gh+1(s,a)]|=|Xh(f),Wh(g)|.\displaystyle\forall f,g\in\mathcal{F}:\;\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f}}}\left[g_{h}(s,a)-\mathcal{T}g_{h+1}(s,a)\right]\right\rvert=\left\lvert\left\langle X_{h}(f),W_{h}(g)\right\rangle\right\rvert.

All concepts defined above are frequently used in the statistical analysis of RL methods with function approximation. Realizability is the most basic function approximation assumption, but is known to be insufficient for offline RL [Foster et al., 2021] unless other strong assumptions hold [Xie and Jiang, 2021, Zhan et al., 2022, Chen and Jiang, 2022]. Completeness is the most standard strengthening of realizability that is used routinely in both online [Jin et al., 2021a] and offline RL [Munos and Szepesvári, 2008, Chen and Jiang, 2019] and is known to hold in several settings including the linear MDP and the linear quadratic regulator. These assumptions ensure that the dynamic programming updates of FQI are stable in the presence of function approximation.

Lastly, the bilinear model was developed in a series of works [Jiang et al., 2017, Jin et al., 2021a, Du et al., 2021] on sample efficient online RL.777Jin et al. [2021a] consider the Bellman Eluder dimension, which is related but distinct from the Bilinear model. However, our proofs can be easily translated to this setting; see Appendix C for more details. The setting is known to capture a wide class of models including linear MDPs, linear Bellman complete models, low-rank MDPs, Linear Quadratic Regulators, reactive POMDPs, and more. As a technical note, the main paper focuses on the “Q-type” version of the bilinear model, but the algorithm and proofs easily extend to the “V-type” version. See Appendix A.3 for details.

Theorem 1 (Cumulative suboptimality).

Fix δ(0,1)\delta\in(0,1), moff=Tm_{\mathrm{off}}=T and mon=1m_{\mathrm{on}}=1, suppose that the function class \mathcal{F} satisfies Assumption 1, and together with the underlying MDP admits Bilinear rank dd. Then with probability at least 1δ1-\delta, Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy πe\pi^{e},

t=1TVπeVπt=O~(max{Cπe,1}VmaxdH2Tlog(||/δ)),\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}=\widetilde{O}\left(\max\left\{C_{\pi^{e}},1\right\}V_{\max}\sqrt{dH^{2}T\cdot\log\left(\lvert\mathcal{F}\rvert/\delta\right)}\right),

where πt=πft\pi^{t}=\pi^{f^{t}} is the greedy policy w.r.t. ftf^{t} at round tt.

The parameters setup in the above theorem indicates that ratio between the total offline samples and the total online sample is each FQI iteration is at least 11. This ensures that during learning, we never forget the offline distribution. A standard online-to-batch conversion [Shalev-Shwartz and Ben-David, 2014] immediately gives the following sample complexity guarantee for Algorithm 1 for finding an ϵ\epsilon-suboptimal policy w.r.t. the optimal policy π\pi^{*} for the underlying MDP.

Corollary 1 (Sample complexity).

Under the assumptions of Theorem 1 if Cπ<C_{\pi^{*}}<\infty then Algorithm 1 can find an ϵ\epsilon-suboptimal policy π^\widehat{\pi} for which VπVπ^ϵV^{\pi^{*}}-V^{\widehat{\pi}}\leq\epsilon with total sample complexity (online + offline):

n=O~(Cπ2Vmax2H3dlog(||/δ)/ϵ2)\displaystyle n=\widetilde{O}\left(C^{2}_{\pi^{*}}V^{2}_{\max}H^{3}d\log\left(\lvert\mathcal{F}\rvert/\delta\right)/\epsilon^{2}\right)

The results formalize the statistical properties of Hy-Q. In terms of sample complexity, a somewhat unique feature of the hybrid setting is that both transfer coefficient and bilinear rank parameters are relevant, whereas these (or related) parameters typically appear in isolation in offline and online RL respectively. In terms of coverage, Theorem 1 highlights an “oracle property” of Hy-Q: it competes with any policy that is sufficiently covered by the offline dataset.

We also highlight the computational efficiency of Hy-Q: it only requires solving least squares problems over the function class \mathcal{F}. To our knowledge, no purely online or purely offline methods are known to be efficient in this sense, except under much stronger “uniform” coverage conditions.

5.1 Proof Sketch

We now give an overview of the proof of Theorem 1. The proof starts with a simple decomposition of the regret:

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} =t=1T𝔼sd0[V0πe(s)V0πft(s)]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-V^{\pi^{f^{t}}}_{0}(s)\right]
=t=1T𝔼sd0[V0πe(s)maxaf0t(s,a)]At+t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)]Bt.\displaystyle=\sum_{t=1}^{T}\underbrace{\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}^{t}(s,a)\right]}_{A_{t}}+\sum_{t=1}^{T}\underbrace{\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]}_{B_{t}}.

Then we note that, one can bound each AtA_{t} and BtB_{t} by the Bellman error under the comparator πe\pi_{e}’s visitation distribution and the learned policy’s visitation distribution. For simplicity let’s define the Bellman error of function ff at time hh as h(f)(s,a)=fh(s,a)𝒯fh+1(s,a)\mathcal{E}_{h}(f)(s,a)=f_{h}(s,a)-{\mathcal{T}}f_{h+1}(s,a), and we can show that

Ath=0H1𝔼s,adhπe[h(ft)(s,a)](Lemma 5)andBth=0H1|𝔼s,adhπth(ft)(s,a)|(Lemma 4).\displaystyle A_{t}\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi^{e}}}\left[-\mathcal{E}_{h}(f^{t})(s,a)\right](\text{\hyperref@@ii[lem:optimism]{Lemma~{}\ref*{lem:optimism}}})\quad\text{and}\quad B_{t}\leq\sum_{h=0}^{H-1}\left|\mathbb{E}_{s,a\sim d_{h}^{\pi^{t}}}\mathcal{E}_{h}(f^{t})(s,a)\right|(\text{\hyperref@@ii[lem:simulation]{Lemma~{}\ref*{lem:simulation}}}).

Then for AtA_{t}, we can recall our definition of the transfer coefficient CπC_{\pi} and this gives us

AtCπeh=0H1𝔼s,aνh[h(ft)]2t;hoff.\displaystyle A_{t}\leq C_{\pi^{e}}\cdot\sqrt{\sum_{h=0}^{H-1}\underbrace{\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\mathcal{E}_{h}(f^{t})\right]^{2}}_{\mathcal{E}_{t;h}^{\textrm{off}}}}.

The terms t;hoff\mathcal{E}^{\textrm{off}}_{t;h} are in the order of statistical error resulting from least square regression since every iteration tt, FQI includes the offline data from νh\nu_{h} in its least square regression problems. Thus AtA_{t} is small for every tt given bounded CπeC_{\pi^{e}}.

To bound the online part BtB_{t}, we utilize the structure of Bilinear models. For the analysis, we construct a covariance matrix Σt;h=τ=1tXh(fτ)Xh(fτ)+λ𝕀\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\lambda\mathbb{I}, where XhX_{h} is defined as in the bilinear model construction. This is used to track the online learning progress. Thus recall the definition of bilinear model, we can bound BtB_{t} in the following sense:

t=1TBtt=1Th=0H1|𝔼s,adhπth(ft)(s,a)|\displaystyle\sum_{t=1}^{T}B_{t}\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left|\mathbb{E}_{s,a\sim d_{h}^{\pi^{t}}}\mathcal{E}_{h}(f^{t})(s,a)\right| =t=1Th=0H1|Xh(ft),Wh(ft)|\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert
t=1Th=0H1Xh(ft)Σt1;h1τ=1t1𝔼s,adhτ[h(ft)(s,a)]2t;hon+λBW2.\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\underbrace{\sum_{\tau=1}^{t-1}\mathbb{E}_{s,a\sim d^{\tau}_{h}}[\mathcal{E}_{h}(f^{t})(s,a)]^{2}}_{\mathcal{E}_{t;h}^{\textrm{on}}}+\lambda B_{W}^{2}}.

The first term on the right hand side of the above inequality, i.e., tXh(ft)Σt1;h1\sum_{t}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}, can be shown to grow sublinearly O~(T)\tilde{O}(\sqrt{T}) using the classic elliptical potential argument (Lemma 6). The term t;hon\mathcal{E}^{\textrm{on}}_{t;h} can be controlled to be small as it is related to the statistical error of least square regression (since in each iteration tt, when we perform least square regression, we use the training data sampled from policies from π1\pi^{1} to πt1\pi^{t-1}). Together this ensures that tBt\sum_{t}B_{t} grows sublinearly O~(T)\tilde{O}(\sqrt{T}), which further implies that there exists a iteration tt^{\prime}, such that BtO~(1/T)B_{t^{\prime}}\leq\tilde{O}(1/\sqrt{T}). Together, the above arguments show that there must exist an iteration tt^{\prime}, such that AtA_{t^{\prime}} and BtB_{t^{\prime}} are small simultaneously, which concludes that πft\pi^{f^{t^{\prime}}} can be close to πe\pi^{e} in terms of the performance. The proof sketch highlights the key observation: as long as we have a function ff that has small Bellman residual under the offline distribution ν\nu and small Bellman residual under its own greedy policy’s distribution dπfd^{\pi^{f}}, then we can show that πf\pi^{f} must be at least as good as any policy πe\pi^{e} that is covered by the offline distribution.

5.2 The Linear Bellman Completeness Model

We next showcase one example of low bilinear rank models: the popular linear Bellman complete model which captures both the linear MDP model [Yang and Wang, 2019, Jin et al., 2020] and the LQR model, and instantiate the sample complexity bound in Corollary 1.

Definition 3.

Given a feature function ϕ:𝒮×𝒜𝔹d(1)\phi:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{B}_{d}(1), an MDP with feature ϕ\phi admits linear Bellman completeness if for any w𝔹d(BW)w\in\mathbb{B}_{d}(B_{W}), there exists a w𝔹d(BW)w^{\prime}\in\mathbb{B}_{d}(B_{W}) such that

s,a:w,ϕ(s,a)=𝔼[R(s,a)]+𝔼sP(s,a)maxaw,ϕ(s,a).\displaystyle\forall s,a:\qquad\left\langle w^{\prime},\phi(s,a)\right\rangle=\mathbb{E}[R(s,a)]+\mathbb{E}_{s^{\prime}\sim P(s,a)}\max_{a^{\prime}}\left\langle w,\phi(s^{\prime},a^{\prime})\right\rangle.

Note that the above condition implies that Qh(s,a)=wh,ϕ(s,a)Q^{\star}_{h}(s,a)=\left\langle w_{h}^{\star},\phi(s,a)\right\rangle with wh2BW\|w^{\star}_{h}\|_{2}\leq B_{W}. Thus, we can define a function class h={wh,ϕ(s,a):whd,wh2BW}\mathcal{F}_{h}=\{\left\langle w_{h},\phi(s,a)\right\rangle:w_{h}\in\mathbb{R}^{d},\|w_{h}\|_{2}\leq B_{W}\} which by inspection satisfies Assumption 1. Additionally, this model is also known to have bilinear rank at most dd [Du et al., 2021]. Thus, using Corollary 1 we immediately get the following guarantee:

Lemma 1.

Let δ(0,1)\delta\in(0,1), suppose the MDP is linear Bellman complete, Cπ<C_{\pi^{*}}<\infty, and consider h\mathcal{F}_{h} defined above. Then, with probability 1δ1-\delta, Algorithm 1 finds an ϵ\epsilon-suboptimal policy with total sample complexity (offline + online):

n=O~(BW2Cπ2H4d2log(BW/ϵδ)ϵ2).\displaystyle n=\widetilde{O}\left(\frac{B_{W}^{2}C^{2}_{\pi^{*}}H^{4}d^{2}\log\left(B_{W}/\epsilon\delta\right)}{\epsilon^{2}}\right).
Proof sketch of Lemma 1.

The proof follows by invoking the result in Corollary 1 for a discretization of the class \mathcal{F}, denoted by ϵ=0,ϵ××H1,ϵ\mathcal{F}_{\epsilon}=\mathcal{F}_{0,\epsilon}\times\dots\times\mathcal{F}_{H-1,\epsilon}. ϵ\mathcal{F}_{\epsilon} is defined such that h,ϵ={wϕ(s,a):^𝔹d,ϵ(BW)}\mathcal{F}_{h,\epsilon}=\{w^{\top}\phi(s,a):\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})\} where ^𝔹d,ϵ(BW)\widehat{}\mathbb{B}_{d,\epsilon}(B_{W}) is an ϵ\epsilon-net of the 𝔹d(BW)\mathbb{B}_{d}(B_{W}) under \ell_{\infty}-distance and contains O((BW/ϵ)d)O((B_{W}/\epsilon)^{d}) many elements. Thus, we get that log(|ϵ|)=O(Hdlog(BW/ϵ))\log(\left\lvert\mathcal{F}_{\epsilon}\right\rvert)=O\left(Hd\log(B_{W}/\epsilon)\right). ∎

On the computational side, with \mathcal{F} as in Lemma 1, the regression problem in Algorithm 1 reduces to a least squares linear regression with a norm constraint on the weight vector. This can be solved by convex programming with complexity scaling polynomially in the parameters [Bubeck et al., 2015].

Remark 1 (Computational efficiency).

For linear Bellman complete models, we note that Algorithm 1 can be implemented efficiently under mild assumptions. For the class \mathcal{F} in Lemma 1, the regression problem in (1) reduces to a least squares linear regression with a norm constraint on the weight vector. This regression problem can be solved efficiently by convex programming with computational efficiency scaling polynomially in the number of parameters [Bubeck et al., 2015] (dd here), whenever maxafh+1(s,a)\max_{a}f_{h+1}(s,a) (or argmaxafh+1(s,a)\mathop{\mathrm{argmax}}_{a}f_{h+1}(s,a)) can be computed efficiently.

Remark 2.

(Linear MDPs) Since linear Bellman complete models generalize linear MDPs [Yang and Wang, 2019, Jin et al., 2020], as we discuss above, Algorithm 1 can be implemented efficiently whenever maxafh+1(s,a)\max_{a}f_{h+1}(s,a) can be computed efficiently. The latter is tractable when:

  1. \bullet

    When |𝒜|\lvert\mathcal{A}\rvert is small/finite, one can just enumerate to compute maxafh+1(s,a)\max_{a}f_{h+1}(s,a) for any ss, and thus (1) can be implemented efficiently. The computational efficiently of Algorithm 1 in this case is comparable to the prior works, e.g. Jin et al. [2020].

  2. \bullet

    When the set {ϕ(s,a)a𝒜}\{\phi(s,a)\mid a\in\mathcal{A}\} is convex and compact, one can simply use a linear optimization oracle to compute maxafh+1(s,a)=maxawh+1ϕ(s,a)\max_{a}f_{h+1}(s,a)=\max_{a}w_{h+1}^{\top}\phi(s,a). This linear optimization problem is itself solvable with computational efficiency scaling polynomially with d. here).

    Note that even under access to a linear optimization oracle, prior works e.g.  Jin et al. [2020] rely on bonuses in the form of argmaxaϕ(s,a)w+βϕ(s,a)Σϕ(s,a)\mathop{\mathrm{argmax}}_{a}\phi(s,a)^{\top}w+\beta\sqrt{\phi(s,a)^{\top}\Sigma\phi(s,a)}, where Σ\Sigma is some positive definite matrix (e.g., the regularized feature covariance matrix). Computing such bonuses could be NP-hard (in the feature dimension dd) without additional assumptions [Dani et al., 2008].

Remark 3.

(Relative condition number) A common coverage metric in these linear MDP models is the relative condition number. In Appendix A.4, we show that our coefficient CπC_{\pi} is upper bounded by the relative condition number of π\pi with respect to ν\nu: 𝔼dπϕΣν1\mathbb{E}_{d^{\pi}}\|\phi\|_{\Sigma^{-1}_{\nu}}, where Σν=𝔼s,aνϕ(s,a)ϕ(s,a)\Sigma_{\nu}=\mathbb{E}_{s,a\sim\nu}\phi(s,a)\phi^{\top}(s,a). Concretely, we have Cπmaxh𝔼dhπϕΣνh12C_{\pi}\leq\sqrt{\max_{h}\mathbb{E}_{d^{\pi}_{h}}\|\phi\|^{2}_{\Sigma^{-1}_{\nu_{h}}}}. Note that such quantity captures coverage in terms of feature, and can be bounded even when density ratio style concentrability coefficient (i.e., sups,adπ(s,a)/ν(s,a)\sup_{s,a}d^{\pi}(s,a)/\nu(s,a)) being infinite.

5.3 Low-rank MDP

In this section, we briefly introduce the low-rank MDP model [Du et al., 2021], which is captured by the V-type Bilinear model discussed in Appendix A.3. Unlike the linear MDP model discussed in Section 5.2, low-rank MDP does not assume the feature ϕ\phi is known a priori.

Definition 4 (Low-rank MDP).

A MDP is called low-rank MDP if there exists μ:𝒮d,ϕ:𝒮×𝒜d\mu^{\star}:{\mathcal{S}}\mapsto\mathbb{R}^{d},\phi^{\star}:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R}^{d}, such that the transition dynamics P(s|s,a)=μ(s)ϕ(s,a)P(s^{\prime}|s,a)=\mu^{\star}(s^{\prime})^{\top}\phi^{\star}(s,a) for all s,a,ss,a,s^{\prime}. We additionally assume that we are given a realizable representation class Φ\Phi such that ϕΦ\phi^{\star}\in\Phi, and that sups,aϕ(s,a)21\sup_{s,a}\|\phi^{\star}(s,a)\|_{2}\leq 1, and fμ2d\|f^{\top}\mu^{\star}\|_{2}\leq\sqrt{d} for any f:𝒮[1,1]f:{\mathcal{S}}\mapsto[-1,1].

Consider the function class h={wϕ(s,a):ϕΦ,w𝔹d(BW)}\mathcal{F}_{h}=\{w^{\top}\phi(s,a):\phi\in\Phi,w\in\mathbb{B}_{d}(B_{W})\}, and through the bilinear decomposition we have that BW2dB_{W}\leq 2\sqrt{d}. By inspection, we know that this function class satisfies Assumption 1. Furthermore, it is well known that the low rank MDP model has V-type bilinear rank of at most dd [Du et al., 2021]. Invoking the sample complexity bound given in Corollary 2 for V-type Bilinear models, we get the following result.

Lemma 2.

Let δ(0,1)\delta\in(0,1) and Φ\Phi be a given representation class. Suppose that the MDP is a rank dd MDP w.r.t. some ϕΦ\phi^{\star}\in\Phi, Cπ<C_{\pi^{*}}<\infty, and consider h\mathcal{F}_{h} defined above. Then, with probability 1δ1-\delta, Algorithm 2 finds an ϵ\epsilon-suboptimal policy with total sample complexity (offline + online):

O~(max{Cπ2,1}Vmax2d2H4|𝒜|log(HTd|Φ|/ϵδ)ϵ2).\displaystyle\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}d^{2}H^{4}\left\lvert\mathcal{A}\right\rvert\log\left({HTd\lvert\Phi\rvert}/{\epsilon\delta}\right)}{\epsilon^{2}}\right).
Proof sketch of Lemma 2.

The proof follows by invoking the result in Corollary 1 for a discretization of the class \mathcal{F}, denoted by ϵ=0,ϵ××H1,ϵ\mathcal{F}_{\epsilon}=\mathcal{F}_{0,\epsilon}\times\dots\times\mathcal{F}_{H-1,\epsilon}. ϵ\mathcal{F}_{\epsilon} is defined such that h,ϵ={wϕ(s,a):ϕΦ,w^𝔹d,ϵ(BW)}\mathcal{F}_{h,\epsilon}=\{w^{\top}\phi(s,a):\phi\in\Phi,w\in\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})\} where ^𝔹d,ϵ(BW)\widehat{}\mathbb{B}_{d,\epsilon}(B_{W}) is an ϵ\epsilon-net of the 𝔹d(BW)\mathbb{B}_{d}(B_{W}) under \ell_{\infty}-distance and contains O((BW/ϵ)d)O((B_{W}/\epsilon)^{d}) many elements. Thus, we get that log(|ϵ|)=O(Hdlog(BW|Φ|/ϵ))\log(\left\lvert\mathcal{F}_{\epsilon}\right\rvert)=O\left(Hd\log(B_{W}\left\lvert\Phi\right\rvert/\epsilon)\right). ∎

For low-rank MDP, the transfer coefficient CπC_{\pi} is upper bounded by a relative condition number style quantity defined using the unknown ground truth feature ϕ\phi^{\star} (see Lemma 13). On the computational side, Algorithm 1 (with the modification of aUniform(𝒜)a\sim\text{Uniform}(\mathcal{A}) in the online data collection step) requires to solve a least squares regression problem at every round. The objective of this regression problem is a convex functional of the hypothesis ff over the constraint set \mathcal{F}. While this is not fully efficiently implementable due to the potential non-convex constraint set \mathcal{F} (e.g., ϕ\phi could be complicated), our regression problem is still much simpler than the oracle models considered in the prior works for this model [Agarwal et al., 2020a, Sekhari et al., 2021, Uehara et al., 2021, Modi et al., 2021].

Refer to caption
Figure 2: A hard instance for offline RL [Zhan et al., 2022, reproduced with permission]

5.4 Why don’t offline RL methods work?

One may wonder why do pure offline RL methods fail to learn when the transfer coefficient is bounded, and why does online access help? We illustrate with the MDP construction developed by Zhan et al. [2022], Chen and Jiang [2022], visualized in Figure 2.

Consider two MDPs {M1,M2}\left\{M_{1},M_{2}\right\} with H=2H=2, three states {A,B,C}\left\{A,B,C\right\}, two actions {L,R}\left\{L,R\right\} and the fixed start state AA. The two MDPs have the same dynamics but different rewards. In both, actions from state BB yield reward 11. In M1M_{1}, (C,R)(C,R) yields reward 11 while (C,L)(C,L) yields reward 11 in M2M_{2}. All other rewards are 0. In both M1M_{1} and M2M_{2}, an optimal policy is π(A)=L\pi^{*}(A)=L and π(B)=π(C)=Uniform({L,R})\pi^{*}(B)=\pi^{*}(C)=\mathrm{Uniform}\left(\left\{L,R\right\}\right). With ={Q1,Q2}\mathcal{F}=\{Q_{1}^{\star},Q_{2}^{\star}\} where QjQ_{j}^{\star} is the optimal QQ function for MjM_{j}, then one can easily verify that \mathcal{F} satisfies Bellman completeness, for both MDPs. Finally with offline distribution ν\nu supported on states AA and BB only (with no coverage on state CC), we have sufficient coverage over dπd^{\pi^{\star}}. However, samples from ν\nu are unable to distinguish between f1f_{1} and f2f_{2} or (M1M_{1} and M2M_{2}), since state CC is not supported by ν\nu. Unfortunately, adversarial tie-breaking may result the greedy policies of f1f_{1} and f2f_{2} visiting state CC, where we have no information about the correct action.

This issue has been documented before, and in order to address it with pure offline RL, existing approaches require additional structural assumptions. For instance, Chen and Jiang [2022] assume that QQ^{\star} has a gap, which usually does not hold when action space is large or continuous. Xie et al. [2021a] assumes policy-dependent Bellman completeness for every possible policy πΠ\pi\in\Pi (which is much stronger than our assumption), and Zhan et al. [2022] assumes a somewhat non-interpretable realizability assumption on some “value” function that does not obey the standard Bellman equation. In contrast, by combining offline data and online data, our approach focuses on functions that have small Bellman residual under both the offline distribution and the on-policy distributions, which together with the offline data coverage assumption, ensures near optimality. It is easy to see that the hybrid approach will succeed Figure 2.

6 Experiments

In this section we discuss empirical results comparing Hy-Q to several representative RL methods on two challenging benchmarks. Our experiments focus on answering the following questions:

  1. 1.

    Can Hy-Q efficiently solve problems that SOTA offline RL methods simply cannot?

  2. 2.

    Can Hy-Q, via the use of offline data, significantly improve the sample efficiency of online RL?

  3. 3.

    Does Hy-Q scale to challenging deep-RL benchmarks?

Refer to caption
Figure 3: The rich observation combination lock [Misra et al., 2020, Zhang et al., 2022c].

Our empirical results provide positive answers to all of these questions. To study the first two, we consider the diabolical combination lock environment [Misra et al., 2020, Zhang et al., 2022c], a synthetic environment designed to be particularly challenging for online exploration. The synthetic nature allows us to carefully control the offline data distribution to modulate the difficulty of the setup and also to compare with a provably efficient baseline [Zhang et al., 2022c]. To study the third question, we consider the Montezuma’s Revenge benchmark from the Arcade Learning environment, which is one of the most challenging empirical benchmarks with high-dimensional image inputs, largely due to the difficulties of exploration. Additional details are deferred to Appendix E.

Hy-Q implementation.

We largely follow Algorithm 1 in our implementation for the combination lock experiment. Particularly, we use a similar function approximation to Zhang et al. [2022c], and a minibatch Adam update on Eq. (1) with the same sampling proportions as in the pseudocode. For Montezuma’s Revenge, in addition to minibatch optimization, since the horizon of the environment is not fixed, we deploy a discounted version of Hy-Q. Concretely, the target value in the Bellman error is calculated from the output of a target network, which is periodically updated, times a discount factor. We refer the readers to Appendix E for more details.

Baselines.

We include representative algorithms from four categories: (1) for imitation learning we use Behavior Cloning (Bc) [Bain and Sammut, 1995], (2) for offline RL we use Conservative Q-Learning (Cql[Kumar et al., 2020] due to its successful demonstrations on some Atari games, (3) for online RL we use Briee [Zhang et al., 2022c] for combination lock888We note that Briee is currently the state-of-the-art method for the combination lock environment. In particular, Misra et al. [2020] show that many Deep RL baselines fail in this environment. and Random Network Distillation (Rnd[Burda et al., 2018] for Montezuma’s Revenge, and (4) as a Hybrid-RL baseline we use Deep Q-learning from Demonstrations (Dqfd[Hester et al., 2018]. We note that Dqfd and prior hybrid RL methods combine expert demonstrations with online interactions, but are not necessarily designed to work with general offline datasets.

Results summary.

Overall, we find that Hy-Q performs favorably against all of these baselines. Compared with offline RL, imitation learning, and prior hybrid methods, Hy-Q is significantly more robust in the presence of a low quality offline data distribution. Compared with online methods, Hy-Q offers order-of-magnitude savings in the total experience.

Reproducibility.

We release our code at https://github.com/yudasong/HyQ. We also include implementation details in Appendix E.

Refer to caption
Figure 4: The learning curve for combination lock with H=100H=100. The plots show the median and 80th/20th quantile for 5 replicates. Pure offline and IL methods are visualized as dashed horizontal lines (in the left plot, CQL overlaps with BC). Note that we report the number of samples while Zhang et al. [2022c] report the number of episodes.

6.1 Combination Lock

The combination lock benchmark is depicted in Figure 3 and consists of horizon H=100H=100, three latent states for each time step and 1010 actions in each state. Each state has a single “good” action that advances down a chain of favorable latent states (white) from which optimal reward can be obtained. A single incorrect action transitions to an absorbing chain (black latent states) with suboptimal value. The agent operates on high dimensional continuous observations omitted from latent states and must use function approximation to succeed. This is an extremely challenging problem for which many Deep RL methods are known to fail [Misra et al., 2020], in part because (uniform) random exploration only has 10H10^{-H} probability of obtaining the optimal reward.

On the other hand, the model has low bilinear rank, so we do have online RL algorithms that are provably sample-efficient: Briee [Zhang et al., 2022c] currently obtains state of the art sample complexity. However, its sample complexity is still quite large, and we hope that Hybrid RL can address this shortcoming. We are not aware of any experiments with offline RL methods on this benchmark.

We construct two offline datasets for the experiments, both of which are derived from the optimal policy π\pi^{\star} which always picks the ”good” actions and stays in the chains of white states. In the optimal trajectory dataset we collect full trajectories by following π\pi^{\star} with ϵ\epsilon-greedy exploration with ϵ=1/H\epsilon=1/H. We also add some noise by making the agent to perform randomly at timestep H/2H/2. In the optimal occupancy dataset we collect transition tuples from the state-occupancy measure of π\pi^{\star} with random actions.999Formally, we sample hUnif([H])h\sim\textrm{Unif}([H]), sdhπs\sim d_{h}^{\pi^{\star}}, aUnif(𝒜)a\sim\textrm{Unif}(\mathcal{A}), rR(s,a)r\sim R(s,a), sP(s,a)s^{\prime}\sim P(s,a). Both datasets have bounded concentrability coefficients (and hence transfer coefficients) with respect to π\pi^{\star}, but the second dataset is much more challenging since the actions in the offline dataset do not directly provide information about π\pi^{\star}, as they do in the former.

The results are presented in Figure 4. First, we observe that Hy-Q can reliably solve the task under both offline distributions with relatively low sample complexity (500k offline samples and \leq 25m online samples). In comparison, Bc fails completely since both datasets contain random actions. Cql can solve the task using the optimal trajectory-based dataset with a sample complexity that is comparable to the combined sample size of Hy-Q. However, Cql fails on the optimal occupancy-based dataset since the actions themselves are not informative. Indeed the pessimism-inducing regularizer of Cql is constant on this dataset and so the algorithm reduces to Fqi which provably fails when the offline data does not have a global coverage (i.e., covers every state-action pair). Finally, Hy-Q can solve the task with a factor of 5-10 reduction in samples (online plus offline) when compared with Briee. This demonstrates the robustness and sample efficiency provided by hybrid RL.

Refer to caption
Figure 5: The learning curve for Montezuma’s Revenge. The plots show the median and 80th/20th quantile for 5 replicates. Pure offline, IL methods and dataset qualities are visualized as dashed horizontal lines. “Expert” denotes VπeV^{\pi^{e}} and “Offline” denotes the average trajectory reward in the offline dataset. The y-axis denotes the (moving) average of 100 episodes for the methods involving online interactions. Note that Cql and Bc overlap on the last plot.
Refer to caption
Figure 6: Screenshots of the training processes of our approach Hy-Q (top row) and Rnd (bottom row). With only 0.1m offline samples of which half is from a random policy and half is from a high quality policy (with reward around 6400), our approach learns significantly faster than Rnd.

6.2 Montezuma’s Revenge

To answer the third question, we turn to Montezuma’s Revenge, an extremely challenging image-based benchmark environment with sparse rewards. We follow the setup from Burda et al. [2018] and introduce stochasticity to the original dynamics: with probability 0.25 the environment executes the previous action instead of the current one. For offline datasets, we first train an “expert policy” πe\pi^{e} via Rnd to achieve Vπe6400V^{\pi^{e}}\approx 6400. We create three datasets by mixing samples from πe\pi^{e} with those from a random policy: the easy dataset contains only samples from πe\pi^{e}, the medium dataset mixes in a 80/20 proportion (80 from πe\pi^{e}), and the hard dataset mixes in a 50/50 proportion. Here we record full trajectories from both policies in the offline dataset, but measure the proportion using the number of transition tuples instead of trajectories. We provide 0.1 million offline samples for the hybrid methods, and 1 million samples for the offline and IL methods.

Results are displayed in Figure 5. Cql fails completely on all datasets. Dqfd performs well on the easy dataset due to the supervised learning style large margin loss [Piot et al., 2014] that imitates the policies in the offline dataset. However, Dqfd’s performance drops as the quality of the offline dataset degrades (medium), and fails when the offline dataset is low quality (hard), where one cannot simply treat offline samples as expert samples. We also observe that Bc is a competitive baseline in the first two settings due to high fraction of expert samples, and thus we view these problems as relatively easy to solve. Hy-Q is the only method that performs well on the hard dataset. Note that here, BC’s performance is quite poor. We also include the comparison with Rnd in Figure 1 and Figure 6: with only 100k offline samples from any of the three datasets, Hy-Q is over 10x more efficient in terms of online sample complexity.

7 Conclusion

We demonstrate the potential of hybrid RL with Hy-Q, a simple, theoretically principled, and empirically effective algorithm. Our theoretical results showcase how Hy-Q circumvents the computational issues of pure offline or online RL, while our empirical results highlight its robustness and sample efficiency. Yet, Hy-Q is perhaps the most natural hybrid algorithm, and we are optimistic that there is much more potential to unlock from the hybrid setting. We look forward to studying this in the future.

Acknowledgement

AS thanks Karthik Sridharan for useful discussions. WS acknowledges funding support from NSF IIS-2154711. We thank Simon Zhai for their careful reading of the manuscript and improvement on the technical correctness of our paper. We also thank Uri Sherman for their discussion on the computational efficiency of the original draft.

References

  • Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, 2019.
  • Agarwal et al. [2019] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2019. URL https://rltheorybook.github.io/.
  • Agarwal et al. [2020a] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank MDPs. In Advances in Neural Information Processing Systems, 2020a.
  • Agarwal et al. [2020b] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, 2020b.
  • Badia et al. [2020] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari human benchmark. In International Conference on Machine Learning, 2020.
  • Bagnell et al. [2003] James Bagnell, Sham M Kakade, Jeff Schneider, and Andrew Ng. Policy search by dynamic programming. Advances in Neural Information Processing Systems, 2003.
  • Bain and Sammut [1995] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, 1995.
  • Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
  • Beygelzimer et al. [2011] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, 2011.
  • Bubeck et al. [2015] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 2015.
  • Burda et al. [2018] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2018.
  • Chang et al. [2021] Jonathan Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 2021.
  • Chen and Jiang [2019] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019.
  • Chen and Jiang [2022] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps. arXiv:2203.13935, 2022.
  • Dani et al. [2008] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
  • Dann et al. [2022] Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, and Karthik Sridharan. Guarantees for epsilon-greedy reinforcement learning with function approximation. In International Conference on Machine Learning, pages 4666–4689. PMLR, 2022.
  • Dann et al. [2018] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient PAC RL with rich observations. In Advances in Neural Information Processing Systems, 2018.
  • Du et al. [2021] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, 2021.
  • Foster et al. [2021] Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory, 2021.
  • Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Advances in Neural Information Processing Systems, 2021.
  • Guo et al. [2022] Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, R’emi Munos, Mohammad Gheshlaghi Azar, and Bilal Piot. BYOL-explore: Exploration by bootstrapped prediction. arXiv:2206.08332, 2022.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. arXiv:1812.05905, 2018.
  • Hester et al. [2018] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep Q-learning from demonstrations. In AAAI Conference on Artificial Intelligence, 2018.
  • Jia et al. [2022] Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. Improving policy optimization with generalist-specialist learning. In International Conference on Machine Learning, pages 10104–10119. PMLR, 2022.
  • Jiang and Huang [2020] Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 2020.
  • Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.
  • Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2020.
  • Jin et al. [2021a] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 2021a.
  • Jin et al. [2021b] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, 2021b.
  • Kakade [2001] Sham M Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 2001.
  • Kakade and Langford [2002] Sham M Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002.
  • Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  • Kane et al. [2022] Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. Computational-statistical gaps in reinforcement learning. In Conference on Learning Theory, 2022.
  • Kostrikov et al. [2021] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv:2110.06169, 2021.
  • Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Lee et al. [2022] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643, 2020.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Liu and Brunskill [2018] Yao Liu and Emma Brunskill. When simple exploration is sample efficient: Identifying sufficient conditions for random exploration to yield pac rl algorithms. arXiv preprint arXiv:1805.09045, 2018.
  • Misra et al. [2020] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, 2020.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
  • Modi et al. [2021] Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. Model-free representation learning and exploration in low-rank MDPs. arXiv:2102.07035, 2021.
  • Munos and Szepesvári [2008] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008.
  • Nair et al. [2018] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics and Automation, 2018.
  • Nair et al. [2020] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359, 2020.
  • Niu et al. [2022] Haoyi Niu, Shubham Sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming Hu, and Xianyuan Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. arXiv preprint arXiv:2206.13464, 2022.
  • Piot et al. [2014] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2014.
  • Qiu et al. [2022] Shuang Qiu, Lingxiao Wang, Chenjia Bai, Zhuoran Yang, and Zhaoran Wang. Contrastive UCB: Provably efficient contrastive self-supervised learning in online reinforcement learning. In International Conference on Machine Learning, 2022.
  • Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv:1709.10087, 2017.
  • Ross and Bagnell [2012] Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement learning. arXiv:1203.1007, 2012.
  • Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv:1511.05952, 2015.
  • Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and Shogi by planning with a learned model. Nature, 2020.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
  • Sekhari et al. [2021] Ayush Sekhari, Christoph Dann, Mehryar Mohri, Yishay Mansour, and Karthik Sridharan. Agnostic reinforcement learning with low-rank MDPs and rich observations. Advances in Neural Information Processing Systems, 2021.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Sun et al. [2019] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, 2019.
  • Uchendu et al. [2022] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, and Karol Hausman. Jump-start reinforcement learning. arXiv:2204.02372, 2022.
  • Uehara and Sun [2021] Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations, 2021.
  • Uehara et al. [2021] Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline RL in low-rank MDPs. arXiv:2110.04652, 2021.
  • Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artificial Intelligence, 2016.
  • Vecerik et al. [2017] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv:1707.08817, 2017.
  • Wang et al. [2021] Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, 2021.
  • Wang et al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  • Xie and Jiang [2021] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. In International Conference on Machine Learning, 2021.
  • Xie et al. [2021a] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021a.
  • Xie et al. [2021b] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in Neural Information Processing Systems, 2021b.
  • Yang and Wang [2019] Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, 2019.
  • Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. In Advances in Neural Information Processing Systems, 2021.
  • Zanette et al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, 2020.
  • Zanette et al. [2021] Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021.
  • Zhan et al. [2022] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR, 2022.
  • Zhang et al. [2022a] Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph Gonzalez, Dale Schuurmans, and Bo Dai. Making linear MDPs practical via contrastive representation learning. In International Conference on Machine Learning, 2022a.
  • Zhang et al. [2022b] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. In International Conference on Artificial Intelligence and Statistics, 2022b.
  • Zhang et al. [2022c] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, and Wen Sun. Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, 2022c.

Appendix A Proofs for Section 5

Additional notation.

Throughout the appendix, we define the feature covariance matrix Σt;h\Sigma_{t;h} as

Σt;h=τ=1tXh(fτ)(Xh(fτ))+λ𝕀.\displaystyle\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})\left(X_{h}(f^{\tau})\right)^{\top}+\lambda\mathbb{I}. (3)

Furthermore, given a distribution βΔ(𝒮×𝒜)\beta\in\Delta({\mathcal{S}}\times\mathcal{A}) and a function f:𝒮×𝒜f:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R}, we denote its weighted 2\ell_{2} norm as f2,β2:=𝔼s,aβf2(s,a)\|f\|^{2}_{2,\beta}:=\sqrt{\mathbb{E}_{s,a\sim\beta}f^{2}(s,a)}.

A.1 Supporting lemmas for Theorem 1

Before proving Theorem 1, we first present a few useful lemma. We start with a standard result on least square generalization bound, which is be used by recalling that Algorithm 1 performs least squares on the empirical bellman error. We defer the proof of Lemma 3 to Appendix B.

Lemma 3.

(Least squares generalization bound) Let R>0R>0, δ(0,1)\delta\in(0,1), we consider a sequential function estimation setting, with an instance space 𝒳\mathcal{X} and target space 𝒴\mathcal{Y}. Let :𝒳[R,R]\mathcal{H}:\mathcal{X}\mapsto[-R,R] be a class of real valued functions. Let 𝒟={(x1,y1),,(xT,yT)}\mathcal{D}=\left\{(x_{1},y_{1}),\dots,(x_{T},y_{T})\right\} be a dataset of TT points where xtρt:=ρt(x1:t1,y1:t1)x_{t}\sim\rho_{t}\vcentcolon={}\rho_{t}(x_{1:t-1},y_{1:t-1}), and yty_{t} is sampled via the conditional probability p(xt)p(\cdot\mid x_{t}):

ytp(xt):=h(xt)+εt,\displaystyle y_{t}\sim p(\cdot\mid x_{t})\vcentcolon={}h^{*}(x_{t})+\varepsilon_{t},

where the function hh^{*} satisfies approximate realizability i.e.

infh1Tt=1T𝔼xρt[(h(x)h(x))2]γ,\inf_{h\in\mathcal{H}}\frac{1}{T}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[\left(h^{*}(x)-h(x)\right)^{2}\right]\leq\gamma,

and {ϵi}i=1n\left\{\epsilon_{i}\right\}_{i=1}^{n} are independent random variables such that 𝔼[ytxt]=h(xt)\mathbb{E}[y_{t}\mid x_{t}]=h^{\ast}(x_{t}). Additionally, suppose that maxt|yt|R\max_{t}\lvert y_{t}\rvert\leq R and maxx|h(x)|R\max_{x}\left\lvert h^{*}(x)\right\rvert\leq R. Then the least square solution h^argminht=1T(h(xt)yt)2\widehat{h}\leftarrow\mathop{\mathrm{argmin}}_{h\in\mathcal{H}}\sum_{t=1}^{T}\left(h(x_{t})-y_{t}\right)^{2} satisfies with probability at least 1δ1-\delta,

t=1T𝔼xρt[(h^(x)h(x))2]\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[(\widehat{h}(x)-h^{*}(x))^{2}\right] 3γT+256R2log(2||/δ).\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

The above lemma is basically an extension of the standard least square regression agnostic generalization bound from i.i.d. setting to the non-i.i.d. case with the sequence of training data forms a sequence of Martingales. We state the result when the realizability only holds approximately upto the approximation γ\gamma. However, for all our proofs, we invoke this result by setting γ=0\gamma=0.

In the next two lemmas, we prove two lemmas where we can bound each part of the regret decomposition using the Bellman error of the value function ff.

Lemma 4 (Performance difference lemma).

For any function f=(f0,,fH1)f=\left(f_{0},\dots,f_{H-1}\right) where fh:𝒮×𝒜f_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} and h[H1]h\in[H-1], we have

𝔼sd0[maxaf0(s,a)V0πf(s)]h=0H1|𝔼s,adhπf[fh(s,a)𝒯fh+1(s,a)]|,\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V_{0}^{\pi^{f}}(s)]\leq\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f}}}\left[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right]\right\rvert,

where we define fH(s,a)=0f_{H}(s,a)=0 for all s,as,a.

Proof.

We start the proof by noting that π0f(s)=argmaxaf0(s,a)\pi^{f}_{0}(s)=\mathop{\mathrm{argmax}}_{a}f_{0}(s,a), then we have:

𝔼sd0[maxaf0(s,a)Vπf(s)]\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)] =𝔼sd0[𝔼aπ0f(s)f0(s,a)V0πf(s)]\displaystyle=\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}f_{0}(s,a)-V_{0}^{\pi^{f}}(s)]
=𝔼sd0[𝔼aπ0f(s)f0(s,a)𝒯f1(s,a)]+𝔼sd0[𝔼aπ0f(s)𝒯f1(s,a)V0πf(s)]\displaystyle=\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi^{f}_{0}(s)}\mathcal{T}f_{1}(s,a)-V_{0}^{\pi^{f}}(s)]
=𝔼s,ad0πf[f0(s,a)𝒯f1(s,a)]+\displaystyle=\mathbb{E}_{s,a\sim d^{\pi^{f}}_{0}}[f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+
𝔼sd0[𝔼aπ0f(s)[R(s,a)+γ𝔼s𝒫(s,a)maxaf1(s,a)R(s,a)+𝔼s𝒫(s,a)V1πf(s)]]\displaystyle\;\;\;\;\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}[R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}\max_{a^{\prime}}f_{1}(s^{\prime},a^{\prime})-R(s,a)+\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}V^{\pi^{f}}_{1}(s^{\prime})]]
=𝔼s,ad0πf[f0(s,a)𝒯f1(s,a)]+𝔼sd1πf[maxaf1(s,a)V1πf(s)]\displaystyle=\mathbb{E}_{s,a\sim d^{\pi^{f}}_{0}}[f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+\mathbb{E}_{s\sim d^{\pi^{f}}_{1}}[\max_{a}f_{1}(s,a)-V_{1}^{\pi^{f}}(s)] (4)

Then by recursively applying the same procedure on the second term in (4), we have

𝔼sd0[maxaf0(s,a)Vπf(s)]\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)] =h=0H1𝔼s,adhπf[fh(s,a)𝒯fh+1(s,a)]+𝔼sdHπf[maxafH(s,a)VHπf(s)].\displaystyle=\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]+\mathbb{E}_{s\sim d^{\pi^{f}}_{H}}[\max_{a}f_{H}(s,a)-V_{H}^{\pi^{f}}(s)].

Finally for h=Hh=H, we recall that we set fH(s,a)=0f_{H}(s,a)=0 and VHπf=0V^{\pi^{f}}_{H}=0 for notation simplicity. Thus we have:

𝔼sd0[maxaf0(s,a)Vπf(s)]\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)] =h=0H1𝔼s,adhπf[fh(s,a)𝒯fh+1(s,a)]\displaystyle=\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]
h=0H1|𝔼s,adhπf[fh(s,a)𝒯fh+1(s,a)]|.\displaystyle\leq\sum_{h=0}^{H-1}\left|\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]\right|.

Now we proceed to how to bound the other half in the regret decomposition:

Lemma 5.

Let πe=(π0e,,πH1e)\pi^{e}=(\pi^{e}_{0},\dots,\pi^{e}_{H-1}) be a comparator policy, and consider any value function f=(f0,,fH1)f=(f_{0},\dots,f_{H-1}) where fh:𝒮×𝒜f_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}. Then,

𝔼sd0[V0πe(s)maxaf0(s,a)]i=0H1𝔼s,adiπe[𝒯fi+1(s,a)fi(s,a)],\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}(s,a)\right]\leq\sum_{i=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{i}}[\mathcal{T}f_{i+1}(s,a)-f_{i}(s,a)],

where we defined fH(s,a)=0f_{H}(s,a)=0 for all s,as,a.

Proof.

The proof is similar to the proof of Lemma 4, and we start with the fact that maxaf(s,a)f(s,a),a\max_{a}f(s,a)\geq f(s,a^{\prime}),\forall a^{\prime}, including actions sampled from πe\pi^{e}:

𝔼sd0[V0πe(s)maxaf0(s,a)]\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right] 𝔼s,ad0πe[Q0πe(s,a)f0(s,a)]\displaystyle\leq\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[Q_{0}^{\pi^{e}}(s,a)-f_{0}(s,a)\right]
=𝔼s,ad0πe[Q0πe(s,a)𝒯f1(s,a)+𝒯f1(s,a)f0(s,a)]\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[Q_{0}^{\pi^{e}}(s,a)-{\mathcal{T}}f_{1}(s,a)+{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right]
=𝔼s,ad0πe[𝔼s𝒫(s,a)V1πe(s)maxaf1(s,a)]+𝔼s,ad0πe[𝒯f1(s,a)f0(s,a)]\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}V_{1}^{\pi^{e}}(s^{\prime})-\max_{a^{\prime}}f_{1}(s^{\prime},a^{\prime})\right]+\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right]
=𝔼sd1πe[V1πe(s)maxaf1(s,a)]+𝔼s,ad0πe[𝒯f1(s,a)f0(s,a)]\displaystyle=\mathbb{E}_{s\sim d^{\pi_{e}}_{1}}\left[V_{1}^{\pi^{e}}(s)-\max_{a}f_{1}(s,a)\right]+\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right] (5)

Again by recursively applying the same procedure on the first term in (5), we have

𝔼sd0[V0πe(s)maxaf0(s,a)]\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right] 𝔼sdHπe[VHπe(s)maxafH(s,a)]+h=0H1𝔼s,adhπe[𝒯fh+1(s,a)fh(s,a)],\displaystyle\leq\mathbb{E}_{s\sim d^{\pi_{e}}_{H}}\left[V_{H}^{\pi^{e}}(s)-\max_{a}f_{H}(s,a)\right]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}\left[{\mathcal{T}}f_{h+1}(s,a)-f_{h}(s,a)\right],

and recall that fH(s,a)=0f_{H}(s,a)=0 and VHπf=0V^{\pi^{f}}_{H}=0, we have

𝔼sd0[V0πe(s)maxaf0(s,a)]\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right] h=0H1𝔼s,adhπe[𝒯fh+1(s,a)fh(s,a)].\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}\left[{\mathcal{T}}f_{h+1}(s,a)-f_{h}(s,a)\right].

The following result is useful in the bilinear models when we want to bound the potential functions. The result directly follows from the elliptical potential lemma [Lattimore and Szepesvári, 2020, Lemma 19.4].

Lemma 6.

Let Xh(f1),,Xh(fT)dX_{h}(f^{1}),\dots,X_{h}(f^{T})\in\mathbb{R}^{d} be a sequence of vectors with Xh(ft)BX<\|X_{h}(f^{t})\|\leq B_{X}<\infty for all tTt\leq T. Then,

t=1TXh(ft)Σt1;h12dTlog(1+TBX2λd),\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{2dT\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)},

where the matrix Σt;h:=τ=1tXh(fτ)Xh(fτ)+λ𝕀\Sigma_{t;h}\vcentcolon={}\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\lambda\mathbb{I} for t[T]t\in[T] and λBX2\lambda\geq B^{2}_{X}, and the matrix norm Xh(ft)Σt1;h1=𝔼[Xh(ft)Σt1;h1Xh(ft)]\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}=\mathbb{E}[X_{h}(f^{t})^{\top}\Sigma_{t-1;h}^{-1}X_{h}(f^{t})].

Proof.

Since λBX2\lambda\geq B_{X}^{2}, we have that

Xh(ft)Σt1;h121λXh(ft)21.\displaystyle\|X_{h}(f^{t})\|^{2}_{\Sigma^{-1}_{t-1;h}}\leq\frac{1}{\lambda}\|X_{h}(f^{t})\|^{2}\leq 1.

Thus, using elliptical potential lemma [Lattimore and Szepesvári, 2020, Lemma 19.4], we get that

t=1TXh(ft)Σt1;h122dlog(1+TBX2λd).\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|^{2}_{\Sigma_{t-1;h}^{-1}}\leq 2d\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right).

The desired bound follows from Jensen’s inequality which implies that

t=1TXh(ft)Σt1;h1Tt=1TXh(ft)Σt1;h122Tdlog(1+TBX2λd).\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{T\cdot\sum_{t=1}^{T}\|X_{h}(f^{t})\|^{2}_{\Sigma_{t-1;h}^{-1}}}\leq\sqrt{2Td\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)}.

A.2 Proof of Theorem 1

Before delving into the proof, we first state that following generalization bound for FQI.

Lemma 7 (Bellman error bound for FQI).

Let δ(0,1)\delta\in(0,1) and let for h[H1]h\in[H-1] and t[T]t\in[T], fht+1f^{t+1}_{h} be the estimated value function for time step hh computed via least square regression using samples in the dataset (𝒟hν,𝒟h1,,𝒟ht){\left(\mathcal{D}^{\nu}_{h},\mathcal{D}^{1}_{h},\dots,\mathcal{D}^{t}_{h}\right)} in (1) in the iteration tt of Algorithm 1. Then, with probability at least 1δ1-\delta, for any h[H1]h\in[H-1] and t[T]t\in[T],

fht+1𝒯fh+1t+12,νh21moff256Vmax2log(2HT||/δ)=:Δoff,\displaystyle\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|_{2,\nu_{h}}^{2}\leq\frac{1}{m_{\mathrm{off}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\Delta_{\mathrm{off}},
and
τ=1tfht+1𝒯fh+1t+12,μhτ21mon256Vmax2log(2HT||/δ)=:Δon,\displaystyle\sum_{\tau=1}^{t}\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|^{2}_{2,\mu^{\tau}_{h}}\leq\frac{1}{m_{\mathrm{on}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\Delta_{\mathrm{on}},

where νh\nu_{h} denotes the offline data distribution at time hh, and the distribution μhτΔ(s,a)\mu^{\tau}_{h}\in\Delta(s,a) is defined such that s,adhπτs,a\sim d^{\pi^{\tau}}_{h}.

Proof.

Fix t[T]t\in[T], h[H1]h\in[H-1] and fh+1t+1h+1f^{t+1}_{h+1}\in\mathcal{F}_{h+1} and consider the regression problem ((1) in the iteration tt of Algorithm 1):

fht+1argminfh{moff𝔼^𝒟hν(f(s,a)rmaxafh+1t+1(s,a))2+monτ=1t𝔼^𝒟hτ(f(s,a)rmaxafh+1t+1(s,a))2},\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{m_{\mathrm{off}}\widehat{\mathbb{E}}_{\mathcal{D}^{\nu}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}+m_{\mathrm{on}}\sum_{\tau=1}^{t}\widehat{\mathbb{E}}_{\mathcal{D}^{\tau}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\},

which can be thought of as regression problem

fht+1argminfh{𝔼^𝒟(f(s,a)rmaxafh+1t+1(s,a))2},\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\},

where dataset 𝒟\mathcal{D} consisting of n=moff+tmonn=m_{\mathrm{off}}+t\cdot m_{\mathrm{on}} samples {(xi,yi)}in\left\{(x_{i},y_{i})\right\}_{i\leq n} where

xi=(shi,ahi)andyi=ri+maxafh+1t+1(sh+1i,a).\displaystyle x_{i}=(s^{i}_{h},a^{i}_{h})\qquad\text{and}\qquad y^{i}=r^{i}+\max_{a}f_{h+1}^{t+1}(s^{i}_{h+1},a).

In particular, we define 𝒟\mathcal{D} such that the first moffm_{\mathrm{off}} samples {(xi,yi)}imoff=𝒟hν\left\{(x_{i},y_{i})\right\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{\nu}_{h}, the next monm_{\mathrm{on}} samples {(xi,yi)}i=moff+1moff+mon=𝒟h1\left\{(x_{i},y_{i})\right\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m_{\mathrm{on}}}=\mathcal{D}^{1}_{h}, and so on where the samples {(xi,yi)}i=moff+(τ1)mon+1moff+τmon=𝒟hτ\left\{(x_{i},y_{i})\right\}_{i=m_{\mathrm{off}}+(\tau-1)m_{\mathrm{on}}+1}^{m_{\mathrm{off}}+\tau m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}. Note that: (a) for any sample (x=(sh,ah),y=(r+maxafh+1t+1(sh+1,a)))(x=(s_{h},a_{h}),y=(r+\max_{a}f_{h+1}^{t+1}(s_{h+1},a))) in 𝒟\mathcal{D}, we have that

𝔼[yx]\displaystyle\operatorname{\mathbb{E}}\left[y\mid x\right] =𝔼sh+1P(sh,ah),rR(sh,ah)[r+maxafh+1t+1(sh+1,a)]\displaystyle=\operatorname{\mathbb{E}}_{s_{h+1}\sim P(s_{h},a_{h}),r\sim R(s_{h},a_{h})}\left[r+\max_{a}f_{h+1}^{t+1}(s_{h+1},a)\right]
=𝒯fh+1t+1(sh,ah)g(sh,ah),\displaystyle=\mathcal{T}f_{h+1}^{t+1}(s_{h},a_{h})\leq g(s_{h},a_{h}),

where the last line holds since the Bellman completeness assumption implies existence of such a function gg, (b) for any sample, |y|Vmax\left\lvert y\right\rvert\leq V_{\mathrm{max}} and f(s,a)Vmaxf(s,a)\leq V_{\mathrm{max}} for all s,as,a, (c) our construction of 𝒟\mathcal{D} implies that for each iteration tt, the sample (xt,yt)(x_{t},y_{t}) are generated in the following procedure: xtx_{t} is sampled from the data generation scheme 𝒟t(x1:t1,y1:t1)\mathcal{D}^{t}(x_{1:t-1},y_{1:t-1}), and yty_{t} is sampled from some conditional probability distribution p(xt)p(\cdot\mid x_{t}) as defined in Lemma 3, finally (d) the samples in 𝒟hν\mathcal{D}^{\nu}_{h} are drawn from the offline distribution νh\nu_{h}, and the samples in 𝒟hτ\mathcal{D}^{\tau}_{h} are drawn such that shdhπts_{h}\sim d_{h}^{\pi^{t}} and ahπft(sh)a_{h}\sim\pi^{f^{t}}(s_{h}). Thus, using Lemma 3, we get that the least square regression solution fht+1f^{t+1}_{h} satisfies

i=1n𝔼[(fht+1(si,ai)𝒯fh+1t+1(si,ai))2𝒟i]\displaystyle\sum_{i=1}^{n}\operatorname{\mathbb{E}}\left[(f^{t+1}_{h}(s^{i},a^{i})-\mathcal{T}f^{t+1}_{h+1}(s^{i},a^{i}))^{2}\mid\mathcal{D}_{i}\right] 256Vmax2log(2||/δ).\displaystyle\leq 256V_{\mathrm{max}}^{2}\log(2\lvert\mathcal{F}\rvert/\delta).

Using the property-(d) in the above, we get that

mofffht+1𝒯fh+1t+12,νh2+monτ=1tfht+1𝒯fh+1t+12,μhτ2256Vmax2log(2||/δ),\displaystyle m_{\mathrm{off}}\cdot\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|_{2,\nu_{h}}^{2}+m_{\mathrm{on}}\cdot\sum_{\tau=1}^{t}\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|^{2}_{2,\mu^{\tau}_{h}}\leq 256V_{\mathrm{max}}^{2}\log(2\lvert\mathcal{F}\rvert/\delta),

where the distribution μhτΔ(s,a)\mu^{\tau}_{h}\in\Delta(s,a) is defined by sampling sdhπτs\sim d^{\pi^{\tau}}_{h} and aπft(s)a\sim\pi^{f^{t}}(s). Taking a union bound over h[H1]h\in[H-1] and t[T]t\in[T], and bounding each term separately, gives the desired statement. ∎

We next note a change in distribution lemma which allows us to bound expected bellman error under the (s,a)(s,a) distribution generated by ftf^{t} in terms of the expected square bellman error w.r.t. the previous policies data distribution, which is further controlled using regression.

Lemma 8.

For any t0t\geq 0 and h[H1]h\in[H-1], we have

|Wh(ft),Xh(ft)|Xh(ft)Σt1;h1i=1t1𝔼s,adhfi[(fht𝒯fh+1t)2]+λBW2,\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right]+\lambda B^{2}_{W}},

where Σt11\Sigma_{t-1}^{-1} is defined in (3) and use the notation dhfid^{f^{i}}_{h} to denote dhπfid^{\pi^{f^{i}}}_{h}.

Proof.

Using Cauchy-Schwarz inequality, we get that

|Wh(ft),Xh(ft)|\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert Xh(ft)Σt1;h1Wh(ft)Σt1;h\displaystyle\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\|W_{h}(f^{t})\|_{\Sigma_{t-1;h}}
=Xh(ft)Σt1;h1(Wh(ft))Σt1Wh(ft)\displaystyle=\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\Sigma_{t-1}W_{h}(f^{t})}
=Xh(ft)Σt1;h1(Wh(ft))(i=1t1Xh(fi)Xh(fi)+λ𝕀)Wh(ft)\displaystyle=\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\left(\sum_{i=1}^{t-1}X_{h}(f^{i})X_{h}(f^{i})^{\top}+\lambda\mathbb{I}\right)W_{h}(f^{t})}
=Xh(ft)Σt1;h1i=1t1|Wh(ft),Xh(fi)|2+λWh(ft)2\displaystyle=\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda\|W_{h}(f^{t})\|^{2}}
Xh(ft)Σt1;h1i=1t1|Wh(ft),Xh(fi)|2+λBW2\displaystyle\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}} (6)
Xh(ft)Σt1;h1i=1t1𝔼s,adhfi[(fht𝒯fh+1t)2]+λBW2\displaystyle\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right]+\lambda B_{W}^{2}}

where the inequality in the second last line holds by plugging in the bound on Wh(ft)\|W_{h}(f^{t})\|, and the last line holds by using Definition 2 which implies that

|Wh(ft),Xh(fi)|2\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2} =(𝔼s,adhfi[fht𝒯fh+1t])2𝔼s,adhfi[(fht𝒯fh+1t)2],\displaystyle=\left(\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right]\right)^{2}\leq\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right],

where the last inequality is due to Jensen’s inequality. ∎

We now have all the tools to prove Theorem 1. We first restate the bound with the exact problem dependent parameters, assumign that BWB_{W} and BXB_{X} are constants which are hidden in the order notation below.

Theorem (Theorem 1 restated).

Let moff=Tm_{\mathrm{off}}=T and mon=1m_{\mathrm{on}}=1. Then, with probability at least 1δ1-\delta, the cumulative suboptimality of Algorithm 1 is bounded as

t=1TVπeVπft=O(max{Cπe,1}VmaxdH2Tlog(1+Td)log(HT||δ)).\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).
Proof of Theorem 1.

Let πe\pi^{e} be any comparator policy with bounded transfer coefficient i.e.

Cπe:=max{0,maxfh=0H1𝔼s,adhπe[fh(s,a)𝒯fh+1(s,a)]h=0H1𝔼s,aνh[(fh(s,a)𝒯fh+1(s,a))2]}<.\displaystyle C_{\pi^{e}}:=\max\left\{0,~{}\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi^{e}}}\left[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right)^{2}\right]}}\right\}<\infty. (7)

We start by noting that

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} =t=1T𝔼sd0[V0πe(s)V0πft(s)]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-V^{\pi^{f^{t}}}_{0}(s)\right]
=t=1T𝔼sd0[V0πe(s)maxaf0t(s,a)]+t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)].\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}^{t}(s,a)\right]+\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]. (8)

For the first term in the right hand side of (8), note that using Lemma 5 for each ftf_{t} for 1tT1\leq t\leq T, we get

t=1T𝔼sd0[V0πe(s)maxaf0t(s,a)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f^{t}_{0}(s,a)\right] t=1Th=0H1𝔼s,adhπe[𝒯fh+1t(s,a)fht(s,a)]\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}[\mathcal{T}f^{t}_{h+1}(s,a)-f^{t}_{h}(s,a)]
t=1TCπeh=0H1𝔼s,aνh[(fht(s,a)𝒯fh+1t(s,a))2]\displaystyle\leq\sum_{t=1}^{T}C_{\pi^{e}}\cdot\sqrt{\sum_{h=0}^{H-1}\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(f_{h}^{t}(s,a)-\mathcal{T}f_{h+1}^{t}(s,a)\right)^{2}\right]}
=TCπeHΔoff,\displaystyle=TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}, (9)

where the second inequality follows from plugging in the definition of CπeC_{\pi_{e}} in (7). The last line follows from Lemma 7.

For the second term in (8), using Lemma 4 for each ftf_{t} for 1tT1\leq t\leq T, we get

t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right] t=1Th=0H1|𝔼s,adhπft[fht(s,a)𝒯fh+1t(s,a)]|\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}f^{t}_{h+1}(s,a)\right]\right\rvert (10)
=t=1Th=0H1|Xh(ft),Wh(ft)|\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert
t=1Th=0H1Xh(ft)Σt1;h1Δon+λBW2,\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\Delta_{\mathrm{on}}+\lambda B_{W}^{2}},

where the second line follows from Definition 2, the third line follows from Lemma 8 and by plugging in the bound in Lemma 7. Using the bound in Lemma 6 in the above, we get that

t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right] 2dH2log(1+TBX2λd)(Δon+λBW2)T\displaystyle\leq\sqrt{2dH^{2}\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)\cdot\left(\Delta_{\mathrm{on}}+\lambda B_{W}^{2}\right)\cdot T}
2dH2log(1+Td)(Δon+BX2BW2)T,\displaystyle\leq\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\Delta_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T}, (11)

where the second line follows by plugging in λ=BX2\lambda=B_{X}^{2}.

Combining the bound (9) and (11), we get that

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} TCπeHΔoff+2dH2log(1+Td)(Δon+BX2BW2)T\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\Delta_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T}

Plugging in the values of Δon\Delta_{\mathrm{on}} and Δoff\Delta_{\mathrm{off}} in the above, and using subadditivity of square-root, we get that

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} 16VmaxCπeTHmofflog(2HT||δ)+16Vmax2dH2Tmonlog(1+Td)log(2HT||δ)\displaystyle\leq 16V_{\mathrm{max}}C_{\pi^{e}}T\sqrt{\frac{H}{m_{\mathrm{off}}}\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}+16V_{\mathrm{max}}\sqrt{\frac{2dH^{2}T}{m_{\mathrm{on}}}\log\left(1+\frac{T}{d}\right)\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}
+HBXBW2dTlog(1+Td).\displaystyle\qquad\qquad\qquad+HB_{X}B_{W}\sqrt{2dT\log\left(1+\frac{T}{d}\right)}.

Setting moff=Tm_{\mathrm{off}}=T and mon=1m_{\mathrm{on}}=1 in the above gives the cumulative suboptimality bound

t=1TVπeVπft=O(max{Cπe,1}VmaxdH2Tlog(1+Td)log(HT||δ)).\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right). (12)

Proof of Corollary 1.

We next convert the above cumulative suboptimality bound into sample complexity bound via a standard online-to-batch conversion. Setting πe=π\pi^{e}=\pi^{*} in (12) and defining the policy π^=Uniform({π1,,πT})\widehat{\pi}=\text{Uniform}\left(\left\{\pi^{1},\dots,\pi^{T}\right\}\right), we get that

𝔼[VπVπ^]\displaystyle\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right] =1T(t=1TVπVπt)\displaystyle=\frac{1}{T}\left(\sum_{t=1}^{T}V^{\pi^{*}}-V^{\pi^{t}}\right)
=O(max{Cπ,1}VmaxdH2Tlog(1+Td)log(HT||δ)).\displaystyle=O\left(\max\left\{C_{\pi^{*}},1\right\}V_{\mathrm{max}}\sqrt{\frac{dH^{2}}{T}\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

Thus, we get that for TO~(max{Cπ2,1}Vmax2dH2log(HT||/δ)ϵ2)T\geq\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right), we get that

𝔼[VπVπ^]ϵ.\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]\leq\epsilon.

In these TT iterations, the total number of offline samples used is

moff=T=O~(max{Cπ2,1}Vmax2dH2log(HT||/δ)ϵ2),\displaystyle m_{\mathrm{off}}=T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

and the total number of online samples used is

monHT=O~(max{Cπ2,1}Vmax2dH3log(HT||/δ)ϵ2),\displaystyle m_{\mathrm{on}}\cdot H\cdot T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

where the additional HH factor appears because we collect monm_{\mathrm{on}} samples for every h[H]h\in[H] in the algorithm. ∎

A.3 V-type Bilinear Rank

Our previous result focus on the Q-type bilinear model. Here we provide the V-type Bilinear rank definition. This V-type Bilinear rank definition is basically the same as the low Bellman rank model proposed by Jiang et al. [2017].

Definition 5 (V-type Bilinear model).

Consider any pair of functions (f,g)(f,g) with f,gf,g\in\mathcal{F}. Denote the greedy policy of ff as πf={πhf:=argmaxafh(s,a),h}\pi^{f}=\{\pi^{f}_{h}:=\mathop{\mathrm{argmax}}_{a}f_{h}(s,a),\forall h\}. We say that the MDP together with the function \mathcal{F} admits a bilinear structure of rank dd if for any h[H1]h\in[H-1], there exist two (unknown) mappings Xh:dX_{h}:\mathcal{F}\mapsto\mathbb{R}^{d} and Wh:dW_{h}:\mathcal{F}\mapsto\mathbb{R}^{d} with maxfXh(f)2BX\max_{f}\|X_{h}(f)\|_{2}\leq B_{X} and maxfWh(f)2BW\max_{f}\|W_{h}(f)\|_{2}\leq B_{W}, such that:

f,g:|𝔼sdhπf,aπg(s)gh(s,a)𝒯gh+1(s,a)|=|Xh(f),Wh(g)|.\displaystyle\forall f,g\in\mathcal{F}:\;\left\lvert\mathbb{E}_{s\sim d_{h}^{\pi^{f}},a\sim\pi_{g}(s)}g_{h}(s,a)-\mathcal{T}g_{h+1}(s,a)\right\rvert=\left\lvert\left\langle X_{h}(f),W_{h}(g)\right\rangle\right\rvert.

Note that different from the Q-type definition, here the action aa is taken from the greedy policy with respect to gg. This way maxag(s,a)\max_{a}g(s,a) can serve as an approximation of VV^{\star} – thus the name of VV-type.

To make Hy-Q work for the V-type Bilinear model, we only need to make slight change on the data collection process, i.e., when we collect online batch 𝒟h\mathcal{D}_{h}, we sample sdhπt,aUniform(𝒜),sP(|s,a)s\sim d^{\pi^{t}}_{h},a\sim\text{Uniform}(\mathcal{A}),s^{\prime}\sim P(\cdot|s,a). Namely the action is taken uniformly randomly here. We provide the pseudocode in Algorithm 2. We refer the reader to Du et al. [2021], Jin et al. [2021a] for a detailed discussion.

Algorithm 2 V-type Hy-Q
0:  Value function class: \mathcal{F}, #iterations: TT, Offline dataset 𝒟hν\mathcal{D}^{\nu}_{h} of size moffm_{\mathrm{off}} for h[H1]h\in[H-1].
1:  Initialize fh1(s,a)=0f_{h}^{1}(s,a)=0.
2:  for t=1,,Tt=1,\dots,T do
3:     Let πt\pi^{t} be the greedy policy w.r.t. ftf^{t} i.e., πht(s)=argmaxafht(s,a).\pi_{h}^{t}(s)=\mathop{\mathrm{argmax}}_{a}f^{t}_{h}(s,a). // Online collection
4:     For each hh, collect monm_{\mathrm{on}} online tuples 𝒟htdhπtUniform(𝒜)\mathcal{D}^{t}_{h}\sim d_{h}^{\pi^{t}}\circ\textrm{Uniform}(\mathcal{A}). // FQI using both online and offline data
5:     Set fHt+1(s,a)=0f_{H}^{t+1}(s,a)=0.
6:     for h=H1,,0h=H-1,\dots,0 do
7:        Estimate fht+1f_{h}^{t+1} using least squares regression on the aggregated data:
fht+1argminfh{𝔼^𝒟hν(f(s,a)rmaxafh+1t+1(s,a))2+τ=1t𝔼^𝒟hτ(f(s,a)rmaxafh+1t+1(s,a))2}\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}^{\nu}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}+\sum_{\tau=1}^{t}\widehat{\mathbb{E}}_{\mathcal{D}^{\tau}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\} (13)
8:     end for
9:  end for

A.3.1 Complexity bound for V-type Bilinear models

In this section, we give a performance analysis of Algorithm 2 for V-type Bilinear models. The contents in this section extend the results developed for Q-type Bilinear models in Section A.2 to V-type Bilinear models.

We first note the following bound for FQI estimates in Algorithm 2.

Lemma 9.

Let δ(0,1)\delta\in(0,1) and let for h[H1]h\in[H-1] and t[T]t\in[T], fht+1f^{t+1}_{h} be the estimated value function for time step hh computed via least square regression using samples in the dataset (𝒟hν,𝒟h1,,𝒟ht){\left(\mathcal{D}^{\nu}_{h},\mathcal{D}^{1}_{h},\dots,\mathcal{D}^{t}_{h}\right)} in (13) in the iteration tt of Algorithm 2. Then, with probability at least 1δ1-\delta, for any h[H1]h\in[H-1] and t[T]t\in[T],

fht+1𝒯fh+1t+12,νh21moff256Vmax2log(2HT||/δ)=:Δ¯off,\displaystyle\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|_{2,\nu_{h}}^{2}\leq\frac{1}{m_{\mathrm{off}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\bar{\Delta}_{\mathrm{off}},
and
τ=1tfht+1𝒯fh+1t+12,μhτ21mon256Vmax2log(2HT||/δ)=:Δ¯on,\displaystyle\sum_{\tau=1}^{t}\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|^{2}_{2,\mu^{\tau}_{h}}\leq\frac{1}{m_{\mathrm{on}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\bar{\Delta}_{\mathrm{on}},

where νh\nu_{h} denotes the offline data distribution at time hh, and the distribution μhτΔ(s,a)\mu^{\tau}_{h}\in\Delta(s,a) is defined such that sdhπτs\sim d^{\pi^{\tau}}_{h} and aUniform(𝒜)a\sim\mathrm{Uniform}(\mathcal{A}).

The following change in distribution lemma is the version of Lemma 8 under V-type Bellman rank assumption.

Lemma 10.

Suppose the underlying model is a V-type bilinear model. Then, for any t0t\geq 0 and h[H1]h\in[H-1], we have

|Wh(ft),Xh(ft)|Xh(ft)Σt1;h1|𝒜|i=1t1𝔼sdhπfi,aUniform(𝒜)[(fht𝒯fh+1t)2]+λBW2,\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]+\lambda B_{W}^{2}},

where Σt11\Sigma_{t-1}^{-1} is defined in (3).

Proof.

The proof is identical to the proof of Lemma 8. Repeating the analysis till (6), we get that

|Wh(ft),Xh(ft)|\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert Xh(ft)Σt1;h1i=1t1|Wh(ft),Xh(fi)|2+λBW2\displaystyle\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}}
=Xh(ft)Σt1;h1i=1t1(𝔼sdhπfi,aπft(s)[fht𝒯fh+1t])2+λBW2\displaystyle=\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left(\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\pi^{f^{t}}(s)}\left[f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right]\right)^{2}+\lambda B_{W}^{2}}
Xh(ft)Σt1;h1|𝒜|i=1t1𝔼sdhπfi,aUniform(𝒜)[(fht𝒯fh+1t)2]+λBW2\displaystyle\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]+\lambda B_{W}^{2}}

where the second line above follows from the definition of V-type bilinear model in Definition 5, and the last line holds because:

(𝔼sdhπfi,aπft(s)[fht𝒯fh+1t])2\displaystyle\left(\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\pi^{f^{t}}(s)}\left[f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right]\right)^{2} 𝔼sdhπfi,aπft(s)[(fht𝒯fh+1t)2]\displaystyle\leq\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},a\sim\pi^{f^{t}}(s)}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]
|𝒜|𝔼sdhπfi,aUniform(𝒜)[(fht𝒯fh+1t)2]\displaystyle\leq\left\lvert\mathcal{A}\right\rvert\cdot\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]

where the first inequality above is due to Jensen’s inequality and the last inequality follows form a straightforward upper bound since each term inside the expectation is non-negative. ∎

We are finally ready to state and prove our main result in this section.

Theorem 2 (Cumulative suboptimality bound for V-type bilinear rank models).

Let mon=|𝒜|m_{\mathrm{on}}=\left\lvert\mathcal{A}\right\rvert and moff=Tm_{\mathrm{off}}=T. Then, with probability at least 1δ1-\delta, the cumulative suboptimality of Algorithm 2 is bounded as

t=1TVπeVπft=O(max{Cπe,1}VmaxdH2Tlog(1+Td)log(HT||δ))\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right)
Proof.

The proof follows closely the proof of Theorem 1. Repeating the analysis till (8) and (9), we get that:

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} TCπeHΔ¯off+t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)].\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\bar{\Delta}_{\mathrm{off}}}+\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]. (14)

For the second term in the above, using Lemma 4 for each ftf_{t} for 1tT1\leq t\leq T, we get

t=1T𝔼sd0[maxaf0t(s,a)V0πft(s)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right] t=1Th=0H1|𝔼s,adhπft[fht(s,a)𝒯hfh+1t(s,a)]|\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}_{h}f^{t}_{h+1}(s,a)\right]\right\rvert
=t=1Th=0H1|Xh(ft),Wh(ft)|\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert
t=1Th=0H1Xh(ft)Σt1;h1|𝒜|Δ¯on+λBW2,\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\bar{\Delta}_{\mathrm{on}}+\lambda B_{W}^{2}},

where the second line follows from Definition 5, and the last line follows from Lemma 10 and by plugging in the bound in Lemma 9. Using the elliptical potential Lemma 6 as in the proof of Theorem 1, we get that

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} TCπeHΔ¯off+2dH2log(1+Td)(|𝒜|Δ¯on+BX2BW2)T\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\bar{\Delta}_{\mathrm{off}}}+\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\left\lvert\mathcal{A}\right\rvert\cdot\bar{\Delta}_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T}

Plugging in the values of Δ¯on\bar{\Delta}_{\mathrm{on}} and Δ¯off\bar{\Delta}_{\mathrm{off}} from Lemma 9 in the above, and using subadditivity of square-root, we get that

t=1TVπeVπft\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}} 16VmaxCπeTHmofflog(2HT||δ)+16Vmax2dH2|𝒜|Tmonlog(1+Td)log(2HT||δ)\displaystyle\leq 16V_{\mathrm{max}}C_{\pi^{e}}T\sqrt{\frac{H}{m_{\mathrm{off}}}\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}+16V_{\mathrm{max}}\sqrt{\frac{2dH^{2}\left\lvert\mathcal{A}\right\rvert T}{m_{\mathrm{on}}}\log\left(1+\frac{T}{d}\right)\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}
+HBXBW2dTlog(1+Td).\displaystyle\qquad\qquad\qquad+HB_{X}B_{W}\sqrt{2dT\log\left(1+\frac{T}{d}\right)}.

Setting mon=|𝒜|m_{\mathrm{on}}=\left\lvert\mathcal{A}\right\rvert and moff=Tm_{\mathrm{off}}=T, we get the following cumulative suboptimality bound:

t=1TVπeVπft=O(max{Cπe,1}VmaxdH2Tlog(1+Td)log(HT||δ)).\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right). (15)

Corollary 2 (Sample complexity).

Under the assumptions of Theorem 2 if Cπ<C_{\pi^{*}}<\infty then Algorithm 2 can find an ϵ\epsilon-suboptimal policy π^\widehat{\pi} for which VπVπ^ϵV^{\pi^{*}}-V^{\widehat{\pi}}\leq\epsilon with total sample complexity of:

n=O~(max{Cπ2,1}Vmax2dH3|𝒜|log(HT||/δ)ϵ2).\displaystyle n=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\left\lvert\mathcal{A}\right\rvert\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right).
Proof.

The following follows from a standard online-to-batch conversion. Setting πe=π\pi^{e}=\pi^{*} in (15) and defining the policy π^=Uniform({π1,,πT})\widehat{\pi}=\text{Uniform}\left(\left\{\pi^{1},\dots,\pi^{T}\right\}\right), we get that

𝔼[VπVπ^]\displaystyle\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right] =1T(t=1TVπVπt)=O(max{Cπe,1}VmaxdH2Tlog(1+Td)log(HT||δ)).\displaystyle=\frac{1}{T}\left(\sum_{t=1}^{T}V^{\pi^{*}}-V^{\pi^{t}}\right)=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{\frac{dH^{2}}{T}\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

Thus, we the policy returned after TO~(max{Cπ2,1}Vmax2dH2log(HT||/δ)ϵ2)T\geq\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right) satisfies 𝔼[VπVπ^]ϵ.\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]\leq\epsilon. In these TT iterations, the total number of offline samples used is

moff=T=O~(max{Cπ2,1}Vmax2dH2log(HT||/δ)ϵ2),\displaystyle m_{\mathrm{off}}=T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

and the total number of online samples collected is

monHT=O~(max{Cπ2,1}Vmax2dH3|𝒜|log(HT||/δ)ϵ2),\displaystyle m_{\mathrm{on}}\cdot H\cdot T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\left\lvert\mathcal{A}\right\rvert\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

where the additional HH factor appears because we collect monm_{\mathrm{on}} samples for every h[H]h\in[H] in the algorithm. ∎

A.4 Bounds on transfer coefficient

Note that CπC_{\pi} takes both the distribution shift and the function class into consideration, and is smaller than the existing density ratio based concentrability coefficient [Kakade and Langford, 2002, Munos and Szepesvári, 2008, Chen and Jiang, 2019] and also existing Bellman error based concentrability coefficient Xie et al. [2021a]. We formalize this in the following lemma.

Lemma 11.

For any π\pi and offline distribution ν\nu,

Cπ\displaystyle C_{\pi} maxf,hfh𝒯fh+1dhπ2fh𝒯fh+1νh2suph,s,adhπ(s,a)νh(s,a).\displaystyle\leq\sqrt{\max_{f,h}\frac{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{d_{h}^{\pi}}}{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{\nu_{h}}}}\leq\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}.
Proof.

Using Jensen’s inequality, we get that

Cπ\displaystyle C_{\pi} maxfh=0H1fh𝒯fh+1dhπ2h=0H1fh𝒯fh+1νh2\displaystyle\leq\sqrt{\max_{f}\frac{\sum_{h=0}^{H-1}\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{d_{h}^{\pi}}}{\sum_{h=0}^{H-1}\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{\nu_{h}}}}
maxf,hfh𝒯fh+1dhπ2fh𝒯fh+1νh2\displaystyle\leq\sqrt{\max_{f,h}\frac{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{d_{h}^{\pi}}}{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{\nu_{h}}}}
suph,s,adhπ(s,a)νh(s,a)\displaystyle\leq\sqrt{\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}}
suph,s,adhπ(s,a)νh(s,a),\displaystyle\leq\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)},

where the second line follows from the Mediant inequality and the last line holds whenever suph,s,adhπ(s,a)νh(s,a)1\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}\geq 1. ∎

Next we show that in the linear Bellman complete setting, CπC_{\pi} is bounded by the relative condition number using the linear features.

Lemma 12.

Consider the linear Bellman complete setting (Definition 3) with known feature ϕ\phi. Suppose that the feature covariance matrix induced by offline distribution ν\nu: Σνh:=𝔼s,aνh[ϕ(s,a)ϕ(s,a)]\Sigma_{\nu_{h}}\vcentcolon={}\mathbb{E}_{s,a\sim\nu_{h}}[\phi^{\star}(s,a)\phi^{\star}(s,a)^{\top}] is invertible. Then for any policy π\pi, we have

Cπ\displaystyle C_{\pi} maxh𝔼s,adhπϕ(s,a)Σνh12.\displaystyle\leq\sqrt{\max_{h}\mathbb{E}_{s,a\sim d^{\pi}_{h}}\|\phi(s,a)\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}.
Proof.

Repeating the argument in Lemma 11, we have

Cπ\displaystyle C_{\pi} maxf,hfh𝒯fh+1dhπ2fh𝒯fh+1νh2\displaystyle\leq\sqrt{\max_{f,h}\frac{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{d_{h}^{\pi}}}{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{\nu_{h}}}}
maxw,hwhϕwhϕdhπ2whϕwhϕνh2\displaystyle\leq\sqrt{\max_{w,h}\frac{\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\|^{2}_{d_{h}^{\pi}}}{\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\|^{2}_{\nu_{h}}}}
maxw,h(whwh)Σνh2𝔼dhπϕΣνh12(whwh)ϕνh2\displaystyle\leq\sqrt{\max_{w,h}\frac{\|(w_{h}-w^{\prime}_{h})\|_{\Sigma_{\nu_{h}}}^{2}\mathbb{E}_{d^{\pi}_{h}}\|\phi\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}{\|(w_{h}-w^{\prime}_{h})^{\top}\phi\|^{2}_{\nu_{h}}}}
=maxh𝔼s,adhπϕ(s,a)Σνh12.\displaystyle=\sqrt{\max_{h}\mathbb{E}_{s,a\sim d^{\pi}_{h}}\|\phi(s,a)\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}.

Recall that in linear Bellman complete setting, we can write ff as wϕw^{\top}\phi, and for any ww that defines ff, there exists ww^{\prime} such that 𝒯f=wϕ\mathcal{T}f=w^{\prime\top}\phi. ∎

Now we proceed to low-rank MDPs where feature is unknown. We show that for low-rank MDPs, CπC_{\pi} is bounded by the partial feature coverage using the unknown ground truth feature.

Lemma 13.

Consider the low-rank MDP setting (Definition 4) where the transition dynamics PP is given by P(ss,a)=μ(s),ϕ(s,a)P(s^{\prime}\mid s,a)=\left\langle\mu^{\star}(s^{\prime}),\phi^{\star}(s,a)\right\rangle for some μ,ϕd\mu^{\star},\phi^{\star}\in\mathbb{R}^{d}. Suppose that the offline distribution ν=(ν0,,νH1)\nu=(\nu_{0},\dots,\nu_{H-1}) is such that maxhmaxs,aπh(a|s)νh(a|s)α\max_{h}\max_{s,a}\frac{\pi_{h}(a|s)}{\nu_{h}(a|s)}\leq\alpha for any s,as,a. Furthermore, suppose that ν\nu is induced via trajectories i.e. ν0(s)=d0(s)\nu_{0}(s)=d_{0}(s) and νh(s)=𝔼s¯,a¯νh1P(s|s¯,a¯)\nu_{h}(s)=\mathbb{E}_{\bar{s},\bar{a}\sim\nu_{h-1}}P(s|\bar{s},\bar{a}) for any h1h\geq 1, and that the feature covariance matrix Σνh1,ϕ:=𝔼s,aνh1[ϕ(s,a)ϕ(s,a)]\Sigma_{\nu_{h-1},\phi^{\star}}\vcentcolon={}\mathbb{E}_{s,a\sim\nu_{h-1}}[\phi^{\star}(s,a)\phi^{\star}(s,a)^{\top}] is invertible.101010This is for notation simplicity, and we emphasize that we do not assume eigenvalues are lower bounded. In other words, eigenvalue of this feature covariance matrix could approach to 0+0^{+}. Then for any policy π\pi, we have

Cπ\displaystyle C_{\pi} αh=1H𝔼s,adh1π[ϕ(s,a)Σνh1,ϕ1]+α.\displaystyle\leq\sqrt{\alpha}\sum_{h=1}^{H}\mathbb{E}_{s,a\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(s,a)\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\right]+\sqrt{\alpha}.
Proof.

We first upper bound the numerator separately. First note that for h=0h=0,

𝔼s,ad0π[𝒯f1(s,a)f0(s,a)]\displaystyle\mathbb{E}_{s,a\sim d_{0}^{\pi}}\left[\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right] 𝔼sd0,aπ(|s)[(𝒯f1(s,a)f0(s,a))2]\displaystyle\leq\sqrt{\mathbb{E}_{s\sim d_{0},a\sim\pi(\cdot|s)}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}
maxs,ad0π(s,a)ν0(s,a)𝔼s,aν0[(𝒯f1(s,a)f0(s,a))2]\displaystyle\leq\sqrt{\max_{s,a}\frac{d^{\pi}_{0}(s,a)}{\nu_{0}(s,a)}\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}
α𝔼s,aν0[(𝒯f1(s,a)f0(s,a))2],\displaystyle\leq\sqrt{\alpha\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}, (16)

where the last inequality follows from our assumption since maxs,ad0π(s,a)ν0(s,a)=maxs,aπ0(a|s)ν0(a|s)α\max_{s,a}\frac{d^{\pi}_{0}(s,a)}{\nu_{0}(s,a)}=\max_{s,a}\frac{\pi_{0}(a|s)}{\nu_{0}(a|s)}\leq\alpha.

Next, for any h1h\geq 1, we note that backing up one step and looking at the pair s¯,a¯\bar{s},\bar{a} that lead to the state ss, we get that

𝔼s,adhπ[𝒯fh+1(s,a)fh(s,a)]\displaystyle\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]
=𝔼s¯,a¯dh1π,sP(s¯,a¯),aπ(s)[𝒯fh+1(s,a)fh(s,a)]\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi},s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]
=𝔼s¯,a¯dh1π[(ϕ(s¯,a¯)μ(s))aπ(a|s)[𝒯fh+1(s,a)fh(s,a)]ds]\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\int\left(\phi^{\star}(\bar{s},\bar{a})^{\top}\mu^{\star}(s)\right)\sum_{a}\pi(a|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\text{d}s\right]
=𝔼s¯,a¯dh1π[ϕ(s¯,a¯)aμ(s)π(a|s)[𝒯fh+1(s,a)fh(s,a)]ds]\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\phi^{\star}(\bar{s},\bar{a})^{\top}\int\sum_{a}\mu^{\star}(s)\pi(a|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\text{d}s\right]
𝔼s¯,a¯dh1π[ϕ(s¯,a¯)Σνh1,ϕ1aμ(s)π(a|s)[𝒯fh+1(s,a)fh(s,a)]dsΣνh1,ϕ],\displaystyle\leq\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\left\|\int\sum_{a}\mu^{\star}(s)\pi(a|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\mathrm{d}s\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}}\right], (17)

where the last line follows from an application of Cauchy-Schwarz inequality. For the term inside the expectation in the right hand side above, we note that,

aμ(s)π(a|s)[𝒯fh+1(s,a)fh(s,a)]dsΣνh1,ϕ2\displaystyle\left\|\int\sum_{a}\mu^{\star}(s)\pi(a|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\mathrm{d}s\right\|^{2}_{\Sigma_{\nu_{h-1},\phi^{\star}}}
=(i)𝔼s¯,a¯νh1[(a(μ(s)ϕ(s¯,a¯))π(a|s)(𝒯fh+1(s,a)fh(s,a))ds)2]\displaystyle\overset{\left(i\right)}{=}\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1}}\left[\left(\int\sum_{a}\left(\mu^{\star}(s)^{\top}\phi^{*}(\bar{s},\bar{a})\right)\pi(a|s)\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)\mathrm{d}s\right)^{2}\right]
=𝔼s¯,a¯νh1[(𝔼sP(s¯,a¯),aπ(s)[𝒯fh+1(s,a)fh(s,a)])2]\displaystyle=\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1}}\left[\left(\operatorname{\mathbb{E}}_{s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\right)^{2}\right]
(ii)𝔼s¯,a¯νh1,sP(s¯,a¯),aπ(s)[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\overset{\left(ii\right)}{\leq{}}\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1},s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]
=(iii)𝔼sνh,aπ(s)[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\overset{\left(iii\right)}{=}\operatorname{\mathbb{E}}_{s\sim\nu_{h},a\sim\pi(s)}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]
(iv)α𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\overset{\left(iv\right)}{\leq{}}\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right] (18)

where (i)\left(i\right) follows by expanding the norm , (ii)\left(ii\right) follows an application of Jensen’s inequality, (iii)\left(iii\right) is due to our assumption that the offline dataset is generated using trajectories such that νh(s)=𝔼s¯,s¯νh1[P(ss¯,a¯)]\nu_{h}(s)=\operatorname{\mathbb{E}}_{\bar{s},\bar{s}\sim\nu_{h-1}}\left[P(s\mid\bar{s},\bar{a})\right]. Finally, (iv)\left(iv\right) follows from the definition of α\alpha. Plugging (18) in (17), we get that for h1h\geq 1,

𝔼s,adhπ[𝒯fh+1(s,a)fh(s,a)]\displaystyle\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]
𝔼s¯,a¯dh1π[ϕ(s¯,a¯)Σνh1,ϕ1α𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]]\displaystyle\leq\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\sqrt{\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}\right] (19)

We are now ready to bound the transfer coefficient. First note that using (16), for any ff,

𝔼s,ad0π[𝒯f1(s,a)f0(s,a)]h=0H1𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\frac{\mathbb{E}_{s,a\sim d_{0}^{\pi}}\left[\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}} α𝔼s,aν0[(𝒯f1(s,a)f0(s,a))2]h=0H1𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\leq\frac{\sqrt{\alpha\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}
α.\displaystyle\leq\sqrt{\alpha}.

Furthermore, for any ff, using (19), we get that

h=1H1𝔼s,adhπ[𝒯fh+1(s,a)fh(s,a)]h=0H1𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]\displaystyle\frac{\sum_{h=1}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}
h=1H1𝔼s¯,a¯dh1π[ϕ(s¯,a¯)Σνh1,ϕ1α𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]h=0H1𝔼s,aνh[(𝒯fh+1(s,a)fh(s,a))2]]\displaystyle\leq\sum_{h=1}^{H-1}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\frac{\sqrt{\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}\right]
h=1H𝔼s¯,a¯dh1π[ϕ(s¯,a¯)Σνh1,ϕ1α],\displaystyle\leq\sum_{h=1}^{H}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\sqrt{\alpha}\right],

where the last line holds for an appropriate choice of λ\lambda (e.g. λ=0\lambda=0). Combining the above two bounds in the definition of CπC_{\pi} we get that

Cπ\displaystyle C_{\pi} αh=1H𝔼s¯,a¯dh1π[ϕ(s¯,a¯)Σνh1,ϕ1]+α.\displaystyle\leq\sqrt{\alpha}\sum_{h=1}^{H}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\right]+\sqrt{\alpha}.

Note that in the above result, the transfer coefficient is upper bounded by the relative coverage under unknown feature ϕ\phi^{\star} and a term α\alpha related to the action coverage, i.e., maxhmaxs,aπh(a|s)νh(a|s)α\max_{h}\max_{s,a}\frac{\pi_{h}(a|s)}{\nu_{h}(a|s)}\leq\alpha. This matches to the coverage condition used in prior offline RL works for low-rank MDPs [Uehara and Sun, 2021].

Appendix B Auxiliary Lemmas

In this section, we provide a few results and their proofs that we used in the previous sections. We first with the following form of Freedman’s inequality that is a modification of a similar inequality in [Beygelzimer et al., 2011].

Lemma 14 (Freedman’s Inequality).

Let {X1,,XT}\left\{X_{1},\dots,X_{T}\right\} be a sequence of non-negative random variables where each xtx_{t} is sampled from some process that depends on all previous instances, i.e, xtρt=ρt(x1:t1)x_{t}\sim\rho_{t}=\rho_{t}(x_{1:t-1}). Further, suppose that |Xt|R\left\lvert X_{t}\right\rvert\leq R almost surely for all tTt\leq T. Then, for any δ>0\delta>0 and λ[0,1/2R]\lambda\in[0,1/2R], with probability at least 1δ1-\delta,

|t=1TXt𝔼[Xtρt]|\displaystyle\left\lvert\sum_{t=1}^{T}X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert λt=1T(2R|𝔼[Xtρt]|+𝔼[Xt2ρt])+log(2/δ)λ.\displaystyle\leq\lambda\sum_{t=1}^{T}\left(2R\left\lvert\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[X_{t}^{2}\mid\rho_{t}\right]\right)+\frac{\log(2/\delta)}{\lambda}.
Proof.

Define the random variable Zt=Xt𝔼[Xtρt]Z_{t}=X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]. Clearly, {Zt}t=1T\left\{Z_{t}\right\}_{t=1}^{T} is a martingale difference sequence. Furthermore, we have that for any tt, |Zt|2R\left\lvert Z_{t}\right\rvert\leq 2R and that

𝔼[Zt2ρt]\displaystyle\operatorname{\mathbb{E}}\left[Z_{t}^{2}\mid\rho_{t}\right] =𝔼[(Xt𝔼[Xtρt])2ρt]2R|𝔼[Xtρt]|+𝔼[Xt2ρt].\displaystyle=\operatorname{\mathbb{E}}\left[\left(X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right)^{2}\mid\rho_{t}\right]\leq 2R\left\lvert\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[X_{t}^{2}\mid\rho_{t}\right]. (20)

where the last inequality holds because |Xt|R\left\lvert X_{t}\right\rvert\leq R.

Using the form of Freedman’s inequality in Beygelzimer et al. [2011, Theorem 1], we get that for any λ[0,1/2R]\lambda\in[0,1/2R],

|t=1TZt|λt=1T𝔼[Zt2ρt]+log(2/δ)λ.\displaystyle\left\lvert\sum_{t=1}^{T}Z_{t}\right\rvert\leq\lambda\sum_{t=1}^{T}\operatorname{\mathbb{E}}\left[Z_{t}^{2}\mid\rho_{t}\right]+\frac{\log(2/\delta)}{\lambda}.

Plugging in the form of ZtZ_{t} and using (20), we get the desired statement. ∎

Next we give a formal proof of Lemma 3, which gives a generalization bound for least squares regression when the samples are adapted to an increasing filtration (and are not necessarily i.i.d.). The proof follows similarly to Agarwal et al. [2019, Lemma A.11].

Lemma 15 (Lemma 3 restated: Least squares generalization bound).

Let R>0R>0, δ(0,1)\delta\in(0,1), we consider a sequential function estimation setting, with an instance space 𝒳\mathcal{X} and target space 𝒴\mathcal{Y}. Let :𝒳[R,R]\mathcal{H}:\mathcal{X}\mapsto[-R,R] be a class of real valued functions. Let 𝒟={(x1,y1),,(xT,yT)}\mathcal{D}=\left\{(x_{1},y_{1}),\dots,(x_{T},y_{T})\right\} be a dataset of TT points where xtρt=ρt(x1:t1,y1:t1)x_{t}\sim\rho_{t}=\rho_{t}(x_{1:t-1},y_{1:t-1}), and yty_{t} is sampled via the conditional probability p(xt)p(\cdot\mid x_{t}):

ytp(xt):=h(xt)+εt,\displaystyle y_{t}\sim p(\cdot\mid x_{t})\vcentcolon={}h^{*}(x_{t})+\varepsilon_{t},

where the function hh^{*} satisfies approximate realizability i.e.

infh1Tt=1T𝔼xρt[(h(x)h(x))2]γ,\inf_{h\in\mathcal{H}}\frac{1}{T}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[\left(h^{*}(x)-h(x)\right)^{2}\right]\leq\gamma,

and {ϵi}i=1n\left\{\epsilon_{i}\right\}_{i=1}^{n} are independent random variables such that 𝔼[ytxt]=h(xt)\mathbb{E}[y_{t}\mid x_{t}]=h^{\ast}(x_{t}). Additionally, suppose that maxt|yt|R\max_{t}\lvert y_{t}\rvert\leq R and maxx|h(x)|R\max_{x}\left\lvert h^{*}(x)\right\rvert\leq R. Then the least square solution h^argminht=1T(h(xt)yt)2\widehat{h}\leftarrow\mathop{\mathrm{argmin}}_{h\in\mathcal{H}}\sum_{t=1}^{T}\left(h(x_{t})-y_{t}\right)^{2} satisfies with probability at least 1δ1-\delta,

t=1T𝔼xρt[(h^(x)h(x))2]\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[(\widehat{h}(x)-h^{*}(x))^{2}\right] 3γT+256R2log(2||/δ).\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).
Proof.

Consider any fixed function hh\in\mathcal{H} and define the random variable

Zth:=(h(xt)yt)2(h(xt)yt)2.\displaystyle Z_{t}^{h}\vcentcolon={}\left(h(x_{t})-y_{t}\right)^{2}-\left(h^{*}(x_{t})-y_{t}\right)^{2}.

Define the notation 𝔼[ρt]\operatorname{\mathbb{E}}\left[\cdot\mid\rho_{t}\right] to denote 𝔼xtρt[]\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\cdot\right], and note that

𝔼[Zthρt]\displaystyle\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right] =𝔼xtρt[(h(xt)h(xt))(h(xt)+h(xt)2yi)]=𝔼xtρt[(h(xt)h(xt))2],\displaystyle=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)\left(h(x_{t})+h^{*}(x_{t})-2y_{i}\right)\right]=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right], (21)

where the last line holds because 𝔼[ytxt]=h(xt)\operatorname{\mathbb{E}}\left[y_{t}\mid x_{t}\right]=h^{*}(x_{t}). Furthermore, we also have that

𝔼[(Zth)2ρt]\displaystyle\operatorname{\mathbb{E}}\left[(Z_{t}^{h})^{2}\mid\rho_{t}\right] =𝔼xtρt[(h(xt)h(xt))2(h(xt)+h(xt)2yt)2]\displaystyle=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\left(h(x_{t})+h^{*}(x_{t})-2y_{t}\right)^{2}\right]
16R2𝔼xtρt[(h(xt)h(xt))2].\displaystyle\leq 16R^{2}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]. (22)

Now we can note that the sequence of random variables {Z1h,,ZTh}\left\{Z^{h}_{1},\dots,Z^{h}_{T}\right\} satisfies the condition in Lemma 14 with|Zth|4R2\left\lvert Z_{t}^{h}\right\rvert\leq 4R^{2}. Thus we get that for any λ[0,1/8R2]\lambda\in[0,1/8R^{2}] and δ>0\delta>0, with probability at least 1δ1-\delta,

|t=1TZth𝔼[Zthρt]|\displaystyle\left\lvert\sum_{t=1}^{T}Z^{h}_{t}-\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert λt=1T(8R2|𝔼[Zthρt]|+𝔼[(Zth)2ρt])+log(2/δ)λ\displaystyle\leq\lambda\sum_{t=1}^{T}\left(8R^{2}\left\lvert\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[\left(Z^{h}_{t}\right)^{2}\mid\rho_{t}\right]\right)+\frac{\log(2/\delta)}{\lambda}
32λR2t=1T𝔼xtρt[(h(xt)h(xt))2]+log(2/δ)λ,\displaystyle\leq 32\lambda R^{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+\frac{\log(2/\delta)}{\lambda},

where the last inequality uses (21) and (22). Setting λ=1/64R2\lambda=1/64R^{2} in the above, and taking a union bound over hh, we get that for any hh\in\mathcal{H} and δ>0\delta>0, with probability at least 1δ1-\delta,

|t=1TZth𝔼[Zthρt]|12t=1T𝔼xtρt[(h(xt)h(xt))2]+64R2log(2||/δ).\displaystyle\left\lvert\sum_{t=1}^{T}Z^{h}_{t}-\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert\leq\frac{1}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

Rearranging the terms and using (21) in the above implies that,

t=1TZth32t=1T𝔼xtρt[(h(xt)h(xt))2]+64R2log(2||/δ)\displaystyle\sum_{t=1}^{T}Z_{t}^{h}\leq\frac{3}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)
and
t=1T𝔼xtρt[(h(xt)h(xt))2]2t=1TZth+128R2log(2||/δ).\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]\leq 2\sum_{t=1}^{T}Z_{t}^{h}+128R^{2}\log(2\lvert\mathcal{H}\rvert/\delta). (23)

For the rest of the proof, we condition on the event that (23) holds for all hh\in\mathcal{H}.

Define the function h~:=argminht=1T𝔼xtρt[(h(xt)h(xt))2]\widetilde{h}\vcentcolon={}\mathop{\mathrm{argmin}}_{h\in\mathcal{H}}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(h(x_{t})-h^{*}(x_{t}))^{2}\right]. Using (23), we get that

t=1TZth~\displaystyle\sum_{t=1}^{T}Z_{t}^{\widetilde{h}} 32t=1T𝔼xtρt[(h~(xt)h(xt))2]+64R2log(2||/δ)\displaystyle\leq\frac{3}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(\widetilde{h}(x_{t})-h^{*}(x_{t}))^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)
32γT+64R2log(2||/δ),\displaystyle\leq\frac{3}{2}\gamma T+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta),

where the last inequality follows from the approximate realizability assumption. Let h^\widehat{h} denote the least squares solution on dataset {(xt,yt)}tT\left\{(x_{t},y_{t})\right\}_{t\leq T}. By definition, we have that

t=1TZth^\displaystyle\sum_{t=1}^{T}Z_{t}^{\widehat{h}} =(h^(xt)yt)2(h(xt)yt)2(h~(xt)yt)2(h(xt)yt)2=t=1TZth~.\displaystyle=(\widehat{h}(x_{t})-y_{t})^{2}-(h^{*}(x_{t})-y_{t})^{2}\leq(\widetilde{h}(x_{t})-y_{t})^{2}-(h^{*}(x_{t})-y_{t})^{2}=\sum_{t=1}^{T}Z_{t}^{\widetilde{h}}.

Combining the above two relations, we get that

t=1TZth^\displaystyle\sum_{t=1}^{T}Z_{t}^{\widehat{h}} 32γT+64R2log(2||/δ).\displaystyle\leq\frac{3}{2}\gamma T+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta). (24)

Finally, using (23) for the function h^\widehat{h}, we get that

t=1T𝔼xtρt[(h^(xt)h(xt))2]\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(\widehat{h}(x_{t})-h^{*}(x_{t}))^{2}\right] 2t=1TZth^+128R2log(2||/δ)\displaystyle\leq 2\sum_{t=1}^{T}Z_{t}^{\widehat{h}}+128R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)
3γT+256R2log(2||/δ),\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta),

where the last inequality uses the relation (24). ∎

Appendix C Low Bellman Eluder Dimension problems

In this section, we consider problems with low Bellman Eluder dimensions Jin et al. [2021a]. This complexity measure is a distributional version of the Eluder dimension applied to the class of Bellman residuals w.r.t. \mathcal{F}. We show that our algorithm Hy-Q gives a similar performance guarantee for problems with small Bellman Eluder dimensions. This demonstrates that Hy-Q applies to any general model-free RL frameworks known in the RL literature so far.

We first introduce the key definitions:

Definition 6 (ε\varepsilon-independence between distributions [Jin et al., 2021a]).

Let 𝒢\mathcal{G} be a class of functions defined on a space 𝒳\mathcal{X}, and ν,μ1,,μn\nu,\mu_{1},\dots,\mu_{n} be probability measures over 𝒳\mathcal{X}. We say ν\nu is ε\varepsilon-independent of {μ1,μ2,,μn}\{\mu_{1},\mu_{2},\dots,\mu_{n}\} with respect to 𝒢\mathcal{G} if there exists g𝒢g\in\mathcal{G} such that i=1n(𝔼μi[g])2ε\sqrt{\sum_{i=1}^{n}(\mathbb{E}_{\mu_{i}}[g])^{2}}\leq\varepsilon, but |𝔼ν[g]|>ε|\mathbb{E}_{\nu}[g]|>\varepsilon.

Definition 7 (Distributional Eluder (DE) dimension).

Let 𝒢\mathcal{G} be a function class defined on 𝒳\mathcal{X} , and 𝒫\mathcal{P} be a family of probability measures over 𝒳\mathcal{X} . The distributional Eluder dimension dimDE(,𝒫,ε)\dim_{\operatorname{DE}}(\mathcal{F},\mathcal{P},\varepsilon) is the length of the longest sequence {ρ1,,ρn}𝒫\{\rho_{1},\dots,\rho_{n}\}\subset\mathcal{P} such that there exists εε\varepsilon^{\prime}\geq\varepsilon where ρi\rho_{i} is ε\varepsilon^{\prime}-independent of {ρ1,,ρi1}\{\rho_{1},\dots,\rho_{i-1}\} for all i[n]i\in[n].

Definition 8 (Bellman Eluder (BE) dimension [Jin et al., 2021a]).

Given a value function class \mathcal{F}, let 𝒢h:=(fh𝒯fh+1fh,fh+1h+1)\mathcal{G}_{h}\vcentcolon={}\left(f_{h}-\mathcal{T}f_{h+1}\mid f\in\mathcal{F}_{h},f_{h+1}\in\mathcal{F}_{h+1}\right) be the set of Bellman residuals induced by \mathcal{F} at step hh, and 𝒫={𝒫h}h=1H\mathcal{P}=\{\mathcal{P}_{h}\}_{h=1}^{H} be a collection of HH probability measure families over 𝒳×𝒜\mathcal{X}\times\mathcal{A}. The ϵ\epsilon-Bellman Eluder dimension of \mathcal{F} with respect to 𝒫\mathcal{P} is defined as

dimBE(,𝒫,ε):=maxh[H]dimDE(𝒢h,𝒫h,ϵ).\displaystyle\dim_{\operatorname{BE}}(\mathcal{F},\mathcal{P},\varepsilon):=\max_{h\in[H]}\dim_{\operatorname{DE}}(\mathcal{G}_{h},\mathcal{P}_{h},\epsilon)\,.

We also note the following lemma that controls the rate at which Bellman error accumulates.

Lemma 16 (Lemma 41, [Jin et al., 2021a]).

Given a function class 𝒢\mathcal{G} defined on a space 𝒳\mathcal{X} with supg𝒢,x𝒳|g(x)|C\sup_{g\in\mathcal{G},x\in\mathcal{X}}\left\lvert g(x)\right\rvert\leq C, and a set of probability measures 𝒫\mathcal{P} over 𝒳\mathcal{X}. Suppose that the sequence {gk}k=1K𝒢\left\{g_{k}\right\}_{k=1}^{K}\subset\mathcal{G} and {μk}k=1K𝒫\left\{\mu_{k}\right\}_{k=1}^{K}\subset\mathcal{P} satisfy that t=1k1(𝔼μt[gk])2β\sum_{t=1}^{k-1}\left(\operatorname{\mathbb{E}}_{\mu_{t}}\left[g_{k}\right]\right)^{2}\leq\beta for all k[K]k\in[K]. Then, for all k[K]k\in[K] and γ>0\gamma>0,

t=1k|𝔼μt[gt]|O(dimDE(𝒢,𝒫,γ)βk+min{k,dimDE(𝒢,𝒫,γ)C}+kγ).\displaystyle\sum_{t=1}^{k}\left\lvert\operatorname{\mathbb{E}}_{\mu_{t}}\left[g_{t}\right]\right\rvert\leq O\left(\sqrt{\mathrm{dim}_{\mathrm{DE}}(\mathcal{G},\mathcal{P},\gamma)\beta k}+\min\left\{k,\mathrm{dim}_{\mathrm{DE}}(\mathcal{G},\mathcal{P},\gamma)C\right\}+k\gamma\right).

We next state our main theorem whose proof is similar to that of Theorem 1.

Theorem 3 (Cumulative suboptimality).

Fix δ(0,1)\delta\in(0,1), moff=HT/dm_{\mathrm{off}}=HT/d and mon=H2m_{\mathrm{on}}=H^{2}, and suppose that the underlying MDP admits Bellman eluder dimention dd, and the function class \mathcal{F} satisfies Assumption 1. Then with probability at least 1δ1-\delta, Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy πe\pi^{e},

t=1TVπeVπt=O~(Vmaxmax{Cπe,1}dTlog(H||/δ)),\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}=\widetilde{O}\left(V_{\mathrm{max}}\max\left\{C_{\pi^{e}},1\right\}\sqrt{dT\cdot\log\left(H\lvert\mathcal{F}\rvert/\delta\right)}\right),

where πt=πft\pi^{t}=\pi^{f^{t}} is the greedy policy w.r.t. ftf^{t} at round tt and d=dimBE(,𝒫,1/T)d=\mathrm{dim}_{\mathrm{BE}}(\mathcal{F},\mathcal{P}_{\mathcal{F}},1/\sqrt{T}). Here 𝒫\mathcal{P}_{\mathcal{F}} is the class of occupancy measures that can be be induced by greedy policies w.r.t. value functions in \mathcal{F}.

Proof.

Repeating the analysis till (10) in the proof of Theorem 1, we get that

t=1TVπeVπt\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}} TCπeHΔoff+t=1Th=0H1|𝔼s,adhπft[fht(s,a)𝒯hfh+1t(s,a)]|\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}_{h}f^{t}_{h+1}(s,a)\right]\right\rvert

Using the bound in Lemma 7 and Lemma 16 in the above, we get that

t=1TVπeVπt\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}} TCπeHΔoff+h=0H1dimDE(𝒢h,𝒫;h,γ)ΔonT\displaystyle\lesssim TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sum_{h=0}^{H-1}\sqrt{\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma)\Delta_{\mathrm{on}}T}
+min{T,dimDE(𝒢h,𝒫;h,γ)C}+Tγ.\displaystyle\qquad\qquad\qquad\qquad+\min\left\{T,\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma)C\right\}+T\gamma.

where 𝒢h:=(fh𝒯fh+1fh,fh+1h+1)\mathcal{G}_{h}\vcentcolon={}\left(f_{h}-\mathcal{T}f_{h+1}\mid f\in\mathcal{F}_{h},f_{h+1}\in\mathcal{F}_{h+1}\right) denotes the set of Bellman residuals induced by \mathcal{F} at step hh, and 𝒫={𝒫;h}h=1H\mathcal{P}=\{\mathcal{P}_{\mathcal{F};h}\}_{h=1}^{H} is the collection of occupancy measures at step hh induced by greedy policies w.r.t. value functions in \mathcal{F}. We set γ=1/T\gamma=1/\sqrt{T} and define d=dimBE(,𝒫,γ)=maxhdimDE(𝒢h,𝒫;h,γ)d=\mathrm{dim}_{\mathrm{BE}}(\mathcal{F},\mathcal{P},\gamma)=\max_{h}\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma). Ignoring the lower order terms, we get that

t=1TVπeVπt\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}} TCπeHΔoff+HdΔonT\displaystyle\lesssim TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+H\sqrt{d\Delta_{\mathrm{on}}T}
TCπeVmaxHlog(HT||/δ)moff+HVmaxdTlog(HT||/δ)mon,\displaystyle\lesssim TC_{\pi^{e}}V_{\mathrm{max}}\cdot\sqrt{H\cdot\frac{\log(HT\lvert\mathcal{F}\rvert/\delta)}{m_{\mathrm{off}}}}+HV_{\mathrm{max}}\sqrt{dT\cdot\frac{\log(HT\lvert\mathcal{F}\rvert/\delta)}{m_{\mathrm{on}}}},

where \lesssim hides lower order terms, multiplying constants and log factors. Setting moff=HT/dm_{\mathrm{off}}=HT/d and mon=H2m_{\mathrm{on}}=H^{2}, we get that

t=1TVπeVπt\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}} =O~(CπeVmaxdTlog(HT||/δ)).\displaystyle=\widetilde{O}\left(C_{\pi^{e}}V_{\mathrm{max}}\sqrt{dT\log(HT\lvert\mathcal{F}\rvert/\delta)}\right).

Appendix D Comparison with previous works

As mentioned in the main text, many previous empirical works consider combining offline expert demonstrations with online interaction [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017, Lee et al., 2022, Jia et al., 2022, Niu et al., 2022]. Thus the idea of performing RL algorithm on both offline data (expert demonstrations) and online data is also explored in some of the previous works, for example, Vecerik et al. [2017] runs DDPG on both the online and expert data, and Hester et al. [2018] uses DQN on both data but with an additional supervised loss. Since we already compared with Hester et al. [2018] in the experiment, here we focus on our discussion with Vecerik et al. [2017].

We first emphasize that Vecerik et al. [2017] only focuses on expert demonstrations and their experiments entirely rely on using expert demonstrations, while we focus on more general offline dataset that is not necessarily coming from experts. Said though, the DDPG-based algorithm from Vecerik et al. [2017] potentially can be used when offline data is not from experts. Although the algorithm from Vecerik et al. [2017] and Hy-Q share the same high-level intuition that one should perform RL on both the datasets, there are still a few differences : (1) Hy-Q uses Q-learning instead of deterministic policy gradients; note that deterministic policy gradient methods cannot be directly applied to discrete action setting; (2) Hy-Q does not require n-step TD style update, since in off-policy case, without proper importance weighting, n-step TD could incur strong bias. While proper tuning on n could balance bias and variance, one does not need to tune such n-step at all in Hy-Q; (3) The idea of keeping a non-zero ratio to sample offline dataset is also proposed in Vecerik et al. [2017]. Our buffer ratio is derived from our theory analysis but meanwhile proves the advantage of the similar heuristic applied in Vecerik et al. [2017]. (4) In their experiment, Vecerik et al. [2017] only considers expert demonstrations. In our experiment, we considered offline datasets with different amounts of transitions from very low-quality policies and showed Hy-Q is robust to low-quality transitions in offline data. Note that some of the differences may seem minor on the implementation level, but they may be important to the theory.

Regarding the experiments, our experimental evaluation adds the following insights over those in Vecerik et al. [2017]: (i) hybrid methods can succeed without expert data, (ii) hybrid methods can succeed in hard exploration discrete-action tasks, (iii) the core algorithm (QQ-learning vs DDPG) is not essential although some details may matter. Due to the similarity between the two methods, we believe some of these insights may also translate to Vecerik et al. [2017] and we expect that the choice between Hy-Q and Hy-DDPG will be environment specific, as it is with the purely online versions of these methods. In some situations, QQ-learning works does not immediately imply Deterministic policy gradient methods work, nor vice versa. Nevertheless, it is beyond the scope of this paper to rigorously verify this claim and we deem the study of Actor-critic algorithms in Hybrid RL setting an interesting future direction.

Appendix E Experiment Details

E.1 Combination Lock

In this section we provide a detailed description of combination lock experiment. The combination lock environment has a horizon HH and 10 actions at each state. There are three latent states zi,h,i{0,1,2}z_{i,h},i\in\{0,1,2\} for each timestep hh, where zi,h,i{0,1}z_{i,h},i\in\{0,1\} are good states and z2,hz_{2,h} is the bad state. For each good state, we randomly pick a good action ai,ha_{i,h}, such that in latent state zi,h,i{0,1}z_{i,h},i\in\{0,1\}, taking the good action ai,ha_{i,h} will result in 0.5 probability of transiting to z0,h+1z_{0,h+1} and 0.5 probability of transiting to z1,h+1z_{1,h+1} while taking all other actions will result in a 1 probability of transiting to z2,h+1z_{2,h+1}. At z2,hz_{2,h}, all actions will result in a deterministic transition to z2,h+1z_{2,h+1}. For the reward, we give an optimal reward of 1 for landing zi,H,i{0,1}z_{i,H},i\in\{0,1\}. We also give an anti-shaped reward of 0.1 for all transitions from a good state to a bad state. All other transitions have a reward of 0. The initial distribution is a uniform distribution over z0,0z_{0,0} and z1,0z_{1,0}. The observation space has dimension 2log(H+1)2^{\left\lceil{\log(H+1)}\right\rceil}, created by concatenating a one-hot representation of the latent state and a one-hot representation of the horizon (appending 0 if necessary). Random noise from 𝒩(0,0.1)\mathcal{N}(0,0.1) is added to each dimension, and finally the observation is multiplied by a Hadamard matrix. Note that in this environment, the agent needs to perform optimally for all HH timesteps to hit the final good state for an optimal reward of 1. Once the agent chooses a bad action, it will stay in the bad state until the end with at most 0.1 possible reward for the trajectory received while transitting from a good state to a bad state.

E.2 Implementation Details of Combination Lock experiment

We train HH separate Q-functions for all HH timesteps. Our function class consists of an encoder and a decoder. For the encoder, we feed the observation into one linear layer with 3 outputs, followed by a softmax layer to get a state-representation. This design of encoder is intended to learn a one-hot representation of the latent state. We take a Kronecker Product of the state-representation and the action, and feed the result to a linear layer with only one output, which will be our Q value. In order to stabilize the training, we warm-start the Q-function of timestep h1h-1 with the encoder from the Q-function at timestep hh of the current iteration and the decoder from the Q-function at time step h1h-1 of the previous iteration, for each iteration of training.

One remark is that since combination lock belongs to Block MDPs, we require a V-type algorithm instead of the Q-type algorithm as shown in the main text. The only difference lies in the online sampling process: instead of sampling from dhπtd^{\pi^{t}}_{h}, for each hh, we sample from dhπtUniform(𝒜)d_{h}^{\pi^{t}}\circ\textrm{Uniform}(\mathcal{A}), i.e., we first rollin with respect to πt\pi^{t} to timestep h1h-1, then take a random action, observe the transition and collect that tuple. See Algorithm 2.

For CQL, we implemented the variant of CQL-DQN and picked the peak in the learning curve to report in the main paper (so in other words, we pick the best performance of CQL using online samples). While CQL is supposed to be a pure offline RL baseline, we simply tune the hyperparameters of CQL using online samples. Thus our reported results can be regarded as an upper bound on the performance that CQL would get if it were trained in the pure offline setting.

E.3 Implementation Details of Montezuma’s Revenge experiment

For Montezuma’s Revenge, we follow the Atari game convention and use a discounted setting with discount factor being 0.99. In this section we provide the detailed algorithm for the discounted setting. The overall algorithm is described in Algorithm 3. For the function approximation, we use a class of convolutional neural networks (parameterized by class Θ\Theta) as promoted by the original Dqn paper [Mnih et al., 2015]. We include several standard empirical design choices that have been commonly used to stabilize the training: we use Prioritize Experience Replay [Schaul et al., 2015] for our buffer. We also add Double Dqn [Van Hasselt et al., 2016] and Dueling Dqn [Wang et al., 2016] during our Q-update. These are the standard heuristics that have been used in Q learning implementation commonly. We also observe that a decaying schedule on the offline sample ratio β\beta and the exploration rate ϵ\epsilon also helps provide better performance. Note that an annealing β\beta does not contradict to our comment in Section 4 on catastrophic forgetting because β\beta always has a lower bound and indeed never decreases to zero when the online learning process proceeds. In addition, we also perform value function update very nvaluen_{\textrm{value}} inside each episode, instead of per episode update since this has been the popular design choice and leads to better efficiency in practice.

Algorithm 3 Discounted Hy-Q
0:  Value function class: \mathcal{F} (induced by Θ\Theta), #iterations: TT, Offline dataset 𝒟ν\mathcal{D}^{\nu} of size moffm_{\mathrm{off}}, discounted factor γ\gamma, value function update frequency nvaluen_{\textrm{value}}, target update frequency ntargetn_{\mathrm{target}}, learning rate α\alpha, offline sample ratio β\beta, exploration rate ϵ\epsilon, action space 𝒜\mathcal{A}.
1:  Randomly initialize value function fθf^{\theta}.
2:  Initialize target value function f~=fθ\tilde{f}=f^{\theta}.
3:  Initialize online buffer 𝒟=\mathcal{D}=\emptyset.
4:  Sample initial state sd0s\sim d_{0}
5:  for t=1,,Tt=1,\dots,T do
6:     Let π\pi be the ϵ\epsilon-greedy policy w.r.t. fθf^{\theta} i.e., π(s)=argmaxafθ(s,a)\pi(s)=\mathop{\mathrm{argmax}}_{a}f^{\theta}(s,a) with probability 1ϵ1-\epsilon and π(s)=𝒰(𝒜)\pi(s)=\mathcal{U}(\mathcal{A}) with probability ϵ\epsilon. // Online collection
7:     Interact with the environment for one step:
a=π(s),sP(s,a),rR(s,a).a=\pi(s),s^{\prime}\sim P(s,a),r\sim R(s,a).
8:     Update online buffer: 𝒟=𝒟{s,a,r,s}\mathcal{D}=\mathcal{D}\cup\{s,a,r,s^{\prime}\}. // Discounted minibatch FQI using both online and offline data
9:     if tmodnvalue=0t\mod n_{\textrm{value}}=0 then
10:        With probability 1β1-\beta: Sample a minibatch DD with size nminibatchn_{\textrm{minibatch}} from online buffer 𝒟\mathcal{D}.Otherwise: Sample a minibatch DD with size nminibatchn_{\textrm{minibatch}} from offline buffer 𝒟ν\mathcal{D}^{\nu}.
11:        Perform one-step gradient descent on DD:
θ=θαθ𝔼^D(fθ(s,a)riγmaxaf~(s,a))2.\displaystyle\theta=\theta-\alpha\nabla_{\theta}\hat{\mathbb{E}}_{D}\left(f^{\theta}(s,a)-r_{i}-\gamma\max_{a^{\prime}}\tilde{f}(s^{\prime},a^{\prime})\right)^{2}.
12:     end if // Delayed update of target function every ntargetn_{\textrm{target}} updates
13:     if tmodntarget=0t\mod n_{\textrm{target}}=0 then
14:        Set target function to the current value function: f~=fθ\tilde{f}=f^{\theta}.
15:     end if
16:     Update sss\leftarrow s^{\prime}.
17:  end for

E.4 Baseline implementation

E.4.1 Combination Lock

We use the open-sourced implementation https://github.com/BY571/CQL/tree/main/CQL-DQN for Cql. For Briee, we use the official code released by the authors: https://github.com/yudasong/briee, where we rely on the code there for the combination lock environment.

E.4.2 Montezuma’s Revenge

We use the open-sourced implementation
https://github.com/jcwleo/random-network-distillation-pytorch for Rnd. For Cql, we use https://github.com/takuseno/d3rlpy for their implementation of Cql for atari. We use https://github.com/felix-kerkhoff/DQfD for Dqfd. For all baselines, we keep the hyperparameters used in these public repositories. For Cql and Dqfd, we provide the offline datasets as described in the main text instead of using the offline dataset provided in the public repositories.111111We note that Cql also fails completely with the original offline dataset (with 1 million samples) provided in the public repository. All baselines are tested in the same stochastic environment setup as in Burda et al. [2018].

E.5 Hardware Infrastructure

We run our experiments on a cluster of computes with Nvidia RTX 3090 GPUs and various CPUs which do not incur any randomness to the results.

E.6 Hyperparameters

E.6.1 Combination Lock

We provide the hyperparameters of Hy-Q in Table. 1. In addition, we provide the hyperparameters we tried for Cql baseline in Table. 2.

Table 1: Hyperparameters for Hy-Q in combination lock
  Value Considered   Final Value
Learning rate   {1e-2, 2e-2, 1e-3}   2e-2
Buffer size   {1e8}   1e8
Optimizer   {Adam, SGD}   Adam
Number of updates per iteration   {30, 300, 500}   500
Batch size   {512}   512
Table 2: Hyperparameters for CQL(DQN) in combination lock
  Value Considered   Final Value
Learning rate   {1e-3}   1e-3
Optimizer   {Adam}   Adam
Buffer size   {1e8}   1e8
Batch size   {512}   512
Discount Factor   {0.99}   0.99
Moving Average Factor τ\tau   {0.01, 0.1, 1}   0.01
Weight on CQL loss α\alpha   {0, 0.1, 0.01}   0.1

E.6.2 Montezuma’s Revenge

We provide the hyperparameter of Hy-Q in Table. 3. We reuse many hyperparameter choices from Dqfd. Note that [a,b][a,b] denotes a decreasing/increasing schedule from aa to bb.

Table 3: Hyperparameter of Discounted Hy-Q in Montezuma’s Revenge.
  Value Considered   Final Value
Learning rate   {6.25e-5, [1e-4,1e-5]}   [1e-4,1e-5]
Offline Schedule β\beta   {0.5,0.2,[0.2,0.01]}   [0.2,0.01]
Exploration ϵ\epsilon rate   {[0.25,0.001]}   [0.25,0.001]
Minibatch size nminibatchn_{\textrm{minibatch}}   {32}   32
Weight decay (regularization) coefficient   {1e-5}   1e-5
Gradient Clipping   {10,20}   10
Discount factor γ\gamma   {0.99}   0.99
Value function update frequency nvaluen_{\textrm{value}}   {4}   4
Target function update frequency ntargetn_{\textrm{target}}   {1000,2000,5000,10000}   10000
Buffer size   {2202^{20}} 2202^{20}
PER Importance Sampling ratio   {[0.6,1]}  [0.6,1]
Online PER ϵ\epsilon   {0.001}  0.001
Offline PER ϵ\epsilon   {0.0001}  0.0001
Online PER Priority Coefficient   {0.4}  0.4
Offline PER Priority Coefficient   {1}  1