Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

Yuda Song Yifei Zhou¹¹1Cornell University. Email: yz639@cornell.edu. The first two authors have equal contributions to the paper. Ayush Sekhari²²2MIT. Email: sekhari@mit.edu J. Andrew Bagnell³³3Carnegie Mellon University. Email: dbagnell@aurora.tech Akshay Krishnamurthy⁴⁴4Microsoft Research. Email: akshaykr@microsoft.com Wen Sun⁵⁵5Cornell University. Email: ws455@cornell.edu Carnegie Mellon University. Email: yudas@cs.cmu.edu

Abstract

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s Revenge.

1 Introduction

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve suitable performance. This issue is particularly salient when the environment requires sophisticated exploration and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence, the practical success of online RL and related policy gradient/improvement methods has been largely restricted to settings where a high quality simulator is available.

To overcome the issue of sample inefficiency, attention has turned to the offline RL setting [Levine et al., 2020], where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in some other manner (e.g., by a system running in production or an expert). While these methods still require a large dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without compromising system performance. However, offline RL methods can suffer from distribution shift, where the state distribution induced by the learned policy differs significantly from the offline distribution [Wang et al., 2021]. Existing provable approaches for addressing distribution shift are computationally intractable, while empirical approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see).

Refer to caption — Figure 1: Performance of our approach Hy-Q on Montezuma’s Revenge. We consider three types of offline datasets: easy, medium, and hard. The easy one contains offline data from a high quality policy (denoted by *Expert* in the figure), the medium one consists of 20% data from a random policy and 80% from the expert, and the hard one consists of half random data and half expert data. All offline datasets have 100k tuples of state, action, reward, and next state (no trajectory-wise information is available to the learners). With only 0.1m offline samples, our approach Hy-Q learns nearly 10x faster than *Random Network Distillation (RND)* – an online deep RL baseline designed for Montezuma’s revenge (see Figure 6 for a visual comparison using screenshots of the game). We also note that *Conservative Q-Learning (CQL)*—an offline RL baseline, completely fails in all three offline datasets, indicating the hardness of offline policy learning from such small size offline datasets. Imitation Learning baseline *Behavior Cloning (BC)* achieves reasonable performance in easy and medium datasets, but fails completely on the hard dataset. See Section 6.2 for more detailed comparison to CQL, BC, and other baselines.

In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the favorable properties of both offline and online settings. In Hybrid RL, the agent has both an offline dataset and the ability to interact with the environment, as in the traditional online RL setting. The offline dataset helps address the exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a number of empirical works [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017] which mainly focus on using expert demonstrations as offline data.

Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from a “nice” distribution. A number of simple and effective algorithms, including CPI [Kakade and Langford, 2002], PSDP [Bagnell et al., 2003], and policy gradient methods [Kakade, 2001, Agarwal et al., 2020b] – methods which have inspired recent powerful heuristic RL methods such as TRPO [Schulman et al., 2015] and PPO [Schulman et al., 2017]— are provably efficient in the reset setting. Yet, leveraging a reset distribution is a strong requirement (often tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid RL differs from the reset setting in that (a) we have an offline dataset, but (b) our online interactions must come from traces that start at the initial state distribution of the environment, and the initial state distribution is not assumed to have any nice properties. Both features (offline data and a nice reset distribution) facilitate algorithm design by de-emphasizing the exploration challenge. However, Hybrid RL is more practical since an offline dataset is much easier to access, while a nice reset distribution or even generative model is generally not guaranteed in practice.

We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The algorithm is a simple adaptation of the classical fitted Q-iteration algorithm (FQI) and it accommodates value-based function approximation in a relatively general setup. ⁶⁶6We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results analyze Q-iteration, but we use an algorithm with an online/mini-batch flavor that is closer to Q-learning for our experiments. For our theoretical results, we prove that Hy-Q is both statistically and computationally efficient assuming that: (1) the offline distribution covers some high quality policy, (2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least squares regression oracle. The first three assumptions are standard statistical assumptions in the RL literature while the fourth is a widely used computational abstraction for supervised learning. No computationally efficient algorithms are known under these assumptions in pure offline or pure online settings, which highlights the advantages of the hybrid setting.

We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination lock [Misra et al., 2020] and Montezuma’s Revenge from the Arcade Learning Environment [Bellemare et al., 2013]. Starting with an offline dataset that contains some transitions from a high quality policy, our approach outperforms: an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma’s Revenge, a pure offline RL baseline, an imitation learning baseline, and an existing hybrid method. Compared to the online methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efficiency (e.g., Figure 1). Compared to the offline and hybrid methods, Hy-Q performs most favorably when the offline dataset also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the significant benefits that can be realized by combining offline and online data.

2 Related Works

We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, offline RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the system to initial state distribution of the environment, which is not assumed to provide any form of coverage.

Pure online RL

Beyond tabular settings, many existing statistically efficient RL algorithms are not computationally tractable, due to the difficulty of implementing optimism. This is true in the linear MDP [Jin et al., 2020] with large action spaces, the linear Bellman complete model [Zanette et al., 2020, Agarwal et al., 2019], and in the general function approximation setting [Jiang et al., 2017, Sun et al., 2019, Du et al., 2021, Jin et al., 2021a]. These computational challenges have inspired results on intractability of aspects of online RL [Dann et al., 2018, Kane et al., 2022]. On the other hand, many simple exploration based algorithms like $\epsilon$ -greedy are computationally efficient, but they may not always work well in practice. Recent theoretical works [Dann et al., 2022, Liu and Brunskill, 2018] have explored the additional structural assumptions on the underlying dynamics and value function class under which $\epsilon$ -greedy succeeds, but they still do not capture all the relevant practical problems.

There are several online RL algorithms that aim to tackle the computational issue via stronger structural assumptions and supervised learning-style computational oracles [Misra et al., 2020, Sekhari et al., 2021, Zhang et al., 2022c, Agarwal et al., 2020a, Uehara et al., 2021, Modi et al., 2021, Zhang et al., 2022a, Qiu et al., 2022]. Compared to these oracle-based methods, our approach operates in the more general “bilinear rank” setting and relies on a standard supervised learning primitive: least squares regression. Notably, our oracle admits efficient implementation with linear function approximation, so we obtain an end-to-end computational guarantee; this is not true for prior oracle-based methods.

There are many deep RL methods for the online setting (e.g., Schulman et al. [2015, 2017], Lillicrap et al. [2016], Haarnoja et al. [2018], Schrittwieser et al. [2020]). Apart from a few exceptions (e.g., Burda et al. [2018], Badia et al. [2020], Guo et al. [2022]), most rely on random exploration (e.g. $\epsilon$ -greedy) and are not capable of strategic exploration. In fact, guarantees for $\epsilon$ -greedy like algorithms only exist under additional structural assumptions on the underlying problem.

In our experiments, we test our approach on Montezuma’s Revenge, and we pick Rnd [Burda et al., 2018] as a deep RL exploration baseline due to its simplicity and effectiveness on the game of Montezuma’s Revenge.

Online RL with reset distributions

When an exploratory reset distribution is available, a number of statistically and computationally efficient algorithms are known. The classic algorithms are Cpi [Kakade and Langford, 2002], Psdp [Bagnell et al., 2003], Natural Policy Gradient [Kakade, 2001, Agarwal et al., 2020b], and PolyTex [Abbasi-Yadkori et al., 2019]. Uchendu et al. [2022] recently demonstrated that algorithms like Psdp work well when equipped with modern neural network function approximators. However, these algorithms (and their analyses) heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically unavailable in practice, unless one also has a simulator and access to its internal states. In contrast, we assume the offline data covers some high quality policy (no need to be globally exploratory), which helps with exploration, but we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing.

Offline RL

Offline RL methods learn policies solely from a given offline dataset, with no interaction whatsoever. When the dataset has global coverage, algorithms such as FQI [Munos and Szepesvári, 2008, Chen and Jiang, 2019] or certainty-equivalence model learning [Ross and Bagnell, 2012], can find near-optimal policies in an oracle-efficient manner, via least squares or model-fitting oracles. However, with only partial coverage, existing methods either (a) are not computationally efficient due to the difficulty of implementing pessimism both in linear settings with large action spaces [Jin et al., 2021b, Zhang et al., 2022b, Chang et al., 2021] and general function approximation settings [Uehara and Sun, 2021, Xie et al., 2021a, Jiang and Huang, 2020, Chen and Jiang, 2022, Zhan et al., 2022], or (b) require strong representation conditions such as policy-based Bellman completeness [Xie et al., 2021a, Zanette et al., 2021]. In contrast, in the hybrid setting, we obtain an efficient algorithm under the more natural condition of completeness w.r.t., the Bellman optimality operator only.

Among the many empirical offline RL methods (e.g., Kumar et al. [2020], Yu et al. [2021], Kostrikov et al. [2021], Fujimoto and Gu [2021]), we use Cql [Kumar et al., 2020] as a baseline in our experiments, since it has been shown to work in image-based settings such as Atari games.

Online RL with offline datasets

Ross and Bagnell [2012] developed a model-based algorithm for a similar hybrid setting. In comparison, our approach is model-free and consequently may be more suitable for high-dimensional state spaces (e.g., raw-pixel images). Xie et al. [2021b] studied hybrid RL and show that offline data does not yield statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and demonstrates computational benefits of hybrid RL.

On the empirical side, several works consider combining offline expert demonstrations with online interaction [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017]. A common challenge in offline RL is the robustness against low-quality offline dataset. Previous works mostly focus on expert demonstrations and have no rigorous guarantees for such robustness. In fact, Nair et al. [2020] showed that such degradation in performance indeed happens in practice with low-quality offline data. In our experiments, we observe that DQfD [Hester et al., 2018] also has a similar degradation. On the other hand, our algorithm is robust to the quality of the offline data. Note that the core idea of our algorithm is similar to that of Vecerik et al. [2017], who adapt DDPG to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. [2017] does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. [2017] for the hybrid RL setting. We also include a detailed comparison with previous empirical work in Appendix D.

3 Preliminaries

We consider finite horizon Markov Decision Process $M({\mathcal{S}},\mathcal{A},H,R,P,d_{0})$ , where ${\mathcal{S}}$ is the state space, $\mathcal{A}$ is the action space, $H$ denotes the horizon, stochastic rewards $R(s,a)\in\Delta([0,1])$ and $P(s,a)\in\Delta({\mathcal{S}})$ are the reward and transition distributions at $(s,a)$ , and $d_{0}\in\Delta({\mathcal{S}})$ is the initial distribution. We assume the agent can only reset from $d_{0}$ (at the beginning of each episode). Since the optimal policy is non-stationary in this setting, we define a policy $\pi:=\{\pi_{0},\dots,\pi_{H-1}\}$ where $\pi_{h}:{\mathcal{S}}\mapsto\Delta(\mathcal{A})$ . Given $\pi$ , $d^{\pi}_{h}\in\Delta({\mathcal{S}}\times\mathcal{A})$ denotes the state-action occupancy induced by $\pi$ at step $h$ . Given $\pi$ , we define the state and state-action value functions in the usual manner: $V^{\pi}_{h}(s)=\mathbb{E}[\sum_{\tau=h}^{H-1}r_{\tau}|\pi,s_{h}=s]$ and $Q^{\pi}_{h}(s,a)=\mathbb{E}[\sum_{\tau=h}^{H-1}r_{\tau}|\pi,s_{h}=s,a_{h}=a]$ . $Q^{\star}$ and $V^{\star}$ denote the optimal value functions. We denote $V^{\pi}=\mathbb{E}_{s_{0}\sim d_{0}}V^{\pi}_{0}(s_{0})$ as the expected total reward of $\pi$ . We define the Bellman operator $\mathcal{T}$ such that for any $f:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R}$ ,

\mathcal{T}f(s,a)=\mathbb{E}[R(s,a)]+\mathbb{E}_{s^{\prime}\sim P(s,a)}\max_{a^{\prime}}f(s^{\prime},a^{\prime})\qquad\forall s,a,

We assume that for each $h$ we have an offline dataset of $m_{\mathrm{off}}$ samples $(s,a,r,s^{\prime})$ drawn iid via $(s,a)\sim\nu_{h},r\in R(s,a),s^{\prime}\sim P(s,a)$ . Here $\nu=\{\nu_{0},\dots,\nu_{H-1}\}$ denote the corresponding offline data distributions. For a dataset $\mathcal{D}$ , we use $\hat{\mathbb{E}}_{\mathcal{D}}[\cdot]$ to denote a sample average over this dataset. For our theoretical results, we will assume that $\nu$ covers some high-quality policy. Note that covering a high quality policy does not mean that $\nu$ itself is a distribution of some high quality policy. For instance, $\nu$ could be a mixture distribution of some high quality policy and a few low-quality policies, in which case, treating $\nu$ as expert demonstration will fail completely (as we will show in our experiments). We consider the value-based function approximation setting, where we are given a function class $\mathcal{F}=\mathcal{F}_{0}\times\dots\times\mathcal{F}_{H-1}$ with $\mathcal{F}_{h}\subset{\mathcal{S}}\times\mathcal{A}\mapsto[0,V_{\mathrm{max}}]$ that we use to approximate the value functions for the underlying MDP. Here $V_{\mathrm{max}}$ denotes the maximum total reward of a trajectory. For ease of notation, we define $f=\{f_{0},\ldots,f_{H-1}\}$ and define $\pi^{f}$ to be the greedy policy w.r.t., $f$ , which chooses actions as $\pi_{h}^{f}(s)=\mathop{\mathrm{argmax}}_{a}f_{h}(s,a)$ .

4 Hybrid Q-Learning

Algorithm 1 Hybrid Q-learning using both offline and online data (Hy-Q)

0: Value class:

\mathcal{F}

, #iterations:

T

, offline dataset

\mathcal{D}^{\nu}_{h}

of size

m_{\mathrm{off}}

=T

for

h\in[H-1]

1: Initialize

f_{h}^{1}(s,a)=0

2: for

t=1,\dots,T

3: Let

\pi^{t}

be the greedy policy w.r.t.

f^{t}

i.e.,

\pi_{h}^{t}(s)=\mathop{\mathrm{argmax}}_{a}f^{t}_{h}(s,a).

4: For each

h

, collect

m_{\mathrm{on}}

=1

online tuples

\mathcal{D}^{t}_{h}\sim d_{h}^{\pi^{t}}

. // Online collection // FQI using both online and offline data

5: Set

f_{H}^{t+1}(s,a)=0

6: for

h=H-1,\dots,0

7: Estimate

f_{h}^{t+1}

using least squares regression on the aggregated data

\mathcal{D}^{t}_{h}=\mathcal{D}^{\nu}_{h}+\sum_{\tau=1}^{t}\mathcal{D}^{\tau}_{h}

\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}^{t}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\}

(1)

8: end for

9: end for

In this section, we present our algorithm Hybrid Q Learning – Hy-Q in Algorithm 1. Hy-Q takes an offline dataset $\{\mathcal{D}^{\nu}\}_{h=1}^{H}$ that contains $(s,a,r,s^{\prime})$ tuples and a Q function class $\mathcal{F}_{h}\subset{\mathcal{S}}\times\mathcal{A}\mapsto[0,H],h=1,\dots,H,$ as inputs, and outputs a policy that optimizes the given reward function. The algorithm is conceptually simple: it iteratively executes the FQI procedure (line 6) using the offline dataset and on-policy samples generated by the learned policies.

Specifically, at iteration $t$ and timestep $h$ , we have an estimate $f_{h}^{t}$ of the $Q_{h}^{\star}$ function and we set $\pi_{h}^{t}$ to be the greedy policy for $f_{h}^{t}$ . We execute $\pi_{h}^{t}$ to collect a dataset $\mathcal{D}_{h}^{t}$ of online samples in line 4. More formally, we sample $s_{h}\sim d^{\pi^{t}}_{h},a_{h}\sim\pi_{h}^{t}(\cdot|s_{h}),s_{h+1}\sim P(\cdot|s_{h},a_{h})$ and add the tuple $(s_{h},a_{h},r_{h},s_{h+1})$ to $\mathcal{D}_{h}^{t}$ . Then we run FQI, the dynamic programming style algorithm on both the offline dataset $\mathcal{D}^{\nu}_{h}$ and all previously collected online samples $\{\mathcal{D}_{h}^{\tau}\}_{\tau=1}^{t}$ . The FQI update works backward from time step $H$ to $0$ and computes $f_{h}^{t+1}$ via least squares regression with input $(s,a)$ and regression target $r+\max_{a^{\prime}}f_{h+1}^{t+1}(s^{\prime},a^{\prime})$ .

Let us make several remarks, here we drop timestep $h$ for generality. Intuitively, the FQI updates in Hy-Q try to ensure that the estimate $f^{t}$ has small Bellman error under both the offline distribution $\nu$ and the online distributions $d^{\pi^{t}}$ . The standard offline version of FQI ensures the former, but this alone is insufficient when the offline dataset has poor coverage. Indeed FQI may have poor performance in such cases [see examples in Zhan et al., 2022, Chen and Jiang, 2022]. The key insight in Hy-Q is to use online interaction to ensure that we also have small Bellman error on $d^{\pi^{t}}$ . As we will see, the moment we find an $f^{t}$ that has small Bellman error on the offline distribution $\nu$ and its own greedy policy’s distribution $d^{\pi^{t}}$ , FQI guarantees that $\pi^{t}$ will be at least as good as any policy covered by $\nu$ . This observation results in an explore-or-terminate phenomenon: either $f^{t}$ has small Bellman error on its distribution and we are done, or $d^{\pi^{t}}$ must be significantly different from distributions we have seen previously and we make progress. Crucially, no explicit exploration is required for this argument, which is precisely how we avoid the computational difficulties with implementing optimism.

Another important point pertains to catastrophic forgetting. We will see that the size of the offline dataset $m_{\mathrm{off}}$ should be comparable to the total amount of online data $\{\mathcal{D}_{h}^{\tau}\}_{\tau=1}^{T}$ , so that the two terms in Eq. 1 have similar weight and we ensure low Bellman error on $\nu$ throughout the learning process. In practice, we implement this by having all model updates use a fixed (significant) number of offline samples even as we collect more online data, so that we do not “forget” the distribution $\nu$ . This is quite different from warm-starting with $\mathcal{D}^{\nu}$ and then switching to online RL, which may result in catastrophic forgetting due to a vanishing proportion of offline samples being used for model training as we collect more online samples. We note that this balancing scheme is analogous to and inspired by the one used by Ross and Bagnell [2012] in the context of model-based RL with a reset distribution. Previously, similar techniques have also been explored for various applications (for example, see Appendix F.3 of Kalashnikov et al. [2018]). As in Ross and Bagnell [2012], a key practical insight from our analysis is that the offline data should be used throughout training to avoid catastrophic forgetting.

5 Theoretical Analysis: Low Bilinear Rank Models

In this section we present the main theoretical guarantees for Hy-Q. We start by stating the main assumptions and definitions for the function approximator, the offline data distribution, and the MDP. We state the key definitions and then provide some discussion.

Assumption 1 (Realizability and Bellman completeness).

For any $h$ , we have $Q_{h}^{\star}\in\mathcal{F}_{h}$ . Additionally, for any $f_{h+1}\in\mathcal{F}_{h+1}$ , we have ${\mathcal{T}}f_{h+1}\in\mathcal{F}_{h}$ .

Definition 1 (Bellman error transfer coefficient).

For any policy $\pi$ , define the transfer coefficient as

\displaystyle C_{\pi}:=\max\left\{0,~{}\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}}}\right\}.

(2)

The transfer coefficient definition above is somewhat non-standard, but is actually weaker than related notions used in prior offline RL results. First, the average Bellman error appearing in the numerator is weaker than the squared Bellman error notion of [Xie et al., 2021a]; a simple calculation shows that $C_{\pi}^{2}$ is upper bounded by their coefficient. Second, by using Bellman errors, both of these are bounded by notions involving density ratios [Kakade and Langford, 2002, Munos and Szepesvári, 2008, Chen and Jiang, 2019]. Third, when function $f\in\mathcal{F}$ is linear in some known feature which is the case for models such as linear MDPs, the above transfer coefficient can be refined to relative condition number defined using the features. Finally, many works, particularly those that do not employ pessimism [Munos and Szepesvári, 2008, Chen and Jiang, 2019], require “all-policy” analogs, which places a much stronger requirement on the offline data distribution $\nu$ . In contrast, we will only ask that $C_{\pi}$ is small for some high-quality policy that we hope to compete with. In Appendix 12, we showcase that our transfer coefficient is weaker than any related notions used in prior works under various settings such as tabular MDPs, linear MDPs, low-rank MDPs, and MDPs with general value function approximation.

Definition 2 (Bilinear model [Du et al., 2021]).

We say that the MDP together with the function class $\mathcal{F}$ is a bilinear model of rank $d$ if for any $h\in[H-1]$ , there exist two (unknown) mappings $X_{h},W_{h}:\mathcal{F}\mapsto\mathbb{R}^{d}$ with $\max_{f}\|X_{h}(f)\|_{2}\leq B_{X}$ and $\max_{f}\|W_{h}(f)\|_{2}\leq B_{W}$ such that:

\displaystyle\forall f,g\in\mathcal{F}:\;\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f}}}\left[g_{h}(s,a)-\mathcal{T}g_{h+1}(s,a)\right]\right\rvert=\left\lvert\left\langle X_{h}(f),W_{h}(g)\right\rangle\right\rvert.

All concepts defined above are frequently used in the statistical analysis of RL methods with function approximation. Realizability is the most basic function approximation assumption, but is known to be insufficient for offline RL [Foster et al., 2021] unless other strong assumptions hold [Xie and Jiang, 2021, Zhan et al., 2022, Chen and Jiang, 2022]. Completeness is the most standard strengthening of realizability that is used routinely in both online [Jin et al., 2021a] and offline RL [Munos and Szepesvári, 2008, Chen and Jiang, 2019] and is known to hold in several settings including the linear MDP and the linear quadratic regulator. These assumptions ensure that the dynamic programming updates of FQI are stable in the presence of function approximation.

Lastly, the bilinear model was developed in a series of works [Jiang et al., 2017, Jin et al., 2021a, Du et al., 2021] on sample efficient online RL.⁷⁷7Jin et al. [2021a] consider the Bellman Eluder dimension, which is related but distinct from the Bilinear model. However, our proofs can be easily translated to this setting; see Appendix C for more details. The setting is known to capture a wide class of models including linear MDPs, linear Bellman complete models, low-rank MDPs, Linear Quadratic Regulators, reactive POMDPs, and more. As a technical note, the main paper focuses on the “Q-type” version of the bilinear model, but the algorithm and proofs easily extend to the “V-type” version. See Appendix A.3 for details.

Theorem 1 (Cumulative suboptimality).

Fix $\delta\in(0,1)$ , $m_{\mathrm{off}}=T$ and $m_{\mathrm{on}}=1$ , suppose that the function class $\mathcal{F}$ satisfies Assumption 1, and together with the underlying MDP admits Bilinear rank $d$ . Then with probability at least $1-\delta$ , Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy $\pi^{e}$ ,

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}=\widetilde{O}\left(\max\left\{C_{\pi^{e}},1\right\}V_{\max}\sqrt{dH^{2}T\cdot\log\left(\lvert\mathcal{F}\rvert/\delta\right)}\right),

where $\pi^{t}=\pi^{f^{t}}$ is the greedy policy w.r.t. $f^{t}$ at round $t$ .

The parameters setup in the above theorem indicates that ratio between the total offline samples and the total online sample is each FQI iteration is at least $1$ . This ensures that during learning, we never forget the offline distribution. A standard online-to-batch conversion [Shalev-Shwartz and Ben-David, 2014] immediately gives the following sample complexity guarantee for Algorithm 1 for finding an $\epsilon$ -suboptimal policy w.r.t. the optimal policy $\pi^{*}$ for the underlying MDP.

Corollary 1 (Sample complexity).

Under the assumptions of Theorem 1 if $C_{\pi^{*}}<\infty$ then Algorithm 1 can find an $\epsilon$ -suboptimal policy $\widehat{\pi}$ for which $V^{\pi^{*}}-V^{\widehat{\pi}}\leq\epsilon$ with total sample complexity (online + offline):

\displaystyle n=\widetilde{O}\left(C^{2}_{\pi^{*}}V^{2}_{\max}H^{3}d\log\left(\lvert\mathcal{F}\rvert/\delta\right)/\epsilon^{2}\right)

The results formalize the statistical properties of Hy-Q. In terms of sample complexity, a somewhat unique feature of the hybrid setting is that both transfer coefficient and bilinear rank parameters are relevant, whereas these (or related) parameters typically appear in isolation in offline and online RL respectively. In terms of coverage, Theorem 1 highlights an “oracle property” of Hy-Q: it competes with any policy that is sufficiently covered by the offline dataset.

We also highlight the computational efficiency of Hy-Q: it only requires solving least squares problems over the function class $\mathcal{F}$ . To our knowledge, no purely online or purely offline methods are known to be efficient in this sense, except under much stronger “uniform” coverage conditions.

5.1 Proof Sketch

We now give an overview of the proof of Theorem 1. The proof starts with a simple decomposition of the regret:

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}$	$\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-V^{\pi^{f^{t}}}_{0}(s)\right]$
		$\displaystyle=\sum_{t=1}^{T}\underbrace{\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}^{t}(s,a)\right]}_{A_{t}}+\sum_{t=1}^{T}\underbrace{\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]}_{B_{t}}.$

Then we note that, one can bound each $A_{t}$ and $B_{t}$ by the Bellman error under the comparator $\pi_{e}$ ’s visitation distribution and the learned policy’s visitation distribution. For simplicity let’s define the Bellman error of function $f$ at time $h$ as $\mathcal{E}_{h}(f)(s,a)=f_{h}(s,a)-{\mathcal{T}}f_{h+1}(s,a)$ , and we can show that

\displaystyle A_{t}\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi^{e}}}\left[-\mathcal{E}_{h}(f^{t})(s,a)\right](\text{\hyperref@@ii[lem:optimism]{Lemma~{}\ref*{lem:optimism}}})\quad\text{and}\quad B_{t}\leq\sum_{h=0}^{H-1}\left|\mathbb{E}_{s,a\sim d_{h}^{\pi^{t}}}\mathcal{E}_{h}(f^{t})(s,a)\right|(\text{\hyperref@@ii[lem:simulation]{Lemma~{}\ref*{lem:simulation}}}).

Then for $A_{t}$ , we can recall our definition of the transfer coefficient $C_{\pi}$ and this gives us

\displaystyle A_{t}\leq C_{\pi^{e}}\cdot\sqrt{\sum_{h=0}^{H-1}\underbrace{\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\mathcal{E}_{h}(f^{t})\right]^{2}}_{\mathcal{E}_{t;h}^{\textrm{off}}}}.

The terms $\mathcal{E}^{\textrm{off}}_{t;h}$ are in the order of statistical error resulting from least square regression since every iteration $t$ , FQI includes the offline data from $\nu_{h}$ in its least square regression problems. Thus $A_{t}$ is small for every $t$ given bounded $C_{\pi^{e}}$ .

To bound the online part $B_{t}$ , we utilize the structure of Bilinear models. For the analysis, we construct a covariance matrix $\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\lambda\mathbb{I}$ , where $X_{h}$ is defined as in the bilinear model construction. This is used to track the online learning progress. Thus recall the definition of bilinear model, we can bound $B_{t}$ in the following sense:

	$\displaystyle\sum_{t=1}^{T}B_{t}\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\|\mathbb{E}_{s,a\sim d_{h}^{\pi^{t}}}\mathcal{E}_{h}(f^{t})(s,a)\right\|$	$\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert$
		$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\underbrace{\sum_{\tau=1}^{t-1}\mathbb{E}_{s,a\sim d^{\tau}_{h}}[\mathcal{E}_{h}(f^{t})(s,a)]^{2}}_{\mathcal{E}_{t;h}^{\textrm{on}}}+\lambda B_{W}^{2}}.$

The first term on the right hand side of the above inequality, i.e., $\sum_{t}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}$ , can be shown to grow sublinearly $\tilde{O}(\sqrt{T})$ using the classic elliptical potential argument (Lemma 6). The term $\mathcal{E}^{\textrm{on}}_{t;h}$ can be controlled to be small as it is related to the statistical error of least square regression (since in each iteration $t$ , when we perform least square regression, we use the training data sampled from policies from $\pi^{1}$ to $\pi^{t-1}$ ). Together this ensures that $\sum_{t}B_{t}$ grows sublinearly $\tilde{O}(\sqrt{T})$ , which further implies that there exists a iteration $t^{\prime}$ , such that $B_{t^{\prime}}\leq\tilde{O}(1/\sqrt{T})$ . Together, the above arguments show that there must exist an iteration $t^{\prime}$ , such that $A_{t^{\prime}}$ and $B_{t^{\prime}}$ are small simultaneously, which concludes that $\pi^{f^{t^{\prime}}}$ can be close to $\pi^{e}$ in terms of the performance. The proof sketch highlights the key observation: as long as we have a function $f$ that has small Bellman residual under the offline distribution $\nu$ and small Bellman residual under its own greedy policy’s distribution $d^{\pi^{f}}$ , then we can show that $\pi^{f}$ must be at least as good as any policy $\pi^{e}$ that is covered by the offline distribution.

5.2 The Linear Bellman Completeness Model

We next showcase one example of low bilinear rank models: the popular linear Bellman complete model which captures both the linear MDP model [Yang and Wang, 2019, Jin et al., 2020] and the LQR model, and instantiate the sample complexity bound in Corollary 1.

Definition 3.

Given a feature function $\phi:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{B}_{d}(1)$ , an MDP with feature $\phi$ admits linear Bellman completeness if for any $w\in\mathbb{B}_{d}(B_{W})$ , there exists a $w^{\prime}\in\mathbb{B}_{d}(B_{W})$ such that

\displaystyle\forall s,a:\qquad\left\langle w^{\prime},\phi(s,a)\right\rangle=\mathbb{E}[R(s,a)]+\mathbb{E}_{s^{\prime}\sim P(s,a)}\max_{a^{\prime}}\left\langle w,\phi(s^{\prime},a^{\prime})\right\rangle.

Note that the above condition implies that $Q^{\star}_{h}(s,a)=\left\langle w_{h}^{\star},\phi(s,a)\right\rangle$ with $\|w^{\star}_{h}\|_{2}\leq B_{W}$ . Thus, we can define a function class $\mathcal{F}_{h}=\{\left\langle w_{h},\phi(s,a)\right\rangle:w_{h}\in\mathbb{R}^{d},\|w_{h}\|_{2}\leq B_{W}\}$ which by inspection satisfies Assumption 1. Additionally, this model is also known to have bilinear rank at most $d$ [Du et al., 2021]. Thus, using Corollary 1 we immediately get the following guarantee:

Lemma 1.

Let $\delta\in(0,1)$ , suppose the MDP is linear Bellman complete, $C_{\pi^{*}}<\infty$ , and consider $\mathcal{F}_{h}$ defined above. Then, with probability $1-\delta$ , Algorithm 1 finds an $\epsilon$ -suboptimal policy with total sample complexity (offline + online):

\displaystyle n=\widetilde{O}\left(\frac{B_{W}^{2}C^{2}_{\pi^{*}}H^{4}d^{2}\log\left(B_{W}/\epsilon\delta\right)}{\epsilon^{2}}\right).

Proof sketch of Lemma 1.

The proof follows by invoking the result in Corollary 1 for a discretization of the class $\mathcal{F}$ , denoted by $\mathcal{F}_{\epsilon}=\mathcal{F}_{0,\epsilon}\times\dots\times\mathcal{F}_{H-1,\epsilon}$ . $\mathcal{F}_{\epsilon}$ is defined such that $\mathcal{F}_{h,\epsilon}=\{w^{\top}\phi(s,a):\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})\}$ where $\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})$ is an $\epsilon$ -net of the $\mathbb{B}_{d}(B_{W})$ under $\ell_{\infty}$ -distance and contains $O((B_{W}/\epsilon)^{d})$ many elements. Thus, we get that $\log(\left\lvert\mathcal{F}_{\epsilon}\right\rvert)=O\left(Hd\log(B_{W}/\epsilon)\right)$ . ∎

On the computational side, with $\mathcal{F}$ as in Lemma 1, the regression problem in Algorithm 1 reduces to a least squares linear regression with a norm constraint on the weight vector. This can be solved by convex programming with complexity scaling polynomially in the parameters [Bubeck et al., 2015].

Remark 1 (Computational efficiency).

For linear Bellman complete models, we note that Algorithm 1 can be implemented efficiently under mild assumptions. For the class $\mathcal{F}$ in Lemma 1, the regression problem in (1) reduces to a least squares linear regression with a norm constraint on the weight vector. This regression problem can be solved efficiently by convex programming with computational efficiency scaling polynomially in the number of parameters [Bubeck et al., 2015] ( $d$ here), whenever $\max_{a}f_{h+1}(s,a)$ (or $\mathop{\mathrm{argmax}}_{a}f_{h+1}(s,a)$ ) can be computed efficiently.

Remark 2.

(Linear MDPs) Since linear Bellman complete models generalize linear MDPs [Yang and Wang, 2019, Jin et al., 2020], as we discuss above, Algorithm 1 can be implemented efficiently whenever $\max_{a}f_{h+1}(s,a)$ can be computed efficiently. The latter is tractable when:

$\bullet$

When $\lvert\mathcal{A}\rvert$ is small/finite, one can just enumerate to compute $\max_{a}f_{h+1}(s,a)$ for any $s$ , and thus (1) can be implemented efficiently. The computational efficiently of Algorithm 1 in this case is comparable to the prior works, e.g. Jin et al. [2020].
$\bullet$

When the set $\{\phi(s,a)\mid a\in\mathcal{A}\}$ is convex and compact, one can simply use a linear optimization oracle to compute $\max_{a}f_{h+1}(s,a)=\max_{a}w_{h+1}^{\top}\phi(s,a)$ . This linear optimization problem is itself solvable with computational efficiency scaling polynomially with d. here).

Note that even under access to a linear optimization oracle, prior works e.g. Jin et al. [2020] rely on bonuses in the form of $\mathop{\mathrm{argmax}}_{a}\phi(s,a)^{\top}w+\beta\sqrt{\phi(s,a)^{\top}\Sigma\phi(s,a)}$ , where $\Sigma$ is some positive definite matrix (e.g., the regularized feature covariance matrix). Computing such bonuses could be NP-hard (in the feature dimension $d$ ) without additional assumptions [Dani et al., 2008].

Remark 3.

(Relative condition number) A common coverage metric in these linear MDP models is the relative condition number. In Appendix A.4, we show that our coefficient $C_{\pi}$ is upper bounded by the relative condition number of $\pi$ with respect to $\nu$ : $\mathbb{E}_{d^{\pi}}\|\phi\|_{\Sigma^{-1}_{\nu}}$ , where $\Sigma_{\nu}=\mathbb{E}_{s,a\sim\nu}\phi(s,a)\phi^{\top}(s,a)$ . Concretely, we have $C_{\pi}\leq\sqrt{\max_{h}\mathbb{E}_{d^{\pi}_{h}}\|\phi\|^{2}_{\Sigma^{-1}_{\nu_{h}}}}$ . Note that such quantity captures coverage in terms of feature, and can be bounded even when density ratio style concentrability coefficient (i.e., $\sup_{s,a}d^{\pi}(s,a)/\nu(s,a)$ ) being infinite.

5.3 Low-rank MDP

In this section, we briefly introduce the low-rank MDP model [Du et al., 2021], which is captured by the V-type Bilinear model discussed in Appendix A.3. Unlike the linear MDP model discussed in Section 5.2, low-rank MDP does not assume the feature $\phi$ is known a priori.

Definition 4 (Low-rank MDP).

A MDP is called low-rank MDP if there exists $\mu^{\star}:{\mathcal{S}}\mapsto\mathbb{R}^{d},\phi^{\star}:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R}^{d}$ , such that the transition dynamics $P(s^{\prime}|s,a)=\mu^{\star}(s^{\prime})^{\top}\phi^{\star}(s,a)$ for all $s,a,s^{\prime}$ . We additionally assume that we are given a realizable representation class $\Phi$ such that $\phi^{\star}\in\Phi$ , and that $\sup_{s,a}\|\phi^{\star}(s,a)\|_{2}\leq 1$ , and $\|f^{\top}\mu^{\star}\|_{2}\leq\sqrt{d}$ for any $f:{\mathcal{S}}\mapsto[-1,1]$ .

Consider the function class $\mathcal{F}_{h}=\{w^{\top}\phi(s,a):\phi\in\Phi,w\in\mathbb{B}_{d}(B_{W})\}$ , and through the bilinear decomposition we have that $B_{W}\leq 2\sqrt{d}$ . By inspection, we know that this function class satisfies Assumption 1. Furthermore, it is well known that the low rank MDP model has V-type bilinear rank of at most $d$ [Du et al., 2021]. Invoking the sample complexity bound given in Corollary 2 for V-type Bilinear models, we get the following result.

Lemma 2.

Let $\delta\in(0,1)$ and $\Phi$ be a given representation class. Suppose that the MDP is a rank $d$ MDP w.r.t. some $\phi^{\star}\in\Phi$ , $C_{\pi^{*}}<\infty$ , and consider $\mathcal{F}_{h}$ defined above. Then, with probability $1-\delta$ , Algorithm 2 finds an $\epsilon$ -suboptimal policy with total sample complexity (offline + online):

\displaystyle\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}d^{2}H^{4}\left\lvert\mathcal{A}\right\rvert\log\left({HTd\lvert\Phi\rvert}/{\epsilon\delta}\right)}{\epsilon^{2}}\right).

Proof sketch of Lemma 2.

The proof follows by invoking the result in Corollary 1 for a discretization of the class $\mathcal{F}$ , denoted by $\mathcal{F}_{\epsilon}=\mathcal{F}_{0,\epsilon}\times\dots\times\mathcal{F}_{H-1,\epsilon}$ . $\mathcal{F}_{\epsilon}$ is defined such that $\mathcal{F}_{h,\epsilon}=\{w^{\top}\phi(s,a):\phi\in\Phi,w\in\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})\}$ where $\widehat{}\mathbb{B}_{d,\epsilon}(B_{W})$ is an $\epsilon$ -net of the $\mathbb{B}_{d}(B_{W})$ under $\ell_{\infty}$ -distance and contains $O((B_{W}/\epsilon)^{d})$ many elements. Thus, we get that $\log(\left\lvert\mathcal{F}_{\epsilon}\right\rvert)=O\left(Hd\log(B_{W}\left\lvert\Phi\right\rvert/\epsilon)\right)$ . ∎

For low-rank MDP, the transfer coefficient $C_{\pi}$ is upper bounded by a relative condition number style quantity defined using the unknown ground truth feature $\phi^{\star}$ (see Lemma 13). On the computational side, Algorithm 1 (with the modification of $a\sim\text{Uniform}(\mathcal{A})$ in the online data collection step) requires to solve a least squares regression problem at every round. The objective of this regression problem is a convex functional of the hypothesis $f$ over the constraint set $\mathcal{F}$ . While this is not fully efficiently implementable due to the potential non-convex constraint set $\mathcal{F}$ (e.g., $\phi$ could be complicated), our regression problem is still much simpler than the oracle models considered in the prior works for this model [Agarwal et al., 2020a, Sekhari et al., 2021, Uehara et al., 2021, Modi et al., 2021].

5.4 Why don’t offline RL methods work?

One may wonder why do pure offline RL methods fail to learn when the transfer coefficient is bounded, and why does online access help? We illustrate with the MDP construction developed by Zhan et al. [2022], Chen and Jiang [2022], visualized in Figure 2.

Consider two MDPs $\left\{M_{1},M_{2}\right\}$ with $H=2$ , three states $\left\{A,B,C\right\}$ , two actions $\left\{L,R\right\}$ and the fixed start state $A$ . The two MDPs have the same dynamics but different rewards. In both, actions from state $B$ yield reward $1$ . In $M_{1}$ , $(C,R)$ yields reward $1$ while $(C,L)$ yields reward $1$ in $M_{2}$ . All other rewards are $0$ . In both $M_{1}$ and $M_{2}$ , an optimal policy is $\pi^{*}(A)=L$ and $\pi^{*}(B)=\pi^{*}(C)=\mathrm{Uniform}\left(\left\{L,R\right\}\right)$ . With $\mathcal{F}=\{Q_{1}^{\star},Q_{2}^{\star}\}$ where $Q_{j}^{\star}$ is the optimal $Q$ function for $M_{j}$ , then one can easily verify that $\mathcal{F}$ satisfies Bellman completeness, for both MDPs. Finally with offline distribution $\nu$ supported on states $A$ and $B$ only (with no coverage on state $C$ ), we have sufficient coverage over $d^{\pi^{\star}}$ . However, samples from $\nu$ are unable to distinguish between $f_{1}$ and $f_{2}$ or ( $M_{1}$ and $M_{2}$ ), since state $C$ is not supported by $\nu$ . Unfortunately, adversarial tie-breaking may result the greedy policies of $f_{1}$ and $f_{2}$ visiting state $C$ , where we have no information about the correct action.

This issue has been documented before, and in order to address it with pure offline RL, existing approaches require additional structural assumptions. For instance, Chen and Jiang [2022] assume that $Q^{\star}$ has a gap, which usually does not hold when action space is large or continuous. Xie et al. [2021a] assumes policy-dependent Bellman completeness for every possible policy $\pi\in\Pi$ (which is much stronger than our assumption), and Zhan et al. [2022] assumes a somewhat non-interpretable realizability assumption on some “value” function that does not obey the standard Bellman equation. In contrast, by combining offline data and online data, our approach focuses on functions that have small Bellman residual under both the offline distribution and the on-policy distributions, which together with the offline data coverage assumption, ensures near optimality. It is easy to see that the hybrid approach will succeed Figure 2.

6 Experiments

In this section we discuss empirical results comparing Hy-Q to several representative RL methods on two challenging benchmarks. Our experiments focus on answering the following questions:

1.

Can Hy-Q efficiently solve problems that SOTA offline RL methods simply cannot?
2.

Can Hy-Q, via the use of offline data, significantly improve the sample efficiency of online RL?
3.

Does Hy-Q scale to challenging deep-RL benchmarks?

Our empirical results provide positive answers to all of these questions. To study the first two, we consider the diabolical combination lock environment [Misra et al., 2020, Zhang et al., 2022c], a synthetic environment designed to be particularly challenging for online exploration. The synthetic nature allows us to carefully control the offline data distribution to modulate the difficulty of the setup and also to compare with a provably efficient baseline [Zhang et al., 2022c]. To study the third question, we consider the Montezuma’s Revenge benchmark from the Arcade Learning environment, which is one of the most challenging empirical benchmarks with high-dimensional image inputs, largely due to the difficulties of exploration. Additional details are deferred to Appendix E.

Hy-Q implementation.

We largely follow Algorithm 1 in our implementation for the combination lock experiment. Particularly, we use a similar function approximation to Zhang et al. [2022c], and a minibatch Adam update on Eq. (1) with the same sampling proportions as in the pseudocode. For Montezuma’s Revenge, in addition to minibatch optimization, since the horizon of the environment is not fixed, we deploy a discounted version of Hy-Q. Concretely, the target value in the Bellman error is calculated from the output of a target network, which is periodically updated, times a discount factor. We refer the readers to Appendix E for more details.

Baselines.

We include representative algorithms from four categories: (1) for imitation learning we use Behavior Cloning (Bc) [Bain and Sammut, 1995], (2) for offline RL we use Conservative Q-Learning (Cql) [Kumar et al., 2020] due to its successful demonstrations on some Atari games, (3) for online RL we use Briee [Zhang et al., 2022c] for combination lock⁸⁸8We note that Briee is currently the state-of-the-art method for the combination lock environment. In particular, Misra et al. [2020] show that many Deep RL baselines fail in this environment. and Random Network Distillation (Rnd) [Burda et al., 2018] for Montezuma’s Revenge, and (4) as a Hybrid-RL baseline we use Deep Q-learning from Demonstrations (Dqfd) [Hester et al., 2018]. We note that Dqfd and prior hybrid RL methods combine expert demonstrations with online interactions, but are not necessarily designed to work with general offline datasets.

Results summary.

Overall, we find that Hy-Q performs favorably against all of these baselines. Compared with offline RL, imitation learning, and prior hybrid methods, Hy-Q is significantly more robust in the presence of a low quality offline data distribution. Compared with online methods, Hy-Q offers order-of-magnitude savings in the total experience.

Reproducibility.

We release our code at https://github.com/yudasong/HyQ. We also include implementation details in Appendix E.

6.1 Combination Lock

The combination lock benchmark is depicted in Figure 3 and consists of horizon $H=100$ , three latent states for each time step and $10$ actions in each state. Each state has a single “good” action that advances down a chain of favorable latent states (white) from which optimal reward can be obtained. A single incorrect action transitions to an absorbing chain (black latent states) with suboptimal value. The agent operates on high dimensional continuous observations omitted from latent states and must use function approximation to succeed. This is an extremely challenging problem for which many Deep RL methods are known to fail [Misra et al., 2020], in part because (uniform) random exploration only has $10^{-H}$ probability of obtaining the optimal reward.

On the other hand, the model has low bilinear rank, so we do have online RL algorithms that are provably sample-efficient: Briee [Zhang et al., 2022c] currently obtains state of the art sample complexity. However, its sample complexity is still quite large, and we hope that Hybrid RL can address this shortcoming. We are not aware of any experiments with offline RL methods on this benchmark.

We construct two offline datasets for the experiments, both of which are derived from the optimal policy $\pi^{\star}$ which always picks the ”good” actions and stays in the chains of white states. In the optimal trajectory dataset we collect full trajectories by following $\pi^{\star}$ with $\epsilon$ -greedy exploration with $\epsilon=1/H$ . We also add some noise by making the agent to perform randomly at timestep $H/2$ . In the optimal occupancy dataset we collect transition tuples from the state-occupancy measure of $\pi^{\star}$ with random actions.⁹⁹9Formally, we sample $h\sim\textrm{Unif}([H])$ , $s\sim d_{h}^{\pi^{\star}}$ , $a\sim\textrm{Unif}(\mathcal{A})$ , $r\sim R(s,a)$ , $s^{\prime}\sim P(s,a)$ . Both datasets have bounded concentrability coefficients (and hence transfer coefficients) with respect to $\pi^{\star}$ , but the second dataset is much more challenging since the actions in the offline dataset do not directly provide information about $\pi^{\star}$ , as they do in the former.

The results are presented in Figure 4. First, we observe that Hy-Q can reliably solve the task under both offline distributions with relatively low sample complexity (500k offline samples and $\leq$ 25m online samples). In comparison, Bc fails completely since both datasets contain random actions. Cql can solve the task using the optimal trajectory-based dataset with a sample complexity that is comparable to the combined sample size of Hy-Q. However, Cql fails on the optimal occupancy-based dataset since the actions themselves are not informative. Indeed the pessimism-inducing regularizer of Cql is constant on this dataset and so the algorithm reduces to Fqi which provably fails when the offline data does not have a global coverage (i.e., covers every state-action pair). Finally, Hy-Q can solve the task with a factor of 5-10 reduction in samples (online plus offline) when compared with Briee. This demonstrates the robustness and sample efficiency provided by hybrid RL.

6.2 Montezuma’s Revenge

To answer the third question, we turn to Montezuma’s Revenge, an extremely challenging image-based benchmark environment with sparse rewards. We follow the setup from Burda et al. [2018] and introduce stochasticity to the original dynamics: with probability 0.25 the environment executes the previous action instead of the current one. For offline datasets, we first train an “expert policy” $\pi^{e}$ via Rnd to achieve $V^{\pi^{e}}\approx 6400$ . We create three datasets by mixing samples from $\pi^{e}$ with those from a random policy: the easy dataset contains only samples from $\pi^{e}$ , the medium dataset mixes in a 80/20 proportion (80 from $\pi^{e}$ ), and the hard dataset mixes in a 50/50 proportion. Here we record full trajectories from both policies in the offline dataset, but measure the proportion using the number of transition tuples instead of trajectories. We provide 0.1 million offline samples for the hybrid methods, and 1 million samples for the offline and IL methods.

Results are displayed in Figure 5. Cql fails completely on all datasets. Dqfd performs well on the easy dataset due to the supervised learning style large margin loss [Piot et al., 2014] that imitates the policies in the offline dataset. However, Dqfd’s performance drops as the quality of the offline dataset degrades (medium), and fails when the offline dataset is low quality (hard), where one cannot simply treat offline samples as expert samples. We also observe that Bc is a competitive baseline in the first two settings due to high fraction of expert samples, and thus we view these problems as relatively easy to solve. Hy-Q is the only method that performs well on the hard dataset. Note that here, BC’s performance is quite poor. We also include the comparison with Rnd in Figure 1 and Figure 6: with only 100k offline samples from any of the three datasets, Hy-Q is over 10x more efficient in terms of online sample complexity.

7 Conclusion

We demonstrate the potential of hybrid RL with Hy-Q, a simple, theoretically principled, and empirically effective algorithm. Our theoretical results showcase how Hy-Q circumvents the computational issues of pure offline or online RL, while our empirical results highlight its robustness and sample efficiency. Yet, Hy-Q is perhaps the most natural hybrid algorithm, and we are optimistic that there is much more potential to unlock from the hybrid setting. We look forward to studying this in the future.

Acknowledgement

AS thanks Karthik Sridharan for useful discussions. WS acknowledges funding support from NSF IIS-2154711. We thank Simon Zhai for their careful reading of the manuscript and improvement on the technical correctness of our paper. We also thank Uri Sherman for their discussion on the computational efficiency of the original draft.

References

Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, 2019.
Agarwal et al. [2019] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2019. URL https://rltheorybook.github.io/.
Agarwal et al. [2020a] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank MDPs. In Advances in Neural Information Processing Systems, 2020a.
Agarwal et al. [2020b] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, 2020b.
Badia et al. [2020] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari human benchmark. In International Conference on Machine Learning, 2020.
Bagnell et al. [2003] James Bagnell, Sham M Kakade, Jeff Schneider, and Andrew Ng. Policy search by dynamic programming. Advances in Neural Information Processing Systems, 2003.
Bain and Sammut [1995] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, 1995.
Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
Beygelzimer et al. [2011] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, 2011.
Bubeck et al. [2015] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 2015.
Burda et al. [2018] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2018.
Chang et al. [2021] Jonathan Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 2021.
Chen and Jiang [2019] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019.
Chen and Jiang [2022] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps. arXiv:2203.13935, 2022.
Dani et al. [2008] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
Dann et al. [2022] Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, and Karthik Sridharan. Guarantees for epsilon-greedy reinforcement learning with function approximation. In International Conference on Machine Learning, pages 4666–4689. PMLR, 2022.
Dann et al. [2018] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient PAC RL with rich observations. In Advances in Neural Information Processing Systems, 2018.
Du et al. [2021] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, 2021.
Foster et al. [2021] Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory, 2021.
Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Advances in Neural Information Processing Systems, 2021.
Guo et al. [2022] Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, R’emi Munos, Mohammad Gheshlaghi Azar, and Bilal Piot. BYOL-explore: Exploration by bootstrapped prediction. arXiv:2206.08332, 2022.
Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. arXiv:1812.05905, 2018.
Hester et al. [2018] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep Q-learning from demonstrations. In AAAI Conference on Artificial Intelligence, 2018.
Jia et al. [2022] Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. Improving policy optimization with generalist-specialist learning. In International Conference on Machine Learning, pages 10104–10119. PMLR, 2022.
Jiang and Huang [2020] Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 2020.
Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.
Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2020.
Jin et al. [2021a] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 2021a.
Jin et al. [2021b] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, 2021b.
Kakade [2001] Sham M Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 2001.
Kakade and Langford [2002] Sham M Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002.
Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
Kane et al. [2022] Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. Computational-statistical gaps in reinforcement learning. In Conference on Learning Theory, 2022.
Kostrikov et al. [2021] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv:2110.06169, 2021.
Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
Lee et al. [2022] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022.
Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643, 2020.
Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
Liu and Brunskill [2018] Yao Liu and Emma Brunskill. When simple exploration is sample efficient: Identifying sufficient conditions for random exploration to yield pac rl algorithms. arXiv preprint arXiv:1805.09045, 2018.
Misra et al. [2020] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, 2020.
Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
Modi et al. [2021] Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. Model-free representation learning and exploration in low-rank MDPs. arXiv:2102.07035, 2021.
Munos and Szepesvári [2008] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008.
Nair et al. [2018] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics and Automation, 2018.
Nair et al. [2020] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359, 2020.
Niu et al. [2022] Haoyi Niu, Shubham Sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming Hu, and Xianyuan Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. arXiv preprint arXiv:2206.13464, 2022.
Piot et al. [2014] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2014.
Qiu et al. [2022] Shuang Qiu, Lingxiao Wang, Chenjia Bai, Zhuoran Yang, and Zhaoran Wang. Contrastive UCB: Provably efficient contrastive self-supervised learning in online reinforcement learning. In International Conference on Machine Learning, 2022.
Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv:1709.10087, 2017.
Ross and Bagnell [2012] Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement learning. arXiv:1203.1007, 2012.
Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv:1511.05952, 2015.
Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and Shogi by planning with a learned model. Nature, 2020.
Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
Sekhari et al. [2021] Ayush Sekhari, Christoph Dann, Mehryar Mohri, Yishay Mansour, and Karthik Sridharan. Agnostic reinforcement learning with low-rank MDPs and rich observations. Advances in Neural Information Processing Systems, 2021.
Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
Sun et al. [2019] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, 2019.
Uchendu et al. [2022] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, and Karol Hausman. Jump-start reinforcement learning. arXiv:2204.02372, 2022.
Uehara and Sun [2021] Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations, 2021.
Uehara et al. [2021] Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline RL in low-rank MDPs. arXiv:2110.04652, 2021.
Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artificial Intelligence, 2016.
Vecerik et al. [2017] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv:1707.08817, 2017.
Wang et al. [2021] Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, 2021.
Wang et al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, 2016.
Xie and Jiang [2021] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. In International Conference on Machine Learning, 2021.
Xie et al. [2021a] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021a.
Xie et al. [2021b] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in Neural Information Processing Systems, 2021b.
Yang and Wang [2019] Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, 2019.
Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. In Advances in Neural Information Processing Systems, 2021.
Zanette et al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, 2020.
Zanette et al. [2021] Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021.
Zhan et al. [2022] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR, 2022.
Zhang et al. [2022a] Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph Gonzalez, Dale Schuurmans, and Bo Dai. Making linear MDPs practical via contrastive representation learning. In International Conference on Machine Learning, 2022a.
Zhang et al. [2022b] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. In International Conference on Artificial Intelligence and Statistics, 2022b.
Zhang et al. [2022c] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, and Wen Sun. Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, 2022c.

Appendix A Proofs for Section 5

Additional notation.

Throughout the appendix, we define the feature covariance matrix $\Sigma_{t;h}$ as

\displaystyle\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})\left(X_{h}(f^{\tau})\right)^{\top}+\lambda\mathbb{I}.

(3)

Furthermore, given a distribution $\beta\in\Delta({\mathcal{S}}\times\mathcal{A})$ and a function $f:{\mathcal{S}}\times\mathcal{A}\mapsto\mathbb{R}$ , we denote its weighted $\ell_{2}$ norm as $\|f\|^{2}_{2,\beta}:=\sqrt{\mathbb{E}_{s,a\sim\beta}f^{2}(s,a)}$ .

A.1 Supporting lemmas for Theorem 1

Before proving Theorem 1, we first present a few useful lemma. We start with a standard result on least square generalization bound, which is be used by recalling that Algorithm 1 performs least squares on the empirical bellman error. We defer the proof of Lemma 3 to Appendix B.

Lemma 3.

(Least squares generalization bound) Let $R>0$ , $\delta\in(0,1)$ , we consider a sequential function estimation setting, with an instance space $\mathcal{X}$ and target space $\mathcal{Y}$ . Let $\mathcal{H}:\mathcal{X}\mapsto[-R,R]$ be a class of real valued functions. Let $\mathcal{D}=\left\{(x_{1},y_{1}),\dots,(x_{T},y_{T})\right\}$ be a dataset of $T$ points where $x_{t}\sim\rho_{t}\vcentcolon={}\rho_{t}(x_{1:t-1},y_{1:t-1})$ , and $y_{t}$ is sampled via the conditional probability $p(\cdot\mid x_{t})$ :

\displaystyle y_{t}\sim p(\cdot\mid x_{t})\vcentcolon={}h^{*}(x_{t})+\varepsilon_{t},

where the function $h^{*}$ satisfies approximate realizability i.e.

\inf_{h\in\mathcal{H}}\frac{1}{T}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[\left(h^{*}(x)-h(x)\right)^{2}\right]\leq\gamma,

and $\left\{\epsilon_{i}\right\}_{i=1}^{n}$ are independent random variables such that $\mathbb{E}[y_{t}\mid x_{t}]=h^{\ast}(x_{t})$ . Additionally, suppose that $\max_{t}\lvert y_{t}\rvert\leq R$ and $\max_{x}\left\lvert h^{*}(x)\right\rvert\leq R$ . Then the least square solution $\widehat{h}\leftarrow\mathop{\mathrm{argmin}}_{h\in\mathcal{H}}\sum_{t=1}^{T}\left(h(x_{t})-y_{t}\right)^{2}$ satisfies with probability at least $1-\delta$ ,

\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[(\widehat{h}(x)-h^{*}(x))^{2}\right]

\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

The above lemma is basically an extension of the standard least square regression agnostic generalization bound from i.i.d. setting to the non-i.i.d. case with the sequence of training data forms a sequence of Martingales. We state the result when the realizability only holds approximately upto the approximation $\gamma$ . However, for all our proofs, we invoke this result by setting $\gamma=0$ .

In the next two lemmas, we prove two lemmas where we can bound each part of the regret decomposition using the Bellman error of the value function $f$ .

Lemma 4 (Performance difference lemma).

For any function $f=\left(f_{0},\dots,f_{H-1}\right)$ where $f_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}$ and $h\in[H-1]$ , we have

\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V_{0}^{\pi^{f}}(s)]\leq\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f}}}\left[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right]\right\rvert,

where we define $f_{H}(s,a)=0$ for all $s,a$ .

Proof.

We start the proof by noting that $\pi^{f}_{0}(s)=\mathop{\mathrm{argmax}}_{a}f_{0}(s,a)$ , then we have:

$\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)]$	$\displaystyle=\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}f_{0}(s,a)-V_{0}^{\pi^{f}}(s)]$
	$\displaystyle=\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi^{f}_{0}(s)}\mathcal{T}f_{1}(s,a)-V_{0}^{\pi^{f}}(s)]$
	$\displaystyle=\mathbb{E}_{s,a\sim d^{\pi^{f}}_{0}}[f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+$
	$\displaystyle\;\;\;\;\mathbb{E}_{s\sim d_{0}}[\mathbb{E}_{a\sim\pi_{0}^{f}(s)}[R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}\max_{a^{\prime}}f_{1}(s^{\prime},a^{\prime})-R(s,a)+\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}V^{\pi^{f}}_{1}(s^{\prime})]]$
	$\displaystyle=\mathbb{E}_{s,a\sim d^{\pi^{f}}_{0}}[f_{0}(s,a)-\mathcal{T}f_{1}(s,a)]+\mathbb{E}_{s\sim d^{\pi^{f}}_{1}}[\max_{a}f_{1}(s,a)-V_{1}^{\pi^{f}}(s)]$	(4)

Then by recursively applying the same procedure on the second term in (4), we have

\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)]

\displaystyle=\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]+\mathbb{E}_{s\sim d^{\pi^{f}}_{H}}[\max_{a}f_{H}(s,a)-V_{H}^{\pi^{f}}(s)].

Finally for $h=H$ , we recall that we set $f_{H}(s,a)=0$ and $V^{\pi^{f}}_{H}=0$ for notation simplicity. Thus we have:

	$\displaystyle\mathbb{E}_{s\sim d_{0}}[\max_{a}f_{0}(s,a)-V^{\pi^{f}}(s)]$	$\displaystyle=\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]$
		$\displaystyle\leq\sum_{h=0}^{H-1}\left\|\mathbb{E}_{s,a\sim d^{\pi^{f}}_{h}}[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)]\right\|.$

∎

Now we proceed to how to bound the other half in the regret decomposition:

Lemma 5.

Let $\pi^{e}=(\pi^{e}_{0},\dots,\pi^{e}_{H-1})$ be a comparator policy, and consider any value function $f=(f_{0},\dots,f_{H-1})$ where $f_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}$ . Then,

\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}(s,a)\right]\leq\sum_{i=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{i}}[\mathcal{T}f_{i+1}(s,a)-f_{i}(s,a)],

where we defined $f_{H}(s,a)=0$ for all $s,a$ .

Proof.

The proof is similar to the proof of Lemma 4, and we start with the fact that $\max_{a}f(s,a)\geq f(s,a^{\prime}),\forall a^{\prime}$ , including actions sampled from $\pi^{e}$ :

$\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right]$	$\displaystyle\leq\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[Q_{0}^{\pi^{e}}(s,a)-f_{0}(s,a)\right]$
	$\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[Q_{0}^{\pi^{e}}(s,a)-{\mathcal{T}}f_{1}(s,a)+{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right]$
	$\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s,a)}V_{1}^{\pi^{e}}(s^{\prime})-\max_{a^{\prime}}f_{1}(s^{\prime},a^{\prime})\right]+\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right]$
	$\displaystyle=\mathbb{E}_{s\sim d^{\pi_{e}}_{1}}\left[V_{1}^{\pi^{e}}(s)-\max_{a}f_{1}(s,a)\right]+\mathbb{E}_{s,a\sim d^{\pi_{e}}_{0}}\left[{\mathcal{T}}f_{1}(s,a)-f_{0}(s,a)\right]$	(5)

Again by recursively applying the same procedure on the first term in (5), we have

\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right]

\displaystyle\leq\mathbb{E}_{s\sim d^{\pi_{e}}_{H}}\left[V_{H}^{\pi^{e}}(s)-\max_{a}f_{H}(s,a)\right]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}\left[{\mathcal{T}}f_{h+1}(s,a)-f_{h}(s,a)\right],

and recall that $f_{H}(s,a)=0$ and $V^{\pi^{f}}_{H}=0$ , we have

\displaystyle\mathbb{E}_{s\sim d_{0}}\left[V_{0}^{\pi^{e}}(s)-\max_{a}f_{0}(s,a)\right]

\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}\left[{\mathcal{T}}f_{h+1}(s,a)-f_{h}(s,a)\right].

∎

The following result is useful in the bilinear models when we want to bound the potential functions. The result directly follows from the elliptical potential lemma [Lattimore and Szepesvári, 2020, Lemma 19.4].

Lemma 6.

Let $X_{h}(f^{1}),\dots,X_{h}(f^{T})\in\mathbb{R}^{d}$ be a sequence of vectors with $\|X_{h}(f^{t})\|\leq B_{X}<\infty$ for all $t\leq T$ . Then,

\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{2dT\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)},

where the matrix $\Sigma_{t;h}\vcentcolon={}\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\lambda\mathbb{I}$ for $t\in[T]$ and $\lambda\geq B^{2}_{X}$ , and the matrix norm $\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}=\mathbb{E}[X_{h}(f^{t})^{\top}\Sigma_{t-1;h}^{-1}X_{h}(f^{t})]$ .

Proof.

Since $\lambda\geq B_{X}^{2}$ , we have that

\displaystyle\|X_{h}(f^{t})\|^{2}_{\Sigma^{-1}_{t-1;h}}\leq\frac{1}{\lambda}\|X_{h}(f^{t})\|^{2}\leq 1.

Thus, using elliptical potential lemma [Lattimore and Szepesvári, 2020, Lemma 19.4], we get that

\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|^{2}_{\Sigma_{t-1;h}^{-1}}\leq 2d\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right).

The desired bound follows from Jensen’s inequality which implies that

\displaystyle\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{T\cdot\sum_{t=1}^{T}\|X_{h}(f^{t})\|^{2}_{\Sigma_{t-1;h}^{-1}}}\leq\sqrt{2Td\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)}.

∎

A.2 Proof of Theorem 1

Before delving into the proof, we first state that following generalization bound for FQI.

Lemma 7 (Bellman error bound for FQI).

Let $\delta\in(0,1)$ and let for $h\in[H-1]$ and $t\in[T]$ , $f^{t+1}_{h}$ be the estimated value function for time step $h$ computed via least square regression using samples in the dataset ${\left(\mathcal{D}^{\nu}_{h},\mathcal{D}^{1}_{h},\dots,\mathcal{D}^{t}_{h}\right)}$ in (1) in the iteration $t$ of Algorithm 1. Then, with probability at least $1-\delta$ , for any $h\in[H-1]$ and $t\in[T]$ ,

	$\displaystyle\left\\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\\|_{2,\nu_{h}}^{2}\leq\frac{1}{m_{\mathrm{off}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\Delta_{\mathrm{off}},$
and
	$\displaystyle\sum_{\tau=1}^{t}\left\\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\\|^{2}_{2,\mu^{\tau}_{h}}\leq\frac{1}{m_{\mathrm{on}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\Delta_{\mathrm{on}},$

where $\nu_{h}$ denotes the offline data distribution at time $h$ , and the distribution $\mu^{\tau}_{h}\in\Delta(s,a)$ is defined such that $s,a\sim d^{\pi^{\tau}}_{h}$ .

Proof.

Fix $t\in[T]$ , $h\in[H-1]$ and $f^{t+1}_{h+1}\in\mathcal{F}_{h+1}$ and consider the regression problem ((1) in the iteration $t$ of Algorithm 1):

\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{m_{\mathrm{off}}\widehat{\mathbb{E}}_{\mathcal{D}^{\nu}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}+m_{\mathrm{on}}\sum_{\tau=1}^{t}\widehat{\mathbb{E}}_{\mathcal{D}^{\tau}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\},

which can be thought of as regression problem

\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\},

where dataset $\mathcal{D}$ consisting of $n=m_{\mathrm{off}}+t\cdot m_{\mathrm{on}}$ samples $\left\{(x_{i},y_{i})\right\}_{i\leq n}$ where

\displaystyle x_{i}=(s^{i}_{h},a^{i}_{h})\qquad\text{and}\qquad y^{i}=r^{i}+\max_{a}f_{h+1}^{t+1}(s^{i}_{h+1},a).

In particular, we define $\mathcal{D}$ such that the first $m_{\mathrm{off}}$ samples $\left\{(x_{i},y_{i})\right\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{\nu}_{h}$ , the next $m_{\mathrm{on}}$ samples $\left\{(x_{i},y_{i})\right\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m_{\mathrm{on}}}=\mathcal{D}^{1}_{h}$ , and so on where the samples $\left\{(x_{i},y_{i})\right\}_{i=m_{\mathrm{off}}+(\tau-1)m_{\mathrm{on}}+1}^{m_{\mathrm{off}}+\tau m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}$ . Note that: (a) for any sample $(x=(s_{h},a_{h}),y=(r+\max_{a}f_{h+1}^{t+1}(s_{h+1},a)))$ in $\mathcal{D}$ , we have that

	$\displaystyle\operatorname{\mathbb{E}}\left[y\mid x\right]$	$\displaystyle=\operatorname{\mathbb{E}}_{s_{h+1}\sim P(s_{h},a_{h}),r\sim R(s_{h},a_{h})}\left[r+\max_{a}f_{h+1}^{t+1}(s_{h+1},a)\right]$
		$\displaystyle=\mathcal{T}f_{h+1}^{t+1}(s_{h},a_{h})\leq g(s_{h},a_{h}),$

where the last line holds since the Bellman completeness assumption implies existence of such a function $g$ , (b) for any sample, $\left\lvert y\right\rvert\leq V_{\mathrm{max}}$ and $f(s,a)\leq V_{\mathrm{max}}$ for all $s,a$ , (c) our construction of $\mathcal{D}$ implies that for each iteration $t$ , the sample $(x_{t},y_{t})$ are generated in the following procedure: $x_{t}$ is sampled from the data generation scheme $\mathcal{D}^{t}(x_{1:t-1},y_{1:t-1})$ , and $y_{t}$ is sampled from some conditional probability distribution $p(\cdot\mid x_{t})$ as defined in Lemma 3, finally (d) the samples in $\mathcal{D}^{\nu}_{h}$ are drawn from the offline distribution $\nu_{h}$ , and the samples in $\mathcal{D}^{\tau}_{h}$ are drawn such that $s_{h}\sim d_{h}^{\pi^{t}}$ and $a_{h}\sim\pi^{f^{t}}(s_{h})$ . Thus, using Lemma 3, we get that the least square regression solution $f^{t+1}_{h}$ satisfies

\displaystyle\sum_{i=1}^{n}\operatorname{\mathbb{E}}\left[(f^{t+1}_{h}(s^{i},a^{i})-\mathcal{T}f^{t+1}_{h+1}(s^{i},a^{i}))^{2}\mid\mathcal{D}_{i}\right]

\displaystyle\leq 256V_{\mathrm{max}}^{2}\log(2\lvert\mathcal{F}\rvert/\delta).

Using the property-(d) in the above, we get that

\displaystyle m_{\mathrm{off}}\cdot\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|_{2,\nu_{h}}^{2}+m_{\mathrm{on}}\cdot\sum_{\tau=1}^{t}\left\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\|^{2}_{2,\mu^{\tau}_{h}}\leq 256V_{\mathrm{max}}^{2}\log(2\lvert\mathcal{F}\rvert/\delta),

where the distribution $\mu^{\tau}_{h}\in\Delta(s,a)$ is defined by sampling $s\sim d^{\pi^{\tau}}_{h}$ and $a\sim\pi^{f^{t}}(s)$ . Taking a union bound over $h\in[H-1]$ and $t\in[T]$ , and bounding each term separately, gives the desired statement. ∎

We next note a change in distribution lemma which allows us to bound expected bellman error under the $(s,a)$ distribution generated by $f^{t}$ in terms of the expected square bellman error w.r.t. the previous policies data distribution, which is further controlled using regression.

Lemma 8.

For any $t\geq 0$ and $h\in[H-1]$ , we have

\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right]+\lambda B^{2}_{W}},

where $\Sigma_{t-1}^{-1}$ is defined in (3) and use the notation $d^{f^{i}}_{h}$ to denote $d^{\pi^{f^{i}}}_{h}$ .

Proof.

Using Cauchy-Schwarz inequality, we get that

$\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert$	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\\|W_{h}(f^{t})\\|_{\Sigma_{t-1;h}}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\Sigma_{t-1}W_{h}(f^{t})}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\left(\sum_{i=1}^{t-1}X_{h}(f^{i})X_{h}(f^{i})^{\top}+\lambda\mathbb{I}\right)W_{h}(f^{t})}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda\\|W_{h}(f^{t})\\|^{2}}$
	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}}$	(6)
	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right]+\lambda B_{W}^{2}}$

where the inequality in the second last line holds by plugging in the bound on $\|W_{h}(f^{t})\|$ , and the last line holds by using Definition 2 which implies that

\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}

\displaystyle=\left(\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right]\right)^{2}\leq\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right],

where the last inequality is due to Jensen’s inequality. ∎

We now have all the tools to prove Theorem 1. We first restate the bound with the exact problem dependent parameters, assumign that $B_{W}$ and $B_{X}$ are constants which are hidden in the order notation below.

Theorem (Theorem 1 restated).

Let $m_{\mathrm{off}}=T$ and $m_{\mathrm{on}}=1$ . Then, with probability at least $1-\delta$ , the cumulative suboptimality of Algorithm 1 is bounded as

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

Proof of Theorem 1.

Let $\pi^{e}$ be any comparator policy with bounded transfer coefficient i.e.

\displaystyle C_{\pi^{e}}:=\max\left\{0,~{}\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi^{e}}}\left[f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(f_{h}(s,a)-\mathcal{T}f_{h+1}(s,a)\right)^{2}\right]}}\right\}<\infty.

(7)

We start by noting that

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}$	$\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-V^{\pi^{f^{t}}}_{0}(s)\right]$
		$\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f_{0}^{t}(s,a)\right]+\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right].$		(8)

For the first term in the right hand side of (8), note that using Lemma 5 for each $f_{t}$ for $1\leq t\leq T$ , we get

$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[V^{\pi^{e}}_{0}(s)-\max_{a}f^{t}_{0}(s,a)\right]$	$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{e}}_{h}}[\mathcal{T}f^{t}_{h+1}(s,a)-f^{t}_{h}(s,a)]$
	$\displaystyle\leq\sum_{t=1}^{T}C_{\pi^{e}}\cdot\sqrt{\sum_{h=0}^{H-1}\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(f_{h}^{t}(s,a)-\mathcal{T}f_{h+1}^{t}(s,a)\right)^{2}\right]}$
	$\displaystyle=TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}},$	(9)

where the second inequality follows from plugging in the definition of $C_{\pi_{e}}$ in (7). The last line follows from Lemma 7.

For the second term in (8), using Lemma 4 for each $f_{t}$ for $1\leq t\leq T$ , we get

$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]$	$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}f^{t}_{h+1}(s,a)\right]\right\rvert$	(10)
	$\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert$
	$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\Delta_{\mathrm{on}}+\lambda B_{W}^{2}},$

where the second line follows from Definition 2, the third line follows from Lemma 8 and by plugging in the bound in Lemma 7. Using the bound in Lemma 6 in the above, we get that

	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]$	$\displaystyle\leq\sqrt{2dH^{2}\log\left(1+\frac{TB_{X}^{2}}{\lambda d}\right)\cdot\left(\Delta_{\mathrm{on}}+\lambda B_{W}^{2}\right)\cdot T}$
		$\displaystyle\leq\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\Delta_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T},$		(11)

where the second line follows by plugging in $\lambda=B_{X}^{2}$ .

Combining the bound (9) and (11), we get that

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}

\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\Delta_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T}

Plugging in the values of $\Delta_{\mathrm{on}}$ and $\Delta_{\mathrm{off}}$ in the above, and using subadditivity of square-root, we get that

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}$	$\displaystyle\leq 16V_{\mathrm{max}}C_{\pi^{e}}T\sqrt{\frac{H}{m_{\mathrm{off}}}\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}+16V_{\mathrm{max}}\sqrt{\frac{2dH^{2}T}{m_{\mathrm{on}}}\log\left(1+\frac{T}{d}\right)\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}$
		$\displaystyle\qquad\qquad\qquad+HB_{X}B_{W}\sqrt{2dT\log\left(1+\frac{T}{d}\right)}.$

Setting $m_{\mathrm{off}}=T$ and $m_{\mathrm{on}}=1$ in the above gives the cumulative suboptimality bound

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

(12)

∎

Proof of Corollary 1.

We next convert the above cumulative suboptimality bound into sample complexity bound via a standard online-to-batch conversion. Setting $\pi^{e}=\pi^{*}$ in (12) and defining the policy $\widehat{\pi}=\text{Uniform}\left(\left\{\pi^{1},\dots,\pi^{T}\right\}\right)$ , we get that

	$\displaystyle\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]$	$\displaystyle=\frac{1}{T}\left(\sum_{t=1}^{T}V^{\pi^{*}}-V^{\pi^{t}}\right)$
		$\displaystyle=O\left(\max\left\{C_{\pi^{*}},1\right\}V_{\mathrm{max}}\sqrt{\frac{dH^{2}}{T}\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).$

Thus, we get that for $T\geq\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right)$ , we get that

\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]\leq\epsilon.

In these $T$ iterations, the total number of offline samples used is

\displaystyle m_{\mathrm{off}}=T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

and the total number of online samples used is

\displaystyle m_{\mathrm{on}}\cdot H\cdot T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

where the additional $H$ factor appears because we collect $m_{\mathrm{on}}$ samples for every $h\in[H]$ in the algorithm. ∎

A.3 V-type Bilinear Rank

Our previous result focus on the Q-type bilinear model. Here we provide the V-type Bilinear rank definition. This V-type Bilinear rank definition is basically the same as the low Bellman rank model proposed by Jiang et al. [2017].

Definition 5 (V-type Bilinear model).

Consider any pair of functions $(f,g)$ with $f,g\in\mathcal{F}$ . Denote the greedy policy of $f$ as $\pi^{f}=\{\pi^{f}_{h}:=\mathop{\mathrm{argmax}}_{a}f_{h}(s,a),\forall h\}$ . We say that the MDP together with the function $\mathcal{F}$ admits a bilinear structure of rank $d$ if for any $h\in[H-1]$ , there exist two (unknown) mappings $X_{h}:\mathcal{F}\mapsto\mathbb{R}^{d}$ and $W_{h}:\mathcal{F}\mapsto\mathbb{R}^{d}$ with $\max_{f}\|X_{h}(f)\|_{2}\leq B_{X}$ and $\max_{f}\|W_{h}(f)\|_{2}\leq B_{W}$ , such that:

\displaystyle\forall f,g\in\mathcal{F}:\;\left\lvert\mathbb{E}_{s\sim d_{h}^{\pi^{f}},a\sim\pi_{g}(s)}g_{h}(s,a)-\mathcal{T}g_{h+1}(s,a)\right\rvert=\left\lvert\left\langle X_{h}(f),W_{h}(g)\right\rangle\right\rvert.

Note that different from the Q-type definition, here the action $a$ is taken from the greedy policy with respect to $g$ . This way $\max_{a}g(s,a)$ can serve as an approximation of $V^{\star}$ – thus the name of $V$ -type.

To make Hy-Q work for the V-type Bilinear model, we only need to make slight change on the data collection process, i.e., when we collect online batch $\mathcal{D}_{h}$ , we sample $s\sim d^{\pi^{t}}_{h},a\sim\text{Uniform}(\mathcal{A}),s^{\prime}\sim P(\cdot|s,a)$ . Namely the action is taken uniformly randomly here. We provide the pseudocode in Algorithm 2. We refer the reader to Du et al. [2021], Jin et al. [2021a] for a detailed discussion.

Algorithm 2 V-type Hy-Q

0: Value function class:

\mathcal{F}

, #iterations:

T

, Offline dataset

\mathcal{D}^{\nu}_{h}

of size

m_{\mathrm{off}}

for

h\in[H-1]

1: Initialize

f_{h}^{1}(s,a)=0

2: for

t=1,\dots,T

3: Let

\pi^{t}

be the greedy policy w.r.t.

f^{t}

i.e.,

\pi_{h}^{t}(s)=\mathop{\mathrm{argmax}}_{a}f^{t}_{h}(s,a).

// Online collection

4: For each

h

, collect

m_{\mathrm{on}}

online tuples

\mathcal{D}^{t}_{h}\sim d_{h}^{\pi^{t}}\circ\textrm{Uniform}(\mathcal{A})

. // FQI using both online and offline data

5: Set

f_{H}^{t+1}(s,a)=0

6: for

h=H-1,\dots,0

7: Estimate

f_{h}^{t+1}

using least squares regression on the aggregated data:

\displaystyle f_{h}^{t+1}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}_{h}}\left\{\widehat{\mathbb{E}}_{\mathcal{D}^{\nu}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}+\sum_{\tau=1}^{t}\widehat{\mathbb{E}}_{\mathcal{D}^{\tau}_{h}}(f(s,a)-r-\max_{a^{\prime}}f^{t+1}_{h+1}(s^{\prime},a^{\prime}))^{2}\right\}

(13)

8: end for

9: end for

A.3.1 Complexity bound for V-type Bilinear models

In this section, we give a performance analysis of Algorithm 2 for V-type Bilinear models. The contents in this section extend the results developed for Q-type Bilinear models in Section A.2 to V-type Bilinear models.

We first note the following bound for FQI estimates in Algorithm 2.

Lemma 9.

Let $\delta\in(0,1)$ and let for $h\in[H-1]$ and $t\in[T]$ , $f^{t+1}_{h}$ be the estimated value function for time step $h$ computed via least square regression using samples in the dataset ${\left(\mathcal{D}^{\nu}_{h},\mathcal{D}^{1}_{h},\dots,\mathcal{D}^{t}_{h}\right)}$ in (13) in the iteration $t$ of Algorithm 2. Then, with probability at least $1-\delta$ , for any $h\in[H-1]$ and $t\in[T]$ ,

	$\displaystyle\left\\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\\|_{2,\nu_{h}}^{2}\leq\frac{1}{m_{\mathrm{off}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\bar{\Delta}_{\mathrm{off}},$
and
	$\displaystyle\sum_{\tau=1}^{t}\left\\|f_{h}^{t+1}-\mathcal{T}f_{h+1}^{t+1}\right\\|^{2}_{2,\mu^{\tau}_{h}}\leq\frac{1}{m_{\mathrm{on}}}256V_{\mathrm{max}}^{2}\log(2HT\lvert\mathcal{F}\rvert/\delta)=\vcentcolon{}\bar{\Delta}_{\mathrm{on}},$

where $\nu_{h}$ denotes the offline data distribution at time $h$ , and the distribution $\mu^{\tau}_{h}\in\Delta(s,a)$ is defined such that $s\sim d^{\pi^{\tau}}_{h}$ and $a\sim\mathrm{Uniform}(\mathcal{A})$ .

The following change in distribution lemma is the version of Lemma 8 under V-type Bellman rank assumption.

Lemma 10.

Suppose the underlying model is a V-type bilinear model. Then, for any $t\geq 0$ and $h\in[H-1]$ , we have

\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert\leq\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]+\lambda B_{W}^{2}},

where $\Sigma_{t-1}^{-1}$ is defined in (3).

Proof.

The proof is identical to the proof of Lemma 8. Repeating the analysis till (6), we get that

	$\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert$	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}}$
		$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left(\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\pi^{f^{t}}(s)}\left[f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right]\right)^{2}+\lambda B_{W}^{2}}$
		$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]+\lambda B_{W}^{2}}$

where the second line above follows from the definition of V-type bilinear model in Definition 5, and the last line holds because:

	$\displaystyle\left(\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\pi^{f^{t}}(s)}\left[f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right]\right)^{2}$	$\displaystyle\leq\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},a\sim\pi^{f^{t}}(s)}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]$
		$\displaystyle\leq\left\lvert\mathcal{A}\right\rvert\cdot\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]$

where the first inequality above is due to Jensen’s inequality and the last inequality follows form a straightforward upper bound since each term inside the expectation is non-negative. ∎

We are finally ready to state and prove our main result in this section.

Theorem 2 (Cumulative suboptimality bound for V-type bilinear rank models).

Let $m_{\mathrm{on}}=\left\lvert\mathcal{A}\right\rvert$ and $m_{\mathrm{off}}=T$ . Then, with probability at least $1-\delta$ , the cumulative suboptimality of Algorithm 2 is bounded as

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right)

Proof.

The proof follows closely the proof of Theorem 1. Repeating the analysis till (8) and (9), we get that:

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}

\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\bar{\Delta}_{\mathrm{off}}}+\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right].

(14)

For the second term in the above, using Lemma 4 for each $f_{t}$ for $1\leq t\leq T$ , we get

	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{s\sim d_{0}}\left[\max_{a}f_{0}^{t}(s,a)-V^{\pi^{f^{t}}}_{0}(s)\right]$	$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}_{h}f^{t}_{h+1}(s,a)\right]\right\rvert$
		$\displaystyle=\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\left\langle X_{h}(f^{t}),W_{h}(f^{t})\right\rangle\right\rvert$
		$\displaystyle\leq\sum_{t=1}^{T}\sum_{h=0}^{H-1}\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\bar{\Delta}_{\mathrm{on}}+\lambda B_{W}^{2}},$

where the second line follows from Definition 5, and the last line follows from Lemma 10 and by plugging in the bound in Lemma 9. Using the elliptical potential Lemma 6 as in the proof of Theorem 1, we get that

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}

\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\bar{\Delta}_{\mathrm{off}}}+\sqrt{2dH^{2}\log\left(1+\frac{T}{d}\right)\cdot\left(\left\lvert\mathcal{A}\right\rvert\cdot\bar{\Delta}_{\mathrm{on}}+B_{X}^{2}B_{W}^{2}\right)\cdot T}

Plugging in the values of $\bar{\Delta}_{\mathrm{on}}$ and $\bar{\Delta}_{\mathrm{off}}$ from Lemma 9 in the above, and using subadditivity of square-root, we get that

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}$	$\displaystyle\leq 16V_{\mathrm{max}}C_{\pi^{e}}T\sqrt{\frac{H}{m_{\mathrm{off}}}\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}+16V_{\mathrm{max}}\sqrt{\frac{2dH^{2}\left\lvert\mathcal{A}\right\rvert T}{m_{\mathrm{on}}}\log\left(1+\frac{T}{d}\right)\log\left(\frac{2HT\lvert\mathcal{F}\rvert}{\delta}\right)}$
		$\displaystyle\qquad\qquad\qquad+HB_{X}B_{W}\sqrt{2dT\log\left(1+\frac{T}{d}\right)}.$

Setting $m_{\mathrm{on}}=\left\lvert\mathcal{A}\right\rvert$ and $m_{\mathrm{off}}=T$ , we get the following cumulative suboptimality bound:

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{f^{t}}}=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{dH^{2}T\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

(15)

∎

Corollary 2 (Sample complexity).

Under the assumptions of Theorem 2 if $C_{\pi^{*}}<\infty$ then Algorithm 2 can find an $\epsilon$ -suboptimal policy $\widehat{\pi}$ for which $V^{\pi^{*}}-V^{\widehat{\pi}}\leq\epsilon$ with total sample complexity of:

\displaystyle n=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\left\lvert\mathcal{A}\right\rvert\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right).

Proof.

The following follows from a standard online-to-batch conversion. Setting $\pi^{e}=\pi^{*}$ in (15) and defining the policy $\widehat{\pi}=\text{Uniform}\left(\left\{\pi^{1},\dots,\pi^{T}\right\}\right)$ , we get that

\displaystyle\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]

\displaystyle=\frac{1}{T}\left(\sum_{t=1}^{T}V^{\pi^{*}}-V^{\pi^{t}}\right)=O\left(\max\left\{C_{\pi^{e}},1\right\}V_{\mathrm{max}}\sqrt{\frac{dH^{2}}{T}\cdot\log\left(1+\frac{T}{d}\right)\log\left(\frac{HT\lvert\mathcal{F}\rvert}{\delta}\right)}\right).

Thus, we the policy returned after $T\geq\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right)$ satisfies $\operatorname{\mathbb{E}}\left[V^{\pi^{*}}-V^{\widehat{\pi}}\right]\leq\epsilon.$ In these $T$ iterations, the total number of offline samples used is

\displaystyle m_{\mathrm{off}}=T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{2}\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

and the total number of online samples collected is

\displaystyle m_{\mathrm{on}}\cdot H\cdot T=\widetilde{O}\left(\frac{\max\left\{C^{2}_{\pi^{*}},1\right\}V_{\mathrm{max}}^{2}dH^{3}\left\lvert\mathcal{A}\right\rvert\log\left({HT\lvert\mathcal{F}\rvert}/{\delta}\right)}{\epsilon^{2}}\right),

where the additional $H$ factor appears because we collect $m_{\mathrm{on}}$ samples for every $h\in[H]$ in the algorithm. ∎

A.4 Bounds on transfer coefficient

Note that $C_{\pi}$ takes both the distribution shift and the function class into consideration, and is smaller than the existing density ratio based concentrability coefficient [Kakade and Langford, 2002, Munos and Szepesvári, 2008, Chen and Jiang, 2019] and also existing Bellman error based concentrability coefficient Xie et al. [2021a]. We formalize this in the following lemma.

Lemma 11.

For any $\pi$ and offline distribution $\nu$ ,

\displaystyle C_{\pi}

\displaystyle\leq\sqrt{\max_{f,h}\frac{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{d_{h}^{\pi}}}{\|f_{h}-\mathcal{T}f_{h+1}\|^{2}_{\nu_{h}}}}\leq\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}.

Proof.

Using Jensen’s inequality, we get that

	$\displaystyle C_{\pi}$	$\displaystyle\leq\sqrt{\max_{f}\frac{\sum_{h=0}^{H-1}\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{d_{h}^{\pi}}}{\sum_{h=0}^{H-1}\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\max_{f,h}\frac{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{d_{h}^{\pi}}}{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}}$
		$\displaystyle\leq\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)},$

where the second line follows from the Mediant inequality and the last line holds whenever $\sup_{h,s,a}\frac{d_{h}^{\pi}(s,a)}{\nu_{h}(s,a)}\geq 1$ . ∎

Next we show that in the linear Bellman complete setting, $C_{\pi}$ is bounded by the relative condition number using the linear features.

Lemma 12.

Consider the linear Bellman complete setting (Definition 3) with known feature $\phi$ . Suppose that the feature covariance matrix induced by offline distribution $\nu$ : $\Sigma_{\nu_{h}}\vcentcolon={}\mathbb{E}_{s,a\sim\nu_{h}}[\phi^{\star}(s,a)\phi^{\star}(s,a)^{\top}]$ is invertible. Then for any policy $\pi$ , we have

\displaystyle C_{\pi}

\displaystyle\leq\sqrt{\max_{h}\mathbb{E}_{s,a\sim d^{\pi}_{h}}\|\phi(s,a)\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}.

Proof.

Repeating the argument in Lemma 11, we have

	$\displaystyle C_{\pi}$	$\displaystyle\leq\sqrt{\max_{f,h}\frac{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{d_{h}^{\pi}}}{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\max_{w,h}\frac{\\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\\|^{2}_{d_{h}^{\pi}}}{\\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\max_{w,h}\frac{\\|(w_{h}-w^{\prime}_{h})\\|_{\Sigma_{\nu_{h}}}^{2}\mathbb{E}_{d^{\pi}_{h}}\\|\phi\\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}{\\|(w_{h}-w^{\prime}_{h})^{\top}\phi\\|^{2}_{\nu_{h}}}}$
		$\displaystyle=\sqrt{\max_{h}\mathbb{E}_{s,a\sim d^{\pi}_{h}}\\|\phi(s,a)\\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}.$

Recall that in linear Bellman complete setting, we can write $f$ as $w^{\top}\phi$ , and for any $w$ that defines $f$ , there exists $w^{\prime}$ such that $\mathcal{T}f=w^{\prime\top}\phi$ . ∎

Now we proceed to low-rank MDPs where feature is unknown. We show that for low-rank MDPs, $C_{\pi}$ is bounded by the partial feature coverage using the unknown ground truth feature.

Lemma 13.

Consider the low-rank MDP setting (Definition 4) where the transition dynamics $P$ is given by $P(s^{\prime}\mid s,a)=\left\langle\mu^{\star}(s^{\prime}),\phi^{\star}(s,a)\right\rangle$ for some $\mu^{\star},\phi^{\star}\in\mathbb{R}^{d}$ . Suppose that the offline distribution $\nu=(\nu_{0},\dots,\nu_{H-1})$ is such that $\max_{h}\max_{s,a}\frac{\pi_{h}(a|s)}{\nu_{h}(a|s)}\leq\alpha$ for any $s,a$ . Furthermore, suppose that $\nu$ is induced via trajectories i.e. $\nu_{0}(s)=d_{0}(s)$ and $\nu_{h}(s)=\mathbb{E}_{\bar{s},\bar{a}\sim\nu_{h-1}}P(s|\bar{s},\bar{a})$ for any $h\geq 1$ , and that the feature covariance matrix $\Sigma_{\nu_{h-1},\phi^{\star}}\vcentcolon={}\mathbb{E}_{s,a\sim\nu_{h-1}}[\phi^{\star}(s,a)\phi^{\star}(s,a)^{\top}]$ is invertible.¹⁰¹⁰10This is for notation simplicity, and we emphasize that we do not assume eigenvalues are lower bounded. In other words, eigenvalue of this feature covariance matrix could approach to $0^{+}$ . Then for any policy $\pi$ , we have

\displaystyle C_{\pi}

\displaystyle\leq\sqrt{\alpha}\sum_{h=1}^{H}\mathbb{E}_{s,a\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(s,a)\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\right]+\sqrt{\alpha}.

Proof.

We first upper bound the numerator separately. First note that for $h=0$ ,

$\displaystyle\mathbb{E}_{s,a\sim d_{0}^{\pi}}\left[\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right]$	$\displaystyle\leq\sqrt{\mathbb{E}_{s\sim d_{0},a\sim\pi(\cdot\|s)}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}$
	$\displaystyle\leq\sqrt{\max_{s,a}\frac{d^{\pi}_{0}(s,a)}{\nu_{0}(s,a)}\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}$
	$\displaystyle\leq\sqrt{\alpha\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]},$	(16)

where the last inequality follows from our assumption since $\max_{s,a}\frac{d^{\pi}_{0}(s,a)}{\nu_{0}(s,a)}=\max_{s,a}\frac{\pi_{0}(a|s)}{\nu_{0}(a|s)}\leq\alpha$ .

Next, for any $h\geq 1$ , we note that backing up one step and looking at the pair $\bar{s},\bar{a}$ that lead to the state $s$ , we get that

		$\displaystyle\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]$
		$\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi},s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]$
		$\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\int\left(\phi^{\star}(\bar{s},\bar{a})^{\top}\mu^{\star}(s)\right)\sum_{a}\pi(a\|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\text{d}s\right]$
		$\displaystyle=\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\phi^{\star}(\bar{s},\bar{a})^{\top}\int\sum_{a}\mu^{\star}(s)\pi(a\|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\text{d}s\right]$
		$\displaystyle\leq\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\\|\phi^{\star}(\bar{s},\bar{a})\right\\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\left\\|\int\sum_{a}\mu^{\star}(s)\pi(a\|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\mathrm{d}s\right\\|_{\Sigma_{\nu_{h-1},\phi^{\star}}}\right],$		(17)

where the last line follows from an application of Cauchy-Schwarz inequality. For the term inside the expectation in the right hand side above, we note that,

		$\displaystyle\left\\|\int\sum_{a}\mu^{\star}(s)\pi(a\|s)\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\mathrm{d}s\right\\|^{2}_{\Sigma_{\nu_{h-1},\phi^{\star}}}$
		$\displaystyle\overset{\left(i\right)}{=}\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1}}\left[\left(\int\sum_{a}\left(\mu^{\star}(s)^{\top}\phi^{*}(\bar{s},\bar{a})\right)\pi(a\|s)\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)\mathrm{d}s\right)^{2}\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1}}\left[\left(\operatorname{\mathbb{E}}_{s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]\right)^{2}\right]$
		$\displaystyle\overset{\left(ii\right)}{\leq{}}\operatorname{\mathbb{E}}_{\bar{s},\bar{a}\sim\nu_{h-1},s\sim P(\bar{s},\bar{a}),a\sim\pi(s)}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]$
		$\displaystyle\overset{\left(iii\right)}{=}\operatorname{\mathbb{E}}_{s\sim\nu_{h},a\sim\pi(s)}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]$
		$\displaystyle\overset{\left(iv\right)}{\leq{}}\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]$		(18)

where $\left(i\right)$ follows by expanding the norm , $\left(ii\right)$ follows an application of Jensen’s inequality, $\left(iii\right)$ is due to our assumption that the offline dataset is generated using trajectories such that $\nu_{h}(s)=\operatorname{\mathbb{E}}_{\bar{s},\bar{s}\sim\nu_{h-1}}\left[P(s\mid\bar{s},\bar{a})\right]$ . Finally, $\left(iv\right)$ follows from the definition of $\alpha$ . Plugging (18) in (17), we get that for $h\geq 1$ ,

		$\displaystyle\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]$
		$\displaystyle\leq\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\\|\phi^{\star}(\bar{s},\bar{a})\right\\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\sqrt{\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}\right]$		(19)

We are now ready to bound the transfer coefficient. First note that using (16), for any $f$ ,

	$\displaystyle\frac{\mathbb{E}_{s,a\sim d_{0}^{\pi}}\left[\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}$	$\displaystyle\leq\frac{\sqrt{\alpha\cdot\mathbb{E}_{s,a\sim\nu_{0}}\left[\left(\mathcal{T}f_{1}(s,a)-f_{0}(s,a)\right)^{2}\right]}}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}$
		$\displaystyle\leq\sqrt{\alpha}.$

Furthermore, for any $f$ , using (19), we get that

		$\displaystyle\frac{\sum_{h=1}^{H-1}\mathbb{E}_{s,a\sim d_{h}^{\pi}}\left[\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right]}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}$
		$\displaystyle\leq\sum_{h=1}^{H-1}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\\|\phi^{\star}(\bar{s},\bar{a})\right\\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\frac{\sqrt{\alpha\cdot\operatorname{\mathbb{E}}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}{\sqrt{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\nu_{h}}\left[\left(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\right)^{2}\right]}}\right]$
		$\displaystyle\leq\sum_{h=1}^{H}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\\|\phi^{\star}(\bar{s},\bar{a})\right\\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\sqrt{\alpha}\right],$

where the last line holds for an appropriate choice of $\lambda$ (e.g. $\lambda=0$ ). Combining the above two bounds in the definition of $C_{\pi}$ we get that

\displaystyle C_{\pi}

\displaystyle\leq\sqrt{\alpha}\sum_{h=1}^{H}\mathbb{E}_{\bar{s},\bar{a}\sim d_{h-1}^{\pi}}\left[\left\|\phi^{\star}(\bar{s},\bar{a})\right\|_{\Sigma_{\nu_{h-1},\phi^{\star}}^{-1}}\right]+\sqrt{\alpha}.

∎

Note that in the above result, the transfer coefficient is upper bounded by the relative coverage under unknown feature $\phi^{\star}$ and a term $\alpha$ related to the action coverage, i.e., $\max_{h}\max_{s,a}\frac{\pi_{h}(a|s)}{\nu_{h}(a|s)}\leq\alpha$ . This matches to the coverage condition used in prior offline RL works for low-rank MDPs [Uehara and Sun, 2021].

Appendix B Auxiliary Lemmas

In this section, we provide a few results and their proofs that we used in the previous sections. We first with the following form of Freedman’s inequality that is a modification of a similar inequality in [Beygelzimer et al., 2011].

Lemma 14 (Freedman’s Inequality).

Let $\left\{X_{1},\dots,X_{T}\right\}$ be a sequence of non-negative random variables where each $x_{t}$ is sampled from some process that depends on all previous instances, i.e, $x_{t}\sim\rho_{t}=\rho_{t}(x_{1:t-1})$ . Further, suppose that $\left\lvert X_{t}\right\rvert\leq R$ almost surely for all $t\leq T$ . Then, for any $\delta>0$ and $\lambda\in[0,1/2R]$ , with probability at least $1-\delta$ ,

\displaystyle\left\lvert\sum_{t=1}^{T}X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert

\displaystyle\leq\lambda\sum_{t=1}^{T}\left(2R\left\lvert\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[X_{t}^{2}\mid\rho_{t}\right]\right)+\frac{\log(2/\delta)}{\lambda}.

Proof.

Define the random variable $Z_{t}=X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]$ . Clearly, $\left\{Z_{t}\right\}_{t=1}^{T}$ is a martingale difference sequence. Furthermore, we have that for any $t$ , $\left\lvert Z_{t}\right\rvert\leq 2R$ and that

\displaystyle\operatorname{\mathbb{E}}\left[Z_{t}^{2}\mid\rho_{t}\right]

\displaystyle=\operatorname{\mathbb{E}}\left[\left(X_{t}-\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right)^{2}\mid\rho_{t}\right]\leq 2R\left\lvert\operatorname{\mathbb{E}}\left[X_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[X_{t}^{2}\mid\rho_{t}\right].

(20)

where the last inequality holds because $\left\lvert X_{t}\right\rvert\leq R$ .

Using the form of Freedman’s inequality in Beygelzimer et al. [2011, Theorem 1], we get that for any $\lambda\in[0,1/2R]$ ,

\displaystyle\left\lvert\sum_{t=1}^{T}Z_{t}\right\rvert\leq\lambda\sum_{t=1}^{T}\operatorname{\mathbb{E}}\left[Z_{t}^{2}\mid\rho_{t}\right]+\frac{\log(2/\delta)}{\lambda}.

Plugging in the form of $Z_{t}$ and using (20), we get the desired statement. ∎

Next we give a formal proof of Lemma 3, which gives a generalization bound for least squares regression when the samples are adapted to an increasing filtration (and are not necessarily i.i.d.). The proof follows similarly to Agarwal et al. [2019, Lemma A.11].

Lemma 15 (Lemma 3 restated: Least squares generalization bound).

Let $R>0$ , $\delta\in(0,1)$ , we consider a sequential function estimation setting, with an instance space $\mathcal{X}$ and target space $\mathcal{Y}$ . Let $\mathcal{H}:\mathcal{X}\mapsto[-R,R]$ be a class of real valued functions. Let $\mathcal{D}=\left\{(x_{1},y_{1}),\dots,(x_{T},y_{T})\right\}$ be a dataset of $T$ points where $x_{t}\sim\rho_{t}=\rho_{t}(x_{1:t-1},y_{1:t-1})$ , and $y_{t}$ is sampled via the conditional probability $p(\cdot\mid x_{t})$ :

\displaystyle y_{t}\sim p(\cdot\mid x_{t})\vcentcolon={}h^{*}(x_{t})+\varepsilon_{t},

where the function $h^{*}$ satisfies approximate realizability i.e.

\inf_{h\in\mathcal{H}}\frac{1}{T}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[\left(h^{*}(x)-h(x)\right)^{2}\right]\leq\gamma,

\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x\sim\rho_{t}}\left[(\widehat{h}(x)-h^{*}(x))^{2}\right]

\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

Proof.

Consider any fixed function $h\in\mathcal{H}$ and define the random variable

\displaystyle Z_{t}^{h}\vcentcolon={}\left(h(x_{t})-y_{t}\right)^{2}-\left(h^{*}(x_{t})-y_{t}\right)^{2}.

Define the notation $\operatorname{\mathbb{E}}\left[\cdot\mid\rho_{t}\right]$ to denote $\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\cdot\right]$ , and note that

\displaystyle\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]

\displaystyle=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)\left(h(x_{t})+h^{*}(x_{t})-2y_{i}\right)\right]=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right],

(21)

where the last line holds because $\operatorname{\mathbb{E}}\left[y_{t}\mid x_{t}\right]=h^{*}(x_{t})$ . Furthermore, we also have that

	$\displaystyle\operatorname{\mathbb{E}}\left[(Z_{t}^{h})^{2}\mid\rho_{t}\right]$	$\displaystyle=\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{}(x_{t})\right)^{2}\left(h(x_{t})+h^{}(x_{t})-2y_{t}\right)^{2}\right]$
		$\displaystyle\leq 16R^{2}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right].$		(22)

Now we can note that the sequence of random variables $\left\{Z^{h}_{1},\dots,Z^{h}_{T}\right\}$ satisfies the condition in Lemma 14 with $\left\lvert Z_{t}^{h}\right\rvert\leq 4R^{2}$ . Thus we get that for any $\lambda\in[0,1/8R^{2}]$ and $\delta>0$ , with probability at least $1-\delta$ ,

	$\displaystyle\left\lvert\sum_{t=1}^{T}Z^{h}_{t}-\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert$	$\displaystyle\leq\lambda\sum_{t=1}^{T}\left(8R^{2}\left\lvert\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert+\operatorname{\mathbb{E}}\left[\left(Z^{h}_{t}\right)^{2}\mid\rho_{t}\right]\right)+\frac{\log(2/\delta)}{\lambda}$
		$\displaystyle\leq 32\lambda R^{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+\frac{\log(2/\delta)}{\lambda},$

where the last inequality uses (21) and (22). Setting $\lambda=1/64R^{2}$ in the above, and taking a union bound over $h$ , we get that for any $h\in\mathcal{H}$ and $\delta>0$ , with probability at least $1-\delta$ ,

\displaystyle\left\lvert\sum_{t=1}^{T}Z^{h}_{t}-\operatorname{\mathbb{E}}\left[Z^{h}_{t}\mid\rho_{t}\right]\right\rvert\leq\frac{1}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

Rearranging the terms and using (21) in the above implies that,

	$\displaystyle\sum_{t=1}^{T}Z_{t}^{h}\leq\frac{3}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)$
and
	$\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[\left(h(x_{t})-h^{*}(x_{t})\right)^{2}\right]\leq 2\sum_{t=1}^{T}Z_{t}^{h}+128R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).$	(23)

For the rest of the proof, we condition on the event that (23) holds for all $h\in\mathcal{H}$ .

Define the function $\widetilde{h}\vcentcolon={}\mathop{\mathrm{argmin}}_{h\in\mathcal{H}}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(h(x_{t})-h^{*}(x_{t}))^{2}\right]$ . Using (23), we get that

	$\displaystyle\sum_{t=1}^{T}Z_{t}^{\widetilde{h}}$	$\displaystyle\leq\frac{3}{2}\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(\widetilde{h}(x_{t})-h^{*}(x_{t}))^{2}\right]+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)$
		$\displaystyle\leq\frac{3}{2}\gamma T+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta),$

where the last inequality follows from the approximate realizability assumption. Let $\widehat{h}$ denote the least squares solution on dataset $\left\{(x_{t},y_{t})\right\}_{t\leq T}$ . By definition, we have that

\displaystyle\sum_{t=1}^{T}Z_{t}^{\widehat{h}}

\displaystyle=(\widehat{h}(x_{t})-y_{t})^{2}-(h^{*}(x_{t})-y_{t})^{2}\leq(\widetilde{h}(x_{t})-y_{t})^{2}-(h^{*}(x_{t})-y_{t})^{2}=\sum_{t=1}^{T}Z_{t}^{\widetilde{h}}.

Combining the above two relations, we get that

\displaystyle\sum_{t=1}^{T}Z_{t}^{\widehat{h}}

\displaystyle\leq\frac{3}{2}\gamma T+64R^{2}\log(2\lvert\mathcal{H}\rvert/\delta).

(24)

Finally, using (23) for the function $\widehat{h}$ , we get that

	$\displaystyle\sum_{t=1}^{T}\operatorname{\mathbb{E}}_{x_{t}\sim\rho_{t}}\left[(\widehat{h}(x_{t})-h^{*}(x_{t}))^{2}\right]$	$\displaystyle\leq 2\sum_{t=1}^{T}Z_{t}^{\widehat{h}}+128R^{2}\log(2\lvert\mathcal{H}\rvert/\delta)$
		$\displaystyle\leq 3\gamma T+256R^{2}\log(2\lvert\mathcal{H}\rvert/\delta),$

where the last inequality uses the relation (24). ∎

Appendix C Low Bellman Eluder Dimension problems

In this section, we consider problems with low Bellman Eluder dimensions Jin et al. [2021a]. This complexity measure is a distributional version of the Eluder dimension applied to the class of Bellman residuals w.r.t. $\mathcal{F}$ . We show that our algorithm Hy-Q gives a similar performance guarantee for problems with small Bellman Eluder dimensions. This demonstrates that Hy-Q applies to any general model-free RL frameworks known in the RL literature so far.

We first introduce the key definitions:

Definition 6 ( $\varepsilon$ -independence between distributions [Jin et al., 2021a]).

Let $\mathcal{G}$ be a class of functions defined on a space $\mathcal{X}$ , and $\nu,\mu_{1},\dots,\mu_{n}$ be probability measures over $\mathcal{X}$ . We say $\nu$ is $\varepsilon$ -independent of $\{\mu_{1},\mu_{2},\dots,\mu_{n}\}$ with respect to $\mathcal{G}$ if there exists $g\in\mathcal{G}$ such that $\sqrt{\sum_{i=1}^{n}(\mathbb{E}_{\mu_{i}}[g])^{2}}\leq\varepsilon$ , but $|\mathbb{E}_{\nu}[g]|>\varepsilon$ .

Definition 7 (Distributional Eluder (DE) dimension).

Let $\mathcal{G}$ be a function class defined on $\mathcal{X}$ , and $\mathcal{P}$ be a family of probability measures over $\mathcal{X}$ . The distributional Eluder dimension $\dim_{\operatorname{DE}}(\mathcal{F},\mathcal{P},\varepsilon)$ is the length of the longest sequence $\{\rho_{1},\dots,\rho_{n}\}\subset\mathcal{P}$ such that there exists $\varepsilon^{\prime}\geq\varepsilon$ where $\rho_{i}$ is $\varepsilon^{\prime}$ -independent of $\{\rho_{1},\dots,\rho_{i-1}\}$ for all $i\in[n]$ .

Definition 8 (Bellman Eluder (BE) dimension [Jin et al., 2021a]).

Given a value function class $\mathcal{F}$ , let $\mathcal{G}_{h}\vcentcolon={}\left(f_{h}-\mathcal{T}f_{h+1}\mid f\in\mathcal{F}_{h},f_{h+1}\in\mathcal{F}_{h+1}\right)$ be the set of Bellman residuals induced by $\mathcal{F}$ at step $h$ , and $\mathcal{P}=\{\mathcal{P}_{h}\}_{h=1}^{H}$ be a collection of $H$ probability measure families over $\mathcal{X}\times\mathcal{A}$ . The $\epsilon$ -Bellman Eluder dimension of $\mathcal{F}$ with respect to $\mathcal{P}$ is defined as

\displaystyle\dim_{\operatorname{BE}}(\mathcal{F},\mathcal{P},\varepsilon):=\max_{h\in[H]}\dim_{\operatorname{DE}}(\mathcal{G}_{h},\mathcal{P}_{h},\epsilon)\,.

We also note the following lemma that controls the rate at which Bellman error accumulates.

Lemma 16 (Lemma 41, [Jin et al., 2021a]).

Given a function class $\mathcal{G}$ defined on a space $\mathcal{X}$ with $\sup_{g\in\mathcal{G},x\in\mathcal{X}}\left\lvert g(x)\right\rvert\leq C$ , and a set of probability measures $\mathcal{P}$ over $\mathcal{X}$ . Suppose that the sequence $\left\{g_{k}\right\}_{k=1}^{K}\subset\mathcal{G}$ and $\left\{\mu_{k}\right\}_{k=1}^{K}\subset\mathcal{P}$ satisfy that $\sum_{t=1}^{k-1}\left(\operatorname{\mathbb{E}}_{\mu_{t}}\left[g_{k}\right]\right)^{2}\leq\beta$ for all $k\in[K]$ . Then, for all $k\in[K]$ and $\gamma>0$ ,

\displaystyle\sum_{t=1}^{k}\left\lvert\operatorname{\mathbb{E}}_{\mu_{t}}\left[g_{t}\right]\right\rvert\leq O\left(\sqrt{\mathrm{dim}_{\mathrm{DE}}(\mathcal{G},\mathcal{P},\gamma)\beta k}+\min\left\{k,\mathrm{dim}_{\mathrm{DE}}(\mathcal{G},\mathcal{P},\gamma)C\right\}+k\gamma\right).

We next state our main theorem whose proof is similar to that of Theorem 1.

Theorem 3 (Cumulative suboptimality).

Fix $\delta\in(0,1)$ , $m_{\mathrm{off}}=HT/d$ and $m_{\mathrm{on}}=H^{2}$ , and suppose that the underlying MDP admits Bellman eluder dimention $d$ , and the function class $\mathcal{F}$ satisfies Assumption 1. Then with probability at least $1-\delta$ , Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy $\pi^{e}$ ,

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}=\widetilde{O}\left(V_{\mathrm{max}}\max\left\{C_{\pi^{e}},1\right\}\sqrt{dT\cdot\log\left(H\lvert\mathcal{F}\rvert/\delta\right)}\right),

where $\pi^{t}=\pi^{f^{t}}$ is the greedy policy w.r.t. $f^{t}$ at round $t$ and $d=\mathrm{dim}_{\mathrm{BE}}(\mathcal{F},\mathcal{P}_{\mathcal{F}},1/\sqrt{T})$ . Here $\mathcal{P}_{\mathcal{F}}$ is the class of occupancy measures that can be be induced by greedy policies w.r.t. value functions in $\mathcal{F}$ .

Proof.

Repeating the analysis till (10) in the proof of Theorem 1, we get that

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}

\displaystyle\leq TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sum_{t=1}^{T}\sum_{h=0}^{H-1}\left\lvert\mathbb{E}_{s,a\sim d_{h}^{\pi^{f^{t}}}}\left[f^{t}_{h}(s,a)-{\mathcal{T}}_{h}f^{t}_{h+1}(s,a)\right]\right\rvert

Using the bound in Lemma 7 and Lemma 16 in the above, we get that

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}$	$\displaystyle\lesssim TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+\sum_{h=0}^{H-1}\sqrt{\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma)\Delta_{\mathrm{on}}T}$
		$\displaystyle\qquad\qquad\qquad\qquad+\min\left\{T,\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma)C\right\}+T\gamma.$

where $\mathcal{G}_{h}\vcentcolon={}\left(f_{h}-\mathcal{T}f_{h+1}\mid f\in\mathcal{F}_{h},f_{h+1}\in\mathcal{F}_{h+1}\right)$ denotes the set of Bellman residuals induced by $\mathcal{F}$ at step $h$ , and $\mathcal{P}=\{\mathcal{P}_{\mathcal{F};h}\}_{h=1}^{H}$ is the collection of occupancy measures at step $h$ induced by greedy policies w.r.t. value functions in $\mathcal{F}$ . We set $\gamma=1/\sqrt{T}$ and define $d=\mathrm{dim}_{\mathrm{BE}}(\mathcal{F},\mathcal{P},\gamma)=\max_{h}\mathrm{dim}_{\mathrm{DE}}(\mathcal{G}_{h},\mathcal{P}_{\mathcal{F};h},\gamma)$ . Ignoring the lower order terms, we get that

	$\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}$	$\displaystyle\lesssim TC_{\pi^{e}}\cdot\sqrt{H\cdot\Delta_{\mathrm{off}}}+H\sqrt{d\Delta_{\mathrm{on}}T}$
		$\displaystyle\lesssim TC_{\pi^{e}}V_{\mathrm{max}}\cdot\sqrt{H\cdot\frac{\log(HT\lvert\mathcal{F}\rvert/\delta)}{m_{\mathrm{off}}}}+HV_{\mathrm{max}}\sqrt{dT\cdot\frac{\log(HT\lvert\mathcal{F}\rvert/\delta)}{m_{\mathrm{on}}}},$

where $\lesssim$ hides lower order terms, multiplying constants and log factors. Setting $m_{\mathrm{off}}=HT/d$ and $m_{\mathrm{on}}=H^{2}$ , we get that

\displaystyle\sum_{t=1}^{T}V^{\pi^{e}}-V^{\pi^{t}}

\displaystyle=\widetilde{O}\left(C_{\pi^{e}}V_{\mathrm{max}}\sqrt{dT\log(HT\lvert\mathcal{F}\rvert/\delta)}\right).

∎

Appendix D Comparison with previous works

As mentioned in the main text, many previous empirical works consider combining offline expert demonstrations with online interaction [Rajeswaran et al., 2017, Hester et al., 2018, Nair et al., 2018, 2020, Vecerik et al., 2017, Lee et al., 2022, Jia et al., 2022, Niu et al., 2022]. Thus the idea of performing RL algorithm on both offline data (expert demonstrations) and online data is also explored in some of the previous works, for example, Vecerik et al. [2017] runs DDPG on both the online and expert data, and Hester et al. [2018] uses DQN on both data but with an additional supervised loss. Since we already compared with Hester et al. [2018] in the experiment, here we focus on our discussion with Vecerik et al. [2017].

We first emphasize that Vecerik et al. [2017] only focuses on expert demonstrations and their experiments entirely rely on using expert demonstrations, while we focus on more general offline dataset that is not necessarily coming from experts. Said though, the DDPG-based algorithm from Vecerik et al. [2017] potentially can be used when offline data is not from experts. Although the algorithm from Vecerik et al. [2017] and Hy-Q share the same high-level intuition that one should perform RL on both the datasets, there are still a few differences : (1) Hy-Q uses Q-learning instead of deterministic policy gradients; note that deterministic policy gradient methods cannot be directly applied to discrete action setting; (2) Hy-Q does not require n-step TD style update, since in off-policy case, without proper importance weighting, n-step TD could incur strong bias. While proper tuning on n could balance bias and variance, one does not need to tune such n-step at all in Hy-Q; (3) The idea of keeping a non-zero ratio to sample offline dataset is also proposed in Vecerik et al. [2017]. Our buffer ratio is derived from our theory analysis but meanwhile proves the advantage of the similar heuristic applied in Vecerik et al. [2017]. (4) In their experiment, Vecerik et al. [2017] only considers expert demonstrations. In our experiment, we considered offline datasets with different amounts of transitions from very low-quality policies and showed Hy-Q is robust to low-quality transitions in offline data. Note that some of the differences may seem minor on the implementation level, but they may be important to the theory.

Regarding the experiments, our experimental evaluation adds the following insights over those in Vecerik et al. [2017]: (i) hybrid methods can succeed without expert data, (ii) hybrid methods can succeed in hard exploration discrete-action tasks, (iii) the core algorithm ( $Q$ -learning vs DDPG) is not essential although some details may matter. Due to the similarity between the two methods, we believe some of these insights may also translate to Vecerik et al. [2017] and we expect that the choice between Hy-Q and Hy-DDPG will be environment specific, as it is with the purely online versions of these methods. In some situations, $Q$ -learning works does not immediately imply Deterministic policy gradient methods work, nor vice versa. Nevertheless, it is beyond the scope of this paper to rigorously verify this claim and we deem the study of Actor-critic algorithms in Hybrid RL setting an interesting future direction.

Appendix E Experiment Details

E.1 Combination Lock

In this section we provide a detailed description of combination lock experiment. The combination lock environment has a horizon $H$ and 10 actions at each state. There are three latent states $z_{i,h},i\in\{0,1,2\}$ for each timestep $h$ , where $z_{i,h},i\in\{0,1\}$ are good states and $z_{2,h}$ is the bad state. For each good state, we randomly pick a good action $a_{i,h}$ , such that in latent state $z_{i,h},i\in\{0,1\}$ , taking the good action $a_{i,h}$ will result in 0.5 probability of transiting to $z_{0,h+1}$ and 0.5 probability of transiting to $z_{1,h+1}$ while taking all other actions will result in a 1 probability of transiting to $z_{2,h+1}$ . At $z_{2,h}$ , all actions will result in a deterministic transition to $z_{2,h+1}$ . For the reward, we give an optimal reward of 1 for landing $z_{i,H},i\in\{0,1\}$ . We also give an anti-shaped reward of 0.1 for all transitions from a good state to a bad state. All other transitions have a reward of 0. The initial distribution is a uniform distribution over $z_{0,0}$ and $z_{1,0}$ . The observation space has dimension $2^{\left\lceil{\log(H+1)}\right\rceil}$ , created by concatenating a one-hot representation of the latent state and a one-hot representation of the horizon (appending 0 if necessary). Random noise from $\mathcal{N}(0,0.1)$ is added to each dimension, and finally the observation is multiplied by a Hadamard matrix. Note that in this environment, the agent needs to perform optimally for all $H$ timesteps to hit the final good state for an optimal reward of 1. Once the agent chooses a bad action, it will stay in the bad state until the end with at most 0.1 possible reward for the trajectory received while transitting from a good state to a bad state.

E.2 Implementation Details of Combination Lock experiment

We train $H$ separate Q-functions for all $H$ timesteps. Our function class consists of an encoder and a decoder. For the encoder, we feed the observation into one linear layer with 3 outputs, followed by a softmax layer to get a state-representation. This design of encoder is intended to learn a one-hot representation of the latent state. We take a Kronecker Product of the state-representation and the action, and feed the result to a linear layer with only one output, which will be our Q value. In order to stabilize the training, we warm-start the Q-function of timestep $h-1$ with the encoder from the Q-function at timestep $h$ of the current iteration and the decoder from the Q-function at time step $h-1$ of the previous iteration, for each iteration of training.

One remark is that since combination lock belongs to Block MDPs, we require a V-type algorithm instead of the Q-type algorithm as shown in the main text. The only difference lies in the online sampling process: instead of sampling from $d^{\pi^{t}}_{h}$ , for each $h$ , we sample from $d_{h}^{\pi^{t}}\circ\textrm{Uniform}(\mathcal{A})$ , i.e., we first rollin with respect to $\pi^{t}$ to timestep $h-1$ , then take a random action, observe the transition and collect that tuple. See Algorithm 2.

For CQL, we implemented the variant of CQL-DQN and picked the peak in the learning curve to report in the main paper (so in other words, we pick the best performance of CQL using online samples). While CQL is supposed to be a pure offline RL baseline, we simply tune the hyperparameters of CQL using online samples. Thus our reported results can be regarded as an upper bound on the performance that CQL would get if it were trained in the pure offline setting.

E.3 Implementation Details of Montezuma’s Revenge experiment

For Montezuma’s Revenge, we follow the Atari game convention and use a discounted setting with discount factor being 0.99. In this section we provide the detailed algorithm for the discounted setting. The overall algorithm is described in Algorithm 3. For the function approximation, we use a class of convolutional neural networks (parameterized by class $\Theta$ ) as promoted by the original Dqn paper [Mnih et al., 2015]. We include several standard empirical design choices that have been commonly used to stabilize the training: we use Prioritize Experience Replay [Schaul et al., 2015] for our buffer. We also add Double Dqn [Van Hasselt et al., 2016] and Dueling Dqn [Wang et al., 2016] during our Q-update. These are the standard heuristics that have been used in Q learning implementation commonly. We also observe that a decaying schedule on the offline sample ratio $\beta$ and the exploration rate $\epsilon$ also helps provide better performance. Note that an annealing $\beta$ does not contradict to our comment in Section 4 on catastrophic forgetting because $\beta$ always has a lower bound and indeed never decreases to zero when the online learning process proceeds. In addition, we also perform value function update very $n_{\textrm{value}}$ inside each episode, instead of per episode update since this has been the popular design choice and leads to better efficiency in practice.

Algorithm 3 Discounted Hy-Q

0: Value function class:

\mathcal{F}

(induced by

\Theta

), #iterations:

T

, Offline dataset

\mathcal{D}^{\nu}

of size

m_{\mathrm{off}}

, discounted factor

\gamma

, value function update frequency

n_{\textrm{value}}

, target update frequency

n_{\mathrm{target}}

, learning rate

\alpha

, offline sample ratio

\beta

, exploration rate

\epsilon

, action space

\mathcal{A}

1: Randomly initialize value function

f^{\theta}

2: Initialize target value function

\tilde{f}=f^{\theta}

3: Initialize online buffer

\mathcal{D}=\emptyset

4: Sample initial state

s\sim d_{0}

5: for

t=1,\dots,T

6: Let

\pi

be the

\epsilon

-greedy policy w.r.t.

f^{\theta}

i.e.,

\pi(s)=\mathop{\mathrm{argmax}}_{a}f^{\theta}(s,a)

with probability

1-\epsilon

and

\pi(s)=\mathcal{U}(\mathcal{A})

with probability

\epsilon

. // Online collection

7: Interact with the environment for one step:

a=\pi(s),s^{\prime}\sim P(s,a),r\sim R(s,a).

8: Update online buffer:

\mathcal{D}=\mathcal{D}\cup\{s,a,r,s^{\prime}\}

. // Discounted minibatch FQI using both online and offline data

9: if

t\mod n_{\textrm{value}}=0

then

10: With probability

1-\beta

: Sample a minibatch

D

with size

n_{\textrm{minibatch}}

from online buffer

\mathcal{D}

.Otherwise: Sample a minibatch

D

with size

n_{\textrm{minibatch}}

from offline buffer

\mathcal{D}^{\nu}

11: Perform one-step gradient descent on

D

\displaystyle\theta=\theta-\alpha\nabla_{\theta}\hat{\mathbb{E}}_{D}\left(f^{\theta}(s,a)-r_{i}-\gamma\max_{a^{\prime}}\tilde{f}(s^{\prime},a^{\prime})\right)^{2}.

12: end if // Delayed update of target function every

n_{\textrm{target}}

updates

13: if

t\mod n_{\textrm{target}}=0

then

14: Set target function to the current value function:

\tilde{f}=f^{\theta}

15: end if

16: Update

s\leftarrow s^{\prime}

17: end for

E.4 Baseline implementation

E.4.1 Combination Lock

We use the open-sourced implementation https://github.com/BY571/CQL/tree/main/CQL-DQN for Cql. For Briee, we use the official code released by the authors: https://github.com/yudasong/briee, where we rely on the code there for the combination lock environment.

E.4.2 Montezuma’s Revenge

We use the open-sourced implementation
https://github.com/jcwleo/random-network-distillation-pytorch for Rnd. For Cql, we use https://github.com/takuseno/d3rlpy for their implementation of Cql for atari. We use https://github.com/felix-kerkhoff/DQfD for Dqfd. For all baselines, we keep the hyperparameters used in these public repositories. For Cql and Dqfd, we provide the offline datasets as described in the main text instead of using the offline dataset provided in the public repositories.¹¹¹¹11We note that Cql also fails completely with the original offline dataset (with 1 million samples) provided in the public repository. All baselines are tested in the same stochastic environment setup as in Burda et al. [2018].

E.5 Hardware Infrastructure

We run our experiments on a cluster of computes with Nvidia RTX 3090 GPUs and various CPUs which do not incur any randomness to the results.

E.6 Hyperparameters

E.6.1 Combination Lock

We provide the hyperparameters of Hy-Q in Table. 1. In addition, we provide the hyperparameters we tried for Cql baseline in Table. 2.

Table 1: Hyperparameters for Hy-Q in combination lock

	Value Considered	Final Value
Learning rate	{1e-2, 2e-2, 1e-3}	2e-2
Buffer size	{1e8}	1e8
Optimizer	{Adam, SGD}	Adam
Number of updates per iteration	{30, 300, 500}	500
Batch size	{512}	512

Table 2: Hyperparameters for CQL(DQN) in combination lock

	Value Considered	Final Value
Learning rate	{1e-3}	1e-3
Optimizer	{Adam}	Adam
Buffer size	{1e8}	1e8
Batch size	{512}	512
Discount Factor	{0.99}	0.99
Moving Average Factor $\tau$	{0.01, 0.1, 1}	0.01
Weight on CQL loss $\alpha$	{0, 0.1, 0.01}	0.1

E.6.2 Montezuma’s Revenge

We provide the hyperparameter of Hy-Q in Table. 3. We reuse many hyperparameter choices from Dqfd. Note that $[a,b]$ denotes a decreasing/increasing schedule from $a$ to $b$ .

Table 3: Hyperparameter of Discounted Hy-Q in Montezuma’s Revenge.

	Value Considered	Final Value
Learning rate	{6.25e-5, [1e-4,1e-5]}	[1e-4,1e-5]
Offline Schedule $\beta$	{0.5,0.2,[0.2,0.01]}	[0.2,0.01]
Exploration $\epsilon$ rate	{[0.25,0.001]}	[0.25,0.001]
Minibatch size $n_{\textrm{minibatch}}$	{32}	32
Weight decay (regularization) coefficient	{1e-5}	1e-5
Gradient Clipping	{10,20}	10
Discount factor $\gamma$	{0.99}	0.99
Value function update frequency $n_{\textrm{value}}$	{4}	4
Target function update frequency $n_{\textrm{target}}$	{1000,2000,5000,10000}	10000
Buffer size	{ $2^{20}$ }	$2^{20}$
PER Importance Sampling ratio	{[0.6,1]}	[0.6,1]
Online PER $\epsilon$	{0.001}	0.001
Offline PER $\epsilon$	{0.0001}	0.0001
Online PER Priority Coefficient	{0.4}	0.4
Offline PER Priority Coefficient	{1}	1

$\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert$	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\\|W_{h}(f^{t})\\|_{\Sigma_{t-1;h}}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\Sigma_{t-1}W_{h}(f^{t})}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left(W_{h}(f^{t})\right)^{\top}\left(\sum_{i=1}^{t-1}X_{h}(f^{i})X_{h}(f^{i})^{\top}+\lambda\mathbb{I}\right)W_{h}(f^{t})}$
	$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda\\|W_{h}(f^{t})\\|^{2}}$
	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}}$	(6)
	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s,a\sim d^{f^{i}}_{h}}\left[\left(f^{t}_{h}-\mathcal{T}f^{t}_{h+1}\right)^{2}\right]+\lambda B_{W}^{2}}$

	$\displaystyle\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{t})\right\rangle\right\rvert$	$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left\lvert\left\langle W_{h}(f^{t}),X_{h}(f^{i})\right\rangle\right\rvert^{2}+\lambda B_{W}^{2}}$
		$\displaystyle=\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\sum_{i=1}^{t-1}\left(\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\pi^{f^{t}}(s)}\left[f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right]\right)^{2}+\lambda B_{W}^{2}}$
		$\displaystyle\leq\\|X_{h}(f^{t})\\|_{\Sigma_{t-1;h}^{-1}}\sqrt{\left\lvert\mathcal{A}\right\rvert\cdot\sum_{i=1}^{t-1}\operatorname{\mathbb{E}}_{s\sim d_{h}^{\pi^{f^{i}}},~{}a\sim\mathrm{Uniform}(\mathcal{A})}\left[\left(f_{h}^{t}-\mathcal{T}f_{h+1}^{t}\right)^{2}\right]+\lambda B_{W}^{2}}$

	$\displaystyle C_{\pi}$	$\displaystyle\leq\sqrt{\max_{f,h}\frac{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{d_{h}^{\pi}}}{\\|f_{h}-\mathcal{T}f_{h+1}\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\max_{w,h}\frac{\\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\\|^{2}_{d_{h}^{\pi}}}{\\|w_{h}^{\top}\phi-{w^{\prime}}_{h}^{\top}\phi\\|^{2}_{\nu_{h}}}}$
		$\displaystyle\leq\sqrt{\max_{w,h}\frac{\\|(w_{h}-w^{\prime}_{h})\\|_{\Sigma_{\nu_{h}}}^{2}\mathbb{E}_{d^{\pi}_{h}}\\|\phi\\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}{\\|(w_{h}-w^{\prime}_{h})^{\top}\phi\\|^{2}_{\nu_{h}}}}$
		$\displaystyle=\sqrt{\max_{h}\mathbb{E}_{s,a\sim d^{\pi}_{h}}\\|\phi(s,a)\\|^{2}_{\Sigma_{\nu_{h}}^{-1}}}.$