Offline Reinforcement Learning with Additional Covering Distributions
Abstract
We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.
1 Introduction and related works
In offline reinforcement learning (offline RL, also known as batch RL), the learner tries to find good policies with a pre-collected dataset. This data-driven paradigm eliminates the heavy burden of environmental interaction required in online learning, which could be dangerous or costly (e.g., in robotics [Kalashnikov et al., 2018, Sinha and Garg, 2021] and healthcare [Gottesman et al., 2018, 2019, Tang et al., 2022]), making offline RL a promising approach in real-world applications.
In early theoretic studies of offline RL (e.g., Munos [2003, 2005, 2007], Ernst et al. [2005], Antos et al. [2007], Munos and Szepesvari [2008], Farahmand et al. [2010]), researchers analyzed the finite-sample behavior of algorithms under the assumptions such as exploratory datasets, realizable or Bellman-complete function classes. However, despite some error propagation bounds and sample complexity guarantees achieved in these works, the strong coverage assumption made on datasets and the non-monotonic assumptions made on function classes—which are always hard to be satisfied in reality—drive people to try to find sample-efficient offline RL algorithms under only weak assumptions about dataset and function classes [Chen and Jiang, 2019].
From the dataset perspective, partial coverage, which means that only some specific (or even none) policies are covered by the dataset [Rashidinejad et al., 2021, Xie et al., 2021, Uehara and Sun, 2021, Song et al., 2022], is studied. To address the problem of insufficient information, most algorithms use behavior regularization (e.g., Laroche and Trichelair [2017], Kumar et al. [2019], Zhan et al. [2022]) or pessimism in the face of uncertainty (e.g., Liu et al. [2020], Jin et al. [2020], Rashidinejad et al. [2021], Xie et al. [2021], Uehara and Sun [2021], Cheng et al. [2022], Zhu et al. [2023]) to constrain the learned policies to be close to the behavior policy. Most of the algorithms in this setting (except some that we will discuss later) require function assumptions in some sense of completeness—Bellman-completeness or strict realization according to another function class (we attribute it as strong realization).
From the function classes perspective, while the primary concern is Bellman-completeness assumption which is criticized for its non-monotonicity, some recent works [Zhan et al., 2022, Chen and Jiang, 2022, Ozdaglar et al., 2022] have noticed that the realizability according to another function class is also non-monotonic. These non-monotonic properties contradict the intuition in supervised learning that rich function classes perform better (or at least no worse). Typical examples of these assumptions are the “realizability of all candidate policies’ value functions” (e.g., Jiang and Huang [2020], Zhu et al. [2023]) and the “realizability of all candidate policies’ density ratio” (e.g., Xie and Jiang [2020]). These assumptions are equally strong as Bellman-completeness, and we classify them as “strong realizability” (Zhan et al. [2022], Ozdaglar et al. [2022] attribute it as “completeness-type”) for clarification. We also classify assuming that the function class realizes specific elements as “weak realizability” correspondingly (Chen and Jiang [2022] attributes this as “realizability-type”). We argue that this taxonomy is justified also because Bellman-completeness can be converted to the realizability assumption between two function classes with the minimax algorithm [Chen and Jiang, 2019]. This conversion aligns the behavior of Bellman-completeness with strong realizability assumptions.
On the one hand, Bellman-completeness assumption is always made in the classical finite-sample analyses of offline RL (e.g., analysis of FQI [Ernst et al., 2005, Antos et al., 2007]) to ensure closed updates of value functions [Sutton and Barto, 2018, Wang et al., 2021]. This assumption is notoriously hard to mitigate, and Foster et al. [2021] even suggests an information-theoretic lower bound stating that without Bellman-completeness, sample-efficient offline RL is impossible even with an exploratory dataset and a function class containing all candidate policies’ value functions. Therefore, it is clear that additional assumptions are required to circumvent Bellman-completeness.
On the other hand, as marginalized importance sampling (MIS, see, e.g., Liu et al. [2018], Uehara et al. [2019], Jiang and Huang [2020], Huang and Jiang [2022]) has shown its effect of eliminating Bellman-completeness with only a partial coverage dataset by assuming the realizability of density ratios in off-policy evaluation (OPE), there are works try to adapt it to policy optimization. These adaptations retain the elimination of Bellman-completeness, but most come up with other drawbacks.Some works (e.g., Jiang and Huang [2020], Zhu et al. [2023]) use OPE as an intermediate evaluation step for policy optimization yet require the strong realizability assumption on the value function class. The others borrow the idea of discriminators from MIS. Lee et al. [2021], Zhan et al. [2022] take value functions as discriminators for the optimal density ratio, using MIS to approximate the linear programming approach of Markov Decision Processes [Manne, 1960, Puterman, 1994]. Nachum et al. [2019], Chen and Jiang [2022], Uehara et al. [2023] take distribution density ratios as discriminators for optimal value function by replacing the Bellman equation in OPE with its optimality variant. While in most cases, theoretic finite-sample guarantees with these discriminators would require strong realizable function classes (e.g., Rashidinejad et al. [2022]), Zhan et al. [2022], Chen and Jiang [2022], Uehara et al. [2023] avoid this with additional gap assumptions or an alternative criterion of optimality—performance degradation w.r.t. the regularized optimal policy. To the best of our knowledge, they are the only works that achieve theoretic sample-efficient guarantees under only weak realizability and partial coverage assumptions. However, on the one hand, the gap (margin) assumption [Chen and Jiang, 2022, Uehara et al., 2023] causes that only some specific Markov decision processes (MDPs)—under which the optimal value functions have gaps—can be solved. On the other hand, sub-optimality compared with a regularized optimal policy [Zhan et al., 2022] could be meaningless in some cases, and the actual performance of the learned policy is intractable.
As summarized above, the following question arises:
Is sample-efficient offline RL possible with only partial coverage and weak realizability assumptions for general MDPs?
We answer this question in the positive and propose an algorithm that achieves finite-sample guarantees under weak assumptions with the help of an additional covering distribution. We assume that the covering distribution covers all non-stationary near-optimal policies, and the dataset covers the trajectories induced by an optimal policy from it. The covering distribution is adaptive such that both “non-stationary” and “near-optimal” above would be alleviated as the gap of optimal value function increases. The covering distribution also gives a trade-off against the data coverage assumption: the more accurate it is, the fewer redundant trajectories are required to be covered by the dataset. Furthermore, we can directly use the data distribution as the covering distribution as done in Uehara et al. [2023], if the near-optimal variant of their data assumptions are also satisfied.
For comparison, we summarize algorithms with partial coverage that do not need Bellman-completeness and model realizability (which is even stronger [Chen and Jiang, 2019, Zhan et al., 2022]) in Table 1. Necessary transfers are made to get the sub-optimality bound. We have removed additional definitions of notations for simplicity and refer the interested reader to the original papers for more detail.
Algorithm | Data assumptions | Function assumptions | Major drawbacks |
Jiang and Huang [2020] | optimal conc. | , and | strong realizability |
Zhan et al. [2022] | optimal conc. | , and | compare with |
Chen and Jiang [2022] | optimal conc. | , and | assume gap (margin) |
Rashidinejad et al. [2022] | optimal conc. | , , and | strong realizability |
Zhu et al. [2023] | optimal conc. | , and | strong realizability |
Uehara et al. [2023] | optimal conc from | , and | assume gap (margin) |
Ours (VOPR) | optimal conc. from | , and | assume a covering |
In conclusion, our contributions are as follows:
-
•
(Section 3) We identify the novel mechanism of non-stationary near-optimal concentrability in policy optimization under weak assumptions.
-
•
(Section 4) We demonstrate the trade-off brought by additional covering distributions for the coverage requirement of the dataset.
-
•
(Section 4) We propose the first algorithm that achieves finite-sample guarantees for general MDPs under only weak realizability and partial coverage assumptions.
2 Preliminaries
This section introduces base concepts and notations in offline RL with function approximation and MIS. See Table 2 in Appendix A for a more complete list of definitions of notations.
Markov Decision Processes (MDPs)
We consider infinite-horizon discounted MDPs defined as , where is the state space, is the action space, is the transition probability, is the deterministic reward function, is the discount factor that unravels the problem of infinite horizons, and is the initial state distribution. With a policy , we say that it induces a random trajectory if: , , and . We define the expected return of a policy as . We also denote the value function of as the expected return starting from some specific state or state-action pair as and . We denote the optimal policies that achieve the maximum return from as , and its member as . We say a policy is optimal almost everywhere if its state value function is maximized at every state and denote it as ( is not always unique). We represent the value functions of as and . It worth noting that and , the unique solutions of both Bellman optimality equation and the primal part of LP approach of MDPs [Puterman, 1994], are not the value functions of all optimal policies. For ease of discussion, we assume , , are compact measurable spaces and, with abuse of notation, we use to denote the corresponding finite uniform measure on each space (e.g., Lebesgue measure). We use to denote the state-action transition operator for density as .
Offline policy learning with function approximation
In the standard theoretical setup of offline RL, we are given with a dataset consisting of tuples, which is collected with state distribution and behavior policy such that . We use to denote the composed state-action distribution of the dataset. When the state space and action space become rather complex, function approximation is typically used. For this, we assume there are some function classes at hand that satisfy certain assumptions and have limited complexity (measured by cardinality, metric entropy and so forth). The function classes considered in this paper are state-action value function class , state distribution ratio class , and policy ratio class .
MIS with density discriminators and error bound
One of the most popular ways to estimate the optimal value function is via the Bellman optimality equation:
(1) |
where denotes the Bellman optimality operator. However, when we try to utilize the constraints from Eq. 1 (e.g., through the error ), the expectation in would introduce the infamous double-sampling issue [Baird, 1995], making the estimation intractable. To overcome this, privious works with MIS tried to take weight functions as discriminators and minimize a weighted sum of Eq. 1. In fact, even the error itself could be written as a weighted sum with some sign function (take if and otherwise [Ozdaglar et al., 2022]). Namely, we approximate through
(2) |
Since the weight function class is marginalized into the state-action space (instead of trajectories), this approach is called marginalized importance sampling (MIS) [Liu et al., 2018]. While theoretic guarantees in MIS under weak realizability and partial coverage assumptions are typically made for scalar values (e.g., the return [Uehara et al., 2019, Jiang and Huang, 2020]), recently, Zhan et al. [2022], Huang and Jiang [2022], Uehara et al. [2023] have gone beyond this and derived error guarantees for the estimators by using some strongly convex functions. Among them, the optimal value function estimator from Uehara et al. [2023] constructs the base of this work.
3 From to optimal policy, the minimum requirement
Uehara et al. [2023] shows that accurately estimating optimal value function under is possible if covers the optimal trajectories starting from itself. This “self-covering” assumption could be relieved and generalized if we only require an accurate estimator under some state-action distribution such that (we use and to denote the state distribution and policy decomposed from ). In fact, provides a trade-off for the coverage requirement of the dataset: the fewer state-action pairs on the support of , the weaker data coverage assumptions we will make. Nevertheless, how much trade-off can provide while preserving the desired result?
In policy learning, our goal is to derive an optimal policy from the estimated (denoted as ). While there are methods (see Section 4.3 for a brief discussion) that induce policies from by exploiting pessimism or data regularization, one of the most straightforward ways is to take the actions covered by that achieve the maximum in each state. This can be done with the help of policy ratio class , via
(3) |
where (normalized if needed). With the optimal realizability of and concentrability of , Eq. 3 is actually equivalent to
(4) |
which guides us to exploit the coverage provided by . Recall that our goal is to use to trade off the coverage assumption of . Therefore, the question left, which forms the primary subject of this section, is
With which can we conclude that is optimal from ,
and what is the minimum requirement of it?
Since and are to provide additional coverage, we also call them “covering distributions”.
The remainder of this section is organized as follows: we first show why single optimal concentrability of is not enough in Section 3.1, and then we introduce the alternative “all optimal concentrability” in Section 3.2 and the adapted version of it in Section 3.3 to accommodate statistical errors.
3.1 The dilemma of single optimal contentrability
Single optimal concentrability is standard [Liu et al., 2020, Xie et al., 2021, Cheng et al., 2022] when we try to mitigate exploratory data assumptions (e.g., all-policy concentrability). However, this framework suffers from a conundrum if only making weak realizability assumptions: we will know that the learned policy performs well only if we are informed with trajectories induced by it—rather than the ones induced by the covered policy.
More specifically, as the optimality of could be quantified as , the performance gap, we can telescope it through the performance difference lemma.
Lemma 1 (The performance difference lemma).
We can decompose the performance gap as
Thus, with Eq. 4, if we want (i.e., ) to be equal to zero, it might be necessary to require to cover () since the right part of the inner product in Eq. 4 is always non-positive. However, as is estimated and is even random when considering approximating it from a dataset, is usually achieved through all-policy concentrability— for all in the hypothesis class. Single optimal concentrability fails to provide the desired result.
For instance, consider the counterexample in Figure 1 which is adapted from Zhan et al. [2022], Chen and Jiang [2022]. Suppose we finally get the following covering distribution and policy:
While achieves single optimal concentrability and achieves the maximized value of in each state on the support of , is not an optimal policy since it would end up with return.

How gap assumptions avoid this
While both Chen and Jiang [2022] and Uehara et al. [2023] consider single optimal concentrability and weak realizability assumptions (Uehara et al. [2023] also assumes additional structures of the dataset), the gap (margin) assumptions ensure that only taking as could achieve Eq. 4. Moreover, Chen and Jiang [2022] shows that with the gap assumption, we can even use a value-based algorithm to derive a near-optimal policy without accurately estimating .
3.2 All-optimal concentrability
While single optimal concentrability suffers the hardness revealed before, there is still an alternative for the exploratory covering , which is shown in the following lemma:
Lemma 2 (From advantage to optimality).
If covers all distributions induced by non-stationary optimal policies (i.e., for any ) and Eq. 4 holds, then is optimal and .
Remark 1.
Non-stationary policies are frequently employed in the analysis of offline RL [Munos, 2003, 2005, Scherrer and Lesner, 2012, Chen and Jiang, 2019, Liu et al., 2020]. If we make the gap assumption, the “all non-stationary” requirement is discardable since the action in each state that could lead to the optimal return is unique.
Remark 2.
Wang et al. [2022] also utilizes the all-optimal concentrability assumption, but they consider the tabular setting and they require additionally gap assumptions to achieve the near-optimal guarantees.
We now provide a short proof of Lemma 2, showing by induction that —the non-stationary policy that adopts at the beginning -th to -th (include the -th) steps and then follows —is optimal.
Proof.
We first rewrite the telescoping equation in the performance difference lemma as
(5) | ||||
(6) | ||||
(7) |
where denotes the -th to -th steps (include the -th and -th) part of . Thus, the optimality of only depends on the first -th to -th steps, and is optimal if this part is on the support of . Now we inductively show that, for any natural number , the initial -th to -th steps part is covered:
-
•
The step- part of (i.e., ) is on the support of since there is some (non-stationary) optimal policy covered by it,
Therefore, . From Eq. 7, is optimal.
-
•
We next show that if is optimal (which means that it’s covered ), then the first -th to -th steps part of is covered by , which means that is optimal. This comes from the fact that the initial -th to -th steps part of the state distribution induced by a policy only depends on its previous -th to -th decisions:
From Eq. 7, is optimal.
Thus, for any , there exists natural number such that
where denotes the -th to -th steps part of the return. Therefore, is optimal. ∎
Consequently, instead of the exploratory data assumption, all non-stationary optimal coverage is sufficient for policy optimization.
3.3 Dealing with statistical error
While Lemma 2 is adequate at the population level (i.e., with an infinite amount of data), covering all non-stationary optimal policies is not enough when considering the empirical setting (i.e., with finite samples) due to the introduced statistical error. This motivates us to adapt Lemma 2 with a more refined .
Assumption 1 (All near-optimal concentrability).
We are given with a covering distribution such that its state distribution part covers the distributions induced by any non-stationary near-optimal policy :
(8) |
We call a policy is near-optimal if and denote as the class of all non-stationary near-optimal policies. We also define to suppress the extreme cases. With this refined , we can now derive the optimality of even with some statistical errors.
Lemma 3 (From advantage to optimality, with statistical errors).
If , and Assumption 1 holds with , is near-optimal.
We defer the proof of this lemma to Section C.1.
Remark 3 (The asymptotic property of ).
One of the most important properties of all near-optimal concentrability is that depends on the statistical error, which decreases as the amount of data increases.
4 Algorithm and analysis
After discussing the minimum requirement of estimating , this section will demonstrate how to fulfill it and accomplish the policy learning task. Our algorithm, which is based on the optimal value estimator from Uehara et al. [2023], follows the pseudocode in Algorithm 1.
(9) |
(10) |
We organized the rest of this section as follows: we first discuss the trade-off provided by the additional covering distribution and how to deduce in reality in Section 4.1; we then provide the finite-sample analysis of Algorithm 1 and its proof sketch in Section 4.2; we finally conclude this section by comparing our algorithms with the others in Section 4.3.
We defer the main proofs in this section to Appendix D.
4.1 Data assumptions and trade-off
As investigated in recent works [Huang and Jiang, 2022, Uehara et al., 2023], value function estimation under a given distribution requires a dataset that contains trajectories rolled out from it. Thus, our data assumption is as follows.
Assumption 2 (Partial concentrability from ).
The shift from to the induced state-action distribution by from is bounded:
(11) |
It is clear that with Assumption 2, is also covered by .
Proposition 4.
If Assumption 2 holds, by definition of ,
We now clarify the order of the learning process: we are first given with a dataset with some good properties; then we try to find a from the support of the state-action distribution of through some inductive bias (with necessary approximation); finally, we apply Algorithm 1 with and .
The choice of constructs a trade-off between the knowledge about optimal policy and the requirement of data coverage. On the one hand, the most casual choice of is (as in Uehara et al. [2023]), which means we have no prior knowledge about optimal policies. Employing as will not only requires the dataset to cover unnecessary suboptimal trajectories, but also makes the dataset non-monotonic (adding new data points to it would break this assumption). On the other hand, if we have perfect knowledge about optimal policies, Assumption 2 could be significantly alleviated. More concretely, if strictly consists of the state-action distribution of trajectories induced by near-optimal policies, our data assumption reduces to the per-step version of near-optimal concentrability.
Lemma 5.
If is a linear combination of the state-action distributions induced by non-stationary near-optimal policies under a fixed probability measure :
(12) |
And covers all admissible distributions of :
where denotes the normalized distribution of the -th step part of . The distribution shift from is bounded as
While the above case is impractical in reality, it reveals the power of this inductive bias: the more auxiliary information we obtain about optimal paths, the weaker coverage assumptions of the dataset are required.
4.2 Finite-sample guarantee
We now give the finite-sample guarantee of Algorithm 1, but before proceeding, we should state necessary function class assumptions. The first are the weak realizability assumptions:
Assumption 3 (Realizability of ).
There exists state-action distribution density ratio such that .
Assumption 4 (Realizability of ).
There exists policy ratio such that and for all .
Assumption 5 (Realizability of ).
contains the optimal value function: .
On the other hand, we gather all the bounding assumptions here.
Assumption 6 (Boundness of ).
For any , we assume .
Assumption 7 (Boundness of ).
For any , we assume .
Assumption 8 (Boundness of ).
For any , we assume .
Remark 4 (Validity).
The invertibility of is shown by Lemma 10 in Section B.1. While Assumptions 3 and 8 actually subsumes Assumption 2, we make it explicit for clarity of explanation. Assumption 4 implicitly assumes that covers , this can easily be done by directly choosing as .
Remark 5.
Although we include the normalization step in Assumption 4, this can also be achieved with some preprocessing steps.
Remark 6.
There is an overlap in the above assumptions: we can derive a policy ratio class directly from and .
With these prerequisites in place, we can finally state our finite-sample guarantee.
Theorem 1 (Sample complexity of learning a near-optimal policy).
If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with where
then with probability at least , the output from Algorithm 1 is near-optimal:
Proof sketch of Theorem 1
As we can obtain the near-optimality guarantee via Lemma 3, the remaining task is to approximate Eq. 4. This comes from the following two lemmas.
Lemma 6 ( error of under , adapted from theorem 2 in Uehara et al. [2023]).
If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least , the estimated from Algorithm 1 satisfies
Lemma 7 (From distance to Eq. 4).
If Assumptions 4 and 7 hold,
Combine them, we have that with probability at least ,
4.3 Comparison with related works
We now provide a brief comparison of our method with some related algorithms.
Algorithms with gap assumptions
Chen and Jiang [2022] and Uehara et al. [2023] assume that there are (soft) gaps in the optimal value function, which is only satisfied by part of MDPs, whereas our goal is to deal with general problems. Moreover, while our algorithm is based on the optimal value estimator proposed by Uehara et al. [2023], we use the policy ratio to ensure a finite distribution shift and our near-optimality guarantee does not require the soft margin assumption. Besides, Uehara et al. [2023] use as , assuming that the dataset covers the optimal trajectories from itself. This assumption is non-monotonic and hard to be satisfied in reality. Instead, we propose using an additional covering distribution as an alternative, which can effectively utilize the prior knowledge about the optimal trajectories and trade off the dataset requirement.
Algorithms with behavior regularization
Zhan et al. [2022] use behavior regularization to ensure that the learned policy is close to the dataset. Nevertheless, the regularization makes the optimality of the learned policy intractable.
Algorithms with pessimism in the face of uncertainty
These algorithms (e.g., Jiang and Huang [2020], Liu et al. [2020], Xie et al. [2021], Cheng et al. [2022], Zhu et al. [2023]) are often closely related to approximate dynamic programming (ADP). They “pessimistically” estimate the given policies and update (or choose) policies “pessimistically” with the estimated value functions. However, the evaluation step used in these algorithms always requires the strong realization of all candidate policies’ value functions, which our algorithm avoids.
Limitations of our algorithm
On the one hand, the additional covering distribution may be hard to access in some scenarios, leading back to using as . On the other hand, although mitigated with increasing dataset size, the assumption of covering all near-optimal policies is still stronger than the classic single-optimal concentrability. In addition, the “non-stationary” coverage requirement is also somewhat restrictive.
5 Conclusion and further work
This paper present VOPR, a new MIS-based algorithm for offline RL with function approximations. VOPR is inspired by the optimal value estimator proposed in Uehara et al. [2023], and it circumvents the soft margin assumption in the original paper with the near-optimal coverage assumption. While it still works if using the data distribution as the covering distribution, VOPR can trade off data assumptions with more refined choices. Compared with other algorithms considering partial coverage, VOPR does not make strong function class assumptions and works under general MDPs. Finally, despite the successes, a refined additional covering distribution may be difficult to obtain, and the near-optimal coverage assumption is still stronger than single optimal concentrability. We leave them for further investigation.
References
- Antos et al. [2007] András Antos, Rémi Munos, and Csaba Szepesvári. Fitted q-iteration in continuous action-space mdps. In NIPS, 2007.
- Baird [1995] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, 1995.
- Chen and Jiang [2019] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. ArXiv, abs/1905.00360, 2019.
- Chen and Jiang [2022] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps. ArXiv, abs/2203.13935, 2022.
- Cheng et al. [2022] Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. ArXiv, abs/2202.02446, 2022.
- Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res., 6:503–556, 2005.
- Farahmand et al. [2010] Amir Massoud Farahmand, Rémi Munos, and Csaba Szepesvari. Error propagation for approximate policy and value iteration. In NIPS, 2010.
- Foster et al. [2021] Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Annual Conference Computational Learning Theory, 2021.
- Gottesman et al. [2018] Omer Gottesman, Fredrik D. Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, A. Aldo Faisal, Leo Anthony Celi, David A. Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. ArXiv, abs/1805.12298, 2018.
- Gottesman et al. [2019] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:16 – 18, 2019.
- Huang and Jiang [2022] Audrey Huang and Nan Jiang. Beyond the return: Off-policy function estimation under user-specified error-measuring distributions. ArXiv, abs/2210.15543, 2022.
- Jiang and Huang [2020] Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. arXiv: Learning, 2020.
- Jin et al. [2020] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 2020.
- Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ArXiv, abs/1806.10293, 2018.
- Kumar et al. [2019] Aviral Kumar, Justin Fu, G. Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Neural Information Processing Systems, 2019.
- Laroche and Trichelair [2017] Romain Laroche and Paul Trichelair. Safe policy improvement with baseline bootstrapping. ArXiv, abs/1712.06924, 2017.
- Lee et al. [2021] Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joëlle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, 2021.
- Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Neural Information Processing Systems, 2018.
- Liu et al. [2020] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. ArXiv, abs/2007.08202, 2020.
- Manne [1960] A.S. Manne. Linear programming and sequential decisions. In Management Science, volume 6, page 259–267, 1960.
- Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.
- Munos [2005] Rémi Munos. Error bounds for approximate value iteration. In AAAI Conference on Artificial Intelligence, 2005.
- Munos [2007] Rémi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM J. Control. Optim., 46:541–561, 2007.
- Munos and Szepesvari [2008] Rémi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, 2008.
- Nachum et al. [2019] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ArXiv, abs/1912.02074, 2019.
- Ozdaglar et al. [2022] Asuman E. Ozdaglar, Sarath Pattathil, Jiawei Zhang, and K. Zhang. Revisiting the linear-programming framework for offline rl with general function approximation. ArXiv, abs/2212.13861, 2022.
- Puterman [1994] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
- Rashidinejad et al. [2021] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart J. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68:8156–8196, 2021.
- Rashidinejad et al. [2022] Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart J. Russell, and Jiantao Jiao. Optimal conservative offline rl with general function approximation via augmented lagrangian. ArXiv, abs/2211.00716, 2022.
- Scherrer and Lesner [2012] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-horizon markov decision processes. In NIPS, 2012.
- Sinha and Garg [2021] Samarth Sinha and Animesh Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning. ArXiv, abs/2103.06326, 2021.
- Song et al. [2022] Yuda Song, Yi Zhou, Ayush Sekhari, J. Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient. ArXiv, abs/2210.06718, 2022.
- Sutton and Barto [2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Tang et al. [2022] Shengpu Tang, Maggie Makar, Michael W. Sjoding, Finale Doshi-Velez, and Jenna Wiens. Leveraging factored action spaces for efficient offline reinforcement learning in healthcare. 2022.
- Uehara and Sun [2021] Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations, 2021.
- Uehara et al. [2019] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, 2019.
- Uehara et al. [2023] Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Refined value-based offline rl under realizability and partial coverage. ArXiv, abs/2302.02392, 2023.
- Wang et al. [2021] Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham M. Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, 2021.
- Wang et al. [2022] Xinqi Wang, Qiwen Cui, and Simon Shaolei Du. On gap-dependent bounds for offline reinforcement learning. ArXiv, abs/2206.00177, 2022.
- Xie and Jiang [2020] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. ArXiv, abs/2003.03924, 2020.
- Xie et al. [2021] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. In Neural Information Processing Systems, 2021.
- Zhan et al. [2022] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D. Lee. Offline reinforcement learning with realizability and single-policy concentrability. ArXiv, abs/2202.04634, 2022.
- Zhu et al. [2023] Hanlin Zhu, Paria Rashidinejad, and Jiantao Jiao. Importance weighted actor-critic for optimal conservative offline reinforcement learning. ArXiv, abs/2301.12714, 2023.
Appendix A Notations
state space | |
action space | |
state-action value function class | |
state-action distribution ratio function class | |
policy ratio function class | |
members of | |
state value function for policy | |
state-action value function for policy | |
optimal state value function | |
optimal state-action value function | |
uniform measure of , , or , depending on the context | |
dataset used in the algorithm | |
state-action distribution of dataset | |
state distribution of dataset | |
behaviour policy | |
the additional covering distribution | |
state distribution of the additional covering distribution | |
policy of the additional covering distribution | |
inner product of and , usually as | |
, normalizing it if needed (e.g., density) | |
Bellman optimality operator, | |
initial state distribution | |
the -th to -th steps part of | |
is absolutely continuous w.r.t. | |
normalize -th step part of state-action distribution induced by | |
state-action distribution induced by from | |
policy take in the previous -th to -th (include the -th) steps, and take after this | |
the class of all non-stationary near-optimal policies | |
state-action transition kernel with policy | |
conjucate operator of some operator |
While , and are mainly used to denote the Radon–Nikodym derivatives of the underlying probability measures w.r.t. , we sometimes also use them to represent the correspondent distribution measure with abuse of notation.
Appendix B Helper Lemmas
B.1 Properties of
We first provide some properties of (for any policy ) as an operator on the -space of , and similar results should also hold for transition operators with policies defined on . Note that the integrations of the absolute value of the functions considered in this subsection are always finite, which means that we can change the orders of integrations via Fubini’s Theorem. As we will consider conjugate operators, we define the inner product as .
Lemma 8.
is linear.
Proof.
Recall the definition of ,
For any ,
This compeletes the proof. ∎
Lemma 9.
The adjoint operator of is
Remark 7.
Intuitively, we can see as one-step forward of , such that we start from , transit into and take . Also, we can view as one-step backward of , such that we compute the value of through the one step transferred state-action distribution with the help of .
Proof.
Consider the inner products and , we should prove that these two are equal. By definition,
and
(Fubini’s Theorem) |
This completes the proof. ∎
Lemma.
Remark 8.
This upper bound should be intuitive since that can be seen as a probability transition kernel from to itself.
Proof.
Fix any , , we define , By Fubini’s theorem, we have that
For another function on such that , we can use Hölder’s inequality, which yields
Thus, for any , and function with ,
So, we have that
This completes the proof. ∎
Lemma 10.
is invertible and
Proof.
Since , converges. Take multiplication
This completes the proof. ∎
Proposition 11.
By definition, .
B.2 Other useful lemmas
Lemma (Performance difference lemma).
We can decompose the performance gap as
Proof.
By definition,
The first equality comes from Bellman equation, and the fourth equality comes from the definition of . This completes the proof. ∎
Appendix C Detailed proofs for Section 3
C.1 Proof of Lemma 3
Lemma (From advantage to optimality, restatement of Lemma 3).
If , and Assumption 1 holds with , is near-optimal.
Proof.
We begin with using induction to prove that is near-optimal for any :
-
•
We first show that is near-optimal. From Assumption 1, we can use any to conclude that
Thus, we can the show optimality of by the advantage:
( is non-positive) By performance difference lemma,
-
•
Next, we show that if is near-optimal, is near-optimal. Since that is optimal, the distribution shift from to is bounded, which means,
Then, we have
( is non-positive) By performance difference lemma,
Therefore, is near-optimal.
Thus, for any , there exists natural number such that
where denotes the -th to -th steps part of the return. Therefore, is near-optimal. ∎
Appendix D Detailed proofs for Section 4
D.1 Proof of Lemma 5
Lemma (Restatement of Lemma 5).
If is a linear combination of the state-action distributions induced by near-optimal non-stationary policies under a fixed probability measure :
(13) |
And covers all admissible distributions of :
The distribution shift from is bounded as
Proof.
Define the state-action distribution of policy from at step as
Also, define the global version of it as
We can rewrite as
(Fubini’s Theorem) | ||||
The last equation comes from that
(Fubini’s Theorem) |
since
we get
Finally, ,
( indicates ) | ||||
This completes the proof. ∎
D.2 Proof of Lemma 6
Note that the lemmas and proofs of this subsection are mainly adapted from Uehara et al. [2023], similar statements could also be found in the original paper. However, since that we use to replace , we present them for clarity of explanation and to make our paper self-contained. We refer interested readers to the original paper for another detail.
We first define the expected version of Eq. 10 as
where , and denotes taking expectation with respect to the reweighted data collecting process.
Lemma 12 (Expectation).
The expected value of w.r.t. the data collecting process is :
Proof.
Since only the second term of is random, we additional define
We can rearrange the expectation as follows,
(14) | ||||
(15) | ||||
(16) |
Then, by the i.i.d. assumption of samples and linear property of MIS,
This compeletes the proof. ∎
Lemma 13 (Concentration).
For any fixed , with probability at least , for any , ,
Proof.
The statistical error only comes from , as
(Lemma 12) | ||||
(Eq. 16) |
Since each entry of is bounded:
we can apply Hoeffding’s inequality which yields that, for any , , with probability at least ,
Finally, we can use union bound, rearranging terms to get that, for any fixed , with probability at least , for any , ,
This compeletes the proof. ∎
Lemma 14.
If is non-negative -a.e. (e.g., ), for any ,
(17) |
Proof.
This result simply comes from the definition:
(Rewrite the expectation with inner products) | ||||
(conjugate) |
This compeletes the proof. ∎
Lemma 15.
If Assumption 5 holds, with probability at least , for any and any state-action distribution , we have
(18) |
Proof.
We can decompose Eq. 18 as follows,
where . For the terms above, we have that:
-
•
and are non-positive since the optimization process.
-
•
and could be bound by concentration.
-
•
For , as Bellman optimality equation holds,
We have that
Thus, we conclude that with probability at least ,
(Lemma 13) |
This compeletes the proof. ∎
With lemmas above, it’s time to prove Lemma 6.
Lemma ( error of under , restatement Lemma 6).
If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least , the estimated from Algorithm 1 satisfies
Proof.
By Assumption 3, , and from Lemma 14 we have
Together with Lemma 15, with probability at least ,
Rearrange this and we can get
This compeletes the proof. ∎
D.3 Proof of Lemma 7
Lemma (Restatement of Lemma 7).
If Assumptions 4 and 7 hold,
Proof.
We can rearrange the above term as
(Assumption 4) | ||||
The distribution shift comes from the fact that
and shifts from to and are both bound by due to Assumptions 4 and 7. This completes the proof. ∎
D.4 Proof of Theorem 1
Theorem (Finite sample guarantee of Algorithm 1, restatement of Theorem 1).
If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with , then with probability at least , the output from Algorithm 1 is near-optimal: