Offline Reinforcement Learning with Additional Covering Distributions

Chenjie Mao
School of Computer Science and Technology
Huazhong University of Science and Technology
Wuhan 430074, China
chenjiemao@hust.edu.cn

Abstract

We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.

1 Introduction and related works

In offline reinforcement learning (offline RL, also known as batch RL), the learner tries to find good policies with a pre-collected dataset. This data-driven paradigm eliminates the heavy burden of environmental interaction required in online learning, which could be dangerous or costly (e.g., in robotics [Kalashnikov et al., 2018, Sinha and Garg, 2021] and healthcare [Gottesman et al., 2018, 2019, Tang et al., 2022]), making offline RL a promising approach in real-world applications.

In early theoretic studies of offline RL (e.g., Munos [2003, 2005, 2007], Ernst et al. [2005], Antos et al. [2007], Munos and Szepesvari [2008], Farahmand et al. [2010]), researchers analyzed the finite-sample behavior of algorithms under the assumptions such as exploratory datasets, realizable or Bellman-complete function classes. However, despite some error propagation bounds and sample complexity guarantees achieved in these works, the strong coverage assumption made on datasets and the non-monotonic assumptions made on function classes—which are always hard to be satisfied in reality—drive people to try to find sample-efficient offline RL algorithms under only weak assumptions about dataset and function classes [Chen and Jiang, 2019].

From the dataset perspective, partial coverage, which means that only some specific (or even none) policies are covered by the dataset [Rashidinejad et al., 2021, Xie et al., 2021, Uehara and Sun, 2021, Song et al., 2022], is studied. To address the problem of insufficient information, most algorithms use behavior regularization (e.g., Laroche and Trichelair [2017], Kumar et al. [2019], Zhan et al. [2022]) or pessimism in the face of uncertainty (e.g., Liu et al. [2020], Jin et al. [2020], Rashidinejad et al. [2021], Xie et al. [2021], Uehara and Sun [2021], Cheng et al. [2022], Zhu et al. [2023]) to constrain the learned policies to be close to the behavior policy. Most of the algorithms in this setting (except some that we will discuss later) require function assumptions in some sense of completeness—Bellman-completeness or strict realization according to another function class (we attribute it as strong realization).

From the function classes perspective, while the primary concern is Bellman-completeness assumption which is criticized for its non-monotonicity, some recent works [Zhan et al., 2022, Chen and Jiang, 2022, Ozdaglar et al., 2022] have noticed that the realizability according to another function class is also non-monotonic. These non-monotonic properties contradict the intuition in supervised learning that rich function classes perform better (or at least no worse). Typical examples of these assumptions are the “realizability of all candidate policies’ value functions” (e.g., Jiang and Huang [2020], Zhu et al. [2023]) and the “realizability of all candidate policies’ density ratio” (e.g., Xie and Jiang [2020]). These assumptions are equally strong as Bellman-completeness, and we classify them as “strong realizability” (Zhan et al. [2022], Ozdaglar et al. [2022] attribute it as “completeness-type”) for clarification. We also classify assuming that the function class realizes specific elements as “weak realizability” correspondingly (Chen and Jiang [2022] attributes this as “realizability-type”). We argue that this taxonomy is justified also because Bellman-completeness can be converted to the realizability assumption between two function classes with the minimax algorithm [Chen and Jiang, 2019]. This conversion aligns the behavior of Bellman-completeness with strong realizability assumptions.

On the one hand, Bellman-completeness assumption is always made in the classical finite-sample analyses of offline RL (e.g., analysis of FQI [Ernst et al., 2005, Antos et al., 2007]) to ensure closed updates of value functions [Sutton and Barto, 2018, Wang et al., 2021]. This assumption is notoriously hard to mitigate, and Foster et al. [2021] even suggests an information-theoretic lower bound stating that without Bellman-completeness, sample-efficient offline RL is impossible even with an exploratory dataset and a function class containing all candidate policies’ value functions. Therefore, it is clear that additional assumptions are required to circumvent Bellman-completeness.

On the other hand, as marginalized importance sampling (MIS, see, e.g., Liu et al. [2018], Uehara et al. [2019], Jiang and Huang [2020], Huang and Jiang [2022]) has shown its effect of eliminating Bellman-completeness with only a partial coverage dataset by assuming the realizability of density ratios in off-policy evaluation (OPE), there are works try to adapt it to policy optimization. These adaptations retain the elimination of Bellman-completeness, but most come up with other drawbacks.Some works (e.g., Jiang and Huang [2020], Zhu et al. [2023]) use OPE as an intermediate evaluation step for policy optimization yet require the strong realizability assumption on the value function class. The others borrow the idea of discriminators from MIS. Lee et al. [2021], Zhan et al. [2022] take value functions as discriminators for the optimal density ratio, using MIS to approximate the linear programming approach of Markov Decision Processes [Manne, 1960, Puterman, 1994]. Nachum et al. [2019], Chen and Jiang [2022], Uehara et al. [2023] take distribution density ratios as discriminators for optimal value function by replacing the Bellman equation in OPE with its optimality variant. While in most cases, theoretic finite-sample guarantees with these discriminators would require strong realizable function classes (e.g., Rashidinejad et al. [2022]), Zhan et al. [2022], Chen and Jiang [2022], Uehara et al. [2023] avoid this with additional gap assumptions or an alternative criterion of optimality—performance degradation w.r.t. the regularized optimal policy. To the best of our knowledge, they are the only works that achieve theoretic sample-efficient guarantees under only weak realizability and partial coverage assumptions. However, on the one hand, the gap (margin) assumption [Chen and Jiang, 2022, Uehara et al., 2023] causes that only some specific Markov decision processes (MDPs)—under which the optimal value functions have gaps—can be solved. On the other hand, sub-optimality compared with a regularized optimal policy [Zhan et al., 2022] could be meaningless in some cases, and the actual performance of the learned policy is intractable.

As summarized above, the following question arises:

Is sample-efficient offline RL possible with only partial coverage and weak realizability assumptions for general MDPs?

We answer this question in the positive and propose an algorithm that achieves finite-sample guarantees under weak assumptions with the help of an additional covering distribution. We assume that the covering distribution covers all non-stationary near-optimal policies, and the dataset covers the trajectories induced by an optimal policy from it. The covering distribution is adaptive such that both “non-stationary” and “near-optimal” above would be alleviated as the gap of optimal value function increases. The covering distribution also gives a trade-off against the data coverage assumption: the more accurate it is, the fewer redundant trajectories are required to be covered by the dataset. Furthermore, we can directly use the data distribution as the covering distribution as done in Uehara et al. [2023], if the near-optimal variant of their data assumptions are also satisfied.

For comparison, we summarize algorithms with partial coverage that do not need Bellman-completeness and model realizability (which is even stronger [Chen and Jiang, 2019, Zhan et al., 2022]) in Table 1. Necessary transfers are made to get the sub-optimality bound. We have removed additional definitions of notations for simplicity and refer the interested reader to the original papers for more detail.

Table 1: Comparison of offline RL algorithm (conc. stand for concentrability)

Algorithm	Data assumptions	Function assumptions	Major drawbacks
Jiang and Huang [2020]	optimal conc.	$w^{\star}\in\mathcal{W}$ , and $\forall\pi\in\Pi,Q_{\pi}\in\mathcal{C}(\mathcal{Q})$	strong realizability
Zhan et al. [2022]	optimal conc.	$w^{\star}_{\alpha}\in\mathcal{W}$ , and $v^{\star}_{\alpha}\in\mathcal{V}$	compare with $\pi^{\star}_{\alpha}$
Chen and Jiang [2022]	optimal conc.	$w^{\star}\in\mathcal{W}$ , and $Q^{\star}\in\mathcal{Q}$	assume gap (margin)
Rashidinejad et al. [2022]	optimal conc.	$w^{\star}\in\mathcal{W}$ , $V^{\star}\in\mathcal{V}$ , $u^{\star}_{w}\in\mathcal{U}\ \forall w$ and $\zeta^{\star}_{w^{\star},u}\in\mathcal{Z}\ \forall u$	strong realizability
Zhu et al. [2023]	optimal conc.	$w^{\star}\in\mathcal{W}$ , and $\forall\pi\in\Pi,Q_{\pi}\in\mathcal{Q}$	strong realizability
Uehara et al. [2023]	optimal conc from $d^{\mathcal{D}}$	$w^{\star}\in\mathcal{W}$ , and $Q^{\star}\in\mathcal{Q}$	assume gap (margin)
Ours (VOPR)	optimal conc. from $d_{c}$	$w^{\star}\in\mathcal{W}$ , $\beta^{\star}\in\mathcal{B}$ and $Q^{\star}\in\mathcal{Q}$	assume a covering $d_{c}$

In conclusion, our contributions are as follows:

•

(Section 3) We identify the novel mechanism of non-stationary near-optimal concentrability in policy optimization under weak assumptions.
•

(Section 4) We demonstrate the trade-off brought by additional covering distributions for the coverage requirement of the dataset.
•

(Section 4) We propose the first algorithm that achieves finite-sample guarantees for general MDPs under only weak realizability and partial coverage assumptions.

2 Preliminaries

This section introduces base concepts and notations in offline RL with function approximation and MIS. See Table 2 in Appendix A for a more complete list of definitions of notations.

Markov Decision Processes (MDPs)

We consider infinite-horizon discounted MDPs defined as $(\mathcal{S},\mathcal{A},P,R,\gamma,\mu_{0})$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $P\colon\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S})$ is the transition probability, $R\colon\mathcal{S}\times\mathcal{A}\to[0,R_{max}]$ is the deterministic reward function, $\gamma\in(0,1)$ is the discount factor that unravels the problem of infinite horizons, and $\mu_{0}\in\Delta(\mathcal{S})$ is the initial state distribution. With a policy $\pi\colon\mathcal{S}\to\Delta(\mathcal{A})$ , we say that it induces a random trajectory $\{s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\dots,s_{i},a_{i},r_{i},s_{i+1},\dots\}$ if: $s_{0}\sim\mu_{0}$ , $a_{i}\sim\pi(\cdot|s_{i})$ , $r_{i}=R(s_{i},a_{i})$ and $s_{i+1}\sim P(\cdot|s_{i},a_{i})$ . We define the expected return of a policy $\pi$ as $J_{\pi}=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid\mu_{0},\pi\big{]}$ . We also denote the value function of $\pi$ as the expected return starting from some specific state $s$ or state-action pair $(s,a)$ as $V_{\pi}(s)=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid s,\pi\big{]}$ and $Q_{\pi}(s,a)=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid(s,a),\pi\big{]}$ . We denote the optimal policies that achieve the maximum return $J^{\star}$ from $\mu_{0}$ as $\Pi^{\star}$ , and its member as $\pi^{\star}$ . We say a policy is optimal almost everywhere if its state value function is maximized at every state and denote it as $\pi_{e}^{\star}$ ( $\pi_{e}^{\star}$ is not always unique). We represent the value functions of $\pi^{\star}_{e}$ as $V^{\star}$ and $Q^{\star}$ . It worth noting that $V^{\star}$ and $Q^{\star}$ , the unique solutions of both Bellman optimality equation and the primal part of LP approach of MDPs [Puterman, 1994], are not the value functions of all optimal policies. For ease of discussion, we assume $\mathcal{S}$ , $\mathcal{A}$ , $\mathcal{S}\times\mathcal{A}$ are compact measurable spaces and, with abuse of notation, we use $\nu$ to denote the corresponding finite uniform measure on each space (e.g., Lebesgue measure). We use $P_{\pi}$ to denote the state-action transition operator for density $d$ as $P_{\pi}d(s^{\prime},a^{\prime})\coloneqq\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)$ .

Offline policy learning with function approximation

In the standard theoretical setup of offline RL, we are given with a dataset $\mathcal{D}$ consisting of $N$ $(s,a,r,s^{\prime})$ tuples, which is collected with state distribution $\mu^{D}$ and behavior policy $\pi_{b}$ such that $s\sim\mu^{D},a\sim\pi_{b}(\cdot|s),r=R(s,a),s^{\prime}\sim P(\cdot|s,a)$ . We use $d^{\mathcal{D}}(s,a)\coloneqq\mu(s)\pi_{b}(a\mid s)$ to denote the composed state-action distribution of the dataset. When the state space and action space become rather complex, function approximation is typically used. For this, we assume there are some function classes at hand that satisfy certain assumptions and have limited complexity (measured by cardinality, metric entropy and so forth). The function classes considered in this paper are state-action value function class $\mathcal{Q}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R})$ , state distribution ratio class $\mathcal{W}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R})$ , and policy ratio class $\mathcal{B}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R})$ .

MIS with density discriminators and $L^{2}$ error bound

One of the most popular ways to estimate the optimal value function is via the Bellman optimality equation:

\displaystyle\forall s\in\mathcal{S},a\in\mathcal{A},\quad Q^{\star}(s,a)=T^{\star}Q^{\star}(s,a)

(1)

where $T^{\star}q(s,a)\coloneqq R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\max q(s^{\prime},\cdot)]$ denotes the Bellman optimality operator. However, when we try to utilize the constraints from Eq. 1 (e.g., through the $L^{1}$ error $\lVert q-T^{\star}q\rVert_{1,d^{\mathcal{D}}}$ ), the expectation in $T^{\star}$ would introduce the infamous double-sampling issue [Baird, 1995], making the estimation intractable. To overcome this, privious works with MIS tried to take weight functions as discriminators and minimize a weighted sum of Eq. 1. In fact, even the $L^{1}$ error itself could be written as a weighted sum with some sign function (take $1$ if $q\geq T^{\star}q$ and $-1$ otherwise [Ozdaglar et al., 2022]). Namely, we approximate $Q^{\star}$ through

\displaystyle\hat{q}=\operatorname*{argmin}\limits_{q\in\mathcal{Q}}\max\limits_{w\in\mathcal{W}}\mathbb{E}_{d^{\mathcal{D}}}[w(s,a)(q(s,a)-T^{\star}q(s,a)].

(2)

Since the weight function class $\mathcal{W}$ is marginalized into the state-action space (instead of trajectories), this approach is called marginalized importance sampling (MIS) [Liu et al., 2018]. While theoretic guarantees in MIS under weak realizability and partial coverage assumptions are typically made for scalar values (e.g., the return [Uehara et al., 2019, Jiang and Huang, 2020]), recently, Zhan et al. [2022], Huang and Jiang [2022], Uehara et al. [2023] have gone beyond this and derived $L^{2}$ error guarantees for the estimators by using some strongly convex functions. Among them, the optimal value function estimator from Uehara et al. [2023] constructs the base of this work.

3 From $Q^{\star}$ to optimal policy, the minimum requirement

Uehara et al. [2023] shows that accurately estimating optimal value function $Q^{\star}$ under $d^{\mathcal{D}}$ is possible if $d^{\mathcal{D}}$ covers the optimal trajectories starting from itself. This “self-covering” assumption could be relieved and generalized if we only require an accurate estimator under some state-action distribution $d_{c}$ such that $d_{c}\ll d^{\mathcal{D}}$ (we use $\mu_{c}$ and $\pi_{c}$ to denote the state distribution and policy decomposed from $d_{c}$ ). In fact, $d_{c}$ provides a trade-off for the coverage requirement of the dataset: the fewer state-action pairs on the support of $d_{c}$ , the weaker data coverage assumptions we will make. Nevertheless, how much trade-off can $d_{c}$ provide while preserving the desired result?

In policy learning, our goal is to derive an optimal policy $\hat{\pi}$ from the estimated $Q^{\star}$ (denoted as $\hat{q}$ ). While there are methods (see Section 4.3 for a brief discussion) that induce policies from $\hat{q}$ by exploiting pessimism or data regularization, one of the most straightforward ways is to take the actions covered by $d_{c}$ that achieve the maximum $\hat{q}$ in each state. This can be done with the help of policy ratio class $\mathcal{B}$ , via

\displaystyle\hat{\beta}=\operatorname*{argmax}\limits_{\beta\in\mathcal{B}}\langle\mu_{c},\hat{q}(\cdot,\pi_{\beta})\rangle\quad\textup{and take}\quad\hat{\pi}=\pi_{\hat{\beta}},

(3)

where $\pi_{\beta}(a\mid s)=\pi_{b}(a\mid s)\beta(s,a)$ (normalized if needed). With the optimal realizability of $\mathcal{B}$ and concentrability of $\pi_{c}$ , Eq. 3 is actually equivalent to

\displaystyle\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0,

(4)

which guides us to exploit the coverage provided by $\mu_{c}$ . Recall that our goal is to use $d_{c}$ to trade off the coverage assumption of $d^{\mathcal{D}}$ . Therefore, the question left, which forms the primary subject of this section, is

With which $\mu_{c}$ can we conclude that $\hat{\pi}$ is optimal from $\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0$ ,
and what is the minimum requirement of it?

Since $\mu_{c}$ and $d_{c}$ are to provide additional coverage, we also call them “covering distributions”.

The remainder of this section is organized as follows: we first show why single optimal concentrability of $\mu_{c}$ is not enough in Section 3.1, and then we introduce the alternative “all optimal concentrability” in Section 3.2 and the adapted version of it in Section 3.3 to accommodate statistical errors.

3.1 The dilemma of single optimal contentrability

Single optimal concentrability is standard [Liu et al., 2020, Xie et al., 2021, Cheng et al., 2022] when we try to mitigate exploratory data assumptions (e.g., all-policy concentrability). However, this framework suffers from a conundrum if only making weak realizability assumptions: we will know that the learned policy performs well only if we are informed with trajectories induced by it—rather than the ones induced by the covered policy.

More specifically, as the optimality of $\hat{\pi}$ could be quantified as $J^{\star}-J_{\hat{\pi}}$ , the performance gap, we can telescope it through the performance difference lemma.

Lemma 1 (The performance difference lemma).

We can decompose the performance gap as

\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})=\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle.

Thus, with Eq. 4, if we want $J^{\star}-J_{\hat{\pi}}$ (i.e., $J_{\pi^{\star}_{e}}-J_{\hat{\pi}}$ ) to be equal to zero, it might be necessary to require $\mu_{c}$ to cover $\mu_{\hat{\pi}}$ ( $\mu_{c}\gg\mu_{\hat{\pi}}$ ) since the right part of the inner product in Eq. 4 is always non-positive. However, as $\hat{\pi}$ is estimated and is even random when considering approximating it from a dataset, $\mu_{c}\gg\mu_{\hat{\pi}}$ is usually achieved through all-policy concentrability— $\mu_{c}\gg\mu_{\pi}$ for all $\pi$ in the hypothesis class. Single optimal concentrability fails to provide the desired result.

For instance, consider the counterexample in Figure 1 which is adapted from Zhan et al. [2022], Chen and Jiang [2022]. Suppose we finally get the following covering distribution and policy:

\displaystyle\mu_{c}(s)=\begin{cases}\nicefrac{{1}}{{2}}&\textup{if $s=1$}\\ \nicefrac{{1}}{{2}}&\textup{if $s=2$}\end{cases}\quad\textup{and}\quad\hat{\pi}(s)=\begin{cases}\textup{L}&\textup{if $s=1$}\\ \textup{R}&\textup{if $s=3$}\\ \textup{Random}&\textup{elsewise}.\end{cases}

While $\mu_{c}$ achieves single optimal concentrability and $\hat{\pi}$ achieves the maximized value of $Q^{\star}$ in each state on the support of $\mu_{c}$ , $\hat{\pi}$ is not an optimal policy since it would end up with $0$ return.

Refer to caption — Figure 1: The above MDP is deterministic, and we initially start from state $1$ . We can take actions $L$ (left) and $R$ (right) in each state. In states $1$ and $3$ , action $L$ ( $R$ ) will transfer us to its left (right) hand state, and taking actions in other states will only cause a self-loop. We can only obtain non-zero rewards by taking actions in states $2$ and $4$ , with values $1$ and $\frac{1}{\gamma}$ correspondingly. There are two trajectories that could lead to the optimal $\nicefrac{{\gamma}}{{(1-\gamma)}}$ return: $\{(1,\textup{R}),2,\dots\}$ and $\{(1,\textup{L}),(3,\textup{L}),4\dots\}$ . We take $\gamma$ as the discount factor.

How gap assumptions avoid this

While both Chen and Jiang [2022] and Uehara et al. [2023] consider single optimal concentrability and weak realizability assumptions (Uehara et al. [2023] also assumes additional structures of the dataset), the gap (margin) assumptions ensure that only taking $\pi^{\star}$ as $\hat{\pi}$ could achieve Eq. 4. Moreover, Chen and Jiang [2022] shows that with the gap assumption, we can even use a value-based algorithm to derive a near-optimal policy without accurately estimating $Q^{\star}$ .

3.2 All-optimal concentrability

While single optimal concentrability suffers the hardness revealed before, there is still an alternative for the exploratory covering $\mu_{c}$ , which is shown in the following lemma:

Lemma 2 (From advantage to optimality).

If $\mu_{c}$ covers all distributions induced by non-stationary optimal policies (i.e., $\mu_{c}\gg\mu_{\pi^{\star}_{\textup{non}}}$ for any $\pi^{\star}_{\textup{non}}$ ) and Eq. 4 holds, then $\hat{\pi}$ is optimal and $J_{\hat{\pi}}=J^{\star}$ .

Remark 1.

Non-stationary policies are frequently employed in the analysis of offline RL [Munos, 2003, 2005, Scherrer and Lesner, 2012, Chen and Jiang, 2019, Liu et al., 2020]. If we make the gap assumption, the “all non-stationary” requirement is discardable since the action in each state that could lead to the optimal return is unique.

Remark 2.

Wang et al. [2022] also utilizes the all-optimal concentrability assumption, but they consider the tabular setting and they require additionally gap assumptions to achieve the near-optimal guarantees.

We now provide a short proof of Lemma 2, showing by induction that $\hat{\pi}_{i}$ —the non-stationary policy that adopts $\hat{\pi}$ at the beginning $0$ -th to $i$ -th (include the $i$ -th) steps and then follows $\pi^{\star}_{e}$ —is optimal.

Proof.

We first rewrite the telescoping equation in the performance difference lemma as

$\displaystyle(1-\gamma)(J_{\hat{\pi}_{i}}-J^{\star})=$	$\displaystyle\langle\mu_{\hat{\pi}_{i}},Q^{\star}(\cdot,\hat{\pi}_{i})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$	(5)
$\displaystyle=$	$\displaystyle\langle\mu_{\hat{\pi}_{i}}^{0:i},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu_{\hat{\pi}_{i}}^{i+1:\infty},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$	(6)
$\displaystyle=$	$\displaystyle\langle\mu_{\hat{\pi}_{i}}^{0:i},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$	(7)

where $\mu^{i:j}_{\pi}$ denotes the $i$ -th to $j$ -th steps (include the $i$ -th and $j$ -th) part of $\mu_{\pi}$ . Thus, the optimality of $\hat{\pi}_{i}$ only depends on the first $0$ -th to $i$ -th steps, and $\hat{\pi}_{i}$ is optimal if this part is on the support of $\mu_{c}$ . Now we inductively show that, for any natural number $i$ , the initial $0$ -th to $i$ -th steps part is covered:

•

The step- $0$ part of $\mu_{\hat{\pi}}$ (i.e., $(1-\gamma)\mu_{0}$ ) is on the support of $\mu_{c}$ since there is some (non-stationary) optimal policy $\pi^{\star}$ covered by it,

$\displaystyle\mu_{c}\gg\mu_{\pi^{\star}}\gg\mu_{0}.$

Therefore, $\langle\mu_{\hat{\pi}_{0}}^{0:0},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0$ . From Eq. 7, $\hat{\pi}_{0}$ is optimal.
•

We next show that if $\hat{\pi}_{i}$ is optimal (which means that it’s covered $\mu_{c}$ ), then the first $0$ -th to $(i+1)$ -th steps part of $\mu_{\hat{\pi}}$ is covered by $\mu_{c}$ , which means that $\hat{\pi}_{i+1}$ is optimal. This comes from the fact that the initial $0$ -th to $(i+1)$ -th steps part of the state distribution induced by a policy only depends on its previous $0$ -th to $i$ -th decisions:

$\displaystyle\mu_{c}\gg\mu_{\hat{\pi}_{i}}\gg\mu_{\hat{\pi}_{i}}^{0:i+1}=\mu_{\hat{\pi}}^{0:i+1}.$

From Eq. 7, $\hat{\pi}_{i+1}$ is optimal.

Thus, for any $\epsilon>0$ , there exists natural number $i\geq\log_{\gamma}\frac{\epsilon}{V_{\max}}$ such that

\displaystyle J^{\star}-J_{\hat{\pi}}\leq J^{\star}-J_{\hat{\pi}}^{0:i}\leq J^{\star}-(J_{\hat{\pi}_{i}}-\gamma^{i+1}V_{\textup{max}})\leq\gamma^{i+1}V_{\textup{max}}\leq\epsilon,

where $J_{\pi}^{i:j}$ denotes the $i$ -th to $j$ -th steps part of the return. Therefore, $\hat{\pi}$ is optimal. ∎

Consequently, instead of the exploratory data assumption, all non-stationary optimal coverage is sufficient for policy optimization.

3.3 Dealing with statistical error

While Lemma 2 is adequate at the population level (i.e., with an infinite amount of data), covering all non-stationary optimal policies is not enough when considering the empirical setting (i.e., with finite samples) due to the introduced statistical error. This motivates us to adapt Lemma 2 with a more refined $\mu_{c}$ .

Assumption 1 (All near-optimal concentrability).

We are given with a covering distribution $d_{c}$ such that its state distribution part $\mu_{c}$ covers the distributions induced by any non-stationary $\varepsilon_{c}$ near-optimal policy $\tilde{\pi}$ :

\displaystyle\Big{\lVert}\frac{\mu_{\tilde{\pi}}}{\mu_{c}}\Big{\rVert}_{\infty}\leq C_{c},\quad\forall\tilde{\pi}\in\Pi_{\varepsilon_{c},\textup{non}}^{\star}.

(8)

We call a policy $\pi$ is $\varepsilon$ near-optimal if $J^{\star}-J_{\pi}\leq\varepsilon$ and denote $\Pi_{\varepsilon,\textup{non}}^{\star}$ as the class of all non-stationary $\varepsilon$ near-optimal policies. We also define $\frac{0}{0}=1$ to suppress the extreme cases. With this refined $\mu_{c}$ , we can now derive the optimality of $\hat{\pi}$ even with some statistical errors.

Lemma 3 (From advantage to optimality, with statistical errors).

If $\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star})\rangle\geq-\varepsilon$ , and Assumption 1 holds with $\varepsilon_{c}\geq\frac{C_{c}\varepsilon}{1-\gamma}$ , $\hat{\pi}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal.

We defer the proof of this lemma to Section C.1.

Remark 3 (The asymptotic property of $\varepsilon_{c}$ ).

One of the most important properties of all near-optimal concentrability is that $\varepsilon_{c}$ depends on the statistical error, which decreases as the amount of data increases.

4 Algorithm and analysis

After discussing the minimum requirement of estimating $Q^{\star}$ , this section will demonstrate how to fulfill it and accomplish the policy learning task. Our algorithm, which is based on the optimal value estimator from Uehara et al. [2023], follows the pseudocode in Algorithm 1.

Input : Dataset

\mathcal{D}

, value function class

\mathcal{Q}

, distribution density ratio class

\mathcal{W}

, policy ratio function class

\mathcal{B}

, and covering distribution

d_{c}

1 Estimate the optimal value function

\hat{q}

\displaystyle\hat{q}=\operatorname*{argmin}\limits_{q\in\mathcal{Q}}\max\limits_{w\in\mathcal{W}}\hat{\mathcal{L}}(d_{c},q,w)

(9)

where

\displaystyle\hat{\mathcal{L}}(d,q,w)\coloneqq 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}

(10)

2 Derive the approximated optimal policy ratio:

\displaystyle\hat{\beta}=\operatorname*{argmax}\limits_{\beta\in\mathcal{B}}\mathbb{E}_{\mu_{c}}[\hat{q}(s,\pi_{\beta})]

Output :

\hat{\pi}=\pi_{\hat{\beta}}

Algorithm 1 VOPR (Value-Based Offline RL with Policy Ratio)

We organized the rest of this section as follows: we first discuss the trade-off provided by the additional covering distribution $d_{c}$ and how to deduce $d_{c}$ in reality in Section 4.1; we then provide the finite-sample analysis of Algorithm 1 and its proof sketch in Section 4.2; we finally conclude this section by comparing our algorithms with the others in Section 4.3.

We defer the main proofs in this section to Appendix D.

4.1 Data assumptions and trade-off

As investigated in recent works [Huang and Jiang, 2022, Uehara et al., 2023], value function estimation under a given distribution requires a dataset that contains trajectories rolled out from it. Thus, our data assumption is as follows.

Assumption 2 (Partial concentrability from $d_{c}$ ).

The shift from $d^{\mathcal{D}}$ to the induced state-action distribution by $\pi_{e}^{\star}$ from $d_{c}$ is bounded:

\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C_{\mathcal{D}}.

(11)

It is clear that with Assumption 2, $d_{c}$ is also covered by $d^{\mathcal{D}}$ .

Proposition 4.

If Assumption 2 holds, by definition of $d_{d_{c},\pi^{\star}_{e}}$ ,

\displaystyle\Big{\lVert}\frac{d_{c}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}/(1-\gamma)}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq\frac{C_{\mathcal{D}}}{1-\gamma}.

We now clarify the order of the learning process: we are first given with a dataset $\mathcal{D}$ with some good properties; then we try to find a $d_{c}$ from the support of the state-action distribution of $\mathcal{D}$ through some inductive bias (with necessary approximation); finally, we apply Algorithm 1 with $\mathcal{D}$ and $d_{c}$ .

The choice of $d_{c}$ constructs a trade-off between the knowledge about optimal policy and the requirement of data coverage. On the one hand, the most casual choice of $d_{c}$ is $d^{\mathcal{D}}$ (as in Uehara et al. [2023]), which means we have no prior knowledge about optimal policies. Employing $d^{\mathcal{D}}$ as $d_{c}$ will not only requires the dataset to cover unnecessary suboptimal trajectories, but also makes the dataset non-monotonic (adding new data points to it would break this assumption). On the other hand, if we have perfect knowledge about optimal policies, Assumption 2 could be significantly alleviated. More concretely, if $d_{c}$ strictly consists of the state-action distribution of trajectories induced by near-optimal policies, our data assumption reduces to the per-step version of near-optimal concentrability.

Lemma 5.

If $d_{c}$ is a linear combination of the state-action distributions induced by non-stationary $\varepsilon$ near-optimal policies $\Pi_{\varepsilon,\textup{non}}^{\star}$ under a fixed probability measure $\lambda$ :

\displaystyle d_{c}=\int_{\Pi_{\varepsilon,\textup{non}}^{\star}}d_{\tilde{\pi}}d\lambda(\tilde{\pi}).

(12)

And $d^{\mathcal{D}}$ covers all admissible distributions of $\Pi_{\varepsilon,\textup{non}}^{\star}$ :

\displaystyle\forall\ \tilde{\pi}\in\Pi^{\star}_{\varepsilon,\textup{non}},\ i\in\mathbb{N},\ \Big{\lVert}\frac{d_{\tilde{\pi},i}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C,

where $d_{\pi,i}$ denotes the normalized distribution of the $i$ -th step part of $d_{\pi}$ . The distribution shift from $d^{\mathcal{D}}$ is bounded as

\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.

While the above case is impractical in reality, it reveals the power of this inductive bias: the more auxiliary information we obtain about optimal paths, the weaker coverage assumptions of the dataset are required.

4.2 Finite-sample guarantee

We now give the finite-sample guarantee of Algorithm 1, but before proceeding, we should state necessary function class assumptions. The first are the weak realizability assumptions:

Assumption 3 (Realizability of $\mathcal{W}$ ).

There exists state-action distribution density ratio $w^{\star}\in\mathcal{W}$ such that $w^{\star}\circ d^{\mathcal{D}}=(I-\gamma P_{\pi^{\star}_{e}})^{-1}d_{c}Q^{\star}$ .

Assumption 4 (Realizability of $\mathcal{B}$ ).

There exists policy ratio $\beta^{\star}\in\mathcal{B}$ such that $\beta^{\star}\circ\pi_{c}=\pi^{\star}_{e}$ and for all $s\in\mathcal{S},\int_{\mathcal{A}}\beta(s,a)\pi_{c}(s,a)d\nu(a)=1$ .

Assumption 5 (Realizability of $\mathcal{Q}$ ).

$\mathcal{Q}$ contains the optimal value function: $Q^{\star}\in\mathcal{Q}$ .

On the other hand, we gather all the bounding assumptions here.

Assumption 6 (Boundness of $\mathcal{Q}$ ).

For any $q\in\mathcal{Q}$ , we assume $q\in(\mathcal{S}\times\mathcal{A}\to[0,V_{\textup{max}}])$ .

Assumption 7 (Boundness of $\mathcal{B}$ ).

For any $\beta\in\mathcal{B}$ , we assume $\beta\in(\mathcal{S}\times\mathcal{A}\to[0,U_{\mathcal{B}}])$ .

Assumption 8 (Boundness of $\mathcal{W}$ ).

For any $w\in\mathcal{W}$ , we assume $w\in(\mathcal{S}\times\mathcal{A}\to[0,U_{\mathcal{W}}])$ .

Remark 4 (Validity).

The invertibility of $I-\gamma P_{\pi^{\star}_{e}}$ is shown by Lemma 10 in Section B.1. While Assumptions 3 and 8 actually subsumes Assumption 2, we make it explicit for clarity of explanation. Assumption 4 implicitly assumes that $\pi_{c}$ covers $\pi^{\star}_{e}$ , this can easily be done by directly choosing $\pi_{b}$ as $\pi_{c}$ .

Remark 5.

Although we include the normalization step in Assumption 4, this can also be achieved with some preprocessing steps.

Remark 6.

There is an overlap in the above assumptions: we can derive a policy ratio class $\mathcal{B}$ directly from $\mathcal{W}$ and $\mathcal{Q}$ .

With these prerequisites in place, we can finally state our finite-sample guarantee.

Theorem 1 (Sample complexity of learning a near-optimal policy).

If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with $\varepsilon_{c}\geq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}$ where

\displaystyle\varepsilon_{\textup{stat}}=U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}},

then with probability at least $1-\delta$ , the output $\hat{\pi}$ from Algorithm 1 is near-optimal:

\displaystyle J^{\star}-J_{\hat{\pi}}\leq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}.

Proof sketch of Theorem 1

As we can obtain the near-optimality guarantee via Lemma 3, the remaining task is to approximate Eq. 4. This comes from the following two lemmas.

Lemma 6 ( $L^{2}$ error of $\hat{q}$ under $d_{c}$ , adapted from theorem 2 in Uehara et al. [2023]).

If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least $1-\delta$ , the estimated $\hat{q}$ from Algorithm 1 satisfies

\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.

Lemma 7 (From $L^{1}$ distance to Eq. 4).

If Assumptions 4 and 7 hold,

\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq

\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}.

Combine them, we have that with probability at least $1-\delta$ ,

\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq

\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 4U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}.

4.3 Comparison with related works

We now provide a brief comparison of our method with some related algorithms.

Algorithms with gap assumptions

Chen and Jiang [2022] and Uehara et al. [2023] assume that there are (soft) gaps in the optimal value function, which is only satisfied by part of MDPs, whereas our goal is to deal with general problems. Moreover, while our algorithm is based on the optimal value estimator proposed by Uehara et al. [2023], we use the policy ratio to ensure a finite distribution shift and our near-optimality guarantee does not require the soft margin assumption. Besides, Uehara et al. [2023] use $d^{\mathcal{D}}$ as $d_{c}$ , assuming that the dataset covers the optimal trajectories from itself. This assumption is non-monotonic and hard to be satisfied in reality. Instead, we propose using an additional covering distribution $d_{c}$ as an alternative, which can effectively utilize the prior knowledge about the optimal trajectories and trade off the dataset requirement.

Algorithms with behavior regularization

Zhan et al. [2022] use behavior regularization to ensure that the learned policy is close to the dataset. Nevertheless, the regularization makes the optimality of the learned policy intractable.

Algorithms with pessimism in the face of uncertainty

These algorithms (e.g., Jiang and Huang [2020], Liu et al. [2020], Xie et al. [2021], Cheng et al. [2022], Zhu et al. [2023]) are often closely related to approximate dynamic programming (ADP). They “pessimistically” estimate the given policies and update (or choose) policies “pessimistically” with the estimated value functions. However, the evaluation step used in these algorithms always requires the strong realization of all candidate policies’ value functions, which our algorithm avoids.

Limitations of our algorithm

On the one hand, the additional covering distribution may be hard to access in some scenarios, leading back to using $d^{\mathcal{D}}$ as $d_{c}$ . On the other hand, although mitigated with increasing dataset size, the assumption of covering all near-optimal policies is still stronger than the classic single-optimal concentrability. In addition, the “non-stationary” coverage requirement is also somewhat restrictive.

5 Conclusion and further work

This paper present VOPR, a new MIS-based algorithm for offline RL with function approximations. VOPR is inspired by the optimal value estimator proposed in Uehara et al. [2023], and it circumvents the soft margin assumption in the original paper with the near-optimal coverage assumption. While it still works if using the data distribution as the covering distribution, VOPR can trade off data assumptions with more refined choices. Compared with other algorithms considering partial coverage, VOPR does not make strong function class assumptions and works under general MDPs. Finally, despite the successes, a refined additional covering distribution may be difficult to obtain, and the near-optimal coverage assumption is still stronger than single optimal concentrability. We leave them for further investigation.

References

Antos et al. [2007] András Antos, Rémi Munos, and Csaba Szepesvári. Fitted q-iteration in continuous action-space mdps. In NIPS, 2007.
Baird [1995] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, 1995.
Chen and Jiang [2019] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. ArXiv, abs/1905.00360, 2019.
Chen and Jiang [2022] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps. ArXiv, abs/2203.13935, 2022.
Cheng et al. [2022] Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. ArXiv, abs/2202.02446, 2022.
Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res., 6:503–556, 2005.
Farahmand et al. [2010] Amir Massoud Farahmand, Rémi Munos, and Csaba Szepesvari. Error propagation for approximate policy and value iteration. In NIPS, 2010.
Foster et al. [2021] Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Annual Conference Computational Learning Theory, 2021.
Gottesman et al. [2018] Omer Gottesman, Fredrik D. Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, A. Aldo Faisal, Leo Anthony Celi, David A. Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. ArXiv, abs/1805.12298, 2018.
Gottesman et al. [2019] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:16 – 18, 2019.
Huang and Jiang [2022] Audrey Huang and Nan Jiang. Beyond the return: Off-policy function estimation under user-specified error-measuring distributions. ArXiv, abs/2210.15543, 2022.
Jiang and Huang [2020] Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. arXiv: Learning, 2020.
Jin et al. [2020] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 2020.
Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ArXiv, abs/1806.10293, 2018.
Kumar et al. [2019] Aviral Kumar, Justin Fu, G. Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Neural Information Processing Systems, 2019.
Laroche and Trichelair [2017] Romain Laroche and Paul Trichelair. Safe policy improvement with baseline bootstrapping. ArXiv, abs/1712.06924, 2017.
Lee et al. [2021] Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joëlle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, 2021.
Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Neural Information Processing Systems, 2018.
Liu et al. [2020] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. ArXiv, abs/2007.08202, 2020.
Manne [1960] A.S. Manne. Linear programming and sequential decisions. In Management Science, volume 6, page 259–267, 1960.
Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.
Munos [2005] Rémi Munos. Error bounds for approximate value iteration. In AAAI Conference on Artificial Intelligence, 2005.
Munos [2007] Rémi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM J. Control. Optim., 46:541–561, 2007.
Munos and Szepesvari [2008] Rémi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, 2008.
Nachum et al. [2019] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ArXiv, abs/1912.02074, 2019.
Ozdaglar et al. [2022] Asuman E. Ozdaglar, Sarath Pattathil, Jiawei Zhang, and K. Zhang. Revisiting the linear-programming framework for offline rl with general function approximation. ArXiv, abs/2212.13861, 2022.
Puterman [1994] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
Rashidinejad et al. [2021] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart J. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68:8156–8196, 2021.
Rashidinejad et al. [2022] Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart J. Russell, and Jiantao Jiao. Optimal conservative offline rl with general function approximation via augmented lagrangian. ArXiv, abs/2211.00716, 2022.
Scherrer and Lesner [2012] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-horizon markov decision processes. In NIPS, 2012.
Sinha and Garg [2021] Samarth Sinha and Animesh Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning. ArXiv, abs/2103.06326, 2021.
Song et al. [2022] Yuda Song, Yi Zhou, Ayush Sekhari, J. Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient. ArXiv, abs/2210.06718, 2022.
Sutton and Barto [2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
Tang et al. [2022] Shengpu Tang, Maggie Makar, Michael W. Sjoding, Finale Doshi-Velez, and Jenna Wiens. Leveraging factored action spaces for efficient offline reinforcement learning in healthcare. 2022.
Uehara and Sun [2021] Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations, 2021.
Uehara et al. [2019] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, 2019.
Uehara et al. [2023] Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Refined value-based offline rl under realizability and partial coverage. ArXiv, abs/2302.02392, 2023.
Wang et al. [2021] Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham M. Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, 2021.
Wang et al. [2022] Xinqi Wang, Qiwen Cui, and Simon Shaolei Du. On gap-dependent bounds for offline reinforcement learning. ArXiv, abs/2206.00177, 2022.
Xie and Jiang [2020] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. ArXiv, abs/2003.03924, 2020.
Xie et al. [2021] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. In Neural Information Processing Systems, 2021.
Zhan et al. [2022] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D. Lee. Offline reinforcement learning with realizability and single-policy concentrability. ArXiv, abs/2202.04634, 2022.
Zhu et al. [2023] Hanlin Zhu, Paria Rashidinejad, and Jiantao Jiao. Importance weighted actor-critic for optimal conservative offline reinforcement learning. ArXiv, abs/2301.12714, 2023.

Appendix A Notations

Table 2: Notations

$\mathcal{S}$	state space
$\mathcal{A}$	action space
$\mathcal{Q}$	state-action value function class
$\mathcal{W}$	state-action distribution ratio function class
$\mathcal{B}$	policy ratio function class
$\beta$	members of $\mathcal{B}$
$V_{\pi}$	state value function for policy $\pi$
$Q_{\pi}$	state-action value function for policy $\pi$
$V^{\star}$	optimal state value function
$Q^{\star}$	optimal state-action value function
$\nu$	uniform measure of $\mathcal{A}$ , $\mathcal{S}$ , or $\mathcal{S}\times\mathcal{A}$ , depending on the context
$\mathcal{D}$	dataset used in the algorithm
$d^{\mathcal{D}}$	state-action distribution of dataset
$\mu^{\mathcal{D}}$	state distribution of dataset
$\pi_{b}$	behaviour policy
$d_{c}$	the additional covering distribution
$\mu_{c}$	state distribution of the additional covering distribution
$\pi_{c}$	policy of the additional covering distribution
$\langle a,b\rangle$	inner product of $a$ and $b$ , usually as $\int ab\ d\nu$
$f_{1}\circ f_{2}$	$(f_{1}\circ f_{2})(s,a)=f_{1}(s,a)f_{2}(s,a)$ , normalizing it if needed (e.g., density)
$\mu\times\pi$	$(\mu\times\pi)(s,a)=\mu(s)\pi(a\mid s)$
$T^{\star}$	Bellman optimality operator, $T^{\star}q(s,a)\coloneqq R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\max q(s^{\prime},\cdot)]$
$\mu_{0}$	initial state distribution
$\mu_{\pi}^{i:j}$	the $i$ -th to $j$ -th steps part of $\mu_{\pi}$
$d_{1}\gg d_{2}$	$d_{2}$ is absolutely continuous w.r.t. $d_{1}$
$d_{\pi,i}$	normalize $i$ -th step part of state-action distribution induced by $\pi$
$d_{d,\pi}$	state-action distribution induced by $\pi$ from $d$
$\pi_{i}$	policy take $\pi$ in the previous $0$ -th to $i$ -th (include the $i$ -th) steps, and take $\pi^{\star}_{e}$ after this
$\pi_{\beta}$	$\pi_{\beta}(a\mid s)=\pi_{c}(a\mid s)\beta(s,a)/\int_{\mathcal{A}}\pi_{c}(a\mid s)\beta(s,a)d\nu(a)$
$\Pi_{\varepsilon,\textup{non}}^{\star}$	the class of all non-stationary $\varepsilon$ near-optimal policies
$P_{\pi}$	state-action transition kernel with policy $\pi$
$O^{\star}$	conjucate operator of some operator $O$

While $\pi,\mu$ , and $d$ are mainly used to denote the Radon–Nikodym derivatives of the underlying probability measures w.r.t. $\nu$ , we sometimes also use them to represent the correspondent distribution measure with abuse of notation.

Appendix B Helper Lemmas

B.1 Properties of $P_{\pi}$

We first provide some properties of $P_{\pi}$ (for any policy $\pi$ ) as an operator on the $L^{\infty}$ -space of $\mathcal{S}\times\mathcal{A}$ , and similar results should also hold for transition operators with policies defined on $\mathcal{S}$ . Note that the integrations of the absolute value of the functions considered in this subsection are always finite, which means that we can change the orders of integrations via Fubini’s Theorem. As we will consider conjugate operators, we define the inner product as $\langle q,d\rangle=\int_{\mathcal{S}\times\mathcal{A}}q(s,a)d(s,a)d\nu(s,a)$ .

Lemma 8.

$P_{\pi}$ is linear.

Proof.

Recall the definition of $P_{\pi}$ ,

\displaystyle P_{\pi}d(s^{\prime},a^{\prime})=\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)

For any $d_{1},d_{2}\in L^{\infty}(\mathcal{S}\times\mathcal{A})$ ,

	$\displaystyle P_{\pi}(\alpha_{1}d_{1}+\alpha_{2}d_{2})(s^{\prime},a^{\prime})=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)(\alpha_{1}d_{1}+\alpha_{2}d_{2})(s,a)d\nu(s,a)$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\alpha_{1}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{1}(s,a)d\nu(s,a)+\int_{\mathcal{S}\times\mathcal{A}}\alpha_{2}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{2}(s,a)d\nu(s,a)$
	$\displaystyle=$	$\displaystyle\alpha_{1}\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{1}(s,a)d\nu(s,a)+\alpha_{2}\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{2}(s,a)d\nu(s,a)$
	$\displaystyle=$	$\displaystyle\alpha_{1}P_{\pi}d_{1}(s^{\prime},a^{\prime})+\alpha_{2}P_{\pi}d_{2}(s^{\prime},a^{\prime})$

This compeletes the proof. ∎

Lemma 9.

The adjoint operator of $P_{\pi}$ is

\displaystyle P_{\pi}^{\star}q(s,a)=\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime}).

Remark 7.

Intuitively, we can see $P_{\pi}d(s^{\prime},a^{\prime})$ as one-step forward of $d$ , such that we start from $(s,a)\sim d$ , transit into $s^{\prime}\sim P(\cdot\mid s,a)$ and take $a^{\prime}\sim\pi(\cdot\mid s^{\prime})$ . Also, we can view $P_{\pi}^{\star}q(s,a)$ as one-step backward of $q$ , such that we compute the value of $(s,a)$ through the one step transferred state-action distribution with the help of $q$ .

Proof.

Consider the inner products $\langle q,P_{\pi}d\rangle$ and $\langle P_{\pi}^{\star}q,d\rangle$ , we should prove that these two are equal. By definition,

	$\displaystyle\langle q,P_{\pi}d\rangle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\Bigg{[}q(s^{\prime},a^{\prime})\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)\Bigg{]}d\nu(s^{\prime},a^{\prime})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)d\nu(s^{\prime},a^{\prime})$

and

$\displaystyle\langle P_{\pi}^{\star}q,d\rangle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d(s,a)\Big{[}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})\Big{]}d\nu(s,a)$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}d(s,a)q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})d\nu(s,a)$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)d\nu(s^{\prime},a^{\prime}).$	(Fubini’s Theorem)

This completes the proof. ∎

Lemma.

$\lVert P_{\pi}^{\star}\rVert_{\infty}=\lVert P_{\pi}\rVert_{\infty}\leq 1$

Remark 8.

This upper bound should be intuitive since that $P_{\pi}$ can be seen as a probability transition kernel from $\mathcal{S}\times\mathcal{A}$ to itself.

Proof.

Fix any $s\in\mathcal{S}$ , $a\in\mathcal{A}$ , we define $p(s^{\prime},a^{\prime})=P(s^{\prime}\mid s,a)\pi(a^{\prime}\mid s^{\prime})$ , By Fubini’s theorem, we have that

	$\displaystyle\lVert p\rVert_{1,\nu}=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\lvert p\rvert d\nu=\int_{\mathcal{S}\times\mathcal{A}}pd\nu$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{\mathcal{A}}P(s^{\prime}\mid s,a)\pi(a^{\prime}\mid s^{\prime})d\nu(a^{\prime})d\nu(s^{\prime})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}P(s^{\prime}\mid s,a)\Bigg{[}\int_{\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})d\nu(a^{\prime})\Bigg{]}d\nu(s^{\prime})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}P(s^{\prime}\mid s,a)d\nu(s^{\prime})$
	$\displaystyle=$	$\displaystyle 1.$

For another function $q$ on $\mathcal{S}\times\mathcal{A}$ such that $\lVert q\rVert_{\infty,\nu}\leq 1$ , we can use Hölder’s inequality, which yields

\displaystyle\lVert pq\rVert_{1,\nu}\leq\lVert q\rVert_{\infty,\nu}\lVert p\rVert_{1,\nu}\leq 1.

Thus, for any $s\in\mathcal{S},a\in\mathcal{A}$ , and function $q$ with $\lVert q\rVert_{\infty,\nu}\leq 1$ ,

\displaystyle P_{\pi}^{\star}q(s,a)=\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})=\lVert pq\rVert_{1,\nu}\leq 1.

So, we have that

\displaystyle\lVert P_{\pi}\rVert_{\infty}=\lVert P_{\pi}^{\star}\rVert_{\infty}=\max\limits_{\lVert q\rVert_{\infty}\leq 1}\lVert P_{\pi}^{\star}q\rVert_{\infty,\nu}\leq\max\limits_{\lVert q\rVert_{\infty}\leq 1}\max\limits_{s\in\mathcal{S},a\in\mathcal{A}}P_{\pi}^{\star}q(s,a)\leq 1.

This completes the proof. ∎

Lemma 10.

$I-\gamma P_{\pi}$ is invertible and

\displaystyle(I-\gamma P_{\pi})^{-1}=\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}.

Proof.

Since $\lVert P_{\pi}\rVert_{\infty}\leq 1$ , $\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}$ converges. Take multiplication

	$\displaystyle(I-\gamma P_{\pi})[\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}]=$	$\displaystyle\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}-\sum\limits_{i=1}^{\infty}(\gamma P_{\pi})^{i}$
	$\displaystyle=$	$\displaystyle(\gamma P_{\pi})^{0}$
	$\displaystyle=$	$\displaystyle I.$

This completes the proof. ∎

Proposition 11.

By definition, $d_{d,\pi}=(1-\gamma)\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}d=(1-\gamma)(I-\gamma P_{\pi})^{-1}d$ .

B.2 Other useful lemmas

Lemma (Performance difference lemma).

We can decompose the performance gap as

\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})=\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle.

Proof.

By definition,

	$\displaystyle\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle=$	$\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})+\gamma\mathbb{E}_{a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}\gamma\mathbb{E}_{a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]\big{]}$
		$\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\gamma\mathbb{E}_{s\sim\mu_{\pi_{1}},a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]\big{]}$
		$\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\mathbb{E}_{s\sim\mu_{\pi_{1}}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}-(1-\gamma)\mathbb{E}_{s\sim\mu_{0}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}$
		$\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}-(1-\gamma)\mathbb{E}_{s\sim\mu_{0}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}$
	$\displaystyle=$	$\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})$

The first equality comes from Bellman equation, and the fourth equality comes from the definition of $\mu_{\pi}$ . This completes the proof. ∎

Appendix C Detailed proofs for Section 3

C.1 Proof of Lemma 3

Lemma (From advantage to optimality, restatement of Lemma 3).

Proof.

We begin with using induction to prove that $\hat{\pi}_{i}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal for any $i\in\mathbb{N}$ :

•

We first show that $\hat{\pi}_{0}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal. From Assumption 1, we can use any $\tilde{\pi}\in\Pi^{\star}_{\varepsilon_{c},\textup{non}}$ to conclude that

\displaystyle\Big{\lVert}\frac{\mu_{0}}{\mu_{c}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{\mu_{\tilde{\pi}}/(1-\gamma)}{\mu_{c}}\Big{\rVert}_{\infty}\leq\frac{C_{c}}{1-\gamma}.

Thus, we can the show optimality of $\hat{\pi}^{\star}_{0}$ by the advantage:

$\displaystyle\langle\mu_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi}_{0})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=$	$\displaystyle\langle\mu^{0:0}_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu^{1:\infty}_{\hat{\pi}_{0}},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle=$	$\displaystyle\langle\mu^{0:0}_{\hat{\pi}^{\star}_{0}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle=$	$\displaystyle(1-\gamma)\langle\mu_{0},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle\geq$	$\displaystyle C_{c}\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$	( $Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})$ is non-positive)
$\displaystyle\geq$	$\displaystyle-C_{c}\varepsilon.$

By performance difference lemma,

	$\displaystyle(1-\gamma)(J_{\hat{\pi}_{0}}-J^{\star})=$	$\displaystyle\langle\mu_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi}_{0})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
	$\displaystyle\geq$	$\displaystyle-C_{c}\varepsilon.$

•

Next, we show that if $\hat{\pi}_{i}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal, $\hat{\pi}_{i+1}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal. Since that $\hat{\pi}_{i}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ optimal, the distribution shift from $\mu_{c}$ to $\mu_{\hat{\pi}_{i}}$ is bounded, which means,

\displaystyle\Big{\lVert}\frac{\mu_{\hat{\pi}}^{0:i+1}}{\mu_{c}}\Big{\rVert}_{\infty}=\Big{\lVert}\frac{\mu_{\hat{\pi}_{i}}^{0:i+1}}{\mu_{c}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{\mu_{\hat{\pi}_{i}}}{\mu_{c}}\Big{\rVert}_{\infty}\leq C_{c}.

Then, we have

	$\displaystyle\langle\mu_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi}_{i+1})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle=$	$\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu^{i+2:\infty}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle=$	$\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle=$	$\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$
$\displaystyle\geq$	$\displaystyle C_{c}\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle$	( $Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})$ is non-positive)
$\displaystyle\geq$	$\displaystyle-C_{c}\varepsilon.$

By performance difference lemma,

	$\displaystyle(1-\gamma)(J_{\hat{\pi}_{i+1}}-J^{\star})=$	$\displaystyle\langle\mu_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi}_{i+1})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle.$
	$\displaystyle\geq$	$\displaystyle-C_{c}\varepsilon$

Therefore, $\hat{\pi}_{i+1}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal.

Thus, for any $\epsilon>0$ , there exists natural number $i\geq\log_{\gamma}\frac{\epsilon}{V_{\max}}$ such that

\displaystyle J^{\star}-J_{\hat{\pi}}\leq J^{\star}-J_{\hat{\pi}}^{0:i}\leq J^{\star}-(J_{\hat{\pi}_{i}}-\gamma^{i+1}V_{\textup{max}})\leq\frac{C_{c}\varepsilon}{1-\gamma}+\gamma^{i+1}V_{\textup{max}}\leq\frac{C_{c}\varepsilon}{1-\gamma}+\epsilon,

where $J_{\pi}^{i:j}$ denotes the $i$ -th to $j$ -th steps part of the return. Therefore, $\hat{\pi}$ is $\frac{C_{c}\varepsilon}{1-\gamma}$ near-optimal. ∎

Appendix D Detailed proofs for Section 4

D.1 Proof of Lemma 5

Lemma (Restatement of Lemma 5).

If $d_{c}$ is a linear combination of the state-action distributions induced by $\varepsilon$ near-optimal non-stationary policies $\Pi_{\varepsilon,\textup{non}}^{\star}$ under a fixed probability measure $\lambda$ :

\displaystyle d_{c}=\int_{\Pi_{\varepsilon,\textup{non}}^{\star}}d_{\tilde{\pi}}d\lambda(\tilde{\pi}).

(13)

And $d^{\mathcal{D}}$ covers all admissible distributions of $\Pi_{\varepsilon,\textup{non}}^{\star}$ :

\displaystyle\forall\ \tilde{\pi}\in\Pi^{\star}_{\varepsilon,\textup{non}},\ i\in\mathbb{N},\ \Big{\lVert}\frac{d_{\tilde{\pi},i}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.

The distribution shift from $d^{\mathcal{D}}$ is bounded as

\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.

Proof.

Define the state-action distribution of policy $\pi$ from $s\in\mathcal{S},a\in\mathcal{A}$ at step $i$ as

	$\displaystyle d_{s,a,\pi,i}(s^{\prime},a^{\prime})=P(s_{i}=s^{\prime},a_{i}=a^{\prime}\mid$	$\displaystyle s_{0}=s,a_{0}=a,s_{1}\sim P(\cdot\mid s_{0},a_{0}),a_{1}\sim\pi(\cdot\mid s_{1})\dots$
		$\displaystyle s_{j}\sim P(\cdot\mid s_{j-1},a_{j-1}),a_{j}\sim\pi(\cdot\mid s_{j})\dots).$

Also, define the global version of it as

\displaystyle d_{s,a,\pi}(s^{\prime},a^{\prime})=(1-\gamma)\sum\limits_{i=0}^{\infty}d_{s,a,\pi,i}(s^{\prime},a^{\prime}).

We can rewrite $d_{d_{c},\pi^{\star}_{e}}(s,a)$ as

$\displaystyle d_{d_{c},\pi^{\star}_{e}}(s,a)=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{c}(s_{1},a_{1})d\nu(s_{1},a_{1})$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)\Big{[}\int_{\Pi}d_{\tilde{\pi}}(s_{1},a_{1})d\lambda(\tilde{\pi})\Big{]}d\nu(s_{1},a_{1})$
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi}}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}d\lambda(\tilde{\pi})$	(Fubini’s Theorem)
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}\int_{\mathcal{S}\times\mathcal{A}}(1-\gamma)\sum\limits_{i=0}^{\infty}\big{[}\gamma^{i}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})\big{]}d\nu(s_{1},a_{1})\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\big{[}\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\big{]}\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}d_{\tilde{\pi}_{i}}^{i:\infty}(s,a)\Big{]}d\lambda(\tilde{\pi}).$

The last equation comes from that

	$\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})$
$\displaystyle=$	$\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)\Big{[}\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2})\Big{]}d\nu(s_{1},a_{1})$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}\Big{[}\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2}),$	(Fubini’s Theorem)

since

		$\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})$
	$\displaystyle=$	$\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}(1-\gamma)\sum\limits_{k=0}^{\infty}\big{[}\gamma^{k}d_{s_{1},a_{1},\pi^{\star}_{e},k}(s,a)\big{]}d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})$
	$\displaystyle=$	$\displaystyle(1-\gamma)\sum\limits_{k=0}^{\infty}\Big{[}\gamma^{k+i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e},k}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}$
	$\displaystyle=$	$\displaystyle(1-\gamma)\sum\limits_{k=0}^{\infty}\big{[}\gamma^{k+i}d_{s_{2},a_{2},\tilde{\pi}_{i},k+i}(s,a)\big{]}$
	$\displaystyle=$	$\displaystyle(1-\gamma)\sum\limits_{k=i}^{\infty}\big{[}\gamma^{k}d_{s_{2},a_{2},\tilde{\pi}_{i},k}(s,a)\big{]}$
	$\displaystyle=$	$\displaystyle d_{s_{2},a_{2},\tilde{\pi}_{i}}^{i:\infty}(s,a),$

we get

		$\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}\Big{[}d_{s_{2},a_{2},\tilde{\pi}_{i}}^{i:\infty}(s,a)\Big{]}\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2})$
	$\displaystyle=$	$\displaystyle d_{\tilde{\pi}_{i}}^{i:\infty}(s,a).$

Finally, $\forall s\in\mathcal{S},a\in\mathcal{A}$ ,

$\displaystyle\frac{d_{d_{c},\pi^{\star}_{e}}(s,a)}{d^{\mathcal{D}}(s,a)}=$	$\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\frac{d_{\tilde{\pi}_{i}}^{i:\infty}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\frac{(1-\gamma)\sum_{j=i}^{\infty}\gamma^{j}d_{\tilde{\pi}_{i},j}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle=$	$\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}(1-\gamma)\sum_{j=i}^{\infty}\gamma^{j}\frac{d_{\tilde{\pi}_{i},j}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle\leq$	$\displaystyle\int_{\Pi}\Big{[}C(1-\gamma)^{2}\sum\limits_{i=0}^{\infty}\sum\limits_{j=i}^{\infty}\gamma^{j}\Big{]}d\lambda(\tilde{\pi})$	( $\tilde{\pi}\in\Pi_{\varepsilon,\textup{non}}^{\star}$ indicates $\tilde{\pi}_{i}\in\Pi_{\varepsilon,\textup{non}}^{\star}$ )
$\displaystyle\leq$	$\displaystyle\int_{\Pi}\Big{[}C(1-\gamma)^{2}\sum\limits_{i=0}^{\infty}\frac{\gamma^{i}}{1-\gamma}\Big{]}d\lambda(\tilde{\pi})$
$\displaystyle\leq$	$\displaystyle\int_{\Pi}Cd\lambda(\tilde{\pi})$
$\displaystyle=$	$\displaystyle C.$

This completes the proof. ∎

D.2 Proof of Lemma 6

Note that the lemmas and proofs of this subsection are mainly adapted from Uehara et al. [2023], similar statements could also be found in the original paper. However, since that we use $d_{c}$ to replace $d^{\mathcal{D}}$ , we present them for clarity of explanation and to make our paper self-contained. We refer interested readers to the original paper for another detail.

We first define the expected version of Eq. 10 as

	$\displaystyle\mathcal{L}(d,q,w)\coloneqq$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}_{w},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{w}}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}$

where $d^{\mathcal{D}}_{w}=d^{\mathcal{D}}\circ w$ , and $\mathbb{E}_{\mathcal{D}_{w}}$ denotes taking expectation with respect to the reweighted data collecting process.

Lemma 12 (Expectation).

The expected value of $\hat{\mathcal{L}}(d,q,w)$ w.r.t. the data collecting process is $\mathcal{L}(d,q,w)$ :

\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]=\mathcal{L}(d,q,w).

Proof.

Since only the second term of $\hat{\mathcal{L}}$ is random, we additional define

\displaystyle\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\coloneqq\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}.

We can rearrange the expectation as follows,

$\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]=$	$\displaystyle\mathbb{E}_{\mathcal{D}}\Big{[}0.5\mathbb{E}_{d}[q^{2}(s,a)]+\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]}$	(14)
$\displaystyle=$	$\displaystyle\mathbb{E}_{\mathcal{D}}\Big{[}0.5\mathbb{E}_{d}[q^{2}(s,a)]\Big{]}+\mathbb{E}_{\mathcal{D}}\Big{[}\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]}$	(15)
$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Big{[}\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]}$	(16)

Then, by the i.i.d. assumption of samples and linear property of MIS,

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Bigg{[}\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}\Bigg{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}}\Big{[}w(s,a)\big{[}\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\gamma\max q(s^{\prime},\cdot)]+R(s,a)-q(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}_{w}}\Big{[}\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\gamma\max q(s^{\prime},\cdot)]+R(s,a)-q(s,a)\Big{]}$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}$
	$\displaystyle=$	$\displaystyle\mathcal{L}(d,q,w).$

This compeletes the proof. ∎

Lemma 13 (Concentration).

For any fixed $d$ , with probability at least $1-\delta$ , for any $q\in\mathcal{Q}$ , $w\in\mathcal{W}$ ,

\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}\leq\varepsilon_{\textup{stat}}.

Proof.

The statistical error only comes from $\hat{\mathcal{L}}_{\mathcal{W}}$ , as

	$\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}=$	$\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}$		(Lemma 12)
	$\displaystyle=$	$\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}_{\mathcal{W}}(q,w)]-\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{\rvert}.$		(Eq. 16)

Since each entry of $\mathcal{L}_{\mathcal{W}}$ is bounded:

\displaystyle\forall q\in\mathcal{Q},w\in\mathcal{W},a\in\mathcal{A},s^{\prime}\in\mathcal{S},\quad\Big{\lvert}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}},

we can apply Hoeffding’s inequality which yields that, for any $q\in\mathcal{Q}$ , $w\in\mathcal{W}$ , with probability at least $1-\delta/(\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert)$ ,

\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}_{\mathcal{W}}(q,w)]-\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}}.

Finally, we can use union bound, rearranging terms to get that, for any fixed $d$ , with probability at least $1-\delta$ , for any $q\in\mathcal{Q}$ , $w\in\mathcal{W}$ ,

\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}}=\varepsilon_{\textup{stat}}

This compeletes the proof. ∎

Lemma 14.

If $w$ is non-negative $\nu$ -a.e. (e.g., $w\in\mathcal{W}$ ), for any $q\colon\mathcal{S}\times\mathcal{A}\to[0,V_{\textup{max}}]$ ,

\displaystyle\mathcal{L}(d,q,w)-\mathcal{L}(d,Q^{\star},w)\geq 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle(\gamma P_{\pi^{\star}_{e}}-I)d^{\mathcal{D}}_{w},q-Q^{\star}\rangle.

(17)

Proof.

This result simply comes from the definition:

	$\displaystyle\mathcal{L}(d,q,w)-\mathcal{L}(d,Q^{\star},w)$
$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]$
	$\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max q(s^{\prime},\cdot)+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)]$
$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]$
	$\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max q(s^{\prime},\cdot)+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma Q^{\star}(s^{\prime},\pi^{\star}_{e})+r-Q^{\star}(s,a)]$
$\displaystyle\geq$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]$
	$\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma q(s^{\prime},\pi^{\star}_{e})+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma Q^{\star}(s^{\prime},\pi^{\star}_{e})+r-Q^{\star}(s,a)]$
$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]$
	$\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma(q-Q^{\star})(s^{\prime},\pi^{\star}_{e})-(q-Q^{\star})(s,a)]$
$\displaystyle=$	$\displaystyle 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle d^{\mathcal{D}}_{w},(\gamma P^{\star}_{\pi^{\star}_{e}}-I)(q-Q^{\star})\rangle$	(Rewrite the expectation with inner products)
$\displaystyle=$	$\displaystyle 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle(\gamma P_{\pi^{\star}_{e}}-I)d^{\mathcal{D}}_{w},q-Q^{\star}\rangle.$	(conjugate)

This compeletes the proof. ∎

Lemma 15.

If Assumption 5 holds, with probability at least $1-\delta$ , for any $w\in\mathcal{W}$ and any state-action distribution $d$ , we have

\displaystyle\mathcal{L}(d,\hat{q},w)-\mathcal{L}(d,Q^{\star},w)\leq 2\varepsilon_{\textup{stat}}.

(18)

Proof.

We can decompose Eq. 18 as follows,

	$\displaystyle\mathcal{L}(d,\hat{q},w)-\mathcal{L}(d,Q^{\star},w)=$	$\displaystyle\underbrace{\mathcal{L}(d,\hat{q},w)-\hat{\mathcal{L}}(d,\hat{q},w)}_{(1)}+\underbrace{\hat{\mathcal{L}}(d,\hat{q},w)-\hat{\mathcal{L}}(d,\hat{q},\hat{w})}_{(2)}$
		$\displaystyle+\underbrace{\hat{\mathcal{L}}(d,\hat{q},\hat{w})-\hat{\mathcal{L}}(d,Q^{\star},\hat{w}(Q^{\star}))}_{(3)}+\underbrace{\hat{\mathcal{L}}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))}_{(4)}$
		$\displaystyle+\underbrace{\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},w)}_{(5)}$

where $\hat{w}(q)=\operatorname*{argmax}_{w\in\mathcal{W}}\hat{\mathcal{L}}(d,q,w)$ . For the terms above, we have that:

•

$(2)$ and $(3)$ are non-positive since the optimization process.
•

$(1)$ and $(4)$ could be bound by concentration.

•

For $(5)$ , as Bellman optimality equation holds,

\displaystyle\forall s\in\mathcal{S},a\in\mathcal{A},\quad\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)\big{]}+R(s,a)-Q^{\star}(s,a)=0.

We have that

	$\displaystyle(5)=$	$\displaystyle\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},w)$
	$\displaystyle=$	$\displaystyle 0.5\mathbb{E}_{d}[(Q^{\star})^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{\hat{w}(Q^{\star})}}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}$
		$\displaystyle-\Big{[}0.5\mathbb{E}_{d}[(Q^{\star})^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{w}}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{(s,a)\sim d_{\hat{w}(Q^{\star})}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}$
		$\displaystyle-\Big{[}\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{(s,a)\sim d_{\hat{w}(Q^{\star})}^{\mathcal{D}}}\Big{[}\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot,s,a)}[\max Q^{\star}(s^{\prime},\cdot)]+R(s,a)-Q^{\star}(s,a)\Big{]}$
		$\displaystyle-\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}}}\Big{[}\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot,s,a)}[\max Q^{\star}(s^{\prime},\cdot)]+R(s,a)-Q^{\star}(s,a)\Big{]}$
	$\displaystyle=$	$\displaystyle 0.$

Thus, we conclude that with probability at least $1-\delta$ ,

$\displaystyle\mathcal{L}(\hat{q},w)-\mathcal{L}(Q^{\star},w)\leq$	$\displaystyle\underbrace{\mathcal{L}(\hat{q},w)-\hat{\mathcal{L}}(\hat{q},w)}_{(1)}+\underbrace{\hat{\mathcal{L}}(Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(Q^{\star},\hat{w}(Q^{\star}))}_{(4)}$
$\displaystyle\leq$	$\displaystyle\lvert\mathcal{L}(\hat{q},w)-\hat{\mathcal{L}}(\hat{q},w)\rvert+\lvert\hat{\mathcal{L}}(Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(Q^{\star},\hat{w}(Q^{\star}))\rvert$
$\displaystyle\leq$	$\displaystyle 2\varepsilon_{\textup{stat}}.$	(Lemma 13)

This compeletes the proof. ∎

With lemmas above, it’s time to prove Lemma 6.

Lemma ( $L^{2}$ error of $\hat{q}$ under $d_{c}$ , restatement Lemma 6).

If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least $1-\delta$ , the estimated $\hat{q}$ from Algorithm 1 satisfies

\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.

Proof.

By Assumption 3, $d^{\mathcal{D}}_{w^{\star}}=(I-\gamma P_{\pi^{\star}})^{-1}d_{c}Q^{\star}$ , and from Lemma 14 we have

	$\displaystyle\mathcal{L}(d_{c},\hat{q},w^{\star})-\mathcal{L}(d_{c},Q^{\star},w^{\star})\geq$	$\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle(I-\gamma P_{\pi^{\star}})(I-\gamma P_{\pi^{\star}})^{-1}d_{c}Q^{\star},(\hat{q}-Q^{\star})\rangle$
	$\displaystyle=$	$\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle d_{c}Q^{\star},(\hat{q}-Q^{\star})\rangle$
	$\displaystyle=$	$\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle d_{c},Q^{\star}(\hat{q}-Q^{\star})\rangle$
	$\displaystyle=$	$\displaystyle 0.5\langle d_{c},(\hat{q}-Q^{\star})^{2}\rangle$
	$\displaystyle=$	$\displaystyle 0.5\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}^{2}.$

Together with Lemma 15, with probability at least $1-\delta$ ,

\displaystyle 0.5\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}^{2}\leq\mathcal{L}(d_{c},\hat{q},w^{\star})-\mathcal{L}(d_{c},Q^{\star},w^{\star})\leq 2\varepsilon_{\textup{stat}}.

Rearrange this and we can get

\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}

This compeletes the proof. ∎

D.3 Proof of Lemma 7

Lemma (Restatement of Lemma 7).

If Assumptions 4 and 7 hold,

\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq

\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}.

Proof.

We can rearrange the above term as

$\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle=$	$\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e}),\mu_{c}\rangle+\langle\hat{q}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\hat{\pi}),\mu_{c}\rangle$
	$\displaystyle+\langle\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle$
$\displaystyle\leq$	$\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e}),\mu_{c}\rangle+\langle\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle$	(Assumption 4)
$\displaystyle\leq$	$\displaystyle\lVert Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e})\rVert_{\mu_{c},1}+\lVert\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi})\rVert_{\mu_{c},1}$
$\displaystyle=$	$\displaystyle\lVert Q^{\star}-\hat{q}\rVert_{\mu_{c}\times\pi^{\star}_{e},1}+\lVert\hat{q}-Q^{\star}\rVert_{\mu_{c}\times\hat{\pi},1}$
$\displaystyle\leq$	$\displaystyle 2U_{\mathcal{B}}\lVert Q^{\star}-\hat{q}\rVert_{d_{c},1}$

The distribution shift comes from the fact that

\displaystyle\Big{\lVert}\frac{\mu\times\pi_{1}}{\mu\times\pi_{2}}\Big{\rVert}_{\infty}=\Big{\lVert}\frac{\pi_{1}}{\pi_{2}}\Big{\rVert}_{\infty},

and shifts from $\pi_{c}$ to $\pi^{\star}_{e}$ and $\hat{\pi}$ are both bound by $U_{\mathcal{B}}$ due to Assumptions 4 and 7. This completes the proof. ∎

D.4 Proof of Theorem 1

Theorem (Finite sample guarantee of Algorithm 1, restatement of Theorem 1).

If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with $\varepsilon_{c}\geq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}$ , then with probability at least $1-\delta$ , the output $\hat{\pi}$ from Algorithm 1 is near-optimal:

\displaystyle J^{\star}-J_{\hat{\pi}}\leq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}.

Proof.

From Lemma 6, we have that with probability at least $1-\delta$ ,

\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.

Then apply Lemma 7 to bound the weighted advantage,

\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq

\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq 4U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}.

Finally, according to Lemma 3, $\hat{\pi}$ is $\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}$ optimal. This completes the proof. ∎

Offline Reinforcement Learning with Additional Covering Distributions

Abstract

1 Introduction and related works

2 Preliminaries

Markov Decision Processes (MDPs)

Offline policy learning with function approximation

MIS with density discriminators and L2L^{2} error bound

3 From Q⋆Q^{\star} to optimal policy, the minimum requirement

3.1 The dilemma of single optimal contentrability

Lemma 1 (The performance difference lemma).

How gap assumptions avoid this

3.2 All-optimal concentrability

Lemma 2 (From advantage to optimality).

Remark 1.

Remark 2.

Proof.

3.3 Dealing with statistical error

Assumption 1 (All near-optimal concentrability).

Lemma 3 (From advantage to optimality, with statistical errors).

Remark 3 (The asymptotic property of εc\varepsilon_{c}).

4 Algorithm and analysis

4.1 Data assumptions and trade-off

Assumption 2 (Partial concentrability from dcd_{c}).

Proposition 4.

Lemma 5.

4.2 Finite-sample guarantee

Assumption 3 (Realizability of 𝒲\mathcal{W}).

Assumption 4 (Realizability of ℬ\mathcal{B}).

Assumption 5 (Realizability of 𝒬\mathcal{Q}).

Assumption 6 (Boundness of 𝒬\mathcal{Q}).

Assumption 7 (Boundness of ℬ\mathcal{B}).

Assumption 8 (Boundness of 𝒲\mathcal{W}).

Remark 4 (Validity).

Remark 5.

Remark 6.

Theorem 1 (Sample complexity of learning a near-optimal policy).

Proof sketch of Theorem 1

Lemma 6 (L2L^{2} error of q^\hat{q} under dcd_{c}, adapted from theorem 2 in Uehara et al. [2023]).

Lemma 7 (From L1L^{1} distance to Eq. 4).

4.3 Comparison with related works

Algorithms with gap assumptions

Algorithms with behavior regularization

Algorithms with pessimism in the face of uncertainty

Limitations of our algorithm

5 Conclusion and further work

References

Appendix A Notations

Appendix B Helper Lemmas

B.1 Properties of PπP_{\pi}

Lemma 8.

Proof.

Lemma 9.

Remark 7.

Proof.

Lemma.

Remark 8.

Proof.

Lemma 10.

Proof.

Proposition 11.

B.2 Other useful lemmas

Lemma (Performance difference lemma).

Proof.

Appendix C Detailed proofs for Section 3

C.1 Proof of Lemma 3

Lemma (From advantage to optimality, restatement of Lemma 3).

Proof.

Appendix D Detailed proofs for Section 4

D.1 Proof of Lemma 5

Lemma (Restatement of Lemma 5).

Proof.

D.2 Proof of Lemma 6

Lemma 12 (Expectation).

Proof.

Lemma 13 (Concentration).

Proof.

Lemma 14.

Proof.

Lemma 15.

Proof.

MIS with density discriminators and $L^{2}$ error bound

3 From $Q^{\star}$ to optimal policy, the minimum requirement

Remark 3 (The asymptotic property of $\varepsilon_{c}$ ).

Assumption 2 (Partial concentrability from $d_{c}$ ).

Assumption 3 (Realizability of $\mathcal{W}$ ).

Assumption 4 (Realizability of $\mathcal{B}$ ).

Assumption 5 (Realizability of $\mathcal{Q}$ ).

Assumption 6 (Boundness of $\mathcal{Q}$ ).

Assumption 7 (Boundness of $\mathcal{B}$ ).

Assumption 8 (Boundness of $\mathcal{W}$ ).

Lemma 6 ( $L^{2}$ error of $\hat{q}$ under $d_{c}$ , adapted from theorem 2 in Uehara et al. [2023]).

Lemma 7 (From $L^{1}$ distance to Eq. 4).

B.1 Properties of $P_{\pi}$

Lemma ( $L^{2}$ error of $\hat{q}$ under $d_{c}$ , restatement Lemma 6).