This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Offline Reinforcement Learning with Additional Covering Distributions

Chenjie Mao
School of Computer Science and Technology
Huazhong University of Science and Technology
Wuhan 430074, China
chenjiemao@hust.edu.cn
Abstract

We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.

1 Introduction and related works

In offline reinforcement learning (offline RL, also known as batch RL), the learner tries to find good policies with a pre-collected dataset. This data-driven paradigm eliminates the heavy burden of environmental interaction required in online learning, which could be dangerous or costly (e.g., in robotics [Kalashnikov et al., 2018, Sinha and Garg, 2021] and healthcare [Gottesman et al., 2018, 2019, Tang et al., 2022]), making offline RL a promising approach in real-world applications.

In early theoretic studies of offline RL (e.g., Munos [2003, 2005, 2007], Ernst et al. [2005], Antos et al. [2007], Munos and Szepesvari [2008], Farahmand et al. [2010]), researchers analyzed the finite-sample behavior of algorithms under the assumptions such as exploratory datasets, realizable or Bellman-complete function classes. However, despite some error propagation bounds and sample complexity guarantees achieved in these works, the strong coverage assumption made on datasets and the non-monotonic assumptions made on function classes—which are always hard to be satisfied in reality—drive people to try to find sample-efficient offline RL algorithms under only weak assumptions about dataset and function classes [Chen and Jiang, 2019].

From the dataset perspective, partial coverage, which means that only some specific (or even none) policies are covered by the dataset [Rashidinejad et al., 2021, Xie et al., 2021, Uehara and Sun, 2021, Song et al., 2022], is studied. To address the problem of insufficient information, most algorithms use behavior regularization (e.g., Laroche and Trichelair [2017], Kumar et al. [2019], Zhan et al. [2022]) or pessimism in the face of uncertainty (e.g., Liu et al. [2020], Jin et al. [2020], Rashidinejad et al. [2021], Xie et al. [2021], Uehara and Sun [2021], Cheng et al. [2022], Zhu et al. [2023]) to constrain the learned policies to be close to the behavior policy. Most of the algorithms in this setting (except some that we will discuss later) require function assumptions in some sense of completeness—Bellman-completeness or strict realization according to another function class (we attribute it as strong realization).

From the function classes perspective, while the primary concern is Bellman-completeness assumption which is criticized for its non-monotonicity, some recent works [Zhan et al., 2022, Chen and Jiang, 2022, Ozdaglar et al., 2022] have noticed that the realizability according to another function class is also non-monotonic. These non-monotonic properties contradict the intuition in supervised learning that rich function classes perform better (or at least no worse). Typical examples of these assumptions are the “realizability of all candidate policies’ value functions” (e.g., Jiang and Huang [2020], Zhu et al. [2023]) and the “realizability of all candidate policies’ density ratio” (e.g., Xie and Jiang [2020]). These assumptions are equally strong as Bellman-completeness, and we classify them as “strong realizability” (Zhan et al. [2022], Ozdaglar et al. [2022] attribute it as “completeness-type”) for clarification. We also classify assuming that the function class realizes specific elements as “weak realizability” correspondingly (Chen and Jiang [2022] attributes this as “realizability-type”). We argue that this taxonomy is justified also because Bellman-completeness can be converted to the realizability assumption between two function classes with the minimax algorithm [Chen and Jiang, 2019]. This conversion aligns the behavior of Bellman-completeness with strong realizability assumptions.

On the one hand, Bellman-completeness assumption is always made in the classical finite-sample analyses of offline RL (e.g., analysis of FQI [Ernst et al., 2005, Antos et al., 2007]) to ensure closed updates of value functions [Sutton and Barto, 2018, Wang et al., 2021]. This assumption is notoriously hard to mitigate, and Foster et al. [2021] even suggests an information-theoretic lower bound stating that without Bellman-completeness, sample-efficient offline RL is impossible even with an exploratory dataset and a function class containing all candidate policies’ value functions. Therefore, it is clear that additional assumptions are required to circumvent Bellman-completeness.

On the other hand, as marginalized importance sampling (MIS, see, e.g., Liu et al. [2018], Uehara et al. [2019], Jiang and Huang [2020], Huang and Jiang [2022]) has shown its effect of eliminating Bellman-completeness with only a partial coverage dataset by assuming the realizability of density ratios in off-policy evaluation (OPE), there are works try to adapt it to policy optimization. These adaptations retain the elimination of Bellman-completeness, but most come up with other drawbacks.Some works (e.g., Jiang and Huang [2020], Zhu et al. [2023]) use OPE as an intermediate evaluation step for policy optimization yet require the strong realizability assumption on the value function class. The others borrow the idea of discriminators from MIS. Lee et al. [2021], Zhan et al. [2022] take value functions as discriminators for the optimal density ratio, using MIS to approximate the linear programming approach of Markov Decision Processes [Manne, 1960, Puterman, 1994]. Nachum et al. [2019], Chen and Jiang [2022], Uehara et al. [2023] take distribution density ratios as discriminators for optimal value function by replacing the Bellman equation in OPE with its optimality variant. While in most cases, theoretic finite-sample guarantees with these discriminators would require strong realizable function classes (e.g., Rashidinejad et al. [2022]), Zhan et al. [2022], Chen and Jiang [2022], Uehara et al. [2023] avoid this with additional gap assumptions or an alternative criterion of optimality—performance degradation w.r.t. the regularized optimal policy. To the best of our knowledge, they are the only works that achieve theoretic sample-efficient guarantees under only weak realizability and partial coverage assumptions. However, on the one hand, the gap (margin) assumption [Chen and Jiang, 2022, Uehara et al., 2023] causes that only some specific Markov decision processes (MDPs)—under which the optimal value functions have gaps—can be solved. On the other hand, sub-optimality compared with a regularized optimal policy [Zhan et al., 2022] could be meaningless in some cases, and the actual performance of the learned policy is intractable.

As summarized above, the following question arises:

Is sample-efficient offline RL possible with only partial coverage and weak realizability assumptions for general MDPs?

We answer this question in the positive and propose an algorithm that achieves finite-sample guarantees under weak assumptions with the help of an additional covering distribution. We assume that the covering distribution covers all non-stationary near-optimal policies, and the dataset covers the trajectories induced by an optimal policy from it. The covering distribution is adaptive such that both “non-stationary” and “near-optimal” above would be alleviated as the gap of optimal value function increases. The covering distribution also gives a trade-off against the data coverage assumption: the more accurate it is, the fewer redundant trajectories are required to be covered by the dataset. Furthermore, we can directly use the data distribution as the covering distribution as done in Uehara et al. [2023], if the near-optimal variant of their data assumptions are also satisfied.

For comparison, we summarize algorithms with partial coverage that do not need Bellman-completeness and model realizability (which is even stronger [Chen and Jiang, 2019, Zhan et al., 2022]) in Table 1. Necessary transfers are made to get the sub-optimality bound. We have removed additional definitions of notations for simplicity and refer the interested reader to the original papers for more detail.

Table 1: Comparison of offline RL algorithm (conc. stand for concentrability)
Algorithm Data assumptions Function assumptions Major drawbacks
Jiang and Huang [2020] optimal conc. w𝒲w^{\star}\in\mathcal{W}, and πΠ,Qπ𝒞(𝒬)\forall\pi\in\Pi,Q_{\pi}\in\mathcal{C}(\mathcal{Q}) strong realizability
Zhan et al. [2022] optimal conc. wα𝒲w^{\star}_{\alpha}\in\mathcal{W}, and vα𝒱v^{\star}_{\alpha}\in\mathcal{V} compare with πα\pi^{\star}_{\alpha}
Chen and Jiang [2022] optimal conc. w𝒲w^{\star}\in\mathcal{W}, and Q𝒬Q^{\star}\in\mathcal{Q} assume gap (margin)
Rashidinejad et al. [2022] optimal conc. w𝒲w^{\star}\in\mathcal{W}, V𝒱V^{\star}\in\mathcal{V}, uw𝒰wu^{\star}_{w}\in\mathcal{U}\ \forall w and ζw,u𝒵u\zeta^{\star}_{w^{\star},u}\in\mathcal{Z}\ \forall u strong realizability
Zhu et al. [2023] optimal conc. w𝒲w^{\star}\in\mathcal{W}, and πΠ,Qπ𝒬\forall\pi\in\Pi,Q_{\pi}\in\mathcal{Q} strong realizability
Uehara et al. [2023] optimal conc from d𝒟d^{\mathcal{D}} w𝒲w^{\star}\in\mathcal{W}, and Q𝒬Q^{\star}\in\mathcal{Q} assume gap (margin)
Ours (VOPR) optimal conc. from dcd_{c} w𝒲w^{\star}\in\mathcal{W}, β\beta^{\star}\in\mathcal{B} and Q𝒬Q^{\star}\in\mathcal{Q} assume a covering dcd_{c}

In conclusion, our contributions are as follows:

  • (Section 3) We identify the novel mechanism of non-stationary near-optimal concentrability in policy optimization under weak assumptions.

  • (Section 4) We demonstrate the trade-off brought by additional covering distributions for the coverage requirement of the dataset.

  • (Section 4) We propose the first algorithm that achieves finite-sample guarantees for general MDPs under only weak realizability and partial coverage assumptions.

2 Preliminaries

This section introduces base concepts and notations in offline RL with function approximation and MIS. See Table 2 in Appendix A for a more complete list of definitions of notations.

Markov Decision Processes (MDPs)

We consider infinite-horizon discounted MDPs defined as (𝒮,𝒜,P,R,γ,μ0)(\mathcal{S},\mathcal{A},P,R,\gamma,\mu_{0}), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, P:𝒮×𝒜Δ(𝒮)P\colon\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}) is the transition probability, R:𝒮×𝒜[0,Rmax]R\colon\mathcal{S}\times\mathcal{A}\to[0,R_{max}] is the deterministic reward function, γ(0,1)\gamma\in(0,1) is the discount factor that unravels the problem of infinite horizons, and μ0Δ(𝒮)\mu_{0}\in\Delta(\mathcal{S}) is the initial state distribution. With a policy π:𝒮Δ(𝒜)\pi\colon\mathcal{S}\to\Delta(\mathcal{A}), we say that it induces a random trajectory {s0,a0,r0,s1,a1,r1,,si,ai,ri,si+1,}\{s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\dots,s_{i},a_{i},r_{i},s_{i+1},\dots\} if: s0μ0s_{0}\sim\mu_{0}, aiπ(|si)a_{i}\sim\pi(\cdot|s_{i}), ri=R(si,ai)r_{i}=R(s_{i},a_{i}) and si+1P(|si,ai)s_{i+1}\sim P(\cdot|s_{i},a_{i}). We define the expected return of a policy π\pi as Jπ=𝔼[i=0γiriμ0,π]J_{\pi}=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid\mu_{0},\pi\big{]}. We also denote the value function of π\pi as the expected return starting from some specific state ss or state-action pair (s,a)(s,a) as Vπ(s)=𝔼[i=0γiris,π]V_{\pi}(s)=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid s,\pi\big{]} and Qπ(s,a)=𝔼[i=0γiri(s,a),π]Q_{\pi}(s,a)=\mathbb{E}\big{[}\sum_{i=0}^{\infty}\gamma^{i}r_{i}\mid(s,a),\pi\big{]}. We denote the optimal policies that achieve the maximum return JJ^{\star} from μ0\mu_{0} as Π\Pi^{\star}, and its member as π\pi^{\star}. We say a policy is optimal almost everywhere if its state value function is maximized at every state and denote it as πe\pi_{e}^{\star} (πe\pi_{e}^{\star} is not always unique). We represent the value functions of πe\pi^{\star}_{e} as VV^{\star} and QQ^{\star}. It worth noting that VV^{\star} and QQ^{\star}, the unique solutions of both Bellman optimality equation and the primal part of LP approach of MDPs [Puterman, 1994], are not the value functions of all optimal policies. For ease of discussion, we assume 𝒮\mathcal{S}, 𝒜\mathcal{A}, 𝒮×𝒜\mathcal{S}\times\mathcal{A} are compact measurable spaces and, with abuse of notation, we use ν\nu to denote the corresponding finite uniform measure on each space (e.g., Lebesgue measure). We use PπP_{\pi} to denote the state-action transition operator for density dd as Pπd(s,a)𝒮×𝒜π(as)P(ss,a)d(s,a)𝑑ν(s,a)P_{\pi}d(s^{\prime},a^{\prime})\coloneqq\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a).

Offline policy learning with function approximation

In the standard theoretical setup of offline RL, we are given with a dataset 𝒟\mathcal{D} consisting of NN (s,a,r,s)(s,a,r,s^{\prime}) tuples, which is collected with state distribution μD\mu^{D} and behavior policy πb\pi_{b} such that sμD,aπb(|s),r=R(s,a),sP(|s,a)s\sim\mu^{D},a\sim\pi_{b}(\cdot|s),r=R(s,a),s^{\prime}\sim P(\cdot|s,a). We use d𝒟(s,a)μ(s)πb(as)d^{\mathcal{D}}(s,a)\coloneqq\mu(s)\pi_{b}(a\mid s) to denote the composed state-action distribution of the dataset. When the state space and action space become rather complex, function approximation is typically used. For this, we assume there are some function classes at hand that satisfy certain assumptions and have limited complexity (measured by cardinality, metric entropy and so forth). The function classes considered in this paper are state-action value function class 𝒬(𝒮×𝒜)\mathcal{Q}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R}), state distribution ratio class 𝒲(𝒮×𝒜)\mathcal{W}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R}), and policy ratio class (𝒮×𝒜)\mathcal{B}\subseteq(\mathcal{S}\times\mathcal{A}\to\mathbb{R}).

MIS with density discriminators and L2L^{2} error bound

One of the most popular ways to estimate the optimal value function is via the Bellman optimality equation:

s𝒮,a𝒜,Q(s,a)=TQ(s,a)\displaystyle\forall s\in\mathcal{S},a\in\mathcal{A},\quad Q^{\star}(s,a)=T^{\star}Q^{\star}(s,a) (1)

where Tq(s,a)R(s,a)+γ𝔼sP(s,a)[maxq(s,)]T^{\star}q(s,a)\coloneqq R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\max q(s^{\prime},\cdot)] denotes the Bellman optimality operator. However, when we try to utilize the constraints from Eq. 1 (e.g., through the L1L^{1} error qTq1,d𝒟\lVert q-T^{\star}q\rVert_{1,d^{\mathcal{D}}}), the expectation in TT^{\star} would introduce the infamous double-sampling issue [Baird, 1995], making the estimation intractable. To overcome this, privious works with MIS tried to take weight functions as discriminators and minimize a weighted sum of Eq. 1. In fact, even the L1L^{1} error itself could be written as a weighted sum with some sign function (take 11 if qTqq\geq T^{\star}q and 1-1 otherwise [Ozdaglar et al., 2022]). Namely, we approximate QQ^{\star} through

q^=argminq𝒬maxw𝒲𝔼d𝒟[w(s,a)(q(s,a)Tq(s,a)].\displaystyle\hat{q}=\operatorname*{argmin}\limits_{q\in\mathcal{Q}}\max\limits_{w\in\mathcal{W}}\mathbb{E}_{d^{\mathcal{D}}}[w(s,a)(q(s,a)-T^{\star}q(s,a)]. (2)

Since the weight function class 𝒲\mathcal{W} is marginalized into the state-action space (instead of trajectories), this approach is called marginalized importance sampling (MIS) [Liu et al., 2018]. While theoretic guarantees in MIS under weak realizability and partial coverage assumptions are typically made for scalar values (e.g., the return [Uehara et al., 2019, Jiang and Huang, 2020]), recently, Zhan et al. [2022], Huang and Jiang [2022], Uehara et al. [2023] have gone beyond this and derived L2L^{2} error guarantees for the estimators by using some strongly convex functions. Among them, the optimal value function estimator from Uehara et al. [2023] constructs the base of this work.

3 From QQ^{\star} to optimal policy, the minimum requirement

Uehara et al. [2023] shows that accurately estimating optimal value function QQ^{\star} under d𝒟d^{\mathcal{D}} is possible if d𝒟d^{\mathcal{D}} covers the optimal trajectories starting from itself. This “self-covering” assumption could be relieved and generalized if we only require an accurate estimator under some state-action distribution dcd_{c} such that dcd𝒟d_{c}\ll d^{\mathcal{D}} (we use μc\mu_{c} and πc\pi_{c} to denote the state distribution and policy decomposed from dcd_{c}). In fact, dcd_{c} provides a trade-off for the coverage requirement of the dataset: the fewer state-action pairs on the support of dcd_{c}, the weaker data coverage assumptions we will make. Nevertheless, how much trade-off can dcd_{c} provide while preserving the desired result?

In policy learning, our goal is to derive an optimal policy π^\hat{\pi} from the estimated QQ^{\star} (denoted as q^\hat{q}). While there are methods (see Section 4.3 for a brief discussion) that induce policies from q^\hat{q} by exploiting pessimism or data regularization, one of the most straightforward ways is to take the actions covered by dcd_{c} that achieve the maximum q^\hat{q} in each state. This can be done with the help of policy ratio class \mathcal{B}, via

β^=argmaxβμc,q^(,πβ)and takeπ^=πβ^,\displaystyle\hat{\beta}=\operatorname*{argmax}\limits_{\beta\in\mathcal{B}}\langle\mu_{c},\hat{q}(\cdot,\pi_{\beta})\rangle\quad\textup{and take}\quad\hat{\pi}=\pi_{\hat{\beta}}, (3)

where πβ(as)=πb(as)β(s,a)\pi_{\beta}(a\mid s)=\pi_{b}(a\mid s)\beta(s,a) (normalized if needed). With the optimal realizability of \mathcal{B} and concentrability of πc\pi_{c}, Eq. 3 is actually equivalent to

μc,Q(,π^)Q(,πe)=0,\displaystyle\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0, (4)

which guides us to exploit the coverage provided by μc\mu_{c}. Recall that our goal is to use dcd_{c} to trade off the coverage assumption of d𝒟d^{\mathcal{D}}. Therefore, the question left, which forms the primary subject of this section, is

With which μc\mu_{c} can we conclude that π^\hat{\pi} is optimal from μc,Q(,π^)Q(,πe)=0\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0,
and what is the minimum requirement of it?

Since μc\mu_{c} and dcd_{c} are to provide additional coverage, we also call them “covering distributions”.

The remainder of this section is organized as follows: we first show why single optimal concentrability of μc\mu_{c} is not enough in Section 3.1, and then we introduce the alternative “all optimal concentrability” in Section 3.2 and the adapted version of it in Section 3.3 to accommodate statistical errors.

3.1 The dilemma of single optimal contentrability

Single optimal concentrability is standard [Liu et al., 2020, Xie et al., 2021, Cheng et al., 2022] when we try to mitigate exploratory data assumptions (e.g., all-policy concentrability). However, this framework suffers from a conundrum if only making weak realizability assumptions: we will know that the learned policy performs well only if we are informed with trajectories induced by it—rather than the ones induced by the covered policy.

More specifically, as the optimality of π^\hat{\pi} could be quantified as JJπ^J^{\star}-J_{\hat{\pi}}, the performance gap, we can telescope it through the performance difference lemma.

Lemma 1 (The performance difference lemma).

We can decompose the performance gap as

(1γ)(Jπ1Jπ2)=μπ1,Qπ2(,π1)Qπ2(,π2).\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})=\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle.

Thus, with Eq. 4, if we want JJπ^J^{\star}-J_{\hat{\pi}} (i.e., JπeJπ^J_{\pi^{\star}_{e}}-J_{\hat{\pi}}) to be equal to zero, it might be necessary to require μc\mu_{c} to cover μπ^\mu_{\hat{\pi}} (μcμπ^\mu_{c}\gg\mu_{\hat{\pi}}) since the right part of the inner product in Eq. 4 is always non-positive. However, as π^\hat{\pi} is estimated and is even random when considering approximating it from a dataset, μcμπ^\mu_{c}\gg\mu_{\hat{\pi}} is usually achieved through all-policy concentrabilityμcμπ\mu_{c}\gg\mu_{\pi} for all π\pi in the hypothesis class. Single optimal concentrability fails to provide the desired result.

For instance, consider the counterexample in Figure 1 which is adapted from Zhan et al. [2022], Chen and Jiang [2022]. Suppose we finally get the following covering distribution and policy:

μc(s)={1/2if s=11/2if s=2andπ^(s)={Lif s=1Rif s=3Randomelsewise.\displaystyle\mu_{c}(s)=\begin{cases}\nicefrac{{1}}{{2}}&\textup{if $s=1$}\\ \nicefrac{{1}}{{2}}&\textup{if $s=2$}\end{cases}\quad\textup{and}\quad\hat{\pi}(s)=\begin{cases}\textup{L}&\textup{if $s=1$}\\ \textup{R}&\textup{if $s=3$}\\ \textup{Random}&\textup{elsewise}.\end{cases}

While μc\mu_{c} achieves single optimal concentrability and π^\hat{\pi} achieves the maximized value of QQ^{\star} in each state on the support of μc\mu_{c}, π^\hat{\pi} is not an optimal policy since it would end up with 0 return.

Refer to caption
Figure 1: The above MDP is deterministic, and we initially start from state 11. We can take actions LL (left) and RR (right) in each state. In states 11 and 33, action LL (RR) will transfer us to its left (right) hand state, and taking actions in other states will only cause a self-loop. We can only obtain non-zero rewards by taking actions in states 22 and 44, with values 11 and 1γ\frac{1}{\gamma} correspondingly. There are two trajectories that could lead to the optimal γ/(1γ)\nicefrac{{\gamma}}{{(1-\gamma)}} return: {(1,R),2,}\{(1,\textup{R}),2,\dots\} and {(1,L),(3,L),4}\{(1,\textup{L}),(3,\textup{L}),4\dots\}. We take γ\gamma as the discount factor.

How gap assumptions avoid this

While both Chen and Jiang [2022] and Uehara et al. [2023] consider single optimal concentrability and weak realizability assumptions (Uehara et al. [2023] also assumes additional structures of the dataset), the gap (margin) assumptions ensure that only taking π\pi^{\star} as π^\hat{\pi} could achieve Eq. 4. Moreover, Chen and Jiang [2022] shows that with the gap assumption, we can even use a value-based algorithm to derive a near-optimal policy without accurately estimating QQ^{\star}.

3.2 All-optimal concentrability

While single optimal concentrability suffers the hardness revealed before, there is still an alternative for the exploratory covering μc\mu_{c}, which is shown in the following lemma:

Lemma 2 (From advantage to optimality).

If μc\mu_{c} covers all distributions induced by non-stationary optimal policies (i.e., μcμπnon\mu_{c}\gg\mu_{\pi^{\star}_{\textup{non}}} for any πnon\pi^{\star}_{\textup{non}}) and Eq. 4 holds, then π^\hat{\pi} is optimal and Jπ^=JJ_{\hat{\pi}}=J^{\star}.

Remark 1.

Non-stationary policies are frequently employed in the analysis of offline RL [Munos, 2003, 2005, Scherrer and Lesner, 2012, Chen and Jiang, 2019, Liu et al., 2020]. If we make the gap assumption, the “all non-stationary” requirement is discardable since the action in each state that could lead to the optimal return is unique.

Remark 2.

Wang et al. [2022] also utilizes the all-optimal concentrability assumption, but they consider the tabular setting and they require additionally gap assumptions to achieve the near-optimal guarantees.

We now provide a short proof of Lemma 2, showing by induction that π^i\hat{\pi}_{i}—the non-stationary policy that adopts π^\hat{\pi} at the beginning 0-th to ii-th (include the ii-th) steps and then follows πe\pi^{\star}_{e}—is optimal.

Proof.

We first rewrite the telescoping equation in the performance difference lemma as

(1γ)(Jπ^iJ)=\displaystyle(1-\gamma)(J_{\hat{\pi}_{i}}-J^{\star})= μπ^i,Q(,π^i)Q(,πe)\displaystyle\langle\mu_{\hat{\pi}_{i}},Q^{\star}(\cdot,\hat{\pi}_{i})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle (5)
=\displaystyle= μπ^i0:i,Q(,π^)Q(,πe)+μπ^ii+1:,Q(,πe)Q(,πe)\displaystyle\langle\mu_{\hat{\pi}_{i}}^{0:i},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu_{\hat{\pi}_{i}}^{i+1:\infty},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle (6)
=\displaystyle= μπ^i0:i,Q(,π^)Q(,πe)\displaystyle\langle\mu_{\hat{\pi}_{i}}^{0:i},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle (7)

where μπi:j\mu^{i:j}_{\pi} denotes the ii-th to jj-th steps (include the ii-th and jj-th) part of μπ\mu_{\pi}. Thus, the optimality of π^i\hat{\pi}_{i} only depends on the first 0-th to ii-th steps, and π^i\hat{\pi}_{i} is optimal if this part is on the support of μc\mu_{c}. Now we inductively show that, for any natural number ii, the initial 0-th to ii-th steps part is covered:

  • The step-0 part of μπ^\mu_{\hat{\pi}} (i.e., (1γ)μ0(1-\gamma)\mu_{0}) is on the support of μc\mu_{c} since there is some (non-stationary) optimal policy π\pi^{\star} covered by it,

    μcμπμ0.\displaystyle\mu_{c}\gg\mu_{\pi^{\star}}\gg\mu_{0}.

    Therefore, μπ^00:0,Q(,π^)Q(,πe)=0\langle\mu_{\hat{\pi}_{0}}^{0:0},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle=0. From Eq. 7, π^0\hat{\pi}_{0} is optimal.

  • We next show that if π^i\hat{\pi}_{i} is optimal (which means that it’s covered μc\mu_{c}), then the first 0-th to (i+1)(i+1)-th steps part of μπ^\mu_{\hat{\pi}} is covered by μc\mu_{c}, which means that π^i+1\hat{\pi}_{i+1} is optimal. This comes from the fact that the initial 0-th to (i+1)(i+1)-th steps part of the state distribution induced by a policy only depends on its previous 0-th to ii-th decisions:

    μcμπ^iμπ^i0:i+1=μπ^0:i+1.\displaystyle\mu_{c}\gg\mu_{\hat{\pi}_{i}}\gg\mu_{\hat{\pi}_{i}}^{0:i+1}=\mu_{\hat{\pi}}^{0:i+1}.

    From Eq. 7, π^i+1\hat{\pi}_{i+1} is optimal.

Thus, for any ϵ>0\epsilon>0, there exists natural number ilogγϵVmaxi\geq\log_{\gamma}\frac{\epsilon}{V_{\max}} such that

JJπ^JJπ^0:iJ(Jπ^iγi+1Vmax)γi+1Vmaxϵ,\displaystyle J^{\star}-J_{\hat{\pi}}\leq J^{\star}-J_{\hat{\pi}}^{0:i}\leq J^{\star}-(J_{\hat{\pi}_{i}}-\gamma^{i+1}V_{\textup{max}})\leq\gamma^{i+1}V_{\textup{max}}\leq\epsilon,

where Jπi:jJ_{\pi}^{i:j} denotes the ii-th to jj-th steps part of the return. Therefore, π^\hat{\pi} is optimal. ∎

Consequently, instead of the exploratory data assumption, all non-stationary optimal coverage is sufficient for policy optimization.

3.3 Dealing with statistical error

While Lemma 2 is adequate at the population level (i.e., with an infinite amount of data), covering all non-stationary optimal policies is not enough when considering the empirical setting (i.e., with finite samples) due to the introduced statistical error. This motivates us to adapt Lemma 2 with a more refined μc\mu_{c}.

Assumption 1 (All near-optimal concentrability).

We are given with a covering distribution dcd_{c} such that its state distribution part μc\mu_{c} covers the distributions induced by any non-stationary εc\varepsilon_{c} near-optimal policy π~\tilde{\pi}:

μπ~μcCc,π~Πεc,non.\displaystyle\Big{\lVert}\frac{\mu_{\tilde{\pi}}}{\mu_{c}}\Big{\rVert}_{\infty}\leq C_{c},\quad\forall\tilde{\pi}\in\Pi_{\varepsilon_{c},\textup{non}}^{\star}. (8)

We call a policy π\pi is ε\varepsilon near-optimal if JJπεJ^{\star}-J_{\pi}\leq\varepsilon and denote Πε,non\Pi_{\varepsilon,\textup{non}}^{\star} as the class of all non-stationary ε\varepsilon near-optimal policies. We also define 00=1\frac{0}{0}=1 to suppress the extreme cases. With this refined μc\mu_{c}, we can now derive the optimality of π^\hat{\pi} even with some statistical errors.

Lemma 3 (From advantage to optimality, with statistical errors).

If μc,Q(,π^)Q(,π)ε\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star})\rangle\geq-\varepsilon , and Assumption 1 holds with εcCcε1γ\varepsilon_{c}\geq\frac{C_{c}\varepsilon}{1-\gamma}, π^\hat{\pi} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal.

We defer the proof of this lemma to Section C.1.

Remark 3 (The asymptotic property of εc\varepsilon_{c}).

One of the most important properties of all near-optimal concentrability is that εc\varepsilon_{c} depends on the statistical error, which decreases as the amount of data increases.

4 Algorithm and analysis

After discussing the minimum requirement of estimating QQ^{\star}, this section will demonstrate how to fulfill it and accomplish the policy learning task. Our algorithm, which is based on the optimal value estimator from Uehara et al. [2023], follows the pseudocode in Algorithm 1.

Input : Dataset 𝒟\mathcal{D}, value function class 𝒬\mathcal{Q}, distribution density ratio class 𝒲\mathcal{W}, policy ratio function class \mathcal{B}, and covering distribution dcd_{c}
1 Estimate the optimal value function q^\hat{q} as
q^=argminq𝒬maxw𝒲^(dc,q,w)\displaystyle\hat{q}=\operatorname*{argmin}\limits_{q\in\mathcal{Q}}\max\limits_{w\in\mathcal{W}}\hat{\mathcal{L}}(d_{c},q,w) (9)
where
^(d,q,w)0.5𝔼d[q2(s,a)]+1N𝒟(s,a,r,s)𝒟[w(s,a)[γmaxq(s,)+rq(s,a)]]\displaystyle\hat{\mathcal{L}}(d,q,w)\coloneqq 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]} (10)
2 Derive the approximated optimal policy ratio:
β^=argmaxβ𝔼μc[q^(s,πβ)]\displaystyle\hat{\beta}=\operatorname*{argmax}\limits_{\beta\in\mathcal{B}}\mathbb{E}_{\mu_{c}}[\hat{q}(s,\pi_{\beta})]
Output : π^=πβ^\hat{\pi}=\pi_{\hat{\beta}}
Algorithm 1 VOPR (Value-Based Offline RL with Policy Ratio)

We organized the rest of this section as follows: we first discuss the trade-off provided by the additional covering distribution dcd_{c} and how to deduce dcd_{c} in reality in Section 4.1; we then provide the finite-sample analysis of Algorithm 1 and its proof sketch in Section 4.2; we finally conclude this section by comparing our algorithms with the others in Section 4.3.

We defer the main proofs in this section to Appendix D.

4.1 Data assumptions and trade-off

As investigated in recent works [Huang and Jiang, 2022, Uehara et al., 2023], value function estimation under a given distribution requires a dataset that contains trajectories rolled out from it. Thus, our data assumption is as follows.

Assumption 2 (Partial concentrability from dcd_{c}).

The shift from d𝒟d^{\mathcal{D}} to the induced state-action distribution by πe\pi_{e}^{\star} from dcd_{c} is bounded:

ddc,πed𝒟C𝒟.\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C_{\mathcal{D}}. (11)

It is clear that with Assumption 2, dcd_{c} is also covered by d𝒟d^{\mathcal{D}}.

Proposition 4.

If Assumption 2 holds, by definition of ddc,πed_{d_{c},\pi^{\star}_{e}},

dcd𝒟ddc,πe/(1γ)d𝒟C𝒟1γ.\displaystyle\Big{\lVert}\frac{d_{c}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}/(1-\gamma)}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq\frac{C_{\mathcal{D}}}{1-\gamma}.

We now clarify the order of the learning process: we are first given with a dataset 𝒟\mathcal{D} with some good properties; then we try to find a dcd_{c} from the support of the state-action distribution of 𝒟\mathcal{D} through some inductive bias (with necessary approximation); finally, we apply Algorithm 1 with 𝒟\mathcal{D} and dcd_{c}.

The choice of dcd_{c} constructs a trade-off between the knowledge about optimal policy and the requirement of data coverage. On the one hand, the most casual choice of dcd_{c} is d𝒟d^{\mathcal{D}} (as in Uehara et al. [2023]), which means we have no prior knowledge about optimal policies. Employing d𝒟d^{\mathcal{D}} as dcd_{c} will not only requires the dataset to cover unnecessary suboptimal trajectories, but also makes the dataset non-monotonic (adding new data points to it would break this assumption). On the other hand, if we have perfect knowledge about optimal policies, Assumption 2 could be significantly alleviated. More concretely, if dcd_{c} strictly consists of the state-action distribution of trajectories induced by near-optimal policies, our data assumption reduces to the per-step version of near-optimal concentrability.

Lemma 5.

If dcd_{c} is a linear combination of the state-action distributions induced by non-stationary ε\varepsilon near-optimal policies Πε,non\Pi_{\varepsilon,\textup{non}}^{\star} under a fixed probability measure λ\lambda:

dc=Πε,nondπ~𝑑λ(π~).\displaystyle d_{c}=\int_{\Pi_{\varepsilon,\textup{non}}^{\star}}d_{\tilde{\pi}}d\lambda(\tilde{\pi}). (12)

And d𝒟d^{\mathcal{D}} covers all admissible distributions of Πε,non\Pi_{\varepsilon,\textup{non}}^{\star}:

π~Πε,non,i,dπ~,id𝒟C,\displaystyle\forall\ \tilde{\pi}\in\Pi^{\star}_{\varepsilon,\textup{non}},\ i\in\mathbb{N},\ \Big{\lVert}\frac{d_{\tilde{\pi},i}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C,

where dπ,id_{\pi,i} denotes the normalized distribution of the ii-th step part of dπd_{\pi}. The distribution shift from d𝒟d^{\mathcal{D}} is bounded as

ddc,πed𝒟C.\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.

While the above case is impractical in reality, it reveals the power of this inductive bias: the more auxiliary information we obtain about optimal paths, the weaker coverage assumptions of the dataset are required.

4.2 Finite-sample guarantee

We now give the finite-sample guarantee of Algorithm 1, but before proceeding, we should state necessary function class assumptions. The first are the weak realizability assumptions:

Assumption 3 (Realizability of 𝒲\mathcal{W}).

There exists state-action distribution density ratio w𝒲w^{\star}\in\mathcal{W} such that wd𝒟=(IγPπe)1dcQw^{\star}\circ d^{\mathcal{D}}=(I-\gamma P_{\pi^{\star}_{e}})^{-1}d_{c}Q^{\star}.

Assumption 4 (Realizability of \mathcal{B}).

There exists policy ratio β\beta^{\star}\in\mathcal{B} such that βπc=πe\beta^{\star}\circ\pi_{c}=\pi^{\star}_{e} and for all s𝒮,𝒜β(s,a)πc(s,a)𝑑ν(a)=1s\in\mathcal{S},\int_{\mathcal{A}}\beta(s,a)\pi_{c}(s,a)d\nu(a)=1.

Assumption 5 (Realizability of 𝒬\mathcal{Q}).

𝒬\mathcal{Q} contains the optimal value function: Q𝒬Q^{\star}\in\mathcal{Q}.

On the other hand, we gather all the bounding assumptions here.

Assumption 6 (Boundness of 𝒬\mathcal{Q}).

For any q𝒬q\in\mathcal{Q}, we assume q(𝒮×𝒜[0,Vmax])q\in(\mathcal{S}\times\mathcal{A}\to[0,V_{\textup{max}}]).

Assumption 7 (Boundness of \mathcal{B}).

For any β\beta\in\mathcal{B}, we assume β(𝒮×𝒜[0,U])\beta\in(\mathcal{S}\times\mathcal{A}\to[0,U_{\mathcal{B}}]).

Assumption 8 (Boundness of 𝒲\mathcal{W}).

For any w𝒲w\in\mathcal{W}, we assume w(𝒮×𝒜[0,U𝒲])w\in(\mathcal{S}\times\mathcal{A}\to[0,U_{\mathcal{W}}]).

Remark 4 (Validity).

The invertibility of IγPπeI-\gamma P_{\pi^{\star}_{e}} is shown by Lemma 10 in Section B.1. While Assumptions 3 and 8 actually subsumes Assumption 2, we make it explicit for clarity of explanation. Assumption 4 implicitly assumes that πc\pi_{c} covers πe\pi^{\star}_{e}, this can easily be done by directly choosing πb\pi_{b} as πc\pi_{c}.

Remark 5.

Although we include the normalization step in Assumption 4, this can also be achieved with some preprocessing steps.

Remark 6.

There is an overlap in the above assumptions: we can derive a policy ratio class \mathcal{B} directly from 𝒲\mathcal{W} and 𝒬\mathcal{Q}.

With these prerequisites in place, we can finally state our finite-sample guarantee.

Theorem 1 (Sample complexity of learning a near-optimal policy).

If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with εc4CcUεstat1γ\varepsilon_{c}\geq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma} where

εstat=U𝒲Vmax2log(2|𝒬||𝒲|/δ)N𝒟,\displaystyle\varepsilon_{\textup{stat}}=U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}},

then with probability at least 1δ1-\delta, the output π^\hat{\pi} from Algorithm 1 is near-optimal:

JJπ^4CcUεstat1γ.\displaystyle J^{\star}-J_{\hat{\pi}}\leq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}.

Proof sketch of Theorem 1

As we can obtain the near-optimality guarantee via Lemma 3, the remaining task is to approximate Eq. 4. This comes from the following two lemmas.

Lemma 6 (L2L^{2} error of q^\hat{q} under dcd_{c}, adapted from theorem 2 in Uehara et al. [2023]).

If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least 1δ1-\delta, the estimated q^\hat{q} from Algorithm 1 satisfies

q^Qdc,22εstat.\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.
Lemma 7 (From L1L^{1} distance to Eq. 4).

If Assumptions 4 and 7 hold,

Q(,πe)Q(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq 2Uq^Qdc,1.\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}.

Combine them, we have that with probability at least 1δ1-\delta,

Q(,πe)Q(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq 2Uq^Qdc,12Uq^Qdc,24Uεstat.\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 4U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}.

4.3 Comparison with related works

We now provide a brief comparison of our method with some related algorithms.

Algorithms with gap assumptions

Chen and Jiang [2022] and Uehara et al. [2023] assume that there are (soft) gaps in the optimal value function, which is only satisfied by part of MDPs, whereas our goal is to deal with general problems. Moreover, while our algorithm is based on the optimal value estimator proposed by Uehara et al. [2023], we use the policy ratio to ensure a finite distribution shift and our near-optimality guarantee does not require the soft margin assumption. Besides, Uehara et al. [2023] use d𝒟d^{\mathcal{D}} as dcd_{c}, assuming that the dataset covers the optimal trajectories from itself. This assumption is non-monotonic and hard to be satisfied in reality. Instead, we propose using an additional covering distribution dcd_{c} as an alternative, which can effectively utilize the prior knowledge about the optimal trajectories and trade off the dataset requirement.

Algorithms with behavior regularization

Zhan et al. [2022] use behavior regularization to ensure that the learned policy is close to the dataset. Nevertheless, the regularization makes the optimality of the learned policy intractable.

Algorithms with pessimism in the face of uncertainty

These algorithms (e.g., Jiang and Huang [2020], Liu et al. [2020], Xie et al. [2021], Cheng et al. [2022], Zhu et al. [2023]) are often closely related to approximate dynamic programming (ADP). They “pessimistically” estimate the given policies and update (or choose) policies “pessimistically” with the estimated value functions. However, the evaluation step used in these algorithms always requires the strong realization of all candidate policies’ value functions, which our algorithm avoids.

Limitations of our algorithm

On the one hand, the additional covering distribution may be hard to access in some scenarios, leading back to using d𝒟d^{\mathcal{D}} as dcd_{c}. On the other hand, although mitigated with increasing dataset size, the assumption of covering all near-optimal policies is still stronger than the classic single-optimal concentrability. In addition, the “non-stationary” coverage requirement is also somewhat restrictive.

5 Conclusion and further work

This paper present VOPR, a new MIS-based algorithm for offline RL with function approximations. VOPR is inspired by the optimal value estimator proposed in Uehara et al. [2023], and it circumvents the soft margin assumption in the original paper with the near-optimal coverage assumption. While it still works if using the data distribution as the covering distribution, VOPR can trade off data assumptions with more refined choices. Compared with other algorithms considering partial coverage, VOPR does not make strong function class assumptions and works under general MDPs. Finally, despite the successes, a refined additional covering distribution may be difficult to obtain, and the near-optimal coverage assumption is still stronger than single optimal concentrability. We leave them for further investigation.

References

  • Antos et al. [2007] András Antos, Rémi Munos, and Csaba Szepesvári. Fitted q-iteration in continuous action-space mdps. In NIPS, 2007.
  • Baird [1995] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, 1995.
  • Chen and Jiang [2019] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. ArXiv, abs/1905.00360, 2019.
  • Chen and Jiang [2022] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps. ArXiv, abs/2203.13935, 2022.
  • Cheng et al. [2022] Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. ArXiv, abs/2202.02446, 2022.
  • Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res., 6:503–556, 2005.
  • Farahmand et al. [2010] Amir Massoud Farahmand, Rémi Munos, and Csaba Szepesvari. Error propagation for approximate policy and value iteration. In NIPS, 2010.
  • Foster et al. [2021] Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Annual Conference Computational Learning Theory, 2021.
  • Gottesman et al. [2018] Omer Gottesman, Fredrik D. Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, A. Aldo Faisal, Leo Anthony Celi, David A. Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. ArXiv, abs/1805.12298, 2018.
  • Gottesman et al. [2019] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:16 – 18, 2019.
  • Huang and Jiang [2022] Audrey Huang and Nan Jiang. Beyond the return: Off-policy function estimation under user-specified error-measuring distributions. ArXiv, abs/2210.15543, 2022.
  • Jiang and Huang [2020] Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. arXiv: Learning, 2020.
  • Jin et al. [2020] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 2020.
  • Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ArXiv, abs/1806.10293, 2018.
  • Kumar et al. [2019] Aviral Kumar, Justin Fu, G. Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Neural Information Processing Systems, 2019.
  • Laroche and Trichelair [2017] Romain Laroche and Paul Trichelair. Safe policy improvement with baseline bootstrapping. ArXiv, abs/1712.06924, 2017.
  • Lee et al. [2021] Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joëlle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, 2021.
  • Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Neural Information Processing Systems, 2018.
  • Liu et al. [2020] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. ArXiv, abs/2007.08202, 2020.
  • Manne [1960] A.S. Manne. Linear programming and sequential decisions. In Management Science, volume 6, page 259–267, 1960.
  • Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.
  • Munos [2005] Rémi Munos. Error bounds for approximate value iteration. In AAAI Conference on Artificial Intelligence, 2005.
  • Munos [2007] Rémi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM J. Control. Optim., 46:541–561, 2007.
  • Munos and Szepesvari [2008] Rémi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, 2008.
  • Nachum et al. [2019] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ArXiv, abs/1912.02074, 2019.
  • Ozdaglar et al. [2022] Asuman E. Ozdaglar, Sarath Pattathil, Jiawei Zhang, and K. Zhang. Revisiting the linear-programming framework for offline rl with general function approximation. ArXiv, abs/2212.13861, 2022.
  • Puterman [1994] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
  • Rashidinejad et al. [2021] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart J. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68:8156–8196, 2021.
  • Rashidinejad et al. [2022] Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart J. Russell, and Jiantao Jiao. Optimal conservative offline rl with general function approximation via augmented lagrangian. ArXiv, abs/2211.00716, 2022.
  • Scherrer and Lesner [2012] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-horizon markov decision processes. In NIPS, 2012.
  • Sinha and Garg [2021] Samarth Sinha and Animesh Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning. ArXiv, abs/2103.06326, 2021.
  • Song et al. [2022] Yuda Song, Yi Zhou, Ayush Sekhari, J. Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient. ArXiv, abs/2210.06718, 2022.
  • Sutton and Barto [2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  • Tang et al. [2022] Shengpu Tang, Maggie Makar, Michael W. Sjoding, Finale Doshi-Velez, and Jenna Wiens. Leveraging factored action spaces for efficient offline reinforcement learning in healthcare. 2022.
  • Uehara and Sun [2021] Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations, 2021.
  • Uehara et al. [2019] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, 2019.
  • Uehara et al. [2023] Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Refined value-based offline rl under realizability and partial coverage. ArXiv, abs/2302.02392, 2023.
  • Wang et al. [2021] Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham M. Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, 2021.
  • Wang et al. [2022] Xinqi Wang, Qiwen Cui, and Simon Shaolei Du. On gap-dependent bounds for offline reinforcement learning. ArXiv, abs/2206.00177, 2022.
  • Xie and Jiang [2020] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. ArXiv, abs/2003.03924, 2020.
  • Xie et al. [2021] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. In Neural Information Processing Systems, 2021.
  • Zhan et al. [2022] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D. Lee. Offline reinforcement learning with realizability and single-policy concentrability. ArXiv, abs/2202.04634, 2022.
  • Zhu et al. [2023] Hanlin Zhu, Paria Rashidinejad, and Jiantao Jiao. Importance weighted actor-critic for optimal conservative offline reinforcement learning. ArXiv, abs/2301.12714, 2023.

Appendix A Notations

Table 2: Notations
𝒮\mathcal{S} state space
𝒜\mathcal{A} action space
𝒬\mathcal{Q} state-action value function class
𝒲\mathcal{W} state-action distribution ratio function class
\mathcal{B} policy ratio function class
β\beta members of \mathcal{B}
VπV_{\pi} state value function for policy π\pi
QπQ_{\pi} state-action value function for policy π\pi
VV^{\star} optimal state value function
QQ^{\star} optimal state-action value function
ν\nu uniform measure of 𝒜\mathcal{A}, 𝒮\mathcal{S}, or 𝒮×𝒜\mathcal{S}\times\mathcal{A}, depending on the context
𝒟\mathcal{D} dataset used in the algorithm
d𝒟d^{\mathcal{D}} state-action distribution of dataset
μ𝒟\mu^{\mathcal{D}} state distribution of dataset
πb\pi_{b} behaviour policy
dcd_{c} the additional covering distribution
μc\mu_{c} state distribution of the additional covering distribution
πc\pi_{c} policy of the additional covering distribution
a,b\langle a,b\rangle inner product of aa and bb, usually as ab𝑑ν\int ab\ d\nu
f1f2f_{1}\circ f_{2} (f1f2)(s,a)=f1(s,a)f2(s,a)(f_{1}\circ f_{2})(s,a)=f_{1}(s,a)f_{2}(s,a), normalizing it if needed (e.g., density)
μ×π\mu\times\pi (μ×π)(s,a)=μ(s)π(as)(\mu\times\pi)(s,a)=\mu(s)\pi(a\mid s)
TT^{\star} Bellman optimality operator, Tq(s,a)R(s,a)+γ𝔼sP(s,a)[maxq(s,)]T^{\star}q(s,a)\coloneqq R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\max q(s^{\prime},\cdot)]
μ0\mu_{0} initial state distribution
μπi:j\mu_{\pi}^{i:j} the ii-th to jj-th steps part of μπ\mu_{\pi}
d1d2d_{1}\gg d_{2} d2d_{2} is absolutely continuous w.r.t. d1d_{1}
dπ,id_{\pi,i} normalize ii-th step part of state-action distribution induced by π\pi
dd,πd_{d,\pi} state-action distribution induced by π\pi from dd
πi\pi_{i} policy take π\pi in the previous 0-th to ii-th (include the ii-th) steps, and take πe\pi^{\star}_{e} after this
πβ\pi_{\beta} πβ(as)=πc(as)β(s,a)/𝒜πc(as)β(s,a)𝑑ν(a)\pi_{\beta}(a\mid s)=\pi_{c}(a\mid s)\beta(s,a)/\int_{\mathcal{A}}\pi_{c}(a\mid s)\beta(s,a)d\nu(a)
Πε,non\Pi_{\varepsilon,\textup{non}}^{\star} the class of all non-stationary ε\varepsilon near-optimal policies
PπP_{\pi} state-action transition kernel with policy π\pi
OO^{\star} conjucate operator of some operator OO

While π,μ\pi,\mu, and dd are mainly used to denote the Radon–Nikodym derivatives of the underlying probability measures w.r.t. ν\nu, we sometimes also use them to represent the correspondent distribution measure with abuse of notation.

Appendix B Helper Lemmas

B.1 Properties of PπP_{\pi}

We first provide some properties of PπP_{\pi} (for any policy π\pi) as an operator on the LL^{\infty}-space of 𝒮×𝒜\mathcal{S}\times\mathcal{A}, and similar results should also hold for transition operators with policies defined on 𝒮\mathcal{S}. Note that the integrations of the absolute value of the functions considered in this subsection are always finite, which means that we can change the orders of integrations via Fubini’s Theorem. As we will consider conjugate operators, we define the inner product as q,d=𝒮×𝒜q(s,a)d(s,a)𝑑ν(s,a)\langle q,d\rangle=\int_{\mathcal{S}\times\mathcal{A}}q(s,a)d(s,a)d\nu(s,a).

Lemma 8.

PπP_{\pi} is linear.

Proof.

Recall the definition of PπP_{\pi},

Pπd(s,a)=𝒮×𝒜π(as)P(ss,a)d(s,a)𝑑ν(s,a)\displaystyle P_{\pi}d(s^{\prime},a^{\prime})=\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)

For any d1,d2L(𝒮×𝒜)d_{1},d_{2}\in L^{\infty}(\mathcal{S}\times\mathcal{A}),

Pπ(α1d1+α2d2)(s,a)=\displaystyle P_{\pi}(\alpha_{1}d_{1}+\alpha_{2}d_{2})(s^{\prime},a^{\prime})= 𝒮×𝒜π(as)P(ss,a)(α1d1+α2d2)(s,a)𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)(\alpha_{1}d_{1}+\alpha_{2}d_{2})(s,a)d\nu(s,a)
=\displaystyle= 𝒮×𝒜α1π(as)P(ss,a)d1(s,a)𝑑ν(s,a)+𝒮×𝒜α2π(as)P(ss,a)d2(s,a)𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\alpha_{1}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{1}(s,a)d\nu(s,a)+\int_{\mathcal{S}\times\mathcal{A}}\alpha_{2}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{2}(s,a)d\nu(s,a)
=\displaystyle= α1𝒮×𝒜π(as)P(ss,a)d1(s,a)𝑑ν(s,a)+α2𝒮×𝒜π(as)P(ss,a)d2(s,a)𝑑ν(s,a)\displaystyle\alpha_{1}\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{1}(s,a)d\nu(s,a)+\alpha_{2}\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d_{2}(s,a)d\nu(s,a)
=\displaystyle= α1Pπd1(s,a)+α2Pπd2(s,a)\displaystyle\alpha_{1}P_{\pi}d_{1}(s^{\prime},a^{\prime})+\alpha_{2}P_{\pi}d_{2}(s^{\prime},a^{\prime})

This compeletes the proof. ∎

Lemma 9.

The adjoint operator of PπP_{\pi} is

Pπq(s,a)=𝒮×𝒜q(s,a)π(as)P(ss,a)𝑑ν(s,a).\displaystyle P_{\pi}^{\star}q(s,a)=\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime}).
Remark 7.

Intuitively, we can see Pπd(s,a)P_{\pi}d(s^{\prime},a^{\prime}) as one-step forward of dd, such that we start from (s,a)d(s,a)\sim d, transit into sP(s,a)s^{\prime}\sim P(\cdot\mid s,a) and take aπ(s)a^{\prime}\sim\pi(\cdot\mid s^{\prime}). Also, we can view Pπq(s,a)P_{\pi}^{\star}q(s,a) as one-step backward of qq, such that we compute the value of (s,a)(s,a) through the one step transferred state-action distribution with the help of qq.

Proof.

Consider the inner products q,Pπd\langle q,P_{\pi}d\rangle and Pπq,d\langle P_{\pi}^{\star}q,d\rangle, we should prove that these two are equal. By definition,

q,Pπd=\displaystyle\langle q,P_{\pi}d\rangle= 𝒮×𝒜[q(s,a)𝒮×𝒜π(as)P(ss,a)d(s,a)𝑑ν(s,a)]𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\Bigg{[}q(s^{\prime},a^{\prime})\int_{\mathcal{S}\times\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)\Bigg{]}d\nu(s^{\prime},a^{\prime})
=\displaystyle= 𝒮×𝒜𝒮×𝒜q(s,a)π(as)P(ss,a)d(s,a)𝑑ν(s,a)𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)d\nu(s^{\prime},a^{\prime})

and

Pπq,d=\displaystyle\langle P_{\pi}^{\star}q,d\rangle= 𝒮×𝒜d(s,a)[𝒮×𝒜q(s,a)π(as)P(ss,a)𝑑ν(s,a)]𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d(s,a)\Big{[}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})\Big{]}d\nu(s,a)
=\displaystyle= 𝒮×𝒜𝒮×𝒜d(s,a)q(s,a)π(as)P(ss,a)𝑑ν(s,a)𝑑ν(s,a)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}d(s,a)q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})d\nu(s,a)
=\displaystyle= 𝒮×𝒜𝒮×𝒜q(s,a)π(as)P(ss,a)d(s,a)𝑑ν(s,a)𝑑ν(s,a).\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d(s,a)d\nu(s,a)d\nu(s^{\prime},a^{\prime}). (Fubini’s Theorem)

This completes the proof. ∎

Lemma.

Pπ=Pπ1\lVert P_{\pi}^{\star}\rVert_{\infty}=\lVert P_{\pi}\rVert_{\infty}\leq 1

Remark 8.

This upper bound should be intuitive since that PπP_{\pi} can be seen as a probability transition kernel from 𝒮×𝒜\mathcal{S}\times\mathcal{A} to itself.

Proof.

Fix any s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A}, we define p(s,a)=P(ss,a)π(as)p(s^{\prime},a^{\prime})=P(s^{\prime}\mid s,a)\pi(a^{\prime}\mid s^{\prime}), By Fubini’s theorem, we have that

p1,ν=\displaystyle\lVert p\rVert_{1,\nu}= 𝒮×𝒜|p|𝑑ν=𝒮×𝒜p𝑑ν\displaystyle\int_{\mathcal{S}\times\mathcal{A}}\lvert p\rvert d\nu=\int_{\mathcal{S}\times\mathcal{A}}pd\nu
=\displaystyle= 𝒮𝒜P(ss,a)π(as)𝑑ν(a)𝑑ν(s)\displaystyle\int_{\mathcal{S}}\int_{\mathcal{A}}P(s^{\prime}\mid s,a)\pi(a^{\prime}\mid s^{\prime})d\nu(a^{\prime})d\nu(s^{\prime})
=\displaystyle= 𝒮P(ss,a)[𝒜π(as)𝑑ν(a)]𝑑ν(s)\displaystyle\int_{\mathcal{S}}P(s^{\prime}\mid s,a)\Bigg{[}\int_{\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})d\nu(a^{\prime})\Bigg{]}d\nu(s^{\prime})
=\displaystyle= 𝒮P(ss,a)𝑑ν(s)\displaystyle\int_{\mathcal{S}}P(s^{\prime}\mid s,a)d\nu(s^{\prime})
=\displaystyle= 1.\displaystyle 1.

For another function qq on 𝒮×𝒜\mathcal{S}\times\mathcal{A} such that q,ν1\lVert q\rVert_{\infty,\nu}\leq 1, we can use Hölder’s inequality, which yields

pq1,νq,νp1,ν1.\displaystyle\lVert pq\rVert_{1,\nu}\leq\lVert q\rVert_{\infty,\nu}\lVert p\rVert_{1,\nu}\leq 1.

Thus, for any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, and function qq with q,ν1\lVert q\rVert_{\infty,\nu}\leq 1,

Pπq(s,a)=𝒮×𝒜q(s,a)π(as)P(ss,a)𝑑ν(s,a)=pq1,ν1.\displaystyle P_{\pi}^{\star}q(s,a)=\int_{\mathcal{S}\times\mathcal{A}}q(s^{\prime},a^{\prime})\pi(a^{\prime}\mid s^{\prime})P(s^{\prime}\mid s,a)d\nu(s^{\prime},a^{\prime})=\lVert pq\rVert_{1,\nu}\leq 1.

So, we have that

Pπ=Pπ=maxq1Pπq,νmaxq1maxs𝒮,a𝒜Pπq(s,a)1.\displaystyle\lVert P_{\pi}\rVert_{\infty}=\lVert P_{\pi}^{\star}\rVert_{\infty}=\max\limits_{\lVert q\rVert_{\infty}\leq 1}\lVert P_{\pi}^{\star}q\rVert_{\infty,\nu}\leq\max\limits_{\lVert q\rVert_{\infty}\leq 1}\max\limits_{s\in\mathcal{S},a\in\mathcal{A}}P_{\pi}^{\star}q(s,a)\leq 1.

This completes the proof. ∎

Lemma 10.

IγPπI-\gamma P_{\pi} is invertible and

(IγPπ)1=i=0(γPπ)i.\displaystyle(I-\gamma P_{\pi})^{-1}=\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}.
Proof.

Since Pπ1\lVert P_{\pi}\rVert_{\infty}\leq 1, i=0(γPπ)i\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i} converges. Take multiplication

(IγPπ)[i=0(γPπ)i]=\displaystyle(I-\gamma P_{\pi})[\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}]= i=0(γPπ)ii=1(γPπ)i\displaystyle\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}-\sum\limits_{i=1}^{\infty}(\gamma P_{\pi})^{i}
=\displaystyle= (γPπ)0\displaystyle(\gamma P_{\pi})^{0}
=\displaystyle= I.\displaystyle I.

This completes the proof. ∎

Proposition 11.

By definition, dd,π=(1γ)i=0(γPπ)id=(1γ)(IγPπ)1dd_{d,\pi}=(1-\gamma)\sum\limits_{i=0}^{\infty}(\gamma P_{\pi})^{i}d=(1-\gamma)(I-\gamma P_{\pi})^{-1}d.

B.2 Other useful lemmas

Lemma (Performance difference lemma).

We can decompose the performance gap as

(1γ)(Jπ1Jπ2)=μπ1,Qπ2(,π1)Qπ2(,π2).\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})=\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle.
Proof.

By definition,

μπ1,Qπ2(,π1)Qπ2(,π2)=\displaystyle\langle\mu_{\pi_{1}},Q_{\pi_{2}}(\cdot,\pi_{1})-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle= 𝔼sμπ1[R(,π1)+γ𝔼aπ1(s),sP(s,a)[Qπ2(s,π2)]Qπ2(,π2)]\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})+\gamma\mathbb{E}_{a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]-Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}
=\displaystyle= 𝔼sμπ1[R(,π1)]+𝔼sμπ1[γ𝔼aπ1(s),sP(s,a)[Qπ2(s,π2)]]\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}\gamma\mathbb{E}_{a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]\big{]}
𝔼sμπ1[Qπ2(,π2)]\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}
=\displaystyle= 𝔼sμπ1[R(,π1)]+γ𝔼sμπ1,aπ1(s),sP(s,a)[Qπ2(s,π2)]]\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\gamma\mathbb{E}_{s\sim\mu_{\pi_{1}},a\sim\pi_{1}(\cdot\mid s),s^{\prime}\sim P(\cdot\mid s,a)}[Q_{\pi_{2}}(s^{\prime},\pi_{2})]\big{]}
𝔼sμπ1[Qπ2(,π2)]\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}
=\displaystyle= 𝔼sμπ1[R(,π1)]+𝔼sμπ1[Qπ2(s,π2)]](1γ)𝔼sμ0[Qπ2(s,π2)]]\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}+\mathbb{E}_{s\sim\mu_{\pi_{1}}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}-(1-\gamma)\mathbb{E}_{s\sim\mu_{0}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}
𝔼sμπ1[Qπ2(,π2)]\displaystyle-\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}Q_{\pi_{2}}(\cdot,\pi_{2})\rangle\big{]}
=\displaystyle= 𝔼sμπ1[R(,π1)](1γ)𝔼sμ0[Qπ2(s,π2)]]\displaystyle\mathbb{E}_{s\sim\mu_{\pi_{1}}}\big{[}R(\cdot,\pi_{1})\big{]}-(1-\gamma)\mathbb{E}_{s\sim\mu_{0}}[Q_{\pi_{2}}(s,\pi_{2})]\big{]}
=\displaystyle= (1γ)(Jπ1Jπ2)\displaystyle(1-\gamma)(J_{\pi_{1}}-J_{\pi_{2}})

The first equality comes from Bellman equation, and the fourth equality comes from the definition of μπ\mu_{\pi}. This completes the proof. ∎

Appendix C Detailed proofs for Section 3

C.1 Proof of Lemma 3

Lemma (From advantage to optimality, restatement of Lemma 3).

If μc,Q(,π^)Q(,π)ε\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star})\rangle\geq-\varepsilon , and Assumption 1 holds with εcCcε1γ\varepsilon_{c}\geq\frac{C_{c}\varepsilon}{1-\gamma}, π^\hat{\pi} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal.

Proof.

We begin with using induction to prove that π^i\hat{\pi}_{i} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal for any ii\in\mathbb{N}:

  • We first show that π^0\hat{\pi}_{0} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal. From Assumption 1, we can use any π~Πεc,non\tilde{\pi}\in\Pi^{\star}_{\varepsilon_{c},\textup{non}} to conclude that

    μ0μcμπ~/(1γ)μcCc1γ.\displaystyle\Big{\lVert}\frac{\mu_{0}}{\mu_{c}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{\mu_{\tilde{\pi}}/(1-\gamma)}{\mu_{c}}\Big{\rVert}_{\infty}\leq\frac{C_{c}}{1-\gamma}.

    Thus, we can the show optimality of π^0\hat{\pi}^{\star}_{0} by the advantage:

    μπ^0,Q(,π^0)Q(,πe)=\displaystyle\langle\mu_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi}_{0})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle= μπ^00:0,Q(,π^)Q(,πe)+μπ^01:,Q(,πe)Q(,πe)\displaystyle\langle\mu^{0:0}_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu^{1:\infty}_{\hat{\pi}_{0}},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    =\displaystyle= μπ^00:0,Q(,π^)Q(,πe)\displaystyle\langle\mu^{0:0}_{\hat{\pi}^{\star}_{0}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    =\displaystyle= (1γ)μ0,Q(,π^)Q(,πe)\displaystyle(1-\gamma)\langle\mu_{0},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    \displaystyle\geq Ccμc,Q(,π^)Q(,πe)\displaystyle C_{c}\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle (Q(,π^)Q(,πe)Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e}) is non-positive)
    \displaystyle\geq Ccε.\displaystyle-C_{c}\varepsilon.

    By performance difference lemma,

    (1γ)(Jπ^0J)=\displaystyle(1-\gamma)(J_{\hat{\pi}_{0}}-J^{\star})= μπ^0,Q(,π^0)Q(,πe)\displaystyle\langle\mu_{\hat{\pi}_{0}},Q^{\star}(\cdot,\hat{\pi}_{0})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    \displaystyle\geq Ccε.\displaystyle-C_{c}\varepsilon.
  • Next, we show that if π^i\hat{\pi}_{i} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal, π^i+1\hat{\pi}_{i+1} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal. Since that π^i\hat{\pi}_{i} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} optimal, the distribution shift from μc\mu_{c} to μπ^i\mu_{\hat{\pi}_{i}} is bounded, which means,

    μπ^0:i+1μc=μπ^i0:i+1μcμπ^iμcCc.\displaystyle\Big{\lVert}\frac{\mu_{\hat{\pi}}^{0:i+1}}{\mu_{c}}\Big{\rVert}_{\infty}=\Big{\lVert}\frac{\mu_{\hat{\pi}_{i}}^{0:i+1}}{\mu_{c}}\Big{\rVert}_{\infty}\leq\Big{\lVert}\frac{\mu_{\hat{\pi}_{i}}}{\mu_{c}}\Big{\rVert}_{\infty}\leq C_{c}.

    Then, we have

    μπ^i+1,Q(,π^i+1)Q(,πe)\displaystyle\langle\mu_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi}_{i+1})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    =\displaystyle= μπ^i+10:i+1,Q(,π^)Q(,πe)+μπ^i+1i+2:,Q(,πe)Q(,πe)\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle+\langle\mu^{i+2:\infty}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    =\displaystyle= μπ^i+10:i+1,Q(,π^)Q(,πe)\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    =\displaystyle= μπ^0:i+1,Q(,π^)Q(,πe)\displaystyle\langle\mu^{0:i+1}_{\hat{\pi}},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle
    \displaystyle\geq Ccμc,Q(,π^)Q(,πe)\displaystyle C_{c}\langle\mu_{c},Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle (Q(,π^)Q(,πe)Q^{\star}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\pi^{\star}_{e}) is non-positive)
    \displaystyle\geq Ccε.\displaystyle-C_{c}\varepsilon.

    By performance difference lemma,

    (1γ)(Jπ^i+1J)=\displaystyle(1-\gamma)(J_{\hat{\pi}_{i+1}}-J^{\star})= μπ^i+1,Q(,π^i+1)Q(,πe).\displaystyle\langle\mu_{\hat{\pi}_{i+1}},Q^{\star}(\cdot,\hat{\pi}_{i+1})-Q^{\star}(\cdot,\pi^{\star}_{e})\rangle.
    \displaystyle\geq Ccε\displaystyle-C_{c}\varepsilon

    Therefore, π^i+1\hat{\pi}_{i+1} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal.

Thus, for any ϵ>0\epsilon>0, there exists natural number ilogγϵVmaxi\geq\log_{\gamma}\frac{\epsilon}{V_{\max}} such that

JJπ^JJπ^0:iJ(Jπ^iγi+1Vmax)Ccε1γ+γi+1VmaxCcε1γ+ϵ,\displaystyle J^{\star}-J_{\hat{\pi}}\leq J^{\star}-J_{\hat{\pi}}^{0:i}\leq J^{\star}-(J_{\hat{\pi}_{i}}-\gamma^{i+1}V_{\textup{max}})\leq\frac{C_{c}\varepsilon}{1-\gamma}+\gamma^{i+1}V_{\textup{max}}\leq\frac{C_{c}\varepsilon}{1-\gamma}+\epsilon,

where Jπi:jJ_{\pi}^{i:j} denotes the ii-th to jj-th steps part of the return. Therefore, π^\hat{\pi} is Ccε1γ\frac{C_{c}\varepsilon}{1-\gamma} near-optimal. ∎

Appendix D Detailed proofs for Section 4

D.1 Proof of Lemma 5

Lemma (Restatement of Lemma 5).

If dcd_{c} is a linear combination of the state-action distributions induced by ε\varepsilon near-optimal non-stationary policies Πε,non\Pi_{\varepsilon,\textup{non}}^{\star} under a fixed probability measure λ\lambda:

dc=Πε,nondπ~𝑑λ(π~).\displaystyle d_{c}=\int_{\Pi_{\varepsilon,\textup{non}}^{\star}}d_{\tilde{\pi}}d\lambda(\tilde{\pi}). (13)

And d𝒟d^{\mathcal{D}} covers all admissible distributions of Πε,non\Pi_{\varepsilon,\textup{non}}^{\star}:

π~Πε,non,i,dπ~,id𝒟C.\displaystyle\forall\ \tilde{\pi}\in\Pi^{\star}_{\varepsilon,\textup{non}},\ i\in\mathbb{N},\ \Big{\lVert}\frac{d_{\tilde{\pi},i}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.

The distribution shift from d𝒟d^{\mathcal{D}} is bounded as

ddc,πed𝒟C.\displaystyle\Big{\lVert}\frac{d_{d_{c},\pi^{\star}_{e}}}{d^{\mathcal{D}}}\Big{\rVert}_{\infty}\leq C.
Proof.

Define the state-action distribution of policy π\pi from s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A} at step ii as

ds,a,π,i(s,a)=P(si=s,ai=a\displaystyle d_{s,a,\pi,i}(s^{\prime},a^{\prime})=P(s_{i}=s^{\prime},a_{i}=a^{\prime}\mid s0=s,a0=a,s1P(s0,a0),a1π(s1)\displaystyle s_{0}=s,a_{0}=a,s_{1}\sim P(\cdot\mid s_{0},a_{0}),a_{1}\sim\pi(\cdot\mid s_{1})\dots
sjP(sj1,aj1),ajπ(sj)).\displaystyle s_{j}\sim P(\cdot\mid s_{j-1},a_{j-1}),a_{j}\sim\pi(\cdot\mid s_{j})\dots).

Also, define the global version of it as

ds,a,π(s,a)=(1γ)i=0ds,a,π,i(s,a).\displaystyle d_{s,a,\pi}(s^{\prime},a^{\prime})=(1-\gamma)\sum\limits_{i=0}^{\infty}d_{s,a,\pi,i}(s^{\prime},a^{\prime}).

We can rewrite ddc,πe(s,a)d_{d_{c},\pi^{\star}_{e}}(s,a) as

ddc,πe(s,a)=\displaystyle d_{d_{c},\pi^{\star}_{e}}(s,a)= 𝒮×𝒜ds1,a1,πe(s,a)dc(s1,a1)𝑑ν(s1,a1)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{c}(s_{1},a_{1})d\nu(s_{1},a_{1})
=\displaystyle= 𝒮×𝒜ds1,a1,πe(s,a)[Πdπ~(s1,a1)𝑑λ(π~)]𝑑ν(s1,a1)\displaystyle\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)\Big{[}\int_{\Pi}d_{\tilde{\pi}}(s_{1},a_{1})d\lambda(\tilde{\pi})\Big{]}d\nu(s_{1},a_{1})
=\displaystyle= Π[𝒮×𝒜ds1,a1,πe(s,a)dπ~(s1,a1)𝑑ν(s1,a1)]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi}}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}d\lambda(\tilde{\pi}) (Fubini’s Theorem)
=\displaystyle= Π[𝒮×𝒜(1γ)i=0[γids1,a1,πe(s,a)dπ~,i(s1,a1)]dν(s1,a1)]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}\int_{\mathcal{S}\times\mathcal{A}}(1-\gamma)\sum\limits_{i=0}^{\infty}\big{[}\gamma^{i}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})\big{]}d\nu(s_{1},a_{1})\Big{]}d\lambda(\tilde{\pi})
=\displaystyle= Π[(1γ)i=0[γi𝒮×𝒜ds1,a1,πe(s,a)dπ~,i(s1,a1)𝑑ν(s1,a1)]]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\big{[}\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\big{]}\Big{]}d\lambda(\tilde{\pi})
=\displaystyle= Π[(1γ)i=0dπ~ii:(s,a)]𝑑λ(π~).\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}d_{\tilde{\pi}_{i}}^{i:\infty}(s,a)\Big{]}d\lambda(\tilde{\pi}).

The last equation comes from that

γi𝒮×𝒜ds1,a1,πe(s,a)dπ~,i(s1,a1)𝑑ν(s1,a1)\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})
=\displaystyle= γi𝒮×𝒜ds1,a1,πe(s,a)[𝒮[𝒜ds2,a2,π~,i(s1,a1)π~(a2s2)𝑑ν(a2)]μ0(s2)𝑑ν(s2)]𝑑ν(s1,a1)\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)\Big{[}\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2})\Big{]}d\nu(s_{1},a_{1})
=\displaystyle= 𝒮[𝒜[γi𝒮×𝒜ds1,a1,πe(s,a)ds2,a2,π~,i(s1,a1)𝑑ν(s1,a1)]π~(a2s2)𝑑ν(a2)]μ0(s2)𝑑ν(s2),\displaystyle\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}\Big{[}\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2}), (Fubini’s Theorem)

since

γi𝒮×𝒜ds1,a1,πe(s,a)ds2,a2,π~,i(s1,a1)𝑑ν(s1,a1)\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})
=\displaystyle= γi𝒮×𝒜(1γ)k=0[γkds1,a1,πe,k(s,a)]ds2,a2,π~,i(s1,a1)dν(s1,a1)\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}(1-\gamma)\sum\limits_{k=0}^{\infty}\big{[}\gamma^{k}d_{s_{1},a_{1},\pi^{\star}_{e},k}(s,a)\big{]}d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})
=\displaystyle= (1γ)k=0[γk+i𝒮×𝒜ds1,a1,πe,k(s,a)ds2,a2,π~,i(s1,a1)𝑑ν(s1,a1)]\displaystyle(1-\gamma)\sum\limits_{k=0}^{\infty}\Big{[}\gamma^{k+i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e},k}(s,a)d_{s_{2},a_{2},\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})\Big{]}
=\displaystyle= (1γ)k=0[γk+ids2,a2,π~i,k+i(s,a)]\displaystyle(1-\gamma)\sum\limits_{k=0}^{\infty}\big{[}\gamma^{k+i}d_{s_{2},a_{2},\tilde{\pi}_{i},k+i}(s,a)\big{]}
=\displaystyle= (1γ)k=i[γkds2,a2,π~i,k(s,a)]\displaystyle(1-\gamma)\sum\limits_{k=i}^{\infty}\big{[}\gamma^{k}d_{s_{2},a_{2},\tilde{\pi}_{i},k}(s,a)\big{]}
=\displaystyle= ds2,a2,π~ii:(s,a),\displaystyle d_{s_{2},a_{2},\tilde{\pi}_{i}}^{i:\infty}(s,a),

we get

γi𝒮×𝒜ds1,a1,πe(s,a)dπ~,i(s1,a1)𝑑ν(s1,a1)\displaystyle\gamma^{i}\int_{\mathcal{S}\times\mathcal{A}}d_{s_{1},a_{1},\pi^{\star}_{e}}(s,a)d_{\tilde{\pi},i}(s_{1},a_{1})d\nu(s_{1},a_{1})
=\displaystyle= 𝒮[𝒜[ds2,a2,π~ii:(s,a)]π~(a2s2)𝑑ν(a2)]μ0(s2)𝑑ν(s2)\displaystyle\int_{\mathcal{S}}\Big{[}\int_{\mathcal{A}}\Big{[}d_{s_{2},a_{2},\tilde{\pi}_{i}}^{i:\infty}(s,a)\Big{]}\tilde{\pi}(a_{2}\mid s_{2})d\nu(a_{2})\Big{]}\mu_{0}(s_{2})d\nu(s_{2})
=\displaystyle= dπ~ii:(s,a).\displaystyle d_{\tilde{\pi}_{i}}^{i:\infty}(s,a).

Finally, s𝒮,a𝒜\forall s\in\mathcal{S},a\in\mathcal{A},

ddc,πe(s,a)d𝒟(s,a)=\displaystyle\frac{d_{d_{c},\pi^{\star}_{e}}(s,a)}{d^{\mathcal{D}}(s,a)}= Π[(1γ)i=0dπ~ii:(s,a)d𝒟(s,a)]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\frac{d_{\tilde{\pi}_{i}}^{i:\infty}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})
=\displaystyle= Π[(1γ)i=0(1γ)j=iγjdπ~i,j(s,a)d𝒟(s,a)]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}\frac{(1-\gamma)\sum_{j=i}^{\infty}\gamma^{j}d_{\tilde{\pi}_{i},j}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})
=\displaystyle= Π[(1γ)i=0(1γ)j=iγjdπ~i,j(s,a)d𝒟(s,a)]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}(1-\gamma)\sum\limits_{i=0}^{\infty}(1-\gamma)\sum_{j=i}^{\infty}\gamma^{j}\frac{d_{\tilde{\pi}_{i},j}(s,a)}{d^{\mathcal{D}}(s,a)}\Big{]}d\lambda(\tilde{\pi})
\displaystyle\leq Π[C(1γ)2i=0j=iγj]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}C(1-\gamma)^{2}\sum\limits_{i=0}^{\infty}\sum\limits_{j=i}^{\infty}\gamma^{j}\Big{]}d\lambda(\tilde{\pi}) (π~Πε,non\tilde{\pi}\in\Pi_{\varepsilon,\textup{non}}^{\star} indicates π~iΠε,non\tilde{\pi}_{i}\in\Pi_{\varepsilon,\textup{non}}^{\star})
\displaystyle\leq Π[C(1γ)2i=0γi1γ]𝑑λ(π~)\displaystyle\int_{\Pi}\Big{[}C(1-\gamma)^{2}\sum\limits_{i=0}^{\infty}\frac{\gamma^{i}}{1-\gamma}\Big{]}d\lambda(\tilde{\pi})
\displaystyle\leq ΠC𝑑λ(π~)\displaystyle\int_{\Pi}Cd\lambda(\tilde{\pi})
=\displaystyle= C.\displaystyle C.

This completes the proof. ∎

D.2 Proof of Lemma 6

Note that the lemmas and proofs of this subsection are mainly adapted from Uehara et al. [2023], similar statements could also be found in the original paper. However, since that we use dcd_{c} to replace d𝒟d^{\mathcal{D}}, we present them for clarity of explanation and to make our paper self-contained. We refer interested readers to the original paper for another detail.

We first define the expected version of Eq. 10 as

(d,q,w)\displaystyle\mathcal{L}(d,q,w)\coloneqq 0.5𝔼d[q2(s,a)]+𝔼(s,a)dw𝒟,r=R(s,a),sP(s,a)[γmaxq(s,)+rq(s,a)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}_{w},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼𝒟w[γmaxq(s,)+rq(s,a)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{w}}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}

where dw𝒟=d𝒟wd^{\mathcal{D}}_{w}=d^{\mathcal{D}}\circ w, and 𝔼𝒟w\mathbb{E}_{\mathcal{D}_{w}} denotes taking expectation with respect to the reweighted data collecting process.

Lemma 12 (Expectation).

The expected value of ^(d,q,w)\hat{\mathcal{L}}(d,q,w) w.r.t. the data collecting process is (d,q,w)\mathcal{L}(d,q,w):

𝔼𝒟[^(d,q,w)]=(d,q,w).\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]=\mathcal{L}(d,q,w).
Proof.

Since only the second term of ^\hat{\mathcal{L}} is random, we additional define

^𝒲(q,w)1N𝒟(s,a,r,s)𝒟𝔼𝒟[w(s,a)[γmaxq(s,)+rq(s,a)].\displaystyle\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\coloneqq\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}.

We can rearrange the expectation as follows,

𝔼𝒟[^(d,q,w)]=\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]= 𝔼𝒟[0.5𝔼d[q2(s,a)]+^𝒲(q,w)]\displaystyle\mathbb{E}_{\mathcal{D}}\Big{[}0.5\mathbb{E}_{d}[q^{2}(s,a)]+\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]} (14)
=\displaystyle= 𝔼𝒟[0.5𝔼d[q2(s,a)]]+𝔼𝒟[^𝒲(q,w)]\displaystyle\mathbb{E}_{\mathcal{D}}\Big{[}0.5\mathbb{E}_{d}[q^{2}(s,a)]\Big{]}+\mathbb{E}_{\mathcal{D}}\Big{[}\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]} (15)
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼𝒟[^𝒲(q,w)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Big{[}\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{]} (16)

Then, by the i.i.d. assumption of samples and linear property of MIS,

𝔼𝒟[^(d,q,w)]=\displaystyle\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]= 0.5𝔼d[q2(s,a)]+𝔼𝒟[1N𝒟(s,a,r,s)𝒟[w(s,a)[γmaxq(s,)+rq(s,a)]]]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Bigg{[}\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}\Bigg{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+1N𝒟(s,a,r,s)𝒟𝔼𝒟[w(s,a)[γmaxq(s,)+rq(s,a)]]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\frac{1}{N_{\mathcal{D}}}\sum\limits_{(s,a,r,s^{\prime})\in\mathcal{D}}\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼𝒟[w(s,a)[γmaxq(s,)+rq(s,a)]]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{\mathcal{D}}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼(s,a)d𝒟,r=R(s,a),sP(s,a)[w(s,a)[γmaxq(s,)+rq(s,a)]]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\Big{[}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼(s,a)d𝒟[w(s,a)[𝔼sP(s,a)[γmaxq(s,)]+R(s,a)q(s,a)]]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}}\Big{[}w(s,a)\big{[}\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\gamma\max q(s^{\prime},\cdot)]+R(s,a)-q(s,a)\big{]}\Big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼(s,a)dw𝒟[𝔼sP(s,a)[γmaxq(s,)]+R(s,a)q(s,a)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d^{\mathcal{D}}_{w}}\Big{[}\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}[\gamma\max q(s^{\prime},\cdot)]+R(s,a)-q(s,a)\Big{]}
=\displaystyle= 0.5𝔼d[q2(s,a)]+𝔼(s,a)dw𝒟,r=R(s,a),sP(s,a)[γmaxq(s,)+rq(s,a)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s,a)]+\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}
=\displaystyle= (d,q,w).\displaystyle\mathcal{L}(d,q,w).

This compeletes the proof. ∎

Lemma 13 (Concentration).

For any fixed dd, with probability at least 1δ1-\delta, for any q𝒬q\in\mathcal{Q}, w𝒲w\in\mathcal{W},

|(d,q,w)^(d,q,w)|εstat.\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}\leq\varepsilon_{\textup{stat}}.
Proof.

The statistical error only comes from ^𝒲\hat{\mathcal{L}}_{\mathcal{W}}, as

|(d,q,w)^(d,q,w)|=\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}= |𝔼𝒟[^(d,q,w)]^(d,q,w)|\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}(d,q,w)]-\hat{\mathcal{L}}(d,q,w)\Big{\rvert} (Lemma 12)
=\displaystyle= |𝔼𝒟[^𝒲(q,w)]^𝒲(q,w)|.\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}_{\mathcal{W}}(q,w)]-\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{\rvert}. (Eq. 16)

Since each entry of 𝒲\mathcal{L}_{\mathcal{W}} is bounded:

q𝒬,w𝒲,a𝒜,s𝒮,|w(s,a)[γmaxq(s,)+rq(s,a)]|U𝒲Vmax,\displaystyle\forall q\in\mathcal{Q},w\in\mathcal{W},a\in\mathcal{A},s^{\prime}\in\mathcal{S},\quad\Big{\lvert}w(s,a)\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}},

we can apply Hoeffding’s inequality which yields that, for any q𝒬q\in\mathcal{Q}, w𝒲w\in\mathcal{W}, with probability at least 1δ/(|𝒬||𝒲|)1-\delta/(\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert),

|𝔼𝒟[^𝒲(q,w)]^𝒲(q,w)|U𝒲Vmax2log(2|𝒬||𝒲|/δ)N𝒟.\displaystyle\Big{\lvert}\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{L}}_{\mathcal{W}}(q,w)]-\hat{\mathcal{L}}_{\mathcal{W}}(q,w)\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}}.

Finally, we can use union bound, rearranging terms to get that, for any fixed dd, with probability at least 1δ1-\delta, for any q𝒬q\in\mathcal{Q}, w𝒲w\in\mathcal{W},

|(d,q,w)^(d,q,w)|U𝒲Vmax2log(2|𝒬||𝒲|/δ)N𝒟=εstat\displaystyle\Big{\lvert}\mathcal{L}(d,q,w)-\hat{\mathcal{L}}(d,q,w)\Big{\rvert}\leq U_{\mathcal{W}}V_{\textup{max}}\sqrt{\frac{2\log(2\lvert\mathcal{Q}\rvert\lvert\mathcal{W}\rvert/\delta)}{N_{\mathcal{D}}}}=\varepsilon_{\textup{stat}}

This compeletes the proof. ∎

Lemma 14.

If ww is non-negative ν\nu-a.e. (e.g., w𝒲w\in\mathcal{W}), for any q:𝒮×𝒜[0,Vmax]q\colon\mathcal{S}\times\mathcal{A}\to[0,V_{\textup{max}}],

(d,q,w)(d,Q,w)0.5d,q2(Q)2+(γPπeI)dw𝒟,qQ.\displaystyle\mathcal{L}(d,q,w)-\mathcal{L}(d,Q^{\star},w)\geq 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle(\gamma P_{\pi^{\star}_{e}}-I)d^{\mathcal{D}}_{w},q-Q^{\star}\rangle. (17)
Proof.

This result simply comes from the definition:

(d,q,w)(d,Q,w)\displaystyle\mathcal{L}(d,q,w)-\mathcal{L}(d,Q^{\star},w)
=\displaystyle= 0.5𝔼d[q2(s)(Q)2(s)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]
+𝔼𝒟w[γmaxq(s,)+rq(s,a)]𝔼𝒟w[γmaxQ(s,)+rQ(s,a)]\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max q(s^{\prime},\cdot)+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)]
=\displaystyle= 0.5𝔼d[q2(s)(Q)2(s)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]
+𝔼𝒟w[γmaxq(s,)+rq(s,a)]𝔼𝒟w[γQ(s,πe)+rQ(s,a)]\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma\max q(s^{\prime},\cdot)+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma Q^{\star}(s^{\prime},\pi^{\star}_{e})+r-Q^{\star}(s,a)]
\displaystyle\geq 0.5𝔼d[q2(s)(Q)2(s)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]
+𝔼𝒟w[γq(s,πe)+rq(s,a)]𝔼𝒟w[γQ(s,πe)+rQ(s,a)]\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma q(s^{\prime},\pi^{\star}_{e})+r-q(s,a)]-\mathbb{E}_{\mathcal{D}_{w}}[\gamma Q^{\star}(s^{\prime},\pi^{\star}_{e})+r-Q^{\star}(s,a)]
=\displaystyle= 0.5𝔼d[q2(s)(Q)2(s)]\displaystyle 0.5\mathbb{E}_{d}[q^{2}(s)-(Q^{\star})^{2}(s)]
+𝔼𝒟w[γ(qQ)(s,πe)(qQ)(s,a)]\displaystyle\quad+\mathbb{E}_{\mathcal{D}_{w}}[\gamma(q-Q^{\star})(s^{\prime},\pi^{\star}_{e})-(q-Q^{\star})(s,a)]
=\displaystyle= 0.5d,q2(Q)2+dw𝒟,(γPπeI)(qQ)\displaystyle 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle d^{\mathcal{D}}_{w},(\gamma P^{\star}_{\pi^{\star}_{e}}-I)(q-Q^{\star})\rangle (Rewrite the expectation with inner products)
=\displaystyle= 0.5d,q2(Q)2+(γPπeI)dw𝒟,qQ.\displaystyle 0.5\langle d,q^{2}-(Q^{\star})^{2}\rangle+\langle(\gamma P_{\pi^{\star}_{e}}-I)d^{\mathcal{D}}_{w},q-Q^{\star}\rangle. (conjugate)

This compeletes the proof. ∎

Lemma 15.

If Assumption 5 holds, with probability at least 1δ1-\delta, for any w𝒲w\in\mathcal{W} and any state-action distribution dd, we have

(d,q^,w)(d,Q,w)2εstat.\displaystyle\mathcal{L}(d,\hat{q},w)-\mathcal{L}(d,Q^{\star},w)\leq 2\varepsilon_{\textup{stat}}. (18)
Proof.

We can decompose Eq. 18 as follows,

(d,q^,w)(d,Q,w)=\displaystyle\mathcal{L}(d,\hat{q},w)-\mathcal{L}(d,Q^{\star},w)= (d,q^,w)^(d,q^,w)(1)+^(d,q^,w)^(d,q^,w^)(2)\displaystyle\underbrace{\mathcal{L}(d,\hat{q},w)-\hat{\mathcal{L}}(d,\hat{q},w)}_{(1)}+\underbrace{\hat{\mathcal{L}}(d,\hat{q},w)-\hat{\mathcal{L}}(d,\hat{q},\hat{w})}_{(2)}
+^(d,q^,w^)^(d,Q,w^(Q))(3)+^(d,Q,w^(Q))(d,Q,w^(Q))(4)\displaystyle+\underbrace{\hat{\mathcal{L}}(d,\hat{q},\hat{w})-\hat{\mathcal{L}}(d,Q^{\star},\hat{w}(Q^{\star}))}_{(3)}+\underbrace{\hat{\mathcal{L}}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))}_{(4)}
+(d,Q,w^(Q))(d,Q,w)(5)\displaystyle+\underbrace{\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},w)}_{(5)}

where w^(q)=argmaxw𝒲^(d,q,w)\hat{w}(q)=\operatorname*{argmax}_{w\in\mathcal{W}}\hat{\mathcal{L}}(d,q,w). For the terms above, we have that:

  • (2)(2) and (3)(3) are non-positive since the optimization process.

  • (1)(1) and (4)(4) could be bound by concentration.

  • For (5)(5), as Bellman optimality equation holds,

    s𝒮,a𝒜,𝔼sP(s,a)[γmaxQ(s,)]+R(s,a)Q(s,a)=0.\displaystyle\forall s\in\mathcal{S},a\in\mathcal{A},\quad\mathbb{E}_{s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)\big{]}+R(s,a)-Q^{\star}(s,a)=0.

    We have that

    (5)=\displaystyle(5)= (d,Q,w^(Q))(d,Q,w)\displaystyle\mathcal{L}(d,Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(d,Q^{\star},w)
    =\displaystyle= 0.5𝔼d[(Q)2(s,a)]+𝔼𝒟w^(Q)[γmaxq(s,)+rq(s,a)]\displaystyle 0.5\mathbb{E}_{d}[(Q^{\star})^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{\hat{w}(Q^{\star})}}\big{[}\gamma\max q(s^{\prime},\cdot)+r-q(s,a)\big{]}
    [0.5𝔼d[(Q)2(s,a)]+𝔼𝒟w[γmaxQ(s,)+rQ(s,a)]]\displaystyle-\Big{[}0.5\mathbb{E}_{d}[(Q^{\star})^{2}(s,a)]+\mathbb{E}_{\mathcal{D}_{w}}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}\Big{]}
    =\displaystyle= 𝔼(s,a)dw^(Q)𝒟,r=R(s,a),sP(s,a)[γmaxQ(s,)+rQ(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim d_{\hat{w}(Q^{\star})}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}
    [𝔼(s,a)dw𝒟,r=R(s,a),sP(s,a)[γmaxQ(s,)+rQ(s,a)]]\displaystyle-\Big{[}\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}},r=R(s,a),s^{\prime}\sim P(\cdot\mid s,a)}\big{[}\gamma\max Q^{\star}(s^{\prime},\cdot)+r-Q^{\star}(s,a)\big{]}\Big{]}
    =\displaystyle= 𝔼(s,a)dw^(Q)𝒟[γ𝔼sP(,s,a)[maxQ(s,)]+R(s,a)Q(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim d_{\hat{w}(Q^{\star})}^{\mathcal{D}}}\Big{[}\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot,s,a)}[\max Q^{\star}(s^{\prime},\cdot)]+R(s,a)-Q^{\star}(s,a)\Big{]}
    𝔼(s,a)dw𝒟[γ𝔼sP(,s,a)[maxQ(s,)]+R(s,a)Q(s,a)]\displaystyle-\mathbb{E}_{(s,a)\sim d_{w}^{\mathcal{D}}}\Big{[}\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot,s,a)}[\max Q^{\star}(s^{\prime},\cdot)]+R(s,a)-Q^{\star}(s,a)\Big{]}
    =\displaystyle= 0.\displaystyle 0.

Thus, we conclude that with probability at least 1δ1-\delta,

(q^,w)(Q,w)\displaystyle\mathcal{L}(\hat{q},w)-\mathcal{L}(Q^{\star},w)\leq (q^,w)^(q^,w)(1)+^(Q,w^(Q))(Q,w^(Q))(4)\displaystyle\underbrace{\mathcal{L}(\hat{q},w)-\hat{\mathcal{L}}(\hat{q},w)}_{(1)}+\underbrace{\hat{\mathcal{L}}(Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(Q^{\star},\hat{w}(Q^{\star}))}_{(4)}
\displaystyle\leq |(q^,w)^(q^,w)|+|^(Q,w^(Q))(Q,w^(Q))|\displaystyle\lvert\mathcal{L}(\hat{q},w)-\hat{\mathcal{L}}(\hat{q},w)\rvert+\lvert\hat{\mathcal{L}}(Q^{\star},\hat{w}(Q^{\star}))-\mathcal{L}(Q^{\star},\hat{w}(Q^{\star}))\rvert
\displaystyle\leq 2εstat.\displaystyle 2\varepsilon_{\textup{stat}}. (Lemma 13)

This compeletes the proof. ∎

With lemmas above, it’s time to prove Lemma 6.

Lemma (L2L^{2} error of q^\hat{q} under dcd_{c}, restatement Lemma 6).

If Assumptions 2, 5, 3, 8 and 6 hold, with probability at least 1δ1-\delta, the estimated q^\hat{q} from Algorithm 1 satisfies

q^Qdc,22εstat.\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.
Proof.

By Assumption 3, dw𝒟=(IγPπ)1dcQd^{\mathcal{D}}_{w^{\star}}=(I-\gamma P_{\pi^{\star}})^{-1}d_{c}Q^{\star}, and from Lemma 14 we have

(dc,q^,w)(dc,Q,w)\displaystyle\mathcal{L}(d_{c},\hat{q},w^{\star})-\mathcal{L}(d_{c},Q^{\star},w^{\star})\geq 0.5dc,q^2(Q)2(IγPπ)(IγPπ)1dcQ,(q^Q)\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle(I-\gamma P_{\pi^{\star}})(I-\gamma P_{\pi^{\star}})^{-1}d_{c}Q^{\star},(\hat{q}-Q^{\star})\rangle
=\displaystyle= 0.5dc,q^2(Q)2dcQ,(q^Q)\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle d_{c}Q^{\star},(\hat{q}-Q^{\star})\rangle
=\displaystyle= 0.5dc,q^2(Q)2dc,Q(q^Q)\displaystyle 0.5\langle d_{c},\hat{q}^{2}-(Q^{\star})^{2}\rangle-\langle d_{c},Q^{\star}(\hat{q}-Q^{\star})\rangle
=\displaystyle= 0.5dc,(q^Q)2\displaystyle 0.5\langle d_{c},(\hat{q}-Q^{\star})^{2}\rangle
=\displaystyle= 0.5q^Qdc,22.\displaystyle 0.5\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}^{2}.

Together with Lemma 15, with probability at least 1δ1-\delta,

0.5q^Qdc,22(dc,q^,w)(dc,Q,w)2εstat.\displaystyle 0.5\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}^{2}\leq\mathcal{L}(d_{c},\hat{q},w^{\star})-\mathcal{L}(d_{c},Q^{\star},w^{\star})\leq 2\varepsilon_{\textup{stat}}.

Rearrange this and we can get

q^Qdc,22εstat\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}

This compeletes the proof. ∎

D.3 Proof of Lemma 7

Lemma (Restatement of Lemma 7).

If Assumptions 4 and 7 hold,

Q(,πe)Q(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq 2Uq^Qdc,1.\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}.
Proof.

We can rearrange the above term as

Q(,πe)Q(,π^),μc=\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle= Q(,πe)q^(,πe),μc+q^(,πe)q^(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e}),\mu_{c}\rangle+\langle\hat{q}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\hat{\pi}),\mu_{c}\rangle
+q^(,π^)Q(,π^),μc\displaystyle+\langle\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle
\displaystyle\leq Q(,πe)q^(,πe),μc+q^(,π^)Q(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e}),\mu_{c}\rangle+\langle\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle (Assumption 4)
\displaystyle\leq Q(,πe)q^(,πe)μc,1+q^(,π^)Q(,π^)μc,1\displaystyle\lVert Q^{\star}(\cdot,\pi^{\star}_{e})-\hat{q}(\cdot,\pi^{\star}_{e})\rVert_{\mu_{c},1}+\lVert\hat{q}(\cdot,\hat{\pi})-Q^{\star}(\cdot,\hat{\pi})\rVert_{\mu_{c},1}
=\displaystyle= Qq^μc×πe,1+q^Qμc×π^,1\displaystyle\lVert Q^{\star}-\hat{q}\rVert_{\mu_{c}\times\pi^{\star}_{e},1}+\lVert\hat{q}-Q^{\star}\rVert_{\mu_{c}\times\hat{\pi},1}
\displaystyle\leq 2UQq^dc,1\displaystyle 2U_{\mathcal{B}}\lVert Q^{\star}-\hat{q}\rVert_{d_{c},1}

The distribution shift comes from the fact that

μ×π1μ×π2=π1π2,\displaystyle\Big{\lVert}\frac{\mu\times\pi_{1}}{\mu\times\pi_{2}}\Big{\rVert}_{\infty}=\Big{\lVert}\frac{\pi_{1}}{\pi_{2}}\Big{\rVert}_{\infty},

and shifts from πc\pi_{c} to πe\pi^{\star}_{e} and π^\hat{\pi} are both bound by UU_{\mathcal{B}} due to Assumptions 4 and 7. This completes the proof. ∎

D.4 Proof of Theorem 1

Theorem (Finite sample guarantee of Algorithm 1, restatement of Theorem 1).

If Assumptions 1, 3, 4, 5, 2, 8, 6 and 7 hold with εc4CcUεstat1γ\varepsilon_{c}\geq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}, then with probability at least 1δ1-\delta, the output π^\hat{\pi} from Algorithm 1 is near-optimal:

JJπ^4CcUεstat1γ.\displaystyle J^{\star}-J_{\hat{\pi}}\leq\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma}.
Proof.

From Lemma 6, we have that with probability at least 1δ1-\delta,

q^Qdc,1q^Qdc,22εstat.\displaystyle\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq\lVert\hat{q}-Q^{\star}\rVert_{d_{c},2}\leq 2\sqrt{\varepsilon_{\textup{stat}}}.

Then apply Lemma 7 to bound the weighted advantage,

Q(,πe)Q(,π^),μc\displaystyle\langle Q^{\star}(\cdot,\pi^{\star}_{e})-Q^{\star}(\cdot,\hat{\pi}),\mu_{c}\rangle\leq 2Uq^Qdc,14Uεstat.\displaystyle 2U_{\mathcal{B}}\lVert\hat{q}-Q^{\star}\rVert_{d_{c},1}\leq 4U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}.

Finally, according to Lemma 3, π^\hat{\pi} is 4CcUεstat1γ\frac{4C_{c}U_{\mathcal{B}}\sqrt{\varepsilon_{\textup{stat}}}}{1-\gamma} optimal. This completes the proof. ∎