This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Nature of Temporal Difference Errors in
Multi-step Distributional Reinforcement Learning

Yunhao Tang
DeepMind
robintyh@deepmind.com Mark Rowland
DeepMind
markrowland@deepmind.com Rémi Munos
DeepMind
munos@deepmind.com Bernardo Ávila Pires
DeepMind
bavilapires@deepmind.com Will Dabney
DeepMind
wdabney@deepmind.com Marc G. Bellemare
Google Brain
bellemare@google.com
Abstract

We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals intriguing and fundamental differences between the two cases in the multi-step setting. We identify a novel notion of path-dependent distributional TD error, which is indispensable for principled multi-step distributional RL. The distinction from the value-based case bears important implications on concepts such as backward-view algorithms. Our work provides the first theoretical guarantees on multi-step off-policy distributional RL algorithms, including results that apply to the small number of existing approaches to multi-step distributional RL. In addition, we derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57 benchmark. Collectively, we shed light on how unique challenges in multi-step distributional RL can be addressed both in theory and practice.

1 Introduction

The return t=0γtRt\sum_{t=0}^{\infty}\gamma^{t}R_{t} is a fundamental concept in reinforcement learning (RL). In general, the return is a random variable, whose distribution captures important information such as the stochasticity in future events. While the classic view of value-based RL typically focuses on the expected return (bertsekas1996neuro, ; sutton1998, ; szepesvari2010algorithms, ), learning the full return distribution is of both theoretical and practical importance (morimura2010nonparametric, ; morimura2012parametric, ; bellemare2017distributional, ; yang2019fully, ; bodnar2019quantile, ; wurman2022outracing, ; bdr2022, ).

To design efficient algorithms for learning return distributions, a natural idea is to construct distributional equivalents of existing multi-step off-policy value-based algorithms. In value-based RL, multi-step learning tends to propagate useful information more efficiently and off-policy learning is ubiquitous in modern RL systems. Meanwhile, the return distribution shares inherent commonalities with the expected return, thanks to the close connection between the distributional Bellman equation (morimura2010nonparametric, ; morimura2012parametric, ; bellemare2017distributional, ; bdr2022, ) and the celebrated value-based Bellman equation (sutton1998, ). The Bellman equation is foundational to value-based RL algorithms, including many multi-step off-policy methods (precup2001off, ; harutyunyan2016q, ; munos2016safe, ; mahmood2017multi, ). Due to the apparent similarity between distributional and value-based Bellman equations, should we expect key value-based concepts and algorithms to seamlessly transfer to distributional learning?

Our study indicates that the answer is no. There are critical differences between distributional and value-based RL, which requires a distinct treatment of multi-step learning. Indeed, thanks to the focus on expected returns, the value-based setup offers many unique conceptual and computational simplifications in algorithmic design. However, we find that such simplifications do not hold for distributional learning. Multi-step distributional RL requires a deeper look at the connections between fundamental concepts such as nn-step returns, TD errors and importance weights for off-policy learning. To this end, we make the following conceptual, theoretical and algorithmic contributions:

Distributional TD error.

We demonstrate the emergence of a novel notion of path-dependent distributional TD error (Section 4). Intriguingly, as the name suggests, path-dependent distributional TD errors are path-dependent, i.e., distributional TD errors at time tt depend on the sequence of immediate rewards (Rs)s=0t1(R_{s})_{s=0}^{t-1}. This differs from value-based TD errors, which are path-independent. We will show that the path-dependency property is not an artifact, but rather a fundamental property of distributional learning. We show numerically that naively constructing certain path-independent distributional TD errors does not produce convergent algorithms. The path-dependency property also has conceptual and computational impacts on forward-view estimates and backward-view algorithms.

Theory of multi-step distributional RL.

We derive distributional Retrace, a novel and generic multi-step off-policy operator for distributional learning. We prove that distributional Retrace is contractive and has the target return distribution as its fixed point. Distributional Retrace interpolates between the one-step distributional Bellman operator (bellemare2017distributional, ) and Monte-Carlo (MC) estimation with importance weighting (chandak2021universal, ), trading-off the strengths from the two extremes.

Approximate multi-step distributional RL.

Finally, we derive Quantile Regression-Retrace, a novel algorithm combining distributional Retrace with quantile representations of distributions (dabney2018distributional, ) (Section 5). One major technical challenge is to define the quantile regression (QR) loss against signed measures, which are unavoidable in sample-based settings. We bypass the issue of ill-defined QR loss and derive unbiased stochastic estimates to the QR loss gradient. This leads up to QR-DQN-Retrace, a deep RL agent with performance improvements over QR-DQN on Atari-57 games.

In Figure 1, we illustrate how the back-up target is computed for multi-step distributional RL. In summary, we take our findings to demonstrate how the set of unique challenges presented by multi-step distributional RL can be addressed both theoretically and empirically. Our study also opens up many exciting research pathways in this domain, paving the way for future investigations.

2 Background

Consider a Markov decision process (MDP) represented as the tuple (𝒳,𝒜,PR,P,γ)\left(\mathcal{X},\mathcal{A},P_{R},P,\gamma\right) where 𝒳\mathcal{X} is the state space, 𝒜\mathcal{A} the action space, PR:𝒳×𝒜𝒫()P_{R}:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}(\mathscr{R}) the reward kernel (with \mathscr{R} a finite set of possible rewards), P:𝒳×𝒜𝒫(𝒳)P:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}(\mathcal{X}) the transition kernel and γ[0,1)\gamma\in[0,1) the discount factor. In general, we use 𝒫(A)\mathscr{P}(A) denote a distribution over set AA. We assume the reward to take a finite set of values mainly because it is notationally simpler to present results; it is straightforward to extend our results to the general case. Let π:𝒳𝒫(𝒜)\pi:\mathcal{X}\rightarrow\mathscr{P}(\mathcal{A}) be a fixed policy. We use (Xt,At,Rt)t=0π(X_{t},A_{t},R_{t})_{t=0}^{\infty}\sim\pi to denote a random trajectory sampled from π\pi, such that Atπ(|Xt),RtPR(|Xt,At),Xt+1P(|Xt,At)A_{t}\sim\pi(\cdot|X_{t}),R_{t}\sim P_{R}(\cdot|X_{t},A_{t}),X_{t+1}\sim P(\cdot|X_{t},A_{t}). Define Gπ(x,a)t=0γtRtG^{\pi}(x,a)\coloneqq\sum_{t=0}^{\infty}\gamma^{t}R_{t} as the random return, obtained by following π\pi starting from (x,a)(x,a). The Q-function Qπ(x,a)𝔼[Gπ(x,a)]Q^{\pi}(x,a)\coloneqq\mathbb{E}[G^{\pi}(x,a)] is defined as the expected return under policy π\pi. For convenience, we also adopt the vector notation Q𝒳×𝒜Q\in\mathbb{R}^{\mathcal{X}\times\mathcal{A}}. Define the one-step value-based Bellman operator Tπ:𝒳×𝒜𝒳×𝒜T^{\pi}:\mathbb{R}^{\mathcal{X\times\mathcal{A}}}\rightarrow\mathbb{R}^{\mathcal{X\times\mathcal{A}}} such that TπQ(x,a)𝔼[R0+γQ(X1,A1π)|X0=x,A0=a]T^{\pi}Q(x,a)\coloneqq\mathbb{E}[R_{0}+\gamma Q\left(X_{1},A_{1}^{\pi}\right)|X_{0}=x,A_{0}=a] where Q(Xt,Atπ)aπ(a|Xt)Q(Xt,a)Q(X_{t},A_{t}^{\pi})\coloneqq\sum_{a}\pi(a|X_{t})Q(X_{t},a). The Q-function QπQ^{\pi} satisfies Qπ=TπQπQ^{\pi}=T^{\pi}Q^{\pi} and is also the unique fixed point of TπT^{\pi}.

Refer to caption
Figure 1: Illustration of a multi-step distributional RL target, constructed as a sum of the initial distribution (left) and weighted distributional TD errors Δ~0:0π,c1Δ~0:1π,\widetilde{\Delta}^{\pi}_{0:0},c_{1}\widetilde{\Delta}^{\pi}_{0:1},\ldots across multiple time steps (middle and right); see Section 3 for further details and notation. In general, distributional TD errors are signed measures, as reflected by the downwards probability mass; they are also scaled by trace coefficients c1c_{1} to correct for off-policy discrepancies between target and behavior policy.

2.1 Distributional reinforcement learning

In general, the return Gπ(x,a)G^{\pi}(x,a) is a random variable and we define its distribution as ηπ(x,a)Lawπ(Gπ(x,a))\eta^{\pi}(x,a)\coloneqq\text{Law}_{\pi}\left(G^{\pi}(x,a)\right). The return distribution satisfies the distributional Bellman equation (morimura2010nonparametric, ; morimura2012parametric, ; bellemare2017distributional, ; rowland2018analysis, ; bdr2022, ),

ηπ(x,a)=𝔼π[(bR0,γ)#ηπ(X1,A1π)|X0=x,A0=a],\displaystyle\eta^{\pi}(x,a)=\mathbb{E}_{\pi}\left[\left(\textrm{b}_{R_{0},\gamma}\right)_{\#}\eta^{\pi}\left(X_{1},A_{1}^{\pi}\right)\;\middle|\;X_{0}=x,A_{0}=a\right]\,, (1)

where (br,γ)#:𝒫()𝒫()(\textrm{b}_{r,\gamma})_{\#}:\mathscr{P}(\mathbb{R})\rightarrow\mathscr{P}(\mathbb{R}) is the pushforward operation defined through the function br,γ(z)=r+γz\textrm{b}_{r,\gamma}(z)=r+\gamma z (rowland2018analysis, ). For convenience, we adopt the notation ηπ(Xt,Atπ)aπ(a|Xt)ηπ(Xt,a)\eta^{\pi}(X_{t},A_{t}^{\pi})\coloneqq\sum_{a}\pi(a|X_{t})\eta^{\pi}(X_{t},a). Throughout the paper, we focus on the space of distributions with bounded support 𝒫()\mathscr{P}_{\infty}(\mathbb{R}). Let η𝒫()𝒳×𝒜\eta\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}} be any distribution vector, we define the distributional Bellman operator 𝒯π:𝒫()𝒳×𝒜𝒫()𝒳×𝒜\mathcal{T}^{\pi}:\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}\rightarrow\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}} as follows (rowland2018analysis, ; bdr2022, ),

𝒯πη(x,a)𝔼[(bR0,γ)#η(X1,A1π)|X0=x,A0=a].\displaystyle\mathcal{T}^{\pi}\eta(x,a)\coloneqq\mathbb{E}\left[(\textrm{b}_{R_{0},\gamma})_{\#}\eta(X_{1},A_{1}^{\pi})\;\middle|\;X_{0}=x,A_{0}=a\right]\,. (2)

Let ηπ\eta^{\pi} be the collection of return distributions under π\pi; the distributional Bellman equation can then be rewritten as ηπ=𝒯πηπ\eta^{\pi}=\mathcal{T}^{\pi}\eta^{\pi}. The distributional Bellman operator 𝒯π\mathcal{T}^{\pi} is γ\gamma-contractive under the supremum pp-Wasserstein distance (dabney2018distributional, ; bdr2022, ), so that ηπ\eta^{\pi} is the unique fixed point of 𝒯π\mathcal{T}^{\pi}. See Appendix B for details of the distance metrics.

2.2 Multi-step off-policy value-based learning

We provide a brief background on the value-based multi-step off-policy setting as a reference for the distributional case discussed below. In off-policy learning, the data is generated under a behavior policy μ\mu, which potentially differs from target policy π\pi. The aim is to evaluate the target Q-function QπQ^{\pi}. As a standard assumption, we require supp(π(|x))supp(μ(|x)),x𝒳\text{supp}(\pi(\cdot|x))\subseteq\text{supp}(\mu(\cdot|x)),\forall x\in\mathcal{X}. Let ρtπ(At|Xt)/μ(At|Xt)\rho_{t}\coloneqq\pi(A_{t}|X_{t})/\mu(A_{t}|X_{t}) be the step-wise importance sampling (IS) ratio at time step tt. Step-wise IS ratios are critical in correcting for the off-policy discrepancy between π\pi and μ\mu.

Let ct[0,ρt]c_{t}\in[0,\rho_{t}] be a time-dependent trace coefficient. We denote c1:t=c1ctc_{1:t}=c_{1}\cdots c_{t} and define c1:0=1c_{1:0}=1 by convention. Consider a generic form of the return-based off-policy operator Rπ,μR^{\pi,\mu} as in (munos2016safe, ),

Rπ,μQ(x,a)Q(x,a)+𝔼μ[t=0c1:tγt(Rt+γQ(Xt+1,At+1π)Q(Xt,At))δtπ=value-based TD error],\displaystyle R^{\pi,\mu}Q(x,a)\coloneqq Q(x,a)+\mathbb{E}_{\mu}\left[\sum_{t=0}^{\infty}c_{1:t}\gamma^{t}\underbrace{\left(R_{t}+\gamma Q\left(X_{t+1},A_{t+1}^{\pi}\right)-Q(X_{t},A_{t})\right)}_{\delta_{t}^{\pi}=\text{value-based\ TD\ error}}\right], (3)

In the above and below, we omit the notation conditioning on X0=x,A0=aX_{0}=x,A_{0}=a for conciseness. The general form of Rπ,μR^{\pi,\mu} encompasses many important special cases: when on-policy and ct=λc_{t}=\lambda, it recovers the Q-function variant of TD(λ\lambda) (sutton1998, ; harutyunyan2016q, ); when ct=λmin(c¯,ρt)c_{t}=\lambda\min(\overline{c},\rho_{t}), it recovers a specific form of Retrace (munos2016safe, ); when ct=ρtc_{t}=\rho_{t}, it recovers the importance sampling (IS) operator. The back-up target is computed as a mixture over TD errors δtπ\delta_{t}^{\pi}, each calculated from the one-step transition data. We also define the discounted TD error δ~tπ=γtδtπ\widetilde{\delta}_{t}^{\pi}=\gamma^{t}\delta_{t}^{\pi}, which can be interpreted as the difference between nn-step returns from two time steps tt and t+1t+1, as we discuss in Section 4. As we will detail, the property of δ~tπ\widetilde{\delta}_{t}^{\pi} marks a significant difference from the distributional RL setting.

By design, Rπ,μR^{\pi,\mu} has QπQ^{\pi} as the unique fixed point. Multi-step updates make use of rewards from multiple time steps, propagating learning signals more efficiently. This is reflected by the fact that Rπ,μR^{\pi,\mu} is β\beta-contractive with β[0,γ]\beta\in[0,\gamma] (munos2016safe, ) and often contracts to QπQ^{\pi} faster than the one-step Bellman operator TπT^{\pi}. Our goal is to design distributional equivalents of multi-step off-policy operators, which can lead to concrete algorithms with sample-based learning.

3 Multi-step off-policy distributional reinforcement learning

We now present the core theoretical results relating to multi-step distributional operators. In general, the aim is to evaluate the target distribution ηπ\eta^{\pi} with access to off-policy data generated under μ\mu.

Below, we use Gt:t=s=ttγstRsG_{t^{\prime}:t}=\sum_{s=t^{\prime}}^{t}\gamma^{s-t^{\prime}}R_{s} to denote the partial sum of discounted rewards between two time steps ttt^{\prime}\leq t. We define the generic form of multi-step off-policy distributional operator π,μ\mathcal{R}^{\pi,\mu} such that for any η𝒫()𝒳×𝒜\eta\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, its back-up target π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a) is computed as

η(x,a)+𝔼μ[t=0c1:t((bG0:t,γt+1)#η(Xt+1,At+1π)(bG0:t1,γt)#η(Xt,At)Δ~0:tπ=Multi-step Distributional TD error)].\displaystyle\eta(x,a)+\mathbb{E}_{\mu}\left[\sum_{t=0}^{\infty}c_{1:t}\cdot\left(\underbrace{\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(X_{t},A_{t})}_{\widetilde{\Delta}_{0:t}^{\pi}=\text{Multi-step\ Distributional\ TD\ error}}\right)\right]. (4)

As an effort to simplify the naming, we call π,μ\mathcal{R}^{\pi,\mu} the distributional Retrace operator. Distributional Retrace only requires ct[0,ρt]c_{t}\in[0,\rho_{t}] and represents a large family of distributional operators. Throughout, we will heavily adopt the pushforward notations. This is mainly because instead of directly working with the random variable GπG^{\pi}, we find it much more convenient to express various important multi-step operations with pushfoward notations.

The back-up target π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a) is written as a weighted sum of the path-dependent distributional TD errors Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi}, which we extensively discuss in Section 4. Though the form of π,μ\mathcal{R}^{\pi,\mu} seems to bear certain similarities to the value-based operator in Equation (3), the critical differences lie in subtle definitions of the distributional TD errors Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} and where to place the traces c1:tc_{1:t} for off-policy corrections. We resume to unpack the insights entailed by the design of the operator in Section 4.

Below, we first present theoretical properties of the distributional Retrace operator. We start with a key property which underlies many ensuing theoretical results. Given a fixed nn-step reward sequence r0:n1r_{0:n-1} and a fixed state-action pair (x,a)𝒳×𝒜(x,a)\in\mathcal{X}\times\mathcal{A}, we call pushfoward distributions of the form (bs=0n1γsrs,γn)#η(x,a)\left(\textrm{b}_{\sum_{s=0}^{n-1}\gamma^{s}r_{s},\gamma^{n}}\right)_{\#}\eta(x,a) the nn-step target distributions. Our result shows that the back-up target of Retrace is a convex combination of nn-step target distributions with varying values of nn.

Lemma 3.1.

(Convex combination) The Retrace back-up target is a convex combination of nn-step target distributions. Formally, there exists an index set I(x,a)I(x,a) such that π,μη(x,a)=iI(x,a)wiηi\mathcal{R}^{\pi,\mu}\eta(x,a)=\sum_{i\in I(x,a)}w_{i}\eta_{i} where wi0w_{i}\geq 0, iI(x,a)wi=1\sum_{i\in I(x,a)}w_{i}=1 and (ηi)iI(x,a)(\eta_{i})_{i\in I(x,a)} are nin_{i}-return target distributions.

Since π,μη𝒫()𝒳×𝒜\mathcal{R}^{\pi,\mu}\eta\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, we can measure the contraction of π,μ\mathcal{R}^{\pi,\mu} under probability metrics.

Proposition 3.2.

(Contraction) π,μ\mathcal{R}^{\pi,\mu} is β\beta-contractive under supremum pp-Wasserstein distance, where β=maxx𝒳,a𝒜t=1𝔼μ[c1ct1(1ct)]γtγ\beta=\max_{x\in\mathcal{X},a\in\mathcal{A}}\sum_{t=1}^{\infty}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}(1-c_{t})\right]\gamma^{t}\leq\gamma.

The contraction rate of the distributional Retrace operator is determined by its effective horizon. At one extreme, when ct=0c_{t}=0, the effective horizon is 11 and β=γ\beta=\gamma, in which case Retrace recovers the one-step operator. At the other extreme, when ct=ρtc_{t}=\rho_{t}, the effective horizon is infinite which gives β=0\beta=0. This latter case can be understood as correcting for all the off-policy discrepancies with IS, which is very efficient in expectation but incurs high variance under sample-based approximations. Proposition 3.2 also implies that the distributional Retrace operator has a unique fixed point.

Proposition 3.3.

(Unique fixed point) π,μ\mathcal{R}^{\pi,\mu} has ηπ\eta^{\pi} as the unique fixed point in 𝒫()𝒳×𝒜\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}.

The above result suggests that starting with η0𝒫()𝒳×𝒜\eta_{0}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, the recursion ηk+1=π,μηk\eta_{k+1}=\mathcal{R}^{\pi,\mu}\eta_{k} produces iterates (ηk)k=0𝒫()𝒳×𝒜(\eta_{k})_{k=0}^{\infty}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}} which converge to ηπ\eta^{\pi} in W¯p\overline{W}_{p} at a rate of 𝒪(βk)\mathcal{O}(\beta^{k}).

4 Understanding multi-step distributional reinforcement learning

Now, we pause and take a closer look at the construction of the distributional Retrace operator. We present a number of insights that distinguish distributional learning from value-based learning.

4.1 Path-dependent TD error

The value-based Retrace back-up target can be written as a mixture of value-based TD errors. To better parse the distributional Retrace operator and draw comparison to the value-based setting, we seek to rewrite the distributional back-up target π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a) into a weighted sum of some notion of distributional TD errors. To this end, we start with a natural analogy to the value-based TD error.

Definition 4.1.

(Distributional TD error) Given a transition (Xt,At,Rt,Xt+1)(X_{t},A_{t},R_{t},X_{t+1}), define the associated distributional TD error as Δπ(Xt,At,Rt,Xt+1)(bRt,γ)#η(Xt+1,At+1π)η(Xt,At)\Delta^{\pi}(X_{t},A_{t},R_{t},X_{t+1})\coloneqq(\textrm{b}_{R_{t},\gamma})_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)-\eta(X_{t},A_{t}).

When the context is clear, we also adopt the concise notation Δtπ=Δπ(Xt,At,Rt,Xt+1)\Delta_{t}^{\pi}=\Delta^{\pi}(X_{t},A_{t},R_{t},X_{t+1}). By construction, distributional TD errors are signed measures with zero total mass (bdr2022, ). The distributional TD error is a natural counterpart to the value-based TD error, because they both stem directly from the corresponding one-step Bellman operators. However, unlike in value-based RL, where TD errors alone suffice to specify the multi-step learning operator (Equation (3)), in distributional RL this is not enough. We introduce the path-dependent distributional TD error, which serves as the building block to distributional Retrace.

Definition 4.2.

(Path-dependent distributional TD error) Given a trajectory (Xs,As,Rs)s=0(X_{s},A_{s},R_{s})_{s=0}^{\infty}, define the path-dependent distributional TD error at time t0t\geq 0 as follows,

Δ~0:tπ(bG0:t1,γt)#Δtπ.\displaystyle\widetilde{\Delta}_{0:t}^{\pi}\coloneqq\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\Delta_{t}^{\pi}. (5)

Path-dependent distributional TD errors are defined as a pushforward measures from Δtπ\Delta_{t}^{\pi}, where the pushforward operations depend on G0:t1G_{0:t-1}. This equips Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} with an intriguing property, path-dependency. Concretely, this means that the path-dependent distributional TD error depends on the sequence of rewards (Rs)s=0t1(R_{s})_{s=0}^{t-1} leading up to step tt. With the above definitions, we can finally rewrite the back-up target of distributional Retrace as a weighted sum of path-dependent distributional TD errors π,μη(x,a)=η(x,a)+𝔼μ[t=0c1:tΔ~0:tπ]\mathcal{R}^{\pi,\mu}\eta(x,a)=\eta(x,a)+\mathbb{E}_{\mu}[\sum_{t=0}^{\infty}c_{1:t}\widetilde{\Delta}_{0:t}^{\pi}]. We now illustrate the difference between value-based and distributional TD errors.

Comparison with value-based TD equivalents.

The value-based equivalent to the path-dependent distributional TD error is the discounted value-based TD error δ~tπ=γtδtπ\widetilde{\delta}_{t}^{\pi}=\gamma^{t}\delta_{t}^{\pi} which we briefly mentioned in Section 2. To see why, note that discounted value-based TD errors allow us to rewrite the value-based Retrace back-up target as Rπ,μQ(x,a)=Q(x,a)+𝔼μ[t=0c1:tδ~tπ]R^{\pi,\mu}Q(x,a)=Q(x,a)+\mathbb{E}_{\mu}[\sum_{t=0}^{\infty}c_{1:t}\widetilde{\delta}_{t}^{\pi}]. For direct comparison between the two settings, we rewrite both Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} and δ~tπ\widetilde{\delta}_{t}^{\pi} as the difference between two nn-step predictions evaluated at two time steps tt and t+1t+1,

Δ~0:tπ=(bG0:t,γt+1)#η(Xt+1,At+1π)(bG0:t1,γt)#η(Xt,At),\displaystyle\widetilde{\Delta}_{0:t}^{\pi}=\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(X_{t},A_{t}), (Distributional)
δ~tπ=(G0:t+γt+1Q(Xt+1,At+1π))(G0:t1+γtQ(Xt,At)).\displaystyle\widetilde{\delta}_{t}^{\pi}=\left(G_{0:t}+\gamma^{t+1}Q\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\left(G_{0:t-1}+\gamma^{t}Q(X_{t},A_{t})\right). (Value-based)

The above rewriting attributes the path-dependency to the fact that the nn-step distributional prediction (bG0:t,γt+1)#η(Xt+1,At+1π)\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right) is non-linear in G0:n1G_{0:n-1}. Indeed, in the value-based setting, because G0:t=G0:t1+γtRtG_{0:t}=G_{0:t-1}+\gamma^{t}R_{t} the partial sum of rewards G0:t1G_{0:t-1} cancels out as a common term. This leaves the discounted TD error δ~tπ\widetilde{\delta}_{t}^{\pi} path-independent. In other words, the computation of δ~tπ\widetilde{\delta}_{t}^{\pi} does not depend on past rewards (Rs)s=0t1(R_{s})_{s=0}^{t-1}. In contrast, in the distributional setting, the pushforward operations are non-linear in the partial sum of rewards G0:t1G_{0:t-1}. As a result, G0:t1G_{0:t-1} does not cancel out in the definition of Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi}, making the path-dependent TD error Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} depend on the past rewards (Rs)s=0t1(R_{s})_{s=0}^{t-1}.

The path-dependent property is not an artifact of the distributional Retrace operator π,μ\mathcal{R}^{\pi,\mu}; instead, it is an indispensable element for convergent multi-step distributional learning in general. We show this by empirically verifying that multi-step learning operators based on alternative definitions of path-independent distributional TD errors are non-convergent even for simple problems.

Refer to caption
Figure 2: Non-convergent example: comparing Lp(kη0,ηπ)L_{p}(\mathcal{R}^{k}\eta_{0},\eta^{\pi}) across iterations. We plot 1010 randomly initialized runs. Note (¯π,μ)kη0(\overline{\mathcal{R}}^{\pi,\mu})^{k}\eta_{0} does not converge to ηπ\eta^{\pi} while others do.

Numerically non-convergent path-independent operators.

Consider the path-independent distributional TD error Δ¯tπ(b0,γt)#Δtπ\overline{\Delta}_{t}^{\pi}\coloneqq\left(\textrm{b}_{0,\gamma^{t}}\right)_{\#}\Delta_{t}^{\pi}. We arrived at this definition by dropping the path-dependent term G0:t1G_{0:t-1} in the pushforward of Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi}. Such a definition seems appealing because when η=ηπ\eta=\eta^{\pi}, the error is zero in expectation 𝔼μ[Δ¯tπ|Xt,At]=0\mathbb{E}_{\mu}\left[\overline{\Delta}_{t}^{\pi}|X_{t},A_{t}\right]=0. This implies that we can construct a multi-step operator by a weighted sum of the alternative path-independent TD error ¯nπ,μη(x,a)η(x,a)+𝔼μ[t=0c1:tΔ¯tπ]\overline{\mathcal{R}}_{n}^{\pi,\mu}\eta(x,a)\coloneqq\eta(x,a)+\mathbb{E}_{\mu}\left[\sum_{t=0}^{\infty}c_{1:t}\overline{\Delta}_{t}^{\pi}\right]. By construction, ¯nπ,μ\overline{\mathcal{R}}_{n}^{\pi,\mu} has ηπ\eta^{\pi} as one fixed point.

We provide a very simple counterexample on which ¯π,μ\overline{\mathcal{R}}^{\pi,\mu} is not contractive: consider an MDP with one state and one action. The state transitions back to itself with a deterministic reward Rt=1R_{t}=1. When the discount factor is γ=0.5\gamma=0.5, ηπ\eta^{\pi} is a Dirac distribution centered at 22. We consider the simple case c1=ρ1c_{1}=\rho_{1} and ct=0,t2c_{t}=0,\forall t\geq 2. We use the LpL_{p} distance to measure the convergence of the distribution iterates (bdr2022, ). Figure 2 shows that (¯nπ,μ)kη0(\overline{\mathcal{R}}_{n}^{\pi,\mu})^{k}\eta_{0} does not converge to ηπ\eta^{\pi}, while the one-step Bellman operator 𝒯π\mathcal{T}^{\pi} and distributional Retrace π,μ\mathcal{R}^{\pi,\mu} are convergent.

In Appendix C, we discuss yet another alternative to Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} designed to be path-independent γtΔtπ\gamma^{t}\Delta_{t}^{\pi}. Though the resulting multi-step operator still has ηπ\eta^{\pi} as one fixed point, we show numerically that it is not contractive on the same simple example. These results demonstrate that naively removing the path-dependency might lead to non-convergent multi-step operators.

4.2 Backward-view of distributional multi-step learning

To highlight the difference between distributional and value-based multi-step learning, we discuss the impact that path-dependent distributional TD errors have on the backward-view distributional algorithm. Thus far, distributional back-up targets are expressed in the forward-view, i.e., the back-up target at time tt is calculated as a function of future transition tuples (Xs,As,Rs)st(X_{s},A_{s},R_{s})_{s\leq t}. The forward-view algorithms, unless truncated, wait until the episode finishes to carry out the update, which might be undesirable when the problem is non-episodic or has a very long horizon.

In the backward-view, when encountering a distributional TD error Δtπ\Delta_{t}^{\pi}, the algorithm carries out updates for all predictions at time ttt^{\prime}\leq t (sutton1998, ). To this end, the algorithm needs to maintain additional partial return traces, i.e., the partial sum of rewards Gt:tG_{t^{\prime}:t}, in order to calculate the path-dependent TD error Δ~tπ\widetilde{\Delta}_{t}^{\pi}. Unlike the value-based state-dependent eligibility traces (sutton1998, ; van2020expected, ), partial return traces are time-dependent. This implies that in an episode of TT steps, value-based backward-view algorithms require memory of size min(|𝒳||𝒜|,𝒪(T))\min(|\mathcal{X}||\mathcal{A}|,\mathcal{O}(T)) while the distributional algorithms requires 𝒪(T)\mathcal{O}(T).

In addition to the added memory complexity, the incremental updates of distributional algorithms are also much more complicated due to the path-dependent TD errors. We remark that the path-independent nature of value-based TD errors greatly simplify the value-based backward-view algorithm. For a more detailed discussion, see Appendix D.

4.3 Importance sampling for multi-step distributional RL

In our initial derivation, we arrived at π,μ\mathcal{R}^{\pi,\mu} through the application of importance sampling (IS) in a different way from the value-based setting. We now highlight the subtle differences and caveats.

For a fixed n1n\geq 1, consider the trace coefficient ct=ρt𝕀[t<n]c_{t}=\rho_{t}\mathbb{I}[t<n]. The back-up target of the resulting Retrace operator reduces to 𝔼μ[ρ1:n1(bG0:n1,γn)#η(Xn,Anπ)]\mathbb{E}_{\mu}\left[\rho_{1:n-1}\cdot\left(\textrm{b}_{G_{0:n-1},\gamma^{n}}\right)_{\#}\eta\left(X_{n},A_{n}^{\pi}\right)\right]. This can be seen as applying IS to the nn-step prediction (bG0:n1,γn)#η(Xn,Anπ)\left(\textrm{b}_{G_{0:n-1},\gamma^{n}}\right)_{\#}\eta\left(X_{n},A_{n}^{\pi}\right). As a caveat, note that an appealing alternative approach is to apply IS to G0:n1G_{0:n-1}, producing the estimate (bρ1:n1G0:n1,γn)#η(Xn,Anπ)\left(\textrm{b}_{\rho_{1:n-1}G_{0:n-1},\gamma^{n}}\right)_{\#}\eta\left(X_{n},A_{n}^{\pi}\right). This latter estimate does not properly correct for the off-policy discrepancy between π\pi and μ\mu. To see why, note that applying the IS ratio to G0:n1G_{0:n-1}, instead of to the probability of its occurrence, is an artifact of value-based RL because the expected return is linear in G0:tG_{0:t} (precup2001off, ). In general for distributional RL, one should importance weigh the measures instead of sum of rewards.

5 Approximate multi-step distributional reinforcement learning algorithm

We now discuss how the distributional Retrace operator combines with parametric distributions, using the construction of the novel Quantile Regression-Retrace algorithm as a practical example. We focus on the quantile representation because it entails the best empirical performance of large-scale distributional RL (dabney2018distributional, ; dabney2018implicit, ). Speficially, we present an application of quantile regression with signed measures, which is interesting in its own right. Below, we start with a brief background on quantile representations (dabney2018distributional, ), followed by details on the proposed algorithm.

Consider parametric distributions of the form: 1mi=1mδzi\frac{1}{m}\sum_{i=1}^{m}\delta_{z_{i}} for a fixed m1m\geq 1, where (zi)i=1m(z_{i})_{i=1}^{m}\in\mathbb{R} are a set of parameters indicating the support of the distribution. Let 𝒫𝒬()\mathscr{P}_{\mathcal{Q}}(\mathbb{R}) denote the family of distribution 𝒫𝒬(){1mi=1mδzi|zi}\mathscr{P}_{\mathcal{Q}}(\mathbb{R})\coloneqq\{\frac{1}{m}\sum_{i=1}^{m}\delta_{z_{i}}|z_{i}\in\mathbb{R}\}. We define the projection Π𝒬:𝒫()𝒫𝒬()\Pi_{\mathcal{Q}}:\mathscr{P}_{\infty}(\mathbb{R})\rightarrow\mathscr{P}_{\mathcal{Q}}(\mathbb{R}) as Π𝒬η=argminν𝒫𝒬()W1(η,ν)\Pi_{\mathcal{Q}}\eta=\arg\min_{\nu\in\mathscr{P}_{\mathcal{Q}}(\mathbb{R})}W_{1}(\eta,\nu), which projects any distribution onto the space of representable distributions in the parametric class under the W1W_{1} distance. With an abuse of notation, we also let Π𝒬\Pi_{\mathcal{Q}} denote the component-wise projection when applied to vectors. See (dabney2018distributional, ; bdr2022, ) for more details.

Gradient-based learning via quantile regression.

We can use quantile regression koenker1978regression ; koenker2005quantile ; koenker2017handbook to calculate the projection Π𝒬η\Pi_{\mathcal{Q}}\eta. Let Fη(z),zF_{\eta}(z),z\in\mathbb{R} denote the CDF of a given distribution η\eta. Let Fη1F_{\eta}^{-1} be the generalized CDF inverse, we define the τ\tau-th quantile as Fη1(τ)F_{\eta}^{-1}(\tau) for τ[0,1]\tau\in[0,1]. The projection Π𝒬\Pi_{\mathcal{Q}} is equivalent to computing zi=Fη1(τi)z_{i}=F_{\eta}^{-1}(\tau_{i}) for τ(2i12m)i=1m\tau\in(\frac{2i-1}{2m})_{i=1}^{m} (dabney2018distributional, ). To learn the τ\tau-th quantile for any τ[0,1]\tau\in[0,1], it suffices to solve the quantile regression problem whose optimal solution is Fη1(τ)F_{\eta}^{-1}(\tau): minθLθτ(η)𝔼Zη[fτ(Zθ)]\min_{\theta}L_{\theta}^{\tau}(\eta)\coloneqq\mathbb{E}_{Z\sim\eta}\left[f_{\tau}(Z-\theta)\right] where fτ(u)=u(τ𝕀[u<0])f_{\tau}(u)=u(\tau-\mathbb{I}[u<0]). In practice, we carry out the gradient update θθαθLθτ(η)\theta\leftarrow\theta-\alpha\nabla_{\theta}L_{\theta}^{\tau}(\eta) to find the optimal solution and learn the quantile θFη1(τ)\theta\approx F_{\eta}^{-1}(\tau).

5.1 Distributional Retrace with quantile representations

Given an input distribution vector η\eta, we use the distributional Retrace operator to construct the back-up target π,μη\mathcal{R}^{\pi,\mu}\eta. Then, we use the quantile projection to map the back-up target onto the space of representations Π𝒬π,μη\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\eta. Overall, we are interested in the recursive update: start with any η0𝒫𝒬()𝒳×𝒜\eta_{0}\in\mathscr{P}_{\mathcal{Q}}(\mathbb{R})^{\mathcal{X}\times{\mathcal{A}}}, consider the sequence of distributions generated via ηk+1=Π𝒬π,μηk\eta_{k+1}=\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\eta_{k}. A direct application of Proposition 3.2 allows us to characterize the convergence of the sequence, following the approach of bdr2022 .

Theorem 5.1.

(Convergence of quantile distributions) The projected distributional Retrace operator Π𝒬π,μ\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu} is β\beta-contractive under W¯\overline{W}_{\infty} distance in 𝒫𝒬()\mathscr{P}_{\mathcal{Q}}(\mathbb{R}). As a result, the above ηk\eta_{k} converges to a limiting distribution ηπ\eta_{\mathcal{R}}^{\pi} in W¯\overline{W}_{\infty}, such that W¯(ηk,ηπ)(β)kW¯(η0,ηπ)\overline{W}_{\infty}(\eta_{k},\eta_{\mathcal{R}}^{\pi})\leq(\beta)^{k}\overline{W}_{\infty}(\eta_{0},\eta_{\mathcal{R}}^{\pi}). Further, the quality of the fixed point is characterized as W¯(ηπ,ηπ)(1β)1W¯(Π𝒬ηπ,ηπ)\overline{W}_{\infty}(\eta_{\mathcal{R}}^{\pi},\eta^{\pi})\leq(1-\beta)^{-1}\overline{W}_{\infty}(\Pi_{\mathcal{Q}}\eta^{\pi},\eta^{\pi}).

Thanks to the faster contraction rate βγ\beta\leq\gamma, the advantage of the projected operator Π𝒬π,μ\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu} is two-fold: (1) the operator often contracts faster to the limiting distribution ηπ\eta_{\mathcal{R}}^{\pi} than the one-step operator 𝒯π\mathcal{T}^{\pi} contracts to its own limiting distribution η𝒯π\eta_{\mathcal{T}^{\pi}} (dabney2018distributional, ); (2) the limiting distribution ηπ\eta_{\mathcal{R}}^{\pi} also enjoys a better approximation bound to the target distribution. We verify such results in Section 7.

5.2 Quantile Regression-Retrace: distributional Retrace with quantile regression

Below, we use zi(x,a)z_{i}(x,a) to represent the ii-th quantile of the distribution at (x,a)(x,a). Overall, we have a tabular quantile representation ηz(x,a)=1mi=1mδzi(x,a),(x,a)𝒳×𝒜\eta_{z}(x,a)=\frac{1}{m}\sum_{i=1}^{m}\delta_{z_{i}(x,a)},\forall(x,a)\in\mathcal{X}\times\mathcal{A}, where we use the notation ηz\eta_{z} to stress the distribution’s dependency on parameter zi(x,a)z_{i}(x,a). For any given bootstrapping distribution vector η𝒫()𝒳×𝒜\eta\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, in order to approximate the projected back-up target Π𝒬π,μη\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\eta with the parameterized quantile distribution ηz\eta_{z}, we solve the set of quantile regression problems for all 1im,(x,a)𝒳×𝒜1\leq i\leq m,(x,a)\in\mathcal{X}\times\mathcal{A},

minzi(x,a)Lzi(x,a)τi(π,μη(x,a)),whereτi=(2i1)/2m.\displaystyle\min_{z_{i}(x,a)}L_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right),\ \text{where}\ \tau_{i}=(2i-1)/2m\,.

For any fixed (x,a,i)(x,a,i), to solve the quantile regression problem, we apply gradient descent on zi(x,a)z_{i}(x,a). In practice, with one sampled trajectory (Xs,As,Rs)s=0μ(X_{s},A_{s},R_{s})_{s=0}^{\infty}\sim\mu, the aim is to construct an unbiased stochastic gradient estimate of the QR loss Lzi(x,a)τi(π,μη(x,a))L_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right). Below, let bt=bG0:t1,γt\textrm{b}_{t}=\textrm{b}_{G_{0:t-1},\gamma^{t}} for simplicity. We start with a stochastic estimate L^zi(x,a)τi(π,μη(x,a))\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}(\mathcal{R}^{\pi,\mu}\eta(x,a)) for the QR loss,

Lzi(x,a)τi(η(x,a))+t=0c1:t(Lzi(x,a)τi((bt+1)#η(Xt+1,At+1π))Lzi(x,a)τi((bt)#η(Xt,At))).\displaystyle L_{z_{i}(x,a)}^{\tau_{i}}\left(\eta(x,a)\right)+\sum_{t=0}^{\infty}c_{1:t}\left(L_{z_{i}(x,a)}^{\tau_{i}}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-L_{z_{i}(x,a)}^{\tau_{i}}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right).

Since L^zi(x,a)τi(π,μη(x,a))\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}(\mathcal{R}^{\pi,\mu}\eta(x,a)) is differentiable with zi(x,a)z_{i}(x,a), we use zi(x,a)L^zi(x,a)τi(π,μη(x,a))\nabla_{z_{i}(x,a)}\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}(\mathcal{R}^{\pi,\mu}\eta(x,a)) as the stochastic gradient estimate. This gradient estimate is unbiased under mild conditions.

Lemma 5.2.

(Unbiased stochastic QR loss gradient estimate) Assume that the trajectory terminates within H<H<\infty steps almost surely, then we have 𝔼μ[L^zi(x,a)τi(π,μη(x,a))]=Lzi(x,a)τi(π,μη(x,a))\mathbb{E}_{\mu}[\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)]=L_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right) and 𝔼μ[zi(x,a)L^zi(x,a)τi(π,μη(x,a))]=zi(x,a)Lzi(x,a)τi(π,μη(x,a))\mathbb{E}_{\mu}[\nabla_{z_{i}(x,a)}\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)]=\nabla_{z_{i}(x,a)}L_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right).

The above stochastic estimate bypasses the challenge that the QR loss is only defined against distributions, whereas sampled back-up targets R^π,μη(x,a)=η(x,a)+t=0c1:tΔ~0:tπ\widehat{R}^{\pi,\mu}\eta(x,a)=\eta(x,a)+\sum_{t=0}^{\infty}c_{1:t}\widetilde{\Delta}_{0:t}^{\pi} are signed measures in general. In Quantile Regression-Retrace, we use ηz\eta_{z} itself as the bootstrapping distribution, such that the algorithm approximates the fixed point iteration ηzΠ𝒬π,μηz\eta_{z}\leftarrow\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\eta_{z}. Concretely, we carry out the following sample-based update

zi(x,a)zi(x,a)αzi(x,a)L^zi(x,a)τi(π,μηz(x,a)),for 1im,(x,a)𝒳×𝒜.\displaystyle z_{i}(x,a)\leftarrow z_{i}(x,a)-\alpha\nabla_{z_{i}(x,a)}\widehat{L}_{z_{i}(x,a)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta_{z}(x,a)\right),\ \text{for}\ \forall\ 1\leq i\leq m,(x,a)\in\mathcal{X}\times\mathcal{A}.

5.3 Deep reinforcement learning: QR-DQN-Retrace

We introduce a deep RL implementation of the Quantile Regression-Retrace: QR-DQN-Retrace, where the parametric representation is combined with function approximations (bellemare2017cramer, ; dabney2018distributional, ; dabney2018implicit, ). The base agent QR-DQN (bellemare2017cramer, ) parameterizes the quantile locations zi(x,a;w)z_{i}(x,a;w) with the output of a neural network with weights ww. Let η(x,a;w)=1mi=1mδzi(x,a;w)\eta(x,a;w)=\frac{1}{m}\sum_{i=1}^{m}\delta_{z_{i}(x,a;w)} denote the parameterized distribution. QR-DQN-Retrace updates its parameters by stochastic gradient descent on the estimated QR loss, averaged across all mm quantile levels wwα1mi=1mwL^zi(x,a;w)τi(π,μη(x,a;w))w\leftarrow w-\alpha\frac{1}{m}\sum_{i=1}^{m}\nabla_{w}\widehat{L}_{z_{i}(x,a;w)}^{\tau_{i}}\left(\mathcal{R}^{\pi,\mu}\eta(x,a;w)\right). In practice, the update is further averaged over state-action pairs sampled from a replay buffer.

Refer to caption
(a) Off-policyness ε\varepsilon
Refer to caption
(b) Trace coefficient ctc_{t}
Refer to caption
(c) Fixed point quality
Refer to caption
(d) Uncorrected
Figure 3: Tabular experiments to illustrate properties of the distributional Retrace operator: we show average results across 1010 randomly sampled MDPs. (a) Contraction rate vs. off-policyness; (b) Contraction rate vs. trace coefficient ct=min(ρt,c¯)c_{t}=\min(\rho_{t},\overline{c}); (c) Fixed point quality vs. trace coefficient ctc_{t}; (d) The uncorrected operator introduces bias to the fixed point while Retrace is unbiased.

6 Discussions

Categorical representations.

The categorical representation is another commonly used class of parameterized distributions in prior literature (bellemare2017cramer, ; rowland2018analysis, ; rowland2019statistics, ; bdr2022, ). We obtain contractive guarantees for the categorical representation similar to Theorem 5.1. As with QR, this leads both to improved fixed-point approximations and faster convergence. Further, this leads to a deep RL algorithm C51-Retrace. The actor-critic Reactor agent (gruslys2017reactor, ) uses C51-Retrace as a critic training algorithm, although without explicit consideration or analysis of the associated distributional operator. See Appendix E for details. We empirically evaluate the stand-alone improvements of C51-Retrace over C51 in Section 7.

Uncorrected methods.

The uncorrected methods do not correct for the off-policyness and hence obtain a biased fixed point (hessel2018rainbow, ; kapturowski2018recurrent, ; kozuno2021revisiting, ). The Rainbow agent (hessel2018rainbow, ) combined nn-step uncorrected learning with C51, effectively implementing a distributional operator whose fixed point differs from ηπ\eta^{\pi}.

On-policy distributional TD(λ\lambda).

Nam et al. nam2021gmac propose SR(λ\lambda), a distributional version of on-policy TD(λ(\lambda) (sutton1988learning, ). In operator form, this can be viewed as a special case of Equation (4) with μ=π\mu=\pi, ct=λc_{t}=\lambda; nam2021gmac also introduce a sample-replacement technique for more efficient implementation.

7 Experiments

We carry out a number of experiments to validate the theoretical insights and empirical improvements.

7.1 Illustration of distributional Retrace properties on tabular MDPs

We verify a few important properties of the distributional Retrace operator on a tabular MDP. The results corroborate the theoretical results from previous sections. Throughout, we use quantile representations with m=100m=100 atoms; we obtain similar results for categorical representations. See Appendix F for details on the experiment setup. Let η0\eta_{0} be the initial distribution, we carry out dynamic programming with π,μ\mathcal{R}^{\pi,\mu} and denote ηk=(π,μ)kη0\eta_{k}=(\mathcal{R}^{\pi,\mu})^{k}\eta_{0} as the kkth distribution iterate.

Impact of off-policyness.

We control the level of off-policyness by setting the behavior policy μ\mu to be a uniform policy and the target policy to π=(1ε)μ+επd\pi=(1-\varepsilon)\mu+\varepsilon\pi_{d} where πd\pi_{d} is a fixed deterministic policy. Moving from ε=0\varepsilon=0 to ε=1\varepsilon=1, we transition from on-policy to very off-policy. We use Lp(ηk,ηπ)L_{p}(\eta_{k},\eta_{\mathcal{R}}^{\pi}) to measure the contraction rate to the fixed point. Figure 3 shows that as the behavior becomes more off-policy, the contraction slows down, degrading the efficiency of multi-step learning.

Impact of trace coefficient ctc_{t}.

Throughout, we set ct=min(ρt,c¯)c_{t}=\min(\rho_{t},\overline{c}) with c¯\overline{c} to control the effective trace length. With a fixed level of off-policyness ε=0.5\varepsilon=0.5, Figure 3(b) shows that increasing c¯\overline{c} speeds up the contraction to the fixed point as predicted by Proposition 3.2.

Quality of fixed point.

We next examine how the quality of the fixed point is impacted by c¯\overline{c}, by measuring Lp(ηk,Π𝒬ηπ)L_{p}(\eta_{k},\Pi_{\mathcal{Q}}\eta^{\pi}) as a proxy to Lp(ηk,ηπ)L_{p}(\eta_{k},\eta^{\pi}). As kk increases the error flattens, at which point we take the converged value to be Lp(ηπ,Π𝒬ηπ)L_{p}(\eta_{\mathcal{R}}^{\pi},\Pi_{\mathcal{Q}}\eta^{\pi}) which measures the fixed point quality. Figure 3(c) shows when c¯\overline{c} increases, the fixed point quality improves, in line with the Theorem 5.1. This phenomenon does not arise in tabular non-distributional reinforcement learning, although related phenomena do occur when using function approximation techniques.

Bias of uncorrected methods.

Finally, we illustrate a critical difference between Retrace and uncorrected nn-step methods (hessel2018rainbow, ): the bias to the fixed point. Figure 3(d) shows that uncorrected nn-step arrives at a fixed point in between ηπ\eta^{\pi} and ημ\eta^{\mu}, showing an obvious bias from ηπ\eta^{\pi}.

7.2 Deep reinforcement learning

We consider the control setting where the target policy π\pi is the greedy policy with respect to the Q-function induced by the parameterized distribution. Because the training data is sampled from a replay, the behavior policy μ\mu is ε\varepsilon-greedy with respect to Q-functions induced by previous copies of the parameterized distribution. We evaluate the performance of deep RL agents on 57 Atari games (bellemare2013arcade, ). To ensure fair comparison across games, we compute the human normalized scores for each agent, and compare their evaluated mean and median scores across all 57 games during training.

Deep RL agents.

The multi-step agents adopt exactly the same hyperparameters as the baseline agents. The only difference is the back-up target. For completeness of results, we show the combination of Retrace with both C51 and QR-DQN. For QR-DQN, we use the Huber loss for quantile regression, which is a thresholded variant of the QR loss (dabney2018distributional, ). Throughout, we use ct=λmin(ρt,c¯)c_{t}=\lambda\min(\rho_{t},\overline{c}) with c¯=1\overline{c}=1 as in (munos2016safe, ). See Appendix F for details. In practice, sampled trajectories are truncated at length nn. We also adapt Retrace to the nn-step case, see Appendix A.

Results.

Figure 4 compares one-step baseline, Retrace and uncorrected nn-step (hessel2018rainbow, ). For C51, both multi-step methods clearly improve the median performance over the one-step baseline. Retrace slightly outperforms uncorrected nn-step towards the end of learning. For QR-DQN, all multi-step algorithms achieve clear performance gains. Retrace significantly outperforms the uncorrected nn-step with the mean performance, while obtaining similar results on the median performance. Overall, distributional Retrace achieves a clear improvement over the one-step baselines. The uncorrected nn-step method typically takes off faster than Retrace but may to slightly worse performance.

Finally, note that in the value-based setting, uncorrected methods are generally more high-performing than Retrace, potentially due to a favorable trade-off between contraction rate and fixed-point bias (rowland2019adaptive, ). Our results add to the benefits of off-policy corrections in the control setting.

Refer to caption
(a) C51
Refer to caption
(b) QR-DQN
Figure 4: Deep RL experiments on Atari-57 games for (a) C51 and (b) QR-DQN. We compare the one-step baseline agent against the multi-step variants (Retrace and uncorrected nn-step). For all multi-step variants, we use n=3n=3. For each agent, we calculate the mean and median performance across all games, and we plot the mean±standard error\text{mean}\pm\text{standard\ error} across 33 seeds. In almost all settings, multi-step variants provide clear advantage over the one-step baseline algorithm.

8 Conclusion

We have identified a number of fundamental conceptual differences between value-based and distributional RL in multi-step settings. Central to such differences is the novel notion of path-dependent distributional TD error, which naturally arises from the multi-step distributional RL problem. Building on this understanding, we have developed the first principled multi-step off-policy distributional operator Retrace. We have also developed an approximate distributional RL algorithm, Quantile Regression-Retrace, which makes distributional Retrace highly competitive in both tabular and high-dimensional setups. This paper also opens up a several avenues for future research, such as the interaction between multi-step distributional RL and signed measures, and the convergence theory of stochastic approximations for multi-step distributional RL.

References

  • [1] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • [2] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • [3] Csaba Szepesvári. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
  • [4] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2010.
  • [5] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010.
  • [6] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017.
  • [7] Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. Advances in neural information processing systems, 2019.
  • [8] Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan. Quantile QT-OPT for risk-aware vision-based robotic grasping. In Proceedings of Robotics: Science and Systems, 2020.
  • [9] Peter R Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
  • [10] Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press, 2022. http://www.distributional-rl.org.
  • [11] Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In Proceedings of the International Conference on Machine Learning, 2001.
  • [12] Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, and Rémi Munos. Q(λ\lambda) with off-policy corrections. In Proceedings of the International Conference on Algorithmic Learning Theory, 2016.
  • [13] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
  • [14] Ashique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step off-policy learning without importance sampling ratios. arXiv, 2017.
  • [15] Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. Advances in Neural Information Processing Systems, 2021.
  • [16] Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • [17] Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2018.
  • [18] Hado van Hasselt, Sephora Madjiheurem, Matteo Hessel, David Silver, André Barreto, and Diana Borsa. Expected eligibility traces. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  • [19] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.
  • [20] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: Journal of the Econometric Society, pages 33–50, 1978.
  • [21] Roger Koenker. Quantile Regression. Econometric Society Monographs. Cambridge University Press, 2005.
  • [22] Roger Koenker, Victor Chernozhukov, Xuming He, and Limin Peng. Handbook of Quantile Regression. CRC Press, 2017.
  • [23] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The Cramer distance as a solution to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.
  • [24] Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2019.
  • [25] Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc G. Bellemare, and Rémi Munos. The Reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2018.
  • [26] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • [27] Steven Kapturowski, Georg Ostrovski, John Quan, Rémi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2019.
  • [28] Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting Peng’s Q(λ\lambda) for modern reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2021.
  • [29] Daniel W. Nam, Younghoon Kim, and Chan Y. Park. GMAC: A distributional perspective on actor-critic framework. In Proceedings of the International Conference on Machine Learning, 2021.
  • [30] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • [31] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • [32] Mark Rowland, Will Dabney, and Rémi Munos. Adaptive trade-offs in off-policy learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2020.
  • [33] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
  • [34] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill New York, 1976.
  • [35] Charles R. Harris, K. Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, 2020.
  • [36] John D. Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03):90–95, 2007.
  • [37] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. Jax: composable transformations of Python+NumPy programs. 2018.
  • [38] Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Wojciech Stokowiec, and Fabio Viola. The DeepMind JAX ecosystem. 2010.
  • [39] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  • [40] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.

The Nature of Temporal Difference Errors in
Multi-step Distributional Reinforcement Learning:
Appendices

Appendix A Extension of distributional Retrace to nn-step truncated trajectories

The nn-step truncated version of distributional Retrace is defined as

nπ,μη(x,a)=η(x,a)+𝔼μ[t=0nc1:tΔ~0:tπ],\displaystyle\mathcal{R}_{n}^{\pi,\mu}\eta(x,a)=\eta(x,a)+\mathbb{E}_{\mu}\left[\sum_{t=0}^{n}c_{1:t}\widetilde{\Delta}_{0:t}^{\pi}\right],

which sums the path-dependent distributional TD errors up to time nn. Compared to the original definition of distributional Retrace, this nn-step operator is more practical to implement. This operator enjoys all the theoretical properties of the original distributional Retrace, with a slight difference on the contraction rate. Intuitively, the operator bootstraps with at most nn steps, which limits the effective horizon of the operator to be n\leq n. It is straightforward to show that the operator is βn\beta_{n}-contractive under W¯p\overline{W}_{p} with βn(β,γ]\beta_{n}\in(\beta,\gamma]. As nn\rightarrow\infty, βnβ\beta_{n}\rightarrow\beta.

Appendix B Distance metrics

We provide a brief review on the distance metrics used in this work. We refer readers to [10] for a complete background.

B.1 Wasserstein distance

Let η1,η2𝒫()\eta_{1},\eta_{2}\in\mathscr{P}_{\infty}(\mathbb{R}) be two distribution measures. Let FηF_{\eta} be the CDF of η\eta. The pp-Wasserstein distance can be computed as

Wp(η1,η2)([0,1]|Fη11(z)Fη21(z)|p𝑑z)1/p.\displaystyle W_{p}(\eta_{1},\eta_{2})\coloneqq\left(\int_{[0,1]}|F_{\eta_{1}}^{-1}(z)-F_{\eta_{2}}^{-1}(z)|^{p}dz\right)^{1/p}.

Note that the above definition is equivalent to the more traditional definition based on optimal transport; indeed, Fηi1(z),zUniform(0,1),i{1,2}F_{\eta_{i}}^{-1}(z),z\sim\text{Uniform}(0,1),i\in\{1,2\} can be understood as the optimal coupling between the two distributions. The above definition is a proper distance metric if p1p\geq 1.

For any distribution vector η1,η2𝒫()𝒳×𝒜\eta_{1},\eta_{2}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, we can define the supremum pp-Wasserstein distance as

W¯p(η1,η2)maxx,aWp(η1(x,a),η2(x,a)).\displaystyle\overline{W}_{p}(\eta_{1},\eta_{2})\coloneqq\max_{x,a}W_{p}(\eta_{1}(x,a),\eta_{2}(x,a)).

B.2 LpL_{p} distance

Let η1,η2𝒫()\eta_{1},\eta_{2}\in\mathscr{P}_{\infty}(\mathbb{R}) be two distribution measures. Let FηF_{\eta} be the CDF of η\eta. The LpL_{p} distance is defined as

Lp(η1,η2)(|Fη1(z)Fη2(z)|p𝑑z)1/p.\displaystyle L_{p}(\eta_{1},\eta_{2})\coloneqq\left(\int_{\mathbb{R}}\left|F_{\eta_{1}}(z)-F_{\eta_{2}}(z)\right|^{p}dz\right)^{1/p}.

The above definition is a proper distance metric when p1p\geq 1.

For any distribution vector η1,η2𝒫()𝒳×𝒜\eta_{1},\eta_{2}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}} or signed measure vector η1,η2()𝒳×𝒜\eta_{1},\eta_{2}\in\mathcal{M}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, we can define the supremum Cramér-pp distance as

L¯p(η1,η2)maxx,aLp(η1(x,a),η2(x,a)).\displaystyle\overline{L}_{p}(\eta_{1},\eta_{2})\coloneqq\max_{x,a}L_{p}(\eta_{1}(x,a),\eta_{2}(x,a)).

Appendix C Numerically non-convergent behavior of alternative multi-step operators

We consider another alternative definition of path-independent alternative to the path-dependent TD error γtΔtπ\gamma^{t}\Delta_{t}^{\pi}. The primary motivation for such a path-dependent TD error is that the discounted value-based TD error takes the form δ~tπ=γtδtπ\widetilde{\delta}_{t}^{\pi}=\gamma^{t}\delta_{t}^{\pi}. The resulting multi-step operator is

~π,μη(x,a)=η(x,a)+𝔼μ[t=0c1:tγtΔtπ].\displaystyle\widetilde{\mathcal{R}}^{\pi,\mu}\eta(x,a)=\eta(x,a)+\mathbb{E}_{\mu}\left[\sum_{t=0}^{\infty}c_{1:t}\gamma^{t}\Delta_{t}^{\pi}\right].

With the same toy example as in the paper: an one-state one-action MDP with a deterministic reward Rt=1R_{t}=1 and discount factor γ=0.5\gamma=0.5. The target distribution ηπ\eta^{\pi} is a Dirac distribution centering at 22. Let ηk=()kη0\eta_{k}=(\mathcal{R})^{k}\eta_{0} be the kk-th distribution iterate by applying the operator {π,μ,~π,μ,~π,μ,𝒯π}\mathcal{R}\in\{\mathcal{R}^{\pi,\mu},\widetilde{\mathcal{R}}^{\pi,\mu},\widetilde{\mathcal{R}}^{\pi,\mu},\mathcal{T}^{\pi}\}, we show the LpL_{p} distance between the iterates and ηπ\eta^{\pi} in Figure 5. It is clear that alternative multi-step operators do not converge to the correct fixed point.

Refer to caption
(a) Full results for all operators
Refer to caption
(b) Comparing two alternatives
Figure 5: Illustration of non-convergent behavior of alternative multi-step operators: for both plots, we show the mean and per-run results across 1010 different initial Dirac distributions η0\eta_{0}. (a) the full comparison between all operators. Two alternative operators do not converge while one-step Bellman operator and distributional Retrace both converge; (b) we zoom in on the difference between the two alternative operators.

Appendix D Backward-view algorithm for multi-step distributional RL

We now describe a backward-view algorithm for multi-step distributional RL with quantile representations. For simplicity, we consider the on-policy case π=μ\pi=\mu and ct=λc_{t}=\lambda. To implement π,μ\mathcal{R}^{\pi,\mu} in the backward-view, at each time step tt and a past time step ttt^{\prime}\leq t, the algorithm needs to maintain two novel traces distinct from the classic eligibility traces [2]: (1) partial return traces Gt:tG_{t^{\prime}:t}, which correspond to the partial sum of rewards between two time steps ttt^{\prime}\leq t; (2) modified eligibility traces, defined as et,tλtte_{t^{\prime},t}\coloneqq\lambda^{t-t^{\prime}}, which measures the trace decay between two time steps ttt^{\prime}\leq t. At a new time step t+1t+1, the new traces are computed recursively: Gt:t+1=Rt+1+γGt,t,et,t+1=λet,tG_{t^{\prime}:t+1}=R_{t+1}+\gamma G_{t^{\prime},t},e_{t^{\prime},t+1}=\lambda e_{t^{\prime},t}.

We assume the algorithm maintains a table of quantile distributions with mm atoms: η(x,a)=1mi=1mδzi(x,a),(x,a)𝒳×𝒜\eta(x,a)=\frac{1}{m}\sum_{i=1}^{m}\delta_{z_{i}(x,a)},\forall(x,a)\in\mathcal{X}\times\mathcal{A}. For any fixed (x,a)(x,a), define Tt(x,a){s|Xs=x,As=a,0st}T_{t}(x,a)\coloneqq\{s|X_{s}=x,A_{s}=a,0\leq s\leq t\} be the set of time steps before time tt at which (x,a)(x,a) is visited. Now, upon arriving at Xt+1X_{t+1}, we observe the TD error Δtπ\Delta_{t}^{\pi}. Recall that Lθτ(η)L_{\theta}^{\tau}(\eta) denote the QR loss of parameter θ\theta at quantile level τ\tau and against the distrbution η\eta. To more conveniently describe the update, we define the QR loss against the path-dependent TD error

(bGs:t1,γts)#Δ~0:tπ=(bGs:t,γt+1s)#η(Xt+1,At+1π)(bGs:t1,γts)#η(Xt,At)\displaystyle\left(\textrm{b}_{G_{s:t-1},\gamma^{t-s}}\right)_{\#}\widetilde{\Delta}_{0:t}^{\pi}=\left(\textrm{b}_{G_{s:t},\gamma^{t+1-s}}\right)_{\#}\eta(X_{t+1},A_{t+1}^{\pi})-\left(\textrm{b}_{G_{s:t-1},\gamma^{t-s}}\right)_{\#}\eta(X_{t},A_{t})

as the difference of the QR losses against the individual distributions,

Lθτ((bGs:t1,γts)#Δ~0:tπ)Lθτ((bGs:t,γt+1s)#η(Xt+1,At+1π))Lθτ((bGs:t1,γts)#η(Xt,At)).\displaystyle L_{\theta}^{\tau}\left(\left(\textrm{b}_{G_{s:t-1},\gamma^{t-s}}\right)_{\#}\widetilde{\Delta}_{0:t}^{\pi}\right)\coloneqq L_{\theta}^{\tau}\left(\left(\textrm{b}_{G_{s:t},\gamma^{t+1-s}}\right)_{\#}\eta(X_{t+1},A_{t+1}^{\pi})\right)-L_{\theta}^{\tau}\left(\left(\textrm{b}_{G_{s:t-1},\gamma^{t-s}}\right)_{\#}\eta(X_{t},A_{t})\right).

Note that the QR loss can be computed using the transition data we have seen so far. We now perform the a gradient update for all entries in the table (x,a)𝒳×𝒜(x,a)\in\mathcal{X}\times\mathcal{A} and 1im1\leq i\leq m (in practice, we update entries that correspond to visited state-action pairs):

zi(x,a)zi(x,a)αsTt(x,a)es,tzi(x,a)Lθτi((bGs:t1,γts)#Δ~0:tπ),\displaystyle z_{i}(x,a)\leftarrow z_{i}(x,a)-\alpha\sum_{s\in T_{t}(x,a)}e_{s,t}\nabla_{z_{i}(x,a)}L_{\theta}^{\tau_{i}}\left(\left(\textrm{b}_{G_{s:t-1},\gamma^{t-s}}\right)_{\#}\widetilde{\Delta}_{0:t}^{\pi}\right),

where τi=2i12m\tau_{i}=\frac{2i-1}{2m}. For any fixed (x,a)(x,a), the above algorithm effectively aggregates updates from time steps sTt(x,a)s\in T_{t}(x,a) at which (x,a)(x,a) is visited.

D.1 Simplifications for value-based RL

We now discuss how the path-independent value-based TD errors greatly simplify the value-based backward-view algorithm. Following the above notations, assume the algorithm maintains a table of Q-function Q(x,a)Q(x,a), we can construct incremental backward-view update for all (x,a)𝒳×𝒜(x,a)\in\mathcal{X}\times\mathcal{A} as follows, by replacing the path-dependent distributional TD error Δ~0:tπ\widetilde{\Delta}_{0:t}^{\pi} by the discounted TD error δ~tπ\widetilde{\delta}_{t}^{\pi}

Q(x,a)Q(x,a)αsTt(x,a)es,tδ~tπ.\displaystyle Q(x,a)\leftarrow Q(x,a)-\alpha\sum_{s\in T_{t}(x,a)}e_{s,t}\widetilde{\delta}_{t}^{\pi}.

Since δtπ\delta_{t}^{\pi} does not depend on the past rewards and is state-action dependent, we can simplify the summation over sTt(x,a)s\in T_{t}(x,a) by defining the state-depedent eligibility traces [2] as a replacement to es,te_{s,t},

e~(x,a)γλe~(x,a)+𝕀[Xt=x,At=a].\displaystyle\widetilde{e}(x,a)\leftarrow\gamma\lambda\widetilde{e}(x,a)+\mathbb{I}[X_{t}=x,A_{t}=a].

As a result, the above update reduces to

Q(x,a)Q(x,a)αe~(x,a)δtπ,\displaystyle Q(x,a)\leftarrow Q(x,a)-\alpha\widetilde{e}(x,a)\delta_{t}^{\pi},

which recovers the classic backward-view update.

D.2 Non-equivalence of forward-view and backward-view algorithms

In value-based RL, forward-view and backward-view algorithms are equivalent given that the trajectory does not visit the same state twice [2]. However, such an equivalence does not generally hold in distributional RL. Indeed, consider the following counterexample in the case of the quantile representation.

Consider a three-step MDP with deterministic transition x1x2x3x_{1}\rightarrow x_{2}\rightarrow x_{3}. There is no action and no reward on the transition. The state x3x_{3} is terminal with a deterministic terminal value r3=1r_{3}=1. We consider m=1m=1 atom and let the quantile parameters be θ1=0\theta_{1}=0 and θ2=1\theta_{2}=1 at states x1,x2x_{1},x_{2} respectively. In this case, the quantile representation learns the median of the target distribution with τ=0.5\tau=0.5.

Now, we consider the update at θ1\theta_{1} with both forward-view and backward-view implementation of the two-step Bellman operator 𝒯2πη(x)=𝔼[(b0,γ2)#η(X2,π(X2))|X0=x]\mathcal{T}_{2}^{\pi}\eta(x)=\mathbb{E}\left[(\textrm{b}_{0,\gamma^{2}})_{\#}\eta(X_{2},\pi(X_{2}))|X_{0}=x\right], which can be obtained from distributional Retrace by setting ct=ρtc_{t}=\rho_{t}. The target distribution at x1x_{1} is a Dirac distribution centering at γ2\gamma^{2}.

Forward-view update.

Below, we use δx\delta_{x} to denote a Dirac distribution at xx. In the forward-view, the back-up distribution is

𝔼[(b0,γ2)#η(X2,π(X2))]=δγ2.\displaystyle\mathbb{E}\left[(\textrm{b}_{0,\gamma^{2}})_{\#}\eta(X_{2},\pi(X_{2}))\right]=\delta_{\gamma^{2}}.

The gradient update to θ1\theta_{1} is thus

θ1(fwd)=θ1αθ1Lθ10.5(δγ2)=θ1+α(0.5𝕀[γ2<θ1]).\displaystyle\theta_{1}^{\text{(fwd)}}=\theta_{1}-\alpha\nabla_{\theta_{1}}L_{\theta_{1}}^{0.5}\left(\delta_{\gamma^{2}}\right)=\theta_{1}+\alpha\left(0.5-\mathbb{I}\left[\gamma^{2}<\theta_{1}\right]\right).

Backward-view update.

To implement the backward-view update, we make clear of the two path-dependent distributional TD errors at two consecutive time steps

Δ~0π=δγδ0,Δ~1π=(b0,γ)#(δγθ2δθ1)=δγ2δγ\displaystyle\widetilde{\Delta}_{0}^{\pi}=\delta_{\gamma}-\delta_{0},\ \ \widetilde{\Delta}_{1}^{\pi}=(\textrm{b}_{0,\gamma})_{\#}\left(\delta_{\gamma\theta_{2}}-\delta_{\theta_{1}}\right)=\delta_{\gamma^{2}}-\delta_{\gamma}

The update consists of two steps:

θ1\displaystyle\theta_{1}^{\prime} =θ1αθ1Lθ10.5(δγ)=θ1+α(0.5𝕀[γ<θ1]),\displaystyle=\theta_{1}-\alpha\nabla_{\theta_{1}}L_{\theta_{1}}^{0.5}\left(\delta_{\gamma}\right)=\theta_{1}+\alpha\left(0.5-\mathbb{I}\left[\gamma<\theta_{1}\right]\right),
θ1(bwd)\displaystyle\theta_{1}^{\text{(bwd)}} =θ1α(θ1Lθ10.5(δγ2)θ1Lθ10.5(δγ))\displaystyle=\theta_{1}^{\prime}-\alpha\left(\nabla_{\theta_{1}^{\prime}}L_{\theta_{1}^{\prime}}^{0.5}\left(\delta_{\gamma}^{2}\right)-\nabla_{\theta_{1}^{\prime}}L_{\theta_{1}^{\prime}}^{0.5}\left(\delta_{\gamma}\right)\right)
=θ1+α(0.5𝕀[γ2<θ1])α(0.5𝕀[γ<θ1]).\displaystyle=\theta_{1}^{\prime}+\alpha\left(0.5-\mathbb{I}[\gamma^{2}<\theta_{1}^{\prime}]\right)-\alpha\left(0.5-\mathbb{I}[\gamma<\theta_{1}^{\prime}]\right).

Overall, we have

θ1(bwd)\displaystyle\theta_{1}^{\text{(bwd)}} =θ1+α(0.5𝕀[γ<θ1])+α(0.5𝕀[γ2<θ1])α(0.5𝕀[γ<θ1])\displaystyle=\theta_{1}+\alpha\left(0.5-\mathbb{I}\left[\gamma<\theta_{1}\right]\right)+\alpha\left(0.5-\mathbb{I}[\gamma^{2}<\theta_{1}^{\prime}]\right)-\alpha\left(0.5-\mathbb{I}[\gamma<\theta_{1}^{\prime}]\right)
=0.5αα𝕀[γ2<0.5α]+𝕀[γ<0.5α].\displaystyle=0.5\alpha-\alpha\mathbb{I}[\gamma^{2}<0.5\alpha]+\mathbb{I}[\gamma<0.5\alpha].

Now, let α(2γ2,2γ)\alpha\in(2\gamma^{2},2\gamma) such that 0.5α(γ2,γ)0.5\alpha\in(\gamma^{2},\gamma), we have θ1bwd=0.5αα=0.5αθ1(fwd)\theta_{1}^{\text{bwd}}=0.5\alpha-\alpha=-0.5\alpha\neq\theta_{1}^{(\text{fwd})}.

D.3 Discussion on memory complexity

The return traces Gt,tG_{t^{\prime},t} and modified eligibility traces et,te_{t^{\prime},t} are time-dependent, which is a direct implication from the fact that distributional TD errors are path-dependent. Indeed, to calculate the distributional TD error Δ~t:tπ\widetilde{\Delta}_{t^{\prime}:t}^{\pi}, it is necessary to keep track Gt,tG_{t^{\prime},t} in the backward-view algorithm. This differs from the classic eligibility traces, which are state-action-dependent [2, 18]. We remark that the state-action-dependency of eligibility traces result from the fact that value-based TD errors Δtπ\Delta_{t}^{\pi} are path-independent. The time-dependency greatly influences the memory complexity of the algorithm: when an episode is of length TT, value-based backward-view algorithm requires memory of size min(|𝒳||𝒜|,T)\min(|\mathcal{X}||\mathcal{A}|,T) to store all eligibility traces. On the other hand, the distributional backward-view algorithm requires 𝒪(T)\mathcal{O}(T).

Appendix E Distributional Retrace with categorical representations

We start by showing that the distributional Retrace operator is βLp\beta_{L_{p}}-contractive under the L¯p\overline{L}_{p} distance for p1p\geq 1. As a comparison, the one-step distributional Bellman operator 𝒯π\mathcal{T}^{\pi} is γ1/p\gamma^{1/p}-contractive under L¯p\overline{L}_{p} [17].

Lemma E.1.

(Contraction in L¯p\overline{L}_{p}) π,μ\mathcal{R}^{\pi,\mu} is βLp\beta_{L_{p}}-contractive under supremum LpL_{p} distance for p1p\geq 1, where βLp[0,γ]\beta_{L_{p}}\in[0,\gamma]. Specifically, we have βLp=maxx𝒳,a𝒜(t=1𝔼μ[c1ct1(1ct)]γt)1/p\beta_{L_{p}}=\max_{x\in\mathcal{X},a\in\mathcal{A}}\left(\sum_{t=1}^{\infty}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}(1-c_{t})\right]\gamma^{t}\right)^{1/p}.

Proof.

The proof is similar to the proof of Proposition 3.2: the result follows by combining the convex combination property of distributional Retrace in Lemma 3.1 with the pp-convexity of LpL_{p} distance [10]. ∎

E.1 Categorical representation

In categorical representations [23], we consider parametric distributions of the form for a fixed m1m\geq 1, i=1mpiδzi\sum_{i=1}^{m}p_{i}\delta_{z_{i}}, where (zi)i=1m(z_{i})_{i=1}^{m}\in\mathbb{R} are a fixed set of atoms and (pi)i=1m(p_{i})_{i=1}^{m} is a categorical distribution such that i=1mpi=1\sum_{i=1}^{m}p_{i}=1 and pi0p_{i}\geq 0. Denote the class of such distributions as 𝒫𝒞(){i=1mpiδzi|i=1mpi=1,pi0}\mathscr{P}_{\mathcal{C}}(\mathbb{R})\coloneqq\{\sum_{i=1}^{m}p_{i}\delta_{z_{i}}|\sum_{i=1}^{m}p_{i}=1,p_{i}\geq 0\}. For simplicity, we assume that the target return is supported on the set of atoms [RMIN/(1γ),RMAX/(1γ)][z1,zm][R_{\text{MIN}}/(1-\gamma),R_{\text{MAX}}/(1-\gamma)]\subset[z_{1},z_{m}].

We introduce the projection that maps from an initial back-up distribution to the categorical parametric class: Π𝒞:𝒫()𝒫𝒞()\Pi_{\mathcal{C}}:\mathscr{P}_{\infty}(\mathbb{R})\rightarrow\mathscr{P}_{\mathcal{C}}(\mathbb{R}) defined as Π𝒞ηargminν𝒫𝒞()L2(ν,η),ν𝒫()\Pi_{\mathcal{C}}\eta\coloneqq\arg\min_{\nu\in\mathscr{P}_{\mathcal{C}}(\mathbb{R})}L_{2}\left(\nu,\eta\right),\forall\nu\in\mathscr{P}_{\infty}(\mathbb{R}). The projection can be easily calculated as described in [6, 17]. For any distribution vector η𝒫()𝒳×𝒜\eta\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, define Π𝒞η\Pi_{\mathcal{C}}\eta as the component-wise projection. Now, given the composed operator Π𝒞π,μ:𝒫()𝒳×𝒜𝒫𝒞()𝒳×𝒜\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}:\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}\rightarrow\mathscr{P}_{\mathcal{C}}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, we characterize the convergence of the seququence ηk=(Π𝒞π,μ)kη0\eta_{k}=\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\right)^{k}\eta_{0}.

Theorem E.2.

(Convergence of categorical distributions) The projected distributional Retrace operator Π𝒞π,μ\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu} is βL2\beta_{L_{2}}-contractive under L¯2\overline{L}_{2} distance in 𝒫𝒬()\mathscr{P}_{\mathcal{Q}}(\mathbb{R}). As a result, the above ηk\eta_{k} converges to a limiting distribution ηπ\eta_{\mathcal{R}}^{\pi} in L¯2\overline{L}_{2}, such that L¯2(ηk,ηπ)(βL2)kL¯2(η0,ηπ)\overline{L}_{2}(\eta_{k},\eta_{\mathcal{R}}^{\pi})\leq(\beta_{L_{2}})^{k}\overline{L}_{2}(\eta_{0},\eta_{\mathcal{R}}^{\pi}). Further, the quality of the fixed point is characterized as L¯2(ηπ,ηπ)(1βL2)1L¯2(Π𝒞ηπ,ηπ)\overline{L}_{2}(\eta_{\mathcal{R}}^{\pi},\eta^{\pi})\leq(1-\beta_{L_{2}})^{-1}\overline{L}_{2}(\Pi_{\mathcal{C}}\eta^{\pi},\eta^{\pi}).

Proof.

The above theorem follows from Lemma E.1. Indeed, since Π𝒬\Pi_{\mathcal{Q}} is a non-expansion in supremum Cramér distance L¯2\overline{L}_{2} [17], the composed operator Π𝒬π,μ\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu} is βL2\beta_{L_{2}}-contractive in L¯2\overline{L}_{2}. Following the same argument as the proof of Theorem 5.1, we obtain the remaining desired results. ∎

The distributional Retrace operator also improves over one-step distributional Bellman operator in two aspects: (1) the bound on the contraction rate βL2γ\beta_{L_{2}}\leq\sqrt{\gamma} is smaller, usually leading to faster contraction to the fixed point; (2) the bound on the quality of the fixed point is improved.

E.2 Cross-entropy update and C51-Retrace

Unlike in the quantile projection case, where calculating Π𝒬η\Pi_{\mathcal{Q}}\eta requires solving a quantile regression minimization problem, the categorical projection can be calculated in an analytic way [17, 10]. Assume the categorical distribution is parameterized as ηw(x,a)=i=1mpi(x,a;w)δzi\eta_{w}(x,a)=\sum_{i=1}^{m}p_{i}(x,a;w)\delta_{z_{i}}. After computing the back-up target distribution Π𝒞π,μη(x,a)\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a) for a given distribution vector η\eta, the algorithm carries out a gradient-based incremental update

wwαw𝔼[Π𝒞π,μη(x,a)|ηw(x,a)],\displaystyle w\leftarrow w-\alpha\nabla_{w}\mathbb{CE}\left[\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)|\eta_{w}(x,a)\right],

where 𝔼(p|q)ipilogqi\mathbb{CE}(p|q)\coloneqq-\sum_{i}p_{i}\log q_{i} denotes the cross-entropy between distribution pp and qq. For simplicity, we adopt a short-hand notation 𝔼(η|ηw)=𝔼w(η)\mathbb{CE}(\eta|\eta_{w})=\mathbb{CE}_{w}(\eta). Note also that in practice, η\eta can be a slowly updated copy of ηw\eta_{w} [33]. As such, the gradient-based update can be understood as approximating the iteration ηk+1=π,μηk\eta_{k+1}=\mathcal{R}^{\pi,\mu}\eta_{k}. We propose the following unbiased estimate to the cross-entropy 𝔼^w[Π𝒞π,μη(x,a)]\widehat{\mathbb{CE}}_{w}\left[\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right], calculated as follows

𝔼w(η(x,a))+t=0c1:t(𝔼w((bt+1)#η(Xt+1,At+1π))𝔼w((bt)#η(Xt,At))).\displaystyle\mathbb{CE}_{w}\left(\eta(x,a)\right)+\sum_{t=0}^{\infty}c_{1:t}\left(\mathbb{CE}_{w}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\mathbb{CE}_{w}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right).
Lemma E.3.

(Unbiased stochastic estimate for categorical update) Assume that the trajectory terminates within H<H<\infty steps almost surely, then we have 𝔼μ[𝔼^w(Π𝒞π,μη(x,a))]=𝔼w(Π𝒞π,μη(x,a))\mathbb{E}_{\mu}\left[\widehat{\mathbb{CE}}_{w}\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\mathbb{CE}_{w}\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right). Without loss of generality, assume ww is a scalar parameter. If there exists a constant M>0M>0 such that |w𝔼w(η)|M,η𝒫()\left|\nabla_{w}\mathbb{CE}_{w}\left(\eta\right)\right|\leq M,\forall\eta\in\mathscr{P}_{\infty}(\mathbb{R}), then the gradient estimate is also unbiased 𝔼μ[w𝔼^w(Π𝒞π,μη(x,a))]=w𝔼w(Π𝒞π,μη(x,a))\mathbb{E}_{\mu}\left[\nabla_{w}\widehat{\mathbb{CE}}_{w}\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\nabla_{w}\mathbb{CE}_{w}\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right).

Proof.

The cross-entropy is defined for any distribution 𝔼w(η)\mathbb{CE}_{w}(\eta). For any signed measure ν=i=1mwiηi\nu=\sum_{i=1}^{m}w_{i}\eta_{i} with ηi𝒫()\eta_{i}\in\mathscr{P}_{\infty}(\mathbb{R}), we define the generalized cross-entropy as

𝔼w(ν)i=1mwi𝔼w(ηi),\displaystyle\mathbb{CE}_{w}\left(\nu\right)\coloneqq\sum_{i=1}^{m}w_{i}\mathbb{CE}_{w}\left(\eta_{i}\right),

Next, we note the cross-entropy is linear in the input distribution (or signed measure). In particular, for a set of NN (potentially infinite) coefficients and distributions (signed measures) (ai,ηi)(a_{i},\eta_{i}),

𝔼w(i=1Naiηi)i=1mai𝔼w(ηi).\displaystyle\mathbb{CE}_{w}\left(\sum_{i=1}^{N}a_{i}\eta_{i}\right)\coloneqq\sum_{i=1}^{m}a_{i}\mathbb{CE}_{w}\left(\eta_{i}\right).

When aia_{i} denotes a distribution, the above rewrites as 𝔼w(𝔼[ηi])=𝔼[𝔼(ηi)]\mathbb{CE}_{w}\left(\mathbb{E}[\eta_{i}]\right)=\mathbb{E}[\mathbb{CE}(\eta_{i})]. Finally, combining everything together, we have 𝔼μ[𝔼^w(Π𝒞π,μη(x,a))]\mathbb{E}_{\mu}\left[\widehat{\mathbb{CE}}_{w}\left(\Pi_{\mathcal{C}}\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right] evaluate to

=𝔼μ[𝔼w(η(x,a))+t=0c1:t(𝔼w((bt+1)#η(Xt+1,At+1π))𝔼w((bt)#η(Xt,At)))]\displaystyle=\mathbb{E}_{\mu}\left[\mathbb{CE}_{w}\left(\eta(x,a)\right)+\sum_{t=0}^{\infty}c_{1:t}\left(\mathbb{CE}_{w}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\mathbb{CE}_{w}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right)\right]
=(a)𝔼μ[𝔼(^π,μη(x,a))]=(b)𝔼μ[𝔼(π,μη(x,a))].\displaystyle=_{(a)}\mathbb{E}_{\mu}\left[\mathbb{CE}\left(\widehat{\mathcal{R}}^{\pi,\mu}\eta(x,a)\right)\right]=_{(b)}\mathbb{E}_{\mu}\left[\mathbb{CE}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right].

In the above, (a) follows from the definition of the cross-entropy with signed measure ^π,μη(x,a)\widehat{\mathcal{R}}^{\pi,\mu}\eta(x,a) and (b) follows from the linearity property of cross-entropy.

Next, to show that the gradient estimate is unbiased too, the high level idea is to apply dominated convergence theorem (DCT) to justify the exhchange of gradient and expectation [34]. This is similar to the quantile representation case (see proof for Lemma 5.2). To this end, consider the absolute value of the gradient estimate |w𝔼^w(π,μη(x,a))|\left|\nabla_{w}\widehat{\mathbb{CE}}_{w}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right|, which serves as an upper bound to the gradient estimate. In order to apply DCT, we need to show the expectation of the absolute gradient is finite. Note we have

𝔼μ[|w𝔼^w(π,μη(x,a))|]\displaystyle\mathbb{E}_{\mu}\left[\left|\nabla_{w}\widehat{\mathbb{CE}}_{w}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right|\right]
=𝔼μ[|w𝔼w(η(x,a))+t=0Hc1:t(w𝔼w((bt+1)#η(Xt+1,At+1π))w𝔼w((bt)#η(Xt,At)))|]\displaystyle=\mathbb{E}_{\mu}\left[\left|\nabla_{w}\mathbb{CE}_{w}\left(\eta(x,a)\right)+\sum_{t=0}^{H}c_{1:t}\left(\nabla_{w}\mathbb{CE}_{w}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\nabla_{w}\mathbb{CE}_{w}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right)\right|\right]
(a)𝔼μ[|w𝔼w(η(x,a))|+t=0Hc1:t|w𝔼w((bt+1)#η(Xt+1,At+1π))w𝔼w((bt)#η(Xt,At))|]\displaystyle\leq_{(a)}\mathbb{E}_{\mu}\left[\left|\nabla_{w}\mathbb{CE}_{w}\left(\eta(x,a)\right)\right|+\sum_{t=0}^{H}c_{1:t}\left|\nabla_{w}\mathbb{CE}_{w}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\nabla_{w}\mathbb{CE}_{w}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right|\right]
(b)𝔼μ[M+t=0HρtM]<,\displaystyle\leq_{(b)}\mathbb{E}_{\mu}\left[M+\sum_{t=0}^{H}\rho^{t}\cdot M\right]<\infty,

where (a) follows from the application of triangle inequality; (b) follows from the fact that the QR loss gradient against a fixed distribution is bounded w𝔼w(ν)[M,M],ν𝒫()\nabla_{w}\mathbb{CE}_{w}\left(\nu\right)\in[-M,M],\forall\nu\in\mathscr{P}_{\infty}(\mathbb{R}) [16].

Hence, with the application DCT, we can exchange the gradient and expectation operator, which yields 𝔼μ[w𝔼^wτ(π,μη(x,a))]=w𝔼μ[𝔼^wτ(π,μη(x,a))]=w𝔼w(π,μη(x,a))\mathbb{E}_{\mu}\left[\nabla_{w}\widehat{\mathbb{CE}}_{w}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\nabla_{w}\mathbb{E}_{\mu}\left[\widehat{\mathbb{CE}}_{w}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\nabla_{w}\mathbb{CE}_{w}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right).

We remark that the condition on the bounded gradient |w𝔼w(η)|M|\nabla_{w}\mathbb{CE}_{w}\left(\eta\right)|\leq M is not restrictive. When ηw\eta_{w} is adopts a softmax parameterization and ww represents the logits, M=1M=1.

Finally, the deep RL agent C51 parameterizes the categorical distribution pi(x,a;w)p_{i}(x,a;w) with a neural network ww at each state action pair (x,a)(x,a) [23]. When combined with the above algorithm, this produces C51-Retrace.

Appendix F Additional experiment details

In this section, we provide detailed information about experiment setups and additional results. All experiments are carried out in Python, using NumPy for numerical computations [35] and Matplotlib for visualization [36]. All deep RL experiments are carried out with Jax [37], specifically making use of the DeepMind Jax ecosystem [38].

F.1 Tabular

We provide additional details on the tabular RL experiments.

Setup.

We consider a tabular MDP with |𝒳|=3|\mathcal{X}|=3 states and |𝒜|=2|\mathcal{A}|=2 actions. The reward r(x,a)r(x,a) is deterministic and generated from a standard Gaussian distribution. The transition probability P(|x,a)P(\cdot|x,a) is sampled from a Dirichlet distribution with parameter (Γ,ΓΓ)(\Gamma,\Gamma...\Gamma) for Γ=0.5\Gamma=0.5. The discount factor is fixed as γ=0.9\gamma=0.9. The MDP has a starting state-action pair (x0,a0)(x_{0},a_{0}). The behavior policy μ\mu is a uniform policy. The target policy is generated as follows: we first sample a deterministic policy πd\pi_{d} and then compute π=(1ε)πd+εμ\pi=(1-\varepsilon)\pi_{d}+\varepsilon\mu, with parameter ε\varepsilon to control the level of off-policyness.

Quantile distribution and projection.

We use m=100m=100 atoms throughout the experiments. Assuming access to the MDP parameters (e.g., reward and transition probability), we can analytically compute the projection Π𝒬\Pi_{\mathcal{Q}} using a sorting algorithm. See [16, 10] for details.

Evaluation metrics.

Let ηk=(π,μ)kη0\eta_{k}=(\mathcal{R}^{\pi,\mu})^{k}\eta_{0} be the kk-th iterate. We use a few different metrics in Figure 3. Given any particular distributional Retrace operator π,μ\mathcal{R}^{\pi,\mu}, there exists a fixed point to the composed operator Π𝒬π,μ\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}. Recall that we denote this distribution as ηπ\eta_{\mathcal{R}}^{\pi}. Fig 3(a)-(b) calculates the iterates’ distance from the fixed point, evaluated at (x0,a0)(x_{0},a_{0}).

Lp(ηk(x0,a0),ηπ(x0,a0)).\displaystyle L_{p}\left(\eta_{k}(x_{0},a_{0}),\eta_{\mathcal{R}}^{\pi}(x_{0},a_{0})\right).

Fig 3(c) calculates the distance from the projected target distribution Π𝒬ηπ\Pi_{\mathcal{Q}}\eta^{\pi}. Recall that Π𝒬ηπ\Pi_{\mathcal{Q}}\eta^{\pi} is in some sense the best possible approximation that the current quantile representation can obtain.

Lp(ηk(x0,a0),Π𝒬ηπ(x0,a0)).\displaystyle L_{p}\left(\eta_{k}(x_{0},a_{0}),\Pi_{\mathcal{Q}}\eta^{\pi}(x_{0},a_{0})\right).

F.2 Deep reinforcement learning

We provide additional details on the deep RL experiments.

Evaluation metrics.

For the ii-th of the 5757 Atari games, we obtain the performance of the agent GiG_{i} at any given point in training. The normalized performance is computed as Zi=(GiUi)/(HiUi)Z_{i}=(G_{i}-U_{i})/(H_{i}-U_{i}) where HiH_{i} is the human performance and UiU_{i} is the performance of a random policy. Then the mean/median metric is calculated as the mean or median statistics over (Zi)i=157(Z_{i})_{i=1}^{57}.

The super human ratio is computed as the number of games such as Zi1Z_{i}\geq 1, i.e., GiHiG_{i}\geq H_{i} where the agent obtains super human performance on the game. Formally, it is compute as 157i=157𝕀[Zi1]\frac{1}{57}\sum_{i=1}^{57}\mathbb{I}[Z_{i}\geq 1].

Shared properties of all baseline agents.

All baseline agents use the same torso architecture as DQN [33] and differ in the head outputs, which we specify below. All agents an Adam optimizer [39] with a fixed learning rate; the optimization is carried out on mini-batches of size 3232 uniformly sampled from the replay buffer. For exploration, the agent acts ε\varepsilon-greedy with respect to induced Q-functions, the details of which we specify below. The exploration policy adopts ε\varepsilon that starts with εmax=1\varepsilon_{\text{max}}=1 and linearly decays to εmin=0.01\varepsilon_{\text{min}}=0.01 over training. At evaluation time, the agent adopts ε=0.001\varepsilon=0.001; the small exploration probability is to prevent the agent from getting stuck.

Details of baseline C51 agent.

The agent head outputs a matrix of size |𝒜|×m|\mathcal{A}|\times m, which represents the logits to (pi(x,a;θ))i=1m\left(p_{i}(x,a;\theta)\right)_{i=1}^{m}. The support (zi)i=1m(z_{i})_{i=1}^{m} is generated as a uniform array over [VMAX,VMAX][-V_{\text{MAX}},V_{\text{MAX}}]. Though VMAXV_{\text{MAX}} should in theory be determined by RMAXR_{\text{MAX}}; in practice, it has been found that setting VMAX=RMAX/(1γ)V_{\text{MAX}}=R_{\text{MAX}}/(1-\gamma) leads to highly sub-optimal performance. This is potentially because usually the random returns are far from the extreme values RMAX/(1γ)R_{\text{MAX}}/(1-\gamma), and it is better to set VMAXV_{\text{MAX}} at a smaller value. Here, we set VMAX=10V_{\text{MAX}}=10 and m=51m=51. For details of other hyperparameters, see [6]. The induced Q-function is computed as Qθ(x,a)=i=1mpi(x,a;θ)ziQ_{\theta}(x,a)=\sum_{i=1}^{m}p_{i}(x,a;\theta)z_{i}.

Details of baseline QR-DQN agent.

The agent head outputs a matrix of size |𝒜|×m|\mathcal{A}|\times m, which represents the quantile locations (zi(x,a;θ))i=1m\left(z_{i}(x,a;\theta)\right)_{i=1}^{m}. Here, we set m=201m=201. For details of other hyperparameters, see [16]. The induced Q-function is computed as Qθ(x,a)=1mi=1mzi(x,a;θ)Q_{\theta}(x,a)=\frac{1}{m}\sum_{i=1}^{m}z_{i}(x,a;\theta).

Details of multi-step agents.

Multi-step variants use exactly the same hyperparameters as the one-step baseline agent. The only difference is that the agent uses multi-step back-up targets.

The agent stores partial trajectories (Xt,At,Rt,xt)t=0n1μ(X_{t},A_{t},R_{t},x_{t})_{t=0}^{n-1}\sim\mu generated under the behavior policy. Here, the behavior policy μ\mu is the ε\varepsilon-greedy policy with respect to a potentially old Q-function (this is because the data at training time is sampled from the replay); the target policy π\pi is the greedy policy with respect to the current Q-function.

Refer to caption
(a) C51
Refer to caption
(b) QR-DQN
Figure 6: Deep RL experiments on Atari-57 games for (a) C51 and (b) QR-DQN. We compare the one-step baseline agent against the multi-step variants (Retrace and uncorrected nn-step). For all multi-step variants, we use n=3n=3. For each agent, we calculate the mean, median and super human ratio performance across all games, and we plot the mean±standard error\text{mean}\pm\text{standard\ error} across 33 seeds. In almost all settings, Multi-step variants provide clear advantage over the one-step baseline algorithm.
Refer to caption
(a) C51
Refer to caption
(b) QR-DQN
Figure 7: Deep RL experiments on Atari-57 games for (a) C51 and (b) QR-DQN, with the same setup as in Figure 6. Here, we compute the interquartile mean (IQM) with 95% bootstrapped confidence interval [40]. In a nutshell, IQM calculates the mean scores after removing extreme score values, making the performance statistics more robust. Even after excluding extreme scores, Retrace obtains favorable performance compared to the uncorrected and one-step algorithm.

Appendix G Proof

To simplify the proof, we assume that the immediate random reward takes a finite number of values. It is straightforward to generalize results to the case where the reward takes an infinite number of values (e.g., the random reward has a continuous distribution).

Assumption G.1.

(Reward takes a finite number of values) For all state-action pair (x,a)(x,a), we assume the random reward R(x,a)R(x,a) takes a finite number of values. Let R~\widetilde{R} be the finite set of values that the reward {R(x,a),(x,a)𝒳×𝒜}\{R(x,a),(x,a)\in\mathcal{X}\times\mathcal{A}\} can take.

For any integer t1t\geq 1, Let R~t\widetilde{R}^{t} denotes the Cartesian product of tt copies of R~\widetilde{R}:

R~tR~×R~××R~tcopies of R~.\displaystyle\widetilde{R}^{t}\coloneqq\underbrace{\widetilde{R}\times\widetilde{R}\times...\times\widetilde{R}}_{t\ \text{copies\ of\ }\widetilde{R}}.

For any fixed tt, we let r0:t1r_{0:t-1} denote the sequence of realizable rewards from time 0 to time t1t-1. Since R~\widetilde{R} is a finite set, R~t\widetilde{R}^{t} is also a finite set.

See 3.1

Proof.

In general ct=c(Ft,At)c_{t}=c(F_{t},A_{t}) where FtF_{t} is a filtration of (Xs,As)s=0t(X_{s},A_{s})_{s=0}^{t}. To start with, we assume ct=c(Xt,At)c_{t}=c(X_{t},A_{t}) to be a Markovian trace coefficient [13]. We start with the simpler case because the proof is greatly simplified with notations and can extend to the general case with some care. We discuss the extension to the general case where ct=c(Ft,At)c_{t}=c(F_{t},A_{t}) towards the end of the proof.

For all t1t\geq 1, we define the coefficient

wy,b,r0:t1𝔼μ[c1ct1(π(b|Xt)c(Xt,b)μ(b|Xt))𝕀[Xt=y]Πs=0t1𝕀[Rs=rs]].\displaystyle w_{y,b,r_{0:t-1}}\coloneqq\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(\pi(b|X_{t})-c(X_{t},b)\mu(b|X_{t})\right)\cdot\mathbb{I}[X_{t}=y]\Pi_{s=0}^{t-1}\mathbb{I}[R_{s}=r_{s}]\right].

Through careful algebra, we can rewrite the Retrace operator as follows

π,μη(x,a)=t=1y𝒳b𝒜r0:t1R~twy,b,r0:t1(bG0:t1,γt)#η(y,b).\displaystyle\mathcal{R}^{\pi,\mu}\eta(x,a)=\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}\sum_{r_{0:t-1}\in\widetilde{R}^{t}}w_{y,b,r_{0:t-1}}\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(y,b).

Note that each term of the form (bG0:t1,γt)#η(y,b)\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(y,b) corresponds to applying a pushforward operation (bG0:t1,γt)#\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#} on the distribution η(x,a)\eta(x,a), which means (bG0:t1,γt)#η(y,b)𝒫()\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(y,b)\in\mathscr{P}_{\infty}(\mathbb{R}). Now that we have expressed π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a) as a linear combination of distributions, we proceed to show that the combination is in fact convex.

Under the assumption ct[0,ρt]c_{t}\in[0,\rho_{t}], we have π(b|y)c(y,b)μ(b|y)0\pi(b|y)-c(y,b)\mu(b|y)\geq 0 for all (y,b)𝒳×𝒜(y,b)\in\mathcal{X}\times\mathcal{A}. Therefore, all weights are non-negative. Next, we examine the sum of all coefficients wy,b,r0:t1=t=1x𝒳b𝒜r0:t1R~twy,b,r0:t1\sum w_{y,b,r_{0:t-1}}=\sum_{t=1}^{\infty}\sum_{x\in\mathcal{X}}\sum_{b\in\mathcal{A}}\sum_{r_{0:t-1}\in\widetilde{R}^{t}}w_{y,b,r_{0:t-1}}.

wy,b,r0:t1\displaystyle\sum w_{y,b,r_{0:t-1}} =(a)t=1y𝒳b𝒜𝔼μ[c1ct1(π(b|Xt)c(Xt,b)μ(b|Xt))𝕀[Xt=y]]\displaystyle=_{(a)}\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(\pi(b|X_{t})-c(X_{t},b)\mu(b|X_{t})\right)\cdot\mathbb{I}[X_{t}=y]\right]
=(b)t=1𝔼μ[c1ct1(1ct)]=(c)1.\displaystyle=_{(b)}\sum_{t=1}^{\infty}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}(1-c_{t})\right]=_{(c)}1.

In the above, (a) follows from the fact that rsR~𝔼[𝕀[Rs=rs]]=1\sum_{r_{s}\in\widetilde{R}}\mathbb{E}[\mathbb{I}[R_{s}=r_{s}]]=1; (b) follows from the fact that for all time steps t1t\geq 1, the following is true,

y𝒳b𝒜𝔼μ[c1ct1(π(b|Xt)c(Xt,b)μ(b|Xt))𝕀[Xt=y]]\displaystyle\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(\pi(b|X_{t})-c(X_{t},b)\mu(b|X_{t})\right)\cdot\mathbb{I}[X_{t}=y]\right]
=b𝒜𝔼μ[c1ct1(π(b|Xt)c(Xt,b)μ(b|Xt))]\displaystyle=\sum_{b\in\mathcal{A}}\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(\pi(b|X_{t})-c(X_{t},b)\mu(b|X_{t})\right)\right]
=𝔼μ[c1ct1(1b𝒜c(Xt,b)μ(b|Xt))]\displaystyle=\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(1-\sum_{b\in\mathcal{A}}c(X_{t},b)\mu(b|X_{t})\right)\right]
=𝔼μ[c1ct1(1ct)].\displaystyle=\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}(1-c_{t})\right].

Finally, (c) is based on the observation that the summation telescopes. Now, by taking the index set to be the set of indices that parameterize wy,b,r0:t1w_{y,b,r_{0:t-1}},

I(x,a)=t=1(y,b,r0:t1)y𝒳,b𝒜,r0:t1R~t.\displaystyle I(x,a)=\cup_{t=1}^{\infty}\left(y,b,r_{0:t-1}\right)_{y\in\mathcal{X},b\in\mathcal{A},r_{0:t-1}\in\widetilde{R}^{t}}.

We can write π,μη(x,a)=iI(x,a)wiηi\mathcal{R}^{\pi,\mu}\eta(x,a)=\sum_{i\in I(x,a)}w_{i}\eta_{i}. Note further that for any iI(x,a)i\in I(x,a), ηi=(bG0:t1,γt)#η(y,b)\eta_{i}=(\textrm{b}_{G_{0:t-1},\gamma^{t}})_{\#}\eta(y,b) is a fixed distribution. The above result suggests that π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a) is a convex combination of fixed distributions.

Extension to the general case.

When ct=c(Ft,At)c_{t}=c(F_{t},A_{t}) is filtration dependent, we let t\mathcal{F}_{t} to be the space of the filtration value up to time tt. For simplicity with the notation, we assume t\mathcal{F}_{t} contains a finite number of elements, such that below we can adopt the summation notation instead of integral. Define the combination coefficient

wy,b,ft,r0:t1𝔼μ[c1ct1(π(b|Xt)c(Ft,b)μ(b|Xt))𝕀[Xt=y]Πs=0t1𝕀[Rs=rs]].\displaystyle w_{y,b,f_{t},r_{0:t-1}}\coloneqq\mathbb{E}_{\mu}\left[c_{1}...c_{t-1}\left(\pi(b|X_{t})-c(F_{t},b)\mu(b|X_{t})\right)\cdot\mathbb{I}[X_{t}=y]\Pi_{s=0}^{t-1}\mathbb{I}[R_{s}=r_{s}]\right].

It is straightforward to verify the following

π,μη(x,a)=t=1y𝒳b𝒜fttr0:t1R~twy,b,ft,r0:t1(bG0:t1,γt)#η(y,b).\displaystyle\mathcal{R}^{\pi,\mu}\eta(x,a)=\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}\sum_{f_{t}\in\mathcal{F}_{t}}\sum_{r_{0:t-1}\in\widetilde{R}^{t}}w_{y,b,f_{t},r_{0:t-1}}\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(y,b).

In addition, the combination coefficients wy,b,ft,r0:t1w_{y,b,f_{t},r_{0:t-1}} sum to 11 and are all non-negative. ∎

See 3.2

Proof.

From the proof of Lemma 3.1, we have

π,μη(x,a)=t=1y𝒳b𝒜r0:t1R~twy,b,r0:t1(bG0:t1,γt)#η(y,b).\displaystyle\mathcal{R}^{\pi,\mu}\eta(x,a)=\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}\sum_{r_{0:t-1}\in\widetilde{R}^{t}}w_{y,b,r_{0:t-1}}\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta(y,b).

Now, we have for any η1,η2𝒫()𝒳×𝒜\eta_{1},\eta_{2}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, for any fixed (x,a)(x,a), we have Wp(π,μη1(x,a),π,μη2(x,a))W_{p}\left(\mathcal{R}^{\pi,\mu}\eta_{1}(x,a),\mathcal{R}^{\pi,\mu}\eta_{2}(x,a)\right) upper bounded as follows

(a)t=1y𝒳b𝒜wy,b,r0:t1Wp((bs=0t1γsrs,γt)#η1(y,b),(bs=0t1γsrs,γt)#η2(y,b))\displaystyle\leq_{(a)}\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}w_{y,b,r_{0:t-1}}W_{p}\left(\left(\textrm{b}_{\sum_{s=0}^{t-1}\gamma^{s}r_{s},\gamma^{t}}\right)_{\#}\eta_{1}(y,b),\left(\textrm{b}_{\sum_{s=0}^{t-1}\gamma^{s}r_{s},\gamma^{t}}\right)_{\#}\eta_{2}(y,b)\right)
(b)t=1y𝒳b𝒜wy,b,r0:t1γtWp(η1(y,b),η2(y,b))\displaystyle\leq_{(b)}\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}w_{y,b,r_{0:t-1}}\gamma^{t}W_{p}\left(\eta_{1}(y,b),\eta_{2}(y,b)\right)
(c)t=1y𝒳b𝒜wy,b,r0:t1γtW¯p(η1,η2)\displaystyle\leq_{(c)}\sum_{t=1}^{\infty}\sum_{y\in\mathcal{X}}\sum_{b\in\mathcal{A}}w_{y,b,r_{0:t-1}}\gamma^{t}\overline{W}_{p}\left(\eta_{1},\eta_{2}\right)

In the above, (a) follows by applying the convexity of the pp-Wasserstein distance [10]; (b) follows by the contraction property of the pushforward operation and WpW_{p} [10]; (c) follows from the definition of W¯p\overline{W}_{p}. By taking the maixmum over (x,a)(x,a) on both sides of the inequality, we obtain

W¯p(π,μη1,π,μη2)βW¯p(η1,η2).\displaystyle\overline{W}_{p}(\mathcal{R}^{\pi,\mu}\eta_{1},\mathcal{R}^{\pi,\mu}\eta_{2})\leq\beta\overline{W}_{p}(\eta_{1},\eta_{2}).

This concludes the proof. ∎

Lemma G.2.

For any fixed (x,a)(x,a) and scalar cc\in\mathbb{R},

(bc,1)#ηπ(x,a)=𝔼π[(bc+R0,γ)#ηπ(X1,A1)|X0=x,A0=a].\displaystyle\left(\textrm{b}_{c,1}\right)_{\#}\eta^{\pi}(x,a)=\mathbb{E}_{\pi}\left[\left(\textrm{b}_{c+R_{0},\gamma}\right)_{\#}\eta^{\pi}(X_{1},A_{1})\;\middle|\;X_{0}=x,A_{0}=a\right]. (6)
Proof.

Let By{x<y|x}B_{y}\coloneqq\{x<y|x\in\mathbb{R}\} be a subset of \mathbb{R} indexed by yy\in\mathbb{R}. Since the set of all such sets {By,y}\{B_{y},y\in\mathbb{R}\} is dense in the sigma-field of \mathbb{R} [34], if we can show for two measures η1,η2\eta_{1},\eta_{2}

η1(By)=η2(By),y\displaystyle\eta_{1}\left(B_{y}\right)=\eta_{2}\left(B_{y}\right),\forall y

then, η1(B)=η2(B)\eta_{1}(B)=\eta_{2}(B) for all Borel sets in \mathbb{R}. Hence, in the following, we seek to show

((bc,1)#ηπ(x,a))By=(𝔼π[(bc+R0,γ)#ηπ(X1,A1)])By,y\displaystyle\left(\left(\textrm{b}_{c,1}\right)_{\#}\eta^{\pi}(x,a)\right)B_{y}=\left(\mathbb{E}_{\pi}\left[\left(\textrm{b}_{c+R_{0},\gamma}\right)_{\#}\eta^{\pi}(X_{1},A_{1})\right]\right)B_{y},\forall y\in\mathbb{R} (7)

Let Fπ(y;x,a)Pπ(Gπ(x,a)y)=ηπ(x,a)(By),yF^{\pi}(y;x,a)\coloneqq P^{\pi}(G^{\pi}(x,a)\leq y)=\eta^{\pi}(x,a)(B_{y}),y\in\mathbb{R} be the CDF of random variable Gπ(x,a)G^{\pi}(x,a). The distributional Bellman equation in Equation (1) implies

Fπ(y;x,a)=𝔼π[Fπ(yR0γ;X1,A1)],y.\displaystyle F^{\pi}(y;x,a)=\mathbb{E}_{\pi}\left[F^{\pi}\left(\frac{y-R_{0}}{\gamma};X_{1},A_{1}\right)\right],\forall y\in\mathbb{R}.

For any constant cc\in\mathbb{R}, let y=ycy=y^{\prime}-c and plug into the above equality,

Fπ(yc;x,a)=𝔼π[Fπ(ycR0γ;X1,A1)],y.\displaystyle F^{\pi}(y^{\prime}-c;x,a)=\mathbb{E}_{\pi}\left[F^{\pi}\left(\frac{y^{\prime}-c-R_{0}}{\gamma};X_{1},A_{1}\right)\right],\forall y^{\prime}\in\mathbb{R}.

Note the LHS is ((bc,1)#ηπ(x,a))(By)\left((\textrm{b}_{c,1})_{\#}\eta^{\pi}(x,a)\right)(B_{y}) while the RHS is (𝔼π[(bc+R0,γ)#ηπ(X1,A1)])(By)\left(\mathbb{E}_{\pi}\left[\left(\textrm{b}_{c+R_{0},\gamma}\right)_{\#}\eta^{\pi}(X_{1},A_{1})\right]\right)(B_{y}). This implies that Equation (7) holds and we conclude the proof. ∎

See 3.3

Proof.

To verify that ηπ\eta^{\pi} is a fixed point, it is equivalent to show

𝔼μ[t=0nc1:t((bG0:t,γt+1)#ηπ(Xt+1,At+1π)(bG0:t1,γt)#ηπ(Xt,At))]=𝟎.\displaystyle\mathbb{E}_{\mu}\left[\sum_{t=0}^{n}c_{1:t}\left(\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right)\right]=\mathbf{0}.

Here, the RHS term 𝟎\mathbf{0} denotes the zero measure, a measure such that for all Borel sets BB\subset\mathbb{R}, 𝟎(B)=0\mathbf{0}(B)=0. We now verify that each of the summation term is a zero measure, i.e.,

𝔼μ[c1:t((bG0:t,γt+1)#ηπ(Xt+1,At+1π)(bG0:t1,γt)#ηπ(Xt,At))]=𝟎.\displaystyle\mathbb{E}_{\mu}\left[c_{1:t}\left(\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right)\right]=\mathbf{0}.

To see this, we follow the derivation below,

𝔼μ[c1:t((bG0:t,γt+1)#ηπ(Xt+1,At+1π)(bG0:t1,γt)#ηπ(Xt,At))]\displaystyle\mathbb{E}_{\mu}\left[c_{1:t}\left(\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right)\right]
=(a)𝔼[𝔼[c1:t[((bG0:t,γt+1)#ηπ(Xt+1,At+1π)](bG0:t1,γt)#ηπ(Xt,At))|(Xs,As,Rs1)s=1t]]\displaystyle=_{(a)}\mathbb{E}\left[\mathbb{E}\left[c_{1:t}\left[\left(\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})\right]-\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right)\;\middle|\;\left(X_{s},A_{s},R_{s-1}\right)_{s=1}^{t}\right]\right]
=(b)𝔼[c1:t𝔼[(bG0:t,γt+1)#ηπ(Xt+1,At+1π)|(Xs,As,Rs1)s=1t]c1:t(bG0:t1,γt)#ηπ(Xt,At)]\displaystyle=_{(b)}\mathbb{E}\left[c_{1:t}\mathbb{E}\left[\left(\textrm{b}_{G_{0:t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})\;\middle|\;\left(X_{s},A_{s},R_{s-1}\right)_{s=1}^{t}\right]-c_{1:t}\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right]
=(c)𝔼[c1:t𝔼[(bG0:t1+γtRt,γt+1)#ηπ(Xt+1,At+1π)|(Xs,As,Rs1)s=1t]first termc1:t(bG0:t1,γt)#ηπ(Xt,At)].\displaystyle=_{(c)}\mathbb{E}\left[c_{1:t}\underbrace{\mathbb{E}\left[\left(\textrm{b}_{G_{0:t-1}+\gamma^{t}R_{t},\gamma^{t+1}}\right)_{\#}\eta^{\pi}(X_{t+1},A_{t+1}^{\pi})\;\middle|\;\left(X_{s},A_{s},R_{s-1}\right)_{s=1}^{t}\right]}_{\text{first term}}-c_{1:t}\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t})\right]. (8)

In the above, in (a) we condition on (Xs,As,Rs)s=1t(X_{s},A_{s},R_{s})_{s=1}^{t} and the equality follows from the tower property of expectations; in (b), we use the fact that the trace product c1:tc_{1:t} and (bG0:t1,γt)#ηπ(Xt,At)\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t}) are deterministic function of the conditioning variable (Xs,As,Rs)s=1t(X_{s},A_{s},R_{s})_{s=1}^{t}; in (c), we split the summation G0:t=G0:t1+γRtG_{0:t}=G_{0:t-1}+\gamma R_{t}. Now we examine the first term in Equation (8), by applying Lemma G.2, we have

first term=(bG0:t1,γt)#ηπ(Xt,At).\displaystyle\text{first\ term}=\left(\textrm{b}_{G_{0:t-1},\gamma^{t}}\right)_{\#}\eta^{\pi}(X_{t},A_{t}).

This implies Equation (8) evaluates to a zero measure. Hence ηπ\eta^{\pi} is a fixed point of the operator π,μ\mathcal{R}^{\pi,\mu}. Because π,μ\mathcal{R}^{\pi,\mu} is also contractive by Proposition 3.2, the fixed point is unique. ∎

See 5.1

Proof.

The quantile projection Π𝒬\Pi_{\mathcal{Q}} is a non-expansion under W¯\overline{W}_{\infty} [16]. Since π,μ\mathcal{R}^{\pi,\mu} is β\beta-contractive under W¯p\overline{W}_{p} for all p1p\geq 1, the composed operator Π𝒬π,μ\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu} is β\beta-contractive under W¯\overline{W}_{\infty}. Now, because (1) Π𝒬π,μ𝒫()𝒳×𝒜\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}; (2) the space Π𝒬π,μ𝒫()𝒳×𝒜\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}} is closed [10]; (3) the operator is contractive, the iterate ηk=(Π𝒬π,μ)kη0\eta_{k}=\left(\Pi_{\mathcal{Q}}\mathcal{R}^{\pi,\mu}\right)^{k}\eta_{0} converges to a limiting distribution ηπ𝒫()𝒳×𝒜\eta_{\mathcal{R}}^{\pi}\in\mathscr{P}_{\infty}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}. Finally, by Proposition 5.28 in [10], we have W¯(ηπ,ηπ)(1β)1W¯(Π𝒬ηπ,ηπ)\overline{W}_{\infty}(\eta_{\mathcal{R}}^{\pi},\eta^{\pi})\leq(1-\beta)^{-1}\overline{W}_{\infty}(\Pi_{\mathcal{Q}}\eta^{\pi},\eta^{\pi}). ∎

See 5.2

Proof.

The QR loss Lθτ(η)L_{\theta}^{\tau}(\eta) is defined for any distribution η\eta and scalar parameter θ\theta. Let ν=i=1mwiηi\nu=\sum_{i=1}^{m}w_{i}\eta_{i} be the linear combination of distributions (ηi)i=1m(\eta_{i})_{i=1}^{m} where wiw_{i}s are potentially negative coefficients. In this case, ν\nu is a signed measure. We define the generalized QR loss for ν\nu as the linear combination of QR losses against ηi\eta_{i} weighted by wiw_{i},

Lθτ(ν)i=1mwiLθτ(ηi).\displaystyle L_{\theta}^{\tau}(\nu)\coloneqq\sum_{i=1}^{m}w_{i}L_{\theta}^{\tau}(\eta_{i}).

Next, we note that the QR loss is linear in the input distribution (or signed measure). This means given any (potentially infinite) set of NN distributions or signed measures νi\nu_{i} with coefficients aia_{i},

Lθτ(i=1Naiνi)=i=1NaiLθτ(νi).\displaystyle L_{\theta}^{\tau}\left(\sum_{i=1}^{N}a_{i}\nu_{i}\right)=\sum_{i=1}^{N}a_{i}L_{\theta}^{\tau}(\nu_{i}).

When (ai)i=1N(a_{i})_{i=1}^{N} denotes a distribution, the above is equivalently expressed as an exchange between expectation and the QR loss Lθτ(𝔼[νi])=𝔼[Lθτ(νi)]L_{\theta}^{\tau}(\mathbb{E}[\nu_{i}])=\mathbb{E}[L_{\theta}^{\tau}(\nu_{i})]. For notational convenience, we let θ=zi(x,a)\theta=z_{i}(x,a) and τ=τi\tau=\tau_{i}. Because the trajectory terminates within HH steps almost surely, since c1:tρ1:tρHc_{1:t}\leq\rho_{1:t}\leq\rho^{H} where ρmaxx𝒳,𝒜π(a|x)μ(a|x)\rho\coloneqq\max_{x\in\mathcal{X},\mathcal{A}}\frac{\pi(a|x)}{\mu(a|x)}, the estimate L^θτ(π,μη(x,a))\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right) is finite almost surely. Combining all results from above we obtain the following

𝔼μ[π,μη(x,a)]\displaystyle\mathbb{E}_{\mu}\left[\mathcal{R}^{\pi,\mu}\eta(x,a)\right] =𝔼μ[Lθτ(η(x,a))+t=0c1:t(Lθτ((bt+1)#η(Xt+1,At+1π))Lθτ((bt)#η(Xt,At)))]\displaystyle=\mathbb{E}_{\mu}\left[L_{\theta}^{\tau}\left(\eta(x,a)\right)+\sum_{t=0}^{\infty}c_{1:t}\left(L_{\theta}^{\tau}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-L_{\theta}^{\tau}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right)\right]
=(a)𝔼μ[Lθτ(^π,μη(x,a))]=(b)Lθτ(𝔼μ[^π,μη(x,a)])=Lθτ(π,μη(x,a)).\displaystyle=_{(a)}\mathbb{E}_{\mu}\left[L_{\theta}^{\tau}\left(\widehat{\mathcal{R}}^{\pi,\mu}\eta(x,a)\right)\right]=_{(b)}L_{\theta}^{\tau}\left(\mathbb{E}_{\mu}\left[\widehat{\mathcal{R}}^{\pi,\mu}\eta(x,a)\right]\right)=L_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right).

In the above, (a) follows from the definition of the generalized QR loss against signed measure the definition of π,μη(x,a)\mathcal{R}^{\pi,\mu}\eta(x,a); (c) follows from the linearity of the QR loss.

Next, to show that the gradient estimate is unbiased too, the high level idea is to apply dominated convergence theorem (DCT) to justify the exhchange of gradient and expectation [34]. Since the expected QR loss gradient θLθτ(π,μη(x,a))\nabla_{\theta}L_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right) exists, we deduce that the estimate θL^θτ(π,μη(x,a))\nabla_{\theta}\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right) exists almost surely. Consider the absolute value of the gradient estimate |θL^θτ(π,μη(x,a))|\left|\nabla_{\theta}\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right|, which serves as an upper bound to the gradient estimate. In order to apply DCT, we need to show the expectation of the absolute gradient is finite. Note we have

𝔼μ[|θL^θτ(π,μη(x,a))|]\displaystyle\mathbb{E}_{\mu}\left[\left|\nabla_{\theta}\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right|\right]
=𝔼μ[|θLθτ(η(x,a))+t=0Hc1:t(θLθτ((bt+1)#η(Xt+1,At+1π))θLθτ((bt)#η(Xt,At)))|]\displaystyle=\mathbb{E}_{\mu}\left[\left|\nabla_{\theta}L_{\theta}^{\tau}\left(\eta(x,a)\right)+\sum_{t=0}^{H}c_{1:t}\left(\nabla_{\theta}L_{\theta}^{\tau}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\nabla_{\theta}L_{\theta}^{\tau}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right)\right|\right]
(a)𝔼μ[|θLθτ(η(x,a))|+t=0Hc1:t|θLθτ((bt+1)#η(Xt+1,At+1π))θLθτ((bt)#η(Xt,At))|]\displaystyle\leq_{(a)}\mathbb{E}_{\mu}\left[\left|\nabla_{\theta}L_{\theta}^{\tau}\left(\eta(x,a)\right)\right|+\sum_{t=0}^{H}c_{1:t}\left|\nabla_{\theta}L_{\theta}^{\tau}\left(\left(\textrm{b}_{t+1}\right)_{\#}\eta\left(X_{t+1},A_{t+1}^{\pi}\right)\right)-\nabla_{\theta}L_{\theta}^{\tau}\left(\left(\textrm{b}_{t}\right)_{\#}\eta(X_{t},A_{t})\right)\right|\right]
(b)𝔼μ[1+t=0Hρt2]<,\displaystyle\leq_{(b)}\mathbb{E}_{\mu}\left[1+\sum_{t=0}^{H}\rho^{t}\cdot 2\right]<\infty,

where (a) follows from the application of triangle inequality; (b) follows from the fact that the QR loss gradient against a fixed distribution is bounded θLθτ(ν)[1,1],ν𝒫()\nabla_{\theta}L_{\theta}^{\tau}\left(\nu\right)\in[-1,1],\forall\nu\in\mathscr{P}_{\infty}(\mathbb{R}) [16].

With the application of DCT, we can exchange the gradient and expectation operator, which yields 𝔼μ[θL^θτ(π,μη(x,a))]=θ𝔼μ[L^θτ(π,μη(x,a))]=θLθτ(π,μη(x,a))\mathbb{E}_{\mu}\left[\nabla_{\theta}\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\nabla_{\theta}\mathbb{E}_{\mu}\left[\widehat{L}_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right)\right]=\nabla_{\theta}L_{\theta}^{\tau}\left(\mathcal{R}^{\pi,\mu}\eta(x,a)\right). ∎