This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variance Control for Distributional Reinforcement Learning

Qi Kuang    Zhoufan Zhu    Liwen Zhang    Fan Zhou
Abstract

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, very few studies investigate the validity of the obtained Q-function estimator in the distributional setting. To fully understand how the approximation errors of the Q-function affect the whole training process, we do some error analysis and theoretically show how to reduce both the bias and the variance of the error terms. With this new understanding, we construct a new estimator Quantiled Expansion Mean (QEM) and introduce a new DRL algorithm (QEMRL) from the statistical perspective. We extensively evaluate our QEMRL algorithm on a variety of Atari and Mujoco benchmark tasks and demonstrate that QEMRL achieves significant improvement over baseline algorithms in terms of sample efficiency and convergence performance.

Machine Learning, ICML

1 Introduction

Distributional Reinforcement Learning (DRL) algorithms have been shown to achieve state-of-art performance in RL benchmark tasks (Bellemare et al., 2017; Dabney et al., 2018b, a; Yang et al., 2019; Zhou et al., 2020, 2021). The core idea of DRL is to estimate the entire distribution of the future return instead of its expectation value, i.e. the Q-function, which captures the intrinsic uncertainty of the whole process in three folds: (i) the stochasticity of rewards, (ii) the indeterminacy of the policy, and (iii) the inherent randomness of transition dynamics. Existing DRL algorithms parameterize the return distribution in different ways, including categorical return atoms (Bellemare et al., 2017), expectiles (Rowland et al., 2019), particles (Nguyen-Tang et al., 2021), and quantiles (Dabney et al., 2018b, a). Among these works, the quantile-based algorithm is widely used due to its simplicity, efficiency of training, and flexibility in modeling the return distribution.

Although the existing quantile-based algorithms achieve remarkable empirical success, the approximated distribution still requires further understanding and investigation. One aspect is the crossing issue, namely, a violation of the monotonicity of the obtained quantile estimations. Zhou et al. (2020, 2021) solves this issue by enforcing the monotonicity of the estimated quantiles using some well-designed neural networks. However, these methods may suffer from some underestimation or overestimation issues. In other words, the estimated quantiles tend to be higher or lower than their true values. Considering this shortcoming, Luo et al. (2021) applies monotonic rational-quadratic splines to ensure monotonicity, but their algorithm is computationally expensive and hard to implement in large-scale tasks.

Another aspect is regard to the tail behavior of the return distribution. It is widely acknowledged that the precision of tail estimation highly depends on the frequency of tail observations (Koenker, 2005). Due to data sparsity, the quantile estimation is often unstable at the tails. To alleviate this instability, Kuznetsov et al. (2020) proposes to truncate the right tail of the approximated return distribution by discarding some topmost atoms. However, this approach lacks theoretical support and ignores the potentially useful information hidden in the tail.

The crossing issue and tail unrealization illustrate that there is a substantial gap between the quantile estimation and its true value. This finding reduces the reliability of the Q-function estimator obtained by quantile-based algorithms and inspires us to further minimize the difference between the estimated Q-function and its true value. In particular, the error associated with Q-function approximation can be decomposed into three parts:

Δ\displaystyle\Delta Qθπ(x,a)Qπ(x,a)=𝔼Zθπ(x,a)𝔼Zπ(x,a)\displaystyle\equiv Q^{\pi}_{\theta}(x,a)-Q^{\pi}(x,a)=\mathbb{E}Z^{\pi}_{\theta}(x,a)-\mathbb{E}Z^{\pi}(x,a)
=𝔼Zθπ(x,a)𝔼x𝒟[R+γZθπ(x,a)] Target Approximation Error1\displaystyle=\underbrace{\mathbb{E}Z^{\pi}_{\theta}(x,a)-\mathbb{E}_{x^{\prime}\sim\mathcal{D}}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]}_{\text{ Target Approximation Error}~{}~{}\mathcal{E}_{1}}
+𝔼x𝒟[R+γZθπ(x,a)]𝔼xP[R+γZθπ(x,a)] Bellman operator Approximation Error2\displaystyle\quad+\underbrace{\mathbb{E}_{x^{\prime}\sim\mathcal{D}}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]-\mathbb{E}_{x^{\prime}\sim P}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]}_{\text{ Bellman operator Approximation Error}~{}~{}\mathcal{E}_{2}}
+𝔼xP[R+γZθπ(x,a)]𝔼Zπ(x,a)Parametrization Induced Error3,\displaystyle\quad+\underbrace{\mathbb{E}_{x^{\prime}\sim P}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]-\mathbb{E}Z^{\pi}(x,a)}_{\text{Parametrization Induced Error}~{}~{}\mathcal{E}_{3}}, (1)

where Qπ()Q^{\pi}(\cdot) is the true Q-function, Qθπ()Q^{\pi}_{\theta}(\cdot) is the approximated Q-function, ZπZ^{\pi} is the random variable with the true return distribution, ZθπZ_{\theta}^{\pi} is the random variable with the approximated quantile function parameterized by a set of quantiles θ\theta, 𝒟\mathcal{D} is the replay buffer, and PP is the transition kernel. These errors can be attributed to different kinds of approximations in DRL (Rowland et al., 2018), including (i) parameterization and its associated projection operators, (ii) stochastic approximation of the Bellman operator, and (iii) gradient updates through quantile loss.

We elaborate on the properties of the three error terms in (1). 1\mathcal{E}_{1} is derived from the target approximation in quantile loss. 2\mathcal{E}_{2} is caused by the stochastic approximation of the Bellman operator. 3\mathcal{E}_{3} results from the parametrization of quantiles and the corresponding projection operator. Among the three, 3\mathcal{E}_{3} can be theoretically eliminated if the representation size is large enough, whereas 1+2\mathcal{E}_{1}+\mathcal{E}_{2} is inevitable in practice due to the batch-based optimization procedure. Therefore, controlling the variance Var(1+2)\mathrm{Var}(\mathcal{E}_{1}+\mathcal{E}_{2}) can significantly speed up the training convergence (see an illustrating example in Figure 1). Thus, one main target of this work is to reduce the two inevitable errors 1\mathcal{E}_{1} and 2\mathcal{E}_{2}, and subsequently improve the existing DRL algorithms.

Refer to caption
Figure 1: Error decay during training. (a) The parameterization-induced error 3\mathcal{E}_{3} (grey areas) remains constant over time with a fixed representation size. The approximation errors 1\mathcal{E}_{1} and 2\mathcal{E}_{2} (blue areas) decrease slowly with time steps. (b) Increase the size of the representation (i.e., the number of quantiles), 3\mathcal{E}_{3} can be theoretically eliminated. By applying the variance reduction technique QEM estimator, 1+2\mathcal{E}_{1}+\mathcal{E}_{2} can be quickly decreased, resulting in faster convergence of algorithms.

The contributions of this work are summarized as follows,

  • We offer a rigorous investigation on the three error terms 1\mathcal{E}_{1}, 2\mathcal{E}_{2}, and 3\mathcal{E}_{3} in DRL, and find that the approximation errors result from the heteroskedasticity of quantile estimates, especially tail estimates.

  • We borrow the idea from the Cornish-Fisher Expansion (Cornish & Fisher, 1938), and propose a statistically robust DRL algorithm, called QEMRL, to reduce the variance of the estimated Q-function.

  • We show that QEMRL achieves a higher stability and a faster convergence rate from both theoretical and empirical perspectives.

2 Background

2.1 Reinforcement Learning

Consider a finite Markov Decision Process (MDP) (𝒳,𝒜,P,γ,)(\mathcal{X},\mathcal{A},P,\gamma,\mathcal{R}), with a finite set of states 𝒳\mathcal{X}, a finite set of actions 𝒜\mathcal{A}, the transition kernel P:𝒳×𝒜𝒫(𝒳)P:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}(\mathcal{X}), the discounted factor γ[0,1)\gamma\in[0,1), and the bounded reward function :𝒳×𝒜𝒫([Rmax,Rmax])\mathcal{R}:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}([-R_{max},R_{max}]). At each timestep, an agent observes state Xt𝒳X_{t}\in\mathcal{X}, takes an action At𝒜A_{t}\in\mathcal{A}, transfers to the next state Xt+1P(Xt,At)X_{t+1}\sim P\left(\cdot\mid X_{t},A_{t}\right), and receives a reward Rt(Xt,At)R_{t}\sim\mathcal{R}\left(X_{t},A_{t}\right). The state-action value function Qπ:𝒳×𝒜Q^{\pi}:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R} of a policy π:𝒳𝒫(𝒜)\pi:\mathcal{X}\rightarrow\mathscr{P}(\mathcal{A}) is the expected discounted sum of rewards starting from xx, taking an action aa and following a policy π\pi. 𝒫(𝒳)\mathscr{P}(\mathcal{X}) denotes the set of probability distributions on a space 𝒳\mathcal{X}.

The classic Bellman equation (Bellman, 1966) relates expected return at each state-action pair (x,a)(x,a) to the expected returns at possible next states by:

Qπ(x,a)=𝔼π[R0+γQπ(X1,A1)X0=x,A0=a].\displaystyle Q^{\pi}(x,a)=\mathbb{E}_{\pi}\left[R_{0}+\gamma Q^{\pi}\left(X_{1},A_{1}\right)\mid X_{0}=x,A_{0}=a\right]. (2)

In the learning task, Q-Learning (Watkins, 1989) employs a common way to obtain π\pi^{*}, which is to find the unique fixed point Q=QπQ^{*}=Q^{\pi^{*}} of the Bellman optimality equation:

Q(x,a)=𝔼[R0+γmaxa𝒜Q(X1,a)X0=x,A0=a].\displaystyle Q^{*}(x,a)=\mathbb{E}\left[R_{0}+\gamma\underset{a^{\prime}\in\mathcal{A}}{\max}Q^{*}\left(X_{1},a^{\prime}\right)\mid X_{0}=x,A_{0}=a\right].

2.2 Distributional Reinforcement Learning

Instead of directly estimating the expectation Qπ(x,a)Q^{\pi}(x,a), DRL focuses on estimating the distribution of the sum of discounted rewards ηπ(x,a)=𝒟(t=0γtRtX0=x,A0=a)\eta_{\pi}(x,a)=\mathcal{D}(\sum_{t=0}^{\infty}\gamma^{t}R_{t}\mid X_{0}=x,A_{0}=a) to sufficiently capture the intrinsic randomness, where 𝒟\mathcal{D} extract the probability distribution of a random variable. In analogy with Equation 2, ηπ\eta_{\pi} satisfies the distributional Bellman equation (Bellemare et al., 2017) as follows,

ηπ(x,a)=\displaystyle\eta_{\pi}(x,a)= (𝒯πηπ)(x,a)\displaystyle\left(\mathcal{T}^{\pi}\eta_{\pi}\right){(x,a)}
=\displaystyle= 𝔼π[(fγ,r)#ηπ(X1,A1)X0=x,A0=a]\displaystyle\mathbb{E}_{\pi}\left[(f_{\gamma,r})_{\#}\eta_{\pi}(X_{1},A_{1})\mid X_{0}=x,A_{0}=a\right]

where fγ,r:f_{\gamma,r}:\mathbb{R}\to\mathbb{R} is defined by fγ,r(x)=r+γx,f_{\gamma,r}(x)=r+\gamma x, and (fγ,r)#η\left(f_{\gamma,r}\right)_{\#}\eta is the pushforward measure of η\eta by fγ,rf_{\gamma,r}. Note that ηπ\eta_{\pi} is the fixed point of distributional Bellman operator 𝒯π:𝒫()𝒳×𝒜𝒫()𝒳×𝒜\mathcal{T}^{\pi}:\mathscr{P}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}\to\mathscr{P}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}, i.e., 𝒯πηπ=ηπ\mathcal{T}^{\pi}\eta_{\pi}=\eta_{\pi}.

In general, the return distribution supports a wide range of possible returns and its shape can be quite complex. Moreover, the transition dynamics are usually unknown in practice, and thus the full computation of the distributional Bellman operator is usually either impossible or computationally infeasible. In the following subsections, we review two main categories of DRL algorithms relying on parametric approximations and projection operators.

2.2.1 Categorical distributional RL

Categorical distributional RL (CDRL, Bellemare et al., 2017) represents the return distribution η\eta with a categorical form η(x,a)=i=1Npi(x,a)δzi\eta(x,a)=\sum_{i=1}^{N}p_{i}(x,a)\delta_{z_{i}}, where δz\delta_{z} denotes the Dirac distribution at zz. z1z2zNz_{1}\leq z_{2}\leq\ldots\leq z_{N} are evenly spaced locations, and {pi}i=1N\left\{p_{i}\right\}_{i=1}^{N} are the corresponding probabilities learned using the Bellman update,

η(x,a)(Π𝒞𝒯πη)(x,a),\eta(x,a)\leftarrow\left(\Pi_{\mathcal{C}}\mathcal{T}^{\pi}\eta\right)(x,a),

where Π𝒞:𝒫()𝒫({z1,z2zN})\Pi_{\mathcal{C}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\{z_{1},z_{2}\ldots z_{N}\}) is a categorical projection operator which ensures the return distribution supported only on {z1,,zN}\left\{z_{1},\ldots,z_{N}\right\}. In practice, CDRL with N=51N=51 has been shown to achieve significant improvement in Atari games.

2.2.2 Quantiled distributional RL

Quantiled distributional RL (QDRL, Dabney et al., 2018b) represents the return distribution with a mixture of Diracs η(x,a)=1Ni=1Nδθi(x,a)\eta(x,a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\theta_{i}(x,a)}, where {θi(x,a)}i=1N\left\{\theta_{i}(x,a)\right\}_{i=1}^{N} are learnable parameters. The Bellman operator moves each atom location θi\theta_{i} towards τi\tau_{i}-th quantile of the target distribution η(x,a):=𝒯πη(x,a)\eta^{\prime}(x,a):=\mathcal{T}^{\pi}\eta(x,a), where τi=2i12N\tau_{i}=\frac{2i-1}{2N}. The corresponding Bellman update form is:

η(x,a)(Π𝒲1𝒯πη)(x,a),\eta(x,a)\leftarrow\left(\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}\eta\right)(x,a),

where Π𝒲1:𝒫()𝒫()\Pi_{\mathcal{W}_{1}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\mathbb{R}) is a quantile projection operator defined by Π𝒲1μ=1Ni=1NδFμ1(τi)\Pi_{\mathcal{W}_{1}}\mu=\frac{1}{N}\sum_{i=1}^{N}\delta_{F^{-1}_{\mu}(\tau_{i})}, and FμF_{\mu} is the cumulative distribution function (CDF) of μ\mu. Fη1(τ)F_{\eta^{\prime}}^{-1}(\tau) can be characterized as the minimizer of the quantile regression loss, while the atom locations θ\theta can be updated by minimizing the following loss function

QR(θ;η,τ)=𝔼Zη([τ1Z>θ+(1τ)1Zθ]|Zθ|).\displaystyle\mathcal{L}_{QR}(\theta;\eta^{\prime},\tau)=\mathbb{E}_{Z\sim\eta^{\prime}}\left(\left[\tau\textbf{1}_{Z>\theta}+\left(1-\tau\right)\textbf{1}_{Z\leq\theta}\right]|Z-\theta|\right). (3)

3 Error Analysis of Distributional RL

As mentioned in Section 1, the parametrization induced error 3\mathcal{E}_{3} in Section 1 comes from quantile representation and its projection operator, which can be eliminated as NN\rightarrow\infty. However, as illustrated in Figure 1, the approximation errors 1\mathcal{E}_{1} and 2\mathcal{E}_{2} are unavoidable in practice and a high variance Var(1+2)\mathrm{Var}(\mathcal{E}_{1}+\mathcal{E}_{2}) may lead to unstable performance of DRL algorithms. Thus, in this section, we further study the three error terms 1\mathcal{E}_{1}, 2\mathcal{E}_{2} and 3\mathcal{E}_{3}, and show why it is important to control them in practice.

3.1 Parametrization Induced Error

We first show the convergence of both the expectation and the variance of the distributional Bellman operator 𝒯π\mathcal{T}^{\pi}. Then, we take parametric representation and projection operator into consideration.

Proposition 3.1 (Sobel, 1982; Bellemare et al., 2017).

Suppose there are two value distributions ν1,ν2𝒫()\nu_{1},\nu_{2}\in\mathscr{P}(\mathbb{R}), and random variables Zik+1𝒯πνi,ZikνiZ_{i}^{k+1}\sim\mathcal{T}^{\pi}\nu_{i},Z_{i}^{k}\sim\nu_{i}. Then, we have

𝔼Z1k+1𝔼Z2k+1γ𝔼Z1k𝔼Z2k, and\displaystyle\left\|\mathbb{E}Z_{1}^{k+1}-\mathbb{E}Z_{2}^{k+1}\right\|_{\infty}\leq\gamma\left\|\mathbb{E}Z_{1}^{k}-\mathbb{E}Z_{2}^{k}\right\|_{\infty},\text{ and }
VarZ1k+1VarZ2k+1γ2VarZ1kVarZ2k.\displaystyle\left\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\|_{\infty}\leq\gamma^{2}\left\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\|_{\infty}.

Based on the fact that 𝒯π\mathcal{T}^{\pi} is a γ\gamma-contraction in d¯p\bar{d}_{p} metric (Bellemare et al., 2017), where d¯p\bar{d}_{p} is the maximal form of the Wasserstein metric, Proposition 3.1 implies that 𝒯π\mathcal{T}^{\pi} is a contraction for both the expectation and the variance. The two converge exponentially to their true values by iteratively applying the distributional Bellman operator.

However, in practice, employing parametric representation for the return distribution leaves a theory-practice gap, which makes neither the expectation nor the variance converge to the true values. To better understand the bias in the Q-function approximation caused by the parametric representation, we introduce the concept of mean-preserving 111This property has been thoroughly discussed in previous work. Based on Section 5.11 of Bellemare et al. (2023), we conclude this definition. to describe the relationship between the expectations of the original distribution and the projected distribution:

Definition 3.2 (Mean-preserving).

Let Π:𝒫()\Pi_{\mathscr{F}}:\mathscr{P}(\mathbb{R})\to\mathscr{F} be a projection operator that maps the space of probability distributions to the desired representation. Suppose there is a representation 𝒫()\mathscr{F}\in\mathscr{P}(\mathbb{R}) and its associated projection operator Π\Pi_{\mathscr{F}} are mean-preserving if for any distribution ν\nu\in\mathscr{F}, the expectation of Πν\Pi_{\mathscr{F}}\nu is the same as that of ν\nu.

For CDRL, a discussion of the mean-preserving property is given by Lyle et al. (2019) and Rowland et al. (2019). It can be shown that for any ν𝒞\nu\in\mathscr{F}_{\mathcal{C}}, where 𝒞\mathscr{F}_{\mathcal{C}} is a NN-categorical representation, the projection Π𝒞\Pi_{\mathcal{C}} preserves the distribution’s expectation when its support is contained in the interval [z1,zN][z_{1},z_{N}]. However, these practitioners usually employ a wide predefined interval for return which makes the projection operator typically overestimate the variance.

For QDRL, Π𝒲1\Pi_{\mathcal{W}_{1}} is not mean-preserving (Bellemare et al., 2023). Given any distribution ν𝒲1\nu\in\mathscr{F}_{\mathcal{W}_{1}}, where 𝒲1\mathscr{F}_{\mathcal{W}_{1}} is a NN-quantile representation, there is no unique NN-quantile distribution Π𝒲1ν\Pi_{\mathcal{W}_{1}}\nu in most cases, as the projection operator Π𝒲1\Pi_{\mathcal{W}_{1}} is not a non-expansion in 1-Wasserstein distance (See Appendix B for details). This means that the expectation, variance, and higher-order moments are not preserved. To make this concrete, a simple MDP example is used to illustrate the bias in the learned quantile estimates.

In Figure 2 (a), rewards R1R_{1} and R2R_{2} are randomly sampled from Unif(0,1)\mathrm{Unif}(0,1) and Unif(1/N,1+1/N)\mathrm{Unif}(1/N,1+1/N) at states x1x_{1} and x2x_{2} respectively, and no rewards are received at x0x_{0}. Clearly, the true return distribution at state x0x_{0} is the mixture γ2(R1+R2)\frac{\gamma}{2}(R_{1}+R_{2}), hence the 12N\frac{1}{2N}-th quantile is γN\frac{\gamma}{N}. When using the QDRL algorithm with NN quantile estimates, the approximated return distribution η^(x1,a)=1Ni=1Nδ2i12N\hat{\eta}(x_{1},a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\frac{2i-1}{2N}} and η^(x2,a)=1Ni=1Nδ2i+12N\hat{\eta}(x_{2},a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\frac{2i+1}{2N}}. In this case, the 12N\frac{1}{2N}-th quantile of the approximated return distribution at state x0x_{0} is 3γ2N\frac{3\gamma}{2N}, whereas the true value is γN\frac{\gamma}{N}. Moreover, for each i=1,,Ni=1,\ldots,N, the 2i12N\frac{2i-1}{2N}-th quantile estimate at state x0x_{0} is not equal to the true value.

Refer to caption

Figure 2: (a) Example MDP, with a single action, equal transition probability, an initial state x0x_{0}, and two terminal states x1,x2x_{1},x_{2} where rewards are drawn from uniform. (b) 5-state MDP, with two actions at initial state x0x_{0}, deterministic transition, and stochastic rewards are exponential at terminal states x3,x4x_{3},x_{4}. (c) We show the true return distributions η(x0,a1)\eta(x_{0},a_{1}) and η(x0,a2)\eta(x_{0},a_{2}), and the expected returns estimated by QDRL and QEMRL.

These biased quantile estimates illustrated in Figure 2 (a) are caused by the use of quantile representation and its projection operator Π𝒲1\Pi_{\mathcal{W}_{1}}. This undesirable property in turn affects the QDRL update, as the combined operator Π𝒲1𝒯π\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi} is in general not a non-expansion in d¯p\bar{d}_{p}, for p[1,)p\in[1,\infty) (Dabney et al., 2018b), which means that the learned quantile estimates may not converge to the true quantiles of the return distribution 222 A recent study (Rowland et al., 2023a) proves that QDRL update may have multiple fixed points, indicating quantiles may not converge to the truth. Despite this, Proposition 2 (Dabney et al., 2018b) concludes that the projected Bellman operator Π𝒲1𝒯π\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi} remains a contraction in d¯\bar{d}_{\infty}. This implies that quantile convergence is guaranteed for all p[1,]p\in[1,\infty].. The projection operator Π𝒲1\Pi_{\mathcal{W}_{1}} is not mean-preserving which inevitably leads to bias in the expectation of return distribution when iteratively applying the projected Bellman operator Π𝒲1𝒯π\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi} during the training process, resulting in a deviation between the estimate and the true value of the Q-function in the end. We now derive an upper bound to quantify this deviation, i.e. 3\mathcal{E}_{3}.

Theorem 3.3 (Parameterization induced error bound).

Let Π𝒲1\Pi_{\mathcal{W}_{1}} be a projection operator onto evenly spaced quantiles τi\tau_{i}’s where each τi=2i12N\tau_{i}=\frac{2i-1}{2N} for i=1,,Ni=1,\ldots,N, and ηk𝒫()\eta_{k}\in\mathscr{P}(\mathbb{R}) be the return distribution of kk-th iteration. Let random variables ZθkΠ𝒲1𝒯πηkZ_{\theta}^{k}\sim\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}\eta_{k} and Zk𝒯πηkZ^{k}\sim\mathcal{T}^{\pi}\eta_{k}. Assume that the distribution of the immediate reward is supported on [Rmax,Rmax][-R_{max},R_{max}], then we have

limk3k=limk𝔼Zθk𝔼Zk2RmaxN(1γ),\displaystyle\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)},

where 3k\mathcal{E}^{k}_{3} is parametrization induced error at kk-th iteration.

Theorem 3.3 implies that the convergence of expectation with projected Bellman operator Π𝒲1𝒯π\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi} cannot be guaranteed after quantile representation and its projection operator are applied 333Note that this bound has a limitation, which only considers the one-step effect of applying the projection operator ΠW1\Pi_{W_{1}}. Therefore, it becomes irrelevant with the iteration number kk. However, Proposition 4.1 of Rowland et al. (2023b) provides a more compelling bound considering the cumulative effect of iteratively applying ΠW1\Pi_{W_{1}}.. Note that the bound will tend to zero with NN\rightarrow\infty, thus it is reasonable to use a relatively large representation size NN to reduct 3\mathcal{E}_{3} in practice.

3.2 Approximation Error

The other two types of errors 1\mathcal{E}_{1} and 2\mathcal{E}_{2}, which determine the variance of the Q-function estimate, are accumulated during the training process by keeping encountering unseen state-action pairs. The target approximation error 1\mathcal{E}_{1} affects action selections, while the Bellman operator approximation error 2\mathcal{E}_{2} leads to the accumulated error of the Q-function estimate, which can be amplified by using the temporal difference updates (Sutton, 1988). The accumulated errors of the Q-function estimate with high uncertainty can make some certain states to be incorrectly estimated, leading to suboptimal policies and potentially divergent behaviors.

As depicted in Figure 2 (b), we utilize this toy example to illustrate how QDRL fails to learn an optimal policy due to a high variance of the approximation error. This 5-state MDP example is originally introduced in Figure 7 of Rowland et al. (2019). In this case, η(x0,a1)\eta(x_{0},a_{1}) and η(x0,a2)\eta(x_{0},a_{2}) follow exponential distributions, and the expectations of them are 1.2 and 1, respectively. We consider a tabular setting, which uniquely represents the approximated return distribution at each state-action pair. Figure 2 (c) demonstrates that in policy evaluation, QDRL inaccurately approximates the Q-function, as it underestimates the expectation of η(x0,a1)\eta(x_{0},a_{1}) and overestimates the other. This is caused by the poor capture of tail events, which results in high uncertainty in the Q-function estimate. Due to the high variance, QDRL fails to learn the optimal policy and chooses a non-optimal action a2a_{2} at the initial state x0x_{0}. On the contrary, our proposed algorithm, QEMRL, employs a statistically robust estimator of the Q-function to reduce its variance, relieves the underestimation and overestimation issues, and ultimately allows for more efficient policy learning.

Different from previous QDRL studies that focus on exploiting the distribution information to further improve the model performance, this work highlights the importance of controlling the variance of the approximation error to obtain a more accurate estimate of the Q-function. More discussion about this is given in the following section.

4 Quantiled Expansion Mean

This section introduces a novel variance reduction technique to estimate the Q-function. In traditional statistics, estimators with lower variance are considered to be more efficient. In RL, variance reduction is also an effective technique for achieving fast convergence in both policy-based and value-based RL algorithms, especially for large-scale tasks (Greensmith et al., 2004; Anschel et al., 2017). Motivated by these findings, we introduce QEM as an estimator that is more robust and has a lower variance than that of QDRL under the heteroskedasticity assumption. Furthermore, we demonstrate the potential benefits of QEM for the distribution approximation in DRL.

4.1 Heteroskedasticity of quantiles

In the context of quantile-based DRL, Q-function is the integral of the quantiles. To approximate this, QDRL employs a simple empirical mean (EM) estimator 1NΣiq^(τi)\frac{1}{N}\Sigma_{i}\hat{q}(\tau_{i}), and it is natural to assume that the estimated quantile satisfies

q^(τ)=q(τ)+ε(τ),\displaystyle\hat{q}(\tau)=q(\tau)+\varepsilon(\tau), (4)

where ε(τ)\varepsilon(\tau) is a zero-mean error. In this case, considering the crossing issue and the biased tail estimates, we assume that the variance of ε(τ)\varepsilon(\tau) is non-constant and depends on τ\tau, which is usually called heteroskedasticity in statistics.

For a direct understanding, we conduct a simple simulation using a Chain MDP to illustrate how QDRL can fail to fit the quantile function. As shown in Figure 3(b), QDRL fits well in the peak area but struggles at the bottom and the tail. Moreover, the non-monotonicity of the quantile estimates in the poorly fitted areas is more severe than the others. As the deviations of the quantile estimates from the truths is significantly larger in the low probability region and the tail, we can make the heteroskedasticity assumption in this case. This phenomenon can be explained since samples near the bottom and the tail are less likely to be drawn. In real-world situations, multimodal distributions are commonly encountered and the heteroskedasticity problem may result in imprecise distribution approximations and consequently poor Q-function approximations. In the next part, we will discuss how to enhance the stability of the Q-function estimate.

4.2 Cornish-Fisher Expansion

It is well-known that quantile can be expressed by the Cornish-Fisher Expansion (CFE, Cornish & Fisher, 1938):

q(τ)=μ+σxτ,\displaystyle q(\tau)=\mu+\sigma x^{\prime}_{\tau}, (5)
xτ=zτ+(zτ21)s6+(zτ33zτ)k24+,\displaystyle x^{\prime}_{\tau}=z_{\tau}+(z^{2}_{\tau}-1)\frac{s}{6}+(z^{3}_{\tau}-3z_{\tau})\frac{k}{24}+\cdots,

where zτz_{\tau} is the τ\tau-th quantile of the standard normal distribution, μ\mu is the mean, σ\sigma is the standard deviation, ss and kk are the skewness and kurtosis of the interested distribution, and the remaining terms in the ellipsis are higher-order moments (See Appendix C for more details). The CFE theoretically determines the distribution with known moments and is widely used in financial studies. Recently, Zhang & Zhu (2023) employ CFE to estimate higher-order moments of financial time series data, which are not directly observable. Our method utilizes a truncated version of CFE framework and employs a linear regression model to construct efficient estimators for distribution moments based on known quantiles. Consequently, we apply this approach within the context of quantile-based DRL.

To be more specific, we plug in the estimate q^(τ)\hat{q}({\tau}) of the the τ\tau-th quantile to Equation 5 and expand it by the first order:

q^(τ)=\displaystyle\hat{q}(\tau)= m1+ω1(τ)+ε(τ),\displaystyle m_{1}+\omega_{1}(\tau)+\varepsilon(\tau), (6)

where m1m_{1} is the mean (say, 11-th moment) of the return distribution, i.e., the Q-function, and ω1(τ)\omega_{1}(\tau) is the remaining term associated with the higher-order (>1>1-th) moments. If ω1(τ)\omega_{1}(\tau) is negligible, m1m_{1} can be estimated by averaging the NN quantile estimates in QDRL.

When the estimated quantile is expanded to the second order, we particularly have the following representation:

q^(τ)=\displaystyle\hat{q}(\tau)= m1+zτm2+m2ω2(τ)+ε(τ),\displaystyle m_{1}+z_{\tau}\sqrt{m_{2}}+\sqrt{m_{2}}\omega_{2}(\tau)+\varepsilon(\tau), (7)

where ω2(τ)\omega_{2}(\tau) is the remaining term associated with the higher-order (>2>2-th) moments. Assume that ω2(τ)\omega_{2}(\tau) is negligible, we can derive a regression model by plugging in the NN quantile estimates, such that

(q^(τ1)q^(τ2)q^(τN))=(1zτ11zτ21zτN)(m1m2)+(ε(τ1)ε(τ2)ε(τN)).\displaystyle\left(\begin{array}[]{c}\hat{q}(\tau_{1})\\ \hat{q}(\tau_{2})\\ \vdots\\ \hat{q}(\tau_{N})\end{array}\right)=\left(\begin{array}[]{cccc}1&z_{\tau_{1}}\\ 1&z_{\tau_{2}}\\ \vdots&\vdots\\ 1&z_{\tau_{N}}\end{array}\right)\left(\begin{array}[]{c}m_{1}\\ \sqrt{m_{2}}\end{array}\right)+\left(\begin{array}[]{c}\varepsilon(\tau_{1})\\ \varepsilon(\tau_{2})\\ \vdots\\ \varepsilon(\tau_{N})\end{array}\right). (22)

The higher-order expansions can be conducted in the same manner. Note that the remaining term is omitted for constructing a regression model, and a more in-depth analysis of the remaining term is available in Section C.2.

For notation simplicity, we rewrite (22) in a matrix form,

𝑸^=𝐗2𝑴2+,\displaystyle\boldsymbol{\hat{Q}}=\mathbf{X}_{2}\boldsymbol{M}_{2}+\mathcal{E}, (23)

where 𝑸^N\boldsymbol{\hat{Q}}\in\mathbb{R}^{N} is the vector of estimated quantiles, 𝐗2N×2\mathbf{X}_{2}\in\mathbb{R}^{N\times 2} and 𝑴22\boldsymbol{M}_{2}\in\mathbb{R}^{2} are the design matrix and the moments respectively, and \mathcal{E} is the vector of error terms.

Refer to caption

Figure 3: (a) Chain MDP, with six states, one action, γ=0.99\gamma=0.99 and gaussian mixture reward distribution at terminal state x5x_{5}. (b) True quantile function (top) and QDRL quantile function at state x0x_{0} after 10K steps iterate. Scatter diagram (bottom) of approximated quantile from training process.

For this bivariate regression model (23), the traditional ordinary least squares method (OLS) can be used to estimate 𝑴2=(m1,m2)\boldsymbol{M}_{2}=(m_{1},\sqrt{m_{2}})^{\prime} when the variances of the errors are invariant across different quantile locations, also known as the homoscedasticity assumption. The estimator m^1\hat{m}_{1} is denoted as Quantiled Expansion Mean (QEM) in this work. However, since the homoscedasticity assumption required by OLS is always violated in real cases, we may consider using the weighted ordinary least squares method (WLS) instead. Under the normality assumption, the following results tell that the WLS estimator m^1\hat{m}_{1} has a lower variance than the direct empirical mean.

Lemma 4.1.

Consider the linear regression model 𝐐^=𝐗2𝐌2+\boldsymbol{\hat{Q}}=\mathbf{X}_{2}\boldsymbol{M}_{2}+\mathcal{E}, \mathcal{E} is distributed on 𝒩(𝟎,σ2V)\mathcal{N}(\mathbf{0},\sigma^{2}V), where V=diag(v1,v2,,vN),vi1,i=1,,NV=diag(v_{1},v_{2},\cdots,v_{N}),v_{i}\geq 1,i=1,\cdots,N, and we set noise variance σ2=1\sigma^{2}=1 without loss of generality. The WLS estimator is

𝑴^2=(𝐗2V1𝐗2)1𝐗2V1𝑸^,\displaystyle\widehat{\boldsymbol{M}}_{2}=(\mathbf{X}^{\top}_{2}V^{-1}\mathbf{X}_{2})^{-1}\mathbf{X}^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}}, (24)

and the QEM estimator m^1\hat{m}_{1} is the first component of 𝐌^2\widehat{\boldsymbol{M}}_{2}.

Remark: Note that it is impossible to determine the weight matrix VV for each state-action pair in practice. Hence, we focus on capturing the relatively high variance in the tail, specifically in the range of τ(0,0.1][0.9,1)\tau\in(0,0.1]\cup[0.9,1). To achieve this, we use a constant viv_{i}, which is set to a value greater than 1 in the tail and equal to 1 in the rest. viv_{i} is treated as a hyperparameter to be tuned in practice (See Appendix E).

With Lemma 4.1, the reduction of variance can be guaranteed by the following Proposition 4.2. Throughout the training process, heteroskedasticity is inevitable, and thus the QEM estimator always exhibits a lower variance than the standard EM estimator m^1=1Ni=1Nq^(τi)\hat{m}^{*}_{1}=\frac{1}{N}\sum_{i=1}^{N}\hat{q}(\tau_{i}).

Proposition 4.2.

Suppose the noise εi\varepsilon_{i} independently follows 𝒩(0,vi)\mathcal{N}(0,v_{i}) where vi1v_{i}\geq 1 for i=1,,Ni=1,\cdots,N, then,

(i) In the homoskedastic case where vi=1v_{i}=1 for i=1,Ni=1,\dots N, the empirical mean estimator m^1\hat{m}_{1}^{*} has a lower variance, Var(m^1)<Var(m^1)\mathrm{Var}(\hat{m}_{1}^{*})<\mathrm{Var}(\hat{m}_{1}) ;

(ii) In the heteroskedastic case where viv_{i}’s are not eaqul, the QEM estimator m^1\hat{m}_{1} achieves a lower variance, i.e. Var(m^1)<Var(m^1)\mathrm{Var}(\hat{m}_{1})<\mathrm{Var}(\hat{m}_{1}^{*}), if and only if v¯211/((iviivizτi2)(ivizτi)21)>0\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0, where v¯=1Nivi\bar{v}=\frac{1}{N}\sum_{i}v_{i}. This inequality holds when zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}}, which can be guaranteed in QDRL.

We also try to explore the potential benefits of the variance reduction technique QEM in improving the approximation accuracy. The Q-function estimate with higher variance can lead to noisy policy gradients in policy-based algorithms (Fujimoto et al., 2018) and prevent selection optimal actions in value-based algorithms (Anschel et al., 2017). These issues can slow down the learning process and negatively impact the algorithm performance. By the following theorem, we are able to show that QEM can reduce the variance and thus improve the approximation performance.

Theorem 4.3.

Consider the policy π^\hat{\pi} that is learned policy, and denote the optimal policy to be πopt\pi_{opt}, α=maxxDTV(π^(x)πopt(x))\alpha=\max_{x^{\prime}}D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})\|\pi_{opt}(\cdot\mid x^{\prime})\right), and n(x,a)=|𝒟|n(x,a)=|\mathcal{D}|. For all δ\delta\in\mathbb{R}, with probability at least 1δ1-\delta, for any η(x,a)𝒫()\eta(x,a)\in\mathscr{P}(\mathbb{R}), and all (x,a)𝒟(x,a)\in\mathcal{D},

F𝒯^π^η(x,a)F𝒯πoptη(x,a)2α+1+4|𝒳|n(x,a)log4|𝒳||𝒜|δ.\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.

Theorem 4.3 indicates that a lower concentration bound can be obtained with a smaller α\alpha value. The decrease in α\alpha can be attributed to the benefits of QEM. Specifically, QEM helps to decrease the perturbations on the Q-function and reduce the variance of the policy gradients, which allows for faster convergence of the policy training and a more accurate distribution approximation. To conclude, QEM relieves the error accumulation within the Q-function update, improves the estimation accuracy, reduces the risk of underestimation and overestimation, and thus ultimately enhances the stability of the whole training process.

Algorithm 1 QEMRL update algorithm
1:  Require: Quantile estimates q^i(x,a)\hat{q}_{i}(x,a) for each (x,a)(x,a)
2:  Collect sample (x,a,r,x)(x,a,r,x^{\prime})
3:   # Compute distributional Bellman target
4:  Compute Q(x,a)Q(x^{\prime},a) using Equation 24
5:  if policy evaluation then
6:     aπ(|x)a^{*}\sim\pi(\cdot|x^{\prime})
7:  else if Q-Learning then
8:     aargmaxaQ(x,a)a^{*}\leftarrow\arg\max_{a}Q(x^{\prime},a)
9:  end if
10:  Scale samples q^i(x,a)r+γq^i(x,a)\hat{q}^{*}_{i}(x^{\prime},a^{*})\leftarrow r+\gamma\hat{q}_{i}(x^{\prime},a^{*}), i\forall i.
11:   # Compute quantile loss
12:  Update estimated quantiles q^i(x,a)\hat{q}_{i}(x,a) by computing the gradients for each i=1,,Ni=1,\ldots,N, q^i(x,a)i=1NQR(q^i(x,a);1Nj=1Nδq^j(x,a),τi).\nabla_{\hat{q}_{i}(x,a)}\sum_{i=1}^{N}\mathcal{L}_{QR}(\hat{q}_{i}(x,a);\frac{1}{N}\sum_{j=1}^{N}\delta_{\hat{q}^{*}_{j}(x^{\prime},a^{*})},\tau_{i}).

5 Experimental Results

In this section, we do some empirical studies to demonstrate the advantage of our QEMRL method. First, a simple tabular experiment is conducted to validate some of the theoretical results presented in Sections 3 and 4. Then we apply the proposed QEMRL update strategy in Algorithm 1 to both the DQN-style and SAC-style DRL algorithms, which are evaluated on the Atari and MuJoCo environments. The detailed architectures of these methods and the hyperparameter selections can be found in Appendix D, and the additional experimental results are included in Appendix E.

In this work, we implement QEM using a 44-th order expansion that includes mean, variance, skewness, and kurtosis in this work. The effects of a higher-order expansion on model estimation are discussed in Section C.1. Intuitively, including more terms in the expansion improves the estimation accuracy of quantiles, but the overfitting risk and the computational cost are also increased. Hence, there is a trade-off between explainability and learning efficiency. We evaluate different expansion orders using the R2R^{2} statistic, which measures the goodness of model fitting. The simulation results (Figure 9) show that a 44-th order expansion seems to be the optimal choice while a higher-order (>4>4-th) expansion does not show a significant increase in R2R^{2}.

5.1 A Tabular Example

FrozenLake (Brockman et al., 2016) is a classic benchmark problem for Q-learning control with high stochasticity and sparse rewards, in which an agent controls the movement of a character in an n×nn\times n grid world. As shown in Figure 4 with a FrozenLake-4×44\times 4 task, ”S” is the starting point, ”H” is the hole that terminates the game, ”G” is the goal state with a reward of 1. All the blue grids stand for the frozen surface where the agent can slide to adjacent grids based on some underlying unknown probabilities when taking a certain movement direction. The reward received by the agent is always zero unless the goal state is reached.

We first approximate the return distribution under the optimal policy π\pi^{*}, which can be realized using the value iteration approach. To be specific, we start from the ”S” state and perform 1K Monte-Carlo (MC) rollouts. An empirical distribution can be obtained by summarizing all these recording trajectories. With the approximation of the distribution, we can draw a curve of quantile estimates shown in Figure 5. Both QEMRL and QDRL were run for 150K training steps and the ϵ\epsilon-greedy exploration strategy is applied in the first 1K steps. For both methods, we set the total number of quantiles to be N=128N=128.

Refer to caption
Figure 4: (a) The optimal direction of movement at each grid. (b) Quantile estimates by MC, QDRL, and QEMRL at the start state. (c) Approximation errors of Q-function estimate and distribution approximation error of QEMRL and QDRL (results are averaged over 10 random seeds).

Although both QEMRL and QDRL can eventually find the optimal movement at the start state, their approximations of the return distribution are quite different. Figure 4 (b) visualizes the approximation errors of the Q-function and the distribution for QEMRL and QDRL with respect to the number of training steps. The Q-function estimates of QEMRL converge correctly in average, whereas the estimates of QDRL do not converge exactly to the truth. A similar pattern can also be found when it comes to the distribution approximation error. Besides, the reduction of variance by using QEM can be verified by the fact that the curves of QEMRL are more stable and decline faster. In Figure 4 (c), we show that the distribution at the start state estimated by QEMRL is eventually closer to the ground truth.

5.2 Evaluation on MuJoCo and Atari 2600

We do some experiments using the MuJoCo benchmark to further verify the analysis results in Section 4. Our implementation is based on the Distributional Soft Actor-Critic (DSAC, Ma et al., 2020) algorithm, which is a distributional version of SAC. Figure 5 demonstrate that both DSAC and QEM-DSAC significantly outperform the baseline SAC. Among the two, QEM-DSAC performs better than DSAC and the learning curves are more stable, which demonstrates that QEM-DSAC can achieve a higher sample efficiency.

Refer to caption
Figure 5: Learning curves of SAC, DSAC, and QEM-DSAC across six MuJoCo games. Each curve is averaged over 5 random seeds and shaded by their confidence intervals.

We also do some comparison between QEM and the baseline method QR-DQN on the Atari 2600 platform. Figure 8 plots the final results of these two algorithms in six Atari games. At the early training stage, QEM-DQN exhibits significant gain in sampling efficiency, resulting in faster convergence and better performance.

Extension to IQN. Some great efforts have been made by the community of DRL to more precisely parameterize the entire distribution with a limited number of quantile locations. One notable example is the introduction of Implicit Quantile Networks (IQN, Dabney et al., 2018a), which tries to recover the continuous map of the entire quantile curve by sampling a different set of quantile values from a uniform distribution Unif(0,1)\mathrm{Unif}(0,1) each time.

Our method can also be applied to IQN as it uses the EM approach to estimate the Q-function. It is noted that the design matrix 𝐗\mathbf{X} must be updated after re-sampling all the quantile fractions at each training step. Moreover, one important sufficient condition zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}} which ensures the reduction of variance does not hold in the IQN case as τ\tau’s are sampled from a uniform distribution. However, according to the simulation results in LABEL:table4, the variance reduction still remains valid in practice. In this case, all the baseline methods are modified to the IQN version. As Figure 6 and Figure 7 demonstrate, QEM can achieve some performance gain in most scenarios and the convergence speeds can be slightly increased.

Refer to caption
Figure 6: Learning curves of SAC, DSAC (IQN), and QEM-DSAC (IQN) across six MuJoCo games. Each curve is averaged over 5 random seeds and shaded by their confidence intervals.
Refer to caption
Figure 7: Learning curves of IQN and IQEM-DQN across six Atari games. Each curve is averaged over 3 random seeds and shaded by their confidence intervals.
Refer to caption
Figure 8: Learning curves (top and middle) of QR-DQN and QEM-DQN across six Atari games. Learning curves (bottom) of QR-DQN and QEM-DQN with exploration across three games.

5.3 Exploration

Since QEM also provides an estimate of the variance, we may consider using it to develop an efficient exploration strategy. In some recent study studies, to more sufficiently utilize the distribution information, Mavrin et al. (2019) proposes a novel exploration strategy, Decaying Left Truncated Variance (DLTV) by using the left truncated variance of the estimated distribution as a bonus term to encourage exploration in unknown states. The optimal action aa^{*} at state xx is selected according to a=argmaxa(Q(x,a)+ctσ+2)a^{*}=\arg\max_{a^{\prime}}\left(Q\left(x,a^{\prime}\right)+c_{t}\sqrt{\sigma^{2}_{+}}\right), where ctc_{t} is a decay factor to suppress the intrinsic uncertainty, and σ+2\sigma^{2}_{+} denotes the estimation of variance. Although DLTV is effective, the validity of the computed truncation lacks a theoretical guarantee. In this work, we follow the idea of DLTV and examine the model performance by using either the variance estimate obtained by QEM or the original DLTV estimation in some hard-explored games. As Figure 8 shows, by using QEM, the exploration efficiency is significantly improved compared to QR-DQN+DLTV since QEM enhances the accuracy of the quantile estimates and thus the accuracy of the distribution variance.

6 Conclusion and Discussion

In this work, we systematically study the three error terms associated with the Q-function estimate and propose a novel DRL algorithm QEMRL, which can be applied to any quantile-based DRL algorithm regardless of whether the quantile locations are fixed or not. We found that a more robust estimate of the Q-function can improve the distribution approximation and speed up the algorithm convergence. We can also utilize the more precise estimate of the distribution variance to optimize the existing exploration strategy.

Finally, there are some open questions we would like to have further discussions here.

Improving the estimation of weight matrix VV. The challenge of estimating the weight matrix VV was recognized from the outset of the method proposal since it is unlikely to know the exact value of VV in practice. In this work, we treat VV as a predefined value that can be tuned, taking into account the computational cost of estimating it across all state-action pairs and time steps. As for future work, we believe a robust and easy-to-implement estimation of weight matrix VV is necessary. Given that the variance of quantile estimation errors varies with state-action pairs and algorithm iterations, we consider two approaches for future investigation. The first approach considers a decay value of viv_{i} instead of the constant. It is worth noting that the variance of poorly estimated quantiles tends to decrease gradually as the number of training samples increases, which motivates us to decrease the value of viv_{i} as training epochs increase. The second approach involves assigning different values of viv_{i} to different state-action pairs. Ideas from the exploration field, specifically the count-based method (Ostrovski et al., 2017), can be borrowed to measure the novelty of state-action pairs. Accordingly, for familiar state-action pairs, a smaller value of viv_{i} should be assigned, while unfamiliar pairs should be assigned a larger value of viv_{i}.

Statistical variance reduction. Our variance reduction method is based on a statistical modeling perspective, and the core insight of our method is that performance might be improved through more careful use of the quantiles to construct a Q-function estimator. While alternative ensembling methods can be directly applied to DRL to reduce the uncertainty in Q-function estimator, commonly used in existing works (Osband et al., 2016; Anschel et al., 2017), it undoubtedly increases model complexity. In this work, we transform the Q value estimation into a linear regression problem, where the Q value is the coefficient of the regression model. In this way, we can leverage the weighted least squares (WLS) method to effectively capture the heteroscedasticity of quantiles and obtain a more efficient and robust Q-function estimator.

Acknowledgements

We thank anonymous reviewers for valuable and constructive feedback on an early version of this manuscript. This work is supported by National Social Science Foundation of China (Grant No.22BTJ031 ) and Postgraduate Innovation Foundation of SUFE. Dr. Fan Zhou’s work is supported by National Natural Science Foundation of China (12001356), Shanghai Sailing Program (20YF1412300), “Chenguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission, Open Research Projects of Zhejiang Lab (NO.2022RC0AB06), Shanghai Research Center for Data Science and Decision Technology, Innovative Research Team of Shanghai University of Finance and Economics.

References

  • Anschel et al. (2017) Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, pp. 176–185. PMLR, 2017.
  • Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458. PMLR, 2017.
  • Bellemare et al. (2023) Bellemare, M. G., Dabney, W., and Rowland, M. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  • Bellman (1966) Bellman, R. Dynamic programming. Science, 153(3731):34–37, 1966.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Cornish & Fisher (1938) Cornish, E. A. and Fisher, R. A. Moments and cumulants in the specification of distributions. Revue de l’Institut international de Statistique, pp. 307–320, 1938.
  • Dabney et al. (2018a) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning, pp. 1096–1105, 2018a.
  • Dabney et al. (2018b) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
  • Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018.
  • Greensmith et al. (2004) Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
  • Hsu et al. (2011) Hsu, D., Kakade, S. M., and Zhang, T. An analysis of random design linear regression. arXiv preprint arXiv:1106.2363, 2011.
  • Koenker (2005) Koenker. Quantile regression. Cambridge University Press, 2005.
  • Kuznetsov et al. (2020) Kuznetsov, A., Shvechikov, P., Grishin, A., and Vetrov, D. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pp. 5556–5566. PMLR, 2020.
  • Luo et al. (2021) Luo, Y., Liu, G., Duan, H., Schulte, O., and Poupart, P. Distributional reinforcement learning with monotonic splines. In International Conference on Learning Representations, 2021.
  • Lyle et al. (2019) Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4504–4511, 2019.
  • Ma et al. (2020) Ma, X., Xia, L., Zhou, Z., Yang, J., and Zhao, Q. Dsac: distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547, 2020.
  • Mavrin et al. (2019) Mavrin, B., Yao, H., Kong, L., Wu, K., and Yu, Y. Distributional reinforcement learning for efficient exploration. In International Conference on Machine Learning, pp. 4424–4434, 2019.
  • Nguyen-Tang et al. (2021) Nguyen-Tang, T., Gupta, S., and Venkatesh, S. Distributional reinforcement learning via moment matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9144–9152, 2021.
  • Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, 29, 2016.
  • Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Count-based exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730. PMLR, 2017.
  • Rowland et al. (2018) Rowland, M., Bellemare, M., Dabney, W., Munos, R., and Teh, Y. W. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  29–37. PMLR, 2018.
  • Rowland et al. (2019) Rowland, M., Dadashi, R., Kumar, S., Munos, R., Bellemare, M. G., and Dabney, W. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pp. 5528–5536. PMLR, 2019.
  • Rowland et al. (2023a) Rowland, M., Munos, R., Azar, M. G., Tang, Y., Ostrovski, G., Harutyunyan, A., Tuyls, K., Bellemare, M. G., and Dabney, W. An analysis of quantile temporal-difference learning. arXiv preprint arXiv:2301.04462, 2023a.
  • Rowland et al. (2023b) Rowland, M., Tang, Y., Lyle, C., Munos, R., Bellemare, M. G., and Dabney, W. The statistical benefits of quantile temporal-difference learning for value estimation. arXiv preprint arXiv:2305.18388, 2023b.
  • Sobel (1982) Sobel, M. J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
  • Villani (2009) Villani, C. Optimal transport: old and new, volume 338. Springer, 2009.
  • Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, 1989.
  • Yang et al. (2019) Yang, D., Zhao, L., Lin, Z., Qin, T., Bian, J., and Liu, T.-Y. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems, pp. 6193–6202, 2019.
  • Zhang & Zhu (2023) Zhang, N. and Zhu, K. Quantiled conditional variance, skewness, and kurtosis by cornish-fisher expansion. arXiv preprint arXiv:2302.06799, 2023.
  • Zhou et al. (2020) Zhou, F., Wang, J., and Feng, X. Non-crossing quantile regression for distributional reinforcement learning. Advances in Neural Information Processing Systems, 33:15909–15919, 2020.
  • Zhou et al. (2021) Zhou, F., Zhu, Z., Kuang, Q., and Zhang, L. Non-decreasing quantile function network with efficient exploration for distributional reinforcement learning. International Joint Conference on Artificial Intelligence, pp.  3455–3461, 2021.

Appendix A Projection Operator

A.1 Categorical projection operator

CDRL algorithm uses a categorical projection operator Π𝒞:𝒫()𝒫({z1,,zN})\Pi_{\mathcal{C}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}\left(\left\{z_{1},\ldots,z_{N}\right\}\right) to restrict approximated distributions to the parametric family of the form 𝒞:={i=1Npiδzii=1Npi=1,pi0}𝒫()\mathscr{F}_{\mathcal{C}}:=\left\{\sum_{i=1}^{N}p_{i}\delta_{z_{i}}\mid\sum_{i=1}^{N}p_{i}=1,p_{i}\geq 0\right\}\subseteq\mathscr{P}(\mathbb{R}), where z1<<zNz_{1}<\cdots<z_{N} are evenly spaced, fixed supports. The operator Π𝒞\Pi_{\mathcal{C}} is defined for a single Dirac delta as

Π𝒞(δw)={δz1wz1wzi+1zizi+1δzi+ziwzizi+1δi+1ziwzi+1δzNwzN.\Pi_{\mathcal{C}}\left(\delta_{w}\right)=\begin{cases}\delta_{z_{1}}&w\leq z_{1}\\ \frac{w-z_{i+1}}{z_{i}-z_{i+1}}\delta_{z_{i}}+\frac{z_{i}-w}{z_{i}-z_{i+1}}\delta_{i+1}&z_{i}\leq w\leq z_{i+1}\\ \delta_{z_{N}}&w\geq z_{N}.\end{cases}

A.2 Quantile projection operator

QDRL algorithm uses a quantile projection operator Π𝒲1:𝒫()𝒫()\Pi_{\mathcal{W}_{1}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\mathbb{R}) to restrict approximated distributions to the parametric family of the form 𝒲1:={1Ni=1Nδziz1:NN}𝒫()\mathscr{F}_{\mathcal{W}_{1}}:=\left\{\frac{1}{N}\sum_{i=1}^{N}\delta_{z_{i}}\mid z_{1:N}\in\mathbb{R}^{N}\right\}\subseteq\mathscr{P}(\mathbb{R}). The operator ΠW1\Pi_{W_{1}} is defined as

Π𝒲1(μ)=1Nk=1NδFμ1(τi),\Pi_{\mathcal{W}_{1}}(\mu)=\frac{1}{N}\sum_{k=1}^{N}\delta_{F_{\mu}^{-1}\left(\tau_{i}\right)},

where τi=2i12N\tau_{i}=\frac{2i-1}{2N}, and FμF_{\mu} is the CDF of μ\mu. The midpoint 2i12N\frac{2i-1}{2N} of the interval [i1N,iN][\frac{i-1}{N},\frac{i}{N}] minimizes the 1-Wasserstein distance W1(μ,ΠW1μ)W_{1}(\mu,\Pi_{W_{1}}{\mu}) between the distribution, μ\mu, and its projection ΠW1μ\Pi_{W_{1}}{\mu} (a NN-quantile distribution with evenly spaced τi\tau_{i}), as demonstrated in Lemma 2 (Dabney et al., 2018b).

Appendix B Proofs

In this section, we provide the proofs of the theorems discussed in the main manuscript.

B.1 Proof of Section 3

Proposition B.1 (Sobel, 1982; Bellemare et al., 2017).

Suppose there are value distributions ν1,ν2𝒫()\nu_{1},\nu_{2}\in\mathscr{P}(\mathbb{R}), and random variables Zik+1𝒯πνi,ZikνiZ_{i}^{k+1}\sim\mathcal{T}^{\pi}\nu_{i},Z_{i}^{k}\sim\nu_{i}. Then, we have

𝔼Z1k+1𝔼Z2k+1γ𝔼Z1k𝔼Z2k, and\displaystyle\left\|\mathbb{E}Z_{1}^{k+1}-\mathbb{E}Z_{2}^{k+1}\right\|_{\infty}\leq\gamma\left\|\mathbb{E}Z_{1}^{k}-\mathbb{E}Z_{2}^{k}\right\|_{\infty},\text{ and }
VarZ1k+1VarZ2k+1γ2VarZ1kVarZ2k.\displaystyle\left\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\|_{\infty}\leq\gamma^{2}\left\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\|_{\infty}.
Proof.

This proof follows directly from Bellemare et al. (2017). The first statement can be proved using the exchange of 𝔼𝒯π=𝒯π𝔼\mathbb{E}\mathcal{T}^{\pi}=\mathcal{T}^{\pi}\mathbb{E}. By independence of RR and PπZiP^{\pi}Z_{i}, where PπP^{\pi} is the transition operator, we have

Zik+1(x,a)\displaystyle Z_{i}^{k+1}(x,a) :=DR(x,a)+γPπZik(x,a)\displaystyle\stackrel{{\scriptstyle D}}{{:=}}R(x,a)+\gamma P^{\pi}Z_{i}^{k}(x,a)
Var(Zik+1(x,a))\displaystyle\mathrm{Var}(Z_{i}^{k+1}(x,a)) =Var(R(x,a))+γ2Var(PπZik(x,a)).\displaystyle=\mathrm{Var}(R(x,a))+\gamma^{2}\mathrm{Var}\left(P^{\pi}Z_{i}^{k}(x,a)\right).

Thus, we have

VarZ1k+1VarZ2k+1\displaystyle\left\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\|_{\infty}
=supx,a|VarZ1k+1(x,a)VarZ2k+1(x,a)|\displaystyle=\sup_{x,a}\left|\mathrm{Var}Z_{1}^{k+1}(x,a)-\mathrm{Var}Z_{2}^{k+1}(x,a)\right|
=supx,aγ2|Var(PπZ1k(x,a))Var(PπZ2k(x,a))|\displaystyle=\sup_{x,a}\gamma^{2}\left|\mathrm{Var}\left(P^{\pi}Z_{1}^{k}(x,a)\right)-\mathrm{Var}\left(P^{\pi}Z_{2}^{k}(x,a)\right)\right|
=supx,aγ2|𝔼[Var(Z1k(X,A))Var(Z2k(X,A))]|\displaystyle=\sup_{x,a}\gamma^{2}\left|\mathbb{E}\left[\mathrm{Var}\left(Z_{1}^{k}\left(X^{\prime},A^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(X^{\prime},A^{\prime}\right)\right)\right]\right|
supx,aγ2|Var(Z1k(x,a))Var(Z2k(x,a))|\displaystyle\leq\sup_{x^{\prime},a^{\prime}}\gamma^{2}\left|\mathrm{Var}\left(Z_{1}^{k}\left(x^{\prime},a^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(x^{\prime},a^{\prime}\right)\right)\right|
γ2VarZ1kVarZ2k.\displaystyle\leq\gamma^{2}\left\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\|_{\infty}.

Lemma B.2 (Lemma B.2 of Rowland et al. (2019)).

Let τk=2k12K\tau_{k}=\frac{2k-1}{2K}, for k=1,,Kk=1,\ldots,K. Consider the corresponding 1-Wasserstein projection operator ΠW1:𝒫()𝒫()\Pi_{W_{1}}:\mathscr{P}(\mathbb{R})\rightarrow\mathscr{P}(\mathbb{R}), defined by

ΠW1μi=1Kk=1KδFμi1(τk),\Pi_{W_{1}}\mu_{i}=\frac{1}{K}\sum_{k=1}^{K}\delta_{F_{\mu_{i}}^{-1}\left(\tau_{k}\right)},

for all μi𝒫()\mu_{i}\in\mathscr{P}(\mathbb{R}), where Fμi1F_{\mu_{i}}^{-1} is the inverse CDF of μi\mu_{i}. Let random variable Xμ1X\sim\mu_{1}, X2μ2X^{2}\sim\mu_{2}, and η1,η2𝒫()\eta_{1},\eta_{2}\in\mathscr{P}(\mathbb{R}). Suppose immediate reward distributions supported on [Rmax,Rmax][-R_{max},R_{max}]. Then, we have:
(i) W1(ΠW1μ1,μ1)2RmaxK(1γ)W_{1}\left(\Pi_{W_{1}}\mu_{1},\mu_{1}\right)\leq\frac{2R_{max}}{K(1-\gamma)};
(ii) W1(ΠW1η1,ΠW1η2)W1(η1,η2)+4RmaxK(1γ)W_{1}\left(\Pi_{W_{1}}\eta_{1},\Pi_{W_{1}}\eta_{2}\right)\leq W_{1}\left(\eta_{1},\eta_{2}\right)+\frac{4R_{max}}{K(1-\gamma)};
(iii) W1(ΠW1μ2,μ2)Rmax2K(1γ)W_{1}\left(\Pi_{W_{1}}\mu_{2},\mu_{2}\right)\leq\frac{R_{max}^{2}}{K(1-\gamma)}.

Proof.

This proof follows directly from Lemma B.2 of Rowland et al. (2019). For proving (i), let Fμ11F_{\mu_{1}}^{-1} be the inverse CDF of μ1\mu_{1}. We have

W1(ΠW1μ1,μ1)\displaystyle W_{1}\left(\Pi_{W_{1}}\mu_{1},\mu_{1}\right) =i=0K11KFμ11(iK)Fμ11(i+1K)|xFμ11(2i+12K)|μ1(dx)\displaystyle=\sum_{i=0}^{K-1}\frac{1}{K}\int_{F_{\mu_{1}}^{-1}(\frac{i}{K})}^{F_{\mu_{1}}^{-1}(\frac{i+1}{K})}|x-F_{\mu_{1}}^{-1}(\frac{2i+1}{2K})|\quad\mu_{1}(dx)
1K(Fμ11(1)Fμ11(0))(return distribution μ1 is bounded on[Rmax1γ,Rmax1γ])\displaystyle\leq\frac{1}{K}\left(F_{\mu_{1}}^{-1}(1)-F_{\mu_{1}}^{-1}(0)\right)\quad(\text{return distribution $\mu_{1}$ is bounded on}~{}[-\frac{R_{max}}{1-\gamma},\frac{R_{max}}{1-\gamma}])
=2RmaxK(1γ).\displaystyle=\frac{2R_{max}}{K(1-\gamma)}.

For proving (ii), using the triangle inequality and statement (i):

W1(ΠW1η1,ΠW1η2)\displaystyle W_{1}\left(\Pi_{W_{1}}\eta_{1},\Pi_{W_{1}}\eta_{2}\right) W1(ΠW1η1,η1)+W1(η1,η2)+W1(η2,ΠW1η2)\displaystyle\leq W_{1}\left(\Pi_{W_{1}}\eta_{1},\eta_{1}\right)+W_{1}\left(\eta_{1},\eta_{2}\right)+W_{1}\left(\eta_{2},\Pi_{W_{1}}\eta_{2}\right)
W1(η1,η2)+4RmaxK(1γ).\displaystyle\leq W_{1}\left(\eta_{1},\eta_{2}\right)+\frac{4R_{max}}{K(1-\gamma)}.

(ii) implies the fact that the quantile projection operator ΠW1\Pi_{W_{1}} is not a non-expansion under 1-Wasserstein distance, which is important for the uniqueness of the fixed point and the convergence of the algorithm.

The proof of (iii) is similar to (i), using the fact that the return distribution μ2\mu_{2} is bounded on [0,Rmax21γ][0,\frac{R_{max}^{2}}{1-\gamma}] to obtain the following inequality:

W1(ΠW1μ2,μ2)Rmax2K(1γ).\displaystyle W_{1}\left(\Pi_{W_{1}}\mu_{2},\mu_{2}\right)\leq\frac{R_{max}^{2}}{K(1-\gamma)}.

Theorem B.3 (Parameterization induced error bound).

Let Π𝒲1\Pi_{\mathcal{W}_{1}} be a projection operator onto evenly spaced quantiles τi\tau_{i}’s where each τi=2i12N\tau_{i}=\frac{2i-1}{2N} for i=1,,Ni=1,\ldots,N, and ηk𝒫()\eta_{k}\in\mathscr{P}(\mathbb{R}) be the return distribution of kk-th iteration. Let random variables ZθkΠ𝒲1𝒯πηkZ_{\theta}^{k}\sim\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}\eta_{k} and Zk𝒯πηkZ^{k}\sim\mathcal{T}^{\pi}\eta_{k}. Assume that the distribution of the immediate reward is supported on [Rmax,Rmax][-R_{max},R_{max}], then we have

limk3k=limk𝔼Zθk𝔼Zk2RmaxN(1γ),\displaystyle\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)},

where 3k\mathcal{E}^{k}_{3} is parametrization induced error at kk-th iteration.

Proof.

Using the dual representation of the Wasserstein distance (Villani, 2009) and Lemma B.2, (x,a)\forall(x,a), we have

|𝔼Zθk(x,a)𝔼Zk(x,a)|\displaystyle\left|\mathbb{E}Z_{\theta}^{k}(x,a)-\mathbb{E}Z^{k}(x,a)\right| W1(ΠW1𝒯πηk(x,a),𝒯πηk(x,a))\displaystyle\leq W_{1}\left(\Pi_{W_{1}}\mathcal{T}^{\pi}\eta_{k}(x,a),\mathcal{T}^{\pi}\eta_{k}(x,a)\right)
2RmaxN(1γ).\displaystyle\leq\frac{2R_{max}}{N(1-\gamma)}.

By taking the limitation over (x,a)(x,a) and iteration kk on the left-hand side, we obtain

limk3k=limk𝔼Zθk𝔼Zk2RmaxN(1γ).\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)}.

In a similar way, the second-order moment can be bounded by,

limk𝔼[Zθk]2𝔼[Zk]2Rmax2N(1γ).\lim_{k\rightarrow\infty}\left\|\mathbb{E}[Z_{\theta}^{k}]^{2}-\mathbb{E}[Z^{k}]^{2}\right\|_{\infty}\leq\frac{R_{max}^{2}}{N(1-\gamma)}.

It suggests that higher-order moments are not preserved after quantile representation is applied. ∎

B.2 Proof of Section 4

Lemma B.4 (expectation by quantiles).

. Let ZνZ\sim\nu be a random variable with CDF FνF_{\nu} and quantile function Fν1F_{\nu}^{-1}. Then,

𝔼[Z]=01Fν1(τ)𝑑τ.\mathbb{E}[Z]=\int_{0}^{1}F_{\nu}^{-1}(\tau)d\tau.
Proof.

As any CDF is non-decreasing and right continuous, we have for all (τ,z)(0,1)×(\tau,z)\in(0,1)\times\mathbb{R} :

Fν1(τ)zτFν(z).F_{\nu}^{-1}(\tau)\leq z\Longleftrightarrow\tau\leq F_{\nu}(z).

Then, denoting UU by a uniformly distributed random variable over [0,1][0,1],

(Fν1(U)z)=(UFν(z))=Fν(z),\mathbb{P}(F_{\nu}^{-1}(U)\leq z)=\mathbb{P}(U\leq F_{\nu}(z))=F_{\nu}(z),

which shows that the random variable Fν1(U)F_{\nu}^{-1}(U) has the same distribution as ZZ. Hence,

𝔼[Z]=𝔼[Fν1(U)]=01Fν1(τ)𝑑τ\mathbb{E}[Z]=\mathbb{E}\left[F_{\nu}^{-1}(U)\right]=\int_{0}^{1}F_{\nu}^{-1}(\tau)d\tau

Lemma B.5.

Consider the linear regression model 𝐐^=𝐗2𝐌2+\boldsymbol{\hat{Q}}=\mathbf{X}_{2}\boldsymbol{M}_{2}+\mathcal{E}, \mathcal{E} is distributed on 𝒩(𝟎,σ2V)\mathcal{N}(\mathbf{0},\sigma^{2}V), where V=diag(v1,v2,,vN),vi1,i=1,,NV=diag(v_{1},v_{2},\cdots,v_{N}),v_{i}\geq 1,i=1,\cdots,N, and we set noise variance σ2=1\sigma^{2}=1 without loss of generality. The WLS estimator is

𝑴^2=(𝐗2V1𝐗2)1𝐗2V1𝑸^,\displaystyle\widehat{\boldsymbol{M}}_{2}=(\mathbf{X}^{\top}_{2}V^{-1}\mathbf{X}_{2})^{-1}\mathbf{X}^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}}, (25)

and the distribution of mean estimator takes the form,

m^1𝒩(m1,1ivi+(ivizτiivi)2ivizτi2(ivizτi)2ivi).\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{\sum_{i}v_{i}}+\frac{(\frac{\sum_{i}v_{i}z_{\tau_{i}}}{\sum_{i}v_{i}})^{2}}{\sum_{i}v_{i}z_{\tau_{i}}^{2}-\frac{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}{\sum_{i}v_{i}}}\right).

When VV equals identity matrix II,

m^1𝒩(m1,1N+z¯2i(zτiz¯)2).\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{N}+\frac{\bar{z}^{2}}{\sum_{i}(z_{\tau_{i}}-\bar{z})^{2}}\right).
Proof.

Premultiplying by V1/2V^{-1/2}, we get the transformed model

V1/2𝑸^=V1/2𝐗2𝑴2+V1/2.V^{-1/2}\boldsymbol{\hat{Q}}=V^{-1/2}\mathbf{X}_{2}\boldsymbol{M}_{2}+V^{-1/2}\mathcal{E}.

Now, set 𝑸^=V1/2𝑸,X2=V1/2X2\boldsymbol{\hat{Q}}^{*}=V^{-1/2}\boldsymbol{Q},X^{*}_{2}=V^{-1/2}X_{2}, and =V1/2\mathcal{E}^{*}=V^{-1/2}\mathcal{E}, so that the transformed model can be written as 𝑸^=𝐗2𝑴2+\boldsymbol{\hat{Q}}^{*}=\mathbf{X}^{*}_{2}\boldsymbol{M}_{2}+\mathcal{E}^{*}. The transformed model is a Gaussian-Markov model, satisfying OLS assumptions. Thus, the unique OLS solution is 𝑴^2=(X2V1X2)1X2V1𝑸^,\widehat{\boldsymbol{M}}_{2}=\left(X^{\top}_{2}V^{-1}X_{2}\right)^{-1}X^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}}, and 𝑴^2𝒩(𝑴2,σ2(X2V1X2)1).\widehat{\boldsymbol{M}}_{2}\sim\mathcal{N}\left(\boldsymbol{M}_{2},\sigma^{2}(X_{2}^{\top}V^{-1}X_{2})^{-1}\right). By computing (X2V1X2)1,(X^{\top}_{2}V^{-1}X_{2})^{-1}, we derive m^1𝒩(m1,1ivi+(ivizτiivi)2ivizτi2(ivizτi)2ivi).\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{\sum_{i}v_{i}}+\frac{(\frac{\sum_{i}v_{i}z_{\tau_{i}}}{\sum_{i}v_{i}})^{2}}{\sum_{i}v_{i}z_{\tau_{i}}^{2}-\frac{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}{\sum_{i}v_{i}}}\right).

Proposition B.6.

Suppose the noise εi\varepsilon_{i} independently follows 𝒩(0,vi)\mathcal{N}(0,v_{i}) where vi1v_{i}\geq 1 for i=1,,Ni=1,\cdots,N, then,

(i) In the homoskedastic case where vi=1v_{i}=1 for i=1,Ni=1,\dots N, the empirical mean estimator m^1\hat{m}_{1}^{*} has a lower variance, Var(m^1)<Var(m^1)\mathrm{Var}(\hat{m}_{1}^{*})<\mathrm{Var}(\hat{m}_{1}) ;

(ii) In the heteroskedastic case where viv_{i}’s are not equal, the QEM estimator m^1\hat{m}_{1} achieves a lower variance, i.e. Var(m^1)<Var(m^1)\mathrm{Var}(\hat{m}_{1})<\mathrm{Var}(\hat{m}_{1}^{*}), if and only if v¯211/((iviivizτi2)(ivizτi)21)>0\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0, where v¯=1Nivi\bar{v}=\frac{1}{N}\sum_{i}v_{i}. This inequality holds when zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}}, which can be guaranteed in QDRL.

Proof.

The proof of (i) comes directly from the comparison of variances, i.e. Var(m^1)=1N<1N+z¯2i(zτiz¯)2=Var(m^1)\mathrm{Var}(\hat{m}_{1})=\frac{1}{N}<\frac{1}{N}+\frac{\bar{z}^{2}}{\sum_{i}(z_{\tau_{i}}-\bar{z})^{2}}=\mathrm{Var}(\hat{m}_{1}^{*}). Next, we prove that (ii) holds under a sufficient condition zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}}. In QDRL, the quantile levels τi=2i12N\tau_{i}=\frac{2i-1}{2N} are equally spaced around 0.5. Under this setup, the condition zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}} indeed holds, where zτiz_{\tau_{i}} is the τi\tau_{i}-th quantile of standard normal distribution. For N=2N=2, we need to validate the inequality v¯211/((iviivizτi2)(ivizτi)21)>0\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0. This can be transformed into a multivariate extreme value problem. By analyzing the function f(v1,v2)=(v1+v2)2411(v1+v2)2(v1v2)21f(v_{1},v_{2})=\frac{(v_{1}+v_{2})^{2}}{4}-1-\frac{1}{\frac{(v_{1}+v_{2})^{2}}{(v_{1}-v_{2})^{2}}-1}, the infimum of f(v1,v2)f(v_{1},v_{2}) is 0 when v1,v2>1v_{1},v_{2}>1, and f(v1,v2)f(v_{1},v_{2}) reaches 0 at the limit lim(v1,v2)(1,1)f(v1,v2)=0\lim_{(v_{1},v_{2})\to(1,1)}f(v_{1},v_{2})=0. For N=3N=3, this case is identical to N=2N=2 since z0.5=0z_{0.5}=0. For N=4N=4, f(v1,v2,v3,v4)=(v1+v2+v3+v4)2N211(v1+v2+v3+v4)(k2v1+v2+v3+k2v4)(kv1+v2v3kv4)21f(v_{1},v_{2},v_{3},v_{4})=\frac{(v_{1}+v_{2}+v_{3}+v_{4})^{2}}{N^{2}}-1-\frac{1}{\frac{(v_{1}+v_{2}+v_{3}+v_{4})(k^{2}v_{1}+v_{2}+v_{3}+k^{2}v_{4})}{(kv_{1}+v_{2}-v_{3}-kv_{4})^{2}}-1}, and this expression can be factored as, f(v1,v2,v3,v4)=v1+v2+v3+v4N2C((v1+v2+v3+v4)CN2(k2v1+v2+v3+k2v4))f(v_{1},v_{2},v_{3},v_{4})=\frac{v_{1}+v_{2}+v_{3}+v_{4}}{N^{2}C}\left((v_{1}+v_{2}+v_{3}+v_{4})C-N^{2}(k^{2}v_{1}+v_{2}+v_{3}+k^{2}v_{4})\right), where C=(k1)2v1v2+(k+1)2v1v3+4k2v1v4+4v2v3+(k+1)2v2v4+(k+1)2v3v4C=(k-1)^{2}v_{1}v_{2}+(k+1)^{2}v_{1}v_{3}+4k^{2}v_{1}v_{4}+4v_{2}v_{3}+(k+1)^{2}v_{2}v_{4}+(k+1)^{2}v_{3}v_{4}, and k=Φ1(7/8)Φ1(5/8)>3k=\frac{\Phi^{-1}(7/8)}{\Phi^{-1}(5/8)}>3. By comparing the coefficient corresponding to the same terms, we can verify that f(v1,v2,v3,v4)>0f(v_{1},v_{2},v_{3},v_{4})>0 when vi>1v_{i}>1. Finally, the remaining cases can be proven in the same manner.

Theorem B.7.

Consider the policy π^\hat{\pi} that is learned policy, and denote the optimal policy to be πopt\pi_{opt}, α=maxxDTV(π^(x)πopt(x))\alpha=\max_{x^{\prime}}D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})\|\pi_{opt}(\cdot\mid x^{\prime})\right), and n(x,a)=|𝒟|n(x,a)=|\mathcal{D}|. For all δ\delta\in\mathbb{R}, with probability at least 1δ1-\delta, for any η(x,a)𝒫()\eta(x,a)\in\mathscr{P}(\mathbb{R}), and all (x,a)𝒟(x,a)\in\mathcal{D},

F𝒯^π^η(x,a)F𝒯πoptη(x,a)2α+1+4|𝒳|n(x,a)log4|𝒳||𝒜|δ.\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.
Proof.

We give this proof in a tabular MDP. Directly following from the definition of the distributional Bellman operator applied to the CDF, we have that

F𝒯^π^η(x,a)(u)F𝒯πoptη(x,a)(u)\displaystyle F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}(u)-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}(u)
=x,aP^(xx,a)π^(ax)FγZ(x,a)+R^(x,a)(u)x,aP(xx,a)πopt(ax)FγZ(x,a)+R(x,a)(u).\displaystyle=\sum_{x^{\prime},a^{\prime}}\hat{P}(x^{\prime}\mid x,a)\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z\left(x^{\prime},a^{\prime}\right)+\hat{R}(x,a)}(u)-\sum_{x^{\prime},a^{\prime}}P(x^{\prime}\mid x,a)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u).

For notation convenience, we use random variables instead of measures. P^\hat{P} and R^\hat{R} are the maximum likelihood estimates of the transition and the reward functions, respectively. Adding and subtracting x,aP^(xx,a)πopt(ax)FγZ(x,a)+R(x,a)(u)\sum_{x^{\prime},a^{\prime}}\hat{P}(x^{\prime}\mid x,a)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u), then we have

xP^(xx,a)a(π^(ax)FγZ(x,a)+R^(x,a)(u)πopt(ax)FγZ(x,a)+R(x,a)(u))\displaystyle\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\left(\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right)
+x,a(P^(xx,a)P(xx,a))πopt(ax)FγZ(x,a)+R(x,a)(u).\displaystyle+\sum_{x^{\prime},a^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u).

For the first term, note that

xP^(xx,a)a(π^(ax)FγZ(x,a)+R^(x,a)(u)πopt(ax)FγZ(x,a)+R(x,a)(u))\displaystyle\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\left(\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right)
=xP^(xx,a)a(π^(ax)FγZ(x,a)+R^(x,a)(u)π^(ax)FγZ(x,a)+R(x,a)(u)+\displaystyle=\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)+
π^(ax)FγZ(x,a)+R(x,a)(u)πopt(ax)FγZ(x,a)+R(x,a)(u))\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\Big{)}
xP^(xx,a)a(|π^(ax)||FγZ(x,a)+R^(x,a)(u)FγZ(x,a)+R(x,a)(u)|+\displaystyle\leq\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})\right|\cdot\left|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right|+
|FγZ(x,a)+R(x,a)(u)||π^(ax)πopt(ax)|)\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\left|F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right|\cdot\left|\hat{\pi}(a^{\prime}\mid x^{\prime})-\pi_{opt}(a^{\prime}\mid x^{\prime})\right|\Big{)}
(a)xP^(xx,a)(FR^(x,a)()FR(x,a)()+2DTV(π^(x)||πopt(x)))\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\left(\left\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\|_{\infty}+2D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})||\pi_{opt}(\cdot\mid x^{\prime})\right)\right)
=2α+FR^(x,a)()FR(x,a)().\displaystyle=2\alpha+\left\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\|_{\infty}.

(a) follows from the fact that a|π^(ax)πopt(x)|=2DTV(π^(x)||πopt(x))2α\sum_{a^{\prime}}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})-\pi_{opt}(\cdot\mid x^{\prime})\right|=2D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})||\pi_{opt}(\cdot\mid x^{\prime})\right)\leq 2\alpha, and

a|π^(ax)||FγZ(x,a)+R^(x,a)(u)FγZ(x,a)+R(x,a)(u)|\displaystyle\sum_{a^{\prime}}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})\right|\cdot\left|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right|
=a|π^(ax)||FR^(x,a)(r)FR(x,a)(r)|dFγZ(x,a)(ur)\displaystyle=\sum_{a^{\prime}}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})\right|\cdot\int\left|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right|dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)
a|π^(ax)|supr|FR^(x,a)(r)FR(x,a)(r)|dFγZ(x,a)(ur)\displaystyle\leq\sum_{a^{\prime}}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})\right|\cdot\sup_{r}\left|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right|\int dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)
=FR^(x,a)()FR(x,a)().\displaystyle=\left\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\|_{\infty}.

The second term can be bounded as follows:

x,a(P^(xx,a)P(xx,a))πopt(ax)FγZ(x,a)+R(x,a)(u)\displaystyle\sum_{x^{\prime},a^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)
x(P^(xx,a)P(xx,a))aπopt(ax)\displaystyle\leq\sum_{x^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\sum_{a^{\prime}}\pi_{opt}(a^{\prime}\mid x^{\prime})
P^(x,a)P(x,a)1aπopt(a)\displaystyle\leq\left\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\|_{1}\cdot\left\|\sum_{a^{\prime}}\pi_{opt}(a^{\prime}\mid\cdot)\right\|_{\infty}
=P^(x,a)P(x,a)1.\displaystyle=\left\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\|_{1}.

Next, we show the two norms can be bounded. By the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, the following inequality holds with probability at least 1δ/21-\delta/2, for all (x,a)𝒟(x,a)\in\mathcal{D},

FR^(x,a)()FR(x,a)()12n(x,a)log4|𝒳𝒜|δ.\displaystyle\left\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\|_{\infty}\leq\sqrt{\frac{1}{2n(x,a)}\log\frac{4|\mathcal{X}\|\mathcal{A}|}{\delta}}.

By Hoeffding’s inequality and an l1l_{1} concentration bound for multinomial distribution444 see https://nanjiang.cs.illinois.edu/files/cs598/note3.pdf., the following inequality holds with probability at least 1δ/21-\delta/2,

maxx,aP^(x,a)P(x,a)12|𝒳|n(x,a)log4|𝒳𝒜|δ.\displaystyle\max_{x,a}\left\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\|_{1}\leq\sqrt{\frac{2|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}\|\mathcal{A}|}{\delta}}.

Consequently, the claim follows from combining the two inequalities,

F𝒯^π^η(x,a)F𝒯πoptη(x,a)2α+1+4|𝒳|n(x,a)log4|𝒳||𝒜|δ.\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.

Appendix C Cornish-Fisher Expansion

The Cornish-Fisher Expansion (Cornish & Fisher, 1938) is an asymptotic expansion used to approximate the quantiles of a probability distribution based on its cumulants. To be more explicit, let XX^{*} be a non-gaussian variable with mean 0 and variance 1. Then, the Cornish-Fisher Expansion can be represented as a polynomial expansion:

FX1(τ)=i=0ai(Φ1(τ))i,\displaystyle F^{-1}_{X^{*}}(\tau)=\sum_{i=0}^{\infty}a_{i}(\Phi^{-1}(\tau))^{i},

where the parameters aia_{i} depend on the cumulants of the XX^{*} and Φ\Phi is the standard normal distribution function. To use this expansion in practice, we need to truncate the series. According to Cornish & Fisher (1938), the highest power of ii must be odd, and the fourth order (i=3i=3) approximation is commonly used in practice. The parameters for the fourth order expansion are a2=a0=κ36a_{2}=a_{0}=\frac{\kappa_{3}}{6}, a1=1+5(κ36)23κ424a_{1}=1+5(\frac{\kappa_{3}}{6})^{2}-3\frac{\kappa_{4}}{24} and a3=κ4242(κ36)2a_{3}=\frac{\kappa_{4}}{24}-2(\frac{\kappa_{3}}{6})^{2}, where κi\kappa_{i} denotes ii-th cumulant. Therefore, the fourth order expansion is

FX1(τ)=κ36+(1+5(κ36)23κ424)Φ1(τ)+κ36(Φ1(τ))2+(κ4242(κ36)2)(Φ1(τ))3+.\displaystyle F^{-1}_{X^{*}}(\tau)=-\frac{\kappa_{3}}{6}+(1+5(\frac{\kappa_{3}}{6})^{2}-3\frac{\kappa_{4}}{24})\Phi^{-1}(\tau)+\frac{\kappa_{3}}{6}(\Phi^{-1}(\tau))^{2}+(\frac{\kappa_{4}}{24}-2(\frac{\kappa_{3}}{6})^{2})(\Phi^{-1}(\tau))^{3}+\cdots.

Now, simply define the XX^{*} as the normalization of XX, X=μ+σXX=\mu+\sigma X^{*}, with mean μ\mu and variance σ2\sigma^{2}. FX1(τ)F^{-1}_{X}(\tau) can be approximated by

FX1(τ)=μ+σ(κ36σ3+(1+5(κ36σ3)23κ424σ4)Φ1(τ)+κ36σ3(Φ1(τ))2+(κ424σ42(κ36σ3)2)(Φ1(τ))3+).\displaystyle F^{-1}_{X}(\tau)=\mu+\sigma\left(-\frac{\kappa_{3}}{6\sigma^{3}}+(1+5(\frac{\kappa_{3}}{6\sigma^{3}})^{2}-3\frac{\kappa_{4}}{24\sigma^{4}})\Phi^{-1}(\tau)+\frac{\kappa_{3}}{6\sigma^{3}}(\Phi^{-1}(\tau))^{2}+(\frac{\kappa_{4}}{24\sigma^{4}}-2(\frac{\kappa_{3}}{6\sigma^{3}})^{2})(\Phi^{-1}(\tau))^{3}+\cdots\right).

Denote skewness s=κ3σ3s=\frac{\kappa_{3}}{\sigma^{3}}, kurtosis k=κ4σ4k=\frac{\kappa_{4}}{\sigma^{4}} and normal distribution quantile zτ=Φ1(τ)z_{\tau}=\Phi^{-1}(\tau). Then, we can rewrite the above equation

FX1(τ)=μ+σ(zτ+(zτ21)s6+(zτ32zτ)k24+(2zτ3+5zτ)(s6)2+).\displaystyle F^{-1}_{X}(\tau)=\mu+\sigma\left(z_{\tau}+(z_{\tau}^{2}-1)\frac{s}{6}+(z_{\tau}^{3}-2z_{\tau})\frac{k}{24}+(-2z_{\tau}^{3}+5z_{\tau})(\frac{s}{6})^{2}+\cdots\right). (26)

C.1 Regression model selection

We use the R-Squared (R2R^{2}) statistic to determine the number of terms in Equation 26 that should be included in the regression model. R2R^{2}, also known as the coefficient of determination, is a statistical measure that shows how well the independent variables explain the variance in the dependent variable. In other words, it is a measure of how well the data fit the regression model.

Refer to caption
Figure 9: Fitted quantile plot. (a) Normal, 𝒩(0,1)\mathcal{N}(0,1). (b) Mixture Gaussian, 0.7𝒩(2,1)+0.3𝒩(3,1)0.7\mathcal{N}(-2,1)+0.3\mathcal{N}(3,1). (c) Exponential, Exp(1)=exExp(1)=e^{-x}. (d) Gumbel, G(0,1)=e(x+ex)G(0,1)=e^{-(x+e^{-x})}.

Consider the linear regression model,

𝒀^=𝐗i𝜷i+.\displaystyle\boldsymbol{\hat{Y}}=\mathbf{X}_{i}\boldsymbol{\beta}_{i}+\mathcal{E}.

The dependent variable 𝒀=(FX1(τ1),,FX1(τN))T\boldsymbol{Y}=(F^{-1}_{X}(\tau_{1}),\dots,F^{-1}_{X}(\tau_{N}))^{T} is composed of the quantiles from distribution of XX, and \mathcal{E} is the noise vector sampled from 𝒩(0,0.25)\mathcal{N}(0,0.25). When the design matrix 𝐗1=(1,,1)\mathbf{X}_{1}=(1,\cdots,1)^{\prime}, this regression model reduces to a one-sample problem, and 𝜷1\boldsymbol{\beta}_{1} can be directly estimated by 1Nn=1NFX1(τn)\frac{1}{N}\sum_{n=1}^{N}F^{-1}_{X}(\tau_{n}). We then investigate the following four types of regression models,

Model 1:

𝐗2=(1,,1zτ1,,zτN)T,𝜷2=(μ,σ)T,\mathbf{X}_{2}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{2}=\left(\mu,\sigma\right)^{T},

Model 2:

𝐗3=(1,,1zτ1,,zτNzτ121,,zτN21)T,𝜷3=(μ,σ,σs6)T,\mathbf{X}_{3}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ \end{array}\right)^{T},\boldsymbol{\beta}_{3}=\left(\mu,\sigma,\sigma\frac{s}{6}\right)^{T},

Model 3:

𝐗4=(1,,1zτ1,,zτNzτ121,,zτN21zτ133zτ1,,zτN33zτN)T,𝜷4=(μ,σ,σs6,σk24)T,\mathbf{X}_{4}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ z_{\tau_{1}}^{3}-3z_{\tau_{1}},&\cdots&,z_{\tau_{N}}^{3}-3z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{4}=\left(\mu,\sigma,\sigma\frac{s}{6},\sigma\frac{k}{24}\right)^{T},

Model 4:

𝐗5=(1,,1zτ1,,zτNzτ121,,zτN21zτ133zτ1,,zτN33zτN2zτ13+5zτ1,,2zτN3+5zτN)T,𝜷5=(μ,σ,σs6,σk24,σ(s6)2)T.\mathbf{X}_{5}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ z_{\tau_{1}}^{3}-3z_{\tau_{1}},&\cdots&,z_{\tau_{N}}^{3}-3z_{\tau_{N}}\\ -2z_{\tau_{1}}^{3}+5z_{\tau_{1}},&\cdots&,-2z_{\tau_{N}}^{3}+5z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{5}=\left(\mu,\sigma,\sigma\frac{s}{6},\sigma\frac{k}{24},\sigma(\frac{s}{6})^{2}\right)^{T}.

Figure 9 shows that the regression fitted values and corresponding R2R^{2} across several distributions of XX. As the number of independent variables increases, more variance in the error can be explained. However, having too many independent variables increases the risk of multicollinearity and overfitting. Based on practical considerations, we choose Model 3 as our regression model due to its satisfactory level of explainability. In the subsequent section, we will give a more in-depth interpretation of this regression model.

C.2 Interpretation of the remaining term ω(τ)\omega(\tau)

In this section, we explore the role of the remaining term ω(τ)\omega(\tau) in the context of random design regression. As discussed in Section 4, we present a decomposition of the estimate q^(τ)\hat{q}(\tau) of the τ\tau-th quantile, which includes contributions from the mean, noise error, and misspecified error. Specifically, we expressed the estimate as follows:

q^(τ)=\displaystyle\hat{q}(\tau)= μ+ω1(τ)+ε(τ).\displaystyle\mu+\omega_{1}(\tau)+\varepsilon(\tau).

where μ\mu can be estimated using the mean estimator 1Nq(τi)\frac{1}{N}\sum q(\tau_{i}), which is commonly used in QDRL and IQN settings. However, this simple model fails to capture important information in the ω1(τ)\omega_{1}(\tau). To address this limitation, we employ the Cornish-Fisher Expansion to expand the equation, resulting in the following expression:

q^(τ)=\displaystyle\hat{q}(\tau)= μ+zτσ+σω2(τ)+ε(τ),\displaystyle\mu+z_{\tau}\sigma+\sigma\omega_{2}(\tau)+\varepsilon(\tau),
q^(τ)=\displaystyle\hat{q}(\tau)= μ+zτσ+(zτ21)σs6+σω3(τ)+ε(τ),\displaystyle\mu+z_{\tau}\sigma+(z_{\tau}^{2}-1)\sigma\frac{s}{6}+\sigma\omega_{3}(\tau)+\varepsilon(\tau),
\displaystyle\cdots

where μ\mu can be estimated by linear regression estimator given multiple quantile levels {τi}\{\tau_{i}\}, which can be sampled from a uniform distribution or predefined to be evenly spaced in (0,1)(0,1). In theory, higher-order expansions can capture more misspecified information in ω(τ)\omega(\tau), leading to a more accurate representation of the quantile. However, as discussed before, expansions are typically limited to the fourth order in practice to balance the trade-off between model complexity and estimation accuracy.

To gain a better understanding of the remaining term ω(τ)\omega(\tau) and its impact on the regression estimator, consider the linear model,

q^(τ)=𝐱τβ+ωτMisspecified error+εNoise error,\displaystyle\hat{q}(\tau)=\mathbf{x_{\tau}^{\prime}}\beta+\underbrace{\omega_{\tau}}_{\text{Misspecified error}}+\underbrace{\varepsilon}_{\text{Noise error}},

where τ\tau can be generally considered a uniform, 𝐱τ=(1,zτ,zτ21,)d\mathbf{x_{\tau}}=(1,z_{\tau},z_{\tau}^{2}-1,...)^{\prime}\in\mathbb{R}^{d}, and β=(μ,σ,σs6,)d\beta=(\mu,\sigma,\sigma\frac{s}{6},...)^{\prime}\in\mathbb{R}^{d}. In particular, define the random variables,

ε:=q^(τ)𝔼[q^(τ)𝐱τ] and ωτ:=𝔼[q^(τ)𝐱τ]𝐱τβ,\varepsilon:=\hat{q}(\tau)-\mathbb{E}[\hat{q}(\tau)\mid\mathbf{x_{\tau}}]\quad\text{ and }\quad\omega_{\tau}:=\mathbb{E}[\hat{q}(\tau)\mid\mathbf{x_{\tau}}]-\mathbf{x_{\tau}}^{\prime}\beta,

where ε\varepsilon corresponds to the noise with zero mean, σnoise 2\sigma^{2}_{\text{noise }} variance and independent across different level of τ\tau, and ωτ\omega_{\tau} corresponds to the misspecified error of β\beta. Under the following conditions, we can derive a bound for the regression estimator in the misspecified model.

Condition 1 (Subgaussian noise). There exist a finite constant σnoise 0\sigma_{\text{noise }}\geq 0 such that for all λ\lambda\in\mathbb{R}, almost surely:

𝔼[exp(λε)𝐱τ]exp(λ2σnoise 2/2).\mathbb{E}[\exp(\lambda\varepsilon)\mid\mathbf{x_{\tau}}]\leq\exp\left(\lambda^{2}\sigma_{\text{noise }}^{2}/2\right).

Condition 2 (Bounded approximation error). There exist a finite constant Cbias 0C_{\text{bias }}\geq 0, almost surely:

Σ1/2𝐱τωτ2Cbias d,\left\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\|_{2}\leq C_{\text{bias }}\sqrt{d},

where Σ=𝔼[𝐱τ𝐱τ]\Sigma=\mathbb{E}[\mathbf{x_{\tau}}\mathbf{x_{\tau}}^{\prime}].

Condition 3 (Subgaussian projections). There exists a finite constant ρ1\rho\geq 1 such that:

𝔼[exp(αΣ1/2𝐱τ)]exp(ρα22/2),αd.\mathbb{E}\left[\exp(\alpha^{\top}\Sigma^{-1/2}\mathbf{x_{\tau}})\right]\leq\exp\left(\rho\cdot\|\alpha\|^{2}_{2}/2\right),\quad\forall\alpha\in\mathbb{R}^{d}.
Theorem C.1.

Suppose that Conditions 1, 2, and 3 hold. Then for any δ(0,1)\delta\in(0,1) and with probability at least 13δ1-3\delta, the following holds:

β^olsβΣ2\displaystyle\left\|\hat{\beta}_{\mathrm{ols}}-\beta\right\|_{\Sigma}^{2} Kρ,δ,N2(4𝔼Σ1/2𝐱τωτ22(1+8log(1/δ))N+3Cbias 2dlog2(1/δ)N2)Misspecified error contribution\displaystyle\leq\underbrace{K_{\rho,\delta,N}^{2}\left(\frac{4\mathbb{E}\left\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\|^{2}_{2}(1+8\log(1/\delta))}{N}+\frac{3C_{\text{bias }}^{2}d\log^{2}(1/\delta)}{N^{2}}\right)}_{\text{Misspecified error contribution}}
+Kρ,δ,Nσnoise 2(d+2dlog(1/δ)+2log(1/δ))NNoise error contribution,\displaystyle+\underbrace{K_{\rho,\delta,N}\cdot\frac{\sigma_{\text{noise }}^{2}\cdot(d+2\sqrt{d\log(1/\delta)}+2\log(1/\delta))}{N}}_{\text{Noise error contribution}},

where Kρ,δ,NK_{\rho,\delta,N} is a constant depending on ρ\rho, δ\delta and NN.

Proof.

The proof of the above theorem can be easily adapted from Theorem 2 in Hsu et al. (2011). ∎

The first term on the right-hand side represents the error due to model misspecification, which occurs when the true model differs from the assumed model. Intuitively, incorporating more relevant information in ω(τ)\omega(\tau) into explanation variables could decrease the quantity of 𝔼Σ1/2𝐱τωτ22\mathbb{E}\left\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\|^{2}_{2} and CbiasC_{bias}. Therefore, the accuracy of the estimator may be potentially improved by reducing the magnitude of the misspecified error. The second term represents the noise error contribution, which is inevitable and can only be controlled by increasing the sample size NN.

Appendix D Experimental Details

D.1 Tabular experiment

The parameter settings used for tabular control are presented in LABEL:table1. In the QEMRL case, the weight matrix VV is set as shown in the table based on domain knowledge indicating that the distribution has low probability support around its median. The greedy parameter decreases exponentially every 100 steps, and the learning rate decrease in segments every 50K steps.

Table 1: The (hyper-)parameters of QEMRL and QDRL used in the tabular control experiment.
Hyperparameter Value
Learning rate schedule {0.05,0.025,0.0125}
Discount factor 0.999
Quantile initialization Unif(0.5,0.5)\mathrm{Unif}(-0.5,0.5)
Number of quantiles 128
Number of training steps 150K
ϵ\epsilon-greedy schedule 0.9t/1000.9^{\lfloor t/100\rfloor}
Number of MC rollouts 10000
Weight matrix VV (QEMRL only) diag{1,1,,1.5,,1.5τ[0.45,0.55],,1,1}diag\{1,1,\cdots,\underbrace{1.5,\cdots,1.5}_{\tau\in[0.45,0.55]},\cdots,1,1\}

D.2 Atari experiment

We extend QEMRL to a DQN-like architecture, and we use the same architecture as QR-DQN, which we refer to as QEM-DQN 555Code is available at https://github.com/Kuangqi927/QEM. Our hyperparameter settings (LABEL:table2) are aligned with Dabney et al. (2018b) for a fair comparison. Additionally, we extend QEMRL to the unfixed quantile fraction algorithm IQN, which embeds quantile fraction τ\tau into the quantile value network on the top of QR-DQN. In Atari, it is infeasible to determine the low probability supports for every state-action pair, therefore we only consider the heteroskedasticity that occurs in the tail and treat VV as a tuning parameter to select an appropriate value. For exploration experiments, we follow the settings of Mavrin et al. (2019) and set the decay factor ct=clogttc_{t}=c\sqrt{\frac{\text{log}t}{t}}, where c=50c=50.

Table 2: The hyperparameters of QEM-DQN and QR-DQN used in the Atari experiments.
Hyperparameter Value
Learning rate 0.00005
Discount factor 0.99
Optimizer Adam
Bath size 32
Number of quantiles 200
Number of quantiles (IQN) 32
Weight matrix VV (QEM-DQN only) diag{1.5,,1.5τ[0.9,1),,1,1,,1.5,,1.5τ(0,0.1]}diag\{\underbrace{1.5,\cdots,1.5}_{\tau\in[0.9,1)},\cdots,1,1,\cdots,\underbrace{1.5,\cdots,1.5}_{\tau\in(0,0.1]}\}

D.3 MuJoCo experiment

We extend QEMRL to a SAC-like architecture, and we use the same architecture of DSAC, named QEM-DSAC. Similarly, we extend QEMRL to an IQN version of DSAC. Hyperparameters and environment-specific parameters are listed in LABEL:table3. In addition, SAC has a variant that introduces a mechanism of fine-tuning α\alpha to achieve target entropy adaptively. While this adaptive mechanism performs well, we follow the use of fixed α\alpha suggested in the original SAC paper to reduce irrelevant factors.

Table 3: The hyperparameters of QEM-DSAC and DSAC used in the MuJoCo experiments.
Hyperparameter Value
Policy network learning rate 0.00030.0003
Quantile Value network learning rate 0.00030.0003
Discount factor 0.99
Optimization Adam
Target smoothing 0.0050.005
Batch size 256
Minimum steps before training 1000010000
Number of quantiles 32
Quantile fraction embedding size (IQN) 64
Weight matrix VV (QEM-DSAC only) diag{1.2,,1.2τ[0.9,1),,1,1,,1.2,,1.2τ(0,0.1]}diag\{\underbrace{1.2,\cdots,1.2}_{\tau\in[0.9,1)},\cdots,1,1,\cdots,\underbrace{1.2,\cdots,1.2}_{\tau\in(0,0.1]}\}
Environment Temperature Parameter
Ant-v2 0.2
HalfCheetah-v2 0.2
Hopper-v2 0.2
Walker2d-v2 0.2
Swimmer-v2 0.2
Humanoid-v2 0.05

Appendix E Additional Experimental Results

E.1 Variance reduction for IQN

IQN does not satisfy the sufficient condition zτi=zτNiz_{\tau_{i}}=-z_{\tau_{N-i}} since τ\tau is sampled from a uniform distribution, rather than evenly spaced as in QDRL. To examine the impact of this on the inequality (iviN)211/((iviivizi2)(ivizi)21)>0(\frac{\sum_{i}v_{i}}{N})^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{i}^{2})}{(\sum_{i}v_{i}z_{i})^{2}}-1)>0 in Proposition 4.2, simulation experiments are conducted. We use the function f(v1,,vN)=(iviN)211/((iviivizτi2)(ivizτi)21)f(v_{1},\cdots,v_{N})=(\frac{\sum_{i}v_{i}}{N})^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1) to examine this inequality, where vi>1v_{i}>1 and τi\tau_{i} are sampled uniformly. In every trial, viv_{i} are randomly sampled from [1,M][1,M], repeating the process 100,000 times. The minimum values of f(v1,,vN)f(v_{1},\cdots,v_{N}) are shown in the following LABEL:table4 for varying values of NN and MM. The results indicate that the minimum of ff is always greater than 0, which demonstrates that the inequality holds in practice.

Table 4: Minimum of ff.
Minimum of ff M NN
0.614 2 32
4.778 5 32
43.143 20 32
0.932 2 128
7.707 5 128
76.489 20 128
1.082 2 500
9.357 5 500
96.473 20 500

E.2 Weight VV tuning experiments

Refer to caption
Refer to caption
Figure 10: Comparison of different weight vv in QEM-DSAC and QEM-DQN experiments

E.3 Additional Atari results

Refer to caption
Figure 11: Comparison of QEM-DQN and QR-DQN across 9 Atari games
Refer to caption
Figure 12: Comparison of IQEM-DQN and IQN across 9 Atari games
Refer to caption
Figure 13: Comparison of QEM and DLTV across 3 hard-explored Atari games