Variance Control for Distributional Reinforcement Learning

Qi Kuang Zhoufan Zhu Liwen Zhang Fan Zhou

Abstract

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, very few studies investigate the validity of the obtained Q-function estimator in the distributional setting. To fully understand how the approximation errors of the Q-function affect the whole training process, we do some error analysis and theoretically show how to reduce both the bias and the variance of the error terms. With this new understanding, we construct a new estimator Quantiled Expansion Mean (QEM) and introduce a new DRL algorithm (QEMRL) from the statistical perspective. We extensively evaluate our QEMRL algorithm on a variety of Atari and Mujoco benchmark tasks and demonstrate that QEMRL achieves significant improvement over baseline algorithms in terms of sample efficiency and convergence performance.

Machine Learning, ICML

1 Introduction

Distributional Reinforcement Learning (DRL) algorithms have been shown to achieve state-of-art performance in RL benchmark tasks (Bellemare et al., 2017; Dabney et al., 2018b, a; Yang et al., 2019; Zhou et al., 2020, 2021). The core idea of DRL is to estimate the entire distribution of the future return instead of its expectation value, i.e. the Q-function, which captures the intrinsic uncertainty of the whole process in three folds: (i) the stochasticity of rewards, (ii) the indeterminacy of the policy, and (iii) the inherent randomness of transition dynamics. Existing DRL algorithms parameterize the return distribution in different ways, including categorical return atoms (Bellemare et al., 2017), expectiles (Rowland et al., 2019), particles (Nguyen-Tang et al., 2021), and quantiles (Dabney et al., 2018b, a). Among these works, the quantile-based algorithm is widely used due to its simplicity, efficiency of training, and flexibility in modeling the return distribution.

Although the existing quantile-based algorithms achieve remarkable empirical success, the approximated distribution still requires further understanding and investigation. One aspect is the crossing issue, namely, a violation of the monotonicity of the obtained quantile estimations. Zhou et al. (2020, 2021) solves this issue by enforcing the monotonicity of the estimated quantiles using some well-designed neural networks. However, these methods may suffer from some underestimation or overestimation issues. In other words, the estimated quantiles tend to be higher or lower than their true values. Considering this shortcoming, Luo et al. (2021) applies monotonic rational-quadratic splines to ensure monotonicity, but their algorithm is computationally expensive and hard to implement in large-scale tasks.

Another aspect is regard to the tail behavior of the return distribution. It is widely acknowledged that the precision of tail estimation highly depends on the frequency of tail observations (Koenker, 2005). Due to data sparsity, the quantile estimation is often unstable at the tails. To alleviate this instability, Kuznetsov et al. (2020) proposes to truncate the right tail of the approximated return distribution by discarding some topmost atoms. However, this approach lacks theoretical support and ignores the potentially useful information hidden in the tail.

The crossing issue and tail unrealization illustrate that there is a substantial gap between the quantile estimation and its true value. This finding reduces the reliability of the Q-function estimator obtained by quantile-based algorithms and inspires us to further minimize the difference between the estimated Q-function and its true value. In particular, the error associated with Q-function approximation can be decomposed into three parts:

$\displaystyle\Delta$	$\displaystyle\equiv Q^{\pi}_{\theta}(x,a)-Q^{\pi}(x,a)=\mathbb{E}Z^{\pi}_{\theta}(x,a)-\mathbb{E}Z^{\pi}(x,a)$
	$\displaystyle=\underbrace{\mathbb{E}Z^{\pi}_{\theta}(x,a)-\mathbb{E}_{x^{\prime}\sim\mathcal{D}}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]}_{\text{ Target Approximation Error}~{}~{}\mathcal{E}_{1}}$
	$\displaystyle\quad+\underbrace{\mathbb{E}_{x^{\prime}\sim\mathcal{D}}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]-\mathbb{E}_{x^{\prime}\sim P}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]}_{\text{ Bellman operator Approximation Error}~{}~{}\mathcal{E}_{2}}$
	$\displaystyle\quad+\underbrace{\mathbb{E}_{x^{\prime}\sim P}[R+\gamma Z^{\pi}_{\theta}(x^{\prime},a^{\prime})]-\mathbb{E}Z^{\pi}(x,a)}_{\text{Parametrization Induced Error}~{}~{}\mathcal{E}_{3}},$	(1)

where $Q^{\pi}(\cdot)$ is the true Q-function, $Q^{\pi}_{\theta}(\cdot)$ is the approximated Q-function, $Z^{\pi}$ is the random variable with the true return distribution, $Z_{\theta}^{\pi}$ is the random variable with the approximated quantile function parameterized by a set of quantiles $\theta$ , $\mathcal{D}$ is the replay buffer, and $P$ is the transition kernel. These errors can be attributed to different kinds of approximations in DRL (Rowland et al., 2018), including (i) parameterization and its associated projection operators, (ii) stochastic approximation of the Bellman operator, and (iii) gradient updates through quantile loss.

We elaborate on the properties of the three error terms in (1). $\mathcal{E}_{1}$ is derived from the target approximation in quantile loss. $\mathcal{E}_{2}$ is caused by the stochastic approximation of the Bellman operator. $\mathcal{E}_{3}$ results from the parametrization of quantiles and the corresponding projection operator. Among the three, $\mathcal{E}_{3}$ can be theoretically eliminated if the representation size is large enough, whereas $\mathcal{E}_{1}+\mathcal{E}_{2}$ is inevitable in practice due to the batch-based optimization procedure. Therefore, controlling the variance $\mathrm{Var}(\mathcal{E}_{1}+\mathcal{E}_{2})$ can significantly speed up the training convergence (see an illustrating example in Figure 1). Thus, one main target of this work is to reduce the two inevitable errors $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ , and subsequently improve the existing DRL algorithms.

Refer to caption — Figure 1: Error decay during training. (a) The parameterization-induced error $\mathcal{E}_{3}$ (grey areas) remains constant over time with a fixed representation size. The approximation errors $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ (blue areas) decrease slowly with time steps. (b) Increase the size of the representation (i.e., the number of quantiles), $\mathcal{E}_{3}$ can be theoretically eliminated. By applying the variance reduction technique QEM estimator, $\mathcal{E}_{1}+\mathcal{E}_{2}$ can be quickly decreased, resulting in faster convergence of algorithms.

The contributions of this work are summarized as follows,

•

We offer a rigorous investigation on the three error terms $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ , and $\mathcal{E}_{3}$ in DRL, and find that the approximation errors result from the heteroskedasticity of quantile estimates, especially tail estimates.
•

We borrow the idea from the Cornish-Fisher Expansion (Cornish & Fisher, 1938), and propose a statistically robust DRL algorithm, called QEMRL, to reduce the variance of the estimated Q-function.
•

We show that QEMRL achieves a higher stability and a faster convergence rate from both theoretical and empirical perspectives.

2 Background

2.1 Reinforcement Learning

Consider a finite Markov Decision Process (MDP) $(\mathcal{X},\mathcal{A},P,\gamma,\mathcal{R})$ , with a finite set of states $\mathcal{X}$ , a finite set of actions $\mathcal{A}$ , the transition kernel $P:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}(\mathcal{X})$ , the discounted factor $\gamma\in[0,1)$ , and the bounded reward function $\mathcal{R}:\mathcal{X}\times\mathcal{A}\rightarrow\mathscr{P}([-R_{max},R_{max}])$ . At each timestep, an agent observes state $X_{t}\in\mathcal{X}$ , takes an action $A_{t}\in\mathcal{A}$ , transfers to the next state $X_{t+1}\sim P\left(\cdot\mid X_{t},A_{t}\right)$ , and receives a reward $R_{t}\sim\mathcal{R}\left(X_{t},A_{t}\right)$ . The state-action value function $Q^{\pi}:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R}$ of a policy $\pi:\mathcal{X}\rightarrow\mathscr{P}(\mathcal{A})$ is the expected discounted sum of rewards starting from $x$ , taking an action $a$ and following a policy $\pi$ . $\mathscr{P}(\mathcal{X})$ denotes the set of probability distributions on a space $\mathcal{X}$ .

The classic Bellman equation (Bellman, 1966) relates expected return at each state-action pair $(x,a)$ to the expected returns at possible next states by:

\displaystyle Q^{\pi}(x,a)=\mathbb{E}_{\pi}\left[R_{0}+\gamma Q^{\pi}\left(X_{1},A_{1}\right)\mid X_{0}=x,A_{0}=a\right].

(2)

In the learning task, Q-Learning (Watkins, 1989) employs a common way to obtain $\pi^{*}$ , which is to find the unique fixed point $Q^{*}=Q^{\pi^{*}}$ of the Bellman optimality equation:

\displaystyle Q^{*}(x,a)=\mathbb{E}\left[R_{0}+\gamma\underset{a^{\prime}\in\mathcal{A}}{\max}Q^{*}\left(X_{1},a^{\prime}\right)\mid X_{0}=x,A_{0}=a\right].

2.2 Distributional Reinforcement Learning

Instead of directly estimating the expectation $Q^{\pi}(x,a)$ , DRL focuses on estimating the distribution of the sum of discounted rewards $\eta_{\pi}(x,a)=\mathcal{D}(\sum_{t=0}^{\infty}\gamma^{t}R_{t}\mid X_{0}=x,A_{0}=a)$ to sufficiently capture the intrinsic randomness, where $\mathcal{D}$ extract the probability distribution of a random variable. In analogy with Equation 2, $\eta_{\pi}$ satisfies the distributional Bellman equation (Bellemare et al., 2017) as follows,

	$\displaystyle\eta_{\pi}(x,a)=$	$\displaystyle\left(\mathcal{T}^{\pi}\eta_{\pi}\right){(x,a)}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\pi}\left[(f_{\gamma,r})_{\#}\eta_{\pi}(X_{1},A_{1})\mid X_{0}=x,A_{0}=a\right]$

where $f_{\gamma,r}:\mathbb{R}\to\mathbb{R}$ is defined by $f_{\gamma,r}(x)=r+\gamma x,$ and $\left(f_{\gamma,r}\right)_{\#}\eta$ is the pushforward measure of $\eta$ by $f_{\gamma,r}$ . Note that $\eta_{\pi}$ is the fixed point of distributional Bellman operator $\mathcal{T}^{\pi}:\mathscr{P}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}\to\mathscr{P}(\mathbb{R})^{\mathcal{X}\times\mathcal{A}}$ , i.e., $\mathcal{T}^{\pi}\eta_{\pi}=\eta_{\pi}$ .

In general, the return distribution supports a wide range of possible returns and its shape can be quite complex. Moreover, the transition dynamics are usually unknown in practice, and thus the full computation of the distributional Bellman operator is usually either impossible or computationally infeasible. In the following subsections, we review two main categories of DRL algorithms relying on parametric approximations and projection operators.

2.2.1 Categorical distributional RL

Categorical distributional RL (CDRL, Bellemare et al., 2017) represents the return distribution $\eta$ with a categorical form $\eta(x,a)=\sum_{i=1}^{N}p_{i}(x,a)\delta_{z_{i}}$ , where $\delta_{z}$ denotes the Dirac distribution at $z$ . $z_{1}\leq z_{2}\leq\ldots\leq z_{N}$ are evenly spaced locations, and $\left\{p_{i}\right\}_{i=1}^{N}$ are the corresponding probabilities learned using the Bellman update,

\eta(x,a)\leftarrow\left(\Pi_{\mathcal{C}}\mathcal{T}^{\pi}\eta\right)(x,a),

where $\Pi_{\mathcal{C}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\{z_{1},z_{2}\ldots z_{N}\})$ is a categorical projection operator which ensures the return distribution supported only on $\left\{z_{1},\ldots,z_{N}\right\}$ . In practice, CDRL with $N=51$ has been shown to achieve significant improvement in Atari games.

2.2.2 Quantiled distributional RL

Quantiled distributional RL (QDRL, Dabney et al., 2018b) represents the return distribution with a mixture of Diracs $\eta(x,a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\theta_{i}(x,a)}$ , where $\left\{\theta_{i}(x,a)\right\}_{i=1}^{N}$ are learnable parameters. The Bellman operator moves each atom location $\theta_{i}$ towards $\tau_{i}$ -th quantile of the target distribution $\eta^{\prime}(x,a):=\mathcal{T}^{\pi}\eta(x,a)$ , where $\tau_{i}=\frac{2i-1}{2N}$ . The corresponding Bellman update form is:

\eta(x,a)\leftarrow\left(\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}\eta\right)(x,a),

where $\Pi_{\mathcal{W}_{1}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\mathbb{R})$ is a quantile projection operator defined by $\Pi_{\mathcal{W}_{1}}\mu=\frac{1}{N}\sum_{i=1}^{N}\delta_{F^{-1}_{\mu}(\tau_{i})}$ , and $F_{\mu}$ is the cumulative distribution function (CDF) of $\mu$ . $F_{\eta^{\prime}}^{-1}(\tau)$ can be characterized as the minimizer of the quantile regression loss, while the atom locations $\theta$ can be updated by minimizing the following loss function

\displaystyle\mathcal{L}_{QR}(\theta;\eta^{\prime},\tau)=\mathbb{E}_{Z\sim\eta^{\prime}}\left(\left[\tau\textbf{1}_{Z>\theta}+\left(1-\tau\right)\textbf{1}_{Z\leq\theta}\right]|Z-\theta|\right).

(3)

3 Error Analysis of Distributional RL

As mentioned in Section 1, the parametrization induced error $\mathcal{E}_{3}$ in Section 1 comes from quantile representation and its projection operator, which can be eliminated as $N\rightarrow\infty$ . However, as illustrated in Figure 1, the approximation errors $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ are unavoidable in practice and a high variance $\mathrm{Var}(\mathcal{E}_{1}+\mathcal{E}_{2})$ may lead to unstable performance of DRL algorithms. Thus, in this section, we further study the three error terms $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ , and show why it is important to control them in practice.

3.1 Parametrization Induced Error

We first show the convergence of both the expectation and the variance of the distributional Bellman operator $\mathcal{T}^{\pi}$ . Then, we take parametric representation and projection operator into consideration.

Proposition 3.1 (Sobel, 1982; Bellemare et al., 2017).

Suppose there are two value distributions $\nu_{1},\nu_{2}\in\mathscr{P}(\mathbb{R})$ , and random variables $Z_{i}^{k+1}\sim\mathcal{T}^{\pi}\nu_{i},Z_{i}^{k}\sim\nu_{i}$ . Then, we have

	$\displaystyle\left\\|\mathbb{E}Z_{1}^{k+1}-\mathbb{E}Z_{2}^{k+1}\right\\|_{\infty}\leq\gamma\left\\|\mathbb{E}Z_{1}^{k}-\mathbb{E}Z_{2}^{k}\right\\|_{\infty},\text{ and }$
	$\displaystyle\left\\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\\|_{\infty}\leq\gamma^{2}\left\\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\\|_{\infty}.$

Based on the fact that $\mathcal{T}^{\pi}$ is a $\gamma$ -contraction in $\bar{d}_{p}$ metric (Bellemare et al., 2017), where $\bar{d}_{p}$ is the maximal form of the Wasserstein metric, Proposition 3.1 implies that $\mathcal{T}^{\pi}$ is a contraction for both the expectation and the variance. The two converge exponentially to their true values by iteratively applying the distributional Bellman operator.

However, in practice, employing parametric representation for the return distribution leaves a theory-practice gap, which makes neither the expectation nor the variance converge to the true values. To better understand the bias in the Q-function approximation caused by the parametric representation, we introduce the concept of mean-preserving ¹¹1This property has been thoroughly discussed in previous work. Based on Section 5.11 of Bellemare et al. (2023), we conclude this definition. to describe the relationship between the expectations of the original distribution and the projected distribution:

Definition 3.2 (Mean-preserving).

Let $\Pi_{\mathscr{F}}:\mathscr{P}(\mathbb{R})\to\mathscr{F}$ be a projection operator that maps the space of probability distributions to the desired representation. Suppose there is a representation $\mathscr{F}\in\mathscr{P}(\mathbb{R})$ and its associated projection operator $\Pi_{\mathscr{F}}$ are mean-preserving if for any distribution $\nu\in\mathscr{F}$ , the expectation of $\Pi_{\mathscr{F}}\nu$ is the same as that of $\nu$ .

For CDRL, a discussion of the mean-preserving property is given by Lyle et al. (2019) and Rowland et al. (2019). It can be shown that for any $\nu\in\mathscr{F}_{\mathcal{C}}$ , where $\mathscr{F}_{\mathcal{C}}$ is a $N$ -categorical representation, the projection $\Pi_{\mathcal{C}}$ preserves the distribution’s expectation when its support is contained in the interval $[z_{1},z_{N}]$ . However, these practitioners usually employ a wide predefined interval for return which makes the projection operator typically overestimate the variance.

For QDRL, $\Pi_{\mathcal{W}_{1}}$ is not mean-preserving (Bellemare et al., 2023). Given any distribution $\nu\in\mathscr{F}_{\mathcal{W}_{1}}$ , where $\mathscr{F}_{\mathcal{W}_{1}}$ is a $N$ -quantile representation, there is no unique $N$ -quantile distribution $\Pi_{\mathcal{W}_{1}}\nu$ in most cases, as the projection operator $\Pi_{\mathcal{W}_{1}}$ is not a non-expansion in 1-Wasserstein distance (See Appendix B for details). This means that the expectation, variance, and higher-order moments are not preserved. To make this concrete, a simple MDP example is used to illustrate the bias in the learned quantile estimates.

In Figure 2 (a), rewards $R_{1}$ and $R_{2}$ are randomly sampled from $\mathrm{Unif}(0,1)$ and $\mathrm{Unif}(1/N,1+1/N)$ at states $x_{1}$ and $x_{2}$ respectively, and no rewards are received at $x_{0}$ . Clearly, the true return distribution at state $x_{0}$ is the mixture $\frac{\gamma}{2}(R_{1}+R_{2})$ , hence the $\frac{1}{2N}$ -th quantile is $\frac{\gamma}{N}$ . When using the QDRL algorithm with $N$ quantile estimates, the approximated return distribution $\hat{\eta}(x_{1},a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\frac{2i-1}{2N}}$ and $\hat{\eta}(x_{2},a)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\frac{2i+1}{2N}}$ . In this case, the $\frac{1}{2N}$ -th quantile of the approximated return distribution at state $x_{0}$ is $\frac{3\gamma}{2N}$ , whereas the true value is $\frac{\gamma}{N}$ . Moreover, for each $i=1,\ldots,N$ , the $\frac{2i-1}{2N}$ -th quantile estimate at state $x_{0}$ is not equal to the true value.

These biased quantile estimates illustrated in Figure 2 (a) are caused by the use of quantile representation and its projection operator $\Pi_{\mathcal{W}_{1}}$ . This undesirable property in turn affects the QDRL update, as the combined operator $\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}$ is in general not a non-expansion in $\bar{d}_{p}$ , for $p\in[1,\infty)$ (Dabney et al., 2018b), which means that the learned quantile estimates may not converge to the true quantiles of the return distribution ²²2 A recent study (Rowland et al., 2023a) proves that QDRL update may have multiple fixed points, indicating quantiles may not converge to the truth. Despite this, Proposition 2 (Dabney et al., 2018b) concludes that the projected Bellman operator $\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}$ remains a contraction in $\bar{d}_{\infty}$ . This implies that quantile convergence is guaranteed for all $p\in[1,\infty]$ .. The projection operator $\Pi_{\mathcal{W}_{1}}$ is not mean-preserving which inevitably leads to bias in the expectation of return distribution when iteratively applying the projected Bellman operator $\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}$ during the training process, resulting in a deviation between the estimate and the true value of the Q-function in the end. We now derive an upper bound to quantify this deviation, i.e. $\mathcal{E}_{3}$ .

Theorem 3.3 (Parameterization induced error bound).

Let $\Pi_{\mathcal{W}_{1}}$ be a projection operator onto evenly spaced quantiles $\tau_{i}$ ’s where each $\tau_{i}=\frac{2i-1}{2N}$ for $i=1,\ldots,N$ , and $\eta_{k}\in\mathscr{P}(\mathbb{R})$ be the return distribution of $k$ -th iteration. Let random variables $Z_{\theta}^{k}\sim\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}\eta_{k}$ and $Z^{k}\sim\mathcal{T}^{\pi}\eta_{k}$ . Assume that the distribution of the immediate reward is supported on $[-R_{max},R_{max}]$ , then we have

\displaystyle\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)},

where $\mathcal{E}^{k}_{3}$ is parametrization induced error at $k$ -th iteration.

Theorem 3.3 implies that the convergence of expectation with projected Bellman operator $\Pi_{\mathcal{W}_{1}}\mathcal{T}^{\pi}$ cannot be guaranteed after quantile representation and its projection operator are applied ³³3Note that this bound has a limitation, which only considers the one-step effect of applying the projection operator $\Pi_{W_{1}}$ . Therefore, it becomes irrelevant with the iteration number $k$ . However, Proposition 4.1 of Rowland et al. (2023b) provides a more compelling bound considering the cumulative effect of iteratively applying $\Pi_{W_{1}}$ .. Note that the bound will tend to zero with $N\rightarrow\infty$ , thus it is reasonable to use a relatively large representation size $N$ to reduct $\mathcal{E}_{3}$ in practice.

3.2 Approximation Error

The other two types of errors $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ , which determine the variance of the Q-function estimate, are accumulated during the training process by keeping encountering unseen state-action pairs. The target approximation error $\mathcal{E}_{1}$ affects action selections, while the Bellman operator approximation error $\mathcal{E}_{2}$ leads to the accumulated error of the Q-function estimate, which can be amplified by using the temporal difference updates (Sutton, 1988). The accumulated errors of the Q-function estimate with high uncertainty can make some certain states to be incorrectly estimated, leading to suboptimal policies and potentially divergent behaviors.

As depicted in Figure 2 (b), we utilize this toy example to illustrate how QDRL fails to learn an optimal policy due to a high variance of the approximation error. This 5-state MDP example is originally introduced in Figure 7 of Rowland et al. (2019). In this case, $\eta(x_{0},a_{1})$ and $\eta(x_{0},a_{2})$ follow exponential distributions, and the expectations of them are 1.2 and 1, respectively. We consider a tabular setting, which uniquely represents the approximated return distribution at each state-action pair. Figure 2 (c) demonstrates that in policy evaluation, QDRL inaccurately approximates the Q-function, as it underestimates the expectation of $\eta(x_{0},a_{1})$ and overestimates the other. This is caused by the poor capture of tail events, which results in high uncertainty in the Q-function estimate. Due to the high variance, QDRL fails to learn the optimal policy and chooses a non-optimal action $a_{2}$ at the initial state $x_{0}$ . On the contrary, our proposed algorithm, QEMRL, employs a statistically robust estimator of the Q-function to reduce its variance, relieves the underestimation and overestimation issues, and ultimately allows for more efficient policy learning.

Different from previous QDRL studies that focus on exploiting the distribution information to further improve the model performance, this work highlights the importance of controlling the variance of the approximation error to obtain a more accurate estimate of the Q-function. More discussion about this is given in the following section.

4 Quantiled Expansion Mean

This section introduces a novel variance reduction technique to estimate the Q-function. In traditional statistics, estimators with lower variance are considered to be more efficient. In RL, variance reduction is also an effective technique for achieving fast convergence in both policy-based and value-based RL algorithms, especially for large-scale tasks (Greensmith et al., 2004; Anschel et al., 2017). Motivated by these findings, we introduce QEM as an estimator that is more robust and has a lower variance than that of QDRL under the heteroskedasticity assumption. Furthermore, we demonstrate the potential benefits of QEM for the distribution approximation in DRL.

4.1 Heteroskedasticity of quantiles

In the context of quantile-based DRL, Q-function is the integral of the quantiles. To approximate this, QDRL employs a simple empirical mean (EM) estimator $\frac{1}{N}\Sigma_{i}\hat{q}(\tau_{i})$ , and it is natural to assume that the estimated quantile satisfies

\displaystyle\hat{q}(\tau)=q(\tau)+\varepsilon(\tau),

(4)

where $\varepsilon(\tau)$ is a zero-mean error. In this case, considering the crossing issue and the biased tail estimates, we assume that the variance of $\varepsilon(\tau)$ is non-constant and depends on $\tau$ , which is usually called heteroskedasticity in statistics.

For a direct understanding, we conduct a simple simulation using a Chain MDP to illustrate how QDRL can fail to fit the quantile function. As shown in Figure 3(b), QDRL fits well in the peak area but struggles at the bottom and the tail. Moreover, the non-monotonicity of the quantile estimates in the poorly fitted areas is more severe than the others. As the deviations of the quantile estimates from the truths is significantly larger in the low probability region and the tail, we can make the heteroskedasticity assumption in this case. This phenomenon can be explained since samples near the bottom and the tail are less likely to be drawn. In real-world situations, multimodal distributions are commonly encountered and the heteroskedasticity problem may result in imprecise distribution approximations and consequently poor Q-function approximations. In the next part, we will discuss how to enhance the stability of the Q-function estimate.

4.2 Cornish-Fisher Expansion

It is well-known that quantile can be expressed by the Cornish-Fisher Expansion (CFE, Cornish & Fisher, 1938):

		$\displaystyle q(\tau)=\mu+\sigma x^{\prime}_{\tau},$		(5)
		$\displaystyle x^{\prime}_{\tau}=z_{\tau}+(z^{2}_{\tau}-1)\frac{s}{6}+(z^{3}_{\tau}-3z_{\tau})\frac{k}{24}+\cdots,$

where $z_{\tau}$ is the $\tau$ -th quantile of the standard normal distribution, $\mu$ is the mean, $\sigma$ is the standard deviation, $s$ and $k$ are the skewness and kurtosis of the interested distribution, and the remaining terms in the ellipsis are higher-order moments (See Appendix C for more details). The CFE theoretically determines the distribution with known moments and is widely used in financial studies. Recently, Zhang & Zhu (2023) employ CFE to estimate higher-order moments of financial time series data, which are not directly observable. Our method utilizes a truncated version of CFE framework and employs a linear regression model to construct efficient estimators for distribution moments based on known quantiles. Consequently, we apply this approach within the context of quantile-based DRL.

To be more specific, we plug in the estimate $\hat{q}({\tau})$ of the the $\tau$ -th quantile to Equation 5 and expand it by the first order:

\displaystyle\hat{q}(\tau)=

\displaystyle m_{1}+\omega_{1}(\tau)+\varepsilon(\tau),

(6)

where $m_{1}$ is the mean (say, $1$ -th moment) of the return distribution, i.e., the Q-function, and $\omega_{1}(\tau)$ is the remaining term associated with the higher-order ( $>1$ -th) moments. If $\omega_{1}(\tau)$ is negligible, $m_{1}$ can be estimated by averaging the $N$ quantile estimates in QDRL.

When the estimated quantile is expanded to the second order, we particularly have the following representation:

\displaystyle\hat{q}(\tau)=

\displaystyle m_{1}+z_{\tau}\sqrt{m_{2}}+\sqrt{m_{2}}\omega_{2}(\tau)+\varepsilon(\tau),

(7)

where $\omega_{2}(\tau)$ is the remaining term associated with the higher-order ( $>2$ -th) moments. Assume that $\omega_{2}(\tau)$ is negligible, we can derive a regression model by plugging in the $N$ quantile estimates, such that

\displaystyle\left(\begin{array}[]{c}\hat{q}(\tau_{1})\\ \hat{q}(\tau_{2})\\ \vdots\\ \hat{q}(\tau_{N})\end{array}\right)=\left(\begin{array}[]{cccc}1&z_{\tau_{1}}\\ 1&z_{\tau_{2}}\\ \vdots&\vdots\\ 1&z_{\tau_{N}}\end{array}\right)\left(\begin{array}[]{c}m_{1}\\ \sqrt{m_{2}}\end{array}\right)+\left(\begin{array}[]{c}\varepsilon(\tau_{1})\\ \varepsilon(\tau_{2})\\ \vdots\\ \varepsilon(\tau_{N})\end{array}\right).

(22)

The higher-order expansions can be conducted in the same manner. Note that the remaining term is omitted for constructing a regression model, and a more in-depth analysis of the remaining term is available in Section C.2.

For notation simplicity, we rewrite (22) in a matrix form,

\displaystyle\boldsymbol{\hat{Q}}=\mathbf{X}_{2}\boldsymbol{M}_{2}+\mathcal{E},

(23)

where $\boldsymbol{\hat{Q}}\in\mathbb{R}^{N}$ is the vector of estimated quantiles, $\mathbf{X}_{2}\in\mathbb{R}^{N\times 2}$ and $\boldsymbol{M}_{2}\in\mathbb{R}^{2}$ are the design matrix and the moments respectively, and $\mathcal{E}$ is the vector of error terms.

For this bivariate regression model (23), the traditional ordinary least squares method (OLS) can be used to estimate $\boldsymbol{M}_{2}=(m_{1},\sqrt{m_{2}})^{\prime}$ when the variances of the errors are invariant across different quantile locations, also known as the homoscedasticity assumption. The estimator $\hat{m}_{1}$ is denoted as Quantiled Expansion Mean (QEM) in this work. However, since the homoscedasticity assumption required by OLS is always violated in real cases, we may consider using the weighted ordinary least squares method (WLS) instead. Under the normality assumption, the following results tell that the WLS estimator $\hat{m}_{1}$ has a lower variance than the direct empirical mean.

Lemma 4.1.

Consider the linear regression model $\boldsymbol{\hat{Q}}=\mathbf{X}_{2}\boldsymbol{M}_{2}+\mathcal{E}$ , $\mathcal{E}$ is distributed on $\mathcal{N}(\mathbf{0},\sigma^{2}V)$ , where $V=diag(v_{1},v_{2},\cdots,v_{N}),v_{i}\geq 1,i=1,\cdots,N$ , and we set noise variance $\sigma^{2}=1$ without loss of generality. The WLS estimator is

\displaystyle\widehat{\boldsymbol{M}}_{2}=(\mathbf{X}^{\top}_{2}V^{-1}\mathbf{X}_{2})^{-1}\mathbf{X}^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}},

(24)

and the QEM estimator $\hat{m}_{1}$ is the first component of $\widehat{\boldsymbol{M}}_{2}$ .

Remark: Note that it is impossible to determine the weight matrix $V$ for each state-action pair in practice. Hence, we focus on capturing the relatively high variance in the tail, specifically in the range of $\tau\in(0,0.1]\cup[0.9,1)$ . To achieve this, we use a constant $v_{i}$ , which is set to a value greater than 1 in the tail and equal to 1 in the rest. $v_{i}$ is treated as a hyperparameter to be tuned in practice (See Appendix E).

With Lemma 4.1, the reduction of variance can be guaranteed by the following Proposition 4.2. Throughout the training process, heteroskedasticity is inevitable, and thus the QEM estimator always exhibits a lower variance than the standard EM estimator $\hat{m}^{*}_{1}=\frac{1}{N}\sum_{i=1}^{N}\hat{q}(\tau_{i})$ .

Proposition 4.2.

Suppose the noise $\varepsilon_{i}$ independently follows $\mathcal{N}(0,v_{i})$ where $v_{i}\geq 1$ for $i=1,\cdots,N$ , then,

(i) In the homoskedastic case where $v_{i}=1$ for $i=1,\dots N$ , the empirical mean estimator $\hat{m}_{1}^{*}$ has a lower variance, $\mathrm{Var}(\hat{m}_{1}^{*})<\mathrm{Var}(\hat{m}_{1})$ ;

(ii) In the heteroskedastic case where $v_{i}$ ’s are not eaqul, the QEM estimator $\hat{m}_{1}$ achieves a lower variance, i.e. $\mathrm{Var}(\hat{m}_{1})<\mathrm{Var}(\hat{m}_{1}^{*})$ , if and only if $\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0$ , where $\bar{v}=\frac{1}{N}\sum_{i}v_{i}$ . This inequality holds when $z_{\tau_{i}}=-z_{\tau_{N-i}}$ , which can be guaranteed in QDRL.

We also try to explore the potential benefits of the variance reduction technique QEM in improving the approximation accuracy. The Q-function estimate with higher variance can lead to noisy policy gradients in policy-based algorithms (Fujimoto et al., 2018) and prevent selection optimal actions in value-based algorithms (Anschel et al., 2017). These issues can slow down the learning process and negatively impact the algorithm performance. By the following theorem, we are able to show that QEM can reduce the variance and thus improve the approximation performance.

Theorem 4.3.

Consider the policy $\hat{\pi}$ that is learned policy, and denote the optimal policy to be $\pi_{opt}$ , $\alpha=\max_{x^{\prime}}D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})\|\pi_{opt}(\cdot\mid x^{\prime})\right)$ , and $n(x,a)=|\mathcal{D}|$ . For all $\delta\in\mathbb{R}$ , with probability at least $1-\delta$ , for any $\eta(x,a)\in\mathscr{P}(\mathbb{R})$ , and all $(x,a)\in\mathcal{D}$ ,

\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.

Theorem 4.3 indicates that a lower concentration bound can be obtained with a smaller $\alpha$ value. The decrease in $\alpha$ can be attributed to the benefits of QEM. Specifically, QEM helps to decrease the perturbations on the Q-function and reduce the variance of the policy gradients, which allows for faster convergence of the policy training and a more accurate distribution approximation. To conclude, QEM relieves the error accumulation within the Q-function update, improves the estimation accuracy, reduces the risk of underestimation and overestimation, and thus ultimately enhances the stability of the whole training process.

Algorithm 1 QEMRL update algorithm

1: Require: Quantile estimates

\hat{q}_{i}(x,a)

for each

(x,a)

2: Collect sample

(x,a,r,x^{\prime})

3: # Compute distributional Bellman target

4: Compute

Q(x^{\prime},a)

using Equation 24

5: if policy evaluation then

a^{*}\sim\pi(\cdot|x^{\prime})

7: else if Q-Learning then

a^{*}\leftarrow\arg\max_{a}Q(x^{\prime},a)

9: end if

10: Scale samples

\hat{q}^{*}_{i}(x^{\prime},a^{*})\leftarrow r+\gamma\hat{q}_{i}(x^{\prime},a^{*})

\forall i

11: # Compute quantile loss

12: Update estimated quantiles

\hat{q}_{i}(x,a)

by computing the gradients for each

i=1,\ldots,N

\nabla_{\hat{q}_{i}(x,a)}\sum_{i=1}^{N}\mathcal{L}_{QR}(\hat{q}_{i}(x,a);\frac{1}{N}\sum_{j=1}^{N}\delta_{\hat{q}^{*}_{j}(x^{\prime},a^{*})},\tau_{i}).

5 Experimental Results

In this section, we do some empirical studies to demonstrate the advantage of our QEMRL method. First, a simple tabular experiment is conducted to validate some of the theoretical results presented in Sections 3 and 4. Then we apply the proposed QEMRL update strategy in Algorithm 1 to both the DQN-style and SAC-style DRL algorithms, which are evaluated on the Atari and MuJoCo environments. The detailed architectures of these methods and the hyperparameter selections can be found in Appendix D, and the additional experimental results are included in Appendix E.

In this work, we implement QEM using a $4$ -th order expansion that includes mean, variance, skewness, and kurtosis in this work. The effects of a higher-order expansion on model estimation are discussed in Section C.1. Intuitively, including more terms in the expansion improves the estimation accuracy of quantiles, but the overfitting risk and the computational cost are also increased. Hence, there is a trade-off between explainability and learning efficiency. We evaluate different expansion orders using the $R^{2}$ statistic, which measures the goodness of model fitting. The simulation results (Figure 9) show that a $4$ -th order expansion seems to be the optimal choice while a higher-order ( $>4$ -th) expansion does not show a significant increase in $R^{2}$ .

5.1 A Tabular Example

FrozenLake (Brockman et al., 2016) is a classic benchmark problem for Q-learning control with high stochasticity and sparse rewards, in which an agent controls the movement of a character in an $n\times n$ grid world. As shown in Figure 4 with a FrozenLake- $4\times 4$ task, ”S” is the starting point, ”H” is the hole that terminates the game, ”G” is the goal state with a reward of 1. All the blue grids stand for the frozen surface where the agent can slide to adjacent grids based on some underlying unknown probabilities when taking a certain movement direction. The reward received by the agent is always zero unless the goal state is reached.

We first approximate the return distribution under the optimal policy $\pi^{*}$ , which can be realized using the value iteration approach. To be specific, we start from the ”S” state and perform 1K Monte-Carlo (MC) rollouts. An empirical distribution can be obtained by summarizing all these recording trajectories. With the approximation of the distribution, we can draw a curve of quantile estimates shown in Figure 5. Both QEMRL and QDRL were run for 150K training steps and the $\epsilon$ -greedy exploration strategy is applied in the first 1K steps. For both methods, we set the total number of quantiles to be $N=128$ .

Although both QEMRL and QDRL can eventually find the optimal movement at the start state, their approximations of the return distribution are quite different. Figure 4 (b) visualizes the approximation errors of the Q-function and the distribution for QEMRL and QDRL with respect to the number of training steps. The Q-function estimates of QEMRL converge correctly in average, whereas the estimates of QDRL do not converge exactly to the truth. A similar pattern can also be found when it comes to the distribution approximation error. Besides, the reduction of variance by using QEM can be verified by the fact that the curves of QEMRL are more stable and decline faster. In Figure 4 (c), we show that the distribution at the start state estimated by QEMRL is eventually closer to the ground truth.

5.2 Evaluation on MuJoCo and Atari 2600

We do some experiments using the MuJoCo benchmark to further verify the analysis results in Section 4. Our implementation is based on the Distributional Soft Actor-Critic (DSAC, Ma et al., 2020) algorithm, which is a distributional version of SAC. Figure 5 demonstrate that both DSAC and QEM-DSAC significantly outperform the baseline SAC. Among the two, QEM-DSAC performs better than DSAC and the learning curves are more stable, which demonstrates that QEM-DSAC can achieve a higher sample efficiency.

We also do some comparison between QEM and the baseline method QR-DQN on the Atari 2600 platform. Figure 8 plots the final results of these two algorithms in six Atari games. At the early training stage, QEM-DQN exhibits significant gain in sampling efficiency, resulting in faster convergence and better performance.

Extension to IQN. Some great efforts have been made by the community of DRL to more precisely parameterize the entire distribution with a limited number of quantile locations. One notable example is the introduction of Implicit Quantile Networks (IQN, Dabney et al., 2018a), which tries to recover the continuous map of the entire quantile curve by sampling a different set of quantile values from a uniform distribution $\mathrm{Unif}(0,1)$ each time.

Our method can also be applied to IQN as it uses the EM approach to estimate the Q-function. It is noted that the design matrix $\mathbf{X}$ must be updated after re-sampling all the quantile fractions at each training step. Moreover, one important sufficient condition $z_{\tau_{i}}=-z_{\tau_{N-i}}$ which ensures the reduction of variance does not hold in the IQN case as $\tau$ ’s are sampled from a uniform distribution. However, according to the simulation results in LABEL:table4, the variance reduction still remains valid in practice. In this case, all the baseline methods are modified to the IQN version. As Figure 6 and Figure 7 demonstrate, QEM can achieve some performance gain in most scenarios and the convergence speeds can be slightly increased.

5.3 Exploration

Since QEM also provides an estimate of the variance, we may consider using it to develop an efficient exploration strategy. In some recent study studies, to more sufficiently utilize the distribution information, Mavrin et al. (2019) proposes a novel exploration strategy, Decaying Left Truncated Variance (DLTV) by using the left truncated variance of the estimated distribution as a bonus term to encourage exploration in unknown states. The optimal action $a^{*}$ at state $x$ is selected according to $a^{*}=\arg\max_{a^{\prime}}\left(Q\left(x,a^{\prime}\right)+c_{t}\sqrt{\sigma^{2}_{+}}\right)$ , where $c_{t}$ is a decay factor to suppress the intrinsic uncertainty, and $\sigma^{2}_{+}$ denotes the estimation of variance. Although DLTV is effective, the validity of the computed truncation lacks a theoretical guarantee. In this work, we follow the idea of DLTV and examine the model performance by using either the variance estimate obtained by QEM or the original DLTV estimation in some hard-explored games. As Figure 8 shows, by using QEM, the exploration efficiency is significantly improved compared to QR-DQN+DLTV since QEM enhances the accuracy of the quantile estimates and thus the accuracy of the distribution variance.

6 Conclusion and Discussion

In this work, we systematically study the three error terms associated with the Q-function estimate and propose a novel DRL algorithm QEMRL, which can be applied to any quantile-based DRL algorithm regardless of whether the quantile locations are fixed or not. We found that a more robust estimate of the Q-function can improve the distribution approximation and speed up the algorithm convergence. We can also utilize the more precise estimate of the distribution variance to optimize the existing exploration strategy.

Finally, there are some open questions we would like to have further discussions here.

Improving the estimation of weight matrix $V$ . The challenge of estimating the weight matrix $V$ was recognized from the outset of the method proposal since it is unlikely to know the exact value of $V$ in practice. In this work, we treat $V$ as a predefined value that can be tuned, taking into account the computational cost of estimating it across all state-action pairs and time steps. As for future work, we believe a robust and easy-to-implement estimation of weight matrix $V$ is necessary. Given that the variance of quantile estimation errors varies with state-action pairs and algorithm iterations, we consider two approaches for future investigation. The first approach considers a decay value of $v_{i}$ instead of the constant. It is worth noting that the variance of poorly estimated quantiles tends to decrease gradually as the number of training samples increases, which motivates us to decrease the value of $v_{i}$ as training epochs increase. The second approach involves assigning different values of $v_{i}$ to different state-action pairs. Ideas from the exploration field, specifically the count-based method (Ostrovski et al., 2017), can be borrowed to measure the novelty of state-action pairs. Accordingly, for familiar state-action pairs, a smaller value of $v_{i}$ should be assigned, while unfamiliar pairs should be assigned a larger value of $v_{i}$ .

Statistical variance reduction. Our variance reduction method is based on a statistical modeling perspective, and the core insight of our method is that performance might be improved through more careful use of the quantiles to construct a Q-function estimator. While alternative ensembling methods can be directly applied to DRL to reduce the uncertainty in Q-function estimator, commonly used in existing works (Osband et al., 2016; Anschel et al., 2017), it undoubtedly increases model complexity. In this work, we transform the Q value estimation into a linear regression problem, where the Q value is the coefficient of the regression model. In this way, we can leverage the weighted least squares (WLS) method to effectively capture the heteroscedasticity of quantiles and obtain a more efficient and robust Q-function estimator.

Acknowledgements

We thank anonymous reviewers for valuable and constructive feedback on an early version of this manuscript. This work is supported by National Social Science Foundation of China (Grant No.22BTJ031 ) and Postgraduate Innovation Foundation of SUFE. Dr. Fan Zhou’s work is supported by National Natural Science Foundation of China (12001356), Shanghai Sailing Program (20YF1412300), “Chenguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission, Open Research Projects of Zhejiang Lab (NO.2022RC0AB06), Shanghai Research Center for Data Science and Decision Technology, Innovative Research Team of Shanghai University of Finance and Economics.

References

Anschel et al. (2017) Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, pp. 176–185. PMLR, 2017.
Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458. PMLR, 2017.
Bellemare et al. (2023) Bellemare, M. G., Dabney, W., and Rowland, M. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
Bellman (1966) Bellman, R. Dynamic programming. Science, 153(3731):34–37, 1966.
Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Cornish & Fisher (1938) Cornish, E. A. and Fisher, R. A. Moments and cumulants in the specification of distributions. Revue de l’Institut international de Statistique, pp. 307–320, 1938.
Dabney et al. (2018a) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning, pp. 1096–1105, 2018a.
Dabney et al. (2018b) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018.
Greensmith et al. (2004) Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
Hsu et al. (2011) Hsu, D., Kakade, S. M., and Zhang, T. An analysis of random design linear regression. arXiv preprint arXiv:1106.2363, 2011.
Koenker (2005) Koenker. Quantile regression. Cambridge University Press, 2005.
Kuznetsov et al. (2020) Kuznetsov, A., Shvechikov, P., Grishin, A., and Vetrov, D. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pp. 5556–5566. PMLR, 2020.
Luo et al. (2021) Luo, Y., Liu, G., Duan, H., Schulte, O., and Poupart, P. Distributional reinforcement learning with monotonic splines. In International Conference on Learning Representations, 2021.
Lyle et al. (2019) Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4504–4511, 2019.
Ma et al. (2020) Ma, X., Xia, L., Zhou, Z., Yang, J., and Zhao, Q. Dsac: distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547, 2020.
Mavrin et al. (2019) Mavrin, B., Yao, H., Kong, L., Wu, K., and Yu, Y. Distributional reinforcement learning for efficient exploration. In International Conference on Machine Learning, pp. 4424–4434, 2019.
Nguyen-Tang et al. (2021) Nguyen-Tang, T., Gupta, S., and Venkatesh, S. Distributional reinforcement learning via moment matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9144–9152, 2021.
Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, 29, 2016.
Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Count-based exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730. PMLR, 2017.
Rowland et al. (2018) Rowland, M., Bellemare, M., Dabney, W., Munos, R., and Teh, Y. W. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 29–37. PMLR, 2018.
Rowland et al. (2019) Rowland, M., Dadashi, R., Kumar, S., Munos, R., Bellemare, M. G., and Dabney, W. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pp. 5528–5536. PMLR, 2019.
Rowland et al. (2023a) Rowland, M., Munos, R., Azar, M. G., Tang, Y., Ostrovski, G., Harutyunyan, A., Tuyls, K., Bellemare, M. G., and Dabney, W. An analysis of quantile temporal-difference learning. arXiv preprint arXiv:2301.04462, 2023a.
Rowland et al. (2023b) Rowland, M., Tang, Y., Lyle, C., Munos, R., Bellemare, M. G., and Dabney, W. The statistical benefits of quantile temporal-difference learning for value estimation. arXiv preprint arXiv:2305.18388, 2023b.
Sobel (1982) Sobel, M. J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
Villani (2009) Villani, C. Optimal transport: old and new, volume 338. Springer, 2009.
Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, 1989.
Yang et al. (2019) Yang, D., Zhao, L., Lin, Z., Qin, T., Bian, J., and Liu, T.-Y. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems, pp. 6193–6202, 2019.
Zhang & Zhu (2023) Zhang, N. and Zhu, K. Quantiled conditional variance, skewness, and kurtosis by cornish-fisher expansion. arXiv preprint arXiv:2302.06799, 2023.
Zhou et al. (2020) Zhou, F., Wang, J., and Feng, X. Non-crossing quantile regression for distributional reinforcement learning. Advances in Neural Information Processing Systems, 33:15909–15919, 2020.
Zhou et al. (2021) Zhou, F., Zhu, Z., Kuang, Q., and Zhang, L. Non-decreasing quantile function network with efficient exploration for distributional reinforcement learning. International Joint Conference on Artificial Intelligence, pp. 3455–3461, 2021.

Appendix A Projection Operator

A.1 Categorical projection operator

CDRL algorithm uses a categorical projection operator $\Pi_{\mathcal{C}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}\left(\left\{z_{1},\ldots,z_{N}\right\}\right)$ to restrict approximated distributions to the parametric family of the form $\mathscr{F}_{\mathcal{C}}:=\left\{\sum_{i=1}^{N}p_{i}\delta_{z_{i}}\mid\sum_{i=1}^{N}p_{i}=1,p_{i}\geq 0\right\}\subseteq\mathscr{P}(\mathbb{R})$ , where $z_{1}<\cdots<z_{N}$ are evenly spaced, fixed supports. The operator $\Pi_{\mathcal{C}}$ is defined for a single Dirac delta as

\Pi_{\mathcal{C}}\left(\delta_{w}\right)=\begin{cases}\delta_{z_{1}}&w\leq z_{1}\\ \frac{w-z_{i+1}}{z_{i}-z_{i+1}}\delta_{z_{i}}+\frac{z_{i}-w}{z_{i}-z_{i+1}}\delta_{i+1}&z_{i}\leq w\leq z_{i+1}\\ \delta_{z_{N}}&w\geq z_{N}.\end{cases}

A.2 Quantile projection operator

QDRL algorithm uses a quantile projection operator $\Pi_{\mathcal{W}_{1}}:\mathscr{P}(\mathbb{R})\to\mathscr{P}(\mathbb{R})$ to restrict approximated distributions to the parametric family of the form $\mathscr{F}_{\mathcal{W}_{1}}:=\left\{\frac{1}{N}\sum_{i=1}^{N}\delta_{z_{i}}\mid z_{1:N}\in\mathbb{R}^{N}\right\}\subseteq\mathscr{P}(\mathbb{R})$ . The operator $\Pi_{W_{1}}$ is defined as

\Pi_{\mathcal{W}_{1}}(\mu)=\frac{1}{N}\sum_{k=1}^{N}\delta_{F_{\mu}^{-1}\left(\tau_{i}\right)},

where $\tau_{i}=\frac{2i-1}{2N}$ , and $F_{\mu}$ is the CDF of $\mu$ . The midpoint $\frac{2i-1}{2N}$ of the interval $[\frac{i-1}{N},\frac{i}{N}]$ minimizes the 1-Wasserstein distance $W_{1}(\mu,\Pi_{W_{1}}{\mu})$ between the distribution, $\mu$ , and its projection $\Pi_{W_{1}}{\mu}$ (a $N$ -quantile distribution with evenly spaced $\tau_{i}$ ), as demonstrated in Lemma 2 (Dabney et al., 2018b).

Appendix B Proofs

In this section, we provide the proofs of the theorems discussed in the main manuscript.

B.1 Proof of Section 3

Proposition B.1 (Sobel, 1982; Bellemare et al., 2017).

Suppose there are value distributions $\nu_{1},\nu_{2}\in\mathscr{P}(\mathbb{R})$ , and random variables $Z_{i}^{k+1}\sim\mathcal{T}^{\pi}\nu_{i},Z_{i}^{k}\sim\nu_{i}$ . Then, we have

	$\displaystyle\left\\|\mathbb{E}Z_{1}^{k+1}-\mathbb{E}Z_{2}^{k+1}\right\\|_{\infty}\leq\gamma\left\\|\mathbb{E}Z_{1}^{k}-\mathbb{E}Z_{2}^{k}\right\\|_{\infty},\text{ and }$
	$\displaystyle\left\\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\\|_{\infty}\leq\gamma^{2}\left\\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\\|_{\infty}.$

Proof.

This proof follows directly from Bellemare et al. (2017). The first statement can be proved using the exchange of $\mathbb{E}\mathcal{T}^{\pi}=\mathcal{T}^{\pi}\mathbb{E}$ . By independence of $R$ and $P^{\pi}Z_{i}$ , where $P^{\pi}$ is the transition operator, we have

	$\displaystyle Z_{i}^{k+1}(x,a)$	$\displaystyle\stackrel{{\scriptstyle D}}{{:=}}R(x,a)+\gamma P^{\pi}Z_{i}^{k}(x,a)$
	$\displaystyle\mathrm{Var}(Z_{i}^{k+1}(x,a))$	$\displaystyle=\mathrm{Var}(R(x,a))+\gamma^{2}\mathrm{Var}\left(P^{\pi}Z_{i}^{k}(x,a)\right).$

Thus, we have

	$\displaystyle\left\\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\\|_{\infty}$
		$\displaystyle=\sup_{x,a}\left\|\mathrm{Var}Z_{1}^{k+1}(x,a)-\mathrm{Var}Z_{2}^{k+1}(x,a)\right\|$
		$\displaystyle=\sup_{x,a}\gamma^{2}\left\|\mathrm{Var}\left(P^{\pi}Z_{1}^{k}(x,a)\right)-\mathrm{Var}\left(P^{\pi}Z_{2}^{k}(x,a)\right)\right\|$
		$\displaystyle=\sup_{x,a}\gamma^{2}\left\|\mathbb{E}\left[\mathrm{Var}\left(Z_{1}^{k}\left(X^{\prime},A^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(X^{\prime},A^{\prime}\right)\right)\right]\right\|$
		$\displaystyle\leq\sup_{x^{\prime},a^{\prime}}\gamma^{2}\left\|\mathrm{Var}\left(Z_{1}^{k}\left(x^{\prime},a^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(x^{\prime},a^{\prime}\right)\right)\right\|$
		$\displaystyle\leq\gamma^{2}\left\\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\\|_{\infty}.$

∎

Lemma B.2 (Lemma B.2 of Rowland et al. (2019)).

Let $\tau_{k}=\frac{2k-1}{2K}$ , for $k=1,\ldots,K$ . Consider the corresponding 1-Wasserstein projection operator $\Pi_{W_{1}}:\mathscr{P}(\mathbb{R})\rightarrow\mathscr{P}(\mathbb{R})$ , defined by

\Pi_{W_{1}}\mu_{i}=\frac{1}{K}\sum_{k=1}^{K}\delta_{F_{\mu_{i}}^{-1}\left(\tau_{k}\right)},

for all $\mu_{i}\in\mathscr{P}(\mathbb{R})$ , where $F_{\mu_{i}}^{-1}$ is the inverse CDF of $\mu_{i}$ . Let random variable $X\sim\mu_{1}$ , $X^{2}\sim\mu_{2}$ , and $\eta_{1},\eta_{2}\in\mathscr{P}(\mathbb{R})$ . Suppose immediate reward distributions supported on $[-R_{max},R_{max}]$ . Then, we have:
(i) $W_{1}\left(\Pi_{W_{1}}\mu_{1},\mu_{1}\right)\leq\frac{2R_{max}}{K(1-\gamma)}$ ;
(ii) $W_{1}\left(\Pi_{W_{1}}\eta_{1},\Pi_{W_{1}}\eta_{2}\right)\leq W_{1}\left(\eta_{1},\eta_{2}\right)+\frac{4R_{max}}{K(1-\gamma)}$ ;
(iii) $W_{1}\left(\Pi_{W_{1}}\mu_{2},\mu_{2}\right)\leq\frac{R_{max}^{2}}{K(1-\gamma)}$ .

Proof.

This proof follows directly from Lemma B.2 of Rowland et al. (2019). For proving (i), let $F_{\mu_{1}}^{-1}$ be the inverse CDF of $\mu_{1}$ . We have

	$\displaystyle W_{1}\left(\Pi_{W_{1}}\mu_{1},\mu_{1}\right)$	$\displaystyle=\sum_{i=0}^{K-1}\frac{1}{K}\int_{F_{\mu_{1}}^{-1}(\frac{i}{K})}^{F_{\mu_{1}}^{-1}(\frac{i+1}{K})}\|x-F_{\mu_{1}}^{-1}(\frac{2i+1}{2K})\|\quad\mu_{1}(dx)$
		$\displaystyle\leq\frac{1}{K}\left(F_{\mu_{1}}^{-1}(1)-F_{\mu_{1}}^{-1}(0)\right)\quad(\text{return distribution $\mu_{1}$ is bounded on}~{}[-\frac{R_{max}}{1-\gamma},\frac{R_{max}}{1-\gamma}])$
		$\displaystyle=\frac{2R_{max}}{K(1-\gamma)}.$

For proving (ii), using the triangle inequality and statement (i):

	$\displaystyle W_{1}\left(\Pi_{W_{1}}\eta_{1},\Pi_{W_{1}}\eta_{2}\right)$	$\displaystyle\leq W_{1}\left(\Pi_{W_{1}}\eta_{1},\eta_{1}\right)+W_{1}\left(\eta_{1},\eta_{2}\right)+W_{1}\left(\eta_{2},\Pi_{W_{1}}\eta_{2}\right)$
		$\displaystyle\leq W_{1}\left(\eta_{1},\eta_{2}\right)+\frac{4R_{max}}{K(1-\gamma)}.$

(ii) implies the fact that the quantile projection operator $\Pi_{W_{1}}$ is not a non-expansion under 1-Wasserstein distance, which is important for the uniqueness of the fixed point and the convergence of the algorithm.

The proof of (iii) is similar to (i), using the fact that the return distribution $\mu_{2}$ is bounded on $[0,\frac{R_{max}^{2}}{1-\gamma}]$ to obtain the following inequality:

\displaystyle W_{1}\left(\Pi_{W_{1}}\mu_{2},\mu_{2}\right)\leq\frac{R_{max}^{2}}{K(1-\gamma)}.

∎

Theorem B.3 (Parameterization induced error bound).

\displaystyle\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)},

where $\mathcal{E}^{k}_{3}$ is parametrization induced error at $k$ -th iteration.

Proof.

Using the dual representation of the Wasserstein distance (Villani, 2009) and Lemma B.2, $\forall(x,a)$ , we have

	$\displaystyle\left\|\mathbb{E}Z_{\theta}^{k}(x,a)-\mathbb{E}Z^{k}(x,a)\right\|$	$\displaystyle\leq W_{1}\left(\Pi_{W_{1}}\mathcal{T}^{\pi}\eta_{k}(x,a),\mathcal{T}^{\pi}\eta_{k}(x,a)\right)$
		$\displaystyle\leq\frac{2R_{max}}{N(1-\gamma)}.$

By taking the limitation over $(x,a)$ and iteration $k$ on the left-hand side, we obtain

\lim_{k\rightarrow\infty}\left\|\mathcal{E}^{k}_{3}\right\|_{\infty}=\lim_{k\rightarrow\infty}\left\|\mathbb{E}Z_{\theta}^{k}-\mathbb{E}Z^{k}\right\|_{\infty}\leq\frac{2R_{max}}{N(1-\gamma)}.

In a similar way, the second-order moment can be bounded by,

\lim_{k\rightarrow\infty}\left\|\mathbb{E}[Z_{\theta}^{k}]^{2}-\mathbb{E}[Z^{k}]^{2}\right\|_{\infty}\leq\frac{R_{max}^{2}}{N(1-\gamma)}.

It suggests that higher-order moments are not preserved after quantile representation is applied. ∎

B.2 Proof of Section 4

Lemma B.4 (expectation by quantiles).

. Let $Z\sim\nu$ be a random variable with CDF $F_{\nu}$ and quantile function $F_{\nu}^{-1}$ . Then,

\mathbb{E}[Z]=\int_{0}^{1}F_{\nu}^{-1}(\tau)d\tau.

Proof.

As any CDF is non-decreasing and right continuous, we have for all $(\tau,z)\in(0,1)\times\mathbb{R}$ :

F_{\nu}^{-1}(\tau)\leq z\Longleftrightarrow\tau\leq F_{\nu}(z).

Then, denoting $U$ by a uniformly distributed random variable over $[0,1]$ ,

\mathbb{P}(F_{\nu}^{-1}(U)\leq z)=\mathbb{P}(U\leq F_{\nu}(z))=F_{\nu}(z),

which shows that the random variable $F_{\nu}^{-1}(U)$ has the same distribution as $Z$ . Hence,

\mathbb{E}[Z]=\mathbb{E}\left[F_{\nu}^{-1}(U)\right]=\int_{0}^{1}F_{\nu}^{-1}(\tau)d\tau

∎

Lemma B.5.

\displaystyle\widehat{\boldsymbol{M}}_{2}=(\mathbf{X}^{\top}_{2}V^{-1}\mathbf{X}_{2})^{-1}\mathbf{X}^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}},

(25)

and the distribution of mean estimator takes the form,

\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{\sum_{i}v_{i}}+\frac{(\frac{\sum_{i}v_{i}z_{\tau_{i}}}{\sum_{i}v_{i}})^{2}}{\sum_{i}v_{i}z_{\tau_{i}}^{2}-\frac{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}{\sum_{i}v_{i}}}\right).

When $V$ equals identity matrix $I$ ,

\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{N}+\frac{\bar{z}^{2}}{\sum_{i}(z_{\tau_{i}}-\bar{z})^{2}}\right).

Proof.

Premultiplying by $V^{-1/2}$ , we get the transformed model

V^{-1/2}\boldsymbol{\hat{Q}}=V^{-1/2}\mathbf{X}_{2}\boldsymbol{M}_{2}+V^{-1/2}\mathcal{E}.

Now, set $\boldsymbol{\hat{Q}}^{*}=V^{-1/2}\boldsymbol{Q},X^{*}_{2}=V^{-1/2}X_{2}$ , and $\mathcal{E}^{*}=V^{-1/2}\mathcal{E}$ , so that the transformed model can be written as $\boldsymbol{\hat{Q}}^{*}=\mathbf{X}^{*}_{2}\boldsymbol{M}_{2}+\mathcal{E}^{*}$ . The transformed model is a Gaussian-Markov model, satisfying OLS assumptions. Thus, the unique OLS solution is $\widehat{\boldsymbol{M}}_{2}=\left(X^{\top}_{2}V^{-1}X_{2}\right)^{-1}X^{\top}_{2}V^{-1}\boldsymbol{\hat{Q}},$ and $\widehat{\boldsymbol{M}}_{2}\sim\mathcal{N}\left(\boldsymbol{M}_{2},\sigma^{2}(X_{2}^{\top}V^{-1}X_{2})^{-1}\right).$ By computing $(X^{\top}_{2}V^{-1}X_{2})^{-1},$ we derive $\hat{m}_{1}\sim\mathcal{N}\left(m_{1},\frac{1}{\sum_{i}v_{i}}+\frac{(\frac{\sum_{i}v_{i}z_{\tau_{i}}}{\sum_{i}v_{i}})^{2}}{\sum_{i}v_{i}z_{\tau_{i}}^{2}-\frac{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}{\sum_{i}v_{i}}}\right).$ ∎

Proposition B.6.

Suppose the noise $\varepsilon_{i}$ independently follows $\mathcal{N}(0,v_{i})$ where $v_{i}\geq 1$ for $i=1,\cdots,N$ , then,

(i) In the homoskedastic case where $v_{i}=1$ for $i=1,\dots N$ , the empirical mean estimator $\hat{m}_{1}^{*}$ has a lower variance, $\mathrm{Var}(\hat{m}_{1}^{*})<\mathrm{Var}(\hat{m}_{1})$ ;

(ii) In the heteroskedastic case where $v_{i}$ ’s are not equal, the QEM estimator $\hat{m}_{1}$ achieves a lower variance, i.e. $\mathrm{Var}(\hat{m}_{1})<\mathrm{Var}(\hat{m}_{1}^{*})$ , if and only if $\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0$ , where $\bar{v}=\frac{1}{N}\sum_{i}v_{i}$ . This inequality holds when $z_{\tau_{i}}=-z_{\tau_{N-i}}$ , which can be guaranteed in QDRL.

Proof.

The proof of (i) comes directly from the comparison of variances, i.e. $\mathrm{Var}(\hat{m}_{1})=\frac{1}{N}<\frac{1}{N}+\frac{\bar{z}^{2}}{\sum_{i}(z_{\tau_{i}}-\bar{z})^{2}}=\mathrm{Var}(\hat{m}_{1}^{*})$ . Next, we prove that (ii) holds under a sufficient condition $z_{\tau_{i}}=-z_{\tau_{N-i}}$ . In QDRL, the quantile levels $\tau_{i}=\frac{2i-1}{2N}$ are equally spaced around 0.5. Under this setup, the condition $z_{\tau_{i}}=-z_{\tau_{N-i}}$ indeed holds, where $z_{\tau_{i}}$ is the $\tau_{i}$ -th quantile of standard normal distribution. For $N=2$ , we need to validate the inequality $\bar{v}^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)>0$ . This can be transformed into a multivariate extreme value problem. By analyzing the function $f(v_{1},v_{2})=\frac{(v_{1}+v_{2})^{2}}{4}-1-\frac{1}{\frac{(v_{1}+v_{2})^{2}}{(v_{1}-v_{2})^{2}}-1}$ , the infimum of $f(v_{1},v_{2})$ is 0 when $v_{1},v_{2}>1$ , and $f(v_{1},v_{2})$ reaches 0 at the limit $\lim_{(v_{1},v_{2})\to(1,1)}f(v_{1},v_{2})=0$ . For $N=3$ , this case is identical to $N=2$ since $z_{0.5}=0$ . For $N=4$ , $f(v_{1},v_{2},v_{3},v_{4})=\frac{(v_{1}+v_{2}+v_{3}+v_{4})^{2}}{N^{2}}-1-\frac{1}{\frac{(v_{1}+v_{2}+v_{3}+v_{4})(k^{2}v_{1}+v_{2}+v_{3}+k^{2}v_{4})}{(kv_{1}+v_{2}-v_{3}-kv_{4})^{2}}-1}$ , and this expression can be factored as, $f(v_{1},v_{2},v_{3},v_{4})=\frac{v_{1}+v_{2}+v_{3}+v_{4}}{N^{2}C}\left((v_{1}+v_{2}+v_{3}+v_{4})C-N^{2}(k^{2}v_{1}+v_{2}+v_{3}+k^{2}v_{4})\right)$ , where $C=(k-1)^{2}v_{1}v_{2}+(k+1)^{2}v_{1}v_{3}+4k^{2}v_{1}v_{4}+4v_{2}v_{3}+(k+1)^{2}v_{2}v_{4}+(k+1)^{2}v_{3}v_{4}$ , and $k=\frac{\Phi^{-1}(7/8)}{\Phi^{-1}(5/8)}>3$ . By comparing the coefficient corresponding to the same terms, we can verify that $f(v_{1},v_{2},v_{3},v_{4})>0$ when $v_{i}>1$ . Finally, the remaining cases can be proven in the same manner.

∎

Theorem B.7.

\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.

Proof.

We give this proof in a tabular MDP. Directly following from the definition of the distributional Bellman operator applied to the CDF, we have that

	$\displaystyle F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}(u)-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}(u)$
	$\displaystyle=\sum_{x^{\prime},a^{\prime}}\hat{P}(x^{\prime}\mid x,a)\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z\left(x^{\prime},a^{\prime}\right)+\hat{R}(x,a)}(u)-\sum_{x^{\prime},a^{\prime}}P(x^{\prime}\mid x,a)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u).$

For notation convenience, we use random variables instead of measures. $\hat{P}$ and $\hat{R}$ are the maximum likelihood estimates of the transition and the reward functions, respectively. Adding and subtracting $\sum_{x^{\prime},a^{\prime}}\hat{P}(x^{\prime}\mid x,a)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)$ , then we have

	$\displaystyle\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\left(\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right)$
	$\displaystyle+\sum_{x^{\prime},a^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u).$

For the first term, note that

	$\displaystyle\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\left(\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right)$
	$\displaystyle=\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)+$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\Big{)}$
	$\displaystyle\leq\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\left\|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|+$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\left\|F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|\cdot\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})-\pi_{opt}(a^{\prime}\mid x^{\prime})\right\|\Big{)}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\left(\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}+2D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})\|\|\pi_{opt}(\cdot\mid x^{\prime})\right)\right)$
	$\displaystyle=2\alpha+\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}.$

(a) follows from the fact that $\sum_{a^{\prime}}\left|\hat{\pi}(a^{\prime}\mid x^{\prime})-\pi_{opt}(\cdot\mid x^{\prime})\right|=2D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})||\pi_{opt}(\cdot\mid x^{\prime})\right)\leq 2\alpha$ , and

	$\displaystyle\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\left\|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|$
	$\displaystyle=\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\int\left\|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right\|dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)$
	$\displaystyle\leq\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\sup_{r}\left\|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right\|\int dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)$
	$\displaystyle=\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}.$

The second term can be bounded as follows:

	$\displaystyle\sum_{x^{\prime},a^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)$
	$\displaystyle\leq\sum_{x^{\prime}}\left(\hat{P}(x^{\prime}\mid x,a)-P(x^{\prime}\mid x,a)\right)\sum_{a^{\prime}}\pi_{opt}(a^{\prime}\mid x^{\prime})$
	$\displaystyle\leq\left\\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\\|_{1}\cdot\left\\|\sum_{a^{\prime}}\pi_{opt}(a^{\prime}\mid\cdot)\right\\|_{\infty}$
	$\displaystyle=\left\\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\\|_{1}.$

Next, we show the two norms can be bounded. By the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, the following inequality holds with probability at least $1-\delta/2$ , for all $(x,a)\in\mathcal{D}$ ,

\displaystyle\left\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\|_{\infty}\leq\sqrt{\frac{1}{2n(x,a)}\log\frac{4|\mathcal{X}\|\mathcal{A}|}{\delta}}.

By Hoeffding’s inequality and an $l_{1}$ concentration bound for multinomial distribution⁴⁴4 see https://nanjiang.cs.illinois.edu/files/cs598/note3.pdf., the following inequality holds with probability at least $1-\delta/2$ ,

\displaystyle\max_{x,a}\left\|\hat{P}(\cdot\mid x,a)-P(\cdot\mid x,a)\right\|_{1}\leq\sqrt{\frac{2|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}\|\mathcal{A}|}{\delta}}.

Consequently, the claim follows from combining the two inequalities,

\displaystyle\left\|F_{\hat{\mathcal{T}}^{\hat{\pi}}\eta(x,a)}-F_{\mathcal{T}^{\pi_{opt}}\eta(x,a)}\right\|_{\infty}\leq 2\alpha+\sqrt{\frac{1+4|\mathcal{X}|}{n(x,a)}\log\frac{4|\mathcal{X}||\mathcal{A}|}{\delta}}.

∎

Appendix C Cornish-Fisher Expansion

The Cornish-Fisher Expansion (Cornish & Fisher, 1938) is an asymptotic expansion used to approximate the quantiles of a probability distribution based on its cumulants. To be more explicit, let $X^{*}$ be a non-gaussian variable with mean 0 and variance 1. Then, the Cornish-Fisher Expansion can be represented as a polynomial expansion:

\displaystyle F^{-1}_{X^{*}}(\tau)=\sum_{i=0}^{\infty}a_{i}(\Phi^{-1}(\tau))^{i},

where the parameters $a_{i}$ depend on the cumulants of the $X^{*}$ and $\Phi$ is the standard normal distribution function. To use this expansion in practice, we need to truncate the series. According to Cornish & Fisher (1938), the highest power of $i$ must be odd, and the fourth order ( $i=3$ ) approximation is commonly used in practice. The parameters for the fourth order expansion are $a_{2}=a_{0}=\frac{\kappa_{3}}{6}$ , $a_{1}=1+5(\frac{\kappa_{3}}{6})^{2}-3\frac{\kappa_{4}}{24}$ and $a_{3}=\frac{\kappa_{4}}{24}-2(\frac{\kappa_{3}}{6})^{2}$ , where $\kappa_{i}$ denotes $i$ -th cumulant. Therefore, the fourth order expansion is

\displaystyle F^{-1}_{X^{*}}(\tau)=-\frac{\kappa_{3}}{6}+(1+5(\frac{\kappa_{3}}{6})^{2}-3\frac{\kappa_{4}}{24})\Phi^{-1}(\tau)+\frac{\kappa_{3}}{6}(\Phi^{-1}(\tau))^{2}+(\frac{\kappa_{4}}{24}-2(\frac{\kappa_{3}}{6})^{2})(\Phi^{-1}(\tau))^{3}+\cdots.

Now, simply define the $X^{*}$ as the normalization of $X$ , $X=\mu+\sigma X^{*}$ , with mean $\mu$ and variance $\sigma^{2}$ . $F^{-1}_{X}(\tau)$ can be approximated by

\displaystyle F^{-1}_{X}(\tau)=\mu+\sigma\left(-\frac{\kappa_{3}}{6\sigma^{3}}+(1+5(\frac{\kappa_{3}}{6\sigma^{3}})^{2}-3\frac{\kappa_{4}}{24\sigma^{4}})\Phi^{-1}(\tau)+\frac{\kappa_{3}}{6\sigma^{3}}(\Phi^{-1}(\tau))^{2}+(\frac{\kappa_{4}}{24\sigma^{4}}-2(\frac{\kappa_{3}}{6\sigma^{3}})^{2})(\Phi^{-1}(\tau))^{3}+\cdots\right).

Denote skewness $s=\frac{\kappa_{3}}{\sigma^{3}}$ , kurtosis $k=\frac{\kappa_{4}}{\sigma^{4}}$ and normal distribution quantile $z_{\tau}=\Phi^{-1}(\tau)$ . Then, we can rewrite the above equation

\displaystyle F^{-1}_{X}(\tau)=\mu+\sigma\left(z_{\tau}+(z_{\tau}^{2}-1)\frac{s}{6}+(z_{\tau}^{3}-2z_{\tau})\frac{k}{24}+(-2z_{\tau}^{3}+5z_{\tau})(\frac{s}{6})^{2}+\cdots\right).

(26)

C.1 Regression model selection

We use the R-Squared ( $R^{2}$ ) statistic to determine the number of terms in Equation 26 that should be included in the regression model. $R^{2}$ , also known as the coefficient of determination, is a statistical measure that shows how well the independent variables explain the variance in the dependent variable. In other words, it is a measure of how well the data fit the regression model.

Consider the linear regression model,

\displaystyle\boldsymbol{\hat{Y}}=\mathbf{X}_{i}\boldsymbol{\beta}_{i}+\mathcal{E}.

The dependent variable $\boldsymbol{Y}=(F^{-1}_{X}(\tau_{1}),\dots,F^{-1}_{X}(\tau_{N}))^{T}$ is composed of the quantiles from distribution of $X$ , and $\mathcal{E}$ is the noise vector sampled from $\mathcal{N}(0,0.25)$ . When the design matrix $\mathbf{X}_{1}=(1,\cdots,1)^{\prime}$ , this regression model reduces to a one-sample problem, and $\boldsymbol{\beta}_{1}$ can be directly estimated by $\frac{1}{N}\sum_{n=1}^{N}F^{-1}_{X}(\tau_{n})$ . We then investigate the following four types of regression models,

Model 1:

\mathbf{X}_{2}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{2}=\left(\mu,\sigma\right)^{T},

Model 2:

\mathbf{X}_{3}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ \end{array}\right)^{T},\boldsymbol{\beta}_{3}=\left(\mu,\sigma,\sigma\frac{s}{6}\right)^{T},

Model 3:

\mathbf{X}_{4}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ z_{\tau_{1}}^{3}-3z_{\tau_{1}},&\cdots&,z_{\tau_{N}}^{3}-3z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{4}=\left(\mu,\sigma,\sigma\frac{s}{6},\sigma\frac{k}{24}\right)^{T},

Model 4:

\mathbf{X}_{5}=\left(\begin{array}[]{cccc}1,&\cdots&,1\\ z_{\tau_{1}},&\cdots&,z_{\tau_{N}}\\ z_{\tau_{1}}^{2}-1,&\cdots&,z_{\tau_{N}}^{2}-1\\ z_{\tau_{1}}^{3}-3z_{\tau_{1}},&\cdots&,z_{\tau_{N}}^{3}-3z_{\tau_{N}}\\ -2z_{\tau_{1}}^{3}+5z_{\tau_{1}},&\cdots&,-2z_{\tau_{N}}^{3}+5z_{\tau_{N}}\\ \end{array}\right)^{T},\boldsymbol{\beta}_{5}=\left(\mu,\sigma,\sigma\frac{s}{6},\sigma\frac{k}{24},\sigma(\frac{s}{6})^{2}\right)^{T}.

Figure 9 shows that the regression fitted values and corresponding $R^{2}$ across several distributions of $X$ . As the number of independent variables increases, more variance in the error can be explained. However, having too many independent variables increases the risk of multicollinearity and overfitting. Based on practical considerations, we choose Model 3 as our regression model due to its satisfactory level of explainability. In the subsequent section, we will give a more in-depth interpretation of this regression model.

C.2 Interpretation of the remaining term $\omega(\tau)$

In this section, we explore the role of the remaining term $\omega(\tau)$ in the context of random design regression. As discussed in Section 4, we present a decomposition of the estimate $\hat{q}(\tau)$ of the $\tau$ -th quantile, which includes contributions from the mean, noise error, and misspecified error. Specifically, we expressed the estimate as follows:

\displaystyle\hat{q}(\tau)=

\displaystyle\mu+\omega_{1}(\tau)+\varepsilon(\tau).

where $\mu$ can be estimated using the mean estimator $\frac{1}{N}\sum q(\tau_{i})$ , which is commonly used in QDRL and IQN settings. However, this simple model fails to capture important information in the $\omega_{1}(\tau)$ . To address this limitation, we employ the Cornish-Fisher Expansion to expand the equation, resulting in the following expression:

	$\displaystyle\hat{q}(\tau)=$	$\displaystyle\mu+z_{\tau}\sigma+\sigma\omega_{2}(\tau)+\varepsilon(\tau),$
	$\displaystyle\hat{q}(\tau)=$	$\displaystyle\mu+z_{\tau}\sigma+(z_{\tau}^{2}-1)\sigma\frac{s}{6}+\sigma\omega_{3}(\tau)+\varepsilon(\tau),$
	$\displaystyle\cdots$

where $\mu$ can be estimated by linear regression estimator given multiple quantile levels $\{\tau_{i}\}$ , which can be sampled from a uniform distribution or predefined to be evenly spaced in $(0,1)$ . In theory, higher-order expansions can capture more misspecified information in $\omega(\tau)$ , leading to a more accurate representation of the quantile. However, as discussed before, expansions are typically limited to the fourth order in practice to balance the trade-off between model complexity and estimation accuracy.

To gain a better understanding of the remaining term $\omega(\tau)$ and its impact on the regression estimator, consider the linear model,

\displaystyle\hat{q}(\tau)=\mathbf{x_{\tau}^{\prime}}\beta+\underbrace{\omega_{\tau}}_{\text{Misspecified error}}+\underbrace{\varepsilon}_{\text{Noise error}},

where $\tau$ can be generally considered a uniform, $\mathbf{x_{\tau}}=(1,z_{\tau},z_{\tau}^{2}-1,...)^{\prime}\in\mathbb{R}^{d}$ , and $\beta=(\mu,\sigma,\sigma\frac{s}{6},...)^{\prime}\in\mathbb{R}^{d}$ . In particular, define the random variables,

\varepsilon:=\hat{q}(\tau)-\mathbb{E}[\hat{q}(\tau)\mid\mathbf{x_{\tau}}]\quad\text{ and }\quad\omega_{\tau}:=\mathbb{E}[\hat{q}(\tau)\mid\mathbf{x_{\tau}}]-\mathbf{x_{\tau}}^{\prime}\beta,

where $\varepsilon$ corresponds to the noise with zero mean, $\sigma^{2}_{\text{noise }}$ variance and independent across different level of $\tau$ , and $\omega_{\tau}$ corresponds to the misspecified error of $\beta$ . Under the following conditions, we can derive a bound for the regression estimator in the misspecified model.

Condition 1 (Subgaussian noise). There exist a finite constant $\sigma_{\text{noise }}\geq 0$ such that for all $\lambda\in\mathbb{R}$ , almost surely:

\mathbb{E}[\exp(\lambda\varepsilon)\mid\mathbf{x_{\tau}}]\leq\exp\left(\lambda^{2}\sigma_{\text{noise }}^{2}/2\right).

Condition 2 (Bounded approximation error). There exist a finite constant $C_{\text{bias }}\geq 0$ , almost surely:

\left\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\|_{2}\leq C_{\text{bias }}\sqrt{d},

where $\Sigma=\mathbb{E}[\mathbf{x_{\tau}}\mathbf{x_{\tau}}^{\prime}]$ .

Condition 3 (Subgaussian projections). There exists a finite constant $\rho\geq 1$ such that:

\mathbb{E}\left[\exp(\alpha^{\top}\Sigma^{-1/2}\mathbf{x_{\tau}})\right]\leq\exp\left(\rho\cdot\|\alpha\|^{2}_{2}/2\right),\quad\forall\alpha\in\mathbb{R}^{d}.

Theorem C.1.

Suppose that Conditions 1, 2, and 3 hold. Then for any $\delta\in(0,1)$ and with probability at least $1-3\delta$ , the following holds:

	$\displaystyle\left\\|\hat{\beta}_{\mathrm{ols}}-\beta\right\\|_{\Sigma}^{2}$	$\displaystyle\leq\underbrace{K_{\rho,\delta,N}^{2}\left(\frac{4\mathbb{E}\left\\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\\|^{2}_{2}(1+8\log(1/\delta))}{N}+\frac{3C_{\text{bias }}^{2}d\log^{2}(1/\delta)}{N^{2}}\right)}_{\text{Misspecified error contribution}}$
		$\displaystyle+\underbrace{K_{\rho,\delta,N}\cdot\frac{\sigma_{\text{noise }}^{2}\cdot(d+2\sqrt{d\log(1/\delta)}+2\log(1/\delta))}{N}}_{\text{Noise error contribution}},$

where $K_{\rho,\delta,N}$ is a constant depending on $\rho$ , $\delta$ and $N$ .

Proof.

The proof of the above theorem can be easily adapted from Theorem 2 in Hsu et al. (2011). ∎

The first term on the right-hand side represents the error due to model misspecification, which occurs when the true model differs from the assumed model. Intuitively, incorporating more relevant information in $\omega(\tau)$ into explanation variables could decrease the quantity of $\mathbb{E}\left\|\Sigma^{-1/2}\mathbf{x_{\tau}}\omega_{\tau}\right\|^{2}_{2}$ and $C_{bias}$ . Therefore, the accuracy of the estimator may be potentially improved by reducing the magnitude of the misspecified error. The second term represents the noise error contribution, which is inevitable and can only be controlled by increasing the sample size $N$ .

Appendix D Experimental Details

D.1 Tabular experiment

The parameter settings used for tabular control are presented in LABEL:table1. In the QEMRL case, the weight matrix $V$ is set as shown in the table based on domain knowledge indicating that the distribution has low probability support around its median. The greedy parameter decreases exponentially every 100 steps, and the learning rate decrease in segments every 50K steps.

Table 1: The (hyper-)parameters of QEMRL and QDRL used in the tabular control experiment.

Hyperparameter	Value
Learning rate schedule	{0.05,0.025,0.0125}
Discount factor	0.999
Quantile initialization	$\mathrm{Unif}(-0.5,0.5)$
Number of quantiles	128
Number of training steps	150K
$\epsilon$ -greedy schedule	$0.9^{\lfloor t/100\rfloor}$
Number of MC rollouts	10000
Weight matrix $V$ (QEMRL only)	$diag\{1,1,\cdots,\underbrace{1.5,\cdots,1.5}_{\tau\in[0.45,0.55]},\cdots,1,1\}$

D.2 Atari experiment

We extend QEMRL to a DQN-like architecture, and we use the same architecture as QR-DQN, which we refer to as QEM-DQN ⁵⁵5Code is available at https://github.com/Kuangqi927/QEM. Our hyperparameter settings (LABEL:table2) are aligned with Dabney et al. (2018b) for a fair comparison. Additionally, we extend QEMRL to the unfixed quantile fraction algorithm IQN, which embeds quantile fraction $\tau$ into the quantile value network on the top of QR-DQN. In Atari, it is infeasible to determine the low probability supports for every state-action pair, therefore we only consider the heteroskedasticity that occurs in the tail and treat $V$ as a tuning parameter to select an appropriate value. For exploration experiments, we follow the settings of Mavrin et al. (2019) and set the decay factor $c_{t}=c\sqrt{\frac{\text{log}t}{t}}$ , where $c=50$ .

Table 2: The hyperparameters of QEM-DQN and QR-DQN used in the Atari experiments.

Hyperparameter	Value
Learning rate	0.00005
Discount factor	0.99
Optimizer	Adam
Bath size	32
Number of quantiles	200
Number of quantiles (IQN)	32
Weight matrix $V$ (QEM-DQN only)	$diag\{\underbrace{1.5,\cdots,1.5}_{\tau\in[0.9,1)},\cdots,1,1,\cdots,\underbrace{1.5,\cdots,1.5}_{\tau\in(0,0.1]}\}$

D.3 MuJoCo experiment

We extend QEMRL to a SAC-like architecture, and we use the same architecture of DSAC, named QEM-DSAC. Similarly, we extend QEMRL to an IQN version of DSAC. Hyperparameters and environment-specific parameters are listed in LABEL:table3. In addition, SAC has a variant that introduces a mechanism of fine-tuning $\alpha$ to achieve target entropy adaptively. While this adaptive mechanism performs well, we follow the use of fixed $\alpha$ suggested in the original SAC paper to reduce irrelevant factors.

Table 3: The hyperparameters of QEM-DSAC and DSAC used in the MuJoCo experiments.

Hyperparameter	Value
Policy network learning rate	$0.0003$
Quantile Value network learning rate	$0.0003$
Discount factor	0.99
Optimization	Adam
Target smoothing	$0.005$
Batch size	256
Minimum steps before training	$10000$
Number of quantiles	32
Quantile fraction embedding size (IQN)	64
Weight matrix $V$ (QEM-DSAC only)	$diag\{\underbrace{1.2,\cdots,1.2}_{\tau\in[0.9,1)},\cdots,1,1,\cdots,\underbrace{1.2,\cdots,1.2}_{\tau\in(0,0.1]}\}$

Environment	Temperature Parameter
Ant-v2	0.2
HalfCheetah-v2	0.2
Hopper-v2	0.2
Walker2d-v2	0.2
Swimmer-v2	0.2
Humanoid-v2	0.05

Appendix E Additional Experimental Results

E.1 Variance reduction for IQN

IQN does not satisfy the sufficient condition $z_{\tau_{i}}=-z_{\tau_{N-i}}$ since $\tau$ is sampled from a uniform distribution, rather than evenly spaced as in QDRL. To examine the impact of this on the inequality $(\frac{\sum_{i}v_{i}}{N})^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{i}^{2})}{(\sum_{i}v_{i}z_{i})^{2}}-1)>0$ in Proposition 4.2, simulation experiments are conducted. We use the function $f(v_{1},\cdots,v_{N})=(\frac{\sum_{i}v_{i}}{N})^{2}-1-1/(\frac{(\sum_{i}v_{i}\sum_{i}v_{i}z_{\tau_{i}}^{2})}{(\sum_{i}v_{i}z_{\tau_{i}})^{2}}-1)$ to examine this inequality, where $v_{i}>1$ and $\tau_{i}$ are sampled uniformly. In every trial, $v_{i}$ are randomly sampled from $[1,M]$ , repeating the process 100,000 times. The minimum values of $f(v_{1},\cdots,v_{N})$ are shown in the following LABEL:table4 for varying values of $N$ and $M$ . The results indicate that the minimum of $f$ is always greater than 0, which demonstrates that the inequality holds in practice.

Table 4: Minimum of

f

Minimum of $f$	M	$N$
0.614	2	32
4.778	5	32
43.143	20	32
0.932	2	128
7.707	5	128
76.489	20	128
1.082	2	500
9.357	5	500
96.473	20	500

	$\displaystyle\left\\|\mathrm{Var}Z_{1}^{k+1}-\mathrm{Var}Z_{2}^{k+1}\right\\|_{\infty}$
		$\displaystyle=\sup_{x,a}\left\|\mathrm{Var}Z_{1}^{k+1}(x,a)-\mathrm{Var}Z_{2}^{k+1}(x,a)\right\|$
		$\displaystyle=\sup_{x,a}\gamma^{2}\left\|\mathrm{Var}\left(P^{\pi}Z_{1}^{k}(x,a)\right)-\mathrm{Var}\left(P^{\pi}Z_{2}^{k}(x,a)\right)\right\|$
		$\displaystyle=\sup_{x,a}\gamma^{2}\left\|\mathbb{E}\left[\mathrm{Var}\left(Z_{1}^{k}\left(X^{\prime},A^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(X^{\prime},A^{\prime}\right)\right)\right]\right\|$
		$\displaystyle\leq\sup_{x^{\prime},a^{\prime}}\gamma^{2}\left\|\mathrm{Var}\left(Z_{1}^{k}\left(x^{\prime},a^{\prime}\right)\right)-\mathrm{Var}\left(Z_{2}^{k}\left(x^{\prime},a^{\prime}\right)\right)\right\|$
		$\displaystyle\leq\gamma^{2}\left\\|\mathrm{Var}Z_{1}^{k}-\mathrm{Var}Z_{2}^{k}\right\\|_{\infty}.$

	$\displaystyle\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\left(\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right)$
	$\displaystyle=\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)+$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\hat{\pi}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)-\pi_{opt}(a^{\prime}\mid x^{\prime})F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\Big{)}$
	$\displaystyle\leq\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\sum_{a^{\prime}}\Big{(}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\left\|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|+$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\left\|F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|\cdot\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})-\pi_{opt}(a^{\prime}\mid x^{\prime})\right\|\Big{)}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{x^{\prime}}\hat{P}(x^{\prime}\mid x,a)\left(\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}+2D_{TV}(\hat{\pi}\left(\cdot\mid x^{\prime})\|\|\pi_{opt}(\cdot\mid x^{\prime})\right)\right)$
	$\displaystyle=2\alpha+\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}.$

	$\displaystyle\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\left\|F_{\gamma Z(x^{\prime},a^{\prime})+\hat{R}(x,a)}(u)-F_{\gamma Z(x^{\prime},a^{\prime})+R(x,a)}(u)\right\|$
	$\displaystyle=\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\int\left\|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right\|dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)$
	$\displaystyle\leq\sum_{a^{\prime}}\left\|\hat{\pi}(a^{\prime}\mid x^{\prime})\right\|\cdot\sup_{r}\left\|F_{\hat{R}(x,a)}(r)-F_{R(x,a)}(r)\right\|\int dF_{\gamma Z\left(x^{\prime},a^{\prime}\right)}(u-r)$
	$\displaystyle=\left\\|F_{\hat{R}(x,a)}(\cdot)-F_{R(x,a)}(\cdot)\right\\|_{\infty}.$

Variance Control for Distributional Reinforcement Learning

Abstract

1 Introduction

2 Background

2.1 Reinforcement Learning

2.2 Distributional Reinforcement Learning

2.2.1 Categorical distributional RL

2.2.2 Quantiled distributional RL

3 Error Analysis of Distributional RL

3.1 Parametrization Induced Error

Proposition 3.1 (Sobel, 1982; Bellemare et al., 2017).

Definition 3.2 (Mean-preserving).

Theorem 3.3 (Parameterization induced error bound).

3.2 Approximation Error

4 Quantiled Expansion Mean

4.1 Heteroskedasticity of quantiles

4.2 Cornish-Fisher Expansion

Lemma 4.1.

Proposition 4.2.

Theorem 4.3.

5 Experimental Results

5.1 A Tabular Example

5.2 Evaluation on MuJoCo and Atari 2600

5.3 Exploration

6 Conclusion and Discussion

Acknowledgements

References

Appendix A Projection Operator

A.1 Categorical projection operator

A.2 Quantile projection operator

Appendix B Proofs

B.1 Proof of Section 3

Proposition B.1 (Sobel, 1982; Bellemare et al., 2017).

Proof.

Lemma B.2 (Lemma B.2 of Rowland et al. (2019)).

Proof.

Theorem B.3 (Parameterization induced error bound).

Proof.

B.2 Proof of Section 4

Lemma B.4 (expectation by quantiles).

Proof.

Lemma B.5.

Proof.

Proposition B.6.

Proof.

Theorem B.7.

Proof.

Appendix C Cornish-Fisher Expansion

C.1 Regression model selection

C.2 Interpretation of the remaining term ω​(τ)\omega(\tau)

Theorem C.1.

Proof.

Appendix D Experimental Details

D.1 Tabular experiment

D.2 Atari experiment

D.3 MuJoCo experiment

Appendix E Additional Experimental Results

E.1 Variance reduction for IQN

E.2 Weight VV tuning experiments

E.3 Additional Atari results

C.2 Interpretation of the remaining term $\omega(\tau)$

E.2 Weight $V$ tuning experiments