This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence

Lingwei Zhu
University of Alberta
lingwei4@ualberta.ca
&Zheng Chen
Osaka University
chenz@sanken.osaka-u.ac.jp
\ANDMatthew Schlegel
University of Alberta
mkschleg@ualberta.ca &Martha White
University of Alberta
CIFAR Canada AI Chair, Amii
whitem@ualberta.ca
Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence—called the Tsallis KL divergence. Tsallis KL defined by the qq-logarithm is a strict generalization, as q=1q=1 corresponds to the standard KL divergence; q>1q>1 provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when q>1q>1 could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI(qq) obtains significant improvements over the standard MVI(q=1q=1) across 35 Atari games.

1 Introduction

There is ample theoretical evidence that it is useful to incorporate KL regularization into policy optimization in reinforcement learning. The most basic approach is to regularize towards a uniform policy, resulting in entropy regularization. More effective, however, is to regularize towards the previous policy. By choosing KL regularization between consecutively updated policies, the optimal policy becomes a softmax over a uniform average of the full history of action value estimates (Vieillard et al., 2020a). This averaging smooths out noise, allowing for better theoretical results (Azar et al., 2012; Kozuno et al., 2019; Vieillard et al., 2020a; Kozuno et al., 2022).

Despite these theoretical benefits, there are some issues with using KL regularization in practice. It is well-known that the uniform average is susceptible to outliers; this issue is inherent to KL divergence (Futami et al., 2018). In practice, heuristics such as assigning vanishing regularization coefficients to some estimates have been implemented widely to increase robustness and accelerate learning (Grau-Moya et al., 2019; Haarnoja et al., 2018; Kitamura et al., 2021). However, theoretical guarantees no longer hold for those heuristics (Vieillard et al., 2020a; Kozuno et al., 2022). A natural question is what alternatives we can consider to this KL divergence regularization, that allows us to overcome some of these disadvantages while maintaining the benefits associate with restricting aggressive policy changes and smoothing errors.

In this work, we explore one possible direction by generalizing to Tsallis KL divergences. Tsallis KL divergences were introduced for physics (Tsallis, 1988, 2009) using a simple idea: replacing the use of the logarithm with the deformed qq-logarithm. The implications for policy optimization, however, are that we get quite a different form for the resulting policy. Tsallis entropy with q=2q=2 has actually already been considered for policy optimization (Chow et al., 2018; Lee et al., 2018), by replacing Shannon entropy with Tsallis entropy to maintain stochasticity in the policy. The resulting policies are called sparsemax policies, because they concentrate the probability on higher-valued actions and truncate the probability to zero for lower-valued actions. Intuitively, this should have the benefit of maintaining stochasticity, but only amongst the most promising actions, unlike the Boltzmann policy which maintains nonzero probability on all actions. Unfortunately, using only Tsallis entropy did not provide significant benefits, and in fact often performed worse than existing methods. We find, however, that using a Tsallis KL divergence to the previous policy does provide notable gains.

We first show how to incorporate Tsallis KL regularization into the standard value iteration updates, and prove that we maintain convergence under this generalization from KL regularization to Tsallis KL regularization. We then characterize the types of policies learned under Tsallis KL, highlighting that there is now a more complex relationship to past action-values than a simple uniform average. We then show how to extend Munchausen Value Iteration (MVI) (Vieillard et al., 2020b), to use Tsallis KL regularization, which we call MVI(qq). We use this naming convention to highlight that this is a strict generalization of MVI: by setting q=1q=1, we exactly recover MVI. We then compare MVI(q=2q=2) with MVI (namely the standard choice where q=1q=1), and find that we obtain significant performance improvements in Atari.

Remark: There is a growing body of literature studying generalizations of KL divergence in RL (Nachum et al., 2019; Zhang et al., 2020). Futami et al. (2018) discussed the inherent drawback of KL divergence in generative modeling and proposed to use β\beta- and γ\gamma-divergence to allow for weighted average of sample contribution. These divergences fall under the category known as the ff-divergence (Sason and Verdú, 2016), commonly used in other machine learning domains including generative modeling (Nowozin et al., 2016; Wan et al., 2020; Yu et al., 2020) and imitation learning (Ghasemipour et al., 2019; Ke et al., 2019). In RL, Wang et al. (2018) discussed using tail adaptive ff-divergence to enforce the mass-covering property. Belousov and Peters (2019) discussed the use of α\alpha-divergence. Tsallis KL divergence, however, has not yet been studied in RL.

2 Problem Setting

We focus on discrete-time discounted Markov Decision Processes (MDPs) expressed by the tuple (𝒮,𝒜,d,P,r,γ)(\mathcal{S},\mathcal{A},d,P,r,\gamma), where 𝒮\mathcal{S} and 𝒜\mathcal{A} denote state space and finite action space, respectively. Let Δ(𝒳)\Delta(\mathcal{X}) denote the set of probability distributions over 𝒳\mathcal{X}. dΔ(𝒮)d\in\Delta(\mathcal{S}) denotes the initial state distribution. P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) denotes the transition probability function, and r(s,a)r(s,a) defines the reward associated with that transition. γ(0,1)\gamma\in(0,1) is the discount factor. A policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) is a mapping from the state space to distributions over actions. We define the action value function following policy π\pi and starting from s0d()s_{0}\sim d(\cdot) with action a0a_{0} taken as Qπ(s,a)=𝔼π[t=0γtrt|s0=s,a0=a]Q_{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}|s_{0}=s,a_{0}=a\right]. A standard approach to find the optimal value function QQ_{*} is value iteration. To define the formulas for value iteration, it will be convenient to write the action value function as a matrix Qπ|𝒮|×|𝒜|Q_{\pi}\!\in\!\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}. For notational convenience, we define the inner product for any two functions F1,F2|𝒮|×|𝒜|F_{1},F_{2}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|} over actions as F1,F2|𝒮|\left\langle F_{1},F_{2}\right\rangle\in\mathbb{R}^{|\mathcal{S}|}.

We are interested in the entropy-regularized MDPs where the recursion is augmented with Ω(π)\Omega(\pi):

{πk+1=argmaxπ(π,QkτΩ(π)),Qk+1=r+γP(πk+1,QkτΩ(πk+1))\displaystyle\begin{cases}\pi_{k+1}=\operatorname*{arg\,max}_{\pi}\!\left(\left\langle\pi,Q_{k}\right\rangle-\tau\Omega(\pi)\right),&\\ Q_{k+1}=r+\gamma P\!\left(\left\langle\pi_{k+1},Q_{k}\right\rangle-\tau\Omega(\pi_{k+1})\right)&\end{cases} (1)

This modified recursion is guaranteed to converge if Ω\Omega is concave in π\pi. For standard (Shannon) entropy regularization, we use Ω(π)=(π)=π,lnπ\Omega(\pi)=-\mathcal{H}\left(\pi\right)=\left\langle\pi,\ln\pi\right\rangle. The resulting optimal policy has πk+1exp(τ1Qk)\pi_{k+1}\propto\exp\left(\tau^{-1}Q_{k}\right), where \propto indicates proportional to up to a constant not depending on actions.

More generally, we can consider a broad class of regularizers known as ff-divergences (Sason and Verdú, 2016): Ω(π)=Df(π||μ):=μ,f(π/μ)\Omega(\pi)=D_{f}(\pi||\mu):=\left\langle\mu,f\left(\pi/\mu\right)\right\rangle, where ff is a convex function. For example, the KL divergence DKL(π||μ)=π,lnπlnμD_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=\left\langle\pi,\ln{\pi}\!-\!\ln{\mu}\right\rangle can be recovered by f(t)=lntf(t)=-\ln t. In this work, when we say KL regularization, we mean the standard choice of setting μ=πk\mu=\pi_{k}, the estimate from the previous update. Therefore, DKLD_{\text{KL}} serves as a penalty to penalize aggressive policy changes. The optimal policy in this case takes the form πk+1πkexp(τ1Qk)\pi_{k+1}\propto\pi_{k}\exp\left(\tau^{-1}Q_{k}\right). By induction, we can show this KL-regularized optimal policy πk+1\pi_{k+1} is a softmax over a uniform average over the history of action value estimates (Vieillard et al., 2020a): πk+1πkexp(τ1Qk)exp(τ1j=1kQj)\pi_{k+1}\!\propto\pi_{k}\exp\left(\tau^{-1}Q_{k}\right)\!\propto\!\cdots\!\propto\!\exp\left(\!\tau^{-1}\sum_{j=1}^{k}Q_{j}\right). Using KL regularization has been shown to be theoretically superior to entropy regularization in terms of error tolerance (Azar et al., 2012; Vieillard et al., 2020a; Kozuno et al., 2022; Chan et al., 2022).

The definitions of ()\mathcal{H}(\cdot) and DKL(||)D_{\text{KL}}(\cdot||\cdot) rely on the standard logarithm and both induce softmax policies as an exponential (inverse function) over (weighted) action-values (Hiriart-Urruty and Lemaréchal, 2004; Nachum and Dai, 2020). Convergence properties of the resulting regularized algorithms have been well studied (Kozuno et al., 2019; Geist et al., 2019; Vieillard et al., 2020a). In this paper, we investigate Tsallis entropy and Tsallis KL divergence as the regularizer, which generalize Shannon entropy and KL divergence respectively.

3 Generalizing to Tsallis Regularization

We can easily incorporate other regularizers in to the value iteration recursion, and maintain convergence as long as those regularizers are strongly convex in π\pi. We characterize the types of policies that arise from using this regularizer, and prove the convergence of resulting regularized recursion.

Refer to caption
Refer to caption
Refer to caption
Figure 1: lnqx\ln_{q}x, expqx\exp_{q}x and Tsallis entropy component πqlnqπ-\pi^{q}\ln_{q}\pi for q=1q=1 to 55. When q=1q=1 they respectively recover their standard counterpart. π\pi is chosen to be Gaussian 𝒩(2,1)\mathcal{N}(2,1). As qq gets larger lnqx\ln_{q}x (and hence Tsallis entropy) becomes more flat and expqx\exp_{q}x more steep.

3.1 Tsallis Entropy Regularization

Tsallis entropy was first proposed by Tsallis (1988) and is defined by the qq-logarithm. The qq-logarithm and its unique inverse function, the qq-exponential, are defined as:

lnqx:=x1q11q,expqx:=[1+(1q)x]+11q,for q\{1}\displaystyle\ln_{q}\!x:=\frac{x^{1-q}-1}{1-q},\quad\exp_{q}\!\,x:=\left[1+(1-q)x\right]^{\frac{1}{1-q}}_{+},\quad\text{for }q\in\mathbb{R}\backslash\{1\} (2)

where []+:=max{,0}[\cdot]_{+}:=\max\{\cdot,0\}. We define ln1=ln,exp1=exp\ln_{1}=\ln\,,\exp_{1}=\exp, as in the limit q1q\rightarrow 1, the formulas in Eq. (2) approach these functions. Tsallis entropy can be defined by Sq(π):=pπq,lnqπ,pS_{q}(\pi):=p\left\langle-\pi^{q},\ln_{q}\!\pi\right\rangle,p\in\mathbb{R} (Suyari and Tsukada, 2005). We visualize the qq-logarithm ​, qq-exponential and Tsallis entropy for different qq in Figure 1. As qq gets larger, qq-logarithm (and hence Tsallis entropy) becomes more flat and qq-exponential more steep111 The qq-logarithm defined here is consistent with the physics literature and different from prior RL works (Lee et al., 2020), where a change of variable q=2qq^{*}=2-q is made. We analyze both cases in Appendix A. . Note that expq\exp_{q}\! is only invertible for x>11qx>\frac{-1}{1-q}.

Tsallis policies have a similar form to softmax, but using the qq-exponential instead. Let us provide some intuition for these policies. When p=12,q=2p=\frac{1}{2},q=2, S2(π)=12π,1πS_{2}(\pi)=\frac{1}{2}\left\langle\pi,1-\pi\right\rangle, the optimization problem argmaxπΔ(𝒜)π,Q+S2(π)=argminπΔ(𝒜)πQ22\operatorname*{arg\,max}_{\pi\in\Delta(\mathcal{A})}\left\langle\pi,Q\right\rangle+S_{2}(\pi)=\operatorname*{arg\,min}_{\pi\in\Delta(\mathcal{A})}\left|\!\left|\pi-Q\right|\!\right|_{2}^{2} is known to be the Euclidean projection onto the probability simplex. Its solution [Qψ]+\left[Q-\psi\right]_{+} is called the sparsemax (Martins and Astudillo, 2016; Lee et al., 2018) and has sparse support (Duchi et al., 2008; Condat, 2016; Blondel et al., 2020). ψ:𝒮×𝒜𝒮\psi:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S} is the unique function satisfying 𝟏,[Qψ]+=1\left\langle\mathbf{1},\left[Q-\psi\right]_{+}\right\rangle=1.

As our first result, we unify the Tsallis entropy regularized policies for all q+q\in\mathbb{R}_{+} with the qq-exponential, and show that qq and τ\tau are interchangeable for controlling the truncation.

Theorem 1.

Let Ω(π)=Sq(π)\Omega(\pi)=-S_{q}(\pi) in Eq. (1). Then the regularized optimal policies can be expressed:

π(a|s)=[Q(s,a)τψ~q(Q(s,)τ)]+(1q)1q=expq(Q(s,a)τψq(Q(s,)τ))\displaystyle\begin{split}\pi(a|s)=\sqrt[1-q]{\left[\frac{Q(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q(s,\cdot)}{\tau}\right)\right]_{+}(1-q)}=\exp_{q}\left(\frac{Q(s,a)}{\tau}-\psi_{q}\left(\frac{Q(s,\cdot)}{\tau}\right)\right)\end{split} (3)

where ψq=ψ~q+11q\psi_{q}=\tilde{\psi}_{q}+\frac{1}{1-q}. Additionally, for an arbitrary (q,τ)(q,\tau) pair with q>1q>1, the same truncation effect (support) can be achieved using (q=2,τq1)(q=2,\frac{\tau}{q-1}).

Proof.

See Appendix B for the full proof. ∎

Theorem 1 characterizes the role played by qq: controlling the degree of truncation. We show the truncation effect when q=2q=2 and q=50q=50 in Figure 2, confirming that Tsallis policies tend to truncate more as qq gets larger. The theorem also highlights that we can set q=2q=2 and still get more or less truncation using different τ\tau, helping to explain why in our experiments q=2q=2 is a generally effective choice.

Refer to caption
Refer to caption
Refer to caption
Figure 2: (Left) Tsallis KL component π1lnqπ2π1-\pi_{1}\ln_{q}\frac{\pi_{2}}{\pi_{1}} between two Gaussian policies π1=𝒩(2.75,1),π2=𝒩(3.25,1)\pi_{1}=\mathcal{N}(2.75,1),\pi_{2}=\mathcal{N}(3.25,1) for q=1q=1 to 55. When q=1q=1 TKL recovers KL. For q>1q>1, TKL is more mode-covering than KL. (Mid) The sparsemax operator acting on a Boltzmann policy when q=2q=2. (Right) The sparsemax when q=50q=50. Truncation gets stronger as qq gets larger. The same effect can be also controlled by τ\tau.

Unfortunately, the threshold ψ~q\tilde{\psi}_{q} (and ψq\psi_{q}) does not have a closed-form solution for q1,2,q\neq 1,2,\infty. Note that q=1q=1 corresponds to Shannon entropy and q=q=\infty to no regularization. However, we can resort to Taylor’s expansion to obtain approximate sparsemax policies.

Theorem 2.

For q1,q\neq 1,\infty, we can obtain approximate threshold ψ^qψq\hat{\psi}_{q}\approx\psi_{q} using Taylor’s expansion, and therefore an approximate policy:

π^(a|s)expq(Q(s,a)τψ^q(Q(s,)τ)),ψ^q(Q(s,)τ)aK(s)Q(s,)τ1|K(s)|+1.\displaystyle\hat{\pi}(a|s)\propto\exp_{q}\left(\frac{Q(s,a)}{\tau}-\hat{\psi}_{q}\left(\frac{Q(s,\cdot)}{\tau}\right)\right),\,\,\hat{\psi}_{q}\left(\frac{Q(s,\cdot)}{\tau}\right)\doteq\frac{\sum_{a\in K(s)}\frac{Q(s,\cdot)}{\tau}-1}{|K(s)|}+1. (4)

K(s)K(s) is the set of highest-valued actions, satisfying the relation 1+iQ(s,a(i))τ>j=1iQ(s,a(j))τ1+i\frac{Q(s,a_{(i)})}{\tau}>\sum_{j=1}^{i}\frac{Q(s,a_{(j)})}{\tau}, where a(j)a_{(j)} indicates the action with jjth largest action value. The sparsemax policy sets the probabilities of lowest-valued actions to zero: π(a(i)|s)=0,i=z+1,,|𝒜|\pi(a_{(i)}|s)=0,i=z+1,\dots,|\mathcal{A}| where Q(s,a(z))>τ1ψ^q(Q(s,)τ)>Q(s,a(z+1))Q(s,a_{(z)})>\tau^{-1}\hat{\psi}_{q}\left(\frac{Q(s,\cdot)}{\tau}\right)>Q(s,a_{(z+1)}). When q=2q=2, ψ^q\hat{\psi}_{q} recovers ψq\psi_{q}.

Proof.

See Appendix B for the full proof. ∎

Lee et al. (2020) also used expq\exp_{q} to represent policies but they consider the continuous action setting and do not give any computable threshold. By contrast, Theorem 2 presents an easily computable ψ^q\hat{\psi}_{q} for all q{1,}q\notin\{1,\infty\}.

3.2 Tsallis KL Regularization and Convergence Results

The Tsallis KL divergence is defined as DKLq(π||μ):=π,lnqμπD^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right):=\left\langle\pi,-\ln_{q}\frac{\mu}{\pi}\right\rangle (Furuichi et al., 2004). It is a member of ff-divergence and can be recovered by choosing f(t)=lnqtf(t)=-\ln_{q}t. As a divergence penalty, it is required that q>0q>0 since f(t)f(t) should be convex. We further assume that q>1q>1 to align with standard divergences; i.e. penalize large value of πμ\frac{\pi}{\mu}, since for 0<q<10<q<1 the regularization would penalize μπ\frac{\mu}{\pi} instead. In practice, we find that 0<q<10<q<1 tend to perform poorly. In contrast to KL, Tsallis KL is more mass-covering; i.e. its value is proportional to the qq-th power of the ratio πμ\frac{\pi}{\mu}. When qq is big, large values of πμ\frac{\pi}{\mu} are strongly penalized (Wang et al., 2018). This behavior of Tsallis KL divergence can also be found in other well-known divergences: the α\alpha-divergence (Wang et al., 2018; Belousov and Peters, 2019) coincides with Tsallis KL when α=2\alpha=2; Rényi’s divergence also penalizes large policy ratio by raising it to the power qq, but inside the logarithm, which is therefore an additive extension of KL (Li and Turner, 2016). In the limit of q1q\rightarrow 1, Tsallis entropy recovers Shannon entropy and the Tsallis KL divergence recovers the KL divergence. We plot the Tsallis KL divergence behavior in Figure 2.

Now let us turn to formalizing when value iteration under Tsallis regularization converges. The qq-logarithm has the following properties: Convexity: lnqπ\ln_{q}\!\pi is convex for q0q\leq 0, concave for q>0q>0. When q=0q=0, both lnq,expq\ln_{q}\!\,\,,\exp_{q}\! become linear. Monotonicity: lnqπ\ln_{q}\pi is monotonically increasing with respect to π\pi. These two properties can be simply verified by checking the first and second order derivative. We prove in Appendix A the following similarity between Shannon entropy (reps. KL) and Tsallis entropy (resp. Tsallis KL). Bounded entropy: we have 0(π)ln|𝒜|0\leq\mathcal{H}\left(\pi\right)\leq\ln|\mathcal{A}|; and q, 0Sq(π)lnq|𝒜|\forall q,\,0\leq S_{q}(\pi)\leq\ln_{q}{|\mathcal{A}|}. Generalized KL property: q\forall q, DKLq(π||μ)0D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)\geq 0. DKLq(π||μ)=0D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=0 if and only if π=μ\pi=\mu almost everywhere, and DKLq(π||μ)D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)\rightarrow\infty whenever π(a|s)>0\pi(a|s)>0 and μ(a|s)=0\mu(a|s)=0.

However, despite their similarity, a crucial difference is that lnq\ln_{q} is non-extensive, which means it is not additive (Tsallis, 1988). In fact, lnq\ln_{q} is only pseudo-additive:

lnqπμ=lnqπ+lnqμ+(1q)lnqπlnqμ.\displaystyle\ln_{q}{\pi\mu}=\ln_{q}\pi+\ln_{q}\mu+(1-q)\ln_{q}\pi\ln_{q}\mu. (5)

Pseudo-additivity complicates obtaining convergence results for Eq. (1) with qq-logarithm regularizers, since the techniques used for Shannon entropy and KL divergence are generally not applicable to their lnq\ln_{q} counterparts. Moreover, deriving the optimal policy may be nontrivial. Convergence results have only been established for Tsallis entropy (Lee et al., 2018; Chow et al., 2018).

We know that Eq. (1) with Ω(π)=DKLq(π||μ)\Omega(\pi)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right), for any μ\mu, converges for qq that make DKLq(π||μ)D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right) strictly convex (Geist et al., 2019). When q=2q=2, it is strongly convex, and so also strictly convex, guaranteeing convergence.

Theorem 3.

The regularized recursion Eq. (1) with Ω(π)=DKLq(π||)\Omega(\pi)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\cdot\right) when q=2q=2 converges to the unique regularized optimal policy.

Proof.

See Appendix C. It simply involves proving that this regularizer is strongly convex. ∎

3.3 TKL Regularized Policies Do More Than Averaging

We next show that the optimal regularized policy under Tsallis KL regularization does more than uniform averaging. It can be seen as performing a weighted average where the degree of weighting is controlled by qq. Consider the recursion

{πk+1=argmaxππ,QkDKLq(π||πk),Qk+1=r+γPπk+1,QkDKLq(πk+1||πk),\displaystyle\begin{cases}\pi_{k+1}=\operatorname*{arg\,max}_{\pi}\!\left\langle\pi,Q_{k}-D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\pi_{k}\right)\right\rangle,&\\ Q_{k+1}=r+\gamma P\!\left\langle\pi_{k+1},Q_{k}-D^{q}_{\!K\!L}\!\left(\pi_{k+1}||\pi_{k}\right)\right\rangle,&\end{cases} (6)

where we dropped the regularization coefficient τ\tau for convenience.

Theorem 4.

The greedy policy πk+1\pi_{k+1} in Equation (6) satisfies

πk+1(expqQ1expqQk)=[expq(j=1kQj)q1+j=2k(q1)ji1=1<<ijkQi1Qij]1q1.\displaystyle\pi_{k+1}\propto\left(\exp_{q}\!Q_{1}\cdots\exp_{q}\!Q_{k}\right)=\left[\exp_{q}\!\!{\left(\sum_{j=1}^{k}Q_{j}\right)}^{q-1}\!\!\!\!\!\!\!\!\!+\sum_{j=2}^{k}(q-1)^{j}\!\!\!\!\!\!\!\sum_{i_{1}=1<\dots<i_{j}}^{k}\!\!\!\!Q_{i_{1}}\cdots Q_{i_{j}}\right]^{\frac{1}{q-1}}. (7)

When q=1q\!=\!1, Eq. (6) reduces to KL regularized recursion and hence Eq. (7) reduces to the KL-regularized policy. When q=2q\!=\!2, Eq. (7) becomes:

exp2Q1exp2Qk=exp2(j=1kQj)+j=2i1=1<<ijkQi1Qij.\displaystyle\exp_{2}{Q_{1}}\cdots\exp_{2}{Q_{k}}\!=\exp_{2}{\left(\sum_{j=1}^{k}Q_{j}\!\right)\!}+\!\!\!\!\!\!\!\sum_{\begin{subarray}{c}j=2\\ i_{1}=1<\dots<i_{j}\end{subarray}}^{k}\!\!\!\!\!\!\!Q_{i_{1}}\cdots Q_{i_{j}}.

i.e., Tsallis KL regularized policies average over the history of value estimates as well as computing the interaction between them j=2ki1<<ijkQi1Qij\sum_{j=2}^{k}\sum_{\begin{subarray}{c}i_{1}<\dots<i_{j}\end{subarray}}^{k}Q_{i_{1}}\dots Q_{i_{j}}.

Proof.

See Appendix D for the full proof. The proof comprises two parts: the first part shows πk+1expqQ1expqQk\pi_{k+1}\propto\exp_{q}\!Q_{1}\dots\exp_{q}\!Q_{k}, and the second part establishes the more-than-averaging property by two-point equation (Yamano, 2002) and the 2q2-q duality (Naudts, 2002; Suyari and Tsukada, 2005) to conclude (expqxexpqy)q1=expq(x+y)q1+(q1)2xy\left(\exp_{q}\!x\cdot\exp_{q}\!y\right)^{q-1}\!\!\!=\exp_{q}\!\left(x+y\right)^{q-1}\!\!+(q-1)^{2}xy. ∎

The form of this policy is harder to intuit, but we can try to understand each component. The first component actually corresponds to a weighted averaging by the property of the expq\exp_{q}\!\,:

expq(i=1kQi)=expqQ1expq(Q21+(1q)Q1)expq(Qk1+(1q)i=1k1Qi).\displaystyle\exp_{q}\!\left(\sum_{i=1}^{k}Q_{i}\right)\!\!=\exp_{q}\!\,Q_{1}\exp_{q}\!\,\left(\frac{Q_{2}}{1+(1-q)Q_{1}}\right)\dots\exp_{q}\!\left(\!\frac{Q_{k}}{1+(1-q)\sum_{i=1}^{k-1}Q_{i}}\!\right). (8)

Eq. (8) is a possible way to expand the summation: the left-hand side of the equation is what one might expect from conventional KL regularization; while the right-hand side shows a weighted scheme such that any estimate QjQ_{j} is weighted by the summation of estimates before QjQ_{j} times 1q1-q (Note that we can exchange 1 and qq, see Appendix A). Weighting down numerator by the sum of components in the demoninator has been analyzed before in the literature of weighted average by robust divergences, e.g., the γ\gamma-divergence (Futami et al., 2018, Table 1). Therefore, we conjecture this functional form helps weighting down the magnitude of excessively large QkQ_{k}, which can also be controlled by choosing qq. In fact, obtaining a weighted average has been an important topic in RL, where many proposed heuristics coincide with weighted averaging (Grau-Moya et al., 2019; Haarnoja et al., 2018; Kitamura et al., 2021).

Now let us consider the second term with q=2q=2, therefore the leading (q1)j(q-1)^{j} vanishes. The action-value cross-product term can be intuitively understood as further increasing the probability for any actions that have had consistently larger values across iterations. This observation agrees with the mode-covering property of Tsallis KL. However, there is no concrete evidence yet how the average inside qq-exponential and the cross-product action values may work jointly to benefit the policy, and their benefits may depend on the task and environments, requiring further categorization and discussion. Empirically, we find that the nonlinearity of Tsallis KL policies bring superior performance to the uniform averaging KL policies on the testbed considered.

4 A Practical Algorithm for Tsallis KL Regularization

In this section we provide a practical algorithm for implementing Tsallis regularization. We first explain why this is not straightforward to simply implement KL-regularized value iteration, and how Munchausen Value Iteration (MVI) overcomes this issue with a clever implicit regularization trick. We then extend this algorithm to q>1q>1 using a similar approach, though now with some approximation due once again to the difficulties of pseudo-additivity.

4.1 Implicit Regularization With MVI

Even for the standard KL, it is difficult to implement KL-regularized value iteration with function approximation. The difficulty arises from the fact that we cannot exactly obtain πk+1πkexp(Qk)\pi_{k+1}\propto\pi_{k}\exp\left(Q_{k}\right). This policy might not be representable by our function approximator. For q=1q=1, one needs to store all past QkQ_{k} which is computationally infeasible.

An alternative direction has been to construct a different value function iteration scheme, which is equivalent to the original KL regularized value iteration (Azar et al., 2012; Kozuno et al., 2019). A recent method of this family is Munchausen VI (MVI) (Vieillard et al., 2020b). MVI implicitly enforces KL regularization using the recursion

{πk+1=argmaxππ,QkτlnπQk+1=r+ατlnπk+1+γPπk+1,Qkτlnπk+1\displaystyle\begin{split}\begin{cases}\!\pi_{k+1}=\operatorname*{arg\,max}_{\pi}\left\langle\pi,Q_{k}-\tau\ln\pi\right\rangle\\ \!Q_{k+1}=r+{\color[rgb]{1,0,0}{\alpha\tau\ln\pi_{k+1}}}+\gamma P\left\langle\pi_{k+1},Q_{k}\!-\!{\color[rgb]{0,0,0}{\color[rgb]{0,0,1}\tau\ln\pi_{k+1}}}\right\rangle\end{cases}\end{split} (9)

We see that Eq. (9) is Eq. (1) with Ω(π)=(π)\Omega(\pi)=-\mathcal{H}\left(\pi\right) (blue) plus an additional red Munchausen term, with coefficient α\alpha. Vieillard et al. (2020b) showed that implicit KL regularization was performed under the hood, even though we still have tractable πk+1exp(τ1Qk)\pi_{k+1}\propto\exp\left(\tau^{-1}Q_{k}\right):

Qk+1=r+ατlnπk+1+γPπk+1,Qkτlnπk+1Qk+1ατlnπk+1=\displaystyle Q_{k+1}=r+{\color[rgb]{0,0,0}\alpha\tau\ln{\pi_{k+1}}}+\gamma P\left\langle\pi_{k+1},Q_{k}-{\color[rgb]{0,0,0}\tau\ln{\pi_{k+1}}}\right\rangle\Leftrightarrow Q_{k+1}\!-\!\alpha\tau\ln{\pi_{k+1}}\!=\!
r+γP(πk+1,Qkατlnπkπk+1,ατ(lnπk+1lnπk)(1α)τlnπk+1)\displaystyle r+\gamma P\big{(}\left\langle\pi_{k+1},Q_{k}-\alpha\tau\ln{\pi_{k}}\right\rangle\!-\!\left\langle\pi_{k+1},\alpha\tau(\ln{\pi_{k+1}}\!-\!\ln{\pi_{k}})-(1-\alpha)\tau\ln{\pi_{k+1}}\right\rangle\!\big{)}
Qk+1=r+γP(πk+1,QkατDKL(πk+1||πk)+(1α)τ(πk+1))\displaystyle\Leftrightarrow Q^{\prime}_{k+1}=r+\gamma P\big{(}\left\langle\pi_{k+1},Q^{\prime}_{k}\right\rangle-\alpha\tau D_{\!K\!L}\!\left(\pi_{k+1}||\pi_{k}\right)+(1-\alpha)\tau\mathcal{H}\left(\pi_{k+1}\right)\big{)} (10)

where Qk+1:=Qk+1ατlnπk+1Q^{\prime}_{k+1}\!:=\!Q_{k+1}-\alpha\tau\ln{\pi_{k+1}} is the generalized action value function.

The implementation of this idea uses the fact that ατlnπk+1=α(QkτQk)\alpha\tau\ln\pi_{k+1}=\alpha(Q_{k}-\mathcal{M}_{\tau}Q_{k}), where τQk:=1Zkexp(τ1Qk),Qk,Zk=𝟏,exp(τ1Qk)\mathcal{M}_{\tau}Q_{k}:=\frac{1}{Z_{k}}\left\langle\exp\left(\tau^{-1}Q_{k}\right),Q_{k}\right\rangle,Z_{k}=\left\langle\boldsymbol{1},\exp\left(\tau^{-1}Q_{k}\right)\right\rangle is the Boltzmann softmax operator.222Using τQ\mathcal{M}_{\tau}Q is equivalent to the log-sum-exp operator up to a constant shift (Azar et al., 2012). In the original work, computing this advantage term was found to be more stable than directly using the log of the policy. In our extension, we use the same form.

4.2 MVI(qq) For General qq

The MVI(q) algorithm is a simple extension of MVI: it replaces the standard exponential in the definition of the advantage with the qq-exponential. We can express this action gap as Qkq,τQkQ_{k}-\mathcal{M}_{q,\tau}{Q_{k}}, where q,τQk=expq(Qkτψq(Qkτ)),Qk\mathcal{M}_{q,\tau}{Q_{k}}=\left\langle\exp_{q}\left(\frac{Q_{k}}{\tau}-\psi_{q}\left(\frac{Q_{k}}{\tau}\right)\right),Q_{k}\right\rangle. When q=1q=1, it recovers QkτQkQ_{k}-\mathcal{M}_{\tau}Q_{k}. We summarize this MVI(qq) algorithm in Algorithm B in the Appendix. When q=1q=1, we recover MVI. For q=q=\infty, we get that ,τQk\mathcal{M}_{\infty,\tau}{Q_{k}} is maxaQk(s,a)\max_{a}Q_{k}(s,a)—no regularization—and we recover advantage learning (Baird and Moore, 1999). Similar to the original MVI algorithm, MVI(qq) enjoys tractable policy expression with πk+1expq(τ1Qk)\pi_{k+1}\propto\exp_{q}\left(\tau^{-1}Q_{k}\right).

Unlike MVI, however, MVI(qq) no longer exactly implements the implicit regularization shown in Eq. (10). Below, we go through a similar derivation as MVI, show why there is an approximation and motivate why the above advantage term is a reasonable approximation. In addition to this reasoning, our primary motivation for this extension of MVI to use q>1q>1 was to inherit the same simple form as MVI as well as because empirically we found it to be effective.

Refer to caption
Figure 3: MVI(q)(q) on CartPole-v1 for q=2,3,4,5q=2,3,4,5, averaged over 50 seeds, with τ=0.03,α=0.9\tau=0.03,\alpha=0.9. (Left) The difference between the proposed action gap Qkq,τQkQ_{k}-\mathcal{M}_{q,\tau}{Q_{k}} and the general Munchausen term lnqπk+1\ln_{q}\pi_{k+1} converges to a constant. (Right) The residual Rq(πk+1,πk)R_{q}(\pi_{k+1},\pi_{k}) becomes larger as qq increases. For q=2q=2, it remains negligible throughout the learning.

Let us similarly define a generalized action value function Qk+1=Qk+1ατlnqπk+1Q_{k+1}^{\prime}=Q_{k+1}-\alpha\tau\ln_{q}\!{\pi_{k+1}}. Using the relationship lnqπk=lnqπkπk+1lnq1πk+1(1q)lnqπklnq1πk+1\ln_{q}\!{\pi_{k}}=\ln_{q}\!\frac{\pi_{k}}{\pi_{k+1}}-\ln_{q}\!\frac{1}{\pi_{k+1}}-(1-q)\ln_{q}\!\pi_{k}\ln_{q}\!\frac{1}{\pi_{k+1}}, we get

Qk+1ατlnqπk+1=r+γPπk+1,Qk+ατlnqπkατlnqπk+τSq(πk+1)\displaystyle Q_{k+1}-{\color[rgb]{0,0,0}\alpha\tau\ln_{q}\!{\pi_{k+1}}}=r+\gamma P\left\langle\pi_{k+1},Q_{k}+\alpha\tau\ln_{q}\!{\pi_{k}}-\alpha\tau\ln_{q}\!{\pi_{k}}+\tau S_{q}\left(\pi_{k+1}\right)\right\rangle
Qk+1=r+γPπk+1,Qk+τSq(πk+1)+\displaystyle\Leftrightarrow Q_{k+1}^{\prime}=r+\gamma P\left\langle\pi_{k+1},Q_{k}^{\prime}+\tau S_{q}\left(\pi_{k+1}\right)\right\rangle+
γPπk+1,ατ(lnqπkπk+1lnq1πk+1(1q)lnq1πk+1lnqπk)\displaystyle\qquad\qquad\gamma P\left\langle\pi_{k+1},\alpha\tau\left(\ln_{q}\!\frac{\pi_{k}}{\pi_{k+1}}-\ln_{q}\!\frac{1}{\pi_{k+1}}-(1-q)\ln_{q}\!\frac{1}{\pi_{k+1}}\ln_{q}\!\pi_{k}\right)\right\rangle (11)
=r+γPπk+1,Qk+(1α)τSq(πk+1)γPπk+1,ατDKLq(πk+1||πk)ατRq(πk+1,πk)\displaystyle=r+\gamma P\left\langle\pi_{k+1},Q_{k}^{\prime}+(1-\alpha)\tau S_{q}(\pi_{k+1})\right\rangle-\gamma P\left\langle\pi_{k+1},\alpha\tau D^{q}_{\!K\!L}\!\left(\pi_{k+1}||\pi_{k}\right)-\alpha\tau R_{q}(\pi_{k+1},\pi_{k})\right\rangle

where we leveraged the fact that ατπk+1,lnq1πk+1=ατSq(πk+1)-\alpha\tau\left\langle\pi_{k+1},\ln_{q}\!\frac{1}{\pi_{k+1}}\right\rangle=-\alpha\tau S_{q}(\pi_{k+1}) and defined the residual term Rq(πk+1,πk):=(1q)lnq1πk+1lnqπkR_{q}(\pi_{k+1},\pi_{k}):=(1-q)\ln_{q}\!\frac{1}{\pi_{k+1}}\ln_{q}\!\pi_{k}. When q=2q=2, it is expected that the residual term remains negligible, but can become larger as qq increases. We visualize the trend of the residual Rq(πk+1,πk)R_{q}(\pi_{k+1},\pi_{k}) for q=2,3,4,5q=2,3,4,5 on the CartPole-v1 environment (Brockman et al., 2016) in Figure 3. Learning consists of 2.5×1052.5\times 10^{5} steps, evaluated every 25002500 steps (one iteration), averaged over 50 independent runs. It is visible that the magnitude of residual jumps from q=4q=4 to 55, while q=2q=2 remains negligible throughout.

A reasonable approximation, therefore, is to use lnqπk+1\ln_{q}\pi_{k+1} and omit this residual term. Even this approximation, however, has an issue. When the actions are in the support, lnq\ln_{q}\!\, is the unique inverse function of expq\exp_{q} and lnqπk+1\ln_{q}\pi_{k+1} yields Qkτψq(Qkτ)\frac{Q_{k}}{\tau}-\psi_{q}\left(\frac{Q_{k}}{\tau}\right). However, for actions outside the support, we cannot get the inverse, because many inputs to expq\exp_{q} can result in zero. We could still use Qkτψq(Qkτ)\frac{Q_{k}}{\tau}-\psi_{q}\left(\frac{Q_{k}}{\tau}\right) as a sensible choice, and it appropriately does use negative values for the Munchausen term for these zero-probability actions. Empirically, however, we found this to be less effective than using the action gap.

Though the action gap is yet another approximation, there are clear similarities between using Qkτψq(Qkτ)\frac{Q_{k}}{\tau}-\psi_{q}\left(\frac{Q_{k}}{\tau}\right) and the action gap Qkq,τQkQ_{k}-\mathcal{M}_{q,\tau}{Q_{k}}. The primary difference is in how the values are centered. We can see ψq\psi_{q} as using a uniform average value of the actions in the support, as characterized in Theorem 2. q,τQk\mathcal{M}_{q,\tau}{Q_{k}}, on the other hand, is a weighted average of action-values.

We plot the difference between Qkq,τQkQ_{k}-\mathcal{M}_{q,\tau}{Q_{k}} and lnqπk+1\ln_{q}\pi_{k+1} in Figure 3, again in Cartpole. The difference stabilizes around -0.5 for most of learning—in other words primarily just shifting by a constant—but in early learning lnqπk+1\ln_{q}\pi_{k+1} is larger, across all qq. This difference in magnitude might explain why using the action gap results in more stable learning, though more investigation is needed to truly understand the difference. For the purposes of this initial work, we pursue the use of the action gap, both as itself a natural extension of the current implementation of MVI and from our own experiments suggesting improved stability with this form.

Refer to caption
Figure 4: Learning curves of MVI(qq) and M-VI on the selected Atari games, averaged over 3 independent runs, with ribbon denoting the standard error. On some environments MVI(qq) significantly improve upon M-VI. Quantitative improvements over M-VI and Tsallis-VI are shown in Figures 5.
Refer to caption
Refer to caption
Figure 5: (Left) The percent improvement of MVI(qq) with q=2q=2 over standard MVI (where q=1q=1) on select Atari games. The improvement is computed by subtracting the scores from MVI(qq) and MVI and normalizing by the MVI scores. (Right) Improvement over Tsallis-VI on Atari environments, normalized with Tsallis-VI scores.

5 Experiments

In this section we investigate the utility of MVI(qq) in the Atari 2600 benchmark (Bellemare et al., 2013). We test whether this result holds in more challenging environments. Specifically, we compare to standard MVI (q=1q=1), which was already shown to have competitive performance on Atari (Vieillard et al., 2020b). We restrict our attention to q=2q=2, which was generally effective in other settings and also allows us to contrast to previous work (Lee et al., 2020) that only used entropy regularization with KL regularization. For MVI(q=2q=2), we take the exact same learning setup—hyperparameters and architecture—as MVI(q=1q=1) and simply modify the term added to the VI update, as in Algorithm 1.

For the Atari games we implemented MVI(qq), Tsallis-VI and M-VI based on the Quantile Regression DQN (Dabney et al., 2018). We leverage the optimized Stable-Baselines3 architecture (Raffin et al., 2021) for best performance and average over 3 independent runs following (Vieillard et al., 2020b), though we run 5050 million frames instead of 200 million. From Figure 4 it is visible that MVI(q)(q) is stable with no wild variance shown, suggesting 3 seeds might be sufficient. We perform grid searches for the algorithmic hyperparameters on two environments Asterix and Seaquest: the latter environment is regarded as a hard exploration environment. MVI(q)(q) α:{0.01,0.1,0.5,0.9,0.99}\alpha:\{0.01,0.1,0.5,0.9,0.99\}; τ:{0.01,0.1,1.0,10,100}\tau:\{0.01,0.1,1.0,10,100\}. Tsallis-VI τ:{0.01,0.1,1.0,10,100}\tau:\{0.01,0.1,1.0,10,100\}. For MVI we use the reported hyperparameters in (Vieillard et al., 2020b). Hyperparameters can be seen from Table 2 and full results are provided in Appendix E.

5.1 Comparing MVI(qq) with q=1q=1 to q=2q=2

We provide the overall performance of MVI versus MVI(q=2q=2) in Figure 5. Using q=2q=2 provides a large improvement in about 5 games, about double the performance in the next 5 games, comparable performance in the next 7 games and then slightly worse performance in 3 games (PrivateEye, Chopper and Seaquest). Both PrivateEye and Seaquest are considered harder exploration games, which might explain this discrepancy. The Tsallis policy with q=2q=2 reduces the support on actions, truncating some probabilities to zero. In general, with a higher qq, the resulting policy is greedier, with q=q=\infty corresponding to exactly the greedy policy. It is possible that for these harder exploration games, the higher stochasticity in the softmax policy from MVI whre q=1q=1 promoted more exploration. A natural next step is to consider incorporating more directed exploration approaches, into MVI(q=2q=2), to benefit from the fact that lower-value actions are removed (avoiding taking poor actions) while exploring in a more directed way when needed.

We examine the learning curves for the games where MVI(qq) had the most significant improvement, in Figure 4. Particularly notable is how much more quickly MVI(qq) learned with q=2q=2, in addition to plateauing at a higher point. In Hero, MVI(qq) learned a stably across the runs, whereas standard MVI with q=1q=1 clearly has some failures.

These results are quite surprising. The algorithms are otherwise very similar, with the seemingly small change of using Munchausen term Qk(s,a)q=2,τQkQ_{k}(s,a)-\mathcal{M}_{q=2,\tau}{Q_{k}} instead of Qk(s,a)q=1,τQkQ_{k}(s,a)-\mathcal{M}_{q=1,\tau}{Q_{k}} and using the qq-logarithm and qq-exponential for the entropy regularization and policy parameterization. Previous work using q=2q=2 to get the sparsemax with entropy regularization generally harmed performance (Lee et al., 2018, 2020). It seems that to get the benefits of the generalization to q>1q>1, the addition of the KL regularization might be key. We validate this in the next section.

5.2 The Importance of Including KL Regularization

In the policy evaluation step of Eq. (11), if we set α=0\alpha=0 then we recover Tsallis-VI which uses regularization Ω(π)=Sq(π)\Omega(\pi)=-S_{q}(\pi) in Eq. (1). In other words, we recover the algorithm that incorporates entropy regularization using the qq-logarithm and the resulting sparsemax policy. Unlike MVI, Tsallis-VI has not been comprehensively evaluated on Atari games, so we include results for the larger benchmark set comprising 35 Atari games. We plot the percentage improvement of MVI(qq) over Tsallis-VI in Figure 5.

The improvement from including the Munchausen term (α>0\alpha>0) is stark. For more than half of the games, MVI(qq) resulted in more than 100% improvement. For the remaining games it was comparable. For 10 games, it provided more than 400% improvement. Looking more specifically at which games there was notable improvement, it seems that exploration may again have played a role. MVI(qq) performs much better on Seaquest and PrivateEye. Both MVI(qq) and Tsallis-VI have policy parameterizations that truncate action support, setting probabilities to zero for some actions. The KL regularization term, however, likely slows this down. It is possible the Tsallis-VI is concentrating too quickly, resulting in insufficient exploration.

6 Conclusion and Discussion

We investigated the use of the more general qq-logarithm for entropy regularization and KL regularization, instead of the standard logarithm (q=1q=1), which gave rise to Tsallis entropy and Tsallis KL regularization. We extended several results previously shown for q=1q=1, namely we proved (a) the form of the Tsallis policy can be expressed by qq-exponential function; (b) Tsallis KL-regularized policies are weighted average of past action-values; (c) the convergence of value iteration for q=2q=2 and (d) a relationship between adding a qq-logarithm of policy to the action-value update, to provide implicit Tsallis KL regularization and entropy regularization, generalizing the original Munchausen Value Iteration (MVI). We used these results to propose a generalization to MVI, which we call MVI(qq), because for q=1q=1 we exactly recover MVI. We showed empirically that the generalization to q>1q>1 can be beneficial, providing notable improvements in the Atari 2600 benchmark.

References

  • Azar et al. [2012] M. G. Azar, V. Gómez, and H. J. Kappen. Dynamic policy programming. Journal of Machine Learning Research, 13(1):3207–3245, 2012.
  • Baird and Moore [1999] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, page 968–974, 1999.
  • Bellemare et al. [2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47(1):253–279, 2013. ISSN 1076-9757.
  • Belousov and Peters [2019] B. Belousov and J. Peters. Entropic regularization of markov decision processes. Entropy, 21(7), 2019.
  • Blondel et al. [2020] M. Blondel, A. F. Martins, and V. Niculae. Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020.
  • Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Chan et al. [2022] A. Chan, H. Silva, S. Lim, T. Kozuno, A. R. Mahmood, and M. White. Greedification operators for policy optimization: Investigating forward and reverse kl divergences. Journal of Machine Learning Research, 23(253):1–79, 2022.
  • Chen et al. [2018] G. Chen, Y. Peng, and M. Zhang. Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:abs/1809.00403, 2018. URL http://arxiv.org/abs/1809.00403.
  • Chow et al. [2018] Y. Chow, O. Nachum, and M. Ghavamzadeh. Path consistency learning in Tsallis entropy regularized MDPs. In International Conference on Machine Learning, pages 979–988, 2018.
  • Condat [2016] L. Condat. Fast projection onto the simplex and the l1 ball. Mathematical Programming, 158:575–585, 2016.
  • Cover and Thomas [2006] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
  • Dabney et al. [2018] W. Dabney, M. Rowland, M. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 2892–2899, 2018.
  • Duchi et al. [2008] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, page 272–279, 2008.
  • Furuichi et al. [2004] S. Furuichi, K. Yanagi, and K. Kuriyama. Fundamental properties of tsallis relative entropy. Journal of Mathematical Physics, 45(12):4868–4877, 2004.
  • Futami et al. [2018] F. Futami, I. Sato, and M. Sugiyama. Variational inference based on robust divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pages 813–822, 2018.
  • Geist et al. [2019] M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized Markov decission processes. In 36th International Conference on Machine Learning, volume 97, pages 2160–2169, 2019.
  • Ghasemipour et al. [2019] S. K. S. Ghasemipour, R. S. Zemel, and S. S. Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1–19, 2019.
  • Grau-Moya et al. [2019] J. Grau-Moya, F. Leibfried, and P. Vrancx. Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, pages 1–13, 2019.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861–1870, 2018.
  • Hiriart-Urruty and Lemaréchal [2004] J. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Grundlehren Text Editions. Springer Berlin Heidelberg, 2004.
  • Ke et al. [2019] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation learning as ff-divergence minimization, 2019. URL https://arxiv.org/abs/1905.12888.
  • Kitamura et al. [2021] T. Kitamura, L. Zhu, and T. Matsubara. Geometric value iteration: Dynamic error-aware kl regularization for reinforcement learning. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157, pages 918–931, 2021.
  • Kozuno et al. [2019] T. Kozuno, E. Uchibe, and K. Doya. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2995–3003, 2019.
  • Kozuno et al. [2022] T. Kozuno, W. Yang, N. Vieillard, T. Kitamura, Y. Tang, J. Mei, P. Ménard, M. G. Azar, M. Valko, R. Munos, O. Pietquin, M. Geist, and C. Szepesvári. Kl-entropy-regularized rl with a generative model is minimax optimal, 2022. URL https://arxiv.org/abs/2205.14211.
  • Lee et al. [2018] K. Lee, S. Choi, and S. Oh. Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3:1466–1473, 2018.
  • Lee et al. [2020] K. Lee, S. Kim, S. Lim, S. Choi, M. Hong, J. I. Kim, Y. Park, and S. Oh. Generalized tsallis entropy reinforcement learning and its application to soft mobile robots. In Robotics: Science and Systems XVI, pages 1–10, 2020.
  • Li and Turner [2016] Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • Martins and Astudillo [2016] A. F. T. Martins and R. F. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 33rd International Conference on Machine Learning, page 1614–1623, 2016.
  • Nachum and Dai [2020] O. Nachum and B. Dai. Reinforcement learning via fenchel-rockafellar duality. 2020. URL http://arxiv.org/abs/2001.01866.
  • Nachum et al. [2019] O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  • Naudts [2002] J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A-statistical Mechanics and Its Applications, 316:323–334, 2002.
  • Nowozin et al. [2016] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, volume 29, pages 1–9, 2016.
  • Prehl et al. [2012] J. Prehl, C. Essex, and K. H. Hoffmann. Tsallis relative entropy and anomalous diffusion. Entropy, 14(4):701–716, 2012.
  • Raffin et al. [2021] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
  • Sason and Verdú [2016] I. Sason and S. Verdú. f-divergence inequalities. IEEE Transactions on Information Theory, 62:5973–6006, 2016.
  • Suyari and Tsukada [2005] H. Suyari and M. Tsukada. Law of error in tsallis statistics. IEEE Transactions on Information Theory, 51(2):753–757, 2005.
  • Suyari et al. [2020] H. Suyari, H. Matsuzoe, and A. M. Scarfone. Advantages of q-logarithm representation over q-exponential representation from the sense of scale and shift on nonlinear systems. The European Physical Journal Special Topics, 229(5):773–785, 2020.
  • Tsallis [1988] C. Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 52:479–487, 1988.
  • Tsallis [2009] C. Tsallis. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer New York, 2009. ISBN 9780387853581.
  • Vieillard et al. [2020a] N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, and M. Geist. Leverage the average: an analysis of regularization in rl. In Advances in Neural Information Processing Systems 33, pages 1–12, 2020a.
  • Vieillard et al. [2020b] N. Vieillard, O. Pietquin, and M. Geist. Munchausen reinforcement learning. In Advances in Neural Information Processing Systems 33, pages 1–11. 2020b.
  • Wan et al. [2020] N. Wan, D. Li, and N. Hovakimyan. f-divergence variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 17370–17379, 2020.
  • Wang et al. [2018] D. Wang, H. Liu, and Q. Liu. Variational inference with tail-adaptive f-divergence. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 5742–5752, 2018.
  • Yamano [2002] T. Yamano. Some properties of q-logarithm and q-exponential functions in tsallis statistics. Physica A: Statistical Mechanics and its Applications, 305(3):486–496, 2002.
  • Yu et al. [2020] L. Yu, Y. Song, J. Song, and S. Ermon. Training deep energy-based models with f-divergence minimization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, pages 1–11, 2020.
  • Zhang et al. [2020] R. Zhang, B. Dai, L. Li, and D. Schuurmans. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020.

Appendix A Basic facts of Tsallis KL divergence

We present some basic facts about qq-logarithm and Tsallis KL divergence.

We begin by introducing the 2q2-q duality for Tsallis statistics. Recall that the qq-logarithm and Tsallis entropy defined in the main paper are:

lnqx=x1q11q,Sq(x)=xq,lnqx.\displaystyle\ln_{q}x=\frac{x^{1-q}-1}{1-q},\quad S_{q}(x)=-\left\langle x^{q},\ln_{q}x\right\rangle.

In the RL literature, another definition q=2qq^{*}=2-q is more often used [Lee et al., 2020]. This is called the 2q2-q duality [Naudts, 2002, Suyari and Tsukada, 2005], which refers to that the Tsallis entropy can be equivalently defined as:

lnqx=xq11q1,Sq(x)=x,lnqx,\displaystyle\ln_{q^{*}}x=\frac{x^{q^{*}-1}-1}{q^{*}-1},\quad S_{q^{*}}(x)=-\left\langle x,\ln_{q^{*}}x\right\rangle,

By the duality we can show [Suyari and Tsukada, 2005, Eq.(12)]:

Sq(x):=xq,x1q11q=𝟏,xq11q=𝟏,xq11q=x,xq11q1=:Sq(x),\displaystyle S_{q}(x):=-\left\langle x^{q},\frac{x^{1-q}-1}{1-q}\right\rangle=\frac{\left\langle\boldsymbol{1},x^{q}\right\rangle-1}{1-q}=\frac{\left\langle\boldsymbol{1},x^{q^{*}}\right\rangle-1}{1-q^{*}}=-\left\langle x,\frac{x^{q^{*}-1}-1}{q^{*}-1}\right\rangle=:S_{q^{*}}(x),

i.e. the duality between logarithms lnqx\ln_{q^{*}}x and lnqx\ln_{q}x allows us to define Tsallis entropy by an alternative notation qq^{*} that eventually reaches to the same functional form.

We now come to examine Tsallis KL divergence (or Tsallis relative entropy) defined in another form: DKLq(π||μ)=π,lnqπμD^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=\left\langle\pi,\ln_{q^{*}}{\frac{\pi}{\mu}}\right\rangle [Prehl et al., 2012]. In the main paper we used the definition DKLq(π||μ)=π,lnqμπD^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=\left\langle\pi,-\ln_{q}{\frac{\mu}{\pi}}\right\rangle [Furuichi et al., 2004]. We show they are equivalent by the same logic:

π,lnqμπ=π,(μπ)1q11q=π,(πμ)q11q1=π,lnqπμ.\displaystyle\left\langle\pi,-\ln_{q}\frac{\mu}{\pi}\right\rangle=\left\langle\pi,-\frac{\left(\frac{\mu}{\pi}\right)^{1-q}-1}{1-q}\right\rangle=\left\langle\pi,\frac{\left(\frac{\pi}{\mu}\right)^{q-1}-1}{q-1}\right\rangle=\left\langle\pi,\ln_{q^{*}}{\frac{\pi}{\mu}}\right\rangle. (12)

The equivalence allows us to work with whichever of lnq\ln_{q} and lnq\ln_{q^{*}} that makes the proof easier to work out the following useful properties of Tsallis KL divergence:

\boldsymbol{-} Nonnegativity DKLq(π||μ)0D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)\geq 0: since the function lnqπ-\ln_{q}{\pi} is convex, by Jensen’s inequality

π,lnqμπlnqπ,μπ=0,\displaystyle\left\langle\pi,-\ln_{q}\frac{\mu}{\pi}\right\rangle\geq-\ln_{q}\left\langle\pi,\frac{\mu}{\pi}\right\rangle=0,

\boldsymbol{-} Conditions of DKLq(π||μ)=0D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=0: directly from the above, in Jensen’s inequality the equality holds only when μπ=1\frac{\mu}{\pi}=1 almost everywhere, i.e. DKLq(π||μ)=0D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=0 implies μ=π\mu=\pi almost everywhere.

\boldsymbol{-} Conditions of DKLq(π||μ)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=\infty: To better align with the standard KL divergence, let us work with lnq\ln_{q^{*}}, following [Cover and Thomas, 2006], let us define

0lnq00=0,0lnq0μ=0,πlnqπ0=.\displaystyle 0\ln_{q^{*}}{\frac{0}{0}}=0,\quad 0\ln_{q^{*}}{\frac{0}{\mu}}=0,\quad\pi\ln_{q^{*}}{\frac{\pi}{0}}=\infty.

We conclude that DKLq(π||μ)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=\infty whenever π>0\pi>0 and μ=0\mu=0.

\boldsymbol{-} Bounded entropy q, 0Sq(π)lnq|𝒜|\forall q,\,0\leq S_{q}(\pi)\leq\ln_{q}{|\mathcal{A}|}: let μ=1|𝒜|\mu=\frac{1}{|\mathcal{A}|}, by the nonnegativity of Tsallis KL divergence:

DKLq(π||μ)\displaystyle D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right) =π,lnq1(|𝒜|π)=π,(|𝒜|π)q11q1\displaystyle=\left\langle\pi,-\ln_{q}\frac{1}{\left(|\mathcal{A}|\cdot\pi\right)}\right\rangle=\left\langle\pi,\frac{\left(|\mathcal{A}|\cdot\pi\right)^{q-1}-1}{q-1}\right\rangle
=|𝒜|q1(𝟏,πq1q11|𝒜|q11q1)0.\displaystyle=|\mathcal{A}|^{q-1}\left(\frac{\left\langle\boldsymbol{1},\pi^{q}\right\rangle-1}{q-1}-\frac{\frac{1}{|\mathcal{A}|^{q-1}}-1}{q-1}\right)\geq 0.

Notice that 𝟏,πq1q1=πq,1π1q1q=π,lnqπ=Sq(π)\frac{\left\langle\boldsymbol{1},\pi^{q}\right\rangle-1}{q-1}=\left\langle\pi^{q},\frac{1-\pi^{1-q}}{1-q}\right\rangle=\left\langle\pi,\ln_{q}\!\pi\right\rangle=-S_{q}(\pi) and 1|𝒜|q11q1=lnq|𝒜|\frac{\frac{1}{|\mathcal{A}|^{q-1}}-1}{q-1}=\ln_{q}\!{|\mathcal{A}|}, we conclude

Sq(π)lnq|𝒜|.\displaystyle S_{q}(\pi)\leq\ln_{q}{|\mathcal{A}|}.

Appendix B Proof of Theorem 1 and 2

We structure this section as the following three parts:

  1. 1.

    Tsallis entropy regularized policy has general expression for all qq. Moreover, qq and τ\tau are interchangeable for controlling the truncation (Theorem 1).

  2. 2.

    The policies can be expressed by qq-exponential (Theorem 1).

  3. 3.

    We present a computable approximate threshold ψ^q\hat{\psi}_{q} (Theorem 2).

General expression for Tsallis entropy regularized policy. The original definition of Tsallis entropy is Sq(π(|s))=pq1(1aπq(a|s)),q,p+S_{q^{*}}(\pi(\cdot|s))=\frac{p}{q^{*}-1}\left(1-\sum_{a}\pi^{q^{*}}(a|s)\right),q^{*}\in\mathbb{R},\,\,p\in\mathbb{R}_{+}. Note that similar to Appendix A, we can choose whichever convenient of qq and qq^{*}, since the domain of the entropic index is \mathbb{R}. To obtain the Tsallis entropy-regularized policies we follow [Chen et al., 2018]. The derivation begins with assuming an actor-critic framework where the policy network is parametrized by ww. It is well-known that the parameters should be updated towards the direction specified by the policy gradient theorem:

Δw𝔼π[Qπlnπw+τ(π)w]sλ(s)𝟏,πw=:f(w),\displaystyle\Delta w\propto\mathbb{E}_{\pi}\left[Q_{\pi}\frac{\partial\ln\pi}{\partial w}+\tau\frac{\partial\mathcal{H}\left(\pi\right)}{\partial w}\right]-\sum_{s}\lambda(s)\frac{\partial\left\langle\boldsymbol{1},\pi\right\rangle}{\partial w}=:f(w), (13)

Recall that (π)\mathcal{H}\left(\pi\right) denotes the Shannon entropy and τ\tau is the coefficient. λ(s)\lambda(s) are the Lagrange multipliers for the constraint 𝟏,π=1\left\langle\boldsymbol{1},\pi\right\rangle=1. In the Tsallis entropy framework, we replace (π)\mathcal{H}\left(\pi\right) with Sq(π)S_{q^{*}}(\pi). We can assume p=1qp=\frac{1}{q^{*}} to ease derivation, which is the case for sparsemax.

We can now explicitly write the optimal condition for the policy network parameters:

f(w)=0=𝔼π[Qπlnπw+τSq(π)w]sλ(s)𝟏,πw=𝔼π[Qπlnπwτ1q1𝟏,πqlnπwψ~q(s)lnπw]=𝔼π[(Qπτ1q1πq1ψ~q(s))lnπw],\displaystyle\begin{split}&f(w)=0=\mathbb{E}_{\pi}\left[Q_{\pi}\frac{\partial\ln\pi}{\partial w}+\tau\frac{\partial S_{q^{*}}(\pi)}{\partial w}\right]-\sum_{s}\lambda(s)\frac{\partial\left\langle\boldsymbol{1},\pi\right\rangle}{\partial w}\\ &=\mathbb{E}_{\pi}\left[Q_{\pi}\frac{\partial\ln\pi}{\partial w}-\tau\frac{1}{q^{*}-1}\left\langle\boldsymbol{1},\pi^{q^{*}}\frac{\partial\ln\pi}{\partial{w}}\right\rangle-\tilde{\psi}_{q}(s)\frac{\partial\ln\pi}{\partial w}\right]\\ &=\mathbb{E}_{\pi}\left[\left(Q_{\pi}-\tau\frac{1}{q^{*}-1}{\pi^{q^{*}-1}}-\tilde{\psi}_{q}(s)\right)\frac{\partial\ln\pi}{\partial w}\right],\end{split} (14)

where we leveraged Sq(π)w=1q1𝟏,πqlnπw\frac{\partial S_{q^{*}}(\pi)}{\partial w}=\frac{1}{q^{*}-1}\left\langle\boldsymbol{1},\pi^{q^{*}}\frac{\partial\ln\pi}{\partial{w}}\right\rangle in the second step and absorbed terms into the expectation in the last step. ψ~q(s)\tilde{\psi}_{q}(s) denotes the adjusted Lagrange multipliers by taking λ(s)\lambda(s) inside the expectation and modifying it according to the discounted stationary distribution.

Now it suffices to verify either lnπw=0\frac{\partial\ln\pi}{\partial w}=0 or

Qπ(s,a)τ1q1πq1(a|s)ψ~q(s)=0π(a|s)=[Qπ(s,a)τψ~q(s)τ]+(q1)q1,or π(a|s)=[Qπ(s,a)τψ~q(s)τ]+(1q)1q,\displaystyle\begin{split}&Q_{\pi}(s,a)-\tau\frac{1}{q^{*}-1}{\pi^{q^{*}-1}(a|s)}-\tilde{\psi}_{q}(s)=0\\ \Leftrightarrow\quad&\pi^{*}(a|s)=\sqrt[q^{*}-1]{\left[\frac{Q_{\pi}(s,a)}{\tau}-\frac{\tilde{\psi}_{q}\left(s\right)}{\tau}\right]_{+}(q^{*}-1)},\\ \text{or }\quad&\pi^{*}(a|s)=\sqrt[1-q]{\left[\frac{Q_{\pi}(s,a)}{\tau}-\frac{\tilde{\psi}_{q}\left(s\right)}{\tau}\right]_{+}(1-q)},\end{split} (15)

where we changed the entropic index from qq^{*} to qq. Clearly, the root does not affect truncation. Consider the pair (q=50,τ)(q^{*}=50,\tau), then the same truncation effect can be achieved by choosing (q=2,τ501)(q^{*}=2,\frac{\tau}{50-1}). The same goes for qq. Therefore, we conclude that qq and τ\tau are interchangeable for the truncation, and we should stick to the analytic choice q=2(q=0)q^{*}=2(q=0).

Tsallis policies can be expressed by qq-exponential. Given Eq. (15), by adding and subtracting 11, we have:

π(a|s)=[1+(1q)(Qπ(s,a)τψ~q(Qπ(s,)τ)11q)]+1q=expq(Qπ(s,a)τψ^q(Qπ(s,)τ)),\displaystyle\pi^{*}(a|s)=\!\!\!\!\!\sqrt[1-q]{\left[1+(1-q)\left(\frac{Q_{\pi}(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)-\frac{1}{1-q}\right)\right]_{+}}\!\!=\exp_{q}\left(\frac{Q_{\pi}(s,a)}{\tau}-\hat{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)\right),

where we defined ψ^q=ψ~q+11q\hat{\psi}_{q}=\tilde{\psi}_{q}+\frac{1}{1-q}. Note that this expression is general for all qq, but whether π\pi^{*} has closed-form expression depends on the solvability of ψ~q\tilde{\psi}_{q}.

Let us consider the extreme case q=q=\infty. It is clear that limq11q0\lim_{q\rightarrow\infty}\frac{1}{1-q}\rightarrow 0. Therefore, for any x>0x>0 we must have x11q1x^{\frac{1}{1-q}}\rightarrow 1; i.e., there is only one action with probability 1, with all others being 0. This conclusion agrees with the fact that Sq(π)0S_{q}(\pi)\rightarrow 0 as qq\rightarrow\infty: hence the regularized policy degenerates to argmax\operatorname*{arg\,max}.

A computable Normalization Function. The constraint aK(s)π(a|s)=1\sum_{a\in K(s)}\pi^{*}(a|s)=1 is exploited to obtain the threshold ψ\psi for the sparsemax [Lee et al., 2018, Chow et al., 2018]. Unfortunately, this is only possible when the root vanishes, since otherwise the constraint yields a summation of radicals. Nonetheless, we can resort to first-order Taylor’s expansion for deriving an approximate policy. Following [Chen et al., 2018], let us expand Eq. (15) by the first order Taylor’s expansion f(z)+f(z)(xz)f(z)+f^{\prime}(z)(x-z), where we let z=1z=1, x=[Qπ(s,a)τψ~q(Qπ(s,)τ)]+(1q)x=\left[\frac{Q_{\pi}(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)\right]_{+}(1-q), f(x)=x11qf(x)=x^{\frac{1}{1-q}}, f(x)=11qxq1qf^{\prime}(x)=\frac{1}{1-q}x^{\frac{q}{1-q}}. So that the unnormalized approximate policy has

π~(a|s)f(z)+f(z)(xz)=1+11q((Qπ(s,a)τψ~q(Qπ(s,)τ))(1q)1).\displaystyle\begin{split}\tilde{\pi}^{*}(a|s)&\approx f(z)+f^{\prime}(z)(x-z)\\ &=1+\frac{1}{1-q}\left(\left(\frac{Q_{\pi}(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)\right)(1-q)-1\right).\end{split} (16)

Therefore it is clear as q,π~(a|s)1q\rightarrow\infty,\tilde{\pi}^{*}(a|s)\rightarrow 1. This concords well with the limit case where π(a|s)\pi^{*}(a|s) degenerates to argmax\operatorname*{arg\,max}. With Eq. (16), we can solve for the approximate normalization by the constraint aK(s)π(a|s)=1\sum_{a\in K(s)}\pi^{*}(a|s)=1:

1\displaystyle 1 =aK(s)[1+11q((Qπ(s,a)τψ~q(Qπ(s,)τ))(1q)1)]\displaystyle=\sum_{a\in K(s)}\left[1+\frac{1}{1-q}\left(\left(\frac{Q_{\pi}(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)\right)(1-q)-1\right)\right]
=|K(s)|11q|K(s)|+aK(s)[Qπ(s,a)τψ~q(Qπ(s,)τ)]\displaystyle=|K(s)|-\frac{1}{1-q}|K(s)|+\sum_{a\in K(s)}\left[\frac{Q_{\pi}(s,a)}{\tau}-\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)\right]
ψ~q(Qπ(s,)τ)=aK(s)Qπ(s,)τ1|K(s)|+111q.\displaystyle\Leftrightarrow\tilde{\psi}_{q}\left(\frac{Q_{\pi}(s,\cdot)}{\tau}\right)=\frac{\sum_{a\in K(s)}\frac{Q_{\pi}(s,\cdot)}{\tau}-1}{|K(s)|}+1-\frac{1}{1-q}.

In order for an action to be in K(s)K(s), it has to satisfy Qπ(s,)τ>aK(s)Qπ(s,)τ1|K(s)|+111q\frac{Q_{\pi}(s,\cdot)}{\tau}>\frac{\sum_{a\in K(s)}\frac{Q_{\pi}(s,\cdot)}{\tau}-1}{|K(s)|}+1-\frac{1}{1-q}. Therefore, the condition of K(s)K(s) satisfies:

1+iQπ(s,a(i))τ>j=1iQπ(s,a(j))τ+i(111q).\displaystyle 1+i\frac{Q_{\pi}(s,a_{(i)})}{\tau}>\sum_{j=1}^{i}{\frac{Q_{\pi}(s,a_{(j)})}{\tau}}+i\left(1-\frac{1}{1-q}\right).

Therefore, we see the approximate threshold ψ^q=ψ~q+1\hat{\psi}_{q}=\tilde{\psi}_{q}+1. When q=0q=0 or q=2q^{*}=2, ψ^q\hat{\psi}_{q} recovers ψ\psi and hence π~\tilde{\pi}^{*} recovers the exact sparsemax policy.

  Input: number of iterations TT, entropy coefficient τ\tau, TKL coefficient α\alpha
  Initialize Q0,π0Q_{0},\pi_{0} arbitrarily
  Let {|𝒜|}={1,2,,|𝒜|}\{\mathcal{|A|}\}=\{1,2,\dots,|\mathcal{A}|\}
  for k=1,2,,Tk=1,2,\dots,T do
     # Policy Improvement
     for (s,a)(𝒮,𝒜)(s,a)\in\mathcal{(S,A)} do
        Sort Qk(s,a(1))>>Qk(s,a(|𝒜|))Q_{k}(s,a_{(1)})>\dots>Q_{k}(s,a_{(|\mathcal{A}|)})
        Find K(s)=max{i{|𝒜|}| 1+iQk(s,a(i))τ>j=1iQk(s,a(j))τ+i(111q)}K(s)=\max\left\{i\in\{\mathcal{|A|}\}\,\big{|}\,1+i\frac{Q_{k}(s,a_{(i)})}{\tau}>{\sum_{j=1}^{i}\frac{Q_{k}(s,a_{(j)})}{\tau}+i\left(1-\frac{1}{1-q}\right)}\right\}
        Compute ψq^(Qk(s,)τ)=aK(s)Qk(s,a)τ1|K(s)|+1{\hat{\psi_{q}}}\left(\frac{Q_{k}(s,\cdot)}{\tau}\right)=\frac{\sum_{a\in K(s)}\frac{Q_{k}(s,a)}{\tau}-1}{|K(s)|}+1
        # Normalize when q2q\neq 2
        πk+1(a|s)expq(Qk(s,a)τψ^q(Qk(s,)τ))\pi_{k+1}(a|s)\propto\exp_{q}\left(\frac{Q_{k}(s,a)}{\tau}-{\hat{\psi}_{q}}\left(\frac{Q_{k}(s,\cdot)}{\tau}\right)\right)
     end for
     # Policy Evaluation
     for (s,a,s)(𝒮,𝒜)(s,a,s^{\prime})\in\mathcal{(S,A)} do
        Qk+1(s,a)=r(s,a)+ατ(Qk(s,a)q,τQk(s))+γb𝒜πk+1(b|s)(Qk(s,b)τlnqπk+1(b|s))Q_{k+1}(s,a)=r(s,a)+\alpha\tau\left(Q_{k}(s,a)-\mathcal{M}_{q,\tau}{Q_{k}}(s)\right)+\gamma\sum_{b\in\mathcal{A}}\pi_{k+1}(b|s^{\prime})\left(Q_{k}(s^{\prime},b)-\tau\ln_{q}{\pi_{k+1}(b|s^{\prime})}\right)
     end for
  end for
Algorithm 1 MVI(qq)

Appendix C Proof of convergence of Ω(π)=DKLq(π||)\Omega(\pi)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\cdot\right) when q=2q=2

Let us work with lnq\ln_{q^{*}} from Appendix A and define ||||p\left|\!\left|\cdot\right|\!\right|_{p} as the lpl_{p}-norm. The convergence proof for Ω(π)=DKLq(π||)\Omega(\pi)=D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\cdot\right) when q=2q=2 comes from that Ω(π)\Omega(\pi) is strongly convex in π\pi:

Ω(π)=DKLq=2(π||)=π,ln2π=π,(π)21121||π||221.\displaystyle\Omega(\pi)=D_{KL}^{q^{*}=2}\left({\pi}||{\cdot}\right)=\left\langle\pi,\ln_{2}\frac{\pi}{\cdot}\right\rangle=\left\langle\pi,\frac{\left(\frac{\pi}{\cdot}\right)^{2-1}-1}{2-1}\right\rangle\propto{\left|\!\left|\frac{\pi}{\cdot}\right|\!\right|_{2}^{2}-1}. (17)

Similarly, the negative Tsallis sparse entropy S2(π)-S_{2}(\pi) is also strongly convex. Then the propositions of [Geist et al., 2019] can be applied, which we restate in the following:

Lemma 1 ([Geist et al., 2019]).

Define regularized value functions as:

Qπ,Ω=r+γPVπ,Ω,Vπ,Ω=π,Qπ,ΩΩ(π).\displaystyle Q_{\pi,\Omega}=r+\gamma PV_{\pi,\Omega},\qquad V_{\pi,\Omega}=\left\langle\pi,Q_{\pi,\Omega}\right\rangle-\Omega(\pi).

If Ω(π)\Omega(\pi) is strongly convex, let Ω(Q)=maxππ,QΩ(π)\Omega^{*}(Q)=\max_{\pi}\left\langle\pi,Q\right\rangle-\Omega(\pi) denote the Legendre-Fenchel transform of Ω(π)\Omega(\pi), then

  • Ω\nabla\Omega^{*} is Lipschitz and is the unique maximizer of argmaxππ,QΩ(π)\operatorname*{arg\,max}_{\pi}\left\langle\pi,Q\right\rangle-\Omega(\pi).

  • Tπ,ΩT_{\pi,\Omega} is a γ\gamma-contraction in the supremum norm, i.e. Tπ,ΩV1Tπ,ΩV2γV1V2\left|\!\left|T_{\pi,\Omega}V_{1}-T_{\pi,\Omega}V_{2}\right|\!\right|_{\infty}\leq\gamma\left|\!\left|V_{1}-V_{2}\right|\!\right|_{\infty}. Further, it has a unique fixed point Vπ,ΩV_{\pi,\Omega}.

  • The policy π,Ω=argmaxππ,Q,ΩΩ(π)\pi_{*,\Omega}=\operatorname*{arg\,max}_{\pi}\left\langle\pi,Q_{*,\Omega}\right\rangle-\Omega(\pi) is the unique optimal regularized policy.

Note that in the main paper we dropped the subscript Ω\Omega for both the regularized optimal policy and action value function to lighten notations. It is now clear that Eq. (6) indeed converges for entropic indices that make DKLq(π||)D^{q}_{\!K\!L}\!\left(\pi\left|\!\right|\cdot\right) strongly convex. But we mostly consider the case q=2q=2.

Appendix D Derivation of the Tsallis KL Policy

This section contains the proof for the Tsallis KL-regularized policy (7). Section D.1 shows that a Tsallis KL policy can also be expressed by a series of multiplications of expq(Q)\exp_{q}\left(Q\right); while Section D.2 shows its more-than-averaging property.

D.1 Tsallis KL Policies are Similar to KL

We extend the proof and use the same notations from [Lee et al., 2020, Appendix D] to derive the Tsallis KL regularized policy. Again let us work with lnq\ln_{q^{*}} from Appendix A. Define state visitation as ρπ(s)=𝔼π[t=0𝟙(st=s)]\rho_{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\mathbbm{1}(s_{t}=s)\right] and state-action visitaion ρπ(s,a)=𝔼π[t=0𝟙(st=s,at=a)]\rho_{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\mathbbm{1}(s_{t}=s,a_{t}=a)\right]. The core of the proof resides in establishing the one-to-one correspondence between the policy and the induced state-action visitation ρπ\rho_{\pi}. For example, Tsallis entropy is written as

Sq(π)=Sq(ρπ)=s,aρπ(s,a)lnqρπ(s,a)aρπ(s,a).S_{q^{*}}(\pi)=S_{q^{*}}(\rho_{\pi})=-\sum_{s,a}\rho_{\pi}(s,a)\ln_{q^{*}}{\frac{\rho_{\pi}(s,a)}{\sum_{a}\rho_{\pi}(s,a)}}.

This unique correspondence allows us to replace the optimization variable from π\pi to ρπ\rho_{\pi}. Indeed, one can always restore the policy by π(a|s):=ρπ(s,a)aρπ(s,a)\pi(a|s):=\frac{\rho_{\pi}(s,a)}{\sum_{a^{\prime}}\rho_{\pi}(s,a^{\prime})}.

Let us write Tsallis KL divergence as DKLq(π||μ)=DKLq(ρ||ν)=s,aρ(s,a)lnqρ(s,a)aν(s,a)ν(s,a)aρ(s,a)D^{q^{*}}_{\!K\!L}\!\left(\pi\left|\!\right|\mu\right)=D^{q^{*}}_{\!K\!L}\!\left(\rho\left|\!\right|\nu\right)=\sum_{s,a}\rho(s,a)\ln_{q^{*}}\frac{\rho(s,a)\sum_{a^{\prime}}\nu(s,a^{\prime})}{\nu(s,a)\sum_{a^{\prime}}\rho(s,a^{\prime})} by replacing the policies π,μ\pi,\mu with their state-action visitation ρ,ν\rho,\nu. One can then convert the Tsallis MDP problem into the following problem:

maxρs,aρ(s,a)sr(s,a)P(s|s,a)DKLq(ρ||ν)subject to s,a,ρ(s,a)>0,aρ(s,a)=d(s)+s,aP(s|s,a)ρ(s,a),\displaystyle\begin{split}&\max_{\rho}\sum_{s,a}\rho(s,a)\sum_{s^{\prime}}r(s,a)P(s^{\prime}|s,a)-D^{q^{*}}_{\!K\!L}\!\left(\rho\left|\!\right|\nu\right)\\ &\text{subject to }\forall s,a,\quad\rho(s,a)>0,\\ &\sum_{a}\rho(s,a)=d(s)+\sum_{s^{\prime},a^{\prime}}P(s|s^{\prime},a^{\prime})\rho(s^{\prime},a^{\prime}),\end{split} (18)

where d(s)d(s) is the initial state distribution. Eq. (18) is known as the Bellman Flow Constraints [Lee et al., 2020, Prop. 5] and is concave in ρ\rho since the first term is linear and the second term is concave in ρ\rho. Then the primal and dual solutions satisfy KKT conditions sufficiently and necessarily. Following [Lee et al., 2020, Appendix D.2], we define the Lagrangian objective as

:=s,aρ(s,a)sr(s,a)P(s|s,a)DKLq(ρ||ν)+s,aλ(s,a)ρ(s,a)\displaystyle\mathcal{L}:=\sum_{s,a}\rho(s,a)\sum_{s^{\prime}}r(s,a)P(s^{\prime}|s,a)-D^{q^{*}}_{\!K\!L}\!\left(\rho\left|\!\right|\nu\right)+\sum_{s,a}\lambda(s,a)\rho(s,a)
+sζ(s)(d(s)+s,aP(s|s,a)ρ(s,a)aρ(s,a))\displaystyle\qquad+\sum_{s}\zeta(s)\left(d(s)+\sum_{s^{\prime},a^{\prime}}P(s|s^{\prime},a^{\prime})\rho(s^{\prime},a^{\prime})-\sum_{a}\rho(s,a)\right)

where λ(s,a)\lambda(s,a) and ζ(s)\zeta(s) are dual variables for nonnegativity and Bellman flow constraints. The KKT conditions are:

s,a,ρ(s,a)\displaystyle\forall s,a,\quad\rho^{*}(s,a) 0,\displaystyle\geq 0,
d(s)+s,aP(s|s,a)ρ(s,a)aρ(s,a)\displaystyle d(s)+\sum_{s^{\prime},a^{\prime}}P(s|s^{\prime},a^{\prime})\rho^{*}(s^{\prime},a^{\prime})-\sum_{a}\rho^{*}(s,a) =0,\displaystyle=0,
λ(s,a)0,λ(s,a)ρ(s,a)\displaystyle\lambda^{*}(s,a)\leq 0,\quad\lambda^{*}(s,a)\rho^{*}(s,a) =0,\displaystyle=0,
0=sr(s,a)P(s|s,a)+γsζ(s)P(s|s,a)ζ(s)+λ(s,a)\displaystyle 0=\sum_{s^{\prime}}r(s,a)P(s^{\prime}|s,a)+\gamma\sum_{s^{\prime}}\zeta^{*}(s^{\prime})P(s^{\prime}|s,a)-\zeta^{*}(s)+\lambda^{*}(s,a) DKLq(ρ||ν)ρ(s,a),\displaystyle-\frac{\partial D^{q^{*}}_{\!K\!L}\!\left(\rho^{*}\left|\!\right|\nu\right)}{\partial\rho(s,a)},
where DKLq(ρ||ν)ρ(s,a)=lnqρ(s,a)aν(s,a)ν(s,a)aρ(s,a)\displaystyle\text{where }-\frac{\partial D^{q^{*}}_{\!K\!L}\!\left(\rho^{*}\left|\!\right|\nu\right)}{\partial\rho(s,a)}=-\ln_{q^{*}}{\frac{\rho^{*}(s,a)\sum_{a^{\prime}}\nu(s,a^{\prime})}{\nu(s,a)\sum_{a^{\prime}}\rho^{*}(s,a^{\prime})}}- (ρ(s,a)aν(s,a)ν(s,a)aρ(s,a))q1\displaystyle\left(\frac{\rho^{*}(s,a)\sum_{a^{\prime}}\nu(s,a^{\prime})}{\nu(s,a)\sum_{a^{\prime}}\rho^{*}(s,a^{\prime})}\right)^{q^{*}-1}
+a(ρ(s,a)aρ(s,a))q\displaystyle+\sum_{a}\left(\frac{\rho^{*}(s,a)}{\sum_{a^{\prime}}\rho^{*}(s,a^{\prime})}\right)^{q^{*}} (aν(s,a)ν(s,a))q1.\displaystyle\left(\frac{\sum_{a^{\prime}}\nu(s,a)}{\nu(s,a)}\right)^{q^{*}-1}.

The dual variable ζ(s)\zeta^{*}(s) can be shown to equal to the optimal state value function V(s)V^{*}(s) following [Lee et al., 2020], and λ(s,a)=0\lambda^{*}(s,a)=0 whenever ρ(s,a)>0\rho^{*}(s,a)>0.

By noticing that xq1=(q1)lnqx+1x^{q^{*}-1}=(q^{*}-1)\ln_{q^{*}}{x}+1, we can show that DKLq(ρ||ν)ρ(s,a)=qlnqρ(s,a)aν(s,a)ν(s,a)aρ(s,a)1+a(ρ(s,a)aρ(s,a))q(aν(s,a)ν(s,a))q1-\frac{\partial D^{q^{*}}_{\!K\!L}\!\left(\rho^{*}\left|\!\right|\nu\right)}{\partial\rho(s,a)}=-q^{*}\ln_{q^{*}}{\frac{\rho^{*}(s,a)\sum_{a^{\prime}}\nu(s,a^{\prime})}{\nu(s,a)\sum_{a^{\prime}}\rho^{*}(s,a^{\prime})}}-1+\sum_{a}\left(\frac{\rho^{*}(s,a)}{\sum_{a^{\prime}}\rho^{*}(s,a^{\prime})}\right)^{q^{*}}\left(\frac{\sum_{a^{\prime}}\nu(s,a)}{\nu(s,a)}\right)^{q^{*}-1}. Substituting ζ(s)=V(s)\zeta^{*}(s)=V^{*}(s), π(a|s)=ρ(s,a)aρ(s,a)\pi^{*}(a|s)=\frac{\rho^{*}(s,a)}{\sum_{a^{\prime}}\rho^{*}(s,a)}, μ(a|s)=ν(s,a)aν(s,a)\mu^{*}(a|s)=\frac{\nu^{*}(s,a)}{\sum_{a^{\prime}}\nu^{*}(s,a)} into the above KKT condition and leverage the equality Q(s,a)=r(s,a)+𝔼sP[γζ(s)]Q^{*}(s,a)=r(s,a)+\mathbb{E}_{s^{\prime}\sim P}[\gamma\zeta^{*}(s^{\prime})] we have:

Q(s,a)V(s)qlnqπ(a|s)μ(a|s)1+aπ(a|s)(π(a|s)μ(a|s))q1=0\displaystyle Q^{*}(s,a)-V^{*}(s)-q^{*}\ln_{q^{*}}{\frac{\pi(a|s)}{\mu(a|s)}}-1+\sum_{a^{\prime}}\pi(a|s)\left(\frac{\pi(a|s)}{\mu(a|s)}\right)^{q^{*}-1}=0
π(a|s)=μ(a|s)expq(Q(s,a)qV(s)+1aπ(a|s)(π(a|s)μ(a|s))q1q).\displaystyle\Leftrightarrow\pi^{*}(a|s)=\mu(a|s)\exp_{q^{*}}\left(\frac{Q^{*}(s,a)}{q^{*}}-\frac{V^{*}(s)+1-\sum_{a^{\prime}}\pi(a|s)\left(\frac{\pi(a|s)}{\mu(a|s)}\right)^{q^{*}-1}}{q^{*}}\right).

By comparing it to the maximum Tsallis entropy policy [Lee et al., 2020, Eq.(49)] we see the only difference lies in the baseline term μ(a|s)(q1)\mu(a|s)^{-(q^{*}-1)}, which is expected since we are exploiting Tsallis KL regularization. Let us define the normalization function as

ψ(Q(s,)q)=V(s)+1aπ(a|s)(π(a|s)μ(a|s))q1q,\displaystyle\psi\left(\frac{Q^{*}(s,\cdot)}{q^{*}}\right)=\frac{V^{*}(s)+1-\sum_{a}\pi(a|s)\left(\frac{\pi(a|s)}{\mu(a|s)}\right)^{q^{*}-1}}{q^{*}},

then we can write the policy as

π(a|s)=μ(a|s)expq(Q(s,a)qψ(Q(s,)q)).\displaystyle\pi^{*}(a|s)=\mu(a|s)\exp_{q^{*}}\left(\frac{Q^{*}(s,a)}{q^{*}}-\psi\left(\frac{Q^{*}(s,\cdot)}{q^{*}}\right)\right).

In a way similar to KL regularized policies, at k+1k+1-th update, take π=πk+1,μ=πk\pi^{*}=\pi_{k+1},\mu=\pi_{k} and Q=QkQ^{*}=Q_{k}, we write πk+1πkexpqQk\pi_{k+1}\propto\pi_{k}\exp_{q}\!Q_{k} since the normalization function does not depend on actions. We ignored the scaling constant qq^{*} and regularization coefficient. Hence one can now expand Tsallis KL policies as:

πk+1πkexpq(Qk)πk1expq(Qk1)expq(Qk)expqQ1expqQk,\displaystyle\pi_{k+1}\propto\pi_{k}\exp_{q^{*}}{\left(Q_{k}\right)}\propto\pi_{k-1}\exp_{q^{*}}{\left(Q_{k-1}\right)}\exp_{q^{*}}{\left(Q_{k}\right)}\propto\cdots\propto\exp_{q^{*}}{Q_{1}}\cdots\exp_{q^{*}}{Q_{k}},

which proved the first part of Eq. (7).

D.2 Tsallis KL Policies Do More than Average

We now show the second part of Eq. (7), which stated that the Tsallis KL policies do more than average. This follows from the following lemma:

Lemma 2 (Eq. (25) of [Yamano, 2002]).
(expqx1expqxn)1q=expq(j=1kxj)1q+j=2k(1q)ji1=1<<ijkxi1xij.\displaystyle\begin{split}\left(\exp_{q}{x_{1}}\dots\exp_{q}{x_{n}}\right)^{1-q}=\exp_{q}\left(\sum_{j=1}^{k}x_{j}\right)^{1-q}+\sum_{j=2}^{k}(1-q)^{j}\sum_{i_{1}=1<\dots<i_{j}}^{k}x_{i_{1}}\cdots x_{i_{j}}.\end{split} (19)

However, the mismatch between the base qq and the exponent 1q1-q is inconvenient. We exploit the q=2qq=2-q^{*} duality to show this property holds for qq^{*} as well:

(expqxexpqy)q1=[1+(q1)x]+[1+(q1)y]+\displaystyle\left(\exp_{q^{*}}{x}\cdot\exp_{q^{*}}{y}\right)^{q^{*}-1}=\left[1+(q^{*}-1)x\right]_{+}\cdot\left[1+(q^{*}-1)y\right]_{+}
=[1+(q1)x+(q1)y+(q1)2xy]+\displaystyle=\left[1+(q^{*}-1)x+(q^{*}-1)y+(q^{*}-1)^{2}xy\right]_{+}
=expq(x+y)q1+(q1)2xy.\displaystyle=\exp_{q}\!\,(x+y)^{q^{*}-1}+(q^{*}-1)^{2}xy.

Now since we proved the two-point property for qq^{*}, by the same induction steps in [Yamano, 2002, Eq. (25)] we conclude the proof. The weighted average part Eq. (8) comes immediately from [Suyari et al., 2020, Eq.(18)].

Appendix E Implementation Details

We list the hyperparameters for Gym environments in Table 1. The epsilon threshold is fixed at 0.010.01 from the beginning of learning. FCn\,n refers to the fully connected layer with nn activation units.

The Q-network uses 3 convolutional layers. The epsilon greedy threshold is initialized at 1.0 and gradually decays to 0.01 at the end of first 10% of learning. We run the algorithms with the swept hyperparameters for full 5×1075\times 10^{7} steps on the selected two Atari environments to pick the best hyperparameters.

We show in Figure 6 the performance of MVI(q)(q) on Cartpole-v1 and Acrobot-v1, and the full learning curves of MVI(qq) on the Atari games in Figure 7. Figures 8 and 9 show the full learning curves of Tsallis-VI.

Refer to caption
Refer to caption
Figure 6: (Left) MVI(q)(q) and MVI with logarithm lnπ\ln\pi simply replaced to lnqπ\ln_{q}\pi on Cartpole-v1, when q=2q=2. The results are averaged over 50 independent runs. The flat learning curve is due to the pseudo-additivity. (Right) MVI(q)(q) on Acrobot-v1 with different qq choices. Each qq is independently fine-tuned. The black bar stands for 95% confidence interval, averaged over 50 independent runs.
Table 1: Parameters used for Gym.
Network Parameter Value Algorithm Parameter Value
TT (total steps) 5×1055\times 10^{5} γ\gamma (discount rate) 0.99
CC (interaction period) 4 ϵ\epsilon (epsilon greedy threshold) 0.01
|B||B| (buffer size) 5×1045\times 10^{4} τ\tau (Tsallis entropy coefficient) 0.030.03
BtB_{t} (batch size) 128 α\alpha (advantage coefficient) 0.90.9
II (update period) 100100 (Car.) / 25002500 (Acro.)
Q-network architecture FC512 - FC512
activation units ReLU
optimizer Adam
optimizer learning rate 10310^{-3}
Table 2: Parameters used for Atari games.
Network Parameter Value Algorithmic Parameter Value
TT (total steps) 5×1075\times 10^{7} γ\gamma (discount rate) 0.99
CC (interaction period) 4 τMVI(q)\tau_{\texttt{MVI($q$)}} ( MVI(qq) entropy coefficient) 10
|B||B| (buffer size) 1×1061\times 10^{6} αMVI(q)\alpha_{\texttt{MVI($q$)}} ( MVI(qq) advantage coefficient) 0.9
BtB_{t} (batch size) 32 τTsallis\tau_{\texttt{Tsallis}} (Tsallis-VI entropy coef.) 10
II (update period) 80008000 αM-VI\alpha_{\texttt{M-VI}} (M-VI advantage coefficient) 0.9
activation units ReLU τM-VI\tau_{\texttt{M-VI}} (M-VI entropy coefficient) 0.03
optimizer Adam ϵ\epsilon (epsilon greedy threshold) 1.00.01|10%1.0\rightarrow 0.01|_{10\%}
optimizer learning rate 10410^{-4}
Q-network architecture
Conv8,8432\text{Conv}^{4}_{8,8}32 - Conv4,4264\text{Conv}^{2}_{4,4}64 - Conv3,3164\text{Conv}^{1}_{3,3}64 - FC512 - FC
Refer to caption
Figure 7: Learning curves of MVI(qq) and M-VI on the selected Atari games.
Refer to caption
Figure 8: Learning curves of MVI(qq) and Tsallis-VI on the selected Atari games.
Refer to caption
Figure 9: (cont’d) MVI(qq) and Tsallis-VI on the selected Atari games.