This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Policy Mirror Descent for Regularized Reinforcement Learning:
A Generalized Framework with Linear Convergence

Wenhao Zhan111The first two authors contributed equally.
Princeton University
Department of Electrical and Computer Engineering, Princeton University.
   Shicong Cen11footnotemark: 1
Carnegie Mellon University
Department of Electrical and Computer Engineering, Carnegie Mellon University.
   Baihe Huang
University of Berkeley
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley.
   Yuxin Chen
University of Pennsylvania
Department of Statistics and Data Science, Wharton School, University of Pennsylvania.
   Jason D. Lee22footnotemark: 2
Princeton University
   Yuejie Chi33footnotemark: 3
Carnegie Mellon University
(May 2021;   Final: January 2023)
Abstract

Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.

Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (Lan,, 2022), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.

Keywords: policy mirror descent, Bregman divergence, regularization, nonsmooth, policy optimization

1 Introduction

Policy optimization lies at the heart of recent successes of reinforcement learning (RL) (Mnih et al.,, 2015). In its basic form, the optimal policy of interest, or a suitably parameterized version, is learned by attempting to maximize the value function in a Markov decision processes (MDP). For the most part, the maximization step is carried out by means of first-order optimization algorithms amenable to large-scale applications, whose foundations were set forth in the early works of Williams, (1992); Sutton et al., (2000). A partial list of widely adopted variants in modern practice includes policy gradient (PG) methods (Sutton et al.,, 2000), natural policy gradient (NPG) methods (Kakade,, 2002), TRPO (Schulman et al.,, 2015), PPO (Schulman et al.,, 2017), soft actor-critic methods (Haarnoja et al.,, 2018), to name just a few. In comparison with model-based and value-based approaches, this family of policy-based algorithms offers a remarkably flexible framework that accommodates both continuous and discrete action spaces, and lends itself well to the incorporation of powerful function approximation schemes like neural networks. In stark contrast to its practical success, however, theoretical understanding of policy optimization remains severely limited even for the tabular case, largely owing to the ubiquitous nonconvexity issue underlying the objective function.

1.1 The role of regularization

In practice, there are often competing objectives and additional constraints that the agent has to deal with in conjunction with maximizing values, which motivate the studies of regularization techniques in RL. In what follows, we isolate a few representative examples.

  • Promoting exploration. In the face of large problem dimensions and complex dynamics, it is often desirable to maintain a suitable degree of randomness in the policy iterates, in order to encourage exploration and discourage premature convergence to sub-optimal policies. A popular strategy of this kind is to enforce entropy regularization (Williams and Peng,, 1991), which penalizes policies that are not sufficiently stochastic. Along similar lines, the Tsallis entropy regularization (Chow et al., 2018b, ; Lee et al.,, 2018) further promotes sparsity of the learned policy while encouraging exploration, ensuring that the resulting policy does not assign non-negligible probabilities to too many sub-optimal actions.

  • Safe RL. In a variety of application scenarios such as industrial robot arms and self-driving vehicles, the agents are required to operate safely both to themselves and the surroundings (Amodei et al.,, 2016; Moldovan and Abbeel,, 2012); for example, certain actions might be strictly forbidden in some states. One way to incorporate such prescribed operational constraints is through adding a regularizer (e.g., a properly chosen log barrier or indicator function tailored to the constraints) to explicitly account for the constraints.

  • Cost-sensitive RL. In reality, different actions of an agent might incur drastically different costs even for the same state. This motivates the design of new objective functions that properly trade off the cumulative rewards against the accumulated cost, which often take the form of certain regularized value functions.

Viewed in this light, it is of imminent value to develop a unified framework towards understanding the capability and limitations of regularized policy optimization. While a recent line of works (Agarwal et al.,, 2020; Mei et al., 2020b, ; Cen et al., 2022b, ) have looked into specific types of regularization techniques such as entropy regularization, existing convergence theory remains highly inadequate when it comes to a more general family of regularizers.

1.2 Main contributions

The current paper focuses on policy optimization for regularized RL in a γ\gamma-discounted infinite horizon Markov decision process (MDP) with state space 𝒮\mathcal{S}, action space 𝒜\mathcal{A}, and reward function r(,)r(\cdot,\cdot). The goal is to find an optimal policy that maximizes a regularized value function. Informally speaking, the regularized value function associated with a given policy π\pi takes the following form:

Vτπ=Vπτ𝔼[hs(π(|s))],V_{\tau}^{\pi}=V^{\pi}-\tau\mathbb{E}\big{[}h_{s}\big{(}\pi(\cdot\,|\,s)\big{)}\big{]},

where VπV^{\pi} denotes the original (unregularized) value function, τ>0\tau>0 is the regularization parameter, hs()h_{s}(\cdot) denotes a convex regularizer employed to regularize the policy in state ss, and the expectation is taken over certain marginal state distribution w.r.t. the MDP (to be made precise in Section 2.1). It is noteworthy that this paper does not require the regularizer hsh_{s} to be either strongly convex or smooth.

In order to maximize the regularized value function (9b), Lan, (2022) exhibited a seminal algorithm called Policy Mirror Descent (PMD), which can be viewed as an adaptation of the mirror descent algorithm (Nemirovsky and Yudin,, 1983; Beck and Teboulle,, 2003) to the realm of policy optimization. In particular, PMD subsumes the natural policy gradient (NPG) method (Kakade,, 2002) as a special case. To further generalize PMD (Lan,, 2022), we propose an algorithm called Generalized Policy Mirror Descent (GPMD). In each iteration, the policy is updated for each state in parallel via a mirror-descent style update rule. In sharp contrast to Lan, (2022) that considered a generic Bregman divergence, our algorithm selects the Bregman divergence adaptively in cognizant of the regularizer, which leads to complementary perspectives and insights. Several important features and theoretical appeal of GPMD are summarized as follows.

  • GPMD substantially broadens the range of (provably effective) algorithmic choices for regularized RL, and subsumes several well-known algorithms as special cases. For example, it reduces to regularized policy iteration (Geist et al.,, 2019) when the learning rate tends to infinity, and subsumes entropy-regularized NPG methods as special cases if we take the Bregman divergence to be the Kullback-Leibler (KL) divergence (Cen et al., 2022b, ).

  • Assuming exact policy evaluation and perfect policy update in each iteration, GPMD converges linearly—in a dimension-free fashion— over the entire range of the learning rate η>0\eta>0. More precisely, it converges to an ε\varepsilon-optimal regularized Q-function in no more than an order of

    1+ητητ(1γ)log1ε\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{1}{\varepsilon}

    iterations (up to some logarithmic factor). Encouragingly, this appealing feature is valid for a broad family of convex and possibly nonsmooth regularizers.

  • The intriguing convergence guarantees are robust in the face of inexact policy evaluation and imperfect policy updates, namely, the algorithm is guaranteed to converge linearly at the same rate until an error floor is hit. See Section 3.2 for details.

  • Numerical experiments are provided in Section 5 to demonstrate the practical applicability and appealing performance of the proposed GPMD algorithm.

Finally, we find it helpful to briefly compare the above findings with prior works. As soon as the learning rate exceeds η1/τ\eta\geq 1/\tau, the iteration complexity of our algorithm is at most on the order of 11γlog1ε\tfrac{1}{1-\gamma}\log\frac{1}{\varepsilon}, thus matching that of regularized policy iteration (Geist et al.,, 2019). In comparison to Lan, (2022), our work sets forth a different framework to analyze mirror-descent type algorithms for regularized policy optimization, generalizing and refining the approach in Cen et al., 2022b far beyond entropy regularization. When constant learning rates are employed, the linear convergence of PMD (Lan,, 2022) critically requires the regularizer to be strongly convex, with only sublinear convergence theory established for convex regularizers. In contrast, we establish the linear convergence of GPMD under constant learning rates even in the absence of strong convexity. Furthermore, for the special case of entropy regularization, the stability analysis of GPMD also significantly improves over the prior art in Cen et al., 2022b , preventing the error floor from blowing up when the learning rate approaches zero, as well as incorporating the impact of optimization error that was previously uncaptured. More detailed comparisons with Lan, (2022) and Cen et al., 2022b can be found in Section 3.

1.3 Related works

Before embarking on our algorithmic and theoretic developments, we briefly review a small sample of other related works.

Global convergence of policy gradient methods.

Recent years have witnessed a surge of activities towards understanding the global convergence properties of policy gradient methods and their variants for both continuous and discrete RL problems, examples including Fazel et al., (2018); Bhandari and Russo, (2019); Agarwal et al., (2020); Zhang et al., 2021b ; Wang et al., (2019); Mei et al., 2020a ; Bhandari and Russo, (2020); Khodadadian et al., (2021); Liu et al., (2020); Mei et al., 2020a ; Agazzi and Lu, (2020); Xu et al., (2019); Wang et al., (2019); Cen et al., 2022b ; Mei et al., (2021); Liu et al., (2019); Wang et al., (2021); Zhang et al., 2020a ; Zhang et al., 2021a ; Zhang et al., 2020b ; Shani et al., (2019), among other things. Neu et al., (2017) provided the first interpretation of NPG methods as mirror descent (Nemirovsky and Yudin,, 1983), thereby enabling the adaptation of techniques for analyzing mirror descent to the studies of NPG-type algorithms such as TRPO (Shani et al.,, 2019; Tomar et al.,, 2020). It has been shown that the NPG method converges sub-linearly for unregularized MDPs with a fixed learning rate (Agarwal et al.,, 2020), and converges linearly if the learning rate is set adaptively (Khodadadian et al.,, 2021), via exact line search (Bhandari and Russo,, 2020), or following a geometrically increasing schedule (Xiao,, 2022). The global linear convergence of NPG holds more generally for an arbitrary fixed learning rate when entropy regularization is enforced (Cen et al., 2022b, ). Noteworthily, Li et al., (2023) established a lower bound indicating that softmax PG methods can take an exponential time—in the size of the state space—to converge, while the convergence rates of NPG-type methods are almost independent of the problem dimension. In addition, another line of recent works (Abbasi-Yadkori et al.,, 2019; Hao et al.,, 2021; Lazic et al.,, 2021) established regret bounds for approximate NPG methods—termed as KL-regularized approximate policy iteration therein—for infinite-horizen undiscounted MDPs, which are beyond the scope of the current paper.

Regularization in RL.

Regularization has been suggested to the RL literature either through the lens of optimization (Dai et al.,, 2018; Agarwal et al.,, 2020), or through the lens of dynamic programming (Geist et al.,, 2019; Vieillard et al.,, 2020). Our work is clearly an instance of the former type. Several recent results in the literature merit particular attention: Agarwal et al., (2020) demonstrated sublinear convergence guarantees for PG methods in the presence of relative entropy regularization, Mei et al., 2020b established linear convergence of entropy-regularized PG methods, whereas Cen et al., 2022b derived an almost dimension-free linear convergence theory for NPG methods with entropy regularization. Most of the existing literature focused on the entropy regularization or KL-type regularization, and the studies of general regularizers had been quite limited until the recent work Lan, (2022). The regularized MDP problems are also closely related to the studies of constrained MDPs, as both types of problems can be employed to model/promote constraint satisfaction in RL, as recently investigated in, e.g., Chow et al., 2018a ; Efroni et al., (2020); Ding et al., (2021); Yu et al., (2019); Xu et al., (2020). Note, however, that it is difficult to directly compare our algorithm with these methods, due to drastically different formulations and settings.

1.4 Notation

Let us introduce several notation that will be adopted throughout. For any set 𝒜\mathcal{A}, we denote by |𝒜||\mathcal{A}| the cardinality of a set 𝒜\mathcal{A}, and let Δ(𝒜)\Delta(\mathcal{A}) indicate the probability simplex over the set 𝒜\mathcal{A}. For any convex and differentiable function h()h(\cdot), the Bregman divergence generated by h()h(\cdot) is defined as

Dh(z,x)h(z)h(x)h(x),zx.\displaystyle D_{h}(z,x)\coloneqq h(z)-h(x)-\big{\langle}\nabla h(x),z-x\big{\rangle}. (1)

For any convex (but not necessarily differentiable) function h()h(\cdot), we denote by h\partial h the subdifferential of hh. Given two probability distributions π1\pi_{1} and π2\pi_{2} over 𝒜\mathcal{A}, the KL divergence from π2\pi_{2} to π1\pi_{1} is defined as 𝖪𝖫(π1π2)a𝒜π1(a)logπ1(a)π2(a)\mathsf{KL}(\pi_{1}\,\|\,\pi_{2})\coloneqq\sum_{a\in\mathcal{A}}\pi_{1}(a)\log\frac{\pi_{1}(a)}{\pi_{2}(a)}. For any vectors a=[ai]1ina=[a_{i}]_{1\leq i\leq n} and b=[bi]1inb=[b_{i}]_{1\leq i\leq n}, the notation aba\leq b (resp. aba\geq b) means that aibia_{i}\leq b_{i} (aibia_{i}\geq b_{i}) for every 1in1\leq i\leq n. We shall also use 11 (resp. 0) to denote the all-one (resp. all-zero) vector whenever it is clear from the context.

2 Model and algorithms

2.1 Problem settings

Markov decision process (MDP).

The focus of this paper is a discounted infinite-horizon Markov decision process, as represented by =(𝒮,𝒜,P,r,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma) (Bertsekas,, 2017). Here, 𝒮{1,,|𝒮|}\mathcal{S}\coloneqq\{1,\cdots,|\mathcal{S}|\} is the state space, 𝒜{1,,|𝒜|}\mathcal{A}\coloneqq\{1,\cdots,|\mathcal{A}|\} is the action space, γ[0,1)\gamma\in[0,1) is the discount factor, P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}) is the probability transition matrix (so that P(|s,a)P(\cdot\,|\,s,a) is the transition probability from state ss upon execution of action aa), whereas r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\to[0,1] is the reward function (so that r(s,a)r(s,a) indicates the immediate reward received in state ss after action aa is executed). Here, we focus on finite-state and finite-action scenarios, meaning that both |𝒮||\mathcal{S}| and |𝒜||\mathcal{A}| are assumed to be finite. A policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\to\Delta(\mathcal{A}) specifies a possibly randomized action selection rule, namely, π(|s)\pi(\cdot\,|\,s) represents the action selection probability in state ss.

For any policy π\pi, we define the associated value function Vπ:𝒮V^{\pi}:\mathcal{S}\to\mathbb{R} as follows

s𝒮:Vπ(s):=𝔼atπ(|st),st+1P(|st,at),t0[t=0γtr(st,at)|s0=s],\forall s\in\mathcal{S}:\qquad V^{\pi}(s):=\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a_{t}\sim\pi(\cdot|s_{t}),\\ s_{t+1}\sim P(\cdot|s_{t},a_{t}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ s_{0}=s\right], (2)

which can be viewed as the utility function we wish to maximize. Here, the expectation is taken over the randomness of the MDP trajectory {(st,at)}t0\{(s_{t},a_{t})\}_{t\geq 0} induced by policy π\pi. Similarly, when the initial action aa is fixed, we can define the action-value function (or Q-function) as follows

(s,a)𝒮×𝒜:Qπ(s,a):=𝔼st+1P(|st,at),at+1π(|st+1),t0[t=0γtr(st,at)|s0=s,a0=a].\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad Q^{\pi}(s,a):=\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}s_{t+1}\sim P(\cdot|s_{t},a_{t}),\\ a_{t+1}\sim\pi(\cdot|s_{t+1}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ s_{0}=s,a_{0}=a\right]. (3)

As a well-known fact, the policy gradient of VπV^{\pi} (w.r.t. the policy π\pi) admits the following closed-form expression (Sutton et al., (2000))

(s,a)𝒮×𝒜:Vπ(s0)π(a|s)=11γds0π(s)Qπ(s,a).\displaystyle\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad\frac{\partial V^{\pi}(s_{0})}{\partial\pi(a\,|\,s)}=\frac{1}{1-\gamma}d_{s_{0}}^{\pi}(s)Q^{\pi}(s,a). (4)

Here, ds0πΔ(𝒮)d_{s_{0}}^{\pi}\in\Delta(\mathcal{S}) is the so-called discounted state visitation distribution defined as follows

ds0π(s)(1γ)t=0γtπ(st=s|s0),\displaystyle d_{s_{0}}^{\pi}(s)\coloneqq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbb{P}^{\pi}(s_{t}=s\,|\,s_{0}), (5)

where π(st=s|s0)\mathbb{P}^{\pi}(s_{t}=s\,|\,s_{0}) denotes the probability of st=ss_{t}=s when the MDP trajectory {st}t0\{s_{t}\}_{t\geq 0} is generated under policy π\pi given the initial state s0s_{0}.

Furthermore, the optimal value function and the optimal Q-function are defined and denoted by

(s,a)𝒮×𝒜:V(s)maxπVπ(s),Q(s,a)maxπQπ(s,a).\displaystyle\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad V^{\star}(s)\coloneqq\max_{\pi}V^{\pi}(s),\qquad Q^{\star}(s,a)\coloneqq\max_{\pi}Q^{\pi}(s,a). (6)

It is well known that there exists at least one optimal policy, denoted by π\pi^{\star}, that simultaneously maximizes the value function and the Q-function for all state-action pairs (Agarwal et al.,, 2019).

Regularized MDP.

In practice, the agent is often asked to design policies that possess certain structural properties in order to be cognizant of system constraints such as safety and operational constraints, as well as encourage exploration during the optimization/learning stage. A natural strategy to achieve these is to resort to the following regularized value function w.r.t. a given policy π\pi (Neu et al.,, 2017; Mei et al., 2020b, ; Cen et al., 2022b, ; Lan,, 2022):

s𝒮:Vτπ(s)\displaystyle\forall s\in\mathcal{S}:\qquad V^{\pi}_{\tau}(s) 𝔼atπ(|st),st+1P(|st,at),t0[t=0γt{r(st,at)τhst(π(|st))}|s0=s]\displaystyle\coloneqq\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a_{t}\sim\pi(\cdot|s_{t}),\\ s_{t+1}\sim P(\cdot|s_{t},a_{t}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{\{}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(\cdot\,|\,s_{t})\big{)}\Big{\}}\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ s_{0}=s\right]
=Vπ(s)τ1γs𝒮dsπ(s)hs(π(|s)),\displaystyle=V^{\pi}(s)-\frac{\tau}{1-\gamma}\sum_{s^{\prime}\in\mathcal{S}}d_{s}^{\pi}(s^{\prime})h_{s^{\prime}}\big{(}\pi(\cdot\,|\,s^{\prime})\big{)}, (7)

where hs:Δζ(𝒜)h_{s}:\Delta_{\zeta}(\mathcal{A})\to\mathbb{R} stands for a convex and possibly nonsmooth regularizer for state ss, τ>0\tau>0 denotes the regularization parameter, and dsπ()d_{s}^{\pi}(\cdot) is defined in (5). Here, for technical convenience, we assume throughout that hs()h_{s}(\cdot) (s𝒮s\in\mathcal{S}) is well-defined over an “ζ\zeta-neighborhood” of the probability simplex Δ(𝒜)\Delta(\mathcal{A}) defined as follows

Δζ(𝒜){x=[xa]a𝒜|xa0 for all a𝒜; 1ζa𝒜xa1+ζ},\Delta_{\zeta}(\mathcal{A})\coloneqq\left\{x=[x_{a}]_{a\in\mathcal{A}}\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ x_{a}\geq 0\text{ for all }a\in\mathcal{A};\leavevmode\nobreak\ 1-\zeta\leq\sum_{a\in\mathcal{A}}x_{a}\leq 1+\zeta\right\},

where ζ>0\zeta>0 can be an arbitrary constant. For instance, entropy regularization adopts the choice hs(p)=i𝒜pilogpih_{s}(p)=\sum_{i\in\mathcal{A}}p_{i}\log p_{i} for all s𝒮s\in\mathcal{S} and pΔ(𝒜)p\in\Delta(\mathcal{A}), which coincides with the negative Shannon entropy of a probability distribution. Similar, a KL regularization adopts the choice hs(p)=𝖪𝖫(pp𝗋𝖾𝖿)h_{s}(p)=\mathsf{KL}(p\,\|\,p_{\mathsf{ref}}), which penalizes the distribution pp that deviates from the reference p𝗋𝖾𝖿p_{\mathsf{ref}}. As another example, a weighted 1\ell_{1} regularization adopts the choice hs(p)=i𝒜ws,ipih_{s}(p)=\sum_{i\in\mathcal{A}}w_{s,i}p_{i} for all s𝒮s\in\mathcal{S} and pΔ(𝒜)p\in\Delta(\mathcal{A}), where ws,i0w_{s,i}\geq 0 is the cost of taking action ii at state ss, and the regularizer hs(π(|s))h_{s}(\pi(\cdot|s)) captures the expected cost of the policy π\pi in state ss. Throughout this paper, we impose the following assumption.

Assumption 1.

Consider an arbitrarily small constant ζ>0\zeta>0. For for any s𝒮s\in\mathcal{S}, suppose that hs()h_{s}(\cdot) is convex and

hs(p)=for any pΔζ(𝒜).\displaystyle h_{s}(p)=\infty\qquad\text{for any }p\notin\Delta_{\zeta}(\mathcal{A}). (8)

Following the convention in prior literature (e.g., Mei et al., 2020b ), we also define the corresponding regularized Q-function as follows:

(s,a)𝒮×𝒜:Qτπ(s,a):=r(s,a)+γ𝔼sP(|s,a)[Vτπ(s)].\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad{Q}^{\pi}_{\tau}(s,a):=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\big{[}V^{\pi}_{\tau}(s^{\prime})\big{]}. (9a)
As can be straightforwardly verified, one can also express VτπV^{\pi}_{\tau} in terms of QτπQ^{\pi}_{\tau} as
s𝒮:Vτπ(s):=𝔼aπ(|s)[Qτπ(s,a)τhs(π(|s))].\forall s\in\mathcal{S}:\qquad V^{\pi}_{\tau}(s):=\mathop{\mathbb{E}}\limits_{a\sim\pi(\cdot|s)}\Big{[}{Q}^{\pi}_{\tau}(s,a)-\tau h_{s}\big{(}\pi(\cdot\,|\,s)\big{)}\Big{]}. (9b)

The optimal regularized value function VτV^{\star}_{\tau} and the corresponding optimal policy πτ\pi_{\tau}^{\star} are defined respectively as follows:

s𝒮:Vτ(s)Vτπτ(s)=maxπVτπ(s),πτargmaxπVτπ.\forall s\in\mathcal{S}:\qquad V^{\star}_{\tau}(s)\coloneqq V^{\pi_{\tau}^{\star}}_{\tau}(s)=\max_{\pi}V^{\pi}_{\tau}(s),\qquad\pi_{\tau}^{\star}\coloneqq\arg\max_{\pi}V^{\pi}_{\tau}. (10)

It is worth noting that Puterman, (2014) asserts the existence of an optimal policy πτ\pi_{\tau}^{\star} that achieves (10) simultaneously for all s𝒮s\in\mathcal{S}. Correspondingly, we shall also define the resulting optimal regularized Q-function as

(s,a)𝒮×𝒜:Qτ(s,a)=Qτπτ(s,a).\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad Q^{\star}_{\tau}(s,a)=Q^{\pi_{\tau}^{\star}}_{\tau}(s,a). (11)

2.2 Algorithm: generalized policy mirror descent

Motivated by PMD (Lan,, 2022), we put forward a generalization of PMD that selects the Bregman divergence in cognizant of the regularizer in use. A thorough comparison with Lan, (2022) will be provided after introducing our generalized PMD algorithm.

Review: mirror descent (MD) for the composite model.

To better elucidate our algorithmic idea, let us first briefly review the design of classical mirror descent—originally proposed by Nemirovsky and Yudin, (1983)—in the optimization literature. Consider the following composite model:

minimizexF(x)f(x)+h(x),\text{minimize}_{x}\quad F(x)\coloneqq f(x)+h(x),

where the objective function consists of two components. The first component is assumed to be differentiable, while the second component h()h(\cdot) can be more general and is commonly employed to model some sort of regularizers. To solve this composite problem, one variant of mirror descent adopts the following update rule (see also Beck, (2017); Duchi et al., (2010)):

x(k+1)=argminx{f(x(k))+f(x(k)),x+h(x)+1ηDh(x,x(k))},\displaystyle x^{(k+1)}=\arg\min_{x}\left\{f\big{(}x^{(k)}\big{)}+\big{\langle}\nabla f(x^{(k)}),x\big{\rangle}+h(x)+\frac{1}{\eta}D_{h}\big{(}x,x^{(k)}\big{)}\right\}, (12)

where η>0\eta>0 is the learning rate or step size, and Dh(,)D_{h}(\cdot,\cdot) is the Bregman divergence defined in (1). Note that the first term within the curly brackets of (12) can be safely discarded as it is a constant given x(k)x^{(k)}. In words, the above update rule approximates f(x)f(x) via its first-order Taylor expansion f(x(k))+f(x(k)),xf\big{(}x^{(k)}\big{)}+\big{\langle}\nabla f(x^{(k)}),x\big{\rangle} at the point x(k)x^{(k)}, employs the Bregman divergence DhD_{h} to monitor the difference between the new iterate and the current iterate x(k)x^{(k)}, and attempts to optimize such (properly monitored) approximation instead. While one can further generalize the Bregman divergence to DωD_{\omega} for a different generator ω\omega, we shall restrict attention to the case with h=ωh=\omega in the current paper.

The proposed algorithm.

We are now ready to present the algorithm we come up with, which is an extension of the PMD algorithm (Lan,, 2022). For notational simplicity, we shall write

Vτ(k)Vτπ(k),Qτ(k)(s,a)Qτπ(k)(s,a)andds0(k)(s)ds0π(k)(s)\displaystyle V^{(k)}_{\tau}\coloneqq V^{\pi^{(k)}}_{\tau},\qquad Q^{(k)}_{\tau}(s,a)\coloneqq Q^{\pi^{(k)}}_{\tau}(s,a)\qquad\text{and}\qquad d^{(k)}_{s_{0}}(s)\coloneqq d^{\pi^{(k)}}_{s_{0}}(s) (13)

throughout the paper, where π(k)\pi^{(k)} denotes our policy estimate in the kk-th iteration.

To begin with, suppose for simplicity that hs()h_{s}(\cdot) is differentiable everywhere. In the kk-th iteration, a natural MD scheme that comes into mind for solving (7)—namely, maximizeπVτπ(s0)\text{maximize}_{\pi}V_{\tau}^{\pi}(s_{0}) for a given initial state s0s_{0}—is the following update rule:

π(k+1)(|s)\displaystyle\pi^{(k+1)}(\cdot\,|\,s) =argminpΔ(𝒜){π(|s)Vτπ(s0)|π=π(k),p+τ1γds0(k)(s)hs(p)+1ηDhs(p,π(k)(|s))}\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\Big{\langle}\nabla_{\pi(\cdot|s)}V_{\tau}^{\pi}(s_{0})\,\Big{|}_{\pi=\pi^{(k)}},p\Big{\rangle}+\frac{\tau}{1-\gamma}d_{s_{0}}^{(k)}(s)h_{s}(p)+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,|\,s)\big{)}\right\}
=argminpΔ(𝒜){11γds0(k)(s){Qτ(k)(s,),p+τhs(p)}+1ηDhs(p,π(k)(|s))}\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{\frac{1}{1-\gamma}d_{s_{0}}^{(k)}(s)\Big{\{}-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)\Big{\}}+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,|\,s)\big{)}\right\}
=argminpΔ(𝒜){Qτ(k)(s,),p+τhs(p)+1ηDhs(p,π(k)(|s))}\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,|\,s)\big{)}\right\} (14)

for every state s𝒮s\in\mathcal{S}, which is a direct application of (12) to our setting. Here, we start with a learning rate η\eta^{\prime}, and obtain simplification by replacing η\eta^{\prime} with η(1γ)/ds0(k)(s){\eta(1-\gamma)}/{d_{s_{0}}^{(k)}(s)}. Notably, the update strategy (14) is invariant to the initial state s0s_{0}, akin to natural policy gradient methods (Agarwal et al.,, 2020).

This update rule is well-defined for, say, the case when hsh_{s} is the negative entropy, since the algorithm guarantees π(k)>0\pi^{(k)}>0 all the time and hence hsh_{s} is always differentiable w.r.t. the kk-th iterate (see Cen et al., 2022b ). In general, however, it is possible to encounter situations when the gradient of hsh_{s} does not exist on the boundary (e.g., when hsh_{s} represents a certain indicator function). To cope with such cases, we resort to a generalized version of Bregman divergence (e.g., Kiwiel, (1997); Lan et al., (2011); Lan and Zhou, (2018)). To be specific, we attempt to replace the usual Bregman divergence Dhs(p,q)D_{h_{s}}(p,q) by the following metric

Dhs(p,q;gs)hs(p)hs(q)gs,pq0,\displaystyle D_{h_{s}}(p,q;g_{s})\coloneqq h_{s}(p)-h_{s}(q)-\langle g_{s},p-q\rangle\geq 0, (15)

where gsg_{s} can be any vector falling within the subdifferential hs(q)\partial h_{s}(q). Here, the non-negativity condition in (15) follows directly from the definition of the subgradient for any convex function. The constraint on gsg_{s} can be further relaxed by exploiting the requirement p,qΔ(𝒜)p,q\in\Delta(\mathcal{A}). In fact, for any vector ξs=gscs1\xi_{s}=g_{s}-c_{s}1 (with csc_{s}\in\mathbb{R} some constant and 11 the all-one vector), one can readily see that

Dhs(p,q;gs)\displaystyle D_{h_{s}}(p,q;g_{s}) =hs(p)hs(q)gs,pq=hs(p)hs(q)ξs,pq+cs1,pq\displaystyle=h_{s}(p)-h_{s}(q)-\langle g_{s},p-q\rangle=h_{s}(p)-h_{s}(q)-\langle\xi_{s},p-q\rangle+c_{s}\langle 1,p-q\rangle
=hs(p)hs(q)ξs,pq=Dhs(p,q;ξs),\displaystyle=h_{s}(p)-h_{s}(q)-\langle\xi_{s},p-q\rangle=D_{h_{s}}(p,q;\xi_{s}), (16)

where the last line is valid since 1p=1q=11^{\top}p=1^{\top}q=1. As a result, everything boils down to identifying a vector ξs\xi_{s} that falls within hs(q)\partial h_{s}(q) upon global shift.

Towards this, we propose the following iterative rule for designing such a sequence of vectors as surrogates for the subgradient of hsh_{s}:

ξ(0)(s,)\displaystyle\xi^{(0)}(s,\cdot) hs(π(0)(|s));\displaystyle\in\partial h_{s}\big{(}\pi^{(0)}(\cdot\,|\,s)\big{)}; (17a)
ξ(k+1)(s,)\displaystyle\xi^{(k+1)}(s,\cdot) =11+ητξ(k)(s,)+η1+ητQτ(k)(s,),k0,\displaystyle=\frac{1}{1+\eta\tau}\xi^{(k)}(s,\cdot)+\frac{\eta}{1+\eta\tau}{{Q}^{(k)}_{\tau}(s,\cdot)},\qquad k\geq 0, (17b)

where ξ(k+1)(s,)\xi^{(k+1)}(s,\cdot) is updated as a convex combination of the previous ξ(k)(s,)\xi^{(k)}(s,\cdot) and Qτ(k)(s,){Q}^{(k)}_{\tau}(s,\cdot), where more emphasis is put on Qτ(k)(s,){Q}^{(k)}_{\tau}(s,\cdot) when the learning rate η\eta is large. As asserted by the following lemma, the above vectors ξ(k)(s,)\xi^{(k)}(s,\cdot) we construct satisfy the desired property, i.e., lying within the subdifferential of hsh_{s} under suitable global shifts. It is worth mentioning that these global shifts {cs(k)}\{c_{s}^{(k)}\} only serve as an aid to better understand the construction, but are not required during the algorithm updates.

Lemma 1.

For all k0k\geq 0 and every s𝒮s\in\mathcal{S}, there exists a quantity cs(k)c_{s}^{(k)}\in\mathbb{R} such that

ξ(k)(s,)cs(k)1hs(π(k)(|s)).\xi^{(k)}(s,\cdot)-c_{s}^{(k)}1\in\partial h_{s}\big{(}\pi^{(k)}(\cdot\,|\,s)\big{)}. (18)

In addition, for every s𝒮s\in\mathcal{S}, there exists a quantity csc_{s}^{\star}\in\mathbb{R} such that

τ1Qτ(s,)cs1hs(πτ(|s)).\tau^{-1}Q_{\tau}^{\star}(s,\cdot)-c_{s}^{\star}1\in\partial h_{s}\big{(}\pi_{\tau}^{\star}(\cdot\,|\,s)\big{)}. (19)
Proof.

See Appendix A.1. ∎

Thus far, we have presented all crucial ingredients of our algorithm. The whole procedure is summarized in Algorithm 1, and will be referred to as Generalized Policy Mirror Descent (GPMD) throughout the paper. Interestingly, several well-known algorithms can be recovered as special cases of GPMD:

  • When the Bregman divergence Dhs(,)D_{h_{s}}(\cdot,\cdot) is taken as the KL divergence, GPMD reduces to the well-renowned NPG algorithm (Kakade,, 2002) when τ=0\tau=0 (no regularization), and to the NPG algorithm with entropy regularization analyzed in (Cen et al., 2022b, ) when hs()h_{s}(\cdot) is taken as the negative Shannon entropy.

  • When η=\eta=\infty (no divergence), GPMD reduces to regularized policy iteration in Geist et al., (2019); in particular, GPMD reduces to the standard policy iteration algorithm if in addition τ\tau is also 0.

1 Input: initial policy iterate π(0)\pi^{(0)}, learning rate η>0\eta>0.
2 Initialize ξ(0)\xi^{(0)} so that ξ(0)(s,)hs(π(0)(|s))\xi^{(0)}(s,\cdot)\in\partial h_{s}\big{(}\pi^{(0)}(\cdot|s)\big{)} for all s𝒮s\in\mathcal{S}.
3 for k=0,1,,k=0,1,\cdots, do
4      
For every s𝒮s\in\mathcal{S}, set
    
π(k+1)(|s)=argminpΔ(𝒜){Qτ(k)(s,),p+τhs(p)+1ηDhs(p,π(k)(|s);ξ(k))},\pi^{(k+1)}(\cdot|s)=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot|s);\xi^{(k)}\big{)}\right\}, (20a)
where
Dhs(p,q;ξ)hs(p)hs(q)ξ(s,),pq.D_{h_{s}}\big{(}p,q;\xi\big{)}\coloneqq h_{s}(p)-h_{s}(q)-\big{\langle}\xi(s,\cdot),p-q\big{\rangle}. (20b)
6       For every (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, compute
ξ(k+1)(s,a)=11+ητξ(k)(s,a)+η1+ητQτ(k)(s,a).\xi^{(k+1)}(s,a)=\frac{1}{1+\eta\tau}\xi^{(k)}(s,a)+\frac{\eta}{1+\eta\tau}{{Q}^{(k)}_{\tau}(s,a)}. (20c)
7
Algorithm 1 PMD with generalized Bregman divergence (GPMD)
Comparison with PMD (Lan,, 2022).

Before continuing, let us take a moment to point out the key differences between our algorithm GPMD and the PMD algorithm proposed in Lan, (2022) in terms of algorithm designs. Although the primary exposition of PMD in Lan, (2022) fixes the Bregman divergence as the KL divergence, the algorithm also works in the presence of a generic Bregman divergence, whose relationship with the regularizer hsh_{s} is, however, unspecified. Furthermore, GPMD adaptively sets this term to be the Bregman divergence generated by the regularizer hsh_{s} in use, together with a carefully designed recursive update rule (cf. (17)) to compute surrogates for the subgradient of hsh_{s} to facilitate implementation. Encouragingly, this specific choice leads to a tailored performance analysis of GPMD, which was not present in and instead complementary with that of PMD (Lan,, 2022). In truth, our theory offers linear convergence guarantees for more general scenarios by adapting to the geometry of the regularizer hsh_{s}; details to follow momentarily.

3 Main results

This section presents our convergence guarantees for the GPMD method presented in Algorithm 1. We shall start with the idealized case assuming that the update rule can be precisely implemented, and then discuss how to generalize it to the scenario with imperfect policy evaluation.

3.1 Convergence of exact GPMD

To start with, let us pin down the convergence behavior of GPMD, assuming that accurate evaluation of the policy Qτ(k)Q^{(k)}_{\tau} is available and the subproblem (20a) can be solved perfectly. Here and below, we shall refer to the algorithm in this case as exact GPMD. Encouragingly, exact GPMD provably achieves global linear convergence from an arbitrary initialization, as asserted by the following theorem.

Theorem 1 (Exact GPMD).

Suppose that Assumption 1 holds. Consider any learning rate η>0\eta>0, and set α:=11+ητ\alpha:=\frac{1}{1+\eta\tau}. Then the iterates of Algorithm 1 satisfy

QτQτ(k+1)\displaystyle\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\|}_{\infty} γ(1(1α)(1γ))kC1,\displaystyle\leq\gamma\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}, (21a)
VτVτ(k+1)\displaystyle\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty} (γ+2)(1(1α)(1γ))kC1,\displaystyle\leq(\gamma+2)\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}, (21b)

for all k0k\geq 0, where C1QτQτ(0)+2αQττξ(0)C_{1}\coloneqq\|{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\|_{\infty}+2\alpha\|{Q}^{\star}_{\tau}-\tau\xi^{(0)}\|_{\infty}.

In addition, if hsh_{s} is 11-strongly convex w.r.t. the 1\ell_{1} norm for some s𝒮s\in\mathcal{S}, then one further has

πτ(s)πτ(k+1)(s)1τ1(1(1α)(1γ))kC1,k0.\big{\|}\pi_{\tau}^{\star}(s)-\pi_{\tau}^{(k+1)}(s)\big{\|}_{1}\leq\tau^{-1}\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1},\qquad k\geq 0. (22)

Our theorem confirms the fast global convergence of the GPMD algorithm, in terms of both the resulting regularized Q-value (if hs()h_{s}(\cdot) is convex) and the policy estimate (if hs()h_{s}(\cdot) is strongly convex). In summary, it takes GPMD no more than

1(1α)(1γ)logC1ε=1+ητητ(1γ)logC1ε\displaystyle\frac{1}{(1-\alpha)(1-\gamma)}\log\frac{C_{1}}{\varepsilon}=\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{C_{1}}{\varepsilon} (23a)
iterations to converge to an ε\varepsilon-optimal regularized Q-function (in the \ell_{\infty} sense), or
1(1α)(1γ)logC1ετ=1+ητητ(1γ)logC1ετ\displaystyle\frac{1}{(1-\alpha)(1-\gamma)}\log\frac{C_{1}}{\varepsilon\tau}=\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{C_{1}}{\varepsilon\tau} (23b)

iterations to yield an ε\varepsilon-approximation (w.r.t. the 1\ell_{1} norm error) of πτ\pi_{\tau}^{\star}. The iteration complexity (23) is nearly dimension-free—namely, depending at most logarithmically on the dimension of the state-action space —making it scalable to large-dimensional problems.

Comparison with Lan, (2022, Theorems 1-3).

To make clear our contributions, it is helpful to compare Theorem 1 with the theory for the state-of-the-art algorithm PMD in Lan, (2022).

  • Linear convergence for convex regularizers under constant learning rates. Suppose that constant learning rates are adopted for both GPMD and PMD. Our finding reveals that GPMD enjoys global linear convergence—in terms of both QτQτ(k+1)\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty} and VτVτ(k+1)\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\|_{\infty}—even when the regularizer hs()h_{s}(\cdot) is only convex but not strongly convex. In contrast, Lan, (2022, Theorem 2) provided only sublinear convergence guarantees (with an iteration complexity proportional to 1/ε1/\varepsilon) for the case with convex regularizers, provided that constant learning rates are adopted.222In fact, Lan, (2022, Theorem 3) suggests using a vanishing strongly convex regularization, as well as a corresponding increasing sequence of learning rates, in order to enable linear convergence for non-strongly-convex regularizers.

  • A full range of learning rates. Theorem 1 reveals linear convergence of GPMD for a full range of learning rates, namely, our result is applicable to any η>0\eta>0. In comparison, linear convergence was established in Lan, (2022) only when the learning rates are sufficiently large and when hsh_{s} is 11-strongly convex w.r.t. the KL divergence. Consequently, the linear convergence results in Lan, (2022) do not extend to several widely used regularizers such as negative Tsallis entropy and log-barrier functions (even after scaling), which are, in contrast, covered by our theory. It is worth noting that the case with small-to-medium learning rates is often more challenging to cope with in theory, given that its dynamics could differ drastically from that of regularized policy iteration.

  • Further comparison of rates under large learning rates. (Lan,, 2022, Theorem 1) achieves a contraction rate of γ\gamma when the regularizer is strongly convex and the step size satisfies η1γγτ\eta\geq\frac{1-\gamma}{\gamma\tau}, while the contraction rate of GPMD is 1ητ1+ητ(1γ)1-\frac{\eta\tau}{1+\eta\tau}(1-\gamma) under the full range of the step size, which is slower but approaches the contraction rate γ\gamma of PMD as η\eta goes to infinity. Therefore, in the limit η\eta\to\infty, both GPMD and PMD achieve the contraction rate γ\gamma. As soon as η1/τ\eta\geq 1/\tau, their iteration complexities are on the same order.

Remark 1.

While our primary focus is to solve the regularized RL problem, one might be tempted to apply GPMD as a means to solve unregularized RL; for instance, one might run GPMD with the regularization parameter diminishing gradually in order to approach a policy with the desired accuracy. We leave the details to Appendix C.

3.2 Convergence of approximate GPMD

In reality, however, it is often the case that GPMD cannot be implemented in an exact manner, either because perfect policy evaluation is unavailable or because the subproblem (20a) cannot be solved exactly. To accommodate these practical considerations, this subsection generalizes our previous result by permitting inexact policy evaluation and non-zero optimization error in solving (20a). The following assumptions make precise this imperfect scenario.

Assumption 2 (Policy evaluation error).

Suppose for any k0k\geq 0, we have access to an estimate Q^τ(k)\widehat{Q}^{(k)}_{\tau} obeying

Q^τ(k)Qτ(k)ε𝖾𝗏𝖺𝗅.\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}\leq\varepsilon_{\mathsf{eval}}. (24)
Assumption 3 (Subproblem optimization error).

Consider any policy π\pi and any vector ξ|𝒮||𝒜|\xi\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}. Define

fs(p;π,ξ)Q(s,),p+τhs(p)+1ηDhs(p,π(|s);ξ(s,)),f_{s}(p;\pi,\xi)\coloneqq-\big{\langle}Q(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi(\cdot\,|\,s);\xi(s,\cdot)\big{)},

where Dhs(p,q;ξ)D_{h_{s}}(p,q;\xi) is defined in (15). Suppose there exists an oracle Gs,ε𝗈𝗉𝗍(Q,π,ξ)G_{s,\varepsilon_{\mathsf{opt}}}(Q,\pi,\xi), which is capable of returning π(|s)\pi^{\prime}(\cdot\,|\,s) such that

fs(π(|s);π,ξ)minpΔ(𝒜)fs(p;π,ξ)+ε𝗈𝗉𝗍.\displaystyle f_{s}\big{(}\pi^{\prime}(\cdot\,|\,s);\pi,\xi\big{)}\leq\min_{p\in\Delta(\mathcal{A})}f_{s}(p;\pi,\xi)+\varepsilon_{\mathsf{opt}}. (25)

Note that the oracle in Assumption 3 can be implemented efficiently in practice via various first-order methods (Beck,, 2017). Under Assumptions 2 and 3, we can modify Algorithm 1 by replacing {Qτ(k)}\{{Q}^{(k)}_{\tau}\} with the estimate {Q^τ(k)}\{\widehat{Q}^{(k)}_{\tau}\}, and invoking the oracle Gs,ε𝗈𝗉𝗍(Q,π,ξ)G_{s,\varepsilon_{\mathsf{opt}}}(Q,\pi,\xi) to solve the subproblem (20a) approximately. The whole procedure, which we shall refer to as approximate GPMD, is summarized in Algorithm 2.

1 Input: initial policy π(0)\pi^{(0)}, learning rate η>0\eta>0.
2 Initialize ξ^(0)(s)hs(π(0)(|s))\widehat{\xi}^{(0)}(s)\in\partial h_{s}\big{(}\pi^{(0)}(\cdot\,|\,s)\big{)} for all s𝒮s\in\mathcal{S}.
3 for k=0,1,,k=0,1,\cdots, do
4       For every s𝒮s\in\mathcal{S}, invoke the oracle to obtain (cf. (25))
π(k+1)(s)=Gs,ε𝗈𝗉𝗍(Q^τ(k),π(k),ξ^(k)).\pi^{(k+1)}(s)=G_{s,\varepsilon_{\mathsf{opt}}}\big{(}\widehat{Q}^{(k)}_{\tau},\pi^{(k)},\widehat{\xi}^{(k)}\big{)}. (26)
5       For every (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, compute
ξ^(k+1)(s,a)=11+ητξ^(k)(s,a)+η1+ητQ^τ(k)(s,a).\widehat{\xi}^{(k+1)}(s,a)=\frac{1}{1+\eta\tau}\widehat{\xi}^{(k)}(s,a)+\frac{\eta}{1+\eta\tau}{\widehat{Q}^{(k)}_{\tau}(s,a)}. (27)
6
Algorithm 2 Approximate PMD with generalized Bregman divergence (Approximate GPMD)

The following theorem uncovers that approximate GPMD converges linearly—at the same rate as exact GPMD—before an error floor is hit.

Theorem 2 (Approximate GPMD).

Suppose that Assumptions 1, 2 and 3 hold. Consider any learning rate η>0\eta>0. Then the iterates of Algorithm 2 satisfy

QτQτ(k+1)\displaystyle\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty} γ[(1(1α)(1γ))kC1+C2],\displaystyle\leq\gamma\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{2}\right], (28a)
VτVτ(k+1)\displaystyle\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\|_{\infty} (γ+2)[(1(1α)(1γ))kC1+C2]+(1α)ε𝗈𝗉𝗍,\displaystyle\leq(\gamma+2)\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{2}\right]+(1-\alpha)\varepsilon_{\mathsf{opt}}, (28b)

where α11+ητ\alpha\coloneqq\frac{1}{1+\eta\tau}, C1C_{1} is defined in Theorem 1, and

C211γ[(2+2γ(1γ)(1α))ε𝖾𝗏𝖺𝗅+(1+2γ(1γ)(1α))ε𝗈𝗉𝗍].C_{2}\coloneqq\frac{1}{1-\gamma}\left[\left(2+\frac{2\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{2\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{opt}}\right].

In addition, if hsh_{s} is 11-strongly convex w.r.t. the 1\ell_{1} norm for any s𝒮s\in\mathcal{S}, then we can further obtain

QτQτ(k+1)\displaystyle\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty} γ[(1(1α)(1γ))kC1+C3],\displaystyle\leq\gamma\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right], (29a)
VτVτ(k+1)\displaystyle\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\|_{\infty} (γ+2)[(1(1α)(1γ))kC1+C3]+(1α)ε𝗈𝗉𝗍,\displaystyle\leq(\gamma+2)\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right]+(1-\alpha)\varepsilon_{\mathsf{opt}}, (29b)
πτ(|s)π(k+1)(|s)1\displaystyle\big{\|}\pi_{\tau}^{\star}(\cdot\,|\,s)-\pi^{(k+1)}(\cdot\,|\,s)\big{\|}_{1} τ1[(1(1α)(1γ))kC1+C3]+2ηε𝗈𝗉𝗍1+ητ,\displaystyle\leq\tau^{-1}\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right]+\sqrt{\frac{2\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}}, (29c)

where

C311γ[(2+ε𝖾𝗏𝖺𝗅γτ(1γ))ε𝖾𝗏𝖺𝗅+(1+4γ(1γ)(1α))ε𝗈𝗉𝗍].C_{3}\coloneqq\frac{1}{1-\gamma}\left[\left(2+\frac{\varepsilon_{\mathsf{eval}}\gamma}{\tau(1-\gamma)}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{4\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{opt}}\right]. (30)

In the special case where ε𝗈𝗉𝗍=0\varepsilon_{\mathsf{opt}}=0 and η=\eta=\infty, Algorithm 2 reduces to regularized policy iteration, and the convergence result can be simplified as follows

QτQτ(k)γkQτQτ(0)+2γε𝖾𝗏𝖺𝗅(1γ)2.\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{\|}_{\infty}\leq\gamma^{k}\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{eval}}}{(1-\gamma)^{2}}.

In particular, when hsh_{s} is taken as the negative entropy, our result strengthens the prior result established in Cen et al., 2022b for approximate entropy-regularized NPG method with ε𝗈𝗉𝗍=0\varepsilon_{\mathsf{opt}}=0 over a wide range of learning rates. Specifically, the error bound in Cen et al., 2022b reads γε𝖾𝗏𝖺𝗅1γ(2+2γητ)\gamma\cdot\frac{\varepsilon_{\mathsf{eval}}}{1-\gamma}\left(2+\frac{2\gamma}{\eta\tau}\right), where the second term in the bracket scales inversely with respect to η\eta and therefore grows unboundedly as η\eta approaches 0. In contrast, (29) and (30) suggest a bound γε𝖾𝗏𝖺𝗅1γ(2+ε𝖾𝗏𝖺𝗅γτ(1γ))\gamma\cdot\frac{\varepsilon_{\mathsf{eval}}}{1-\gamma}\left(2+\frac{\varepsilon_{\mathsf{eval}}\gamma}{\tau(1-\gamma)}\right), which is independent of the learning rate η\eta in use and thus prevents the error bound from blowing up when the learning rate approaches 0. Indeed, our result improves over the prior art Cen et al., 2022b whenever η2(1γ)ε𝖾𝗏𝖺𝗅\eta\leq\frac{2(1-\gamma)}{\varepsilon_{\mathsf{eval}}}.

Remark 2 (Sample complexities).

One might naturally ask how many samples are sufficient to learn an ε\varepsilon-optimal regularized Q-function, by leveraging sample-based policy evaluation algorithms in GPMD. Notice that it is straightforward to consider an expected version of Assumption 2 as following:

{𝔼[Q^τ(k)Qτ(k)]ε𝖾𝗏𝖺𝗅;𝔼[Q^τ(k)Qτ(k)2]ε𝖾𝗏𝖺𝗅2,\begin{cases}\mathbb{E}\big{[}\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}\big{]}&\leq\varepsilon_{\mathsf{eval}};\\[4.30554pt] \mathbb{E}\big{[}\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}^{2}\big{]}&\leq\varepsilon_{\mathsf{eval}}^{2},\end{cases}

where the expectation is with respect to the randomness in policy evaluation, then the convergence results in Theorem 2 apply to 𝔼[QτQτ(k+1)]\mathbb{E}\big{[}\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}\big{]} and 𝔼[πτ(|s)πτ(k+1)(|s)1]\mathbb{E}\big{[}\big{\|}\pi_{\tau}^{\star}(\cdot\,|\,s)-\pi_{\tau}^{(k+1)}(\cdot\,|\,s)\big{\|}_{1}\big{]} instead. This randomized version makes it immediately amenable to combine with, e.g., the rollout-based policy evaluators in Lan, (2022, Section 5.1) to obtain (possibly crude) bounds on the sample complexity. We omit these straightforward developments.

Roughly speaking, approximate GPMD is guaranteed to converge linearly to an error bound that scales linearly in both the policy evaluation error ε𝖾𝗏𝖺𝗅\varepsilon_{\mathsf{eval}} and the optimization error ε𝗈𝗉𝗍\varepsilon_{\mathsf{opt}}, thus confirming the stability of our algorithm vis-à-vis imperfect implementation of the algorithm. As before, our theory improves upon prior works by demonstrating linear convergence for a full range of learning rates even in the absence of strong convexity and smoothness.

4 Analysis for exact GPMD (Theorem 1)

In this section, we present the analysis for our main result in Theorem 1, which follows a different framework from Lan, (2022). Here and throughout, we shall often employ the following shorthand notation when it is clear from the context:

π(k)(s)π(k)(|s)Δ(𝒜),Qπ(s)Qπ(s,)|𝒜|,ξ(k)(s)ξ(k)(s,)|𝒜|,Qτπ(s)Qτπ(s,)|𝒜|,\displaystyle\begin{array}[]{lll}&\pi^{(k)}(s)\coloneqq\pi^{(k)}(\cdot\,|\,s)\in\Delta(\mathcal{A}),&Q^{\pi}(s)\coloneqq Q^{\pi}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},\\ &\xi^{(k)}(s)\coloneqq\xi^{(k)}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},&Q_{\tau}^{\pi}(s)\coloneqq Q_{\tau}^{\pi}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},\end{array} (33)

in addition to those already defined in (13).

4.1 Preparation: basic facts

In this subsection, we single out a few basic results that underlie the proof of our main theorems.

Performance improvement.

To begin with, we demonstrate that GPMD enjoys a sort of monotonic improvements concerning the updates of both the value function and the Q-function, as stated in the following lemma. This lemma can be viewed as a generalization of the well-established policy improvement lemma in the analysis of NPG (Agarwal et al.,, 2020; Cen et al., 2022b, ) as well as PMD (Lan,, 2022).

Lemma 2 (Pointwise monotonicity).

For any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and any k0k\geq 0, Algorithm 1 achieves

Vτ(k+1)(s)Vτ(k)(s)andQτ(k+1)(s,a)Qτ(k)(s,a).\displaystyle V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)\qquad\text{and}\qquad{Q}^{(k+1)}_{\tau}(s,a)\geq{Q}^{(k)}_{\tau}(s,a). (34)
Proof.

See Appendix A.2. ∎

Interestingly, the above monotonicity holds simultaneously for all state-action pairs, and hence can be understood as a kind of pointwise monotonicity.

Generalized Bellman operator.

Another key ingredient of our proof lies in the use of a generalized Bellman operator 𝒯τ,h:|𝒮||𝒜||𝒮||𝒜|\mathcal{T}_{\tau,h}:\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}\rightarrow\mathbb{R}^{|\mathcal{S}||\mathcal{A}|} associated with the regularizer h={hs}s𝒮h=\{h_{s}\}_{s\in\mathcal{S}}. Specifically, for any state-action pair (s,a)(s,a) and any vector Q|𝒮||𝒜|Q\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}, we define

𝒯τ,h(Q)(s,a)=r(s,a)+γ𝔼sP(|s,a)[maxpΔ(𝒜){Q(s),pτhs(p)}].\displaystyle\mathcal{T}_{\tau,h}(Q)(s,a)=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\Big{\{}\big{\langle}Q(s^{\prime}),p\big{\rangle}-\tau h_{s^{\prime}}(p)\Big{\}}\right]. (35)

It is worth noting that this definition shares similarity with the regularized Bellman operator proposed in Geist et al., (2019), where the operator defined there is targeted at VτV_{\tau}, while ours is defined w.r.t. QτQ_{\tau}.

The importance of this generalized Bellman operator is two-fold: it enjoys a desired contraction property, and its fixed point corresponds to the optimal regularized Q-function. These are generalizations of the properties for the classical Bellman operator, and are formally stated in the following lemma. The proof is deferred to Appendix A.3.

Lemma 3 (Properties of the generalized Bellman operator).

For any τ>0\tau>0, the operator 𝒯τ,h\mathcal{T}_{\tau,h} defined in (35) satisfies the following properties:

  • 𝒯τ,h\mathcal{T}_{\tau,h} is a contraction operator w.r.t. the \ell_{\infty} norm, namely, for any Q1,Q2|𝒮||𝒜|Q_{1},Q_{2}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}, one has

    𝒯τ,h(Q1)𝒯τ,h(Q2)γQ1Q2.\big{\|}\mathcal{T}_{\tau,h}(Q_{1})-\mathcal{T}_{\tau,h}(Q_{2})\big{\|}_{\infty}\leq\gamma\|Q_{1}-Q_{2}\|_{\infty}. (36)
  • The optimal regularized QQ-function Qτ{Q}^{\star}_{\tau} is a fixed point of 𝒯τ,h\mathcal{T}_{\tau,h}, that is,

    𝒯τ,h(Qτ)=Qτ.\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})={Q}^{\star}_{\tau}. (37)

4.2 Proof of Theorem 1

Inspired by Cen et al., 2022b , our proof consists of (i) characterizing the dynamics of \ell_{\infty} errors and establishing a connection to a useful linear system with two variables, and (ii) analyzing the dynamics of this linear system directly. In what follows, we elaborate on each of these steps.

Step 1: error contraction and its connection to a linear system.

With the assistance of the above preparations, we are ready to elucidate how to characterize the convergence behavior of QτQτ(k+1)\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}. Recalling the update rule of ξ(k+1)\xi^{(k+1)} (cf. (20c)), we can deduce that

Qττξ(k+1)=α(Qττξ(k))+(1α)(QτQτ(k)){Q}^{\star}_{\tau}-\tau\xi^{(k+1)}=\alpha\big{(}{Q}^{\star}_{\tau}-\tau\xi^{(k)}\big{)}+(1-\alpha)\big{(}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{)}

with α=11+ητ\alpha=\frac{1}{1+\eta\tau}, thus indicating that

Qττξ(k+1)αQττξ(k)+(1α)QτQτ(k).\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\big{\|}_{\infty}\leq\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k)}\big{\|}_{\infty}+(1-\alpha)\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{\|}_{\infty}. (38)

Interestingly, there exists an intimate connection between QτQτ(k+1)\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty} and Qττξ(k+1)\|{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\|_{\infty} that allows us to bound the former term by the latter. This is stated in the following lemma, with the proof postponed to Appendix A.4.

Lemma 4.

Set α=11+ητ\alpha=\frac{1}{1+\eta\tau}. The iterates of Algorithm 1 satisfy

QτQτ(k+1)γQττξ(k+1)+γαk+1Qτ(0)τξ(0).\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\|}_{\infty}\leq\gamma\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\big{\|}_{\infty}+\gamma\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}. (39)

The above inequalities (38) and (39) can be succinctly described via a useful linear system with two variables QτQτ(k)\|{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\|_{\infty} and Qττξ(k)\|{Q}^{\star}_{\tau}-\tau\xi^{(k)}\|_{\infty}, that is,

xk+1Axk+γαk+1y,x_{k+1}\leq Ax_{k}+\gamma\alpha^{k+1}y, (40)

where

A[γ(1α)γα1αα],xk[QτQτ(k)Qττξ(k)]andy[Qτ(0)τξ(0)0].A\coloneqq\begin{bmatrix}\gamma(1-\alpha)&\gamma\alpha\\ 1-\alpha&\alpha\end{bmatrix},\qquad x_{k}\coloneqq\begin{bmatrix}\|{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\|_{\infty}\\ \|{Q}^{\star}_{\tau}-\tau\xi^{(k)}\|_{\infty}\end{bmatrix}\qquad\text{and}\qquad y\coloneqq\begin{bmatrix}\|{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\|_{\infty}\\ 0\end{bmatrix}. (41)

This forms the basis for proving Theorem 1.

Step 2: analyzing the dynamics of the linear system (40).

Before proceeding, we note that a linear system similar to (40) has been analyzed in Cen et al., 2022b (, Section 4.2.2). We intend to apply the following properties that have been derived therein:

xk+1\displaystyle x_{k+1} Ak+1[x0+γ(α1AI)1y],\displaystyle\leq A^{k+1}\left[x_{0}+\gamma(\alpha^{-1}A-I)^{-1}y\right], (42a)
γ(α1AI)1y\displaystyle\gamma(\alpha^{-1}A-I)^{-1}y =[0Qτ(0)τξ(0)],\displaystyle=\begin{bmatrix}0\\ \|{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\|_{\infty}\end{bmatrix}, (42b)
Ak+1\displaystyle A^{k+1} =((1α)γ+α)k[γ1][1αα].\displaystyle=\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\begin{bmatrix}\gamma\\ 1\end{bmatrix}\begin{bmatrix}1-\alpha&\alpha\end{bmatrix}. (42c)

Substituting (42c) and (42b) into (42a) and rearranging terms, we reach

xk+1\displaystyle x_{k+1} ((1α)γ+α)k((1α)QτQτ(0)+αQττξ(0)+αQτ(0)τξ(0))[γ1]\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left((1-\alpha)\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}+\alpha\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right)\begin{bmatrix}\gamma\\ 1\end{bmatrix}
((1α)γ+α)k(QτQτ(0)+2αQττξ(0))[γ1],\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right)\begin{bmatrix}\gamma\\ 1\end{bmatrix}, (43)

which taken together with the definition of xk+1x_{k+1} gives

QτQτ(k+1)\displaystyle\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\|}_{\infty} γ((1α)γ+α)k(QτQτ(0)+2αQττξ(0)),\displaystyle\leq\gamma\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right), (44a)
Qττξ(k+1)\displaystyle\big{\|}Q_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\|}_{\infty} ((1α)γ+α)k(QτQτ(0)+2αQττξ(0)).\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right). (44b)
Step 3: controlling πτ(s)π(k+1)(s)1\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1} and VτVτ(k+1)\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}.

It remains to convert this result to an upper bound on πτ(s)π(k+1)(s)1\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1} and VτVτ(k+1)\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}. By virtue of Lemma 1, there exist two vectors gτ(s)hs(πτ(s))g_{\tau}^{\star}(s)\in\partial h_{s}\big{(}\pi_{\tau}^{\star}(s)\big{)}, g(k+1)(s)hs(π(k+1)(s))g^{(k+1)}(s)\in\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)} and two scalars cs,cs(k+1)c_{s}^{\star},c_{s}^{(k+1)}\in\mathbb{R} that satisfy

{τ1Qτ(s)cs1=gτ(s)ξ(k+1)(s,)cs(k+1)1=g(k+1)(s).\begin{cases}\tau^{-1}Q_{\tau}^{\star}(s)-c_{s}^{\star}1&=g_{\tau}^{\star}(s)\\ \xi^{(k+1)}(s,\cdot)-c_{s}^{(k+1)}1&=g^{(k+1)}(s)\end{cases}.

It holds for all s𝒮s\in\mathcal{S} that

Vτ(s)Vτ(k+1)(s)\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)
=Qτ(s),πτ(s)τhs(πτ(s))Qτ(k+1)(s),πτ(k+1)(s)+τhs(πτ(k+1)(s))\displaystyle=\big{\langle}Q_{\tau}^{\star}(s),\pi_{\tau}^{\star}(s)\big{\rangle}-\tau h_{s}(\pi_{\tau}^{\star}(s))-\big{\langle}Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\tau h_{s}(\pi_{\tau}^{(k+1)}(s))
=Qτ(s)Qτ(k+1)(s),πτ(k+1)(s)+[τ(hs(πτ(k+1)(s))hs(πτ(s)))Qτ(s),πτ(k+1)(s)πτ(s)]\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\Big{[}\tau(h_{s}(\pi_{\tau}^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s)))-\big{\langle}Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}\Big{]}
(i)Qτ(s)Qτ(k+1)(s),πτ(k+1)(s)+τg(k+1)(s)Qτ(s),πτ(k+1)(s)πτ(s))\displaystyle\overset{\text{(i)}}{\leq}\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\big{\langle}\tau g^{(k+1)}(s)-Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s))\big{\rangle}
=Qτ(s)Qτ(k+1)(s),πτ(k+1)(s)+τξ(k+1)(s)Qτ(s),πτ(k+1)(s)πτ(s))\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\big{\langle}\tau\xi^{(k+1)}(s)-Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s))\big{\rangle}
Qτ(s)Qτ(k+1)(s)+2Qτ(s)τξ(k+1)(s),\displaystyle\leq\big{\|}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s)\big{\|}_{\infty}+2\big{\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\|}_{\infty}, (45)

where (i) results from hs(πτ(k+1)(s))hs(πτ(s))g(k+1)(s),πτ(k+1)(s)πτ(s)h_{s}(\pi_{\tau}^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s))\leq\big{\langle}g^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}. Plugging (44) into (45) completes the proof for (21b).

When hsh_{s} is 11-strongly convex w.r.t. the 1\ell_{1} norm, we can invoke the strong monotonicity property of a strongly convex function (Beck,, 2017, Theorem 5.24) to obtain

πτ(s)π(k+1)(s)12\displaystyle\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}^{2} πτ(s)π(k+1)(s),gτ(s)g(k+1)(s)\displaystyle\leq\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),g_{\tau}^{\star}(s)-g^{(k+1)}(s)\big{\rangle}
=πτ(s)π(k+1)(s),gτ(s)+cs1g(k+1)(s)cs(k+1)1\displaystyle=\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),g_{\tau}^{\star}(s)+c_{s}^{\star}1-g^{(k+1)}(s)-c_{s}^{(k+1)}1\big{\rangle}
πτ(s)π(k+1)(s)1gτ(s)+cs1g(k+1)(s)cs(k+1)1\displaystyle\leq\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}\big{\|}g_{\tau}^{\star}(s)+c_{s}^{\star}1-g^{(k+1)}(s)-c_{s}^{(k+1)}1\big{\|}_{\infty}
=τ1πτ(s)π(k+1)(s)1Qτ(s)τξ(k+1)(s),\displaystyle=\tau^{-1}\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}\big{\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\|}_{\infty}, (46)

where the second line is valid since πτ(s),1=π(k+1)(s),1=1\langle\pi_{\tau}^{\star}(s),1\rangle=\langle\pi^{(k+1)}(s),1\rangle=1. This taken together with (44) gives rise to the advertised bound

πτ(s)π(k+1)(s)1\displaystyle\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1} τ1Qτ(s)τξ(k+1)(s)\displaystyle\leq\tau^{-1}\big{\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\|}_{\infty}
τ1((1α)γ+α)k(QτQτ(0)+2αQττξ(0)).\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right).

5 Numerical experiments

In this section, we provide some simple numerical experiments to corroborate the effectiveness of the GPMD algorithm.

5.1 Tsallis entropy

While Shannon entropy is a popular choice of regularization, the discrepancy between the value function of the regularized MDP and the unregularized counterpart scales as O(τ1γlog|𝒜|)O(\frac{\tau}{1-\gamma}\log|\mathcal{A}|). In addition, the optimal policy under Shannon entropy regularization assigns positive mass to all actions and is hence non-sparse. To promote sparsity and obtain better control of the bias induced by regularization, Lee et al., (2018, 2019) proposed to employ the Tsallis entropy (Tsallis,, 1988) as an alternative. To be precise, for any vector pΔ(𝒜)p\in\Delta(\mathcal{A}), the associated Tsallis entropy is defined as

𝖳𝗌𝖺𝗅𝗅𝗂𝗌q(p)=1q1(1a𝒜(p(a))q)=1q1𝔼ap[1(p(a))q1],\mathsf{Tsallis}_{q}(p)=\frac{1}{q-1}\left(1-\sum_{a\in\mathcal{A}}\big{(}p(a)\big{)}^{q}\right)=\frac{1}{q-1}\mathbb{E}_{a\sim p}\left[1-\big{(}p(a)\big{)}^{q-1}\right],

where q>0q>0 is often referred to as the entropic-index. When q1q\to 1, the Tsallis entropy reduces to the Shannon entropy.

We now evaluate numerically the performance of PMD and GPMD when applied to a randomly generated MDP with |𝒮|=200|\mathcal{S}|=200 and |𝒜|=50|\mathcal{A}|=50. Here, the transition probability kernel and the reward function are generated as follows. For each state-action pair (s,a)(s,a), we randomly select 2020 states to form a set 𝒮s,a\mathcal{S}_{s,a}, and set P(s|s,a)=1/20P(s^{\prime}|s,a)=1/20 if s𝒮s,as^{\prime}\in\mathcal{S}_{s,a}, and 0 otherwise. The reward function is generated by r(s,a)Us,aUsr(s,a)\sim U_{s,a}\cdot U_{s}, where Us,aU_{s,a} and UsU_{s} are independent uniform random variables over [0,1][0,1]. We shall set the regularizer as hs(p)=𝖳𝗌𝖺𝗅𝗅𝗂𝗌2(p)h_{s}(p)=-\mathsf{Tsallis}_{2}(p) for all s𝒮s\in\mathcal{S} with a regularization parameter τ=0.001\tau=0.001. As can be seen from the numerical results displayed in Figure 1(a), GPMD enjoys a faster convergence rate compared to PMD.

Refer to caption Refer to caption
(a) Tsallis entropy regularization (b) Log-barrier regularization
Figure 1: QτQτ(t)\|Q_{\tau}^{\star}-Q_{\tau}^{(t)}\|_{\infty} versus the iteration count for both PMD and GPMD, for multiple choices of the learning rate η\eta. The left plot (a) is concerned with Tsallis entropy regularization, whereas the right plot (b) concerns log-barrier regularization used in our constrained RL example. The error curves are averaged over 5 independent runs.

5.2 Constrained RL

In reality, an agent with the sole aim of maximizing cumulative rewards might sometimes end up with unintended or even harmful behavior, due to, say, improper design of the reward function, or non-perfect simulation of physical laws. Therefore, it is sometimes necessary to enforce proper constraints on the policy in order to prevent it from taking certain actions too frequently.

To simulate this problem, we first solve a MDP with |𝒮|=200|\mathcal{S}|=200 and |𝒜|=50|\mathcal{A}|=50, generated in the same way as in the previous subsection. We then pick 10 state-action pairs from the support of the optimal policy at random to form a set Ψ\Psi. We can ensure that πτ(a|s)<πmax=0.1\pi_{\tau}^{\star}(a\,|\,s)<\pi_{\rm max}=0.1 for all (s,a)Ψ(s,a)\in\Psi by adding the following log-barrier regularization with τ=0.001\tau=0.001:

hs(p)={,if (s,a)Ψ and p(a)πmax,log(πmaxp(a)),if (s,a)Ψ and p(a)<πmax,0,otherwise.h_{s}(p)=\begin{cases}\infty,&\text{if }(s,a)\in\Psi\text{ and }p(a)\geq\pi_{\rm max},\\ -\log\big{(}\pi_{\rm max}-p(a)\big{)},&\text{if }(s,a)\in\Psi\text{ and }p(a)<\pi_{\rm max},\\ 0,&\text{otherwise}.\end{cases}

Numerical comparisons of PMD and GPMD when applied this problem are plotted in Figure 1(b). It is observed that PMD methods stall after reaching an error floor on the order of 10210^{-2}, while GPMD methods are able to converge to the optimal policy efficiently.

6 Discussion

The present paper has introduced a generalized framework of policy optimization tailored to regularized RL problems. We have proposed a generalized policy mirror descent (GPMD) algorithm that achieves dimension-free linear convergence, which covers an entire range of learning rates and accommodates convex and possibly nonsmooth regularizers. Numerical experiments have been conducted to demonstrate the utility of the proposed GPMD algorithm. Our approach opens up a couple of future directions that are worthy of further exploration. For example, the current work restricts its attention to convex regularizers and tabular MDPs; it is of paramount interest to develop policy optimization algorithms when the regularizers are nonconvex and when sophisticated policy parameterization—including function approximation—is adopted. Understanding the sample complexities of the proposed algorithm—when the policies are evaluated using samples collected over an online trajectory—is crucial in sample-constrained scenarios and is left for future investigation. Furthermore, it might be worthwhile to extend the proposed algorithm to accommodate multi-agent RL, with a representative example being regularized multi-agent Markov games (Cen et al.,, 2021; Zhao et al.,, 2022; Cen et al., 2022a, ; Cen et al., 2022c, ).

Acknowledgements

S. Cen and Y. Chi are supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080, CCF-1901199, CCF-2007911, and CNS-2148212. S. Cen is also gratefully supported by Wei Shen and Xuehong Zhang Presidential Fellowship, and Nicholas Minnici Dean’s Graduate Fellowship in Electrical and Computer Engineering at Carnegie Mellon University. W. Zhan and Y. Chen are supported in part by the Google Research Scholar Award, the Alfred P. Sloan Research Fellowship, and the grants AFOSR FA9550-22-1-0198, ONR N00014-22-1-2354, NSF CCF-2221009, CCF-1907661, IIS-2218713, and IIS-2218773. W. Zhan and J. Lee are supported in part by the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, and an ONR Young Investigator Award.

Appendix A Proof of key lemmas

In this section, we collect the proof of several key lemmas. Here and throughout, we use 𝔼π[]\mathbb{E}_{\pi}[\cdot] to denote the expectation over the randomness of the MDP induced by policy π\pi. We shall follow the notation convention in (33) throughout. In addition, to further simplify notation, we shall abuse the notation by letting

Dhs(π~,π;ξ)\displaystyle D_{h_{s}}(\widetilde{\pi},\pi;\xi) Dhs(π~(|s),π(|s);ξ(s,))\displaystyle\coloneqq D_{h_{s}}\big{(}\widetilde{\pi}(\cdot\,|\,s),\pi(\cdot\,|\,s);\xi(s,\cdot)\big{)} (47a)
Dhs(p,π;ξ)\displaystyle D_{h_{s}}(p,\pi;\xi) Dhs(p,π(|s);ξ(s,))\displaystyle\coloneqq D_{h_{s}}\big{(}p,\pi(\cdot\,|\,s);\xi(s,\cdot)\big{)} (47b)
Dhs(π,p;ξ)\displaystyle D_{h_{s}}(\pi,p;\xi) Dhs(π(|s),p;ξ(s,))\displaystyle\coloneqq D_{h_{s}}\big{(}\pi(\cdot\,|\,s),p;\xi(s,\cdot)\big{)} (47c)

for any policy π\pi and π~\widetilde{\pi} and any pΔ(𝒜)p\in\Delta(\mathcal{A}), whenever it is clear from the context.

A.1 Proof of Lemma 1

We start by relaxing the probability simplex constraint (i.e., pΔ(𝒜)p\in\Delta(\mathcal{A})) in (20a) with a simpler linear constraint a𝒜p(a)=1\sum_{a\in\mathcal{A}}p(a)=1 as follows

minimizep|𝒜|ηQτ(k)(s),p+ητhs(p)+Dhs(p,π(k);ξ(k))subject toa𝒜p(a)=1.\begin{array}[]{ll}\text{minimize}_{p\in\mathbb{R}^{|\mathcal{A}|}}&-\eta\big{\langle}Q^{(k)}_{\tau}(s),p\big{\rangle}+\eta\tau h_{s}(p)+D_{h_{s}}\big{(}p,\pi^{(k)};\xi^{(k)}\big{)}\\[4.30554pt] \text{subject to}&\sum_{a\in\mathcal{A}}p(a)=1.\end{array} (48)

To justify the validity of dropping the non-negative constraint, we note that for any pp obeying p(a)<0p(a)<0 for some a𝒜a\in\mathcal{A}, our assumption on hsh_{s} (see Assumption 1) leads to hs(p)=h_{s}(p)=\infty, which cannot possibly be the optimal solution. This confirms the equivalence between (20a) and (48).

Observe that the Lagrangian w.r.t. (48) is given by

s(p,λs(k))\displaystyle\mathcal{L}_{s}\big{(}p,\lambda_{s}^{(k)}\big{)} =ηQτ(k)(s),p+ητhs(p)+hs(p)hs(π(k)(s))pπ(k)(s),ξ(k)(s)+λs(k)(a𝒜p(a)1),\displaystyle=-\eta\big{\langle}Q_{\tau}^{(k)}(s),p\big{\rangle}+\eta\tau h_{s}(p)+h_{s}(p)-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}p-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}+\lambda_{s}^{(k)}\left(\sum_{a\in\mathcal{A}}p(a)-1\right),

where λs(k)\lambda_{s}^{(k)}\in\mathbb{R} denotes the Lagrange multiplier associated with the constraint a𝒜p(a)=1\sum_{a\in\mathcal{A}}p(a)=1. Given that π(k+1)(s)\pi^{(k+1)}(s) is the solution to (20a) and hence (48), the optimality condition requires that

0ps(p,λs(k))|p=π(k+1)(s)=ηQτ(k)(s)+(1+ητ)hs(π(k+1)(s))ξ(k)(s)+λs(k)1.0\in\partial_{p}\mathcal{L}_{s}\big{(}p,\lambda_{s}^{(k)}\big{)}\,\Big{|}\,_{p=\pi^{(k+1)}(s)}=-\eta Q^{(k)}_{\tau}(s)+(1+\eta\tau)\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\xi^{(k)}(s)+\lambda^{(k)}_{s}1.

Rearranging terms and making use of the construction (17), we are left with

ξ(k+1)(s)λs(k)1+ητ1=11+ητ[ηQτ(k)(s)+ξ(k)(s)λs(k)1]hs(π(k+1)(s)),\xi^{(k+1)}(s)-\frac{\lambda^{(k)}_{s}}{1+\eta\tau}1=\frac{1}{1+\eta\tau}\left[\eta Q^{(k)}_{\tau}(s)+\xi^{(k)}(s)-\lambda^{(k)}_{s}1\right]\in\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)},

thus concluding the proof of the first claim (18).

We now turn to the second claim (19). In view of the property (37), we have

πτ(s)=argminpΔ(𝒜)Qτ(s),p+τhs(p).\pi_{\tau}^{\star}(s)=\arg\min_{p\in\Delta(\mathcal{A})}-\big{\langle}Q^{\star}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p).

This optimization problem is equivalent to

minimizep|𝒜|Qτ(s),p+τhs(p),subject toa𝒜p(a)=1,\begin{array}[]{ll}\text{minimize}_{p\in\mathbb{R}^{|\mathcal{A}|}}&-\big{\langle}Q^{\star}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p),\\[4.30554pt] \text{subject to}&\sum_{a\in\mathcal{A}}p(a)=1,\end{array} (49)

which can be verified by repeating a similar argument for (48). The Lagrangian associated with (49) is

s(p,λs)\displaystyle\mathcal{L}_{s}\big{(}p,\lambda_{s}^{\star}\big{)} =Qτ(s),p+τhs(p)+λs(a𝒜p(a)1),\displaystyle=-\big{\langle}Q_{\tau}^{\star}(s),p\big{\rangle}+\tau h_{s}(p)+\lambda_{s}^{\star}\left(\sum_{a\in\mathcal{A}}p(a)-1\right),

where λs\lambda_{s}^{\star}\in\mathbb{R} denotes the Lagrange multiplier. Therefore, the first-order optimality condition requires that

0ps(p,λs)|p=πτ(s)=Qτ(s)+τhs(πτ(s))+λs1,0\in\partial_{p}\mathcal{L}_{s}\big{(}p,\lambda_{s}^{\star}\big{)}\,\Big{|}\,_{p=\pi^{\star}_{\tau}(s)}=-Q^{\star}_{\tau}(s)+\tau\partial h_{s}\big{(}\pi^{\star}_{\tau}(s)\big{)}+\lambda^{\star}_{s}1,

which immediately finishes the proof.

A.2 Proof of Lemma 2

We start by introducing the performance difference lemma that has previously been derived in Lan, (2022, Lemma 2). For the sake of self-containedness, we include a proof of this lemma in Appendix A.2.1.

Lemma 5 (Performance difference).

For any two policies π\pi and π\pi^{\prime}, we have

Vτπ(s)Vτπ(s)=11γ𝔼sdsπ[Qτπ(s),π(s)π(s)τhs(π(s))+τhs(π(s))],V^{\pi^{\prime}}_{\tau}(s)-V^{\pi}_{\tau}(s)=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{\pi^{\prime}}_{s}}\Big{[}\big{\langle}Q^{\pi}_{\tau}(s^{\prime}),\pi^{\prime}(s^{\prime})-\pi(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{\prime}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi(s^{\prime})\big{)}\Big{]}, (50)

where dsπd^{\pi}_{s} has been defined in (5).

Armed with Lemma 5, one can readily rewrite the difference Vτ(k+1)(s)Vτ(k)(s)V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s) between two consecutive iterates as follows

Vτ(k+1)(s)Vτ(k)(s)\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)
=11γ𝔼sds(k+1)[Qτ(k)(s),π(k+1)(s)π(k)(s)τhs(π(k+1)(s))+τhs(π(k)(s))].\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\Big{[}\big{\langle}Q^{(k)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\Big{]}. (51)

It then comes down to studying the right-hand side of the relation (A.2), which can be accomplished via the following “three-point” lemma. The proof of this lemma can be found in Appendix A.2.2.

Lemma 6.

For any s𝒮s\in\mathcal{S} and any vector pΔ(𝒜)p\in\Delta(\mathcal{A}), we have

(1+ητ)Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))Dhs(p,π(k);ξ(k))\displaystyle(1+\eta\tau)D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}
=η[Qτ(k)(s),π(k+1)(s)p+τhs(p)τhs(π(k+1)(s))].\displaystyle\qquad=\eta\left[\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}+\tau h_{s}(p)-\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right].

Taking p=π(k)(s)p=\pi^{(k)}(s) in Lemma 6 and combining it with (A.2), we arrive at

Vτ(k+1)(s)Vτ(k)(s)\displaystyle V_{\tau}^{(k+1)}(s)-V_{\tau}^{(k)}(s)
=1(1γ)η𝔼sds(k+1)[(1+ητ)Dhs(π(k),π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))]0\displaystyle=\frac{1}{(1-\gamma)\eta}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d_{s}^{(k+1)}}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}\right]\geq 0

for any s𝒮s\in\mathcal{S}, thus establishing the advertised pointwise monotonicity w.r.t. the regularized value function.

When it comes to the regularized Q-function, it is readily seen from the definition (9a) that

Qτ(k+1)(s,a)\displaystyle{Q}^{(k+1)}_{\tau}(s,a) =r(s,a)+γ𝔼sP(|s,a)[Vτ(k+1)(s)]\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\big{[}V^{(k+1)}_{\tau}(s^{\prime})\big{]}
r(s,a)+γ𝔼sP(|s,a)[Vτ(k)(s)]=Qτ(k)(s,a)\displaystyle\geq r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\big{[}V^{(k)}_{\tau}(s^{\prime})\big{]}={Q}^{(k)}_{\tau}(s,a)

for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, where the last line is valid since Vτ(k+1)Vτ(k)V_{\tau}^{(k+1)}\geq V_{\tau}^{(k)}. This concludes the proof.

A.2.1 Proof of Lemma 5

For any two policies π\pi^{\prime} and π\pi, it follows from the definition (7) of Vτπ(s)V^{\pi}_{\tau}(s) that

Vτπ(s)Vτπ(s)=𝔼π[t=0γt[r(st,at)τhst(π(st))]|s0=s]Vτπ(s)\displaystyle V^{\pi^{\prime}}_{\tau}(s)-V^{\pi}_{\tau}(s)=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}\Big{]}\,\Big{|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)
=𝔼π[t=0γt[r(st,at)τhst(π(st))+Vτπ(st)Vτπ(st)]|s0=s]Vτπ(s)\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+V^{\pi}_{\tau}(s_{t})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)
=𝔼π[t=0γt[r(st,at)τhst(π(st))+γVτπ(st+1)Vτπ(st)]|s0=s]+𝔼π[Vτπ(s0)|s0=s]Vτπ(s)\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{|}\,s_{0}=s\right]+\mathbb{E}_{\pi^{\prime}}\left[V^{\pi}_{\tau}(s_{0})\,\Big{|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)
=𝔼π[t=0γt[r(st,at)τhst(π(st))+γVτπ(st+1)Vτπ(st)]|s0=s]\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{|}\,s_{0}=s\right]
=𝔼π[t=0γt[r(st,at)τhst(π(st))+γVτπ(st+1)Vτπ(st)τhst(π(st))+τhst(π(st))]|s0=s]\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{|}\,s_{0}=s\right]
=𝔼π[t=0γt[Qτπ(st,at)τhst(π(st))Vτπ(st)τhst(π(st))+τhst(π(st))]|s0=s]\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{|}\,s_{0}=s\right]
=11γ𝔼sdsπ[Qτπ(s),π(s)π(s)τhs(π(s))+τhs(π(s))],\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{\pi^{\prime}}_{s}}\Big{[}\big{\langle}Q^{\pi}_{\tau}(s^{\prime}),\pi^{\prime}(s^{\prime})-\pi(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{\prime}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi(s^{\prime})\big{)}\Big{]}, (52)

where the penultimate line comes from the definition (9a). To see why the last line of (52) is valid, we make note of the following identity

𝔼atπ(st)[Qτπ(st,at)τhst(π(st))Vτπ(st)]\displaystyle\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi^{\prime}(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}-V^{\pi}_{\tau}(s_{t})\Big{]}
=𝔼atπ(st)[Qτπ(st,at)τhst(π(st))]𝔼atπ(st)[Qτπ(st,at)τhst(π(st))]\displaystyle\qquad=\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi^{\prime}(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}-\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}
=Qτπ(st)τhst(π(st))1,π(st)π(st)\displaystyle\qquad=\Big{\langle}Q^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\cdot 1,\pi^{\prime}(s_{t})-\pi(s_{t})\Big{\rangle}
=Qτπ(st),π(st)π(st),\displaystyle\qquad=\Big{\langle}Q^{\pi}_{\tau}(s_{t}),\pi^{\prime}(s_{t})-\pi(s_{t})\Big{\rangle}, (53)

where the first identity results from the relation (9b), and the last relation holds since 1π(st)=1π(st)=11^{\top}\pi^{\prime}(s_{t})=1^{\top}\pi(s_{t})=1. The last line of (52) then follows immediately from the relation (53) and the definition (5) of dsπd_{s}^{\pi}.

A.2.2 Proof of Lemma 6

For any state s𝒮s\in\mathcal{S}, we make the observation that

Dhs(p,π(k);ξ(k))=hs(p)hs(π(k)(s))pπ(k)(s),ξ(k)(s)\displaystyle D_{h_{s}}\big{(}p,\pi^{(k)};\xi^{(k)}\big{)}=h_{s}(p)-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}p-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}
=hs(p)hs(π(k+1)(s))pπ(k+1)(s),ξ(k)(s)\displaystyle\qquad=h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k)}(s)\big{\rangle}
+hs(π(k+1)(s))hs(π(k)(s))π(k+1)(s)π(k)(s),ξ(k)(s)\displaystyle\qquad\qquad+h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}\pi^{(k+1)}(s)-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}
=hs(p)hs(π(k+1)(s))pπ(k+1)(s),ξ(k+1)(s)\displaystyle\qquad=h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)\big{\rangle}
+hs(π(k+1)(s))hs(π(k)(s))π(k+1)(s)π(k)(s),ξ(k)(s)\displaystyle\qquad\qquad+h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}\pi^{(k+1)}(s)-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}
+pπ(k+1)(s),ξ(k+1)(s)ξ(k)(s)\displaystyle\qquad\qquad+\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)-\xi^{(k)}(s)\big{\rangle}\big{\rangle}
=Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))+pπ(k+1)(s),ξ(k+1)(s)ξ(k)(s)\displaystyle\qquad=D_{h_{s}}\big{(}p,\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}+\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)-\xi^{(k)}(s)\big{\rangle}
=Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))+pπ(k+1)(s),ηQτ(k)(s)ητξ(k+1)(s),\displaystyle\qquad=D_{h_{s}}\big{(}p,\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}+\big{\langle}p-\pi^{(k+1)}(s),\eta Q_{\tau}^{(k)}(s)-\eta\tau\xi^{(k+1)}(s)\big{\rangle},

where the first and the fourth steps invoke the definition (15) of the generalized Bregman divergence and the last line results from the update rule (20c). Rearranging terms, we are left with

ηQτ(k)(s),π(k+1)(s)p\displaystyle\eta\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}
={Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))Dhs(p,π(k);ξ(k))}\displaystyle\qquad\qquad=\left\{D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}\right\}
+ητξ(k+1)(s),π(k+1)(s)p.\displaystyle\qquad\qquad\qquad+\eta\tau\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)-p\big{\rangle}.

Adding the term ητ{hs(p)hs(π(k+1)(s))}\eta\tau\left\{h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right\} to both sides of this identity leads to

η[Qτ(k)(s),π(k+1)(s)p+τhs(p)τhs(π(k+1)(s))]\displaystyle\eta\left[\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}+\tau h_{s}(p)-\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right]
={Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))Dhs(p,π(k);ξ(k))}\displaystyle\qquad=\left\{D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}\right\}
+ητ(hs(p)hs(π(k+1)(s))ξ(k+1)(s),pπ(k+1)(s))\displaystyle\qquad\quad\quad+\eta\tau\left(h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}\xi^{(k+1)}(s),p-\pi^{(k+1)}(s)\big{\rangle}\right)
=(1+ητ)Dhs(p,π(k+1);ξ(k+1))+Dhs(π(k+1),π(k);ξ(k))Dhs(p,π(k);ξ(k))\displaystyle\qquad=(1+\eta\tau)D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},\pi^{(k)};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}

as claimed, where the last line makes use of the definition (15).

A.3 Proof of Lemma 3

In the sequel, we shall prove each claim in Lemma 3 separately.

Proof of the contraction property (36).

For any Q1,Q2|𝒮||𝒜|Q_{1},Q_{2}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}, the definition (35) of the generalized Bellman operator obeys

𝒯τ,h(Q1)𝒯τ,h(Q2)\displaystyle\mathcal{T}_{\tau,h}(Q_{1})-\mathcal{T}_{\tau,h}(Q_{2}) =γ𝔼sP(|s,a)[maxpΔ(𝒜){Q1(s),pτhs(p)}maxpΔ(𝒜){Q2(s),pτhs(p)}]\displaystyle=\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{1}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}-\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{2}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}\right]
(a)γ𝔼sP(|s,a)[maxpΔ(𝒜)Q1(s)Q2(s),p]\displaystyle\overset{\mathrm{(a)}}{\leq}\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{1}(s^{\prime})-Q_{2}(s^{\prime}),p\big{\rangle}\right]
γ𝔼sP(|s,a)[maxp:p1=1Q1Q2p1]\displaystyle\leq\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{p:\|p\|_{1}=1}\|Q_{1}-Q_{2}\|_{\infty}\|p\|_{1}\right]
=γQ1Q2,\displaystyle=\gamma\|Q_{1}-Q_{2}\|_{\infty},

where (a) arises from the elementary fact maxxf(x)maxxg(x)maxx(f(x)g(x))\max_{x}f(x)-\max_{x}g(x)\leq\max_{x}\big{(}f(x)-g(x)\big{)}.

Proof of the fixed point property (37).

Towards this, let us first define

π(s)argmaxpsΔ(𝒜)𝔼aps[Qτ(s,a)τhs(p(s))].\displaystyle\pi^{\dagger}(s)\coloneqq\arg\max_{p_{s}\in\Delta(\mathcal{A})}\mathop{\mathbb{E}}\limits_{a\sim p_{s}}\Big{[}{Q}^{\star}_{\tau}(s,a)-\tau h_{s}\big{(}p(s)\big{)}\Big{]}. (54)

Then it can be easily verified that

Qτ(s,a)\displaystyle{Q}^{\star}_{\tau}(s,a) =r(s,a)+γ𝔼s1P(|s,a)[𝔼a1π(s1)[Qτ(s1,a1)τhs1(π(s1))]]\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\star}(s_{1})}\Big{[}{Q}^{\star}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\star}(s_{1})\big{)}\Big{]}\right]
r(s,a)+γ𝔼s1P(|s,a)[𝔼a1π(s1)[Qτ(s1,a1)τhs1(π(s1))]],\displaystyle\leq r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\Big{[}{Q}^{\star}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\Big{]}\right], (55)

where the first identity results from (9), and the second line arises from the maximizing property of π\pi^{\dagger} (see (54)).

Note that the right-hand side of (55) involves the term Qτ(s1,a1){Q}^{\star}_{\tau}(s_{1},a_{1}), which can be further upper bounded via the same argument for (55). Successively repeating this upper bound argument (and the expansion) eventually allows one to obtain

Qτ(s,a)r(s,a)+γ𝔼π[t=1γt1{r(st,at)τhst(π(st))}|s0=s,a0=a]=Qτπ(s,a).{Q}^{\star}_{\tau}(s,a)\leq r(s,a)+\gamma\mathbb{E}_{\pi^{\dagger}}\left[\sum_{t=1}^{\infty}\gamma^{t-1}\Big{\{}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\dagger}(s_{t})\big{)}\Big{\}}\,\Big{|}\,s_{0}=s,a_{0}=a\right]={Q}^{\pi^{\dagger}}_{\tau}(s,a).

However, the fact that π\pi^{\star} is the optimal policy necessarily implies the following reverse inequality:

Qτ(s,a)Qτπ(s,a).{Q}^{\star}_{\tau}(s,a)\geq{Q}^{\pi^{\dagger}}_{\tau}(s,a).

Therefore, one must have

Qτ(s,a)=Qτπ(s,a).\displaystyle{Q}^{\star}_{\tau}(s,a)={Q}^{\pi^{\dagger}}_{\tau}(s,a). (56)

To finish up, it suffices to show that Qτπ=𝒯τ,h(Qτ){Q}^{\pi^{\dagger}}_{\tau}=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau}). To this end, it is observed that

Qτπ(s,a)\displaystyle{Q}^{\pi^{\dagger}}_{\tau}(s,a) =r(s,a)+γ𝔼s1P(|s,a)[𝔼a1π(s1)[Qτπ(s1,a1)τhs1(π(s1))]]\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\left[{Q}^{\pi^{\dagger}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\right]\right]
=(b)r(s,a)+γ𝔼s1P(|s,a)[𝔼a1π(s1)[Qτ(s1,a1)τhs1(π(s1))]]\displaystyle\overset{\mathrm{(b)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\Big{[}{Q}^{{\star}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\Big{]}\right]
=(c)r(s,a)+γ𝔼s1P(|s,a)[maxpΔ(𝒜)Qτ(s1,a1),pτhs1(p)]]\displaystyle\overset{\mathrm{(c)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{\tau}^{\star}(s_{1},a_{1}),p\big{\rangle}-\tau h_{s_{1}}(p)\Big{]}\right]
=𝒯τ,h(Qτ)(s,a),\displaystyle=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a),

where (b) utilizes the fact (56), (c) follows from the definition (54) of π\pi^{\dagger}, and the last identity is a consequence of the definition (35) of 𝒯τ,h\mathcal{T}_{\tau,h}. The above results taken collectively demonstrate that Qτ=𝒯τ,h(Qτ){Q}^{\star}_{\tau}=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau}) as claimed.

A.4 Proof of Lemma 4

Recall that Qτ(k+1)=Qτπ(k+1){Q}^{(k+1)}_{\tau}={Q}^{\pi^{(k+1)}}_{\tau}. In view of the relation (9), one obtains

Qτ(k+1)(s,a)\displaystyle Q_{\tau}^{(k+1)}(s,a) =r(s,a)+γ𝔼sP(|s,a)[Vτ(k+1)(s)]\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[V_{\tau}^{(k+1)}(s^{\prime})\right]
=r(s,a)+γ𝔼sP(|s,a)[𝔼aπ(k+1)(s)[Qτ(k+1)(s,a)τhs(π(k+1)(s))]]\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\mathop{\mathbb{E}}\limits_{a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\left[Q_{\tau}^{(k+1)}(s^{\prime},a^{\prime})-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\right]\right]
=r(s,a)+γ𝔼sP(|s,a)[<Qτ(k+1)(s),π(k+1)(s)>τhs(π(k+1)(s))].\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\big{<}Q_{\tau}^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{>}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\right].

This combined with the fixed-point condition (37) allows us to derive

Qτ(s,a)Qτ(k+1)(s,a)=𝒯τ,h(Qτ)(s,a){r(s,a)+γ𝔼sP(|s,a)[Qτ(k+1)(s),π(k+1)(s)τhs(π(k+1)(s))]}\displaystyle{Q}^{\star}_{\tau}(s,a)-{Q}^{(k+1)}_{\tau}(s,a)=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}{Q}^{(k+1)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}
=𝒯τ,h(Qτ)(s,a){r(s,a)+γ𝔼sP(|s,a)[τξ(k+1)(s),π(k+1)(s)τhs(π(k+1)(s))]}\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}
γ𝔼sP(|s,a),aπ(k+1)(s)[Qτ(k+1)(s,a)τξ(k+1)(s,a)].\displaystyle\qquad\qquad\qquad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\xi^{(k+1)}(s^{\prime},a^{\prime})\Big{]}. (57)

In what follows, we control each term on the right-hand side of (A.4) separately.

Step 1: bounding the 1st term on the right-hand side of (A.4).

Lemma 1 tells us that

ξ(k+1)(s)cs(k+1)1hs(π(k+1)(s))\xi^{(k+1)}(s)-c_{s}^{(k+1)}1\in\partial h_{s}(\pi^{(k+1)}(s))

for some scalar cs(k+1)c_{s}^{(k+1)}\in\mathbb{R}. This important property allows one to derive

0\displaystyle 0 ξ(k+1)(s)+cs(k+1)1+hs(π(k+1)(s))=k+1,s(π(k+1)(s);cs(k+1))\displaystyle\in-\xi^{(k+1)}(s)+c_{s}^{(k+1)}1+\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)}=\partial{\mathcal{L}}_{k+1,s}\big{(}\pi^{(k+1)}(s);c_{s}^{(k+1)}\big{)} (58)

where

k+1,s(p;λ)ξ(k+1)(s),p+hs(p)fk+1,s(p)+λ 1p.\mathcal{L}_{k+1,s}(p;\lambda)\coloneqq\underset{\eqqcolon\,f_{k+1,s}(p)}{\underbrace{-\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}+h_{s}\big{(}p\big{)}}}+\lambda\,1^{\top}p.

Recognizing that the function fk+1,s()f_{k+1,s}(\cdot) is convex in pp, we can view k+1,s(p;λ)\mathcal{L}_{k+1,s}(p;\lambda) as the Lagrangian of the following constrained convex problem with Lagrangian multiplier λ\lambda\in\mathbb{R}:

minimizep:1p=1fk+1,s(p)=ξ(k+1)(s),p+hs(p).\displaystyle\mathop{\text{minimize}}\limits_{p:1^{\top}p=1}\quad f_{k+1,s}(p)=-\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}+h_{s}\big{(}p\big{)}. (59)

The condition (58) can then be interpreted as the optimality condition w.r.t. the program (59) and π(k+1)(s)\pi^{(k+1)}(s), meaning that

fk+1,s(π(k+1)(s))=minp:1p=1fk+1,s(p),\displaystyle f_{k+1,s}\big{(}\pi^{(k+1)}(s)\big{)}=\min_{p:1^{\top}p=1}f_{k+1,s}(p),

or equivalently,

ξ(k+1)(s),π(k+1)(s)hs(π(k+1)(s))\displaystyle\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}-h_{s}\big{(}\pi^{(k+1)}(s)\big{)} =maxp:1p=1ξ(k+1)(s),phs(p).\displaystyle=\max_{p:1^{\top}p=1}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p). (60)

In addition, for any vector pp that does not obey p0p\geq 0, Assumption 1 implies that hs(p)=h_{s}(p)=\infty, and hence pp cannot possibly be the optimal solution to maxpΔ(𝒜)ξ(k+1)(s),phs(p)\max_{p\in\Delta(\mathcal{A})}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p). This together with (60) essentially implies that

ξ(k+1)(s),π(k+1)(s)hs(π(k+1)(s))\displaystyle\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}-h_{s}\big{(}\pi^{(k+1)}(s)\big{)} =maxpΔ(𝒜)ξ(k+1)(s),phs(p).\displaystyle=\max_{p\in\Delta(\mathcal{A})}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p). (61)

As a consequence, we arrive at

𝒯τ,h(Qτ)(s,a){r(s,a)+γ𝔼sP(|s,a)[τξ(k+1)(s),π(k+1)(s)τhs(π(k+1)(s))]}\displaystyle\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\Big{[}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}
=𝒯τ,h(Qτ)(s,a){r(s,a)+γ𝔼sP(|s,a)[maxpΔ(𝒜){τξ(k+1)(s),pτhs(p)}]}\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\Big{[}\max_{p\in\Delta(\mathcal{A})}\Big{\{}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),p\big{\rangle}-\tau h_{s^{\prime}}(p)\Big{\}}\Big{]}\right\}
=𝒯τ,h(Qτ)(s,a)𝒯τ,h(τξ(k+1))(s,a)\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\mathcal{T}_{\tau,h}(\tau\xi^{(k+1)})(s,a)
γQττξ(k+1),\displaystyle\qquad\leq\gamma\big{\|}{Q}_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\|}_{\infty}, (62)

where the last step results from the contraction property (36) in Lemma 3.

Step 2: bounding the 2nd term on the right-hand side of (A.4).

Recall that α=11+ητ\alpha=\frac{1}{1+\eta\tau}. Invoking the monotonicity property in Lemma 2 and the update rule (20c), we obtain

Qτ(k+1)(s,a)τξ(k+1)(s,a)\displaystyle{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k+1)}(s,a) =α{Qτ(k+1)(s,a)τξ(k)(s,a)}+(1α){Qτ(k+1)(s,a)Qτ(k)(s,a)}\displaystyle=\alpha\Big{\{}{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k)}(s,a)\Big{\}}+(1-\alpha)\Big{\{}{Q}^{(k+1)}_{\tau}(s,a)-{Q}^{(k)}_{\tau}(s,a)\Big{\}}
α{Qτ(k)(s,a)τξ(k)(s,a)}.\displaystyle\geq\alpha\Big{\{}{Q}^{(k)}_{\tau}(s,a)-\tau\xi^{(k)}(s,a)\Big{\}}.

Repeating this lower bound argument then yields

Qτ(k+1)(s,a)τξ(k+1)(s,a)\displaystyle{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k+1)}(s,a) αk+1{Qτ(0)(s,a)τξ(0)(s,a)}\displaystyle\geq\alpha^{k+1}\Big{\{}{Q}^{(0)}_{\tau}(s,a)-\tau\xi^{(0)}(s,a)\Big{\}}
αk+1Qτ(0)τξ(0),\displaystyle\geq-\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty},

thus revealing that

𝔼sP(|s,a),aπk+1(s)[Qτ(k+1)(s,a)τξ(k+1)(s,a)]αk+1Qτ(0)τξ(0).-\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a),a^{\prime}\sim\pi^{k+1}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\xi^{(k+1)}(s^{\prime},a^{\prime})\Big{]}\leq\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}. (63)
Step 3: putting all this together.

Substituting (62) and (63) into (A.4) gives

0Qτ(s,a)Qτ(k+1)(s,a)γQττξ(k+1)+αk+1Qτ(0)τξ(0)\displaystyle 0\leq{Q}^{\star}_{\tau}(s,a)-{Q}^{(k+1)}_{\tau}(s,a)\leq\gamma\big{\|}{Q}_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\|}_{\infty}+\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty} (64)

for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, thus concluding the proof.

Appendix B Analysis for approximate GPMD (Theorem 2)

The proof consists of three steps: (i) evaluating the performance difference between π(k)\pi^{(k)} and π(k+1)\pi^{(k+1)}, (ii) establishing a linear system to characterize the error dynamic, and (iii) analyzing this linear system to derive global convergence guarantees. We shall describe the details of each step in the sequel. As before, we adopt the notational convention (47) whenever it is clear from the context.

B.1 Step 1: bounding performance difference between consecutive iterates

When only approximate policy evaluation is available, we are no longer guaranteed to have pointwise monotonicity as in the case of Lemma 2. Fortunately, we are still able to establish an approximate versioin of Lemma 2, as stated below.

Lemma 7 (Performance improvement for approximate GPMD).

For all s𝒮s\in\mathcal{S} and all k0k\geq 0, we have

Vτ(k+1)(s)Vτ(k)(s)1+α1γε𝗈𝗉𝗍21γε𝖾𝗏𝖺𝗅.V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{2}{1-\gamma}\varepsilon_{\mathsf{eval}}.

In addition, if hsh_{s} is 1-strongly convex w.r.t. the 1\ell_{1} norm for all s𝒮s\in\mathcal{S}, then one further has

Vτ(k+1)(s)Vτ(k)(s)3+α1γε𝗈𝗉𝗍η(2+ητ)(1γ)ε𝖾𝗏𝖺𝗅2.V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)-\frac{3+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{\eta}{(2+\eta\tau)(1-\gamma)}\varepsilon_{\mathsf{eval}}^{2}.

In words, while monotonicity is not guaranteed, this lemma precludes the possibility of Vτ(k+1)(s)V^{(k+1)}_{\tau}(s) being be much smaller than Vτ(k)(s)V^{(k)}_{\tau}(s), as long as both ε𝖾𝗏𝖺𝗅\varepsilon_{\mathsf{eval}} and ε𝗈𝗉𝗍\varepsilon_{\mathsf{opt}} are reasonably small.

B.1.1 Proof of Lemma 7

The case when hsh_{s} is convex.

Let π~(k+1)\widetilde{\pi}^{(k+1)} be the exact solution of the following problem

π~(k+1)(s)=argminpΔ(𝒜){Q^τ(k)(s),p+τhs(p)+1ηDhs(p,π(k);ξ^(k))}.\widetilde{\pi}^{(k+1)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}\widehat{Q}^{(k)}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)};\widehat{\xi}^{(k)}\big{)}\right\}. (65)

With this auxiliary policy iterate π~(k+1)\widetilde{\pi}^{(k+1)} in mind, we start by decomposing Vτ(k+1)(s)Vτ(k)(s)V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s) into the following three parts:

Vτ(k+1)(s)Vτ(k)(s)\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)
=11γ𝔼sds(k+1)[Qτ(k)(s),π(k+1)(s)π(k)(s)τhs(π(k+1)(s))+τhs(π(k)(s))]\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}Q^{(k)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\right]
=11γ𝔼sds(k+1)[Q^τ(k)(s),π~(k+1)(s)π(k)(s)τhs(π~(k+1)(s))+τhs(π(k)(s))]\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\right]
+11γ𝔼sds(k+1)[Q^τ(k)(s),π(k+1)(s)π~(k+1)(s)τhs(π(k+1)(s))+τhs(π~(k+1)(s))]\displaystyle\qquad\quad+\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}\right]
+11γ𝔼sds(k+1)[Qτ(k)(s)Q^τ(k)(s),π(k+1)(s)π(k)(s)],\displaystyle\qquad\quad+\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}\right]}, (66)

where the first identity arises from the performance difference lemma (cf. Lemma 5). To continue, we seek to control each part of (66) separately.

  • Regarding the first term of (66), replacing ξ\xi (resp. QτQ_{\tau}) by ξ^\widehat{\xi} (resp. Q^τ\widehat{Q}_{\tau}) in Lemma 6 indicates that

    Q^τ(k)(s),π~(k+1)(s)π(k)(s)τhs(π~(k+1)(s))+τhs(π(k)(s))\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}
    =1η[(1+ητ)Dhs(π(k),π~(k+1)(s);ξ^(k+1))+Dhs(π~(k+1),π(k);ξ^(k))]\displaystyle\qquad=\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)}(s^{\prime});\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}\right] (67)

    for all s𝒮s^{\prime}\in\mathcal{S}.

  • As for the second term of (66), the definition of the oracle Gs,ε𝗈𝗉𝗍G_{s,\varepsilon_{\mathsf{opt}}} (see Assumption 3) guarantees that

    Q^τ(k)(s),π(k+1)(s)+τhs(π(k+1)(s))+1ηDhs(π(k+1),π(k);ξ^(k))\displaystyle-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}+\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}
    Q^τ(k)(s),π~(k+1)(s)+τhs(π~(k+1)(s))+1ηDhs(π~(k+1),π(k);ξ^(k))+ε𝗈𝗉𝗍\displaystyle\qquad\leq-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}+\varepsilon_{\mathsf{opt}} (68)

    for any s𝒮s^{\prime}\in\mathcal{S}. Rearranging terms, we are left with

    Q^τ(k)(s),π(k+1)(s)π~(k+1)(s)τhs(π(k+1)(s))+τhs(π~(k+1)(s))\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}
    1ηDhs(π~(k+1),π(k);ξ^(k))+1ηDhs(π(k+1),π(k);ξ^(k))ε𝗈𝗉𝗍\displaystyle\qquad\geq-\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-\varepsilon_{\mathsf{opt}}
    =1η(Dhs(π~(k+1),π(k);ξ^(k))Dhs(π(k+1),π~(k);ξ^(k)))\displaystyle\qquad=-\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)
    +1η(Dhs(π(k+1),π(k);ξ^(k))Dhs(π(k+1),π~(k);ξ^(k)))ε𝗈𝗉𝗍.\displaystyle\qquad\quad+\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)-\varepsilon_{\mathsf{opt}}. (69)

In addition, we note that the term

Dhs(π~(k+1),π(k);ξ^(k))D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}

appears in both (67) and (69), which can be canceled out when summing these two equalities. Specifically, adding (67) and (69) gives

Q^τ(k)(s),π~(k+1)(s)π(k)(s)τhs(π~(k+1)(s))+τhs(π(k)(s))\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}
+Q^τ(k)(s),π(k+1)(s)π~(k+1)(s)τhs(π(k+1)(s))+τhs(π~(k+1)(s))\displaystyle\quad+\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}
1η[(1+ητ)Dhs(π(k),π~(k+1);ξ^(k+1))+Dhs(π(k+1),π~(k);ξ^(k))]\displaystyle\qquad\geq\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}{\pi}^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right]
+1η(Dhs(π(k+1),π(k);ξ^(k))Dhs(π(k+1),π~(k);ξ^(k)))ε𝗈𝗉𝗍.\displaystyle\qquad\quad+\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)-\varepsilon_{\mathsf{opt}}.

Substituting this into (66) and invoking the elementary inequality |a,b|a1b|\langle a,b\rangle|\leq\|a\|_{1}\|b\|_{\infty} thus lead to

Vτ(k+1)(s)Vτ(k)(s)\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)
11γ𝔼sds(k+1)[1η[(1+ητ)Dhs(π(k),π~(k+1);ξ^(k+1))+Dhs(π(k+1),π~(k);ξ^(k))]]\displaystyle\geq\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right]\right]
+11γ𝔼sds(k+1)[1η(Dhs(π(k+1),π(k);ξ^(k))Dhs(π(k+1),π~(k);ξ^(k)))]\displaystyle\quad+\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)\right]
ε𝗈𝗉𝗍1γ11γ𝔼sds(k+1)[Qτ(k)(s)Q^τ(k)(s)π(k+1)(s)π(k)(s)1],\displaystyle\quad-\frac{\varepsilon_{\mathsf{opt}}}{1-\gamma}-\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\|}_{\infty}\big{\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}\right]}, (70)

where the last line makes use of Assumption 2 and the fact π(k+1)(s)1=π(k)(s)1=1\|\pi^{(k+1)}(s)\|_{1}=\|\pi^{(k)}(s)\|_{1}=1.

Following the discussion in Lemma 1, we can see that ξ^(k)(s)cs(k)1hs(π~(k)(s))\widehat{\xi}^{(k)}(s)-c_{s}^{(k)}1\in\partial h_{s}(\widetilde{\pi}^{(k)}(s)) with some constant cs(k)c_{s}^{(k)} for all kk. This together with the convexity of hsh_{s} (see (15)) guarantees that

Dhs(π(k),π~(k+1);ξ^(k+1))0andDhs(π(k+1),π~(k);ξ^(k))0\displaystyle D_{h_{s}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}\geq 0\qquad\text{and}\qquad D_{h_{s}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\geq 0 (71)

for any s𝒮s\in\mathcal{S}, thus implying that the first term of (70) is non-negative. It remains to control the second term in (70). Towards this, a little algebra gives

Dhs(π(k+1),π(k);ξ^(k))Dhs(π(k+1),π~(k);ξ^(k))\displaystyle D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}
={hs(π(k)(s))hs(π~(k)(s))ξ^(k)(s),π(k)(s)π~(k)(s)}\displaystyle\qquad=-\left\{h_{s}(\pi^{(k)}(s))-h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}-\big{\langle}\widehat{\xi}^{(k)}(s),\pi^{(k)}(s)-\widetilde{\pi}^{(k)}(s)\big{\rangle}\right\}
=hs(π(k)(s))+hs(π~(k)(s))+11+ητξ^(k1)(s)+η1+ητQ^τ(k1)(s),π(k)(s)π~(k)(s)\displaystyle\qquad=-h_{s}\big{(}\pi^{(k)}(s)\big{)}+h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}+\left\langle\frac{1}{1+\eta\tau}\widehat{\xi}^{(k-1)}(s)+\frac{\eta}{1+\eta\tau}\widehat{Q}_{\tau}^{(k-1)}(s),\pi^{(k)}(s)-\widetilde{\pi}^{(k)}(s)\right\rangle
=η1+ητ{Q^τ(k1)(s),π~(k)(s)+τhs(π~(k)(s))+1ηDhs(π~(k),π(k1);ξ^(k1))\displaystyle\qquad=\frac{\eta}{1+\eta\tau}\left\{-\big{\langle}\widehat{Q}^{(k-1)}_{\tau}(s),\widetilde{\pi}^{(k)}(s)\big{\rangle}+\tau h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\widetilde{\pi}^{(k)},\pi^{(k-1)};\widehat{\xi}^{(k-1)}\big{)}\right.
[Q^τ(k1)(s),π(k)(s)+τhs(π(k)(s))+1ηDhs(π(k),π(k1);ξ^(k1))]}\displaystyle\qquad\quad\quad\quad\ \ \left.-\left[-\big{\langle}\widehat{Q}^{(k-1)}_{\tau}(s),\pi^{(k)}(s)\big{\rangle}+\tau h_{s}\big{(}\pi^{(k)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\pi^{(k)},\pi^{(k-1)};\widehat{\xi}^{(k-1)}\big{)}\right]\right\}
ηε𝗈𝗉𝗍1+ητ.\displaystyle\qquad\geq-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}. (72)

Here, the first and the third lines follow from the definition (15), the second inequality comes from the construction (27), whereas the last step invokes the definition of the oracle (25). Substitution of (71) and (72) into (70) gives

Vτ(k+1)(s)Vτ(k)(s)\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s) 1+α1γε𝗈𝗉𝗍11γ𝔼sds(k+1)[Qτ(k)(s)Q^τ(k)(s)π(k+1)(s)π(k)(s)1]\displaystyle\geq-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\|}_{\infty}\big{\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}\right]} (73)
1+α1γε𝗈𝗉𝗍21γε𝖾𝗏𝖺𝗅.\displaystyle\geq-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{2}{1-\gamma}\varepsilon_{\mathsf{eval}}.
The case when hsh_{s} is strongly convex.

When hsh_{s^{\prime}} is 11-strongly convex w.r.t. the 1\ell_{1} norm, the objective function of sub-problem (65) is 1+ητη\frac{1+\eta\tau}{\eta}-strongly convex w.r.t. the 1\ell_{1} norm. Taking this together with the ε𝗈𝗉𝗍\varepsilon_{\mathsf{opt}}-approximation guarantee in Assumption 3, we can demonstrate that

1+ητ2ηπ~(k+1)(s)π(k+1)(s)12ε𝗈𝗉𝗍for all k0 and s𝒮.\frac{1+\eta\tau}{2\eta}\big{\|}\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}\leq\varepsilon_{\mathsf{opt}}\qquad\text{for all }k\geq 0\text{ and }s^{\prime}\in\mathcal{S}. (74)

Additionally, the strong convexity assumption also implies that

Dhs(π(k+1)(s)π~(k)(s);ξ(k)(s))12π~(k)(s)π(k+1)(s)12\displaystyle D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k)}(s^{\prime});\xi^{(k)}(s^{\prime})\big{)}\geq\frac{1}{2}\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}
=12(π~(k)(s)π(k+1)(s)12+π~(k)(s)π(k)(s)12)12π~(k)(s)π(k)(s)12\displaystyle\qquad=\frac{1}{2}\left(\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}+\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}^{2}\right)-\frac{1}{2}\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}^{2}
14(π~(k)(s)π(k+1)(s)1+π~(k)(s)π(k)(s)1)212π~(k)(s)π(k)(s)12\displaystyle\qquad\geq\frac{1}{4}\left(\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}+\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}\right)^{2}-\frac{1}{2}\big{\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}^{2}
14π(k)(s)π(k+1)(s)12ηε𝗈𝗉𝗍1+ητ,\displaystyle\qquad\geq\frac{1}{4}\big{\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau},

where the third line results from Young’s inequality, and the final step follows from (74). We can develop a similar lower bound on Dhs(π(k),π~(k+1);ξ(k+1))D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\xi^{(k+1)}\big{)} as well. Taken together, these lower bounds give

1η[(1+ητ)Dhs(π(k)(s),π~(k+1)(s);ξ(k+1)(s))+Dhs(π(k+1)(s),π~(k)(s);ξ(k)(s))]\displaystyle\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime});\xi^{(k+1)}(s^{\prime})\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k)}(s^{\prime});\xi^{(k)}(s^{\prime})\big{)}\right]
2+ητη(14π(k)(s)π(k+1)(s)12ηε𝗈𝗉𝗍1+ητ)\displaystyle\qquad\geq\frac{2+\eta\tau}{\eta}\left(\frac{1}{4}\big{\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}\right)
2+ητ4ηπ(k)(s)π(k+1)(s)122ε𝗈𝗉𝗍.\displaystyle\qquad\geq\frac{2+\eta\tau}{4\eta}\big{\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}-2\varepsilon_{\mathsf{opt}}.

In addition, it is easily seen that

Qτ(k)(s)Q^τ(k)(s)π(k+1)(s)π(k)(s)1\displaystyle-\big{\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\|}_{\infty}\big{\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}
12(2η2+ητQτ(k)(s)Q^τ(k)(s)2+2+ητ2ηπ(k+1)(s)π(k)(s)12)\displaystyle\qquad\geq-\frac{1}{2}\left(\frac{2\eta}{2+\eta\tau}\big{\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\|}_{\infty}^{2}+\frac{2+\eta\tau}{2\eta}\big{\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}^{2}\right)
η2+ητε𝖾𝗏𝖺𝗅22+ητ4ηπ(k+1)(s)π(k)(s)12.\displaystyle\qquad\geq-\frac{\eta}{2+\eta\tau}\varepsilon_{\mathsf{eval}}^{2}-\frac{2+\eta\tau}{4\eta}\big{\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\|}_{1}^{2}.

Combining the above two inequalities with (73), we arrive at the advertised bound

Vτ(k+1)(s)Vτ(k)(s)3+α1γε𝗈𝗉𝗍η(2+ητ)(1γ)ε𝖾𝗏𝖺𝗅2.V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)\geq-\frac{3+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{\eta}{(2+\eta\tau)(1-\gamma)}\varepsilon_{\mathsf{eval}}^{2}.

B.2 Step 2: connecting the algorithm dynamic with a linear system

Now we are ready to discuss how to control QτQτ(k)\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}. In short, we intend to establish the connection among several intertwined quantities, and identify a simple linear system that captures the algorithm dynamic.

Bounding Qττξ^(k+1)\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}.

From the definition of ξ^(k+1)\widehat{\xi}^{(k+1)} in (27), we have

Qττξ^(k+1)\displaystyle\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty} =α(Qττξ^(k))+(1α)(QτQτ(k))+(1α)(Qτ(k)Q^τ(k))\displaystyle=\big{\|}\alpha({Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)})+(1-\alpha)({Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau})+(1-\alpha)({Q}^{{(k)}}_{\tau}-\widehat{{Q}}^{{(k)}}_{\tau})\big{\|}_{\infty}
αQττξ^(k)+(1α)QτQτ(k)+(1α)Qτ(k)Q^τ(k)\displaystyle\leq\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\|}_{\infty}+(1-\alpha)\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}+(1-\alpha)\big{\|}{Q}^{{(k)}}_{\tau}-\widehat{{Q}}^{{(k)}}_{\tau}\big{\|}_{\infty}
αQττξ^(k)+(1α)QτQτ(k)+(1α)ε𝖾𝗏𝖺𝗅,\displaystyle\leq\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\|}_{\infty}+(1-\alpha)\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}+(1-\alpha)\varepsilon_{\mathsf{eval}}, (75)

where the last inequality is a consequence of Assumption 2.

Bounding mins,a(Qτ(k+1)(s,a)τξ^(k+1)(s,a))-\min_{s,a}\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)}.

Applying the definition in (27) once again, we obtain

(Qτ(k+1)(s,a)τξ^(k+1)(s,a))\displaystyle-\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)} =α(Qτ(k)(s,a)τξ^(k)(s,a))+(1α)(Q^τ(k)(s,a)Qτ(k)(s,a))\displaystyle=-\alpha\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha)\big{(}\widehat{{Q}}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k)}}_{\tau}(s,a)\big{)}
+(Qτ(k)(s,a)Qτ(k+1)(s,a))\displaystyle\qquad+\big{(}{Q}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)\big{)}
α(Qτ(k)(s,a)τξ^(k)(s,a))+(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍,\displaystyle\leq-\alpha\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}, (76)

where

c1={2γ1γ,if hs is convex but not strongly convex,ηε𝖾𝗏𝖺𝗅γ(2+ητ)(1γ),if hs is 1-strongly convex w.r.t. the 1 norm,c2={(α+1)γ1γ,if hs is convex but not strongly convex,(α+3)γ1γ,if hs is 1-strongly convex w.r.t. the 1 norm.\begin{split}c_{1}=&\begin{cases}\frac{2\gamma}{1-\gamma},\qquad&\text{if }h_{s}\text{ is convex but not strongly convex},\\ \frac{\eta\varepsilon_{\mathsf{eval}}\gamma}{(2+\eta\tau)(1-\gamma)},\quad&\text{if }h_{s}\text{ is $1$-strongly convex w.r.t. the $\ell_{1}$ norm},\end{cases}\\ c_{2}&=\begin{cases}\frac{(\alpha+1)\gamma}{1-\gamma},\quad&\quad\text{if }h_{s}\text{ is convex but not strongly convex},\\ \frac{(\alpha+3)\gamma}{1-\gamma},\quad&\quad\text{if }h_{s}\text{ is $1$-strongly convex w.r.t. the $\ell_{1}$ norm}.\end{cases}\end{split} (77)

Here, the last step of (76) follows from Assumption 2 as well as the following relation:

Qτ(k)(s,a)Qτ(k+1)(s,a)=γ𝔼sP(|s,a)[Vτ(k)(s)Vτ(k+1)(s)]c1ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍,{Q}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)=\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\left[V^{{(k)}}_{\tau}(s^{\prime})-V^{{(k+1)}}_{\tau}(s^{\prime})\right]\leq c_{1}\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}},

where we have made use of Lemma 7. Taking the maximum over (s,a)(s,a) on both sides of (76) yields

mins,a(Qτ(k+1)(s,a)τξ^(k+1)(s,a))αmins,a(Qτ(k)(s,a)τξ^(k)(s,a))+(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍.-\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right)\leq-\alpha\min_{s,a}\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}. (78)
Bounding QτQτ(k+1)\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}.

To begin with, let us decompose Qτ(s,a)Qτ(k+1)(s,a){Q}^{{\star}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a) into several parts. Invoking the relation (37) in Lemma 3 as well as the property (9), we reach

Qτ(s,a)Qτ(k+1)(s,a)\displaystyle{Q}^{{\star}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)
=𝒯τ,h(Qτ)(s,a)[r(s,a)+γ𝔼sP(|s,a)[Qτ(k+1)(s),π(k+1)(s)τhs(π(k+1)(s))]]\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}{Q}^{(k+1)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s)\big{)}\Big{]}\right]
=𝒯τ,h(Qτ)(s,a)[r(s,a)+γ𝔼sP(|s,a)[τξ^(k+1)(s),π(k+1)(s)τhs(π(k+1)(s))]]\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s)\big{)}\Big{]}\right]
γ𝔼sP(|s,a),aπ(k+1)(s)[Qτ(k+1)(s,a)τξ^(k+1)(s,a)]\displaystyle\qquad\qquad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\Big{]}
={𝒯τ,h(Qτ)(s,a)[r(s,a)+γ𝔼sP(|s,a)[τξ^(k+1)(s),π~(k+1)(s)τhs(π~(k+1)(s))]]}\displaystyle\qquad=\left\{\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}\right]\right\}
τγ𝔼sP(|s,a)[ξ^(k+1)(s),π(k+1)(s)π~(k+1)(s)hs(π(k+1)(s))+hs(π~(k+1)(s))]\displaystyle\qquad\quad-\tau\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\Big{[}\big{\langle}\widehat{\xi}^{(k+1)}(s^{\prime}),{\pi}^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-h_{s^{\prime}}\big{(}{\pi}^{(k+1)}(s)\big{)}+h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}
γ𝔼sP(|s,a),aπ(k+1)(s)[Qτ(k+1)(s,a)τξ^(k+1)(s,a)].\displaystyle\qquad\quad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\Big{]}. (79)

In the sequel, we control the three terms in (79) separately.

  • To begin with, we repeat a similar argument as for (62) to show that

    𝒯τ,h(Qτ)(s,a)[r(s,a)+γ𝔼sP(|s,a)[τξ^(k+1)(s),π~(k+1)(s)τhs(π~(k+1)(s))]]\displaystyle\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}\right]
    =𝒯τ,h(Qτ)(s,a)𝒯τ,h(ξ^(k+1))(s,a)\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\mathcal{T}_{\tau,h}(\widehat{\xi}^{(k+1)})(s,a)
    γQττξ^(k+1).\displaystyle\qquad\leq\gamma\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}.
  • The second term of (79) can be bounded by applying (72) with kk replaced by k+1k+1:

    ξ^(k+1)(s),π(k+1)(s)π~(k+1)(s)hs(π(k+1)(s))+hs(π~(k+1)(s))ηε𝗈𝗉𝗍1+ητ.\displaystyle\big{\langle}\widehat{\xi}^{(k+1)}(s^{\prime}),{\pi}^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-h_{s^{\prime}}\big{(}{\pi}^{(k+1)}(s)\big{)}+h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\geq-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}.
  • As for the third term of (79), taking the maximum over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} gives

    Qτ(k+1)(s,a)τξ^(k+1)(s,a)mins,a(Qτ(k+1)(s,a)τξ^(k+1)(s,a)).{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\leq-\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right).

Taken together, the above bounds and the decomposition (79) lead to

QτQτ(k+1)γQττξ^(k+1)γmins,a(Qτ(k+1)(s,a)τξ^(k+1)(s,a))+γ(1α)ε𝗈𝗉𝗍.\big{\|}{Q}^{{*}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}\leq\gamma\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}-\gamma\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right)+\gamma(1-\alpha)\varepsilon_{\mathsf{opt}}. (80)
A linear system of interest.

Combining (B.2),(78) and (80), we reach the following linear system

zk+1Bzk+b,z_{k+1}\leq Bz_{k}+b, (81)

where

B[γ(1α)γαγα1αα000α],zk[QτQτ(k)Qττξ^(k)mins,a(Qτ(k)(s,a)τξ^(k)(s,a))],B\coloneqq\begin{bmatrix}\gamma(1-\alpha)&\gamma\alpha&\gamma\alpha\\ 1-\alpha&\alpha&0\\ 0&0&\alpha\end{bmatrix},\qquad z_{k}\coloneqq\begin{bmatrix}\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}\\ \big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\|}_{\infty}\\ -\min_{s,a}\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}\end{bmatrix},
b[γ(22α+c1)ε𝖾𝗏𝖺𝗅+γ(1α+c2)ε𝗈𝗉𝗍(1α)ε𝖾𝗏𝖺𝗅(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍].b\coloneqq\begin{bmatrix}\gamma(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+\gamma(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\\ (1-\alpha)\varepsilon_{\mathsf{eval}}\\ (1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\end{bmatrix}. (82)

This linear system of three variables captures how the estimation error progresses as the iteration count increases.

B.3 Step 3: linear system analysis

In this step, we analyze the behavior of the linear system (81) derived above. Observe that the eigenvalues and respective eigenvectors of the matrix BB are given by

λ1=α+(1α)γ,λ2=α,λ3=0,\lambda_{1}=\alpha+(1-\alpha)\gamma,\qquad\lambda_{2}=\alpha,\qquad\lambda_{3}=0, (83)
v1=[γ10],v2=[011],v3=[αα10].v_{1}=\begin{bmatrix}\gamma\\ 1\\ 0\end{bmatrix},\qquad v_{2}=\begin{bmatrix}0\\ -1\\ 1\end{bmatrix},\qquad v_{3}=\begin{bmatrix}\alpha\\ \alpha-1\\ 0\end{bmatrix}. (84)

Armed with these, we can decompose z0z_{0} in terms of the eigenvectors of BB as follows

z0\displaystyle z_{0} [QτQτ(0)Qττξ^(0)Qτ(0)τξ^(0)]\displaystyle\leq\begin{bmatrix}\|{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\|_{\infty}\\ \|{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\|_{\infty}\\ \|{Q}^{(0)}_{\tau}-\tau\widehat{\xi}^{(0)}\|_{\infty}\end{bmatrix}
=1α+(1α)γ[(1α)QτQτ(0)+αQττξ^(0)+αQτ(0)τξ^(0)]v1\displaystyle=\frac{1}{\alpha+(1-\alpha)\gamma}\left[(1-\alpha)\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\|}_{\infty}+\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}+\alpha\big{\|}{Q}^{(0)}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right]v_{1}
+Qττξ^(0)v2+ezv3\displaystyle\qquad+\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}v_{2}+e_{z}v_{3}
1α+(1α)γ[QτQτ(0)+2αQττξ^(0)]v1+Qττξ^(0)v2+ezv3,\displaystyle\leq\frac{1}{\alpha+(1-\alpha)\gamma}\left[\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right]v_{1}+\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}v_{2}+e_{z}v_{3}, (85)

where eze_{z}\in\mathbb{R} is some constant that does not affect our final result. Also, the vector bb defined in (82) satisfies

b\displaystyle b [γ(22α+c1)ε𝖾𝗏𝖺𝗅+γ(1α+c2)ε𝗈𝗉𝗍(1α)ε𝖾𝗏𝖺𝗅+(1α)ε𝗈𝗉𝗍(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍]\displaystyle\leq\begin{bmatrix}\gamma(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+\gamma(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\\ (1-\alpha)\varepsilon_{\mathsf{eval}}+(1-\alpha)\varepsilon_{\mathsf{opt}}\\ (1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\end{bmatrix}
=[(22α+c1)ε𝖾𝗏𝖺𝗅+(1α+c2)ε𝗈𝗉𝗍]v1+[(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍]v2.\displaystyle=\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}v_{1}+\big{[}(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\big{]}v_{2}. (86)

Using the decomposition in (B.3) and (86) and applying the system relation (81) recursively, we can derive

zk+1\displaystyle z_{k+1} Bk+1z0+t=0kBktb\displaystyle\leq B^{k+1}z_{0}+\sum_{t=0}^{k}B^{k-t}b
Bk+1[1α+(1α)γ[QτQτ(0)+2αQττξ^(0)]v1+Qττξ^(0)v2+ezv3]\displaystyle\leq B^{k+1}\left[\frac{1}{\alpha+(1-\alpha)\gamma}\left[\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right]v_{1}+\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}v_{2}+e_{z}v_{3}\right]
+t=0kBkt[[(22α+c1)ε𝖾𝗏𝖺𝗅+(1α+c2)ε𝗈𝗉𝗍]v1+[(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍]v2]\displaystyle\quad+\sum_{t=0}^{k}B^{k-t}\Big{[}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}v_{1}+\big{[}(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\big{]}v_{2}\Big{]}
=[λ1k(QτQτ(0)+2αQττξ^(0))+1λ1k+11λ1[(22α+c1)ε𝖾𝗏𝖺𝗅+(1α+c2)ε𝗈𝗉𝗍]]v1\displaystyle=\left[\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+\frac{1-\lambda_{1}^{k+1}}{1-\lambda_{1}}\Big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\Big{]}\right]v_{1}
+[λ2k+1Qττξ^(0)+1λ2k+11λ2[(1α+c1)ε𝖾𝗏𝖺𝗅+c2ε𝗈𝗉𝗍]]v2.\displaystyle\quad+\Big{[}\lambda_{2}^{k+1}\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}+\frac{1-\lambda_{2}^{k+1}}{1-\lambda_{2}}[(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}]\Big{]}v_{2}.

Recognizing that the first two entries of v2v_{2} are non-positive, we can discard the term involving v2v_{2} and obtain

[QτQτ(k+1)Qττξ^(k+1)]\displaystyle\begin{bmatrix}\|{Q}^{\star}_{\tau}-{Q}^{{(k+1)}}_{\tau}\|_{\infty}\\ \|Q_{\tau}^{\star}-\tau\widehat{\xi}^{(k+1)}\|_{\infty}\end{bmatrix}
[λ1k(QτQτ(0)+2αQττξ^(0))+1λ1k1λ1[(22α+c1)ε𝖾𝗏𝖺𝗅+(1α+c2)ε𝗈𝗉𝗍]][γ1]\displaystyle\leq\left[\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+\frac{1-\lambda_{1}^{k}}{1-\lambda_{1}}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}\right]\begin{bmatrix}\gamma\\ 1\end{bmatrix}
[λ1k(QτQτ(0)+2αQττξ^(0))+11λ1[(22α+c1)ε𝖾𝗏𝖺𝗅+(1α+c2)ε𝗈𝗉𝗍]C][γ1].\displaystyle\leq\bigg{[}\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+\underbrace{\frac{1}{1-\lambda_{1}}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}}_{\eqqcolon\,C}\bigg{]}\begin{bmatrix}\gamma\\ 1\end{bmatrix}.

Making use of the fact that 1λ1=(1α)(1γ)1-\lambda_{1}=(1-\alpha)(1-\gamma), we can conclude

C=11γ[(2+c11α)ε𝖾𝗏𝖺𝗅+(1+c21α)ε𝗈𝗉𝗍].C=\frac{1}{1-\gamma}\left[\left(2+\frac{c_{1}}{1-\alpha}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{c_{2}}{1-\alpha}\right)\varepsilon_{\mathsf{opt}}\right].

The above bound essentially says that

QτQτ(k+1)γ[λ1k(QτQτ(0)+2αQττξ^(0))+C]\displaystyle\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}\leq\gamma\left[\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+C\right]

and

Qττξ^(k+1)λ1k(QτQτ(0)+2αQττξ^(0))+C.\displaystyle\big{\|}Q_{\tau}^{\star}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}\leq\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+C.

Turning to Vτ(s)Vτ(k+1)(s)V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s), by a similar argument as (45), we have

Vτ(s)Vτ(k+1)(s)\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)
=Qτ(s)Qτ(k+1)(s),π(k+1)(s)+[τ(hs(π(k+1)(s))hs(πτ(s)))Qτ(s),π(k+1)(s)πτ(s)]\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}+\Big{[}\tau(h_{s}(\pi^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s)))-\big{\langle}Q_{\tau}^{\star}(s),\pi^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}\Big{]}
=Qτ(s)Qτ(k+1)(s),π(k+1)(s)+τDhs(π(k+1),πτ;gτ)\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}+\tau D_{h_{s}}(\pi^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})
QτQτ(k+1)+τDhs(π(k+1),π~(k+1);ξ^(k+1))\displaystyle\leq\Big{\|}Q_{\tau}^{\star}-Q_{\tau}^{(k+1)}\Big{\|}_{\infty}+\tau D_{h_{s}}(\pi^{(k+1)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)})
+τDhs(π~(k+1),πτ;gτ)+τπ(k+1)(s)π~(k+1)(s),ξ^(k+1)(s)gτ(s),\displaystyle\qquad+\tau D_{h_{s}}(\widetilde{\pi}^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle},

where the third step results from the standard three-point lemma. To control the second term, we rearrange terms in (68) and reach at

ε𝗈𝗉𝗍\displaystyle\varepsilon_{\mathsf{opt}} Q^τ(k)(s),π(k+1)(s)+τhs(π(k+1)(s))+1ηDhs(π(k+1),π(k);ξ^(k))\displaystyle\geq-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\pi^{(k+1)}(s)\big{\rangle}+\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}
+Q^τ(k)(s),π~(k+1)(s)τhs(π~(k+1)(s))1ηDhs(π~(k+1),π(k);ξ^(k))\displaystyle\qquad+\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)\big{\rangle}-\tau h_{s}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}-\frac{1}{\eta}D_{h_{s}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}
=Q^τ(k)(s),π~(k+1)(s)π(k+1)(s)+1+ητη(hs(π(k+1)(s))hs(π~(k+1)(s)))\displaystyle=\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)-\pi^{(k+1)}(s)\big{\rangle}+\frac{1+\eta\tau}{\eta}\big{(}h_{s}(\pi^{(k+1)}(s))-h_{s}(\widetilde{\pi}^{(k+1)}(s))\big{)}
+1ηξ^(k)(s),π~(k+1)(s)π(k+1)(s)\displaystyle\qquad+\frac{1}{\eta}\big{\langle}\widehat{\xi}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)-\pi^{(k+1)}(s)\big{\rangle}
=1+ητηDhs(π(k+1),π~(k+1);ξ^(k+1)).\displaystyle=\frac{1+\eta\tau}{\eta}D_{h_{s}}(\pi^{(k+1)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}).

For the remaining terms, recall that ξ^(k+1)cs(k+1)1hs(π~(k+1)(s))\widehat{\xi}^{(k+1)}-c_{s}^{(k+1)}1\in\partial h_{s}(\widetilde{\pi}^{(k+1)}(s)) with some constant cs(k+1)c_{s}^{(k+1)}. So we have

τDhs(π~(k+1),πτ;gτ)+τπ(k+1)(s)π~(k+1)(s),ξ^(k+1)(s)gτ(s)\displaystyle\tau D_{h_{s}}(\widetilde{\pi}^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle}
=τhs(π~(k+1)(s))τhs(πτ(s))π~(k+1)(s)πτ(s),Qτ(s)+τπ(k+1)(s)π~(k+1)(s),ξ^(k+1)(s)gτ(s)\displaystyle=\tau h_{s}(\widetilde{\pi}^{(k+1)}(s))-\tau h_{s}(\pi_{\tau}^{\star}(s))-\big{\langle}\widetilde{\pi}^{(k+1)}(s)-\pi_{\tau}^{\star}(s),Q_{\tau}^{\star}(s)\big{\rangle}+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle}
πτ(s)π~(k+1)(s),Qτ(s)τξ^(k+1)π(k+1)(s)π~(k+1)(s),Qτ(s)τξ^(k+1)(s)\displaystyle\leq\big{\langle}\pi_{\tau}^{\star}(s)-\widetilde{\pi}^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}\big{\rangle}-\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\rangle}
=πτ(s)π(k+1)(s),Qτ(s)τξ^(k+1)\displaystyle=\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}\big{\rangle}
2Qτ(s)τξ^(k+1)(s).\displaystyle\leq 2\big{\|}Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\|}_{\infty}.

Taken together, we conclude that

Vτ(s)Vτ(k+1)(s)\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s) QτQτ(k+1)+2Qτ(s)τξ^(k+1)(s)+ητ1+ητε𝗈𝗉𝗍\displaystyle\leq\Big{\|}Q_{\tau}^{\star}-Q_{\tau}^{(k+1)}\Big{\|}_{\infty}+2\big{\|}Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\|}_{\infty}+\frac{\eta\tau}{1+\eta\tau}\varepsilon_{\mathsf{opt}}
(γ+2)[λ1k(QτQτ(0)+2αQττξ^(0))+C]+ητ1+ητε𝗈𝗉𝗍.\displaystyle\leq(\gamma+2)\left[\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+C\right]+\frac{\eta\tau}{1+\eta\tau}\varepsilon_{\mathsf{opt}}.

Finally, plugging in the choices of c1c_{1} and c2c_{2} (cf. (77)), we have CC2C\leq C_{2} when {hs}\{h_{s}\} is convex, and CC3C\leq C_{3} when {hs}\{h_{s}\} is 11-strongly convex w.r.t. the 1\ell_{1} norm. In addition, for the latter case, we can follow a similar argument as for (46) to demonstrate that

πτ(s)π~τ(k)(s)1\displaystyle\big{\|}\pi_{\tau}^{\star}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\|}_{1} τ1((1α)γ+α)k(QτQτ(0)+2αQττξ(0))+τ1C3,\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right)+\tau^{-1}C_{3},

which taken together with (74) gives

πτ(s)πτ(k)(s)1\displaystyle\big{\|}\pi_{\tau}^{\star}(s)-\pi_{\tau}^{(k)}(s)\big{\|}_{1} πτ(s)π~τ(k)(s)1+πτ(k)(s)π~τ(k)(s)1\displaystyle\leq\big{\|}\pi_{\tau}^{\star}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\|}_{1}+\big{\|}\pi_{\tau}^{(k)}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\|}_{1}
τ1((1α)γ+α)k(QτQτ(0)+2αQττξ(0))\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right)
+τ1C3+2ηε𝗈𝗉𝗍1+ητ.\displaystyle\quad+\tau^{-1}C_{3}+\sqrt{\frac{2\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}}.

This concludes the proof.

Appendix C Adaptive GPMD

In this section, we present adaptive GPMD, an adaptive variant of GPMD that computes optimal policies of the original MDP without the need of specifying the regularization parameter τ\tau in advance. In a nutshell, the proposed adaptive GPMD algorithm is a stage-based algorithm. In the ii-th stage, we execute GPMD (i.e., Algorithm 1) with regularization parameter τi\tau_{i} for Ti+1T_{i}+1 iterations. In what follows, we shall denote by πi(t)\pi_{i}^{(t)} and ξi(t)\xi_{i}^{(t)} the tt-th iterates in the ii-th stage. At the end of each stage, Adaptive GPMD will halve the regularization parameter τ\tau, and in the meantime, double ξi(Ti+1)\xi_{i}^{(T_{i}+1)} (i.e., the auxiliary vector corresponding to some subgradient up to global shift) and use it to as the initial vector ξi+1(0)\xi_{i+1}^{(0)} for the next stage. To ensure that ξi+1(0)(s)\xi_{i+1}^{(0)}(s) still lies within the set of subgradients hs(πi+1(0)(s))\partial h_{s}\big{(}\pi_{i+1}^{(0)}(s)\big{)} up to global shift, we solve the sub-optimization problem (87) to obtain πi+1(0)\pi_{i+1}^{(0)} as the initial policy iterate for the next stage. The whole procedure is summarized in Algorithm 3.

1 Input: learning rate η>0\eta>0.
2 Initialize τ0=1,ξ0(0)(s,a)=0\tau_{0}=1,\xi_{0}^{(0)}(s,a)=0 for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}. Choose π0(0)\pi_{0}^{(0)} to be the minimizer of the following problems:
π0(0)(s)=argminpΔ(𝒜)hs(p),s𝒮.\pi_{0}^{(0)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}h_{s}(p),\qquad\forall s\in\mathcal{S}.
3for stage i=0,1,,i=0,1,\cdots, do
4       Call Algorithm 1 with regularization parameter τi\tau_{i}, learning rate η\eta, and initialization πi(0)\pi_{i}^{(0)} and ξi(0)\xi_{i}^{(0)}. Obtain πi(Ti+1)\pi_{i}^{(T_{i}+1)} and ξi(Ti+1)\xi_{i}^{(T_{i}+1)} with Ti=1+ητi(1γ)ητilog81γT_{i}=\big{\lceil}\frac{1+\eta\tau_{i}}{(1-\gamma)\eta\tau_{i}}\log\frac{8}{1-\gamma}\big{\rceil}, where \lceil\cdot\rceil is the ceiling function.
5      Set τi+1=τi/2\tau_{i+1}=\tau_{i}/2, ξi+1(0)=2ξi(Ti+1)\xi_{i+1}^{(0)}=2\xi_{i}^{(T_{i}+1)}, and choose πi+1(0)\pi_{i+1}^{(0)} to be the minimizer of the following problems:
πi+1(0)(s)=argminpΔ(𝒜)ξi+1(0)(s),p+hs(p),s𝒮.\displaystyle\pi_{i+1}^{(0)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}-\langle\xi_{i+1}^{(0)}(s),p\rangle+h_{s}(p),\qquad\forall s\in\mathcal{S}. (87)
Algorithm 3 Adaptive GPMD

To help characterize the discrepancy of the value functions due to regularization, we assume boundedness of the regularizers {hs}\{h_{s}\} as follows.

Assumption 4.

Suppose that there exists some quantity B>1B>1 such that |hs(p)|B|h_{s}(p)|\leq B holds for all pΔ(𝒜)p\in\Delta(\mathcal{A}) and all s𝒮s\in\mathcal{S}.

The following theorem demonstrates that Algorithm 3 is capable of finding an ε\varepsilon-optimal policy for the unregularized MDP within an order of log1ε\log\frac{1}{\varepsilon} stages. To simplify notation, we abbreviate QτiQ_{\tau_{i}}^{\star}, Qτiπi(T)Q_{\tau_{i}}^{\pi_{i}^{(T)}} and Vτiπi(T)V_{\tau_{i}}^{\pi_{i}^{(T)}} as QiQ_{i}^{\star}, Qi(T)Q^{(T)}_{i} and Vi(T)V^{(T)}_{i}, respectively, as long as it is clear from the context.

Theorem 3.

Suppose that Assumptions 1 and 4 hold. For any learning rate η>0\eta>0 and any stage i0i\geq 0, the iterates of Algorithm 3 satisfy

QQπi(Ti+1)3τiB1γ=3B(1γ)2i.\big{\|}Q^{\star}-Q^{\pi_{i}^{(T_{i}+1)}}\big{\|}_{\infty}\leq\frac{3\tau_{i}B}{1-\gamma}=\frac{3B}{(1-\gamma)2^{i}}.

As a direct implication of Theorem 3, it suffices to run Algorithm 3 with S=𝒪(logB(1γ)ε)S=\mathcal{O}(\log\frac{B}{(1-\gamma)\varepsilon}) stages, resulting in a total iteration complexity of at most

i=0STi=𝒪((11γlogB(1γ)ε+B(1γ)2ηε)log11γ).\displaystyle\sum_{i=0}^{S}T_{i}=\mathcal{O}\left(\left(\frac{1}{1-\gamma}\log\frac{B}{(1-\gamma)\varepsilon}+\frac{B}{(1-\gamma)^{2}\eta\varepsilon}\right)\log\frac{1}{1-\gamma}\right). (88)

In comparison, we recall from Theorem 1 that: directly running GPMD with regularization parameter τ=(1γ)ε/B\tau={(1-\gamma)\varepsilon}/{B} leads to an iteration complexity of

𝒪((11γ+B(1γ)2ηε)logB(1γ)ε).\displaystyle\mathcal{O}\Big{(}\Big{(}\frac{1}{1-\gamma}+\frac{B}{(1-\gamma)^{2}\eta\varepsilon}\Big{)}\log\frac{B}{(1-\gamma)\varepsilon}\Big{)}. (89)

When focusing on the term 𝒪~(B(1γ)2ηε)\widetilde{\mathcal{O}}(\frac{B}{(1-\gamma)^{2}\eta\varepsilon}), (88) improves upon (89) by a factor of logB(1γ)εlog11γ\frac{\log\frac{B}{(1-\gamma)\varepsilon}}{\log\frac{1}{1-\gamma}}.

Proof of Theorem 3.

To begin with, we make note of the fact that, for any τ,τ>0\tau,\tau^{\prime}>0,

QτQτ=maxπQτπmaxπQτπmaxπQτπQτπ|ττ|B1γ.\big{\|}Q^{\star}_{\tau}-Q^{\star}_{\tau^{\prime}}\big{\|}_{\infty}=\Big{\|}\max_{\pi}Q^{\pi}_{\tau}-\max_{\pi}Q^{\pi}_{\tau^{\prime}}\Big{\|}_{\infty}\leq\max_{\pi}\big{\|}Q^{\pi}_{\tau}-Q^{\pi}_{\tau^{\prime}}\big{\|}_{\infty}\leq\frac{|\tau-\tau^{\prime}|B}{1-\gamma}. (90)

It then follows that

QQπi(Ti+1)\displaystyle\big{\|}Q^{\star}-Q^{\pi_{i}^{(T_{i}+1)}}\big{\|}_{\infty} =QQi+Qπi(Ti+1)Qi(Ti+1)+QiQi(Ti+1)\displaystyle=\big{\|}Q^{\star}-Q_{i}^{\star}\big{\|}_{\infty}+\big{\|}Q^{\pi_{i}^{(T_{i}+1)}}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty}+\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty} (91)
2τiB1γ+QiQi(Ti+1).\displaystyle\leq\frac{2\tau_{i}B}{1-\gamma}+\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty}.

Next, we demonstrate how to control QiQi(Ti+1)\|Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\|_{\infty}. The definition of πi(0)\pi_{i}^{(0)} implies the existence of some constant cs(i,0)c_{s}^{(i,0)} such that

ξi(0)(s,)cs(i,0)1hs(πi(0)(s)).\displaystyle\xi_{i}^{(0)}(s,\cdot)-c_{s}^{(i,0)}1\in\partial h_{s}\big{(}\pi_{i}^{(0)}(s)\big{)}. (92)

By invoking the convergence results of GPMD (cf. (44)), we obtain: for all i0i\geq 0,

QiQi(Ti+1)\displaystyle\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty} γ((1αi)γ+αi)Ti(QiQi(0)+2αiQiτiξi(0)),\displaystyle\leq\gamma\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\left(\big{\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\|}_{\infty}+2\alpha_{i}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}\right), (93a)
Qiτiξi(Ti+1)\displaystyle\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\|}_{\infty} ((1αi)γ+αi)Ti(QiQi(0)+2αiQiτiξi(0)),\displaystyle\leq\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\left(\big{\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\|}_{\infty}+2\alpha_{i}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}\right), (93b)

where αi=11+ητi\alpha_{i}=\frac{1}{1+\eta\tau_{i}}. To proceed, we follow similar arguments in (45) and show that

Vi(s)Vi(0)(s)\displaystyle V^{\star}_{i}(s)-V^{(0)}_{i}(s) =11γ𝔼sdsπ[πτi(s)πi(0)(s),Qi(s)+τi(hs(πi(0))hs(πτi))]\displaystyle=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s\sim d^{\pi}_{s^{\prime}}}\Big{[}\left\langle\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),Q^{\star}_{i}(s)\right\rangle+\tau_{i}\big{(}h_{s}(\pi_{i}^{(0)})-h_{s}(\pi^{\star}_{\tau_{i}})\big{)}\Big{]}
11γ𝔼sdsπ[πτi(s)πi(0)(s),Qi(s)τπτi(s)πi(0)(s),ξi(0)]\displaystyle\leq\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s\sim d^{\pi}_{s^{\prime}}}\Big{[}\left\langle\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),Q^{\star}_{i}(s)\right\rangle-\tau\Big{\langle}\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),\xi_{i}^{(0)}\Big{\rangle}\Big{]}
21γQiτiξi(0),\displaystyle\leq\frac{2}{1-\gamma}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty},

where the first step invokes the regularized performance difference lemma (Lemma 5). It then follows that

QiQi(0)γViVi(0)2γ1γQiτiξi(0).\displaystyle\big{\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\|}_{\infty}\leq\gamma\big{\|}V^{\star}_{i}-V^{(0)}_{i}\big{\|}_{\infty}\leq\frac{2\gamma}{1-\gamma}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}. (94)

Substitution of (94) into (93) gives

QiQi(Ti+1)\displaystyle\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty} 2γ1γ((1αi)γ+αi)TiQiτiξi(0),\displaystyle\leq\frac{2\gamma}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}, (95a)
Qiτiξi(Ti+1)\displaystyle\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\|}_{\infty} 21γ((1αi)γ+αi)TiQiτiξi(0).\displaystyle\leq\frac{2}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}. (95b)

Next, we aim to prove by induction that Qiτiξi(0)2τiB1γ\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}\leq\frac{2\tau_{i}B}{1-\gamma}. Clearly, this claim holds trivially for the base case with i=0i=0. Next, supposing that the claim holds for some i0i\geq 0, we would like to prove it for i+1i+1 as well. Towards this end, observe that

Qi+1τi+1ξi+1(0)\displaystyle\big{\|}Q^{\star}_{i+1}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\|}_{\infty} =Qi+1Qi+Qiτi+1ξi+1(0)\displaystyle=\big{\|}Q^{\star}_{i+1}-Q^{\star}_{i}\big{\|}_{\infty}+\big{\|}Q^{\star}_{i}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\|}_{\infty}
τi+1B1γ+Qiτiξi(Ti+1)\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}+\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\|}_{\infty}
τi+1B1γ+21γ((1αi)γ+αi)TiQiτiξi(0)\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}+\frac{2}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}
τi+1B1γ(1+81γ(1(1γ)ητi1+ητi)Ti).\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}\left(1+\frac{8}{1-\gamma}\left(1-\frac{(1-\gamma)\eta\tau_{i}}{1+\eta\tau_{i}}\right)^{T_{i}}\right).

When Ti1+ητiητi(1γ)log81γT_{i}\geq\lceil\frac{1+\eta\tau_{i}}{\eta\tau_{i}(1-\gamma)}\log\frac{8}{1-\gamma}\rceil, we arrive at

Qi+1τi+1ξi+1(0)2τi+1B1γ,\big{\|}Q^{\star}_{i+1}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\|}_{\infty}\leq\frac{2\tau_{i+1}B}{1-\gamma},

which verifies the claim for i+1i+1. Substitution back into (95) leads to

QiQi(Ti+1)2γ1γ(1(1γ)ητi1+ητi)Ti2τiB1γτiB1γ.\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty}\leq\frac{2\gamma}{1-\gamma}\left(1-\frac{(1-\gamma)\eta\tau_{i}}{1+\eta\tau_{i}}\right)^{T_{i}}\frac{2\tau_{i}B}{1-\gamma}\leq\frac{\tau_{i}B}{1-\gamma}. (96)

Combining (96) with (91) concludes the proof. ∎

References

  • Abbasi-Yadkori et al., (2019) Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. (2019). Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR.
  • Agarwal et al., (2019) Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. Technical report.
  • Agarwal et al., (2020) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2020). Optimality and approximation with policy gradient methods in Markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR.
  • Agazzi and Lu, (2020) Agazzi, A. and Lu, J. (2020). Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858.
  • Amodei et al., (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in ai safety.
  • Beck, (2017) Beck, A. (2017). First-order methods in optimization. SIAM.
  • Beck and Teboulle, (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175.
  • Bertsekas, (2017) Bertsekas, D. P. (2017). Dynamic programming and optimal control (4th edition). Athena Scientific.
  • Bhandari and Russo, (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
  • Bhandari and Russo, (2020) Bhandari, J. and Russo, D. (2020). A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120.
  • (11) Cen, S., Chen, F., and Chi, Y. (2022a). Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization. arXiv preprint arXiv:2204.05466.
  • (12) Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022b). Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578.
  • (13) Cen, S., Chi, Y., Du, S. S., and Xiao, L. (2022c). Faster last-iterate convergence of policy optimization in zero-sum markov games. arXiv preprint arXiv:2210.01050.
  • Cen et al., (2021) Cen, S., Wei, Y., and Chi, Y. (2021). Fast policy extragradient methods for competitive games with entropy regularization. In Advances in Neural Information Processing Systems, volume 34, pages 27952–27964.
  • (15) Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018a). A Lyapunov-based approach to safe reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8103–8112.
  • (16) Chow, Y., Nachum, O., and Ghavamzadeh, M. (2018b). Path consistency learning in Tsallis entropy regularized MDPs. In International Conference on Machine Learning, pages 979–988. PMLR.
  • Dai et al., (2018) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. (2018). SBEED: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1125–1134. PMLR.
  • Ding et al., (2021) Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanovic, M. (2021). Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR.
  • Duchi et al., (2010) Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Tewari, A. (2010). Composite objective mirror descent. In COLT, pages 14–26. Citeseer.
  • Efroni et al., (2020) Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
  • Fazel et al., (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476.
  • Geist et al., (2019) Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized Markov decision processes. In International Conference on Machine Learning, pages 2160–2169.
  • Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR.
  • Hao et al., (2021) Hao, B., Lazic, N., Abbasi-Yadkori, Y., Joulani, P., and Szepesvári, C. (2021). Adaptive approximate policy iteration. In International Conference on Artificial Intelligence and Statistics, pages 523–531. PMLR.
  • Kakade, (2002) Kakade, S. M. (2002). A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems, pages 1531–1538.
  • Khodadadian et al., (2021) Khodadadian, S., Jhunjhunwala, P. R., Varma, S. M., and Maguluri, S. T. (2021). On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424.
  • Kiwiel, (1997) Kiwiel, K. C. (1997). Proximal minimization methods with generalized Bregman functions. SIAM journal on control and optimization, 35(4):1142–1168.
  • Lan, (2022) Lan, G. (2022). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical Programming.
  • Lan et al., (2011) Lan, G., Lu, Z., and Monteiro, R. D. (2011). Primal-dual first-order methods with O(1/ε){O(1/\varepsilon)} iteration-complexity for cone programming. Mathematical Programming, 126(1):1–29.
  • Lan and Zhou, (2018) Lan, G. and Zhou, Y. (2018). An optimal randomized incremental gradient method. Mathematical programming, 171(1):167–215.
  • Lazic et al., (2021) Lazic, N., Yin, D., Abbasi-Yadkori, Y., and Szepesvari, C. (2021). Improved regret bound and experience replay in regularized policy iteration. In International Conference on Machine Learning, pages 6032–6042. PMLR.
  • Lee et al., (2018) Lee, K., Choi, S., and Oh, S. (2018). Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466–1473.
  • Lee et al., (2019) Lee, K., Kim, S., Lim, S., Choi, S., and Oh, S. (2019). Tsallis reinforcement learning: A unified framework for maximum entropy reinforcement learning. arXiv preprint arXiv:1902.00137.
  • Li et al., (2023) Li, G., Wei, Y., Chi, Y., and Chen, Y. (2023). Softmax policy gradient methods can take exponential time to converge. Mathematical Programming. To appear.
  • Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10565–10576.
  • Liu et al., (2020) Liu, Y., Zhang, K., Basar, T., and Yin, W. (2020). An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33.
  • Mei et al., (2021) Mei, J., Gao, Y., Dai, B., Szepesvari, C., and Schuurmans, D. (2021). Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pages 7555–7564. PMLR.
  • (38) Mei, J., Xiao, C., Dai, B., Li, L., Szepesvári, C., and Schuurmans, D. (2020a). Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33.
  • (39) Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020b). On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR.
  • Mnih et al., (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
  • Moldovan and Abbeel, (2012) Moldovan, T. M. and Abbeel, P. (2012). Safe exploration in markov decision processes.
  • Nemirovsky and Yudin, (1983) Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.
  • Neu et al., (2017) Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
  • Puterman, (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
  • Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Shani et al., (2019) Shani, L., Efroni, Y., and Mannor, S. (2019). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. arXiv preprint arXiv:1909.02769.
  • Sutton et al., (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
  • Tomar et al., (2020) Tomar, M., Shani, L., Efroni, Y., and Ghavamzadeh, M. (2020). Mirror descent policy optimization. arXiv preprint arXiv:2005.09814.
  • Tsallis, (1988) Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 52(1):479–487.
  • Vieillard et al., (2020) Vieillard, N., Kozuno, T., Scherrer, B., Pietquin, O., Munos, R., and Geist, M. (2020). Leverage the average: an analysis of KL regularization in reinforcement learning. In NeurIPS-34th Conference on Neural Information Processing Systems.
  • Wang et al., (2019) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations.
  • Wang et al., (2021) Wang, W., Han, J., Yang, Z., and Wang, Z. (2021). Global convergence of policy gradient for linear-quadratic mean-field control/game in continuous time. In International Conference on Machine Learning, pages 10772–10782. PMLR.
  • Williams, (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
  • Williams and Peng, (1991) Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
  • Xiao, (2022) Xiao, L. (2022). On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443.
  • Xu et al., (2019) Xu, P., Gao, F., and Gu, Q. (2019). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations.
  • Xu et al., (2020) Xu, T., Liang, Y., and Lan, G. (2020). A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869.
  • Yu et al., (2019) Yu, M., Yang, Z., Kolar, M., and Wang, Z. (2019). Convergent policy optimization for safe reinforcement learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3127–3139.
  • (60) Zhang, J., Kim, J., O’Donoghue, B., and Boyd, S. (2020a). Sample efficient reinforcement learning with REINFORCE. arXiv preprint arXiv:2010.11364.
  • (61) Zhang, J., Koppel, A., Bedi, A. S., Szepesvári, C., and Wang, M. (2020b). Variational policy gradient method for reinforcement learning with general utilities. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 4572–4583.
  • (62) Zhang, J., Ni, C., Szepesvari, C., Wang, M., et al. (2021a). On the convergence and sample efficiency of variance-reduced policy gradient method. In Advances in Neural Information Processing Systems, volume 34, pages 2228–2240.
  • (63) Zhang, K., Hu, B., and Basar, T. (2021b). Policy optimization for 2\mathcal{H}_{2} linear control with \mathcal{H}_{\infty} robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109.
  • Zhao et al., (2022) Zhao, Y., Tian, Y., Lee, J., and Du, S. (2022). Provably efficient policy optimization for two-player zero-sum markov games. In International Conference on Artificial Intelligence and Statistics, pages 2736–2761. PMLR.