Policy Mirror Descent for Regularized Reinforcement Learning:
A Generalized Framework with Linear Convergence

Wenhao Zhan¹¹1The first two authors contributed equally.
Princeton University
Department of Electrical and Computer Engineering, Princeton University. Shicong Cen¹¹footnotemark: 1
Carnegie Mellon University
Department of Electrical and Computer Engineering, Carnegie Mellon University. Baihe Huang
University of Berkeley
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Yuxin Chen
University of Pennsylvania
Department of Statistics and Data Science, Wharton School, University of Pennsylvania. Jason D. Lee²²footnotemark: 2
Princeton University
Yuejie Chi³³footnotemark: 3
Carnegie Mellon University

(May 2021; Final: January 2023)

Abstract

Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.

Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (Lan,, 2022), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.

Keywords: policy mirror descent, Bregman divergence, regularization, nonsmooth, policy optimization

1 Introduction

Policy optimization lies at the heart of recent successes of reinforcement learning (RL) (Mnih et al.,, 2015). In its basic form, the optimal policy of interest, or a suitably parameterized version, is learned by attempting to maximize the value function in a Markov decision processes (MDP). For the most part, the maximization step is carried out by means of first-order optimization algorithms amenable to large-scale applications, whose foundations were set forth in the early works of Williams, (1992); Sutton et al., (2000). A partial list of widely adopted variants in modern practice includes policy gradient (PG) methods (Sutton et al.,, 2000), natural policy gradient (NPG) methods (Kakade,, 2002), TRPO (Schulman et al.,, 2015), PPO (Schulman et al.,, 2017), soft actor-critic methods (Haarnoja et al.,, 2018), to name just a few. In comparison with model-based and value-based approaches, this family of policy-based algorithms offers a remarkably flexible framework that accommodates both continuous and discrete action spaces, and lends itself well to the incorporation of powerful function approximation schemes like neural networks. In stark contrast to its practical success, however, theoretical understanding of policy optimization remains severely limited even for the tabular case, largely owing to the ubiquitous nonconvexity issue underlying the objective function.

1.1 The role of regularization

In practice, there are often competing objectives and additional constraints that the agent has to deal with in conjunction with maximizing values, which motivate the studies of regularization techniques in RL. In what follows, we isolate a few representative examples.

•

Promoting exploration. In the face of large problem dimensions and complex dynamics, it is often desirable to maintain a suitable degree of randomness in the policy iterates, in order to encourage exploration and discourage premature convergence to sub-optimal policies. A popular strategy of this kind is to enforce entropy regularization (Williams and Peng,, 1991), which penalizes policies that are not sufficiently stochastic. Along similar lines, the Tsallis entropy regularization (Chow et al., 2018b, ; Lee et al.,, 2018) further promotes sparsity of the learned policy while encouraging exploration, ensuring that the resulting policy does not assign non-negligible probabilities to too many sub-optimal actions.
•

Safe RL. In a variety of application scenarios such as industrial robot arms and self-driving vehicles, the agents are required to operate safely both to themselves and the surroundings (Amodei et al.,, 2016; Moldovan and Abbeel,, 2012); for example, certain actions might be strictly forbidden in some states. One way to incorporate such prescribed operational constraints is through adding a regularizer (e.g., a properly chosen log barrier or indicator function tailored to the constraints) to explicitly account for the constraints.
•

Cost-sensitive RL. In reality, different actions of an agent might incur drastically different costs even for the same state. This motivates the design of new objective functions that properly trade off the cumulative rewards against the accumulated cost, which often take the form of certain regularized value functions.

Viewed in this light, it is of imminent value to develop a unified framework towards understanding the capability and limitations of regularized policy optimization. While a recent line of works (Agarwal et al.,, 2020; Mei et al., 2020b, ; Cen et al., 2022b, ) have looked into specific types of regularization techniques such as entropy regularization, existing convergence theory remains highly inadequate when it comes to a more general family of regularizers.

1.2 Main contributions

The current paper focuses on policy optimization for regularized RL in a $\gamma$ -discounted infinite horizon Markov decision process (MDP) with state space $\mathcal{S}$ , action space $\mathcal{A}$ , and reward function $r(\cdot,\cdot)$ . The goal is to find an optimal policy that maximizes a regularized value function. Informally speaking, the regularized value function associated with a given policy $\pi$ takes the following form:

V_{\tau}^{\pi}=V^{\pi}-\tau\mathbb{E}\big{[}h_{s}\big{(}\pi(\cdot\,|\,s)\big{)}\big{]},

where $V^{\pi}$ denotes the original (unregularized) value function, $\tau>0$ is the regularization parameter, $h_{s}(\cdot)$ denotes a convex regularizer employed to regularize the policy in state $s$ , and the expectation is taken over certain marginal state distribution w.r.t. the MDP (to be made precise in Section 2.1). It is noteworthy that this paper does not require the regularizer $h_{s}$ to be either strongly convex or smooth.

In order to maximize the regularized value function (9b), Lan, (2022) exhibited a seminal algorithm called Policy Mirror Descent (PMD), which can be viewed as an adaptation of the mirror descent algorithm (Nemirovsky and Yudin,, 1983; Beck and Teboulle,, 2003) to the realm of policy optimization. In particular, PMD subsumes the natural policy gradient (NPG) method (Kakade,, 2002) as a special case. To further generalize PMD (Lan,, 2022), we propose an algorithm called Generalized Policy Mirror Descent (GPMD). In each iteration, the policy is updated for each state in parallel via a mirror-descent style update rule. In sharp contrast to Lan, (2022) that considered a generic Bregman divergence, our algorithm selects the Bregman divergence adaptively in cognizant of the regularizer, which leads to complementary perspectives and insights. Several important features and theoretical appeal of GPMD are summarized as follows.

•

GPMD substantially broadens the range of (provably effective) algorithmic choices for regularized RL, and subsumes several well-known algorithms as special cases. For example, it reduces to regularized policy iteration (Geist et al.,, 2019) when the learning rate tends to infinity, and subsumes entropy-regularized NPG methods as special cases if we take the Bregman divergence to be the Kullback-Leibler (KL) divergence (Cen et al., 2022b, ).
•

Assuming exact policy evaluation and perfect policy update in each iteration, GPMD converges linearly—in a dimension-free fashion— over the entire range of the learning rate $\eta>0$ . More precisely, it converges to an $\varepsilon$ -optimal regularized Q-function in no more than an order of

$\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{1}{\varepsilon}$

iterations (up to some logarithmic factor). Encouragingly, this appealing feature is valid for a broad family of convex and possibly nonsmooth regularizers.
•

The intriguing convergence guarantees are robust in the face of inexact policy evaluation and imperfect policy updates, namely, the algorithm is guaranteed to converge linearly at the same rate until an error floor is hit. See Section 3.2 for details.
•

Numerical experiments are provided in Section 5 to demonstrate the practical applicability and appealing performance of the proposed GPMD algorithm.

Finally, we find it helpful to briefly compare the above findings with prior works. As soon as the learning rate exceeds $\eta\geq 1/\tau$ , the iteration complexity of our algorithm is at most on the order of $\tfrac{1}{1-\gamma}\log\frac{1}{\varepsilon}$ , thus matching that of regularized policy iteration (Geist et al.,, 2019). In comparison to Lan, (2022), our work sets forth a different framework to analyze mirror-descent type algorithms for regularized policy optimization, generalizing and refining the approach in Cen et al., 2022b far beyond entropy regularization. When constant learning rates are employed, the linear convergence of PMD (Lan,, 2022) critically requires the regularizer to be strongly convex, with only sublinear convergence theory established for convex regularizers. In contrast, we establish the linear convergence of GPMD under constant learning rates even in the absence of strong convexity. Furthermore, for the special case of entropy regularization, the stability analysis of GPMD also significantly improves over the prior art in Cen et al., 2022b , preventing the error floor from blowing up when the learning rate approaches zero, as well as incorporating the impact of optimization error that was previously uncaptured. More detailed comparisons with Lan, (2022) and Cen et al., 2022b can be found in Section 3.

1.3 Related works

Before embarking on our algorithmic and theoretic developments, we briefly review a small sample of other related works.

Global convergence of policy gradient methods.

Recent years have witnessed a surge of activities towards understanding the global convergence properties of policy gradient methods and their variants for both continuous and discrete RL problems, examples including Fazel et al., (2018); Bhandari and Russo, (2019); Agarwal et al., (2020); Zhang et al., 2021b ; Wang et al., (2019); Mei et al., 2020a ; Bhandari and Russo, (2020); Khodadadian et al., (2021); Liu et al., (2020); Mei et al., 2020a ; Agazzi and Lu, (2020); Xu et al., (2019); Wang et al., (2019); Cen et al., 2022b ; Mei et al., (2021); Liu et al., (2019); Wang et al., (2021); Zhang et al., 2020a ; Zhang et al., 2021a ; Zhang et al., 2020b ; Shani et al., (2019), among other things. Neu et al., (2017) provided the first interpretation of NPG methods as mirror descent (Nemirovsky and Yudin,, 1983), thereby enabling the adaptation of techniques for analyzing mirror descent to the studies of NPG-type algorithms such as TRPO (Shani et al.,, 2019; Tomar et al.,, 2020). It has been shown that the NPG method converges sub-linearly for unregularized MDPs with a fixed learning rate (Agarwal et al.,, 2020), and converges linearly if the learning rate is set adaptively (Khodadadian et al.,, 2021), via exact line search (Bhandari and Russo,, 2020), or following a geometrically increasing schedule (Xiao,, 2022). The global linear convergence of NPG holds more generally for an arbitrary fixed learning rate when entropy regularization is enforced (Cen et al., 2022b, ). Noteworthily, Li et al., (2023) established a lower bound indicating that softmax PG methods can take an exponential time—in the size of the state space—to converge, while the convergence rates of NPG-type methods are almost independent of the problem dimension. In addition, another line of recent works (Abbasi-Yadkori et al.,, 2019; Hao et al.,, 2021; Lazic et al.,, 2021) established regret bounds for approximate NPG methods—termed as KL-regularized approximate policy iteration therein—for infinite-horizen undiscounted MDPs, which are beyond the scope of the current paper.

Regularization in RL.

Regularization has been suggested to the RL literature either through the lens of optimization (Dai et al.,, 2018; Agarwal et al.,, 2020), or through the lens of dynamic programming (Geist et al.,, 2019; Vieillard et al.,, 2020). Our work is clearly an instance of the former type. Several recent results in the literature merit particular attention: Agarwal et al., (2020) demonstrated sublinear convergence guarantees for PG methods in the presence of relative entropy regularization, Mei et al., 2020b established linear convergence of entropy-regularized PG methods, whereas Cen et al., 2022b derived an almost dimension-free linear convergence theory for NPG methods with entropy regularization. Most of the existing literature focused on the entropy regularization or KL-type regularization, and the studies of general regularizers had been quite limited until the recent work Lan, (2022). The regularized MDP problems are also closely related to the studies of constrained MDPs, as both types of problems can be employed to model/promote constraint satisfaction in RL, as recently investigated in, e.g., Chow et al., 2018a ; Efroni et al., (2020); Ding et al., (2021); Yu et al., (2019); Xu et al., (2020). Note, however, that it is difficult to directly compare our algorithm with these methods, due to drastically different formulations and settings.

1.4 Notation

Let us introduce several notation that will be adopted throughout. For any set $\mathcal{A}$ , we denote by $|\mathcal{A}|$ the cardinality of a set $\mathcal{A}$ , and let $\Delta(\mathcal{A})$ indicate the probability simplex over the set $\mathcal{A}$ . For any convex and differentiable function $h(\cdot)$ , the Bregman divergence generated by $h(\cdot)$ is defined as

\displaystyle D_{h}(z,x)\coloneqq h(z)-h(x)-\big{\langle}\nabla h(x),z-x\big{\rangle}.

(1)

For any convex (but not necessarily differentiable) function $h(\cdot)$ , we denote by $\partial h$ the subdifferential of $h$ . Given two probability distributions $\pi_{1}$ and $\pi_{2}$ over $\mathcal{A}$ , the KL divergence from $\pi_{2}$ to $\pi_{1}$ is defined as $\mathsf{KL}(\pi_{1}\,\|\,\pi_{2})\coloneqq\sum_{a\in\mathcal{A}}\pi_{1}(a)\log\frac{\pi_{1}(a)}{\pi_{2}(a)}$ . For any vectors $a=[a_{i}]_{1\leq i\leq n}$ and $b=[b_{i}]_{1\leq i\leq n}$ , the notation $a\leq b$ (resp. $a\geq b$ ) means that $a_{i}\leq b_{i}$ ( $a_{i}\geq b_{i}$ ) for every $1\leq i\leq n$ . We shall also use $1$ (resp. $0$ ) to denote the all-one (resp. all-zero) vector whenever it is clear from the context.

2 Model and algorithms

2.1 Problem settings

Markov decision process (MDP).

The focus of this paper is a discounted infinite-horizon Markov decision process, as represented by $\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma)$ (Bertsekas,, 2017). Here, $\mathcal{S}\coloneqq\{1,\cdots,|\mathcal{S}|\}$ is the state space, $\mathcal{A}\coloneqq\{1,\cdots,|\mathcal{A}|\}$ is the action space, $\gamma\in[0,1)$ is the discount factor, $P:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S})$ is the probability transition matrix (so that $P(\cdot\,|\,s,a)$ is the transition probability from state $s$ upon execution of action $a$ ), whereas $r:\mathcal{S}\times\mathcal{A}\to[0,1]$ is the reward function (so that $r(s,a)$ indicates the immediate reward received in state $s$ after action $a$ is executed). Here, we focus on finite-state and finite-action scenarios, meaning that both $|\mathcal{S}|$ and $|\mathcal{A}|$ are assumed to be finite. A policy $\pi:\mathcal{S}\to\Delta(\mathcal{A})$ specifies a possibly randomized action selection rule, namely, $\pi(\cdot\,|\,s)$ represents the action selection probability in state $s$ .

For any policy $\pi$ , we define the associated value function $V^{\pi}:\mathcal{S}\to\mathbb{R}$ as follows

\forall s\in\mathcal{S}:\qquad V^{\pi}(s):=\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a_{t}\sim\pi(\cdot|s_{t}),\\ s_{t+1}\sim P(\cdot|s_{t},a_{t}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ s_{0}=s\right],

(2)

which can be viewed as the utility function we wish to maximize. Here, the expectation is taken over the randomness of the MDP trajectory $\{(s_{t},a_{t})\}_{t\geq 0}$ induced by policy $\pi$ . Similarly, when the initial action $a$ is fixed, we can define the action-value function (or Q-function) as follows

\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad Q^{\pi}(s,a):=\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}s_{t+1}\sim P(\cdot|s_{t},a_{t}),\\ a_{t+1}\sim\pi(\cdot|s_{t+1}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ s_{0}=s,a_{0}=a\right].

(3)

As a well-known fact, the policy gradient of $V^{\pi}$ (w.r.t. the policy $\pi$ ) admits the following closed-form expression (Sutton et al., (2000))

\displaystyle\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad\frac{\partial V^{\pi}(s_{0})}{\partial\pi(a\,|\,s)}=\frac{1}{1-\gamma}d_{s_{0}}^{\pi}(s)Q^{\pi}(s,a).

(4)

Here, $d_{s_{0}}^{\pi}\in\Delta(\mathcal{S})$ is the so-called discounted state visitation distribution defined as follows

\displaystyle d_{s_{0}}^{\pi}(s)\coloneqq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbb{P}^{\pi}(s_{t}=s\,|\,s_{0}),

(5)

where $\mathbb{P}^{\pi}(s_{t}=s\,|\,s_{0})$ denotes the probability of $s_{t}=s$ when the MDP trajectory $\{s_{t}\}_{t\geq 0}$ is generated under policy $\pi$ given the initial state $s_{0}$ .

Furthermore, the optimal value function and the optimal Q-function are defined and denoted by

\displaystyle\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad V^{\star}(s)\coloneqq\max_{\pi}V^{\pi}(s),\qquad Q^{\star}(s,a)\coloneqq\max_{\pi}Q^{\pi}(s,a).

(6)

It is well known that there exists at least one optimal policy, denoted by $\pi^{\star}$ , that simultaneously maximizes the value function and the Q-function for all state-action pairs (Agarwal et al.,, 2019).

Regularized MDP.

In practice, the agent is often asked to design policies that possess certain structural properties in order to be cognizant of system constraints such as safety and operational constraints, as well as encourage exploration during the optimization/learning stage. A natural strategy to achieve these is to resort to the following regularized value function w.r.t. a given policy $\pi$ (Neu et al.,, 2017; Mei et al., 2020b, ; Cen et al., 2022b, ; Lan,, 2022):

	$\displaystyle\forall s\in\mathcal{S}:\qquad V^{\pi}_{\tau}(s)$	$\displaystyle\coloneqq\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a_{t}\sim\pi(\cdot\|s_{t}),\\ s_{t+1}\sim P(\cdot\|s_{t},a_{t}),\leavevmode\nobreak\ \forall t\geq 0\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{\{}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(\cdot\,\|\,s_{t})\big{)}\Big{\}}\leavevmode\nobreak\ \Big{\|}\leavevmode\nobreak\ s_{0}=s\right]$
		$\displaystyle=V^{\pi}(s)-\frac{\tau}{1-\gamma}\sum_{s^{\prime}\in\mathcal{S}}d_{s}^{\pi}(s^{\prime})h_{s^{\prime}}\big{(}\pi(\cdot\,\|\,s^{\prime})\big{)},$		(7)

where $h_{s}:\Delta_{\zeta}(\mathcal{A})\to\mathbb{R}$ stands for a convex and possibly nonsmooth regularizer for state $s$ , $\tau>0$ denotes the regularization parameter, and $d_{s}^{\pi}(\cdot)$ is defined in (5). Here, for technical convenience, we assume throughout that $h_{s}(\cdot)$ ( $s\in\mathcal{S}$ ) is well-defined over an “ $\zeta$ -neighborhood” of the probability simplex $\Delta(\mathcal{A})$ defined as follows

\Delta_{\zeta}(\mathcal{A})\coloneqq\left\{x=[x_{a}]_{a\in\mathcal{A}}\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ x_{a}\geq 0\text{ for all }a\in\mathcal{A};\leavevmode\nobreak\ 1-\zeta\leq\sum_{a\in\mathcal{A}}x_{a}\leq 1+\zeta\right\},

where $\zeta>0$ can be an arbitrary constant. For instance, entropy regularization adopts the choice $h_{s}(p)=\sum_{i\in\mathcal{A}}p_{i}\log p_{i}$ for all $s\in\mathcal{S}$ and $p\in\Delta(\mathcal{A})$ , which coincides with the negative Shannon entropy of a probability distribution. Similar, a KL regularization adopts the choice $h_{s}(p)=\mathsf{KL}(p\,\|\,p_{\mathsf{ref}})$ , which penalizes the distribution $p$ that deviates from the reference $p_{\mathsf{ref}}$ . As another example, a weighted $\ell_{1}$ regularization adopts the choice $h_{s}(p)=\sum_{i\in\mathcal{A}}w_{s,i}p_{i}$ for all $s\in\mathcal{S}$ and $p\in\Delta(\mathcal{A})$ , where $w_{s,i}\geq 0$ is the cost of taking action $i$ at state $s$ , and the regularizer $h_{s}(\pi(\cdot|s))$ captures the expected cost of the policy $\pi$ in state $s$ . Throughout this paper, we impose the following assumption.

Assumption 1.

Consider an arbitrarily small constant $\zeta>0$ . For for any $s\in\mathcal{S}$ , suppose that $h_{s}(\cdot)$ is convex and

\displaystyle h_{s}(p)=\infty\qquad\text{for any }p\notin\Delta_{\zeta}(\mathcal{A}).

(8)

Following the convention in prior literature (e.g., Mei et al., 2020b ), we also define the corresponding regularized Q-function as follows:

	$\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad{Q}^{\pi}_{\tau}(s,a):=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\big{[}V^{\pi}_{\tau}(s^{\prime})\big{]}.$	(9a)
As can be straightforwardly verified, one can also express $V^{\pi}_{\tau}$ in terms of $Q^{\pi}_{\tau}$ as
	$\forall s\in\mathcal{S}:\qquad V^{\pi}_{\tau}(s):=\mathop{\mathbb{E}}\limits_{a\sim\pi(\cdot\|s)}\Big{[}{Q}^{\pi}_{\tau}(s,a)-\tau h_{s}\big{(}\pi(\cdot\,\|\,s)\big{)}\Big{]}.$	(9b)

The optimal regularized value function $V^{\star}_{\tau}$ and the corresponding optimal policy $\pi_{\tau}^{\star}$ are defined respectively as follows:

\forall s\in\mathcal{S}:\qquad V^{\star}_{\tau}(s)\coloneqq V^{\pi_{\tau}^{\star}}_{\tau}(s)=\max_{\pi}V^{\pi}_{\tau}(s),\qquad\pi_{\tau}^{\star}\coloneqq\arg\max_{\pi}V^{\pi}_{\tau}.

(10)

It is worth noting that Puterman, (2014) asserts the existence of an optimal policy $\pi_{\tau}^{\star}$ that achieves (10) simultaneously for all $s\in\mathcal{S}$ . Correspondingly, we shall also define the resulting optimal regularized Q-function as

\forall(s,a)\in\mathcal{S}\times\mathcal{A}:\qquad Q^{\star}_{\tau}(s,a)=Q^{\pi_{\tau}^{\star}}_{\tau}(s,a).

(11)

2.2 Algorithm: generalized policy mirror descent

Motivated by PMD (Lan,, 2022), we put forward a generalization of PMD that selects the Bregman divergence in cognizant of the regularizer in use. A thorough comparison with Lan, (2022) will be provided after introducing our generalized PMD algorithm.

Review: mirror descent (MD) for the composite model.

To better elucidate our algorithmic idea, let us first briefly review the design of classical mirror descent—originally proposed by Nemirovsky and Yudin, (1983)—in the optimization literature. Consider the following composite model:

\text{minimize}_{x}\quad F(x)\coloneqq f(x)+h(x),

where the objective function consists of two components. The first component is assumed to be differentiable, while the second component $h(\cdot)$ can be more general and is commonly employed to model some sort of regularizers. To solve this composite problem, one variant of mirror descent adopts the following update rule (see also Beck, (2017); Duchi et al., (2010)):

\displaystyle x^{(k+1)}=\arg\min_{x}\left\{f\big{(}x^{(k)}\big{)}+\big{\langle}\nabla f(x^{(k)}),x\big{\rangle}+h(x)+\frac{1}{\eta}D_{h}\big{(}x,x^{(k)}\big{)}\right\},

(12)

where $\eta>0$ is the learning rate or step size, and $D_{h}(\cdot,\cdot)$ is the Bregman divergence defined in (1). Note that the first term within the curly brackets of (12) can be safely discarded as it is a constant given $x^{(k)}$ . In words, the above update rule approximates $f(x)$ via its first-order Taylor expansion $f\big{(}x^{(k)}\big{)}+\big{\langle}\nabla f(x^{(k)}),x\big{\rangle}$ at the point $x^{(k)}$ , employs the Bregman divergence $D_{h}$ to monitor the difference between the new iterate and the current iterate $x^{(k)}$ , and attempts to optimize such (properly monitored) approximation instead. While one can further generalize the Bregman divergence to $D_{\omega}$ for a different generator $\omega$ , we shall restrict attention to the case with $h=\omega$ in the current paper.

The proposed algorithm.

We are now ready to present the algorithm we come up with, which is an extension of the PMD algorithm (Lan,, 2022). For notational simplicity, we shall write

\displaystyle V^{(k)}_{\tau}\coloneqq V^{\pi^{(k)}}_{\tau},\qquad Q^{(k)}_{\tau}(s,a)\coloneqq Q^{\pi^{(k)}}_{\tau}(s,a)\qquad\text{and}\qquad d^{(k)}_{s_{0}}(s)\coloneqq d^{\pi^{(k)}}_{s_{0}}(s)

(13)

throughout the paper, where $\pi^{(k)}$ denotes our policy estimate in the $k$ -th iteration.

To begin with, suppose for simplicity that $h_{s}(\cdot)$ is differentiable everywhere. In the $k$ -th iteration, a natural MD scheme that comes into mind for solving (7)—namely, $\text{maximize}_{\pi}V_{\tau}^{\pi}(s_{0})$ for a given initial state $s_{0}$ —is the following update rule:

$\displaystyle\pi^{(k+1)}(\cdot\,\|\,s)$	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\Big{\langle}\nabla_{\pi(\cdot\|s)}V_{\tau}^{\pi}(s_{0})\,\Big{\|}_{\pi=\pi^{(k)}},p\Big{\rangle}+\frac{\tau}{1-\gamma}d_{s_{0}}^{(k)}(s)h_{s}(p)+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$
	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{\frac{1}{1-\gamma}d_{s_{0}}^{(k)}(s)\Big{\{}-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)\Big{\}}+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$
	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$	(14)

for every state $s\in\mathcal{S}$ , which is a direct application of (12) to our setting. Here, we start with a learning rate $\eta^{\prime}$ , and obtain simplification by replacing $\eta^{\prime}$ with ${\eta(1-\gamma)}/{d_{s_{0}}^{(k)}(s)}$ . Notably, the update strategy (14) is invariant to the initial state $s_{0}$ , akin to natural policy gradient methods (Agarwal et al.,, 2020).

This update rule is well-defined for, say, the case when $h_{s}$ is the negative entropy, since the algorithm guarantees $\pi^{(k)}>0$ all the time and hence $h_{s}$ is always differentiable w.r.t. the $k$ -th iterate (see Cen et al., 2022b ). In general, however, it is possible to encounter situations when the gradient of $h_{s}$ does not exist on the boundary (e.g., when $h_{s}$ represents a certain indicator function). To cope with such cases, we resort to a generalized version of Bregman divergence (e.g., Kiwiel, (1997); Lan et al., (2011); Lan and Zhou, (2018)). To be specific, we attempt to replace the usual Bregman divergence $D_{h_{s}}(p,q)$ by the following metric

\displaystyle D_{h_{s}}(p,q;g_{s})\coloneqq h_{s}(p)-h_{s}(q)-\langle g_{s},p-q\rangle\geq 0,

(15)

where $g_{s}$ can be any vector falling within the subdifferential $\partial h_{s}(q)$ . Here, the non-negativity condition in (15) follows directly from the definition of the subgradient for any convex function. The constraint on $g_{s}$ can be further relaxed by exploiting the requirement $p,q\in\Delta(\mathcal{A})$ . In fact, for any vector $\xi_{s}=g_{s}-c_{s}1$ (with $c_{s}\in\mathbb{R}$ some constant and $1$ the all-one vector), one can readily see that

	$\displaystyle D_{h_{s}}(p,q;g_{s})$	$\displaystyle=h_{s}(p)-h_{s}(q)-\langle g_{s},p-q\rangle=h_{s}(p)-h_{s}(q)-\langle\xi_{s},p-q\rangle+c_{s}\langle 1,p-q\rangle$
		$\displaystyle=h_{s}(p)-h_{s}(q)-\langle\xi_{s},p-q\rangle=D_{h_{s}}(p,q;\xi_{s}),$		(16)

where the last line is valid since $1^{\top}p=1^{\top}q=1$ . As a result, everything boils down to identifying a vector $\xi_{s}$ that falls within $\partial h_{s}(q)$ upon global shift.

Towards this, we propose the following iterative rule for designing such a sequence of vectors as surrogates for the subgradient of $h_{s}$ :


$\displaystyle\xi^{(0)}(s,\cdot)$	$\displaystyle\in\partial h_{s}\big{(}\pi^{(0)}(\cdot\,\|\,s)\big{)};$	(17a)
$\displaystyle\xi^{(k+1)}(s,\cdot)$	$\displaystyle=\frac{1}{1+\eta\tau}\xi^{(k)}(s,\cdot)+\frac{\eta}{1+\eta\tau}{{Q}^{(k)}_{\tau}(s,\cdot)},\qquad k\geq 0,$	(17b)

where $\xi^{(k+1)}(s,\cdot)$ is updated as a convex combination of the previous $\xi^{(k)}(s,\cdot)$ and ${Q}^{(k)}_{\tau}(s,\cdot)$ , where more emphasis is put on ${Q}^{(k)}_{\tau}(s,\cdot)$ when the learning rate $\eta$ is large. As asserted by the following lemma, the above vectors $\xi^{(k)}(s,\cdot)$ we construct satisfy the desired property, i.e., lying within the subdifferential of $h_{s}$ under suitable global shifts. It is worth mentioning that these global shifts $\{c_{s}^{(k)}\}$ only serve as an aid to better understand the construction, but are not required during the algorithm updates.

Lemma 1.

For all $k\geq 0$ and every $s\in\mathcal{S}$ , there exists a quantity $c_{s}^{(k)}\in\mathbb{R}$ such that

\xi^{(k)}(s,\cdot)-c_{s}^{(k)}1\in\partial h_{s}\big{(}\pi^{(k)}(\cdot\,|\,s)\big{)}.

(18)

In addition, for every $s\in\mathcal{S}$ , there exists a quantity $c_{s}^{\star}\in\mathbb{R}$ such that

\tau^{-1}Q_{\tau}^{\star}(s,\cdot)-c_{s}^{\star}1\in\partial h_{s}\big{(}\pi_{\tau}^{\star}(\cdot\,|\,s)\big{)}.

(19)

Proof.

See Appendix A.1. ∎

Thus far, we have presented all crucial ingredients of our algorithm. The whole procedure is summarized in Algorithm 1, and will be referred to as Generalized Policy Mirror Descent (GPMD) throughout the paper. Interestingly, several well-known algorithms can be recovered as special cases of GPMD:

•

When the Bregman divergence $D_{h_{s}}(\cdot,\cdot)$ is taken as the KL divergence, GPMD reduces to the well-renowned NPG algorithm (Kakade,, 2002) when $\tau=0$ (no regularization), and to the NPG algorithm with entropy regularization analyzed in (Cen et al., 2022b, ) when $h_{s}(\cdot)$ is taken as the negative Shannon entropy.
•

When $\eta=\infty$ (no divergence), GPMD reduces to regularized policy iteration in Geist et al., (2019); in particular, GPMD reduces to the standard policy iteration algorithm if in addition $\tau$ is also $0$ .

1 Input: initial policy iterate

\pi^{(0)}

, learning rate

\eta>0

2 Initialize

\xi^{(0)}

so that

\xi^{(0)}(s,\cdot)\in\partial h_{s}\big{(}\pi^{(0)}(\cdot|s)\big{)}

for all

s\in\mathcal{S}

3 for $k=0,1,\cdots,$ do

For every

s\in\mathcal{S}

, set


	$\pi^{(k+1)}(\cdot\|s)=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\|s);\xi^{(k)}\big{)}\right\},$	(20a)
where
	$D_{h_{s}}\big{(}p,q;\xi\big{)}\coloneqq h_{s}(p)-h_{s}(q)-\big{\langle}\xi(s,\cdot),p-q\big{\rangle}.$	(20b)

6 For every

(s,a)\in\mathcal{S}\times\mathcal{A}

, compute

\xi^{(k+1)}(s,a)=\frac{1}{1+\eta\tau}\xi^{(k)}(s,a)+\frac{\eta}{1+\eta\tau}{{Q}^{(k)}_{\tau}(s,a)}.

(20c)

Algorithm 1 PMD with generalized Bregman divergence (GPMD)

Comparison with PMD (Lan,, 2022).

Before continuing, let us take a moment to point out the key differences between our algorithm GPMD and the PMD algorithm proposed in Lan, (2022) in terms of algorithm designs. Although the primary exposition of PMD in Lan, (2022) fixes the Bregman divergence as the KL divergence, the algorithm also works in the presence of a generic Bregman divergence, whose relationship with the regularizer $h_{s}$ is, however, unspecified. Furthermore, GPMD adaptively sets this term to be the Bregman divergence generated by the regularizer $h_{s}$ in use, together with a carefully designed recursive update rule (cf. (17)) to compute surrogates for the subgradient of $h_{s}$ to facilitate implementation. Encouragingly, this specific choice leads to a tailored performance analysis of GPMD, which was not present in and instead complementary with that of PMD (Lan,, 2022). In truth, our theory offers linear convergence guarantees for more general scenarios by adapting to the geometry of the regularizer $h_{s}$ ; details to follow momentarily.

3 Main results

This section presents our convergence guarantees for the GPMD method presented in Algorithm 1. We shall start with the idealized case assuming that the update rule can be precisely implemented, and then discuss how to generalize it to the scenario with imperfect policy evaluation.

3.1 Convergence of exact GPMD

To start with, let us pin down the convergence behavior of GPMD, assuming that accurate evaluation of the policy $Q^{(k)}_{\tau}$ is available and the subproblem (20a) can be solved perfectly. Here and below, we shall refer to the algorithm in this case as exact GPMD. Encouragingly, exact GPMD provably achieves global linear convergence from an arbitrary initialization, as asserted by the following theorem.

Theorem 1 (Exact GPMD).

Suppose that Assumption 1 holds. Consider any learning rate $\eta>0$ , and set $\alpha:=\frac{1}{1+\eta\tau}$ . Then the iterates of Algorithm 1 satisfy


$\displaystyle\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1},$	(21a)
$\displaystyle\big{\\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\\|}_{\infty}$	$\displaystyle\leq(\gamma+2)\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1},$	(21b)

for all $k\geq 0$ , where $C_{1}\coloneqq\|{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\|_{\infty}+2\alpha\|{Q}^{\star}_{\tau}-\tau\xi^{(0)}\|_{\infty}$ .

In addition, if $h_{s}$ is $1$ -strongly convex w.r.t. the $\ell_{1}$ norm for some $s\in\mathcal{S}$ , then one further has

\big{\|}\pi_{\tau}^{\star}(s)-\pi_{\tau}^{(k+1)}(s)\big{\|}_{1}\leq\tau^{-1}\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1},\qquad k\geq 0.

(22)

Our theorem confirms the fast global convergence of the GPMD algorithm, in terms of both the resulting regularized Q-value (if $h_{s}(\cdot)$ is convex) and the policy estimate (if $h_{s}(\cdot)$ is strongly convex). In summary, it takes GPMD no more than


	$\displaystyle\frac{1}{(1-\alpha)(1-\gamma)}\log\frac{C_{1}}{\varepsilon}=\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{C_{1}}{\varepsilon}$	(23a)
iterations to converge to an $\varepsilon$ -optimal regularized Q-function (in the $\ell_{\infty}$ sense), or

	$\displaystyle\frac{1}{(1-\alpha)(1-\gamma)}\log\frac{C_{1}}{\varepsilon\tau}=\frac{1+\eta\tau}{\eta\tau(1-\gamma)}\log\frac{C_{1}}{\varepsilon\tau}$	(23b)

iterations to yield an $\varepsilon$ -approximation (w.r.t. the $\ell_{1}$ norm error) of $\pi_{\tau}^{\star}$ . The iteration complexity (23) is nearly dimension-free—namely, depending at most logarithmically on the dimension of the state-action space —making it scalable to large-dimensional problems.

Comparison with Lan, (2022, Theorems 1-3).

To make clear our contributions, it is helpful to compare Theorem 1 with the theory for the state-of-the-art algorithm PMD in Lan, (2022).

•

Linear convergence for convex regularizers under constant learning rates. Suppose that constant learning rates are adopted for both GPMD and PMD. Our finding reveals that GPMD enjoys global linear convergence—in terms of both $\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}$ and $\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\|_{\infty}$ —even when the regularizer $h_{s}(\cdot)$ is only convex but not strongly convex. In contrast, Lan, (2022, Theorem 2) provided only sublinear convergence guarantees (with an iteration complexity proportional to $1/\varepsilon$ ) for the case with convex regularizers, provided that constant learning rates are adopted.²²2In fact, Lan, (2022, Theorem 3) suggests using a vanishing strongly convex regularization, as well as a corresponding increasing sequence of learning rates, in order to enable linear convergence for non-strongly-convex regularizers.
•

A full range of learning rates. Theorem 1 reveals linear convergence of GPMD for a full range of learning rates, namely, our result is applicable to any $\eta>0$ . In comparison, linear convergence was established in Lan, (2022) only when the learning rates are sufficiently large and when $h_{s}$ is $1$ -strongly convex w.r.t. the KL divergence. Consequently, the linear convergence results in Lan, (2022) do not extend to several widely used regularizers such as negative Tsallis entropy and log-barrier functions (even after scaling), which are, in contrast, covered by our theory. It is worth noting that the case with small-to-medium learning rates is often more challenging to cope with in theory, given that its dynamics could differ drastically from that of regularized policy iteration.
•

Further comparison of rates under large learning rates. (Lan,, 2022, Theorem 1) achieves a contraction rate of $\gamma$ when the regularizer is strongly convex and the step size satisfies $\eta\geq\frac{1-\gamma}{\gamma\tau}$ , while the contraction rate of GPMD is $1-\frac{\eta\tau}{1+\eta\tau}(1-\gamma)$ under the full range of the step size, which is slower but approaches the contraction rate $\gamma$ of PMD as $\eta$ goes to infinity. Therefore, in the limit $\eta\to\infty$ , both GPMD and PMD achieve the contraction rate $\gamma$ . As soon as $\eta\geq 1/\tau$ , their iteration complexities are on the same order.

Remark 1.

While our primary focus is to solve the regularized RL problem, one might be tempted to apply GPMD as a means to solve unregularized RL; for instance, one might run GPMD with the regularization parameter diminishing gradually in order to approach a policy with the desired accuracy. We leave the details to Appendix C.

3.2 Convergence of approximate GPMD

In reality, however, it is often the case that GPMD cannot be implemented in an exact manner, either because perfect policy evaluation is unavailable or because the subproblem (20a) cannot be solved exactly. To accommodate these practical considerations, this subsection generalizes our previous result by permitting inexact policy evaluation and non-zero optimization error in solving (20a). The following assumptions make precise this imperfect scenario.

Assumption 2 (Policy evaluation error).

Suppose for any $k\geq 0$ , we have access to an estimate $\widehat{Q}^{(k)}_{\tau}$ obeying

\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}\leq\varepsilon_{\mathsf{eval}}.

(24)

Assumption 3 (Subproblem optimization error).

Consider any policy $\pi$ and any vector $\xi\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ . Define

f_{s}(p;\pi,\xi)\coloneqq-\big{\langle}Q(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi(\cdot\,|\,s);\xi(s,\cdot)\big{)},

where $D_{h_{s}}(p,q;\xi)$ is defined in (15). Suppose there exists an oracle $G_{s,\varepsilon_{\mathsf{opt}}}(Q,\pi,\xi)$ , which is capable of returning $\pi^{\prime}(\cdot\,|\,s)$ such that

\displaystyle f_{s}\big{(}\pi^{\prime}(\cdot\,|\,s);\pi,\xi\big{)}\leq\min_{p\in\Delta(\mathcal{A})}f_{s}(p;\pi,\xi)+\varepsilon_{\mathsf{opt}}.

(25)

Note that the oracle in Assumption 3 can be implemented efficiently in practice via various first-order methods (Beck,, 2017). Under Assumptions 2 and 3, we can modify Algorithm 1 by replacing $\{{Q}^{(k)}_{\tau}\}$ with the estimate $\{\widehat{Q}^{(k)}_{\tau}\}$ , and invoking the oracle $G_{s,\varepsilon_{\mathsf{opt}}}(Q,\pi,\xi)$ to solve the subproblem (20a) approximately. The whole procedure, which we shall refer to as approximate GPMD, is summarized in Algorithm 2.

1 Input: initial policy

\pi^{(0)}

, learning rate

\eta>0

2 Initialize

\widehat{\xi}^{(0)}(s)\in\partial h_{s}\big{(}\pi^{(0)}(\cdot\,|\,s)\big{)}

for all

s\in\mathcal{S}

3 for $k=0,1,\cdots,$ do

4 For every

s\in\mathcal{S}

, invoke the oracle to obtain (cf. (25))

\pi^{(k+1)}(s)=G_{s,\varepsilon_{\mathsf{opt}}}\big{(}\widehat{Q}^{(k)}_{\tau},\pi^{(k)},\widehat{\xi}^{(k)}\big{)}.

(26)

5 For every

(s,a)\in\mathcal{S}\times\mathcal{A}

, compute

\widehat{\xi}^{(k+1)}(s,a)=\frac{1}{1+\eta\tau}\widehat{\xi}^{(k)}(s,a)+\frac{\eta}{1+\eta\tau}{\widehat{Q}^{(k)}_{\tau}(s,a)}.

(27)

Algorithm 2 Approximate PMD with generalized Bregman divergence (Approximate GPMD)

The following theorem uncovers that approximate GPMD converges linearly—at the same rate as exact GPMD—before an error floor is hit.

Theorem 2 (Approximate GPMD).

Suppose that Assumptions 1, 2 and 3 hold. Consider any learning rate $\eta>0$ . Then the iterates of Algorithm 2 satisfy


$\displaystyle\\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\\|_{\infty}$	$\displaystyle\leq\gamma\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{2}\right],$	(28a)
$\displaystyle\\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\\|_{\infty}$	$\displaystyle\leq(\gamma+2)\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{2}\right]+(1-\alpha)\varepsilon_{\mathsf{opt}},$	(28b)

where $\alpha\coloneqq\frac{1}{1+\eta\tau}$ , $C_{1}$ is defined in Theorem 1, and

C_{2}\coloneqq\frac{1}{1-\gamma}\left[\left(2+\frac{2\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{2\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{opt}}\right].

In addition, if $h_{s}$ is $1$ -strongly convex w.r.t. the $\ell_{1}$ norm for any $s\in\mathcal{S}$ , then we can further obtain


$\displaystyle\\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\\|_{\infty}$	$\displaystyle\leq\gamma\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right],$	(29a)
$\displaystyle\\|{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\\|_{\infty}$	$\displaystyle\leq(\gamma+2)\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right]+(1-\alpha)\varepsilon_{\mathsf{opt}},$	(29b)
$\displaystyle\big{\\|}\pi_{\tau}^{\star}(\cdot\,\|\,s)-\pi^{(k+1)}(\cdot\,\|\,s)\big{\\|}_{1}$	$\displaystyle\leq\tau^{-1}\left[\big{(}1-(1-\alpha)(1-\gamma)\big{)}^{k}C_{1}+C_{3}\right]+\sqrt{\frac{2\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}},$	(29c)

where

C_{3}\coloneqq\frac{1}{1-\gamma}\left[\left(2+\frac{\varepsilon_{\mathsf{eval}}\gamma}{\tau(1-\gamma)}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{4\gamma}{(1-\gamma)(1-\alpha)}\right)\varepsilon_{\mathsf{opt}}\right].

(30)

In the special case where $\varepsilon_{\mathsf{opt}}=0$ and $\eta=\infty$ , Algorithm 2 reduces to regularized policy iteration, and the convergence result can be simplified as follows

\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{\|}_{\infty}\leq\gamma^{k}\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{eval}}}{(1-\gamma)^{2}}.

In particular, when $h_{s}$ is taken as the negative entropy, our result strengthens the prior result established in Cen et al., 2022b for approximate entropy-regularized NPG method with $\varepsilon_{\mathsf{opt}}=0$ over a wide range of learning rates. Specifically, the error bound in Cen et al., 2022b reads $\gamma\cdot\frac{\varepsilon_{\mathsf{eval}}}{1-\gamma}\left(2+\frac{2\gamma}{\eta\tau}\right)$ , where the second term in the bracket scales inversely with respect to $\eta$ and therefore grows unboundedly as $\eta$ approaches $0$ . In contrast, (29) and (30) suggest a bound $\gamma\cdot\frac{\varepsilon_{\mathsf{eval}}}{1-\gamma}\left(2+\frac{\varepsilon_{\mathsf{eval}}\gamma}{\tau(1-\gamma)}\right)$ , which is independent of the learning rate $\eta$ in use and thus prevents the error bound from blowing up when the learning rate approaches $0$ . Indeed, our result improves over the prior art Cen et al., 2022b whenever $\eta\leq\frac{2(1-\gamma)}{\varepsilon_{\mathsf{eval}}}$ .

Remark 2 (Sample complexities).

One might naturally ask how many samples are sufficient to learn an $\varepsilon$ -optimal regularized Q-function, by leveraging sample-based policy evaluation algorithms in GPMD. Notice that it is straightforward to consider an expected version of Assumption 2 as following:

\begin{cases}\mathbb{E}\big{[}\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}\big{]}&\leq\varepsilon_{\mathsf{eval}};\\[4.30554pt] \mathbb{E}\big{[}\big{\|}\widehat{Q}^{(k)}_{\tau}-Q^{(k)}_{\tau}\big{\|}_{\infty}^{2}\big{]}&\leq\varepsilon_{\mathsf{eval}}^{2},\end{cases}

where the expectation is with respect to the randomness in policy evaluation, then the convergence results in Theorem 2 apply to $\mathbb{E}\big{[}\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}\big{]}$ and $\mathbb{E}\big{[}\big{\|}\pi_{\tau}^{\star}(\cdot\,|\,s)-\pi_{\tau}^{(k+1)}(\cdot\,|\,s)\big{\|}_{1}\big{]}$ instead. This randomized version makes it immediately amenable to combine with, e.g., the rollout-based policy evaluators in Lan, (2022, Section 5.1) to obtain (possibly crude) bounds on the sample complexity. We omit these straightforward developments.

Roughly speaking, approximate GPMD is guaranteed to converge linearly to an error bound that scales linearly in both the policy evaluation error $\varepsilon_{\mathsf{eval}}$ and the optimization error $\varepsilon_{\mathsf{opt}}$ , thus confirming the stability of our algorithm vis-à-vis imperfect implementation of the algorithm. As before, our theory improves upon prior works by demonstrating linear convergence for a full range of learning rates even in the absence of strong convexity and smoothness.

4 Analysis for exact GPMD (Theorem 1)

In this section, we present the analysis for our main result in Theorem 1, which follows a different framework from Lan, (2022). Here and throughout, we shall often employ the following shorthand notation when it is clear from the context:

\displaystyle\begin{array}[]{lll}&\pi^{(k)}(s)\coloneqq\pi^{(k)}(\cdot\,|\,s)\in\Delta(\mathcal{A}),&Q^{\pi}(s)\coloneqq Q^{\pi}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},\\ &\xi^{(k)}(s)\coloneqq\xi^{(k)}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},&Q_{\tau}^{\pi}(s)\coloneqq Q_{\tau}^{\pi}(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|},\end{array}

(33)

in addition to those already defined in (13).

4.1 Preparation: basic facts

In this subsection, we single out a few basic results that underlie the proof of our main theorems.

Performance improvement.

To begin with, we demonstrate that GPMD enjoys a sort of monotonic improvements concerning the updates of both the value function and the Q-function, as stated in the following lemma. This lemma can be viewed as a generalization of the well-established policy improvement lemma in the analysis of NPG (Agarwal et al.,, 2020; Cen et al., 2022b, ) as well as PMD (Lan,, 2022).

Lemma 2 (Pointwise monotonicity).

For any $(s,a)\in\mathcal{S}\times\mathcal{A}$ and any $k\geq 0$ , Algorithm 1 achieves

\displaystyle V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)\qquad\text{and}\qquad{Q}^{(k+1)}_{\tau}(s,a)\geq{Q}^{(k)}_{\tau}(s,a).

(34)

Proof.

See Appendix A.2. ∎

Interestingly, the above monotonicity holds simultaneously for all state-action pairs, and hence can be understood as a kind of pointwise monotonicity.

Generalized Bellman operator.

Another key ingredient of our proof lies in the use of a generalized Bellman operator $\mathcal{T}_{\tau,h}:\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}\rightarrow\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ associated with the regularizer $h=\{h_{s}\}_{s\in\mathcal{S}}$ . Specifically, for any state-action pair $(s,a)$ and any vector $Q\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ , we define

\displaystyle\mathcal{T}_{\tau,h}(Q)(s,a)=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\Big{\{}\big{\langle}Q(s^{\prime}),p\big{\rangle}-\tau h_{s^{\prime}}(p)\Big{\}}\right].

(35)

It is worth noting that this definition shares similarity with the regularized Bellman operator proposed in Geist et al., (2019), where the operator defined there is targeted at $V_{\tau}$ , while ours is defined w.r.t. $Q_{\tau}$ .

The importance of this generalized Bellman operator is two-fold: it enjoys a desired contraction property, and its fixed point corresponds to the optimal regularized Q-function. These are generalizations of the properties for the classical Bellman operator, and are formally stated in the following lemma. The proof is deferred to Appendix A.3.

Lemma 3 (Properties of the generalized Bellman operator).

For any $\tau>0$ , the operator $\mathcal{T}_{\tau,h}$ defined in (35) satisfies the following properties:

•

$\mathcal{T}_{\tau,h}$ is a contraction operator w.r.t. the $\ell_{\infty}$ norm, namely, for any $Q_{1},Q_{2}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ , one has

$\big{\|}\mathcal{T}_{\tau,h}(Q_{1})-\mathcal{T}_{\tau,h}(Q_{2})\big{\|}_{\infty}\leq\gamma\|Q_{1}-Q_{2}\|_{\infty}.$ (36)
•

The optimal regularized $Q$ -function ${Q}^{\star}_{\tau}$ is a fixed point of $\mathcal{T}_{\tau,h}$ , that is,

$\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})={Q}^{\star}_{\tau}.$ (37)

4.2 Proof of Theorem 1

Inspired by Cen et al., 2022b , our proof consists of (i) characterizing the dynamics of $\ell_{\infty}$ errors and establishing a connection to a useful linear system with two variables, and (ii) analyzing the dynamics of this linear system directly. In what follows, we elaborate on each of these steps.

Step 1: error contraction and its connection to a linear system.

With the assistance of the above preparations, we are ready to elucidate how to characterize the convergence behavior of $\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}$ . Recalling the update rule of $\xi^{(k+1)}$ (cf. (20c)), we can deduce that

{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}=\alpha\big{(}{Q}^{\star}_{\tau}-\tau\xi^{(k)}\big{)}+(1-\alpha)\big{(}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{)}

with $\alpha=\frac{1}{1+\eta\tau}$ , thus indicating that

\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\big{\|}_{\infty}\leq\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k)}\big{\|}_{\infty}+(1-\alpha)\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\big{\|}_{\infty}.

(38)

Interestingly, there exists an intimate connection between $\|{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\|_{\infty}$ and $\|{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\|_{\infty}$ that allows us to bound the former term by the latter. This is stated in the following lemma, with the proof postponed to Appendix A.4.

Lemma 4.

Set $\alpha=\frac{1}{1+\eta\tau}$ . The iterates of Algorithm 1 satisfy

\big{\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\|}_{\infty}\leq\gamma\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(k+1)}\big{\|}_{\infty}+\gamma\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}.

(39)

The above inequalities (38) and (39) can be succinctly described via a useful linear system with two variables $\|{Q}^{\star}_{\tau}-{Q}^{(k)}_{\tau}\|_{\infty}$ and $\|{Q}^{\star}_{\tau}-\tau\xi^{(k)}\|_{\infty}$ , that is,

x_{k+1}\leq Ax_{k}+\gamma\alpha^{k+1}y,

(40)

where

A\coloneqq\begin{bmatrix}\gamma(1-\alpha)&\gamma\alpha\\ 1-\alpha&\alpha\end{bmatrix},\qquad x_{k}\coloneqq\begin{bmatrix}\|{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\|_{\infty}\\ \|{Q}^{\star}_{\tau}-\tau\xi^{(k)}\|_{\infty}\end{bmatrix}\qquad\text{and}\qquad y\coloneqq\begin{bmatrix}\|{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\|_{\infty}\\ 0\end{bmatrix}.

(41)

This forms the basis for proving Theorem 1.

Step 2: analyzing the dynamics of the linear system (40).

Before proceeding, we note that a linear system similar to (40) has been analyzed in Cen et al., 2022b (, Section 4.2.2). We intend to apply the following properties that have been derived therein:


$\displaystyle x_{k+1}$	$\displaystyle\leq A^{k+1}\left[x_{0}+\gamma(\alpha^{-1}A-I)^{-1}y\right],$	(42a)
$\displaystyle\gamma(\alpha^{-1}A-I)^{-1}y$	$\displaystyle=\begin{bmatrix}0\\ \\|{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\\|_{\infty}\end{bmatrix},$	(42b)
$\displaystyle A^{k+1}$	$\displaystyle=\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\begin{bmatrix}\gamma\\ 1\end{bmatrix}\begin{bmatrix}1-\alpha&\alpha\end{bmatrix}.$	(42c)

Substituting (42c) and (42b) into (42a) and rearranging terms, we reach

	$\displaystyle x_{k+1}$	$\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left((1-\alpha)\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}+\alpha\big{\\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right)\begin{bmatrix}\gamma\\ 1\end{bmatrix}$
		$\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right)\begin{bmatrix}\gamma\\ 1\end{bmatrix},$		(43)

which taken together with the definition of $x_{k+1}$ gives


$\displaystyle\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right),$	(44a)
$\displaystyle\big{\\|}Q_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right).$	(44b)

Step 3: controlling $\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}$ and $\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}$ .

It remains to convert this result to an upper bound on $\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}$ and $\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}$ . By virtue of Lemma 1, there exist two vectors $g_{\tau}^{\star}(s)\in\partial h_{s}\big{(}\pi_{\tau}^{\star}(s)\big{)}$ , $g^{(k+1)}(s)\in\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)}$ and two scalars $c_{s}^{\star},c_{s}^{(k+1)}\in\mathbb{R}$ that satisfy

\begin{cases}\tau^{-1}Q_{\tau}^{\star}(s)-c_{s}^{\star}1&=g_{\tau}^{\star}(s)\\ \xi^{(k+1)}(s,\cdot)-c_{s}^{(k+1)}1&=g^{(k+1)}(s)\end{cases}.

It holds for all $s\in\mathcal{S}$ that

	$\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)$
	$\displaystyle=\big{\langle}Q_{\tau}^{\star}(s),\pi_{\tau}^{\star}(s)\big{\rangle}-\tau h_{s}(\pi_{\tau}^{\star}(s))-\big{\langle}Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\tau h_{s}(\pi_{\tau}^{(k+1)}(s))$
	$\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\Big{[}\tau(h_{s}(\pi_{\tau}^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s)))-\big{\langle}Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}\Big{]}$
	$\displaystyle\overset{\text{(i)}}{\leq}\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\big{\langle}\tau g^{(k+1)}(s)-Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s))\big{\rangle}$
	$\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)\big{\rangle}+\big{\langle}\tau\xi^{(k+1)}(s)-Q_{\tau}^{\star}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s))\big{\rangle}$
	$\displaystyle\leq\big{\\|}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s)\big{\\|}_{\infty}+2\big{\\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\\|}_{\infty},$		(45)

where (i) results from $h_{s}(\pi_{\tau}^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s))\leq\big{\langle}g^{(k+1)}(s),\pi_{\tau}^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}$ . Plugging (44) into (45) completes the proof for (21b).

When $h_{s}$ is $1$ -strongly convex w.r.t. the $\ell_{1}$ norm, we can invoke the strong monotonicity property of a strongly convex function (Beck,, 2017, Theorem 5.24) to obtain

$\displaystyle\big{\\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\\|}_{1}^{2}$	$\displaystyle\leq\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),g_{\tau}^{\star}(s)-g^{(k+1)}(s)\big{\rangle}$
	$\displaystyle=\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),g_{\tau}^{\star}(s)+c_{s}^{\star}1-g^{(k+1)}(s)-c_{s}^{(k+1)}1\big{\rangle}$
	$\displaystyle\leq\big{\\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\\|}_{1}\big{\\|}g_{\tau}^{\star}(s)+c_{s}^{\star}1-g^{(k+1)}(s)-c_{s}^{(k+1)}1\big{\\|}_{\infty}$
	$\displaystyle=\tau^{-1}\big{\\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\\|}_{1}\big{\\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\\|}_{\infty},$	(46)

where the second line is valid since $\langle\pi_{\tau}^{\star}(s),1\rangle=\langle\pi^{(k+1)}(s),1\rangle=1$ . This taken together with (44) gives rise to the advertised bound

	$\displaystyle\big{\\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\\|}_{1}$	$\displaystyle\leq\tau^{-1}\big{\\|}Q_{\tau}^{\star}(s)-\tau\xi^{(k+1)}(s)\big{\\|}_{\infty}$
		$\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right).$

5 Numerical experiments

In this section, we provide some simple numerical experiments to corroborate the effectiveness of the GPMD algorithm.

5.1 Tsallis entropy

While Shannon entropy is a popular choice of regularization, the discrepancy between the value function of the regularized MDP and the unregularized counterpart scales as $O(\frac{\tau}{1-\gamma}\log|\mathcal{A}|)$ . In addition, the optimal policy under Shannon entropy regularization assigns positive mass to all actions and is hence non-sparse. To promote sparsity and obtain better control of the bias induced by regularization, Lee et al., (2018, 2019) proposed to employ the Tsallis entropy (Tsallis,, 1988) as an alternative. To be precise, for any vector $p\in\Delta(\mathcal{A})$ , the associated Tsallis entropy is defined as

\mathsf{Tsallis}_{q}(p)=\frac{1}{q-1}\left(1-\sum_{a\in\mathcal{A}}\big{(}p(a)\big{)}^{q}\right)=\frac{1}{q-1}\mathbb{E}_{a\sim p}\left[1-\big{(}p(a)\big{)}^{q-1}\right],

where $q>0$ is often referred to as the entropic-index. When $q\to 1$ , the Tsallis entropy reduces to the Shannon entropy.

We now evaluate numerically the performance of PMD and GPMD when applied to a randomly generated MDP with $|\mathcal{S}|=200$ and $|\mathcal{A}|=50$ . Here, the transition probability kernel and the reward function are generated as follows. For each state-action pair $(s,a)$ , we randomly select $20$ states to form a set $\mathcal{S}_{s,a}$ , and set $P(s^{\prime}|s,a)=1/20$ if $s^{\prime}\in\mathcal{S}_{s,a}$ , and $0$ otherwise. The reward function is generated by $r(s,a)\sim U_{s,a}\cdot U_{s}$ , where $U_{s,a}$ and $U_{s}$ are independent uniform random variables over $[0,1]$ . We shall set the regularizer as $h_{s}(p)=-\mathsf{Tsallis}_{2}(p)$ for all $s\in\mathcal{S}$ with a regularization parameter $\tau=0.001$ . As can be seen from the numerical results displayed in Figure 1(a), GPMD enjoys a faster convergence rate compared to PMD.

Refer to caption — Figure 1: $\|Q_{\tau}^{\star}-Q_{\tau}^{(t)}\|_{\infty}$ versus the iteration count for both PMD and GPMD, for multiple choices of the learning rate $\eta$ . The left plot (a) is concerned with Tsallis entropy regularization, whereas the right plot (b) concerns log-barrier regularization used in our constrained RL example. The error curves are averaged over 5 independent runs.

5.2 Constrained RL

In reality, an agent with the sole aim of maximizing cumulative rewards might sometimes end up with unintended or even harmful behavior, due to, say, improper design of the reward function, or non-perfect simulation of physical laws. Therefore, it is sometimes necessary to enforce proper constraints on the policy in order to prevent it from taking certain actions too frequently.

To simulate this problem, we first solve a MDP with $|\mathcal{S}|=200$ and $|\mathcal{A}|=50$ , generated in the same way as in the previous subsection. We then pick 10 state-action pairs from the support of the optimal policy at random to form a set $\Psi$ . We can ensure that $\pi_{\tau}^{\star}(a\,|\,s)<\pi_{\rm max}=0.1$ for all $(s,a)\in\Psi$ by adding the following log-barrier regularization with $\tau=0.001$ :

h_{s}(p)=\begin{cases}\infty,&\text{if }(s,a)\in\Psi\text{ and }p(a)\geq\pi_{\rm max},\\ -\log\big{(}\pi_{\rm max}-p(a)\big{)},&\text{if }(s,a)\in\Psi\text{ and }p(a)<\pi_{\rm max},\\ 0,&\text{otherwise}.\end{cases}

Numerical comparisons of PMD and GPMD when applied this problem are plotted in Figure 1(b). It is observed that PMD methods stall after reaching an error floor on the order of $10^{-2}$ , while GPMD methods are able to converge to the optimal policy efficiently.

6 Discussion

The present paper has introduced a generalized framework of policy optimization tailored to regularized RL problems. We have proposed a generalized policy mirror descent (GPMD) algorithm that achieves dimension-free linear convergence, which covers an entire range of learning rates and accommodates convex and possibly nonsmooth regularizers. Numerical experiments have been conducted to demonstrate the utility of the proposed GPMD algorithm. Our approach opens up a couple of future directions that are worthy of further exploration. For example, the current work restricts its attention to convex regularizers and tabular MDPs; it is of paramount interest to develop policy optimization algorithms when the regularizers are nonconvex and when sophisticated policy parameterization—including function approximation—is adopted. Understanding the sample complexities of the proposed algorithm—when the policies are evaluated using samples collected over an online trajectory—is crucial in sample-constrained scenarios and is left for future investigation. Furthermore, it might be worthwhile to extend the proposed algorithm to accommodate multi-agent RL, with a representative example being regularized multi-agent Markov games (Cen et al.,, 2021; Zhao et al.,, 2022; Cen et al., 2022a, ; Cen et al., 2022c, ).

Acknowledgements

S. Cen and Y. Chi are supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080, CCF-1901199, CCF-2007911, and CNS-2148212. S. Cen is also gratefully supported by Wei Shen and Xuehong Zhang Presidential Fellowship, and Nicholas Minnici Dean’s Graduate Fellowship in Electrical and Computer Engineering at Carnegie Mellon University. W. Zhan and Y. Chen are supported in part by the Google Research Scholar Award, the Alfred P. Sloan Research Fellowship, and the grants AFOSR FA9550-22-1-0198, ONR N00014-22-1-2354, NSF CCF-2221009, CCF-1907661, IIS-2218713, and IIS-2218773. W. Zhan and J. Lee are supported in part by the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, and an ONR Young Investigator Award.

Appendix A Proof of key lemmas

In this section, we collect the proof of several key lemmas. Here and throughout, we use $\mathbb{E}_{\pi}[\cdot]$ to denote the expectation over the randomness of the MDP induced by policy $\pi$ . We shall follow the notation convention in (33) throughout. In addition, to further simplify notation, we shall abuse the notation by letting


$\displaystyle D_{h_{s}}(\widetilde{\pi},\pi;\xi)$	$\displaystyle\coloneqq D_{h_{s}}\big{(}\widetilde{\pi}(\cdot\,\|\,s),\pi(\cdot\,\|\,s);\xi(s,\cdot)\big{)}$	(47a)
$\displaystyle D_{h_{s}}(p,\pi;\xi)$	$\displaystyle\coloneqq D_{h_{s}}\big{(}p,\pi(\cdot\,\|\,s);\xi(s,\cdot)\big{)}$	(47b)
$\displaystyle D_{h_{s}}(\pi,p;\xi)$	$\displaystyle\coloneqq D_{h_{s}}\big{(}\pi(\cdot\,\|\,s),p;\xi(s,\cdot)\big{)}$	(47c)

for any policy $\pi$ and $\widetilde{\pi}$ and any $p\in\Delta(\mathcal{A})$ , whenever it is clear from the context.

A.1 Proof of Lemma 1

We start by relaxing the probability simplex constraint (i.e., $p\in\Delta(\mathcal{A})$ ) in (20a) with a simpler linear constraint $\sum_{a\in\mathcal{A}}p(a)=1$ as follows

\begin{array}[]{ll}\text{minimize}_{p\in\mathbb{R}^{|\mathcal{A}|}}&-\eta\big{\langle}Q^{(k)}_{\tau}(s),p\big{\rangle}+\eta\tau h_{s}(p)+D_{h_{s}}\big{(}p,\pi^{(k)};\xi^{(k)}\big{)}\\[4.30554pt] \text{subject to}&\sum_{a\in\mathcal{A}}p(a)=1.\end{array}

(48)

To justify the validity of dropping the non-negative constraint, we note that for any $p$ obeying $p(a)<0$ for some $a\in\mathcal{A}$ , our assumption on $h_{s}$ (see Assumption 1) leads to $h_{s}(p)=\infty$ , which cannot possibly be the optimal solution. This confirms the equivalence between (20a) and (48).

Observe that the Lagrangian w.r.t. (48) is given by

\displaystyle\mathcal{L}_{s}\big{(}p,\lambda_{s}^{(k)}\big{)}

\displaystyle=-\eta\big{\langle}Q_{\tau}^{(k)}(s),p\big{\rangle}+\eta\tau h_{s}(p)+h_{s}(p)-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}p-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}+\lambda_{s}^{(k)}\left(\sum_{a\in\mathcal{A}}p(a)-1\right),

where $\lambda_{s}^{(k)}\in\mathbb{R}$ denotes the Lagrange multiplier associated with the constraint $\sum_{a\in\mathcal{A}}p(a)=1$ . Given that $\pi^{(k+1)}(s)$ is the solution to (20a) and hence (48), the optimality condition requires that

0\in\partial_{p}\mathcal{L}_{s}\big{(}p,\lambda_{s}^{(k)}\big{)}\,\Big{|}\,_{p=\pi^{(k+1)}(s)}=-\eta Q^{(k)}_{\tau}(s)+(1+\eta\tau)\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\xi^{(k)}(s)+\lambda^{(k)}_{s}1.

Rearranging terms and making use of the construction (17), we are left with

\xi^{(k+1)}(s)-\frac{\lambda^{(k)}_{s}}{1+\eta\tau}1=\frac{1}{1+\eta\tau}\left[\eta Q^{(k)}_{\tau}(s)+\xi^{(k)}(s)-\lambda^{(k)}_{s}1\right]\in\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)},

thus concluding the proof of the first claim (18).

We now turn to the second claim (19). In view of the property (37), we have

\pi_{\tau}^{\star}(s)=\arg\min_{p\in\Delta(\mathcal{A})}-\big{\langle}Q^{\star}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p).

This optimization problem is equivalent to

\begin{array}[]{ll}\text{minimize}_{p\in\mathbb{R}^{|\mathcal{A}|}}&-\big{\langle}Q^{\star}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p),\\[4.30554pt] \text{subject to}&\sum_{a\in\mathcal{A}}p(a)=1,\end{array}

(49)

which can be verified by repeating a similar argument for (48). The Lagrangian associated with (49) is

\displaystyle\mathcal{L}_{s}\big{(}p,\lambda_{s}^{\star}\big{)}

\displaystyle=-\big{\langle}Q_{\tau}^{\star}(s),p\big{\rangle}+\tau h_{s}(p)+\lambda_{s}^{\star}\left(\sum_{a\in\mathcal{A}}p(a)-1\right),

where $\lambda_{s}^{\star}\in\mathbb{R}$ denotes the Lagrange multiplier. Therefore, the first-order optimality condition requires that

0\in\partial_{p}\mathcal{L}_{s}\big{(}p,\lambda_{s}^{\star}\big{)}\,\Big{|}\,_{p=\pi^{\star}_{\tau}(s)}=-Q^{\star}_{\tau}(s)+\tau\partial h_{s}\big{(}\pi^{\star}_{\tau}(s)\big{)}+\lambda^{\star}_{s}1,

which immediately finishes the proof.

A.2 Proof of Lemma 2

We start by introducing the performance difference lemma that has previously been derived in Lan, (2022, Lemma 2). For the sake of self-containedness, we include a proof of this lemma in Appendix A.2.1.

Lemma 5 (Performance difference).

For any two policies $\pi$ and $\pi^{\prime}$ , we have

V^{\pi^{\prime}}_{\tau}(s)-V^{\pi}_{\tau}(s)=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{\pi^{\prime}}_{s}}\Big{[}\big{\langle}Q^{\pi}_{\tau}(s^{\prime}),\pi^{\prime}(s^{\prime})-\pi(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{\prime}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi(s^{\prime})\big{)}\Big{]},

(50)

where $d^{\pi}_{s}$ has been defined in (5).

Armed with Lemma 5, one can readily rewrite the difference $V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$ between two consecutive iterates as follows

		$\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$
		$\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\Big{[}\big{\langle}Q^{(k)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\Big{]}.$		(51)

It then comes down to studying the right-hand side of the relation (A.2), which can be accomplished via the following “three-point” lemma. The proof of this lemma can be found in Appendix A.2.2.

Lemma 6.

For any $s\in\mathcal{S}$ and any vector $p\in\Delta(\mathcal{A})$ , we have

	$\displaystyle(1+\eta\tau)D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}$
	$\displaystyle\qquad=\eta\left[\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}+\tau h_{s}(p)-\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right].$

Taking $p=\pi^{(k)}(s)$ in Lemma 6 and combining it with (A.2), we arrive at

	$\displaystyle V_{\tau}^{(k+1)}(s)-V_{\tau}^{(k)}(s)$
	$\displaystyle=\frac{1}{(1-\gamma)\eta}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d_{s}^{(k+1)}}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}\right]\geq 0$

for any $s\in\mathcal{S}$ , thus establishing the advertised pointwise monotonicity w.r.t. the regularized value function.

When it comes to the regularized Q-function, it is readily seen from the definition (9a) that

	$\displaystyle{Q}^{(k+1)}_{\tau}(s,a)$	$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\big{[}V^{(k+1)}_{\tau}(s^{\prime})\big{]}$
		$\displaystyle\geq r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\big{[}V^{(k)}_{\tau}(s^{\prime})\big{]}={Q}^{(k)}_{\tau}(s,a)$

for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , where the last line is valid since $V_{\tau}^{(k+1)}\geq V_{\tau}^{(k)}$ . This concludes the proof.

A.2.1 Proof of Lemma 5

For any two policies $\pi^{\prime}$ and $\pi$ , it follows from the definition (7) of $V^{\pi}_{\tau}(s)$ that

	$\displaystyle V^{\pi^{\prime}}_{\tau}(s)-V^{\pi}_{\tau}(s)=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+V^{\pi}_{\tau}(s_{t})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]+\mathbb{E}_{\pi^{\prime}}\left[V^{\pi}_{\tau}(s_{0})\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{\pi^{\prime}}_{s}}\Big{[}\big{\langle}Q^{\pi}_{\tau}(s^{\prime}),\pi^{\prime}(s^{\prime})-\pi(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{\prime}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi(s^{\prime})\big{)}\Big{]},$		(52)

where the penultimate line comes from the definition (9a). To see why the last line of (52) is valid, we make note of the following identity

	$\displaystyle\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi^{\prime}(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}-V^{\pi}_{\tau}(s_{t})\Big{]}$
	$\displaystyle\qquad=\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi^{\prime}(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}-\mathop{\mathbb{E}}\limits_{a_{t}\sim\pi(s_{t})}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}$
	$\displaystyle\qquad=\Big{\langle}Q^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\cdot 1,\pi^{\prime}(s_{t})-\pi(s_{t})\Big{\rangle}$
	$\displaystyle\qquad=\Big{\langle}Q^{\pi}_{\tau}(s_{t}),\pi^{\prime}(s_{t})-\pi(s_{t})\Big{\rangle},$		(53)

where the first identity results from the relation (9b), and the last relation holds since $1^{\top}\pi^{\prime}(s_{t})=1^{\top}\pi(s_{t})=1$ . The last line of (52) then follows immediately from the relation (53) and the definition (5) of $d_{s}^{\pi}$ .

A.2.2 Proof of Lemma 6

For any state $s\in\mathcal{S}$ , we make the observation that

	$\displaystyle D_{h_{s}}\big{(}p,\pi^{(k)};\xi^{(k)}\big{)}=h_{s}(p)-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}p-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}$
	$\displaystyle\qquad=h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k)}(s)\big{\rangle}$
	$\displaystyle\qquad\qquad+h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}\pi^{(k+1)}(s)-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}$
	$\displaystyle\qquad=h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)\big{\rangle}$
	$\displaystyle\qquad\qquad+h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-h_{s}\big{(}\pi^{(k)}(s)\big{)}-\big{\langle}\pi^{(k+1)}(s)-\pi^{(k)}(s),\xi^{(k)}(s)\big{\rangle}$
	$\displaystyle\qquad\qquad+\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)-\xi^{(k)}(s)\big{\rangle}\big{\rangle}$
	$\displaystyle\qquad=D_{h_{s}}\big{(}p,\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}+\big{\langle}p-\pi^{(k+1)}(s),\xi^{(k+1)}(s)-\xi^{(k)}(s)\big{\rangle}$
	$\displaystyle\qquad=D_{h_{s}}\big{(}p,\pi^{(k+1)};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\xi^{(k)}\big{)}+\big{\langle}p-\pi^{(k+1)}(s),\eta Q_{\tau}^{(k)}(s)-\eta\tau\xi^{(k+1)}(s)\big{\rangle},$

where the first and the fourth steps invoke the definition (15) of the generalized Bregman divergence and the last line results from the update rule (20c). Rearranging terms, we are left with

	$\displaystyle\eta\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}$
	$\displaystyle\qquad\qquad=\left\{D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}\right\}$
	$\displaystyle\qquad\qquad\qquad+\eta\tau\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)-p\big{\rangle}.$

Adding the term $\eta\tau\left\{h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right\}$ to both sides of this identity leads to

	$\displaystyle\eta\left[\big{\langle}Q^{(k)}_{\tau}(s),\pi^{(k+1)}(s)-p\big{\rangle}+\tau h_{s}(p)-\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}\right]$
	$\displaystyle\qquad=\left\{D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},{\pi^{(k)}};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}\right\}$
	$\displaystyle\qquad\quad\quad+\eta\tau\left(h_{s}(p)-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}-\big{\langle}\xi^{(k+1)}(s),p-\pi^{(k+1)}(s)\big{\rangle}\right)$
	$\displaystyle\qquad=(1+\eta\tau)D_{h_{s}}\big{(}{p},{\pi^{(k+1)}};\xi^{(k+1)}\big{)}+D_{h_{s}}\big{(}{\pi^{(k+1)}},\pi^{(k)};\xi^{(k)}\big{)}-D_{h_{s}}\big{(}{p},{\pi^{(k)}};\xi^{(k)}\big{)}$

as claimed, where the last line makes use of the definition (15).

A.3 Proof of Lemma 3

In the sequel, we shall prove each claim in Lemma 3 separately.

Proof of the contraction property (36).

For any $Q_{1},Q_{2}\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ , the definition (35) of the generalized Bellman operator obeys

	$\displaystyle\mathcal{T}_{\tau,h}(Q_{1})-\mathcal{T}_{\tau,h}(Q_{2})$	$\displaystyle=\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{1}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}-\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{2}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}\right]$
		$\displaystyle\overset{\mathrm{(a)}}{\leq}\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{1}(s^{\prime})-Q_{2}(s^{\prime}),p\big{\rangle}\right]$
		$\displaystyle\leq\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p:\\|p\\|_{1}=1}\\|Q_{1}-Q_{2}\\|_{\infty}\\|p\\|_{1}\right]$
		$\displaystyle=\gamma\\|Q_{1}-Q_{2}\\|_{\infty},$

where (a) arises from the elementary fact $\max_{x}f(x)-\max_{x}g(x)\leq\max_{x}\big{(}f(x)-g(x)\big{)}$ .

Proof of the fixed point property (37).

Towards this, let us first define

\displaystyle\pi^{\dagger}(s)\coloneqq\arg\max_{p_{s}\in\Delta(\mathcal{A})}\mathop{\mathbb{E}}\limits_{a\sim p_{s}}\Big{[}{Q}^{\star}_{\tau}(s,a)-\tau h_{s}\big{(}p(s)\big{)}\Big{]}.

(54)

Then it can be easily verified that

	$\displaystyle{Q}^{\star}_{\tau}(s,a)$	$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\star}(s_{1})}\Big{[}{Q}^{\star}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\star}(s_{1})\big{)}\Big{]}\right]$
		$\displaystyle\leq r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\Big{[}{Q}^{\star}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\Big{]}\right],$		(55)

where the first identity results from (9), and the second line arises from the maximizing property of $\pi^{\dagger}$ (see (54)).

Note that the right-hand side of (55) involves the term ${Q}^{\star}_{\tau}(s_{1},a_{1})$ , which can be further upper bounded via the same argument for (55). Successively repeating this upper bound argument (and the expansion) eventually allows one to obtain

{Q}^{\star}_{\tau}(s,a)\leq r(s,a)+\gamma\mathbb{E}_{\pi^{\dagger}}\left[\sum_{t=1}^{\infty}\gamma^{t-1}\Big{\{}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\dagger}(s_{t})\big{)}\Big{\}}\,\Big{|}\,s_{0}=s,a_{0}=a\right]={Q}^{\pi^{\dagger}}_{\tau}(s,a).

However, the fact that $\pi^{\star}$ is the optimal policy necessarily implies the following reverse inequality:

{Q}^{\star}_{\tau}(s,a)\geq{Q}^{\pi^{\dagger}}_{\tau}(s,a).

Therefore, one must have

\displaystyle{Q}^{\star}_{\tau}(s,a)={Q}^{\pi^{\dagger}}_{\tau}(s,a).

(56)

To finish up, it suffices to show that ${Q}^{\pi^{\dagger}}_{\tau}=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})$ . To this end, it is observed that

	$\displaystyle{Q}^{\pi^{\dagger}}_{\tau}(s,a)$	$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\left[{Q}^{\pi^{\dagger}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\right]\right]$
		$\displaystyle\overset{\mathrm{(b)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\Big{[}{Q}^{{\star}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\Big{]}\right]$
		$\displaystyle\overset{\mathrm{(c)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{\tau}^{\star}(s_{1},a_{1}),p\big{\rangle}-\tau h_{s_{1}}(p)\Big{]}\right]$
		$\displaystyle=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a),$

where (b) utilizes the fact (56), (c) follows from the definition (54) of $\pi^{\dagger}$ , and the last identity is a consequence of the definition (35) of $\mathcal{T}_{\tau,h}$ . The above results taken collectively demonstrate that ${Q}^{\star}_{\tau}=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})$ as claimed.

A.4 Proof of Lemma 4

Recall that ${Q}^{(k+1)}_{\tau}={Q}^{\pi^{(k+1)}}_{\tau}$ . In view of the relation (9), one obtains

	$\displaystyle Q_{\tau}^{(k+1)}(s,a)$	$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[V_{\tau}^{(k+1)}(s^{\prime})\right]$
		$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\left[Q_{\tau}^{(k+1)}(s^{\prime},a^{\prime})-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\right]\right]$
		$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\big{<}Q_{\tau}^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{>}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\right].$

This combined with the fixed-point condition (37) allows us to derive

		$\displaystyle{Q}^{\star}_{\tau}(s,a)-{Q}^{(k+1)}_{\tau}(s,a)=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}{Q}^{(k+1)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}$
		$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}$
		$\displaystyle\qquad\qquad\qquad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\xi^{(k+1)}(s^{\prime},a^{\prime})\Big{]}.$		(57)

In what follows, we control each term on the right-hand side of (A.4) separately.

Step 1: bounding the 1st term on the right-hand side of (A.4).

Lemma 1 tells us that

\xi^{(k+1)}(s)-c_{s}^{(k+1)}1\in\partial h_{s}(\pi^{(k+1)}(s))

for some scalar $c_{s}^{(k+1)}\in\mathbb{R}$ . This important property allows one to derive

\displaystyle 0

\displaystyle\in-\xi^{(k+1)}(s)+c_{s}^{(k+1)}1+\partial h_{s}\big{(}\pi^{(k+1)}(s)\big{)}=\partial{\mathcal{L}}_{k+1,s}\big{(}\pi^{(k+1)}(s);c_{s}^{(k+1)}\big{)}

(58)

where

\mathcal{L}_{k+1,s}(p;\lambda)\coloneqq\underset{\eqqcolon\,f_{k+1,s}(p)}{\underbrace{-\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}+h_{s}\big{(}p\big{)}}}+\lambda\,1^{\top}p.

Recognizing that the function $f_{k+1,s}(\cdot)$ is convex in $p$ , we can view $\mathcal{L}_{k+1,s}(p;\lambda)$ as the Lagrangian of the following constrained convex problem with Lagrangian multiplier $\lambda\in\mathbb{R}$ :

\displaystyle\mathop{\text{minimize}}\limits_{p:1^{\top}p=1}\quad f_{k+1,s}(p)=-\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}+h_{s}\big{(}p\big{)}.

(59)

The condition (58) can then be interpreted as the optimality condition w.r.t. the program (59) and $\pi^{(k+1)}(s)$ , meaning that

\displaystyle f_{k+1,s}\big{(}\pi^{(k+1)}(s)\big{)}=\min_{p:1^{\top}p=1}f_{k+1,s}(p),

or equivalently,

\displaystyle\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}

\displaystyle=\max_{p:1^{\top}p=1}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p).

(60)

In addition, for any vector $p$ that does not obey $p\geq 0$ , Assumption 1 implies that $h_{s}(p)=\infty$ , and hence $p$ cannot possibly be the optimal solution to $\max_{p\in\Delta(\mathcal{A})}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p)$ . This together with (60) essentially implies that

\displaystyle\big{\langle}\xi^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}-h_{s}\big{(}\pi^{(k+1)}(s)\big{)}

\displaystyle=\max_{p\in\Delta(\mathcal{A})}\big{\langle}\xi^{(k+1)}(s),p\big{\rangle}-h_{s}(p).

(61)

As a consequence, we arrive at

	$\displaystyle\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\Big{[}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}\Big{]}\right\}$
	$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left\{r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\Big{[}\max_{p\in\Delta(\mathcal{A})}\Big{\{}\big{\langle}\tau\xi^{(k+1)}(s^{\prime}),p\big{\rangle}-\tau h_{s^{\prime}}(p)\Big{\}}\Big{]}\right\}$
	$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\mathcal{T}_{\tau,h}(\tau\xi^{(k+1)})(s,a)$
	$\displaystyle\qquad\leq\gamma\big{\\|}{Q}_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\\|}_{\infty},$		(62)

where the last step results from the contraction property (36) in Lemma 3.

Step 2: bounding the 2nd term on the right-hand side of (A.4).

Recall that $\alpha=\frac{1}{1+\eta\tau}$ . Invoking the monotonicity property in Lemma 2 and the update rule (20c), we obtain

	$\displaystyle{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k+1)}(s,a)$	$\displaystyle=\alpha\Big{\{}{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k)}(s,a)\Big{\}}+(1-\alpha)\Big{\{}{Q}^{(k+1)}_{\tau}(s,a)-{Q}^{(k)}_{\tau}(s,a)\Big{\}}$
		$\displaystyle\geq\alpha\Big{\{}{Q}^{(k)}_{\tau}(s,a)-\tau\xi^{(k)}(s,a)\Big{\}}.$

Repeating this lower bound argument then yields

	$\displaystyle{Q}^{(k+1)}_{\tau}(s,a)-\tau\xi^{(k+1)}(s,a)$	$\displaystyle\geq\alpha^{k+1}\Big{\{}{Q}^{(0)}_{\tau}(s,a)-\tau\xi^{(0)}(s,a)\Big{\}}$
		$\displaystyle\geq-\alpha^{k+1}\big{\\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty},$

thus revealing that

-\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a),a^{\prime}\sim\pi^{k+1}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\xi^{(k+1)}(s^{\prime},a^{\prime})\Big{]}\leq\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}.

(63)

Step 3: putting all this together.

Substituting (62) and (63) into (A.4) gives

\displaystyle 0\leq{Q}^{\star}_{\tau}(s,a)-{Q}^{(k+1)}_{\tau}(s,a)\leq\gamma\big{\|}{Q}_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\|}_{\infty}+\alpha^{k+1}\big{\|}{Q}^{(0)}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}

(64)

for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , thus concluding the proof.

Appendix B Analysis for approximate GPMD (Theorem 2)

The proof consists of three steps: (i) evaluating the performance difference between $\pi^{(k)}$ and $\pi^{(k+1)}$ , (ii) establishing a linear system to characterize the error dynamic, and (iii) analyzing this linear system to derive global convergence guarantees. We shall describe the details of each step in the sequel. As before, we adopt the notational convention (47) whenever it is clear from the context.

B.1 Step 1: bounding performance difference between consecutive iterates

When only approximate policy evaluation is available, we are no longer guaranteed to have pointwise monotonicity as in the case of Lemma 2. Fortunately, we are still able to establish an approximate versioin of Lemma 2, as stated below.

Lemma 7 (Performance improvement for approximate GPMD).

For all $s\in\mathcal{S}$ and all $k\geq 0$ , we have

V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{2}{1-\gamma}\varepsilon_{\mathsf{eval}}.

In addition, if $h_{s}$ is 1-strongly convex w.r.t. the $\ell_{1}$ norm for all $s\in\mathcal{S}$ , then one further has

V^{(k+1)}_{\tau}(s)\geq V^{(k)}_{\tau}(s)-\frac{3+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{\eta}{(2+\eta\tau)(1-\gamma)}\varepsilon_{\mathsf{eval}}^{2}.

In words, while monotonicity is not guaranteed, this lemma precludes the possibility of $V^{(k+1)}_{\tau}(s)$ being be much smaller than $V^{(k)}_{\tau}(s)$ , as long as both $\varepsilon_{\mathsf{eval}}$ and $\varepsilon_{\mathsf{opt}}$ are reasonably small.

B.1.1 Proof of Lemma 7

The case when $h_{s}$ is convex.

Let $\widetilde{\pi}^{(k+1)}$ be the exact solution of the following problem

\widetilde{\pi}^{(k+1)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}\widehat{Q}^{(k)}_{\tau}(s),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)};\widehat{\xi}^{(k)}\big{)}\right\}.

(65)

With this auxiliary policy iterate $\widetilde{\pi}^{(k+1)}$ in mind, we start by decomposing $V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$ into the following three parts:

	$\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$
	$\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}Q^{(k)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\right]$
	$\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}\right]$
	$\displaystyle\qquad\quad+\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}\right]$
	$\displaystyle\qquad\quad+\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\langle}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}\right]},$		(66)

where the first identity arises from the performance difference lemma (cf. Lemma 5). To continue, we seek to control each part of (66) separately.

•

Regarding the first term of (66), replacing $\xi$ (resp. $Q_{\tau}$ ) by $\widehat{\xi}$ (resp. $\widehat{Q}_{\tau}$ ) in Lemma 6 indicates that

	$\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}$
	$\displaystyle\qquad=\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)}(s^{\prime});\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}\right]$		(67)

for all $s^{\prime}\in\mathcal{S}$ .

•

As for the second term of (66), the definition of the oracle $G_{s,\varepsilon_{\mathsf{opt}}}$ (see Assumption 3) guarantees that

	$\displaystyle-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}+\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}$
	$\displaystyle\qquad\leq-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}+\varepsilon_{\mathsf{opt}}$		(68)

for any $s^{\prime}\in\mathcal{S}$ . Rearranging terms, we are left with

	$\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}$
	$\displaystyle\qquad\geq-\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}+\frac{1}{\eta}D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-\varepsilon_{\mathsf{opt}}$
	$\displaystyle\qquad=-\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)$
	$\displaystyle\qquad\quad+\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)-\varepsilon_{\mathsf{opt}}.$		(69)

In addition, we note that the term

D_{h_{s^{\prime}}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}

appears in both (67) and (69), which can be canceled out when summing these two equalities. Specifically, adding (67) and (69) gives

	$\displaystyle\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi^{(k)}(s^{\prime})\big{)}$
	$\displaystyle\quad+\big{\langle}\widehat{Q}_{\tau}^{(k)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s^{\prime})\big{)}$
	$\displaystyle\qquad\geq\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}{\pi}^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right]$
	$\displaystyle\qquad\quad+\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)-\varepsilon_{\mathsf{opt}}.$

Substituting this into (66) and invoking the elementary inequality $|\langle a,b\rangle|\leq\|a\|_{1}\|b\|_{\infty}$ thus lead to

	$\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$
	$\displaystyle\geq\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right]\right]$
	$\displaystyle\quad+\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\frac{1}{\eta}\left(D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\right)\right]$
	$\displaystyle\quad-\frac{\varepsilon_{\mathsf{opt}}}{1-\gamma}-\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\\|}_{\infty}\big{\\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}\right]},$		(70)

where the last line makes use of Assumption 2 and the fact $\|\pi^{(k+1)}(s)\|_{1}=\|\pi^{(k)}(s)\|_{1}=1$ .

Following the discussion in Lemma 1, we can see that $\widehat{\xi}^{(k)}(s)-c_{s}^{(k)}1\in\partial h_{s}(\widetilde{\pi}^{(k)}(s))$ with some constant $c_{s}^{(k)}$ for all $k$ . This together with the convexity of $h_{s}$ (see (15)) guarantees that

\displaystyle D_{h_{s}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}\big{)}\geq 0\qquad\text{and}\qquad D_{h_{s}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}\geq 0

(71)

for any $s\in\mathcal{S}$ , thus implying that the first term of (70) is non-negative. It remains to control the second term in (70). Towards this, a little algebra gives

	$\displaystyle D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}-D_{h_{s}}\big{(}\pi^{(k+1)},\widetilde{\pi}^{(k)};\widehat{\xi}^{(k)}\big{)}$
	$\displaystyle\qquad=-\left\{h_{s}(\pi^{(k)}(s))-h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}-\big{\langle}\widehat{\xi}^{(k)}(s),\pi^{(k)}(s)-\widetilde{\pi}^{(k)}(s)\big{\rangle}\right\}$
	$\displaystyle\qquad=-h_{s}\big{(}\pi^{(k)}(s)\big{)}+h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}+\left\langle\frac{1}{1+\eta\tau}\widehat{\xi}^{(k-1)}(s)+\frac{\eta}{1+\eta\tau}\widehat{Q}_{\tau}^{(k-1)}(s),\pi^{(k)}(s)-\widetilde{\pi}^{(k)}(s)\right\rangle$
	$\displaystyle\qquad=\frac{\eta}{1+\eta\tau}\left\{-\big{\langle}\widehat{Q}^{(k-1)}_{\tau}(s),\widetilde{\pi}^{(k)}(s)\big{\rangle}+\tau h_{s}\big{(}\widetilde{\pi}^{(k)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\widetilde{\pi}^{(k)},\pi^{(k-1)};\widehat{\xi}^{(k-1)}\big{)}\right.$
	$\displaystyle\qquad\quad\quad\quad\ \ \left.-\left[-\big{\langle}\widehat{Q}^{(k-1)}_{\tau}(s),\pi^{(k)}(s)\big{\rangle}+\tau h_{s}\big{(}\pi^{(k)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\pi^{(k)},\pi^{(k-1)};\widehat{\xi}^{(k-1)}\big{)}\right]\right\}$
	$\displaystyle\qquad\geq-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}.$		(72)

Here, the first and the third lines follow from the definition (15), the second inequality comes from the construction (27), whereas the last step invokes the definition of the oracle (25). Substitution of (71) and (72) into (70) gives

	$\displaystyle V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)$	$\displaystyle\geq-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{1}{1-\gamma}{\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{(k+1)}_{s}}\left[\big{\\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\\|}_{\infty}\big{\\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}\right]}$		(73)
		$\displaystyle\geq-\frac{1+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{2}{1-\gamma}\varepsilon_{\mathsf{eval}}.$

The case when $h_{s}$ is strongly convex.

When $h_{s^{\prime}}$ is $1$ -strongly convex w.r.t. the $\ell_{1}$ norm, the objective function of sub-problem (65) is $\frac{1+\eta\tau}{\eta}$ -strongly convex w.r.t. the $\ell_{1}$ norm. Taking this together with the $\varepsilon_{\mathsf{opt}}$ -approximation guarantee in Assumption 3, we can demonstrate that

\frac{1+\eta\tau}{2\eta}\big{\|}\widetilde{\pi}^{(k+1)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\|}_{1}^{2}\leq\varepsilon_{\mathsf{opt}}\qquad\text{for all }k\geq 0\text{ and }s^{\prime}\in\mathcal{S}.

(74)

Additionally, the strong convexity assumption also implies that

	$\displaystyle D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k)}(s^{\prime});\xi^{(k)}(s^{\prime})\big{)}\geq\frac{1}{2}\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}^{2}$
	$\displaystyle\qquad=\frac{1}{2}\left(\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}^{2}+\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}^{2}\right)-\frac{1}{2}\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}^{2}$
	$\displaystyle\qquad\geq\frac{1}{4}\left(\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}+\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}\right)^{2}-\frac{1}{2}\big{\\|}\widetilde{\pi}^{(k)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}^{2}$
	$\displaystyle\qquad\geq\frac{1}{4}\big{\\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}^{2}-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau},$

where the third line results from Young’s inequality, and the final step follows from (74). We can develop a similar lower bound on $D_{h_{s^{\prime}}}\big{(}\pi^{(k)},\widetilde{\pi}^{(k+1)};\xi^{(k+1)}\big{)}$ as well. Taken together, these lower bounds give

	$\displaystyle\frac{1}{\eta}\left[(1+\eta\tau)D_{h_{s^{\prime}}}\big{(}\pi^{(k)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime});\xi^{(k+1)}(s^{\prime})\big{)}+D_{h_{s^{\prime}}}\big{(}\pi^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k)}(s^{\prime});\xi^{(k)}(s^{\prime})\big{)}\right]$
	$\displaystyle\qquad\geq\frac{2+\eta\tau}{\eta}\left(\frac{1}{4}\big{\\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}^{2}-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}\right)$
	$\displaystyle\qquad\geq\frac{2+\eta\tau}{4\eta}\big{\\|}\pi^{(k)}(s^{\prime})-\pi^{(k+1)}(s^{\prime})\big{\\|}_{1}^{2}-2\varepsilon_{\mathsf{opt}}.$

In addition, it is easily seen that

	$\displaystyle-\big{\\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\\|}_{\infty}\big{\\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}$
	$\displaystyle\qquad\geq-\frac{1}{2}\left(\frac{2\eta}{2+\eta\tau}\big{\\|}Q^{(k)}_{\tau}(s^{\prime})-\widehat{Q}^{{(k)}}_{\tau}(s^{\prime})\big{\\|}_{\infty}^{2}+\frac{2+\eta\tau}{2\eta}\big{\\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}^{2}\right)$
	$\displaystyle\qquad\geq-\frac{\eta}{2+\eta\tau}\varepsilon_{\mathsf{eval}}^{2}-\frac{2+\eta\tau}{4\eta}\big{\\|}\pi^{(k+1)}(s^{\prime})-\pi^{(k)}(s^{\prime})\big{\\|}_{1}^{2}.$

Combining the above two inequalities with (73), we arrive at the advertised bound

V^{(k+1)}_{\tau}(s)-V^{(k)}_{\tau}(s)\geq-\frac{3+\alpha}{1-\gamma}\varepsilon_{\mathsf{opt}}-\frac{\eta}{(2+\eta\tau)(1-\gamma)}\varepsilon_{\mathsf{eval}}^{2}.

B.2 Step 2: connecting the algorithm dynamic with a linear system

Now we are ready to discuss how to control $\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}$ . In short, we intend to establish the connection among several intertwined quantities, and identify a simple linear system that captures the algorithm dynamic.

Bounding $\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}$ .

From the definition of $\widehat{\xi}^{(k+1)}$ in (27), we have

$\displaystyle\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\\|}_{\infty}$	$\displaystyle=\big{\\|}\alpha({Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)})+(1-\alpha)({Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau})+(1-\alpha)({Q}^{{(k)}}_{\tau}-\widehat{{Q}}^{{(k)}}_{\tau})\big{\\|}_{\infty}$
	$\displaystyle\leq\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\\|}_{\infty}+(1-\alpha)\big{\\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\\|}_{\infty}+(1-\alpha)\big{\\|}{Q}^{{(k)}}_{\tau}-\widehat{{Q}}^{{(k)}}_{\tau}\big{\\|}_{\infty}$
	$\displaystyle\leq\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\\|}_{\infty}+(1-\alpha)\big{\\|}{Q}^{\star}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\\|}_{\infty}+(1-\alpha)\varepsilon_{\mathsf{eval}},$	(75)

where the last inequality is a consequence of Assumption 2.

Bounding $-\min_{s,a}\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)}$ .

Applying the definition in (27) once again, we obtain

$\displaystyle-\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)}$	$\displaystyle=-\alpha\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha)\big{(}\widehat{{Q}}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k)}}_{\tau}(s,a)\big{)}$
	$\displaystyle\qquad+\big{(}{Q}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)\big{)}$
	$\displaystyle\leq-\alpha\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}},$	(76)

where

\begin{split}c_{1}=&\begin{cases}\frac{2\gamma}{1-\gamma},\qquad&\text{if }h_{s}\text{ is convex but not strongly convex},\\ \frac{\eta\varepsilon_{\mathsf{eval}}\gamma}{(2+\eta\tau)(1-\gamma)},\quad&\text{if }h_{s}\text{ is $1$-strongly convex w.r.t. the $\ell_{1}$ norm},\end{cases}\\ c_{2}&=\begin{cases}\frac{(\alpha+1)\gamma}{1-\gamma},\quad&\quad\text{if }h_{s}\text{ is convex but not strongly convex},\\ \frac{(\alpha+3)\gamma}{1-\gamma},\quad&\quad\text{if }h_{s}\text{ is $1$-strongly convex w.r.t. the $\ell_{1}$ norm}.\end{cases}\end{split}

(77)

Here, the last step of (76) follows from Assumption 2 as well as the following relation:

{Q}^{{(k)}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)=\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot|s,a)}\left[V^{{(k)}}_{\tau}(s^{\prime})-V^{{(k+1)}}_{\tau}(s^{\prime})\right]\leq c_{1}\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}},

where we have made use of Lemma 7. Taking the maximum over $(s,a)$ on both sides of (76) yields

-\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right)\leq-\alpha\min_{s,a}\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}+(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}.

(78)

Bounding $\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}$ .

To begin with, let us decompose ${Q}^{{\star}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)$ into several parts. Invoking the relation (37) in Lemma 3 as well as the property (9), we reach

	$\displaystyle{Q}^{{\star}}_{\tau}(s,a)-{Q}^{{(k+1)}}_{\tau}(s,a)$
	$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}{Q}^{(k+1)}_{\tau}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s)\big{)}\Big{]}\right]$
	$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\pi^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{(k+1)}(s)\big{)}\Big{]}\right]$
	$\displaystyle\qquad\qquad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\Big{]}$
	$\displaystyle\qquad=\left\{\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}\right]\right\}$
	$\displaystyle\qquad\quad-\tau\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a)}\Big{[}\big{\langle}\widehat{\xi}^{(k+1)}(s^{\prime}),{\pi}^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-h_{s^{\prime}}\big{(}{\pi}^{(k+1)}(s)\big{)}+h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}$
	$\displaystyle\qquad\quad-\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim{P}(\cdot\|s,a),a^{\prime}\sim\pi^{(k+1)}(s^{\prime})}\Big{[}{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\Big{]}.$		(79)

In the sequel, we control the three terms in (79) separately.

•

To begin with, we repeat a similar argument as for (62) to show that

	$\displaystyle\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\left[r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\Big{[}\big{\langle}\tau\widehat{\xi}^{(k+1)}(s^{\prime}),\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\Big{]}\right]$
	$\displaystyle\qquad=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a)-\mathcal{T}_{\tau,h}(\widehat{\xi}^{(k+1)})(s,a)$
	$\displaystyle\qquad\leq\gamma\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\\|}_{\infty}.$

•

The second term of (79) can be bounded by applying (72) with $k$ replaced by $k+1$ :

\displaystyle\big{\langle}\widehat{\xi}^{(k+1)}(s^{\prime}),{\pi}^{(k+1)}(s^{\prime})-\widetilde{\pi}^{(k+1)}(s^{\prime})\big{\rangle}-h_{s^{\prime}}\big{(}{\pi}^{(k+1)}(s)\big{)}+h_{s^{\prime}}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}\geq-\frac{\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}.

•

As for the third term of (79), taking the maximum over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ gives

{Q}^{(k+1)}_{\tau}(s^{\prime},a^{\prime})-\tau\widehat{\xi}^{(k+1)}(s^{\prime},a^{\prime})\leq-\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right).

Taken together, the above bounds and the decomposition (79) lead to

\big{\|}{Q}^{{*}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}\leq\gamma\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}-\gamma\min_{s,a}\left({Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\right)+\gamma(1-\alpha)\varepsilon_{\mathsf{opt}}.

(80)

A linear system of interest.

Combining (B.2),(78) and (80), we reach the following linear system

z_{k+1}\leq Bz_{k}+b,

(81)

where

B\coloneqq\begin{bmatrix}\gamma(1-\alpha)&\gamma\alpha&\gamma\alpha\\ 1-\alpha&\alpha&0\\ 0&0&\alpha\end{bmatrix},\qquad z_{k}\coloneqq\begin{bmatrix}\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k)}}_{\tau}\big{\|}_{\infty}\\ \big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k)}\big{\|}_{\infty}\\ -\min_{s,a}\big{(}{Q}^{{(k)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k)}(s,a)\big{)}\end{bmatrix},

b\coloneqq\begin{bmatrix}\gamma(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+\gamma(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\\ (1-\alpha)\varepsilon_{\mathsf{eval}}\\ (1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\end{bmatrix}.

(82)

This linear system of three variables captures how the estimation error progresses as the iteration count increases.

B.3 Step 3: linear system analysis

In this step, we analyze the behavior of the linear system (81) derived above. Observe that the eigenvalues and respective eigenvectors of the matrix $B$ are given by

\lambda_{1}=\alpha+(1-\alpha)\gamma,\qquad\lambda_{2}=\alpha,\qquad\lambda_{3}=0,

(83)

v_{1}=\begin{bmatrix}\gamma\\ 1\\ 0\end{bmatrix},\qquad v_{2}=\begin{bmatrix}0\\ -1\\ 1\end{bmatrix},\qquad v_{3}=\begin{bmatrix}\alpha\\ \alpha-1\\ 0\end{bmatrix}.

(84)

Armed with these, we can decompose $z_{0}$ in terms of the eigenvectors of $B$ as follows

$\displaystyle z_{0}$	$\displaystyle\leq\begin{bmatrix}\\|{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\\|_{\infty}\\ \\|{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\\|_{\infty}\\ \\|{Q}^{(0)}_{\tau}-\tau\widehat{\xi}^{(0)}\\|_{\infty}\end{bmatrix}$
	$\displaystyle=\frac{1}{\alpha+(1-\alpha)\gamma}\left[(1-\alpha)\big{\\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\\|}_{\infty}+\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}+\alpha\big{\\|}{Q}^{(0)}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right]v_{1}$
	$\displaystyle\qquad+\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}v_{2}+e_{z}v_{3}$
	$\displaystyle\leq\frac{1}{\alpha+(1-\alpha)\gamma}\left[\big{\\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right]v_{1}+\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}v_{2}+e_{z}v_{3},$	(85)

where $e_{z}\in\mathbb{R}$ is some constant that does not affect our final result. Also, the vector $b$ defined in (82) satisfies

	$\displaystyle b$	$\displaystyle\leq\begin{bmatrix}\gamma(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+\gamma(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\\ (1-\alpha)\varepsilon_{\mathsf{eval}}+(1-\alpha)\varepsilon_{\mathsf{opt}}\\ (1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\end{bmatrix}$
		$\displaystyle=\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}v_{1}+\big{[}(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\big{]}v_{2}.$		(86)

Using the decomposition in (B.3) and (86) and applying the system relation (81) recursively, we can derive

	$\displaystyle z_{k+1}$	$\displaystyle\leq B^{k+1}z_{0}+\sum_{t=0}^{k}B^{k-t}b$
		$\displaystyle\leq B^{k+1}\left[\frac{1}{\alpha+(1-\alpha)\gamma}\left[\big{\\|}{Q}^{{\star}}_{\tau}-{Q}^{{(0)}}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right]v_{1}+\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}v_{2}+e_{z}v_{3}\right]$
		$\displaystyle\quad+\sum_{t=0}^{k}B^{k-t}\Big{[}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}v_{1}+\big{[}(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}\big{]}v_{2}\Big{]}$
		$\displaystyle=\left[\lambda_{1}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right)+\frac{1-\lambda_{1}^{k+1}}{1-\lambda_{1}}\Big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\Big{]}\right]v_{1}$
		$\displaystyle\quad+\Big{[}\lambda_{2}^{k+1}\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}+\frac{1-\lambda_{2}^{k+1}}{1-\lambda_{2}}[(1-\alpha+c_{1})\varepsilon_{\mathsf{eval}}+c_{2}\varepsilon_{\mathsf{opt}}]\Big{]}v_{2}.$

Recognizing that the first two entries of $v_{2}$ are non-positive, we can discard the term involving $v_{2}$ and obtain

	$\displaystyle\begin{bmatrix}\\|{Q}^{\star}_{\tau}-{Q}^{{(k+1)}}_{\tau}\\|_{\infty}\\ \\|Q_{\tau}^{\star}-\tau\widehat{\xi}^{(k+1)}\\|_{\infty}\end{bmatrix}$
	$\displaystyle\leq\left[\lambda_{1}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right)+\frac{1-\lambda_{1}^{k}}{1-\lambda_{1}}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}\right]\begin{bmatrix}\gamma\\ 1\end{bmatrix}$
	$\displaystyle\leq\bigg{[}\lambda_{1}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right)+\underbrace{\frac{1}{1-\lambda_{1}}\big{[}(2-2\alpha+c_{1})\varepsilon_{\mathsf{eval}}+(1-\alpha+c_{2})\varepsilon_{\mathsf{opt}}\big{]}}_{\eqqcolon\,C}\bigg{]}\begin{bmatrix}\gamma\\ 1\end{bmatrix}.$

Making use of the fact that $1-\lambda_{1}=(1-\alpha)(1-\gamma)$ , we can conclude

C=\frac{1}{1-\gamma}\left[\left(2+\frac{c_{1}}{1-\alpha}\right)\varepsilon_{\mathsf{eval}}+\left(1+\frac{c_{2}}{1-\alpha}\right)\varepsilon_{\mathsf{opt}}\right].

The above bound essentially says that

\displaystyle\big{\|}{Q}^{\star}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}\leq\gamma\left[\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+C\right]

and

\displaystyle\big{\|}Q_{\tau}^{\star}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}\leq\lambda_{1}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\|}_{\infty}\right)+C.

Turning to $V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)$ , by a similar argument as (45), we have

		$\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)$
		$\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}+\Big{[}\tau(h_{s}(\pi^{(k+1)}(s))-h_{s}(\pi_{\tau}^{\star}(s)))-\big{\langle}Q_{\tau}^{\star}(s),\pi^{(k+1)}(s)-\pi_{\tau}^{\star}(s)\big{\rangle}\Big{]}$
		$\displaystyle=\big{\langle}Q_{\tau}^{\star}(s)-Q_{\tau}^{(k+1)}(s),\pi^{(k+1)}(s)\big{\rangle}+\tau D_{h_{s}}(\pi^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})$
		$\displaystyle\leq\Big{\\|}Q_{\tau}^{\star}-Q_{\tau}^{(k+1)}\Big{\\|}_{\infty}+\tau D_{h_{s}}(\pi^{(k+1)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)})$
		$\displaystyle\qquad+\tau D_{h_{s}}(\widetilde{\pi}^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle},$

where the third step results from the standard three-point lemma. To control the second term, we rearrange terms in (68) and reach at

	$\displaystyle\varepsilon_{\mathsf{opt}}$	$\displaystyle\geq-\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\pi^{(k+1)}(s)\big{\rangle}+\tau h_{s}\big{(}\pi^{(k+1)}(s)\big{)}+\frac{1}{\eta}D_{h_{s}}\big{(}\pi^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}$
		$\displaystyle\qquad+\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)\big{\rangle}-\tau h_{s}\big{(}\widetilde{\pi}^{(k+1)}(s)\big{)}-\frac{1}{\eta}D_{h_{s}}\big{(}\widetilde{\pi}^{(k+1)},\pi^{(k)};\widehat{\xi}^{(k)}\big{)}$
		$\displaystyle=\big{\langle}\widehat{Q}_{\tau}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)-\pi^{(k+1)}(s)\big{\rangle}+\frac{1+\eta\tau}{\eta}\big{(}h_{s}(\pi^{(k+1)}(s))-h_{s}(\widetilde{\pi}^{(k+1)}(s))\big{)}$
		$\displaystyle\qquad+\frac{1}{\eta}\big{\langle}\widehat{\xi}^{(k)}(s),\widetilde{\pi}^{(k+1)}(s)-\pi^{(k+1)}(s)\big{\rangle}$
		$\displaystyle=\frac{1+\eta\tau}{\eta}D_{h_{s}}(\pi^{(k+1)},\widetilde{\pi}^{(k+1)};\widehat{\xi}^{(k+1)}).$

For the remaining terms, recall that $\widehat{\xi}^{(k+1)}-c_{s}^{(k+1)}1\in\partial h_{s}(\widetilde{\pi}^{(k+1)}(s))$ with some constant $c_{s}^{(k+1)}$ . So we have

	$\displaystyle\tau D_{h_{s}}(\widetilde{\pi}^{(k+1)},\pi_{\tau}^{\star};g_{\tau}^{\star})+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle}$
	$\displaystyle=\tau h_{s}(\widetilde{\pi}^{(k+1)}(s))-\tau h_{s}(\pi_{\tau}^{\star}(s))-\big{\langle}\widetilde{\pi}^{(k+1)}(s)-\pi_{\tau}^{\star}(s),Q_{\tau}^{\star}(s)\big{\rangle}+\tau\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),\widehat{\xi}^{(k+1)}(s)-g_{\tau}^{\star}(s)\big{\rangle}$
	$\displaystyle\leq\big{\langle}\pi_{\tau}^{\star}(s)-\widetilde{\pi}^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}\big{\rangle}-\big{\langle}\pi^{(k+1)}(s)-\widetilde{\pi}^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\rangle}$
	$\displaystyle=\big{\langle}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s),Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}\big{\rangle}$
	$\displaystyle\leq 2\big{\\|}Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\\|}_{\infty}.$

Taken together, we conclude that

	$\displaystyle V_{\tau}^{\star}(s)-V_{\tau}^{(k+1)}(s)$	$\displaystyle\leq\Big{\\|}Q_{\tau}^{\star}-Q_{\tau}^{(k+1)}\Big{\\|}_{\infty}+2\big{\\|}Q_{\tau}^{\star}(s)-\tau\widehat{\xi}^{(k+1)}(s)\big{\\|}_{\infty}+\frac{\eta\tau}{1+\eta\tau}\varepsilon_{\mathsf{opt}}$
		$\displaystyle\leq(\gamma+2)\left[\lambda_{1}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(0)}\big{\\|}_{\infty}\right)+C\right]+\frac{\eta\tau}{1+\eta\tau}\varepsilon_{\mathsf{opt}}.$

Finally, plugging in the choices of $c_{1}$ and $c_{2}$ (cf. (77)), we have $C\leq C_{2}$ when $\{h_{s}\}$ is convex, and $C\leq C_{3}$ when $\{h_{s}\}$ is $1$ -strongly convex w.r.t. the $\ell_{1}$ norm. In addition, for the latter case, we can follow a similar argument as for (46) to demonstrate that

\displaystyle\big{\|}\pi_{\tau}^{\star}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\|}_{1}

\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\|}_{\infty}+2\alpha\big{\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\|}_{\infty}\right)+\tau^{-1}C_{3},

which taken together with (74) gives

	$\displaystyle\big{\\|}\pi_{\tau}^{\star}(s)-\pi_{\tau}^{(k)}(s)\big{\\|}_{1}$	$\displaystyle\leq\big{\\|}\pi_{\tau}^{\star}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\\|}_{1}+\big{\\|}\pi_{\tau}^{(k)}(s)-\widetilde{\pi}_{\tau}^{(k)}(s)\big{\\|}_{1}$
		$\displaystyle\leq\tau^{-1}\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right)$
		$\displaystyle\quad+\tau^{-1}C_{3}+\sqrt{\frac{2\eta\varepsilon_{\mathsf{opt}}}{1+\eta\tau}}.$

This concludes the proof.

Appendix C Adaptive GPMD

In this section, we present adaptive GPMD, an adaptive variant of GPMD that computes optimal policies of the original MDP without the need of specifying the regularization parameter $\tau$ in advance. In a nutshell, the proposed adaptive GPMD algorithm is a stage-based algorithm. In the $i$ -th stage, we execute GPMD (i.e., Algorithm 1) with regularization parameter $\tau_{i}$ for $T_{i}+1$ iterations. In what follows, we shall denote by $\pi_{i}^{(t)}$ and $\xi_{i}^{(t)}$ the $t$ -th iterates in the $i$ -th stage. At the end of each stage, Adaptive GPMD will halve the regularization parameter $\tau$ , and in the meantime, double $\xi_{i}^{(T_{i}+1)}$ (i.e., the auxiliary vector corresponding to some subgradient up to global shift) and use it to as the initial vector $\xi_{i+1}^{(0)}$ for the next stage. To ensure that $\xi_{i+1}^{(0)}(s)$ still lies within the set of subgradients $\partial h_{s}\big{(}\pi_{i+1}^{(0)}(s)\big{)}$ up to global shift, we solve the sub-optimization problem (87) to obtain $\pi_{i+1}^{(0)}$ as the initial policy iterate for the next stage. The whole procedure is summarized in Algorithm 3.

1 Input: learning rate

\eta>0

2 Initialize

\tau_{0}=1,\xi_{0}^{(0)}(s,a)=0

for all

s\in\mathcal{S},a\in\mathcal{A}

. Choose

\pi_{0}^{(0)}

to be the minimizer of the following problems:

\pi_{0}^{(0)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}h_{s}(p),\qquad\forall s\in\mathcal{S}.

3for stage $i=0,1,\cdots,$ do

4 Call Algorithm 1 with regularization parameter

\tau_{i}

, learning rate

\eta

, and initialization

\pi_{i}^{(0)}

and

\xi_{i}^{(0)}

. Obtain

\pi_{i}^{(T_{i}+1)}

and

\xi_{i}^{(T_{i}+1)}

with

T_{i}=\big{\lceil}\frac{1+\eta\tau_{i}}{(1-\gamma)\eta\tau_{i}}\log\frac{8}{1-\gamma}\big{\rceil}

, where

\lceil\cdot\rceil

is the ceiling function.

5 Set

\tau_{i+1}=\tau_{i}/2

\xi_{i+1}^{(0)}=2\xi_{i}^{(T_{i}+1)}

, and choose

\pi_{i+1}^{(0)}

to be the minimizer of the following problems:

\displaystyle\pi_{i+1}^{(0)}(s)=\arg\min_{p\in\Delta(\mathcal{A})}-\langle\xi_{i+1}^{(0)}(s),p\rangle+h_{s}(p),\qquad\forall s\in\mathcal{S}.

(87)

Algorithm 3 Adaptive GPMD

To help characterize the discrepancy of the value functions due to regularization, we assume boundedness of the regularizers $\{h_{s}\}$ as follows.

Assumption 4.

Suppose that there exists some quantity $B>1$ such that $|h_{s}(p)|\leq B$ holds for all $p\in\Delta(\mathcal{A})$ and all $s\in\mathcal{S}$ .

The following theorem demonstrates that Algorithm 3 is capable of finding an $\varepsilon$ -optimal policy for the unregularized MDP within an order of $\log\frac{1}{\varepsilon}$ stages. To simplify notation, we abbreviate $Q_{\tau_{i}}^{\star}$ , $Q_{\tau_{i}}^{\pi_{i}^{(T)}}$ and $V_{\tau_{i}}^{\pi_{i}^{(T)}}$ as $Q_{i}^{\star}$ , $Q^{(T)}_{i}$ and $V^{(T)}_{i}$ , respectively, as long as it is clear from the context.

Theorem 3.

Suppose that Assumptions 1 and 4 hold. For any learning rate $\eta>0$ and any stage $i\geq 0$ , the iterates of Algorithm 3 satisfy

\big{\|}Q^{\star}-Q^{\pi_{i}^{(T_{i}+1)}}\big{\|}_{\infty}\leq\frac{3\tau_{i}B}{1-\gamma}=\frac{3B}{(1-\gamma)2^{i}}.

As a direct implication of Theorem 3, it suffices to run Algorithm 3 with $S=\mathcal{O}(\log\frac{B}{(1-\gamma)\varepsilon})$ stages, resulting in a total iteration complexity of at most

\displaystyle\sum_{i=0}^{S}T_{i}=\mathcal{O}\left(\left(\frac{1}{1-\gamma}\log\frac{B}{(1-\gamma)\varepsilon}+\frac{B}{(1-\gamma)^{2}\eta\varepsilon}\right)\log\frac{1}{1-\gamma}\right).

(88)

In comparison, we recall from Theorem 1 that: directly running GPMD with regularization parameter $\tau={(1-\gamma)\varepsilon}/{B}$ leads to an iteration complexity of

\displaystyle\mathcal{O}\Big{(}\Big{(}\frac{1}{1-\gamma}+\frac{B}{(1-\gamma)^{2}\eta\varepsilon}\Big{)}\log\frac{B}{(1-\gamma)\varepsilon}\Big{)}.

(89)

When focusing on the term $\widetilde{\mathcal{O}}(\frac{B}{(1-\gamma)^{2}\eta\varepsilon})$ , (88) improves upon (89) by a factor of $\frac{\log\frac{B}{(1-\gamma)\varepsilon}}{\log\frac{1}{1-\gamma}}$ .

Proof of Theorem 3.

To begin with, we make note of the fact that, for any $\tau,\tau^{\prime}>0$ ,

\big{\|}Q^{\star}_{\tau}-Q^{\star}_{\tau^{\prime}}\big{\|}_{\infty}=\Big{\|}\max_{\pi}Q^{\pi}_{\tau}-\max_{\pi}Q^{\pi}_{\tau^{\prime}}\Big{\|}_{\infty}\leq\max_{\pi}\big{\|}Q^{\pi}_{\tau}-Q^{\pi}_{\tau^{\prime}}\big{\|}_{\infty}\leq\frac{|\tau-\tau^{\prime}|B}{1-\gamma}.

(90)

It then follows that

	$\displaystyle\big{\\|}Q^{\star}-Q^{\pi_{i}^{(T_{i}+1)}}\big{\\|}_{\infty}$	$\displaystyle=\big{\\|}Q^{\star}-Q_{i}^{\star}\big{\\|}_{\infty}+\big{\\|}Q^{\pi_{i}^{(T_{i}+1)}}-Q_{i}^{(T_{i}+1)}\big{\\|}_{\infty}+\big{\\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$		(91)
		$\displaystyle\leq\frac{2\tau_{i}B}{1-\gamma}+\big{\\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\\|}_{\infty}.$		(91)

Next, we demonstrate how to control $\|Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\|_{\infty}$ . The definition of $\pi_{i}^{(0)}$ implies the existence of some constant $c_{s}^{(i,0)}$ such that

\displaystyle\xi_{i}^{(0)}(s,\cdot)-c_{s}^{(i,0)}1\in\partial h_{s}\big{(}\pi_{i}^{(0)}(s)\big{)}.

(92)

By invoking the convergence results of GPMD (cf. (44)), we obtain: for all $i\geq 0$ ,


$\displaystyle\big{\\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\left(\big{\\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\\|}_{\infty}+2\alpha_{i}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty}\right),$	(93a)
$\displaystyle\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\left(\big{\\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\\|}_{\infty}+2\alpha_{i}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty}\right),$	(93b)

where $\alpha_{i}=\frac{1}{1+\eta\tau_{i}}$ . To proceed, we follow similar arguments in (45) and show that

	$\displaystyle V^{\star}_{i}(s)-V^{(0)}_{i}(s)$	$\displaystyle=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s\sim d^{\pi}_{s^{\prime}}}\Big{[}\left\langle\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),Q^{\star}_{i}(s)\right\rangle+\tau_{i}\big{(}h_{s}(\pi_{i}^{(0)})-h_{s}(\pi^{\star}_{\tau_{i}})\big{)}\Big{]}$
		$\displaystyle\leq\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s\sim d^{\pi}_{s^{\prime}}}\Big{[}\left\langle\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),Q^{\star}_{i}(s)\right\rangle-\tau\Big{\langle}\pi^{\star}_{\tau_{i}}(s)-\pi_{i}^{(0)}(s),\xi_{i}^{(0)}\Big{\rangle}\Big{]}$
		$\displaystyle\leq\frac{2}{1-\gamma}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty},$

where the first step invokes the regularized performance difference lemma (Lemma 5). It then follows that

\displaystyle\big{\|}Q^{\star}_{i}-{Q}_{i}^{(0)}\big{\|}_{\infty}\leq\gamma\big{\|}V^{\star}_{i}-V^{(0)}_{i}\big{\|}_{\infty}\leq\frac{2\gamma}{1-\gamma}\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}.

(94)

Substitution of (94) into (93) gives


$\displaystyle\big{\\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\frac{2\gamma}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty},$	(95a)
$\displaystyle\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\frac{2}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty}.$	(95b)

Next, we aim to prove by induction that $\big{\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\|}_{\infty}\leq\frac{2\tau_{i}B}{1-\gamma}$ . Clearly, this claim holds trivially for the base case with $i=0$ . Next, supposing that the claim holds for some $i\geq 0$ , we would like to prove it for $i+1$ as well. Towards this end, observe that

	$\displaystyle\big{\\|}Q^{\star}_{i+1}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\\|}_{\infty}$	$\displaystyle=\big{\\|}Q^{\star}_{i+1}-Q^{\star}_{i}\big{\\|}_{\infty}+\big{\\|}Q^{\star}_{i}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\\|}_{\infty}$
		$\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}+\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(T_{i}+1)}\big{\\|}_{\infty}$
		$\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}+\frac{2}{1-\gamma}\big{(}(1-\alpha_{i})\gamma+\alpha_{i}\big{)}^{T_{i}}\big{\\|}Q^{\star}_{i}-\tau_{i}\xi_{i}^{(0)}\big{\\|}_{\infty}$
		$\displaystyle\leq\frac{\tau_{i+1}B}{1-\gamma}\left(1+\frac{8}{1-\gamma}\left(1-\frac{(1-\gamma)\eta\tau_{i}}{1+\eta\tau_{i}}\right)^{T_{i}}\right).$

When $T_{i}\geq\lceil\frac{1+\eta\tau_{i}}{\eta\tau_{i}(1-\gamma)}\log\frac{8}{1-\gamma}\rceil$ , we arrive at

\big{\|}Q^{\star}_{i+1}-\tau_{i+1}\xi_{i+1}^{(0)}\big{\|}_{\infty}\leq\frac{2\tau_{i+1}B}{1-\gamma},

which verifies the claim for $i+1$ . Substitution back into (95) leads to

\big{\|}Q^{\star}_{i}-Q_{i}^{(T_{i}+1)}\big{\|}_{\infty}\leq\frac{2\gamma}{1-\gamma}\left(1-\frac{(1-\gamma)\eta\tau_{i}}{1+\eta\tau_{i}}\right)^{T_{i}}\frac{2\tau_{i}B}{1-\gamma}\leq\frac{\tau_{i}B}{1-\gamma}.

(96)

Combining (96) with (91) concludes the proof. ∎

References

Abbasi-Yadkori et al., (2019) Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. (2019). Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR.
Agarwal et al., (2019) Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. Technical report.
Agarwal et al., (2020) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2020). Optimality and approximation with policy gradient methods in Markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR.
Agazzi and Lu, (2020) Agazzi, A. and Lu, J. (2020). Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858.
Amodei et al., (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in ai safety.
Beck, (2017) Beck, A. (2017). First-order methods in optimization. SIAM.
Beck and Teboulle, (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175.
Bertsekas, (2017) Bertsekas, D. P. (2017). Dynamic programming and optimal control (4th edition). Athena Scientific.
Bhandari and Russo, (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
Bhandari and Russo, (2020) Bhandari, J. and Russo, D. (2020). A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120.
(11) Cen, S., Chen, F., and Chi, Y. (2022a). Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization. arXiv preprint arXiv:2204.05466.
(12) Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022b). Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578.
(13) Cen, S., Chi, Y., Du, S. S., and Xiao, L. (2022c). Faster last-iterate convergence of policy optimization in zero-sum markov games. arXiv preprint arXiv:2210.01050.
Cen et al., (2021) Cen, S., Wei, Y., and Chi, Y. (2021). Fast policy extragradient methods for competitive games with entropy regularization. In Advances in Neural Information Processing Systems, volume 34, pages 27952–27964.
(15) Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018a). A Lyapunov-based approach to safe reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8103–8112.
(16) Chow, Y., Nachum, O., and Ghavamzadeh, M. (2018b). Path consistency learning in Tsallis entropy regularized MDPs. In International Conference on Machine Learning, pages 979–988. PMLR.
Dai et al., (2018) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. (2018). SBEED: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1125–1134. PMLR.
Ding et al., (2021) Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanovic, M. (2021). Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR.
Duchi et al., (2010) Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Tewari, A. (2010). Composite objective mirror descent. In COLT, pages 14–26. Citeseer.
Efroni et al., (2020) Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
Fazel et al., (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476.
Geist et al., (2019) Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized Markov decision processes. In International Conference on Machine Learning, pages 2160–2169.
Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR.
Hao et al., (2021) Hao, B., Lazic, N., Abbasi-Yadkori, Y., Joulani, P., and Szepesvári, C. (2021). Adaptive approximate policy iteration. In International Conference on Artificial Intelligence and Statistics, pages 523–531. PMLR.
Kakade, (2002) Kakade, S. M. (2002). A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems, pages 1531–1538.
Khodadadian et al., (2021) Khodadadian, S., Jhunjhunwala, P. R., Varma, S. M., and Maguluri, S. T. (2021). On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424.
Kiwiel, (1997) Kiwiel, K. C. (1997). Proximal minimization methods with generalized Bregman functions. SIAM journal on control and optimization, 35(4):1142–1168.
Lan, (2022) Lan, G. (2022). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical Programming.
Lan et al., (2011) Lan, G., Lu, Z., and Monteiro, R. D. (2011). Primal-dual first-order methods with ${O(1/\varepsilon)}$ iteration-complexity for cone programming. Mathematical Programming, 126(1):1–29.
Lan and Zhou, (2018) Lan, G. and Zhou, Y. (2018). An optimal randomized incremental gradient method. Mathematical programming, 171(1):167–215.
Lazic et al., (2021) Lazic, N., Yin, D., Abbasi-Yadkori, Y., and Szepesvari, C. (2021). Improved regret bound and experience replay in regularized policy iteration. In International Conference on Machine Learning, pages 6032–6042. PMLR.
Lee et al., (2018) Lee, K., Choi, S., and Oh, S. (2018). Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466–1473.
Lee et al., (2019) Lee, K., Kim, S., Lim, S., Choi, S., and Oh, S. (2019). Tsallis reinforcement learning: A unified framework for maximum entropy reinforcement learning. arXiv preprint arXiv:1902.00137.
Li et al., (2023) Li, G., Wei, Y., Chi, Y., and Chen, Y. (2023). Softmax policy gradient methods can take exponential time to converge. Mathematical Programming. To appear.
Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10565–10576.
Liu et al., (2020) Liu, Y., Zhang, K., Basar, T., and Yin, W. (2020). An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33.
Mei et al., (2021) Mei, J., Gao, Y., Dai, B., Szepesvari, C., and Schuurmans, D. (2021). Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pages 7555–7564. PMLR.
(38) Mei, J., Xiao, C., Dai, B., Li, L., Szepesvári, C., and Schuurmans, D. (2020a). Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33.
(39) Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020b). On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR.
Mnih et al., (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
Moldovan and Abbeel, (2012) Moldovan, T. M. and Abbeel, P. (2012). Safe exploration in markov decision processes.
Nemirovsky and Yudin, (1983) Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.
Neu et al., (2017) Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
Puterman, (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shani et al., (2019) Shani, L., Efroni, Y., and Mannor, S. (2019). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. arXiv preprint arXiv:1909.02769.
Sutton et al., (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
Tomar et al., (2020) Tomar, M., Shani, L., Efroni, Y., and Ghavamzadeh, M. (2020). Mirror descent policy optimization. arXiv preprint arXiv:2005.09814.
Tsallis, (1988) Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 52(1):479–487.
Vieillard et al., (2020) Vieillard, N., Kozuno, T., Scherrer, B., Pietquin, O., Munos, R., and Geist, M. (2020). Leverage the average: an analysis of KL regularization in reinforcement learning. In NeurIPS-34th Conference on Neural Information Processing Systems.
Wang et al., (2019) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations.
Wang et al., (2021) Wang, W., Han, J., Yang, Z., and Wang, Z. (2021). Global convergence of policy gradient for linear-quadratic mean-field control/game in continuous time. In International Conference on Machine Learning, pages 10772–10782. PMLR.
Williams, (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
Williams and Peng, (1991) Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
Xiao, (2022) Xiao, L. (2022). On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443.
Xu et al., (2019) Xu, P., Gao, F., and Gu, Q. (2019). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations.
Xu et al., (2020) Xu, T., Liang, Y., and Lan, G. (2020). A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869.
Yu et al., (2019) Yu, M., Yang, Z., Kolar, M., and Wang, Z. (2019). Convergent policy optimization for safe reinforcement learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3127–3139.
(60) Zhang, J., Kim, J., O’Donoghue, B., and Boyd, S. (2020a). Sample efficient reinforcement learning with REINFORCE. arXiv preprint arXiv:2010.11364.
(61) Zhang, J., Koppel, A., Bedi, A. S., Szepesvári, C., and Wang, M. (2020b). Variational policy gradient method for reinforcement learning with general utilities. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 4572–4583.
(62) Zhang, J., Ni, C., Szepesvari, C., Wang, M., et al. (2021a). On the convergence and sample efficiency of variance-reduced policy gradient method. In Advances in Neural Information Processing Systems, volume 34, pages 2228–2240.
(63) Zhang, K., Hu, B., and Basar, T. (2021b). Policy optimization for $\mathcal{H}_{2}$ linear control with $\mathcal{H}_{\infty}$ robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109.
Zhao et al., (2022) Zhao, Y., Tian, Y., Lee, J., and Du, S. (2022). Provably efficient policy optimization for two-player zero-sum markov games. In International Conference on Artificial Intelligence and Statistics, pages 2736–2761. PMLR.

$\displaystyle\pi^{(k+1)}(\cdot\,\|\,s)$	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\Big{\langle}\nabla_{\pi(\cdot\|s)}V_{\tau}^{\pi}(s_{0})\,\Big{\|}_{\pi=\pi^{(k)}},p\Big{\rangle}+\frac{\tau}{1-\gamma}d_{s_{0}}^{(k)}(s)h_{s}(p)+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$
	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{\frac{1}{1-\gamma}d_{s_{0}}^{(k)}(s)\Big{\{}-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)\Big{\}}+\frac{1}{\eta^{\prime}}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$
	$\displaystyle=\arg\min_{p\in\Delta(\mathcal{A})}\left\{-\big{\langle}Q_{\tau}^{(k)}(s,\cdot),p\big{\rangle}+\tau h_{s}(p)+\frac{1}{\eta}D_{h_{s}}\big{(}p,\pi^{(k)}(\cdot\,\|\,s)\big{)}\right\}$	(14)


$\displaystyle\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(k+1)}_{\tau}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right),$	(44a)
$\displaystyle\big{\\|}Q_{\tau}^{\star}-\tau\xi^{(k+1)}\big{\\|}_{\infty}$	$\displaystyle\leq\big{(}(1-\alpha)\gamma+\alpha\big{)}^{k}\left(\big{\\|}{Q}^{\star}_{\tau}-{Q}^{(0)}_{\tau}\big{\\|}_{\infty}+2\alpha\big{\\|}{Q}^{\star}_{\tau}-\tau\xi^{(0)}\big{\\|}_{\infty}\right).$	(44b)

	$\displaystyle V^{\pi^{\prime}}_{\tau}(s)-V^{\pi}_{\tau}(s)=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+V^{\pi}_{\tau}(s_{t})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]+\mathbb{E}_{\pi^{\prime}}\left[V^{\pi}_{\tau}(s_{0})\,\Big{\|}\,s_{0}=s\right]-V^{\pi}_{\tau}(s)$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}r(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}+\gamma V^{\pi}_{\tau}(s_{t+1})-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}\Big{[}Q^{\pi}_{\tau}(s_{t},a_{t})-\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}-V^{\pi}_{\tau}(s_{t})-\tau h_{s_{t}}\big{(}\pi^{\prime}(s_{t})\big{)}+\tau h_{s_{t}}\big{(}\pi(s_{t})\big{)}\Big{]}\,\Big{\|}\,s_{0}=s\right]$
	$\displaystyle\qquad=\frac{1}{1-\gamma}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim d^{\pi^{\prime}}_{s}}\Big{[}\big{\langle}Q^{\pi}_{\tau}(s^{\prime}),\pi^{\prime}(s^{\prime})-\pi(s^{\prime})\big{\rangle}-\tau h_{s^{\prime}}\big{(}\pi^{\prime}(s^{\prime})\big{)}+\tau h_{s^{\prime}}\big{(}\pi(s^{\prime})\big{)}\Big{]},$		(52)

	$\displaystyle\mathcal{T}_{\tau,h}(Q_{1})-\mathcal{T}_{\tau,h}(Q_{2})$	$\displaystyle=\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{1}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}-\max_{p\in\Delta(\mathcal{A})}\Big{\{}\langle Q_{2}(s^{\prime}),p\rangle-\tau h_{s^{\prime}}(p)\Big{\}}\right]$
		$\displaystyle\overset{\mathrm{(a)}}{\leq}\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{1}(s^{\prime})-Q_{2}(s^{\prime}),p\big{\rangle}\right]$
		$\displaystyle\leq\gamma\mathop{\mathbb{E}}\limits_{s^{\prime}\sim P(\cdot\|s,a)}\left[\max_{p:\\|p\\|_{1}=1}\\|Q_{1}-Q_{2}\\|_{\infty}\\|p\\|_{1}\right]$
		$\displaystyle=\gamma\\|Q_{1}-Q_{2}\\|_{\infty},$

	$\displaystyle{Q}^{\pi^{\dagger}}_{\tau}(s,a)$	$\displaystyle=r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\left[{Q}^{\pi^{\dagger}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\right]\right]$
		$\displaystyle\overset{\mathrm{(b)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\mathop{\mathbb{E}}\limits_{a_{1}\sim\pi^{\dagger}(s_{1})}\Big{[}{Q}^{{\star}}_{\tau}(s_{1},a_{1})-\tau h_{s_{1}}\big{(}\pi^{\dagger}(s_{1})\big{)}\Big{]}\right]$
		$\displaystyle\overset{\mathrm{(c)}}{=}r(s,a)+\gamma\mathop{\mathbb{E}}\limits_{s_{1}\sim{P}(\cdot\|s,a)}\left[\max_{p\in\Delta(\mathcal{A})}\big{\langle}Q_{\tau}^{\star}(s_{1},a_{1}),p\big{\rangle}-\tau h_{s_{1}}(p)\Big{]}\right]$
		$\displaystyle=\mathcal{T}_{\tau,h}({Q}^{\star}_{\tau})(s,a),$


(a) Tsallis entropy regularization	(b) Log-barrier regularization

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

Abstract

1 Introduction

1.1 The role of regularization

1.2 Main contributions

1.3 Related works

Global convergence of policy gradient methods.

Regularization in RL.

1.4 Notation

2 Model and algorithms

2.1 Problem settings

Markov decision process (MDP).

Regularized MDP.

Assumption 1.

2.2 Algorithm: generalized policy mirror descent

Review: mirror descent (MD) for the composite model.

The proposed algorithm.

Lemma 1.

Proof.

Comparison with PMD (Lan,, 2022).

3 Main results

3.1 Convergence of exact GPMD

Theorem 1 (Exact GPMD).

Comparison with Lan, (2022, Theorems 1-3).

Remark 1.

3.2 Convergence of approximate GPMD

Assumption 2 (Policy evaluation error).

Assumption 3 (Subproblem optimization error).

Theorem 2 (Approximate GPMD).

Remark 2 (Sample complexities).

4 Analysis for exact GPMD (Theorem 1)

4.1 Preparation: basic facts

Performance improvement.

Lemma 2 (Pointwise monotonicity).

Proof.

Generalized Bellman operator.

Lemma 3 (Properties of the generalized Bellman operator).

4.2 Proof of Theorem 1

Step 1: error contraction and its connection to a linear system.

Lemma 4.

Step 2: analyzing the dynamics of the linear system (40).

Step 3: controlling ‖πτ⋆​(s)−π(k+1)​(s)‖1\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1} and ‖Vτ⋆−Vτ(k+1)‖∞\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}.

5 Numerical experiments

5.1 Tsallis entropy

5.2 Constrained RL

6 Discussion

Acknowledgements

Appendix A Proof of key lemmas

A.1 Proof of Lemma 1

A.2 Proof of Lemma 2

Lemma 5 (Performance difference).

Lemma 6.

A.2.1 Proof of Lemma 5

A.2.2 Proof of Lemma 6

A.3 Proof of Lemma 3

Proof of the contraction property (36).

Proof of the fixed point property (37).

A.4 Proof of Lemma 4

Step 1: bounding the 1st term on the right-hand side of (A.4).

Step 2: bounding the 2nd term on the right-hand side of (A.4).

Step 3: putting all this together.

Appendix B Analysis for approximate GPMD (Theorem 2)

B.1 Step 1: bounding performance difference between consecutive iterates

Lemma 7 (Performance improvement for approximate GPMD).

B.1.1 Proof of Lemma 7

The case when hsh_{s} is convex.

The case when hsh_{s} is strongly convex.

B.2 Step 2: connecting the algorithm dynamic with a linear system

Bounding ‖Qτ⋆−τ​ξ^(k+1)‖∞\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}.

Bounding −mins,a⁡(Qτ(k+1)​(s,a)−τ​ξ^(k+1)​(s,a))-\min_{s,a}\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)}.

Bounding ‖Qτ⋆−Qτ(k+1)‖∞\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}.

A linear system of interest.

B.3 Step 3: linear system analysis

Appendix C Adaptive GPMD

Assumption 4.

Theorem 3.

Proof of Theorem 3.

References

Policy Mirror Descent for Regularized Reinforcement Learning:
A Generalized Framework with Linear Convergence

Step 3: controlling $\big{\|}\pi_{\tau}^{\star}(s)-\pi^{(k+1)}(s)\big{\|}_{1}$ and $\big{\|}{V}^{\star}_{\tau}-{V}^{(k+1)}_{\tau}\big{\|}_{\infty}$ .

The case when $h_{s}$ is convex.

The case when $h_{s}$ is strongly convex.

Bounding $\big{\|}{Q}^{\star}_{\tau}-\tau\widehat{\xi}^{(k+1)}\big{\|}_{\infty}$ .

Bounding $-\min_{s,a}\big{(}{Q}^{{(k+1)}}_{\tau}(s,a)-\tau\widehat{\xi}^{(k+1)}(s,a)\big{)}$ .

Bounding $\big{\|}{Q}^{{\star}}_{\tau}-{Q}^{{(k+1)}}_{\tau}\big{\|}_{\infty}$ .