Finite-time analysis of entropy-regularized neural natural actor-critic algorithm

Semih Çaycı Niao He R. Srikant CSL, University of Illinois at Urbana-Champaign, scayci@illinois.eduDepartment of Computer Science, ETH Zurich, niao.he@inf.ethz.chECE and CSL, University of Illinois at Urbana-Champaign, rsrikant@illinois.edu

Abstract

Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large (potentially continuous) state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.

1 Introduction

In reinforcement learning (RL), an agent aims to find an optimal policy that maximizes the expected total reward in a Markov decision process (MDP) by interacting with an unknown and dynamical environment [1, 2, 3]. Policy gradient methods [4, 5, 6], which employ first-order optimization methods to find the best policy within a parametric policy class, have demonstrated impressive success in numerous complicated RL problems. The success largely benefits from the versatility of policy gradient methods in accommodating a rich class of function approximation schemes [7, 8, 9, 10].

Natural policy gradient (NPG), natural actor-critic (NAC) and their variants, which use Fisher information matrix as a pre-conditioner for the gradient updates [11, 12, 13, 14], are particularly popular because of their impressive empirical performance in practical applications. In practice, NPG/NAC methods are further combined with (a) neural network approximation for high representation power of both the actor and the critic, and (b) entropy regularization for stability and sufficient exploration, leading to remarkable performance in complicated control tasks that involve large state-action spaces [15, 9, 16].

Despite the empirical successes, a strong theoretical understanding of policy gradient methods, especially when boosted with function approximation and entropy regularization, appears to be in a nascent stage. Recently, there has been a plethora of theoretical attempts to understand the convergence properties of policy gradient methods and the role of entropy regularization [17, 18, 19, 20, 21]. These works predominantly study the tabular setting, where a parallelism between the well-known policy iteration and policy gradient methods can be exploited to establish the convergence results. But for the more intriguing function approximation regime, especially with neural network approximation, little theory is known. Two of the main challenges come from the highly nonconvex nature of the problem when using neural network approximation for both the actor and the critic, and the complex exploration dynamics.

In this paper, we provide the first non-asymptotic analysis of an entropy-regularized natural actor-critic (NAC) method in which we use two separate two-layer neural networks for the actor and critic, and employ a learning scheme based on approximate natural policy gradient updates to achieve optimality. We show that the expressive power of these neural networks provide the ability to achieve optimality within a broad class of policies.

1.1 Main Contributions

We elaborate some of our contributions below.

•

Sharp sample complexity, convergence rate and overparameterization bounds: We prove sharp convergence guarantees in terms of sample complexity, iteration complexity and network width. Particularly, We prove that the NAC method with an adaptive step-size achieves sharp $\tilde{O}(1/\epsilon)$ iteration complexity and $\tilde{O}(1/\epsilon^{5})$ sample complexity to achieve $\epsilon$ gap with the optimal policy of the regularized MDP under mildest distribution mismatch conditions to the best of our knowledge. The required network width for both the actor and critic are $\tilde{O}(1/\epsilon^{4})$ and $\tilde{O}(1/\epsilon^{2})$ , respectively. Under the standard distribution mismatch assumption as in [22], our sample complexity bound for the unregularized MDP is $\tilde{O}(1/\epsilon^{6})$ , which improves the existing bounds significantly.
•

Stable policy optimization: Existing works on neural policy gradient methods with neural network approximation assume that the policies perform sufficient exploration to avoid instability, i.e., convergence to near-deterministic and strictly suboptimal stationary policies. In this paper, we prove that policy optimization is stabilized by incorporating entropy regularization, gradient clipping and averaging. In particular, we show that the combination of these methods leads to “persistence of excitation” condition, which ensures sufficient exploration to avoid near-deterministic and strictly suboptimal stationary policies. Consequently, we prove convergence to the globally optimal policy under the mildest concentrability coefficient assumption for on-policy NAC to the best of our knowledge.
•

Understanding the dynamics of neural network approximation in policy optimization: Our analysis reveals that the uniform approximation power of the actor network to approximate Q-functions throughout policy optimization steps is crucial to ensure global (near-)optimality, which is a specific feature of reinforcement learning that induces a distributional shift over time in contrast to a static supervised learning problem. To that end, we establish high-probability bounds for a two-layer feedforward actor neural network to uniformly approximate Q-functions of the policy iterates during the training.

1.2 Related Work

Policy gradient and actor-critic: Policy gradient methods use a gradient-based scheme to find the optimal policy [4, 5]. [12] proposed the natural gradient method, which uses the Fisher information matrix as a pre-conditioner to fit the problem geometry better. Actor-critic method, which learns approximations to both state-action value functions and policies for variance reduction, was introduced in [6].

Neural actor-critic methods: Recently, there has been a surge of interest in direct policy optimization methods for solving MDPs with large state spaces by exploiting the representation power of deep neural networks. Particularly, deterministic policy gradient [23], trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25], soft actor-critic (SAC) [15, 26] achieved impressive empirical success in solving complicated control tasks.

Role of regularization: Entropy regularization is an essential part of policy optimization algorithms (e.g., TRPO, PPO and SAC) to encourage exploration and achieve fast and stable convergence. It has been numerically observed that entropy regularization leads to a smoother optimization landscape, which leads to improved convergence properties in policy optimization [16]. For tabular reinforcement learning, the impact of entropy regularization was studied in [17, 20, 21]. On the other hand, the function approximation regime leads to considerably different dynamics compared to the tabular setting mainly because of the generalization over a large state space, complex exploration dynamics and distributional shift. As such, the role of regularization is very different in the function approximation regime, which we study in this paper.

Theoretical analysis of policy optimization methods: Despite the vast literature on the practical performance of PG/AC/NAC type algorithms, their theoretical understanding has remained elusive until recently. In the tabular setting, global convergence rates for PG methods were established in [17, 18, 27]. By incorporating entropy regularization, it was shown in [28, 19, 20, 29, 30] that the convergence rate can be improved significantly in the tabular setting. Finite-time performances of off-policy actor-critic methods in the tabular and linear function approximation regimes were investigated in [31, 32]. In our paper, we consider neural network approximation under entropy regularization with on-policy sampling.

On the other hand, when the controller employs a function approximator for the purpose of generalization to a large state-action space, the convergence properties of policy optimization methods radically change due to more complicated optimization landscape and distribution mismatch phenomenon in reinforcement learning [17]. Under strong assumptions on the exploratory behavior of policies throughout learning iterations, global optimality of NPG with linear function approximation up to a function approximation error was established in [17]. For actor-critic and natural actor-critic methods with linear function approximation, there are finite-time analyses in [33, 34, 35]. For general actor schemes with linear critic, convergence to stationary points was investigated in [36, 37, 38].

By incorporating entropy regularization, it was shown that improved convergence rates under much weaker conditions on the underlying controlled Markov chain can be established in [39] with linear function approximation. Our paper uses results from the drift analysis in [39], but addresses the complications due to the nonlinearity introduced by ReLU activation functions, and establishes global convergence to the optimal policies. The neural network approximation eliminates the function approximation error, which is a constant in linear function approximation, by employing a sufficiently wide actor neural network.

Neural network analysis: The empirical success of neural networks, which have more parameters than the data points, has been theoretically explained in [40, 41, 42], where it was shown that overparameterized neural networks trained by using first-order optimization methods achieve good generalization properties. The need for massive overparameterization was addressed in [43, 44], and it was shown that considerably smaller network widths can suffice to achieve good training and generalization results in structured supervised learning problems. Our analysis in this work is mainly inspired by [43]. On the other hand, reinforcement learning problem has significantly different and more challenging dynamics than the supervised learning setting as we have a dynamic optimization problem in actor-critic, where distributional shift occurs as the policies are updated. As such, uniform approximation power of the actor network in approximating various functions through policy optimization steps becomes critical, different from the supervised learning setting in [45]. Our analysis utilizes tools from [45, 43]: (i) we consider max-norm geometry to achieve mild overparameterization, (ii) we bound the distance between the neural tangent kernel (NTK) function class and the class of functions realizable by a finite-width neural network by extending the ReLU analysis in [43].

The most relevant work in the literature is [22], where the convergence of NAC with a two-layer neural network was studied without entropy regularization. It was shown that, under strong assumptions on the exploratory behavior of policies throughout the trajectory, neural-NPG achieves $\epsilon$ -optimality with $O(1/\epsilon^{14})$ sample complexity and $O(1/\epsilon^{12})$ network width bounds. In this paper, we incorporate widely-used algorithmic techniques (entropy regularization, averaging and gradient clipping) to NAC with neural network approximation, and prove significantly improved sample complexity and overparameterization bounds under weaker assumptions on the concentrability coefficients. Additionally, our analysis reveals that the uniform approximation power of the actor neural network is critically important to establish global optimality, where distributional shift plays a crucial role (see Section 5.4). In another relevant work, [46] considers a single-timescale actor-critic with neural network approximation, but the function approximation error was not investigated due to the realizability assumption, which assumes that all policies throughout the policy optimization steps are realizable by the neural network. On of the main goals of our work is to study the benefits of employing neural networks in policy optimization, and we explicitly characterize the function class and approximation error that stems from the use of finite-width neural networks.

1.3 Notation

For a sequence of numbers $\{x_{i}:i\in I\}$ where $I$ is an index set, $[x_{i}]_{i\in I}$ denotes the vector obtained by concatenation of $x_{i},i\in I$ . For a set $A$ , $|A|$ denotes its cardinality. For two distributions $P,Q$ defined over the same probability space, Kullback-Leibler divergence is denoted as follows: $D_{KL}(P\|Q)=\mathbb{E}_{s\sim P}\Big{[}\log\frac{P(s)}{Q(s)}\Big{]}.$ For a convex set $C\subset\mathbb{R}^{d}$ and $x\in\mathbb{R}^{d}$ , $\mathcal{P}_{C}(x)$ denotes the projection of $x$ onto $C$ : $\mathcal{P}_{C}(x)=\arg\min_{y\in C}\|x-y\|_{2}$ . For $n\in\mathbb{Z}^{+}$ , $[n]=\{1,2,\ldots,n\}$ . For $d,m\in\mathbb{N},$ $R>0$ and $v\in\mathbb{R}^{m\times d},$ we denote

\mathcal{B}_{m,R}^{d}(v)=\Big{\{}y\in\mathbb{R}^{m\times d}:\sup_{i\in[m]}\|v_{i}-y_{i}\|_{2}\leq\frac{R}{\sqrt{m}}\Big{\}},

where $v_{i}$ denotes the $i^{\rm th}$ row of $v.$

2 Background and Problem Setting

In this section, we introduce basic backgrounds of the problem setting, the natural actor critic method, as well as entropy regularization and neural network approximation that we consider.

2.1 Markov Decision Processes

We consider a discounted Markov decision process $(\mathcal{S},\mathcal{A},P,r,\gamma)$ where $\mathcal{S}$ and $\mathcal{A}$ are the state and action spaces, $P$ is a (unknown) transition kernel, $r:\mathcal{S}\times\mathcal{A}\rightarrow[0,r_{max}]$ , $0<r_{max}<\infty$ is the reward function, and $\gamma\in(0,1)$ is the discount factor. In this work, we consider a state space $\mathcal{S}$ and a finite action space $\mathcal{A}$ such that $\mathcal{S}\times\mathcal{A}\subset\mathbb{R}^{d}.$ Also, we assume that, by appropriate representation of the state and action variables, the following bound holds:

\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\|(s,a)\|_{2}\leq 1,

(1)

throughout the paper.

Value function: A randomized policy $\pi:\mathcal{S}\rightarrow\mathcal{A}$ assigns some probability of taking an action $a\in\mathcal{A}$ at a given state $s\in\mathcal{S}$ . A policy $\pi$ introduces a trajectory by specifying $a_{t}\sim\pi(\cdot|s_{t})$ and $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ . For any $s_{0}\in\mathcal{S}$ , the corresponding value function of a policy $\pi$ is as follows:

V^{\pi}(s_{0})=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}\Big{]},

(2)

where $a_{t}\sim\pi(\cdot|s_{t})$ and $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ .

Entropy regularization: In policy optimization, in order to encourage exploration and avoid near-deterministic suboptimal policies, entropy regularization is commonly used in practice [8, 15, 9, 16]. For a policy $\pi$ , let

H^{\pi}(s_{0})=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}\mathcal{H}\big{(}\pi(\cdot|s_{t})\big{)}\Big{|}s_{0}\Big{]},

(3)

where $\mathcal{H}(\pi(\cdot|s))=-\sum_{a\in\mathcal{A}}\pi(a|s)\log\big{(}\pi(a|s)\big{)}$ is the entropy functional. Then, for $\lambda>0$ , the entropy-regularized value function is defined as follows:

V_{\lambda}^{\pi}(s_{0})=V^{\pi}(s_{0})+\lambda H^{\pi}(s_{0}).

(4)

Note that $\pi_{\mathbf{0}}(\cdot|\cdot)=1/|\mathcal{A}|$ maximizes the regularizer $H^{\pi}(s_{0})$ for any $s_{0}\in\mathcal{S}$ . Hence, the additional $\lambda H^{\pi}(s_{0})$ term in (4) encourages exploration while introducing some bias controlled by $\lambda>0$ .

Entropy-regularized objective: For a given initial state distribution $\mu$ and for a given regularization parameter $\lambda>0$ , the objective in this paper is to maximize the entropy-regularized value function:

\max_{\pi}V_{\lambda}^{\pi}(\mu):=\mathbb{E}_{s_{0}\sim\mu}\left[V_{\lambda}^{\pi}(s_{0})\right].

(5)

We denote the optimal policy for the regularized MDP as $\pi^{*}$ throughout the paper.

Q-function and advantage function: The entropy-regularized (or soft) Q-function $q_{\lambda}^{\pi}(s,a)$ is defined as:

\displaystyle\begin{aligned} q_{\lambda}^{\pi}(s,a)&=\mathbb{E}\Big{[}\sum_{k=0}^{\infty}\gamma^{k}\big{(}r(s_{k},a_{k})-\lambda\log\pi(a_{k}|s_{k})\big{)}\Big{|}s_{0}=s,a_{0}=a\Big{]}.\end{aligned}

(6)

Note that $q_{\lambda}^{\pi}$ is the fixed point of the Bellman equation $q(s,a)=\mathcal{T}^{\pi}q(s,a)$ where the Bellman operator $\mathcal{T}^{\pi}$ is defined as:

\mathcal{T}^{\pi}q(s,a)=r(s,a)-\lambda\log\pi(a|s)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})}[q(s^{\prime},a^{\prime})].

As we will see, for NAC algorithms, the following function, called the soft Q-function under a policy $\pi,$ turns out to be a useful quantity [20]:

Q_{\lambda}^{\pi}(s,a)=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[V_{\lambda}^{\pi}(s^{\prime})\right].

(7)

Note that the two Q-functions are related as follows:

q_{\lambda}^{\pi}(s,a)=Q_{\lambda}^{\pi}(s,a)-\lambda\log\pi(a|s).

The advantage function under a policy $\pi$ is defined as follows:

\displaystyle\begin{aligned} A_{\lambda}^{\pi}(s,a)&=q_{\lambda}^{\pi}(s,a)-V_{\lambda}^{\pi}(s).\end{aligned}

(8)

Similarly, the soft advantage function is defined as follows:

\Xi_{\lambda}^{\pi}(s,a)=Q_{\lambda}^{\pi}(s,a)-\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s)Q_{\lambda}^{\pi}(s,a^{\prime}).

(9)

Lastly, we can bound the entropy-regularized value function as follows:

0\leq V_{\lambda}^{\pi}(\mu)\leq\frac{r_{max}+\lambda\log|\mathcal{A}|}{1-\gamma},

(10)

for any $\lambda>0,\pi$ since $r\in[0,r_{max}]$ and $\mathcal{H}(p)\leq\log|\mathcal{A}|$ for any distribution $p$ over $\mathcal{A}$ [20].

2.2 Natural Policy Gradient under Entropy Regularization

For a given randomized policy $\pi_{\theta}$ parameterized by $\theta\in\Theta$ where $\Theta$ is a given parameter space, policy gradient methods maximize $V^{{\pi_{\theta}}}(\mu)$ by using the policy gradient $\nabla_{\theta}V^{\pi_{\theta}}(\mu)$ . Natural policy gradient, as a quasi-Newton method, adjusts the gradient update to fit problem geometry by using the Fisher information matrix as a pre-conditioner [12, 20].

Let

G^{\pi_{\theta}}(\mu)=\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\nabla_{\theta}\log\pi_{\theta}(a|s)\nabla_{\theta}^{\top}\log\pi_{\theta}(a|s)\Big{]},

be the Fisher information matrix under policy $\pi_{\theta}$ , where

d_{\mu}^{\pi}(\cdot)=(1-\gamma)\sum_{k=0}^{\infty}\gamma^{k}\mathbb{P}(s_{k}\in\cdot|s_{0}\sim\mu),

is the discounted state visitation distribution under a policy $\pi$ . Then, the update rule under NPG can be expressed as

\theta\leftarrow\theta+\eta\cdot\big{[}G^{{\pi_{\theta}}}(\mu)\big{]}^{-1}\nabla V_{\lambda}^{\pi_{\theta}}(\mu),

(11)

where $\eta>0$ is the step-size. Equivalently, the NPG update can be written as follows:

\theta^{+}\in\arg\max_{\theta\in\mathbb{R}^{d}}\Big{\{}\nabla^{\top}_{\theta}V_{\lambda}(\pi_{\theta^{-}})(\theta-\theta^{-})-\frac{1}{2\eta}(\theta-\theta^{-})^{\top}G^{\pi_{\theta^{-}}}(\mu)(\theta-\theta^{-})\Big{\}}.

(12)

The above update scheme is closely related to gradient ascent and policy mirror ascent. Note that the gradient ascent for policy optimization performs the following update:

\theta^{+}\in\arg\max_{\theta\in\mathbb{R}^{d}}\Big{\{}\nabla^{\top}_{\theta}V_{\lambda}(\pi_{\theta^{-}})(\theta-\theta^{-})-\frac{1}{2\eta}\|\theta-\theta^{-}\|_{2}^{2}\Big{\}}.

(13)

The update in (13) leads to the policy gradient algorithm [4]. Compared to (13), the natural policy gradient uses a generalized Mahalanobis distance (i.e., weighted- $\ell_{2}$ distance) as the Bregman divergence instead of $\ell_{2}$ distance, which yields significant improvements in policy optimization by avoiding the so-called vanishing gradient problem in the tabular case [20, 19, 17]. Consequently, state-of-the-art reinforcement learning algorithms such as trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25] and soft actor-critic [15] are variants of the natural policy gradient method.

In the following, we provide necessary tools to compute the policy gradient and the update rule in (11) based on [39].

Proposition 1 (Policy gradient).

For any $\theta$ and $\lambda>0$ , we have:

\nabla_{\theta}{V}_{\lambda}^{\pi_{\theta}}(\mu)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\nabla_{\theta}\log\pi_{\theta}(a|s)q_{\lambda}^{\pi_{\theta}}(s,a)\Big{]}.

(14)

Based on Proposition 14, the gradient update of natural policy gradient can be computed by the following lemma, which is an extension of [12, 17, 39].

Lemma 1.

Let

L(w,\theta)=\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{\theta}(a|s)w-q_{\lambda}^{\pi_{\theta}}(s,a)\Big{)}^{2}\Big{]},

(15)

be the error for a given policy parameter $\theta$ . Define

w_{\lambda}^{\pi_{\theta}}\in\arg\min_{w}L(w,\theta).

(16)

Then, we have:

G^{\pi_{\theta}}(\mu)w_{\lambda}^{\pi_{\theta}}=(1-\gamma)\nabla_{\theta}V_{\lambda}^{\pi_{\theta}}(\mu),

(17)

where $G^{\pi_{\theta}}$ is the Fisher information matrix.

The above results for general policy parameterization will provide basis for the entropy-regularized natural actor-critic (NAC) with neural network approximation that we will introduce in the following section, with certain modifications for variance reduction and stability that we will describe; see Remark 2 later.

3 Natural Actor-Critic with Neural Network Approximation

In this section, we will introduce the entropy-regularized natural actor critic algorithm, where both the actor and critic are represented by single-hidden-layer neural networks.

Throughout this paper, we make the following assumption, which is standard in policy optimization [17].

Assumption 1 (Sampling oracle).

For a given initial state distribution $\mu$ and policy $\pi$ , we assume that the controller is able to obtain an independent sample from $d_{\mu}^{\pi}$ at any time.

The sampling process involves a resetting mechanism and a simulator, which are available in many important application scenarios, and sampling from a state visitation distribution $d_{\mu}^{\pi}$ can be performed as described in [47].

3.1 Actor Network and Natural Policy Gradient

For a network width $m\in\mathbb{Z}^{+}$ and $c_{i}\in\mathbb{R}$ , $\theta_{i}\in\mathbb{R}^{d}$ for $i\in[m]$ , the actor network is given by the single-hidden-layer neural network:

f(s,a;(c,\theta))=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\sigma\big{(}\langle\theta_{i},(s,a)\rangle\big{)},

(18)

where $c=[c_{i}]_{i\in[m]},\theta=[\theta_{i}]_{i\in[m]}$ , $\sigma(x)=\max\{0,x\}$ is the ReLU activation function. As a common practice [43, 44, 42], we fix the output layer $c$ after a random initialization, and only train the weights of hidden layer, namely, $\theta\in\Theta\subset\mathbb{R}^{m\times d}$ . Given a (possibly random) parameter $\theta^{0}\in\mathbb{R}^{m\times d}$ , a design parameter $R>0$ , regularization parameter $\lambda>0$ and network width $m\in\mathbb{Z}^{+}$ , the parameter space that we consider is as follows:

\Theta=\Big{\{}\theta\in\mathbb{R}^{m\times d}:\max_{i\in[m]}\|\theta_{i}-\theta^{0}_{i}\|_{2}\leq\frac{R}{\lambda\sqrt{m}}\Big{\}}.

(19)

For this parameter space $\Theta$ , the policy class that we consider is $\Pi=\{\pi_{\theta}:\theta\in\Theta\}$ , where the policy that corresponds to $\theta\in\Theta$ is as follows:

\pi_{\theta}(a|s)=\frac{\exp(f(s,a;(c,\theta))}{\sum_{a^{\prime}\in\mathcal{A}}\exp(f(s,a^{\prime};(c,\theta))}.

(20)

We randomly initialize the actor neural network by using the symmetric initialization in Algorithm 1, $(c,\theta(0))\sim{\tt sym\_init}(m,d)$ , which was introduced in [48].

Inputs:

m

: network width,

d

: feature dimension

for

i=1,2,\ldots,m/2

c_{i}=-c_{i+m/2}\sim Rademacher

\theta_{i}=\theta_{i+m/2}\sim\mathcal{N}(0,I_{d})

end for

return network weights

(c,\theta

)

Algorithm 1

{\tt sym\_init}(m,d)

- Symmetric Initialization

Later, we will employ a similar symmetric initialization scheme for the critic neural network.

We denote the policy at iteration $t<T$ as $\pi_{t}=\pi_{\theta(t)}$ and neural network output as $f_{t}(s,a)=f(s,a;(c,\theta(t)))$ . In the absence of $\Xi_{\lambda}^{\pi_{t}}$ and $d_{\mu}^{\pi_{t}}$ , we estimate

u_{t}^{\star}=\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\mathbb{E}[(\nabla^{\top}\log\pi_{t}(a|s)u-\Xi_{\lambda}^{\pi_{t}}(s,a))^{2}],

(21)

by using samples by the following actor-critic method:

•

Critic: Temporal difference learning algorithm (Algorithm 3 in Section 3.2), which employs a critic neural network, returns a set of neural network weights that yield a sample-based estimate for the soft advantage function $\{\widehat{\Xi}_{\lambda}^{\pi_{t}}(s,a):(s,a)\in\mathcal{S}\times\mathcal{A}\}$ .

•

Policy update: Given this, we approximate $u_{t}^{\star}$ by using stochastic gradient descent (SGD) with $N$ iterations and step-size $\alpha_{A}>0$ . Starting with $u_{0}^{(t)}=0$ , an iteration of SGD is as follows

	$\displaystyle u_{n+1/2}^{(t)}$	$\displaystyle=u_{n}^{(t)}-\alpha_{A}\Big{(}\nabla_{\theta}^{\top}\log{\pi}_{t}(a_{n}\|s_{n})u_{n}^{(t)}-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s_{n},a_{n})\Big{)}\nabla_{\theta}\log{\pi}_{t}(a_{n}\|s_{n}),$		(22)
	$\displaystyle u_{n+1}^{(t)}$	$\displaystyle=\mathcal{P}_{\mathcal{B}_{m,R}^{d}(0)}\left(u_{n+1/2}^{(t)}\right),$		(23)

where $s_{n}\sim d_{\mu}^{\pi_{t}}$ and $a_{n}\sim\pi_{t}(\cdot|s_{n})$ for $n=0,1,\ldots,N-1$ , $\widehat{\Xi}^{\pi_{t}}_{\lambda}$ is the output of the critic. Then, the final estimate is $u_{t}=\frac{1}{N}\sum_{n=1}^{N}u_{n}^{(t)}$ . By using $u_{t}$ , we perform the following update:

\theta(t+1)=\theta(t)+\eta_{t}\cdot w_{t},

where $w_{t}=u_{t}-\lambda\big{(}\theta(t)-\theta(0)\big{)}$ .

The natural actor-critic algorithm is summarized in Algorithm 2. Below, we summarize the modifications in the algorithm that we consider in this paper with respect to the NPG described in the previous section.

Remark 1 (Averaging and projection).

The update in each iteration of the NAC algorithm described in Algorithm 2 can be equivalently written as follows:

\theta(t+1)-\theta(0)=(1-\eta_{t}\lambda)\cdot(\theta(t)-\theta(0))+(1-\eta_{t})u_{t},

(24)

where $u_{t}$ is an approximate solution to the optimization problem (21). As we will see, the projection of $u_{t}$ onto $\mathcal{B}_{m,R}^{d}(0)$ (which can be considered as gradient clipping), in conjunction with the averaging in the policy update (24) enables us to control $\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}$ while taking (natural) gradient steps towards the optimal policy. Controlling $\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}$ is critical for two reasons: (i) to ensure sufficient exploration, and (ii) to establish the convergence bounds for the neural networks.

Alternatively, one may be tempted to project $\theta(t)$ onto a ball around $\theta(0)$ in the $\ell_{2}$ -geometry to control $\max_{i}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}$ . However, as the algorithm follows the natural policy gradient, which uses a different Bregman divergence than $\|\cdot\|_{2}$ , projection of $\theta(t)$ with respect to the $\ell_{2}$ -norm may not result in moving the policy in the direction of improvement. Similarly, since we parameterize the policies by using a lower-dimensional vector $\theta\in\mathbb{R}^{m\times d}$ to avoid storing and computing $|\mathcal{S}\times\mathcal{A}|$ -dimensional policies, Bregman projection in the probability simplex, which is commonly used in direct parameterization, is not a feasible option for policy optimization with function approximation.

As such, simultaneous use of averaging and projection of the update $u_{t}$ are critical to control the network weights and policy improvement.

Remark 2 (Baseline).

Note that the update $u_{t}^{\star}$ in (21) uses the soft-advantage function $\Xi_{\lambda}^{\pi}$ rather than the state-action value function $q_{\lambda}^{\pi}$ . The soft-advantage function uses $\sum_{a\sim\pi_{t}}\pi_{t}(a|s)Q_{\lambda}^{\pi_{t}}(s,a)$ as a baseline for variance reduction, which is a common practice in policy gradient methods [1].

In the following subsection, we describe the critic algorithm in detail.

1: Initialize

(c,\theta(0))\sim{\tt sym\_init}(m,d)

2: for

t=0,1,\ldots,T-1

3: Critic:

\widehat{\Xi}_{\lambda}^{\pi_{t}}=~{}

MN-NTD

(\pi_{t},R,m^{\prime},T^{\prime},\alpha_{C})

4: Initialize:

u^{(t)}_{0}=0

5: for

n=0,1,\ldots,N

6: Sampling:

s_{n}\sim d_{\mu}^{\pi_{t}},a_{n}\sim\pi_{t}(\cdot|s_{n}).

u_{n+1/2}^{(t)}=u_{n}^{(t)}-\alpha_{A}\Big{(}\nabla_{\theta}^{\top}\log{\pi}_{t}(a_{n}|s_{n})u_{n}^{(t)}-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s_{n},a_{n})\Big{)}\nabla_{\theta}\log{\pi}_{t}(a_{n}|s_{n})

u_{n+1}^{(t)}=\mathcal{P}_{\mathcal{B}_{m,R}^{d}(0)}\left(u_{n+1/2}^{(t)}\right)

9: end for

10:

u_{t}=\frac{1}{N}\sum_{n=1}^{N}u_{n}^{(t)}

11:

\theta(t+1)=\theta(t)+\eta_{t}u_{t}-\eta_{t}\lambda[\theta(t)-\theta(0)]

12: end for

Algorithm 2 Entropy-regularized Neural NAC

3.2 Critic Network and Temporal Difference Learning

We estimate $\Xi_{\lambda}^{\pi_{t}}$ by using the neural TD learning algorithm with max-norm regularization [49]. Note that $\Xi_{\lambda}^{\pi_{t}}$ can be directly obtained from $q_{\lambda}^{\pi_{\theta}}$ via $Q_{\lambda}^{\pi_{\theta}}(s,a)=q_{\lambda}^{\pi_{\theta}}(s,a)-\lambda\log{\pi_{\theta}}(a|s)$ and (9). Since $q_{\lambda}^{\pi_{\theta}}$ is the fixed point of the Bellman equation (6), it can be approximated by using temporal difference (TD) learning algorithms.

For the critic, we use a two-layer neural network of width $m^{\prime}$ , which is defined as follows:

\widehat{q}(s,a;(b,W))=\frac{1}{\sqrt{m^{\prime}}}\sum_{i=1}^{m^{\prime}}b_{i}\sigma\left(\langle W_{i},(s,a)\rangle\right).

(25)

The critic network is initialized according to the symmetric initialization scheme in Algorithm 1. Let $(b,W(0))$ denote the initialization.

We aim to solve the following problem:

W^{\star}=\arg\min_{W}~{}~{}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim{\pi_{\theta}}(\cdot|s)}\Big{[}\Big{(}\widehat{q}(s,a;(b,W))-\mathcal{T}^{\pi_{\theta}}\widehat{q}(s,a;(b,W))\Big{)}^{2}\Big{]}.

(26)

where $\mathcal{T}^{\pi}$ is the Bellman operator in (2.1).

We will consider max-norm regularization in the updates of the critic, which was shown to be effective in supervised learning and reinforcement learning (see [49, 50]). For a given $w_{0}\in\mathbb{R}^{d}$ and $R>0$ , let

\mathcal{G}_{R}(w_{0})=\{w\in\mathbb{R}^{d}:\|w-w_{0}\|_{2}\leq R/\sqrt{m^{\prime}}\}.

(27)

Under max-norm regularization, each hidden unit’s weight vector is confined within the set $\mathcal{G}_{R}(W_{i}(0))$ for a given projection radius $R$ .

For $k=0,1,\ldots,T^{\prime}-1$ , we assume that $(s_{k},a_{k})$ is sampled from $d_{\mu}^{\pi_{\theta}}$ , i.e., $s_{k}\sim d_{\mu}^{\pi_{\theta}},a_{k}\sim{\pi_{\theta}}(\cdot|s_{k})$ . Upon obtaining $(s_{k},a_{k})$ , the next state-action pair is obtained by following ${\pi_{\theta}}$ : $s_{k}^{\prime}\sim P(\cdot|s_{k},a_{k})$ , $a_{k}^{\prime}\sim{\pi_{\theta}}(\cdot|s_{k}^{\prime})$ . One can replace the i.i.d. sampling here with Markovian sampling at the cost of a more complicated analysis as in [51]. However, since experience replay is used in practice, the actual sampling procedure is neither purely Markovian or i.i.d., and here for simplicity of the analysis, we choose to model it as i.i.d. sampling.

An iteration of MN-NTD is as follows:

\displaystyle W_{i}(t+1)=\mathcal{P}_{\mathcal{G}_{R}(W_{i}(0))}\left(W_{i}(t)+\alpha\big{(}r_{k}+\gamma\widehat{q}_{k}(s_{k}^{\prime},a_{k}^{\prime})-\widehat{q}_{k}(s_{k},a_{k})\big{)}\nabla_{W_{i}}\widehat{q}_{k}(s_{k},a_{k})\right),\forall i\in[m^{\prime}],

where $\widehat{q}_{k}(s,a)=\widehat{q}(s,a;(b,W(k)))$ , $r_{k}=r(s_{k},a_{k})-\lambda\log\pi_{\theta}(a_{k}|s_{k})$ and $\mathcal{P}_{\mathcal{C}}$ is the projection operator onto a set $\mathcal{C}\subset\mathbb{R}^{d}$ . The output of the critic, which approximates $q_{\lambda}^{\pi_{\theta}}$ , is then obtained as:

\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)=\widehat{q}\Big{(}s,a;\big{(}b,\frac{1}{T^{\prime}}\sum_{k<T^{\prime}}W(k)\big{)}\Big{)},~{}~{}(s,a)\in\mathcal{S}\times\mathcal{A},

where $T^{\prime}$ is the number of iterations of MN-NTD. We obtain an approximation of the soft Q-function as

\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a)=\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)-\lambda\log{\pi_{\theta}}(a|s).

The corresponding estimate for the soft advantage function is the following:

\widehat{\Xi}_{\lambda}^{\pi_{\theta}}(s,a)=\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a)-\sum_{a^{\prime}\in\mathcal{A}}{\pi_{\theta}}(a^{\prime}|s)\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a^{\prime}).

(28)

The critic update for a given policy $\pi_{\theta},\theta\in\Theta$ is summarized in Algorithm 3.

Inputs: Policy

\pi_{\theta}

, proj. radius

R

, network width

m^{\prime}

, sample size

T^{\prime}

, step-size

\alpha_{C}

Initialization:

(b,W(0))={\tt sym\_init}(m^{\prime},d)

for

k<T^{\prime}-1

Observe

(s_{k},a_{k})\sim d_{\mu}^{\pi_{\theta}}\circ{\pi_{\theta}}(\cdot|s_{k})

s_{k}^{\prime}\sim P(\cdot|s_{k},a_{k})

a_{k}^{\prime}\sim\pi_{\theta}(\cdot|s_{k}^{\prime})

Observe reward:

r_{k}:=r(s_{k},a_{k})-\lambda\log{\pi_{\theta}}(a_{k}|s_{k})

Compute semi-gradient:

g_{k}=\big{(}r_{k}+\gamma\hat{q}_{k}(s_{k}^{\prime},a_{k}^{\prime})-\hat{q}_{k}(s_{k},a_{k})\big{)}\nabla_{\theta}\hat{q}_{k}(s_{k},a_{k})

Take a semi-gradient step:

W(k+1/2)=W(k)+\alpha_{C}g_{k}

Max-norm regularization:

W_{i}(k+1)=\mathcal{P}_{\mathcal{G}_{R}(W_{i}(0))}\left\{W_{i}(k+1/2)\right\},\forall i\in[m^{\prime}]

end for

return

\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)=\hat{q}\Big{(}s,a;\big{(}b,\frac{1}{T^{\prime}}\sum_{k<T^{\prime}}W(k)\big{)}\Big{)}

for all

(s,a)\in\mathcal{S}\times\mathcal{A}

Algorithm 3 MN-NTD - Max-Norm Regularized Neural TD Learning

4 Main Results: Sample Complexity and Overparameterization Bounds for Neural NAC

In this section, we analyze the convergence of the entropy-regularized neural NAC algorithm and provide sample complexity and overparameterization bounds for both the actor and the critic.

4.1 Regularization and Persistence of Excitation under Neural NAC

The following proposition implies that the persistence of excitation condition (see [52] for a discussion of the critical role of persistence of excitation in stochastic control problems) is satisfied under Algorithm 2, which implies sufficient exploration to ensure convergence to global optimality.

Proposition 2 (Persistence of excitation).

For any regularization parameter $\lambda>0$ , projection radius $R$ , the entropy-regularized NAC satisfies the following:

\max_{i\in[m]}~{}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R\varkappa_{t}}{\lambda\sqrt{m}},

(29)

where

\varkappa_{t}=\begin{cases}1,&\eta_{t}=\frac{1}{\lambda(t+1)},\\ 1-(1-\eta\lambda)^{t},&\eta_{t}=\eta\in\left(0,\frac{1}{\lambda}\right),\end{cases}

(30)

for all $t\geq 0$ almost surely. Consequently,

\pi_{min}:=\inf_{(s,a)\in\mathcal{S}\times\mathcal{A}}\pi_{t}(a|s)\geq\frac{\exp\left(-2R/\lambda-2\rho_{0}\left(\frac{R\varkappa_{t}}{\lambda},m,\delta\right)\right)}{|\mathcal{A}|}>0,

(31)

simultaneously for all $t\geq 0$ with probability at least $1-\delta$ over the random initialization of the actor network, where the function $\rho_{0}$ is given by

\rho_{0}(R_{0},m,\delta)=\frac{16R_{0}}{\sqrt{m}}\Big{(}R_{0}+\sqrt{\log\Big{(}\frac{1}{\delta}\Big{)}}+\sqrt{d\log(m)}\Big{)}.

(32)

Proposition 32 has two critical implications:

(i)

The inequality in (29) implies that any action $a\in\mathcal{A}$ is explored with strictly positive probability at any given state $s\in\mathcal{S}$ , which implies that all policies throughout the policy optimization steps satisfy the “persistence of excitation” condition with high probability over the random initialization. As we will see in the convergence analysis, this property implies sufficient exploration, which ensures that near-deterministic suboptimal policies are avoided. Sufficient exploration is achieved by entropy regularization, averaging, projection of $u_{t}$ , and large network width $m$ for the policy parameterization.
(ii)

The inequality (29) implies that we can control the deviation of the actor network weights by $R$ , $\lambda$ and $m$ . This property is key for the neural network analysis in the lazy-training regime.

4.2 Transportation Mappings and Function Classes

We first present a brief discussion on kernel approximations of neural networks, which will be useful to state our convergence results. Consider the following space of mappings:

\mathcal{H}_{{\bar{\nu}}}=\{v:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}:\sup_{w\in\mathbb{R}^{d}}\|v(w)\|_{2}\leq{\bar{\nu}}\},

(33)

and the function class:

\mathcal{F}_{{\bar{\nu}}}=\Big{\{}g(\cdot)=\mathbb{E}_{w_{0}\sim\mathcal{N}(0,I_{d})}[\langle v(w_{0}),\cdot\rangle\mathbbm{1}\{\langle w_{0},\cdot\rangle>0\}]:v\in\mathcal{H}_{{\bar{\nu}}}\Big{\}}.

(34)

Note that $\mathcal{F}_{{\bar{\nu}}}$ is a provably rich subset of the reproducible kernel Hilbert space (RKHS) induced by the neural tangent kernel, which can approximate continuous functions over a compact space [43, 40, 53]. For a given class of transportation maps $\mathcal{V}=\{v_{k}\in\mathcal{H}_{\bar{\nu}}:k\in[K]\}$ for $K\geq 1$ , we also consider the following subspace of $\mathcal{F}_{{\bar{\nu}}}$ :

\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}=\Big{\{}g(\cdot)=\mathbb{E}_{w_{0}\sim\mathcal{N}(0,I_{d})}[\big{\langle}\sum_{k\in[K]}\alpha_{k}v_{k}(w_{0}),\cdot\big{\rangle}\mathbbm{1}\{\langle w_{0},\cdot\rangle>0\}]:\|\alpha\|_{1}\leq 1\Big{\}}.

(35)

Note that the above set depends on the choice of $\{v_{k}\}_{k\in[K]}$ but these maps can be arbitrary, The space of continuously differentiable functions $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ over a compact domain has a countable basis $\{\varphi_{k}:k=0,1,\ldots\}$ [54]. By [43, Theorem 4.3], one can find transportation mappings $v_{k}\in\mathcal{H}_{{\bar{\nu}}}$ such that $g_{k}(s,a)=\mathbb{E}[\langle v_{k}(w_{0}),(s,a)^{\top}\cdot\mathbbm{1}\{(s,a)\cdot w_{0}\geq 0\}\rangle]$ approximates $\varphi_{k}$ well. As such, $\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}$ is able to approximate a function class which contains continuously differentiable functions over a compact space as $K\rightarrow\infty$ with appropriate $V$ .

4.3 Convergence of the Critic

We make the following realizability assumption for the Q-function.

Assumption 2 (Realizability of the Q-function).

For any $t\geq 0$ , we assume that $q_{\lambda}^{\pi_{t}}\in\mathcal{F}_{{\bar{\nu}}}$ for some ${\bar{\nu}}>0$ .

Assumption 2 is a smoothness condition on the class of realizable functions that can be approximated by the critic network, which is dense in the space of continuous functions over $\Omega_{d}$ (see Section 4.2). ${\bar{\nu}}$ , which is an upper bound on the RKHS norm, is the measure of smoothness. One can also replace the above condition by a slightly stronger condition which states that $q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{{\bar{\nu}}}$ , $\forall\theta\in\Theta.$ Note that the class of functions $\mathcal{F}_{\bar{\nu}}$ is deterministic and its approximation properties are well-known [43]. In [22], it was assumed that the state-action value functions lie in a random function class, which is obtained by shifting $\mathcal{F}_{\bar{\nu}}$ with a Gaussian process. By employing a symmetric initialization, we eliminate this Gaussian process noise, and therefore the realizable class of functions is deterministic and provably rich.

Theorem 1 (Convergence of the Critic, Theorem 2 in [49]).

Under Assumption 2, for any error probability $\delta\in(0,1)$ , let

\ell(m^{\prime},\delta)=4\sqrt{\log(2m^{\prime}+1)}+4\sqrt{\log(T/\delta)},

and $R>{\bar{\nu}}$ . Then, for any target error $\varepsilon>0$ , number of iterations $T^{\prime}\in\mathbb{N}$ , network width

m^{\prime}>\frac{16\Big{(}{\bar{\nu}}+\big{(}R+\ell(m^{\prime},\delta)\big{)}\big{(}{\bar{\nu}}+R\big{)}\Big{)}^{2}}{(1-\gamma)^{2}\varepsilon^{2}},

and step-size

\alpha_{C}=\frac{\varepsilon^{2}(1-\gamma)}{(1+2R)^{2}},

the critic yields the following bound:

\mathbb{E}\Big{[}\sqrt{\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\big{(}\overline{q}_{T^{\prime}}^{\pi_{t}}(s,a)-q_{\lambda}^{\pi_{t}}(s,a)\big{)}^{2}\Big{]}}\mathbbm{1}_{A_{2}}\Big{]}\leq\frac{(1+2R){\bar{\nu}}}{\varepsilon(1-\gamma)\sqrt{T^{\prime}}}+3\varepsilon,

where $A_{2}$ holds with probability at least $1-\delta$ over the random initializations of the critic network.

Note that in order to achieve a target error less than $\varepsilon>0$ , a network width of $m^{\prime}=\widetilde{O}\Big{(}\frac{{\bar{\nu}}^{4}}{\varepsilon^{2}}\Big{)}$ and iteration complexity $T^{\prime}=O\Big{(}\frac{(1+2{\bar{\nu}})^{2}{\bar{\nu}}^{2}}{\varepsilon^{4}}\Big{)}$ suffice. The analysis of TD learning algorithm in [49] uses results from [45], which was given for classification (supervised learning) problems with logistic loss. On the other hand, TD learning requires a significantly more challenging analysis because of bootstrapping in the updates (i.e., using a stochastic semi-gradient instead of a true gradient) and quadratic loss function. Furthermore, for improved sample complexity and overparameterization bounds, max-norm regularization is employed instead of early stopping [49].

4.4 Global Optimality and Convergence of Neural NAC

In this section, we provide the main convergence result for the entropy-regularized NAC with neural network approximation.

Assumption 3 (Realizability).

For $K\geq 1$ , we assume that for all $\theta\in\Theta$ , $Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}$ , where the function class $\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}$ is defined in Section 4.2.

Note that $\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}$ approximates a rich class of functions over a compact space well for large $K$ (see Section 4.2). Also, Assumption 3 implies that there is a structure among the soft Q-functions in the policy class $\Theta$ since each $Q_{\lambda}^{\pi_{\theta}}$ can be written as a linear combination of $K$ functions that correspond to the transportation maps $v_{k}$ . We consider this relatively restricted function class instead of $\mathcal{F}_{{\bar{\nu}}}$ to obtain uniform approximation error bounds to handle the dynamic structure of the policy optimization over time steps. Notably, the actor network features are expected to fit $Q_{\lambda}^{\pi_{t}}$ over all iterations, thus an inherent structure in $\{Q_{\lambda}^{\pi_{\theta}}:\theta\in\Theta\}$ appears to be necessary. For further discussion, see Section 5.4 (particularly Remark 7).

4.4.1 Performance Bounds under a Weak Distribution Mismatch Condition

First, we establish sample complexity and overparameterization bounds under a weak distribution mismatch condition, which is provided below. This condition is significantly weaker compared to the existing literature (e.g., [22, 55, 17]) as we proved that the policies achieve sufficient exploration by Proposition 32 (see Remark 36 for details).

Assumption 4 (Weak distribution mismatch condition).

There exists a constant $C_{\infty}<\infty$ such that

\sup_{t\geq 0}~{}\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)}{d_{\mu}^{\pi_{t}}(s)}\Big{)}^{2}\Big{]}\leq C_{\infty}^{2}.

Remark 3 (Weak distribution mismatch condition).

Note that a sufficient condition for Assumption 4 is an exploratory initial state distribution $\mu$ , which covers the support of the state visitation distribution of $d_{\mu}^{\pi^{*}}$ :

\sup_{s\in\mathsf{supp}(d_{\mu}^{\pi^{*}})}\frac{d_{\mu}^{\pi^{*}}(s)}{\mu(s)}<\infty,

(36)

since $\sqrt{\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)}{d_{\mu}^{\pi_{t}}(s)}\Big{)}^{2}\Big{]}}\leq\frac{1}{1-\gamma}\left\|\frac{d_{\mu}^{\pi^{*}}}{\mu}\right\|_{\infty}$ . Hence, if the initial distribution has a sufficiently large support set, then Assumption 4 is satisfied without any assumptions on $\{\pi_{t}:t\geq 0\}$ . Together with Proposition 32, it ensures stability of the policy optimization with minimal assumptions on $\mu$ .

The following theorem is one of the main results in this paper, which establishes the convergence bounds of the NAC algorithm.

Theorem 2 (Performance bounds).

Under Assumptions 1-4, Algorithm 2 with $R>{\bar{\nu}}$ and regularization coefficient $\lambda>0$ satisfies the following bounds:

(1)

with step-size $\eta_{t}=\frac{1}{\lambda(t+1)},~{}t\geq 0$ , we have

(1-\gamma)\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\frac{2R^{2}(1+\log T)}{\lambda T}+{2R\sqrt{\rho_{0}}}+4\rho_{0}T\lambda+M_{\infty}\Big{(}\rho_{1}+\varepsilon+\frac{Rq_{max}}{N^{1/4}}\Big{)},

(2)

with step-size $\eta_{t}=\eta\in(0,1/\lambda)$ , we have

(1-\gamma)\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\frac{\lambda e^{-\eta\lambda T}\log|\mathcal{A}|}{1-e^{-\eta\lambda T}}+2R\sqrt{\rho_{0}}+4\rho_{0}T\lambda+M_{\infty}\Big{(}\rho_{1}+\varepsilon+\frac{Rq_{max}}{N^{1/4}}\Big{)}+2\eta R^{2},

for any $\delta\in(0,1/3)$ where $\mathbb{P}(A)\geq 1-3\delta$ over the random initialization of the actor and critic networks,

q_{max}=4\Big{(}R+\frac{r_{max}}{1-\gamma}+\frac{\lambda\log|\mathcal{A}|}{1-\gamma}\Big{)},

(37)

which is an upper bound on the gradient norm in (22),

\displaystyle\begin{aligned} M_{\infty}&=C_{\infty}(1+\pi_{min}^{-1}),\\ \rho_{0}&=\frac{16R}{\lambda\sqrt{m}}\Big{(}\frac{R}{\lambda}+\sqrt{\log\Big{(}\frac{1}{\delta}\Big{)}}+\sqrt{d\log(m)}\Big{)},\\ \rho_{1}&=\frac{16{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)},\end{aligned}

$m^{\prime}=\widetilde{O}\Big{(}\frac{{\bar{\nu}}^{4}}{\varepsilon^{2}}\Big{)}$ and $T^{\prime}=O\Big{(}\frac{(1+2{\bar{\nu}})^{2}{\bar{\nu}}^{2}}{\varepsilon^{4}}\Big{)}$ (as specified in Theorem 1).

In the following, we characterize the sample complexity, iteration complexity and overparameterization bounds based on Theorem 2.

Corollary 1 (Sample Complexity and Overparameterization Bounds).

For any $\epsilon>0$ and $\delta\in(0,1/3)$ , Algorithm 2 with $R>{\bar{\nu}}$ satisfies:

\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\epsilon,

where $\mathbb{P}(A)\geq 1-3\delta$ over the random initialization of the actor-critic networks for the following parameters:

•

iteration complexity: $T=\tilde{O}\Big{(}\frac{R^{2}}{(1-\gamma)\lambda\epsilon}\Big{)}$ ,
•

actor network width: $m=\tilde{O}\Big{(}\frac{R^{8}}{(1-\gamma)^{4}\lambda^{4}\epsilon^{4}}+\frac{R^{6}\log(1/\delta)}{\lambda^{2}(1-\gamma)^{4}\epsilon^{4}}+\frac{M_{\infty}{\bar{\nu}}^{2}\log(K/\delta)}{\epsilon^{2}(1-\gamma)^{2}}\Big{)}$ ,
•

critic sample complexity: $T^{\prime}=O\Big{(}\frac{M_{\infty}^{2}R^{4}}{(1-\gamma)^{2}\epsilon^{4}}\Big{)}$ ,
•

critic network width: $m^{\prime}=\tilde{O}\Big{(}\frac{M_{\infty}^{2}R^{4}\log(1/\delta)}{(1-\gamma)^{2}\epsilon^{2}}\Big{)}$ ,
•

actor sample complexity: $N=O\Big{(}\frac{M_{\infty}^{4}R^{4}q_{max}^{4}}{\epsilon^{4}(1-\gamma)^{4}}\Big{)}$ .

Hence, the overall sample complexity of the Neural NAC algorithm is $\tilde{O}\Big{(}\frac{1}{\epsilon^{5}}\Big{)}$ .

Remark 4 (Bias-variance tradeoff in policy optimization).

By Proposition 32, the network parameters evolve such that

\sup_{t\geq 0}\max_{i\in[m]}\|\theta_{i}(0)-\theta_{i}(t)\|_{2}\leq\frac{R}{\lambda\sqrt{m}},

and $\sup_{t\geq 0}\sup_{s,a}\pi_{t}(a|s)<1$ . Hence, the NAC always performs a policy search within the class of randomized policies, which leads to fast and stable convergence under minimal regularity conditions. In particular, Assumption 4 is the mildest distributional mismatch condition in on-policy NPG/NAC settings to the best of our knowledge, and it suffices to establish convergence results in Theorem 2. On the other hand, entropy regularization introduces a bias term controlled by $\lambda$ , hence the convergence is in the regularized MDP. Another way to see this is that deterministic policies, which require $\lim_{t}\|\theta(t)\|_{2}=\infty$ , may not be achieved for $\lambda>0$ since $\theta(t)$ is always contained within a compact set. Letting $\lambda\downarrow 0$ eliminates the bias, but at the same time reduces the convergence speed and may lead to instability due to lack of exploration. Hence, there is a bias-variance tradeoff in policy optimization, controlled by $\lambda>0$ .

Remark 5 (Different network widths for actor and critic).

Corollary 1 indicates that the actor network requires $\tilde{O}(1/\epsilon^{4})$ neurons while the critic network requires $\tilde{O}(1/\epsilon^{2})$ although both approximate (soft) state-action value functions. This difference is because the actor network is required to uniformly approximate all state-action value functions over the trajectory, while the critic network approximates (pointwise) a single state-action value function at each iteration.

Remark 6 (Fast initial convergence rate under constant step-sizes).

The second part of Theorem 2 indicates that the convergence rate is $e^{-\Omega(T)}$ under a constant step-size $\eta\in(0,1/\lambda)$ , while there is an additional error term $2\eta R^{2}$ . This justifies the common practice of “halving the step-size” in optimization (see, e.g., [56]) for the specific case of natural actor-critic that we investigate: one achieves a fast convergence rate with a constant step-size until the optimization stalls, then the process is repeated after halving the step-size.

4.4.2 Performance Bounds under a Strong Distribution Mismatch Condition

In the following, we consider the standard distribution mismatch condition (e.g., in [55, 22]) and establish sample complexity and overparameterization bounds based on Theorem 2, for the unregularized MDP.

Assumption 4’ (Strong distribution mismatch condition).

There exists a constant $\tilde{C}_{\infty}<\infty$ such that

\sup_{t\geq 0}~{}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi_{t}}\otimes\pi_{t}(\cdot|s)}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)\pi^{*}(a|s)}{d_{\mu}^{\pi_{t}}(s)\pi_{t}(a|s)}\Big{)}^{2}\Big{]}\leq\tilde{C}_{\infty}^{2}.

(38)

Note that Assumption 38 implies Assumption 4, and it is a considerably stronger assumption that necessitates our policies $\{\pi_{t}:t=0,1,\ldots,T-1\}$ being sufficiently exploratory throughout policy optimization.

Corollary 2.

Under Assumptions 1-3 and 38, for any $\epsilon>0$ and $\delta\in(0,1/3)$ , Algorithm 2 with $R>{\bar{\nu}}$ and $\lambda=O(1/\sqrt{T})$ satisfies:

\min_{t\in[T]}~{}\mathbb{E}[(\max_{\pi}V^{\pi}(\mu)-V^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\epsilon,

where $\mathbb{P}(A)\geq 1-3\delta$ over the random initialization of the actor-critic networks for the following parameters:

•

iteration complexity: $T=\tilde{O}\Big{(}\frac{R^{2}}{(1-\gamma)\epsilon^{2}}\Big{)}$ ,
•

actor network width: $m=\tilde{O}\Big{(}\frac{R^{8}}{(1-\gamma)^{4}\epsilon^{8}}+\frac{R^{6}\log(1/\delta)}{(1-\gamma)^{4}\epsilon^{6}}+\frac{\tilde{C}_{\infty}{\bar{\nu}}^{2}\log(K/\delta)}{\epsilon^{2}(1-\gamma)^{2}}\Big{)}$ ,
•

critic sample complexity: $T^{\prime}=O\Big{(}\frac{\tilde{M}_{\infty}^{2}R^{4}}{(1-\gamma)^{2}\epsilon^{4}}\Big{)}$ ,
•

critic network width: $m^{\prime}=\tilde{O}\Big{(}\frac{\tilde{M}_{\infty}^{2}R^{4}\log(1/\delta)}{(1-\gamma)^{2}\epsilon^{2}}\Big{)}$ ,
•

actor sample complexity: $N=O\Big{(}\frac{\tilde{M}_{\infty}^{4}R^{4}q_{max}^{4}}{\epsilon^{4}(1-\gamma)^{4}}\Big{)}$ ,

where $\tilde{M}_{\infty}=\tilde{C}_{\infty}(1+\pi_{min}^{-1})$ .

Hence, the overall sample complexity of Neural NAC for finding an $\epsilon$ -optimal policy of the unregularized MDP is $\tilde{O}\Big{(}\frac{1}{\epsilon^{6}}\Big{)}$ .

4.5 Comparison With Prior Works

Among the existing works that theoretically investigate policy gradient methods, the most related one is [22], which considers Neural PG/NPG methods equipped with a two-layer neural network. We point key differences between our work and these previous works:

•

Prior works do not incorporate entropy regularization. As a result, they need a stronger concentrability coefficient assumption like Assumption 38 instead of the weaker Assumption 4 under which we are able to prove our main results.
•

In the proofs in the subsequent section, it will become clear that one needs to uniformly bound the function approximation error in the actor part of our algorithm to address the dependencies in the parameter values between iterations and the NTRF features. We propose new techniques to address this point, which was not addressed in the prior work.
•

While our algorithm is similar in spirit to the algorithms analyzed in the prior works, we also incorporate a number of important algorithmic ideas that are used in practice (e.g., entropy regularization, averaging, gradient clipping). As a result, we have to use different analysis techniques. As a consequence of these algorithmic and analytical techniques, we obtain considerably sharper sample complexity and overparameterization bounds (see Table 1). Interestingly, all of these algorithmic improvements to the original NAC algorithms seem to be important to obtain the sharper bounds.
•

We employ a symmetric initialization scheme proposed in [48] to ensure that $f_{0}(s,a)=0$ for all $s,a$ despite the random initialization. As a consequence of symmetric initialization, we eliminate the impact of $f_{0}$ in the infinite width limit, which is effectively a noise term $\epsilon_{0}$ in the performance bounds [57].

Paper	Algorithm	Width of actor, critic	Sample comp.	Error	Condition	Objective
[22]	Neural NPG	$O(1/\epsilon^{12})$ , $O(1/\epsilon^{12})$	$O(1/\epsilon^{14})$	$\epsilon+\epsilon_{0}$	Strong	Unregularized
Ours	Neural NAC	$\tilde{O}(1/\epsilon^{4})$ , $\tilde{O}(1/\epsilon^{2})$	$\tilde{O}(1/\epsilon^{5})$	$\epsilon$	Weak	Regularized
Ours	Neural NAC	$\tilde{O}(1/\epsilon^{8})$ , $\tilde{O}(1/\epsilon^{2})$	$\tilde{O}(1/\epsilon^{6})$	$\epsilon$	Strong	Unregularized

Table 1: The overparameterization and sample complexity bounds for variants of natural policy gradient with neural network approximation.

5 Finite-Time Analysis of Neural NAC

In this section, we provide the convergence analysis of the algorithm.

5.1 Analysis of Neural Network at Initialization

For $\delta\in(0,1)$ and any $R_{0}>0$ , let

\rho_{0}(R_{0},m,\delta)=\frac{16R_{0}}{\sqrt{m}}\Big{(}R_{0}+\sqrt{\log(1/\delta)}+\sqrt{d\log(m)}\Big{)},

(39)

and define

A_{0}=\Big{\{}\sup_{x:\|x\|_{2}\leq 1}\frac{R_{0}}{m}\sum_{i=1}^{m}\mathbbm{1}\Big{\{}|\theta_{i}^{\top}(0)x|\leq\frac{R_{0}}{\sqrt{m}}\Big{\}}\leq\rho_{0}(R_{0},m,\delta)\Big{\}}.

(40)

The following lemma bounds the deviation of the neural network from its linear approximation around the initialization, and it will be used throughout the convergence analysis.

Lemma 2.

Let $\theta_{i}(0)\sim\mathcal{N}(0,I_{d})$ for all $i\in[m]$ , $\theta\in\mathcal{B}_{m,R_{0}}^{d}(\theta(0))$ and $\theta^{\prime}\in\mathcal{B}_{m,R_{0}}^{d}(0)$ for some $R_{0}>0$ . Then,

$\displaystyle\sup_{x\in\mathbb{R}^{d}:\\|x\\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{\|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{\|}$	$\displaystyle\leq\rho_{0}(R_{0},m,\delta),$	(41)
$\displaystyle\sup_{x\in\mathbb{R}^{d}:\\|x\\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{\|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}x\Big{\|}$	$\displaystyle\leq\rho_{0}(R_{0},m,\delta),$	(42)
$\displaystyle\sup_{x\in\mathbb{R}^{d}:\\|x\\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{\|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}x^{\top}\theta^{\prime}_{i}\Big{\|}$	$\displaystyle\leq\rho_{0}(R_{0},m,\delta),$	(43)

under the event $A_{0}$ defined in (40), which holds with probability at least $1-\delta$ over the random initialization of the actor.

Proof.

Let $\Omega_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\}$ . For $x\in\Omega_{d}$ , let

S(x,\theta)=\big{\{}i\in[m]:\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}\neq\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\big{\}}.

For any $i\in S(x,\theta)$ , the following is true:

\displaystyle|\theta^{\top}_{i}(0)x|

\displaystyle\leq|\theta^{\top}_{i}(0)x-\theta_{i}^{\top}x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2},

(44)

where the first inequality is true since $sign(\theta^{\top}_{i}(0)x)\neq sign(\theta^{\top}_{i}x)$ and the second inequality follows from Cauchy-Schwarz inequality and $x\in\Omega_{d}$ . Therefore,

S(x,\theta)\subset\{i\in[m]:|\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}\}.

Since

\displaystyle\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}=\frac{1}{\sqrt{m}}\sum_{i\in S(x,\theta)}|\theta_{i}^{\top}(0)x|,

we have:

\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\mathbbm{1}\{|\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}\}\|\theta_{i}-\theta_{i}(0)\|_{2}.

Since $\max_{i\in[m]}\|\theta_{i}-\theta_{i}(0)\|_{2}\leq\frac{R_{0}}{\sqrt{m}},$ the above inequality leads to the following:

\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}\leq\frac{R_{0}}{m}\sum_{i=1}^{m}\mathbbm{1}\{|\theta_{i}^{\top}(0)x\|\leq\frac{R_{0}}{\sqrt{m}}\}.

Taking supremum over $x\in\Omega_{d}$ , and using Lemma 4 in [58] on the RHS of the above inequality concludes the proof.

In order to prove (42), similar to (44), we have the following inequality:

|\theta_{i}^{\top}x|\leq|\theta_{i}^{\top}x-\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}.

Using this, the proof follows from exactly the same steps. ∎

Note that Lemma 2 is an extension of the concentration bounds in [45, 41, 58] for neural networks. On the other hand, our concentration result provides uniform convergence over $\Omega_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\}$ rather than finitely many points, thus it is a stronger concentration bound compared to the ones in the literature, which are used to analyze neural networks [45, 41]. We need these uniform concentration inequalities to address the challenges due to the dynamics policy optimization, e.g., distributional shift.

5.2 Impact of Entropy Regularization

First, we analyze the impact of entropy regularization, which will yield key results in the convergence analysis.

Proof of Proposition 32.

Recall from Line 11 in Algorithm 3 that the policy update is as follows:

\theta(t+1)=\theta(t)+\eta_{t}u_{t}-\eta_{t}\lambda(\theta(t)-\theta(0)).

Let $\overline{\theta}(t)=\theta(t)-\theta(0)$ for all $t\geq 0$ . Then, the update rule can be written as:

\overline{\theta}(t+1)=\overline{\theta}(t)(1-\eta_{t}\lambda)+\eta_{t}u_{t}.

Since the step-size is $\eta_{t}\lambda=\frac{1}{t+1}$ , we have:

\overline{\theta}(t+1)=\frac{1}{\lambda(t+1)}\sum_{k=0}^{t}u_{k},

by induction. Hence, by triangle inequality:

\|\overline{\theta}_{i}(t+1)\|_{2}=\|\theta_{i}(t+1)-\theta_{i}(0)\|_{2}\leq\frac{1}{\lambda(t+1)}\sum_{k=0}^{t}\|u_{i,k}\|_{2},

(45)

for any $i\in[m]$ . Note that $u_{k}\in\mathcal{B}_{m,R}^{d}(0)$ as a consequence of projection, therefore $\|u_{i,k}\|\leq R/\sqrt{m}$ for all $i\in[m]$ . Hence, by (45), we conclude that

\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R}{\lambda\sqrt{m}},

(46)

for any $t\geq 0$ . Also, since $w_{t}=u_{t}-\lambda(\theta(t)-\theta(0))$ , we have:

\sup_{t\geq 0}\|w_{t}\|_{2}\leq\|u_{t}\|_{2}+\lambda\|\theta(t)-\theta(0)\|_{2}\leq 2R.

(47)

Under a constant step-size $\eta\in(0,1/\lambda)$ , we can expand the parameter movement for any $t\geq 1$ as follows:

$\displaystyle\overline{\theta}_{i}(t+1)$	$\displaystyle=\overline{\theta}_{i}(t)\cdot(1-\eta\lambda)+\eta\cdot u_{i,t},$
	$\displaystyle=\overline{\theta}_{i}(t-1)\cdot(1-\eta\lambda)^{2}+\eta(1-\lambda\eta)u_{i,t-1}+\eta u_{i,t},$	$\displaystyle\vdots$
	$\displaystyle=\overline{\theta}_{i}(0)(1-\eta\lambda)^{t}+\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}u_{i,t-k}=\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}u_{i,t-k},$

for any neuron $i\in[m]$ . Then, we have:

	$\displaystyle\\|\theta_{i}(t+1)-\theta_{i}(0)\\|_{2}\leq\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}\\|u_{i,k}\\|_{2}$	$\displaystyle\leq\frac{R}{\lambda\sqrt{m}}(1-(1-\eta\lambda)^{t+1}),$
		$\displaystyle\leq\frac{R}{\lambda\sqrt{m}},$		(48)

which follows from triangle inequality, $\|u_{i,k}\|_{2}\leq R/\sqrt{m}$ due to the projection, and the fact that $(1-(1-\eta\lambda)^{t})\leq 1$ for any $t\geq 0$ .

In order to prove the lower bound for $\inf_{t\geq 0,(s,a)\in\mathcal{S}\times\mathcal{A}}\pi_{t}(a|s)$ , first recall that $\pi_{t}(a|s)\propto\exp(f_{t}(s,a))$ . Hence, a uniform upper bound on $|f_{t}(s,a)|$ over all $t\geq 0$ and $(s,a)\in\mathcal{S}\times\mathcal{A}$ suffices to lower bound $\pi_{t}(a|s)$ . By symmetric initialization, $f_{0}(s,a)=0$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Hence,

f_{t}(s,a)=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\Big{(}[\theta_{i}(t)-\theta_{i}(0)]^{\top}(s,a)\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}\Big{)}\\ +\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}(\mathbbm{1}\{\theta_{i}^{\top}(s,a)\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\})\theta_{i}^{\top}(t)(s,a).

(49)

First, we bound the first summand on the RHS of (49) by using (46) and triangle inequality:

\sup_{s,a}\Big{|}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\Big{(}[\theta_{i}(t)-\theta_{i}(0)]^{\top}(s,a)\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}\Big{)}\Big{|}\leq\frac{R}{\lambda},

(50)

since $|c_{i}\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}(s,a)|\leq 1$ . For the last term in (49), first note that $\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R}{\lambda\sqrt{m}}$ , so we can use Lemma 2. By using triangle inequality and Lemma 2:

\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}(\mathbbm{1}\{\theta_{i}^{\top}(s,a)\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\})\theta_{i}^{\top}(t)(s,a)\Big{|}\leq\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)},

with probability at least $1-\delta$ over the random initialization of the actor network. Hence, with probability at least $1-\delta$ ,

\sup_{s,a}|f_{t}(s,a)|\leq R/\lambda+\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)},

and $\pi_{t}(a|s)\geq\frac{1}{|\mathcal{A}|}e^{\frac{-2R}{\lambda}-2\rho_{0}\big{(}\frac{R}{\lambda},m,\delta\big{)}}$ .

∎

5.3 Lyapunov Drift Analysis

First, we present a key lemma which will be used throughout the analysis.

Lemma 3 (Log-linear approximation error).

Let

\widetilde{\pi}_{t}(a|s)=\frac{\exp(\nabla_{\theta}^{\top}f_{0}(s,a)\theta(t))}{\sum_{a^{\prime}\in\mathcal{A}}\exp(\nabla_{\theta}^{\top}f_{0}(s,a^{\prime})\theta(t))},

be log-linear approximation of the policy $\pi_{t}(a|s)$ . Then, for any $\delta\in(0,1)$ , we have:

\sup_{t\geq 0}~{}\sup_{s,a}\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq 3\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)},

(51)

over $A_{0}$ .

Proof.

Note that $f_{t}(s,a)=\nabla f_{t}(s,a)\theta(t)$ for a ReLU neural network. By using this, we can write the log-linear approximation error as follows:

\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|+\Big{|}\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}e^{(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}\Big{|}.

(52)

By log-sum inequality (Theorem 2.7.1 in [59]), for any $x_{a},y_{a}>0$ ,

\log\frac{\sum_{a}x_{a}}{\sum_{a}y_{a}}\leq\sum_{a}\frac{x_{a}}{\sum_{a^{\prime}}x_{a^{\prime}}}\log\frac{x_{a}}{y_{a}}.

Setting $x_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}$ and $y_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}e^{(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)}$ , we have:

\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{t}(s,a^{\prime})\theta(t)}}\leq\sum_{a^{\prime}}\widetilde{\pi}_{t}(a^{\prime}|s)|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|.

(53)

Setting $y_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}$ and $x_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}e^{(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)}$ , we have:

\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{t}(s,a^{\prime})\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}\leq\sum_{a^{\prime}}\pi_{t}(a^{\prime}|s)|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|.

(54)

Using (53) and (54) to bound the last term in (52), we obtain:

\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|+\sum_{a^{\prime}}\big{[}\pi_{t}(a^{\prime}|s)+\widetilde{\pi}_{t}(a^{\prime}|s)\big{]}|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|.

(55)

By Lemma 2, under the event $A_{0}$ , $|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\rho_{0}(R/\lambda,m,\delta)$ for all $t\geq 0,s\in\mathcal{S},a\in\mathcal{A}$ . Hence, under the event $A_{0}$ , we have:

\sup_{t\geq 0}~{}\sup_{s,a}~{}\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq 3\rho_{0}(R/\lambda,m,\delta),

which concludes the proof. ∎

The following result is standard in the analysis of policy gradient methods [60, 39].

Lemma 4 (Lemma 5, [39]).

For any $\theta,\theta^{\prime}\in\mathbb{R}^{d}$ and $\mu$ , we have:

V_{\lambda}^{\pi_{\theta}}(\mu)-V_{\lambda}^{\pi_{\theta^{\prime}}}(\mu)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}A_{\lambda}^{\pi_{\theta^{\prime}}}(s,a)+\lambda\log\frac{\pi_{\theta^{\prime}}(a|s)}{\pi_{\theta}(a|s)}\Big{]},

(56)

where $A_{\lambda}^{\pi_{\theta}}$ is the advantage function defined in (8).

Lemma 4 is an extension of the performance difference lemma in [60], and the proof can be found in [39]. In the following, we provide the main Lyapunov drift, which is central to the proof. This Lyapunov function is widely used in the analysis of natural gradient descent algorithms [14, 17, 22, 39].

Definition 1 (Potential function).

For any policy $\pi\in\Pi$ , the potential function $\Psi$ is defined as follows:

\Psi(\pi)=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}D_{KL}\big{(}\pi^{*}(\cdot|s)\|\pi(\cdot|s)\big{)}\Big{]}.

(57)

Lemma 5 (Lyapunov drift).

For any $t\geq 0$ , let $\Delta_{t}=V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu)$ . Then,

\displaystyle\begin{aligned} \Psi(\pi_{t+1})-\Psi(\pi_{t})\leq&-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}+2\eta_{t}^{2}R^{2}\\ &+\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi_{t}(\cdot|s)}\Big{[}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\Big{]}\\ &-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\Big{]}\\ &+(\eta_{t}\lambda+6)\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)},\end{aligned}

(58)

in the event $A_{0}$ which holds with probability at least $1-\delta$ over the random initialization of the actor.

Proof.

First, note that the log-linear approximation of $\pi_{\theta}$ is smooth [17]:

\|\nabla\log\widetilde{\pi}_{\theta}(a|s)-\nabla\log\widetilde{\pi}_{\theta^{\prime}}(a|s)\|_{2}\leq\|\theta-\theta^{\prime}\|_{2},

(59)

for any $s,a$ since $\|\nabla f_{0}(s,a)\|_{2}\leq 1$ . Also,

\Psi(\pi_{t+1})-\Psi(\pi_{t})=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\pi_{t}(a|s)}{\pi_{t+1}(a|s)}\Big{]}.

To use the smoothness of log-linear approximation, we use a telescoping sum and obtain:

\displaystyle\Psi(\pi_{t+1})-\Psi(\pi_{t})

\displaystyle=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\widetilde{\pi}_{t}(a|s)}{\widetilde{\pi}_{t+1}(a|s)}+\log\frac{\pi_{t}(a|s)}{\widetilde{\pi}_{t}(a|s)}+\log\frac{\widetilde{\pi}_{t+1}(a|s)}{{\pi}_{t+1}(a|s)}\Big{]}.

By Lemma 3, the last two terms are bounded by $\rho_{0}(R/\lambda,m,\delta)$ . Let

D_{t}=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\widetilde{\pi}_{t}(a|s)}{\widetilde{\pi}_{t+1}(a|s)}\Big{]}.

Then, by the smoothness of the log-linear approximation, we have:

D_{t}\leq-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\nabla_{\theta}^{\top}\log\widetilde{\pi}_{t}(a|s)w_{t}+\frac{\eta_{t}^{2}\|w_{t}\|_{2}^{2}}{2},

Recall $\Delta_{t}=V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu)$ . Using Lemma 4 and the definition of the advantage function, we obtain:

	$\displaystyle D_{t}$	$\displaystyle\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}-\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a\|s)}}[Q_{\lambda}^{\pi_{t}}(s,a)-\lambda\log\pi_{t}(a\|s)]$
		$\displaystyle-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{}},a\sim\pi^{}(\cdot\|s)}[\nabla^{\top}\log\widetilde{\pi}_{t}(a\|s)w_{t}-q_{\lambda}^{\pi_{t}}(s,a)]+\frac{\eta_{t}^{2}\\|w_{t}\\|_{2}^{2}}{2}.$		(60)

Since we have $\nabla\log\widetilde{\pi}_{t}(a|s)=\nabla f_{0}(s,a)-\mathbb{E}_{a^{\prime}\sim\widetilde{\pi}_{t}(\cdot|s)}[\nabla f_{0}(s,a^{\prime})]$ , we have the following inequality:

	$\displaystyle D_{t}$	$\displaystyle\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}$
		$\displaystyle+\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a\|s)}}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]$
		$\displaystyle-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{}},a\sim\pi^{}(\cdot\|s)}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]$
		$\displaystyle+\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}[\sum_{a\in\mathcal{A}}(\widetilde{\pi}_{t}(a\|s)-\pi_{t}(a\|s))\nabla^{\top}f_{0}(s,a)w_{t}]+\frac{\eta_{t}^{2}\\|w_{t}\\|_{2}^{2}}{2}.$

By the definition of $w_{t}=u_{t}-\lambda[\theta(t)-\theta(0)]$ and the fact that $f_{0}(s,a)=0$ due to the symmetric initialization, we have:

\nabla^{\top}f_{0}(s,a)w_{t}=\nabla^{\top}f_{0}(s,a)u_{t}-\lambda\nabla^{\top}f_{0}(s,a)\theta(t).

Substituting this identity to the above inequality, we have:

\displaystyle\begin{aligned} D_{t}&\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}\\ &+\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a|s)}}[\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)]\\ &-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}[\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)]\\ &+2\eta_{t}R\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a}|\widetilde{\pi}_{t}(a|s)-\pi_{t}(a|s)|\Big{]}\\ &+\eta_{t}\lambda\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a\in\mathcal{A}}|\pi_{t}(a|s)-\pi^{*}(a|s)|\cdot(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)\Big{]}+2\eta_{t}^{2}R^{2},\end{aligned}

(61)

where we used (47) to bound $\|w_{t}\|_{2}$ . Furthermore, note that

|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(t)x\Big{|},

for any $x=(s,a)^{\top}\in\mathbb{R}^{d}$ . Thus, by Lemma 2,

\sup_{s,a}|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\rho_{0}(R/\lambda,m,\delta).

This bounds the penultimate term in (61). Finally, in order to bound the fifth term in (61), we use Pinsker’s inequality and then Lemma 3:

\sup_{s\in\mathcal{S}}\|\widetilde{\pi}_{t}(\cdot|s)-\pi_{t}(\cdot|s)\|_{1}\leq\sup_{s\in\mathcal{S}}\sqrt{\sum_{a}\pi_{t}(a|s)\log\frac{\pi_{t}(a|s)}{\widetilde{\pi}_{t}(a|s)}}\leq\sqrt{\rho_{0}(R/\lambda,m,\delta)}.

Substituting these into (61) and then into (58), the desired result follows. ∎

5.4 Analysis of the Function Approximation Error: How Do Neural Networks Address Distributional Shift in Policy Optimization?

As a specific feature of reinforcement learning, policy optimization in particular, the probability distribution of the underlying system changes over time as a function of the control policy. Consequently, the function approximator (i.e., the actor network in our case) needs to adapt to this distributional shift throughout the policy optimization steps. In this subsection, we analyze the function approximation error, which sheds light on how neural networks in the NTK regime address the distributional shift challenge.

Now we focus on the approximation error in Lemma 5:

\epsilon_{bias}^{\pi_{t}}=\mathbb{E}_{s\sim d_{\mu}^{*}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\big{(}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\big{)}\Big{]}.

(62)

Note that $\epsilon_{bias}^{\pi_{t}}$ can be equally expressed as follows:

\epsilon_{bias}^{\pi_{t}}=\mathbb{E}_{s\sim d_{\mu}^{*}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\big{(}\nabla^{\top}\log\pi_{t}(s,a)u_{t}-\Xi_{\lambda}^{\pi_{t}}(s,a)\big{)}\Big{]}\\ +\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\Big{(}\big{[}\nabla f_{0}(s,a)-\nabla f_{t}(s,a)\big{]}^{\top}u_{t}\Big{)}\Big{]},

where $\Xi_{\lambda}^{\pi}$ is the soft advantage function. The above identity provides intuition about the choice of sample-based gradient update $u_{t}$ in Algorithm 2, which we will investigate in detail later.

Let

L_{0}(u,\theta)=\mathbb{E}[(\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{\pi_{\theta}}(s,a))^{2}].

In the following, we answer the following question: given the perfect knowledge of the soft Q-function $Q_{\lambda}^{\pi_{\theta}}$ , what is the minimum approximation error $\min_{u}L_{0}(u,{\theta(t)})$ ?

Proposition 3 (Approximation Error).

Under symmetric initialization of the actor network, we have the following results:

• Pointwise approximation error: For any $\theta\in\Theta$ and $Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{\bar{\nu}}$ ,

\mathbb{E}\left[\min_{u\in\mathbb{R}^{m\times d}}L_{0}(u,\theta)\right]\leq\frac{4{\bar{\nu}}^{2}}{m},

(63)

where the expectation is over the random initialization of the actor network.

• Uniform approximation error: Let

A_{1}=\Big{\{}\sup_{\begin{subarray}{c}s,a\\ \theta\in\Theta\end{subarray}}~{}\min_{u}|\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{{\pi_{\theta}}}(s,a)|\leq\frac{2{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}\Big{\}}.

Then, under Assumption 3, $A_{1}$ holds with probability at least $1-\delta$ over the random initialization of the actor network. Furthermore,

\mathbb{E}\Big{[}\mathbbm{1}_{A_{0}\cap A_{1}}\sup_{\theta}\min_{u}L_{0}(u,\theta)\Big{]}\leq\frac{4{\bar{\nu}}^{2}}{m}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}^{2}.

(64)

Proof.

For a given policy parameter $\theta\in\Theta$ , let the transportation mapping of $Q_{\lambda}^{\pi_{\theta}}$ be $v_{\theta}$ and let

Y_{i}^{\theta}(s,a)=v_{\theta}^{\top}(\theta_{i}(0))(s,a)\cdot\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\},~{}i\in[m].

Note that $\mathbb{E}[Y_{i}^{\theta}(s,a)]=Q_{\lambda}^{\pi_{\theta}}(s,a)$ for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Also, let

u_{\theta}^{*}=\Big{[}\frac{1}{\sqrt{m}}c_{i}v_{\theta}(\theta_{i}(0))\Big{]}_{i\in[m]}.

Since $u_{\theta}^{*}\in\mathcal{B}_{m,{\bar{\nu}}}^{d}(0)$ for all $\theta\in\Theta$ , projected risk minimization within $\mathcal{B}_{m,R}^{d}(0)$ for $R\geq{\bar{\nu}}$ suffices for optimality. We have

\nabla^{\top}f_{0}(s,a)u_{\theta}^{*}=\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a),

and

\min_{u}L_{0}(u,\theta)\leq\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim{\pi_{\theta}}(\cdot|s)}\Big{[}\Big{(}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{)}^{2}\Big{]}.

(65)

1. Pointwise approximation error: First we consider a given fixed $\theta\in\Theta$ . Taking the expectation in (65) and using Fubini’s theorem,

$\displaystyle\mathbb{E}[\min_{u}L_{0}(u,\theta)]$	$\displaystyle\leq\mathbb{E}_{s,a}\mathbb{E}\Big{[}\Big{(}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{)}^{2}\Big{]},$
	$\displaystyle=2\mathbb{E}_{s,a}\Big{[}\frac{4}{m^{2}}\sum_{i=1}^{m/2}Var(Y_{i}^{\theta}(s,a))+\frac{4}{m^{2}}\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{m/2}Cov(Y_{i}^{\theta}(s,a),Y_{j}^{\theta}(s,a))\Big{]},$	(66)
	$\displaystyle=4\mathbb{E}_{s,a}\Big{[}\frac{Var(Y_{1}^{\theta}(s,a))}{m^{2}}\Big{]},$	(67)
	$\displaystyle\leq\frac{4}{m^{2}}\mathbb{E}_{s,a}\mathbb{E}[(Y_{1}^{\theta}(s,a))^{2}],$	(68)

where the identity (66) is due to the symmetric initialization, (67) holds because $\{Y_{i}^{\theta}(s,a):i=1,2,\ldots,m/2\}$ is independent (since $\{\theta_{i}(0):i\in[m/2]\}$ is independent). By Cauchy-Schwarz inequality and the fact that $v_{\theta}\in\mathcal{H}_{{\bar{\nu}}}$ , we have:

|Y_{i}^{\theta}(s,a)|\leq\|v_{\theta}(\theta_{i}(0))\|_{2}\leq{\bar{\nu}}.

Hence, using this in (68), we obtain:

\mathbb{E}\min_{u}L_{0}(u,\theta)\leq\frac{4{\bar{\nu}}^{2}}{m^{2}}.

(69)

2. Uniform approximation error: For any $\theta\in\Theta$ , since $Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}$ there exists $\alpha^{\theta}=(\alpha_{1}^{\theta},\alpha_{2}^{\theta},\ldots,\alpha_{K}^{\theta})\in\mathbb{R}^{K}$ such that $\|\alpha^{\theta}\|_{1}\leq 1$ and $v_{\theta}=\sum_{k}\alpha_{k}^{\theta}v_{k}$ . We consider the following error:

R_{m}(\Theta)=\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}~{}\sup_{\theta\in\Theta}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{|}.

(70)

Then, we have the following identity from the definition of $v_{\theta}$ :

R_{m}(\Theta)=\sup_{s,a}~{}\sup_{\theta}~{}\Big{|}\sum_{k=1}^{K}\alpha_{k}^{\theta}\cdot\Big{(}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{)}\Big{|},

(71)

where $Z_{i}^{k}(s,a)=v_{k}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\}$ . Then, by triangle inequality,

	$\displaystyle R_{m}(\Theta)$	$\displaystyle\leq\sup_{s,a}~{}\sup_{\theta}~{}\max_{k\in[K]}~{}\Big{\|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{\|}\cdot\\|\alpha^{\theta}\\|_{1},$
		$\displaystyle\leq\max_{k\in[K]}~{}\sup_{s,a}~{}\Big{\|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{\|}.$		(72)

By using union bound and (72), for any $z>0$ , we have the following:

\mathbb{P}(R_{m}(\Theta)>z)\leq\sum_{k=1}^{K}\mathbb{P}\Big{(}\sup_{s,a}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}>z\Big{)}.

(73)

We utilize the following to obtain a uniform bound for $|\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]|$ over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ .

Lemma 6.

For any $k\in[K]$ , for any $\delta\in(0,1)$ , the following holds:

\sup_{s,a}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}\leq\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(1/\delta)}}{\sqrt{m}},

(74)

with probability at least $1-\delta$ .

Hence, using Lemma 6 and (73) with $z=\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(K/\delta)}}{\sqrt{m}}$ , we conclude that

R_{m}(\Theta)\leq\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(K/\delta)}}{\sqrt{m}},

with probability at least $1-\delta$ . The expectation result follows from this inequality. ∎

Now, we have the following result for the approximation error under $\pi_{t}$ .

Corollary 3.

Under Assumption 3, we have:

\mathbb{E}[\mathbbm{1}_{A_{0}\cap A_{1}}\min_{u}~{}L_{0}(u,{\theta(t)})]\leq\frac{16\bar{\nu}^{2}}{m}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}^{2},

where the event $A_{1}$ , defined in Proposition 64, holds with probability at least $1-\delta$ over the random initialization of the actor.

Remark 7 (Why do we need a uniform approximation error bound?).

Note that for a given fixed policy ${\pi_{\theta}},\theta\in\Theta$ , Proposition 64 provides a sharp pointwise approximation error bound as long as $Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{\bar{\nu}}$ with a corresponding transportation map $v_{\theta}$ . In order for this result to hold, $v_{\theta}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\}$ is required to be iid for $i\in[m/2]$ , which is the main idea behind the random initialization schemes for the NTK analysis. On the other hand, in policy optimization, $\theta(t)$ depends on the initialization $\theta(0)$ , therefore $v_{\theta(t)}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\}$ is not independent – hence, $Cov(Y_{i}^{\theta}(s,a),Y_{j}^{\theta}(s,a))\neq 0$ for $i\neq j$ in (66). Furthermore, the distribution of $(s,a)$ at time $t>0$ also depends on $\pi_{0}$ . Therefore, the pointwise approximation error cannot be used to provide an approximation bound for $Q_{\lambda}^{\pi_{t}}$ under the entropy-regularized NAC. In the existing works, this important issue regarding the temporal correlation and its impact on the NTK analysis was not addressed. In this work, we utilize the uniform approximation error bound provided in Proposition 64 to address this issue.

In the absence of $Q_{\lambda}^{\pi_{\theta}}$ , the critic yields a noisy estimate $\overline{Q}_{\lambda}^{\pi_{\theta}}$ . Additionally, since $d_{\mu}^{\pi_{\theta}}$ is not known a priori, samples $\{(s_{n},a_{n})\sim d_{\mu}^{\pi_{\theta}}\circ{\pi_{\theta}}(\cdot|s):n\geq 0\}$ are used to obtain the update $u_{t}$ . These two factors are the sources of error in the natural actor-critic method: $\min_{u}L_{0}(u,{\theta(t)})\leq L_{0}(u_{t},{\theta(t)})$ . In the following, we quantify this error and show that:

1.

Increasing number of SGD iterations, $N$ ,
2.

Increasing representation power of the actor network in terms of the width $m$ ,
3.

Low mean-squared Bellman error in the critic (by large $m^{\prime},T^{\prime}$ ),

lead to vanishing error.

First, we study the error introduced by using SGD for solving:

\widehat{L}_{t}(u,{\theta(t)})=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{t}(s,a)u-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s,a)\Big{)}^{2}\Big{]}.

(75)

Proposition 4 (Theorem 14.8 in [61]).

Algorithm 2 (Lines 3-9) with step-size $\alpha_{A}=R/\sqrt{q_{max}N}$ yields the following result:

\mathbb{E}[\widehat{L}_{t}(u_{t},{\theta(t)})]-\min_{u}\widehat{L}_{t}(u,{\theta(t)})\leq\frac{Rq_{max}}{\sqrt{N}},

(76)

for any $R\geq{\bar{\nu}}$ where the expectation is over the random samples $\{(s_{n},a_{n}):n\in[N]\}$ .

The following proposition provides an error bound in terms of the statistical error for finding the optimum $u_{t}$ via SGD as well as TD-learning error in estimating the soft Q-function. Let

L_{t}(u,{\theta(t)})=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{t}(s,a)u-{\Xi}_{\lambda}^{\pi_{t}}(s,a)\Big{)}^{2}\Big{]}.

(77)

Proposition 5.

Let $A=A_{0}\cap A_{1}\cap A_{2}$ , hence $\mathbb{P}(A)\geq 1-3\delta$ . We have the following inequality:

\mathbb{E}[\mathbbm{1}_{A}L_{t}(u_{t},{\theta(t)})]\leq 8\min_{u}\Big{\{}\mathbbm{1}_{A}L_{0}(u,{\theta(t)})+\mathbb{E}\big{[}\mathbbm{1}_{A}\big{|}[\nabla f_{0}(s,a)-\nabla f_{t}(s,a)]^{\top}u\big{|}^{2}\big{]}\Big{\}}+\frac{2Rq_{max}}{\sqrt{N}}\\ +6\mathbb{E}[\mathbbm{1}_{A}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}],

(78)

where the expectation is over the samples for critic (TD learning) and actor (SGD) updates. Consequently, we have:

\mathbb{E}[\sqrt{L_{t}(u_{t},{\theta(t)})}]\leq 3\sqrt{\mathbb{E}[\min_{u\in\mathcal{B}_{m,R}^{d}(0)}L_{0}(u,{\theta(t)})]}+\frac{2\sqrt{Rq_{max}}}{N^{\frac{1}{4}}}+3\rho_{0}(R/\lambda,m,\delta)\\ +3\mathbb{E}\sqrt{\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}},

(79)

under the event $A$ .

Proof.

We extensively use the inequality $(x+y)\leq 2x^{2}+2y^{2}$ for $x,y\in\mathbb{R}$ . First, note that

\widehat{L}_{t}(u,{\theta(t)})\leq 2L_{t}(u,{\theta(t)})+2\mathbb{E}_{s,a\sim d_{t}}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]},

(80)

for any $u\in\mathcal{B}_{m,R}^{d}(0)$ . Hence, under $A=A_{0}\cap A_{1}\cap A_{2}$ , we have:

$\displaystyle\mathbb{E}[L_{t}(u_{t},\theta(t))]$	$\displaystyle\leq 2\mathbb{E}[\widehat{L}_{t}(u_{t},\theta(t))]+2\mathbb{E}\big{[}\|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)\|^{2}\big{]},$	(81)
	$\displaystyle\leq 2\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\widehat{L}_{t}(u,\theta(t))+2\mathbb{E}_{s,a}\big{[}\|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)\|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}},$	(82)
	$\displaystyle\leq 4\min_{u\in\mathcal{B}_{m,R}^{d}(0)}L_{t}(u,\theta(t))+6\mathbb{E}_{s,a}\big{[}\|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)\|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}},$	(83)

where the second line follows from Prop. 4 and the last line follows from (80). Consequently, we have:

\mathbb{E}[L_{t}(u_{t},\theta(t))]\leq 8\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\Big{\{}\mathbb{E}[(\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a))^{2}]+\mathbb{E}[|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}u|^{2}]\Big{\}}\\ +6\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}},

(84)

where we use $(x+y)^{2}\leq 2x^{2}+2y^{2}$ and the following inequality:

	$\displaystyle\mathbb{E}[(\nabla^{\top}\log\pi_{t}(a\|s)u-\Xi_{\lambda}^{\pi_{t}}(s,a))^{2}]$	$\displaystyle=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}Var(\nabla^{\top}f_{t}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a)),$
		$\displaystyle\leq\mathbb{E}[(\nabla^{\top}f_{t}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a))^{2}].$

Using (84) and Theorem 2 in [49] together with the inequality $\sqrt{x+y+z}\leq\sqrt{x}+\sqrt{y}+\sqrt{z}$ for $x,y,z>0$ , we obtain (79). ∎

Hence, we obtain the following bound on the approximation error $\epsilon_{bias}^{\pi_{t}}$ .

Corollary 4 (Approximation Error).

Under Assumption 1-4, we have the following bound on the approximation error:

\mathbb{E}[\mathbbm{1}_{A}\cdot\epsilon_{bias}^{\pi_{t}}]\leq M_{\infty}\Big{[}\frac{8{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}+\frac{2\sqrt{Rq_{max}}}{N^{\frac{1}{4}}}+4\varepsilon\Big{]},

where $A=A_{0}\cap A_{1}\cap A_{2}$ , $m^{\prime}=\widetilde{O}\Big{(}\frac{\bar{\nu}^{4}}{(1-\gamma)^{2}\varepsilon^{2}}\Big{)}$ and $T^{\prime}=O\Big{(}\frac{(1+2\bar{\nu})^{2}\bar{\nu}^{2}}{\varepsilon^{4}}\Big{)}$ and $\mathbb{P}(A)\geq 1-3\delta$ .

Proof.

In order to prove Corollary 4, we substitute the results of Corollary 3 and Theorem 1 into (79). ∎

The main message of Corollary 4 is as follows: in order to eliminate the bias introduced by using (i) function approximation, (ii) sample-based estimation for actor and critic, one should employ more representation power in both actor and critic networks (via $m$ and $m^{\prime}$ ), and also use more samples in actor and critic updates (via $N$ and $T^{\prime}$ ). Furthermore, Corollary 4 quantifies the required network widths and sample complexities to achieve a desired bias $\epsilon>0$ .

In the following subsection, we finally prove Theorem 2 by using the Lyapunov drift result (Lemma 5) and the approximation error bound (Corollary 4).

5.5 Convergence of Entropy-Regularized Natural Actor-Critic

Proof of Theorem 2.

In the following, we prove the first part of Theorem 2, where the second part follows identical steps with a constant step size $\eta\in(0,1/\lambda)$ . First, note that Lemma 5 implies the following bound:

\displaystyle\begin{aligned} \mathbbm{1}_{A}\Big{[}\Psi(\pi_{t+1})-\Psi(\pi_{t})\Big{]}&\leq-\eta_{t}\lambda\Psi(\pi_{t})\mathbbm{1}_{A}-\eta_{t}(1-\gamma)\Delta_{t}\mathbbm{1}_{A}+2\eta_{t}^{2}R^{2}+\eta_{t}\mathbbm{1}_{A}\epsilon_{bias}^{\pi_{t}}\\ &+(\eta_{t}\lambda+6)\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)}.\end{aligned}

(85)

By Corollary 4,

\mathbb{E}[\mathbbm{1}_{A}\epsilon_{bias}^{\pi_{t}}]\leq M_{\infty}\Big{[}\rho_{1}+\frac{2\sqrt{Rq_{max}}}{N^{1/4}}+4\varepsilon\Big{]}=:\epsilon_{bias},

where

\rho_{1}=\frac{16\bar{\nu}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log(K/\delta)}\Big{)}.

Let $\overline{\Psi}_{t}:=\mathbb{E}[\Psi(\pi_{t})\mathbbm{1}_{A}]$ . Then,

	$\displaystyle\overline{\Psi}_{t+1}-\overline{\Psi}_{t}$	$\displaystyle\leq-\eta_{t}\lambda\overline{\Psi}_{t}-\eta_{t}(1-\gamma)\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+2\eta_{t}^{2}R^{2}+\eta_{t}\epsilon_{bias}$
		$\displaystyle+7\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)}.$

Since $\eta_{t}=\frac{1}{\lambda(t+1)}$ , by induction,

	$\displaystyle\overline{\Psi}_{t+1}$	$\displaystyle\leq(1-\eta_{t}\lambda)\overline{\Psi}_{t}+\eta_{t}(1-\gamma)\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+2\eta_{t}^{2}R^{2}$
		$\displaystyle\hskip 42.67912pt+\eta_{t}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}+7\rho_{0}(R/\lambda,m,\delta),$
		$\displaystyle\leq-\frac{(1-\gamma)}{\lambda(t+1)}\sum_{k\leq t}\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+\frac{1}{\lambda}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}+\frac{2R^{2}\log(t+1)}{\lambda^{2}(t+1)}$
		$\displaystyle\hskip 42.67912pt+4(t+1)\rho_{0}(R/\lambda,m,\delta).$

Hence,

\displaystyle\begin{aligned} \min_{0\leq t<T}\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]\leq\frac{1}{1-\gamma}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}&+\frac{2R^{2}(1+\log T)}{\lambda(1-\gamma)T}+\frac{4T\lambda}{(1-\gamma)}\rho_{0}(R/\lambda,m,\delta),\end{aligned}

(86)

which concludes the proof. ∎

6 Conclusion

In this paper, we established global convergence of the two-timescale entropy-regularized NAC algorithm with neural network approximation. We observed that entropy regularization led to significantly improved sample complexity and overparameterization bounds under weaker conditions since it (i) encourages exploration, (ii) controls the movement of the neural network parameters. We characterized the bias due to function approximation and sample-based estimation, and showed that overparameterization and increasing sample-size eliminates bias.

In practice, single-timescale natural policy gradient methods are predominantly used in conjunction with entropy regularization and off-policy sampling [15]. The analysis techniques that we develop in this paper can be used to analyze these algorithms.

In supervised learning, softmax parameterization is predominantly used for multiclass classification problems, where natural gradient descent is employed for a better adjustment to the problem geometry [50, 62, 63]. The techniques that we developed in this paper can be useful in establishing convergence results and understanding the role of entropy regularization as well.

Acknowledgements

S. Ç. would like to thank Siddhartha Satpathi for his help in the proof of Lemma 6. This work is supported in part by NSF TRIPODS grant CCF-1934986.

References

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[2] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis lectures on artificial intelligence and machine learning, vol. 4, no. 1, pp. 1–103, 2010.
[3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific, 1996.
[4] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[5] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99. Citeseer, 1999, pp. 1057–1063.
[6] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems. Citeseer, 2000, pp. 1008–1014.
[7] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1928–1937.
[8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
[9] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Trust-pcl: An off-policy trust region method for continuous control,” arXiv preprint arXiv:1707.01891, 2017.
[10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International conference on machine learning. PMLR, 2016, pp. 1329–1338.
[11] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
[12] S. M. Kakade, “A natural policy gradient,” Advances in neural information processing systems, vol. 14, 2001.
[13] S. Bhatnagar, M. Ghavamzadeh, M. Lee, and R. S. Sutton, “Incremental natural actor-critic algorithms,” Advances in neural information processing systems, vol. 20, pp. 105–112, 2007.
[14] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
[15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning. PMLR, 2018, pp. 1861–1870.
[16] Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International Conference on Machine Learning. PMLR, 2019, pp. 151–160.
[17] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “Optimality and approximation with policy gradient methods in markov decision processes,” in Conference on Learning Theory. PMLR, 2020, pp. 64–66.
[18] J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,” arXiv preprint arXiv:1906.01786, 2019.
[19] G. Lan, “Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes,” arXiv preprint arXiv:2102.00135, 2021.
[20] S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi, “Fast global convergence of natural policy gradient methods with entropy regularization,” arXiv preprint arXiv:2007.06558, 2020.
[21] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in International Conference on Machine Learning. PMLR, 2020, pp. 6820–6829.
[22] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” arXiv preprint arXiv:1909.01150, 2019.
[23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning. PMLR, 2014, pp. 387–395.
[24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
[25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[26] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” Advances in Neural Information Processing Systems, vol. 33, pp. 741–752, 2020.
[27] S. Khodadadian, P. R. Jhunjhunwala, S. M. Varma, and S. T. Maguluri, “On the linear convergence of natural policy gradient algorithm,” arXiv preprint arXiv:2105.01424, 2021.
[28] L. Shani, Y. Efroni, and S. Mannor, “Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5668–5675.
[29] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence,” arXiv preprint arXiv:2105.11066, 2021.
[30] W. Yang, X. Li, G. Xie, and Z. Zhang, “Finding the near optimal policy via adaptive reduced regularization in mdps,” arXiv preprint arXiv:2011.00213, 2020.
[31] S. Khodadadian, Z. Chen, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic algorithm,” in International Conference on Machine Learning. PMLR, 2021, pp. 5420–5431.
[32] Z. Chen, S. Khodadadian, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” IEEE Control Systems Letters, 2022.
[33] ——, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” arXiv preprint arXiv:2105.12540, 2021.
[34] T. Xu, Z. Wang, and Y. Liang, “Improving sample complexity bounds for actor-critic algorithms,” arXiv preprint arXiv:2004.12956, 2020.
[35] J. Zhang, C. Ni, C. Szepesvari, M. Wang et al., “On the convergence and sample efficiency of variance-reduced policy gradient method,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[36] H. Kumar, A. Koppel, and A. Ribeiro, “On the sample complexity of actor-critic method for reinforcement learning with function approximation,” arXiv preprint arXiv:1910.08412, 2019.
[37] Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 617–17 628, 2020.
[38] S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021.
[39] S. Cayci, N. He, and R. Srikant, “Linear convergence of entropy-regularized natural policy gradient with linear function approximation,” arXiv preprint arXiv:2106.04096, 2021.
[40] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv preprint arXiv:1806.07572, 2018.
[41] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in International Conference on Learning Representations, 2018.
[42] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 322–332.
[43] Z. Ji, M. Telgarsky, and R. Xian, “Neural tangent kernels, transportation mappings, and universal approximation,” in International Conference on Learning Representations, 2019.
[44] S. Oymak and M. Soltanolkotabi, “Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 84–105, 2020.
[45] Z. Ji and M. Telgarsky, “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks,” arXiv preprint arXiv:1909.12292, 2019.
[46] Z. Fu, Z. Yang, and Z. Wang, “Single-timescale actor-critic provably finds globally optimal policy,” in International Conference on Learning Representations, 2020.
[47] V. R. Konda and J. N. Tsitsiklis, “Onactor-critic algorithms,” SIAM journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.
[48] Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” arXiv preprint arXiv:1910.01619, 2019.
[49] S. Cayci, S. Satpathi, N. He, and R. Srikant, “Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation,” arXiv preprint arXiv:2103.01391, 2021.
[50] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[51] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation andtd learning,” in Conference on Learning Theory. PMLR, 2019, pp. 2803–2830.
[52] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015.
[53] L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[54] E. Kreyszig, Introductory functional analysis with applications. John Wiley & Sons, 1991, vol. 17.
[55] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural proximal/trust region policy optimization attains globally optimal policy,” arXiv preprint arXiv:1906.10306, 2019.
[56] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
[57] M. Telgarsky, “Deep learning theory lecture notes,” https://mjt.cs.illinois.edu/dlt/, 2021, version: 2021-10-27 v0.0-e7150f2d (alpha).
[58] S. Satpathi, H. Gupta, S. Liang, and R. Srikant, “The role of regularization in overparameterized neural networks,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 4683–4688.
[59] T. M. Cover and J. A. Thomas, “Elements of information theory (wiley series in telecommunications and signal processing),” 2006.
[60] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
[61] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[62] R. Pascanu and Y. Bengio, “Revisiting natural gradient for deep networks,” arXiv preprint arXiv:1301.3584, 2013.
[63] G. Zhang, J. Martens, and R. Grosse, “Fast convergence of natural gradient descent for overparameterized neural networks,” arXiv preprint arXiv:1905.10961, 2019.
[64] V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Measures of Complexity, vol. 16, no. 2, p. 11, 1971.

Appendix A Proof of Lemma 6

Consider $g\in\mathcal{F}_{\bar{\nu}}$ with a corresponding transportation map $v\in\mathcal{H}_{\bar{\nu}}$ . Using Cauchy-Schwarz inequality,

	$\displaystyle\mathbb{E}\sup_{x:\\|x\\|_{2}\leq 1}\Big{\|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{\|}^{2}$
	$\displaystyle\leq\mathbb{E}\sup_{x:\\|x\\|_{2}\leq 1}\Big{\\|}\frac{1}{m}\sum_{i=1}^{m}v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}-\mathbb{E}v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{\\|}^{2}.$

Define $b_{i}:=v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}.$ Define a class $B$ containing all possible values taken by $b:=\{b_{i}\}_{i=1}^{m}$ over $\{x:\|x\|_{2}\leq 1\}$ for a fixed $\theta(0).$ Further, using Cauchy-Schwarz inequality,

	$\displaystyle\mathbb{E}\sup_{x:\\|x\\|_{2}\leq 1}\Big{\|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{\|}^{2}$
	$\displaystyle\leq\mathbb{E}\sup_{b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}(b_{i}-\mathbb{E}b_{i})^{\top}(b_{j}-\mathbb{E}b_{j})+\frac{4{\bar{\nu}}^{2}}{m}.$

Using the symmetrization argument with Rademacher random variables $\sigma_{ij}$ ’s,

\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}\leq 4\mathbb{E}_{\theta(0)}\mathbb{E}_{r}\sup_{b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}\sigma_{ij}b_{i}^{\top}b_{j}+\frac{4{\bar{\nu}}^{2}}{m}.

Note that given $\theta(0),$ $B$ is a finite set. We apply Massart’s Finite Class lemma to have,

\displaystyle\mathbb{E}_{r}\Big{[}\sup_{b:b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}\sigma_{ij}b_{i}^{\top}b_{j}|\theta(0)\Big{]}\leq\sqrt{\sum_{i\neq j}^{m}\|v(\theta_{i}(0))\|^{2}\|v(\theta_{j}(0))\|^{2}}\frac{\sqrt{2\log|B|}}{m^{2}}

We calculate $|B|$ using VC-theory. Each element $b_{i}$ of $b,$ partitions the space $\{\|x\|\leq 1\}\subset\mathbb{R}^{d}$ into two half planes where one half takes value $v(\theta_{i}(0))$ and another half takes value $0.$ Hence all possible values taken by $b$ in space $\{\|x\|\leq 1\}$ is equal to the number of components in the partition made by $m$ half planes $\{b_{i}\}_{i=1}^{m}$ . The number of such components is bounded by $m^{d+1}+1$ using the growth function defined in [64]. Hence $|B|\leq m^{2d}$ , and the following holds:

\displaystyle\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}\leq\frac{12{\bar{\nu}}^{2}\sqrt{d\log m}}{m}.

The result follows from Jensen’s inequality.

	$\displaystyle D_{t}$	$\displaystyle\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}$
		$\displaystyle+\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a\|s)}}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]$
		$\displaystyle-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{}},a\sim\pi^{}(\cdot\|s)}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]$
		$\displaystyle+\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}[\sum_{a\in\mathcal{A}}(\widetilde{\pi}_{t}(a\|s)-\pi_{t}(a\|s))\nabla^{\top}f_{0}(s,a)w_{t}]+\frac{\eta_{t}^{2}\\|w_{t}\\|_{2}^{2}}{2}.$