This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finite-time analysis of entropy-regularized neural natural actor-critic algorithm

Semih Çaycı      Niao He      R. Srikant CSL, University of Illinois at Urbana-Champaign, scayci@illinois.eduDepartment of Computer Science, ETH Zurich, niao.he@inf.ethz.chECE and CSL, University of Illinois at Urbana-Champaign, rsrikant@illinois.edu
Abstract

Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large (potentially continuous) state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.

1 Introduction

In reinforcement learning (RL), an agent aims to find an optimal policy that maximizes the expected total reward in a Markov decision process (MDP) by interacting with an unknown and dynamical environment [1, 2, 3]. Policy gradient methods [4, 5, 6], which employ first-order optimization methods to find the best policy within a parametric policy class, have demonstrated impressive success in numerous complicated RL problems. The success largely benefits from the versatility of policy gradient methods in accommodating a rich class of function approximation schemes [7, 8, 9, 10].

Natural policy gradient (NPG), natural actor-critic (NAC) and their variants, which use Fisher information matrix as a pre-conditioner for the gradient updates [11, 12, 13, 14], are particularly popular because of their impressive empirical performance in practical applications. In practice, NPG/NAC methods are further combined with (a) neural network approximation for high representation power of both the actor and the critic, and (b) entropy regularization for stability and sufficient exploration, leading to remarkable performance in complicated control tasks that involve large state-action spaces [15, 9, 16].

Despite the empirical successes, a strong theoretical understanding of policy gradient methods, especially when boosted with function approximation and entropy regularization, appears to be in a nascent stage. Recently, there has been a plethora of theoretical attempts to understand the convergence properties of policy gradient methods and the role of entropy regularization [17, 18, 19, 20, 21]. These works predominantly study the tabular setting, where a parallelism between the well-known policy iteration and policy gradient methods can be exploited to establish the convergence results. But for the more intriguing function approximation regime, especially with neural network approximation, little theory is known. Two of the main challenges come from the highly nonconvex nature of the problem when using neural network approximation for both the actor and the critic, and the complex exploration dynamics.

In this paper, we provide the first non-asymptotic analysis of an entropy-regularized natural actor-critic (NAC) method in which we use two separate two-layer neural networks for the actor and critic, and employ a learning scheme based on approximate natural policy gradient updates to achieve optimality. We show that the expressive power of these neural networks provide the ability to achieve optimality within a broad class of policies.

1.1 Main Contributions

We elaborate some of our contributions below.

  • Sharp sample complexity, convergence rate and overparameterization bounds: We prove sharp convergence guarantees in terms of sample complexity, iteration complexity and network width. Particularly, We prove that the NAC method with an adaptive step-size achieves sharp O~(1/ϵ)\tilde{O}(1/\epsilon) iteration complexity and O~(1/ϵ5)\tilde{O}(1/\epsilon^{5}) sample complexity to achieve ϵ\epsilon gap with the optimal policy of the regularized MDP under mildest distribution mismatch conditions to the best of our knowledge. The required network width for both the actor and critic are O~(1/ϵ4)\tilde{O}(1/\epsilon^{4}) and O~(1/ϵ2)\tilde{O}(1/\epsilon^{2}), respectively. Under the standard distribution mismatch assumption as in [22], our sample complexity bound for the unregularized MDP is O~(1/ϵ6)\tilde{O}(1/\epsilon^{6}), which improves the existing bounds significantly.

  • Stable policy optimization: Existing works on neural policy gradient methods with neural network approximation assume that the policies perform sufficient exploration to avoid instability, i.e., convergence to near-deterministic and strictly suboptimal stationary policies. In this paper, we prove that policy optimization is stabilized by incorporating entropy regularization, gradient clipping and averaging. In particular, we show that the combination of these methods leads to “persistence of excitation” condition, which ensures sufficient exploration to avoid near-deterministic and strictly suboptimal stationary policies. Consequently, we prove convergence to the globally optimal policy under the mildest concentrability coefficient assumption for on-policy NAC to the best of our knowledge.

  • Understanding the dynamics of neural network approximation in policy optimization: Our analysis reveals that the uniform approximation power of the actor network to approximate Q-functions throughout policy optimization steps is crucial to ensure global (near-)optimality, which is a specific feature of reinforcement learning that induces a distributional shift over time in contrast to a static supervised learning problem. To that end, we establish high-probability bounds for a two-layer feedforward actor neural network to uniformly approximate Q-functions of the policy iterates during the training.

1.2 Related Work

Policy gradient and actor-critic: Policy gradient methods use a gradient-based scheme to find the optimal policy [4, 5]. [12] proposed the natural gradient method, which uses the Fisher information matrix as a pre-conditioner to fit the problem geometry better. Actor-critic method, which learns approximations to both state-action value functions and policies for variance reduction, was introduced in [6].

Neural actor-critic methods: Recently, there has been a surge of interest in direct policy optimization methods for solving MDPs with large state spaces by exploiting the representation power of deep neural networks. Particularly, deterministic policy gradient [23], trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25], soft actor-critic (SAC) [15, 26] achieved impressive empirical success in solving complicated control tasks.

Role of regularization: Entropy regularization is an essential part of policy optimization algorithms (e.g., TRPO, PPO and SAC) to encourage exploration and achieve fast and stable convergence. It has been numerically observed that entropy regularization leads to a smoother optimization landscape, which leads to improved convergence properties in policy optimization [16]. For tabular reinforcement learning, the impact of entropy regularization was studied in [17, 20, 21]. On the other hand, the function approximation regime leads to considerably different dynamics compared to the tabular setting mainly because of the generalization over a large state space, complex exploration dynamics and distributional shift. As such, the role of regularization is very different in the function approximation regime, which we study in this paper.

Theoretical analysis of policy optimization methods: Despite the vast literature on the practical performance of PG/AC/NAC type algorithms, their theoretical understanding has remained elusive until recently. In the tabular setting, global convergence rates for PG methods were established in [17, 18, 27]. By incorporating entropy regularization, it was shown in [28, 19, 20, 29, 30] that the convergence rate can be improved significantly in the tabular setting. Finite-time performances of off-policy actor-critic methods in the tabular and linear function approximation regimes were investigated in [31, 32]. In our paper, we consider neural network approximation under entropy regularization with on-policy sampling.

On the other hand, when the controller employs a function approximator for the purpose of generalization to a large state-action space, the convergence properties of policy optimization methods radically change due to more complicated optimization landscape and distribution mismatch phenomenon in reinforcement learning [17]. Under strong assumptions on the exploratory behavior of policies throughout learning iterations, global optimality of NPG with linear function approximation up to a function approximation error was established in [17]. For actor-critic and natural actor-critic methods with linear function approximation, there are finite-time analyses in [33, 34, 35]. For general actor schemes with linear critic, convergence to stationary points was investigated in [36, 37, 38].

By incorporating entropy regularization, it was shown that improved convergence rates under much weaker conditions on the underlying controlled Markov chain can be established in [39] with linear function approximation. Our paper uses results from the drift analysis in [39], but addresses the complications due to the nonlinearity introduced by ReLU activation functions, and establishes global convergence to the optimal policies. The neural network approximation eliminates the function approximation error, which is a constant in linear function approximation, by employing a sufficiently wide actor neural network.

Neural network analysis: The empirical success of neural networks, which have more parameters than the data points, has been theoretically explained in [40, 41, 42], where it was shown that overparameterized neural networks trained by using first-order optimization methods achieve good generalization properties. The need for massive overparameterization was addressed in [43, 44], and it was shown that considerably smaller network widths can suffice to achieve good training and generalization results in structured supervised learning problems. Our analysis in this work is mainly inspired by [43]. On the other hand, reinforcement learning problem has significantly different and more challenging dynamics than the supervised learning setting as we have a dynamic optimization problem in actor-critic, where distributional shift occurs as the policies are updated. As such, uniform approximation power of the actor network in approximating various functions through policy optimization steps becomes critical, different from the supervised learning setting in [45]. Our analysis utilizes tools from [45, 43]: (i) we consider max-norm geometry to achieve mild overparameterization, (ii) we bound the distance between the neural tangent kernel (NTK) function class and the class of functions realizable by a finite-width neural network by extending the ReLU analysis in [43].

The most relevant work in the literature is [22], where the convergence of NAC with a two-layer neural network was studied without entropy regularization. It was shown that, under strong assumptions on the exploratory behavior of policies throughout the trajectory, neural-NPG achieves ϵ\epsilon-optimality with O(1/ϵ14)O(1/\epsilon^{14}) sample complexity and O(1/ϵ12)O(1/\epsilon^{12}) network width bounds. In this paper, we incorporate widely-used algorithmic techniques (entropy regularization, averaging and gradient clipping) to NAC with neural network approximation, and prove significantly improved sample complexity and overparameterization bounds under weaker assumptions on the concentrability coefficients. Additionally, our analysis reveals that the uniform approximation power of the actor neural network is critically important to establish global optimality, where distributional shift plays a crucial role (see Section 5.4). In another relevant work, [46] considers a single-timescale actor-critic with neural network approximation, but the function approximation error was not investigated due to the realizability assumption, which assumes that all policies throughout the policy optimization steps are realizable by the neural network. On of the main goals of our work is to study the benefits of employing neural networks in policy optimization, and we explicitly characterize the function class and approximation error that stems from the use of finite-width neural networks.

1.3 Notation

For a sequence of numbers {xi:iI}\{x_{i}:i\in I\} where II is an index set, [xi]iI[x_{i}]_{i\in I} denotes the vector obtained by concatenation of xi,iIx_{i},i\in I. For a set AA, |A||A| denotes its cardinality. For two distributions P,QP,Q defined over the same probability space, Kullback-Leibler divergence is denoted as follows: DKL(PQ)=𝔼sP[logP(s)Q(s)].D_{KL}(P\|Q)=\mathbb{E}_{s\sim P}\Big{[}\log\frac{P(s)}{Q(s)}\Big{]}. For a convex set CdC\subset\mathbb{R}^{d} and xdx\in\mathbb{R}^{d}, 𝒫C(x)\mathcal{P}_{C}(x) denotes the projection of xx onto CC: 𝒫C(x)=argminyCxy2\mathcal{P}_{C}(x)=\arg\min_{y\in C}\|x-y\|_{2}. For n+n\in\mathbb{Z}^{+}, [n]={1,2,,n}[n]=\{1,2,\ldots,n\}. For d,m,d,m\in\mathbb{N}, R>0R>0 and vm×d,v\in\mathbb{R}^{m\times d}, we denote

m,Rd(v)={ym×d:supi[m]viyi2Rm},\mathcal{B}_{m,R}^{d}(v)=\Big{\{}y\in\mathbb{R}^{m\times d}:\sup_{i\in[m]}\|v_{i}-y_{i}\|_{2}\leq\frac{R}{\sqrt{m}}\Big{\}},

where viv_{i} denotes the ithi^{\rm th} row of v.v.

2 Background and Problem Setting

In this section, we introduce basic backgrounds of the problem setting, the natural actor critic method, as well as entropy regularization and neural network approximation that we consider.

2.1 Markov Decision Processes

We consider a discounted Markov decision process (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma) where 𝒮\mathcal{S} and 𝒜\mathcal{A} are the state and action spaces, PP is a (unknown) transition kernel, r:𝒮×𝒜[0,rmax]r:\mathcal{S}\times\mathcal{A}\rightarrow[0,r_{max}], 0<rmax<0<r_{max}<\infty is the reward function, and γ(0,1)\gamma\in(0,1) is the discount factor. In this work, we consider a state space 𝒮\mathcal{S} and a finite action space 𝒜\mathcal{A} such that 𝒮×𝒜d.\mathcal{S}\times\mathcal{A}\subset\mathbb{R}^{d}. Also, we assume that, by appropriate representation of the state and action variables, the following bound holds:

sup(s,a)𝒮×𝒜(s,a)21,\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\|(s,a)\|_{2}\leq 1, (1)

throughout the paper.

Value function: A randomized policy π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} assigns some probability of taking an action a𝒜a\in\mathcal{A} at a given state s𝒮s\in\mathcal{S}. A policy π\pi introduces a trajectory by specifying atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}). For any s0𝒮s_{0}\in\mathcal{S}, the corresponding value function of a policy π\pi is as follows:

Vπ(s0)=𝔼[t=0γtr(st,at)|s0],V^{\pi}(s_{0})=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}\Big{]}, (2)

where atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}).

Entropy regularization: In policy optimization, in order to encourage exploration and avoid near-deterministic suboptimal policies, entropy regularization is commonly used in practice [8, 15, 9, 16]. For a policy π\pi, let

Hπ(s0)=𝔼[t=0γt(π(|st))|s0],H^{\pi}(s_{0})=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}\mathcal{H}\big{(}\pi(\cdot|s_{t})\big{)}\Big{|}s_{0}\Big{]}, (3)

where (π(|s))=a𝒜π(a|s)log(π(a|s))\mathcal{H}(\pi(\cdot|s))=-\sum_{a\in\mathcal{A}}\pi(a|s)\log\big{(}\pi(a|s)\big{)} is the entropy functional. Then, for λ>0\lambda>0, the entropy-regularized value function is defined as follows:

Vλπ(s0)=Vπ(s0)+λHπ(s0).V_{\lambda}^{\pi}(s_{0})=V^{\pi}(s_{0})+\lambda H^{\pi}(s_{0}). (4)

Note that π𝟎(|)=1/|𝒜|\pi_{\mathbf{0}}(\cdot|\cdot)=1/|\mathcal{A}| maximizes the regularizer Hπ(s0)H^{\pi}(s_{0}) for any s0𝒮s_{0}\in\mathcal{S}. Hence, the additional λHπ(s0)\lambda H^{\pi}(s_{0}) term in (4) encourages exploration while introducing some bias controlled by λ>0\lambda>0.

Entropy-regularized objective: For a given initial state distribution μ\mu and for a given regularization parameter λ>0\lambda>0, the objective in this paper is to maximize the entropy-regularized value function:

maxπVλπ(μ):=𝔼s0μ[Vλπ(s0)].\max_{\pi}V_{\lambda}^{\pi}(\mu):=\mathbb{E}_{s_{0}\sim\mu}\left[V_{\lambda}^{\pi}(s_{0})\right]. (5)

We denote the optimal policy for the regularized MDP as π\pi^{*} throughout the paper.

Q-function and advantage function: The entropy-regularized (or soft) Q-function qλπ(s,a)q_{\lambda}^{\pi}(s,a) is defined as:

qλπ(s,a)=𝔼[k=0γk(r(sk,ak)λlogπ(ak|sk))|s0=s,a0=a].\displaystyle\begin{aligned} q_{\lambda}^{\pi}(s,a)&=\mathbb{E}\Big{[}\sum_{k=0}^{\infty}\gamma^{k}\big{(}r(s_{k},a_{k})-\lambda\log\pi(a_{k}|s_{k})\big{)}\Big{|}s_{0}=s,a_{0}=a\Big{]}.\end{aligned} (6)

Note that qλπq_{\lambda}^{\pi} is the fixed point of the Bellman equation q(s,a)=𝒯πq(s,a)q(s,a)=\mathcal{T}^{\pi}q(s,a) where the Bellman operator 𝒯π\mathcal{T}^{\pi} is defined as:

𝒯πq(s,a)=r(s,a)λlogπ(a|s)+γ𝔼sP(|s,a),aπ(|s)[q(s,a)].\mathcal{T}^{\pi}q(s,a)=r(s,a)-\lambda\log\pi(a|s)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})}[q(s^{\prime},a^{\prime})].

As we will see, for NAC algorithms, the following function, called the soft Q-function under a policy π,\pi, turns out to be a useful quantity [20]:

Qλπ(s,a)=r(s,a)+γ𝔼sP(|s,a)[Vλπ(s)].Q_{\lambda}^{\pi}(s,a)=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[V_{\lambda}^{\pi}(s^{\prime})\right]. (7)

Note that the two Q-functions are related as follows:

qλπ(s,a)=Qλπ(s,a)λlogπ(a|s).q_{\lambda}^{\pi}(s,a)=Q_{\lambda}^{\pi}(s,a)-\lambda\log\pi(a|s).

The advantage function under a policy π\pi is defined as follows:

Aλπ(s,a)=qλπ(s,a)Vλπ(s).\displaystyle\begin{aligned} A_{\lambda}^{\pi}(s,a)&=q_{\lambda}^{\pi}(s,a)-V_{\lambda}^{\pi}(s).\end{aligned} (8)

Similarly, the soft advantage function is defined as follows:

Ξλπ(s,a)=Qλπ(s,a)a𝒜π(a|s)Qλπ(s,a).\Xi_{\lambda}^{\pi}(s,a)=Q_{\lambda}^{\pi}(s,a)-\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s)Q_{\lambda}^{\pi}(s,a^{\prime}). (9)

Lastly, we can bound the entropy-regularized value function as follows:

0Vλπ(μ)rmax+λlog|𝒜|1γ,0\leq V_{\lambda}^{\pi}(\mu)\leq\frac{r_{max}+\lambda\log|\mathcal{A}|}{1-\gamma}, (10)

for any λ>0,π\lambda>0,\pi since r[0,rmax]r\in[0,r_{max}] and (p)log|𝒜|\mathcal{H}(p)\leq\log|\mathcal{A}| for any distribution pp over 𝒜\mathcal{A} [20].

2.2 Natural Policy Gradient under Entropy Regularization

For a given randomized policy πθ\pi_{\theta} parameterized by θΘ\theta\in\Theta where Θ\Theta is a given parameter space, policy gradient methods maximize Vπθ(μ)V^{{\pi_{\theta}}}(\mu) by using the policy gradient θVπθ(μ)\nabla_{\theta}V^{\pi_{\theta}}(\mu). Natural policy gradient, as a quasi-Newton method, adjusts the gradient update to fit problem geometry by using the Fisher information matrix as a pre-conditioner [12, 20].

Let

Gπθ(μ)=𝔼sdμπθ,aπθ(|s)[θlogπθ(a|s)θlogπθ(a|s)],G^{\pi_{\theta}}(\mu)=\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\nabla_{\theta}\log\pi_{\theta}(a|s)\nabla_{\theta}^{\top}\log\pi_{\theta}(a|s)\Big{]},

be the Fisher information matrix under policy πθ\pi_{\theta}, where

dμπ()=(1γ)k=0γk(sk|s0μ),d_{\mu}^{\pi}(\cdot)=(1-\gamma)\sum_{k=0}^{\infty}\gamma^{k}\mathbb{P}(s_{k}\in\cdot|s_{0}\sim\mu),

is the discounted state visitation distribution under a policy π\pi. Then, the update rule under NPG can be expressed as

θθ+η[Gπθ(μ)]1Vλπθ(μ),\theta\leftarrow\theta+\eta\cdot\big{[}G^{{\pi_{\theta}}}(\mu)\big{]}^{-1}\nabla V_{\lambda}^{\pi_{\theta}}(\mu), (11)

where η>0\eta>0 is the step-size. Equivalently, the NPG update can be written as follows:

θ+argmaxθd{θVλ(πθ)(θθ)12η(θθ)Gπθ(μ)(θθ)}.\theta^{+}\in\arg\max_{\theta\in\mathbb{R}^{d}}\Big{\{}\nabla^{\top}_{\theta}V_{\lambda}(\pi_{\theta^{-}})(\theta-\theta^{-})-\frac{1}{2\eta}(\theta-\theta^{-})^{\top}G^{\pi_{\theta^{-}}}(\mu)(\theta-\theta^{-})\Big{\}}. (12)

The above update scheme is closely related to gradient ascent and policy mirror ascent. Note that the gradient ascent for policy optimization performs the following update:

θ+argmaxθd{θVλ(πθ)(θθ)12ηθθ22}.\theta^{+}\in\arg\max_{\theta\in\mathbb{R}^{d}}\Big{\{}\nabla^{\top}_{\theta}V_{\lambda}(\pi_{\theta^{-}})(\theta-\theta^{-})-\frac{1}{2\eta}\|\theta-\theta^{-}\|_{2}^{2}\Big{\}}. (13)

The update in (13) leads to the policy gradient algorithm [4]. Compared to (13), the natural policy gradient uses a generalized Mahalanobis distance (i.e., weighted-2\ell_{2} distance) as the Bregman divergence instead of 2\ell_{2} distance, which yields significant improvements in policy optimization by avoiding the so-called vanishing gradient problem in the tabular case [20, 19, 17]. Consequently, state-of-the-art reinforcement learning algorithms such as trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25] and soft actor-critic [15] are variants of the natural policy gradient method.

In the following, we provide necessary tools to compute the policy gradient and the update rule in (11) based on [39].

Proposition 1 (Policy gradient).

For any θ\theta and λ>0\lambda>0, we have:

θVλπθ(μ)=11γ𝔼sdμπθ,aπθ(|s)[θlogπθ(a|s)qλπθ(s,a)].\nabla_{\theta}{V}_{\lambda}^{\pi_{\theta}}(\mu)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\nabla_{\theta}\log\pi_{\theta}(a|s)q_{\lambda}^{\pi_{\theta}}(s,a)\Big{]}. (14)

Based on Proposition 14, the gradient update of natural policy gradient can be computed by the following lemma, which is an extension of [12, 17, 39].

Lemma 1.

Let

L(w,θ)=𝔼sdμπθ,aπθ(|s)[(θlogπθ(a|s)wqλπθ(s,a))2],L(w,\theta)=\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{\theta}(a|s)w-q_{\lambda}^{\pi_{\theta}}(s,a)\Big{)}^{2}\Big{]}, (15)

be the error for a given policy parameter θ\theta. Define

wλπθargminwL(w,θ).w_{\lambda}^{\pi_{\theta}}\in\arg\min_{w}L(w,\theta). (16)

Then, we have:

Gπθ(μ)wλπθ=(1γ)θVλπθ(μ),G^{\pi_{\theta}}(\mu)w_{\lambda}^{\pi_{\theta}}=(1-\gamma)\nabla_{\theta}V_{\lambda}^{\pi_{\theta}}(\mu), (17)

where GπθG^{\pi_{\theta}} is the Fisher information matrix.

The above results for general policy parameterization will provide basis for the entropy-regularized natural actor-critic (NAC) with neural network approximation that we will introduce in the following section, with certain modifications for variance reduction and stability that we will describe; see Remark 2 later.

3 Natural Actor-Critic with Neural Network Approximation

In this section, we will introduce the entropy-regularized natural actor critic algorithm, where both the actor and critic are represented by single-hidden-layer neural networks.

Throughout this paper, we make the following assumption, which is standard in policy optimization [17].

Assumption 1 (Sampling oracle).

For a given initial state distribution μ\mu and policy π\pi, we assume that the controller is able to obtain an independent sample from dμπd_{\mu}^{\pi} at any time.

The sampling process involves a resetting mechanism and a simulator, which are available in many important application scenarios, and sampling from a state visitation distribution dμπd_{\mu}^{\pi} can be performed as described in [47].

3.1 Actor Network and Natural Policy Gradient

For a network width m+m\in\mathbb{Z}^{+} and cic_{i}\in\mathbb{R}, θid\theta_{i}\in\mathbb{R}^{d} for i[m]i\in[m], the actor network is given by the single-hidden-layer neural network:

f(s,a;(c,θ))=1mi=1mciσ(θi,(s,a)),f(s,a;(c,\theta))=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\sigma\big{(}\langle\theta_{i},(s,a)\rangle\big{)}, (18)

where c=[ci]i[m],θ=[θi]i[m]c=[c_{i}]_{i\in[m]},\theta=[\theta_{i}]_{i\in[m]}, σ(x)=max{0,x}\sigma(x)=\max\{0,x\} is the ReLU activation function. As a common practice [43, 44, 42], we fix the output layer cc after a random initialization, and only train the weights of hidden layer, namely, θΘm×d\theta\in\Theta\subset\mathbb{R}^{m\times d}. Given a (possibly random) parameter θ0m×d\theta^{0}\in\mathbb{R}^{m\times d}, a design parameter R>0R>0, regularization parameter λ>0\lambda>0 and network width m+m\in\mathbb{Z}^{+}, the parameter space that we consider is as follows:

Θ={θm×d:maxi[m]θiθi02Rλm}.\Theta=\Big{\{}\theta\in\mathbb{R}^{m\times d}:\max_{i\in[m]}\|\theta_{i}-\theta^{0}_{i}\|_{2}\leq\frac{R}{\lambda\sqrt{m}}\Big{\}}. (19)

For this parameter space Θ\Theta, the policy class that we consider is Π={πθ:θΘ}\Pi=\{\pi_{\theta}:\theta\in\Theta\}, where the policy that corresponds to θΘ\theta\in\Theta is as follows:

πθ(a|s)=exp(f(s,a;(c,θ))a𝒜exp(f(s,a;(c,θ)).\pi_{\theta}(a|s)=\frac{\exp(f(s,a;(c,\theta))}{\sum_{a^{\prime}\in\mathcal{A}}\exp(f(s,a^{\prime};(c,\theta))}. (20)

We randomly initialize the actor neural network by using the symmetric initialization in Algorithm 1, (c,θ(0))𝚜𝚢𝚖_𝚒𝚗𝚒𝚝(m,d)(c,\theta(0))\sim{\tt sym\_init}(m,d), which was introduced in [48].

  Inputs: mm: network width, dd: feature dimension
  for i=1,2,,m/2i=1,2,\ldots,m/2 do
     ci=ci+m/2Rademacherc_{i}=-c_{i+m/2}\sim Rademacher
     θi=θi+m/2𝒩(0,Id)\theta_{i}=\theta_{i+m/2}\sim\mathcal{N}(0,I_{d})
  end for
  return  network weights (c,θ(c,\theta)
Algorithm 1 𝚜𝚢𝚖_𝚒𝚗𝚒𝚝(m,d){\tt sym\_init}(m,d) - Symmetric Initialization

Later, we will employ a similar symmetric initialization scheme for the critic neural network.

We denote the policy at iteration t<Tt<T as πt=πθ(t)\pi_{t}=\pi_{\theta(t)} and neural network output as ft(s,a)=f(s,a;(c,θ(t)))f_{t}(s,a)=f(s,a;(c,\theta(t))). In the absence of Ξλπt\Xi_{\lambda}^{\pi_{t}} and dμπtd_{\mu}^{\pi_{t}}, we estimate

ut=minum,Rd(0)𝔼[(logπt(a|s)uΞλπt(s,a))2],u_{t}^{\star}=\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\mathbb{E}[(\nabla^{\top}\log\pi_{t}(a|s)u-\Xi_{\lambda}^{\pi_{t}}(s,a))^{2}], (21)

by using samples by the following actor-critic method:

  • Critic: Temporal difference learning algorithm (Algorithm 3 in Section 3.2), which employs a critic neural network, returns a set of neural network weights that yield a sample-based estimate for the soft advantage function {Ξ^λπt(s,a):(s,a)𝒮×𝒜}\{\widehat{\Xi}_{\lambda}^{\pi_{t}}(s,a):(s,a)\in\mathcal{S}\times\mathcal{A}\}.

  • Policy update: Given this, we approximate utu_{t}^{\star} by using stochastic gradient descent (SGD) with NN iterations and step-size αA>0\alpha_{A}>0. Starting with u0(t)=0u_{0}^{(t)}=0, an iteration of SGD is as follows

    un+1/2(t)\displaystyle u_{n+1/2}^{(t)} =un(t)αA(θlogπt(an|sn)un(t)Ξ^λπt(sn,an))θlogπt(an|sn),\displaystyle=u_{n}^{(t)}-\alpha_{A}\Big{(}\nabla_{\theta}^{\top}\log{\pi}_{t}(a_{n}|s_{n})u_{n}^{(t)}-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s_{n},a_{n})\Big{)}\nabla_{\theta}\log{\pi}_{t}(a_{n}|s_{n}), (22)
    un+1(t)\displaystyle u_{n+1}^{(t)} =𝒫m,Rd(0)(un+1/2(t)),\displaystyle=\mathcal{P}_{\mathcal{B}_{m,R}^{d}(0)}\left(u_{n+1/2}^{(t)}\right), (23)

    where sndμπts_{n}\sim d_{\mu}^{\pi_{t}} and anπt(|sn)a_{n}\sim\pi_{t}(\cdot|s_{n}) for n=0,1,,N1n=0,1,\ldots,N-1, Ξ^λπt\widehat{\Xi}^{\pi_{t}}_{\lambda} is the output of the critic. Then, the final estimate is ut=1Nn=1Nun(t)u_{t}=\frac{1}{N}\sum_{n=1}^{N}u_{n}^{(t)}. By using utu_{t}, we perform the following update:

    θ(t+1)=θ(t)+ηtwt,\theta(t+1)=\theta(t)+\eta_{t}\cdot w_{t},

    where wt=utλ(θ(t)θ(0))w_{t}=u_{t}-\lambda\big{(}\theta(t)-\theta(0)\big{)}.

The natural actor-critic algorithm is summarized in Algorithm 2. Below, we summarize the modifications in the algorithm that we consider in this paper with respect to the NPG described in the previous section.

Remark 1 (Averaging and projection).

The update in each iteration of the NAC algorithm described in Algorithm 2 can be equivalently written as follows:

θ(t+1)θ(0)=(1ηtλ)(θ(t)θ(0))+(1ηt)ut,\theta(t+1)-\theta(0)=(1-\eta_{t}\lambda)\cdot(\theta(t)-\theta(0))+(1-\eta_{t})u_{t}, (24)

where utu_{t} is an approximate solution to the optimization problem (21). As we will see, the projection of utu_{t} onto m,Rd(0)\mathcal{B}_{m,R}^{d}(0) (which can be considered as gradient clipping), in conjunction with the averaging in the policy update (24) enables us to control maxi[m]θi(t)θi(0)2\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2} while taking (natural) gradient steps towards the optimal policy. Controlling maxi[m]θi(t)θi(0)2\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2} is critical for two reasons: (i) to ensure sufficient exploration, and (ii) to establish the convergence bounds for the neural networks.

Alternatively, one may be tempted to project θ(t)\theta(t) onto a ball around θ(0)\theta(0) in the 2\ell_{2}-geometry to control maxiθi(t)θi(0)2\max_{i}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}. However, as the algorithm follows the natural policy gradient, which uses a different Bregman divergence than 2\|\cdot\|_{2}, projection of θ(t)\theta(t) with respect to the 2\ell_{2}-norm may not result in moving the policy in the direction of improvement. Similarly, since we parameterize the policies by using a lower-dimensional vector θm×d\theta\in\mathbb{R}^{m\times d} to avoid storing and computing |𝒮×𝒜||\mathcal{S}\times\mathcal{A}|-dimensional policies, Bregman projection in the probability simplex, which is commonly used in direct parameterization, is not a feasible option for policy optimization with function approximation.

As such, simultaneous use of averaging and projection of the update utu_{t} are critical to control the network weights and policy improvement.

Remark 2 (Baseline).

Note that the update utu_{t}^{\star} in (21) uses the soft-advantage function Ξλπ\Xi_{\lambda}^{\pi} rather than the state-action value function qλπq_{\lambda}^{\pi}. The soft-advantage function uses aπtπt(a|s)Qλπt(s,a)\sum_{a\sim\pi_{t}}\pi_{t}(a|s)Q_{\lambda}^{\pi_{t}}(s,a) as a baseline for variance reduction, which is a common practice in policy gradient methods [1].

In the following subsection, we describe the critic algorithm in detail.

1:  Initialize (c,θ(0))𝚜𝚢𝚖_𝚒𝚗𝚒𝚝(m,d)(c,\theta(0))\sim{\tt sym\_init}(m,d)
2:  for t=0,1,,T1t=0,1,\ldots,T-1 do
3:     Critic: Ξ^λπt=\widehat{\Xi}_{\lambda}^{\pi_{t}}=~{}MN-NTD(πt,R,m,T,αC)(\pi_{t},R,m^{\prime},T^{\prime},\alpha_{C})
4:     Initialize: u0(t)=0u^{(t)}_{0}=0
5:     for n=0,1,,Nn=0,1,\ldots,N do
6:        Sampling: sndμπt,anπt(|sn).s_{n}\sim d_{\mu}^{\pi_{t}},a_{n}\sim\pi_{t}(\cdot|s_{n}).
7:        un+1/2(t)=un(t)αA(θlogπt(an|sn)un(t)Ξ^λπt(sn,an))θlogπt(an|sn)u_{n+1/2}^{(t)}=u_{n}^{(t)}-\alpha_{A}\Big{(}\nabla_{\theta}^{\top}\log{\pi}_{t}(a_{n}|s_{n})u_{n}^{(t)}-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s_{n},a_{n})\Big{)}\nabla_{\theta}\log{\pi}_{t}(a_{n}|s_{n})
8:        un+1(t)=𝒫m,Rd(0)(un+1/2(t))u_{n+1}^{(t)}=\mathcal{P}_{\mathcal{B}_{m,R}^{d}(0)}\left(u_{n+1/2}^{(t)}\right)
9:     end for
10:     ut=1Nn=1Nun(t)u_{t}=\frac{1}{N}\sum_{n=1}^{N}u_{n}^{(t)}
11:     θ(t+1)=θ(t)+ηtutηtλ[θ(t)θ(0)]\theta(t+1)=\theta(t)+\eta_{t}u_{t}-\eta_{t}\lambda[\theta(t)-\theta(0)]
12:  end for
Algorithm 2 Entropy-regularized Neural NAC

3.2 Critic Network and Temporal Difference Learning

We estimate Ξλπt\Xi_{\lambda}^{\pi_{t}} by using the neural TD learning algorithm with max-norm regularization [49]. Note that Ξλπt\Xi_{\lambda}^{\pi_{t}} can be directly obtained from qλπθq_{\lambda}^{\pi_{\theta}} via Qλπθ(s,a)=qλπθ(s,a)λlogπθ(a|s)Q_{\lambda}^{\pi_{\theta}}(s,a)=q_{\lambda}^{\pi_{\theta}}(s,a)-\lambda\log{\pi_{\theta}}(a|s) and (9). Since qλπθq_{\lambda}^{\pi_{\theta}} is the fixed point of the Bellman equation (6), it can be approximated by using temporal difference (TD) learning algorithms.

For the critic, we use a two-layer neural network of width mm^{\prime}, which is defined as follows:

q^(s,a;(b,W))=1mi=1mbiσ(Wi,(s,a)).\widehat{q}(s,a;(b,W))=\frac{1}{\sqrt{m^{\prime}}}\sum_{i=1}^{m^{\prime}}b_{i}\sigma\left(\langle W_{i},(s,a)\rangle\right). (25)

The critic network is initialized according to the symmetric initialization scheme in Algorithm 1. Let (b,W(0))(b,W(0)) denote the initialization.

We aim to solve the following problem:

W=argminW𝔼sdμπθ,aπθ(|s)[(q^(s,a;(b,W))𝒯πθq^(s,a;(b,W)))2].W^{\star}=\arg\min_{W}~{}~{}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim{\pi_{\theta}}(\cdot|s)}\Big{[}\Big{(}\widehat{q}(s,a;(b,W))-\mathcal{T}^{\pi_{\theta}}\widehat{q}(s,a;(b,W))\Big{)}^{2}\Big{]}. (26)

where 𝒯π\mathcal{T}^{\pi} is the Bellman operator in (2.1).

We will consider max-norm regularization in the updates of the critic, which was shown to be effective in supervised learning and reinforcement learning (see [49, 50]). For a given w0dw_{0}\in\mathbb{R}^{d} and R>0R>0, let

𝒢R(w0)={wd:ww02R/m}.\mathcal{G}_{R}(w_{0})=\{w\in\mathbb{R}^{d}:\|w-w_{0}\|_{2}\leq R/\sqrt{m^{\prime}}\}. (27)

Under max-norm regularization, each hidden unit’s weight vector is confined within the set 𝒢R(Wi(0))\mathcal{G}_{R}(W_{i}(0)) for a given projection radius RR.

For k=0,1,,T1k=0,1,\ldots,T^{\prime}-1, we assume that (sk,ak)(s_{k},a_{k}) is sampled from dμπθd_{\mu}^{\pi_{\theta}}, i.e., skdμπθ,akπθ(|sk)s_{k}\sim d_{\mu}^{\pi_{\theta}},a_{k}\sim{\pi_{\theta}}(\cdot|s_{k}). Upon obtaining (sk,ak)(s_{k},a_{k}), the next state-action pair is obtained by following πθ{\pi_{\theta}}: skP(|sk,ak)s_{k}^{\prime}\sim P(\cdot|s_{k},a_{k}), akπθ(|sk)a_{k}^{\prime}\sim{\pi_{\theta}}(\cdot|s_{k}^{\prime}). One can replace the i.i.d. sampling here with Markovian sampling at the cost of a more complicated analysis as in [51]. However, since experience replay is used in practice, the actual sampling procedure is neither purely Markovian or i.i.d., and here for simplicity of the analysis, we choose to model it as i.i.d. sampling.

An iteration of MN-NTD is as follows:

Wi(t+1)=𝒫𝒢R(Wi(0))(Wi(t)+α(rk+γq^k(sk,ak)q^k(sk,ak))Wiq^k(sk,ak)),i[m],\displaystyle W_{i}(t+1)=\mathcal{P}_{\mathcal{G}_{R}(W_{i}(0))}\left(W_{i}(t)+\alpha\big{(}r_{k}+\gamma\widehat{q}_{k}(s_{k}^{\prime},a_{k}^{\prime})-\widehat{q}_{k}(s_{k},a_{k})\big{)}\nabla_{W_{i}}\widehat{q}_{k}(s_{k},a_{k})\right),\forall i\in[m^{\prime}],

where q^k(s,a)=q^(s,a;(b,W(k)))\widehat{q}_{k}(s,a)=\widehat{q}(s,a;(b,W(k))), rk=r(sk,ak)λlogπθ(ak|sk)r_{k}=r(s_{k},a_{k})-\lambda\log\pi_{\theta}(a_{k}|s_{k}) and 𝒫𝒞\mathcal{P}_{\mathcal{C}} is the projection operator onto a set 𝒞d\mathcal{C}\subset\mathbb{R}^{d}. The output of the critic, which approximates qλπθq_{\lambda}^{\pi_{\theta}}, is then obtained as:

q¯Tπθ(s,a)=q^(s,a;(b,1Tk<TW(k))),(s,a)𝒮×𝒜,\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)=\widehat{q}\Big{(}s,a;\big{(}b,\frac{1}{T^{\prime}}\sum_{k<T^{\prime}}W(k)\big{)}\Big{)},~{}~{}(s,a)\in\mathcal{S}\times\mathcal{A},

where TT^{\prime} is the number of iterations of MN-NTD. We obtain an approximation of the soft Q-function as

Q¯λπθ(s,a)=q¯Tπθ(s,a)λlogπθ(a|s).\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a)=\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)-\lambda\log{\pi_{\theta}}(a|s).

The corresponding estimate for the soft advantage function is the following:

Ξ^λπθ(s,a)=Q¯λπθ(s,a)a𝒜πθ(a|s)Q¯λπθ(s,a).\widehat{\Xi}_{\lambda}^{\pi_{\theta}}(s,a)=\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a)-\sum_{a^{\prime}\in\mathcal{A}}{\pi_{\theta}}(a^{\prime}|s)\overline{Q}_{\lambda}^{\pi_{\theta}}(s,a^{\prime}). (28)

The critic update for a given policy πθ,θΘ\pi_{\theta},\theta\in\Theta is summarized in Algorithm 3.

  Inputs: Policy πθ\pi_{\theta}, proj. radius RR, network width mm^{\prime}, sample size TT^{\prime}, step-size αC\alpha_{C}
  Initialization: (b,W(0))=𝚜𝚢𝚖_𝚒𝚗𝚒𝚝(m,d)(b,W(0))={\tt sym\_init}(m^{\prime},d)
  for k<T1k<T^{\prime}-1 do
     Observe (sk,ak)dμπθπθ(|sk)(s_{k},a_{k})\sim d_{\mu}^{\pi_{\theta}}\circ{\pi_{\theta}}(\cdot|s_{k}), skP(|sk,ak)s_{k}^{\prime}\sim P(\cdot|s_{k},a_{k}), akπθ(|sk)a_{k}^{\prime}\sim\pi_{\theta}(\cdot|s_{k}^{\prime})
     Observe reward: rk:=r(sk,ak)λlogπθ(ak|sk)r_{k}:=r(s_{k},a_{k})-\lambda\log{\pi_{\theta}}(a_{k}|s_{k})
     Compute semi-gradient: gk=(rk+γq^k(sk,ak)q^k(sk,ak))θq^k(sk,ak)g_{k}=\big{(}r_{k}+\gamma\hat{q}_{k}(s_{k}^{\prime},a_{k}^{\prime})-\hat{q}_{k}(s_{k},a_{k})\big{)}\nabla_{\theta}\hat{q}_{k}(s_{k},a_{k})
     Take a semi-gradient step: W(k+1/2)=W(k)+αCgkW(k+1/2)=W(k)+\alpha_{C}g_{k}
     Max-norm regularization: Wi(k+1)=𝒫𝒢R(Wi(0)){Wi(k+1/2)},i[m]W_{i}(k+1)=\mathcal{P}_{\mathcal{G}_{R}(W_{i}(0))}\left\{W_{i}(k+1/2)\right\},\forall i\in[m^{\prime}]
  end for
  return  q¯Tπθ(s,a)=q^(s,a;(b,1Tk<TW(k)))\overline{q}_{T^{\prime}}^{\pi_{\theta}}(s,a)=\hat{q}\Big{(}s,a;\big{(}b,\frac{1}{T^{\prime}}\sum_{k<T^{\prime}}W(k)\big{)}\Big{)} for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}
Algorithm 3 MN-NTD - Max-Norm Regularized Neural TD Learning

4 Main Results: Sample Complexity and Overparameterization Bounds for Neural NAC

In this section, we analyze the convergence of the entropy-regularized neural NAC algorithm and provide sample complexity and overparameterization bounds for both the actor and the critic.

4.1 Regularization and Persistence of Excitation under Neural NAC

The following proposition implies that the persistence of excitation condition (see [52] for a discussion of the critical role of persistence of excitation in stochastic control problems) is satisfied under Algorithm 2, which implies sufficient exploration to ensure convergence to global optimality.

Proposition 2 (Persistence of excitation).

For any regularization parameter λ>0\lambda>0, projection radius RR, the entropy-regularized NAC satisfies the following:

maxi[m]θi(t)θi(0)2Rϰtλm,\max_{i\in[m]}~{}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R\varkappa_{t}}{\lambda\sqrt{m}}, (29)

where

ϰt={1,ηt=1λ(t+1),1(1ηλ)t,ηt=η(0,1λ),\varkappa_{t}=\begin{cases}1,&\eta_{t}=\frac{1}{\lambda(t+1)},\\ 1-(1-\eta\lambda)^{t},&\eta_{t}=\eta\in\left(0,\frac{1}{\lambda}\right),\end{cases} (30)

for all t0t\geq 0 almost surely. Consequently,

πmin:=inf(s,a)𝒮×𝒜πt(a|s)exp(2R/λ2ρ0(Rϰtλ,m,δ))|𝒜|>0,\pi_{min}:=\inf_{(s,a)\in\mathcal{S}\times\mathcal{A}}\pi_{t}(a|s)\geq\frac{\exp\left(-2R/\lambda-2\rho_{0}\left(\frac{R\varkappa_{t}}{\lambda},m,\delta\right)\right)}{|\mathcal{A}|}>0, (31)

simultaneously for all t0t\geq 0 with probability at least 1δ1-\delta over the random initialization of the actor network, where the function ρ0\rho_{0} is given by

ρ0(R0,m,δ)=16R0m(R0+log(1δ)+dlog(m)).\rho_{0}(R_{0},m,\delta)=\frac{16R_{0}}{\sqrt{m}}\Big{(}R_{0}+\sqrt{\log\Big{(}\frac{1}{\delta}\Big{)}}+\sqrt{d\log(m)}\Big{)}. (32)

Proposition 32 has two critical implications:

  1. (i)

    The inequality in (29) implies that any action a𝒜a\in\mathcal{A} is explored with strictly positive probability at any given state s𝒮s\in\mathcal{S}, which implies that all policies throughout the policy optimization steps satisfy the “persistence of excitation” condition with high probability over the random initialization. As we will see in the convergence analysis, this property implies sufficient exploration, which ensures that near-deterministic suboptimal policies are avoided. Sufficient exploration is achieved by entropy regularization, averaging, projection of utu_{t}, and large network width mm for the policy parameterization.

  2. (ii)

    The inequality (29) implies that we can control the deviation of the actor network weights by RR, λ\lambda and mm. This property is key for the neural network analysis in the lazy-training regime.

4.2 Transportation Mappings and Function Classes

We first present a brief discussion on kernel approximations of neural networks, which will be useful to state our convergence results. Consider the following space of mappings:

ν¯={v:dd:supwdv(w)2ν¯},\mathcal{H}_{{\bar{\nu}}}=\{v:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}:\sup_{w\in\mathbb{R}^{d}}\|v(w)\|_{2}\leq{\bar{\nu}}\}, (33)

and the function class:

ν¯={g()=𝔼w0𝒩(0,Id)[v(w0),𝟙{w0,>0}]:vν¯}.\mathcal{F}_{{\bar{\nu}}}=\Big{\{}g(\cdot)=\mathbb{E}_{w_{0}\sim\mathcal{N}(0,I_{d})}[\langle v(w_{0}),\cdot\rangle\mathbbm{1}\{\langle w_{0},\cdot\rangle>0\}]:v\in\mathcal{H}_{{\bar{\nu}}}\Big{\}}. (34)

Note that ν¯\mathcal{F}_{{\bar{\nu}}} is a provably rich subset of the reproducible kernel Hilbert space (RKHS) induced by the neural tangent kernel, which can approximate continuous functions over a compact space [43, 40, 53]. For a given class of transportation maps 𝒱={vkν¯:k[K]}\mathcal{V}=\{v_{k}\in\mathcal{H}_{\bar{\nu}}:k\in[K]\} for K1K\geq 1, we also consider the following subspace of ν¯\mathcal{F}_{{\bar{\nu}}}:

K,ν¯,𝒱={g()=𝔼w0𝒩(0,Id)[k[K]αkvk(w0),𝟙{w0,>0}]:α11}.\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}=\Big{\{}g(\cdot)=\mathbb{E}_{w_{0}\sim\mathcal{N}(0,I_{d})}[\big{\langle}\sum_{k\in[K]}\alpha_{k}v_{k}(w_{0}),\cdot\big{\rangle}\mathbbm{1}\{\langle w_{0},\cdot\rangle>0\}]:\|\alpha\|_{1}\leq 1\Big{\}}. (35)

Note that the above set depends on the choice of {vk}k[K]\{v_{k}\}_{k\in[K]} but these maps can be arbitrary, The space of continuously differentiable functions f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} over a compact domain has a countable basis {φk:k=0,1,}\{\varphi_{k}:k=0,1,\ldots\} [54]. By [43, Theorem 4.3], one can find transportation mappings vkν¯v_{k}\in\mathcal{H}_{{\bar{\nu}}} such that gk(s,a)=𝔼[vk(w0),(s,a)𝟙{(s,a)w00}]g_{k}(s,a)=\mathbb{E}[\langle v_{k}(w_{0}),(s,a)^{\top}\cdot\mathbbm{1}\{(s,a)\cdot w_{0}\geq 0\}\rangle] approximates φk\varphi_{k} well. As such, K,ν¯,𝒱\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}} is able to approximate a function class which contains continuously differentiable functions over a compact space as KK\rightarrow\infty with appropriate VV.

4.3 Convergence of the Critic

We make the following realizability assumption for the Q-function.

Assumption 2 (Realizability of the Q-function).

For any t0t\geq 0, we assume that qλπtν¯q_{\lambda}^{\pi_{t}}\in\mathcal{F}_{{\bar{\nu}}} for some ν¯>0{\bar{\nu}}>0.

Assumption 2 is a smoothness condition on the class of realizable functions that can be approximated by the critic network, which is dense in the space of continuous functions over Ωd\Omega_{d} (see Section 4.2). ν¯{\bar{\nu}}, which is an upper bound on the RKHS norm, is the measure of smoothness. One can also replace the above condition by a slightly stronger condition which states that qλπθν¯q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{{\bar{\nu}}}, θΘ.\forall\theta\in\Theta. Note that the class of functions ν¯\mathcal{F}_{\bar{\nu}} is deterministic and its approximation properties are well-known [43]. In [22], it was assumed that the state-action value functions lie in a random function class, which is obtained by shifting ν¯\mathcal{F}_{\bar{\nu}} with a Gaussian process. By employing a symmetric initialization, we eliminate this Gaussian process noise, and therefore the realizable class of functions is deterministic and provably rich.

Theorem 1 (Convergence of the Critic, Theorem 2 in [49]).

Under Assumption 2, for any error probability δ(0,1)\delta\in(0,1), let

(m,δ)=4log(2m+1)+4log(T/δ),\ell(m^{\prime},\delta)=4\sqrt{\log(2m^{\prime}+1)}+4\sqrt{\log(T/\delta)},

and R>ν¯R>{\bar{\nu}}. Then, for any target error ε>0\varepsilon>0, number of iterations TT^{\prime}\in\mathbb{N}, network width

m>16(ν¯+(R+(m,δ))(ν¯+R))2(1γ)2ε2,m^{\prime}>\frac{16\Big{(}{\bar{\nu}}+\big{(}R+\ell(m^{\prime},\delta)\big{)}\big{(}{\bar{\nu}}+R\big{)}\Big{)}^{2}}{(1-\gamma)^{2}\varepsilon^{2}},

and step-size

αC=ε2(1γ)(1+2R)2,\alpha_{C}=\frac{\varepsilon^{2}(1-\gamma)}{(1+2R)^{2}},

the critic yields the following bound:

𝔼[𝔼sdμπt,aπt(|s)[(q¯Tπt(s,a)qλπt(s,a))2]𝟙A2](1+2R)ν¯ε(1γ)T+3ε,\mathbb{E}\Big{[}\sqrt{\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\big{(}\overline{q}_{T^{\prime}}^{\pi_{t}}(s,a)-q_{\lambda}^{\pi_{t}}(s,a)\big{)}^{2}\Big{]}}\mathbbm{1}_{A_{2}}\Big{]}\leq\frac{(1+2R){\bar{\nu}}}{\varepsilon(1-\gamma)\sqrt{T^{\prime}}}+3\varepsilon,

where A2A_{2} holds with probability at least 1δ1-\delta over the random initializations of the critic network.

Note that in order to achieve a target error less than ε>0\varepsilon>0, a network width of m=O~(ν¯4ε2)m^{\prime}=\widetilde{O}\Big{(}\frac{{\bar{\nu}}^{4}}{\varepsilon^{2}}\Big{)} and iteration complexity T=O((1+2ν¯)2ν¯2ε4)T^{\prime}=O\Big{(}\frac{(1+2{\bar{\nu}})^{2}{\bar{\nu}}^{2}}{\varepsilon^{4}}\Big{)} suffice. The analysis of TD learning algorithm in [49] uses results from [45], which was given for classification (supervised learning) problems with logistic loss. On the other hand, TD learning requires a significantly more challenging analysis because of bootstrapping in the updates (i.e., using a stochastic semi-gradient instead of a true gradient) and quadratic loss function. Furthermore, for improved sample complexity and overparameterization bounds, max-norm regularization is employed instead of early stopping [49].

4.4 Global Optimality and Convergence of Neural NAC

In this section, we provide the main convergence result for the entropy-regularized NAC with neural network approximation.

Assumption 3 (Realizability).

For K1K\geq 1, we assume that for all θΘ\theta\in\Theta, QλπθK,ν¯,𝒱Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}}, where the function class K,ν¯,𝒱\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}} is defined in Section 4.2.

Note that K,ν¯,𝒱\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}} approximates a rich class of functions over a compact space well for large KK (see Section 4.2). Also, Assumption 3 implies that there is a structure among the soft Q-functions in the policy class Θ\Theta since each QλπθQ_{\lambda}^{\pi_{\theta}} can be written as a linear combination of KK functions that correspond to the transportation maps vkv_{k}. We consider this relatively restricted function class instead of ν¯\mathcal{F}_{{\bar{\nu}}} to obtain uniform approximation error bounds to handle the dynamic structure of the policy optimization over time steps. Notably, the actor network features are expected to fit QλπtQ_{\lambda}^{\pi_{t}} over all iterations, thus an inherent structure in {Qλπθ:θΘ}\{Q_{\lambda}^{\pi_{\theta}}:\theta\in\Theta\} appears to be necessary. For further discussion, see Section 5.4 (particularly Remark 7).

4.4.1 Performance Bounds under a Weak Distribution Mismatch Condition

First, we establish sample complexity and overparameterization bounds under a weak distribution mismatch condition, which is provided below. This condition is significantly weaker compared to the existing literature (e.g., [22, 55, 17]) as we proved that the policies achieve sufficient exploration by Proposition 32 (see Remark 36 for details).

Assumption 4 (Weak distribution mismatch condition).

There exists a constant C<C_{\infty}<\infty such that

supt0𝔼sdμπt[(dμπ(s)dμπt(s))2]C2.\sup_{t\geq 0}~{}\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)}{d_{\mu}^{\pi_{t}}(s)}\Big{)}^{2}\Big{]}\leq C_{\infty}^{2}.
Remark 3 (Weak distribution mismatch condition).

Note that a sufficient condition for Assumption 4 is an exploratory initial state distribution μ\mu, which covers the support of the state visitation distribution of dμπd_{\mu}^{\pi^{*}}:

sups𝗌𝗎𝗉𝗉(dμπ)dμπ(s)μ(s)<,\sup_{s\in\mathsf{supp}(d_{\mu}^{\pi^{*}})}\frac{d_{\mu}^{\pi^{*}}(s)}{\mu(s)}<\infty, (36)

since 𝔼sdμπt[(dμπ(s)dμπt(s))2]11γdμπμ\sqrt{\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)}{d_{\mu}^{\pi_{t}}(s)}\Big{)}^{2}\Big{]}}\leq\frac{1}{1-\gamma}\left\|\frac{d_{\mu}^{\pi^{*}}}{\mu}\right\|_{\infty}. Hence, if the initial distribution has a sufficiently large support set, then Assumption 4 is satisfied without any assumptions on {πt:t0}\{\pi_{t}:t\geq 0\}. Together with Proposition 32, it ensures stability of the policy optimization with minimal assumptions on μ\mu.

The following theorem is one of the main results in this paper, which establishes the convergence bounds of the NAC algorithm.

Theorem 2 (Performance bounds).

Under Assumptions 1-4, Algorithm 2 with R>ν¯R>{\bar{\nu}} and regularization coefficient λ>0\lambda>0 satisfies the following bounds:

  • (1)

    with step-size ηt=1λ(t+1),t0\eta_{t}=\frac{1}{\lambda(t+1)},~{}t\geq 0, we have

    (1γ)mint[T]𝔼[(Vλπ(μ)Vλπt(μ))𝟙A]2R2(1+logT)λT+2Rρ0+4ρ0Tλ+M(ρ1+ε+RqmaxN1/4),(1-\gamma)\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\frac{2R^{2}(1+\log T)}{\lambda T}+{2R\sqrt{\rho_{0}}}+4\rho_{0}T\lambda+M_{\infty}\Big{(}\rho_{1}+\varepsilon+\frac{Rq_{max}}{N^{1/4}}\Big{)},
  • (2)

    with step-size ηt=η(0,1/λ)\eta_{t}=\eta\in(0,1/\lambda), we have

    (1γ)mint[T]𝔼[(Vλπ(μ)Vλπt(μ))𝟙A]λeηλTlog|𝒜|1eηλT+2Rρ0+4ρ0Tλ+M(ρ1+ε+RqmaxN1/4)+2ηR2,(1-\gamma)\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\frac{\lambda e^{-\eta\lambda T}\log|\mathcal{A}|}{1-e^{-\eta\lambda T}}+2R\sqrt{\rho_{0}}+4\rho_{0}T\lambda+M_{\infty}\Big{(}\rho_{1}+\varepsilon+\frac{Rq_{max}}{N^{1/4}}\Big{)}+2\eta R^{2},

for any δ(0,1/3)\delta\in(0,1/3) where (A)13δ\mathbb{P}(A)\geq 1-3\delta over the random initialization of the actor and critic networks,

qmax=4(R+rmax1γ+λlog|𝒜|1γ),q_{max}=4\Big{(}R+\frac{r_{max}}{1-\gamma}+\frac{\lambda\log|\mathcal{A}|}{1-\gamma}\Big{)}, (37)

which is an upper bound on the gradient norm in (22),

M=C(1+πmin1),ρ0=16Rλm(Rλ+log(1δ)+dlog(m)),ρ1=16ν¯m((dlog(m))14+log(Kδ)),\displaystyle\begin{aligned} M_{\infty}&=C_{\infty}(1+\pi_{min}^{-1}),\\ \rho_{0}&=\frac{16R}{\lambda\sqrt{m}}\Big{(}\frac{R}{\lambda}+\sqrt{\log\Big{(}\frac{1}{\delta}\Big{)}}+\sqrt{d\log(m)}\Big{)},\\ \rho_{1}&=\frac{16{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)},\end{aligned}

m=O~(ν¯4ε2)m^{\prime}=\widetilde{O}\Big{(}\frac{{\bar{\nu}}^{4}}{\varepsilon^{2}}\Big{)} and T=O((1+2ν¯)2ν¯2ε4)T^{\prime}=O\Big{(}\frac{(1+2{\bar{\nu}})^{2}{\bar{\nu}}^{2}}{\varepsilon^{4}}\Big{)} (as specified in Theorem 1).

In the following, we characterize the sample complexity, iteration complexity and overparameterization bounds based on Theorem 2.

Corollary 1 (Sample Complexity and Overparameterization Bounds).

For any ϵ>0\epsilon>0 and δ(0,1/3)\delta\in(0,1/3), Algorithm 2 with R>ν¯R>{\bar{\nu}} satisfies:

mint[T]𝔼[(Vλπ(μ)Vλπt(μ))𝟙A]ϵ,\min_{t\in[T]}~{}\mathbb{E}[(V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\epsilon,

where (A)13δ\mathbb{P}(A)\geq 1-3\delta over the random initialization of the actor-critic networks for the following parameters:

  • iteration complexity: T=O~(R2(1γ)λϵ)T=\tilde{O}\Big{(}\frac{R^{2}}{(1-\gamma)\lambda\epsilon}\Big{)},

  • actor network width: m=O~(R8(1γ)4λ4ϵ4+R6log(1/δ)λ2(1γ)4ϵ4+Mν¯2log(K/δ)ϵ2(1γ)2)m=\tilde{O}\Big{(}\frac{R^{8}}{(1-\gamma)^{4}\lambda^{4}\epsilon^{4}}+\frac{R^{6}\log(1/\delta)}{\lambda^{2}(1-\gamma)^{4}\epsilon^{4}}+\frac{M_{\infty}{\bar{\nu}}^{2}\log(K/\delta)}{\epsilon^{2}(1-\gamma)^{2}}\Big{)},

  • critic sample complexity: T=O(M2R4(1γ)2ϵ4)T^{\prime}=O\Big{(}\frac{M_{\infty}^{2}R^{4}}{(1-\gamma)^{2}\epsilon^{4}}\Big{)},

  • critic network width: m=O~(M2R4log(1/δ)(1γ)2ϵ2)m^{\prime}=\tilde{O}\Big{(}\frac{M_{\infty}^{2}R^{4}\log(1/\delta)}{(1-\gamma)^{2}\epsilon^{2}}\Big{)},

  • actor sample complexity: N=O(M4R4qmax4ϵ4(1γ)4)N=O\Big{(}\frac{M_{\infty}^{4}R^{4}q_{max}^{4}}{\epsilon^{4}(1-\gamma)^{4}}\Big{)}.

Hence, the overall sample complexity of the Neural NAC algorithm is O~(1ϵ5)\tilde{O}\Big{(}\frac{1}{\epsilon^{5}}\Big{)}.

Remark 4 (Bias-variance tradeoff in policy optimization).

By Proposition 32, the network parameters evolve such that

supt0maxi[m]θi(0)θi(t)2Rλm,\sup_{t\geq 0}\max_{i\in[m]}\|\theta_{i}(0)-\theta_{i}(t)\|_{2}\leq\frac{R}{\lambda\sqrt{m}},

and supt0sups,aπt(a|s)<1\sup_{t\geq 0}\sup_{s,a}\pi_{t}(a|s)<1. Hence, the NAC always performs a policy search within the class of randomized policies, which leads to fast and stable convergence under minimal regularity conditions. In particular, Assumption 4 is the mildest distributional mismatch condition in on-policy NPG/NAC settings to the best of our knowledge, and it suffices to establish convergence results in Theorem 2. On the other hand, entropy regularization introduces a bias term controlled by λ\lambda, hence the convergence is in the regularized MDP. Another way to see this is that deterministic policies, which require limtθ(t)2=\lim_{t}\|\theta(t)\|_{2}=\infty, may not be achieved for λ>0\lambda>0 since θ(t)\theta(t) is always contained within a compact set. Letting λ0\lambda\downarrow 0 eliminates the bias, but at the same time reduces the convergence speed and may lead to instability due to lack of exploration. Hence, there is a bias-variance tradeoff in policy optimization, controlled by λ>0\lambda>0.

Remark 5 (Different network widths for actor and critic).

Corollary 1 indicates that the actor network requires O~(1/ϵ4)\tilde{O}(1/\epsilon^{4}) neurons while the critic network requires O~(1/ϵ2)\tilde{O}(1/\epsilon^{2}) although both approximate (soft) state-action value functions. This difference is because the actor network is required to uniformly approximate all state-action value functions over the trajectory, while the critic network approximates (pointwise) a single state-action value function at each iteration.

Remark 6 (Fast initial convergence rate under constant step-sizes).

The second part of Theorem 2 indicates that the convergence rate is eΩ(T)e^{-\Omega(T)} under a constant step-size η(0,1/λ)\eta\in(0,1/\lambda), while there is an additional error term 2ηR22\eta R^{2}. This justifies the common practice of “halving the step-size” in optimization (see, e.g., [56]) for the specific case of natural actor-critic that we investigate: one achieves a fast convergence rate with a constant step-size until the optimization stalls, then the process is repeated after halving the step-size.

4.4.2 Performance Bounds under a Strong Distribution Mismatch Condition

In the following, we consider the standard distribution mismatch condition (e.g., in [55, 22]) and establish sample complexity and overparameterization bounds based on Theorem 2, for the unregularized MDP.

Assumption 4’ (Strong distribution mismatch condition).

There exists a constant C~<\tilde{C}_{\infty}<\infty such that

supt0𝔼(s,a)dμπtπt(|s)[(dμπ(s)π(a|s)dμπt(s)πt(a|s))2]C~2.\sup_{t\geq 0}~{}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi_{t}}\otimes\pi_{t}(\cdot|s)}\Big{[}\Big{(}\frac{d_{\mu}^{\pi^{*}}(s)\pi^{*}(a|s)}{d_{\mu}^{\pi_{t}}(s)\pi_{t}(a|s)}\Big{)}^{2}\Big{]}\leq\tilde{C}_{\infty}^{2}. (38)

Note that Assumption 38 implies Assumption 4, and it is a considerably stronger assumption that necessitates our policies {πt:t=0,1,,T1}\{\pi_{t}:t=0,1,\ldots,T-1\} being sufficiently exploratory throughout policy optimization.

Corollary 2.

Under Assumptions 1-3 and 38, for any ϵ>0\epsilon>0 and δ(0,1/3)\delta\in(0,1/3), Algorithm 2 with R>ν¯R>{\bar{\nu}} and λ=O(1/T)\lambda=O(1/\sqrt{T}) satisfies:

mint[T]𝔼[(maxπVπ(μ)Vπt(μ))𝟙A]ϵ,\min_{t\in[T]}~{}\mathbb{E}[(\max_{\pi}V^{\pi}(\mu)-V^{\pi_{t}}(\mu))\mathbbm{1}_{A}]\leq\epsilon,

where (A)13δ\mathbb{P}(A)\geq 1-3\delta over the random initialization of the actor-critic networks for the following parameters:

  • iteration complexity: T=O~(R2(1γ)ϵ2)T=\tilde{O}\Big{(}\frac{R^{2}}{(1-\gamma)\epsilon^{2}}\Big{)},

  • actor network width: m=O~(R8(1γ)4ϵ8+R6log(1/δ)(1γ)4ϵ6+C~ν¯2log(K/δ)ϵ2(1γ)2)m=\tilde{O}\Big{(}\frac{R^{8}}{(1-\gamma)^{4}\epsilon^{8}}+\frac{R^{6}\log(1/\delta)}{(1-\gamma)^{4}\epsilon^{6}}+\frac{\tilde{C}_{\infty}{\bar{\nu}}^{2}\log(K/\delta)}{\epsilon^{2}(1-\gamma)^{2}}\Big{)},

  • critic sample complexity: T=O(M~2R4(1γ)2ϵ4)T^{\prime}=O\Big{(}\frac{\tilde{M}_{\infty}^{2}R^{4}}{(1-\gamma)^{2}\epsilon^{4}}\Big{)},

  • critic network width: m=O~(M~2R4log(1/δ)(1γ)2ϵ2)m^{\prime}=\tilde{O}\Big{(}\frac{\tilde{M}_{\infty}^{2}R^{4}\log(1/\delta)}{(1-\gamma)^{2}\epsilon^{2}}\Big{)},

  • actor sample complexity: N=O(M~4R4qmax4ϵ4(1γ)4)N=O\Big{(}\frac{\tilde{M}_{\infty}^{4}R^{4}q_{max}^{4}}{\epsilon^{4}(1-\gamma)^{4}}\Big{)},

where M~=C~(1+πmin1)\tilde{M}_{\infty}=\tilde{C}_{\infty}(1+\pi_{min}^{-1}).

Hence, the overall sample complexity of Neural NAC for finding an ϵ\epsilon-optimal policy of the unregularized MDP is O~(1ϵ6)\tilde{O}\Big{(}\frac{1}{\epsilon^{6}}\Big{)}.

4.5 Comparison With Prior Works

Among the existing works that theoretically investigate policy gradient methods, the most related one is [22], which considers Neural PG/NPG methods equipped with a two-layer neural network. We point key differences between our work and these previous works:

  • Prior works do not incorporate entropy regularization. As a result, they need a stronger concentrability coefficient assumption like Assumption 38 instead of the weaker Assumption 4 under which we are able to prove our main results.

  • In the proofs in the subsequent section, it will become clear that one needs to uniformly bound the function approximation error in the actor part of our algorithm to address the dependencies in the parameter values between iterations and the NTRF features. We propose new techniques to address this point, which was not addressed in the prior work.

  • While our algorithm is similar in spirit to the algorithms analyzed in the prior works, we also incorporate a number of important algorithmic ideas that are used in practice (e.g., entropy regularization, averaging, gradient clipping). As a result, we have to use different analysis techniques. As a consequence of these algorithmic and analytical techniques, we obtain considerably sharper sample complexity and overparameterization bounds (see Table 1). Interestingly, all of these algorithmic improvements to the original NAC algorithms seem to be important to obtain the sharper bounds.

  • We employ a symmetric initialization scheme proposed in [48] to ensure that f0(s,a)=0f_{0}(s,a)=0 for all s,as,a despite the random initialization. As a consequence of symmetric initialization, we eliminate the impact of f0f_{0} in the infinite width limit, which is effectively a noise term ϵ0\epsilon_{0} in the performance bounds [57].

Paper Algorithm Width of actor, critic Sample comp. Error Condition Objective
[22] Neural NPG O(1/ϵ12)O(1/\epsilon^{12}), O(1/ϵ12)O(1/\epsilon^{12}) O(1/ϵ14)O(1/\epsilon^{14}) ϵ+ϵ0\epsilon+\epsilon_{0} Strong Unregularized
Ours Neural NAC O~(1/ϵ4)\tilde{O}(1/\epsilon^{4}), O~(1/ϵ2)\tilde{O}(1/\epsilon^{2}) O~(1/ϵ5)\tilde{O}(1/\epsilon^{5}) ϵ\epsilon Weak Regularized
Ours Neural NAC O~(1/ϵ8)\tilde{O}(1/\epsilon^{8}), O~(1/ϵ2)\tilde{O}(1/\epsilon^{2}) O~(1/ϵ6)\tilde{O}(1/\epsilon^{6}) ϵ\epsilon Strong Unregularized
Table 1: The overparameterization and sample complexity bounds for variants of natural policy gradient with neural network approximation.

5 Finite-Time Analysis of Neural NAC

In this section, we provide the convergence analysis of the algorithm.

5.1 Analysis of Neural Network at Initialization

For δ(0,1)\delta\in(0,1) and any R0>0R_{0}>0, let

ρ0(R0,m,δ)=16R0m(R0+log(1/δ)+dlog(m)),\rho_{0}(R_{0},m,\delta)=\frac{16R_{0}}{\sqrt{m}}\Big{(}R_{0}+\sqrt{\log(1/\delta)}+\sqrt{d\log(m)}\Big{)}, (39)

and define

A0={supx:x21R0mi=1m𝟙{|θi(0)x|R0m}ρ0(R0,m,δ)}.A_{0}=\Big{\{}\sup_{x:\|x\|_{2}\leq 1}\frac{R_{0}}{m}\sum_{i=1}^{m}\mathbbm{1}\Big{\{}|\theta_{i}^{\top}(0)x|\leq\frac{R_{0}}{\sqrt{m}}\Big{\}}\leq\rho_{0}(R_{0},m,\delta)\Big{\}}. (40)

The following lemma bounds the deviation of the neural network from its linear approximation around the initialization, and it will be used throughout the convergence analysis.

Lemma 2.

Let θi(0)𝒩(0,Id)\theta_{i}(0)\sim\mathcal{N}(0,I_{d}) for all i[m]i\in[m], θm,R0d(θ(0))\theta\in\mathcal{B}_{m,R_{0}}^{d}(\theta(0)) and θm,R0d(0)\theta^{\prime}\in\mathcal{B}_{m,R_{0}}^{d}(0) for some R0>0R_{0}>0. Then,

supxd:x211mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θi(0)x|\displaystyle\sup_{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|} ρ0(R0,m,δ),\displaystyle\leq\rho_{0}(R_{0},m,\delta), (41)
supxd:x211mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θix|\displaystyle\sup_{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}x\Big{|} ρ0(R0,m,δ),\displaystyle\leq\rho_{0}(R_{0},m,\delta), (42)
supxd:x211mi=1m|(𝟙{θix0}𝟙{θi(0)x0})xθi|\displaystyle\sup_{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}x^{\top}\theta^{\prime}_{i}\Big{|} ρ0(R0,m,δ),\displaystyle\leq\rho_{0}(R_{0},m,\delta), (43)

under the event A0A_{0} defined in (40), which holds with probability at least 1δ1-\delta over the random initialization of the actor.

Proof.

Let Ωd={xd:x21}\Omega_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\}. For xΩdx\in\Omega_{d}, let

S(x,θ)={i[m]:𝟙{θix0}𝟙{θi(0)x0}}.S(x,\theta)=\big{\{}i\in[m]:\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}\neq\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\big{\}}.

For any iS(x,θ)i\in S(x,\theta), the following is true:

|θi(0)x|\displaystyle|\theta^{\top}_{i}(0)x| |θi(0)xθix|θiθi(0)2,\displaystyle\leq|\theta^{\top}_{i}(0)x-\theta_{i}^{\top}x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}, (44)

where the first inequality is true since sign(θi(0)x)sign(θix)sign(\theta^{\top}_{i}(0)x)\neq sign(\theta^{\top}_{i}x) and the second inequality follows from Cauchy-Schwarz inequality and xΩdx\in\Omega_{d}. Therefore,

S(x,θ){i[m]:|θi(0)x|θiθi(0)2}.S(x,\theta)\subset\{i\in[m]:|\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}\}.

Since

1mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θi(0)x|=1miS(x,θ)|θi(0)x|,\displaystyle\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}=\frac{1}{\sqrt{m}}\sum_{i\in S(x,\theta)}|\theta_{i}^{\top}(0)x|,

we have:

1mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θi(0)x|1mi=1m𝟙{|θi(0)x|θiθi(0)2}θiθi(0)2.\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\mathbbm{1}\{|\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}\}\|\theta_{i}-\theta_{i}(0)\|_{2}.

Since maxi[m]θiθi(0)2R0m,\max_{i\in[m]}\|\theta_{i}-\theta_{i}(0)\|_{2}\leq\frac{R_{0}}{\sqrt{m}}, the above inequality leads to the following:

1mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θi(0)x|R0mi=1m𝟙{|θi(0)xR0m}.\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(0)x\Big{|}\leq\frac{R_{0}}{m}\sum_{i=1}^{m}\mathbbm{1}\{|\theta_{i}^{\top}(0)x\|\leq\frac{R_{0}}{\sqrt{m}}\}.

Taking supremum over xΩdx\in\Omega_{d}, and using Lemma 4 in [58] on the RHS of the above inequality concludes the proof.

In order to prove (42), similar to (44), we have the following inequality:

|θix||θixθi(0)x|θiθi(0)2.|\theta_{i}^{\top}x|\leq|\theta_{i}^{\top}x-\theta_{i}^{\top}(0)x|\leq\|\theta_{i}-\theta_{i}(0)\|_{2}.

Using this, the proof follows from exactly the same steps. ∎

Note that Lemma 2 is an extension of the concentration bounds in [45, 41, 58] for neural networks. On the other hand, our concentration result provides uniform convergence over Ωd={xd:x21}\Omega_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\} rather than finitely many points, thus it is a stronger concentration bound compared to the ones in the literature, which are used to analyze neural networks [45, 41]. We need these uniform concentration inequalities to address the challenges due to the dynamics policy optimization, e.g., distributional shift.

5.2 Impact of Entropy Regularization

First, we analyze the impact of entropy regularization, which will yield key results in the convergence analysis.

Proof of Proposition 32.

Recall from Line 11 in Algorithm 3 that the policy update is as follows:

θ(t+1)=θ(t)+ηtutηtλ(θ(t)θ(0)).\theta(t+1)=\theta(t)+\eta_{t}u_{t}-\eta_{t}\lambda(\theta(t)-\theta(0)).

Let θ¯(t)=θ(t)θ(0)\overline{\theta}(t)=\theta(t)-\theta(0) for all t0t\geq 0. Then, the update rule can be written as:

θ¯(t+1)=θ¯(t)(1ηtλ)+ηtut.\overline{\theta}(t+1)=\overline{\theta}(t)(1-\eta_{t}\lambda)+\eta_{t}u_{t}.

Since the step-size is ηtλ=1t+1\eta_{t}\lambda=\frac{1}{t+1}, we have:

θ¯(t+1)=1λ(t+1)k=0tuk,\overline{\theta}(t+1)=\frac{1}{\lambda(t+1)}\sum_{k=0}^{t}u_{k},

by induction. Hence, by triangle inequality:

θ¯i(t+1)2=θi(t+1)θi(0)21λ(t+1)k=0tui,k2,\|\overline{\theta}_{i}(t+1)\|_{2}=\|\theta_{i}(t+1)-\theta_{i}(0)\|_{2}\leq\frac{1}{\lambda(t+1)}\sum_{k=0}^{t}\|u_{i,k}\|_{2}, (45)

for any i[m]i\in[m]. Note that ukm,Rd(0)u_{k}\in\mathcal{B}_{m,R}^{d}(0) as a consequence of projection, therefore ui,kR/m\|u_{i,k}\|\leq R/\sqrt{m} for all i[m]i\in[m]. Hence, by (45), we conclude that

maxi[m]θi(t)θi(0)2Rλm,\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R}{\lambda\sqrt{m}}, (46)

for any t0t\geq 0. Also, since wt=utλ(θ(t)θ(0))w_{t}=u_{t}-\lambda(\theta(t)-\theta(0)), we have:

supt0wt2ut2+λθ(t)θ(0)22R.\sup_{t\geq 0}\|w_{t}\|_{2}\leq\|u_{t}\|_{2}+\lambda\|\theta(t)-\theta(0)\|_{2}\leq 2R. (47)

Under a constant step-size η(0,1/λ)\eta\in(0,1/\lambda), we can expand the parameter movement for any t1t\geq 1 as follows:

θ¯i(t+1)\displaystyle\overline{\theta}_{i}(t+1) =θ¯i(t)(1ηλ)+ηui,t,\displaystyle=\overline{\theta}_{i}(t)\cdot(1-\eta\lambda)+\eta\cdot u_{i,t},
=θ¯i(t1)(1ηλ)2+η(1λη)ui,t1+ηui,t,\displaystyle=\overline{\theta}_{i}(t-1)\cdot(1-\eta\lambda)^{2}+\eta(1-\lambda\eta)u_{i,t-1}+\eta u_{i,t}, \displaystyle\vdots
=θ¯i(0)(1ηλ)t+ηk=0t(1ηλ)kui,tk=ηk=0t(1ηλ)kui,tk,\displaystyle=\overline{\theta}_{i}(0)(1-\eta\lambda)^{t}+\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}u_{i,t-k}=\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}u_{i,t-k},

for any neuron i[m]i\in[m]. Then, we have:

θi(t+1)θi(0)2ηk=0t(1ηλ)kui,k2\displaystyle\|\theta_{i}(t+1)-\theta_{i}(0)\|_{2}\leq\eta\sum_{k=0}^{t}(1-\eta\lambda)^{k}\|u_{i,k}\|_{2} Rλm(1(1ηλ)t+1),\displaystyle\leq\frac{R}{\lambda\sqrt{m}}(1-(1-\eta\lambda)^{t+1}),
Rλm,\displaystyle\leq\frac{R}{\lambda\sqrt{m}}, (48)

which follows from triangle inequality, ui,k2R/m\|u_{i,k}\|_{2}\leq R/\sqrt{m} due to the projection, and the fact that (1(1ηλ)t)1(1-(1-\eta\lambda)^{t})\leq 1 for any t0t\geq 0.

In order to prove the lower bound for inft0,(s,a)𝒮×𝒜πt(a|s)\inf_{t\geq 0,(s,a)\in\mathcal{S}\times\mathcal{A}}\pi_{t}(a|s), first recall that πt(a|s)exp(ft(s,a))\pi_{t}(a|s)\propto\exp(f_{t}(s,a)). Hence, a uniform upper bound on |ft(s,a)||f_{t}(s,a)| over all t0t\geq 0 and (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} suffices to lower bound πt(a|s)\pi_{t}(a|s). By symmetric initialization, f0(s,a)=0f_{0}(s,a)=0 for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Hence,

ft(s,a)=1mi=1mci([θi(t)θi(0)](s,a)𝟙{θi(t)(s,a)0})+1mi=1mci(𝟙{θi(s,a)0}𝟙{θi(0)(s,a)0})θi(t)(s,a).f_{t}(s,a)=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\Big{(}[\theta_{i}(t)-\theta_{i}(0)]^{\top}(s,a)\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}\Big{)}\\ +\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}(\mathbbm{1}\{\theta_{i}^{\top}(s,a)\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\})\theta_{i}^{\top}(t)(s,a). (49)

First, we bound the first summand on the RHS of (49) by using (46) and triangle inequality:

sups,a|1mi=1mci([θi(t)θi(0)](s,a)𝟙{θi(t)(s,a)0})|Rλ,\sup_{s,a}\Big{|}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}c_{i}\Big{(}[\theta_{i}(t)-\theta_{i}(0)]^{\top}(s,a)\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}\Big{)}\Big{|}\leq\frac{R}{\lambda}, (50)

since |ci𝟙{θi(t)(s,a)0}(s,a)|1|c_{i}\mathbbm{1}\{\theta_{i}^{\top}(t)(s,a)\geq 0\}(s,a)|\leq 1. For the last term in (49), first note that maxi[m]θi(t)θi(0)2Rλm\max_{i\in[m]}\|\theta_{i}(t)-\theta_{i}(0)\|_{2}\leq\frac{R}{\lambda\sqrt{m}}, so we can use Lemma 2. By using triangle inequality and Lemma 2:

1mi=1m|(𝟙{θi(s,a)0}𝟙{θi(0)(s,a)0})θi(t)(s,a)|ρ0(Rλ,m,δ),\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}(\mathbbm{1}\{\theta_{i}^{\top}(s,a)\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\})\theta_{i}^{\top}(t)(s,a)\Big{|}\leq\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)},

with probability at least 1δ1-\delta over the random initialization of the actor network. Hence, with probability at least 1δ1-\delta,

sups,a|ft(s,a)|R/λ+ρ0(Rλ,m,δ),\sup_{s,a}|f_{t}(s,a)|\leq R/\lambda+\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)},

and πt(a|s)1|𝒜|e2Rλ2ρ0(Rλ,m,δ)\pi_{t}(a|s)\geq\frac{1}{|\mathcal{A}|}e^{\frac{-2R}{\lambda}-2\rho_{0}\big{(}\frac{R}{\lambda},m,\delta\big{)}}.

5.3 Lyapunov Drift Analysis

First, we present a key lemma which will be used throughout the analysis.

Lemma 3 (Log-linear approximation error).

Let

π~t(a|s)=exp(θf0(s,a)θ(t))a𝒜exp(θf0(s,a)θ(t)),\widetilde{\pi}_{t}(a|s)=\frac{\exp(\nabla_{\theta}^{\top}f_{0}(s,a)\theta(t))}{\sum_{a^{\prime}\in\mathcal{A}}\exp(\nabla_{\theta}^{\top}f_{0}(s,a^{\prime})\theta(t))},

be log-linear approximation of the policy πt(a|s)\pi_{t}(a|s). Then, for any δ(0,1)\delta\in(0,1), we have:

supt0sups,a|logπ~t(a|s)πt(a|s)|3ρ0(Rλ,m,δ),\sup_{t\geq 0}~{}\sup_{s,a}\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq 3\rho_{0}\Big{(}\frac{R}{\lambda},m,\delta\Big{)}, (51)

over A0A_{0}.

Proof.

Note that ft(s,a)=ft(s,a)θ(t)f_{t}(s,a)=\nabla f_{t}(s,a)\theta(t) for a ReLU neural network. By using this, we can write the log-linear approximation error as follows:

|logπ~t(a|s)πt(a|s)||(ft(s,a)f0(s,a))θ(t)|+|logaef0(s,a)θ(t)e(ft(s,a)f0(s,a))θ(t)aef0(s,a)θ(t)|.\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|+\Big{|}\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}e^{(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}\Big{|}. (52)

By log-sum inequality (Theorem 2.7.1 in [59]), for any xa,ya>0x_{a},y_{a}>0,

logaxaayaaxaaxalogxaya.\log\frac{\sum_{a}x_{a}}{\sum_{a}y_{a}}\leq\sum_{a}\frac{x_{a}}{\sum_{a^{\prime}}x_{a^{\prime}}}\log\frac{x_{a}}{y_{a}}.

Setting xa=ef0(s,a)θ(t)x_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)} and ya=ef0(s,a)θ(t)e(ft(s,a)f0(s,a))θ(t)y_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}e^{(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)}, we have:

logaef0(s,a)θ(t)aeft(s,a)θ(t)aπ~t(a|s)|(ft(s,a)f0(s,a))θ(t)|.\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{t}(s,a^{\prime})\theta(t)}}\leq\sum_{a^{\prime}}\widetilde{\pi}_{t}(a^{\prime}|s)|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|. (53)

Setting ya=ef0(s,a)θ(t)y_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)} and xa=ef0(s,a)θ(t)e(ft(s,a)f0(s,a))θ(t)x_{a}=e^{\nabla^{\top}f_{0}(s,a)\theta(t)}e^{(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)}, we have:

logaeft(s,a)θ(t)aef0(s,a)θ(t)aπt(a|s)|(ft(s,a)f0(s,a))θ(t)|.\log\frac{\sum_{a^{\prime}}e^{\nabla^{\top}f_{t}(s,a^{\prime})\theta(t)}}{\sum_{a^{\prime}}e^{\nabla^{\top}f_{0}(s,a^{\prime})\theta(t)}}\leq\sum_{a^{\prime}}\pi_{t}(a^{\prime}|s)|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|. (54)

Using (53) and (54) to bound the last term in (52), we obtain:

|logπ~t(a|s)πt(a|s)||(ft(s,a)f0(s,a))θ(t)|+a[πt(a|s)+π~t(a|s)]|(ft(s,a)f0(s,a))θ(t)|.\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|+\sum_{a^{\prime}}\big{[}\pi_{t}(a^{\prime}|s)+\widetilde{\pi}_{t}(a^{\prime}|s)\big{]}|(\nabla f_{t}(s,a^{\prime})-\nabla f_{0}(s,a^{\prime}))^{\top}\theta(t)|. (55)

By Lemma 2, under the event A0A_{0}, |(ft(s,a)f0(s,a))θ(t)|ρ0(R/λ,m,δ)|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\rho_{0}(R/\lambda,m,\delta) for all t0,s𝒮,a𝒜t\geq 0,s\in\mathcal{S},a\in\mathcal{A}. Hence, under the event A0A_{0}, we have:

supt0sups,a|logπ~t(a|s)πt(a|s)|3ρ0(R/λ,m,δ),\sup_{t\geq 0}~{}\sup_{s,a}~{}\Big{|}\log\frac{\widetilde{\pi}_{t}(a|s)}{\pi_{t}(a|s)}\Big{|}\leq 3\rho_{0}(R/\lambda,m,\delta),

which concludes the proof. ∎

The following result is standard in the analysis of policy gradient methods [60, 39].

Lemma 4 (Lemma 5, [39]).

For any θ,θd\theta,\theta^{\prime}\in\mathbb{R}^{d} and μ\mu, we have:

Vλπθ(μ)Vλπθ(μ)=11γ𝔼sdμπθ,aπθ(|s)[Aλπθ(s,a)+λlogπθ(a|s)πθ(a|s)],V_{\lambda}^{\pi_{\theta}}(\mu)-V_{\lambda}^{\pi_{\theta^{\prime}}}(\mu)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\Big{[}A_{\lambda}^{\pi_{\theta^{\prime}}}(s,a)+\lambda\log\frac{\pi_{\theta^{\prime}}(a|s)}{\pi_{\theta}(a|s)}\Big{]}, (56)

where AλπθA_{\lambda}^{\pi_{\theta}} is the advantage function defined in (8).

Lemma 4 is an extension of the performance difference lemma in [60], and the proof can be found in [39]. In the following, we provide the main Lyapunov drift, which is central to the proof. This Lyapunov function is widely used in the analysis of natural gradient descent algorithms [14, 17, 22, 39].

Definition 1 (Potential function).

For any policy πΠ\pi\in\Pi, the potential function Ψ\Psi is defined as follows:

Ψ(π)=𝔼sdμπ[DKL(π(|s)π(|s))].\Psi(\pi)=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}D_{KL}\big{(}\pi^{*}(\cdot|s)\|\pi(\cdot|s)\big{)}\Big{]}. (57)
Lemma 5 (Lyapunov drift).

For any t0t\geq 0, let Δt=Vλπ(μ)Vλπt(μ)\Delta_{t}=V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu). Then,

Ψ(πt+1)Ψ(πt)ηtλΨ(πt)ηt(1γ)Δt+2ηt2R2+ηt𝔼sdμπ,aπt(|s)[f0(s,a)utQλπt(s,a)]ηt𝔼sdμπ,aπ(|s)[f0(s,a)utQλπt(s,a)]+(ηtλ+6)ρ0(R/λ,m,δ)+2ηtRρ0(R/λ,m,δ),\displaystyle\begin{aligned} \Psi(\pi_{t+1})-\Psi(\pi_{t})\leq&-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}+2\eta_{t}^{2}R^{2}\\ &+\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi_{t}(\cdot|s)}\Big{[}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\Big{]}\\ &-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\Big{]}\\ &+(\eta_{t}\lambda+6)\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)},\end{aligned} (58)

in the event A0A_{0} which holds with probability at least 1δ1-\delta over the random initialization of the actor.

Proof.

First, note that the log-linear approximation of πθ\pi_{\theta} is smooth [17]:

logπ~θ(a|s)logπ~θ(a|s)2θθ2,\|\nabla\log\widetilde{\pi}_{\theta}(a|s)-\nabla\log\widetilde{\pi}_{\theta^{\prime}}(a|s)\|_{2}\leq\|\theta-\theta^{\prime}\|_{2}, (59)

for any s,as,a since f0(s,a)21\|\nabla f_{0}(s,a)\|_{2}\leq 1. Also,

Ψ(πt+1)Ψ(πt)=𝔼sdμπ,aπ(|s)[logπt(a|s)πt+1(a|s)].\Psi(\pi_{t+1})-\Psi(\pi_{t})=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\pi_{t}(a|s)}{\pi_{t+1}(a|s)}\Big{]}.

To use the smoothness of log-linear approximation, we use a telescoping sum and obtain:

Ψ(πt+1)Ψ(πt)\displaystyle\Psi(\pi_{t+1})-\Psi(\pi_{t}) =𝔼sdμπ,aπ(|s)[logπ~t(a|s)π~t+1(a|s)+logπt(a|s)π~t(a|s)+logπ~t+1(a|s)πt+1(a|s)].\displaystyle=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\widetilde{\pi}_{t}(a|s)}{\widetilde{\pi}_{t+1}(a|s)}+\log\frac{\pi_{t}(a|s)}{\widetilde{\pi}_{t}(a|s)}+\log\frac{\widetilde{\pi}_{t+1}(a|s)}{{\pi}_{t+1}(a|s)}\Big{]}.

By Lemma 3, the last two terms are bounded by ρ0(R/λ,m,δ)\rho_{0}(R/\lambda,m,\delta). Let

Dt=𝔼sdμπ,aπ(|s)[logπ~t(a|s)π~t+1(a|s)].D_{t}=\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\Big{[}\log\frac{\widetilde{\pi}_{t}(a|s)}{\widetilde{\pi}_{t+1}(a|s)}\Big{]}.

Then, by the smoothness of the log-linear approximation, we have:

Dtηt𝔼sdμπ,aπ(|s)θlogπ~t(a|s)wt+ηt2wt222,D_{t}\leq-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}\nabla_{\theta}^{\top}\log\widetilde{\pi}_{t}(a|s)w_{t}+\frac{\eta_{t}^{2}\|w_{t}\|_{2}^{2}}{2},

Recall Δt=Vλπ(μ)Vλπt(μ)\Delta_{t}=V_{\lambda}^{\pi^{*}}(\mu)-V_{\lambda}^{\pi_{t}}(\mu). Using Lemma 4 and the definition of the advantage function, we obtain:

Dt\displaystyle D_{t} ηtλΨ(πt)ηt(1γ)Δtηt𝔼sdμπ(s),aπt(a|s)[Qλπt(s,a)λlogπt(a|s)]\displaystyle\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}-\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a|s)}}[Q_{\lambda}^{\pi_{t}}(s,a)-\lambda\log\pi_{t}(a|s)]
ηt𝔼sdμπ,aπ(|s)[logπ~t(a|s)wtqλπt(s,a)]+ηt2wt222.\displaystyle-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}[\nabla^{\top}\log\widetilde{\pi}_{t}(a|s)w_{t}-q_{\lambda}^{\pi_{t}}(s,a)]+\frac{\eta_{t}^{2}\|w_{t}\|_{2}^{2}}{2}. (60)

Since we have logπ~t(a|s)=f0(s,a)𝔼aπ~t(|s)[f0(s,a)]\nabla\log\widetilde{\pi}_{t}(a|s)=\nabla f_{0}(s,a)-\mathbb{E}_{a^{\prime}\sim\widetilde{\pi}_{t}(\cdot|s)}[\nabla f_{0}(s,a^{\prime})], we have the following inequality:

Dt\displaystyle D_{t} ηtλΨ(πt)ηt(1γ)Δt\displaystyle\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}
+ηt𝔼sdμπ(s),aπt(a|s)[f0(s,a)wtQλπt(s,a)+λft(s,a)]\displaystyle+\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a|s)}}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]
ηt𝔼sdμπ,aπ(|s)[f0(s,a)wtQλπt(s,a)+λft(s,a)]\displaystyle-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}[\nabla^{\top}f_{0}(s,a)w_{t}-Q_{\lambda}^{\pi_{t}}(s,a)+\lambda f_{t}(s,a)]
+ηt𝔼sdμπ[a𝒜(π~t(a|s)πt(a|s))f0(s,a)wt]+ηt2wt222.\displaystyle+\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}[\sum_{a\in\mathcal{A}}(\widetilde{\pi}_{t}(a|s)-\pi_{t}(a|s))\nabla^{\top}f_{0}(s,a)w_{t}]+\frac{\eta_{t}^{2}\|w_{t}\|_{2}^{2}}{2}.

By the definition of wt=utλ[θ(t)θ(0)]w_{t}=u_{t}-\lambda[\theta(t)-\theta(0)] and the fact that f0(s,a)=0f_{0}(s,a)=0 due to the symmetric initialization, we have:

f0(s,a)wt=f0(s,a)utλf0(s,a)θ(t).\nabla^{\top}f_{0}(s,a)w_{t}=\nabla^{\top}f_{0}(s,a)u_{t}-\lambda\nabla^{\top}f_{0}(s,a)\theta(t).

Substituting this identity to the above inequality, we have:

DtηtλΨ(πt)ηt(1γ)Δt+ηt𝔼sdμπ(s),aπt(a|s)[f0(s,a)utQλπt(s,a)]ηt𝔼sdμπ,aπ(|s)[f0(s,a)utQλπt(s,a)]+2ηtR𝔼sdμπ[a|π~t(a|s)πt(a|s)|]+ηtλ𝔼sdμπ[a𝒜|πt(a|s)π(a|s)|(ft(s,a)f0(s,a))θ(t)]+2ηt2R2,\displaystyle\begin{aligned} D_{t}&\leq-\eta_{t}\lambda\Psi(\pi_{t})-\eta_{t}(1-\gamma)\Delta_{t}\\ &+\eta_{t}\mathbb{E}_{{s\sim d_{\mu}^{\pi^{*}}(s),a\sim\pi_{t}(a|s)}}[\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)]\\ &-\eta_{t}\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}},a\sim\pi^{*}(\cdot|s)}[\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)]\\ &+2\eta_{t}R\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a}|\widetilde{\pi}_{t}(a|s)-\pi_{t}(a|s)|\Big{]}\\ &+\eta_{t}\lambda\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a\in\mathcal{A}}|\pi_{t}(a|s)-\pi^{*}(a|s)|\cdot(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)\Big{]}+2\eta_{t}^{2}R^{2},\end{aligned} (61)

where we used (47) to bound wt2\|w_{t}\|_{2}. Furthermore, note that

|(ft(s,a)f0(s,a))θ(t)|1mi=1m|(𝟙{θix0}𝟙{θi(0)x0})θi(t)x|,|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\Big{|}\Big{(}\mathbbm{1}\{\theta_{i}^{\top}x\geq 0\}-\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{)}\theta_{i}^{\top}(t)x\Big{|},

for any x=(s,a)dx=(s,a)^{\top}\in\mathbb{R}^{d}. Thus, by Lemma 2,

sups,a|(ft(s,a)f0(s,a))θ(t)|ρ0(R/λ,m,δ).\sup_{s,a}|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}\theta(t)|\leq\rho_{0}(R/\lambda,m,\delta).

This bounds the penultimate term in (61). Finally, in order to bound the fifth term in (61), we use Pinsker’s inequality and then Lemma 3:

sups𝒮π~t(|s)πt(|s)1sups𝒮aπt(a|s)logπt(a|s)π~t(a|s)ρ0(R/λ,m,δ).\sup_{s\in\mathcal{S}}\|\widetilde{\pi}_{t}(\cdot|s)-\pi_{t}(\cdot|s)\|_{1}\leq\sup_{s\in\mathcal{S}}\sqrt{\sum_{a}\pi_{t}(a|s)\log\frac{\pi_{t}(a|s)}{\widetilde{\pi}_{t}(a|s)}}\leq\sqrt{\rho_{0}(R/\lambda,m,\delta)}.

Substituting these into (61) and then into (58), the desired result follows. ∎

5.4 Analysis of the Function Approximation Error: How Do Neural Networks Address Distributional Shift in Policy Optimization?

As a specific feature of reinforcement learning, policy optimization in particular, the probability distribution of the underlying system changes over time as a function of the control policy. Consequently, the function approximator (i.e., the actor network in our case) needs to adapt to this distributional shift throughout the policy optimization steps. In this subsection, we analyze the function approximation error, which sheds light on how neural networks in the NTK regime address the distributional shift challenge.

Now we focus on the approximation error in Lemma 5:

ϵbiasπt=𝔼sdμ[a𝒜(πt(a|s)π(a|s))(f0(s,a)utQλπt(s,a))].\epsilon_{bias}^{\pi_{t}}=\mathbb{E}_{s\sim d_{\mu}^{*}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\big{(}\nabla^{\top}f_{0}(s,a)u_{t}-Q_{\lambda}^{\pi_{t}}(s,a)\big{)}\Big{]}. (62)

Note that ϵbiasπt\epsilon_{bias}^{\pi_{t}} can be equally expressed as follows:

ϵbiasπt=𝔼sdμ[a𝒜(πt(a|s)π(a|s))(logπt(s,a)utΞλπt(s,a))]+𝔼sdμπ[a𝒜(πt(a|s)π(a|s))([f0(s,a)ft(s,a)]ut)],\epsilon_{bias}^{\pi_{t}}=\mathbb{E}_{s\sim d_{\mu}^{*}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\big{(}\nabla^{\top}\log\pi_{t}(s,a)u_{t}-\Xi_{\lambda}^{\pi_{t}}(s,a)\big{)}\Big{]}\\ +\mathbb{E}_{s\sim d_{\mu}^{\pi^{*}}}\Big{[}\sum_{a\in\mathcal{A}}\big{(}\pi_{t}(a|s)-\pi^{*}(a|s)\big{)}\Big{(}\big{[}\nabla f_{0}(s,a)-\nabla f_{t}(s,a)\big{]}^{\top}u_{t}\Big{)}\Big{]},

where Ξλπ\Xi_{\lambda}^{\pi} is the soft advantage function. The above identity provides intuition about the choice of sample-based gradient update utu_{t} in Algorithm 2, which we will investigate in detail later.

Let

L0(u,θ)=𝔼[(f0(s,a)uQλπθ(s,a))2].L_{0}(u,\theta)=\mathbb{E}[(\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{\pi_{\theta}}(s,a))^{2}].

In the following, we answer the following question: given the perfect knowledge of the soft Q-function QλπθQ_{\lambda}^{\pi_{\theta}}, what is the minimum approximation error minuL0(u,θ(t))\min_{u}L_{0}(u,{\theta(t)})?

Proposition 3 (Approximation Error).

Under symmetric initialization of the actor network, we have the following results:

• Pointwise approximation error: For any θΘ\theta\in\Theta and Qλπθν¯Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{\bar{\nu}},

𝔼[minum×dL0(u,θ)]4ν¯2m,\mathbb{E}\left[\min_{u\in\mathbb{R}^{m\times d}}L_{0}(u,\theta)\right]\leq\frac{4{\bar{\nu}}^{2}}{m}, (63)

where the expectation is over the random initialization of the actor network.

• Uniform approximation error: Let

A1={sups,aθΘminu|f0(s,a)uQλπθ(s,a)|2ν¯m((dlog(m))14+log(Kδ))}.A_{1}=\Big{\{}\sup_{\begin{subarray}{c}s,a\\ \theta\in\Theta\end{subarray}}~{}\min_{u}|\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{{\pi_{\theta}}}(s,a)|\leq\frac{2{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}\Big{\}}.

Then, under Assumption 3, A1A_{1} holds with probability at least 1δ1-\delta over the random initialization of the actor network. Furthermore,

𝔼[𝟙A0A1supθminuL0(u,θ)]4ν¯2m((dlog(m))14+log(Kδ))2.\mathbb{E}\Big{[}\mathbbm{1}_{A_{0}\cap A_{1}}\sup_{\theta}\min_{u}L_{0}(u,\theta)\Big{]}\leq\frac{4{\bar{\nu}}^{2}}{m}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}^{2}. (64)
Proof.

For a given policy parameter θΘ\theta\in\Theta, let the transportation mapping of QλπθQ_{\lambda}^{\pi_{\theta}} be vθv_{\theta} and let

Yiθ(s,a)=vθ(θi(0))(s,a)𝟙{θi(0)(s,a)0},i[m].Y_{i}^{\theta}(s,a)=v_{\theta}^{\top}(\theta_{i}(0))(s,a)\cdot\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\},~{}i\in[m].

Note that 𝔼[Yiθ(s,a)]=Qλπθ(s,a)\mathbb{E}[Y_{i}^{\theta}(s,a)]=Q_{\lambda}^{\pi_{\theta}}(s,a) for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Also, let

uθ=[1mcivθ(θi(0))]i[m].u_{\theta}^{*}=\Big{[}\frac{1}{\sqrt{m}}c_{i}v_{\theta}(\theta_{i}(0))\Big{]}_{i\in[m]}.

Since uθm,ν¯d(0)u_{\theta}^{*}\in\mathcal{B}_{m,{\bar{\nu}}}^{d}(0) for all θΘ\theta\in\Theta, projected risk minimization within m,Rd(0)\mathcal{B}_{m,R}^{d}(0) for Rν¯R\geq{\bar{\nu}} suffices for optimality. We have

f0(s,a)uθ=1mi=1mYiθ(s,a),\nabla^{\top}f_{0}(s,a)u_{\theta}^{*}=\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a),

and

minuL0(u,θ)𝔼sdμπθ,aπθ(|s)[(1mi=1mYiθ(s,a)𝔼[Y1θ(s,a)])2].\min_{u}L_{0}(u,\theta)\leq\mathbb{E}_{s\sim d_{\mu}^{\pi_{\theta}},a\sim{\pi_{\theta}}(\cdot|s)}\Big{[}\Big{(}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{)}^{2}\Big{]}. (65)

1. Pointwise approximation error: First we consider a given fixed θΘ\theta\in\Theta. Taking the expectation in (65) and using Fubini’s theorem,

𝔼[minuL0(u,θ)]\displaystyle\mathbb{E}[\min_{u}L_{0}(u,\theta)] 𝔼s,a𝔼[(1mi=1mYiθ(s,a)𝔼[Y1θ(s,a)])2],\displaystyle\leq\mathbb{E}_{s,a}\mathbb{E}\Big{[}\Big{(}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{)}^{2}\Big{]},
=2𝔼s,a[4m2i=1m/2Var(Yiθ(s,a))+4m2i,j=1ijm/2Cov(Yiθ(s,a),Yjθ(s,a))],\displaystyle=2\mathbb{E}_{s,a}\Big{[}\frac{4}{m^{2}}\sum_{i=1}^{m/2}Var(Y_{i}^{\theta}(s,a))+\frac{4}{m^{2}}\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{m/2}Cov(Y_{i}^{\theta}(s,a),Y_{j}^{\theta}(s,a))\Big{]}, (66)
=4𝔼s,a[Var(Y1θ(s,a))m2],\displaystyle=4\mathbb{E}_{s,a}\Big{[}\frac{Var(Y_{1}^{\theta}(s,a))}{m^{2}}\Big{]}, (67)
4m2𝔼s,a𝔼[(Y1θ(s,a))2],\displaystyle\leq\frac{4}{m^{2}}\mathbb{E}_{s,a}\mathbb{E}[(Y_{1}^{\theta}(s,a))^{2}], (68)

where the identity (66) is due to the symmetric initialization, (67) holds because {Yiθ(s,a):i=1,2,,m/2}\{Y_{i}^{\theta}(s,a):i=1,2,\ldots,m/2\} is independent (since {θi(0):i[m/2]}\{\theta_{i}(0):i\in[m/2]\} is independent). By Cauchy-Schwarz inequality and the fact that vθν¯v_{\theta}\in\mathcal{H}_{{\bar{\nu}}}, we have:

|Yiθ(s,a)|vθ(θi(0))2ν¯.|Y_{i}^{\theta}(s,a)|\leq\|v_{\theta}(\theta_{i}(0))\|_{2}\leq{\bar{\nu}}.

Hence, using this in (68), we obtain:

𝔼minuL0(u,θ)4ν¯2m2.\mathbb{E}\min_{u}L_{0}(u,\theta)\leq\frac{4{\bar{\nu}}^{2}}{m^{2}}. (69)

2. Uniform approximation error: For any θΘ\theta\in\Theta, since QλπθK,ν¯,𝒱Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{K,{\bar{\nu}},\mathcal{V}} there exists αθ=(α1θ,α2θ,,αKθ)K\alpha^{\theta}=(\alpha_{1}^{\theta},\alpha_{2}^{\theta},\ldots,\alpha_{K}^{\theta})\in\mathbb{R}^{K} such that αθ11\|\alpha^{\theta}\|_{1}\leq 1 and vθ=kαkθvkv_{\theta}=\sum_{k}\alpha_{k}^{\theta}v_{k}. We consider the following error:

Rm(Θ)=sup(s,a)𝒮×𝒜supθΘ|1mi=1mYiθ(s,a)𝔼[Y1θ(s,a)]|.R_{m}(\Theta)=\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}~{}\sup_{\theta\in\Theta}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Y_{i}^{\theta}(s,a)-\mathbb{E}[Y_{1}^{\theta}(s,a)]\Big{|}. (70)

Then, we have the following identity from the definition of vθv_{\theta}:

Rm(Θ)=sups,asupθ|k=1Kαkθ(1mi=1mZik(s,a)𝔼[Z1k(s,a)])|,R_{m}(\Theta)=\sup_{s,a}~{}\sup_{\theta}~{}\Big{|}\sum_{k=1}^{K}\alpha_{k}^{\theta}\cdot\Big{(}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{)}\Big{|}, (71)

where Zik(s,a)=vk(θi(0))(s,a)𝟙{θi(0)(s,a)0}Z_{i}^{k}(s,a)=v_{k}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\}. Then, by triangle inequality,

Rm(Θ)\displaystyle R_{m}(\Theta) sups,asupθmaxk[K]|1mi=1mZik(s,a)𝔼[Z1k(s,a)]|αθ1,\displaystyle\leq\sup_{s,a}~{}\sup_{\theta}~{}\max_{k\in[K]}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}\cdot\|\alpha^{\theta}\|_{1},
maxk[K]sups,a|1mi=1mZik(s,a)𝔼[Z1k(s,a)]|.\displaystyle\leq\max_{k\in[K]}~{}\sup_{s,a}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}. (72)

By using union bound and (72), for any z>0z>0, we have the following:

(Rm(Θ)>z)k=1K(sups,a|1mi=1mZik(s,a)𝔼[Z1k(s,a)]|>z).\mathbb{P}(R_{m}(\Theta)>z)\leq\sum_{k=1}^{K}\mathbb{P}\Big{(}\sup_{s,a}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}>z\Big{)}. (73)

We utilize the following to obtain a uniform bound for |1mi=1mZik(s,a)𝔼[Z1k(s,a)]||\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]| over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}.

Lemma 6.

For any k[K]k\in[K], for any δ(0,1)\delta\in(0,1), the following holds:

sups,a|1mi=1mZik(s,a)𝔼[Z1k(s,a)]|4ν¯(dlogm)1/4m+4ν¯log(1/δ)m,\sup_{s,a}~{}\Big{|}\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{k}(s,a)-\mathbb{E}[Z_{1}^{k}(s,a)]\Big{|}\leq\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(1/\delta)}}{\sqrt{m}}, (74)

with probability at least 1δ1-\delta.

Hence, using Lemma 6 and (73) with z=4ν¯(dlogm)1/4m+4ν¯log(K/δ)mz=\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(K/\delta)}}{\sqrt{m}}, we conclude that

Rm(Θ)4ν¯(dlogm)1/4m+4ν¯log(K/δ)m,R_{m}(\Theta)\leq\frac{4{\bar{\nu}}(d\log m)^{1/4}}{\sqrt{m}}+\frac{4{\bar{\nu}}\sqrt{\log(K/\delta)}}{\sqrt{m}},

with probability at least 1δ1-\delta. The expectation result follows from this inequality. ∎

Now, we have the following result for the approximation error under πt\pi_{t}.

Corollary 3.

Under Assumption 3, we have:

𝔼[𝟙A0A1minuL0(u,θ(t))]16ν¯2m((dlog(m))14+log(Kδ))2,\mathbb{E}[\mathbbm{1}_{A_{0}\cap A_{1}}\min_{u}~{}L_{0}(u,{\theta(t)})]\leq\frac{16\bar{\nu}^{2}}{m}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}^{2},

where the event A1A_{1}, defined in Proposition 64, holds with probability at least 1δ1-\delta over the random initialization of the actor.

Remark 7 (Why do we need a uniform approximation error bound?).

Note that for a given fixed policy πθ,θΘ{\pi_{\theta}},\theta\in\Theta, Proposition 64 provides a sharp pointwise approximation error bound as long as Qλπθν¯Q_{\lambda}^{\pi_{\theta}}\in\mathcal{F}_{\bar{\nu}} with a corresponding transportation map vθv_{\theta}. In order for this result to hold, vθ(θi(0))(s,a)𝟙{θi(0)(s,a)0}v_{\theta}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\} is required to be iid for i[m/2]i\in[m/2], which is the main idea behind the random initialization schemes for the NTK analysis. On the other hand, in policy optimization, θ(t)\theta(t) depends on the initialization θ(0)\theta(0), therefore vθ(t)(θi(0))(s,a)𝟙{θi(0)(s,a)0}v_{\theta(t)}^{\top}(\theta_{i}(0))(s,a)\mathbbm{1}\{\theta_{i}^{\top}(0)(s,a)\geq 0\} is not independent – hence, Cov(Yiθ(s,a),Yjθ(s,a))0Cov(Y_{i}^{\theta}(s,a),Y_{j}^{\theta}(s,a))\neq 0 for iji\neq j in (66). Furthermore, the distribution of (s,a)(s,a) at time t>0t>0 also depends on π0\pi_{0}. Therefore, the pointwise approximation error cannot be used to provide an approximation bound for QλπtQ_{\lambda}^{\pi_{t}} under the entropy-regularized NAC. In the existing works, this important issue regarding the temporal correlation and its impact on the NTK analysis was not addressed. In this work, we utilize the uniform approximation error bound provided in Proposition 64 to address this issue.

In the absence of QλπθQ_{\lambda}^{\pi_{\theta}}, the critic yields a noisy estimate Q¯λπθ\overline{Q}_{\lambda}^{\pi_{\theta}}. Additionally, since dμπθd_{\mu}^{\pi_{\theta}} is not known a priori, samples {(sn,an)dμπθπθ(|s):n0}\{(s_{n},a_{n})\sim d_{\mu}^{\pi_{\theta}}\circ{\pi_{\theta}}(\cdot|s):n\geq 0\} are used to obtain the update utu_{t}. These two factors are the sources of error in the natural actor-critic method: minuL0(u,θ(t))L0(ut,θ(t))\min_{u}L_{0}(u,{\theta(t)})\leq L_{0}(u_{t},{\theta(t)}). In the following, we quantify this error and show that:

  1. 1.

    Increasing number of SGD iterations, NN,

  2. 2.

    Increasing representation power of the actor network in terms of the width mm,

  3. 3.

    Low mean-squared Bellman error in the critic (by large m,Tm^{\prime},T^{\prime}),

lead to vanishing error.

First, we study the error introduced by using SGD for solving:

L^t(u,θ(t))=𝔼sdμπt,aπt(|s)[(θlogπt(s,a)uΞ^λπt(s,a))2].\widehat{L}_{t}(u,{\theta(t)})=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{t}(s,a)u-\widehat{\Xi}_{\lambda}^{\pi_{t}}(s,a)\Big{)}^{2}\Big{]}. (75)
Proposition 4 (Theorem 14.8 in [61]).

Algorithm 2 (Lines 3-9) with step-size αA=R/qmaxN\alpha_{A}=R/\sqrt{q_{max}N} yields the following result:

𝔼[L^t(ut,θ(t))]minuL^t(u,θ(t))RqmaxN,\mathbb{E}[\widehat{L}_{t}(u_{t},{\theta(t)})]-\min_{u}\widehat{L}_{t}(u,{\theta(t)})\leq\frac{Rq_{max}}{\sqrt{N}}, (76)

for any Rν¯R\geq{\bar{\nu}} where the expectation is over the random samples {(sn,an):n[N]}\{(s_{n},a_{n}):n\in[N]\}.

The following proposition provides an error bound in terms of the statistical error for finding the optimum utu_{t} via SGD as well as TD-learning error in estimating the soft Q-function. Let

Lt(u,θ(t))=𝔼sdμπt,aπt(|s)[(θlogπt(s,a)uΞλπt(s,a))2].L_{t}(u,{\theta(t)})=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}},a\sim\pi_{t}(\cdot|s)}\Big{[}\Big{(}\nabla_{\theta}^{\top}\log\pi_{t}(s,a)u-{\Xi}_{\lambda}^{\pi_{t}}(s,a)\Big{)}^{2}\Big{]}. (77)
Proposition 5.

Let A=A0A1A2A=A_{0}\cap A_{1}\cap A_{2}, hence (A)13δ\mathbb{P}(A)\geq 1-3\delta. We have the following inequality:

𝔼[𝟙ALt(ut,θ(t))]8minu{𝟙AL0(u,θ(t))+𝔼[𝟙A|[f0(s,a)ft(s,a)]u|2]}+2RqmaxN+6𝔼[𝟙A|Qλπt(s,a)Q¯λπt(s,a)|2],\mathbb{E}[\mathbbm{1}_{A}L_{t}(u_{t},{\theta(t)})]\leq 8\min_{u}\Big{\{}\mathbbm{1}_{A}L_{0}(u,{\theta(t)})+\mathbb{E}\big{[}\mathbbm{1}_{A}\big{|}[\nabla f_{0}(s,a)-\nabla f_{t}(s,a)]^{\top}u\big{|}^{2}\big{]}\Big{\}}+\frac{2Rq_{max}}{\sqrt{N}}\\ +6\mathbb{E}[\mathbbm{1}_{A}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}], (78)

where the expectation is over the samples for critic (TD learning) and actor (SGD) updates. Consequently, we have:

𝔼[Lt(ut,θ(t))]3𝔼[minum,Rd(0)L0(u,θ(t))]+2RqmaxN14+3ρ0(R/λ,m,δ)+3𝔼𝔼s,a[|Qλπt(s,a)Q¯λπt(s,a)|2],\mathbb{E}[\sqrt{L_{t}(u_{t},{\theta(t)})}]\leq 3\sqrt{\mathbb{E}[\min_{u\in\mathcal{B}_{m,R}^{d}(0)}L_{0}(u,{\theta(t)})]}+\frac{2\sqrt{Rq_{max}}}{N^{\frac{1}{4}}}+3\rho_{0}(R/\lambda,m,\delta)\\ +3\mathbb{E}\sqrt{\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}}, (79)

under the event AA.

Proof.

We extensively use the inequality (x+y)2x2+2y2(x+y)\leq 2x^{2}+2y^{2} for x,yx,y\in\mathbb{R}. First, note that

L^t(u,θ(t))2Lt(u,θ(t))+2𝔼s,adt[|Qλπt(s,a)Q¯λπt(s,a)|2],\widehat{L}_{t}(u,{\theta(t)})\leq 2L_{t}(u,{\theta(t)})+2\mathbb{E}_{s,a\sim d_{t}}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}, (80)

for any um,Rd(0)u\in\mathcal{B}_{m,R}^{d}(0). Hence, under A=A0A1A2A=A_{0}\cap A_{1}\cap A_{2}, we have:

𝔼[Lt(ut,θ(t))]\displaystyle\mathbb{E}[L_{t}(u_{t},\theta(t))] 2𝔼[L^t(ut,θ(t))]+2𝔼[|Qλπt(s,a)Q¯λπt(s,a)|2],\displaystyle\leq 2\mathbb{E}[\widehat{L}_{t}(u_{t},\theta(t))]+2\mathbb{E}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}, (81)
2minum,Rd(0)L^t(u,θ(t))+2𝔼s,a[|Qλπt(s,a)Q¯λπt(s,a)|2]+2RqmaxN1/2,\displaystyle\leq 2\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\widehat{L}_{t}(u,\theta(t))+2\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}}, (82)
4minum,Rd(0)Lt(u,θ(t))+6𝔼s,a[|Qλπt(s,a)Q¯λπt(s,a)|2]+2RqmaxN1/2,\displaystyle\leq 4\min_{u\in\mathcal{B}_{m,R}^{d}(0)}L_{t}(u,\theta(t))+6\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}}, (83)

where the second line follows from Prop. 4 and the last line follows from (80). Consequently, we have:

𝔼[Lt(ut,θ(t))]8minum,Rd(0){𝔼[(f0(s,a)uQλπt(s,a))2]+𝔼[|(ft(s,a)f0(s,a))u|2]}+6𝔼s,a[|Qλπt(s,a)Q¯λπt(s,a)|2]+2RqmaxN1/2,\mathbb{E}[L_{t}(u_{t},\theta(t))]\leq 8\min_{u\in\mathcal{B}_{m,R}^{d}(0)}\Big{\{}\mathbb{E}[(\nabla^{\top}f_{0}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a))^{2}]+\mathbb{E}[|(\nabla f_{t}(s,a)-\nabla f_{0}(s,a))^{\top}u|^{2}]\Big{\}}\\ +6\mathbb{E}_{s,a}\big{[}|Q_{\lambda}^{\pi_{t}}(s,a)-\overline{Q}_{\lambda}^{\pi_{t}}(s,a)|^{2}\big{]}+\frac{2{Rq_{max}}}{N^{1/2}}, (84)

where we use (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2} and the following inequality:

𝔼[(logπt(a|s)uΞλπt(s,a))2]\displaystyle\mathbb{E}[(\nabla^{\top}\log\pi_{t}(a|s)u-\Xi_{\lambda}^{\pi_{t}}(s,a))^{2}] =𝔼sdμπtVar(ft(s,a)uQλπt(s,a)),\displaystyle=\mathbb{E}_{s\sim d_{\mu}^{\pi_{t}}}Var(\nabla^{\top}f_{t}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a)),
𝔼[(ft(s,a)uQλπt(s,a))2].\displaystyle\leq\mathbb{E}[(\nabla^{\top}f_{t}(s,a)u-Q_{\lambda}^{\pi_{t}}(s,a))^{2}].

Using (84) and Theorem 2 in [49] together with the inequality x+y+zx+y+z\sqrt{x+y+z}\leq\sqrt{x}+\sqrt{y}+\sqrt{z} for x,y,z>0x,y,z>0, we obtain (79). ∎

Hence, we obtain the following bound on the approximation error ϵbiasπt\epsilon_{bias}^{\pi_{t}}.

Corollary 4 (Approximation Error).

Under Assumption 1-4, we have the following bound on the approximation error:

𝔼[𝟙Aϵbiasπt]M[8ν¯m((dlog(m))14+log(Kδ))+2RqmaxN14+4ε],\mathbb{E}[\mathbbm{1}_{A}\cdot\epsilon_{bias}^{\pi_{t}}]\leq M_{\infty}\Big{[}\frac{8{\bar{\nu}}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log\Big{(}\frac{K}{\delta}\Big{)}}\Big{)}+\frac{2\sqrt{Rq_{max}}}{N^{\frac{1}{4}}}+4\varepsilon\Big{]},

where A=A0A1A2A=A_{0}\cap A_{1}\cap A_{2}, m=O~(ν¯4(1γ)2ε2)m^{\prime}=\widetilde{O}\Big{(}\frac{\bar{\nu}^{4}}{(1-\gamma)^{2}\varepsilon^{2}}\Big{)} and T=O((1+2ν¯)2ν¯2ε4)T^{\prime}=O\Big{(}\frac{(1+2\bar{\nu})^{2}\bar{\nu}^{2}}{\varepsilon^{4}}\Big{)} and (A)13δ\mathbb{P}(A)\geq 1-3\delta.

Proof.

In order to prove Corollary 4, we substitute the results of Corollary 3 and Theorem 1 into (79). ∎

The main message of Corollary 4 is as follows: in order to eliminate the bias introduced by using (i) function approximation, (ii) sample-based estimation for actor and critic, one should employ more representation power in both actor and critic networks (via mm and mm^{\prime}), and also use more samples in actor and critic updates (via NN and TT^{\prime}). Furthermore, Corollary 4 quantifies the required network widths and sample complexities to achieve a desired bias ϵ>0\epsilon>0.

In the following subsection, we finally prove Theorem 2 by using the Lyapunov drift result (Lemma 5) and the approximation error bound (Corollary 4).

5.5 Convergence of Entropy-Regularized Natural Actor-Critic

Proof of Theorem 2.

In the following, we prove the first part of Theorem 2, where the second part follows identical steps with a constant step size η(0,1/λ)\eta\in(0,1/\lambda). First, note that Lemma 5 implies the following bound:

𝟙A[Ψ(πt+1)Ψ(πt)]ηtλΨ(πt)𝟙Aηt(1γ)Δt𝟙A+2ηt2R2+ηt𝟙Aϵbiasπt+(ηtλ+6)ρ0(R/λ,m,δ)+2ηtRρ0(R/λ,m,δ).\displaystyle\begin{aligned} \mathbbm{1}_{A}\Big{[}\Psi(\pi_{t+1})-\Psi(\pi_{t})\Big{]}&\leq-\eta_{t}\lambda\Psi(\pi_{t})\mathbbm{1}_{A}-\eta_{t}(1-\gamma)\Delta_{t}\mathbbm{1}_{A}+2\eta_{t}^{2}R^{2}+\eta_{t}\mathbbm{1}_{A}\epsilon_{bias}^{\pi_{t}}\\ &+(\eta_{t}\lambda+6)\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)}.\end{aligned} (85)

By Corollary 4,

𝔼[𝟙Aϵbiasπt]M[ρ1+2RqmaxN1/4+4ε]=:ϵbias,\mathbb{E}[\mathbbm{1}_{A}\epsilon_{bias}^{\pi_{t}}]\leq M_{\infty}\Big{[}\rho_{1}+\frac{2\sqrt{Rq_{max}}}{N^{1/4}}+4\varepsilon\Big{]}=:\epsilon_{bias},

where

ρ1=16ν¯m((dlog(m))14+log(K/δ)).\rho_{1}=\frac{16\bar{\nu}}{\sqrt{m}}\Big{(}(d\log(m))^{\frac{1}{4}}+\sqrt{\log(K/\delta)}\Big{)}.

Let Ψ¯t:=𝔼[Ψ(πt)𝟙A]\overline{\Psi}_{t}:=\mathbb{E}[\Psi(\pi_{t})\mathbbm{1}_{A}]. Then,

Ψ¯t+1Ψ¯t\displaystyle\overline{\Psi}_{t+1}-\overline{\Psi}_{t} ηtλΨ¯tηt(1γ)𝔼[𝟙AΔt]+2ηt2R2+ηtϵbias\displaystyle\leq-\eta_{t}\lambda\overline{\Psi}_{t}-\eta_{t}(1-\gamma)\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+2\eta_{t}^{2}R^{2}+\eta_{t}\epsilon_{bias}
+7ρ0(R/λ,m,δ)+2ηtRρ0(R/λ,m,δ).\displaystyle+7\rho_{0}(R/\lambda,m,\delta)+2\eta_{t}R\sqrt{\rho_{0}(R/\lambda,m,\delta)}.

Since ηt=1λ(t+1)\eta_{t}=\frac{1}{\lambda(t+1)}, by induction,

Ψ¯t+1\displaystyle\overline{\Psi}_{t+1} (1ηtλ)Ψ¯t+ηt(1γ)𝔼[𝟙AΔt]+2ηt2R2\displaystyle\leq(1-\eta_{t}\lambda)\overline{\Psi}_{t}+\eta_{t}(1-\gamma)\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+2\eta_{t}^{2}R^{2}
+ηt(ϵbias+2Rρ0(R/λ,m,δ))+7ρ0(R/λ,m,δ),\displaystyle\hskip 42.67912pt+\eta_{t}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}+7\rho_{0}(R/\lambda,m,\delta),
(1γ)λ(t+1)kt𝔼[𝟙AΔt]+1λ(ϵbias+2Rρ0(R/λ,m,δ))+2R2log(t+1)λ2(t+1)\displaystyle\leq-\frac{(1-\gamma)}{\lambda(t+1)}\sum_{k\leq t}\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]+\frac{1}{\lambda}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}+\frac{2R^{2}\log(t+1)}{\lambda^{2}(t+1)}
+4(t+1)ρ0(R/λ,m,δ).\displaystyle\hskip 42.67912pt+4(t+1)\rho_{0}(R/\lambda,m,\delta).

Hence,

min0t<T𝔼[𝟙AΔt]11γ(ϵbias+2Rρ0(R/λ,m,δ))+2R2(1+logT)λ(1γ)T+4Tλ(1γ)ρ0(R/λ,m,δ),\displaystyle\begin{aligned} \min_{0\leq t<T}\mathbb{E}[\mathbbm{1}_{A}\Delta_{t}]\leq\frac{1}{1-\gamma}\Big{(}\epsilon_{bias}+2R\sqrt{\rho_{0}(R/\lambda,m,\delta)}\Big{)}&+\frac{2R^{2}(1+\log T)}{\lambda(1-\gamma)T}+\frac{4T\lambda}{(1-\gamma)}\rho_{0}(R/\lambda,m,\delta),\end{aligned} (86)

which concludes the proof. ∎

6 Conclusion

In this paper, we established global convergence of the two-timescale entropy-regularized NAC algorithm with neural network approximation. We observed that entropy regularization led to significantly improved sample complexity and overparameterization bounds under weaker conditions since it (i) encourages exploration, (ii) controls the movement of the neural network parameters. We characterized the bias due to function approximation and sample-based estimation, and showed that overparameterization and increasing sample-size eliminates bias.

In practice, single-timescale natural policy gradient methods are predominantly used in conjunction with entropy regularization and off-policy sampling [15]. The analysis techniques that we develop in this paper can be used to analyze these algorithms.

In supervised learning, softmax parameterization is predominantly used for multiclass classification problems, where natural gradient descent is employed for a better adjustment to the problem geometry [50, 62, 63]. The techniques that we developed in this paper can be useful in establishing convergence results and understanding the role of entropy regularization as well.

Acknowledgements

S. Ç. would like to thank Siddhartha Satpathi for his help in the proof of Lemma 6. This work is supported in part by NSF TRIPODS grant CCF-1934986.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [2] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis lectures on artificial intelligence and machine learning, vol. 4, no. 1, pp. 1–103, 2010.
  • [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming.   Athena Scientific, 1996.
  • [4] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
  • [5] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99.   Citeseer, 1999, pp. 1057–1063.
  • [6] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems.   Citeseer, 2000, pp. 1008–1014.
  • [7] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2016, pp. 1928–1937.
  • [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [9] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Trust-pcl: An off-policy trust region method for continuous control,” arXiv preprint arXiv:1707.01891, 2017.
  • [10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International conference on machine learning.   PMLR, 2016, pp. 1329–1338.
  • [11] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
  • [12] S. M. Kakade, “A natural policy gradient,” Advances in neural information processing systems, vol. 14, 2001.
  • [13] S. Bhatnagar, M. Ghavamzadeh, M. Lee, and R. S. Sutton, “Incremental natural actor-critic algorithms,” Advances in neural information processing systems, vol. 20, pp. 105–112, 2007.
  • [14] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1861–1870.
  • [16] Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International Conference on Machine Learning.   PMLR, 2019, pp. 151–160.
  • [17] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “Optimality and approximation with policy gradient methods in markov decision processes,” in Conference on Learning Theory.   PMLR, 2020, pp. 64–66.
  • [18] J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,” arXiv preprint arXiv:1906.01786, 2019.
  • [19] G. Lan, “Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes,” arXiv preprint arXiv:2102.00135, 2021.
  • [20] S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi, “Fast global convergence of natural policy gradient methods with entropy regularization,” arXiv preprint arXiv:2007.06558, 2020.
  • [21] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in International Conference on Machine Learning.   PMLR, 2020, pp. 6820–6829.
  • [22] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” arXiv preprint arXiv:1909.01150, 2019.
  • [23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning.   PMLR, 2014, pp. 387–395.
  • [24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  • [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [26] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” Advances in Neural Information Processing Systems, vol. 33, pp. 741–752, 2020.
  • [27] S. Khodadadian, P. R. Jhunjhunwala, S. M. Varma, and S. T. Maguluri, “On the linear convergence of natural policy gradient algorithm,” arXiv preprint arXiv:2105.01424, 2021.
  • [28] L. Shani, Y. Efroni, and S. Mannor, “Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5668–5675.
  • [29] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence,” arXiv preprint arXiv:2105.11066, 2021.
  • [30] W. Yang, X. Li, G. Xie, and Z. Zhang, “Finding the near optimal policy via adaptive reduced regularization in mdps,” arXiv preprint arXiv:2011.00213, 2020.
  • [31] S. Khodadadian, Z. Chen, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic algorithm,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5420–5431.
  • [32] Z. Chen, S. Khodadadian, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” IEEE Control Systems Letters, 2022.
  • [33] ——, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” arXiv preprint arXiv:2105.12540, 2021.
  • [34] T. Xu, Z. Wang, and Y. Liang, “Improving sample complexity bounds for actor-critic algorithms,” arXiv preprint arXiv:2004.12956, 2020.
  • [35] J. Zhang, C. Ni, C. Szepesvari, M. Wang et al., “On the convergence and sample efficiency of variance-reduced policy gradient method,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [36] H. Kumar, A. Koppel, and A. Ribeiro, “On the sample complexity of actor-critic method for reinforcement learning with function approximation,” arXiv preprint arXiv:1910.08412, 2019.
  • [37] Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 617–17 628, 2020.
  • [38] S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021.
  • [39] S. Cayci, N. He, and R. Srikant, “Linear convergence of entropy-regularized natural policy gradient with linear function approximation,” arXiv preprint arXiv:2106.04096, 2021.
  • [40] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv preprint arXiv:1806.07572, 2018.
  • [41] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in International Conference on Learning Representations, 2018.
  • [42] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” in International Conference on Machine Learning.   PMLR, 2019, pp. 322–332.
  • [43] Z. Ji, M. Telgarsky, and R. Xian, “Neural tangent kernels, transportation mappings, and universal approximation,” in International Conference on Learning Representations, 2019.
  • [44] S. Oymak and M. Soltanolkotabi, “Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 84–105, 2020.
  • [45] Z. Ji and M. Telgarsky, “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks,” arXiv preprint arXiv:1909.12292, 2019.
  • [46] Z. Fu, Z. Yang, and Z. Wang, “Single-timescale actor-critic provably finds globally optimal policy,” in International Conference on Learning Representations, 2020.
  • [47] V. R. Konda and J. N. Tsitsiklis, “Onactor-critic algorithms,” SIAM journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.
  • [48] Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” arXiv preprint arXiv:1910.01619, 2019.
  • [49] S. Cayci, S. Satpathi, N. He, and R. Srikant, “Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation,” arXiv preprint arXiv:2103.01391, 2021.
  • [50] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [51] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation andtd learning,” in Conference on Learning Theory.   PMLR, 2019, pp. 2803–2830.
  • [52] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control.   SIAM, 2015.
  • [53] L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [54] E. Kreyszig, Introductory functional analysis with applications.   John Wiley & Sons, 1991, vol. 17.
  • [55] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural proximal/trust region policy optimization attains globally optimal policy,” arXiv preprint arXiv:1906.10306, 2019.
  • [56] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2016, pp. 795–811.
  • [57] M. Telgarsky, “Deep learning theory lecture notes,” https://mjt.cs.illinois.edu/dlt/, 2021, version: 2021-10-27 v0.0-e7150f2d (alpha).
  • [58] S. Satpathi, H. Gupta, S. Liang, and R. Srikant, “The role of regularization in overparameterized neural networks,” in 2020 59th IEEE Conference on Decision and Control (CDC).   IEEE, 2020, pp. 4683–4688.
  • [59] T. M. Cover and J. A. Thomas, “Elements of information theory (wiley series in telecommunications and signal processing),” 2006.
  • [60] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in In Proc. 19th International Conference on Machine Learning.   Citeseer, 2002.
  • [61] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms.   Cambridge university press, 2014.
  • [62] R. Pascanu and Y. Bengio, “Revisiting natural gradient for deep networks,” arXiv preprint arXiv:1301.3584, 2013.
  • [63] G. Zhang, J. Martens, and R. Grosse, “Fast convergence of natural gradient descent for overparameterized neural networks,” arXiv preprint arXiv:1905.10961, 2019.
  • [64] V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Measures of Complexity, vol. 16, no. 2, p. 11, 1971.

Appendix A Proof of Lemma 6

Consider gν¯g\in\mathcal{F}_{\bar{\nu}} with a corresponding transportation map vν¯v\in\mathcal{H}_{\bar{\nu}}. Using Cauchy-Schwarz inequality,

𝔼supx:x21|g(x)1mi=1mv(θi(0))x𝟙{θi(0)x0}|2\displaystyle\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}
𝔼supx:x211mi=1mv(θi(0))𝟙{θi(0)x0}𝔼v(θi(0))𝟙{θi(0)x0}2.\displaystyle\leq\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{\|}\frac{1}{m}\sum_{i=1}^{m}v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}-\mathbb{E}v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{\|}^{2}.

Define bi:=v(θi(0))𝟙{θi(0)x0}.b_{i}:=v(\theta_{i}(0))\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}. Define a class BB containing all possible values taken by b:={bi}i=1mb:=\{b_{i}\}_{i=1}^{m} over {x:x21}\{x:\|x\|_{2}\leq 1\} for a fixed θ(0).\theta(0). Further, using Cauchy-Schwarz inequality,

𝔼supx:x21|g(x)1mi=1mv(θi(0))x𝟙{θi(0)x0}|2\displaystyle\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}
𝔼supbB1m2ijm(bi𝔼bi)(bj𝔼bj)+4ν¯2m.\displaystyle\leq\mathbb{E}\sup_{b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}(b_{i}-\mathbb{E}b_{i})^{\top}(b_{j}-\mathbb{E}b_{j})+\frac{4{\bar{\nu}}^{2}}{m}.

Using the symmetrization argument with Rademacher random variables σij\sigma_{ij}’s,

𝔼supx:x21|g(x)1mi=1mv(θi(0))x𝟙{θi(0)x0}|24𝔼θ(0)𝔼rsupbB1m2ijmσijbibj+4ν¯2m.\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}\leq 4\mathbb{E}_{\theta(0)}\mathbb{E}_{r}\sup_{b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}\sigma_{ij}b_{i}^{\top}b_{j}+\frac{4{\bar{\nu}}^{2}}{m}.

Note that given θ(0),\theta(0), BB is a finite set. We apply Massart’s Finite Class lemma to have,

𝔼r[supb:bB1m2ijmσijbibj|θ(0)]ijmv(θi(0))2v(θj(0))22log|B|m2\displaystyle\mathbb{E}_{r}\Big{[}\sup_{b:b\in B}\frac{1}{m^{2}}\sum_{i\neq j}^{m}\sigma_{ij}b_{i}^{\top}b_{j}|\theta(0)\Big{]}\leq\sqrt{\sum_{i\neq j}^{m}\|v(\theta_{i}(0))\|^{2}\|v(\theta_{j}(0))\|^{2}}\frac{\sqrt{2\log|B|}}{m^{2}}

We calculate |B||B| using VC-theory. Each element bib_{i} of b,b, partitions the space {x1}d\{\|x\|\leq 1\}\subset\mathbb{R}^{d} into two half planes where one half takes value v(θi(0))v(\theta_{i}(0)) and another half takes value 0.0. Hence all possible values taken by bb in space {x1}\{\|x\|\leq 1\} is equal to the number of components in the partition made by mm half planes {bi}i=1m\{b_{i}\}_{i=1}^{m}. The number of such components is bounded by md+1+1m^{d+1}+1 using the growth function defined in [64]. Hence |B|m2d|B|\leq m^{2d}, and the following holds:

𝔼supx:x21|g(x)1mi=1mv(θi(0))x𝟙{θi(0)x0}|212ν¯2dlogmm.\displaystyle\mathbb{E}\sup_{x:\|x\|_{2}\leq 1}\Big{|}g(x)-\frac{1}{m}\sum_{i=1}^{m}v^{\top}(\theta_{i}(0))x\mathbbm{1}\{\theta_{i}^{\top}(0)x\geq 0\}\Big{|}^{2}\leq\frac{12{\bar{\nu}}^{2}\sqrt{d\log m}}{m}.

The result follows from Jensen’s inequality.