Finite-time analysis of entropy-regularized neural natural actor-critic algorithm
Abstract
Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large (potentially continuous) state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
1 Introduction
In reinforcement learning (RL), an agent aims to find an optimal policy that maximizes the expected total reward in a Markov decision process (MDP) by interacting with an unknown and dynamical environment [1, 2, 3]. Policy gradient methods [4, 5, 6], which employ first-order optimization methods to find the best policy within a parametric policy class, have demonstrated impressive success in numerous complicated RL problems. The success largely benefits from the versatility of policy gradient methods in accommodating a rich class of function approximation schemes [7, 8, 9, 10].
Natural policy gradient (NPG), natural actor-critic (NAC) and their variants, which use Fisher information matrix as a pre-conditioner for the gradient updates [11, 12, 13, 14], are particularly popular because of their impressive empirical performance in practical applications. In practice, NPG/NAC methods are further combined with (a) neural network approximation for high representation power of both the actor and the critic, and (b) entropy regularization for stability and sufficient exploration, leading to remarkable performance in complicated control tasks that involve large state-action spaces [15, 9, 16].
Despite the empirical successes, a strong theoretical understanding of policy gradient methods, especially when boosted with function approximation and entropy regularization, appears to be in a nascent stage. Recently, there has been a plethora of theoretical attempts to understand the convergence properties of policy gradient methods and the role of entropy regularization [17, 18, 19, 20, 21]. These works predominantly study the tabular setting, where a parallelism between the well-known policy iteration and policy gradient methods can be exploited to establish the convergence results. But for the more intriguing function approximation regime, especially with neural network approximation, little theory is known. Two of the main challenges come from the highly nonconvex nature of the problem when using neural network approximation for both the actor and the critic, and the complex exploration dynamics.
In this paper, we provide the first non-asymptotic analysis of an entropy-regularized natural actor-critic (NAC) method in which we use two separate two-layer neural networks for the actor and critic, and employ a learning scheme based on approximate natural policy gradient updates to achieve optimality. We show that the expressive power of these neural networks provide the ability to achieve optimality within a broad class of policies.
1.1 Main Contributions
We elaborate some of our contributions below.
-
•
Sharp sample complexity, convergence rate and overparameterization bounds: We prove sharp convergence guarantees in terms of sample complexity, iteration complexity and network width. Particularly, We prove that the NAC method with an adaptive step-size achieves sharp iteration complexity and sample complexity to achieve gap with the optimal policy of the regularized MDP under mildest distribution mismatch conditions to the best of our knowledge. The required network width for both the actor and critic are and , respectively. Under the standard distribution mismatch assumption as in [22], our sample complexity bound for the unregularized MDP is , which improves the existing bounds significantly.
-
•
Stable policy optimization: Existing works on neural policy gradient methods with neural network approximation assume that the policies perform sufficient exploration to avoid instability, i.e., convergence to near-deterministic and strictly suboptimal stationary policies. In this paper, we prove that policy optimization is stabilized by incorporating entropy regularization, gradient clipping and averaging. In particular, we show that the combination of these methods leads to “persistence of excitation” condition, which ensures sufficient exploration to avoid near-deterministic and strictly suboptimal stationary policies. Consequently, we prove convergence to the globally optimal policy under the mildest concentrability coefficient assumption for on-policy NAC to the best of our knowledge.
-
•
Understanding the dynamics of neural network approximation in policy optimization: Our analysis reveals that the uniform approximation power of the actor network to approximate Q-functions throughout policy optimization steps is crucial to ensure global (near-)optimality, which is a specific feature of reinforcement learning that induces a distributional shift over time in contrast to a static supervised learning problem. To that end, we establish high-probability bounds for a two-layer feedforward actor neural network to uniformly approximate Q-functions of the policy iterates during the training.
1.2 Related Work
Policy gradient and actor-critic: Policy gradient methods use a gradient-based scheme to find the optimal policy [4, 5]. [12] proposed the natural gradient method, which uses the Fisher information matrix as a pre-conditioner to fit the problem geometry better. Actor-critic method, which learns approximations to both state-action value functions and policies for variance reduction, was introduced in [6].
Neural actor-critic methods: Recently, there has been a surge of interest in direct policy optimization methods for solving MDPs with large state spaces by exploiting the representation power of deep neural networks. Particularly, deterministic policy gradient [23], trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25], soft actor-critic (SAC) [15, 26] achieved impressive empirical success in solving complicated control tasks.
Role of regularization: Entropy regularization is an essential part of policy optimization algorithms (e.g., TRPO, PPO and SAC) to encourage exploration and achieve fast and stable convergence. It has been numerically observed that entropy regularization leads to a smoother optimization landscape, which leads to improved convergence properties in policy optimization [16]. For tabular reinforcement learning, the impact of entropy regularization was studied in [17, 20, 21]. On the other hand, the function approximation regime leads to considerably different dynamics compared to the tabular setting mainly because of the generalization over a large state space, complex exploration dynamics and distributional shift. As such, the role of regularization is very different in the function approximation regime, which we study in this paper.
Theoretical analysis of policy optimization methods: Despite the vast literature on the practical performance of PG/AC/NAC type algorithms, their theoretical understanding has remained elusive until recently. In the tabular setting, global convergence rates for PG methods were established in [17, 18, 27]. By incorporating entropy regularization, it was shown in [28, 19, 20, 29, 30] that the convergence rate can be improved significantly in the tabular setting. Finite-time performances of off-policy actor-critic methods in the tabular and linear function approximation regimes were investigated in [31, 32]. In our paper, we consider neural network approximation under entropy regularization with on-policy sampling.
On the other hand, when the controller employs a function approximator for the purpose of generalization to a large state-action space, the convergence properties of policy optimization methods radically change due to more complicated optimization landscape and distribution mismatch phenomenon in reinforcement learning [17]. Under strong assumptions on the exploratory behavior of policies throughout learning iterations, global optimality of NPG with linear function approximation up to a function approximation error was established in [17]. For actor-critic and natural actor-critic methods with linear function approximation, there are finite-time analyses in [33, 34, 35]. For general actor schemes with linear critic, convergence to stationary points was investigated in [36, 37, 38].
By incorporating entropy regularization, it was shown that improved convergence rates under much weaker conditions on the underlying controlled Markov chain can be established in [39] with linear function approximation. Our paper uses results from the drift analysis in [39], but addresses the complications due to the nonlinearity introduced by ReLU activation functions, and establishes global convergence to the optimal policies. The neural network approximation eliminates the function approximation error, which is a constant in linear function approximation, by employing a sufficiently wide actor neural network.
Neural network analysis: The empirical success of neural networks, which have more parameters than the data points, has been theoretically explained in [40, 41, 42], where it was shown that overparameterized neural networks trained by using first-order optimization methods achieve good generalization properties. The need for massive overparameterization was addressed in [43, 44], and it was shown that considerably smaller network widths can suffice to achieve good training and generalization results in structured supervised learning problems. Our analysis in this work is mainly inspired by [43]. On the other hand, reinforcement learning problem has significantly different and more challenging dynamics than the supervised learning setting as we have a dynamic optimization problem in actor-critic, where distributional shift occurs as the policies are updated. As such, uniform approximation power of the actor network in approximating various functions through policy optimization steps becomes critical, different from the supervised learning setting in [45]. Our analysis utilizes tools from [45, 43]: (i) we consider max-norm geometry to achieve mild overparameterization, (ii) we bound the distance between the neural tangent kernel (NTK) function class and the class of functions realizable by a finite-width neural network by extending the ReLU analysis in [43].
The most relevant work in the literature is [22], where the convergence of NAC with a two-layer neural network was studied without entropy regularization. It was shown that, under strong assumptions on the exploratory behavior of policies throughout the trajectory, neural-NPG achieves -optimality with sample complexity and network width bounds. In this paper, we incorporate widely-used algorithmic techniques (entropy regularization, averaging and gradient clipping) to NAC with neural network approximation, and prove significantly improved sample complexity and overparameterization bounds under weaker assumptions on the concentrability coefficients. Additionally, our analysis reveals that the uniform approximation power of the actor neural network is critically important to establish global optimality, where distributional shift plays a crucial role (see Section 5.4). In another relevant work, [46] considers a single-timescale actor-critic with neural network approximation, but the function approximation error was not investigated due to the realizability assumption, which assumes that all policies throughout the policy optimization steps are realizable by the neural network. On of the main goals of our work is to study the benefits of employing neural networks in policy optimization, and we explicitly characterize the function class and approximation error that stems from the use of finite-width neural networks.
1.3 Notation
For a sequence of numbers where is an index set, denotes the vector obtained by concatenation of . For a set , denotes its cardinality. For two distributions defined over the same probability space, Kullback-Leibler divergence is denoted as follows: For a convex set and , denotes the projection of onto : . For , . For and we denote
where denotes the row of
2 Background and Problem Setting
In this section, we introduce basic backgrounds of the problem setting, the natural actor critic method, as well as entropy regularization and neural network approximation that we consider.
2.1 Markov Decision Processes
We consider a discounted Markov decision process where and are the state and action spaces, is a (unknown) transition kernel, , is the reward function, and is the discount factor. In this work, we consider a state space and a finite action space such that Also, we assume that, by appropriate representation of the state and action variables, the following bound holds:
(1) |
throughout the paper.
Value function: A randomized policy assigns some probability of taking an action at a given state . A policy introduces a trajectory by specifying and . For any , the corresponding value function of a policy is as follows:
(2) |
where and .
Entropy regularization: In policy optimization, in order to encourage exploration and avoid near-deterministic suboptimal policies, entropy regularization is commonly used in practice [8, 15, 9, 16]. For a policy , let
(3) |
where is the entropy functional. Then, for , the entropy-regularized value function is defined as follows:
(4) |
Note that maximizes the regularizer for any . Hence, the additional term in (4) encourages exploration while introducing some bias controlled by .
Entropy-regularized objective: For a given initial state distribution and for a given regularization parameter , the objective in this paper is to maximize the entropy-regularized value function:
(5) |
We denote the optimal policy for the regularized MDP as throughout the paper.
Q-function and advantage function: The entropy-regularized (or soft) Q-function is defined as:
(6) |
Note that is the fixed point of the Bellman equation where the Bellman operator is defined as:
As we will see, for NAC algorithms, the following function, called the soft Q-function under a policy turns out to be a useful quantity [20]:
(7) |
Note that the two Q-functions are related as follows:
The advantage function under a policy is defined as follows:
(8) |
Similarly, the soft advantage function is defined as follows:
(9) |
Lastly, we can bound the entropy-regularized value function as follows:
(10) |
for any since and for any distribution over [20].
2.2 Natural Policy Gradient under Entropy Regularization
For a given randomized policy parameterized by where is a given parameter space, policy gradient methods maximize by using the policy gradient . Natural policy gradient, as a quasi-Newton method, adjusts the gradient update to fit problem geometry by using the Fisher information matrix as a pre-conditioner [12, 20].
Let
be the Fisher information matrix under policy , where
is the discounted state visitation distribution under a policy . Then, the update rule under NPG can be expressed as
(11) |
where is the step-size. Equivalently, the NPG update can be written as follows:
(12) |
The above update scheme is closely related to gradient ascent and policy mirror ascent. Note that the gradient ascent for policy optimization performs the following update:
(13) |
The update in (13) leads to the policy gradient algorithm [4]. Compared to (13), the natural policy gradient uses a generalized Mahalanobis distance (i.e., weighted- distance) as the Bregman divergence instead of distance, which yields significant improvements in policy optimization by avoiding the so-called vanishing gradient problem in the tabular case [20, 19, 17]. Consequently, state-of-the-art reinforcement learning algorithms such as trust region policy optimization (TRPO) [24], proximal policy optimization (PPO) [25] and soft actor-critic [15] are variants of the natural policy gradient method.
In the following, we provide necessary tools to compute the policy gradient and the update rule in (11) based on [39].
Proposition 1 (Policy gradient).
For any and , we have:
(14) |
Based on Proposition 14, the gradient update of natural policy gradient can be computed by the following lemma, which is an extension of [12, 17, 39].
Lemma 1.
Let
(15) |
be the error for a given policy parameter . Define
(16) |
Then, we have:
(17) |
where is the Fisher information matrix.
The above results for general policy parameterization will provide basis for the entropy-regularized natural actor-critic (NAC) with neural network approximation that we will introduce in the following section, with certain modifications for variance reduction and stability that we will describe; see Remark 2 later.
3 Natural Actor-Critic with Neural Network Approximation
In this section, we will introduce the entropy-regularized natural actor critic algorithm, where both the actor and critic are represented by single-hidden-layer neural networks.
Throughout this paper, we make the following assumption, which is standard in policy optimization [17].
Assumption 1 (Sampling oracle).
For a given initial state distribution and policy , we assume that the controller is able to obtain an independent sample from at any time.
The sampling process involves a resetting mechanism and a simulator, which are available in many important application scenarios, and sampling from a state visitation distribution can be performed as described in [47].
3.1 Actor Network and Natural Policy Gradient
For a network width and , for , the actor network is given by the single-hidden-layer neural network:
(18) |
where , is the ReLU activation function. As a common practice [43, 44, 42], we fix the output layer after a random initialization, and only train the weights of hidden layer, namely, . Given a (possibly random) parameter , a design parameter , regularization parameter and network width , the parameter space that we consider is as follows:
(19) |
For this parameter space , the policy class that we consider is , where the policy that corresponds to is as follows:
(20) |
We randomly initialize the actor neural network by using the symmetric initialization in Algorithm 1, , which was introduced in [48].
Later, we will employ a similar symmetric initialization scheme for the critic neural network.
We denote the policy at iteration as and neural network output as . In the absence of and , we estimate
(21) |
by using samples by the following actor-critic method:
- •
-
•
Policy update: Given this, we approximate by using stochastic gradient descent (SGD) with iterations and step-size . Starting with , an iteration of SGD is as follows
(22) (23) where and for , is the output of the critic. Then, the final estimate is . By using , we perform the following update:
where .
The natural actor-critic algorithm is summarized in Algorithm 2. Below, we summarize the modifications in the algorithm that we consider in this paper with respect to the NPG described in the previous section.
Remark 1 (Averaging and projection).
The update in each iteration of the NAC algorithm described in Algorithm 2 can be equivalently written as follows:
(24) |
where is an approximate solution to the optimization problem (21). As we will see, the projection of onto (which can be considered as gradient clipping), in conjunction with the averaging in the policy update (24) enables us to control while taking (natural) gradient steps towards the optimal policy. Controlling is critical for two reasons: (i) to ensure sufficient exploration, and (ii) to establish the convergence bounds for the neural networks.
Alternatively, one may be tempted to project onto a ball around in the -geometry to control . However, as the algorithm follows the natural policy gradient, which uses a different Bregman divergence than , projection of with respect to the -norm may not result in moving the policy in the direction of improvement. Similarly, since we parameterize the policies by using a lower-dimensional vector to avoid storing and computing -dimensional policies, Bregman projection in the probability simplex, which is commonly used in direct parameterization, is not a feasible option for policy optimization with function approximation.
As such, simultaneous use of averaging and projection of the update are critical to control the network weights and policy improvement.
Remark 2 (Baseline).
In the following subsection, we describe the critic algorithm in detail.
3.2 Critic Network and Temporal Difference Learning
We estimate by using the neural TD learning algorithm with max-norm regularization [49]. Note that can be directly obtained from via and (9). Since is the fixed point of the Bellman equation (6), it can be approximated by using temporal difference (TD) learning algorithms.
For the critic, we use a two-layer neural network of width , which is defined as follows:
(25) |
The critic network is initialized according to the symmetric initialization scheme in Algorithm 1. Let denote the initialization.
We will consider max-norm regularization in the updates of the critic, which was shown to be effective in supervised learning and reinforcement learning (see [49, 50]). For a given and , let
(27) |
Under max-norm regularization, each hidden unit’s weight vector is confined within the set for a given projection radius .
For , we assume that is sampled from , i.e., . Upon obtaining , the next state-action pair is obtained by following : , . One can replace the i.i.d. sampling here with Markovian sampling at the cost of a more complicated analysis as in [51]. However, since experience replay is used in practice, the actual sampling procedure is neither purely Markovian or i.i.d., and here for simplicity of the analysis, we choose to model it as i.i.d. sampling.
An iteration of MN-NTD is as follows:
where , and is the projection operator onto a set . The output of the critic, which approximates , is then obtained as:
where is the number of iterations of MN-NTD. We obtain an approximation of the soft Q-function as
The corresponding estimate for the soft advantage function is the following:
(28) |
The critic update for a given policy is summarized in Algorithm 3.
4 Main Results: Sample Complexity and Overparameterization Bounds for Neural NAC
In this section, we analyze the convergence of the entropy-regularized neural NAC algorithm and provide sample complexity and overparameterization bounds for both the actor and the critic.
4.1 Regularization and Persistence of Excitation under Neural NAC
The following proposition implies that the persistence of excitation condition (see [52] for a discussion of the critical role of persistence of excitation in stochastic control problems) is satisfied under Algorithm 2, which implies sufficient exploration to ensure convergence to global optimality.
Proposition 2 (Persistence of excitation).
For any regularization parameter , projection radius , the entropy-regularized NAC satisfies the following:
(29) |
where
(30) |
for all almost surely. Consequently,
(31) |
simultaneously for all with probability at least over the random initialization of the actor network, where the function is given by
(32) |
Proposition 32 has two critical implications:
-
(i)
The inequality in (29) implies that any action is explored with strictly positive probability at any given state , which implies that all policies throughout the policy optimization steps satisfy the “persistence of excitation” condition with high probability over the random initialization. As we will see in the convergence analysis, this property implies sufficient exploration, which ensures that near-deterministic suboptimal policies are avoided. Sufficient exploration is achieved by entropy regularization, averaging, projection of , and large network width for the policy parameterization.
-
(ii)
The inequality (29) implies that we can control the deviation of the actor network weights by , and . This property is key for the neural network analysis in the lazy-training regime.
4.2 Transportation Mappings and Function Classes
We first present a brief discussion on kernel approximations of neural networks, which will be useful to state our convergence results. Consider the following space of mappings:
(33) |
and the function class:
(34) |
Note that is a provably rich subset of the reproducible kernel Hilbert space (RKHS) induced by the neural tangent kernel, which can approximate continuous functions over a compact space [43, 40, 53]. For a given class of transportation maps for , we also consider the following subspace of :
(35) |
Note that the above set depends on the choice of but these maps can be arbitrary, The space of continuously differentiable functions over a compact domain has a countable basis [54]. By [43, Theorem 4.3], one can find transportation mappings such that approximates well. As such, is able to approximate a function class which contains continuously differentiable functions over a compact space as with appropriate .
4.3 Convergence of the Critic
We make the following realizability assumption for the Q-function.
Assumption 2 (Realizability of the Q-function).
For any , we assume that for some .
Assumption 2 is a smoothness condition on the class of realizable functions that can be approximated by the critic network, which is dense in the space of continuous functions over (see Section 4.2). , which is an upper bound on the RKHS norm, is the measure of smoothness. One can also replace the above condition by a slightly stronger condition which states that , Note that the class of functions is deterministic and its approximation properties are well-known [43]. In [22], it was assumed that the state-action value functions lie in a random function class, which is obtained by shifting with a Gaussian process. By employing a symmetric initialization, we eliminate this Gaussian process noise, and therefore the realizable class of functions is deterministic and provably rich.
Theorem 1 (Convergence of the Critic, Theorem 2 in [49]).
Under Assumption 2, for any error probability , let
and . Then, for any target error , number of iterations , network width
and step-size
the critic yields the following bound:
where holds with probability at least over the random initializations of the critic network.
Note that in order to achieve a target error less than , a network width of and iteration complexity suffice. The analysis of TD learning algorithm in [49] uses results from [45], which was given for classification (supervised learning) problems with logistic loss. On the other hand, TD learning requires a significantly more challenging analysis because of bootstrapping in the updates (i.e., using a stochastic semi-gradient instead of a true gradient) and quadratic loss function. Furthermore, for improved sample complexity and overparameterization bounds, max-norm regularization is employed instead of early stopping [49].
4.4 Global Optimality and Convergence of Neural NAC
In this section, we provide the main convergence result for the entropy-regularized NAC with neural network approximation.
Assumption 3 (Realizability).
For , we assume that for all , , where the function class is defined in Section 4.2.
Note that approximates a rich class of functions over a compact space well for large (see Section 4.2). Also, Assumption 3 implies that there is a structure among the soft Q-functions in the policy class since each can be written as a linear combination of functions that correspond to the transportation maps . We consider this relatively restricted function class instead of to obtain uniform approximation error bounds to handle the dynamic structure of the policy optimization over time steps. Notably, the actor network features are expected to fit over all iterations, thus an inherent structure in appears to be necessary. For further discussion, see Section 5.4 (particularly Remark 7).
4.4.1 Performance Bounds under a Weak Distribution Mismatch Condition
First, we establish sample complexity and overparameterization bounds under a weak distribution mismatch condition, which is provided below. This condition is significantly weaker compared to the existing literature (e.g., [22, 55, 17]) as we proved that the policies achieve sufficient exploration by Proposition 32 (see Remark 36 for details).
Assumption 4 (Weak distribution mismatch condition).
There exists a constant such that
Remark 3 (Weak distribution mismatch condition).
Note that a sufficient condition for Assumption 4 is an exploratory initial state distribution , which covers the support of the state visitation distribution of :
(36) |
since . Hence, if the initial distribution has a sufficiently large support set, then Assumption 4 is satisfied without any assumptions on . Together with Proposition 32, it ensures stability of the policy optimization with minimal assumptions on .
The following theorem is one of the main results in this paper, which establishes the convergence bounds of the NAC algorithm.
Theorem 2 (Performance bounds).
Under Assumptions 1-4, Algorithm 2 with and regularization coefficient satisfies the following bounds:
-
(1)
with step-size , we have
-
(2)
with step-size , we have
for any where over the random initialization of the actor and critic networks,
(37) |
which is an upper bound on the gradient norm in (22),
and (as specified in Theorem 1).
In the following, we characterize the sample complexity, iteration complexity and overparameterization bounds based on Theorem 2.
Corollary 1 (Sample Complexity and Overparameterization Bounds).
For any and , Algorithm 2 with satisfies:
where over the random initialization of the actor-critic networks for the following parameters:
-
•
iteration complexity: ,
-
•
actor network width: ,
-
•
critic sample complexity: ,
-
•
critic network width: ,
-
•
actor sample complexity: .
Hence, the overall sample complexity of the Neural NAC algorithm is .
Remark 4 (Bias-variance tradeoff in policy optimization).
By Proposition 32, the network parameters evolve such that
and . Hence, the NAC always performs a policy search within the class of randomized policies, which leads to fast and stable convergence under minimal regularity conditions. In particular, Assumption 4 is the mildest distributional mismatch condition in on-policy NPG/NAC settings to the best of our knowledge, and it suffices to establish convergence results in Theorem 2. On the other hand, entropy regularization introduces a bias term controlled by , hence the convergence is in the regularized MDP. Another way to see this is that deterministic policies, which require , may not be achieved for since is always contained within a compact set. Letting eliminates the bias, but at the same time reduces the convergence speed and may lead to instability due to lack of exploration. Hence, there is a bias-variance tradeoff in policy optimization, controlled by .
Remark 5 (Different network widths for actor and critic).
Corollary 1 indicates that the actor network requires neurons while the critic network requires although both approximate (soft) state-action value functions. This difference is because the actor network is required to uniformly approximate all state-action value functions over the trajectory, while the critic network approximates (pointwise) a single state-action value function at each iteration.
Remark 6 (Fast initial convergence rate under constant step-sizes).
The second part of Theorem 2 indicates that the convergence rate is under a constant step-size , while there is an additional error term . This justifies the common practice of “halving the step-size” in optimization (see, e.g., [56]) for the specific case of natural actor-critic that we investigate: one achieves a fast convergence rate with a constant step-size until the optimization stalls, then the process is repeated after halving the step-size.
4.4.2 Performance Bounds under a Strong Distribution Mismatch Condition
In the following, we consider the standard distribution mismatch condition (e.g., in [55, 22]) and establish sample complexity and overparameterization bounds based on Theorem 2, for the unregularized MDP.
Assumption 4’ (Strong distribution mismatch condition).
There exists a constant such that
(38) |
Note that Assumption 38 implies Assumption 4, and it is a considerably stronger assumption that necessitates our policies being sufficiently exploratory throughout policy optimization.
Corollary 2.
Under Assumptions 1-3 and 38, for any and , Algorithm 2 with and satisfies:
where over the random initialization of the actor-critic networks for the following parameters:
-
•
iteration complexity: ,
-
•
actor network width: ,
-
•
critic sample complexity: ,
-
•
critic network width: ,
-
•
actor sample complexity: ,
where .
Hence, the overall sample complexity of Neural NAC for finding an -optimal policy of the unregularized MDP is .
4.5 Comparison With Prior Works
Among the existing works that theoretically investigate policy gradient methods, the most related one is [22], which considers Neural PG/NPG methods equipped with a two-layer neural network. We point key differences between our work and these previous works:
- •
-
•
In the proofs in the subsequent section, it will become clear that one needs to uniformly bound the function approximation error in the actor part of our algorithm to address the dependencies in the parameter values between iterations and the NTRF features. We propose new techniques to address this point, which was not addressed in the prior work.
-
•
While our algorithm is similar in spirit to the algorithms analyzed in the prior works, we also incorporate a number of important algorithmic ideas that are used in practice (e.g., entropy regularization, averaging, gradient clipping). As a result, we have to use different analysis techniques. As a consequence of these algorithmic and analytical techniques, we obtain considerably sharper sample complexity and overparameterization bounds (see Table 1). Interestingly, all of these algorithmic improvements to the original NAC algorithms seem to be important to obtain the sharper bounds.
- •
Paper | Algorithm | Width of actor, critic | Sample comp. | Error | Condition | Objective |
---|---|---|---|---|---|---|
[22] | Neural NPG | , | Strong | Unregularized | ||
Ours | Neural NAC | , | Weak | Regularized | ||
Ours | Neural NAC | , | Strong | Unregularized |
5 Finite-Time Analysis of Neural NAC
In this section, we provide the convergence analysis of the algorithm.
5.1 Analysis of Neural Network at Initialization
For and any , let
(39) |
and define
(40) |
The following lemma bounds the deviation of the neural network from its linear approximation around the initialization, and it will be used throughout the convergence analysis.
Lemma 2.
Let for all , and for some . Then,
(41) | ||||
(42) | ||||
(43) |
under the event defined in (40), which holds with probability at least over the random initialization of the actor.
Proof.
Let . For , let
For any , the following is true:
(44) |
where the first inequality is true since and the second inequality follows from Cauchy-Schwarz inequality and . Therefore,
Since
we have:
Since the above inequality leads to the following:
Taking supremum over , and using Lemma 4 in [58] on the RHS of the above inequality concludes the proof.
Note that Lemma 2 is an extension of the concentration bounds in [45, 41, 58] for neural networks. On the other hand, our concentration result provides uniform convergence over rather than finitely many points, thus it is a stronger concentration bound compared to the ones in the literature, which are used to analyze neural networks [45, 41]. We need these uniform concentration inequalities to address the challenges due to the dynamics policy optimization, e.g., distributional shift.
5.2 Impact of Entropy Regularization
First, we analyze the impact of entropy regularization, which will yield key results in the convergence analysis.
Proof of Proposition 32.
Recall from Line 11 in Algorithm 3 that the policy update is as follows:
Let for all . Then, the update rule can be written as:
Since the step-size is , we have:
by induction. Hence, by triangle inequality:
(45) |
for any . Note that as a consequence of projection, therefore for all . Hence, by (45), we conclude that
(46) |
for any . Also, since , we have:
(47) |
Under a constant step-size , we can expand the parameter movement for any as follows:
for any neuron . Then, we have:
(48) |
which follows from triangle inequality, due to the projection, and the fact that for any .
In order to prove the lower bound for , first recall that . Hence, a uniform upper bound on over all and suffices to lower bound . By symmetric initialization, for all . Hence,
(49) |
First, we bound the first summand on the RHS of (49) by using (46) and triangle inequality:
(50) |
since . For the last term in (49), first note that , so we can use Lemma 2. By using triangle inequality and Lemma 2:
with probability at least over the random initialization of the actor network. Hence, with probability at least ,
and .
∎
5.3 Lyapunov Drift Analysis
First, we present a key lemma which will be used throughout the analysis.
Lemma 3 (Log-linear approximation error).
Let
be log-linear approximation of the policy . Then, for any , we have:
(51) |
over .
Proof.
Note that for a ReLU neural network. By using this, we can write the log-linear approximation error as follows:
(52) |
Lemma 4 (Lemma 5, [39]).
Lemma 4 is an extension of the performance difference lemma in [60], and the proof can be found in [39]. In the following, we provide the main Lyapunov drift, which is central to the proof. This Lyapunov function is widely used in the analysis of natural gradient descent algorithms [14, 17, 22, 39].
Definition 1 (Potential function).
For any policy , the potential function is defined as follows:
(57) |
Lemma 5 (Lyapunov drift).
For any , let . Then,
(58) |
in the event which holds with probability at least over the random initialization of the actor.
Proof.
First, note that the log-linear approximation of is smooth [17]:
(59) |
for any since . Also,
To use the smoothness of log-linear approximation, we use a telescoping sum and obtain:
By Lemma 3, the last two terms are bounded by . Let
Then, by the smoothness of the log-linear approximation, we have:
Recall . Using Lemma 4 and the definition of the advantage function, we obtain:
(60) |
Since we have , we have the following inequality:
By the definition of and the fact that due to the symmetric initialization, we have:
Substituting this identity to the above inequality, we have:
(61) |
where we used (47) to bound . Furthermore, note that
for any . Thus, by Lemma 2,
This bounds the penultimate term in (61). Finally, in order to bound the fifth term in (61), we use Pinsker’s inequality and then Lemma 3:
Substituting these into (61) and then into (58), the desired result follows. ∎
5.4 Analysis of the Function Approximation Error: How Do Neural Networks Address Distributional Shift in Policy Optimization?
As a specific feature of reinforcement learning, policy optimization in particular, the probability distribution of the underlying system changes over time as a function of the control policy. Consequently, the function approximator (i.e., the actor network in our case) needs to adapt to this distributional shift throughout the policy optimization steps. In this subsection, we analyze the function approximation error, which sheds light on how neural networks in the NTK regime address the distributional shift challenge.
Now we focus on the approximation error in Lemma 5:
(62) |
Note that can be equally expressed as follows:
where is the soft advantage function. The above identity provides intuition about the choice of sample-based gradient update in Algorithm 2, which we will investigate in detail later.
Let
In the following, we answer the following question: given the perfect knowledge of the soft Q-function , what is the minimum approximation error ?
Proposition 3 (Approximation Error).
Under symmetric initialization of the actor network, we have the following results:
• Pointwise approximation error: For any and ,
(63) |
where the expectation is over the random initialization of the actor network.
• Uniform approximation error: Let
Then, under Assumption 3, holds with probability at least over the random initialization of the actor network. Furthermore,
(64) |
Proof.
For a given policy parameter , let the transportation mapping of be and let
Note that for any . Also, let
Since for all , projected risk minimization within for suffices for optimality. We have
and
(65) |
1. Pointwise approximation error: First we consider a given fixed . Taking the expectation in (65) and using Fubini’s theorem,
(66) | ||||
(67) | ||||
(68) |
where the identity (66) is due to the symmetric initialization, (67) holds because is independent (since is independent). By Cauchy-Schwarz inequality and the fact that , we have:
Hence, using this in (68), we obtain:
(69) |
2. Uniform approximation error: For any , since there exists such that and . We consider the following error:
(70) |
Then, we have the following identity from the definition of :
(71) |
where . Then, by triangle inequality,
(72) |
By using union bound and (72), for any , we have the following:
(73) |
We utilize the following to obtain a uniform bound for over all .
Lemma 6.
For any , for any , the following holds:
(74) |
with probability at least .
Now, we have the following result for the approximation error under .
Corollary 3.
Remark 7 (Why do we need a uniform approximation error bound?).
Note that for a given fixed policy , Proposition 64 provides a sharp pointwise approximation error bound as long as with a corresponding transportation map . In order for this result to hold, is required to be iid for , which is the main idea behind the random initialization schemes for the NTK analysis. On the other hand, in policy optimization, depends on the initialization , therefore is not independent – hence, for in (66). Furthermore, the distribution of at time also depends on . Therefore, the pointwise approximation error cannot be used to provide an approximation bound for under the entropy-regularized NAC. In the existing works, this important issue regarding the temporal correlation and its impact on the NTK analysis was not addressed. In this work, we utilize the uniform approximation error bound provided in Proposition 64 to address this issue.
In the absence of , the critic yields a noisy estimate . Additionally, since is not known a priori, samples are used to obtain the update . These two factors are the sources of error in the natural actor-critic method: . In the following, we quantify this error and show that:
-
1.
Increasing number of SGD iterations, ,
-
2.
Increasing representation power of the actor network in terms of the width ,
-
3.
Low mean-squared Bellman error in the critic (by large ),
lead to vanishing error.
First, we study the error introduced by using SGD for solving:
(75) |
Proposition 4 (Theorem 14.8 in [61]).
The following proposition provides an error bound in terms of the statistical error for finding the optimum via SGD as well as TD-learning error in estimating the soft Q-function. Let
(77) |
Proposition 5.
Let , hence . We have the following inequality:
(78) |
where the expectation is over the samples for critic (TD learning) and actor (SGD) updates. Consequently, we have:
(79) |
under the event .
Proof.
We extensively use the inequality for . First, note that
(80) |
for any . Hence, under , we have:
(81) | ||||
(82) | ||||
(83) |
where the second line follows from Prop. 4 and the last line follows from (80). Consequently, we have:
(84) |
where we use and the following inequality:
Using (84) and Theorem 2 in [49] together with the inequality for , we obtain (79). ∎
Hence, we obtain the following bound on the approximation error .
Corollary 4 (Approximation Error).
Proof.
The main message of Corollary 4 is as follows: in order to eliminate the bias introduced by using (i) function approximation, (ii) sample-based estimation for actor and critic, one should employ more representation power in both actor and critic networks (via and ), and also use more samples in actor and critic updates (via and ). Furthermore, Corollary 4 quantifies the required network widths and sample complexities to achieve a desired bias .
5.5 Convergence of Entropy-Regularized Natural Actor-Critic
Proof of Theorem 2.
6 Conclusion
In this paper, we established global convergence of the two-timescale entropy-regularized NAC algorithm with neural network approximation. We observed that entropy regularization led to significantly improved sample complexity and overparameterization bounds under weaker conditions since it (i) encourages exploration, (ii) controls the movement of the neural network parameters. We characterized the bias due to function approximation and sample-based estimation, and showed that overparameterization and increasing sample-size eliminates bias.
In practice, single-timescale natural policy gradient methods are predominantly used in conjunction with entropy regularization and off-policy sampling [15]. The analysis techniques that we develop in this paper can be used to analyze these algorithms.
In supervised learning, softmax parameterization is predominantly used for multiclass classification problems, where natural gradient descent is employed for a better adjustment to the problem geometry [50, 62, 63]. The techniques that we developed in this paper can be useful in establishing convergence results and understanding the role of entropy regularization as well.
Acknowledgements
S. Ç. would like to thank Siddhartha Satpathi for his help in the proof of Lemma 6. This work is supported in part by NSF TRIPODS grant CCF-1934986.
References
- [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- [2] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis lectures on artificial intelligence and machine learning, vol. 4, no. 1, pp. 1–103, 2010.
- [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific, 1996.
- [4] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
- [5] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs, vol. 99. Citeseer, 1999, pp. 1057–1063.
- [6] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems. Citeseer, 2000, pp. 1008–1014.
- [7] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1928–1937.
- [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
- [9] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Trust-pcl: An off-policy trust region method for continuous control,” arXiv preprint arXiv:1707.01891, 2017.
- [10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International conference on machine learning. PMLR, 2016, pp. 1329–1338.
- [11] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
- [12] S. M. Kakade, “A natural policy gradient,” Advances in neural information processing systems, vol. 14, 2001.
- [13] S. Bhatnagar, M. Ghavamzadeh, M. Lee, and R. S. Sutton, “Incremental natural actor-critic algorithms,” Advances in neural information processing systems, vol. 20, pp. 105–112, 2007.
- [14] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
- [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning. PMLR, 2018, pp. 1861–1870.
- [16] Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International Conference on Machine Learning. PMLR, 2019, pp. 151–160.
- [17] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “Optimality and approximation with policy gradient methods in markov decision processes,” in Conference on Learning Theory. PMLR, 2020, pp. 64–66.
- [18] J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,” arXiv preprint arXiv:1906.01786, 2019.
- [19] G. Lan, “Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes,” arXiv preprint arXiv:2102.00135, 2021.
- [20] S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi, “Fast global convergence of natural policy gradient methods with entropy regularization,” arXiv preprint arXiv:2007.06558, 2020.
- [21] J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in International Conference on Machine Learning. PMLR, 2020, pp. 6820–6829.
- [22] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” arXiv preprint arXiv:1909.01150, 2019.
- [23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning. PMLR, 2014, pp. 387–395.
- [24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
- [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- [26] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” Advances in Neural Information Processing Systems, vol. 33, pp. 741–752, 2020.
- [27] S. Khodadadian, P. R. Jhunjhunwala, S. M. Varma, and S. T. Maguluri, “On the linear convergence of natural policy gradient algorithm,” arXiv preprint arXiv:2105.01424, 2021.
- [28] L. Shani, Y. Efroni, and S. Mannor, “Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5668–5675.
- [29] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence,” arXiv preprint arXiv:2105.11066, 2021.
- [30] W. Yang, X. Li, G. Xie, and Z. Zhang, “Finding the near optimal policy via adaptive reduced regularization in mdps,” arXiv preprint arXiv:2011.00213, 2020.
- [31] S. Khodadadian, Z. Chen, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic algorithm,” in International Conference on Machine Learning. PMLR, 2021, pp. 5420–5431.
- [32] Z. Chen, S. Khodadadian, and S. T. Maguluri, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” IEEE Control Systems Letters, 2022.
- [33] ——, “Finite-sample analysis of off-policy natural actor-critic with linear function approximation,” arXiv preprint arXiv:2105.12540, 2021.
- [34] T. Xu, Z. Wang, and Y. Liang, “Improving sample complexity bounds for actor-critic algorithms,” arXiv preprint arXiv:2004.12956, 2020.
- [35] J. Zhang, C. Ni, C. Szepesvari, M. Wang et al., “On the convergence and sample efficiency of variance-reduced policy gradient method,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- [36] H. Kumar, A. Koppel, and A. Ribeiro, “On the sample complexity of actor-critic method for reinforcement learning with function approximation,” arXiv preprint arXiv:1910.08412, 2019.
- [37] Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 617–17 628, 2020.
- [38] S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021.
- [39] S. Cayci, N. He, and R. Srikant, “Linear convergence of entropy-regularized natural policy gradient with linear function approximation,” arXiv preprint arXiv:2106.04096, 2021.
- [40] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv preprint arXiv:1806.07572, 2018.
- [41] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in International Conference on Learning Representations, 2018.
- [42] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 322–332.
- [43] Z. Ji, M. Telgarsky, and R. Xian, “Neural tangent kernels, transportation mappings, and universal approximation,” in International Conference on Learning Representations, 2019.
- [44] S. Oymak and M. Soltanolkotabi, “Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 84–105, 2020.
- [45] Z. Ji and M. Telgarsky, “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks,” arXiv preprint arXiv:1909.12292, 2019.
- [46] Z. Fu, Z. Yang, and Z. Wang, “Single-timescale actor-critic provably finds globally optimal policy,” in International Conference on Learning Representations, 2020.
- [47] V. R. Konda and J. N. Tsitsiklis, “Onactor-critic algorithms,” SIAM journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.
- [48] Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” arXiv preprint arXiv:1910.01619, 2019.
- [49] S. Cayci, S. Satpathi, N. He, and R. Srikant, “Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation,” arXiv preprint arXiv:2103.01391, 2021.
- [50] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
- [51] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation andtd learning,” in Conference on Learning Theory. PMLR, 2019, pp. 2803–2830.
- [52] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015.
- [53] L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [54] E. Kreyszig, Introductory functional analysis with applications. John Wiley & Sons, 1991, vol. 17.
- [55] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural proximal/trust region policy optimization attains globally optimal policy,” arXiv preprint arXiv:1906.10306, 2019.
- [56] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
- [57] M. Telgarsky, “Deep learning theory lecture notes,” https://mjt.cs.illinois.edu/dlt/, 2021, version: 2021-10-27 v0.0-e7150f2d (alpha).
- [58] S. Satpathi, H. Gupta, S. Liang, and R. Srikant, “The role of regularization in overparameterized neural networks,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 4683–4688.
- [59] T. M. Cover and J. A. Thomas, “Elements of information theory (wiley series in telecommunications and signal processing),” 2006.
- [60] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
- [61] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- [62] R. Pascanu and Y. Bengio, “Revisiting natural gradient for deep networks,” arXiv preprint arXiv:1301.3584, 2013.
- [63] G. Zhang, J. Martens, and R. Grosse, “Fast convergence of natural gradient descent for overparameterized neural networks,” arXiv preprint arXiv:1905.10961, 2019.
- [64] V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Measures of Complexity, vol. 16, no. 2, p. 11, 1971.
Appendix A Proof of Lemma 6
Consider with a corresponding transportation map . Using Cauchy-Schwarz inequality,
Define Define a class containing all possible values taken by over for a fixed Further, using Cauchy-Schwarz inequality,
Using the symmetrization argument with Rademacher random variables ’s,
Note that given is a finite set. We apply Massart’s Finite Class lemma to have,
We calculate using VC-theory. Each element of partitions the space into two half planes where one half takes value and another half takes value Hence all possible values taken by in space is equal to the number of components in the partition made by half planes . The number of such components is bounded by using the growth function defined in [64]. Hence , and the following holds:
The result follows from Jensen’s inequality.