On the Analysis of Model-free Methods for the LQRZ. Jin, J.M. Schmitt, and Z. Wen
On the Analysis of Model-free Methods for the Linear Quadratic Regulator
On the Analysis of Model-free Methods for the Linear Quadratic Regulator
Abstract
Many reinforcement learning methods achieve great success in practice but lack theoretical foundation. In this paper, we study the convergence analysis on the problem of the Linear Quadratic Regulator (LQR). The global linear convergence properties and sample complexities are established for several popular algorithms such as the policy gradient algorithm, TD-learning and the actor-critic (AC) algorithm. Our results show that the actor-critic algorithm can reduce the sample complexity compared with the policy gradient algorithm. Although our analysis is still preliminary, it explains the benefit of AC algorithm in a certain sense.
keywords:
Linear Quadratic Regulator, TD-learning, the policy gradient algorithm, the actor-critic algorithm, Convergence49N10, 68T05, 90C40, 93E35
1 Introduction
Reinforcement learning (RL) involves training an agent such that the agent takes a sequence of actions to minimize its cumulative cost (or maximize its cumulative reward), see [19] for a general introduction to RL. Model free methods which do not estimate and use the transition kernel directly in RL enjoy wide popularity. They have achieved great success in many fields, such as robotics [10], biology [15], competitive gaming [17] and so on. In order to improve the performance of these algorithms, a theoretical understanding of questions regarding global convergence and sample complexity becomes more and more crucial. However, since RL problems are non-convex, it is even hard to prove convergence for model free methods.
The drawback of the non-convexity also appears in the Linear Quadratic Regulator (LQR) problem which is an elementary problem in system control and well understood. Therefore, we consider LQRs as a first step. In practice, people often estimate the transition matrices directly (called system identification) and then design a linear policy for the LQR. Since model-free methods become more and more popular, there is a large number of literatures analyzing model free methods for LQRs, see for example [23, 9, 7, 16]. The authors of [23] analyze the sample complexity of the least-squares temporal difference (LSTD) method for one fixed policy in the LQR setting. There are also contributions [9, 16] in which the properties of the cumulative cost with respect to the policy are analyzed and global convergence of policy iterations generated by some zero-order optimization method is shown. In the analysis of LQRs, we can without loss of generality (w.l.o.g.) restrict ourselves to linear policies, since it can be shown that optimal policies are linear. Although the cumulative cost in the LQR problem is non-convex with respect to the policy, any locally optimal policy is globally optimal. In this paper, we analyze some basic model free methods at the example of the LQR setting and derive the sample complexity, i.e., the number of samples which is at least required to guarantee convergence up to some specified tolerance.
1.1 Related work
TD-learning [18] and Q-learning [24] are basic and popular value based methods. There is a line of work examining the convergence of TD-learning and Q-learning with linear value function approximations for Markov decision processes (MDPs), see, e.g., [22, 4, 2, 27]. The convergence with probability is proved in [22, 4]. Moreover, a non-asymptotic analysis of TD-learning and Q-learning is provided in [2] and [27], respectively. The authors in [3] extend the asymptotic analysis of TD-learning to the case of nonlinear value function approximations. In addition, there is a large number of contributions analyzing least-squares TD [5, 13] and gradient TD [20, 14] which also require linear value function approximations.
In policy based methods, the policy gradient method [25, 21] and the actor-critic method [21, 11] achieve empirical success. Sutton et al. show in [21] an asymptotic convergence result for the actor-critic method applied to MDPs under a compatibility condition on the value function approximations. In [6, 26], a non-asymptotic analysis of actor-critic methods is provided under some parametrization assumptions on the MDP requiring that the state space is finite. In [9, 16], the policy gradient method with evolution strategy is used for the LQR setting and a global convergence result is derived. The sample complexity of the LSTD method for LQRs is shown to be for any given policy in [23]. There is also a line of work considering the design and analysis of model based methods for LQRs, see for example [1, 8, 7].
1.2 Contribution
In this paper, our motivation is to analyze general RL methods in the LQR setting. Unlike the MDP, the state space and the action space in the LQR are both infinite dimensional which makes LQR problems difficult to handle. On the other hand, the LQR represents a simple but also typical continuous problem in RL. First, we use the TD-learning approach with linear value function approximations for the policy evaluation of LQRs, which is a quite popular method used for general RL problems. Compared to [23], we focus on the sample complexity of TD-learning instead of the LSTD approach in which linear equations need to be solved. In addition, we also analyze the global convergence of policy iterations generated by TD-learning, which is inspired by the work of [9].
Instead of using evolution strategies as in [9, 16], we prove global linear convergence of the policy gradient method and the actor-critic algorithm in the LQR setting, which are much more widely used methods in practice. Another difference is that we focus on the more complex noise cases instead of initial random cases. We also show that the policy gradient is equivalent to the gradient of the cumulative cost with respect to the policy parameters. Moreover, our work combines the analysis of the value function and the analysis of the policy.
The estimation of the complexities of the policy iterations generated by TD-learning, the gradient method and the actor-critic method are given in Table 1. These complexities are based on the results of this paper, in particular on Theorem 4.1, Theorem 5.4 and Theorem 6.4. In Table 1, is the discount factor and is the error tolerance with respect to the globally optimal value.
Algorithm | TD steps |
|
Iterations | ||
---|---|---|---|---|---|
Policy Iteration with TD | / | ||||
Policy Gradient | / | ||||
Actor-Critic Algorithm |
1.3 Organization
This paper is organized as follows. Section 2 starts with a description of the general LQR problem. In order to solve LQRs in the RL framework, we introduce the policy function and convert the original problem into a policy optimization problem. Section 3 discusses the TD-learning approach with linear value function approximations and the convergence of this method. In Section 4, the process of policy iterations is combined with TD-learning and the convergence of this method is analyzed. Sections 5 and 6 present convergence results of the policy gradient method and the actor critic algorithm. In Section 7, we finally give a brief summary of the main results of this paper.
2 Preliminary
Linear time invariant (LTI) systems have the following form:
where , and the sequence describes unbiased, independent and identically distributed noise. We call state and control. The control is measured by the cost function
with positive definite matrices and . Thus, the cumulative cost with discount factor is given by . The LQR problem consists in finding the control which minimizes the expectation of the cumulative cost leading to the following optimization problem which we will refer to as the LQR problem
(2.1) | ||||
s.t. |
Optimal control theory shows that the optimal control input can be written as a linear function with respect to the state, i.e., , where . The optimal control gain is given by
(2.2) |
where is the solution of the algebraic Riccati equation (ARE)
Hence, the optimal policy only depends on , , , and and the optimal policy formula above is shown in [9]. We will explain the optimality of this policy in Section 4.
2.1 Stochastic Policies
In practice, the system is not known exactly. Therefore, it is popular to use model-free methods to solve the LQR problem (2.1). We also want to investigate the properties of those model-free methods for LQRs. We observe that the controls form -valued sequences belonging to an infinite-dimensional vector space. Thus, the analysis of the LQR problem (2.1) has to be carried out with care. Since the problem (2.1) can be viewed as a Markov decision process, people often use policy functions, which are maps from the state space to the control space, to represent the control. In this paper, the policies are constrained to some specific set of policy functions to simplify the problem.
In general, we can employ Gaussian policies
where is the parameter of the policy and is a fixed constant. There are several possibilities to choose function spaces for , such as linear function spaces and neural networks. As explained above, optimal controls of (2.1) depend linearly on the state. Therefore, since the usage of nonlinear policy functions considerably complicates the analysis, we focus on linear policy functions, which can be represented by matrices as follows
This yields an equivalent form of the control
In addition, the probability density function of has the explicit form
The advantage of stochastic policies is that the policy gradient method can be applied to the problem, which we will discuss in a later section. In addition, stochastic policies are benefitial to the exploration, which is quite important in reinforcement learning.
Under the policy , the dynamic system can be written as
Let , , and . Using these abbreviations, the LQR problem (2.1) can be reformulated as
(2.3) | ||||
s.t. |
where we have used the linearity of the trace operator, more precisely
We will use this trick in several places of the rest of this paper.
Let the domain of feasible policies be given by Dom, where denotes the spectral radius of the matrix . In the following, we assume that Dom is non-empty. Policies lying in the feasible domain are called stable. The set Dom is non-convex, see [9] and for all the corresponding value is finite. By Proposition 3.2 in [23], for any stable and any , there exists a , such that for any , the following inequality holds
(2.4) |
For any compact subset Dom, we can find uniform constants and such that (2.4) holds for all . Unfortunately, since , may also be finite for unstable . More precisely, one can show that is finite if and only if
(2.5) |
We observe that the constraints in (2.3) yield for
(2.6) |
Provided that satisfies (2.5), we can insert (2.6) into . Exploiting that for , we obtain the following analytic form of the objective function
(2.7) |
Let . Then can be further simplified to
(2.8) |
In the next result, we observe that for which are sufficiently close to , we can restrict our search for the optimal policy to the set of stable policies.
Lemma 2.1.
Suppose that is a stable policy. Then for any there exists a discount factor and a positive constant such that
(2.9) |
is valid for any policy with , where denotes an optimal policy. In addition, the set
(2.10) |
is a compact subset of . Finally, there exist constants and such that
(2.11) |
holds for all .
Proof 2.2.
Let be an arbitrary policy with , where will be determined later. We can w.l.o.g. assume that satisfies (2.5). Otherwise, would be infinite and (2.9) holds trivially. First, we observe that (2.9) follows by
(2.12) |
Next, we introduce the abbreviations and . Since , and are positive definite, using (2.4) and the fact that holds for any matrix , we can deduce
where denotes the constant of the policy in (2.4) for . By , we further get
Let and .
Using these two inequalities, we observe that (2.12) and therefore also (2.9) follow by
(2.13) |
We can w.l.o.g. assume that , otherwise and can be rescaled such that this holds. Hence, isolating in (2.13), we conclude that
(2.14) |
Next, we observe that
(2.15) | ||||
is valid for constants satisfying
(2.16) |
We conclude that for satisfying (2.16) and for with
(2.17) |
the inequalities (2.13) and therefore (2.9) hold. Next, we observe that
which is obviously a compact subset of Dom. The last statement of the theorem follows directly by Proposition 3.2 in [23].
For instance, if a stable policy is given, then we can choose a close to such that the level set of is compact and only has stable policies by Lemma 2.1. Hence this lemma is quite important in the following discussion.
For a policy satisfying (2.5), the objective function of (2.7) is differentiable in a sufficiently small neighborhood of . Hence, we can compute the gradient of (2.7) which is given by
(2.18) |
where and the sequence is generated by policy . This gradient form is obtained by Lemma 1 in [9] and Lemma 4 in [16].
From the representation of in (2.8), we can establish the following relations between , and , see also Lemma 13 in [9].
Lemma 2.3.
Let be a policy such that is finite. Then the representation of in (2.8) yields the following two inequalities:
(2.19) |
and
(2.20) |
Proof 2.4.
Since and , we obtain
The second inequality is derived by
Next, we will gather some useful properties of (2.3), which will serve as tools for the convergence analysis of the methods introduced in the following sections. Moreover, we will also have a look at the difficulties showing up in the theoretical analysis of (2.3). In general, the cost function in problem (2.3) as well as the set of policies satisfying (2.5) are non-convex. In order to cope with this problem of non-convexity, we use the so-called PL condition named after Polyak and Lojasiewicz, which is a relaxation of the notion of strong convexity. The PL condition is satisfied if there exists a universal constant such that for any policy with finite , we have
(2.21) |
where denotes a global optimum for (2.3). From Lemma 3 in [9], Lemma 4 in [16], and , we get that (2.21) is satisfied for (2.3) with
(2.22) |
We note that holds due to the optimality of .
By the PL condition (2.21), we know that stationary points of (2.3), i.e. , are global minima. Since , a policy is a stationary point if and only if
(2.23) |
We can verify that in (2.2) is the global minimum of the function . The optimization problem (2.3) may have more than one global minimum and, unfortunately, it is hard to derive the analytic form of all optimal policies in terms of , , , and .
To simplify the analysis, we assume that is Gaussian, that is . This assumption also guarantees that .
3 Policy evaluation
Given a fixed stable policy , it is desirable to know the expectation of the corresponding cummulative cost for an initial state , which in some sense evaluates “how good” it is to be in the state under policy . This expectation is given by the so-called value function , which is defined below. Moreover, the expectation of the cummulative cost for taking an action in some initial state under policy is given by state-action value which is also defined below, see also [19].
The task of the policy evaluation is to get good approximations of the value function of a fixed stable policy . In value-based methods, the policy evaluation is a very important and elementary step. In addition, the policy evaluation plays the role of the critic in the actor critic algorithm. The TD-learning method [18] is a prevalent method for the policy evaluation. In this section, we discuss the usage of the TD-learning method in the LQR-setting.
First, we compute the value function and state-action value function for a stable policy . The value function, which gives the state value for the policy has the following explicit form:
(3.24) |
and the state-action value of the policy is given by
(3.25) | ||||
where is the trajectory distribution of the system with policy .
As we will see below, the value function of can be written as a function which is linear with respect to the feature function
for , where denotes the vectorization operator stacking the columns of some matrix A on top of one another, i.e.,
Therefore, it is natural to propose a class of linear approximation funtions with feature by
where is the parameter we seek to estimate.
To this end, we observe that , where and . For each stable policy , our aim is to find a parameter such that approximates well in expectation. This is carried out by minimizing a suitable loss function, for which we have to find a “good” distribution. A reasonable choice for this distribution is the stationary distribution where
for any stable . It is easy to verify that if , then as well. This explain why we call it the stationary distribution. We also use the distribution to represent in short.
Remark 3.1.
For the sake of simplicity, we introduce the abbreviation
since . When there is only the variable in the expection, we write instead of to represent this expectation. Otherwise, we use the notation for convenience.
Then we define the loss function
such that the gradient of the loss function is given by
However, in practice, since the real value function is unknown, people often use the bias estimation of to replace where is the subsequent state of . For convenience, let , which is called TD error.
We further note that . Then we obtain the semi-gradient
(3.26) |
and it is quite straightforward to get the stochastic semi-gradient
(3.27) |
Starting with some initial parameter and using the gradient descent method with or as update strategy, the TD learning method with linear approximation functions is described by Algorithm 1.
The following lemma shows that is a minimum of .
Lemma 3.2.
For any fixed stable policy , we have
(3.28) |
Proof 3.3.
For the first element in the semi-gradient, we get
by the definition of . For the other elements we will consider them in matrix form
where we have used that is independent of . By the definition of , we have . For the constant term, it holds . Thus, we obtain .
Assumption 3.4.
Remark 3.5.
In some related works, people use projected gradient descent to guarantee the boundedness of which is not actually used in practice.
Lemma 3.6.
There exists a positive number , such that for all and
(3.29) |
holds. In addition, is continuous with respect to .
Proof 3.7.
Set , where is symmetric.
Thus, we can define
Since norms of matrices are equivalent and for , we obtain from (3.30) that is positive and continuous with respect to . Besides, . Using this inequality in connection with (3.30) we obtain
where the third inequality follows by the Cauchy-Schwarz inequality.
Thus, we conclude the existence of such a positive number which is continuous with respect to .
Under 3.4, the constant has a uniform lower bound in a compact subset of . Then we can use some basic tool of the analysis of the gradient descent method to prove the convergence of Algorithm 1.
Theorem 3.8.
Suppose that 3.4 holds. Algorithm 1 is run with the semi-gradient update using to generate . Then, the following inequality holds for all :
(3.31) |
Proof 3.9.
By the definition of and the fact that , we know
(3.32) | ||||
Thus, is an th order polynomial with respect to , and . Then we can obtain the bound .
We next show that Algorithm 1 also converges if the stochastic semi-gradients are used.
Theorem 3.10.
Suppose that 3.4 holds. Let Algorithm 1 be run with the stochastic semi-gradient update using to generate . Then, for all it holds
(3.33) |
Proof 3.11.
The key step to prove (3.31) is based on an important gradient descent inequality. In order to establish this inequality, we have to estimate two terms at first. For the descent term , we get the following lower bound:
(3.34) | ||||
where last inequality follows by Cauchy-Schwarz inequality. Then we split the norm of into two parts:
(3.35) | ||||
Then we get the upper bound of and by the Cauchy-Schwarz inequality:
(3.36) | ||||
In particular, the last equality holds by Proposition A.1. in [23]. Analogously, has the same upper bound such that
(3.37) |
where we used .
Furthermore, has a uniform upper bound with respect to on a compact subset of , since roughly. Then we can derive convergence of under the norm by Lemma 3.6.
4 Policy Iteration
The policy iteration (PI) [12] is a very fundamental value-based method in Markov decision process (MDP) problems with finite actions. This method can also be applied to LQRs, since the state-action value function is quadratic and it is easy to find the best action such that the value function is minimal by solving linear equations.
Given a policy and the corresponding state-action function of the form (3.25), the policy can be improved by selecting the action for a fixed state such that the value of is minimal:
Thus, we obtain an improved policy with
(4.40) |
Observing the gradient (2.18), there is another form of (4.40):
(4.41) |
This form corresponds to the Gauss-Newton method in [9] with stepsize . Hence, by the discussion in [9], we obtain the convergence of the policy iteration with the exact state-action value function and in addition the fact that value based methods are faster than the policy gradient method. Moreover, from Lemma 8 in [9], we obtain that . This implies that if is stable, then also is stable provided that is sufficiently close to , i.e., if satisfies (2.17) with , see Lemma 2.1.
However, we do not know the state action value function exactly since we do not know the system exactly. Due to the results of the former discussion of policy evaluation, we can obtain an approximation of . This approximation is used in a policy iteration scheme, for which we want to analyze the convergence. The approximation of the state action value function has the following form:
For any stable policy , we denote the parameters of by . If , we call an approximation of the state-action value function. The policy can also be improved by using the following approximation for the state action value:
(4.42) |
Thus, we obtain the policy iteration algorithm.
As we will prove below, the policy iteration with approximate state-action value functions also converges in the LQR-setting if the error between the approximation value and the true value is small enough. For any stable initial policy and any , we assume that the error tolerance between the approximation value and the true value satisfies
(4.43) |
where only depends on , , , , and .
Theorem 4.1.
For any stable initial policy and any , suppose that an approximation of the state action value function is known, where satisfies (4.43). If is sufficiently close to , then for , Algorithm 2 yields stable policies which are elements of Dom and satisfy
(4.44) |
Proof 4.2.
We first introduce some notation which is used in this proof. Let be an approximation of the state action value function and denote by the true value, i.e., . Then the improved policy is given by . Next, we obtain from Lemma 6 in [9] that
(4.45) | ||||
where . Furthermore, we note that and holds by (4.40). Let and . Then we compute the difference between and the policy generated by the true policy iteration:
(4.46) | ||||
with . Next, we note that (4.41) implies
(4.47) |
which together with (4.46) yields
(4.48) |
Since , it holds , and has the upper bound
(4.49) |
where the second inequality follows by (2.19) with
Using (4.45) and (4.48), the differences of the cumulative cost between the original policy and the improved policy can be represented as:
(4.50) |
For the first term in (4.50), we have
(4.51) | ||||
where the second inequality is derived by the fact and Lemma 11 in the supplementary material of [9]. More precisely, one can check that this lemma is also valid for the setting in this paper. For the second term in (4.50), using (2.20) and (4.49) we get the upper bound
(4.52) | ||||
By (4.51) and (4.52), it is direct to obtain the following inequality:
(4.53) |
where and .
We can start with and set . Then let be small enough such that . Thus, the bound of is . Next, we show that
(4.54) |
which implies that the inequality (4.53) holds with and for all iterates.
We use induction to prove this uniform bound (4.54). When , this inequality (4.54) holds obviously. Then we assume that (4.54) holds with . Using in connection with (4.53), we observe that if , then
by the inequality (4.53). Otherwise it holds . Thus, the bound (4.54) holds for all .
Hence, we have following inequality:
(4.55) |
Furthermore, we also require that , which is equivalent to . Thus, the upper bound of should be , where
Then we can verify that such that (4.44) is proved. Finally, we assume w.l.o.g. that . We can guarantee that . Inserting this in (4.55), we obtain
Hence, Dom Dom holds by Lemma 2.1 for all iterates, if is sufficiently close to .
Using the definition of , we observe that the approximation parameter has an upper bound . Thus, for each TD-learning of a fix , it needs samplings by Lemma 3.6 and Theorem 3.8. However if we use stochastic semi-gradient descent, the sample complexity for each policy evaluation of this algorithm becomes by Theorem 3.10.
5 The Policy Gradient method
In RL, the policy gradient method [25, 21] is widely used. In this section, we apply the policy gradient method to the problem (2.3) and analyze the convergence of this method.
In order to compute the policy gradient, we have to know the score function of the policy , which is given by
(5.56) |
By the policy gradient theorem in [21], we obtain the policy gradient
(5.57) | ||||
The policy gradient (5.57) is equivalent to which is shown in Lemma 5.2. For the representation of the gradient in (5.57) it is straightforward to design an estimation. After achieving some triples generated by the policy in problem , we can compute the sample gradient:
(5.58) |
We apply the stochastic gradient descent method to the problem (2.3), which is summarized in Algorithm 3.
Since the estimator is biased, we have to control the bias by increasing the length . The next lemma shows that how the parameter influences the bias and the variance of the estimator .
Lemma 5.1.
Let be a stable policy. Then it holds
(5.59) | |||||
(5.60) |
where and the constants and depend on , , , , , , , , , and .
Proof 5.2.
Define the event fields generated by and the operator . We observe that and are independent from . By the definition of the value function in (3.24), it is straightforward to obtain the conditional expection of the cumulative cost:
First, we claim that is equivalent to . To this end, we verify the following identity:
(5.61) | ||||
The third equality is valid since and
(5.62) |
which holds due to the symmetry of . In the last equality of (5.61), we have used similar arguments in connection with the fact . Taking the sum , applying to both sides of (5.61) and multiplying by yields , see (2.18).
Since is the estimator of , we focus on the bound of the bias and split the bias into two parts
(5.63) | ||||
where we have used that . In the following, we will further simplify and estimate the above two terms, respectively. To this end, we show for each that
(5.64) |
For , we observe that
(5.65) | ||||
and for it holds
(5.66) | ||||
Since is stable, (2.4) yields
(5.67) |
for some constant . Using this, (2.19) and (5.64), we can estimate the second term in (5.63) as follows
(5.68) | ||||
By using (5.61), (5.67) and (2.19), the first term in (5.63) can be estimated as:
(5.69) | ||||
Combining (5.68) and (5.69) yields (5.59) with
Finally, we derive a bound for the variance of .
(5.70) | ||||
Hence, we only need to estimate each term of (5.70):
(5.71) |
Let . Then and is independent from . We can estimate the bound of :
(5.72) | ||||
We observe that (5.72) is bounded yielding that (5.71) is bounded by a constant . Inserting this in (5.70), we get the estimation
(5.73) |
such that (5.59) holds with .
Assumption 5.3.
There is a positive number such that for any stable .
At the end of this section, we present a theorem that guarantees the convergence of Algorithm 3. In order to prove this convergence, we need to assume that the error tolerance and the confidence is small enough such that following inequality holds:
(5.74) |
Theorem 5.4.
Let 5.3 hold, be stable and suppose that is sufficiently close to and is small enough such that (2.17) is satisfied with and . For any error tolerance and confidence satisfying (5.74), suppose that the sample size is large enough such that
with and the step size is chosen to satisfy . After iterations with , Algorithm 3 yields an iterate such that
holds with probability greater than .
Proof 5.5.
For any , define the error and the stopping time
We first note that Lemma 2.1 yields Dom for all since and is chosen such that (2.16) and (2.17) hold with and . Hence and are both uniform bounded in Dom.
Next, for simplicity, we define as the expectation operator conditioned on the sigma field which contains all the randomness of the first iterates. Since the gradient is locally Lipschitz, which is shown in [16], there are a uniform Lipschitz constant and a uniform radius such that
(5.75) |
for any and with . We choose sufficiently small such that .
Let L be sufficiently large such that . Using this and in particular Lemma 5.1, we obtain
(5.76) | ||||
where . We note that in the last estimation of (5.76) the inequality
is used. We assume that . By the PL condition (2.21), we have
(5.77) | ||||
Applying successively this inequality, we obtain a similar result as in [16]:
where we have used that . By (5.60), we observe that taking implies . We note that this condition on as well as and are satisfied for . Setting , we observe that for
with a sufficiently large constant the inequality holds.
By using the same techniques as in proof of the Proposition 1 in [16], we observe that . By Chebyshev’s inequality, we have
This completes the proof.
6 The Actor-Critic Algorithm
In the policy gradient algorithm, we update the policy through pure sampling updates. Hence, the policy gradient has high variance, which slows down the speed of convergence. A popular method to reduce the variance is the Actor-Critic algorithm, which replaces the Monte Carlo method by the bootstrapping method. The policy gradient in the Actor-Critic algorithm has the following form:
(6.78) |
where . We investigate the bias and the variance of the estimators.
Lemma 6.1.
Proof 6.2.
Analogous to the proof of Lemma 5.1, can split the bias of into two parts:
(6.81) | ||||
We have discussed the first term in the proof of Lemma 5.1 and its upper bound is . Let and . We observe that the second term of (6.81) has the following equivalent form:
(6.82) | ||||
Since and is an -approximation of , the right hand of (6.82) is bounded by , which yields (6.79).
Since by (2.19) and , the following inequality holds:
(6.83) | ||||
Hence, we obtain the inequality (6.80):
(6.84) | ||||
where the third inequality holds by the same skill in (3.32) and
In the estimations above similar techniques as in (5.70) are used.
Now we can obtain a convergence result for the AC algorithm which is similar to the one in Theorem 5.4. In order to prove this result, we need the following assumption.
Assumption 6.3.
There is a positive number such that for any stable .
Theorem 6.4.
Suppose that 6.3 is satisfied and let be a stable policy. Moreover, suppose that is sufficiently close to and small enough such that (2.17) holds with . For any error tolerance and confidence satisfying (5.74), suppose that the sample size is large enough and the error of the approximation value function is small enough such that
The step size is chosen such that holds. After iterations with iterations, the iterate satisfies
with probability greater than .
Proof 6.5.
The proof is similar to the proof of Theorem 5.4.
Finally, we analyze the sample complexity of the policy gradient method and the AC algorithm. We assume that the discount factor with is close to such that . By the definition of and in particular by its representation in (2.8), we obtain . For the AC algorithm, we have to require that . Then the TD-learning algorithm needs steps by Theorem 3.8. However, the sample size in the AC algorithm is only . We can sample trajectories parallelly. Then the variance of the gradient becomes . Using similar arguments as in the proof of Theorem 5.4, one can prove that the iteration time of the AC algorithm is equal to . From the statements shown above, we conclude the complexities given in Table 1.
7 Conclusion
Reinforcement learning has achieved success in many fields but lacks theoretical understanding in the continuous case. In this paper, we apply well known algorithms in RL to a basic model LQR. First, we show the convergence of the policy iteration with TD learning, which is hard to prove in general cases. Then we obtain the linear convergence of the policy gradient method and AC algorithm. Finally, we compare the sample complexity of those algorithms.
The results of this paper are proved for the LQR setting, which allows us to restrict ourselves to linear policies. For extensions to more general problems, this restriction to a linear framework is not possible anymore. Consequently, the policy function may depend nonlinearly on its parameters which makes in particular the convergence analysis of optimization methods such as the policy gradient method much more involved. A further difficulty is that the PL condition, which guarantees that stationary policies are globally optimal, may not hold anymore. However, since the LQR can be used as approximation for more general nonlinear problems, the techniques developed in this paper can serve as important tool for the treatment of more general problems.
Acknowledgments.
The research is supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence.
References
- [1] Y. Abbasi-Yadkori and C. Szepesvári, Regret bounds for the adaptive control of linear quadratic systems, in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
- [2] J. Bhandari, D. Russo, and R. Singal, A finite time analysis of temporal difference learning with linear function approximation, in Conference On Learning Theory, 2018, pp. 1691–1692.
- [3] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári, Convergent temporal-difference learning with arbitrary smooth function approximation, in Advances in Neural Information Processing Systems, 2009, pp. 1204–1212.
- [4] V. S. Borkar and S. P. Meyn, The ode method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Control and Optimization, 38 (2000), pp. 447–469.
- [5] S. J. Bradtke and A. G. Barto, Linear least-squares algorithms for temporal difference learning, Machine learning, 22 (1996), pp. 33–57.
- [6] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, Sbeed: Convergent reinforcement learning with nonlinear function approximation, in International Conference on Machine Learning, 2018, pp. 1125–1134.
- [7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, Regret bounds for robust adaptive control of the linear quadratic regulator, in Advances in Neural Information Processing Systems, 2018, pp. 4188–4197.
- [8] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, On the sample complexity of the linear quadratic regulator, Foundations of Computational Mathematics, (2019), pp. 1–47.
- [9] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 1467–1476.
- [10] J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey, The International Journal of Robotics Research, 32 (2013), pp. 1238–1274.
- [11] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, in Advances in neural information processing systems, 2000, pp. 1008–1014.
- [12] M. G. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of machine learning research, 4 (2003), pp. 1107–1149.
- [13] A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of lstd, 2010.
- [14] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik, Finite-sample analysis of proximal gradient td algorithms., in UAI, Citeseer, 2015, pp. 504–513.
- [15] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, Applications of deep learning and reinforcement learning to biological data, IEEE transactions on neural networks and learning systems, 29 (2018), pp. 2063–2079.
- [16] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, Derivative-free methods for policy optimization: Guarantees for linear quadratic systems, in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2916–2925.
- [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), p. 529.
- [18] R. S. Sutton, Learning to predict by the methods of temporal differences, Machine learning, 3 (1988), pp. 9–44.
- [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
- [20] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation, in Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 993–1000.
- [21] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Submitted to Advances in Neural Information Processing Systems, 12 (1999), pp. 1057–1063.
- [22] J. N. Tsitsiklis and B. Van Roy, Analysis of temporal-diffference learning with function approximation, in Advances in neural information processing systems, 1997, pp. 1075–1081.
- [23] S. Tu and B. Recht, Least-squares temporal difference learning for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 5005–5014.
- [24] C. J. Watkins and P. Dayan, Q-learning, Machine learning, 8 (1992), pp. 279–292.
- [25] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, 8 (1992), pp. 229–256.
- [26] K. Zhang, A. Koppel, H. Zhu, and T. Başar, Global convergence of policy gradient methods to (almost) locally optimal policies, arXiv preprint arXiv:1906.08383, (2019).
- [27] S. Zou, T. Xu, and Y. Liang, Finite-sample analysis for sarsa with linear function approximation, in Advances in Neural Information Processing Systems, 2019, pp. 8668–8678.