This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\headers

On the Analysis of Model-free Methods for the LQRZ. Jin, J.M. Schmitt, and Z. Wen

On the Analysis of Model-free Methods for the Linear Quadratic Regulator

Zeyu Jin School of Mathematical Sciences, Peking University, CHINA (). 1801210096@pku.edu.cn    Johann Michael Schmitt Beijing International Center for Mathematical Research, Peking University, CHINA (). MichelSchmitt2@web.de    Zaiwen Wen Beijing International Center for Mathematical Research, Peking University, CHINA (). Research supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence. wenzw@pku.edu.cn

On the Analysis of Model-free Methods for the Linear Quadratic Regulator

Zeyu Jin School of Mathematical Sciences, Peking University, CHINA (). 1801210096@pku.edu.cn    Johann Michael Schmitt Beijing International Center for Mathematical Research, Peking University, CHINA (). MichelSchmitt2@web.de    Zaiwen Wen Beijing International Center for Mathematical Research, Peking University, CHINA (). Research supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence. wenzw@pku.edu.cn
Abstract

Many reinforcement learning methods achieve great success in practice but lack theoretical foundation. In this paper, we study the convergence analysis on the problem of the Linear Quadratic Regulator (LQR). The global linear convergence properties and sample complexities are established for several popular algorithms such as the policy gradient algorithm, TD-learning and the actor-critic (AC) algorithm. Our results show that the actor-critic algorithm can reduce the sample complexity compared with the policy gradient algorithm. Although our analysis is still preliminary, it explains the benefit of AC algorithm in a certain sense.

keywords:
Linear Quadratic Regulator, TD-learning, the policy gradient algorithm, the actor-critic algorithm, Convergence
{AMS}

49N10, 68T05, 90C40, 93E35

1 Introduction

Reinforcement learning (RL) involves training an agent such that the agent takes a sequence of actions to minimize its cumulative cost (or maximize its cumulative reward), see [19] for a general introduction to RL. Model free methods which do not estimate and use the transition kernel directly in RL enjoy wide popularity. They have achieved great success in many fields, such as robotics [10], biology [15], competitive gaming [17] and so on. In order to improve the performance of these algorithms, a theoretical understanding of questions regarding global convergence and sample complexity becomes more and more crucial. However, since RL problems are non-convex, it is even hard to prove convergence for model free methods.

The drawback of the non-convexity also appears in the Linear Quadratic Regulator (LQR) problem which is an elementary problem in system control and well understood. Therefore, we consider LQRs as a first step. In practice, people often estimate the transition matrices directly (called system identification) and then design a linear policy for the LQR. Since model-free methods become more and more popular, there is a large number of literatures analyzing model free methods for LQRs, see for example [23, 9, 7, 16]. The authors of [23] analyze the sample complexity of the least-squares temporal difference (LSTD) method for one fixed policy in the LQR setting. There are also contributions [9, 16] in which the properties of the cumulative cost with respect to the policy are analyzed and global convergence of policy iterations generated by some zero-order optimization method is shown. In the analysis of LQRs, we can without loss of generality (w.l.o.g.) restrict ourselves to linear policies, since it can be shown that optimal policies are linear. Although the cumulative cost in the LQR problem is non-convex with respect to the policy, any locally optimal policy is globally optimal. In this paper, we analyze some basic model free methods at the example of the LQR setting and derive the sample complexity, i.e., the number of samples which is at least required to guarantee convergence up to some specified tolerance.

1.1 Related work

TD-learning [18] and Q-learning [24] are basic and popular value based methods. There is a line of work examining the convergence of TD-learning and Q-learning with linear value function approximations for Markov decision processes (MDPs), see, e.g., [22, 4, 2, 27]. The convergence with probability 11 is proved in [22, 4]. Moreover, a non-asymptotic analysis of TD-learning and Q-learning is provided in [2] and [27], respectively. The authors in [3] extend the asymptotic analysis of TD-learning to the case of nonlinear value function approximations. In addition, there is a large number of contributions analyzing least-squares TD [5, 13] and gradient TD [20, 14] which also require linear value function approximations.

In policy based methods, the policy gradient method [25, 21] and the actor-critic method [21, 11] achieve empirical success. Sutton et al. show in [21] an asymptotic convergence result for the actor-critic method applied to MDPs under a compatibility condition on the value function approximations. In [6, 26], a non-asymptotic analysis of actor-critic methods is provided under some parametrization assumptions on the MDP requiring that the state space is finite. In [9, 16], the policy gradient method with evolution strategy is used for the LQR setting and a global convergence result is derived. The sample complexity of the LSTD method for LQRs is shown to be Ω(n2/ϵ2)\Omega(n^{2}/\epsilon^{2}) for any given policy in [23]. There is also a line of work considering the design and analysis of model based methods for LQRs, see for example [1, 8, 7].

1.2 Contribution

In this paper, our motivation is to analyze general RL methods in the LQR setting. Unlike the MDP, the state space and the action space in the LQR are both infinite dimensional which makes LQR problems difficult to handle. On the other hand, the LQR represents a simple but also typical continuous problem in RL. First, we use the TD-learning approach with linear value function approximations for the policy evaluation of LQRs, which is a quite popular method used for general RL problems. Compared to [23], we focus on the sample complexity of TD-learning instead of the LSTD approach in which linear equations need to be solved. In addition, we also analyze the global convergence of policy iterations generated by TD-learning, which is inspired by the work of [9].

Instead of using evolution strategies as in [9, 16], we prove global linear convergence of the policy gradient method and the actor-critic algorithm in the LQR setting, which are much more widely used methods in practice. Another difference is that we focus on the more complex noise cases instead of initial random cases. We also show that the policy gradient is equivalent to the gradient of the cumulative cost with respect to the policy parameters. Moreover, our work combines the analysis of the value function and the analysis of the policy.

The estimation of the complexities of the policy iterations generated by TD-learning, the gradient method and the actor-critic method are given in Table 1. These complexities are based on the results of this paper, in particular on Theorem 4.1, Theorem 5.4 and Theorem 6.4. In Table 1, γ\gamma is the discount factor and ϵ\epsilon is the error tolerance with respect to the globally optimal value.

Algorithm TD steps
Length of trajectory
for policy gradient
Iterations
Policy Iteration with TD O(1ϵ(1γ)5)O\left(\frac{1}{\epsilon(1-\gamma)^{5}}\right) / O(1)O(1)
Policy Gradient / O(1(1γ)2)O\left(\frac{1}{(1-\gamma)^{2}}\right) O(1δϵ(1γ)7)O\left(\frac{1}{\delta\epsilon(1-\gamma)^{7}}\right)
Actor-Critic Algorithm O(1δϵ(1γ)4)O\left(\frac{1}{\delta\epsilon(1-\gamma)^{4}}\right) O(1(1γ)4)O\left(\frac{1}{(1-\gamma)^{4}}\right) O(1δϵ(1γ))O\left(\frac{1}{\delta\epsilon(1-\gamma)}\right)
Table 1: A summary of complexity analysis

1.3 Organization

This paper is organized as follows. Section 2 starts with a description of the general LQR problem. In order to solve LQRs in the RL framework, we introduce the policy function and convert the original problem into a policy optimization problem. Section 3 discusses the TD-learning approach with linear value function approximations and the convergence of this method. In Section 4, the process of policy iterations is combined with TD-learning and the convergence of this method is analyzed. Sections 5 and 6 present convergence results of the policy gradient method and the actor critic algorithm. In Section 7, we finally give a brief summary of the main results of this paper.

2 Preliminary

Linear time invariant (LTI) systems have the following form:

xt+1=Axt+But+ωt,x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t},

where An×nA\in\mathbb{R}^{n\times n}, Bn×dB\in\mathbb{R}^{n\times d} and the sequence ωt\omega_{t} describes unbiased, independent and identically distributed noise. We call xtnx_{t}\in\mathbb{R}^{n} state and utdu_{t}\in\mathbb{R}^{d} control. The control is measured by the cost function

ct=xtSxt+utRutc_{t}=x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t}

with positive definite matrices Sn×nS\in\mathbb{R}^{n\times n} and Rd×dR\in\mathbb{R}^{d\times d}. Thus, the cumulative cost with discount factor γ(0,1)\gamma\in(0,1) is given by t=0γt(xtSxt+utRut)\sum_{t=0}^{\infty}\gamma^{t}(x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t}). The LQR problem consists in finding the control which minimizes the expectation of the cumulative cost leading to the following optimization problem which we will refer to as the LQR problem

(2.1) min{ut}t=0\displaystyle\min_{\{u_{t}\}_{t=0}^{\infty}} t=0𝔼[γt(xtSxt+utRut)]\displaystyle\quad\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}(x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t})]
s.t. x0=0,xt+1=Axt+But+ωt.\displaystyle\quad x_{0}=0,\quad x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t}.

Optimal control theory shows that the optimal control input can be written as a linear function with respect to the state, i.e., ut=Kxtu_{t}=K^{*}x_{t}, where Kd×nK^{*}\in\mathbb{R}^{d\times n}. The optimal control gain is given by

(2.2) K=γ(R+γBPγB)1BPγA,K^{*}=-\gamma(R+\gamma B^{\top}P_{\gamma}B)^{-1}B^{\top}P_{\gamma}A,

where PγP_{\gamma} is the solution of the algebraic Riccati equation (ARE)

Pγ=γAPγA+Sγ2APγB(γBPγB+R)1BPγA.P_{\gamma}=\gamma A^{\top}P_{\gamma}A+S-\gamma^{2}A^{\top}P_{\gamma}B(\gamma B^{\top}P_{\gamma}B+R)^{-1}B^{\top}P_{\gamma}A.

Hence, the optimal policy KK^{*} only depends on AA, BB, SS, RR and γ\gamma and the optimal policy formula above is shown in [9]. We will explain the optimality of this policy in Section 4.

2.1 Stochastic Policies

In practice, the system is not known exactly. Therefore, it is popular to use model-free methods to solve the LQR problem (2.1). We also want to investigate the properties of those model-free methods for LQRs. We observe that the controls utu_{t} form d\mathbb{R}^{d}-valued sequences {ut}t=0\{u_{t}\}_{t=0}^{\infty} belonging to an infinite-dimensional vector space. Thus, the analysis of the LQR problem (2.1) has to be carried out with care. Since the problem (2.1) can be viewed as a Markov decision process, people often use policy functions, which are maps from the state space to the control space, to represent the control. In this paper, the policies are constrained to some specific set of policy functions to simplify the problem.

In general, we can employ Gaussian policies

utπθ(|xt)=N(fθ(xt),σ2Id),u_{t}\sim\pi_{\theta}(\cdot|x_{t})=N(f_{\theta}(x_{t}),\sigma^{2}I_{d}),

where θ\theta is the parameter of the policy and σ>0\sigma>0 is a fixed constant. There are several possibilities to choose function spaces for fθf_{\theta}, such as linear function spaces and neural networks. As explained above, optimal controls of (2.1) depend linearly on the state. Therefore, since the usage of nonlinear policy functions considerably complicates the analysis, we focus on linear policy functions, which can be represented by matrices Kd×nK\in\mathbb{R}^{d\times n} as follows

utπK(|xt)=N(Kxt,σ2Id).u_{t}\sim\pi_{K}(\cdot|x_{t})=N(Kx_{t},\sigma^{2}I_{d}).

This yields an equivalent form of the control

ut=Kxt+ηt,ηtN(0,σ2Id).u_{t}=Kx_{t}+\eta_{t},\quad\eta_{t}\sim N(0,\sigma^{2}I_{d}).

In addition, the probability density function of πK\pi_{K} has the explicit form

πK(u|x)=1(2π)d/2σdexp(uKx222σ2).\displaystyle\pi_{K}(u|x)=\frac{1}{(2\pi)^{d/2}\sigma^{d}}\mathrm{exp}\left(-\frac{\|u-Kx\|_{2}^{2}}{2\sigma^{2}}\right).

The advantage of stochastic policies is that the policy gradient method can be applied to the problem, which we will discuss in a later section. In addition, stochastic policies are benefitial to the exploration, which is quite important in reinforcement learning.

Under the policy πK\pi_{K}, the dynamic system can be written as

xt+1=Axt+But+ωt=(A+BK)xt+Bηt+ωt.x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t}=(A+BK)x_{t}+B\eta_{t}+\omega_{t}.

Let ω~t:=Bηt+ωt\widetilde{\omega}_{t}:=B\eta_{t}+\omega_{t}, Dω~=𝔼[ω~tω~t]D_{\widetilde{\omega}}=\mathbb{E}[\widetilde{\omega}_{t}\widetilde{\omega}_{t}^{\top}], Dω=𝔼[ωtωt]D_{\omega}=\mathbb{E}[\omega_{t}\omega_{t}^{\top}] and 𝔼[ηtηt]=σ2Id\mathbb{E}[\eta_{t}\eta_{t}^{\top}]=\sigma^{2}I_{d}. Using these abbreviations, the LQR problem (2.1) can be reformulated as

(2.3) minK\displaystyle\min_{K} J(K)=t=0𝔼γt[xt(S+KRK)xt]+σ2Tr[R]1γ\displaystyle\;\;J(K)=\sum_{t=0}^{\infty}\mathbb{E}\gamma^{t}[x_{t}^{\top}(S+K^{\top}RK)x_{t}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}
s.t. x0=0,xt+1=(A+BK)xt+ω~t,\displaystyle\;\;x_{0}=0,\;\;x_{t+1}=(A+BK)x_{t}+\widetilde{\omega}_{t},

where we have used the linearity of the trace operator, more precisely

t=0𝔼[γtηtRηt]\displaystyle\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}\eta_{t}^{\top}R\eta_{t}] =t=0γt𝔼Tr[ηtRηt]=t=0γt𝔼Tr[ηtηtR]=t=0γtTr[𝔼[ηtηt]R]\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}\operatorname{Tr}[\eta_{t}^{\top}R\eta_{t}]=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}\operatorname{Tr}[\eta_{t}\eta_{t}^{\top}R]=\sum_{t=0}^{\infty}\gamma^{t}\operatorname{Tr}[\mathbb{E}[\eta_{t}\eta_{t}^{\top}]R]
=σ2Tr[R]1γ.\displaystyle=\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

We will use this trick in several places of the rest of this paper.

Let the domain of feasible policies be given by Dom:={K|ρ(A+BK)<1}:=\{K\;|\;\rho(A+BK)<1\}, where ρ(X)\rho(X) denotes the spectral radius of the matrix XX. In the following, we assume that Dom is non-empty. Policies lying in the feasible domain are called stable. The set Dom is non-convex, see [9] and for all K𝐃𝐨𝐦K\in\bf{Dom} the corresponding value J(K)J(K) is finite. By Proposition 3.2 in [23], for any stable KK and any ρ(ρ(A+BK),1)\rho\in(\rho(A+BK),1), there exists a ΓK>0\Gamma_{K}>0, such that for any k1k\geq 1, the following inequality holds

(2.4) (A+BK)k2ΓKρk.\|(A+BK)^{k}\|_{2}\leq\Gamma_{K}\rho^{k}.

For any compact subset 𝒦\mathcal{K}\subsetDom, we can find uniform constants Γ\Gamma and ρ\rho such that (2.4) holds for all K𝒦K\in\mathcal{K}. Unfortunately, since 0<γ<10<\gamma<1, J(K)J(K) may also be finite for unstable KK. More precisely, one can show that J(K)J(K) is finite if and only if

(2.5) ρ(A+BK)<1γ.\rho(A+BK)<\frac{1}{\sqrt{\gamma}}.

We observe that the constraints xt+1=(A+BK)xt+ω~tx_{t+1}=(A+BK)x_{t}+\widetilde{\omega}_{t} in (2.3) yield for t1t\geq 1

(2.6) xt=i=0t1(A+BK)ti1ω~i.x_{t}=\sum\limits_{i=0}^{t-1}(A+BK)^{t-i-1}\widetilde{\omega}_{i}.

Provided that KK satisfies (2.5), we can insert (2.6) into J(K)J(K). Exploiting that 𝔼[w~iw~j]=0\mathbb{E}[\tilde{w}_{i}\tilde{w}_{j}^{\top}]=0 for jij\neq i, we obtain the following analytic form of the objective function

(2.7) J(K)=γ1γi=0Tr[Dω~γi[(A+BK)i](S+KRK)(A+BK)i]+σ2Tr[R]1γ.J(K)=\frac{\gamma}{1-\gamma}\sum_{i=0}^{\infty}\operatorname{Tr}[D_{\widetilde{\omega}}\gamma^{i}[(A+BK)^{i}]^{\top}(S+K^{\top}RK)(A+BK)^{i}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

Let PK,γ:=i=0γi[(A+BK)i](S+KRK)(A+BK)iP_{K,\gamma}:=\sum_{i=0}^{\infty}\gamma^{i}[(A+BK)^{i}]^{\top}(S+K^{\top}RK)(A+BK)^{i}. Then J(K)J(K) can be further simplified to

(2.8) J(K)=γ1γTr[Dω~PK,γ]+σ2Tr[R]1γ.J(K)=\frac{\gamma}{1-\gamma}\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

In the next result, we observe that for γ\gamma which are sufficiently close to 11, we can restrict our search for the optimal policy to the set of stable policies.

Lemma 2.1.

Suppose that KK is a stable policy. Then for any ν1\nu\geq 1 there exists a discount factor 0<γ<10<\gamma<1 and a positive constant β=β(K,ν)<1\beta=\beta(K,\nu)<1 such that

(2.9) J(K¯)J(K)>ν(J(K)J(K))J(\bar{K})-J(K^{*})>\nu(J(K)-J(K^{*}))

is valid for any policy K¯\bar{K} with ρ(A+BK¯)>1β\rho(A+B\bar{K})>1-\beta, where KK^{*} denotes an optimal policy. In addition, the set

(2.10) DomK,ν={K~:J(K~)J(K)ν(J(K)J(K))}\text{\bf{Dom}}_{K,\nu}=\left\{\tilde{K}:J(\tilde{K})-J(K^{*})\leq\nu(J(K)-J(K^{*}))\right\}

is a compact subset of 𝐃𝐨𝐦\bf{Dom}. Finally, there exist constants ρ(1β,1)\rho\in(1-\beta,1) and ΓK,ν\Gamma_{K,\nu} such that

(2.11) (A+BK~)k2ΓK,νρk.\|(A+B\tilde{K})^{k}\|_{2}\leq\Gamma_{K,\nu}\rho^{k}.

holds for all K~DomK,ν\tilde{K}\in\text{\bf{Dom}}_{K,\nu}.

Proof 2.2.

Let K¯\bar{K} be an arbitrary policy with ρ(A+BK¯)1β\rho(A+B\bar{K})\geq 1-\beta, where β\beta will be determined later. We can w.l.o.g. assume that K¯\bar{K} satisfies (2.5). Otherwise, J(K¯)J(\bar{K}) would be infinite and (2.9) holds trivially. First, we observe that (2.9) follows by

(2.12) Tr[Dω~PK¯,γ]>νTr[Dω~PK,γ]\operatorname{Tr}[D_{\widetilde{\omega}}P_{\bar{K},\gamma}]>\nu\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]

Next, we introduce the abbreviations ρK=ρ(A+BK)\rho_{K}=\rho(A+BK) and ρK¯=ρ(A+BK¯)\rho_{\bar{K}}=\rho(A+B\bar{K}). Since Dω~D_{\widetilde{\omega}}, SS and QQ are positive definite, using (2.4) and the fact that ρ(Z)Z2ZFnZ2\rho(Z)\leq\|Z\|_{2}\leq\|Z\|_{F}\leq\sqrt{n}\|Z\|_{2} holds for any matrix Zn×nZ\in\mathbb{R}^{n\times n}, we can deduce

Tr[Dω~PK,γ]\displaystyle\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}] σmax(Dw~)Tr[PK,γ]\displaystyle\leq\sigma_{\max}(D_{\widetilde{w}})\operatorname{Tr}[P_{K,\gamma}]
σmax(Dw~)σmax(S+KRK)t=0γt(A+BK)tF2\displaystyle\leq\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)\sum_{t=0}^{\infty}\gamma^{t}\|(A+BK)^{t}\|_{F}^{2}
σmax(Dw~)σmax(S+KRK)nΓK21γ(ρK+12)2\displaystyle\leq\frac{\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)n\Gamma^{2}_{K}}{1-\gamma\left(\frac{\rho_{K}+1}{2}\right)^{2}}

where ΓK\Gamma_{K} denotes the constant of the policy KK in (2.4) for ρ=ρK+12\rho=\frac{\rho_{K}+1}{2}. By (A+BK¯)tFρ((A+BK¯)t)=ρK¯t\|(A+B\bar{K})^{t}\|_{F}\geq\rho((A+B\bar{K})^{t})=\rho^{t}_{\bar{K}}, we further get

Tr[Dω~PK¯,γ]σmin(Dw~)σmin(S)1γρK¯2.\displaystyle\operatorname{Tr}[D_{\widetilde{\omega}}P_{\bar{K},\gamma}]\geq\frac{\sigma_{\min}(D_{\widetilde{w}})\sigma_{\min}(S)}{1-\gamma\rho_{\bar{K}}^{2}}.

Let cu=σmax(Dw~)σmax(S+KRK)nΓK2c_{u}=\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)n\Gamma^{2}_{K} and cl=σmin(Dw~)σmin(S)c_{l}=\sigma_{\min}(D_{\widetilde{w}})\sigma_{\min}(S).

Using these two inequalities, we observe that (2.12) and therefore also (2.9) follow by

(2.13) cuν1γ(ρK+12)2<cl1γρK¯2.\frac{c_{u}\nu}{1-\gamma\left(\frac{\rho_{K}+1}{2}\right)^{2}}<\frac{c_{l}}{1-\gamma\rho_{\bar{K}}^{2}}.

We can w.l.o.g. assume that cuν>clc_{u}\nu>c_{l}, otherwise cuc_{u} and clc_{l} can be rescaled such that this holds. Hence, isolating γ\gamma in (2.13), we conclude that

(2.14) 0<cuνclcuνρK¯2cl(ρK+12)2<γ.0<\frac{c_{u}\nu-c_{l}}{c_{u}\nu\rho^{2}_{\bar{K}}-c_{l}\left(\frac{\rho_{K}+1}{2}\right)^{2}}<\gamma.

Next, we observe that

(2.15) cuνclcuνρK¯2cl(ρK+12)2\displaystyle\frac{c_{u}\nu-c_{l}}{c_{u}\nu\rho^{2}_{\bar{K}}-c_{l}\left(\frac{\rho_{K}+1}{2}\right)^{2}} <cuνclcuν(1β)2cl+cl[1(ρK+12)2]\displaystyle<\frac{c_{u}\nu-c_{l}}{c_{u}\nu(1-\beta)^{2}-c_{l}+c_{l}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}
<cuνclcuν2cuνβcl+cl[1(ρK+12)2]<1\displaystyle<\frac{c_{u}\nu-c_{l}}{c_{u}\nu-2c_{u}\nu\beta-c_{l}+c_{l}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}<1

is valid for constants β>0\beta>0 satisfying

(2.16) βcl4νcu[1(ρK+12)2].\beta\leq\frac{c_{l}}{4\nu c_{u}}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right].

We conclude that for β>0\beta>0 satisfying (2.16) and for γ\gamma with

(2.17) 0<cuνclcuνcl+cl2[1(ρK+12)2]<γ<1,0<\frac{c_{u}\nu-c_{l}}{c_{u}\nu-c_{l}+\frac{c_{l}}{2}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}<\gamma<1,

the inequalities (2.13) and therefore (2.9) hold. Next, we observe that

DomK,ν{K~:ρ(A+BK~)1β},\displaystyle\text{\bf{Dom}}_{K,\nu}\subset\left\{\tilde{K}:\rho(A+B\tilde{K})\leq 1-\beta\right\},

which is obviously a compact subset of Dom. The last statement of the theorem follows directly by Proposition 3.2 in [23].

For instance, if a stable policy KK is given, then we can choose a γ\gamma close to 11 such that the level set of J(K)J(K) is compact and only has stable policies by Lemma 2.1. Hence this lemma is quite important in the following discussion.

For a policy KK satisfying (2.5), the objective function of (2.7) is differentiable in a sufficiently small neighborhood of KK. Hence, we can compute the gradient of (2.7) which is given by

(2.18) J(K)=2((R+γBPK,γB)K+γBPK,γA)ΣK,γ,\nabla J(K)=2\left((R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A\right)\Sigma_{K,\gamma},

where ΣK,γ=t=0𝔼[γtxtxt]\Sigma_{K,\gamma}=\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}x_{t}x_{t}^{\top}] and the sequence {xt}t=0\{x_{t}\}_{t=0}^{\infty} is generated by policy utπK(|xt)u_{t}\sim\pi_{K}(\cdot|x_{t}). This gradient form is obtained by Lemma 1 in [9] and Lemma 4 in [16].

From the representation of J(K)J(K) in (2.8), we can establish the following relations between J(K)J(K), PK,γP_{K,\gamma} and ΣK,γ\Sigma_{K,\gamma}, see also Lemma 13 in [9].

Lemma 2.3.

Let KK be a policy such that J(K)J(K) is finite. Then the representation of J(K)J(K) in (2.8) yields the following two inequalities:

(2.19) J(K)γ1γσmin(Dω~)PK,γFJ(K)\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\|P_{K,\gamma}\|_{F}

and

(2.20) J(K)σmin(S)ΣK,γFJ(K)\geq\sigma_{\mathrm{min}}(S)\|\Sigma_{K,\gamma}\|_{F}

Proof 2.4.

Since J(K)γ1γTr[Dω~PK,γ]J(K)\geq\frac{\gamma}{1-\gamma}\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}] and PK,γ0P_{K,\gamma}\succ 0, we obtain

J(K)γ1γσmin(Dω~)Tr[PK,γ]γ1γσmin(Dω~)PK,γF.J(K)\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\operatorname{Tr}[P_{K,\gamma}]\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\|P_{K,\gamma}\|_{F}.

The second inequality is derived by

J(K)t=0γt𝔼[xtSxt]σmin(S)Tr[ΣK,γ]σmin(S)ΣK,γF.J(K)\geq\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}[x_{t}^{\top}Sx_{t}]\geq\sigma_{\min}(S)\operatorname{Tr}[\Sigma_{K,\gamma}]\geq\sigma_{\min}(S)\|\Sigma_{K,\gamma}\|_{F}.

Next, we will gather some useful properties of (2.3), which will serve as tools for the convergence analysis of the methods introduced in the following sections. Moreover, we will also have a look at the difficulties showing up in the theoretical analysis of (2.3). In general, the cost function J(K)J(K) in problem (2.3) as well as the set of policies satisfying (2.5) are non-convex. In order to cope with this problem of non-convexity, we use the so-called PL condition named after Polyak and Lojasiewicz, which is a relaxation of the notion of strong convexity. The PL condition is satisfied if there exists a universal constant μ>0\mu>0 such that for any policy KK with finite J(K)J(K), we have

(2.21) J(K)F2μ[J(K)J(K)],\|\nabla J(K)\|_{F}^{2}\geq\mu[J(K)-J(K^{*})],

where KK^{*} denotes a global optimum for (2.3). From Lemma 3 in [9], Lemma 4 in [16], R0R\succ 0 and ΣK,γγDω0\Sigma_{K,\gamma}\succeq\gamma D_{\omega}\succ 0, we get that (2.21) is satisfied for (2.3) with

(2.22) μ=γσmin(ΣK,γ)2σmin(R)(1γ)ΣK,γ2>0.\mu=\frac{\gamma\sigma_{\mathrm{min}}(\Sigma_{K,\gamma})^{2}\sigma_{\mathrm{min}}(R)}{(1-\gamma)\|\Sigma_{K^{*},\gamma}\|_{2}}>0.

We note that ΣK,γF<\|\Sigma_{K^{*},\gamma}\|_{F}<\infty holds due to the optimality of KK^{*}.

By the PL condition (2.21), we know that stationary points of (2.3), i.e. J(K)=0\nabla J(K)=0, are global minima. Since ΣK,γγDω~0\Sigma_{K,\gamma}\succeq\gamma D_{\widetilde{\omega}}\succ 0, a policy KK is a stationary point if and only if

(2.23) (R+γBPK,γB)K+γBPK,γA=0.(R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A=0.

We can verify that KK^{*} in (2.2) is the global minimum of the function JJ. The optimization problem (2.3) may have more than one global minimum and, unfortunately, it is hard to derive the analytic form of all optimal policies in terms of AA, BB, SS, RR and γ\gamma.

To simplify the analysis, we assume that ωt\omega_{t} is Gaussian, that is ωtN(0,Dω)\omega_{t}\sim N(0,D_{\omega}). This assumption also guarantees that ω~tN(0,Dω~)\widetilde{\omega}_{t}\sim N(0,D_{\widetilde{\omega}}).

3 Policy evaluation

Given a fixed stable policy πK\pi_{K}, it is desirable to know the expectation of the corresponding cummulative cost for an initial state x0=xx_{0}=x, which in some sense evaluates “how good” it is to be in the state xx under policy πK\pi_{K}. This expectation is given by the so-called value function VK(x)V_{K}(x), which is defined below. Moreover, the expectation of the cummulative cost for taking an action u0=uu_{0}=u in some initial state x0=xx_{0}=x under policy πK\pi_{K} is given by state-action value QK(x,u)Q_{K}(x,u) which is also defined below, see also [19].

The task of the policy evaluation is to get good approximations of the value function of a fixed stable policy πK\pi_{K}. In value-based methods, the policy evaluation is a very important and elementary step. In addition, the policy evaluation plays the role of the critic in the actor critic algorithm. The TD-learning method [18] is a prevalent method for the policy evaluation. In this section, we discuss the usage of the TD-learning method in the LQR-setting.

First, we compute the value function VKV_{K} and state-action value function QKQ_{K} for a stable policy πK\pi_{K}. The value function, which gives the state value for the policy πK\pi_{K} has the following explicit form:

(3.24) VK(x):=𝔼pK[t=0γtct|x0=x]=xPK,γx+γ1γTr[PK,γDω~]+σ2Tr[R]1γ,\displaystyle V_{K}(x):=\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}c_{t}\Big{|}x_{0}=x\right]=x^{\top}P_{K,\gamma}x+\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma},

and the state-action value of the policy πK\pi_{K} is given by

(3.25) QK(x,u):=\displaystyle Q_{K}(x,u):= 𝔼pK[t=0γtct|x0=x,u0=u]\displaystyle\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}c_{t}\Big{|}x_{0}=x,u_{0}=u\right]
=\displaystyle= γ1γTr[γPK,γDω~+σ2R]+γTr[PK,γDω]\displaystyle\frac{\gamma}{1-\gamma}\operatorname{Tr}[\gamma P_{K,\gamma}D_{\widetilde{\omega}}+\sigma^{2}R]+\gamma\operatorname{Tr}[P_{K,\gamma}D_{\omega}]
+[xu][S+γAPK,γAγAPK,γBγBPK,γAR+γBPK,γB][xu],\displaystyle+[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}S+\gamma A^{\top}P_{K,\gamma}A&\gamma A^{\top}P_{K,\gamma}B\\ \gamma B^{\top}P_{K,\gamma}A&R+\gamma B^{\top}P_{K,\gamma}B\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right],

where pKp_{K} is the trajectory distribution of the system with policy πK\pi_{K}.

As we will see below, the value function of πK\pi_{K} can be written as a function which is linear with respect to the feature function

ϕ(x)=[1vec(xx)]\phi(x)=\left[\begin{array}[]{c}1\\ \mathrm{v}ec(xx^{\top})\end{array}\right]

for xnx\in\mathbb{R}^{n}, where vec\mathrm{v}ec denotes the vectorization operator stacking the columns of some matrix A on top of one another, i.e.,

vec(A)=[a1,1,,am,1,a1,2,,am,2,,a1,n,,am,n].\mathrm{vec}(A)=[a_{1,1},\cdots,a_{m,1},a_{1,2},\cdots,a_{m,2},\cdots,a_{1,n},\cdots,a_{m,n}]^{\top}.

Therefore, it is natural to propose a class of linear approximation funtions with feature ϕ\phi by

V~(x;θ)=ϕ(x)θ,\widetilde{V}(x;\theta)=\phi(x)^{\top}\theta,

where θ=[θ0θ1]\theta=\left[\begin{array}[]{c}\theta_{0}\\ \theta_{1}\end{array}\right] is the parameter we seek to estimate.

To this end, we observe that VK(x)=V~(x,θ)V_{K}(x)=\widetilde{V}(x,\theta^{\star}), where θ1=vec(PK,γ)\theta_{1}^{\star}=\mathrm{v}ec(P_{K,\gamma}) and θ0=γ1γTr[PK,γDω~]+σ2Tr[R]1γ\theta_{0}^{\star}=\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}. For each stable policy KK, our aim is to find a parameter θ\theta such that V~(x;θ)\tilde{V}(x;\theta) approximates well VK(x)V_{K}(x) in expectation. This is carried out by minimizing a suitable loss function, for which we have to find a “good” distribution. A reasonable choice for this distribution is the stationary distribution N(0,DK)N(0,D_{K}) where

DK=i=0[(A+BK)i]Dω~(A+BK)iD_{K}=\sum_{i=0}^{\infty}[(A+BK)^{i}]^{\top}D_{\widetilde{\omega}}(A+BK)^{i}

for any stable KK. It is easy to verify that if xN(0,DK)x\sim N(0,D_{K}), then x=(A+BK)x+ω~N(0,DK)x^{\prime}=(A+BK)x+\widetilde{\omega}\sim N(0,D_{K}) as well. This explain why we call it the stationary distribution. We also use the distribution μK\mu_{K} to represent N(0,DK)N(0,D_{K}) in short.

Remark 3.1.

For the sake of simplicity, we introduce the abbreviation

𝔼x,x[]:=𝔼xμK,ηN(0,σ2Id),ωN(0,Dω)[],\displaystyle\mathbb{E}_{x,x^{\prime}}[\cdot]:=\mathbb{E}_{x\sim\mu_{K},\eta\sim N(0,\sigma^{2}I_{d}),\omega\sim N(0,D_{\omega})}[\cdot],

since x=(A+BK)x+η+ωx^{\prime}=(A+BK)x+\eta+\omega. When there is only the variable xx in the expection, we write 𝔼μK[]\mathbb{E}_{\mu_{K}}[\cdot] instead of 𝔼x,x[]\mathbb{E}_{x,x^{\prime}}[\cdot] to represent this expectation. Otherwise, we use the notation 𝔼x,x[]\mathbb{E}_{x,x^{\prime}}[\cdot] for convenience.

Then we define the loss function

L(θ)=12𝔼xμK[(VK(x)V~(x;θ))2]L(\theta)=\frac{1}{2}\mathbb{E}_{x\sim\mu_{K}}\left[(V_{K}(x)-\widetilde{V}(x;\theta))^{2}\right]

such that the gradient of the loss function is given by

θL(θ)=𝔼xμK[ϕ(x)(V~(x;θ)VK(x))].\nabla_{\theta}L(\theta)=\mathbb{E}_{x\sim\mu_{K}}\left[\phi(x)(\widetilde{V}(x;\theta)-V_{K}(x))\right].

However, in practice, since the real value function VKV_{K} is unknown, people often use the bias estimation of c(x,u)+γV~(x;θ)c(x,u)+\gamma\widetilde{V}(x^{\prime};\theta) to replace VK(x)V_{K}(x) where x=(A+BK)x+ω~x^{\prime}=(A+BK)x+\widetilde{\omega} is the subsequent state of xx. For convenience, let δ~(x,u,x;θ)=V~(x;θ)c(x,u)γV~(x;θ)\tilde{\delta}(x,u,x^{\prime};\theta)=\widetilde{V}(x;\theta)-c(x,u)-\gamma\widetilde{V}(x^{\prime};\theta), which is called TD error.

We further note that u=Kx+ηu=Kx+\eta. Then we obtain the semi-gradient

(3.26) h¯(θ)=[h¯0(θ)h¯1(θ)]=𝔼x,x[ϕ(x)δ~(x,u,x;θ)],\bar{h}(\theta)=\left[\begin{array}[]{c}\bar{h}_{0}(\theta)\\ \bar{h}_{1}(\theta)\end{array}\right]=\mathbb{E}_{x,x^{\prime}}\left[\phi(x)\tilde{\delta}(x,u,x^{\prime};\theta)\right],

and it is quite straightforward to get the stochastic semi-gradient

(3.27) h(θ)=[h0(θ)h1(θ)]=[ϕ(x)δ~(x,u,x;θ)].h(\theta)=\left[\begin{array}[]{c}h_{0}(\theta)\\ h_{1}(\theta)\end{array}\right]=\left[\phi(x)\tilde{\delta}(x,u,x^{\prime};\theta)\right].

Starting with some initial parameter θ(0)\theta^{(0)} and using the gradient descent method with h¯(θ)\bar{h}(\theta) or h(θ)h(\theta) as update strategy, the TD learning method with linear approximation functions is described by Algorithm 1.

0:  Stable policy KK, parameters θ(0)\theta^{(0)}, step size α\alpha, the number of steps NN.
1:  for s=1,2,,N1s=1,2,\cdots,N-1 do
2:     if semi-gradient then
3:        Compute the sample gradient h¯(θ(s1))\bar{h}(\theta^{(s-1)}) by (3.26).
4:        Update the approximate value function θ(s)=θ(s1)αh¯(θ(s1))\theta^{(s)}=\theta^{(s-1)}-\alpha\bar{h}(\theta^{(s-1)}).
5:     else if stochastic semi-gradient then
6:        xμKx\sim\mu_{K}, ω~N(0,Dω~)\widetilde{\omega}\sim N(0,D_{\widetilde{\omega}}) and x=(A+BK)x+ω~x^{\prime}=(A+BK)x+\tilde{\omega}.
7:        Compute the sample gradient h(θ(s1))h(\theta^{(s-1)}) by (3.27).
8:        Update the approximate value function θ(s)=θ(s1)αh(θ(s1))\theta^{(s)}=\theta^{(s-1)}-\alpha h(\theta^{(s-1)}).
9:     end if
10:  end for
11:  Averaging: θ~=1Ns=0N1θ(s)\tilde{\theta}=\frac{1}{N}\sum_{s=0}^{N-1}\theta^{(s)}.
Algorithm 1 TD learning

The following lemma shows that θ\theta^{\star} is a minimum of h¯\bar{h}.

Lemma 3.2.

For any fixed stable policy KK, we have

(3.28) h¯(θ)=0.\bar{h}(\theta^{\star})=0.

Proof 3.3.

For the first element in the semi-gradient, we get

h¯0(θ)=𝔼x,x[δ~(x,u,x;θ)]=𝔼x,x[VK(x)c(x,u)γVK(x)]=0,\bar{h}_{0}(\theta^{\star})=\mathbb{E}_{x,x^{\prime}}[\tilde{\delta}(x,u,x^{\prime};\theta^{\star})]=\mathbb{E}_{x,x^{\prime}}[V_{K}(x)-c(x,u)-\gamma V_{K}(x^{\prime})]=0,

by the definition of VKV_{K}. For the other elements we will consider them in matrix form

𝔼x,x{xx[VK(x)cγVK(x)]}\displaystyle\mathbb{E}_{x,x^{\prime}}\left\{xx^{\top}[V_{K}(x)-c-\gamma V_{K}(x^{\prime})]\right\}
=\displaystyle= 𝔼μK{xx[xPK,γx+θ0x(S+KRK)xσ2Tr[R]\displaystyle\mathbb{E}_{\mu_{K}}\left\{xx^{\top}[x^{\top}P_{K,\gamma}x+\theta_{0}^{\star}-x^{\top}(S+K^{\top}RK)x-\sigma^{2}\operatorname{Tr}[R]\right.
γx(A+BK)PK,γ(A+BK)xγTr[PK,γDω~]γθ0]},\displaystyle\quad\quad\quad\quad\quad\left.-\gamma x^{\top}(A+BK)^{\top}P_{K,\gamma}(A+BK)x-\gamma\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]-\gamma\theta_{0}^{\star}]\right\},

where we have used that η\eta is independent of xx. By the definition of PK,γP_{K,\gamma}, we have PK,γ=S+KRK+γ(A+BK)PK,γ(A+BK)P_{K,\gamma}=S+K^{\top}RK+\gamma(A+BK)^{\top}P_{K,\gamma}(A+BK). For the constant term, it holds (1γ)θ0=σ2Tr[R]+γTr[PK,γDω~](1-\gamma)\theta_{0}^{*}=\sigma^{2}\operatorname{Tr}[R]+\gamma\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]. Thus, we obtain h¯(θ)=0\bar{h}(\theta^{\star})=0.

Assumption 3.4.

Assume that there exists Mθ>0M_{\theta}>0 such that the norms of θ\theta^{\star} and θ(s)\theta^{(s)} generated by Algorithm 1 are bounded by MθM_{\theta}:

θ2Mθ,θ(s)2Mθ.\|\theta^{\star}\|_{2}\leq M_{\theta},\quad\|\theta^{(s)}\|_{2}\leq M_{\theta}.

Remark 3.5.

In some related works, people use projected gradient descent to guarantee the boundedness of θ(s)\theta^{(s)} which is not actually used in practice.

Lemma 3.6.

There exists a positive number κ\kappa, such that for all θ\theta and θ\theta^{*}

(3.29) κθθ22𝔼μK[|V~(x,θ)V~(x;θ)|2]\kappa\|\theta^{*}-\theta\|_{2}^{2}\leq\mathbb{E}_{\mu_{K}}\left[|\widetilde{V}(x,\theta^{*})-\widetilde{V}(x;\theta)|^{2}\right]

holds. In addition, κ\kappa is continuous with respect to KK.

Proof 3.7.

Set θθ=[θ0θ0vec(Θ)]\theta^{*}-\theta=\left[\begin{array}[]{c}\theta_{0}^{*}-\theta_{0}\\ \mathrm{v}ec(\Theta)\end{array}\right], where Θ\Theta is symmetric.

By Law of total variance,

(3.30) Var(xΘx)=𝔼μK[xΘxxΘx][𝔼μK(xΘx)]2=2Tr[DKΘDKΘ]=2DK1/2ΘDK1/2F2.\displaystyle\begin{aligned} &\mathrm{Var}(x^{\top}\Theta x)\\ =&\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta xx^{\top}\Theta x\right]-\left[\mathbb{E}_{\mu_{K}}(x^{\top}\Theta x)\right]^{2}=2\operatorname{Tr}[D_{K}\Theta D_{K}\Theta]=2\|D_{K}^{1/2}\Theta D_{K}^{1/2}\|^{2}_{F}.\end{aligned}

The second equality is derived by Proposition A.1. in [23].

Thus, we can define

λK=infΘF=1DK1/2ΘDK1/2F.\lambda_{K}=\inf_{\|\Theta\|_{F}=1}\|D_{K}^{1/2}\Theta D_{K}^{1/2}\|_{F}.

Since norms of matrices are equivalent and Var(xΘx)>0\mathrm{Var}(x^{\top}\Theta x)>0 for Θ0\Theta\neq 0, we obtain from (3.30) that λK\lambda_{K} is positive and continuous with respect to KK. Besides, |𝔼μK[xΘx]|=|Tr(DKΘ)|DKFΘF\left|\mathbb{E}_{\mu_{K}}[x^{\top}\Theta x]\right|=|\operatorname{Tr}(D_{K}\Theta)|\leq\|D_{K}\|_{F}\|\Theta\|_{F}. Using this inequality in connection with (3.30) we obtain

𝔼μK[|V~(x,θ)V~(x;θ)|2]\displaystyle\mathbb{E}_{\mu_{K}}\left[|\widetilde{V}(x,\theta^{*})-\widetilde{V}(x;\theta)|^{2}\right]
=\displaystyle= 𝔼μK[(xΘx+θ0θ0)2]\displaystyle\mathbb{E}_{\mu_{K}}\left[(x^{\top}\Theta x+\theta_{0}^{*}-\theta_{0})^{2}\right]
=\displaystyle= Var(xΘx)+(θ0θ0+𝔼μK[xΘx])2\displaystyle\mathrm{Var}(x^{\top}\Theta x)+\left(\theta_{0}^{*}-\theta_{0}+\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}
\displaystyle\geq 2λK2ΘF2+(θ0θ0𝔼μK[xΘx])2\displaystyle 2\lambda_{K}^{2}\|\Theta\|_{F}^{2}+\left(\theta_{0}-\theta_{0}^{*}-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}
\displaystyle\geq λK2ΘF2+λK2DKF2(𝔼μK[xΘx])2+(θ0θ0𝔼μK[xΘx])2\displaystyle{\lambda^{2}_{K}}\|\Theta\|_{F}^{2}+\frac{\lambda^{2}_{K}}{\|D_{K}\|_{F}^{2}}\left(\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}+\left(\theta_{0}-\theta_{0}^{*}-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}
\displaystyle\geq λK2ΘF2+(11+DKF2λK2)(𝔼μK[xΘx]+θ0θ0+𝔼μK[xΘx])2\displaystyle\lambda_{K}^{2}\|\Theta\|_{F}^{2}+\left(\frac{1}{1+\frac{\|D_{K}\|^{2}_{F}}{\lambda^{2}_{K}}}\right)\left(-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]+\theta_{0}-\theta_{0}^{*}+\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}
\displaystyle\geq λK2ΘF2+(λK2λK2+DKF2)(θ0θ0)2.\displaystyle{\lambda_{K}^{2}}\|\Theta\|_{F}^{2}+\left(\frac{\lambda_{K}^{2}}{\lambda_{K}^{2}+\|D_{K}\|_{F}^{2}}\right)(\theta_{0}^{*}-\theta_{0})^{2}.

where the third inequality follows by the Cauchy-Schwarz inequality.

Thus, we conclude the existence of such a positive number κ\kappa which is continuous with respect to KK.

Under 3.4, the constant κ\kappa has a uniform lower bound in a compact subset of 𝐃𝐨𝐦\mathbf{Dom}. Then we can use some basic tool of the analysis of the gradient descent method to prove the convergence of Algorithm 1.

Theorem 3.8.

Suppose that 3.4 holds. Algorithm 1 is run with the semi-gradient update using α=1γ2(1+(n+2)DKF2)\alpha=\frac{1-\gamma}{2(1+(n+2)\|D_{K}\|_{F}^{2})} to generate V~(x;θ~)\widetilde{V}(x;\tilde{\theta}). Then, the following inequality holds for all xnx\in\mathbb{R}^{n}:

(3.31) 𝔼μK[(V~(x;θ~)V~(x;θ))2]8(1+(n+2)DKF2)Mθ2(1γ)2N.\mathbb{E}_{\mu_{K}}[(\widetilde{V}(x;\tilde{\theta})-\widetilde{V}(x;\theta^{\star}))^{2}]\leq\frac{8(1+(n+2)\|D_{K}\|_{F}^{2})M_{\theta}^{2}}{(1-\gamma)^{2}N}.

Proof 3.9.

By the definition of h(θ)h(\theta) and the fact that ϕ(x)22=(x24+1)\|\phi(x)\|_{2}^{2}=(\|x\|_{2}^{4}+1), we know

(3.32) h(θ)22=\displaystyle\|h(\theta)\|^{2}_{2}= (c+γϕ(x)θϕ(x)θ)2ϕ(x)22\displaystyle(c+\gamma\phi(x^{\prime})^{\top}\theta-\phi(x)^{\top}\theta)^{2}\|\phi(x)\|_{2}^{2}
\displaystyle\leq 2c2ϕ(x)22+2Mθ2γϕ(x)ϕ(x)22ϕ(x)22\displaystyle 2c^{2}\|\phi(x)\|_{2}^{2}+2M_{\theta}^{2}\|\gamma\phi(x^{\prime})-\phi(x)\|^{2}_{2}\|\phi(x)\|_{2}^{2}
=\displaystyle= 2(x(S+KRK)x+2ηRKx+ηRη)2(x24+1)\displaystyle 2(x^{\top}(S+K^{\top}RK)x+2\eta^{\top}RKx+\eta^{\top}R\eta)^{2}(\|x\|_{2}^{4}+1)
+2Mθ2[(1γ)2+2x24+2x24](x24+1).\displaystyle+2M_{\theta}^{2}\left[(1-\gamma)^{2}+2\|x^{\prime}\|_{2}^{4}+2\|x\|_{2}^{4}\right](\|x\|_{2}^{4}+1).

Thus, h(θ)22\|h(\theta)\|^{2}_{2} is an 88th order polynomial with respect to xx, ω\omega and η\eta. Then we can obtain the bound σh2=O((Mθ2+K24)𝔼μKx28+𝔼μKx24σ4)\sigma_{h}^{2}=O\left((M^{2}_{\theta}+\|K\|_{2}^{4})\mathbb{E}_{\mu_{K}}\|x\|_{2}^{8}+\mathbb{E}_{\mu_{K}}\|x\|^{4}_{2}\sigma^{4}\right).

We next show that Algorithm 1 also converges if the stochastic semi-gradients are used.

Theorem 3.10.

Suppose that 3.4 holds. Let Algorithm 1 be run with the stochastic semi-gradient update using α=min{1γ4(1+(n+2)DKF2),1/N}\alpha=\min\{\frac{1-\gamma}{4(1+(n+2)\|D_{K}\|_{F}^{2})},1/\sqrt{N}\} to generate V~(x;θ~)\widetilde{V}(x;\tilde{\theta}). Then, for all xnx\in\mathbb{R}^{n} it holds

(3.33) 𝔼μK[(V~(x;θ~)V~(x;θ))2]Mθ2+2σh22(1γ)N4(1+(n+2)DKF2).\mathbb{E}_{\mu_{K}}[(\widetilde{V}(x;\widetilde{\theta})-\widetilde{V}(x;\theta^{\star}))^{2}]\leq\frac{M_{\theta}^{2}+2\sigma_{h}^{2}}{2(1-\gamma)\sqrt{N}-4(1+(n+2)\|D_{K}\|_{F}^{2})}.

Proof 3.11.

The key step to prove (3.31) is based on an important gradient descent inequality. In order to establish this inequality, we have to estimate two terms at first. For the descent term (h¯(θ(s))h¯)(θ(s)θ)(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star}), we get the following lower bound:

(3.34) (h¯(θ(s))h¯)(θ(s)θ)\displaystyle(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star})
=\displaystyle= 𝔼x,x{[δ~(x,u,x;θ(s))δ~(x,u,x;θ)]ϕ(x)(θ(s)θ)}\displaystyle\mathbb{E}_{x,x^{\prime}}\{[\tilde{\delta}(x,u,x^{\prime};\theta^{(s)})-\tilde{\delta}(x,u,x^{\prime};\theta^{\star})]\phi(x)^{\top}(\theta^{(s)}-\theta^{\star})\}
=\displaystyle= 𝔼μK[V~(x;θ(s))V~(x;θ)]2γ𝔼x,x(V~(x;θ(s))V~(x;θ))(V~(x;θ(s))V~(x;θ))\displaystyle\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}-\gamma\mathbb{E}_{x,x^{\prime}}(\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star}))(\widetilde{V}(x^{\prime};\theta^{(s)})-\widetilde{V}(x^{\prime};\theta^{\star}))
\displaystyle\geq (1γ)𝔼μK[V~(x;θ(s))V~(x;θ)]2,\displaystyle(1-\gamma)\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2},

where last inequality follows by Cauchy-Schwarz inequality. Then we split the norm of h¯(θ(s))h¯\bar{h}(\theta^{(s)})-\bar{h}^{\star} into two parts:

(3.35) h¯(θ(s))h¯22=\displaystyle\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\|^{2}_{2}= 𝔼μK{ϕ(x)[V~(x;θ(s))V~(x;θ)+γV~(x;θ)γV~(x;θ(s))]}22\displaystyle\Big{\|}\mathbb{E}_{\mu_{K}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})+\gamma\widetilde{V}(x^{\prime};\theta^{\star})-\gamma\widetilde{V}(x^{\prime};\theta^{(s)})]\big{\}}\Big{\|}_{2}^{2}
\displaystyle\leq 𝔼x,x{ϕ(x)[V~(x;θ(s))V~(x;θ)]}22\displaystyle\Big{\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]\big{\}}\Big{\|}_{2}^{2}
+𝔼x,x{γϕ(x)[V~(x;θ(s))V~(x;θ)]}22=I1+I2.\displaystyle+\Big{\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\gamma\phi(x)[\widetilde{V}(x^{\prime};\theta^{(s)})-\widetilde{V}(x^{\prime};\theta^{\star})]\big{\}}\Big{\|}_{2}^{2}=I_{1}+I_{2}.

Then we get the upper bound of I1I_{1} and I2I_{2} by the Cauchy-Schwarz inequality:

(3.36) I1\displaystyle I_{1} =j|𝔼μK{ϕj(x)[V~(x;θ(s))V~(x;θ)]}|2\displaystyle=\sum_{j}\Big{|}\mathbb{E}_{\mu_{K}}\big{\{}\phi_{j}(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]\big{\}}\Big{|}^{2}
j𝔼μK[ϕj(x)]2𝔼μK[V~(x;θ(s))V~(x;θ)]2\displaystyle\leq\sum_{j}\mathbb{E}_{\mu_{K}}[\phi_{j}(x)]^{2}\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}
=(1+𝔼μKxxF2)𝔼μK[V~(x;θ(s))V~(x;θ)]2\displaystyle=(1+\mathbb{E}_{\mu_{K}}\|xx^{\top}\|_{F}^{2})\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}
=[1+2DKF2+(Tr[DK])2]𝔼μK[V~(x;θ(s))V~(x;θ)]2\displaystyle=[1+2\|D_{K}\|_{F}^{2}+(\operatorname{Tr}[D_{K}])^{2}]\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}

In particular, the last equality holds by Proposition A.1. in [23]. Analogously, I2I_{2} has the same upper bound such that

(3.37) h¯(θ(s))h¯222[1+(n+2)DKF2][V~(x;θ(s))V~(x;θ)]2,\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\|^{2}_{2}\leq 2[1+(n+2)\|D_{K}\|_{F}^{2}][\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2},

where we used (Tr[DK])2nDKF2(\operatorname{Tr}[D_{K}])^{2}\leq n\|D_{K}\|_{F}^{2}.

By (3.34), (3.37) and Lemma 3.2, we obtain

(3.38) θ(s+1)θ22=θ(s)αh¯(θ(s))θ22\displaystyle\|\theta^{(s+1)}-\theta^{\star}\|_{2}^{2}=\|\theta^{(s)}-\alpha\bar{h}(\theta^{(s)})-\theta^{\star}\|_{2}^{2}
=\displaystyle= θ(s)θ222α(h¯(θ(s))h¯)(θ(s)θ)+α2h¯(θ(s))h¯22\displaystyle\|\theta^{(s)}-\theta^{\star}\|_{2}^{2}-2\alpha(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star})+\alpha^{2}\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\|^{2}_{2}
\displaystyle\leq θ(s)θ22(2α(1γ)2(1+(n+2)DKF2)α2)𝔼μK[V~(x;θ(s))V~(x;θ)]2.\displaystyle\|\theta^{(s)}-\theta^{\star}\|_{2}^{2}-(2\alpha(1-\gamma)-2(1+(n+2)\|D_{K}\|_{F}^{2})\alpha^{2})\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}.

The value of the right hand of (3.38) is minimal for α=1γ2(1+(n+2)DKF2)\alpha=\frac{1-\gamma}{2(1+(n+2)\|D_{K}\|_{F}^{2})}. Rearranging (3.38) and summing the inequalities from t=0t=0 to N1N-1 yields:

(3.39) 𝔼μK[V~(x;θ~)V~(x;θ)]2\displaystyle\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\tilde{\theta})-\widetilde{V}(x;\theta^{\star})]^{2}
\displaystyle\leq 1Ns=0N1𝔼μK[V~(x;θ(s))V~(x;θ)]2\displaystyle\frac{1}{N}\sum_{s=0}^{N-1}\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}
\displaystyle\leq 1Nt=0N12(1+(n+2)DKF2)(1γ)2(θ(s)θ22θ(s+1)θ22)\displaystyle\frac{1}{N}\sum_{t=0}^{N-1}\frac{2(1+(n+2)\|D_{K}\|_{F}^{2})}{(1-\gamma)^{2}}(\|\theta^{(s)}-\theta^{\star}\|^{2}_{2}-\|\theta^{(s+1)}-\theta^{\star}\|^{2}_{2})
\displaystyle\leq 8(1+(n+2)DKF2)Mθ2(1γ)2N.\displaystyle\frac{8(1+(n+2)\|D_{K}\|_{F}^{2})M_{\theta}^{2}}{(1-\gamma)^{2}N}.

This completes the proof.

Furthermore, 𝔼μKx22k\mathbb{E}_{\mu_{K}}\|x\|_{2}^{2k} has a uniform upper bound with respect to KK on a compact subset of 𝐃𝐨𝐦\mathbf{Dom}, since 𝔼μKx22k=O(ΓK2k1ρ2k𝔼ω~22k])\mathbb{E}_{\mu_{K}}\|x\|_{2}^{2k}=O(\frac{\Gamma_{K}^{2k}}{1-\rho^{2k}}\mathbb{E}\|\widetilde{\omega}\|^{2k}_{2}]) roughly. Then we can derive convergence of θ(s)\theta^{(s)} under the 22-norm by Lemma 3.6.

4 Policy Iteration

The policy iteration (PI) [12] is a very fundamental value-based method in Markov decision process (MDP) problems with finite actions. This method can also be applied to LQRs, since the state-action value function QQ is quadratic and it is easy to find the best action such that the value function QQ is minimal by solving linear equations.

Given a policy πK\pi_{K} and the corresponding state-action function QKQ_{K} of the form (3.25), the policy πK\pi_{K} can be improved by selecting the action uu^{*} for a fixed state xx such that the value of QKQ_{K} is minimal:

u=\displaystyle u^{*}= argminuQK(x,u)\displaystyle\mathrm{arg}\min_{u}Q_{K}(x,u)
=\displaystyle= argminu[xu][S+γAPK,γAγAPK,γBγBPK,γAR+γBPK,γB][xu]\displaystyle\mathrm{arg}\min_{u}[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}S+\gamma A^{\top}P_{K,\gamma}A&\gamma A^{\top}P_{K,\gamma}B\\ \gamma B^{\top}P_{K,\gamma}A&R+\gamma B^{\top}P_{K,\gamma}B\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right]
=\displaystyle= (R+γBPK,γB)1(γBPK,γA)x.\displaystyle-(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}(\gamma B^{\top}P_{K,\gamma}A)x.

Thus, we obtain an improved policy πK\pi_{K^{\prime}} with

(4.40) K=(R+γBPK,γB)1(γBPK,γA).K^{\prime}=-(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}(\gamma B^{\top}P_{K,\gamma}A).

Observing the gradient (2.18), there is another form of (4.40):

(4.41) K=K12(R+γBPK,γB)1JΣK,γ1.K^{\prime}=K-\frac{1}{2}(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}\nabla J\Sigma_{K,\gamma}^{-1}.

This form corresponds to the Gauss-Newton method in [9] with stepsize 12\frac{1}{2}. Hence, by the discussion in [9], we obtain the convergence of the policy iteration with the exact state-action value function QKQ_{K} and in addition the fact that value based methods are faster than the policy gradient method. Moreover, from Lemma 8 in [9], we obtain that J(K)<J(K)J(K^{\prime})<J(K). This implies that if KK is stable, then also KK^{\prime} is stable provided that γ\gamma is sufficiently close to 11, i.e., if γ\gamma satisfies (2.17) with ν=1\nu=1, see Lemma 2.1.

However, we do not know the state action value function QQ exactly since we do not know the system exactly. Due to the results of the former discussion of policy evaluation, we can obtain an approximation of QQ. This approximation is used in a policy iteration scheme, for which we want to analyze the convergence. The approximation of the state action value function has the following form:

Q~(x,u;Θ)=[xu][Θ11Θ12Θ21Θ22][xu]+Θ0.\widetilde{Q}(x,u;\Theta)=[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}\Theta_{11}&\Theta_{12}\\ \Theta_{21}&\Theta_{22}\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right]+\Theta_{0}.

For any stable policy πK\pi_{K}, we denote the parameters of QKQ_{K} by ΘK\Theta^{*}_{K}. If ΘΘKFϵ0\|\Theta-\Theta^{*}_{K}\|_{F}\leq\epsilon_{0}, we call Q~(x,u;Θ)\widetilde{Q}(x,u;\Theta) an ϵ0\epsilon_{0}-approximation of the state-action value function. The policy can also be improved by using the following approximation for the state action value:

(4.42) K=Θ221Θ21.K^{\prime}=-\Theta^{-1}_{22}\Theta_{21}.

Thus, we obtain the policy iteration algorithm.

0:  Stable policy K(0)K^{(0)}, the number of steps TT.
1:  for s=1,2,,Ts=1,2,\cdots,T do
2:     Evaluate an approximate state-action value function Q~(,Θ(s1))\widetilde{Q}(\cdot,\Theta^{(s-1)}) of policy K(s1)K^{(s-1)} by TD learning with the stochastic gradient.
3:     Policy improvement:
K(s)=(Θ22(s1))1Θ21(s1).K^{(s)}=-(\Theta_{22}^{(s-1)})^{-1}\Theta_{21}^{(s-1)}.
4:  end for
Algorithm 2 The policy iteration method

As we will prove below, the policy iteration with approximate state-action value functions also converges in the LQR-setting if the error between the approximation value and the true value is small enough. For any stable initial policy K(0)K^{(0)} and any ϵ>0\epsilon>0, we assume that the error tolerance ϵ0\epsilon_{0} between the approximation value Θ\Theta and the true value Θ\Theta^{\star} satisfies

(4.43) ϵ0min{12R12,C1ϵΣK,γ2J(K)}\epsilon_{0}\leq\min\left\{\frac{1}{2}\|R^{-1}\|_{2},\frac{C_{1}\sqrt{\epsilon}}{\sqrt{\|\Sigma_{K^{*},\gamma}\|_{2}}J(K^{*})}\right\}

where C1C_{1} only depends on σmin(Dω~)\sigma_{\mathrm{min}}(D_{\widetilde{\omega}}), σmin(S)\sigma_{\mathrm{min}}(S), R12\|R^{-1}\|_{2}, A2\|A\|_{2}, B2\|B\|_{2} and (1γ)J(K(0))(1-\gamma)J(K^{(0)}).

Theorem 4.1.

For any stable initial policy K(0)K^{(0)} and any ϵ>0\epsilon>0, suppose that an ϵ0\epsilon_{0}-approximation of the state action value function is known, where ϵ0\epsilon_{0} satisfies (4.43). If γ\gamma is sufficiently close to 11, then for T[2ΣK,γ2γ2σmin(Dω~)1]logJ(K0)J(K)ϵT\geq\left[\frac{2\|\Sigma_{K^{*},\gamma}\|_{2}}{\gamma^{2}\sigma_{min}(D_{\widetilde{\omega}})}-1\right]\log\frac{J(K_{0})-J(K^{*})}{\epsilon}, Algorithm 2 yields stable policies K(1),,K(T)K^{(1)},\ldots,K^{(T)} which are elements of DomK(0),2{}_{K^{(0)},2} and satisfy

(4.44) J(K(T))J(K)ϵ.J(K^{(T)})-J(K^{*})\leq\epsilon.

Proof 4.2.

We first introduce some notation which is used in this proof. Let Q~(,Θ)\widetilde{Q}(\cdot,\Theta) be an ϵ0\epsilon_{0}-approximation of the state action value function QK()Q_{K}(\cdot) and denote by Θ\Theta^{\star} the true value, i.e., QK()=Q~(,Θ)Q_{K}(\cdot)=\widetilde{Q}(\cdot,\Theta^{\star}). Then the improved policy is given by K=Θ221Θ21K^{\prime}=-\Theta_{22}^{-1}\Theta_{21}. Next, we obtain from Lemma 6 in [9] that

(4.45) J(K)J(K)\displaystyle J(K^{\prime})-J(K)
=\displaystyle= Tr[2ΣK,γ(KK)EK+ΣK,γ(KK)(R+γBPK,γB)(KK)],\displaystyle\operatorname{Tr}[2\Sigma_{K^{\prime},\gamma}(K^{\prime}-K)^{\top}E_{K}+\Sigma_{K^{\prime},\gamma}(K^{\prime}-K)^{\top}(R+\gamma B^{\top}P_{K,\gamma}B)(K^{\prime}-K)],

where EK=(R+γBPK,γB)K+γBPK,γAE_{K}=(R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A. Furthermore, we note that R+γBPK,γB=Θ22R+\gamma B^{\top}P_{K,\gamma}B=\Theta^{*}_{22} and γBPK,γA=Θ21\gamma B^{\top}P_{K,\gamma}A=\Theta^{*}_{21} holds by (4.40). Let Δ1:=Θ21Θ21\Delta_{1}:=\Theta_{21}-\Theta_{21}^{*} and Δ2:=Θ22Θ22\Delta_{2}:=\Theta_{22}-\Theta_{22}^{*}. Then we compute the difference between KK^{\prime} and the policy (Θ22)1Θ21-(\Theta^{*}_{22})^{-1}\Theta^{*}_{21} generated by the true policy iteration:

(4.46) K+(Θ22)1Θ21=\displaystyle K^{\prime}+(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}= (Θ22)1(I+Δ2(Θ22)1)1(Θ21+Δ1)+(Θ22)1Θ21\displaystyle-(\Theta^{*}_{22})^{-1}(I+\Delta_{2}(\Theta^{*}_{22})^{-1})^{-1}(\Theta^{*}_{21}+\Delta_{1})+(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}
=\displaystyle= (Θ22)1(I+Δ2(Θ22)1)1(Δ2(Θ22)1Θ21Δ1)\displaystyle(\Theta^{*}_{22})^{-1}(I+\Delta_{2}(\Theta^{*}_{22})^{-1})^{-1}(\Delta_{2}(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}-\Delta_{1})
=\displaystyle= (Θ22)1Δ3\displaystyle(\Theta^{*}_{22})^{-1}\Delta_{3}

with Δ3=(I+Δ2(Θ22)1)1(Δ2(Θ22)1Θ21Δ1)\Delta_{3}=(I+\Delta_{2}(\Theta^{*}_{22})^{-1})^{-1}(\Delta_{2}(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}-\Delta_{1}). Next, we note that (4.41) implies

(4.47) K+(Θ22)1Θ21=(Θ22)1EKK+(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}=(\Theta_{22}^{\star})^{-1}E_{K}

which together with (4.46) yields

(4.48) KK=(Θ22)1(Δ3EK)K^{\prime}-K=(\Theta^{*}_{22})^{-1}(\Delta_{3}-E_{K})

Since ΘΘKFϵ0\|\Theta-\Theta_{K}^{*}\|_{F}\leq\epsilon_{0}, it holds Δ1Fϵ0\|\Delta_{1}\|_{F}\leq\epsilon_{0}, Δ2Fϵ0\|\Delta_{2}\|_{F}\leq\epsilon_{0} and Δ3F\|\Delta_{3}\|_{F} has the upper bound

(4.49) Δ3Fϵ01ϵ0R12(γR12B2A2PK,γ2+1)C0ϵ0(1γ)J(K)+2ϵ0,\|\Delta_{3}\|_{F}\leq\frac{\epsilon_{0}}{1-\epsilon_{0}\|R^{-1}\|_{2}}(\gamma\|R^{-1}\|_{2}\|B\|_{2}\|A\|_{2}\|P_{K,\gamma}\|_{2}+1)\leq C_{0}\epsilon_{0}(1-\gamma)J(K)+2\epsilon_{0},

where the second inequality follows by (2.19) with

C0=2R12B2A2σmin(Dω~).\displaystyle C_{0}=\frac{2\|R^{-1}\|_{2}\|B\|_{2}\|A\|_{2}}{\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}.

Using (4.45) and (4.48), the differences of the cumulative cost between the original policy KK and the improved policy KK^{\prime} can be represented as:

(4.50) J(K)J(K)=Tr[ΣK,γEK(Θ22)1EK+ΣK,γΔ3(Θ22)1Δ3].\displaystyle J(K^{\prime})-J(K)=\operatorname{Tr}[-\Sigma_{K^{\prime},\gamma}E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}+\Sigma_{K^{\prime},\gamma}\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}].

For the first term in (4.50), we have

(4.51) Tr[ΣK,γEK(Θ22)1EK]\displaystyle\operatorname{Tr}[\Sigma_{K^{\prime},\gamma}E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}]\geq σmin(ΣK,γ)Tr[EK(Θ22)1EK]\displaystyle\sigma_{\mathrm{min}}(\Sigma_{K^{\prime},\gamma})\operatorname{Tr}[E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}]
\displaystyle\geq γ2σmin(Dω~)ΣK,γ2(J(K)J(K)),\displaystyle\frac{\gamma^{2}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}{\|\Sigma_{K^{*},\gamma}\|_{2}}(J(K)-J(K^{*})),

where the second inequality is derived by the fact ΣK,γγ2Dω~\Sigma_{K^{\prime},\gamma}\succeq\gamma^{2}D_{\widetilde{\omega}} and Lemma 11 in the supplementary material of [9]. More precisely, one can check that this lemma is also valid for the setting in this paper. For the second term in (4.50), using (2.20) and (4.49) we get the upper bound

(4.52) Tr[ΣK,γΔ3(Θ22)1Δ3]\displaystyle\operatorname{Tr}[\Sigma_{K^{\prime},\gamma}\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}]
\displaystyle\leq ΣK,γFTr[Δ3(Θ22)1Δ3]ΣK,γFR12Tr[Δ3Δ3]\displaystyle\|\Sigma_{K^{\prime},\gamma}\|_{F}\operatorname{Tr}[\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}]\leq\|\Sigma_{K^{\prime},\gamma}\|_{F}\|R^{-1}\|_{2}\operatorname{Tr}[\Delta_{3}^{\top}\Delta_{3}]
\displaystyle\leq R12[C0(1γ)J(K)+2]2ϵ02σmin(S)J(K)\displaystyle\|R^{-1}\|_{2}\frac{[C_{0}(1-\gamma)J(K)+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}J(K^{\prime})

By (4.51) and (4.52), it is direct to obtain the following inequality:

(4.53) J(K)J(K)1α1β(J(K)J(K))+β1βJ(K),J(K^{\prime})-J(K^{*})\leq\frac{1-\alpha}{1-\beta}(J(K)-J(K^{*}))+\frac{\beta}{1-\beta}J(K^{*}),

where α=γ2σmin(Dω~)ΣK,γ2<1\alpha=\frac{\gamma^{2}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}{\|\Sigma_{K^{*},\gamma}\|_{2}}<1 and β=R12[C0(1γ)J(K)+2]2ϵ02σmin(S)\beta=\|R^{-1}\|_{2}\frac{[C_{0}(1-\gamma)J(K)+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}.

We can start with K(0)K^{(0)} and set β=R12[2C0(1γ)J(K(0))+2]2ϵ02σmin(S)\beta=\|R^{-1}\|_{2}\frac{[2C_{0}(1-\gamma)J(K^{(0)})+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}. Then let ϵ0\epsilon_{0} be small enough such that βα/2\beta\leq\alpha/2. Thus, the bound of ϵ0\epsilon_{0} is O(12C0(1γ)J(K(0))+2)O(\frac{1}{2C_{0}(1-\gamma)J(K^{(0)})+2}). Next, we show that

(4.54) J(K(t))2J(K(0)),J(K^{(t)})\leq 2J(K^{(0)}),

which implies that the inequality (4.53) holds with K=K(t)K=K^{(t)} and K=K(t+1)K^{\prime}=K^{(t+1)} for all iterates.

We use induction to prove this uniform bound (4.54). When t=0t=0, this inequality (4.54) holds obviously. Then we assume that (4.54) holds with K(t)K^{(t)}. Using βα/2\beta\leq\alpha/2 in connection with (4.53), we observe that if J(K(t))2J(K)J(K^{(t)})\geq 2J(K^{*}), then

J(K(t+1))1α1βJ(K(t))+(11αβ1β)J(K)1α/21βJ(Kt)2J(K(0))J(K^{(t+1)})\leq\frac{1-\alpha}{1-\beta}J(K^{(t)})+(1-\frac{1-\alpha-\beta}{1-\beta})J(K^{*})\leq\frac{1-\alpha/2}{1-\beta}J(K^{t})\leq 2J(K^{(0)})

by the inequality (4.53). Otherwise it holds J(K(t+1))2α1βJ(K)<2J(K(0))J(K^{(t+1)})\leq\frac{2-\alpha}{1-\beta}J(K^{*})<2J(K^{(0)}). Thus, the bound (4.54) holds for all tt.

Hence, we have following inequality:

(4.55) J(K(t))J(K)(1α1β)t(J(K(0))J(K))+βαβJ(K).J(K^{(t)})-J(K^{*})\leq\left(\frac{1-\alpha}{1-\beta}\right)^{t}(J(K^{(0)})-J(K^{*}))+\frac{\beta}{\alpha-\beta}J(K^{*}).

Furthermore, we also require that βαβJ(K)2βαJ(K)ϵ2\frac{\beta}{\alpha-\beta}J(K^{*})\leq\frac{2\beta}{\alpha}J(K^{*})\leq\frac{\epsilon}{2}, which is equivalent to βαϵ4J(K)\beta\leq\frac{\alpha\epsilon}{4J(K^{*})}. Thus, the upper bound of ϵ0\epsilon_{0} should be C1ϵΣK,γ2J(K)C_{1}\frac{\sqrt{\epsilon}}{\sqrt{\|\Sigma_{K^{*},\gamma}\|_{2}}J(K^{*})}, where

C1=O(1[C0(1γ)J(K(0))+2]).C_{1}=O\left(\frac{1}{[C_{0}(1-\gamma)J(K^{(0)})+2]}\right).

Then we can verify that (1α1β)T(J(K(0))J(K))ϵ2\left(\frac{1-\alpha}{1-\beta}\right)^{T}(J(K^{(0)})-J(K^{*}))\leq\frac{\epsilon}{2} such that (4.44) is proved. Finally, we assume w.l.o.g. that J(K(0))J(K)>ϵJ(K^{(0)})-J(K^{*})>\epsilon. We can guarantee that βαβϵ2J(K)J(K(0))J(K)2J(K)\frac{\beta}{\alpha-\beta}\leq\frac{\epsilon}{2J(K^{*})}\leq\frac{J(K^{(0)})-J(K^{*})}{2J(K^{*})}. Inserting this in (4.55), we obtain

J(K(t))J(K)\displaystyle J(K^{(t)})-J(K^{*}) [(1α1β)t+12](J(K(0))J(K))\displaystyle\leq\left[\left(\frac{1-\alpha}{1-\beta}\right)^{t}+\frac{1}{2}\right](J(K^{(0)})-J(K^{*}))
2(J(K(0))J(K))\displaystyle\leq 2(J(K^{(0)})-J(K^{*}))

Hence, K(t)K^{(t)}\in DomK(0),2{}_{K^{(0)},2}\subset Dom holds by Lemma 2.1 for all iterates, if γ\gamma is sufficiently close to 11.

Using the definition of α\alpha, we observe that the approximation parameter ϵ0\epsilon_{0} has an upper bound O((1γ)32ϵ12)O\left({(1-\gamma)^{\frac{3}{2}}\epsilon^{\frac{1}{2}}}\right). Thus, for each TD-learning of a fix K(t)K^{(t)}, it needs O(1(1γ)5ϵ)O\left(\frac{1}{(1-\gamma)^{5}\epsilon}\right) samplings by Lemma 3.6 and Theorem 3.8. However if we use stochastic semi-gradient descent, the sample complexity for each policy evaluation of this algorithm becomes O(1(1γ)8ϵ2)O\left(\frac{1}{(1-\gamma)^{8}\epsilon^{2}}\right) by Theorem 3.10.

5 The Policy Gradient method

In RL, the policy gradient method [25, 21] is widely used. In this section, we apply the policy gradient method to the problem (2.3) and analyze the convergence of this method.

In order to compute the policy gradient, we have to know the score function KlogπK\nabla_{K}\log\pi_{K} of the policy πK\pi_{K}, which is given by

(5.56) KlogπK(u|x)=1σ2(uKx)x.\displaystyle\quad\nabla_{K}\log\pi_{K}(u|x)=\frac{1}{\sigma^{2}}(u-Kx)x^{\top}.

By the policy gradient theorem in [21], we obtain the policy gradient

(5.57) G(K)\displaystyle G(K) =1σ2𝔼pK[t=0(utKxt)xtk=tγkck]\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]
=1σ2𝔼pK[t=0γt(utKxt)xtQK(xt,ut)].\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}(u_{t}-Kx_{t})x_{t}^{\top}Q_{K}(x_{t},u_{t})\right].

The policy gradient (5.57) is equivalent to J(K)\nabla J(K) which is shown in Lemma 5.2. For the representation of the gradient in (5.57) it is straightforward to design an estimation. After achieving some triples {xt,ut,ct}t=0L\{x_{t},u_{t},c_{t}\}_{t=0}^{L} generated by the policy πK\pi_{K} in problem (2.3)\eqref{tck}, we can compute the sample gradient:

(5.58) G^(L)(K)=1σ2t=0L[(utKxt)xtk=tLγkck].\hat{G}^{(L)}(K)=\frac{1}{\sigma^{2}}\sum_{t=0}^{L}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{L}\gamma^{k}c_{k}\right].

We apply the stochastic gradient descent method to the problem (2.3), which is summarized in Algorithm 3.

0:  Initial policy K0K_{0}, roll out length LL, step size α\alpha.
1:  for s=0,1,,T1s=0,1,\cdots,T-1 do
2:     Simulate ut=K(s)xt+ηtu_{t}=K^{(s)}x_{t}+\eta_{t} for LL steps starting from x0=0x_{0}=0 and obtain D(s)={xt(s),ut(s),ct(s)}t=0LD^{(s)}=\{x^{(s)}_{t},u^{(s)}_{t},c^{(s)}_{t}\}_{t=0}^{L}.
3:     Compute G^(L)(K(s))\hat{G}^{(L)}(K^{(s)}) by (5.58).
4:     Update policy K(s+1)=K(s)αG^(L)(K(s))K^{(s+1)}=K^{(s)}-\alpha\hat{G}^{(L)}(K^{(s)}).
5:  end for
Algorithm 3 The Policy Gradient Method

Since the estimator G^(L)(K)\hat{G}^{(L)}(K) is biased, we have to control the bias by increasing the length LL. The next lemma shows that how the parameter LL influences the bias and the variance of the estimator G^(L)(K)\hat{G}^{(L)}(K).

Lemma 5.1.

Let KK be a stable policy. Then it holds

(5.59) J(K)𝔼[G^(L)(K)]2\displaystyle\left\|\nabla J(K)-\mathbb{E}[\hat{G}^{(L)}(K)]\right\|_{2} \displaystyle\leq C2ΓK21ρ2γL+1(ΓK21ρ2+11γ),\displaystyle C_{2}\frac{\Gamma_{K}^{2}}{1-\rho^{2}}\gamma^{L+1}(\frac{\Gamma_{K}^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma}),
(5.60) 𝔼G^(L)(K)F2\displaystyle\mathbb{E}\left\|\hat{G}^{(L)}(K)\right\|^{2}_{F} \displaystyle\leq C3L3(1γ2),\displaystyle C_{3}\frac{L^{3}}{(1-\gamma^{2})},

where ρ(ρ(A+BK),1)\rho\in(\rho(A+BK),1) and the constants C2C_{2} and C3C_{3} depend on A2\|A\|_{2}, B2\|B\|_{2}, S2\|S\|_{2}, R2\|R\|_{2}, K2\|K\|_{2}, ΓK\Gamma_{K}, 11ρ2\frac{1}{1-\rho^{2}}, ω~\widetilde{\omega}, nn, dd and (1γ)J(K)(1-\gamma)J(K).

Proof 5.2.

Define the event fields t\mathcal{F}_{t} generated by (x0,x1,,xt)(x_{0},x_{1},\cdots,x_{t}) and the operator 𝒯(X)=(A+BK)X(A+BK)\mathcal{T}(X)=(A+BK)^{\top}X(A+BK). We observe that ηt\eta_{t} and ωt\omega_{t} are independent from t\mathcal{F}_{t}. By the definition of the value function in (3.24), it is straightforward to obtain the conditional expection of the cumulative cost:

𝔼[k=tγkck|t]=γtVK(xt)=γtxtPK,γxt+γt+11γTr[PK,γDω~]+γtσ2Tr[R]1γ.\mathbb{E}\left[\sum_{k=t}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]=\gamma^{t}V_{K}(x_{t})=\gamma^{t}x_{t}^{\top}P_{K,\gamma}x_{t}+\frac{\gamma^{t+1}}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\gamma^{t}\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

First, we claim that G(K)G(K) is equivalent to J(K)\nabla J(K). To this end, we verify the following identity:

(5.61) 𝔼[(utKxt)xtk=tγkck|t]\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= 𝔼[ηtxtk=t+1γkck|t]+𝔼[ηtxtγtct|t]\displaystyle\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]+\mathbb{E}\left[\eta_{t}x_{t}^{\top}\gamma^{t}c_{t}\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= 𝔼[𝔼[ηtxtk=t+1γkck|t+1]|t]\displaystyle\mathbb{E}\left[\mathbb{E}[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}|\mathcal{F}_{t+1}]\Big{|}\mathcal{F}_{t}\right]
+γt𝔼[ηt(xtSxt+(Kxt+ηt)R(Kxt+ηt))|t]xt\displaystyle+\gamma^{t}\mathbb{E}\left[\eta_{t}(x_{t}^{\top}Sx_{t}+(Kx_{t}+\eta_{t})^{\top}R(Kx_{t}+\eta_{t}))\Big{|}\mathcal{F}_{t}\right]x_{t}^{\top}
=\displaystyle= 𝔼[ηtγt+1VK(xt+1)|t]xt+2σ2γtRKxtxt\displaystyle\mathbb{E}\left[\eta_{t}\gamma^{t+1}V_{K}(x_{t+1})\Big{|}\mathcal{F}_{t}\right]x_{t}+2\sigma^{2}\gamma^{t}RKx_{t}x^{\top}_{t}
=\displaystyle= γt+1𝔼[ηtxt+1PK,γxt+1|t]xt+2σ2γtRKxtxt\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}x_{t+1}^{\top}P_{K,\gamma}x_{t+1}\Big{|}\mathcal{F}_{t}\right]x_{t}+2\sigma^{2}\gamma^{t}RKx_{t}x^{\top}_{t}
=\displaystyle= 2σ2γt[γBPK,γ(A+BK)+RK]xtxt.\displaystyle 2\sigma^{2}\gamma^{t}[\gamma B^{\top}P_{K,\gamma}(A+BK)+RK]x_{t}x_{t}^{\top}.

The third equality is valid since 𝔼[ηt]=0\mathbb{E}[\eta_{t}]=0 and

(5.62) 𝔼[ηtηtBPK,γBηt|t]=0,\mathbb{E}\left[\eta_{t}\eta_{t}^{\top}B^{\top}P_{K,\gamma}B\eta_{t}\Big{|}\mathcal{F}_{t}\right]=0,

which holds due to the symmetry of ηt\eta_{t}. In the last equality of (5.61), we have used similar arguments in connection with the fact xt+1=(A+BK)xt+ωt+Bηtx_{t+1}=(A+BK)x_{t}+\omega_{t}+B\eta_{t}. Taking the sum t=0()\sum\limits_{t=0}^{\infty}(\cdot), applying 𝔼ρK[]\mathbb{E}_{\rho_{K}}[\cdot] to both sides of (5.61) and multiplying by 1σ2\frac{1}{\sigma^{2}} yields G(K)=J(K)G(K)=\nabla J(K), see (2.18).

Since G^(L)(K)\hat{G}^{(L)}(K) is the estimator of J(K)\nabla J(K), we focus on the bound of the bias and split the bias into two parts

(5.63) J(K)𝔼[G^(L)(K)]\displaystyle\nabla J(K)-\mathbb{E}[\hat{G}^{(L)}(K)]
=\displaystyle= 1σ2𝔼[t=L+1(utKxt)xtk=tγkck]+1σ2𝔼[t=0L(utKxt)xtk=L+1γkck],\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]+\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\right],

where we have used that G(K)=J(K)G(K)=\nabla J(K). In the following, we will further simplify and estimate the above two terms, respectively. To this end, we show for each tLt\leq L that

(5.64) 𝔼[(utKxt)xtk=L+1γkck|t]=2σ2γL+1B𝒯Lt(PK,γ)(A+BK)xtxt.\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]\\ =2\sigma^{2}\gamma^{L+1}B^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})(A+BK)x_{t}x_{t}^{\top}.

For t<Lt<L, we observe that

(5.65) 𝔼[(utKxt)xtk=L+1γkck|t]\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= 𝔼[𝔼[ηtxtk=L+1γkck|L+1]|t]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{L+1}\right]\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= γL+1𝔼[ηtxtVK(xL+1)|t]\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}x_{t}^{\top}V_{K}(x_{L+1})\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= γL+1𝔼[ηt[(A+BK)xL+ω~L]PK,γ[(A+BK)xL+ω~L]|t]xt\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}[(A+BK)x_{L}+\widetilde{\omega}_{L}]^{\top}P_{K,\gamma}[(A+BK)x_{L}+\widetilde{\omega}_{L}]\Big{|}\mathcal{F}_{t}\right]x_{t}^{\top}
=\displaystyle= γL+1𝔼[ηtxL𝒯(PK,γ)xL|t]xt(sinceω~Lisindependentfromηt)\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}x_{L}^{\top}\mathcal{T}(P_{K,\gamma})x_{L}\Big{|}\mathcal{F}_{t}\right]x_{t}^{\top}(since\;\widetilde{\omega}_{L}\;is\;independent\;from\;\eta_{t})
\displaystyle\cdots
=\displaystyle= γL+1𝔼{ηtxt+1𝒯Lt(PK,γ)xt+1|t}xt\displaystyle\gamma^{L+1}\mathbb{E}\left\{\eta_{t}x_{t+1}^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})x_{t+1}\Big{|}\mathcal{F}_{t}\right\}x_{t}^{\top}
=\displaystyle= 2σ2γL+1B𝒯Lt(PK,γ)(A+BK)xtxt,\displaystyle 2\sigma^{2}\gamma^{L+1}B^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})(A+BK)x_{t}x_{t}^{\top},

and for t=Lt=L it holds

(5.66) 𝔼[(utKxt)xtk=t+1γkck|t]\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= 𝔼[𝔼[ηtxtk=t+1γkck|t+1]|t]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t+1}\right]\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= γt+1𝔼[ηtxtVK(xt+1)|t]\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}x_{t}^{\top}V_{K}(x_{t+1})\Big{|}\mathcal{F}_{t}\right]
=\displaystyle= γt+1𝔼[ηt[(A+BK)xt+ω~t]PK,γ[(A+BK)xt+ω~t]|t]xt\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}[(A+BK)x_{t}+\widetilde{\omega}_{t}]^{\top}P_{K,\gamma}[(A+BK)x_{t}+\widetilde{\omega}_{t}]\Big{|}\mathcal{F}_{t}\right]x_{t}^{\top}
=\displaystyle= 2σ2γt+1BPK,γ(A+BK)xtxt.\displaystyle 2\sigma^{2}\gamma^{t+1}B^{\top}P_{K,\gamma}(A+BK)x_{t}x_{t}^{\top}.

In (5.65) and (5.66), similar arguments as in (5.61) are used, in particular 𝔼[ηt]=0\mathbb{E}[\eta_{t}]=0. Hence, (5.64) holds.

Since KK is stable, (2.4) yields

(5.67) 𝔼[xtxt]2ΓK2k=0tρ2kDω~2\left\|\mathbb{E}[x_{t}x_{t}^{\top}]\right\|_{2}\leq{\Gamma_{K}^{2}}\sum_{k=0}^{t}\rho^{2k}\|D_{\widetilde{\omega}}\|_{2}

for some constant 0<ρ<10<\rho<1. Using this, (2.19) and (5.64), we can estimate the second term in (5.63) as follows

(5.68) 1σ2𝔼[t=0L(utKxt)xtk=L+1γkck]2\displaystyle\left\|\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\right]\right\|_{2}
\displaystyle\leq 2γL+1B2t=0LPK,γ2ΓK3ρ2L2t+1𝔼[xtxt]2\displaystyle{2\gamma^{L+1}}\|B\|_{2}\sum_{t=0}^{L}\left\|P_{K,\gamma}\|_{2}\Gamma_{K}^{3}\rho^{2L-2t+1}\|\mathbb{E}[x_{t}x_{t}^{\top}]\right\|_{2}
\displaystyle\leq 2γL+1B2(1γ)γσmin(Dω~)J(K)ΓK5(1ρ2)2Dω~2.\displaystyle 2\gamma^{L+1}\|B\|_{2}\frac{(1-\gamma)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}J(K)\frac{\Gamma_{K}^{5}}{(1-\rho^{2})^{2}}\|D_{\widetilde{\omega}}\|_{2}.

By using (5.61), (5.67) and (2.19), the first term in (5.63) can be estimated as:

(5.69) 1σ2𝔼[t=L+1(utKxt)xtk=tγkck]2\displaystyle\left\|\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]\right\|_{2}
\displaystyle\leq 2(R2K2+ΓKB2PK,γ2)t=L+1ΓK21ρ2γtDω~2\displaystyle 2\Big{(}\|R\|_{2}\|K\|_{2}+\Gamma_{K}\|B\|_{2}\|P_{K,\gamma}\|_{2}\Big{)}\sum_{t=L+1}^{\infty}\frac{\Gamma_{K}^{2}}{1-\rho^{2}}\gamma^{t}\|D_{\widetilde{\omega}}\|_{2}
\displaystyle\leq 2(R2K2+ΓKB2(1γ)γσmin(Dω~)J(K))Dω~2γL+1ΓK2(1ρ2)(1γ).\displaystyle 2\Big{(}\|R\|_{2}\|K\|_{2}+\Gamma_{K}\|B\|_{2}\frac{(1-\gamma)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}J(K)\Big{)}\|D_{\widetilde{\omega}}\|_{2}\frac{\gamma^{L+1}\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}.

Combining (5.68) and (5.69) yields (5.59) with

C2=4Dω~2max{ΓKB2(1γ)J(K)γσmin(Dω~),R2K2}.C_{2}=4\|D_{\widetilde{\omega}}\|_{2}\cdot\max\left\{\Gamma_{K}\|B\|_{2}\frac{(1-\gamma)J(K)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})},\|R\|_{2}\|K\|_{2}\right\}.

Finally, we derive a bound for the variance of G^(L)\hat{G}^{(L)}.

(5.70) 𝔼G^(L)(K)F2=\displaystyle\mathbb{E}\left\|\hat{G}^{(L)}(K)\right\|^{2}_{F}= 1σ4𝔼t=0L(utKxt)xtk=tLγkckF2\displaystyle\frac{1}{\sigma^{4}}\mathbb{E}\left\|\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{L}\gamma^{k}c_{k}\right\|_{F}^{2}
\displaystyle\leq Lσ4t=1L𝔼[xt22ηt22(k=tLγkck)2]\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\mathbb{E}\left[\|x_{t}\|^{2}_{2}\|\eta_{t}\|_{2}^{2}\Big{(}\sum_{k=t}^{L}\gamma^{k}c_{k}\Big{)}^{2}\right]
\displaystyle\leq Lσ4t=1L(Lt+1)𝔼[xt22ηt22k=tLγ2kck2]\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}(L-t+1)\mathbb{E}\left[\|x_{t}\|^{2}_{2}\|\eta_{t}\|_{2}^{2}\sum_{k=t}^{L}\gamma^{2k}c^{2}_{k}\right]
\displaystyle\leq L2σ4t=1Lγ2tk=tL𝔼[xt22ηt22ck2].\displaystyle\frac{L^{2}}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\sum_{k=t}^{L}\mathbb{E}\left[\|x_{t}\|^{2}_{2}\|\eta_{t}\|_{2}^{2}c^{2}_{k}\right].

Hence, we only need to estimate each term of (5.70):

(5.71) 𝔼[xt22ηt22ck2]O(K24+S22)𝔼[xk24xt22ηt22]+O(R22)Tr[DK]σ6.\mathbb{E}\left[\|x_{t}\|^{2}_{2}\|\eta_{t}\|_{2}^{2}c^{2}_{k}\right]\leq O\left(\|K\|_{2}^{4}+\|S\|_{2}^{2}\right)\mathbb{E}\Big{[}\|x_{k}\|^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}+O\left(\|R\|_{2}^{2}\right)\operatorname{Tr}[D_{K}]\sigma^{6}.

Let Σ=i=tk1(A+BK)k1iω~t\Sigma=\sum_{i=t}^{k-1}(A+BK)^{k-1-i}\widetilde{\omega}_{t}. Then xk=(A+BK)ktxt+Σx_{k}=(A+BK)^{k-t}x_{t}+\Sigma and Σ\Sigma is independent from xtx_{t}. We can estimate the bound of 𝔼[xk24xt22ηt22]\mathbb{E}\Big{[}\|x_{k}\|^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}:

(5.72) 𝔼[xk24xt22ηt22]=\displaystyle\mathbb{E}\Big{[}\|x_{k}\|^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}= 𝔼[(A+BK)ktxt+Σ24xt22ηt22]\displaystyle\mathbb{E}\Big{[}\big{\|}(A+BK)^{k-t}x_{t}+\Sigma\big{\|}^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}
\displaystyle\leq 𝔼{[2(A+BK)ktxt22+2Σ22]2xt22ηt22}\displaystyle\mathbb{E}\Big{\{}\big{[}2\|(A+BK)^{k-t}x_{t}\|_{2}^{2}+2\|\Sigma\|_{2}^{2}\big{]}^{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{\}}
\displaystyle\leq 8𝔼{[(A+BK)ktxt24+Σ24]xt22ηt22}\displaystyle 8\mathbb{E}\Big{\{}\big{[}\|(A+BK)^{k-t}x_{t}\|_{2}^{4}+\|\Sigma\|_{2}^{4}\big{]}\|x_{t}\|_{2}^{2}\|\eta_{t}\|^{2}_{2}\Big{\}}
\displaystyle\leq 8dσ2𝔼[(A+BK)ktxt24xt22]+8𝔼xt22𝔼(Σ24ηt22)\displaystyle 8d\sigma^{2}\mathbb{E}\big{[}\|(A+BK)^{k-t}x_{t}\|_{2}^{4}\|x_{t}\|_{2}^{2}\big{]}+8\mathbb{E}\|x_{t}\|^{2}_{2}\mathbb{E}\big{(}\|\Sigma\|^{4}_{2}\|\eta_{t}\|_{2}^{2}\big{)}
\displaystyle\leq O([𝔼μKx6+Tr[DK]𝔼μKx4]σ2+Tr[DK]σ6).\displaystyle O\Big{(}\big{[}\mathbb{E}_{\mu_{K}}\|x\|^{6}+\operatorname{Tr}[D_{K}]\mathbb{E}_{\mu_{K}}\|x\|^{4}\big{]}\sigma^{2}+\operatorname{Tr}[D_{K}]\sigma^{6}\Big{)}.

We observe that (5.72) is bounded yielding that (5.71) is bounded by a constant C>0C>0. Inserting this in (5.70), we get the estimation

(5.73) 𝔼G^(L)(K)F2CL3σ4(1γ2)\displaystyle\mathbb{E}\left\|\hat{G}^{(L)}(K)\right\|^{2}_{F}\leq\frac{CL^{3}}{\sigma^{4}(1-\gamma^{2})}

such that (5.59) holds with C3=Cσ4C_{3}=\frac{C}{\sigma^{4}}.

Assumption 5.3.

There is a positive number MGM_{G} such that G^L(K)FMG\big{\|}\hat{G}^{L}(K)\big{\|}_{F}\leq M_{G} for any stable KK.

At the end of this section, we present a theorem that guarantees the convergence of Algorithm 3. In order to prove this convergence, we need to assume that the error tolerance ϵ\epsilon and the confidence δ\delta is small enough such that following inequality holds:

(5.74) δϵ<min(1,10(J(K(0))J(K))3log[120(J(K(0))J(K))]).\delta\epsilon<\min\left(1,\frac{10(J(K^{(0)})-J(K^{*}))}{3\log[120(J(K^{(0)})-J(K^{*}))]}\right).
Theorem 5.4.

Let 5.3 hold, K(0)K^{(0)} be stable and suppose that γ\gamma is sufficiently close to 11 and 0<β<10<\beta<1 is small enough such that (2.17) is satisfied with ν=10δ\nu=\frac{10}{\delta} and K=K(0)K=K^{(0)}. For any error tolerance ϵ\epsilon and confidence δ\delta satisfying (5.74), suppose that the sample size LL is large enough such that

γL+1(Γ21ρ2+11γ)O(δϵ/n)\gamma^{L+1}\left(\frac{\Gamma^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma}\right)\leq O(\sqrt{\delta\epsilon/n})

with Γ=ΓK(0),ν\Gamma=\Gamma_{K^{(0)},\nu} and the step size α\alpha is chosen to satisfy αO(δϵ(1γ2)σ2L3d)\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})\sigma^{2}}{L^{3}d}\right). After TT iterations with T=O(1αlog[J(K(0))J(K)δϵ])T=O(\frac{1}{\alpha}\log[\frac{J(K^{(0)})-J(K^{*})}{\delta\epsilon}]), Algorithm 3 yields an iterate K(T)K^{(T)} such that

J(K(T))J(K)ϵJ(K^{(T)})-J(K^{*})\leq\epsilon

holds with probability greater than 1δ1-\delta.

Proof 5.5.

For any t=0,1,t=0,1,\cdots, define the error Δt=J(K(t))J(K)\Delta_{t}=J(K^{(t)})-J(K^{*}) and the stopping time

τ:=min{t|Δt>10Δ0/δ}.\tau:=\min\left\{t|\Delta_{t}>10\Delta_{0}/\delta\right\}.

We first note that Lemma 2.1 yields K(t)K^{(t)}\inDomK(0),ν{}_{K^{(0)},\nu} for all t<τt<\tau since γ\gamma and β\beta is chosen such that (2.16) and (2.17) hold with ν=10δ\nu=\frac{10}{\delta} and K=K(0)K=K^{(0)}. Hence C2C_{2} and C3C_{3} are both uniform bounded in DomK(0),ν{}_{K^{(0)},\nu}.

Next, for simplicity, we define 𝔼t:=𝔼[|𝒜t]\mathbb{E}^{t}:=\mathbb{E}\left[\cdot|\mathcal{A}_{t}\right] as the expectation operator conditioned on the sigma field 𝒜t\mathcal{A}_{t} which contains all the randomness of the first tt iterates. Since the gradient J\nabla J is locally Lipschitz, which is shown in [16], there are a uniform Lipschitz constant ϕ0\phi_{0} and a uniform radius ρ0\rho_{0} such that

(5.75) J(K)J(K)J(K),KK+ϕ02KKF2J(K^{\prime})-J(K)\leq\left\langle\nabla J(K),K^{\prime}-K\right\rangle+\frac{\phi_{0}}{2}\left\|K^{\prime}-K\right\|_{F}^{2}

for any KK and KK^{\prime} with KKFρ0\|K-K^{\prime}\|_{F}\leq\rho_{0}. We choose α\alpha sufficiently small such that αMGρ0\alpha M_{G}\leq\rho_{0}.

Let L be sufficiently large such that C2Γ21ρ2nγL+1(Γ21ρ2+11γ)δϵμ/180C_{2}\frac{\Gamma^{2}}{1-\rho^{2}}\sqrt{n}\gamma^{L+1}(\frac{\Gamma^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma})\leq\sqrt{\delta\epsilon\mu/180}. Using this and in particular Lemma 5.1, we obtain

(5.76) 𝔼t[J(K(t+1))J(K(t))]\displaystyle\mathbb{E}^{t}\left[J\left(K^{(t+1)}\right)-J\left(K^{(t)}\right)\right]
\displaystyle\leq 𝔼t[J(K(t)),K(t+1)K(t)+ϕ02K(t+1)K(t)F2]\displaystyle\mathbb{E}^{t}\left[\left\langle\nabla J\left(K^{(t)}\right),K^{(t+1)}-K^{(t)}\right\rangle+\frac{\phi_{0}}{2}\left\|K^{(t+1)}-K^{(t)}\right\|_{F}^{2}\right]
=\displaystyle= αJ(K(t)),J(K(t))+ϕ0α22𝔼t[G^(L)(K(t))F2]\displaystyle-\alpha\left\langle\nabla J\left(K^{(t)}\right),\nabla J\left(K^{(t)}\right)\right\rangle+\frac{\phi_{0}\alpha^{2}}{2}\mathbb{E}^{t}\left[\left\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}\right]
+αJ(K(t)),J(K(t))𝔼t[G^(L)(K(t))]\displaystyle+\alpha\left\langle\nabla J\left(K^{(t)}\right),\nabla J\left(K^{(t)}\right)-\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\right\rangle
\displaystyle\leq αJ(K(t))F2+ϕ0α22𝔼t[G^(L)(K(t))F2]\displaystyle-\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}\mathbb{E}^{t}\left[\left\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}\right]
+αJ(K(t))FJ(K(t))𝔼t[G^(L)(K(t))]F\displaystyle+\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}\Big{\|}\nabla J\left(K^{(t)}\right)-\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\Big{\|}_{F}
\displaystyle\leq αJ(K(t))F2+ϕ0α22[Var(G^(L)(K(t)))+𝔼t[G^(L)(K(t))]F2]\displaystyle-\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}\left[\mathrm{Var}\left(\hat{G}^{(L)}(K^{(t)})\right)+\left\|\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\right\|_{F}^{2}\right]
+αJ(K(t))Fδϵμ/180\displaystyle+\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}\sqrt{\delta\epsilon\mu/180}
\displaystyle\leq 34αJ(K(t))F2+αδϵμ180+ϕ0α22G2+ϕ0α2(J(K(t))F2+δϵμ180),\displaystyle-\frac{3}{4}\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\alpha\delta\epsilon\mu}{180}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\phi_{0}\alpha^{2}\left(\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\delta\epsilon\mu}{180}\right),

where G2=Var(G^(L)(K(t)))=𝔼t[G^(L)(K(t))F2]𝔼t[G^(L)(K(t))F2G_{2}=\mathrm{Var}\left(\hat{G}^{(L)}(K^{(t)})\right)=\mathbb{E}^{t}\left[\left\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}\right]-\left\|\mathbb{E}^{t}[\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}. We note that in the last estimation of (5.76) the inequality

αJ(K(t))Fδϵμ/18014αJ(K(t))F2+αδϵμ180\displaystyle\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}\sqrt{\delta\epsilon\mu/180}\leq\frac{1}{4}\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\alpha\delta\epsilon\mu}{180}

is used. We assume that α12ϕ0\alpha\leq\frac{1}{2\phi_{0}}. By the PL condition (2.21), we have

(5.77) 𝔼t[Δt+1Δt]\displaystyle\mathbb{E}^{t}\left[\Delta_{t+1}-\Delta_{t}\right]\leq 14αJ(K(t))F2+ϕ0α22G2+αδϵμ120\displaystyle-\frac{1}{4}\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\delta\epsilon\mu}{120}
\displaystyle\leq 14αμΔt+ϕ0α22G2+αδϵμ120.\displaystyle-\frac{1}{4}\alpha\mu\Delta_{t}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\delta\epsilon\mu}{120}.

Applying successively this inequality, we obtain a similar result as in [16]:

𝔼[Δt+11τ>t+1]\displaystyle\mathbb{E}\left[\Delta_{t+1}1_{\tau>t+1}\right]\leq (1αμ4)t+1Δ0+(ϕ0α22G2+αμϵδ120)i=0t(1αμ4)i\displaystyle\left(1-\frac{\alpha\mu}{4}\right)^{t+1}\Delta_{0}+(\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\mu\epsilon\delta}{120})\sum_{i=0}^{t}\left(1-\frac{\alpha\mu}{4}\right)^{i}
\displaystyle\leq (1αμ4)t+1Δ0+2μαϕ0G2+ϵδ30,\displaystyle\left(1-\frac{\alpha\mu}{4}\right)^{t+1}\Delta_{0}+\frac{2}{\mu}\alpha\phi_{0}G_{2}+\frac{\epsilon\delta}{30},

where we have used that 𝔼t[Δt]=Δt\mathbb{E}^{t}\left[\Delta_{t}\right]=\Delta_{t}. By (5.60), we observe that taking αμϵδ240ϕ0C3L3(1γ2)\alpha\leq\frac{\mu\epsilon\delta}{240\phi_{0}C_{3}\frac{L^{3}}{(1-\gamma^{2})}} implies 2μαϕ0G2ϵδ120\frac{2}{\mu}\alpha\phi_{0}G_{2}\leq\frac{\epsilon\delta}{120}. We note that this condition on α\alpha as well as αMGρ0\alpha M_{G}\leq\rho_{0} and α12ϕ0\alpha\leq\frac{1}{2\phi_{0}} are satisfied for αO(δϵ(1γ2)σ2L3d)\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})\sigma^{2}}{L^{3}d}\right). Setting t=T1t=T-1, we observe that for

TC1αlog[120[J(K(0))J(K)]δϵ]T\geq C\cdot\frac{1}{\alpha}\log\left[\frac{120[J(K^{(0)})-J(K^{*})]}{\delta\epsilon}\right]

with a sufficiently large constant CC the inequality 𝔼t[Δt+11τ>t+1]ϵδ5\mathbb{E}^{t}\left[\Delta_{t+1}1_{\tau>t+1}\right]\leq\frac{\epsilon\delta}{5} holds.

By using the same techniques as in proof of the Proposition 1 in [16], we observe that {1τT}4δ/5\mathbb{P}\left\{1_{\tau\leq T}\right\}\leq 4\delta/5. By Chebyshev’s inequality, we have

{ΔTϵ}\displaystyle\mathbb{P}\left\{\Delta_{T}\geq\epsilon\right\}\leq {ΔT1τ>Tϵ}+{1τT}\displaystyle\mathbb{P}\left\{\Delta_{T}1_{\tau>T}\geq\epsilon\right\}+\mathbb{P}\left\{1_{\tau\leq T}\right\}
\displaystyle\leq 1ϵ𝔼[ΔT1τ>T]+{1τT}\displaystyle\frac{1}{\epsilon}\mathbb{E}\left[\Delta_{T}1_{\tau>T}\right]+\mathbb{P}\left\{1_{\tau\leq T}\right\}
\displaystyle\leq δ5+4δ5\displaystyle\frac{\delta}{5}+\frac{4\delta}{5}
=\displaystyle= δ.\displaystyle\delta.

This completes the proof.

6 The Actor-Critic Algorithm

In the policy gradient algorithm, we update the policy through pure sampling updates. Hence, the policy gradient has high variance, which slows down the speed of convergence. A popular method to reduce the variance is the Actor-Critic algorithm, which replaces the Monte Carlo method by the bootstrapping method. The policy gradient in the Actor-Critic algorithm has the following form:

(6.78) G^AC(L)(K)=1σ2t=0L[(utKxt)xtγtδ~t].\hat{G}_{AC}^{(L)}(K)=\frac{1}{\sigma^{2}}\sum_{t=0}^{L}\left[(u_{t}-Kx_{t})x_{t}^{\top}\gamma^{t}\tilde{\delta}_{t}\right].

where δ~t:=ct+γV~(xt+1)V~(xt)\tilde{\delta}_{t}:=c_{t}+\gamma\widetilde{V}(x_{t+1})-\widetilde{V}(x_{t}). We investigate the bias and the variance of the estimators.

Lemma 6.1.

Suppose that the policy KK is stable and V~\widetilde{V} is an ϵ0\epsilon_{0}-approximation of the state value function. Then for the estimation G^AC(L)(K)\hat{G}_{AC}^{(L)}(K) shown in (6.78) the following inequalities hold

(6.79) J(K)𝔼[G^AC(L)(K)]2\displaystyle\left\|\nabla J(K)-\mathbb{E}[\hat{G}_{AC}^{(L)}(K)]\right\|_{2} \displaystyle\leq C2γL+1ΓK2(1ρ2)(1γ)+C4ΓKJ(K)ϵ0,\displaystyle C_{2}\gamma^{L+1}\frac{\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}+C_{4}\Gamma_{K}J(K)\epsilon_{0},
(6.80) 𝔼G^AC(L)(K)F2\displaystyle\mathbb{E}\left\|\hat{G}_{AC}^{(L)}(K)\right\|^{2}_{F} \displaystyle\leq [C5σ2+O(Tr[DK]σ2)]L1γ2,\displaystyle\left[\frac{C_{5}}{\sigma^{2}}+O(\operatorname{Tr}[D_{K}]\sigma^{2})\right]\frac{L}{1-\gamma^{2}},

where C2C_{2} is given as in Lemma 5.1 and C4C_{4}, C5C_{5} depend on A2\|A\|_{2}, B2\|B\|_{2}, S2\|S\|_{2}, R2\|R\|_{2}, K22\|K\|^{2}_{2}, ΓK\Gamma_{K}, 11ρ2\frac{1}{1-\rho^{2}}, ω~\widetilde{\omega} and (1γ)J(K)(1-\gamma)J(K) with ρ(ρ(A+BK),1)\rho\in(\rho(A+BK),1).

Proof 6.2.

Analogous to the proof of Lemma 5.1, can split the bias of G^AC(L)\hat{G}_{AC}^{(L)} into two parts:

(6.81) J(K)𝔼[G^AC(L)(K)]=\displaystyle\nabla J(K)-\mathbb{E}[\hat{G}_{AC}^{(L)}(K)]= 1σ2𝔼[t=L+1(utKxt)xtk=tγkck]\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]
+1σ2𝔼[t=0Lγt+1(utKxt)xt(VK(xt+1)V~(xt+1))].\displaystyle+\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}(u_{t}-Kx_{t})x_{t}^{\top}(V_{K}(x_{t+1})-\widetilde{V}(x_{t+1}))\right].

We have discussed the first term in the proof of Lemma 5.1 and its upper bound is C2ΓK2(1ρ2)(1γ)γL+1C_{2}\frac{\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}\gamma^{L+1}. Let VK(x)=ϕ(x)θ=xΘKx+θ0V_{K}(x)=\phi(x)^{\top}\theta^{*}=x^{\top}\Theta_{K}^{*}x+\theta^{*}_{0} and V~(x)=ϕ(x)θ=xΘx+θ0\widetilde{V}(x)=\phi(x)^{\top}\theta=x^{\top}\Theta x+\theta_{0}. We observe that the second term of (6.81) has the following equivalent form:

(6.82) 1σ2𝔼[t=0Lγt+1(utKxt)xt(VK(xt+1)V~(xt+1))]\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}(u_{t}-Kx_{t})x_{t}^{\top}(V_{K}(x_{t+1})-\widetilde{V}(x_{t+1}))\right]
=\displaystyle= 1σ2𝔼[t=0Lγt+1ηt(Tr[(ΘKΘ)xt+1xt+1])xt]\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}\eta_{t}(\operatorname{Tr}[(\Theta^{*}_{K}-\Theta)x_{t+1}x_{t+1}^{\top}])x_{t}^{\top}\right]
=\displaystyle= 1σ2𝔼[t=0Lγt+1ηt[ηtB(ΘKΘ)(A+BK)xt]xt]\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}\eta_{t}[\eta_{t}^{\top}B^{\top}(\Theta^{*}_{K}-\Theta)(A+BK)x_{t}]x_{t}^{\top}\right]
=\displaystyle= 2B(ΘKΘ)(A+BK)𝔼[t=0Lγt+1xtxt].\displaystyle 2B^{\top}(\Theta^{*}_{K}-\Theta)(A+BK)\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}x_{t}x_{t}^{\top}\right].

Since 𝔼[t=0Lγt+1xtxt]2ΣK,γ2J(K)σmin(S)\left\|\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}x_{t}x_{t}^{\top}\right]\right\|_{2}\leq\|\Sigma_{K,\gamma}\|_{2}\leq\frac{J(K)}{\sigma_{\mathrm{min}}(S)} and V~\widetilde{V} is an ϵ0\epsilon_{0}-approximation of VKV_{K}, the right hand of (6.82) is bounded by B2ΓKσmin(S)J(K)ϵ0\frac{\|B\|_{2}\Gamma_{K}}{\sigma_{\mathrm{min}}(S)}J(K)\epsilon_{0}, which yields (6.79).

Since θ12θ12+ϵ0(1γ)J(K)γσmin(Dω~)+ϵ0\|\theta_{1}\|_{2}\leq\|\theta_{1}^{\star}\|_{2}+\epsilon_{0}\leq\frac{(1-\gamma)J(K)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}+\epsilon_{0} by (2.19) and |θ0||θ0|+ϵ0=γ1γTr[PK,γDω~]+σ2Tr[R]1γ+ϵ0=J(K)+ϵ0|\theta_{0}|\leq|\theta_{0}^{*}|+\epsilon_{0}=\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}+\epsilon_{0}=J(K)+\epsilon_{0}, the following inequality holds:

(6.83) δ~t2\displaystyle\tilde{\delta}_{t}^{2} 3[ct2+(1γ)2|θ0|2+xtxtγxt+1xt+1F2θ122]\displaystyle\leq 3\left[c_{t}^{2}+(1-\gamma)^{2}|\theta_{0}|^{2}+\|x_{t}^{\top}x_{t}-\gamma x_{t+1}^{\top}x_{t+1}\|^{2}_{F}\|\theta_{1}\|_{2}^{2}\right]
3ct2+O((1γ)2J2(K))(1+xt24+xt+124).\displaystyle\leq 3c_{t}^{2}+O((1-\gamma)^{2}J^{2}(K))(1+\|x_{t}\|_{2}^{4}+\|x_{t+1}\|^{4}_{2}).

Hence, we obtain the inequality (6.80):

(6.84) 𝔼G^AC(L)(K)F2\displaystyle\mathbb{E}\left\|\hat{G}_{AC}^{(L)}(K)\right\|^{2}_{F} Lσ4t=1Lγ2t𝔼(xt22ηt22|δ~t|2)\displaystyle\leq\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\mathbb{E}\left(\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}|\tilde{\delta}_{t}|^{2}\right)
Lσ4t=1Lγ2t[C5σ2+O(Tr[DK])σ6]\displaystyle\leq\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\left[C_{5}\sigma^{2}+O(\operatorname{Tr}[D_{K}])\sigma^{6}\right]
[C5σ2+O(Tr[DK]σ2)]L1γ2,\displaystyle\leq\left[\frac{C_{5}}{\sigma^{2}}+O(\operatorname{Tr}[D_{K}]\sigma^{2})\right]\frac{L}{1-\gamma^{2}},

where the third inequality holds by the same skill in (3.32) and

C5=O([(1γ)2J2(K)+K24]𝔼μKx6).C_{5}=O\left([(1-\gamma)^{2}J^{2}(K)+\|K\|^{4}_{2}]\mathbb{E}_{\mu_{K}}\|x\|^{6}\right).

In the estimations above similar techniques as in (5.70) are used.

Now we can obtain a convergence result for the AC algorithm which is similar to the one in Theorem 5.4. In order to prove this result, we need the following assumption.

Assumption 6.3.

There is a positive number MGACM^{AC}_{G} such that G^AC(L)(K)FMGAC\big{\|}\hat{G}_{AC}^{(L)}(K)\|_{F}\leq M^{AC}_{G} for any stable KK.

Theorem 6.4.

Suppose that 6.3 is satisfied and let K(0)K^{(0)} be a stable policy. Moreover, suppose that γ\gamma is sufficiently close to 11 and 0<β<10<\beta<1 small enough such that (2.17) holds with ν=10δ\nu=\frac{10}{\delta}. For any error tolerance ϵ\epsilon and confidence δ\delta satisfying (5.74), suppose that the sample size LL is large enough and the error of the approximation value function ϵ0\epsilon_{0} is small enough such that

γL+111γO(δϵ/n)andJ(K(0))ϵ0O(δϵ/n).\gamma^{L+1}\frac{1}{1-\gamma}\leq O(\sqrt{\delta\epsilon/n})\quad\text{and}\quad J(K^{(0)})\epsilon_{0}\leq O(\sqrt{\delta\epsilon/n}).

The step size α\alpha is chosen such that αO(δϵ(1γ2)Ld)\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})}{Ld}\right) holds. After TT iterations with T=O(1αlog[J(K(0))J(K)δϵ])T=O(\frac{1}{\alpha}\log[\frac{J(K^{(0)})-J(K^{*})}{\delta\epsilon}]) iterations, the iterate K(T)K^{(T)} satisfies

J(K(T))J(K)ϵJ(K^{(T)})-J(K^{*})\leq\epsilon

with probability greater than 1δ1-\delta.

Proof 6.5.

The proof is similar to the proof of Theorem 5.4.

Finally, we analyze the sample complexity of the policy gradient method and the AC algorithm. We assume that the discount factor γ\gamma with 0<γ<10<\gamma<1 is close to 11 such that log(γ)γ1\log(\gamma)\approx\gamma-1. By the definition of J(K)J(K) and in particular by its representation in (2.8), we obtain J(K(0))=O(11γ)J(K^{(0)})=O(\frac{1}{1-\gamma}). For the AC algorithm, we have to require that J(K(0))ϵ0=O(δϵ/n)J(K^{(0)})\epsilon_{0}=O(\sqrt{\delta\epsilon/n}). Then the TD-learning algorithm needs O(1δϵ(1γ)4)O\left(\frac{1}{\delta\epsilon(1-\gamma)^{4}}\right) steps by Theorem 3.8. However, the sample size LL in the AC algorithm is only O(logn(1γ)δϵlog(γ))O((1γ)2)O\left(\frac{\log\frac{n}{(1-\gamma)\delta\epsilon}}{\log(\gamma)}\right)\approx O((1-\gamma)^{-2}). We can sample 1(1γ)2\frac{1}{(1-\gamma)^{2}} trajectories parallelly. Then the variance of the gradient G^L(K)\hat{G}^{L}(K) becomes (1γ)L=O(11γ)(1-\gamma)L=O(\frac{1}{1-\gamma}). Using similar arguments as in the proof of Theorem 5.4, one can prove that the iteration time TT of the AC algorithm is equal to O(1δϵ(1γ))O(\frac{1}{\delta\epsilon(1-\gamma)}). From the statements shown above, we conclude the complexities given in Table 1.

7 Conclusion

Reinforcement learning has achieved success in many fields but lacks theoretical understanding in the continuous case. In this paper, we apply well known algorithms in RL to a basic model LQR. First, we show the convergence of the policy iteration with TD learning, which is hard to prove in general cases. Then we obtain the linear convergence of the policy gradient method and AC algorithm. Finally, we compare the sample complexity of those algorithms.

The results of this paper are proved for the LQR setting, which allows us to restrict ourselves to linear policies. For extensions to more general problems, this restriction to a linear framework is not possible anymore. Consequently, the policy function may depend nonlinearly on its parameters which makes in particular the convergence analysis of optimization methods such as the policy gradient method much more involved. A further difficulty is that the PL condition, which guarantees that stationary policies are globally optimal, may not hold anymore. However, since the LQR can be used as approximation for more general nonlinear problems, the techniques developed in this paper can serve as important tool for the treatment of more general problems.

Acknowledgments.

The research is supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence.

References

  • [1] Y. Abbasi-Yadkori and C. Szepesvári, Regret bounds for the adaptive control of linear quadratic systems, in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
  • [2] J. Bhandari, D. Russo, and R. Singal, A finite time analysis of temporal difference learning with linear function approximation, in Conference On Learning Theory, 2018, pp. 1691–1692.
  • [3] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári, Convergent temporal-difference learning with arbitrary smooth function approximation, in Advances in Neural Information Processing Systems, 2009, pp. 1204–1212.
  • [4] V. S. Borkar and S. P. Meyn, The ode method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Control and Optimization, 38 (2000), pp. 447–469.
  • [5] S. J. Bradtke and A. G. Barto, Linear least-squares algorithms for temporal difference learning, Machine learning, 22 (1996), pp. 33–57.
  • [6] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, Sbeed: Convergent reinforcement learning with nonlinear function approximation, in International Conference on Machine Learning, 2018, pp. 1125–1134.
  • [7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, Regret bounds for robust adaptive control of the linear quadratic regulator, in Advances in Neural Information Processing Systems, 2018, pp. 4188–4197.
  • [8] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, On the sample complexity of the linear quadratic regulator, Foundations of Computational Mathematics, (2019), pp. 1–47.
  • [9] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 1467–1476.
  • [10] J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey, The International Journal of Robotics Research, 32 (2013), pp. 1238–1274.
  • [11] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, in Advances in neural information processing systems, 2000, pp. 1008–1014.
  • [12] M. G. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of machine learning research, 4 (2003), pp. 1107–1149.
  • [13] A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of lstd, 2010.
  • [14] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik, Finite-sample analysis of proximal gradient td algorithms., in UAI, Citeseer, 2015, pp. 504–513.
  • [15] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, Applications of deep learning and reinforcement learning to biological data, IEEE transactions on neural networks and learning systems, 29 (2018), pp. 2063–2079.
  • [16] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, Derivative-free methods for policy optimization: Guarantees for linear quadratic systems, in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2916–2925.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), p. 529.
  • [18] R. S. Sutton, Learning to predict by the methods of temporal differences, Machine learning, 3 (1988), pp. 9–44.
  • [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
  • [20] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation, in Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 993–1000.
  • [21] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Submitted to Advances in Neural Information Processing Systems, 12 (1999), pp. 1057–1063.
  • [22] J. N. Tsitsiklis and B. Van Roy, Analysis of temporal-diffference learning with function approximation, in Advances in neural information processing systems, 1997, pp. 1075–1081.
  • [23] S. Tu and B. Recht, Least-squares temporal difference learning for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 5005–5014.
  • [24] C. J. Watkins and P. Dayan, Q-learning, Machine learning, 8 (1992), pp. 279–292.
  • [25] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, 8 (1992), pp. 229–256.
  • [26] K. Zhang, A. Koppel, H. Zhu, and T. Başar, Global convergence of policy gradient methods to (almost) locally optimal policies, arXiv preprint arXiv:1906.08383, (2019).
  • [27] S. Zou, T. Xu, and Y. Liang, Finite-sample analysis for sarsa with linear function approximation, in Advances in Neural Information Processing Systems, 2019, pp. 8668–8678.