\headers

On the Analysis of Model-free Methods for the LQRZ. Jin, J.M. Schmitt, and Z. Wen

On the Analysis of Model-free Methods for the Linear Quadratic Regulator

Zeyu Jin School of Mathematical Sciences, Peking University, CHINA (). 1801210096@pku.edu.cn Johann Michael Schmitt Beijing International Center for Mathematical Research, Peking University, CHINA (). MichelSchmitt2@web.de Zaiwen Wen Beijing International Center for Mathematical Research, Peking University, CHINA (). Research supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence. wenzw@pku.edu.cn

On the Analysis of Model-free Methods for the Linear Quadratic Regulator

Abstract

Many reinforcement learning methods achieve great success in practice but lack theoretical foundation. In this paper, we study the convergence analysis on the problem of the Linear Quadratic Regulator (LQR). The global linear convergence properties and sample complexities are established for several popular algorithms such as the policy gradient algorithm, TD-learning and the actor-critic (AC) algorithm. Our results show that the actor-critic algorithm can reduce the sample complexity compared with the policy gradient algorithm. Although our analysis is still preliminary, it explains the benefit of AC algorithm in a certain sense.

keywords:

Linear Quadratic Regulator, TD-learning, the policy gradient algorithm, the actor-critic algorithm, Convergence

{AMS}

49N10, 68T05, 90C40, 93E35

1 Introduction

Reinforcement learning (RL) involves training an agent such that the agent takes a sequence of actions to minimize its cumulative cost (or maximize its cumulative reward), see [19] for a general introduction to RL. Model free methods which do not estimate and use the transition kernel directly in RL enjoy wide popularity. They have achieved great success in many fields, such as robotics [10], biology [15], competitive gaming [17] and so on. In order to improve the performance of these algorithms, a theoretical understanding of questions regarding global convergence and sample complexity becomes more and more crucial. However, since RL problems are non-convex, it is even hard to prove convergence for model free methods.

The drawback of the non-convexity also appears in the Linear Quadratic Regulator (LQR) problem which is an elementary problem in system control and well understood. Therefore, we consider LQRs as a first step. In practice, people often estimate the transition matrices directly (called system identification) and then design a linear policy for the LQR. Since model-free methods become more and more popular, there is a large number of literatures analyzing model free methods for LQRs, see for example [23, 9, 7, 16]. The authors of [23] analyze the sample complexity of the least-squares temporal difference (LSTD) method for one fixed policy in the LQR setting. There are also contributions [9, 16] in which the properties of the cumulative cost with respect to the policy are analyzed and global convergence of policy iterations generated by some zero-order optimization method is shown. In the analysis of LQRs, we can without loss of generality (w.l.o.g.) restrict ourselves to linear policies, since it can be shown that optimal policies are linear. Although the cumulative cost in the LQR problem is non-convex with respect to the policy, any locally optimal policy is globally optimal. In this paper, we analyze some basic model free methods at the example of the LQR setting and derive the sample complexity, i.e., the number of samples which is at least required to guarantee convergence up to some specified tolerance.

1.1 Related work

TD-learning [18] and Q-learning [24] are basic and popular value based methods. There is a line of work examining the convergence of TD-learning and Q-learning with linear value function approximations for Markov decision processes (MDPs), see, e.g., [22, 4, 2, 27]. The convergence with probability $1$ is proved in [22, 4]. Moreover, a non-asymptotic analysis of TD-learning and Q-learning is provided in [2] and [27], respectively. The authors in [3] extend the asymptotic analysis of TD-learning to the case of nonlinear value function approximations. In addition, there is a large number of contributions analyzing least-squares TD [5, 13] and gradient TD [20, 14] which also require linear value function approximations.

In policy based methods, the policy gradient method [25, 21] and the actor-critic method [21, 11] achieve empirical success. Sutton et al. show in [21] an asymptotic convergence result for the actor-critic method applied to MDPs under a compatibility condition on the value function approximations. In [6, 26], a non-asymptotic analysis of actor-critic methods is provided under some parametrization assumptions on the MDP requiring that the state space is finite. In [9, 16], the policy gradient method with evolution strategy is used for the LQR setting and a global convergence result is derived. The sample complexity of the LSTD method for LQRs is shown to be $\Omega(n^{2}/\epsilon^{2})$ for any given policy in [23]. There is also a line of work considering the design and analysis of model based methods for LQRs, see for example [1, 8, 7].

1.2 Contribution

In this paper, our motivation is to analyze general RL methods in the LQR setting. Unlike the MDP, the state space and the action space in the LQR are both infinite dimensional which makes LQR problems difficult to handle. On the other hand, the LQR represents a simple but also typical continuous problem in RL. First, we use the TD-learning approach with linear value function approximations for the policy evaluation of LQRs, which is a quite popular method used for general RL problems. Compared to [23], we focus on the sample complexity of TD-learning instead of the LSTD approach in which linear equations need to be solved. In addition, we also analyze the global convergence of policy iterations generated by TD-learning, which is inspired by the work of [9].

Instead of using evolution strategies as in [9, 16], we prove global linear convergence of the policy gradient method and the actor-critic algorithm in the LQR setting, which are much more widely used methods in practice. Another difference is that we focus on the more complex noise cases instead of initial random cases. We also show that the policy gradient is equivalent to the gradient of the cumulative cost with respect to the policy parameters. Moreover, our work combines the analysis of the value function and the analysis of the policy.

The estimation of the complexities of the policy iterations generated by TD-learning, the gradient method and the actor-critic method are given in Table 1. These complexities are based on the results of this paper, in particular on Theorem 4.1, Theorem 5.4 and Theorem 6.4. In Table 1, $\gamma$ is the discount factor and $\epsilon$ is the error tolerance with respect to the globally optimal value.

Algorithm

TD steps

Length of trajectory

for policy gradient

Iterations

Policy Iteration with TD

O\left(\frac{1}{\epsilon(1-\gamma)^{5}}\right)

O(1)

Policy Gradient

O\left(\frac{1}{(1-\gamma)^{2}}\right)

O\left(\frac{1}{\delta\epsilon(1-\gamma)^{7}}\right)

Actor-Critic Algorithm

O\left(\frac{1}{\delta\epsilon(1-\gamma)^{4}}\right)

O\left(\frac{1}{(1-\gamma)^{4}}\right)

O\left(\frac{1}{\delta\epsilon(1-\gamma)}\right)

Table 1: A summary of complexity analysis

1.3 Organization

This paper is organized as follows. Section 2 starts with a description of the general LQR problem. In order to solve LQRs in the RL framework, we introduce the policy function and convert the original problem into a policy optimization problem. Section 3 discusses the TD-learning approach with linear value function approximations and the convergence of this method. In Section 4, the process of policy iterations is combined with TD-learning and the convergence of this method is analyzed. Sections 5 and 6 present convergence results of the policy gradient method and the actor critic algorithm. In Section 7, we finally give a brief summary of the main results of this paper.

2 Preliminary

Linear time invariant (LTI) systems have the following form:

x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t},

where $A\in\mathbb{R}^{n\times n}$ , $B\in\mathbb{R}^{n\times d}$ and the sequence $\omega_{t}$ describes unbiased, independent and identically distributed noise. We call $x_{t}\in\mathbb{R}^{n}$ state and $u_{t}\in\mathbb{R}^{d}$ control. The control is measured by the cost function

c_{t}=x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t}

with positive definite matrices $S\in\mathbb{R}^{n\times n}$ and $R\in\mathbb{R}^{d\times d}$ . Thus, the cumulative cost with discount factor $\gamma\in(0,1)$ is given by $\sum_{t=0}^{\infty}\gamma^{t}(x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t})$ . The LQR problem consists in finding the control which minimizes the expectation of the cumulative cost leading to the following optimization problem which we will refer to as the LQR problem

(2.1)		$\displaystyle\min_{\{u_{t}\}_{t=0}^{\infty}}$	$\displaystyle\quad\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}(x_{t}^{\top}Sx_{t}+u_{t}^{\top}Ru_{t})]$
(2.1)		s.t.	$\displaystyle\quad x_{0}=0,\quad x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t}.$

Optimal control theory shows that the optimal control input can be written as a linear function with respect to the state, i.e., $u_{t}=K^{*}x_{t}$ , where $K^{*}\in\mathbb{R}^{d\times n}$ . The optimal control gain is given by

(2.2)

K^{*}=-\gamma(R+\gamma B^{\top}P_{\gamma}B)^{-1}B^{\top}P_{\gamma}A,

where $P_{\gamma}$ is the solution of the algebraic Riccati equation (ARE)

P_{\gamma}=\gamma A^{\top}P_{\gamma}A+S-\gamma^{2}A^{\top}P_{\gamma}B(\gamma B^{\top}P_{\gamma}B+R)^{-1}B^{\top}P_{\gamma}A.

Hence, the optimal policy $K^{*}$ only depends on $A$ , $B$ , $S$ , $R$ and $\gamma$ and the optimal policy formula above is shown in [9]. We will explain the optimality of this policy in Section 4.

2.1 Stochastic Policies

In practice, the system is not known exactly. Therefore, it is popular to use model-free methods to solve the LQR problem (2.1). We also want to investigate the properties of those model-free methods for LQRs. We observe that the controls $u_{t}$ form $\mathbb{R}^{d}$ -valued sequences $\{u_{t}\}_{t=0}^{\infty}$ belonging to an infinite-dimensional vector space. Thus, the analysis of the LQR problem (2.1) has to be carried out with care. Since the problem (2.1) can be viewed as a Markov decision process, people often use policy functions, which are maps from the state space to the control space, to represent the control. In this paper, the policies are constrained to some specific set of policy functions to simplify the problem.

In general, we can employ Gaussian policies

u_{t}\sim\pi_{\theta}(\cdot|x_{t})=N(f_{\theta}(x_{t}),\sigma^{2}I_{d}),

where $\theta$ is the parameter of the policy and $\sigma>0$ is a fixed constant. There are several possibilities to choose function spaces for $f_{\theta}$ , such as linear function spaces and neural networks. As explained above, optimal controls of (2.1) depend linearly on the state. Therefore, since the usage of nonlinear policy functions considerably complicates the analysis, we focus on linear policy functions, which can be represented by matrices $K\in\mathbb{R}^{d\times n}$ as follows

u_{t}\sim\pi_{K}(\cdot|x_{t})=N(Kx_{t},\sigma^{2}I_{d}).

This yields an equivalent form of the control

u_{t}=Kx_{t}+\eta_{t},\quad\eta_{t}\sim N(0,\sigma^{2}I_{d}).

In addition, the probability density function of $\pi_{K}$ has the explicit form

\displaystyle\pi_{K}(u|x)=\frac{1}{(2\pi)^{d/2}\sigma^{d}}\mathrm{exp}\left(-\frac{\|u-Kx\|_{2}^{2}}{2\sigma^{2}}\right).

The advantage of stochastic policies is that the policy gradient method can be applied to the problem, which we will discuss in a later section. In addition, stochastic policies are benefitial to the exploration, which is quite important in reinforcement learning.

Under the policy $\pi_{K}$ , the dynamic system can be written as

x_{t+1}=Ax_{t}+Bu_{t}+\omega_{t}=(A+BK)x_{t}+B\eta_{t}+\omega_{t}.

Let $\widetilde{\omega}_{t}:=B\eta_{t}+\omega_{t}$ , $D_{\widetilde{\omega}}=\mathbb{E}[\widetilde{\omega}_{t}\widetilde{\omega}_{t}^{\top}]$ , $D_{\omega}=\mathbb{E}[\omega_{t}\omega_{t}^{\top}]$ and $\mathbb{E}[\eta_{t}\eta_{t}^{\top}]=\sigma^{2}I_{d}$ . Using these abbreviations, the LQR problem (2.1) can be reformulated as

(2.3)		$\displaystyle\min_{K}$	$\displaystyle\;\;J(K)=\sum_{t=0}^{\infty}\mathbb{E}\gamma^{t}[x_{t}^{\top}(S+K^{\top}RK)x_{t}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}$
(2.3)		s.t.	$\displaystyle\;\;x_{0}=0,\;\;x_{t+1}=(A+BK)x_{t}+\widetilde{\omega}_{t},$

where we have used the linearity of the trace operator, more precisely

	$\displaystyle\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}\eta_{t}^{\top}R\eta_{t}]$	$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}\operatorname{Tr}[\eta_{t}^{\top}R\eta_{t}]=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}\operatorname{Tr}[\eta_{t}\eta_{t}^{\top}R]=\sum_{t=0}^{\infty}\gamma^{t}\operatorname{Tr}[\mathbb{E}[\eta_{t}\eta_{t}^{\top}]R]$
		$\displaystyle=\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.$

We will use this trick in several places of the rest of this paper.

Let the domain of feasible policies be given by Dom $:=\{K\;|\;\rho(A+BK)<1\}$ , where $\rho(X)$ denotes the spectral radius of the matrix $X$ . In the following, we assume that Dom is non-empty. Policies lying in the feasible domain are called stable. The set Dom is non-convex, see [9] and for all $K\in\bf{Dom}$ the corresponding value $J(K)$ is finite. By Proposition 3.2 in [23], for any stable $K$ and any $\rho\in(\rho(A+BK),1)$ , there exists a $\Gamma_{K}>0$ , such that for any $k\geq 1$ , the following inequality holds

(2.4)

\|(A+BK)^{k}\|_{2}\leq\Gamma_{K}\rho^{k}.

For any compact subset $\mathcal{K}\subset$ Dom, we can find uniform constants $\Gamma$ and $\rho$ such that (2.4) holds for all $K\in\mathcal{K}$ . Unfortunately, since $0<\gamma<1$ , $J(K)$ may also be finite for unstable $K$ . More precisely, one can show that $J(K)$ is finite if and only if

(2.5)

\rho(A+BK)<\frac{1}{\sqrt{\gamma}}.

We observe that the constraints $x_{t+1}=(A+BK)x_{t}+\widetilde{\omega}_{t}$ in (2.3) yield for $t\geq 1$

(2.6)

x_{t}=\sum\limits_{i=0}^{t-1}(A+BK)^{t-i-1}\widetilde{\omega}_{i}.

Provided that $K$ satisfies (2.5), we can insert (2.6) into $J(K)$ . Exploiting that $\mathbb{E}[\tilde{w}_{i}\tilde{w}_{j}^{\top}]=0$ for $j\neq i$ , we obtain the following analytic form of the objective function

(2.7)

J(K)=\frac{\gamma}{1-\gamma}\sum_{i=0}^{\infty}\operatorname{Tr}[D_{\widetilde{\omega}}\gamma^{i}[(A+BK)^{i}]^{\top}(S+K^{\top}RK)(A+BK)^{i}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

Let $P_{K,\gamma}:=\sum_{i=0}^{\infty}\gamma^{i}[(A+BK)^{i}]^{\top}(S+K^{\top}RK)(A+BK)^{i}$ . Then $J(K)$ can be further simplified to

(2.8)

J(K)=\frac{\gamma}{1-\gamma}\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

In the next result, we observe that for $\gamma$ which are sufficiently close to $1$ , we can restrict our search for the optimal policy to the set of stable policies.

Lemma 2.1.

Suppose that $K$ is a stable policy. Then for any $\nu\geq 1$ there exists a discount factor $0<\gamma<1$ and a positive constant $\beta=\beta(K,\nu)<1$ such that

(2.9)

J(\bar{K})-J(K^{*})>\nu(J(K)-J(K^{*}))

is valid for any policy $\bar{K}$ with $\rho(A+B\bar{K})>1-\beta$ , where $K^{*}$ denotes an optimal policy. In addition, the set

(2.10)

\text{\bf{Dom}}_{K,\nu}=\left\{\tilde{K}:J(\tilde{K})-J(K^{*})\leq\nu(J(K)-J(K^{*}))\right\}

is a compact subset of $\bf{Dom}$ . Finally, there exist constants $\rho\in(1-\beta,1)$ and $\Gamma_{K,\nu}$ such that

(2.11)

\|(A+B\tilde{K})^{k}\|_{2}\leq\Gamma_{K,\nu}\rho^{k}.

holds for all $\tilde{K}\in\text{\bf{Dom}}_{K,\nu}$ .

Proof 2.2.

Let $\bar{K}$ be an arbitrary policy with $\rho(A+B\bar{K})\geq 1-\beta$ , where $\beta$ will be determined later. We can w.l.o.g. assume that $\bar{K}$ satisfies (2.5). Otherwise, $J(\bar{K})$ would be infinite and (2.9) holds trivially. First, we observe that (2.9) follows by

(2.12)

\operatorname{Tr}[D_{\widetilde{\omega}}P_{\bar{K},\gamma}]>\nu\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]

Next, we introduce the abbreviations $\rho_{K}=\rho(A+BK)$ and $\rho_{\bar{K}}=\rho(A+B\bar{K})$ . Since $D_{\widetilde{\omega}}$ , $S$ and $Q$ are positive definite, using (2.4) and the fact that $\rho(Z)\leq\|Z\|_{2}\leq\|Z\|_{F}\leq\sqrt{n}\|Z\|_{2}$ holds for any matrix $Z\in\mathbb{R}^{n\times n}$ , we can deduce

	$\displaystyle\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]$	$\displaystyle\leq\sigma_{\max}(D_{\widetilde{w}})\operatorname{Tr}[P_{K,\gamma}]$
		$\displaystyle\leq\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)\sum_{t=0}^{\infty}\gamma^{t}\\|(A+BK)^{t}\\|_{F}^{2}$
		$\displaystyle\leq\frac{\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)n\Gamma^{2}_{K}}{1-\gamma\left(\frac{\rho_{K}+1}{2}\right)^{2}}$

where $\Gamma_{K}$ denotes the constant of the policy $K$ in (2.4) for $\rho=\frac{\rho_{K}+1}{2}$ . By $\|(A+B\bar{K})^{t}\|_{F}\geq\rho((A+B\bar{K})^{t})=\rho^{t}_{\bar{K}}$ , we further get

\displaystyle\operatorname{Tr}[D_{\widetilde{\omega}}P_{\bar{K},\gamma}]\geq\frac{\sigma_{\min}(D_{\widetilde{w}})\sigma_{\min}(S)}{1-\gamma\rho_{\bar{K}}^{2}}.

Let $c_{u}=\sigma_{\max}(D_{\widetilde{w}})\sigma_{\max}(S+K^{\top}RK)n\Gamma^{2}_{K}$ and $c_{l}=\sigma_{\min}(D_{\widetilde{w}})\sigma_{\min}(S)$ .

Using these two inequalities, we observe that (2.12) and therefore also (2.9) follow by

(2.13)

\frac{c_{u}\nu}{1-\gamma\left(\frac{\rho_{K}+1}{2}\right)^{2}}<\frac{c_{l}}{1-\gamma\rho_{\bar{K}}^{2}}.

We can w.l.o.g. assume that $c_{u}\nu>c_{l}$ , otherwise $c_{u}$ and $c_{l}$ can be rescaled such that this holds. Hence, isolating $\gamma$ in (2.13), we conclude that

(2.14)

0<\frac{c_{u}\nu-c_{l}}{c_{u}\nu\rho^{2}_{\bar{K}}-c_{l}\left(\frac{\rho_{K}+1}{2}\right)^{2}}<\gamma.

Next, we observe that

(2.15)		$\displaystyle\frac{c_{u}\nu-c_{l}}{c_{u}\nu\rho^{2}_{\bar{K}}-c_{l}\left(\frac{\rho_{K}+1}{2}\right)^{2}}$	$\displaystyle<\frac{c_{u}\nu-c_{l}}{c_{u}\nu(1-\beta)^{2}-c_{l}+c_{l}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}$
(2.15)			$\displaystyle<\frac{c_{u}\nu-c_{l}}{c_{u}\nu-2c_{u}\nu\beta-c_{l}+c_{l}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}<1$

is valid for constants $\beta>0$ satisfying

(2.16)

\beta\leq\frac{c_{l}}{4\nu c_{u}}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right].

We conclude that for $\beta>0$ satisfying (2.16) and for $\gamma$ with

(2.17)

0<\frac{c_{u}\nu-c_{l}}{c_{u}\nu-c_{l}+\frac{c_{l}}{2}\left[1-\left(\frac{\rho_{K}+1}{2}\right)^{2}\right]}<\gamma<1,

the inequalities (2.13) and therefore (2.9) hold. Next, we observe that

\displaystyle\text{\bf{Dom}}_{K,\nu}\subset\left\{\tilde{K}:\rho(A+B\tilde{K})\leq 1-\beta\right\},

which is obviously a compact subset of Dom. The last statement of the theorem follows directly by Proposition 3.2 in [23].

For instance, if a stable policy $K$ is given, then we can choose a $\gamma$ close to $1$ such that the level set of $J(K)$ is compact and only has stable policies by Lemma 2.1. Hence this lemma is quite important in the following discussion.

For a policy $K$ satisfying (2.5), the objective function of (2.7) is differentiable in a sufficiently small neighborhood of $K$ . Hence, we can compute the gradient of (2.7) which is given by

(2.18)

\nabla J(K)=2\left((R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A\right)\Sigma_{K,\gamma},

where $\Sigma_{K,\gamma}=\sum_{t=0}^{\infty}\mathbb{E}[\gamma^{t}x_{t}x_{t}^{\top}]$ and the sequence $\{x_{t}\}_{t=0}^{\infty}$ is generated by policy $u_{t}\sim\pi_{K}(\cdot|x_{t})$ . This gradient form is obtained by Lemma 1 in [9] and Lemma 4 in [16].

From the representation of $J(K)$ in (2.8), we can establish the following relations between $J(K)$ , $P_{K,\gamma}$ and $\Sigma_{K,\gamma}$ , see also Lemma 13 in [9].

Lemma 2.3.

Let $K$ be a policy such that $J(K)$ is finite. Then the representation of $J(K)$ in (2.8) yields the following two inequalities:

(2.19)

J(K)\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\|P_{K,\gamma}\|_{F}

and

(2.20)

J(K)\geq\sigma_{\mathrm{min}}(S)\|\Sigma_{K,\gamma}\|_{F}

Proof 2.4.

Since $J(K)\geq\frac{\gamma}{1-\gamma}\operatorname{Tr}[D_{\widetilde{\omega}}P_{K,\gamma}]$ and $P_{K,\gamma}\succ 0$ , we obtain

J(K)\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\operatorname{Tr}[P_{K,\gamma}]\geq\frac{\gamma}{1-\gamma}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})\|P_{K,\gamma}\|_{F}.

The second inequality is derived by

J(K)\geq\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}[x_{t}^{\top}Sx_{t}]\geq\sigma_{\min}(S)\operatorname{Tr}[\Sigma_{K,\gamma}]\geq\sigma_{\min}(S)\|\Sigma_{K,\gamma}\|_{F}.

Next, we will gather some useful properties of (2.3), which will serve as tools for the convergence analysis of the methods introduced in the following sections. Moreover, we will also have a look at the difficulties showing up in the theoretical analysis of (2.3). In general, the cost function $J(K)$ in problem (2.3) as well as the set of policies satisfying (2.5) are non-convex. In order to cope with this problem of non-convexity, we use the so-called PL condition named after Polyak and Lojasiewicz, which is a relaxation of the notion of strong convexity. The PL condition is satisfied if there exists a universal constant $\mu>0$ such that for any policy $K$ with finite $J(K)$ , we have

(2.21)

\|\nabla J(K)\|_{F}^{2}\geq\mu[J(K)-J(K^{*})],

where $K^{*}$ denotes a global optimum for (2.3). From Lemma 3 in [9], Lemma 4 in [16], $R\succ 0$ and $\Sigma_{K,\gamma}\succeq\gamma D_{\omega}\succ 0$ , we get that (2.21) is satisfied for (2.3) with

(2.22)

\mu=\frac{\gamma\sigma_{\mathrm{min}}(\Sigma_{K,\gamma})^{2}\sigma_{\mathrm{min}}(R)}{(1-\gamma)\|\Sigma_{K^{*},\gamma}\|_{2}}>0.

We note that $\|\Sigma_{K^{*},\gamma}\|_{F}<\infty$ holds due to the optimality of $K^{*}$ .

By the PL condition (2.21), we know that stationary points of (2.3), i.e. $\nabla J(K)=0$ , are global minima. Since $\Sigma_{K,\gamma}\succeq\gamma D_{\widetilde{\omega}}\succ 0$ , a policy $K$ is a stationary point if and only if

(2.23)

(R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A=0.

We can verify that $K^{*}$ in (2.2) is the global minimum of the function $J$ . The optimization problem (2.3) may have more than one global minimum and, unfortunately, it is hard to derive the analytic form of all optimal policies in terms of $A$ , $B$ , $S$ , $R$ and $\gamma$ .

To simplify the analysis, we assume that $\omega_{t}$ is Gaussian, that is $\omega_{t}\sim N(0,D_{\omega})$ . This assumption also guarantees that $\widetilde{\omega}_{t}\sim N(0,D_{\widetilde{\omega}})$ .

3 Policy evaluation

Given a fixed stable policy $\pi_{K}$ , it is desirable to know the expectation of the corresponding cummulative cost for an initial state $x_{0}=x$ , which in some sense evaluates “how good” it is to be in the state $x$ under policy $\pi_{K}$ . This expectation is given by the so-called value function $V_{K}(x)$ , which is defined below. Moreover, the expectation of the cummulative cost for taking an action $u_{0}=u$ in some initial state $x_{0}=x$ under policy $\pi_{K}$ is given by state-action value $Q_{K}(x,u)$ which is also defined below, see also [19].

The task of the policy evaluation is to get good approximations of the value function of a fixed stable policy $\pi_{K}$ . In value-based methods, the policy evaluation is a very important and elementary step. In addition, the policy evaluation plays the role of the critic in the actor critic algorithm. The TD-learning method [18] is a prevalent method for the policy evaluation. In this section, we discuss the usage of the TD-learning method in the LQR-setting.

First, we compute the value function $V_{K}$ and state-action value function $Q_{K}$ for a stable policy $\pi_{K}$ . The value function, which gives the state value for the policy $\pi_{K}$ has the following explicit form:

(3.24)

\displaystyle V_{K}(x):=\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}c_{t}\Big{|}x_{0}=x\right]=x^{\top}P_{K,\gamma}x+\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma},

and the state-action value of the policy $\pi_{K}$ is given by

(3.25)	$\displaystyle Q_{K}(x,u):=$	$\displaystyle\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}c_{t}\Big{\|}x_{0}=x,u_{0}=u\right]$
	$\displaystyle=$	$\displaystyle\frac{\gamma}{1-\gamma}\operatorname{Tr}[\gamma P_{K,\gamma}D_{\widetilde{\omega}}+\sigma^{2}R]+\gamma\operatorname{Tr}[P_{K,\gamma}D_{\omega}]$
		$\displaystyle+[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}S+\gamma A^{\top}P_{K,\gamma}A&\gamma A^{\top}P_{K,\gamma}B\\ \gamma B^{\top}P_{K,\gamma}A&R+\gamma B^{\top}P_{K,\gamma}B\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right],$

where $p_{K}$ is the trajectory distribution of the system with policy $\pi_{K}$ .

As we will see below, the value function of $\pi_{K}$ can be written as a function which is linear with respect to the feature function

\phi(x)=\left[\begin{array}[]{c}1\\ \mathrm{v}ec(xx^{\top})\end{array}\right]

for $x\in\mathbb{R}^{n}$ , where $\mathrm{v}ec$ denotes the vectorization operator stacking the columns of some matrix A on top of one another, i.e.,

\mathrm{vec}(A)=[a_{1,1},\cdots,a_{m,1},a_{1,2},\cdots,a_{m,2},\cdots,a_{1,n},\cdots,a_{m,n}]^{\top}.

Therefore, it is natural to propose a class of linear approximation funtions with feature $\phi$ by

\widetilde{V}(x;\theta)=\phi(x)^{\top}\theta,

where $\theta=\left[\begin{array}[]{c}\theta_{0}\\ \theta_{1}\end{array}\right]$ is the parameter we seek to estimate.

To this end, we observe that $V_{K}(x)=\widetilde{V}(x,\theta^{\star})$ , where $\theta_{1}^{\star}=\mathrm{v}ec(P_{K,\gamma})$ and $\theta_{0}^{\star}=\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}$ . For each stable policy $K$ , our aim is to find a parameter $\theta$ such that $\tilde{V}(x;\theta)$ approximates well $V_{K}(x)$ in expectation. This is carried out by minimizing a suitable loss function, for which we have to find a “good” distribution. A reasonable choice for this distribution is the stationary distribution $N(0,D_{K})$ where

D_{K}=\sum_{i=0}^{\infty}[(A+BK)^{i}]^{\top}D_{\widetilde{\omega}}(A+BK)^{i}

for any stable $K$ . It is easy to verify that if $x\sim N(0,D_{K})$ , then $x^{\prime}=(A+BK)x+\widetilde{\omega}\sim N(0,D_{K})$ as well. This explain why we call it the stationary distribution. We also use the distribution $\mu_{K}$ to represent $N(0,D_{K})$ in short.

Remark 3.1.

For the sake of simplicity, we introduce the abbreviation

\displaystyle\mathbb{E}_{x,x^{\prime}}[\cdot]:=\mathbb{E}_{x\sim\mu_{K},\eta\sim N(0,\sigma^{2}I_{d}),\omega\sim N(0,D_{\omega})}[\cdot],

since $x^{\prime}=(A+BK)x+\eta+\omega$ . When there is only the variable $x$ in the expection, we write $\mathbb{E}_{\mu_{K}}[\cdot]$ instead of $\mathbb{E}_{x,x^{\prime}}[\cdot]$ to represent this expectation. Otherwise, we use the notation $\mathbb{E}_{x,x^{\prime}}[\cdot]$ for convenience.

Then we define the loss function

L(\theta)=\frac{1}{2}\mathbb{E}_{x\sim\mu_{K}}\left[(V_{K}(x)-\widetilde{V}(x;\theta))^{2}\right]

such that the gradient of the loss function is given by

\nabla_{\theta}L(\theta)=\mathbb{E}_{x\sim\mu_{K}}\left[\phi(x)(\widetilde{V}(x;\theta)-V_{K}(x))\right].

However, in practice, since the real value function $V_{K}$ is unknown, people often use the bias estimation of $c(x,u)+\gamma\widetilde{V}(x^{\prime};\theta)$ to replace $V_{K}(x)$ where $x^{\prime}=(A+BK)x+\widetilde{\omega}$ is the subsequent state of $x$ . For convenience, let $\tilde{\delta}(x,u,x^{\prime};\theta)=\widetilde{V}(x;\theta)-c(x,u)-\gamma\widetilde{V}(x^{\prime};\theta)$ , which is called TD error.

We further note that $u=Kx+\eta$ . Then we obtain the semi-gradient

(3.26)

\bar{h}(\theta)=\left[\begin{array}[]{c}\bar{h}_{0}(\theta)\\ \bar{h}_{1}(\theta)\end{array}\right]=\mathbb{E}_{x,x^{\prime}}\left[\phi(x)\tilde{\delta}(x,u,x^{\prime};\theta)\right],

and it is quite straightforward to get the stochastic semi-gradient

(3.27)

h(\theta)=\left[\begin{array}[]{c}h_{0}(\theta)\\ h_{1}(\theta)\end{array}\right]=\left[\phi(x)\tilde{\delta}(x,u,x^{\prime};\theta)\right].

Starting with some initial parameter $\theta^{(0)}$ and using the gradient descent method with $\bar{h}(\theta)$ or $h(\theta)$ as update strategy, the TD learning method with linear approximation functions is described by Algorithm 1.

0: Stable policy

K

, parameters

\theta^{(0)}

, step size

\alpha

, the number of steps

N

1: for

s=1,2,\cdots,N-1

2: if semi-gradient then

3: Compute the sample gradient

\bar{h}(\theta^{(s-1)})

by (3.26).

4: Update the approximate value function

\theta^{(s)}=\theta^{(s-1)}-\alpha\bar{h}(\theta^{(s-1)})

5: else if stochastic semi-gradient then

x\sim\mu_{K}

\widetilde{\omega}\sim N(0,D_{\widetilde{\omega}})

and

x^{\prime}=(A+BK)x+\tilde{\omega}

7: Compute the sample gradient

h(\theta^{(s-1)})

by (3.27).

8: Update the approximate value function

\theta^{(s)}=\theta^{(s-1)}-\alpha h(\theta^{(s-1)})

9: end if

10: end for

11: Averaging:

\tilde{\theta}=\frac{1}{N}\sum_{s=0}^{N-1}\theta^{(s)}

Algorithm 1 TD learning

The following lemma shows that $\theta^{\star}$ is a minimum of $\bar{h}$ .

Lemma 3.2.

For any fixed stable policy $K$ , we have

(3.28)

\bar{h}(\theta^{\star})=0.

Proof 3.3.

For the first element in the semi-gradient, we get

\bar{h}_{0}(\theta^{\star})=\mathbb{E}_{x,x^{\prime}}[\tilde{\delta}(x,u,x^{\prime};\theta^{\star})]=\mathbb{E}_{x,x^{\prime}}[V_{K}(x)-c(x,u)-\gamma V_{K}(x^{\prime})]=0,

by the definition of $V_{K}$ . For the other elements we will consider them in matrix form

		$\displaystyle\mathbb{E}_{x,x^{\prime}}\left\{xx^{\top}[V_{K}(x)-c-\gamma V_{K}(x^{\prime})]\right\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mu_{K}}\left\{xx^{\top}[x^{\top}P_{K,\gamma}x+\theta_{0}^{\star}-x^{\top}(S+K^{\top}RK)x-\sigma^{2}\operatorname{Tr}[R]\right.$
		$\displaystyle\quad\quad\quad\quad\quad\left.-\gamma x^{\top}(A+BK)^{\top}P_{K,\gamma}(A+BK)x-\gamma\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]-\gamma\theta_{0}^{\star}]\right\},$

where we have used that $\eta$ is independent of $x$ . By the definition of $P_{K,\gamma}$ , we have $P_{K,\gamma}=S+K^{\top}RK+\gamma(A+BK)^{\top}P_{K,\gamma}(A+BK)$ . For the constant term, it holds $(1-\gamma)\theta_{0}^{*}=\sigma^{2}\operatorname{Tr}[R]+\gamma\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]$ . Thus, we obtain $\bar{h}(\theta^{\star})=0$ .

Assumption 3.4.

Assume that there exists $M_{\theta}>0$ such that the norms of $\theta^{\star}$ and $\theta^{(s)}$ generated by Algorithm 1 are bounded by $M_{\theta}$ :

\|\theta^{\star}\|_{2}\leq M_{\theta},\quad\|\theta^{(s)}\|_{2}\leq M_{\theta}.

Remark 3.5.

In some related works, people use projected gradient descent to guarantee the boundedness of $\theta^{(s)}$ which is not actually used in practice.

Lemma 3.6.

There exists a positive number $\kappa$ , such that for all $\theta$ and $\theta^{*}$

(3.29)

\kappa\|\theta^{*}-\theta\|_{2}^{2}\leq\mathbb{E}_{\mu_{K}}\left[|\widetilde{V}(x,\theta^{*})-\widetilde{V}(x;\theta)|^{2}\right]

holds. In addition, $\kappa$ is continuous with respect to $K$ .

Proof 3.7.

Set $\theta^{*}-\theta=\left[\begin{array}[]{c}\theta_{0}^{*}-\theta_{0}\\ \mathrm{v}ec(\Theta)\end{array}\right]$ , where $\Theta$ is symmetric.

By Law of total variance,

(3.30)

\displaystyle\begin{aligned} &\mathrm{Var}(x^{\top}\Theta x)\\ =&\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta xx^{\top}\Theta x\right]-\left[\mathbb{E}_{\mu_{K}}(x^{\top}\Theta x)\right]^{2}=2\operatorname{Tr}[D_{K}\Theta D_{K}\Theta]=2\|D_{K}^{1/2}\Theta D_{K}^{1/2}\|^{2}_{F}.\end{aligned}

The second equality is derived by Proposition A.1. in [23].

Thus, we can define

\lambda_{K}=\inf_{\|\Theta\|_{F}=1}\|D_{K}^{1/2}\Theta D_{K}^{1/2}\|_{F}.

Since norms of matrices are equivalent and $\mathrm{Var}(x^{\top}\Theta x)>0$ for $\Theta\neq 0$ , we obtain from (3.30) that $\lambda_{K}$ is positive and continuous with respect to $K$ . Besides, $\left|\mathbb{E}_{\mu_{K}}[x^{\top}\Theta x]\right|=|\operatorname{Tr}(D_{K}\Theta)|\leq\|D_{K}\|_{F}\|\Theta\|_{F}$ . Using this inequality in connection with (3.30) we obtain

		$\displaystyle\mathbb{E}_{\mu_{K}}\left[\|\widetilde{V}(x,\theta^{*})-\widetilde{V}(x;\theta)\|^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mu_{K}}\left[(x^{\top}\Theta x+\theta_{0}^{*}-\theta_{0})^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathrm{Var}(x^{\top}\Theta x)+\left(\theta_{0}^{*}-\theta_{0}+\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}$
	$\displaystyle\geq$	$\displaystyle 2\lambda_{K}^{2}\\|\Theta\\|_{F}^{2}+\left(\theta_{0}-\theta_{0}^{*}-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}$
	$\displaystyle\geq$	$\displaystyle{\lambda^{2}_{K}}\\|\Theta\\|_{F}^{2}+\frac{\lambda^{2}_{K}}{\\|D_{K}\\|_{F}^{2}}\left(\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}+\left(\theta_{0}-\theta_{0}^{*}-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}$
	$\displaystyle\geq$	$\displaystyle\lambda_{K}^{2}\\|\Theta\\|_{F}^{2}+\left(\frac{1}{1+\frac{\\|D_{K}\\|^{2}_{F}}{\lambda^{2}_{K}}}\right)\left(-\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]+\theta_{0}-\theta_{0}^{*}+\mathbb{E}_{\mu_{K}}\left[x^{\top}\Theta x\right]\right)^{2}$
	$\displaystyle\geq$	$\displaystyle{\lambda_{K}^{2}}\\|\Theta\\|_{F}^{2}+\left(\frac{\lambda_{K}^{2}}{\lambda_{K}^{2}+\\|D_{K}\\|_{F}^{2}}\right)(\theta_{0}^{*}-\theta_{0})^{2}.$

where the third inequality follows by the Cauchy-Schwarz inequality.

Thus, we conclude the existence of such a positive number $\kappa$ which is continuous with respect to $K$ .

Under 3.4, the constant $\kappa$ has a uniform lower bound in a compact subset of $\mathbf{Dom}$ . Then we can use some basic tool of the analysis of the gradient descent method to prove the convergence of Algorithm 1.

Theorem 3.8.

Suppose that 3.4 holds. Algorithm 1 is run with the semi-gradient update using $\alpha=\frac{1-\gamma}{2(1+(n+2)\|D_{K}\|_{F}^{2})}$ to generate $\widetilde{V}(x;\tilde{\theta})$ . Then, the following inequality holds for all $x\in\mathbb{R}^{n}$ :

(3.31)

\mathbb{E}_{\mu_{K}}[(\widetilde{V}(x;\tilde{\theta})-\widetilde{V}(x;\theta^{\star}))^{2}]\leq\frac{8(1+(n+2)\|D_{K}\|_{F}^{2})M_{\theta}^{2}}{(1-\gamma)^{2}N}.

Proof 3.9.

By the definition of $h(\theta)$ and the fact that $\|\phi(x)\|_{2}^{2}=(\|x\|_{2}^{4}+1)$ , we know

(3.32)	$\displaystyle\\|h(\theta)\\|^{2}_{2}=$	$\displaystyle(c+\gamma\phi(x^{\prime})^{\top}\theta-\phi(x)^{\top}\theta)^{2}\\|\phi(x)\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle 2c^{2}\\|\phi(x)\\|_{2}^{2}+2M_{\theta}^{2}\\|\gamma\phi(x^{\prime})-\phi(x)\\|^{2}_{2}\\|\phi(x)\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle 2(x^{\top}(S+K^{\top}RK)x+2\eta^{\top}RKx+\eta^{\top}R\eta)^{2}(\\|x\\|_{2}^{4}+1)$
		$\displaystyle+2M_{\theta}^{2}\left[(1-\gamma)^{2}+2\\|x^{\prime}\\|_{2}^{4}+2\\|x\\|_{2}^{4}\right](\\|x\\|_{2}^{4}+1).$

Thus, $\|h(\theta)\|^{2}_{2}$ is an $8$ th order polynomial with respect to $x$ , $\omega$ and $\eta$ . Then we can obtain the bound $\sigma_{h}^{2}=O\left((M^{2}_{\theta}+\|K\|_{2}^{4})\mathbb{E}_{\mu_{K}}\|x\|_{2}^{8}+\mathbb{E}_{\mu_{K}}\|x\|^{4}_{2}\sigma^{4}\right)$ .

We next show that Algorithm 1 also converges if the stochastic semi-gradients are used.

Theorem 3.10.

Suppose that 3.4 holds. Let Algorithm 1 be run with the stochastic semi-gradient update using $\alpha=\min\{\frac{1-\gamma}{4(1+(n+2)\|D_{K}\|_{F}^{2})},1/\sqrt{N}\}$ to generate $\widetilde{V}(x;\tilde{\theta})$ . Then, for all $x\in\mathbb{R}^{n}$ it holds

(3.33)

\mathbb{E}_{\mu_{K}}[(\widetilde{V}(x;\widetilde{\theta})-\widetilde{V}(x;\theta^{\star}))^{2}]\leq\frac{M_{\theta}^{2}+2\sigma_{h}^{2}}{2(1-\gamma)\sqrt{N}-4(1+(n+2)\|D_{K}\|_{F}^{2})}.

Proof 3.11.

The key step to prove (3.31) is based on an important gradient descent inequality. In order to establish this inequality, we have to estimate two terms at first. For the descent term $(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star})$ , we get the following lower bound:

(3.34)		$\displaystyle(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star})$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{x,x^{\prime}}\{[\tilde{\delta}(x,u,x^{\prime};\theta^{(s)})-\tilde{\delta}(x,u,x^{\prime};\theta^{\star})]\phi(x)^{\top}(\theta^{(s)}-\theta^{\star})\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}-\gamma\mathbb{E}_{x,x^{\prime}}(\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star}))(\widetilde{V}(x^{\prime};\theta^{(s)})-\widetilde{V}(x^{\prime};\theta^{\star}))$
	$\displaystyle\geq$	$\displaystyle(1-\gamma)\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2},$

where last inequality follows by Cauchy-Schwarz inequality. Then we split the norm of $\bar{h}(\theta^{(s)})-\bar{h}^{\star}$ into two parts:

(3.35)	$\displaystyle\\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\\|^{2}_{2}=$	$\displaystyle\Big{\\|}\mathbb{E}_{\mu_{K}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})+\gamma\widetilde{V}(x^{\prime};\theta^{\star})-\gamma\widetilde{V}(x^{\prime};\theta^{(s)})]\big{\}}\Big{\\|}_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\Big{\\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]\big{\}}\Big{\\|}_{2}^{2}$
		$\displaystyle+\Big{\\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\gamma\phi(x)[\widetilde{V}(x^{\prime};\theta^{(s)})-\widetilde{V}(x^{\prime};\theta^{\star})]\big{\}}\Big{\\|}_{2}^{2}=I_{1}+I_{2}.$

Then we get the upper bound of $I_{1}$ and $I_{2}$ by the Cauchy-Schwarz inequality:

(3.36)	$\displaystyle I_{1}$	$\displaystyle=\sum_{j}\Big{\|}\mathbb{E}_{\mu_{K}}\big{\{}\phi_{j}(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]\big{\}}\Big{\|}^{2}$
		$\displaystyle\leq\sum_{j}\mathbb{E}_{\mu_{K}}[\phi_{j}(x)]^{2}\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}$
		$\displaystyle=(1+\mathbb{E}_{\mu_{K}}\\|xx^{\top}\\|_{F}^{2})\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}$
		$\displaystyle=[1+2\\|D_{K}\\|_{F}^{2}+(\operatorname{Tr}[D_{K}])^{2}]\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}$

In particular, the last equality holds by Proposition A.1. in [23]. Analogously, $I_{2}$ has the same upper bound such that

(3.37)

\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\|^{2}_{2}\leq 2[1+(n+2)\|D_{K}\|_{F}^{2}][\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2},

where we used $(\operatorname{Tr}[D_{K}])^{2}\leq n\|D_{K}\|_{F}^{2}$ .

By (3.34), (3.37) and Lemma 3.2, we obtain

(3.38)		$\displaystyle\\|\theta^{(s+1)}-\theta^{\star}\\|_{2}^{2}=\\|\theta^{(s)}-\alpha\bar{h}(\theta^{(s)})-\theta^{\star}\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\\|\theta^{(s)}-\theta^{\star}\\|_{2}^{2}-2\alpha(\bar{h}(\theta^{(s)})-\bar{h}^{\star})^{\top}(\theta^{(s)}-\theta^{\star})+\alpha^{2}\\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\\|^{2}_{2}$
	$\displaystyle\leq$	$\displaystyle\\|\theta^{(s)}-\theta^{\star}\\|_{2}^{2}-(2\alpha(1-\gamma)-2(1+(n+2)\\|D_{K}\\|_{F}^{2})\alpha^{2})\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}.$

The value of the right hand of (3.38) is minimal for $\alpha=\frac{1-\gamma}{2(1+(n+2)\|D_{K}\|_{F}^{2})}$ . Rearranging (3.38) and summing the inequalities from $t=0$ to $N-1$ yields:

(3.39)		$\displaystyle\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\tilde{\theta})-\widetilde{V}(x;\theta^{\star})]^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{N}\sum_{s=0}^{N-1}\mathbb{E}_{\mu_{K}}[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{N}\sum_{t=0}^{N-1}\frac{2(1+(n+2)\\|D_{K}\\|_{F}^{2})}{(1-\gamma)^{2}}(\\|\theta^{(s)}-\theta^{\star}\\|^{2}_{2}-\\|\theta^{(s+1)}-\theta^{\star}\\|^{2}_{2})$
	$\displaystyle\leq$	$\displaystyle\frac{8(1+(n+2)\\|D_{K}\\|_{F}^{2})M_{\theta}^{2}}{(1-\gamma)^{2}N}.$

This completes the proof.

Furthermore, $\mathbb{E}_{\mu_{K}}\|x\|_{2}^{2k}$ has a uniform upper bound with respect to $K$ on a compact subset of $\mathbf{Dom}$ , since $\mathbb{E}_{\mu_{K}}\|x\|_{2}^{2k}=O(\frac{\Gamma_{K}^{2k}}{1-\rho^{2k}}\mathbb{E}\|\widetilde{\omega}\|^{2k}_{2}])$ roughly. Then we can derive convergence of $\theta^{(s)}$ under the $2-$ norm by Lemma 3.6.

4 Policy Iteration

The policy iteration (PI) [12] is a very fundamental value-based method in Markov decision process (MDP) problems with finite actions. This method can also be applied to LQRs, since the state-action value function $Q$ is quadratic and it is easy to find the best action such that the value function $Q$ is minimal by solving linear equations.

Given a policy $\pi_{K}$ and the corresponding state-action function $Q_{K}$ of the form (3.25), the policy $\pi_{K}$ can be improved by selecting the action $u^{*}$ for a fixed state $x$ such that the value of $Q_{K}$ is minimal:

	$\displaystyle u^{*}=$	$\displaystyle\mathrm{arg}\min_{u}Q_{K}(x,u)$
	$\displaystyle=$	$\displaystyle\mathrm{arg}\min_{u}[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}S+\gamma A^{\top}P_{K,\gamma}A&\gamma A^{\top}P_{K,\gamma}B\\ \gamma B^{\top}P_{K,\gamma}A&R+\gamma B^{\top}P_{K,\gamma}B\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right]$
	$\displaystyle=$	$\displaystyle-(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}(\gamma B^{\top}P_{K,\gamma}A)x.$

Thus, we obtain an improved policy $\pi_{K^{\prime}}$ with

(4.40)

K^{\prime}=-(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}(\gamma B^{\top}P_{K,\gamma}A).

Observing the gradient (2.18), there is another form of (4.40):

(4.41)

K^{\prime}=K-\frac{1}{2}(R+\gamma B^{\top}P_{K,\gamma}B)^{-1}\nabla J\Sigma_{K,\gamma}^{-1}.

This form corresponds to the Gauss-Newton method in [9] with stepsize $\frac{1}{2}$ . Hence, by the discussion in [9], we obtain the convergence of the policy iteration with the exact state-action value function $Q_{K}$ and in addition the fact that value based methods are faster than the policy gradient method. Moreover, from Lemma 8 in [9], we obtain that $J(K^{\prime})<J(K)$ . This implies that if $K$ is stable, then also $K^{\prime}$ is stable provided that $\gamma$ is sufficiently close to $1$ , i.e., if $\gamma$ satisfies (2.17) with $\nu=1$ , see Lemma 2.1.

However, we do not know the state action value function $Q$ exactly since we do not know the system exactly. Due to the results of the former discussion of policy evaluation, we can obtain an approximation of $Q$ . This approximation is used in a policy iteration scheme, for which we want to analyze the convergence. The approximation of the state action value function has the following form:

\widetilde{Q}(x,u;\Theta)=[x^{\top}\;u^{\top}]\left[\begin{array}[]{cc}\Theta_{11}&\Theta_{12}\\ \Theta_{21}&\Theta_{22}\end{array}\right]\left[\begin{array}[]{c}x\\ u\end{array}\right]+\Theta_{0}.

For any stable policy $\pi_{K}$ , we denote the parameters of $Q_{K}$ by $\Theta^{*}_{K}$ . If $\|\Theta-\Theta^{*}_{K}\|_{F}\leq\epsilon_{0}$ , we call $\widetilde{Q}(x,u;\Theta)$ an $\epsilon_{0}-$ approximation of the state-action value function. The policy can also be improved by using the following approximation for the state action value:

(4.42)

K^{\prime}=-\Theta^{-1}_{22}\Theta_{21}.

Thus, we obtain the policy iteration algorithm.

0: Stable policy

K^{(0)}

, the number of steps

T

1: for

s=1,2,\cdots,T

2: Evaluate an approximate state-action value function

\widetilde{Q}(\cdot,\Theta^{(s-1)})

of policy

K^{(s-1)}

by TD learning with the stochastic gradient.

3: Policy improvement:

K^{(s)}=-(\Theta_{22}^{(s-1)})^{-1}\Theta_{21}^{(s-1)}.

4: end for

Algorithm 2 The policy iteration method

As we will prove below, the policy iteration with approximate state-action value functions also converges in the LQR-setting if the error between the approximation value and the true value is small enough. For any stable initial policy $K^{(0)}$ and any $\epsilon>0$ , we assume that the error tolerance $\epsilon_{0}$ between the approximation value $\Theta$ and the true value $\Theta^{\star}$ satisfies

(4.43)

\epsilon_{0}\leq\min\left\{\frac{1}{2}\|R^{-1}\|_{2},\frac{C_{1}\sqrt{\epsilon}}{\sqrt{\|\Sigma_{K^{*},\gamma}\|_{2}}J(K^{*})}\right\}

where $C_{1}$ only depends on $\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})$ , $\sigma_{\mathrm{min}}(S)$ , $\|R^{-1}\|_{2}$ , $\|A\|_{2}$ , $\|B\|_{2}$ and $(1-\gamma)J(K^{(0)})$ .

Theorem 4.1.

For any stable initial policy $K^{(0)}$ and any $\epsilon>0$ , suppose that an $\epsilon_{0}-$ approximation of the state action value function is known, where $\epsilon_{0}$ satisfies (4.43). If $\gamma$ is sufficiently close to $1$ , then for $T\geq\left[\frac{2\|\Sigma_{K^{*},\gamma}\|_{2}}{\gamma^{2}\sigma_{min}(D_{\widetilde{\omega}})}-1\right]\log\frac{J(K_{0})-J(K^{*})}{\epsilon}$ , Algorithm 2 yields stable policies $K^{(1)},\ldots,K^{(T)}$ which are elements of Dom ${}_{K^{(0)},2}$ and satisfy

(4.44)

J(K^{(T)})-J(K^{*})\leq\epsilon.

Proof 4.2.

We first introduce some notation which is used in this proof. Let $\widetilde{Q}(\cdot,\Theta)$ be an $\epsilon_{0}-$ approximation of the state action value function $Q_{K}(\cdot)$ and denote by $\Theta^{\star}$ the true value, i.e., $Q_{K}(\cdot)=\widetilde{Q}(\cdot,\Theta^{\star})$ . Then the improved policy is given by $K^{\prime}=-\Theta_{22}^{-1}\Theta_{21}$ . Next, we obtain from Lemma 6 in [9] that

(4.45)			$\displaystyle J(K^{\prime})-J(K)$
(4.45)		$\displaystyle=$	$\displaystyle\operatorname{Tr}[2\Sigma_{K^{\prime},\gamma}(K^{\prime}-K)^{\top}E_{K}+\Sigma_{K^{\prime},\gamma}(K^{\prime}-K)^{\top}(R+\gamma B^{\top}P_{K,\gamma}B)(K^{\prime}-K)],$

where $E_{K}=(R+\gamma B^{\top}P_{K,\gamma}B)K+\gamma B^{\top}P_{K,\gamma}A$ . Furthermore, we note that $R+\gamma B^{\top}P_{K,\gamma}B=\Theta^{*}_{22}$ and $\gamma B^{\top}P_{K,\gamma}A=\Theta^{*}_{21}$ holds by (4.40). Let $\Delta_{1}:=\Theta_{21}-\Theta_{21}^{*}$ and $\Delta_{2}:=\Theta_{22}-\Theta_{22}^{*}$ . Then we compute the difference between $K^{\prime}$ and the policy $-(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}$ generated by the true policy iteration:

(4.46)	$\displaystyle K^{\prime}+(\Theta^{}_{22})^{-1}\Theta^{}_{21}=$	$\displaystyle-(\Theta^{}_{22})^{-1}(I+\Delta_{2}(\Theta^{}_{22})^{-1})^{-1}(\Theta^{}_{21}+\Delta_{1})+(\Theta^{}_{22})^{-1}\Theta^{*}_{21}$
	$\displaystyle=$	$\displaystyle(\Theta^{}_{22})^{-1}(I+\Delta_{2}(\Theta^{}_{22})^{-1})^{-1}(\Delta_{2}(\Theta^{}_{22})^{-1}\Theta^{}_{21}-\Delta_{1})$
	$\displaystyle=$	$\displaystyle(\Theta^{*}_{22})^{-1}\Delta_{3}$

with $\Delta_{3}=(I+\Delta_{2}(\Theta^{*}_{22})^{-1})^{-1}(\Delta_{2}(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}-\Delta_{1})$ . Next, we note that (4.41) implies

(4.47)

K+(\Theta^{*}_{22})^{-1}\Theta^{*}_{21}=(\Theta_{22}^{\star})^{-1}E_{K}

which together with (4.46) yields

(4.48)

K^{\prime}-K=(\Theta^{*}_{22})^{-1}(\Delta_{3}-E_{K})

Since $\|\Theta-\Theta_{K}^{*}\|_{F}\leq\epsilon_{0}$ , it holds $\|\Delta_{1}\|_{F}\leq\epsilon_{0}$ , $\|\Delta_{2}\|_{F}\leq\epsilon_{0}$ and $\|\Delta_{3}\|_{F}$ has the upper bound

(4.49)

\|\Delta_{3}\|_{F}\leq\frac{\epsilon_{0}}{1-\epsilon_{0}\|R^{-1}\|_{2}}(\gamma\|R^{-1}\|_{2}\|B\|_{2}\|A\|_{2}\|P_{K,\gamma}\|_{2}+1)\leq C_{0}\epsilon_{0}(1-\gamma)J(K)+2\epsilon_{0},

where the second inequality follows by (2.19) with

\displaystyle C_{0}=\frac{2\|R^{-1}\|_{2}\|B\|_{2}\|A\|_{2}}{\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}.

Using (4.45) and (4.48), the differences of the cumulative cost between the original policy $K$ and the improved policy $K^{\prime}$ can be represented as:

(4.50)

\displaystyle J(K^{\prime})-J(K)=\operatorname{Tr}[-\Sigma_{K^{\prime},\gamma}E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}+\Sigma_{K^{\prime},\gamma}\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}].

For the first term in (4.50), we have

(4.51)		$\displaystyle\operatorname{Tr}[\Sigma_{K^{\prime},\gamma}E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}]\geq$	$\displaystyle\sigma_{\mathrm{min}}(\Sigma_{K^{\prime},\gamma})\operatorname{Tr}[E_{K}^{\top}(\Theta_{22}^{*})^{-1}E_{K}]$
(4.51)		$\displaystyle\geq$	$\displaystyle\frac{\gamma^{2}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}{\\|\Sigma_{K^{},\gamma}\\|_{2}}(J(K)-J(K^{})),$

where the second inequality is derived by the fact $\Sigma_{K^{\prime},\gamma}\succeq\gamma^{2}D_{\widetilde{\omega}}$ and Lemma 11 in the supplementary material of [9]. More precisely, one can check that this lemma is also valid for the setting in this paper. For the second term in (4.50), using (2.20) and (4.49) we get the upper bound

(4.52)		$\displaystyle\operatorname{Tr}[\Sigma_{K^{\prime},\gamma}\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}]$
	$\displaystyle\leq$	$\displaystyle\\|\Sigma_{K^{\prime},\gamma}\\|_{F}\operatorname{Tr}[\Delta_{3}^{\top}(\Theta^{*}_{22})^{-1}\Delta_{3}]\leq\\|\Sigma_{K^{\prime},\gamma}\\|_{F}\\|R^{-1}\\|_{2}\operatorname{Tr}[\Delta_{3}^{\top}\Delta_{3}]$
	$\displaystyle\leq$	$\displaystyle\\|R^{-1}\\|_{2}\frac{[C_{0}(1-\gamma)J(K)+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}J(K^{\prime})$

By (4.51) and (4.52), it is direct to obtain the following inequality:

(4.53)

J(K^{\prime})-J(K^{*})\leq\frac{1-\alpha}{1-\beta}(J(K)-J(K^{*}))+\frac{\beta}{1-\beta}J(K^{*}),

where $\alpha=\frac{\gamma^{2}\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}{\|\Sigma_{K^{*},\gamma}\|_{2}}<1$ and $\beta=\|R^{-1}\|_{2}\frac{[C_{0}(1-\gamma)J(K)+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}$ .

We can start with $K^{(0)}$ and set $\beta=\|R^{-1}\|_{2}\frac{[2C_{0}(1-\gamma)J(K^{(0)})+2]^{2}\epsilon_{0}^{2}}{\sigma_{\mathrm{min}}(S)}$ . Then let $\epsilon_{0}$ be small enough such that $\beta\leq\alpha/2$ . Thus, the bound of $\epsilon_{0}$ is $O(\frac{1}{2C_{0}(1-\gamma)J(K^{(0)})+2})$ . Next, we show that

(4.54)

J(K^{(t)})\leq 2J(K^{(0)}),

which implies that the inequality (4.53) holds with $K=K^{(t)}$ and $K^{\prime}=K^{(t+1)}$ for all iterates.

We use induction to prove this uniform bound (4.54). When $t=0$ , this inequality (4.54) holds obviously. Then we assume that (4.54) holds with $K^{(t)}$ . Using $\beta\leq\alpha/2$ in connection with (4.53), we observe that if $J(K^{(t)})\geq 2J(K^{*})$ , then

J(K^{(t+1)})\leq\frac{1-\alpha}{1-\beta}J(K^{(t)})+(1-\frac{1-\alpha-\beta}{1-\beta})J(K^{*})\leq\frac{1-\alpha/2}{1-\beta}J(K^{t})\leq 2J(K^{(0)})

by the inequality (4.53). Otherwise it holds $J(K^{(t+1)})\leq\frac{2-\alpha}{1-\beta}J(K^{*})<2J(K^{(0)})$ . Thus, the bound (4.54) holds for all $t$ .

Hence, we have following inequality:

(4.55)

J(K^{(t)})-J(K^{*})\leq\left(\frac{1-\alpha}{1-\beta}\right)^{t}(J(K^{(0)})-J(K^{*}))+\frac{\beta}{\alpha-\beta}J(K^{*}).

Furthermore, we also require that $\frac{\beta}{\alpha-\beta}J(K^{*})\leq\frac{2\beta}{\alpha}J(K^{*})\leq\frac{\epsilon}{2}$ , which is equivalent to $\beta\leq\frac{\alpha\epsilon}{4J(K^{*})}$ . Thus, the upper bound of $\epsilon_{0}$ should be $C_{1}\frac{\sqrt{\epsilon}}{\sqrt{\|\Sigma_{K^{*},\gamma}\|_{2}}J(K^{*})}$ , where

C_{1}=O\left(\frac{1}{[C_{0}(1-\gamma)J(K^{(0)})+2]}\right).

Then we can verify that $\left(\frac{1-\alpha}{1-\beta}\right)^{T}(J(K^{(0)})-J(K^{*}))\leq\frac{\epsilon}{2}$ such that (4.44) is proved. Finally, we assume w.l.o.g. that $J(K^{(0)})-J(K^{*})>\epsilon$ . We can guarantee that $\frac{\beta}{\alpha-\beta}\leq\frac{\epsilon}{2J(K^{*})}\leq\frac{J(K^{(0)})-J(K^{*})}{2J(K^{*})}$ . Inserting this in (4.55), we obtain

	$\displaystyle J(K^{(t)})-J(K^{*})$	$\displaystyle\leq\left[\left(\frac{1-\alpha}{1-\beta}\right)^{t}+\frac{1}{2}\right](J(K^{(0)})-J(K^{*}))$
		$\displaystyle\leq 2(J(K^{(0)})-J(K^{*}))$

Hence, $K^{(t)}\in$ Dom ${}_{K^{(0)},2}\subset$ Dom holds by Lemma 2.1 for all iterates, if $\gamma$ is sufficiently close to $1$ .

Using the definition of $\alpha$ , we observe that the approximation parameter $\epsilon_{0}$ has an upper bound $O\left({(1-\gamma)^{\frac{3}{2}}\epsilon^{\frac{1}{2}}}\right)$ . Thus, for each TD-learning of a fix $K^{(t)}$ , it needs $O\left(\frac{1}{(1-\gamma)^{5}\epsilon}\right)$ samplings by Lemma 3.6 and Theorem 3.8. However if we use stochastic semi-gradient descent, the sample complexity for each policy evaluation of this algorithm becomes $O\left(\frac{1}{(1-\gamma)^{8}\epsilon^{2}}\right)$ by Theorem 3.10.

5 The Policy Gradient method

In RL, the policy gradient method [25, 21] is widely used. In this section, we apply the policy gradient method to the problem (2.3) and analyze the convergence of this method.

In order to compute the policy gradient, we have to know the score function $\nabla_{K}\log\pi_{K}$ of the policy $\pi_{K}$ , which is given by

(5.56)

\displaystyle\quad\nabla_{K}\log\pi_{K}(u|x)=\frac{1}{\sigma^{2}}(u-Kx)x^{\top}.

By the policy gradient theorem in [21], we obtain the policy gradient

(5.57)		$\displaystyle G(K)$	$\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]$
(5.57)			$\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{p_{K}}\left[\sum_{t=0}^{\infty}\gamma^{t}(u_{t}-Kx_{t})x_{t}^{\top}Q_{K}(x_{t},u_{t})\right].$

The policy gradient (5.57) is equivalent to $\nabla J(K)$ which is shown in Lemma 5.2. For the representation of the gradient in (5.57) it is straightforward to design an estimation. After achieving some triples $\{x_{t},u_{t},c_{t}\}_{t=0}^{L}$ generated by the policy $\pi_{K}$ in problem $\eqref{tck}$ , we can compute the sample gradient:

(5.58)

\hat{G}^{(L)}(K)=\frac{1}{\sigma^{2}}\sum_{t=0}^{L}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{L}\gamma^{k}c_{k}\right].

We apply the stochastic gradient descent method to the problem (2.3), which is summarized in Algorithm 3.

0: Initial policy

K_{0}

, roll out length

L

, step size

\alpha

1: for

s=0,1,\cdots,T-1

2: Simulate

u_{t}=K^{(s)}x_{t}+\eta_{t}

for

L

steps starting from

x_{0}=0

and obtain

D^{(s)}=\{x^{(s)}_{t},u^{(s)}_{t},c^{(s)}_{t}\}_{t=0}^{L}

3: Compute

\hat{G}^{(L)}(K^{(s)})

by (5.58).

4: Update policy

K^{(s+1)}=K^{(s)}-\alpha\hat{G}^{(L)}(K^{(s)})

5: end for

Algorithm 3 The Policy Gradient Method

Since the estimator $\hat{G}^{(L)}(K)$ is biased, we have to control the bias by increasing the length $L$ . The next lemma shows that how the parameter $L$ influences the bias and the variance of the estimator $\hat{G}^{(L)}(K)$ .

Lemma 5.1.

Let $K$ be a stable policy. Then it holds

(5.59)		$\displaystyle\left\\|\nabla J(K)-\mathbb{E}[\hat{G}^{(L)}(K)]\right\\|_{2}$	$\displaystyle\leq$	$\displaystyle C_{2}\frac{\Gamma_{K}^{2}}{1-\rho^{2}}\gamma^{L+1}(\frac{\Gamma_{K}^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma}),$
(5.60)		$\displaystyle\mathbb{E}\left\\|\hat{G}^{(L)}(K)\right\\|^{2}_{F}$	$\displaystyle\leq$	$\displaystyle C_{3}\frac{L^{3}}{(1-\gamma^{2})},$

where $\rho\in(\rho(A+BK),1)$ and the constants $C_{2}$ and $C_{3}$ depend on $\|A\|_{2}$ , $\|B\|_{2}$ , $\|S\|_{2}$ , $\|R\|_{2}$ , $\|K\|_{2}$ , $\Gamma_{K}$ , $\frac{1}{1-\rho^{2}}$ , $\widetilde{\omega}$ , $n$ , $d$ and $(1-\gamma)J(K)$ .

Proof 5.2.

Define the event fields $\mathcal{F}_{t}$ generated by $(x_{0},x_{1},\cdots,x_{t})$ and the operator $\mathcal{T}(X)=(A+BK)^{\top}X(A+BK)$ . We observe that $\eta_{t}$ and $\omega_{t}$ are independent from $\mathcal{F}_{t}$ . By the definition of the value function in (3.24), it is straightforward to obtain the conditional expection of the cumulative cost:

\mathbb{E}\left[\sum_{k=t}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]=\gamma^{t}V_{K}(x_{t})=\gamma^{t}x_{t}^{\top}P_{K,\gamma}x_{t}+\frac{\gamma^{t+1}}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\gamma^{t}\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}.

First, we claim that $G(K)$ is equivalent to $\nabla J(K)$ . To this end, we verify the following identity:

(5.61)		$\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{t}\right]+\mathbb{E}\left[\eta_{t}x_{t}^{\top}\gamma^{t}c_{t}\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\|\mathcal{F}_{t+1}]\Big{\|}\mathcal{F}_{t}\right]$
		$\displaystyle+\gamma^{t}\mathbb{E}\left[\eta_{t}(x_{t}^{\top}Sx_{t}+(Kx_{t}+\eta_{t})^{\top}R(Kx_{t}+\eta_{t}))\Big{\|}\mathcal{F}_{t}\right]x_{t}^{\top}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\eta_{t}\gamma^{t+1}V_{K}(x_{t+1})\Big{\|}\mathcal{F}_{t}\right]x_{t}+2\sigma^{2}\gamma^{t}RKx_{t}x^{\top}_{t}$
	$\displaystyle=$	$\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}x_{t+1}^{\top}P_{K,\gamma}x_{t+1}\Big{\|}\mathcal{F}_{t}\right]x_{t}+2\sigma^{2}\gamma^{t}RKx_{t}x^{\top}_{t}$
	$\displaystyle=$	$\displaystyle 2\sigma^{2}\gamma^{t}[\gamma B^{\top}P_{K,\gamma}(A+BK)+RK]x_{t}x_{t}^{\top}.$

The third equality is valid since $\mathbb{E}[\eta_{t}]=0$ and

(5.62)

\mathbb{E}\left[\eta_{t}\eta_{t}^{\top}B^{\top}P_{K,\gamma}B\eta_{t}\Big{|}\mathcal{F}_{t}\right]=0,

which holds due to the symmetry of $\eta_{t}$ . In the last equality of (5.61), we have used similar arguments in connection with the fact $x_{t+1}=(A+BK)x_{t}+\omega_{t}+B\eta_{t}$ . Taking the sum $\sum\limits_{t=0}^{\infty}(\cdot)$ , applying $\mathbb{E}_{\rho_{K}}[\cdot]$ to both sides of (5.61) and multiplying by $\frac{1}{\sigma^{2}}$ yields $G(K)=\nabla J(K)$ , see (2.18).

Since $\hat{G}^{(L)}(K)$ is the estimator of $\nabla J(K)$ , we focus on the bound of the bias and split the bias into two parts

(5.63)			$\displaystyle\nabla J(K)-\mathbb{E}[\hat{G}^{(L)}(K)]$
(5.63)		$\displaystyle=$	$\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]+\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\right],$

where we have used that $G(K)=\nabla J(K)$ . In the following, we will further simplify and estimate the above two terms, respectively. To this end, we show for each $t\leq L$ that

(5.64)

\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{|}\mathcal{F}_{t}\right]\\ =2\sigma^{2}\gamma^{L+1}B^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})(A+BK)x_{t}x_{t}^{\top}.

For $t<L$ , we observe that

(5.65)		$\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{L+1}\right]\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}x_{t}^{\top}V_{K}(x_{L+1})\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}[(A+BK)x_{L}+\widetilde{\omega}_{L}]^{\top}P_{K,\gamma}[(A+BK)x_{L}+\widetilde{\omega}_{L}]\Big{\|}\mathcal{F}_{t}\right]x_{t}^{\top}$
	$\displaystyle=$	$\displaystyle\gamma^{L+1}\mathbb{E}\left[\eta_{t}x_{L}^{\top}\mathcal{T}(P_{K,\gamma})x_{L}\Big{\|}\mathcal{F}_{t}\right]x_{t}^{\top}(since\;\widetilde{\omega}_{L}\;is\;independent\;from\;\eta_{t})$
		$\displaystyle\cdots$
	$\displaystyle=$	$\displaystyle\gamma^{L+1}\mathbb{E}\left\{\eta_{t}x_{t+1}^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})x_{t+1}\Big{\|}\mathcal{F}_{t}\right\}x_{t}^{\top}$
	$\displaystyle=$	$\displaystyle 2\sigma^{2}\gamma^{L+1}B^{\top}\mathcal{T}^{L-t}(P_{K,\gamma})(A+BK)x_{t}x_{t}^{\top},$

and for $t=L$ it holds

(5.66)		$\displaystyle\mathbb{E}\left[(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\eta_{t}x_{t}^{\top}\sum_{k=t+1}^{\infty}\gamma^{k}c_{k}\Big{\|}\mathcal{F}_{t+1}\right]\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}x_{t}^{\top}V_{K}(x_{t+1})\Big{\|}\mathcal{F}_{t}\right]$
	$\displaystyle=$	$\displaystyle\gamma^{t+1}\mathbb{E}\left[\eta_{t}[(A+BK)x_{t}+\widetilde{\omega}_{t}]^{\top}P_{K,\gamma}[(A+BK)x_{t}+\widetilde{\omega}_{t}]\Big{\|}\mathcal{F}_{t}\right]x_{t}^{\top}$
	$\displaystyle=$	$\displaystyle 2\sigma^{2}\gamma^{t+1}B^{\top}P_{K,\gamma}(A+BK)x_{t}x_{t}^{\top}.$

In (5.65) and (5.66), similar arguments as in (5.61) are used, in particular $\mathbb{E}[\eta_{t}]=0$ . Hence, (5.64) holds.

Since $K$ is stable, (2.4) yields

(5.67)

\left\|\mathbb{E}[x_{t}x_{t}^{\top}]\right\|_{2}\leq{\Gamma_{K}^{2}}\sum_{k=0}^{t}\rho^{2k}\|D_{\widetilde{\omega}}\|_{2}

for some constant $0<\rho<1$ . Using this, (2.19) and (5.64), we can estimate the second term in (5.63) as follows

(5.68)		$\displaystyle\left\\|\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=L+1}^{\infty}\gamma^{k}c_{k}\right]\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle{2\gamma^{L+1}}\\|B\\|_{2}\sum_{t=0}^{L}\left\\|P_{K,\gamma}\\|_{2}\Gamma_{K}^{3}\rho^{2L-2t+1}\\|\mathbb{E}[x_{t}x_{t}^{\top}]\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle 2\gamma^{L+1}\\|B\\|_{2}\frac{(1-\gamma)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}J(K)\frac{\Gamma_{K}^{5}}{(1-\rho^{2})^{2}}\\|D_{\widetilde{\omega}}\\|_{2}.$

By using (5.61), (5.67) and (2.19), the first term in (5.63) can be estimated as:

(5.69)		$\displaystyle\left\\|\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle 2\Big{(}\\|R\\|_{2}\\|K\\|_{2}+\Gamma_{K}\\|B\\|_{2}\\|P_{K,\gamma}\\|_{2}\Big{)}\sum_{t=L+1}^{\infty}\frac{\Gamma_{K}^{2}}{1-\rho^{2}}\gamma^{t}\\|D_{\widetilde{\omega}}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle 2\Big{(}\\|R\\|_{2}\\|K\\|_{2}+\Gamma_{K}\\|B\\|_{2}\frac{(1-\gamma)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}J(K)\Big{)}\\|D_{\widetilde{\omega}}\\|_{2}\frac{\gamma^{L+1}\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}.$

Combining (5.68) and (5.69) yields (5.59) with

C_{2}=4\|D_{\widetilde{\omega}}\|_{2}\cdot\max\left\{\Gamma_{K}\|B\|_{2}\frac{(1-\gamma)J(K)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})},\|R\|_{2}\|K\|_{2}\right\}.

Finally, we derive a bound for the variance of $\hat{G}^{(L)}$ .

(5.70)	$\displaystyle\mathbb{E}\left\\|\hat{G}^{(L)}(K)\right\\|^{2}_{F}=$	$\displaystyle\frac{1}{\sigma^{4}}\mathbb{E}\left\\|\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{L}\gamma^{k}c_{k}\right\\|_{F}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}\Big{(}\sum_{k=t}^{L}\gamma^{k}c_{k}\Big{)}^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}(L-t+1)\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}\sum_{k=t}^{L}\gamma^{2k}c^{2}_{k}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{L^{2}}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\sum_{k=t}^{L}\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}c^{2}_{k}\right].$

Hence, we only need to estimate each term of (5.70):

(5.71)

\mathbb{E}\left[\|x_{t}\|^{2}_{2}\|\eta_{t}\|_{2}^{2}c^{2}_{k}\right]\leq O\left(\|K\|_{2}^{4}+\|S\|_{2}^{2}\right)\mathbb{E}\Big{[}\|x_{k}\|^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}+O\left(\|R\|_{2}^{2}\right)\operatorname{Tr}[D_{K}]\sigma^{6}.

Let $\Sigma=\sum_{i=t}^{k-1}(A+BK)^{k-1-i}\widetilde{\omega}_{t}$ . Then $x_{k}=(A+BK)^{k-t}x_{t}+\Sigma$ and $\Sigma$ is independent from $x_{t}$ . We can estimate the bound of $\mathbb{E}\Big{[}\|x_{k}\|^{4}_{2}\|x_{t}\|_{2}^{2}\|\eta_{t}\|_{2}^{2}\Big{]}$ :

(5.72)	$\displaystyle\mathbb{E}\Big{[}\\|x_{k}\\|^{4}_{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{]}=$	$\displaystyle\mathbb{E}\Big{[}\big{\\|}(A+BK)^{k-t}x_{t}+\Sigma\big{\\|}^{4}_{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{]}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{\{}\big{[}2\\|(A+BK)^{k-t}x_{t}\\|_{2}^{2}+2\\|\Sigma\\|_{2}^{2}\big{]}^{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle 8\mathbb{E}\Big{\{}\big{[}\\|(A+BK)^{k-t}x_{t}\\|_{2}^{4}+\\|\Sigma\\|_{2}^{4}\big{]}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|^{2}_{2}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle 8d\sigma^{2}\mathbb{E}\big{[}\\|(A+BK)^{k-t}x_{t}\\|_{2}^{4}\\|x_{t}\\|_{2}^{2}\big{]}+8\mathbb{E}\\|x_{t}\\|^{2}_{2}\mathbb{E}\big{(}\\|\Sigma\\|^{4}_{2}\\|\eta_{t}\\|_{2}^{2}\big{)}$
	$\displaystyle\leq$	$\displaystyle O\Big{(}\big{[}\mathbb{E}_{\mu_{K}}\\|x\\|^{6}+\operatorname{Tr}[D_{K}]\mathbb{E}_{\mu_{K}}\\|x\\|^{4}\big{]}\sigma^{2}+\operatorname{Tr}[D_{K}]\sigma^{6}\Big{)}.$

We observe that (5.72) is bounded yielding that (5.71) is bounded by a constant $C>0$ . Inserting this in (5.70), we get the estimation

(5.73)

\displaystyle\mathbb{E}\left\|\hat{G}^{(L)}(K)\right\|^{2}_{F}\leq\frac{CL^{3}}{\sigma^{4}(1-\gamma^{2})}

such that (5.59) holds with $C_{3}=\frac{C}{\sigma^{4}}$ .

Assumption 5.3.

There is a positive number $M_{G}$ such that $\big{\|}\hat{G}^{L}(K)\big{\|}_{F}\leq M_{G}$ for any stable $K$ .

At the end of this section, we present a theorem that guarantees the convergence of Algorithm 3. In order to prove this convergence, we need to assume that the error tolerance $\epsilon$ and the confidence $\delta$ is small enough such that following inequality holds:

(5.74)

\delta\epsilon<\min\left(1,\frac{10(J(K^{(0)})-J(K^{*}))}{3\log[120(J(K^{(0)})-J(K^{*}))]}\right).

Theorem 5.4.

Let 5.3 hold, $K^{(0)}$ be stable and suppose that $\gamma$ is sufficiently close to $1$ and $0<\beta<1$ is small enough such that (2.17) is satisfied with $\nu=\frac{10}{\delta}$ and $K=K^{(0)}$ . For any error tolerance $\epsilon$ and confidence $\delta$ satisfying (5.74), suppose that the sample size $L$ is large enough such that

\gamma^{L+1}\left(\frac{\Gamma^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma}\right)\leq O(\sqrt{\delta\epsilon/n})

with $\Gamma=\Gamma_{K^{(0)},\nu}$ and the step size $\alpha$ is chosen to satisfy $\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})\sigma^{2}}{L^{3}d}\right)$ . After $T$ iterations with $T=O(\frac{1}{\alpha}\log[\frac{J(K^{(0)})-J(K^{*})}{\delta\epsilon}])$ , Algorithm 3 yields an iterate $K^{(T)}$ such that

J(K^{(T)})-J(K^{*})\leq\epsilon

holds with probability greater than $1-\delta$ .

Proof 5.5.

For any $t=0,1,\cdots$ , define the error $\Delta_{t}=J(K^{(t)})-J(K^{*})$ and the stopping time

\tau:=\min\left\{t|\Delta_{t}>10\Delta_{0}/\delta\right\}.

We first note that Lemma 2.1 yields $K^{(t)}\in$ Dom ${}_{K^{(0)},\nu}$ for all $t<\tau$ since $\gamma$ and $\beta$ is chosen such that (2.16) and (2.17) hold with $\nu=\frac{10}{\delta}$ and $K=K^{(0)}$ . Hence $C_{2}$ and $C_{3}$ are both uniform bounded in Dom ${}_{K^{(0)},\nu}$ .

Next, for simplicity, we define $\mathbb{E}^{t}:=\mathbb{E}\left[\cdot|\mathcal{A}_{t}\right]$ as the expectation operator conditioned on the sigma field $\mathcal{A}_{t}$ which contains all the randomness of the first $t$ iterates. Since the gradient $\nabla J$ is locally Lipschitz, which is shown in [16], there are a uniform Lipschitz constant $\phi_{0}$ and a uniform radius $\rho_{0}$ such that

(5.75)

J(K^{\prime})-J(K)\leq\left\langle\nabla J(K),K^{\prime}-K\right\rangle+\frac{\phi_{0}}{2}\left\|K^{\prime}-K\right\|_{F}^{2}

for any $K$ and $K^{\prime}$ with $\|K-K^{\prime}\|_{F}\leq\rho_{0}$ . We choose $\alpha$ sufficiently small such that $\alpha M_{G}\leq\rho_{0}$ .

Let L be sufficiently large such that $C_{2}\frac{\Gamma^{2}}{1-\rho^{2}}\sqrt{n}\gamma^{L+1}(\frac{\Gamma^{2}}{1-\rho^{2}}+\frac{1}{1-\gamma})\leq\sqrt{\delta\epsilon\mu/180}$ . Using this and in particular Lemma 5.1, we obtain

(5.76)		$\displaystyle\mathbb{E}^{t}\left[J\left(K^{(t+1)}\right)-J\left(K^{(t)}\right)\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}^{t}\left[\left\langle\nabla J\left(K^{(t)}\right),K^{(t+1)}-K^{(t)}\right\rangle+\frac{\phi_{0}}{2}\left\\|K^{(t+1)}-K^{(t)}\right\\|_{F}^{2}\right]$
	$\displaystyle=$	$\displaystyle-\alpha\left\langle\nabla J\left(K^{(t)}\right),\nabla J\left(K^{(t)}\right)\right\rangle+\frac{\phi_{0}\alpha^{2}}{2}\mathbb{E}^{t}\left[\left\\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\\|_{F}^{2}\right]$
		$\displaystyle+\alpha\left\langle\nabla J\left(K^{(t)}\right),\nabla J\left(K^{(t)}\right)-\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\right\rangle$
	$\displaystyle\leq$	$\displaystyle-\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}\mathbb{E}^{t}\left[\left\\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\\|_{F}^{2}\right]$
		$\displaystyle+\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}\Big{\\|}\nabla J\left(K^{(t)}\right)-\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\Big{\\|}_{F}$
	$\displaystyle\leq$	$\displaystyle-\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}\left[\mathrm{Var}\left(\hat{G}^{(L)}(K^{(t)})\right)+\left\\|\mathbb{E}^{t}[\hat{G}^{(L)}(K^{(t)})]\right\\|_{F}^{2}\right]$
		$\displaystyle+\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}\sqrt{\delta\epsilon\mu/180}$
	$\displaystyle\leq$	$\displaystyle-\frac{3}{4}\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}^{2}+\frac{\alpha\delta\epsilon\mu}{180}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\phi_{0}\alpha^{2}\left(\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}^{2}+\frac{\delta\epsilon\mu}{180}\right),$

where $G_{2}=\mathrm{Var}\left(\hat{G}^{(L)}(K^{(t)})\right)=\mathbb{E}^{t}\left[\left\|\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}\right]-\left\|\mathbb{E}^{t}[\hat{G}^{(L)}\left(K^{(t)}\right)\right\|_{F}^{2}$ . We note that in the last estimation of (5.76) the inequality

\displaystyle\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}\sqrt{\delta\epsilon\mu/180}\leq\frac{1}{4}\alpha\Big{\|}\nabla J\left(K^{(t)}\right)\Big{\|}_{F}^{2}+\frac{\alpha\delta\epsilon\mu}{180}

is used. We assume that $\alpha\leq\frac{1}{2\phi_{0}}$ . By the PL condition (2.21), we have

(5.77)		$\displaystyle\mathbb{E}^{t}\left[\Delta_{t+1}-\Delta_{t}\right]\leq$	$\displaystyle-\frac{1}{4}\alpha\Big{\\|}\nabla J\left(K^{(t)}\right)\Big{\\|}_{F}^{2}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\delta\epsilon\mu}{120}$
(5.77)		$\displaystyle\leq$	$\displaystyle-\frac{1}{4}\alpha\mu\Delta_{t}+\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\delta\epsilon\mu}{120}.$

Applying successively this inequality, we obtain a similar result as in [16]:

	$\displaystyle\mathbb{E}\left[\Delta_{t+1}1_{\tau>t+1}\right]\leq$	$\displaystyle\left(1-\frac{\alpha\mu}{4}\right)^{t+1}\Delta_{0}+(\frac{\phi_{0}\alpha^{2}}{2}G_{2}+\frac{\alpha\mu\epsilon\delta}{120})\sum_{i=0}^{t}\left(1-\frac{\alpha\mu}{4}\right)^{i}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{\alpha\mu}{4}\right)^{t+1}\Delta_{0}+\frac{2}{\mu}\alpha\phi_{0}G_{2}+\frac{\epsilon\delta}{30},$

where we have used that $\mathbb{E}^{t}\left[\Delta_{t}\right]=\Delta_{t}$ . By (5.60), we observe that taking $\alpha\leq\frac{\mu\epsilon\delta}{240\phi_{0}C_{3}\frac{L^{3}}{(1-\gamma^{2})}}$ implies $\frac{2}{\mu}\alpha\phi_{0}G_{2}\leq\frac{\epsilon\delta}{120}$ . We note that this condition on $\alpha$ as well as $\alpha M_{G}\leq\rho_{0}$ and $\alpha\leq\frac{1}{2\phi_{0}}$ are satisfied for $\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})\sigma^{2}}{L^{3}d}\right)$ . Setting $t=T-1$ , we observe that for

T\geq C\cdot\frac{1}{\alpha}\log\left[\frac{120[J(K^{(0)})-J(K^{*})]}{\delta\epsilon}\right]

with a sufficiently large constant $C$ the inequality $\mathbb{E}^{t}\left[\Delta_{t+1}1_{\tau>t+1}\right]\leq\frac{\epsilon\delta}{5}$ holds.

By using the same techniques as in proof of the Proposition 1 in [16], we observe that $\mathbb{P}\left\{1_{\tau\leq T}\right\}\leq 4\delta/5$ . By Chebyshev’s inequality, we have

	$\displaystyle\mathbb{P}\left\{\Delta_{T}\geq\epsilon\right\}\leq$	$\displaystyle\mathbb{P}\left\{\Delta_{T}1_{\tau>T}\geq\epsilon\right\}+\mathbb{P}\left\{1_{\tau\leq T}\right\}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon}\mathbb{E}\left[\Delta_{T}1_{\tau>T}\right]+\mathbb{P}\left\{1_{\tau\leq T}\right\}$
	$\displaystyle\leq$	$\displaystyle\frac{\delta}{5}+\frac{4\delta}{5}$
	$\displaystyle=$	$\displaystyle\delta.$

This completes the proof.

6 The Actor-Critic Algorithm

In the policy gradient algorithm, we update the policy through pure sampling updates. Hence, the policy gradient has high variance, which slows down the speed of convergence. A popular method to reduce the variance is the Actor-Critic algorithm, which replaces the Monte Carlo method by the bootstrapping method. The policy gradient in the Actor-Critic algorithm has the following form:

(6.78)

\hat{G}_{AC}^{(L)}(K)=\frac{1}{\sigma^{2}}\sum_{t=0}^{L}\left[(u_{t}-Kx_{t})x_{t}^{\top}\gamma^{t}\tilde{\delta}_{t}\right].

where $\tilde{\delta}_{t}:=c_{t}+\gamma\widetilde{V}(x_{t+1})-\widetilde{V}(x_{t})$ . We investigate the bias and the variance of the estimators.

Lemma 6.1.

Suppose that the policy $K$ is stable and $\widetilde{V}$ is an $\epsilon_{0}-$ approximation of the state value function. Then for the estimation $\hat{G}_{AC}^{(L)}(K)$ shown in (6.78) the following inequalities hold

(6.79)		$\displaystyle\left\\|\nabla J(K)-\mathbb{E}[\hat{G}_{AC}^{(L)}(K)]\right\\|_{2}$	$\displaystyle\leq$	$\displaystyle C_{2}\gamma^{L+1}\frac{\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}+C_{4}\Gamma_{K}J(K)\epsilon_{0},$
(6.80)		$\displaystyle\mathbb{E}\left\\|\hat{G}_{AC}^{(L)}(K)\right\\|^{2}_{F}$	$\displaystyle\leq$	$\displaystyle\left[\frac{C_{5}}{\sigma^{2}}+O(\operatorname{Tr}[D_{K}]\sigma^{2})\right]\frac{L}{1-\gamma^{2}},$

where $C_{2}$ is given as in Lemma 5.1 and $C_{4}$ , $C_{5}$ depend on $\|A\|_{2}$ , $\|B\|_{2}$ , $\|S\|_{2}$ , $\|R\|_{2}$ , $\|K\|^{2}_{2}$ , $\Gamma_{K}$ , $\frac{1}{1-\rho^{2}}$ , $\widetilde{\omega}$ and $(1-\gamma)J(K)$ with $\rho\in(\rho(A+BK),1)$ .

Proof 6.2.

Analogous to the proof of Lemma 5.1, can split the bias of $\hat{G}_{AC}^{(L)}$ into two parts:

(6.81)		$\displaystyle\nabla J(K)-\mathbb{E}[\hat{G}_{AC}^{(L)}(K)]=$	$\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=L+1}^{\infty}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{\infty}\gamma^{k}c_{k}\right]$
(6.81)			$\displaystyle+\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}(u_{t}-Kx_{t})x_{t}^{\top}(V_{K}(x_{t+1})-\widetilde{V}(x_{t+1}))\right].$

We have discussed the first term in the proof of Lemma 5.1 and its upper bound is $C_{2}\frac{\Gamma_{K}^{2}}{(1-\rho^{2})(1-\gamma)}\gamma^{L+1}$ . Let $V_{K}(x)=\phi(x)^{\top}\theta^{*}=x^{\top}\Theta_{K}^{*}x+\theta^{*}_{0}$ and $\widetilde{V}(x)=\phi(x)^{\top}\theta=x^{\top}\Theta x+\theta_{0}$ . We observe that the second term of (6.81) has the following equivalent form:

(6.82)		$\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}(u_{t}-Kx_{t})x_{t}^{\top}(V_{K}(x_{t+1})-\widetilde{V}(x_{t+1}))\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}\eta_{t}(\operatorname{Tr}[(\Theta^{*}_{K}-\Theta)x_{t+1}x_{t+1}^{\top}])x_{t}^{\top}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}\eta_{t}[\eta_{t}^{\top}B^{\top}(\Theta^{*}_{K}-\Theta)(A+BK)x_{t}]x_{t}^{\top}\right]$
	$\displaystyle=$	$\displaystyle 2B^{\top}(\Theta^{*}_{K}-\Theta)(A+BK)\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}x_{t}x_{t}^{\top}\right].$

Since $\left\|\mathbb{E}\left[\sum_{t=0}^{L}\gamma^{t+1}x_{t}x_{t}^{\top}\right]\right\|_{2}\leq\|\Sigma_{K,\gamma}\|_{2}\leq\frac{J(K)}{\sigma_{\mathrm{min}}(S)}$ and $\widetilde{V}$ is an $\epsilon_{0}$ -approximation of $V_{K}$ , the right hand of (6.82) is bounded by $\frac{\|B\|_{2}\Gamma_{K}}{\sigma_{\mathrm{min}}(S)}J(K)\epsilon_{0}$ , which yields (6.79).

Since $\|\theta_{1}\|_{2}\leq\|\theta_{1}^{\star}\|_{2}+\epsilon_{0}\leq\frac{(1-\gamma)J(K)}{\gamma\sigma_{\mathrm{min}}(D_{\widetilde{\omega}})}+\epsilon_{0}$ by (2.19) and $|\theta_{0}|\leq|\theta_{0}^{*}|+\epsilon_{0}=\frac{\gamma}{1-\gamma}\operatorname{Tr}[P_{K,\gamma}D_{\widetilde{\omega}}]+\frac{\sigma^{2}\operatorname{Tr}[R]}{1-\gamma}+\epsilon_{0}=J(K)+\epsilon_{0}$ , the following inequality holds:

(6.83)		$\displaystyle\tilde{\delta}_{t}^{2}$	$\displaystyle\leq 3\left[c_{t}^{2}+(1-\gamma)^{2}\|\theta_{0}\|^{2}+\\|x_{t}^{\top}x_{t}-\gamma x_{t+1}^{\top}x_{t+1}\\|^{2}_{F}\\|\theta_{1}\\|_{2}^{2}\right]$
(6.83)			$\displaystyle\leq 3c_{t}^{2}+O((1-\gamma)^{2}J^{2}(K))(1+\\|x_{t}\\|_{2}^{4}+\\|x_{t+1}\\|^{4}_{2}).$

Hence, we obtain the inequality (6.80):

(6.84)	$\displaystyle\mathbb{E}\left\\|\hat{G}_{AC}^{(L)}(K)\right\\|^{2}_{F}$	$\displaystyle\leq\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\mathbb{E}\left(\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\|\tilde{\delta}_{t}\|^{2}\right)$
		$\displaystyle\leq\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\left[C_{5}\sigma^{2}+O(\operatorname{Tr}[D_{K}])\sigma^{6}\right]$
		$\displaystyle\leq\left[\frac{C_{5}}{\sigma^{2}}+O(\operatorname{Tr}[D_{K}]\sigma^{2})\right]\frac{L}{1-\gamma^{2}},$

where the third inequality holds by the same skill in (3.32) and

C_{5}=O\left([(1-\gamma)^{2}J^{2}(K)+\|K\|^{4}_{2}]\mathbb{E}_{\mu_{K}}\|x\|^{6}\right).

In the estimations above similar techniques as in (5.70) are used.

Now we can obtain a convergence result for the AC algorithm which is similar to the one in Theorem 5.4. In order to prove this result, we need the following assumption.

Assumption 6.3.

There is a positive number $M^{AC}_{G}$ such that $\big{\|}\hat{G}_{AC}^{(L)}(K)\|_{F}\leq M^{AC}_{G}$ for any stable $K$ .

Theorem 6.4.

Suppose that 6.3 is satisfied and let $K^{(0)}$ be a stable policy. Moreover, suppose that $\gamma$ is sufficiently close to $1$ and $0<\beta<1$ small enough such that (2.17) holds with $\nu=\frac{10}{\delta}$ . For any error tolerance $\epsilon$ and confidence $\delta$ satisfying (5.74), suppose that the sample size $L$ is large enough and the error of the approximation value function $\epsilon_{0}$ is small enough such that

\gamma^{L+1}\frac{1}{1-\gamma}\leq O(\sqrt{\delta\epsilon/n})\quad\text{and}\quad J(K^{(0)})\epsilon_{0}\leq O(\sqrt{\delta\epsilon/n}).

The step size $\alpha$ is chosen such that $\alpha\leq O\left(\frac{\delta\epsilon(1-\gamma^{2})}{Ld}\right)$ holds. After $T$ iterations with $T=O(\frac{1}{\alpha}\log[\frac{J(K^{(0)})-J(K^{*})}{\delta\epsilon}])$ iterations, the iterate $K^{(T)}$ satisfies

J(K^{(T)})-J(K^{*})\leq\epsilon

with probability greater than $1-\delta$ .

Proof 6.5.

The proof is similar to the proof of Theorem 5.4.

Finally, we analyze the sample complexity of the policy gradient method and the AC algorithm. We assume that the discount factor $\gamma$ with $0<\gamma<1$ is close to $1$ such that $\log(\gamma)\approx\gamma-1$ . By the definition of $J(K)$ and in particular by its representation in (2.8), we obtain $J(K^{(0)})=O(\frac{1}{1-\gamma})$ . For the AC algorithm, we have to require that $J(K^{(0)})\epsilon_{0}=O(\sqrt{\delta\epsilon/n})$ . Then the TD-learning algorithm needs $O\left(\frac{1}{\delta\epsilon(1-\gamma)^{4}}\right)$ steps by Theorem 3.8. However, the sample size $L$ in the AC algorithm is only $O\left(\frac{\log\frac{n}{(1-\gamma)\delta\epsilon}}{\log(\gamma)}\right)\approx O((1-\gamma)^{-2})$ . We can sample $\frac{1}{(1-\gamma)^{2}}$ trajectories parallelly. Then the variance of the gradient $\hat{G}^{L}(K)$ becomes $(1-\gamma)L=O(\frac{1}{1-\gamma})$ . Using similar arguments as in the proof of Theorem 5.4, one can prove that the iteration time $T$ of the AC algorithm is equal to $O(\frac{1}{\delta\epsilon(1-\gamma)})$ . From the statements shown above, we conclude the complexities given in Table 1.

7 Conclusion

Reinforcement learning has achieved success in many fields but lacks theoretical understanding in the continuous case. In this paper, we apply well known algorithms in RL to a basic model LQR. First, we show the convergence of the policy iteration with TD learning, which is hard to prove in general cases. Then we obtain the linear convergence of the policy gradient method and AC algorithm. Finally, we compare the sample complexity of those algorithms.

The results of this paper are proved for the LQR setting, which allows us to restrict ourselves to linear policies. For extensions to more general problems, this restriction to a linear framework is not possible anymore. Consequently, the policy function may depend nonlinearly on its parameters which makes in particular the convergence analysis of optimization methods such as the policy gradient method much more involved. A further difficulty is that the PL condition, which guarantees that stationary policies are globally optimal, may not hold anymore. However, since the LQR can be used as approximation for more general nonlinear problems, the techniques developed in this paper can serve as important tool for the treatment of more general problems.

Acknowledgments.

The research is supported in part by the NSFC grant 11831002 and Beijing Academy of Artificial Intelligence.

References

[1] Y. Abbasi-Yadkori and C. Szepesvári, Regret bounds for the adaptive control of linear quadratic systems, in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
[2] J. Bhandari, D. Russo, and R. Singal, A finite time analysis of temporal difference learning with linear function approximation, in Conference On Learning Theory, 2018, pp. 1691–1692.
[3] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári, Convergent temporal-difference learning with arbitrary smooth function approximation, in Advances in Neural Information Processing Systems, 2009, pp. 1204–1212.
[4] V. S. Borkar and S. P. Meyn, The ode method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Control and Optimization, 38 (2000), pp. 447–469.
[5] S. J. Bradtke and A. G. Barto, Linear least-squares algorithms for temporal difference learning, Machine learning, 22 (1996), pp. 33–57.
[6] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, Sbeed: Convergent reinforcement learning with nonlinear function approximation, in International Conference on Machine Learning, 2018, pp. 1125–1134.
[7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, Regret bounds for robust adaptive control of the linear quadratic regulator, in Advances in Neural Information Processing Systems, 2018, pp. 4188–4197.
[8] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, On the sample complexity of the linear quadratic regulator, Foundations of Computational Mathematics, (2019), pp. 1–47.
[9] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 1467–1476.
[10] J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey, The International Journal of Robotics Research, 32 (2013), pp. 1238–1274.
[11] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, in Advances in neural information processing systems, 2000, pp. 1008–1014.
[12] M. G. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of machine learning research, 4 (2003), pp. 1107–1149.
[13] A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of lstd, 2010.
[14] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik, Finite-sample analysis of proximal gradient td algorithms., in UAI, Citeseer, 2015, pp. 504–513.
[15] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, Applications of deep learning and reinforcement learning to biological data, IEEE transactions on neural networks and learning systems, 29 (2018), pp. 2063–2079.
[16] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, Derivative-free methods for policy optimization: Guarantees for linear quadratic systems, in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2916–2925.
[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), p. 529.
[18] R. S. Sutton, Learning to predict by the methods of temporal differences, Machine learning, 3 (1988), pp. 9–44.
[19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
[20] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation, in Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 993–1000.
[21] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Submitted to Advances in Neural Information Processing Systems, 12 (1999), pp. 1057–1063.
[22] J. N. Tsitsiklis and B. Van Roy, Analysis of temporal-diffference learning with function approximation, in Advances in neural information processing systems, 1997, pp. 1075–1081.
[23] S. Tu and B. Recht, Least-squares temporal difference learning for the linear quadratic regulator, in International Conference on Machine Learning, 2018, pp. 5005–5014.
[24] C. J. Watkins and P. Dayan, Q-learning, Machine learning, 8 (1992), pp. 279–292.
[25] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, 8 (1992), pp. 229–256.
[26] K. Zhang, A. Koppel, H. Zhu, and T. Başar, Global convergence of policy gradient methods to (almost) locally optimal policies, arXiv preprint arXiv:1906.08383, (2019).
[27] S. Zou, T. Xu, and Y. Liang, Finite-sample analysis for sarsa with linear function approximation, in Advances in Neural Information Processing Systems, 2019, pp. 8668–8678.

(3.32)	$\displaystyle\\|h(\theta)\\|^{2}_{2}=$	$\displaystyle(c+\gamma\phi(x^{\prime})^{\top}\theta-\phi(x)^{\top}\theta)^{2}\\|\phi(x)\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle 2c^{2}\\|\phi(x)\\|_{2}^{2}+2M_{\theta}^{2}\\|\gamma\phi(x^{\prime})-\phi(x)\\|^{2}_{2}\\|\phi(x)\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle 2(x^{\top}(S+K^{\top}RK)x+2\eta^{\top}RKx+\eta^{\top}R\eta)^{2}(\\|x\\|_{2}^{4}+1)$
		$\displaystyle+2M_{\theta}^{2}\left[(1-\gamma)^{2}+2\\|x^{\prime}\\|_{2}^{4}+2\\|x\\|_{2}^{4}\right](\\|x\\|_{2}^{4}+1).$

(3.35)	$\displaystyle\\|\bar{h}(\theta^{(s)})-\bar{h}^{\star}\\|^{2}_{2}=$	$\displaystyle\Big{\\|}\mathbb{E}_{\mu_{K}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})+\gamma\widetilde{V}(x^{\prime};\theta^{\star})-\gamma\widetilde{V}(x^{\prime};\theta^{(s)})]\big{\}}\Big{\\|}_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\Big{\\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\phi(x)[\widetilde{V}(x;\theta^{(s)})-\widetilde{V}(x;\theta^{\star})]\big{\}}\Big{\\|}_{2}^{2}$
		$\displaystyle+\Big{\\|}\mathbb{E}_{x,x^{\prime}}\big{\{}\gamma\phi(x)[\widetilde{V}(x^{\prime};\theta^{(s)})-\widetilde{V}(x^{\prime};\theta^{\star})]\big{\}}\Big{\\|}_{2}^{2}=I_{1}+I_{2}.$

(5.70)	$\displaystyle\mathbb{E}\left\\|\hat{G}^{(L)}(K)\right\\|^{2}_{F}=$	$\displaystyle\frac{1}{\sigma^{4}}\mathbb{E}\left\\|\sum_{t=0}^{L}(u_{t}-Kx_{t})x_{t}^{\top}\sum_{k=t}^{L}\gamma^{k}c_{k}\right\\|_{F}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}\Big{(}\sum_{k=t}^{L}\gamma^{k}c_{k}\Big{)}^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{L}{\sigma^{4}}\sum_{t=1}^{L}(L-t+1)\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}\sum_{k=t}^{L}\gamma^{2k}c^{2}_{k}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{L^{2}}{\sigma^{4}}\sum_{t=1}^{L}\gamma^{2t}\sum_{k=t}^{L}\mathbb{E}\left[\\|x_{t}\\|^{2}_{2}\\|\eta_{t}\\|_{2}^{2}c^{2}_{k}\right].$

(5.72)	$\displaystyle\mathbb{E}\Big{[}\\|x_{k}\\|^{4}_{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{]}=$	$\displaystyle\mathbb{E}\Big{[}\big{\\|}(A+BK)^{k-t}x_{t}+\Sigma\big{\\|}^{4}_{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{]}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{\{}\big{[}2\\|(A+BK)^{k-t}x_{t}\\|_{2}^{2}+2\\|\Sigma\\|_{2}^{2}\big{]}^{2}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|_{2}^{2}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle 8\mathbb{E}\Big{\{}\big{[}\\|(A+BK)^{k-t}x_{t}\\|_{2}^{4}+\\|\Sigma\\|_{2}^{4}\big{]}\\|x_{t}\\|_{2}^{2}\\|\eta_{t}\\|^{2}_{2}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle 8d\sigma^{2}\mathbb{E}\big{[}\\|(A+BK)^{k-t}x_{t}\\|_{2}^{4}\\|x_{t}\\|_{2}^{2}\big{]}+8\mathbb{E}\\|x_{t}\\|^{2}_{2}\mathbb{E}\big{(}\\|\Sigma\\|^{4}_{2}\\|\eta_{t}\\|_{2}^{2}\big{)}$
	$\displaystyle\leq$	$\displaystyle O\Big{(}\big{[}\mathbb{E}_{\mu_{K}}\\|x\\|^{6}+\operatorname{Tr}[D_{K}]\mathbb{E}_{\mu_{K}}\\|x\\|^{4}\big{]}\sigma^{2}+\operatorname{Tr}[D_{K}]\sigma^{6}\Big{)}.$