This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sample Complexity of Linear Quadratic Regulator
Without Initial Stability

Amirreza Neshaei Moghaddam Department of Electrical and Computer Engineering
University of California at Los Angeles, Los Angeles
amirnesha@ucla.edu
Alex Olshevsky Department of Electrical and Computer Engineering
Boston University
alexols@bu.edu
 and  Bahman Gharesifard Department of Electrical and Computer Engineering
University of California, Los Angeles
gharesifard@ucla.edu
Abstract.

Inspired by REINFORCE, we introduce a novel receding-horizon algorithm for the Linear Quadratic Regulator (LQR) problem with unknown parameters. Unlike prior methods, our algorithm avoids reliance on two-point gradient estimates while maintaining the same order of sample complexity. Furthermore, it eliminates the restrictive requirement of starting with a stable initial policy, broadening its applicability. Beyond these improvements, we introduce a refined analysis of error propagation through the contraction of the Riemannian distance over the Riccati operator. This refinement leads to a better sample complexity and ensures improved convergence guarantees. Numerical simulations validate the theoretical results, demonstrating the method’s practical feasibility and performance in realistic scenarios.

1. Introduction

The Linear Quadratic Regulator (LQR) problem, a cornerstone of optimal control theory, offers an analytically tractable framework for optimal control of linear systems with quadratic costs. Traditional methods rely on complete knowledge of system dynamics, solving the Algebraic Riccati Equation [2] to determine optimal control policies. However, recent real-world scenarios often involve incomplete or inaccurate models. Classical methods in control theory, such as identification theory [6] and adaptive control [1], were specifically designed to provide guarantees for decision-making in scenarios with unknown parameters. However, the problem of effectively approximating the optimal policy using these methods remains underexplored in the traditional literature. Recent efforts have sought to bridge this gap by analyzing the sample complexity of learning-based approaches to LQR [4], providing bounds on control performance relative to the amount of data available.

In contrast, the model-free approach, rooted in reinforcement learning (RL), bypasses the need for explicit dynamics identification, instead focusing on direct policy optimization through cost evaluations. Recent advances leverage stochastic zero-order optimization techniques, including policy gradient methods, to achieve provable convergence to near-optimal solutions despite the inherent non-convexity of the LQR cost landscape. Foundational works, such as [5], established the feasibility of such methods despite the non-convexity of the problem, demonstrating convergence under random initialization. Subsequent efforts, including [11] and [13], have refined these techniques, achieving improved sample complexity bounds. Notably, all of these works assume that the initial policy is stabilizing.

A key limitation of these methods, including [11, 13], is the reliance on two-point gradient estimation, which requires evaluating costs for two different policies while maintaining identical initial states. In practice, this assumption is often infeasible, as the initial state is typically chosen randomly and cannot be controlled externally. Our earlier work [12] addressed this challenge, establishing the best-known result among methods that assume initial stability without having to rely on two-point estimates. Instead, we proposed a one-point gradient estimation method, inspired by REINFORCE [19, 17], that achieves the same convergence rate as the two-point method [11] using only a single cost evaluation at each step. This approach enhances both the practical applicability and theoretical robustness of model-free methods, setting a new benchmark under the initial stability assumption.

The requirement for an initial stabilizing policy significantly limits the utility of these methods in practice. Finding such a policy can be challenging or infeasible and often relies on identification techniques, which model-free methods are designed to avoid. Without getting technical at this point, it is worth pointing out that this initial stability assumption plays a major role in the construction of the mentioned model-free methods, and cannot be removed easily. For instance, this assumption ensures favorable optimization properties, like coercivity and gradient domination, that simplify convergence analysis. In this sense, removing this assumption while maintaining stability and convergence guarantees is essential to generalize policy gradient methods, a challenge that has remained an active research topic [20, 14, 9, 22].

As also pointed out in [20], the γ\gamma-discounted LQR problems studied in [14, 9, 22] are equivalent to the standard non-discounted LQR with system matrices scaled by γ\sqrt{\gamma}. In [14, 9, 22], this scaling results in an enlarged set of stabilizing policies when γ\gamma is sufficiently small, enabling policy gradient algorithms to start from an arbitrary policy. However, as noted in [20], this comes at the cost of solving multiple LQR instances rather than a single one, increasing computational overhead. Furthermore, the optimization landscape in the discounted setting remains fundamentally the same as in the undiscounted case, as described in [5, 11]. Consequently, the same difficulties mentioned in [8, 18] persist when extending these methods to output-feedback settings, where additional estimation errors complicate policy search. In contrast, receding-horizon approaches [20] provide a more direct and extensible framework for tackling such challenges [21].

This paper builds on the receding-horizon policy gradient framework introduced in [20], a significant step towards eliminating the need for a stabilizing initial policy by recursively updating finite-horizon costs. While the approach proposed in this work marks an important step forward in model-free LQR, we address the reliance on the two-point gradient estimation, a known limitation discussed earlier. Building on the gradient estimation approach from our earlier work [12], we adapt the core idea to accommodate the new setup that eliminates the initial stability assumption. Specifically, our modified method retains the same convergence rate while overcoming the restrictive assumptions of two-point estimation. Beyond these modifications, we introduce a refined convergence analysis, via an argument based on a Riemannian distance function [3], which significantly improves the propagation of errors. This ensures that the accumulated error remains linear in the horizon length, in contrast to the exponential growth in [20]. As a result, we achieve a uniform sample complexity bound of 𝒪~(ε2)\widetilde{\mathcal{O}}(\varepsilon^{-2}), independent of problem-specific constants, thereby offering a more scalable and robust policy search framework.

1.1. Algorithm and Paper Structure Overview

The paper is structured into three sections. Section II presents the necessary preliminaries and establishes the notation used throughout the paper. Section III introduces our proposed algorithm, which operates through a hierarchical double-loop structure, an outer loop which provides a surrogate cost function in a receding horizon manner, and an inner loop applying policy gradient method to obtain an estimate of its optimal policy. Section IV delves deeper into the policy gradient method employed in the inner loop, providing rigorous convergence results and theoretical guarantees for this critical component of the algorithm. Section V includes the sample complexity bounds, and comparisons with the results in the literature. Finally, we provide simulations studies verifying our findings in Section VI.

To be more specific, the core idea of the algorithm leverages the observation that, for any error tolerance ε\varepsilon, there exists sufficiently large finite horizon NN where the sequence of policies minimizing recursively updated finite-horizon costs can approximate the optimal policy for the infinite-horizon cost within ε\varepsilon neighborhood. This insight motivates the algorithm’s design: a recursive outer loop that iteratively refines the surrogate cost function over a sequence of finite horizons, and an inner loop that employs policy gradient methods to approximate the optimal policy for each of these costs. Specifically, in the outer loop, the algorithm updates the surrogate cost and the associated policy at each horizon step hh, starting from the terminal horizon h=N1h=N-1 and moving backward to h=0h=0. At each step hh, the inner loop applies a policy gradient method to compute an approximately optimal policy for the finite-horizon cost over the interval [h,N][h,N]. This step generates a surrogate policy K~h\widetilde{K}_{h}, which is then incorporated into the cost function of the subsequent step in the outer loop.

The main difficulty in analyzing the proposed algorithm stems from the fact that the approximation errors from the policy gradient method in the inner loop propagate across all steps of the outer loop. To ensure overall convergence, the algorithm imposes a requirement on the accuracy of the policy gradient method in the inner loop. Each policy obtained must be sufficiently close to the optimal policy for the corresponding finite-horizon cost. This guarantees that the final policy at the last step of the outer loop converges to the true optimal policy for the infinite-horizon cost.

2. Preliminaries

In this section, we gather the required notation, closely following the ones in [20] which our work builds on. Consider the discrete-time linear system

(1) xt+1=Axt+But,x_{t+1}=Ax_{t}+Bu_{t},

where xtnx_{t}\in{\mathbb{R}}^{n} is the system state at time tt, utmu_{t}\in{\mathbb{R}}^{m} is the control input at time t0t\geq 0, An×nA\in{\mathbb{R}}^{n\times n} and Bn×mB\in{\mathbb{R}}^{n\times m} are the system matrices which are unknown to the control designer. Crucially here, the initial state x0x_{0} is sampled randomly from a distribution 𝒟{\mathcal{D}} and satisfies

(2) 𝔼[x0]=0,𝔼[x0x0]=Σ0,andx02Cma.s.\displaystyle{\mathbb{E}}[x_{0}]=0,\quad{\mathbb{E}}[x_{0}x_{0}^{\top}]=\Sigma_{0},\ \text{and}\ \|x_{0}\|^{2}\leq C_{m}\ \text{a.s.}

The objective in the LQR problem is to find the optimal controller that minimizes the following cost

J=𝔼x0𝒟[t=0xtQxt+utRut],J_{\infty}={\mathbb{E}}_{x_{0}\sim{\mathcal{D}}}\left[\sum_{t=0}^{\infty}x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}\right],

where Qn×nQ\in{\mathbb{R}}^{n\times n} and Rm×mR\in{\mathbb{R}}^{m\times m} are the symmetric positive definite matrices that parameterize the cost. We require the pair (A,B)(A,B) to be stabilizable, and since Q0Q\succ 0, the pair (A,Q1/2)(A,Q^{1/2}) is observable, As a result, the unique optimal controller is a linear state-feedback ut=Kxtu^{\ast}_{t}=-K^{*}x_{t} where KK^{*} is derived as follows

(3) K=(R+BPB)1BPA,K^{*}=(R+B^{\top}P^{*}B)^{-1}B^{\top}P^{*}A,

and PP^{*} denotes the unique positive definite solution to the discounted discrete-time algebraic Riccati equation (ARE) [2]:

(4) P=APAAPB(R+BPB)1BPA+Q.P=A^{\top}PA-A^{\top}PB(R+B^{\top}PB)^{-1}B^{\top}PA+Q.

2.1. Notations

We use X\|X\|, XF\|X\|_{F}, σmin(X)\sigma_{\min}(X), and κX\kappa_{X} to denote the 2-norm, Frobenius norm, minimum singular value, and the condition number of a matrix XX respectively. We also use ρ(X)\rho(X) to denote the spectral radius of a square matrix XX. Moreover, for a positive definite matrix WW of appropriate dimensions, we define the WW-induced norm of a matrix XX as

XW2:=supz0zXWXzzWz.\|X\|_{W}^{2}:=\sup_{z\neq 0}\frac{z^{\top}X^{\top}WXz}{z^{\top}Wz}.

Following the notation in [20], we denote the PP^{*}-induced norm by X\|X\|_{*}. Furthermore, we denote the Riemannian distance [3] between two positive definite matrices U,Vn×nU,V\in{\mathbb{R}}^{n\times n} by

δ(U,V)=(i=1nlog2λi(UV1))1/2.\delta(U,V)=\left(\sum_{i=1}^{n}\log^{2}\lambda_{i}(UV^{-1})\right)^{1/2}.

We now introduce some important notations which will be used in describing the algorithm and proof of the main result. Let NN be the horizon length and hh the initial time step. The true finite-horizon cost Jh(Kh)J_{h}(K_{h}) of a policy KhK_{h} is defined as

(5) Jh(Kh):=𝔼xh𝒟[t=h+1N1xt(Q+(Kt)RKt)xt+xh(Q+KhRKh)xh+xNQNxN],\displaystyle J_{h}(K_{h}):={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\Bigg{[}\sum_{t=h+1}^{N-1}x_{t}^{\top}\left(Q+(K^{*}_{t})^{\top}RK^{*}_{t}\right)x_{t}+x_{h}^{\top}\left(Q+K_{h}^{\top}RK_{h}\right)x_{h}+x_{N}^{\top}Q_{N}x_{N}\Bigg{]},

where:

  • xh𝒟x_{h}\sim\mathcal{D} denotes the initial state xhx_{h} is drawn from the distribution 𝒟\mathcal{D},

  • QNQ_{N} is the terminal cost matrix, which can be chosen arbitrarily (e.g., QN=0Q_{N}=0),

  • KhK_{h} is the feedback gain applied at step hh,

  • KtK^{*}_{t} is the feedback gain at step tt, to be formally defined via the Riccati difference equation in (21);

Finally, for all t{h+1,,N1}t\in\{h+1,\dotsc,N-1\}, the state evolves according to:

xt+1=(ABKt)xt,x_{t+1}=(A-BK^{*}_{t})x_{t},

with

xh+1=(ABKh)xh.x_{h+1}=(A-BK_{h})x_{h}.

We also define the surrogate cost

(6) J~h(Kh):=𝔼xh𝒟[t=h+1N1xt(Q+K~tRK~t)xt+xh(Q+KhRKh)xh+xNQNxN],\displaystyle\widetilde{J}_{h}(K_{h}):={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\Bigg{[}\sum_{t=h+1}^{N-1}x_{t}^{\top}\left(Q+\widetilde{K}_{t}^{\top}R\widetilde{K}_{t}\right)x_{t}+x_{h}^{\top}\left(Q+K_{h}^{\top}RK_{h}\right)x_{h}+x_{N}^{\top}Q_{N}x_{N}\Bigg{]},

where K~t\widetilde{K}_{t} is the feedback gain derived at step tt of the [outer loop of the] algorithm, and for all t{h+1,,N1}t\in\{h+1,\dotsc,N-1\}, the state evolves as:

xt+1=(ABK~t)xt,x_{t+1}=(A-B\widetilde{K}_{t})x_{t},

with

xh+1=(ABKh)xh.x_{h+1}=(A-BK_{h})x_{h}.

The key difference between J~h(Kh)\widetilde{J}_{h}(K_{h}) and Jh(Kh)J_{h}(K_{h}) lies in the use of K~t\widetilde{K}_{t} versus KtK^{*}_{t} for t{h+1,,N1}t\in\{h+1,\dotsc,N-1\}. This distinction implies that J~h(Kh)\widetilde{J}_{h}(K_{h}) incorporates all errors from earlier steps, precisely the ones at {N1,,h+1}\{N-1,\dotsc,h+1\}, as the procedure is recursive.

We now define several functions that facilitate the characterization of our gradient estimate, which uses ideas from our earlier work in [12]). To start, we let

(7) J~h(Kh;xh)\displaystyle\widetilde{J}_{h}(K_{h};x_{h}) :=t=h+1N1xt(Q+K~tRK~t)xt+xh(Q+KhRKh)xh+xNQNxN\displaystyle:=\sum_{t=h+1}^{N-1}x_{t}^{\top}\left(Q+\widetilde{K}_{t}^{\top}R\widetilde{K}_{t}\right)x_{t}+x_{h}^{\top}(Q+K_{h}^{\top}RK_{h})x_{h}+x_{N}^{\top}Q_{N}x_{N}
(8) =xh(Q+KhRKh)xh+xh(ABKh)P~h+1(ABKh)xh,\displaystyle=x_{h}^{\top}(Q+K_{h}^{\top}RK_{h})x_{h}+x_{h}^{\top}(A-BK_{h})^{\top}\widetilde{P}_{h+1}(A-BK_{h})x_{h},

so that

J~h(Kh)=𝔼xh𝒟[J~h(Kh;xh)].\widetilde{J}_{h}(K_{h})={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[\widetilde{J}_{h}(K_{h};x_{h})\right].

Using (8), we can compute the gradient of J~h(Kh;xh)\widetilde{J}_{h}(K_{h};x_{h}) with respect to KhK_{h} as follows:

(9) J~h(Kh;xh)\displaystyle\nabla\widetilde{J}_{h}(K_{h};x_{h}) =(xhKhRKhxh+xhKhBP~h+1BKhxh2xhAP~h+1BKhxh)\displaystyle=\nabla\Bigg{(}x_{h}^{\top}K_{h}^{\top}RK_{h}x_{h}+x_{h}^{\top}K_{h}^{\top}B^{\top}\widetilde{P}_{h+1}BK_{h}x_{h}-2x_{h}^{\top}A^{\top}\widetilde{P}_{h+1}BK_{h}x_{h}\Bigg{)}
(10) =2RKhxhxh+2BP~h+1BKhxhxh2BP~h+1Axhxh\displaystyle=2RK_{h}x_{h}x_{h}^{\top}+2B^{\top}\widetilde{P}_{h+1}BK_{h}x_{h}x_{h}^{\top}-2B^{\top}\widetilde{P}_{h+1}Ax_{h}x_{h}^{\top}
(11) =2((R+BP~h+1B)KhBP~h+1A)xhxh,\displaystyle=2\left((R+B^{\top}\widetilde{P}_{h+1}B)K_{h}-B^{\top}\widetilde{P}_{h+1}A\right)x_{h}x_{h}^{\top},

and thus,

(12) J~h(Kh)\displaystyle\nabla\widetilde{J}_{h}(K_{h}) =𝔼xh𝒟[J~h(Kh;xh)]\displaystyle={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[\nabla\widetilde{J}_{h}(K_{h};x_{h})\right]
(13) =2((R+BP~h+1B)KhBP~h+1A)𝔼xh𝒟[xhxh]\displaystyle=2\left((R+B^{\top}\widetilde{P}_{h+1}B)K_{h}-B^{\top}\widetilde{P}_{h+1}A\right){\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[x_{h}x_{h}^{\top}\right]
(14) =2((R+BP~h+1B)KhBP~h+1A)Σ0.\displaystyle=2\left((R+B^{\top}\widetilde{P}_{h+1}B)K_{h}-B^{\top}\widetilde{P}_{h+1}A\right)\Sigma_{0}.

Moreover, we define

Qh(xh,uh)\displaystyle Q_{h}(x_{h},u_{h}) :=xhQxh+uhRuh+t=h+1N1xt(Q+K~tRK~t)xt+xNQNxN\displaystyle:=x_{h}^{\top}Qx_{h}+u_{h}^{\top}Ru_{h}+\sum_{t=h+1}^{N-1}x_{t}^{\top}\left(Q+\widetilde{K}_{t}^{\top}R\widetilde{K}_{t}\right)x_{t}+x_{N}^{\top}Q_{N}x_{N}
=xhQxh+uhRuh+J~h+1(K~h+1;Axh+Buh)\displaystyle=x_{h}^{\top}Qx_{h}+u_{h}^{\top}Ru_{h}+\widetilde{J}_{h+1}(\widetilde{K}_{h+1};Ax_{h}+Bu_{h})
(15) =xhQxh+uhRuh+(Axh+Buh)P~h+1(Axh+Buh),\displaystyle=x_{h}^{\top}Qx_{h}+u_{h}^{\top}Ru_{h}+(Ax_{h}+Bu_{h})^{\top}\widetilde{P}_{h+1}(Ax_{h}+Bu_{h}),

so that

J~h(Kh;xh)=Qh(xh,Khxh),\widetilde{J}_{h}(K_{h};x_{h})=Q_{h}(x_{h},-K_{h}x_{h}),

and

J~h(Kh)=𝔼xh𝒟[Qh(xh,Khxh)].\widetilde{J}_{h}(K_{h})={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[Q_{h}(x_{h},-K_{h}x_{h})\right].

Having established the cost functions, we now introduce the notation used to describe the policies:

(16) Kh\displaystyle K^{*}_{h} :=argminKhJh(Kh),\displaystyle:=\operatorname{argmin}_{K_{h}}J_{h}(K_{h}),
(17) K~h\displaystyle\widetilde{K}^{*}_{h} :=argminKhJ~h(Kh),\displaystyle:=\operatorname{argmin}_{K_{h}}\widetilde{J}_{h}(K_{h}),

where KhK^{*}_{h} denotes the optimal policy for the true cost Jh(Kh)J_{h}(K_{h}), and K~h\widetilde{K}^{*}_{h} denotes the optimal policy for the surrogate cost J~h(Kh)\widetilde{J}_{h}(K_{h}). Additionally, K~h\widetilde{K}_{h} represents an estimate of K~h\widetilde{K}^{*}_{h}. It is obtained using a policy gradient method in the inner loop of the algorithm, which is applied at each step hh of the outer loop to minimize the surrogate cost J~h(Kh)\widetilde{J}_{h}(K_{h}).

We now move on to the recursive equations. First, we have

(18) P~h\displaystyle\widetilde{P}_{h} =(ABK~h)P~h+1(ABK~h)+K~hRK~h+Q,\displaystyle=(A-B\widetilde{K}_{h})^{\top}\widetilde{P}_{h+1}(A-B\widetilde{K}_{h})+\widetilde{K}_{h}^{\top}R\widetilde{K}_{h}+Q,

where P~N=QN\widetilde{P}_{N}=Q_{N}. In addition,

(19) P~h\displaystyle\widetilde{P}^{*}_{h} =(ABK~h)P~h+1(ABK~h)+(K~h)RK~h+Q,\displaystyle=(A-B\widetilde{K}^{*}_{h})^{\top}\widetilde{P}_{h+1}(A-B\widetilde{K}^{*}_{h})+(\widetilde{K}^{*}_{h})^{\top}R\widetilde{K}^{*}_{h}+Q,

where K~h\widetilde{K}^{*}_{h} from (17) can also be computed from

K~h=(R+BP~h+1B)1BP~h+1A.\widetilde{K}^{*}_{h}=(R+B^{\top}\widetilde{P}_{h+1}B)^{-1}B^{\top}\widetilde{P}_{h+1}A.

Finally, we have the Riccati difference equation (RDE):

(20) Ph\displaystyle P^{*}_{h} =(ABKh)Ph+1(ABKh)+(Kh)RKh+Q,\displaystyle=(A-BK^{*}_{h})^{\top}P^{*}_{h+1}(A-BK^{*}_{h})+(K^{*}_{h})^{\top}RK^{*}_{h}+Q,

where PN=QNP^{*}_{N}=Q_{N} and KhK^{*}_{h} from (17) can also be computed from

(21) Kh=(R+BPh+1B)1BPh+1A.K^{*}_{h}=(R+B^{\top}P^{*}_{h+1}B)^{-1}B^{\top}P^{*}_{h+1}A.

As a result, it is easy to follow that

(22) 𝔼xh𝒟[xhP~hxh]\displaystyle{\mathbb{E}}_{x_{h}\sim{\mathcal{D}}}\left[x_{h}^{\top}\widetilde{P}_{h}x_{h}\right] =J~h(K~h),\displaystyle=\widetilde{J}_{h}(\widetilde{K}_{h}),
(23) 𝔼xh𝒟[xhP~hxh]\displaystyle{\mathbb{E}}_{x_{h}\sim{\mathcal{D}}}\left[x_{h}^{\top}\widetilde{P}^{*}_{h}x_{h}\right] =J~h(K~h),and\displaystyle=\widetilde{J}_{h}(\widetilde{K}^{*}_{h}),\quad\text{and}
(24) 𝔼xh𝒟[xhPhxh]\displaystyle{\mathbb{E}}_{x_{h}\sim{\mathcal{D}}}\left[x_{h}^{\top}P^{*}_{h}x_{h}\right] =Jh(Kh).\displaystyle=J_{h}(K^{*}_{h}).

We also define the Riccati operator

(25) (P):=Q+A(PPB(R+BPB)1BP)A,\mathcal{R}(P):=Q+A^{\top}(P-PB(R+B^{\top}PB)^{-1}B^{\top}P)A,

so that P~h\widetilde{P}^{*}_{h} and PhP^{*}_{h} can also be shown as

(26) P~h\displaystyle\widetilde{P}^{*}_{h} =(P~h+1)\displaystyle=\mathcal{R}(\widetilde{P}_{h+1})
(27) Ph\displaystyle P^{*}_{h} =(Ph+1),\displaystyle=\mathcal{R}(P^{*}_{h+1}),

after replacing K~h\widetilde{K}^{*}_{h} and KhK^{*}_{h} in (19) and (20) respectively.

We now introduce the following mild assumption, which will be useful in establishing a key result.

Assumption 2.1.

AA in (1) is non-singular.

Under this assumption, the following result from [16] holds:

Lemma 2.1.

Consider the operator \mathcal{R} defined in (25). If Assumption 2.1 holds, then for any symmetric positive definite matrices X,Yn×nX,Y\in\mathbb{R}^{n\times n}, we have

δ((X),(Y))δ(X,Y).\delta(\mathcal{R}(X),\mathcal{R}(Y))\leq\delta(X,Y).

Having introduced all the necessary definitions, we now turn our attention to the our loop.

3. The Outer Loop (Receding-Horizon Policy Gradient)

It has been demonstrated that the solution to the RDE (20) converges monotonically to the stabilizing solution of the ARE (4) exponentially [7]. As a result, {Kt}t{N1,,1,0}\{K^{*}_{t}\}_{t\in\{N-1,\dotsc,1,0\}} in (21) also converges monotonically to KK^{*} as NN increases. In particular, we recall the following result from [20, Theorem 1].

Theorem 3.1.

Let AK:=ABKA^{*}_{K}:=A-BK^{*}, and define

(28) N0=12log(QNPκPAKBελmin(R))log(1AK),N_{0}=\frac{1}{2}\frac{\log{\left(\frac{\|Q_{N}-P^{*}\|_{*}\kappa_{P^{*}}\|A^{*}_{K}\|\|B\|}{\varepsilon\lambda_{\min}(R)}\right)}}{\log{\left(\frac{1}{\|A^{*}_{K}\|_{*}}\right)}},

where QNPQ_{N}\succeq P^{*}. Then it holds that AK<1\|A^{*}_{K}\|_{*}<1 and for all NN0N\geq N_{0}, the control policy K0K^{*}_{0} computed by (21) is stabilizing and satisfies K0Kε\|K^{*}_{0}-K^{*}\|\leq\varepsilon for any ε>0\varepsilon>0.

The proof of Theorem 3.1 is provided in Appendix A for completeness (and to account for some minor change in notation). We also note that this theorem relies on a minor inherent assumption that QNQ_{N} satisfies QNPQ_{N}\succeq P^{*}. A full discussion of this assumption is provided in Remark A.1 in Appendix A.

With this result in place, we provide our proposed algorithm (see Algorithm 1). Note that in this section, we focus on the outer loop of Algorithm 1, analyzing the requirements it imposes on the convergence of the policy gradient method employed in the inner loop. The details of the policy gradient method will be discussed in the next section.

Algorithm 1 Receding-Horizon Policy Gradient
0:  Horizon NN, max iterations {Th}\{T_{h}\}, stepsizes {αh,t}\{\alpha_{h,t}\}, variance parameter σ2\sigma^{2}
1:  for h=N1h=N-1 to 0 do
2:     Initialize Kh,0K_{h,0} arbitrarily (e.g., the convergent policy from the prev. iter. Kh+1,Th+1K_{h+1,T_{h+1}} or 0).
3:     for t=0t=0 to Th1T_{h}-1 do
4:        Sample xh𝒟x_{h}\sim\mathcal{D} and ηh,t𝒩(0,Im)\eta_{h,t}\sim\mathcal{N}(0,I_{m}) and simulate a trajectory with uh,t=Kh,txh+σηh,tu_{h,t}=-K_{h,t}x_{h}+\sigma\eta_{h,t}.
5:        Compute Qh(xh,uh,t)Q_{h}(x_{h},u_{h,t}) for said trajectory.
6:        Compute the gradient estimate
^J~h,t(Kh,t)=1σQh(xh,uh,t)ηhxh.\widehat{\nabla}\widetilde{J}_{h,t}(K_{h,t})=-\frac{1}{\sigma}Q_{h}(x_{h},u_{h,t})\eta_{h}x_{h}^{\top}.
7:        Update Kh,t+1=Kh,tαh,t^J~h(Kh,t)K_{h,t+1}=K_{h,t}-\alpha_{h,t}\cdot\widehat{\nabla}\widetilde{J}_{h}(K_{h,t}).
8:     end for
9:     K~hKh,Th\widetilde{K}_{h}\leftarrow K_{h,T_{h}}.
10:     Incorporate K~h\widetilde{K}_{h} into the surrogate cost function for the next step, i.e., J~h1()\widetilde{J}_{h-1}(\cdot).
11:  end for
12:  return  K0,T0K_{0,T_{0}}.

Before we move on to the next result, we define the following constants:

a:=σmin(Q)2\displaystyle a:=\frac{\sigma_{\min}(Q)}{2}
φ:=maxt{0,1,,N1}ABKt\displaystyle\varphi:=\max_{t\in\{0,1,\dotsc,N-1\}}\|A-BK^{*}_{t}\|
Pmax:=maxt{0,1,,N1}{Pt}\displaystyle P_{\max}:=\max_{t\in\{0,1,\dotsc,N-1\}}\{P^{*}_{t}\}
C1:=φBλmin(R)\displaystyle C_{1}:=\frac{\varphi\|B\|}{\lambda_{\min}(R)}
C2:=2φA(1+Pmax+aIB2λmin(R))\displaystyle C_{2}:=2\varphi\|A\|\left(1+\frac{\|P_{\max}+aI\|\|B\|^{2}}{\lambda_{\min}(R)}\right)
C3:=2R+B(Pmax+aI)B.\displaystyle C_{3}:=2\|R+B^{\top}(P_{\max}+aI)B\|.

Additionally, given a scalar ε>0\varepsilon>0, we define:

(29) ςh,ε:={min{aC3N,a22eC3NPmax,aε8eC3NC1Pmax,ε4C1C3},h1,ϵ4,h=0.\displaystyle\varsigma_{h,\varepsilon}:=\begin{cases}\min\left\{\sqrt{\frac{a}{C_{3}N}},\sqrt{\frac{a^{2}}{2eC_{3}N\|P_{\max}\|}},\sqrt{\frac{a\varepsilon}{8eC_{3}NC_{1}\|P_{\max}\|}},\sqrt{\frac{\varepsilon}{4C_{1}C_{3}}}\right\},&h\geq 1,\\ \frac{\epsilon}{4},&h=0.\end{cases}

We now present a key result, Theorem 3.2, on the accumulation of errors that constitutes an improvement over [20, Theorem 2] (corrected version of which is stated as Theorem 3.3 below); as the proof of Theorem 3.2 demonstrates, this improvement relies on a fundamentally different analysis.

Theorem 3.2.

(Main result: outer loop): Select

(30) N=12log(2QNPκPAKBϵλmin(R))log(1AK)+1,N=\frac{1}{2}\cdot\frac{\log\left(\frac{2\|Q_{N}-P^{*}\|_{*}\cdot\kappa_{P^{*}}\cdot\|A_{K}^{*}\|\cdot\|B\|}{\epsilon\cdot\lambda_{\min}(R)}\right)}{\log\left(\frac{1}{\|A_{K}^{*}\|_{*}}\right)}+1,

where QNPQ_{N}\succeq P^{*}, and suppose that Assumption 2.1 holds. Now assume that, for some ϵ>0\epsilon>0, there exists a sequence of policies {K~h}h{N1,,0}\{\widetilde{K}_{h}\}_{h\in\{N-1,\dotsc,0\}} such that for all h{N1,,0}h\in\{N-1,\dotsc,0\},

K~hK~hςh,ε,\|\widetilde{K}_{h}-\widetilde{K}^{*}_{h}\|\leq\varsigma_{h,\varepsilon},

where K~h\widetilde{K}^{*}_{h} is the optimal policy for the Linear Quadratic Regulator (LQR) problem from step hh to NN, incorporating errors from previous iterations of Algorithm 1. Then, the proposed algorithm outputs a control policy K~0\widetilde{K}_{0} that satisfies K~0Kε\|\widetilde{K}_{0}-K^{*}\|\leq\varepsilon. Furthermore, if ε\varepsilon is sufficiently small such that

ε<1ABKB,\varepsilon<\frac{1-\|A-BK^{*}\|_{*}}{\|B\|},

then K~0\widetilde{K}_{0} is stabilizing.

The proof of Theorem 3.2 is presented in Appendix B. A key component of our analysis is the contraction of the Riemannian distance on the Riccati operator, as established in Lemma 2.1. This allows us to demonstrate that the accumulated error remains linear in NN, in contrast to the exponential growth in [20, Theorem 2].

Given this discrepancy, we revisit [20, Theorem 2] and present a revised version which accounts for some necessary, and non-trivial, modifications to make the statement accurate. For the latter reason, and the fact that this result does not rely on Assumption 2.1, we provide a complete proof in Appendix C.

Theorem 3.3.

(Prior result: outer loop): Choose

(31) N=12log(2QNPκPAKBϵλmin(R))log(1AK)+1,N=\frac{1}{2}\cdot\frac{\log\left(\frac{2\|Q_{N}-P^{*}\|_{*}\cdot\kappa_{P^{*}}\cdot\|A_{K}^{*}\|\cdot\|B\|}{\epsilon\cdot\lambda_{\min}(R)}\right)}{\log\left(\frac{1}{\|A_{K}^{*}\|_{*}}\right)}+1,

where QNPQ_{N}\succeq P^{*}. Now assume that, for some ϵ>0\epsilon>0, there exists a sequence of policies {K~h}h{N1,,0}\{\widetilde{K}_{h}\}_{h\in\{N-1,\dotsc,0\}} such that

(32) K~hK~h{min{aC3,aC2h2C3,12ϵC1C2h2C3},h2,min{aC3,12ϵC1C3},h=1,ϵ4,h=0.\displaystyle\|\widetilde{K}_{h}-\widetilde{K}^{*}_{h}\|\leq\begin{cases}\min\left\{\sqrt{\frac{a}{C_{3}}},\sqrt{\frac{a}{C_{2}^{h-2}C_{3}}},\frac{1}{2}\sqrt{\frac{\epsilon}{C_{1}C_{2}^{h-2}C_{3}}}\right\},&h\geq 2,\\[6.0pt] \min\left\{\sqrt{\frac{a}{C_{3}}},\frac{1}{2}\sqrt{\frac{\epsilon}{C_{1}C_{3}}}\right\},&h=1,\\[6.0pt] \frac{\epsilon}{4},&h=0.\end{cases}

where K~h\widetilde{K}^{*}_{h} is the optimal policy for the Linear Quadratic Regulator (LQR) problem from step hh to NN, incorporating errors from previous iterations of Algorithm 1. Then, the RHPG algorithm outputs a control policy K~0\widetilde{K}_{0} that satisfies K~0Kε\|\widetilde{K}_{0}-K^{*}\|\leq\varepsilon. Furthermore, if ε\varepsilon is sufficiently small such that

ε<1ABKB,\varepsilon<\frac{1-\|A-BK^{*}\|_{*}}{\|B\|},

then K~0\widetilde{K}_{0} is stabilizing.

As previously mentioned, Theorem 3.2 significantly improves error accumulation, resulting in much less restrictive requirements than Theorem 3.3. The limitations of Theorem 3.3 stem from the exponent of the constant C2C_{2} in (32), which is discussed in detail in Appendix C. It is worth re-iterating that this improvement comes only at the cost of Assumption 2.1, a rather mild structural requirement.

4. The Inner Loop and Policy Gradient

In this section, we focus on the inner loop of Algorithm 1, on which we will implement our proposed policy gradient method.

We seek a way to estimate the gradient of this function with respect to KhK_{h}. To remedy, we propose:

(33) ^J~h(K):=1σ2Qh(xh,uh)(uh+Kxh)xh,\displaystyle\widehat{\nabla}\widetilde{J}_{h}(K):=-\frac{1}{\sigma^{2}}Q_{h}(x_{h},u_{h})(u_{h}+Kx_{h})x_{h}^{\top},

where xhx_{h} is sampled from 𝒟\mathcal{D}, and then uhu_{h} is chosen randomly from the Gaussian distribution 𝒩(Kxh,σ2Im)\mathcal{N}(-Kx_{h},\sigma^{2}I_{m}) for some σ>0\sigma>0. Moreover, we rewrite uh𝒩(Kxh,σ2Im)u_{h}\sim\mathcal{N}(-Kx_{h},\sigma^{2}I_{m}) as

(34) uh=Kxh+σηh,u_{h}=-Kx_{h}+\sigma\eta_{h},

where ηh𝒩(0,Im)\eta_{h}\sim\mathcal{N}(0,I_{m}). Substituting (34) in (33) yields

(35) ^J~h(K)=1σQh(xh,Kxh+σηh)ηhxh.\displaystyle\widehat{\nabla}\widetilde{J}_{h}(K)=-\frac{1}{\sigma}Q_{h}(x_{h},-Kx_{h}+\sigma\eta_{h})\eta_{h}x_{h}^{\top}.

This expression corresponds to the gradient estimate utilized in Algorithm 1, as described in its formulation.

Proposition 4.1.

Suppose xhx_{h} is sampled from 𝒟\mathcal{D} and uhu_{h} chosen from 𝒩(Kxh,σ2Im)\mathcal{N}(-Kx_{h},\sigma^{2}I_{m}) as before. Then for any given choice of KK, we have that

(36) 𝔼[^J~h(K)]=J~h(K).{\mathbb{E}}[\widehat{\nabla}\widetilde{J}_{h}(K)]=\nabla\widetilde{J}_{h}(K).
Proof.

Following (35),

𝔼xh[^J~h(K)]\displaystyle{\mathbb{E}}_{x_{h}}[\widehat{\nabla}\widetilde{J}_{h}(K)] =𝔼xh[𝔼ηh[^J~h(K)|xh]]\displaystyle={\mathbb{E}}_{x_{h}}\left[{\mathbb{E}}_{\eta_{h}}\left[\widehat{\nabla}\widetilde{J}_{h}(K)\big{|}x_{h}\right]\right]
=(i)𝔼xh[1σ2𝔼ηh[Q(xh,Kxh+σηh)(σηh)|xh]xh]\displaystyle\overset{\mathrm{(i)}}{=}{\mathbb{E}}_{x_{h}}\left[-\frac{1}{\sigma^{2}}{\mathbb{E}}_{\eta_{h}}\left[Q(x_{h},-Kx_{h}+\sigma\eta_{h})(\sigma\eta_{h})\big{|}x_{h}\right]x_{h}^{\top}\right]
(37) =(ii)𝔼xh[𝔼ηh[uQK(xh,u)|u=Kxh+σηh|xh]xh],\displaystyle\overset{\mathrm{(ii)}}{=}{\mathbb{E}}_{x_{h}}\left[{\mathbb{E}}_{\eta_{h}}\left[-\nabla_{u}Q^{K}(x_{h},u)\bigg{|}_{u=-Kx_{h}+\sigma\eta_{h}}\big{|}x_{h}\right]x_{h}^{\top}\right],

where (i) follows from xhx_{h}^{\top} being determined when given xhx_{h}, and (ii) from Stein’s lemma [15]. Using (15), we compute

uQh(xh,u)\displaystyle\nabla_{u}Q_{h}(x_{h},u) =u(xhQxh+uRu+(Axh+Bu)P~h+1(Axh+Bu))\displaystyle=\nabla_{u}\Bigg{(}x_{h}^{\top}Qx_{h}+u^{\top}Ru+(Ax_{h}+Bu)^{\top}\widetilde{P}_{h+1}(Ax_{h}+Bu)\Bigg{)}
=2Ru+2BP~h+1Bu+2BP~h+1Axh,\displaystyle=2Ru+2B^{\top}\widetilde{P}_{h+1}Bu+2B^{\top}\widetilde{P}_{h+1}Ax_{h},

which evaluated at u=Kxh+σηhu=-Kx_{h}+\sigma\eta_{h} yields

uQh(xh,u)|u=Kxh+σηh=2((R+BP~h+1B)(Kxh+σηh)+BP~h+1Axh).\displaystyle\nabla_{u}Q_{h}(x_{h},u)\bigg{|}_{u=-Kx_{h}+\sigma\eta_{h}}=2\left((R+B^{\top}\widetilde{P}_{h+1}B)(-Kx_{h}+\sigma\eta_{h})+B^{\top}\widetilde{P}_{h+1}Ax_{h}\right).

Substituting in (37), we obtain

𝔼[^J~h(K)]\displaystyle{\mathbb{E}}[\widehat{\nabla}\widetilde{J}_{h}(K)] =𝔼xh𝒟[2((R+BP~h+1B)KBP~h+1A)xhxh]\displaystyle={\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[2\left((R+B^{\top}\widetilde{P}_{h+1}B)K-B^{\top}\widetilde{P}_{h+1}A\right)x_{h}x_{h}^{\top}\right]
=2((R+BP~h+1B)KBP~h+1A)𝔼xh𝒟[xhxh]\displaystyle=2\left((R+B^{\top}\widetilde{P}_{h+1}B)K-B^{\top}\widetilde{P}_{h+1}A\right){\mathbb{E}}_{x_{h}\sim\mathcal{D}}\left[x_{h}x_{h}^{\top}\right]
=2((R+BP~h+1B)KBP~h+1A)Σ0\displaystyle=2\left((R+B^{\top}\widetilde{P}_{h+1}B)K-B^{\top}\widetilde{P}_{h+1}A\right)\Sigma_{0}
=(i)J~h(K),\displaystyle\overset{\mathrm{(i)}}{=}\nabla\widetilde{J}_{h}(K),

where (i) follows from (14). ∎

Similar to [20], we define the following sets regarding the inner loop of the algorithm for each h{0,1,,N1}h\in\{0,1,\dotsc,N-1\}:

(38) 𝒢h:={Kh|J~h(Kh)J~h(K~h)10ζ1J~h(Kh,0)},{\mathcal{G}}_{h}:=\{K_{h}|\widetilde{J}_{h}(K_{h})-\widetilde{J}_{h}(\widetilde{K}^{*}_{h})\leq 10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})\},

for some arbitrary ζ(0,1)\zeta\in(0,1). We also define the following constant:

C~h:=10ζ1J~h(Kh,0)+J~h(K~h)σmin(Σ0)σmin(R).\displaystyle\widetilde{C}_{h}:=\frac{10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})+\widetilde{J}_{h}(\widetilde{K}^{*}_{h})}{\sigma_{\text{min}}(\Sigma_{0})\sigma_{\text{min}}(R)}.

We now provide some bounds in the following lemma.

Lemma 4.1.

Suppose ζ(0,1e]\zeta\in(0,\frac{1}{e}], and

P~h+1Ph+1a.\|\widetilde{P}_{h+1}-P^{*}_{h+1}\|\leq a.

Then for any K𝒢hK\in{\mathcal{G}}_{h}, we have that

(39) ^J~h(K)Fξh,3(log1ζ)3/2\|\widehat{\nabla}\widetilde{J}_{h}(K)\|_{F}\leq\xi_{h,3}\left(\log\frac{1}{\zeta}\right)^{3/2}

with probability at least 1ζ1-\zeta, where ξh,1,ξh,2,ξh,3\xi_{h,1},\xi_{h,2},\xi_{h,3}\in{\mathbb{R}} are given by

(40) ξh,1\displaystyle\xi_{h,1} :=(Q+2RC~h2+2(Pmax+a)(A2+2B2C~h))Cm3/2,\displaystyle:=\Big{(}\|Q\|+2\|R\|\widetilde{C}_{h}^{2}+2(\|P_{\max}\|+a)(\|A\|^{2}+2\|B\|^{2}\widetilde{C}_{h})\Big{)}C_{m}^{3/2},
(41) ξh,2\displaystyle\xi_{h,2} :=2(R+2(Pmax+a)B2)Cm1/2,\displaystyle:=2\left(\|R\|+2(\|P_{\max}\|+a)\|B\|^{2}\right)C_{m}^{1/2},
(42) ξh,3\displaystyle\xi_{h,3} :=1σ(ξh,151/2m1/2)+σ(ξh,253/2m3/2).\displaystyle:=\frac{1}{\sigma}\left(\xi_{h,1}5^{1/2}m^{1/2}\right)+\sigma\left(\xi_{h,2}5^{3/2}m^{3/2}\right).

Moreover,

(43) 𝔼[^J~h(K)F2]ξh,4,{\mathbb{E}}\left[\|\widehat{\nabla}\widetilde{J}_{h}(K)\|_{F}^{2}\right]\leq\xi_{h,4},

where

(44) ξh,4\displaystyle\xi_{h,4} :=1σ2ξh,12m+2ξh,1ξh,2m(m+2)+σ2ξh,22m(m+2)(m+4).\displaystyle:=\frac{1}{\sigma^{2}}\xi_{h,1}^{2}m+2\xi_{h,1}\xi_{h,2}m(m+2)+\sigma^{2}\xi_{h,2}^{2}m(m+2)(m+4).
Proof.

Using the Formulation of ^J~h(K)\widehat{\nabla}\widetilde{J}_{h}(K) derived in (35), we have

(45) ^J~h(K)F\displaystyle\|\widehat{\nabla}\widetilde{J}_{h}(K)\|_{F} =1σQh(xh,Kxh+σηh)ηhxhF\displaystyle=\|\frac{1}{\sigma}Q_{h}(x_{h},-Kx_{h}+\sigma\eta_{h})\eta_{h}x_{h}^{\top}\|_{F}
(46) 1σQh(xh,Kxh+σηh)ηhxh.\displaystyle\leq\frac{1}{\sigma}Q_{h}(x_{h},-Kx_{h}+\sigma\eta_{h})\|\eta_{h}\|\|x_{h}\|.

Before we continue, we provide the following bound:

Sublemma 4.1.

Suppose K𝒢hK\in{\mathcal{G}}_{h}. Then it holds that

(47) KF2C~h.\|K\|^{2}_{F}\leq\widetilde{C}_{h}.

Proof of Sublemma 4.1. Using (6), we have

(48) J~h(K)\displaystyle\widetilde{J}_{h}(K) 𝔼xh𝒟[xh(Q+KRK)xh]\displaystyle\geq{\mathbb{E}}_{x_{h}\sim{\mathcal{D}}}\left[x_{h}^{\top}(Q+K^{\top}RK)x_{h}\right]
(49) =𝔼xh𝒟[tr((Q+KRK)xhxh)]\displaystyle={\mathbb{E}}_{x_{h}\sim{\mathcal{D}}}\left[\operatorname{tr}\left((Q+K^{\top}RK)x_{h}x_{h}^{\top}\right)\right]
(50) =tr((Q+KRK)Σ0)\displaystyle=\operatorname{tr}\left((Q+K^{\top}RK)\Sigma_{0}\right)
(51) σmin(Σ0)tr(Q+KRK)\displaystyle\geq\sigma_{\text{min}}(\Sigma_{0})\operatorname{tr}(Q+K^{\top}RK)
(52) σmin(Σ0)tr(RKK)\displaystyle\geq\sigma_{\text{min}}(\Sigma_{0})\operatorname{tr}(RKK^{\top})
(53) σmin(Σ0)σmin(R)KF2.\displaystyle\geq\sigma_{\text{min}}(\Sigma_{0})\sigma_{\text{min}}(R)\|K\|^{2}_{F}.

Rearranging (53) yields

KF2\displaystyle\|K\|^{2}_{F} J~h(K)σmin(Σ0)σmin(R)\displaystyle\leq\frac{\widetilde{J}_{h}(K)}{\sigma_{\text{min}}(\Sigma_{0})\sigma_{\text{min}}(R)}
(i)10ζ1J~h(Kh,0)+J~h(K~h)σmin(Σ0)σmin(R)\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})+\widetilde{J}_{h}(\widetilde{K}^{*}_{h})}{\sigma_{\text{min}}(\Sigma_{0})\sigma_{\text{min}}(R)}
=C~h,\displaystyle=\widetilde{C}_{h},

where (i) follows from the definition of the set 𝒢h{\mathcal{G}}_{h} in (38). This concludes the proof of Sublemma 4.1. \diamond

We now continue with the proof of the Lemma 4.1. Note that

Qh(xh,Kxh+σηh)\displaystyle Q_{h}(x_{h},-Kx_{h}+\sigma\eta_{h})
=\displaystyle= xhQxh+(Kxh+σηh)R(Kxh+σηh)+(Axh+B(Kxh+σηh))P~h+1(Axh+B(Kxh+σηh))\displaystyle x_{h}^{\top}Qx_{h}+(-Kx_{h}+\sigma\eta_{h})^{\top}R(-Kx_{h}+\sigma\eta_{h})+(Ax_{h}+B(-Kx_{h}+\sigma\eta_{h}))^{\top}\widetilde{P}_{h+1}(Ax_{h}+B(-Kx_{h}+\sigma\eta_{h}))
\displaystyle\leq QCm+RKxh+σηh2+P~h+1Axh+B(Kxh+σηh)2.\displaystyle\|Q\|C_{m}+\|R\|\|-Kx_{h}+\sigma\eta_{h}\|^{2}+\|\widetilde{P}_{h+1}\|\|Ax_{h}+B(-Kx_{h}+\sigma\eta_{h})\|^{2}.

As a result,

(54) Qh(xh,Kxh+σηh)\displaystyle Q_{h}(x_{h},-Kx_{h}+\sigma\eta_{h})
(55) \displaystyle\leq QCm+2R(C~hCm+σ2ηh2)+2(Pmax+a)A2Cm+4(Pmax+a)B2(C~hCm+σ2ηh2))\displaystyle\|Q\|C_{m}+2\|R\|(\widetilde{C}_{h}C_{m}+\sigma^{2}\|\eta_{h}\|^{2})+2(\|P_{\max}\|+a)\|A\|^{2}C_{m}+4(\|P_{\max}\|+a)\|B\|^{2}(\widetilde{C}_{h}C_{m}+\sigma^{2}\|\eta_{h}\|^{2}))
(56) =\displaystyle= Cm(Q+2RC~h)+2Cm(Pmax+a)(A2+2B2C~h))+2(R+2(Pmax+a)B2)σ2ηh2,\displaystyle C_{m}(\|Q\|+2\|R\|\widetilde{C}_{h})+2C_{m}(\|P_{\max}\|+a)(\|A\|^{2}+2\|B\|^{2}\widetilde{C}_{h}))+2\left(\|R\|+2(\|P_{\max}\|+a)\|B\|^{2}\right)\sigma^{2}\|\eta_{h}\|^{2},

where the inequality follows from Sublemma 4.1 along with the fact that by the assumption,

P~h+1\displaystyle\|\widetilde{P}_{h+1}\| =Ph+1+(P~h+1Ph+1)\displaystyle=\|P^{*}_{h+1}+(\widetilde{P}_{h+1}-P^{*}_{h+1})\|
Ph+1+P~h+1Ph+1\displaystyle\leq\|P^{*}_{h+1}\|+\|\widetilde{P}_{h+1}-P^{*}_{h+1}\|
Pmax+a.\displaystyle\leq\|P_{\max}\|+a.

Combining (46) with (56) and (2), we obtain

(57) ^J~h(K)F\displaystyle\|\widehat{\nabla}\widetilde{J}_{h}(K)\|_{F}
(58) \displaystyle\leq 1σ(Q+2RC~h+2(Pmax+a)(A2+2B2C~h))Cm3/2ηh\displaystyle\frac{1}{\sigma}\Bigg{(}\|Q\|+2\|R\|\widetilde{C}_{h}+2(\|P_{\max}\|+a)(\|A\|^{2}+2\|B\|^{2}\widetilde{C}_{h})\Bigg{)}C_{m}^{3/2}\|\eta_{h}\|
(59) +2(R+2(Pmax+a)B2)σCm1/2ηh3\displaystyle+2\left(\|R\|+2(\|P_{\max}\|+a)\|B\|^{2}\right)\sigma C_{m}^{1/2}\|\eta_{h}\|^{3}
(60) =\displaystyle= 1σξh,1ηh+σξh,2ηh3.\displaystyle\frac{1}{\sigma}\xi_{h,1}\|\eta_{h}\|+\sigma\xi_{h,2}\|\eta_{h}\|^{3}.

Furthermore, since ηh𝒩(0,Im)\eta_{h}\sim\mathcal{N}(0,I_{m}) for any hh, ηh2\|\eta_{h}\|^{2} is distributed according to the chi-squared distribution with mm degrees of freedom (ηh2χ2(m)\|\eta_{h}\|^{2}\sim\chi^{2}(m) for any hh). Therefore, the standard Laurent-Massart bounds [10] suggest that for arbitrary y>0y>0, we have that

(61) {ηh2m+2my+2y}ey.{\mathbb{P}}\{\|\eta_{h}\|^{2}\geq m+2\sqrt{my}+2y\}\leq e^{-y}.

Now if we take y=mlog1ζy=m\log\frac{1}{\zeta}, since ζ(0,1/e)\zeta\in(0,1/e) by our assumption, it holds that y=mlog1ζmy=m\log\frac{1}{\zeta}\geq m. Thus

{ηh25y}\displaystyle{\mathbb{P}}\{\|\eta_{h}\|^{2}\geq 5y\} {ηh2m+2my+2y}\displaystyle\leq{\mathbb{P}}\{\|\eta_{h}\|^{2}\geq m+2\sqrt{my}+2y\}
ey,\displaystyle\leq e^{-y},

which after substituting yy with its value mlog1ζm\log\frac{1}{\zeta} gives

{ηh25mlog1ζ}emlog1ζ=ζmζ.\displaystyle{\mathbb{P}}\{\|\eta_{h}\|^{2}\geq 5m\log\frac{1}{\zeta}\}\leq e^{-m\log\frac{1}{\zeta}}=\zeta^{m}\leq\zeta.

As a result, we have ηh51/2m1/2(log1ζ)1/2\|\eta_{h}\|\leq 5^{1/2}m^{1/2}(\log\frac{1}{\zeta})^{1/2} and consequently

ηh353/2m3/2(log1ζ)3/2\|\eta_{h}\|^{3}\leq 5^{3/2}m^{3/2}(\log\frac{1}{\zeta})^{3/2}

with probability at least 1ζ1-\zeta, which after applying on (60) yields

^J~h(K)F\displaystyle\|\widehat{\nabla}\widetilde{J}_{h}(K)\|_{F} 1σξh,151/2m1/2(log1ζ)1/2+σξh,253/2m3/2(log1ζ)3/2\displaystyle\leq\frac{1}{\sigma}\xi_{h,1}5^{1/2}m^{1/2}\left(\log\frac{1}{\zeta}\right)^{1/2}+\sigma\xi_{h,2}5^{3/2}m^{3/2}\left(\log\frac{1}{\zeta}\right)^{3/2}
(1σξh,151/2m1/2+σξh,253/2m3/2)(log1ζ)3/2\displaystyle\leq\left(\frac{1}{\sigma}\xi_{h,1}5^{1/2}m^{1/2}+\sigma\xi_{h,2}5^{3/2}m^{3/2}\right)\left(\log\frac{1}{\zeta}\right)^{3/2}
=ξh,3(log1ζ)3/2,\displaystyle=\xi_{h,3}\left(\log\frac{1}{\zeta}\right)^{3/2},

proving the first claim. As for the second claim, note that using (60), we have

(62) ^J~h(K)F21σ2ξh,12ηh2+2ξh,1ξh,2ηh4+σ2ξh,22ηh6.\displaystyle\|\widehat{\nabla}\widetilde{J}_{h}(K)\|^{2}_{F}\leq\frac{1}{\sigma^{2}}\xi_{h,1}^{2}\|\eta_{h}\|^{2}+2\xi_{h,1}\xi_{h,2}\|\eta_{h}\|^{4}+\sigma^{2}\xi_{h,2}^{2}\|\eta_{h}\|^{6}.

Now since ηhχ(m)\|\eta_{h}\|\sim\chi(m) whose moments are known, taking an expectation on (62) results in

𝔼[^J~h(K)F2]\displaystyle{\mathbb{E}}\left[\|\widehat{\nabla}\widetilde{J}_{h}(K)\|^{2}_{F}\right] 1σ2ξh,12𝔼[ηt^2]+2ξh,1ξh,2𝔼[ηt^4]+σ2ξh,22𝔼[ηt^6]\displaystyle\leq\frac{1}{\sigma^{2}}\xi_{h,1}^{2}{\mathbb{E}}[\|\eta_{\hat{t}}\|^{2}]+2\xi_{h,1}\xi_{h,2}{\mathbb{E}}[\|\eta_{\hat{t}}\|^{4}]+\sigma^{2}\xi_{h,2}^{2}{\mathbb{E}}[\|\eta_{\hat{t}}\|^{6}]
=1σ2ξh,12m+2ξh,1ξh,2m(m+2)+σ2ξh,22m(m+2)(m+4)\displaystyle=\frac{1}{\sigma^{2}}\xi_{h,1}^{2}m+2\xi_{h,1}\xi_{h,2}m(m+2)+\sigma^{2}\xi_{h,2}^{2}m(m+2)(m+4)
=ξh,4,\displaystyle=\xi_{h,4},

concluding the proof. ∎

We next provide some useful properties of the cost function J~h(K)\widetilde{J}_{h}(K) in the following lemma.

Lemma 4.2.

For all h{0,1,,N1}h\in\{0,1,\dotsc,N-1\}, the function J~h\widetilde{J}_{h} is μ2\frac{\mu}{2}-strongly convex, where

μ:=4σmin(Σ0)σmin(R),\mu:=4\sigma_{\min}(\Sigma_{0})\sigma_{\min}(R),

and in particular, for all Km×nK\in{\mathbb{R}}^{m\times n},

(63) J~h(K)F2μ(J~h(K)J~h(K~h)),\|\nabla\widetilde{J}_{h}(K)\|_{F}^{2}\geq\mu(\widetilde{J}_{h}(K)-\widetilde{J}_{h}(\widetilde{K}^{*}_{h})),

where K~h\widetilde{K}^{*}_{h} is the global minimizer of J~h\widetilde{J}_{h}. Moreover, assuming that P~h+1Ph+1a\|\widetilde{P}_{h+1}-P^{*}_{h+1}\|\leq a, we have that for all K1,K2m×nK_{1},K_{2}\in{\mathbb{R}}^{m\times n},

(64) J~h(K2)J~h(K1)FLK2K1F,\|\nabla\widetilde{J}_{h}(K_{2})-\nabla\widetilde{J}_{h}(K_{1})\|_{F}\leq L\|K_{2}-K_{1}\|_{F},

where

L:=C3Σ0.L:=C_{3}\|\Sigma_{0}\|.
Proof.

We first prove the strong convexity as follows:

J~h(K2)J~h(K1),K2K1\displaystyle\left\langle\nabla\widetilde{J}_{h}(K_{2})-\nabla\widetilde{J}_{h}(K_{1}),K_{2}-K_{1}\right\rangle =2tr(Σ0(K2K1)(R+BP~h+1B)(K2K1))\displaystyle=2\operatorname{tr}\left(\Sigma_{0}(K_{2}-K_{1})^{\top}(R+B^{\top}\widetilde{P}_{h+1}B)(K_{2}-K_{1})\right)
2σmin(Σ0)σmin(R)tr((K2K1)(K2K1))\displaystyle\geq 2\sigma_{\min}(\Sigma_{0})\sigma_{\min}(R)\operatorname{tr}\left((K_{2}-K_{1})^{\top}(K_{2}-K_{1})\right)
=μ2K2K1F2.\displaystyle=\frac{\mu}{2}\|K_{2}-K_{1}\|_{F}^{2}.

Note the the next inequality is an immediate consequence of the PL-inequality. Now we move on to the LL-smoothness property:

J~h(K2)J~h(K1)F\displaystyle\|\nabla\widetilde{J}_{h}(K_{2})-\nabla\widetilde{J}_{h}(K_{1})\|_{F} =2(R+BP~h+1B)(K2K1)Σ0F\displaystyle=\|2(R+B^{\top}\widetilde{P}_{h+1}B)(K_{2}-K_{1})\Sigma_{0}\|_{F}
Σ0(2R+BP~h+1B)K2K1F\displaystyle\leq\|\Sigma_{0}\|(2\|R+B^{\top}\widetilde{P}_{h+1}B\|)\|K_{2}-K_{1}\|_{F}
Σ0(2R+B(Pmax+aI)B)K2K1F\displaystyle\leq\|\Sigma_{0}\|(2\|R+B^{\top}(P_{\max}+aI)B\|)\|K_{2}-K_{1}\|_{F}
=Σ0C3K2K1F\displaystyle=\|\Sigma_{0}\|C_{3}\|K_{2}-K_{1}\|_{F}
=LK2K1F,\displaystyle=L\|K_{2}-K_{1}\|_{F},

concluding the proof. ∎

Before introducing the next result, let us denote the optimality gap of iterate tt by

(65) Δt=J~h(Kh,t)J~h(K~h).\Delta_{t}=\widetilde{J}_{h}(K_{h,t})-\widetilde{J}_{h}(\widetilde{K}^{*}_{h}).

Moreover, let t{\mathcal{F}}_{t} denote the σ\sigma-algebra containing the randomness up to iteration tt of the inner loop of the algorithm for each h{0,1,,N1}h\in\{0,1,\dotsc,N-1\} (including Kh,tK_{h,t} but not ^J~h(Kh,t)\widehat{\nabla}\widetilde{J}_{h}(K_{h,t})). We then define

(66) τ:=min{t|Δt>10ζ1J~h(Kh,0)},\tau:=\min\left\{t\ |\ \Delta_{t}>10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})\right\},

which is a stopping time with respect to t{\mathcal{F}}_{t}. Note that we did some notation abuse as Δt,t,\Delta_{t},{\mathcal{F}}_{t}, and τ\tau may differ for each h{0,1,,N1}h\in\{0,1,\dotsc,N-1\}. But since these steps hh of the outer loop do not impact one another, we used just one notation for simplicity.

Lemma 4.3.

Suppose P~h+1Ph+1a\|\widetilde{P}_{h+1}-P^{*}_{h+1}\|\leq a, and the update rule follows

(67) Kh,t+1=Kh,tαh,t^J~h(Kh,t),K_{h,t+1}=K_{h,t}-\alpha_{h,t}\widehat{\nabla}\widetilde{J}_{h}(K_{h,t}),

where αh,t>0\alpha_{h,t}>0 is the step-size. Then for any t{0,1,2,}t\in\{0,1,2,\dotsc\}, we have

(68) 𝔼[Δt+1|t]1τ>t((1μαh,t)Δt+Lαh,t22ξh,4)1τ>t,{\mathbb{E}}[\Delta_{t+1}|{\mathcal{F}}_{t}]1_{\tau>t}\leq\left(\left(1-\mu\alpha_{h,t}\right)\Delta_{t}+\frac{L\alpha_{h,t}^{2}}{2}\xi_{h,4}\right)1_{\tau>t},

where Δt\Delta_{t} is defined in (65).

Proof.

First, note that by LL-smoothness, we have

Δt+1Δt\displaystyle\Delta_{t+1}-\Delta_{t} =J~h(Kh,t+1)J~h(Kh,t)\displaystyle=\widetilde{J}_{h}(K_{h,t+1})-\widetilde{J}_{h}(K_{h,t})
J~h(Kh,t),Kh,t+1Kh,t+L2Kh,t+1Kh,tF2\displaystyle\leq\langle\nabla\widetilde{J}_{h}(K_{h,t}),K_{h,t+1}-K_{h,t}\rangle+\frac{L}{2}\|K_{h,t+1}-K_{h,t}\|_{F}^{2}
=αh,tJ~h(Kh,t),^J~h(Kh,t)+Lαh,t22^J~h(Kh,t)F2,\displaystyle=-\alpha_{h,t}\langle\nabla\widetilde{J}_{h}(K_{h,t}),\widehat{\nabla}\widetilde{J}_{h}(K_{h,t})\rangle+\frac{L\alpha_{h,t}^{2}}{2}\|\widehat{\nabla}\widetilde{J}_{h}(K_{h,t})\|_{F}^{2},

which after multiplying by 1τ>t1_{\tau>t} (which is determined by t{\mathcal{F}}_{t}) and taking an expectation conditioned on t{\mathcal{F}}_{t} gives

(69) 𝔼[Δt+1Δt|t]1τ>t\displaystyle{\mathbb{E}}[\Delta_{t+1}-\Delta_{t}|{\mathcal{F}}_{t}]1_{\tau>t} αh,tJ~h(Kh,t),𝔼[^J~h(Kh,t)|t]1τ>t+Lαh,t22𝔼[^J~h(Kh,t)F2|t]1τ>t\displaystyle\leq-\alpha_{h,t}\langle\nabla\widetilde{J}_{h}(K_{h,t}),{\mathbb{E}}[\widehat{\nabla}\widetilde{J}_{h}(K_{h,t})|{\mathcal{F}}_{t}]\rangle 1_{\tau>t}+\frac{L\alpha_{h,t}^{2}}{2}{\mathbb{E}}[\|\widehat{\nabla}\widetilde{J}_{h}(K_{h,t})\|_{F}^{2}|{\mathcal{F}}_{t}]1_{\tau>t}
(70) =(i)αh,tJ~h(Kh,t)F21τ>t+Lαh,t22ξh,41τ>t\displaystyle\overset{\mathrm{(i)}}{=}-\alpha_{h,t}\|\nabla\widetilde{J}_{h}(K_{h,t})\|_{F}^{2}1_{\tau>t}+\frac{L\alpha_{h,t}^{2}}{2}\xi_{h,4}1_{\tau>t}
(71) (ii)αh,tμΔt1τ>t+Lαh,t22ξh,41τ>t,\displaystyle\overset{\mathrm{(ii)}}{\leq}-\alpha_{h,t}\mu\Delta_{t}1_{\tau>t}+\frac{L\alpha_{h,t}^{2}}{2}\xi_{h,4}1_{\tau>t},

where (i) follows from Proposition 4.1, Lemma 4.1 along with the fact that the event {τ>t}\{\tau>t\} implies Kh,t𝒢hK_{h,t}\in{\mathcal{G}}_{h}, and (ii) is due to Lemma 4.2.

Now after some rearranging on (71) and noting that Δt\Delta_{t} is also determined by t{\mathcal{F}}_{t}, we conclude that

(72) 𝔼[Δt+1|t]1τ>t((1μαh,t)Δt+Lαh,t22ξh,4)1τ>t,\displaystyle{\mathbb{E}}[\Delta_{t+1}|{\mathcal{F}}_{t}]1_{\tau>t}\leq\left(\left(1-\mu\alpha_{h,t}\right)\Delta_{t}+\frac{L\alpha_{h,t}^{2}}{2}\xi_{h,4}\right)1_{\tau>t},

finishing the proof. ∎

We are now in a position to state a precise version of our main result for the inner loop.

Theorem 4.1.

(Main result: inner loop): Suppose P~h+1Ph+1a\|\widetilde{P}_{h+1}-P^{*}_{h+1}\|\leq a. For any h{0,1,,N1}h\in\{0,1,\dotsc,N-1\}, if the step-size is chosen as

(73) αh,t=2μ1t+θhforθh=max{2,2Lξh,4μ2J~h(Kh,0)},\alpha_{h,t}=\frac{2}{\mu}\frac{1}{t+\theta_{h}}\quad\text{for}\quad\theta_{h}=\max\{2,\frac{2L\xi_{h,4}}{\mu^{2}\widetilde{J}_{h}(K_{h,0})}\},

then for a given error tolerance ς\varsigma, the iterate Kh,ThK_{h,T_{h}} of the update rule (67) after

Th=407μς2ζθhJ~h(Kh,0)T_{h}=\frac{40}{7\mu\varsigma^{2}\zeta}\theta_{h}\widetilde{J}_{h}(K_{h,0})

steps satisfies

Kh,ThK~hFς,\|K_{h,T_{h}}-\widetilde{K}^{*}_{h}\|_{F}\leq\varsigma,

with a probability of at least 1ζ1-\zeta.

The proof of this result relies heavily on Proposition 4.2, which we establish next.

Proposition 4.2.

Under the parameter settings of Theorem 4.1, we have that

𝔼[ΔTh1τ>Th]740μς2ζ.{\mathbb{E}}[\Delta_{T_{h}}1_{\tau>T_{h}}]\leq\frac{7}{40}\mu\varsigma^{2}\zeta.

Moreover, the event {τTh}\{\tau\geq T_{h}\} happens with probability of at least 310ζ\frac{3}{10}\zeta.

Proof.

We dedicate the following sublemma to prove the first claim.

Sublemma 4.2.

Under the parameter setup of Theorem 4.1, we have that

𝔼[Δt1τ>t]θhJ~h(Kh,0)t+θh,{\mathbb{E}}[\Delta_{t}1_{\tau>t}]\leq\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{t+\theta_{h}},

for all t[Th]t\in[T_{h}].

Proof of Sublemma 4.2. We prove this result by induction on tt as follows:

Base case (t=0t=0):

Δ01τ>0Δ0J~h(Kh,0)=θhJ~h(Kh,0)0+θh,\displaystyle\Delta_{0}1_{\tau>0}\leq\Delta_{0}\leq\widetilde{J}_{h}(K_{h,0})=\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{0+\theta_{h}},

which after taking expectation proves the claim for t=0t=0.

Inductive step: Let k[Th1]k\in[T_{h}-1] be fixed and assume that

(74) 𝔼[Δk1τ>k]θhJ~h(Kh,0)k+θh{\mathbb{E}}[\Delta_{k}1_{\tau>k}]\leq\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{k+\theta_{h}}

holds (the inductive hypothesis). Observe that

(75) 𝔼[Δk+11τ>k+1]\displaystyle{\mathbb{E}}[\Delta_{k+1}1_{\tau>k+1}] (i)𝔼[Δk+11τ>k]\displaystyle\overset{\mathrm{(i)}}{\leq}{\mathbb{E}}[\Delta_{k+1}1_{\tau>k}]
(76) =𝔼[𝔼[Δk+11τ>k|k]]\displaystyle={\mathbb{E}}[{\mathbb{E}}[\Delta_{k+1}1_{\tau>k}|{\mathcal{F}}_{k}]]
(77) =(ii)𝔼[𝔼[Δk+1|k]1τ>k],\displaystyle\overset{\mathrm{(ii)}}{=}{\mathbb{E}}[{\mathbb{E}}[\Delta_{k+1}|{\mathcal{F}}_{k}]1_{\tau>k}],

where (i) comes from 1τ>k+11τ>k1_{\tau>k+1}\leq 1_{\tau>k} and (ii) from the fact that 1τ>k1_{\tau>k} is determined by k{\mathcal{F}}_{k}. By Lemma 4.3, we have that

(78) 𝔼[Δk+1|k]1τ>k\displaystyle{\mathbb{E}}[\Delta_{k+1}|{\mathcal{F}}_{k}]1_{\tau>k} ((1μαk)Δk+Lαk22ξh,4)1τ>k\displaystyle\leq\left(\left(1-\mu\alpha_{k}\right)\Delta_{k}+\frac{L\alpha_{k}^{2}}{2}\xi_{h,4}\right)1_{\tau>k}
(79) =(i)(12k+θh)Δk1τ>k+2Lξh,4μ2(1k+θh)2,\displaystyle\overset{\mathrm{(i)}}{=}\left(1-\frac{2}{k+\theta_{h}}\right)\Delta_{k}1_{\tau>k}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{k+\theta_{h}}\right)^{2},

where (i) comes from replacing αk\alpha_{k} with its value in Theorem 4.1 along with the fact that 1τ>k11_{\tau>k}\leq 1. Now taking an expectation on (79) and combining it with (77) yields

(80) 𝔼[Δk+11τ>k+1]\displaystyle{\mathbb{E}}[\Delta_{k+1}1_{\tau>k+1}] (12k+θh)𝔼[Δk1τ>k]+2Lξh,4μ2(1k+θh)2\displaystyle\leq\left(1-\frac{2}{k+\theta_{h}}\right){\mathbb{E}}[\Delta_{k}1_{\tau>k}]+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{k+\theta_{h}}\right)^{2}
(81) (i)(12k+θh)θhJ~h(Kh,0)k+θh+2Lξh,4μ2(1k+θh)2\displaystyle\overset{\mathrm{(i)}}{\leq}\left(1-\frac{2}{k+\theta_{h}}\right)\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{k+\theta_{h}}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{k+\theta_{h}}\right)^{2}
(82) =(11k+θh)θhJ~h(Kh,0)k+θh1(k+θh)2(θhJ~h(Kh,0)2Lξh,4μ2)\displaystyle=\left(1-\frac{1}{k+\theta_{h}}\right)\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{k+\theta_{h}}-\frac{1}{(k+\theta_{h})^{2}}\left(\theta_{h}\widetilde{J}_{h}(K_{h,0})-\frac{2L\xi_{h,4}}{\mu^{2}}\right)
(83) (ii)k+θh1(k+θh)2θhJ~h(Kh,0)\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{k+\theta_{h}-1}{(k+\theta_{h})^{2}}\theta_{h}\widetilde{J}_{h}(K_{h,0})
(84) 1k+θh+1θhJ~h(Kh,0),\displaystyle\leq\frac{1}{k+\theta_{h}+1}\theta_{h}\widetilde{J}_{h}(K_{h,0}),

where (i) comes from the induction hypothesis (74), and (ii) from

θhJ~h(Kh,0)2Lξh,4μ20,\theta_{h}\widetilde{J}_{h}(K_{h,0})-\frac{2L\xi_{h,4}}{\mu^{2}}\geq 0,

which is due to the choice of θh\theta_{h} in Theorem 4.1. This proves the claim for k+1k+1, completing the inductive step. \diamond

Now utilizing Sublemma 4.2 along with the choice of ThT_{h} in Theorem 4.1, we have

𝔼[ΔTh1τ>Th]θhJ~h(Kh,0)Th+θhθhJ~h(Kh,0)Th7μς2ζ40,{\mathbb{E}}[\Delta_{T_{h}}1_{\tau>T_{h}}]\leq\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{T_{h}+\theta_{h}}\leq\frac{\theta_{h}\widetilde{J}_{h}(K_{h,0})}{T_{h}}\leq\frac{7\mu\varsigma^{2}\zeta}{40},

concluding the proof of the first claim of Proposition 4.2. Moving on to the second claim, we start by introducing the stopped process

(85) Yt:=Δtτ+4Lξh,4μ21t+θh.Y_{t}:=\Delta_{t\wedge\tau}+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{t+\theta_{h}}.

We now show this process is a supermartingale. First, observe that

(86) 𝔼[Yt+1|t]\displaystyle{\mathbb{E}}[Y_{t+1}|{\mathcal{F}}_{t}] =𝔼[Δt+1τ|t]+4Lξh,4μ21t+θh+1\displaystyle={\mathbb{E}}[\Delta_{t+1\wedge\tau}|{\mathcal{F}}_{t}]+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{t+\theta_{h}+1}
(87) =𝔼[Δt+1τ(1τt+1τ>t)|t]+4Lξh,4μ21t+θh+1\displaystyle={\mathbb{E}}[\Delta_{t+1\wedge\tau}(1_{\tau\leq t}+1_{\tau>t})|{\mathcal{F}}_{t}]+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{t+\theta_{h}+1}
(88) =𝔼[Δt+1τ1τt|t]+𝔼[Δt+1τ|t]1τ>t+4Lξh,4μ21t+θh+1.\displaystyle={\mathbb{E}}[\Delta_{t+1\wedge\tau}1_{\tau\leq t}|{\mathcal{F}}_{t}]+{\mathbb{E}}[\Delta_{t+1\wedge\tau}|{\mathcal{F}}_{t}]1_{\tau>t}+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{t+\theta_{h}+1}.

Now note that for the first term of the right-hand side of (88), it holds that

(89) 𝔼[Δt+1τ1τt|t]=(i)𝔼[Δtτ1τt|t]=Δtτ1τt,\displaystyle{\mathbb{E}}[\Delta_{t+1\wedge\tau}1_{\tau\leq t}|{\mathcal{F}}_{t}]\overset{\mathrm{(i)}}{=}{\mathbb{E}}[\Delta_{t\wedge\tau}1_{\tau\leq t}|{\mathcal{F}}_{t}]=\Delta_{t\wedge\tau}1_{\tau\leq t},

where (i) follows from the fact that under the event {τt}\{\tau\leq t\}, we have Δt+1τ=Δtτ\Delta_{t+1\wedge\tau}=\Delta_{t\wedge\tau}. Moreover, for the second term of the right-hand side of (88), we have that

(90) 𝔼[Δt+1τ|t]1τ>t\displaystyle{\mathbb{E}}[\Delta_{t+1\wedge\tau}|{\mathcal{F}}_{t}]1_{\tau>t} (i)(12t+θh)Δt1τ>t+2Lξh,4μ2(1t+θh)21τ>t\displaystyle\overset{\mathrm{(i)}}{\leq}\left(1-\frac{2}{t+\theta_{h}}\right)\Delta_{t}1_{\tau>t}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{t+\theta_{h}}\right)^{2}1_{\tau>t}
(91) Δt1τ>t+2Lξh,4μ2(1t+θh)2,\displaystyle\leq\Delta_{t}1_{\tau>t}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{t+\theta_{h}}\right)^{2},

where (i) follows from Lemma 4.3. Combining (89) and (91) with (88), we get

(92) 𝔼[Yt+1|t]\displaystyle{\mathbb{E}}[Y_{t+1}|{\mathcal{F}}_{t}] Δtτ1τt+Δt1τ>t+2Lξh,4μ2(1t+θh)2+4Lξh,4μ21t+θh+1\displaystyle\leq\Delta_{t\wedge\tau}1_{\tau\leq t}+\Delta_{t}1_{\tau>t}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{t+\theta_{h}}\right)^{2}+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{t+\theta_{h}+1}
(93) =Δtτ+2Lξh,4μ2(1(t+θh)2+2t+θh+1)\displaystyle=\Delta_{t\wedge\tau}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{1}{(t+\theta_{h})^{2}}+\frac{2}{t+\theta_{h}+1}\right)
(94) (i)Δtτ+2Lξh,4μ2(2t+θh)\displaystyle\overset{\mathrm{(i)}}{\leq}\Delta_{t\wedge\tau}+\frac{2L\xi_{h,4}}{\mu^{2}}\left(\frac{2}{t+\theta_{h}}\right)
(95) =Yt,\displaystyle=Y_{t},

where (i) follows from θh2\theta_{h}\geq 2 under parameter choice of Theorem 4.1. This finishes the proof of YtY_{t} being a supermartingale. Now note that

{τTh}\displaystyle{\mathbb{P}}\{\tau\leq T_{h}\} ={maxt[Th]Δt>10ζ1J~h(Kh,0)}\displaystyle={\mathbb{P}}\left\{\max_{t\in[T_{h}]}\Delta_{t}>10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})\right\}
{maxt[Th]Δtτ>10ζ1J~h(Kh,0)}\displaystyle\leq{\mathbb{P}}\left\{\max_{t\in[T_{h}]}\Delta_{t\wedge\tau}>10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})\right\}
(i){maxt[Th]Yt10ζ1J~h(Kh,0)},\displaystyle\overset{\mathrm{(i)}}{\leq}{\mathbb{P}}\left\{\max_{t\in[T_{h}]}Y_{t}\geq 10\zeta^{-1}\widetilde{J}_{h}(K_{h,0})\right\},

where (i) follows from the fact that YtΔtτY_{t}\geq\Delta_{t\wedge\tau}. Using Doob/Ville’s inequality for supermartingales, we have that

{τTh}\displaystyle{\mathbb{P}}\{\tau\leq T_{h}\} ζ𝔼[Y0]10J~h(Kh,0)=ζ(Δ0+4Lξh,4μ21θh)10J~h(Kh,0).\displaystyle{\leq}\frac{\zeta{\mathbb{E}}[Y_{0}]}{10\widetilde{J}_{h}(K_{h,0})}=\frac{\zeta\left(\Delta_{0}+\frac{4L\xi_{h,4}}{\mu^{2}}\frac{1}{\theta_{h}}\right)}{10\widetilde{J}_{h}(K_{h,0})}.

Using the choice of θh\theta_{h} in Theorem 4.1, we have that

(96) {τTh}\displaystyle{\mathbb{P}}\{\tau\leq T_{h}\} ζ(J~h(Kh,0)+2J~h(Kh,0))10J~h(Kh,0)\displaystyle{\leq}\frac{\zeta\left(\widetilde{J}_{h}(K_{h,0})+2\widetilde{J}_{h}(K_{h,0})\right)}{10\widetilde{J}_{h}(K_{h,0})}
(97) =310ζ.\displaystyle=\frac{3}{10}\zeta.

This verifies the second claim of Proposition 4.2, concluding the proof. ∎

With this in mind, the proof of Theorem 4.1 is straightforward:

Proof of Theorem 4.1: We now employ Proposition 4.2 to validate the claims of Theorem 4.1. Note that

{ΔThμ4ς2}\displaystyle{\mathbb{P}}\left\{\Delta_{T_{h}}\geq\frac{\mu}{4}\varsigma^{2}\right\} {ΔTh1τ>Thμ4ς2}+{1τTh=1}\displaystyle\leq{\mathbb{P}}\left\{\Delta_{T_{h}}1_{\tau>T_{h}}\geq\frac{\mu}{4}\varsigma^{2}\right\}+{\mathbb{P}}\left\{1_{\tau\leq T_{h}}=1\right\}
(i)4μς2𝔼[ΔTh1τ>Th]+{τTh}\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{4}{\mu\varsigma^{2}}{\mathbb{E}}[\Delta_{T_{h}}1_{\tau>T_{h}}]+{\mathbb{P}}\left\{\tau\leq T_{h}\right\}
(ii)710ζ+310ζ\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{7}{10}\zeta+\frac{3}{10}\zeta
=ζ,\displaystyle=\zeta,

where (i) follows from applying Markov’s inequality on the first claim of Proposition 4.2, and (ii) comes directly from the second claim of Proposition 4.2. Finally, we utilize the μ2\frac{\mu}{2}-strong convexity of J~h\widetilde{J}_{h}, along with J~h(K~h)=0\nabla\widetilde{J}_{h}(\widetilde{K}^{*}_{h})=0 to write

J~h(Kh,Th)J~h(K~h)\displaystyle\widetilde{J}_{h}(K_{h,T_{h}})-\widetilde{J}_{h}(\widetilde{K}^{*}_{h}) J~h(K~h)(Kh,ThK~h)+μ4Kh,ThK~hF2\displaystyle\geq\nabla\widetilde{J}_{h}(\widetilde{K}^{*}_{h})^{\top}(K_{h,T_{h}}-\widetilde{K}^{*}_{h})+\frac{\mu}{4}\|K_{h,T_{h}}-\widetilde{K}^{*}_{h}\|_{F}^{2}
=μ4Kh,ThK~hF2,\displaystyle=\frac{\mu}{4}\|K_{h,T_{h}}-\widetilde{K}^{*}_{h}\|_{F}^{2},

and hence,

Kh,ThK~hF24μ(J~h(Kh,Th)J~h(K~h))ς2,\displaystyle\|K_{h,T_{h}}-\widetilde{K}^{*}_{h}\|_{F}^{2}\leq\frac{4}{\mu}\left(\widetilde{J}_{h}(K_{h,T_{h}})-\widetilde{J}_{h}(\widetilde{K}^{*}_{h})\right)\leq\varsigma^{2},

with a probability of at least 1ζ1-\zeta, finishing the proof. \diamond

5. Sample complexity

We now utilize our results on inner loop and outer loop to provide sample complexity bounds. To wit, combining Theorems 4.1 and 3.2, along with applying the union bound on the probabilities of failure at each step, we provide the following result.

Corollary 5.1.

Suppose Assumption 2.1 holds, and choose

N=12log(2QNPκPAKBϵλmin(R))log(1AK)+1,\displaystyle N=\frac{1}{2}\cdot\frac{\log\big{(}\frac{2\|Q_{N}-P^{*}\|_{*}\cdot\kappa_{P^{*}}\cdot\|A_{K}^{*}\|\cdot\|B\|}{\epsilon\cdot\lambda_{\min}(R)}\big{)}}{\log\big{(}\frac{1}{\|A_{K}^{*}\|_{*}}\big{)}}+1,

where QNPQ_{N}\succeq P^{*}. Moreover, for each h{0,1,,N1}h\in\{0,1,\dotsc,N-1\}, let ςh,ε\varsigma_{h,\varepsilon} be as defined in (29). Then Algorithm 1 with the parameters as suggested in Theorem 4.1, i.e.,

αh,t=2μ1t+θhforθh=max{2,2Lξh,4μ2J~h(Kh,0)},\displaystyle\alpha_{h,t}=\frac{2}{\mu}\frac{1}{t+\theta_{h}}\quad\text{for}\quad\theta_{h}=\max\{2,\frac{2L\xi_{h,4}}{\mu^{2}\widetilde{J}_{h}(K_{h,0})}\},

and

Th=407μςh,ε2ζθhJ~h(Kh,0),T_{h}=\frac{40}{7\mu\varsigma_{h,\varepsilon}^{2}\zeta}\theta_{h}\widetilde{J}_{h}(K_{h,0}),

outputs a control policy K~0\widetilde{K}_{0} that satisfies K~0Kε\|\widetilde{K}_{0}-K^{*}\|\leq\varepsilon with a probability of at least 1Nζ1-N\zeta. Furthermore, if ε\varepsilon is sufficiently small such that

ε<1ABKB,\varepsilon<\frac{1-\|A-BK^{*}\|_{*}}{\|B\|},

then K~0\widetilde{K}_{0} is stabilizing.

The results in Corollary 5.1 provide a rigorous theoretical foundation for Algorithm 1, ensuring it computes a control policy K~0\widetilde{K}_{0} satisfying K~0Kϵ\|\widetilde{K}_{0}-K^{*}\|\leq\epsilon with high probability. The following corollary formalizes the sample complexity bound of our approach.

Corollary 5.2.

(Main result: complexity bound): Under Assumption 2.1, Algorithm 1 achieves a sample complexity bound of at most

h=0N1Th=𝒪~(ϵ2).\sum_{h=0}^{N-1}T_{h}=\widetilde{\mathcal{O}}\left(\epsilon^{-2}\right).

It is worth comparing this result with the one in [20], taking into account the necessary adjustments ala Theorem 3.3, where error accumulation results in a worse sample complexity bound.

Corollary 5.3.

(Prior Result: Complexity Bound): Algorithm 1 in [20] achieves the sample complexity bound of at most

h=0N1Th=𝒪~(max{ϵ2,ϵ(1+log(C2)2log(1/AK))}),\sum_{h=0}^{N-1}T^{\prime}_{h}=\widetilde{\mathcal{O}}\left(\max\left\{\epsilon^{-2},\epsilon^{-\left(1+\frac{\log(C_{2})}{2\log{(1/\|A^{*}_{K}\|_{*}})}\right)}\right\}\right),

where ThT^{\prime}_{h} denotes the counterpart of ThT_{h} in [20].

This comparison highlights the advantage of our method, which achieves a uniform sample complexity bound of 𝒪~(ϵ2)\widetilde{\mathcal{O}}(\epsilon^{-2}), independent of problem-specific constants. In contrast, the bound in [20] deteriorates as C2C_{2} increases, since their second term scales as

𝒪~(ϵ(1+log(C2)2log(1/AK))).\widetilde{\mathcal{O}}\left(\epsilon^{-\left(1+\frac{\log(C_{2})}{2\log(1/\|A^{*}_{K}\|_{*})}\right)}\right).

This can be arbitrarily worse than 𝒪~(ϵ2)\widetilde{\mathcal{O}}(\epsilon^{-2}), leading to much higher sample complexity in some cases.

Finally, to validate these theoretical guarantees and assess the algorithm’s empirical performance, we conduct simulation studies on a standard example from [20]. The setup and results are presented in the following section.

6. Simulation Studies

Refer to caption
(a) Average number of calls to the zeroth-order oracle.
Refer to caption
(b) Policy optimality gap K~0K\|\widetilde{K}_{0}-K^{*}\|.
Figure 1. Simulation results showing sample complexity and policy optimality gap.

For comparison, we demonstrate our results on the example provided in [20], where A=5A=5, B=0.33B=0.33, Q=R=1Q=R=1, and the optimal policy is K=14.5482K^{*}=14.5482 with P=221.4271P^{*}=221.4271. In this example, we select QN=300PQ_{N}=300\succeq P^{*}, in alignment with a minor inherent assumption discussed later in Remark A.1 (Appendix A). Additionally, we initialize our policy at each step hh of the outer loop of Algorithm 1 as Kh,0=0K_{h,0}=0. This choice contrasts with [5, 11], which require stable policies for initialization, as the stable policies for this example lie in the set

𝒦={K12.12<K<18.18}.\mathcal{K}=\{K\mid 12.12<K<18.18\}.

We set N=12log(1ϵ)N=\lceil\frac{1}{2}\log\left(\frac{1}{\epsilon}\right)\rceil, consistent with (31), and in each inner loop, apply the policy gradient (PG) update outlined in Algorithm 1 using a time-varying step-size as suggested in (73). The algorithm is run for twelve different values of ϵ\epsilon: ϵ{106,105.5,105,,100.5}\epsilon\in\{10^{-6},10^{-5.5},10^{-5},\dotsc,10^{-0.5}\}, with the results shown in Figure 1. To account for the inherent randomness in the algorithm, we perform one hundred independent runs for each value of ε\varepsilon and compute the average sample complexity and policy optimality gap K~0K\|\widetilde{K}_{0}-K^{*}\|. As seen in Figure 1, the sample complexity exhibits a slope consistent with 𝒪(ε0.5)\mathcal{O}(\varepsilon^{-0.5}), visibly outperforming the method in [20], which demonstrates a much steeper slope of approximately 𝒪(ε1.5)\mathcal{O}(\varepsilon^{-1.5}).

7. Conclusion

In this paper, we introduced a novel approach to solving the model-free LQR problem, inspired by policy gradient methods, particularly REINFORCE. Our algorithm eliminates the restrictive requirement of starting with a stable initial policy, making it applicable in scenarios where obtaining such a policy is challenging. Furthermore, it removes the reliance on two-point gradient estimation, enhancing practical applicability while maintaining similar rates.

Beyond these improvements, we introduced a refined outer-loop analysis that significantly enhances error accumulation, leveraging the contraction of the Riemannian distance over the Riccati operator. This ensures that the accumulated error remains linear in the horizon length, leading to a sample complexity bound of 𝒪~(ϵ2)\widetilde{\mathcal{O}}(\epsilon^{-2}), independent of problem-specific constants, making the method more broadly applicable.

We provide a rigorous theoretical analysis, establishing that the algorithm achieves convergence to the optimal policy with competitive sample complexity bounds. Importantly, our numerical simulations reveal performance that surpasses these theoretical guarantees, with the algorithm consistently outperforming prior methods that rely on two-point gradient estimates. This superior performance, combined with a more practical framework, highlights the potential of the proposed method for solving control problems in a model-free setting. Future directions include extensions to nonlinear and partially observed systems, as well as robustness enhancements.

References

  • [1] A. M. Annaswamy and A. L. Fradkov. A historical perspective of adaptive control and learning. Annual Reviews in Control, 52:18–41, 2021.
  • [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995.
  • [3] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2007.
  • [4] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
  • [5] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
  • [6] M. Gevers. A personal view of the development of system identification: A 30-year journey through an exciting field. IEEE Control systems magazine, 26(6):93–105, 2006.
  • [7] B. Hassibi, A. H. Sayed, and T. Kailath. Indefinite-Quadratic Estimation and Control. Society for Industrial and Applied Mathematics, 1999.
  • [8] B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Başar. Toward a theoretical foundation of policy optimization for learning control policies. Annual Review of Control, Robotics, and Autonomous Systems, 6:123–158, 2023.
  • [9] A. Lamperski. Computing stabilizing linear controllers via policy iteration. In Proceedings of the 59th IEEE Conference on Decision and Control (CDC), pages 1902–1907. IEEE, 2020.
  • [10] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
  • [11] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. Journal of Machine Learning Research, 21(21):1–51, 2020.
  • [12] A. Neshaei Moghaddam, A. Olshevsky, and B. Gharesifard. Sample complexity of the linear quadratic regulator: A reinforcement learning lens, 2024.
  • [13] H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanović. Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem. IEEE Transactions on Automatic Control, 67(5):2435–2450, 2022.
  • [14] J. C. Perdomo, J. Umenberger, and M. Simchowitz. Stabilizing dynamical systems via policy gradient methods. In Advances in Neural Information Processing Systems, 2021.
  • [15] C. M. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151, 1981.
  • [16] J. Sun and M. Cantoni. On Riccati contraction in time-varying linear-quadratic control, 2023.
  • [17] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063, 1999.
  • [18] Y. Tang, Y. Zheng, and N. Li. Analysis of the optimization landscape of linear quadratic gaussian (LQG) control. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 599–610, 2021.
  • [19] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  • [20] X. Zhang and T. Başar. Revisiting LQR control from the perspective of receding-horizon policy gradient. IEEE Control Systems Letters, 7:1664–1669, 2023.
  • [21] X. Zhang, S. Mowlavi, M. Benosman, and T. Başar. Global convergence of receding-horizon policy search in learning estimator designs, 2023.
  • [22] F. Zhao, X. Fu, and K. You. Learning stabilizing controllers of linear systems via discount policy gradient. arXiv preprint arXiv:2112.09294, 2021.

Appendix A Proof of Theorem 3.1

We let

P¯t:=PtP,R¯:=R+BPB,\displaystyle\overline{P}_{t}:=P^{*}_{t}-P^{*},\quad\overline{R}:=R+B^{\top}P^{*}B,
A¯:=ABR¯1BPA,\displaystyle\overline{A}:=A-B\overline{R}^{-1}B^{\top}P^{*}A,

and we have

P¯t\displaystyle\overline{P}_{t} =A¯P¯t+1A¯A¯P¯t+1B(R¯+BP¯t+1B)1BP¯t+1A¯\displaystyle=\overline{A}^{\top}\overline{P}_{t+1}\overline{A}-\overline{A}^{\top}\overline{P}_{t+1}B(\overline{R}+B^{\top}\overline{P}_{t+1}B)^{-1}B^{\top}\overline{P}_{t+1}\overline{A}
=A¯P¯t+11/2[I+P¯t+11/2BR¯1BP¯t+11/2]1P¯t+11/2A¯\displaystyle=\overline{A}^{\top}\overline{P}_{t+1}^{1/2}\big{[}I+\overline{P}^{1/2}_{t+1}B\overline{R}^{-1}B^{\top}\overline{P}_{t+1}^{1/2}\big{]}^{-1}\overline{P}_{t+1}^{1/2}\overline{A}
[1+λmin(P¯t+11/2BR¯1BP¯t+11/2)]1A¯P¯t+1A¯\displaystyle\leq\big{[}1+\lambda_{\min}(\overline{P}^{1/2}_{t+1}B\overline{R}^{-1}B^{\top}\overline{P}_{t+1}^{1/2})\big{]}^{-1}\overline{A}^{\top}\overline{P}_{t+1}\overline{A}
(98) =:μtA¯P¯t+1A¯,\displaystyle=:\mu_{t}\overline{A}^{\top}\overline{P}_{t+1}\overline{A},

where P¯t+11/2\overline{P}_{t+1}^{1/2} denotes the unique positive semi-definite (psd) square root of the psd matrix P¯t+1\overline{P}_{t+1}, 0<μt10<\mu_{t}\leq 1 for all tt, and A¯\overline{A} satisfies ρ(A¯)<1\rho(\overline{A})<1. We now use \|\cdot\|_{*} to represent the PP^{*}-induced matrix norm and invoke Theorem 14.4.1 of [7], where our P¯t\overline{P}_{t}, A¯\overline{A}^{\top} and PP^{*} correspond to PiPP_{i}-P^{*}, FpF_{p} and WW in [7], respectively. By Theorem 14.4.1 of [7] and (98), we obtain A¯<1\|\overline{A}\|_{*}<1 and given that μt1\mu_{t}\leq 1,

P¯tA¯2P¯t+1.\displaystyle\|\overline{P}_{t}\|_{*}\leq\|\overline{A}\|^{2}_{*}\cdot\|\overline{P}_{t+1}\|_{*}.

Therefore, the convergence is exponential such that P¯tA¯2(Nt)P¯N\|\overline{P}_{t}\|_{*}\leq\|\overline{A}\|_{*}^{2(N-t)}\cdot\|\overline{P}_{N}\|_{*}. As a result, the convergence of P¯t\overline{P}_{t} to 0 in spectral norm can be characterized as

P¯tκPP¯tκPA¯2(Nt)P¯N,\displaystyle\|\overline{P}_{t}\|\leq\kappa_{P^{*}}\cdot\|\overline{P}_{t}\|_{*}\leq\kappa_{P^{*}}\cdot\|\overline{A}\|_{*}^{2(N-t)}\cdot\|\overline{P}_{N}\|_{*},

where we have used κX\kappa_{X} to denote the condition number of XX. That is, to ensure P¯1ϵ\|\overline{P}_{1}\|\leq\epsilon, it suffices to require

(99) N12log(P¯NκPϵ)log(1A¯)+1.\displaystyle N\geq\frac{1}{2}\cdot\frac{\log\big{(}\frac{\|\overline{P}_{N}\|_{*}\cdot\kappa_{P^{*}}}{\epsilon}\big{)}}{\log\big{(}\frac{1}{\|\overline{A}\|_{*}}\big{)}}+1.

Lastly, we show that the (monotonic) convergence of KtK^{*}_{t} to KK^{*} follows from the convergence of PtP^{*}_{t} to PP^{*}. This can be verified through:

KtK\displaystyle K^{*}_{t}-K^{*} =(R+BPt+1B)1BPt+1A(R+BPB)1BPA\displaystyle=(R+B^{\top}P^{*}_{t+1}B)^{-1}B^{\top}P^{*}_{t+1}A-(R+B^{\top}P^{*}B)^{-1}B^{\top}P^{*}A
=[(R+BPt+1B)1(R+BPB)1]BPA+(R+BPt+1B)1B(Pt+1P)A\displaystyle=\big{[}(R+B^{\top}P^{*}_{t+1}B)^{-1}-(R+B^{\top}P^{*}B)^{-1}\big{]}B^{\top}P^{*}A+(R+B^{\top}P^{*}_{t+1}B)^{-1}B^{\top}(P^{*}_{t+1}-P^{*})A
=(R+BPt+1B)1B(PPt+1)BK(R+BPt+1B)1B(PPt+1)A\displaystyle=(R+B^{\top}P^{*}_{t+1}B)^{-1}B^{\top}(P^{*}-P^{*}_{t+1})BK^{*}-(R+B^{\top}P^{*}_{t+1}B)^{-1}B^{\top}(P^{*}-P^{*}_{t+1})A
(100) =(R+BPt+1B)1B(PPt+1)(BKA).\displaystyle=(R+B^{\top}P^{*}_{t+1}B)^{-1}B^{\top}(P^{*}-P^{*}_{t+1})(BK^{*}-A).

Hence, we have KtKA¯Bλmin(R)Pt+1P\|K^{*}_{t}-K^{*}\|\leq\frac{\|\overline{A}\|\cdot\|B\|}{\lambda_{\min}(R)}\cdot\|P^{*}_{t+1}-P^{*}\| and

K0KA¯Bλmin(R)P¯1.\displaystyle\|K^{*}_{0}-K^{*}\|\leq\frac{\|\overline{A}\|\cdot\|B\|}{\lambda_{\min}(R)}\cdot\|\overline{P}_{1}\|.

Substituting ϵ\epsilon in (99) with ϵλmin(R)A¯B\frac{\epsilon\cdot\lambda_{\min}(R)}{\|\overline{A}\|\cdot\|B\|} completes the proof.

Remark A.1.

Note that since Theorem 3.1 requires P¯t=PtP\overline{P}_{t}=P^{*}_{t}-P^{*} to be positive definite for each tt, it implies that we have access to a PN=QNP^{*}_{N}=Q_{N} that satisfies QNPQ_{N}\succeq P^{*} so that due to the monotonic convergence of (20), it will hold that

PNPN1P0P,P^{*}_{N}\succeq P^{*}_{N-1}\succeq\cdots\succeq P^{*}_{0}\succeq P^{*},

satisfying said requirement.

Appendix B Proof of Theorem 3.2

We start the proof by providing some preliminary results.

Lemma B.1.

Let UU and VV be two positive definie matrices. It holds that

(101) UVVeδ(U,V)δ(U,V).\|U-V\|\leq\|V\|\,e^{\delta(U,V)}\,\delta(U,V).

Furthermore, if

(102) V1UV<1,\|V^{-1}\|\,\|U-V\|<1,

then we have

(103) δ(U,V)V1UVF1V1UV.\delta(U,V)\leq\frac{\|V^{-1}\|\,\|U-V\|_{F}}{1-\|V^{-1}\|\,\|U-V\|}.
Proof.

First, since UU and VV are positive definite, we have that V1/2UV1/2V^{-1/2}UV^{-1/2} is positive definite, and therefore has a logarithm; so we let

Z:=log(V1/2UV1/2),Z:=\log(V^{-1/2}UV^{-1/2}),

and hence, we can write

U=V1/2exp(Z)V1/2.U=V^{1/2}\exp(Z)V^{1/2}.

The eigenvalues of ZZ are precisely the logarithms of the eigenvalues of UV1UV^{-1} due to UV1UV^{-1} and V1/2UV1/2V^{-1/2}UV^{-1/2} being similar. As a result,

δ(U,V)=ZF.\delta(U,V)=\|Z\|_{F}.

We now write

UV=V1/2exp(Z)V1/2V=V1/2(exp(Z)I)V1/2,U-V=V^{1/2}\exp(Z)V^{1/2}-V=V^{1/2}(\exp(Z)-I)V^{1/2},

and thus,

(104) UVVexp(Z)I.\|U-V\|\leq\|V\|\,\|\exp(Z)-I\|.

Since ex1xexe^{x}-1\leq xe^{x} whenever x0x\geq 0, we also have for any matrix ZZ, by consider the expansion of eZe^{Z}:

exp(Z)IeZ1eZZ.\|\exp(Z)-I\|\leq e^{||Z||}-1\leq e^{\|Z\|}\,\|Z\|.

Since the spectral norm is always bounded by the Frobenius norm, we have:

exp(Z)IeZFZF.\|\exp(Z)-I\|\leq e^{\|Z\|_{F}}\,\|Z\|_{F}.

Finally, recalling that ZF=δ(U,V)\|Z\|_{F}=\delta(U,V), this becomes:

exp(Z)Ieδ(U,V)δ(U,V),\|\exp(Z)-I\|\leq e^{\delta(U,V)}\,\delta(U,V),

which after substituting into (104) yields:

UVVeδ(U,V)δ(U,V),\|U-V\|\leq\|V\|\,e^{\delta(U,V)}\,\delta(U,V),

concluding the proof of the first claim. We now move on to the second claim. As before, we write

δ(U,V)=log(V1/2UV1/2)F.\delta(U,V)=\|\log(V^{-1/2}UV^{-1/2})\|_{F}.

We now define

X:=V1/2(UV)V1/2,X:=V^{-1/2}(U-V)V^{-1/2},

so that

V1/2UV1/2=I+X.V^{-1/2}UV^{-1/2}=I+X.

Moreover, following (102),

X=V1/2(UV)V1/2V1UV<1,\|X\|=\|V^{-1/2}(U-V)V^{-1/2}\|\leq\|V^{-1}\|\|U-V\|<1,

and hence, one can use the series expansion of the logarithm

log(I+X)=X12X2+,\log(I+X)=X-\frac{1}{2}X^{2}+\cdots,

to show

(105) log(I+X)F\displaystyle\|\log(I+X)\|_{F} =k=1(1)k+1kXkF\displaystyle=\left\|\sum_{k=1}^{\infty}\frac{(-1)^{k+1}}{k}X^{k}\right\|_{F}
(106) k=1XkF\displaystyle\leq\sum_{k=1}^{\infty}\|X^{k}\|_{F}
(107) k=1XFXk1\displaystyle\leq\sum_{k=1}^{\infty}\|X\|_{F}\|X^{k-1}\|
(108) XFk=0Xk\displaystyle\leq\|X\|_{F}\sum_{k=0}^{\infty}\|X\|^{k}
(109) =XF1X.\displaystyle=\frac{\|X\|_{F}}{1-\|X\|}.

As a result, we have

δ(U,V)\displaystyle\delta(U,V) =log(I+X)F\displaystyle=\|\log(I+X)\|_{F}
XF1X\displaystyle\leq\frac{\|X\|_{F}}{1-\|X\|}
=V1/2(UV)V1/2F1V1/2(UV)V1/2\displaystyle=\frac{\|V^{-1/2}(U-V)V^{-1/2}\|_{F}}{1-\|V^{-1/2}(U-V)V^{-1/2}\|}
V1(UV)F1V1(UV),\displaystyle\leq\frac{\|V^{-1}\|\|(U-V)\|_{F}}{1-\|V^{-1}\|\|(U-V)\|},

finishing the proof. ∎

Building on Lemma B.1, we proceed to state the following result regarding the LQR setting.

Lemma B.2.

Let t{1,2,,N1}t\in\{1,2,\dotsc,N-1\}, select QNQQ_{N}\succeq Q, and suppose Assumption 2.1 holds. Additionally, assume that for all t{t+1,t+2,,N}t^{\prime}\in\{t+1,t+2,\dotsc,N\}, we have

(110) PtP~ta,and\displaystyle\|P^{*}_{t^{\prime}}-\widetilde{P}_{t^{\prime}}\|\leq a,\quad\text{and}
(111) δ(P~t,P~t)ε,\displaystyle\delta(\widetilde{P}^{*}_{t^{\prime}},\widetilde{P}_{t^{\prime}})\leq\varepsilon,

where ε\varepsilon satisfies

(112) ε1Nmin{a2ePmax,1}.\varepsilon\leq\frac{1}{N}\min\left\{\frac{a}{2e\|P_{\max}\|},1\right\}.

If

(113) K~tK~tFaC3ε,\|\widetilde{K}_{t}-\widetilde{K}^{*}_{t}\|_{F}\leq\sqrt{\frac{a}{C_{3}}\varepsilon},

then the following bounds hold:

(114) PtP~ta,and\displaystyle\|P^{*}_{t}-\widetilde{P}_{t}\|\leq a,\quad\text{and}
(115) δ(P~t,P~t)ε.\displaystyle\delta(\widetilde{P}^{*}_{t},\widetilde{P}_{t})\leq\varepsilon.
Proof.

Before we move on to the proof, we esablish some preliminary results. First, note that since

PN=P~N=QNQ0,P^{*}_{N}=\widetilde{P}_{N}=Q_{N}\succeq Q\succ 0,

due to the monotonic convergence of (20) to PQP^{*}\succeq Q (see [7]), we have that PtQP^{*}_{t}\succeq Q for all t{1,2,,N}t\in\{1,2,\dotsc,N\}. Therefore, it holds that

(116) σmin(Pt)σmin(Q)=2a>0.\sigma_{\min}(P^{*}_{t})\geq\sigma_{\min}(Q)=2a>0.

Moreover, due to (110), we have

(117) P~tPtaIaI0\widetilde{P}_{t^{\prime}}\succeq P^{*}_{t^{\prime}}-aI\succeq aI\succ 0

for all t{t+1,t+2,,N}t^{\prime}\in\{t+1,t+2,\dotsc,N\}. Now since (116), (117), and Assumption 2.1 all hold, we can apply Lemma 2.1 to show that for all t{t+1,t+2,,N}t^{\prime}\in\{t+1,t+2,\dotsc,N\},

(118) δ(Pt1,P~t1)\displaystyle\delta(P_{t^{\prime}-1}^{*},\widetilde{P}_{t^{\prime}-1}^{*}) =(i)δ((Pt),(P~t))\displaystyle\overset{\mathrm{(i)}}{=}\delta(\mathcal{R}(P^{*}_{t^{\prime}}),\mathcal{R}(\widetilde{P}_{t^{\prime}}))
(119) δ(Pt,P~t),\displaystyle\leq\delta(P^{*}_{t^{\prime}},\widetilde{P}_{t^{\prime}}),

where (i) follows from (26) and (27). Following (119), we can now write

(120) δ(Pt,P~t)\displaystyle\delta(P_{t}^{*},\widetilde{P}_{t}^{*}) δ(Pt+1,P~t+1)\displaystyle\leq\delta(P_{t+1}^{*},\widetilde{P}_{t+1})
(121) (i)δ(Pt+1,P~t+1)+δ(P~t+1,P~t+1)\displaystyle\overset{\mathrm{(i)}}{\leq}\delta(P_{t+1}^{*},\widetilde{P}^{*}_{t+1})+\delta(\widetilde{P}_{t+1}^{*},\widetilde{P}_{t+1})
(122) δ(Pt+2,P~t+2)+δ(P~t+1,P~t+1)\displaystyle\leq\delta(P_{t+2}^{*},\widetilde{P}_{t+2})+\delta(\widetilde{P}_{t+1}^{*},\widetilde{P}_{t+1})
(123) δ(Pt+2,P~t+2)+δ(P~t+2,P~t+2)+δ(P~t+1,P~t+1)\displaystyle\leq\delta(P_{t+2}^{*},\widetilde{P}^{*}_{t+2})+\delta(\widetilde{P}_{t+2}^{*},\widetilde{P}_{t+2})+\delta(\widetilde{P}_{t+1}^{*},\widetilde{P}_{t+1})
(124) \displaystyle\leq\cdots
(125) δ(PN,P~N)+k=1Nt1δ(P~t+k,P~t+k)\displaystyle\leq\delta(P_{N}^{*},\widetilde{P}_{N})+\sum_{k=1}^{N-t-1}\delta(\widetilde{P}_{t+k}^{*},\widetilde{P}_{t+k})
(126) =(ii)k=1Nt1δ(P~t+k,P~t+k)\displaystyle\overset{\mathrm{(ii)}}{=}\sum_{k=1}^{N-t-1}\delta(\widetilde{P}_{t+k}^{*},\widetilde{P}_{t+k})
(127) (iii)εN,\displaystyle\overset{\mathrm{(iii)}}{\leq}\varepsilon N,

where (i) is due to the triangle inequality of the Riemannian distance [3], (ii) follows from PN=P~N=QNP_{N}^{*}=\widetilde{P}_{N}=Q_{N}, and (iii) from (111). We now start the proof of (114) by writing

(128) PtP~tPtP~t+P~tP~t,\displaystyle\|P^{*}_{t}-\widetilde{P}_{t}\|\leq\|P^{*}_{t}-\widetilde{P}^{*}_{t}\|+\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|,

and trying to provide a bound for both terms of the right-hand side of (128). For the first term, we have

(129) PtP~t\displaystyle\|P^{*}_{t}-\widetilde{P}^{*}_{t}\| (i)Pmaxeδ(Pt,P~t)δ(Pt,P~t)\displaystyle\overset{\mathrm{(i)}}{\leq}\|P_{\max}\|e^{\delta(P^{*}_{t},\widetilde{P}^{*}_{t})}\delta(P^{*}_{t},\widetilde{P}^{*}_{t})
(130) (ii)PmaxeεNεN\displaystyle\overset{\mathrm{(ii)}}{\leq}\|P_{\max}\|e^{\varepsilon N}\varepsilon N
(131) (iii)a2,\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{a}{2},

where (i) follows from (101), (ii) from (127), and (iii) from the condition on ε\varepsilon in (112). As for the second term on the right-hand side of (128), we can write

P~tP~t\displaystyle\widetilde{P}^{*}_{t}-\widetilde{P}_{t} =(ABK~t)P~t+1(ABK~t)+(K~t)RK~t(ABK~t)P~t+1(ABK~t)(K~t)RK~t\displaystyle=(A-B\widetilde{K}^{*}_{t})^{\top}\widetilde{P}_{t+1}(A-B\widetilde{K}^{*}_{t})+(\widetilde{K}^{*}_{t})^{\top}R\widetilde{K}^{*}_{t}-(A-B\widetilde{K}_{t})^{\top}\widetilde{P}_{t+1}(A-B\widetilde{K}_{t})-(\widetilde{K}_{t})^{\top}R\widetilde{K}_{t}
=(K~t)BP~t+1AAP~t+1BK~t+(K~t)(R+BP~t+1B)K~t\displaystyle=-(\widetilde{K}^{*}_{t})^{\top}B^{\top}\widetilde{P}_{t+1}A-A^{\top}\widetilde{P}_{t+1}B\widetilde{K}_{t}^{*}+(\widetilde{K}^{*}_{t})^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\widetilde{K}^{*}_{t}
+K~tBP~t+1A+AP~t+1BK~tK~t(R+BP~t+1B)K~t\displaystyle\hskip 10.00002pt+\widetilde{K}_{t}^{\top}B^{\top}\widetilde{P}_{t+1}A+A^{\top}\widetilde{P}_{t+1}B\widetilde{K}_{t}-\widetilde{K}_{t}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\widetilde{K}_{t}
=(i)[(R+BP~t+1B)1BP~t+1AK~t](R+BP~t+1B)[(R+BP~t+1B)1BP~t+1AK~t]\displaystyle\overset{\mathrm{(i)}}{=}\big{[}(R+B^{\top}\widetilde{P}_{t+1}B)^{-1}B^{\top}\widetilde{P}_{t+1}A-\widetilde{K}^{*}_{t}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\big{[}(R+B^{\top}\widetilde{P}_{t+1}B)^{-1}B^{\top}\widetilde{P}_{t+1}A-\widetilde{K}^{*}_{t}\big{]}
[(R+BP~t+1B)1BP~t+1AK~t](R+BP~t+1B)[(R+BP~t+1B)1BP~t+1AK~t]\displaystyle\hskip 10.00002pt-\big{[}(R+B^{\top}\widetilde{P}_{t+1}B)^{-1}B^{\top}\widetilde{P}_{t+1}A-\widetilde{K}_{t}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\big{[}(R+B^{\top}\widetilde{P}_{t+1}B)^{-1}B^{\top}\widetilde{P}_{t+1}A-\widetilde{K}_{t}\big{]}
=[K~tK~t](R+BP~t+1B)[K~tK~t][K~tK~t](R+BP~t+1B)[K~tK~t]\displaystyle=\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}^{*}_{t}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}^{*}_{t}\big{]}-\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\big{]}
(132) =[K~tK~t](R+BP~t+1B)[K~tK~t],\displaystyle=-\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{t+1}B)\big{[}\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\big{]},

where (i) follows from completion of squares. Combining (132) and (110), we have

(133) P~tP~t\displaystyle\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\| R+B(Pmax+a)BK~tK~t2\displaystyle\leq\|R+B^{\top}(P_{\max}+a)B\|\|\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\|^{2}
(134) C32K~tK~t2\displaystyle\leq\frac{C_{3}}{2}\|\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\|^{2}
(135) (C32)(aC3ε)\displaystyle\leq\left(\frac{C_{3}}{2}\right)\left(\frac{a}{C_{3}}\varepsilon\right)
(136) a2.\displaystyle\leq\frac{a}{2}.

Finally, substituting (131) and (136) in (128), we have

PtP~ta2+a2=a,\|P^{*}_{t}-\widetilde{P}_{t}\|\leq\frac{a}{2}+\frac{a}{2}=a,

finishing the proof of (114). Having established this, we proceed to prove (115). Note that similar to (133), we can write

(137) P~tP~tF\displaystyle\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|_{F} R+B(Pmax+aI)BK~tK~tF2\displaystyle\leq\|R+B^{\top}(P_{\max}+aI)B\|\|\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\|_{F}^{2}
(138) C32K~tK~tF2.\displaystyle\leq\frac{C_{3}}{2}\|\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\|_{F}^{2}.

Moreover, due to (114), we have that P~tPtaI\widetilde{P}_{t}\succeq P^{*}_{t}-aI, and hence,

(139) σmin(P~t)σmin(Pt)a(i)a,\displaystyle\sigma_{\min}(\widetilde{P}_{t})\geq\sigma_{\min}(P^{*}_{t})-a\overset{\mathrm{(i)}}{\geq}a,

where (i) follows from (116). Combining (139) and (136), we have

(140) P~t1P~tP~t\displaystyle\|\widetilde{P}_{t}^{-1}\|\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\| =P~tP~tσmin(P~t)\displaystyle=\frac{\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|}{\sigma_{\min}(\widetilde{P}_{t})}
(141) a/2a\displaystyle\leq\frac{a/2}{a}
(142) =12.\displaystyle=\frac{1}{2}.

Thus, the condition (102) of Lemma B.1 is met, and we can utilize (103) to write

δ(P~t,P~t)\displaystyle\delta(\widetilde{P}^{*}_{t},\widetilde{P}_{t}) P~t1P~tP~tF1P~t1P~tP~t\displaystyle\leq\frac{\|\widetilde{P}_{t}^{-1}\|\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|_{F}}{1-\|\widetilde{P}_{t}^{-1}\|\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|}
(i)(1/a)P~tP~tF(1/2)\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{(1/a)\|\widetilde{P}^{*}_{t}-\widetilde{P}_{t}\|_{F}}{(1/2)}
(ii)C3aK~tK~tF2\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{C_{3}}{a}\|\widetilde{K}^{*}_{t}-\widetilde{K}_{t}\|_{F}^{2}
(i)ε,\displaystyle\overset{\mathrm{(i)}}{\leq}\varepsilon,

where (i) follows from (139) and (142), (ii) from (138), and (iii) from condition (113). This verifies (115), concluding the proof. ∎

Having established Lemma B.2, we can finally present the proof of 3.2.

Proof of Theorem 3.2: First, according to Theorem 3.1, our choice of NN in (31) ensures that K0K^{*}_{0} is stabilizing and K0Kϵ/2\|K^{*}_{0}-K^{*}\|\leq\epsilon/2. Then, it remains to show that the output K~0\widetilde{K}_{0} satisfies K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2.

Now observe that

K~0K0K~0K0+K~0K~0,\displaystyle\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|+\|\widetilde{K}_{0}-\widetilde{K}_{0}^{*}\|,

where substituting KtK^{*}_{t} and KK^{*} in (100), respectively, with K~0\widetilde{K}^{*}_{0} and K0K^{*}_{0} leads to

K~0K0=(R+BP~1B)1B(P1P~1)(BK0A).\displaystyle\widetilde{K}^{*}_{0}-K^{*}_{0}=(R+B^{\top}\widetilde{P}_{1}B)^{-1}B^{\top}(P^{*}_{1}-\widetilde{P}_{1})(BK^{*}_{0}-A).

Hence, the error size K~0K0\|\widetilde{K}^{*}_{0}-K^{*}_{0}\| could be bounded by

(143) K~0K0ABK0Bλmin(R)P1P~1.\displaystyle\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\frac{\|A-BK^{*}_{0}\|\cdot\|B\|}{\lambda_{\min}(R)}\cdot\|P^{*}_{1}-\widetilde{P}_{1}\|.

Next, since we have K~0K~0ς0,ε=ϵ/4\|\widetilde{K}_{0}-\widetilde{K}_{0}^{*}\|\leq\varsigma_{0,\varepsilon}=\epsilon/4, it suffices to show K~0K0ϵ/4\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\epsilon/4 to fulfill K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2. Then, by (143), in order to satisfy K~0K0ϵ/4\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\epsilon/4, it remains to show

(144) P1P~1ϵ4C1.\displaystyle\|P^{*}_{1}-\widetilde{P}_{1}\|\leq\frac{\epsilon}{4C_{1}}.

In order to show this, we first let

(145) ε=1Nmin{ε8eC1Pmax,a2ePmax,1},\varepsilon=\frac{1}{N}\min\left\{\frac{\varepsilon}{8eC_{1}\|P_{\max}\|},\frac{a}{2e\|P_{\max}\|},1\right\},

which clearly satisfies (112). Now we want to show, by strong induction, that

PtP~ta,and\displaystyle\|P^{*}_{t}-\widetilde{P}_{t}\|\leq a,\quad\text{and}
δ(P~t,P~t)ε,\displaystyle\delta(\widetilde{P}^{*}_{t},\widetilde{P}_{t})\leq\varepsilon,

for all t{N,N1,,1}t\in\{N,N-1,\dotsc,1\}. For the base case, we have

PN=P~N=P~N=QN,P^{*}_{N}=\widetilde{P}^{*}_{N}=\widetilde{P}_{N}=Q_{N},

and hence, it immediately follows that

PNP~N=0a,and\displaystyle\|P^{*}_{N}-\widetilde{P}_{N}\|=0\leq a,\quad\text{and}
δ(P~N,P~N)=0ε.\displaystyle\delta(\widetilde{P}^{*}_{N},\widetilde{P}_{N})=0\leq\varepsilon.

Now since it holds in the statement of Theorem 3.2 that

K~hK~hςh,εaC3ε,\|\widetilde{K}_{h}-\widetilde{K}^{*}_{h}\|\leq\varsigma_{h,\varepsilon}\leq\sqrt{\frac{a}{C_{3}}\varepsilon},

which satisfies (113), the inductive step follows directly from Lemma B.2. We have now succesfully established that

(146) PtP~ta,and\displaystyle\|P^{*}_{t}-\widetilde{P}_{t}\|\leq a,\quad\text{and}
(147) δ(P~t,P~t)ε,\displaystyle\delta(\widetilde{P}^{*}_{t},\widetilde{P}_{t})\leq\varepsilon,

for all t{N,N1,,1}t\in\{N,N-1,\dotsc,1\}. As a result, we have

(148) δ(P1,P~1)\displaystyle\delta(P_{1}^{*},\widetilde{P}_{1}^{*}) δ(P2,P~2)\displaystyle\leq\delta(P_{2}^{*},\widetilde{P}_{2})
(149) δ(P2,P~2)+δ(P~2,P~2)\displaystyle\leq\delta(P_{2}^{*},\widetilde{P}^{*}_{2})+\delta(\widetilde{P}_{2}^{*},\widetilde{P}_{2})
(150) δ(P3,P~3)+δ(P~2,P~2)\displaystyle\leq\delta(P_{3}^{*},\widetilde{P}_{3})+\delta(\widetilde{P}_{2}^{*},\widetilde{P}_{2})
(151) δ(P3,P~3)+δ(P~3,P~3)+δ(P~2,P~2)\displaystyle\leq\delta(P_{3}^{*},\widetilde{P}^{*}_{3})+\delta(\widetilde{P}_{3}^{*},\widetilde{P}_{3})+\delta(\widetilde{P}_{2}^{*},\widetilde{P}_{2})
(152) \displaystyle\leq\cdots
(153) δ(PN,P~N)+k=2N1δ(P~k,P~k)\displaystyle\leq\delta(P_{N}^{*},\widetilde{P}_{N})+\sum_{k=2}^{N-1}\delta(\widetilde{P}_{k}^{*},\widetilde{P}_{k})
(154) =k=2N1δ(P~k,P~t)\displaystyle=\sum_{k=2}^{N-1}\delta(\widetilde{P}_{k}^{*},\widetilde{P}_{t})
(155) εN.\displaystyle\leq\varepsilon N.

We now show (144) by writing

(156) P1P~1P1P~1+P~1P~1,\displaystyle\|P^{*}_{1}-\widetilde{P}_{1}\|\leq\|P^{*}_{1}-\widetilde{P}^{*}_{1}\|+\|\widetilde{P}^{*}_{1}-\widetilde{P}_{1}\|,

and providing a bound for both terms of the right-hand side of (156). For the first term, we have

(157) P1P~1\displaystyle\|P^{*}_{1}-\widetilde{P}^{*}_{1}\| (i)Pmaxeδ(P1,P~1)δ(P1,P~1)\displaystyle\overset{\mathrm{(i)}}{\leq}\|P_{\max}\|e^{\delta(P^{*}_{1},\widetilde{P}^{*}_{1})}\delta(P^{*}_{1},\widetilde{P}^{*}_{1})
(158) (ii)PmaxeεNεN\displaystyle\overset{\mathrm{(ii)}}{\leq}\|P_{\max}\|e^{\varepsilon N}\varepsilon N
(159) (iii)ε8C1,\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{\varepsilon}{8C_{1}},

where (i) follows from Lemma B.1, (ii) from (155), and (iii) from (145). As for the second term on the right-hand side of (156), we utilize (132) to write

(160) P~1P~1\displaystyle\|\widetilde{P}^{*}_{1}-\widetilde{P}_{1}\| R+BP~2BK~1K~12\displaystyle\leq\|R+B^{\top}\widetilde{P}_{2}B\|\|\widetilde{K}_{1}-\widetilde{K}_{1}^{*}\|^{2}
(161) (i)R+B(Pmax+aI)B(ς1,ε)2\displaystyle\overset{\mathrm{(i)}}{\leq}\|R+B^{\top}(P_{\max}+aI)B\|(\varsigma_{1,\varepsilon})^{2}
(162) (ii)C32ε4C1C3\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{C_{3}}{2}\frac{\varepsilon}{4C_{1}C_{3}}
(163) =ε8C1,\displaystyle=\frac{\varepsilon}{8C_{1}},

where (i) follows from (146), and (ii) is due to the definition of ς1,ε\varsigma_{1,\varepsilon} in (29). Finally, substituting (159) and (163) in (156), we have

PtP~tε8C1+ε8C1=ε4C1,\|P^{*}_{t}-\widetilde{P}_{t}\|\leq\frac{\varepsilon}{8C_{1}}+\frac{\varepsilon}{8C_{1}}=\frac{\varepsilon}{4C_{1}},

thereby establishing (144) and concluding the proof of Theorem 3.2. \diamond

Appendix C Proof of Theorem 3.3

First, according to Theorem 3.1, we select

(164) N=12log(2QNPκPAKBϵλmin(R))log(1AK)+1,\displaystyle N=\frac{1}{2}\cdot\frac{\log\big{(}\frac{2\|Q_{N}-P^{*}\|_{*}\cdot\kappa_{P^{*}}\cdot\|A_{K}^{*}\|\cdot\|B\|}{\epsilon\cdot\lambda_{\min}(R)}\big{)}}{\log\big{(}\frac{1}{\|A_{K}^{*}\|_{*}}\big{)}}+1,

where AK:=ABKA_{K}^{*}:=A-BK^{*}. This ensures that K0K^{*}_{0} is stabilizing and K0Kϵ/2\|K^{*}_{0}-K^{*}\|\leq\epsilon/2. Then, it remains to show that the output K~0\widetilde{K}_{0} satisfies K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2.

Recall that the RDE (20) is a backward iteration starting with PN=QN0P^{*}_{N}=Q_{N}\geq 0, and can also be represented as:

(165) Pt\displaystyle P^{*}_{t} =(ABKt)Pt+1(ABKt)+(Kt)RKt+Q\displaystyle=(A-BK^{*}_{t})^{\top}P^{*}_{t+1}(A-BK^{*}_{t})+(K^{*}_{t})^{\top}RK^{*}_{t}+Q
(166) =APt+1(ABKt)+Q+(Kt)(R+BPt+1B)Kt(Kt)(BPt+1A)\displaystyle=A^{\top}P^{*}_{t+1}\big{(}A-BK^{*}_{t}\big{)}+Q+(K^{*}_{t})^{\top}(R+B^{\top}P^{*}_{t+1}B)K^{*}_{t}-(K^{*}_{t})^{\top}(B^{\top}P^{*}_{t+1}A)
(167) =(i)APt+1(ABKt)+Q,\displaystyle\overset{\mathrm{(i)}}{=}A^{\top}P^{*}_{t+1}\big{(}A-BK^{*}_{t}\big{)}+Q,

where (i) comes from Kt=(R+BPt+1B)1(BPt+1A)K^{*}_{t}=(R+B^{\top}P^{*}_{t+1}B)^{-1}(B^{\top}P^{*}_{t+1}A). Moreover, for clarity of proof, we denote the policy optimization error at time tt by:

et:=K~tK~t.e_{t}:=\widetilde{K}_{t}-\widetilde{K}_{t}^{*}.

We argue that K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2 can be achieved by carefully controlling ete_{t} for all tt. At t=0t=0, it holds that

K~0K0K~0K0+e0,\displaystyle\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|+\|e_{0}\|,

where substituting KtK^{*}_{t} and KK^{*} in (100), respectively, with K~0\widetilde{K}^{*}_{0} and K0K^{*}_{0} leads to

K~0K0=(R+BP~1B)1B(P1P~1)(BK0A).\displaystyle\widetilde{K}^{*}_{0}-K^{*}_{0}=(R+B^{\top}\widetilde{P}_{1}B)^{-1}B^{\top}(P^{*}_{1}-\widetilde{P}_{1})(BK^{*}_{0}-A).

Hence, the error size K~0K0\|\widetilde{K}^{*}_{0}-K^{*}_{0}\| could be bounded by

(168) K~0K0ABK0Bλmin(R)P1P~1.\displaystyle\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\frac{\|A-BK^{*}_{0}\|\cdot\|B\|}{\lambda_{\min}(R)}\cdot\|P^{*}_{1}-\widetilde{P}_{1}\|.

Next, we require e0ϵ/4\|e_{0}\|\leq\epsilon/4 and K~0K0ϵ/4\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\epsilon/4 to fulfill K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2. We additionally require P1P~1a\|P^{*}_{1}-\widetilde{P}_{1}\|\leq a to upper-bound the positive definite solutions of (18). Then, by (168), in order to fulfill K~0K0ϵ/4\|\widetilde{K}^{*}_{0}-K^{*}_{0}\|\leq\epsilon/4, it suffices to require

(169) P1P~1min{a,ϵ4C1}.\displaystyle\|P^{*}_{1}-\widetilde{P}_{1}\|\leq\min\bigg{\{}a,\frac{\epsilon}{4C_{1}}\bigg{\}}.

Subsequently, we have

(170) P1P~1=(P1P~1)+(P~1P~1).\displaystyle P^{*}_{1}-\widetilde{P}_{1}=(P^{*}_{1}-\widetilde{P}^{*}_{1})+(\widetilde{P}^{*}_{1}-\widetilde{P}_{1}).

The first difference term on the RHS of (170) is

P1P~1\displaystyle P^{*}_{1}-\widetilde{P}^{*}_{1} =AP2(ABK1)AP~2(ABK~1)\displaystyle=A^{\top}P^{*}_{2}\big{(}A-BK^{*}_{1}\big{)}-A^{\top}\widetilde{P}_{2}\big{(}A-B\widetilde{K}^{*}_{1}\big{)}
(171) =A(P2P~2)(ABK1)+AP~2B(K~1K1).\displaystyle=A^{\top}(P_{2}^{*}-\widetilde{P}_{2})(A-BK^{*}_{1})+A^{\top}\widetilde{P}_{2}B(\widetilde{K}_{1}^{*}-K^{*}_{1}).
(172) =A(P2P~2)(ABK1)AP~2B(R+BP~2B)1B(P2P~2)(ABK1)\displaystyle=A^{\top}(P_{2}^{*}-\widetilde{P}_{2})(A-BK^{*}_{1})-A^{\top}\widetilde{P}_{2}B(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}(P^{*}_{2}-\widetilde{P}_{2})(A-BK^{*}_{1})
(173) =A[IP~2B(R+BP~2B)1B](P2P~2)(ABK1),\displaystyle=A^{\top}[I-\widetilde{P}_{2}B(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}](P_{2}^{*}-\widetilde{P}_{2})(A-BK^{*}_{1}),

Moreover, the second term on the RHS of (170) is

P~1P~1\displaystyle\widetilde{P}^{*}_{1}-\widetilde{P}_{1} =(ABK~1)P~2(ABK~1)+(K~1)RK~1(ABK~1)P~2(ABK~1)(K~1)RK~1\displaystyle=(A-B\widetilde{K}^{*}_{1})^{\top}\widetilde{P}_{2}(A-B\widetilde{K}^{*}_{1})+(\widetilde{K}^{*}_{1})^{\top}R\widetilde{K}^{*}_{1}-(A-B\widetilde{K}_{1})^{\top}\widetilde{P}_{2}(A-B\widetilde{K}_{1})-(\widetilde{K}_{1})^{\top}R\widetilde{K}_{1}
=(K~1)BP~2AAP~2BK~1+(K~1)(R+BP~2B)K~1\displaystyle=-(\widetilde{K}^{*}_{1})^{\top}B^{\top}\widetilde{P}_{2}A-A^{\top}\widetilde{P}_{2}B\widetilde{K}_{1}^{*}+(\widetilde{K}^{*}_{1})^{\top}(R+B^{\top}\widetilde{P}_{2}B)\widetilde{K}^{*}_{1}
+K~1BP~2A+AP~2BK~1K~1(R+BP~2B)K~1\displaystyle\hskip 10.00002pt+\widetilde{K}_{1}^{\top}B^{\top}\widetilde{P}_{2}A+A^{\top}\widetilde{P}_{2}B\widetilde{K}_{1}-\widetilde{K}_{1}^{\top}(R+B^{\top}\widetilde{P}_{2}B)\widetilde{K}_{1}
=[(R+BP~2B)1BP~2AK~1](R+BP~2B)[(R+BP~2B)1BP~2AK~1]\displaystyle=\big{[}(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}\widetilde{P}_{2}A-\widetilde{K}^{*}_{1}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{2}B)\big{[}(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}\widetilde{P}_{2}A-\widetilde{K}^{*}_{1}\big{]}
(174) [(R+BP~2B)1BP~2AK~1](R+BP~2B)[(R+BP~2B)1BP~2AK~1]\displaystyle\hskip 10.00002pt-\big{[}(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}\widetilde{P}_{2}A-\widetilde{K}_{1}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{2}B)\big{[}(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}\widetilde{P}_{2}A-\widetilde{K}_{1}\big{]}
=[K~1K~1](R+BP~2B)[K~1K~1][K~1K~1](R+BP~2B)[K~1K~1]\displaystyle=\big{[}\widetilde{K}^{*}_{1}-\widetilde{K}^{*}_{1}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{2}B)\big{[}\widetilde{K}^{*}_{1}-\widetilde{K}^{*}_{1}\big{]}-\big{[}\widetilde{K}^{*}_{1}-\widetilde{K}_{1}\big{]}^{\top}(R+B^{\top}\widetilde{P}_{2}B)\big{[}\widetilde{K}^{*}_{1}-\widetilde{K}_{1}\big{]}
(175) =e1(R+BP~2B)e1,\displaystyle=-e_{1}^{\top}(R+B^{\top}\widetilde{P}_{2}B)e_{1},

where (174) follows from completion of squares. Thus, combining (170), (171), and (175) yields

P1P~1\displaystyle\hskip 13.00005pt\|P^{*}_{1}-\widetilde{P}_{1}\| P2P~2φAIP~2B(R+BP~2B)1B+e12R+BP~2B\displaystyle\leq\|P^{*}_{2}-\widetilde{P}_{2}\|\cdot\varphi\|A\|\|I-\widetilde{P}_{2}B(R+B^{\top}\widetilde{P}_{2}B)^{-1}B^{\top}\|+\|e_{1}\|^{2}\|R+B^{\top}\widetilde{P}_{2}B\|
(176) φA(1+Pmax+aIB2λmin(R))P2P~2+e12R+BP~2B.\displaystyle\leq\varphi\|A\|\left(1+\frac{\|P_{\max}+aI\|\|B\|^{2}}{\lambda_{\min}(R)}\right)\cdot\|P^{*}_{2}-\widetilde{P}_{2}\|+\|e_{1}\|^{2}\|R+B^{\top}\widetilde{P}_{2}B\|.

Note that the difference between (176) with its counterpart in [20] is because the argument for

(I+P~2BR1B)11\|(I+\widetilde{P}_{2}BR^{-1}B^{\top})^{-1}\|\leq 1

does not hold, since P~2BR1B\widetilde{P}_{2}BR^{-1}B^{\top} is not necessarily positive semi-definite (a product of two symmetric psd matrices is not necessarily psd unless the product is a normal matrix). Now, we require

(177) P2P~2\displaystyle\|P^{*}_{2}-\widetilde{P}_{2}\| min{a,aC2,ϵ4C1C2}\displaystyle\leq\min\bigg{\{}a,\frac{a}{C_{2}},\frac{\epsilon}{4C_{1}C_{2}}\cdot\bigg{\}}
(178) e1\displaystyle\|e_{1}\| min{aC3,12ϵC1C3},\displaystyle\leq\min\bigg{\{}\sqrt{\frac{a}{C_{3}}},\frac{1}{2}\sqrt{\frac{\epsilon}{C_{1}C_{3}}}\bigg{\}},

Then, conditions (177) and (178) are sufficient for (169) (and thus for K~0K0ϵ/2\|\widetilde{K}_{0}-K^{*}_{0}\|\leq\epsilon/2) to hold. Subsequently, we can propagate the required accuracies in (177) and (178) forward in time. Specifically, we iteratively apply the arguments in (176) (i.e., by plugging quantities with subscript tt into the LHS of (176) and plugging quantities with subscript t+1t+1 into the RHS of (176)) to obtain the result that if at all t{2,,N1}t\in\{2,\cdots,N-1\}, we require

(179) PtP~tmin{a,aC2t1,ϵ4C1C2t1}\displaystyle\|P^{*}_{t}-\widetilde{P}_{t}\|\leq\min\bigg{\{}a,\frac{a}{C_{2}^{t-1}},\frac{\epsilon}{4C_{1}C_{2}^{t-1}}\bigg{\}}
etmin{aC3,aC2t2C3,12ϵC1C2t2C3},\displaystyle\|e_{t}\|\leq\min\bigg{\{}\sqrt{\frac{a}{C_{3}}},\sqrt{\frac{a}{C_{2}^{t-2}C_{3}}},\frac{1}{2}\sqrt{\frac{\epsilon}{C_{1}C_{2}^{t-2}C_{3}}}\bigg{\}},

then (177) holds true and therefore (169) is satisfied.

We now compute the required accuracy for eN1e_{N-1}. Note that PN1=P~N1P^{*}_{N-1}=\widetilde{P}^{*}_{N-1} since no prior computational errors happened at t=Nt=N. By (176), the distance between PN1P^{*}_{N-1} and P~N1\widetilde{P}_{N-1} can be bounded as

PN1P~N1=P~N1P~N1eN12C3.\displaystyle\|P^{*}_{N-1}-\widetilde{P}_{N-1}\|=\|\widetilde{P}^{*}_{N-1}-\widetilde{P}_{N-1}\|\leq\|e_{N-1}\|^{2}\cdot C_{3}.

To fulfill the requirement (179) for t=N1t=N-1, which is

PN1P~N1min{a,aC2N2,ϵ4C1C2N2},\displaystyle\|P^{*}_{N-1}-\widetilde{P}_{N-1}\|\leq\min\bigg{\{}a,\frac{a}{C_{2}^{N-2}},\frac{\epsilon}{4C_{1}C_{2}^{N-2}}\bigg{\}},

it suffices to let

(180) eN1min{aC3,aC2N2C3,12ϵC1C2N2C3}.\displaystyle\|e_{N-1}\|\leq\min\bigg{\{}\sqrt{\frac{a}{C_{3}}},\sqrt{\frac{a}{C_{2}^{N-2}C_{3}}},\frac{1}{2}\sqrt{\frac{\epsilon}{C_{1}C_{2}^{N-2}C_{3}}}\bigg{\}}.

Finally, we analyze the worst-case complexity of RHPG by computing, at the most stringent case, the required size of et\|e_{t}\|. When C21C_{2}\leq 1, the most stringent dependence of et\|e_{t}\| on ϵ\epsilon happens at t=0t=0, which is of the order 𝒪(ϵ)\mathcal{O}(\epsilon), and the dependences on system parameters are 𝒪(1)\mathcal{O}(1). We then analyze the case where C2>1C_{2}>1, where the requirement on e0\|e_{0}\| is still 𝒪(ϵ)\mathcal{O}(\epsilon). Note that in this case, the requirement on eN1\|e_{N-1}\| is stricter than that on any other et\|e_{t}\| for any t{1,,N1}t\in\{1,\cdots,N-1\} and by (180):

(181) eN1𝒪(ϵC1C2N2C3).\displaystyle\|e_{N-1}\|\sim\mathcal{O}\Big{(}\sqrt{\frac{\epsilon}{C_{1}C_{2}^{N-2}C_{3}}}\Big{)}.

Since we require NN to satisfy (164), the dependence of eN1\|e_{N-1}\| on ϵ\epsilon in (181) becomes

eN1𝒪(ϵ12+log(C2)4log(1/AK))\|e_{N-1}\|\sim\mathcal{O}\left(\epsilon^{\frac{1}{2}+\frac{\log(C_{2})}{4\log{(1/\|A^{*}_{K}\|_{*}})}}\right)

with additional polynomial dependences on system parameters since

C2N2\displaystyle C_{2}^{N-2} C2log(1/ε)2log(1/AK)\displaystyle\approx C_{2}^{\frac{\log(1/\varepsilon)}{2\log(1/\|A^{*}_{K}\|_{*})}}
=(1/ε)log(C2)2log1AK.\displaystyle=(1/\varepsilon)^{\frac{\log(C_{2})}{2\log{\frac{1}{\|A^{*}_{K}\|_{*}}}}}.

As a result, it suffices to require error bound for all tt to be

et𝒪(min{ε,ϵ12+log(C2)4log(1/AK)})\|e_{t}\|\sim\mathcal{O}\left(\min\left\{\varepsilon,\epsilon^{\frac{1}{2}+\frac{\log(C_{2})}{4\log{(1/\|A^{*}_{K}\|_{*}})}}\right\}\right)

The difference between our requirement for the C2>1C_{2}>1 case with its counterpart in [20] is due to a calculation error in [20] which incorrectly neglects the impact of the exponent in C2N2C_{2}^{N-2}. Lastly, for K~0\widetilde{K}_{0} to be stabilizing, it suffices to select a sufficiently small ϵ\epsilon such that the ϵ\epsilon-ball centered at the infinite-horizon LQR policy KK^{*} lies entirely in the set of stabilizing policies. A crude bound that satisfies this requirement is

ϵ<1ABKBABK~0<1.\displaystyle\epsilon<\frac{1-\|A-BK^{*}\|_{*}}{\|B\|}\Longrightarrow\|A-B\widetilde{K}_{0}\|_{*}<1.

This completes the proof.