This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient Off-Policy Reinforcement Learning Algorithm for the Continuous-Time LQR Problem

Victor G. Lopez and Matthias A. Müller This work received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 948679).V. G. Lopez and M. A. Müller are with the Leibniz University Hannover, Institute of Automatic Control, 30167 Hannover, Germany {lopez,mueller}@irt.uni-hannover.de
Abstract

In this paper, an off-policy reinforcement learning algorithm is designed to solve the continuous-time LQR problem using only input-state data measured from the system. Different from other algorithms in the literature, we propose the use of a specific persistently exciting input as the exploration signal during the data collection step. We then show that, using this persistently excited data, the solution of the matrix equation in our algorithm is guaranteed to exist and to be unique at every iteration. Convergence of the algorithm to the optimal control input is also proven. Moreover, we formulate the policy evaluation step as the solution of a Sylvester-transpose equation, which increases the efficiency of its solution. Finally, a method to determine a stabilizing policy to initialize the algorithm using only measured data is proposed.

I INTRODUCTION

Reinforcement learning (RL) is a set of iterative algorithms that allow a system to learn its optimal behavior as it interacts with its environment [1, 2]. In the context of linear optimal control, RL has been used in the last few decades to solve the linear quadratic regulator (LQR) problem in continuous-time [3, 4, 5, 6, 7, 8] and in discrete time [9, 10, 11, 12]. For applications of RL procedures to nonlinear systems and other extensions, the reader is referred to the surveys [13, 14, 15] and the references therein.

In the continuous-time linear time-invariant (CT-LTI) case, several RL algorithms with attractive properties have been designed. Although the first proposed algorithms required at least partial knowledge of the system model (e.g., [3]), completely data-based methods are now well known [4, 5, 6, 7]. These data-based algorithms replace the need for model knowledge by measuring persistently excited data directly from the system. Most of these data-based methods are on-policy algorithms, meaning that they require the application (or simulation) of an exciting input to the system at every iteration, such that a new set of data can be collected. In contrast, the authors in [8] proposed a data-based off-policy RL algorithm. This method has the advantage of requiring to collect data from the system only once, and then every iteration of the algorithm is performed using the same batch of measurements.

The method in [8], as well as most on-policy methods, is formulated as the problem of determining the values of certain unknown matrices from a set of equations derived from the Bellman equation. Taking advantage of the properties of the Kronecker product, this problem is then expressed as a set of linear equations that can be easily solved. However, the Kronecker product formulation generates matrices of large size, and this procedure presents a high computational burden that increases rapidly with an increase in the system dimension.

Another important issue in the existing learning-based control literature is the selection of a proper persistently exciting (PE) input. In most of the above literature, heuristic approaches for persistence of excitation are employed, often designing exciting inputs by adding sinusoidal, exponential and/or random signals [14]. A different approach for persistence of excitation was studied in [16], where conditions for the design of a discrete-time PE input are formally established. It is shown in [16] that their definition of persistence of excitation provides data measurements that are so rich in information that every possible trajectory of a controllable discrete-time linear system can be expressed in terms of such data. This result is now known as Willems’ lemma, and has been successfully used in recent years in data-based analysis, estimation and control of discrete-time systems (see, e.g., the survey [17] and the references therein). In [6], it was proposed to use a PE signal as defined in [16] to excite a continuous-time system during a Q-learning procedure, which guarantees solvability of their policy evaluation step. However, the method in [6] is an on-policy algorithm and the authors require persistence of excitation of a signal composed of both the input and the state of the system. This contrasts with our objective of considering a PE signal in terms of the input only. Moreover, in [6] a high order of persistence of excitation is needed.

The contributions of this paper are as follows. We propose a novel data-based off-policy RL algorithm to solve the LQR problem for continuous-time systems. As in [8], we perform the policy evaluation and policy improvement steps simultaneously. Different from the existing algorithms, we formulate a Sylvester-transpose equation that can be efficiently solved using known methods [18, 19, 20]. This avoids the use of the Kronecker product and the ensuing large matrices in our computations. Moreover, we use the results in [21], where a continuous-time version of Willems’ lemma was proposed. This allows us to design a PE input that guarantees the solvability of the Sylvester-transpose equation in a data-based fashion. In our formulation, persistence of excitation depends only on the input of the system, and we require the use of a PE input of lower order compared to [6]. Finally, we propose a method to determine the required initial stabilizing policy for the proposed algorithm using only measured data. Different from [7], this method does not require the solution of linear matrix inequalities (LMIs).

In the following, Section II introduces the preliminary results that are used throughout the paper. The development of the proposed efficient RL algorithm and its theoretical analysis are shown in Section III. Section IV analyses the computational efficiency of the proposed algorithm and presents a procedure to compute the initial stabilizing gain. In Section V, we illustrate the theoretical results with numerical examples, and Section VI concludes the paper.

II PRELIMINARIES

In this section, we present existing results from the literature that are relevant for the remainder of this paper.

II-A Matrix definitions for continuous-time data

Consider the integer NN\in\mathbb{N} and the positive scalar T+T~{}\in~{}\mathbb{R}_{+}. Let ξ:[0,NT]σ\xi:[0,NT]\rightarrow\mathbb{R}^{\sigma}, with [0,NT][0,NT]\subset\mathbb{R}, denote a continuous-time signal of length NTNT. Using the trajectory ξ\xi, we define the following matrix

T(ξ(t)):=[ξ(t)ξ(t+T)ξ(t+(N1)T)]\mathcal{H}_{T}(\xi(t)):=\left[\begin{array}[]{cccc}\xi(t)&\xi(t+T)&\cdots&\xi(t+(N-1)T)\end{array}\right] (1)

for 0tT0\leq t\leq T. Notice that (1) is a time-varying matrix defined on the interval t[0,T]t\in[0,T].

Now, consider the following CT-LTI system

x˙(t)=Ax(t)+Bu(t),\dot{x}(t)=Ax(t)+Bu(t), (2)

where xnx\in\mathbb{R}^{n} and umu\in\mathbb{R}^{m} are the state and input vectors of the system, respectively. The pair (A,B)(A,B) is assumed to be controllable throughout the paper.

Suppose that the input signal u:[0,NT]mu:[0,NT]\rightarrow\mathbb{R}^{m} is applied to (2), and the resulting state trajectory x:[0,NT]nx:[0,NT]\rightarrow\mathbb{R}^{n} is collected. From (2) and the definition in (1), we can write

T(x˙(t))=AT(x(t))+BT(u(t)).\mathcal{H}_{T}(\dot{x}(t))=A\mathcal{H}_{T}(x(t))+B\mathcal{H}_{T}(u(t)).

Since it is unusual to have the state derivative x˙\dot{x} available as a measurement, integrate the expression above to obtain

T(x(T))T(x(0))=A0TT(x(τ))𝑑τ+B0TT(u(τ))𝑑τ.\mathcal{H}_{T}(x(T))-\mathcal{H}_{T}(x(0))\\ =A\int_{0}^{T}\mathcal{H}_{T}(x(\tau))d\tau+B\int_{0}^{T}\mathcal{H}_{T}(u(\tau))d\tau.

For convenience of notation, define the matrices

X~=T(x(T))T(x(0)),\tilde{X}=\mathcal{H}_{T}(x(T))-\mathcal{H}_{T}(x(0)), (3)
X=0TT(x(τ))𝑑τ,U=0TT(u(τ))𝑑τ.X=\int_{0}^{T}\mathcal{H}_{T}(x(\tau))d\tau,\quad U=\int_{0}^{T}\mathcal{H}_{T}(u(\tau))d\tau.

Notice that the matrix XX (and similarly UU) only requires the computation of integrals of the form 0Tx(τ+jT)𝑑τ{\int_{0}^{T}x(\tau+jT)d\tau}, j=0,,N1j=0,\ldots,N-1. This is simpler than the integrals computed in the existing RL literature [8, 6, 7].

By definition, the following expression holds

X~=AX+BU.\tilde{X}=AX+BU. (4)

II-B Persistence of excitation for discrete-time systems

Define the integer constants L,NL,N\in\mathbb{N}. The Hankel matrix of depth LL of a discrete-time sequence {μk}k=0N1={μ0,μ1,,μN1}\{\mu_{k}\}_{k=0}^{N-1}=\{\mu_{0},\,\mu_{1},\,\ldots,\,\mu_{N-1}\}, μkm\mu_{k}\in\mathbb{R}^{m}, is defined as

HL(μ):=[μ0μ1μNLμ1μ2μNL+1μL1μLμN1].H_{L}(\mu):=\left[\begin{array}[]{cccc}\mu_{0}&\mu_{1}&\cdots&\mu_{N-L}\\ \mu_{1}&\mu_{2}&\cdots&\mu_{N-L+1}\\ \vdots&\vdots&\ddots&\vdots\\ \mu_{L-1}&\mu_{L}&\cdots&\mu_{N-1}\end{array}\right].

In [16], the following definition of a PE input for discrete-time systems is made.

Definition 1

The discrete sequence {μk}k=0N1\{\mu_{k}\}_{k=0}^{N-1}, μkm\mu_{k}\in\mathbb{R}^{m}, is said to be persistently exciting of order LL if its Hankel matrix of depth LL has full row rank, i.e.,

rank(HL(μ))=mL.\text{rank}(H_{L}(\mu))=mL. (5)

It is important to highlight the fact that Definition 1 provides a condition that enables a straightforward design of a PE input and that is easy to verify for any discrete sequence.

Remark 1

A necessary condition for (5) to hold is that N(m+1)L1N\geq(m+1)L-1. This provides a minimum length for a PE input sequence.

II-C Persistence of excitation for continuous-time systems

It is shown in [21] that a piecewise constant input designed by exploiting Definition 1 is persistently exciting for the continuous-time system (2). This class of inputs is formally described in the following definition.

Definition 2 (Piecewise constant PE input)

Consider a time interval T>0T>0 such that

T2πk|m(λiλj)|,k.T\neq\frac{2\pi k}{|\mathcal{I}_{m}(\lambda_{i}-\lambda_{j})|},\qquad\forall k\in\mathbb{Z}. (6)

where λi\lambda_{i} and λj\lambda_{j} are any two eigenvalues of matrix AA in (2), and m()\mathcal{I}_{m}(\cdot) is the imaginary part of a complex number. A piecewise constant persistently exciting (PCPE) input of order LL for continuous-time systems is defined as u(t+iT)=μi{u(t+iT)=\mu_{i}} for all 0t<T0\leq t<T, i=0,,N1i=0,\ldots,N-1, where {μi}i=0N1\{\mu_{i}\}_{i=0}^{N-1} is a sequence of constant vectors μim\mu_{i}\in\mathbb{R}^{m} that is persistently exciting of order LL in the sense of Definition 1.

Remark 2

Notice that the condition (6) is not restrictive, even with no knowledge of the system model (2). This is because the values of TT that make this condition fail form a set of measure zero and are unlikely to be encountered in practice.

When a PCPE input is applied to system (2), the obtained input-state data set satisfies an important rank condition, as shown below.

Lemma 1 ([21])

Consider system (2), let the pair (A,B)(A,B) be controllable, and let uu be a PCPE input of order n+1n+1 as defined in Definition 2. Then, the rank condition

rank([T(x(t))T(u(t))])=rank([T(x(t))H1(μ)])=m+n\text{rank}\left(\left[\begin{array}[]{c}\mathcal{H}_{T}(x(t))\\ \mathcal{H}_{T}(u(t))\end{array}\right]\right)=\text{rank}\left(\left[\begin{array}[]{c}\mathcal{H}_{T}(x(t))\\ H_{1}(\mu)\end{array}\right]\right)=m+n (7)

holds for all 0tT0\leq t\leq T.

Remark 3

In [21], the result in Lemma 1 was presented considering persistence of excitation of any order LL. For simplicity of notation, we presented Lemma 1 directly for PE inputs of order n+1n+1. This is the only order of persistence of excitation used throughout the paper.

II-D The LQR problem and Kleinman’s algorithm

For a CT-LTI system (2), the infinite-horizon LQR problem concerns determining the control input uu that minimizes a cost function of the form

J(x(0),u):=0(x(t)Qx(t)+u(t)Ru(t))𝑑t,J(x(0),u):=\int_{0}^{\infty}\left(x^{\top}(t)Qx(t)+u^{\top}(t)Ru(t)\right)dt, (8)

where Q0Q\succeq 0 and R0R\succ 0. Throughout the paper, we assume that the pair (A,Q1/2)(A,Q^{1/2}) is observable. This, together with the assumed controllability of (A,B)(A,B), implies that the optimal control input is given by u(x)=Kxu^{*}(x)=-K^{*}x, where

K=R1BPK^{*}=R^{-1}B^{\top}P^{*}

and the matrix P0P^{*}\succ 0 solves the algebraic Riccati equation

Q+PA+APPBR1BP=0.Q+P^{*}A+A^{\top}P^{*}-P^{*}BR^{-1}B^{\top}P^{*}=0.

In [22], Kleinman proposed a model-based iterative algorithm to solve the LQR problem. This algorithm starts by selecting an initial stabilizing matrix K0K_{0}, i.e., a matrix such that ABK0A-BK_{0} is Hurwitz stable. At every iteration ii, the Lyapunov equation

Pi(ABKi)+(ABKi)Pi+Q+KiRKi=0P_{i}(A-BK_{i})+(A-BK_{i})^{\top}P_{i}+Q+K_{i}^{\top}RK_{i}=0 (9)

is solved for PiP_{i}. Then, a new feedback matrix is defined as

Ki+1=R1BPi.K_{i+1}=R^{-1}B^{\top}P_{i}. (10)

The algorithm iterates the equations (9) and (10) until convergence. With the main drawback of being a model-based method, Kleinman’s algorithm otherwise possesses highly attractive features. Namely, at each iteration the matrix Ki+1K_{i+1} is stabilizing, the algorithm converges such that

limiKi+1=K,\lim_{i\rightarrow\infty}K_{i+1}=K^{*},

and convergence occurs at a quadratic rate [22].

The following section presents the main developments of this paper.

III AN EFFICIENT DATA-BASED ALGORITHM FOR THE CT LQR PROBLEM

In this section, we present an efficient data-based off-policy RL algorithm to determine the optimal controller that minimizes (8). We show that the proposed procedure is equivalent to Kleinman’s algorithm (9)-(10), and therefore preserves all of its theoretical properties. For the clarity of exposition, we introduce first a model-based algorithm that is then used as the basis of our data-based method.

III-A A model-based algorithm

Combining (9) and (10), we readily obtain the following expressions

PiAKi+1RKi+APiKiRKi+1+Q+KiRKi=0P_{i}A-K_{i+1}^{\top}RK_{i}+A^{\top}P_{i}-K_{i}^{\top}RK_{i+1}+Q+K_{i}^{\top}RK_{i}=0

and BPiRKi+1=0B^{\top}P_{i}-RK_{i+1}=0. Therefore, the matrix equation

[ABRKiR][PiKi+1][In0]+[In0][PiKi+1][ABRKiR]+[Q+KiRKi000]=0\left[\begin{array}[]{cc}A&B\\ -RK_{i}&-R\end{array}\right]^{\top}\left[\begin{array}[]{c}P_{i}\\ K_{i+1}\end{array}\right]\left[\begin{array}[]{cc}I_{n}&0\end{array}\right]\\ +\left[\begin{array}[]{c}I_{n}\\ 0\end{array}\right]\left[\begin{array}[]{c}P_{i}\\ K_{i+1}\end{array}\right]^{\top}\left[\begin{array}[]{cc}A&B\\ -RK_{i}&-R\end{array}\right]\\ +\left[\begin{array}[]{cc}Q+K_{i}^{\top}RK_{i}&0\\ 0&0\end{array}\right]=0 (11)

holds, where InI_{n} is an n×nn\times n identity matrix and 0 represents a matrix of zeros with appropriate dimensions.

Denoting the fixed matrices as

Φi:=[ABRKiR],E:=[In0],\displaystyle\Phi_{i}:=\left[\begin{array}[]{cc}A&B\\ -RK_{i}&-R\end{array}\right],\quad E:=\left[\begin{array}[]{cc}I_{n}&0\end{array}\right], (15)
Q¯i:=[Q+KiRKi000]\displaystyle\bar{Q}_{i}:=\left[\begin{array}[]{cc}Q+K_{i}^{\top}RK_{i}&0\\ 0&0\end{array}\right] (18)

and the unknown matrix as

Θi+1:=[PiKi+1],\Theta_{i+1}:=\left[\begin{array}[]{c}P_{i}\\ K_{i+1}\end{array}\right], (19)

we can write (11) in the compact form

ΦiΘi+1E+EΘi+1Φi+Q¯i=0.\Phi_{i}^{\top}\Theta_{i+1}E+E^{\top}\Theta_{i+1}^{\top}\Phi_{i}+\bar{Q}_{i}=0. (20)

The matrix Θi+1(n+m)×n\Theta_{i+1}\in\mathbb{R}^{(n+m)\times n} consists of the unknown matrices in Kleinman’s algorithm, PiP_{i} and Ki+1K_{i+1}. It is of our interest to design a method in which solving a matrix equation as in (20) for Θi+1\Theta_{i+1} corresponds to solving both (9) and (10) simultaneously. However, it can be noted that (20), as it is formulated, in general does not have a unique solution Θi+1\Theta_{i+1}. To address this issue, first express the unknown submatrices of Θi+1\Theta_{i+1} as

Θi+1=[Θi+11Θi+12],\Theta_{i+1}=\left[\begin{array}[]{c}\Theta_{i+1}^{1}\\ \Theta_{i+1}^{2}\end{array}\right], (21)

with Θi+11n×n\Theta_{i+1}^{1}\in\mathbb{R}^{n\times n} and Θi+12m×n\Theta_{i+1}^{2}\in\mathbb{R}^{m\times n}. In the following lemma, we show that there exists only one matrix Θi+1\Theta_{i+1} that solves (20) such that the submatrix Θi+11\Theta_{i+1}^{1} is symmetric.

Lemma 2

Consider the equation (20) with the matrices Φi\Phi_{i}, EE and Q¯i\bar{Q}_{i} defined as in (18). Moreover, let the matrix KiK_{i} be stabilizing. Then, there exists a unique solution (21) to this equation for which Θi+11=(Θi+11)\Theta_{i+1}^{1}=(\Theta_{i+1}^{1})^{\top}.

Proof:

Considering the partition in (21), notice that (20) holds for any matrix Θi+1\Theta_{i+1} such that

AΘi+11KiRΘi+12+(Θi+11)A(Θi+12)RKi+Q+KiRKi=0,A^{\top}\Theta_{i+1}^{1}-K_{i}^{\top}R\Theta_{i+1}^{2}+(\Theta_{i+1}^{1})^{\top}A-(\Theta_{i+1}^{2})^{\top}RK_{i}\\ +Q+K_{i}^{\top}RK_{i}=0,

and

BΘi+11RΘi+12=0.B^{\top}\Theta_{i+1}^{1}-R\Theta_{i+1}^{2}=0.

From the second equation it is clear that Θi+12=R1BΘi+11\Theta_{i+1}^{2}=R^{-1}B^{\top}\Theta_{i+1}^{1}. Substituting this and the fact that Θi+11=(Θi+11)\Theta_{i+1}^{1}=(\Theta_{i+1}^{1})^{\top} in the first equation, we get

(ABKi)Θi+11+Θi+11(ABKi)+Q+KiRKi=0.(A-BK_{i})^{\top}\Theta_{i+1}^{1}+\Theta_{i+1}^{1}(A-BK_{i})+Q+K_{i}^{\top}RK_{i}=0. (22)

Since KiK_{i} is stabilizing, we use Lyapunov arguments to conclude that Θi+11\Theta_{i+1}^{1} (and therefore also Θi+12\Theta_{i+1}^{2}) is unique. ∎

Lemma 2 implies that constraining the solution of (20) to include a symmetric submatrix Θi+11\Theta_{i+1}^{1} leads to the desired solution (19). The following lemma shows that we achieve this by properly modifying Φi\Phi_{i} in (18).

Lemma 3

Consider the matrix equation

(Φi)Θi+1E+EΘi+1Φi++Q¯i=0,(\Phi_{i}^{-})^{\top}\Theta_{i+1}E+E^{\top}\Theta_{i+1}^{\top}\Phi_{i}^{+}+\bar{Q}_{i}=0, (23)

where

Φi+:=[A+IBRKiR],Φi:=[AIBRKiR],\Phi_{i}^{+}:=\left[\begin{array}[]{cc}A+I&B\\ -RK_{i}&-R\end{array}\right],\quad\Phi_{i}^{-}:=\left[\begin{array}[]{cc}A-I&B\\ -RK_{i}&-R\end{array}\right], (24)

and the matrices EE and Q¯i\bar{Q}_{i} are defined as in (18). Moreover, let the matrix KiK_{i} be stabilizing. Then, the solution (21) of (23) is unique, and Θi+11=(Θi+11)\Theta_{i+1}^{1}=(\Theta_{i+1}^{1})^{\top}. Moreover, the solution of (23) is also a solution of (20).

Proof:

First, define the matrix

S=[Θi+11(Θi+11)000].S=\left[\begin{array}[]{cc}\Theta_{i+1}^{1}-(\Theta_{i+1}^{1})^{\top}&0\\ 0&0\end{array}\right].

Using this definition, it is straightforward to express (23) in terms of the matrix Φi\Phi_{i} in (18) as

ΦiΘi+1E+EΘi+1Φi+Q¯i=S.\Phi_{i}^{\top}\Theta_{i+1}E+E^{\top}\Theta_{i+1}^{\top}\Phi_{i}+\bar{Q}_{i}=S.

Notice that the left-hand side of this expression is symmetric, and therefore so must be SS. Now, SS is symmetric if and only if Θi+11=(Θi+11)\Theta_{i+1}^{1}=(\Theta_{i+1}^{1})^{\top}, that is, S=0S=0. This implies both that the solution of (23) also solves (20) and, by Lemma 2, that this solution is unique. ∎

Remark 4

Equation (23) is a case of the generalized Sylvester-transpose equation, and algorithms to solve it efficiently are well known [18, 19, 20].

Using this result, we formulate Algorithm 1 below. As in any policy iteration procedure, Algorithm 1 is initialized with a stabilizing matrix K0K_{0}. Using this matrix (as well as model knowledge), (23) is solved for Θi+1\Theta_{i+1}. Then, partitioning Θi+1\Theta_{i+1} as in (21), a new feedback matrix is obtained as Ki+1=Θi+12K_{i+1}=\Theta_{i+1}^{2}.

 Algorithm 1: Model-based RL algorithm  

1:procedure 
2:    Let i=0i=0 and initialize a stabilizing feedback matrix K0K_{0}.
3:    Using the definitions in (18) and (24), solve for Θi+1\Theta_{i+1} from the equation
(Φi)Θi+1E+EΘi+1Φi++Q¯i=0.(\Phi_{i}^{-})^{\top}\Theta_{i+1}E+E^{\top}\Theta_{i+1}^{\top}\Phi_{i}^{+}+\bar{Q}_{i}=0.
4:    Partitioning Θi+1\Theta_{i+1} as in (21), define
Ki+1=Θi+12.K^{i+1}=\Theta_{i+1}^{2}.
5:    If Ki+1Ki>ε\|K^{i+1}-K^{i}\|>\varepsilon for some ε>0\varepsilon>0, let i=i+1i=i+1 and go to Step 3. Otherwise, stop.
6:end procedure 

Using the results obtained so far, we conclude that Algorithm 1 is equivalent to Kleinman’s algorithm in the sense that, starting from the same initial matrix K0K_{0}, they provide the same updated policies Ki+1K_{i+1} at every iteration. This implies that Algorithm 1 preserves all the properties of Kleinman’s algorithm. In the following, we use this result to design a data-based algorithm.

III-B The data-based algorithm

To avoid the need for model knowledge in Algorithm 1, we collect persistently excited data from the system (2) as described in Section II-C. Using this data, we define the constant matrices XX, UU and X~\tilde{X} as in (3).

Lemma 1 showed that the collected data set satisfies the rank condition (7). In the following lemma, we extend this result to the matrices XX and UU.

Lemma 4

Consider system (2), let the pair (A,B)(A,B) be controllable, and let uu be a PCPE input of order n+1n+1 as defined in Definition 2. Using the resulting input-state data, define the matrices XX and UU as in (3). Then,

rank([XU])=n+m.\text{rank}\left(\left[\begin{array}[]{c}X\\ U\end{array}\right]\right)=n+m. (25)
Proof:

Notice that, since the applied input is piecewise constant, an expression for the resulting state of (2) is

x(t+iT)=eAtx(iT)+0teAτ𝑑τBμi,x(t+iT)=e^{At}x(iT)+\int_{0}^{t}e^{A\tau}d\tau B\mu_{i},

for i=0,,N1i=0,\ldots,N-1 and 0tT0\leq t\leq T. Thus, we can write

[XU]=0T[T(x(τ))H1(μ)]𝑑τ=0T[eAτ0τeAs𝑑sB0I]𝑑τW[T(x(0))H1(μ)].\left[\begin{array}[]{c}X\\ U\end{array}\right]=\int_{0}^{T}\left[\begin{array}[]{c}\mathcal{H}_{T}(x(\tau))\\ H_{1}(\mu)\end{array}\right]d\tau\\ =\underbrace{\int_{0}^{T}\left[\begin{array}[]{cc}e^{A\tau}&\int_{0}^{\tau}e^{As}dsB\\ 0&I\end{array}\right]d\tau}_{W}\left[\begin{array}[]{c}\mathcal{H}_{T}(x(0))\\ H_{1}(\mu)\end{array}\right].

Notice that WW is nonsingular since the condition (6) holds (the fact that 0TeAτ𝑑τ\int_{0}^{T}e^{A\tau}d\tau is nonsingular follows from the fact that TT corresponds to a non-pathological sampling time [23]). Moreover, by Lemma 1 the second matrix on the right-hand side has full row rank, completing the proof. ∎

Define Z=[XU]Z=[X^{\top}\quad U^{\top}]^{\top}. Since ZZ has full row rank by Lemma 4, we can select n+mn+m linearly independent columns from it. Let zkz_{k} represent the kkth column of ZZ, and let η={k1,,kn+m}\eta=\{k_{1},\ldots,k_{n+m}\} be a set of indices such that

Zη:=[zk1zkn+m]Z_{\eta}:=\left[z_{k_{1}}\quad\cdots\quad z_{k_{n+m}}\right] (26)

is a nonsingular matrix. Then, Θi+1\Theta_{i+1} is a solution of (23) if and only if it is a solution of

Zη(Φi)Θi+1EZη+ZηEΘi+1Φi+Zη+ZηQ¯iZη=0.Z_{\eta}^{\top}(\Phi_{i}^{-})^{\top}\Theta_{i+1}EZ_{\eta}+Z_{\eta}^{\top}E^{\top}\Theta_{i+1}^{\top}\Phi_{i}^{+}Z_{\eta}+Z_{\eta}^{\top}\bar{Q}_{i}Z_{\eta}=0. (27)

From the definitions in (18) and (24), and using the expression (4), we have the following

Φi+Zη=[AXη+Xη+BUηRKiXηRUη]=[X~η+XηRKiXηRUη],\Phi_{i}^{+}Z_{\eta}=\left[\begin{array}[]{c}AX_{\eta}+X_{\eta}+BU_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right]=\left[\begin{array}[]{c}\tilde{X}_{\eta}+X_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right],
ΦiZη=[AXηXη+BUηRKiXηRUη]=[X~ηXηRKiXηRUη],\Phi_{i}^{-}Z_{\eta}=\left[\begin{array}[]{c}AX_{\eta}-X_{\eta}+BU_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right]=\left[\begin{array}[]{c}\tilde{X}_{\eta}-X_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right],

ZηQ¯iZη=Xη(Q+KiRKi)XηZ_{\eta}^{\top}\bar{Q}_{i}Z_{\eta}=X_{\eta}^{\top}(Q+K_{i}^{\top}RK_{i})X_{\eta} and EZη=XηEZ_{\eta}=X_{\eta}, where the subindex η\eta represents a matrix constructed using the columns specified by the set η\eta from the corresponding original matrix. Substituting in (27), we obtain

(Yi)Θi+1Xη+XηΘi+1Yi++Xη(Q+KiRKi)Xη=0.(Y_{i}^{-})^{\top}\Theta_{i+1}X_{\eta}+X_{\eta}^{\top}\Theta_{i+1}^{\top}Y_{i}^{+}+X_{\eta}^{\top}(Q+K_{i}^{\top}RK_{i})X_{\eta}=0. (28)

where

Yi\displaystyle Y_{i}^{-} :=[X~ηXηRKiXηRUη],\displaystyle:=\left[\begin{array}[]{c}\tilde{X}_{\eta}-X_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right], (29)
Yi+\displaystyle Y_{i}^{+} :=[X~η+XηRKiXηRUη].\displaystyle:=\left[\begin{array}[]{c}\tilde{X}_{\eta}+X_{\eta}\\ -RK_{i}X_{\eta}-RU_{\eta}\end{array}\right].

Now, (28) is a data-based equation that does not require any knowledge about the system model. Algorithm 2 uses this expression to solve the LQR problem. For convenience, for Algorithm 2 we define

Qi:=Xη(Q+KiRKi)Xη.Q_{i}:=X_{\eta}^{\top}(Q+K_{i}^{\top}RK_{i})X_{\eta}. (30)

 Algorithm 2: Data-based RL algorithm  

1:procedure 
2:    Select N(n+1)m+nN\geq(n+1)m+n and T>0T>0, apply a PCPE input of order n+1n+1 to (2) and collect an NTNT-long input-state trajectory.
3:    Compute the matrices XX, UU, and X~\tilde{X} as in (3).
4:    Select a set of indices η={k1,,kn+m}\eta=\{k_{1},\ldots,k_{n+m}\} such that [XηUη][X_{\eta}^{\top}\quad U_{\eta}^{\top}]^{\top} is nonsingular.
5:    Let i=0i=0 and initialize a stabilizing feedback matrix K0K_{0}.
6:    Define the matrices Yi+Y_{i}^{+}, YiY_{i}^{-} and QiQ_{i} as in (29)-(30), and solve for Θi+1\Theta_{i+1} from the equation
(Yi)Θi+1Xη+XηΘi+1Yi++Qi=0.(Y_{i}^{-})^{\top}\Theta_{i+1}X_{\eta}+X_{\eta}^{\top}\Theta_{i+1}^{\top}Y_{i}^{+}+Q_{i}=0. (31)
7:    Partitioning Θi+1\Theta_{i+1} as in (21), define
Ki+1=Θi+12.K^{i+1}=\Theta_{i+1}^{2}.
8:    If Ki+1Ki>ε\|K^{i+1}-K^{i}\|>\varepsilon for some ε>0\varepsilon>0, let i=i+1i=i+1 and go to Step 6. Otherwise, stop.
9:end procedure 

The following theorem states the main properties of this algorithm.

Theorem 1

Consider the CT-LTI system (2), and the partitioning (21) of Θi+1\Theta_{i+1}. Every iteration of Algorithm 2 has the following properties: (ii) the solution Θi+1\Theta_{i+1} of (31) exists and is unique; (iiii) the gain Ki+1K_{i+1} is stabilizing; and (iiiiii) Θi1Θi+11P\Theta_{i}^{1}\succeq\Theta_{i+1}^{1}\succeq P^{*}. Moreover,

limiKi=K\lim_{i\rightarrow\infty}K_{i}=K^{*}

and the rate of convergence of the algorithm is quadratic.

Proof:

The proof is obtained by showing that Algorithm 2 is equivalent to Kleinman’s algorithm at every iteration. First, notice that by Lemma 4, the matrix [XU][X^{\top}\quad U^{\top}]^{\top} has full row rank and, therefore, a nonsingular matrix [XηUη][X_{\eta}^{\top}\quad U_{\eta}^{\top}]^{\top} can always be constructed. This means that (31) is equivalent to (23). Now, noting that K0K_{0} is stabilizing, use an induction argument to assume that KiK_{i} is stabilizing. Lemma 3 shows the existence and uniqueness of Θi+1\Theta_{i+1} from (23). Moreover, the expression (22) in the proof of Lemma 2 shows that Θi+11=Pi\Theta_{i+1}^{1}=P_{i}, where PiP_{i} is the solution of the Lyapunov equation (9). Also in the proof of Lemma 2 it was shown that Θi+12=R1BΘi+11\Theta_{i+1}^{2}=R^{-1}B^{\top}\Theta_{i+1}^{1}, which now corresponds to Kleinman’s updated gain (10). Therefore, Algorithm 2 is equivalent to Kleinman’s algorithm and shares all of its properties [22]. ∎

Algorithm 2 is a purely data-based, off-policy method to solve the continuous-time LQR problem. Using Definition 2, we are able to guarantee the existence of a solution Θi+1\Theta_{i+1} of (31) at every iteration for data trajectories of fixed length. This contrasts with the methods in the literature that must keep collecting data until a matrix gets full rank, such as, e.g., [7, 8]. Moreover, we avoid the use of the Kronecker product and its resulting large matrices in Algorithm 2. As stated in Remark 4, methods to efficiently solve a Sylvester-transpose equation as in (31) are well known.

Remark 5

Step 4 of Algorithm 2 instructs to select n+mn+m linearly independent columns of [XU][X^{\top}\quad U^{\top}]^{\top}. This step is performed in benefit of efficiency, as it decreases the size of the matrices in (31). However, since [XU][X^{\top}\quad U^{\top}]^{\top} has full row rank, skipping this step in Algorithm 2 and using the complete data matrices instead does not affect the result at each iteration.

IV PRACTICAL CONSIDERATIONS

IV-A Efficiency analysis of Algorithm 2

In this subsection, we analyze the theoretical computational complexity of Algorithm 2. Moreover, we compare this complexity with that of the algorithm proposed in [8]. This is because [8] is also an off-policy data-based method that shares many of the characteristics of Algorithm 2.

The most expensive steps in Algorithm 2 are obtaining the solution of (31) and selecting n+mn+m linearly independent vectors from [XU][X^{\top}\quad U^{\top}]^{\top}. Methods to solve the Sylvester-transpose equation (31) with a complexity of 𝒪((n+m)3)\mathcal{O}((n+m)^{3}) are known [19]. The selection of linearly independent vectors can be performed using a simple procedure like Gaussian elimination to transform the matrix of interest into row echelon form. This method has a complexity of 𝒪((n+m)2N)\mathcal{O}((n+m)^{2}N) operations [24]. This step, however, only needs to be performed once in Algorithm 2 (in Step 4). Thus, we conclude that Algorithm 2 requires once 𝒪((n+m)2N)\mathcal{O}((n+m)^{2}N) and then in each iteration 𝒪((n+m)3)\mathcal{O}((n+m)^{3}) floating point operations.

The algorithm in [8] was also shown to be equivalent to Kleinman’s algorithm at every iteration. However, their method uses a Kronecker product formulation that yields matrices of large dimensions. Let NN_{\otimes} be the amount of data samples used in [8]. Then, the most expensive step at each iteration of their algorithm is the product of a matrix with dimensions (12n(n+1)+mn)×N(\frac{1}{2}n(n+1)+mn)\times N_{\otimes} times its transpose. This product, and hence each iteration of the algorithm, requires 𝒪((12n(n+1)+mn)2N)\mathcal{O}((\frac{1}{2}n(n+1)+mn)^{2}N_{\otimes}) floating point operations [25]. Clearly, as the dimension of the system increases, the difference in performance of both algorithms becomes more significant. Moreover, we notice from [8] that the amount of collected data must satisfy N12n(n+1)+mnN_{\otimes}\geq\frac{1}{2}n(n+1)+mn for the algorithm to yield a unique solution at every iteration. Compare this with the bound N(n+1)m+nN\geq(n+1)m+n in Algorithm 2. In Section V, we test this theoretical comparison using numerical examples.

IV-B An initial stabilizing policy

In [26, Remark 2], a procedure to design a stabilizing controller for continuous-time systems using only measured data was described. This method is based on the solution of a linear matrix inequality (LMI). The authors in [7] proposed to use a similar LMI-based procedure to determine the initial stabilizing gain for a Q-learning algorithm. Since one of the goals in this paper is computational efficiency, we would like to avoid the computationally expensive step of solving an LMI. In this subsection, we present an alternative method to determine the initial stabilizing matrix K0K_{0} for Algorithm 2. The following development follows closely a procedure proposed in [27, Section IV] for discrete-time systems.

Let FF be the Moore-Penrose pseudoinverse of the matrix XX in (3). Since XX has full row rank (see Lemma 4), FF is a right inverse of XX. Furthermore, let GG be a basis for the null space of XX, such that X(FGK¯)=IX(F-G\bar{K})=I for any matrix K¯\bar{K} of appropriate dimensions. Using the matrices FF, GG and UU from (3), we propose to compute the initial stabilizing gain K0K_{0} for Algorithm 2 as

K0=U(FGK¯)K_{0}=-U(F-G\bar{K}) (32)

where K¯\bar{K} is a matrix to be determined.

From (4) and (32), notice that

X~(FGK¯)\displaystyle\tilde{X}(F-G\bar{K}) =[AB][XU](FGK¯)\displaystyle=\left[A\quad B\right]\left[\begin{array}[]{c}X\\ U\end{array}\right](F-G\bar{K})
=[AB][IK0]\displaystyle=\left[A\quad B\right]\left[\begin{array}[]{c}I\\ -K_{0}\end{array}\right]

Therefore, by designing the poles of the matrix X~(FGK¯)\tilde{X}(F-G\bar{K}), we also set the poles of ABK0A-BK_{0}. Since (A,B)(A,B) is controllable and hence the poles of ABK0A-BK_{0} can be assigned arbitrarily, also the poles of X~(FGK¯)\tilde{X}(F-G\bar{K}) can be placed arbitrarily by a suitable choice of K¯\bar{K}. Moreover, since X~\tilde{X}, FF and GG are matrices obtained from data, we can operate with them without any need of model knowledge. This procedure is summarized in the following theorem. The proof of this theorem is straightforward considering the procedure described in this subsection and is hence omitted.

Theorem 2

Let the matrices X~\tilde{X}, XX and UU be defined as in (3) using data collected from (2) during the application of a PCPE input of order n+1n+1. Define FF as the Moore-Penrose pseudoinverse of XX and GG as a basis for the null space of XX. Moreover, define the virtual system matrices A¯=X~F\bar{A}=\tilde{X}F and B¯=X~G\bar{B}=\tilde{X}G. Using pole-placement methods, determine a matrix K¯\bar{K} such that A¯B¯K¯\bar{A}-\bar{B}\bar{K} is Hurwitz. Then, the matrix K0K_{0} defined by (32) is stabilizing for system (2).

Remark 6

Notice that the matrices A¯=X~F\bar{A}=\tilde{X}F and B¯=X~G\bar{B}=\tilde{X}G in Theorem 2 do not correspond to the actual system matrices AA and BB. In fact, BB and B¯\bar{B} in general do not have the same dimensions. No model identification is performed in the proposed procedure.

V NUMERICAL EXPERIMENTS

In this section, we compare in simulation the efficiency of the proposed Algorithm 2 with that of the algorithm presented in [8]. As described above, these algorithms have the same characteristics: they are data-based off-policy methods that are equivalent to Kleinman’s algorithm at every iteration.

To compare the efficiency of both algorithms, several simulations are performed for different, randomly generated linear systems (2). In particular, 100 different linear systems are generated using the command rss in Matlab, and both algorithms are applied to each of them. The system dimensions considered for each set of 100 experiments are n=2,3,5n=2,3,5 and 77. In every case, we consider single input systems (m=1m=1), and we define the cost function (8) with Q=IQ=I and R=2R=2.

Each implementation of Algorithm 2 had the following characteristics. A PCPE input as in Definition 2 was used to collect data from the system. A sample of data was collected every 10410^{-4} time units. We considered a time interval of T=0.2T=0.2, and we collected data for a total of NTNT time units, with N=(n+1)m+nN=(n+1)m+n. The method described in [18] was used to solve the Sylvester-transpose equation (31) at every iteration.

For the implementation of the Kronecker product-based method in [8], we followed the same simulation characteristics described in the simulation section of that paper. The only exception is in the amount of data collected, which was reduced for small system dimensions in order to make a fairer comparison.

Finally, notice that the command rss in Matlab yields stable systems. Thus, an initial stabilizing matrix of K0=0K_{0}=0 was used for all experiments and both algorithms. The simulations were performed using Matlab R2020b on an Intel i7-10875H (2.30 GHz) with 16 GB of memory.

The results of our simulations are displayed in Table I. In this table, we refer to Algorithm 2, which is based on the solution of a Sylvester-transpose equation, as ‘SYL’. The algorithm in [8] that is based on the use of the Kronecker product is denoted as ‘KRO’. To compare the computational efficiency of the methods, we present the average time that it takes the algorithms to complete 10 iterations. Due to their quadratic rate of convergence, 10 iterations yield a very accurate result of the optimal control gain for both algorithms. In the table we can observe a confirmation of our theoretical analysis regarding the improved performance of Algorithm 2.

Dimension nn Average time (sec)
SYL KRO
22 0.00990.0099 0.09530.0953
33 0.01440.0144 0.17190.1719
55 0.02870.0287 0.43170.4317
77 0.10460.1046 1.80211.8021
TABLE I: Run-time comparison between Algorithm 2 (SYL) and the algorithm in [8] (KRO).

During the execution of these experiments, we noted some issues in the performance of both methods when applied to systems of large dimensions. First, Algorithm 2 requires the application of a solver from the literature to solve (31). We found that, if the data matrix ZηZ_{\eta} in Algorithm 2 has a large condition number, the solvers considered often failed to provide the correct result. To address this problem, methods to construct a matrix ZηZ_{\eta} with low condition number from a larger matrix ZZ could be considered. Regarding the algorithm in [8], determining a proper input in order to satisfy the required persistence of excitation condition for the collected data (compare the discussion in the Introduction) becomes ever more difficult as the dimension of the system increases. In this case, it is uncertain how to solve this issue.

VI CONCLUSIONS

In this paper, a computationally efficient algorithm was proposed to solve the continuous-time LQR problem. The proposed algorithm is equivalent to Kleinman’s method, it does not require any knowledge from the system model and it requires collecting data from the system only once. We presented a persistently exciting input that guarantees that the matrix equation (31) in our algorithm has a unique solution at every iteration. Finally, we showed a method to determine an initial stabilizing feedback matrix using only measured data and that does not require to solve LMIs. Simulation results show that our algorithm significantly improves the performance of an algorithm with similar properties in the literature.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   Cambridge, MA: MIT Press, 2018.
  • [2] D. P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed.   Belmont, MA: Athena Scientific, 2005, vol. I.
  • [3] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477–484, 2009.
  • [4] K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
  • [5] H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanovic, “Random search for learning the linear quadratic regulator,” in 2020 American Control Conference (ACC), 2020, pp. 4798–4803.
  • [6] J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, no. 11, pp. 2850–2859, 2012.
  • [7] C. Possieri and M. Sassano, “Q-learning for continuous-time linear systems: A data-driven implementation of the Kleinman algorithm,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 10, pp. 6487–6497, 2022.
  • [8] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, pp. 2699–2704, 2012.
  • [9] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of 1994 American Control Conference, vol. 3, 1994, pp. 3475–3479.
  • [10] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 1–10.
  • [11] V. G. Lopez, M. Alsalti, and M. A. Müller, “Efficient off-policy Q-learning for data-based discrete-time LQR problems,” IEEE Transactions on Automatic Control, pp. 1–12, 2023.
  • [12] Y. Yang, B. Kiumarsi, H. Modares, and C. Xu, “Model-free λ\lambda-policy iteration for discrete-time linear quadratic regulation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 2, pp. 635–649, 2023.
  • [13] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018.
  • [14] Z.-P. Jiang, T. Bian, and W. Gao, “Learning-based control: A tutorial and some recent results,” Foundations and Trends in Systems and Control, vol. 8, no. 3, pp. 176–284, 2020. [Online]. Available: http://dx.doi.org/10.1561/2600000023
  • [15] D. Liu, S. Xue, B. Zhao, B. Luo, and Q. Wei, “Adaptive dynamic programming for control: A survey and recent advances,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 142–160, 2021.
  • [16] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A note on persistency of excitation,” Systems & Control Letters, vol. 54, no. 4, pp. 325–329, 2005.
  • [17] I. Markovsky and F. Dörfler, “Behavioral systems theory in data-driven analysis, signal processing, and control,” Annual Reviews in Control, vol. 52, pp. 42–64, 2021.
  • [18] M. Hajarian, “Extending the CGLS algorithm for least squares solutions of the generalized Sylvester-transpose matrix equations,” Journal of the Franklin Institute, vol. 353, no. 5, pp. 1168–1185, 2016, special Issue on Matrix Equations with Application to Control Theory.
  • [19] K. Tansri, S. Choomklang, and P. Chansangiam, “Conjugate gradient algorithm for consistent generalized Sylvester-transpose matrix equations,” AIMS Mathematics, vol. 7, no. 4, pp. 5386–5407, 2022.
  • [20] C. Song, G. Chen, and L. Zhao, “Iterative solutions to coupled Sylvester-transpose matrix equations,” Applied Mathematical Modelling, vol. 35, no. 10, pp. 4675–4683, 2011.
  • [21] V. G. Lopez and M. A. Müller, “On a continuous-time version of Willems’ lemma,” in 2022 IEEE 61st Conference on Decision and Control (CDC), 2022, pp. 2759–2764.
  • [22] D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Transactions on Automatic Control, vol. 13, no. 1, pp. 114–115, 1968.
  • [23] C.-T. Chen, Linear System Theory and Design, 3rd ed.   New York, USA: Oxford University Press, 1999.
  • [24] G. Strang, Linear Algebra and its Applications, 4th ed.   Belmont, USA: Thomson Brooks/Cole, 2006.
  • [25] R. Hunger, “Floating point operations in matrix-vector calculus,” Tech. Rep., 2007. [Online]. Available: https://mediatum.ub.tum.de/doc/625604/625604
  • [26] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization, optimality, and robustness,” IEEE Transactions on Automatic Control, vol. 65, no. 3, pp. 909–924, 2020.
  • [27] H. J. van Waarde, J. Eising, H. L. Trentelman, and M. K. Camlibel, “Data informativity: A new perspective on data-driven analysis and control,” IEEE Transactions on Automatic Control, vol. 65, no. 11, pp. 4753–4768, 2020.