This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Toward Reliable Designs of Data-Driven Reinforcement Learning Tracking Control for Euler-Lagrange Systems

Zhikai Yao, Jennie Si, , Ruofan Wu, and Jianyong Yao This work is supported in part by National Science Foundation #1563921, and #1808752. (Corresponding authors: Jennie Si and Jianyong Yao.)Z. Yao is with the School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu Province, 210094 CN, and also with the Department of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ, 85281, USA (e-mail: zacyao.cn@gmail.com).J. Si and R. Wu are with the Department of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ, 85281, USA (e-mail: si@asu.edu; ruofanwu@asu.edu).J. Yao is with the School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu Province, 210094 CN (e-mail: jerryyao.buaa@gmail.com).
Abstract

This paper addresses reinforcement learning based, direct signal tracking control with an objective of developing mathematically suitable and practically useful design approaches. Specifically, we aim to provide reliable and easy to implement designs in order to reach reproducible neural network-based solutions. Our proposed new design takes advantage of two control design frameworks: a reinforcement learning based, data-driven approach to provide the needed adaptation and (sub)optimality, and a backstepping based approach to provide closed-loop system stability framework. We develop this work based on an established direct heuristic dynamic programming (dHDP) learning paradigm to perform online learning and adaptation and a backstepping design for a class of important nonlinear dynamics described as Euler-Lagrange systems. We provide a theoretical guarantee for the stability of the overall dynamic system, weight convergence of the approximating nonlinear neural networks, and the Bellman (sub)optimality of the resulted control policy. We use simulations to demonstrate significantly improved design performance of the proposed approach over the original dHDP.

Index Terms:
Reinforcement learning, tracking control, direct heuristic dynamic programming (dHDP), backstepping.

I Introduction

We consider the problem of data-driven optimal tracking control for Euler-Lagrange systems which are represented in a wide range of application problems. The evolution of such systems can be described by the solutions to the Euler-Lagrange equation. In classical mechanics, it is equivalent to Newton’s laws of motion. Such systems also have the advantage that they use the same generalized coordinate system that makes solving the solution of motion easier. The Euler-Lagrange mechanisms can be found in many familiar systems, such as marine navigation equipment, automatic machine tools, satellite-tracking antennas, remote control airplanes, automatic navigation systems on boats and planes, autofocus cameras, computer hard disk drive, and more. In the modern computer and control era, such mechanisms are still behind the robotic manipulators, wearable robots, ground vehicles, and many more applications in mechanical, electrical and electromechanical systems.

Tracking control has been studied extensively in control theoretic context where the tracking control designs are based on well defined mathematical models of the nonlinear dynamics. Well-established approaches include backstepping control [1], observer-based control [2] and nonlinear adaptive/robust control [3, 4]. Those important results have provided the foundation for nonlinear tracking control, yet, their applicability may be limited especially when the nonlinear dynamics are difficult or impossible to precisely model. As such, data-driven nonlinear tracking control designs are sought after. Common and natural approaches are the use of machine learning or neural networks techniques as they can learn directly from data by means of the universally approximating property.

Most of the existing data-driven tracking control results focus on stabilization of nonlinear dynamic systems [5]. As is well-known, real engineering applications require considerations of optimal control performance, not just stability to account for factors such as energy consumption, tracking error rate, and more. Therefore, optimal tracking control solutions of nonlinear systems are sought after. A classical formulation of the problem is to obtain nonlinear optimal tracking control solutions from solving the Hamilton-Jacobi-Bellman (HJB) equation. But solving the HJB equation poses great challenges for general nonlinear systems. One such challenge is a lack of closed-form analytical solution even if a mathematical description of the nonlinear dynamics is available. Additionally, traditional approaches to solving the HJB equations are backward in time and therefore, can only be solved offline. Data-driven nonlinear optimal tracking control designs provide new promises to address these challenges, yet, they face new obstacles.

Currently, there is only a handful of results concerning the theory, algorithm design and implementation of data-driven optimal tracking control. Central to these results [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] are the use of reinforcement learning to establish an approximate solution to the HJB equation. Yet, few results have demonstrated that these methods are not only mathematically suitable but also practically useful in order to address real engineering problems in terms of providing reliable, easy to implement designs that lead to reproducible neural network-based solutions. In addressing the design of reinforcement learning based optimal tracking of coal gasification problem, the authors of [12] first applied offline neural network identifications to establish the necessary mathematical descriptions of the nonlinear system dynamics and the desired tracking trajectory in order to carry on the control design. As such, it is questionable if approaches based on similar ideas can potentially be useful for other applications as obtaining reproducible models will be the first barrier to overcome. This is not a trivial problem, as it requires great expertise and the subject is still under investigation because current neural network modeling results usually introduce large variances which depend on the designer and the hyperparameters used in learning the models. The authors of [24] proposed a tracking control solution based on dHDP [25] by making use of an additional neural network to provide an internal goal signal. This is theoretically suitable but it complicates the problem as discussed previously. The approximation errors are on top of the uncertainties introduced as approximation variances due to dHDP tracking solutions as pointed out in [26]. As such, the reproducibility of the approach is yet to be demonstrated systematically.

Among those existing reinforcement learning based tracking control design, they often rely on a reference model from which a (continuously differential) desired tracking signal xdx_{\rm{d}} can be obtained [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. While it may be feasible and useful for certain applications such as flight validation [27], the issue of constructing an appropriate reference model for general nonlinear system control purposes has not been thoroughly addressed. Actually, few published results are available either from a general theoretic perspective or from specific applications perspective. Even for a specific application, it is not considered an easy task [28]. It is therefore fair to say that choosing an appropriate reference model is quite challenging and it may have been taken for granted. This could be the reason that most results on tracking related work have focused on addressing tracking control algorithm design or improving convergence properties of tracking algorithms. As one can imagine, the problem can exacerbate for large scale, complex dynamic systems. Even worse for some applications, it is nearly impossible to accurately capture nonlinear dynamics using a mathematical model. It is also worth mentioning that, some reported results [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] require generating a corresponding reference control, which can also be challenging and make some of these approaches less applicable. Another commonly implied assumption in many existing learning based tracking control is that the nonlinear dynamics are partially known (or specifically the input dynamics are known). This has simplified the problem but in the meantime, it limits their applicability.

To directly use a reference trajectory in place of a model reference structure, backstepping idea can be employed as it allows for the construction of both feedback control laws and associated Lyapunov functions in a systematic way [1]. The backstepping idea in this context was examined in [29, 30] with critic-only reinforcement learning control. However, their results are limited to partially known system dynamics. Alternatively, the dHDP construct allows for direct use of reference trajectory as well in the design of tracking control. The idea was demonstrated via simulations in [10, 24]. The results are promising, yet, they lack a theoretical support for (sub)optimality and stability analysis. Even though tracking control results in [10] were obtained for dHDP as well, our current approach is fundamentally different as we propose a new control strategy and we use an informative tracking error based stage cost instead of a binary cost. Specifically, the current study is motivated to circumvent some long-standing issues in reinforcement learning, including QQ-learning and dHDP, that is to reduce the variances in the resulted action and/or critic networks after training and thus to improve the reproductivity and consistency of the trained actor/critic network outcomes even if they are trained by different designers.

In this paper, we aim at developing a new nonlinear tracking control design framework with a goal of making the design approach feasible for applications. In a previous study [26], we have shown that well initialized actor-critic neural networks in dHDP can significantly improve the quality of the optimal value function approximation and the optimal control policy. From the same study we realized the importance of finding an initially good estimate of a (locally) optimal solution to the Bellman equation. In this study, we create a backstepping control strategy to provide a feedback system stability framework and a dHDP online adaptation control strategy to provide a feed-forward compensation in order to obtain near optimal tracking control solution. In the feedback control performance objective, we take into account the feed-forward control input. As a result of this innovative solution approach, we can provide an overall system stability guarantee and also avoid the use of a reference model for the desired tracking trajectory. As such, we avoid the challenge of creating a reference model and also remove one source of approximation error. Because of using a reinforcement learning feed-forward control, we address unknown nonlinear dynamics via learning from data, i.e., our design is data-driven, not model dependent.

Our contributions of this work are as follows.

  • 1)

    We introduce a new reinforcement learning control design framework within a backstepping feedback construct. This allows us to avoid the need of a fully identified system model for backstepping-based control as well as the dependence on a reference model for the desired tracking trajectory for reinforcement learning based-control. Additionally, the backstepping feedback component provides a guideline to narrow down the representation domain to be explored by neural networks in dHDP and to increase the chance of reaching a good (sub)optimal solution.

  • 2)

    We provide a theoretical guarantee for the stability of the overall dynamic system, weight convergence of the approximating nonlinear neural networks, and the Bellman (sub)optimality of the resulted control policy.

  • 3)

    We provide simulations to not only demonstrate how the proposed design method works but also to show how the proposed algorithm can significantly improve reproducibility of the results under dHDP integrated with backstepping feedback stabilizing control.

The rest of the paper is organized as follows. Section II provides the problem formulation. Section III presents the backstepping design. Section IV develops the reinforcement learning control. Section V provides theoretical analyses of the proposed algorithm. Simulation and comparison results are presented in Section VI and the concluding remarks are given in Section VII.

II Problem Formulation

We consider a class of nonlinear dynamics described as Euler-Lagrange systems that govern the motion of rigid structures:

M(q(t))q¨(t)+Vm(q(t),q˙(t))q˙(t)+G(q(t))+F(q˙(t))+τd(t)=τ(t).\displaystyle\begin{aligned} M(q(t))\ddot{q}(t)+&V_{m}(q(t),\dot{q}(t))\dot{q}(t)\\ +&G(q(t))+F(\dot{q}(t))+\tau_{d}(t)=\tau(t).\end{aligned} (1)

In Eq. (1), M(q(t))M(q(t)), Vm(q(t),q˙(t))V_{m}(q(t),\dot{q}(t)), G(q(t))G(q(t)), F(q˙(t))F(\dot{q}(t)) and τd(t)\tau_{d}(t) are unknown, q(t),q˙(t),q¨(t)nq(t),\dot{q}(t),\ddot{q}(t)\in\mathbb{R}^{n} denote the rigid link position, velocity, and acceleration vectors, respectively; M(q(t))n×nM(q(t))\in\mathbb{R}^{n\times n} denotes the inertia matrix; Vm(q(t),q˙(t))n×nV_{m}(q(t),\dot{q}(t))\in\mathbb{R}^{n\times n} the centripetal-coriolis matrix; G(q(t))RnG(q(t))\in\mathrm{R}^{n} the gravity vector, F(q˙(t))nF(\dot{q}(t))\in\mathbb{R}^{n} friction; τd(t)n\tau_{d}(t)\in\mathbb{R}^{n} a disturbance; and τ(t)n\tau(t)\in\mathbb{R}^{n} represents the torque control input. The subsequent development is based on the assumption that q(t)q(t) and q˙(t)\dot{q}(t) are measurable.

To carry on the development of tracking control of the dynamics in Eq. (1), we use the following general discrete-time state space representations where x1(k)nx_{1}(k)\in\mathbb{R}^{n} and x2(k)nx_{2}(k)\in\mathbb{R}^{n} denote link position and velocity, respectively. Applying the Euler discretization as in [31], systems as described by Eq. (1) can be rewritten as

x1(k+1)=hx2(k)+x1(k)M+(k)x2(k+1)=u(k)g(k)τd(k),\displaystyle\begin{aligned} x_{1}(k+1)&=hx_{2}(k)+x_{1}(k)\\ M^{+}(k)x_{2}(k+1)&=u(k)-g(k)-\tau_{d}(k),\end{aligned} (2)

where h=tk+1tkh=t_{k+1}-t_{k}, u(k)=τ(tk)u(k)=\tau(t_{k}), g(k)=Vm(k)x2(k)+G(k)+F(k)M(k)x2(k)g(k)=V_{m}(k)x_{2}(k)+G(k)+F(k)-M^{-}(k)x_{2}(k), M+(k)M^{+}(k) denotes M(tk+1)/hM(t_{k+1})/h and M(k)M^{-}(k) denotes M(tk)/hM(t_{k})/h. The control objective is for the output x1(k)x_{1}(k) to track a desired time-varying trajectory x1d(k)x_{1\rm{d}}(k) as closely as possible.

Assumption 1. The inertia matrix M(q)M(q) is symmetric, positive definite, and the following inequality holds:

Mmin(M(k))T(M(k)),\displaystyle M_{min}\leq\|(M^{-}(k))^{T}(M^{-}(k))\|, (3)

where MminM_{min} is a given positive constant, and \|\cdot\| denotes the Euclidean norm.

Assumption 2. The nonlinear disturbance term τd(k)\tau_{d}(k) is bounded, i.e., τd(k)\tau_{d}(k)\in\mathcal{L}_{\infty}.

In subsequent development, the desired trajectory x1d(k)x_{1\rm{d}}(k) is not defined by any reference model, but rather, x1d(k)x_{1\rm{d}}(k) is simply and directly provided. Our proposed new closed-loop, data-driven nonlinear tracking control solution is as shown in Fig. 1. There are two major components in the design. The backstepping design provides a feedback control structure, the feed-forward signal from the backstepping framework is then accounted for by using a dHDP online learning scheme. Even though the dHDP alone can theoretically be used for tracking control while preserving some qualitative properties such as those in [10], our current approach aims at improving reliability of the design and reproducibility of the results while maintaining theoretical suitability. Next, we provide a comprehensive introduction of the two constituent control design blocks.

III Backstepping To Provide Baseline Tracking Control

The control goal is to find u(k)u(k) (Fig.1) so that the output x1(k)x_{1}(k) of the system Eq. (2) tracks the desired time-varying trajectory x1d(k)x_{1\mathrm{d}}(k). The discrete-time backstepping scheme for the tracking problem of Eq. (2) with unknown, nonlinear system dynamics is derived step by step as follows. Refer to Fig.1, the two blocks play complementary roles in constructing the final control signal u(k)u(k). The backstepping control block is designed as follows.

Step 1. To develop the backstepping design, a virtual control function is synthesized. Let e1(k)ne_{1}(k)\in\mathbb{R}^{n} be the deviation of x1(k)x_{1}(k) from the target x1d(k)nx_{1\mathrm{d}}(k)\in\mathbb{R}^{n}, i.e.,

e1(k)=x1(k)x1d(k).\displaystyle e_{1}(k)=x_{1}(k)-x_{1\mathrm{d}}(k). (4)

From (2) and (4), we have

e1(k+1)=hx2(k)+x1(k)x1d(k+1).\displaystyle e_{1}(k+1)=hx_{2}(k)+x_{1}(k)-x_{1\mathrm{d}}(k+1). (5)

Then, we view x2(k)x_{2}(k) as a virtual control in Eq. (5) and introduce the error variable

e2(k)=x2(k)α(k),\displaystyle e_{2}(k)=x_{2}(k)-\alpha(k), (6)

where α(k)n\alpha(k)\in\mathbb{R}^{n} is a stabilizing function for x2(k)x_{2}(k) to be chosen as

α(k)=1h(c1e1(k)x1(k)+x1d(k+1)),\displaystyle\begin{aligned} \alpha(k)=\frac{1}{h}(c_{1}e_{1}(k)-x_{1}(k)+x_{1\mathrm{d}}(k+1)),\end{aligned} (7)

where c1c_{1} is a design constant. Then Eq. (5) becomes

e1(k+1)=c1e1(k)+he2(k).\displaystyle\begin{aligned} e_{1}(k+1)=c_{1}e_{1}(k)+he_{2}(k).\end{aligned} (8)

Step 2. The final control law is synthesized to drive e2(k)e_{2}(k) towards zero or a small value. The error variable e2(k)e_{2}(k) as the forward signal is written as

M+(k)e2(k+1)=M+(k)x2(k+1)M+(k)α(k+1)=u(k)g(k)M+(k)α(k+1)τd(k),\displaystyle\begin{aligned} M^{+}(k)e_{2}(k+1)&=M^{+}(k)x_{2}(k+1)-M^{+}(k)\alpha(k+1)\\ &=u(k)-g(k)-M^{+}(k)\alpha(k+1)-\tau_{d}(k),\end{aligned} (9)

where α(k+1)\alpha(k+1) is written as:

α(k+1)=1h(c1(c1e1(k)+he2(k))hx2(k)x1(k)+x1d(k+2)).\displaystyle\begin{aligned} \alpha(k+1)&=\frac{1}{h}(c_{1}(c_{1}e_{1}(k)+he_{2}(k))\\ &\quad-hx_{2}(k)-x_{1}(k)+x_{1\mathrm{d}}(k+2)).\end{aligned} (10)

The control input u(k)u(k) is selected as

u(k)=f^(k)+c2e2(k)\displaystyle u(k)=\hat{f}(k)+c_{2}e_{2}(k) (11)

where c2c_{2} is a design constant, f^(k)\hat{f}(k) is an estimate of the combined unknown dynamics f(k)f(k) by neural networks, and f(k)f(k) is written as

f(k)=g(k)+M+(k)α(k+1).\displaystyle f(k)=g(k)+M^{+}(k)\alpha(k+1). (12)

Substituting Eq. (11) and Eq. (12) into Eq. (9) yields

M+(k)e2(k+1)=c2e2(k)+f~(k)τd(k),\displaystyle M^{+}(k)e_{2}(k+1)=c_{2}e_{2}(k)+\tilde{f}(k)-\tau_{d}(k), (13)

where

f~(k)=f^(k)f(k).\displaystyle\tilde{f}(k)=\hat{f}(k)-f(k). (14)
Refer to caption
Figure 1: Schematic diagram of the proposed tracking control framework.

IV Reinforcement Learning To Provide Data-driven Feed-forward Input Control

The dHDP block is used to provide control input for unaddressed dynamics due to a lack of a mathematical description of system Eq. (1). Here we use dHDP as a basic structural framework to approximate the cost-to-go function and the optimal control policy at the same time [25, 32]. The dHDP has been shown a feasible tool for solving complex and realistic problems including the stabilization, tracking, reconfiguring control of Apache helicopters [33, 34, 35], damping low frequency oscillations in large power systems [36], and wearable robots with human in the loop [37, 38, 39]. We therefore consider the dHDP can potentially provide the necessary online learning capability in the backstepping design and together, we provide a new learning control method that improves the reproducibility of results when applied to meaningful real applications.

IV-A Basic Formulation

The dHDP based reinforcement learning feed-forward control f^(k)\hat{f}(k) is as shown in Fig. 1. The stage cost r(k)r(k) is defined as

r(k)=e1T(k)Qe1(k)+uT(k)Ru(k),\displaystyle r(k)=e_{1}^{T}(k)Qe_{1}(k)+u^{T}(k)Ru(k), (15)

where Qn×nQ\in\mathbb{R}^{n\times n}, Rn×nR\in\mathbb{R}^{n\times n} are positive semi-definite matrices. Then, the cost-to-go J(k)J(k) is written as

J(k)=j=1γjr(k+j),\displaystyle J(k)=\sum_{j=1}^{\infty}\gamma^{j}r(k+j), (16)

where 0<γ<10<\gamma<1 is a discount factor for the infinite-horizon tracking problem. We require r(k)r(k) to be a semi-definite function of the output error e1(k)e_{1}(k) and control u(k)u(k), so the cost function is well-defined. Based on Eq. (16), we formulate the following Bellman equation:

J(k1)=γJ(k)+r(k).\displaystyle J(k-1)=\gamma J(k)+r(k). (17)

IV-B Actor-Critic Networks

The dHDP design follows that in [25] with an actor neural network and a critic neural network. Hyperbolic tangent is used as the transfer function in the actor-critic networks to approximate the control policy and the cost-to-go function.

IV-B1 Critic Neural Network

The critic neural network (CNN) consists of three layers of neurons, namely the input layer, the hidden layer and the output layer. The input and output of CNN are

xc(k)=[xa(k),u(k)]T,\displaystyle x_{c}(k)=[x_{a}(k),u(k)]^{T}, (18)
J^(k)=w^c2(k)ϕ(w^c1(k)xc(k))=w^c2ϕc(k),\displaystyle\hat{J}(k)=\hat{w}_{c2}(k)*\phi(\hat{w}_{c1}(k)*x_{c}(k))=\hat{w}_{c2}\phi_{c}(k), (19)

where

xa(k)=[x1(k),x2(k),e1(k),e2(k),x1d(k+2)]T,\displaystyle x_{a}(k)=[x_{1}(k),x_{2}(k),e_{1}(k),e_{2}(k),x_{\operatorname{1d}}(k+2)]^{T}, (20)

In the above, w^c1\hat{w}_{c1} and w^c2\hat{w}_{c2} are the estimated weight matrices between the input and hidden, and output layers, respectively. ϕ()\phi(\cdot) is the hyperbolic tangent activation function,

ϕ(v)=1exp(v)1+exp(v).\displaystyle\phi(v)=\frac{1-\exp(-v)}{1+\exp(-v)}. (21)

From Eq. (17), the prediction error ec(k)e_{c}(k) is formulated as

ec(k)=γJ^(k)[J^(k1)r(k)].\displaystyle e_{c}(k)=\gamma\hat{J}(k)-\left[\hat{J}(k-1)-r(k)\right]. (22)

The weights of CNN are updated to minimize the following approximation cost

Ec(k)=12ecT(k)ec(k).\displaystyle E_{c}(k)={1\over 2}e_{c}^{T}(k)e_{c}(k). (23)

Gradient descent is used to adjust the critic weights. For the input-to hidden layer,

Δw^c1(k)=lc[Ec(k)w^c1(k)]Ec(k)w^c1(k)=Ec(k)J^(k)J^(k)ϕc(k)ϕc(k)vc(k)vc(k)w^c1(k)=γec(k)w^c2(k)[12(1ϕc2(k))]xc(k),\displaystyle\begin{aligned} &\Delta\hat{w}_{c1}(k)=l_{c}\left[-\frac{\partial E_{c}(k)}{\partial\hat{w}_{c1}(k)}\right]\\ &\frac{\partial E_{c}(k)}{\partial\hat{w}_{c1}(k)}=\frac{\partial E_{c}(k)}{\partial\hat{J}(k)}\frac{\partial\hat{J}(k)}{\partial\phi_{c}(k)}\frac{\partial\phi_{c}(k)}{\partial v_{c}(k)}\frac{\partial v_{c}(k)}{\partial\hat{w}_{c1}(k)}\\ &\quad\quad\quad\quad=\gamma e_{c}(k)\hat{w}_{c2}(k)\left[\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)\right]x_{c}(k),\end{aligned} (24)

Similarly for the hidden-to-output layer,

Δw^c2(k)=lc[Ec(k)w^c2(k)]Ec(k)w^c2(k)=Ec(k)J^(k)J^(k)w^c2(k)=γec(k)ϕc(k).\displaystyle\begin{aligned} &\Delta\hat{w}_{c2}(k)=l_{c}\left[-\frac{\partial E_{c}(k)}{\partial\hat{w}_{c2}(k)}\right]\\ &\frac{\partial E_{c}(k)}{\partial\hat{w}_{c2}(k)}=\frac{\partial E_{c}(k)}{\partial\hat{J}(k)}\frac{\partial\hat{J}(k)}{\partial\hat{w}_{c2}(k)}=\gamma e_{c}(k)\phi_{c}(k).\end{aligned} (25)

In the above, lc>0l_{c}>0 is the learning rate.

IV-B2 Action Neural Network

In this algorithm, the action neural network (ANN) is to approximate the unknown dynamics f(k)f(k) in Eq. (12). The input to the ANN is xa(k)x_{a}(k), the respective output is given as follows:

f^(k)=w^a2(k)ϕ(w^a1(k)xa(k))=w^a2ϕa(k),\displaystyle\hat{f}(k)=\hat{w}_{a2}(k)*\phi(\hat{w}_{a1}(k)*x_{a}(k))=\hat{w}_{a2}\phi_{a}(k), (26)

where w^a1\hat{w}_{a1} and w^a2\hat{w}_{a2} are the estimated weight matrices.

The ANN weights are adjusted to minimize the following cost,

Ea(k)=12eaT(k)ea(k),\displaystyle E_{a}(k)={1\over 2}e_{a}^{T}(k)e_{a}(k), (27)

where

ea(k)=J^(k)+f^(k)Uc,\displaystyle e_{a}(k)=\hat{J}(k)+\|\hat{f}(k)\|-U_{c}, (28)

In the above, UcU_{c} is the ultimate performance objective in the tracking control design paradigm, which is defined as Uc=0U_{c}=0 under the current problem formulation; f~(k)=f^(k)f(k)\tilde{f}(k)=\hat{f}(k)-f(k) is defined in Eq. (14). The desired tracking performance will be achieved if f~(k)\|\tilde{f}(k)\| approaches 0.

Remark 1. From Eq. (13), we have that

f~(k)=M+(k)e2(k+1)c2e2(k)+τd(k).\displaystyle\tilde{f}(k)=M^{+}(k)e_{2}(k+1)-c_{2}e_{2}(k)+\tau_{d}(k). (29)

To compute ea(k)e_{a}(k) in Eq. (28), we use an initial estimate M^+\hat{M}^{+} in place of M+(k)M^{+}(k) in Eq. (29). In the error estimation process, the disturbance τd(k)\tau_{d}(k) is zero as the feed-forward controller aims at learning the unknown system dynamics.

The weight update rule is again based on gradient descent. For the input-to hidden layer,

Δw^a1(k)=la[Ea(k)w^a1(k)]Ea(k)w^a1(k)=Ea(k)J^(k)J^(k)u(k)u(k)f^(k)f(k)ϕa(k)ϕa(k)va(k)va(k)w^a1(k)=ea(k)[w^c2(k)12(1ϕc2(k))w^cu(k)]×w^a2(k)12(1ϕa2(k))xa(k),\displaystyle\begin{aligned} &\Delta\hat{w}_{a1}(k)=l_{a}\left[-\frac{\partial E_{a}(k)}{\partial\hat{w}_{a1}(k)}\right]\\ &\frac{\partial E_{a}(k)}{\partial\hat{w}_{a1}(k)}=\frac{\partial E_{a}(k)}{\partial\hat{J}(k)}\frac{\partial\hat{J}(k)}{\partial u(k)}\frac{\partial u(k)}{\partial\hat{f}(k)}\frac{\partial f(k)}{\partial\phi_{a}(k)}\frac{\partial\phi_{a}(k)}{\partial v_{a}(k)}\frac{\partial v_{a}(k)}{\partial\hat{w}_{a1}(k)}\\ &\quad\quad\quad~{}~{}=e_{a}(k)\left[\hat{w}_{c2}(k)\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)\hat{w}_{cu}(k)\right]\\ &\quad\quad\quad\quad~{}~{}\times\hat{w}_{a2}(k)\frac{1}{2}\left(1-\phi_{a}^{2}(k)\right)x_{a}(k),\end{aligned} (30)

and for the hidden-to-output layer,

Δw^a2(k)=la[Ea(k)w^a2(k)]Ea(k)w^a2(k)=Ea(k)J^(k)J^(k)u(k)u(k)f^(k)f^(k)w^a2(k)=ea(k)[w^c2(k)12(1ϕc2(k))wcu(k)]ϕa(k),\displaystyle\begin{aligned} \Delta\hat{w}_{a2}(k)&=l_{a}\left[-\frac{\partial E_{a}(k)}{\partial\hat{w}_{a2}(k)}\right]\\ \frac{\partial E_{a}(k)}{\partial\hat{w}_{a2}(k)}&=\frac{\partial E_{a}(k)}{\partial\hat{J}(k)}\frac{\partial\hat{J}(k)}{\partial u(k)}\frac{\partial u(k)}{\partial\hat{f}(k)}\frac{\partial\hat{f}(k)}{\partial\hat{w}_{a2}(k)}\\ &=e_{a}(k)\left[\hat{w}_{c2}(k)\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)w_{cu}(k)\right]\phi_{a}(k),\end{aligned} (31)

where wcu(k)w_{cu}(k) is the weight associated with the input element from ANN, i.e., the part of wc1w_{c1} which connect with u(k)u(k), and la>0l_{a}>0 is the learning rate.

Algorithm 1 summarizes the implementation procedure of the dHDP-based tracking control.

Algorithm 1. Direct signal tracking control based on dHDP
Specify desired trajectory x1dx_{1\rm{d}};
Initialization: wa(0)w_{a}(0), wc(0)w_{c}(0), x1(0)x_{1}(0), x2(0)x_{2}(0), M^+\hat{M}^{+};
Set hyperparameters : lal_{a}, lcl_{c}, c1c_{1}, c2c_{2};
Calculate virtual control α(0)\alpha(0) according to Eq. (7);
Calculate e2(0)e_{2}(0) according to Eq. (6);
Calculate r(0)r(0) according to Eq. (15);
    Backstepping design:
      Calculate f^(k)\hat{f}(k) according to Eq. (26);
      Calculate u(k)u(k) according to Eq. (11);
      Take control input u(k)u(k) into Eq. (2);
      Produce x(k+1)x(k+1), r(k+1)r(k+1) according to Eq. (2), (15);
    dHDP design:
      Obtain α(k+1)\alpha(k+1) by Eq. (7);
      Calculate e2(k+1)e_{2}(k+1) according to Eq. (6);
      Calculate J^(k)\hat{J}(k) according to Eq. (19);
      Calculate ec(k)e_{c}(k) according to equation Eq. (22);
        wc(k+1)=wc(k)+Δwc(k)w_{c}(k+1)=w_{c}(k)+\Delta w_{c}(k);
      Calculate ea(k)e_{a}(k) (Remark 1) ;
        wa(k+1)=wa(k)+Δwa(k)w_{a}(k+1)=w_{a}(k)+\Delta w_{a}(k);
    Iterate until converge.

V Lyapunov Stability Analysis

In this section, we provide a theoretical analysis for the stability of the overall dynamic system, weight convergence of the actor and critic neural networks, and the Bellman (sub)optimality of the control policy.

V-A Preliminaries

Let wcw_{c}^{*}, waw_{a}^{*} denote the optimal weights, that is,

wa=argminw^aJ^(k)+f¯(k)Uc,wc=argminw^cγJ^(k)+r(k)J^(k1).\displaystyle\begin{aligned} w_{a}^{*}=&\arg\min_{\hat{w}_{a}}\|\hat{J}(k)+\bar{f}(k)-U_{c}\|,\\ w_{c}^{*}=&\arg\min_{\hat{w}_{c}}\|\gamma\hat{J}(k)+r(k)-\hat{J}(k-1)\|.\end{aligned} (32)

Then, the optimal cost-to-go J(k)J^{*}(k) and unknown dynamics f(k)f(k) can be expressed as

J(k)=wc2ϕc(k)+ϵc(k),f(k)=wa2ϕa(k)+ϵa(k)\displaystyle J^{*}(k)=w_{c2}^{*}\phi_{c}(k)+\epsilon_{c}(k),\quad f(k)=w_{a2}^{*}\phi_{a}(k)+\epsilon_{a}(k) (33)

where ϵc(k)\epsilon_{c}(k) and ϵa(k)\epsilon_{a}(k) are the reconstruction errors of the actor and critic neural networks, respectively.

Assumption 3. The optimal weights for the actor-critic networks exist and they are bounded by two positive constants wamw_{am} and wcmw_{cm}, respectively,

wawam,wcwcm.\displaystyle\left\|w_{a}^{*}\right\|\leq w_{am},\quad\left\|w_{c}^{*}\right\|\leq w_{cm}. (34)

Then, the weight estimation errors of the actor and critic neural networks are described respectively as

w~a(k):=w^a(k)wa,w~c(k):=w^c(k)wc.\displaystyle\tilde{w}_{a}(k):=\hat{w}_{a}(k)-w^{*}_{a},\quad\tilde{w}_{c}(k):=\hat{w}_{c}(k)-w^{*}_{c}. (35)

Remark 2. A weight parameter convergence result was obtained for the dHDP in [40] under the condition that the weights between the input and hidden layers remain unchanged during learning. The result was later extended to allowing for all the weights in the actor and critic networks to adapt during learning [41]. Another study [10] addressed tracking control using dHDP for a Brunovsky canonical system. Such a system may be mathematically interesting but practically limiting. Additionally, the design in [10] requires reference models. In this study, we take reference of [40] and [41] to prove our new results on weight convergence for tracking control. Note that [40] and [41] are about regulation control, not tracking control. Notice also that both works lack a system stability result.

Lemma 1. Under Assumption 3, consider the weight vector of the hidden-to-output layer in CNN. Let

L1(k)=1lctr((w~c2(k))Tw~c2(k)).\displaystyle L_{1}(k)=\frac{1}{l_{c}}tr\left((\tilde{w}_{c2}(k))^{T}\tilde{w}_{c2}(k)\right). (36)

Then its first difference is given by

ΔL1(k)=γ2ζc(k)2(1γ2lcϕc(k)2)×γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2+γwc2ϕc(k)+r(k)w^c2(k1)ϕc(k1)2,\displaystyle\begin{aligned} \Delta L_{1}(k)&=-\gamma^{2}\left\|\zeta_{c}(k)\right\|^{2}-\left(1-\gamma^{2}l_{c}\|\phi_{c}(k)\right\|^{2})\\ &\times\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\\ &+\|\gamma w_{c2}^{*}\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2},\end{aligned} (37)

where ζc(k)=w~c2(k)ϕc(k)\zeta_{c}(k)=\tilde{w}_{c2}(k)\phi_{c}(k) is an approximation error of the critic output.

Proof of Lemma 1. The first difference of L1(k)L_{1}(k) can be written as

ΔL1(k)=1lctr[(w~c2(k+1))Tw~c2(k+1)(w~c2(k))Tw~c2(k)].\displaystyle\begin{aligned} \Delta L_{1}(k)&=\frac{1}{l_{c}}tr\left[(\tilde{w}_{c2}(k+1))^{T}\tilde{w}_{c2}(k+1)\right.\\ &\left.-(\tilde{w}_{c2}(k))^{T}\tilde{w}_{c2}(k)\right].\end{aligned} (38)

With the updating rule in Eq. (25), w~c2(k+1)\tilde{w}_{c2}(k+1) can be rewritten as

w~c2(k+1)=w^c2(k+1)wc2=w~c2(k)γlcϕc(k)[γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)]T.\displaystyle\begin{aligned} \tilde{w}_{c2}(k+1)=&\hat{w}_{c2}(k+1)-w_{c2}^{*}\\ =&\tilde{w}_{c2}(k)-\gamma l_{c}\phi_{c}(k)\left[\gamma\hat{w}_{c2}(k)\phi_{c}(k)\right.\\ &\left.+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right]^{T}.\end{aligned} (39)

Then, the first term in the brackets of Eq. (38) can be given as

tr[(w~c2(k+1))Tw~c2(k+1)]=(w~c2(k))Tw~c2(k)2γlcw~c2(k)ϕc(k)×[γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)]T+γ2lc2ϕc(k)2γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2.\displaystyle\begin{aligned} &tr\left[(\tilde{w}_{c2}(k+1))^{T}\tilde{w}_{c2}(k+1)\right]\\ =&(\tilde{w}_{c2}(k))^{T}\tilde{w}_{c2}(k)-2\gamma l_{c}\tilde{w}_{c2}(k)\phi_{c}(k)\\ &\times\left[\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right]^{T}\\ &+\gamma^{2}l_{c}^{2}\left\|\phi_{c}(k)\right\|^{2}\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\\ &-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}.\end{aligned} (40)

As w~c2(k)ϕc(k)\tilde{w}_{c2}(k)\phi_{c}(k) is a scalar, by using Eq. (35), we can rewrite the middle term in the above formula as follows:

2γlcw~c2(k)ϕc(k)[γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)]=lc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)γw~c2(k)ϕc(k)2γw~c2(k)ϕc(k)2γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2)=lc(γwc2ϕc(k)+r(k)w^c2(k1)ϕc(k1)2γ2ζc(k)2γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2).\displaystyle\begin{aligned} -&2\gamma l_{c}\tilde{w}_{c2}(k)\phi_{c}(k)\left[\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\right.\\ &\quad\left.-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right]\\ =&l_{c}\left(\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right.\\ &-\gamma\tilde{w}_{c2}(k)\phi_{c}(k)\left\|{}^{2}-\right\|\gamma\tilde{w}_{c2}(k)\phi_{c}(k)\|^{2}\\ &-\left.\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\right)\\ =&l_{c}\left(\|\gamma w_{c2}^{*}\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\right.\\ &-\gamma^{2}\left\|\zeta_{c}(k)\right\|^{2}-\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\\ &\left.-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\right).\end{aligned} (41)

Substituting Eq. (40) and Eq. (41) into Eq. (38), we obtain Lemma 1. \hfill\blacksquare

Lemma 2. Under Assumption 3, consider the weight vector of the hidden-to-output layer in ANN. Let

L2(k)=1laβ1tr[(w~a2(k))Tw~a2(k)].\displaystyle L_{2}(k)=\frac{1}{l_{a}\beta_{1}}tr\left[\left(\tilde{w}_{a2}(k)\right)^{T}\tilde{w}_{a2}(k)\right]. (42)

Then its first difference is bounded by

ΔL2(k)1β1((1laϕa(k)2w^c2(k)C(k)2)×w^c2(k)ϕc(k)+d2+8ζc(k)2+8wc2ϕc(k)2+8ζa(k)2+4d2+w^c2(k)C(k)ζa(k)2),\displaystyle\begin{aligned} \Delta L_{2}(k)\leq&\frac{1}{\beta_{1}}\left(-(1-l_{a}\|\phi_{a}(k)\|^{2}\|\hat{w}_{c2}(k)C(k)\|^{2})\right.\\ &\left.\times\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}+8\|\zeta_{c}(k)\|^{2}\right.\\ &\left.+8\|w_{c2}^{*}\phi_{c}(k)\|^{2}+8\|\zeta_{a}(k)\|^{2}+4d^{2}\right.\\ &\left.+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\right),\end{aligned} (43)

where ζa(k)=w~a2(k)ϕa(k)\zeta_{a}(k)=\tilde{w}_{a2}(k)\phi_{a}(k) is an approximation error of the action network output; C(k)=12(1ϕc2(k))wcu(k)C(k)=\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)w_{cu}(k); β1>0\beta_{1}>0 is a weighting factor; and the lumped disturbance d=(M^+M+)e2(k+1)+ζa(k)τd(k)ϵa(k)d=\|(\hat{M}^{+}-M^{+})e_{2}(k+1)+\zeta_{a}(k)-\tau_{d}(k)-\epsilon_{a}(k)\|.

Proof of Lemma 2. The first difference of L2(k)L_{2}(k) can be written as

ΔL2(k)=1laβ1tr[(w~a2(k+1))Tw~a2(k+1)(w~a2(k))Tw~a2(k)].\displaystyle\begin{aligned} \Delta L_{2}(k)&=\frac{1}{l_{a}\beta_{1}}tr\left[(\tilde{w}_{a2}(k+1))^{T}\tilde{w}_{a2}(k+1)\right.\\ &\left.-(\tilde{w}_{a2}(k))^{T}\tilde{w}_{a2}(k)\right].\end{aligned} (44)

With the updating rule in Eq. (31), w~a2(k+1)\tilde{w}_{a2}(k+1) can be rewritten as

w~a2(k+1)=w^a2(k+1)wa2=w^a2(k)laϕa(k)w^c2(k)C(k)×[w^c2(k)ϕc(k)+d]Twa2=w~a2(k)laϕa(k)w^c2(k)C(k)×[w^c2(k)ϕc(k)+d]T.\displaystyle\begin{aligned} \tilde{w}_{a2}(k+1)=&\hat{w}_{a2}(k+1)-w_{a2}^{*}\\ =&\hat{w}_{a2}(k)-l_{a}\phi_{a}(k)\hat{w}_{c2}(k)C(k)\\ &\times[\hat{w}_{c2}(k)\phi_{c}(k)+d]^{T}-w_{a2}^{*}\\ =&\tilde{w}_{a2}(k)-l_{a}\phi_{a}(k)\hat{w}_{c2}(k)C(k)\\ &\times[\hat{w}_{c2}(k)\phi_{c}(k)+d]^{T}.\end{aligned} (45)

Based on this expression, it is easy to obtain that

tr[(w~a2(k+1))Tw~a2(k+1)]=(w~a2(k))Tw~a2(k)+la2ϕa(k)2w^c2(k)C(k)2×w^c2(k)ϕc(k)+d22law^c2(k)C(k)[w^c2(k)ϕc(k)+d]Tζa(k).\displaystyle\begin{aligned} &tr\left[\left(\tilde{w}_{a2}(k+1)\right)^{T}\tilde{w}_{a2}(k+1)\right]\\ =&(\tilde{w}_{a2}(k))^{T}\tilde{w}_{a2}(k)+l_{a}^{2}\|\phi_{a}(k)\|^{2}\|\hat{w}_{c2}(k)C(k)\|^{2}\\ &\times\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ &-2l_{a}\hat{w}_{c2}(k)C(k)[\hat{w}_{c2}(k)\phi_{c}(k)+d]^{T}\zeta_{a}(k).\end{aligned} (46)

Substituting Eq. (46) into Eq. (44), we have

ΔL2(k)=1β1(laϕa(k)2w^c2(k)C(k)2×w^c2(k)ϕc(k)+d22w^c2(k)C(k)[w^c2(k)ϕc(k)+d]Tζa(k)).\displaystyle\begin{aligned} \Delta L_{2}(k)=&\frac{1}{\beta_{1}}(l_{a}\|\phi_{a}(k)\|^{2}\|\hat{w}_{c2}(k)C(k)\|^{2}\\ &\times\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ &-2\hat{w}_{c2}(k)C(k)[\hat{w}_{c2}(k)\phi_{c}(k)+d]^{T}\zeta_{a}(k)).\end{aligned} (47)

The second term in Eq. (46) can be given as

2w^c2(k)C(k)[w^c2(k)ϕc(k)+d]Tζa(k)=w^c2(k)ϕc(k)+dw^c2(k)C(k)ζa(k)2+w^c2(k)C(k)ζa(k)2w^c2(k)ϕc(k)+d22w^c2(k)ϕc(k)+d2+w^c2(k)C(k)ζa(k)2w^c2(k)ϕc(k)+d24w^c2(k)ϕc(k)2+4d2+w^c2(k)C(k)ζa(k)2w^c2(k)ϕc(k)+d(k)24(w~c2(k)+wc2)ϕc(k)2+4d2+w^c2(k)C(k)ζa(k)2w^c2(k)ϕc(k)+d28ζc(k)2+8wc2ϕc(k)2+4d2+w^c2(k)C(k)ζa(k)2w^c2(k)ϕc(k)+d2.\displaystyle\begin{aligned} &-2\hat{w}_{c2}(k)C(k)[\hat{w}_{c2}(k)\phi_{c}(k)+d]^{T}\zeta_{a}(k)\\ =&\|\hat{w}_{c2}(k)\phi_{c}(k)+d-\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\\ &+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}-\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ \leq&2\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\\ &-\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ \leq&4\|\hat{w}_{c2}(k)\phi_{c}(k)\|^{2}+4d^{2}+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\\ &-\|\hat{w}_{c2}(k)\phi_{c}(k)+d(k)\|^{2}\\ \leq&4\|(\tilde{w}_{c2}(k)+w_{c2}^{*})\phi_{c}(k)\|^{2}+4d^{2}+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\\ &-\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ \leq&8\|\zeta_{c}(k)\|^{2}+8\|w_{c2}^{*}\phi_{c}(k)\|^{2}+4d^{2}+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\\ &-\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}.\end{aligned} (48)

We have thus obtained Lemma 2. \hfill\blacksquare

Lemma 3. Under Assumption 3, consider the weight vector of the input-to-hidden layer in CNN. Let

L3(k)=1lcβ2tr[(w~c1(k))Tw~c1(k)].\displaystyle L_{3}(k)=\frac{1}{l_{c}\beta_{2}}tr\left[(\tilde{w}_{c1}(k))^{T}\tilde{w}_{c1}(k)\right]. (49)

Then its first difference is bounded by

ΔL3(k)1β2(γ2lcγw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2A(k)2xc(k)2+γw~c1(k)xc(k)AT(k)2+γγw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2),\displaystyle\begin{aligned} \Delta L_{3}(k)\leq&\frac{1}{\beta_{2}}\left(\gamma^{2}l_{c}\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\right.\\ &-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\|A(k)\|^{2}\|x_{c}(k)\|^{2}\\ &+\gamma\|\tilde{w}_{c1}(k)x_{c}(k)A^{T}(k)\|^{2}+\gamma\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)\\ &\left.+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\right),\end{aligned} (50)

where β2>0\beta_{2}>0 is a weighting factor and A(k)A(k) is a vector, with A(k)=A(k)= 12(1ϕc2(k))w^c2(k)\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)\hat{w}_{c2}(k).

Proof of Lemma 3. The first difference of L3(k)L_{3}(k) can be written as

ΔL3(k)=1lcβ2tr[(w~c1(k+1))Tw~c1(k+1)(w~c1(k))Tw~c1(k)].\displaystyle\begin{aligned} \Delta L_{3}(k)&=\frac{1}{l_{c}\beta_{2}}tr\left[(\tilde{w}_{c1}(k+1))^{T}\tilde{w}_{c1}(k+1)\right.\\ &\left.-(\tilde{w}_{c1}(k))^{T}\tilde{w}_{c1}(k)\right].\end{aligned} (51)

With the updating rules in Eq. (24), w^c1(k+1)\hat{w}_{c1}(k+1) can be written as

w^c1(k+1)=w^c1(k)γlc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1))TB(k),\displaystyle\begin{aligned} \hat{w}_{c1}(k+1)=&\hat{w}_{c1}(k)-\gamma l_{c}\left(\gamma\hat{w}_{c2}(k)\phi_{c}(k)\right.\\ &\left.+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right)^{T}B(k),\end{aligned} (52)

where B(k)=12(1ϕc2(k))w^c2(k)xc(k)B(k)=\frac{1}{2}\left(1-\phi_{c}^{2}(k)\right)\hat{w}_{c2}(k)x_{c}(k). Following the same approach as earlier, we can express w~c1(k+1)\tilde{w}_{c1}(k+1) by

w~c1(k+1)=w^c1(k+1)wc1=w~c1(k)γlc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1))TB(k).\displaystyle\begin{aligned} \tilde{w}_{c1}(k+1)=&\hat{w}_{c1}(k+1)-w_{c1}^{*}\\ =&\tilde{w}_{c1}(k)-\gamma l_{c}\left(\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\right.\\ &\left.-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right)^{T}B(k).\end{aligned} (53)

To facilitate the development, the following notation is introduced:

BT(k)B(k)=xcT(k)AT(k)A(k)xc(k)=A(k)2xc(k)2.\displaystyle B^{T}(k)B(k)=x_{c}^{T}(k)A^{T}(k)A(k)x_{c}(k)=\|A(k)\|^{2}\|x_{c}(k)\|^{2}. (54)

Then, we obtain

tr[(w~c1(k+1))Tw~c1(k+1)]=(w~c1(k))Tw~c1(k)+γ2lc2γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2BT(k)B(k)2γlc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1))BT(k)w~c1(k).\displaystyle\begin{aligned} &tr\left[(\tilde{w}_{c1}(k+1))^{T}\tilde{w}_{c1}(k+1)\right]\\ =&(\tilde{w}_{c1}(k))^{T}\tilde{w}_{c1}(k)+\gamma^{2}l_{c}^{2}\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)\\ &+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}B^{T}(k)B(k)\\ &-2\gamma l_{c}\left(\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)\right.\\ &\left.-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right)B^{T}(k)\tilde{w}_{c1}(k).\end{aligned} (55)

By introducing the property of trace function,

tr(xc(k)AT(k)w~c1(k))=tr(w~c1(k)xc(k)AT(k)),\displaystyle tr\left(x_{c}(k)A^{T}(k)\tilde{w}_{c1}(k)\right)=tr\left(\tilde{w}_{c1}(k)x_{c}(k)A^{T}(k)\right), (56)

the last term in (55) can be expressed as

2γlc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1))×xc(k)AT(k)w~c1(k)=γlc(γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)w~c1(k)xc(k)AT(k)2w~c1(k)xc(k)AT(k)2γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2).\displaystyle\begin{aligned} &-2\gamma l_{c}\left(\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right)\\ &\times x_{c}(k)A^{T}(k)\tilde{w}_{c1}(k)\\ =&\gamma l_{c}\left(\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\right.\\ &-\tilde{w}_{c1}(k)x_{c}(k)A^{T}(k)\left\|{}^{2}-\right\|\tilde{w}_{c1}(k)x_{c}(k)A^{T}(k)\|^{2}\\ &\left.-\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\right).\end{aligned} (57)

Therefore, substituting Eq. (56) and Eq. (57) into Eq. (52), we have Lemma 3. \hfill\blacksquare

Lemma 4. Under Assumption 3, consider the weight vector of the input-to-hidden layer in ANN. Let

L4(k)=1laβ3tr[(w~a1(k))Tw~a1(k)],\displaystyle L_{4}(k)=\frac{1}{l_{a}\beta_{3}}tr\left[(\tilde{w}_{a1}(k))^{T}\tilde{w}_{a1}(k)\right], (58)

Then its first difference is bounded by

ΔL4(k)1β3(law^c2(k)ϕc(k)+d(k)2xa(k)2×w^c2(k)C(k)DT(k)2+w^c2(k)ϕc(k)+d2+w~a1(k)xa(k)2w^c2(k)C(k)DT(k)2),\displaystyle\begin{aligned} \Delta L_{4}(k)\leq&\frac{1}{\beta_{3}}(l_{a}\|\hat{w}_{c2}(k)\phi_{c}(k)+d(k)\|^{2}\|x_{a}(k)\|^{2}\\ &\times\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}+\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ &+\|\tilde{w}_{a1}(k)x_{a}(k)\|^{2}\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}),\end{aligned} (59)

where D(k)=12(1ϕa2(k))w^a2(k)D(k)=\frac{1}{2}\left(1-\phi_{a}^{2}(k)\right)\hat{w}_{a2}(k), and β3>0\beta_{3}>0 is a weighting factor.

Proof of Lemma 4. The first difference of L4(k)L_{4}(k) can be written as

ΔL4(k)=1laβ3tr[(w~a1(k+1))Tw~a1(k+1)(w~a1(k))Tw~a1(k)].\displaystyle\begin{aligned} \Delta L_{4}(k)&=\frac{1}{l_{a}\beta_{3}}tr\left[(\tilde{w}_{a1}(k+1))^{T}\tilde{w}_{a1}(k+1)\right.\\ &\quad\left.-(\tilde{w}_{a1}(k))^{T}\tilde{w}_{a1}(k)\right].\end{aligned} (60)

With the updating rule in Eq. (30), w~a1(k+1)\tilde{w}_{a1}(k+1) can be rewritten as

w~a1(k+1)=w^a1(k+1)wa1=w~a1(k)la[w^c2(k)ϕc(k)+d]×D(k)CT(k)(w^c2(k))TxaT(k).\displaystyle\begin{aligned} \tilde{w}_{a1}(k+1)=&\hat{w}_{a1}(k+1)-w_{a1}^{*}\\ =&\tilde{w}_{a1}(k)-l_{a}[\hat{w}_{c2}(k)\phi_{c}(k)+d]\\ &\times D(k)C^{T}(k)(\hat{w}_{c2}(k))^{T}x_{a}^{T}(k).\end{aligned} (61)

Let us consider

tr[(w~a1(k+1))Tw~a1(k+1)]=(w~a1(k))Tw~a1(k)+la2w^c2(k)ϕc(k)+d2×w^c2(k)C(k)DT(k)2xa(k)22law^c2(k)C(k)DT(k)(w^c2(k)ϕc(k)+d)Tw~a1(k)xa(k).\displaystyle\begin{aligned} &tr\left[(\tilde{w}_{a1}(k+1))^{T}\tilde{w}_{a1}(k+1)\right]\\ &=(\tilde{w}_{a1}(k))^{T}\tilde{w}_{a1}(k)+l_{a}^{2}\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}\\ &\quad\times\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}\|x_{a}(k)\|^{2}\\ &\quad-2l_{a}\hat{w}_{c2}(k)C(k)D^{T}(k)(\hat{w}_{c2}(k)\phi_{c}(k)+d)^{T}\tilde{w}_{a1}(k)x_{a}(k).\\ \end{aligned} (62)

Then, by using the property of trace function tr(XTY+YTX)=tr(XTY)+tr([XTY]T)=2tr(XTY)tr(X^{T}Y+Y^{T}X)=tr(X^{T}Y)+tr([X^{T}Y]^{T})=2tr(X^{T}Y) and tr(XY)=tr(YX)tr(XY)=tr(YX), the last term in Eq. (62) is bounded by

2law^c2(k)C(k)DT(k)(w^c2(k)ϕc(k)+d)Tw~a1(k)xa(k)la(w^c2(k)ϕc(k)+d2+w~a1(k)xa(k)2×w^c2(k)C(k)DT(k)2).\displaystyle\begin{aligned} &-2l_{a}\hat{w}_{c2}(k)C(k)D^{T}(k)(\hat{w}_{c2}(k)\phi_{c}(k)+d)^{T}\tilde{w}_{a1}(k)x_{a}(k)\\ \leq&l_{a}\left(\|\hat{w}_{c2}(k)\phi_{c}(k)+d\|^{2}+\|\tilde{w}_{a1}(k)x_{a}(k)\|^{2}\right.\\ &\left.\times\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}\right).\end{aligned} (63)

We have the statement of Lemma 4 by substituting Eq. (61) and Eq. (62) into Eq. (60). \hfill\blacksquare

V-B Stability, convergence, and (sub)optimality results

With Lemmas 1-4 in place, we are now in a position to provide results on closed-loop stability of the system, convergences of the neural networks in dHDP, and the Bellman (sub)optimality of the resulted control policy.

Definition 1. (Uniformly Ultimately Boundedness of a discrete time dynamical system [42, 43]) A dynamical system is said to be uniformly ultimately bounded with ultimate bound b>0,b>0, if for any a>0a>0 and k0>0,k_{0}>0, there exists a positive number N=N(a,b)N=N(a,b) independent of k0k_{0}, such that ξ~(k)b\|\tilde{\xi}(k)\|\leq b for all kN+k0k\geq N+k_{0} whenever ξ~(k0)a\left\|\tilde{\xi}\left(k_{0}\right)\right\|\leq a.

In the following, ξ~(k)\tilde{\xi}(k) can represent tracking errors e1(k)e_{1}(k), e2(k)e_{2}(k) or weight approximation errors w~a(k)\tilde{w}_{a}(k), w~c(k)\tilde{w}_{c}(k), which are related to the stable system or the weight convergence of the actor and critic neural networks, respectively.

Theorem 1. (Stability Result) Let Assumption 2 and Assumption 3 hold, and the tracking errors e1(k)e_{1}(k) and e2(k)e_{2}(k) defined in Eq. (4) and Eq. (6), respectively. Then the considered system is uniformly ultimately bounded by their respective initial errors if c1c_{1} in Eq. (8) and c2c_{2} in Eq. (11) satisfy the following condition,

c1<22,c2<122Mmin4h2\displaystyle\|c_{1}\|<\frac{\sqrt{2}}{2},\quad\|c_{2}\|<\frac{1}{2}\sqrt{2M_{min}-4h^{2}} (64)

where MminM_{min} is given in Eq. (3).

Proof of Theorem 1. We introduce a candidate Lyapunov function:

Ls(k)=e1T(k)e1(k)+(M(k)e2(k))T(M(k)e2(k)).\displaystyle L_{s}(k)=e_{1}^{T}(k)e_{1}(k)+(M^{-}(k)e_{2}(k))^{T}(M^{-}(k)e_{2}(k)). (65)

The first difference of Ls(k)L_{s}(k) is given as

ΔLs(k)=e1T(k+1)e1(k+1)+(M(k+1)e2(k+1))T(M(k+1)e2(k+1))e1T(k)e1(k)(M(k)e2(k))T(M(k)e2(k)),\displaystyle\begin{aligned} \Delta L_{s}(k)=&e_{1}^{T}(k+1)e_{1}(k+1)\\ &+(M^{-}(k+1)e_{2}(k+1))^{T}(M^{-}(k+1)e_{2}(k+1))\\ &-e_{1}^{T}(k)e_{1}(k)-(M^{-}(k)e_{2}(k))^{T}(M^{-}(k)e_{2}(k)),\end{aligned} (66)

by substituting e1(k+1)e_{1}(k+1) and M(k+1)e2(k+1)M^{-}(k+1)e_{2}(k+1) from Eq. (8) and (13), we have

ΔLs(k)=(2c121)e1T(k)e1(k)+(2c22(M(k))T(M(k))+2h2)e2T(k)e2(k)+8ζa(k)2+8ϵa(k)2+4τd(k)2.\displaystyle\begin{aligned} \Delta L_{s}(k)=&(2\|c_{1}\|^{2}-1)e_{1}^{T}(k)e_{1}(k)\\ &+(2\|c_{2}\|^{2}-\|(M^{-}(k))^{T}(M^{-}(k))\|+2h^{2})e_{2}^{T}(k)e_{2}(k)\\ &+8\|\zeta_{a}(k)\|^{2}+8\|\epsilon_{a}(k)\|^{2}+4\|\tau_{d}(k)\|^{2}.\end{aligned} (67)

Based on Eq. (3), and also let H1=8ζa(k)2+8ϵa(k)2+4τd(k)2H_{1}=8\|\zeta_{a}(k)\|^{2}+8\|\epsilon_{a}(k)\|^{2}+4\|\tau_{d}(k)\|^{2}, we obtain

ΔLs(k)<(12c12)e12(k)[Mmin2(c22+h2)]e22(k)+H1.\displaystyle\begin{aligned} \Delta L_{s}(k)<&-(1-2\|c_{1}\|^{2})e_{1}^{2}(k)\\ &-[M_{min}-2(\|c_{2}\|^{2}+h^{2})]e_{2}^{2}(k)+H_{1}.\end{aligned} (68)

and H1H_{1} can be bounded by

H18ζam2+8ϵam2+4τdm2=H1m,\displaystyle H_{1}\leq 8\|\zeta_{am}\|^{2}+8\|\epsilon_{am}\|^{2}+4\|\tau_{dm}\|^{2}=H_{1m}, (69)

where ζam\zeta_{am}, ϵam\epsilon_{am} and τdm\tau_{dm} are the upper bound of ζa(k)\zeta_{a}(k) in Eq. (43), ϵa(k)\epsilon_{a}(k) in Eq. (33), and τd(k)\tau_{d}(k) in Eq. (2), respectively.

Therefore, for c1c_{1}, c2c_{2} chosen from Eq. (64), and

e1(k)>H1m12c12ore2(k)>H1mMmin2(c22+h2),\displaystyle\begin{aligned} &\|e_{1}(k)\|>\sqrt{\frac{H_{1m}}{1-2\|c_{1}\|^{2}}}\\ ~{}\operatorname{or}~{}&\|e_{2}(k)\|>\sqrt{\frac{H_{1m}}{M_{min}-2(\|c_{2}\|^{2}+h^{2})}},\end{aligned} (70)

the first difference ΔLs<0\Delta L_{s}<0.

According to Definition 1 and the bounded initial states and weights, this demonstrates that the errors e1(k)e_{1}(k) and e2(k)e_{2}(k) are uniformly ultimately bounded from time step kk to k+1k+1, and the boundness of control input can also be ensured according to Eq. (11).

Theorem 2. (Weight Convergence) Under Assumption 3 and the stage cost as given in Eq. (15) and based on Theorem 1, let the weights of the actor and critic neural networks be updated according to Eq. (24), (25), (30) and (31), respectively. Then w~c\tilde{w}_{c} and w~a\tilde{w}_{a} are uniformly ultimately bounded provided that the following conditions are met:

lc<minkβ2γγ2β2(ϕc(k)2+1β2A(k)2xc(k)2)la<mink(β3β1)(β3(w^c2(k))TC(k)2ϕa(k)2+β1w^c2(k)C(k)DT(k)2xa(k)2)1.\displaystyle\begin{aligned} &l_{c}<\min_{k}\frac{\beta_{2}-\gamma}{\gamma^{2}\beta_{2}\left(\left\|\phi_{c}(k)\right\|^{2}+\frac{1}{\beta_{2}}\|A(k)\|^{2}\|x_{c}(k)\|^{2}\right)}\\ &l_{a}<\min_{k}\left(\beta_{3}-\beta_{1}\right)\left(\beta_{3}\|\left(\hat{w}_{c2}(k)\right)^{T}C(k)\|^{2}\|\phi_{a}(k)\|^{2}\right.\\ &\quad\quad\left.+\beta_{1}\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}\|x_{a}(k)\|^{2}\right)^{-1}.\end{aligned} (71)

Remark 3. With γ\gamma, β1\beta_{1}, β2\beta_{2}, β3\beta_{3} provided in Eq. (16), (42), (49) and (58), respectively, we can choose lcl_{c}, lal_{a} to satisfy (71) by setting β2>γ>0\beta_{2}>\gamma>0 and β3>β1>0\beta_{3}>\beta_{1}>0.

Proof of Theorem 2. We introduce a candidate of Lyapunov function:

Lw(k)=L1(k)+L2(k)+L3(k)+L4(k),\displaystyle L_{w}(k)=L_{1}(k)+L_{2}(k)+L_{3}(k)+L_{4}(k), (72)

where L1(k)L_{1}(k), L2(k)L_{2}(k), L3(k)L_{3}(k) and L4(k)L_{4}(k) are shown in Eq. (36), (42), (49) and (58). Then the first difference of Lw(k)L_{w}(k) is given as

ΔLw(k)(γ28β1)ζc(k)2(1γ2lcϕc(k)2γ2lcβ2A(k)2xc(k)2γβ2)γw^c2(k)ϕc(k)+r(k)w^c2(k1)ϕc(k1)2w^c2(k)ϕc(k)+d2(1β1laβ1w^c2(k)C(k)2ϕa(k)2laβ3w^c2(k)C(k)DT(k)2xa(k)21β3)+H2,\displaystyle\begin{aligned} \Delta L_{w}(k)\leq&-(\gamma^{2}-\frac{8}{\beta_{1}})\left\|\zeta_{c}(k)\right\|^{2}-(1-\gamma^{2}l_{c}\|\phi_{c}(k)\|^{2}\\ &-\frac{\gamma^{2}l_{c}}{\beta_{2}}\|A(k)\|^{2}\|x_{c}(k)\|^{2}-\frac{\gamma}{\beta_{2}})\|\gamma\hat{w}_{c2}(k)\phi_{c}(k)\\ &+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}-\left\|\hat{w}_{c2}(k)\phi_{c}(k)\right.\\ &\left.+d\right\|^{2}(\frac{1}{\beta_{1}}-\frac{l_{a}}{\beta_{1}}\|\hat{w}_{c2}(k)C(k)\|^{2}\|\phi_{a}(k)\|^{2}\\ &-\frac{l_{a}}{\beta_{3}}\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}\|x_{a}(k)\|^{2}-\frac{1}{\beta_{3}})\\ &+H_{2},\end{aligned} (73)

where H2H_{2} is defined as

H2=γwc2ϕc(k)+r(k)w^c2(k1)ϕc(k1)2+1β1{8wc2ϕc(k)2+4d2+w^c2(k)C(k)ζa(k)2}+γβ2w~c1(k)xc(k)AT(k)2+1β3w^c2(k)C(k)DT(k)2w~a1(k)xa(k)2.\displaystyle\begin{aligned} H_{2}=&\|\gamma w_{c2}^{*}\phi_{c}(k)+r(k)-\hat{w}_{c2}(k-1)\phi_{c}(k-1)\|^{2}\\ +&\frac{1}{\beta_{1}}\left\{8\|w_{c2}^{*}\phi_{c}(k)\|^{2}+4d^{2}+\|\hat{w}_{c2}(k)C(k)\zeta_{a}(k)\|^{2}\right\}\\ +&\frac{\gamma}{\beta_{2}}\|\tilde{w}_{c1}(k)x_{c}(k)A^{T}(k)\|^{2}\\ +&\frac{1}{\beta_{3}}\|\hat{w}_{c2}(k)C(k)D^{T}(k)\|^{2}\|\tilde{w}_{a1}(k)x_{a}(k)\|^{2}.\end{aligned} (74)

for lcl_{c} and lal_{a} satisfying Eq. (71) and also by selecting β2>γ\beta_{2}>\gamma and β3>β1\beta_{3}>\beta_{1}, we obtain

ΔLw(k)(γ28β1)ζc(k)2+H2.\displaystyle\begin{aligned} \Delta L_{w}(k)\leq&-(\gamma^{2}-\frac{8}{\beta_{1}})\left\|\zeta_{c}(k)\right\|^{2}+H_{2}.\end{aligned} (75)

Applying the Cauchy–Schwarz inequality, we have

H2(8β1+4γ2+2)(w2cmϕcm)2+4rm2+1β1{8ζam2+8dm2+(w2cmCmζam)2}+γβ2(w1cmxcmAm)2+1β3(w2cmCmDmw1cmxam)2=H2m,\displaystyle\begin{aligned} H_{2}\leq&(\frac{8}{\beta_{1}}+4\gamma^{2}+2)\left(w_{2cm}\phi_{cm}\right)^{2}+4r_{m}^{2}\\ &+\frac{1}{\beta_{1}}\left\{8\zeta_{am}^{2}+8d_{m}^{2}+\left(w_{2cm}C_{m}\zeta_{am}\right)^{2}\right\}\\ &+\frac{\gamma}{\beta_{2}}\left(w_{1cm}x_{cm}A_{m}\right)^{2}+\frac{1}{\beta_{3}}\left(w_{2cm}C_{m}D_{m}w_{1cm}x_{am}\right)^{2}\\ =&H_{2m},\end{aligned} (76)

where ϕcm\phi_{cm}, rmr_{m}, w1cmw_{1cm}, w2cmw_{2cm}, AmA_{m}, CmC_{m}, DmD_{m}, xamx_{am}, and xcmx_{cm} are the upper bounds of ϕc(k)\phi_{c}(k), r(k)r(k), wc1(k)w_{c1}(k), wc2(k)w_{c2}(k), A(k)A(k), C(k)C(k), D(k)D(k), xa(k)x_{a}(k), and xc(k)x_{c}(k), respectively.

Therefore, if γ28β1>0,\gamma^{2}-\frac{8}{\beta_{1}}>0, that is, β1>8γ2\beta_{1}>\frac{8}{\gamma^{2}} and γ(0,1),\gamma\in(0,1), then for lal_{a}, lcl_{c} with constraints from (70), and

ζc(k)>H2mγ28β1,\displaystyle\|\zeta_{c}(k)\|>\sqrt{\frac{H_{2m}}{\gamma^{2}-\frac{8}{\beta_{1}}}}, (77)

the first difference ΔLw(k)<0\Delta L_{w}(k)<0 holds.

From Definition 1, this result means that the estimation errors w~c\tilde{w}_{c} and w~a\tilde{w}_{a} are uniformly ultimately bounded from the time step kk to k+1k+1, respectively. \hfill\blacksquare

Remark 4. Given Assumption 3, and that the initial states and weights are bounded, then the initial stage cost and initial output of actor network are bounded. As the feed-forward input f(k)f(k) in Eq. (33) is realized via actor network with bounded optimal weights, we have that the initial approximation error of the actor network is bounded. From Definition 1, ΔLs<0\Delta L_{s}<0 in Eq. (68) and ΔLw<0\Delta L_{w}<0 in Eq. (74), the tracking errors e1(k)e_{1}(k), e2(k)e_{2}(k) and the estimation errors w~a(k)\tilde{w}_{a}(k) and w~c(k)\tilde{w}_{c}(k) are bounded from step kk to the next step k+1k+1, and the control law u(k)u(k) is bounded from step kk to the next step k+1k+1 as well. Then the resulted stage cost r(k+1)r(k+1) is bounded. By mathematical induction, we have the tracking error e1(k)e_{1}(k), e2(k)e_{2}(k) and the estimation errors w~a(k)\tilde{w}_{a}(k), w~c(k)\tilde{w}_{c}(k) uniformly ultimately bounded.

Remark 5. Results of Theorem 1 and Theorem 2 hold under less restrictive conditions than those in [40, 41] that require bounded stage cost. We require an initially bounded system state and actor-critic network weights only.

Theorem 3. ((Sub)optimality Result) Under the conditions of Theorem 2, the Bellman optimality is achieved within finite approximation error. Meanwhile, the error between the obtained control law u(k)u(k) and optimal control u(k)u^{*}(k) is uniformly ultimately bounded.

Proof of Theorem 3. From the approximate cost-to-go in Eq. (20) and the cost-to-go expressed in Eq. (33), we have

J^(k))J(k)=w^c2(k)ϕc(k)wc2ϕc(k)ϵc(k)w~c2(k)ϕcm+ϵcm.\displaystyle\begin{aligned} &\|\hat{J}(k))-J^{*}(k)\|\\ =&\|\hat{w}_{c2}(k)\phi_{c}(k)-w_{c2}^{*}\phi_{c}(k)-\epsilon_{c}(k)\|\leqslant\|\tilde{w}_{c2}(k)\|\phi_{cm}+\epsilon_{cm}.\end{aligned} (78)

Similarly, from (11), (26) and (33), we have

u(k)u(k)w~a2(k)ϕam+ϵam,\displaystyle\|u(k)-u^{*}(k)\|\leqslant\|\tilde{w}_{a2}(k)\|\phi_{am}+\epsilon_{am}, (79)

where ϕcm\phi_{cm}, ϵcm\epsilon_{cm} and ϵam\epsilon_{am} are the upper bound of ϕc(k)\phi_{c}(k), ϵc(k)\epsilon_{c}(k) and ϵa(k)\epsilon_{a}(k). This comes directly as w~c2\|\tilde{w}_{c2}\| and w~a2\|\tilde{w}_{a2}\| are both uniformly ultimately bounded as the time step kk increases as shown in Theorem 2. It demonstrates that the Bellman optimality is achieved within finite approximation errors. \hfill\blacksquare

VI Simulation Study

We use two examples to demonstrate how the proposed algorithm works and how it improves reproducibility of results over the original dHDP for data-driven tracking control.

Example 1. We consider a single-link robot manipulator with the following motion equation:

Mq¨(t)+G(q(t))+τd(t)=τ(t),\displaystyle M\ddot{q}(t)+G(q(t))+\tau_{d}(t)=\tau(t), (80)

where G(q(t))=12×9.8×m×l×sin(q(t))G(q(t))=\frac{1}{2}\times 9.8\times m\times l\times sin(q(t)); mm and ll are the mass and the half length of the manipulator, respectively. The values of MM, mm, ll, τd\tau_{d} and initial state are different in different simulation cases below (refer to Table I). Note that, the model in Eq. (80) is to provide a simulated environment in place of a real physical environment. That is to say that the proposed approach is data-driven.

The feedback gain parameters c1c_{1} and c2c_{2} in Eq. (64) are chosen as c1=0.7c_{1}=0.7 and c2=5c_{2}=-5. The CNN and ANN in Fig.1 each has six hidden nodes. The discount factor γ\gamma in Eq. (16) is chosen as 0.95. The continuous time system dynamics in Eq. (80) is discretized by the Runge-Kutta discretization method with h=0.02sh=0.02s.

The per sample mean square error (MSE) defined below is used in Algorithm 1 to terminate ANN and CNN weight update procedure, and also, it is used for scheduling learning rates.

MSE=1n+nΣk=nn+e1(k)2,\displaystyle\mathrm{MSE}=\frac{1}{n^{+}-n^{-}}\Sigma_{k=n^{-}}^{n^{+}}\left\|e_{1}(k)\right\|^{2}, (81)

where (n+n)(n^{+}-n^{-}) is the number of data samples between time stamps nn^{-} and n+n^{+}.

In the following, we demonstrate the effectiveness of the proposed algorithm by first comparing it with dHDP tracking control without backstepping and then, feedback stabilizing control [1], where the feedback stabilizing control law is u(k)=c2e2(k)u(k)=c_{2}e_{2}(k). In all the simulation studies below, a trial consists of 6000 consecutive samples.

VI-A Tracking by dHDP with and without backstepping

In this comparison study, we set x1(0)=0.1x_{1}(0)=-0.1, x2(0)=0.1x_{2}(0)=0.1, m=1m=1, l=1l=1, M^+=5/h\hat{M}^{+}=5/h, and τd=0\tau_{d}=0. A total of 50 trials were conducted to obtain results reported here. We define a trial a success if the MSE of the last 3000 samples (denoted as MSE3000+{}_{3000^{+}}) is less than the MSE of first 3000 samples (denoted as MSE3000{}_{3000^{-}}). The dHDP alone reached 14% success rate under the comparison settings. The MSE3000+{}_{3000^{+}} of the successful cases is 1.454×101\times 10^{-1}. In comparison, the success rate of the proposed algorithm is 100% and the MSE3000+{}_{3000^{+}} is 2.135×105\times 10^{-5}. Typical tracking trials are shown in Fig. 2 and the weights of actor-critic network are shown in Fig. 3. Performance improvement is apparent.

TABLE I: Comprehensive performance evaluation
case initial state MM m,lm,l dd success rate # reset
1 (-0.1, 0.1) C (1,1) N/A 96% 2
2 (0.1, -0.1) C (1,1) N/A 82% 16
3 (0.2, -0.2) C (1,1) N/A 80% 24
4 (-0.2, 0.2) C (1,1) N/A 94% 5
5 (-0.1, 0.1) C (1,1) Pulse 76% 28
6 (-0.1, 0.1) C (1,1) Gaus 84% 12
7 (-0.1, 0.1) C (1,2) N/A 60% 26
8 (-0.1, 0.1) C (2,2) N/A 50% 35
9 (-0.1, 0.1) V (1,1) N/A 90% 8
Refer to caption Refer to caption
Figure 2: Tracking performance for dHDP without backstepping (a) and tracking performance for dHDP with backstepping (b).
Refer to caption Refer to caption
Figure 3: The weights of action network with backstepping (a) and the weights of critic network with backstepping (b).

VI-B Reproducibility of the proposed scheme

While the previous evaluation has illustrated the effectiveness of the proposed tracking control design, we are now in a position to perform a comprehensive evaluation. To do so, nine different scenarios (Table I) are used to quantitatively evaluate the reproducibility of results using the proposed algorithm. In obtaining the results, a trial is successful if the MSE3000+{}_{3000^{+}} is less than that of feedback stabilizing control method in one trial. In Table I, “C” denotes a constant 5, “V” a constant 5 plus random Gaussian with mean = 0 and std = 0.50, a “Pulse” disturbance is τd(t)=2\tau_{d}(t)=2 appearing only at t=40t=40, and “Gaus” represents Gaussian noise with mean = 0 and std = 8.25, respectively. The success rate over 50 trials is shown in Table I where the reproducibility of the proposed algorithm is verified.

Refer to caption
Figure 4: Comparison of average MSE3000+{}_{3000^{+}}.

VI-C Reproducibility with reset after failure

To test if the proposed method can lead to 100% success, we use a “reset” mechanism. Previously we defined a trial as a simulation containing 6000 time samples starting from a given initial state and randomly initialized weights in ANN and CNN. Now we define an episode containing multiple trials until reaching success. Each trial in an episode has 6000 time samples. An episode starts from a given initial state and randomly initialized ANN and CNN weights. But if a trial ends with a failure, only the ANN and CNN weights are reset to random values for the next trial. The “# of reset” in Table Table I is the number of resets occurred in 50 episodes. Note that, the “success rate” column reports results from subsection B in the above. The average MSE3000+{}_{3000^{+}} of the last successful trial over the 50 episodes is shown in Fig. 4. For comparison, we also show the MSE3000+{}_{3000^{+}} from applying feedback stabilizing control. It can be seen that the proposed algorithm outperforms feedback stabilizing control.

Example 2. We now consider a two-link robot manipulator with the following motion equation

M(q(t))q¨(t)+Vm(q(t),q˙(t))q˙(t)+Fd+Fsq˙(t)+τd(t)=τ(t).\displaystyle M(q(t))\ddot{q}(t)+V_{m}(q(t),\dot{q}(t))\dot{q}(t)+F_{d}+F_{s}\dot{q}(t)+\tau_{d}(t)=\tau(t). (82)

where q(t)=[q1(t)q2(t)]Tq(t)=[q_{1}(t)\quad q_{2}(t)]^{T}, the inertia matrix is given by

M(q(t))[p1+2p3s1(t)p2+p3s1(t)p2+p3s1(t)p2],M(q(t))\triangleq\left[\begin{array}[]{cc}p_{1}+2p_{3}s_{1}(t)&p_{2}+p_{3}s_{1}(t)\\ p_{2}+p_{3}s_{1}(t)&p_{2}\end{array}\right],

the centripetal-coriolis matrix is given by

Vm(q(t),q˙(t))[p3s2(t)q˙2(t)p3s2(t)(q˙1(t)+q˙2(t))p2s2(t)q˙1(t)5],V_{m}(q(t),\dot{q}(t))\triangleq\left[\begin{array}[]{cc}-p_{3}s_{2}(t)\dot{q}_{2}(t)&-p_{3}s_{2}(t)\left(\dot{q}_{1}(t)+\dot{q}_{2}(t)\right)\\ p_{2}s_{2}(t)\dot{q}_{1}(t)&5\end{array}\right],

where s1=cos(q2)s_{1}=\cos\left(q_{2}\right), s2=sin(q2)s_{2}=\sin\left(q_{2}\right); p1=3.473kgm2p_{1}=3.473\mathrm{~{}kg}\mathrm{~{}m}^{2}, p2=0.196kgm2p_{2}=0.196\mathrm{~{}kg}\mathrm{~{}m}^{2}, and p3=0.242kgm2p_{3}=0.242\mathrm{~{}kg}\mathrm{~{}m}^{2}; Fd=diag[5.3,6.1]NmsF_{d}=\mathrm{diag}[5.3,6.1]\mathrm{N}~{}\mathrm{m}~{}\mathrm{s} and Fs(q˙)=[8.45tanh(q˙1),2.35tanh(q˙2)]TNmF_{s}(\dot{q})=[8.45\tanh(\dot{q}_{1}),~{}2.35\tanh(\dot{q}_{2})]^{T}\mathrm{~{}N}\mathrm{~{}m} are the models for the static and the dynamic friction, respectively.

The feedback gain parameters c1c_{1} and c2c_{2} in Eq. (64) are chosen as c1=0.7c_{1}=0.7 and c2=1c_{2}=-1. The CNN and ANN in Fig.1 each has eight hidden nodes. The discount factor γ\gamma in Eq. (16) is chosen as 0.95, the initial states are q(t)=[1.81.5]Tq(t)=[1.8\quad 1.5]^{T} and q˙(t)=[00]T\dot{q}(t)=[0\quad 0]^{T} and the learning rate lal_{a}, lcl_{c} are 0.01. We use a sampling time period of h=0.01sh=0.01s and the initialization of M+(k)M^{+}(k) is [p1p2;p2p2]/h[p_{1}~{}p_{2};p_{2}~{}p_{2}]/h.

A comparison of tracking performance shows that the MSE3000+{}_{3000^{+}} of the stabilizing control is 2.69×1022.69\times 10^{-2} for q1q_{1} and 1.54×1021.54\times 10^{-2} for q2q_{2} while the MSE3000+{}_{3000^{+}} of the dHDP-based is 0.19×1020.19\times 10^{-2} for q1q_{1} and 0.21×1020.21\times 10^{-2} for q2q_{2}, respectively. This example again verifies the effectiveness of the proposed tracking control design.

VII Conclusion

This study aims at developing a mathematically suitable and practically useful, data-driven tracking control solution. Toward this goal, we introduce a new dHDP-based tracking control algorithm that takes advantage of the potential closed-loop system stability framework of the backstepping design for Euler-Lagrange systems. Such design approach also removes the dependence on a reference model for the desired tracking trajectory. Based on the proposed algorithm, we have shown stability of the overall dynamic system, weight convergence of the actor-critic neural networks, and (sub)optimality of the Bellman solution. Our simulations show improved reproducibility and tracking error of the tracking control design. As the dHDP has been shown feasible to solve complex engineering application problems, it is expected that this algorithm also has the potential for applications of tracking control of nonlinear dynamic systems.

References

  • [1] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos, Nonlinear and adaptive control design.   John Wiley & Sons, Inc., 1995.
  • [2] H. K. Khalil and J. W. Grizzle, Nonlinear systems.   Prentice hall Upper Saddle River, NJ, 2002, vol. 3.
  • [3] H. Nijmeijer and A. Van der Schaft, Nonlinear dynamical control systems.   Springer, 1990, vol. 175.
  • [4] A. Isidori, Nonlinear control systems.   Springer Science & Business Media, 2013.
  • [5] K. Sun, J. Qiu, H. R. Karimi, and H. Gao, “A novel finite-time control for nonstrict feedback saturated nonlinear systems with tracking error constraint,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
  • [6] B. Fan, Q. Yang, X. Tang, and Y. Sun, “Robust adp design for continuous-time nonlinear systems with output constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2127–2138, 2018.
  • [7] H. Fu, X. Chen, W. Wang, and M. Wu, “Mrac for unknown discrete-time nonlinear systems based on supervised neural dynamic programming,” Neurocomputing, vol. 384, pp. 130–141, 2020.
  • [8] M.-B. Radac and R.-E. Precup, “Data-driven model-free tracking reinforcement learning control with vrft-based adaptive actor-critic,” Applied Sciences, vol. 9, no. 9, p. 1807, 2019.
  • [9] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy hdp iteration algorithm,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 937–942, 2008.
  • [10] L. Yang, J. Si, K. S. Tsakalis, and A. A. Rodriguez, “Direct heuristic dynamic programming for nonlinear tracking control with filtered tracking error,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 6, pp. 1617–1622, 2009.
  • [11] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Transactions on Neural Networks, vol. 22, no. 12, pp. 2226–2236, 2011.
  • [12] Q. Wei and D. Liu, “Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification,” IEEE Transactions on Automation Science and Engineering, vol. 11, no. 4, pp. 1020–1036, 2014.
  • [13] H. Modares and F. L. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014.
  • [14] B. Kiumarsi and F. L. Lewis, “Actor–critic-based optimal tracking for partially unknown nonlinear discrete-time systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 140–151, 2015.
  • [15] R. Kamalapurkar, H. Dinh, S. Bhasin, and W. E. Dixon, “Approximate optimal trajectory tracking for continuous-time nonlinear systems,” Automatica, vol. 51, pp. 40–48, 2015.
  • [16] H. Modares, F. L. Lewis, and Z.-P. Jiang, “H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2550–2562, 2015.
  • [17] B. Luo, D. Liu, T. Huang, and D. Wang, “Model-free optimal tracking control via critic-only q-learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 10, pp. 2134–2144, 2016.
  • [18] C. Mu, Z. Ni, C. Sun, and H. He, “Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems,” IEEE Transactions on Cybernetics, vol. 47, no. 6, pp. 1460–1470, 2017.
  • [19] W. Gao and Z.-P. Jiang, “Learning-based adaptive optimal tracking control of strict-feedback nonlinear systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2614–2624, 2018.
  • [20] D. Wang, D. Liu, Y. Zhang, and H. Li, “Neural network robust tracking control with adaptive critic framework for uncertain nonlinear systems,” Neural Networks, vol. 97, pp. 11–18, 2018.
  • [21] B. Zhao and D. Liu, “Event-triggered decentralized tracking control of modular reconfigurable robots through adaptive dynamic programming,” IEEE Transactions on Industrial Electronics, vol. 67, no. 4, pp. 3054–3064, 2019.
  • [22] C. Mu and Y. Zhang, “Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 1, pp. 259–273, 2019.
  • [23] H. Dong, X. Zhao, and B. Luo, “Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only adp,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020.
  • [24] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 6, pp. 913–928, 2013.
  • [25] J. Si and Y.-T. Wang, “Online learning control by association and reinforcement,” IEEE Transactions on Neural networks, vol. 12, no. 2, pp. 264–276, 2001.
  • [26] L. Yang, J. Si, K. S. Tsakalis, and A. A. Rodriguez, “Performance evaluation of direct heuristic dynamic programming using control-theoretic measures,” Journal of Intelligent and Robotic Systems, vol. 55, no. 2-3, pp. 177–201, 2009.
  • [27] N. T. Nguyen, Model Reference Adaptive Control: A Primer.   Springer, 2018.
  • [28] G. R. G. da Silva, A. S. Bazanella, and L. Campestrini, “On the choice of an appropriate reference model for control of multivariable plants,” IEEE Transactions on Control Systems Technology, vol. 27, no. 5, pp. 1937–1949, 2019.
  • [29] H. Zargarzadeh, T. Dierks, and S. Jagannathan, “Optimal control of nonlinear continuous-time systems in strict-feedback form,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2535–2549, 2015.
  • [30] Z. Wang, X. Liu, K. Liu, S. Li, and H. Wang, “Backstepping-based lyapunov function construction using approximate dynamic programming and sum of square techniques,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3393–3403, 2016.
  • [31] F. A. Miranda-Villatoro, B. Brogliato, and F. Castanos, “Multivalued robust tracking control of lagrange systems: Continuous and discrete-time algorithms,” IEEE Transactions on Automatic Control, vol. 62, no. 9, pp. 4436–4450, 2017.
  • [32] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of learning and approximate dynamic programming.   John Wiley & Sons, 2004, vol. 2.
  • [33] R. Enns and J. Si, “Apache helicopter stabilization using neural dynamic programming,” Journal of Guidance, Control, and Dynamics, vol. 25, no. 1, pp. 19–25, 2002.
  • [34] ——, “Helicopter flight-control reconfiguration for main rotor actuator failures,” Journal of Guidance, Control, and Dynamics, vol. 26, no. 4, pp. 572–584, 2003.
  • [35] ——, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Transactions on Neural networks, vol. 14, no. 4, pp. 929–939, 2003.
  • [36] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming for damping oscillations in a large power system,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 1008–1013, 2008.
  • [37] Y. Wen, J. Si, X. Gao, S. Huang, and H. Huang, “A new powered lower limb prosthesis control framework based on adaptive dynamic programming.” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 9, pp. 2215–2220, 2017.
  • [38] Y. Wen, J. Si, A. Brandt, X. Gao, and H. Huang, “Online reinforcement learning control for the personalization of a robotic knee prosthesis,” IEEE Transactions on Cybernetics, 2019.
  • [39] Y. Zhang, S. Li, K. J. Nolan, and D. Zanotto, “Adaptive assist-as-needed control based on actor-critic reinforcement learning.” in The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 4066–4071.
  • [40] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A boundedness result for the direct heuristic dynamic programming,” Neural Networks, vol. 32, pp. 229–235, 2012.
  • [41] Y. Sokolov, R. Kozma, L. D. Werbos, and P. J. Werbos, “Complete stability analysis of a heuristic approximate dynamic programming control design,” Automatica, vol. 59, pp. 9–18, 2015.
  • [42] A. N. Michel, L. Hou, and D. Liu, Stability of dynamical systems.   Springer, 2008.
  • [43] J. Sarangapani, Neural network control of nonlinear discrete-time systems.   CRC press, 2018.