A Deep Reinforcement Learning-based Sliding Mode Control Design for Partially-known Nonlinear Systems*

Sahand Mosharafian, Shirin Afzali, Yajie Bao, Javad Mohammadpour Velni *This research was financially supported by the United States National Science Foundation under award #1912757.S. Mosharafian, S. Afzali, Y. Bao, and J. Mohammadpour Velni are with the School of Electrical & Computer Engineering, University of Georgia, Athens, GA 30602, USA; sahandmosharafian@uga.edu, shirin.afzali@uga.edu, yajie.bao@uga.edu, javadm@uga.edu.

Abstract

Presence of model uncertainties creates challenges for model-based control design, and complexity of the control design is further exacerbated when coping with nonlinear systems. This paper presents a sliding mode control (SMC) design approach for nonlinear systems with partially known dynamics by blending data-driven and model-based approaches. First, an SMC is designed for the available (nominal) model of the nonlinear system. The closed-loop state trajectory of the available model is used to build the desired trajectory for the partially known nonlinear system states. Next, a deep policy gradient method is used to cope with unknown parts of the system dynamics and adjust the sliding mode control output to achieve a desired state trajectory. The performance (and viability) of the proposed design approach is finally examined through numerical examples.

I Introduction

Controller design for nonlinear dynamical systems has been an area of research interest for decades. Various methods for controller design for nonlinear systems have been proposed including feedback linearization [1], backstepping control [2], and sliding mode control (SMC) [3]. Generally, there are plant-model mismatches that arise from parameter uncertainty [4], measurement noise and external disturbances. SMC is a control design technique that offers robustness to these uncertainties in nonlinear systems with stability guarantees [5, 3]. However, SMC requires bounds on uncertainties and adds a discontinuity to the system through the $sign$ function, which results in chattering and deteriorates the performance of the SMC. Furthermore, the uncertain knowledge of the system equations would result in a conservative SMC design. Data-driven approaches to control, such as model-free reinforcement learning (RL), require no information about the system and can learn control laws from the data through interactions with system without models [6]. However, RL cannot provide stability guarantees and suffers from high sample complexity. In this paper, a reinforcement learning-based SMC design approach is proposed to cope with uncertainties without known bounds by combining the advantages of both RL and SMC.

RL consists of an agent that interacts with the environment and improves its control actions to maximize the discounted future rewards received from the environment based on the action provided [7]. The distinguishing feature of RL is “learning by interaction with the environment” independent of the complexity of the system, thereby enabling RL to be used for complicated control tasks.

There have been recent advancements in the field of artificial intelligence by fusing RL and deep learning techniques. Deep reinforcement learning (DRL) algorithms are resulted from employing deep neural networks to approximate components of reinforcement learning (value function, policy, and model) [8]. Deep Q network (DQN) is a combination of deep neural networks and an RL algorithm called Q-learning which contributed to a significant progress in the fields of games, robotics, and so on [9]. However, DQN is only capable of solving discrete problems with low-dimensional action spaces. Therefore, In particular, continuous policy gradient methods were proposed to cope with continuous action spaces. Deterministic policy gradient methods [10] can particularly be useful for controller design applications.

A deterministic policy gradient algorithm based on deep learning and actor-critic is presented in [11]. This method, called deep deterministic policy gradient (DDPG), can handle continuous and high-dimensional action spaces and is used in this paper to design a sliding mode controller. DDPG is an actor-critic, model-free, off-policy algorithm, in which critic learns the Q-function using off-policy data, and actor learns the policy using the sampled policy gradient [11].

A number of previous studies employed RL for designing SMC. Authors in [12] estimated the uncertainties and disturbance terms respectively by an NN approximator and a disturbance observer for the SMC integrated with RL. Moreover, [13] proposed optimal guaranteed cost SMC integrated with the approximate dynamic programming (ADP) algorithm based on a single critic neural network (NN) for constrained-input nonlinear systems with disturbances. Different from the existing works, in our work, we assume the system is partially known and the goal is to achieve a desired performance for the original system using the knowledge of a simplified model of the system. In particular, we present an RL-based SMC design approach which preserves the structure of the SMC law by combining the SMC designed for the nominal model and the RL for coping with uncertainties. Instead of using fixed bounds for SMC, the proposed approach can cope with time-varying (and even state- and input-dependent) uncertainties by virtue of the model-free off-policy policy gradient RL algorithm.

The novelty of our work reported in this paper lies in fusing model-based and data-driven approaches for the design of an SMC for a class of nonlinear systems. The model-based part of the controller is obtained through available knowledge about the nonlinear system dynamics. The data-driven part of the controller is then calculated using DDPG algorithm to cope with the discrepancy between the original system and the available model of the system. Moreover, no information about the unknown parts of the system dynamics is needed, and the desired performance is reached. Furthermore, the control input, as well as the system states are penalized when defining the reward function for the RL agent to limit chattering. It is noted that since the plant-model mismatch is used by DDPG to update the SMC output, the proposed design approach interacts with the actual system online and hence leads to less conservative results compared to the traditional robust SMC design methods in the literature.

The remainder of this paper is organized as follows. Preliminaries and problem statement are provided in Section \@slowromancapii@. Section \@slowromancapiii@ describes the SMC design process. Section \@slowromancapiv@ discusses the DDPG algorithm for SMC design purposes. Simulation results are presented in Section \@slowromancapv@ to validate the performance of the proposed design approach, and concluding remarks are provided in Section \@slowromancapvi@.

II Problem Statement and Preliminaries

This section first presents the model of the system under study and then provides a brief description of the policy gradient method in reinforcement learning.

II-A System Model

Consider a class of nonlinear systems with $n$ measurable states described in the normal form as

\begin{gathered}\dot{x}_{i}(t)=x_{i+1}(t)\,;\,\,~{}~{}~{}i=1,2,\ldots,n-1,\\ \dot{x}_{n}(t)=f(t,\mathbf{x}(t))+\Delta f(t,\mathbf{x}(t))\\ +\left[g(t,\mathbf{x}(t))+\Delta g(t,\mathbf{x}(t))\right]\,u(t),\end{gathered}

(1)

where $\mathbf{x}(t)\in\mathbf{R}^{n}$ is the vector of all system states, $u(t)\in\mathbf{R}$ is the control input, $f(t,\mathbf{x}(t))\in\mathbf{R}$ , and $g(t,\mathbf{x}(t))\in\mathbf{R}$ . Assume that $\Delta f(t,\mathbf{x}(t))$ and $\Delta g(t,\mathbf{x}(t))$ are unknown. A simplified model of the original system in (1) can be represented as follows

\begin{gathered}\dot{\hat{x}}_{i}(t)=\hat{x}_{i+1}(t)\,;\,\,~{}~{}~{}~{}\,\,\,i=1,2,\ldots,n-1,\\ \dot{\hat{x}}_{n}(t)=f(t,\mathbf{\hat{x}}(t))+g(t,\mathbf{\hat{x}}(t))\,\hat{u}(t),\end{gathered}

(2)

where $\mathbf{\hat{x}}(t)$ is the vector of all simplified system states. The goal is to design an RL-based sliding mode controller (SMC) with partial knowledge of the system dynamics (here, the partial knowledge is the simplified system model). It is noted that the simplified system can be even considered to be a linear approximation of the original system.

Remark 1

The original system model can be described in the strict feedback form

\begin{gathered}\dot{x}_{1}=f_{1}(t,x_{1}(t))+g_{1}(t,x_{1}(t))\,x_{2},\\ \dot{x}_{2}=f_{2}(t,x_{1}(t),x_{2}(t))+g_{2}(t,x_{1}(t),x_{2}(t))\,x_{3},\\ \vdots\\ \dot{x}_{n-1}=f_{n-1}(t,x_{1}(t),x_{2}(t),\ldots,x_{n-1}(t))\\ \hskip 55.0pt+g_{n-1}(t,x_{1}(t),x_{2}(t),\ldots,x_{n-1}(t))\,x_{n},\\ \dot{x}_{n}=f_{n}(t,\mathbf{x}(t))+\Delta f(t,\mathbf{x}(t))\\ \hskip 55.0pt+\left[g_{n}(t,\mathbf{x}(t))+\Delta g(t,\mathbf{x}(t))\right]\,u(t),\end{gathered}

(3)

where the uncertainties are assumed to only exist in the expression of $\dot{x}_{n}$ . The simplified model of the system described by (3) will also be in the strict feedback form. However, the strict feedback form can be transformed into the normal form using a state transformation

z_{1}=x_{1};z_{2}=\dot{z}_{1};\cdots;z_{n}=\dot{z}_{n-1}

(4)

where $z_{n}$ requires no information about $\dot{x}_{n}$ , and the simplified model can be transformed similarly. Therefore, the RL-based SMC design approach proposed in this paper can be extended to treat systems in the form of (3).

II-B Policy Gradient in Reinforcement Learning

A reinforcement learning (RL) agent aims at learning a policy that maximizes the discounted future rewards (expected return). The return at time step $t$ is the total discounted reward from $t$ as $G_{t}=\sum_{k=t}^{N}\gamma^{k-t}\,r_{k}$ , where $r_{k}=r(s_{k},a_{k})$ is the reward received by taking action $a_{k}$ in state $s_{k}$ , and $0<\gamma\leq 1$ is the discount rate. For non-episodic tasks, $N$ is $\infty$ . The value function evaluates the expected return beginning from state $s$ under policy $\pi$ , and represented as $V^{\pi}(s)=\mathbf{E}_{\pi}[G_{t}|S_{t}=s]$ . The expected return beginning from state $s$ and taking action $a$ is defined as Q-value $(Q^{\pi}(s,a)=\mathbf{E}_{\pi}[G_{t}|S_{t}=s,A_{t}=a])$ following policy $\pi$ . The RL agent aims at maximizing its expected return beginning from the initial state; thus, the agent’s goal is to maximize $J(\pi)=V^{\pi}(s_{0})=E_{\pi}[G_{0}]$ .

In policy gradient algorithms, which are suitable for RL problems with continuous action space [10], the policy is parametrized by additional sets of parameters $\mathbf{\theta}$ , which can be the weights of a neural network ( $\pi(s,\mathbf{\theta})=\pi_{\mathbf{\theta}}(s)$ ). In this case, the objective function for RL agent turns into $J(\pi_{\theta})=E_{\pi_{\theta}}[G_{0}]$ . In policy gradient algorithms, the goal is to update policy parameters $\theta$ to maximize $J$ ; hence, the parameters $\theta$ are updated in the direction of $\nabla_{\theta}J$ . In [7], it is shown that for stochastic policies

\begin{gathered}\nabla_{\theta}J(\pi_{\theta})=\int_{S}\rho(s)\int_{A}\nabla_{\theta}\,\pi_{\theta}(a|s)\,Q^{\pi}(s,a)\,\mathrm{d}a\,\mathrm{d}s\\ =\mathbf{E}_{s\sim\rho^{\pi},\,a\sim\pi_{\theta}}[\nabla_{\theta}\,\mathrm{log}\,\pi_{\theta}(a|s)\,Q^{\pi}(s,a)],\end{gathered}

(5)

where $\rho(s)$ is the state distribution following policy $\pi_{\theta}$ .

Actor-critic algorithms, which use policy gradient theorem, consist of an actor which adjusts policy parameters $\theta$ , and a critic which estimates $Q^{\pi}(s,a)$ by $Q^{\phi}(s,a)$ with parameters $\phi$ [14]. The critic tries to adjust parameters $\phi$ in order to minimize the following mean squared error (MSE)

\begin{gathered}L(\phi)=\mathbf{E}_{s\sim\rho^{\pi},\,a\sim\pi_{\theta}}\left[(\,Q^{\phi}(s,a)-Q^{\pi}(s,a)\,)^{2}\right].\end{gathered}

(6)

For designing controllers using policy gradient in this paper, continuous deterministic policy is used, and the gradient of policy should be adapted to improve the deterministic policy. According to [10], in policy improvement methods, a common approach to update policy is to find a greedy policy such that

\mu^{k+1}(s)=\mathrm{arg}\max_{a}Q^{\mu^{k}}(s,a).

The notation $\mu(s)$ is used to show the deterministic policy. Since the greedy policy improvement is computationally expensive for continuous action spaces, the alternative method for improving the parametrized policy is to move in the direction of $\nabla_{\theta}Q^{\mu^{k}}(s,\mu_{\theta}(s))$ . Hence, the updating formula for improving policy is represented as [10]

\theta^{k+1}=\theta^{k+1}+\alpha_{a}\,\mathbf{E}_{s\sim\rho^{\mu^{k}}}\left[\nabla_{\theta}Q^{\mu^{k}}(s,\mu_{\theta}(s))\right],

(7)

where $\alpha_{a}$ is the learning rate. It is shown in [10] that

\nabla_{\theta}J(\mu_{\theta})=\mathbf{E}_{s\sim\mu^{k}}\left[\nabla_{\theta}Q^{\mu^{k}}(s,\mu_{\theta}(s))\right],

(8)

which implies that the update formula (7) moves policy parameters $\theta$ in the direction that maximizes $J(\mu_{\theta}(s))$ . The update formula is used later in the paper to find a suitable control action for the original system (1).

III Design of an SMC for the Original System

Since $\Delta f(t,\mathbf{x}(t))$ and $\Delta g(t,\mathbf{x}(t))$ as well as their bounds are unknown, designing a controller for the original system is not straightforward. First, an SMC for the simplified system is designed to use the existing knowledge. Then, RL is used to cope with the uncertainties in the original system while preserving the structure of the sliding mode controller.

To design SMC for the simplified system, by defining a stable sliding surface as

\hat{\sigma}(\hat{\mathbf{x}})=\sum_{i=1}^{n}a_{i}\,\hat{x}_{i}\,;\,\,\,\,\,\,a_{i}>0,\,\,\,\,i=1,2,\ldots,n,

(9)

the controller is

\begin{gathered}\hat{u}(t,\hat{\mathbf{x}})=\hat{u}_{c}(t,\hat{\mathbf{x}})+\hat{u}_{eq}(t,\hat{\mathbf{x}}),\\ \end{gathered}

(10)

where

\begin{gathered}\hat{u}_{eq}(t,\hat{\mathbf{x}})=\frac{-1}{a_{n}\,g(t,\mathbf{\hat{x}})}\left[\sum_{i=1}^{n-1}a_{i}\,\hat{x}_{i+1}+a_{n}\,f(t,\mathbf{\hat{x}})\right],\\ \hat{u}_{c}(t,\hat{\mathbf{x}})=\frac{-\hat{\mu}}{a_{n}\,g(t,\mathbf{\hat{x}})}\,sign(\hat{\sigma}).\end{gathered}

(11)

The error is defined as

e_{i}(t)=x_{i}(t)-\hat{x}_{i}(t);\,\,\,\,\,\,i=1,2,\ldots,n.

(12)

Therefore, the error system is a nonlinear system in the normal form as

\begin{gathered}\dot{e}_{i}(t)=e_{i+1}(t)\,;\,\,\,\,\,i=1,2,\ldots,n-1,\\ \dot{e}_{n}(t)=f(t,\mathbf{x})+\Delta f(t,\mathbf{x})-f(t,\mathbf{\hat{x}})\\ +\left[g(t,\mathbf{x})+\Delta g(t,\mathbf{x})\right]u(t,{\mathbf{x}})-g(t,\hat{\mathbf{x}})\hat{u}(t,\hat{\mathbf{x}}).\end{gathered}

(13)

Now, a new stable sliding surface is defined for the error system as $\sigma=\sum_{i=1}^{n}a_{i}\,e_{i}$ . The first-order derivative of $\sigma$ is

\begin{gathered}\dot{\sigma}=\sum_{i=1}^{n-1}a_{i}\,e_{i+1}+a_{n}\,\Big{[}f(t,\mathbf{x})+\Delta f(t,\mathbf{x})-f(t,\mathbf{\hat{x}})\\ +\left[g(t,\mathbf{x})+\Delta g(t,\mathbf{x})\right]u(t,{\mathbf{x}})-g(t,\hat{\mathbf{x}})\hat{u}(t,\hat{\mathbf{x}})\Big{]}.\end{gathered}

(14)

With the control law (10) for the simplified model, we consider the controller for the original system in the form of

u(t,{\mathbf{x}})=\hat{u}(t,\mathbf{x})+u_{1}(t),

(15)

where $u_{1}$ is used to compensate for the plant-model mismatch and will be learned by RL. By substituting $\hat{u}$ and $u$ in (14) with (10) and (15), $\dot{\sigma}$ turns into

\begin{gathered}\dot{\sigma}={\hat{\mu}\left[\,sign(\hat{\sigma}(\hat{\mathbf{x}}))-sign(\hat{\sigma}({\mathbf{x}}))\right]}\hskip 95.0pt\\ \hskip 10.0pt-\frac{\Delta g(t,\mathbf{x})}{g(t,\mathbf{x})}\left[a_{n}\,f(t,\mathbf{x})+\sum_{i=1}^{n-1}a_{i}\,x_{i+1}+\hat{\mu}\,sign(\hat{\sigma}(\mathbf{x}))\right]\\ +a_{n}\,\Delta f(t,\mathbf{x})+a_{n}\,\left[g(t,\mathbf{x})+\Delta g(t,\mathbf{x})\right]u_{1}(t).\hskip 25.0pt\end{gathered}

(16)

To design an SMC for the error system, $u_{1}(t)$ is chosen as

u_{1}(t)=-r(t)-\mu(t)\,sign(\sigma),

(17)

where $r(t)$ and $\mu(t)$ need to be designed. Ideally, consider a Lyapunov function candidate $V=\frac{1}{2}\sigma^{\top}\sigma$ ,

\begin{split}r(t)&=\frac{1}{a_{n}\,\left[g(t,\mathbf{x})+\Delta g(t,\mathbf{x})\right]}\Bigg{\{}\frac{-\Delta g(t,\mathbf{x})}{g(t,\mathbf{x})}\bigg{[}a_{n}\,f(t,\mathbf{x})\\ &+\sum_{i=1}^{n-1}a_{i}\,x_{i+1}+\hat{\mu}\,sign(\hat{\sigma}(\mathbf{x}))\bigg{]}+a_{n}\,\Delta f(t,\mathbf{x})\\ &+\hat{\mu}\left[\,sign(\hat{\sigma}(\hat{\mathbf{x}}))-sign(\hat{\sigma}({\mathbf{x}}))\right]\Bigg{\}},\\ \mu(t)&=\frac{1}{a_{n}\,\left[g(t,\mathbf{x})+\Delta g(t,\mathbf{x})\right]}sign(\sigma)\end{split}

(18)

such that $\dot{V}=\sigma^{\top}\dot{\sigma}=-\sigma^{\top}sign(\sigma)\leq 0$ , which guarantees $\sigma\xrightarrow[]{}0$ in finite time. Also, (18) shows the deficiency of the nominal controller (10).

If the original system does not have any unknown parts ( $\Delta f\rightarrow 0$ and $\Delta g\rightarrow 0$ ), and $u_{1}(t)=0$ , then $\mathbf{x}\rightarrow\mathbf{\hat{x}}$ and nothing is left to be designed. However, the original system model is not completely available. The desired case is when the simplified system is close to the original system, but it might not be always the case. When $\Delta f$ and $\Delta g$ are large, $u_{1}(t)$ is significant and $\hat{u}(t)$ alone will not achieve the control objective with stability guarantees. To learn $u_{1}(t)$ in the form of (17) without the knowledge of $\Delta f$ and $\Delta g$ , DDPG is used, which is elaborated in the next section.

Remark 2

For a tracking controller design problem using sliding mode for the original system represented in (1), where the desired output for the first state is $y(t)=x_{1}^{ref}(t)$ , first a new system based on the simplified system (equation (2)) and the desired output is defined by assuming $\hat{e}_{1}(t)=\hat{x}_{1}(t)-y(t)$ . Hence, this system dynamics turn into

\begin{gathered}\dot{\hat{e}}_{1}=\hat{e_{2}}=\hat{x}_{2}-\dot{y}\\ \dot{\hat{e}}_{2}=\hat{e_{3}}=\hat{x}_{3}-\ddot{y}\\ \vdots\\ \dot{\hat{e}}_{n-1}=\hat{e_{n}}=\hat{x}_{n}-y^{(n-1)}\\ \dot{\hat{e}}_{n}=-y^{(n)}+f(t,\mathbf{x}(t))+\Delta f(t,\mathbf{x}(t))\\ +\left[g(t,\mathbf{x}(t))+\Delta g(t,\mathbf{x}(t))\right]\,u(t).\end{gathered}

(19)

Then, the procedure described above for designing SMC for the original system can be employed by replacing $\hat{x}_{i}$ with $\hat{e}_{i}$ in equations (9)-(16).

IV Deep Reinforcement Learning Controller Design for the Error System

Deep deterministic policy gradient (DDPG) algorithm was introduced in [11]. In this method, actor learns a deterministic policy while critic learns the Q-value function. Since the Q-value update may cause divergence [11], a copy of the actor network and a copy of the critic network are considered as target networks. DDPG uses soft target updates instead of directly copying weights from the original network. Hence, target network weights are updated slowly based on the learned network. Although soft target update may slow down the learning process, its stability improvement outweighs the low learning speed. A major challenge in deterministic policy gradient methods is exploration; adding noise to the deterministic policy can improve the exploration and avoid sub-optimal solutions [11]. The added noise can be an Ornstein-Uhlenbeck process [15]. Based on [11], in DDPG algorithm, the critic updates Q network weights to minimize the following loss

L=\mathbf{E}_{s\sim\rho^{\mu}}\big{\{}[\,\,Q^{\phi}(s,\mu_{\theta}(s))-(\,r(s,\mu_{\theta}(s))+\\ \gamma Q^{\phi_{t}}(s^{\prime},\mu_{\theta_{t}}(s^{\prime}))\,)\,\,]^{2}\big{\}},

(20)

where $\phi$ represents Q network parameters, while $\phi_{t}$ and $\theta_{t}$ are target Q network and target actor network parameters, respectively. Sample-based loss can be simply calculated by

L_{B_{1}}=\frac{1}{|B_{1}|}\sum_{(s,\mu_{\theta}(s),r,s^{\prime})\in B_{1}}[\,\,Q^{\phi}(s,\mu_{\theta}(s))-(\,r(s,\mu_{\theta}(s))\\ +\gamma\,Q^{\phi_{t}}(s^{\prime},\mu_{\theta_{t}}(s^{\prime}))\,)\,\,]^{2},

(21)

where $B_{1}$ is a mini-batch of the sampled data and $|B_{1}|$ is the number of samples in the mini-batch.

Refer to caption — Figure 1: Actor network diagram: the custom output layer is designed to create $u_{1}$ . Trainable weights of the network (which build policy parameters $\theta$ ) are $W^{i}$ and $b^{i}$ , $f^{i}$ is the activation function.

For designing SMC, the structure considered for the actor network is shown in Fig. 2. The output layer is customized to achieve the desired form of control signal given in (17). Based on Fig. 2, the activation functions of the last layer (before the custom layer) generate two outputs; the one that generates $r$ is linear, while for generating $\mu$ , tangent hyperbolic (tanh) activation function is used. Rectified linear activation function is used for the rest of the layers. It is assumed that no information is available about the sign of $g+\Delta g$ , and $\mu$ might be positive or negative; since the output of tanh function is between -1 and 1, $\mu$ is bounded between -1 and 1. Bounds on $\mu$ result in limited chattering of the control signal (in case larger bounds on $\mu$ are needed, $\mu$ can be multiplied by a fixed number). If the sign of $g+\Delta g$ does not change in a vicinity of the origin, then tanh activation function can be replaced by sigmoid function.

The structure of the closed-loop system controlled by DDPG controller is shown in Fig. 2. The DDPG network uses $\mathbf{e}$ as input and generates $u_{1}$ as its output. For DDPG to learn the optimal control signal, the performance index needs to be defined. By sampling every $t_{s}$ seconds from the original and the simplified system states, the objective would be for the DDPG algorithm to maximize the following cost function

\begin{gathered}\max_{\theta}\sum_{k=0}^{N-1}\sum_{i=1}^{n}-q_{i}\,[e_{i}(k)]^{2}-q_{u}\,[u_{1}(k)]^{2},\end{gathered}

(22)

where $u_{1}=\mu_{\theta}(s)$ , $q_{i}>0$ is a weight indicating the importance of the error system state $e_{i}$ in the optimization problem, $q_{u}\geq 0$ penalizes $u_{1}$ (i.e., large control efforts and hence high amplitude chattering), and $N$ denotes the episode length. Based on the defined objective function, the reward at step $k$ of each episode is simply considered as $r(\mathbf{e},\mu_{\theta}(\mathbf{e}))=\sum_{i=1}^{n}-a_{i}^{2}\,[e_{i}(k)]^{2}-b\,[u_{1}(k)]^{2}$ . It is noted that when $t_{s}\rightarrow 0$ , the summation over $k$ in (22) turns into an integral, and the optimal solution results from solving the HJB equations [16].

The SMC design procedure for the nonlinear error system (13) is summarized as follows.

\mathbf{procedure}

: SMC design for partially-known nonlinear systems

2: Input: initial policy network parameters

\mathbf{\theta}

, Q-learning network parameters

\mathbf{\phi}

, empty replay buffer

B

, episode length

N

, number of total episodes for training

N_{ep}

, bound on the reward at each step

\underline{G}

, initial system states

S_{0}

, learning rates

\alpha_{c}

\alpha_{a}

, and

\tau

3: while

counter<N_{ep}

4: reset the system (

s\leftarrow s_{0}

)

counter\leftarrow 0

6: while the episode is not terminated do

7: select action

u_{1}

based on current state

s

8: apply

u=\hat{u}(\mathbf{x})+u_{1}

to the original system

9: observe next state

S^{\prime}

and reward

R

, and store

(s,s^{\prime},u_{1},r)

in replay buffer

B

10: sample a mini-batch

B_{1}

from

B

11: update Q network parameters using

\phi\leftarrow\phi-\alpha_{c}\nabla_{\phi}L_{B_{1}}

12: update policy network using

\theta\leftarrow\theta+\alpha_{a}\,\nabla_{\theta}\frac{1}{|B_{1}|}\sum_{s\in B_{1}}Q_{\phi}(s,\pi_{\theta}(s))

13: update target networks using

\theta_{t}\leftarrow\tau\,\theta+(1-\tau)\,\theta_{t}

\phi\leftarrow\tau\,\phi+(1-\tau)\,\phi_{t}

14:

counter\leftarrow counter+1

15: if

counter\geq N

return<\underline{G}

then

16: the episode is terminated

17: end if

18: end while

19: end while

20:

\mathbf{end\,procedure}

Remark 3

It is noted that the data-driven component of the proposed SMC action does not need information about bounds on the uncertain parts of the system model. Instead, it learns to utilize the discrepancy between the simplified model and the original system through interaction with the closed-loop system. Therefore, the performance of the proposed SMC is less conservative compared to traditional robust SMC design approaches in the literature that only use bounds on the model uncertainties.

V Simulation Results and Discussion

To evaluate the performance of the proposed RL-based sliding mode controller design approach, a nonlinear spring-mass-damper system is used.

V-1 Case description

The state-space representation of the nonlinear mass-spring-damper shown in Fig. 3 (the original system) is as follows

\begin{gathered}\dot{x}_{1}=x_{2},\\ \dot{x}_{2}=\dfrac{1}{m}[-c\,{x}_{2}|{x}_{2}|-k\,x_{1}-b\,x_{1}^{3}+u],\end{gathered}

(23)

where $m$ is the mass, $c$ is the damping coefficient for the nonlinear damper, and $k$ and $b$ represent the nonlinear spring parameters. The available model of the system is, however, a linear system (i.e., the simplified system) derived based on the available knowledge of the physical system with the following differential equation:

\begin{gathered}\dot{\hat{x}}_{1}=\hat{x}_{2},\\ \dot{\hat{x}}_{2}=\dfrac{1}{\hat{m}}[-\hat{c}\,\hat{x}_{2}-\hat{k}\,\hat{x}_{1}+\hat{u}].\end{gathered}

(24)

The constant values in equations (23) and (24) are given in Table I. Our goal is to solve a tracking control problem using the proposed RL-based sliding mode control design method. According to Remark 2 and the design process explained in the previous sections, the control law for tracking $x_{1}^{*}(t)=y(t)=\sin(t)-1$ of the simplified system turns into

\begin{gathered}\hat{u}(t,\hat{x}_{1},\hat{x}_{2})=\hat{m}[\hat{x}_{2}+\cos(t)-\sin(t)+\dfrac{\hat{k}}{\hat{m}}\hat{x}_{1}+\dfrac{\hat{c}}{\hat{m}}\hat{x}_{2}]\\ -\,\hat{m}\,[sign(\hat{\sigma})],\end{gathered}

(25)

when the following sliding surface is used:

\hat{\sigma}=\hat{e}_{1}+\hat{e}_{2}\\ =\hat{x}_{1}+\hat{x_{2}}+1-\sin(t)-\cos(t).

Then, the controller for the original system turns into

u=\hat{u}(t,x_{1},x_{2})-r(t)-\mu(t)\,sign(\sigma),

where $\sigma=e_{1}+e_{2}$ , and DDPG will be employed to learn $r(t)$ and $u(t)$ .

TABLE I: Mass-spring-damper system parameters

parameter	value	parameter	value
$m$	$0.8\,kg$	$\hat{m}$	$1\,kg$
$c$	$2.2\,Ns/m$	$\hat{c}$	$2\,Ns/m$
$k$	$5.5\,N/m$	$\hat{k}$	$5\,N/m$
$b$	$1.5\,N/m^{3}$

V-2 Implementation of DDPG

For implementing DDPG, Keras package [17] is used. For implementing the proposed control law, two networks with the same structure are used as the actor and its target. These networks consist of 6 layers. The output layer structure is customized to build the desired form of control signal $u_{1}$ as in (17) (shown in Fig. 2). Each of the first three layers includes 512 units with rectified linear activation function, while the fourth layer includes 64 units with linear function. The fifth layer includes 2 units and the last layer (output layer) is customized as shown in Fig. 2. The inputs to the networks are the error system states. The critic network and its target network are identical and divided into two parts; the first part with error system states as inputs consists of three 512-unit hidden layers. The second part also includes three 512-unit hidden layers but the input of this part is the output of the actor network. Then, the last layer of these two parts are concatenated and the concatenated output is connected to two 512-unit hidden layers. Finally, the output layer builds a single output. Ornstein–Uhlenbeck process with standard deviation of $\sigma=0.1$ is added to the output of the actor network during the learning for exploration.

V-3 Experimental setting

The hyperparameters used in the simulation are listed in Table II. The reward for each step of an episode is considered as $r(\mathbf{e},\mu_{\theta}(\mathbf{e}))=-e_{1}^{2}-e_{2}^{2}$ . To penalize the reward at each step equally, $\gamma=1$ is considered in the simulation studies. This is a reasonable assumption since the controller design procedure is considered as an episodic task (for non-episodic tasks $\gamma<1$ should be chosen to avoid unlimited return). The goal of using the DDPG network is for the states of the error system to reach zero in the desired time horizon (here, the horizon is considered to be $T=7\,s$ ). By assuming $t_{s}=0.1\,s$ , the prediction horizon (episode length) $N$ is $70$ . Besides, if the reward at each time step exceeds $-20$ , the corresponding episode during the learning phase will be terminated. Each episode begins from the initial states $[0,0]^{T}$ .

TABLE II: Simulation hyperparameters

parameter	value	parameter	value
$\alpha_{a}$	$10^{-4}$	$\alpha_{c}$	$5\times 10^{-3}$
$\gamma$	$1$	$\tau$	$5\times 10^{-3}$
$N$	$70$	$\|B_{1}\|$	$70$
$\underline{G}$	$-20$

V-4 Results and discussion

The performance of the proposed controller after convergence is shown in Fig. 4. The first subplot shows the original system states, the tracking signal $x_{1}^{ref}$ , the control law calculated using the available simplified model (model-based controller $\hat{u}(x_{1},x_{2})$ ), and the output of the DDPG network ( $u_{1}$ ). The second subplot shows the performance of the simplified system using the SMC controller $\hat{u}(\hat{x}_{1},\hat{x}_{2})$ . The last subplot depicts the error system dynamics for two cases: 1) network is employed to compensate for the unknown parts in the original system dynamics; 2) when only the control law calculated based on the simplified system is used. Simulation results depict the efficacy of the proposed controller design in stabilizing the error system dynamics. It is noted that, in this example, the goal is to track a specific reference ( $x^{ref}_{1}$ ), and the controller is successfully able to track the reference (in other words, $e_{1}$ converges to zero). The results reveal the capability of the proposed method to control a partially-known system, in which not only the dynamics are not completely available but also the available knowledge is not accurate (here the constants $\hat{c},\,\hat{k},\,\hat{m}$ do not match their real values). Fig. 5 shows the return ( $G_{0}$ ) at each episode during the learning process; as observed, after about 175 episodes, the proper action is found.

To demonstrate the generalization capability of the proposed controller, we use the learned controller to evaluate its performance when the initial condition for the system changes. The system is trained with $[0\,\,\,\,0]^{T}$ as the initial state, while the performance is evaluated when the system initial condition is $[2\,\,-1]^{T}$ . From the results shown in Fig. 6, it is observed that with the proposed control design approach, successful tracking of the reference is achieved although the initial state for evaluation is different from the one used for learning the controller.

VI Conclusion

In this paper, model-based and data-driven control design approaches were fused to build a sliding mode controller for a class of partially-known nonlinear systems. A deterministic policy gradient approach (known as DDPG) was employed to cope with the mismatch between the available model of the system and the actual system dynamics online. A procedure for designing such controller was proposed and the performance of the design approach was evaluated through simulation studies.

References

[1] J.-J. E. Slotine and J. Karl Hedrick, “Robust input-output feedback linearization,” International Journal of control, vol. 57, no. 5, pp. 1133–1139, 1993.
[2] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos, Nonlinear and adaptive control design. John Wiley & Sons, Inc., 1995.
[3] Y. Shtessel, C. Edwards, L. Fridman, and A. Levant, Sliding mode control and observation, vol. 10. Springer, 2014.
[4] Y. Bao, J. M. Velni, and M. Shahbakhti, “Epistemic uncertainty quantification in state-space lpv model identification using bayesian neural networks,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 719–724, 2021.
[5] C. Edwards and S. Spurgeon, Sliding mode control: theory and applications. Crc Press, 1998.
[6] Y. Bao and J. M. Velni, “Model-free control design using policy gradient reinforcement learning in lpv framework,” in 2021 European Control Conference (ECC), pp. 150–155, IEEE, 2021.
[7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[8] Y. Ma, W. Zhu, M. G. Benton, and J. Romagnoli, “Continuous control of a polymerization system with deep reinforcement learning,” Journal of Process Control, vol. 75, pp. 40–47, 2019.
[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[10] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning, pp. 387–395, PMLR, 2014.
[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[12] Q.-Y. Fan and G.-H. Yang, “Adaptive actor–critic design-based integral sliding-mode control for partially unknown nonlinear systems with input disturbances,” IEEE transactions on neural networks and learning systems, vol. 27, no. 1, pp. 165–177, 2015.
[13] H. Zhang, Q. Qu, G. Xiao, and Y. Cui, “Optimal guaranteed cost sliding mode control for constrained-input nonlinear systems with matched and unmatched disturbances,” IEEE transactions on neural networks and learning systems, vol. 29, no. 6, pp. 2112–2126, 2018.
[14] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, 2012.
[15] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian motion,” Physical review, vol. 36, no. 5, p. 823, 1930.
[16] D. P. Bertsekas et al., Dynamic programming and optimal control: Vol. 1. 1995.
[17] F. Chollet et al., “keras,” 2015.