This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\tfootnote

This work was partially supported by JST CREST Grant Number JPMJCR2012, Japan and JSPS KAKENHI Grant Number JP21J10780, Japan.

\corresp

Corresponding author: Junya Ikemoto (e-mail: ikemoto@hopf.sys.es.osaka-u.ac.jp).

Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation

JUNYA IKEMOTO1 and TOSHIMITSU USHIO2 Graduate School of Engineering Science, Osaka University, Toyonaka 560-8531, Japan. ikemoto@hopf.sys.es.osaka-u.ac.jp ushio@sys.es.osaka-u.ac.jp
Abstract

Deep reinforcement learning (DRL) has attracted much attention as an approach to solve optimal control problems without mathematical models of systems. On the other hand, in general, constraints may be imposed on optimal control problems. In this study, we consider the optimal control problems with constraints to complete temporal control tasks. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within bounded time intervals. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a τ\tau-CMDP. We formulate the STL-constrained optimal control problem as the τ\tau-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.

Index Terms:
Constrained Reinforcement Learning, Deep Reinforcement Learning, Lagrangian Relaxation, Signal Temporal Logic.
\titlepgskip

=-15pt

I Introduction

Reinforcement learning (RL) is a machine learning method for sequential decision making problems [1]. In RL, a learner, which is called an agent, interacts with an environment and learns a desired policy automatically. Recently, RL with deep neural networks (DNNs) [2], which is called Deep RL (DRL), has attracted much attention for solving complicated decision making problems such as playing video games [3]. DRL has been studied in various fields and many practical applications of DRL have been proposed [4, 5, 6]. On the other hand, when we apply RL or DRL to a problem in the real world, we must specify a state space of an environment for the problem beforehand. The states of the environment need to include sufficient information in order to determine a desired action at each time. Additionally, we must design a reward function for the task. If we do not design it to evaluate behaviors precisely, the learned policy may not be appropriate for the task.

Recently, controller design methods for temporal control tasks such as periodic, sequential, or reactive tasks have been studied in the control system community [7]. In these studies, linear temporal logic (LTL) has often been used. LTL is one of temporal logics that have developed as formal methods in the computer science community [8]. LTL can express a temporal control task in a logical form.

LTL has also been applied to RL for temporal control tasks [9]. By using RL, we can obtain a policy to complete a temporal control task described by an LTL formula without a mathematical model of a system. The given LTL formula is transformed into an ω\omega-automaton that is a finite-state machine and accepts all traces satisfying the LTL formula. The transformed automaton can express states that include sufficient information to complete the temporal control task. We regard a system’s state and an automaton’s state as an environment’s state for RL.The reward function for the temporal control task is designed based on the acceptance condition of the transformed automaton. Additionally, DRL algorithms for satisfying LTL formulae have been proposed in order to solve problems with continuous state-action spaces [10, 11].

In real world problems, it is often necessary to describe temporal control tasks with time bounds. Unfortunately, LTL cannot express the time bounds. Then, metric interval temporal logic (MITL) and signal temporal logic (STL) are useful [12]. MITL is an extension of LTL and has time-constrained temporal operators. Furthermore, STL is an extension of MITL. Although LTL and MITL have predicates over Boolean signals, STL has inequality formed predicates over real-valued signals, which is useful to specify dynamical system’s trajectories within bounded time intervals. Additionally, STL has a quantitative semantics called robustness that evaluates how well a system’s trajectory satisfies the given STL formula [13]. In the control system community, controller design methods to complete tasks described by STL formulae have been proposed [14, 15], where the control problems are formulated as constrained optimization problems using models of systems. Model-free RL-based controller design methods have also been proposed [16, 17, 18, 19]. In [16], Aksaray et al. proposed a Q-learning algorithm for satisfying a given STL formula. The satisfaction of the given STL formula is based on a finite trajectory of the system. Thus, as an environment’s state for a temporal control task, we use the extended state consisting of the current system’s state and the previous system’s states instead of using an automaton such as [9]. Additionally, we design a reward function using the robustness for the given formula. In [17], Venkataraman et al. proposed a tractable learning method using a flag state instead of the previous system’s state sequence to reduce the dimensionality of the environment’s state space. However, these methods cannot be directly applied to problems with a continuous state-action space because they are based on a classical tabular Q-learning algorithm. For problems with continuous spaces, in [18], Balakrishnan et al. introduced a partial signal and applied a DRL algorithm to design a controller that partially satisfies a given STL specification and, in [19], we proposed a DRL-based design of a network controller to complete an STL control task with network delays.

On the other hand, for some control problems, we aim to design a policy that optimizes a given control performance index under a constraint described by an STL formula. For example, in practical applications, we should operate a system in order to satisfy a given STL formula with minimum fuel costs. In this study, we tackle to obtain the optimal policy for a given control performance index among the policies satisfying a given STL formula without a mathematical model of a system.

I-A Contribution:

The main contribution is to propose a DRL algorithm to obtain an optimal policy for a given control performance index such as fuel costs under a constraint described by an STL formula. Our proposed algorithm has the following three advantages.

  1. 1.

    We directly solve control problems with continuous state-action spaces. We apply DRL algorithms for problems with continuous spaces such as deep deterministic policy gradient (DDPG) [20] and soft actor critic (SAC) [21].

  2. 2.

    We obtain a policy that not only satisfies a given STL formula but also is optimal with respect to a given control performance index. We consider the optimal control problem constrained by a given STL formula and formulate the problem as a constrained Markov decision process (CMDP) [22]. In the CMDP problem, we introduce two reward functions: one is the reward function for the given control performance index and the other is the reward function for the given STL constraint. To solve the CMDP problem, we apply a constrained DRL (CDRL) algorithm with the Lagrangian relaxation [23]. In this algorithm, we relax the CMDP problem into an unconstrained problem using a Lagrange multiplier to utilize standard DRL algorithms for problems with continuous spaces.

  3. 3.

    We introduce a two-phase learning algorithm in order to make it easy to learn a policy satisfying the given STL formula. In a CMDP problem, it is important to satisfy the given constraint. The agent needs many experiences satisfying the given STL formula in order to learn how to satisfy the formula. However, it is difficult to collect the experiences considering both the control performance index and the STL constraint in the early learning stage since the agent may prioritize to optimize its policy with respect to the control performance index. Thus, in the first phase, the agent learns its policy without the control performance index in order to obtain experiences satisfying the STL constraint easily, which is called pre-training. After obtaining many experiences satisfying the STL formula, in the second phase, the agent learns its optimal policy for the control performance index under the STL constraint, which is called fine-tuning.

Through simulations, we demonstrate the learning performance of the proposed algorithm.

I-B Related works:

I-B1 Classical RL for satisfying STL formulae

Aksaray et al. proposed a method to design policies satisfying STL formulae based on the Q-learning algorithm [16]. However, in the method, the dimensionality of an environment’s state tends to be large. Thus, Venkataraman et al. proposed a tractable learning method to reduce the dimensionality [17]. Furthermore, Kalagarla et al. proposed an STL-constrained RL algorithm using a CMDP formulation and an online learning method [24]. However, since these are tabular-based approaches, we cannot directly apply them to problems with continuous spaces.

I-B2 DRL for satisfying STL formulae

DRL algorithms for satisfying STL formulae have been proposed [18, 19]. However, these studies focused on satisfying a given STL formula as the main objective. On the other hand, in this study, we regard the given STL formula as a constraint of a control problem and tackle the STL-constrained optimal control problem using a CDRL algorithm with the Lagrangian relaxation.

I-B3 Learning with demonstrations for satisfying STL formulae

Learning methods with demonstrations have been proposed [25, 26]. They designed a reward function using demonstrations, which was an imitation learning method. On the other hand, in this study, we do not use demonstrations to design a reward function for satisfying STL formulae. Alternatively, we design a reward function for satisfying STL formulae using robustness and the log-sum-exp approximation [16].

I-C Structure:

The remainder of this paper is organized as follows. In Section II, we review STL and the Q-learning algorithm to learn a policy satisfying STL formulae briefly. In Section III, we formulate an optimal control problem under a constraint described by an STL formula as a CMDP problem. In Section IV, we propose a CDRL algorithm with the Lagrangian relaxation to solve the CMDP problem. We relax the CMDP problem to an unconstrained problem using a Lagrange multiplier to utilize the DRL algorithm for unconstrained problems with continuous spaces. In Section V, by numerical simulations, we demonstrate the usefulness of the proposed method. In Section VI, we conclude the paper and discuss future works.

I-D Notation:

0\mathbb{N}_{\geq 0} is the set of the nonnegative integers. \mathbb{R} is the set of the real numbers. 0\mathbb{R}_{\geq 0} is the set of nonnegative real numbers. n\mathbb{R}^{n} is the nn-dimensional Euclidean space. For a set AA\subset\mathbb{R}, maxA\max A and minA\min A are the maximum and minimum value in AA if they exist, respectively.

II Preliminaries

II-A Signal temporal logic

We consider the following discrete-time stochastic dynamical system.

xk+1=f(xk,ak)+Δwwk,\displaystyle x_{k+1}=f(x_{k},a_{k})+\Delta_{w}w_{k}, (1)

where xk𝒳x_{k}\in\mathcal{X}, ak𝒜a_{k}\in\mathcal{A}, and wk𝒲w_{k}\in\mathcal{W} are the system’s state, the agent’s control action, and the system noise at k{0,1,}k\in\{0,1,...\}. 𝒳=nx\mathcal{X}=\mathbb{R}^{n_{x}}, 𝒜na\mathcal{A}\subseteq\mathbb{R}^{n_{a}}, and 𝒲=nx\mathcal{W}=\mathbb{R}^{n_{x}} are the system’s state space, the control action space, and the system noise space, respectively. The system noise wkw_{k} is an independent and identically distributed random variable with a probability density pw:𝒲0p_{w}:\mathcal{W}\to\mathbb{R}_{\geq 0}. Δw\Delta_{w} is a regular matrix that is a weighting factor of the system noise. f:𝒳×𝒜𝒳f:\mathcal{X}\times\mathcal{A}\to\mathcal{X} is a function that describes the system dynamics. Then, we have the transition probability density pf(x|x,a):=|Δw1|pw(Δw1(xf(x,a)))p_{f}(x^{\prime}|x,a):=|\Delta_{w}^{-1}|p_{w}(\Delta_{w}^{-1}(x^{\prime}-f(x,a))). The initial state x0𝒳x_{0}\in\mathcal{X} is sampled from a probability density p0:𝒳0p_{0}:\mathcal{X}\to\mathbb{R}_{\geq 0}. For a finite system trajectory whose length is K+1K+1, xk1:k2x_{k_{1}:k_{2}} denotes the partial trajectory for the time interval [k1,k2][k_{1},k_{2}], where 0k1k2K0\leq k_{1}\leq k_{2}\leq K.

STL is a specification formalism that allows us to express real-time properties of real-valued trajectories of systems [12]. We consider the following syntax of STL.

Φ\displaystyle\Phi ::=\displaystyle::= G[0,Ke]ϕ|F[0,Ke]ϕ,\displaystyle G_{[0,K_{e}]}\phi\ |\ F_{[0,K_{e}]}\phi,
ϕ\displaystyle\phi ::=\displaystyle::= G[ks,ke]φ|F[ks,ke]φ|ϕϕ|ϕϕ,\displaystyle G_{[k_{s},k_{e}]}\varphi\ |\ F_{[k_{s},k_{e}]}\varphi\ |\ \phi\land\phi\ |\ \phi\lor\phi,
φ\displaystyle\varphi ::=\displaystyle::= ψ|¬φ|φφ|φφ,\displaystyle\psi\ |\ \lnot\varphi\ |\ \varphi\land\varphi\ |\ \varphi\lor\varphi,

where KeK_{e}, ksk_{s}, and ke0k_{e}\in\mathbb{N}_{\geq 0} are nonnegative constants for the time bounds, Φ,ϕ,φ,\Phi,\phi,\varphi, and ψ\psi are the STL formulae, ψ\psi is a predicate in the form of h(x)dh(x)\leq d, h:𝒳h:\mathcal{X}\to\mathbb{R} is a function of the system’s state, and dd\in\mathbb{R} is a constant. The Boolean operators ¬\lnot, \land, and \lor are negation, conjunction, and disjunction, respectively. The temporal operators G𝒯G_{\mathcal{T}} and F𝒯F_{\mathcal{T}} refer to Globally (always) and Finally (eventually), respectively, where 𝒯\mathcal{T} denotes the time bound of the temporal operator. ϕi=G[ksi,kei]φi\phi_{i}=G_{[k_{s}^{i},k_{e}^{i}]}\varphi_{i}, or ϕi=F[ksi,kei]φi\phi_{i}=F_{[k_{s}^{i},k_{e}^{i}]}\varphi_{i}, i{1,2,,M}i\in\{1,2,...,M\} are called STL sub-formulae. ϕ\phi comprises multiple STL sub-formulae {ϕi}i=1M\{\phi_{i}\}_{i=1}^{M}.

The Boolean semantics of STL is recursively defined as follows:

xk:Kψh(xk)d,\displaystyle x_{k:K}\models\psi\Leftrightarrow h(x_{k})\leq d,
xk:K¬ψ¬(xk:Kψ),\displaystyle x_{k:K}\models\lnot\psi\Leftrightarrow\lnot(x_{k:K}\models\psi),
xk:Kϕ1ϕ2xk:Kϕ1xk:Kϕ2,\displaystyle x_{k:K}\models\phi_{1}\land\phi_{2}\Leftrightarrow x_{k:K}\models\phi_{1}\land x_{k:K}\models\phi_{2},
xk:Kϕ1ϕ2xk:Kϕ1xk:Kϕ2,\displaystyle x_{k:K}\models\phi_{1}\lor\phi_{2}\Leftrightarrow x_{k:K}\models\phi_{1}\lor x_{k:K}\models\phi_{2},
xk:KG[ks,ke]ϕ\displaystyle x_{k:K}\models G_{[k_{s},k_{e}]}\phi\Leftrightarrow
xk:Kϕ,k[k+ks,k+ke],\displaystyle\hskip 50.0ptx_{k^{\prime}:K}\models\phi,\ \forall k^{\prime}\in[k+k_{s},k+k_{e}],
xk:KF[ks,ke]ϕ\displaystyle x_{k:K}\models F_{[k_{s},k_{e}]}\phi\Leftrightarrow
k[k+ks,k+ke],s.t. xk:Kϕ.\displaystyle\hskip 50.0pt\exists k^{\prime}\in[k+k_{s},k+k_{e}],\ \text{s.t. }x_{k^{\prime}:K}\models\phi.

The quantitative semantics of STL, which is called robustness, is recursively defined as follows:

ρ(xk:K,ψ)\displaystyle\rho(x_{k:K},\psi) =\displaystyle= dh(xk),\displaystyle d-h(x_{k}),
ρ(xk:K,¬ψ)\displaystyle\rho(x_{k:K},\lnot\psi) =\displaystyle= ρ(xk:K,ψ)\displaystyle-\rho(x_{k:K},\psi)
ρ(xk:K,ϕ1ϕ2)\displaystyle\rho(x_{k:K},\phi_{1}\land\phi_{2}) =\displaystyle= min{ρ(xk:K,ϕ1),ρ(xk:K,ϕ2)},\displaystyle\min\{\rho(x_{k:K},\phi_{1}),\rho(x_{k:K},\phi_{2})\},
ρ(xk:K,ϕ1ϕ2)\displaystyle\rho(x_{k:K},\phi_{1}\lor\phi_{2}) =\displaystyle= max{ρ(xk:K,ϕ1),ρ(xk:K,ϕ2)},\displaystyle\max\{\rho(x_{k:K},\phi_{1}),\rho(x_{k:K},\phi_{2})\},
ρ(xk:K,G[ks,ke]ϕ)\displaystyle\rho(x_{k:K},G_{[k_{s},k_{e}]}\phi) =\displaystyle= mink[k+ks,k+ke]ρ(xk:K,ϕ),\displaystyle\min_{k^{\prime}\in[k+k_{s},k+k_{e}]}\rho(x_{k^{\prime}:K},\phi),
ρ(xk:K,F[ks,ke]ϕ)\displaystyle\rho(x_{k:K},F_{[k_{s},k_{e}]}\phi) =\displaystyle= maxk[k+ks,k+ke]ρ(xk:K,ϕ),\displaystyle\max_{k^{\prime}\in[k+k_{s},k+k_{e}]}\rho(x_{k^{\prime}:K},\phi),

which quantifies how well the trajectory satisfies the given STL formulae [13].

The horizon length of an STL formula is recursively defined as follows:

hrz(ψ)\displaystyle\text{hrz}(\psi) =\displaystyle= 0,\displaystyle 0,
hrz(ϕ)\displaystyle\text{hrz}(\phi) =\displaystyle= ke,for ϕ=G[ks,ke]φor F[ks,ke]φ,\displaystyle k_{e},\ \text{for }\phi=G_{[k_{s},k_{e}]}\varphi\ \text{or }F_{[k_{s},k_{e}]}\varphi,
hrz(¬ϕ)\displaystyle\text{hrz}(\lnot\phi) =\displaystyle= hrz(ϕ),\displaystyle\text{hrz}(\phi),
hrz(ϕ1ϕ2)\displaystyle\text{hrz}(\phi_{1}\land\phi_{2}) =\displaystyle= max{hrz(ϕ1),hrz(ϕ2)},\displaystyle\max\{\text{hrz}(\phi_{1}),\text{hrz}(\phi_{2})\},
hrz(ϕ1ϕ2)\displaystyle\text{hrz}(\phi_{1}\lor\phi_{2}) =\displaystyle= max{hrz(ϕ1),hrz(ϕ2)},\displaystyle\max\{\text{hrz}(\phi_{1}),\text{hrz}(\phi_{2})\},
hrz(G[ks,ke]ϕ)\displaystyle\text{hrz}(G_{[k_{s},k_{e}]}\phi) =\displaystyle= ke+hrz(ϕ),\displaystyle k_{e}+\text{hrz}(\phi),
hrz(F[ks,ke]ϕ)\displaystyle\text{hrz}(F_{[k_{s},k_{e}]}\phi) =\displaystyle= ke+hrz(ϕ).\displaystyle k_{e}+\text{hrz}(\phi).

hrz(ϕ)\text{hrz}(\phi) is the required length of the state sequence to verify the satisfaction of the STL formula ϕ\phi.

II-B Q-learning for Satisfying STL Formulae

In this section, we review the Q-learning algorithm to learn a policy satisfying a given STL formula [16]. Although we often regard the current state of the dynamical system (1) as the environment’s state for RL, the current system’s state is not enough to determine an action for satisfying a given STL formula. Thus, Aksaray et al. defined the following extended state using previous system’s states.

zk=[xkτ+1xkτ+2xk]𝒵,\displaystyle z_{k}=[x_{k-\tau+1}^{\top}\ x_{k-\tau+2}^{\top}\ ...\ x_{k}^{\top}]^{\top}\in\mathcal{Z},

where τ=hrz(ϕ)+1\tau=\text{hrz}(\phi)+1 for the given STL formula Φ=G[0,Ke]ϕ\Phi=G_{[0,K_{e}]}\phi (or Φ=F[0,Ke]ϕ\Phi=F_{[0,K_{e}]}\phi) and 𝒵\mathcal{Z} is an extended state space.We show a simple example in Fig. 1. We operate a one-dimensional dynamical system to satisfy the STL formula

Φ=G[0,10](F[0,3](2.5x0)F[0,3](0x2.5)).\displaystyle\Phi=G_{[0,10]}(F_{[0,3]}(-2.5\leq x\leq 0)\land F_{[0,3]}(0\leq x\leq 2.5)).

At any time in the time interval [0,10][0,10], the system should enter both the blue region and the green region before 3 time steps are elapsed, where there is no constraint for the order of the visits. Let the current system’s state be xk=1.5x_{k}=1.5. Note that the desired action for the STL formula is different depending on the past state sequence. For example, in the case where xk3:k=0.5,0.5,1.0,1.5x_{k-3:k}=-0.5,0.5,1.0,1.5, we should operate the system to the blue region right away. On the other hand, in the case where xk3:k=1.5,2.5,0.5,1.5x_{k-3:k}=-1.5,-2.5,-0.5,1.5, we do not need to move it. Thus, we regard not only the current system’s state but also previous system’s states as an environment’s state for RL.

Refer to caption
Figure 1: Illustration of a simple example of temporal control tasks described by STL formulae.

Additionally, Aksaray et al. designed the reward function RSTL:𝒵R_{STL}:\mathcal{Z}\to\mathbb{R} using robustness and the log-sum-exp approximation. The robustness of a trajectory x0:Kx_{0:K} with respect to the given STL formula Φ\Phi is as follows:

ρ(x0:K,Φ)\displaystyle\rho(x_{0:K},\Phi) (2)
=\displaystyle= {min{ρ(x0:τ1,ϕ),ρ(xKτ+1:K,ϕ)}for Φ=G[0,Ke]ϕ,max{ρ(x0:τ1,ϕ),ρ(xKτ+1:K,ϕ)}for Φ=F[0,Ke]ϕ,\displaystyle\begin{cases}\min\left\{\rho(x_{0:\tau-1},\phi),\ ...\ \rho(x_{K-\tau+1:K},\phi)\right\}\\ \hskip 102.42992pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max\left\{\rho(x_{0:\tau-1},\phi),\ ...\ \rho(x_{K-\tau+1:K},\phi)\right\}\\ \hskip 102.42992pt\text{for }\Phi=F_{[0,K_{e}]}\phi,\\ \end{cases}
=\displaystyle= {min{ρ(zτ1,ϕ),ρ(zK,ϕ)}for Φ=G[0,Ke]ϕ,max{ρ(zτ1,ϕ),ρ(zK,ϕ)}for Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}\min\left\{\rho(z_{\tau-1},\phi),\ ...\ \rho(z_{K},\phi)\right\}\\ \hskip 102.42992pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max\left\{\rho(z_{\tau-1},\phi),\ ...\ \rho(z_{K},\phi)\right\}\\ \hskip 102.42992pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\\ \end{cases}

We consider the following problem.

maxπPr[x0:KπΦ]=maxπE[𝟏(ρ(x0:Kπ,Φ))],\displaystyle\max_{\pi}Pr\left[x_{0:K}^{\pi}\models\Phi\right]=\max_{\pi}E\left[\bm{1}(\rho(x_{0:K}^{\pi},\Phi))\right], (3)

where x0:Kπx_{0:K}^{\pi} is the system’s trajectory controlled by the policy π\pi and the function 𝟏:{0,1}\bm{1}:\mathbb{R}\to\{0,1\} is an indicator defined by

𝟏(y)={1if y0,0if y<0.\displaystyle\bm{1}(y)=\begin{cases}1&\text{if }y\geq 0,\\ 0&\text{if }y<0.\end{cases} (4)

Since 𝟏(min{y1,,yn})=min{𝟏(y1),, 1(yn)}\bm{1}(\min\{y_{1},\ ...,\ y_{n}\})=\min\{\bm{1}(y_{1}),\ ...,\ \bm{1}(y_{n})\} and 𝟏(max{y1,,yn})=max{𝟏(y1),, 1(yn)}\bm{1}(\max\{y_{1},\ ...,\ y_{n}\})=\max\{\bm{1}(y_{1}),\ ...,\ \bm{1}(y_{n})\},

maxπE[𝟏(ρ(x0:Kπ,Φ))]\displaystyle\max_{\pi}E\left[\bm{1}(\rho(x_{0:K}^{\pi},\Phi))\right] (5)
=\displaystyle= {maxπE[𝟏(minτ1kKρ(zk,ϕ))]for Φ=G[0,Ke]ϕ,maxπE[𝟏(maxτ1kKρ(zk,ϕ))]for Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}\max_{\pi}E\left[\bm{1}(\min_{\tau-1\leq k\leq K}\rho(z_{k},\phi))\right]\\ \hskip 102.42992pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max_{\pi}E\left[\bm{1}(\max_{\tau-1\leq k\leq K}\rho(z_{k},\phi))\right]\\ \hskip 102.42992pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\end{cases}
=\displaystyle= {maxπE[minτ1kK𝟏(ρ(zk,ϕ))]for Φ=G[0,Ke]ϕ,maxπE[maxτ1kK𝟏(ρ(zk,ϕ))]for Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}\max_{\pi}E\left[\min_{\tau-1\leq k\leq K}\bm{1}(\rho(z_{k},\phi))\right]\\ \hskip 102.42992pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max_{\pi}E\left[\max_{\tau-1\leq k\leq K}\bm{1}(\rho(z_{k},\phi))\right]\\ \hskip 102.42992pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\end{cases}

Then, we use the following log-sum-exp approximation.

min{y1,,yn}\displaystyle\min\{y_{1},\ ...,\ y_{n}\} \displaystyle\simeq 1βlogi=1nexp(βyi),\displaystyle-\frac{1}{\beta}\log\sum_{i=1}^{n}\exp(-\beta y_{i}), (6)
max{y1,,yn}\displaystyle\max\{y_{1},\ ...,\ y_{n}\} \displaystyle\simeq 1βlogi=1nexp(βyi),\displaystyle\frac{1}{\beta}\log\sum_{i=1}^{n}\exp(\beta y_{i}), (7)

where β>0\beta>0 is an approximation parameter. We can approximate min{}\min\{\cdots\} or max{}\max\{\cdots\} with arbitrary accuracy by selecting a large β\beta. Then, (5) can be approximated as follows:

maxπE[𝟏(ρ(x0:Kπ,Φ))]\displaystyle\max_{\pi}E[\bm{1}(\rho(x_{0:K}^{\pi},\Phi))]
\displaystyle\simeq {maxπE[1βlogk=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=G[0,Ke]ϕ,maxπE[1βlogk=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}\max_{\pi}E\left[-\frac{1}{\beta}\log\sum_{k=\tau-1}^{K}\exp(-\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max_{\pi}E\left[\frac{1}{\beta}\log\sum_{k=\tau-1}^{K}\exp(\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\end{cases}

Since the log\log function is a strictly monotonic function and β>0\beta>0 is a constant, we have

{maxπE[1βlogk=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=G[0,Ke]ϕ,maxπE[1βlogk=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}\max_{\pi}E\left[-\frac{1}{\beta}\log\sum_{k=\tau-1}^{K}\exp(-\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max_{\pi}E\left[\frac{1}{\beta}\log\sum_{k=\tau-1}^{K}\exp(\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\end{cases}
{maxπE[k=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=G[0,Ke]ϕ,maxπE[k=τ1Kexp(β𝟏(ρ(zk,ϕ)))]for Φ=F[0,Ke]ϕ.\displaystyle\Leftrightarrow\begin{cases}\max_{\pi}E\left[\sum_{k=\tau-1}^{K}-\exp(-\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=G_{[0,K_{e}]}\phi,\\ \max_{\pi}E\left[\sum_{k=\tau-1}^{K}\exp(\beta\bm{1}(\rho(z_{k},\phi)))\right]\\ \hskip 128.0374pt\text{for }\Phi=F_{[0,K_{e}]}\phi.\end{cases}

Thus, we use the following reward function RSTL:𝒵R_{STL}:\mathcal{Z}\to\mathbb{R} to satisfy the given STL formula Φ\Phi.

RSTL(z)\displaystyle R_{STL}(z)
=\displaystyle= {exp(β𝟏(ρ(z,ϕ)))if Φ=G[0,Ke]ϕ,exp(β𝟏(ρ(z,ϕ)))if Φ=F[0,Ke]ϕ.\displaystyle\begin{cases}-\exp(-\beta\bm{1}(\rho(z,\phi)))&\text{if }\Phi=G_{[0,K_{e}]}\phi,\\ \exp(\beta\bm{1}(\rho(z,\phi)))&\text{if }\Phi=F_{[0,K_{e}]}\phi.\end{cases}

To design a controller satisfying an STL formula using the Q-learning algorithm, Aksaray et al. proposed a τ\tau-MDP as follows:

Definition 1 (τ\tau-MDP): We consider an STL formula Φ=G[0,Ke]ϕ\Phi=G_{[0,K_{e}]}\phi (or Φ=F[0,Ke]ϕ\Phi=F_{[0,K_{e}]}\phi), where hrz(Φ)=K\text{hrz}(\Phi)=K and ϕ\phi comprises multiple STL sub-formulae ϕi,i{1,2,,M}\phi_{i},\ i\in\{1,2,...,M\}. Subsequently, we set τ=hrz(ϕ)+1\tau=\text{hrz}(\phi)+1, that is, K=Ke+τ1K=K_{e}+\tau-1. A τ\tau-MDP is defined by a tuple τ=𝒵,𝒜,p0z,pz,RSTL\mathcal{M}_{\tau}=\left<\mathcal{Z},\mathcal{A},p_{0}^{z},p^{z},R_{STL}\right>, where

  • 𝒵𝒳τ\mathcal{Z}\subseteq\mathcal{X}^{\tau} is an extended state space that is an environment’s state space for RL. The extended state z𝒵z\in\mathcal{Z} is a vector of multiple system’s states z=[z[0]z[1]z[τ1]],z[i]𝒳,i{0,1,,τ1}z=[z[0]^{\top}\ z[1]^{\top}\ ...\ z[\tau-1]^{\top}]^{\top},\ z[i]\in\mathcal{X},\ \forall i\in\{0,1,...,\tau-1\}.

  • 𝒜\mathcal{A} is an agent’s control action space.

  • p0zp_{0}^{z} is a probability density for the initial extended state z0z_{0} with z0[i]=x0,i{0,1,,τ1}z_{0}[i]=x_{0},\ \forall i\in\{0,1,...,\tau-1\}, where x0x_{0} is generated from p0p_{0}.

  • pzp^{z} is a transition probability density for the extended state. When the system’s state is updated by xpf(|x,a)x^{\prime}\sim p_{f}(\cdot|x,a), the extended state is updated by zpz(|z,a)z^{\prime}\sim p^{z}(\cdot|z,a) as follows:

    z[i]\displaystyle z^{\prime}[i] =\displaystyle= z[i+1],i{0,1,..,τ2},\displaystyle z[i+1],\ \forall i\in\{0,1,..,\tau-2\},
    z[τ1]\displaystyle z^{\prime}[\tau-1] \displaystyle\sim pf(|z[τ1],a).\displaystyle p_{f}(\cdot|z[\tau-1],a).

    Fig. 2 shows an example of the transition. We consider the sequence that consists of τ\tau system’s states xkτ+1,xkτ+2,,xkx_{k-\tau+1},x_{k-\tau+2},...,x_{k} as the extended state at time kk. In the transition, the head system’s state xkτ+1x_{k-\tau+1} is removed from the sequence and other system’s states xkτ+2,,xkx_{k-\tau+2},...,x_{k} are shifted to the left. After that, the next system’s state xk+1x_{k+1} updated by pf(|xk,ak)p_{f}(\cdot|x_{k},a_{k}) is inputted to the tail of the sequence. The next extended state zk+1z_{k+1} depends on the current extended state zkz_{k} and the agent’s action aka_{k}.

  • RSTL:𝒵R_{STL}:\mathcal{Z}\to\mathbb{R} is the STL-reward function defined by (LABEL:tau_reward).

Refer to caption
Figure 2: Illustration of an extended state transition. We consider the case τ=3\tau=3. The next extended state zk+1z_{k+1} depends on the current extended state zkz_{k} and the agent’s action aka_{k}.

III Problem formulation

We consider the following optimal policy design problem constrained by a given STL formula Φ\Phi, where the system model (1) is unknown.

maximizeπEpπ[k=0KγkR(xk,ak)],subject tox0:KΦ,\begin{split}&\text{maximize}_{\pi}\ E_{p_{\pi}}\left[\sum_{k=0}^{K}\gamma^{k}R(x_{k},a_{k})\right],\\ &\text{subject to}\ \ \ \ x_{0:K}\models\Phi,\end{split} (9)

where γ[0,1)\gamma\in[0,1) is a discount factor and R:𝒳×𝒜R:\mathcal{X}\times\mathcal{A}\to\mathbb{R} is a reward function for a given control performance index. Epπ[]E_{p_{\pi}}[\cdot] is the expectation value with respect to the distributions p0p_{0}, pfp_{f}, and π\pi. We introduce the following τ\tau-CMDP that is an extension of a τ\tau-MDP [16] to deal with the problem (9).

Definition 2 (τ\tau-CMDP): We consider an STL formula Φ=G[0,Ke]ϕ\Phi=G_{[0,K_{e}]}\phi (or Φ=F[0,Ke]ϕ\Phi=F_{[0,K_{e}]}\phi) as a constraint, where hrz(Φ)=K\text{hrz}(\Phi)=K and ϕ\phi comprises multiple STL sub-formulae ϕi,i{1,2,,M}\phi_{i},\ i\in\{1,2,...,M\}. Subsequently, we set τ=hrz(ϕ)+1\tau=\text{hrz}(\phi)+1, that is, K=Ke+τ1K=K_{e}+\tau-1. A τ\tau-CMDP is defined by a tuple 𝒞τ=𝒵,𝒜,p0z,pz,RSTL,Rz\mathcal{CM}_{\tau}=\left<\mathcal{Z},\mathcal{A},p_{0}^{z},p^{z},R_{STL},R_{z}\right>, where

  • 𝒵𝒳τ\mathcal{Z}\subseteq\mathcal{X}^{\tau} is an extended state space that is an environment’s state space for RL. The extended state z𝒵z\in\mathcal{Z} is a vector of multiple system’s states z=[z[0]z[1]z[τ1]],z[i]𝒳,i{0,1,,τ1}z=[z[0]^{\top}\ z[1]^{\top}\ ...\ z[\tau-1]^{\top}]^{\top},\ z[i]\in\mathcal{X},\ \forall i\in\{0,1,...,\tau-1\}.

  • 𝒜\mathcal{A} is an agent’s control action space.

  • p0zp_{0}^{z} is a probability density for the initial extended state z0z_{0} with z0[i]=x0,i{0,1,,τ1}z_{0}[i]=x_{0},\ \forall i\in\{0,1,...,\tau-1\}, where x0x_{0} is generated from p0p_{0}.

  • pzp^{z} is a transition probability density for the extended state. When the system’s state is updated by xpf(|x,a)x^{\prime}\sim p_{f}(\cdot|x,a), the extended state is updated by zpz(|z,a)z^{\prime}\sim p^{z}(\cdot|z,a) as follows:

    z[i]\displaystyle z^{\prime}[i] =\displaystyle= z[i+1],i{0,1,..,τ2},\displaystyle z[i+1],\ \forall i\in\{0,1,..,\tau-2\},
    z[τ1]\displaystyle z^{\prime}[\tau-1] \displaystyle\sim pf(|z[τ1],a).\displaystyle p_{f}(\cdot|z[\tau-1],a).
  • RSTL:𝒵R_{STL}:\mathcal{Z}\to\mathbb{R} is the STL-reward function defined by (LABEL:tau_reward) for satisfying the given STL formula Φ\Phi.

  • Rz:𝒵×𝒜R_{z}:\mathcal{Z}\times\mathcal{A}\to\mathbb{R} is a reward function as follows:

    Rz(z,a)=R(z[τ1],a),\displaystyle R_{z}(z,a)=R(z[\tau-1],a),

    where R:𝒳×𝒜R:\mathcal{X}\times\mathcal{A}\to\mathbb{R} is a reward function for a given control performance index.

We design an optimal policy with respect to RzR_{z} under satisfying the STL formula using a model-free CDRL algorithm [23]. Then, we define the following functions.

J(π)\displaystyle J(\pi) =\displaystyle= Epπ[k=0KγkRz(zk,ak)],\displaystyle E_{p_{\pi}}\left[\sum_{k=0}^{K}\gamma^{k}R_{z}(z_{k},a_{k})\right],
JSTL(π)\displaystyle J_{STL}(\pi) =\displaystyle= Epπ[k=0KγkRSTL(zk)],\displaystyle E_{p_{\pi}}\left[\sum_{k=0}^{K}\gamma^{k}R_{STL}(z_{k})\right],

where γ[0,1)\gamma\in[0,1) is a discount factor close to 11. Epπ[]E_{p_{\pi}}[\cdot] is the expectation value with respect to the distributions p0p_{0}, pfp_{f}, and π\pi. We reformulate the problem (9) as follows:

πargmaxπJ(π),\displaystyle\pi^{*}\in\arg\max_{\pi}\ J(\pi), (10)
subject toJSTL(π)lSTL,\displaystyle\text{subject to}\ \ J_{STL}(\pi)\geq l_{STL}, (11)

where lSTLl_{STL}\in\mathbb{R} is a lower threshold. In this study, lSTLl_{STL} is a hyper-parameter for adjusting the satisfiability of the given STL formula. The larger lSTLl_{STL} is, the more conservatively the agent learns a policy to satisfy the STL formula. We call the constrained problem with (10) and (11) a τ\tau-CMDP problem. In the next section, we propose a CDRL algorithm with the Lagrangian relaxation to solve the τ\tau-CMDP problem.

IV Deep reinforcement learning under a signal temporal logic constraint

We propose a CDRL algorithm with the Lagrangian relaxation to obtain an optimal policy for the τ\tau-CMDP problem. Our proposed algorithm is based on the DDPG algorithm [20] or the SAC algorithm [21], which are DRL algorithms derived from the Q-learning algorithm for problems with continuous state-action spaces. In both algorithms, we parameterize an agent’s policy π\pi using a DNN, which is called an actor DNN. The agent updates the parameter vector of the actor DNN based on J(π)J(\pi). However, in this problem, the agent cannot directly use J(π)J(\pi) since the mathematical model of the system pfp_{f} is unknown. Thus, we approximate J(π)J(\pi) using another DNN, which is called a critic DNN. Additionally, we use the following two techniques proposed in [3].

  • Experience replay,

  • Target network.

In the experience replay, the agent does not update the parameter vectors of DNNs immediately when obtaining an experience. Alternatively, the agent stores the obtained experience to the replay buffer 𝒟\mathcal{D}. The agent selects some experiences from the replay buffer 𝒟\mathcal{D} randomly and updates the parameter vector of DNNs using the selected experiences. The experience replay can reduce correlation among experience data. In the target network technique, we prepare separate DNNs for the critic DNN and the actor DNN, which are called a target critic DNN and a target actor DNN, respectively, and output target values for updates of the critic DNN. The parameter vectors of the target DNNs are updated by tracking the parameter vectors of the actor DNN and the critic DNN slowly. If we do not use the target network technique for updates of the critic DNN, we need to compute the target value using the current critic DNN, which is called bootstrapping. If we update the critic DNN substantially, the target value computed by the updated critic DNN may change largely, which leads to oscillations of the learning performance. It is known that the target network technique can improve the learning stability.

Remark: The standard DRL algorithm based on Q-learning is the DQN algorithm [3]. However, the DQN algorithm cannot handle continuous action spaces due to its DNN architecture.

On the other hand, we cannot directly apply the DDPG algorithm and the SAC algorithm to the τ\tau-CMDP problem since these are algorithms for unconstrained problems. Thus, we consider the following Lagrangian relaxation [27].

minκ0maxπ(π,κ),\displaystyle\min_{\kappa\geq 0}\max_{\pi}\mathcal{L}(\pi,\kappa), (12)

where (π,κ)\mathcal{L}(\pi,\kappa) is a Lagrangian function given by

(π,κ)=J(π)+κ(JSTL(π)lSTL),\displaystyle\mathcal{L}(\pi,\kappa)=J(\pi)+\kappa(J_{STL}(\pi)-l_{STL}), (13)

and κ0\kappa\geq 0 is a Lagrange multiplier. We can relax the constrained problem into the unconstrained problem.

Refer to caption
Figure 3: Illustration of an actor DNN for the DDPG-Lagrangian algorithm. Actually, we input a pre-processed state z^\hat{z} stated in Section IV.D to the DNN instead of an extended state zz.
Refer to caption
Figure 4: Illustration of the two-type critic DNNs (the reward critic DNN and the STL-reward critic DNN). In the DDPG-Lagrangian algorithm, the reward critic DNN and the STL-reward critic estimate the terms J(μθμ)J(\mu_{\theta_{\mu}}) and JSTL(μθμ)J_{STL}(\mu_{\theta_{\mu}}) in (13), respectively. In the SAC-Lagrangian algorithm, the reward critic DNN and the STL-reward critic DNN estimate the terms Jent(πθπ)J_{ent}(\pi_{\theta_{\pi}}) and JSTL(πθπ)J_{STL}(\pi_{\theta_{\pi}}) in (20), respectively. Actually, we input a pre-processed state z^\hat{z} stated in Section IV.D to the DNN instead of an extended state zz.

IV-A DDPG-Lagrangian

We parameterize a deterministic policy using a DNN as shown in Fig. 3, which is an actor DNN. Its parameter vector is denoted by θμ\theta_{\mu}. In the DDPG-Lagrangian algorithm, the parameter vector θμ\theta_{\mu} is updated by maximizing (13). However, J(μθμ)J(\mu_{\theta_{\mu}}) and JSTL(μθμ)J_{STL}(\mu_{\theta_{\mu}}) are unknown. Thus, as shown in Fig. 4, J(μθμ)J(\mu_{\theta_{\mu}}) and JSTL(μθμ)J_{STL}(\mu_{\theta_{\mu}}) are approximated by two separate critic DNNs, which are called a reward critic DNN and an STL-reward critic DNN, respectively. The parameter vectors of the reward critic DNN and the STL-reward critic DNN are denoted by θr\theta_{r} and θs\theta_{s}, respectively. θr\theta_{r} and θs\theta_{s} are updated by decreasing the following critic loss functions.

Jrc(θr)\displaystyle J_{rc}(\theta_{r}) =\displaystyle= E(z,a,z)𝒟[(Qθr(z,a)tr)2],\displaystyle E_{(z,a,z^{\prime})\sim\mathcal{D}}\left[\left(Q_{\theta_{r}}(z,a)-t_{r}\right)^{2}\right], (14)
Jsc(θs)\displaystyle J_{sc}(\theta_{s}) =\displaystyle= E(z,a,z)𝒟[(Qθs(z,a)ts)2],\displaystyle E_{(z,a,z^{\prime})\sim\mathcal{D}}\left[\left(Q_{\theta_{s}}(z,a)-t_{s}\right)^{2}\right], (15)

where Qθr(,)Q_{\theta_{r}}(\cdot,\cdot) and Qθs(,)Q_{\theta_{s}}(\cdot,\cdot) are the outputs of the reward critic DNN and the STL-reward critic DNN, respectively. The target values trt_{r} and tst_{s} are given by

tr\displaystyle t_{r} =\displaystyle= Rz(z,a)+γQθr(z,μθμ(z)),\displaystyle R_{z}(z,a)+\gamma Q_{\theta_{r}^{-}}(z^{\prime},\mu_{\theta_{\mu}^{-}}(z^{\prime})),
ts\displaystyle t_{s} =\displaystyle= RSTL(z)+γQθs(z,μθμ(z)).\displaystyle R_{STL}(z)+\gamma Q_{\theta_{s}^{-}}(z^{\prime},\mu_{\theta_{\mu}^{-}}(z^{\prime})).

Qθr(,)Q_{\theta_{r}^{-}}(\cdot,\cdot) and Qθs(,)Q_{\theta_{s}^{-}}(\cdot,\cdot) are the outputs of the target reward critic DNN and the target STL-reward critic DNN, respectively, and μθμ()\mu_{\theta_{\mu}^{-}}(\cdot) is the output of target actor DNN. θr\theta_{r}^{-}, θs\theta_{s}^{-}, and θμ\theta_{\mu}^{-} are parameter vectors of the target reward critic DNN, the target STL-reward critic DNN, and the target actor DNN, respectively. Their parameter vectors are slowly updated by the following soft update.

θr\displaystyle\theta_{r}^{-} \displaystyle\leftarrow ξθr+(1ξ)θr,\displaystyle\xi\theta_{r}+(1-\xi)\theta_{r}^{-},
θs\displaystyle\theta_{s}^{-} \displaystyle\leftarrow ξθs+(1ξ)θs,\displaystyle\xi\theta_{s}+(1-\xi)\theta_{s}^{-}, (16)
θμ\displaystyle\theta_{\mu}^{-} \displaystyle\leftarrow ξθμ+(1ξ)θμ,\displaystyle\xi\theta_{\mu}+(1-\xi)\theta_{\mu}^{-},

where ξ>0\xi>0 is a sufficiently small positive constant. The agent stores experiences to the replay buffer 𝒟\mathcal{D} and selects some experiences from 𝒟\mathcal{D} randomly for updates of θr\theta_{r} and θs\theta_{s}. E(z,a,z)𝒟[]E_{(z,a,z^{\prime})\sim\mathcal{D}}[\cdot] is the expected value under the random sampling of the experiences from 𝒟\mathcal{D}. In the standard DDPG algorithm [20], the parameter vector of the actor DNN is updated by decreasing

Ja(θμ)=Ez𝒟[Qθr(z,μθμ(z))],\displaystyle J_{a}(\theta_{\mu})=E_{z\sim\mathcal{D}}[-Q_{\theta_{r}}(z,\mu_{\theta_{\mu}}(z))],

where Ez𝒟[]E_{z\sim\mathcal{D}}[\cdot] is the expected value with respect to zz sampled from 𝒟\mathcal{D} randomly. However, in the DDPG-Lagrangian algorithm, we consider (13) as an objective instead of J(μθμ)J(\mu_{\theta_{\mu}}). Thus, the parameter vector of the actor DNN θμ\theta_{\mu} is updated by decreasing the following actor loss function.

Ja(θμ)=\displaystyle J_{a}(\theta_{\mu})=
Ez𝒟[(Qθr(z,μθμ(z))+κQθs(z,μθμ(z)))].\displaystyle\ \ \ E_{z\sim\mathcal{D}}[-(Q_{\theta_{r}}(z,\mu_{\theta_{\mu}}(z))+\kappa Q_{\theta_{s}}(z,\mu_{\theta_{\mu}}(z)))].
(17)

The Lagrange multiplier κ\kappa is updated by decreasing the following loss function.

JL(κ)=Ez0p0z[κ(Qθs(z0,μθμ(z0))lSTL)],\displaystyle J_{L}(\kappa)=E_{z_{0}\sim p_{0}^{z}}\left[\kappa(Q_{\theta_{s}}(z_{0},\mu_{\theta_{\mu}}(z_{0}))-l_{STL})\right], (18)

where Ez0p0z[]E_{z_{0}\sim p_{0}^{z}}[\cdot] is the expected value with respect to p0zp_{0}^{z}.

Remark: κ\kappa is a nonnegative parameter adjusting the relative importance of the STL-reward critic DNN against the reward critic DNN in updating the actor DNN. Intuitively, if the agent’s policy does not satisfy (11), then we increase the parameter κ\kappa, which increases the relative importance of the STL-critic DNN. On the other hand, if the agent’s policy satisfies (11), then we decrease the parameter κ\kappa, which decreases the relative importance of the STL-critic DNN.

Refer to caption
Figure 5: Illustration of an actor DNN with a reparameterization trick. The DNN outputs the mean μθπ(z^)\mu_{\theta_{\pi}}(\hat{z}) and the standard deviation σθπ(z^)\sigma_{\theta_{\pi}}(\hat{z}) parameters for an input z^\hat{z}. We use the reparameterization trick to sample an action, where ϵ\epsilon is sampled from a standard normal distribution 𝒩(0,1)\mathcal{N}(0,1).

IV-B SAC-Lagrangian

SAC is a maximum entropy DRL algorithm that obtains a policy to maximize both the expected sum of rewards and the expected entropy of the policy. It is known that a maximum entropy algorithm improves explorations by acquiring diverse behaviors and has the robustness for the estimation error [21]. In the SAC algorithm, we design a stochastic policy π\pi. We use the following objective with an entropy term instead of J(π)J(\pi).

Jent(π)\displaystyle J_{ent}(\pi) =\displaystyle= Epπ[k=0Kγk(Rz(zk,ak)+α(π(|zk)))],\displaystyle E_{p_{\pi}}\left[\sum_{k=0}^{K}\gamma^{k}(R_{z}(z_{k},a_{k})+\alpha\mathcal{H}(\pi(\cdot|z_{k})))\right], (19)
=\displaystyle= J(π)+Epπ[k=0Kγkα(π(|zk))],\displaystyle J(\pi)+E_{p_{\pi}}\left[\sum_{k=0}^{K}\gamma^{k}\alpha\mathcal{H}(\pi(\cdot|z_{k}))\right],

where (π(|zk))=Eaπ[logπ(a|zk)]\mathcal{H}(\pi(\cdot|z_{k}))=E_{a\sim\pi}[-\log\pi(a|z_{k})] is an entropy of the stochastic policy π\pi and α0\alpha\geq 0 is an entropy temperature. The entropy temperature determines the relative importance of the entropy term against the sum of rewards.

We use the Lagrangian relaxation for the SAC algorithm such as [28, 29]. Then, a Lagrangian function with the entropy term is given by

(π,κ)=Jent(π)+κ(JSTL(π)lSTL).\displaystyle\mathcal{L}(\pi,\kappa)=J_{ent}(\pi)+\kappa(J_{STL}(\pi)-l_{STL}). (20)

We model the stochastic policy πθπ\pi_{\theta_{\pi}} using a Gaussian with the mean and the standard deviation outputted by a DNN with a reparameterization trick [30] as shown in Fig. 5, which is an actor DNN. The parameter vector is denoted by θπ\theta_{\pi}. Additionally, we need to estimate Jent(πθπ)J_{ent}(\pi_{\theta_{\pi}}) and JSTL(πθπ)J_{STL}(\pi_{\theta_{\pi}}) to update the parameter vector θπ\theta_{\pi} like the DDPG-Lagrangian algorithm. Thus, Jent(πθπ)J_{ent}(\pi_{\theta_{\pi}}) and JSTL(πθπ)J_{STL}(\pi_{\theta_{\pi}}) are also approximated by two separate critic DNNs as shown in Fig. 4. Note that, in the SAC-Lagrangian algorithm, the reward critic DNN estimates not only J(πθπ)J(\pi_{\theta_{\pi}}) but also the entropy term. The parameter vectors are also updated using the experience replay and the target network technique. θr\theta_{r} and θs\theta_{s} are updated by decreasing the following critic loss functions.

Jrc(θr)=\displaystyle J_{rc}(\theta_{r})=
E(z,a,z)𝒟[(Qθr(z,a)(r+γVθr(z)))2],\displaystyle\ \ \ E_{(z,a,z^{\prime})\sim\mathcal{D}}\left[\left(Q_{\theta_{r}}(z,a)-\left(r+\gamma V_{\theta_{r}^{-}}(z^{\prime})\right)\right)^{2}\right],
(21)
Jsc(θs)=\displaystyle J_{sc}(\theta_{s})=
E(z,a,z)𝒟[(Qθs(z,a)(s+γVθs(z)))2],\displaystyle\ \ \ E_{(z,a,z^{\prime})\sim\mathcal{D}}\left[\left(Q_{\theta_{s}}(z,a)-\left(s+\gamma V_{\theta_{s}^{-}}(z^{\prime})\right)\right)^{2}\right],
(22)

where r=Rz(z,a)r=R_{z}(z,a), s=RSTL(z)s=R_{STL}(z), and Qθr(,)Q_{\theta_{r}}(\cdot,\cdot) and Qθs(,)Q_{\theta_{s}}(\cdot,\cdot) are the outputs of the reward critic DNN and the STL-reward critic DNN, respectively. The target values are computed by

Vθr(z)\displaystyle V_{\theta_{r}^{-}}(z^{\prime}) =\displaystyle= Eaπθπ[Qθr(z,a)αlogπθπ(a|z)],\displaystyle E_{a^{\prime}\sim\pi_{\theta_{\pi}}}\left[Q_{\theta_{r}^{-}}(z^{\prime},a^{\prime})-\alpha\log\pi_{\theta_{\pi}}(a^{\prime}|z^{\prime})\right],
Vθs(z)\displaystyle V_{\theta_{s}^{-}}(z^{\prime}) =\displaystyle= Eaπθπ[Qθs(z,a)],\displaystyle E_{a^{\prime}\sim\pi_{\theta_{\pi}}}\left[Q_{\theta_{s}^{-}}(z^{\prime},a^{\prime})\right],

where Qθr(,)Q_{\theta_{r}^{-}}(\cdot,\cdot) and Qθs(,)Q_{\theta_{s}^{-}}(\cdot,\cdot) are outputs of the target reward critic DNN and the target STL-reward critic DNN, respectively, and Eaπθπ[]E_{a^{\prime}\sim\pi_{\theta_{\pi}}}[\cdot] is the expected value with respect to πθπ\pi_{\theta_{\pi}}. Their parameter vectors θr\theta_{r}^{-}, θs\theta_{s}^{-} are slowly updated like (16). In the standard SAC algorithm, the parameter vector of the actor DNN θπ\theta_{\pi} is updated by decreasing

Ja(θπ)=Ez𝒟,aπθπ[αlog(πθπ(a|z))Qθr(z,a)],\displaystyle J_{a}(\theta_{\pi})=E_{z\sim\mathcal{D},a\sim\pi_{\theta_{\pi}}}[\alpha\log(\pi_{\theta_{\pi}}(a|z))-Q_{\theta_{r}}(z,a)],

where Ez𝒟,aπθπ[]E_{z\sim\mathcal{D},a\sim\pi_{\theta_{\pi}}}[\cdot] is the expected value with respect to the experiences zz sampled from 𝒟\mathcal{D} and the stochastic policy πθπ\pi_{\theta_{\pi}}. However, in the SAC-Lagrangian algorithm, we consider (20) as the objective instead of (19). Thus, the parameter vector of the actor DNN θπ\theta_{\pi} is updated by decreasing the following actor loss function.

Ja(θπ)\displaystyle J_{a}(\theta_{\pi}) =\displaystyle= Ez𝒟,aπθπ[αlog(πθπ(a|z))\displaystyle E_{z\sim\mathcal{D},a\sim\pi_{\theta_{\pi}}}[\alpha\log(\pi_{\theta_{\pi}}(a|z)) (23)
(Qθr(z,a)+κQθs(z,a))].\displaystyle\ \ \ -(Q_{\theta_{r}}(z,a)+\kappa Q_{\theta_{s}}(z,a))].

The Lagrange multiplier κ\kappa is updated by decreasing the following loss function.

JL(κ)=Ez0p0z,aπθπ[κ(Qθs(z0,a)lSTL)],\displaystyle J_{L}(\kappa)=E_{z_{0}\sim p_{0}^{z},a\sim\pi_{\theta_{\pi}}}\left[\kappa(Q_{\theta_{s}}(z_{0},a)-l_{STL})\right], (24)

where Ez0p0z,aπθπ[]E_{z_{0}\sim p_{0}^{z},a\sim\pi_{\theta_{\pi}}}[\cdot] is the expected value with respect to p0zp_{0}^{z} and πθπ\pi_{\theta_{\pi}}. The entropy temperature α\alpha is updated by decreasing the following loss function.

Jtemp(α)=Ez𝒟,aπθπ[α(log(πθπ(a|z))0)],\displaystyle J_{temp}(\alpha)=E_{z\sim\mathcal{D},a\sim\pi_{\theta_{\pi}}}\left[\alpha(-\log(\pi_{\theta_{\pi}}(a|z))-\mathcal{H}_{0})\right], (25)

where 0\mathcal{H}_{0} is a lower bound which is a hyper-parameter. In [21], the parameter 0\mathcal{H}_{0} is selected based on the dimensionality of the action space. Additionally, in the SAC algorithm, to mitigate the positive bias in updates of θπ\theta_{\pi}, the double Q-learning technique [31] is adopted, where we prepare two critic DNNs and two target critic DNNs. Thus, in the SAC-Lagrangian, we also adopt the technique.

IV-C Pre-training and fine-tuning

In this study, it is important to satisfy the given STL constraint. In order to learn a policy satisfying a given STL formula, the agent needs many experiences satisfying the formula. However, it is difficult to collect the experiences considering both the control performance index and the STL constraint in the early learning stage since the agent may prioritize to optimize its policy with respect to the control performance index. Thus, we propose a two-phase learning algorithm. In the first phase, which is called pre-train, the agent focuses on learning a policy satisfying a given STL formula Φ\Phi to store experiences receiving high STL rewards to a replay buffer 𝒟\mathcal{D}, that is, the agent learns its policy considering only STL-rewards.

Pre-training for DDPG-Lagrangian

The parameter vector of the actor DNN θμ\theta_{\mu} is updated by decreasing

Ja(θμ)=Ez𝒟[Qθs(z,μθμ(z))]\displaystyle J_{a}(\theta_{\mu})=E_{z\sim\mathcal{D}}\left[-Q_{\theta_{s}}(z,\mu_{\theta_{\mu}}(z))\right] (26)

instead of (17). On the other hand, θs\theta_{s} is updated by (15).

Pre-training for SAC-Lagrangian

The parameter vector of the actor DNN θπ\theta_{\pi} is updated by decreasing

Ja(θπ)=Ez𝒟,aπθπ[αlog(πθπ(a|z))Qθs(z,a)]\displaystyle J_{a}(\theta_{\pi})=E_{z\sim\mathcal{D},a\sim\pi_{\theta_{\pi}}}\left[\alpha\log(\pi_{\theta_{\pi}}(a|z))-Q_{\theta_{s}}(z,a)\right] (27)

instead of (23). On the other hand, θs\theta_{s} is updated by (22), where VθsV_{\theta_{s}}^{-} is computed by

Vθs(z)=Eaπθπ[Qθs(z,a)αlog(πθπ(a|z))].\displaystyle V_{\theta_{s}^{-}}(z^{\prime})=E_{a^{\prime}\sim\pi_{\theta_{\pi}}}[Q_{\theta_{s}^{-}}(z^{\prime},a^{\prime})-\alpha\log(\pi_{\theta_{\pi}}(a^{\prime}|z^{\prime}))].

In the second phase, which is called fine-tune, the agent learns the optimal policy constrained by the given STL formula. In the DDPG-Lagrangian algorithm, the actor DNN θμ\theta_{\mu} is updated by (17). In the SAC-Lagrangian algorithm, the actor DNN θπ\theta_{\pi} is updated by (23).

Remark: The two-phase learning may become unstable temporally because it discontinuously changes the objective functions. In such a case, we may start the second phase with changing the objective functions from those used in the first phase smoothly and slowly.

IV-D Pre-process

If τ\tau is a large value, it is difficult for the agent to learn its policy due to the large dimensionality of the extended state space. Then, pre-process is useful in order to reduce the dimensionality, which is related to [17]. In the previous study, a flag state for each sub-formula is defined as a discrete state. The flag discrete state space is combined with the system’s discrete state space. On the other hand, in this study, it is assumed that the system state space is continuous. If we use the discrete flag states, the pre-processed state space is a hybrid state space that has discrete values and continuous values. Thus, we consider the flag state as a continuous value and input it to DNNs as shown in Fig. 6.

Refer to caption
Figure 6: Example of constructing a pre-processed state. We consider the 1-dimensional system and the two STL sub-formulae: ϕ1=F[2,7](x0.0)\phi_{1}=F_{[2,7]}(x\geq 0.0) and ϕ2=F[0,7](x0.2)\phi_{2}=F_{[0,7]}(x\geq 0.2). For each sub-formula, we compute the flag value using the extended state zkz_{k}, which is regarded as a continuous value in [0.5,0.5][-0.5,0.5]. After that, we construct the pre-processed state using zk[τ1](=xk)z_{k}[\tau-1](=x_{k}), f^k1\hat{f}_{k}^{1}, and f^k2\hat{f}_{k}^{2} and input it to DNNs.

We introduce a flag value fif^{i} for each STL sub-formula ϕi\phi_{i}, where it is assumed that kei=τ1,i{1,2,,M}k_{e}^{i}=\tau-1,\ \forall i\in\{1,2,...,M\}.

Definition 3 (Pre-process): For an extended state zz, a flag value fif^{i} of an STL sub-formula ϕi\phi_{i} is defined as follows:

(i) For ϕi=G[ksi,τ1]φi\phi_{i}=G_{[k_{s}^{i},\tau-1]}\varphi_{i},

fi=max{τlτksi|l{ksi,,τ1}\displaystyle f^{i}=\max\left\{\frac{\tau-l}{\tau-k_{s}^{i}}\ \middle|\ l\in\{k_{s}^{i},...,\tau-1\}\hskip 14.22636pt\right.
.(l{l,,τ1},z[l]φi)}.\displaystyle\biggl{.}\land(\forall l^{\prime}\in\{l,...,\tau-1\},\ z[l^{\prime}]\models\varphi_{i})\biggr{\}}. (28)

(ii) For ϕi=F[ksi,τ1]φi\phi_{i}=F_{[k_{s}^{i},\tau-1]}\varphi_{i},

fi=max{lksi+1τksi|\displaystyle f^{i}=\max\left\{\frac{l-k_{s}^{i}+1}{\tau-k_{s}^{i}}\ \middle|\hskip 85.35826pt\right.
.l{ksi,,τ1}z[l]φi}.\displaystyle\biggl{.}\hskip 56.9055ptl\in\{k_{s}^{i},...,\tau-1\}\land z[l]\models\varphi_{i}\biggr{\}}. (29)

Note that max=\max\emptyset=-\infty and the flag value represents the normalized time lying in (0,1]{}(0,1]\cap\{-\infty\}. Intuitively, for ϕi=G[ksi,τ1]φi\phi_{i}=G_{[k_{s}^{i},\tau-1]}\varphi_{i}, the flag value indicates the time duration in which ϕi\phi_{i} is always satisfied, whereas, for ϕi=F[ksi,τ1]φi\phi_{i}=F_{[k_{s}^{i},\tau-1]}\varphi_{i}, the flag value indicates the instant when φi\varphi_{i} is satisfied. The flag values fi,i{1,2,,M}f^{i},\ i\in\{1,2,...,M\} calculated by (28) or (29) are transformed into f^i\hat{f}^{i} as follows:

f^i={fi12if fi,12otherwise.\displaystyle\hat{f}^{i}=\begin{cases}f^{i}-\frac{1}{2}&\text{if }f^{i}\neq-\infty,\\ -\frac{1}{2}&\text{otherwise}.\end{cases} (30)

The transformed flag values f^i\hat{f}^{i} are used as inputs to DNNs to prevent positive biases of the flag values and inputting -\infty to DNNs. We compute the flag value for each STL sub-formula and construct a flag state f^=[f^1f^2f^M]\hat{f}=[\hat{f}^{1}\ \hat{f}^{2}\ ...\ \hat{f}^{M}]^{\top}, which is called pre-processing. We use the pre-processed state z^=[z[τ1]f^]\hat{z}=[z[\tau-1]^{\top}\ \hat{f}^{\top}]^{\top} as an input to DNNs instead of the extended state zz.

Remark: It is important to ensure the Markov property of the pre-processed state for the agent to learn its policy. If kei=τ1,i{1,2,,M}k_{e}^{i}=\tau-1,\ \forall i\in\{1,2,...,M\}, then the pre-processed state z^\hat{z} satisfies the Markov property. We consider the current pre-processed state z^=[z[τ1]f^]\hat{z}=[z[\tau-1]^{\top}\ \hat{f}^{\top}]^{\top} and the next pre-processed state z^=[z[τ1]f^]\hat{z}^{\prime}=[z^{\prime}[\tau-1]^{\top}\ \hat{f}^{\prime}]^{\top}. z[τ1]z^{\prime}[\tau-1] is generated by pf(|z[τ1],a)p_{f}(\cdot|z[\tau-1],a), where aa is the current action. Therefore, z[τ1]z^{\prime}[\tau-1] depends on z[τ1]z[\tau-1] and the current action aa. For each transformed flag value f^i,i{1,2,,M}\hat{f}^{i},\ i\in\{1,2,...,M\}, it is updated by

  1. 1.

    ϕi=G[ksi,τ1]φi\phi_{i}=G_{[k_{s}^{i},\tau-1]}\varphi_{i}

    f^i={min{f^i+1τksi,12},xφi,12,x⊧̸φi,\displaystyle\hat{f}^{i^{\prime}}=\begin{cases}\min\left\{\hat{f}^{i}+\frac{1}{\tau-k_{s}^{i}},\frac{1}{2}\right\},&x^{\prime}\models\varphi_{i},\\ -\frac{1}{2},&x^{\prime}\not\models\varphi_{i},\end{cases} (31)
  2. 2.

    ϕi=F[ksi,τ1]φi\phi_{i}=F_{[k_{s}^{i},\tau-1]}\varphi_{i}

    f^i={12,xφi,max{f^i1τksi,12},x⊧̸φi,\displaystyle\hat{f}^{i^{\prime}}=\begin{cases}\frac{1}{2},&x^{\prime}\models\varphi_{i},\\ \max\left\{\hat{f}^{i}-\frac{1}{\tau-k_{s}^{i}},-\frac{1}{2}\right\},&x^{\prime}\not\models\varphi_{i},\end{cases} (32)

where xpf(|z[τ1],a)x^{\prime}\sim p_{f}(\cdot|z[\tau-1],a). The transformed flag values are updated by the next system’s state. Therefore, the next transformed flag values f^i,i{1,2,,M}\hat{f}^{i^{\prime}},\ i\in\{1,2,...,M\} depends on f^i\hat{f}^{i}, z[τ1]z[\tau-1], and the current action aa. Thus, the Markov property of the pre-processed state holds.

Refer to caption
Figure 7: Example of a sub-formula ϕi\phi_{i} with kemaxkei+1k_{e}^{\max}\geq k_{e}^{i}+1. We consider the 1-dimensional system and kemax=7k_{e}^{\max}=7 (τ=8\tau=8). For the sub-formula ϕi=F[2,4](x0.0)\phi_{i}=F_{[2,4]}(x\geq 0.0), zk+1[7](=xk+1)z_{k+1}[7](=x_{k+1}) depends on zk[7](=xk)z_{k}[7](=x_{k}) and aka_{k}. However, f^k+1i\hat{f}_{k+1}^{i} depends on f^ki\hat{f}_{k}^{i} and zk[5]z_{k}[5]. If the pre-processed state is given by [zk[7]f^k][z_{k}[7]\ \hat{f}_{k}]^{\top}, the agent with DNNs observes the environment partially. Then, the agent also needs zk[5]z_{k}[5] and zk[6]z_{k}[6] as parts of the pre-processed state.
Refer to caption
Figure 8: Example of the sub-formula ϕj\phi_{j} with kemax=kej+1k_{e}^{\max}=k_{e}^{j}+1. We consider the 1-dimensional system and kemax=7k_{e}^{\max}=7 (τ=8\tau=8). For ϕj=F[2,6](x0.0)\phi_{j}=F_{[2,6]}(x\geq 0.0), that is kemaxkej=1k_{e}^{\max}-k_{e}^{j}=1, the transformed flag value can be updated by [zk[7]f^k][z_{k}[7]\ \hat{f}_{k}]^{\top} only.

On the other hand, in the case where kemaxkemin+1k_{e}^{\max}\geq k_{e}^{\min}+1, we must include z[τkemax+kemin],,z[τ1]z[\tau-k_{e}^{\max}+k_{e}^{\min}],...,z[\tau-1] to the pre-processed state z^\hat{z} in order to ensure the Markov property, where kemax=maxi{1,2,,M}keik_{e}^{\max}=\max_{i\in\{1,2,...,M\}}k_{e}^{i} and kemin=mini{1,2,,M}keik_{e}^{\min}=\min_{i\in\{1,2,...,M\}}k_{e}^{i}. For example, as shown in Fig. 7, there may be some transformed flag values that are updated with information other than [z[τ1]f^][z[\tau-1]^{\top}\ \hat{f}]^{\top} and the current action. Note that, in the case kemax=kej+1k_{e}^{\max}=k_{e}^{j}+1 as shown in Fig. 8, the transformed flag value f^j\hat{f}^{j} is updated by [z[τ1]f^][z[\tau-1]^{\top}\ \hat{f}]^{\top}, that is, the agent with DNNs can learn its policy using [z[τ1]f^][z[\tau-1]^{\top}\ \hat{f}]^{\top} when kemax=kemin+1k_{e}^{\max}=k_{e}^{\min}+1. As the difference kemaxkemink_{e}^{\max}-k_{e}^{\min} increases, we need to include more past system states in the pre-processed state.

For simplicity, in this study, we focus on the case where kei=τ1,i{1,2,,M}k_{e}^{i}=\tau-1,\ \forall i\in\{1,2,...,M\}. Then, the pre-processing is most effective in terms of reducing the dimensionality of the extended state space.

IV-E Algorithm

Our proposed algorithm to design an optimal policy under the given STL constraint is presented in Algorithm 1. In line 1, we select a DRL algorithm such as the DDPG algorithm and the SAC algorithm. From line 2 to 4, we initialize the parameter vectors of the DNNs, the entropy temperature (if the algorithm is the SAC-Lagrangian algorithm), and the Lagrange multiplier. In line 5, we initialize a replay buffer 𝒟\mathcal{D}. In line 6, we set the number of the repetition of pre-training KpreK_{pre}. In line 7, we initialize a counter for updates. In line 9, the agent receives an initial state x0p0x_{0}\sim p_{0}. From line 10 to 11, the agent sets the initial extended state z0=[x0x0]z_{0}=[x_{0}^{\top}\ ...\ x_{0}^{\top}]^{\top} and computes the pre-processed state z^0\hat{z}_{0}. One learning step is done between line 13 and 25. In line 13, the agent determines an action aka_{k} based on the pre-processed state z^k\hat{z}_{k} for an exploration. In line 14, the state of the system changes by the determined action aka_{k} and the agent receives the next state xk+1x_{k+1}, the reward rkr_{k}, and the STL-reward sks_{k}. From line 15 to 16, the agent sets the next extended state zk+1z_{k+1} using xk+1x_{k+1} and zkz_{k} and computes the next pre-processed state z^k+1\hat{z}_{k+1}. In line 17, the agent stores the experience (z^k,ak,z^k+1,rk,sk)(\hat{z}_{k},a_{k},\hat{z}_{k+1},r_{k},s_{k}) in the replay buffer 𝒟\mathcal{D}. In line 18, the agent samples II experiences {(z^(i),a(i),z^(i),r(i),s(i))}i=1I\{(\hat{z}^{(i)},a^{(i)},\hat{z}^{\prime(i)},r^{(i)},s^{(i)})\}_{i=1}^{I} from the replay buffer 𝒟\mathcal{D} randomly. If the learning counter is c<Kprec<K_{pre}, the agent pre-trains the parameter vectors in Algorithm 3. Then, the parameter vectors of the reward critic DNN θr\theta_{r} and the STL-reward critic DNN θs\theta_{s} are updated by (14) and (15) (or (21) and (22)), respectively. The parameter vector of the actor DNN θμ\theta_{\mu} (or θπ\theta_{\pi}) is updated by (26) (or (27)). In the SAC-based algorithm, the entropy temperature α\alpha is updated by (25). On the other hand, if the learning counter is cKprec\geq K_{pre}, the agent fine-tunes the parameter vectors in Algorithm 4. Then, the parameter vector of the actor DNN θμ\theta_{\mu} (or θπ\theta_{\pi}) is updated by (17) (or (23)) and the other parameter vectors are updated same as the case c<Kprec<K_{pre}. The Lagrange multiplier is updated by (18) (or (24)). In line 24, the agent updates the parameter vectors of the target DNNs by (16). In line 25, the learning counter is updated. The agent repeats the process between lines 13 and 25 in a learning episode.

Algorithm 1 Two-phase DRL-Lagrangian to design an optimal policy under an STL constraint.
1:  Select a DRL algorithm such as DDPG and SAC.
2:  Initialize parameter vectors of main DNNs.
3:  Initialize parameter vectors of target DNNs.
4:  Initialize an entropy temperature and a Lagrange multiplier α,κ\alpha,\ \kappa.
5:  Initialize a replay buffer 𝒟\mathcal{D}.
6:  Set the number of the repetition of pre-training KpreK_{pre}.
7:  Initialize learning counter c0c\leftarrow 0.
8:  for Episode=1,,MAX EPISODE\text{Episode}=1,...,\text{MAX EPISODE} do
9:     Receive an initial state x0p0x_{0}\sim p_{0}.
10:     Set the initial extended state z0z_{0} using x0x_{0}.
11:     Compute the pre-processed state z^0\hat{z}_{0} by Algorithm 2.
12:     for Discrete-time step k=0,,Kk=0,...,K do
13:        Determine an action aka_{k} based on the state z^k\hat{z}_{k}.
14:        Execute aka_{k} and receive the next state xk+1x_{k+1} and the reward rkr_{k} and the STL-reward sks_{k}.
15:        Set the next extended state zk+1z_{k+1} using xk+1x_{k+1} and zkz_{k}.
16:        Compute the next pre-processed state z^k+1\hat{z}_{k+1} by Algorithm 2.
17:        Store the experience (z^k,ak,z^k+1,rk,sk)(\hat{z}_{k},a_{k},\hat{z}_{k+1},r_{k},s_{k}) in the replay buffer 𝒟\mathcal{D}.
18:        Sample II experiences {(z^(i),a(i),z^(i),r(i),s(i))}i=1,,I\{(\hat{z}^{(i)},a^{(i)},\hat{z}^{\prime(i)},r^{(i)},s^{(i)})\}_{i=1,...,I}from 𝒟\mathcal{D} randomly.
19:        if c<Kprec<K_{pre} then
20:           Pre-training by Algorithm 3.
21:        else
22:           Fine-tuning by Algorithm 4.
23:        end if
24:        Update the target DNNs by (16).
25:        cc+1c\leftarrow c+1.
26:     end for
27:  end for

Algorithm 2 Pre-processing of the extended state
1:  Input: The extended state zz and the STL sub-formulae {ϕi}i=1M\{\phi_{i}\}_{i=1}^{M}.
2:  for i=1,,Mi=1,...,M do
3:     if ϕi=G[ksi,τ1]φi\phi_{i}=G_{[k_{s}^{i},\tau-1]}\varphi_{i} then
4:        Compute the flag value fif^{i} by (28).
5:     end if
6:     if ϕi=F[ksi,τ1]φi\phi_{i}=F_{[k_{s}^{i},\tau-1]}\varphi_{i} then
7:        Compute the flag value fif^{i} by (29).
8:     end if
9:  end for
10:  Set the flag state f^=[f^1f^2f^M]\hat{f}=[\hat{f}^{1}\ \hat{f}^{2}\ ...\ \hat{f}^{M}]^{\top}.
11:  Output: The pre-processed state z^=[z[τ1]f^]\hat{z}=[z[\tau-1]^{\top}\ \hat{f}^{\top}]^{\top}.
Algorithm 3 Pre-training
1:  Input: The experiences {(z^(i),a(i),z^(i),r(i),s(i))}i=1,2,,I\{(\hat{z}^{(i)},a^{(i)},\hat{z}^{\prime(i)},r^{(i)},s^{(i)})\}_{i=1,2,...,I} and parameters θπ,θr,θs,α\theta_{\pi},\ \theta_{r},\ \theta_{s},\ \alpha.
2:  The parameter vector θr\theta_{r} is updated by (14) or (21).
3:  The parameter vector θs\theta_{s} is updated by (15) or (22) .
4:  The parameter vector θπ\theta_{\pi} is updated by (26) or (27).
5:  if SAC-based algorithm then
6:     The entropy temperature α\alpha is updated by (25).
7:  end if
8:  Output: θπ,θr,θs,α\theta_{\pi},\ \theta_{r},\ \theta_{s},\ \alpha
Algorithm 4 Fine-tuning
1:  Input: The experiences {(z^(i),a(i),z^(i),r(i),s(i))}i=1,2,,I\{(\hat{z}^{(i)},a^{(i)},\hat{z}^{\prime(i)},r^{(i)},s^{(i)})\}_{i=1,2,...,I} and parameters θπ,θr,θs,α,κ\theta_{\pi},\ \theta_{r},\ \theta_{s},\ \alpha,\ \kappa.
2:  The parameter vector θr\theta_{r} is updated by (14) or (21).
3:  The parameter vector θs\theta_{s} is updated by (15) or (22).
4:  The parameter vector θπ\theta_{\pi} is updated by (17) or (23).
5:  if SAC-based algorithm then
6:     The entropy temperature α\alpha is updated by (25).
7:  end if
8:  The Lagrange multiplier κ\kappa is updated by (18) or (24).
9:  Output: θπ,θr,θs,α,κ\theta_{\pi},\ \theta_{r},\ \theta_{s},\ \alpha,\ \kappa

V Example

Refer to caption
Figure 9: Control of a two-wheeled mobile robot under an STL constraint. The working area is 0.5x(0)4.5, 0.5x(1)4.50.5\leq x^{(0)}\leq 4.5,\ 0.5\leq x^{(1)}\leq 4.5 colored gray. The initial state of the system is sampled randomly in 0.5x(0)2.5, 0.5x(1)2.5,π/2x(2)π/20.5\leq x^{(0)}\leq 2.5,\ 0.5\leq x^{(1)}\leq 2.5,\ -\pi/2\leq x^{(2)}\leq\pi/2 colored red. The region 1 labeled by φ1\varphi_{1} is 3.5x(0)4.5, 3.5x(1)4.53.5\leq x^{(0)}\leq 4.5,\ 3.5\leq x^{(1)}\leq 4.5 and the region 2 labeled by φ2\varphi_{2} is 3.5x(0)4.5, 1.5x(1)2.53.5\leq x^{(0)}\leq 4.5,\ 1.5\leq x^{(1)}\leq 2.5. These regions are colored blue.

We consider STL-constrained optimal control problems for a two-wheeled mobile robot shown in Fig. 9, where its working area Ω\Omega is {(x(0),x(1))| 0.5x(0)4.5, 0.5x(1)4.5}\{(x^{(0)},x^{(1)})|\ 0.5\leq x^{(0)}\leq 4.5,\ 0.5\leq x^{(1)}\leq 4.5\}. Let x(2)x^{(2)} be the steering angle with x(2)[π,π]x^{(2)}\in[-\pi,\pi]. A discrete-time model of the robot is described by

[xk+1(0)xk+1(1)xk+1(2)]=[xk(0)+Δak(0)cos(xk(2))xk(1)+Δak(0)sin(xk(2))xk(2)+Δak(1)]+Δw[wk(0)wk(1)wk(2)],\displaystyle\begin{bmatrix}x_{k+1}^{(0)}\\ x_{k+1}^{(1)}\\ x_{k+1}^{(2)}\end{bmatrix}=\begin{bmatrix}x_{k}^{(0)}+\Delta a_{k}^{(0)}\cos(x_{k}^{(2)})\\ x_{k}^{(1)}+\Delta a_{k}^{(0)}\sin(x_{k}^{(2)})\\ x_{k}^{(2)}+\Delta a_{k}^{(1)}\end{bmatrix}+\Delta_{w}\begin{bmatrix}w_{k}^{(0)}\\ w_{k}^{(1)}\\ w_{k}^{(2)}\end{bmatrix}, (33)

where xk=[xk(0)xk(1)xk(2)]3x_{k}=[x_{k}^{(0)}\ x_{k}^{(1)}\ x_{k}^{(2)}]^{\top}\in\mathbb{R}^{3}, ak=[ak(0)ak(1)][1,1]2a_{k}=[a_{k}^{(0)}\ a_{k}^{(1)}]^{\top}\in[-1,1]^{2}, and wk=[wk(0)wk(1)wk(2)]3w_{k}=[w_{k}^{(0)}\ w_{k}^{(1)}\ w_{k}^{(2)}]^{\top}\in\mathbb{R}^{3}. wk(i),i{0,1,2}w_{k}^{(i)},\ i\in\{0,1,2\} is sampled from a standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). We assume that Δ=0.1\Delta=0.1 and Δw=0.01I\Delta_{w}=0.01I, where II is the unit matrix. The initial state of the system is sampled randomly in 0.5x(0)2.5, 0.5x(1)2.5,π/2x(2)π/20.5\leq x^{(0)}\leq 2.5,\ 0.5\leq x^{(1)}\leq 2.5,\ -\pi/2\leq x^{(2)}\leq\pi/2. The region 1 is {(x(0),x(1))| 3.5x(0)4.5, 3.5x(1)4.5}\{(x^{(0)},x^{(1)})|\ 3.5\leq x^{(0)}\leq 4.5,\ 3.5\leq x^{(1)}\leq 4.5\} and the region 2 is {(x(0),x(1))| 3.5x(0)4.5, 1.5x(1)2.5}\{(x^{(0)},x^{(1)})|\ 3.5\leq x^{(0)}\leq 4.5,\ 1.5\leq x^{(1)}\leq 2.5\}. We consider the following two constraints.

Constraint 1 (Recurrence): At any time in the time interval [0,900][0,900], the robot visits both the regions 1 and 2 before 9999 time steps are elapsed, where there is no constraint for the order of the visits.

Constraint 2 (Stabilization): The robot visits the region 1 or 2 in the time interval [0,450][0,450] and stays there for 4949 time steps.

These constraints are described by the following STL formulae.

Formula 1:

Φ1\displaystyle\Phi_{1} =\displaystyle= G[0,900](F[0,99]φ1F[0,99]φ2),\displaystyle G_{[0,900]}(F_{[0,99]}\varphi_{1}\land F_{[0,99]}\varphi_{2}), (34)

Formula 2:

Φ2\displaystyle\Phi_{2} =\displaystyle= F[0,450](G[0,49]φ1G[0,49]φ2),\displaystyle F_{[0,450]}(G_{[0,49]}\varphi_{1}\lor G_{[0,49]}\varphi_{2}), (35)

where

φ1\displaystyle\varphi_{1} =\displaystyle= ((3.5x(0)4.5)(3.5x(1)4.5)),\displaystyle((3.5\leq x^{(0)}\leq 4.5)\land(3.5\leq x^{(1)}\leq 4.5)),
φ2\displaystyle\varphi_{2} =\displaystyle= ((3.5x(0)4.5)(1.5x(1)2.5)).\displaystyle((3.5\leq x^{(0)}\leq 4.5)\land(1.5\leq x^{(1)}\leq 2.5)).

We consider the following reward function

Rz(z,a)=Rx(z[τ1])+Ra(a),\displaystyle R_{z}(z,a)=R_{x}(z[\tau-1])+R_{a}(a), (36)

where

Rx(x)\displaystyle R_{x}(x) =\displaystyle= min{x(0)0.5,4.5x(0),\displaystyle\min\{x^{(0)}-0.5,4.5-x^{(0)}, (37)
x(1)0.5, 4.5x(1), 0.0},\displaystyle\ \ \ \ \ x^{(1)}-0.5,\ 4.5-x^{(1)},\ 0.0\},
Ra(a)\displaystyle R_{a}(a) =\displaystyle= a22.\displaystyle-||a||_{2}^{2}. (38)

(37) is the term for keeping the working area. As the agent moves away from the working area, the agent receives a larger negative reward. (38) is the term for fuel costs.

V-A Evaluation

We apply the SAC-Lagrangian algorithm to design a policy constrained by an STL formula. In all simulations, the DNNs had two hidden layers, all of which have 256 units, and all layers are fully connected. The activation functions for the hidden layers and the outputs of the actor DNN are the rectified linear unit functions and hyperbolic tangent functions, respectively. We normalize x(0)x^{(0)} and x(1)x^{(1)} as x(0)2.5x^{(0)}-2.5 and x(1)2.5x^{(1)}-2.5, respectively. The size of the replay buffer 𝒟\mathcal{D} is 1.0×1051.0\times 10^{5}, and the size of the mini-batch is I=64I=64. We use Adam [32] as the optimizers for all main DNNs, the entropy temperature, and the Lagrange multiplier. The learning rate of the optimizer for the Lagrange multiplier is 1.0×1051.0\times 10^{-5} and the learning rates of the other optimizers are 3.0×1043.0\times 10^{-4}. The soft update rate of the target network is ξ=0.01\xi=0.01. The discount factor is γ=0.99\gamma=0.99. The target for updating the entropy temperature 0\mathcal{H}_{0} is 2.0-2.0. The STL-reward parameter is β=100\beta=100. The agent learns its control policy for 6.0×1056.0\times 10^{5} steps. The initial parameters of both the entropy temperature and the Lagrange multiplier are 1.01.0. For performance evaluation, we introduce the following three indices:

  • a reward learning curve shows the mean of the sum of rewards k=0KγkRz(zk,ak)\sum_{k=0}^{K}\gamma^{k}R_{z}(z_{k},a_{k}) for 100 trajectories,

  • an STL-reward learning curve shows the mean of the sum of STL-rewards k=0KγkRSTL(zk)\sum_{k=0}^{K}\gamma^{k}R_{STL}(z_{k}) for 100 trajectories, and

  • a success rate shows the number of trajectories satisfying the given STL constraint for 100 trajectories.

We prepare 100100 initial states sampled from p0p_{0} and generate 100100 trajectories using the learned policy for each evaluation. We show the results for Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2). We do not use pre-training in Case 1. All simulations were done on a computer with AMD Ryzen 9 3950X 16-core processor, NVIDIA (R) GeForce RTX 2070 super, and 32GB of memory and were conducted using the Python software.

V-A1 Formula 1

We consider the case where the constraint is given by (34). In this simulation, we set K=1000K=1000 and lSTL=40l_{STL}=-40. The dimension of the extended state zz is τ=100\tau=100. The reward learning curves and the STL-rewards learning curves are shown in Figs. 10 and 11, respectively. In Case 1, it takes a lot of steps to learn a policy such that the sum of STL-rewards is near the threshold lSTL=40l_{STL}=-40. The reward learning curve decreases gradually while the STL-reward curve increases. This is an effect of lacking in experience satisfying the STL formula Φ\Phi. If the agent cannot satisfy the STL constraint during its explorations, the Lagrange multiplier κ\kappa becomes large as shown in Fig. 12. Then, the STL term κQθs-\kappa Q_{\theta_{s}} of the actor loss J(πθ)J(\pi_{\theta}) becomes larger than the other terms. As a result, the agent updates the parameter vector θπ\theta_{\pi} considering only the STL rewards. On the other hand, in Case 2, the agent can obtain enough experiences satisfying the STL formula in 300000300000 pre-training steps. The agent learns the policy such that the sum of the STL-rewards is near the threshold relatively quickly and fine-tunes the policy under the STL constraint after pre-training. According to the results in the both cases, our proposed method is useful to learn the optimal policy under the STL constraint. Additionally, as the sum of STL-rewards obtained by the learned policy is increasing, the success rate for the given STL formula is also increasing as shown in Fig. 13.

Refer to caption
Figure 10: Reward learning curves for the formula Φ1\Phi_{1}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The gray line shows 300000300000 steps.
Refer to caption
Figure 11: STL-reward learning curves for the formula Φ1\Phi_{1}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The dashed line shows the threshold lSTL=40l_{STL}=-40. The gray line shows 300000300000 steps.
Refer to caption
Figure 12: Curves of Lagrange multiplier κ\kappa for the formula Φ1\Phi_{1}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The gray line shows 300000300000 steps.
Refer to caption
Figure 13: Success rates for the formula Φ1\Phi_{1}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The gray line shows 300000300000 steps.

V-A2 Formula 2

We consider the case where the constraint is given by (35). In this simulation, we set K=500K=500 and lSTL=35l_{STL}=35. The dimension of the extended state zz is τ=50\tau=50. We use the reward function RSTL(z)=exp(β𝟏(ρ(z,ϕ)))/exp(β)R_{STL}(z)=\exp(\beta\bm{1}(\rho(z,\phi)))/\exp(\beta) in stead of (LABEL:tau_reward) to prevent the sum of STL-rewards diverging to infinity. The reward learning curves and the STL-rewards learning curves are shown in Figs. 14 and 15, respectively. In Case 1, although the reward learning curve maintains more than 20-20, the STL-reward learning curve maintains much less than the threshold lSTL=35l_{STL}=35. On the other hand, in Case 2, the agent learns a policy such that the sum of STL-rewards is near the threshold lSTL=35l_{STL}=35 and fine-tunes the policy under the STL constraint after pre-training. Our proposed method is useful for not only the formula Φ1\Phi_{1} but also the formula Φ2\Phi_{2}. Additionally, as the sum of STL-rewards obtained by the learned policy is increasing, the success rate for the given STL formula is also increasing as shown in Fig. 16.

Refer to caption
Figure 14: Reward learning curves for the formula Φ2\Phi_{2}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The gray line shows 300000300000 steps.
Refer to caption
Figure 15: STL-reward learning curves for the formula Φ2\Phi_{2}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The dashed line shows the threshold lSTL=35l_{STL}=35. The gray line shows 300000300000 steps.
Refer to caption
Figure 16: Success rates for the formula Φ2\Phi_{2}. The red and blue curves show the results of Kpre=0K_{pre}=0 (Case 1) and Kpre=300000K_{pre}=300000 (Case 2), respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The gray line shows 300000300000 steps.

V-B Ablation studies for pre-processing

In this section, we show the ablation studies for pre-processing introduced in Section IV.D. We conduct the experiment for Φ1\Phi_{1} using the SAC-Lagrangian algorithm. In the case without pre-processing, the dimensionality of the input to DNNs is 300300 and, in the case with pre-processing, the dimensionality of the input to DNNs is 55. The STL-reward learning curves for each case are shown in Fig. 17. The agent without pre-processing cannot improve the performance of its policy for STL-rewards. The result concludes that pre-processing is useful for a problem constrained by an STL formula with a large τ\tau.

Refer to caption
Figure 17: STL-reward learning curves for the case without pre-processing (red) and the case with pre-processing (blue). We consider the formula Φ1\Phi_{1}. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively. The dashed line shows the threshold lSTL=40l_{STL}=-40. The gray line shows 300000300000 steps.

V-C Comparison with another DRL algorithm

In this section, we compare the SAC based algorithm with other algorithms: DDPG [20] and TD3 [31]. TD3 is an extended DDPG algorithm with the clipped double Q-learning technique to mitigate the positive bias for the critic estimation. For the DDPG-Lagrangian algorithm and the TD3-Lagrangian algorithm, we need to set a stochastic process generating exploration noises. We use the following Ornstein-Uhlenbeck process.

ωk+1=ωkp1(ωkp2)+p3ε,\displaystyle\omega_{k+1}=\omega_{k}-p_{1}(\omega_{k}-p_{2})+p_{3}\varepsilon,

where ε\varepsilon is a noise generated by a standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). We set the parameters (p1,p2,p3)=(0.15,0,0.3)(p_{1},p_{2},p_{3})=(0.15,0,0.3). For the TD3-Lagrangian algorithm, the target policy smoothing and the delayed policy updates are same as the original paper [31]. The target policy smoothing is implemented by adding noises sampled from the normal distribution 𝒩(0,0.2)\mathcal{N}(0,0.2) to the actions chosen by the target actor DNN, clipped to (0.5,0.5)(-0.5,0.5), the agent updates the actor DNN and the target DNNs every 22 learning steps. Other experimental settings such as hyper parameters, optimizers, and DNN architectures, are same as the SAC-Lagrangian algorithm.

We conduct experiments for Φ1\Phi_{1}. We show the reward learning curves and the STL-reward learning curves in Figs. 18 and 19, respectively. Although all algorithms can improve the policy with respect to rewards after fine-tuning, the DDPG algorithm cannot improve the policy with respect to the STL-rewards. The STL-reward curve of the DDPG-Lagrangian algorithm is much less than the threshold. On the other hand, the TD3-Lagrangian algorithm and the SAC-Lagrangian algorithm can learn the policy such that the STL-rewards are more than threshold. These results show the importance of the double Q-learning technique to mitigate positive biases for critic estimations in the fine-tuning phase. Actually, the technique is used in both the TD3-Lagrangian algorithm and the SAC-Lagrangian algorithm. Then, we show the result in the case where we do not use the double Q-learning technique in the SAC-Lagrangian in Fig. 20. Although the agent can learn a policy such that the STL-rewards are near the threshold in the pre-train phase, the performance of the agent’s policy with respect to the STL-rewards is degraded in the fine-tune phase.

Refer to caption
Figure 18: Reward learning curves for the formula Φ1\Phi_{1}. The red, blue, and green curves show the results of the DDPG-Lagrangian algorithm, the TD3-Lagrangian algorithm, and the SAC-Lagrangian algorithm, respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively.
Refer to caption
Figure 19: STL-reward learning curves for the formula Φ1\Phi_{1}. The red, blue, and green curves show the results of the DDPG-Lagrangian algorithm, the TD3-Lagrangian algorithm, and the SAC-Lagrangian algorithm, respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively.
Refer to caption
Figure 20: STL-reward learning curves for the formula Φ1\Phi_{1}. The purple and blue curves show the results of the SAC-Lagrangian algorithm without and with the double Q-learning technique, respectively. The solid curves and the shades represent the average results and standard deviations over 1010 trials with different random seeds, respectively.

VI Conclusion

We considered a model-free optimal control problem constrained by a given STL formula. We modeled the problem as a τ\tau-CMDP that is an extension of a τ\tau-MDP. To solve the τ\tau-CMDP problem with continuous state-action spaces, we proposed a CDRL algorithm with the Lagrangian relaxation. In the algorithm, we relaxed the constrained problem into an unconstrained problem to utilize a standard DRL algorithm for unconstrained problems. Additionally, we proposed a practical two-phase learning algorithm to make it easy to obtain experiences satisfying the given STL formula. Through numerical simulations, we demonstrated the performance of the proposed algorithm. First, we showed that the agent with our proposed two-phase algorithm can learn its policy for the τ\tau-CMDP problem. Next, we conducted ablation studies for pre-processing to reduce the dimensionality of the extended state and showed the usefulness. Finally, we compared three CDRL algorithms and showed the usefulness of the double Q-learning technique in the fine-tune phase.

On the other hand, the syntax in this study is restrictive compared with the general STL syntax. Relaxing the syntax restriction is a future work. Furthermore, we may not directly apply our proposed methods to high dimensional decision making problems because it is difficult to obtain experiences satisfying a given STL formula for the problems. Solving the issue is also an interesting direction for a future work.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press, 2018.
  • [2] H. Dong, Z. Ding, and S. Zhang Eds., Deep Reinforcement Learning Fundamentals, Research and Applications, Singapore: Springer, 2020.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
  • [4] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in Proc. 2017 IEEE Int. Conf. on Robotics and Automation (ICRA), May 2017, pp. 3389–3396.
  • [5] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, May 2019.
  • [6] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Transactions on Cybernetics, vol. 50, no. 9, pp. 3826–3839, Sept. 2020.
  • [7] C. Belta, B. Yordanov, and E. A. Gol, Formal Methods for DiscreteTime Dynamical Systems, Cham, Switzerland: Springer, 2017.
  • [8] C. Baier and J.-P. Katoen, Principles of Model Checking, Cambridge, MA, USA: MIT Press, 2008.
  • [9] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-Constrained Reinforcement Learning,” 2018, arXiv:1801.08099.
  • [10] L. Z. Yuan, M. Hasanbeig, A. Abate, and D. Kroening, “Modular deep reinforcement learning with temporal logic specifications,” 2019, arXiv:1909.11591.
  • [11] M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7973-7980, Aug. 2021.
  • [12] O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” Formal Techniques, Modeling and Analysis of Timed and Fault-Tolerant Systems, pp. 152–166, Jan. 2004.
  • [13] G. E. Fainekos and G. J. Pappas, “Robustness of temporal logic specifications for continuous-time signals,” Theoretical Computer Science, vol. 410, no. 42, pp. 4262–4291, Sept. 2009.
  • [14] V. Raman, A. Donzé, M. Maasoumy, R. M. Murray, A. Sangiovanni-Vincentelli, and S. A. Seshia, “Model predictive control with signal temporal logic specifications,” in Proc. IEEE 53rd Conf. on Decision and Control (CDC), Dec. 2014, pp. 81–87.
  • [15] L. Lindemann and D. V. Dimarogonas, “Control barrier functions for signal temporal logic tasks,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 96–101, Jan. 2019.
  • [16] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust satisfaction of signal temporal logic specifications,” in Proc. IEEE 55th Conf. on Decision and Control (CDC), Dec. 2016, pp. 6565–6570.
  • [17] H. Venkataraman, D. Aksaray, and P. Seiler, “Tractable reinforcement learning of signal temporal logic objectives,” 2020, arXiv:2001.09467.
  • [18] A. Balakrishnan and J. V. Deshmukh, “Structured reward shaping using signal temporal logic specifications,” in Proc. 2019 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), Nov. 2019, pp. 3481–3486.
  • [19] J. Ikemoto and T. Ushio, “Deep reinforcement learning based networked control with network delays for signal temporal logic specifications,” in Proc. IEEE 27th Int. Conf. on Emerging Technologies and Factory Automation (ETFA), Sept. 2021, doi: 10.1109/ETFA52439.2022.9921505.
  • [20] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2015, arXiv:1509.02971.
  • [21] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” 2018, arXiv:1812.05905.
  • [22] E. Altman, Constrained Markov Decision Processes, New York, USA: Routledge, 1999.
  • [23] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints in model-free reinforcement learning: A survey,” in Proc. Int. Joint Conf. on Artificial Intelligence Organization (IJCAI), Aug. 2021, pp. 4508–4515.
  • [24] K. C. Kalagarla, R. Jain, and P. Nuzzo, “Model-free reinforcement learning for optimal control of Markov decision processes under signal temporal logic specifications,” in Proc. IEEE 60th Conf. on Decision and Control (CDC), Dec. 2021, pp. 2252–2257.
  • [25] A.G. Puranic, J.V. Deshmukh, and S. Nikolaidis, “Learning from demonstrations using signal temporal logic,” 2021, arXiv:2102.07730.
  • [26] A.G. Puranic, J.V. Deshmukh, and S. Nikolaidis, “Learning from demonstrations using signal temporal logic in stochastic and continuous domains,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6250-6257, Oct. 2021.
  • [27] D. P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods, New York, NY, USA: Academic Press, 2014.
  • [28] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the real world with minimal human effort,” 2020, arXiv:2002.08550.
  • [29] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy deep reinforcement learning algorithm for Volt-VAR control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008–3018, Jul. 2020.
  • [30] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013, arXiv:1312.6114.
  • [31] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. Int. Conf. on machine learning (ICML), Jul. 2018, pp. 1587–1596.
  • [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
\EOD