This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Safe Learning for Uncertainty-Aware Planning via Interval MDP Abstraction

Jesse Jiang,  Ye Zhao,  and Samuel Coogan This work was supported in part by the National Science Foundation under grant #1924978.Jesse Jiang and Samuel Coogan are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: jjiang@gatech.edu, sam.coogan@gatech.edu). S. Coogan is also with the School of Civil and Environmental Engineering. Ye Zhao is with the School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: ye.zhao@me.gatech.edu).
Abstract

We study the problem of refining satisfiability bounds for partially-known stochastic systems against planning specifications defined using syntactically co-safe Linear Temporal Logic (scLTL). We propose an abstraction-based approach that iteratively generates high-confidence Interval Markov Decision Process (IMDP) abstractions of the system from high-confidence bounds on the unknown component of the dynamics obtained via Gaussian process regression. In particular, we develop a synthesis strategy to sample the unknown dynamics by finding paths which avoid specification-violating states using a product IMDP. We further provide a heuristic to choose among various candidate paths to maximize the information gain. Finally, we propose an iterative algorithm to synthesize a satisfying control policy for the product IMDP system. We demonstrate our work with a case study on mobile robot navigation.

Index Terms:
Automata, Hybrid systems, Markov processes, Gaussian process learning

I Introduction

Abstraction-based approaches for verification and synthesis of dynamical systems offer computationally tractable methods for accommodating complex specifications [1]. In particular, Interval Markov Decision Processes (IMDP) [2], which allow for an interval of transition probabilities, provide a rich abstraction model for stochastic systems. As compared to stochastic control[3], abstraction-based methods allow for more complex specifications to be considered and have been widely used for hybrid stochastic systems [4].

The transition probability intervals in IMDP abstractions have typically modeled the uncertainty which arises from abstracting the dynamics of continuous states in discrete regions [5]. However, partially-known stochastic systems, which show promise for modeling a wide range of real-world systems, add unknown dynamics which contribute further uncertainty. Previous works model this uncertainty by assuming that some prior data on the dynamics are available [6].

The paper [7] is the first to address the problem of modeling unknown dynamics in stochastic hybrid systems via the use of IMDP abstraction in combination with Gaussian process (GP) regression [8]. GP regression can approximate unknown functions with arbitrary accuracy and also provides bounds on the approximation uncertainty [9].

The main contribution of this work is to develop a method for sampling the unknown dynamics of a stochastic system online in order to reduce abstraction error and increase the probability of satisfying a syntactically co-safe linear temporal logic (scLTL) specification [2].

Our goal is to find a control policy which guarantees the satisfaction of a scLTL specification with sufficient probability. However, we assume a stochastic noise which creates unavoidable perturbation. The system also has unknown dynamics which we estimate with Gaussian processes. This creates an estimation error which increases uncertainty in state transitions and which we aim to reduce by sampling the unknown dynamics. Thus, this paper focuses on the problem of safe learning to allow online exploration rather than a static planning problem using previously collected data samples as in [7].

Our approach is as follows. First, we estimate the unknown dynamics of the system using Gaussian processes and construct a high-confidence IMDP abstraction. We then develop an algorithm for finding nonviolating cycles in a product IMDP of the system abstraction combined with a finite automaton of the scLTL specification which allow the dynamics of the system to be sampled without violating the specification. We develop a heuristic for evaluating candidate cycles in order to maximize the uncertainty reduction gained from sampling. Finally, we propose an iterative method to sample the state-space, thereby decreasing the uncertainty of a GP estimation of the unknown dynamics until a satisfying control policy for the system can be synthesized or a terminating condition such as a maximum number of iterations has been reached. We utilize sparse GPs [10] to improve computational efficiency. We demonstrate our method on a case study of robotic motion planning.

II Problem Setup

Consider a discrete-time, partially-known system

x[k+1]=f(x[k])+u[k]+g(x[k])+ν[k]x[k+1]=f(x[k])+u[k]+g(x[k])+\nu[k] (1)

where xXnx\in X\subseteq\mathbb{R}^{n} is the system state, unu\in\mathbb{R}^{n} is the control action, f(x)f(x) is the known dynamics, g(x)g(x) is the unknown dynamics to be learned via GP regression, ν\nu is stochastic noise, and time is indexed with brackets. This system has applications in, e.g., biology [11], communications [12], and robotics [13].

Assumption 1

1) Each dimension νi[k],i=1,,n\nu_{i}[k],i=1,\ldots,n of ν\nu, is an independent, zero mean random variable with stationary, symmetric, and unimodal distribution ρνi\rho_{\nu_{i}} and is σνi\sigma_{\nu_{i}}-sub-Gaussian, i.e., the distribution tail decays at least as fast as a Gaussian random variable with variance σνi2\sigma_{\nu_{i}}^{2}.

2) Given a data set D={(zj,yj)}j=1mD=\{(z^{j},y^{j})\}_{j=1}^{m} where yjy^{j} is an observation of g(zj)g(z^{j}) perturbed by σνi\sigma_{\nu_{i}}-sub-Gaussian noise, it is possible to construct an estimate g^D(x)\hat{g}^{D}(x) of gg and bound the estimation error between g(x)g(x) and g^D(x)\hat{g}^{D}(x) by some high-confidence bound γD(x)\gamma^{D}(x). Thus,

gD(x)=g^D(x)γD(x),g+D(x)=g^D(x)+γD(x)g_{-}^{D}(x)=\hat{g}^{D}(x)-\gamma^{D}(x),\quad g_{+}^{D}(x)=\hat{g}^{D}(x)+\gamma^{D}(x) (2)

are high-confidence bounds on gg, i.e., gD(x)g(x)g+D(x)g^{D}_{-}(x)\leq g(x)\leq g^{D}_{+}(x) with high confidence. For simplicity, we drop the superscript DD when the dataset is clear.

Assumption 2

The state-space XX is bounded and is partitioned into hyper-rectangular regions {Xq}qQ\{X_{q}\}_{q\in Q} defined as

Xq={xaqxbq}X,X_{q}=\{x\mid\ a_{q}\leq x\leq b_{q}\}\subset X, (3)

where the inequality is taken elementwise for lower and upper bounds aq,bqna_{q},b_{q}\in\mathbb{R}^{n} and QQ is a finite index set of the regions. Each region has a center cq=(aq+bq)/2c_{q}={(a_{q}+b_{q})}/{2}. Additionally, the system possesses a labeling function LL which maps hyper-rectangular regions to observations OO.

Define feedback controllers Kq(;g^):XXK_{{q}}(\cdot\ ;\hat{g}):X\xrightarrow{}X as

Kq(x;g^)=cqf(x)g^(x).K_{q}(x;\hat{g})=c_{q}-f(x)-\hat{g}(x). (4)

The choice u[k]=Kq(x[k];g^)u[k]=K_{{q^{\prime}}}(x[k];\hat{g}) thus produces a control action which compensates for the known and estimated dynamics to reach the center of region XqX_{q^{\prime}}, although the actual state update will be perturbed as shown in Figure 1.

Our ultimate goal is to apply a sequence of feedback controllers so that the resulting sequence of observations satisfies a control objective specified as a syntactically co-safe LTL (scLTL) formula over the observations OO.

Definition 1 (Syntactically co-safe LTL [2, Def. 2.3])

A syntactically co-safe linear temporal logic (scLTL) formula ϕ\phi over a set of observations OO is recursively defined as

ϕ=|o|¬o|ϕ1ϕ2|ϕ1ϕ2|ϕ|ϕ1𝒰ϕ2|ϕ\phi=\top\ |\ o\ |\ \lnot{o}\ |\ \phi_{1}\land\phi_{2}\ |\ \phi_{1}\lor\phi_{2}\ |\ \bigcirc\phi\ |\ \phi_{1}\mathcal{U}\phi_{2}\ |\ \Diamond\phi

where oOo\in O is an observation and ϕ\phi, ϕ1\phi_{1}, and ϕ2\phi_{2} are scLTL formulas. We define the next operator \bigcirc as meaning that ϕ\phi will be satisfied in the next state transition, the until operator 𝒰\mathcal{U} as meaning that the system satisfies ϕ1\phi_{1} until it satisfies ϕ2\phi_{2}, and the eventually operator \Diamond as 𝒰ϕ\top\mathcal{U}\phi.

ScLTL formulas are characterized by the property that they are satisfied in finite time. It is well-known that scLTL satisfaction can be checked using a finite state automaton:

Definition 2 (Finite State Automaton [2, Def. 2.4])

A finite state automaton (FSA) is a tuple 𝒜=(S,s0,O,δ,F)\mathcal{A}=(S,s_{0},O,\delta,F), where

  • SS is a finite set of states,

  • s0Ss_{0}\in S is the initial state,

  • OO is the input alphabet, which corresponds to observations from the scLTL specification,

  • δ:S×OS\delta:S\times O\xrightarrow{}S is a transition function, and

  • FSF\subseteq S is the set of accepting (final) states.

A sequence of inputs (a word) from OO is said to be accepted by an FSA if it ends in an accepting state. A scLTL formula can always be translated into a FSA that accepts all and only those words satisfying the formula. We use scLTL specifications in this paper because they are well-suited to robotic motion planning tasks which are satisfied in finite time. Additionally, the simpler structure of an FSA as opposed to the Büchi and Rabin automata of general LTL enables the methods we propose below.

x[k]x[k]x[k+1]x[k+1]c2c_{2}X1X_{1}X4X_{4}X2X_{2}X3X_{3} == uncertainty from high-confidence bounds on g(x)g(x) == uncertainty from stochastic noise ν\nu =xmax=x_{\max} =xmin=x_{\min} x1x_{1}x2x_{2}
Figure 1: Feedback controller and calculation of transition probabilities. The controller targets the center of state X2X_{2}. The uncertainty in g^(x)\hat{g}(x) creates a nondeterministic region of transition (brown rectangle). The maximum probability of transitioning to state X3X_{3} is found by centering stochastic noise at the point xmaxx_{\max} closest to state X3X_{3} (green point) and calculating the probability that the noise reaches state X3X_{3}. The minimum probability of transitioning to state X3X_{3} under this controller is given likewise by centering stochastic noise at the point xminx_{\min} furthest from X3X_{3} (red point).
Definition 3 (Interval Markov Decision Process)

An Interval Markov Decision Process (IMDP) is a tuple =(Q,A,Tˇ,T^,Q0,O,L)\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L) where:

  • QQ is a finite set of states,

  • AA is a finite set of actions,

  • Tˇ,T^:Q×A×Q[0,1]\check{T},\hat{T}:Q\times A\times Q^{\prime}\xrightarrow{}[0,1] are lower and upper bounds, respectively, on the transition probability from state qQq\in Q to state qQq^{\prime}\in Q under action αA\alpha\in A,

  • Q0QQ_{0}\subseteq Q is a set of initial states,

  • OO is a finite set of atomic propositions or observations,

  • L:QOL:Q\xrightarrow{}O is a labeling function.

The set of actions AA corresponds to the set of all valid feedback controllers for the system. We do not assume that all actions are available at each state. Therefore, each state has a subset A(q)AA(q)\subseteq A of available actions.

Definition 4 (High-Confidence IMDP Abstraction)

Consider stochastic system (1), partitions (3), and the family of feedback controllers (4) where g^(x)\hat{g}(x) is an estimate of g(x)g(x). Further, suppose that g(x)g_{-}(x) and g+(x)g_{+}(x) are high-confidence bounds on g^(x)\hat{g}(x) which satisfy (2). Then, an IMDP =(Q,A,Tˇ,T^,Q0,O,L)\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L) is a high-confidence IMDP abstraction of (1), if:

  • The set of states QQ for the abstraction is the index set of partitions, i.e. partition XqX_{q} is abstracted as state qq, and the set of observations OO and labeling function LL for the abstraction are the same as for the system,

  • For all qQq\in Q, the set of actions A(q)A(q) is the set of one-step reachable regions at qq under its feedback controllers,

  • For all qQq\in Q and all αqA(q)\alpha_{q^{*}}\in A(q):

Tˇ(q,αq,q)\displaystyle\check{T}(q,\alpha_{q^{*}},q^{\prime})\leq (5)
minxXqming(x)wg+(x)ν(f(x)+w+Kq(x;g^)+νXq),\displaystyle\min_{x\in X_{q}}\min_{g_{-}(x)\leq w\leq g_{+}(x)}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}}),
T^(q,αq,q)\displaystyle\hat{T}(q,\alpha_{q^{*}},q^{\prime})\geq (6)
maxxXqmaxg(x)wg+(x)ν(f(x)+w+Kq(x;g^)+νXq)\displaystyle\max_{x\in X_{q}}\max_{g_{-}(x)\leq w\leq g_{+}(x)}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}})

where ν\mathbb{P}_{\nu} denotes probability with respect to ν\nu.

Verification and synthesis problems for IMDP systems evaluated against scLTL specifications are often solved using graph theoretic methods on a product IMDP:

Definition 5 (PIMDP)

Let =(Q,A,Tˇ,T^,Q0,O,L)\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L) be an IMDP and 𝒜=(S,s0,O,δ,F)\mathcal{A}=(S,s_{0},O,\delta,F) be an FSA. The product IMDP (PIMDP) is defined as a tuple 𝒫=𝒜=\mathcal{P}=\mathcal{I}\otimes\mathcal{A}=
(Q×S,A,Tˇ,T^,Q×s0,F)(Q\times S,A,\check{T}^{\prime},\hat{T}^{\prime},Q\times s_{0},F^{\prime}), where

  • Tˇ:(q,s)×A×(q,s):=Tˇ(q,α,q)\check{T}^{\prime}:(q,s)\times A\times(q^{\prime},s^{\prime}):=\check{T}(q,\alpha,q^{\prime}) if sδ(s,L(q))s^{\prime}\in\delta(s,L(q)) and 0 otherwise

  • T^:(q,s)×A×(q,s):=T^(q,α,q)\hat{T}^{\prime}:(q,s)\times A\times(q^{\prime},s^{\prime}):=\hat{T}(q,\alpha,q^{\prime}) if sδ(s,L(q))s^{\prime}\in\delta(s,L(q)) and 0 otherwise

  • (q0,δ(s0,L(q0)))(Q×S)(q_{0},\delta(s_{0},L(q_{0})))\in(Q\times S) is a set of initial states of 𝒜\mathcal{I}\otimes\mathcal{A}, and

  • F=Q×FF^{\prime}=Q\times F is the set of accepting (final) states.

We can now formulate our proposed problem:

Problem 1

Design an iterative algorithm to sample and learn the unknown dynamics of system (1) without violating the scLTL specification ϕ\phi and synthesize a control policy which satisfies ϕ\phi with some desired threshold probability or prove that no such control policy exists.

To solve this problem, we construct a high-confidence IMDP abstraction of the system (1) using a GP estimation of the unknown dynamics. We then formulate a method to sample the state-space without violating the specification, updating the GP estimation until a satisfying control policy can be synthesized.

III Abstraction of System as IMDP

In this section, we detail our approach to abstracting a system of the form (1) into a high-confidence IMDP.

We first need to determine an approximation of g(x)g(x), the unknown dynamics. At each time step of system (1), we know x[k+1]x[k+1], f(x[k])f(x[k]), and u[k]u[k]. Therefore, we can define

y[k]=x[k+1]f(x[k])u[k]=g(x[k])+ν[k].y[k]=x[k+1]-f(x[k])-u[k]=g(x[k])+\nu[k].

Then, we construct a Gaussian process estimation g^(x)\hat{g}(x) for g(x)g(x) by considering a dataset of samples (x[k],y[k])(x[k],y[k]).

Definition 6 (Gaussian Process Regression)

Gaussian Process (GP) regression models a function gi:ng_{i}:\mathbb{R}^{n}\to\mathbb{R} as a distribution with covariance κ:n×n>0\kappa:\mathbb{R}^{n}\times\mathbb{R}^{n}\xrightarrow{}\mathbb{R}_{>0}. Assume a dataset of mm samples D={(zj,yij)}j{1,,m}D=\{(z^{j},y_{i}^{j})\}_{j\in\{1,...,m\}}, where zjnz^{j}\in\mathbb{R}^{n} is the input and yijy^{j}_{i} is an observation of gi(zj)g_{i}(z^{j}) under Gaussian noise with variance σνi2\sigma_{\nu_{i}}^{2}. Let Km×mK\in\mathbb{R}^{m\times m} be a matrix defined elementwise by Kj=κ(zj,z)K_{j\ell}=\kappa(z^{j},z^{\ell}) and for znz\in\mathbb{R}^{n}, let k(z)=[κ(z,z1)κ(z,z2)k(z)=[\kappa(z,z^{1})\;\kappa(z,z^{2})\ldots κ(z,zm)]Tm\kappa(z,z^{m})]^{T}\in\mathbb{R}^{m}. Then, the predictive distribution of gig_{i} at a test point zz is the conditional distribution of gig_{i} given DD, which is Gaussian with mean μgi,D\mu_{g_{i},D} and variance σgi,D2\sigma_{g_{i},D}^{2} given by

μgi,D(z)\displaystyle\mu_{g_{i},D}(z) =k(z)T(K+σνi2Im)1Y\displaystyle=k(z)^{T}(K+\sigma_{\nu_{i}}^{2}I_{m})^{-1}Y (7)
σgi,D2(z)\displaystyle\sigma_{g_{i},D}^{2}(z) =κ(z,z)k(z)T(K+σνi2Im)1k(z),\displaystyle=\kappa(z,z)-k(z)^{T}(K+\sigma_{\nu_{i}}^{2}I_{m})^{-1}k(z), (8)

where ImI_{m} is the identity and Y=[yi1yi2yim]TY=\begin{bmatrix}y^{1}_{i}&y^{2}_{i}&\ldots&y^{m}_{i}\end{bmatrix}^{T}.

In practice, GP regression has a complexity of O(m3)O(m^{3}). To mitigate this, we use sparse Gaussian process regression [10]:

Definition 7 (Sparse Gaussian Process Regression)

A sparse Gaussian process uses a set Dη={(zj,yij)}j{1,,η}D_{\eta}=\{(z^{j},y_{i}^{j})\}_{j\in\{1,...,\eta\}} to approximate a GP of a larger dataset DD. Given inducing points {zj}j{1,,η}\{z_{j}\}_{j\in\{1,...,\eta\}} with Yη=[yi1yi2yiη]TY_{\eta}=\begin{bmatrix}y^{1}_{i}&y^{2}_{i}&\ldots&y^{\eta}_{i}\end{bmatrix}^{T} and covariance matrix AηA_{\eta}, the predictive distribution of the unknown function gig_{i} has mean μgi,Dη\mu_{g_{i},D_{\eta}} and variance σgi,Dη2\sigma_{g_{i},D_{\eta}}^{2}

μgi,Dη(z)\displaystyle\mu_{g_{i},D_{\eta}}(z) =kη(z)T(Kη+σνi2Iη)1Yη\displaystyle=k_{\eta}(z)^{T}(K_{\eta}+\sigma_{\nu_{i}}^{2}I_{\eta})^{-1}Y_{\eta}
σgi,Dη2(z)\displaystyle\sigma_{g_{i},D_{\eta}}^{2}(z) =κ(z,z)kη(z)TKη1(KηAη)Kη1kη(z)\displaystyle=\kappa(z,z)-k_{\eta}(z)^{T}K_{\eta}^{-1}(K_{\eta}-A_{\eta})K_{\eta}^{-1}k_{\eta}(z)

where Kηη×ηK_{\eta}\in\mathbb{R}^{\eta\times\eta} is a matrix defined elementwise by Kη,j=κ(zj,z)K_{\eta,j\ell}=\kappa(z^{j},z^{\ell}) for all zDηz\in D_{\eta}. For znz\in\mathbb{R}^{n}, let kη(z)=[κ(z,z1)κ(z,z2)k_{\eta}(z)=[\kappa(z,z^{1})\;\kappa(z,z^{2})\ldots κ(z,zη)]Tη\kappa(z,z^{\eta})]^{T}\in\mathbb{R}^{\eta}. The parameters {zj}j{1,,η}\{z^{j}\}_{j\in\{1,...,\eta\}}, {yij}j{1,,η}\{y_{i}^{j}\}_{j\in\{1,...,\eta\}}, and AηA_{\eta} are optimized to minimize the Kullback-Leibler divergence (evaluated at the inducing points) between 𝒩(μgi,Dη,σgi,Dη2)\mathcal{N}(\mu_{g_{i},D_{\eta}},\sigma_{g_{i},D_{\eta}}^{2}), the posterior of gig_{i} under the sparse GP; and p(gi|Y)p(g_{i}|Y), the posterior of gig_{i} under a GP with the full dataset DD. We refer the reader to [10] for a detailed treatment of sparse Gaussian process theory. The computational complexity of sparse GP regression is O(mη2)O(m\eta^{2}), so by fixing η\eta the algorithm is linear in mm. We note that sparse GP regression introduces error into the estimation; however, in practice this error does not affect the validity of our methods.

Given some dataset DD, we construct an estimation of the unknown dynamics independently in each coordinate and determine high-confidence bounds on the estimation error

g^iD(x)\displaystyle\hat{g}_{i}^{D}(x) :=μgi,D(x),\displaystyle:=\mu_{g_{i},D}(x),
γi(x)\displaystyle\gamma_{i}(x) :=βσgi,D(x)|gi(x)g^iD(x)|\displaystyle:=\beta\sigma_{g_{i},D}(x)\geq|g_{i}(x)-\hat{g}_{i}^{D}(x)|

for each i=1,,ni=1,\ldots,n. We also determine high-confidence lower and upper bounds on g(x)g(x) as

g(x)=g^D(x)βσg,D(x),g+(x)=g^D(x)+βσg,D(x)g_{-}(x)=\hat{g}^{D}(x)-\beta\sigma_{g,D}(x),\quad g_{+}(x)=\hat{g}^{D}(x)+\beta\sigma_{g,D}(x)

The parameter β\beta is calculated as

β=(σν1+(2/m)(Bi+σν2(γkm+1+log1δ)))\beta=\bigg{(}\frac{\sigma_{\nu}}{\sqrt{1+(2/m)}}(B_{i}+\sigma_{\nu}\sqrt{2(\gamma_{k}^{m}+1+\log\frac{1}{\delta})})\bigg{)} (9)

for noise σν\sigma_{\nu}-sub-Gaussian, mm the number of GP samples, high-confidence parameter δ\delta, information gain constant γkm\gamma_{k}^{m}, and RKHS constant BiB_{i} as detailed in Lemma 1, [7]. Note that the same parameter βσg,D\beta\sigma_{g,D} is used to determine high-confidence bounds on both the estimation error and g(x)g(x) itself.
For each region qq in the state-space, we select a high-confidence error bound for the unknown dynamics as

γi(q)=maxxXqγi(x)\gamma_{i}(q)=\max_{x\in X_{q}}\gamma_{i}(x)

In practice, we compute this bound by sampling γi(x)\gamma_{i}(x) throughout the state-space, introducing a trade-off between approximation error and computation complexity. We now construct transition probability intervals assuming that the high-confidence bounds on unknown dynamics always hold:

Theorem 1 (Construction of Transition Probabilities)

Consider q,qQq,q^{\prime}\in Q and action αqA(q)\alpha_{q^{*}}\in A(q). Lower bound Tˇ\check{T} and upper bound T^\hat{T} transition probabilities from qq to qq^{\prime} under αq\alpha_{q^{*}} are given by

Tˇ(q,αq,q)=i=1naibiρνi(zxmin,i(q,αq,q))𝑑z,\check{T}(q,\alpha_{q^{*}},q^{\prime})=\prod_{i=1}^{n}\int_{a^{\prime}_{i}}^{b^{\prime}_{i}}\rho_{\nu_{i}}(z-x_{\min,i}(q,\alpha_{q^{*}},q^{\prime}))dz, (10)
T^(q,αq,q)=i=1naibiρνi(zxmax,i(q,αq,q))𝑑z,\hat{T}(q,\alpha_{q^{*}},q^{\prime})=\prod_{i=1}^{n}\int_{a^{\prime}_{i}}^{b^{\prime}_{i}}\rho_{\nu_{i}}(z-x_{\max,i}(q,\alpha_{q^{*}},q^{\prime}))dz, (11)

where xmin,ix_{\min,i} is the ii-th coordinate of xminx_{\min} and similarly for xmax,ix_{\max,i}, we recall ρνi\rho_{\nu_{i}} is the probability density function of the stochastic noise νi\nu_{i}, and aa^{\prime} and bb^{\prime} are the lower and upper boundary points for region qq^{\prime}. We define xminx_{\min} and xmaxx_{\max} as

xmin(q,αq,q)=\displaystyle x_{\min}(q,\alpha_{q^{*}},q^{\prime})=\ argmaxxXxcq1\displaystyle\underset{x\in X}{\mathrm{argmax}}\ ||x-c_{q^{\prime}}||_{1} (12)
s.t.cqγ(q)xcq+γ(q),\displaystyle\text{s.t.}\ c_{q^{*}}-\gamma(q)\leq x\leq c_{q^{*}}+\gamma(q),
xmax(q,αq,q)=\displaystyle x_{\max}(q,\alpha_{q^{*}},q^{\prime})=\ argminxXxcq1\displaystyle\underset{x\in X}{\mathrm{argmin}}\ ||x-c_{q^{\prime}}||_{1} (13)
s.t.cqγ(q)xcq+γ(q),\displaystyle\text{s.t.}\ c_{q^{*}}-\gamma(q)\leq x\leq c_{q^{*}}+\gamma(q),

where ||||1||\cdot||_{1} is the 1-norm and γ(q)\gamma(q) is a high-confidence error bound on the unknown dynamics satisfying Assumption 1.
Then, the transition probability bounds (10)–(11) satisfy the constraints for high-confidence IMDP abstractions in (5)–(6).
Proof:  The righthand side of the bound in equation (5) can be rewritten as

minxXqming(x)wg+(x)ν(f(x)+w+Kq(x;g^)+νXq)\displaystyle\min_{x\in X_{q}}\min_{\begin{subarray}{c}g_{-}(x)\leq w\\ \leq g_{+}(x)\end{subarray}}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}}) (14)
=minxXqming(x)wg+(x)ν(cq+wg^(x)+νq)\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}g_{-}(x)\leq w\leq g_{+}(x)\end{subarray}}\mathbb{P}_{\nu}(c_{{q^{*}}}+w-\hat{g}(x)+\nu\in q^{\prime}) (15)
=minxXqminγ(x)ωγ(x)ν(cq+ω+νXq)\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}-\gamma(x)\leq\omega\leq\gamma(x)\end{subarray}}\mathbb{P}_{\nu}(c_{{q^{*}}}+\omega+\nu\in X_{q^{\prime}}) (16)
=minxXqminγ(x)ωγ(x)i=1nνi(cq,i+ωi+νi[aq,i,bq,i])\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}-\gamma(x)\leq\omega\\ \leq\gamma(x)\end{subarray}}\prod_{i=1}^{n}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]) (17)
=minxXqi=1nminγi(x)ωiγi(x)νi(cq,i+ωi+νi[aq,i,bq,i])\displaystyle=\min_{x\in X_{q}}\prod_{i=1}^{n}\min_{\begin{subarray}{c}-\gamma_{i}(x)\leq\omega_{i}\\ \leq\gamma_{i}(x)\end{subarray}}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]) (18)
i=1nminγi(q)ωiγi(q)νi(cq,i+ωi+νi[aq,i,bq,i])\displaystyle\geq\prod_{i=1}^{n}\min_{\begin{subarray}{c}-\gamma_{i}(q)\leq\omega_{i}\\ \leq\gamma_{i}(q)\end{subarray}}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]) (19)

where (14) is the righthand side of (5); (15) follows after expanding the feedback controller expression Kq(x;g^)K_{{q^{*}}}(x;\hat{g}) using (4) and simplifying; (16) follows by assumption of high-confidence error bound γ(x)\gamma(x) and the definition of g(x)g_{-}(x) and g+(x)g_{+}(x) from Assumption 1 and taking ω=wg^(x)\omega=w-\hat{g}(x); (17) follows by assumption that each νi\nu_{i} is independent and νi\mathbb{P}_{\nu_{i}} denotes probability with respect to νi\nu_{i}, where we recall that aqa_{q^{\prime}} and bqb_{q^{\prime}} are the lower and upper corners of the region XqX_{q^{\prime}}, and aq,ia_{q^{\prime},i} is the ii-th coordinate of aqa_{q^{\prime}} and similarly for cq,ic_{q*,i} and bq,ib_{q^{\prime},i}; (18) follows from the fact that the hyper-rectangular constraint γ(x)ωγ(x)-\gamma(x)\leq\omega\leq\gamma(x) is equivalent to independent constraint γi(x)ωiγi(x)-\gamma_{i}(x)\leq\omega_{i}\leq\gamma_{i}(x) along each coordinate; and (19) follows from the definition γi(q)=maxxXqγi(x)\gamma_{i}(q)=\max_{x\in X_{q}}\gamma_{i}(x).

Now, because the probability distribution for each random variable νi\nu_{i} is assumed unimodal and symmetric, νi(cq,i+ωi+νi[aq,i,bq,i])\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]) is minimized when the distance between (cq,i+ωi)(c_{{q^{*}},i}+\omega_{i}) and the midpoint of [aq,i,bq,i][a_{q^{\prime},i},b_{q^{\prime},i}] is maximized, i.e., when |cq,i+ωicq,i||c_{{q^{*}},i}+\omega_{i}-c_{q^{\prime},i}| is maximized, subject to the constraint γi(q)ωiγi(q)-\gamma_{i}(q)\leq\omega_{i}\leq\gamma_{i}(q). Substituting x=cq+ωx=c_{q^{*}}+\omega, and observing that xcq1=i=1n|xicq,i|\|x-c_{q^{\prime}}\|_{1}=\sum_{i=1}^{n}|x_{i}-c_{q^{\prime},i}|, this is exactly the maximizing point specified by xmin(q,αq,q)x_{\min}(q,\alpha_{q^{*}},q^{\prime}) in (12). Thus, the expression in (19) is equivalent to

i=1nνi(xmin,i(q,αq,q)+νi[aq,i,bq,i]),\displaystyle\prod_{i=1}^{n}\mathbb{P}_{\nu_{i}}(x_{\min,i}(q,\alpha_{q^{*}},q^{\prime})+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]), (20)

which in turn is equivalent to the righthand side of (10), establishing the bound in (5). An analogous argument establishes that (11) satisfies (6). ∎

We construct a high-confidence IMDP abstraction of the system using the hyper-rectangular partition regions as states, high-confidence bounds on the unknown dynamics obtained via GP regression, and transition probability intervals calculated using Theorem 1, solving the first part of Problem 1.

IV Safe Sampling of PIMDP

IV-A Probability of Satisfaction Calculation

Given a high-confidence IMDP abstraction of the system and a FSA of a desired scLTL specification, we construct a PIMDP using Definition 5. We first introduce the concept of control policies and adversaries:

Definition 8 (Control Policy)

A control policy πΠ\pi\in\Pi of a PIMDP is a mapping (Q×S)+A(Q\times S)^{+}\xrightarrow{}A, where (Q×S)+(Q\times S)^{+} is the set of finite sequences of states of the PIMDP.

Definition 9 (PIMDP Adversary)

Given a PIMDP state (q,s)(q,s) and action α\alpha, an adversary ξΞ\xi\in\Xi is an assignment of transition probabilities TξT_{\xi}^{\prime} to all states (q,s)(q^{\prime},s^{\prime}) such that

Tˇ((q,s),α,(q,s))\displaystyle\check{T}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime})) Tξ((q,s),α,(q,s))\displaystyle\leq T_{\xi}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime}))
T^((q,s),α,(q,s)).\displaystyle\leq\hat{T}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime})).

In particular, we use a minimizing adversary, which realizes transition probabilities such that the probability of satisfying the specification is minimal, and a maximizing adversary, which maximizes the probability of satisfaction.

To find safe sampling cycles in the PIMDP, we calculate

Pˇmax((q,s)ϕ)=maxπΠminξΞP(wϕ|π,ξ,w[0]=(q,s)),\check{P}_{\max}((q,s)\models\phi)=\max\limits_{\pi\in\Pi}\min\limits_{\xi\in\Xi}P(w\models\phi\ |\ \pi,\xi,w[0]=(q,s)),

which is the probability that a random path ww starting at PIMDP state (q,s)(q,s) satisfies the scLTL specification ϕ\phi under a maximizing control policy π\pi and minimizing adversary ξ\xi.

Additionally, we will also use the best case probability of satisfaction under a maximizing control policy and adversary:

P^max((q,s)ϕ)=maxπΠmaxξΞP(wiϕ|π,ξ,w[0]=(q,s))\hat{P}_{\max}((q,s)\models\phi)=\max\limits_{\pi\in\Pi}\max\limits_{\xi\in\Xi}P(w_{i}\models\phi\ |\ \pi,\xi,w[0]=(q,s))

To calculate these probabilities, we use a value iteration method proposed in Section V, [14].

IV-B Nonviolating Sub-Graph Generation

We note that scLTL specifications may generate FSA states which are absorbing and non-accepting, i.e., it is impossible to satisfy the specification once one of these states is reached. Such states may also exist in PIMDP constructions even without appearing in the corresponding FSA. We define these states as those which have zero probability of satisfying the scLTL specification under any control policy and adversary:

Failure States={(q,s)Q×S|P^max((q,s)ϕ)=0}.\text{Failure States}=\{(q,s)\in Q\times S\ |\ \hat{P}_{\max}((q,s)\models\phi)=0\}.

We can then define a notion of specification nonviolation:

Definition 10 (Nonviolating PIMDP)

A PIMDP 𝒫\mathcal{P} is nonviolating with respect to a scLTL specification ϕ\phi if there exists no failure states in 𝒫\mathcal{P}.

Our algorithm for calculating a nonviolating PIMDP is as follows. We first initialize a set of failure states. Then, we loop through all non-failure states and prune actions which have nonzero upper-bound transition probability to failure states. We check if this pruning has left any states with no available actions, designating these also as failure states to prune. The process continues until no new failure states are found. Our nonviolating sub-graph is the set of all unpruned states with their remaining actions.

IV-C Candidate Cycle Selection

Now that we have a nonviolating sub-graph of our PIMDP, we want to select a path which we can take in order to sample the state-space indefinitely while maximizing the information gain of our Gaussian process. To do this, we first recall the concept of maximal end components [15]:

Definition 11 (End Component [15])

An end component of a finite PIMDP 𝒫\mathcal{P} is a pair (𝒯,Act)(\mathcal{T},Act) with 𝒯(Q×S)\mathcal{T}\subseteq(Q\times S) and Act:𝒯AAct:\mathcal{T}\rightarrow A such that

  • Act(q,s)A(q)\emptyset\neq Act(q,s)\subseteq A(q) for all states (q,s)𝒯(q,s)\in\mathcal{T},

  • (q,s)𝒯(q,s)\in\mathcal{T} and αAct(q,s)\alpha\in Act(q,s) implies {(q,s)𝒯|T^(q,α,q))>0,sδ(s,L(q))}𝒯\{(q^{\prime},s^{\prime})\in\mathcal{T}\ |\ \hat{T}(q,\alpha,q^{\prime}))>0,s^{\prime}\in\delta(s,L(q))\}\subseteq\mathcal{T},

  • The digraph G(𝒯,Act)G_{(\mathcal{T},Act)} induced by (𝒯,Act)(\mathcal{T},Act) is strongly connected.

Definition 12 (Maximal End Component (MEC) [15])

An end component (𝒯,Act)(\mathcal{T},Act) of a finite PIMDP 𝒫\mathcal{P} is maximal if there is no end component (𝒯,Act)(\mathcal{T}^{*},Act^{*}) such that (𝒯,Act)(𝒯,Act)(\mathcal{T},Act)\neq(\mathcal{T}^{*},Act^{*}) and 𝒯𝒯\mathcal{T}\subseteq\mathcal{T}^{*} and Act(q,s)Act(q,s)Act(q,s)\subseteq Act^{*}(q,s) for all (q,s)𝒯(q,s)\in\mathcal{T}.

PIMDP abstractions have the property that any infinite path will eventually stay in a single MEC. We propose the following heuristic in order to select a MEC to cycle within. First, we calculate Pˇmax\check{P}_{\max} from our initial state to each candidate MEC. We reject any MEC which we cannot reach with probability 1, or, in case no MECs can be reached with probability 1, we immediately select the MEC with the highest reachability probability. If multiple candidate MECs remain, we then calculate the Gaussian process covariance κ(cq,cq)\kappa(c_{q},c_{q^{*}}) between the centers of the IMDP states qq in each remaining candidate MEC and the accepting IMDP state qq^{*}. We sum the covariances for all states in each MEC and select the MEC with the highest total covariance score, which corresponds to maximum information gain [16], defined as reduction of GP uncertainty at the accepting state. We generate a control policy by selecting the actions at each state which give the maximum probability of reaching the MEC. Once in the MEC, we use a controller which cycles through the available actions.

By applying the algorithms detailed above to calculate a non-violating PIMDP and MEC, we generate a control policy which samples the state-space indefinitely without violating the specification, solving the second part of Problem 1.

IV-D Iterative Sampling Algorithm

We now detail our complete method to solve Problem 1. Given a scLTL specification ϕ\phi which we want to satisfy with probability PsatP_{\text{sat}}, we construct a PIMDP using a high-confidence IMDP abstraction of the system in Eq. (1) and an FSA which models ϕ\phi. Then, we calculate reachability probabilities under a minimizing adversary Pˇmax\check{P}_{\max} from the initial states in the PIMDP to the accepting states. If PˇmaxPsat\check{P}_{\max}\geq P_{\text{sat}}, then the control policy selects the actions which produce Pˇmax\check{P}_{\max} at each state and the problem is solved. Otherwise, we calculate a control policy to sample the state-space without violating the specification ϕ\phi using the methods in previous sections. We follow the calculated control policy for a predetermined number of steps and sample the unknown dynamics at each step. We batch update the GP with the data collected, reconstruct transition probability intervals for each state, and recalculate reachability probabilities Pˇmax\check{P}_{\max} for our initial states. If PˇmaxPsat\check{P}_{\max}\geq P_{\text{sat}}, a satisfying control policy is found; otherwise, we repeat the process above. Our iterative algorithm ends when PˇmaxPsat\check{P}_{\max}\geq P_{\text{sat}}; the GP approximation has low enough uncertainty to know that a successful control policy cannot be synthesized, i.e., when the reachability probability P^max\hat{P}_{\max} under a maximizing adversary is less than the desired PsatP_{\text{sat}}; or a maximum number of iterations has been reached.

V Case Study

Suppose we have a mobile robot in a 2D state-space with position xX:=[0,5]22x\in X:=[0,5]^{2}\subset\mathbb{R}^{2}. The state-space is partitioned into a set of 25 hyper-rectangular regions corresponding to IMDP states. The dynamics of the robot are

x[k+1]=x[k]+u[k]+g(x[k])+vx[k+1]=x[k]+u[k]+g(x[k])+v (21)

where g(x)g(x) models the unknown effect of the slope of the terrain. The control action uu is generated by the family of controllers in Section II where the set of available target regions are those left, right, above, or below each region.

Within the state-space, we have one goal region with the atomic proposition Goal and a set of hazard regions labeled with Haz. These yield the scLTL specification

ϕ1=¬Haz𝒰Goal.\phi_{1}=\lnot\texttt{Haz}\ \mathcal{U}\ \texttt{Goal}. (22)

An illustration of the state-space is shown in Figure 2. We choose a low-dimensional case study in order to illustrate our methodology. Future works will refine our algorithms on applications with higher-dimensional state-spaces.

Refer to caption
Figure 2: State-space of the case study. The initial region is labeled with ”Init”, the (green) target region is labeled with ”Goal”, and the (red) hazard regions are labeled with ”Haz”. States that eventually enter the safe cycle are blue, and the number in the region indicates the iteration of the algorithm at which the state enters the safe cycle. States which are not numbered do not enter the safe cycle. The yellow trace is an example of a sampling run.

The true g(x)g(x) is sampled from two randomly generated Gaussian processes (one for each dimension) with bounded support [0.4,0.4][-0.4,0.4] and squared exponential kernel κ\kappa,

κ(x,x)=σg2e(xx)22l2.\kappa(x,x^{\prime})=\sigma_{g}^{2}e^{-\frac{(x-x^{\prime})^{2}}{2l^{2}}}. (23)

We choose hyperparameters σg=0.45\sigma_{g}=0.45 and l=1.75l=1.75.

We estimate the unknown dynamics with two sparse Gaussian processes with the same kernel as the true dynamics. We sample the GPs at 100 points in each region to determine error bounds. We set the number of inducing points η=250\eta=250 and choose our high-confidence-bound parameter β=2\beta=2. Each iteration of the algorithm takes 250 steps, so the total number of data samples mm is the number of iterations times 250. Our stochastic noise ν\nu is independently drawn from two truncated Gaussian distributions, one for each dimension, and both with σν=0.1\sigma_{\nu}=0.1 and bounded support [0.2,0.2][-0.2,0.2].

We next apply the iterative algorithm described in Section IV-D, setting the desired probability of satisfying the specification to 1. Our algorithm successfully finds a satisfying feedback control strategy in an average of 15 iterations (calculated over 10 runs). The algorithm is implemented in Python on a 2.5 GHz Intel Core i9 machine with 16 GB of RAM and a Nvidia RTX 3060 GPU, and requires on average 1 minute 14 seconds to complete.

Figure 2 depicts the expansion of the safe cycle used to sample the state-space. Initially, only the left two columns of states are safe and reachable. As the algorithm progresses, more states and actions are added to the safe cycle, moving the system closer to the goal until the unknown dynamics can be estimated with enough certainty to achieve a probability of satisfying the specification of 1.

The left plot in Figure 3 depicts the total transition probability uncertainty for the system after each iteration

Tunc,total=qQαA(q)qQT^(q,α,q)Tˇ(q,α,q).T_{{\rm unc},{\rm total}}=\sum_{q\in Q}\sum_{\alpha\in A(q)}\sum_{q^{\prime}\in Q}\hat{T}(q,\alpha,q^{\prime})-\check{T}(q,\alpha,q^{\prime}). (24)

The right plot in Figure 3 shows the probability of satisfying the specification after each iteration.

VI Conclusion

In this work, we developed a method to safely learn unknown dynamics for a system motivated by the robotic motion-planning problem. Our approach uses an IMDP abstraction of the system and a finite state automaton of scLTL specifications. We designed an algorithm for finding nonviolating paths within a product IMDP construction which can be used to sample the state-space and construct a Gaussian process approximation of unknown dynamics. We then detailed an algorithm to iteratively sample the state-space to improve the probability of satisfying a desired specification and demonstrated its use with a case study of robot navigation. Our approach can be used with any system for which a high-confidence IMDP abstraction can be constructed as well as any objective which can be written as a scLTL specification. Future work will apply these methods to models of bipedal walking robots utilizing region-based motion planning [13].

Refer to caption
Figure 3: The left plot shows the total uncertainty in transition probability intervals after each iteration of the algorithm, and the right plot shows the probability of satisfying the specification after each iteration. Results are plotted over 10 runs of the algorithm. The uncertainty decreases as more data samples are collected, and likewise the probability of satisfaction increases once the safe cycle has expanded close enough to the goal.

References

  • [1] P. Tabuada, Verification and Control of Hybrid Systems: A Symbolic Approach, 1st ed.   Springer Publishing Company, Incorporated, 2009.
  • [2] C. Belta, B. Yordanov, and E. Göl, Formal Methods for Discrete-Time Dynamical Systems, ser. Studies in Systems, Decision and Control.   Springer International Publishing, 2017.
  • [3] C. Mark and S. Liu, “Stochastic MPC with distributionally robust chance constraints,” IFAC, vol. 53, no. 2, pp. 7136–7141, 2020.
  • [4] A. Lavaei, S. Soudjani, A. Abate, and M. Zamani, “Automated verification and synthesis of stochastic hybrid systems: A survey,” 2021.
  • [5] C. Baier, B. Haverkort, H. Hermanns, and J.-P. Katoen, “Model-checking algorithms for continuous-time Markov chains,” IEEE Transactions on Software Engineering, vol. 29, no. 6, pp. 524–541, Jun. 2003.
  • [6] M. Ahmadi, A. Israel, and U. Topcu, “Safety assessemt based on physically-viable data-driven models,” in 56th IEEE CDC, Dec. 2017, pp. 6409–6414.
  • [7] J. Jackson, L. Laurenti, E. Frew, and M. Lahijanian, “Strategy synthesis for partially-known switched stochastic systems,” in Proceedings of HSCC ’21, pp. 1–11.
  • [8] C. K. I. Williams and C. E. Rasmussen, “Gaussian processes for regression,” in Advances in neural information processing systems 8.   MIT press, 1996, pp. 514–520.
  • [9] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with Gaussian processes,” in 53rd IEEE CDC, Dec. 2014, pp. 1424–1431.
  • [10] F. Leibfried, V. Dutordoir, S. John, and N. Durrande, “A tutorial on sparse gaussian processes and variational inference,” 2021, arXiv: 2012.13962.
  • [11] A. A. Julius, A. Halasz, M. S. Sakar, H. Rubin, V. Kumar, and G. J. Pappas, “Stochastic modeling and control of biological systems: The lactose regulation system of Escherichia Coli,” IEEE Transactions on Automatic Control, vol. 53, no. Special Issue, pp. 51–65, 2008.
  • [12] E. Altman, T. Başar, and R. Srikant, “Congestion control as a stochastic control problem with action delays,” Automatica, vol. 35, no. 12, pp. 1937–1950, 1999.
  • [13] A. Shamsah, J. Warnke, Z. Gu, and Y. Zhao, “Integrated Task and Motion Planning for Safe Legged Navigation in Partially Observable Environments,” 2021, arXiv: 2110.12097.
  • [14] M. Lahijanian, S. B. Andersson, and C. Belta, “Formal Verification and Synthesis for Discrete-Time Stochastic Systems,” IEEE Transactions on Automatic Control, vol. 60, no. 8, pp. 2031–2045, Aug. 2015.
  • [15] C. Baier and J.-P. Katoen, Principles of Model Checking.   MIT Press, 2008.
  • [16] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-theoretic regret bounds for gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, p. 3250–3265, May 2012.

[Algorithm Psuedocodes] Algorithm 1 calculates a sub-graph of a PIMDP which is nonviolating with respect to a scLTL specification. It takes as input a PIMDP construction along with upper bounds P^max\hat{P}_{\rm max} on the maximum probability of satisfying the specification for each state. In lines 1–2, failure states which have P^max=0\hat{P}_{\rm max}=0 are identified. Next, in lines 4–10, non-failure states are looped through and any of their actions which have a nonzero upper bound probability of reaching the set of failure states are removed. In lines 12–17, states which have no actions remaining are added to the set of failure states and the algorithm returns to line 3. The loop from lines 3–18 repeats until no new failure states are identified. The algorithm returns the remaining non-failure states and actions as the nonviolating sub-graph.

Input: PIMDP 𝒫\mathcal{P}, P^max\hat{P}_{\max} for each state in 𝒫\mathcal{P}
Output: PIMDP 𝒫\mathcal{P}’ which is a nonviolating subset of 𝒫\mathcal{P}
1 Initialize R={(q,s)𝒫|P^max(q,sϕ)=0}R=\{(q,s)\in\mathcal{P}\ |\ \hat{P}_{\max}(q,s\models\phi)=0\};
2 Initialize U=RU=R;
3 while RR\neq\emptyset do
4      for (q,s)𝒫U(q,s)\in\mathcal{P}\setminus U do
5           for αA\alpha\in A do
6                if T^((q,s),α,U)0\hat{T}((q,s),\alpha,U)\neq 0 then
7                     Remove α\alpha from available actions at (q,s)(q,s) ;
8                    
9                     end if
10                    
11                     end for
12                    
13                     end for
14                    R=R=\emptyset ;
15                     for (q,s)𝒫U(q,s)\in\mathcal{P}\setminus U do
16                          if A((q,s))=A((q,s))=\emptyset then
17                               R=R(q,s)R=R\cup(q,s) ;
18                               U=U(q,s)U=U\cup(q,s) ;
19                              
20                               end if
21                              
22                               end for
23                              
24                               end while
return 𝒫=𝒫U\mathcal{P}^{\prime}=\mathcal{P}\setminus U
Algorithm 1 Nonviolating Sub-Graph Generation

Algorithm 2 calculates a maximum end component of a PIMDP to sample along with a corresponding control policy, maximizing the information gain with respect to learning the unknown dynamics. It takes as input a nonviolating sub-graph of a PIMDP. In line 1, the maximal end components of the sub-graph are identified. In lines 3–8, a lower bound Pˇmax\check{P}_{\rm max} on the maximum probability of reaching each MEC from the initial state of the original PIMDP is calculated. Those MECs which have Pˇmax=1\check{P}_{\max}=1 are added to the list of candidate MECs. In lines 9–11, if there are no candidate MECs found, then the algorithm selects the MEC with the highest Pˇmax\check{P}_{\rm max} as the MEC to cycle in. If there are candidate MECs, then in lines 12–17 each candidate MEC is assigned a score equal to the sum of the covariances between each state in the MEC and the accepting state qq^{*} of the PIMDP. The MEC with the maximum covariance score is selected as the MEC to cycle in. In line 18, a control policy for the selected MEC is calculated which selects the actions at each state outside the MEC which have maximum probability of reaching the MEC. For states within the MEC, the control policy cycles through the actions at each state which are available in the MEC. The algorithm returns the selected MEC along with its corresponding control policy.

Input: Nonviolating Sub-PIMDP 𝒫\mathcal{P}^{\prime}
Output: Selected MEC (𝒯,Act)(\mathcal{T}^{*},Act^{*}) and control policy π\pi^{*}
1 Initialize MM as the MECs of 𝒫\mathcal{P}^{\prime};
2 Initialize C=C=\emptyset as the set of candidate MECs;
3 for (𝒯,Act)M(\mathcal{T}^{\dagger},Act^{\dagger})\in M do
4      Calculate reachability probability 𝒫ˇmax\check{\mathcal{P}}_{\max} from initial state r0r_{0} of PIMDP 𝒫\mathcal{P}^{\prime} to 𝒯\mathcal{T}^{\dagger};
5      if Pˇmax=1\check{P}_{\max}=1 then
6           C=C(𝒯,Act)C=C\cup(\mathcal{T}^{\dagger},Act^{\dagger});
7          
8           end if
9          
10           end for
11          if C=C=\emptyset then
12                (𝒯,Act)=argmax(𝒯,Act)MPˇmax(r0𝒯)(\mathcal{T}^{*},Act^{*})=\underset{(\mathcal{T}^{\dagger},Act^{\dagger})\in M}{\mathrm{argmax}}\check{P}_{\max}(r_{0}\models\Diamond\mathcal{T}^{\dagger});
13               
14                end if
15               else
16                     for (𝒯,Act)C(\mathcal{T}^{\dagger},Act^{\dagger})\in C do
17                          HH = Sum κ(cq,cq)\kappa(c_{q},c_{q^{*}}) for all IMDP states q𝒯q\in\mathcal{T}^{\dagger} w.r.t. accepting state qq^{*};
18                         
19                          end for
20                         Find (𝒯,Act)C(\mathcal{T}^{*},Act^{*})\in C with maximum score HH;
21                         
22                          end if
23                         Control policy π\pi^{*} selects available actions in 𝒫\mathcal{P}^{\prime} with maximum probability of reaching 𝒯\mathcal{T}^{*} for states not in 𝒯\mathcal{T}^{*} and cycles through actions ActAct^{*} for all states in 𝒯\mathcal{T}^{*};
return Selected MEC (𝒯,Act)(\mathcal{T}^{*},Act^{*}), control policy π\pi^{*}
Algorithm 2 Nonviolating Cycle Selection

Algorithm 3 performs an iterative procedure to safely learn the unknown dynamics of the system (1) until a given scLTL specification can be satisfied with sufficient probability. It takes as input the system dynamics, a scLTL specification, and a desired probability of satisfaction PsatP_{\rm sat}. In line 1, an IMDP abstraction of the system and a FSA of the specification are constructed. Then, the IMDP and FSA are combined into a PIMDP construction. Finally, a GP estimation g^(x)\hat{g}(x) of the unknown dynamics is initialized with its hyperparameters. In lines 2–3, lower and upper bounds Pˇmax\check{P}_{\rm max} and P^max\hat{P}_{\rm max} are calculated for the initial state in the PIMDP. If the lower bound probability Pˇmax\check{P}_{\rm max} is less than the desired PsatP_{\rm sat}, the loop in lines 4–18 is entered. In lines 5–7, if the upper bound probability P^max\hat{P}_{\rm max} is less than PsatP_{\rm sat}, then the specification cannot be satisfied with sufficient probability regardless of how well the unknown dynamics are learned. Thus, the algorithm returns that no satisfying control policy exists. Otherwise, in lines 8–9, a nonviolating sub-graph of the PIMDP is calculated using Algorithm 1 and a MEC to cycle in along with its corresponding control policy is calculated from this sub-graph using Algorithm 2. In lines 10–13, this control policy is used to take a predefined number of steps to sample the unknown dynamics. In lines 14–15, the GP estimation g^(x)\hat{g}(x) is updated using these samples, and transition probability intervals are recalculated for each state in the PIMDP. In lines 16–17, Pˇmax\check{P}_{\rm max} and P^max\hat{P}_{\rm max} are recalculated for the initial state. If PˇmaxPsat\check{P}_{\rm max}\geq P_{\rm sat}, the loop terminates and the algorithm returns a control policy calculated in lines 19–21 which selects the actions at each state which have maximum probability of satisfying the specification. If Pˇmax<Psat\check{P}_{\rm max}<P_{\rm sat}, the loop repeats from line 4. If the maximum number of iterations of the loop is reached, the algorithm terminates without determining a satisfying control policy or the nonexistence thereof.

Input: System dynamics in (1), scLTL specification ϕ,Psat\phi,P_{\text{sat}}
Output: Satisfying control policy π\pi^{\dagger} or proof of nonexistence
1 Construct IMDP \mathcal{I} from System 1, FSA 𝒜\mathcal{A} from ϕ\phi, PIMDP 𝒫\mathcal{P} from \mathcal{I} and 𝒜\mathcal{A}, initial GP regression g^(x)\hat{g}(x);
2 Calculate Pˇmax((q0,δ(q0,s0))ϕ)\check{P}_{\max}((q_{0},\delta(q_{0},s_{0}))\models\phi) for initial state;
3 Calculate P^max((q0,δ(q0,s0))ϕ)\hat{P}_{\max}((q_{0},\delta(q_{0},s_{0}))\models\phi) for initial state;
4 while (Pˇmax<Psat\check{P}_{\rm max}<P_{\rm sat}) and (Count<< MaxIterations) do
5      if P^max<Psat\hat{P}_{\rm max}<P_{\rm sat} then
6           return No satisfying control policy exists;
7          
8           end if
9          Find nonviolating sub-PIMDP 𝒫\mathcal{P}^{\prime} using Algorithm 1;
10           Find MEC to cycle in with corresponding control policy π\pi^{*} using Algorithm 2;
11           for NumInnerIterations do
12                Take action π(q)\pi^{*}(q) at current state qq;
13                Sample y[k]=x[k+1]f(x[k])u[k]y[k]=x[k+1]-f(x[k])-u[k];
14               
15                end for
16               Construct GP g^(x)\hat{g}(x) using collected samples y[k]y[k];
17                Recalculate transition probability intervals for each state in 𝒫\mathcal{P};
18                Recalculate Pˇmax((q0,δ(q0,s0))ϕ)\check{P}_{\rm max}((q_{0},\delta(q_{0},s_{0}))\models\phi) for initial state;
19                Recalculate P^max((q0,δ(q0,s0))ϕ)\hat{P}_{\rm max}((q_{0},\delta(q_{0},s_{0}))\models\phi) for initial state;
20               
21                end while
22               if PˇmaxPsat\check{P}_{\max}\geq P_{\text{sat}} then
23                     return Control policy π=argmaxαA(q)Pˇmax((q,s)ϕ)(q,s)𝒫\pi^{\dagger}=\underset{\alpha\in A(q)}{\mathrm{argmax}}\ \check{P}_{\rm max}((q,s)\models\phi)\ \forall(q,s)\in\mathcal{P};
24                    
25                     end if
Algorithm 3 Iterative Synthesis Algorithm