Safe Learning for Uncertainty-Aware Planning via Interval MDP Abstraction

Jesse Jiang, Ye Zhao, and Samuel Coogan This work was supported in part by the National Science Foundation under grant #1924978.Jesse Jiang and Samuel Coogan are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: jjiang@gatech.edu, sam.coogan@gatech.edu). S. Coogan is also with the School of Civil and Environmental Engineering. Ye Zhao is with the School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: ye.zhao@me.gatech.edu).

Abstract

We study the problem of refining satisfiability bounds for partially-known stochastic systems against planning specifications defined using syntactically co-safe Linear Temporal Logic (scLTL). We propose an abstraction-based approach that iteratively generates high-confidence Interval Markov Decision Process (IMDP) abstractions of the system from high-confidence bounds on the unknown component of the dynamics obtained via Gaussian process regression. In particular, we develop a synthesis strategy to sample the unknown dynamics by finding paths which avoid specification-violating states using a product IMDP. We further provide a heuristic to choose among various candidate paths to maximize the information gain. Finally, we propose an iterative algorithm to synthesize a satisfying control policy for the product IMDP system. We demonstrate our work with a case study on mobile robot navigation.

Index Terms:

Automata, Hybrid systems, Markov processes, Gaussian process learning

I Introduction

Abstraction-based approaches for verification and synthesis of dynamical systems offer computationally tractable methods for accommodating complex specifications [1]. In particular, Interval Markov Decision Processes (IMDP) [2], which allow for an interval of transition probabilities, provide a rich abstraction model for stochastic systems. As compared to stochastic control[3], abstraction-based methods allow for more complex specifications to be considered and have been widely used for hybrid stochastic systems [4].

The transition probability intervals in IMDP abstractions have typically modeled the uncertainty which arises from abstracting the dynamics of continuous states in discrete regions [5]. However, partially-known stochastic systems, which show promise for modeling a wide range of real-world systems, add unknown dynamics which contribute further uncertainty. Previous works model this uncertainty by assuming that some prior data on the dynamics are available [6].

The paper [7] is the first to address the problem of modeling unknown dynamics in stochastic hybrid systems via the use of IMDP abstraction in combination with Gaussian process (GP) regression [8]. GP regression can approximate unknown functions with arbitrary accuracy and also provides bounds on the approximation uncertainty [9].

The main contribution of this work is to develop a method for sampling the unknown dynamics of a stochastic system online in order to reduce abstraction error and increase the probability of satisfying a syntactically co-safe linear temporal logic (scLTL) specification [2].

Our goal is to find a control policy which guarantees the satisfaction of a scLTL specification with sufficient probability. However, we assume a stochastic noise which creates unavoidable perturbation. The system also has unknown dynamics which we estimate with Gaussian processes. This creates an estimation error which increases uncertainty in state transitions and which we aim to reduce by sampling the unknown dynamics. Thus, this paper focuses on the problem of safe learning to allow online exploration rather than a static planning problem using previously collected data samples as in [7].

Our approach is as follows. First, we estimate the unknown dynamics of the system using Gaussian processes and construct a high-confidence IMDP abstraction. We then develop an algorithm for finding nonviolating cycles in a product IMDP of the system abstraction combined with a finite automaton of the scLTL specification which allow the dynamics of the system to be sampled without violating the specification. We develop a heuristic for evaluating candidate cycles in order to maximize the uncertainty reduction gained from sampling. Finally, we propose an iterative method to sample the state-space, thereby decreasing the uncertainty of a GP estimation of the unknown dynamics until a satisfying control policy for the system can be synthesized or a terminating condition such as a maximum number of iterations has been reached. We utilize sparse GPs [10] to improve computational efficiency. We demonstrate our method on a case study of robotic motion planning.

II Problem Setup

Consider a discrete-time, partially-known system

x[k+1]=f(x[k])+u[k]+g(x[k])+\nu[k]

(1)

where $x\in X\subseteq\mathbb{R}^{n}$ is the system state, $u\in\mathbb{R}^{n}$ is the control action, $f(x)$ is the known dynamics, $g(x)$ is the unknown dynamics to be learned via GP regression, $\nu$ is stochastic noise, and time is indexed with brackets. This system has applications in, e.g., biology [11], communications [12], and robotics [13].

Assumption 1

1) Each dimension $\nu_{i}[k],i=1,\ldots,n$ of $\nu$ , is an independent, zero mean random variable with stationary, symmetric, and unimodal distribution $\rho_{\nu_{i}}$ and is $\sigma_{\nu_{i}}$ -sub-Gaussian, i.e., the distribution tail decays at least as fast as a Gaussian random variable with variance $\sigma_{\nu_{i}}^{2}$ .

2) Given a data set $D=\{(z^{j},y^{j})\}_{j=1}^{m}$ where $y^{j}$ is an observation of $g(z^{j})$ perturbed by $\sigma_{\nu_{i}}$ -sub-Gaussian noise, it is possible to construct an estimate $\hat{g}^{D}(x)$ of $g$ and bound the estimation error between $g(x)$ and $\hat{g}^{D}(x)$ by some high-confidence bound $\gamma^{D}(x)$ . Thus,

g_{-}^{D}(x)=\hat{g}^{D}(x)-\gamma^{D}(x),\quad g_{+}^{D}(x)=\hat{g}^{D}(x)+\gamma^{D}(x)

(2)

are high-confidence bounds on $g$ , i.e., $g^{D}_{-}(x)\leq g(x)\leq g^{D}_{+}(x)$ with high confidence. For simplicity, we drop the superscript $D$ when the dataset is clear.

Assumption 2

The state-space $X$ is bounded and is partitioned into hyper-rectangular regions $\{X_{q}\}_{q\in Q}$ defined as

X_{q}=\{x\mid\ a_{q}\leq x\leq b_{q}\}\subset X,

(3)

where the inequality is taken elementwise for lower and upper bounds $a_{q},b_{q}\in\mathbb{R}^{n}$ and $Q$ is a finite index set of the regions. Each region has a center $c_{q}={(a_{q}+b_{q})}/{2}$ . Additionally, the system possesses a labeling function $L$ which maps hyper-rectangular regions to observations $O$ .

Define feedback controllers $K_{{q}}(\cdot\ ;\hat{g}):X\xrightarrow{}X$ as

K_{q}(x;\hat{g})=c_{q}-f(x)-\hat{g}(x).

(4)

The choice $u[k]=K_{{q^{\prime}}}(x[k];\hat{g})$ thus produces a control action which compensates for the known and estimated dynamics to reach the center of region $X_{q^{\prime}}$ , although the actual state update will be perturbed as shown in Figure 1.

Our ultimate goal is to apply a sequence of feedback controllers so that the resulting sequence of observations satisfies a control objective specified as a syntactically co-safe LTL (scLTL) formula over the observations $O$ .

Definition 1 (Syntactically co-safe LTL [2, Def. 2.3])

A syntactically co-safe linear temporal logic (scLTL) formula $\phi$ over a set of observations $O$ is recursively defined as

\phi=\top\ |\ o\ |\ \lnot{o}\ |\ \phi_{1}\land\phi_{2}\ |\ \phi_{1}\lor\phi_{2}\ |\ \bigcirc\phi\ |\ \phi_{1}\mathcal{U}\phi_{2}\ |\ \Diamond\phi

where $o\in O$ is an observation and $\phi$ , $\phi_{1}$ , and $\phi_{2}$ are scLTL formulas. We define the next operator $\bigcirc$ as meaning that $\phi$ will be satisfied in the next state transition, the until operator $\mathcal{U}$ as meaning that the system satisfies $\phi_{1}$ until it satisfies $\phi_{2}$ , and the eventually operator $\Diamond$ as $\top\mathcal{U}\phi$ .

ScLTL formulas are characterized by the property that they are satisfied in finite time. It is well-known that scLTL satisfaction can be checked using a finite state automaton:

Definition 2 (Finite State Automaton [2, Def. 2.4])

A finite state automaton (FSA) is a tuple $\mathcal{A}=(S,s_{0},O,\delta,F)$ , where

•

$S$ is a finite set of states,
•

$s_{0}\in S$ is the initial state,
•

$O$ is the input alphabet, which corresponds to observations from the scLTL specification,
•

$\delta:S\times O\xrightarrow{}S$ is a transition function, and
•

$F\subseteq S$ is the set of accepting (final) states.

A sequence of inputs (a word) from $O$ is said to be accepted by an FSA if it ends in an accepting state. A scLTL formula can always be translated into a FSA that accepts all and only those words satisfying the formula. We use scLTL specifications in this paper because they are well-suited to robotic motion planning tasks which are satisfied in finite time. Additionally, the simpler structure of an FSA as opposed to the Büchi and Rabin automata of general LTL enables the methods we propose below.

Figure 1: Feedback controller and calculation of transition probabilities. The controller targets the center of state

X_{2}

. The uncertainty in

\hat{g}(x)

creates a nondeterministic region of transition (brown rectangle). The maximum probability of transitioning to state

X_{3}

is found by centering stochastic noise at the point

x_{\max}

closest to state

X_{3}

(green point) and calculating the probability that the noise reaches state

X_{3}

. The minimum probability of transitioning to state

X_{3}

under this controller is given likewise by centering stochastic noise at the point

x_{\min}

furthest from

X_{3}

(red point).

Definition 3 (Interval Markov Decision Process)

An Interval Markov Decision Process (IMDP) is a tuple $\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L)$ where:

•

$Q$ is a finite set of states,
•

$A$ is a finite set of actions,
•

$\check{T},\hat{T}:Q\times A\times Q^{\prime}\xrightarrow{}[0,1]$ are lower and upper bounds, respectively, on the transition probability from state $q\in Q$ to state $q^{\prime}\in Q$ under action $\alpha\in A$ ,
•

$Q_{0}\subseteq Q$ is a set of initial states,
•

$O$ is a finite set of atomic propositions or observations,
•

$L:Q\xrightarrow{}O$ is a labeling function.

The set of actions $A$ corresponds to the set of all valid feedback controllers for the system. We do not assume that all actions are available at each state. Therefore, each state has a subset $A(q)\subseteq A$ of available actions.

Definition 4 (High-Confidence IMDP Abstraction)

Consider stochastic system (1), partitions (3), and the family of feedback controllers (4) where $\hat{g}(x)$ is an estimate of $g(x)$ . Further, suppose that $g_{-}(x)$ and $g_{+}(x)$ are high-confidence bounds on $\hat{g}(x)$ which satisfy (2). Then, an IMDP $\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L)$ is a high-confidence IMDP abstraction of (1), if:

•

The set of states $Q$ for the abstraction is the index set of partitions, i.e. partition $X_{q}$ is abstracted as state $q$ , and the set of observations $O$ and labeling function $L$ for the abstraction are the same as for the system,
•

For all $q\in Q$ , the set of actions $A(q)$ is the set of one-step reachable regions at $q$ under its feedback controllers,
•

For all $q\in Q$ and all $\alpha_{q^{*}}\in A(q)$ :

		$\displaystyle\check{T}(q,\alpha_{q^{*}},q^{\prime})\leq$		(5)
		$\displaystyle\min_{x\in X_{q}}\min_{g_{-}(x)\leq w\leq g_{+}(x)}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}}),$
		$\displaystyle\hat{T}(q,\alpha_{q^{*}},q^{\prime})\geq$		(6)
		$\displaystyle\max_{x\in X_{q}}\max_{g_{-}(x)\leq w\leq g_{+}(x)}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}})$

where $\mathbb{P}_{\nu}$ denotes probability with respect to $\nu$ .

Verification and synthesis problems for IMDP systems evaluated against scLTL specifications are often solved using graph theoretic methods on a product IMDP:

Definition 5 (PIMDP)

Let $\mathcal{I}=(Q,A,\check{T},\hat{T},Q_{0},O,L)$ be an IMDP and $\mathcal{A}=(S,s_{0},O,\delta,F)$ be an FSA. The product IMDP (PIMDP) is defined as a tuple $\mathcal{P}=\mathcal{I}\otimes\mathcal{A}=$
$(Q\times S,A,\check{T}^{\prime},\hat{T}^{\prime},Q\times s_{0},F^{\prime})$ , where

•

$\check{T}^{\prime}:(q,s)\times A\times(q^{\prime},s^{\prime}):=\check{T}(q,\alpha,q^{\prime})$ if $s^{\prime}\in\delta(s,L(q))$ and $0$ otherwise
•

$\hat{T}^{\prime}:(q,s)\times A\times(q^{\prime},s^{\prime}):=\hat{T}(q,\alpha,q^{\prime})$ if $s^{\prime}\in\delta(s,L(q))$ and $0$ otherwise
•

$(q_{0},\delta(s_{0},L(q_{0})))\in(Q\times S)$ is a set of initial states of $\mathcal{I}\otimes\mathcal{A}$ , and
•

$F^{\prime}=Q\times F$ is the set of accepting (final) states.

We can now formulate our proposed problem:

Problem 1

Design an iterative algorithm to sample and learn the unknown dynamics of system (1) without violating the scLTL specification $\phi$ and synthesize a control policy which satisfies $\phi$ with some desired threshold probability or prove that no such control policy exists.

To solve this problem, we construct a high-confidence IMDP abstraction of the system (1) using a GP estimation of the unknown dynamics. We then formulate a method to sample the state-space without violating the specification, updating the GP estimation until a satisfying control policy can be synthesized.

III Abstraction of System as IMDP

In this section, we detail our approach to abstracting a system of the form (1) into a high-confidence IMDP.

We first need to determine an approximation of $g(x)$ , the unknown dynamics. At each time step of system (1), we know $x[k+1]$ , $f(x[k])$ , and $u[k]$ . Therefore, we can define

y[k]=x[k+1]-f(x[k])-u[k]=g(x[k])+\nu[k].

Then, we construct a Gaussian process estimation $\hat{g}(x)$ for $g(x)$ by considering a dataset of samples $(x[k],y[k])$ .

Definition 6 (Gaussian Process Regression)

Gaussian Process (GP) regression models a function $g_{i}:\mathbb{R}^{n}\to\mathbb{R}$ as a distribution with covariance $\kappa:\mathbb{R}^{n}\times\mathbb{R}^{n}\xrightarrow{}\mathbb{R}_{>0}$ . Assume a dataset of $m$ samples $D=\{(z^{j},y_{i}^{j})\}_{j\in\{1,...,m\}}$ , where $z^{j}\in\mathbb{R}^{n}$ is the input and $y^{j}_{i}$ is an observation of $g_{i}(z^{j})$ under Gaussian noise with variance $\sigma_{\nu_{i}}^{2}$ . Let $K\in\mathbb{R}^{m\times m}$ be a matrix defined elementwise by $K_{j\ell}=\kappa(z^{j},z^{\ell})$ and for $z\in\mathbb{R}^{n}$ , let $k(z)=[\kappa(z,z^{1})\;\kappa(z,z^{2})\ldots$ $\kappa(z,z^{m})]^{T}\in\mathbb{R}^{m}$ . Then, the predictive distribution of $g_{i}$ at a test point $z$ is the conditional distribution of $g_{i}$ given $D$ , which is Gaussian with mean $\mu_{g_{i},D}$ and variance $\sigma_{g_{i},D}^{2}$ given by

	$\displaystyle\mu_{g_{i},D}(z)$	$\displaystyle=k(z)^{T}(K+\sigma_{\nu_{i}}^{2}I_{m})^{-1}Y$		(7)
	$\displaystyle\sigma_{g_{i},D}^{2}(z)$	$\displaystyle=\kappa(z,z)-k(z)^{T}(K+\sigma_{\nu_{i}}^{2}I_{m})^{-1}k(z),$		(8)

where $I_{m}$ is the identity and $Y=\begin{bmatrix}y^{1}_{i}&y^{2}_{i}&\ldots&y^{m}_{i}\end{bmatrix}^{T}$ .

In practice, GP regression has a complexity of $O(m^{3})$ . To mitigate this, we use sparse Gaussian process regression [10]:

Definition 7 (Sparse Gaussian Process Regression)

A sparse Gaussian process uses a set $D_{\eta}=\{(z^{j},y_{i}^{j})\}_{j\in\{1,...,\eta\}}$ to approximate a GP of a larger dataset $D$ . Given inducing points $\{z_{j}\}_{j\in\{1,...,\eta\}}$ with $Y_{\eta}=\begin{bmatrix}y^{1}_{i}&y^{2}_{i}&\ldots&y^{\eta}_{i}\end{bmatrix}^{T}$ and covariance matrix $A_{\eta}$ , the predictive distribution of the unknown function $g_{i}$ has mean $\mu_{g_{i},D_{\eta}}$ and variance $\sigma_{g_{i},D_{\eta}}^{2}$

	$\displaystyle\mu_{g_{i},D_{\eta}}(z)$	$\displaystyle=k_{\eta}(z)^{T}(K_{\eta}+\sigma_{\nu_{i}}^{2}I_{\eta})^{-1}Y_{\eta}$
	$\displaystyle\sigma_{g_{i},D_{\eta}}^{2}(z)$	$\displaystyle=\kappa(z,z)-k_{\eta}(z)^{T}K_{\eta}^{-1}(K_{\eta}-A_{\eta})K_{\eta}^{-1}k_{\eta}(z)$

where $K_{\eta}\in\mathbb{R}^{\eta\times\eta}$ is a matrix defined elementwise by $K_{\eta,j\ell}=\kappa(z^{j},z^{\ell})$ for all $z\in D_{\eta}$ . For $z\in\mathbb{R}^{n}$ , let $k_{\eta}(z)=[\kappa(z,z^{1})\;\kappa(z,z^{2})\ldots$ $\kappa(z,z^{\eta})]^{T}\in\mathbb{R}^{\eta}$ . The parameters $\{z^{j}\}_{j\in\{1,...,\eta\}}$ , $\{y_{i}^{j}\}_{j\in\{1,...,\eta\}}$ , and $A_{\eta}$ are optimized to minimize the Kullback-Leibler divergence (evaluated at the inducing points) between $\mathcal{N}(\mu_{g_{i},D_{\eta}},\sigma_{g_{i},D_{\eta}}^{2})$ , the posterior of $g_{i}$ under the sparse GP; and $p(g_{i}|Y)$ , the posterior of $g_{i}$ under a GP with the full dataset $D$ . We refer the reader to [10] for a detailed treatment of sparse Gaussian process theory. The computational complexity of sparse GP regression is $O(m\eta^{2})$ , so by fixing $\eta$ the algorithm is linear in $m$ . We note that sparse GP regression introduces error into the estimation; however, in practice this error does not affect the validity of our methods.

Given some dataset $D$ , we construct an estimation of the unknown dynamics independently in each coordinate and determine high-confidence bounds on the estimation error

	$\displaystyle\hat{g}_{i}^{D}(x)$	$\displaystyle:=\mu_{g_{i},D}(x),$
	$\displaystyle\gamma_{i}(x)$	$\displaystyle:=\beta\sigma_{g_{i},D}(x)\geq\|g_{i}(x)-\hat{g}_{i}^{D}(x)\|$

for each $i=1,\ldots,n$ . We also determine high-confidence lower and upper bounds on $g(x)$ as

g_{-}(x)=\hat{g}^{D}(x)-\beta\sigma_{g,D}(x),\quad g_{+}(x)=\hat{g}^{D}(x)+\beta\sigma_{g,D}(x)

The parameter $\beta$ is calculated as

\beta=\bigg{(}\frac{\sigma_{\nu}}{\sqrt{1+(2/m)}}(B_{i}+\sigma_{\nu}\sqrt{2(\gamma_{k}^{m}+1+\log\frac{1}{\delta})})\bigg{)}

(9)

for noise $\sigma_{\nu}$ -sub-Gaussian, $m$ the number of GP samples, high-confidence parameter $\delta$ , information gain constant $\gamma_{k}^{m}$ , and RKHS constant $B_{i}$ as detailed in Lemma 1, [7]. Note that the same parameter $\beta\sigma_{g,D}$ is used to determine high-confidence bounds on both the estimation error and $g(x)$ itself.
For each region $q$ in the state-space, we select a high-confidence error bound for the unknown dynamics as

\gamma_{i}(q)=\max_{x\in X_{q}}\gamma_{i}(x)

In practice, we compute this bound by sampling $\gamma_{i}(x)$ throughout the state-space, introducing a trade-off between approximation error and computation complexity. We now construct transition probability intervals assuming that the high-confidence bounds on unknown dynamics always hold:

Theorem 1 (Construction of Transition Probabilities)

Consider $q,q^{\prime}\in Q$ and action $\alpha_{q^{*}}\in A(q)$ . Lower bound $\check{T}$ and upper bound $\hat{T}$ transition probabilities from $q$ to $q^{\prime}$ under $\alpha_{q^{*}}$ are given by

\check{T}(q,\alpha_{q^{*}},q^{\prime})=\prod_{i=1}^{n}\int_{a^{\prime}_{i}}^{b^{\prime}_{i}}\rho_{\nu_{i}}(z-x_{\min,i}(q,\alpha_{q^{*}},q^{\prime}))dz,

(10)

\hat{T}(q,\alpha_{q^{*}},q^{\prime})=\prod_{i=1}^{n}\int_{a^{\prime}_{i}}^{b^{\prime}_{i}}\rho_{\nu_{i}}(z-x_{\max,i}(q,\alpha_{q^{*}},q^{\prime}))dz,

(11)

where $x_{\min,i}$ is the $i$ -th coordinate of $x_{\min}$ and similarly for $x_{\max,i}$ , we recall $\rho_{\nu_{i}}$ is the probability density function of the stochastic noise $\nu_{i}$ , and $a^{\prime}$ and $b^{\prime}$ are the lower and upper boundary points for region $q^{\prime}$ . We define $x_{\min}$ and $x_{\max}$ as

	$\displaystyle x_{\min}(q,\alpha_{q^{*}},q^{\prime})=\$	$\displaystyle\underset{x\in X}{\mathrm{argmax}}\ \|\|x-c_{q^{\prime}}\|\|_{1}$		(12)
		$\displaystyle\text{s.t.}\ c_{q^{}}-\gamma(q)\leq x\leq c_{q^{}}+\gamma(q),$

	$\displaystyle x_{\max}(q,\alpha_{q^{*}},q^{\prime})=\$	$\displaystyle\underset{x\in X}{\mathrm{argmin}}\ \|\|x-c_{q^{\prime}}\|\|_{1}$		(13)
		$\displaystyle\text{s.t.}\ c_{q^{}}-\gamma(q)\leq x\leq c_{q^{}}+\gamma(q),$

where $||\cdot||_{1}$ is the 1-norm and $\gamma(q)$ is a high-confidence error bound on the unknown dynamics satisfying Assumption 1.
Then, the transition probability bounds (10)–(11) satisfy the constraints for high-confidence IMDP abstractions in (5)–(6).
Proof: The righthand side of the bound in equation (5) can be rewritten as

		$\displaystyle\min_{x\in X_{q}}\min_{\begin{subarray}{c}g_{-}(x)\leq w\\ \leq g_{+}(x)\end{subarray}}\mathbb{P}_{\nu}(f(x)+w+K_{{q^{*}}}(x;\hat{g})+\nu\in X_{q^{\prime}})$		(14)
		$\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}g_{-}(x)\leq w\leq g_{+}(x)\end{subarray}}\mathbb{P}_{\nu}(c_{{q^{*}}}+w-\hat{g}(x)+\nu\in q^{\prime})$		(15)
		$\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}-\gamma(x)\leq\omega\leq\gamma(x)\end{subarray}}\mathbb{P}_{\nu}(c_{{q^{*}}}+\omega+\nu\in X_{q^{\prime}})$		(16)
		$\displaystyle=\min_{x\in X_{q}}\min_{\begin{subarray}{c}-\gamma(x)\leq\omega\\ \leq\gamma(x)\end{subarray}}\prod_{i=1}^{n}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}])$		(17)
		$\displaystyle=\min_{x\in X_{q}}\prod_{i=1}^{n}\min_{\begin{subarray}{c}-\gamma_{i}(x)\leq\omega_{i}\\ \leq\gamma_{i}(x)\end{subarray}}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}])$		(18)
		$\displaystyle\geq\prod_{i=1}^{n}\min_{\begin{subarray}{c}-\gamma_{i}(q)\leq\omega_{i}\\ \leq\gamma_{i}(q)\end{subarray}}\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}])$		(19)

where (14) is the righthand side of (5); (15) follows after expanding the feedback controller expression $K_{{q^{*}}}(x;\hat{g})$ using (4) and simplifying; (16) follows by assumption of high-confidence error bound $\gamma(x)$ and the definition of $g_{-}(x)$ and $g_{+}(x)$ from Assumption 1 and taking $\omega=w-\hat{g}(x)$ ; (17) follows by assumption that each $\nu_{i}$ is independent and $\mathbb{P}_{\nu_{i}}$ denotes probability with respect to $\nu_{i}$ , where we recall that $a_{q^{\prime}}$ and $b_{q^{\prime}}$ are the lower and upper corners of the region $X_{q^{\prime}}$ , and $a_{q^{\prime},i}$ is the $i$ -th coordinate of $a_{q^{\prime}}$ and similarly for $c_{q*,i}$ and $b_{q^{\prime},i}$ ; (18) follows from the fact that the hyper-rectangular constraint $-\gamma(x)\leq\omega\leq\gamma(x)$ is equivalent to independent constraint $-\gamma_{i}(x)\leq\omega_{i}\leq\gamma_{i}(x)$ along each coordinate; and (19) follows from the definition $\gamma_{i}(q)=\max_{x\in X_{q}}\gamma_{i}(x)$ .

Now, because the probability distribution for each random variable $\nu_{i}$ is assumed unimodal and symmetric, $\mathbb{P}_{\nu_{i}}(c_{{q^{*}},i}+\omega_{i}+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}])$ is minimized when the distance between $(c_{{q^{*}},i}+\omega_{i})$ and the midpoint of $[a_{q^{\prime},i},b_{q^{\prime},i}]$ is maximized, i.e., when $|c_{{q^{*}},i}+\omega_{i}-c_{q^{\prime},i}|$ is maximized, subject to the constraint $-\gamma_{i}(q)\leq\omega_{i}\leq\gamma_{i}(q)$ . Substituting $x=c_{q^{*}}+\omega$ , and observing that $\|x-c_{q^{\prime}}\|_{1}=\sum_{i=1}^{n}|x_{i}-c_{q^{\prime},i}|$ , this is exactly the maximizing point specified by $x_{\min}(q,\alpha_{q^{*}},q^{\prime})$ in (12). Thus, the expression in (19) is equivalent to

\displaystyle\prod_{i=1}^{n}\mathbb{P}_{\nu_{i}}(x_{\min,i}(q,\alpha_{q^{*}},q^{\prime})+\nu_{i}\in[a_{q^{\prime},i},b_{q^{\prime},i}]),

(20)

which in turn is equivalent to the righthand side of (10), establishing the bound in (5). An analogous argument establishes that (11) satisfies (6). ∎

We construct a high-confidence IMDP abstraction of the system using the hyper-rectangular partition regions as states, high-confidence bounds on the unknown dynamics obtained via GP regression, and transition probability intervals calculated using Theorem 1, solving the first part of Problem 1.

IV Safe Sampling of PIMDP

IV-A Probability of Satisfaction Calculation

Given a high-confidence IMDP abstraction of the system and a FSA of a desired scLTL specification, we construct a PIMDP using Definition 5. We first introduce the concept of control policies and adversaries:

Definition 8 (Control Policy)

A control policy $\pi\in\Pi$ of a PIMDP is a mapping $(Q\times S)^{+}\xrightarrow{}A$ , where $(Q\times S)^{+}$ is the set of finite sequences of states of the PIMDP.

Definition 9 (PIMDP Adversary)

Given a PIMDP state $(q,s)$ and action $\alpha$ , an adversary $\xi\in\Xi$ is an assignment of transition probabilities $T_{\xi}^{\prime}$ to all states $(q^{\prime},s^{\prime})$ such that

	$\displaystyle\check{T}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime}))$	$\displaystyle\leq T_{\xi}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime}))$
		$\displaystyle\leq\hat{T}^{\prime}((q,s),\alpha,(q^{\prime},s^{\prime})).$

In particular, we use a minimizing adversary, which realizes transition probabilities such that the probability of satisfying the specification is minimal, and a maximizing adversary, which maximizes the probability of satisfaction.

To find safe sampling cycles in the PIMDP, we calculate

\check{P}_{\max}((q,s)\models\phi)=\max\limits_{\pi\in\Pi}\min\limits_{\xi\in\Xi}P(w\models\phi\ |\ \pi,\xi,w[0]=(q,s)),

which is the probability that a random path $w$ starting at PIMDP state $(q,s)$ satisfies the scLTL specification $\phi$ under a maximizing control policy $\pi$ and minimizing adversary $\xi$ .

Additionally, we will also use the best case probability of satisfaction under a maximizing control policy and adversary:

\hat{P}_{\max}((q,s)\models\phi)=\max\limits_{\pi\in\Pi}\max\limits_{\xi\in\Xi}P(w_{i}\models\phi\ |\ \pi,\xi,w[0]=(q,s))

To calculate these probabilities, we use a value iteration method proposed in Section V, [14].

IV-B Nonviolating Sub-Graph Generation

We note that scLTL specifications may generate FSA states which are absorbing and non-accepting, i.e., it is impossible to satisfy the specification once one of these states is reached. Such states may also exist in PIMDP constructions even without appearing in the corresponding FSA. We define these states as those which have zero probability of satisfying the scLTL specification under any control policy and adversary:

\text{Failure States}=\{(q,s)\in Q\times S\ |\ \hat{P}_{\max}((q,s)\models\phi)=0\}.

We can then define a notion of specification nonviolation:

Definition 10 (Nonviolating PIMDP)

A PIMDP $\mathcal{P}$ is nonviolating with respect to a scLTL specification $\phi$ if there exists no failure states in $\mathcal{P}$ .

Our algorithm for calculating a nonviolating PIMDP is as follows. We first initialize a set of failure states. Then, we loop through all non-failure states and prune actions which have nonzero upper-bound transition probability to failure states. We check if this pruning has left any states with no available actions, designating these also as failure states to prune. The process continues until no new failure states are found. Our nonviolating sub-graph is the set of all unpruned states with their remaining actions.

IV-C Candidate Cycle Selection

Now that we have a nonviolating sub-graph of our PIMDP, we want to select a path which we can take in order to sample the state-space indefinitely while maximizing the information gain of our Gaussian process. To do this, we first recall the concept of maximal end components [15]:

Definition 11 (End Component [15])

An end component of a finite PIMDP $\mathcal{P}$ is a pair $(\mathcal{T},Act)$ with $\mathcal{T}\subseteq(Q\times S)$ and $Act:\mathcal{T}\rightarrow A$ such that

•

$\emptyset\neq Act(q,s)\subseteq A(q)$ for all states $(q,s)\in\mathcal{T}$ ,
•

$(q,s)\in\mathcal{T}$ and $\alpha\in Act(q,s)$ implies $\{(q^{\prime},s^{\prime})\in\mathcal{T}\ |\ \hat{T}(q,\alpha,q^{\prime}))>0,s^{\prime}\in\delta(s,L(q))\}\subseteq\mathcal{T}$ ,
•

The digraph $G_{(\mathcal{T},Act)}$ induced by $(\mathcal{T},Act)$ is strongly connected.

Definition 12 (Maximal End Component (MEC) [15])

An end component $(\mathcal{T},Act)$ of a finite PIMDP $\mathcal{P}$ is maximal if there is no end component $(\mathcal{T}^{*},Act^{*})$ such that $(\mathcal{T},Act)\neq(\mathcal{T}^{*},Act^{*})$ and $\mathcal{T}\subseteq\mathcal{T}^{*}$ and $Act(q,s)\subseteq Act^{*}(q,s)$ for all $(q,s)\in\mathcal{T}$ .

PIMDP abstractions have the property that any infinite path will eventually stay in a single MEC. We propose the following heuristic in order to select a MEC to cycle within. First, we calculate $\check{P}_{\max}$ from our initial state to each candidate MEC. We reject any MEC which we cannot reach with probability 1, or, in case no MECs can be reached with probability 1, we immediately select the MEC with the highest reachability probability. If multiple candidate MECs remain, we then calculate the Gaussian process covariance $\kappa(c_{q},c_{q^{*}})$ between the centers of the IMDP states $q$ in each remaining candidate MEC and the accepting IMDP state $q^{*}$ . We sum the covariances for all states in each MEC and select the MEC with the highest total covariance score, which corresponds to maximum information gain [16], defined as reduction of GP uncertainty at the accepting state. We generate a control policy by selecting the actions at each state which give the maximum probability of reaching the MEC. Once in the MEC, we use a controller which cycles through the available actions.

By applying the algorithms detailed above to calculate a non-violating PIMDP and MEC, we generate a control policy which samples the state-space indefinitely without violating the specification, solving the second part of Problem 1.

IV-D Iterative Sampling Algorithm

We now detail our complete method to solve Problem 1. Given a scLTL specification $\phi$ which we want to satisfy with probability $P_{\text{sat}}$ , we construct a PIMDP using a high-confidence IMDP abstraction of the system in Eq. (1) and an FSA which models $\phi$ . Then, we calculate reachability probabilities under a minimizing adversary $\check{P}_{\max}$ from the initial states in the PIMDP to the accepting states. If $\check{P}_{\max}\geq P_{\text{sat}}$ , then the control policy selects the actions which produce $\check{P}_{\max}$ at each state and the problem is solved. Otherwise, we calculate a control policy to sample the state-space without violating the specification $\phi$ using the methods in previous sections. We follow the calculated control policy for a predetermined number of steps and sample the unknown dynamics at each step. We batch update the GP with the data collected, reconstruct transition probability intervals for each state, and recalculate reachability probabilities $\check{P}_{\max}$ for our initial states. If $\check{P}_{\max}\geq P_{\text{sat}}$ , a satisfying control policy is found; otherwise, we repeat the process above. Our iterative algorithm ends when $\check{P}_{\max}\geq P_{\text{sat}}$ ; the GP approximation has low enough uncertainty to know that a successful control policy cannot be synthesized, i.e., when the reachability probability $\hat{P}_{\max}$ under a maximizing adversary is less than the desired $P_{\text{sat}}$ ; or a maximum number of iterations has been reached.

V Case Study

Suppose we have a mobile robot in a 2D state-space with position $x\in X:=[0,5]^{2}\subset\mathbb{R}^{2}$ . The state-space is partitioned into a set of 25 hyper-rectangular regions corresponding to IMDP states. The dynamics of the robot are

x[k+1]=x[k]+u[k]+g(x[k])+v

(21)

where $g(x)$ models the unknown effect of the slope of the terrain. The control action $u$ is generated by the family of controllers in Section II where the set of available target regions are those left, right, above, or below each region.

Within the state-space, we have one goal region with the atomic proposition Goal and a set of hazard regions labeled with Haz. These yield the scLTL specification

\phi_{1}=\lnot\texttt{Haz}\ \mathcal{U}\ \texttt{Goal}.

(22)

An illustration of the state-space is shown in Figure 2. We choose a low-dimensional case study in order to illustrate our methodology. Future works will refine our algorithms on applications with higher-dimensional state-spaces.

Refer to caption — Figure 2: State-space of the case study. The initial region is labeled with ”Init”, the (green) target region is labeled with ”Goal”, and the (red) hazard regions are labeled with ”Haz”. States that eventually enter the safe cycle are blue, and the number in the region indicates the iteration of the algorithm at which the state enters the safe cycle. States which are not numbered do not enter the safe cycle. The yellow trace is an example of a sampling run.

The true $g(x)$ is sampled from two randomly generated Gaussian processes (one for each dimension) with bounded support $[-0.4,0.4]$ and squared exponential kernel $\kappa$ ,

\kappa(x,x^{\prime})=\sigma_{g}^{2}e^{-\frac{(x-x^{\prime})^{2}}{2l^{2}}}.

(23)

We choose hyperparameters $\sigma_{g}=0.45$ and $l=1.75$ .

We estimate the unknown dynamics with two sparse Gaussian processes with the same kernel as the true dynamics. We sample the GPs at 100 points in each region to determine error bounds. We set the number of inducing points $\eta=250$ and choose our high-confidence-bound parameter $\beta=2$ . Each iteration of the algorithm takes 250 steps, so the total number of data samples $m$ is the number of iterations times 250. Our stochastic noise $\nu$ is independently drawn from two truncated Gaussian distributions, one for each dimension, and both with $\sigma_{\nu}=0.1$ and bounded support $[-0.2,0.2]$ .

We next apply the iterative algorithm described in Section IV-D, setting the desired probability of satisfying the specification to 1. Our algorithm successfully finds a satisfying feedback control strategy in an average of 15 iterations (calculated over 10 runs). The algorithm is implemented in Python on a 2.5 GHz Intel Core i9 machine with 16 GB of RAM and a Nvidia RTX 3060 GPU, and requires on average 1 minute 14 seconds to complete.

Figure 2 depicts the expansion of the safe cycle used to sample the state-space. Initially, only the left two columns of states are safe and reachable. As the algorithm progresses, more states and actions are added to the safe cycle, moving the system closer to the goal until the unknown dynamics can be estimated with enough certainty to achieve a probability of satisfying the specification of 1.

The left plot in Figure 3 depicts the total transition probability uncertainty for the system after each iteration

T_{{\rm unc},{\rm total}}=\sum_{q\in Q}\sum_{\alpha\in A(q)}\sum_{q^{\prime}\in Q}\hat{T}(q,\alpha,q^{\prime})-\check{T}(q,\alpha,q^{\prime}).

(24)

The right plot in Figure 3 shows the probability of satisfying the specification after each iteration.

VI Conclusion

In this work, we developed a method to safely learn unknown dynamics for a system motivated by the robotic motion-planning problem. Our approach uses an IMDP abstraction of the system and a finite state automaton of scLTL specifications. We designed an algorithm for finding nonviolating paths within a product IMDP construction which can be used to sample the state-space and construct a Gaussian process approximation of unknown dynamics. We then detailed an algorithm to iteratively sample the state-space to improve the probability of satisfying a desired specification and demonstrated its use with a case study of robot navigation. Our approach can be used with any system for which a high-confidence IMDP abstraction can be constructed as well as any objective which can be written as a scLTL specification. Future work will apply these methods to models of bipedal walking robots utilizing region-based motion planning [13].

References

[1] P. Tabuada, Verification and Control of Hybrid Systems: A Symbolic Approach, 1st ed. Springer Publishing Company, Incorporated, 2009.
[2] C. Belta, B. Yordanov, and E. Göl, Formal Methods for Discrete-Time Dynamical Systems, ser. Studies in Systems, Decision and Control. Springer International Publishing, 2017.
[3] C. Mark and S. Liu, “Stochastic MPC with distributionally robust chance constraints,” IFAC, vol. 53, no. 2, pp. 7136–7141, 2020.
[4] A. Lavaei, S. Soudjani, A. Abate, and M. Zamani, “Automated verification and synthesis of stochastic hybrid systems: A survey,” 2021.
[5] C. Baier, B. Haverkort, H. Hermanns, and J.-P. Katoen, “Model-checking algorithms for continuous-time Markov chains,” IEEE Transactions on Software Engineering, vol. 29, no. 6, pp. 524–541, Jun. 2003.
[6] M. Ahmadi, A. Israel, and U. Topcu, “Safety assessemt based on physically-viable data-driven models,” in 56th IEEE CDC, Dec. 2017, pp. 6409–6414.
[7] J. Jackson, L. Laurenti, E. Frew, and M. Lahijanian, “Strategy synthesis for partially-known switched stochastic systems,” in Proceedings of HSCC ’21, pp. 1–11.
[8] C. K. I. Williams and C. E. Rasmussen, “Gaussian processes for regression,” in Advances in neural information processing systems 8. MIT press, 1996, pp. 514–520.
[9] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with Gaussian processes,” in 53rd IEEE CDC, Dec. 2014, pp. 1424–1431.
[10] F. Leibfried, V. Dutordoir, S. John, and N. Durrande, “A tutorial on sparse gaussian processes and variational inference,” 2021, arXiv: 2012.13962.
[11] A. A. Julius, A. Halasz, M. S. Sakar, H. Rubin, V. Kumar, and G. J. Pappas, “Stochastic modeling and control of biological systems: The lactose regulation system of Escherichia Coli,” IEEE Transactions on Automatic Control, vol. 53, no. Special Issue, pp. 51–65, 2008.
[12] E. Altman, T. Başar, and R. Srikant, “Congestion control as a stochastic control problem with action delays,” Automatica, vol. 35, no. 12, pp. 1937–1950, 1999.
[13] A. Shamsah, J. Warnke, Z. Gu, and Y. Zhao, “Integrated Task and Motion Planning for Safe Legged Navigation in Partially Observable Environments,” 2021, arXiv: 2110.12097.
[14] M. Lahijanian, S. B. Andersson, and C. Belta, “Formal Verification and Synthesis for Discrete-Time Stochastic Systems,” IEEE Transactions on Automatic Control, vol. 60, no. 8, pp. 2031–2045, Aug. 2015.
[15] C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press, 2008.
[16] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-theoretic regret bounds for gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, p. 3250–3265, May 2012.

[Algorithm Psuedocodes] Algorithm 1 calculates a sub-graph of a PIMDP which is nonviolating with respect to a scLTL specification. It takes as input a PIMDP construction along with upper bounds $\hat{P}_{\rm max}$ on the maximum probability of satisfying the specification for each state. In lines 1–2, failure states which have $\hat{P}_{\rm max}=0$ are identified. Next, in lines 4–10, non-failure states are looped through and any of their actions which have a nonzero upper bound probability of reaching the set of failure states are removed. In lines 12–17, states which have no actions remaining are added to the set of failure states and the algorithm returns to line 3. The loop from lines 3–18 repeats until no new failure states are identified. The algorithm returns the remaining non-failure states and actions as the nonviolating sub-graph.

Input: PIMDP

\mathcal{P}

\hat{P}_{\max}

for each state in

\mathcal{P}

Output: PIMDP

\mathcal{P}

’ which is a nonviolating subset of

\mathcal{P}

1 Initialize $R=\{(q,s)\in\mathcal{P}\ |\ \hat{P}_{\max}(q,s\models\phi)=0\}$ ;

2 Initialize $U=R$ ;

3 while $R\neq\emptyset$ do

4 for $(q,s)\in\mathcal{P}\setminus U$ do

5 for $\alpha\in A$ do

6 if $\hat{T}((q,s),\alpha,U)\neq 0$ then

7 Remove

\alpha

from available actions at

(q,s)

;

9 end if

11 end for

13 end for

R=\emptyset

;

15 for $(q,s)\in\mathcal{P}\setminus U$ do

16 if $A((q,s))=\emptyset$ then

R=R\cup(q,s)

;

U=U\cup(q,s)

;

20 end if

22 end for

24 end while

return $\mathcal{P}^{\prime}=\mathcal{P}\setminus U$

Algorithm 1 Nonviolating Sub-Graph Generation

Algorithm 2 calculates a maximum end component of a PIMDP to sample along with a corresponding control policy, maximizing the information gain with respect to learning the unknown dynamics. It takes as input a nonviolating sub-graph of a PIMDP. In line 1, the maximal end components of the sub-graph are identified. In lines 3–8, a lower bound $\check{P}_{\rm max}$ on the maximum probability of reaching each MEC from the initial state of the original PIMDP is calculated. Those MECs which have $\check{P}_{\max}=1$ are added to the list of candidate MECs. In lines 9–11, if there are no candidate MECs found, then the algorithm selects the MEC with the highest $\check{P}_{\rm max}$ as the MEC to cycle in. If there are candidate MECs, then in lines 12–17 each candidate MEC is assigned a score equal to the sum of the covariances between each state in the MEC and the accepting state $q^{*}$ of the PIMDP. The MEC with the maximum covariance score is selected as the MEC to cycle in. In line 18, a control policy for the selected MEC is calculated which selects the actions at each state outside the MEC which have maximum probability of reaching the MEC. For states within the MEC, the control policy cycles through the actions at each state which are available in the MEC. The algorithm returns the selected MEC along with its corresponding control policy.

Input: Nonviolating Sub-PIMDP

\mathcal{P}^{\prime}

Output: Selected MEC

(\mathcal{T}^{*},Act^{*})

and control policy

\pi^{*}

1 Initialize $M$ as the MECs of $\mathcal{P}^{\prime}$ ;

2 Initialize $C=\emptyset$ as the set of candidate MECs;

3 for $(\mathcal{T}^{\dagger},Act^{\dagger})\in M$ do

4 Calculate reachability probability

\check{\mathcal{P}}_{\max}

from initial state

r_{0}

of PIMDP

\mathcal{P}^{\prime}

\mathcal{T}^{\dagger}

;

5 if $\check{P}_{\max}=1$ then

C=C\cup(\mathcal{T}^{\dagger},Act^{\dagger})

;

8 end if

10 end for

11 if $C=\emptyset$ then

(\mathcal{T}^{*},Act^{*})=\underset{(\mathcal{T}^{\dagger},Act^{\dagger})\in M}{\mathrm{argmax}}\check{P}_{\max}(r_{0}\models\Diamond\mathcal{T}^{\dagger})

;

14 end if

15 else

16 for $(\mathcal{T}^{\dagger},Act^{\dagger})\in C$ do

H

= Sum

\kappa(c_{q},c_{q^{*}})

for all IMDP states

q\in\mathcal{T}^{\dagger}

w.r.t. accepting state

q^{*}

;

19 end for

20 Find

(\mathcal{T}^{*},Act^{*})\in C

with maximum score

H

;

22 end if

23 Control policy

\pi^{*}

selects available actions in

\mathcal{P}^{\prime}

with maximum probability of reaching

\mathcal{T}^{*}

for states not in

\mathcal{T}^{*}

and cycles through actions

Act^{*}

for all states in

\mathcal{T}^{*}

;

return Selected MEC $(\mathcal{T}^{*},Act^{*})$ , control policy $\pi^{*}$

Algorithm 2 Nonviolating Cycle Selection

Algorithm 3 performs an iterative procedure to safely learn the unknown dynamics of the system (1) until a given scLTL specification can be satisfied with sufficient probability. It takes as input the system dynamics, a scLTL specification, and a desired probability of satisfaction $P_{\rm sat}$ . In line 1, an IMDP abstraction of the system and a FSA of the specification are constructed. Then, the IMDP and FSA are combined into a PIMDP construction. Finally, a GP estimation $\hat{g}(x)$ of the unknown dynamics is initialized with its hyperparameters. In lines 2–3, lower and upper bounds $\check{P}_{\rm max}$ and $\hat{P}_{\rm max}$ are calculated for the initial state in the PIMDP. If the lower bound probability $\check{P}_{\rm max}$ is less than the desired $P_{\rm sat}$ , the loop in lines 4–18 is entered. In lines 5–7, if the upper bound probability $\hat{P}_{\rm max}$ is less than $P_{\rm sat}$ , then the specification cannot be satisfied with sufficient probability regardless of how well the unknown dynamics are learned. Thus, the algorithm returns that no satisfying control policy exists. Otherwise, in lines 8–9, a nonviolating sub-graph of the PIMDP is calculated using Algorithm 1 and a MEC to cycle in along with its corresponding control policy is calculated from this sub-graph using Algorithm 2. In lines 10–13, this control policy is used to take a predefined number of steps to sample the unknown dynamics. In lines 14–15, the GP estimation $\hat{g}(x)$ is updated using these samples, and transition probability intervals are recalculated for each state in the PIMDP. In lines 16–17, $\check{P}_{\rm max}$ and $\hat{P}_{\rm max}$ are recalculated for the initial state. If $\check{P}_{\rm max}\geq P_{\rm sat}$ , the loop terminates and the algorithm returns a control policy calculated in lines 19–21 which selects the actions at each state which have maximum probability of satisfying the specification. If $\check{P}_{\rm max}<P_{\rm sat}$ , the loop repeats from line 4. If the maximum number of iterations of the loop is reached, the algorithm terminates without determining a satisfying control policy or the nonexistence thereof.

Input: System dynamics in (1), scLTL specification

\phi,P_{\text{sat}}

Output: Satisfying control policy

\pi^{\dagger}

or proof of nonexistence

1 Construct IMDP

\mathcal{I}

from System 1, FSA

\mathcal{A}

from

\phi

, PIMDP

\mathcal{P}

from

\mathcal{I}

and

\mathcal{A}

, initial GP regression

\hat{g}(x)

;

2 Calculate

\check{P}_{\max}((q_{0},\delta(q_{0},s_{0}))\models\phi)

for initial state;

3 Calculate

\hat{P}_{\max}((q_{0},\delta(q_{0},s_{0}))\models\phi)

for initial state;

4 while ( $\check{P}_{\rm max}<P_{\rm sat}$ ) and (Count $<$ MaxIterations) do

5 if $\hat{P}_{\rm max}<P_{\rm sat}$ then

6 return No satisfying control policy exists;

8 end if

9 Find nonviolating sub-PIMDP

\mathcal{P}^{\prime}

using Algorithm 1;

10 Find MEC to cycle in with corresponding control policy

\pi^{*}

using Algorithm 2;

11 for NumInnerIterations do

12 Take action

\pi^{*}(q)

at current state

q

;

13 Sample

y[k]=x[k+1]-f(x[k])-u[k]

;

15 end for

16 Construct GP

\hat{g}(x)

using collected samples

y[k]

;

17 Recalculate transition probability intervals for each state in

\mathcal{P}

;

18 Recalculate

\check{P}_{\rm max}((q_{0},\delta(q_{0},s_{0}))\models\phi)

for initial state;

19 Recalculate

\hat{P}_{\rm max}((q_{0},\delta(q_{0},s_{0}))\models\phi)

for initial state;

21 end while

22 if $\check{P}_{\max}\geq P_{\text{sat}}$ then

23 return Control policy $\pi^{\dagger}=\underset{\alpha\in A(q)}{\mathrm{argmax}}\ \check{P}_{\rm max}((q,s)\models\phi)\ \forall(q,s)\in\mathcal{P}$ ;

25 end if

Algorithm 3 Iterative Synthesis Algorithm