This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Collaborative rover-copter path planning and exploration with temporal logic specifications based on Bayesian update under uncertain environments

Kazumune Hashimoto Graduate School of Engineering, Osaka UniversitySuitaJapan Natsuko Tsumagari Graduate School of Engineering and Science, Osaka UniversityToyonakaJapan  and  Toshimitsu Ushio Graduate School of Engineering and Science, Osaka UniversityToyonakaJapan
(2021)
Abstract.

This paper investigates a collaborative rover-copter path planning and exploration with temporal logic specifications under uncertain environments. The objective of the rover is to complete a mission expressed by a syntactically co-safe linear temporal logic (scLTL) formula, while the objective of the copter is to actively explore the environment and reduce its uncertainties, aiming at assisting the rover and enhancing the efficiency of the mission completion. To formalize our approach, we first capture the environmental uncertainties by environmental beliefs of the atomic propositions, under an assumption that it is unknown which properties (or, atomic propositions) are satisfied in each area of the environment. The environmental beliefs of the atomic propositions are updated according to the Bayes rule based on the Bernoulli-type sensor measurements provided by both the rover and the copter. Then, the optimal policy for the rover is synthesized by maximizing a belief of the satisfaction of the scLTL formula through an implementation of an automata-based model checking. An exploration policy for the copter is then synthesized by employing the notion of an entropy that is evaluated based on the environmental beliefs of the atomic propositions, and a path that the rover intends to follow according to the optimal policy. As such, the copter can actively explore regions whose uncertainties are high and that are relevant to the mission completion. Finally, some numerical examples illustrate the effectiveness of the proposed approach.

Collaborative motion planning, Temporal logics, Bayesian-based decision making, Uncertain environments
copyright: acmcopyrightjournalyear: 2021doi: 10.1145/3470453ccs: Computer systems organization Roboticsccs: Theory of computation Modal and temporal logics

1. Introduction

Autonomous systems play an important role to accomplish complex, high level scientific missions autonomously under uncertain environments. To increase the efficiency of completing the mission, integrating a collaboration of multiple, heterogeneous robots has attracted much attention in recent years, see, e.g., (lewis2014, ). In this paper, we are particularly interested in the situation, where completing the mission will be achieved by the collaboration of an unmanned ground vehicle (UGV), which is called a rover, and an unmanned aerial vehicle (UAV), which is called a (heli)copter. The utilization of the rover-copter collaboration is motivated by the fact that the rover has the role to complete the mission (e.g., search for a target object, etc.), while the copter has the role to assist the rover so as to enhance the efficiency of completing the mission. Specifically, the copter aims at actively exploring the environment and reducing its uncertainties by revealing which properties (obstacles, free space, etc.) are satisfied in each area of the environment. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the environment. By doing so, the rover will be able to complete the mission while guaranteeing safety. Employing the rover-copter collaboration is also motivated by the fact that the National Aeronautics and Space Administration (NASA) is launching Mars 2020 mission (balaram2018, ). In particular, to investigate Martian geology and habitability, NASA has decided to send copters to Mars in order to help the rover discover target samples in an efficient way (landau2015, ). Motivated by this fact, several motion planning techniques employing the rover-copter collaboration for the Mars exploration have been investigated in recent years, see, e.g., (nilsson2018, ; bharadwaj2018, ; sasaki2020, ).

As briefly mentioned above, planning under environmental uncertainties has two distinct major problems; the first one is how to synthesize a control policy such that a complex, high level mission specification can be satisfied in an automatic way, and the second one is how to explore the environment so as to effectively reduce its uncertainties. In this paper, we propose a novel algorithm to solve these two problems by making use of the rover-copter collaboration. First, we tackle the former problem by employing temporal logic synthesis techniques (temporalreview1, ; temporalreview2, ; belta2017formal, ). More specifically, we express a mission specification by a syntactically co-safe linear temporal logic (scLTL). In contrast to the simple reach-avoid task, the scLTL formula has the ability to describe various complex specifications that involve logic and temporal constraints (belta2017formal, ). Moreover, the optimal policy that fulfills the scLTL specification can be synthesized using a value iteration algorithm, which is in general more computationally efficient than synthesizing controllers with the LTL (requires to solve the Rabin game) or the STL (requires to solve the (mixed) integer programming). Additionally, the utilization of the scLTL is sometimes natural in practice, since the path planning problem often deals with the mission that terminates in a finite time rather than the infinite time like LTL. To formalize our approach, we first capture the uncertain environment by assuming that, it is unknown which properties or atomic propositions are satisfied in each area of the environment. Specifically, we define environmental beliefs of the atomic propositions, which are described by posterior probabilities, that evaluate their uncertainties based on the sensor measurements in each area of the environment. As will be detailed later, these beliefs are updated according to the Bayes rule based on sensor measurements provided by both the copter and the rover. The optimal policy for the rover is then synthesized by maximizing a belief of the satisfaction of the scLTL formula for the controlled trajectory of the rover. In particular, based on an automata-based model checking (see, e.g., (baier, )), we combine a motion model of the rover described by a Markov decision process (MDP) and a finite state automaton (FSA) that accepts all good prefix satisfying the scLTL formula. This combined model, which is called a product belief MDP, has a transition function induced by the current environmental beliefs of the atomic propositions. The problem for finding the optimal policy is then reduced to a finite-time reachability problem in the product belief MDP, which can be solved via a value iteration algorithm.

The latter problem (i.e., how to explore the environment so as to effectively reduce its uncertainties) will be mainly solved by the copter, since it is able to move more quickly and freely than the rover and is thus suited for the exploration. Roughly speaking, the objective of the copter is to actively explore the environment and reduce its uncertainties by updating the environmental beliefs of the atomic propositions. We first describe the observations by employing a Bernoulli-type sensor model, see, e.g., (bertuccelli2005, ; wang2009, ; hussein2007, ; imai2013, ). The Bernoulli sensor abstracts the complexity of image processing into binary observations. Despite the simplicity, the Bernoulli sensor model is commonly used in the UAV community, due to the following reasons; (i) it is able to capture the erroneous observations; (ii) it is able to capture the limited sensor range; (iii) in contrast to the other sophisticated sensor models such as those that involve the probability density functions, the Bayesian update can be simply computed without any integrals or approximations. In particular, the third feature is well-suited for the copter’s exploration, as the computational power of the CPU and the battery capacity are often limited, and it is desirable to make the belief updates as “computationally light” as possible. The exploration algorithm is given by employing the notion of entropy that is evaluated based on the environmental beliefs of the atomic propositions, and a path that the rover intends to follow according to the current optimal policy. As such, the copter can put an emphasis on actively exploring regions whose uncertainties are high and that are relevant to the mission completion.

Related works and contributions of this paper

Based on the above, the approach presented in this paper is related to the previous works of literature in terms of the following aspects:

  1. (1)

    Motion planning/exploration employing the rover-copter collaboration;

  2. (2)

    Temporal logic planning under environmental uncertainties;

In what follows, we discuss how our approach differs from the previous works and highlight our main contributions.

Several motion planning/exploration techniques employing the rover-copter collaboration have been provided, see, e.g., (nilsson2018, ; bharadwaj2018, ; sasaki2020, ). In particular, our approach is closely related to (nilsson2018, ), in the sense that the overall synthesis problem is decomposed into two sub-problems, i.e., the problem of synthesizing a copter’s exploration policy in the uncertain environment, and the problem of synthesizing the optimal policy for the rover so as to satisfy the scLTL formula. Our approach builds upon this previous work in terms of both synthesis for the rover’s optimal policy and the copter’s exploration policy in the following ways. Rather than capturing the environmental uncertainties by the belief MDPs (see Section II.A in (nilsson2018, )), in which the environmental belief states are given in a discrete space, this paper captures the environmental uncertainties by assigning beliefs in a continuous space (i.e., the beliefs can take continuous values in the interval [0,1][0,1]). This allows us to apply the Bayes rule to update the beliefs according to the sensor measurements provided by both the rover and the copter. Moreover, while the previous work solves a value iteration over the state-space in a product MDP that involves the set of environmental belief states, our approach defines a product MDP that does not involve such states, i.e., the set of states of product MDP combines only the set of states of the MDP motion model of the rover and the set of states of the FSA that accepts all trajectories satisfying the scLTL formula (not including the environmental belief states). Hence, we can alleviate the time complexity of synthesizing the optimal policies for the rover in comparison with the previous work. In addition, we provide a theoretical, convergence analysis of the proposed algorithm, in which the environmental beliefs of the atomic propositions are shown to converge to the appropriate values.

Besides the above, many temporal logic planning schemes under environmental uncertainties have been proposed. Most of the previous works assume that the environment has unknown properties (atomic propositions) (ayala2013, ; fu2016, ; maly2013, ; meng2013a, ; meng2015a, ; meng2018, ; livingston2012, ; wongpiromsarn2012, ), or that the motion model of the robot includes uncertainties (lahijanian2010, ; yoo2016, ; sadigh16, ; vasile2016, ; leahy2019, ; wolff2012, ; ulusoy2014, ; wongpiromsarn2012, ; kazumune2020, ), or that the motion model of the robot is completely unknown (sadigh2014, ; jing2015, ; li2018, ; hasanbeig2019, ). Since this paper assumes that it is unknown which atomic propositions are satisfied in the environment, our approach is particularly related to the first category, i.e., (ayala2013, ; fu2016, ; maly2013, ; meng2013a, ; meng2015a, ; meng2018, ; livingston2012, ; wongpiromsarn2012, ). For example, (ayala2013, ) proposed to combine an automata-based model checking and run-time verification for synthesizing a temporal logic motion planning under an incomplete knowledge about the workspace. (fu2016, ) proposed a temporal logic synthesis under probabilistic semantic maps obtained by simultaneous localization and mapping (SLAM). Moreover, (meng2013a, ) proposed a planning revision scheme under incomplete knowledge about the workspace. Our approach is essentially different from the above previous works, in the sense that we incorporate a sensor failure about observations on the atomic propositions. In particular, as previously mentioned, we employ a Bernoulli-type sensor model to describe erroneous observations, and update the beliefs based on the Bayes rule. Besides, the synthesis approach (e.g., construction of the product MDP) is also different from the above previous works, since we make use of the beliefs to synthesize control policies, see Section 4.3. Moreover, our approach is different from the above previous works, in the sense that we incorporate an explicit algorithm for exploration, so as to reduce environmental uncertainties. Other than the above previous works, a few approaches that take into account sensor failures/noise have been provided, see, e.g., (johnson2013, ; johnson2015, ; nuzzo, ; TIGER2020325, ). For example, in (johnson2015, ), the authors proposed a probabilistic model checking for a reactive synthesis under sensor failures and actuator failures. Moreover, (nuzzo, ) introduced the concept of stochastic signal temporal logic (StSTL), and provided both verification and synthesis techniques using assume/guarantee contracts. In contrast to the above previous works, we here propose a Bayesian approach, in which the beliefs that are assigned in the environment are introduced, and these are updated based on observations provided by the Bernoulli sensors.

In summary, the main novelties of this paper with respect to the related works are as follows: using the copter-rover collaboration, we develop a new approach to synthesizing an optimal policy for the rover so as to satisfy an scLTL formula, and an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions and reduce the environmental uncertainties. In particular:

  1. (1)

    Using the Bernoulli sensor model and the Bayesian update, we propose a novel exploration algorithm for the copter so as to update the environmental beliefs and effectively reduce the environmental uncertainties (for details, see Section 4.2);

  2. (2)

    We propose a novel framework to synthesize the optimal policy for the rover. In particular, we solve a value iteration over a product MDP, whose state-space does not involve the set of the states of the environmental beliefs. This leads to the reduction of time complexity of the value iteration in comparisons with the previous work (for details, see Section 4.3);

  3. (3)

    We provide a theoretical, convergence analysis of the proposed algorithm, where it is shown that the environmental beliefs of the atomic propositions converge to the appropriate values (for details, see Section 5).

The remainder of this paper is organized as follows. In Section 2, we provide some preliminaries of the Markov decision process and syntacticallly co-safe LTL formula. In Section 3, we formulate a problem that we seek to solve in this paper. In Section 4, we describe the main algorithm that aims to synthesize both an exploration policy to update the environmental beliefs of the atomic propositions and the optimal policy to satisfy an scLTL formula. In Section 5, we analyze the convergence property of the proposed algorithm. In Section 6, we illustrate the effectiveness of the proposed approach through a simulation example. We finally conclude in Section 7.

Notation. Let \mathbb{N}, 0\mathbb{N}_{\geq 0}, >0\mathbb{N}_{>0}, a:b\mathbb{N}_{a:b} be the set of integers, non-negative integers, positive integers, and the set of integers in the interval [a,b][a,b], respectively. Let \mathbb{R}, 0\mathbb{R}_{\geq 0}, >0\mathbb{R}_{>0}, a:b\mathbb{R}_{a:b} be the set of reals, non-negative reals, positive reals, and the set of reals in the interval [a,b][a,b], respectively. For a given vector xnx\in\mathbb{R}^{n}, denote by x(i)x^{(i)} the ii-th element of xx. Given a finite set XX, let 𝒟(X)\mathcal{D}(X) denote the set of all probability distributions on XX, i.e., the set of all functions p:X[0,1]p:X\rightarrow[0,1] such that xXp(x)=1\sum_{x\in X}p(x)=1.

2. Preliminaries

2.1. Markov Decision Process

A Markov Decision Process (MDP) is defined as a tuple =(X,x0,U,p)\mathcal{M}=(X,x_{0},U,p), where XX is the finite set of states, x0Xx_{0}\in X is the initial state, UU is the finite set of control inputs, and p:X×U𝒟(X)p:X\times U\rightarrow\mathcal{D}(X) is the transition probability function that associates, for each state xXx\in X and input uUu\in U, the corresponding probability distribution over XX. For simplicity of presentation, we abbreviate p(x,u)(x)p(x,u)(x^{\prime}) as p(x|x,u)p(x^{\prime}|x,u). Given =(X,x0,U,p)\mathcal{M}=(X,x_{0},U,p), a policy sequence μseq=μ1μ2{\mu}_{seq}=\mu_{1}\mu_{2}\ldots is defined as an infinite sequence of the mappings μk:XU\mu_{k}:X\rightarrow U, kk\in\mathbb{N}. Namely, each μk\mu_{k}, k0k\in\mathbb{N}_{\geq 0} represents a policy as a mapping from each state in XX onto the corresponding control input in UU. The policy sequence μseq{\mu}_{seq} is called stationary if the policy is the invariant for all times, i.e., μk=μk+1\mu_{k}=\mu_{k+1}, k0\forall k\in\mathbb{N}_{\geq 0}. Given a policy sequence μseq=μ1μ2{\mu}_{seq}=\mu_{1}\mu_{2}\ldots, a trajectory induced by μseq{\mu}_{seq} is denoted by 𝐱μseq=x(0)x(1)Xω{\bf x}_{{\mu}_{seq}}=x(0)x(1)\ldots\in X^{\omega}, where x(0)=x0x(0)=x_{0} and x(k+1)p(|x(k),μk(x(k)))x(k+1)\sim p(\cdot|x(k),\mu_{k}(x(k))), k0\forall k\in\mathbb{N}_{\geq 0}.

2.2. Syntactically co-safe LTL

Syntactically co-safe LTL (scLTL for short) is defined by using the set of atomic propositions APAP, Boolean operators, and some temporal operators. Atomic propositions are the Boolean variables taking either true or false. Specifically, the syntax of the scLTL formulas are constructed according to the following grammar:

(1) ϕ::=true|ap|¬ap|ϕ1ϕ2|ϕ1ϕ2|ϕ|ϕ1Uϕ2,\phi::={\rm true}\ |\ ap\ |\neg ap\ |\ \phi_{1}\wedge\phi_{2}\ |\ \phi_{1}\vee\phi_{2}\ |\ \bigcirc\phi\ |\ \phi_{1}\mathit{U}\phi_{2},

where apAPap\in AP is the atomic proposition, ¬\neg (negation), \wedge (conjunction), \vee (disjunction) are the Boolean connectives, and \bigcirc (next), U\mathit{U} (until) are the temporal operators. The semantics of LTL formula is inductively defined over an infinite sequence of sets of atomic propositions 𝐰=w0w1(2AP)ω{\bf w}=w_{0}w_{1}\cdots\in(2^{AP})^{\omega}. Intuitively, an atomic proposition apAPap\in AP is satisfied iff apap is true at w0w_{0} (i.e., apw0ap\in w_{0}). Moreover, ¬ap\neg ap is satisfied iff apap is not true at w0w_{0} (i.e., apw0ap\notin w_{0}). ϕ1ϕ2\phi_{1}\wedge\phi_{2} is satisfied iff both ϕ1\phi_{1} and ϕ2\phi_{2} are satisfied. ϕ1ϕ2\phi_{1}\vee\phi_{2} is satisfied iff ϕ1\phi_{1} or ϕ2\phi_{2} are satisfied. ϕ\bigcirc\phi is satisfied iff ϕ\phi is satisfied for the suffix of 𝐰{\bf w} that begins from the next position, i.e., i.e., w1w2w_{1}w_{2}\cdots. Finally, ϕ1Uϕ2\phi_{1}\mathit{U}\phi_{2} is satisfied iff ϕ1\phi_{1} is satisfied until ϕ2\phi_{2} is satisfied. Given 𝐰=w0w1(2AP)ω{\bf w}=w_{0}w_{1}\cdots\in(2^{AP})^{\omega} and an scLTL formula ϕ\phi, we denote by 𝐰ϕ{\bf w}\models\phi iff 𝐰{\bf w} satisfies ϕ\phi. It is known that every 𝐰=w0w1{\bf w}=w_{0}w_{1}\cdots that satisfies the scLTL formula ϕ\phi contains a finite good prefix w0w1wnw_{0}w_{1}\cdots w_{n} for some n0n\in\mathbb{N}_{\geq 0}, such that w0w1wn𝐰(2AP)ωw_{0}w_{1}\cdots w_{n}{\bf w}^{\prime}\in(2^{AP})^{\omega} also satisfies ϕ\phi for all 𝐰(2AP)ω{\bf w}^{\prime}\in(2^{AP})^{\omega}.

A finite state automaton (FSA) is defined as a tuple 𝒜=(Q,Σ,δ,q0,Qf)\mathcal{A}=(Q,\Sigma,\delta,q_{0},Q_{f}), where QQ is a set of states, Σ\Sigma is the input alphabet, δ:Q×ΣQ\delta:Q\times\Sigma\rightarrow Q is the transition function, q0Qq_{0}\in Q is the initial state, and QfQQ_{f}\subseteq Q is the set of accepting states. Moreover, denote by Post:Q2QPost:Q\rightarrow 2^{Q} the successors for each state in QQ, i.e., Post(q)={qQ|σΣ,qδ(q,σ)}Post(q)=\{q^{\prime}\in Q\ |\ \exists\sigma\in\Sigma,q^{\prime}\in\delta(q,\sigma)\}. It is known that any scLTL formula ϕ\phi can be translated into the FSA with Σ=2AP\Sigma=2^{AP}, in the sense that all good prefix for ϕ\phi can be accepted by 𝒜ϕ\mathcal{A}_{\phi}. We denote by 𝒜ϕ\mathcal{A}_{\phi} the FSA corresponding to the scLTL formula ϕ\phi. The translation from scLTL formulas to the FSA can be automatically done using several off-the-shelf tools, such as SCHECK2 (latvala2003, ).

3. Problem formulation

In this section we describe an uncertain environment, motion and sensor models of the rover and the copter, and the main problem that we seek to solve in this paper.

3.1. Uncertain environment

We capture an environment as a two-dimensional map consisting of nn cells. For example, this map is obtained by discretizing a given bounded search area into uniform grids with nn cells. Let xi2x_{i}\in\mathbb{R}^{2}, i{1,,n}i\in\{1,\ldots,n\} be the position or the centroid of the cell ii, and let X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}. Moreover, we denote by APAP the set of atomic propositions, which represents a set of labels or properties that can be satisfied in the states. In addition, we denote by L:X2AP{L}:X\rightarrow 2^{AP} the labeling function, which represents a mapping from each state xXx\in X onto the corresponding set of atomic propositions that are satisfied in xx. For example, if L(x)={𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒}L(x)=\{\mathit{obstacle}\} with AP={𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒}AP=\{\mathit{obstacle}\}, it intuitively means that there is an obstacle in the state xx. In this paper, it is assumed that the labeling function LL is unknown due to the uncertainty of the environment, i.e., we do not have a complete knowledge about the properties of states in the environment. Thus, instead of LL, we make use of the belief or the posterior probability (given the past observations) as follows:

(2) (xap)[0,1]\displaystyle\mathcal{B}(x\models ap)\in[0,1]

for all xXx\in X and apAPap\in AP, where xapx\models ap denotes that apap is satisfied in xx (i.e., apL(x)ap\in L(x)). For example, (x𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒)=1\mathcal{B}(x\models\mathit{obstacle})=1 with AP={𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒}AP=\{\mathit{obstacle}\} intuitively means that, it is for sure that there exists an obstacle in xx. In addition, (x𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒)=0.5\mathcal{B}(x\models\mathit{obstacle})=0.5 intuitively means that, it is completely unknown whether there exists an obstacle in xx. The belief that apap is not satisfied in xx is denoted as (x¬ap)\mathcal{B}(x\models\neg ap), and, from (2), it is computed as (x¬ap)=1(xap)\mathcal{B}(x\models\neg ap)=1-\mathcal{B}(x\models ap). In what follows, the beliefs in (2) are called the environmental beliefs of the atomic propositions. As we will see later, the environmental beliefs of the atomic propositions are updated based on the observations provided by sensors equipped with the copter and the rover.

3.2. Rover and copter model

3.2.1. Motion model

The motion of the rover is modeled by an MDP r=(X,xr0,Ur,pr)\mathcal{M}_{r}=(X,x_{r_{0}},U_{r},p_{r}), where XX is the set of states (or the environment), xr0Xx_{r_{0}}\in X is the initial state of the rover, UrU_{r} is the finite set of inputs, and prp_{r} is the transition probability function. Similarly, the motion of the copter is modeled by an MDP c=(X,xc0,Uc,pc)\mathcal{M}_{c}=(X,x_{c_{0}},U_{c},p_{c}), where XX is the set of states, xc0Xx_{c_{0}}\in X is the initial state of the copter, UcU_{c} is the finite set of inputs, and pcp_{c} is the transition probability function.

3.2.2. Sensor model

The rover is equipped with sensors that can provide observations on several atomic propositions in APAP. Specifically, let APrAPAP_{r}\subseteq AP be a set of atomic propositions or the properties that can be observed by the rover’s sensors. For example, if APr={𝑡𝑎𝑟𝑔𝑒𝑡}AP_{r}=\{\mathit{target}\}, the rover is equipped with a sensor that can detect a target object. To describe the erroneous observations, we use a Bernoulli-type sensor model (see, e.g. (bertuccelli2005, ; wang2009, ; hussein2007, ; imai2013, )) as follows. Suppose that the rover’s position is xXx\in X, and we would like the rover to know whether an atomic proposition apAPrap\in AP_{r} is satisfied at the position xXx^{\prime}\in X. Due to the fact that the rover can provide sensor measurements only in a limited range, it is assumed that xxRapr\|x-x^{\prime}\|\leq R^{r}_{ap}, where Rapr>0R^{r}_{ap}\in\mathbb{R}_{>0} is a given sensor range for apap. The corresponding observation is described by the binary variable, which is denoted by Zxr(xap){0,1}Z^{r}_{x}(x^{\prime}\models ap)\in\{0,1\}. For example, if Zxr(x𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒)=1Z^{r}_{x}(x^{\prime}\models\mathit{obstacle})=1 with APr={𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒}AP_{r}=\{\mathit{obstacle}\}, the rover at the position xx detects an obstacle at xx^{\prime} by the corresponding sensor. The conditional probabilities that the sensor provides the correct or the false measurement are characterized as follows:

Pr[Zxr(xap)=1|xap]=β1,xr(x,ap),Pr[Zxr(xap)=0|xap]=1β1,xr(x,ap)\displaystyle{\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=1|x^{\prime}\models ap\right]=\beta^{r}_{1,x}(x^{\prime},ap),\ {\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=0|x^{\prime}\models ap\right]=1-\beta^{r}_{1,x}(x^{\prime},ap)
Pr[Zxr(xap)=0|x¬ap]=β2,xr(x,ap),Pr[Zxr(xap)=1|x¬ap]=1β2,xr(x,ap)\displaystyle{\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=0|x^{\prime}\models\neg ap\right]=\beta^{r}_{2,x}(x^{\prime},ap),\ {\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=1|x^{\prime}\models\neg ap\right]=1-\beta^{r}_{2,x}(x^{\prime},ap)

where β1,xr(x,ap),β2,xr(x,ap)[0,1]\beta^{r}_{1,x}(x^{\prime},ap),\beta^{r}_{2,x}(x^{\prime},ap)\in[0,1] are given parameters that characterize the precision of the sensor. For example, under the fact that apap is satisfied in xx^{\prime} (i.e., apL(x)ap\in L(x^{\prime})), the probability of making the correct measurement (i.e., Zxr(xap)=1Z^{r}_{x}(x^{\prime}\models ap)=1) is β1,xr(x,ap)\beta^{r}_{1,x}(x^{\prime},ap). On the other hand, the probability of making the false measurement (i.e., Zxr(xap)=0Z^{r}_{x}(x^{\prime}\models ap)=0) is 1β1,xr(x,ap)1-\beta^{r}_{1,x}(x^{\prime},ap). For simplicity, it is assumed that the probabilities of making the correct measurements are the same, i.e., β1,xr(x,ap)=β2,xr(x,ap)=βxr(x,ap)\beta^{r}_{1,x}(x^{\prime},ap)=\beta^{r}_{2,x}(x^{\prime},ap)=\beta^{r}_{x}(x^{\prime},ap), x,xX\forall x,x^{\prime}\in X and apAPrap\in AP_{r}. Regarding βxr\beta^{r}_{x}, we assume that it is characterized by the fourth-order polynomial function of xx\|x-x^{\prime}\| as follows (wang2009, ; hussein2007, ):

(3) βxr(x,ap)=\displaystyle\beta^{r}_{x}(x^{\prime},ap)= Mapr(Rapr)4(xx2(Rapr)2)2+0.5,ifxxRapr,\displaystyle\frac{M^{r}_{ap}}{(R^{r}_{ap})^{4}}\left(\|x-x^{\prime}\|^{2}-(R^{r}_{ap})^{2}\right)^{2}+0.5,\ \ {\rm if}\ \|x-x^{\prime}\|\leq R^{r}_{ap},
(4) βxr(x,ap)=\displaystyle\beta^{r}_{x}(x^{\prime},ap)= 0.5ifxx>Rapr,\displaystyle 0.5\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\rm if}\ \|x-x^{\prime}\|>R^{r}_{ap},

for given Mapr(0,0.5]M^{r}_{ap}\in(0,0.5]. (3) and (4) imply that the reliability of the sensor decreases (and eventually becomes 0.50.5) as the distance between xx and xx^{\prime} becomes larger.

Similarly, let APcAPAP_{c}\subseteq AP be a set of atomic propositions or the properties that can be observed by the copter’s sensors. For simplicity, it is assumed that APrAPc=APAP_{r}\cup AP_{c}=AP. Moreover, we denote by Rapc>0R^{c}_{ap}\in\mathbb{R}_{>0} for each apAPcap\in AP_{c} a given sensor range for apap. Suppose that the copter’s position is xXx\in X, and we would like the copter to know whether an atomic proposition apAPcap\in AP_{c} is satisfied at the position xXx^{\prime}\in X with xxRapc\|x^{\prime}-x\|\leq R^{c}_{ap}. The corresponding observation is denoted by Zxc(xap){0,1}Z^{c}_{x}(x^{\prime}\models ap)\in\{0,1\}. Moreover, denote by βxc(x,ap)[0,1]\beta^{c}_{x}(x^{\prime},ap)\in[0,1] a given parameter that characterizes the precision of the sensor for given Mapr(0,0.5]M^{r}_{ap}\in(0,0.5] as with (3) and (4).

Despite simplicity, the Bernoulli sensor model is commonly used in the UAV community, due to the following reasons; (i) it is able to capture the erroneous observations; (ii) it is able to capture the limited sensor range; (iii) in contrast to the other sophisticated sensor models such as those that involve the probability density functions, the Bayesian update can be simply computed without any integrals or approximations. In particular, the third feature is well-suited for the copter’s exploration, as the computational power of the CPU and the battery capacity are often limited, and it is desirable to make the belief update as computationally light as possible.

3.3. Mission specification and problem formulation

The mission specification that the rover should satisfy is expressed by an scLTL, denoted by ϕ\phi, over the set of atomic propositions APAP. The satisfaction relation of the scLTL formula ϕ\phi is given over the word generated by the trajectory of the rover. That is, given a policy sequence μr,seq=μr,0μr,1μr,2\mu_{r,seq}=\mu_{r,0}\mu_{r,1}\mu_{r,2}\ldots, we say that the trajectory 𝐱μr,seq=x(0)x(1)Xω{\bf x}_{\mu_{r,seq}}=x(0)x(1)\ldots\in X^{\omega} satisfies ϕ\phi, which we denote by 𝐱μr,seqϕ{\bf x}_{\mu_{r,seq}}\models\phi, iff the corresponding word satisfies ϕ\phi, i.e., 𝐰=L(x(0))L(x(1))ϕ{\bf w}=L(x(0))L(x(1))\ldots\models\phi. Since the rover aims at achieving the satisfaction of ϕ\phi, we would like to derive an optimal policy, such that the probability of satisfying ϕ\phi, i.e., Pr[𝐱μr,seqϕ]{\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi], is maximized. However, since the labeling function LL is unknown, the values of Pr[𝐱μr,seqϕ]{\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi] are also unknown (i.e., we do not have direct access to this probability value). Hence, we will instead compute and maximize the belief that the trajectory of the rover satisfies ϕ\phi, which we denote by

(5) (𝐱μr,seqϕ)[0,1].\displaystyle\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi)\in[0,1].

That is, (𝐱μr,seqϕ)\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi) indicates the posterior that the trajectory of the rover satisfies ϕ\phi given the past observations (sensor measurements) provided by the rover and the copter. As we will see later, (5) is computed and maximized based on the environmental beliefs of the atomic propositions in (2). Since the environmental beliefs of the atomic propositions will be updated based on the sensor measurements, the optimal policy that maximizes (5) will be also updated accordingly. Moreover, since we would like to reduce the environmental uncertainties as much as possible (i.e., we would like to make (xap)\mathcal{B}(x\models ap) converge to 11 or 0 for all xX,apAPx\in X,ap\in AP), it is also necessary to explore the state space XX so as to collect the sensor measurements and update the environmental beliefs of the atomic propositions. In this paper, the copter has the main role to explore the uncertain environment, since, as previously mentioned in Section 1, it is able to move more quickly and freely than the rover. Therefore, we need to synthesize not only an optimal policy for the rover such that (5) is maximized so as to increase the possibility to satisfy ϕ\phi, but also an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions and effectively reduce the environmental uncertainties.

Problem 1.

Consider the MDP motion models of the rover r\mathcal{M}_{r} and of the copter c\mathcal{M}_{c}, the Bernoulli sensor models as described in Section 3.2.2, and mission specification expressed by the scLTL formula ϕ\phi. Then, synthesize for the copter-rover team a policy to increase the possibility to achieve the satisfaction of ϕ\phi. Specifically, synthesize an optimal policy for the rover such that (5) is maximized, and an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions in (2). \Box

4. Approach

In this section we provide a solution approach to Problem 1. In Section 4.1, we provide the overview of the approach. Then, we provide the algorithms of the exploration phase and the mission execution phase in Section 4.2 and Section 4.3, respectively.

4.1. Overview of the approach

Following (nilsson2018, ), we consider a sequential approach to solve Problem 1. The overview of the approach is shown in Algorithm 1.

1 k0k\leftarrow 0 (initialize the time); xcxc0x_{c}\leftarrow x_{c_{0}} (initialize the position of the copter); xrxr0x_{r}\leftarrow x_{r_{0}} (initialize the position of the rover);
2 Using prior knowledge, initialize (xap)(0,1)\mathcal{B}(x\models ap)\in(0,1) for all xXx\in X and apAPap\in AP;
3 The rover computes the optimal policy such that (5) is maximized, and compute the mapping bmax:X[0,1]b_{\max}:X\rightarrow[0,1] (for details, see Section 4.3);
4
5Repeat the following two phases:
  1. (1)

    Exploration phase (see Section 4.2): The copter explores the state space for a given time period TcT_{c} and update the environmental beliefs of the atomic propositions \mathcal{B}:

    (6) {xc,}𝐸𝑥𝑝𝑙𝑜𝑟𝑎𝑡𝑖𝑜𝑛(xc,,bmax,Tc).\displaystyle\{x_{c},\mathcal{B}\}\leftarrow\mathit{Exploration}(x_{c},\mathcal{B},b_{\max},T_{c}).

    Set kk+Tck\leftarrow k+T_{c} and the copter transmits \mathcal{B} to the rover;

  2. (2)

    Mission execution phase (see Section 4.3): The rover computes the optimal policy such that (5) is maximized and executes it for a given time period TrT_{r}. Moreover, update the environmental beliefs of the atomic propositions \mathcal{B} as well as the mapping bmaxb_{\max}, i.e.,

    (7) {xr,bmax,}𝑀𝑖𝑠𝑠𝑖𝑜𝑛𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛(xr,,Tr).\displaystyle\{x_{r},b_{\max},\mathcal{B}\}\leftarrow\mathit{MissionExecution}(x_{r},\mathcal{B},T_{r}).

    Set kk+Trk\leftarrow k+T_{r} and the rover transmits \mathcal{B} and bmaxb_{\max} to the copter;

Algorithm 1 Overview of the main algorithm.

With a slight abuse of notation, we denote by \mathcal{B} in the algorithm the set of all environmental beliefs of the atomic propositions, i.e., (xap),xX,apAP\mathcal{B}(x\models ap),x\in X,ap\in AP. Specifically, the approach mainly consists of the two phases: the exploration phase and the mission execution phase. During the exploration phase, the copter explores the state-space XX so as to update \mathcal{B} for a given time period Tc>0T_{c}\in\mathbb{N}_{>0}. In (6) (as well as (7)), bmax:X[0,1]b_{\max}:X\rightarrow[0,1] will denote a mapping from each state onto the corresponding maximum belief that the rover will reach within the time period TrT_{r} according to the current optimal policy; for the detailed definition, see Section 4.3.3. As we will see later, the exploration policy is given by making use of the mapping bmaxb_{\max} and the entropy that will be derived from the current environmental beliefs of the atomic propositions \mathcal{B}. Once the exploration is done, the copter transmits the updated environmental beliefs of the atomic propositions \mathcal{B} to the rover and moves on to the mission execution phase. During the mission execution phase, the rover computes the optimal policy such that (5) is maximized, and executes the policy for a given time period Tr>0T_{r}\in\mathbb{N}_{>0}. Moreover, during the execution, the rover provides sensor measurements to update \mathcal{B}. Once the execution is done, the rover transmits the updated \mathcal{B} and bmaxb_{\max} to the copter and moves back to the exploration phase. The sequential approach as above is motivated by the fact that, before allowing the rover to execute the optimal policy, we can let the copter in advance explore regions around the rover’s (future) path. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the future. Then, if the copter finds some obstacles in the path, the rover can re-design the path for avoiding the obstacles and try to find another way to complete the mission. Such scheme is somewhat too careful, but may be necessary to be done especially for safety critical systems, such as the exploration on Mars.

As detailed below, the algorithms for both the exploration and the mission execution phases are significantly different from (nilsson2018, ) in the following three aspects. First, the environmental beliefs of the atomic propositions are updated based on the Bayes rule using the past sensor measurements. This allows us to provide novel copter’s explorations, in which the copter actively explores the environment by evaluating both the level of uncertainty and the relevancy to the mission completion (see Section 4.2). Second, we propose a novel framework to synthesize the optimal policy for the rover. In particular, we solve a value iteration over a product MDP, whose state-space does not involve the set of the states for the environmental beliefs. This leads to the reduction of the time complexity of solving the value iteration algorithm (for details, see Section 4.3). Finally, we provide a theoretical, convergence analysis of the proposed algorithm, where it is shown that the environmental beliefs of the atomic propositions converge to the appropriate values (for details, see Section 5).

4.2. Exploration phase

In this subsection, we propose an algorithm of how the copter explores the environment so as to effectively update the environmental beliefs of the atomic propositions.

4.2.1. Bayesian belief update

Using the sensor model described in Section 3.2.2, the copter updates the environmental beliefs of the atomic propositions based on the Bayes filter (wang2009, ). Suppose that the copter is in the position xXx\in X and, for some xXx^{\prime}\in X and apAPrap\in AP_{r} with xxRapc\|x-x^{\prime}\|\leq R^{c}_{ap}, it gives the corresponding observation as Zxc(xap)=z{0,1}Z^{c}_{x}(x^{\prime}\models ap)=z\in\{0,1\}. Then, using this observation, the belief that apap is satisfied in xx^{\prime}, i.e., (xap)\mathcal{B}(x^{\prime}\models ap) is updated by applying the Bayes rule as follows:

(8) (xap)Pr[Zxc(xap)=z|xap](xap)Pr[Zxc(xap)=z]\displaystyle\mathcal{B}(x^{\prime}\models ap)\longleftarrow\cfrac{{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)}{{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z]}

where Pr[Zxc(xap)=z|xap]=βxc(x,ap)z(1βxc(x,ap))1z{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models ap]={\beta^{c}_{x}(x^{\prime},ap)}^{z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{1-z}, and Pr[Zxc(xap)=z]{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z] is computed as

Pr[Zxc(xap)=z]\displaystyle{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z] =Pr[Zxc(xap)=z|xap](xap)\displaystyle={\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)
+Pr[Zxc(xap)=z|x¬ap](1(xap))\displaystyle\ \ \ \ +{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models\neg ap](1-\mathcal{B}(x^{\prime}\models ap))
=βxc(x,ap)z(1βxc(x,ap))1z(xap)\displaystyle={\beta^{c}_{x}(x^{\prime},ap)}^{z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{1-z}\mathcal{B}(x^{\prime}\models ap)
+βxc(x,ap)1z(1βxc(x,ap))z(1(xap)).\displaystyle\ \ \ \ +{\beta^{c}_{x}(x^{\prime},ap)}^{1-z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{z}(1-\mathcal{B}(x^{\prime}\models ap)).

As will be clearer in the overall exploration algorithm given below, if the copter is placed at xXx\in X, it obtains the sensor measurements for all its neighbors, i.e., xXx^{\prime}\in X with xxRc\|x-x^{\prime}\|\leq R_{c} and for all atomic propositions apAPcap\in AP_{c}, and update the corresponding environmental beliefs of the atomic propositions according to (8).

4.2.2. Acquisition function for exploration

Let us now define an acquisition function to be evaluated for synthesizing the exploration strategy for the copter. First, we define the notion of an entropy H:[0,1][0,1]H:[0,1]\rightarrow[0,1] as follows (cover2006, ):

(9) H(b)=blogb(1b)log(1b)\displaystyle H(b)=-b\log b-(1-b)\log(1-b)

for b[0,1]b\in[0,1], where log\log is to the base 22. In essence, H((xap))H(\mathcal{B}(x\models ap)) for xX,apAPx\in X,ap\in AP represents the level of uncertainty about whether apap is satisfied in xx, and takes the largest value if (xap)=0.5\mathcal{B}(x\models ap)=0.5 and the lowest value if (xap)=0\mathcal{B}(x\models ap)=0 or 11. Hence, by actively exploring the state space where the entropy is large and updating the corresponding beliefs according to (8), it is expected that the environmental uncertainties can be effectively reduced.

However, if the copter would explore the environment only by evaluating the above entropy, it might happen to explore the states that are completely irrelevant to the rover’s mission completion. In other words, since the copter knows the path that the rover intends to follow according to the current optimal policy, it is preferable that the copter should investigate areas around such path (before the rover executes it) so as to update the environmental beliefs of the atomic propositions. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the environment, so that the rover will be able to complete the mission while avoiding any obstacles. In order to incorporate the rover’s path for exploration, recall that the current position of the rover is xrx_{r} and we have the mapping bmax:X[0,1]b_{\max}:X\rightarrow[0,1] (see Algorithm 1). As previously described in Section 4.1, bmax(x)b_{\max}(x) for each xXx\in X indicates the belief that the rover will reach xXx\in X from xrx_{r} within the time period TrT_{r} according to the rover’s current optimal policy (for the detailed definition and the calculation on bmaxb_{\max}, see the mission execution phase in Section 4.3.3). Hence, for a large value of bmax(x)b_{\max}(x) (bmax(x)1b_{\max}(x)\approx 1), we have a high belief that the rover will reach xx within the time period TrT_{r}. Combining the entropy (9) and bmaxb_{\max}, let us define the acquisition function W:X>0W:X\rightarrow\mathbb{R}_{>0} as follows:

(10) W(x)=apAPcH((xap))+αbmax(x),\displaystyle W(x)=\sum_{ap\in AP_{c}}H(\mathcal{B}(x\models ap))+\alpha b_{\max}(x),

for all xXx\in X, where α>0\alpha\in\mathbb{R}_{>0} is the weight associated to bmax(x)b_{\max}(x).

4.2.3. Exploration algorithm

We now propose an exploration algorithm. In the following, we provide two different exploration strategies so as to take both the efficiency of computation and the coverage of exploration into account.

Input : xx (current copter’s position); \mathcal{B} (current environmental beliefs of the atomic propositions); bmaxb_{\max} (mapping to indicate reachability probability of the rover); TcT_{c} (time period for exploration);
Output : xx (updated current position); \mathcal{B} (updated environmental beliefs of the atomic propositions);
1 for =0:Tc1\ell=0:T_{c}-1 do
2       Compute HH and WW according to (9) and (10), respectively;
3       From (11), compute the control input uc=μc1(x)u^{*}_{c}=\mu^{*}_{c_{1}}(x);
4      Apply ucu^{*}_{c} and sample the next state xnextpc(|x,uc)x_{next}\sim p_{c}(\cdot|x,u^{*}_{c});
5       xxnextx\leftarrow x_{next};
6       for each (x,ap)X×APc(x^{\prime},ap)\in X\times AP_{c} with xxRapc\|x-x^{\prime}\|\leq R^{c}_{ap} do
7             Provide the corresponding observation: Zxc(xap)=zZ^{c}_{x}(x^{\prime}\models ap)=z;
8             Update (xap)\mathcal{B}(x^{\prime}\models ap) according to (8).
9       end for
10      
11 end for
return xx, \mathcal{B};
Algorithm 2 𝐸𝑥𝑝𝑙𝑜𝑟𝑎𝑡𝑖𝑜𝑛(x,,bmax,Tc)\mathit{Exploration}(x,\mathcal{B},b_{\max},T_{c}) (local selection-based exploration)

(Local selection-based policy): The first exploration strategy is the local selection-based policy, in which the copter executes a one step greedy exploration:

(11) μc1(x)argmaxuUc𝔼xpc(|x,u)[W(x)|x,u]\displaystyle\mu^{*}_{c_{1}}(x)\in\arg\max_{u\in U_{c}}\ \mathbb{E}_{x^{\prime}\sim p_{c}(\cdot|x,u)}[W(x^{\prime})|x,u]

for all xXx\in X. (11) implies that the copter greedy selects a control input such that the corresponding next state provides the highest acquisition. The overall exploration algorithm based on the local selection-based policy is summarized in Algorithm 2. As shown in the algorithm, for each step \ell, the copter computes the acquisition function and a control input ucUcu^{*}_{c}\in U_{c} according to Section 4.2.2 (line 6–line 2). Once ucu^{*}_{c} is obtained, the copter applies it and samples the next state xnextx_{next}. Given the new current position, the copter makes the new sensor measurements for all its neighbors and the atomic propositions APcAP_{c}, and update the corresponding environmental beliefs of the atomic propositions (line 3–line 2). The above procedure is iterated for the copter’s time period TcT_{c}. Note that, in order to enhance the exploration, the acquisition function as well as the policy computed in (11) are updated for each time when the copter obtains new observations.

The local selection-based policy is computationally efficient, since the optimal control input (line 2 in Algorithm 2) can be obtained by evaluating the acquisitions only for the next states. A disadvantage of this approach, however, is that it might not guarantee an effective exploration for the whole state space XX, since the copter evaluates the acquisitions only locally. Another exploration strategy would be therefore to select the optimal state to be visited by evaluating the acquisitions for all states in XX (instead of only locally for the next states), and then collect the sensor measurements to update the corresponding environmental beliefs. This leads us to a global selection-based approach, and the details are given below.

(Global selection-based policy): In the global selection-based approach, the copter first selects the optimal state that provides the highest acquisition in the whole state-space XX, i.e.,

(12) x=argmaxxXW(x).\displaystyle x^{*}=\arg\max_{x^{\prime}\in X}\ W(x^{\prime}).

Then, the copter computes the optimal policy μc2:XU\mu^{*}_{c_{2}}:X\rightarrow U such that the probability of reaching xx^{*} is maximized, i.e.,

(13) μc2argmaxμcPr[𝐱c,μcx],\displaystyle\mu^{*}_{c_{2}}\in\arg\max_{\mu_{c}}\ {\rm Pr}[{\bf x}_{c,\mu_{c}}\models\Diamond x^{*}],

where 𝐱c,μcXω{\bf x}_{c,\mu_{c}}\in X^{\omega} denotes the state trajectory of the copter by applying the policy μc\mu_{c}, and x\Diamond x^{*} indicates the property that the state trajectory reaches xx^{*} in finite time (which corresponds to the ”eventually” operator), i.e., 𝐱c,μc=xc,μc0xc,μc1xc,μc2x{\bf x}_{c,\mu_{c}}={x}^{0}_{c,\mu_{c}}{x}^{1}_{c,\mu_{c}}{x}^{2}_{c,\mu_{c}}\cdots\models\Diamond x^{*} iff there exists kk\in\mathbb{N} such that xc,μck=x{x}^{k}_{c,\mu_{c}}=x^{*}. (13) can be indeed solve via value iteration algorithm; for details, see Section 4.3.2. Then, the copter moves to xx^{*} from the current state by applying μc2\mu^{*}_{c_{2}} so as to collect the corresponding sensor measurements and update the environmental beliefs of the atomic propositions. Once xx^{*} is reached, the copter re-computes the new xx^{*} based on the updated environmental beliefs, and iterate the same procedure as above for the time period TcT_{c}. The overall exploration algorithm based on the global selection-based policy is summarized in Algorithm 3. As shown in the algorithm, the copter first finds the optimal state xx^{*} that provides the highest acquisition among the whole state space XX, and compute the optimal policy to reach xx^{*} (line 3–line 3). The copter applies the optimal policy until it reaches xx^{*} and collects the corresponding sensor measurements (line 3–line 3). Note that the copter collects not only the sensor measurements for xx^{*}, but also the ones for the states that are traversed while reaching to xx^{*}, aiming to enhance the efficiency of exploration. The above procedure is iterated for the time period TcT_{c}. The variable nsuccn_{succ} in Algorithm 2 counts the number of times when the copter successfully reaches xx^{*}.

The global selection-based approach should require a heavier computation than the local selection-based approach, since it needs to find the optimal state by evaluating the acquisitions for the whole state-space XX, as well as to compute the optimal policy to reach xx^{*} via a value iteration. However, the advantage of employing this approach is that we can guarantee the coverage of exploration, i.e., by repeating Algorithm 2, the environmental belief of the atomic propositions converge to the appropriate values; for details, see Section 5.

1 Input and Output are the same as Algorithm 2;
2 0\ell\leftarrow 0, nsucc0n_{succ}\leftarrow 0;
3 while <Tc1\ell<T_{c}-1 do
4       Compute HH and WW according to (9) and (10), respectively;
5       Compute xXx^{*}\in X and μc2:XUc\mu^{*}_{c2}:X\rightarrow U_{c} according to (12) and (13), respectively;
6       while xxx\neq x^{*} and <Tc1\ell<T_{c}-1  do
7             Apply uc=μc2(x)u^{*}_{c}=\mu^{*}_{c2}(x) and sample the next state xnextpc(|x,uc)x_{next}\sim p_{c}(\cdot|x,u^{*}_{c});
8             xxnextx\leftarrow x_{next},  +1\ell\leftarrow\ell+1;
9             for each (x,ap)X×APc(x^{\prime},ap)\in X\times AP_{c} with xxRapc\|x-x^{\prime}\|\leq R^{c}_{ap}  do
10                   Provide the corresponding observation: Zxc(xap)=zZ^{c}_{x}(x^{\prime}\models ap)=z;
11                   Update (xap)\mathcal{B}(x^{\prime}\models ap) according to (8).
12             end for
13            
14       end while
15       if x=xx=x^{*} then
16             nsuccnsucc+1n_{succ}\leftarrow n_{succ}+1;
17       end if
18      
19 end while
return xx, \mathcal{B};
Algorithm 3 𝐸𝑥𝑝𝑙𝑜𝑟𝑎𝑡𝑖𝑜𝑛(x,,bmax,Tc)\mathit{Exploration}(x,\mathcal{B},b_{\max},T_{c}) (global selection-based exploration)

4.3. Mission execution phase

In this subsection, we propose a detailed algorithm of the rover’s mission execution phase.

4.3.1. Product belief MDP

Given the environmental beliefs of the atomic propositions in (2), let (xσ)\mathcal{B}(x\models\sigma) for xXx\in X, σ2AP\sigma\in 2^{AP} be the joint belief that all atomic propositions in σ\sigma are satisfied in xx, i.e., (xσ)=apσ(xap)\mathcal{B}(x\models\sigma)=\prod_{ap\in\sigma}\mathcal{B}(x\models ap). Moreover, let (xσ¬(AP\σ))\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma)) be the joint belief that all atomic propositions in σ\sigma are satisfied in xx and all atomic propositions in APAP other than σ\sigma (i.e., AP\σAP\backslash\sigma) are not satisfied in xx, i.e.,

(14) (xσ¬(AP\σ))=apσ(xap)apAP\σ(1(xap)).\displaystyle\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma))=\prod_{ap\in\sigma}\mathcal{B}(x\models ap)\prod_{ap\in AP\backslash\sigma}(1-\mathcal{B}(x\models ap)).

For simplicity of presentation, let alph(xσ)=(xσ¬(AP\σ))\mathcal{B}_{alph}(x\models\sigma)=\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma)). Moreover, let 𝒜ϕ=(Q,2AP,δ,q0,Qf)\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f}) be an FSA corresponding to ϕ\phi, and, for each (q,q)Q×Q(q,q^{\prime})\in Q\times Q, denote by en(q,q)2APen(q,q^{\prime})\subseteq 2^{AP} a subset of input alphabets, for which the transition from qq to qq^{\prime} is allowed: en(q,q)={σ2AP|qδ(q,σ)}en(q,q^{\prime})=\{\sigma\in 2^{AP}\ |\ q^{\prime}\in\delta(q,\sigma)\}. In addition, given xXx\in X and q,qQq,q^{\prime}\in Q, we let

(15) en(xen(q,q))=σen(q,q)alph(xσ).\displaystyle\mathcal{B}_{en}(x\models en(q,q^{\prime}))=\sum_{\sigma\in en(q,q^{\prime})}\mathcal{B}_{alph}(x\models\sigma).

That is, en(xen(q,q))\mathcal{B}_{en}(x\models en(q,q^{\prime})) represents the belief that qq makes the transition to qq^{\prime} from the atomic propositions that are satisfied in xx. Note that we have qPost(q)en(xen(q,q))=1\sum_{q^{\prime}\in Post(q)}\mathcal{B}_{en}(x\models en(q,q^{\prime}))=1, since the collection of all events (the set of atomic propositions) corresponding to all outgoing transitions from qq are all possible events that can occur, i.e., 2AP2^{AP}. For this clarification, see an example below.

Refer to caption
Figure 1. FSA that accepts all good prefix for the scLTL formula ϕ=a\phi=\Diamond a. The marked node represents the accepting state.

(Example): Consider the environment with two states X={x1,x2}X=\{x_{1},x_{2}\} and let AP={a,b}AP=\{a,b\} and assume that the environmental beliefs of the atomic propositions are given by

(16) (x1a)=0.1,(x1b)=0.1,(x2a)=0.9,(x2b)=0.2.\displaystyle\mathcal{B}(x_{1}\models a)=0.1,\ \mathcal{B}(x_{1}\models b)=0.1,\ \mathcal{B}(x_{2}\models a)=0.9,\ \mathcal{B}(x_{2}\models b)=0.2.

Moreover, the scLTL formula is assumed to be given by ϕ=a\phi=\Diamond a. The corresponding FSA 𝒜ϕ\mathcal{A}_{\phi} that accepts all good prefix for ϕ\phi is shown in Fig. 1. For example, since q0=δ(q0,)q_{0}=\delta(q_{0},\varnothing) and q0=δ(q0,{b})q_{0}=\delta(q_{0},\{b\}), we have en(q0,q0)={,{b}}en(q_{0},q_{0})=\{\varnothing,\{b\}\}. Moreover, we have

(17) en(x1en(q0,q0))\displaystyle\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{0})) =alph(x1)+alph(x1{b})=0.9×0.9+0.9×0.1=0.9,\displaystyle=\mathcal{B}_{alph}(x_{1}\models\varnothing)+\mathcal{B}_{alph}(x_{1}\models\{b\})=0.9\times 0.9+0.9\times 0.1=0.9,

which implies that, if the position of the rover is x1x_{1}, we have a high belief that q0q_{0} provides the self-loop, i.e., the belief of reaching the accepting state q1q_{1} is low. This is due to the fact that we have a low belief that aa is satisfied in x1x_{1}. Moreover, we have en(x1,en(q0,q1))=alph(x1,{a})+alph(x1,{a,b})=0.09+0.01=0.1\mathcal{B}_{en}(x_{1},en(q_{0},q_{1}))=\mathcal{B}_{alph}(x_{1},\{a\})+\mathcal{B}_{alph}(x_{1},\{a,b\})=0.09+0.01=0.1. Note that we have en(x1en(q0,q1))+en(x1en(q0,q0))=1\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{1}))+\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{0}))=1, satisfying the probabilistic nature. This is due to the fact that all outgoing transitions from q0q_{0} are only q0q_{0} and q1q_{1}, and the collection of all the corresponding events (set of atomic propositions) is {,{a},{b},{a,b}}\{\varnothing,\{a\},\{b\},\{a,b\}\} (=2AP=2^{AP}), which is indeed all events that can occur. We also have en(x2en(q0,q1))=alph(x2{a})+alph(x2{a,b})=0.72+0.18=0.9\mathcal{B}_{en}(x_{2}\models en(q_{0},q_{1}))=\mathcal{B}_{alph}(x_{2}\models\{a\})+\mathcal{B}_{alph}(x_{2}\models\{a,b\})=0.72+0.18=0.9, implying that we have a high belief of reaching the accepting state, which is due to the fact that we have the high belief that aa is satisfied in x2x_{2}. \Box

Based on the above, we define the product belief MDP as a composition of the rover’s motion model r\mathcal{M}_{r} and the FSA 𝒜ϕ\mathcal{A}_{\phi} as follows:

Definition 1.

Let r=(X,xr,Ur,pr)\mathcal{M}_{r}=(X,x_{r},U_{r},p_{r}) and 𝒜ϕ=(Q,2AP,δ,q0,Qf)\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f}) be the MDP motion model of the rover and the FSA corresponding to ϕ\phi, respectively. Moreover, given the environmental beliefs of the atomic propositions (2), let en(xen(q,q))\mathcal{B}_{en}(x\models en(q,q^{\prime})) for xXx\in X, q,qQq,q^{\prime}\in Q be given by (15). Then, the product belief MDP between r\mathcal{M}_{r} and 𝒜ϕ\mathcal{A}_{\phi} is defined as a tuple S=(S,s0,US,pS,Sf)\mathcal{M}_{S}=(S,s_{0},U_{S},p_{S},S_{f}), where

  • S=X×QS=X\times Q is the set of states;

  • s0=(xr,q0)Ss_{0}=(x_{r},q_{0})\in S is the initial state;

  • US=UrU_{S}=U_{r} is the set of control inputs;

  • pS:S×US𝒟(S)p_{S}:S\times U_{S}\rightarrow\mathcal{D}(S) is the transition belief function, defined as

    (18) pS((x,q)|(x,q),u)=pr(x|x,u)en(xen(q,q));\displaystyle p_{S}((x^{\prime},q^{\prime})|(x,q),u)=p_{r}(x^{\prime}|x,u)\mathcal{B}_{en}(x\models en(q,q^{\prime}));
  • SfSS_{f}\subseteq S is the set of accepting states, where Sf=X×QfS_{f}=X\times Q_{f}. \Box

As shown in (18), the transition function is called a belief function instead of a probability function, which is due to that it is computed based on the environmental beliefs of the atomic propositions (i.e., the posterior given the past observations). As previously mentioned, en(xen(q,q))\mathcal{B}_{en}(x\models en(q,q^{\prime})) represents the belief that qq makes the transition to qq^{\prime} according to the atomic propositions that are satisfied in xx. Hence, pS((x,q)|(x,q),u)p_{S}((x^{\prime},q^{\prime})|(x,q),u) indicates the joint belief that the pair (x,q)(x,q) makes the transition to (x,q)(x^{\prime},q^{\prime}) by applying uu.

As shown in Definition 1, the product MDP has the set of states involving only XX and QQ. This leads to the reduction of the time complexity of synthesizing the optimal policy for the rover in contrast to the previous work; for details, see Section 4.3.2 (in particular Remark 1).

4.3.2. Value iteration

Given S\mathcal{M}_{S}, we denote by μS:SUS(=Ur)\mu_{S}:S\rightarrow U_{S}(=U_{r}) a policy for S\mathcal{M}_{S}, which associates a control input from USU_{S} for each state in SS. Then, let μS,seq=μSμSμS\mu_{S,seq}=\mu_{S}\mu_{S}\mu_{S}\ldots be the corresponding stationary policy sequence. We denote by 𝐬μS,seq=s(0)s(1)Sω{\bf s}_{\mu_{S,seq}}=s(0)s(1)\ldots\in S^{\omega} with s()=(x(),q())s(\ell)=(x(\ell),q(\ell)), 0\forall\ell\in\mathbb{N}_{\geq 0}, the trajectory of the product belief MDP S\mathcal{M}_{S}, such that s(0)=s0s(0)=s_{0} (i.e., x(0)=xrx(0)=x_{r}, q(0)=q0q(0)=q_{0}) and s(+1)pS(|s(),μS(s()))s(\ell+1)\sim p_{S}(\cdot|s(\ell),\mu_{S}(s(\ell))) for all 0\ell\in\mathbb{N}_{\geq 0}. Given 𝐬μS,seq=s(0)s(1){\bf s}_{\mu_{{S,seq}}}=s(0)s(1)\ldots with s()=(x(),q())s(\ell)=(x(\ell),q(\ell)), 0\forall\ell\in\mathbb{N}_{\geq 0}, we can induce the corresponding trajectory of 𝒜ϕ\mathcal{A}_{\phi} as q(0)q(1)Qωq(0)q(1)\ldots\in Q^{\omega}. If the trajectory of S\mathcal{M}_{S} reaches the accepting state in SfS_{f} in finite time, it means that the corresponding trajectory of 𝒜ϕ\mathcal{A}_{\phi} reaches the accepting state in QfQ_{f} in finite time (i.e., it satisfies ϕ\phi). Hence, the problem of maximizing the belief for the satisfaction of ϕ\phi defined by (5), can be reduced to the problem of maximizing the belief that the trajectory of S\mathcal{M}_{S} reaches SfS_{f} in finite time, i.e.,

(19) μS=argmaxμS(𝐬μS,seqSf).\displaystyle\mu^{*}_{S}=\arg\max_{\mu_{S}}\ \mathcal{B}\left({\bf s}_{\mu_{S,seq}}\models\Diamond S_{f}\right).

The problem (19) can be indeed solved via a value iteration as follows (see, e.g., (abate2008, ; nilsson2018, )). Let 𝟣Sf:S{0,1}\mathsf{1}_{S_{f}}:S\rightarrow\{0,1\} be given by 𝟣Sf(s)=1\mathsf{1}_{S_{f}}(s)=1, if sSfs\in S_{f} and 0 otherwise. Then, set V0(s)=𝟣Sf(s)V^{0}(s)=\mathsf{1}_{S_{f}}(s), sS\forall s\in S and for all sSs\in S, >0\ell\in\mathbb{N}_{>0},

(20) V+1(s)=maxuUSmax(𝟣Sf(s),𝔼spS(|s,u)[V(s)|s,u]),\displaystyle V^{\ell+1}(s)=\max_{u\in U_{S}}\max\left(\mathsf{1}_{S_{f}}(s),\mathbb{E}_{s^{\prime}\sim p_{S}(\cdot|s,u)}[V^{\ell}(s^{\prime})|s,u]\right),
(21) μS+1(s)=argmaxuUSmax(𝟣Sf(s),𝔼spS(|s,u)[V(s)|s,u]).\displaystyle\mu^{\ell+1}_{S}(s)=\arg\max_{u\in U_{S}}\max\left(\mathsf{1}_{S_{f}}(s),\mathbb{E}_{s^{\prime}\sim p_{S}(\cdot|s,u)}[V^{\ell}(s^{\prime})|s,u]\right).

The above computations are given until they reach some fixed point, i.e., μS=μS=μS+1\mu^{*}_{S}=\mu^{\ell^{\prime}}_{S}=\mu^{\ell^{\prime}+1}_{S} for some \ell^{\prime}. Alternatively, one may iterate the above only for a given finite time steps T¯>0\overline{T}\in\mathbb{N}_{>0} with T¯Tr\overline{T}\geq T_{r}, i.e., iterate (20) and (21) for all 0:T¯1\ell\in\mathbb{N}_{0:\overline{T}-1} and set μS=μST¯\mu^{*}_{S}=\mu^{\overline{T}}_{S}. This in turn implies to obtain the optimal policy that maximizes the belief that the trajectory of S\mathcal{M}_{S} reaches SfS_{f} within the time interval [0,T¯][0,\overline{T}]. Hence, it implies that we maximize the belief that the length of the good prefix of the word satisfying ϕ\phi is less than T¯\overline{T}. Given the optimal policy μS\mu^{*}_{S} computed as above, we can induce the policy sequence for the rover based on the trajectory of S\mathcal{M}_{S}, i.e.,

(22) μr,seq=μr,0μr,1μr,2,\displaystyle\mu^{*}_{r,seq}=\mu^{*}_{r,0}\mu^{*}_{r,1}\mu^{*}_{r,2}\ldots,

where μr,(x())=μS(s())\mu^{*}_{r,\ell}(x(\ell))=\mu^{*}_{S}(s(\ell)), 0\forall\ell\in\mathbb{N}_{\geq 0}.

Remark 1.

As shown in Definition 1, the product MDP involves only XX and QQ, and does not involve the states of the environmental beliefs as formulated in (nilsson2018, ). In particular, since the environmental beliefs of the atomic propositions are assigned for every state in XX in our problem setup, the time complexity of solving the value iteration algorithm is O(|(E|X|×X×Q)2×US|)O(|(E^{|X|}\times X\times Q)^{2}\times U_{S}|) in (nilsson2018, ) with EE being the set of states of the environment belief, while O(|(X×Q)2×US|)O(|(X\times Q)^{2}\times U_{S}|) in our approach. This implies that the time complexity of the value iteration algorithm in the previous work is exponential with respect to |X||X|, while it is polynomial in our approach. Therefore, our approach could alleviate the running time and the memory usage for synthesizing the optimal policies for the rover in contrast to the previous work. \Box

4.3.3. Computing the reachability belief and bmaxb_{\max}

Suppose that the current rover’s position xrx_{r} and the optimal policy μS\mu^{*}_{S} is computed as above. Then, given xXx\in X and 0:Tr1\ell\in\mathbb{N}_{0:T_{r}-1}, we can compute a belief that the rover will reach xx from xrx_{r} after \ell time steps according to the optimal policy μS\mu^{*}_{S}. To this end, we denote the collection of all states of S\mathcal{M}_{S} by S={s1,s2,s3,,sm}S=\{s_{1},s_{2},s_{3},\ldots,s_{m}\}, where mm is the number of the states of S\mathcal{M}_{S}. Given xXx\in X, we denote by 𝒥(x){1,2,,m}\mathcal{J}(x)\subseteq\{1,2,\ldots,m\} the set of indices, for which the corresponding states of S\mathcal{M}_{S} include xx, i.e., 𝒥(x)={i1:m|si=(x,q)forsomeqQ}\mathcal{J}(x)=\{i\in\mathbb{N}_{1:m}\ |\ s_{i}=(x,q)\ {\rm for\ some}\ q\in Q\}. If the policy μS\mu^{*}_{S} is employed, the belief MDP S\mathcal{M}_{S} can be viewed as a belief Markov chain induced by μS\mu^{*}_{S}, which is denoted by SμS=(S,s0,pSμS)\mathcal{M}^{\mu^{*}_{S}}_{S}=(S,s_{0},p^{\mu^{*}_{S}}_{S}), where S={s1,s2,s3,,sm}S=\{s_{1},s_{2},s_{3},\ldots,s_{m}\} is the set of states, s0=(xr,q0)s_{0}=(x_{r},q_{0}) is the initial state, and pSμSp^{\mu^{*}_{S}}_{S} is the transition belief function defined by pSμS(s|s)=pS(s|s,μS(s))p^{\mu^{*}_{S}}_{S}(s^{\prime}|s)=p_{S}(s^{\prime}|s,\mu^{*}_{S}(s)), s,sS\forall s,s^{\prime}\in S.

Input : xrx_{r} (current rover’s position); \mathcal{B} (current environmental beliefs of the atomic propositions); TrT_{r} (time period for the mission execution);
Output : xrx_{r} (updated current position); bmaxb_{\max} (the mapping that represents maximum probability); \mathcal{B} (updated environmental beliefs of the atomic propositions);
1 Solve the value iteration (20), (21) until reaching some fixed point (or iterate them for all 0:T¯1\ell\in\mathbb{N}_{0:\overline{T}-1} for a given T¯Tr\overline{T}\geq T_{r}) and obtain the optimal policy μS:SUS(=Ur)\mu^{*}_{S}:S\rightarrow U_{S}(=U_{r});
2 xxrx\leftarrow x_{r}qq0q\leftarrow q_{0};
3 for =0:Tr1\ell=0:T_{r}-1 do
4       urμS(x,q)u^{*}_{r}\leftarrow\mu^{*}_{S}(x,q) and sample (xnext,qnext)pS(|(x,q),ur)(x_{next},q_{next})\sim p_{S}(\cdot|(x,q),u^{*}_{r});
5       xxnextx\leftarrow x_{next},  qqnextq\leftarrow q_{next};
6       for each (x,ap)X×APr(x^{\prime},ap)\in X\times AP_{r} with xxRapr\|x-x^{\prime}\|\leq R^{r}_{ap} do
7             Provide the corresponding observation Zxr(x,ap)=zZ^{r}_{x}(x^{\prime},ap)=z;
8             Update the belief as: (xap)Pr[Zxr(x,ap)=z|xap](xap)Pr[Zxr(x,ap)=z]\mathcal{B}(x^{\prime}\models ap)\longleftarrow\cfrac{{\rm Pr}[Z^{r}_{x}(x^{\prime},ap)=z|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)}{{\rm Pr}[Z^{r}_{x}(x^{\prime},ap)=z]};
9       end for
10      
11 end for
12xrxx_{r}\leftarrow x and solve the value iteration (20), (21) to update the optimal policy μS\mu^{*}_{S};
13 Compute bmaxb_{\max} according to Section 4.3.3;
14
return xrx_{r}, bmaxb_{\max}, \mathcal{B};
Algorithm 4 𝑀𝑖𝑠𝑠𝑖𝑜𝑛𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛(xr,,Tr)\mathit{MissionExecution}(x_{r},\mathcal{B},T_{r}) (main algorithm for mission execution)

Now, let b[0,1]mb_{\ell}\in[0,1]^{m} for all 0:L1\ell\in\mathbb{N}_{0:L-1} be recursively given by b+1=Abb_{\ell+1}=Ab_{\ell}, where A[0,1]m×mA\in[0,1]^{m\times m} is the transition matrix for the belief Markov chain SμS\mathcal{M}^{\mu^{*}_{S}}_{S}, and b0(i)=1b^{(i)}_{0}=1 if sis_{i} is the initial state (i.e., si=s0=(xr,q0)s_{i}=s_{0}=(x_{r},q_{0})) and b0(i)=0b^{(i)}_{0}=0 if sis_{i} is not the initial state. That is, b(i)b^{(i)}_{\ell} represents a belief that the state sis_{i} is reached after \ell time steps from the initial state s0=(xr,q0)s_{0}=(x_{r},q_{0}) according to the optimal policy μS\mu^{*}_{S}. Based on the above, for each xXx\in X, we can compute a belief that xx is reached after \ell time steps, denoted by b(x)b_{\ell}(x), as b(x)=i𝒥(x)b(i)b_{\ell}(x)=\sum_{i\in\mathcal{J}(x)}b^{(i)}_{\ell}. Then, let bmax(x)b_{\max}(x) be given by bmax(x)=max0:Tr1b(x)b_{\max}(x)=\max_{\ell\in\mathbb{N}_{0:T_{r}-1}}\ b_{\ell}(x), i.e., bmax(x)b_{\max}(x) indicates the maximum belief that the rover will reach xx (starting from xrx_{r}) within the time steps TrT_{r}. That is, for a large value of bmax(x)b_{\max}(x) (bmax(x)1b_{\max}(x)\approx 1), we have a high belief that the rover will reach xx at some point in the time interval [0,Tr][0,T_{r}].

As previously described in Section 4.2.2, the mapping bmaxb_{\max} is utilized for the copter’s exploration, so as to effectively search cells that are relevant to the mission execution.

4.3.4. Overall mission execution algorithm

We now summarize the main algorithm of the mission execution phase in Algorithm 4. As shown in the algorithm, the rover computes the optimal policy by solving the value iteration and apply it for the time period TrT_{r}. Moreover, while applying this policy, it takes the sensor measurements and updates the environmental beliefs of the atomic propositions (2) (line 4–line 4). Afterwards, the rover computes the mapping bmaxb_{\max} according to the procedure described in Section 4.3.3. Finally, the algorithm returns the current rover’s position xrx_{r}, the mapping bmaxb_{\max} and the updated belief for the atomic propositions in (2).

5. Convergence analysis

In this section, we analyze convergence property of the proposed algorithm presented in the previous section. In particular, we show that, by executing Algorithm 1 with the exploration phase given by the global selection-based approach (Algorithm 3), the environmental beliefs of the atomic propositions converge to the appropriate values, i.e., for all xXx\in X and apAPap\in AP, (xap)1\mathcal{B}(x\models ap)\rightarrow 1 if apL(x)ap\in L(x), and (xap)0\mathcal{B}(x\models ap)\rightarrow 0 if apL(x)ap\notin L(x). Suppose that Algorithm 1 is implemented with the exploration phase given by Algorithm 3. To simplify the analysis, we make the following assumptions:

Assumption 1.

For every execution of Algorithm 3, it follows that nsucc1n_{succ}\geq 1. \Box

Assumption 2.

For the sensor model of the copter and the rover, we assume that:

  1. (i)

    APc=APr=APAP_{c}=AP_{r}=AP.

  2. (ii)

    Rapc=Rapr=0R^{c}_{ap}=R^{r}_{ap}=0 for all apAPap\in AP.

  3. (iii)

    βxc(x,ap)=βxc(x,ap)=βxr(x,ap)=βxr(x,ap)\beta^{c}_{x}(x,ap)=\beta^{c}_{x}(x,ap^{\prime})=\beta^{r}_{x}(x,ap)=\beta^{r}_{x}(x,ap^{\prime}) for all {ap,ap}AP×AP\{ap,ap^{\prime}\}\in AP\times AP. \Box

Assumption 1 excludes the case where Algorithm 3 is terminated without reaching any selected states xx^{*} to be explored. Moreover, the first assumption in Assumption 2 means that both the copter and the rover are equipped with all sensors for APAP. The second assumption in Assumption 2 means from (3) and (4) that the copter and the rover can only take the sensor measurements only on their current states. The third assumption in Assumption 2 means that the precision of the sensor is the same for both the rover and the copter.

For simplicity, we let β=βxc(x,ap)=βxc(x,ap)=βxr(x,ap)=βxr(x,ap)\beta=\beta^{c}_{x}(x,ap)=\beta^{c}_{x}(x,ap^{\prime})=\beta^{r}_{x}(x,ap)=\beta^{r}_{x}(x,ap^{\prime}) for all {ap,ap}AP×AP\{ap,ap^{\prime}\}\in AP\times AP. Note that β>0.5\beta>0.5 (for this clarification, see (3) and (4)). In addition, let Nx>0N_{x}\in\mathbb{N}_{>0} denote the total number of times the copter/rover visits the state xx and takes the corresponding sensor measurements for each apAPap\in AP, and let mNxapNxm^{ap}_{N_{x}}\leq N_{x} (xXx\in X, apAPap\in AP) denote the number of times the corresponding sensor measurements for apap are 11. In other words, NxmNxapN_{x}-m^{ap}_{N_{x}} represents the number of times the corresponding observations for apap are 0. Finally, we make the following assumption:

Assumption 3.

There exist ε>0\varepsilon>0 with 2β2ε1>02\beta-2\varepsilon-1>0 and N¯>0\bar{N}\in\mathbb{N}_{>0}, such that for all xXx\in X, apAPap\in AP and NxN¯N_{x}\geq\bar{N}, we have βεmNxap/Nxβ+ε\beta-\varepsilon\leq{m^{ap}_{N_{x}}}/{N_{x}}\leq\beta+\varepsilon if apL(x)ap\in L(x) and βε(NxmNxap)/Nxβ+ε\beta-\varepsilon\leq({N_{x}-m^{ap}_{N_{x}}})/{N_{x}}\leq\beta+\varepsilon if apL(x)ap\notin L(x). \Box

Assumption 3 implies that the ratio between the number of sensor measurements and the number of making the correct sensor measurements (i.e., mNxap/Nx{m^{ap}_{N_{x}}}/{N_{x}} if apAPap\in AP and (NxmNxap)/Nx({N_{x}-m^{ap}_{N_{x}}})/{N_{x}} if apAPap\notin AP) is ε\varepsilon-close to β\beta if the number of the visits (the sensor measurements) at xx is sufficiently large. The following theorem shows that, by executing Algorithm 1 with the exploration phase given by Algorithm 3, all the environmental beliefs of the atomic propositions converge to the appropriate values.

Theorem 1.

Let Assumption 13 hold. Let α=0\alpha=0 in (10) and suppose that Algorithm 1 is executed with the exploration phase given by Algorithm 3. Then, for all xXx\in X and apAPap\in AP, it follows that

(23) (xap)\displaystyle\mathcal{B}(x\models ap)\rightarrow 1,ifapL(x),\displaystyle 1,\ \ {\rm if}\ ap\in L(x),
(24) (xap)\displaystyle\mathcal{B}(x\models ap)\rightarrow 0,ifapL(x),\displaystyle 0,\ \ {\rm if}\ ap\notin L(x),

as kk\rightarrow\infty. \Box

Recall that kk is defined in Algorithm 1 and represents the (global) time step during execution of Algorithm 1. Hence, Theorem 1 means that the environmental beliefs of the atomic propositions converge to the appropriate values as the number of the iterations for the exploration/mission execution phase goes to infinity. For the proof of Theorem 1, see Appendix A.

Remark 2.

As shown in Theorem 1, the convergence properties (23), (24) may not hold if α0\alpha\neq 0 in (10). Nevertheless, as previously stated in Section 4.2.2, setting α0\alpha\neq 0 is useful and important for practical applications, since it can avoid the exploration of states that are completely irrelevant to the mission execution. Additionally, (23), (24) may not hold if the exploration phase is given by the local selection-based policy Algorithm 2. Nevertheless, as previously stated in Section 4.2, utilizing Algorithm 2 is useful for practical applications where the computation capacity of the copter is severely limited, since the optimal control input can be obtained by evaluating the acquisitions only for the next states. \Box

Now, let μ^r,seq\hat{\mu}^{*}_{r,seq} denote the optimal policy sequence that maximizes the probability of satisfying ϕ\phi under the assumption that the labeling function LL is known, i.e., μ^r,seq=argminμr,seqPr[𝐱μr,seqϕ]\hat{\mu}^{*}_{r,seq}=\arg\min_{\mu_{r,seq}}{\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi]. If LL is known, μ^r,seq\hat{\mu}^{*}_{r,seq} can be derived by constructing a product MDP with the knowledge about LL and solving the value iteration algorithm (for details, see the proof of Corollary 1 in Appendix B). Note that μr,seq{\mu}^{*}_{r,seq} given by (22) is not necessarily equal to μ^r,seq\hat{\mu}^{*}_{r,seq}, since μr,seq{\mu}^{*}_{r,seq} is derived by maximizing the belief of satisfying ϕ\phi based on the sensor measurements (i.e., μr,seq=argminμr,seq(𝐱μr,seqϕ){\mu}^{*}_{r,seq}=\arg\min_{\mu_{r,seq}}\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi)). The following Corollary is derived from Theorem 1, showing that μr,seq{\mu}^{*}_{r,seq} converges μ^r,seq\hat{\mu}^{*}_{r,seq} as kk\rightarrow\infty.

Corollary 1.

Let Assumption 13 hold. Let α=0\alpha=0 in (10) and suppose that Algorithm 1 is executed with the exploration phase given by Algorithm 3. Then, it follows that μr,seqμ^r,seq{\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq} as kk\rightarrow\infty. \Box

For the proof, see Appendix B.

6. Simulation results

In this section, we illustrate the effectiveness of the proposed algorithm through numerical simulations. The simulation was conducted on Python 3.7.9 with AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx CPU and 16GB RAM.

6.1. Simulation 1

6.1.1. Problem setup

Refer to caption
Figure 2. Environmental map of Simulation 1. Each alphabet (A,B,C,D and O) labeled in the cell indicates the atomic proposition that holds true in that cell.

We consider the environmental map consisting of n=100n=100 cells as shown in Fig. 2. The set of states of the environment (i.e., the positions or the centroids of the cells) is given by X={[i,j]𝖳2,i=0,,9,j=0,,9}X=\{[i,j]^{\mathsf{T}}\in\mathbb{R}^{2},i=0,\ldots,9,j=0,\ldots,9\}. The set of the atomic propositions is given by AP={A,B,C,D,O}AP=\{A,B,C,D,O\}, where A,B,CA,B,C and DD represent the atomic propositions of target objects that the rover seeks to discover, and OO represents the atomic proposition of obstacles that the rover needs to avoid for all times. For both the MDP motion models of the rover and the copter, the set of states is given by XX. The sets of control inputs for both the rover and the copter, i.e., UrU_{r}, UcU_{c}, are also the same and consists of 55 components: stay in the same cell, move up, move down, move right, and move left. Each control input drives the rover/copter towards one of the 8 cells adjacent to the current rover/copter’s position. The transition probability for the rover’s MDP is that there is 9595% chance to move from the current cell to the desired cell, and there is the remaining 5% chance to move to one of the cells adjacent to the desired cell (with equal probability). Regarding the copter’s MDP, it is assumed that there is 9090% chance to move from the current cell to the desired cell, and 1010% chance to move to one of the cells adjacent to the desired cell. We assume that the copter is able to move without caring the obstacles (i.e., it can move on all states in XX).

The rover is equipped with sensors to detect both the target objects and the obstacles, i.e., APr={A,B,C,D,O}AP_{r}=\{A,B,C,D,O\}, while the copter is equipped with a sensor to detect only the obstacles, i.e., APc={O}AP_{c}=\{O\}. Moreover, the sensor range is assumed to be given by ROr=RAr=RBr=RCr=RDr=2R^{r}_{O}=R^{r}_{A}=R^{r}_{B}=R^{r}_{C}=R^{r}_{D}=2, which implies from (3), (4), that the rover can provide high reliable sensor measurements only on its current cell and can provide row reliable measurements on its adjacent cells, and ROc=4R^{c}_{O}=4, which implies that the copter can detect obstacles within 4 cells far from the current cell. In addition, it is assumed that MOr=MAr=MBr=MCr=MDr=0.5M^{r}_{O}=M^{r}_{A}=M^{r}_{B}=M^{r}_{C}=M^{r}_{D}=0.5, which implies from (3), (4) that the rover’s maximum sensor accuracy is 100%. Moreover, MOc=0.4M^{c}_{O}=0.4, which implies that the copter’s maximum sensor accuracy is 90%. The scLTL specification for the rover is given by

(25) ϕ=ϕ1ϕ2ϕ3,\displaystyle\phi=\phi_{1}\vee\phi_{2}\vee\phi_{3},

where

(26) ϕ1=FoA,ϕ2=FoBFoC,ϕ3=FoCFoD,\displaystyle\phi_{1}=F_{o}A,\ \phi_{2}=F_{o}B\wedge\bigcirc F_{o}C,\ \phi_{3}=F_{o}C\wedge\bigcirc F_{o}D,

with Foap=¬OU(¬Oap)F_{o}ap=\neg O{U}(\neg O\wedge ap) for apAPap\in AP. Intuitively, ϕ1\phi_{1} means that the rover should eventually discover the target AA while avoiding the obstacles. Moreover, ϕ2\phi_{2} (resp. ϕ3\phi_{3}) means that the rover should eventually discover BB and then CC (resp. CC and then DD) while avoiding the obstacles. During execution of Algorithm 1, we set Tc=5T_{c}=5, Tr=3T_{r}=3, and α=1.5\alpha=1.5 in the acquisition function of (10). Moreover, we assume that the mission is (regarded as) complete if the belief of satisfying ϕ\phi by the rover’s optimal policy exceeds 0.980.98.

6.1.2. Simulation results

Refer to caption
Figure 3. Snapshots of the environmental belief in Simulation 1. For simplicity, only the color maps for three propositions C,D,OC,D,O are shown. The blue circles represent the rover’s position, and the red circles represent the copter’s position. The rover has reached the cell (0,5)(0,5) and found CC at k=325k=325, and then reached (0,2)(0,2) where DD exists at k=334k=334, regarding that the mission is complete. It can be also verified that the rover has avoided obstacles for all times; for details, see the animation in (animation, ).

Some snapshots of the simulation result by applying Algorithm 1 are shown in Fig. 3. During execution of Algorithm 1, it is assumed that the copter executes the global selection-based policy (Algorithm 3) until the mission is complete. In the figure, the environmental beliefs of the atomic propositions are illustrated as the color maps, and, for simplicity, only the color maps for C,D,OC,D,O are shown. The rover’s position is illustrated as the blue circle (only shown in the figures of C,DC,D), and the copter’s position is illustrated as the red circle (only shown in the figures of OO). It can be shown from the figures that the rover has reached the cell where CC exists (at k=325k=325), and then reached the cell where DD exists (at k=334k=334), regarding that the mission is complete. The whole behaviors of both the rover and the copter as well as the time elapse of the environmental beliefs of the atomic propositions can be shown in the animation; see (animation, ).

To make comparisons between the local (Algorithm 2) and the global (Algorithm 3) selection-based exploration for the copter, we have iterated the following steps: (i) The initial positions of the rover and the copter are randomly chosen from XX; (ii) Using the generated initial positions, execute Algorithm 1 with the local selection-based exploration (Algorithm 2); (iii) Using the generated initial positions, execute Algorithm 1 with the global selection-based exploration (Algorithm 3). The above steps have been iterated for 100 times, and for each exploration policy, we counted the number of times when the mission was successfully complete before k=300k=300. At the same time, we also measured the average running time (in sec) of the local/global exploration policy (i.e., the average execution time of Algorithm 2 and Algorithm 3). Table 1 illustrates the simulation results. The table indicates that the number of completing the mission via the global selection-based exploration is larger than the local one. This may be due to that the global selection-based exploration guarantees the convergence of the environmental beliefs of the atomic propositions, while the local one does not (see Section 4.2.3, Section 5). On the other hand, the table also shows that the global one requires heavier computation than the local one, which is due to that it needs to solve the value iteration algorithm to reach the selected state with the highest acquisition (for details, see (13)).

Table 1. The number of times when the mission was successfully complete for the local/global exploration policy, and the average running time (in sec) of executing the local/global exploration policy for each iteration in Algorithm 1.
Number of completing the mission Average running time (s)
Local (Algorithm 2) 62 3.0
Global (Algorithm 3) 71 31

6.2. Simulation 2: comparison with the existing algorithm

In this section, we show that the proposed algorithm is advantageous over the existing algorithm (nilsson2018, ) in terms of the running time and the memory usage of solving the value iteration (see Remark 1 for the detailed explanation). We consider the environment with different size of the state space: n=|X|𝒩={6,9,12,15,50,100}n=|X|\in\mathcal{N}=\{6,9,12,15,50,100\}. The set of the atomic propositions is given by AP={A,O}AP=\{A,O\}, where AA indicates the target object and OO indicates the obstacle. The mission specification is given by ϕ=¬OU(¬OA)\phi=\neg O\ {U}\ (\neg O\wedge A). For each n𝒩n\in\mathcal{N}, we randomly generate the initial beliefs of the atomic propositions (xap)\mathcal{B}(x\models ap) for all xX,apAPx\in X,ap\in AP uniformly from the interval (0,1)(0,1) and solve the value iteration algorithm to synthesize the optimal policy for the rover according to the proposed approach (see Section 4.3, in particular, (20) and (21)). For the implementation of (nilsson2018, ), we assign the environmental belief for every state in the environment and solve the corresponding value iteration. The copter’s exploration has not been given in this simulation, since we would like to focus on evaluating the running time of synthesizing the optimal policy for the rover. Table 2 shows the resulting running time (in sec) of solving the value iterations, where the symbol ”—” indicates that the optimal policy could not be found due to the overflow of the memory. The table shows that the running time of solving the value iteration with the existing algorithm increases rapidly as nn increases and, in particular, it becomes infeasible when n>12n>12 due to the overflow of the memory 1)1)1)Note that in the numerical simulation in (nilsson2018, ) considers the state space with n=100n=100. This is due to that the atomic propositions are assigned only for some small regions in the state space. Specifically, the numerical simulation in (nilsson2018, ) considers that only 88 or 55 regions in the state-space are of interest to be explored (see Fig. 6 in (nilsson2018, )), so that the reduction of the computational complexity of solving the value iteration is achieved. However, as can be seen in our problem setup, we assume to assign the environmental beliefs for the atomic propositions for every single cell in the state space. Thus, the algorithm in (nilsson2018, ) has becomes infeasible for n>12n>12 in our problem setup.. As described in Remark 1, such blowup is due to the fact that the size of the state-space of the product MDP increases exponentially with respect to nn (=|X|)(=|X|). Therefore, the proposed approach is shown to be more useful than the existing approach in terms of the running time and the memory usage for synthesizing the optimal policy for the rover.

Table 2. Running time of solving the value iterations (in sec) using the proposed approach and the existing algorithm in (nilsson2018, ).
nn 6 9 12 15 50 100
Proposed approach 0.02 0.05 0.13 0.22 3.91 20.35
Previous approach in (nilsson2018, ) 0.02 1.07 86.02

7. Conclusion and future works

In this paper, we investigate a collaborative rover-copter path planning and exploration with temporal logic specifications under uncertain environments. Mainly, the rover has the role to satisfy a mission specification expressed by an scLTL formula, while the copter has the role to assist the rover by exploring the uncertain environment and reduce its uncertainties. The environmental uncertainties are captured by the environmental beliefs of the atomic propositions, which represent the posterior probabilities that evaluate the level of uncertainties based on the sensor measurements. A control policy of the rover is then synthesized by maximizing a belief for the satisfaction of the scLTL formula through the implementation of an automata-based model checking. Then, an exploration policy for the copter is synthesized by evaluating the entropy that represents the level of uncertainties and the rover’s path according to the current optimal policy. Finally, the effectiveness of the proposed approach is validated through several numerical examples. Future works involve investigating safety guarantees (i.e., the rover avoids obstacles for all times during execution of Algorithm 1), as well as utilizing more sophisticated sensor models than the simple Bernoulli-type sensor models considered in this paper. In addition, extending the proposed approach to a more real-time and concurrency-related techniques, such as those that synthesize a supervisor that determines the activity of both the rover and the copter, should be studied for our further investigations.

Acknowledgement

This work was supported by JST ERATO Grant Number JPMJER1603, JST CREST Grant Number JPMJCR2012, Japan and JSPS Grant-in-Aid for Young Scientists Grant Number JP21K14184.

References

  • (1) F. L. Lewis, H. Zhang, K. Hengster-Movric, A. Das, Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches, Springer, 2014.
  • (2) B. Balaram, et al., Mars helicopter technology demonstrator, in: AIAA Atmospheric Flight Mechanics Conference, 2018, pp. 1–18.
  • (3) D. Brown, et al., Mars helicopter to fly on nasa’s next red planet rover mission, in: NASA/JPL News Release, 2018.
  • (4) P. Nilsson, S. Haesaert, R. Thakker, K. Otsu, C. Vasile, A. Agha-Mohammadi, R. Murray, A. D. Ames, Toward specification-guided active mars exploration for cooperative robot teams, in: Proceedings of Robotics: Science and Systems (RSS), 2018.
  • (5) S. Bharadwaj, M. Ahmadi, T. Tanaka, U. Topcu, Transfer entropy in mdps with temporal logic specifications, in: 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 4173–4180.
  • (6) T. Sasaki, K. Otsu, R. Thakker, S. Haesaert, A. Agha-mohammadi, Where to map? iterative rover-copter path planning for mars exploration, IEEE Robotics and Automation Letters 5 (2) (2020) 2123–2130.
  • (7) H. Kress-Gazit, M. Lahijanian, V. Raman, Synthesis for Robots: Guarantees and Feedback for Robot Behavior, Annual Review of Control, Robotics, and Autonomous Systems 1 (2018) 211–236.
  • (8) C. Belta, A. Bicchi, M. Egerstedt, E. Frazzoli, E. Klavins, G. J. Pappas, Symbolic planning and control of robot motion [Grand Challenges of Robotics], IEEE Robotics and Automation Magazine 14 (1) (2007) 61–70.
  • (9) C. Belta, B. Yordanov, E. A. Gol, Formal methods for discrete-time dynamical systems, Vol. 89, Springer, 2017.
  • (10) C. Baier, J.-P. Katoen, Principles of model checking, The MIT Press, 2008.
  • (11) L. F. Bertuccelli, J. P. How, Robust uav search for environments with imprecise probability maps, in: Proceedings of the 44th IEEE Conference on Decision and Control, 2005, pp. 5680–5685.
  • (12) Y. Wang, I. I. Hussein, Bayesian-based decision making for object search and characterization, in: 2009 American Control Conference, 2009, pp. 1964–1969.
  • (13) I. I. Hussein, D. M. Stipanovic, Effective coverage control for mobile sensor networks with guaranteed collision avoidance, IEEE Transactions on Control Systems Technology 15 (4) (2007) 642–657.
  • (14) K. Imai, T. Ushio, Effective combination of search policy based on probability and entropy for heterogeneous mobile sensors, in: 2013 IEEE International Conference on Systems, Man, and Cybernetics, 2013, pp. 1981–1986.
  • (15) A. I. M. Ayala, S. B. Andersson, C. Belta, Temporal logic motion planning in unknown environments, in: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 5279–5284.
  • (16) J. Fu, N. Atanasov, U. Topcu, G. J. Pappas, Optimal temporal logic planning in probabilistic semantic maps, in: 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 3690–3697.
  • (17) M. Maly, M. Lahijanian, L. E. Kavraki, H. Kress-Gazit, M. Y. Vardi, Iterative temporal motion planning for hybrid systems in partially unknown environments, in: Proceedings of the 16th International Conference on Hybrid Systems: Computation and Control, 2013, p. 353–362.
  • (18) M. Guo, K. H. Johansson, D. V. Dimarogonas, Revising motion planning under linear temporal logic specifications in partially known workspaces, in: IEEE International Conference on Robotics and Automation (ICRA), 2013.
  • (19) M. Guo, D. V. Dimarogonas, Multi-agent plan reconfiguration under local ltl specifications, The International Journal of Robotics Research 34 (2) (2015) 218–235.
  • (20) M. Guo, M. M.Zavlanos, Probabilistic motion planning under temporal tasks and soft constraints, IEEE Transactions on Automatic Control 63 (12) (2018) 4051–4066.
  • (21) S. C. Livingston, R. M. Murray, J. W. Burdick, Backtracking temporal logic synthesis for uncertain environments, in: 2012 IEEE International Conference on Robotics and Automation, 2012, pp. 5163–5170.
  • (22) T. Wongpiromsarn, E. Frazzoli, Control of probabilistic systems under dynamic, partially known environments with temporal logic specifications, in: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012, pp. 7644–7651.
  • (23) M. Lahijanian, J. Wasniewski, S. B. Andersson, C. Belta, Motion planning and control from temporal logic specifications with probabilistic satisfaction guarantees, in: 2010 IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 3227–3232.
  • (24) C. Yoo, C. Belta, Control with probabilistic signal temporal logic, in: Preprint: available on https://arxiv.org/pdf/1510.08474.pdf, 2015.
  • (25) D. Sadigh, A. Kapoor, Safe control under uncertainty with probabilistic signal temporal logic, in: Proceedings of Robotics: Science and Systems, 2016.
  • (26) C. Vasile, K. Leahy, E. Cristofalo, A. Jones, M. Schwager, C. Belta, Control in belief space with temporal logic specifications, in: 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 7419–7424.
  • (27) K. Leahy, E. Cristofalo, C. I. Vasile, A. Jones, E. Montijano, M. Schwager, C. Belta, Control in belief space with temporal logic specifications using vision-based localization, The International Journal of Robotics Research 38 (6) (2019) 702–722.
  • (28) E. M. Wolff, U. Topcu, R. M. Murray, Robust control of uncertain markov decision processes with temporal logic specifications, in: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012, pp. 3372–3379.
  • (29) A. Ulusoy, T. Wongpiromsarn, C. Belta, Incremental controller synthesis in probabilistic environments with temporal logic constraints, The International Journal of Robotics Research 33 (8) (2014) 1130–1144.
  • (30) K. Hashimoto, A. Saoud, M. Kishida, T. Ushio, D. V. Dimarogonas, Learning-based symbolic abstractions for nonlinear control systems, in arxiv, available on https://arxiv.org/abs/2004.01879 (2020).
  • (31) D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, S. A. Seshia, A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications, in: 53rd IEEE Conference on Decision and Control, 2014, pp. 1091–1096.
  • (32) J. Wang, X. Ding, M. Lahijanian, I. Paschalidis, C. Belta, Temporal logic motion control using actor-critic methods, The International Journal of Robotics Research 34 (10) (2015) 1329–1344.
  • (33) X. Li, Y. Ma, C. Belta, A policy search method for temporal logic specified reinforcement learning tasks, in: 2018 Annual American Control Conference (ACC), 2018, pp. 240–245.
  • (34) M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, I. Lee, Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees, in: 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 5338–5343.
  • (35) B. Johnson, H. Kress-Gazit, Analyzing and revising high-level robot behaviors under actuator error, in: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 741–748.
  • (36) B. Johnson, H. Kress-Gazit, Analyzing and revising synthesized controllers for robots with sensing and actuation errors, The International Journal of Robotics Research 34 (6) (2015) 816–832.
  • (37) P. Nuzzo, J. Li, A. L. Sangiovanni-Vincentelli, Y. Xi, D. Li, Stochastic assume-guarantee contracts for cyber-physical system design, ACM Transactions on Embedded Computing Systems 18 (1) (Jan. 2019).
  • (38) M. Tiger, F. Heintz, Incremental reasoning in probabilistic signal temporal logic, International Journal of Approximate Reasoning 119 (2020) 325 – 352.
  • (39) T. Latvala, Efficient model checking of safety properties, in: 10th International SPIN workshop, 2003, pp. 74–88.
  • (40) T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley Series, 2006.
  • (41) A. Abate, M. Prandini, J. Lygeros, S. Sastry, Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems, Automatica 44 (11) (2008) 2724–2734.
  • (42) [link to the animation]

Appendix A Proof of Theorem 1

Let us first rewrite the Bayesian update (8) by

(27) j+1(xap)=Pr[Zx(xap)=z|xap]j(xap)Pr[Zx(xap)=z]\displaystyle\mathcal{B}_{j+1}(x\models ap)=\frac{{\rm Pr}[Z_{x}(x\models ap)=z|x\models ap]\mathcal{B}_{j}(x\models ap)}{{\rm Pr}[Z_{x}(x\models ap)=z]}

for j0j\in\mathbb{N}_{\geq 0}, where we let Zx(xap)=Zxr(xap)Z_{x}(x\models ap)=Z^{r}_{x}(x\models ap) (resp. Zx(xap)=Zxc(xap)Z_{x}(x\models ap)=Z^{c}_{x}(x\models ap)) if the rover (resp. copter) provides the sensor measurement for apap at xx. That is, j(xap)\mathcal{B}_{j}(x\models ap), j0j\in\mathbb{N}_{\geq 0} represents the environmental belief of apap at xx computed after jj times visit (by either the rover or the copter) of xx. If Zx(xap)=1Z_{x}(x\models ap)=1 at the j+1j+1-th visit of xx, the Bayesian update is given by

(28) j+1(xap)=βj(xap)βj(xap)+(1β)(1j(xap)).\displaystyle\mathcal{B}_{j+1}(x\models ap)=\cfrac{\beta\mathcal{B}_{j}(x\models ap)}{\beta\mathcal{B}_{j}(x\models ap)+(1-\beta)(1-\mathcal{B}_{j}(x\models ap))}.

After some simple calculations, we then obtain ~j+1(xap)=1ββ~j(xap)\tilde{\mathcal{B}}_{j+1}(x\models ap)=\frac{1-\beta}{\beta}\tilde{\mathcal{B}}_{j}(x\models ap), where we let ~j(xap)=1j(xap)1\tilde{\mathcal{B}}_{j}(x\models ap)=\frac{1}{\mathcal{B}_{j}(x\models ap)}-1. On the other hand, if Zx(xap)=0Z_{x}(x\models ap)=0 at the j+1j+1-th visit, it follows that ~j+1(xap)=β1β~j(xap)\tilde{\mathcal{B}}_{j+1}(x\models ap)=\frac{\beta}{1-\beta}\tilde{\mathcal{B}}_{j}(x\models ap). Therefore, we obtain

(29) ~j+1(xap)=\displaystyle\tilde{\mathcal{B}}_{j+1}(x\models ap)= 1ββ~j(xap),ifZx(xap)=1,\displaystyle\frac{1-\beta}{\beta}\tilde{\mathcal{B}}_{j}(x\models ap),\ \ {\rm if}\ Z_{x}(x\models ap)=1,
(30) ~j+1(xap)=\displaystyle\tilde{\mathcal{B}}_{j+1}(x\models ap)= β1β~j(xap),ifZx(xap)=0.\displaystyle\frac{\beta}{1-\beta}\tilde{\mathcal{B}}_{j}(x\models ap),\ \ {\rm if}\ Z_{x}(x\models ap)=0.

Suppose that the rover/copter visits xx the total NxN_{x} times and that apL(x)ap\in L(x). From (29), (30), we have

~Nx(xap)\displaystyle\tilde{\mathcal{B}}_{N_{x}}(x\models ap) =(1ββ)2mNxapNx~0(xap).\displaystyle=\left(\cfrac{1-\beta}{\beta}\right)^{2m^{ap}_{N_{x}}-N_{x}}\tilde{\mathcal{B}}_{0}(x\models ap).

Note that ~0(xap)=10(xap)1(0,)\tilde{\mathcal{B}}_{0}(x\models ap)=\frac{1}{\mathcal{B}_{0}(x\models ap)}-1\in(0,\infty), since the initial belief of the atomic proposition is selected as 0(xap)(0,1)\mathcal{B}_{0}(x\models ap)\in(0,1) (see line 2 in Algorithm 1). From Assumption 3, it follows that Nx(2β2ε1)2mNxapNxNx(2β+2ε1)N_{x}(2\beta-2\varepsilon-1)\leq 2m^{ap}_{N_{x}}-N_{x}\leq N_{x}(2\beta+2\varepsilon-1) for all NN¯N\geq\bar{N} and apL(x)ap\in L(x). Thus, we obtain 0<~Nx(xap){(1β)/β}Nx(2β2ε1)~0(xap)0<\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\leq\{({1-\beta})/{\beta}\}^{N_{x}(2\beta-2\varepsilon-1)}\tilde{\mathcal{B}}_{0}(x\models ap) for all NxN¯N_{x}\geq\bar{N} and apL(x)ap\in L(x). Noting that 2β2ε1>02\beta-2\varepsilon-1>0 (see Assumption 3), β>0.5\beta>0.5 and ~0(xap)(0,)\tilde{\mathcal{B}}_{0}(x\models ap)\in(0,\infty), we obtain ~Nx(xap)0\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\rightarrow 0 as NxN_{x}\rightarrow\infty, which implies that Nx(xap)1\mathcal{B}_{N_{x}}(x\models ap)\rightarrow 1 as NxN_{x}\rightarrow\infty. Similarly, we obtain ~Nx(xap)\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\rightarrow\infty, i.e., Nx(xap)0\mathcal{B}_{N_{x}}(x\models ap)\rightarrow 0 if apL(x)ap\notin L(x) as NxN_{x}\rightarrow\infty. Hence, for all xXx\in X and apAPap\in AP, we have

(31) limNxNx(xap)=\displaystyle\lim_{N_{x}\rightarrow\infty}\mathcal{B}_{N_{x}}(x\models ap)= 1,ifapL(x),\displaystyle 1,\ \ {\rm if}\ ap\in L(x),
(32) limNxNx(xap)=\displaystyle\lim_{N_{x}\rightarrow\infty}\mathcal{B}_{N_{x}}(x\models ap)= 0,ifapL(x).\displaystyle 0,\ \ {\rm if}\ ap\notin L(x).

In other words, (23), (24) are satisfied for all xXx\in X and apAPap\in AP, if for all xXx\in X, the number of visits at xx goes to infinity as kk\rightarrow\infty, i.e., NxN_{x}\rightarrow\infty as kk\rightarrow\infty. In what follows, it is shown that NxN_{x}\rightarrow\infty as kk\rightarrow\infty for all xXx\in X. With a slight abuse of notation, let Nx(k)kN_{x}(k)\leq k and mNxap(k)Nx(k)m^{ap}_{N_{x}}(k)\leq N_{x}(k) denote, respectively, the number of total times the rover/copter visits xx within the time step kk and the number of times the corresponding observations for apap are 11. To show by contradiction, let XnotXX_{not}\subset X denote the set of all states at which the number of visits does not go to the infinity as kk\rightarrow\infty, and assume that XnotX_{not} is non-empty. In other words, there exists a time step kter>0k_{ter}\in\mathbb{N}_{>0} such that all xXnotx\in X_{not} are no more visited after kterk_{ter}, i.e., Nx(k)=Nx(k+1)N_{x}(k)=N_{x}(k+1) for all kkterk\geq k_{ter}. Hence, we have Nx(k)(xap)=Nx(k+1)(xap)(0,1)\mathcal{B}_{N_{x}(k)}(x\models ap)=\mathcal{B}_{N_{x}(k+1)}(x\models ap)\in(0,1) for all kkterk\geq k_{ter}, and, therefore, H(Nx(k)(xap))=H(Nx(k+1)(xap))(0,1)H(\mathcal{B}_{N_{x}(k)}(x\models ap))=H(\mathcal{B}_{N_{x}(k+1)}(x\models ap))\in(0,1) for all kkterk\geq k_{ter}, which implies that the entropy remains constant and does not converge to 0. On the other hand, it follows that, for all xX\Xnotx^{\prime}\in X\backslash X_{not}, Nx(k)N_{x^{\prime}}(k)\rightarrow\infty as kk\rightarrow\infty. Hence, the environmental beliefs of the atomic proposition converge to the appropriate values, i.e., for all xX\Xnotx^{\prime}\in X\backslash X_{not} and apAPap\in AP, Nx(k)(xap)1\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap)\rightarrow 1 if apL(x)ap\in L(x^{\prime}) and 0 if apL(x)ap\notin L(x^{\prime}) as kk\rightarrow\infty. Therefore, the entropy converges to 0, i.e., for all apAPap\in AP and xX\Xnotx^{\prime}\in X\backslash X_{not}, H(Nx(k)(xap))0H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))\rightarrow 0 as kk\rightarrow\infty.

Now, since apAPH(Nx(k)(xap))0\sum_{ap\in AP}H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))\rightarrow 0 for all xX\Xnotx^{\prime}\in X\backslash X_{not}, there exist a time step k¯kter\bar{k}\geq k_{ter} such that the following holds: for all kk¯k\geq\bar{k}, xXnotx\in X_{not} and xX\Xnotx^{\prime}\in X\backslash X_{not},

(33) apAPH(Nx(k)(xap))<apAPH(Nx(k)(xap)).\displaystyle\sum_{ap\in AP}H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))<\sum_{ap\in AP}H(\mathcal{B}_{N_{x}(k)}(x\models ap)).

Recalling that we set α=0\alpha=0 in (10), the inequality (33) implies that, at a certain time step after k¯kter\bar{k}\geq k_{ter}, the copter would select some xXnotx^{*}\in X_{not} at the beginning of execution of Algorithm 3, which means from Assumption 1 that the copter would have visited xx^{*} after k¯kter\bar{k}\geq k_{ter}. However, this contradicts the assumption that all xXnotx\in X_{not} are not visited after kterk_{ter}. Overall, such contradiction follows from the fact that we assume XnotX_{not} is non-empty. Therefore, it follows that XnotX_{not} is empty, and thus Nx(k)N_{x}(k)\rightarrow\infty as kk\rightarrow\infty for all xXx\in X. In summary, (23) and (24) are satisfied for all xXx\in X and apAPap\in AP. \Box

Appendix B Proof of Corollary 1

Let ^S=(S,s0,US,p^S,Sf)\hat{\mathcal{M}}_{S}=({S},{s}_{0},{U}_{S},\hat{p}_{S},{S}_{f}) denote the product MDP between r=(X,xr,Ur,pr)\mathcal{M}_{r}=(X,x_{r},U_{r},p_{r}) and 𝒜ϕ=(Q,2AP,δ,q0,Qf)\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f}), where SS, s0{s}_{0}, US{U}_{S} and SfS_{f} are the same as the product belief MDP defined in Definition 1, and p^S:S×US𝒟(S)\hat{p}_{S}:S\times U_{S}\rightarrow\mathcal{D}(S) is the transition probability function, defined by p^S((x,q)|(x,q),u)=pr(x|x,u)\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u)=p_{r}(x^{\prime}|x,u) (for all {(x,q),(x,q)}S×S\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S and uUSu\in U_{S} ) iff L(x)en(q,q)L(x)\in en(q,q^{\prime}) and 0 otherwise. If LL is known, the optimal policy μ^r,seq\hat{\mu}^{*}_{r,seq} is then obtained by solving the value iteration algorithm over the product MDP ^S\hat{\mathcal{M}}_{S}. This fact, combined with Theorem 1, implies that if the product belief MDP S{\mathcal{M}}_{S} in Definition 1 converges to the product MDP ^S\hat{\mathcal{M}}_{S}, i.e., S^S{\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S} as kk\rightarrow\infty, then μr,seqμ^r,seq{\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq} as kk\rightarrow\infty. To show that S^S{\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S} as kk\rightarrow\infty, we need to show that pS((x,q)|(x,q),u)p^S((x,q)|(x,q),u)p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u) (for all {(x,q),(x,q)}S×S\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S and uUSu\in U_{S}). Suppose that (xap)1\mathcal{B}(x\models ap)\rightarrow 1 (resp. (xap)0\mathcal{B}(x\models ap)\rightarrow 0) for all apL(x)ap\in L(x) (resp. apL(x)ap\notin L(x)), i.e., all the environmental beliefs of the atomic propositions converge to the appropriate values. From (14), it then follows that alph(xσ)1\mathcal{B}_{alph}(x\models\sigma)\rightarrow 1 iff σ=L(x)\sigma=L(x) and 0 otherwise. From (15), we then obtain en(xσ)1\mathcal{B}_{en}(x\models\sigma)\rightarrow 1 iff σ=L(x)\sigma=L(x) and 0 otherwise. Hence, the transition belief function (18) becomes pS((x,q)|(x,q),u)pr(x|x,u)p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow p_{r}(x^{\prime}|x,u) (for all {(x,q),(x,q)}S×S\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S and uUSu\in U_{S} ) iff L(x)en(q,q)L(x)\in en(q,q^{\prime}) and 0 otherwise, which indeed coincides with p^S((x,q)|(x,q),u)\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u). Hence, it follows that pS((x,q)|(x,q),u)p^S((x,q)|(x,q),u)p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u) (for all {(x,q),(x,q)}S×S\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S and uUSu\in U_{S} ), and, therefore, S^S{\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S} as kk\rightarrow\infty. As described above, this directly follows μr,seqμ^r,seq{\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq} as kk\rightarrow\infty. \Box