Collaborative rover-copter path planning and exploration with temporal logic specifications based on Bayesian update under uncertain environments

Kazumune Hashimoto Graduate School of Engineering, Osaka UniversitySuitaJapan , Natsuko Tsumagari Graduate School of Engineering and Science, Osaka UniversityToyonakaJapan and Toshimitsu Ushio Graduate School of Engineering and Science, Osaka UniversityToyonakaJapan

(2021)

Abstract.

This paper investigates a collaborative rover-copter path planning and exploration with temporal logic specifications under uncertain environments. The objective of the rover is to complete a mission expressed by a syntactically co-safe linear temporal logic (scLTL) formula, while the objective of the copter is to actively explore the environment and reduce its uncertainties, aiming at assisting the rover and enhancing the efficiency of the mission completion. To formalize our approach, we first capture the environmental uncertainties by environmental beliefs of the atomic propositions, under an assumption that it is unknown which properties (or, atomic propositions) are satisfied in each area of the environment. The environmental beliefs of the atomic propositions are updated according to the Bayes rule based on the Bernoulli-type sensor measurements provided by both the rover and the copter. Then, the optimal policy for the rover is synthesized by maximizing a belief of the satisfaction of the scLTL formula through an implementation of an automata-based model checking. An exploration policy for the copter is then synthesized by employing the notion of an entropy that is evaluated based on the environmental beliefs of the atomic propositions, and a path that the rover intends to follow according to the optimal policy. As such, the copter can actively explore regions whose uncertainties are high and that are relevant to the mission completion. Finally, some numerical examples illustrate the effectiveness of the proposed approach.

Collaborative motion planning, Temporal logics, Bayesian-based decision making, Uncertain environments

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: 10.1145/3470453^†^†ccs: Computer systems organization Robotics^†^†ccs: Theory of computation Modal and temporal logics

1. Introduction

Autonomous systems play an important role to accomplish complex, high level scientific missions autonomously under uncertain environments. To increase the efficiency of completing the mission, integrating a collaboration of multiple, heterogeneous robots has attracted much attention in recent years, see, e.g., (lewis2014, ). In this paper, we are particularly interested in the situation, where completing the mission will be achieved by the collaboration of an unmanned ground vehicle (UGV), which is called a rover, and an unmanned aerial vehicle (UAV), which is called a (heli)copter. The utilization of the rover-copter collaboration is motivated by the fact that the rover has the role to complete the mission (e.g., search for a target object, etc.), while the copter has the role to assist the rover so as to enhance the efficiency of completing the mission. Specifically, the copter aims at actively exploring the environment and reducing its uncertainties by revealing which properties (obstacles, free space, etc.) are satisfied in each area of the environment. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the environment. By doing so, the rover will be able to complete the mission while guaranteeing safety. Employing the rover-copter collaboration is also motivated by the fact that the National Aeronautics and Space Administration (NASA) is launching Mars 2020 mission (balaram2018, ). In particular, to investigate Martian geology and habitability, NASA has decided to send copters to Mars in order to help the rover discover target samples in an efficient way (landau2015, ). Motivated by this fact, several motion planning techniques employing the rover-copter collaboration for the Mars exploration have been investigated in recent years, see, e.g., (nilsson2018, ; bharadwaj2018, ; sasaki2020, ).

As briefly mentioned above, planning under environmental uncertainties has two distinct major problems; the first one is how to synthesize a control policy such that a complex, high level mission specification can be satisfied in an automatic way, and the second one is how to explore the environment so as to effectively reduce its uncertainties. In this paper, we propose a novel algorithm to solve these two problems by making use of the rover-copter collaboration. First, we tackle the former problem by employing temporal logic synthesis techniques (temporalreview1, ; temporalreview2, ; belta2017formal, ). More specifically, we express a mission specification by a syntactically co-safe linear temporal logic (scLTL). In contrast to the simple reach-avoid task, the scLTL formula has the ability to describe various complex specifications that involve logic and temporal constraints (belta2017formal, ). Moreover, the optimal policy that fulfills the scLTL specification can be synthesized using a value iteration algorithm, which is in general more computationally efficient than synthesizing controllers with the LTL (requires to solve the Rabin game) or the STL (requires to solve the (mixed) integer programming). Additionally, the utilization of the scLTL is sometimes natural in practice, since the path planning problem often deals with the mission that terminates in a finite time rather than the infinite time like LTL. To formalize our approach, we first capture the uncertain environment by assuming that, it is unknown which properties or atomic propositions are satisfied in each area of the environment. Specifically, we define environmental beliefs of the atomic propositions, which are described by posterior probabilities, that evaluate their uncertainties based on the sensor measurements in each area of the environment. As will be detailed later, these beliefs are updated according to the Bayes rule based on sensor measurements provided by both the copter and the rover. The optimal policy for the rover is then synthesized by maximizing a belief of the satisfaction of the scLTL formula for the controlled trajectory of the rover. In particular, based on an automata-based model checking (see, e.g., (baier, )), we combine a motion model of the rover described by a Markov decision process (MDP) and a finite state automaton (FSA) that accepts all good prefix satisfying the scLTL formula. This combined model, which is called a product belief MDP, has a transition function induced by the current environmental beliefs of the atomic propositions. The problem for finding the optimal policy is then reduced to a finite-time reachability problem in the product belief MDP, which can be solved via a value iteration algorithm.

The latter problem (i.e., how to explore the environment so as to effectively reduce its uncertainties) will be mainly solved by the copter, since it is able to move more quickly and freely than the rover and is thus suited for the exploration. Roughly speaking, the objective of the copter is to actively explore the environment and reduce its uncertainties by updating the environmental beliefs of the atomic propositions. We first describe the observations by employing a Bernoulli-type sensor model, see, e.g., (bertuccelli2005, ; wang2009, ; hussein2007, ; imai2013, ). The Bernoulli sensor abstracts the complexity of image processing into binary observations. Despite the simplicity, the Bernoulli sensor model is commonly used in the UAV community, due to the following reasons; (i) it is able to capture the erroneous observations; (ii) it is able to capture the limited sensor range; (iii) in contrast to the other sophisticated sensor models such as those that involve the probability density functions, the Bayesian update can be simply computed without any integrals or approximations. In particular, the third feature is well-suited for the copter’s exploration, as the computational power of the CPU and the battery capacity are often limited, and it is desirable to make the belief updates as “computationally light” as possible. The exploration algorithm is given by employing the notion of entropy that is evaluated based on the environmental beliefs of the atomic propositions, and a path that the rover intends to follow according to the current optimal policy. As such, the copter can put an emphasis on actively exploring regions whose uncertainties are high and that are relevant to the mission completion.

Related works and contributions of this paper

Based on the above, the approach presented in this paper is related to the previous works of literature in terms of the following aspects:

(1)

Motion planning/exploration employing the rover-copter collaboration;
(2)

Temporal logic planning under environmental uncertainties;

In what follows, we discuss how our approach differs from the previous works and highlight our main contributions.

Several motion planning/exploration techniques employing the rover-copter collaboration have been provided, see, e.g., (nilsson2018, ; bharadwaj2018, ; sasaki2020, ). In particular, our approach is closely related to (nilsson2018, ), in the sense that the overall synthesis problem is decomposed into two sub-problems, i.e., the problem of synthesizing a copter’s exploration policy in the uncertain environment, and the problem of synthesizing the optimal policy for the rover so as to satisfy the scLTL formula. Our approach builds upon this previous work in terms of both synthesis for the rover’s optimal policy and the copter’s exploration policy in the following ways. Rather than capturing the environmental uncertainties by the belief MDPs (see Section II.A in (nilsson2018, )), in which the environmental belief states are given in a discrete space, this paper captures the environmental uncertainties by assigning beliefs in a continuous space (i.e., the beliefs can take continuous values in the interval $[0,1]$ ). This allows us to apply the Bayes rule to update the beliefs according to the sensor measurements provided by both the rover and the copter. Moreover, while the previous work solves a value iteration over the state-space in a product MDP that involves the set of environmental belief states, our approach defines a product MDP that does not involve such states, i.e., the set of states of product MDP combines only the set of states of the MDP motion model of the rover and the set of states of the FSA that accepts all trajectories satisfying the scLTL formula (not including the environmental belief states). Hence, we can alleviate the time complexity of synthesizing the optimal policies for the rover in comparison with the previous work. In addition, we provide a theoretical, convergence analysis of the proposed algorithm, in which the environmental beliefs of the atomic propositions are shown to converge to the appropriate values.

Besides the above, many temporal logic planning schemes under environmental uncertainties have been proposed. Most of the previous works assume that the environment has unknown properties (atomic propositions) (ayala2013, ; fu2016, ; maly2013, ; meng2013a, ; meng2015a, ; meng2018, ; livingston2012, ; wongpiromsarn2012, ), or that the motion model of the robot includes uncertainties (lahijanian2010, ; yoo2016, ; sadigh16, ; vasile2016, ; leahy2019, ; wolff2012, ; ulusoy2014, ; wongpiromsarn2012, ; kazumune2020, ), or that the motion model of the robot is completely unknown (sadigh2014, ; jing2015, ; li2018, ; hasanbeig2019, ). Since this paper assumes that it is unknown which atomic propositions are satisfied in the environment, our approach is particularly related to the first category, i.e., (ayala2013, ; fu2016, ; maly2013, ; meng2013a, ; meng2015a, ; meng2018, ; livingston2012, ; wongpiromsarn2012, ). For example, (ayala2013, ) proposed to combine an automata-based model checking and run-time verification for synthesizing a temporal logic motion planning under an incomplete knowledge about the workspace. (fu2016, ) proposed a temporal logic synthesis under probabilistic semantic maps obtained by simultaneous localization and mapping (SLAM). Moreover, (meng2013a, ) proposed a planning revision scheme under incomplete knowledge about the workspace. Our approach is essentially different from the above previous works, in the sense that we incorporate a sensor failure about observations on the atomic propositions. In particular, as previously mentioned, we employ a Bernoulli-type sensor model to describe erroneous observations, and update the beliefs based on the Bayes rule. Besides, the synthesis approach (e.g., construction of the product MDP) is also different from the above previous works, since we make use of the beliefs to synthesize control policies, see Section 4.3. Moreover, our approach is different from the above previous works, in the sense that we incorporate an explicit algorithm for exploration, so as to reduce environmental uncertainties. Other than the above previous works, a few approaches that take into account sensor failures/noise have been provided, see, e.g., (johnson2013, ; johnson2015, ; nuzzo, ; TIGER2020325, ). For example, in (johnson2015, ), the authors proposed a probabilistic model checking for a reactive synthesis under sensor failures and actuator failures. Moreover, (nuzzo, ) introduced the concept of stochastic signal temporal logic (StSTL), and provided both verification and synthesis techniques using assume/guarantee contracts. In contrast to the above previous works, we here propose a Bayesian approach, in which the beliefs that are assigned in the environment are introduced, and these are updated based on observations provided by the Bernoulli sensors.

In summary, the main novelties of this paper with respect to the related works are as follows: using the copter-rover collaboration, we develop a new approach to synthesizing an optimal policy for the rover so as to satisfy an scLTL formula, and an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions and reduce the environmental uncertainties. In particular:

(1)

Using the Bernoulli sensor model and the Bayesian update, we propose a novel exploration algorithm for the copter so as to update the environmental beliefs and effectively reduce the environmental uncertainties (for details, see Section 4.2);
(2)

We propose a novel framework to synthesize the optimal policy for the rover. In particular, we solve a value iteration over a product MDP, whose state-space does not involve the set of the states of the environmental beliefs. This leads to the reduction of time complexity of the value iteration in comparisons with the previous work (for details, see Section 4.3);
(3)

We provide a theoretical, convergence analysis of the proposed algorithm, where it is shown that the environmental beliefs of the atomic propositions converge to the appropriate values (for details, see Section 5).

The remainder of this paper is organized as follows. In Section 2, we provide some preliminaries of the Markov decision process and syntacticallly co-safe LTL formula. In Section 3, we formulate a problem that we seek to solve in this paper. In Section 4, we describe the main algorithm that aims to synthesize both an exploration policy to update the environmental beliefs of the atomic propositions and the optimal policy to satisfy an scLTL formula. In Section 5, we analyze the convergence property of the proposed algorithm. In Section 6, we illustrate the effectiveness of the proposed approach through a simulation example. We finally conclude in Section 7.

Notation. Let $\mathbb{N}$ , $\mathbb{N}_{\geq 0}$ , $\mathbb{N}_{>0}$ , $\mathbb{N}_{a:b}$ be the set of integers, non-negative integers, positive integers, and the set of integers in the interval $[a,b]$ , respectively. Let $\mathbb{R}$ , $\mathbb{R}_{\geq 0}$ , $\mathbb{R}_{>0}$ , $\mathbb{R}_{a:b}$ be the set of reals, non-negative reals, positive reals, and the set of reals in the interval $[a,b]$ , respectively. For a given vector $x\in\mathbb{R}^{n}$ , denote by $x^{(i)}$ the $i$ -th element of $x$ . Given a finite set $X$ , let $\mathcal{D}(X)$ denote the set of all probability distributions on $X$ , i.e., the set of all functions $p:X\rightarrow[0,1]$ such that $\sum_{x\in X}p(x)=1$ .

2. Preliminaries

2.1. Markov Decision Process

A Markov Decision Process (MDP) is defined as a tuple $\mathcal{M}=(X,x_{0},U,p)$ , where $X$ is the finite set of states, $x_{0}\in X$ is the initial state, $U$ is the finite set of control inputs, and $p:X\times U\rightarrow\mathcal{D}(X)$ is the transition probability function that associates, for each state $x\in X$ and input $u\in U$ , the corresponding probability distribution over $X$ . For simplicity of presentation, we abbreviate $p(x,u)(x^{\prime})$ as $p(x^{\prime}|x,u)$ . Given $\mathcal{M}=(X,x_{0},U,p)$ , a policy sequence ${\mu}_{seq}=\mu_{1}\mu_{2}\ldots$ is defined as an infinite sequence of the mappings $\mu_{k}:X\rightarrow U$ , $k\in\mathbb{N}$ . Namely, each $\mu_{k}$ , $k\in\mathbb{N}_{\geq 0}$ represents a policy as a mapping from each state in $X$ onto the corresponding control input in $U$ . The policy sequence ${\mu}_{seq}$ is called stationary if the policy is the invariant for all times, i.e., $\mu_{k}=\mu_{k+1}$ , $\forall k\in\mathbb{N}_{\geq 0}$ . Given a policy sequence ${\mu}_{seq}=\mu_{1}\mu_{2}\ldots$ , a trajectory induced by ${\mu}_{seq}$ is denoted by ${\bf x}_{{\mu}_{seq}}=x(0)x(1)\ldots\in X^{\omega}$ , where $x(0)=x_{0}$ and $x(k+1)\sim p(\cdot|x(k),\mu_{k}(x(k)))$ , $\forall k\in\mathbb{N}_{\geq 0}$ .

2.2. Syntactically co-safe LTL

Syntactically co-safe LTL (scLTL for short) is defined by using the set of atomic propositions $AP$ , Boolean operators, and some temporal operators. Atomic propositions are the Boolean variables taking either true or false. Specifically, the syntax of the scLTL formulas are constructed according to the following grammar:

(1)

\phi::={\rm true}\ |\ ap\ |\neg ap\ |\ \phi_{1}\wedge\phi_{2}\ |\ \phi_{1}\vee\phi_{2}\ |\ \bigcirc\phi\ |\ \phi_{1}\mathit{U}\phi_{2},

where $ap\in AP$ is the atomic proposition, $\neg$ (negation), $\wedge$ (conjunction), $\vee$ (disjunction) are the Boolean connectives, and $\bigcirc$ (next), $\mathit{U}$ (until) are the temporal operators. The semantics of LTL formula is inductively defined over an infinite sequence of sets of atomic propositions ${\bf w}=w_{0}w_{1}\cdots\in(2^{AP})^{\omega}$ . Intuitively, an atomic proposition $ap\in AP$ is satisfied iff $ap$ is true at $w_{0}$ (i.e., $ap\in w_{0}$ ). Moreover, $\neg ap$ is satisfied iff $ap$ is not true at $w_{0}$ (i.e., $ap\notin w_{0}$ ). $\phi_{1}\wedge\phi_{2}$ is satisfied iff both $\phi_{1}$ and $\phi_{2}$ are satisfied. $\phi_{1}\vee\phi_{2}$ is satisfied iff $\phi_{1}$ or $\phi_{2}$ are satisfied. $\bigcirc\phi$ is satisfied iff $\phi$ is satisfied for the suffix of ${\bf w}$ that begins from the next position, i.e., i.e., $w_{1}w_{2}\cdots$ . Finally, $\phi_{1}\mathit{U}\phi_{2}$ is satisfied iff $\phi_{1}$ is satisfied until $\phi_{2}$ is satisfied. Given ${\bf w}=w_{0}w_{1}\cdots\in(2^{AP})^{\omega}$ and an scLTL formula $\phi$ , we denote by ${\bf w}\models\phi$ iff ${\bf w}$ satisfies $\phi$ . It is known that every ${\bf w}=w_{0}w_{1}\cdots$ that satisfies the scLTL formula $\phi$ contains a finite good prefix $w_{0}w_{1}\cdots w_{n}$ for some $n\in\mathbb{N}_{\geq 0}$ , such that $w_{0}w_{1}\cdots w_{n}{\bf w}^{\prime}\in(2^{AP})^{\omega}$ also satisfies $\phi$ for all ${\bf w}^{\prime}\in(2^{AP})^{\omega}$ .

A finite state automaton (FSA) is defined as a tuple $\mathcal{A}=(Q,\Sigma,\delta,q_{0},Q_{f})$ , where $Q$ is a set of states, $\Sigma$ is the input alphabet, $\delta:Q\times\Sigma\rightarrow Q$ is the transition function, $q_{0}\in Q$ is the initial state, and $Q_{f}\subseteq Q$ is the set of accepting states. Moreover, denote by $Post:Q\rightarrow 2^{Q}$ the successors for each state in $Q$ , i.e., $Post(q)=\{q^{\prime}\in Q\ |\ \exists\sigma\in\Sigma,q^{\prime}\in\delta(q,\sigma)\}$ . It is known that any scLTL formula $\phi$ can be translated into the FSA with $\Sigma=2^{AP}$ , in the sense that all good prefix for $\phi$ can be accepted by $\mathcal{A}_{\phi}$ . We denote by $\mathcal{A}_{\phi}$ the FSA corresponding to the scLTL formula $\phi$ . The translation from scLTL formulas to the FSA can be automatically done using several off-the-shelf tools, such as SCHECK2 (latvala2003, ).

3. Problem formulation

In this section we describe an uncertain environment, motion and sensor models of the rover and the copter, and the main problem that we seek to solve in this paper.

3.1. Uncertain environment

We capture an environment as a two-dimensional map consisting of $n$ cells. For example, this map is obtained by discretizing a given bounded search area into uniform grids with $n$ cells. Let $x_{i}\in\mathbb{R}^{2}$ , $i\in\{1,\ldots,n\}$ be the position or the centroid of the cell $i$ , and let $X=\{x_{1},\ldots,x_{n}\}$ . Moreover, we denote by $AP$ the set of atomic propositions, which represents a set of labels or properties that can be satisfied in the states. In addition, we denote by ${L}:X\rightarrow 2^{AP}$ the labeling function, which represents a mapping from each state $x\in X$ onto the corresponding set of atomic propositions that are satisfied in $x$ . For example, if $L(x)=\{\mathit{obstacle}\}$ with $AP=\{\mathit{obstacle}\}$ , it intuitively means that there is an obstacle in the state $x$ . In this paper, it is assumed that the labeling function $L$ is unknown due to the uncertainty of the environment, i.e., we do not have a complete knowledge about the properties of states in the environment. Thus, instead of $L$ , we make use of the belief or the posterior probability (given the past observations) as follows:

(2)

\displaystyle\mathcal{B}(x\models ap)\in[0,1]

for all $x\in X$ and $ap\in AP$ , where $x\models ap$ denotes that $ap$ is satisfied in $x$ (i.e., $ap\in L(x)$ ). For example, $\mathcal{B}(x\models\mathit{obstacle})=1$ with $AP=\{\mathit{obstacle}\}$ intuitively means that, it is for sure that there exists an obstacle in $x$ . In addition, $\mathcal{B}(x\models\mathit{obstacle})=0.5$ intuitively means that, it is completely unknown whether there exists an obstacle in $x$ . The belief that $ap$ is not satisfied in $x$ is denoted as $\mathcal{B}(x\models\neg ap)$ , and, from (2), it is computed as $\mathcal{B}(x\models\neg ap)=1-\mathcal{B}(x\models ap)$ . In what follows, the beliefs in (2) are called the environmental beliefs of the atomic propositions. As we will see later, the environmental beliefs of the atomic propositions are updated based on the observations provided by sensors equipped with the copter and the rover.

3.2. Rover and copter model

3.2.1. Motion model

The motion of the rover is modeled by an MDP $\mathcal{M}_{r}=(X,x_{r_{0}},U_{r},p_{r})$ , where $X$ is the set of states (or the environment), $x_{r_{0}}\in X$ is the initial state of the rover, $U_{r}$ is the finite set of inputs, and $p_{r}$ is the transition probability function. Similarly, the motion of the copter is modeled by an MDP $\mathcal{M}_{c}=(X,x_{c_{0}},U_{c},p_{c})$ , where $X$ is the set of states, $x_{c_{0}}\in X$ is the initial state of the copter, $U_{c}$ is the finite set of inputs, and $p_{c}$ is the transition probability function.

3.2.2. Sensor model

The rover is equipped with sensors that can provide observations on several atomic propositions in $AP$ . Specifically, let $AP_{r}\subseteq AP$ be a set of atomic propositions or the properties that can be observed by the rover’s sensors. For example, if $AP_{r}=\{\mathit{target}\}$ , the rover is equipped with a sensor that can detect a target object. To describe the erroneous observations, we use a Bernoulli-type sensor model (see, e.g. (bertuccelli2005, ; wang2009, ; hussein2007, ; imai2013, )) as follows. Suppose that the rover’s position is $x\in X$ , and we would like the rover to know whether an atomic proposition $ap\in AP_{r}$ is satisfied at the position $x^{\prime}\in X$ . Due to the fact that the rover can provide sensor measurements only in a limited range, it is assumed that $\|x-x^{\prime}\|\leq R^{r}_{ap}$ , where $R^{r}_{ap}\in\mathbb{R}_{>0}$ is a given sensor range for $ap$ . The corresponding observation is described by the binary variable, which is denoted by $Z^{r}_{x}(x^{\prime}\models ap)\in\{0,1\}$ . For example, if $Z^{r}_{x}(x^{\prime}\models\mathit{obstacle})=1$ with $AP_{r}=\{\mathit{obstacle}\}$ , the rover at the position $x$ detects an obstacle at $x^{\prime}$ by the corresponding sensor. The conditional probabilities that the sensor provides the correct or the false measurement are characterized as follows:

	$\displaystyle{\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=1\|x^{\prime}\models ap\right]=\beta^{r}_{1,x}(x^{\prime},ap),\ {\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=0\|x^{\prime}\models ap\right]=1-\beta^{r}_{1,x}(x^{\prime},ap)$
	$\displaystyle{\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=0\|x^{\prime}\models\neg ap\right]=\beta^{r}_{2,x}(x^{\prime},ap),\ {\rm Pr}\left[Z^{r}_{x}(x^{\prime}\models ap)=1\|x^{\prime}\models\neg ap\right]=1-\beta^{r}_{2,x}(x^{\prime},ap)$

where $\beta^{r}_{1,x}(x^{\prime},ap),\beta^{r}_{2,x}(x^{\prime},ap)\in[0,1]$ are given parameters that characterize the precision of the sensor. For example, under the fact that $ap$ is satisfied in $x^{\prime}$ (i.e., $ap\in L(x^{\prime})$ ), the probability of making the correct measurement (i.e., $Z^{r}_{x}(x^{\prime}\models ap)=1$ ) is $\beta^{r}_{1,x}(x^{\prime},ap)$ . On the other hand, the probability of making the false measurement (i.e., $Z^{r}_{x}(x^{\prime}\models ap)=0$ ) is $1-\beta^{r}_{1,x}(x^{\prime},ap)$ . For simplicity, it is assumed that the probabilities of making the correct measurements are the same, i.e., $\beta^{r}_{1,x}(x^{\prime},ap)=\beta^{r}_{2,x}(x^{\prime},ap)=\beta^{r}_{x}(x^{\prime},ap)$ , $\forall x,x^{\prime}\in X$ and $ap\in AP_{r}$ . Regarding $\beta^{r}_{x}$ , we assume that it is characterized by the fourth-order polynomial function of $\|x-x^{\prime}\|$ as follows (wang2009, ; hussein2007, ):

(3)		$\displaystyle\beta^{r}_{x}(x^{\prime},ap)=$	$\displaystyle\frac{M^{r}_{ap}}{(R^{r}_{ap})^{4}}\left(\\|x-x^{\prime}\\|^{2}-(R^{r}_{ap})^{2}\right)^{2}+0.5,\ \ {\rm if}\ \\|x-x^{\prime}\\|\leq R^{r}_{ap},$
(4)		$\displaystyle\beta^{r}_{x}(x^{\prime},ap)=$	$\displaystyle 0.5\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\rm if}\ \\|x-x^{\prime}\\|>R^{r}_{ap},$

for given $M^{r}_{ap}\in(0,0.5]$ . (3) and (4) imply that the reliability of the sensor decreases (and eventually becomes $0.5$ ) as the distance between $x$ and $x^{\prime}$ becomes larger.

Similarly, let $AP_{c}\subseteq AP$ be a set of atomic propositions or the properties that can be observed by the copter’s sensors. For simplicity, it is assumed that $AP_{r}\cup AP_{c}=AP$ . Moreover, we denote by $R^{c}_{ap}\in\mathbb{R}_{>0}$ for each $ap\in AP_{c}$ a given sensor range for $ap$ . Suppose that the copter’s position is $x\in X$ , and we would like the copter to know whether an atomic proposition $ap\in AP_{c}$ is satisfied at the position $x^{\prime}\in X$ with $\|x^{\prime}-x\|\leq R^{c}_{ap}$ . The corresponding observation is denoted by $Z^{c}_{x}(x^{\prime}\models ap)\in\{0,1\}$ . Moreover, denote by $\beta^{c}_{x}(x^{\prime},ap)\in[0,1]$ a given parameter that characterizes the precision of the sensor for given $M^{r}_{ap}\in(0,0.5]$ as with (3) and (4).

Despite simplicity, the Bernoulli sensor model is commonly used in the UAV community, due to the following reasons; (i) it is able to capture the erroneous observations; (ii) it is able to capture the limited sensor range; (iii) in contrast to the other sophisticated sensor models such as those that involve the probability density functions, the Bayesian update can be simply computed without any integrals or approximations. In particular, the third feature is well-suited for the copter’s exploration, as the computational power of the CPU and the battery capacity are often limited, and it is desirable to make the belief update as computationally light as possible.

3.3. Mission specification and problem formulation

The mission specification that the rover should satisfy is expressed by an scLTL, denoted by $\phi$ , over the set of atomic propositions $AP$ . The satisfaction relation of the scLTL formula $\phi$ is given over the word generated by the trajectory of the rover. That is, given a policy sequence $\mu_{r,seq}=\mu_{r,0}\mu_{r,1}\mu_{r,2}\ldots$ , we say that the trajectory ${\bf x}_{\mu_{r,seq}}=x(0)x(1)\ldots\in X^{\omega}$ satisfies $\phi$ , which we denote by ${\bf x}_{\mu_{r,seq}}\models\phi$ , iff the corresponding word satisfies $\phi$ , i.e., ${\bf w}=L(x(0))L(x(1))\ldots\models\phi$ . Since the rover aims at achieving the satisfaction of $\phi$ , we would like to derive an optimal policy, such that the probability of satisfying $\phi$ , i.e., ${\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi]$ , is maximized. However, since the labeling function $L$ is unknown, the values of ${\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi]$ are also unknown (i.e., we do not have direct access to this probability value). Hence, we will instead compute and maximize the belief that the trajectory of the rover satisfies $\phi$ , which we denote by

(5)

\displaystyle\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi)\in[0,1].

That is, $\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi)$ indicates the posterior that the trajectory of the rover satisfies $\phi$ given the past observations (sensor measurements) provided by the rover and the copter. As we will see later, (5) is computed and maximized based on the environmental beliefs of the atomic propositions in (2). Since the environmental beliefs of the atomic propositions will be updated based on the sensor measurements, the optimal policy that maximizes (5) will be also updated accordingly. Moreover, since we would like to reduce the environmental uncertainties as much as possible (i.e., we would like to make $\mathcal{B}(x\models ap)$ converge to $1$ or $0$ for all $x\in X,ap\in AP$ ), it is also necessary to explore the state space $X$ so as to collect the sensor measurements and update the environmental beliefs of the atomic propositions. In this paper, the copter has the main role to explore the uncertain environment, since, as previously mentioned in Section 1, it is able to move more quickly and freely than the rover. Therefore, we need to synthesize not only an optimal policy for the rover such that (5) is maximized so as to increase the possibility to satisfy $\phi$ , but also an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions and effectively reduce the environmental uncertainties.

Problem 1.

Consider the MDP motion models of the rover $\mathcal{M}_{r}$ and of the copter $\mathcal{M}_{c}$ , the Bernoulli sensor models as described in Section 3.2.2, and mission specification expressed by the scLTL formula $\phi$ . Then, synthesize for the copter-rover team a policy to increase the possibility to achieve the satisfaction of $\phi$ . Specifically, synthesize an optimal policy for the rover such that (5) is maximized, and an exploration policy for the copter so as to update the environmental beliefs of the atomic propositions in (2). $\Box$

4. Approach

In this section we provide a solution approach to Problem 1. In Section 4.1, we provide the overview of the approach. Then, we provide the algorithms of the exploration phase and the mission execution phase in Section 4.2 and Section 4.3, respectively.

4.1. Overview of the approach

Following (nilsson2018, ), we consider a sequential approach to solve Problem 1. The overview of the approach is shown in Algorithm 1.

k\leftarrow 0

(initialize the time);

x_{c}\leftarrow x_{c_{0}}

(initialize the position of the copter);

x_{r}\leftarrow x_{r_{0}}

(initialize the position of the rover);

2 Using prior knowledge, initialize

\mathcal{B}(x\models ap)\in(0,1)

for all

x\in X

and

ap\in AP

;

3 The rover computes the optimal policy such that (5) is maximized, and compute the mapping

b_{\max}:X\rightarrow[0,1]

(for details, see Section 4.3);

5Repeat the following two phases:

(1)

Exploration phase (see Section 4.2): The copter explores the state space for a given time period $T_{c}$ and update the environmental beliefs of the atomic propositions $\mathcal{B}$ :

(6) $\displaystyle\{x_{c},\mathcal{B}\}\leftarrow\mathit{Exploration}(x_{c},\mathcal{B},b_{\max},T_{c}).$

Set $k\leftarrow k+T_{c}$ and the copter transmits $\mathcal{B}$ to the rover;

(2)

Mission execution phase (see Section 4.3): The rover computes the optimal policy such that (5) is maximized and executes it for a given time period $T_{r}$ . Moreover, update the environmental beliefs of the atomic propositions $\mathcal{B}$ as well as the mapping $b_{\max}$ , i.e.,

(7)

\displaystyle\{x_{r},b_{\max},\mathcal{B}\}\leftarrow\mathit{MissionExecution}(x_{r},\mathcal{B},T_{r}).

Set $k\leftarrow k+T_{r}$ and the rover transmits $\mathcal{B}$ and $b_{\max}$ to the copter;

Algorithm 1 Overview of the main algorithm.

With a slight abuse of notation, we denote by $\mathcal{B}$ in the algorithm the set of all environmental beliefs of the atomic propositions, i.e., $\mathcal{B}(x\models ap),x\in X,ap\in AP$ . Specifically, the approach mainly consists of the two phases: the exploration phase and the mission execution phase. During the exploration phase, the copter explores the state-space $X$ so as to update $\mathcal{B}$ for a given time period $T_{c}\in\mathbb{N}_{>0}$ . In (6) (as well as (7)), $b_{\max}:X\rightarrow[0,1]$ will denote a mapping from each state onto the corresponding maximum belief that the rover will reach within the time period $T_{r}$ according to the current optimal policy; for the detailed definition, see Section 4.3.3. As we will see later, the exploration policy is given by making use of the mapping $b_{\max}$ and the entropy that will be derived from the current environmental beliefs of the atomic propositions $\mathcal{B}$ . Once the exploration is done, the copter transmits the updated environmental beliefs of the atomic propositions $\mathcal{B}$ to the rover and moves on to the mission execution phase. During the mission execution phase, the rover computes the optimal policy such that (5) is maximized, and executes the policy for a given time period $T_{r}\in\mathbb{N}_{>0}$ . Moreover, during the execution, the rover provides sensor measurements to update $\mathcal{B}$ . Once the execution is done, the rover transmits the updated $\mathcal{B}$ and $b_{\max}$ to the copter and moves back to the exploration phase. The sequential approach as above is motivated by the fact that, before allowing the rover to execute the optimal policy, we can let the copter in advance explore regions around the rover’s (future) path. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the future. Then, if the copter finds some obstacles in the path, the rover can re-design the path for avoiding the obstacles and try to find another way to complete the mission. Such scheme is somewhat too careful, but may be necessary to be done especially for safety critical systems, such as the exploration on Mars.

As detailed below, the algorithms for both the exploration and the mission execution phases are significantly different from (nilsson2018, ) in the following three aspects. First, the environmental beliefs of the atomic propositions are updated based on the Bayes rule using the past sensor measurements. This allows us to provide novel copter’s explorations, in which the copter actively explores the environment by evaluating both the level of uncertainty and the relevancy to the mission completion (see Section 4.2). Second, we propose a novel framework to synthesize the optimal policy for the rover. In particular, we solve a value iteration over a product MDP, whose state-space does not involve the set of the states for the environmental beliefs. This leads to the reduction of the time complexity of solving the value iteration algorithm (for details, see Section 4.3). Finally, we provide a theoretical, convergence analysis of the proposed algorithm, where it is shown that the environmental beliefs of the atomic propositions converge to the appropriate values (for details, see Section 5).

4.2. Exploration phase

In this subsection, we propose an algorithm of how the copter explores the environment so as to effectively update the environmental beliefs of the atomic propositions.

4.2.1. Bayesian belief update

Using the sensor model described in Section 3.2.2, the copter updates the environmental beliefs of the atomic propositions based on the Bayes filter (wang2009, ). Suppose that the copter is in the position $x\in X$ and, for some $x^{\prime}\in X$ and $ap\in AP_{r}$ with $\|x-x^{\prime}\|\leq R^{c}_{ap}$ , it gives the corresponding observation as $Z^{c}_{x}(x^{\prime}\models ap)=z\in\{0,1\}$ . Then, using this observation, the belief that $ap$ is satisfied in $x^{\prime}$ , i.e., $\mathcal{B}(x^{\prime}\models ap)$ is updated by applying the Bayes rule as follows:

(8)

\displaystyle\mathcal{B}(x^{\prime}\models ap)\longleftarrow\cfrac{{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)}{{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z]}

where ${\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z|x^{\prime}\models ap]={\beta^{c}_{x}(x^{\prime},ap)}^{z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{1-z}$ , and ${\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z]$ is computed as

	$\displaystyle{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z]$	$\displaystyle={\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z\|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)$
		$\displaystyle\ \ \ \ +{\rm Pr}[Z^{c}_{x}(x^{\prime}\models ap)=z\|x^{\prime}\models\neg ap](1-\mathcal{B}(x^{\prime}\models ap))$
		$\displaystyle={\beta^{c}_{x}(x^{\prime},ap)}^{z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{1-z}\mathcal{B}(x^{\prime}\models ap)$
		$\displaystyle\ \ \ \ +{\beta^{c}_{x}(x^{\prime},ap)}^{1-z}(1-{\beta^{c}_{x}(x^{\prime},ap)})^{z}(1-\mathcal{B}(x^{\prime}\models ap)).$

As will be clearer in the overall exploration algorithm given below, if the copter is placed at $x\in X$ , it obtains the sensor measurements for all its neighbors, i.e., $x^{\prime}\in X$ with $\|x-x^{\prime}\|\leq R_{c}$ and for all atomic propositions $ap\in AP_{c}$ , and update the corresponding environmental beliefs of the atomic propositions according to (8).

4.2.2. Acquisition function for exploration

Let us now define an acquisition function to be evaluated for synthesizing the exploration strategy for the copter. First, we define the notion of an entropy $H:[0,1]\rightarrow[0,1]$ as follows (cover2006, ):

(9)

\displaystyle H(b)=-b\log b-(1-b)\log(1-b)

for $b\in[0,1]$ , where $\log$ is to the base $2$ . In essence, $H(\mathcal{B}(x\models ap))$ for $x\in X,ap\in AP$ represents the level of uncertainty about whether $ap$ is satisfied in $x$ , and takes the largest value if $\mathcal{B}(x\models ap)=0.5$ and the lowest value if $\mathcal{B}(x\models ap)=0$ or $1$ . Hence, by actively exploring the state space where the entropy is large and updating the corresponding beliefs according to (8), it is expected that the environmental uncertainties can be effectively reduced.

However, if the copter would explore the environment only by evaluating the above entropy, it might happen to explore the states that are completely irrelevant to the rover’s mission completion. In other words, since the copter knows the path that the rover intends to follow according to the current optimal policy, it is preferable that the copter should investigate areas around such path (before the rover executes it) so as to update the environmental beliefs of the atomic propositions. For example, the copter checks if no obstacles are present along with the path that the rover intends to follow in the environment, so that the rover will be able to complete the mission while avoiding any obstacles. In order to incorporate the rover’s path for exploration, recall that the current position of the rover is $x_{r}$ and we have the mapping $b_{\max}:X\rightarrow[0,1]$ (see Algorithm 1). As previously described in Section 4.1, $b_{\max}(x)$ for each $x\in X$ indicates the belief that the rover will reach $x\in X$ from $x_{r}$ within the time period $T_{r}$ according to the rover’s current optimal policy (for the detailed definition and the calculation on $b_{\max}$ , see the mission execution phase in Section 4.3.3). Hence, for a large value of $b_{\max}(x)$ ( $b_{\max}(x)\approx 1$ ), we have a high belief that the rover will reach $x$ within the time period $T_{r}$ . Combining the entropy (9) and $b_{\max}$ , let us define the acquisition function $W:X\rightarrow\mathbb{R}_{>0}$ as follows:

(10)

\displaystyle W(x)=\sum_{ap\in AP_{c}}H(\mathcal{B}(x\models ap))+\alpha b_{\max}(x),

for all $x\in X$ , where $\alpha\in\mathbb{R}_{>0}$ is the weight associated to $b_{\max}(x)$ .

4.2.3. Exploration algorithm

We now propose an exploration algorithm. In the following, we provide two different exploration strategies so as to take both the efficiency of computation and the coverage of exploration into account.

Input :

x

(current copter’s position);

\mathcal{B}

(current environmental beliefs of the atomic propositions);

b_{\max}

(mapping to indicate reachability probability of the rover);

T_{c}

(time period for exploration);

Output :

x

(updated current position);

\mathcal{B}

(updated environmental beliefs of the atomic propositions);

1 for $\ell=0:T_{c}-1$ do

2 Compute

H

and

W

according to (9) and (10), respectively;

3 From (11), compute the control input

u^{*}_{c}=\mu^{*}_{c_{1}}(x)

;

4 Apply

u^{*}_{c}

and sample the next state

x_{next}\sim p_{c}(\cdot|x,u^{*}_{c})

;

x\leftarrow x_{next}

;

6 for each $(x^{\prime},ap)\in X\times AP_{c}$ with $\|x-x^{\prime}\|\leq R^{c}_{ap}$ do

7 Provide the corresponding observation:

Z^{c}_{x}(x^{\prime}\models ap)=z

;

8 Update

\mathcal{B}(x^{\prime}\models ap)

according to (8).

9 end for

11 end for

return

x

\mathcal{B}

;

Algorithm 2

\mathit{Exploration}(x,\mathcal{B},b_{\max},T_{c})

(local selection-based exploration)

(Local selection-based policy): The first exploration strategy is the local selection-based policy, in which the copter executes a one step greedy exploration:

(11)

\displaystyle\mu^{*}_{c_{1}}(x)\in\arg\max_{u\in U_{c}}\ \mathbb{E}_{x^{\prime}\sim p_{c}(\cdot|x,u)}[W(x^{\prime})|x,u]

for all $x\in X$ . (11) implies that the copter greedy selects a control input such that the corresponding next state provides the highest acquisition. The overall exploration algorithm based on the local selection-based policy is summarized in Algorithm 2. As shown in the algorithm, for each step $\ell$ , the copter computes the acquisition function and a control input $u^{*}_{c}\in U_{c}$ according to Section 4.2.2 (line 6–line 2). Once $u^{*}_{c}$ is obtained, the copter applies it and samples the next state $x_{next}$ . Given the new current position, the copter makes the new sensor measurements for all its neighbors and the atomic propositions $AP_{c}$ , and update the corresponding environmental beliefs of the atomic propositions (line 3–line 2). The above procedure is iterated for the copter’s time period $T_{c}$ . Note that, in order to enhance the exploration, the acquisition function as well as the policy computed in (11) are updated for each time when the copter obtains new observations.

The local selection-based policy is computationally efficient, since the optimal control input (line 2 in Algorithm 2) can be obtained by evaluating the acquisitions only for the next states. A disadvantage of this approach, however, is that it might not guarantee an effective exploration for the whole state space $X$ , since the copter evaluates the acquisitions only locally. Another exploration strategy would be therefore to select the optimal state to be visited by evaluating the acquisitions for all states in $X$ (instead of only locally for the next states), and then collect the sensor measurements to update the corresponding environmental beliefs. This leads us to a global selection-based approach, and the details are given below.

(Global selection-based policy): In the global selection-based approach, the copter first selects the optimal state that provides the highest acquisition in the whole state-space $X$ , i.e.,

(12)

\displaystyle x^{*}=\arg\max_{x^{\prime}\in X}\ W(x^{\prime}).

Then, the copter computes the optimal policy $\mu^{*}_{c_{2}}:X\rightarrow U$ such that the probability of reaching $x^{*}$ is maximized, i.e.,

(13)

\displaystyle\mu^{*}_{c_{2}}\in\arg\max_{\mu_{c}}\ {\rm Pr}[{\bf x}_{c,\mu_{c}}\models\Diamond x^{*}],

where ${\bf x}_{c,\mu_{c}}\in X^{\omega}$ denotes the state trajectory of the copter by applying the policy $\mu_{c}$ , and $\Diamond x^{*}$ indicates the property that the state trajectory reaches $x^{*}$ in finite time (which corresponds to the ”eventually” operator), i.e., ${\bf x}_{c,\mu_{c}}={x}^{0}_{c,\mu_{c}}{x}^{1}_{c,\mu_{c}}{x}^{2}_{c,\mu_{c}}\cdots\models\Diamond x^{*}$ iff there exists $k\in\mathbb{N}$ such that ${x}^{k}_{c,\mu_{c}}=x^{*}$ . (13) can be indeed solve via value iteration algorithm; for details, see Section 4.3.2. Then, the copter moves to $x^{*}$ from the current state by applying $\mu^{*}_{c_{2}}$ so as to collect the corresponding sensor measurements and update the environmental beliefs of the atomic propositions. Once $x^{*}$ is reached, the copter re-computes the new $x^{*}$ based on the updated environmental beliefs, and iterate the same procedure as above for the time period $T_{c}$ . The overall exploration algorithm based on the global selection-based policy is summarized in Algorithm 3. As shown in the algorithm, the copter first finds the optimal state $x^{*}$ that provides the highest acquisition among the whole state space $X$ , and compute the optimal policy to reach $x^{*}$ (line 3–line 3). The copter applies the optimal policy until it reaches $x^{*}$ and collects the corresponding sensor measurements (line 3–line 3). Note that the copter collects not only the sensor measurements for $x^{*}$ , but also the ones for the states that are traversed while reaching to $x^{*}$ , aiming to enhance the efficiency of exploration. The above procedure is iterated for the time period $T_{c}$ . The variable $n_{succ}$ in Algorithm 2 counts the number of times when the copter successfully reaches $x^{*}$ .

The global selection-based approach should require a heavier computation than the local selection-based approach, since it needs to find the optimal state by evaluating the acquisitions for the whole state-space $X$ , as well as to compute the optimal policy to reach $x^{*}$ via a value iteration. However, the advantage of employing this approach is that we can guarantee the coverage of exploration, i.e., by repeating Algorithm 2, the environmental belief of the atomic propositions converge to the appropriate values; for details, see Section 5.

1 Input and Output are the same as Algorithm 2;

\ell\leftarrow 0

n_{succ}\leftarrow 0

;

3 while $\ell<T_{c}-1$ do

4 Compute

H

and

W

according to (9) and (10), respectively;

5 Compute

x^{*}\in X

and

\mu^{*}_{c2}:X\rightarrow U_{c}

according to (12) and (13), respectively;

6 while $x\neq x^{*}$ and $\ell<T_{c}-1$ do

7 Apply

u^{*}_{c}=\mu^{*}_{c2}(x)

and sample the next state

x_{next}\sim p_{c}(\cdot|x,u^{*}_{c})

;

x\leftarrow x_{next}

\ell\leftarrow\ell+1

;

9 for each $(x^{\prime},ap)\in X\times AP_{c}$ with $\|x-x^{\prime}\|\leq R^{c}_{ap}$ do

10 Provide the corresponding observation:

Z^{c}_{x}(x^{\prime}\models ap)=z

;

11 Update

\mathcal{B}(x^{\prime}\models ap)

according to (8).

12 end for

14 end while

15 if $x=x^{*}$ then

n_{succ}\leftarrow n_{succ}+1

;

17 end if

19 end while

return

x

\mathcal{B}

;

Algorithm 3

\mathit{Exploration}(x,\mathcal{B},b_{\max},T_{c})

(global selection-based exploration)

4.3. Mission execution phase

In this subsection, we propose a detailed algorithm of the rover’s mission execution phase.

4.3.1. Product belief MDP

Given the environmental beliefs of the atomic propositions in (2), let $\mathcal{B}(x\models\sigma)$ for $x\in X$ , $\sigma\in 2^{AP}$ be the joint belief that all atomic propositions in $\sigma$ are satisfied in $x$ , i.e., $\mathcal{B}(x\models\sigma)=\prod_{ap\in\sigma}\mathcal{B}(x\models ap)$ . Moreover, let $\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma))$ be the joint belief that all atomic propositions in $\sigma$ are satisfied in $x$ and all atomic propositions in $AP$ other than $\sigma$ (i.e., $AP\backslash\sigma$ ) are not satisfied in $x$ , i.e.,

(14)

\displaystyle\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma))=\prod_{ap\in\sigma}\mathcal{B}(x\models ap)\prod_{ap\in AP\backslash\sigma}(1-\mathcal{B}(x\models ap)).

For simplicity of presentation, let $\mathcal{B}_{alph}(x\models\sigma)=\mathcal{B}(x\models\sigma\wedge\neg(AP\backslash\sigma))$ . Moreover, let $\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f})$ be an FSA corresponding to $\phi$ , and, for each $(q,q^{\prime})\in Q\times Q$ , denote by $en(q,q^{\prime})\subseteq 2^{AP}$ a subset of input alphabets, for which the transition from $q$ to $q^{\prime}$ is allowed: $en(q,q^{\prime})=\{\sigma\in 2^{AP}\ |\ q^{\prime}\in\delta(q,\sigma)\}$ . In addition, given $x\in X$ and $q,q^{\prime}\in Q$ , we let

(15)

\displaystyle\mathcal{B}_{en}(x\models en(q,q^{\prime}))=\sum_{\sigma\in en(q,q^{\prime})}\mathcal{B}_{alph}(x\models\sigma).

That is, $\mathcal{B}_{en}(x\models en(q,q^{\prime}))$ represents the belief that $q$ makes the transition to $q^{\prime}$ from the atomic propositions that are satisfied in $x$ . Note that we have $\sum_{q^{\prime}\in Post(q)}\mathcal{B}_{en}(x\models en(q,q^{\prime}))=1$ , since the collection of all events (the set of atomic propositions) corresponding to all outgoing transitions from $q$ are all possible events that can occur, i.e., $2^{AP}$ . For this clarification, see an example below.

Refer to caption — Figure 1. FSA that accepts all good prefix for the scLTL formula $\phi=\Diamond a$ . The marked node represents the accepting state.

(Example): Consider the environment with two states $X=\{x_{1},x_{2}\}$ and let $AP=\{a,b\}$ and assume that the environmental beliefs of the atomic propositions are given by

(16)

\displaystyle\mathcal{B}(x_{1}\models a)=0.1,\ \mathcal{B}(x_{1}\models b)=0.1,\ \mathcal{B}(x_{2}\models a)=0.9,\ \mathcal{B}(x_{2}\models b)=0.2.

Moreover, the scLTL formula is assumed to be given by $\phi=\Diamond a$ . The corresponding FSA $\mathcal{A}_{\phi}$ that accepts all good prefix for $\phi$ is shown in Fig. 1. For example, since $q_{0}=\delta(q_{0},\varnothing)$ and $q_{0}=\delta(q_{0},\{b\})$ , we have $en(q_{0},q_{0})=\{\varnothing,\{b\}\}$ . Moreover, we have

(17)

\displaystyle\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{0}))

\displaystyle=\mathcal{B}_{alph}(x_{1}\models\varnothing)+\mathcal{B}_{alph}(x_{1}\models\{b\})=0.9\times 0.9+0.9\times 0.1=0.9,

which implies that, if the position of the rover is $x_{1}$ , we have a high belief that $q_{0}$ provides the self-loop, i.e., the belief of reaching the accepting state $q_{1}$ is low. This is due to the fact that we have a low belief that $a$ is satisfied in $x_{1}$ . Moreover, we have $\mathcal{B}_{en}(x_{1},en(q_{0},q_{1}))=\mathcal{B}_{alph}(x_{1},\{a\})+\mathcal{B}_{alph}(x_{1},\{a,b\})=0.09+0.01=0.1$ . Note that we have $\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{1}))+\mathcal{B}_{en}(x_{1}\models en(q_{0},q_{0}))=1$ , satisfying the probabilistic nature. This is due to the fact that all outgoing transitions from $q_{0}$ are only $q_{0}$ and $q_{1}$ , and the collection of all the corresponding events (set of atomic propositions) is $\{\varnothing,\{a\},\{b\},\{a,b\}\}$ ( $=2^{AP}$ ), which is indeed all events that can occur. We also have $\mathcal{B}_{en}(x_{2}\models en(q_{0},q_{1}))=\mathcal{B}_{alph}(x_{2}\models\{a\})+\mathcal{B}_{alph}(x_{2}\models\{a,b\})=0.72+0.18=0.9$ , implying that we have a high belief of reaching the accepting state, which is due to the fact that we have the high belief that $a$ is satisfied in $x_{2}$ . $\Box$

Based on the above, we define the product belief MDP as a composition of the rover’s motion model $\mathcal{M}_{r}$ and the FSA $\mathcal{A}_{\phi}$ as follows:

Definition 1.

Let $\mathcal{M}_{r}=(X,x_{r},U_{r},p_{r})$ and $\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f})$ be the MDP motion model of the rover and the FSA corresponding to $\phi$ , respectively. Moreover, given the environmental beliefs of the atomic propositions (2), let $\mathcal{B}_{en}(x\models en(q,q^{\prime}))$ for $x\in X$ , $q,q^{\prime}\in Q$ be given by (15). Then, the product belief MDP between $\mathcal{M}_{r}$ and $\mathcal{A}_{\phi}$ is defined as a tuple $\mathcal{M}_{S}=(S,s_{0},U_{S},p_{S},S_{f})$ , where

•

$S=X\times Q$ is the set of states;
•

$s_{0}=(x_{r},q_{0})\in S$ is the initial state;
•

$U_{S}=U_{r}$ is the set of control inputs;

•

$p_{S}:S\times U_{S}\rightarrow\mathcal{D}(S)$ is the transition belief function, defined as

(18)

\displaystyle p_{S}((x^{\prime},q^{\prime})|(x,q),u)=p_{r}(x^{\prime}|x,u)\mathcal{B}_{en}(x\models en(q,q^{\prime}));

•

$S_{f}\subseteq S$ is the set of accepting states, where $S_{f}=X\times Q_{f}$ . $\Box$

As shown in (18), the transition function is called a belief function instead of a probability function, which is due to that it is computed based on the environmental beliefs of the atomic propositions (i.e., the posterior given the past observations). As previously mentioned, $\mathcal{B}_{en}(x\models en(q,q^{\prime}))$ represents the belief that $q$ makes the transition to $q^{\prime}$ according to the atomic propositions that are satisfied in $x$ . Hence, $p_{S}((x^{\prime},q^{\prime})|(x,q),u)$ indicates the joint belief that the pair $(x,q)$ makes the transition to $(x^{\prime},q^{\prime})$ by applying $u$ .

As shown in Definition 1, the product MDP has the set of states involving only $X$ and $Q$ . This leads to the reduction of the time complexity of synthesizing the optimal policy for the rover in contrast to the previous work; for details, see Section 4.3.2 (in particular Remark 1).

4.3.2. Value iteration

Given $\mathcal{M}_{S}$ , we denote by $\mu_{S}:S\rightarrow U_{S}(=U_{r})$ a policy for $\mathcal{M}_{S}$ , which associates a control input from $U_{S}$ for each state in $S$ . Then, let $\mu_{S,seq}=\mu_{S}\mu_{S}\mu_{S}\ldots$ be the corresponding stationary policy sequence. We denote by ${\bf s}_{\mu_{S,seq}}=s(0)s(1)\ldots\in S^{\omega}$ with $s(\ell)=(x(\ell),q(\ell))$ , $\forall\ell\in\mathbb{N}_{\geq 0}$ , the trajectory of the product belief MDP $\mathcal{M}_{S}$ , such that $s(0)=s_{0}$ (i.e., $x(0)=x_{r}$ , $q(0)=q_{0}$ ) and $s(\ell+1)\sim p_{S}(\cdot|s(\ell),\mu_{S}(s(\ell)))$ for all $\ell\in\mathbb{N}_{\geq 0}$ . Given ${\bf s}_{\mu_{{S,seq}}}=s(0)s(1)\ldots$ with $s(\ell)=(x(\ell),q(\ell))$ , $\forall\ell\in\mathbb{N}_{\geq 0}$ , we can induce the corresponding trajectory of $\mathcal{A}_{\phi}$ as $q(0)q(1)\ldots\in Q^{\omega}$ . If the trajectory of $\mathcal{M}_{S}$ reaches the accepting state in $S_{f}$ in finite time, it means that the corresponding trajectory of $\mathcal{A}_{\phi}$ reaches the accepting state in $Q_{f}$ in finite time (i.e., it satisfies $\phi$ ). Hence, the problem of maximizing the belief for the satisfaction of $\phi$ defined by (5), can be reduced to the problem of maximizing the belief that the trajectory of $\mathcal{M}_{S}$ reaches $S_{f}$ in finite time, i.e.,

(19)

\displaystyle\mu^{*}_{S}=\arg\max_{\mu_{S}}\ \mathcal{B}\left({\bf s}_{\mu_{S,seq}}\models\Diamond S_{f}\right).

The problem (19) can be indeed solved via a value iteration as follows (see, e.g., (abate2008, ; nilsson2018, )). Let $\mathsf{1}_{S_{f}}:S\rightarrow\{0,1\}$ be given by $\mathsf{1}_{S_{f}}(s)=1$ , if $s\in S_{f}$ and $0$ otherwise. Then, set $V^{0}(s)=\mathsf{1}_{S_{f}}(s)$ , $\forall s\in S$ and for all $s\in S$ , $\ell\in\mathbb{N}_{>0}$ ,

(20)		$\displaystyle V^{\ell+1}(s)=\max_{u\in U_{S}}\max\left(\mathsf{1}_{S_{f}}(s),\mathbb{E}_{s^{\prime}\sim p_{S}(\cdot\|s,u)}[V^{\ell}(s^{\prime})\|s,u]\right),$
(21)		$\displaystyle\mu^{\ell+1}_{S}(s)=\arg\max_{u\in U_{S}}\max\left(\mathsf{1}_{S_{f}}(s),\mathbb{E}_{s^{\prime}\sim p_{S}(\cdot\|s,u)}[V^{\ell}(s^{\prime})\|s,u]\right).$

The above computations are given until they reach some fixed point, i.e., $\mu^{*}_{S}=\mu^{\ell^{\prime}}_{S}=\mu^{\ell^{\prime}+1}_{S}$ for some $\ell^{\prime}$ . Alternatively, one may iterate the above only for a given finite time steps $\overline{T}\in\mathbb{N}_{>0}$ with $\overline{T}\geq T_{r}$ , i.e., iterate (20) and (21) for all $\ell\in\mathbb{N}_{0:\overline{T}-1}$ and set $\mu^{*}_{S}=\mu^{\overline{T}}_{S}$ . This in turn implies to obtain the optimal policy that maximizes the belief that the trajectory of $\mathcal{M}_{S}$ reaches $S_{f}$ within the time interval $[0,\overline{T}]$ . Hence, it implies that we maximize the belief that the length of the good prefix of the word satisfying $\phi$ is less than $\overline{T}$ . Given the optimal policy $\mu^{*}_{S}$ computed as above, we can induce the policy sequence for the rover based on the trajectory of $\mathcal{M}_{S}$ , i.e.,

(22)

\displaystyle\mu^{*}_{r,seq}=\mu^{*}_{r,0}\mu^{*}_{r,1}\mu^{*}_{r,2}\ldots,

where $\mu^{*}_{r,\ell}(x(\ell))=\mu^{*}_{S}(s(\ell))$ , $\forall\ell\in\mathbb{N}_{\geq 0}$ .

Remark 1.

As shown in Definition 1, the product MDP involves only $X$ and $Q$ , and does not involve the states of the environmental beliefs as formulated in (nilsson2018, ). In particular, since the environmental beliefs of the atomic propositions are assigned for every state in $X$ in our problem setup, the time complexity of solving the value iteration algorithm is $O(|(E^{|X|}\times X\times Q)^{2}\times U_{S}|)$ in (nilsson2018, ) with $E$ being the set of states of the environment belief, while $O(|(X\times Q)^{2}\times U_{S}|)$ in our approach. This implies that the time complexity of the value iteration algorithm in the previous work is exponential with respect to $|X|$ , while it is polynomial in our approach. Therefore, our approach could alleviate the running time and the memory usage for synthesizing the optimal policies for the rover in contrast to the previous work. $\Box$

4.3.3. Computing the reachability belief and $b_{\max}$

Suppose that the current rover’s position $x_{r}$ and the optimal policy $\mu^{*}_{S}$ is computed as above. Then, given $x\in X$ and $\ell\in\mathbb{N}_{0:T_{r}-1}$ , we can compute a belief that the rover will reach $x$ from $x_{r}$ after $\ell$ time steps according to the optimal policy $\mu^{*}_{S}$ . To this end, we denote the collection of all states of $\mathcal{M}_{S}$ by $S=\{s_{1},s_{2},s_{3},\ldots,s_{m}\}$ , where $m$ is the number of the states of $\mathcal{M}_{S}$ . Given $x\in X$ , we denote by $\mathcal{J}(x)\subseteq\{1,2,\ldots,m\}$ the set of indices, for which the corresponding states of $\mathcal{M}_{S}$ include $x$ , i.e., $\mathcal{J}(x)=\{i\in\mathbb{N}_{1:m}\ |\ s_{i}=(x,q)\ {\rm for\ some}\ q\in Q\}$ . If the policy $\mu^{*}_{S}$ is employed, the belief MDP $\mathcal{M}_{S}$ can be viewed as a belief Markov chain induced by $\mu^{*}_{S}$ , which is denoted by $\mathcal{M}^{\mu^{*}_{S}}_{S}=(S,s_{0},p^{\mu^{*}_{S}}_{S})$ , where $S=\{s_{1},s_{2},s_{3},\ldots,s_{m}\}$ is the set of states, $s_{0}=(x_{r},q_{0})$ is the initial state, and $p^{\mu^{*}_{S}}_{S}$ is the transition belief function defined by $p^{\mu^{*}_{S}}_{S}(s^{\prime}|s)=p_{S}(s^{\prime}|s,\mu^{*}_{S}(s))$ , $\forall s,s^{\prime}\in S$ .

Input :

x_{r}

(current rover’s position);

\mathcal{B}

(current environmental beliefs of the atomic propositions);

T_{r}

(time period for the mission execution);

Output :

x_{r}

(updated current position);

b_{\max}

(the mapping that represents maximum probability);

\mathcal{B}

(updated environmental beliefs of the atomic propositions);

1 Solve the value iteration (20), (21) until reaching some fixed point (or iterate them for all

\ell\in\mathbb{N}_{0:\overline{T}-1}

for a given

\overline{T}\geq T_{r}

) and obtain the optimal policy

\mu^{*}_{S}:S\rightarrow U_{S}(=U_{r})

;

x\leftarrow x_{r}

q\leftarrow q_{0}

;

3 for $\ell=0:T_{r}-1$ do

u^{*}_{r}\leftarrow\mu^{*}_{S}(x,q)

and sample

(x_{next},q_{next})\sim p_{S}(\cdot|(x,q),u^{*}_{r})

;

x\leftarrow x_{next}

q\leftarrow q_{next}

;

6 for each $(x^{\prime},ap)\in X\times AP_{r}$ with $\|x-x^{\prime}\|\leq R^{r}_{ap}$ do

7 Provide the corresponding observation

Z^{r}_{x}(x^{\prime},ap)=z

;

8 Update the belief as:

\mathcal{B}(x^{\prime}\models ap)\longleftarrow\cfrac{{\rm Pr}[Z^{r}_{x}(x^{\prime},ap)=z|x^{\prime}\models ap]\mathcal{B}(x^{\prime}\models ap)}{{\rm Pr}[Z^{r}_{x}(x^{\prime},ap)=z]}

;

9 end for

11 end for

x_{r}\leftarrow x

and solve the value iteration (20), (21) to update the optimal policy

\mu^{*}_{S}

;

13 Compute

b_{\max}

according to Section 4.3.3;

return

x_{r}

b_{\max}

\mathcal{B}

;

Algorithm 4

\mathit{MissionExecution}(x_{r},\mathcal{B},T_{r})

(main algorithm for mission execution)

Now, let $b_{\ell}\in[0,1]^{m}$ for all $\ell\in\mathbb{N}_{0:L-1}$ be recursively given by $b_{\ell+1}=Ab_{\ell}$ , where $A\in[0,1]^{m\times m}$ is the transition matrix for the belief Markov chain $\mathcal{M}^{\mu^{*}_{S}}_{S}$ , and $b^{(i)}_{0}=1$ if $s_{i}$ is the initial state (i.e., $s_{i}=s_{0}=(x_{r},q_{0})$ ) and $b^{(i)}_{0}=0$ if $s_{i}$ is not the initial state. That is, $b^{(i)}_{\ell}$ represents a belief that the state $s_{i}$ is reached after $\ell$ time steps from the initial state $s_{0}=(x_{r},q_{0})$ according to the optimal policy $\mu^{*}_{S}$ . Based on the above, for each $x\in X$ , we can compute a belief that $x$ is reached after $\ell$ time steps, denoted by $b_{\ell}(x)$ , as $b_{\ell}(x)=\sum_{i\in\mathcal{J}(x)}b^{(i)}_{\ell}$ . Then, let $b_{\max}(x)$ be given by $b_{\max}(x)=\max_{\ell\in\mathbb{N}_{0:T_{r}-1}}\ b_{\ell}(x)$ , i.e., $b_{\max}(x)$ indicates the maximum belief that the rover will reach $x$ (starting from $x_{r}$ ) within the time steps $T_{r}$ . That is, for a large value of $b_{\max}(x)$ ( $b_{\max}(x)\approx 1$ ), we have a high belief that the rover will reach $x$ at some point in the time interval $[0,T_{r}]$ .

As previously described in Section 4.2.2, the mapping $b_{\max}$ is utilized for the copter’s exploration, so as to effectively search cells that are relevant to the mission execution.

4.3.4. Overall mission execution algorithm

We now summarize the main algorithm of the mission execution phase in Algorithm 4. As shown in the algorithm, the rover computes the optimal policy by solving the value iteration and apply it for the time period $T_{r}$ . Moreover, while applying this policy, it takes the sensor measurements and updates the environmental beliefs of the atomic propositions (2) (line 4–line 4). Afterwards, the rover computes the mapping $b_{\max}$ according to the procedure described in Section 4.3.3. Finally, the algorithm returns the current rover’s position $x_{r}$ , the mapping $b_{\max}$ and the updated belief for the atomic propositions in (2).

5. Convergence analysis

In this section, we analyze convergence property of the proposed algorithm presented in the previous section. In particular, we show that, by executing Algorithm 1 with the exploration phase given by the global selection-based approach (Algorithm 3), the environmental beliefs of the atomic propositions converge to the appropriate values, i.e., for all $x\in X$ and $ap\in AP$ , $\mathcal{B}(x\models ap)\rightarrow 1$ if $ap\in L(x)$ , and $\mathcal{B}(x\models ap)\rightarrow 0$ if $ap\notin L(x)$ . Suppose that Algorithm 1 is implemented with the exploration phase given by Algorithm 3. To simplify the analysis, we make the following assumptions:

Assumption 1.

For every execution of Algorithm 3, it follows that $n_{succ}\geq 1$ . $\Box$

Assumption 2.

For the sensor model of the copter and the rover, we assume that:

(i)

$AP_{c}=AP_{r}=AP$ .
(ii)

$R^{c}_{ap}=R^{r}_{ap}=0$ for all $ap\in AP$ .
(iii)

$\beta^{c}_{x}(x,ap)=\beta^{c}_{x}(x,ap^{\prime})=\beta^{r}_{x}(x,ap)=\beta^{r}_{x}(x,ap^{\prime})$ for all $\{ap,ap^{\prime}\}\in AP\times AP$ . $\Box$

Assumption 1 excludes the case where Algorithm 3 is terminated without reaching any selected states $x^{*}$ to be explored. Moreover, the first assumption in Assumption 2 means that both the copter and the rover are equipped with all sensors for $AP$ . The second assumption in Assumption 2 means from (3) and (4) that the copter and the rover can only take the sensor measurements only on their current states. The third assumption in Assumption 2 means that the precision of the sensor is the same for both the rover and the copter.

For simplicity, we let $\beta=\beta^{c}_{x}(x,ap)=\beta^{c}_{x}(x,ap^{\prime})=\beta^{r}_{x}(x,ap)=\beta^{r}_{x}(x,ap^{\prime})$ for all $\{ap,ap^{\prime}\}\in AP\times AP$ . Note that $\beta>0.5$ (for this clarification, see (3) and (4)). In addition, let $N_{x}\in\mathbb{N}_{>0}$ denote the total number of times the copter/rover visits the state $x$ and takes the corresponding sensor measurements for each $ap\in AP$ , and let $m^{ap}_{N_{x}}\leq N_{x}$ ( $x\in X$ , $ap\in AP$ ) denote the number of times the corresponding sensor measurements for $ap$ are $1$ . In other words, $N_{x}-m^{ap}_{N_{x}}$ represents the number of times the corresponding observations for $ap$ are $0$ . Finally, we make the following assumption:

Assumption 3.

There exist $\varepsilon>0$ with $2\beta-2\varepsilon-1>0$ and $\bar{N}\in\mathbb{N}_{>0}$ , such that for all $x\in X$ , $ap\in AP$ and $N_{x}\geq\bar{N}$ , we have $\beta-\varepsilon\leq{m^{ap}_{N_{x}}}/{N_{x}}\leq\beta+\varepsilon$ if $ap\in L(x)$ and $\beta-\varepsilon\leq({N_{x}-m^{ap}_{N_{x}}})/{N_{x}}\leq\beta+\varepsilon$ if $ap\notin L(x)$ . $\Box$

Assumption 3 implies that the ratio between the number of sensor measurements and the number of making the correct sensor measurements (i.e., ${m^{ap}_{N_{x}}}/{N_{x}}$ if $ap\in AP$ and $({N_{x}-m^{ap}_{N_{x}}})/{N_{x}}$ if $ap\notin AP$ ) is $\varepsilon$ -close to $\beta$ if the number of the visits (the sensor measurements) at $x$ is sufficiently large. The following theorem shows that, by executing Algorithm 1 with the exploration phase given by Algorithm 3, all the environmental beliefs of the atomic propositions converge to the appropriate values.

Theorem 1.

Let Assumption 1–3 hold. Let $\alpha=0$ in (10) and suppose that Algorithm 1 is executed with the exploration phase given by Algorithm 3. Then, for all $x\in X$ and $ap\in AP$ , it follows that

(23)		$\displaystyle\mathcal{B}(x\models ap)\rightarrow$	$\displaystyle 1,\ \ {\rm if}\ ap\in L(x),$
(24)		$\displaystyle\mathcal{B}(x\models ap)\rightarrow$	$\displaystyle 0,\ \ {\rm if}\ ap\notin L(x),$

as $k\rightarrow\infty$ . $\Box$

Recall that $k$ is defined in Algorithm 1 and represents the (global) time step during execution of Algorithm 1. Hence, Theorem 1 means that the environmental beliefs of the atomic propositions converge to the appropriate values as the number of the iterations for the exploration/mission execution phase goes to infinity. For the proof of Theorem 1, see Appendix A.

Remark 2.

As shown in Theorem 1, the convergence properties (23), (24) may not hold if $\alpha\neq 0$ in (10). Nevertheless, as previously stated in Section 4.2.2, setting $\alpha\neq 0$ is useful and important for practical applications, since it can avoid the exploration of states that are completely irrelevant to the mission execution. Additionally, (23), (24) may not hold if the exploration phase is given by the local selection-based policy Algorithm 2. Nevertheless, as previously stated in Section 4.2, utilizing Algorithm 2 is useful for practical applications where the computation capacity of the copter is severely limited, since the optimal control input can be obtained by evaluating the acquisitions only for the next states. $\Box$

Now, let $\hat{\mu}^{*}_{r,seq}$ denote the optimal policy sequence that maximizes the probability of satisfying $\phi$ under the assumption that the labeling function $L$ is known, i.e., $\hat{\mu}^{*}_{r,seq}=\arg\min_{\mu_{r,seq}}{\rm Pr}[{\bf x}_{\mu_{r,seq}}\models\phi]$ . If $L$ is known, $\hat{\mu}^{*}_{r,seq}$ can be derived by constructing a product MDP with the knowledge about $L$ and solving the value iteration algorithm (for details, see the proof of Corollary 1 in Appendix B). Note that ${\mu}^{*}_{r,seq}$ given by (22) is not necessarily equal to $\hat{\mu}^{*}_{r,seq}$ , since ${\mu}^{*}_{r,seq}$ is derived by maximizing the belief of satisfying $\phi$ based on the sensor measurements (i.e., ${\mu}^{*}_{r,seq}=\arg\min_{\mu_{r,seq}}\mathcal{B}({\bf x}_{\mu_{r,seq}}\models\phi)$ ). The following Corollary is derived from Theorem 1, showing that ${\mu}^{*}_{r,seq}$ converges $\hat{\mu}^{*}_{r,seq}$ as $k\rightarrow\infty$ .

Corollary 1.

Let Assumption 1–3 hold. Let $\alpha=0$ in (10) and suppose that Algorithm 1 is executed with the exploration phase given by Algorithm 3. Then, it follows that ${\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq}$ as $k\rightarrow\infty$ . $\Box$

For the proof, see Appendix B.

6. Simulation results

In this section, we illustrate the effectiveness of the proposed algorithm through numerical simulations. The simulation was conducted on Python 3.7.9 with AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx CPU and 16GB RAM.

6.1. Simulation 1

6.1.1. Problem setup

We consider the environmental map consisting of $n=100$ cells as shown in Fig. 2. The set of states of the environment (i.e., the positions or the centroids of the cells) is given by $X=\{[i,j]^{\mathsf{T}}\in\mathbb{R}^{2},i=0,\ldots,9,j=0,\ldots,9\}$ . The set of the atomic propositions is given by $AP=\{A,B,C,D,O\}$ , where $A,B,C$ and $D$ represent the atomic propositions of target objects that the rover seeks to discover, and $O$ represents the atomic proposition of obstacles that the rover needs to avoid for all times. For both the MDP motion models of the rover and the copter, the set of states is given by $X$ . The sets of control inputs for both the rover and the copter, i.e., $U_{r}$ , $U_{c}$ , are also the same and consists of $5$ components: stay in the same cell, move up, move down, move right, and move left. Each control input drives the rover/copter towards one of the 8 cells adjacent to the current rover/copter’s position. The transition probability for the rover’s MDP is that there is $95$ % chance to move from the current cell to the desired cell, and there is the remaining 5% chance to move to one of the cells adjacent to the desired cell (with equal probability). Regarding the copter’s MDP, it is assumed that there is $90$ % chance to move from the current cell to the desired cell, and $10$ % chance to move to one of the cells adjacent to the desired cell. We assume that the copter is able to move without caring the obstacles (i.e., it can move on all states in $X$ ).

The rover is equipped with sensors to detect both the target objects and the obstacles, i.e., $AP_{r}=\{A,B,C,D,O\}$ , while the copter is equipped with a sensor to detect only the obstacles, i.e., $AP_{c}=\{O\}$ . Moreover, the sensor range is assumed to be given by $R^{r}_{O}=R^{r}_{A}=R^{r}_{B}=R^{r}_{C}=R^{r}_{D}=2$ , which implies from (3), (4), that the rover can provide high reliable sensor measurements only on its current cell and can provide row reliable measurements on its adjacent cells, and $R^{c}_{O}=4$ , which implies that the copter can detect obstacles within 4 cells far from the current cell. In addition, it is assumed that $M^{r}_{O}=M^{r}_{A}=M^{r}_{B}=M^{r}_{C}=M^{r}_{D}=0.5$ , which implies from (3), (4) that the rover’s maximum sensor accuracy is 100%. Moreover, $M^{c}_{O}=0.4$ , which implies that the copter’s maximum sensor accuracy is 90%. The scLTL specification for the rover is given by

(25)

\displaystyle\phi=\phi_{1}\vee\phi_{2}\vee\phi_{3},

where

(26)

\displaystyle\phi_{1}=F_{o}A,\ \phi_{2}=F_{o}B\wedge\bigcirc F_{o}C,\ \phi_{3}=F_{o}C\wedge\bigcirc F_{o}D,

with $F_{o}ap=\neg O{U}(\neg O\wedge ap)$ for $ap\in AP$ . Intuitively, $\phi_{1}$ means that the rover should eventually discover the target $A$ while avoiding the obstacles. Moreover, $\phi_{2}$ (resp. $\phi_{3}$ ) means that the rover should eventually discover $B$ and then $C$ (resp. $C$ and then $D$ ) while avoiding the obstacles. During execution of Algorithm 1, we set $T_{c}=5$ , $T_{r}=3$ , and $\alpha=1.5$ in the acquisition function of (10). Moreover, we assume that the mission is (regarded as) complete if the belief of satisfying $\phi$ by the rover’s optimal policy exceeds $0.98$ .

6.1.2. Simulation results

Some snapshots of the simulation result by applying Algorithm 1 are shown in Fig. 3. During execution of Algorithm 1, it is assumed that the copter executes the global selection-based policy (Algorithm 3) until the mission is complete. In the figure, the environmental beliefs of the atomic propositions are illustrated as the color maps, and, for simplicity, only the color maps for $C,D,O$ are shown. The rover’s position is illustrated as the blue circle (only shown in the figures of $C,D$ ), and the copter’s position is illustrated as the red circle (only shown in the figures of $O$ ). It can be shown from the figures that the rover has reached the cell where $C$ exists (at $k=325$ ), and then reached the cell where $D$ exists (at $k=334$ ), regarding that the mission is complete. The whole behaviors of both the rover and the copter as well as the time elapse of the environmental beliefs of the atomic propositions can be shown in the animation; see (animation, ).

To make comparisons between the local (Algorithm 2) and the global (Algorithm 3) selection-based exploration for the copter, we have iterated the following steps: (i) The initial positions of the rover and the copter are randomly chosen from $X$ ; (ii) Using the generated initial positions, execute Algorithm 1 with the local selection-based exploration (Algorithm 2); (iii) Using the generated initial positions, execute Algorithm 1 with the global selection-based exploration (Algorithm 3). The above steps have been iterated for 100 times, and for each exploration policy, we counted the number of times when the mission was successfully complete before $k=300$ . At the same time, we also measured the average running time (in sec) of the local/global exploration policy (i.e., the average execution time of Algorithm 2 and Algorithm 3). Table 1 illustrates the simulation results. The table indicates that the number of completing the mission via the global selection-based exploration is larger than the local one. This may be due to that the global selection-based exploration guarantees the convergence of the environmental beliefs of the atomic propositions, while the local one does not (see Section 4.2.3, Section 5). On the other hand, the table also shows that the global one requires heavier computation than the local one, which is due to that it needs to solve the value iteration algorithm to reach the selected state with the highest acquisition (for details, see (13)).

Table 1. The number of times when the mission was successfully complete for the local/global exploration policy, and the average running time (in sec) of executing the local/global exploration policy for each iteration in Algorithm 1.

	Number of completing the mission	Average running time (s)
Local (Algorithm 2)	62	3.0
Global (Algorithm 3)	71	31

6.2. Simulation 2: comparison with the existing algorithm

In this section, we show that the proposed algorithm is advantageous over the existing algorithm (nilsson2018, ) in terms of the running time and the memory usage of solving the value iteration (see Remark 1 for the detailed explanation). We consider the environment with different size of the state space: $n=|X|\in\mathcal{N}=\{6,9,12,15,50,100\}$ . The set of the atomic propositions is given by $AP=\{A,O\}$ , where $A$ indicates the target object and $O$ indicates the obstacle. The mission specification is given by $\phi=\neg O\ {U}\ (\neg O\wedge A)$ . For each $n\in\mathcal{N}$ , we randomly generate the initial beliefs of the atomic propositions $\mathcal{B}(x\models ap)$ for all $x\in X,ap\in AP$ uniformly from the interval $(0,1)$ and solve the value iteration algorithm to synthesize the optimal policy for the rover according to the proposed approach (see Section 4.3, in particular, (20) and (21)). For the implementation of (nilsson2018, ), we assign the environmental belief for every state in the environment and solve the corresponding value iteration. The copter’s exploration has not been given in this simulation, since we would like to focus on evaluating the running time of synthesizing the optimal policy for the rover. Table 2 shows the resulting running time (in sec) of solving the value iterations, where the symbol ”—” indicates that the optimal policy could not be found due to the overflow of the memory. The table shows that the running time of solving the value iteration with the existing algorithm increases rapidly as $n$ increases and, in particular, it becomes infeasible when $n>12$ due to the overflow of the memory ¹⁾¹⁾1)Note that in the numerical simulation in (nilsson2018, ) considers the state space with $n=100$ . This is due to that the atomic propositions are assigned only for some small regions in the state space. Specifically, the numerical simulation in (nilsson2018, ) considers that only $8$ or $5$ regions in the state-space are of interest to be explored (see Fig. 6 in (nilsson2018, )), so that the reduction of the computational complexity of solving the value iteration is achieved. However, as can be seen in our problem setup, we assume to assign the environmental beliefs for the atomic propositions for every single cell in the state space. Thus, the algorithm in (nilsson2018, ) has becomes infeasible for $n>12$ in our problem setup.. As described in Remark 1, such blowup is due to the fact that the size of the state-space of the product MDP increases exponentially with respect to $n$ $(=|X|)$ . Therefore, the proposed approach is shown to be more useful than the existing approach in terms of the running time and the memory usage for synthesizing the optimal policy for the rover.

Table 2. Running time of solving the value iterations (in sec) using the proposed approach and the existing algorithm in (nilsson2018, ).

$n$	6	9	12	15	50	100
Proposed approach	0.02	0.05	0.13	0.22	3.91	20.35
Previous approach in (nilsson2018, )	0.02	1.07	86.02	—	—	—

7. Conclusion and future works

In this paper, we investigate a collaborative rover-copter path planning and exploration with temporal logic specifications under uncertain environments. Mainly, the rover has the role to satisfy a mission specification expressed by an scLTL formula, while the copter has the role to assist the rover by exploring the uncertain environment and reduce its uncertainties. The environmental uncertainties are captured by the environmental beliefs of the atomic propositions, which represent the posterior probabilities that evaluate the level of uncertainties based on the sensor measurements. A control policy of the rover is then synthesized by maximizing a belief for the satisfaction of the scLTL formula through the implementation of an automata-based model checking. Then, an exploration policy for the copter is synthesized by evaluating the entropy that represents the level of uncertainties and the rover’s path according to the current optimal policy. Finally, the effectiveness of the proposed approach is validated through several numerical examples. Future works involve investigating safety guarantees (i.e., the rover avoids obstacles for all times during execution of Algorithm 1), as well as utilizing more sophisticated sensor models than the simple Bernoulli-type sensor models considered in this paper. In addition, extending the proposed approach to a more real-time and concurrency-related techniques, such as those that synthesize a supervisor that determines the activity of both the rover and the copter, should be studied for our further investigations.

Acknowledgement

This work was supported by JST ERATO Grant Number JPMJER1603, JST CREST Grant Number JPMJCR2012, Japan and JSPS Grant-in-Aid for Young Scientists Grant Number JP21K14184.

References

(1) F. L. Lewis, H. Zhang, K. Hengster-Movric, A. Das, Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches, Springer, 2014.
(2) B. Balaram, et al., Mars helicopter technology demonstrator, in: AIAA Atmospheric Flight Mechanics Conference, 2018, pp. 1–18.
(3) D. Brown, et al., Mars helicopter to fly on nasa’s next red planet rover mission, in: NASA/JPL News Release, 2018.
(4) P. Nilsson, S. Haesaert, R. Thakker, K. Otsu, C. Vasile, A. Agha-Mohammadi, R. Murray, A. D. Ames, Toward specification-guided active mars exploration for cooperative robot teams, in: Proceedings of Robotics: Science and Systems (RSS), 2018.
(5) S. Bharadwaj, M. Ahmadi, T. Tanaka, U. Topcu, Transfer entropy in mdps with temporal logic specifications, in: 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 4173–4180.
(6) T. Sasaki, K. Otsu, R. Thakker, S. Haesaert, A. Agha-mohammadi, Where to map? iterative rover-copter path planning for mars exploration, IEEE Robotics and Automation Letters 5 (2) (2020) 2123–2130.
(7) H. Kress-Gazit, M. Lahijanian, V. Raman, Synthesis for Robots: Guarantees and Feedback for Robot Behavior, Annual Review of Control, Robotics, and Autonomous Systems 1 (2018) 211–236.
(8) C. Belta, A. Bicchi, M. Egerstedt, E. Frazzoli, E. Klavins, G. J. Pappas, Symbolic planning and control of robot motion [Grand Challenges of Robotics], IEEE Robotics and Automation Magazine 14 (1) (2007) 61–70.
(9) C. Belta, B. Yordanov, E. A. Gol, Formal methods for discrete-time dynamical systems, Vol. 89, Springer, 2017.
(10) C. Baier, J.-P. Katoen, Principles of model checking, The MIT Press, 2008.
(11) L. F. Bertuccelli, J. P. How, Robust uav search for environments with imprecise probability maps, in: Proceedings of the 44th IEEE Conference on Decision and Control, 2005, pp. 5680–5685.
(12) Y. Wang, I. I. Hussein, Bayesian-based decision making for object search and characterization, in: 2009 American Control Conference, 2009, pp. 1964–1969.
(13) I. I. Hussein, D. M. Stipanovic, Effective coverage control for mobile sensor networks with guaranteed collision avoidance, IEEE Transactions on Control Systems Technology 15 (4) (2007) 642–657.
(14) K. Imai, T. Ushio, Effective combination of search policy based on probability and entropy for heterogeneous mobile sensors, in: 2013 IEEE International Conference on Systems, Man, and Cybernetics, 2013, pp. 1981–1986.
(15) A. I. M. Ayala, S. B. Andersson, C. Belta, Temporal logic motion planning in unknown environments, in: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 5279–5284.
(16) J. Fu, N. Atanasov, U. Topcu, G. J. Pappas, Optimal temporal logic planning in probabilistic semantic maps, in: 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 3690–3697.
(17) M. Maly, M. Lahijanian, L. E. Kavraki, H. Kress-Gazit, M. Y. Vardi, Iterative temporal motion planning for hybrid systems in partially unknown environments, in: Proceedings of the 16th International Conference on Hybrid Systems: Computation and Control, 2013, p. 353–362.
(18) M. Guo, K. H. Johansson, D. V. Dimarogonas, Revising motion planning under linear temporal logic specifications in partially known workspaces, in: IEEE International Conference on Robotics and Automation (ICRA), 2013.
(19) M. Guo, D. V. Dimarogonas, Multi-agent plan reconfiguration under local ltl specifications, The International Journal of Robotics Research 34 (2) (2015) 218–235.
(20) M. Guo, M. M.Zavlanos, Probabilistic motion planning under temporal tasks and soft constraints, IEEE Transactions on Automatic Control 63 (12) (2018) 4051–4066.
(21) S. C. Livingston, R. M. Murray, J. W. Burdick, Backtracking temporal logic synthesis for uncertain environments, in: 2012 IEEE International Conference on Robotics and Automation, 2012, pp. 5163–5170.
(22) T. Wongpiromsarn, E. Frazzoli, Control of probabilistic systems under dynamic, partially known environments with temporal logic specifications, in: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012, pp. 7644–7651.
(23) M. Lahijanian, J. Wasniewski, S. B. Andersson, C. Belta, Motion planning and control from temporal logic specifications with probabilistic satisfaction guarantees, in: 2010 IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 3227–3232.
(24) C. Yoo, C. Belta, Control with probabilistic signal temporal logic, in: Preprint: available on https://arxiv.org/pdf/1510.08474.pdf, 2015.
(25) D. Sadigh, A. Kapoor, Safe control under uncertainty with probabilistic signal temporal logic, in: Proceedings of Robotics: Science and Systems, 2016.
(26) C. Vasile, K. Leahy, E. Cristofalo, A. Jones, M. Schwager, C. Belta, Control in belief space with temporal logic specifications, in: 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 7419–7424.
(27) K. Leahy, E. Cristofalo, C. I. Vasile, A. Jones, E. Montijano, M. Schwager, C. Belta, Control in belief space with temporal logic specifications using vision-based localization, The International Journal of Robotics Research 38 (6) (2019) 702–722.
(28) E. M. Wolff, U. Topcu, R. M. Murray, Robust control of uncertain markov decision processes with temporal logic specifications, in: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012, pp. 3372–3379.
(29) A. Ulusoy, T. Wongpiromsarn, C. Belta, Incremental controller synthesis in probabilistic environments with temporal logic constraints, The International Journal of Robotics Research 33 (8) (2014) 1130–1144.
(30) K. Hashimoto, A. Saoud, M. Kishida, T. Ushio, D. V. Dimarogonas, Learning-based symbolic abstractions for nonlinear control systems, in arxiv, available on https://arxiv.org/abs/2004.01879 (2020).
(31) D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, S. A. Seshia, A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications, in: 53rd IEEE Conference on Decision and Control, 2014, pp. 1091–1096.
(32) J. Wang, X. Ding, M. Lahijanian, I. Paschalidis, C. Belta, Temporal logic motion control using actor-critic methods, The International Journal of Robotics Research 34 (10) (2015) 1329–1344.
(33) X. Li, Y. Ma, C. Belta, A policy search method for temporal logic specified reinforcement learning tasks, in: 2018 Annual American Control Conference (ACC), 2018, pp. 240–245.
(34) M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, I. Lee, Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees, in: 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 5338–5343.
(35) B. Johnson, H. Kress-Gazit, Analyzing and revising high-level robot behaviors under actuator error, in: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 741–748.
(36) B. Johnson, H. Kress-Gazit, Analyzing and revising synthesized controllers for robots with sensing and actuation errors, The International Journal of Robotics Research 34 (6) (2015) 816–832.
(37) P. Nuzzo, J. Li, A. L. Sangiovanni-Vincentelli, Y. Xi, D. Li, Stochastic assume-guarantee contracts for cyber-physical system design, ACM Transactions on Embedded Computing Systems 18 (1) (Jan. 2019).
(38) M. Tiger, F. Heintz, Incremental reasoning in probabilistic signal temporal logic, International Journal of Approximate Reasoning 119 (2020) 325 – 352.
(39) T. Latvala, Efficient model checking of safety properties, in: 10th International SPIN workshop, 2003, pp. 74–88.
(40) T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley Series, 2006.
(41) A. Abate, M. Prandini, J. Lygeros, S. Sastry, Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems, Automatica 44 (11) (2008) 2724–2734.
(42) [link to the animation]

Appendix A Proof of Theorem 1

Let us first rewrite the Bayesian update (8) by

(27)

\displaystyle\mathcal{B}_{j+1}(x\models ap)=\frac{{\rm Pr}[Z_{x}(x\models ap)=z|x\models ap]\mathcal{B}_{j}(x\models ap)}{{\rm Pr}[Z_{x}(x\models ap)=z]}

for $j\in\mathbb{N}_{\geq 0}$ , where we let $Z_{x}(x\models ap)=Z^{r}_{x}(x\models ap)$ (resp. $Z_{x}(x\models ap)=Z^{c}_{x}(x\models ap)$ ) if the rover (resp. copter) provides the sensor measurement for $ap$ at $x$ . That is, $\mathcal{B}_{j}(x\models ap)$ , $j\in\mathbb{N}_{\geq 0}$ represents the environmental belief of $ap$ at $x$ computed after $j$ times visit (by either the rover or the copter) of $x$ . If $Z_{x}(x\models ap)=1$ at the $j+1$ -th visit of $x$ , the Bayesian update is given by

(28)

\displaystyle\mathcal{B}_{j+1}(x\models ap)=\cfrac{\beta\mathcal{B}_{j}(x\models ap)}{\beta\mathcal{B}_{j}(x\models ap)+(1-\beta)(1-\mathcal{B}_{j}(x\models ap))}.

After some simple calculations, we then obtain $\tilde{\mathcal{B}}_{j+1}(x\models ap)=\frac{1-\beta}{\beta}\tilde{\mathcal{B}}_{j}(x\models ap)$ , where we let $\tilde{\mathcal{B}}_{j}(x\models ap)=\frac{1}{\mathcal{B}_{j}(x\models ap)}-1$ . On the other hand, if $Z_{x}(x\models ap)=0$ at the $j+1$ -th visit, it follows that $\tilde{\mathcal{B}}_{j+1}(x\models ap)=\frac{\beta}{1-\beta}\tilde{\mathcal{B}}_{j}(x\models ap)$ . Therefore, we obtain

(29)		$\displaystyle\tilde{\mathcal{B}}_{j+1}(x\models ap)=$	$\displaystyle\frac{1-\beta}{\beta}\tilde{\mathcal{B}}_{j}(x\models ap),\ \ {\rm if}\ Z_{x}(x\models ap)=1,$
(30)		$\displaystyle\tilde{\mathcal{B}}_{j+1}(x\models ap)=$	$\displaystyle\frac{\beta}{1-\beta}\tilde{\mathcal{B}}_{j}(x\models ap),\ \ {\rm if}\ Z_{x}(x\models ap)=0.$

Suppose that the rover/copter visits $x$ the total $N_{x}$ times and that $ap\in L(x)$ . From (29), (30), we have

\displaystyle\tilde{\mathcal{B}}_{N_{x}}(x\models ap)

\displaystyle=\left(\cfrac{1-\beta}{\beta}\right)^{2m^{ap}_{N_{x}}-N_{x}}\tilde{\mathcal{B}}_{0}(x\models ap).

Note that $\tilde{\mathcal{B}}_{0}(x\models ap)=\frac{1}{\mathcal{B}_{0}(x\models ap)}-1\in(0,\infty)$ , since the initial belief of the atomic proposition is selected as $\mathcal{B}_{0}(x\models ap)\in(0,1)$ (see line 2 in Algorithm 1). From Assumption 3, it follows that $N_{x}(2\beta-2\varepsilon-1)\leq 2m^{ap}_{N_{x}}-N_{x}\leq N_{x}(2\beta+2\varepsilon-1)$ for all $N\geq\bar{N}$ and $ap\in L(x)$ . Thus, we obtain $0<\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\leq\{({1-\beta})/{\beta}\}^{N_{x}(2\beta-2\varepsilon-1)}\tilde{\mathcal{B}}_{0}(x\models ap)$ for all $N_{x}\geq\bar{N}$ and $ap\in L(x)$ . Noting that $2\beta-2\varepsilon-1>0$ (see Assumption 3), $\beta>0.5$ and $\tilde{\mathcal{B}}_{0}(x\models ap)\in(0,\infty)$ , we obtain $\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\rightarrow 0$ as $N_{x}\rightarrow\infty$ , which implies that $\mathcal{B}_{N_{x}}(x\models ap)\rightarrow 1$ as $N_{x}\rightarrow\infty$ . Similarly, we obtain $\tilde{\mathcal{B}}_{N_{x}}(x\models ap)\rightarrow\infty$ , i.e., $\mathcal{B}_{N_{x}}(x\models ap)\rightarrow 0$ if $ap\notin L(x)$ as $N_{x}\rightarrow\infty$ . Hence, for all $x\in X$ and $ap\in AP$ , we have

(31)		$\displaystyle\lim_{N_{x}\rightarrow\infty}\mathcal{B}_{N_{x}}(x\models ap)=$	$\displaystyle 1,\ \ {\rm if}\ ap\in L(x),$
(32)		$\displaystyle\lim_{N_{x}\rightarrow\infty}\mathcal{B}_{N_{x}}(x\models ap)=$	$\displaystyle 0,\ \ {\rm if}\ ap\notin L(x).$

In other words, (23), (24) are satisfied for all $x\in X$ and $ap\in AP$ , if for all $x\in X$ , the number of visits at $x$ goes to infinity as $k\rightarrow\infty$ , i.e., $N_{x}\rightarrow\infty$ as $k\rightarrow\infty$ . In what follows, it is shown that $N_{x}\rightarrow\infty$ as $k\rightarrow\infty$ for all $x\in X$ . With a slight abuse of notation, let $N_{x}(k)\leq k$ and $m^{ap}_{N_{x}}(k)\leq N_{x}(k)$ denote, respectively, the number of total times the rover/copter visits $x$ within the time step $k$ and the number of times the corresponding observations for $ap$ are $1$ . To show by contradiction, let $X_{not}\subset X$ denote the set of all states at which the number of visits does not go to the infinity as $k\rightarrow\infty$ , and assume that $X_{not}$ is non-empty. In other words, there exists a time step $k_{ter}\in\mathbb{N}_{>0}$ such that all $x\in X_{not}$ are no more visited after $k_{ter}$ , i.e., $N_{x}(k)=N_{x}(k+1)$ for all $k\geq k_{ter}$ . Hence, we have $\mathcal{B}_{N_{x}(k)}(x\models ap)=\mathcal{B}_{N_{x}(k+1)}(x\models ap)\in(0,1)$ for all $k\geq k_{ter}$ , and, therefore, $H(\mathcal{B}_{N_{x}(k)}(x\models ap))=H(\mathcal{B}_{N_{x}(k+1)}(x\models ap))\in(0,1)$ for all $k\geq k_{ter}$ , which implies that the entropy remains constant and does not converge to $0$ . On the other hand, it follows that, for all $x^{\prime}\in X\backslash X_{not}$ , $N_{x^{\prime}}(k)\rightarrow\infty$ as $k\rightarrow\infty$ . Hence, the environmental beliefs of the atomic proposition converge to the appropriate values, i.e., for all $x^{\prime}\in X\backslash X_{not}$ and $ap\in AP$ , $\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap)\rightarrow 1$ if $ap\in L(x^{\prime})$ and $0$ if $ap\notin L(x^{\prime})$ as $k\rightarrow\infty$ . Therefore, the entropy converges to $0$ , i.e., for all $ap\in AP$ and $x^{\prime}\in X\backslash X_{not}$ , $H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))\rightarrow 0$ as $k\rightarrow\infty$ .

Now, since $\sum_{ap\in AP}H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))\rightarrow 0$ for all $x^{\prime}\in X\backslash X_{not}$ , there exist a time step $\bar{k}\geq k_{ter}$ such that the following holds: for all $k\geq\bar{k}$ , $x\in X_{not}$ and $x^{\prime}\in X\backslash X_{not}$ ,

(33)

\displaystyle\sum_{ap\in AP}H(\mathcal{B}_{N_{x^{\prime}}(k)}(x^{\prime}\models ap))<\sum_{ap\in AP}H(\mathcal{B}_{N_{x}(k)}(x\models ap)).

Recalling that we set $\alpha=0$ in (10), the inequality (33) implies that, at a certain time step after $\bar{k}\geq k_{ter}$ , the copter would select some $x^{*}\in X_{not}$ at the beginning of execution of Algorithm 3, which means from Assumption 1 that the copter would have visited $x^{*}$ after $\bar{k}\geq k_{ter}$ . However, this contradicts the assumption that all $x\in X_{not}$ are not visited after $k_{ter}$ . Overall, such contradiction follows from the fact that we assume $X_{not}$ is non-empty. Therefore, it follows that $X_{not}$ is empty, and thus $N_{x}(k)\rightarrow\infty$ as $k\rightarrow\infty$ for all $x\in X$ . In summary, (23) and (24) are satisfied for all $x\in X$ and $ap\in AP$ . $\Box$

Appendix B Proof of Corollary 1

Let $\hat{\mathcal{M}}_{S}=({S},{s}_{0},{U}_{S},\hat{p}_{S},{S}_{f})$ denote the product MDP between $\mathcal{M}_{r}=(X,x_{r},U_{r},p_{r})$ and $\mathcal{A}_{\phi}=(Q,2^{AP},\delta,q_{0},Q_{f})$ , where $S$ , ${s}_{0}$ , ${U}_{S}$ and $S_{f}$ are the same as the product belief MDP defined in Definition 1, and $\hat{p}_{S}:S\times U_{S}\rightarrow\mathcal{D}(S)$ is the transition probability function, defined by $\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u)=p_{r}(x^{\prime}|x,u)$ (for all $\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S$ and $u\in U_{S}$ ) iff $L(x)\in en(q,q^{\prime})$ and $0$ otherwise. If $L$ is known, the optimal policy $\hat{\mu}^{*}_{r,seq}$ is then obtained by solving the value iteration algorithm over the product MDP $\hat{\mathcal{M}}_{S}$ . This fact, combined with Theorem 1, implies that if the product belief MDP ${\mathcal{M}}_{S}$ in Definition 1 converges to the product MDP $\hat{\mathcal{M}}_{S}$ , i.e., ${\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S}$ as $k\rightarrow\infty$ , then ${\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq}$ as $k\rightarrow\infty$ . To show that ${\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S}$ as $k\rightarrow\infty$ , we need to show that $p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u)$ (for all $\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S$ and $u\in U_{S}$ ). Suppose that $\mathcal{B}(x\models ap)\rightarrow 1$ (resp. $\mathcal{B}(x\models ap)\rightarrow 0$ ) for all $ap\in L(x)$ (resp. $ap\notin L(x)$ ), i.e., all the environmental beliefs of the atomic propositions converge to the appropriate values. From (14), it then follows that $\mathcal{B}_{alph}(x\models\sigma)\rightarrow 1$ iff $\sigma=L(x)$ and $0$ otherwise. From (15), we then obtain $\mathcal{B}_{en}(x\models\sigma)\rightarrow 1$ iff $\sigma=L(x)$ and $0$ otherwise. Hence, the transition belief function (18) becomes $p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow p_{r}(x^{\prime}|x,u)$ (for all $\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S$ and $u\in U_{S}$ ) iff $L(x)\in en(q,q^{\prime})$ and $0$ otherwise, which indeed coincides with $\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u)$ . Hence, it follows that $p_{S}((x^{\prime},q^{\prime})|(x,q),u)\rightarrow\hat{p}_{S}((x^{\prime},q^{\prime})|(x,q),u)$ (for all $\{(x,q),(x^{\prime},q^{\prime})\}\in S\times S$ and $u\in U_{S}$ ), and, therefore, ${\mathcal{M}}_{S}\rightarrow\hat{\mathcal{M}}_{S}$ as $k\rightarrow\infty$ . As described above, this directly follows ${\mu}^{*}_{r,seq}\rightarrow\hat{\mu}^{*}_{r,seq}$ as $k\rightarrow\infty$ . $\Box$

Collaborative rover-copter path planning and exploration with temporal logic specifications based on Bayesian update under uncertain environments

Abstract.

1. Introduction

Related works and contributions of this paper

2. Preliminaries

2.1. Markov Decision Process

2.2. Syntactically co-safe LTL

3. Problem formulation

3.1. Uncertain environment

3.2. Rover and copter model

3.2.1. Motion model

3.2.2. Sensor model

3.3. Mission specification and problem formulation

Problem 1.

4. Approach

4.1. Overview of the approach

4.2. Exploration phase

4.2.1. Bayesian belief update

4.2.2. Acquisition function for exploration

4.2.3. Exploration algorithm

4.3. Mission execution phase

4.3.1. Product belief MDP

Definition 1.

4.3.2. Value iteration

Remark 1.

4.3.3. Computing the reachability belief and bmaxb_{\max}

4.3.4. Overall mission execution algorithm

5. Convergence analysis

Assumption 1.

Assumption 2.

Assumption 3.

Theorem 1.

Remark 2.

Corollary 1.

6. Simulation results

6.1. Simulation 1

6.1.1. Problem setup

6.1.2. Simulation results

6.2. Simulation 2: comparison with the existing algorithm

7. Conclusion and future works

Acknowledgement

References

Appendix A Proof of Theorem 1

Appendix B Proof of Corollary 1

4.3.3. Computing the reachability belief and $b_{\max}$