Generalization of Safe Optimal Control Actions on Networked Multi-Agent Systems

Lin Song, Neng Wan, Aditya Gahlawat, Chuyuan Tao, Naira Hovakimyan, and Evangelos A. Theodorou This work is supported by the Air Force Office of Scientific Research (AFSOR), National Aeronautics and Space Administration (NASA) and National Science Foundation’s National Robotics Initiative (NRI) and Cyber-Physical Systems (CPS) awards #1830639, #1932529, and #1932288. Lin Song, Neng Wan, Aditya Gahlawat, Chuyuan Tao, and Naira Hovakimyan are with the Department of Mechanical Science and Engineering, University of Illinois at Urbana Champaign, Urbana, IL 61801 USA (e-mail:{linsong2, nengwan2, gahlawat, chuyuan2, nhovakim}@illinois.edu).Evangelos A. Theodorou is with the Department of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: {evangelos.theodorou}@gatech.edu).

Abstract

We propose a unified framework to fast generate a safe optimal control action for a new task from existing controllers on Multi-Agent Systems (MASs). The control action composition is achieved by taking a weighted mixture of the existing controllers according to the contribution of each component task. Instead of sophisticatedly tuning the cost parameters and other hyper-parameters for safe and reliable behavior in the optimal control framework, the safety of each single task solution is guaranteed using the control barrier functions (CBFs) for high-degree stochastic systems, which constrains the system state within a known safe operation region where it originates from. Linearity of CBF constraints in control enables the control action composition. The discussed framework can immediately provide reliable solutions to new tasks by taking a weighted mixture of solved component-task actions and filtering on some CBF constraints, instead of performing an extensive sampling to achieve a new controller. Our results are verified and demonstrated on both a single UAV and two cooperative UAV teams in an environment with obstacles.

Index Terms:

Stochastic Optimal Control, Safe Control, Multi-Agent Systems, Control Barrier Functions

I Introduction

Optimal control design by minimizing a cost function or maximizing a reward function has been investigated for various systems [1, 2]. Specifically, stochastic optimal control problems consider minimizing a cost function for dynamical systems subject to random noise [3, 4]. Although such formulation captures a wide class of real-world problems, stochastic optimal control actions are typically difficult and expensive to compute in large-scale systems due to the curse of dimensionality [5]. To overcome the computation challenges, many approximation-based approaches, including cost parameterization[6], path-integral formulation [7, 8], value function approximation [9] and policy approximation [10], have been proposed. Exponential transformation on the value function was applied in [11] such that linear-form solutions to stochastic control problems were achieved, and thus the computational efficiency was improved. Optimal control problems whose solutions can be obtained by solving reduced linear equations are generally categorized as linearly-solvable optimal control (LSOC) problems in [12]. These formulations leverage the benefits of LSOC problems, including compositionality and path-integral representation of the optimal solution [13]. Path integral (PI) control approach usually requires extensive sampling on given dynamical models. However, the compositionality property enables the construction of composite control by taking a weighted mixture of existing component controllers to solve a new task in a certain class without sampling on the dynamical system again, and thus improves the computation efficiency [14, 15]. The methodology of combining and reusing existing controllers has been explored and validated on different systems, from single-agent systems (e.g. 3-dimensional robotic arm [14] and physically-based character animation [16]) to networked multi-agent systems (e.g. a cooperative UAV team [17]).

Networked multi-agent systems enable coordination between agents and have a wide range of real-world applications, including cooperative vehicles [18, 19], robotics [20], sensor networks [21], and transportation [22]. The decentralized POMDP (Dec-POMDP) model in [23], formulates the control problem in multi-agent systems under uncertainty with only local state information available. A dynamic programming (DP)-based approach has been investigated on the Dec-POMDP model in [24] towards the search for the optimal solution. However, the computation required for solving such systems grows exponentially as the system state dimension scales. To tackle the difficulties in computation, approximation-based approaches and distributed algorithms based on certain network partition have been discussed. In [25], the expectation-maximization (EM) algorithm was introduced on the Dec-POMDP model, where optimal policies were represented as finite-state controllers to improve the scalability. Q-value functions are approximated and efficient computation of optimal policies is achieved in [26]. Furthermore, fully-decentralized reinforcement learning algorithms were investigated on networked MASs using function approximation in [27]. Compared to aforementioned approaches, path integral (PI) formalism of stochastic optimal control problem solutions relies on sampling of the stochastic differential equation, and is applicable to large-scale nonlinear systems. Path integral formulation has demonstrated higher efficiency and robustness in solving high-dimensional reinforcement learning problems in [7]. Path integral control approach has also been extended to MASs, and an approximate inference method was applied to approximate the exact solution in [28]. A distributed algorithm proposed in [29] partitions the networked MAS into multiple subsystems, from which local control actions are computed with limited resources.

However, the optimal solution to a cost-minimization problem is not always reliable and applicable in the real world, especially in safety-critical scenarios. Guaranteed-safe control actions have attracted researchers’ interest, both in controller design [30, 31], and in verification[32, 33]. Control barrier functions (CBFs) have been proposed as a powerful tool to enforce safe behavior of system states by introducing extra constraints on the control inputs [34]. Control barrier functions are required to be finite (for Reciprocal CBF) or positive (for Zero CBF) as the system operates within the known safe region. CBF can be constructed empirically [35] or learned from data [36, 37]. Along with optimization-based control actions or reinforcement learning (RL) techniques, the CBF-based approach to ensure safety has achieved satisfying results in various application scenarios, including bipedal locomotion under model uncertainty [38], autonomous vehicles [39], and UAVs [40]. Recently, CBF techniques were generalized to stochastic systems with high-probability guarantees, in cases of both complete and incomplete information in [41] and [42]. To achieve the state-trajectory tracking goal and associated safe certificates on nonlinear and complicated systems, contraction-based methods are applicable on both nominal dynamics [43] and uncertain dynamics [44, 45].

Refer to caption — \textcolorsubsectioncolorFig. 1: The architecture of the proposed safe composition control framework.

However, efficient computation of certified-safe stochastic optimal control solutions on MASs is still an open problem. The main contribution of this paper is a framework of generalizing optimal control actions while ensuring safety on networked MASs; the architecture is illustrated in Fig. 1. When multiple solved control problems on MASs share identical dynamical information, but are slightly different in some aspects, such as terminal states and final costs, instead of sampling on the given dynamical system again to solve a new task, the compositionality of achieved control actions can be leveraged, and the existing controllers of these solved problems can be weighted and mixed to solve a new problem and drive the system to a new target. The baseline optimal control for the new task is first obtained by mixing existing certified-safe controllers. Then, a post-composite constrained optimization using CBFs is formulated to filter the baseline control and thus guarantee the safety. The task generalization capability of resulting control actions allows direct and efficient computation of a new problem solution without re-sampling in a certain class. Compared with our preliminary work [17], we furthermore incorporate safety constraints on the compositionality of LSOC problem solutions. The proposed strategy is validated via numerical simulations on both single UAV and two cooperative UAV teams.

The rest of the paper is organized as follows: Section II introduces the preliminaries of formulating stochastic control problems, linearly-solvable optimal control (LSOC) problems, and stochastic CBFs; Section III introduces the compositionality of LSOC problem solutions on networked MASs; Section IV introduces the achieved certified-safe optimal control actions on MASs and the generalization of proposed control actions with safety guarantees; Section V provides numerical simulations in three scenarios validating the proposed approach; the conclusion of this paper and some open problems are discussed in Section VI.

II Preliminaries and Problem Formulation

We start by introducing some preliminary results of optimal control actions in stochastic systems, including both the single-agent and multi-agent scenarios. Then an extension of control barrier functions (CBFs) to stochastic systems is presented, which enables the proposed safe and optimal control framework introduced in Section IV.

II-A Stochastic optimal control problems

II-A1 Single-Agent Systems

Consider a continuous-time dynamical system described by the It $\hat{\textrm{o}}$ diffusion process:

	$\displaystyle dx^{t}$	$\displaystyle=g(x^{t})dt+B(x^{t})[u(x^{t},t)dt+\sigma d\omega]$		(1)
		$\displaystyle=f(x^{t},u^{t})dt+F(x^{t},u^{t})d\omega,$

where time-varying $x^{t}\in\mathbb{R}^{M}$ is the state vector with $M$ denoting the state dimension, $g(x^{t})\in\mathbb{R}^{M}$ , $B(x^{t})\in\mathbb{R}^{M\times P}$ , $u(x^{t},t)\in\mathbb{R}^{P}$ are the passive dynamics, control matrix and control input with $P$ denoting the input dimension, and $\omega\in\mathbb{R}^{P}$ denotes the Brownian noise with covariance $\sigma\in\mathbb{R}^{P\times P}$ .

Let $\mathcal{I}$ denote the set of interior states and $\mathcal{B}$ denote the set of boundary states. When $x^{t}\in\mathcal{I}$ , the running cost function is defined as:

c(x^{t},u^{t})=q(x^{t})+\frac{1}{2}u(x^{t},t)^{\top}Ru(x^{t},t),

(2)

where $q(x^{t})\geq 0$ is a state-related cost and $u(x^{t},t)^{\top}Ru(x^{t},t)$ is a control-quadratic term with $R$ being a positive definite matrix. When $x^{t_{f}}\in\mathcal{B}$ , the terminal cost function is denoted by $\phi(x^{t_{f}})$ , where $t_{f}$ is the exit time and determined online in the first-exit problem. The cost-to-go function $J^{u}(x^{t},t)$ for the first-exit problem under control action $u$ can be defined as:

J^{u}(x^{t},t)=\mathbb{E}_{x^{t},t}^{u}[\phi(x^{t_{f}})+\int_{t}^{t_{f}}c(x(\tau),u(\tau))d\tau].

(3)

The value function $V(x^{t},t)$ is defined as the optimal cost-to-go function:

V(x^{t},t)=\min_{u}\mathbb{E}_{x^{t},t}^{u}[\phi(x^{t_{f}})+\int_{t}^{t_{f}}c(x(\tau),u(\tau))d\tau].

(4)

The value function is the expected cumulative running cost starting from the state $x^{t}$ and acting optimally thereafter. For notation simplicity, the time-evolution of the state is omitted.

Define a stochastic second-order differentiator as $\mathcal{L}_{(u)}[V]=f^{\top}\nabla_{x}V+\frac{1}{2}\textrm{tr}(FF^{\top}\nabla_{xx}^{2}V)$ , and then the value function $V$ satisfies the stochastic Hamilton-Jacobi-Bellman (HJB) equation as follows:

0=\min_{u}\{c(x,u)+\mathcal{L}_{(u)}[V](x)\}.

(5)

With the desirability function $Z(x,t)=[\exp(-V(x,t))/\lambda]$ and under the nonlinearity cancellation condition $\sigma\sigma^{\top}=\lambda R^{-1}$ , the linear-form optimal control action for continuous-time stochastic systems is obtained as

u^{*}(x,t)=\sigma\sigma^{\top}B^{\top}(x)\frac{\nabla_{x}Z(x,t)}{Z(x,t)},

(6)

and the corresponding transformed linear HJB equation takes the form of $0=\mathcal{L}[Z]-qZ$ , where $\mathcal{L}[Z]=f^{\top}\nabla_{x}Z+\frac{1}{2}\textrm{tr}(FF^{\top}\nabla_{xx}^{2}Z)$ .

II-A2 Multi-Agent Systems

For a networked multi-agent system governed by mutually independent passive dynamics, the index set of all the agents neighboring or adjacent to agent $i$ is denoted by $\mathcal{N}_{i}$ . The factorial subsystem for agent $i$ includes all the agents directly communicating with agent $i$ and the agent itself, and is denoted by $\bar{\mathcal{N}}_{i}:=\mathcal{N}_{i}\cup\{i\}$ , and the cardinality of set $\bar{\mathcal{N}}_{i}$ is denoted by $|\bar{\mathcal{N}}_{i}|$ . Consider the joint continuous-time dynamics for factorial subsystem $\bar{\mathcal{N}}_{i}$ as in [29]:

d\bar{x}_{i}=\bar{g}_{i}(\bar{x}_{i})dt+\bar{B}_{i}(\bar{x}_{i})[\bar{u}_{i}(\bar{x}_{i},t)dt+\bar{\sigma}_{i}d\bar{\omega}_{i}],

(7)

where the joint state vector is denoted by $\bar{x}_{i}=[x_{i}^{\top},x_{j\in\mathcal{N}_{i}}^{\top}]^{\top}\in\mathbb{R}^{M\cdot|\bar{\mathcal{N}}_{i}|}$ , the joint passive dynamics vector is denoted by $\bar{g}_{i}(\bar{x}_{i})=[g_{i}(x_{i})^{\top},g_{j\in\mathcal{N}_{i}}(x_{j})^{\top}]^{\top}\in\mathbb{R}^{M\cdot|\bar{\mathcal{N}}_{i}|}$ , the joint control matrix is denoted by $\bar{B}_{i}(\bar{x}_{i})=[B_{i}(x_{i})\hskip 2.84526pt\mathbf{0};\mathbf{0}\hskip 2.84526ptB_{j\in\mathcal{N}_{i}}(x_{j})]\in\mathbb{R}^{M\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|}$ , the joint control action vector is $\bar{u}_{i}(\bar{x}_{i},t)=[u_{i}(\bar{x}_{i},t)^{\top},u_{j\in\mathcal{N}_{i}}(\bar{x}_{i},t)^{\top}]^{\top}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|}$ , and $\bar{\omega}_{i}=[\omega_{i}^{\top},\omega_{j\in\mathcal{N}_{i}}^{\top}]^{\top}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|}$ is the joint noise vector with covariance matrix $\bar{\sigma}_{i}=\textrm{diag}\{\sigma_{i},\sigma_{j\in\mathcal{N}_{i}}\}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|}$ .

Let $\bar{\mathcal{I}}_{i}$ denote the set of interior joint states, and $\bar{\mathcal{B}}_{i}$ denote the set of boundary joint states. When $\bar{x}_{i}\in\bar{\mathcal{I}}_{i}$ , the running cost function of the joint states of $\bar{\mathcal{N}}_{i}$ is defined as:

c_{i}(\bar{x}_{i},\bar{u}_{i})=q_{i}(\bar{x}_{i})+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t),

(8)

where $q_{i}(\bar{x}_{i})\geq 0$ is a joint-state-related cost, and $\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t)$ is a control-quadratic term with $\bar{R}_{i}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|}$ being a positive definite matrix. When $\bar{x}_{i}^{t_{f}}\in\bar{\mathcal{B}}_{i}$ , the terminal cost function is denoted by $\phi_{i}(\bar{x}_{i}^{t_{f}})$ , where $t_{f}$ is the exit time and determined online in the first-exit problem. Specifically, when the central agent state $x_{i}^{t_{f}}\in\mathcal{B}_{i}$ , we also have the terminal cost function $\phi_{i}(x_{i}^{t_{f}})$ defined. As an extension of the single-agent case, the cost-to-go function $J_{i}^{\bar{u}_{i}}(\bar{x}_{i}^{t},t)$ for the first-exit problem under joint control action $\bar{u}_{i}$ can be defined as:

J_{i}^{\bar{u}_{i}}(\bar{x}_{i}^{t},t)=\mathbb{E}_{\bar{x}_{i}^{t},t}^{\bar{u}_{i}}[\phi_{i}(\bar{x}_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(\bar{x}_{i}(\tau),\bar{u}_{i}(\tau))d\tau].

(9)

The value function $V_{i}(\bar{x}_{i},t)$ is defined as the optimal cost-to-go function:

V_{i}(\bar{x}_{i},t)=\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i}^{t},t}^{\bar{u}_{i}}[\phi_{i}(\bar{x}_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(\bar{x}_{i}(\tau),\bar{u}_{i}(\tau))d\tau],

(10)

and the value function is the expected cumulative running cost starting from joint state $\bar{x}_{i}$ and acting optimally thereafter.

Similar to the single-agent scenario, the desirability function over the joint state $\bar{x}_{i}$ is defined as

Z(\bar{x}_{i},t)=\exp[-V_{i}(\bar{x}_{i},t)/\lambda_{i}]

(11)

under the condition that $\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}=\lambda_{i}\bar{R}_{i}^{-1}$ is satisfied to cancel the nonlinear terms. The linear-form joint optimal control action for continuous-time stochastic networked MAS under aforementioned decentralization topology is derived in [29] and the result is in the form:

\bar{u}_{i}^{*}(\bar{x}_{i},t)=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}(\bar{x}_{i})^{\top}\frac{\nabla_{\bar{x}_{i}}Z(\bar{x}_{i},t)}{Z(\bar{x}_{i},t)}.

(12)

Remark 1.

Only the central agent $i$ of the factorial subsystem $\bar{\mathcal{N}}_{i}$ samples or selects the local optimal control action $u_{i}^{*}(\bar{x}_{i},t)$ from the joint optimal control action $\bar{u}^{*}_{i}(\bar{x}_{i},t)$ according to the computation results on factorial subsystem $\bar{\mathcal{N}}_{i}$ .

II-B Stochastic control barrier function (SCBF) as a safety filter

Optimization-based control actions achieve the goal of cost minimization, but seldom provide rigorous guarantees on satisfying the commanded constraints, which can be problematic in some safety-critical scenarios. Control barrier functions (CBFs) have been widely used to describe certain desired known safe operating region and can guarantee the state-invariant property within the proposed set by filtering existing control actions. Recently, CBFs have also been investigated in stochastic systems in [41].

Definition 1 (zero-CBF for stochastic systems).

The function $h(x)$ serves as a zero-CBF (ZCBF) for a system described by the stochastic differential equation (SDE) (1), if for all $x$ satisfying $h(x)>0$ there exists a $u$ satisfying

\frac{\partial{h}}{\partial x}(g(x)+B(x)u)+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(x)(\frac{\partial^{2}h}{\partial x^{2}})B(x)\sigma)\geq-h(x).

(13)

We next introduce a lemma and a theorem modified from [41] constructing ZCBFs for high-degree stochastic systems, which our proposed safe control framework is built upon.

Lemma 1 (State-invariant guarantees by definition of the ZCBF).

Define the safe operating region $\mathcal{C}=\{x:h(x)\geq 0\}$ . If $u^{t}$ satisfies (13) for all $t$ , then $P_{r}(x^{t}\in\mathcal{C}\hskip 5.69054pt\forall t)=1$ , provided $x^{0}\in\mathcal{C}$ .

Theorem 1 (Construction of ZCBF for high-degree stochastic systems).

Define the ZCBFs for high-degree stochastic systems (1) as $h_{0}(x)=h(x)$ , and subsequently

	$\displaystyle h_{i+1}(x)$	$\displaystyle=\frac{\partial{h_{i}}}{\partial x}g(x)$
		$\displaystyle+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(x)(\frac{\partial^{2}h_{i}}{\partial x^{2}})B(x)\sigma)+h_{i}(x),$		(14)

$i=0,1,2,\dots$ , and define $\bar{\mathcal{C}}_{r}=\bigcap_{i=0}^{r}\mathcal{C}_{i}$ with $\mathcal{C}_{i}=\{x:h_{i}(x)\geq 0\}$ and $\mathcal{C}=\{x:h(x)\geq 0\}$ . Suppose that there exists $r$ such that, for any $x\in\bar{\mathcal{C}}_{r}$ , we have $\frac{\partial{h_{i}}}{\partial x}B(x)u\geq 0$ for $i<r$ and

\frac{\partial{h}_{r}}{\partial x}(g(x)+B(x)u)+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(x)(\frac{\partial^{2}h_{r}}{\partial x^{2}})B(x)\sigma)\geq-h_{r}(x).

(15)

Then $P_{r}(x^{t}\in\mathcal{C}\hskip 5.69054pt\forall t)=1$ if $x^{0}\in\bar{\mathcal{C}}_{r}$ .

Proof.

Provided $u^{t}$ satisfies (15), by Lemma 1, we have $x^{t}\in{\mathcal{C}}_{r}\hskip 5.69054pt\forall t$ if $x^{0}\in\mathcal{C}_{r}$ , which is equivalent to $h_{r}(x^{t})\geq 0\hskip 5.69054pt\forall t$ if $x^{0}\in\mathcal{C}_{r}$ . By the construction of the high-degree ZCBFs as in (1), we have

h_{r}(x)=\frac{\partial{h_{r-1}}}{\partial x}g(x)+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(\frac{\partial^{2}h_{r-1}}{\partial x^{2}})B\sigma)+h_{r-1}(x)\geq 0.

(16)

Since $\frac{\partial{h_{r-1}}}{\partial x}B(x)u\geq 0$ , once added to both sides of (16), we have

	$\displaystyle h_{r}(x)+\frac{\partial{h_{r-1}}}{\partial x}B(x)u$	$\displaystyle=\frac{\partial{h_{r-1}}}{\partial x}(g(x)+B(x)u)$
		$\displaystyle+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(\frac{\partial^{2}h_{r-1}}{\partial x^{2}})B\sigma)$
		$\displaystyle+h_{r-1}(x)\geq 0,$

which satisfies (13), and thus we have $x^{t}\in{\mathcal{C}}_{r-1}\hskip 5.69054pt\forall t$ if $x^{0}\in\mathcal{C}_{r-1}$ . By induction, we have

x^{t}\in\mathcal{C}_{i}\hskip 5.69054pt\forall t\hskip 5.69054pt\textrm{if}\hskip 5.69054ptx^{0}\in\mathcal{C}_{i},i=0,1,2,\dots,r,

which is equivalent to $x^{t}\in\mathcal{C}_{0}=\mathcal{C}\hskip 5.69054pt\forall t$ , if $x^{0}\in\bigcap_{i=0}^{r}\mathcal{C}_{i}=\bar{\mathcal{C}}_{r}$ . ∎

III Control Compositionality

Considering the optimal control action takes a linear form ((6) in the single-agent scenario and (12) in the multi-agent scenario), assume the component problems and the composite problem share certain dynamical information and satisfy certain conditions on the final cost; then a composite control action can be obtained analytically by combining existing component control actions.

Assume there are $F$ problems in MASs to solve, governed by identical dynamics (7), running cost rates (8), a set of interior joint states $\bar{\mathcal{I}}_{i}$ and boundary joint states $\bar{\mathcal{B}}_{i}$ , but differ at the final costs. For factorial subsystem $\bar{\mathcal{N}}_{i}$ , denote the final cost of the component problem $f$ on joint states $\bar{x}_{i}$ as $\phi_{i}^{\{f\}}(\bar{x}_{i}^{t_{f}})$ , and the corresponding desirability function as $Z^{\{f\}}(\bar{x}_{i}^{t_{f}},t_{f})$ . Suppose the composite final cost $\phi_{i}(\bar{x}_{i}^{t_{f}})$ satisfies

\phi_{i}(\bar{x}_{i}^{t_{f}})=-\log(\sum_{f=1}^{F}\omega_{i}^{\{f\}}\exp(-\phi_{i}^{\{f\}}(\bar{x}_{i}^{t_{f}}))),

(17)

for some set of weights $\omega_{i}^{\{f\}}$ . Then the composite desirability function can be computed on the boundary states $\bar{\mathcal{B}}_{i}$ as:

Z(\bar{x}_{i}^{t_{f}},t_{f})=\sum_{f=1}^{F}\omega_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i}^{t_{f}},t_{f}).

(18)

Since the desirability function solves a linear HJB equation after the exponential transformation (11), once the composite condition on the desirability function (18) holds on the boundary, it holds everywhere, and the compositionality of optimal control actions can be extended to a task-generalization setting, i.e., the component problems and composite problem furthermore have different terminal states. The result of control generalization is formulated in the next theorem. Such formulation is especially useful when the component control actions are analytically solvable but costly; then the control action solving the new task can be constructed by composition in a sample-free manner and is less computationally expensive.

Theorem 2 (Continuous-time MAS compositionality).

Suppose there are $F$ multi-agent LSOC problems in continuous-time on factorial subsystem $\bar{\mathcal{N}}_{i}$ with joint states $\bar{x}_{i}$ and central agent state $x_{i}$ , governed by the same joint dynamics (7), running cost rates (8), and the set of interior joint states $\overline{\mathcal{I}}_{i}$ , but various terminal costs $\phi_{i}^{\{f\}}$ and terminal joint states $\bar{x}_{i}^{d^{\{f\}}}$ . Define the terminal joint state for a new problem as $\bar{x}_{i}^{d}$ and the composition weights as

\bar{\omega}_{i}^{\{f\}}=\exp(-\frac{1}{2}(\bar{x}_{i}^{d}-\bar{x}_{i}^{d^{\{f\}}})^{\top}P(\bar{x}_{i}^{d}-\bar{x}_{i}^{d^{\{f\}}})),

(19)

with $P$ being a positive definite diagonal matrix representing the kernel widths. Suppose the terminal cost for the new problem satisfies

\phi_{i}(\bar{x}_{i}^{t_{f}})=-\log(\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}\exp(-\phi_{i}^{\{f\}}(\bar{x}_{i}^{t_{f}}))),

(20)

where $\bar{x}_{i}^{t_{f}}$ denotes the boundary joint states and $\tilde{\omega}_{i}^{\{f\}}=\frac{\bar{\omega}_{i}^{\{f\}}}{\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}}$ can be interpreted as the probability weights. The optimal control law solving the new problem is obtained by a weighted combination of the existing controllers

\bar{u}_{i}^{*}(\bar{x}_{i},t)=\sum\nolimits_{f=1}^{F}\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)\bar{u}_{i}^{*^{\{f\}}}(\bar{x}_{i},t),

(21)

with

\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)=\frac{\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)}{\sum\nolimits_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z^{\{e\}}(\bar{x}_{i},t)}.

(22)

Proof.

From composition of the terminal-cost function in continuous-time given by (30), we have the following composition relationship of desirability functions by definition and it holds on the boundary joint states:

Z(\bar{x}_{i}^{t_{f}},t)=\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i}^{t_{f}},t).

(23)

Since the desirability function solves a linear HJB equation, once condition (23) holds on the boundary, it holds everywhere and we have:

Z(\bar{x}_{i},t)=\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t).

(24)

For the composite problem, the joint optimal control action in the form of (12), can thus be reduced to

	$\displaystyle\quad\bar{u}_{i}^{*}(\bar{x}_{i},t)=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})\frac{\nabla_{\bar{x}_{i}}Z(\bar{x}_{i},t)}{Z(\bar{x}_{i},t)}$
	$\displaystyle=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})\frac{\nabla_{\bar{x}_{i}}\Big{[}{\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)}\Big{]}}{\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)}$
	$\displaystyle=\frac{\sum\nolimits_{f=1}^{F}\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})\nabla_{\bar{x}_{i}}\Big{[}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)\Big{]}}{\sum\nolimits_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z^{\{e\}}(\bar{x}_{i},t)}$
	$\displaystyle=\frac{\sum\nolimits_{f=1}^{F}\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})Z^{\{f\}}(\bar{x}_{i},t)\nabla_{\bar{x}_{i}}\Big{[}\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)\Big{]}}{\sum\nolimits_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z^{\{e\}}(\bar{x}_{i},t)Z^{\{f\}}(\bar{x}_{i},t)}$
	$\displaystyle=\sum\limits_{f=1}^{F}{\frac{\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)}{\sum\limits_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z^{\{e\}}(\bar{x}_{i},t)}\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})\frac{\nabla_{\bar{x}_{i}}Z^{\{f\}}(\bar{x}_{i},t)}{Z^{\{f\}}(\bar{x}_{i},t)}}$
	$\displaystyle=\sum\nolimits_{f=1}^{F}\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)\bar{u}_{i}^{*^{\{f\}}}(\bar{x}_{i},t),$

with

\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)=\frac{\tilde{\omega}_{i}^{\{f\}}Z^{\{f\}}(\bar{x}_{i},t)}{\sum\nolimits_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z^{\{e\}}(\bar{x}_{i},t)}.

∎

IV Composition on Certified-safe Control Actions

We first propose a revised optimal stochastic control law incorporating ZCBF in a multi-agent setting, which can provide safety guarantees by enforcing certain conditions.

Theorem 3 (Safe and optimal control in MASs).

In continuous-time joint dynamics (7), for factorial subsystem $\bar{\mathcal{N}}_{i}$ with joint states $\bar{x}_{i}$ and central agent state $x_{i}$ , the desirability function is represented by $Z(\bar{x}_{i},t)=\exp[-V_{i}(\bar{x}_{i},t)/\lambda_{i}]$ with $\lambda_{i}\in\mathbb{R}$ and $V_{i}(\cdot,t)$ being the value function, and the joint optimal control action is given by

\bar{u}_{i}^{*}(\bar{x}_{i},t)=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}^{\top}(\bar{x}_{i})\frac{\nabla_{\bar{x}_{i}}Z(\bar{x}_{i},t)}{Z(\bar{x}_{i},t)}.

(25)

As discussed in Section II-A2, only the central agent $i$ of each factorial subsystem $\bar{\mathcal{N}}_{i}$ selects or samples its local control action $u_{i}^{*}(\bar{x}_{i},t)$ from (25). Define the ZCBFs for high-degree multi-agent systems as $h_{0}(x_{i})=h(x_{i})$ , and subsequently

	$\displaystyle h_{k+1}(x_{i})$	$\displaystyle=\frac{\partial{h_{k}}}{\partial x_{i}}g_{i}(x_{i})$
		$\displaystyle+\frac{1}{2}\textrm{tr}(\sigma_{i}^{\top}B_{i}^{\top}(\frac{\partial^{2}h_{k}}{\partial{x_{i}}^{2}})B_{i}\sigma_{i})+h_{k}(x_{i}),$		(26)

$k=0,1,2,\dots$ , and define $\bar{\mathcal{C}}_{r,i}=\bigcap_{k=0}^{r}\mathcal{C}_{k,i}$ with $\mathcal{C}_{k,i}=\{x_{i}:h_{k}(x_{i})\geq 0\}$ and $\mathcal{C}_{i}=\{x_{i}:h(x_{i})\geq 0\}$ . The CBF constraint for the high-degree multi-agent system is given by

\frac{\partial{h}_{r}}{\partial x_{i}}(g_{i}(x_{i})+B_{i}(x_{i})u)+\frac{1}{2}\textrm{tr}(\sigma_{i}^{\top}B_{i}^{\top}\frac{\partial^{2}h_{r}}{\partial x_{i}^{2}}B_{i}\sigma_{i})\geq-h_{r}(x_{i}),

(27)

where $r$ is selected such that $\frac{\partial{h}_{k}}{\partial x_{i}}B_{i}(x_{i})u\geq 0$ holds for $k<r$ and (27) holds for $k=r$ . Then, the optimal control for the central agent $i$ of each factorial subsystem $\bar{\mathcal{N}}_{i}$ that can also guarantee the subsystem safety (state-invariant) within $\mathcal{C}_{i}$ is given by

	$\displaystyle{u}_{s,i}^{*}$	$\displaystyle=\operatorname{arg\,min}_{u\in\mathbb{R}^{P}}\\|{u}_{i}^{}-u\\|_{2},$		(28)
	s.t.	$\displaystyle\eqref{eq:cbf-high-deg-const}$

where $u_{i}^{*}$ is the optimal local control action for the central agent $i$ sampled from (25).

Proof.

The first part (optimal control law) of the theorem comes from the solution to a linearly solvable optimal control problem and is established in (12) of Section II-A2. However, simply applying the optimal control strategy may not render the system safe considering the system stochastic noise, since the control goal such as obstacle-avoidance is captured in the cost function in the form of soft constraints. For the second part, a constrained optimization framework is applied to both minimizing the difference towards the ideal optimal control sampled from (12) and enforcing the ZCBF constraints (27) for safety concerns. For high-degree systems, $\frac{\partial{h_{k}}}{\partial x_{i}}B_{i}(x_{i})=0$ for some $x_{i}$ when $k<r$ ; and $r$ is selected as the least positive integer such that $\frac{\partial{h_{r}}}{\partial x_{i}}B_{i}(x_{i})\neq 0$ and where $u$ starts to explicitly show up linearly in (27). Finally, if the condition (27) holds, the system state is invariant within $\mathcal{C}_{i}$ and guaranteed by Theorem 1. ∎

Furthermore, utilizing the linear compositionality of the optimal control action in multi-agent systems, along with the fact that the ZCBF constraints are affine in control, the above results can be extended to a task-generalization setting, where the optimal solution to a new control problem can be achieved by taking a weighted mixture of existing control actions. Meanwhile, the composite control action is certified-safe given by an additional step of post-composite constrained optimization using ZCBFs.

Theorem 4 (Generalization of safe and optimal control in MASs).

Suppose there are $F$ multi-agent LSOC problems in continuous-time on factorial subsystem $\bar{\mathcal{N}}_{i}$ with joint states $\bar{x}_{i}$ and central agent state $x_{i}$ , governed by the same joint dynamics (7), running cost rates (8), and the set of interior joint states $\bar{\mathcal{I}}_{i}$ , but various terminal costs $\phi_{i}^{\{f\}}$ and terminal joint states $\bar{x}_{i}^{d^{\{f\}}}$ . The safe and optimal control action $u_{s,i}^{*^{\{f\}}}(f=1,2,\dots,F)$ for central agent $i$ solving each single problem is computed via (28). Denote the terminal joint state for a new problem as $\bar{x}_{i}^{d}$ and define the composition weights as

\bar{\omega}_{i}^{\{f\}}=\exp(-\frac{1}{2}(\bar{x}_{i}^{d}-\bar{x}_{i}^{d^{\{f\}}})^{\top}P(\bar{x}_{i}^{d}-\bar{x}_{i}^{d^{\{f\}}})),

(29)

with $P$ being a positive definite diagonal matrix representing the kernel widths. Suppose the terminal cost for the new problem satisfies

\phi_{i}(\bar{x}_{i}^{t_{f}})=-\log(\sum\nolimits_{f=1}^{F}\tilde{\omega}_{i}^{\{f\}}\exp(-\phi_{i}^{\{f\}}(\bar{x}_{i}^{t_{f}}))),

(30)

where $\bar{x}_{i}^{t_{f}}$ denotes the boundary joint states and $\tilde{\omega}_{i}^{\{f\}}=\frac{\bar{\omega}_{i}^{\{f\}}}{\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}}$ . The optimal control that solves the new problem is directly computable through a weighted combination of the component problem control solutions:

u_{i}^{*}(\bar{x}_{i},t)=\sum_{f=1}^{F}\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)u_{s,i}^{*^{\{f\}}}(\bar{x}_{i},t)

(31)

with

\displaystyle\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)

\displaystyle=\frac{\tilde{\omega}_{i}^{\{f\}}Z_{i}^{\{f\}}(\bar{x}_{i},t)}{\sum_{e=1}^{F}\tilde{\omega}_{i}^{\{e\}}Z_{i}^{\{e\}}(\bar{x}_{i},t)},

where the local control action $u_{s,j}^{*^{\{f\}}}$ is only sampled or selected from the computed joint control $\bar{u}_{s,j}^{*^{\{f\}}}$ on factorial subsystem $\bar{\mathcal{N}}_{j}$ . Furthermore, the optimal control for central agent $i$ of each factorial subsystem $\bar{\mathcal{N}}_{i}$ that can solve the composite task and also guarantee the subsystem safety (state-invariant) within a desired safe set $\mathcal{C}_{i}=\{x_{i}:h(x_{i})\geq 0\}$ is given by

	$\displaystyle{u}_{s,i}^{*}$	$\displaystyle=\operatorname{arg\,min}_{u\in\mathbb{R}^{P}}\\|{u}_{i}^{}-u\\|_{2},$		(32)
	s.t.	$\displaystyle\eqref{eq:cbf-high-deg-const}$

where $u_{i}^{*}$ is computed using (31).

Proof.

The first part of the theorem is an immediate result from Theorem 2, where the safe control from Theorem 3 is applied as the primitives, and only the sampled central agent control is considered. Since (27) is affine in control $u$ on considered system dynamics, and each component problem control law solves a constrained optimization problem where the baseline optimal control is linearly solvable (details for the optimal control linear compositionality can also be found in [17]), the linearity property preserves and enables the compositionality of such safe and optimal control actions computed using Theorem 3. Furthermore, condition (27) applied to the composite control action from (31) ensures that the system is state-invariant within $\mathcal{C}_{i}$ by Theorem 1. ∎

V Simulation Results

We performed numerical simulations in Matlab for both single-agent systems (single UAV) and cooperative networked multi-agent systems (cooperative UAV team) to demonstrate the proposed results. Each UAV is described by the following continuous-time dynamics:

{\left(\begin{matrix}dx_{i}\\ dy_{i}\\ dv_{i}\\ d\varphi_{i}\end{matrix}\right)=\left(\begin{matrix}v_{i}\cos\varphi_{i}\\ v_{i}\sin\varphi_{i}\\ 0\\ 0\end{matrix}\right)dt+\begin{pmatrix}0&0\\ 0&0\\ 1&0\\ 0&1\end{pmatrix}\left[\left(\begin{matrix}u_{i}\\ w_{i}\end{matrix}\right)dt+\begin{pmatrix}\sigma_{i}&0\\ 0&\nu_{i}\end{pmatrix}d\omega_{i}\right],}

(33)

where $(x_{i},y_{i}),v_{i},\varphi_{i}$ denote the position coordinate, forward velocity and heading angle for UAV $i$ . The system state vector is $(x_{i},y_{i},v_{i},\varphi_{i})^{\top}$ ; the forward acceleration $u_{i}$ and angular velocity $w_{i}$ are the control inputs, and $\omega_{i}$ is the standard Brownian-motion disturbance. We specify the noise level as $\sigma_{i}=0.05$ and $\nu_{i}=0.025$ throughout the simulation.

In all the following experiments, we compute the baseline optimal control actions ((6) in the single-agent scenario and (25), (31) in the multi-agent scenario) in a path integral approximation framework, and obtain the constrained optimal control actions using the above theorems. In the following obstacle-avoidance tasks, each individual obstacle is described by an independent set of ZCBF constraints. We also explicitly check that the start point satisfies all the ZCBF constraints in place (especially $x^{0}\in\bar{\mathcal{C}}_{r}$ as in Theorem 1), which renders the proposed methodology applicable.

V-A Single-agent system experiments with the proposed control strategy

V-A1 Single-problem experiment

We first compare the performance between the filtered control actions with CBFs and the baseline optimal control actions in a simple case. The UAV is governed by the dynamics in (33) and is tasked to fly from the start $(5,5)$ towards a target $(35,20)$ while avoiding the obstacles. The running cost for the UAV is in the following form:

q(\mathbf{x})=\|(x,y)-(x^{t_{f}},y^{t_{f}})\|_{2}-d^{\max},

(34)

where $\|(x,y)-(x^{t_{f}},y^{t_{f}})\|_{2}$ computes the distance to the goal position for the UAV, and $d^{\max}$ denotes the distance between the initial position and target position for the UAV. The obstacles are incorporated by greater state-related cost values (i.e. $q(x)=160$ ) in the baseline optimal control action computation. Here, we utilize CBFs for enhanced safety and construct a set of CBFs for the description of one circular obstacle centered at $(x_{c},y_{c})$ with a radius of $r_{c}$ as follows:

$\displaystyle h_{0}(x)$	$\displaystyle=h(x)=(x-x_{c})^{2}+(y-y_{c})^{2}-(r_{c}+D_{s})^{2},$	(35)
$\displaystyle h_{1}(x)$	$\displaystyle=\frac{\partial{h_{0}}}{\partial x}g(x)+\frac{1}{2}\textrm{tr}(\sigma^{\top}B^{\top}(\frac{\partial^{2}h_{0}}{\partial x^{2}})B\sigma)+h_{0}(x)$
	$\displaystyle=2(x-x_{c})v\cos\phi+2(y-y_{c})v\sin\phi$
	$\displaystyle+(x-x_{c})^{2}+(y-y_{c})^{2}-(r_{c}+D_{s})^{2},$	(36)

where $g(x),B,\sigma$ are defined in the dynamics (1) and $D_{s}$ is a user-defined safety margin.

The simulation runs for 10 times independently, under both the safe optimal control actions and the ideal optimal control actions, respectively. The execution trajectories are shown in Fig. 2, where the obstacles are denoted by the filled circles, the trajectories under the safe optimal control are denoted by the green lines, and the trajectories under the ideal optimal control are denoted by the red lines. As Fig. 2 shows, the performance of the ideal optimal control is not always uniform on avoiding the lowest obstacle and relies on fine parameter-tuning. However, the trajectories using the safe optimal control actions can always guarantee a safe margin surrounding the obstacles throughout all the simulations. Meanwhile, the safety margin can be tuned as a hyper-parameter for achieving the least conservative and feasible trajectory.

V-A2 Task-generalization experiment

In this example, we consider two component problems seeking safe optimal control actions under the identical dynamics (33), running cost rates (34), set of interior states, but have different final costs and terminal states. The final costs take the form of $\phi=\frac{d}{2}(|x-x_{d}|+c)+\alpha$ , where the two problems have different cost parameters $c,d,\alpha$ . The first problem ensures that the UAV starting from $(5,5)$ reaches the upper target $(35,28)$ , while the second problem ensures that the UAV starting from $(5,5)$ reaches the lower target $(35,14)$ . Further, we consider how these safe control actions can help with solving a new problem aiming at reaching the target $(35,21)$ . Here, we take a weighted mixture of safe and optimal control actions solving the component problems and obtain the optimal solution constrained by the ZCBFs. The execution trajectories for the component problems and the composite problem are shown in Fig. 3.

In Fig. 3, the executive trajectories of the two component problems are denoted by the dashed lines, and the executive trajectory running the composite safe optimal control action is denoted by the solid line. The targets of the two component tasks are denoted by the cross markers. As Fig. 3 shows, all these trajectories can avoid the obstacles with sufficient safety margins. The component problem solution can lead the UAV to the targeted goal accurately. However, the safe optimal control action by composition leads the UAV to a target within some acceptable error range, denoted by the red solid circle.

V-B Evaluation of the safe and optimal control law for a single task in a networked MAS under various safety margins

For the networked MAS simulations, we consider a cooperative UAV team as illustrated in Fig. 4, where UAVs 1 and 2 fly cooperatively (distance-minimized) and UAV 3 flies independently towards the goal joint states while avoiding the obstacles. According to the factorization introduced in Section II-A2, the joint states of the three factorial subsystems are $\bar{\mathbf{x}}_{1}=[\mathbf{x}_{1};\mathbf{x}_{2}]^{\top},\bar{\mathbf{x}}_{2}=[\mathbf{x}_{1};\mathbf{x}_{2};\mathbf{x}_{3}]^{\top},\bar{\mathbf{x}}_{3}=[\mathbf{x}_{2};\mathbf{x}_{3}]^{\top}$ , where $\mathbf{x}_{i}=[x_{i};y_{i};v_{i};\varphi_{i}]^{\top}$ and $(x_{i},y_{i}),v_{i},\varphi_{i}$ denote the position coordinate, forward velocity and heading angle for UAV $i$ . The system joint dynamics can be described by (7). The coordination between UAVs is considered by the running cost in the following form:

$\displaystyle q_{1}(\bar{\mathbf{x}}_{1})$	$\displaystyle=0.7\cdot(\\|(x_{1},y_{1})-(x_{1}^{t_{f}},y_{1}^{t_{f}})\\|_{2}-d_{1}^{\textrm{max}})$	(37)
	$\displaystyle\hskip 35.0pt+1.4\cdot(\\|(x_{1},y_{1})-(x_{2},y_{2})\\|_{2}-d_{12}^{\textrm{max}}),$
$\displaystyle q_{2}(\bar{\mathbf{x}}_{2})$	$\displaystyle=0.7\cdot(\\|(x_{2},y_{2})-(x_{2}^{t_{f}},y_{2}^{t_{f}})\\|_{2}-d_{2}^{\textrm{max}})$	(38)
	$\displaystyle\hskip 35.0pt+1.4\cdot(\\|(x_{2},y_{2})-(x_{1},y_{1})\\|_{2}-d_{21}^{\textrm{max}}),$
$\displaystyle q_{3}(\bar{\mathbf{x}}_{3})$	$\displaystyle=\\|(x_{3},y_{3})-(x_{3}^{t_{f}},y_{3}^{t_{f}})\\|_{2}-d_{3}^{\textrm{max}},$	(39)

where $\|(x_{i},y_{i})-(x_{i}^{t_{f}},y_{i}^{t_{f}})\|_{2}$ calculates the distance to the goal position for UAV $i$ , $\|(x_{i},y_{i})-(x_{j},y_{j})\|_{2}$ calculates the distance between UAVs $i$ and $j$ , $d_{i}^{\max}$ denotes the distance between the initial position and target position for UAV $i$ , and $d_{ij}^{\max}$ denotes the initial distance between UAVs $i$ and $j$ . The cost parameters and coefficients can be tuned for better performance and algorithm stability.

We first evaluate optimization-based control actions from (25) on the networked MAS, and run 10 independent simulations. The trajectories for the UAV team are shown in Fig. 5, where trajectories of UAVs 1, 2 and 3 are denoted by the red, blue, and green lines, respectively. The two obstacles are represented by the filled circles. The start point for UAV $2$ is labeled by ‘S2’ in Fig. 5. As the figure shows, all UAVs can reach the appointed targets. However, although the control performance can be improved by tuning the obstacle state costs and running cost coefficients, no guarantees on the obstacle-avoiding goal is achievable. Among the observed runs, there are some cases when UAV 3 collides with the lower obstacle due to the stochastic noise.

We further extend the experiments using the control actions subject to the CBF constraints according to (28). The CBFs are designed similarly as (35) and (36). We run 10 independent simulations and the executive trajectories are shown in Fig. 6. Similar to the case under ideal optimal control actions, all UAVs can reach the target states, and the coordination between UAVs 1 and 2 is achieved. Furthermore, although every single trajectory may differ much due to the stochastic noise, all trajectories of each single UAV can avoid the placed obstacles with some safety margins.

Compared with the case of using ideal optimal control actions without CBFs, where no margin can be achieved surrounding the obstacles for the trajectories, the safety margin achieved from Theorem 3 can further be tuned. We evaluate the relationship between the commanded margins in CBF design (e.g. $D_{s}$ parameter in (36)) and the achieved minimum distance to the nearest obstacles for all UAVs throughout the simulations, and a quantitative illustration result is given in Fig. 8. We can observe that the UAVs running unfiltered optimal control actions fail to meet the safety margin, where UAVs under safe optimal control actions can avoid all the obstacles and the minimum distance to both obstacles is larger than the threshold set in the CBF design in (35) and (36).

V-C Performance of the composite safe and optimal control law generalizing to a new task in a networked MAS

We further evaluate the proposed safe optimal control strategy in a task-generalization setting on the cooperative UAV team in Fig. 7. The five UAVs work in three groups, where UAVs 1 and 2, 4 and 5 fly cooperatively (distance-minimized) and UAV 3 flies independently towards the goal while avoiding some obstacles. We consider two component problems, subject to identical joint dynamics (7), joint running costs (8), and set of interior joint states $\bar{\mathcal{I}}_{i}$ for factorial subsystem $\bar{\mathcal{N}}_{i}$ , but different final costs and terminal joint states. In the two problems, the target position for all the UAVs are $(35,28)$ and $(35,14)$ , respectively. In each problem, the safe optimal control leading the UAV team to the target is obtained according to (28), and the execution trajectories are demonstrated in Fig. 9, where the trajectories of UAVs 1,2, 3, 4 and 5 are denoted by the red, blue, green, magenta and cyan lines, respectively. The target positions in the two different problems are labeled by the stars.

Once the component problem safe optimal control action is obtained, a weighted mixture on the primitives specified by (31) can be taken to achieve the composite safe optimal control action, and a further constrained optimization step according to (32) will ensure that the resulting optimal control action is safe-guaranteed. The execution trajectories of the UAV team using the composite safe control action to solve a new problem are shown in Fig. 10, where the red solid circle demonstrates an allowable error range. As Fig. 10 shows, under the filtered composite safe optimal control action, all the UAVs can avoid the obstacles with suitable safe margins, and UAVs 1 and 2, 4 and 5 can cooperate. They can also reach the target position but subject to some error.

Similar to what Fig. 3 illustrates, beyond the impact of the stochastic noise, the composition is not exactly accurate here since the desirability function solving the transformed linear HJB equation discussed in Section II-A is time-dependent for continuous-time systems, and the composition on time is not achievable. Furthermore, though each component controller by itself is in the feedback-control form, the composite control is not performing feedback on the new continuous states and thus fails to have the state error-compensating capacity; it is also fragile to the external noise. However, by running several simulations and selecting the optimal local control action for each agent independently, the obtained performance, as illustrated in Fig. 10, is satisfying, and the terminal error is controllable. Also, the composite safe control action proposed to solve the new problem using Theorem 4 is obtained in a sample-free manner by taking a weighted mixture of primitives and getting filtered by the CBF constraints. It is worth to apply especially in the case when each component problem solution can be solved analytically but is expensive to compute, when consideration of effort in solving a new problem dominates the control precision.

VI Conclusion

In this paper, we developed a framework of safe generalization of optimal control utilizing control barrier functions (CBFs) and linearly-solvable property in stochastic system control, in both single-agent and cooperative networked multi-agent system cases. The proposed control action simultaneously ensures optimality and guarantees safety by enforcing the CBF constraints, while minimizing the difference away from the ideal optimal control. Considering the linearity of considered CBF constraints and compositionality of linear-solvable optimal control (LSOC), we further extend the safe optimal control framework to a task generalization setting, where a weighted mixture of computed actions for component problems is taken to solve a new problem. The safety of such composite control action is incorporated by additional CBF constraints as a filter after the composition. The composite safe optimal control action is obtained in a sample-free manner and thus is less computationally-expensive, while the safety property can be reserved by the additional CBF constraint. We evaluate the proposed approach on numerical simulations of a single UAV and two cooperative UAV teams with obstacle-avoidance and target-reaching goals. The constructed composition control law can drive the teamed UAV to a new target within some acceptable error range. The error can be explained by the system stochastic noise and the error in composing a time-dependent desirability function of continuous-time dynamics. We hope the proposed strategy can be applied in scenarios when the computational cost of the component task safe optimal control solution dominates the precision of the implemented composite control. Future work will consider the safety guarantees on the control action generalization capability on networked MASs under incomplete information where not every state information is measurable, and such scenario can simulate the real-world environment with high-fidelity. Another direction of research for future work consists of leveraging images and high-dimensional sensor data, and introducing perception-based closed-loop control to the current framework.

References

[1] D. E. Kirk, Optimal control theory: an introduction. Courier Corporation, 2004.
[2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[3] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015.
[4] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,” Physical Review Letters, vol. 95, no. 20, p. 200201, 2005.
[5] V. D. Blondel and J. N. Tsitsiklis, “A survey of computational complexity results in systems and control,” Automatica, vol. 36, no. 9, pp. 1249–1274, 2000.
[6] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: an overview,” in Proceedings of 1995 34th IEEE conference on decision and control, vol. 1. IEEE, 1995, pp. 560–564.
[7] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” The Journal of Machine Learning Research, vol. 11, pp. 3137–3181, 2010.
[8] ——, “Reinforcement learning of motor skills in high dimensions: A path integral approach,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 2397–2403.
[9] W. B. Powell and J. Ma, “A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications,” Journal of Control Theory and Applications, vol. 9, no. 3, pp. 336–352, 2011.
[10] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063.
[11] E. Todorov, “Efficient computation of optimal actions,” Proceedings of National Academy of Sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009.
[12] K. Dvijotham and E. Todorov, “Linearly solvable optimal control,” Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, vol. 17, pp. 119–141, 2012.
[13] ——, “A unified theory of linearly solvable optimal control,” Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), vol. 1, 2011.
[14] Y. Pan, E. A. Theodorou, and M. Kontitsis, “Sample efficient path integral control under uncertainty,” in Advances in Neural Information Processing Systems, vol. 2, 2015, pp. 2314 – 2322.
[15] E. Todorov, “Compositionality of optimal control laws,” in Advances in Neural Information Processing Systems, 2009, pp. 1856–1864.
[16] U. Muico, J. Popović, and Z. Popović, “Composite control of physically simulated characters,” ACM Transactions on Graphics (TOG), vol. 30, no. 3, pp. 1–11, 2011.
[17] L. Song, N. Wan, A. Gahlawat, N. Hovakimyan, and E. A. Theodorou, “Compositionality of linearly solvable optimal control in networked multi-agent systems,” in 2021 American Control Conference (ACC). IEEE, 2021, pp. 1334–1339.
[18] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor,” IEEE Robotics and Automation Magazine, vol. 19, no. 3, pp. 20–32, 2012.
[19] V. Cichella, R. Choe, S. B. Mehdi, E. Xargay, N. Hovakimyan, V. Dobrokhodov, I. Kaminer, A. M. Pascoal, and A. P. Aguiar, “Safe coordinated maneuvering of teams of multirotor unmanned aerial vehicles: A cooperative control framework for multivehicle, time-critical missions,” IEEE Control Systems Magazine, vol. 36, no. 4, pp. 59–82, 2016.
[20] S. Wilson, P. Glotfelter, L. Wang, S. Mayya, G. Notomista, M. Mote, and M. Egerstedt, “The Robotarium: Globally impactful opportunities, challenges, and lessons learned in remote-access, distributed control of multirobot systems,” IEEE Control Systems Magazine, vol. 40, no. 1, pp. 26–44, 2020.
[21] M. Zheng, C.-L. Liu, and F. Liu, “Average-consensus tracking of sensor network via distributed coordination control of heterogeneous multi-agent systems,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 132–137, 2018.
[22] Y. Liu, L. Liu, and W.-P. Chen, “Intelligent traffic light control using distributed multi-agent Q learning,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–8.
[23] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of Markov decision processes,” Mathematics of Operations Research, vol. 27, no. 4, pp. 819–840, 2002.
[24] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic programming for partially observable stochastic games,” in AAAI, vol. 4, 2004, pp. 709–715.
[25] J. Pajarinen and J. Peltonen, “Periodic finite state controllers for efficient POMDP and DEC-POMDP planning,” in Advances in Neural Information Processing Systems, vol. 24, 2011, pp. 2636–2644.
[26] F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approximate Q-value functions for decentralized POMDPs,” Journal of Artificial Intelligence Research, vol. 32, pp. 289–353, 2008.
[27] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning. PMLR, 2018, pp. 5872–5881.
[28] B. Van Den Broek, W. Wiegerinck, and B. Kappen, “Graphical model inference in optimal control of stochastic multi-agent systems,” Journal of Artificial Intelligence Research, vol. 32, pp. 95–122, 2008.
[29] N. Wan, A. Gahlawat, N. Hovakimyan, E. A. Theodorou, and P. G. Voulgaris, “Distributed algorithms for linearly-solvable optimal control in networked multi-agent systems,” arXiv preprint arXiv:2102.09104, 2021.
[30] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning-based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.
[31] T. Holicki, C. W. Scherer, and S. Trimpe, “Controller design via experimental exploration with robustness guarantees,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 641–646, 2020.
[32] P. Tabuada, Verification and control of hybrid systems: a symbolic approach. Springer Science & Business Media, 2009.
[33] S. Mitra, T. Wongpiromsarn, and R. M. Murray, “Verifying cyber-physical interactions in safety-critical systems,” IEEE Security & Privacy, vol. 11, no. 4, pp. 28–37, 2013.
[34] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016.
[35] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European Control Conference (ECC). IEEE, 2019, pp. 3420–3431.
[36] M. Srinivasan, A. Dabholkar, S. Coogan, and P. A. Vela, “Synthesis of control barrier functions using a supervised machine learning approach,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 7139–7145.
[37] A. Robey, H. Hu, L. Lindemann, H. Zhang, D. V. Dimarogonas, S. Tu, and N. Matni, “Learning control barrier functions from expert demonstrations,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 3717–3724.
[38] J. Choi, F. Castaneda, C. J. Tomlin, and K. Sreenath, “Reinforcement learning for safety-critical control under model uncertainty, using control lyapunov functions and control barrier functions,” arXiv preprint arXiv:2004.07584, 2020.
[39] A. D. Ames, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs with application to adaptive cruise control,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 6271–6278.
[40] Y. Chen, A. Singletary, and A. D. Ames, “Guaranteed obstacle avoidance for multi-robot operations with limited actuation: A control barrier function approach,” IEEE Control Systems Letters, vol. 5, no. 1, pp. 127–132, 2020.
[41] A. Clark, “Control barrier functions for stochastic systems,” Automatica, vol. 130, p. 109688, 2021.
[42] ——, “Control barrier functions for complete and incomplete information stochastic systems,” in 2019 American Control Conference (ACC). IEEE, 2019, pp. 2928–2935.
[43] I. R. Manchester and J.-J. E. Slotine, “Control contraction metrics: Convex and intrinsic criteria for nonlinear feedback design,” IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 3046–3053, 2017.
[44] A. Gahlawat, A. Lakshmanan, L. Song, A. Patterson, Z. Wu, N. Hovakimyan, and E. A. Theodorou, “Contraction L1-adaptive control using gaussian processes,” in Learning for Dynamics and Control. PMLR, 2021, pp. 1027–1040.
[45] A. Lakshmanan, A. Gahlawat, and N. Hovakimyan, “Safe feedback motion planning: A contraction theory and L1-adaptive control based approach,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 1578–1583.