This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Control Barrier Functions
for Decentralized Multi-Agent Navigation

Zhan Gao, Guang Yang and Amanda Prorok Zhan Gao, Guang Yang and Amanda Prorok are with Department of Computer Science and Technology, University of Cambridge, CB3 0FD (email: zg292@cam.ac.uk; gy268@cam.ac.uk; asp45@cam.ac.uk). This work was supported by ERC Project 949940 (gAIa).
Abstract

Control barrier functions (CBFs) enable guaranteed safe multi-agent navigation in the continuous domain. The resulting navigation performance, however, is highly sensitive to the underlying hyperparameters. Traditional approaches consider fixed CBFs (where parameters are tuned apriori), and hence, typically do not perform well in cluttered and highly dynamic environments: conservative parameter values can lead to inefficient agent trajectories, or even failure to reach goal positions, whereas aggressive parameter values can lead to infeasible controls. To overcome these issues, in this paper, we propose online CBFs, whereby hyperparameters are tuned in real-time, as a function of what agents perceive in their immediate neighborhood. Since the explicit relationship between CBFs and navigation performance is hard to model, we leverage reinforcement learning to learn CBF-tuning policies in a model-free manner. Because we parameterize the policies with graph neural networks (GNNs), we are able to synthesize decentralized agent controllers that adjust parameter values locally, varying the degree of conservative and aggressive behaviors across agents. Simulations as well as real-world experiments show that (i) online CBFs are capable of solving navigation scenarios that are infeasible for fixed CBFs, and (ii), that they improve navigation performance by adapting to other agents and changes in the environment.

I INTRODUCTION

Multi-agent systems are ideally suited to tackle spatially distributed tasks, for which safe and efficient motion planning is a key enabling foundation [1, 2, 3]. In the context of multi-agent navigation, model-based approaches often assume full knowledge of the environment and system dynamics, and require designing explicit objective functions and well-tuned hyper-parameters prior to agent deployment. Data-driven approaches are able to work with partially observed environments and complex system dynamics that are difficult to model, but often sacrifice safety and convergence guarantees. This work aims to find a middle-ground that leverages the advantages from both approaches.

In this paper, we focus on designing decentralized controllers for multi-agent navigation with dynamical constraints. Different from classic path-finding problems [4, 5, 6], we generate feedback control inputs and perform collision avoidance in continuous space. Specifically, the problem of multi-agent navigation with convergence and safety guarantees can be formulated as a sequence of real-time optimization problems by using control barrier functions (CBFs) and control Lyapunov functions (CLFs), where the former allow agents to move safely without collision and the latter guide them towards target states. The combination of CBFs and CLFs has been widely used for safety-critical controls [7], [8]. Although CBFs provide safety guarantees, traditional approaches require manually setting parameters within CBF constraints [9, 7, 8, 10]. This may yield overly conservative trajectories with strong CBF constraints or overly aggressive trajectories with relaxed ones, both of which could lead to controller infeasibility, i.e., no admissible control exists, and agents are hence unable to steer to their destinations. Moreover, such issues occur more frequently when the environment is cluttered with moving agents and an increasing number of obstacles, because the number of CBF constraints scales with that of agents and obstacles. Many existing works preset CBF parameters before deployment and fix the latter throughout the navigation procedure [11, 12, 13]. This requires resetting CBF parameters in each new environment, and makes multi-agent systems incapable of operating in dynamic environments where agent configurations and obstacle constellations vary across time. Hence, we aim to develop methods that capture time-varying environment states and tune CBF parameters in real time as a function of these states.

Instead of hand-tuning and fixing CBF parameters at the outset, we propose a methodology that tunes the latter based on agent and obstacle states in a real-time and decentralized manner. The goal is to find an optimal sequence of time-varying CBF parameters that adapts to new environment configurations, and that varies the degree of conservative and aggressive behaviors across agents (striking a balance that aids in trajectory deconfliction). Due to the challenge of explicitly modeling the relationship between CBFs and navigation performance, we parameterize the CBF-tuning policy with graph neural networks (GNNs) and learn the latter with model-free reinforcement learning (RL). Thanks to the inherently distributed nature of GNNs [14, 15, 16, 17], the resulting policy allows for a decentralized implementation, i.e., it can be executed by each agent locally with only neighborhood information, yielding an efficient and scalable solution.

Related work. There are two main groups of CBF-based techniques in multi-agent control: model-based [18, 19, 20, 21] and data-driven approaches [22, 23, 24]. Model-based approaches require full knowledge of the environment and fix CBF parameters a-priori. The CBF constraints are affine in the control variable, to formulate a quadratic program (QP) controller, which provides safety guarantees for navigation. In contrast, data-driven approaches directly approximate CBFs with neural networks [25, 26, 27]. However, these approaches sacrifice safety guarantees in the process. The work in [28] combines model-based and data-driven approaches by learning a backup CBF to enhance the safety. For the feasibility of CBFs, [29] designs a feasibility guaranteed controller for traffic-merging problems. The work in [30] studies the feasibility of a CBF-based model predictive controller (MPC) in a discrete time setting, while [12] introduces a decaying term paired with CBFs to improve the feasibility of the MPC. Moreover, [31] extends the class 𝒦\mathcal{K} function of CBFs for forward invariance in continuous time. A more closely related work, [11], develops an SVM classifier to filter out infeasible CBF parameters and reduce the search space to find optimal CBF parameters. However, it considers single-agent scenarios and the selected CBF parameters are fixed during navigation. To the best of our knowledge, none of the aforementioned works update CBF parameters online based on changes in the agents’ locally perceived environment.

Contributions. Our contributions are as follows:

  1. 1.

    We propose an online safety-critical framework that adapts CBFs to dynamic environments in a decentralized manner. It inherits safety guarantees from traditional CBFs and facilitates the feasibility of the controller, due to the online tuning of CBF parameters.

  2. 2.

    We parameterize the CBF-tuning policy with GNNs and conduct training with model-free RL. The former allows for a decentralized implementation, while the latter overcomes the challenge of explicitly modeling the relationship between CBFs and navigation performance.

  3. 3.

    We validate our approach with numerical simulations and real-world experiments in various environment configurations. The results show that online CBFs can handle navigation scenarios that fail with fixed CBFs (even when we perform an exhaustive parameter search).

II PRELIMINARIES

We introduce preliminaries about system dynamics, CLFs and CBFs in decentralized multi-agent navigation.

System dynamics Consider a multi-agent system with NN agents 𝒜={Ai}i=1N{\mathcal{A}}=\{A_{i}\}_{i=1}^{N} in a 2-D environment with MM static obstacles {Oj}j=1M\{O_{j}\}_{j=1}^{M}. The agent dynamics take the form of

𝐱˙i=f(𝐱i)+g(𝐱i)𝐮i,\dot{{\mathbf{x}}}_{i}=f({\mathbf{x}}_{i})+g({\mathbf{x}}_{i}){\mathbf{u}}_{i}, (1)

where 𝐱in{\mathbf{x}}_{i}\in\mathbb{R}^{n} is the internal state, 𝐮im{\mathbf{u}}_{i}\in\mathbb{R}^{m} the control input, 𝐱˙i\dot{{\mathbf{x}}}_{i} the derivative of 𝐱i{\mathbf{x}}_{i} w.r.t. time tt, and f(𝐱i)f({\mathbf{x}}_{i}), g(𝐱i)g({\mathbf{x}}_{i}) the flow vectors for i=1,,Ni=1,...,N. Each agent AiA_{i} has a sensing range σ+\sigma\in\mathbb{R}^{+} that provides partial observability of the entire environment, i.e., the states of the other agents {𝐱j}j𝒩i\{{\mathbf{x}}_{j}\}_{j\in{\mathcal{N}}_{i}} and the positions of the obstacles {𝐩,o}𝒩i\{{\mathbf{p}}_{\ell,o}\}_{\ell\in{\mathcal{N}}_{i}} within the neighborhood of radius σ\sigma where 𝒩i{\mathcal{N}}_{i} is the neighbor set of AiA_{i} – see Fig. 1. We consider decentralized control policies

πi(𝐮i|𝐱i,{𝐱j}j𝒩i,{𝐩,o}𝒩i),fori=1,,N\pi_{i}\Big{(}{\mathbf{u}}_{i}\Big{|}{\mathbf{x}}_{i},\{{\mathbf{x}}_{j}\}_{j\in{\mathcal{N}}_{i}},\{{\mathbf{p}}_{\ell,o}\}_{\ell\in{\mathcal{N}}_{i}}\Big{)},\leavevmode\nobreak\ \text{for}\leavevmode\nobreak\ i=1,\ldots,N (2)

that drive agents from initial 𝐗(0):={𝐱i(0)}i=1N{\mathbf{X}}^{(0)}:=\{{\mathbf{x}}_{i}^{(0)}\}_{i=1}^{N} to target states 𝐗d:={𝐱id}i=1N{\mathbf{X}}^{d}:=\{{\mathbf{x}}_{i}^{d}\}_{i=1}^{N} with local neighborhood information.

Control Lyapunov function (CLF). A CLF is designed to encode the goal-reaching requirement, i.e., the satisfaction of CLF constraints guarantees that agents converge to their target states. We define the exponentially-stabilizing CLF that ensures an exponential convergence as follows [32].

Definition 1

Given the system dynamics (1) of agent AiA_{i}, a positive definite continuously differentiable function Vi(𝐱i):nV_{i}({\mathbf{x}}_{i}):\mathbb{R}^{n}\mapsto\mathbb{R} is an exponentially-stabilizing CLF if there exists a positive constant ϵ0\epsilon\geq 0 such that for any 𝐱in{\mathbf{x}}_{i}\!\in\!\mathbb{R}^{n},

inf𝐮i𝕌i[£fVi(𝐱i)+£gVi(𝐱i)𝐮i+ϵVi(𝐱i)]0,\displaystyle\!\textstyle\inf_{{\mathbf{u}}_{i}\in\mathbb{U}_{i}}[\pounds_{f}V_{i}({\mathbf{x}}_{i}\!)\!+\!\pounds_{g}V_{i}({\mathbf{x}}_{i}){\mathbf{u}}_{i}\!+\!\epsilon V_{i}({\mathbf{x}}_{i})]\!\leq\!0, (3)

where £fVi(𝐱i):=Vi(𝐱i)𝐱if(𝐱i)\pounds_{f}V_{i}({\mathbf{x}}_{i}):=\frac{\partial V_{i}({\mathbf{x}}_{i})}{\partial{\mathbf{x}}_{i}}f({\mathbf{x}}_{i}) is the Lie derivative of Vi(𝐱i)V_{i}({\mathbf{x}}_{i}) [33] and 𝕌i\mathbb{U}_{i} is the control space of agent AiA_{i}.

Refer to caption
Figure 1: Agent AiA_{i} communicates with the other agents and senses obstacles within its sensing range σ\sigma. In this sketch, there are two CBF constraints w.r.t. the other agents, and one w.r.t. obstacle O1O_{1}.

Control barrier function (CBF). A CBF is designed to avoid static obstacles as well as to prevent collisions among moving agents. It ensures forward invariance of the state trajectory, i.e., if the agent starts within a safety set, it will always stay within safety sets [34]. Specifically, we encode the safety requirement of agent AiA_{i} in a smooth function hi(𝐱i):nh_{i}({\mathbf{x}}_{i}):\mathbb{R}^{n}\mapsto\mathbb{R} and its derivative w.r.t. time is given by

h˙i(𝐱i)=£fhi(𝐱i)+£ghi(𝐱i)𝐮i,\dot{h}_{i}({\mathbf{x}}_{i})=\pounds_{f}h_{i}({\mathbf{x}}_{i})+\pounds_{g}h_{i}({\mathbf{x}}_{i}){\mathbf{u}}_{i}, (4)

where £fhi(𝐱i):=hi(𝐱i)𝐱if(𝐱i),£ghi(𝐱i):=hi(𝐱i)𝐱ig(𝐱i)\pounds_{f}h_{i}({\mathbf{x}}_{i}):=\frac{\partial h_{i}({\mathbf{x}}_{i})}{\partial{\mathbf{x}}_{i}}f({\mathbf{x}}_{i}),\pounds_{g}h_{i}({\mathbf{x}}_{i}):=\frac{\partial h_{i}({\mathbf{x}}_{i})}{\partial{\mathbf{x}}_{i}}g({\mathbf{x}}_{i}) are the Lie derivatives of hi(𝐱i)h_{i}({\mathbf{x}}_{i}). We define the higher-order CBF with relative degree one as follows [35].

Definition 2

Given the system dynamics (1) of agent AiA_{i}, a differentiable function hi:nh_{i}:\mathbb{R}^{n}\mapsto\mathbb{R} is a higher-order CBF with relative degree one if there exists a strictly increasing function αi:++\alpha_{i}:\mathbb{R}^{+}\to\mathbb{R}^{+} such that for any 𝐱in{\mathbf{x}}_{i}\!\in\!\mathbb{R}^{n},

hi(𝐱i)0,h˙i(𝐱i)+αi(hi(𝐱i))0.\displaystyle h_{i}({\mathbf{x}}_{i})\geq 0,\leavevmode\nobreak\ \dot{h}_{i}({\mathbf{x}}_{i})+\alpha_{i}\big{(}h_{i}({\mathbf{x}}_{i})\big{)}\geq 0. (5)

Here, αi()\alpha_{i}(\cdot) is referred to as a class 𝒦\mathcal{K} function for agent AiA_{i}, which is determined by function parameters η\eta and ζ\zeta. More details are given in Section IV-A.

III PROBLEM FORMULATION

We first formulate the problem of decentralized multi-agent navigation with CLFs for state convergence and CBFs for safety guarantees. We then propose the problem of online CBF optimization, which generates time-varying CBFs based on instantaneously sensed states to optimize performance.

III-A Decentralized Multi-Agent Navigation

Assume agents and obstacles are disk-shaped with radii {Ri}i=1N\{R_{i}\}_{i=1}^{N} and {R}=1M\{R_{\ell}\}_{\ell=1}^{M}. Let {𝐩i}i=1N\{{\mathbf{p}}_{i}\}_{i=1}^{N}, {𝐯i}i=1N\{{\mathbf{v}}_{i}\}_{i=1}^{N} and {𝐝i}i=1N\{{\mathbf{d}}_{i}\}_{i=1}^{N} be the positions, velocities and destinations of agents 𝒜{\mathcal{A}}, which are determined by the internal states {𝐱i}i=1N\{{\mathbf{x}}_{i}\}_{i=1}^{N}, and {𝐩,o}=1M\{{\mathbf{p}}_{\ell,o}\}_{\ell=1}^{M} be the obstacle positions. The goal is to move agents towards destinations while avoiding collision in a decentralized manner. The destination convergence is equivalent to the state convergence as limtT𝐱i(t)=𝐱id\lim_{t\to T}{\mathbf{x}}^{(t)}_{i}={\mathbf{x}}^{d}_{i} with TT the maximal time step for i=1,,Ni=1,...,N. The collision avoidance is equivalent to the safety constraints on the agent states, i.e.,

𝒞i,a(t)={𝐱i(t)n|𝐩i(t)𝐩j(t)Ri+Rj2,ji},\displaystyle{\mathcal{C}}^{(t)}_{i,\rm a}\!=\!\!\{{\mathbf{x}}^{(t)}_{i}\!\in\!\mathbb{R}^{n}\!\!\leavevmode\nobreak\ |\leavevmode\nobreak\ \!\|{\mathbf{p}}^{(t)}_{i}\!-\!{\mathbf{p}}^{(t)}_{j}\|\!\geq\!\frac{R_{i}\!\!+\!\!R_{j}}{2},\leavevmode\nobreak\ j\!\neq\!i\}, (6)
𝒞i,o(t)={𝐱i(t)n|𝐩i(t)𝐩,o(t)Ri+R2,=1,,M}.\displaystyle{\mathcal{C}}^{(t)}_{i,\rm o}\!=\!\!\{{\mathbf{x}}^{(t)}_{i}\!\in\!\mathbb{R}^{n}\!\!\leavevmode\nobreak\ |\leavevmode\nobreak\ \!\|{\mathbf{p}}^{(t)}_{i}\!-\!{\mathbf{p}}^{(t)}_{\ell,\rm o}\|\!\!\geq\!\!\frac{R_{i}\!\!+\!\!R_{\ell}}{2}\!,\leavevmode\nobreak\ \!\ell\!=\!1,...,M\}. (7)

where \|\cdot\| is the vector norm. This allows us to formulate the problem of multi-agent navigation as follows.

Problem 1 (Decentralized Multi-Agent Navigation)

Given the multi-agent system 𝒜{\mathcal{A}} with dynamics (1), the initial states {𝐱i(0)}i=1N\{{\mathbf{x}}_{i}^{(0)}\}_{i=1}^{N} and the target states {𝐱id}i=1N\{{\mathbf{x}}_{i}^{d}\}_{i=1}^{N} satisfying {f(𝐱id)=0}i=1N\{f({\mathbf{x}}_{i}^{d})=0\}_{i=1}^{N}, find decentralized policies {πi:|𝒩i|×nm}i=1N\{\pi_{i}\!:\!\mathbb{R}^{|{\mathcal{N}}_{i}|\times n}\!\to\!\mathbb{R}^{m}\}_{i=1}^{N} [cf. (2)] such that for i=1,,Ni=1,...,N,

limtT𝐱i(t)𝐱id=0,\displaystyle\lim_{t\to T}\leavevmode\nobreak\ \leavevmode\nobreak\ \|{\mathbf{x}}_{i}^{(t)}-{\mathbf{x}}_{i}^{d}\|=0, (8)
s.t.𝐱i(t)𝒞i,a(t)𝒞i,o(t)=𝒞i(t).\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \text{s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ {\mathbf{x}}_{i}^{(t)}\in{\mathcal{C}}^{(t)}_{i,\textrm{a}}\cap{\mathcal{C}}^{(t)}_{i,\textrm{o}}={\mathcal{C}}^{(t)}_{i}. (9)

The condition (8) guarantees state convergence, i.e., navigation, and the condition (9) guarantees state safety, i.e., collision avoidance. Problem 1 is challenging because (i) decentralized policies generate control inputs with only local neighborhood information; and (ii), safety sets can be non-convex and time-varying, depending on moving agents.

We propose to solve Problem 1 with a CBF-CLF based quadratic programming (QP) controller. Specifically, at each time step tt, we can formulate a QP problem as

min𝐮i(t)𝕌i,δi(t)\displaystyle\leavevmode\nobreak\ \underset{{\mathbf{u}}^{(t)}_{i}\in\mathbb{U}_{i},\delta_{i}^{(t)}\in\mathbb{R}}{\text{min}} 𝐮i(t)2+ξ(δi(t))2\displaystyle\|{\mathbf{u}}_{i}^{(t)}\|_{2}+\xi(\delta_{i}^{(t)})^{2} (10)
s.t. £fhi,j(𝐱i(t))+£ghi,j(𝐱i(t))𝐮i(t)+αi(hi,j(𝐱i(t)))0,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\pounds_{f}h_{i,j}({\mathbf{x}}_{i}^{(t)})\!+\!\pounds_{g}h_{i,j}({\mathbf{x}}_{i}^{(t)}){\mathbf{u}}_{i}^{(t)}\!\!+\!\alpha_{i}(h_{i,j}({\mathbf{x}}_{i}^{(t)}))\!\geq\!0,
for allj=0,,Ci,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\text{for all}\leavevmode\nobreak\ j=0,\dots,C_{i},
£fVi(𝐱i(t))+£gVi(𝐱i(t))𝐮i(t)+ϵVi(𝐱i(t))+δi(t)0,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\pounds_{f}V_{i}({\mathbf{x}}_{i}^{(t)})\!+\!\pounds_{g}V_{i}({\mathbf{x}}_{i}^{(t)}){\mathbf{u}}_{i}^{(t)}\!+\!\epsilon V_{i}({\mathbf{x}}_{i}^{(t)})\!+\!\delta_{i}^{(t)}\leq 0,
𝐱i(t)𝒞i(t),for alli=1,,N,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\mathbf{x}}_{i}^{(t)}\in{\mathcal{C}}_{i}^{(t)},\leavevmode\nobreak\ \text{for all}\leavevmode\nobreak\ i=1,\dots,N,

where ξ+\xi\in\mathbb{R}^{+} is a penalty weight for slack variable δi\delta_{i}\in\mathbb{R} that is selected based on how strictly the CLF needs to be enforced, and CiC_{i} is the number of CBF constraints based on agent AiA_{i}’s perceived agents and obstacles. For example, there are two CBF constraints w.r.t. the other agents and one w.r.t. the obstacle in Fig. 1. The control input is bounded by 𝐮i(t)𝕌i{\mathbf{u}}_{i}^{(t)}\in\mathbb{U}_{i} given physical constraints of agent AiA_{i}. The QP is solved per time step to generate 𝐮i(t){\mathbf{u}}_{i}^{(t)} until completion. The class 𝒦\mathcal{K} function αi()\alpha_{i}(\cdot) in the CBF constraint determines how strictly we want to enforce safety, and therefore will change the agent behavior to be either conservative or aggressive.

III-B Online CBF Optimization

The CBF-CLF-QP controller solves problem (10) to generate a sequence of control inputs {𝐮i(t)}t=0T\{{\mathbf{u}}_{i}^{(t)}\}_{t=0}^{T} for each agent AiA_{i}, providing goal-reaching convergence and safety guarantees. However, this may not be the case when problem (10) is unsolvable at some time step tt during navigation, i.e., when there is no feasible solution for problem (10) given CBF and CLF constraints at time step tt. In this circumstance, an agent will stay safe at its current state but stop progressing towards its destination due to the lack of feasible control inputs, resulting in the failure of multi-agent navigation.

Specifically, given system dynamics (1), CBF constraints (5) and CLF constraints (3), the super-level set of agent AiA_{i} that satisfies the constraints in problem (10) at time step tt is

𝕌i,CBF,CLF(t):={𝐮i(t)|£ghi,j(𝐱i(t))𝐮i(t)£fhi,j(𝐱i(t))αi(hi,j(𝐱i(t))),for allj,£gVi(𝐱i(t))𝐮i(t)£fVi(𝐱i(t))+ϵVi(𝐱i(t))+δi(t)}.\mathbb{U}^{(t)}_{i,\mathrm{CBF},\mathrm{CLF}}\!\!:=\!\!\left\{\!\!{\mathbf{u}}_{i}^{(t)}\!\!\!\;\middle|\;\!\!\begin{aligned} &\!\pounds_{g}h_{i,j}({\mathbf{x}}_{i}^{(t)}){\mathbf{u}}_{i}^{(t)}\!\!\geq\!\!-\pounds_{f}h_{i,j}({\mathbf{x}}_{i}^{(t)}\!)\\ &\!-\!\alpha_{i}\big{(}h_{i,j}({\mathbf{x}}_{i}^{(t)})\big{)},\text{for all}\leavevmode\nobreak\ j,\\ &-\pounds_{g}V_{i}({\mathbf{x}}_{i}^{(t)}){\mathbf{u}}_{i}^{(t)}\!\!\geq\!\pounds_{f}V_{i}({\mathbf{x}}_{i}^{(t)})\\ &+\epsilon V_{i}({\mathbf{x}}_{i}^{(t)})+\delta_{i}^{(t)}\end{aligned}\!\!\right\}\!\!. (11)

By combining (11) with the physical constraints 𝕌i\mathbb{U}_{i}, the space of feasible solutions of agent AiA_{i} is given by

𝕌i𝕌i,CBF,CLF(t),fori=1,,N.\mathbb{U}_{i}\cap\mathbb{U}^{(t)}_{i,\mathrm{CBF},\mathrm{CLF}},\leavevmode\nobreak\ \text{for}\leavevmode\nobreak\ i=1,\ldots,N. (12)

For conservative CBFs, the resulting constraints are strict and there may be no feasible solution in 𝕌i,CBF,CLF(t)\mathbb{U}^{(t)}_{i,\mathrm{CBF},\mathrm{CLF}} s.t. 𝕌i𝕌i,CBF,CLF(t)=\mathbb{U}_{i}\cap\mathbb{U}^{(t)}_{i,\mathrm{CBF},\mathrm{CLF}}=\varnothing. For aggressive CBFs, the resulting constraints are relaxed and the agents may be too close to the obstacles or each other. In these cases, the QP controller may generate control inputs that require sudden changes beyond the agent’s physical capability 𝕌i\mathbb{U}_{i} s.t. 𝕌i𝕌i,CBF,CLF(t)=\mathbb{U}_{i}\cap\mathbb{U}^{(t)}_{i,\mathrm{CBF},\mathrm{CLF}}=\varnothing. Both scenarios lead to the infeasibility of problem (10) and, thus, navigation failure, indicating an inherent trade-off that is defined by CBF constraints – see Figs. 2(d)-2(e) for examples.

The aforementioned issue is exacerbated when the environment becomes cluttered with increasing numbers of agents and obstacles, which makes it challenging to hand-tune CBFs. Furthermore, fixing CBFs during navigation may not effectively handle the dynamic nature of the environment with moving agents, and even well-tuned CBFs could suffer from performance degradation with environment changes. These observations motivate the use of time-varying CBFs based on instantaneously sensed states, to tune agents’ conservative and aggressive behavior. We refer to the latter as online CBFs. Define decentralized CBF-tuning policies as

πi,CBF(αi|𝐱i,{𝐱j}j𝒩i,{𝐩,o}𝒩i),fori=1,,N,\pi_{i,\mathrm{CBF}}\Big{(}\!\alpha_{i}\Big{|}{\mathbf{x}}_{i},\!\{{\mathbf{x}}_{j}\}_{j\in{\mathcal{N}}_{i}},\!\{{\mathbf{p}}_{\ell,\rm o}\}_{\ell\in{\mathcal{N}}_{i}}\!\Big{)},\leavevmode\nobreak\ \text{for}\leavevmode\nobreak\ i\!=\!1,...,N, (13)

which generate the class 𝒦{\mathcal{K}} function αi()\alpha_{i}(\cdot), i.e., the CBFs, based on local neighborhood information. At each time step tt, a new class 𝒦{\mathcal{K}} function αi(t)()\alpha^{(t)}_{i}(\cdot) is generated for agent AiA_{i} and passed into CBFs for solving (10) to compute the control input 𝐮i(t){\mathbf{u}}_{i}^{(t)}. Given any objective function F({πi,CBF}i=1N,𝐗(0),𝐗d)F(\{\pi_{i,\mathrm{CBF}}\}_{i=1}^{N},{\mathbf{X}}^{(0)},{\mathbf{X}}^{d}) that represents the navigation performance, the initial states 𝐗(0){\mathbf{X}}^{(0)} and the target states 𝐗d{\mathbf{X}}^{d}, we can formulate the problem of online CBF optimization.

Problem 2 (Online CBF optimization)

Given the initial states 𝐗(0){\mathbf{X}}^{(0)} and target states 𝐗d{\mathbf{X}}^{d}, find decentralized CBF-tuning policies {πi,CBF}i=1N\{\pi_{i,\mathrm{CBF}}\}_{i=1}^{N} [cf. (13)] that generate online CBFs with local neighborhood information, to guarantee the feasibility of problem (10) and maximize the objective function F({πi,CBF}i=1N,𝐗(0),𝐗d)F(\{\pi_{i,\mathrm{CBF}}\}_{i=1}^{N},{\mathbf{X}}^{(0)},{\mathbf{X}}^{d}).

The CBF-tuning policy conducts online CBF adjustments based on the local state of a dynamic environment, which provides control feasibility where fixed CBFs would not be able to, while maintaining safety guarantees. The generated time-varying CBFs strike a balance between conservative and aggressive behaviors among different agents. For scenarios where agent trajectories are in conflict (e.g., several agents need to navigate through narrow space), this yields an inherent prioritization among agents and provides deconfliction for agent trajectories – see Fig. 3(a) for demonstration.

IV METHODOLOGY

In this section, we specify the decentralized CLF-CBF-QP controller to solve Problem 1 and leverage model-free reinforcement learning with decentralized GNNs to solve Problem 2. We consider a linear system for each agent AiA_{i}, which has the following system dynamics

[p˙i,1p˙i,2]=[0000][pi,1pi,2]+[1001][ui,1ui,2],\displaystyle\left[\begin{matrix}\dot{p}_{i,1}\\ \dot{p}_{i,2}\end{matrix}\right]=\left[\begin{matrix}0&0\\ 0&0\end{matrix}\right]\left[\begin{matrix}p_{i,1}\\ p_{i,2}\end{matrix}\right]+\left[\begin{matrix}1&0\\ 0&1\end{matrix}\right]\left[\begin{matrix}u_{i,1}\\ u_{i,2}\end{matrix}\right], (14)

where 𝐩i=[pi,1,pi,2]{\mathbf{p}}_{i}=[p_{i,1},p_{i,2}]^{\top} is the position and 𝐮i=[ui,1,ui,2]{\mathbf{u}}_{i}=[u_{i,1},u_{i,2}]^{\top} is the control input of agent AiA_{i} for i=1,,Ni=1,...,N.

IV-A CLF-CBF-QP Controller

Given the destination 𝐝i=[di,1,di,2]{\mathbf{d}}_{i}=[d_{i,1},d_{i,2}]^{\top} of agent AiA_{i}, define a Lyapunov function candidate as Vi(𝐱)=(pi,1di,1)2+(pi,2di,2)2V_{i}({\mathbf{x}})=(p_{i,1}-d_{i,1})^{2}+(p_{i,2}-d_{i,2})^{2} and the CLF constraint as

2(𝐩i𝐝i)𝐮i+ϵVi(𝐱i)+δi02({\mathbf{p}}_{i}-{\mathbf{d}}_{i})^{\top}{\mathbf{u}}_{i}+\epsilon V_{i}({\mathbf{x}}_{i})+\delta_{i}\leq 0 (15)

for i=1,,Ni=1,...,N. Define barrier function candidates as

hi,j,a(𝐱i)\displaystyle h_{i,j,\mathrm{a}}({\mathbf{x}}_{i}) =(pi,1pj,1)2+(pi,2pj,2)2(Ri+Rj)2,\displaystyle\!=\!(p_{i,1}\!-\!p_{j,1})^{2}\!+\!(p_{i,2}\!-\!p_{j,2})^{2}\!-\!(R_{i}\!+\!R_{j})^{2}, (16)
hi,,o(𝐱i)\displaystyle h_{i,\ell,\mathrm{o}}({\mathbf{x}}_{i}) =(pi,1p,1,o)2+(pi,2p,2,o)2(Ri+R)2,\displaystyle\!=\!(p_{i,1}\!-\!p_{\ell,1,\mathrm{o}})^{2}\!+\!(p_{i,2}\!-\!p_{\ell,2,\mathrm{o}})^{2}\!-\!(R_{i}\!+\!R_{\ell})^{2},

where hi,j,ah_{i,j,\mathrm{a}} is w.r.t. collision avoidance between the agents AiA_{i} and AjA_{j}, and hi,,oh_{i,\ell,\mathrm{o}} is w.r.t. collision avoidance between agent AiA_{i} and obstacle OO_{\ell} with 𝐩,o=[p,1,o,p,2,o]{\mathbf{p}}_{\ell,\mathrm{o}}=[p_{\ell,\mathrm{1,o}},p_{\ell,\mathrm{2,o}}]^{\top} the obstacle position. The resulting CBF constraints are

2(𝐩i\displaystyle 2({\mathbf{p}}_{i}- 𝐩j)𝐮i2(𝐩i𝐩j)𝐩˙j+ζi,a(hi,j,a(𝐱i))ηi,a0,\displaystyle{\mathbf{p}}_{j})^{\top}{\mathbf{u}}_{i}-2({\mathbf{p}}_{i}-{\mathbf{p}}_{j})^{\top}\dot{{\mathbf{p}}}_{j}+\zeta_{i,\mathrm{a}}\big{(}h_{i,j,\mathrm{a}}({\mathbf{x}}_{i})\big{)}^{\eta_{i,\mathrm{a}}}\geq 0,
2(𝐩i𝐩)𝐮i+ζi,o(hi,,o(𝐱i))ηi,o0,\displaystyle 2({\mathbf{p}}_{i}-{\mathbf{p}}_{\ell})^{\top}{\mathbf{u}}_{i}+\zeta_{i,\mathrm{o}}\big{(}h_{i,\ell,\mathrm{o}}({\mathbf{x}}_{i})\big{)}^{\eta_{i,\mathrm{o}}}\geq 0, (17)

where ζi,a,ηi,a\zeta_{i,\mathrm{a}},\eta_{i,\mathrm{a}} are CBF parameters of agent AiA_{i} w.r.t. the other agents and ζi,o,ηi,o\zeta_{i,\mathrm{o}},\eta_{i,\mathrm{o}} are w.r.t. the obstacles. In this context, each agent has two sets of CBF parameters for the other agents and for the obstacles, respectively. We can then specify the CBF-CLF-QP controller by substituting the CLF constraint (15) and CBF constraints (IV-A) into problem (10).

IV-B Reinforcement Learning

The CBFs are determined by the class 𝒦{\mathcal{K}} function αi()\alpha_{i}(\cdot) with parameters ζi,a,ηi,a\zeta_{i,\mathrm{a}},\eta_{i,\mathrm{a}} for the other agents and ζi,o,ηi,o\zeta_{i,\mathrm{o}},\eta_{i,\mathrm{o}} for the obstacles. This indicates that we can learn CBFs by learning CBF parameters 𝜻i=[ζi,a,ζi,o],𝜼i=[ηi,a,ηi,o]\boldsymbol{\zeta}_{i}=[\zeta_{i,\mathrm{a}},\zeta_{i,\mathrm{o}}]^{\top},\boldsymbol{\eta}_{i}=[\eta_{i,\mathrm{a}},\eta_{i,\mathrm{o}}]^{\top} at each agent AiA_{i}. Since it is challenging to explicitly model the relationship between CBF parameters and navigation performance, we formulate Problem 2 in the RL domain and learn CBF-tuning policies in a model-free manner.

We start by defining a partially observable Markov decision process. At each time tt, agents are defined by states 𝐗(t)={𝐱i(t)}i=1N{\mathbf{X}}^{(t)}\!=\!\{{\mathbf{x}}_{i}^{(t)}\}_{i=1}^{N}. Each agent AiA_{i} observes its local state 𝐱i(t){\mathbf{x}}_{i}^{(t)}, communicates with its neighboring agents, and senses its neighboring obstacles to collect the neighborhood information {𝐱j(t)}j𝒩i\{{\mathbf{x}}_{j}^{(t)}\}_{j\in{\mathcal{N}}_{i}} and {𝐩,o}𝒩i\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell\in{\mathcal{N}}_{i}}. The CBF-tuning policy πi,CBF\pi_{i,\rm CBF} generates CBF parameters 𝜻i(t)\boldsymbol{\zeta}_{i}^{(t)} and 𝜼i(t)\boldsymbol{\eta}_{i}^{(t)}, which is a distribution over 𝜻i(t)\boldsymbol{\zeta}_{i}^{(t)}, 𝜼i(t)\boldsymbol{\eta}_{i}^{(t)} conditioned on 𝐱i(t){\mathbf{x}}_{i}^{(t)}, {𝐱j(t)}j𝒩i\{{\mathbf{x}}_{j}^{(t)}\}_{j\in{\mathcal{N}}_{i}}, {𝐩,o}𝒩i\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell\in{\mathcal{N}}_{i}}. The CBF parameters 𝜻i(t)\boldsymbol{\zeta}_{i}^{(t)}, 𝜼i(t)\boldsymbol{\eta}_{i}^{(t)} are fed into the QP controller (10), which generates the control action 𝐮i(t){\mathbf{u}}_{i}^{(t)} that drives the local state 𝐱i(t){\mathbf{x}}_{i}^{(t)} to 𝐱i(t+1){\mathbf{x}}_{i}^{(t+1)} based on the agent’s dynamics (14). The reward function ri(𝐗(t))r_{i}({\mathbf{X}}^{(t)}) represents the instantaneous navigation performance of agent AiA_{i} at time tt, which consists of two components: (i) the navigation reward ri,navr_{i,\textrm{nav}} and (ii) the QP’s feasibility reward ri,infsr_{i,\textrm{infs}}, i.e.,

ri(t)(𝐗(t))=ri,nav(t)(𝐗(t))+βiri,infs(t)(𝐗(t)),\displaystyle r_{i}^{(t)}({\mathbf{X}}^{(t)})=r_{i,\text{nav}}^{(t)}({\mathbf{X}}^{(t)})+\beta_{i}r_{i,\text{infs}}^{(t)}({\mathbf{X}}^{(t)}), (18)

where βi\beta_{i} is the regularization parameter. The first term represents the task-relevant performance of agent AiA_{i}, while the second term corresponds to the feasibility of the QP controller (10) with the generated CBF parameters, e.g., it penalizes the scenario where the QP controller has no feasible solution with overly conservative or aggressive CBFs. The total reward of agents is r(t)=i=1Nri(t)r^{(t)}=\sum_{i=1}^{N}r_{i}^{(t)}. With the discount factor γ\gamma that accounts for the future rewards, the expected discounted reward can be represented as

R(𝐗(0),𝐗d,{𝐩,o}=1M|{πi,CBF}i=1N)=𝔼[t=0γtr(t)],\displaystyle R\big{(}{\mathbf{X}}^{(0)}\!,{\mathbf{X}}^{d}\!,\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell=1}^{M}|\{\pi_{i,\rm CBF}\}_{i=1}^{N}\big{)}\!=\!\mathbb{E}\Big{[}\!\sum_{t=0}^{\infty}\gamma^{t}r^{(t)}\!\Big{]}\!, (19)

where 𝔼[]\mathbb{E}[\cdot] is w.r.t. CBF-tuning policies. The expected discounted reward in (19) corresponds to the objective function in Problem 2, which transforms the problem into the RL domain. By parameterizing the policies {πi,CBF}i=1N\{\pi_{i,\rm CBF}\}_{i=1}^{N} with information processing architectures {𝚽i(𝐱i(t),{𝐱j(t)}j𝒩i,{𝐩,o}𝒩i,𝜽i)}i=1N\{\boldsymbol{\Phi}_{i}({\mathbf{x}}_{i}^{(t)},\{{\mathbf{x}}_{j}^{(t)}\}_{j\in{\mathcal{N}}_{i}},\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell\in{\mathcal{N}}_{i}},\boldsymbol{\theta}_{i})\}_{i=1}^{N} of parameters {𝜽i}i=1N\{\boldsymbol{\theta}_{i}\}_{i=1}^{N}, the goal is to learn optimal parameters {𝜽i}i=1N\{\boldsymbol{\theta}_{i}^{*}\}_{i=1}^{N} that maximize R(𝐗(0),𝐗d,{𝐩,o}=1M|{𝜽i}i=1N)R({\mathbf{X}}^{(0)},{\mathbf{X}}^{d},\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell=1}^{M}|\{\boldsymbol{\theta}_{i}\}_{i=1}^{N}). We solve the latter by updating {𝜽i}i=1N\{\boldsymbol{\theta}_{i}\}_{i=1}^{N} through policy gradient ascent.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 2: (a) Agent trajectories with online CBFs generated by GNN-based policy. Green circles are initial positions, blue squares are goal positions, and grey circles are obstacles. Green-to-blue lines are agent trajectories and the color bar represents the time scale. (b)-(c) Time-varying CBF parameters ζ3,a\zeta_{3,\rm a}, ζ3,o\zeta_{3,\rm o} and η3,a\eta_{3,\rm a}, η3,o\eta_{3,\rm o} of agent A3A_{3} w.r.t. the other agents and the obstacles. The vertical lines in the top plots of (b)-(c) represent the maximal and minimal values of time-varying CBF parameters. (d) Agent trajectories with the minimal fixed CBF parameters (i.e., the most conservative case). The agents A2A_{2}, A3A_{3} and A4A_{4} have overly conservative CBFs and their controllers have no feasible solution. (e) Agent trajectories with the maximal fixed CBF parameters (i.e., the most aggressive case). The agents A3A_{3} and A4A_{4} have overly aggressive CBFs and get stuck before the narrow passage between O3O_{3} and O4O_{4}, where controllers have no feasible solution.

IV-C Graph Neural Networks

We parameterize CBF-tuning policies with GNNs, which allow for decentralized execution. They are inherently permutation equivariant (independent of agent ordering), and hence, generalize to unseen agent constellations [36, 37, 38].

Motivated by the observation that CBFs need only relative information (e.g., relative positions between agents and obstacles) [cf. (IV-A)], we design a translation-invariant GNN that leverages message passing mechanisms to generate CBF parameters with relative information. For each agent AiA_{i} with its local state 𝐱i{\mathbf{x}}_{i}, the states of neighboring agents {𝐱j}j𝒩i\{{\mathbf{x}}_{j}\}_{j\in{\mathcal{N}}_{i}} and the positions of neighboring obstacles {𝐩,o}𝒩i\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell\in{\mathcal{N}}_{i}}, it generates CBF parameters with the message aggregation functions m,a,m,o{\mathcal{F}}_{\rm m,a},{\mathcal{F}}_{\rm m,o} and the feature update function u{\mathcal{F}}_{\rm u} as

(𝜻i(t),𝜼i(t))\displaystyle(\boldsymbol{\zeta}_{i}^{(t)}\!,\boldsymbol{\eta}_{i}^{(t)}) =𝚽i(𝐱i(t),{𝐱j(t)}j𝒩i,{𝐩,o}𝒩i,𝜽i)\displaystyle\!=\!\boldsymbol{\Phi}_{i}({\mathbf{x}}_{i}^{(t)},\{{\mathbf{x}}_{j}^{(t)}\}_{j\in{\mathcal{N}}_{i}},\{{\mathbf{p}}_{\ell,\mathrm{o}}\}_{\ell\in{\mathcal{N}}_{i}},\boldsymbol{\theta}_{i}) (20)
=u(j𝒩im,a(𝐱j𝐱i)+𝒩im,o(𝐩,o𝐩i)),\displaystyle\!=\!{\mathcal{F}}_{\rm u}\Big{(}\!\sum_{j\in{\mathcal{N}}_{i}}\!{\mathcal{F}}_{\rm m,a}({\mathbf{x}}_{j}\!-\!{\mathbf{x}}_{i})\!+\!\!\sum_{\ell\in{\mathcal{N}}_{i}}\!{\mathcal{F}}_{\rm m,o}({\mathbf{p}}_{\ell,\mathrm{o}}\!-\!{\mathbf{p}}_{i})\!\Big{)},

where 𝜽i\boldsymbol{\theta}_{i} are the function parameters of m,a,m,o{\mathcal{F}}_{\rm m,a},{\mathcal{F}}_{\rm m,o} and u{\mathcal{F}}_{\rm u}. By sharing m,a,m,o{\mathcal{F}}_{\rm m,a},{\mathcal{F}}_{\rm m,o} and u{\mathcal{F}}_{\rm u} over all agents, we have 𝜽1==𝜽N\boldsymbol{\theta}_{1}=\cdots=\boldsymbol{\theta}_{N} and thus 𝚽1==𝚽N\boldsymbol{\Phi}_{1}=\cdots=\boldsymbol{\Phi}_{N}.

The GNN-based policy has the following properties:

  1. 1.

    Decentralized execution: The functions m,a,m,o,u{\mathcal{F}}_{\rm m,a},{\mathcal{F}}_{\rm m,o},{\mathcal{F}}_{\rm u} require only neighborhood information and the policy can be executed in a decentralized manner.

  2. 2.

    Translation invariance: The policy uses relative information and is invariant to translations in 2\mathbb{R}^{2}.

  3. 3.

    Permutation equivariance: The functions m,a,m,o,u{\mathcal{F}}_{\rm m,a},{\mathcal{F}}_{\rm m,o},{\mathcal{F}}_{\rm u} are homogeneous and the policy is equivariant to permutations (i.e., agent reorderings).

V EXPERIMENTS

We evaluate our approach in this section. First, we conduct a proof of concept with four agents and four obstacles. Then, we show how our approach solves navigation scenarios that are infeasible with fixed CBFs. Next, we show the generalization of our approach in scenarios with more obstacles. Lastly, we report results from real-world experiments.

V-A Proof of Concept

We consider an environment shown in Fig. 2(a). The agents have radius 0.150.15m, and are initialized randomly in the top region of the workspace and tasked towards goal positions in the bottom region. The obstacles are of radius 0.50.5m, and are distributed between the initial and goal positions of agents.

Implementation details. The agents are represented by positions {𝐩i}i=1N\{{\mathbf{p}}_{i}\}_{i=1}^{N} and velocities {𝐯i}i=1N\{{\mathbf{v}}_{i}\}_{i=1}^{N}, and the obstacles by positions {𝐩,o}=1M\{{\mathbf{p}}_{\ell,\rm o}\}_{\ell=1}^{M}. At each time step, each agent generates desired CBF parameters with its local policy based on neighborhood information, and feeds the latter into the QP controller to generate the feasible velocity towards its destination. An episode ends if all agents reach destinations or the episode times out. The sensing range, i.e., the communication radius, is 22m, the maximal velocity is 0.50.5m per time step in each direction, the maximal time step is 500500, and the time interval is 0.050.05s. At time tt, the reward is defined as

ri(t)=(𝐩i(t)𝐝i𝐩i(t)𝐝i2𝐯i(t)𝐯i(t)2)𝐯i(t)2+rQP(t),\displaystyle r_{i}^{(t)}=\Big{(}\frac{{\mathbf{p}}_{i}^{(t)}-{\mathbf{d}}_{i}}{\|{\mathbf{p}}_{i}^{(t)}-{\mathbf{d}}_{i}\|_{2}}\cdot\frac{{\mathbf{v}}_{i}^{(t)}}{\|{\mathbf{v}}_{i}^{(t)}\|_{2}}\Big{)}\|{\mathbf{v}}_{i}^{(t)}\|_{2}+r^{(t)}_{QP}, (21)

where the first term rewards fast movement towards the destination and the second term represents the infeasibility penalty of the QP controller. The message aggregation and feature update functions of the GNN are multi-layer perceptrons (MLPs), and the training is conducted with PPO [39].

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: Agent trajectories with online CBFs generated by GNN-based tuning policy and optimal fixed CBFs selected by exhaustive grid-search in different infeasible scenarios. (a) Narrow Passage scenario with online CBFs. (b) Narrow Passage scenario with optimal fixed CBFs. (c) Cross scenario with online CBFs. (d) Cross scenario with optimal fixed CBFs. (e) Singularity scenario with online CBFs. (f) Singularity scenario with optimal fixed CBFs.

Performance. Fig. 2(a) shows the agent trajectories with online CBFs. The agents move smoothly from initial positions to destinations without collision. Figs. 2(b)-2(c) show the variation of CBF parameters 𝜻\boldsymbol{\zeta}, 𝜼\boldsymbol{\eta} of an example agent A3A_{3} w.r.t. the other agents and obstacles, respectively. We see that (i) the values of 𝜻\boldsymbol{\zeta} remain maximal for the majority of its trajectory, which can be interpreted as a relaxation of CBF constraints to achieve fast velocities towards destination; (ii), the values of 𝜻\boldsymbol{\zeta} drop and the values of 𝜼\boldsymbol{\eta} increase between time step 100100 and 150150, which render CBF constraints conservative to avoid inter-agent congestion. I.e., A3A_{3} slows its velocity for A4A_{4} when preparing to pass through the narrow passage between O3O_{3} and O4O_{4}; (iii), the values of 𝜻\boldsymbol{\zeta} and 𝜼\boldsymbol{\eta} tend to be random after time step 250250, because A3A_{3} is close to its destination and CBF parameters play little role at that stage.

To show the trade-off between conservative and aggressive behavior inherent in CBFs, we select the minimal and maximal values from time-varying CBF parameters in Figs. 2(b)-2(c) (dashed lines), corresponding to the most conservative and aggressive CBFs, and perform navigation for fixed CBFs [21, 25] with these selected parameters. For the minimal values in Fig. 2(d), A2A_{2}, A3A_{3} keep still and A1A_{1}, A4A_{4} move along the environment boundary, with overly conservative trajectories, and only A1A_{1} reaches its destination. For the maximal values in Fig. 2(e), while all agents aggressively move towards destinations, A3A_{3} and A4A_{4} get stuck before the narrow passage between O3O_{3} and O4O_{4}. This is because the agents are too close to each other and the obstacles, where their controller has no feasible solution. This highlights the importance of our approach, which provides online CBFs.

V-B Feasibility

We perform our approach in three distinct navigation scenarios: Narrow Passage, Cross, and Singularity. For the fixed CBFs, we employ an exhaustive grid-search to find optimal parameters (ζ,η)(\zeta,\eta) from [0.1,10]×[1.0,2.0][0.1,10]\times[1.0,2.0] for 100100 combinations.

Fig. 3 shows the performance of our approach and the optimal fixed CBFs in our three scenarios. Overall, our results show that, while we exhaustively traversed the parameter space, there still exist scenarios where there is no solution for fixed CBFs. In contrast, the online CBFs can solve these infeasible scenarios (i.e., all agents reach their destinations successfully). For the Narrow Passage scenario in Figs. 3(a)-3(b) and the Cross scenario in Figs. 3(c)-3(d), our approach deconflicts agents by prioritizing them with varying degrees of conservative / aggressive behaviors. For the Singularity scenario in Figs. 3(e)-3(f), the agent and its destination are aligned with an obstacle in the middle. The fixed CBF-based controller generates controls on the edge or vertex of the admissible control set, and hence, agents get stuck in local minima [40], while our approach helps agents escape such conditions due to online CBF tuning.

V-C Generalization

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) Environment used to train GNN-based policy for online CBFs and to conduct grid-search for fixed CBFs. Agent trajectories are generated with online CBFs of the trained GNN-based policy. (b) Performance comparison between online CBFs generated by GNN-based policy, optimal fixed CBFs with exhaustive grid-search and time-varying CBFs with random parameters.

We show the generalization of our approach by testing the trained policy on previously unseen environments. We consider two baselines: (i) optimal fixed CBFs with exhaustive grid-search and (ii) time-varying CBFs with random parameters. The first searches the parameter space exhaustively and selects the optimal values for fixed CBFs. As existing works primarily concentrate on identifying the optimal fixed CBFs for specific tasks [41, 11, 42], we consider this baseline to approximate the optimal performance of controllers with fixed CBFs. Note that it is inefficient and included here solely for reference. The second selects random CBF parameters every 1010 time steps, which is as efficient as our approach.

We consider larger environments with 88 obstacles, where the maximal time step is 750750. We train our approach for online CBFs and conduct grid-search for fixed CBFs in the environment shown in Fig. 4(a), and test them by randomly shifting initial, goal and obstacle positions. The performance is measured by two metrics: (i) Success weighted by Path Length (SPL) [43] and (ii) the percentage to the maximal speed (PCTSpeed). The former is a stringent measure combining the success rate and the path length, while the latter represents the ratio of the average speed to the maximal one.

Fig. 4(b) shows the results averaged over 2020 random initializations. Our approach outperforms the baselines in both metrics, with higher expectations and lower standard deviations. This corresponds to theoretical findings that (i) online CBFs coordinate agents’ conservative / aggressive behaviors based on instantaneous environment states, which allows smooth navigation without congestion for higher expected performance; (ii), online CBFs deconflict agents to solve infeasible scenarios, which improves robustness for lower standard deviations. Random CBFs exhibit a higher SPL than fixed CBFs because they deconflict some infeasible scenarios by randomly prioritizing agents and have a higher success rate. However, random CBFs show a lower PCTSpeed because randomly coordinating agents exhibits poor performance, even though navigation tasks are successful.

V-D Real-World Experiments

Refer to caption
Figure 5: Real-world experiments with DJI’s Robomasters. Robots are required to pass through a narrow passage and reach predefined goal positions. The online CBFs are able to deconflict robots.

We conduct real-world experiments to validate our approach. We consider a narrow passage scenario that requires deconfliction for multi-robot navigation. We use four customized DJI Robomasters with Raspberry Pi. Each robot has a partially observable space with a sensing range of 22m, and employs an external telemetry (OptiTrack) for localization. We use ROS2 as communication middle-ware. At each time, the robot receives its current state and the neighbors’ states, and deploys the decentralized controller for navigation.

Fig. 5 shows that our approach steers all robots to their destinations without collision, because online CBFs deconflict robots by adapting between conservative and aggressive constraint values. When performing the same experiment with (optimized) fixed CBFs, robots fail to navigate through the narrow passage because online deconfliction is not facilitated. This insight corroborates our theoretical analysis and numerical simulations [44, Video Link].

VI CONCLUSION

This paper proposed online CBFs for decentralized multi-agent navigation. We formulated the problem of multi-agent navigation as a quadratic programming with CLFs for state convergence and CBFs for safety constraints, and proposed an online CBF optimization that tunes CBFs based on instantaneous state information in a dynamic environment. We solved this problem by leveraging RL with GNN parameterization. The former allows for model-free training and the latter provides decentralized agent controllers. We show, through simulations and real-world experiments, that our approach coordinates agent behaviors to deconflict their trajectories and improve overall navigation performance, all the while ensuring safety. In future work, we will extend our work on higher-dimensional non-linear systems.

References

  • [1] J. Ota, “Multi-agent robot systems as distributed autonomous systems,” Advanced Engineering Informatics, vol. 20, no. 1, pp. 59–70, 2006.
  • [2] Y. Wang, E. Garcia, D. Casbeer, and F. Zhang, “Cooperative control of multi-agent systems: Theory and applications,” 2017.
  • [3] A. Oroojlooy and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” Applied Intelligence, pp. 1–46, 2022.
  • [4] S. M. LaValle, J. J. Kuffner, B. Donald et al., “Rapidly-exploring random trees: Progress and prospects,” Algorithmic and computational robotics: new directions, vol. 5, pp. 293–308, 2001.
  • [5] G. Wagner and H. Choset, “M*: A complete multirobot path planning algorithm with performance bounds,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011.
  • [6] D. Foead, A. Ghifari, M. B. Kusuma, N. Hanafiah, and E. Gunawan, “A systematic literature review of a* pathfinding,” Procedia Computer Science, vol. 179, pp. 507–514, 2021.
  • [7] A. D. Ames, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs with application to adaptive cruise control,” in IEEE Conference on Decision and Control (CDC), 2014.
  • [8] Q. Nguyen and K. Sreenath, “Exponential control barrier functions for enforcing high relative-degree safety-critical constraints,” in IEEE American Control Conference (ACC), 2016.
  • [9] S.-C. Hsu, X. Xu, and A. D. Ames, “Control barrier function based quadratic programs with application to bipedal robotic walking,” in IEEE American Control Conference (ACC), 2015.
  • [10] U. Borrmann, L. Wang, A. D. Ames, and M. Egerstedt, “Control barrier certificates for safe swarm behavior,” IFAC-PapersOnLine, vol. 48, no. 27, pp. 68–73, 2015.
  • [11] W. Xiao, C. A. Belta, and C. G. Cassandras, “Feasibility-guided learning for constrained optimal control problems,” in IEEE Conference on Decision and Control (CDC), 2020.
  • [12] J. Zeng, Z. Li, and K. Sreenath, “Enhancing feasibility and safety of nonlinear model predictive control with discrete-time control barrier functions,” in IEEE Conference on Decision and Control (CDC), 2021.
  • [13] R. Cheng, M. J. Khojasteh, A. D. Ames, and J. W. Burdick, “Safe multi-agent interaction through robust control barrier functions with learned uncertainties,” in IEEE Conference on Decision and Control (CDC), 2020.
  • [14] Z. Gao and A. Prorok, “Environment optimization for multi-agent navigation,” arXiv preprint arXiv:2209.11279, 2022.
  • [15] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for decentralized multi-robot path planning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • [16] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralized controllers for robot swarms with graph neural networks,” in Conference on Robot Learning (CoRL), 2020.
  • [17] Z. Gao, F. Gama, and A. Ribeiro, “Wide and deep graph neural network with distributed online learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 3862–3877, 2022.
  • [18] L. Lindemann and D. V. Dimarogonas, “Control barrier functions for multi-agent systems under conflicting local signal temporal logic tasks,” IEEE Control Systems Letters, vol. 3, no. 3, pp. 757–762, 2019.
  • [19] X. Tan and D. V. Dimarogonas, “Distributed implementation of control barrier functions for multi-agent systems,” IEEE Control Systems Letters, vol. 6, pp. 1879–1884, 2021.
  • [20] O. Özkahraman and P. Ogren, “Combining control barrier functions and behavior trees for multi-agent underwater coverage missions,” in IEEE Conference on Decision and Control (CDC), 2020.
  • [21] M. Srinivasan, S. Coogan, and M. Egerstedt, “Control of multi-agent systems with finite time control barrier certificates and temporal logic,” in IEEE Conference on Decision and Control (CDC), 2018.
  • [22] Z. Qin, K. Zhang, Y. Chen, J. Chen, and C. Fan, “Learning safe multi-agent control with decentralized neural barrier certificates,” arXiv preprint arXiv:2101.05436, 2021.
  • [23] M. Ahmadi, A. Singletary, J. W. Burdick, and A. D. Ames, “Safe policy synthesis in multi-agent pomdps via discrete-time barrier functions,” in IEEE Conference on Decision and Control (CDC), 2019.
  • [24] C. Yu, H. Yu, and S. Gao, “Learning control admissibility models with graph neural networks for multi-agent navigation,” arXiv preprint arXiv:2210.09378, 2022.
  • [25] C. Dawson, Z. Qin, S. Gao, and C. Fan, “Safe nonlinear control using robust neural lyapunov-barrier functions,” in Conference on Robot Learning (CoRL), 2022.
  • [26] A. Robey, H. Hu, L. Lindemann, H. Zhang, D. V. Dimarogonas, S. Tu, and N. Matni, “Learning control barrier functions from expert demonstrations,” in IEEE Conference on Decision and Control (CDC), 2020.
  • [27] P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgöwer, “Training robust neural networks using lipschitz bounds,” IEEE Control Systems Letters, vol. 6, pp. 121–126, 2021.
  • [28] C. Folkestad, Y. Chen, A. D. Ames, and J. W. Burdick, “Data-driven safety-critical control: Synthesizing control barrier functions with koopman operators,” IEEE Control Systems Letters, vol. 5, no. 6, pp. 2012–2017, 2021.
  • [29] K. Xu, W. Xiao, and C. G. Cassandras, “Feasibility guaranteed traffic merging control using control barrier functions,” in IEEE American Control Conference (ACC), 2022.
  • [30] J. Zeng, B. Zhang, and K. Sreenath, “Safety-critical model predictive control with discrete-time control barrier function,” in IEEE American Control Conference (ACC), 2021.
  • [31] J. Breeden, K. Garg, and D. Panagou, “Control barrier functions in sampled-data systems,” IEEE Control Systems Letters, vol. 6, pp. 367–372, 2021.
  • [32] A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle, “Rapidly exponentially stabilizing control lyapunov functions and hybrid zero dynamics,” IEEE Transactions on Automatic Control, vol. 59, no. 4, pp. 876–891, 2014.
  • [33] H. K. Khalil, “Nonlinear systems third edition,” Patience Hall, vol. 115, 2002.
  • [34] X. Xu, P. Tabuada, J. W. Grizzle, and A. D. Ames, “Robustness of control barrier functions for safety critical control,” IFAC-PapersOnLine, vol. 48, no. 27, pp. 54–61, 2015.
  • [35] W. Xiao and C. Belta, “Control barrier functions for systems with high relative degree,” in IEEE Conference on Decision and Control (CDC), 2019.
  • [36] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
  • [37] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations (ICML), 2018.
  • [38] Z. Gao, E. Isufi, and A. Ribeiro, “Stochastic graph neural networks,” IEEE Transactions on Signal Processing, vol. 69, pp. 4428–4443, 2021.
  • [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [40] L. Wang, “Multi-robot coordination and safe learning using barrier certificates,” 2018.
  • [41] X. Xiao, J. Dufek, and R. Murphy, “Explicit motion risk representation,” in IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), 2019.
  • [42] J. Usevitch and D. Panagou, “Adversarial resilience for sampled-data systems using control barrier function methods,” in IEEE American Control Conference (ACC), 2021.
  • [43] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
  • [44] Z. Gao, G. Yang, and A. Prorok, “Video of real-world experiments,” https://youtu.be/SVVWLnRh1KY, 2023.