This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Vector Field-Guided Learning Predictive Control for Motion Planning of Mobile Robots with Uncertain Dynamics

Yang Lu1, Weijia Yao2, IEEE Member, Yongqian Xiao1, Xinglong Zhang1, IEEE Member
Xin Xu1, IEEE Senior Member, Yaonan Wang2, Dingbang Xiao1
1Yang Lu, Yongqian Xiao, Xinglong Zhang, Xin Xu, and Dingbang Xiao are with the College of Intelligence Science and Technology, National University of Defense Technology, Hunan, China luyang18@mail.sdu.edu.cn, shawyongqian@gmail.com, zhangxinglong18@nudt.edu.cn, xuxin_\_mail@263.net, dingbangxiao@nudt.edu.cn. (Corresponding author: Xin Xu.)2Weijia Yao and Yaonan Wang are with the School of Robotics, Hunan University, China wjyao@hnu.edu.cn,yaonan@hnu.edu.cn.
Abstract

In obstacle-dense scenarios, providing safe guidance for mobile robots is critical to improve the safe maneuvering capability. However, the guidance provided by standard guiding vector fields (GVFs) may limit the motion capability due to the improper curvature of the integral curve when traversing obstacles. On the other hand, robotic system dynamics are often time-varying, uncertain, and even unknown during the motion planning process. Therefore, many existing kinodynamic motion planning methods could not achieve satisfactory reliability in guaranteeing safety. To address these challenges, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that improves safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic GVF, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses a deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and a game-based safe barrier function are then incorporated into the LPC framework to generate near-optimal planning solutions. Extensive simulations and real-world experiments were conducted on quadrotor unmanned aerial vehicles and unmanned ground vehicles, demonstrating that VF-LPC enables robots to maneuver safely.

Index Terms:
Collision avoidance, integrated planning and control, planning under uncertainty, reinforcement learning.

I Introduction

Prior information or desired paths are often required to guide robots’ motion. As an effective and efficient method, guiding vector field (GVF) techniques have been recently studied to realize path-following or obstacle-avoidance tasks successfully for robots like fixed-wing airplanes[1], unicycle-type vehicles[2], unmanned aerial vehicles[3], etc. The GVF typically addresses motion planning and control tasks in the following manner[4]: (1) simple kinematic models of robots are considered (such as single or double integrator models); (2) GVF provides guidance signals to be tracked by the inner-loop dynamic controller). Therefore, the guidance level (i.e., the design of the GVF) and the control level can be separately designed. However, the proposed GVFs such as[5] may generate guidance signals with drastic variations when crossing static obstacles, posing a challenge to the robot’s maneuvering capabilities. On the other hand, robot dynamics and safety constraints (e.g., avoidance of suddenly appearing obstacles) are critical in improving the robots’ maneuvering capabilities. When considering the robotic dynamics constraint separately at planning and control levels, the issues of consistency and optimality have to be carefully addressed for real-world applications. Therefore, it is crucial to introduce an Integrated Motion Planning and Control (IMPC) approach to generate solutions transmitted to robot dynamics directly. In light of the above two aspects, designing adaptive IMPC approaches that leverage the improved GVF as guidance for real-world robots with uncertain/unknown dynamics is promising, particularly in obstacle-dense scenarios; An illustration diagram is presented in Fig. 1.

Refer to caption
Figure 1: The proposed integrated motion planning and control architecture. In this figure, the guidance module generates desired linear and angular velocities vv and ω\omega; FF and MM are forces and moments used to control a robot directly.

Optimization-based motion planning methods have recently been studied for realizing IMPC [6, 7]. It is crucial to estimate model uncertainties in real-time to address the challenge of uncertain robotic dynamics impacting model predictive control (MPC) methods. This can be achieved through online estimation techniques, while solving nonlinear or non-convex optimization problems online may pose reliability and computational intensity issues. Reinforcement learning (RL) and adaptive dynamic programming (ADP) are promising in solving optimal planning and control problems[8],[9],[10]. Recent works on realizing RL-based IMPC have been studied[9, 10]. They generally design proper reward functions based on priori information or desired paths to learn optimal policies. With the desired paths, existing endeavors primarily realize safe tracking control by designing reward functions encompassing state errors, control inputs, and safety terms. However, in obstacle-dense scenarios, obtaining the desired safe paths is not easy. When the desired path is constituted by a straight line from the starting point to the endpoint, it may traverse many obstacles in obstacle-dense environments. Furthermore, the desired speed or angular velocity for these paths needs to be computed separately. The two aspects increase the complexity of designing RL-based IMPC algorithms, posing challenges to achieving near-optimal performance. Therefore, providing guidance for them is very important. Despite the above difficulties, current RL-based IMPC has shown effectiveness and efficiency for robots with nonlinear system dynamics[8],[9],[10]. Letting RL-based IMPC approaches work as the IMPC structure in Fig. 1 is still promising if the above-discussed challenges can be well addressed. Namely, a GVF provides kinematic guidance (e.g. linear and angular velocities) for RL-based IMPC approaches.

There exist two main categories of RL-based IMPC studies; i.e., model-based and model-free ones. Obtaining a precise dynamic model is nontrivial due to internal factors such as system nonlinearity, and external factors such as uneven terrain, and slippery surfaces. The advantage of model-free RL methods lies in their independence from precise models. However, they still lack generalization ability to the unseen scenarios and data-efficiency of policy learning. Efficient model-based RL (MBRL) approaches have shown their effectiveness in real-world applications[9, 10]. Since real-world robot dynamics are often uncertain and even unknown, MBRL approaches struggle to achieve satisfactory reliability. This is due to: (1) The online adaptability of current data-driven modeling methods is insufficient; (2) The safety terms in the reward function may lead to policy divergence of RL-based motion planning algorithms. Motivated by the two challenges, we propose compensating for the data-driven model online and adopting a receding-horizon actor-critic framework (also called Learning Predictive Control[11], LPC) to ensure the convergence of planning policies in the prediction horizons.

To realize motion planning of mobile robots with uncertain/unknown dynamics in obstacle-dense scenarios, we propose a Vector Field-guided Learning Predictive Control (VF-LPC) approach. Specifically, the guiding level plans preliminary trajectories fast to avoid dense static obstacles. The optimal performance under the uncertain/unknown system dynamics and the safety constraints from (suddenly appearing) moving obstacles are optimized by solving the Hamilton–Jacobi–Bellman (HJB) equation online in prediction horizons. In particular, we introduce a sparsification technique in the model compensation and finite-horizon actor-critic learning processes to improve online efficiency and IMPC performance. The contributions of this paper are summarized as follows:

  • 1)

    The VF-LPC approach can achieve near-optimal motion planning for mobile robots with uncertain/unknown dynamics in obstacle-dense environments. The approach achieves higher computational efficiency and obtains more reliable solutions than advanced model predictive control (MPC) and RL methods in solving nonlinear optimization problems (see Sections IV-D and VI).

  • 2)

    Our proposed VF-LPC approach not only has theoretical guarantees but also has been demonstrated effective in practice since it has been validated by extensive simulations and experiments with quadrotor unmanned aerial vehicles (UAVs) and a Hongqi E-HS3 vehicle (see Sections V and VI).

  • 3)

    The VF-LPC approach can update online the uncertain dynamics of a fully data-driven model trained by the deep Koopman operator. It reduces the differences between the real and learned system dynamics models when the environment is time-varying or the system dynamics are learned inaccurately. Therefore, it improves the planning performance and guarantees safety (see Section IV-B).

  • 4)

    By adding virtual obstacles, the modified and improved discrete-time composite vector field adopted by our VF-LPC approach can satisfy robot kinodynamic constraints. In addition, the modified vector field does not suffer from the deadlock problem, which usually exists in traditional composite vector fields. Moreover, VF-LPC can deal with (suddenly appearing) moving obstacles by introducing a game-based barrier function (see Sections IV-A, IV-C).

The remainder of this paper is organized as follows. Section II reviews the related works. Then Section III provides the preliminaries and problem formulation. The VF-LPC is introduced in Section IV. Then, Section V presents the convergence results of VF-LPC. Section VI elaborates on the simulation and experimental validation. Finally, the conclusion is drawn in Section VII.

Notation: The notation xQ2\left\|{x}\right\|_{Q}^{2} represents xQxx^{\top}Qx, where QQ is a positive (semi-)definite matrix, and x=xx\left\|{x}\right\|=\sqrt{x^{\top}x}. The field of real numbers is denoted by \mathbb{R}. A diagonal matrix is denoted by diag{ν1,,νn}\text{diag}\{\nu_{1},\cdots,\nu_{n}\}, where ν1,,νn\nu_{1},\cdots,\nu_{n}\in\mathbb{R} are entries on the diagonal. Throughout this paper, we use the notation II to denote the identity matrix of suitable dimensions. The notations \otimes and \odot denote the Kronnecker and Hadamard products, respectively.

II Related Work

We first discuss several modeling methods and then present a literature review on MPC and learning-based approaches for motion planning of robots with uncertain dynamics.

Data-driven modeling. Current advanced data-driven modeling methods include least-squares[12], recurrent neural networks (NNs)[13], multi-layer perception (MLP)[14], neural networks[15], etc. As a linear operator, the Koopman operator-based modeling methods[16],[17],[18] can establish linear time-invariant system dynamics. Impressed by such a property, dynamic mode decomposition (DMD)[16], extended DMD (EDMD)[17], and kernel-based DMD[18] have received much attention in recent years. However, the modeling performance of the Koopman operator relies heavily on the observable function design. Consequently, approaches using NNs for automated observable function construction [19, 20] were proposed and have been validated to be effective through numerical simulations. To improve the modeling accuracy, Xiao et al. [21] proposed a deep direct Koopman (DDK) method for identifying linear time-invariant vehicle dynamic models. Unlike these, we further consider improving the online adaptability of offline-trained dynamics models by learning the uncertain dynamics of the offline-trained Koopman model.

Sampling-based and optimization-based motion planning algorithms for robots with uncertain dynamics. Several studies have integrated chance constraints with sampling-based Rapidly-exploring Random Trees (CC-RRT) methods, presenting efficient path planning capability in densely cluttered obstacle environments [22, 23]. Gaussian processes (GPs) were employed for determining dynamically feasible paths and CC-RRT for establishing probabilistically feasible paths[23]. These approaches fail to guarantee optimality due to the lack of consideration of the robot’s motion dynamics.

Under uncertain dynamics, optimization-based motion planning methods typically involve objective functions incorporating Conditional Value-at-Risk (CVaR) measures[24],[25]. Nakka et. al. handle the motion planning problem of chance-constrained nonlinear stochastic systems by deriving a surrogate problem with convex constraints[6]. Zhu et. al. designed a chance-constrained nonlinear MPC method to solve collision avoidance problems of multi-robots under various uncertainties like motion disturbance[26]. To realize real-time optimization, they developed a tight bound for the approximation of collision probability. In[27], Lew et. al. proposed a robust trajectory optimization method for nonlinear systems with model uncertainty and disturbances. Especially, it is capable of processing non-convex obstacle constraints. In [28], Kalman filtering was utilized for state estimation, and risk-aware safety constraints arising from estimation errors were introduced into stochastic optimal motion planning problems. Many of the above studies process various uncertainties like sensing, obstacle motion, robot dynamics, etc. According to [29], two major ideas are considered in the area of feedback motion planning, i.e., probabilistic guarantees on safety and bounded models of uncertainties. In this paper, the unknown system dynamics are modeled with the previous deep Koopman operator[21]. We consider the bounded uncertainty of data-driven system dynamics and compensate online for it.

Learning-based motion planning for robots with uncertain dynamics. As discussed in[30], RL-based approaches to addressing obstacle avoidance problems under uncertainties typically fall into two categories: one involves the endeavor to construct stochastic models of the uncertain dynamics inherent in robotic systems, leveraging the resultant probabilistic models for planning or policy learning; the other entails devising plans that account for worst-case scenarios. Regarding the first category, a Gaussian process model was used in the policy search framework of PILCO, thereby capturing the system dynamics and estimating the probability of safe constraints violation[31]. During the policy learning process, the candidate policies are optimized toward the safer directions with low risks. In [32], the Monte Carlo motion planning method was proposed to sample feasible trajectories under uncertainties, thereby fulfilling probabilistic collision avoidance constraints. Model-based motion planning under uncertain dynamics can be found in [9],[10]. Regarding the second category, Snyder et. al. proposed a trust-region-based online learning algorithm with provable regret bounds by minimizing worst-case regret[30]. To realize kinodynamic motion planning and control, we design an IMPC framework leveraging the guidance from the vector field and further consider the uncertainties of the data-driven model in obstacle-dense environments.

III Preliminaries and Problem Formulation

This section presents a detailed preliminary for the composite vector field, which will be developed later in this paper for generating preliminary kinodynamic trajectories. Then, we review a data-driven deep Koopman-based system modeling method. The previously developed sparse GP can efficiently identify model uncertainties online, which is also reviewed here to identify the model uncertainties of the nominal model. Note that for one thing, GP can be used to identify the full system dynamics individually, but the long-horizon modeling accuracy is difficult to guarantee. To enhance the accuracy, one has to use flawless samples and fine-tune the hyperparameters, which can be computationally demanding. For another, the Koopman model may not accurately characterize the exact model, so estimating the uncertainty of model learning is essential. Therefore, to improve the accuracy, we propose to combine the Koopman model learning and the sparse GP. In particular, we employ online sparse GP to compensate for the inaccuracy and uncertainty associated with the offline-trained Koopman model, which will be introduced later in our methodology. Finally, we present the problem formulation for optimal motion planning under the fully data-driven system model containing uncertain dynamics.

III-A Composite Vector Field

Consider the following ordinary differential equation

ξ˙=χ(ξ(t))\dot{\xi}=\chi(\xi(t))

with the initial state ξ(0)2\xi(0)\in\mathbb{R}^{2}, where χ()\chi(\cdot) is continuously differentiable concerning ξ\xi, and it is designed to be a guiding vector field for path following [5].

In Fig. 2, the elements of the composite vector field are illustrated in detail. A reference path 𝒫\mathcal{P} is provided initially and may be occluded by obstacles, and it is defined by

𝒫={ξ2:ϕ(ξ)=0},\mathcal{P}=\left\{\xi\in\mathbb{R}^{2}:\phi(\xi)=0\right\},

where ϕ:2\phi:\mathbb{R}^{2}\rightarrow\mathbb{R} is twice continuously differentiable. For example, a circle path 𝒫\mathcal{P} can be described by choosing ϕ(x,y)=x2+y2R2\phi(x,y)=x^{2}+y^{2}-R^{2}, where RR is the circle radius. To avoid collisions, Yao et al. [5] proposed a composite vector field for processing obstacle constraints. It involves a reactive boundary it\mathcal{R}_{i}^{t} and a repulsive boundary 𝒬it\mathcal{Q}_{i}^{t}, i.e.,

it={ξ2:φi(ξ,t)=0},𝒬it={ξ2:φi(ξ,t)=ci},\displaystyle\mathcal{R}_{i}^{t}=\left\{\xi\in\mathbb{R}^{2}:\varphi_{i}(\xi,t)=0\right\},\mathcal{Q}_{i}^{t}=\left\{\xi\in\mathbb{R}^{2}:\varphi_{i}(\xi,t)=c_{i}\right\},

where φi:2×\varphi_{i}:\mathbb{R}^{2}\times\mathbb{R}\to\mathbb{R} is twice continuously differentiable, i={1,2,,m}i\in\mathcal{I}=\{1,2,\cdots,m\}, mm is the total number of obstacles, and ci<0c_{i}<0. The repulsive boundary 𝒬it\mathcal{Q}_{i}^{t} is the boundary that tightly encloses the ii-th obstacle at time tt and a robot is forbidden to cross this boundary to avoid collision with the obstacle. The reactive boundary it\mathcal{R}_{i}^{t} is larger than and encloses the repulsive boundary, and its interior is a region where a robot can detect an obstacle and become reactive. We use prescripts ex{\rm{ex}} and in{\rm{in}} to denote the exterior and interior regions of a boundary, respectively. For example, 𝒬ex{}^{\rm{ex}}{\mathcal{Q}} represents the exterior region of the repulsive boundary (see Fig. 2). An example of moving circular reactive and repulsive boundaries can be characterized by choosing φi(x,y;t)=(xt)2+y2R2\varphi_{i}(x,y;t)=(x-t)^{2}+y^{2}-R^{2} and letting |ci|<R|c_{i}|<R. In this case, the reactive and repulsive boundaries are large and small concentric circles moving along the xx-axis as tt increases, respectively.

Refer to caption
Figure 2: Component illustration of the composite vector field[5]. The oblique lines construct a close area. The annulus area is the sandwiched region s=𝒬exin\mathcal{M}_{s}={}^{\mathrm{ex}}\mathcal{Q}\cap{}^{\mathrm{in}}\mathcal{R}.

We denote the path-following vector field by χ𝒫\chi_{\mathcal{P}} and the repulsive vector field by χi\chi_{\mathcal{R}_{i}}, and they are defined below:

χ𝒫(ξ)\displaystyle\chi_{\mathcal{P}}(\xi) =γ0Eϕ(ξ)kpϕ(ξ)ϕ(ξ),\displaystyle=\gamma_{0}E\nabla\phi(\xi)-k_{p}\phi(\xi)\nabla\phi(\xi),
χi(ξ)\displaystyle\chi_{\mathcal{R}_{i}}(\xi) =γiEφi(ξ)kriφi(ξ)φi(ξ),i,\displaystyle=\gamma_{i}E\nabla\varphi_{i}(\xi)-k_{r_{i}}\varphi_{i}(\xi)\nabla\varphi_{i}(\xi),\;i\in\mathcal{I},

where E=[0110]E=\left[\begin{smallmatrix}0&-1\\ 1&0\\ \end{smallmatrix}\right] is the rotation matrix of 9090^{\circ}, γi{1,1}\gamma_{i}\in\{1,-1\}, i{0}i\in\mathcal{I}\cup\{0\}, determines the moving direction (clockwise or counterclockwise), and kp,krik_{p},k_{r_{i}} are positive coefficients. The composite vector field is as follows[5]:

χ(ξ)=(i𝒬i(ξ))χ^𝒫(ξ)+i(i(ξ)χ^i(ξ)),\chi(\xi)=\left(\prod_{i\in\mathcal{I}}\sqcup_{\mathcal{Q}_{i}}(\xi)\right)\hat{\chi}_{\mathcal{P}}(\xi)+\sum_{i\in\mathcal{I}}\left(\sqcap_{\mathcal{R}_{i}}(\xi)\hat{\chi}_{\mathcal{R}_{i}}(\xi)\right), (1)

where ()^\hat{(\cdot)} is the normalization notation (i.e., for a nonzero vector vnv\in\mathbb{R}^{n}, v^=v/v\hat{v}=v/\lVert v\rVert), 𝒬(ξ)=f1(ξ)f1(ξ)+f2(ξ),(ξ)=f2(ξ)f1(ξ)+f2(ξ)\sqcup_{\mathcal{Q}}(\xi)=\frac{f_{1}(\xi)}{f_{1}(\xi)+f_{2}(\xi)},\sqcap_{\mathcal{R}}(\xi)=\frac{f_{2}(\xi)}{f_{1}(\xi)+f_{2}(\xi)} are smooth bump functions, where f1(ξ)=0f_{1}(\xi)=0 if φ(ξ)c\varphi(\xi)\leq c and f1(ξ)=exp(l1cφ(ξ))f_{1}(\xi)=\exp\left(\frac{l_{1}}{c-\varphi(\xi)}\right) if φ(ξ)>c\varphi(\xi)>c, f2(ξ)=exp(l2φ(ξ))f_{2}({\xi})=\exp\left(\frac{l_{2}}{\varphi(\xi)}\right) if φ(ξ)<0\varphi(\xi)<0 and f2(ξ)=0f_{2}({\xi})=0 if φ(ξ)0\varphi(\xi)\geq 0, and l1,l2>0l_{1},l_{2}>0 are coefficients for changing the decaying or increasing rate. Note that for simplicity, the subscripts ii of related symbols have been omitted above. These smooth bump functions blend parts of different vector fields and create a composite vector field for path following and collision avoidance; for more details, see [5]. To understand the composite vector field intuitively, we illustrate the composite vector field (1) in Fig. 3. For χ(ξ)\chi(\xi) in (1), it is equal to χ^𝒫(ξ)\hat{\chi}_{\mathcal{P}}(\xi), 𝒬(ξ)χ^𝒫(ξ)+(ξ)χ^(ξ)\sqcup_{\mathcal{Q}}\left(\xi\right)\hat{\chi}_{\mathcal{P}}\left(\xi\right)+\sqcap_{\mathcal{R}}\left(\xi\right)\hat{\chi}_{\mathcal{R}}\left(\xi\right), and χ^(ξ)\hat{\chi}_{\mathcal{R}}(\xi) within the three regions ex{}^{\rm{ex}}{\mathcal{R}}, 𝒬exin{}^{\rm{ex}}{\mathcal{Q}}\cap{}^{\mathrm{in}}\mathcal{R}, and 𝒬in{}^{\mathrm{in}}\mathcal{Q}, respectively.

Refer to caption
Figure 3: An illustration of the composite vector field (1). We primarily illustrate the directions of the vector field in the vicinity of the path and directions partly within the interior region 𝒬in{}^{\rm{in}}{\mathcal{Q}}.

The singular sets of χ𝒫\chi_{\mathcal{P}} and χi\chi_{\mathcal{R}_{i}}, denoted by 𝒞𝒫\mathcal{C}_{\mathcal{P}} and 𝒞i\mathcal{C}_{\mathcal{R}_{i}}, respectively, are defined below:

𝒞𝒫\displaystyle\mathcal{C}_{\mathcal{P}} ={ξ2:χ𝒫(ξ)=0}={ξ2:ϕ(ξ)=0},\displaystyle=\left\{\xi\in\mathbb{R}^{2}:\chi_{\mathcal{P}}(\xi)=0\right\}=\left\{\xi\in\mathbb{R}^{2}:\nabla\phi(\xi)=0\right\},
𝒞i\displaystyle\mathcal{C}_{\mathcal{R}_{i}} ={ξ2:χi(ξ)=0}={ξ2:φi(ξ)=0}.\displaystyle=\left\{\xi\in\mathbb{R}^{2}:\chi_{\mathcal{R}_{i}}(\xi)=0\right\}=\left\{\xi\in\mathbb{R}^{2}:\nabla\varphi_{i}(\xi)=0\right\}.

The elements of singular sets are called singular points, where vector fields vanish. Due to the possible presence of singular points, special designs are required to solve the deadlock problem. When we employ the guiding vector field as high-level guiding signals, neglecting the kinodynamic constraints usually deteriorates the control performance. Therefore, in the subsequent sections, we will consider the kinodynamic constraints.

III-B Offline Deep Koopman Operator-based System Modeling

Consider the following continuous-time nonlinear system

x˙=f(x,u),\dot{x}=f(x,u), (2)

where xnxx\in\mathbb{R}^{n_{x}} denotes the system state, f:nx×nunxf:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R}^{n_{x}} is the system transition function, uΩunuu\in\Omega_{u}\subset\mathbb{R}^{n_{u}} denotes the control input, and Ωu\Omega_{u} is the control constraint set. Note that the explicit dependence on time is dropped unless needed for clarity. We assume that f()f(\cdot) is locally Lipschitz continuous.

The discrete-time Koopman operator of (2) can be described as follows:

Υ(xk+1)=(𝒦Υ)(xk,u(xk)),\varUpsilon_{\infty}(x_{k+1})=(\mathcal{K}\varUpsilon_{\infty})(x_{k},u(x_{k})), (3)

where 𝒦\mathcal{K} is an infinite-dimensional linear Koopman operator in a Hilbert space \mathcal{H}, and Υ\varUpsilon_{\infty} is the observable function. In [21], 𝒦\mathcal{K} is approximated by n𝒦n_{\mathcal{K}}-order system dynamics by using deep neural networks, i.e.,

Υ(xk+1)=AΥ(xk)+Bu(xk),\varUpsilon(x_{k+1})=A\varUpsilon(x_{k})+Bu(x_{k}), (4)

where An𝒦×n𝒦A\in\mathbb{R}^{n_{\mathcal{K}}\times n_{\mathcal{K}}} and Bn𝒦×nuB\in\mathbb{R}^{n_{\mathcal{K}}\times n_{u}} are latent system matrices, Υ(xk)=[xk,ρe(xk)]\varUpsilon(x_{k})=[x_{k}^{\top},\rho_{e}^{\top}(x_{k})]^{\top} where ρe:nxnρe\rho_{e}:\mathbb{R}^{n_{x}}\to\mathbb{R}^{n_{\rho_{e}}} denotes the encoder module.

III-C Sparse GP Regression for Online Compensation

Next, we will review a sparse GP regression method called FITC [33], which reduces computational complexity by selecting inducing samples and introduces a low-rank approximation of the covariance matrix, transforming the original GP model into an efficient one. It is briefly introduced in the following.

III-C1 The formulation of full GP Regression

An independent training set is composed of state vectors, i.e., 𝐳=[z1,z2,,zn]n×nz\mathbf{z}=[z_{1},z_{2},\cdots,z_{n}]^{\top}\in\mathbb{R}^{n\times n_{z}} and the corresponding output vectors 𝐲=[y1,y2,,yn]n×ny\mathbf{y}=[y_{1},y_{2},\cdots,y_{n}]^{\top}\in\mathbb{R}^{n\times n_{y}}. In [34], the mean and variance functions of each output dimension a{1,,ny}a\in\{1,\cdots,n_{y}\} at a test point z=[x,u]z=[x^{\top},u^{\top}]^{\top} are computed by

mda=Kz𝐳a(K𝐳𝐳a+σa2I)1[𝐲]a,\displaystyle m^{a}_{d}=K^{a}_{z\mathbf{z}}(K^{a}_{\mathbf{zz}}+\sigma^{2}_{a}I)^{-1}[\mathbf{y}]_{a}, (5)
Σda=KzzaKz𝐳a(K𝐳𝐳a+σa2I)1K𝐳za,\displaystyle\Sigma^{a}_{d}=K^{a}_{zz}-K^{a}_{z\mathbf{z}}\left(K^{a}_{\mathbf{zz}}+\sigma_{a}^{2}I\right)^{-1}K^{a}_{\mathbf{z}z},

where σa\sigma_{a} is the variance, K𝐳𝐳a=ka(𝐳,𝐳)nz×nzK_{\mathbf{zz}}^{a}=k^{a}(\mathbf{z},\mathbf{z})\in\mathbb{R}^{n_{z}\times n_{z}} is a Gram matrix containing variances of the training samples. Correspondingly, Kz𝐳a=(K𝐳za)=ka(z,𝐳)K_{z\mathbf{z}}^{a}=(K_{\mathbf{z}z}^{a})^{\top}=k^{a}(z,\mathbf{z}) denotes the variance between a test sample and training samples, and Kzza=ka(z,z)K_{zz}^{a}=k^{a}(z,z) represents the covariance, ka(,)k^{a}(\cdot,\cdot) is the squared exponential kernel function and is defined as follows:

ka(zi,zj)=σf,a2exp(1/2(zizj)La1(zizj)),k^{a}\left(z_{i},z_{j}\right)=\sigma_{f,a}^{2}\exp(-1/2\left(z_{i}-z_{j}\right)^{\top}L_{a}^{-1}\left(z_{i}-z_{j}\right)), (6)

where σf,a2\sigma_{f,a}^{2} is the signal variance and La=2IL_{a}=\ell^{2}I. Here σf,a\sigma_{f,a} and \ell are hyperparameters of the covariance function.

III-C2 Sparse GP Regression

Given an inducing dictionary set {𝐳ind,𝐲ind}\{\mathbf{z}_{\rm{ind}},\mathbf{y}_{\rm{ind}}\} with nindn_{\rm{ind}} samples from {𝐳,𝐲}\{\mathbf{z},\mathbf{y}\}, the prior hyper-parameters can be optimized by maximizing the marginal log-likelihood of the observed samples. In [33], the mean and variance functions of a full GP are approximated by using inducing targets 𝐲ind\mathbf{y}_{\rm{ind}}, inputs 𝐳ind\mathbf{z}_{\rm{ind}}, i.e.,

m~da(z)\displaystyle\tilde{m}_{d}^{a}(z) =Qz𝐳a(Q𝐳𝐳a+Λ)1[𝐲]a,\displaystyle=Q_{z\mathbf{z}}^{a}(Q_{\mathbf{z}\mathbf{z}}^{a}+\Lambda)^{-1}[\mathbf{y}]_{a}, (7)
Σ~da(z)\displaystyle\tilde{\Sigma}_{d}^{a}(z) =KzzaQz𝐳a(Q𝐳𝐳a+Λ)1Q𝐳za,\displaystyle=K_{zz}^{a}-Q_{z\mathbf{z}}^{a}(Q_{\mathbf{z}\mathbf{z}}^{a}+\Lambda)^{-1}Q_{\mathbf{z}z}^{a},

where Λ=K𝐳𝐳aQ𝐳𝐳a+σa2I\Lambda=K_{\mathbf{z}\mathbf{z}}^{a}-Q_{\mathbf{z}\mathbf{z}}^{a}+\sigma_{a}^{2}I is diagonal and the notation Qζζ~a:=Kζ𝐳inda(K𝐳ind𝐳inda)1K𝐳indζ~aQ_{\zeta\tilde{\zeta}}^{a}:=K_{\zeta\mathbf{z}_{\rm{ind}}}^{a}(K_{\mathbf{z}_{\rm{ind}}\mathbf{z}_{\rm{ind}}}^{a})^{-1}K_{\mathbf{z}_{\rm{ind}}\tilde{\zeta}}^{a}. Several matrices in (7) do not depend on zz and can be precomputed, such that they only need to be updated when updating 𝐳ind\mathbf{z}_{\rm{ind}} or 𝒟\mathcal{D} itself.

Finally, a multivariate GP is established by

d(z)𝒩(m~d,Σ~d),d(z)\sim\mathcal{N}(\tilde{m}_{d},\tilde{\Sigma}_{d}), (8)

where m~d=[m~d1,,m~dny]\tilde{m}_{d}=[\tilde{m}_{d}^{1},\cdots,\tilde{m}_{d}^{n_{y}}]^{\top}, and Σ~d=diag{Σ~d1,,Σ~dny}\tilde{\Sigma}_{d}=\text{diag}\{\tilde{\Sigma}_{d}^{1},\cdots,\tilde{\Sigma}_{d}^{n_{y}}\}.

III-D Problem Formulation

III-D1 Composite Vector Field with Kinodynamic Constraints

The composite vector field acts as a local path planner and should satisfy the kinodynamic constraint, leading to the problem of Vector-Field-guided Trajectory Planning with Kinodynamic Constraint (VF-TPKC), which is formulated in Definition 1. This problem is decomposed into two components. In the presence of obstacles obstructing the desired path, the planning method should ensure the safety of paths, i.e., avoiding collision with obstacles. Then, the issue of satisfying dynamic constraints arises, involving improvements upon the path planning method established in the first step.

Remark 1.

The term “kinodynamic constraint” refers to the requirement that a robot will not collide with obstacles at different speeds. To address this issue, we have transformed it into the fulfillment of the maximum lateral acceleration. \hfill\blacktriangleleft

Definition 1.

(VF-TPKC) Design a continuously differentiable vector field χ:×22\chi:\mathbb{R}\times\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} for ξ(t)=χ(t,ξ(t))\xi(t)=\chi(t,\xi(t)) such that:

  • 1)

    It achieves path-following and collision avoidance. In addition, the path-following error is bounded, and no deadlocks exist.

  • 2)

    Given the robot’s velocities vxv_{x}, vyv_{y}, and the maximum centripetal acceleration amaxa_{\rm{max}}, it holds that (vx2+vy2)κ(t)amax({v_{x}^{2}+v_{y}^{2}})\kappa(t)\leq a_{\rm{max}} for t>0t>0 and κ(t)\kappa(t) is the curvature at time t. .

A guiding vector field χ:×22\chi:\mathbb{R}\times\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} is designed to generate a continuously differentiable reference path, which is obtained by

Ξ=0χ(ξ(t))dt.\begin{array}[]{l}\Xi=\int_{0}^{\infty}\chi(\xi(t))\rm{d}t.\end{array}

Subsequently, we employ a learning-based predictive control approach to track the desired trajectories and avoid dynamic obstacles at the same time.

III-D2 Optimal Motion Planning to Avoid Moving Obstacles

Given the offline learned system (4), it is feasible to use it to design optimal IMPC. However, the interaction environments may be time-varying, causing the system dynamics to be uncertain. We can rewrite the exact system dynamics as a data-driven Koopman model adding an uncertain part by

Υ(xk+1)=AΥ(xk)+Bu(xk)fnom(Υ(xk),u(xk))+Bs(g(Υ(xk),u(xk))+wk)yk,\varUpsilon(x_{k+1})=\underbrace{A\varUpsilon(x_{k})+Bu(x_{k})}_{f_{\text{nom}}(\varUpsilon(x_{k}),u(x_{k}))}+B_{s}\underbrace{\left(g(\varUpsilon(x_{k}),u(x_{k}))+w_{k}\right)}_{y_{k}}, (9)

where the above model consists of a known nominal part fnomf_{\text{nom}} and an additive term yky_{k}, which lies within the subspace spanned by BsB_{s}[35]. We assume that the process noise wk𝒩(0,Σw)w_{k}\sim\mathcal{N}\left(0,\Sigma^{w}\right) is independent and identically distributed (i.i.d.), with spatially uncorrelated properties, i.e., Σw=diag{σ12,,σny2}\Sigma^{w}=\text{diag}\{\sigma_{1}^{2},\ldots,\sigma_{n_{y}}^{2}\}, where nyn_{y} denotes the dimension of yky_{k}.

Assuming that the desired trajectory can be denoted by

Υ(xr,k+1)=AΥ(xr,k)+Bu(xr,k),\varUpsilon(x_{r,k+1})=A\varUpsilon(x_{r,k})+Bu(x_{r,k}), (10)

the subtraction of Eq. (10) from Eq. (9) yields the following error model, i.e.,

x~k+1=Ax~k+Bu~(xk)+Bs(g(Υ(xk),u(xk))+wk)yk,\tilde{x}_{k+1}=A\tilde{x}_{k}+B\tilde{u}(x_{k})+B_{s}\underbrace{\left(g(\varUpsilon(x_{k}),u(x_{k}))+w_{k}\right)}_{y_{k}}, (11)

where x~k=Υ(xk)Υ(xr,k)\tilde{x}_{k}=\varUpsilon(x_{k})-\varUpsilon(x_{r,k}) is the error state, xr,kx_{r,k} is the reference state, and u~(xk)=u(xk)u(xr,k)\tilde{u}(x_{k})=u(x_{k})-u(x_{r,k}) is the control input.

We formally define the Optimal Motion Planning (OMP) problem, which consists of two subproblems. The first subproblem is the tracking control problem. To formulate this subproblem, we first define the value function as the cumulative discounted sum of the infinite-horizon costs:

V(x~k)=τ=kγτkL(x~τ,u~(x~τ)),V_{\infty}(\tilde{x}_{k})=\sum_{\tau=k}^{\infty}\gamma^{\tau-k}L(\tilde{x}_{\tau},\tilde{u}(\tilde{x}_{\tau})), (12)

where 0<γ10<\gamma\leq 1, L(x~τ,u~τ)=x~τQx~τ+u~(x~τ)Ru~(x~τ)L(\tilde{x}_{\tau},\tilde{u}_{\tau})=\tilde{x}_{\tau}^{\top}Q\tilde{x}_{\tau}+\tilde{u}^{\top}(\tilde{x}_{\tau})R\tilde{u}(\tilde{x}_{\tau}) is the cost function, Q0n𝒦×n𝒦Q\succeq 0\in\mathbb{R}^{n_{\mathcal{K}}\times n_{\mathcal{K}}} is positive semi-definite, and R0nu×nuR\succ 0\in\mathbb{R}^{n_{u}\times n_{u}} is positive definite. The second subproblem is how the robot can avoid moving obstacles. Combining the two subproblems, we formulate the optimal OMP as below:

Problem 1.

(OMP) Design an optimal IMPC for the robot with uncertain system dynamics such that it

  • C.1:

    Starts at x0nxx_{0}\in\mathbb{R}^{n_{x}} and tracks the reference path Ξ\Xi by minimizing the value function V(x~k)V_{\infty}(\tilde{x}_{k}).

  • C.2:

    Avoids collisions with all obstacles 1,,q𝒲2\mathcal{B}_{1},\dots,\mathcal{B}_{q}\subseteq\mathcal{W}\subseteq\mathbb{R}^{2}, where 𝒲\mathcal{W} denotes the workspace.

IV Vector Field Guided Receding Horizon Reinforcement Learning for Mobile Robots with Uncertain System Dynamics

It is essential to generate local collision-free trajectories for guiding robots’ motion in obstacle-dense scenarios, which could improve safety and simplify the design of RL-based IMPC. Motivated by this aspect, we design a guiding vector field that considers dynamic constraints and excludes the deadlock problem (i.e., singular points). This part is illustrated by the safety guiding module in Fig. 4. Considering safety when robots track the desired trajectory, we must deal with the movements of (suddenly appearing) moving obstacles and the uncertainties of the nominal deep Koopman model. To this end, we develop an online receding-horizon reinforcement learning (RHRL) approach that employs a game-based exponential barrier function and a fast model compensation scheme. This part is illustrated by the learning predictive control module of Fig. 4. The details of each module and its sub-modules will be illustrated in the following subsections.

Refer to caption
Figure 4: The overall framework of the VF-LPC algorithm.

IV-A Discrete-time Kinodynamic Composite Vector Field

In this subsection, we present a discrete-time kinodynamic composite vector field to generate locally feasible trajectories, corresponding to module A of Fig. 4.

IV-A1 Composite Vector Field with Kinodynamic Constraints

The first objective in Definition 1 can be achieved by the composite vector field (1). To accomplish the second objective in Definition 1, we first design the following kinodynamic composite vector field based on (1):

χc(ξ)=(i\𝒬i(ξ)isi(ξ))χ^𝒫(ξ)+i\(i(ξ)χ^i(ξ))+i(1si(ξ))χ^i(ξ),\begin{aligned} \chi_{c}(\xi)=&\left(\prod_{i\in\mathcal{I}\backslash\mathcal{I}^{\prime}}\sqcup_{\mathcal{Q}_{i}}(\xi)\prod_{i\in\mathcal{I}^{\prime}}s_{i}(\xi)\right)\hat{\chi}_{\mathcal{P}}(\xi)\\ &+\sum_{i\in\mathcal{I}\backslash\mathcal{I}^{\prime}}\left(\sqcap_{\mathcal{R}_{i}}(\xi)\hat{\chi}_{\mathcal{R}_{i}}(\xi)\right)+\sum_{i\in\mathcal{I}^{\prime}}\left(1-{s_{i}}(\xi)\right)\hat{\chi}_{\mathcal{R}_{i}}(\xi),\end{aligned}

(13)

where \mathcal{I}^{\prime} is a set containing the index numbers of manually added virtual obstacles, and si:2s_{i}:\mathbb{R}^{2}\to\mathbb{R} is a function to be designed later. The path generated by the original composite vector field in (1) would often require the robot to make large turns within a limited distance for collision avoidance. The role of virtual obstacles here is to proactively modify the vector field such that the curvature of the robot trajectory is less than the maximum allowable value as the robot enters the sandwiched region

s=𝒬actlexactlin,\mathcal{M}_{s}={}^{\mathrm{ex}}\mathcal{Q}_{\mathrm{actl}}\cap{}^{\mathrm{in}}\mathcal{R}_{\mathrm{actl}}, (14)

(i.e., the area sandwiched between the repulsive and the reactive boundaries; see the white annulus region in Fig. 5), thereby satisfying the dynamic constraints. The definitions of the reactive and repulsive boundaries are expressed through the function φi(ξ),i\varphi_{i}(\xi),i\in\mathcal{I}^{\prime}. For example, the reactive boundary vrtl\mathcal{R}_{\rm{vrtl}} corresponding to a virtual obstacle is described by {ξ2:φi(ξ)=0},i\{\xi\in\mathbb{R}^{2}:\varphi_{i}(\xi)=0\},i\in\mathcal{I}^{\prime} in Fig. 5, and its repulsive boundary 𝒬vrtl\mathcal{Q}_{\rm{vrtl}} is described by {ξ2:φi(ξ)=ci},i\{\xi\in\mathbb{R}^{2}:\varphi_{i}(\xi)=c_{i}\},i\in\mathcal{I}^{\prime}. When the robot’s position ξ=(X,Y)2\xi=(X,Y)\in\mathbb{R}^{2} enters the virtual reactive region vrtlin{}^{\mathrm{in}}\mathcal{R}_{\rm{vrtl}}, it will be attracted towards the virtual repulsive boundary 𝒬vrtl\mathcal{Q}_{\rm{vrtl}}. This provides a direction change before ξ\xi enters the actual reactive region actlin{}^{\mathrm{in}}\mathcal{R}_{\rm{actl}} corresponding to the actual obstacle, and the virtual obstacle will not affect its motion after ξ\xi enters the actual reactive region actlin{}^{\mathrm{in}}\mathcal{R}_{\rm{actl}}. Based on the above analyses, within/outside the buffer region (i.e., the shaded area in Fig. 5)

b=actlexvrtlin,\mathcal{M}_{b}={}^{\mathrm{ex}}\mathcal{R}_{\text{actl}}\cap{}^{\mathrm{in}}\mathcal{R}_{\text{vrtl}}, (15)

the function si(ξ):2,i{s_{i}}(\xi):\mathbb{R}^{2}\to\mathbb{R},i\in\mathcal{I}^{\prime} is designed to be

si(ξ)={exp(kc,iciφi(ξ))ξb,1otherwise,{s_{i}}(\xi)=\begin{cases}\exp\left(\frac{k_{c,i}}{c_{i}-\varphi_{i}(\xi)}\right)&\xi\in\mathcal{M}_{b},\\ 1&\text{otherwise},\end{cases} (16)

where the adjustable coefficient kc,i>0k_{c,i}>0 is used to change the convergence rate to 𝒬vrtl\mathcal{Q}_{\rm{vrtl}}, and exp()\exp(\cdot) denotes the exponential function. Thus, in this design, 0<si(ξ)<10<{s_{i}}(\xi)<1 if ξb\xi\in\mathcal{M}_{b}; si(ξ)=1{s_{i}}(\xi)=1, otherwise.

Remark 2.

The placed virtual obstacle is assumed to satisfy (𝒬vrtlin𝒫)(actlin𝒫)({}^{\mathrm{in}}\mathcal{Q}_{\rm{vrtl}}\cap\mathcal{P})\subset({}^{\mathrm{in}}\mathcal{R}_{\rm{actl}}\cap\mathcal{P}). The assumption is used for letting ξ\xi exit the sandwiched region s\mathcal{M}_{s} from b\mathcal{M}_{b}, but not converging to 𝒬vrtl\mathcal{Q}_{\rm{vrtl}} when robots are in b\mathcal{M}_{b}. \hfill\blacktriangleleft

Remark 3.

Virtual obstacles cease to exert their influence if ξ\xi enter b\mathcal{M}_{b} again from s\mathcal{M}_{s}. This setting stops robots from returning to s\mathcal{M}_{s}, but enables them to move towards the desired path to complete an obstacle avoidance process.\hfill\blacktriangleleft

Now, the guidance path generated by the vector field χc(ξ)\chi_{c}(\xi) can avoid rapidly increasing curvature. To satisfy the maximum centripetal acceleration amaxa_{\rm{max}}, speed planning is further performed for the path Ξf=0Tχ(ξ(t))dt\Xi_{f}=\int_{0}^{T}\chi(\xi(t))\rm{d}t under a given desired speed vdv_{d}, where T>0T>0 determines the time duration. The maximum allowable speed is (amax/κ(t))1/2(a_{\rm{max}}/\kappa(t))^{1/2}. We can perform speed planning by the following strategy: If vd>(amax/κ(t))1/2v_{d}>(a_{\rm{max}}/\kappa(t))^{1/2}, then set the speed at ξ(t)\xi(t) to (amax/κ(t))1/2(a_{\rm{max}}/\kappa(t))^{1/2}; otherwise, set the speed at ξ(t)\xi(t) to vdv_{d}.

Refer to caption
Figure 5: An illustrative example of the kinodynamic composite vector field. The gray-shaded ellipse is the actual obstacle, and its corresponding repulsive and reactive boundaries are the red and black solid elliptic curves, respectively. The small and large blue circles with dashed lines represent the repulsive and reactive boundaries for a manually added virtual obstacle. The black curve is the desired path.

IV-A2 Analysis of the Composite Vector Field

The composite vector field (13) gives an idea of how to accommodate the kinodynamic constraint. By redesigning the characterizing functions ϕ\phi, φi\varphi_{i}, the bump functions, and the coefficients cic_{i}, the first condition of Definition 1 can be met in practice[5]. The second condition can be satisfied by selecting proper positive coefficient kc,i,ik_{c,i},i\in\mathcal{I}^{\prime}. These two conditions are satisfied in simulations and experiments in this paper.

IV-A3 Discrete-time Kinodynamic Guiding Vector Field

With χc(ξ)\chi_{c}(\xi) defined in (13), the desired path is computed by its discrete-time form, i.e.,

ξk+1=ξk+βχc(ξk,grid),ξ0=(x0,y0),\xi_{k+1}=\xi_{k}+\beta\chi_{c}(\xi_{k,grid}),\quad\xi_{0}=(x_{0},y_{0}), (17)

where ξ02\xi_{0}\in\mathbb{R}^{2} is the initial robot position and β\beta is the step length, and ξk,grid=argminξg𝒢ξkξg\xi_{k,grid}=\arg\min_{\xi_{g}\in\mathcal{G}}||\xi_{k}-\xi_{g}||, where the set 𝒢2\mathcal{G}\subseteq\mathbb{R}^{2} is the grid map consisting of a finite number of chosen points in a selected region. In terms of computational efficiency, it is advisable to precompute χc(ξ)\chi_{c}(\xi) on a mesh grid map. This makes it possible to quickly identify the nearest vector χc(ξk,grid)\chi_{c}(\xi_{k,grid}) on the mesh grip map based on the current position ξk\xi_{k}. Finally, we can obtain the planned trajectory. Note that a vector with nearly zero norms is not selected as the current vector but rather inherits the vector from the previous state to avoid suffering from the singularity issue. Namely, if χc(ξk,grid)<ϵ||\chi_{c}(\xi_{k,grid})||<\epsilon, then χc(ξk,grid)χc(ξk1,grid)\chi_{c}(\xi_{k,grid})\leftarrow\chi_{c}(\xi_{k-1,grid}), where ϵ\epsilon is a small positive number. This allows us to plan guiding trajectories directly without causing the deadlock problem from singular points.

IV-B Online Compensation to Update the Offline Trained Deep Koopman Model

In this subsection, we present an online compensation method to update the offline trained deep Koopman model since model uncertainties exist. This subsection corresponds to module B-(1) of Fig. 4.

To construct the “input-output” form of a GP, Eq. (9) is rewritten as

d(zk)=Bd(Υ(xk+1)fnom(Υ(xk),u(xk))),{d}(z_{k})=B_{d}^{\dagger}(\varUpsilon({x}_{k+1})-f_{\text{nom}}\left(\varUpsilon({x}_{k}),{u}(x_{k})\right)), (18)

where BdB_{d}^{\dagger} is the Moore–Penrose pseudoinverse of BdB_{d}.

Remark 4.

Due to the necessity for efficient computation in the online compensation of vehicle dynamic models, we further apply the approximate linear dependence (ALD) strategy[9] to quantize the online updated dictionary 𝒟GP\mathcal{D}_{\rm{GP}} for obtaining the training dataset 𝒟SGP\mathcal{D}_{\rm{SGP}}. Note that 𝒟GP\mathcal{D}_{\rm{GP}} is obtained in the same way as [36]. Subsequently, we employ the Sparse GP-based algorithm FITC [33] for online estimation to enhance computational efficiency. This approach aligns with our previous work, and additional details can be found in[9].\hfill\blacktriangleleft

With the optimized parameters, GP models compensate for model uncertainties. Therefore, we define the learned model of (9) as

Υ(xk+1)=fnom(Υ(xk),u(xk))+d(zk)+ϵ,\varUpsilon({x}_{k+1})=f_{\text{nom}}(\varUpsilon({x}_{k}),{u}(x_{k}))+{d}(z_{k})+\epsilon, (19)

where ϵn𝒦\epsilon\in\mathbb{R}^{n_{\mathcal{K}}} is the estimation error. The reference trajectory can be expressed as Υ(xr,k+1)=fnom(Υ(xr,k),u(xr,k))\varUpsilon(x_{r,k+1})=f_{\text{nom}}(\varUpsilon({x}_{r,k}),{u}(x_{r,k})).

We can obtain the Jacobian matrix of (19) at a reference state (Υ(xr,k),u(xr,k))(\varUpsilon({x}_{r,k}),{u}(x_{r,k})), i.e.,

Ad,k=A+d(zk)Υ(xk),Bd,k=B+d(zk)uk,A_{d,k}=A+\frac{\partial{d}(z_{k})}{\partial\varUpsilon({{x}_{k}})},B_{d,k}=B+\frac{\partial{d}(z_{k})}{\partial{{u}_{k}}}, (20)

where Ad,kn𝒦×n𝒦A_{d,k}\in\mathbb{R}^{n_{\mathcal{K}}\times n_{\mathcal{K}}}, Bd,kn𝒦×nuB_{d,k}\in\mathbb{R}^{n_{\mathcal{K}}\times n_{u}}. Note that Ad,kA_{d,k} and Bd,kB_{d,k} are then used for the training process of model-based RL in Section IV-D. To obtain d(zk)Υ(xk)\frac{\partial{d}(z_{k})}{\partial\varUpsilon({{x}_{k}})} and d(zk)uk\frac{\partial{d}(z_{k})}{\partial{{u}_{k}}}, one should compute d(zk)zk\frac{\partial{d}(z_{k})}{\partial{{z}_{k}}} first, which is equivalent to calculating (Qzk𝐳sp(Q𝐳sp𝐳sp+Λ)1𝐲sp)/zk\partial\left(Q_{z_{k}\mathbf{z}_{\rm{sp}}}(Q_{\mathbf{z}_{\rm{sp}}\mathbf{z}_{\rm{sp}}}+\Lambda)^{-1}\mathbf{y}_{\rm{sp}}\right)/\partial z_{k}, i.e.,

d(zk)zk=12(𝐳sp[zk𝟙])(Qzk𝐳sp(Q𝐳sp𝐳sp+Λ)1𝐲sp),\displaystyle\dfrac{\partial{d}(z_{k})}{\partial{{z}_{k}}}=\dfrac{1}{\ell^{2}}(\mathbf{z}_{\rm{sp}}-[{z}_{k}\otimes\mathds{1}])\left(Q_{z_{k}\mathbf{z}_{\rm{sp}}}\odot(Q_{\mathbf{z}_{\rm{sp}}\mathbf{z}_{\rm{sp}}}+\Lambda)^{-1}\mathbf{y}_{\rm{sp}}\right), (21)

where 𝐳spnz×nsp\mathbf{z}_{\rm{sp}}\in\mathbb{R}^{n_{z}\times n_{\rm{sp}}} and 𝐲spnsp×ny\mathbf{y}_{\rm{sp}}\in\mathbb{R}^{n_{\rm{sp}}\times n_{y}} denotes the input and output of the training samples 𝒟SGP\mathcal{D}_{\rm{SGP}}, respectively, 𝟙\mathds{1} denotes an nspn_{\rm{sp}}-dimensional row vector consisting entirely of 11. Ultimately, this yields matrices d(zk)Υ(xk)\frac{\partial{d}(z_{k})}{\partial\varUpsilon{({x}_{k}})} and d(zk)uk\frac{\partial{d}(z_{k})}{\partial{{u}_{k}}}. Then an error model derived from (11) is given below by assuming that the estimation error is sufficiently small, i.e.,

x~^k+1=fnom(x~k,u~k(x~k))+d(zk).\hat{\tilde{x}}_{k+1}=f_{\text{nom}}(\tilde{x}_{k},\tilde{u}_{k}(\tilde{x}_{k}))+d({z_{k}}). (22)

IV-C Exponential Barrier Function Incorporated Cost Function to Avoid Moving Obstacles

In this subsection, we design a cost function incorporated with an exponential barrier function to deal with the moving obstacle constraint, corresponding to module B-(2) of Fig. 4.

Note that C.2 in the OMP Problem can cause safety issues due to the obstacles’ movements. Consistent with [9], an implementable barrier function used for safety is designed as

h(x)=μoexp(π(x,xp)),h\left(x\right)=\mu_{o}\exp\left(-\lVert\pi(x,x_{p})\rVert\right), (23)

where μo0\mu_{o}\geq 0 is the penalty coefficient, xp2x_{p}\in\mathbb{R}^{2} is the nearest reactive boundary coordinate from the nearest obstacle, and π:nx×2\pi:\mathbb{R}^{n_{x}}\times\mathbb{R}^{2}\rightarrow\mathbb{R} maps xx and xpx_{p} to the distance error. With h(x)h(x), the robot keeps a safe distance from obstacles.

Due to the adoption of rule-based switching between tracking control and obstacle-avoiding modes in [9], without considering obstacle velocities, it does not guarantee the safety of mobile robots under extreme conditions such as the hazardous driving behavior of opposing vehicles. Our idea stems from the work[37] in completing suicidal game tasks, where the “evader” aspires to avoid being caught by the “pursuer” by computing safe actions. Similarly, when moving obstacles obstruct the road ahead, our strategy is to keep a safe distance from obstacles and track the desired trajectory if the ego robot cannot be caught. Integrating [9] and [37], the exponential barrier function comes into effect if the nearest surrounding obstacle would collide with the robot, where the judgment is generated by the method in [37], as elaborated below.

Refer to caption
Figure 6: Pursuer-Evader based game model. The barrier function comes into effect to ensure safety when the pursuer is situated within the p\mathcal{E}_{p} region.

As shown in Fig. 6, a Cartesian coordinate system is established with the direction of the evader’s velocity as the YY-axis, to facilitate the design of the safety-guaranteed switching mechanism. In this figure, lsafel_{\rm{safe}} denotes the preset maximum safety margin, and depends on the maximum tracking error of a controller, and ll is the safety distance defined by the physical constraints of the robot and obstacles. The terminal circle is divided into the usable part (the black line) and the nonusable part (the green line), separated by the boundary of the usable part (BUP). In [37], the solution of BUP is determined by s¯=arccos(vp/ve),s¯(π/2,π]\bar{s}=\arccos(-{v_{p}}/{v_{e}}),\ \bar{s}\in({\pi}/{2},\pi], where vpv_{p} and vev_{e} are the speeds of the pursuer and the evader, respectively. The barrier implies that the optimal play by both agents starting from any point will generate a path that does not penetrate this surface. The BUP connects the barrier and meets the terminal surface at the BUP tangentially. Therefore, as shown in Fig. 6, we consider the plane formed by points aa, bb, and the points at BUP as a conservative barrier to ensure safety. Fig. 6 shows that the coordinate plane is divided into two regions: p=p,1p,2\mathcal{E}_{p}=\mathcal{E}_{p,1}\cap\mathcal{E}_{p,2} and c=p\mathcal{E}_{c}=\mathcal{E}\setminus\mathcal{E}_{p}, where p,1\mathcal{E}_{p,1} and p,2\mathcal{E}_{p,2} are defined below:

p,1={ζ=[Xe,Ye]2:ζl+lsafe}\displaystyle\mathcal{E}_{p,1}=\left\{\zeta=[X_{e},Y_{e}]^{\top}\in\mathbb{R}^{2}:\lVert\zeta\lVert\leq l+l_{\text{safe}}\right\}
p,2=123,\displaystyle\mathcal{E}_{p,2}=\mathcal{H}_{1}\cup\mathcal{H}_{2}\cup\mathcal{H}_{3},

with

1={[Xe,Ye]:Xelsins¯,Yetan(s¯)Xe+l/coss¯}\displaystyle\mathcal{H}_{1}=\{[X_{e},Y_{e}]^{\top}:X_{e}\leq-l\sin\bar{s},\;Y_{e}\geq\tan(\bar{s})X_{e}+l/\cos\bar{s}\}
2={[Xe,Ye]:Xelsins¯,Yetan(s¯)Xe+l/coss¯}\displaystyle\mathcal{H}_{2}=\{[X_{e},Y_{e}]^{\top}:X_{e}\geq l\sin\bar{s},\;Y_{e}\geq-\tan(\bar{s})X_{e}+l/\cos\bar{s}\}
3={[Xe,Ye]:Xe(lsins¯,lsins¯),Yel2Xe2}.\displaystyle\mathcal{H}_{3}=\{[X_{e},Y_{e}]^{\top}:X_{e}\in(-l\sin\bar{s},l\sin\bar{s}),\;Y_{e}\geq-\sqrt{l^{2}-X_{e}^{2}}\}.

The switching mechanism is elucidated more clearly below: In the area p\mathcal{E}_{p}, μo0\mu_{o}\neq 0 in (23) is set to keep a safe distance from the obstacle, while in the region c\mathcal{E}_{c}, we let μo=0\mu_{o}=0 to track the planned preliminary path. To satisfy the conditions C.1 and C.2 in the OMP Problem, the step cost function at the kk-th stage is re-designed as

L(x~k,u~k)=x~kQx~k+u~kRu~k+h(xk).\displaystyle L(\tilde{x}_{k},\tilde{u}_{k})=\tilde{x}_{k}^{\top}Q\tilde{x}_{k}+\tilde{u}_{k}^{\top}R\tilde{u}_{k}+h(x_{k}). (24)

Equipped with the autonomous switching mechanism, the cost function is designed to learn motion planning policies.

IV-D Online Receding-horizon Reinforcement Learning with a Game-based Exponential Barrier Function Design

In this subsection, we design a receding-horizon reinforcement learning with a game-based exponential barrier function, corresponding to module B-(3) of Fig. 4. In the previous subsection, the game-based exponential barrier function has been incorporated into the cost function to establish a constraint-free optimization problem, which is to be solved online in this subsection by the Receding-Horizon Reinforcement Learning (RHRL) approach. For this reason, we call it Exponential barrier function-based RHRL (Eb-RHRL).

To minimize the infinite-horizon value function V(x~k)V_{\infty}(\tilde{x}_{k}), we first present the definitions of the optimal solution (control) and value functions commonly used in infinite-horizon ADP and RL. As a class of ADP methods, the dual heuristic programming (DHP) approach aims to minimize the derivative of the value function with respect to the state. Consistent with the DHP-based framework, we re-define the value function as

λ(x~k)=V(x~k)/x~k.\lambda_{\infty}(\tilde{x}_{k})=\partial V_{\infty}(\tilde{x}_{k})/\partial\tilde{x}_{k}. (25)
Definition 2.

The optimal value function corresponding to the infinite-horizon optimization objective is defined by taking (12) into (25), i.e,

λ(x~k)=2Qx~k+γAd,kλ(x~^k+1)+h(xk)x~k.\lambda_{\infty}^{*}\left(\tilde{x}_{k}\right)=2Q\tilde{x}_{k}+\gamma A_{d,k}^{\top}\lambda^{*}(\hat{\tilde{x}}_{k+1})+\dfrac{\partial h({x}_{k})}{\partial\tilde{x}_{k}}. (26)

Then, the optimal solution is defined by the following equation:

u~(x~k)=argminu~(x~k)τ=kγτkL(x~^τ,u~(x~^τ)),\tilde{u}^{*}(\tilde{x}_{k})=\arg\min_{\tilde{u}(\tilde{x}_{k})}\sum_{\tau=k}^{\infty}\gamma^{\tau-k}L(\hat{\tilde{x}}_{\tau},\tilde{u}(\hat{\tilde{x}}_{\tau})), (27)

where γ(0,1]\gamma\in\left(0,1\right] is the discount factor.

Note that the optimal solution and value function can be solved under the iterative learning framework[9]. However, the online convergence property is subject to learning efficiency. Motivated by the concept of receding-horizon from MPC, as discussed in [11], we divide V(x~k)V_{\infty}(\tilde{x}_{k}) into multiple sub-problems within the prediction horizon [k,k+N][k,k+N], where NN is the prediction horizon length. The value function in the prediction horizon is expressed by

{V(x~^τ)=𝔼[F(x~^k+N)+j=τk+N1L(x~^j,u~(x~^j))],V(x~^k+N)=𝔼[F(x~^k+N)],\left\{\begin{aligned} &V(\hat{\tilde{x}}_{\tau})=\mathbb{E}\left[F(\hat{\tilde{x}}_{k+N})+\sum_{j=\tau}^{k+N-1}L(\hat{\tilde{x}}_{j},\tilde{u}(\hat{\tilde{x}}_{j}))\right],\\ &V(\hat{\tilde{x}}_{k+N})=\mathbb{E}\left[F(\hat{\tilde{x}}_{k+N})\right],\end{aligned}\right. (28)

where 𝔼[]\mathbb{E}[\cdot] denotes the expectation value, and F(x~^k+N)F(\hat{\tilde{x}}_{k+N}) is the terminal cost function, defined by F(x~^k+N)=x~^k+NPx~^k+NF(\hat{\tilde{x}}_{k+N})=\hat{\tilde{x}}_{k+N}^{\top}P\hat{\tilde{x}}_{k+N}, where Pn𝒦×n𝒦P\in\mathbb{R}^{n_{\mathcal{K}}\times n_{\mathcal{K}}} is the terminal penalty matrix, which can be determined in the same way as [11].

The value function V(x~^τ)V(\hat{\tilde{x}}_{\tau}) in Eq. (28) is written by

V(x~^τ)=𝔼[L(x~^τ,u~(x~^τ))+γV(x~^τ+1)].V(\hat{\tilde{x}}_{\tau})=\mathbb{E}\left[L(\hat{\tilde{x}}_{\tau},\tilde{u}(\hat{\tilde{x}}_{\tau}))+\gamma V(\hat{\tilde{x}}_{\tau+1})\right]. (29)

According to the Bellman’s optimality principle, the optimal policy u~\tilde{u}^{*} minimizes V(x~^τ)V(\hat{\tilde{x}}_{\tau}) in the prediction horizons; i.e.,

{V(x~^τ)=𝔼[L(x~^τ,u~(x~^τ))+γV(x~^τ+1)],τ[k+N1]V(x~^k+N)=𝔼[F(x~^k+N)].\left\{\begin{aligned} &V^{*}(\hat{\tilde{x}}_{\tau})=\mathbb{E}\left[L(\hat{\tilde{x}}_{\tau},\tilde{u}^{*}(\hat{\tilde{x}}_{\tau}))+\gamma V^{*}(\hat{\tilde{x}}_{\tau+1})\right],\tau\in[k+N-1]\\ &V^{*}(\hat{\tilde{x}}_{k+N})=\mathbb{E}\left[F(\hat{\tilde{x}}_{k+N})\right].\end{aligned}\right.

(30)
Definition 3.

For the critic network, the optimal value function in the finite horizon is re-defined by V(x~^τ)/x~^τ=0\partial V^{*}(\hat{\tilde{x}}_{\tau})/\partial\hat{\tilde{x}}_{\tau}=0, i.e,

λ(x~^τ)={2Qx~^τ+γAd,τλ(x~^τ+1)+h(x^τ)x~^τ,τ[k,k+N1]2Px~^k+N+h(x^k+N)/x~^k+N,τ=k+N\lambda^{*}(\hat{\tilde{x}}_{\tau})=\left\{\begin{array}[]{cc}2Q\hat{\tilde{x}}_{\tau}+\gamma A_{d,\tau}^{\top}\lambda^{*}(\hat{\tilde{x}}_{\tau+1})+\dfrac{\partial h(\hat{x}_{\tau})}{\partial\hat{\tilde{x}}_{\tau}},&\tau\in[k,k+N-1]\\ 2P\hat{\tilde{x}}_{k+N}+{\partial h({\hat{x}}_{k+N})}/{\partial\hat{\tilde{x}}_{k+N}},&\tau=k+N\end{array}\right.

(31)

With the optimal value function λ(x~^τ)\lambda^{*}(\hat{\tilde{x}}_{\tau}), one can compute the optimal solution by setting V(x~^τ)/u~(x~^τ)=0{\partial V^{*}(\hat{\tilde{x}}_{\tau})}/{\partial\tilde{u}(\hat{\tilde{x}}_{\tau})}=0. Then, for the actor network, the optimal solution in the finite horizon is defined by

u~(x~^τ)=12γR1Bd,τλ(x~^τ+1),τ[k,k+N1].\tilde{u}^{*}(\hat{\tilde{x}}_{\tau})=-\dfrac{1}{2}\gamma R^{-1}B_{d,\tau}^{\top}\lambda^{*}(\hat{\tilde{x}}_{\tau+1}),\tau\in[k,k+N-1]. (32)
Remark 5.

The optimal solution u~(x~^τ)\tilde{u}^{*}(\hat{\tilde{x}}_{\tau}) and the optimal value function λ(x~^τ)\lambda^{*}(\hat{\tilde{x}}_{\tau}) are difficult to be solved analytically. This is because λ(x~^τ+1)\lambda^{*}(\hat{\tilde{x}}_{\tau+1}) is not available; see Eqs. (31) and (32). In the following, the optimal policies of the actor and critic networks are learned iteratively.

Input: Obstacle information, robot’s states, the desired path.
1
Output: The final safe trajectory.
2
3Initialize Wc,1:N1,Wa,1:N1{W}_{c,1:N}^{1},{W}_{a,1:N}^{1};
4 Function LPC:
5       Initialize the coefficient μ0\mu_{0} with 0.
6      if (X,Y)p(X,Y)\in\mathcal{E}_{p} then
7            μo\mu_{o}\leftarrow a positive real number.
8       end if
9      
10      for i=1,,imaxi=1,\cdots,i_{\max} (maximuim iterations) do
11            
            // Prediction Horizon
12            
13            for  τ=k,,k+N\tau=k,\cdots,k+N do
14                  
15                  Compute u~^i(x~^τ)\hat{\tilde{u}}^{i}(\hat{\tilde{x}}_{\tau}) and λ^i(x~^τ)\hat{\lambda}^{i}(\hat{\tilde{x}}_{\tau}) by Eq. (33).
16                  Update the system dynamics (22) and Jacobian matrices (20) with sparse GP.
17                  Compute ϵci(x~^τ)\epsilon_{c}^{i}(\hat{\tilde{x}}_{\tau}) and ϵai(x~^τ)\epsilon_{a}^{i}(\hat{\tilde{x}}_{\tau}) with (37) and (38).
18                  Update the weights Wc,τi{W}_{c,\tau}^{i} and Wa,τi{W}_{a,\tau}^{i} with (39).
19                  Apply u~^i(x~^τ)\hat{\tilde{u}}^{i}(\hat{\tilde{x}}_{\tau}) to the system (11).
20             end for
21            
22            if the weights converge then
23                   Wc,k:k+N1Wc,k:k+Ni{W}_{c,k:k+N}^{1}\leftarrow W_{c,k:k+N}^{i}, Wa,k:k+N1Wa,k:k+Ni{W}_{a,k:k+N}^{1}\leftarrow W_{a,k:k+N}^{i}.
24                  Return the sequence {u~^i(x~k),,u~^i(x~^k+N)}\{\hat{\tilde{u}}^{i}(\tilde{x}_{k}),\cdots,\hat{\tilde{u}}^{i}(\hat{\tilde{x}}_{k+N})\}.
25             end if
26            
27       end for
28      
29
30
31end
32 Function QuantizedSparseGP:
33       Select a representative sample set 𝒟SGP\mathcal{D}_{\rm{SGP}} from 𝒟GP\mathcal{D}_{\rm{GP}}.
34      Optimize the hyper-parameters for online compensation.
35
36
37end
38 Function Main():
39       while not reaching the destination do
40            
41            Get the current state x0x_{0} of the system.
42            if reaching the end of the guiding trajectory then
43                  
44                  Generate the guidance (the desired trajectory) for the LPC module with Eqs. (13) and (17).
45             end if
46            
47            if the online dictionary 𝒟GP\mathcal{D}_{\rm{GP}} has been collected then
48                  Perform QuantizedSparseGP.
49             end if
50            
            // Control Horizon
51             Perform LPC and apply the 1st solution of the sequence to the real-world robot.
52       end while
53      
54
55end
Algorithm 1 VF-LPC algorithm

At the ii-th iteration, the two optimal policies of the actor and critic networks are approximated by two kernel-based-network structures, i.e.,

u~^i(x~^τ)\displaystyle\hat{\tilde{u}}^{i}(\hat{\tilde{x}}_{\tau}) =(Wa,τi)Φ(x~^τ),\displaystyle=({W}_{a,\tau}^{i})^{\top}\Phi(\hat{\tilde{x}}_{\tau}), (33a)
λ^i(x~^τ)\displaystyle\hat{\lambda}^{i}(\hat{\tilde{x}}_{\tau}) =(Wc,τi)Φ(x~^τ),\displaystyle=({W}_{c,\tau}^{i})^{\top}\Phi(\hat{\tilde{x}}_{\tau}), (33b)

where Wa,τnΦ×nu{W}_{a,\tau}\in\mathbb{R}^{n_{\Phi}\times n_{u}} and Wc,τnΦ×n𝒦{W}_{c,\tau}\in\mathbb{R}^{n_{\Phi}\times n_{\mathcal{K}}} are the weights of the actor and critic networks, respectively, and nΦn_{\Phi} denotes the dimension of the RL training dictionary, which is obtained by employing the ALD strategy[9] to select representative samples from the robot’s state space. For the current state x~^τ\hat{\tilde{x}}_{\tau}, the basis function in Eq. (33) is constructed by

Φ(x~^τ)=[k(x~^τ,c1),,k(x~^τ,cnΦ)]nΦ,\Phi(\hat{\tilde{x}}_{\tau})=\left[k(\hat{\tilde{x}}_{\tau},c_{1}),\ldots,k(\hat{\tilde{x}}_{\tau},c_{n_{\Phi}})\right]^{\top}\in\mathbb{R}^{n_{\Phi}},

where c1,,cnΦc_{1},\cdots,c_{n_{\Phi}} are elements in the RL training dictionary and k(,):n𝒦×n𝒦k(\cdot,\cdot):\mathbb{R}^{n_{\mathcal{K}}}\times\mathbb{R}^{n_{\mathcal{K}}}\rightarrow\mathbb{R} is the Gaussian kernel function.

With the estimated value function in (33b) and consistent with (32), we can obtain the target optimal solution at the ii-th iteration and the τ\tau-th horizon by

u~i(x~^τ)=12γR1Bd,τλ^i(x~^τ+1),τ[k,k+N1].\tilde{u}^{i}(\hat{\tilde{x}}_{\tau})=-\dfrac{1}{2}\gamma R^{-1}B_{d,\tau}^{\top}\hat{\lambda}^{i}(\hat{\tilde{x}}_{\tau+1}),\tau\in[k,k+N-1]. (34)

Correspondingly, for τ[k,k+N]\tau\in[k,k+N], the value function Vi+1(x~^τ)V^{i+1}(\hat{\tilde{x}}_{\tau}) in Eq. (28) is written by

Vi+1(x~^τ)=𝔼[L(x~^τ,u~i(x~^τ))+γVi(x~^τ+1)].V^{i+1}(\hat{\tilde{x}}_{\tau})=\mathbb{E}\left[L(\hat{\tilde{x}}_{\tau},\tilde{u}^{i}(\hat{\tilde{x}}_{\tau}))+\gamma V^{i}(\hat{\tilde{x}}_{\tau+1})\right]. (35)

Note that x~^τ+1\hat{\tilde{x}}_{\tau+1} is obtained by the identified system dynamics (22). Substituting Eq. (33b) into Eq. (31), the target value function becomes

λi+1(x~^τ)={2Qx~^τ+γAd,τλ^i(x~^τ+1)+h(x^τ)x~^τ,τ[k,k+N1]2Px~^k+N+h(x^k+N)/x~^k+N,τ=k+N\lambda^{i+1}(\hat{\tilde{x}}_{\tau})=\left\{\begin{array}[]{cc}2Q\hat{\tilde{x}}_{\tau}+\gamma A_{d,\tau}^{\top}\hat{\lambda}^{i}(\hat{\tilde{x}}_{\tau+1})+\dfrac{\partial h(\hat{x}_{\tau})}{\partial\hat{\tilde{x}}_{\tau}},&\tau\in[k,k+N-1]\\ 2P\hat{\tilde{x}}_{k+N}+{\partial h({\hat{x}}_{k+N})}/{\partial\hat{\tilde{x}}_{k+N}},&\tau=k+N\end{array}\right.

(36)

The critic network aims at minimizing the error function between the target and the approximate value functions, i.e.,

ϵci(x~^τ)=12λi(x~^τ)λ^i(x~^τ)2.\epsilon_{c}^{i}(\hat{\tilde{x}}_{\tau})=\frac{1}{2}\lVert\lambda^{i}(\hat{\tilde{x}}_{\tau})-\hat{\lambda}^{i}(\hat{\tilde{x}}_{\tau})\rVert^{2}. (37)

For the actor network, it minimizes the error function between the target and the approximate solutions, which is defined by the following equation:

ϵai(x~^τ)=12u~i(x~^τ)u~^i(x~^τ)2.\epsilon_{a}^{i}(\hat{\tilde{x}}_{\tau})=\frac{1}{2}\lVert\tilde{u}^{i}(\hat{\tilde{x}}_{\tau})-\hat{\tilde{u}}^{i}(\hat{\tilde{x}}_{\tau})\rVert^{2}. (38)

Therefore, one can derive the following update rules for the critic network and the actor network, respectively:

Wc,τi+1\displaystyle{W}_{c,\tau}^{i+1} =Wc,τiηcϵci(x~^τ)/Wc,τi,\displaystyle={W}_{c,\tau}^{i}-\eta_{c}{\partial\epsilon_{c}^{i}(\hat{\tilde{x}}_{\tau})}/{\partial W_{c,\tau}^{i}}, (39a)
Wa,τi+1\displaystyle{W}_{a,\tau}^{i+1} =Wa,τiηaϵai(x~^τ)/Wa,τi,\displaystyle={W}_{a,\tau}^{i}-\eta_{a}{\partial\epsilon_{a}^{i}(\hat{\tilde{x}}_{\tau})}/{\partial W_{a,\tau}^{i}}, (39b)

where ηc,ηa>0\eta_{c},\eta_{a}>0 are the step coefficients of the gradients.

The overall algorithm of VF-LPC is shown in Algorithm. 1. Also, the convergence theorem of the Eb-RHRL algorithm is given below and its proof is presented in Section V. In the following, expressions similar to {λi}\{\lambda^{i}\} and {u~i}\{\tilde{u}^{i}\} denote the sequences in the prediction horizon of the kk-th time instant; i.e.,

{λi}:=(λi(x~^k),,λi(x~^k+N1)),\{\lambda^{i}\}:=(\lambda^{i}(\hat{\tilde{x}}_{k}),\cdots,\lambda^{i}(\hat{\tilde{x}}_{k+N-1})),

and

{u~i}:=(u~i(x~^k),,u~i(x~^k+N1)).\{\tilde{u}^{i}\}:=(\tilde{u}^{i}(\hat{\tilde{x}}_{k}),\cdots,\tilde{u}^{i}(\hat{\tilde{x}}_{k+N-1})).
Theorem 1.

(Convergence of the Eb-RHRL). As the iteration number ii increases, sequences {λi}\{\lambda^{i}\} and {u~i}\{\tilde{u}^{i}\} converge to (31) and (32), respectively; i.e., limi{λi}={λ}\lim_{i\rightarrow\infty}\{\lambda^{i}\}=\{\lambda^{*}\} and limi{u~i}={u~}\lim_{i\rightarrow\infty}\{\tilde{u}^{i}\}=\{\tilde{u}^{*}\}.

Theorem 1 indicates that the Eb-RHRL approach in Algorithm 1 solves the optimal IMPC problems under state constraints by obtaining the near-optimal solutions.

V Theoretic Analysis

This section presents the theoretical analyses regarding Theorem 1. Building upon the proofs in [38, 39, 11], we prove that the Eb-RHRL algorithm, integrating the game-based barrier function, converges to the optimal value function and solution, thereby solving the OMP problem.

Lemma 1.

Given a control sequence {μi}\{\mu^{i}\} involving random actions, Λi+1(x~^τ)\Lambda^{i+1}(\hat{\tilde{x}}_{\tau}) is defined by

Λi+1(x~^τ)=x~^τQx~^τ+μi(x~^τ)Rμi(x~^τ)+h(x^τ)+γΛi(x~^τ+1),\Lambda^{i+1}(\hat{\tilde{x}}_{\tau})=\hat{\tilde{x}}_{\tau}^{\top}Q\hat{\tilde{x}}_{\tau}+\mu^{i}(\hat{\tilde{x}}_{\tau})^{\top}R\mu^{i}(\hat{\tilde{x}}_{\tau})+h(\hat{x}_{\tau})+\gamma\Lambda^{i}(\hat{\tilde{x}}_{\tau+1}),

where μi():n𝒦nu\mu^{i}(\cdot):\mathbb{R}^{n_{\mathcal{K}}}\rightarrow\mathbb{R}^{n_{u}}. Rewrite (35) as follows:

Vi+1(x~^τ)=x~^τQx~^τ+u~i(x~^τ)Ru~i(x~^τ)+h(x^τ)+γVi(x~^τ+1),V^{i+1}(\hat{\tilde{x}}_{\tau})=\hat{\tilde{x}}_{\tau}^{\top}Q\hat{\tilde{x}}_{\tau}+\tilde{u}^{i}(\hat{\tilde{x}}_{\tau})^{\top}R\tilde{u}^{i}(\hat{\tilde{x}}_{\tau})+h(\hat{x}_{\tau})+\gamma V^{i}(\hat{\tilde{x}}_{\tau+1}),

where u~i(x~^τ)\tilde{u}^{i}(\hat{\tilde{x}}_{\tau}) is defined by (34). We can conclude that, if V0()=Λ0()=0V^{0}(\cdot)=\Lambda^{0}(\cdot)=0, then Vi(x~^τ)Λi(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau})\leq\Lambda^{i}(\hat{\tilde{x}}_{\tau}) for all ii.

Proof.

The sequence {u~i}\{\tilde{u}^{i}\} is used to minimize Vi(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau}), while all elements in {μi}\{\mu^{i}\} are random. Under the condition of V0()=Λ0()=0V^{0}(\cdot)=\Lambda^{0}(\cdot)=0, Vi(x~^τ)Λi(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau})\leq\Lambda^{i}(\hat{\tilde{x}}_{\tau}) holds, for all ii. \hfill\blacksquare

Lemma 1 is now used to obtain and prove Lemma 2.

Lemma 2.

Let the sequence {Vi}\{V^{i}\} be defined by (35). By applying an initial admissible control policy, the following conclusions hold: 1) There exists an upper bound Z¯i(x~^τ)\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) such that 0Vi(x~^τ)Z¯i(x~^τ)0\leq V^{i}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) for all ii. 2) If (32) is solvable, then there exists an upper bound Z¯i(x~^τ)\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) such that Vi(x~^τ)Z¯i(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}), where Vi(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau}) solves (30) and 0Vi(x~^τ)V(x~^τ)Z¯i(x~^τ)0\leq V^{i}(\hat{\tilde{x}}_{\tau})\leq V^{*}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) for all ii.

Proof.

To begin with, ηi():n𝒦nu\eta^{i}(\cdot):\mathbb{R}^{n_{\mathcal{K}}}\rightarrow\mathbb{R}^{n_{u}} is assumed to be an admissible policy. Let V0(x~^τ)=Z0(x~^τ)=0V^{0}(\hat{\tilde{x}}_{\tau})=Z^{0}(\hat{\tilde{x}}_{\tau})=0, where

Zi+1(x~^τ)=𝔼[L(x~^τ,ηi(x~^τ))+γZi(x~^τ+1)],i1.Z^{i+1}(\hat{\tilde{x}}_{\tau})=\mathbb{E}\left[L(\hat{\tilde{x}}_{\tau},\eta^{i}(\hat{\tilde{x}}_{\tau}))+\gamma Z^{i}(\hat{\tilde{x}}_{\tau+1})\right],\forall i\geq 1.

We define an upper bound

Z¯i(x~^τ)\displaystyle\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) =𝔼[j=ττ+N1γjτL(x~^j,ηi(x~^j))\displaystyle=\mathbb{E}\left[\sum_{j=\tau}^{\tau+N-1}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))\right. (40)
+γNZi(N1)(x~^τ+N)]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \left.+\gamma^{N}Z^{i-(N-1)}(\hat{\tilde{x}}_{\tau+N})\right]
=𝔼[j=ττ+N1γjτL(x~^j,ηi(x~^j))+γNF(x~^τ+N)]\displaystyle=\mathbb{E}\left[\sum_{j=\tau}^{\tau+N-1}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))+\gamma^{N}F(\hat{\tilde{x}}_{\tau+N})\right]

with the dynamic model (22) and L(x~^j,ηi(x~^j))=x~^jQx~^j+(ηi(x~^j))Rηi(x~^j)+h(x^j)L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))=\hat{\tilde{x}}_{j}^{\top}Q\hat{\tilde{x}}_{j}+(\eta^{i}(\hat{\tilde{x}}_{j}))^{\top}R\eta^{i}(\hat{\tilde{x}}_{j})+h(\hat{x}_{j}). When propagating Zi+1(x~^τ)Z^{i+1}(\hat{\tilde{x}}_{\tau}) from current stage τ\tau to τ+N\tau+N, two cases exist and analyzed below.

Case 1 (iN1i\leq N-1):

Zi+1(x~^τ)\displaystyle Z^{i+1}(\hat{\tilde{x}}_{\tau}) =𝔼[L(x~^τ,ηi(x~^τ))+γZi(x~^τ+1)]\displaystyle=\mathbb{E}\left[L(\hat{\tilde{x}}_{\tau},\eta^{i}(\hat{\tilde{x}}_{\tau}))+\gamma Z^{i}(\hat{\tilde{x}}_{\tau+1})\right] (41)
=𝔼[j=ττ+1γjτL(x~^j,ηi(x~^j))+γ2Zi1(x~^τ+2)]\displaystyle=\mathbb{E}\left[\sum_{j=\tau}^{\tau+1}\gamma^{j-\tau}L\left(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j})\right)+\gamma^{2}Z^{i-1}(\hat{\tilde{x}}_{\tau+2})\right]
\displaystyle\ldots
=𝔼[j=ττ+iγjτL(x~^j,ηi(x~^j))].\displaystyle=\mathbb{E}\left[\sum_{j=\tau}^{\tau+i}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))\right].

Since h(x^j)0h(\hat{x}_{j})\geq 0, by applying an admissible control ηi()\eta^{i}(\cdot) to the system, we can obtain

j=ττ+i1γjτL(x~^j,ηi(x~^j))<j=ττ+i2γjτL(x~^j,ηi(x~^j)),\sum_{j=\tau}^{\tau+i_{1}}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))<\sum_{j=\tau}^{\tau+i_{2}}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j})), (42)

where i1<i2[0,N]i_{1}<i_{2}\in[0,N]. With Eqs. (41) and (42), we have

Zi+1(x~^τ)=𝔼[j=ττ+iγjτL(x~^j,ηi(x~^j))]<𝔼[j=ττ+N1γjτL(x~^j,ηi(x~^j))+γNF(x~^τ+N)]=Z¯i(x~^τ).\begin{aligned} &Z^{i+1}(\hat{\tilde{x}}_{\tau})\\ &=\mathbb{E}\left[\sum_{j=\tau}^{\tau+i}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))\right]\\ &<\mathbb{E}\left[\sum_{j=\tau}^{\tau+N-1}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))+\gamma^{N}F(\hat{\tilde{x}}_{\tau+N})\right]\\ &=\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}).\end{aligned}

(43)

Case 2 (iNi\geq N):

Zi+1(x~^τ)=𝔼[j=ττ+N1γjτL(x~^j,ηi(x~^j))+γNF(x~^τ+N)]=Z¯i(x~^τ).\begin{aligned} &Z^{i+1}(\hat{\tilde{x}}_{\tau})\\ &=\mathbb{E}\left[\sum_{j=\tau}^{\tau+N-1}\gamma^{j-\tau}L(\hat{\tilde{x}}_{j},\eta^{i}(\hat{\tilde{x}}_{j}))+\gamma^{N}F(\hat{\tilde{x}}_{\tau+N})\right]=\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}).\end{aligned}

(44)

Combining Eqs. (43) and (44), Zi+1(x~^τ)Z¯i(x~^τ)Z^{i+1}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) holds.

According to Lemma 1, setting the sequence {μi}\{\mu^{i}\} as {ηi}\{\eta^{i}\} and the sequence of the value function {Λi}\{\Lambda^{i}\} as {Zi}\{Z^{i}\}, it follows that Vi(x~^τ)Zi(x~^τ)Z¯i(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau})\leq Z^{i}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}) for all ii, which proves 1). By setting {ηi}\{\eta^{i}\} as {u~}\{\tilde{u}^{*}\}, one has 0Vi(x~^τ)V(x~^τ)Z¯i(x~^τ)0\leq V^{i}(\hat{\tilde{x}}_{\tau})\leq V^{*}(\hat{\tilde{x}}_{\tau})\leq\bar{Z}^{i}(\hat{\tilde{x}}_{\tau}), and part 2) of Lemma 2 is now proved. \hfill\blacksquare

Lemma 3.

[11, Monotonicity property] Let Vi(x~^τ)V^{i}(\hat{\tilde{x}}_{\tau}) be defined by (35). With V0(x~^τ)=0V^{0}(\hat{\tilde{x}}_{\tau})=0, Vi+1(x~^τ)Vi(x~^τ)V^{i+1}(\hat{\tilde{x}}_{\tau})\geq V^{i}(\hat{\tilde{x}}_{\tau}) holds.

Proof.

The proof is given in[38].

Under Lemma 2, we now arrive at the proof of Theorem 1.

Proof of Theorem 1.

Following the part 2) of Lemma 2, one has limiVi(x~^τ)V(x~^τ)\lim_{i\rightarrow\infty}V^{i}(\hat{\tilde{x}}_{\tau})\leq V^{*}(\hat{\tilde{x}}_{\tau}). According to Lemmas 2 and 3, V(x~^τ)=Z¯i(x~^)V(x~^τ)V^{\infty}(\hat{\tilde{x}}_{\tau})=\bar{Z}^{i}(\hat{\tilde{x}})\geq V^{*}(\hat{\tilde{x}}_{\tau}) holds for some V(x~^τ)V^{\infty}(\hat{\tilde{x}}_{\tau}). The two aspects indicate that limiVi(x~^τ)=V(x~^τ)\lim_{i\rightarrow\infty}V^{i}(\hat{\tilde{x}}_{\tau})=V^{*}(\hat{\tilde{x}}_{\tau}) and limiu~i(x~^τ)=u~(x~^τ)\lim_{i\rightarrow\infty}\tilde{u}^{i}(\hat{\tilde{x}}_{\tau})=\tilde{u}^{*}(\hat{\tilde{x}}_{\tau}). We further derive limi{Vi}={V}\lim_{i\rightarrow\infty}\{V^{i}\}=\{V^{*}\} and limi{u~i}={u~}\lim_{i\rightarrow\infty}\{\tilde{u}^{i}\}=\{\tilde{u}^{*}\}. Accordingly, λi(x~^τ)\lambda^{i}(\hat{\tilde{x}}_{\tau}) designed by Eq. (36) converges to λ(x~^τ)\lambda^{*}(\hat{\tilde{x}}_{\tau}) and the sequence {λi}\{\lambda^{i}\} converges to {λ}\{\lambda^{*}\} eventually, as ii\rightarrow\infty. \hfill\blacksquare

VI Simulation and Experimental Results

In this section, we present the simulation and experimental results to validate our proposed VF-LPC approach. The algorithm was deployed on a computer running Windows 11 with an Intel Core i7-11800H @2.30GHZ CPU. We first compared VF-LPC with advanced IMPC methods in CarSim software. Then the effectiveness of the online model compensating strategy is verified through a tracking control task. The simulation results on a quadrotor UAV were performed. We also tested the VF-LPC approach on an actual ground vehicle of Hongqi EHS-3. The platform is shown in the right part of Fig. 4.

VI-A Path planning in an Obstacle-Dense Environment

To evaluate the effectiveness of the VF-LPC approach, we compare it with advanced planning approaches in a scenario containing static obstacles; see Fig. 7. We compare VF-LPC with the commonly used planning approaches, such as Hybrid A*[40], RRT*-CFS[41], Timed Elastic Bands (TEB)[42] and OBTPAP[43] which iteratively solves the time-optimal motion planning problem. In these tests, we use the analytic kinematic vehicle model with the state variable [X,Y,ψ][X,Y,\psi], representing the XX-, YY-coordinates and the heading, respectively. All the approaches are performed in MATLAB 2023a.

VI-A1 Evaluation metrics

To evaluate the planning methods, we employ various performance metrics. The first metric is the trajectory length from the initial point to the destination, denoted as LpL_{p}; i.e., Lp=k=1M1(Xk+1,Yk+1)(Xk,Yk){L}_{p}=\sum_{k=1}^{M-1}\lVert(X_{k+1},Y_{k+1})-(X_{k},Y_{k})\rVert, where MM is the number of trajectory points. The second metric tcalt_{\mathrm{cal}} is the calculationtime\mathrm{calculation\ time} of trajectory planning. The third metric pertains to SD\mathrm{SD}, serving to quantify whether the planned trajectories have a safe distance from obstacles.

VI-A2 Analyses of the results

As shown from TABLE I, the paths planned by methods VF-LPC and TEB exhibit similar lengths, albeit shorter than OBTPAP. Though the trajectory length of Hybrid A* is the shortest, it is deemed unsafe due to the lack of maintaining a safe distance while avoiding the first obstacle. Regarding computational efficiency, VF-LPC demonstrates the best performance among the methods.

TABLE I: The Comparative Analysis of Planning Results using Different Methods in the Obstacle-dense Environment.
Metrics VF-LPC Hybrid A*[40] RRT*-CFS[41] TEB[42] OBTPAP[43]
Lp{L}_{p} (m)   47.4447.44   46.16 54.2154.21 48.3448.34 47.5447.54
tcalt_{\mathrm{cal}} (s)     0.29\mathbf{0.29}    2.292.29 50.2850.28    4.804.80    5.725.72
SD\mathrm{SD}? \checkmark ×\times \checkmark \checkmark \checkmark

The bold indicates the best result of each metric.

Refer to caption
Figure 7: The planned results using different comparison methods.

VI-B Collision-Avoidance Simulation in High-Fidelity CarSim

Tracking a desired path while avoiding static and moving obstacles is a fundamental task. To demonstrate the superiority of VF-LPC, we compare it with MPC-CBF[44], LMPCC[45], RHRL-KDP[10], and CFS[46] under different metrics.

VI-B1 Simulation settings

All the comparison methods utilize the system dynamics model (45) and solve the nonlinear optimization problem using the IPOPT solver[47] within the CasADi framework[48], which describes the vehicle dynamics as follows:

[X˙Y˙ψ˙v˙xv˙yω˙]=[vxcosψvysinψvxsinψ+vycosψωvyω+ax2Caf(δfmvy+lfωmvx)+2Carlrωvymvxvxω2Iz[lfCaf(δfvy+lfωvx)lrCarlrωvyvx]],\left[\begin{array}[]{l}\dot{X}\\ \dot{Y}\\ \dot{\psi}\\ \dot{v}_{x}\\ \dot{v}_{y}\\ \dot{\omega}\\ \end{array}\right]=\left[\begin{array}[]{c}v_{x}\cos\psi-v_{y}\sin\psi\\ v_{x}\sin\psi+v_{y}\cos\psi\\ \omega\\ v_{y}\omega+a_{x}\\ 2C_{af}(\frac{\delta_{f}}{m}-\frac{v_{y}+l_{f}\omega}{mv_{x}})+2C_{ar}\frac{l_{r}\omega-v_{y}}{mv_{x}}-v_{x}\omega\\ \frac{2}{I_{z}}\left[l_{f}C_{af}(\delta_{f}-\frac{v_{y}+l_{f}\omega}{v_{x}})-l_{r}C_{ar}\frac{l_{r}\omega-v_{y}}{v_{x}}\right]\\ \end{array}\right],

(45)

where x=[X,Y,ψ,vx,vy,ω]6x=[X,Y,\psi,v_{x},v_{y},\omega]^{\top}\in\mathbb{R}^{6} is the state vector, X,YX,Y are the global horizontal and vertical coordinates of the vehicle, respectively, ψ\psi is the yaw angle, vx,vyv_{x},v_{y} denote the longitudinal and lateral velocities, respectively, ω\omega denotes the yaw rate, lfl_{f} and lrl_{r} are the distances from the center of gravity (CoG) to the front and rear wheels, respectively; Caf,CarC_{af},C_{ar} represent the cornering stiffnesses of the front and rear wheels, respectively; IzI_{z} denotes the yaw moment of inertia; mm is the vehicle’s mass. Their values can be found in TABLE II. In this model, the acceleration axa_{x} and the steering angle δf\delta_{f} are two variables of the control vector uu, i.e., u=[ax,δf]u=[a_{x},\delta_{f}]^{\top}. The reference state is xr=[Xr,Yr,ψr,vxr,vyr,ωr]{x_{r}}=[X_{r},Y_{r},\psi_{r},v_{x}^{r},v_{y}^{r},\omega_{r}]^{\top}.

TABLE II: Vehicle Dynamic Parameters
mm IzI_{z} lfl_{f} lrl_{r} CafC_{af} CarC_{ar}
2257kg2257\rm{kg}  3524.9kgm23524.9\rm{kg\cdot m^{2}}  1.33m1.33\rm{m}  1.81m1.81\rm{m} 66900N/rad66900\rm{N/rad} 62700N/rad62700\rm{N/rad}

For VF-LPC, we employed the approach outlined in [21] to conduct system identification by collecting vehicle motion data from the CarSim solver. Finally, a linear time-invariant system model (4) is generated for the VF-LPC algorithm, where n𝒦=10n_{\mathcal{K}}=10 and nu=2n_{u}=2. To construct an error model for facilitating subsequent algorithm design, we subtract the desired state from the current state, i.e, x~=xxr\tilde{x}=x-x_{r} and the desired control from system control, i.e., u~=[ax,δf]\tilde{u}=[a_{x},\delta_{f}]^{\top}.

VI-B2 Evaluation metrics

The desired speed vdv_{d} of all the methods is set to 2525 km/h. To evaluate the VF-LPC approach, we compare it with other IMPC algorithms under several metrics, which are Aver. S.T. (average solution time of each time step), JLatJ_{\rm{Lat}} (cost of the lateral error), JHeadingJ_{\rm{Heading}} (cost of the heading error), and JConJ_{\rm{Con}} (control cost). Specifically, they are defined by JLat=(XXr)sinψr(YYr)cosψrq12J_{\rm{Lat}}=\lVert(X-X_{r})\sin\psi_{r}-(Y-Y_{r})\cos\psi_{r}\rVert_{q_{1}}^{2}, JHeading=ψψrq22J_{\rm{Heading}}=\lVert\psi-\psi_{r}\rVert_{q_{2}}^{2}, and JCon=u~R2J_{\rm{Con}}=\lVert\tilde{u}\rVert_{R}^{2}, where q1,q2>0q_{1},q_{2}>0 and Rnu×nuR\in\mathbb{R}^{n_{u}\times n_{u}} is definite. We define a weighted average cost to evaluate the overall performance, which is

JMC=1Mk=1M(Jk,Lat+Jk,Heading+Jk,Con),\displaystyle J_{\rm{MC}}=\begin{array}[]{c}\frac{1}{M}\sum_{k=1}^{M}(J_{k,\rm{Lat}}+J_{k,\rm{Heading}}+J_{k,\rm{Con}}),\end{array}

where MM is the number of waypoints of the driving trajectory. The other quantitative metrics are route length LpL_{p}, completion time tcomt_{\mathrm{com}}, and calculation time tcalt_{\rm{cal}}.

VI-B3 Analyses of the results

The testing scenario requires the vehicle to track a reference path (black line in Fig. 8) while avoiding collisions with static and moving obstacles.

Refer to caption
Figure 8: Simulation results of avoiding static and moving obstacles for four IMPC methods. The black arrow denotes the moving direction of the dynamic obstacle, and the black thin line is the desired path. Labels ‘EP’ and ‘SP’ denote the endpoint and start point, respectively.
TABLE III: Performance Evaluations of Integrated Motion Planning and Control Methods
Metrics VF-LPC MPC-CBF[44] LMPCC[45] RHRL-KDP[10] CFS[46]
JMCJ_{\rm{MC}} 61.50\mathbf{61.50} 76.7676.76 79.0979.09 155.98155.98 105.03105.03
LpL_{p} (m) 252.54\mathbf{252.54} 287.45287.45 266.76266.76 307.29307.29 273.71273.71
tcomt_{\mathrm{com}} (s) 38.40\mathbf{38.40} 41.1341.13 38.8838.88    44.7244.72    39.0339.03
tcalt_{\mathrm{cal}} (s) 0.006\mathbf{0.006}    0.150.15    0.150.15        0.070.07       0.120.12

The bold indicates the best result of each metric.

As seen in TABLE III, VF-LPC has the lowest motion control cost JMCJ_{\rm{MC}}. The computational cost of VF-LPC is lower than that of MPC-CBF, LMPCC, RHRL-KDP, and CFS, and it generates the shortest trajectory and has the least completion time. As VF-LPC employs a linear time-invariant Koopman-based vehicle dynamic model to optimize the nonlinear optimal motion planning problem based on the scheme of Eb-RHRL, the computational time is the least. However, the MPC-based methods and RHRL-KDP require online solving of nonlinear optimization problems, resulting in a computational burden.

VI-C Model Uncertainties Learning

As mentioned in previous sections, the offline trained deep Koopman model probably does not perfectly reflect the exact system dynamics when deployed online due to external uncertainties. As a result, this could cause performance degradation both in planning and control and even safety issues. In this section, we will demonstrate the sparse GP-based learning model uncertainties for the deep Koopman model.

The uncertainty in the dynamics of a system negatively affects the planning and control performance of robots. To eliminate the influence, a quantified sparse GP [9] is utilized to compensate for the difference between the offline-trained deep Koopman model and the exact vehicle model. The simulated dynamical model (45), reveals that the vehicle states vx,vy,ωv_{x},v_{y},\omega and ρe(xk)\rho_{e}(x_{k}) are primarily affected by dynamic parameters. Therefore, we further derive Bd=[𝟎3I3 03×nρe]B_{d}=[\mathbf{0}_{3}\ I_{3}\ \mathbf{0}_{3\times n_{\rho_{e}}}]^{\top}, and g(x,u)=g(vx,vy,ω,ρe,ax,δf):nρe+53g(x,u)=g(v_{x},v_{y},\omega,\rho_{e}^{\top},a_{x},\delta_{f}):\mathbb{R}^{n_{\rho_{e}}+5}\rightarrow\mathbb{R}^{3}.

The algorithm was simulated in a racing track road and the desired speed vdv_{d} is set to be 6m/s6\ \rm{m/s}. Note that we collect motion data of the real vehicle (right half of Fig. 4) to train for an offline nominal deep Koopman model with nρe=6n_{\rho_{e}}=6 and x=[X,Y,ψ,vx,vy,ω]6x=[X,Y,\psi,v_{x},v_{y},\omega]^{\top}\in\mathbb{R}^{6}. The exact vehicle parameters are: m=1257kgm=1257\rm{kg}, Iz=1524.9kgm2I_{z}=1524.9\rm{kg}\cdot\rm{m}^{2}, Caf=8790N/radC_{af}=8790\rm{N/rad}, and Car=30400N/radC_{ar}=30400\rm{N/rad}. The iid process noise wk6w_{k}\in\mathbb{R}^{6} with zero mean and variance σ=0.002\sigma=0.002 is added to the vehicle dynamics. Specifically, g(x,u)g(x,u) maps [x,u][x,u]^{\top} to [vx,vy,ω][v_{x},v_{y},\omega]^{\top}.

Refer to caption
Figure 9: Tracking error comparison between VF-LPC (w/o ML) and VF-LPC (w/ ML).

To evaluate the effectiveness of our algorithm on learning model uncertainties, VF-LPC with and without model learning, abbreviated as VF-LPC (w/ ML) and VF-LPC (w/o ML), are separately tested on the racing road. At the initial stage, i.e., the first 100m100\rm{m}, we update the dictionary online, and it is then sparsified with Approximate Linear Dependence (ALD) similarly to[9]. With the sparsified dictionary, we train it to update the Jacobian matrices (20) and compensate for (4) during the remaining miles.

From the simulation results in Fig. 9, notable improvement can be observed in the lateral stage error of VF-LPC (w/ ML) compared to VF-LPC (w/o ML). Ultimately, the average lateral error throughout the entire testing process for VF-LPC (w/ ML) is lower than that of VF-LPC (w/o ML). These results validate the capability of VF-LPC to process model uncertainty and enhance control performance.

VI-D Validation on Unmanned Aerial Vehicles

In this subsection, we demonstrate the effectiveness of VF-LPC on motion planning tasks of a different class of mobile robots, i.e., quadrotor UAVs, which have intricate dynamics. We use a widely-used model [50] to simulate the dynamics.

Firstly, we conduct data-driven modeling of the quadrotor UAV using deep Koopman operators[21], followed by employing the VF-LPC to generate the robot’s maneuvers in a 3D environment, as illustrated in Fig. 10. The state variables of the training data consist of the XYZ coordinates and velocities (vx,vy,vzv_{x},v_{y},v_{z}) of the quadrotor UAV, with control variables of acceleration in the XYZ directions. A total of 148148 data, each comprising approximately 45004500 samples, were collected randomly in the state and control space for training and model validation purposes. The 66-dimensional state vector [X,Y,Z,vx,vy,vz][X,Y,Z,v_{x},v_{y},v_{z}]^{\top} was processed through an encode layer, resulting in a final system model of 1818 state variables and 33 control variables (i.e., n𝒦=18,nu=3n_{\mathcal{K}}=18,n_{u}=3). Then the obtained model is incorporated into VF-LPC, which focuses on planning 3-axis accelerations for attitude control. Not only the VF-LPC approach can avoid multiple static obstacles. Under suddenly appearing obstacles (see Fig. 10), the VF-LPC approach is also capable of handling them, and the average speed of the vehicle reaches 1m/s1\rm{m/s}.

The results show that the VF-LPC approach can achieve near-optimal motion planning for robots with offline-trained data-driven dynamics.

Refer to caption
Figure 10: Validating test on UAVs: Avoiding multiple static and suddenly appearing obstacles when tracking a desired trajectory.

VI-E Real-World Experiments

To further validate the effectiveness of VF-LPC, real-world vehicular experiments were conducted on the Hongqi E-HS3 platform, which is shown in module B-(1) of Fig. 4.

At each time instance within the predictive horizon, the desired trajectory and obstacles are initially transformed into the vehicle body’s local coordinate system. Then, we employ polynomial curves to fit the desired path points, obtaining the desired path 𝒫\mathcal{P} with (13). Subsequently, the safe trajectory generated by the kinodynamic composite vector field is transformed back into the global coordinate system.

VI-E1 Multiple Static Obstacles Avoidance

In this scenario, we evaluate the obstacle avoidance capability of the VF-LPC approach. As illustrated in Fig. 11, the black dash line denotes the desired path. Due to the vehicle’s long wheelbase of 4.94.9 meters, the turning radius is large. However, the desired path is constrained and small in size, posing a significant challenge for the algorithm in terms of safe obstacle avoidance and tracking. We set the speed to 1.5m/s1.5\rm{m/s}, and it can be observed that the vehicle successfully reaches the destination while avoiding multiple obstacles. VF-LPC can plan a smooth trajectory for guiding finite-horizon actor-critic learning processes. This also reflects the effectiveness and advantages of the kinodynamic guiding vector field which satisfies the kinodynamic constraint.

Refer to caption
Figure 11: Avoiding multiple static obstacles when tracking a size-constrained desired path. The black vehicles represent the intelligent vehicle at different time instants. Labels ‘EP’ and ‘SP’ denote the endpoint and starting point.

VI-E2 Moving Obstacles Avoidance

As shown in Fig. 12, a human driver first drove the intelligent vehicle to generate the desired path 𝒫\mathcal{P} in this scenario. We have the following settings for testing our algorithm: The moving obstacles start to move at different preset velocities when the distance between them and the intelligent vehicle is less than 25m25\rm{m}, thereby validating its emergency collision avoidance capability. Using a gradient of gray, we label the obstacles’ positions at different moments during their motion processes, where the darkest color represents the initial moment.

Refer to caption
Figure 12: Avoiding multiple moving obstacles when tracking a desired path. Labels ‘EP’ and ‘SP’ denote the endpoint and start point, respectively.

From the overall tracking results in Fig. 12, the intelligent vehicle keeps safe distances from obstacles all the time and returns to the desired path smoothly. In addition, our method exhibits small lateral tracking errors and longitudinal velocity deviation Moreover, the maximum vehicle’s speed reaches 2.4m/s2.4\ \rm{m/s}. Finally, it arrives at the ending point successfully and completes the motion planning and control task.

VII Conclusion and Future Work

This paper presents the Vector Field-guided Learning Predictive Control (VF-LPC) approach for mobile robots with safety guarantees, which offers a framework for the Integrated Motion Planning and Control (IMPC) of robots. VF-LPC designs kinodynamic guiding vector field for safe robot maneuvering. This is a notable improvement over the existing composite vector fields. Also, the learned deep Koopman model is updated online by sparse GP to improve safety and control performance. It is then incorporated into LPC to solve nonlinear IMPC problems. Rigorous theoretical analysis is provided to witness the online learning convergence.

VF-LPC is evaluated against motion planning methods that employ MPC and RL in high-fidelity CarSim software. The results show that VF-LPC outperforms them under metrics of completion time, route length, and average solution time. To further show the effectiveness and generalization of our proposed approach, we carried out path-tracking control tests of a mobile vehicle on a racing road to validate the model uncertainties learning capability, and we also successfully implemented the approach on quadrotor UAVs. Finally, we conducted real-world experiments on a Hongqi E-HS3 vehicle.

Our work has several possible future directions. 1) With an additional prediction module, VF-LPC could be promising even in higher-speed tasks. 2) In unknown environments, one can construct safer barrier functions in real-time using the information perceived by onboard sensors.

References

  • [1] W. Yao, H. G. de Marina, B. Lin, and M. Cao, “Singularity-free guiding vector field for robot navigation,” IEEE Transactions on Robotics, vol. 37, no. 4, pp. 1206–1221, 2021.
  • [2] D. Panagou, “Motion planning and collision avoidance using navigation vector fields,” in 2014 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2014, pp. 2513–2518.
  • [3] A. Marchidan and E. Bakolas, “Collision avoidance for an unmanned aerial vehicle in the presence of static and moving obstacles,” Journal of Guidance, Control, and Dynamics, vol. 43, no. 1, pp. 96–110, 2020.
  • [4] Y. A. Kapitanyuk, A. V. Proskurnikov, and M. Cao, “A guiding vector-field algorithm for path-following control of nonholonomic mobile robots,” IEEE Transactions on Control Systems Technology, vol. 26, no. 4, pp. 1372–1385, 2018.
  • [5] W. Yao, B. Lin, B. D. Anderson, and M. Cao, “Guiding vector fields for following occluded paths,” IEEE Transactions on Automatic Control, vol. 67, no. 8, pp. 4091–4106, 2022.
  • [6] Y. K. Nakka and S.-J. Chung, “Trajectory optimization of chance-constrained nonlinear stochastic systems for motion planning under uncertainty,” IEEE Transactions on Robotics, vol. 39, no. 1, pp. 203–222, 2022.
  • [7] C. Pek and M. Althoff, “Fail-safe motion planning for online verification of autonomous vehicles using convex optimization,” IEEE Transactions on Robotics, vol. 37, no. 3, pp. 798–814, 2020.
  • [8] P. Deptula, H.-Y. Chen, R. A. Licitra, J. A. Rosenfeld, and W. E. Dixon, “Approximate optimal motion planning to avoid unknown moving avoidance regions,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 414–430, 2019.
  • [9] Y. Lu, X. Zhang, X. Xu, and W. Yao, “Learning-based near-optimal motion planning for intelligent vehicles with uncertain dynamics,” IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1532–1539, 2024.
  • [10] X. Zhang, Y. Jiang, Y. Lu, and X. Xu, “Receding-horizon reinforcement learning approach for kinodynamic motion planning of autonomous vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 3, pp. 556–568, 2022.
  • [11] X. Xu, H. Chen, C. Lian, and D. Li, “Learning-based predictive control for discrete-time nonlinear systems with stochastic disturbances,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6202–6213, 2018.
  • [12] B. A. H. Vicente, S. S. James, and S. R. Anderson, “Linear system identification versus physical modeling of lateral–longitudinal vehicle dynamics,” IEEE Transactions on Control Systems Technology, vol. 29, no. 3, pp. 1380–1387, 2020.
  • [13] T. Gräber, S. Lupberger, M. Unterreiner, and D. Schramm, “A hybrid approach to side-slip angle estimation with recurrent neural networks and kinematic vehicle models,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 1, pp. 39–47, 2019.
  • [14] N. A. Spielberg, M. Brown, N. R. Kapania, J. C. Kegelman, and J. C. Gerdes, “Neural network vehicle models for high-performance automated driving,” Science Robotics, vol. 4, no. 28, p. eaaw1975, 2019.
  • [15] M. Da Lio, D. Bortoluzzi, and G. P. Rosati Papini, “Modelling longitudinal vehicle dynamics with neural networks,” Vehicle System Dynamics, vol. 58, no. 11, pp. 1675–1693, 2020.
  • [16] P. J. Schmid, “Dynamic mode decomposition of numerical and experimental data,” Journal of Fluid Mechanics, vol. 656, pp. 5–28, 2010.
  • [17] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–driven approximation of the koopman operator: Extending dynamic mode decomposition,” Journal of Nonlinear Science, vol. 25, pp. 1307–1346, 2015.
  • [18] I. Kevrekidis, C. W. Rowley, and M. Williams, “A kernel-based method for data-driven koopman spectral analysis,” Journal of Computational Dynamics, vol. 2, no. 2, pp. 247–265, 2016.
  • [19] B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universal linear embeddings of nonlinear dynamics,” Nature Communications, vol. 9, no. 1, p. 4950, 2018.
  • [20] S. E. Otto and C. W. Rowley, “Linearly recurrent autoencoder networks for learning dynamics,” SIAM Journal on Applied Dynamical Systems, vol. 18, no. 1, pp. 558–593, 2019.
  • [21] Y. Xiao, X. Zhang, X. Xu, Y. Lu, and J. Li, “DDK: A deep koopman approach for longitudinal and lateral control of autonomous ground vehicles,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 975–981.
  • [22] B. Luders, M. Kothari, and J. How, “Chance constrained RRT for probabilistic robustness to environmental uncertainty,” in AIAA Guidance, Navigation, and Control Conference, 2010, p. 8160.
  • [23] G. S. Aoude, B. D. Luders, J. M. Joseph, N. Roy, and J. P. How, “Probabilistically safe motion planning to avoid dynamic obstacles with uncertain motion patterns,” Autonomous Robots, vol. 35, pp. 51–76, 2013.
  • [24] A. Hakobyan and I. Yang, “Distributionally robust risk map for learning-based motion planning and control: A semidefinite programming approach,” IEEE Transactions on Robotics, vol. 39, no. 1, pp. 718–737, 2022.
  • [25] A. Dixit, M. Ahmadi, and J. W. Burdick, “Risk-averse receding horizon motion planning for obstacle avoidance using coherent risk measures,” Artificial Intelligence, vol. 325, p. 104018, 2023.
  • [26] H. Zhu and J. Alonso-Mora, “Chance-constrained collision avoidance for MAVs in dynamic environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 776–783, 2019.
  • [27] T. Lew, R. Bonalli, and M. Pavone, “Chance-constrained sequential convex programming for robust trajectory optimization,” in 2020 European Control Conference (ECC).   IEEE, 2020, pp. 1871–1878.
  • [28] L. Lindemann, M. Cleaveland, Y. Kantaros, and G. J. Pappas, “Robust motion planning in the presence of estimation uncertainty,” in 2021 60th IEEE Conference on Decision and Control (CDC).   IEEE, 2021, pp. 5205–5212.
  • [29] S. Singh, B. Landry, A. Majumdar, J.-J. Slotine, and M. Pavone, “Robust feedback motion planning via contraction theory,” The International Journal of Robotics Research, vol. 42, no. 9, pp. 655–688, 2023.
  • [30] D. Snyder, M. Booker, N. Simon, W. Xia, D. Suo, E. Hazan, and A. Majumdar, “Online learning for obstacle avoidance,” in Conference on Robot Learning.   PMLR, 2023, pp. 2926–2954.
  • [31] K. Polymenakos, A. Abate, and S. Roberts, “Safe policy search using gaussian process models,” in Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, 2019, pp. 1565–1573.
  • [32] L. Janson, E. Schmerling, and M. Pavone, “Monte carlo motion planning for robot trajectory optimization under uncertainty,” in Robotics Research: Volume 2.   Springer, 2018, pp. 343–361.
  • [33] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” Advances in Neural Information Processing Systems, vol. 18, 2005.
  • [34] C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer School on Machine Learning.   Springer, 2003, pp. 63–71.
  • [35] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using Gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2019.
  • [36] J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger, “Learning-based model predictive control for autonomous racing,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3363–3370, 2019.
  • [37] I. Exarchos, P. Tsiotras, and M. Pachter, “On the suicidal pedestrian differential game,” Dynamic Games and Applications, vol. 5, pp. 297–317, 2015.
  • [38] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 1, pp. 145–157, 2012.
  • [39] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008.
  • [40] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Path planning for autonomous vehicles in unknown semi-structured environments,” The International Journal of Robotics Research, vol. 29, no. 5, pp. 485–501, 2010.
  • [41] J. Leu, G. Zhang, L. Sun, and M. Tomizuka, “Efficient robot motion planning via sampling and optimization,” in 2021 American Control Conference (ACC).   IEEE, 2021, pp. 4196–4202.
  • [42] C. Rösmann, F. Hoffmann, and T. Bertram, “Kinodynamic trajectory optimization and control for car-like robots,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5681–5686.
  • [43] B. Li, T. Acarman, Y. Zhang, Y. Ouyang, C. Yaman, Q. Kong, X. Zhong, and X. Peng, “Optimization-based trajectory planning for autonomous parking with irregularly placed obstacles: A lightweight iterative framework,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 11 970–11 981, 2022.
  • [44] J. Zeng, B. Zhang, and K. Sreenath, “Safety-critical model predictive control with discrete-time control barrier function,” in American Control Conference (ACC), 2021, pp. 3882–3889.
  • [45] B. Brito, B. Floor, L. Ferranti, and J. Alonso-Mora, “Model predictive contouring control for collision avoidance in unstructured dynamic environments,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4459–4466, 2019.
  • [46] C. Liu, C.-Y. Lin, and M. Tomizuka, “The convex feasible set algorithm for real time optimization in motion planning,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2712–2733, 2018.
  • [47] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, 2006.
  • [48] J. A. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi: a software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
  • [49] R. Rajamani, Vehicle dynamics and control.   Springer Science & Business Media, 2011.
  • [50] Y. Xu, W. Zheng, D. Luo, and H. Duan, “Dynamic affine formation control of networked under-actuated quad-rotor uavs with three-dimensional patterns,” Journal of Systems Engineering and Electronics, vol. 33, no. 6, pp. 1269–1285, 2022.