This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RGB-D SLAM in Indoor Planar Environments with Multiple Large Dynamic Objects

Ran Long1,∗, Christian Rauch1, Tianwei Zhang2, Vladimir Ivan1, Tin Lun Lam2,3, Sethu Vijayakumar1,∗ Manuscript received: February, 23, 2022; Revised April, 18, 2022; Accepted June, 11, 2022.This paper was recommended for publication by Editor Tamim Asfour upon evaluation of the Associate Editor and Reviewers’ comments. This research is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017008 (Harmony) and the Alan Turing Institute.1The authors are with the Institute of Perception, Action and Behaviour, School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, U.K.2The authors are with the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS).3The author is with the School of Science and Engineering, the Chinese University of Hong Kong, Shenzhen.Corresponding authors: Ran Long (Ran.Long@ed.ac.uk), Sethu Vijayakumar (Sethu.Vijayakumar@ed.ac.uk).Digital Object Identifier (DOI): see top of this page.
Abstract

This work presents a novel dense RGB-D SLAM approach for dynamic planar environments that enables simultaneous multi-object tracking, camera localisation and background reconstruction. Previous dynamic SLAM methods either rely on semantic segmentation to directly detect dynamic objects; or assume that dynamic objects occupy a smaller proportion of the camera view than the static background and can, therefore, be removed as outliers. With the aid of camera motion prior, our approach enables dense SLAM when the camera view is largely occluded by multiple dynamic objects. The dynamic planar objects are separated by their different rigid motions and tracked independently. The remaining dynamic non-planar areas are removed as outliers and not mapped into the background. The evaluation demonstrates that our approach outperforms the state-of-the-art methods in terms of localisation, mapping, dynamic segmentation and object tracking. We also demonstrate its robustness to large drift in the camera motion prior.

Index Terms:
SLAM, visual tracking, sensor fusion.

I Introduction

Simultaneous localisation and mapping (SLAM) is one of the core components in autonomous robots and virtual reality applications. In indoor environments, planes are common man-made features. Planar SLAM methods have used the characteristics of planes to reduce long-term drift and improve the accuracy of localisation [1, 2]. However, these methods assume that the environment is static – an assumption that is violated when the robot works in conjunction with other humans or robots, or manipulates objects in semi-automated warehouses.

The core problem of enabling SLAM in dynamic environments while differentiating multiple dynamic objects involves several challenges:

  1. 1.

    There are usually an unknown number of third-party motions in addition to the camera motion in dynamic environments. The number of motions or dynamic objects is also changing.

  2. 2.

    Static background is often assumed to account for the major proportion of the camera view. However, without semantic segmentation, dynamic objects that occupy a large proportion of the camera view can end up being classified as the static background.

  3. 3.

    The majority of the colour and depth information can be occluded by dynamic objects and the remaining static parts of the visual input may not be enough to support accurate camera ego-motion estimation.

Refer to caption Refer to caption Refer to caption
Figure 1: Hierarchical segmentation based on planes and non-planar areas. The planes are extracted from the depth map and the non-planar areas are represented by a set of super-pixels.

Many dynamic SLAM methods have considered multiple dynamic objects [3, 4, 5], but either rely on semantic segmentation or assume that the static background is the largest rigid body in the camera view. To concurrently solve these problems, we propose a hierarchical representation of images that extracts planes from planar areas and over-segments non-planar areas into super-pixels (Figure 1). We consequently segment and track multiple dynamic planar rigid objects, and remove dynamic non-planar objects to enable camera localisation and mapping. For this, we assume that planes occupy a major fraction of the environment, including the static background and rigid dynamic objects. In addition, the camera motion can be distinguished from other third-party motions by a tightly-coupled camera motion prior from robot odometry.

In summary, this work contributes:

  1. 1.

    a new methodology for online multimotion segmentation based on planes in indoor dynamic environments,

  2. 2.

    a novel pipeline that simultaneously tracks multiple planar rigid objects, estimates camera ego-motion and reconstructs the static background,

  3. 3.

    a RGB-D SLAM method that is robust to large-occluded camera view caused by multiple large dynamic objects.

II Related Work

II-A Dynamic SLAM

Dynamic SLAM methods can be categorised into outlier-based, semantic-based, multimotion and proprioception-aided methods.

Outlier-based methods assume that the static background occupies the major component in the camera view and dynamic objects can be robustly removed during camera tracking. Joint-VO-SF (JF) [6] over-segments images into clusters and classifies each cluster as either static or dynamic by comparing the depth and intensity residuals. StaticFusion (SF) [7] introduces a continuous score to represent the probability that a cluster is static. The scores are used to segment the static background for camera localisation and mapping, while the remaining dynamic parts are discarded as outliers. Co-Fusion (CF) [5] and MultiMotionFusion [8] can further model the outliers as new objects and track them independently.

Semantic-based methods directly detect dynamic objects from the semantic segmentation. Based on Mask R-CNN [9], which provides pixel-wise object segmentation, EM-Fusion [10] integrates object tracking and SLAM into a single expectation maximisation (EM) framework. ClusterVO [3] can track camera ego-motion and multiple rigidly moving clusters simultaneously by combining semantic bounding boxes and ORB features [11]. DynaSLAM II [12] further integrates the multi-object tracking (MOT) and SLAM into a tightly-coupled formulation to improve its performance on both problems. However, these methods require that the object detector is pre-trained on a dataset which includes these objects or that the object model is provided in advance.

Multimotion methods explicitly model the dynamic component as third-party rigid motions and segment dynamic objects by their different motions against the camera motion. Multimotion visual odometry (MVO) [4] is an online multimotion segmentation and tracking method based on sparse keypoints. It iteratively samples all keypoints using RANSAC to generate motion hypotheses and automatically merge them when the merging can decrease the total energy. In addition to RANSAC, motion hypotheses can be generated from the increased N-points using a grid-based scene flow [13]. Instead of merging motion hypotheses, DymSLAM [14] computes a residual matrix from hypothetical motion models and directly estimates the number of dynamic objects from the residual matrix. The dense segmentation of multiple dynamic objects based on super-pixels [15] is acquired after the number of objects is specified. However, all these methods assume that the largest body of motion in the camera view is the static background and they have not demonstrated their effectiveness when the major proportion of the camera view is occluded by dynamic objects.

Robot proprioception, such as IMU and wheel odometry, can be fused with visual sensors to improve the accuracy and robustness of localisation in dynamic environments [16, 17]. Kim et al. [16] uses the camera motion prior from an IMU to compensate for the camera motion and select static keypoints based on motion vectors. RigidFusion (RF) [17] uses the camera motion prior from wheel odometry and additional object motion priors to enable SLAM with single dynamic object reconstruction when the major part of the camera view is occluded. However, these methods are unable to track multiple dynamic objects independently.

II-B Planar SLAM

Planar features have been widely used in indoor dynamic SLAM methods. Infinite planes can be used as landmarks in the pose graph SLAM problem [18]. Based on keyframe management, a global dense planar map can be reconstructed using only a single CPU [19]. Planes can also be combined with keypoints and lines [2, 1] for more robust camera tracking. In structured environments, planes have been demonstrated to significantly reduce accumulated rotational drift under Manhattan world (MW) assumption [1]. All these methods assume static environments, because planes in the indoor environment, like walls, are often static. However, this assumption is violated when planar objects, such as boxes, are transported or manipulated by humans or robots.

III Methodology

III-A Overview and notation

Refer to caption
Figure 2: The pipeline of our proposed method. (1) We first represent the input image in the current frame tt as a combination of planes and super-pixels. The ORB features [11] are extracted and matched to the previous frame. (2) Planes with similar rigid motions are clustered into MM planar rigid bodies and their corresponding egocentric motions are estimated respectively. However, we are uncertain which planar rigid body belongs to the static background. (3) We, therefore, jointly separate the static background from the planes and super-pixels, and estimate the camera motion via frame-to-frame alignment. (4) The static part is used to reconstruct the background and refine the camera motion. (5,6) Dynamic non-planar super-pixels are removed as outliers, while dynamic planar rigid bodies are matched with planar rigid bodies in the previous frame. The matched planar rigid body is tracked using RANSAC on their ORB features and plane parameters.

The overview of our pipeline is illustrated in Figure 2. Our method takes RGB-D image pairs from two consecutive frames. At the tt-th frame, we have a depth image DtD_{t} and an intensity image ItI_{t} computed from the colour image.

A plane is represented in the Hessian form Π=(𝐧T,d)T\Pi=(\mathbf{n}^{T},d)^{T}, where 𝐧=(nx,ny,nz)\mathbf{n}=(n_{x},n_{y},n_{z}) is the normal of the plane and dd is the perpendicular distance between the plane and camera origin. For each image frame, we extract planes directly from the depth map using PEAC [20], which can provide the number of planes PP, the pixel-wise segmentation of planes 𝐕p:{Vip|i[1,P]}\mathbf{V}^{p}:\{V^{p}_{i}|i\in[1,P]\} and their corresponding plane parameters. After plane extraction, the remaining non-planar areas are over-segmented into SS super-pixels: 𝐕np:{Vinp|i[1,S]}\mathbf{V}^{np}:\{V^{np}_{i}|i\in[1,S]\}.

For the ii-th plane, we extract a set of keypoints 𝐊i\mathbf{K}_{i} using ORB features [21]. We then conduct multimotion segmentation on planar areas 𝐕p\mathbf{V}^{p} and cluster the planes into MM planar rigid bodies of different motions. For simplicity, we name the camera motion relative to objects’ egocentric frames as object egocentric motion [4] and denote them as 𝐓ego={ξ~m𝔰𝔢(3)|m[1,M]}{}^{ego}\mathbf{T}=\{\tilde{\xi}_{m}\in\mathfrak{se}(3)|m\in[1,M]\}. Since the static background may not be the largest rigid body in the camera view, we use the camera motion prior ξ~c\tilde{\xi}_{c} with potential drift to simultaneously classify all planes and super-pixels into either static or dynamic, and estimate the camera ego-motion.

We use the score γi[0,1]\gamma_{i}\in[0,1] to represent the probability that a plane or super-pixel is static. The scores 𝜸:{γi|i[1,P+S]}\bm{\gamma}:\{\gamma_{i}|i\in[1,P+S]\} are assigned to each plane and super-pixel, where {γi|i[1,P]}\{\gamma_{i}|i\in[1,P]\} and {γi|i[P+1,P+S]}\{\gamma_{i}|i\in[P+1,P+S]\} represent the scores of planes and super-pixels respectively. At time tt, the pixel-wise static dynamic segmentation Γtw×h\Gamma_{t}\in\mathbb{R}^{w\times h} can be estimated from 𝜸\bm{\gamma}. The static parts of intensity and depth images are used to estimate the camera motion ξc𝔰𝔢(3)\xi_{c}\in\mathfrak{se}(3). The dynamic planar rigid bodies are used to track dynamic objects. The non-planar dynamic super-pixels, such as humans, are removed as outliers.

The world-, camera-, and the mm-th object-frames are WW, CC and OmO^{m} respectively. The camera motion T(ξc)exp(ξc)T(\xi_{c})\coloneqq exp(\xi_{c}) is TCt1Ct=TWCt11TWCtT_{C_{t-1}C_{t}}=T_{WC_{t-1}}^{-1}T_{WC_{t}}, which transforms homogeneous coordinates of a point in the current camera frame CtC_{t} to the previous frame Ct1C_{t-1}. The function exp(ξ)exp(\xi) is the matrix exponential map for Lie group SE(3)SE(3). The mm-th object egocentric transformations is the camera motion relative to this object [4]:

T(ξ~m)=TCt1Ct1m=TOt1mCt11TOtmCt.\displaystyle T(\tilde{\xi}_{m})={{}^{m}}T^{-1}_{C_{t-1}C_{t}}=T^{-1}_{O^{m}_{t-1}C_{t-1}}T_{O^{m}_{t}C_{t}}. (1)

III-B Multimotion segmentation based on planes

To extract planes, we transform the depth map to a point cloud and cluster connected groups of points with close normal directions using the method from [20]. To match plane ii in the current frame with one in the previous frame, we first estimate the angle and point-to-plane distance between plane ii and all planes in the previous frame. A plane is chosen as a candidate if the angle and distance are below 10 degrees and 0.1 m respectively, which is the same as [2, 1]. However, rather than choosing the plane with minimal distance [2, 1], we further consider overlap proportion between two planes using the Jaccard index, J(V1,V2)=|V1V2||V1V2|J(V_{1},V_{2})=\frac{|V_{1}\cap V_{2}|}{|V_{1}\cup V_{2}|}, where |V1||V_{1}| is the number of pixels in planar segment 11. We choose the candidate plane that has the maximal Jaccard index as the matched plane for plane ii and denote it as plane ii^{\prime}.

To estimate object egocentric motion Ti=TCt1Ct1iT_{i}={{}^{i}}T^{-1}_{C_{t-1}C_{t}} of plane ii, we extract and match ORB keypoints from plane ii and ii^{\prime}. The error function is defined as:

ei(Ti)=kχiiρ(𝐱ikT𝐱ikΣ)+λhq(Πi)q(TiTΠi)22,e_{i}(T_{i})=\sum_{k\in\chi_{ii^{\prime}}}\rho(||\mathbf{x}^{k}_{i}-T\mathbf{x}^{k}_{i^{\prime}}||_{\Sigma})+\lambda_{h}||q(\Pi_{i})-q(T_{i}^{-T}\Pi_{i^{\prime}})||^{2}_{2}, (2)

where χii\chi_{ii^{\prime}} is the set of keypoint matches between planes ii and ii^{\prime}. 𝐱ik\mathbf{x}^{k}_{i} and 𝐱ik\mathbf{x}^{k}_{i^{\prime}} are homogeneous coordinates [x,y,z,1]\left[x,y,z,1\right] of the two matched keypoints. ρ()\rho\left(\cdot\right) is the robust Huber error function [21]. λh\lambda_{h} is the parameter to weight the error between the Hessian form of planes. q(Π)=[arctan(nxnz),arctan(nynz),d]q(\Pi)=\left[\arctan(\frac{n_{x}}{n_{z}}),\arctan(\frac{n_{y}}{n_{z}}),d\right] avoids over-parametrisation of the Hessian form [1].

To cluster planes with similar motions, we introduce a score bij[0,1]b_{ij}\in[0,1] for each pair of neighbouring planes ii and jj in the current frame. bijb_{ij} represents the probability that the motion of planes ii and jj can be modelled by the same rigid transformation. We further introduce a new formulation based on planes to jointly estimate motion of planes and merge planes into rigid bodies:

min𝐓ego,𝐛i=1Pei(Ti)+λ1(i,j)Epbijf(Ti,Tj)λ2(i,j)Epbij,\displaystyle\min_{{}^{ego}\mathbf{T},\mathbf{b}}\sum_{i=1}^{P}e_{i}(T_{i})+\lambda_{1}\sum_{(i,j)\in E_{p}}b_{ij}f(T_{i},T_{j})-\lambda_{2}\sum_{(i,j)\in E_{p}}b_{ij}, (3)
s.t. bij[0,1]i,j.\displaystyle\text{ s.t.\ $b_{ij}\in[0,1]\ \forall i,j$}.

𝐓ego{}^{ego}\mathbf{T} is the set of egocentric transformations {T1,,TP}\{T_{1},\cdots,T_{P}\} for all planes in the current frame. 𝐛={bij|(i,j)Ep}\mathbf{b}=\{b_{ij}|(i,j)\in E_{p}\}. EpE_{p} is the connectivity graph of planes in the current frame and (i,j)Ep(i,j)\in E_{p} means that planes ii and jj are connected in space. The first term ei(Ti)e_{i}(T_{i}) is introduced in Equation 2. In the second term, we propose f(Ti,Tj)=[ei(Tj)+ej(Ti)][ei(Ti)+ej(Tj)]f(T_{i},T_{j})=\left[e_{i}(T_{j})+e_{j}(T_{i})\right]-\left[e_{i}(T_{i})+e_{j}(T_{j})\right] to quantify the error between two planes with egocentric motion TiT_{i} and TjT_{j} respectively. The last term penalises the model complexity by maximising the sum of probabilities that neighbouring planes have similar motions.

The novelty of the formulation is that we treat each individual plane as a motion hypothesis and estimate the likelihood 𝐛\mathbf{b} of any two neighbouring hypotheses having the same motion. This is in contrast to MVO [4], which discretely decides whether two motion hypotheses are merged or not.

To minimise Equation 3, 𝐓ego{}^{ego}\mathbf{T} and 𝐛\mathbf{b} are decoupled. We firstly initialise all egocentric motions TiT_{i} to identity and all scores bijb_{ij} to 0. Then, at each iteration, we fix 𝐛\mathbf{b} and find optimal 𝐓ego{}^{ego}\mathbf{T} by optimising each transformation independently. 𝐛\mathbf{b} is analytically solved subsequently by fixing the optimised transformations. After minimisation, we set a threshold b^=0.9\hat{b}=0.9 and merge planes ii and jj if bij>b^b_{ij}>\hat{b}. We therefore acquire MM planar rigid bodies and use RANSAC to estimate their prior egocentric motions {ξ~1,,ξ~M}\{\tilde{\xi}_{1},\cdots,\tilde{\xi}_{M}\} respectively. However, since the dynamic objects can occupy the major part of the images, we still need to decide which planar rigid body belongs to the static background.

III-C Joint camera tracking and background segmentation

We jointly track the camera motion and segment the static background based on a hierarchical representation of planes and non-planar super-pixels. This representation is more efficient in planar environments than uniformly sampled clusters used in previous work [7, 17]. In addition, compared to RigidFusion [17], our method only requires the camera motion prior. The dynamic planar objects are detected by their different rigid motions compared to the camera motion while dynamic non-planar areas are removed by their high residuals. To achieve it, we propose to minimise a combined formulation that consists of three energy terms:

minξc,𝜸R(ξc,𝜸)+G(ξc,𝜸)+H(ξc) s.t. γi[0,1]i,\min_{\xi_{c},\bm{\gamma}}\ R(\xi_{c},\bm{\gamma})+G(\xi_{c},\bm{\gamma})+H(\xi_{c})\quad\text{ s.t.\ $\gamma_{i}\in[0,1]\ \forall\,i$}, (4)

where 𝜸\bm{\gamma} is the full set of probabilities that each plane or super-pixel is static. ξc𝔰𝔢(3)\xi_{c}\in\mathfrak{se}(3) is the camera ego-motion in the world frame. Importantly, planes that belong to the same planar rigid body are assigned with independent scores γ\gamma. The first term R(ξc,𝜸)R(\xi_{c},\bm{\gamma}) aligns the static rigid body using weighted intensity and depth residuals. The second term G(ξc,𝜸)G(\xi_{c},\bm{\gamma}) segments dynamic objects by either different motions or high residuals and maintains segmentation smoothness. The last regularisation term H(ξc)H(\xi_{c}) adds a soft constraint on the camera motion.

III-C1 Residual term

Following the previous work [7, 17], we consider image pairs (It1,Dt1I_{t-1},D_{t-1}) and (It,DtI_{t},D_{t}) from two consecutive frames. For a pixel pp with coordinate 𝚡tp2\mathtt{x}^{p}_{t}\in\mathbb{R}^{2} in the current frame tt, the intensity residual rIp(ξ)r_{I}^{p}(\xi) and depth residual rDp(ξ)r_{D}^{p}(\xi) against the previous frame under motion ξ\xi are given by:

rIp(ξ)\displaystyle r_{I}^{p}(\xi) =It1(𝒲(𝚡tp,ξ))It(𝚡tp)\displaystyle=I_{t-1}\left(\mathcal{W}(\mathtt{x}^{p}_{t},\xi)\right)-I_{t}\left(\mathtt{x}^{p}_{t}\right) (5)
rDp(ξ)\displaystyle r_{D}^{p}(\xi) =Dt1(𝒲(𝚡tp,ξ))|T(ξ)π1(𝚡tp,Dt(𝚡tp))|z,\displaystyle=D_{t-1}\left(\mathcal{W}(\mathtt{x}^{p}_{t},\xi)\right)-|T(\xi)\pi^{-1}(\mathtt{x}^{p}_{t},D_{t}\left(\mathtt{x}^{p}_{t})\right)|_{z}\ , (6)

where π:32\pi:\mathbb{R}^{3}\rightarrow\mathbb{R}^{2} is the camera projection function and ||z|\cdot|_{z} returns the zz-coordinate of a 3D point. The image warping function 𝒲\mathcal{W} is:

𝒲(𝚡tp,ξ)=π(T(ξ)π1(𝚡tp,Dt(𝚡tp))),\displaystyle\mathcal{W}(\mathtt{x}^{p}_{t},\xi)=\pi\left(T(\xi)\pi^{-1}(\mathtt{x}^{p}_{t},D_{t}(\mathtt{x}^{p}_{t}))\right), (7)

which provides the corresponding coordinate 𝚡t1p\mathtt{x}^{p}_{t-1} in the previous frame. Similar to SF, the weighted residual term is:

R(ξc,𝜸)=p=1Nγi(p)[F(αIwIprIp(ξc))+F(wDprDp(ξc))],\displaystyle R(\xi_{c},\bm{\gamma})=\sum_{p=1}^{N}\gamma_{i(p)}[F(\alpha_{I}w_{I}^{p}r_{I}^{p}(\xi_{c}))+F(w_{D}^{p}r_{D}^{p}(\xi_{c}))]\ , (8)

where NN is the number of pixels with a valid depth value and i(p)[1,P+S]i(p)\in[1,P+S] indicates the index of the segment that contains the pixel pp. αI\alpha_{I} is used to weight the intensity residuals. The Cauchy robust penalty:

F(r)=c22log(1+(rc2))F(r)=\frac{c^{2}}{2}\log\left(1+\left(\frac{r}{c}^{2}\right)\right) (9)

is used to control robustness of minimisation and cc is the inflection point of F(r)F(r). Compared to SF, which assigns scores to each cluster, we represent the image a combination of planes and super-pixels.

III-C2 Segmentation term

The objective of G(ξc,𝜸)G(\xi_{c},\bm{\gamma}) is to detect dynamic planar rigid bodies by their motions and dynamic non-planar super-pixels by their high residuals. G(ξc,𝜸)G(\xi_{c},\bm{\gamma}) is computed by the sum of three items:

G(ξc,𝜸)=λpGp(ξc,𝜸)+λnpGnp(𝜸)+λrGr(𝜸),G(\xi_{c},\bm{\gamma})=\lambda_{p}G_{p}(\xi_{c},\bm{\gamma})+\lambda_{np}G_{np}(\bm{\gamma})+\lambda_{r}G_{r}(\bm{\gamma}), (10)

where λp\lambda_{p}, λnp\lambda_{np} and λr\lambda_{r} are parameters to weight different items. The first term Gp(ξc,𝜸)G_{p}(\xi_{c},\bm{\gamma}) classifies planes as dynamic when their egocentric motions are different from the camera motion ξc\xi_{c}:

Gp(ξc,𝜸)=i=1Pγiρ(ξcξ~m(i)22),G_{p}(\xi_{c},\bm{\gamma})=\sum_{i=1}^{P}\gamma_{i}\rho(||\xi_{c}-\tilde{\xi}_{m(i)}||_{2}^{2}), (11)

where m(i)m(i) is the planar rigid body that contains the plane ii. ξ~m(i)\tilde{\xi}_{m(i)} is the egocentric motion prior of the mm-th planar rigid body and Huber cost function ρ()\rho(\cdot) is used to robustly control the error.

The second term Gnp(𝜸)G_{np}(\bm{\gamma}) handles non-planar dynamic areas. We follow StaticFusion and assume they have a significantly higher residual under the camera motion:

Gnp(𝜸)=F(c^)i=P+1P+S(1γi)Ni,G_{np}(\bm{\gamma})=F(\hat{c})\sum_{i=P+1}^{P+S}(1-\gamma_{i})N_{i}, (12)

where we only consider super-pixels in non-planar area and NiN_{i} is the number of pixels with valid depth in the ii-th super-pixel. The threshold c^\hat{c} is chosen as the average residual over all SS super-pixels.

The last term Gr(𝜸)G_{r}(\bm{\gamma}) maintains the spacial smoothness of segmentation 𝜸\bm{\gamma} for both planar and non-planar areas by encouraging neighbour areas to have close scores:

Gr(𝜸)=(i,j)Epbij(γiγj)2+(i,j)Enp(γiγj)2,G_{r}(\bm{\gamma})=\sum_{(i,j)\in E_{p}}b_{ij}(\gamma_{i}-\gamma_{j})^{2}+\sum_{(i,j)\in E_{np}}(\gamma_{i}-\gamma_{j})^{2}, (13)

where EpE_{p} and EnpE_{np} is the connectivity graph for planes and non-planar super-pixels respectively. bijb_{ij} is directly acquired from the minimisation of Equation 3. This means that rather than directly assigning the same score γ\gamma to planes that belong to the same rigid body, we encourage them to have a close score γ\gamma.

III-C3 Motion regularisation term

We add a soft constraint on the camera motion ξc\xi_{c} based on the motion prior ξ~c\tilde{\xi}_{c}:

H(ξc)=λc(1αs)ρ(ξcξ~c22),H(\xi_{c})=\lambda_{c}(1-\alpha_{s})\rho(||\xi_{c}-\tilde{\xi}_{c}||_{2}^{2}), (14)

where αs[0,1]\alpha_{s}\in[0,1] is the proportion between the the number of pixels that are associated to the static background over the total number of pixels with valid depth reading. This means that we rely more on the camera motion prior when the dynamic objects occupy a higher proportion of the image view. The robust Huber cost function ρ()\rho(\cdot) is used to handle large potential drifts in the camera motion prior.

The solver of Equation 4 is based on StaticFusion and a similar coarse-to-fine scheme is applied to directly align dense images. Specifically, we create an image pyramid for each incoming RGB-D image and start the optimisation from the coarsest level. The results acquired in the intermediate level are used to initialise the next level, to allow correct convergence. We also decouple the camera motion ξc\xi_{c} and 𝜸\bm{\gamma} for more efficient computation. Concretely, we initialise the camera motion ξc\xi_{c} as identity and all 𝜸\bm{\gamma} to 1. For each iteration, we first fix 𝜸\bm{\gamma} and find the optimal ξc\xi_{c}. The closed-form solution for 𝜸\bm{\gamma} is then obtained by fixing ξc\xi_{c}. The solution for the previous iteration is used to initialise the current iteration.

III-D Background reconstruction and camera pose refinement

In the current frame tt, after the minimisation of Equation 4, we acquire the optimised camera motion ξ^c\hat{\xi}_{c} and the static parts of intensity and depth images (Its,Dts)(I^{s}_{t},D^{s}_{t}). These images are used to reconstruct the static background and refine the camera pose ξc\xi_{c} using frame-to-model alignment. Concretely, we render an image pair (It1r,Dt1r)(I^{r}_{t-1},D^{r}_{t-1}) from the current static background model at the previous camera pose. The rendered image pair (It1r,Dt1r)(I^{r}_{t-1},D^{r}_{t-1}) is directly aligned with (Its,Dts)(I^{s}_{t},D^{s}_{t}) by minimising

minξcR(ξc,𝜸=𝟏)+H(ξc).\min_{\xi_{c}}\ R(\xi_{c},\bm{\gamma}=\mathbf{1})+H(\xi_{c}). (15)

The first term R(ξc,𝜸=𝟏)R(\xi_{c},\bm{\gamma}=\mathbf{1}) is the same as Equation 8 but the 𝜸\bm{\gamma} is fixed to 𝟏\mathbf{1} because the input should only contain the static background. We append H(ξc)H(\xi_{c}) in Equation 14 as a soft-constraint for the frame-to-model alignment and αs\alpha_{s} is estimated from pixel-wise dynamic segmentation Γt\Gamma_{t}. Since we have already solved Equation 4, we directly start from the finest level of the image pyramid and initialise the solver with the camera pose ξ^c\hat{\xi}_{c} for the solver of Equation 15. The refined camera pose is used to fuse the static images (Its,Dts)(I^{s}_{t},D^{s}_{t}) with the surfel-based 3D model as described in SF [7].

III-E Planar objects tracking

After removing the static planes, we further track dynamic planar rigid bodies independently. This is different to our previous work RigidFusion [17] which models the whole dynamic component with a single rigid transformation. For each dynamic planar rigid body mm, we match it to the previous dynamic rigid bodies using the plane association and estimate the egocentric motion. If all the currently associated planes are static in the previous frame, we detect the dynamic planar rigid body mm as a new object and the initial pose of the object relative to the camera frame is denoted as TinitT_{init}. If the initial time of frame for an object is t0t_{0}, the object pose in the object’s initial frame can be acquired by [14]:

TOt0mOtm=TCt0CtTCt0CtmTinit1.\displaystyle T_{O^{m}_{t_{0}}O^{m}_{t}}=T_{C_{t_{0}}C_{t}}{{}^{m}}T_{C_{t_{0}}C_{t}}T_{init}^{-1}. (16)

IV Evaluation

IV-A Setup

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: (a) An omnidirectional wheeled platform with Vicon markers. (b) The first rigid object is put on a board with wheels and moved by a human. (c) The second rigid object is put on the youBot and is controlled remotely.

The sequences for evaluation are collected with an Azure Kinect DK RGB-D camera which is mounted on an omnidirectional robot (Figure 3(a)). The camera produces RGB-D image pairs with a resolution of 1280 x 720 at 30 Hz. The images are down-scaled and cropped to 640 x 480 (VGA) to accelerate the speed of pre-processing (Figure 2), such as super-pixel and plane extraction. In the solver of Equation 4 and (15), we further down-scale images to 320 x 240 (QVGA).

The dynamic objects are created from stacked boxes and are either moved by humans or via a remotely controlled KUKA youBot (Figure 3). The ground truth trajectories of the camera and objects are collected using a Vicon system by attaching Vicon markers on the camera and dynamic objects. The camera motion prior is acquired by adding synthetic drift on camera ground truth trajectories with a magnitude of around 7 cm/s (trans.) and 0.4 rad/s (rot.).

For quantitative evaluation, we estimate the absolute trajectory error (ATE) and the relative pose error (RPE) [22] against the ground truth. The proposed method is compared with PlanarSLAM (PS) [1], EM-Fusion (EMF) [10], Joint-VO-SF (JF) [6], StaticFusion (SF) [7], Co-Fusion (CF) [5] and RigidFusion (RF) [17]. We additionally provide the camera motion prior with drift to CF as the variant CF. The original RF uses motion priors for both camera and object. Here we only provide RF with the camera motion prior and denote it as RF, while our method with the camera motion prior is denoted as ours.

We collect eight sequences with various camera and object movements in different planar environments. For example, in the seq1, a human moves the taller box to clear way for both the robot and the other object so that the potential collision can be avoided, while in the seq5, the robot tries to overtake two dynamic objects ahead (Figure 4). All trajectories are designed such that the two dynamic objects and a human can be visible in the image at the same time and frequently occupy the major proportion of the camera view. We also run experiments on sequences sitting_xyz and walking_xyz from TUM RGB-D dataset [22] which includes a large proportion of non-planar areas and denote them as seq9 and seq10 respectively.

Refer to caption
Figure 4: The ground truth trajectories of camera and dynamic objects in both 2D and 3D perspectives. Trajectories of humans are not plotted. The red segment of camera trajectories represents the part when there are moving objects in the camera view, while the blue segment means the camera moves in static environments. We mark trajectories’ start position with a black solid dot and end position with a circle-cross marker. The black arrows point to the direction of camera view.

IV-B Camera localisation

We estimate the ATE root-mean-square error (RMSE) and RPE RMSE between the estimated camera trajectories and ground truth (Table I). In planar dynamic environments (seq. 1-8), the evaluation demonstrates that our method outperforms all other state-of-the-art methods (Figure 5). With the help of the camera motion prior, our method achieves the best performance and corrects the large drift of the camera motion prior. Even without the camera motion prior, we still achieve better performance than the baseline of JF, SF and CF. PS is unable to estimate the correct camera pose because there are dynamic planes in the environment while PS assumes all planes are static. Our method also outperforms EMF because the semantic segmentation method [9] can only detect and segment certain categories of dynamic objects, like humans.

In non-planar dynamic environments (seq. 9-10), EMF outperforms all other methods because the dynamic humans can be directly segmented by Mask R-CNN [9]. However, even without relying on semantic segmentation, our method has close performance compared to StaticFusion. This is because our method can still detect dynamic super-pixels by their high residuals under the camera motion.

Refer to caption
Figure 5: Visualisation of the estimated camera trajectories, camera motion prior and ground truth. The start position of all trajectories is aligned to the same position and is marked with a black solid dot. Our method (blue) achieves the lowest error compared to the ground truth (black solid) and can correct the drift of the camera motion prior (black dashed).
MP SLAM Method
PS EMF JF SF CF CF RF ours ours
1 26.7 38.5 50.6 30.5 22.9 10.4 10.2 16.5 20.1 4.23
2 49.5 88.7 63.6 28.2 27.4 26.0 7.30 14.3 6.81 6.32
3 41.7 53.1 37.0 24.3 74.0 21.6 10.6 4.38 4.01 3.42
4 36.0 36.8 34.0 28.9 87.2 18.9 20.3 8.39 22.6 8.37
5 16.2 31.4 14.7 10.3 13.6 4.73 8.35 14.1 25.2 6.74
6 11.7 39.6 35.5 52.8 23.5 10.1 3.67 7.57 8.37 4.67
7 25.5 19.1 25.5 34.7 57.6 14.7 8.71 41.3 6.43 7.60
8 28.4 46.8 25.6 26.5 62.1 69.8 18.9 14.2 8.33 10.3
9 273 2.15\mathbf{2.15} 3.73.7^{\dagger} 11.1 4.0 2.7 5.63 9.73 3.81 5.54
10 197 29.8 6.6\mathbf{6.6}^{\dagger} 87.4 12.7 69.6 48.7 19.5 14.9 11.6
(a) Trans. ATE RMSE (cm)
MP SLAM Method
PS EMF JF SF CF CF RF ours ours
1 7.64 23.8 43.6 28.4 17.2 7.58 9.07 9.41 10.9 4.50
2 7.31 51.6 22.8 26.9 11.2 12.6 5.61 3.99 3.58 3.06
3 7.87 25.1 14.8 23.5 26.1 6.8 4.22 7.13 3.11 2.78
4 7.38 29.9 26.5 28.2 64.3 15.9 15.7 6.13 14.2 6.52
5 7.61 25.8 6.31 31.4 3.31 3.62 6.34 3.90 13.5 4.77
6 7.51 17.1 30.6 25.4 18.1 7.02 4.67 4.26 4.38 3.18
7 7.52 12.8 15.4 31.3 62.4 7.54 6.43 25.1 4.73 4.09
8 7.29 20.0 15.6 24.1 36.4 28.3 11.2 7.54 5.91 4.41
9 7.36 3.12 2.6\mathbf{2.6}^{\dagger} 5.7 2.8 2.7 3.01 3.48 2.95 2.98
10 7.34 49.0 6.0\mathbf{6.0}^{\dagger} 27.7 12.1 32.9 41.9 13.4 9.59 8.67
(b) Trans. RPE RMSE (cm/s)
TABLE I: ATE and RPE RMSE for all ten RGB-D sequences. The asterisk (*) symbol represents that the method uses the camera motion prior with drift and the dagger (\dagger) symbol means the result is taken from the original paper [10]. Our method achieves the best performance in custom robotic sequences collected from planar environments (seq. 1-8) and estimates correct camera trajectories in TUM RGB-D dataset [22] which contains a large proportion of non-planar areas (seq. 9-10).

IV-C Multimotion segmentation

segmentation
seq4: opposing_move2 seq8: obj_transfer
RGB input with
planes and
super-pixels
segmentation
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
SF
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RF
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
CF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6: Segmentation result of the static background and dynamic objects. We visualise the input RGB images with the segmentation of planes and super-pixels in the first row. In all four methods, the static part is marked by blue. In SF and RF, we use red to represent dynamic parts. In CF, we use different colours to show different objects. In our method, the non-planar dynamic areas are marked by red, the planar rigid objects are marked by other colours. Results show that only our method can segment multiple dynamic objects correctly and is robust to large occlusion.

For planar environments, we visualise the segmentation results of our method and compare them with SF, RF and CF (Figure 6). SF is unable to detect all dynamic objects because they as a whole occlude a large proportion of the camera view, while RF tends to classify parts of the static background as dynamic. Both CF and our method can further distinguish between different dynamic objects. However, the segmentation of CF is not complete and CF tends to have a delay when detecting a new object. We use two different colours (green and purple) to represent that our method treats the taller object as a new one after it passes behind the front object. In non-planar environments, our method can still provide correct binary segmentation of the static and dynamic objects (Figure 7). However, we are unable to segment and track different non-planar dynamic objects independently.

RGB

Refer to caption Refer to caption Refer to caption Refer to caption

SF

Refer to caption Refer to caption Refer to caption Refer to caption

ours

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7: Static/dynamic segmentation results on the walking_xyz sequence [22]. The first row shows the RGB images with segmentation of planes and super-pixels. Our method achieves close segmentation performance to SF in non-planar environments.

IV-D Background reconstruction

We qualitatively evaluate the reconstruction result of seq3 (Figure 8). Since we have no ground truth segmentation, we re-collect a new sequence with the same camera trajectory but no dynamic objects to recover the true background. As shown in the results, RF maps the dynamic objects into the static background model. CF has mapped the same static object twice, which is caused by wrong camera pose estimation. Only our proposed method can remove all dynamic objects and correctly reconstruct the background.

Refer to captionGround truth Refer to captionRF
Refer to captionCF Refer to captionours
Figure 8: Reconstruction result of the RGB-D sequence 3. The reconstruction failures are marked with red rectangles. RF has mapped dynamic objects into the background. CF has mapped the same static poster twice, which indicates wrong localisation results.

IV-E Planar rigid objects trajectory

For both objects, we compute the ATE RMSE between the estimated and ground-truth trajectories when they are in the camera view (Table II). Since the object can move out of or move into the camera view several times, one object trajectory can be divided into multiple parts. For each object, we, therefore, use the maximal ATE RMSE among the estimated trajectories of different parts for the final result. Our method can provide more accurate and complete object trajectories than CF, but loses track of a dynamic object when the object stops moving or is occluded by other objects (Figure 9).

Refer to caption
Figure 9: Comparison between CF baseline (red) and our method with the camera motion prior (blue) in terms of the estimated object trajectories. Our method can detect an object as dynamic immediately when it starts to move and provide a more accurate trajectory than CF.
seq1 seq4 seq7
object1 object2 object1 object2 object1 object2
CF 21.5 10.6 24.2 5.36 33.8 6.57
CF 20.9 16.3 20.5 6.21 17.1 12.9
ours 13.1 4.95 4.95 8.84 7.27 3.93
TABLE II: ATE RMSE of the object trajectories estimated from CF baseline, CF and ours.

IV-F Impact of drift in motion prior

We increase the drift magnitude of the camera motion prior to test our methods’ robustness to different levels of drift. By comparing the RPE RMSE of the camera motion prior and estimated trajectories, we find that our method can outperform Co-Fusion baseline with drift up to 24 cm/s (Figure 10). Even when the motion prior has a drift of nearly 30 cm/s, we can still reduce the drift to around 12 cm/s. Compared to Co-Fusion with camera motion prior, our method is always better using the motion prior with the same magnitude of drift.

Refer to caption
Figure 10: RPE RMSE of the estimated trajectories against the drift magnitude of wheel odometry. Our method preforms better than CF when using the camera motion prior with the same magnitude of drift and is robust to nearly 24 cm/s odometry drift until it is comparable with CF baseline.

V Conclusion

This work presented a dense RGB-D SLAM method that tracks multiple planar rigid objects without relying on semantic segmentation. We also proposed a novel online multimotion segmentation method and a dynamic segmentation pipeline based on a hierarchical representation of planes and super-pixels. The detailed evaluation demonstrates that our method achieves better localisation and mapping results than state-of-the-art approaches when multiple dynamic objects occupy the major proportion of the camera view. If one dynamic object is occluded by another, our method fails to track the object but detects the object as new after it reappears in the camera view. Our future work would be re-detecting the dynamic objects based on their models to support long-term object tracking. We also plan to extend our method to non-planar environments and enable independently tracking of multiple large non-planar rigid objects.

References

  • [1] Y. Li, R. Yunus, N. Brasch, N. Navab, and F. Tombari, “RGB-D SLAM with structural regularities,” in IEEE International Conference on Robotics and Automation, 2021.
  • [2] X. Zhang, W. Wang, X. Qi, Z. Liao, and R. Wei, “Point-plane SLAM using supposed planes for indoor environments,” Sensors, 2019.
  • [3] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering moving instances and estimating visual odometry for self and surroundings,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [4] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual odometry (MVO): Simultaneous estimation of camera and third-party motions,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.
  • [5] M. Rünz and L. Agapito, “Co-Fusion: Real-time segmentation, tracking and fusion of multiple objects,” in IEEE International Conference on Robotics and Automation, 2017.
  • [6] M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, and D. Cremers, “Fast odometry and scene flow from RGB-D cameras based on geometric clustering,” in IEEE International Conference on Robotics and Automation, 2017.
  • [7] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers, “StaticFusion: Background reconstruction for dense RGB-D SLAM in dynamic environments,” in IEEE International Conference on Robotics and Automation, 2018.
  • [8] C. Rauch, R. Long, V. Ivan, and S. Vijayakumar, “Sparse-dense motion modelling and tracking for manipulation without prior object models,” arXiv preprint arXiv:2204.11923, 2022.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. of the IEEE International Conf. on Computer Vision, 2017.
  • [10] M. Strecke and J. Stueckler, “EM-Fusion: Dynamic object-level SLAM With probabilistic data association,” in 2019 IEEE/CVF International Conference on Computer Vision, 2019.
  • [11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in IEEE International Conference on Computer Vision, 2011.
  • [12] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM II: Tightly-coupled multi-object tracking and SLAM,” IEEE Robotics and Automation Letters, 2021.
  • [13] S. Lee, C. Y. Son, and H. J. Kim, “Robust real-time RGB-D visual odometry in dynamic environments via rigid motion model,” in IEEE/RSJ International Conf. on Intelligent Robots and Systems, 2019.
  • [14] C. Wang, B. Luo, Y. Zhang, Q. Zhao, L. Yin, W. Wang, X. Su, Y. Wang, and C. Li, “DymSLAM: 4D dynamic scene reconstruction based on geometrical motion segmentation,” IEEE Robotics and Automation Letters, 2020.
  • [15] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
  • [16] D.-H. Kim, S.-B. Han, and J.-H. Kim, “Visual odometry algorithm using an RGB-D sensor and IMU in a highly dynamic environment,” in Robot Intelligence Technology and Applications 3.   Springer, 2015.
  • [17] R. Long, C. Rauch, T. Zhang, V. Ivan, and S. Vijayakumar, “RigidFusion: Robot localisation and mapping in environments with large dynamic rigid objects,” IEEE Robotics and Automation Letters, 2021.
  • [18] M. Kaess, “Simultaneous localization and mapping with infinite planes,” in IEEE International Conference on Robotics and Automation, 2015.
  • [19] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based dense planar SLAM,” in IEEE International Conference on Robotics and Automation, 2017.
  • [20] C. Feng, Y. Taguchi, and V. R. Kamat, “Fast plane extraction in organized point clouds using agglomerative hierarchical clustering,” in IEEE International Conference on Robotics and Automation, 2014.
  • [21] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Transactions on Robotics, 2017.
  • [22] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.