This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LVBA: LiDAR-Visual Bundle Adjustment for RGB Point Cloud Mapping

Rundong Li, Xiyuan Liu, Haotian Li, Zheng Liu, Jiarong Lin, Yixi Cai, and Fu Zhang The authors are with the Department of Mechanical Engineering, The University of Hong Kong, Hong Kong, China (e-mail: {rdli10010, xliuaa, haotianl, u3007335, zivlin, yixicai}@connect.hku.hk; fuzhang@hku.hk)
Abstract

Point cloud maps with accurate color are crucial in robotics and mapping applications. Existing approaches for producing RGB-colorized maps are primarily based on real-time localization using filter-based estimation or sliding window optimization, which may lack accuracy and global consistency. In this work, we introduce a novel global LiDAR-Visual bundle adjustment (BA) named LVBA to improve the quality of RGB point cloud mapping beyond existing baselines. LVBA first optimizes LiDAR poses via a global LiDAR BA, followed by a photometric visual BA incorporating planar features from the LiDAR point cloud for camera pose optimization. Additionally, to address the challenge of map point occlusions in constructing optimization problems, we implement a novel LiDAR-assisted global visibility algorithm in LVBA. To evaluate the effectiveness of LVBA, we conducted extensive experiments by comparing its mapping quality against existing state-of-the-art baselines (i.e., R3LIVE and FAST-LIVO). Our results prove that LVBA can proficiently reconstruct high-fidelity, accurate RGB point cloud maps, outperforming its counterparts.

Index Terms:
Mapping, Bundle Adjustment, Sensor Fusion

I Introduction and Related Works

High-fidelity colorized 3D maps are vital in diverse fields including robotics[1], environmental science[2], and civil engineering[3]. They serve various purposes, which can provide essential data for robot navigation, offer accurate models for environmental monitoring, and form the basis of digital twin simulations.

Based on the used sensors, existing colorized 3D map reconstruction methods can be primarily categorized into three classes: 1) Visual camera-based[4, 5]; 2) RGB-D sensor-based[6, 7]; 3) LiDAR-camera fusion based [1, 8]. For visual camera-based methods [4, 5], without direct depth measurements, they triangulate the depth from multi-view geometry. This process usually involves feature tracking algorithms, which can introduce systematic errors and increase the risk of incorrect 3D geometry estimation[9]. Methods using RGB-D sensors can utilize direct dense depth measurements through structure light. However, most of them are range-limited due to the intensity constraints of the detection light. In contrast, 3D points measured from LiDAR sensors are accurate, offering a long detection range and errors within centimeters. Hence, reconstructing colorized 3D maps from LiDAR-camera fusion based methods are drawing increasing attention in the literature. In this work, we aim to develop a LiDAR-visual bundle adjustment method to improve the mapping quality of LiDAR-camera fusion-based methods.

Refer to caption
Figure 1: An RGB point cloud map optimized using our method. The depicted data was captured at the Chong Yuet Ming Physics Building at The University of Hong Kong. Our method effectively optimizes both LiDAR and camera poses, achieving high levels of accuracy and consistency in the mapping process.

Although significant progress has been made in many recent approaches (e.g., [10, 11, 6, 12]), the fusion of LiDAR and camera measurements, as well as the reconstruction of the high-fidelity colorized map, remain challenging. Previous works (e.g., [1, 8]) for the LiDAR-visual platform have primarily focused on improving localization accuracy or enhancing the system robustness against sensor degeneration. However, most of these works rely on filter or sliding window methods, which cannot avoid accumulation errors and are susceptible to inaccurate extrinsic parameters. These issues directly affect the observation consistency of camera images, resulting in blurred colorized point clouds.

In fact, these problems can be solved by employing a global optimization method (i.e., bundle adjustment). By directly optimizing the camera’s observation consistency using existing geometry information from LiDAR point clouds, globally consistent high-fidelity point clouds can be constructed even without accurate extrinsic parameters.

To our knowledge, the only global optimization method for LiDAR-visual sensor fusion is Colmap-PCD[13], which is only a recent proposal. Based on the Colmap[14, 15] framework, Colmap-PCD registers camera frames to a given LiDAR point cloud using a geometric visual bundle adjustment algorithm. Compared to Colmap-PCD, a key difference in our method is the use of photometric visual bundle adjustment, which bypasses the need for feature extraction and matching algorithms and shows better robustness under low-texture environments. Moreover, by effectively utilizing LiDAR scan information, we address the occlusion check without ray-tracing and effectively build up the global constraint. This approach enhances the optimization’s efficiency, as the ray-tracing process is typically time-consuming. Another advantage of photometric visual bundle adjustment is its ability to estimate camera exposure parameters (e.g., exposure time, gain, etc.)[16], which could directly affect the image brightness and is always crucial. The existing works often directly estimate a scale factor called ”exposure time”[16, 17] without considering the noise model and covariance, which can lead to brightness drift during global bundle adjustment. In this paper, we derive a relative exposure time with covariance consideration, allowing us to estimate exposure time globally.

In summary, we propose a LiDAR-visual bundle adjustment (LVBA) to optimize camera and LiDAR poses. LVBA works in two stages: first, it optimizes LiDAR poses through a LiDAR BA[18]; then, it optimizes camera poses using a photometric BA method, leveraging the geometric prior provided by the LiDAR point cloud. Our main contributions are:

  • We proposed a photometric bundle adjustment to estimate camera states using the prior LiDAR point cloud map, which improves the colorization quality of the point cloud even without accurate time alignment or finely calibrated extrinsic parameters.

  • We proposed a LiDAR-assisted scene point generation and visibility determination algorithm, which contains global co-visibility of camera frames that facilitates us to construct the global photometric visual bundle adjustment problem.

  • We implemented a toolchain to evaluate the accuracy and consistency of the colorized maps. Using this toolchain, we conducted an extensive evaluation of our proposed LVBA against state-of-the-art LiDAR-Visual-Inertial mapping approaches. Our evaluation results demonstrate that LVBA outperforms other state-of-the-art works in accurately and consistently reconstructing colorized point cloud maps (refer to Fig. 1).

II Methodology

II-A System Overview

The overview of our proposed LiDAR-visual bundle adjustment (LVBA) is shown in Fig. 2. LVBA takes LiDAR scans, camera images, and their rough poses within the same world reference frame as input, typically from a front-end method such as [1, 8]. The system includes two components, namely the LiDAR BA part and the visual BA part. Firstly, the LiDAR states undergo optimization using a LiDAR BA method[18], producing an optimal LiDAR pose estimation and planar features. Subsequently, the camera poses are optimized through an iterative coarse-to-fine photometric bundle adjustment, which utilizes LiDAR scans and extracted planar feature points as prior.

Refer to caption
Figure 2: The overview of our system. Our system consists of a LiDAR BA and a visual BA.

II-B LiDAR Bundle Adjustment

In LVBA, we utilize a LiDAR BA method called BALM[19, 18] (version 2) to optimize the LiDAR poses. This method formulates the LiDAR BA problem by leveraging the edge and plane features extracted from LiDAR point clouds. The optimization process seeks to minimize the Euclidean distance between each point in a scan and its neighboring edge or plane. We use BALM to optimize the 6 DoF pose of each LiDAR scan and to build a voxel map that contains plane features extracted from the LiDAR points, which is then used in our later process of visual BA. Denote 𝐓^L\widehat{\mathbf{T}}_{L} the LiDAR pose after BALM optimization.

To ensure alignment between the camera trajectory and optimized LiDAR poses, we also need to update the camera poses after BALM optimizes the LiDAR pose. To do so, for each camera frame 𝐓CSE(3)\mathbf{T}_{C}\in SE(3) before BALM optimization, the closest LiDAR frame 𝐓LSE(3)\mathbf{T}_{L}\in SE(3) before BALM optimization is found. Note that 𝐓C\mathbf{T}_{C} and 𝐓L\mathbf{T}_{L} are respectively the camera and LiDAR poses in the same world reference frame. Then the camera pose 𝐓C\mathbf{T}_{C} should be updated as:

𝐓^C=𝐓^L𝐓L1𝐓C\displaystyle\widehat{\mathbf{T}}_{C}=\widehat{\mathbf{T}}_{L}\mathbf{T}_{L}^{-1}\mathbf{T}_{C} (1)

II-C Visual Bundle Adjustment with LiDAR Prior

The visual BA in LVBA first generates visual feature points named “scene points” and then formulates the cost function by projecting scene points onto different image frames and minimizing the photometric discrepancy between them. Two types of scene points are used, local scene points and global scene points. Sec. II-C1 and II-C2 present local scene points selection and visibility determination. Sec. II-C3 presents global scene point selection and visibility determination. Finally, Sec. 4 and II-C5 describe the process of constructing our photometric cost and its optimization.

II-C1 Local Scene Point Generation

Our visual BA begins by generating local scene points for each image frame. When selecting local scene points, we prefer points surrounded by complex textures as they provide richer photometric details and effective constraints when projected onto an image frame. Specifically, for each camera frame, its image is divided into grid cells. The LiDAR planar feature points captured when the position of LiDAR is close to this camera frame are then projected onto these grid cells. To ensure the view quality of the scene point, only those LiDAR feature points whose surface normal is alongside the view direction from the camera frame are taken into consideration. Specifically, the projected LiDAR feature points should fulfill the following condition:

|𝐧fT(𝐩f𝐭C)𝐩f𝐭C2|>α0\displaystyle\Bigg{|}\mathbf{n}_{f}^{T}\frac{(\mathbf{p}_{f}-\mathbf{t}_{C})}{\|\mathbf{p}_{f}-\mathbf{t}_{C}\|_{2}}\Bigg{|}>\alpha_{0} (2)

where 𝐩f\mathbf{p}_{f} and 𝐧f\mathbf{n}_{f} are the position and normal vector of the LiDAR planner feature point, respectively. 𝐭C\mathbf{t}_{C} is the position of the camera frame, and α0\alpha_{0} is a threshold. After each LiDAR feature point is projected, a score reflecting the intensity gradients around the candidate scene point is subsequently computed for every projected point by employing the Difference of Gaussian (DoG) method [20]. Within each grid cell, the point with the highest score is then selected as a local scene point. These selected local scene points are denoted as 𝝅=(𝐩,𝐧)3×𝕊2\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})\in\mathbb{R}^{3}\times\mathbb{S}^{2}, where 𝐩\mathbf{p} represents the position of the point and 𝐧\mathbf{n} represents the normal vector estimated by our LiDAR BA. For each selected local scene point, we denote the corresponding camera frame as the “reference frame” of this local scene point. In scenarios where a grid cell only contains points with low scores (indicating a lack of useful photometric information, such as in a textureless surface), no points are selected from that grid cell.

II-C2 Local Visibility Determination

After generating local scene points for each frame, our next step is to identify the other frames that can observe these points. These identified frames are then denoted as the “target frame” of each scene point and will be then used to construct the corresponding cost item in (II-C4). To facilitate this, a local visibility determination process is implemented. In the process, only frames within a sliding window relative to the reference frame of a local scene point are considered. Since these frames are close to the position of the reference frame, hence the parallax is insignificant and a local scene point is usually directly visible by these frames without further occlusion check. As a consequence, we only have to examine if the view direction between the local scene point 𝝅=(𝐩,𝐧)\boldsymbol{\pi}=(\mathbf{p},\mathbf{n}) and the candidate target frame at 𝐓t=(𝐑t,𝐭t)\mathbf{T}_{t}=(\mathbf{R}_{t},\mathbf{t}_{t}) is within the frame’s FoV. The FoV check is much more efficient than the occlusion check based on ray-casting but may give false visibility results for scene points on the edge of foreground and background objects. To fix this issue, we project, using the initial sensor pose, the local scene point onto the reference frame, and the candidate target frame to evaluate the photometric discrepancy between them. If a scene point is truly visible in the target frame, the initial photometric discrepancies are often small. Finally, to ensure the view quality of the scene point, we choose target frames that are viewing the scene point along a direction close to the point surface normal. Consequently, the local visibility determination checks are:

|𝐝T𝐳C|>α1,|𝐝T𝐧|>α2,𝙽𝙲𝙲(𝐈r,𝐈t)>α3\displaystyle|\mathbf{d}^{T}\mathbf{z}_{C}|>\alpha_{1},\quad|\mathbf{d}^{T}\mathbf{n}|>\alpha_{2},\quad\mathtt{NCC}(\mathbf{I}_{r},\mathbf{I}_{t})>\alpha_{3} (3)

where 𝐝=𝐩𝐭t𝐩𝐭t2\mathbf{d}=\frac{\mathbf{p}-\mathbf{t}_{t}}{\|\mathbf{p}-\mathbf{t}_{t}\|_{2}} denotes the view direction from the candidate target frame to the local scene point, 𝐳C\mathbf{z}_{C} express the z-axis of the camera frame (vertical to the camera image plane), 𝐈r\mathbf{I}_{r} and 𝐈t\mathbf{I}_{t} are the RGB values of patches generated on the reference frame and the candidate target frame, respectively, and 𝙽𝙲𝙲(,)\mathtt{NCC}(\cdot,\cdot) is the Normalized Cross-Correlation (NCC) [21] between the two patches. The details for patch generation are illustrated in Sec. 4. α1\alpha_{1}, α2\alpha_{2}, and α3\alpha_{3} are three thresholds. If a local scene point is deemed visible by this frame (i.e., satisfying (3)), it will be selected as a target frame and contribute to the final optimization cost in (II-C4).

II-C3 Global Scene Point Selection and Visibility Generation

Local scene points provide effective local constraints for camera pose optimization. However, they fail to provide more global constraints for camera poses far apart observing the same area. Therefore, we introduce global scene points and global visibility that provide more comprehensive, global constraints. When generating global scene points and determining global visibility, the distance between the reference and target frames increases, making the occlusion check necessary. Typically, a ray-casting algorithm (e.g., in [22]) is employed to address the occlusion check, but this approach not only increases the computation costs but also risks inaccuracies in sparser point cloud scenarios. To overcome these challenges, we implement a LiDAR-assisted method based on a global visibility voxel map, addressing the occlusion check with more reliable results and reduced computational load, as illustrated in Fig. 3.

Refer to caption
Figure 3: LiDAR-assisted Global Visibility Map. A voxel map that stores the viability information of each voxel is constructed with LiDAR scans (II-C3). After that, a global scene point is selected from all scene points in each voxel (Sec. II-C3). The visibility voxel map, together with the selected global scene point is then used in the global visibility determination process (Sec. II-C3)
Global Visibility Voxel Map

Our method for global scene point selection and visibility determination relies on a global visibility voxel map created using LiDAR scan data. This map records the visibility information for each voxel and is constructed after the optimization of LiDAR poses using BALM. During construction, for each point in a LiDAR scan, the voxel it belongs to is appended with the pose of the LiDAR scan, indicating that the voxel is visible from the position of this LiDAR scan. Subsequently, camera frames proximate to these visible positions are identified and stored. Consequently, each voxel in the map retains a set of indices of camera frames from which it is visible.

Global Scene Point Selection

To limit the computation cost, only a subset of all local scene points, namely global scene points, is used to construct global constraints. To select these global scene points, we distribute all local scene points into the visibility voxel map. For each voxel, the local scene point that exhibits the best observation quality from their respective reference frames is selected. To model the observation quality, we define a score ss, which describes the projection area on the reference image of a small surface element at the scene point. A larger projection area indicates better observation quality. More specifically, for each candidate scene point 𝝅=(𝐩,𝐧)\boldsymbol{\pi}=(\mathbf{p},\mathbf{n}) the score ss is calculated by:

s=𝐧T(𝐩𝐭r)𝐩𝐭r22\displaystyle s=\frac{\mathbf{n}^{T}(\mathbf{p}-\mathbf{t}_{r})}{\|\mathbf{p}-\mathbf{t}_{r}\|^{2}_{2}} (4)

where 𝐭r\mathbf{t}_{r} is the position of the corresponding reference frame. The local scene point within each voxel that achieves the highest score is then selected as a global scene point.

Global Visibility Determination

Once the global scene points have been selected, the next step is ascertaining their visibility across all camera frames. This process comprises two main steps: First, we identify, utilizing the global visibility map, the camera frames stored in the voxel where the global scene point belongs as the candidate target frames. This is because these camera frames are most likely to observe this voxel, and hence the global scene point in it. Second, we identify those candidate target frames satisfying (3) as the true target frames, following the same reason as in local visibility determination.

II-C4 Photometric Error Formulation

Refer to caption
Figure 4: Photometric Error Formulation: A scene point 𝝅=(𝐩,𝐧)\boldsymbol{\pi}=(\mathbf{p},\mathbf{n}) is first projected to the reference image frame at 𝐓r\mathbf{T}_{r}, and a reference patch {𝐮r(i)}\{\mathbf{u}_{r}^{(i)}\} is generated. Then, the reference patch is projected and wrapped onto the target image frame at 𝐓t\mathbf{T}_{t} by a homography transformation 𝐇\mathbf{H} to generate a target patch {𝐮t(i)}\{\mathbf{u}_{t}^{(i)}\}. Finally, a photometric error is constructed with the L2L2-norm of the radiance error between two patches.

The visual BA in LVBA optimizes the photometric discrepancy of a scene point 𝝅=(𝐩,𝐧)3×𝕊2\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})\in\mathbb{R}^{3}\times\mathbb{S}^{2} (either local or global scene points) between the reference frame and target frame, whose poses are denoted as 𝐓r=(𝐑r,𝐭r)SE(3)\mathbf{T}_{r}=(\mathbf{R}_{r},\mathbf{t}_{r})\in SE(3) and 𝐓t=(𝐑t,𝐭t)SE(3)\mathbf{T}_{t}=(\mathbf{R}_{t},\mathbf{t}_{t})\in SE(3), respectively. As shown in Fig. 4, a cost item is constructed below:

Firstly, the scene point 𝝅\boldsymbol{\pi} is projected to the reference image plane via the pin-hole camera projection model:

𝐮^r=𝐊𝐑rT(𝐩𝐭r)\displaystyle\hat{\mathbf{u}}_{r}=\mathbf{K}\mathbf{R}_{r}^{T}(\mathbf{p}-\mathbf{t}_{r}) (5)

where 𝐊\mathbf{K} is the camera intrinsic matrix, 𝐮^r3\mathbf{\hat{u}}_{r}\in\mathbb{R}^{3} is the homogeneous format of 𝐮r2\mathbf{u}_{r}\in\mathbb{R}^{2} in the reference image plane coordinate (without further notice, we use ()^\hat{(\cdot)} to denote the homogeneous format of a vector). Then, a reference patch {𝐮r(i)}\{\mathbf{u}_{r}^{(i)}\} is generated around the projected point. Utilizing the plane normal of the scene point and the camera projection model, the reference patch can be projected to the target frame via a homography transformation[23]:

𝐮^t(i)\displaystyle\hat{\mathbf{u}}_{t}^{(i)} =𝐇(𝐓r,𝐓t,𝝅)𝐮^r(i)\displaystyle=\mathbf{H}(\mathbf{T}_{r},\mathbf{T}_{t},\boldsymbol{\pi})\hat{\mathbf{u}}_{r}^{(i)} (6)

where the homography matrix 𝐇\mathbf{H} can be expressed as

𝐇\displaystyle\mathbf{H} =𝐊𝐑t1[𝐧T(𝐩𝐭r)𝐄+(𝐭r𝐭t)𝐧T]𝐑r𝐊1\displaystyle=\mathbf{K}\mathbf{R}_{t}^{-1}\big{[}\mathbf{n}^{T}(\mathbf{p}-\mathbf{t}_{r})\mathbf{E}+(\mathbf{t}_{r}-\mathbf{t}_{t})\mathbf{n}^{T}\big{]}\mathbf{R}_{r}\mathbf{K}^{-1} (7)

and 𝐄\mathbf{E} is 3×33\times 3 identity matrix. Finally, the photometric cost is constructed as:

Lphoto(𝓣r,𝓣t;𝝅)\displaystyle L_{\text{photo}}(\boldsymbol{\mathcal{T}}_{r},\boldsymbol{\mathcal{T}}_{t};\boldsymbol{\pi}) =i=0N2ϵt1𝐈t(𝐮t(i))ϵr1𝐈r(𝐮r(i))𝚺2\displaystyle=\sum_{i=0}^{N^{2}}\|\epsilon_{t}^{-1}\mathbf{I}_{t}(\mathbf{u}_{t}^{(i)})-\epsilon_{r}^{-1}\mathbf{I}_{r}(\mathbf{u}_{r}^{(i)})\|_{\boldsymbol{\Sigma}}^{2}\;
𝚺\displaystyle\boldsymbol{\Sigma} =(1ϵr2+ϵt2)𝐄\displaystyle=\bigg{(}\frac{1}{\epsilon_{r}^{2}+\epsilon_{t}^{2}}\bigg{)}\mathbf{E} (8)

where 𝐈r()\mathbf{I}_{r}(\boldsymbol{\cdot}) and 𝐈t()\mathbf{I}_{t}(\boldsymbol{\cdot}) calculate RGB color vector by bi-linear interpolation on the reference image and target image, respectively, 𝓣r=(𝐓r,ϵr)\boldsymbol{\mathcal{T}}_{r}=(\mathbf{T}_{r},\epsilon_{r}) and 𝓣t=(𝐓t,ϵt)\boldsymbol{\mathcal{T}}_{t}=(\mathbf{T}_{t},\epsilon_{t}) represent the camera states of reference and target frames, ϵr\epsilon_{r} and ϵt\epsilon_{t} represent the relative exposure time of them and 𝚺\boldsymbol{\Sigma} is the covariance matrix of the cost function. The relative exposure time ϵ\epsilon and covariance matrix 𝚺\boldsymbol{\Sigma} are defined using a simplified camera measurement model. Further details can be referred to the APPENDIX A111https://github.com/hku-mars/mapping_eval.

II-C5 Levenberg-Marquardt Optimization

With our constructed photometric cost item, the optimal estimation of camera states 𝓣\boldsymbol{\mathcal{T}}^{\ast} is given by:

𝓣\displaystyle\boldsymbol{\mathcal{T}}^{\ast} =argmin𝓣i=1MsjViLphoto(𝓣r(i),𝓣j;𝝅i)\displaystyle=\underset{\boldsymbol{\mathcal{T}}}{\operatorname{argmin}}{\sum_{i=1}^{M_{s}}\sum_{j\in V_{i}}{L_{\text{photo}}(\boldsymbol{\mathcal{T}}_{r(i)},\boldsymbol{\mathcal{T}}_{j};\boldsymbol{\pi}_{i})}} (9)

where 𝓣={𝓣1,𝓣2,,𝓣Mp}\boldsymbol{\mathcal{T}}=\{\boldsymbol{\mathcal{T}}_{1},\boldsymbol{\mathcal{T}}_{2},\cdots,\boldsymbol{\mathcal{T}}_{M_{p}}\} is the camera state set, 𝚷={𝝅1,𝝅2,,𝝅Ms}\boldsymbol{\Pi}=\{\boldsymbol{\pi}_{1},\boldsymbol{\pi}_{2},\cdots,\boldsymbol{\pi}_{M_{s}}\} is the scene point set (both local and global), MpM_{p} and MsM_{s} represents the total number of camera poses and scene points, respectively, r(i)r(i) represents the reference frame index of the ii-th scene point, ViV_{i} is the target frame set (both local and global), and LphotoL_{\text{photo}} is the photometric cost item, as defined in (II-C4). In LVBA, we employed a Levenberg-Marquardt optimizer to minimize the cost function in (9).

To enhance the robustness of the cost function against initial estimation, our visual BA employs an iterative coarse-to-fine strategy. This strategy involves the process of down-sampling the image and sequentially optimizing the camera states from the top layer of the pyramid to the original resolution. Within each iteration, we utilize the optimized states obtained from the previous higher layer to generate scene points and determine visibility.

III Experiments

III-A Experiment Setup

During the experiment, we compared our LVBA with other state-of-the-art LiDAR-visual(-inertial) sensor fusion methods, including R3LIVE[1], FAST-LIVO[8], and Colmap-PCD[13]. We perform evaluations on three datasets, the R3LIVE dataset[1], the FAST-LIVO dataset[8], and MaRS-LVIG[24]. We conducted mapping accuracy evaluation on all three datasets, and the trajectory accuracy evaluation on MARS-LVIG, which provided ground truth trajectory.

For both the Colmap-PCD and LVBA, we utilize R3LIVE to provide the initial estimation when evaluating on the R3LIVE dataset and MaRS-LVIG dataset, while employing FAST-LIVO to provide the initial estimation when evaluating on the FAST-LIVO dataset. Since these data are collected at a high rate, typically, 10 Hz, optimizing all frame poses on the raw data is computationally demanding for LVBA and Colmap-PCD, which are two global optimization approaches. To constrain the computation load, we extracted keyframes for camera frames. Furthermore, since some of the sequences are extensively long for Colmap-PCD and LVBA, leading to excessive time and memory costs, we separated them into sub-sequences, and the average value of the evaluation result for each sequence is then taken as the final result.

III-B Mapping Evaluation

III-B1 Evaluation Method

Refer to caption
Figure 5: A sample output from our mapping evaluation algorithm. (a) displays the original image captured by the camera. (b) shows the image as rendered by our algorithm, and (c) presents the depth map, derived from the point cloud, which was used in the rendering of image (b). The rendered image (b) is then compared to the original (a) for evaluation using metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM).

To evaluate the mapping accuracy and consistency with our optimization results, we use a colorizing-and-rendering algorithm. Initially, an RGB-radiance point cloud is created using LiDAR scans and camera images. For each LiDAR point 𝐩L\mathbf{p}^{L} in the LiDAR frame 𝐓L=(𝐑L,𝐭L)\mathbf{T}_{L}=(\mathbf{R}_{L},\mathbf{t}_{L}), the nearest camera frame at 𝐓C=(𝐑C,𝐭C)\mathbf{T}_{C}=(\mathbf{R}_{C},\mathbf{t}_{C}) is identified. The radiance value of this point is calculated as:

𝐫=ϵ1𝐈(𝚷(𝐑CT(𝐑L𝐩L+𝐭L𝐭C))\displaystyle\mathbf{r}=\epsilon^{-1}\mathbf{I}\Big{(}\boldsymbol{\Pi}\big{(}\mathbf{R}_{C}^{T}(\mathbf{R}_{L}\mathbf{p}^{L}+\mathbf{t}_{L}-\mathbf{t}_{C}\big{)}\Big{)} (10)

where ϵ\epsilon is the estimated relative exposure time of the camera frame, for benchmark methods that didn’t estimate the relative exposure time, ϵ\epsilon is set to 11, 𝚷()\boldsymbol{\Pi}(\cdot) denotes the projection process to the camera’s image plane, and 𝐈()\mathbf{I}(\cdot) yields the RGB value. After constructing the radiance map, we render raw images from the radiance map. For each image with optimized camera frame pose and relative exposure time ϵ\epsilon, we project the colorized radiance map onto its image plane to create a rendered image, the RGB color of the image is determined by 𝐈=ϵ𝐫\mathbf{I}=\epsilon\mathbf{r}. An example of our rendering result is shown in Fig. 5. Finally, the discrepancy between these rendered images and the original raw images is evaluated using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metrics. The source code of our evaluation toolchain is available on GitHub222https://github.com/hku-mars/mapping_eval.

III-B2 Evaluation Setup

In this section, we performed a mapping accuracy evaluation among all three datasets. Due to the strict time synchronization requirement of FAST-LIVO, which is not satisfied in the R3LIVE dataset, we didn’t evaluate FAST-LIVO with the R3LIVE dataset. Further, as LVBA and Colmap-PCD only estimate the poses of the keyframes, we assessed R3LIVE and FAST-LIVO only at those keyframes, although their results are obtained by running on all frames at 10Hz.

To assess the effectiveness of our proposed global scene point selection and their visibility determination, we ran our algorithm without utilizing any global scene point constraints (termed “w/o GSP”). Furthermore, to evaluate the effectiveness of our proposed relative exposure time estimation, we ran the algorithm by assuming all relative exposure time to be one (termed “w/o RET”). Both experiments were conducted under the same settings of our full method as illustrated above.

III-B3 Comparison Results

TABLE I: Result of Mapping Evaluation
Datasets Sequences R3LIVE[1] FAST-LIVO[8] Colmap-PCD[13] Ours (w/o GSP1) Ours (w/o RET2) Ours (Full)
PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow
R3LIVE[1] hku campus seq 00 18.13 0.1741 21.36 0.3629 20.56 0.3328 21.06 0.3177 21.64 0.3852
hku campus seq 01 17.82 0.1995 19.72 0.3478 19.21 0.2905 20.14 0.3750
hku campus seq 02 16.97 0.1520 18.91 0.2965 18.31 0.2042 19.34 0.3193
hku campus seq 03 17.55 0.1827 17.45 0.2267 19.16 0.3262 18.59 0.2326 19.56 0.3544
hkust campus 00 18.03 0.1559 20.14 0.2791 19.71 0.2361 20.52 0.2934
hkust campus 01 17.91 0.1783 20.13 0.3006 19.64 0.2675 20.38 0.3159
hkust campus 02 16.47 0.1628 18.23 0.2631 18.29 0.2520 18.90 0.3263
hkust campus 03 17.54 0.1849 19.44 0.3014 19.27 0.2548 19.79 0.3285
FAST-LIVO[8] hku1 19.02 0.1278 21.32 0.2052 22.45 0.3047 21.89 0.2773 22.23 0.2810 22.46 0.3364
hku2 23.69 0.2813 25.85 0.3639 26.71 0.4377 25.54 0.3483 26.84 0.4212 26.86 0.4223
visual challenge 19.71 0.1027 21.84 0.1378 22.83 0.1976 21.23 0.1483 22.53 0.1786 24.11 0.3377
MaRS-LVIG[24] HKisland 01 15.07 0.1061 14.50 0.0855 15.68 0.1656 17.95 0.3466 19.34 0.4872 19.71 0.5204
AMvalley 01 22.69 0.1716 21.16 0.1198 22.23 0.2171 26.08 0.4386 26.44 0.4852 27.21 0.5190
AMtown 01 19.90 0.1237 19.42 0.0854 19.17 0.1654 21.94 0.2274 22.40 0.2631 22.54 0.2679
HKairport_GNSS 01 16.77 0.0832 16.30 0.0708 18.89 0.1799 18.10 0.1564 19.24 0.2228 19.54 0.2652
  • 1GSP: Global scene point.  2 RET: Relative exposure time. 3 – : System failed.

Refer to caption
Figure 6: The result of our benchmark experiment. Displayed from left to right: (a) depicts the original image captured by the camera. (b) shows the image generated using our proposed method. (c) shows the image generated using Colmap-PCD[13]. (d) and (e) illustrate the rendered result utilizing state estimations from FAST-LIVO[8] and R3LIVE[1], respectively.

The evaluation results are computed and reported in TABLE I. Compared with two LiDAR-visual-inertial odometry (i.e. R3LIVE and FAST-LIVO), our LVBA (Full) demonstrates significant improvements in mapping accuracy across all tested sequences. Both R3LIVE and FAST-LIVO are based on the ESIKF framework and lack the ability to effectively correct historical errors. When colorizing the point cloud with all images, those image frames with substantial state errors are revealed by the color blur of point clouds, especially when the sequences get longer. In contrast, our proposed global LiDAR-visual BA could optimize the state estimation of all image frames, leading to a global consistency and low color blur, as illustrated in Fig. 6. Compared with Colmap-PCD, our LVBA (Full) achieved better performance on most of the sequences, except the SSIM on hku2 in the FAST-LIVO dataset. Since the sequence hku2 is captured in a night scenario, the image is taken with a high camera gain, resulting in increased measurement noise. This noise can potentially impact our optimization process. In all the rest sequences, Colmap-PCD is affected by varying degrees of errors introduced by feature extraction, matching, and point-to-plane association, resulting in lower performance. Additionally, for those sequences with significant exposure time variance, Colmap-PCD may yield inferior results compared to our method due to the lack of exposure time estimation (e.g. sequence visual challenge), or even complete failure due to the difficulty in identifying the proper feature matching (e.g. sequence hku campus seq 01). Furthermore, when evaluating on the R3LIVE dataset, we observed that Colmap-PCD encountered challenges with system initialization, leading to system failures or inaccurate pose estimations. Some qualitative mapping results are illustrated in Fig. 6, demonstrating the better resolution and consistency of our method in constructing high-fidelity colorized 3D maps. More visualization results can be found in our video at https://youtu.be/jtIUBI0U76c.

Refer to caption
Figure 7: A visualization of our ablation study. (a) and (b) show the original image captured by the camera. (a1) and (b1) show the details of the original image. (a2) and (b2) display the result produced by our full algorithm. (a3) and (b3) depict the outcome when global scene points are removed. (a4) and (b4) represent the result of our method when the estimation of relative exposure times is removed. This sequence effectively demonstrates the impact and contribution of each component in our method to the overall result.

Compared with the LVBA without global scene point constraints (“w/o GSP”) or relative exposure time estimation (“w/o RET”), our full algorithm, incorporating global constraints, significantly improves mapping quality. Fig. 7 illustrates that a lack of global constraints provided by our global scene points results in color blurring and separation in the colorized point cloud map, and failing to estimate relative exposure times leads to noticeable brightness and color discrepancies between the original image and our rendered result. The sequences hku campus seq 02/03, which provide the ground truth for camera exposure time, were used to compare our estimated relative exposure times against the actual values. As depicted in Fig. 8, while our estimates do not provide the absolute values due to the simplified model, they still successfully capture the general trend of the ground truth exposure times.

Refer to caption
Figure 8: The comparison between our estimated relative exposure time and the ground truth.

IV Conclusion

In this paper, we introduced a LiDAR-Visual bundle adjustment framework, aimed at enhancing pose estimation accuracy and ensuring global RGB mapping consistency for LiDAR-camera platforms. By integrating photometric BA for the camera with geometric priors, we achieved high-precision alignment of camera poses with the LiDAR point cloud. Our LiDAR-assisted visibility determination algorithm allowed for the effective global application of this BA method. Through rigorous evaluations, we have demonstrated that our approach surpasses other state-of-the-art methods in both pose estimation accuracy and mapping consistency across multiple datasets. Looking ahead, we aim to incorporate IMU pre-integration and more advanced optimization techniques to further elevate the performance of our system.

References

  • [1] J. Lin and F. Zhang, “R3live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, Conference Proceedings, pp. 10 672–10 678.
  • [2] T. L. Pavlis and K. A. Mason, “The new world of 3d geologic mapping,” Gsa Today, vol. 27, no. 9, pp. 4–10, 2017.
  • [3] F. Fassi, L. Fregonese, S. Ackermann, and V. De Troia, “Comparison between laser scanning and automated 3d modelling techniques to reconstruct complex and extensive cultural heritage areas,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 40, pp. 73–80, 2013.
  • [4] A. Delaunoy and M. Pollefeys, “Photometric bundle adjustment for dense multi-view 3d modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Conference Proceedings, pp. 1486–1493.
  • [5] Z. Cui, L. Heng, Y. C. Yeo, A. Geiger, M. Pollefeys, and T. Sattler, “Real-time dense mapping for self-driving vehicles using fisheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 6087–6093.
  • [6] W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo visual inertial lidar simultaneous localization and mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 370–377.
  • [7] T. Laidlow, M. Bloesch, W. Li, and S. Leutenegger, “Dense rgb-d-inertial slam with map deformations,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 6741–6748.
  • [8] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 4003–4009.
  • [9] H. Alismail, B. Browning, and S. Lucey, Photometric Bundle Adjustment for Vision-Based SLAM.   Springer International Publishing, 2017, pp. 324–341.
  • [10] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mapping with high robustness and low drift,” Journal of field robotics, vol. 35, no. 8, pp. 1242–1264, 2018.
  • [11] J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidar-monocular visual odometry,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2018, pp. 7872–7879.
  • [12] X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Lic-fusion: Lidar-inertial-camera odometry,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 5848–5854.
  • [13] C. Bai, R. Fu, and X. Gao, “Colmap-pcd: An open-source tool for fine image-to-point cloud registration,” arXiv preprint arXiv:2310.05504, 2023.
  • [14] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in European Conference on Computer Vision (ECCV), 2016.
  • [15] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [16] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
  • [17] J. Lin and F. Zhang, “R3live++: A robust, real-time, radiance reconstruction package with a tightly-coupled lidar-inertial-visual state estimator,” arXiv preprint arXiv:2209.03666, 2022.
  • [18] Z. Liu, X. Liu, and F. Zhang, “Efficient and consistent bundle adjustment on lidar point clouds,” IEEE Transactions on Robotics, vol. 39, no. 6, pp. 4366–4386, 2023.
  • [19] Z. Liu and F. Zhang, “Balm: Bundle adjustment for lidar mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3184–3191, 2021. [Online]. Available: https://dx.doi.org/10.1109/lra.2021.3062815
  • [20] D. Marr and E. Hildreth, “Theory of edge detection,” Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 207, no. 1167, pp. 187–217, 1980.
  • [21] R. Brunelli, Template matching techniques in computer vision: theory and practice.   John Wiley & Sons, 2009.
  • [22] F. Kong, X. Liu, B. Tang, J. Lin, Y. Ren, Y. Cai, F. Zhu, N. Chen, and F. Zhang, “Marsim: A light-weight point-realistic simulator for lidar-based uavs,” IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2954–2961, 2023.
  • [23] A. M. Andrew, “Multiple view geometry in computer vision,” Kybernetes, vol. 30, no. 9/10, pp. 1333–1341, 2001.
  • [24] H. Li, Y. Zou, N. Chen, J. Lin, X. Liu, W. Xu, C. Zheng, R. Li, D. He, F. Kong et al., “Mars-lvig dataset: A multi-sensor aerial robots slam dataset for lidar-visual-inertial-gnss fusion,” The International Journal of Robotics Research, p. 02783649241227968, 2024.

See pages - of LVBA_supplementary.pdf