LVBA: LiDAR-Visual Bundle Adjustment for RGB Point Cloud Mapping

Rundong Li, Xiyuan Liu, Haotian Li, Zheng Liu, Jiarong Lin, Yixi Cai, and Fu Zhang The authors are with the Department of Mechanical Engineering, The University of Hong Kong, Hong Kong, China (e-mail: {rdli10010, xliuaa, haotianl, u3007335, zivlin, yixicai}@connect.hku.hk; fuzhang@hku.hk)

Abstract

Point cloud maps with accurate color are crucial in robotics and mapping applications. Existing approaches for producing RGB-colorized maps are primarily based on real-time localization using filter-based estimation or sliding window optimization, which may lack accuracy and global consistency. In this work, we introduce a novel global LiDAR-Visual bundle adjustment (BA) named LVBA to improve the quality of RGB point cloud mapping beyond existing baselines. LVBA first optimizes LiDAR poses via a global LiDAR BA, followed by a photometric visual BA incorporating planar features from the LiDAR point cloud for camera pose optimization. Additionally, to address the challenge of map point occlusions in constructing optimization problems, we implement a novel LiDAR-assisted global visibility algorithm in LVBA. To evaluate the effectiveness of LVBA, we conducted extensive experiments by comparing its mapping quality against existing state-of-the-art baselines (i.e., R³LIVE and FAST-LIVO). Our results prove that LVBA can proficiently reconstruct high-fidelity, accurate RGB point cloud maps, outperforming its counterparts.

Index Terms:

Mapping, Bundle Adjustment, Sensor Fusion

I Introduction and Related Works

High-fidelity colorized 3D maps are vital in diverse fields including robotics[1], environmental science[2], and civil engineering[3]. They serve various purposes, which can provide essential data for robot navigation, offer accurate models for environmental monitoring, and form the basis of digital twin simulations.

Based on the used sensors, existing colorized 3D map reconstruction methods can be primarily categorized into three classes: 1) Visual camera-based[4, 5]; 2) RGB-D sensor-based[6, 7]; 3) LiDAR-camera fusion based [1, 8]. For visual camera-based methods [4, 5], without direct depth measurements, they triangulate the depth from multi-view geometry. This process usually involves feature tracking algorithms, which can introduce systematic errors and increase the risk of incorrect 3D geometry estimation[9]. Methods using RGB-D sensors can utilize direct dense depth measurements through structure light. However, most of them are range-limited due to the intensity constraints of the detection light. In contrast, 3D points measured from LiDAR sensors are accurate, offering a long detection range and errors within centimeters. Hence, reconstructing colorized 3D maps from LiDAR-camera fusion based methods are drawing increasing attention in the literature. In this work, we aim to develop a LiDAR-visual bundle adjustment method to improve the mapping quality of LiDAR-camera fusion-based methods.

Refer to caption — Figure 1: An RGB point cloud map optimized using our method. The depicted data was captured at the Chong Yuet Ming Physics Building at The University of Hong Kong. Our method effectively optimizes both LiDAR and camera poses, achieving high levels of accuracy and consistency in the mapping process.

Although significant progress has been made in many recent approaches (e.g., [10, 11, 6, 12]), the fusion of LiDAR and camera measurements, as well as the reconstruction of the high-fidelity colorized map, remain challenging. Previous works (e.g., [1, 8]) for the LiDAR-visual platform have primarily focused on improving localization accuracy or enhancing the system robustness against sensor degeneration. However, most of these works rely on filter or sliding window methods, which cannot avoid accumulation errors and are susceptible to inaccurate extrinsic parameters. These issues directly affect the observation consistency of camera images, resulting in blurred colorized point clouds.

In fact, these problems can be solved by employing a global optimization method (i.e., bundle adjustment). By directly optimizing the camera’s observation consistency using existing geometry information from LiDAR point clouds, globally consistent high-fidelity point clouds can be constructed even without accurate extrinsic parameters.

To our knowledge, the only global optimization method for LiDAR-visual sensor fusion is Colmap-PCD[13], which is only a recent proposal. Based on the Colmap[14, 15] framework, Colmap-PCD registers camera frames to a given LiDAR point cloud using a geometric visual bundle adjustment algorithm. Compared to Colmap-PCD, a key difference in our method is the use of photometric visual bundle adjustment, which bypasses the need for feature extraction and matching algorithms and shows better robustness under low-texture environments. Moreover, by effectively utilizing LiDAR scan information, we address the occlusion check without ray-tracing and effectively build up the global constraint. This approach enhances the optimization’s efficiency, as the ray-tracing process is typically time-consuming. Another advantage of photometric visual bundle adjustment is its ability to estimate camera exposure parameters (e.g., exposure time, gain, etc.)[16], which could directly affect the image brightness and is always crucial. The existing works often directly estimate a scale factor called ”exposure time”[16, 17] without considering the noise model and covariance, which can lead to brightness drift during global bundle adjustment. In this paper, we derive a relative exposure time with covariance consideration, allowing us to estimate exposure time globally.

In summary, we propose a LiDAR-visual bundle adjustment (LVBA) to optimize camera and LiDAR poses. LVBA works in two stages: first, it optimizes LiDAR poses through a LiDAR BA[18]; then, it optimizes camera poses using a photometric BA method, leveraging the geometric prior provided by the LiDAR point cloud. Our main contributions are:

•

We proposed a photometric bundle adjustment to estimate camera states using the prior LiDAR point cloud map, which improves the colorization quality of the point cloud even without accurate time alignment or finely calibrated extrinsic parameters.
•

We proposed a LiDAR-assisted scene point generation and visibility determination algorithm, which contains global co-visibility of camera frames that facilitates us to construct the global photometric visual bundle adjustment problem.
•

We implemented a toolchain to evaluate the accuracy and consistency of the colorized maps. Using this toolchain, we conducted an extensive evaluation of our proposed LVBA against state-of-the-art LiDAR-Visual-Inertial mapping approaches. Our evaluation results demonstrate that LVBA outperforms other state-of-the-art works in accurately and consistently reconstructing colorized point cloud maps (refer to Fig. 1).

II Methodology

II-A System Overview

The overview of our proposed LiDAR-visual bundle adjustment (LVBA) is shown in Fig. 2. LVBA takes LiDAR scans, camera images, and their rough poses within the same world reference frame as input, typically from a front-end method such as [1, 8]. The system includes two components, namely the LiDAR BA part and the visual BA part. Firstly, the LiDAR states undergo optimization using a LiDAR BA method[18], producing an optimal LiDAR pose estimation and planar features. Subsequently, the camera poses are optimized through an iterative coarse-to-fine photometric bundle adjustment, which utilizes LiDAR scans and extracted planar feature points as prior.

II-B LiDAR Bundle Adjustment

In LVBA, we utilize a LiDAR BA method called BALM[19, 18] (version 2) to optimize the LiDAR poses. This method formulates the LiDAR BA problem by leveraging the edge and plane features extracted from LiDAR point clouds. The optimization process seeks to minimize the Euclidean distance between each point in a scan and its neighboring edge or plane. We use BALM to optimize the 6 DoF pose of each LiDAR scan and to build a voxel map that contains plane features extracted from the LiDAR points, which is then used in our later process of visual BA. Denote $\widehat{\mathbf{T}}_{L}$ the LiDAR pose after BALM optimization.

To ensure alignment between the camera trajectory and optimized LiDAR poses, we also need to update the camera poses after BALM optimizes the LiDAR pose. To do so, for each camera frame $\mathbf{T}_{C}\in SE(3)$ before BALM optimization, the closest LiDAR frame $\mathbf{T}_{L}\in SE(3)$ before BALM optimization is found. Note that $\mathbf{T}_{C}$ and $\mathbf{T}_{L}$ are respectively the camera and LiDAR poses in the same world reference frame. Then the camera pose $\mathbf{T}_{C}$ should be updated as:

\displaystyle\widehat{\mathbf{T}}_{C}=\widehat{\mathbf{T}}_{L}\mathbf{T}_{L}^{-1}\mathbf{T}_{C}

(1)

II-C Visual Bundle Adjustment with LiDAR Prior

The visual BA in LVBA first generates visual feature points named “scene points” and then formulates the cost function by projecting scene points onto different image frames and minimizing the photometric discrepancy between them. Two types of scene points are used, local scene points and global scene points. Sec. II-C1 and II-C2 present local scene points selection and visibility determination. Sec. II-C3 presents global scene point selection and visibility determination. Finally, Sec. 4 and II-C5 describe the process of constructing our photometric cost and its optimization.

II-C1 Local Scene Point Generation

Our visual BA begins by generating local scene points for each image frame. When selecting local scene points, we prefer points surrounded by complex textures as they provide richer photometric details and effective constraints when projected onto an image frame. Specifically, for each camera frame, its image is divided into grid cells. The LiDAR planar feature points captured when the position of LiDAR is close to this camera frame are then projected onto these grid cells. To ensure the view quality of the scene point, only those LiDAR feature points whose surface normal is alongside the view direction from the camera frame are taken into consideration. Specifically, the projected LiDAR feature points should fulfill the following condition:

\displaystyle\Bigg{|}\mathbf{n}_{f}^{T}\frac{(\mathbf{p}_{f}-\mathbf{t}_{C})}{\|\mathbf{p}_{f}-\mathbf{t}_{C}\|_{2}}\Bigg{|}>\alpha_{0}

(2)

where $\mathbf{p}_{f}$ and $\mathbf{n}_{f}$ are the position and normal vector of the LiDAR planner feature point, respectively. $\mathbf{t}_{C}$ is the position of the camera frame, and $\alpha_{0}$ is a threshold. After each LiDAR feature point is projected, a score reflecting the intensity gradients around the candidate scene point is subsequently computed for every projected point by employing the Difference of Gaussian (DoG) method [20]. Within each grid cell, the point with the highest score is then selected as a local scene point. These selected local scene points are denoted as $\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})\in\mathbb{R}^{3}\times\mathbb{S}^{2}$ , where $\mathbf{p}$ represents the position of the point and $\mathbf{n}$ represents the normal vector estimated by our LiDAR BA. For each selected local scene point, we denote the corresponding camera frame as the “reference frame” of this local scene point. In scenarios where a grid cell only contains points with low scores (indicating a lack of useful photometric information, such as in a textureless surface), no points are selected from that grid cell.

II-C2 Local Visibility Determination

After generating local scene points for each frame, our next step is to identify the other frames that can observe these points. These identified frames are then denoted as the “target frame” of each scene point and will be then used to construct the corresponding cost item in (II-C4). To facilitate this, a local visibility determination process is implemented. In the process, only frames within a sliding window relative to the reference frame of a local scene point are considered. Since these frames are close to the position of the reference frame, hence the parallax is insignificant and a local scene point is usually directly visible by these frames without further occlusion check. As a consequence, we only have to examine if the view direction between the local scene point $\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})$ and the candidate target frame at $\mathbf{T}_{t}=(\mathbf{R}_{t},\mathbf{t}_{t})$ is within the frame’s FoV. The FoV check is much more efficient than the occlusion check based on ray-casting but may give false visibility results for scene points on the edge of foreground and background objects. To fix this issue, we project, using the initial sensor pose, the local scene point onto the reference frame, and the candidate target frame to evaluate the photometric discrepancy between them. If a scene point is truly visible in the target frame, the initial photometric discrepancies are often small. Finally, to ensure the view quality of the scene point, we choose target frames that are viewing the scene point along a direction close to the point surface normal. Consequently, the local visibility determination checks are:

\displaystyle|\mathbf{d}^{T}\mathbf{z}_{C}|>\alpha_{1},\quad|\mathbf{d}^{T}\mathbf{n}|>\alpha_{2},\quad\mathtt{NCC}(\mathbf{I}_{r},\mathbf{I}_{t})>\alpha_{3}

(3)

where $\mathbf{d}=\frac{\mathbf{p}-\mathbf{t}_{t}}{\|\mathbf{p}-\mathbf{t}_{t}\|_{2}}$ denotes the view direction from the candidate target frame to the local scene point, $\mathbf{z}_{C}$ express the z-axis of the camera frame (vertical to the camera image plane), $\mathbf{I}_{r}$ and $\mathbf{I}_{t}$ are the RGB values of patches generated on the reference frame and the candidate target frame, respectively, and $\mathtt{NCC}(\cdot,\cdot)$ is the Normalized Cross-Correlation (NCC) [21] between the two patches. The details for patch generation are illustrated in Sec. 4. $\alpha_{1}$ , $\alpha_{2}$ , and $\alpha_{3}$ are three thresholds. If a local scene point is deemed visible by this frame (i.e., satisfying (3)), it will be selected as a target frame and contribute to the final optimization cost in (II-C4).

II-C3 Global Scene Point Selection and Visibility Generation

Local scene points provide effective local constraints for camera pose optimization. However, they fail to provide more global constraints for camera poses far apart observing the same area. Therefore, we introduce global scene points and global visibility that provide more comprehensive, global constraints. When generating global scene points and determining global visibility, the distance between the reference and target frames increases, making the occlusion check necessary. Typically, a ray-casting algorithm (e.g., in [22]) is employed to address the occlusion check, but this approach not only increases the computation costs but also risks inaccuracies in sparser point cloud scenarios. To overcome these challenges, we implement a LiDAR-assisted method based on a global visibility voxel map, addressing the occlusion check with more reliable results and reduced computational load, as illustrated in Fig. 3.

Global Visibility Voxel Map

Our method for global scene point selection and visibility determination relies on a global visibility voxel map created using LiDAR scan data. This map records the visibility information for each voxel and is constructed after the optimization of LiDAR poses using BALM. During construction, for each point in a LiDAR scan, the voxel it belongs to is appended with the pose of the LiDAR scan, indicating that the voxel is visible from the position of this LiDAR scan. Subsequently, camera frames proximate to these visible positions are identified and stored. Consequently, each voxel in the map retains a set of indices of camera frames from which it is visible.

Global Scene Point Selection

To limit the computation cost, only a subset of all local scene points, namely global scene points, is used to construct global constraints. To select these global scene points, we distribute all local scene points into the visibility voxel map. For each voxel, the local scene point that exhibits the best observation quality from their respective reference frames is selected. To model the observation quality, we define a score $s$ , which describes the projection area on the reference image of a small surface element at the scene point. A larger projection area indicates better observation quality. More specifically, for each candidate scene point $\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})$ the score $s$ is calculated by:

\displaystyle s=\frac{\mathbf{n}^{T}(\mathbf{p}-\mathbf{t}_{r})}{\|\mathbf{p}-\mathbf{t}_{r}\|^{2}_{2}}

(4)

where $\mathbf{t}_{r}$ is the position of the corresponding reference frame. The local scene point within each voxel that achieves the highest score is then selected as a global scene point.

Global Visibility Determination

Once the global scene points have been selected, the next step is ascertaining their visibility across all camera frames. This process comprises two main steps: First, we identify, utilizing the global visibility map, the camera frames stored in the voxel where the global scene point belongs as the candidate target frames. This is because these camera frames are most likely to observe this voxel, and hence the global scene point in it. Second, we identify those candidate target frames satisfying (3) as the true target frames, following the same reason as in local visibility determination.

II-C4 Photometric Error Formulation

The visual BA in LVBA optimizes the photometric discrepancy of a scene point $\boldsymbol{\pi}=(\mathbf{p},\mathbf{n})\in\mathbb{R}^{3}\times\mathbb{S}^{2}$ (either local or global scene points) between the reference frame and target frame, whose poses are denoted as $\mathbf{T}_{r}=(\mathbf{R}_{r},\mathbf{t}_{r})\in SE(3)$ and $\mathbf{T}_{t}=(\mathbf{R}_{t},\mathbf{t}_{t})\in SE(3)$ , respectively. As shown in Fig. 4, a cost item is constructed below:

Firstly, the scene point $\boldsymbol{\pi}$ is projected to the reference image plane via the pin-hole camera projection model:

\displaystyle\hat{\mathbf{u}}_{r}=\mathbf{K}\mathbf{R}_{r}^{T}(\mathbf{p}-\mathbf{t}_{r})

(5)

where $\mathbf{K}$ is the camera intrinsic matrix, $\mathbf{\hat{u}}_{r}\in\mathbb{R}^{3}$ is the homogeneous format of $\mathbf{u}_{r}\in\mathbb{R}^{2}$ in the reference image plane coordinate (without further notice, we use $\hat{(\cdot)}$ to denote the homogeneous format of a vector). Then, a reference patch $\{\mathbf{u}_{r}^{(i)}\}$ is generated around the projected point. Utilizing the plane normal of the scene point and the camera projection model, the reference patch can be projected to the target frame via a homography transformation[23]:

\displaystyle\hat{\mathbf{u}}_{t}^{(i)}

\displaystyle=\mathbf{H}(\mathbf{T}_{r},\mathbf{T}_{t},\boldsymbol{\pi})\hat{\mathbf{u}}_{r}^{(i)}

(6)

where the homography matrix $\mathbf{H}$ can be expressed as

\displaystyle\mathbf{H}

\displaystyle=\mathbf{K}\mathbf{R}_{t}^{-1}\big{[}\mathbf{n}^{T}(\mathbf{p}-\mathbf{t}_{r})\mathbf{E}+(\mathbf{t}_{r}-\mathbf{t}_{t})\mathbf{n}^{T}\big{]}\mathbf{R}_{r}\mathbf{K}^{-1}

(7)

and $\mathbf{E}$ is $3\times 3$ identity matrix. Finally, the photometric cost is constructed as:

	$\displaystyle L_{\text{photo}}(\boldsymbol{\mathcal{T}}_{r},\boldsymbol{\mathcal{T}}_{t};\boldsymbol{\pi})$	$\displaystyle=\sum_{i=0}^{N^{2}}\\|\epsilon_{t}^{-1}\mathbf{I}_{t}(\mathbf{u}_{t}^{(i)})-\epsilon_{r}^{-1}\mathbf{I}_{r}(\mathbf{u}_{r}^{(i)})\\|_{\boldsymbol{\Sigma}}^{2}\;$
	$\displaystyle\boldsymbol{\Sigma}$	$\displaystyle=\bigg{(}\frac{1}{\epsilon_{r}^{2}+\epsilon_{t}^{2}}\bigg{)}\mathbf{E}$		(8)

where $\mathbf{I}_{r}(\boldsymbol{\cdot})$ and $\mathbf{I}_{t}(\boldsymbol{\cdot})$ calculate RGB color vector by bi-linear interpolation on the reference image and target image, respectively, $\boldsymbol{\mathcal{T}}_{r}=(\mathbf{T}_{r},\epsilon_{r})$ and $\boldsymbol{\mathcal{T}}_{t}=(\mathbf{T}_{t},\epsilon_{t})$ represent the camera states of reference and target frames, $\epsilon_{r}$ and $\epsilon_{t}$ represent the relative exposure time of them and $\boldsymbol{\Sigma}$ is the covariance matrix of the cost function. The relative exposure time $\epsilon$ and covariance matrix $\boldsymbol{\Sigma}$ are defined using a simplified camera measurement model. Further details can be referred to the APPENDIX A¹¹1https://github.com/hku-mars/mapping_eval.

II-C5 Levenberg-Marquardt Optimization

With our constructed photometric cost item, the optimal estimation of camera states $\boldsymbol{\mathcal{T}}^{\ast}$ is given by:

\displaystyle\boldsymbol{\mathcal{T}}^{\ast}

\displaystyle=\underset{\boldsymbol{\mathcal{T}}}{\operatorname{argmin}}{\sum_{i=1}^{M_{s}}\sum_{j\in V_{i}}{L_{\text{photo}}(\boldsymbol{\mathcal{T}}_{r(i)},\boldsymbol{\mathcal{T}}_{j};\boldsymbol{\pi}_{i})}}

(9)

where $\boldsymbol{\mathcal{T}}=\{\boldsymbol{\mathcal{T}}_{1},\boldsymbol{\mathcal{T}}_{2},\cdots,\boldsymbol{\mathcal{T}}_{M_{p}}\}$ is the camera state set, $\boldsymbol{\Pi}=\{\boldsymbol{\pi}_{1},\boldsymbol{\pi}_{2},\cdots,\boldsymbol{\pi}_{M_{s}}\}$ is the scene point set (both local and global), $M_{p}$ and $M_{s}$ represents the total number of camera poses and scene points, respectively, $r(i)$ represents the reference frame index of the $i$ -th scene point, $V_{i}$ is the target frame set (both local and global), and $L_{\text{photo}}$ is the photometric cost item, as defined in (II-C4). In LVBA, we employed a Levenberg-Marquardt optimizer to minimize the cost function in (9).

To enhance the robustness of the cost function against initial estimation, our visual BA employs an iterative coarse-to-fine strategy. This strategy involves the process of down-sampling the image and sequentially optimizing the camera states from the top layer of the pyramid to the original resolution. Within each iteration, we utilize the optimized states obtained from the previous higher layer to generate scene points and determine visibility.

III Experiments

III-A Experiment Setup

During the experiment, we compared our LVBA with other state-of-the-art LiDAR-visual(-inertial) sensor fusion methods, including R³LIVE[1], FAST-LIVO[8], and Colmap-PCD[13]. We perform evaluations on three datasets, the R³LIVE dataset[1], the FAST-LIVO dataset[8], and MaRS-LVIG[24]. We conducted mapping accuracy evaluation on all three datasets, and the trajectory accuracy evaluation on MARS-LVIG, which provided ground truth trajectory.

For both the Colmap-PCD and LVBA, we utilize R³LIVE to provide the initial estimation when evaluating on the R³LIVE dataset and MaRS-LVIG dataset, while employing FAST-LIVO to provide the initial estimation when evaluating on the FAST-LIVO dataset. Since these data are collected at a high rate, typically, 10 Hz, optimizing all frame poses on the raw data is computationally demanding for LVBA and Colmap-PCD, which are two global optimization approaches. To constrain the computation load, we extracted keyframes for camera frames. Furthermore, since some of the sequences are extensively long for Colmap-PCD and LVBA, leading to excessive time and memory costs, we separated them into sub-sequences, and the average value of the evaluation result for each sequence is then taken as the final result.

III-B Mapping Evaluation

III-B1 Evaluation Method

To evaluate the mapping accuracy and consistency with our optimization results, we use a colorizing-and-rendering algorithm. Initially, an RGB-radiance point cloud is created using LiDAR scans and camera images. For each LiDAR point $\mathbf{p}^{L}$ in the LiDAR frame $\mathbf{T}_{L}=(\mathbf{R}_{L},\mathbf{t}_{L})$ , the nearest camera frame at $\mathbf{T}_{C}=(\mathbf{R}_{C},\mathbf{t}_{C})$ is identified. The radiance value of this point is calculated as:

\displaystyle\mathbf{r}=\epsilon^{-1}\mathbf{I}\Big{(}\boldsymbol{\Pi}\big{(}\mathbf{R}_{C}^{T}(\mathbf{R}_{L}\mathbf{p}^{L}+\mathbf{t}_{L}-\mathbf{t}_{C}\big{)}\Big{)}

(10)

where $\epsilon$ is the estimated relative exposure time of the camera frame, for benchmark methods that didn’t estimate the relative exposure time, $\epsilon$ is set to $1$ , $\boldsymbol{\Pi}(\cdot)$ denotes the projection process to the camera’s image plane, and $\mathbf{I}(\cdot)$ yields the RGB value. After constructing the radiance map, we render raw images from the radiance map. For each image with optimized camera frame pose and relative exposure time $\epsilon$ , we project the colorized radiance map onto its image plane to create a rendered image, the RGB color of the image is determined by $\mathbf{I}=\epsilon\mathbf{r}$ . An example of our rendering result is shown in Fig. 5. Finally, the discrepancy between these rendered images and the original raw images is evaluated using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metrics. The source code of our evaluation toolchain is available on GitHub²²2https://github.com/hku-mars/mapping_eval.

III-B2 Evaluation Setup

In this section, we performed a mapping accuracy evaluation among all three datasets. Due to the strict time synchronization requirement of FAST-LIVO, which is not satisfied in the R³LIVE dataset, we didn’t evaluate FAST-LIVO with the R³LIVE dataset. Further, as LVBA and Colmap-PCD only estimate the poses of the keyframes, we assessed R³LIVE and FAST-LIVO only at those keyframes, although their results are obtained by running on all frames at 10Hz.

To assess the effectiveness of our proposed global scene point selection and their visibility determination, we ran our algorithm without utilizing any global scene point constraints (termed “w/o GSP”). Furthermore, to evaluate the effectiveness of our proposed relative exposure time estimation, we ran the algorithm by assuming all relative exposure time to be one (termed “w/o RET”). Both experiments were conducted under the same settings of our full method as illustrated above.

III-B3 Comparison Results

TABLE I: Result of Mapping Evaluation

Datasets	Sequences	R³LIVE[1]		FAST-LIVO[8]		Colmap-PCD[13]		Ours (w/o GSP¹)		Ours (w/o RET²)		Ours (Full)
Datasets	Sequences	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
R³LIVE[1]	hku campus seq 00	18.13	0.1741	–	–	21.36	0.3629	20.56	0.3328	21.06	0.3177	21.64	0.3852
	hku campus seq 01	17.82	0.1995	–	–	–	–	19.72	0.3478	19.21	0.2905	20.14	0.3750
	hku campus seq 02	16.97	0.1520	–	–	–	–	18.91	0.2965	18.31	0.2042	19.34	0.3193
	hku campus seq 03	17.55	0.1827	–	–	17.45	0.2267	19.16	0.3262	18.59	0.2326	19.56	0.3544
	hkust campus 00	18.03	0.1559	–	–	–	–	20.14	0.2791	19.71	0.2361	20.52	0.2934
	hkust campus 01	17.91	0.1783	–	–	–	–	20.13	0.3006	19.64	0.2675	20.38	0.3159
	hkust campus 02	16.47	0.1628	–	–	–	–	18.23	0.2631	18.29	0.2520	18.90	0.3263
	hkust campus 03	17.54	0.1849	–	–	–	–	19.44	0.3014	19.27	0.2548	19.79	0.3285
FAST-LIVO[8]	hku1	19.02	0.1278	21.32	0.2052	22.45	0.3047	21.89	0.2773	22.23	0.2810	22.46	0.3364
	hku2	23.69	0.2813	25.85	0.3639	26.71	0.4377	25.54	0.3483	26.84	0.4212	26.86	0.4223
	visual challenge	19.71	0.1027	21.84	0.1378	22.83	0.1976	21.23	0.1483	22.53	0.1786	24.11	0.3377
MaRS-LVIG[24]	HKisland 01	15.07	0.1061	14.50	0.0855	15.68	0.1656	17.95	0.3466	19.34	0.4872	19.71	0.5204
	AMvalley 01	22.69	0.1716	21.16	0.1198	22.23	0.2171	26.08	0.4386	26.44	0.4852	27.21	0.5190
	AMtown 01	19.90	0.1237	19.42	0.0854	19.17	0.1654	21.94	0.2274	22.40	0.2631	22.54	0.2679
	HKairport_GNSS 01	16.77	0.0832	16.30	0.0708	18.89	0.1799	18.10	0.1564	19.24	0.2228	19.54	0.2652

•

¹GSP: Global scene point. ² RET: Relative exposure time. ³ – : System failed.

The evaluation results are computed and reported in TABLE I. Compared with two LiDAR-visual-inertial odometry (i.e. R³LIVE and FAST-LIVO), our LVBA (Full) demonstrates significant improvements in mapping accuracy across all tested sequences. Both R³LIVE and FAST-LIVO are based on the ESIKF framework and lack the ability to effectively correct historical errors. When colorizing the point cloud with all images, those image frames with substantial state errors are revealed by the color blur of point clouds, especially when the sequences get longer. In contrast, our proposed global LiDAR-visual BA could optimize the state estimation of all image frames, leading to a global consistency and low color blur, as illustrated in Fig. 6. Compared with Colmap-PCD, our LVBA (Full) achieved better performance on most of the sequences, except the SSIM on hku2 in the FAST-LIVO dataset. Since the sequence hku2 is captured in a night scenario, the image is taken with a high camera gain, resulting in increased measurement noise. This noise can potentially impact our optimization process. In all the rest sequences, Colmap-PCD is affected by varying degrees of errors introduced by feature extraction, matching, and point-to-plane association, resulting in lower performance. Additionally, for those sequences with significant exposure time variance, Colmap-PCD may yield inferior results compared to our method due to the lack of exposure time estimation (e.g. sequence visual challenge), or even complete failure due to the difficulty in identifying the proper feature matching (e.g. sequence hku campus seq 01). Furthermore, when evaluating on the R³LIVE dataset, we observed that Colmap-PCD encountered challenges with system initialization, leading to system failures or inaccurate pose estimations. Some qualitative mapping results are illustrated in Fig. 6, demonstrating the better resolution and consistency of our method in constructing high-fidelity colorized 3D maps. More visualization results can be found in our video at https://youtu.be/jtIUBI0U76c.

Compared with the LVBA without global scene point constraints (“w/o GSP”) or relative exposure time estimation (“w/o RET”), our full algorithm, incorporating global constraints, significantly improves mapping quality. Fig. 7 illustrates that a lack of global constraints provided by our global scene points results in color blurring and separation in the colorized point cloud map, and failing to estimate relative exposure times leads to noticeable brightness and color discrepancies between the original image and our rendered result. The sequences hku campus seq 02/03, which provide the ground truth for camera exposure time, were used to compare our estimated relative exposure times against the actual values. As depicted in Fig. 8, while our estimates do not provide the absolute values due to the simplified model, they still successfully capture the general trend of the ground truth exposure times.

IV Conclusion

In this paper, we introduced a LiDAR-Visual bundle adjustment framework, aimed at enhancing pose estimation accuracy and ensuring global RGB mapping consistency for LiDAR-camera platforms. By integrating photometric BA for the camera with geometric priors, we achieved high-precision alignment of camera poses with the LiDAR point cloud. Our LiDAR-assisted visibility determination algorithm allowed for the effective global application of this BA method. Through rigorous evaluations, we have demonstrated that our approach surpasses other state-of-the-art methods in both pose estimation accuracy and mapping consistency across multiple datasets. Looking ahead, we aim to incorporate IMU pre-integration and more advanced optimization techniques to further elevate the performance of our system.

References

[1] J. Lin and F. Zhang, “R³live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 10 672–10 678.
[2] T. L. Pavlis and K. A. Mason, “The new world of 3d geologic mapping,” Gsa Today, vol. 27, no. 9, pp. 4–10, 2017.
[3] F. Fassi, L. Fregonese, S. Ackermann, and V. De Troia, “Comparison between laser scanning and automated 3d modelling techniques to reconstruct complex and extensive cultural heritage areas,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 40, pp. 73–80, 2013.
[4] A. Delaunoy and M. Pollefeys, “Photometric bundle adjustment for dense multi-view 3d modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Conference Proceedings, pp. 1486–1493.
[5] Z. Cui, L. Heng, Y. C. Yeo, A. Geiger, M. Pollefeys, and T. Sattler, “Real-time dense mapping for self-driving vehicles using fisheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6087–6093.
[6] W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo visual inertial lidar simultaneous localization and mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 370–377.
[7] T. Laidlow, M. Bloesch, W. Li, and S. Leutenegger, “Dense rgb-d-inertial slam with map deformations,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6741–6748.
[8] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 4003–4009.
[9] H. Alismail, B. Browning, and S. Lucey, Photometric Bundle Adjustment for Vision-Based SLAM. Springer International Publishing, 2017, pp. 324–341.
[10] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mapping with high robustness and low drift,” Journal of field robotics, vol. 35, no. 8, pp. 1242–1264, 2018.
[11] J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidar-monocular visual odometry,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2018, pp. 7872–7879.
[12] X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Lic-fusion: Lidar-inertial-camera odometry,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5848–5854.
[13] C. Bai, R. Fu, and X. Gao, “Colmap-pcd: An open-source tool for fine image-to-point cloud registration,” arXiv preprint arXiv:2310.05504, 2023.
[14] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in European Conference on Computer Vision (ECCV), 2016.
[15] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[16] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
[17] J. Lin and F. Zhang, “R³live++: A robust, real-time, radiance reconstruction package with a tightly-coupled lidar-inertial-visual state estimator,” arXiv preprint arXiv:2209.03666, 2022.
[18] Z. Liu, X. Liu, and F. Zhang, “Efficient and consistent bundle adjustment on lidar point clouds,” IEEE Transactions on Robotics, vol. 39, no. 6, pp. 4366–4386, 2023.
[19] Z. Liu and F. Zhang, “Balm: Bundle adjustment for lidar mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3184–3191, 2021. [Online]. Available: https://dx.doi.org/10.1109/lra.2021.3062815
[20] D. Marr and E. Hildreth, “Theory of edge detection,” Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 207, no. 1167, pp. 187–217, 1980.
[21] R. Brunelli, Template matching techniques in computer vision: theory and practice. John Wiley & Sons, 2009.
[22] F. Kong, X. Liu, B. Tang, J. Lin, Y. Ren, Y. Cai, F. Zhu, N. Chen, and F. Zhang, “Marsim: A light-weight point-realistic simulator for lidar-based uavs,” IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2954–2961, 2023.
[23] A. M. Andrew, “Multiple view geometry in computer vision,” Kybernetes, vol. 30, no. 9/10, pp. 1333–1341, 2001.
[24] H. Li, Y. Zou, N. Chen, J. Lin, X. Liu, W. Xu, C. Zheng, R. Li, D. He, F. Kong et al., “Mars-lvig dataset: A multi-sensor aerial robots slam dataset for lidar-visual-inertial-gnss fusion,” The International Journal of Robotics Research, p. 02783649241227968, 2024.

See pages - of LVBA_supplementary.pdf