This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding

Tianci Wen1,2  Zhiang Liu1,2  Yongchun Fang1,2{}^{1,2\ *}
1 IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University
2 IITRS, Shenzhen Research Institute, Nankai University
Abstract

3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-of-the-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of 19.86%19.86\% in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras. The project page is available at https://segs-slam.github.io/.

[Uncaptioned image]
Figure 1: Our method SEGS-SLAM outperforms SOTA methods (GS-ICP SLAM [9], Photo-SLAM [12], SplaTAM [14], MonoGS [24]) in photorealistic mapping quality across monocular, stereo, and RGB-D cameras. The scenes are from TUM RGB-D dataset [32] (the top row and the left three images in the bottom row) and the EuRoC MAV dataset [2] (the right three images in the bottom row). Non-obvious differences in quality are highlighted by arrows/insets.
footnotetext: Corresponding author.

1 Introduction

Visual simultaneous localization and mapping (SLAM) is a fundamental problem in 3D computer vision, with wide applications in autonomous driving, robotics, virtual reality, and augmented reality. SLAM aims to construct dense or sparse maps to represent the scene. Recently, neural radiance fields (NeRF) [26] has been integrated into SLAM pipelines, significantly enhancing scene representation capabilities. The latest advancement in radiance field rendering is 3D Gaussian splatting (3D-GS) [15], an explicit scene representation that achieves revolutionary improvements in rendering and training speed. Recent SLAM works [12, 24, 14, 40, 27, 11, 9, 17, 39] incorporating 3D-GS have demonstrated that explicit representations provide more promising rendering performance when compared with implicit ones.

However, most SLAM algorithms based on 3D-GS have neglected the latent structure in the scene, which constrains their rendering quality. While some methods [14, 40, 27, 9, 11, 24, 43, 17, 39] have explored improvements in accuracy, efficiency, and semantics, insufficient structural exploitation remains a critical issue. For instance, as evidenced by the 2nd row in Fig. 1, MonoGS [24] produces a highly disorganized reconstruction of the ladder structure due to this limitation. In contrast, few methods, like Photo-SLAM [12], leverage scene structure. Photo-SLAM [12] initializes 3D Gaussians with point cloud obtained from indirect visual SLAM and incorporates a geometry-based densification module. In the original 3D-GS, 3D Gaussians are initialized from COLMAP [30] points. Since indirect visual SLAM and COLMAP [30] share similar pipeline structures, the generated point clouds exhibit similar intrinsic properties. Hence, the 3D Gaussians of Photo-SLAM [12] converge to a relatively optimal result with fewer iterations, yet it still underutilizes the underlying scene structure. The blurry reconstruction of the mouse edge by Photo-SLAM [12] is still apparent as shown in the 2nd row of Fig. 1.

Another partially unresolved challenge in these methods [14, 40, 27, 9, 11, 24, 12, 43, 17, 39] is the significant appearance variations within the scene (e.g., exposure, lighting). To address this issue, NeRF-W [23] refines the appearance embeddings (AE) in NRW [25] and introduces them into NeRF. However, AE has a notable limitation: its training involves each ground-truth image from the test set. In novel view synthesis tasks, the test set contains 12.5% of all views, whereas in SLAM tasks, this increases to 80%, making it more challenging for AE to accurately predict appearance in novel views. Additionally, these approaches [14, 40, 27, 9, 11, 24, 12, 21, 43, 17, 39] fail to capture high-frequency details (e.g., object edges, complex texture regions). FreGS [45] combines frequency regularization to model the local details, but its effectiveness is constrained by the use of a single-scale frequency spectrum.

To address the above limitations, this paper presents SEGS-SLAM, a novel 3D Gaussian Splatting SLAM system. First, we investigate the benefits of leveraging scene structure for improving rendering accuracy. While point cloud produced by ORB-SLAM3 [3] preserves strong latent structure, we observe that the anchor-based 3D Gaussians in [21] effectively leverage the underlying structure. Motivated by this, we propose a structure-enhanced photorealistic mapping (SEPM) framework, which initializes anchor points using ORB-SLAM3 [3] point cloud, significantly enhancing the utilization of scene structure. Experimental results validate the effectiveness of this simple yet powerful strategy, and we hope this insight will inspire further research in this direction. Second, we propose Appearance-from-Motion embedding (AfME), which takes poses as input and eliminates the need for training on the left half of each ground-truth image in the test set. We further introduce a frequency pyramid regularization (FPR) technique to better capture high-frequency details in the scene. The main contributions of this work are as follows:

  1. 1.

    To our knowledge, structure-enhanced photorealistic mapping (SEPM) is the first SLAM framework that intializes anchor points with ORB-SLAM3 point cloud to strengthen the utilization of scene structure, leading to significant rendering improvements.

  2. 2.

    We propose Appearance-from-Motion embedding (AfME), which models per-image appearance variations into a latent space extracted from camera pose.

  3. 3.

    Extensive evaluations on various public datasets demonstrate that our method significantly surpasses state-of-the-art (SOTA) methods in photorealistic mapping quality across monocular, stereo, and RGB-D cameras, while maintaining competitive tracking accuracy.

2 Related Work

Visual SLAM. Traditional visual SLAM methods can be classified into two categories: indirect methods and direct methods. Indirect methods [16, 3] rely on extracting and tracking features between consecutive frames to estimate poses and build sparse maps by minimizing a reprojection error, including ORB-SLAM3 [3]. Direct methods [6, 7] estimate motion and structure by minimizing a photometric error, which can build sparse or semi-dense maps. Recently, some methods [1, 33, 35, 36, 18] have integrated deep learning into visual SLAM systems. Among them, the current SOTA method is Droid-SLAM [35]. More recently, Lipson et al. [18] combine optical flow prediction with a pose-solving layer to achieve camera tracking. Our approach favors traditional indirect visual SLAM.

Implicit Representation based SLAM. iMAP [34] pioneers the use of neural implicit representations to achieve tracking and mapping through reconstruction error. Subsequently, many works [50, 41, 4, 13, 37, 49, 22, 42, 51] have explored new representation forms, including voxel-based neural implicit surface representation [41], multi-scale tri-planes [13] and point-based neural implicit representation [28]. Recently, SNI-SLAM [49] and IBD-SLAM [42] introduce a hierarchical semantic representation and an xyz-map representation, respectively. Some works [48, 19, 5, 10, 29, 47, 44] address other challenges, including loop closure [48, 19]. However, most efforts focus on scene geometry reconstruction, with Point-SLAM [28] extending to novel view synthesis. Moreover, the NeRF models used in these methods do not account for appearance variations.

3D Gaussian Splatting based SLAM. Recently, an explicit representation, 3D-GS [15], is introduced into visual SLAM. Most methods enhance RGB-D SLAM in rendering quality [9, 14, 40, 43], efficiency [27, 9], robustness [11, 39], and semantics [17]. For example, GS-ICP SLAM[9] achieves photorealistic mapping by fusing 3D-GS with Generalized ICP. Few methods improve SLAM rendering accuracy and efficiency across monocular, stereo, and RGB-D cameras, with MonoGS [24] and Photo-SLAM [12] as exceptions. However, neither of these methods effectively leverages latent scene structures or enhances the modeling of scene details, which limits their rendering quality. To address this, our proposed SEGS-SLAM further reinforces the utilization of global scene structure.

Refer to caption
Figure 2: Overview of our method. Our method supports monocular, stereo, and RGB-D cameras. The input image stream is processed by the localization and geometric mapping modules, generating point cloud and accurate poses. SEPM incrementally initializes anchor points (middle) based on the point cloud (top left), which preserves the underlying structure. The poses are then fed into the AfME to model appearance variations in the scene. Additionally, we introduce FPR to improve the reconstruction of high-frequency details in the scene.

3 Preliminaries

In this section, we first introduce the structured 3D Gaussian splatting in [21]. Subsequently, we review the localization and mapping process of ORB-SLAM3 [3].

3.1 Structured 3D Gaussian Splatting

Structured 3D Gaussians is a hierarchical representation in [21]. They construct anchor points by voxelizing the point cloud obtained from COLMAP [30]. An anchor point is the center 𝐭v\mathrm{\mathbf{t}}_{v} of a voxel, equipped with a context feature f^v32\hat{f}_{v}\in\mathbb{R}^{32}, a scale factor lv3l_{v}\in\mathbb{R}^{3}, and a set of kk learnable offsets 𝐎v={𝒪0,,𝒪k1}k×3\mathrm{\mathbf{O}}_{v}=\{\mathcal{O}_{0},\ldots,\mathcal{O}_{k-1}\}\in\mathbb{R}^{k\times 3}. For each anchor point 𝐭v\mathrm{\mathbf{t}}_{v}, they generate kk 3D Gaussians, whose positions are calculated as:

{μ0,,μk1}=𝐭v+{𝒪0,,𝒪k1}lv.\{\mu_{0},\ldots,\mu_{k-1}\}=\mathrm{\mathbf{t}}_{v}+\{\mathcal{O}_{0},\ldots,\mathcal{O}_{k-1}\}\cdot l_{v}. (1)

Other parameters of kk 3D Gaussians are decoded using individual MLPs, denoted as MαM_{\alpha}, McM_{c}, MqM_{q}, and MsM_{s}, respectively. The colors of the Gaussians are obtained as follows:

{c0,,ck1}=MC(f^v,δvc,𝐝vc),\{c_{0},\dots,c_{k-1}\}=M_{C}(\hat{f}_{v},\ \delta_{vc},\ \vec{\mathrm{\mathbf{d}}}_{vc}), (2)

where δvc=𝐭v𝐭c2\delta_{vc}=\|\mathrm{\mathbf{t}}_{v}-\mathrm{\mathbf{t}}_{c}\|_{2} is the relative distance between camera position 𝐭c\mathrm{\mathbf{t}}_{c} and an anchor point and 𝐝vc=(𝐭v𝐭c)/δvc\vec{\mathrm{\mathbf{d}}}_{vc}=(\mathrm{\mathbf{t}}_{v}-\mathrm{\mathbf{t}}_{c})/\delta_{vc} is their viewing direction. The opacity {αi}\{\alpha_{i}\}, quaternion {qi}\{q_{i}\}, and scale {si}\{s_{i}\} are similarly obtained. After obtaining the parameters of each 3D Gaussian G(x)G(x) within the view frustum, it is projected onto the image plane to form a 2D Gaussian Gi(x)G^{\prime}_{i}(x^{\prime}). Following 3D-GS [15], a tile-based rasterizer is used to sort the 2D Gaussians, and α\alpha-blending is employed to complete the rendering:

C(x)=iNciδij=1i1(1δj),δi=αiGi(x),C(x^{\prime})=\sum_{i\in N}c_{i}\delta_{i}\prod_{j=1}^{i-1}(1-\delta_{j}),\quad\delta_{i}=\alpha_{i}G^{\prime}_{i}(x^{\prime}), (3)

where xx^{\prime} is the pixel position and NN is the number of corresponding 2D Gaussians for each pixel.

3.2 Localization and Geometry Mapping

ORB-SLAM3 [3] can track camera poses and generate point cloud accurately. The camera pose is represented as (𝐑,𝐭)(\mathrm{\mathbf{R}},\mathrm{\mathbf{t}}), where 𝐑SO(3)\mathrm{\mathbf{R}}\in\mathrm{SO(3)} denotes orientation and 𝐭3\mathrm{\mathbf{t}}\in\mathbb{R}^{3} represents position. The camera poses (𝐑,𝐭)(\mathrm{\mathbf{R}},\mathrm{\mathbf{t}}) and the point cloud {𝐏0,,𝐏η}η×3\{\mathrm{\mathbf{P}}_{0},\dots,\mathrm{\mathbf{P}}_{\eta}\}\in\mathbb{R}^{\eta\times 3} of the scene can be solved through local or global bundle adjustment (BA):

{𝐏m,𝐑l,𝐭l}\displaystyle\{\mathrm{\mathbf{P}}_{m},\mathrm{\mathbf{R}}_{l},\mathrm{\mathbf{t}}_{l}\} =argmin𝐏m,𝐑l,𝐭lκ𝒦L𝒦Fj𝒳kρ(E(κ,j)),\displaystyle=\mathop{\mathrm{argmin}}\limits_{\mathrm{\mathbf{P}}_{m},\mathrm{\mathbf{R}}_{l},\mathrm{\mathbf{t}}_{l}}\sum_{\kappa\in\mathcal{K}_{L}\cup\mathcal{K}_{F}}\sum_{j\in\mathcal{X}_{k}}\rho(E(\kappa,j)), (4)
E(κ,j)\displaystyle E(\kappa,j) =𝐩jπ(𝐑κ𝐏j+𝐭κ)Σg2,\displaystyle=\|\mathrm{\mathbf{p}}_{j}-\pi(\mathrm{\mathbf{R}}_{\kappa}\mathrm{\mathbf{P}}_{j}+\mathrm{\mathbf{t}}_{\kappa})\|^{2}_{\Sigma_{g}}, (5)

where mPLm\in{P}_{L}, l𝒦Ll\in\mathcal{K}_{L}, 𝒦L\mathcal{K}_{L} is a set of covisible keyframes, PL{P}_{L} are the points seen in 𝒦L\mathcal{K}_{L}, 𝒦F\mathcal{K}_{F} are other keyframes, 𝒳k\mathcal{X}_{k} is the set of matched points between a keyframe κ\kappa and PL{P}_{L}, ρ\rho is the robust Huber cost function, E(κ,j)E(\kappa,j) is the reprojection error between the matched 3D points 𝐏j\mathrm{\mathbf{P}}_{j} and 2D feature points 𝐩j\mathrm{\mathbf{p}}_{j}, π\pi is the projection function, and Σg\Sigma_{g} denotes the covariance matrix associated with the keypoint’s scale. We provide more details in Sec. 11.2 of supplementary material.

4 SEGS-SLAM

In this section, we first give an overview of our system. We then present details of two key innovations: SEPM and AfME. Finally, we provide an introduction to our training loss and FPR. The overview of our SEGS-SLAM is summarized in Fig. 2. First, the input image stream is processed through tracking and geometric mapping process in Sec. 3.2 to obtain camera poses and point cloud. On one hand, SEPM voxelizes the point cloud to initialize anchors, enhancing the exploitation of underlying scene structure. On the other hand, AfME encodes the camera poses to model appearance variations. Finally, the rendered images are supervised by ground-truth images, with the assistance of FPR, while jointly optimizing the parameters of the 3D Gaussians and the weights of AfME throughout training.

Refer to caption
(a) Photo-SLAM
(b) Ours w SEPM only
(c) Ground Truth
Figure 3: Visualization of the Photo-SLAM’s 3D Gaussians and of our method’s anchor points using only SEPM after 30k iterations. SEPM enhances the underlying structure of the 3D representation.

4.1 Structure-Enhanced Photorealistic Mapping

A common limitation of existing 3DGS-based SLAM systems is the gradual degradation of the underlying structure of Gaussians during optimization, limiting the quality of the rendered results. Our key insight is that preserving strong scene structure throughout the optimization process is crucial for achieving high-fidelity rendering. Inspired by this, we make several observations. First, ORB-SLAM3 [3] generates a point cloud that preserves the scene structure well. Photo-SLAM [12] initializes 3D Gaussians with the point cloud, but as shown in the Fig. 3 (a), relying solely on the latent structure of the point cloud is insufficient, as Gaussians still undergo structural degradation during optimization. We further observe that the hierarchical structure in [21] leverages scene structure, using anchor points to manage 3D Gaussians. These anchor points remain fixed during the optimization process, ensuring that 3D Gaussians retain the underlying structure of the scene.

Based on this observation, we propose incrementally voxelizing the point cloud PkP_{k} of each keyframe to construct anchor points, as follows:

Vk={Pkϵ}ϵ,V_{k}=\{\lfloor\frac{P_{k}}{\epsilon}\rceil\}\cdot\epsilon, (6)

where VkRN×3V_{k}\in R^{N\times 3} denotes voxel centers, ϵ\epsilon is the voxel size, and {}\{\cdot\} is the operation to remove redundant points.

Fig. 2 visualizes this process. The ORB-SLAM3 point cloud (top left) is voxelized into the anchor points (middle), which preserve the underlying structure effectively. In this way, the structural prior from ORB-SLAM3’s point cloud and the anchor-based organization are seamlessly fused, enhancing the exploitation of the underlying structure. Specifically, the anchor points inherit the strong structural properties of the point cloud and, due to their fixed positions, consistently preserve the latent scene structure throughout training. As shown in the Fig. 3 (b), this effectively prevents structural degradation of Gaussians over training, a key issue in prior methods. This strategy yields substantial improvements in rendering accuracy, as demonstrated in our ablation studies. Although conceptually simple, our approach proves highly effective. After anchor points construction, 3D Gaussians are then generated according to Eq. 1, Eq. 8, and rendered via Eq. 3.

4.2 Appearance-from-Motion Embedding

Refer to caption
(a) Appearance embedding in NeRF-W [23].
Refer to caption
(b) Our Appearance-from-Motion embedding.
Figure 4: AE [23] and the proposed AfME. The differences between them are: (1) AE uses image indexes as input, whereas AfME leverages camera poses. (2) AE adopts learnable vectors (LV), while AfME utilizes a tiny MLP. (3) Most critically, our AfME requires no additional training on novel views.

To further enhance the rendering quality of SEGS-SLAM, we observe a drop in the photorealistic mapping quality when suffering from appearance changes. AE is a proven solution in [23, 21] for handling such variations. However, the limitation of the AE in [23] lies in involving each ground-truth (left half of the image) in the test set for training as shown in the mid of Fig. 4 (a). To address this issue, we propose Appearance-from-Motion embedding (AfME), which employs a lightweight Multilayer Perceptron (MLP) MθaM_{\theta_{a}} to learn a shared appearance representation. As illustrated in Fig. 4 (b), the input of the encoder MθaM_{\theta_{a}} is the the camera pose (𝐑,𝐭)(\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}). The MLP encodes the pose and outputs an embedding vector 𝐑,𝐭(a)\bm{\ell}^{(a)}_{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}}, as shown below:

𝐑,𝐭(a)=Mθa(𝐑,𝐭).\bm{\ell}^{(a)}_{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}}=M_{\theta_{a}}(\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}). (7)

Subsequently, the embedding vector 𝐑,𝐭(a)\bm{\ell}^{(a)}_{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}} is fed into the color decoder MCM_{C}. After introducing AfME, the color prediction of 3D Gaussians changes from Eq. 2 to:

{c0,,ck1}=MC(f^v,δvc,𝐝vc,𝐑,𝐭(a)).\{c_{0},\dots,c_{k-1}\}=M_{C}(\hat{f}_{v},\ \delta_{vc},\ \vec{\mathrm{\mathbf{d}}}_{vc},\ \bm{\ell}^{(a)}_{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}}). (8)

More details are in Sec. 11.4 of supplementary material. We choose camera poses as inputs for several reasons: 1) Similar to image indices, camera poses are unique for each view. 2) Camera poses naturally represent spatial information, enabling AfME to predict appearance from spatial context. 3) Camera poses are more continuous than image indices.

We adopt AfME to encode the scene appearance into the continuous pose space. Through training on the training set, AfME learns the mapping between appearance and camera pose, enabling it to predict the appearance for novel views. We conduct an experiment to demonstrate this. After training, we fix f^v,δvc,𝐝vc\hat{f}_{v},\ \delta_{vc},\ \vec{\mathrm{\mathbf{d}}}_{vc} in the input of Eq. 8 and vary only the pose {𝐑,𝐭}\{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}\} fed into AfME. As shown in Fig. 5, the lighting conditions under the same view change consistently with the pose {𝐑,𝐭}\{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}\}, matching the illumination of corresponding viewpoints. This demonstrates that the illumination conditions can be embedded by AfME effectively.

Refer to caption
(a) Same-view renderings with different appearance embedded by AfME.
Refer to caption
(b) Renderings from different views correspond to different appearances.
Figure 5: The visualization of AfME controlling appearance. The rendering viewpoints in the top three images above are same, and only the input to AfME has been changed. The input poses of the AfME in the top-row images correspond to those in the bottom-row images. The results show that only the color and illumination have changed, while the geometry is fixed.

4.3 Frequency Pyramid Regularization

A minor improvement is the frequency pyramid regularization (FPR), which leverages multi-scale frequency representation to enhance the reconstruction of high-frequency details in the scene. To achieve this, we apply bilinear interpolation to downsample both the render images IrI_{r} and the ground truth images IgI_{g}. Let s𝒮={s0,s1,,sn}s\in\mathcal{S}=\{s_{0},s_{1},\ldots,s_{n}\} denote the scale of an image. We apply a 2D Fast Fourier Transform (FFT) to obtain the frequency spectra (Irs)(u,v),(Igs)(u,v)\mathcal{F}(I_{r}^{s})(u,v),\mathcal{F}(I_{g}^{s})(u,v). The loss hf\mathcal{L}_{hf} is computed as

hf=s𝒮1𝒩λsu,v|Fhf,rs(u,v)Fhf,gs(u,v)|,\displaystyle\mathcal{L}_{hf}=\sum_{s\in\mathcal{S}}\frac{1}{\mathcal{N}}\lambda_{s}\sum_{u,v}\left|F_{hf,r}^{s}(u,v)-F_{hf,g}^{s}(u,v)\right|, (9)
Fhf,is(u,v)=Hhf(u,v)(Iis)(u,v),i{r,g},\displaystyle F_{hf,i}^{s}(u,v)=H_{hf}(u,v)\cdot\mathcal{F}(I_{i}^{s})(u,v),\ i\in\{r,g\}, (10)

where Fhf,rs(u,v),Fhf,gs(u,v)F_{hf,r}^{s}(u,v),F_{hf,g}^{s}(u,v) is the high-frequency extracted by a high-pass filter Hhf(u,v)H_{hf}(u,v), 𝒩=HW\mathcal{N}=HW denotes the image size, and λs\lambda_{s} represents the weight of each scale. More details are in Sec. 8 of supplementary material.

4.4 Losses Design

The optimization of the learnable parameters, the MLP MαM_{\alpha}, McM_{c}, MqM_{q}, MsM_{s}, and MθaM_{\theta_{a}}, are achieved by minimizing the L1 loss 1\mathcal{L}_{1}, SSIM term [38] SSIM\mathcal{L}_{\text{SSIM}}, frequency regularization hf\mathcal{L}_{hf}, and volume regularization [20] vol\mathcal{L}_{\text{vol}} between the rendered images and the ground truth images, denoted as

=(1λ)1+λ(1SSIM)+λvolvol+λhfhf.\mathcal{L}=(1-\lambda)\mathcal{L}_{1}+\lambda(1-\mathcal{L}_{\text{SSIM}})+\lambda_{\text{vol}}\mathcal{L}_{\text{vol}}+\lambda_{hf}\mathcal{L}_{hf}. (11)

Following [21], we also incorporate vol\mathcal{L}_{\text{vol}}.

Datasets (Camera) Replica (RGB-D) TUM RGB-D (RGB-D)
Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
MonoGS [24] 36.81 0.964 0.069 24.11 0.800 0.231
Photo-SLAM [12] 35.50 0.949 0.056 20.99 0.736 0.213
Photo-SLAM-30K 36.94 0.952 0.040 21.73 0.757 0.186
RTG-SLAM [27] 32.79 0.918 0.124 16.47 0.574 0.461
GS-SLAM [40] 34.27 0.975 0.082 - - -
SplaTAM [14] 33.85 0.936 0.099 21.41 0.764 0.265
SGS-SLAM [17] 33.96 0.969 0.099 - - -
GS-ICP SLAM [9] 37.14 0.968 0.045 17.81 0.642 0.361
Ours 39.42 0.975 0.021 26.03 0.843 0.107
Table 1: Quantitative evaluation of our method compared to SOTA methods for RGB-D camera on Replica and TUM RGB-D datasets. Best results are marked as best score, second best score and third best score. GS-SLAM denotes the result of GS-SLAM is taken from [40], all others are obtained in our experiments. ’-’ denotes the system does not provide valid results.
Refer to caption
(a) GS-ICP SLAM [9]
(b) SplaTAM [14]
(c) RTG-SLAM [27]
(d) Ours
(e) Ground Truth
Figure 6: We show comparisons of ours to SOTA methods for RGB-D camera. The top scene is office2 from the Replica datasets, and the bottom is fr3/office from TUM RGB-D datasets. Non-obvious differences in quality are highlighted by insets.
Datasets (Camera) Replica (Mono) TUM RGB-D (Mono) EuRoC (Stereo)
method PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
MonoGS [24] 28.34 0.878 0.256 21.00 0.705 0.393 22.60 0.789 0.274
Photo-SLAM [12] 33.60 0.934 0.077 20.17 0.708 0.224 11.90 0.409 0.439
Photo-SLAM-30K 36.08 0.947 0.054 21.06 0.733 0.186 11.77 0.405 0.430
Ours 37.96 0.964 0.037 25.17 0.825 0.122 23.64 0.791 0.182
Table 2: Quantitative evaluation of our method compared to SOTA methods for Monocular (Mono) and Stereo cameras on Replica, TUM RGB-D, and EuRoC MAV datasets. Best results are marked as best score and second best score.

5 Experiment

5.1 Experiment Setup

Implementation. Our SEGS-SLAM is fully implemented using the LibTorch framework with C++ and CUDA. The training and rendering of 3D Gaussians involves three key modules: SEPM, AfME, and FPR, operating as a parallel thread alongside the localization and geometric mapping process. SEGS-SLAM trains the 3D Gaussians using only keyframe images, point clouds, and poses, where keyframes are selected based on co-visibility. In each iteration, SEGS-SLAM randomly samples a viewpoint from the current set of keyframes. We use the images and poses of keyframes as the training set, while the remaining images and poses serve as the test set. Moreover, following FreGS [45], we activate FPR once the structure of anchor points stabilizes and terminate it based on the completion of anchor point densification. The scale level of FPR is set to 3. Except for the non-open-source GS-SLAM [40], all methods compared in this paper are run on the same machine using their official code. The machine is equipped with an NVIDIA RTX 4090 GPU and a Ryzen 5995WX CPU. By default, our method runs for 30K iterations. The voxel size ϵ\epsilon is 0.001m0.001\,\text{m}. For Eq. 11, we set λ=0.2,λvol=0.01,λhf=0.01\lambda=0.2,\lambda_{\text{vol}}=0.01,\lambda_{hf}=0.01.

Baselines. We first list the baseline methods used to evaluate photorealistic mapping. For monocular and stereo cameras, we compare our method with Photo-SLAM [12], MonoGS [24], and Photo-SLAM-30K. For RGB-D cameras, we additionally include comparisons with RTG-SLAM [27], GS-SLAM [40], SplaTAM [14], SGS-SLAM [17], and GS-ICP SLAM [9], all of which represent SOTA SLAM methods based on 3D-GS. To ensure fairness, we set the maximum iteration limit to 30K for all methods, following the original 3D-GS [15]. Photo-SLAM-30K refers to Photo-SLAM [12] with a fixed iteration count of 30K. For camera pose estimation, we also compare ours with ORB-SLAM3 [3] and DROID-SLAM [35].

Metrics. We follow the evaluation protocol of MonoGS [24] to assess both camera pose estimation and novel view synthesis. For camera pose estimation, we report the root mean square error (RMSE) of the absolute trajectory error (ATE) [8] for all frames. For photorealistic mapping, we report standard rendering quality metrics, including PSNR, SSIM, and LPIPS [46]. To evaluate the photorealistic mapping quality, we only calculate the average metrics over novel views for all methods. We report the average across five runs for all methods. To ensure fairness, no training views are included in the evaluation, and for all RGB-D SLAM methods, no masks are applied to either the rendered or ground truth images during metric calculation. As a result, the reported metrics for Photo-SLAM [12] are slightly lower than those in the original paper, as they averages both novel and training views. Similarly, the metrics of SplaTAM [14], SGS-SLAM [17], and GS-ICP SLAM [9] are slightly lower than reported, as the original methods use a mask to exclude outliers for both the rendered and ground truth images based on anomalies in the depth image.

Datasets. Following [12, 40, 9, 14, 24, 11, 27, 28, 34, 17], we evaluate all methods on all sequences of the Replica dataset [31] for monocular and RGB-D cameras. Following [12, 40, 9, 24, 27, 34, 50], we use the fr1/desk, fr2/xyz, and fr3/office sequences of the TUM RGB-D dataset [32] for monocular and RGB-D cameras. Following [12], we use the MH01, MH02, V101, and V201 sequences of the EuRoC MAV dataset [2] for stereo cameras.

Camera Type RGB-D Monocular Stereo
Datasets Replica TUM R Avg. Replica TUM R Avg. EuRoC
Method RMSE \downarrow RMSE \downarrow RMSE \downarrow RMSE \downarrow RMSE \downarrow RMSE \downarrow RMSE \downarrow
ORB-SLAM3 [3] 1.780 2.196 1.988 51.744 46.004 48.874 10.907
DRIOD-SLAM [35] 74.264 74.216 74.24 76.600 1.689 39.145 1.926
MonoGS [24] 0.565 1.502 1.033 37.054 4.009 63.437 49.241
Photo-SLAM [12] 0.582 1.870 1.226 0.930 1.539 1.235 11.023
RTG-SLAM [27] 0.191 0.985 0.581 - - - -
GS-SLAM [40] 0.500 3.700 2.100 - - - -
SplaTAM [14] 0.343 4.215 2.279 - - - -
SGS-SLAM [17] 0.365 - - - - - -
GS-ICP SLAM [9] 0.177 2.921 1.549 - - - -
Ours 0.430 1.528 0.979 0.833 1.505 1.169 7.462
Table 3: Camera tracking result on Replica, TUM RGB-D (TUM R), and EuRoC MAV datasets for Monocular, stereo, and RGB-D cameras. RMSE of ATE (cm) is reported. The best results are marked as best score, second best score and third best score. ’-’ denotes that the system does not provide valid results.
Refer to caption
(a) MonoGS [24]
(b) Photo-SLAM [12]
(c) Photo-SLAM-30k
(d) Ours
(e) Ground Truth
Figure 7: We show comparisons of ours to SOTA methods for Monocular and Stereo cameras. The top scene is room1 from the Replica dataset, and the bottom is V201 from the EuRoC MAV dataset. Non-obvious differences in quality are highlighted by insets.

5.2 Results Analysis

Camera Tracking Accuracy. As shown in Tab. 3, our method demonstrates competitive accuracy in tracking for monocular, stereo, and RGB-D cameras when compared with SOTA methods. This highlights the advantage of indirect visual SLAM in terms of localization accuracy.

Novel View synthesis. The quantitative rendering results for novel views in RGB-D scenarios are shown in Tab. 1, where SEGS-SLAM significantly outperforms comparison methods, achieving the highest average rendering quality on both TUM RGB-D and Replica datasets. In the top of Fig. 6, it is evident that for the Replica dataset, only our method can accurately recover the contours of edge regions. The TUM RGB-D dataset presents a greater challenge compared with the Replica dataset, with highly cluttered scene structures and substantial lighting variations. GS-ICP SLAM [9], a leading RGB-D SLAM method based on 3D-GS, achieves the second-highest rendering accuracy on the Replica dataset. However, it relies heavily on depth images. As shown in the bottom of Fig. 6, GS-ICP SLAM [9] and other methods perform poorly on the TUM RGB-D dataset. Our SEGS-SLAM better reconstructs scene structure and lighting variations, benefiting from our SEPM and the AfME.

Tab. 2 presents quantitative rendering results for monocular scenarios, where SEGS-SLAM surpasses other methods. Notably, SEGS-SLAM continues to significantly outperform comparison methods on the TUM RGB-D dataset. Importantly, compared with RGB-D scenarios, MonoGS [24] experiences a sharp decline. The top of Fig. 7 further demonstrates that on the Replica dataset, our method effectively models high-frequency details more realistically in regions such as the edge of the wall.

Moreover, our method remains effective in stereo scenarios. The corresponding quantitative results for realistic mapping are recorded in Tab. 2, where our method achieves the highest rendering quality, surpassing the current SOTA method, MonoGS [24]. As shown in the bottom of Fig. 7, our approach better reconstructs the global structure and local details of the scene. We highlight that a key factor enabling our method to achieve superior performance across different camera types and datasets is the proposed SEPM. Its enhancement to rendering quality is highly generalizable, as further demonstrated in ablation studies.

5.3 Ablation Studies

Structure-Enhanced photorealistic mapping. To evaluate the impact of SEPM on photorealistic mapping metrics, we additionally train two variants of our method: one without SEPM, AfME, and FPR, and another without AfME and FPR. The variant without SEPM, AfME, and FPR directly uses the original 3D-GS [15]. As shown in Tab. 4, rows (1) and (2), introducing SEPM consistently yields significant improvements in rendering quality across all scenes. This validates the effectiveness of enhancing the exploitation of latent structure. Moreover, it demonstrates that SEPM has strong generalization across diverse camera inputs and datasets, inspiring future research. Notably, with only SEPM, our rendering metrics surpass existing SOTA methods. To further illustrate the benefits of SEPM, Fig. 8 presents visual comparisons, where a more complete and accurate scene structure leads to superior rendering quality.

Refer to caption
Ours w/o FPR, AfME, SEPM
Ours w/o FPR, AfME
Ground Truth
Figure 8: Ablation of SEPM. The figure presents qualitative results on the TUM, Replica, and EuRoC datasets. It is evident that, across all three datasets, incorporating SEPM enables the reconstruction of more complete and accurate structure.

Appearance-from-Motion embedding. To evaluate the impact of the proposed AfME, we train an additional model for our method without AfME. In the rows (4) and (5) of Tab. 4, our full method (5) outperforms the model without AfME (4) in terms of PSNR scores. As shown in the top of Fig. 9, the AfME effectively predicts the lighting conditions of novel views. On the Replica dataset, the improvements from AfME are relatively modest. Replica is an easier dataset, in which PSNR already exceeds 37 without AfME, indicating that scene is well-reconstructed. Although incorporating AfME still yields a PSNR gain, the benefits are much more pronounced on the challenging TUM dataset, as clearly observed in both Fig. 9 and Tab. 4.

Frequency pyramid regularization. To evaluate the effect of the proposed FPR on photorealistic mapping metrics, we train an additional model for our method without FPR. As shown in Tab. 4, rows (3) and (5), our full method (5) surpasses the model without FPR (3) in terms of PSNR scores. Additionally, after applying the FPR, the model renders finer details in highly textured regions, as demonstrated by the curtain at the bottom of Fig. 9.

Refer to caption
Ours w/o AfME
Ours
Ground Truth
Refer to caption
Ours w/o FPR
Ours
Ground Truth
Figure 9: Ablation of AfME (Top) and FPR (Bottom). It is evident that with the introduction of AfME, the lighting conditions at novel views are accurately predicted, indicating the effectiveness of our AfME. The result without AfME is darker than the ground truth.
Camera type RGB-D Mono Stereo
Datasets Replica TUM R Replica TUM R EuRoC
# Method PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow
(1) w/o FPR, AfME, SEPM 36.07 21.73 35.46 21.10 11.76
(2) w/o FPR, AfME 38.98 24.20 36.31 23.54 22.91
(3) w/o FPR 39.18 25.04 36.44 24.91 23.52
(4) w/o AfME 39.12 24.66 37.48 23.69 22.99
(5) Ours 39.42 26.03 37.96 25.17 23.64
Table 4: Ablation Study on the key components (1) - (5). The best results are marked as best score.

5.4 Limitations

One limitation of our method is that a poorly structured point cloud leads to a decline in photorealistic mapping quality. Additionally, while our method achieves real-time tracking and rendering at 17 and 400 FPS, respectively, it exhibits reduced rendering speed due to the increased number of 3D Gaussians used to model high-frequency details. Currently, AFME is only capable of handling static scenes. By encoding more complex inputs, it can tackle more sophisticated dynamic scenarios.

6 Conclusion

We propose a novel SLAM system with progressively refined 3D-GS, termed SEGS-SLAM. Experimental results show that our method surpasses SOTA methods in rendering quality across monocular, stereo, and RGB-D datasets. We demonstrate that by enhancing the utilization of the underlying scene structure, SEPM improves the visual quality of rendering. Furthermore, our proposed AfME and FPR effectively predict the appearance of novel views and refine the scene details, respectively.

Acknowledgements: This work is supported by the National Natural Science Foundation of China under Grant 62233011.

References

  • Bloesch et al. [2018] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J. Davison. Codeslam — learning a compact, optimisable representation for dense visual slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Burri et al. [2016] Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achtelik, and Roland Siegwart. The euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, 2016.
  • Campos et al. [2021] Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, José M. M. Montiel, and Juan D. Tardós. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  • Chung et al. [2023] Chi-Ming Chung, Yang-Che Tseng, Ya-Ching Hsu, Xiang-Qian Shi, et al. Orbeez-slam: A real-time monocular visual slam with orb features and nerf-realized mapping. In 2023 IEEE International Conference on Robotics and Automation, pages 9400–9406, 2023.
  • Deng et al. [2024] Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wentao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19657–19666, 2024.
  • Engel et al. [2014] Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. In European Conference on Computer Vision, pages 834–849, 2014.
  • Engel et al. [2018] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625, 2018.
  • Grupp [2017] Michael Grupp. evo: Python package for the evaluation of odometry and slam. https://github.com/MichaelGrupp/evo, 2017.
  • Ha et al. [2024] Seongbo Ha, Jiung Yeon, and Hyeonwoo Yu. Rgbd gs-icp slam. In European Conference on Computer Vision, pages 180–197, 2024.
  • Hu et al. [2023] Jiarui Hu, Mao Mao, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Cp-slam: Collaborative neural point-based slam system. In Advances in Neural Information Processing Systems, pages 39429–39442, 2023.
  • Hu et al. [2024] Jiarui Hu, Xianhao Chen, Boyin Feng, Guanglin Li, Liangjing Yang, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Cg-slam: Efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian field. In European Conference on Computer Vision, pages 93–112, 2024.
  • Huang et al. [2024] Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photorealistic mapping for monocular stereo and rgb-d cameras. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21584–21593, 2024.
  • Johari et al. [2023] Mohammad Mahdi Johari, Camilla Carta, and François Fleuret. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17408–17419, 2023.
  • Keetha et al. [2024] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21357–21366, 2024.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  • Klein and Murray [2007] Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 225–234, 2007.
  • Li et al. [2025] Mingrui Li, Shuhong Liu, Heng Zhou, Guohao Zhu, Na Cheng, Tianchen Deng, and Hongyu Wang. Sgs-slam: Semantic gaussian splatting for neural dense slam. In European Conference on Computer Vision, pages 163–179, 2025.
  • Lipson and Deng [2024] Lahav Lipson and Jia Deng. Multi-session slam with differentiable wide-baseline pose optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19626–19635, 2024.
  • Liso et al. [2024] Lorenzo Liso, Erik Sandström, Vladimir Yugay, Luc Van Gool, and Martin R. Oswald. Loopy-slam: Dense neural slam with loop closures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20363–20373, 2024.
  • Lombardi et al. [2021] Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics, 40(4), 2021.
  • Lu et al. [2024] Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024.
  • Mao et al. [2024] Yunxuan Mao, Xuan Yu, Zhuqing Zhang, Kai Wang, Yue Wang, Rong Xiong, and Yiyi Liao. Ngel-slam: Neural implicit representation-based global consistent low-latency slam system. In 2024 IEEE International Conference on Robotics and Automation, pages 6952–6958, 2024.
  • Martin-Brualla et al. [2021] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
  • Matsuki et al. [2024] Hidenobu Matsuki, Riku Murai, Paul H.J. Kelly, and Andrew J. Davison. Gaussian splatting slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18039–18048, 2024.
  • Meshry et al. [2019] Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-Brualla. Neural rerendering in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pages 405–421, 2020.
  • Peng et al. [2024] Zhexi Peng, Tianjia Shao, Yong Liu, Jingke Zhou, Yin Yang, Jingdong Wang, and Kun Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. In ACM SIGGRAPH, 2024.
  • Sandström et al. [2023a] Erik Sandström, Yue Li, Luc Van Gool, and Martin R. Oswald. Point-slam: Dense neural point cloud-based slam. In IEEE/CVF International Conference on Computer Vision, pages 18433–18444, 2023a.
  • Sandström et al. [2023b] Erik Sandström, Kevin Ta, Luc Van Gool, and Martin R. Oswald. Uncle-slam: Uncertainty learning for dense neural slam. In IEEE/CVF International Conference on Computer Vision Workshops, pages 4537–4548, 2023b.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition, 2016.
  • Straub et al. [2019] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  • Sturm et al. [2012] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580, 2012.
  • Sucar et al. [2020] Edgar Sucar, Kentaro Wada, and Andrew Davison. Nodeslam: Neural object descriptors for multi-view shape reconstruction. In 2020 International Conference on 3D Vision, pages 949–958, 2020.
  • Sucar et al. [2021] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J. Davison. imap: Implicit mapping and positioning in real-time. In IEEE/CVF International Conference on Computer Vision, pages 6229–6238, 2021.
  • Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. In Advances in Neural Information Processing Systems, pages 16558–16569, 2021.
  • Teed et al. [2023] Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. In Advances in Neural Information Processing Systems, pages 39033–39051, 2023.
  • Wang et al. [2023] Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13293–13302, 2023.
  • Wang et al. [2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Xu et al. [2024] Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization. In In Advances in Neural Information Processing Systems, 2024.
  • Yan et al. [2024] Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d gaussian splatting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19595–19604, 2024.
  • Yang et al. [2022] Xingrui Yang, Hai Li, Hongjia Zhai, Yuhang Ming, Yuqian Liu, and Guofeng Zhang. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In 2022 IEEE International Symposium on Mixed and Augmented Reality, pages 499–507, 2022.
  • Yin et al. [2024] Minghao Yin, Shangzhe Wu, and Kai Han. Ibd-slam: Learning image-based depth fusion for generalizable slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10563–10573, 2024.
  • Yugay et al. [2023] Vladimir Yugay, Yue Li, Theo Gevers, and Martin R Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting. arXiv preprint arXiv:2312.10070, 2023.
  • Zhai et al. [2024] Hongjia Zhai, Gan Huang, Qirui Hu, Guanglin Li, Hujun Bao, and Guofeng Zhang. Nis-slam: Neural implicit semantic rgb-d slam for 3d consistent scene understanding. IEEE Transactions on Visualization and Computer Graphics, pages 1–11, 2024.
  • Zhang et al. [2024a] Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric Xing. Fregs: 3d gaussian splatting with progressive frequency regularization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21424–21433, 2024a.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Zhang et al. [2024b] Wei Zhang, Tiecheng Sun, Sen Wang, Qing Cheng, and Norbert Haala. Hi-slam: Monocular real-time dense mapping with hybrid implicit fields. IEEE Robotics and Automation Letters, 9(2):1548–1555, 2024b.
  • Zhang et al. [2023] Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. Go-slam: Global optimization for consistent 3d instant reconstruction. In IEEE/CVF International Conference on Computer Vision, pages 3727–3737, 2023.
  • Zhu et al. [2024a] Siting Zhu, Guangming Wang, Hermann Blum, Jiuming Liu, Liang Song, Marc Pollefeys, and Hesheng Wang. Sni-slam: Semantic neural implicit slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21167–21177, 2024a.
  • Zhu et al. [2022] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786–12796, 2022.
  • Zhu et al. [2024b] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R. Oswald, Andreas Geiger, and Marc Pollefeys. Nicer-slam: Neural implicit scene encoding for rgb slam. In International Conference on 3D Vision, pages 42–52, 2024b.
\thetitle

Supplementary Material

7 Overview

The supplementary material is organized as follows: (1) Sec. 8 introduces more details of FPR. (2) Sec. 9 presents additional ablation studies. (3) Sec. 10 provides real-time performance for all methods. (4) Sec. 11 provides additional implementation details, including the detailed pipeline for localization and geometry mapping (Sec. 11.2), the MLP architecture used for structured 3D Gaussians (Sec. 11.3), the MLP structure for Appearance-from-Motion Embedding (Sec. 11.4), and anchor point refinement (Sec. 8). (6) Sec. 12 presents quantitative results for each scene and includes more comparative renderings.

8 Details of FPR

In scenes with simple structures, our structured 3D Gaussians can effectively model both structure and appearance changes. However, we observe that structured 3D Gaussians perform poorly in rendering high-frequency details, such as object edges and areas with complex textures. Hence, we propose the frequency pyramid regularization (FPR) technique, which effectively leverages multi-scale frequency spectra. Here, we introduce the frequency pyramid to improve the consistency of rendering details for the same object across varying viewpoint distances. Unlike FreGS [45], we leverage only high-frequency information, as low-frequency components typically represent scene structure, which is already effectively captured by our structured 3D Gaussians as shown in Tab. 6.

Refer to caption
(a) Ours w/o FPR
(b) Ours
(c) Ground Truth
Figure 10: Visualization of anchor points after 30K iterations. Increasing the number of Gaussians along edges improves the rendering quality. In the figure, 3D Gaussians are k=10k=10 times the number of anchor points.

The primary effect of FPR is to guide the densification of anchor points. Specifically, when the average gradient of all Gaussians within a voxel exceeds a threshold τg\tau_{g}, a new anchor point is added at the center of the voxel. Consequently, if a high-frequency region in the scene exhibits a substantial discrepancy between output rendering and ground truth, the total loss includes the frequency regularization hf\mathcal{L}_{hf} increases, pushing the average gradient beyond τg\tau_{g}. Then, more new anchors are added in the region, and scene edges become sharper, as shown in Fig. 10. Thus, the high-frequency details in the scene are refined by FPR.

9 Additional Ablation Studies

Scale level of FPR. We think that multiple scales improve the consistency under varying observation distances. The scale level of FPR is set to 3. Results are shown in Tab. 5.

Camera type RGB-D Mono Stereo
Datasets Replica TUM R Replica TUM R EuRoC
Scale level PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow
1 39.14 25.82 37.65 23.83 23.58
2 39.29 25.53 37.55 23.77 23.57
4 39.34 26.02 37.51 24.75 23.42
3 (Ours ) 39.42 26.03 37.96 25.17 23.64
Table 5: Ablation Study on the scale level of FPR.

The low-frequency component in FPR. In our experiments, we find that the low-frequency component of FPR conflicts with structured 3D Gaussians, resulting in degradation. The best result is using only the high-frequent component in FPR, as shown in Tab. 6.

Camera type RGB-D Mono Stereo
Datasets Replica TUM R Replica TUM R EuRoC
Metric PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow
low freq. 38.85 25.76 36.79 24.79 23.37
low & high freq. 39.23 25.84 37.72 24.84 23.27
high freq. (Ours) 39.42 26.03 37.96 25.17 23.64
Table 6: Ablation study on the low frequency component of FPR.

Replacement of key components. We train two additional models: one replaces AfME with the appearance embeddings (AE) from Scaffold-GS [21], and another replaces FPR with the single-scale frequency regularization (SFR). Our AfME differs from AE in two primary ways: 1) AfME uses camera poses as input, whereas AE uses camera indices; 2) AfME employs an MLP network structure, while AE utilizes an embedding layer. SFR refers to using only the original-scale image frequencies in Eq. 9. The distinction between FPR and SFR lies in the use of multi-scale image frequencies in FPR. As shown in Tab. 4, rows (1), (2), and (3), our full method (3) achieves the highest PSNR scores. This demonstrates that, compared with AE, our AfME is more effective in predicting appearance variations across a wide range of novel views, thus avoiding additional training on the test set. On the other hand, it also highlights that by introducing the frequency pyramid, the model maintain consistency in scene details across varying viewpoint distances, leading to superior rendering quality.

Camera type RGB-D Mono Stereo
Datasets Replica TUM R Replica TUM R EuRoC
# Method PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow PSNR \uparrow
(1) replace AfME with AE 38.22 19.56 36.09 19.33 18.39
(2) replace FPR with SFR 39.14 25.82 37.65 23.83 23.58
(3) Ours 39.42 26.03 37.96 25.17 23.64
Table 7: Ablation Study on the replacement of key components.

10 Real-time performance

Our method, following Photo-SLAM, employs two parallel threads: Localization & Geometry Mapping and 3D-GS Mapping. We note that only tracking and rendering are real-time. The runtime of all methods is provided in Tab. 8.

Metric (RGB-D) MonoGS Photo-SLAM (-30K) RTG-SLAM SplaTAM SGS-SLAM GS-ICP SLAM Ours
Rendering FPS \uparrow 706 1562 (1439) 447 531 486 630 400
Tracking FPS \uparrow 1.33 30.30 (30.87) 17.24 0.15 0.14 30.32 17.18
Mapping Time \downarrow 37m40s 1m20s (6m32s) 12m03s 3h45m 4h05m 1m32s 11m14s
Table 8: Real-time performance.

11 Implementation details

11.1 System Overview

Our system comprises two main modules: localization and geometry mapping and progressively refined 3D Gaussian splatting (3D-GS). In our implementation, these two modules run in separate threads. The localization and geometry mapping module focuses on camera pose estimation and scene point cloud mapping. The progressively refined 3D-GS module takes the estimated keyframe poses and point clouds from the localization and geometry mapping module. Then the module incrementally completes the photorealistic mapping of the scene.

11.2 Localization and Geometry Mapping

In our implementation, the localization and geometric mapping module consists of three main threads: tracking, local mapping, and loop closing, along with an on-demand thread for global bundle adjustment (BA). Specifically, the tracking thread performs a motion-only BA to optimize camera poses. The local mapping thread optimizes keyframe poses and map point clouds within a local sliding window via local BA. Lastly, the loop closing thread continuously checks for loop closures. If a loop is detected, a global BA is triggered to jointly optimize the camera poses of all keyframes and all points of the scene.

Motion-only BA. We optimize the camera orientation 𝐑SO(3)\mathrm{\mathbf{R}}\in\mathrm{SO(3)} and position t3t\in\mathbb{R}^{3} through motion-only BA. The camera poses (𝐑ι,𝐭ι)(\mathrm{\mathbf{R}}_{\iota},\mathrm{\mathbf{t}}_{\iota}) are optimized by minimizing the reprojection error between the matched 3D points 𝐏ι3\mathrm{\mathbf{P}}_{\iota}\in\mathbb{R}^{3} and 2D feature points 𝐩ι\mathrm{\mathbf{p}}_{\iota} within a sliding window:

{𝐑ι,𝐭ι}\displaystyle\{\mathrm{\mathbf{R}}_{\iota},\mathrm{\mathbf{t}}_{\iota}\} =ι𝒳argmin𝐑ι,𝐭ιρ(pιπ(𝐑ι𝐏ι+𝐭ι)Σg2)\displaystyle=\sum_{\iota\in\mathcal{X}}\mathop{\mathrm{argmin}}\limits_{\mathrm{\mathbf{R}}_{\iota},\mathrm{\mathbf{t}}_{\iota}}\rho(\|p_{\iota}-\pi(\mathrm{\mathbf{R}_{\iota}}\mathrm{\mathbf{P}}_{\iota}+\mathrm{\mathbf{t}}_{\iota})\|^{2}_{\Sigma_{g}}) (12)

where 𝒳\mathcal{X} represent the set of all matches, Σg\Sigma_{g} denote the covariance matrix associated with the keypoint’s scale, π\pi is the projection function, and ρ\rho is the robust Huber cost function.

Local BA. We perform a local BA by optimizing a set of covisible keyframes 𝒦L\mathcal{K}_{L} alone with the set of points PL{P}_{L} observed in those keyframes as follows:

{𝐏m,𝐑l,𝐭l}\displaystyle\{\mathrm{\mathbf{P}}_{m},\mathrm{\mathbf{R}}_{l},\mathrm{\mathbf{t}}_{l}\} =argmin𝐏m,𝐑l,𝐭lκ𝒦L𝒦Fj𝒳kρ(E(κ,j))\displaystyle=\mathop{\mathrm{argmin}}\limits_{\mathrm{\mathbf{P}}_{m},\mathrm{\mathbf{R}}_{l},\mathrm{\mathbf{t}}_{l}}\sum_{\kappa\in\mathcal{K}_{L}\cup\mathcal{K}_{F}}\sum_{j\in\mathcal{X}_{k}}\rho(E(\kappa,j)) (13)
E(κ,j)\displaystyle E(\kappa,j) =𝐩jπ(𝐑κ𝐏j+𝐭κ)Σg2\displaystyle=\|\mathrm{\mathbf{p}}_{j}-\pi(\mathrm{\mathbf{R}}_{\kappa}\mathrm{\mathbf{P}}_{j}+\mathrm{\mathbf{t}}_{\kappa})\|^{2}_{\Sigma_{g}} (14)

where mPLm\in{P}_{L}, l𝒦Ll\in\mathcal{K}_{L}, 𝒦F\mathcal{K}_{F} are all other keyframes, 𝒳k\mathcal{X}_{k} is the set of matches between keypoints in a keyframe κ\kappa and points in PL{P}_{L}.

Global BA. Global BA is a special case of local BA, where all keyframes and map points are included in the optimization, except the origin keyframe, which is kept fixed to prevent gauge freedom.

11.3 Structured 3D Gaussians

MLPs as feature decoders. Following [21], we employ four MLPs as decoders to derive the parameters of each 3D Gaussian, including the opacity MLP MαM_{\alpha}, the color MLP MCM_{C}, and the covariance MLP Mq,MsM_{q},M_{s}. Each MLP adopts a linear layer followed by ReLU and another linear layer. The outputs are activated by their respective activation functions to obtain the final parameters of each 3D Gaussian. The detailed architecture of these MLPs is illustrated in Fig. 11. In our implementation, the hidden layer dimensions of all MLPs are set to 32.

  • For opacity, we use Tanh()\mathrm{Tanh(\cdot)} to activate the output of the final linear layer. Since the opacity values of 3D Gaussians are typically positive, we constrain the value range to [0,1)[0,1) to ensure valid outputs.

  • For color, we use Sigmoid\mathrm{Sigmoid} function to activate the output of the final linear layer, which constrains the color value into a range of [0,1)[0,1).

  • For rotation, following 3D-GS [15], we employ a normalization to activate the output of the final linear layer, ensuring the validity of the quaternion representation for rotation.

  • For scaling, a Sigmoid\mathrm{Sigmoid} function is applied to activate the output of the final linear layer. Finally, the scaling of each 3D Gaussian is determined by adjusting the scaling lvl_{v} of its associated anchor based on the MLP’s output, as formulated below:

    {s0,,sk1}=Ms(f^v,δvc,𝐝vc)lv\displaystyle\{s_{0},\dots,s_{k-1}\}=M_{s}(\hat{f}_{v},\ \delta_{vc},\ \vec{\mathrm{\mathbf{d}}}_{vc})\cdot l_{v} (15)

11.4 Appearance-from-Motion Embedding

Refer to caption
Figure 11: Structure of the MLPs MαM_{\alpha}, MCM_{C}, MsM_{s}, and MqM_{q}. For each anchor, we use these MLPs to predict the opacity, color, scale, and quaternion of kk 3D Gaussian. The inputs to the MLPs include the relative distance δvc\delta_{vc} and the viewing direction 𝐝vc\vec{\mathrm{\mathbf{d}}}_{vc} between camera position 𝐭c\mathrm{\mathbf{t}}_{c} and an anchor point. Since NaN_{a} is not a fixed parameter, its specific value is not included in the figure.

MLP as feature encoder. For AfME, we employ an MLP MθaM_{\theta_{a}} as the encoder. The input to this MLP is the camera pose corresponding to each image. The MLP MθaM_{\theta_{a}} extracts pose features and feeds these features to the color decoder MCM_{C}. The MLP MθaM_{\theta_{a}} adopts a structure of a linear layer followed by a linear activation function, as illustrated in Fig. 12. The entire pipeline for obtaining the Gaussian color is also detailed in Fig. 12. We adopt an encoder-decoder architecture, where an encoder MLP MθaM_{\theta_{a}} extracts features from the camera poses. Unlike FreGS [45], we leverage only high-frequency information, as low-frequency components typically represent scene structure, which is already effectively captured by our structured 3D Gaussians.

Refer to caption
Figure 12: Structure of the MLPs MθaM_{\theta_{a}} and MCM_{C}. We adopt an encoder-decoder architecture, where an encoder MLP MθaM_{\theta_{a}} first extracts features from the camera poses. For each anchor, the feature 𝐑,𝐭(a)\bm{\ell}^{(a)}_{\mathrm{\mathbf{R}},\ \mathrm{\mathbf{t}}}, context feature f^v\hat{f}_{v}, the relative distance δvc\delta_{vc} between camera position 𝐭c\mathrm{\mathbf{t}}_{c} and the anchor point, and their viewing direction 𝐝vc\vec{\mathrm{\mathbf{d}}}_{vc} are then fed into a decoder MCM_{C} to predict the color of each Gaussian.
Datasets Replica TUM RGB-D
Method Metric R0 R1 R2 Of0 Of1 Of2 Of3 Of4 Avg. fr1/d fr2/x fr3/o Avg.
MonoGS [24] PSNR\uparrow 34.29 35.77 36.79 40.87 40.73 35.22 35.89 34.98 36.81 23.59 24.46 24.29 24.11
SSIM\uparrow 0.953 0.957 0.965 0.979 0.977 0.961 0.962 0.955 0.964 0.783 0.789 0.829 0.800
LPIPS\downarrow 0.071 0.078 0.074 0.048 0.052 0.074 0.061 0.092 0.069 0.244 0.227 0.223 0.231
Photo-SLAM [12] PSNR\uparrow 32.09 34.15 35.91 38.70 39.53 33.13 34.15 36.35 35.50 20.14 22.15 20.68 20.99
SSIM\uparrow 0.920 0.941 0.959 0.967 0.964 0.943 0.943 0.956 0.949 0.722 0.765 0.721 0.736
LPIPS\downarrow 0.069 0.055 0.041 0.048 0.045 0.075 0.064 0.053 0.056 0.258 0.169 0.211 0.213
Photo-SLAM-30K PSNR\uparrow 31.41 35.84 38.41 40.44 41.06 34.56 35.43 38.36 36.94 21.78 21.57 21.84 21.73
SSIM\uparrow 0.873 0.955 0.971 0.975 0.972 0.952 0.954 0.967 0.952 0.766 0.755 0.751 0.757
LPIPS\downarrow 0.046 0.036 0.026 0.033 0.033 0.059 0.049 0.036 0.040 0.212 0.182 0.165 0.186
RTG-SLAM [27] PSNR\uparrow 28.49 31.27 32.96 37.32 36.12 31.14 31.19 33.81 32.79 13.62 17.08 18.70 16.47
SSIM\uparrow 0.834 0.902 0.927 0.957 0.943 0.923 0.918 0.937 0.918 0.501 0.573 0.648 0.574
LPIPS\downarrow 0.152 0.119 0.122 0.084 0.103 0.145 0.139 0.125 0.124 0.557 0.403 0.422 0.461
GS-SLAM [40] PSNR\uparrow 31.56 32.86 32.59 38.70 41.17 32.36 32.03 32.92 34.27 - - - -
SSIM\uparrow 0.968 0.973 0.971 0.986 0.993 0.978 0.970 0.968 0.975 - - - -
LPIPS\downarrow 0.094 0.075 0.093 0.050 0.033 0.094 0.110 0.112 0.082 - - - -
SplaTAM [14] PSNR\uparrow 32.54 33.58 35.03 38.00 38.85 31.71 29.74 31.40 33.85 21.02 23.39 19.81 21.41
SSIM\uparrow 0.938 0.936 0.952 0.963 0.955 0.928 0.902 0.914 0.936 0.753 0.806 0.731 0.764
LPIPS\downarrow 0.068 0.096 0.072 0.087 0.095 0.100 0.119 0.157 0.099 0.341 0.204 0.249 0.265
SGS-SLAM [17] PSNR\uparrow 32.48 33.50 35.11 38.22 38.91 31.86 30.05 31.53 33.96 - - - -
SSIM\uparrow 0.975 0.968 0.983 0.983 0.982 0.966 0.952 0.946 0.969 - - - -
LPIPS\downarrow 0.071 0.099 0.073 0.083 0.091 0.099 0.118 0.154 0.099 - - - -
GS-ICP SLAM [9] PSNR\uparrow 34.89 37.15 37.89 41.62 42.86 32.69 31.45 38.54 37.14 15.67 18.49 19.25 17.81
SSIM\uparrow 0.955 0.965 0.970 0.981 0.981 0.965 0.959 0.969 0.968 0.574 0.667 0.692 0.642
LPIPS\downarrow 0.048 0.045 0.047 0.027 0.031 0.057 0.057 0.045 0.045 0.444 0.308 0.329 0.361
Ours PSNR\uparrow 37.07 39.54 40.33 42.04 43.21 36.38 37.18 39.62 39.42 25.29 26.35 26.46 26.03
SSIM\uparrow 0.968 0.977 0.980 0.982 0.979 0.967 0.969 0.977 0.975 0.839 0.831 0.859 0.843
LPIPS\downarrow 0.023 0.016 0.015 0.020 0.019 0.035 0.026 0.018 0.021 0.136 0.081 0.105 0.107
Table 9: Quantitative evaluation of our method compared to state-of-the-art methods for RGB-D camera on Replica and TUM RGB-D datasets. The best results are marked as best score, second best score and third best score. GS-SLAM denotes the result of GS-SLAM taken from [40], and all others are obtained in our experiments. ’-’ denotes that the system does not provide valid results.
Datasets (Camera) Replica (Mono) TUM RGB-D (Mono)
Method Metric R0 R1 R2 Of0 Of1 Of2 Of3 Of4 Avg. fr1/d fr2/x fr3/o Avg.
MonoGS [24] PSNR\uparrow 26.19 25.42 27.83 31.90 34.22 26.09 28.56 26.49 28.34 20.38 21.21 21.41 21.00
SSIM\uparrow 0.819 0.798 0.889 0.911 0.930 0.881 0.898 0.897 0.878 0.691 0.690 0.735 0.705
LPIPS\downarrow 0.246 0.368 0.252 0.249 0.192 0.268 0.189 0.284 0.256 0.377 0.377 0.426 0.393
Photo-SLAM [12] PSNR\uparrow 30.43 32.11 32.89 37.24 38.10 31.60 32.27 34.16 33.60 19.56 20.82 20.12 20.17
SSIM\uparrow 0.890 0.926 0.937 0.960 0.955 0.932 0.928 0.943 0.934 0.705 0.718 0.702 0.708
LPIPS\downarrow 0.099 0.073 0.069 0.062 0.061 0.094 0.084 0.073 0.077 0.281 0.158 0.233 0.224
Photo-SLAM-30K PSNR\uparrow 32.13 33.14 37.27 38.04 41.73 35.22 34.88 36.22 36.08 22.57 20.54 20.08 21.06
SSIM\uparrow 0.896 0.921 0.965 0.964 0.974 0.952 0.949 0.955 0.947 0.787 0.714 0.697 0.733
LPIPS\downarrow 0.056 0.086 0.035 0.055 0.033 0.061 0.057 0.052 0.054 0.179 0.166 0.213 0.186
Ours PSNR\uparrow 34.94 37.96 38.28 41.19 42.23 36.30 35.44 37.33 37.96 23.94 25.39 26.17 25.17
SSIM\uparrow 0.949 0.967 0.971 0.978 0.972 0.964 0.952 0.959 0.964 0.804 0.813 0.857 0.825
LPIPS\downarrow 0.039 0.027 0.026 0.027 0.036 0.038 0.055 0.050 0.037 0.135 0.121 0.110 0.122
Table 10: Quantitative evaluation of our method compared to state-of-the-art methods for Monocular (Mono) camera on Replica and TUM RGB-D datasets. The best results are marked as best score and second best score.
Datasets Replica TUM RGB-D
Method Metric R0 R1 R2 Of0 Of1 Of2 Of3 Of4 Avg. fr1/d fr2/x fr3/o Avg.
ORB-SLAM3 [3] RMSE\downarrow 0.500 0.537 0.731 0.762 1.338 0.636 0.419 9.319 1.780 5.056 0.390 1.143 2.196
DRIOD-SLAM [35] RMSE\downarrow 95.994 52.471 62.908 54.807 36.038 118.191 94.200 79.510 74.264 36.057 16.749 169.844 74.216
MonoGS [24] RMSE\downarrow 0.444 0.273 0.274 0.442 0.469 0.220 0.159 2.237 0.565 1.531 1.440 1.535 1.502
Photo-SLAM [12] RMSE\downarrow 0.529 0.397 0.295 0.501 0.379 1.202 0.768 0.585 0.582 3.578 0.337 1.696 1.870
RTG-SLAM [27] RMSE\downarrow 0.222 0.258 0.248 0.201 0.190 0.115 0.156 0.136 0.191 1.582 0.377 0.996 0.985
GS-SLAM [40] RMSE\downarrow 0.480 0.530 0.330 0.520 0.410 0.590 0.460 0.700 0.500 3.300 1.300 6.600 3.700
SplaTAM [14] RMSE\downarrow 0.501 0.220 0.298 0.316 0.582 0.256 0.288 0.279 0.343 5.102 1.339 3.329 4.215
SGS-SLAM [17] RMSE\downarrow 0.463 0.216 0.300 0.339 0.547 0.299 0.451 0.311 0.365 - - - -
GS-ICP SLAM [9] RMSE\downarrow 0.189 0.132 0.216 0.201 0.236 0.160 0.162 0.117 0.177 3.539 2.251 2.972 2.921
Ours RMSE\downarrow 0.296 0.264 0.182 0.429 0.354 1.040 0.434 0.441 0.430 3.187 0.370 1.026 1.528
Table 11: Camera tracking result on Replica and TUM RGB-D datasets for RGB-D camera. RMSE of ATE (cm) is reported. The best results are marked as best score, second best score and third best score. ’-’ denotes the system does not provide valid results.
Datasets (Camera) Replica (Mono) TUM RGB-D (Mono)
Method Metric R0 R1 R2 Of0 Of1 Of2 Of3 Of4 Avg. fr1/d fr2/x fr3/o Avg.
ORB-SLAM3 [3] RMSE\downarrow 51.388 26.384 4.330 110.212 103.948 65.359 51.145 1.188 51.744 4.3269 10.4598 123.226 46.004
DRIOD-SLAM [35] RMSE\downarrow 103.892 53.146 66.939 53.267 34.431 119.311 98.089 83.732 76.600 1.769 0.458 2.839 1.689
MonoGS [24] RMSE\downarrow 12.623 56.357 25.350 43.245 19.729 39.148 11.754 88.230 37.054 4.575 4.605 2.847 4.009
Photo-SLAM [12] RMSE\downarrow 0.336 0.551 0.234 2.703 0.505 2.065 0.399 0.644 0.930 1.633 0.935 2.050 1.539
Ours RMSE\downarrow 0.288 0.388 0.215 0.579 0.320 3.963 0.307 0.603 0.833 3.187 0.370 1.026 1.505
Table 12: Camera tracking result on Replica and TUM RGB-D datasets for monocular camera. RMSE of ATE (cm) is reported. Best results are marked as best score, second best score and third best score.

11.5 Anchor Points Refinement

Our anchor refinement strategy follows [21], and it is included here to enhance the completeness of this paper.

Anchor Growing. 3D Gaussians are spatially quantized into voxels of size ϵg=0.001\epsilon_{g}=0.001. For all 3D Gaussians within each voxel, we compute the average gradient after Nt=100N_{t}=100 training iterations, denoted as g\nabla_{g}. When the average gradient g\nabla_{g} within a voxel exceeds a threshold τg=0.0002\tau_{g}=0.0002, a new anchor is added at the center of the voxel. Since the total loss includes the frequency regularization hf\mathcal{L}_{hf} in Eq. 9, anchor points grow toward underrepresented high-frequency regions in the scene. Ultimately, the local details of the scene are refined. In our implementation, the scene is quantized into a multi-resolution voxel grid, allowing new anchors to be added to regions of varying sizes, as defined by

ϵg(m)=ϵg/4m1,τg(m)=τg2m1\displaystyle\epsilon_{g}^{(m)}=\epsilon_{g}/4^{m-1},\ \tau_{g}^{(m)}=\tau_{g}\cdot 2^{m-1} (16)

where mm denotes the level of quantization. Additionally, we adopt a random candidate pruning strategy to moderate the growth rate of anchors.

To eliminate redundant anchor points, we evaluate their opacity. Specifically, after NtN_{t} training iterations, we accumulate the opacity values of each 3D Gaussian. If the accumulated value αp\alpha_{p} falls below a pre-defined threshold, the Gaussian is removed from the scene.

11.6 Experimental Parameters

For the monocular camera in the Replica dataset, the dimension of AfME NaN_{a} is set to 1, while for other configurations, it is set to 3232. Each anchor manages k=10k=10 3D Gaussians. Anchors with an opacity value below 0.0050.005 are removed. The loss weights λ\lambda, λvol\lambda_{\text{vol}}, and λhf\lambda_{hf} are set to 0.20.2, 0.010.01, and 0.010.01, respectively. For the monocular camera in the Replica dataset, the weight λhf\lambda_{hf} for frequency regularization is set to 0.0250.025, and the frequency pyramid consists of 33 levels.

12 Additional Qualitative Results

12.1 Per-scene Results.

Tab. 9, Tab. 10, Tab. 13(a), Tab. 11, Tab. 12, and Tab. 13(b) present the photorealistic mapping and localization results of our method across all datasets for each scene. Additionally, Fig. 13, Fig. 14, Fig. 15, Fig. 16, and Fig. 17 show more rendering comparisons between our method and all baseline methods for each scene.

Datasets (Camera) EuRoC (Stereo)
Method Metric MH01 MH02 V101 V201 Avg.
MonoGS PSNR\uparrow 22.84 25.53 23.39 18.66 22.60
[24] SSIM\uparrow 0.789 0.850 0.831 0.687 0.789
LPIPS\downarrow 0.243 0.181 0.287 0.384 0.274
Photo- PSNR\uparrow 11.22 11.14 13.78 11.46 11.90
SLAM SSIM\uparrow 0.300 0.306 0.520 0.509 0.409
[12] LPIPS\downarrow 0.469 0.464 0.394 0.427 0.439
Photo- PSNR\uparrow 11.10 11.04 13.66 11.26 11.77
SLAM- SSIM\uparrow 0.296 0.300 0.516 0.508 0.405
30K LPIPS\downarrow 0.466 0.457 0.389 0.409 0.430
Ours PSNR\uparrow 22.50 22.30 24.90 24.89 23.64
SSIM\uparrow 0.750 0.727 0.843 0.842 0.791
LPIPS\downarrow 0.220 0.269 0.122 0.117 0.182
(a) Quantitative evaluation of our method compared to state-of-the-art methods for Stereo camera on EuRoC MAV datasets. The best results are marked as best score and second best score.
Datasets (Camera) EuRoC (Stereo)
Method Metric MH01 MH02 V101 V201 Avg.
ORB-SLAM3 [3] RMSE\downarrow 4.806 4.938 8.829 25.057 10.907
DRIOD-SLAM [35] RMSE\downarrow 1.177 1.169 3.678 1.680 1.926
MonoGS [24] RMSE\downarrow 11.194 8.327 29.365 148.080 49.241
Photo-SLAM [12] RMSE\downarrow 3.997 4.547 8.882 26.665 11.023
Ours RMSE\downarrow 3.948 3.863 8.823 13.217 7.462
(b) Camera tracking result on EuRoC MAV datasets for stereo camera. RMSE of ATE (cm) is reported. The best results are marked as best score and second best score.
Table 13: Quantitative evaluation of our method compared to state-of-the-art methods for Stereo camera on EuRoC MAV datasets.
Refer to caption
(a) GS-ICP SLAM [9]
(b) SplaTAM [14]
(c) RTG-SLAM [27]
(d) Ours
(e) Ground Truth
Figure 13: We show comparisons of ours to state-of-the-art methods on TUM RGB-D dataset for RGB-D camera. From top to bottom, the scenes are fr1/desk (rows 1–2), fr2/xyz (rows 3–4), and fr3/office (rows 5–6). Non-obvious differences in quality are highlighted by arrows.
Refer to caption
(a) GS-ICP SLAM [9]
(b) SplaTAM [14]
(c) RTG-SLAM [27]
(d) Ours
(e) Ground Truth
Figure 14: We show comparisons of ours to state-of-the-art methods on Replica dataset for RGB-D camera. From top to bottom, the scenes are Office0, Office1, Office2, Office3, Office4, room0, room1, and room2. Non-obvious differences in quality are highlighted by arrows/insets. Since all methods achieve high-quality rendering results on the Replica dataset, we have highlighted specific regions in the images. In these annotated areas, our method consistently demonstrates sharper edges or finer textures.
Refer to caption
(a) MonoGS [24]
(b) Photo-SLAM [12]
(c) Photo-SLAM-30K
(d) Ours
(e) Ground Truth
Figure 15: We show comparisons of ours to state-of-the-art methods on TUM RGB-D dataset for Monocular camera. From top to bottom, the scenes are fr1/desk (rows 1–2), fr2/xyz (rows 3–4), and fr3/office (rows 5–6). Non-obvious differences in qualityare highlighted by arrows. Photo-SLAM [12] uses a set of parameters to undistort images as ground truth supervision. Consequently, its rendered images for fr1/desk and fr2/xyz exhibit black borders.
Refer to caption
(a) MonoGS [24]
(b) Photo-SLAM [12]
(c) Photo-SLAM-30K
(d) Ours
(e) Ground Truth
Figure 16: We show comparisons of ours to state-of-the-art methods on Replica dataset for Monocular camera. From top to bottom, the scenes are Office0, Office1, Office2, Office3, Office4, room0, room1, and room2. Non-obvious differences in quality are highlighted by arrows/insets. Since all methods achieve high-quality rendering results on the Replica dataset, we have highlighted specific regions in the images. In these annotated areas, our method consistently demonstrates sharper edges or finer textures.
Refer to caption
(a) MonoGS [24]
(b) Photo-SLAM [12]
(c) Photo-SLAM-30K
(d) Ours
(e) Ground Truth
Figure 17: We show comparisons of ours to state-of-the-art methods on the EuRoC MAV dataset for Stereo camera. From top to bottom, the scenes are MH01 (rows 1–2), MH02 (rows 3–4), V101 (rows 5–6), and V201 (rows 7–8). Non-obvious differences in quality are highlighted by arrows.