This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

IL-NeRF: Incremental Learning for Neural Radiance Fields with Camera Pose Alignment

Letian Zhang Department of Computer Science, Middle Tennessee State University Ming Li Center for Research in Computer Vision, University of Central Florida Chen Chen Center for Research in Computer Vision, University of Central Florida Jie Xu Department of Electrical and Computer Engineering, University of Miami
Abstract

Neural radiance fields (NeRF) is a promising approach for generating photorealistic images and representing complex scenes. However, when processing data sequentially, it can suffer from catastrophic forgetting, where previous data is easily forgotten after training with new data. Existing incremental learning methods using knowledge distillation assume that continuous data chunks contain both 2D images and corresponding camera pose parameters, pre-estimated from the complete dataset. This poses a paradox as the necessary camera pose must be estimated from the entire dataset, even though the data arrives sequentially and future chunks are inaccessible. In contrast, we focus on a practical scenario where camera poses are unknown. We propose IL-NeRF, a novel framework for incremental NeRF training, to address this challenge. IL-NeRF’s key idea lies in selecting a set of past camera poses as references to initialize and align the camera poses of incoming image data. This is followed by a joint optimization of camera poses and replay-based NeRF distillation. Our experiments on real-world indoor and outdoor scenes show that IL-NeRF handles incremental NeRF training and outperforms the baselines by up to 54.04%54.04\% in rendering quality. The project page is https://ilnerf.github.io/.

1 Introduction

Neural Radiance Fields (NeRF) [26] has recently shown great promise in producing photorealistic images from sparse image sets by encoding a 3D scene with a neural network that maps the location of 3D points to color and volume density. However, to achieve this level of performance, NeRF typically requires access to all training data at once in order to estimate the camera pose based on the entire dataset [7, 26]. In practical applications such as automotive and remote sensing, data is acquired sequentially, necessitating an immediately available updated 3D scene representation. Moreover, scenarios arise where a user acquires scans of a scene to train NeRF, only to find that the training yielded less effective results than expected. Consequently, the user rescans the scene with new data to enhance the fidelity of the images rendered by NeRF. In these instances, the scene representation must be trained in an incremental training environment, where the model can only access a limited number of views at each training stage, while still undertaking the task of reconstructing the entire scene.

Refer to caption
Figure 1: Current works require accurate camera poses pre-estimated from the whole image data. Our IL-NeRF can incrementally learn the 3D scene without camera poses. The results of IL-NeRF consist of both aligned camera poses and NeRF model.

In the context of incremental NeRF training, existing works [12, 7, 34, 5] generally operate under the assumption that data is segmented into multiple sequential chunks, with access limited to the current chunk while previously processed data is discarded. This assumption to incremental learning presents a notable challenge for NeRF, which requires updating its knowledge without erasing prior learned information, a phenomenon known as catastrophic forgetting [53]. To address this challenge, prior works [12, 7, 34, 5] have investigated the implementation of incremental learning strategies for NeRF training, incorporating a knowledge distillation technique [53] to mitigate catastrophic forgetting. Specifically, before training the NeRF model with the current data chunk, pseudo-RGB values for the scene are generated using the previously trained NeRF model. These RGB values are subsequently utilized to train the NeRF model with the current data chunk, and the process is iteratively repeated, with the prior NeRF model acting as the supervisory teacher for subsequent data chunks. This enables the NeRF model to learn from new data while retaining knowledge from previously discarded data.

Motivation. While existing works propose effective incremental learning methods through knowledge distillation, these approaches are founded on the assumption that continuous data chunks comprise not only 2D images but also corresponding camera pose parameters, pre-estimated from the complete dataset. This assumption poses a paradox as the required camera pose must be estimated from the entire dataset, yet the data is intended to arrive sequentially, relatively independently, with future data chunks remaining unknown and inaccessible while previous chunks are discarded. In contrast, as shown in Figure 1, our work addresses a more practical scenario where pre-estimated camera poses are unavailable for each training data chunk, necessitating the consideration of camera pose estimation and the alignment of its coordinate system.

Challenge. Since the previous training data have been discarded, the incoming training data cannot simply be used directly for camera pose estimation because the isolated estimated camera pose will not be in the same coordinate system as the previous camera pose, which will lead to NeRF training misalignment and failure to render the 3D scene. Therefore, accurately estimating the camera poses of the sequential coming data within the same coordinate system in incremental NeRF training becomes a crucial issue that needs to be addressed.

Contribution. To deal with the above challenge, we present a novel framework for incremental NeRF training, named IL-NeRF, which can incrementally estimate the incoming data’s camera poses and effectively tackle the issue of catastrophic forgetting. (1) We propose an incremental camera pose alignment module that selects a suitable set of camera poses from previous estimates. These chosen poses help render prior training images from NeRF and aid in the joint camera pose estimation process. They also act as a reference coordinate system, facilitating the alignment of newly arrived and previous camera poses. (2) To ensure the appropriate selection of camera poses from the previously estimated ones, we transform this selection task into a graph-based reward-collection optimization problem. We then introduce a practical greedy algorithm to effectively solve this optimization problem. (3) To align camera pose coordinates, we use selected camera poses as a reference to derive a transfer matrix, transforming the current camera poses into the previous coordinate system. (4) We utilize a joint optimization method for camera poses and replay-based NeRF distillation, mitigating catastrophic forgetting and refining the accuracy of the camera poses.

The experimental results on three diverse datasets show that our proposed framework can improve PSNR, SSIM, and LPIPS by up to 54.04%54.04\% compared to the original NeRF, significantly mitigating the negative impact of catastrophic forgetting in NeRF’s incremental learning process. Moreover, our framework can effectively estimate and align the camera pose parameters in a consistent coordinate system.

2 Related Works

NeRF. NeRF [26] employs volume rendering to depict a continuous scene and achieve high-quality view synthesis. Several subsequent works have been introduced based on the success of NeRF, aiming to improve view synthesis efficiency and quality, including faster training and rendering [28, 6, 10], deformable or dynamic scene synthesis [32, 35, 36], editable view synthesis [30, 49], light changes [23, 27], surface enhancements [31, 45, 51], depth priors [8, 37, 48], etc. However, most of these methods presume access to all training data and require pre-estimated camera pose parameters from the entire dataset. In this study, we tackle a more practical scenario where NeRF learns the scene incrementally with a sequential data stream, without pre-estimated camera pose parameters.
Incremental NeRF Learning. Incremental learning, constrained by limited data availability during each training iteration, often causes catastrophic forgetting [9]. Methods to mitigate this issue include parameter isolation [13, 22, 21], replay [20, 38, 41], and regularization [1, 50, 2]. There is limited research on combining NeRF with incremental learning. Existing studies [12, 7, 34, 5] use knowledge distillation [53] to mitigate catastrophic forgetting by accessing past training data, specifically RGB values, from a pre-trained network. The retrieved data is merged with new incoming data to train NeRF. In [12], a regularization-based filter is used to select relevant information from randomly sampled camera views. In [7], a small network is trained to generate previously seen rays that are directed toward the scene. The authors in [34] employ the same replay-based method in [7] but they substitute NeRF with Instant-NGP [28] to expedite the training process. In [5], a prioritized replay buffer is introduced to keep the images and their camera poses with the lowest historical rendering qualities for continual learning. However, these methods require pre-estimated camera poses from the entire training data for each coming data chunk. In this work, we consider a more realistic scenario where pre-estimated camera poses are unavailable for each training data chunk, thus requiring incremental camera pose alignment.
NeRF With Pose Refinement. Pose refinement is widely used in NeRF training. iNeRF [52] refines camera poses for novel view images using a reconstructed NeRF model. NeRFmm [47] jointly optimizes both camera intrinsic and extrinsic parameters during NeRF training, and BARF [19] proposes a coarse-to-fine positional encoding strategy for camera poses and NeRF joint optimization. SC-NeRF [15] further considers camera distortion refinement and employs a geometric loss to regularize rays. In this paper, IL-NeRF uses the pose refinement in SC-NeRF [15]. Note that IL-NeRF is not limited to only the pose refinement in SC-NeRF [15]; any other well-designed pose refinement methods can also be transferred to IL-NeRF.

3 Incremental NeRF Training Preliminary

NeRF. NeRF aims to learn a 3D scene with a simple neural network, e.g., MLPs, that takes 3D location x and view direction rd\textbf{r}_{d} as input and produces RGB color c and volume density σ\mathbf{\sigma} as output. For each ray r=(ro,rd)\textbf{r}=(\textbf{r}_{o},\textbf{r}_{d}) emitted from the camera origin ro\textbf{r}_{o} in direction rd\textbf{r}_{d}, NeRF samples MM 3D points along the ray xi=ro+zird\textbf{x}_{i}=\textbf{r}_{o}+z_{i}\textbf{r}_{d}, where ziz_{i} is the distance from a camera to the sample point xi\textbf{x}_{i}. Thus the pixel color CC can be integrated by the volumetric rendering as follows:

C(r)=i=1Mαi(1δi)c(xi)\displaystyle C(\textbf{r})=\sum_{i=1}^{M}\alpha_{i}(1-\delta_{i})\textbf{c}(\textbf{x}_{i}) (1)

where δi=exp((zizi1)σ(xi)\delta_{i}=\exp(-(z_{i}-z_{i-1})\mathbf{\sigma}(\textbf{x}_{i}) represents the transmittance of the ray segment between sample points xi1\textbf{x}_{i-1} and xi\textbf{x}_{i}, and αi=j=1i1δj\alpha_{i}=\prod_{j=1}^{i-1}\delta_{j} is the ray attenuation from the origin ro\textbf{r}_{o} to the sample point xi\textbf{x}_{i}. Since the whole pipeline is differentiable, NeRF can be trained by minimizing the photometric error between the rendering views and ground truth views.

=rRC(r)C(r)22\displaystyle\mathcal{L}=\sum_{\textbf{r}\in R}\parallel C^{*}(\textbf{r})-C(\textbf{r})\parallel^{2}_{2} (2)
Θ=argminΘ(CC,𝒫)\displaystyle{\Theta}^{*}=\arg\min_{\Theta}\mathcal{L}(C\mid C^{*},\mathcal{P}) (3)

where RR represents a group of rays from one or more camera views, which is obtained from the camera pose parameters 𝒫\mathcal{P}. CC^{*} is the ground truth of pixel color. Θ\Theta denotes the parameters of the network. For more details of NeRF, we refer the readers to [26].

Incremental NeRF Training. To achieve impressive performance, NeRF assumes access to all training data that covers all views of a scene at once. However, the entire training data might not be available simultaneously in practical applications due to physical or hardware limitations, e.g., edge devices with a limited amount of memory and limited data storage. As a result, data needs to be processed sequentially. In other words, NeRF will incrementally learn the scene with new training data without revisiting the old ones. Concretely, we consider a time-slotted system wherein each time slot t{0,,T1}t\in\{0,\cdots,T-1\}, a chunk of image data GtG^{t} consists of NN number of images ItI^{t}, i.e., Gt={Int,n{0,N1}}G^{t}=\{I_{n}^{t},n\in\{0,\cdots N-1\}\}. Generally, only the latest image data chunk GtG^{t} is available while previous image data chunks G0:t1G^{0:t-1} are inaccessible. The objective of incremental NeRF is to minimize reconstruction loss across all provided chunks of image data in {G0,,GT1}\{G_{0},\dots,G_{T-1}\}. Note that unlike the existing works [7, 12, 34, 5], in our work, each incoming data contains only the images, without corresponding camera poses. Thus, in this paper, the main aim of incremental learning for NeRF is to enable the neural network to learn and adapt continually by ensuring that the camera poses from new image data are estimated and aligned in a consistent coordinate system while preventing catastrophic forgetting across all previously seen image data.

Refer to caption
Figure 2: Overview of IL-NeRF pipeline. Firstly, the network Θt1\Theta_{t-1}^{*} from the previous NeRF are frozen. Then, incremental camera pose alignment is employed to estimate the current camera poses 𝒫c\mathcal{P}^{c} through (a) Finding optimal camera poses from the previous camera poses; (b) Estimating the camera poses for the incoming image data and the rendered images from the selected camera poses; (c) Aligning the current camera poses into the previous camera coordinate system. Finally, the network Θt\Theta_{t}, the current estimated poses 𝒫c\mathcal{P}^{c}, and previous poses 𝒫p\mathcal{P}^{p} are jointly trained on both the current image data rays CcC^{c} and the distilled past rays CpC^{p} simultaneously.

4 IL-NeRF

In this section, we introduce our proposed framework, IL-NeRF, which prevents the catastrophic forgetting problem by replay-based knowledge distillation retrieved from NeRF itself (Section 4.1) and utilizes the incremental camera pose alignment to estimate and align the camera poses of incoming training image data within the same coordinates system as the previous camera poses (Section 4.2). The overall pipeline is illustrated in Figure 2.

4.1 Replay-Based NeRF Distillation

The problem of catastrophic forgetting occurs when a network, trained only with the currently available data at each time step, struggles to remember previously learned knowledge, resulting in low-quality image synthesis for previously seen views. To address this, we adopt a replay-based NeRF distillation strategy in the NeRF training. At each time slot tt, we copy and freeze the parameters of the network as a teacher network before training the network on the incoming image data chunk tt. Since the network has been trained on t1t-1 previous image data chunks, we use Θt1\Theta^{*}_{t-1} to denote the frozen parameters of the teacher network. During each training iteration for the image data chunk tt, we use the past camera poses 𝒫p\mathcal{P}^{p} to obtain the pixel colors of the past training rays from the teacher network Θt1\Theta_{t-1} as follows:

Cp=(𝒫p,Θt1)\displaystyle C^{p}=\mathcal{F}(\mathcal{P}^{p},\Theta^{*}_{t-1}) (4)

By treating CpC^{p} as pseudo ground truth, we optimize Θt\Theta_{t} by learning new knowledge from the new incoming image data and old knowledge from the teacher network, thus mitigating the forgetting problem. The new loss function is defined as follows:

=rcC^cCc22+rpC^pCp22\displaystyle\mathcal{L}=\sum_{\textbf{r}^{c}}\parallel\hat{C}^{c}-C^{c}\parallel^{2}_{2}+\sum_{\textbf{r}^{p}}\parallel\hat{C}^{p}-C^{p}\parallel^{2}_{2} (5)
Θt=argminΘt(C^c,C^pCc,𝒫c,Θt1,𝒫p)\displaystyle\Theta_{t}^{*}=\arg\min_{\Theta_{t}}\mathcal{L}(\hat{C}^{c},\hat{C}^{p}\mid C^{c},\mathcal{P}^{c},\Theta_{t-1}^{*},\mathcal{P}^{p}) (6)

where rc\textbf{r}^{c} and rp\textbf{r}^{p} are the training rays obtained from current camera poses 𝒫c\mathcal{P}^{c} and past camera poses 𝒫p\mathcal{P}^{p}. C^c\hat{C}^{c} and C^p\hat{C}^{p} are the estimated colors by the network that is being trained, given the concatenated camera poses [𝒫c,𝒫p][\mathcal{P}^{c},\mathcal{P}^{p}]:

[C^c,C^p]=([𝒫c,𝒫p],Θt)\displaystyle[\hat{C}^{c},\hat{C}^{p}]=\mathcal{F}([\mathcal{P}^{c},\mathcal{P}^{p}],\Theta_{t}) (7)

The existing works [7, 12, 34] assume that the camera poses are provided with each incoming image data and not saved on the device when the next round of training image data arrives. This requires additional effort to train a new auxiliary neural network to remember or filter the previously trained rays. However, this paper considers a different scenario. Specifically, the incoming image data chunk only includes the training images and not the camera poses, and hence the camera poses need to be obtained using camera calibration techniques and saved on the device. It’s worth noting that only the previous camera poses are stored, not the previous training images. Furthermore, the camera pose requires only a small storage space. In the following section, we will discuss how to estimate and align the previous and new camera poses into the same coordinate system.

4.2 Incremental Camera Pose Alignment

Let 𝒫p\mathcal{P}^{p} represent the previously aligned camera poses. which includes intrinsic camera matrix KK and extrinsic camera matrices {πp,ψp}\{\pi^{p},\psi^{p}\}, where πp\pi^{p} is the set of rotation matrices and ψp\psi^{p} is the set of translations. When it comes to estimating camera poses from the incoming image data, it is not sufficient to assume that the problem can be resolved by independently estimating the camera poses of the incoming image data. This is because the outcomes derived from the isolated estimation of camera poses do not align with the coordinate system of the previous camera poses, leading to a significant issue of coordinate misalignment. To this end, we introduce an incremental camera pose alignment method leveraging data from a trained NeRF. We start by choosing DD camera poses from prior instances based on low training losses and comprehensive coverage. These are then combined with incoming images to enhance camera estimation.

Finding Previous Optimal Camera Poses. Our approach is to formulate a reward-collection optimization problem on a graph. In this graph, the nodes represent camera positions (i.e., the translations in the camera poses) with each node assigned a reward corresponding to the negative value of the preceding training loss. The edges represent Euclidean distances between each camera pair’s positions. The goal is to find a path that collects as much reward as possible, subject to constraints on the total number of visited nodes and camera view coverage. Concretely, the objective optimization problem can be formulated as

maxk=1|𝒫p|xkRk;s.t.k=1|𝒫p|xk=D;\displaystyle\max\sum_{k=1}^{|\mathcal{P}^{p}|}x_{k}R_{k};~{}~{}~{}~{}s.t.~{}\sum_{k=1}^{|\mathcal{P}^{p}|}x_{k}=D; (8)
S(K)Sth,K={k|xk=1,k=1,,|𝒫p|};\displaystyle~{}~{}~{}~{}S(K)\geq S_{th},K=\{k|x_{k}=1,k=1,\dots,|\mathcal{P}^{p}|\}; (9)
E(xk)1,k{1,,|𝒫p|};\displaystyle~{}~{}~{}~{}E(x_{k})\leq 1,\forall k\in\{1,\dots,|\mathcal{P}^{p}|\}; (10)

where xkx_{k} is the binary decision variable: xk=1x_{k}=1 if node kk is visited otherwise xk=0x_{k}=0. S(K)S(K) is the shortest path that connects all the selected nodes. E(xk)E(x_{k}) is the number of incoming edges of each selected node. The first constraint (8) makes sure that only DD previous camera poses are selected. The second constraint (9) means that the view coverage of the selected cameras is larger than a threshold, because a large field of view coverage of the selected cameras improves the accuracy of the camera pose estimation. The third constraint (10) guarantees that every node only has one incoming edge. In other words, every node is visited at most once. Consequently, this reward-collection optimization problem can be viewed as a hybrid of the Knapsack Problem and the Travelling Salesperson Problem, which is an NP-hard problem. To solve this problem, we propose a greedy algorithm that can reduce computation time by several orders of magnitude compared with the Brute-Force method. Let ei,j=ej,ie_{i,j}=e_{j,i} denote the edge between camera ii and jj. At camera node ii, we define an approximation edge length of all the nodes connecting to the node ii as e^i,j=Rj+λ(SthDei,j)\hat{e}_{i,j}=R_{j}+\lambda(\frac{S_{th}}{D}-{e}_{i,j}), where λ\lambda is a parameter for adjusting the units of RiR_{i} and SthDei,j\frac{S{th}}{D}-{e}_{i,j}. This approximation edge is similar to the Lagrange multiplier [4] for handling the constraint (9). We introduce an auxiliary starting node into the graph, which connects to all nodes with the same edge length. The greedy algorithm begins at this auxiliary starting node and selects the unvisited node with the maximum approximation edge as the next starting node. This process is repeated until a total of DD nodes have been selected. For a comprehensive understanding of the process, we offer a detailed description of the greedy algorithm, along with pseudocode, in the supplementary material.

Camera Pose Alignment. After solving the above reward-collection optimization problem, we select DD camera poses with large view coverage from the previously aligned camera poses. The reason for selecting DD camera poses instead of using all the camera poses is that camera poses with poorly rendered images will provide inaccurate features leading to large camera estimation errors.

These DD camera poses are utilized to render the images from the NeRF model, which are integrated with the incoming image data as a group. The camera poses of this group can be estimated using camera calibration methods, such as traditional SfM or SLAM techniques. Let πd\pi_{d} be the rotation matrix of the selected camera in time slot t1t-1 and π~d\tilde{\pi}_{d} be the corresponding rotation matrix of the selected camera in time slot tt. Similarly, let ψd\psi_{d} be the translation of selected camera in time slot t1t-1 and ψ~d\tilde{\psi}_{d} be the corresponding translation of selected camera in time slot tt. We can then get the transfer matrices of the rotation matrix and translation from the coordinate system in time slot tt to the coordinate system in time slot t1t-1 by:

(11)

We can use transfer matrices to align the camera poses {π,ψ}\{\pi,\psi\} of the new images to the original camera poses by:

π~=ππ,ψ~=πψ+ψ\displaystyle\tilde{\pi}=\triangle\pi\pi,~{}~{}~{}~{}\tilde{\psi}=\triangle\pi\psi+\triangle\psi (12)

Joint Optimization of Poses and NeRF. However, we have observed that the camera pose alignment still produces errors in the camera poses. To address this, we draw inspiration from previous works [47, 15] and introduce a joint optimization of poses and NeRF method during training of our proposed IL-NeRF. Rather than directly optimizing the initial camera pose π~\tilde{\pi} and ψ~\tilde{\psi}, we employ a 6-dimensional vector Φ=[a,b]\Phi=[a,b] to define the trainable parameters of each camera pose, where a3a\in\mathbb{R}^{3} represents the 3D rotation angles and b3b\in\mathbb{R}^{3} denotes the increment of the translation. To ensure the orthogonality of the rotation matrix, Rodrigues’ formula Ω(a)\Omega(a) is used to generate the 3D rotation matrix. Final rotation and translation are:

π~=Ω(a)π~,ψ~=ψ~+b\displaystyle\tilde{\pi}=\Omega(a)\tilde{\pi},~{}~{}~{}~{}\tilde{\psi}=\tilde{\psi}+b (13)

By integrating the 6-dimensional vector Φ\Phi into the NeRF training pipeline, the camera parameters and scene representation can be jointly optimized during training. Here, we slightly abuse the notation to use Φ\Phi to represent the group of all cameras’ trainable parameters. Mathematically, the pose refinement can be written as:

Θt,Φt=argminΘt,Φt(C^c,C^pCc,𝒫c,Θt1,𝒫p)\displaystyle\Theta_{t}^{*},\Phi_{t}^{*}=\arg\min_{\Theta_{t},\Phi_{t}}\mathcal{L}(\hat{C}^{c},\hat{C}^{p}\mid C^{c},\mathcal{P}^{c},\Theta_{t-1}^{*},\mathcal{P}^{p}) (14)
Algorithm 1 IL-NeRF Pseudo Code
1:Initialize 𝒫p=\mathcal{P}^{p}=\emptyset.
2:for t=0t=0 do
3:     Estimate camera poses 𝒫0c\mathcal{P}^{c}_{0} from G0G^{0}
4:     Jointly train NeRF network Θ0\Theta_{0} with camera poses
5:     𝒫p=𝒫p𝒫0c\mathcal{P}^{p}=\mathcal{P}^{p}\bigcup\mathcal{P}^{c}_{0}
6:end for
7:for t=1t=1 to t=Tt=T do
8:     Copy and freeze as Θt1\Theta_{t-1}^{*}
9:     Obtain past training data CpC^{p} by (𝒫p,Θt1)\mathcal{F}(\mathcal{P}^{p},\Theta^{*}_{t-1})
10:     Align the camera pose 𝒫tc\mathcal{P}^{c}_{t} based on 𝒫p\mathcal{P}^{p}
11:     Jointly train NeRF network Θt\Theta_{t} with camera poses
12:     𝒫p=𝒫p𝒫tc\mathcal{P}^{p}=\mathcal{P}^{p}\bigcup\mathcal{P}^{c}_{t}
13:end for

5 Experiment

Dataset. We use three different datasets to evaluate different aspects of our model, namely Mip-NeRF360 [3], LLFF [25], and NeRF-real360 [26]. To simulate the incremental scenarios, we reorganize the camera order so that it moves sequentially and we select a portion of the dataset where the previous images are not revisited. All the training images of each scene are divided into four training chunks denoted as 𝒢={G0,G1,G2,G3}\mathcal{G}=\{G^{0},G^{1},G^{2},G^{3}\}. We acquire the camera pose parameters using COLMAP [40].

Table 1: Performance comparison with the baselines on PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF. Note that NeRF, NeRF, and EWC require the ground truth pre-estimated camera poses from entire image data, but IL-NeRF estimates and aligns camera poses by the proposed incremental camera pose alignment module. NeRF can be treated as the representation of the existing incremental learning works with accurate camera poses [7, 12, 34, 5].
Data Method Need PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
Pose G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Mip-NeRF360 Counter NeRF Yes 32.17 / 0.92 / 0.07 29.58 / 0.86 / 0.14 28.03 / 0.82 / 0.18 28.28 / 0.85 / 0.18
NeRF Yes 32.12 / 0.91 / 0.07 24.62 / 0.72 / 0.25 21.94 / 0.65 / 0.34 20.30 / 0.62 / 0.37
EWC Yes 32.12 / 0.91 / 0.07 23.83 / 0.72 / 0.25 22.56 / 0.65 / 0.33 21.11 / 0.61 / 0.36
NeRF-SLAM No 31.75 / 0.91 / 0.07 28.30 / 0.83 / 0.21 26.84 / 0.79 / 0.28 25.30 / 0.77 / 0.31
IL-NeRF (Ours) No 32.13 / 0.91 / 0.07 29.63 / 0.87 / 0.12 28.56 / 0.85 / 0.15 27.82 / 0.83 / 0.17
Kitchen NeRF Yes 31.05 / 0.91 / 0.07 29.72 / 0.88 / 0.13 29.33 / 0.85 / 0.15 29.18 / 0.84 / 0.14
NeRF Yes 31.17 / 0.91 / 0.08 27.01 / 0.75 / 0.25 21.42 / 0.70 / 0.31 23.69 / 0.75 / 0.24
EWC Yes 31.17 / 0.91 / 0.08 26.76 / 0.74 / 0.25 22.09 / 0.70 / 0.31 23.39 / 0.74 / 0.23
NeRF-SLAM No 30.87 / 0.90 / 0.09 29.63 / 0.85 / 0.20 27.65 / 0.81 / 0.24 27.71 / 0.82 / 0.20
IL-NeRF (Ours) No 31.27 / 0.92 / 0.07 30.66 / 0.89 / 0.10 29.84 / 0.87 / 0.12 29.34 / 0.86 / 0.13
LLFF Fortress NeRF Yes 31.75 / 0.86 / 0.11 31.83 / 0.84 / 0.09 30.90 / 0.86 / 0.14 29.81 / 0.85 / 0.11
NeRF Yes 31.56 / 0.85 / 0.11 29.38 / 0.80 / 0.15 27.05 / 0.78 / 0.17 25.39 / 0.78 / 0.16
EWC Yes 31.56 / 0.85 / 0.11 29.41 / 0.79 / 0.15 25.83 / 0.77 / 0.16 24.40 / 0.78 / 0.15
NeRf-SLAM No 31.08 / 0.82 / 0.12 31.09 / 0.82 / 0.12 29.77 / 0.83 / 0.15 28.53 / 0.82 / 0.14
IL-NeRF No 31.69 / 0.85 / 0.11 31.02 / 0.84 / 0.10 30.33 / 0.84 / 0.11 29.45 / 0.83 / 0.12
Horns NeRF Yes 29.86 / 0.89 / 0.07 29.67 / 0.89 / 0.06 29.24 / 0.89 / 0.07 28.87 / 0.87 / 0.08
NeRF Yes 29.78 / 0.86 / 0.09 27.04 / 0.75 / 0.09 26.04 / 0.74 / 0.12 24.01 / 0.70 / 0.14
EWC Yes 29.78 / 0.86 / 0.09 26.77 / 0.75 / 0.09 27.08 / 0.74 / 0.11 24.68 / 0.69 / 0.14
NeRF-SLAM No 28.86 / 0.83 / 0.10 28.77 / 0.84 / 0.08 28.19 / 0.83 / 0.10 27.56 / 0.82 / 0.12
IL-NeRF (Ours) No 29.92 / 0.89 / 0.07 29.50 / 0.89 / 0.07 29.01 / 0.89 / 0.07 28.96 / 0.87 / 0.09
NeRF-real360 Pinecone NeRF Yes 26.88 / 0.89 / 0.12 24.23 / 0.79 / 0.16 24.03 / 0.73 / 0.19 23.18 / 0.74 / 0.21
NeRF Yes 26.22 / 0.84 / 0.16 22.90 / 0.64 / 0.24 21.15 / 0.58 / 0.33 18.94 / 0.49 / 0.41
EWC Yes 26.22 / 0.84 / 0.16 22.70 / 0.63 / 0.24 21.42 / 0.57 / 0.32 18.81 / 0.48 / 0.41
NeRF-SLAM No 25.63 / 0.81 / 0.18 24.09 / 0.73 / 0.22 23.01 / 0.68 / 0.29 21.79 / 0.65 / 0.34
IL-NeRF (Ours) No 26.3 / 0.87 / 0.10 24.56 / 0.78 / 0.17 23.78 / 0.74 / 0.20 22.93 / 0.72 / 0.23
Vasedeck NeRF Yes 29.27 / 0.86 / 0.07 27.93 / 0.85 / 0.12 26.03 / 0.74 / 0.16 26.18 / 0.74 / 0.18
NeRF Yes 29.03 / 0.85 / 0.07 23.99 / 0.70 / 0.26 22.73 / 0.69 / 0.24 21.57 / 0.64 / 0.31
EWC Yes 29.03 / 0.85 / 0.07 24.36 / 0.69 / 0.25 22.25 / 0.68 / 0.24 20.52 / 0.64 / 0.30
NeRF-SLAM No 27.98 / 0.79 / 0.11 26.41 / 0.77 / 0.21 25.10 / 0.72 / 0.21 24.62 / 0.71 / 0.26
IL-NeRF (Ours) No 29.48 / 0.86 / 0.07 27.38 / 0.82 / 0.10 26.11 / 0.76 / 0.14 26.15 / 0.75 / 0.17

Baseline and Metrics. IL-NeRF is compared with the following baselines: NeRF: Original NeRF is incrementally trained with only current image data chunk with ground truth camera poses but without NeRF distillation, making it susceptible to catastrophic forgetting. EWC [16]: Similar to the NeRF, EWC incrementally trains the model with only current image data chunk with ground truth camera poses however it utilizes a widely-used regularization-based method, which penalizes the changes in parameters that are important for past training sets. NeRF: NeRF is incrementally trained with ground truth camera poses under replay-based NeRF distillation. Note that the ground truth camera poses are estimated from all the training images. NeRF can be treated as the representation of the existing works [7, 12, 34, 5], which require the ground truth camera poses in each incoming image data chunk. NeRF-SLAM: we follow the general implementation of NeRF-SLAM [39], which use the SLAM to align camera poses of the coming image data. We replace Droid-SLAM [42] in the NeRF-SLAM to ORB-SLAM2 [29] because Droid-SLAM utilizes a complex deep learning model for camera pose estimation, which needs training on image data before NeRF training.

We evaluate IL-NeRF and baselines in three aspects, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [46] and Learned Perceptual Image Patch Similarity (LPIPS) [54]. We use AlexNet [17] as the backbone of LPIPS. It should be noted that as joint optimization of poses and NeRF is performed for all camera poses, including the previous and current poses, at each time step, the camera poses are continuously changing. This means it is not possible to fix the camera poses for test data, and all evaluation metrics compare the rendered images from the training camera poses with the ground truth.

Refer to caption
Figure 3: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process.

5.1 Results

Table 1 shows the partial results obtained by IL-NeRF and baseline methods on the three datasets. Due to page limitations, more comparisons are shown in the supplementary material. From the results, we can see that IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF. We explain the results in more detail next.

NeRF: In comparison to the original NeRF, IL-NeRF demonstrates substantial enhancements of between 10.87%10.87\% to 36.55%36.55\% in terms of PSNR, 15.18%15.18\% to 46.90%46.90\% in terms of SSIM, and 25.00%25.00\% to 54.05%54.05\% in terms of LPIPS across three datasets. In the initial image data chunk G0G^{0}, IL-NeRF outperforms the original NeRF slightly, as a result of joint optimization assistance provided by IL-NeRF. However, the performance of the original NeRF rapidly declines thereafter due to catastrophic forgetting.

EWC: EWC fails to reduce the adverse effects of catastrophic forgetting, and in some cases can be even worse than traditional NeRF. This is due to the lack of previous training images, the failure of EWC to recover previous scenes, and the introduction of a penalty mechanism that creates a disincentive to learn new images that have not been scanned.

NeRF: Comparing NeRF with IL-NeRF may not be entirely fair, given that NeRF benefits from having access to all training images to estimate camera poses for the image data chunks, while in our scenario, the camera poses are not provided in the task and must be derived through incremental camera pose alignment. Nonetheless, IL-NeRF performs comparably to NeRF, largely due to its incremental camera pose alignment module utilized during training.

NeRF-SLAM: While NeRF-SLAM uses SLAM to align camera poses in incoming image data to a common coordinate system, it lags behind our IL-NeRF in terms of performance. This difference stems from NeRF-SLAM’s exclusive reliance on selected keyframes for replay-based training, resulting in overfitting to specific rays and compromising multiview consistency. Furthermore, NeRF-SLAM demands more memory storage space. For instance, on the ‘Garden’ image data in Mip-NeRF360, NeRF-SLAM necessitates an additional 251.3 MB of memory for storing keyframes and point clouds. In contrast, IL-NeRF only requires an additional 37.7 KB of memory for storing previous camera poses.

Figure 3 provides additional insights by presenting a qualitative comparison of the performance of the original NeRF and IL-NeRF on the ‘Kitchen’ and ‘Counter’ scenes in the Mip-NeRF360 dataset. Specifically, we demonstrate the rendering results on the first image data after each incremental training. It is evident that the original NeRF suffers from the catastrophic forgetting problem, resulting in images with significant distortions such as noise and blur, whereas IL-NeRF generates highly realistic images with quality comparable to the ground truth. This observation indicates that IL-NeRF is highly effective in mitigating the forgetting problem and estimating the camera poses. More results are shown in the supplementary material.

Refer to caption
Figure 4: Influence of optimal pose count D on IL-NeRF.

5.2 Ablation Study

Effect of Pose Selection. The first step of incremental camera pose alignment is to find previous DD optimal camera poses as the references for estimating and aligning the camera poses of the incoming image data. To illustrate the influence of DD on the IL-NERF performance, we set the value of DD to 1, 5, 10, 20, and all, respectively. Figure 4 portrays the PSNR of IL-NeRF across varying values of DD on the ‘Bicycle’ scene from the Mip-NeRF360 dataset. As depicted, when DD is a small value, the lack of adequate reference images results in the estimated camera poses of the incoming image data deviating from the original camera pose coordinate system, thereby leading to considerably poor rendering. As the value of DD increases, the estimated camera poses of the incoming image data become increasingly precise, and thus the PSNR increases. Note that an excessively large value of DD introduces poorly rendered cameras, subsequently leading to a decrease in PSNR. To identify the optimal DD camera poses, we address a reward-collection optimization problem on a graph. In order to demonstrate the effectiveness of our camera selection approach, we conduct a comparative analysis with two baselines: (1) Randomly selecting DD camera poses as the reference, and (2) Myopically selecting DD camera poses with the lowest training losses. Figure 6 shows the performance of IL-NeRF and two baselines on the ‘Bicycle’ scene. As we can see, our proposed method surpasses the other two approaches. This is primarily due to our method’s ability to ensure the quality of rendered images used as references while providing a broader camera view coverage, thereby facilitating the more accurate camera pose estimation of incoming image data.
Effect of Transfer Matrices. The transfer matrices are obtained by computing the corresponding rotation matrix and translation of the selected D camera poses in time slots t1t-1 and tt. These matrices are then employed to align the camera pose of new images to the original camera pose coordinate system. To investigate the effectiveness of the transfer matrices, we compare IL-NeRF with IL-NeRF without considering the transfer matrices, denoted as ‘IL-NeRF w/o TM’ in Figure 6 on the ‘Garden’ scene from the Mip-NeRF360 dataset. The results reveal a significant decline in performance without the transfer matrices, achieving only 15.76dB15.76dB in terms of PSNR. This decline can be attributed to separate camera pose estimation for two tasks resulting in camera poses in two independent coordinate systems, which could mislead the model during training, leading to decreased performance.
Effect of Pose Refinement. Despite the transfer matrices’ ability to align the camera poses to the original coordinate system, they may still contain noise and inaccuracies. To mitigate this issue, we use the coordinate-aligned camera poses as initial values and jointly optimize the camera poses and scene representation during NeRF training, a process known as pose refinement. We perform an ablation study to investigate the effectiveness of pose refinement by comparing IL-NeRF with and without it. The results in Figure 6 indicate that IL-NeRF with pose refinement outperforms IL-NeRF without it (i.e., IL-NeRF w/o PR). Figure 7 further shows the qualitative comparison of IL-NeRF w/o TM, IL-NeRF w/o PR, and IL-NeRF. More results are shown in the supplementary material.
Camera Pose. Our goal is to incrementally train a NeRF model given only RGB images as input, without known camera poses. In other words, we need to find out the camera poses associated with each input image while training the NeRF model. We treat the COLMAP estimation from all training images as ground-truth (GT) camera poses and report the difference between our optimized camera poses and theirs on the training images. Figure 8 shows the camera pose trajectories of GT and IL-NeRF. As we can see, IL-NeRF recovers accurate camera poses with the help of camera coordinate alignment and pose refinement. More results are shown in the supplementary material.

Refer to caption
Figure 5: Our camera pose selection method outperforms random selection and myopic selection.
Refer to caption
Figure 6: Comparison of IL-NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.
Refer to caption
Figure 7: Comparison of IL-NeRF w/o Transfer Matrices (TM), IL-NeRF w/o Pose Refinement (PR) and IL-NeRF.
Refer to caption
(a) Kitchen
Refer to caption
(b) Vasedesk
Figure 8: Camera pose estimation comparison. GT means the camera poses estimated by COLMAP from all the training images. IL-NeRF recovers accurate camera poses with the help of incremental camera pose alignment.

6 Conclusion

In this study, we introduce an incremental learning algorithm called IL-NeRF that tackles the problems of catastrophic forgetting and coordinate shifting in NeRF training in incremental learning settings. The IL-NeRF algorithm employs a replay-based NeRF distillation pipeline to retain past information and learn from new data independently. Furthermore, a method for aligning camera pose coordinates is introduced to identify camera poses associated with incoming tasks during the NeRF model training. Experimental results reveal that the IL-NeRF algorithm outperforms the original NeRF model in sequential data settings.

References

  • Aljundi et al. [2017] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3375, 2017.
  • Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  • Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  • Bertsekas [2014] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
  • Cai and Mueller [2023] Zhipeng Cai and Matthias Mueller. Clnerf: Continual learning meets nerf. arXiv preprint arXiv:2308.14816, 2023.
  • Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
  • Chung et al. [2022] Jaeyoung Chung, Kanggeon Lee, Sungyong Baik, and Kyoung Mu Lee. Meil-nerf: Memory-efficient incremental learning of neural radiance fields. arXiv preprint arXiv:2212.08328, 2022.
  • Deng et al. [2022] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  • French [1999] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  • Golden et al. [1987] Bruce L Golden, Larry Levy, and Rakesh Vohra. The orienteering problem. Naval Research Logistics (NRL), 34(3):307–318, 1987.
  • Guo et al. [2022] Mengqi Guo, Chen Li, and Gim Hee Lee. Incremental learning for neural radiance field with uncertainty-filtered knowledge distillation. arXiv preprint arXiv:2212.10950, 2022.
  • Hung et al. [2019] Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Jandaghi et al. [2021] Hossein Jandaghi, Ali Divsalar, and Saeed Emami. The categorized orienteering problem with count-dependent profits. Applied Soft Computing, 113:107962, 2021.
  • Jeong et al. [2021] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5846–5854, 2021.
  • Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  • kwea123 [2022] kwea123. ngp-pl: a pytorch implementation of instant-ngp, 2022.
  • Lin et al. [2021] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5741–5751, 2021.
  • Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  • Mallya and Lazebnik [2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  • Mallya et al. [2018] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
  • Martin-Brualla et al. [2021] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
  • Martins et al. [2021] Leandro do C Martins, Rafael D Tordecilla, Juliana Castaneda, Angel A Juan, and Javier Faulin. Electric vehicle routing, arc routing, and team orienteering problems in sustainable transportation. Energies, 14(16):5131, 2021.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Mildenhall et al. [2022] Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P Srinivasan, and Jonathan T Barron. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16190–16199, 2022.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  • Mur-Artal and Tardós [2017] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5):1255–1262, 2017.
  • Niemeyer and Geiger [2021] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
  • Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
  • Park et al. [2021] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
  • Pěnička et al. [2017] Robert Pěnička, Jan Faigl, Petr Váňa, and Martin Saska. Dubins orienteering problem. IEEE Robotics and Automation Letters, 2(2):1210–1217, 2017.
  • Po et al. [2023] Ryan Po, Zhengyang Dong, Alexander W Bergman, and Gordon Wetzstein. Instant continual learning of neural radiance fields. arXiv preprint arXiv:2309.01811, 2023.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
  • Rebain et al. [2021] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decomposed radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14153–14161, 2021.
  • Roessle et al. [2022] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12892–12901, 2022.
  • Rolnick et al. [2019] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Rosinol et al. [2022] Antoni Rosinol, John J Leonard, and Luca Carlone. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641, 2022.
  • Schonberger and Frahm [2016] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  • Shin et al. [2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
  • Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
  • Vansteenwegen and Gunawan [2019] Pieter Vansteenwegen and Aldy Gunawan. Orienteering problems. EURO Advanced Tutorials on Operational Research, 2019.
  • Vansteenwegen et al. [2011] Pieter Vansteenwegen, Wouter Souffriau, and Dirk Van Oudheusden. The orienteering problem: A survey. European Journal of Operational Research, 209(1):1–10, 2011.
  • Wang et al. [2021a] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wang et al. [2021b] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  • Wei et al. [2021] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5610–5619, 2021.
  • Xiang et al. [2021] Fanbo Xiang, Zexiang Xu, Milos Hasan, Yannick Hold-Geoffroy, Kalyan Sunkavalli, and Hao Su. Neutex: Neural texture mapping for volumetric neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7128, 2021.
  • Xu and Zhu [2018] Ju Xu and Zhanxing Zhu. Reinforced continual learning. Advances in Neural Information Processing Systems, 31, 2018.
  • Yariv et al. [2021] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  • Yen-Chen et al. [2021] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
  • Zhang et al. [2019] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713–3722, 2019.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
\thetitle

Supplementary Material

7 Finding Previous Optimal Camera Poses

To find previous DD optimal camera poses, we formulate a reward-collection optimization problem on a graph. In this graph, the nodes represent camera positions (i.e., the translations in the camera poses) with each node assigned a reward corresponding to the negative value of the preceding training loss. The edges represent Euclidean distances between each camera pair’s positions. The goal is to find a path that collects as much reward as possible, subject to constraints on the total number of visited nodes and camera view coverage. Concretely, the objective optimization problem can be formulated as

maxk=1|𝒫p|xkRk\displaystyle\max\sum_{k=1}^{|\mathcal{P}^{p}|}x_{k}R_{k} (15)
s.t.k=1|𝒫p|xk=D\displaystyle s.t.~{}\sum_{k=1}^{|\mathcal{P}^{p}|}x_{k}=D (16)
S(K)Sth,K={k|xk=1,k=1,,|𝒫p|}\displaystyle~{}~{}~{}~{}S(K)\geq S_{th},K=\{k|x_{k}=1,k=1,\dots,|\mathcal{P}^{p}|\} (17)
E(xk)1,k{1,,|𝒫p|}\displaystyle~{}~{}~{}~{}E(x_{k})\leq 1,\forall k\in\{1,\dots,|\mathcal{P}^{p}|\} (18)

where xkx_{k} is the binary decision variable: xk=1x_{k}=1 if node kk is visited otherwise xk=0x_{k}=0. S(K)S(K) is the shortest path that connects all the selected nodes. E(xk)E(x_{k}) is the number of incoming edge of each selected node. The first constraint (16) makes sure that only DD previous camera poses are selected. The second constraint (17) means that the view coverage of the selected cameras is larger than a threshold. This is because a large field of view coverage of the selected cameras improves the accuracy of the camera pose estimation. The third constraint (18) guarantees that every node only has one incoming edge. In other words, every node is visited at most once. Consequently, this reward-collection optimization problem can be viewed as a hybrid of the Knapsack Problem and the Travelling Salesperson Problem, which is an NP-hard problem.

Related Work for Reward-Collection Optimization Problem. The reward collection problem, also named orienteering problem, is an optimization issue that aims to determine the most efficient route for visiting multiple locations while maximizing the value or score of each place seen, all within a specified time frame and beginning and ending at a particular point [11]. This problem is widely utilized in the tourism sector [44], robot routing [33], food delivery [43] and transportation control [24]. As the orienteering problem belongs to the NP-hard class of problems, no algorithm can solve it optimally within a reasonable amount of time [14]. Different from the traditional orienteering problem, the optimization problem in this paper is more complex. Firstly, we do not limit the start and end points, and at the same time, we have a limitation on the number of accessible points, which makes it impossible to apply the existing proposed approaches to our method.

Algorithm 2 Brute-Force for Selecting Cameras
1:Generate all possible DD camera combinations 𝒦\mathcal{K}.
2:Initialize =\mathcal{B}=\emptyset and b=b=-\infty.
3:for K𝒦K\in\mathcal{K} do
4:       Use Breath-First-Search to find the shortest path S(K)S(K) that visits all the nodes in KK.
5:       if S(K)SthS(K)\geq S_{th} then
6:             Sum the total reward \mathcal{R} of KK nodes.
7:             if b\mathcal{R}\geq b then
8:                    =K\mathcal{B}=K
9:                    b=b=\mathcal{R}
10:             end if
11:       end if
12:end for
13:Return \mathcal{B}

Brute-Force Method. The straightforward approach to address this problem is the Brute-Force method, demonstrated in Algorithm 2. This method involves: (1) Determining the shortest path for visiting all nodes for each DD camera combination. (2) Selecting the camera combination with the highest total rewards, while ensuring compliance with all constraints. However, the time complexity of this approach is O((2D×D)×(ND))O((2^{D}\times D)\times\binom{N}{D}), where (ND)\binom{N}{D} represents the number of DD-combinations derived from a given set of NN previous camera poses. 2D×D2^{D}\times D is the time complexity of finding the shortest path for each D camera combination. While this method guarantees an optimal solution, it becomes impractical for large numbers of nodes due to its time complexity.

Proposed Greedy Algorithm. Let 𝔾(V,E)\mathbb{G}(V,E) be the graph of the previous camera poses, where VV denotes the set of cameras (nodes) and EE denotes the set of edges connecting each pair of two cameras. Let ei,j=ej,i,ei,jE,ej,iEe_{i,j}=e_{j,i},e_{i,j}\in E,e_{j,i}\in E denote the edge between camera ii and jj. Note that 𝔾(V,E)\mathbb{G}(V,E) is an undirected weighted complete graph. The core concept of our greedy algorithm is to traverse all unvisited nodes starting from a specific node. During this process, the algorithm calculates the approximate edges between each pair of the current node and its connected unvisited node, subsequently selecting the node with the maximum approximation edge as the next starting node. This process is repeated until a total of DD nodes have been selected. The complete description of the greedy algorithm is outlined in Algorithm 3.

Step 1 (line 1 to line 2): Introducing an auxiliary starting node V0V_{0} into the graph, which establishes connections to all nodes with an edge length of 0. We define a set \mathcal{B} to keep track of the visited nodes during traversal and initialize it as =V0\mathcal{B}={V_{0}}. Additionally, the current node index is initialized as k=0k=0.

Step 2 (line 4 to line 10): Identifying all the connected nodes of the current node VkV_{k} that have not been visited yet (i.e., ViV_{i}\notin\mathcal{B}). Based on the reward RiR_{i} and the edge length ek,ie_{k,i} for each unvisited node, we compute the approximation edge length e^k,i\hat{e}_{k,i} and insert it into a temporary set E^\hat{E}. Specifically, e^k,i=Ri+λ(SthDek,i)\hat{e}_{k,i}=R_{i}+\lambda(\frac{S_{th}}{D}-{e}_{k,i}), where λ\lambda is a parameter for adjusting the units of RiR_{i} and SthDek,i\frac{S{th}}{D}-{e}_{k,i}. This approximation edge is similar to the Lagrange multiplier [4] for handling the constraint (17).

Step 3 (line 11 to line 12): Selecting the node with the maximum e^k,i\hat{e}_{k,i} as the next visited node, updating the current index as k=argmaxE^k=\arg\max\hat{E}, and inserting the next visited node into the set of visited nodes \mathcal{B}.

Step 4: Repeating Step 2 and Step 3 until the greedy algorithm has visited a total of DD nodes.

Algorithm 3 Proposed Greedy Algorithm
1:Add an auxiliary starting node V0V_{0} linking all the nodes.
2:Initialize the visited set ={V0}\mathcal{B}=\{V_{0}\} and k=0k=0.
3:while ||<D+1|\mathcal{B}|<D+1 do
4:       E^={}\hat{E}=\{\}
5:       for ViVV_{i}\in V do
6:             if ViV_{i}\notin\mathcal{B} then
7:                    e^k,i=Ri+λ(SthDek,i)\hat{e}_{k,i}=R_{i}+\lambda(\frac{S_{th}}{D}-{e}_{k,i})
8:                    E^.append(e^k,i)\hat{E}.append(\hat{e}_{k,i})
9:             end if
10:       end for
11:       k=argmaxE^k=\arg\max\hat{E}
12:       .append(Vk)\mathcal{B}.append(V_{k})
13:end while
14:Return .remove(V0)\mathcal{B}.remove(V_{0})

The time complexity of our greedy algorithm is O(D×NlogN)O(D\times N\log N), which reduces computation time by several orders of magnitude.

Comparison of Brute-Force Method with Ours. Here, we show the performance comparison of Brute-Force Method and our greedy algorithm in terms of PSNR and computation time. As the results in Table 2, 3 and 4, L-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than that of our proposed greedy algorithm, which makes the Brute-Force method impractical for incremental training scenarios. It should be noted that the runtime of the Brute-Force method increases exponentially with the size of the training data. For example, the ‘Bicycle’ scene in Mip-NeRF360 contains 194 training images and ‘Horns’ scene in LLFF contains 62 training images, but the Brute-Force method’s runtime for these two scenes are 5 days 6 hours and 1 hour 47 min, respectively.

Table 2: Performance comparison of Brute-Force method and our IL-NeRF on the Mip-NeRF360. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.
Scene Method PSNR running time
Bicycle Brute-Force 22.36 5 days 6 hours
Greedy (Ours) 22.34 10.92 ms
Bonsai Brute-Force 28.96 10 days 8 hours
Greedy (Ours) 28.96 25.57 ms
Counter Brute-Force 27.86 10 days 2 hours
Greedy (Ours) 27.82 23.78 ms
Garden Brute-Force 24.83 4 days 18 hours
Greedy (Ours) 24.82 9.87 ms
Kitchen Brute-Force 29.34 11 days 6 hours
Greedy (Ours) 29.34 28.39 ms
Room Brute-Force 31.49 12 days 10 hours
Greedy (Ours) 31.45 37.58 ms
Stump Brute-Force 24.91 4 days 12 hours
Greedy (Ours) 24.89 8.73 ms
Table 3: Performance comparison of Brute-Force method and our IL-NeRF on the LLFF. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.
Scene Method PSNR running time
Fern Brute-Force 25.27 62.01 s
Greedy (Ours) 25.26 4.38 ms
Flower Brute-Force 30.49 4.63 min
Greedy (Ours) 30.49 6.81 ms
Fortress Brute-Force 29.45 14.17 min
Greedy (Ours) 29.45 8.78 ms
Horns Brute-Force 28.97 1 hour 47 min
Greedy (Ours) 28.96 9.87 ms
Leaves Brute-Force 23.88 65.78 s
Greedy (Ours) 23.88 4.95 ms
Orchids Brute-Force 23.67 57.84 s
Greedy (Ours) 23.67 5.58 ms
Room Brute-Force 31.88 12.48 min
Greedy (Ours) 31.88 8.73 ms
Trex Brute-Force 27.81 57.97 min
Greedy (Ours) 27.81 9.37 ms
Table 4: Performance comparison of Brute-Force method and our IL-NeRF on the NeRF-real360. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.
Scene Method PSNR running time
Pinecone Brute-Force 22.96 4 days 10 hours
Greedy (Ours) 22.93 9.58 ms
Vasedeck Brute-Force 26.24 5 days 2 hours
Greedy (Ours) 26.15 10.61 ms

7.1 Implementation Details

We implement our framework following the architecture of Instant-NeRF [28, 18]. We use two separate Adam optimizers for NeRF and camera poses refinement respectively, with an initial learning rate of 0.01 for NeRF and an initial learning rate of 0.005 for pose refinement. The learning rate of the NeRF model decays every iteration by multiplying with 0.9954 (exponential decay), and the learning rate of the pose refinement decays every 100 iterations with a multiplier of 0.9. We train the network in each incremental step for Mip-NeRF360 with 30k iterations and D=10D=10, LLFF with 5k iterations and D=5D=5, and NeRF-real360 with 20k iterations and D=10D=10.

8 More Comparisons for Results

In the main text, we only show PSNR, SSIM and LPIPS for some scenes of the three datasets, and here we give the full results. Table 5 shows the results obtained by IL-NeRF and baseline methods on the Mip-NeRF360 dataset with seven real-world indoor and outdoor scenes. Similarly, Table 6 presents the results obtained on the LLFF dataset with eight forward-facing scenes. Additionally, Table 7 also shows the results obtained on the NeRF-real360 dataset with two real-world object-orientation scenes. From the results, we can see that IL-NeRF outperforms the original NeRF and achieves comparable results with NeRF.

Table 5: Performance comparison on the Mip-NeRF360 dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF.
Scene Method Pose PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Bicycle NeRF Yes 22.76 / 0.61 / 0.33 18.58 / 0.47 / 0.46 20.03 / 0.52 / 0.43 20.03 / 0.52 / 0.44
EWC Yes 22.76 / 0.61 / 0.33 18.80 / 0.47 / 0.45 19.41 / 0.51 / 0.43 19.89 / 0.52 / 0.43
NeRF Yes 22.88 / 0.62 / 0.33 20.23 / 0.49 / 0.43 22.03 / 0.53 / 0.39 22.18 / 0.54 / 0.41
NeRF-SLAM No 22.78 / 0.61 / 0.33 19.67 / 0.48 / 0.45 21.37 / 0.53 / 0.41 21.61 / 0.53 / 0.42
IL-NeRF No 22.90 / 0.62 / 0.33 19.84 / 0.48 / 0.44 22.05 / 0.54 / 0.40 22.34 / 0.55 / 0.40
Bonsai NeRF Yes 33.30 / 0.93 / 0.07 25.47 / 0.75 / 0.25 23.53 / 0.66 / 0.34 22.12 / 0.68 / 0.35
EWC Yes 33.30 / 0.93 / 0.07 25.62 / 0.75 / 0.25 22.35 / 0.66 / 0.33 21.51 / 0.68 / 0.34
NeRF Yes 33.48 / 0.93 / 0.07 29.93 / 0.88 / 0.15 28.03 / 0.84 / 0.18 28.18 / 0.84 / 0.21
NeRF-SLAM No 33.32 / 0.93 / 0.07 29.13 / 0.84 / 0.21 28.01 / 0.79 / 0.29 26.85 / 0.80 / 0.29
IL-NeRF No 33.54 / 0.93 / 0.07 30.73 / 0.89 / 0.12 29.77 / 0.86 / 0.16 28.96 / 0.85 / 0.18
Counter NeRF Yes 32.12 / 0.91 / 0.07 24.62 / 0.72 / 0.25 21.94 / 0.65 / 0.34 20.30 / 0.62 / 0.37
EWC Yes 32.12 / 0.91 / 0.07 23.83 / 0.72 / 0.25 22.56 / 0.65 / 0.33 21.11 / 0.61 / 0.36
NeRF Yes 32.17 / 0.92 / 0.07 29.58 / 0.86 / 0.14 28.03 / 0.82 / 0.18 28.28 / 0.85 / 0.18
NeRF-SLAM No 31.75 / 0.91 / 0.07 28.30 / 0.83 / 0.21 26.84 / 0.79 / 0.28 25.30 / 0.77 / 0.31
IL-NeRF No 32.13 / 0.91 / 0.07 29.63 / 0.87 / 0.12 28.56 / 0.85 / 0.15 27.82 / 0.83 / 0.17
Garden NeRF Yes 24.70 / 0.71 / 0.20 22.34 / 0.64 / 0.25 20.17 / 0.59 / 0.31 19.42 / 0.54 / 0.38
EWC Yes 24.70 / 0.71 / 0.20 23.38 / 0.63 / 0.24 20.09 / 0.58 / 0.31 19.81 / 0.54 / 0.37
NeRF Yes 24.72 / 0.73 / 0.19 24.93 / 0.72 / 0.18 24.68 / 0.69 / 0.22 24.48 / 0.67 / 0.21
NeRF-SLAM No 24.72 / 0.71 / 0.20 24.03 / 0.69 / 0.23 23.50 / 0.65 / 0.28 23.37 / 0.61 / 0.33
IL-NeRF No 24.73 / 0.73 / 0.19 24.80 / 0.70 / 0.22 24.86 / 0.69 / 0.23 24.82 / 0.67 / 0.23
Kitchen NeRF Yes 31.17 / 0.91 / 0.08 27.01 / 0.75 / 0.25 21.42 / 0.70 / 0.31 23.69 / 0.75 / 0.24
EWC Yes 31.17 / 0.91 / 0.08 26.76 / 0.74 / 0.25 22.09 / 0.70 / 0.31 23.39 / 0.74 / 0.23
NeRF Yes 31.05 / 0.91 / 0.07 29.72 / 0.88 / 0.13 29.33 / 0.85 / 0.15 29.18 / 0.84 / 0.14
NeRF-SLAM No 30.87 / 0.90 / 0.09 29.63 / 0.85 / 0.20 27.65 / 0.81 / 0.24 27.71 / 0.82 / 0.20
IL-NeRF No 31.27 / 0.92 / 0.07 30.66 / 0.89 / 0.10 29.84 / 0.87 / 0.12 29.34 / 0.86 / 0.13
Room NeRF Yes 35.98 / 0.96 / 0.04 30.78 / 0.91 / 0.09 26.34 / 0.80 / 0.21 27.44 / 0.86 / 0.16
EWC Yes 35.98 / 0.96 / 0.04 31.84 / 0.90 / 0.09 27.38 / 0.79 / 0.20 28.08 / 0.86 / 0.16
NeRF Yes 36.18 / 0.96 / 0.03 33.93 / 0.95 / 0.05 32.03 / 0.92 / 0.08 31.99 / 0.93 / 0.06
NeRF-SLAM No 35.74 / 0.94 / 0.08 33.20 / 0.93 / 0.07 30.36 / 0.88 / 0.17 30.73 / 0.89 / 0.13
IL-NeRF No 36.04 / 0.96 / 0.04 34.02 / 0.94 / 0.04 32.35 / 0.92 / 0.07 31.45 / 0.91 / 0.09
Stump NeRF Yes 25.62 / 0.77 / 0.28 22.30 / 0.52 / 0.38 21.25 / 0.46 / 0.42 20.55 / 0.44 / 0.47
EWC Yes 25.62 / 0.77 / 0.28 22.55 / 0.51 / 0.37 21.09 / 0.45 / 0.42 21.48 / 0.44 / 0.46
NeRF Yes 26.18 / 0.79 / 0.27 25.93 / 0.64 / 0.37 25.12 / 0.62 / 0.38 25.18 / 0.64 / 0.39
NeRF-SLAM No 24.98 / 0.74 / 0.31 24.76 / 0.61 / 0.36 24.05 / 0.56 / 0.40 23.93 / 0.57 / 0.43
IL-NeRF No 25.96 / 0.77 / 0.28 25.75 / 0.66 / 0.32 25.09 / 0.60 / 0.37 24.89 / 0.58 / 0.37
Table 6: Performance comparison on the LLFF dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF.
Scene Method Pose PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Fern NeRF Yes 29.19 / 0.90 / 0.06 24.58 / 0.80 / 0.19 23.21 / 0.67 / 0.24 22.43 / 0.65 / 0.25
EWC Yes 29.19 / 0.90 / 0.06 24.88 / 0.79 / 0.19 22.60 / 0.66 / 0.23 23.32 / 0.64 / 0.25
NeRF Yes 29.26 / 0.90 / 0.06 26.71 / 0.88 / 0.09 25.79 / 0.85 / 0.12 25.19 / 0.82 / 0.13
NeRF-SLAM No 28.77 / 0.88 / 0.16 25.22 / 0.85 / 0.15 24.95 / 0.79 / 0.23 24.33 / 0.77 / 0.21
IL-NeRF No 29.30 / 0.90 / 0.06 26.68 / 0.87 / 0.10 25.63 / 0.81 / 0.13 25.26 / 0.80 / 0.15
Flower NeRF Yes 34.12 / 0.96 / 0.01 30.30 / 0.91 / 0.02 28.40 / 0.90 / 0.03 27.50 / 0.88/ 0.04
EWC Yes 34.12 / 0.96 / 0.01 29.84 / 0.90 / 0.02 28.16 / 0.90 / 0.03 27.14 / 0.88 / 0.03
NeRF Yes 34.28 / 0.96 / 0.01 31.76 / 0.93 / 0.01 30.98 / 0.93 / 0.02 30.68 / 0.93 / 0.02
NeRF-SLAM No 33.28 / 0.96 / 0.01 31.34 / 0.92 / 0.01 30.29 / 0.92 / 0.02 29.72 / 0.91 / 0.03
IL-NeRF No 34.22 / 0.96 / 0.01 31.81 / 0.94 / 0.01 31.11 / 0.94 / 0.02 30.49 / 0.93 / 0.02
Fortress NeRF Yes 31.56 / 0.85 / 0.11 29.38 / 0.80 / 0.15 27.05 / 0.78 / 0.17 25.39 / 0.78 / 0.16
EWC Yes 31.56 / 0.85 / 0.11 29.41 / 0.79 / 0.15 25.83 / 0.77 / 0.16 24.40 / 0.78 / 0.15
NeRF Yes 31.75 / 0.86 / 0.11 31.83 / 0.84 / 0.09 30.90 / 0.86 / 0.14 29.81 / 0.85 / 0.11
NeRf-SLAM No 31.08 / 0.82 / 0.12 31.09 / 0.82 / 0.12 29.77 / 0.83 / 0.15 28.53 / 0.82 / 0.14
IL-NeRF No 31.69 / 0.85 / 0.11 31.02 / 0.84 / 0.10 30.33 / 0.84 / 0.11 29.45 / 0.83 / 0.12
Horns NeRF Yes 29.78 / 0.86 / 0.09 27.04 / 0.75 / 0.09 26.04 / 0.74 / 0.12 24.01 / 0.70 / 0.14
EWC Yes 29.78 / 0.86 / 0.09 26.77 / 0.75 / 0.09 27.08 / 0.74 / 0.11 24.68 / 0.69 / 0.14
NeRF Yes 29.86 / 0.89 / 0.07 29.67 / 0.89 / 0.06 29.24 / 0.89 / 0.07 28.87 / 0.87 / 0.08
NeRF-SLAM No 28.86 / 0.83 / 0.10 28.77 / 0.84 / 0.08 28.19 / 0.83 / 0.10 27.56 / 0.82 / 0.12
IL-NeRF No 29.92 / 0.89 / 0.07 29.50 / 0.89 / 0.07 29.01 / 0.89 / 0.07 28.96 / 0.87 / 0.09
Leaves NeRF Yes 25.51 / 0.90 / 0.06 22.12 / 0.79 / 0.13 21.03 / 0.75 / 0.15 20.62 / 0.73 / 0.16
EWC Yes 25.51 / 0.90 / 0.06 21.18 / 0.79 / 0.13 21.96 / 0.74 / 0.15 20.39 / 0.72 / 0.16
NeRF Yes 25.58 / 0.90 / 0.06 24.81 / 0.89 / 0.07 24.23 / 0.87 / 0.07 23.84 / 0.86 / 0.08
NeRF-SLAM No 24.83 / 0.87 / 0.08 23.93 / 0.86 / 0.10 23.27 / 0.83 / 0.12 22.86 / 0.82 / 0.13
IL-NeRF No 25.62 / 0.90 / 0.06 24.74 / 0.88 / 0.07 24.26 / 0.87 / 0.07 23.88 / 0.86 / 0.08
Orchids NeRF Yes 25.68 / 0.85 / 0.08 22.78 / 0.77 / 0.10 21.43 / 0.74 / 0.12 20.77 / 0.71 / 0.14
EWC Yes 25.68 / 0.85 / 0.08 22.18 / 0.77/ 0.10 21.41 / 0.73 / 0.12 20.23 / 0.70 / 0.14
NeRF Yes 25.88 / 0.87 / 0.08 24.37 / 0.84 / 0.10 23.69 / 0.80 / 0.12 23.59 / 0.79 / 0.12
NeRF-SLAM No 24.32 / 0.81 / 0.09 23.94 / 0.82/ 0.10 23.26 / 0.78 / 0.12 22.76 / 0.76 / 0.13
IL-NeRF No 25.78 / 0.86 / 0.08 24.17 / 0.82 / 0.10 23.89 / 0.80 / 0.12 23.67 / 0.77 / 0.13
Room NeRF Yes 32.35 / 0.92 / 0.09 29.58 / 0.89 / 0.10 28.82 / 0.89 / 0.11 30.59 / 0.92 / 0.07
EWC Yes 31.14 / 0.92 / 0.09 28.46 / 0.88 / 0.10 28.78 / 0.89 / 0.11 30.29 / 0.91 / 0.08
NeRF Yes 32.26 / 0.92 / 0.09 31.52 / 0.92 / 0.08 31.21 / 0.92 / 0.08 31.98 / 0.92 / 0.08
NeRF-SLAM No 30.84 / 0.89 / 0.10 31.02 / 0.91 / 0.09 30.76 / 0.91 / 0.10 31.63 / 0.92 / 0.07
IL-NeRF No 32.50 / 0.92 / 0.08 31.76 / 0.92 / 0.08 31.58 / 0.92 / 0.08 31.88 / 0.92 / 0.07
Trex NeRF Yes 28.50 / 0.90 / 0.07 27.29 / 0.89 / 0.07 26.40 / 0.88 / 0.07 26.24 / 0.86 / 0.09
EWC Yes 28.50 / 0.90 / 0.07 26.42 / 0.88 / 0.07 25.97 / 0.89 / 0.07 25.18 / 0.91 / 0.09
NeRF Yes 28.74 / 0.91 / 0.06 28.32 / 0.90 / 0.06 28.11 / 0.90 / 0.07 27.98 / 0.90 / 0.06
NeRF-SLAM No 27.26 / 0.90 / 0.07 28.05 / 0.89 / 0.06 27.60 / 0.89 / 0.07 27.37 / 0.88 / 0.08
IL-NeRF No 28.70 / 0.90 / 0.07 28.14 / 0.90 / 0.06 27.90 / 0.90 / 0.07 27.81 / 0.90 / 0.06
Table 7: Performance comparison on the NeRF-real360 dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF.
Scene Method Pose PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Pinecone NeRF Yes 26.22 / 0.84 / 0.16 22.90 / 0.64 / 0.24 21.15 / 0.58 / 0.33 18.94 / 0.49 / 0.41
EWC Yes 26.22 / 0.84 / 0.16 22.70 / 0.63 / 0.24 21.42 / 0.57 / 0.32 18.81 / 0.48 / 0.41
NeRF Yes 26.88 / 0.89 / 0.12 24.23 / 0.79 / 0.16 24.03 / 0.73 / 0.19 23.18 / 0.74 / 0.21
NeRF-SLAM No 25.63 / 0.81 / 0.18 24.09 / 0.73 / 0.22 23.01 / 0.68 / 0.29 21.79 / 0.65 / 0.34
IL-NeRF No 26.31 / 0.87 / 0.10 24.56 / 0.78 / 0.17 23.78 / 0.74 / 0.20 22.93 / 0.72 / 0.23
Vasedeck NeRF Yes 29.03 / 0.85 / 0.07 23.99 / 0.70 / 0.26 22.73 / 0.69 / 0.24 21.57 / 0.64 / 0.31
EWC Yes 29.03 / 0.85 / 0.07 24.36 / 0.69 / 0.25 22.25 / 0.68 / 0.24 20.52 / 0.64 / 0.30
NeRF Yes 29.27 / 0.86 / 0.07 27.93 / 0.85 / 0.12 26.03 / 0.74 / 0.16 26.18 / 0.74 / 0.18
NeRF-SLAM No 27.98 / 0.79 / 0.11 26.41 / 0.77 / 0.21 25.10 / 0.72 / 0.21 24.62 / 0.71 / 0.26
IL-NeRF No 29.48 / 0.86 / 0.07 27.38 / 0.82 / 0.10 26.11 / 0.76 / 0.14 26.15 / 0.75 / 0.17

Furthermore, we compare the original NeRF and IL-NeRF on two scenes, the ’Kitchen’ and ‘Counter’ scenes in the Mip-NeRF360 dataset. Here, we give more visualization results of original NeRF and IL-NeRF on all scenes of three datasets.

Figure 9 to Figure 16 provide additional insight by presenting a qualitative comparison of the performance of the original NeRF and IL-NeRF. Specifically, we demonstrate the rendering results on the first task after each incremental training. It is evident that the original NeRF suffers from the catastrophic forgetting problem, resulting in images with significant distortions such as noise and blur, whereas IL-NeRF generates highly realistic images with quality comparable to the ground truth. This observation indicates that IL-NeRF is highly effective in mitigating the forgetting problem and addressing the coordinate shifting issue.

Video Demo. To further show the performance of IL-NeRF, we post a video demonstration in the supplementary material, named ‘sm_\_video.mp4’. In this video, we show rendered images from all baselines and IL-NeRF.

Refer to caption
Figure 9: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Counter’ and ’Bonsai’ in the Mip-NeRF36 dataset.
Refer to caption
Figure 10: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Garden’ and ’Bicycle’ in the Mip-NeRF36 dataset.
Refer to caption
Figure 11: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Room’ and ’Stump’ in the Mip-NeRF36 dataset.
Refer to caption
Figure 12: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Fern’ and ’Flower’ in the LLFF dataset.
Refer to caption
Figure 13: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Fortress’ and ’Horns’ in the LLFF dataset.
Refer to caption
Figure 14: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Leaves’ and ’Orchids’ in the LLFF dataset.
Refer to caption
Figure 15: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testsets are the scenes ’Room’ and ’Trex’ in the LLFF dataset.
Refer to caption
Figure 16: Qualitative comparison of the original NeRF and IL-NeRF on the rendering images in the first image data after each incremental training. GT means the ground truth of the training image. The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information. In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process. Testset is the scenes ’Pinecone’ and Vasedeskin the NeRF-real360 dataset.

9 Ablation Study

In the main paper, we analyze the effectiveness of the camera coordinate alignment and the pose refinement that has been added to IL-NeRF on the scene ’Garden’. Here, we give more numerical results for the ablation study.

Table 8 to Table 10 shows the performance comparison of the original NeRF, NeRF, IL-NeRF, IL-NeRF w/o TM, and IL-NeRF w/o PR on all three datasets. As we can see, IL-NeRF outperforms the original NeRF and achieves comparable results with NeRF. The results reveal a significant decline in performance on all test data without the transfer matrices (i.e., IL-NeRF w/o TM). This decline can be attributed to separate camera pose estimation for two tasks resulting in camera poses in two independent coordinate systems, which could mislead the model during training, leading to decreased performance. The results of IL-NeRF w/o PR, indicate that IL-NeRF with pose refinement outperforms IL-NeRF without it as the aligned camera poses by the transfer matrices may still contain noise and inaccuracies.

Figure 17 shows the camera pose trajectories of GT and IL-NeRF. We treat the COLMAP estimation from all training images as ground-truth (GT) camera poses. As we can see, IL-NeRF recovers accurate camera poses due to the help of incremental camera pose alignment.

Table 8: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.
Scene Method PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Bicycle w/o TM 22.90 / 0.62 / 0.33 13.64 / 0.32 / 0.66 11.86 / 0.29 / 0.79 11.14 / 0.26 / 0.83
w/o PR 22.76 / 0.61 / 0.33 18.67 / 0.41 / 0.52 20.74 / 0.46 / 0.50 21.06 / 0.46 / 0.52
IL-NeRF 22.90 / 0.62 / 0.33 19.84 / 0.48 / 0.44 22.05 / 0.54 / 0.40 22.34 / 0.55 / 0.40
Bonsai w/o TM 33.54 / 0.93 / 0.07 21.13 / 0.59 / 0.18 19.48 / 0.46 / 0.31 18.40 / 0.40 / 0.37
w/o PR 33.30 / 0.93 / 0.07 28.30 / 0.84 / 0.18 27.80 / 0.81 / 0.21 27.33 / 0.80 / 0.22
IL-NeRF 33.54 / 0.93 / 0.07 30.73 / 0.89 / 0.12 29.77 / 0.86 / 0.16 28.96 / 0.85 / 0.18
Counter w/o TM 32.13 / 0.91 / 0.07 23.91 / 0.58 / 0.18 19.02 / 0.45 / 0.29 13.87 / 0.39 / 0.35
w/o PR 32.12 / 0.91 / 0.07 27.89 / 0.83 / 0.13 27.05 / 0.81 / 0.16 26.47 / 0.79 / 0.17
IL-NeRF 32.13 / 0.91 / 0.07 29.63 / 0.87 / 0.12 28.56 / 0.85 / 0.15 27.82 / 0.83 / 0.17
Garden w/o TM 24.73 / 0.73 / 0.19 17.05 / 0.46 / 0.33 13.37 / 0.37 / 0.45 15.76 / 0.31 / 0.47
w/o PR 24.70 / 0.71 / 0.20 23.34 / 0.67 / 0.20 23.17 / 0.69 / 0.21 22.42 / 0.67 / 0.23
IL-NeRF 24.73 / 0.73 / 0.19 24.80 / 0.70 / 0.22 24.86 / 0.69 / 0.23 24.82 / 0.67 / 0.23
Kitchen w/o TM 31.27 / 0.92 / 0.07 21.08 / 0.59 / 0.15 16.05 / 0.46 / 0.23 14.63 / 0.40 / 0.26
w/o PR 31.17 / 0.91 / 0.08 28.54 / 0.84 / 0.12 27.86 / 0.82 / 0.15 27.48 / 0.78 / 0.18
IL-NeRF 31.27 / 0.92 / 0.07 30.66 / 0.89 / 0.10 29.84 / 0.87 / 0.12 29.34 / 0.86 / 0.13
Room w/o TM 36.04 / 0.96 / 0.04 27.45 / 0.62 / 0.06 17.40 / 0.49 / 0.13 19.98 / 0.43 / 0.18
w/o PR 35.98 / 0.96 / 0.04 32.67 / 0.92 / 0.07 31.12 / 0.86 / 0.17 30.35 / 0.84 / 0.13
IL-NeRF 36.04 / 0.96 / 0.04 34.02 / 0.94 / 0.04 32.35 / 0.92 / 0.07 31.45 / 0.91 / 0.09
Stump w/o TM 25.96 / 0.77 / 0.28 20.78 / 0.44 / 0.48 16.71 / 0.32 / 0.73 15.81 / 0.27 / 0.80
w/o PR 25.62 / 0.77 / 0.28 24.25 / 0.58 / 0.37 23.77 / 0.53 / 0.43 22.43 / 0.50 / 0.46
IL-NeRF 25.96 / 0.77 / 0.28 25.75 / 0.66 / 0.32 25.09 / 0.60 / 0.37 24.89 / 0.58 / 0.39
Table 9: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.
Scene Method PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Fern w/o TM 29.30 / 0.90 / 0.06 18.34 / 0.58 / 0.15 13.79 / 0.43 / 0.25 16.04 / 0.37 / 0.31
w/o PR 29.19 / 0.90 / 0.06 25.37 / 0.82 / 0.11 24.62 / 0.77 / 0.19 24.77 / 0.78 / 0.20
IL-NeRF 29.30 / 0.90 / 0.06 26.68 / 0.87 / 0.10 25.63 / 0.81 / 0.13 25.26 / 0.80 / 0.15
Flower w/o TM 34.22 / 0.96 / 0.01 25.67 / 0.62 / 0.15 20.72 / 0.50 / 0.39 14.37 / 0.43 / 0.44
w/o PR 34.12 / 0.96 / 0.01 30.82 / 0.92 / 0.02 30.69 / 0.91 / 0.03 28.41 / 0.89 / 0.03
IL-NeRF 34.22 / 0.96 / 0.01 31.81 / 0.94 / 0.01 31.11 / 0.94 / 0.02 30.49 / 0.93 / 0.02
Fortress w/o TM 31.6 / 0.85 / 0.11 21.33 / 0.56 / 0.15 20.20 / 0.45 / 0.21 16.71 / 0.39 / 0.34
w/o PR 31.56 / 0.85 / 0.11 30.62 / 0.82 / 0.14 29.75 / 0.79 / 0.15 28.78 / 0.76 / 0.17
IL-NeRF 31.69 / 0.85 / 0.11 31.02 / 0.84 / 0.10 30.33 / 0.84 / 0.11 29.45 / 0.83 / 0.12
Horns w/o TM 31.6 / 0.85 / 0.11 21.33 / 0.56 / 0.15 20.20 / 0.45 / 0.21 16.71 / 0.39 / 0.34
w/o PR 29.92 / 0.89 / 0.07 20.28 / 0.59 / 0.20 15.61 / 0.47 / 0.33 14.44 / 0.41 / 0.48
IL-NeRF 29.92 / 0.89 / 0.07 29.50 / 0.89 / 0.07 29.01 / 0.89 / 0.07 28.96 / 0.87 / 0.09
Leaves w/o TM 25.62 / 0.90 / 0.06 17.01 / 0.58 / 0.20 13.05 / 0.46 / 0.33 11.17 / 0.40 / 0.46
w/o PR 25.51 / 0.90 / 0.06 24.20 / 0.87 / 0.06 23.77 / 0.84 / 0.08 22.69 / 0.83 / 0.09
IL-NeRF 25.62 / 0.90 / 0.06 24.74 / 0.88 / 0.07 24.26 / 0.87 / 0.07 23.88 / 0.86 / 0.08
Orchids w/o TM 25.78 / 0.86 / 0.08 19.50 / 0.54 / 0.15 15.85 / 0.42 / 0.23 12.03 / 0.36 / 0.46
w/o PR 25.68 / 0.85 / 0.08 23.08 / 0.79/ 0.10 22.96 / 0.75 / 0.12 22.23 / 0.72 / 0.13
IL-NeRF 25.78 / 0.86 / 0.08 24.17 / 0.82 / 0.10 23.89 / 0.80 / 0.12 23.67 / 0.77 / 0.13
Room w/o TM 32.50 / 0.92 / 0.08 25.63 / 0.61 / 0.12 20.99 / 0.49 / 0.25 16.25 / 0.43 / 0.34
w/o PR 31.14 / 0.92 / 0.09 30.37 / 0.90 / 0.09 30.56 / 0.91 / 0.09 30.58 / 0.90 / 0.08
IL-NeRF 32.50 / 0.92 / 0.08 31.76 / 0.92 / 0.08 31.58 / 0.92 / 0.08 31.88 / 0.92 / 0.07
Trex w/o TM 28.70 / 0.90 / 0.07 19.35 / 0.60 / 0.19 15.01 / 0.48 / 0.33 13.86 / 0.42 / 0.42
w/o PR 28.50 / 0.90 / 0.07 27.01 / 0.87 / 0.09 26.26 / 0.88 / 0.07 26.54 / 0.88 / 0.08
IL-NeRF 28.70 / 0.90 / 0.07 28.14 / 0.90 / 0.06 27.90 / 0.90 / 0.07 27.81 / 0.90 / 0.06
Table 10: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.
Scene Method PSNR \Uparrow / SSIM \Uparrow / LPIPS \Downarrow
G0G^{0} G1G^{1} G2G^{2} G3G^{3}
Pinecone w/o TM 26.31 / 0.87 / 0.10 19.82 / 0.52 / 0.25 15.84 / 0.39 / 0.39 14.56 / 0.34 / 0.47
w/o PR 26.22 / 0.84 / 0.16 23.87 / 0.61 / 0.22 22.24 / 0.66 / 0.28 21.13 / 0.66 / 0.32
IL-NeRF 26.31 / 0.87 / 0.10 24.56 / 0.78 / 0.17 23.78 / 0.74 / 0.20 22.93 / 0.72 / 0.23
Vasedeck w/o TM 29.48 / 0.86 / 0.07 22.09 / 0.54 / 0.25 16.05 / 0.40 / 0.35 13.04 / 0.35 / 0.37
w/o PR 29.03 / 0.85 / 0.07 25.30 / 0.74 / 0.18 24.80 / 0.68 / 0.21 24.33 / 0.63 / 0.23
IL-NeRF 29.48 / 0.86 / 0.07 27.38 / 0.82 / 0.10 26.11 / 0.76 / 0.14 26.15 / 0.75 / 0.17
Refer to caption
(a) Bicycle
Refer to caption
(b) Counter
Refer to caption
(c) Garden
Refer to caption
(d) Bonsai
Refer to caption
(e) Room
Refer to caption
(f) Stump
Refer to caption
(g) Fern
Refer to caption
(h) Flower
Refer to caption
(i) Fortress
Refer to caption
(j) Leaves
Refer to caption
(k) Horns
Refer to caption
(l) Meeting room
Refer to caption
(m) Orchids
Refer to caption
(n) Trex
Refer to caption
(o) Pinecone
Figure 17: Camera pose estimation comparison. GT means the camera poses estimated by COLMAP from all the training images. IL-NeRF recovers accurate camera poses due to the help of incremental camera pose alignment.

10 Limitation

For large-scale scenes with limited overlap between views in the training dataset, the performance of IL-NeRF may be suboptimal because the limited overlap between views can result in significant errors or even the inability to calculate the transfer matrices during the camera coordinate alignment.