IL-NeRF: Incremental Learning for Neural Radiance Fields with Camera Pose Alignment

Letian Zhang Department of Computer Science, Middle Tennessee State University Ming Li Center for Research in Computer Vision, University of Central Florida Chen Chen Center for Research in Computer Vision, University of Central Florida Jie Xu Department of Electrical and Computer Engineering, University of Miami

Abstract

Neural radiance fields (NeRF) is a promising approach for generating photorealistic images and representing complex scenes. However, when processing data sequentially, it can suffer from catastrophic forgetting, where previous data is easily forgotten after training with new data. Existing incremental learning methods using knowledge distillation assume that continuous data chunks contain both 2D images and corresponding camera pose parameters, pre-estimated from the complete dataset. This poses a paradox as the necessary camera pose must be estimated from the entire dataset, even though the data arrives sequentially and future chunks are inaccessible. In contrast, we focus on a practical scenario where camera poses are unknown. We propose IL-NeRF, a novel framework for incremental NeRF training, to address this challenge. IL-NeRF’s key idea lies in selecting a set of past camera poses as references to initialize and align the camera poses of incoming image data. This is followed by a joint optimization of camera poses and replay-based NeRF distillation. Our experiments on real-world indoor and outdoor scenes show that IL-NeRF handles incremental NeRF training and outperforms the baselines by up to $54.04\%$ in rendering quality. The project page is https://ilnerf.github.io/.

1 Introduction

Neural Radiance Fields (NeRF) [26] has recently shown great promise in producing photorealistic images from sparse image sets by encoding a 3D scene with a neural network that maps the location of 3D points to color and volume density. However, to achieve this level of performance, NeRF typically requires access to all training data at once in order to estimate the camera pose based on the entire dataset [7, 26]. In practical applications such as automotive and remote sensing, data is acquired sequentially, necessitating an immediately available updated 3D scene representation. Moreover, scenarios arise where a user acquires scans of a scene to train NeRF, only to find that the training yielded less effective results than expected. Consequently, the user rescans the scene with new data to enhance the fidelity of the images rendered by NeRF. In these instances, the scene representation must be trained in an incremental training environment, where the model can only access a limited number of views at each training stage, while still undertaking the task of reconstructing the entire scene.

Refer to caption — Figure 1: Current works require accurate camera poses pre-estimated from the whole image data. Our IL-NeRF can incrementally learn the 3D scene without camera poses. The results of IL-NeRF consist of both aligned camera poses and NeRF model.

In the context of incremental NeRF training, existing works [12, 7, 34, 5] generally operate under the assumption that data is segmented into multiple sequential chunks, with access limited to the current chunk while previously processed data is discarded. This assumption to incremental learning presents a notable challenge for NeRF, which requires updating its knowledge without erasing prior learned information, a phenomenon known as catastrophic forgetting [53]. To address this challenge, prior works [12, 7, 34, 5] have investigated the implementation of incremental learning strategies for NeRF training, incorporating a knowledge distillation technique [53] to mitigate catastrophic forgetting. Specifically, before training the NeRF model with the current data chunk, pseudo-RGB values for the scene are generated using the previously trained NeRF model. These RGB values are subsequently utilized to train the NeRF model with the current data chunk, and the process is iteratively repeated, with the prior NeRF model acting as the supervisory teacher for subsequent data chunks. This enables the NeRF model to learn from new data while retaining knowledge from previously discarded data.

Motivation. While existing works propose effective incremental learning methods through knowledge distillation, these approaches are founded on the assumption that continuous data chunks comprise not only 2D images but also corresponding camera pose parameters, pre-estimated from the complete dataset. This assumption poses a paradox as the required camera pose must be estimated from the entire dataset, yet the data is intended to arrive sequentially, relatively independently, with future data chunks remaining unknown and inaccessible while previous chunks are discarded. In contrast, as shown in Figure 1, our work addresses a more practical scenario where pre-estimated camera poses are unavailable for each training data chunk, necessitating the consideration of camera pose estimation and the alignment of its coordinate system.

Challenge. Since the previous training data have been discarded, the incoming training data cannot simply be used directly for camera pose estimation because the isolated estimated camera pose will not be in the same coordinate system as the previous camera pose, which will lead to NeRF training misalignment and failure to render the 3D scene. Therefore, accurately estimating the camera poses of the sequential coming data within the same coordinate system in incremental NeRF training becomes a crucial issue that needs to be addressed.

Contribution. To deal with the above challenge, we present a novel framework for incremental NeRF training, named IL-NeRF, which can incrementally estimate the incoming data’s camera poses and effectively tackle the issue of catastrophic forgetting. (1) We propose an incremental camera pose alignment module that selects a suitable set of camera poses from previous estimates. These chosen poses help render prior training images from NeRF and aid in the joint camera pose estimation process. They also act as a reference coordinate system, facilitating the alignment of newly arrived and previous camera poses. (2) To ensure the appropriate selection of camera poses from the previously estimated ones, we transform this selection task into a graph-based reward-collection optimization problem. We then introduce a practical greedy algorithm to effectively solve this optimization problem. (3) To align camera pose coordinates, we use selected camera poses as a reference to derive a transfer matrix, transforming the current camera poses into the previous coordinate system. (4) We utilize a joint optimization method for camera poses and replay-based NeRF distillation, mitigating catastrophic forgetting and refining the accuracy of the camera poses.

The experimental results on three diverse datasets show that our proposed framework can improve PSNR, SSIM, and LPIPS by up to $54.04\%$ compared to the original NeRF, significantly mitigating the negative impact of catastrophic forgetting in NeRF’s incremental learning process. Moreover, our framework can effectively estimate and align the camera pose parameters in a consistent coordinate system.

2 Related Works

NeRF. NeRF [26] employs volume rendering to depict a continuous scene and achieve high-quality view synthesis. Several subsequent works have been introduced based on the success of NeRF, aiming to improve view synthesis efficiency and quality, including faster training and rendering [28, 6, 10], deformable or dynamic scene synthesis [32, 35, 36], editable view synthesis [30, 49], light changes [23, 27], surface enhancements [31, 45, 51], depth priors [8, 37, 48], etc. However, most of these methods presume access to all training data and require pre-estimated camera pose parameters from the entire dataset. In this study, we tackle a more practical scenario where NeRF learns the scene incrementally with a sequential data stream, without pre-estimated camera pose parameters.
Incremental NeRF Learning. Incremental learning, constrained by limited data availability during each training iteration, often causes catastrophic forgetting [9]. Methods to mitigate this issue include parameter isolation [13, 22, 21], replay [20, 38, 41], and regularization [1, 50, 2]. There is limited research on combining NeRF with incremental learning. Existing studies [12, 7, 34, 5] use knowledge distillation [53] to mitigate catastrophic forgetting by accessing past training data, specifically RGB values, from a pre-trained network. The retrieved data is merged with new incoming data to train NeRF. In [12], a regularization-based filter is used to select relevant information from randomly sampled camera views. In [7], a small network is trained to generate previously seen rays that are directed toward the scene. The authors in [34] employ the same replay-based method in [7] but they substitute NeRF with Instant-NGP [28] to expedite the training process. In [5], a prioritized replay buffer is introduced to keep the images and their camera poses with the lowest historical rendering qualities for continual learning. However, these methods require pre-estimated camera poses from the entire training data for each coming data chunk. In this work, we consider a more realistic scenario where pre-estimated camera poses are unavailable for each training data chunk, thus requiring incremental camera pose alignment.
NeRF With Pose Refinement. Pose refinement is widely used in NeRF training. iNeRF [52] refines camera poses for novel view images using a reconstructed NeRF model. NeRFmm [47] jointly optimizes both camera intrinsic and extrinsic parameters during NeRF training, and BARF [19] proposes a coarse-to-fine positional encoding strategy for camera poses and NeRF joint optimization. SC-NeRF [15] further considers camera distortion refinement and employs a geometric loss to regularize rays. In this paper, IL-NeRF uses the pose refinement in SC-NeRF [15]. Note that IL-NeRF is not limited to only the pose refinement in SC-NeRF [15]; any other well-designed pose refinement methods can also be transferred to IL-NeRF.

3 Incremental NeRF Training Preliminary

NeRF. NeRF aims to learn a 3D scene with a simple neural network, e.g., MLPs, that takes 3D location x and view direction $\textbf{r}_{d}$ as input and produces RGB color c and volume density $\mathbf{\sigma}$ as output. For each ray $\textbf{r}=(\textbf{r}_{o},\textbf{r}_{d})$ emitted from the camera origin $\textbf{r}_{o}$ in direction $\textbf{r}_{d}$ , NeRF samples $M$ 3D points along the ray $\textbf{x}_{i}=\textbf{r}_{o}+z_{i}\textbf{r}_{d}$ , where $z_{i}$ is the distance from a camera to the sample point $\textbf{x}_{i}$ . Thus the pixel color $C$ can be integrated by the volumetric rendering as follows:

\displaystyle C(\textbf{r})=\sum_{i=1}^{M}\alpha_{i}(1-\delta_{i})\textbf{c}(\textbf{x}_{i})

(1)

where $\delta_{i}=\exp(-(z_{i}-z_{i-1})\mathbf{\sigma}(\textbf{x}_{i})$ represents the transmittance of the ray segment between sample points $\textbf{x}_{i-1}$ and $\textbf{x}_{i}$ , and $\alpha_{i}=\prod_{j=1}^{i-1}\delta_{j}$ is the ray attenuation from the origin $\textbf{r}_{o}$ to the sample point $\textbf{x}_{i}$ . Since the whole pipeline is differentiable, NeRF can be trained by minimizing the photometric error between the rendering views and ground truth views.

		$\displaystyle\mathcal{L}=\sum_{\textbf{r}\in R}\parallel C^{*}(\textbf{r})-C(\textbf{r})\parallel^{2}_{2}$		(2)
		$\displaystyle{\Theta}^{}=\arg\min_{\Theta}\mathcal{L}(C\mid C^{},\mathcal{P})$		(3)

where $R$ represents a group of rays from one or more camera views, which is obtained from the camera pose parameters $\mathcal{P}$ . $C^{*}$ is the ground truth of pixel color. $\Theta$ denotes the parameters of the network. For more details of NeRF, we refer the readers to [26].

Incremental NeRF Training. To achieve impressive performance, NeRF assumes access to all training data that covers all views of a scene at once. However, the entire training data might not be available simultaneously in practical applications due to physical or hardware limitations, e.g., edge devices with a limited amount of memory and limited data storage. As a result, data needs to be processed sequentially. In other words, NeRF will incrementally learn the scene with new training data without revisiting the old ones. Concretely, we consider a time-slotted system wherein each time slot $t\in\{0,\cdots,T-1\}$ , a chunk of image data $G^{t}$ consists of $N$ number of images $I^{t}$ , i.e., $G^{t}=\{I_{n}^{t},n\in\{0,\cdots N-1\}\}$ . Generally, only the latest image data chunk $G^{t}$ is available while previous image data chunks $G^{0:t-1}$ are inaccessible. The objective of incremental NeRF is to minimize reconstruction loss across all provided chunks of image data in $\{G_{0},\dots,G_{T-1}\}$ . Note that unlike the existing works [7, 12, 34, 5], in our work, each incoming data contains only the images, without corresponding camera poses. Thus, in this paper, the main aim of incremental learning for NeRF is to enable the neural network to learn and adapt continually by ensuring that the camera poses from new image data are estimated and aligned in a consistent coordinate system while preventing catastrophic forgetting across all previously seen image data.

4 IL-NeRF

In this section, we introduce our proposed framework, IL-NeRF, which prevents the catastrophic forgetting problem by replay-based knowledge distillation retrieved from NeRF itself (Section 4.1) and utilizes the incremental camera pose alignment to estimate and align the camera poses of incoming training image data within the same coordinates system as the previous camera poses (Section 4.2). The overall pipeline is illustrated in Figure 2.

4.1 Replay-Based NeRF Distillation

The problem of catastrophic forgetting occurs when a network, trained only with the currently available data at each time step, struggles to remember previously learned knowledge, resulting in low-quality image synthesis for previously seen views. To address this, we adopt a replay-based NeRF distillation strategy in the NeRF training. At each time slot $t$ , we copy and freeze the parameters of the network as a teacher network before training the network on the incoming image data chunk $t$ . Since the network has been trained on $t-1$ previous image data chunks, we use $\Theta^{*}_{t-1}$ to denote the frozen parameters of the teacher network. During each training iteration for the image data chunk $t$ , we use the past camera poses $\mathcal{P}^{p}$ to obtain the pixel colors of the past training rays from the teacher network $\Theta_{t-1}$ as follows:

\displaystyle C^{p}=\mathcal{F}(\mathcal{P}^{p},\Theta^{*}_{t-1})

(4)

By treating $C^{p}$ as pseudo ground truth, we optimize $\Theta_{t}$ by learning new knowledge from the new incoming image data and old knowledge from the teacher network, thus mitigating the forgetting problem. The new loss function is defined as follows:

	$\displaystyle\mathcal{L}=\sum_{\textbf{r}^{c}}\parallel\hat{C}^{c}-C^{c}\parallel^{2}_{2}+\sum_{\textbf{r}^{p}}\parallel\hat{C}^{p}-C^{p}\parallel^{2}_{2}$		(5)
	$\displaystyle\Theta_{t}^{}=\arg\min_{\Theta_{t}}\mathcal{L}(\hat{C}^{c},\hat{C}^{p}\mid C^{c},\mathcal{P}^{c},\Theta_{t-1}^{},\mathcal{P}^{p})$		(6)

where $\textbf{r}^{c}$ and $\textbf{r}^{p}$ are the training rays obtained from current camera poses $\mathcal{P}^{c}$ and past camera poses $\mathcal{P}^{p}$ . $\hat{C}^{c}$ and $\hat{C}^{p}$ are the estimated colors by the network that is being trained, given the concatenated camera poses $[\mathcal{P}^{c},\mathcal{P}^{p}]$ :

\displaystyle[\hat{C}^{c},\hat{C}^{p}]=\mathcal{F}([\mathcal{P}^{c},\mathcal{P}^{p}],\Theta_{t})

(7)

The existing works [7, 12, 34] assume that the camera poses are provided with each incoming image data and not saved on the device when the next round of training image data arrives. This requires additional effort to train a new auxiliary neural network to remember or filter the previously trained rays. However, this paper considers a different scenario. Specifically, the incoming image data chunk only includes the training images and not the camera poses, and hence the camera poses need to be obtained using camera calibration techniques and saved on the device. It’s worth noting that only the previous camera poses are stored, not the previous training images. Furthermore, the camera pose requires only a small storage space. In the following section, we will discuss how to estimate and align the previous and new camera poses into the same coordinate system.

4.2 Incremental Camera Pose Alignment

Let $\mathcal{P}^{p}$ represent the previously aligned camera poses. which includes intrinsic camera matrix $K$ and extrinsic camera matrices $\{\pi^{p},\psi^{p}\}$ , where $\pi^{p}$ is the set of rotation matrices and $\psi^{p}$ is the set of translations. When it comes to estimating camera poses from the incoming image data, it is not sufficient to assume that the problem can be resolved by independently estimating the camera poses of the incoming image data. This is because the outcomes derived from the isolated estimation of camera poses do not align with the coordinate system of the previous camera poses, leading to a significant issue of coordinate misalignment. To this end, we introduce an incremental camera pose alignment method leveraging data from a trained NeRF. We start by choosing $D$ camera poses from prior instances based on low training losses and comprehensive coverage. These are then combined with incoming images to enhance camera estimation.

Finding Previous Optimal Camera Poses. Our approach is to formulate a reward-collection optimization problem on a graph. In this graph, the nodes represent camera positions (i.e., the translations in the camera poses) with each node assigned a reward corresponding to the negative value of the preceding training loss. The edges represent Euclidean distances between each camera pair’s positions. The goal is to find a path that collects as much reward as possible, subject to constraints on the total number of visited nodes and camera view coverage. Concretely, the objective optimization problem can be formulated as

	$\displaystyle\max\sum_{k=1}^{\|\mathcal{P}^{p}\|}x_{k}R_{k};~{}~{}~{}~{}s.t.~{}\sum_{k=1}^{\|\mathcal{P}^{p}\|}x_{k}=D;$		(8)
	$\displaystyle~{}~{}~{}~{}S(K)\geq S_{th},K=\{k\|x_{k}=1,k=1,\dots,\|\mathcal{P}^{p}\|\};$		(9)
	$\displaystyle~{}~{}~{}~{}E(x_{k})\leq 1,\forall k\in\{1,\dots,\|\mathcal{P}^{p}\|\};$		(10)

where $x_{k}$ is the binary decision variable: $x_{k}=1$ if node $k$ is visited otherwise $x_{k}=0$ . $S(K)$ is the shortest path that connects all the selected nodes. $E(x_{k})$ is the number of incoming edges of each selected node. The first constraint (8) makes sure that only $D$ previous camera poses are selected. The second constraint (9) means that the view coverage of the selected cameras is larger than a threshold, because a large field of view coverage of the selected cameras improves the accuracy of the camera pose estimation. The third constraint (10) guarantees that every node only has one incoming edge. In other words, every node is visited at most once. Consequently, this reward-collection optimization problem can be viewed as a hybrid of the Knapsack Problem and the Travelling Salesperson Problem, which is an NP-hard problem. To solve this problem, we propose a greedy algorithm that can reduce computation time by several orders of magnitude compared with the Brute-Force method. Let $e_{i,j}=e_{j,i}$ denote the edge between camera $i$ and $j$ . At camera node $i$ , we define an approximation edge length of all the nodes connecting to the node $i$ as $\hat{e}_{i,j}=R_{j}+\lambda(\frac{S_{th}}{D}-{e}_{i,j})$ , where $\lambda$ is a parameter for adjusting the units of $R_{i}$ and $\frac{S{th}}{D}-{e}_{i,j}$ . This approximation edge is similar to the Lagrange multiplier [4] for handling the constraint (9). We introduce an auxiliary starting node into the graph, which connects to all nodes with the same edge length. The greedy algorithm begins at this auxiliary starting node and selects the unvisited node with the maximum approximation edge as the next starting node. This process is repeated until a total of $D$ nodes have been selected. For a comprehensive understanding of the process, we offer a detailed description of the greedy algorithm, along with pseudocode, in the supplementary material.

Camera Pose Alignment. After solving the above reward-collection optimization problem, we select $D$ camera poses with large view coverage from the previously aligned camera poses. The reason for selecting $D$ camera poses instead of using all the camera poses is that camera poses with poorly rendered images will provide inaccurate features leading to large camera estimation errors.

These $D$ camera poses are utilized to render the images from the NeRF model, which are integrated with the incoming image data as a group. The camera poses of this group can be estimated using camera calibration methods, such as traditional SfM or SLAM techniques. Let $\pi_{d}$ be the rotation matrix of the selected camera in time slot $t-1$ and $\tilde{\pi}_{d}$ be the corresponding rotation matrix of the selected camera in time slot $t$ . Similarly, let $\psi_{d}$ be the translation of selected camera in time slot $t-1$ and $\tilde{\psi}_{d}$ be the corresponding translation of selected camera in time slot $t$ . We can then get the transfer matrices of the rotation matrix and translation from the coordinate system in time slot $t$ to the coordinate system in time slot $t-1$ by:

(11)

We can use transfer matrices to align the camera poses $\{\pi,\psi\}$ of the new images to the original camera poses by:

\displaystyle\tilde{\pi}=\triangle\pi\pi,~{}~{}~{}~{}\tilde{\psi}=\triangle\pi\psi+\triangle\psi

(12)

Joint Optimization of Poses and NeRF. However, we have observed that the camera pose alignment still produces errors in the camera poses. To address this, we draw inspiration from previous works [47, 15] and introduce a joint optimization of poses and NeRF method during training of our proposed IL-NeRF. Rather than directly optimizing the initial camera pose $\tilde{\pi}$ and $\tilde{\psi}$ , we employ a 6-dimensional vector $\Phi=[a,b]$ to define the trainable parameters of each camera pose, where $a\in\mathbb{R}^{3}$ represents the 3D rotation angles and $b\in\mathbb{R}^{3}$ denotes the increment of the translation. To ensure the orthogonality of the rotation matrix, Rodrigues’ formula $\Omega(a)$ is used to generate the 3D rotation matrix. Final rotation and translation are:

\displaystyle\tilde{\pi}=\Omega(a)\tilde{\pi},~{}~{}~{}~{}\tilde{\psi}=\tilde{\psi}+b

(13)

By integrating the 6-dimensional vector $\Phi$ into the NeRF training pipeline, the camera parameters and scene representation can be jointly optimized during training. Here, we slightly abuse the notation to use $\Phi$ to represent the group of all cameras’ trainable parameters. Mathematically, the pose refinement can be written as:

\displaystyle\Theta_{t}^{*},\Phi_{t}^{*}=\arg\min_{\Theta_{t},\Phi_{t}}\mathcal{L}(\hat{C}^{c},\hat{C}^{p}\mid C^{c},\mathcal{P}^{c},\Theta_{t-1}^{*},\mathcal{P}^{p})

(14)

Algorithm 1 IL-NeRF Pseudo Code

1:Initialize

\mathcal{P}^{p}=\emptyset

2:for

t=0

3: Estimate camera poses

\mathcal{P}^{c}_{0}

from

G^{0}

4: Jointly train NeRF network

\Theta_{0}

with camera poses

\mathcal{P}^{p}=\mathcal{P}^{p}\bigcup\mathcal{P}^{c}_{0}

6:end for

7:for

t=1

t=T

8: Copy and freeze as

\Theta_{t-1}^{*}

9: Obtain past training data

C^{p}

\mathcal{F}(\mathcal{P}^{p},\Theta^{*}_{t-1})

10: Align the camera pose

\mathcal{P}^{c}_{t}

based on

\mathcal{P}^{p}

11: Jointly train NeRF network

\Theta_{t}

with camera poses

12:

\mathcal{P}^{p}=\mathcal{P}^{p}\bigcup\mathcal{P}^{c}_{t}

13:end for

5 Experiment

Dataset. We use three different datasets to evaluate different aspects of our model, namely Mip-NeRF360 [3], LLFF [25], and NeRF-real360 [26]. To simulate the incremental scenarios, we reorganize the camera order so that it moves sequentially and we select a portion of the dataset where the previous images are not revisited. All the training images of each scene are divided into four training chunks denoted as $\mathcal{G}=\{G^{0},G^{1},G^{2},G^{3}\}$ . We acquire the camera pose parameters using COLMAP [40].

Table 1: Performance comparison with the baselines on PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF^∗. Note that NeRF^∗, NeRF, and EWC require the ground truth pre-estimated camera poses from entire image data, but IL-NeRF estimates and aligns camera poses by the proposed incremental camera pose alignment module. NeRF^∗ can be treated as the representation of the existing incremental learning works with accurate camera poses [7, 12, 34, 5].

Data		Method	Need	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
Data		Method	Pose	$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Mip-NeRF360	Counter	NeRF^∗	Yes	32.17 / 0.92 / 0.07	29.58 / 0.86 / 0.14	28.03 / 0.82 / 0.18	28.28 / 0.85 / 0.18
		NeRF	Yes	32.12 / 0.91 / 0.07	24.62 / 0.72 / 0.25	21.94 / 0.65 / 0.34	20.30 / 0.62 / 0.37
		EWC	Yes	32.12 / 0.91 / 0.07	23.83 / 0.72 / 0.25	22.56 / 0.65 / 0.33	21.11 / 0.61 / 0.36
		NeRF-SLAM	No	31.75 / 0.91 / 0.07	28.30 / 0.83 / 0.21	26.84 / 0.79 / 0.28	25.30 / 0.77 / 0.31
		IL-NeRF (Ours)	No	32.13 / 0.91 / 0.07	29.63 / 0.87 / 0.12	28.56 / 0.85 / 0.15	27.82 / 0.83 / 0.17
	Kitchen	NeRF^∗	Yes	31.05 / 0.91 / 0.07	29.72 / 0.88 / 0.13	29.33 / 0.85 / 0.15	29.18 / 0.84 / 0.14
		NeRF	Yes	31.17 / 0.91 / 0.08	27.01 / 0.75 / 0.25	21.42 / 0.70 / 0.31	23.69 / 0.75 / 0.24
		EWC	Yes	31.17 / 0.91 / 0.08	26.76 / 0.74 / 0.25	22.09 / 0.70 / 0.31	23.39 / 0.74 / 0.23
		NeRF-SLAM	No	30.87 / 0.90 / 0.09	29.63 / 0.85 / 0.20	27.65 / 0.81 / 0.24	27.71 / 0.82 / 0.20
		IL-NeRF (Ours)	No	31.27 / 0.92 / 0.07	30.66 / 0.89 / 0.10	29.84 / 0.87 / 0.12	29.34 / 0.86 / 0.13
LLFF	Fortress	NeRF^∗	Yes	31.75 / 0.86 / 0.11	31.83 / 0.84 / 0.09	30.90 / 0.86 / 0.14	29.81 / 0.85 / 0.11
		NeRF	Yes	31.56 / 0.85 / 0.11	29.38 / 0.80 / 0.15	27.05 / 0.78 / 0.17	25.39 / 0.78 / 0.16
		EWC	Yes	31.56 / 0.85 / 0.11	29.41 / 0.79 / 0.15	25.83 / 0.77 / 0.16	24.40 / 0.78 / 0.15
		NeRf-SLAM	No	31.08 / 0.82 / 0.12	31.09 / 0.82 / 0.12	29.77 / 0.83 / 0.15	28.53 / 0.82 / 0.14
		IL-NeRF	No	31.69 / 0.85 / 0.11	31.02 / 0.84 / 0.10	30.33 / 0.84 / 0.11	29.45 / 0.83 / 0.12
	Horns	NeRF^∗	Yes	29.86 / 0.89 / 0.07	29.67 / 0.89 / 0.06	29.24 / 0.89 / 0.07	28.87 / 0.87 / 0.08
		NeRF	Yes	29.78 / 0.86 / 0.09	27.04 / 0.75 / 0.09	26.04 / 0.74 / 0.12	24.01 / 0.70 / 0.14
		EWC	Yes	29.78 / 0.86 / 0.09	26.77 / 0.75 / 0.09	27.08 / 0.74 / 0.11	24.68 / 0.69 / 0.14
		NeRF-SLAM	No	28.86 / 0.83 / 0.10	28.77 / 0.84 / 0.08	28.19 / 0.83 / 0.10	27.56 / 0.82 / 0.12
		IL-NeRF (Ours)	No	29.92 / 0.89 / 0.07	29.50 / 0.89 / 0.07	29.01 / 0.89 / 0.07	28.96 / 0.87 / 0.09
NeRF-real360	Pinecone	NeRF^∗	Yes	26.88 / 0.89 / 0.12	24.23 / 0.79 / 0.16	24.03 / 0.73 / 0.19	23.18 / 0.74 / 0.21
		NeRF	Yes	26.22 / 0.84 / 0.16	22.90 / 0.64 / 0.24	21.15 / 0.58 / 0.33	18.94 / 0.49 / 0.41
		EWC	Yes	26.22 / 0.84 / 0.16	22.70 / 0.63 / 0.24	21.42 / 0.57 / 0.32	18.81 / 0.48 / 0.41
		NeRF-SLAM	No	25.63 / 0.81 / 0.18	24.09 / 0.73 / 0.22	23.01 / 0.68 / 0.29	21.79 / 0.65 / 0.34
		IL-NeRF (Ours)	No	26.3 / 0.87 / 0.10	24.56 / 0.78 / 0.17	23.78 / 0.74 / 0.20	22.93 / 0.72 / 0.23
	Vasedeck	NeRF^∗	Yes	29.27 / 0.86 / 0.07	27.93 / 0.85 / 0.12	26.03 / 0.74 / 0.16	26.18 / 0.74 / 0.18
		NeRF	Yes	29.03 / 0.85 / 0.07	23.99 / 0.70 / 0.26	22.73 / 0.69 / 0.24	21.57 / 0.64 / 0.31
		EWC	Yes	29.03 / 0.85 / 0.07	24.36 / 0.69 / 0.25	22.25 / 0.68 / 0.24	20.52 / 0.64 / 0.30
		NeRF-SLAM	No	27.98 / 0.79 / 0.11	26.41 / 0.77 / 0.21	25.10 / 0.72 / 0.21	24.62 / 0.71 / 0.26
		IL-NeRF (Ours)	No	29.48 / 0.86 / 0.07	27.38 / 0.82 / 0.10	26.11 / 0.76 / 0.14	26.15 / 0.75 / 0.17

Baseline and Metrics. IL-NeRF is compared with the following baselines: NeRF: Original NeRF is incrementally trained with only current image data chunk with ground truth camera poses but without NeRF distillation, making it susceptible to catastrophic forgetting. EWC [16]: Similar to the NeRF, EWC incrementally trains the model with only current image data chunk with ground truth camera poses however it utilizes a widely-used regularization-based method, which penalizes the changes in parameters that are important for past training sets. NeRF^∗: NeRF is incrementally trained with ground truth camera poses under replay-based NeRF distillation. Note that the ground truth camera poses are estimated from all the training images. NeRF^∗ can be treated as the representation of the existing works [7, 12, 34, 5], which require the ground truth camera poses in each incoming image data chunk. NeRF-SLAM: we follow the general implementation of NeRF-SLAM [39], which use the SLAM to align camera poses of the coming image data. We replace Droid-SLAM [42] in the NeRF-SLAM to ORB-SLAM2 [29] because Droid-SLAM utilizes a complex deep learning model for camera pose estimation, which needs training on image data before NeRF training.

We evaluate IL-NeRF and baselines in three aspects, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [46] and Learned Perceptual Image Patch Similarity (LPIPS) [54]. We use AlexNet [17] as the backbone of LPIPS. It should be noted that as joint optimization of poses and NeRF is performed for all camera poses, including the previous and current poses, at each time step, the camera poses are continuously changing. This means it is not possible to fix the camera poses for test data, and all evaluation metrics compare the rendered images from the training camera poses with the ground truth.

5.1 Results

Table 1 shows the partial results obtained by IL-NeRF and baseline methods on the three datasets. Due to page limitations, more comparisons are shown in the supplementary material. From the results, we can see that IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF^∗. We explain the results in more detail next.

NeRF: In comparison to the original NeRF, IL-NeRF demonstrates substantial enhancements of between $10.87\%$ to $36.55\%$ in terms of PSNR, $15.18\%$ to $46.90\%$ in terms of SSIM, and $25.00\%$ to $54.05\%$ in terms of LPIPS across three datasets. In the initial image data chunk $G^{0}$ , IL-NeRF outperforms the original NeRF slightly, as a result of joint optimization assistance provided by IL-NeRF. However, the performance of the original NeRF rapidly declines thereafter due to catastrophic forgetting.

EWC: EWC fails to reduce the adverse effects of catastrophic forgetting, and in some cases can be even worse than traditional NeRF. This is due to the lack of previous training images, the failure of EWC to recover previous scenes, and the introduction of a penalty mechanism that creates a disincentive to learn new images that have not been scanned.

NeRF^∗: Comparing NeRF^∗ with IL-NeRF may not be entirely fair, given that NeRF^∗ benefits from having access to all training images to estimate camera poses for the image data chunks, while in our scenario, the camera poses are not provided in the task and must be derived through incremental camera pose alignment. Nonetheless, IL-NeRF performs comparably to NeRF^∗, largely due to its incremental camera pose alignment module utilized during training.

NeRF-SLAM: While NeRF-SLAM uses SLAM to align camera poses in incoming image data to a common coordinate system, it lags behind our IL-NeRF in terms of performance. This difference stems from NeRF-SLAM’s exclusive reliance on selected keyframes for replay-based training, resulting in overfitting to specific rays and compromising multiview consistency. Furthermore, NeRF-SLAM demands more memory storage space. For instance, on the ‘Garden’ image data in Mip-NeRF360, NeRF-SLAM necessitates an additional 251.3 MB of memory for storing keyframes and point clouds. In contrast, IL-NeRF only requires an additional 37.7 KB of memory for storing previous camera poses.

Figure 3 provides additional insights by presenting a qualitative comparison of the performance of the original NeRF and IL-NeRF on the ‘Kitchen’ and ‘Counter’ scenes in the Mip-NeRF360 dataset. Specifically, we demonstrate the rendering results on the first image data after each incremental training. It is evident that the original NeRF suffers from the catastrophic forgetting problem, resulting in images with significant distortions such as noise and blur, whereas IL-NeRF generates highly realistic images with quality comparable to the ground truth. This observation indicates that IL-NeRF is highly effective in mitigating the forgetting problem and estimating the camera poses. More results are shown in the supplementary material.

5.2 Ablation Study

Effect of Pose Selection. The first step of incremental camera pose alignment is to find previous $D$ optimal camera poses as the references for estimating and aligning the camera poses of the incoming image data. To illustrate the influence of $D$ on the IL-NERF performance, we set the value of $D$ to 1, 5, 10, 20, and all, respectively. Figure 4 portrays the PSNR of IL-NeRF across varying values of $D$ on the ‘Bicycle’ scene from the Mip-NeRF360 dataset. As depicted, when $D$ is a small value, the lack of adequate reference images results in the estimated camera poses of the incoming image data deviating from the original camera pose coordinate system, thereby leading to considerably poor rendering. As the value of $D$ increases, the estimated camera poses of the incoming image data become increasingly precise, and thus the PSNR increases. Note that an excessively large value of $D$ introduces poorly rendered cameras, subsequently leading to a decrease in PSNR. To identify the optimal $D$ camera poses, we address a reward-collection optimization problem on a graph. In order to demonstrate the effectiveness of our camera selection approach, we conduct a comparative analysis with two baselines: (1) Randomly selecting $D$ camera poses as the reference, and (2) Myopically selecting $D$ camera poses with the lowest training losses. Figure 6 shows the performance of IL-NeRF and two baselines on the ‘Bicycle’ scene. As we can see, our proposed method surpasses the other two approaches. This is primarily due to our method’s ability to ensure the quality of rendered images used as references while providing a broader camera view coverage, thereby facilitating the more accurate camera pose estimation of incoming image data.
Effect of Transfer Matrices. The transfer matrices are obtained by computing the corresponding rotation matrix and translation of the selected D camera poses in time slots $t-1$ and $t$ . These matrices are then employed to align the camera pose of new images to the original camera pose coordinate system. To investigate the effectiveness of the transfer matrices, we compare IL-NeRF with IL-NeRF without considering the transfer matrices, denoted as ‘IL-NeRF w/o TM’ in Figure 6 on the ‘Garden’ scene from the Mip-NeRF360 dataset. The results reveal a significant decline in performance without the transfer matrices, achieving only $15.76dB$ in terms of PSNR. This decline can be attributed to separate camera pose estimation for two tasks resulting in camera poses in two independent coordinate systems, which could mislead the model during training, leading to decreased performance.
Effect of Pose Refinement. Despite the transfer matrices’ ability to align the camera poses to the original coordinate system, they may still contain noise and inaccuracies. To mitigate this issue, we use the coordinate-aligned camera poses as initial values and jointly optimize the camera poses and scene representation during NeRF training, a process known as pose refinement. We perform an ablation study to investigate the effectiveness of pose refinement by comparing IL-NeRF with and without it. The results in Figure 6 indicate that IL-NeRF with pose refinement outperforms IL-NeRF without it (i.e., IL-NeRF w/o PR). Figure 7 further shows the qualitative comparison of IL-NeRF w/o TM, IL-NeRF w/o PR, and IL-NeRF. More results are shown in the supplementary material.
Camera Pose. Our goal is to incrementally train a NeRF model given only RGB images as input, without known camera poses. In other words, we need to find out the camera poses associated with each input image while training the NeRF model. We treat the COLMAP estimation from all training images as ground-truth (GT) camera poses and report the difference between our optimized camera poses and theirs on the training images. Figure 8 shows the camera pose trajectories of GT and IL-NeRF. As we can see, IL-NeRF recovers accurate camera poses with the help of camera coordinate alignment and pose refinement. More results are shown in the supplementary material.

6 Conclusion

In this study, we introduce an incremental learning algorithm called IL-NeRF that tackles the problems of catastrophic forgetting and coordinate shifting in NeRF training in incremental learning settings. The IL-NeRF algorithm employs a replay-based NeRF distillation pipeline to retain past information and learn from new data independently. Furthermore, a method for aligning camera pose coordinates is introduced to identify camera poses associated with incoming tasks during the NeRF model training. Experimental results reveal that the IL-NeRF algorithm outperforms the original NeRF model in sequential data settings.

References

Aljundi et al. [2017] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3375, 2017.
Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
Bertsekas [2014] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
Cai and Mueller [2023] Zhipeng Cai and Matthias Mueller. Clnerf: Continual learning meets nerf. arXiv preprint arXiv:2308.14816, 2023.
Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
Chung et al. [2022] Jaeyoung Chung, Kanggeon Lee, Sungyong Baik, and Kyoung Mu Lee. Meil-nerf: Memory-efficient incremental learning of neural radiance fields. arXiv preprint arXiv:2212.08328, 2022.
Deng et al. [2022] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
French [1999] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
Golden et al. [1987] Bruce L Golden, Larry Levy, and Rakesh Vohra. The orienteering problem. Naval Research Logistics (NRL), 34(3):307–318, 1987.
Guo et al. [2022] Mengqi Guo, Chen Li, and Gim Hee Lee. Incremental learning for neural radiance field with uncertainty-filtered knowledge distillation. arXiv preprint arXiv:2212.10950, 2022.
Hung et al. [2019] Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems, 32, 2019.
Jandaghi et al. [2021] Hossein Jandaghi, Ali Divsalar, and Saeed Emami. The categorized orienteering problem with count-dependent profits. Applied Soft Computing, 113:107962, 2021.
Jeong et al. [2021] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5846–5854, 2021.
Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
kwea123 [2022] kwea123. ngp-pl: a pytorch implementation of instant-ngp, 2022.
Lin et al. [2021] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5741–5751, 2021.
Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
Mallya and Lazebnik [2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
Mallya et al. [2018] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
Martin-Brualla et al. [2021] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
Martins et al. [2021] Leandro do C Martins, Rafael D Tordecilla, Juliana Castaneda, Angel A Juan, and Javier Faulin. Electric vehicle routing, arc routing, and team orienteering problems in sustainable transportation. Energies, 14(16):5131, 2021.
Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
Mildenhall et al. [2022] Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P Srinivasan, and Jonathan T Barron. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16190–16199, 2022.
Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
Mur-Artal and Tardós [2017] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5):1255–1262, 2017.
Niemeyer and Geiger [2021] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
Park et al. [2021] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
Pěnička et al. [2017] Robert Pěnička, Jan Faigl, Petr Váňa, and Martin Saska. Dubins orienteering problem. IEEE Robotics and Automation Letters, 2(2):1210–1217, 2017.
Po et al. [2023] Ryan Po, Zhengyang Dong, Alexander W Bergman, and Gordon Wetzstein. Instant continual learning of neural radiance fields. arXiv preprint arXiv:2309.01811, 2023.
Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
Rebain et al. [2021] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decomposed radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14153–14161, 2021.
Roessle et al. [2022] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12892–12901, 2022.
Rolnick et al. [2019] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
Rosinol et al. [2022] Antoni Rosinol, John J Leonard, and Luca Carlone. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641, 2022.
Schonberger and Frahm [2016] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
Shin et al. [2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
Vansteenwegen and Gunawan [2019] Pieter Vansteenwegen and Aldy Gunawan. Orienteering problems. EURO Advanced Tutorials on Operational Research, 2019.
Vansteenwegen et al. [2011] Pieter Vansteenwegen, Wouter Souffriau, and Dirk Van Oudheusden. The orienteering problem: A survey. European Journal of Operational Research, 209(1):1–10, 2011.
Wang et al. [2021a] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Wang et al. [2021b] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
Wei et al. [2021] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5610–5619, 2021.
Xiang et al. [2021] Fanbo Xiang, Zexiang Xu, Milos Hasan, Yannick Hold-Geoffroy, Kalyan Sunkavalli, and Hao Su. Neutex: Neural texture mapping for volumetric neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7128, 2021.
Xu and Zhu [2018] Ju Xu and Zhanxing Zhu. Reinforced continual learning. Advances in Neural Information Processing Systems, 31, 2018.
Yariv et al. [2021] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
Yen-Chen et al. [2021] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
Zhang et al. [2019] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713–3722, 2019.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.

\thetitle

Supplementary Material

7 Finding Previous Optimal Camera Poses

To find previous $D$ optimal camera poses, we formulate a reward-collection optimization problem on a graph. In this graph, the nodes represent camera positions (i.e., the translations in the camera poses) with each node assigned a reward corresponding to the negative value of the preceding training loss. The edges represent Euclidean distances between each camera pair’s positions. The goal is to find a path that collects as much reward as possible, subject to constraints on the total number of visited nodes and camera view coverage. Concretely, the objective optimization problem can be formulated as

	$\displaystyle\max\sum_{k=1}^{\|\mathcal{P}^{p}\|}x_{k}R_{k}$		(15)
	$\displaystyle s.t.~{}\sum_{k=1}^{\|\mathcal{P}^{p}\|}x_{k}=D$		(16)
	$\displaystyle~{}~{}~{}~{}S(K)\geq S_{th},K=\{k\|x_{k}=1,k=1,\dots,\|\mathcal{P}^{p}\|\}$		(17)
	$\displaystyle~{}~{}~{}~{}E(x_{k})\leq 1,\forall k\in\{1,\dots,\|\mathcal{P}^{p}\|\}$		(18)

Related Work for Reward-Collection Optimization Problem. The reward collection problem, also named orienteering problem, is an optimization issue that aims to determine the most efficient route for visiting multiple locations while maximizing the value or score of each place seen, all within a specified time frame and beginning and ending at a particular point [11]. This problem is widely utilized in the tourism sector [44], robot routing [33], food delivery [43] and transportation control [24]. As the orienteering problem belongs to the NP-hard class of problems, no algorithm can solve it optimally within a reasonable amount of time [14]. Different from the traditional orienteering problem, the optimization problem in this paper is more complex. Firstly, we do not limit the start and end points, and at the same time, we have a limitation on the number of accessible points, which makes it impossible to apply the existing proposed approaches to our method.

Algorithm 2 Brute-Force for Selecting Cameras

1:Generate all possible

D

camera combinations

\mathcal{K}

2:Initialize

\mathcal{B}=\emptyset

and

b=-\infty

3:for

K\in\mathcal{K}

4: Use Breath-First-Search to find the shortest path

S(K)

that visits all the nodes in

K

5: if

S(K)\geq S_{th}

then

6: Sum the total reward

\mathcal{R}

K

nodes.

7: if

\mathcal{R}\geq b

then

\mathcal{B}=K

b=\mathcal{R}

10: end if

11: end if

12:end for

13:Return

\mathcal{B}

Brute-Force Method. The straightforward approach to address this problem is the Brute-Force method, demonstrated in Algorithm 2. This method involves: (1) Determining the shortest path for visiting all nodes for each $D$ camera combination. (2) Selecting the camera combination with the highest total rewards, while ensuring compliance with all constraints. However, the time complexity of this approach is $O((2^{D}\times D)\times\binom{N}{D})$ , where $\binom{N}{D}$ represents the number of $D$ -combinations derived from a given set of $N$ previous camera poses. $2^{D}\times D$ is the time complexity of finding the shortest path for each D camera combination. While this method guarantees an optimal solution, it becomes impractical for large numbers of nodes due to its time complexity.

Proposed Greedy Algorithm. Let $\mathbb{G}(V,E)$ be the graph of the previous camera poses, where $V$ denotes the set of cameras (nodes) and $E$ denotes the set of edges connecting each pair of two cameras. Let $e_{i,j}=e_{j,i},e_{i,j}\in E,e_{j,i}\in E$ denote the edge between camera $i$ and $j$ . Note that $\mathbb{G}(V,E)$ is an undirected weighted complete graph. The core concept of our greedy algorithm is to traverse all unvisited nodes starting from a specific node. During this process, the algorithm calculates the approximate edges between each pair of the current node and its connected unvisited node, subsequently selecting the node with the maximum approximation edge as the next starting node. This process is repeated until a total of $D$ nodes have been selected. The complete description of the greedy algorithm is outlined in Algorithm 3.

Step 1 (line 1 to line 2): Introducing an auxiliary starting node $V_{0}$ into the graph, which establishes connections to all nodes with an edge length of 0. We define a set $\mathcal{B}$ to keep track of the visited nodes during traversal and initialize it as $\mathcal{B}={V_{0}}$ . Additionally, the current node index is initialized as $k=0$ .

Step 2 (line 4 to line 10): Identifying all the connected nodes of the current node $V_{k}$ that have not been visited yet (i.e., $V_{i}\notin\mathcal{B}$ ). Based on the reward $R_{i}$ and the edge length $e_{k,i}$ for each unvisited node, we compute the approximation edge length $\hat{e}_{k,i}$ and insert it into a temporary set $\hat{E}$ . Specifically, $\hat{e}_{k,i}=R_{i}+\lambda(\frac{S_{th}}{D}-{e}_{k,i})$ , where $\lambda$ is a parameter for adjusting the units of $R_{i}$ and $\frac{S{th}}{D}-{e}_{k,i}$ . This approximation edge is similar to the Lagrange multiplier [4] for handling the constraint (17).

Step 3 (line 11 to line 12): Selecting the node with the maximum $\hat{e}_{k,i}$ as the next visited node, updating the current index as $k=\arg\max\hat{E}$ , and inserting the next visited node into the set of visited nodes $\mathcal{B}$ .

Step 4: Repeating Step 2 and Step 3 until the greedy algorithm has visited a total of $D$ nodes.

Algorithm 3 Proposed Greedy Algorithm

1:Add an auxiliary starting node

V_{0}

linking all the nodes.

2:Initialize the visited set

\mathcal{B}=\{V_{0}\}

and

k=0

3:while

|\mathcal{B}|<D+1

\hat{E}=\{\}

5: for

V_{i}\in V

6: if

V_{i}\notin\mathcal{B}

then

\hat{e}_{k,i}=R_{i}+\lambda(\frac{S_{th}}{D}-{e}_{k,i})

\hat{E}.append(\hat{e}_{k,i})

9: end if

10: end for

11:

k=\arg\max\hat{E}

12:

\mathcal{B}.append(V_{k})

13:end while

14:Return

\mathcal{B}.remove(V_{0})

The time complexity of our greedy algorithm is $O(D\times N\log N)$ , which reduces computation time by several orders of magnitude.

Comparison of Brute-Force Method with Ours. Here, we show the performance comparison of Brute-Force Method and our greedy algorithm in terms of PSNR and computation time. As the results in Table 2, 3 and 4, L-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than that of our proposed greedy algorithm, which makes the Brute-Force method impractical for incremental training scenarios. It should be noted that the runtime of the Brute-Force method increases exponentially with the size of the training data. For example, the ‘Bicycle’ scene in Mip-NeRF360 contains 194 training images and ‘Horns’ scene in LLFF contains 62 training images, but the Brute-Force method’s runtime for these two scenes are 5 days 6 hours and 1 hour 47 min, respectively.

Table 2: Performance comparison of Brute-Force method and our IL-NeRF on the Mip-NeRF360. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.

Scene	Method	PSNR	running time
Bicycle	Brute-Force	22.36	5 days 6 hours
Bicycle	Greedy (Ours)	22.34	10.92 ms
Bonsai	Brute-Force	28.96	10 days 8 hours
Bonsai	Greedy (Ours)	28.96	25.57 ms
Counter	Brute-Force	27.86	10 days 2 hours
Counter	Greedy (Ours)	27.82	23.78 ms
Garden	Brute-Force	24.83	4 days 18 hours
Garden	Greedy (Ours)	24.82	9.87 ms
Kitchen	Brute-Force	29.34	11 days 6 hours
Kitchen	Greedy (Ours)	29.34	28.39 ms
Room	Brute-Force	31.49	12 days 10 hours
Room	Greedy (Ours)	31.45	37.58 ms
Stump	Brute-Force	24.91	4 days 12 hours
Stump	Greedy (Ours)	24.89	8.73 ms

Table 3: Performance comparison of Brute-Force method and our IL-NeRF on the LLFF. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.

Scene	Method	PSNR	running time
Fern	Brute-Force	25.27	62.01 s
Fern	Greedy (Ours)	25.26	4.38 ms
Flower	Brute-Force	30.49	4.63 min
Flower	Greedy (Ours)	30.49	6.81 ms
Fortress	Brute-Force	29.45	14.17 min
Fortress	Greedy (Ours)	29.45	8.78 ms
Horns	Brute-Force	28.97	1 hour 47 min
Horns	Greedy (Ours)	28.96	9.87 ms
Leaves	Brute-Force	23.88	65.78 s
Leaves	Greedy (Ours)	23.88	4.95 ms
Orchids	Brute-Force	23.67	57.84 s
Orchids	Greedy (Ours)	23.67	5.58 ms
Room	Brute-Force	31.88	12.48 min
Room	Greedy (Ours)	31.88	8.73 ms
Trex	Brute-Force	27.81	57.97 min
Trex	Greedy (Ours)	27.81	9.37 ms

Table 4: Performance comparison of Brute-Force method and our IL-NeRF on the NeRF-real360. IL-NeRF can achieve comparable PSNR with Brute-Force method, however, the runtime of the Brute-Force method is several orders of magnitude larger than our proposed greedy algorithm.

Scene	Method	PSNR	running time
Pinecone	Brute-Force	22.96	4 days 10 hours
Pinecone	Greedy (Ours)	22.93	9.58 ms
Vasedeck	Brute-Force	26.24	5 days 2 hours
Vasedeck	Greedy (Ours)	26.15	10.61 ms

7.1 Implementation Details

We implement our framework following the architecture of Instant-NeRF [28, 18]. We use two separate Adam optimizers for NeRF and camera poses refinement respectively, with an initial learning rate of 0.01 for NeRF and an initial learning rate of 0.005 for pose refinement. The learning rate of the NeRF model decays every iteration by multiplying with 0.9954 (exponential decay), and the learning rate of the pose refinement decays every 100 iterations with a multiplier of 0.9. We train the network in each incremental step for Mip-NeRF360 with 30k iterations and $D=10$ , LLFF with 5k iterations and $D=5$ , and NeRF-real360 with 20k iterations and $D=10$ .

8 More Comparisons for Results

In the main text, we only show PSNR, SSIM and LPIPS for some scenes of the three datasets, and here we give the full results. Table 5 shows the results obtained by IL-NeRF and baseline methods on the Mip-NeRF360 dataset with seven real-world indoor and outdoor scenes. Similarly, Table 6 presents the results obtained on the LLFF dataset with eight forward-facing scenes. Additionally, Table 7 also shows the results obtained on the NeRF-real360 dataset with two real-world object-orientation scenes. From the results, we can see that IL-NeRF outperforms the original NeRF and achieves comparable results with NeRF^∗.

Table 5: Performance comparison on the Mip-NeRF360 dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF^∗.

Scene	Method	Pose	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
			$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Bicycle	NeRF	Yes	22.76 / 0.61 / 0.33	18.58 / 0.47 / 0.46	20.03 / 0.52 / 0.43	20.03 / 0.52 / 0.44
	EWC	Yes	22.76 / 0.61 / 0.33	18.80 / 0.47 / 0.45	19.41 / 0.51 / 0.43	19.89 / 0.52 / 0.43
	NeRF^∗	Yes	22.88 / 0.62 / 0.33	20.23 / 0.49 / 0.43	22.03 / 0.53 / 0.39	22.18 / 0.54 / 0.41
	NeRF-SLAM	No	22.78 / 0.61 / 0.33	19.67 / 0.48 / 0.45	21.37 / 0.53 / 0.41	21.61 / 0.53 / 0.42
	IL-NeRF	No	22.90 / 0.62 / 0.33	19.84 / 0.48 / 0.44	22.05 / 0.54 / 0.40	22.34 / 0.55 / 0.40
Bonsai	NeRF	Yes	33.30 / 0.93 / 0.07	25.47 / 0.75 / 0.25	23.53 / 0.66 / 0.34	22.12 / 0.68 / 0.35
	EWC	Yes	33.30 / 0.93 / 0.07	25.62 / 0.75 / 0.25	22.35 / 0.66 / 0.33	21.51 / 0.68 / 0.34
	NeRF^∗	Yes	33.48 / 0.93 / 0.07	29.93 / 0.88 / 0.15	28.03 / 0.84 / 0.18	28.18 / 0.84 / 0.21
	NeRF-SLAM	No	33.32 / 0.93 / 0.07	29.13 / 0.84 / 0.21	28.01 / 0.79 / 0.29	26.85 / 0.80 / 0.29
	IL-NeRF	No	33.54 / 0.93 / 0.07	30.73 / 0.89 / 0.12	29.77 / 0.86 / 0.16	28.96 / 0.85 / 0.18
Counter	NeRF	Yes	32.12 / 0.91 / 0.07	24.62 / 0.72 / 0.25	21.94 / 0.65 / 0.34	20.30 / 0.62 / 0.37
	EWC	Yes	32.12 / 0.91 / 0.07	23.83 / 0.72 / 0.25	22.56 / 0.65 / 0.33	21.11 / 0.61 / 0.36
	NeRF^∗	Yes	32.17 / 0.92 / 0.07	29.58 / 0.86 / 0.14	28.03 / 0.82 / 0.18	28.28 / 0.85 / 0.18
	NeRF-SLAM	No	31.75 / 0.91 / 0.07	28.30 / 0.83 / 0.21	26.84 / 0.79 / 0.28	25.30 / 0.77 / 0.31
	IL-NeRF	No	32.13 / 0.91 / 0.07	29.63 / 0.87 / 0.12	28.56 / 0.85 / 0.15	27.82 / 0.83 / 0.17
Garden	NeRF	Yes	24.70 / 0.71 / 0.20	22.34 / 0.64 / 0.25	20.17 / 0.59 / 0.31	19.42 / 0.54 / 0.38
	EWC	Yes	24.70 / 0.71 / 0.20	23.38 / 0.63 / 0.24	20.09 / 0.58 / 0.31	19.81 / 0.54 / 0.37
	NeRF^∗	Yes	24.72 / 0.73 / 0.19	24.93 / 0.72 / 0.18	24.68 / 0.69 / 0.22	24.48 / 0.67 / 0.21
	NeRF-SLAM	No	24.72 / 0.71 / 0.20	24.03 / 0.69 / 0.23	23.50 / 0.65 / 0.28	23.37 / 0.61 / 0.33
	IL-NeRF	No	24.73 / 0.73 / 0.19	24.80 / 0.70 / 0.22	24.86 / 0.69 / 0.23	24.82 / 0.67 / 0.23
Kitchen	NeRF	Yes	31.17 / 0.91 / 0.08	27.01 / 0.75 / 0.25	21.42 / 0.70 / 0.31	23.69 / 0.75 / 0.24
	EWC	Yes	31.17 / 0.91 / 0.08	26.76 / 0.74 / 0.25	22.09 / 0.70 / 0.31	23.39 / 0.74 / 0.23
	NeRF^∗	Yes	31.05 / 0.91 / 0.07	29.72 / 0.88 / 0.13	29.33 / 0.85 / 0.15	29.18 / 0.84 / 0.14
	NeRF-SLAM	No	30.87 / 0.90 / 0.09	29.63 / 0.85 / 0.20	27.65 / 0.81 / 0.24	27.71 / 0.82 / 0.20
	IL-NeRF	No	31.27 / 0.92 / 0.07	30.66 / 0.89 / 0.10	29.84 / 0.87 / 0.12	29.34 / 0.86 / 0.13
Room	NeRF	Yes	35.98 / 0.96 / 0.04	30.78 / 0.91 / 0.09	26.34 / 0.80 / 0.21	27.44 / 0.86 / 0.16
	EWC	Yes	35.98 / 0.96 / 0.04	31.84 / 0.90 / 0.09	27.38 / 0.79 / 0.20	28.08 / 0.86 / 0.16
	NeRF^∗	Yes	36.18 / 0.96 / 0.03	33.93 / 0.95 / 0.05	32.03 / 0.92 / 0.08	31.99 / 0.93 / 0.06
	NeRF-SLAM	No	35.74 / 0.94 / 0.08	33.20 / 0.93 / 0.07	30.36 / 0.88 / 0.17	30.73 / 0.89 / 0.13
	IL-NeRF	No	36.04 / 0.96 / 0.04	34.02 / 0.94 / 0.04	32.35 / 0.92 / 0.07	31.45 / 0.91 / 0.09
Stump	NeRF	Yes	25.62 / 0.77 / 0.28	22.30 / 0.52 / 0.38	21.25 / 0.46 / 0.42	20.55 / 0.44 / 0.47
	EWC	Yes	25.62 / 0.77 / 0.28	22.55 / 0.51 / 0.37	21.09 / 0.45 / 0.42	21.48 / 0.44 / 0.46
	NeRF^∗	Yes	26.18 / 0.79 / 0.27	25.93 / 0.64 / 0.37	25.12 / 0.62 / 0.38	25.18 / 0.64 / 0.39
	NeRF-SLAM	No	24.98 / 0.74 / 0.31	24.76 / 0.61 / 0.36	24.05 / 0.56 / 0.40	23.93 / 0.57 / 0.43
	IL-NeRF	No	25.96 / 0.77 / 0.28	25.75 / 0.66 / 0.32	25.09 / 0.60 / 0.37	24.89 / 0.58 / 0.37

Table 6: Performance comparison on the LLFF dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF^∗.

Scene	Method	Pose	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
			$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Fern	NeRF	Yes	29.19 / 0.90 / 0.06	24.58 / 0.80 / 0.19	23.21 / 0.67 / 0.24	22.43 / 0.65 / 0.25
	EWC	Yes	29.19 / 0.90 / 0.06	24.88 / 0.79 / 0.19	22.60 / 0.66 / 0.23	23.32 / 0.64 / 0.25
	NeRF^∗	Yes	29.26 / 0.90 / 0.06	26.71 / 0.88 / 0.09	25.79 / 0.85 / 0.12	25.19 / 0.82 / 0.13
	NeRF-SLAM	No	28.77 / 0.88 / 0.16	25.22 / 0.85 / 0.15	24.95 / 0.79 / 0.23	24.33 / 0.77 / 0.21
	IL-NeRF	No	29.30 / 0.90 / 0.06	26.68 / 0.87 / 0.10	25.63 / 0.81 / 0.13	25.26 / 0.80 / 0.15
Flower	NeRF	Yes	34.12 / 0.96 / 0.01	30.30 / 0.91 / 0.02	28.40 / 0.90 / 0.03	27.50 / 0.88/ 0.04
	EWC	Yes	34.12 / 0.96 / 0.01	29.84 / 0.90 / 0.02	28.16 / 0.90 / 0.03	27.14 / 0.88 / 0.03
	NeRF^∗	Yes	34.28 / 0.96 / 0.01	31.76 / 0.93 / 0.01	30.98 / 0.93 / 0.02	30.68 / 0.93 / 0.02
	NeRF-SLAM	No	33.28 / 0.96 / 0.01	31.34 / 0.92 / 0.01	30.29 / 0.92 / 0.02	29.72 / 0.91 / 0.03
	IL-NeRF	No	34.22 / 0.96 / 0.01	31.81 / 0.94 / 0.01	31.11 / 0.94 / 0.02	30.49 / 0.93 / 0.02
Fortress	NeRF	Yes	31.56 / 0.85 / 0.11	29.38 / 0.80 / 0.15	27.05 / 0.78 / 0.17	25.39 / 0.78 / 0.16
	EWC	Yes	31.56 / 0.85 / 0.11	29.41 / 0.79 / 0.15	25.83 / 0.77 / 0.16	24.40 / 0.78 / 0.15
	NeRF^∗	Yes	31.75 / 0.86 / 0.11	31.83 / 0.84 / 0.09	30.90 / 0.86 / 0.14	29.81 / 0.85 / 0.11
	NeRf-SLAM	No	31.08 / 0.82 / 0.12	31.09 / 0.82 / 0.12	29.77 / 0.83 / 0.15	28.53 / 0.82 / 0.14
	IL-NeRF	No	31.69 / 0.85 / 0.11	31.02 / 0.84 / 0.10	30.33 / 0.84 / 0.11	29.45 / 0.83 / 0.12
Horns	NeRF	Yes	29.78 / 0.86 / 0.09	27.04 / 0.75 / 0.09	26.04 / 0.74 / 0.12	24.01 / 0.70 / 0.14
	EWC	Yes	29.78 / 0.86 / 0.09	26.77 / 0.75 / 0.09	27.08 / 0.74 / 0.11	24.68 / 0.69 / 0.14
	NeRF^∗	Yes	29.86 / 0.89 / 0.07	29.67 / 0.89 / 0.06	29.24 / 0.89 / 0.07	28.87 / 0.87 / 0.08
	NeRF-SLAM	No	28.86 / 0.83 / 0.10	28.77 / 0.84 / 0.08	28.19 / 0.83 / 0.10	27.56 / 0.82 / 0.12
	IL-NeRF	No	29.92 / 0.89 / 0.07	29.50 / 0.89 / 0.07	29.01 / 0.89 / 0.07	28.96 / 0.87 / 0.09
Leaves	NeRF	Yes	25.51 / 0.90 / 0.06	22.12 / 0.79 / 0.13	21.03 / 0.75 / 0.15	20.62 / 0.73 / 0.16
	EWC	Yes	25.51 / 0.90 / 0.06	21.18 / 0.79 / 0.13	21.96 / 0.74 / 0.15	20.39 / 0.72 / 0.16
	NeRF^∗	Yes	25.58 / 0.90 / 0.06	24.81 / 0.89 / 0.07	24.23 / 0.87 / 0.07	23.84 / 0.86 / 0.08
	NeRF-SLAM	No	24.83 / 0.87 / 0.08	23.93 / 0.86 / 0.10	23.27 / 0.83 / 0.12	22.86 / 0.82 / 0.13
	IL-NeRF	No	25.62 / 0.90 / 0.06	24.74 / 0.88 / 0.07	24.26 / 0.87 / 0.07	23.88 / 0.86 / 0.08
Orchids	NeRF	Yes	25.68 / 0.85 / 0.08	22.78 / 0.77 / 0.10	21.43 / 0.74 / 0.12	20.77 / 0.71 / 0.14
	EWC	Yes	25.68 / 0.85 / 0.08	22.18 / 0.77/ 0.10	21.41 / 0.73 / 0.12	20.23 / 0.70 / 0.14
	NeRF^∗	Yes	25.88 / 0.87 / 0.08	24.37 / 0.84 / 0.10	23.69 / 0.80 / 0.12	23.59 / 0.79 / 0.12
	NeRF-SLAM	No	24.32 / 0.81 / 0.09	23.94 / 0.82/ 0.10	23.26 / 0.78 / 0.12	22.76 / 0.76 / 0.13
	IL-NeRF	No	25.78 / 0.86 / 0.08	24.17 / 0.82 / 0.10	23.89 / 0.80 / 0.12	23.67 / 0.77 / 0.13
Room	NeRF	Yes	32.35 / 0.92 / 0.09	29.58 / 0.89 / 0.10	28.82 / 0.89 / 0.11	30.59 / 0.92 / 0.07
	EWC	Yes	31.14 / 0.92 / 0.09	28.46 / 0.88 / 0.10	28.78 / 0.89 / 0.11	30.29 / 0.91 / 0.08
	NeRF^∗	Yes	32.26 / 0.92 / 0.09	31.52 / 0.92 / 0.08	31.21 / 0.92 / 0.08	31.98 / 0.92 / 0.08
	NeRF-SLAM	No	30.84 / 0.89 / 0.10	31.02 / 0.91 / 0.09	30.76 / 0.91 / 0.10	31.63 / 0.92 / 0.07
	IL-NeRF	No	32.50 / 0.92 / 0.08	31.76 / 0.92 / 0.08	31.58 / 0.92 / 0.08	31.88 / 0.92 / 0.07
Trex	NeRF	Yes	28.50 / 0.90 / 0.07	27.29 / 0.89 / 0.07	26.40 / 0.88 / 0.07	26.24 / 0.86 / 0.09
	EWC	Yes	28.50 / 0.90 / 0.07	26.42 / 0.88 / 0.07	25.97 / 0.89 / 0.07	25.18 / 0.91 / 0.09
	NeRF^∗	Yes	28.74 / 0.91 / 0.06	28.32 / 0.90 / 0.06	28.11 / 0.90 / 0.07	27.98 / 0.90 / 0.06
	NeRF-SLAM	No	27.26 / 0.90 / 0.07	28.05 / 0.89 / 0.06	27.60 / 0.89 / 0.07	27.37 / 0.88 / 0.08
	IL-NeRF	No	28.70 / 0.90 / 0.07	28.14 / 0.90 / 0.06	27.90 / 0.90 / 0.07	27.81 / 0.90 / 0.06

Table 7: Performance comparison on the NeRF-real360 dataset with the baselines: PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF-SLAM and achieves comparable results with NeRF^∗.

Scene	Method	Pose	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
			$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Pinecone	NeRF	Yes	26.22 / 0.84 / 0.16	22.90 / 0.64 / 0.24	21.15 / 0.58 / 0.33	18.94 / 0.49 / 0.41
	EWC	Yes	26.22 / 0.84 / 0.16	22.70 / 0.63 / 0.24	21.42 / 0.57 / 0.32	18.81 / 0.48 / 0.41
	NeRF^∗	Yes	26.88 / 0.89 / 0.12	24.23 / 0.79 / 0.16	24.03 / 0.73 / 0.19	23.18 / 0.74 / 0.21
	NeRF-SLAM	No	25.63 / 0.81 / 0.18	24.09 / 0.73 / 0.22	23.01 / 0.68 / 0.29	21.79 / 0.65 / 0.34
	IL-NeRF	No	26.31 / 0.87 / 0.10	24.56 / 0.78 / 0.17	23.78 / 0.74 / 0.20	22.93 / 0.72 / 0.23
Vasedeck	NeRF	Yes	29.03 / 0.85 / 0.07	23.99 / 0.70 / 0.26	22.73 / 0.69 / 0.24	21.57 / 0.64 / 0.31
	EWC	Yes	29.03 / 0.85 / 0.07	24.36 / 0.69 / 0.25	22.25 / 0.68 / 0.24	20.52 / 0.64 / 0.30
	NeRF^∗	Yes	29.27 / 0.86 / 0.07	27.93 / 0.85 / 0.12	26.03 / 0.74 / 0.16	26.18 / 0.74 / 0.18
	NeRF-SLAM	No	27.98 / 0.79 / 0.11	26.41 / 0.77 / 0.21	25.10 / 0.72 / 0.21	24.62 / 0.71 / 0.26
	IL-NeRF	No	29.48 / 0.86 / 0.07	27.38 / 0.82 / 0.10	26.11 / 0.76 / 0.14	26.15 / 0.75 / 0.17

Furthermore, we compare the original NeRF and IL-NeRF on two scenes, the ’Kitchen’ and ‘Counter’ scenes in the Mip-NeRF360 dataset. Here, we give more visualization results of original NeRF and IL-NeRF on all scenes of three datasets.

Figure 9 to Figure 16 provide additional insight by presenting a qualitative comparison of the performance of the original NeRF and IL-NeRF. Specifically, we demonstrate the rendering results on the first task after each incremental training. It is evident that the original NeRF suffers from the catastrophic forgetting problem, resulting in images with significant distortions such as noise and blur, whereas IL-NeRF generates highly realistic images with quality comparable to the ground truth. This observation indicates that IL-NeRF is highly effective in mitigating the forgetting problem and addressing the coordinate shifting issue.

Video Demo. To further show the performance of IL-NeRF, we post a video demonstration in the supplementary material, named ‘sm $\_$ video.mp4’. In this video, we show rendered images from all baselines and IL-NeRF.

9 Ablation Study

In the main paper, we analyze the effectiveness of the camera coordinate alignment and the pose refinement that has been added to IL-NeRF on the scene ’Garden’. Here, we give more numerical results for the ablation study.

Table 8 to Table 10 shows the performance comparison of the original NeRF, NeRF^∗, IL-NeRF, IL-NeRF w/o TM, and IL-NeRF w/o PR on all three datasets. As we can see, IL-NeRF outperforms the original NeRF and achieves comparable results with NeRF^∗. The results reveal a significant decline in performance on all test data without the transfer matrices (i.e., IL-NeRF w/o TM). This decline can be attributed to separate camera pose estimation for two tasks resulting in camera poses in two independent coordinate systems, which could mislead the model during training, leading to decreased performance. The results of IL-NeRF w/o PR, indicate that IL-NeRF with pose refinement outperforms IL-NeRF without it as the aligned camera poses by the transfer matrices may still contain noise and inaccuracies.

Figure 17 shows the camera pose trajectories of GT and IL-NeRF. We treat the COLMAP estimation from all training images as ground-truth (GT) camera poses. As we can see, IL-NeRF recovers accurate camera poses due to the help of incremental camera pose alignment.

Table 8: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.

Scene	Method	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
		$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Bicycle	w/o TM	22.90 / 0.62 / 0.33	13.64 / 0.32 / 0.66	11.86 / 0.29 / 0.79	11.14 / 0.26 / 0.83
	w/o PR	22.76 / 0.61 / 0.33	18.67 / 0.41 / 0.52	20.74 / 0.46 / 0.50	21.06 / 0.46 / 0.52
	IL-NeRF	22.90 / 0.62 / 0.33	19.84 / 0.48 / 0.44	22.05 / 0.54 / 0.40	22.34 / 0.55 / 0.40
Bonsai	w/o TM	33.54 / 0.93 / 0.07	21.13 / 0.59 / 0.18	19.48 / 0.46 / 0.31	18.40 / 0.40 / 0.37
	w/o PR	33.30 / 0.93 / 0.07	28.30 / 0.84 / 0.18	27.80 / 0.81 / 0.21	27.33 / 0.80 / 0.22
	IL-NeRF	33.54 / 0.93 / 0.07	30.73 / 0.89 / 0.12	29.77 / 0.86 / 0.16	28.96 / 0.85 / 0.18
Counter	w/o TM	32.13 / 0.91 / 0.07	23.91 / 0.58 / 0.18	19.02 / 0.45 / 0.29	13.87 / 0.39 / 0.35
	w/o PR	32.12 / 0.91 / 0.07	27.89 / 0.83 / 0.13	27.05 / 0.81 / 0.16	26.47 / 0.79 / 0.17
	IL-NeRF	32.13 / 0.91 / 0.07	29.63 / 0.87 / 0.12	28.56 / 0.85 / 0.15	27.82 / 0.83 / 0.17
Garden	w/o TM	24.73 / 0.73 / 0.19	17.05 / 0.46 / 0.33	13.37 / 0.37 / 0.45	15.76 / 0.31 / 0.47
	w/o PR	24.70 / 0.71 / 0.20	23.34 / 0.67 / 0.20	23.17 / 0.69 / 0.21	22.42 / 0.67 / 0.23
	IL-NeRF	24.73 / 0.73 / 0.19	24.80 / 0.70 / 0.22	24.86 / 0.69 / 0.23	24.82 / 0.67 / 0.23
Kitchen	w/o TM	31.27 / 0.92 / 0.07	21.08 / 0.59 / 0.15	16.05 / 0.46 / 0.23	14.63 / 0.40 / 0.26
	w/o PR	31.17 / 0.91 / 0.08	28.54 / 0.84 / 0.12	27.86 / 0.82 / 0.15	27.48 / 0.78 / 0.18
	IL-NeRF	31.27 / 0.92 / 0.07	30.66 / 0.89 / 0.10	29.84 / 0.87 / 0.12	29.34 / 0.86 / 0.13
Room	w/o TM	36.04 / 0.96 / 0.04	27.45 / 0.62 / 0.06	17.40 / 0.49 / 0.13	19.98 / 0.43 / 0.18
	w/o PR	35.98 / 0.96 / 0.04	32.67 / 0.92 / 0.07	31.12 / 0.86 / 0.17	30.35 / 0.84 / 0.13
	IL-NeRF	36.04 / 0.96 / 0.04	34.02 / 0.94 / 0.04	32.35 / 0.92 / 0.07	31.45 / 0.91 / 0.09
Stump	w/o TM	25.96 / 0.77 / 0.28	20.78 / 0.44 / 0.48	16.71 / 0.32 / 0.73	15.81 / 0.27 / 0.80
	w/o PR	25.62 / 0.77 / 0.28	24.25 / 0.58 / 0.37	23.77 / 0.53 / 0.43	22.43 / 0.50 / 0.46
	IL-NeRF	25.96 / 0.77 / 0.28	25.75 / 0.66 / 0.32	25.09 / 0.60 / 0.37	24.89 / 0.58 / 0.39

Table 9: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.

Scene	Method	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
		$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Fern	w/o TM	29.30 / 0.90 / 0.06	18.34 / 0.58 / 0.15	13.79 / 0.43 / 0.25	16.04 / 0.37 / 0.31
	w/o PR	29.19 / 0.90 / 0.06	25.37 / 0.82 / 0.11	24.62 / 0.77 / 0.19	24.77 / 0.78 / 0.20
	IL-NeRF	29.30 / 0.90 / 0.06	26.68 / 0.87 / 0.10	25.63 / 0.81 / 0.13	25.26 / 0.80 / 0.15
Flower	w/o TM	34.22 / 0.96 / 0.01	25.67 / 0.62 / 0.15	20.72 / 0.50 / 0.39	14.37 / 0.43 / 0.44
	w/o PR	34.12 / 0.96 / 0.01	30.82 / 0.92 / 0.02	30.69 / 0.91 / 0.03	28.41 / 0.89 / 0.03
	IL-NeRF	34.22 / 0.96 / 0.01	31.81 / 0.94 / 0.01	31.11 / 0.94 / 0.02	30.49 / 0.93 / 0.02
Fortress	w/o TM	31.6 / 0.85 / 0.11	21.33 / 0.56 / 0.15	20.20 / 0.45 / 0.21	16.71 / 0.39 / 0.34
	w/o PR	31.56 / 0.85 / 0.11	30.62 / 0.82 / 0.14	29.75 / 0.79 / 0.15	28.78 / 0.76 / 0.17
	IL-NeRF	31.69 / 0.85 / 0.11	31.02 / 0.84 / 0.10	30.33 / 0.84 / 0.11	29.45 / 0.83 / 0.12
Horns	w/o TM	31.6 / 0.85 / 0.11	21.33 / 0.56 / 0.15	20.20 / 0.45 / 0.21	16.71 / 0.39 / 0.34
	w/o PR	29.92 / 0.89 / 0.07	20.28 / 0.59 / 0.20	15.61 / 0.47 / 0.33	14.44 / 0.41 / 0.48
	IL-NeRF	29.92 / 0.89 / 0.07	29.50 / 0.89 / 0.07	29.01 / 0.89 / 0.07	28.96 / 0.87 / 0.09
Leaves	w/o TM	25.62 / 0.90 / 0.06	17.01 / 0.58 / 0.20	13.05 / 0.46 / 0.33	11.17 / 0.40 / 0.46
	w/o PR	25.51 / 0.90 / 0.06	24.20 / 0.87 / 0.06	23.77 / 0.84 / 0.08	22.69 / 0.83 / 0.09
	IL-NeRF	25.62 / 0.90 / 0.06	24.74 / 0.88 / 0.07	24.26 / 0.87 / 0.07	23.88 / 0.86 / 0.08
Orchids	w/o TM	25.78 / 0.86 / 0.08	19.50 / 0.54 / 0.15	15.85 / 0.42 / 0.23	12.03 / 0.36 / 0.46
	w/o PR	25.68 / 0.85 / 0.08	23.08 / 0.79/ 0.10	22.96 / 0.75 / 0.12	22.23 / 0.72 / 0.13
	IL-NeRF	25.78 / 0.86 / 0.08	24.17 / 0.82 / 0.10	23.89 / 0.80 / 0.12	23.67 / 0.77 / 0.13
Room	w/o TM	32.50 / 0.92 / 0.08	25.63 / 0.61 / 0.12	20.99 / 0.49 / 0.25	16.25 / 0.43 / 0.34
	w/o PR	31.14 / 0.92 / 0.09	30.37 / 0.90 / 0.09	30.56 / 0.91 / 0.09	30.58 / 0.90 / 0.08
	IL-NeRF	32.50 / 0.92 / 0.08	31.76 / 0.92 / 0.08	31.58 / 0.92 / 0.08	31.88 / 0.92 / 0.07
Trex	w/o TM	28.70 / 0.90 / 0.07	19.35 / 0.60 / 0.19	15.01 / 0.48 / 0.33	13.86 / 0.42 / 0.42
	w/o PR	28.50 / 0.90 / 0.07	27.01 / 0.87 / 0.09	26.26 / 0.88 / 0.07	26.54 / 0.88 / 0.08
	IL-NeRF	28.70 / 0.90 / 0.07	28.14 / 0.90 / 0.06	27.90 / 0.90 / 0.07	27.81 / 0.90 / 0.06

Table 10: Comparison of IL- NeRF w/o TM, IL-NeRF w/o PR and IL-NeRF. IL-NeRF outperforms these two cases.

Scene	Method	PSNR $\Uparrow$ / SSIM $\Uparrow$ / LPIPS $\Downarrow$
		$G^{0}$	$G^{1}$	$G^{2}$	$G^{3}$
Pinecone	w/o TM	26.31 / 0.87 / 0.10	19.82 / 0.52 / 0.25	15.84 / 0.39 / 0.39	14.56 / 0.34 / 0.47
	w/o PR	26.22 / 0.84 / 0.16	23.87 / 0.61 / 0.22	22.24 / 0.66 / 0.28	21.13 / 0.66 / 0.32
	IL-NeRF	26.31 / 0.87 / 0.10	24.56 / 0.78 / 0.17	23.78 / 0.74 / 0.20	22.93 / 0.72 / 0.23
Vasedeck	w/o TM	29.48 / 0.86 / 0.07	22.09 / 0.54 / 0.25	16.05 / 0.40 / 0.35	13.04 / 0.35 / 0.37
	w/o PR	29.03 / 0.85 / 0.07	25.30 / 0.74 / 0.18	24.80 / 0.68 / 0.21	24.33 / 0.63 / 0.23
	IL-NeRF	29.48 / 0.86 / 0.07	27.38 / 0.82 / 0.10	26.11 / 0.76 / 0.14	26.15 / 0.75 / 0.17

10 Limitation

For large-scale scenes with limited overlap between views in the training dataset, the performance of IL-NeRF may be suboptimal because the limited overlap between views can result in significant errors or even the inability to calculate the transfer matrices during the camera coordinate alignment.