NPR: Nocturnal Place Recognition in Streets

Bingxi Liu^1,2, Yujie Fu³, Feng Lu^2,4, Jinqiang Cui², Yihong Wu³, Hong Zhang^1,∗
¹Southern University of Science and Technology, Shenzhen, China.
²Peng Cheng Laboratory, Shenzhen, China.
³Institute of Automation, Chinese Academy of Sciences, Beijing, China.
⁴Tsinghua University, Shenzhen, China.
liubx@pcl.ac.cn, hzhang@sustech.edu.cn

Abstract

Visual Place Recognition (VPR) is the task of retrieving database images similar to a query photo by comparing it to a large database of known images. In real-world applications, extreme illumination changes caused by query images taken at night pose a significant obstacle that VPR needs to overcome. However, a training set with day-night correspondence for city-scale, street-level VPR does not exist. To address this challenge, we propose a novel pipeline that divides VPR and conquers Nocturnal Place Recognition (NPR). Specifically, we first established a street-level day-night dataset, NightStreet, and used it to train an unpaired image-to-image translation model. Then we used this model to process existing large-scale VPR datasets to generate the VPR-Night datasets and demonstrated how to combine them with two popular VPR pipelines. Finally, we proposed a divide-and-conquer VPR framework and provided explanations at the theoretical, experimental, and application levels. Under our framework, previous methods can significantly improve performance on two public datasets, including the top-ranked method.

1 Introduction

Visual Place Recognition (VPR) is a fundamental task in the fields of computer vision [4, 19, 34, 37, 38, 48] and robotics [16, 22, 23, 25, 28, 43], which involves returning database images that are similar to a query image by comparing it with a known large-scale database image. As previously reported in research, challenges for VPR include database scale [6], repeated structures [38], structural changes [18], occlusion [4], viewpoint[7], visual scale [14], illumination [3, 9, 23, 22, 32, 36, 37, 43, 48], and seasonal changes [25, 35]. Almost all recent VPR methods are learning on large-scale datasets [6, 24, 26, 37, 45]. A neural network is often used to map images into an embedding space that can efficiently distinguish images captured from different places. However, existing datasets restrict the progress of VPR in nighttime street scenes.

Refer to caption — Figure 1: A demo for NPR. A 1.2km $\times$ 1.2km satellite map is presented, where densely purple dots represent the locations where the database images were captured. Our proposed method can retrieve the correct results from a database of 75,984 images, even when provided with a nighttime captured image that includes a significant change in the view direction.

Training sets. Previous research has established large-scale VPR training sets using the downloading interface of Google Street View [1, 2, 4, 6]. These training sets were collected using either car-mounted panoramic cameras or pedestrian-carried street-view cameras and covered almost all of the challenges above, except for nighttime scenes. Some VPR training sets for autonomous driving scenes include nighttime scenes but lack other challenges [23, 43]. For example, cars are limited to roads and mostly use forward-facing cameras, resulting in a lack of view direction changes.

Testing sets. Two representative testing sets for VPR in nighttime street scenes are Tokyo 24/7 [37] and Aachen Day/Night. i) The Tokyo 24/7 dataset comprises daytime, sunset, and nighttime scenes. However, all research [4, 6, 8, 15, 21] has yet to separate these scenes for testing purposes; even the original paper [37] tested only sunset and nighttime together. This oversight obscures the fact that VPR performance is poor at night. ii) The Aachen Day/Night dataset evaluates Visual Localization (VL) but can also assess VPR as it serves as the first stage in a two-stage VL approach [12, 30, 33]. The dataset is divided into daytime and nighttime and ranked on an evaluation system without visible ground truth. The top-ranked method used VPR to recall 50 candidate frames, which is poor in practical applications.

Under the limitations of the training set, a straightforward idea is to apply Image Enhancement (IE) or image-to-image translation for the nighttime queries to daytime queries. However, these methods or their application in VPR have the following problems: i) These methods introduce additional computing resources and time. ii) Learning-based IE’s training sets usually pairs of low-exposure to high-exposure images rather than night-to-day pairs [44], even not outdoors, so they may have poor performance on the VPR dataset. iii) Inaccurate or erroneous night-to-day image-to-image translation can result in a degradation of VPR performance [3].

In this article, we successfully address training set issues through reverse ”Night-to-Day” and propose a method to divide VPR and conquer Nocturnal Place Recognition (NPR). To summarize, our contributions are as follows:

•

We propose a dataset comprising street scene images captured during daytime and nighttime and trained an unpaired image-to-image translation network on this dataset.
•

Using the above translation network, we processed existing VPR datasets to generate the VPR-Night datasets and demonstrated two popular VPR pipelines on how to leverage the VPR-Night datasets.
•

We propose a divide-and-conquer VPR framework and provide theoretical, experimental, and practical explanations for the framework. Furthermore, under this framework, previous methods exhibit superior performance on public datasets.

2 Related Work

Visual Place Recognition has been dominated by deep learning methods. Previous research can be summarized in three main aspects: model, loss function, and data. At the model level, Convolutional Neural Networks (CNNs) [4] or Visual Transformers (ViTs) [41] are typically used as the feature extraction network backbone, followed by a pooling or aggregation layer [4, 29], such as NetVLAD. At the loss function level, triplet loss and contrastive loss are commonly used to increase the Euclidean margin for better feature embedding [8]. However, triplet loss has a significant issue with mining hard negative samples. To address this problem, Liu et al. [21] proposed the statistical attention embedding loss, which efficiently utilizes multiple negative samples for supervision. Moreover, building on [15], Ge et al. [15] utilized network output scores as self-supervised labels and achieved new state-of-the-art results through iterative training. Berton et al. [6] used the Large Margin Cosine Loss (LMCL) in VPR tasks and demonstrated superior performance over triplet loss. At the data level, it can be further divided into two scenarios: road scenes and street scenes. Road scene datasets typically have obvious characteristics [16, 23, 43], such as the camera facing forward and the fixed viewing direction. Although these datasets may contain nighttime data, these characteristics do not suit VPR tasks in street scenes. Street scene datasets are usually obtained from the Google Street View download interface [1, 2, 6], which can be arbitrarily expanded and includes almost all challenges of VPR tasks, except for nighttime scenes. Therefore, it is reasonable to conclude that the performance of existing VPR models in nighttime scenarios benefits from data augmentation during the training phase and the generalization capability during the inference phase.

Nighttime Computer Vision involves addressing classic downstream tasks using nighttime images, which can be categorized into two-stage and one-stage methods. Two-stage methods [3, 46] involve converting nighttime images to daytime images before performing downstream tasks, while one-stage methodologies exploit raw nighttime data for training such tasks [11, 48]. For instance, Anoosheh et al. [3] propose a GAN method for converting nighttime images into daytime images and perform retrieval-based localization. Cui et al. [11] propose a method to perform reverse ISP and dark processing and train an end-to-end model for dark object detection. Xu et al. [46] propose a GAN-based approach to convert nighttime images into daytime images for semantic segmentation of autonomous driving scenes. However, in the context of street-level VPR, two-stage methods suffer from poor generalization and increased computational requirements, whereas the one-stage approach is constrained by a scarcity of suitable training datasets. To address this limitation, we introduce a novel paradigm that integrates the two perspectives by transforming daytime images into nighttime images prior to VPR training.

Image-to-Image Translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set. The classification of these methods is based on the type of training set, which can be pixel-level corresponding image-to-image or unpaired image-to-image [50, 47]. Obtaining large-scale, street-level, pixel-level correspondence day-night image pairs in the real world is an extremely challenging task. Therefore, we have considered the second type of training set. There are two main categories of methods used for unpaired image-to-image translation: two-side mapping and one-side mapping. The former, which includes CycleGAN [50] and DualGAN [47], relies on the cycle-consistency constraint. This constraint ensures that the translated image can be reconstructed using an inverse mapping. However, the bijective projection required by this approach is often too limiting, which has led to the development of one-side mapping techniques. One such approach involves using a geometry distance to preserve content. DistanceGAN [13], for instance, maintains the distances between images within domains, while GCGAN [5] ensures geometry-consistency between input and output. Another technique, CUT [27], maximizes mutual information between the input and output using contrastive learning. In particular, the diffusion model [49] has recently demonstrated outstanding performance in solving image-to-image translation problems. However, its usage necessitates substantial computational resources. After a thorough comparison of several approaches, we proceeded to train our day-to-night image-to-image translation network utilizing NEGCUT [42] as the foundation.

3 The Proposed Dataset

In this chapter, we introduce a series of datasets that have been proposed, which are divided into two categories: NightStreet dataset for training a day-to-night image-to-image translation network and VPR-Night dataset for training night-to-day VPR networks.

3.1 The NightStreet Dataset

Some datasets have been previously identified as potentially useful for day-to-night image-to-image translation. Nevertheless, they are not suitable for VPR tasks in street scenes. As depicted in Figure 2 (a), the current low light enhancement research predominantly focuses on tackling low light conditions (such as those caused by backlighting) or weak exposure. However, utilizing these datasets in reverse fails to accurately capture the changes that occur during day-to-night translation. As demonstrated in Figure 2 (b), there exist significant differences between automatic driving scenes and street scenes. For example, roadways typically constitute more than one-third of the image content, and images captured while the car is in motion tend to exhibit blurriness. Additionally, lighting conditions (e.g., street lights and car tail lights) generally exhibit limited variation. In regards to Figure 2 (c), several time-lapse photography datasets have been proposed in the image generation field. However, these datasets share a common characteristic: the photographers who capture them tend to focus on distant views and skylines, which differ greatly from the urban street scenes.

We constructed day-night image pairs by directly rearranging the query images from Tokyo 24/7 and Aachen Day/Night datasets. Tokyo 24/7 [37] provides 375 daytime and nighttime images each, while Aachen Day/Night [33] includes 234 daytime and 196 nighttime images. Importantly, we did not exploit the ground-truth relationship between the query and database images. Instead, we trained our translation network under unpaired images setting.

3.2 The VPR-Night Datasets

We used a trained day-to-night image-to-image translation model to process the existing VPR datasets and obtained the VPR-Night datasets. Theoretically, this method can be applied to any VPR dataset. In this study, we processed the Pitts-30k and SF-XL-small datasets, and named the new datasets Pitts-30k-N and SF-XL-small-N, respectively.

Pitts-30k-N. Pitts-30k is a subset of Pittsburgh-250k that is widely used in the research of VPR because of its high experimental efficiency [4]. It consists of 30k database images from Google Street View and 24k test queries generated from Street View but taken at different times, and is divided into three roughly equal parts for training, validation, and testing. According to the method designed in section 4.2, we only need to perform day-to-night transfer on the test queries. Therefore, there are 24k night-style query images and 30k daytime database images in Pitts-30k-N dataset.

SF-XL-small-N. The San Francisco eXtra Large (SF-XL) [6] is a city-scale, dense, time-varying dataset. It crawls 3.43M equirectangular panoramas images from Google Street View and divides them into 12 crops, with the entire dataset consisting of 41.2M images. Each crop is labeled with 6 DoF information including GPS and heading orientation. Unlike the Pittsburgh dataset, the training set is not divided into query images and database images. To quickly validate our method, we opted to use a subset of SF-XL, namely SF-XL-small, comprising 100k street view images. Similarly, SF-XL-small-N contains 100k original images and 100k nighttime images.

4 Method

In this chapter, we first introduce the image-to-image translation from day-to-night, and then describe how to train two VPR pipelines on the generated nighttime data. Finally, we explain the rationale and implementation of dividing VPR and conquering NPR.

4.1 Day-to-Night Image-to-Image Translation

In this section, we introduce a contrastive learning-based unpaired image-to-image translation network [27, 42], which is trained on the NightStreet dataset and generates the VPR-Night datasets. Our goal is to preserve the content of daytime images while specifying nighttime style. We define the daytime images from the input domain as $\mathcal{X}\subset\mathbb{R}^{H\times W\times 3}$ and the nighttime images from the output domain as $\mathcal{Y}\subset\mathbb{R}^{H\times W\times 3}$ , and aim to learn a mapping $G:\mathcal{X}\to\mathcal{Y}$ . The NightStreet dataset can be represented as an instance $X=\{x\in\mathcal{X}\}$ and $Y=\{y\in\mathcal{Y}\}$ . The mapping function $G$ is decomposed into an encoder $G_{\mathrm{enc}}$ and a generator $G_{\mathrm{dec}}$ , so the process of producing output images can be represented as:

\hat{y}=G(x)=G_{\mathrm{dec}}(G_{\mathrm{enc}}(x)).

(1)

We encourage the output images to have a nighttime style similar to the target domain by using a discriminator $D$ and the following adversarial loss [50]:

\mathcal{L}_{\mathrm{GAN}}=\mathbb{E}_{y\sim Y}\log{D(y)}+\mathbb{E}_{x\sim X}\log{(1-D(G(x)))}.

(2)

Then, a contrastive learning framework is employed to maintain local content consistency between input $x$ and output $\hat{y}$ . We extract a certain number $N$ of patches from the images, where one pair of patches is located at the same position in $x$ and $\hat{y}$ , denoted as $q$ and $p$ , while the remaining $N-2$ patches are randomly selected from the $x$ . These patches are then fed into a feature extraction network to obtain feature vectors, and an $(N-1)$ classification problem is established. The feature extraction network used here is based on the $G_{\mathrm{enc}}$ , followed by a 2-layer MLP.

4.2 Two Night-to-Day VPR Pipelines

In this section, we introduce strategies for utilizing nighttime data to train two popular VPR frameworks, namely the triplet network shown in Figure 3 (a) and the classification network shown in Figure 3 (b).

Triplet Network. From the training set with GPS, anchor samples $q$ are selected, and data sharing the same GPS label or with close proximity are considered positive samples $p$ , while the remaining samples are regarded as negative samples $n$ . These samples are then fed into $f$ , a pre-trained deep neural network with an aggregation layer, obtaining feature vectors $f(q),f(p),$ and $f(n)$ in the embedding space. Finally, the Euclidean distance between the $f(q)$ and $f(p)$ , as well as $f(q)$ and $f(n)$ , are calculated in the embedding space to obtain a triplet loss, which is formulated as:

L_{\mathrm{triplet}}=l(||f(q)-f(p)||_{2}^{2}-||f(q)-f(n)||_{2}^{2}+m),

(3)

where $l$ is the hingle loss $l(x)=\mathrm{max}(x,0)$ , $m$ is the margin parameter that controls the distance between samples of the same class and different classes.

Considering that we aim to achieve Night-to-Day VPR, we need to construct sample pairs with different styles, i.e., we need to transfer the anchor samples to nighttime style or convert both positive and negative samples to nighttime style. The latter method is obviously less efficient than the former.

Classification Network. VPR can be treated as a classification problem based on labels. Following [6], the training set can be partitioned into classes by splitting it into square geographical cells using UTM coordinates $\{\textit{east},\textit{north}\}$ and further dividing each cell into a set of classes based on the orientation $\{\textit{heading}\}$ of each image. We transformed all images in the database into a night style while preserving their original labels. Then we employed the Large Margin Cosine Loss (LMCL) [40] to train a model:

L_{\mathrm{lmc}}=\frac{1}{N}\sum_{i}{-\log{\frac{e^{s(\cos(\theta_{{y_{i}},i})-m)}}{e^{s(\cos(\theta_{{y_{i}},i})-m)}+\sum_{j\neq y_{i}}{e^{s\cos(\theta_{j,i})}}}}},

(4)

subject to

\begin{split}cos(\theta_{j},i)&={W_{j}}^{T}x_{i},\end{split}

(5)

where $N$ is the numer of training images, $x_{i}$ is the $i$ -th embedding vector corresponding to the ground truth class of $y_{i}$ , the $W_{j}$ is the weight vector of the $j$ -th class, and $\theta_{j}$ is the angle between $W_{j}$ and $x_{i}$ . $s$ and $m$ are two hyperparameters that control the weights of the intra-class distance and inter-class distance in the loss function, respectively.

4.3 Dividing VPR and Conquering NPR

In this section, we introduce concepts from three fields: deep learning, neuroscience, and computer algorithms. Our aim is to explain the rationale behind the need to divide night vision problem from general visual problem¹¹1Although we focus solely on the VPR task, we believe that this framework should be applicable to many other computer vision tasks..

Deep learning. We utilize data-driven network learning, where the training set and test set should have similar distributions [17]. However, previous research on nighttime VPR does not follow this principle, which is the reason why all methods degrade severely in nighttime scenes. When the fitting ability of the model is limited, increasing the amount of nighttime data will also cause the VPR performance in daytime scenes to degrade.

Neuroscience. There are two types of photoreceptor cells distributed on the retina: cone cells and rod cells [39]. Cone cells are primarily responsible for color and detail discrimination and are only activated under relatively adequate lighting conditions. Rod cells, on the other hand, are mainly responsible for identifying the intensity of light and motion, and can be responsive under low-light conditions. Nocturnal animals possess a greater number of rod cells.

Computer algorithm. Divide-and-conquer [10] (D&C) is a classic algorithm that decomposes a large problem into several smaller but structurally similar sub-problems, recursively solving these sub-problems, and then combining the sub-problem solutions, to obtain the original problem solution.

Based on the above knowledge, it is suggested that the original model should be used for daytime scenes, while the VPR-Night trained or fine-tuned model should be used for nighttime scenes. Then we provide three implementation ideas for the ”divide” step: i) training a day-night network for classification, ii) using the global average brightness and an empirical threshold for classification, and iii) using system time and local sunset time ²²2https://www.timeanddate.com/astronomy/ for classification. The first method can achieve better results, but has low practicality in real-world applications. The second method may fail in some specific situations, such as scenes with strong backlight being misclassified as night and scenes with bright light being misclassified as day. The third method is independent of image content, but is only applicable to devices equipped with communication functions, such as smartphones and robots. In our experiments, we use a combination of the second and third methods for decomposing the test set. Finally, we emphasize that the execution of ”divide” in the real world is a one-time process, so the additional computation required for new pipelines can be ignored.

5 Experiments

Method

Backbone

Aggregation

Method

Feature

Dim

Loss

Function

Training

Dataset

R@1

All queries

R@1

Day queries

R@1

Sunset queries

R@1

Night queries

NetVLAD [4] (CVPR’16)

NetVLAD-NPR

NetVLAD-D&C

VGG-16

NetVLAD

32768

Triplet Loss

Pitts-30k

Pitts-30k-N

63.8

62.5 (-1.3)

68.6 (+4.8)

88.6

80.0 (-8.6)

88.6

75.2

64.8 (-10.4)

74.3 (-0.9)

27.6

42.9 (+15.3)

NetVLAD [4] (CVPR’16)

VGG-16

NetVLAD+PCA

4096

Triplet Loss

Pitts-30k

68.9

DIR [29] (T-PAMI’18)

Res-101

GeM+FC

2048

Triplet Loss

Google Landmark

74.9

92.4

81.9

50.5

SARE [21] (ICCV’19)

VGG-16

NetVLAD+PCA

4096

SARE-Joint

Pitts-30k

74.8

SFRS [15] (ECCV’20)

VGG-16

NetVLAD+PCA

4096

SARE-Joint

Pitts-30k

78.5

APPSVR [28] (ICCV’21)

VGG-16

APP+PCA

4096

Triplet Loss

Pitts-30k

77.1

DVG [8] (CVPR’22)

DVG-NPR

DVG-D&C

Res-18

NetVLAD

16384

Triplet Loss

Pitts-30k

Pitts-30k-N

66.7

74.0 (+7.3)

74.9 (+8.2)

90.5

85.7 (-4.8)

90.5

76.2

80.0 (+3.8)

78.1 (+1.9)

33.3

56.2 (+22.9)

CosPlace [6] (CVPR’22)

CosPlace-NPR

CosPlace-D&C

VGG-16

GeM+FC

512

LCM Loss

SF-XL

Fine tuning on small-N

81.9

82.9 (+1.0)

84.1 (+2.2)

90.5

87.6 (-2.9)

90.5

89.5

83.8 (-5.7)

84.8 (-4.7)

65.7

77.1 (+11.4)

CosPlace [6]

CosPlace-NPR

Res-50

GeM+FC

512

LCM Loss

SF-XL

Fine tuning on small-N

88.6

92.1 (+3.5)

95.2

96.2 (+1.0)

90.5

93.3 (+2.8)

80.0

86.7 (+6.7)

Table 1: Comparisons of various methods on Tokyo 24/7 [37]. We reproduced three methods and added their NPR and D&C versions. As previously mentioned, none of the previous methods conducted experiments on the last three columns. Therefore, we currently only report their Recall@1 metrics for all queries.

In this chapter, we describe the implementation details, the test set, and the evaluation methods. We provide quantitative and qualitative results, followed by visualizations that demonstrate the utility of NPR as well as its current limitations.

5.1 Implementation details

Day-to-Night Image-to-Image Translation. We use the same training parameters as in [42] and select the model from the 400th epoch for inference. Since the image size of NightStreet is complex and differs greatly from that of the VPR training set, we scale the NightStreet images proportionally with the shorter side fixed at 640 and then randomly crop them to a size of 512 $\times$ 512 during training. This can maintain scale consistency between the training set and testing set as much as possible. The translation network takes approximately one day to process the SF-XL-small dataset, at a resolution of 512 $\times$ 512 on a single 3090 Ti.

Night-to-Day Visual Place Recognition. The essence of our method is to transfer the nighttime style to existing datasets and combine it with existing pipelines. Therefore, it can be compatible with any VPR method and improve their performance in nighttime scenes. We replicated the work of [4, 8, 6], and applied our method. Since most of the previous methods used the VGG-16 structure, we trained and compared it with this structure. The current top-ranked method on Tokyo 24/7 is based on ResNet-50, so we also conducted experiments based on it. We reproduce the best performance of the baseline method or directly use the model open-sourced by the author. Specifically, data augmentation was not used during NPR training and the learning rate used when fine-tuning the model was 1e-6.

Night-to-Day Visual Localization. We adopted the hierarchical localization framework provided by [30] and replaced the VPR module with Superpoint [31] and Superglue [12] for local feature extraction and matching, respectively. Notably, we adjusted the input image resolution of the VPR module to the recommended parameters for this project.

5.2 Datasets and evaluation methodology

We reported results on multiple publicly available datasets.

Tokyo 24/7 v2 [37] contains 75,984 database images from Google Street View and 315 query images taken using mobile phone cameras. This is an extremely challenging dataset where the queries were taken at daytime, sunset, nighttime, while the database images were only taken at daytime. However, to the best of our knowledge, all works using this dataset have not tested them by category. Even the original work that proposed this dataset evaluated sunset and night as one category. This incident seriously overshadowed the fact that the performance of VPR was poor under the cover of night. We propose two ways to test the dataset: one is to directly split by labels, and the other is to use the information of exchangeable image file format (EXIF) and local sunset time ³³3https://www.timeanddate.com/sun/japan/tokyo?month=9&year=2014 for partitioning. The second method will split the sunset testing set into two categories, test them separately, and then merge the results.

Aachen Day/Night v1 [33] comprises 922 query images captured during day and night, and 4328 reference images collected over a span of two years using hand-held cameras. All query images were captured using mobile phone cameras, hence the Aachen Day-Night dataset considers the scenario of localization using mobile devices, e.g., for augmented or mixed reality applications.

Other datasets. Additionally, we employ the Pitts-30k and SF-XL-test-v1 datasets to investigate the phenomenon of model degradation.

Evaluation metric. In the VPR dataset, we use the recall@N with a 25 meters threshold, i.e., the percentage of queries for which at least one of the first N predictions is within a 25 meters distance from the ground truth position of the query, following standard procedure [4, 37, 28, 6, 21, 15]. In the visual localization dataset, we evaluate VPR using the success rate of localization under different recall@N metrics. Specifically, the visual localization system considers higher localization precision than that required by the VPR dataset.

5.3 Comparison with the State-of-the-art Methods

Method

Backbone

Aggregation Method

Feature Dim

(0.25m, 2°) / (0.5m, 5

\degree

) / (5m, 10°)

R@1

R@5

R@10

R@20

NetVLAD [4]

VGG-16

NetVLAD+PCA

4096

68.4 / 78.6 / 85.7

75.5 / 84.7 / 93.9

82.7 / 88.8 / 95.9

83.7 / 92.9 / 99.0

CosPlace

CosPlace-NPR

Res-50

GeM

512

69.4 / 82.7 / 89.8

71.4 / 84.7 / 93.9

77.6 / 87.8 / 95.9

78.6 / 88.8 / 95.9

77.6 / 87.8 / 95.9

79.6 / 88.8 / 95.9

83.7 / 90.8 / 98.0

84.7 / 91.8 / 98.0

Table 2: Comparisons of three methods on Aachen Day/Night [36]. All three methods employ the same VL pipeline [30], with only the VPR method being replaced. The success rates under different levels of localization accuracy were evaluated at Recall@N.

As shown in Table 1, we have replicated the experimental results of NetVLAD [4], DIR [29], DVG [8], and CosPlace [6] on the Tokyo 24/7 dataset, while citing the results of NetVLAD with PCA [4], SARE [21], SFRS [15], and APPSVR [28] from the Deep Visual Geo-Localization Benchmark [8]. Our method was applied to these three replication methods and labeled with the name suffixes NPR and D&C accordingly. The results can be summarized in a few points:

1) The Recall@1 for nighttime queries are significantly lower than those for daytime queries across all methods. The Recall@1 for all queries is a trade-off result. This confirms our viewpoint that the challenge of Nighttime VPR has been overlooked for a considerable period of time.

2) All methods trained on the VPR-Night datasets showed a significant improvement in performance on nighttime queries. Models with weaker fitting abilities, such as VGG-16 and ResNet-18, showed a corresponding degradation in daytime scenes, whereas networks with stronger fitting abilities, such as ResNet-50, were able to improve Recall@1 for both daytime and nighttime queries.

3) To address the issue of imbalanced performance between daytime and nighttime for small models, a divide-and-conquer algorithm can effectively maintain performance balance, which is particularly beneficial for models deployed on mobile platforms.

As illustrated in Figure 4, we present the variation of accuracy with respect to different recall values, N. Our proposed method shows a significant performance improvement over the baseline approach across all recall values. It is worth noting that the vertical axes of the two plots have different origins.

As shown in Table 2, our proposed method outperforms the baseline approach on the nighttime testing subset of the Aachen dataset. While our localization success rates at R@10 and R@20 are slightly lower than those of NetVLAD with PCA, we think this may be attributed to the fact that the size of the database cannot fully reflect the advantages of our model, and our output dimensionality is also significantly lower than that of NetVLAD.

5.4 Daytime VPR experiments

	Pitts-30k-test		SF-XL-test-v1
	R@1	R@5	R@1	R@5
CosPlace	88.4	94.6	68.1	78.9
CosPlace-NPR	88.0 (-0.4)	94.1 (-0.5)	65.4 (-2.7)	75.8 (-3.1)

Table 3: Results on Pitts-30k-test [4] and SF-XL-test-v1 [6]. Our Model is CosPlace+NPR (VGG-16), while the comparative method is CosPlace (VGG-16). It should be noted that for testing SF-XL-test-v1, we used the SF-XL-small database instead of SF-XL. Therefore, the CosPlace’s result presented here are slightly better than that reported in [6].

While we did not have high expectations for the performance of the NPR during daytime, we still demonstrated its performance on other daytime datasets and compared it with baseline methods. As shown in Table 3, our method exhibits a slight decrease in performance on the Pitts-30k and SF-XL-test-v1 datasets.

5.5 Qualitative experiment

In this section, we present samples from the generated VPR-Night datasets and the results of CosPlace-NPR. As shown in Figure 5, (1) and (2) represent two locations, (a), (c), and (e) are data from different times of day in Google Street View, while (b), (d), and (f) are generated results. Our method preserves scale, appearance, and viewpoint changes while adding a nighttime style.

As shown in Figure 6, our incremental improvement significantly enhances performance at night. These nocturnal query images not only exhibit extreme illumination differences with database images, but also undergo architectural resets, changes in perspective, and scale variations. This is highly consistent with real-world scenarios.

5.6 Limitations

We believe that our theory is not limited, but the current implementation is constrained. i) We require a larger NightStreet dataset to capture a richer range of day-to-night variations. Fortunately, the loose requirement for unpaired images makes the dataset easily extensible. And constructing the NightStreet dataset is undoubtedly less challenging than creating a large-scale, street-level, day-to-night corresponding VPR dataset. ii) Rendering large-scale datasets such as SF-XL requires significant GPU resources. We plan to address both of these limitations in our future work.

6 Conclusions

In this work, we address the challenging problem of nighttime VPR, which has been hindered by the lack of appropriate training datasets and inaccurate testing methodologies. To overcome these issues, we propose a dedicated pipeline for Noctural Place Recognition. First, we construct the NightStreet dataset and train a day-to-night image-to-image translation network. We then apply the network to process existing large-scale VPR datasets and demonstrate how to integrate them into two popular VPR pipelines. Finally, we introduce the idea of differentiating between VPR and NPR, providing a multidimensional interpretation. Our experimental results show that our pipeline significantly improves previous methods.

References

[1] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
[2] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. Google street view: Capturing the world at street level. Computer, 43(6):32–38, 2010.
[3] Asha Anoosheh, Torsten Sattler, Radu Timofte, Marc Pollefeys, and Luc Van Gool. Night-to-day image translation for retrieval-based localization. In 2019 International Conference on Robotics and Automation (ICRA), pages 5958–5964. IEEE, 2019.
[4] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40(06):1437–1451, 2018.
[5] Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. Advances in neural information processing systems, 30, 2017.
[6] Gabriele Berton, Carlo Masone, and Barbara Caputo. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022.
[7] Gabriele Berton, Carlo Masone, Valerio Paolicelli, and Barbara Caputo. Viewpoint invariant dense matching for visual geolocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12169–12178, October 2021.
[8] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5396–5407, June 2022.
[9] Bingyi Cao, Andre Araujo, and Jack Sim. Unifying deep local and global features for image search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 726–743. Springer, 2020.
[10] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022.
[11] Ziteng Cui, Guo-Jun Qi, Lin Gu, Shaodi You, Zenghui Zhang, and Tatsuya Harada. Multitask aet with orthogonal tangent regularity for dark object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2553–2562, 2021.
[12] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
[13] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2427–2436, 2019.
[14] Yujie Fu, Pengju Zhang, Bingxi Liu, Zheng Rong, and Yihong Wu. Learning to reduce scale differences for large-scale invariant image matching. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[15] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 369–386. Springer, 2020.
[16] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[18] Daniel Cabrini Hauagge and Noah Snavely. Image matching using local symmetry features. In 2012 IEEE conference on computer vision and pattern recognition, pages 206–213. IEEE, 2012.
[19] Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
[20] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (proceedings of SIGGRAPH), 33(4), 2014.
[21] Liu Liu, Hongdong Li, and Yuchao Dai. Stochastic attraction-repulsion embedding for large scale image localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2570–2579, 2019.
[22] Stephanie Lowry, Niko Sünderhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
[23] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
[24] Michael J Milford and Gordon F Wyeth. Mapping a suburb with a single camera using a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038–1053, 2008.
[25] Michael J Milford and Gordon F Wyeth. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, pages 1643–1649. IEEE, 2012.
[26] Kohei Ozaki and Shuhei Yokoo. Large-scale landmark retrieval/recognition under a noisy and diverse dataset. ArXiv, 2019.
[27] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
[28] Guohao Peng, Jun Zhang, Heshan Li, and Danwei Wang. Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 885–894, 2021.
[29] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
[30] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
[31] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
[32] Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys. Lamar: Benchmarking localization and mapping for augmented reality. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 686–704. Springer, 2022.
[33] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018.
[34] Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, and Tomas Pajdla. Are large-scale 3d models really necessary for accurate visual localization? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1646, 2017.
[35] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA), page 2013, 2013.
[36] Carl Toft, Will Maddern, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, et al. Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):2074–2088, 2020.
[37] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 14, 2017.
[38] Akihiko Torii, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. Visual place recognition with repetitive structures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11):2346–2359, 2015.
[39] Jonathan D Trobe. The neurology of vision. Oxford university press, 2001.
[40] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
[41] Ruotong Wang, Yanqing Shen, Weiliang Zuo, Sanping Zhou, and Nanning Zheng. Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657, 2022.
[42] Weilun Wang, Wengang Zhou, Jianmin Bao, Dong Chen, and Houqiang Li. Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14020–14029, 2021.
[43] Frederik Warburg, Soren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020.
[44] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018.
[45] T. Weyand, A. Araujo, B. Cao, and J. Sim. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. CVPR, 2020.
[46] Weihuang Xu, Nasim Souly, and Pratik Prabhanjan Brahma. Reliability of gan generated data to train and validate perception systems for autonomous vehicles. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 171–180, 2021.
[47] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2849–2857, 2017.
[48] Mubariz Zaffar, Sourav Garg, Michael Milford, Julian Kooij, David Flynn, Klaus McDonald-Maier, and Shoaib Ehsan. Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. International Journal of Computer Vision, 129(7):2136–2174, 2021.
[49] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv preprint arXiv:2207.06635, 2022.
[50] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.