Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

Linhua Kong, Dongxia Chang, Lian Liu, Zisen Kong, Pengyuan Li and Yao Zhao Manuscript created October, 2020; This work was developed by the IEEE Publication Technology Department. This work is distributed under the LaTeX Project Public License (LPPL) ( http://www.latex-project.org/ ) version 1.3. A copy of the LPPL, version 1.3, is included in the base LaTeX documentation of all distributions of LaTeX released 2003/12/01 or later. The opinions expressed here are entirely that of the author. No warranty is expressed or implied. User assumes all risk.

Abstract

Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3% NDS and 8.4% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).

Index Terms:

3D Object Detection, Mutil-Modal, Alignment, Contrastive Learning.

I Introduction

Autonomous driving systems are designed to enable vehicles to perform driving tasks without human intervention. Therefore, it is crucial to accurately perceive the environment around the vehicle. 3D object detection is an important research area in autonomous driving perception as it allows for accurate localization and classification of objects in the surrounding environment. Recently, mutil-view 3D object detection [1, 2, 3, 4, 5, 6] has made significant strides in performance. Currently, cameras are widely used in practice as the main acquisition device for 3D object detection. However, cameras perform poorly in harsh environments and do not model the depth of objects well [7]. Radar is used as an auxiliary sensor because of its suitability for addressing the above mentioned problems and its low cost.

By combining the rich semantic information provided by cameras with the accurate depth information captured by the radar can lead to more robust and reliable results for 3D object detection. However, due to the domain gap between modalities, how to accurately align the same object acquired by radar and camera sensors in the fusion process is very crucial. To better align the data of these two modalities, researchers have proposed a number of alignment methods, which can be broadly classified into into two groups, dense bird-eye-view (BEV) alignment methods [8] and sparse BEV alignment methods [9, 10, 11].

Refer to caption — Figure 1: Different alignment methods: Dense BEV Alignment (a), Sparse BEV Alignment (b) and Dual-Route Alignment (ours c). Q, V, BEV, PV and CL indicate Query, Value, Bird-Eye-View, Perception-View and Contrastive Loss separately. The semi-transparent part and fully solid part form their own systems.

The Dense BEV alignment methods [8] use deformable attention to update features of one modality from another. As shown in Fig. 1 (a), those methods use image BEV features (Query) to find features that represent the same object in radar BEV features (Value) by deformable attention [12], which is then used to update the corresponding image BEV features. The above process is also applied to radar BEV features (Query). The deformable attention is effective at correlating features representing the same object of different modalities, but it can not guarantee that the updated features of two modalities precisely represent the same object at corresponding locations. It is sub-optimal when using the fusion block to fuse them. Different from these approaches, the sparse BEV alignment methods [9, 10, 11] uses deformable attention to extract features from the two modal data separately, which are then used to update the sparse queries. Specifically, as shown in Fig. 1 (b), these methods use the sparse queries to model the objects in the environment, the deformable attention is then employed to find related features in image features and radar BEV features to update the sparse queries respectively. Using sparse queries as an intermediate pivot can be advantageous for aligning features representing the same object in radar and images. However, they do not consider features interaction between modalities. In summary, existing alignment methods either do not consider inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities.

Based on the above analysis, we redesign the routing path of sparse queries in the sparse BEV alignment methods and propose a novel radar camera alignment model named RCAlign for 3D object detection. The RCAlign not only enables features interaction between modalities during alignment but also guarantees the alignment of features representing the same object across modalities, ensuring consistency and accuracy in feature representation. Specifically, as shown in Fig. 1 (c), we develop a Dual-Route Alignment (DRA) module that updates the sparse queries through two routing paths respectively. Later, the updated sparse queries from two paths are aligned using contrastive loss. Besides, considering the sparsity of the radar BEV features, a Radar Feature Enhancement (RFE) module is proposed, which enhances the representation of radar BEV features by knowledge distillation loss and allows a more effective fusion with image features. The experimental results on the nuScenes benchmark demonstrate the effectiveness of RCAlign. Compared with the existing works, our work has the following contributions.

•

We propose a novel radar camera alignment model called RCAlign based on the sparse BEV alignment methods for 3D object detection.
•

In our model, a Dual-Route Alignment module is proposed to more efficiently align the features of two modalities and perform inter-modal interactions during the alignment process.
•

We design a Radar Feature Enhancement (RFE) module to obtain the enhanced radar BEV features by knowledge distillation loss for better fusion with image features.
•

Extensive experiments demonstrate that the RCAlign achieves a new state-of-the-art performance on the nuScenes benchmark, with a result of 67.3% NDS on the nuScenes test set.

The remainder of this paper is arranged as follows. In Section II, we briefly review the background related to our work. Section III introduces the formulation of our RCAlign. Section IV gives the experimental results. Finally, the conclusions of the study are drawn in Section V.

II Related work

In this section, we briefly introduce recent developments related to our study, focusing on three topics: camera-based 3D object detection, radar-camera 3D object detection, and contrastive learning.

II-A Camera-based 3D Object Detection

The central focus of recent works on 3D object detection has shifted from PV-based methods [13, 14, 15, 16, 17] to BEV-based methods [1, 3, 18, 19, 20, 21] due to various challenges encountered in PV, such as occlusion, viewpoint transformation and so on. The BEV-based methods can be divided into dense BEV-based methods [4, 20, 22, 21, 23, 24, 25], and query-based methods [2, 3, 18, 19, 26]. The dense BEV-based methods usually employ the Lift-Splat-Shoot (LSS)[27] to transform image features into BEV space. The BEVDet series [1, 28, 29] successfully introduces LSS to 3D object detection and achieves excellent performance. Considering the inaccuracy of LSS for depth estimation, BEVDepth [21] introduces lidar to supervise the predicted depth. After that, in order to get more accurate information about speeds, the temporal information [28, 20, 21] is introduced and further improves the performance of 3D object detection. The query-based methods usually predefine some sparse queries to model the objects and then use the queries to index the related image features. After the introduction of sparse queries for 3D object detection by DETR3D [19] firstly, numerous methods based on sparse queries have been proposed. PETR [2] uses 3D positional encoding to transform image features into 3D position-aware features for end-to-end 3D object detection. In StreamPETR [3], a memory queue which also consists of sparse queries is designed to store temporal information. Considering that query-based methods do not need to model the background and are more efficient compared to dense BEV-based methods. In the work, we design the RCAlign based on the query-based methods.

II-B Radar-camera 3D object detection

The radar excels in acquiring both the Doppler velocity and distance of an object, remains virtually unaffected by challenging environmental conditions and is cost-effective. Therefore, radar can serve as an excellent auxiliary sensor to the camera for low-cost 3D object detection. Recently, 3D object detection algorithms based on radar and camera fusion [11, 8, 30, 31, 32, 33] have demonstrated to be effective and have achieved remarkable performance. Similar to the research direction in 3D object detection for multi-view images, researchers initially focused on studying PV space[30, 31, 32]. They establish methods for associating radar points with objects in images. The CenterFusion [30] uses the frustum to associate the radar points with the object. Considering the inaccuracy of the radar projection position, RADIANT [31] corrects the radar point projection position and designs specific rules to associate radar points with objects in the image using Euclidean coordinates, while CRAFT [32] uses polar coordinates for this association. For the BEV-based methods, in order to obtain more accurate image BEV features, the CRN [11] utilizes radar assistance to transform image features from PV into BEV. Afterwards they concat the radar features and image features to serve as sparse queries and employ deformable attention to align features from both modalities. Different from the CRN of obtaining sparse queries, the SparseFusion3D [10] uses pre-defined sparse queries as an intermediate state to align the features from the two modalities. This type of method does not perform inter-modal interactions during the alignment process. RCBEVDet [8] designs a new backbone to extract radar features, and then uses radar features or image features as sparse queries to query features of another modality to align the features. However, they fail to achieve true alignment between the features from different modalities. In order to alleviate the above problems, we design a dual-route alignment module with contrastive loss which effectively aligns features representing the same object across modalities.

II-C Contrastive Learning

The contrastive learning [34, 35, 36] aims to pull samples of the same class closer while pushing samples of different classes further apart. It has been applied to the field of object detection to pull in features that represent the same object. SoCo [37] adopts contrastive learning to maximise the similarity of features representing the same object acquired through different augmentation methods. In order to obtain contrastive-aware proposal embeddings, FSCE [38] designs contrastive branch using contrastive learning to maximize the within-category agreement and cross-category disagreement. CAT-Det [39] utilises contrastive learning for data enhancement at the point and object levels, significantly improving detection accuracy. Thus, using contrastive learning enables the similarity of features that represent the same object. Here, we utilize the comparative learning to align the updated sparse queries following the two routing paths.

III Proposed Method

As analyzed in Sec. I, existing alignment methods for radar and camera fusion either neglect inter-modal features interaction during alignment or fail to align features at corresponding positions. Therefore, we propose a noval Radar Camera Alignment by Contrastive Learning (RCAlign) to alleviate the above problems. In this section, we first describe the overall structure of RCAlign. After that the multimodal feature extraction and the proposed DRA and RFE modules in RCAlign are presented in detail, respectively.

III-A Overall Architecture

As shown in Fig. 2, the proposed RCAlign includes an image backbone, a radar backbone, a radar head, a dual-route alignment module, a radar feature enhancement module and a 3D detection head. During the multimodal feature extraction stage, the image backbone and the radar backbone are used to extract multi-view image PV features and radar BEV features respectively. Subsequently, the radar BEV features are passed through the radar head to get the centre of 3D boxes and radar heatmaps. Then, the radar points corresponding to the top $k$ probability values obtained from the radar heatmaps are taken out as part of the sparse queries. In addition to radar queries, sparse queries also contain initial queries and temporal queries. The sparse queries, image PV features and radar BEV features are input into the proposed DRA module to align and fuse the features of the two modalities. After that, the fused features are used to predict the categories and 3D boxes. Finally, considering the sparsity of the radar BEV features, we project the centre of the predicted 3D boxes into the BEV grid as the occupancy features, the occupancy features and the radar BEV features are fed into the RFE module to get the enhanced radar BEV features. The enhanced radar features are used to distill the original radar features, guiding the network to optimize for acquiring denser radar features.

III-B Multimodal Feature Extraction

For image branch, given the images $I\in R^{N\times C\times H\times W}$ from $N$ views, we use the backbone and the FPN [40] to extract multi-view image PV features. For radar branch, given the radar data corresponding to the image, we first aggregate five consecutive frames of radar data. During the aggregation process, we introduce velocity to compensate for the radar points position coordinates in the previous frames. Specifically for the radar point at the $t$ -1 frames, we compensate the position using the following equation:

\displaystyle p_{xy}^{t}=p_{xy}^{t-1}+v_{xy}^{t-1}*\Delta{t},

(1)

where $p_{xy}^{t}$ and $p_{xy}^{t-1}$ denote $x,y$ coordinates of the radar point at time $t$ and $t-1$ , $v_{xy}^{t-1}$ and $\Delta{t}$ denote the velocity of the radar point at moment $t-1$ and the time interval between $t$ and $t-1$ , respectively. Finally, we use PointPillar [41] to extract radar BEV features.

III-C Dual-Route Alignment

In order to better align the features of the two modalities and to perform the inter-modal features interaction during the alignment process, we propose the Dual-Route Alignment (DRA) moudle. The DRA achieves alignment and fusion by aggregating image PV features, radar BEV features to the sparse queries. As shown in Fig. 3 (a), DRA includes two routing paths, each undergoing two deformable attention [12] for inter-modal features interaction. Subsequently, the updated sparse queries from the two paths are aligned and fused using the fusion block. Specifically, for the above path, given the sparse queries $z_{q}$ , the radar BEV features $f_{r}$ and the reference point $p_{q}$ . The process of aggregate radar BEV features can be formulated as:

	$\displaystyle DeformAttn(z_{q},p_{q},f_{r})=\sum_{m=1}^{M}W_{m}[\sum_{k=1}^{K}A_{mqk}\cdot$		(2)
	$\displaystyle W_{m}^{\dagger}f_{r}(p_{q}+\Delta p_{mqk})],$		(2)

where $M$ and $K$ indicate the number of attention heads and sample points respectively. $A_{mqk}$ and $\Delta p_{mqk}$ stand for attention weight and sampling offset respectively. $W_{m}$ and $W_{m}^{\dagger}$ denote the weights of the linear. We denote the updated sparse queries as $z_{r}$ . For the image PV features $f_{i}$ , the aggregation of $f_{i}$ using $z_{r}$ can be represented as follows:

	$\displaystyle DeformAttn(z_{r},p_{q},f_{i})=\sum_{m=1}^{M}W_{m}^{\prime}[\sum_{k=1}^{K}A_{mqk}^{\prime}\cdot$		(3)
	$\displaystyle W_{m}^{\dagger\prime}f_{i}(p_{q}+\Delta p_{mqk}^{\prime})],$		(3)

where $W_{m}^{\prime}$ and $W_{m}^{\dagger\prime}$ indicate the weights of the linear, $A_{mqk}^{\prime}$ and $\Delta p_{mqk}^{\prime}$ indicate attention weight and sampling offset respectively. The updated sparse queries are denoted as $z_{ri}$ . For the below path, the same operation is performed, but the image PV features are aggregated before the radar BEV features. The updated sparse queries are denoted as $z_{ir}$ . After the aforementioned operation, $z_{ri}$ and $z_{ir}$ are intended to represent the same object at corresponding positions, and their features should have similar semantics. Therefore, we introduce the contrastive learning [42] to align the features of the corresponding positions of $z_{ri}$ and $z_{ir}$ , which can be calculated as:

		$\displaystyle\widehat{z}_{ri}=\frac{z_{ri}}{\|\|z_{ri}\|\|},\widehat{z}_{ir}=\frac{z_{ir}}{\|\|z_{ir}\|\|},$		(4)
		$\displaystyle\mathcal{L}_{CL}=\frac{\mathcal{L}_{CE}(\widehat{z}_{ri}\otimes\widehat{z}_{ir}^{t}\cdot\tau,I)+\mathcal{L}_{CE}(\widehat{z}_{ir}\otimes\widehat{z}_{ri}^{t}\cdot\tau,I)}{2},$		(4)

where $\widehat{z}_{ri}$ and $\widehat{z}_{ir}$ stand for the normalized features. $\otimes$ , $\tau$ , and $I$ denote matrix multiplication, logit scale, and target matrix respectively. $\mathcal{L}_{CE}$ represents the cross-entropy loss. Finally, we use element-wise addition to obtain fusion queries $z_{f}$ , which are utilized for 3D object detection task.

III-D Radar Feature Enhancement

The radar data is obtained by aggregating five consecutive frames, but they are still sparse compared to the lidar. Considering that the number of fused queries exceeds the count of ground truth boxes by a considerable margin, there will be cases where multiple queries predict the same box. Due to most of the predicted boxes representing the same object having low classification probabilities, they have some bias relative to the ground truth boxes but also surround them. Therefore, it is feasible to utilize the predicted centre of the 3D boxes for enhancing radar features. Based on this, we designed the radar feature enhancement module.

As illustrated in Fig. 3 (b), the original radar BEV features and occupancy features are fed into RFE to obtain the enhanced radar features. Specifically, we first transform the centre of 3D boxes predicted by the fusion queries $z_{f}$ into BEV grids to get the occupancy features. Later, the radar BEV features and occupancy features are concatenated on the channel dimension, and the concatenated features are fed into a 3-layer conv block to obtain the enhanced radar BEV features. The conv block is composed of a convolutional layer, followed by batch normalization [43], and ReLU [44]. After obtaining the enhanced radar BEV features, on the one hand, they are input to the radar head as the second auxiliary task, sharing parameters with the radar head mentioned in Sec. III-A. On the other hand, they are used to distil the original radar features by knowledge distillation [45], guiding the network to optimize towards the direction where enhanced radar features can be acquired. The knowledge distillation process can be calculated as follows:

\displaystyle\mathcal{L}_{KD}=\sum_{k=1}^{C}\sum_{j=1}^{H}\sum_{i=1}^{W}(f_{er}^{ijk}-f_{r}^{ijk}),

(5)

where $H$ , $W$ and $C$ represent the height, width, and the number of channels of the radar BEV features, respectively. $f_{er}$ denotes enhanced radar BEV features.

III-E Loss Function

Here, we conclude the loss function including the 3D object detection task losses $\mathcal{L}_{task}$ , the losses corresponding to the two radar heads $\mathcal{L}_{RH1}$ and $\mathcal{L}_{RH2}$ , the contrastive loss $\mathcal{L}_{CL}$ and the knowledge distillation loss $\mathcal{L}_{KD}$ . The final optimization objective can be expressed as:

\displaystyle\mathcal{L}=\mathcal{L}_{task}+\lambda_{1}\mathcal{L}_{RH1}+\lambda_{2}\mathcal{L}_{RH2}+\lambda_{3}\mathcal{L}_{CL}+\lambda_{4}\mathcal{L}_{KD},

(6)

where $\lambda_{1},\lambda_{2},\lambda_{3}$ and $\lambda_{4}$ are hyper-parameters.

TABLE I: Comparison on the nuScenes val set. ’C’ and ’R’ represent camera and radar respectively.

{\ast}

The backbone benefits from perspective pre-training.

Method	Modality	Backbone	Input Size	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mASE $\downarrow$	mAOE $\downarrow$	mAVE $\downarrow$	mAAE $\downarrow$	FPS $\uparrow$
BEVDet [1]	C	R50	256 $\times$ 704	0.392	0.312	0.691	0.272	0.523	0.909	0.247	15.6
BEVDepth [21]	C	R50	256 $\times$ 704	0.475	0.351	0.639	0.267	0.479	0.428	0.198	11.6
SOLOFusion [46]	C	R50	256 $\times$ 704	0.534	0.427	0.567	0.274	0.411	0.252	0.188	11.4
StreamPETR [3]	C	R50	256 $\times$ 704	0.540	0.432	0.581	0.272	0.413	0.295	0.195	27.1
SparseBEV [47]	C	R50	256 $\times$ 704	0.545	0.432	0.606	0.274	0.387	0.251	0.186	-
BEVNeXt [4]	C	R50	256 $\times$ 704	0.548	0.437	0.550	0.265	0.427	0.260	0.208	-
CenterFusion [30]	C+R	DLA34	448 $\times$ 800	0.453	0.332	0.649	0.263	0.535	0.540	0.142	-
RCBEV4d [48]	C+R	Swin-T	256 $\times$ 704	0.497	0.381	0.526	0.272	0.445	0.465	0.185	7.5
CRAFT [32]	C+R	DLA34	448 $\times$ 800	0.517	0.411	0.494	0.276	0.454	0.486	0.176	4.1
CRN [11]	C+R	R50	256 $\times$ 704	0.560	0.490	0.487	0.277	0.542	0.344	0.197	20.4
RCBEVDet [8]	C+R	R50	256 $\times$ 704	0.568	0.453	0.486	0.285	0.404	0.220	0.192	21.3
RCAlign	C+R	R50	256 $\times$ 704	0.611	0.537	0.463	0.261	0.462	0.192	0.194	14.6
DETR3D^∗ [19]	C	R101	900 $\times$ 1600	0.434	0.349	0.716	0.268	0.379	0.842	0.200	3.7
PETR^∗ [2]	C	R101	900 $\times$ 1600	0.442	0.370	0.711	0.267	0.383	0.865	0.201	1.7
BEVFormer^∗ [20]	C	R101	900 $\times$ 1600	0.517	0.416	0.673	0.274	0.372	0.394	0.198	1.7
BEVDepth [21]	C	R101	512 $\times$ 1408	0.535	0.412	0.565	0.266	0.358	0.331	0.190	5.0
SOLOFusion [46]	C	R101	512 $\times$ 1408	0.582	0.483	0.503	0.264	0.381	0.246	0.207	-
SparseBEV^∗ [47]	C	R101	512 $\times$ 1408	0.592	0.501	0.562	0.265	0.321	0.243	0.195	-
StreamPETR^∗ [3]	C	R101	512 $\times$ 1408	0.592	0.504	0.569	0.262	0.315	0.257	0.199	6.4
BEVNeXt^∗ [4]	C	R101	512 $\times$ 1408	0.597	0.500	0.487	0.260	0.343	0.245	0.197	4.4
MVFusion^∗ [33]	C+R	R101	900 $\times$ 1600	0.455	0.380	0.675	0.258	0.372	0.833	0.196	-
CRN [11]	C+R	R101	512 $\times$ 1408	0.592	0.525	0.460	0.273	0.443	0.352	0.180	7.2
RCAlign^∗	C+R	R101	512 $\times$ 1408	0.645	0.570	0.457	0.257	0.319	0.186	0.187	5.4

TABLE II: Comparison of Per-class AP on nuScenes val set. ’C.V.’, ’Ped.’, ’M.C.’, and ’T.C.’ denote construction vehicle, pedestrian, motorcycle, and traffic cone, respectively.

Method	Car	Truck	Bus	Trailer	C.V.	Ped.	M.C.	Bicycle	T.C.	Barrier	mAP
CenterFusion [30]	0.524	0.265	0.362	0.154	0.055	0.389	0.305	0.299	0.563	0.470	0.332
RCBEV4d [8]	0.683	0.323	0.369	0.148	0.108	0.443	0.357	0.270	0.552	0.557	0.381
CRAFT [32]	0.696	0.376	0.473	0.201	0.107	0.462	0.395	0.310	0.571	0.511	0.411
CRN [11]	0.736	0.445	0.556	0.220	0.154	0.502	0.547	0.489	0.614	0.638	0.490
RCAlign	0.777	0.488	0.589	0.234	0.231	0.617	0.596	0.534	0.667	0.636	0.537

IV Experiments

In this section, we conduct extensive experiments on our RCAlign using the nuScenes dataset[49]. First, we compare RCAlign with other other relevant camera-based and radar camera fusion methods in 3D object detection and tracking tasks. After that, rigorous ablation experiments are given to demonstrate the effectiveness of the proposed module. Then, we perform key parametric analyses that may affect the performance of model. Finally, we conduct a robustness analysis experiment. The above experimental results validate the effectiveness of the proposed model.

IV-A Experimental Settings

Datasets and Metrics. The nuScenes dataset [49] is a publicly available large-scale dataset for autonomous driving, comprising 1000 driving scenes. Among these, 750 scenes are designated for training, 150 for validation, and the remaining 150 for testing. Each scene in the nuScenes dataset spans a duration of 20 seconds and is annotated using a frequency of 2 Hz. The data in each scene is captured using 6 cameras that provide a full 360-degree field of view, along with 5 radars and 1 lidar. The annotated 3D boxes are categorized into 23 classes initially and aggregated into 10 categories for the 3D object detection task during evaluation. The mean Average Precision (mAP), nuScenes Detection Score (NDS), mean Average Translation Error (mATE), mean Average Scale Error (mASE), mean Average Orientation Error (mAOE), mean Average Velocity Error (mAVE) and mean Average Attribute Error (mAAE) are used as the evaluation metrics for 3D object detection. We compare the Average Multi-Object Tracking Accuracy (AMOTA), Average Multi-Object Tracking Precision (AMOTP), number of False Positives (FP), number of False Negatives (FN) and number of Identity Switches (IDS) with those of other methods for 3D object tracking.

Implementation Details. We use ResNet50 [50], ResNet101 and V2-99 [51] backbones to extract image features. The R50 and R101 are used for nuScenes val set, which are pre-train on ImageNet [52] and nuImages [49]. The V2-99 is used for nuScenes test set and initialized from DD3D [53]. We use the [ $x$ , $y$ , $z$ , $vx_{comp}$ , $vy_{comp}$ , $timestamp$ ] as the radar features. The point cloud range is set to [-51.2m, 51.2m] for X and Y axis, and [-5.0m, 3.0m] for Z axis. The radar backbone follows the architecture of FUTR3D [9]. The radar head adopts CenterNet [14], and as an auxiliary task, we simplify it by retaining only the prediction of the heatmap and the centers of 3D boxes. The 3D detection head is the same as DETR3D [19].

During training, 2D supervised losses [54] and query denoies [55] are introduced. All experiments were conducted without the use of CBGS [56] and TTA strategies. For image and BEV data augmentation, we follow the PETR [2], the radar data corresponds to the transformation when performing BEV data augmentation. Following StreamPETR [3] when compared with other state-of-the-art methods, the model is trained for 60 epochs. During ablation experiments, it is trained for 24 epochs. AdamW [57] optimizer is used to optimize the model with the learning rate set to 4e-4. The learning rate is updated by the cosine annealing strategy. The models are trained end-to-end. The number of radar queries is set to 30. The $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are set to 1, the $\lambda_{4}$ is set to 5. We implement RCAlign through the mmdetection3D framework and train it on NVIDIA 3090 GPUs.

IV-B Comparison with State-of-the-Art Methods

In this section, we first present the results of RCAlign and other state-of-the-art methods for 3D object detection on the nuScenes val set and test set. Following that, visualisation results of RCAlign on the nuScenes val set are presented. Finally, we give the results of RCAlign in 3D object tracking.

3D Object Detection on val set. We compare the RCAlign with the state-of-the-art methods on nuScenes val for 3D object detection in Tab. I. (1) With the R50 [50] backbone and the input resolution of 256 $\times$ 704, RCAlign outperforms both existing camera-based methods and radar-camera fusion methods, including the latest state-of-the-art methods CRN [11] and RCBEVDet [8]. The NDS and mAP are significant improvements compared to RCBEVDet, with increases of 4.3% and 8.4%, respectively. (2) To the best of our knowledge, this may be the first radar-camera fusion algorithm to achieve an NDS exceeding 60% in real-time 3D object detection. (3) Furthermore, RCAlign with R50 backbone can even outperform CRN using the R101 backbone with the input resolution of 512 $\times$ 1408. (4) When the model size and image size increase to R101 and 512 $\times$ 1408, RCAlign still can outperform all previous camera-based or radar and camera fusion methods. Alongside the significant improvements in NDS and mAP, mAVE has also made remarkable strides. The improvement of mAVE can be attributed to the ability of radar to capture information about the speed of objects.

Pre-class AP on val set. In order to better demonstrate the performance of RCAlign, we analyse the pre-class AP for the radar and camera fusion models in Tab. II. From the table we can see that all the classes except barriers have been improved. Especially for pedestrians, which always is small or clustered, there has been significant improvement, with increasing by 11.5% compared to CRN. In the case of barriers, comparison with CRN shows almost equal performance. In terms of mAP, we have also seen a significant improvement compared to other radar and camera fusion methods.

TABLE III: Comparison on the nuScenes test set. ’C’and ’R’ represent camera and radar respectively.

Method	Modality	Backbone	Input Size	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mASE $\downarrow$	mAOE $\downarrow$	mAVE $\downarrow$	mAAE $\downarrow$
DETR3D [19]	C	V2-99	900 $\times$ 1600	0.479	0.412	0.641	0.255	0.394	0.845	0.133
BEVFormer [20]	C	V2-99	900 $\times$ 1600	0.569	0.481	0.582	0.256	0.375	0.378	0.126
PETRv2 [58]	C	V2-99	640 $\times$ 1600	0.582	0.490	0.561	0.243	0.361	0.343	0.120
BEVDepth [21]	C	V2-99	640 $\times$ 1600	0.600	0.503	0.445	0.245	0.378	0.320	0.126
SOLOFusion [46]	C	ConvNeXt-B	640 $\times$ 1600	0.619	0.540	0.453	0.257	0.376	0.276	0.148
StreamPETR [3]	C	V2-99	640 $\times$ 1600	0.636	0.550	0.479	0.239	0.317	0.241	0.119
SparseBEV [47]	C	V2-99	640 $\times$ 1600	0.636	0.556	0.485	0.244	0.332	0.246	0.117
BEVNeXt [4]	C	V2-99	640 $\times$ 1600	0.642	0.557	0.409	0.241	0.352	0.233	0.129
CenterFusion [30]	C+R	DLA34	448 $\times$ 800	0.449	0.326	0.631	0.261	0.516	0.614	0.115
RCBEV [48]	C+R	Swin-T	256 $\times$ 704	0.486	0.406	0.484	0.257	0.587	0.702	0.140
MVFusion [33]	C+R	V2-99	640 $\times$ 1600	0.517	0.453	0.569	0.246	0.379	0.781	0.128
CRAFT [32]	C+R	DLA34	448 $\times$ 800	0.523	0.411	0.467	0.268	0.456	0.519	0.114
CRN [11]	C+R	ConvNeXt-B	640 $\times$ 1600	0.624	0.575	0.416	0.264	0.456	0.365	0.130
RCBEVDet [8]	C+R	V2-99	640 $\times$ 1600	0.639	0.550	0.390	0.234	0.362	0.259	0.113
RCAlign	C+R	V2-99	640 $\times$ 1600	0.673	0.606	0.385	0.241	0.360	0.191	0.123

TABLE IV: Comparison of 3D object tracking on nuScenes test set.

Methods	Modality	Backbone	AMOTA $\uparrow$	AMOTP $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDS $\downarrow$
UVTR [59]	C	V2-99	51.9	1.125	14994	39209	2204
StreamPETR [3]	C	ConvNext-B	56.6	0.975	21268	31484	784
BEVNeXt [4]	C	V2-99	57.8	0.917	-	-	519
CRN [11]	C+R	ConvNeXt-B	56.9	0.809	16822	41093	946
RCAlign	C+R	V2-99	60.5	0.901	14153	29266	689

3D Object Detection on test set. For the nuScenes test set, we use the V2-99 [51] backbone and the input resolution is 640 $\times$ 1600. Table III demonstrates that RCAlign has a notable advancement over existing state-of-the-art methods. When compared to RCBEVDet, there is an improvement of 3.4% NDS and 5.6% mAP. Additionally, RCBEVDet [8] improves baseline (BEVDepth) results by 3.4% NDS and 3.5% mAP. In contrast, RCAlign enhances baseline (StreamPETR) results by 3.7% NDS and 5.6% mAP, which is surpassing RCBEVDet. The above analyses show that RCAlign can achieve outstanding results, thus validating the effectiveness of the proposed model.

Visualisation Results. We show the visualisation results in the camera view (left) and BEV view (right) in Fig. 4. In the BEV view, RCAlign can present excellent detection results, particularly for the densely gathered crowd highlighted in the orange circle. However, as indicated by the green circle, there are some instances of failure, where some objects are either misdetected or missed. This may be due to the distance from the vehicle resulting in poorer detection results. More visualisation results for different weather conditions are provided in Fig. 5. From the figure, we can see that RCAlign can obtain detection results closer to the ground truth in various weather conditions.

3D Object Tracking on test set. As shown in Tab. IV, when compared to the other state-of-the-art methods with the V2-99 backbone, RCAlign achieves the best performance in both AMOTA and AMOTP. The AMOTA can be improved by 2.7% compared to BEVNext. With the addition of the ConvNext-B backbone, RCAlign still performs best on AMOTA, which is the most dominant evaluation metric in nuScenes test set for 3D object tracking. RCAlign improves by 3.6% in AMOTA compared to CRN. For other metrics, RCAlign achieves the best or just below the best performance.

TABLE V: Ablation Study of proposed DRA and RFE. C, R, DA, SDA, RH and SRH denote camera branch, radar branch, deformable attention, second deformable attention, radar head and second radar head respectively.

DRA

RFE

NDS

\uparrow

mAP

\uparrow

mATE

\downarrow

mAOE

\downarrow

SDA

SRH

(a)

✔

0.518

0.412

0.609

0.516

✔

0.569

0.486

0.525

0.551

✔

0.576

0.496

0.522

0.537

✔

0.583

0.501

0.517

0.503

✔

0.584

0.507

0.512

0.515

(b)

✔

0.586

0.504

0.523

0.476

✔

0.588

0.508

0.519

0.479

✔

0.592

0.515

0.502

0.487

TABLE VI: Comparison of using different sampling methods when selecting radar points. SM and FPS denote sampling method and farthest point sampling, respectively.

SM	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAOE $\downarrow$
Random	0.585	0.500	0.505	0.494
FPS	0.588	0.512	0.518	0.515
Topk	0.592	0.515	0.502	0.487

TABLE VII: Comparison of using different alignment losses (AL) to align sparse queries from two paths. Cos and CL denote cosine loss and contrastive loss, respectively.

AL	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAOE $\downarrow$
L1	0.514	0.421	0.594	0.693
Cos	0.576	0.497	0.536	0.517
CL	0.592	0.515	0.502	0.487

IV-C Ablation Study

In order to verify the validity of important components of RCAlign, in this part, we give ablation experiments for DRA, REF and their containing sub-modules.

Effectiveness of DRA. In Tab. V (a), we validate the effectiveness of each sub-module of the DRA. (1) Firstly, we present the results of StreamPETR [3] with deformable attention (baseline) in the first line. Then, a radar branch is added to the baseline (Rbaseline), and the features from both modalities are aggregated using deformable attention, resulting in a significant improvement over baseline by 5.1% NDS and 7.4% mAP. This part proves that radar can indeed assist 3D object detection for better performance. (2) Afterwards, the contrastive loss or the second deformable attention is added to Rbaseline, and it is clear from lines 3 and 4 that adding any of the sub-module of DRA can improve the performance of Rbaseline on both NDS and mAP. The same results are observed for mATE and mAOE. The above results demonstrate that the proposed alignment strategy and the inter-modal features interaction are beneficial for Rbaseline. (3) Finally, both sub-modules are added to the Rbaseline, resulting in further performance improvements. (4) In summary, by adding the DRA, the model outperforms the baseline by 6.6% NDS and 9.5% mAP and surpasses the Rbaseline by 1.5% NDS and 2.1% mAP. The experimental results on nuScenes dataset validate the effectiveness of DRA.

Effectiveness of RH and RFE. We conducted ablation experiments on the radar head (RH) and RFE module, the results are presented in Tab. V (b). (1) We first introduce the RH module based on the DRA module, which resulted in a 0.2% gain in NDS. (2) After that, we use the second radar head (SRH) to classify and regress the enhanced radar BEV features, the NDS and mAP can be further improved, which increased by 0.2% NDS and 0.4% mAP. (3) Furthermore, by exploiting knowledge distillation loss, NDS and mAP are improved by 0.4% and 0.7%, respectively. After adding the RH and RFE, Both NDS and mAP increased by 0.8%. Thus, the effectiveness of RH and RFE is verified. (4) It is worth noting that when compared with the baseline, the proposed RCAlign enhances NDS by 7.4% and mAP by 10.3%. Even for the designed Rbaseline, RCAlign has achieved significant improvements, with NDS and mAP increasing by 2.3% and 2.9%, respectively.

IV-D Parametric Analysis

Here, we analyse the important parameters in the model. This includes the Impact of different radar point sampling methods, the impact of different alignment losses, and the impact of contrastive loss and distillation loss weights.

TABLE VIII: Effect of constrastive loss wight

\lambda_{3}

$\lambda_{3}$	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAOE $\downarrow$
0.1	0.583	0.503	0.523	0.516
1	0.592	0.515	0.502	0.487
10	0.585	0.507	0.531	0.494

TABLE IX: Effect of distillation loss wight

\lambda_{4}

$\lambda_{4}$	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAOE $\downarrow$
1	0.581	0.502	0.517	0.528
5	0.592	0.515	0.502	0.487
10	0.580	0.502	0.527	0.523

Impact of sampling methods. As depicted in Fig. 2, radar points are selected to be part of the sparse queries after the radar head. Here, we analyze the impact of different sampling methods for radar points on the experimental results. As shown in Tab. VII, (1) we start with a random sampling approach, the NDS and mAP are both terrible. (2) After that, we use the farthest point sampling (FPS), which is employed in PointNet++ [60], we take the radar point corresponding to the highest probability value as the initial point. Compared with the random sampling, there are some improvements in NDS and mAP, but some reduction in mATE and mAOE. (3) Finally, we attempt to select radar points corresponding to the top $k$ probability values (Topk) obtained from the radar heatmap. Compared to the previous two methods, this approach led to improvements in NDS, mAP, mATE, and mAOE. (4) Therefore, we choose Topk as the selection method for radar points.

Impact of alignment Loss. We experiment with three different alignment losses to align sparse queries after the two route paths. As shown in Tab. VII, (1) we initially use the L1 loss to align the two sparse queries in DRA, but it produce poorer results. (2) When the alignment loss is changed to cosine loss, there is a significant improvement in overall metrics. This is due to that although the features sampled by the two routing paths are the same, the different order of sampling means that the final feature representations should only have semantic similarity, rather than forcing the feature values to be identical as well. (3) After that, we modify the alignment loss to a contrastive loss, and we can see that all the metrics are further improved. In contrast to the cosine loss, contrastive loss not only pulls the similarity of features at the same location but also simultaneously pushes apart features at different locations to some extent. (4) Based on the above analysis, we choose contrastive loss for the alignment of the two route paths.

Impact of $\lambda$ . We conduct experiments on how the weight of contrastive loss and knowledge distillation loss would affect the performance of the model in the Eq. 6. As shown in Tab. IX, with the $\lambda_{3}$ increases, all metrics first increase and then decrease, and the model achieves optimal performance when the weight of the contrastive loss $\lambda_{3}$ is set to 1. Similarly, Tab. IX shows that the model performs optimally when the weight of the knowledge distillation loss $\lambda_{4}$ is set to 5.

IV-E Robustness Analyse

TABLE X: The robustness analysis of different lighting and weather conditions using mAP metric.

	Modality	Sunny	Rainy	Day	Night
CenterPoint	L	0.629	0.592	0.628	0.354
RCBEV	C+R	0.361	0.385	0.371	0.155
CRN	C+R	0.548	0.570	0.551	0.304
RCAlign	C+R	0.567	0.610	0.575	0.357

A robust detection algorithm is crucial for enhancing vehicle safety in autonomous driving. Therefore, we conducte robustness experiments under different weather and lighting conditions. Following the CRN [11], we use the R101 backbone and the input resolution of 512 $\times$ 1408, as shown in Tab. X, when compared to the radar camera fusion methods, RCAlign achieves state-of-the-art performance across various weather and lighting scenes. Additionally, due to radar is not affected by weather or lighting conditions, RCAlign even outperforms lidar-based method (CenterPoint) in rainy and night scenes. The above experimental results demonstrate that RCAlign is both effective and robustness in different weather conditions.

V Conclusion

In this paper, we propose a novel alignment model RCAlign for radar and camera fusion. Firstly, we designed the DRA, which consists of two deformable attention and a fusion block. The two deformable attention are employed for inter-modal features interaction. The fusion module is utilized to align features using contrastive loss and fuse features from two modalities through element-wise addition. Afterwards, considering the sparsity of radar features, we design a radar feature enhancement module that uses predicted the centre of the 3D boxes to densify the original radar features by knowledge distillation loss. Besides, we introduce the radar head acting on the original radar features and the enhanced radar features as an auxiliary task. Finally, the designed model is optimized by combining the above losses and the 3D object detection task losses. Extensive experiments conducted on the nuScenes benchmark illustrate that the proposed RCAlign currently achieves a new state-of-the-art performance. At the same time, the rigorous ablation experiments demonstrate the effectiveness of our proposed DRA and RFE.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant 62272035.

References

[1] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High-performance multi-camera 3d object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
[2] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” in European Conference on Computer Vision. Springer, 2022, pp. 531–548.
[3] S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object-centric temporal modeling for efficient multi-view 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3621–3631.
[4] Z. Li, S. Lan, J. M. Alvarez, and Z. Wu, “Bevnext: Reviving dense bev frameworks for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 113–20 123.
[5] C. Shu, J. Deng, F. Yu, and Y. Liu, “3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3580–3589.
[6] Z. Chen, Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Wu, and F. Zhao, “Graph-detr4d: Spatio-temporal graph modeling for multi-view 3d object detection,” IEEE Transactions on Image Processing, 2024.
[7] P. Wolters, J. Gilg, T. Teepe, F. Herzog, A. Laouichi, M. Hofmann, and G. Rigoll, “Unleashing hydra: Hybrid fusion, depth consistency and radar for unified 3d perception,” arXiv preprint arXiv:2403.07746, 2024.
[8] Z. Lin, Z. Liu, Z. Xia, X. Wang, Y. Wang, S. Qi, Y. Dong, N. Dong, L. Zhang, and C. Zhu, “Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 928–14 937.
[9] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 172–181.
[10] Z. Yu, W. Wan, M. Ren, X. Zheng, and Z. Fang, “Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception,” IEEE Transactions on Intelligent Vehicles, 2023.
[11] Y. Kim, J. Shin, S. Kim, I.-J. Lee, J. W. Choi, and D. Kum, “Crn: Camera radar net for accurate, robust, efficient 3d perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 615–17 626.
[12] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
[13] T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
[14] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
[15] L. Wang, L. Zhang, Y. Zhu, Z. Zhang, T. He, M. Li, and X. Xue, “Progressive coordinate transforms for monocular 3d object detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 364–13 377, 2021.
[16] J. U. Kim, H.-I. Kim, and Y. M. Ro, “Stereoscopic vision recalling memory for monocular 3d object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 2749–2760, 2023.
[17] C. Huang, T. He, H. Ren, W. Wang, B. Lin, and D. Cai, “Obmo: One bounding box multiple objects for monocular 3d object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 6570–6581, 2023.
[18] J. Hou, Z. Liu, Z. Zou, X. Ye, X. Bai et al., “Query-based temporal fusion with explicit motion for 3d object detection,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[19] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning. PMLR, 2022, pp. 180–191.
[20] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision. Springer, 2022, pp. 1–18.
[21] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485.
[22] C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu et al., “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 830–17 839.
[23] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M² bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” arXiv preprint arXiv:2204.05088, 2022.
[24] Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1486–1494.
[25] H. Shi, C. Pang, J. Zhang, K. Yang, Y. Wu, H. Ni, Y. Lin, R. Stiefelhagen, and K. Wang, “Cobev: Elevating roadside 3d object detection with depth and height complementarity,” IEEE Transactions on Image Processing, 2024.
[26] X. Lin, T. Lin, Z. Pei, L. Huang, and Z. Su, “Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion,” arXiv preprint arXiv:2211.10581, 2022.
[27] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 194–210.
[28] J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi-camera 3d object detection,” arXiv preprint arXiv:2203.17054, 2022.
[29] ——, “Bevpoolv2: A cutting-edge implementation of bevdet toward deployment,” arXiv preprint arXiv:2211.17111, 2022.
[30] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1527–1536.
[31] Y. Long, A. Kumar, D. Morris, X. Liu, M. Castro, and P. Chakravarty, “Radiant: Radar-image association network for 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1808–1816.
[32] Y. Kim, S. Kim, J. W. Choi, and D. Kum, “Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 1160–1168.
[33] Z. Wu, G. Chen, Y. Gan, L. Wang, and J. Pu, “Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2766–2773.
[34] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
[35] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[36] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[37] F. Wei, Y. Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021.
[38] B. Sun, B. Li, S. Cai, Y. Yuan, and C. Zhang, “Fsce: Few-shot object detection via contrastive proposal encoding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7352–7362.
[39] Y. Zhang, J. Chen, and D. Huang, “Cat-det: Contrastively augmented transformer for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 908–917.
[40] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[41] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705.
[42] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.
[44] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[45] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[46] J. Park, C. Xu, S. Yang, K. Keutzer, K. M. Kitani, M. Tomizuka, and W. Zhan, “Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection,” in The Eleventh International Conference on Learning Representations, 2022.
[47] H. Liu, Y. Teng, T. Lu, H. Wang, and L. Wang, “Sparsebev: High-performance sparse 3d object detection from multi-camera videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 580–18 590.
[48] T. Zhou, J. Chen, Y. Shi, K. Jiang, M. Yang, and D. Yang, “Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1523–1535, 2023.
[49] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[51] Y. Lee, J.-w. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and gpu-computation efficient backbone network for real-time object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0.
[52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[53] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152.
[54] S. Wang, X. Jiang, and Y. Li, “Focal-petr: Embracing foreground for efficient multi-camera 3d object detection,” IEEE Transactions on Intelligent Vehicles, 2023.
[55] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 13 619–13 627.
[56] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grouping and sampling for point cloud 3d object detection,” arXiv preprint arXiv:1908.09492, 2019.
[57] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[58] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3262–3272.
[59] Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 18 442–18 455, 2022.
[60] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.