Minkowski Tracker: A Sparse Spatio-Temporal R-CNN
for Joint Object Detection and Tracking

JunYoung Gwak, Silvio Savarese, Jeannette Bohg

Abstract

Recent research in multi-task learning reveals the benefit of solving related problems in a single neural network. 3D object detection and multi-object tracking (MOT) are two heavily intertwined problems predicting and associating an object instance location across time. However, most previous works in 3D MOT treat the detector as a preceding separated pipeline, disjointly taking the output of the detector as an input to the tracker. In this work, we present Minkowski Tracker, a sparse spatio-temporal R-CNN that jointly solves object detection and tracking. Inspired by region-based CNN (R-CNN), we propose to solve tracking as a second stage of the object detector R-CNN that predicts assignment probability to tracks. First, Minkowski Tracker takes 4D point clouds as input to generate a spatio-temporal Bird’s-eye-view (BEV) feature map through a 4D sparse convolutional encoder network. Then, our proposed TrackAlign aggregates the track region-of-interest (ROI) features from the BEV features. Finally, Minkowski Tracker updates the track and its confidence score based on the detection-to-track match probability predicted from the ROI features. We show in large-scale experiments that the overall performance gain of our method is due to four factors: 1. The temporal reasoning of the 4D encoder improves the detection performance 2. The multi-task learning of object detection and MOT jointly enhances each other 3. The detection-to-track match score learns implicit motion model to enhance track assignment 4. The detection-to-track match score improves the quality of the track confidence score. As a result, Minkowski Tracker achieved the state-of-the-art performance on Nuscenes dataset tracking task without hand-designed motion models.

Introduction

3D Multi-object tracking (MOT) allows autonomous agents to perceive the motion of various entities in their surroundings. MOT is one of the core perception problems of various real-time robotics applications such as autonomous driving (Luo, Yang, and Yuille 2021; Petrovskaya and Thrun 2008) and collaborative robotics (Gross et al. 2011; Erol et al. 2018). With the increasing demand for AI-based, real-world automation (Maurer et al. 2016) and better affordability of 3D sensors (Hecht 2018), the need for robust 3D MOT has become ever higher. Consequently, the advent of multiple large-scale 3D MOT dataset (Geiger et al. 2013; Caesar et al. 2020; Martin-Martin et al. 2021; Sun et al. 2020) and improved techniques to process 3D data (Graham 2014; Choy, Gwak, and Savarese 2019) have led to a recent significant advances in 3D MOT.

The most popular multi-object tracking (MOT) framework solves correspondences among detections over time, commonly known as a tracking-by-detection paradigm. Most previous approaches to tracking-by-detection treat detection and tracking as a separate module, having an object detector that runs frame-by-frame and then establishing correspondences over time. This pipelined approach is so widely adopted that many MOT challenges (Leal-Taixé et al. 2015; Dendorfer et al. 2019) provide baseline public object detection results, and most trackers in the Nuscenes challenge (Caesar et al. 2020) evaluate using the same object detection method (Yin, Zhou, and Krahenbuhl 2021) for a fair comparison. However, recent research in multi-task learning reveals that jointly solving related tasks improves the generalization performance of all the tasks (Zhang and Yang 2021), which could extend to detection and tracking.

The last decade of machine learning research has revealed two important lessons: 1. End-to-end learning tends to achieve better results than pipelined approaches. 2. Related learning tasks can help each other, as demonstrated in transfer learning (Zhuang et al. 2020), multi-task learning (Zhang and Yang 2021), and cross-task learning (Zamir et al. 2020). The most common tracking-by-detection approach builds hand-designed filters outside of the neural network, not being end-to-end and sometimes slow. Furthermore, detection and tracking are related problems predicting the object instance and location over time that could potentially benefit from multi-task learning. To this end, we propose Minkowski Tracker, a sparse spatio-temporal R-CNN for joint object detection and tracking.

The input space of our 3D MOT is in 4D Minkowski spacetime, a combination of 3D Euclidean space and time into a 4D manifold. Being a true end-to-end system, Minkowski Tracker takes a sequence of 3D point clouds of length $\mathcal{T}$ as input and directly processes it using a 4D sparse encoder. The explicit temporal reasoning of the 4D encoder is robust to temporal changes such as occlusion, truncation, and transformation. Furthermore, the 4D encoder generates a spatio-temporal bird’s-eye-view (BEV) feature map, which enables our tracking framework.

Inspired by the success of region-based CNN (R-CNN) in predicting additional information regarding detected instances such as mask (He et al. 2017) and 3D mesh (Gkioxari, Malik, and Johnson 2019), Minkowski Tracker solves tracking-by-detection MOT by predicting the match probability of each detected instance to current tracks. To enable matching between detections and tracks, our proposed TrackAlign samples collective region-of-interest (ROI) features from the spatio-temporal BEV feature map. As a result, Minkowski Tracker learns implicit motion models by predicting the detection’s match probability to the tracks. Moreover, the multi-task learning of object detection and MOT jointly enhances the performance of each other, as we will show in experiments.

Lastly, prediction confidence is crucial information for any system that takes uncertain inputs. For example, informative object track confidence could be helpful for robotics applications such as anomaly detection and decision making. Consequently, the recent MOT evaluation metric AMOTA (Weng et al. 2020a) takes track confidence into account. A common practice of previous works is to take an average of the object detection confidences within the track, which lacks information on the confidence of associations across the track. Minkowski Tracker complements this missing information using the network-predicted detection-to-track match probability scores.

To summarize, we propose Minkowski Tracker, a sparse spatio-temporal R-CNN for joint object detection and tracking. Our proposed R-CNN structure solves tracking by sharing features with the detector to learn implicit motion models. Additionally, we propose TrackAlign, a spatio-temporal region of interest (ROI) feature aggregation function to sample collective ROI features between detections and tracks. The performance gain of our method, as we ablate in Table 3, is due to the four factors:

•

Minkowski Tracker improves 3D object detection and tracking by reasoning directly on 4D spatio-temporal space using a 4D sparse convolutional encoder.
•

The multi-task learning of 3D object detection and multi-object tracking in the same network enhances the performance of each other.
•

Minkowski Tracker learns implicit motion models as the second stage of a region-based CNN (R-CNN) by predicting the detection’s match probability to the tracks, enhancing the quality of track assignment.
•

Minkowski Tracker offers more robust track confidence by incorporating the detection-to-track match probability score.

Related Work

LiDAR-based 3D object detection

Most modern 3D object detectors for outdoor autonomous driving scenarios follow a similar structure. First, a point cloud encoder encodes 3D point clouds into a BEV feature. Second, tracking heads infer 3D object detections on the extracted BEV features.

First, for the point cloud encoder, VoxelNet (Zhou and Tuzel 2018) proposes to voxelize point clouds using a 3D voxel feature encoder based on PointNet (Qi et al. 2017) followed by a 3D CNN. PointPillar (Lang et al. 2019) proposes to encode 3D point clouds into a 2D pillar feature map followed by a 2D CNN. SECOND (Yan, Mao, and Li 2018) makes VoxelNet efficient by using sparse convolution (Graham 2014) instead of a dense 3D CNN. The point cloud encoder of Minkowski Tracker is a 4D sparse convolutional encoder. The 4D encoder improves object detection by reasoning over a long temporal horizon, being robust to temporal changes such as occlusion, truncation, and transformation. Additionally, the 4D encoder outputs a spatio-temporal BEV feature map to enable our tracking framework.

Second, the tracking heads infer 3D object detections on the BEV feature map. Point R-CNN (Shi, Wang, and Li 2019) and Voxel R-CNN (Deng et al. 2021) proposed two-stage region proposal networks similar to Faster R-CNN (Ren et al. 2015). CenterPoint (Yin, Zhou, and Krahenbuhl 2021) proposes a two-stage anchor-free keypoint detector network. 3DSSD (Yang et al. 2020) proposes a anchor-free single stage 3D detection network. Our proposed method is agnostic to the choice of object detection heads. We use CenterPoint as our baseline detector network for a fair comparison against previous works in MOT.

Region-based CNN

Region-based CNN (R-CNN) is a neural network architecture that infers scene knowledge based on region-of-interest (ROI) features. The initial goal of R-CNN was object detection (Girshick 2015; Ren et al. 2015), extracting ROI features based on region proposals to predict and refine object detection parameters. As the technique matured, the ROI features found a new use for predicting additional information about the detected objects. Mask R-CNN (He et al. 2017) proposes to solve instance segmentation as the second stage of object detection by making an object mask prediction from bilinear-interpolated ROI features. Mesh R-CNN (Gkioxari, Malik, and Johnson 2019) proposes reconstructing a corresponding 3D mesh from the detection’s ROI features. The proposed Minkowski Tracker similarly extends object detection networks to multi-object tracking using a spatio-temporal ROI feature. Our proposed TrackAlign samples collective ROI features of detections and tracks to predict the detected object’s match likelihood to current tracks.

LiDAR-based 3D multi-object tracking

There are numerous track association algorithms of detection-to-track trackers. The most simple yet powerful association algorithm is the global nearest neighbor (GNN) match (Rana et al. 2014), which associates a detection with its closest track on the Bird’s-Eye View (BEV) global Euclidean coordinate space. CenterPoint (Yin, Zhou, and Krahenbuhl 2021) slightly improves this algorithm by backprojecting a detection based on its predicted velocity into the timestep of the track and then performs GNN Matching there. Some use Kalman filters (Kalman 1960) to find a posterior estimate of a tracked object instance. A simple motion model predicts a tracked object’s next state which is then matched to a detection based on the Mahalanobis distance (Chiu et al. 2020) or IoU/GIoU (Rezatofighi et al. 2019; Weng et al. 2020b). Others use random finite sets (RFS) (Pang, Morris, and Radha 2021; Liu et al. 2022), factor graphs (Pöschmann, Pfeifer, and Protzel 2020; Liang and Meyer 2022), and graph neural networks (Zaech et al. 2022; Brasó and Leal-Taixé 2020; Weng et al. 2020c) for track association and management. Unlike previous works, our network directly predicts associations between the detections and tracks, learning an implicit motion model.

In addition to the association algorithms, (Chiu et al. 2020; Weng et al. 2020c; Baser et al. 2019) propose to fuse the aforementioned motion models with appearance features to improve track management quality. Additionally, (Wang et al. 2021; Wu et al. 2021) propose maintaining invisible tracks to prevent ID switches during occlusions. The contribution of Minkowski Tracker is complementary to these efforts, and it may benefit from multi-modality input and advanced track management algorithms.

Method

Refer to caption — Figure 1: Visualization of Minkowski Tracker R-CNN with number of frames $T=3$ . First, the 4D sparse encoder takes 4D point clouds $\mathcal{P}$ as an input to generate a spatio-temporal Bird’s-eye-view feature map $\mathcal{F}$ . Then, the 3D object detector outputs 3D bounding boxes $\mathcal{B}$ . The next stage is an online tracker, which first extracts track ROI features $\mathbb{F}$ using our proposed TrackAlign. Finally, the DetectionToTrackClassifier predicts track match probability $\mathcal{S}$ for each detection.

In this section, we illustrate the Minkowski Tracker in detail. First off, Minkowski Tracker deals with spatio-temporal data. For consistent notation of time, let $T$ be the maximum number of frames. Additionally, let $t$ and any superscript $\cdot^{t}$ be a relative temporal index, where $t=1$ is the current and the latest frame where we detect objects, $t=2$ is a previous frame, and $t=T$ is the oldest frame the network processes.

The outline of our method is in Algorithm 1. The input to Minkowski Tracker is a 4D spatio-temporal point cloud $\mathcal{P}$ , which at each timestamp is a temporal stack of up to $T$ previous 3D lidar point cloud observations.

The core component of our framework is Minkowski Tracker R-CNN, a two-stage detection-to-track sparse spatio-temporal R-CNN, visualized in Figure 1. First, the 4D sparse encoder takes 4D point clouds $\mathcal{P}$ as input to generate a spatio-temporal Bird’s-eye-view feature map $\mathcal{F}$ (Line 13). Then, the 3D object detector head outputs 3D detection bounding boxes $\mathcal{B}$ on the BEV feature $\mathcal{F}^{1}$ at the current frame $(t=1)$ (Line 14). The next stage is an online tracker, which first extracts spatio-temporal ROI features $\mathbb{F}$ from a combination of detections $\mathcal{B}$ and current tracks $\mathcal{T}$ using our proposed TrackAlign (Line 16). Then, the DetectionToTrackClassifier outputs a match probability score $\mathcal{S}$ between the detections and tracks from $\mathbb{F}$ (Line 17).

Finally, given the network output detections $\mathcal{B}$ and detection-to-track scores $\mathcal{S}$ , Minkowski Tracker updates the tracks $\mathcal{T}$ with a simple Hungarian matching over $\mathcal{S}$ and calculates its track confidence score (Line 20). In the following subsections, we detail each component of our framework.

Algorithm 1 Outline of Minkowski Tracker

1:Input

\{\mathcal{P}^{t}\}\forall t

: 3D lidar point clouds at relative time

t

2:Input

T

: Maximum temporal axis size

3:Output

\mathcal{T}

: Tracks

\mathcal{T}\leftarrow\{\}

5:for

i\in\{1,2,\ldots\}

\mathcal{P}\leftarrow\{\mathcal{P}^{t}\}\forall t\in[1,\min(T,i)]

7: procedure Minkowski Tracker R-CNN(

\mathcal{P},\mathcal{T}

)

8: Input

\mathcal{P}

: 4D spatio-temporal lidar point clouds

9: Input

\mathcal{T}

: Tracks

10: Output

\mathcal{B}

: Detection bounding boxes

11: Output

\mathcal{S}

: Detection-to-track match probability

12:

\triangleright

\eqparboxCOMMENTFirst stage: 3D object detector

13:

\mathcal{F}\leftarrow\texttt{4DSparseEncoder}(\mathcal{P})

14:

\mathcal{B}\leftarrow\texttt{3DObjectDetector}(\mathcal{F}^{1})

15:

\triangleright

\eqparboxCOMMENTSecond stage: 3D online tracker

16:

\mathbb{F}\leftarrow\texttt{TrackAlign}(\mathcal{F},\mathcal{B},\mathcal{T})

17:

\mathcal{S}\leftarrow\texttt{DetectionToTrackClassifier}(\mathbb{F})

18: end procedure

19:

\triangleright

\eqparboxCOMMENTTrack match and track confidence update

20:

\mathcal{T}\leftarrow\texttt{TrackManager}(\mathcal{B},\mathcal{T},\mathcal{S})

21:end for

4D sparse encoder

Minkowski Tracker R-CNN takes 4D spatio-temporal sparse point cloud in Minkowski timespace as an input. $\mathcal{P}=\{(x,y,z,t,r)_{i}\}$ where $(x,y,z)$ is 3D location in Euclidean space, $t$ is time in a relative temporal index, and $r$ is reflectance measurement. The output of the encoder is a spatio-temporal Bird’s-eye-view (BEV) feature map $\mathcal{F}=\{\mathcal{F}^{1},\mathcal{F}^{2},\ldots,\mathcal{F}^{T}\}$ .

In short, our encoder is a 4D-equivalent of SECOND (Yan, Mao, and Li 2018) with some adjustments on the temporal axis to maintain its temporal shape. First, the 4D voxelizer quantizes the points in 3D Euclidean space while retaining the temporal index. Let the 3D location of the point clouds range within $W$ , $H$ , $D$ along the $X$ , $Y$ , $Z$ axis respectively. The voxelizer groups points along the 3D Euclidean voxel grid of size $v_{W}$ , $v_{H}$ , $v_{D}$ for each $t$ . Then, the voxel feature encoding (VFE) layer (Zhou and Tuzel 2018) extracts voxel features for each voxel group. Note that the voxelizer does not apply any grouping nor VFE along the temporal axis to retain the temporal dimension of the voxel tensor. The resulting 5D sparse tensor is of size $(C_{in}\times T\times\lfloor\frac{D}{v_{D}}\rfloor\times\lfloor\frac{H}{v_{H}}\rfloor\times\lfloor\frac{W}{v_{W}}\rfloor)$ where $C_{in}$ is the size of the input feature channel.

Then, the 4D sparse convolutional middle encoder network processes the voxelized 5D sparse tensor to generate dense spatio-temporal BEV images $\mathcal{F}$ . Our middle encoder follows the same architecture as that of SECOND (Yan, Mao, and Li 2018) with the following modifications. First, our middle encoder uses 4D sparse convolution instead of 3D sparse convolution to enable spatio-temporal reasoning. Second, the network never strides along the temporal axis to preserve the temporal shape $T$ . Given the encoder network spatial stride factor of st, the resulting dense 4D spatio-temporal BEV tensor is of size $\mathcal{F}=(C_{out}\times T\times\lfloor\lfloor\frac{H}{v_{H}}\rfloor\frac{1}{\text{st}}\rfloor\times\lfloor\lfloor\frac{W}{v_{W}}\rfloor\frac{1}{\text{st}}\rfloor)$ where $C_{out}$ is the output channel size and the $Z$ axis is squashed down by the network design. Finally, we split $\mathcal{F}$ along the temporal axis to get a 2D BEV feature images at each specific $t$ , $\mathcal{F}=\{\mathcal{F}^{1},\mathcal{F}^{2},\ldots,\mathcal{F}^{T}\}$ . As a result, our 4D sparse encoder fully propagates information across the entire 4D spatio-temporal space, explicitly reasoning on the 3D spatial domain and its temporal changes.

3D object detector

The first stage of Minkowski Tracker R-CNN, including the encoder above, is a 3D object detector. It takes a 2D BEV feature image $\mathcal{F}^{1}$ at the current frame $(t=1)$ as an input and outputs 3D bounding boxes $\mathcal{B}=\{(u,v,d,w,l,h,\alpha,vel,c,s_{\text{det}})\}$ of a center location $(u,v,d)$ , 3D size $(w,l,h)$ , rotation $\alpha$ , velocity $vel$ , semantic class $c$ , and detection confidence score $s_{\text{det}}$ per box.

Minkowski Tracker R-CNN is agnostic to the choice of the detection algorithm. Without loss of generality, we chose CenterPoint (Yin, Zhou, and Krahenbuhl 2021) as our 3D detection model for fair comparison against most previous works. The performance of the tracker is heavily affected by that of the detector. Minkowski Tracker R-CNN is a modular network that can incorporate any research advance in the object detector to improve its general performance. For simplicity, we define any loss functions used to train the 3D object detector as $L_{\text{det}}$ .

Although Minkowski Tracker R-CNN shares the identical detector head to CenterPoint (Yin, Zhou, and Krahenbuhl 2021), there are two crucial differences that set our detection results apart. First, the explicit temporal reasoning of the 4D encoder is robust to temporal changes such as occlusion, truncation, and transformation. Second, the multi-task learning of object detection and tracking enhances the quality of each other. In our experiment, we will quantitatively ablate the two to demonstrate how Minkowski Tracker R-CNN improves detection quality without any changes in the detector head.

3D online tracker

The second stage of our network is a 3D online tracker. The input to our tracker is:

•

3D object bounding boxes from the detector $\mathcal{B}=\{\mathcal{B}_{1},\mathcal{B}_{2},\ldots,\mathcal{B}_{N}\}$ where $N$ is the number of detected objects.
•

Current tracks $\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},\ldots,\mathcal{T}_{M}\}$ where $M$ is the number of tracks and each track $\mathcal{T}_{i}=\{\mathcal{T}_{i}^{2},\mathcal{T}_{i}^{3},\ldots,\mathcal{T}_{i}^{T}\}$ is a set of 3D bounding boxes of the same instance moving across the previous frame $t=2$ to the last frame $t=T$ . Note that part of the track could be missing due to occlusion or missing detections (e.g. $\mathcal{T}_{i}=\{\mathcal{T}_{i}^{2},\mathcal{T}_{i}^{4},\mathcal{T}_{i}^{5}\}$ ).
•

The spatio-temporal BEV image features $\mathcal{F}=\{\mathcal{F}^{1},\mathcal{F}^{2},\ldots,\mathcal{F}^{T}\}$ from the 4D encoder.

The tracker outputs a matrix $S=(N\times M)$ where each element is a match probability between the detections and tracks.

A common way to associate detections to tracks is based on a hand-designed filter that often uses simple linear motion models. The filters predict the next state of the track, and trackers measure the distance of the predicted states to the detections to build a cost matrix. However, simple motion models may not be sufficient to represent complicated real-world actions such as a curvy road as visualized in Figure 2. In contrast, a neural network can directly observe the car’s motion along the road and associate its movement. Therefore, we propose to implicitly model motions within the neural network by directly predicting the match probability of detections to current tracks.

Furthermore, object detection and tracking are related problems, predicting the object instance and location over time. Research in multi-task learning reveals that solving related problems jointly with a shared network enhances each other (Zhang and Yang 2021). However, most previous approaches to detection-to-track decouple detection and tracking as independent pipelines. To this end, we propose solving online tracking as the second stage of a detector by aligning spatio-temporal ROI features with our novel TrackAlign. Our experiment will quantitatively verify that matching tracks based on the learned detection-to-track matching score improves tracking quality.

TrackAlign

We visualize TrackAlign in Figure 3. TrackAlign samples ROI features $\mathbb{F}=(N\times M)$ of a combination of $\mathcal{B}$ and $\mathcal{T}$ . Without loss of generality, we describe how TrackAlign extracts the $i$ th row and $j$ th column of $\mathbb{F}$ , $\mathbb{F}_{i\times j}$ from the $i$ -th bounding box $\mathcal{B}_{i}$ and $j$ -th track $\mathcal{T}_{j}$ .

First, rotated ROIAlign (Ren et al. 2015) extracts the temporal ROI feature $\mathbb{F}_{i\times j}^{t}$ of each timestep $t$ from a 2D BEV feature $\mathcal{F}^{t}$ and a 2D rotated bounding box $\mathbb{B}^{t}$ . At the time of inference, $\mathbb{B}^{1}=\texttt{BBox3Dto2D}(\mathcal{B}_{i})$ and $\mathbb{B}^{t}=\texttt{BBox3Dto2D}(\mathcal{T}_{j}^{t})\forall t\in[2,T]$ . BBox3Dto2D is a function that converts 3D bounding box to a 2D BEV bounding box by removing its $Z$ axis elements, $d$ and $h$ . In case tracks are missing at certain timestamps $t_{\text{miss}}$ , we skip extracting the feature $\mathbb{F}_{i\times j}^{t_{\text{miss}}}=\emptyset$ . Finally, we MaxPool all extracted ROI features across the temporal axis $\mathbb{F}_{i\times j}=\texttt{MaxPool}(\mathbb{F}_{i\times j}^{1},\mathbb{F}_{i\times j}^{2},\ldots,\mathbb{F}_{i\times j}^{T})$ .

There are three benefits of TrackAlign worth mentioning: 1. The aggregated feature $\mathbb{F}$ is aware of the heading of an instance since we use rotated ROIAlign. 2. The bilinear interpolation of ROIAlign allows sub-pixel reasoning. From a practical standpoint, the size of a pixel of the BEV image used in our network is 60cm which is bigger than the average human shoulder-to-shoulder distance, 40cm. 3. TrackAlign is agnostic to the length of the temporal window $T$ and missing detections in tracks, owing to the symmetric function MaxPool.

DetectionToTrackClassifier

Given the ROI features of a combination of detections and tracks $\mathbb{F}=(N\times M)$ , the DetectionToTrackClassifier predicts a cost matrix of a match probability between detections and tracks $S=(N\times M)$ . DetectionToTrackClassifier is simply a couple of layers of element-wise multilayer perceptrons (MLP) on $\mathbb{F}$ to predict match probability logits for logistic regression.

During training, we are given ground-truth tracks $\mathcal{T}_{\text{GT}}=\{\{\mathcal{T}^{t}_{GTi}\}\forall t\in[1,T]\}\forall i\in[1,K]$ , where $K$ is the number of ground-truth tracks. Using $\mathcal{T}_{\text{GT}}$ , we generate a set of positive and negative matching tracks with a combination of tracks at the current frame ( $t=1$ ) and previous frames ( $t\in[2,T]$ ).

\{\mathcal{T}^{1}_{GTi},\mathcal{T}^{2}_{GTj},\mathcal{T}^{3}_{GTj},\ldots,\mathcal{T}^{T}_{GTj}\}\forall i\in[1,K]\ \forall j\in[1,K]

$\mathcal{T}^{1}_{GTi}$ resembles detection $\mathcal{B}_{i}$ and $\{\mathcal{T}^{2}_{GTj},\mathcal{T}^{3}_{GTj},\ldots,\mathcal{T}^{T}_{GTj}\}$ resembles track $\mathcal{T}_{j}$ during evaluation. Then, we consider the matching tracks $(i=j)$ as positive and all else $(i\neq j)$ as negative matching tracks.

However, two issues arise with this approach. First, the positive-to-negative ratio $1:(K-1)$ is heavily biased. Second, most of the negative samples are too easy, e.g. matching a pedestrian on one end of the map to the car on the other end of the map. To mitigate this issue, we have two fixes. First, we extract hard negative samples using a heuristic filter $f$ . We define heuristic filter $f(\mathcal{T}^{1}_{GTi},\{\mathcal{T}^{2}_{GTj},\mathcal{T}^{3}_{GTj},\ldots,\mathcal{T}^{T}_{GTj}\})$ as follows: 1. $c(\mathcal{T}_{GTi})=c(\mathcal{T}_{GTj})$ where $c$ is a semantic class of an object 2. $dist(\mathcal{T}^{1}_{GTi},\mathcal{T}^{1}_{GTj})<G$ , where $dist$ is center-to-center 2D Euclidean BEV distance and $G$ is a hyperparameter gating distance. Any tracks that do not satisfy $f$ are filtered out to make the negative samples of the training data more meaningful. Second, we use binary focal loss (Lin et al. 2017), where $\hat{p}$ is the predicted match probability, $\alpha$ is a post-filter positive-to-negative ratio, and $\gamma=2$ .

L_{\text{track}}=-1[i=j]\alpha(1-\hat{p})^{\gamma}\log(\hat{p})-1[i\neq j]\hat{p}^{\gamma}\log(1-\hat{p})

The final training loss of Minkowski Tracker R-CNN is as follows, where $\lambda_{\text{track}}$ is a loss balancing term.

L=L_{\text{det}}+\lambda_{\text{track}}L_{\text{track}}

Method	AMOTA↑	MOTA↑	RECALL↑	FP↓	FN↓	IDS↓	FRAG↓
StanfordIPRL-TRI (Chiu et al. 2020)	0.550	0.459	0.600	17533	33216	950	776
CenterPoint (Yin, Zhou, and Krahenbuhl 2021)^∗	0.638	0.537	0.675	18612	22928	760	529
OGR3MOT (Zaech et al. 2022)^∗	0.656	0.554	0.692	17877	24013	288	371
Belief Propagation (Meyer et al. 2018)^∗	0.666	0.571	0.684	16884	22381	182	245
SimpleTrack (Pang, Li, and Wang 2021)^∗	0.668	0.566	0.703	17514	23451	575	591
ImmortalTracker (Wang et al. 2021)^∗	0.677	0.572	0.714	18012	21661	320	477
GNN-PMB (Liu et al. 2022)^∗	0.678	0.563	0.696	17071	21521	770	431
NEBP (Liang and Meyer 2022)^∗	0.683	0.584	0.705	16773	21971	227	299
TransFusion-L (Bai et al. 2022)	0.686	0.571	0.731	17851	23437	893	626
Minkowski Tracker (ours)^∗	0.698	0.578	0.757	19340	21220	325	217

Table 1: Tracking result on Nuscenes test set. Our proposed method achieved state-of-the-art performance on AMOTA metric. Minkowski Tracker has the highest recall and lowest false negative, meaning that our 4D object detection trained jointly with online tracking recovers as many tracks as possible. Minkowski Tracker also has the lowest track fragmentation and low ID switches demonstrating that our network successfully learns implicit motion models to assign detections to tracks.

Track match and track confidence update

Minkowski Tracker updates track $\mathcal{T}$ given the output of Minkowski Tracker R-CNN detection bounding boxes $\mathcal{B}$ and detection-to-track match probability $\mathcal{S}$ . Prior to assigning detections to tracks, the tracker filters out obvious mismatches using the aforementioned heuristic filter $f(\mathcal{B}_{i},\mathcal{T}_{j})$ .

Additionally, in order to enforce geometric relations to the matching scheme, we also define distance ratio $\mathcal{D}=(N\times M)$ as $\mathcal{D}_{i\times j}=dist(\mathcal{B}_{i},\mathcal{T}^{2}_{j}+{vel}_{j}^{2})/G$ . Since any tracks with $dist(\cdot,\cdot)>G$ are filtered out, the distance ratio is within [0, 1], the smaller the better. Then, we Hungarian match detections $\mathcal{B}$ to tracks $\mathcal{T}$ to minimize the cost matrix $\lambda_{D}\mathcal{D}-(1-\lambda_{D})\mathcal{S}$ , where $\lambda_{D}$ is a balancing hyperparameter. Any unmatched detections are initialized as a new track. Any tracks unmatched for longer than $T$ are considered as dead tracks.

Lastly, we estimate the confidence for each track. Track confidence scores determine the reliability of a track which is crucial for its applications. Thus, the tracking evaluation metric AMOTA (Weng et al. 2020a) takes track confidence into account. The most commonly defined track confidence score is an average of detection confidence scores $s_{\text{det}}$ in a track. This is a reasonable choice since a track is composed of detections. However, it lacks information on the confidence of associations among the detections. Therefore, we complement the missing information by defining track score as a track average of $(1-\lambda_{S})s_{\text{det}i}+\lambda_{S}\mathcal{S}_{i\times j}$ , a linear combination of a detection score and its detection-to-track match probability score. In our experiment, we will quantitatively confirm that our score incorporating the detection-to-track match probability improves AMOTA.

Implementation details

Our implementation is based on an open-source implementation of CenterPoint (Yin, Zhou, and Krahenbuhl 2021) by OpenMMLab. Our encoder architecture follows that of SECOND (Yan, Mao, and Li 2018) with some changes to process the 4D data, as described in Method section 1. The last layer of the encoder is composed of three deformable convolution (Dai et al. 2017) to learn different features for classification, bounding box regression, and instance prediction. The classification and bounding box regression features are consumed by the first stage CenterPoint head, as discussed in Method section 2. The instance prediction features are consumed by the second stage tracker in Method section 3.

For hyperparameters, the gating distance $G$ , following the tracker scheme of CenterPoint, is based on per-class 99 percentile error of velocity prediction. More precisely, 4m for cars and trucks, 5.5m for buses, 3m for trailers, 1m for pedestrians, 13m for motorcycles, and 3m for bicycles. Focal loss term $\alpha$ is a post-filtering per-class positive-to-negative ratio, 4 for cars and pedestrians, and 2 for all else. The detection and track loss balancing term $\lambda_{\text{track}}=1$ . The track assignment cost matrix balancing parameter $\lambda_{D}=0.5$ and track confidence score balancing term $\lambda_{S}=0.2$ , as shown in our ablation study.

For Nuscenes dataset, we crop point clouds at range $W=[-54,54],H=[-54,54],D=[-5,3]$ and voxelize it at size $(v_{W},v_{H},v_{D})=(0.075,0.075,0.2)$ for X, Y, and Z axis respectively. The temporal axis size $T=3$ . Each $t$ is 0.5 seconds following the 2HZ sampling rate of tracking key sequences. Each $t$ is composed of a concatenation of 10 samples of high-frequency lidar point clouds. We train our network for 30 epochs with AdamW (Loshchilov and Hutter 2018) optimizer of learning rate of 1e-4 and weight decay 1e-2 with a one-cycle learning rate scheduler. We trained our model with a batch size of 16 on 8 TITAN RTX GPUs.

Experiments

Dataset

The Nuscenes dataset (Caesar et al. 2020) is a large-scale autonomous driving dataset. It comprises 1000 driving scenes of 20 seconds length. Each scene is labeled with 3D bounding boxes of 7 object classes at 2Hz to facilitate multi-object tracking algorithms. The dataset provides multiple modalities of sensors, while we only use lidar point clouds as input to our system. Following the tracking challenge protocol, our temporal window at $t$ is roughly 0.5 seconds.

Metrics

Following the Nuscenes tracking challenge, we use the following metrics to evaluate our tracker: AMOTA (Weng et al. 2020a), MOTA (Bernardin and Stiefelhagen 2008), RECALL, FP (False Positive), FN (False Negative), IDS (track ID switches), and FRAG (track FRAGmentation). MOTA has long been a standard evaluation metric for object tracking, measuring the common track association errors such as FP, FN, and IDS. AMOTA takes track confidence scores into account, being an integral of MOTA over recall scores. AMOTA is the primary metric of the Nuscenes tracking challenge.

Baseline model

CenterPoint (Yin, Zhou, and Krahenbuhl 2021) is our primary baseline 3D object detection and online tracking model. Centerpoint built a simple yet powerful 3D online tracking baseline by matching tracks with the closest object detection backprojecting the predicted velocity to the track. Our method is agnostic to the choice of a 3D object detector. We chose CenterPoint since most of the previous works in the Nuscenes tracking challenge take the detection result of CenterPoint as an input to the tracker.

Results

We tested our method on the test set of the Nuscenes tracking challenge. As shown in Tabel 1, our method achieved the state-of-the-art AMOTA of all published submissions based only on lidar point clouds. Minkowski Tracker has notably high recall and lowest false negative among all methods. This result indicates that our method is superb at recovering ground-truth tracks with limited observations, owing to the 4D encoder’s temporal reasoning and multi-task training of detection and tracking. Minkowski Tracker also has the lowest track fragmentation and low ID switches demonstrating that the network’s implicit motion model successfully connects the tracks. Overall, our proposed framework quantitatively outperformed previous works in various metrics.

It is also worth noting that the performance of multi-object tracking is bound to that of object detection. Minkowski Tracker uses CenterPoint as a object detection head, with other CenterPoint-based trackers marked as ^∗ in Table 1. Our network is modular, and the latest development of 3D object detection can easily be incorporated by updating the detection head.

In Figure 4, we visualized a qualitative result of tracks on the validation set of the Nuscenes dataset, comparing CenterPoint with Minkowski Tracker. First, CenterPoint suffers from detection errors such as false positives and false negatives. In contrast, our method predicts persistent detections benefiting from the temporal reasoning of the 4D encoder. This result aligns with the quantitative result that our method achieved the best recall among all methods. Furthermore, CenterPoint suffers from ID switches, whereas our method robustly tracks the target, demonstrating that our implicit motion model successfully assigns detections to tracks. This result explains why our method achieved the lowest track fragmentation among all methods and less than half ID switches compared to CenterPoint.

Ablation Study

Method	mAP↑	NDS↑
CenterPoint	59.6	66.8
CenterPoint + 4DEncoder	61.9	69.3
CenterPoint + 4DEncoder + track	62.4	69.5

Table 2: Detection result comparison on Nuscenes detection validation split between CenterPoint (Yin, Zhou, and Krahenbuhl 2021) and Minkowski Tracker R-CNN. 4DEncoder refers to our network using a 4D sparse encoder. track refers to a detector jointly trained with a tracking loss.

Method	AMOTA↑
CenterPoint	66.9
CenterPoint + 4DEncoder	67.4
CenterPoint + 4DEncoder + track	67.8
CenterPoint + 4DEncoder + track + score	69.1
CenterPoint + 4DEncoder + track + conf	68.5
CenterPoint + 4DEncoder + track + score + conf	70.3

Table 3: Multi-object tracking ablation study on Nuscenes tracking validation split. 4DEncoder refers to our network using a 4D sparse encoder. track refers to a detector jointly trained with a tracking loss. score refers to the track assignment based on detection-to-track score. conf refers to the detection-to-track score-based track confidence.

We demonstrate in large-scale experiments that the overall performance gain of our method is due to four factors: 1. The 4D encoder improves detection by adding temporal reasoning. 2. Multi-task learning of object detection and online tracking enhances each other. 3. The network learns the implicit motion model to improve track assignment 4. The track confidence score incorporating the detection-to-track score is informative. In this section, we thoroughly analyze the performance gain of each component.

First, the 4D encoder of Minkowski Tracker R-CNN improves the object detection performance, subsequently improving the tracking performance. In Table 2 and 3, we compare the detection and tracking performance of CenterPoint with the original 3D encoder and our 4D encoder. Minkowski Tracker R-CNN outperforms CenterPoint in object detection owing to the extended horizon of temporal reasoning through the 4D sparse encoder. Subsequently, the tracking result of the 4D detector global nearest neighbor match outperforms CenterPoint.

Second, the multi-task learning of detection and tracking jointly enhances the performance of each other. In Table 2 and 3, we compare the detection and tracking performance of the 4D encoder CenterPoint, trained with and without the tracking loss ( $\lambda_{\text{track}}=\{1,0\}$ ). A detector and tracker trained with the tracking loss (track) outperforms the bare detector. The improved detection from the 4D encoder and multi-task learning explains the quantitative result that our method has the highest recall and lowest false negative among all methods.

Third, the cost matrix incorporating the detection-to-track match probability score improves the quality of the track assignment. As shown in Table 3, the advanced track assignment with the learned score, denoted as cost, gives a persistent quantitative improvement in the quality of tracks. This result indicates that our network learns an implicit motion model that improves track assignment, which explains the lowest track fragmentation and low ID switches of the quantitative results.

Fourth, the detection-to-track score prediction by Minkowski Tracker R-CNN improves the quality of the track confidence. As shown in Table 3, the track confidence incorporating the matching score, denoted as conf, gives a persistent quantitative improvement in the AMOTA metric, which is a tracking metric that takes track confidence score into account.

Hyperparameter Search

In Figure 5, we measure the performance of our model with varying hyperparameters. First, the network predicted matching score $\mathcal{S}$ and distance ratio $\mathcal{D}$ are both effective measures for track assignment. Nevertheless, a linear combination of $\mathcal{D}$ and $\mathcal{S}$ gives a noticeable improvement in tracking performance, hinting that there are complimentary information between the two.

Second, the combination of the network predicted matching score $\mathcal{S}$ and the detection confidence $s_{\text{det}}$ results in more robust track confidence. The fact that a track is composed of detections and their temporal associations explains why track confidence should be composed of both detection confidence $s_{\text{det}}$ and association score $\mathcal{S}$ .

Runtime Analysis

In Figure 6, we analyze and compare the runtime of Minkowski Tracker with CenterPoint (Yin, Zhou, and Krahenbuhl 2021) and ImmortalTracker (Wang et al. 2021). The runtime is measured on a single-core AMD Ryzen 7 1800X CPU with a single TITAN RTX GPU. The primary concern of the runtime of our method is that we use a 4D encoder, which requires at least $\times T$ more computation compared to its 3D counterpart. Not surprisingly, the network runtime of our method is roughly $T=3$ times slower than CenterPoint.

However, Minkowski Tracker does not require complicated hand-designed filters to achieve high performance. CenterPoint relatively does not perform well in tracking since it uses a bare minimal motion model. ImmortalTracker, based on CenterPoint, requires a much higher runtime in the tracker to compute 3D Kalman Filter with linear motion model and 3D GIoU (Rezatofighi et al. 2019) for a reliable matching. In contrast, Minkowski Tracker relies on the network’s implicit motion model, taking a simple linear assignment of the detection-to-track score, which does not require much runtime in the tracker. As a result, Minkowski Tracker is $\times 6$ faster than ImmortalTracker while achieving the state-of-the-art tracking performance.

Conclusion

We propose Minkowski Tracker, a sparse spatio-temporal R-CNN for joint object detection and tracking. Our network processes the 4D point cloud using a 4D sparse encoder to improve detection with temporal reasoning. Our proposed TrackAlign enables two-stage joint detection and tracking for improved multi-task learning. Furthermore, our network learns an implicit motion model, which advances track assignment and confidence score. As a result, Minkowski Tracker reached the state-of-the-art performance on the Nuscenes tracking challenge, achieving the highest recall and lowest false negative, validated in our qualitative result and ablation study.

Acknowledgement

Toyota Research Institute provided funds to support this work.

References

Bai et al. (2022) Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; and Tai, C.-L. 2022. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1090–1099.
Baser et al. (2019) Baser, E.; Balasubramanian, V.; Bhattacharyya, P.; and Czarnecki, K. 2019. Fantrack: 3d multi-object tracking with feature association network. In 2019 IEEE Intelligent Vehicles Symposium (IV), 1426–1433. IEEE.
Bernardin and Stiefelhagen (2008) Bernardin, K.; and Stiefelhagen, R. 2008. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008: 1–10.
Brasó and Leal-Taixé (2020) Brasó, G.; and Leal-Taixé, L. 2020. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6247–6257.
Caesar et al. (2020) Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621–11631.
Chiu et al. (2020) Chiu, H.-k.; Prioletti, A.; Li, J.; and Bohg, J. 2020. Probabilistic 3D Multi-Object Tracking for Autonomous Driving. arXiv preprint arXiv:2001.05673.
Choy, Gwak, and Savarese (2019) Choy, C.; Gwak, J.; and Savarese, S. 2019. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3075–3084.
Dai et al. (2017) Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, 764–773.
Dendorfer et al. (2019) Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; and Leal-Taixé, L. 2019. CVPR19 Tracking and Detection Challenge: How crowded can it get? arXiv:1906.04567 [cs]. ArXiv: 1906.04567.
Deng et al. (2021) Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; and Li, H. 2021. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1201–1209.
Erol et al. (2018) Erol, B. A.; Majumdar, A.; Lwowski, J.; Benavidez, P.; Rad, P.; and Jamshidi, M. 2018. Improved deep neural network object tracking system for applications in home robotics. In Computational Intelligence for Pattern Recognition, 369–395. Springer.
Geiger et al. (2013) Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11): 1231–1237.
Girshick (2015) Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
Gkioxari, Malik, and Johnson (2019) Gkioxari, G.; Malik, J.; and Johnson, J. 2019. Mesh r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9785–9795.
Graham (2014) Graham, B. 2014. Spatially-sparse convolutional neural networks. arXiv preprint arXiv:1409.6070.
Gross et al. (2011) Gross, H.-M.; Schroeter, C.; Mueller, S.; Volkhardt, M.; Einhorn, E.; Bley, A.; Langner, T.; Martin, C.; and Merten, M. 2011. I’ll keep an eye on you: Home robot companion for elderly people with cognitive impairment. In 2011 IEEE International Conference on Systems, Man, and Cybernetics, 2481–2488. IEEE.
He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
Hecht (2018) Hecht, J. 2018. Lidar for self-driving cars. Optics and Photonics News, 29(1): 26–33.
Kalman (1960) Kalman, R. E. 1960. A new approach to linear filtering and prediction problems. Basic Engineering.
Lang et al. (2019) Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Beijbom, O. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12697–12705.
Leal-Taixé et al. (2015) Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; and Schindler, K. 2015. MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv:1504.01942 [cs]. ArXiv: 1504.01942.
Liang and Meyer (2022) Liang, M.; and Meyer, F. 2022. Neural Enhanced Belief Propagation for Data Assocation in Multiobject Tracking. arXiv preprint arXiv:2203.09948.
Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
Liu et al. (2022) Liu, J.; Bai, L.; Xia, Y.; Huang, T.; and Zhu, B. 2022. GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker without Bells and Whistles. arXiv e-prints, arXiv–2206.
Loshchilov and Hutter (2018) Loshchilov, I.; and Hutter, F. 2018. Fixing weight decay regularization in adam. N/A.
Luo, Yang, and Yuille (2021) Luo, C.; Yang, X.; and Yuille, A. 2021. Exploring simple 3d multi-object tracking for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10488–10497.
Martin-Martin et al. (2021) Martin-Martin, R.; Patel, M.; Rezatofighi, H.; Shenoi, A.; Gwak, J.; Frankel, E.; Sadeghian, A.; and Savarese, S. 2021. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE transactions on pattern analysis and machine intelligence.
Maurer et al. (2016) Maurer, M.; Gerdes, J. C.; Lenz, B.; and Winner, H. 2016. Autonomous driving: technical, legal and social aspects. Springer Nature.
Meyer et al. (2018) Meyer, F.; Kropfreiter, T.; Williams, J. L.; Lau, R.; Hlawatsch, F.; Braca, P.; and Win, M. Z. 2018. Message passing algorithms for scalable multitarget tracking. Proceedings of the IEEE, 106(2): 221–259.
Pang, Morris, and Radha (2021) Pang, S.; Morris, D.; and Radha, H. 2021. 3D Multi-Object Tracking using Random Finite Set-based Multiple Measurement Models Filtering (RFS-M 3) for Autonomous Vehicles. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 13701–13707. IEEE.
Pang, Li, and Wang (2021) Pang, Z.; Li, Z.; and Wang, N. 2021. Simpletrack: Understanding and rethinking 3d multi-object tracking. arXiv preprint arXiv:2111.09621.
Petrovskaya and Thrun (2008) Petrovskaya, A.; and Thrun, S. 2008. Model based vehicle tracking for autonomous driving in urban environments. Proceedings of robotics: science and systems IV, Zurich, Switzerland, 34.
Pöschmann, Pfeifer, and Protzel (2020) Pöschmann, J.; Pfeifer, T.; and Protzel, P. 2020. Factor graph based 3d multi-object tracking in point clouds. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10343–10350. IEEE.
Qi et al. (2017) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
Rana et al. (2014) Rana, R.; Yang, M.; Wark, T.; Chou, C. T.; and Hu, W. 2014. SimpleTrack: Adaptive trajectory compression with deterministic projection matrix for mobile sensor networks. IEEE Sensors Journal, 15(1): 365–373.
Ren et al. (2015) Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
Rezatofighi et al. (2019) Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; and Savarese, S. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658–666.
Shi, Wang, and Li (2019) Shi, S.; Wang, X.; and Li, H. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 770–779.
Sun et al. (2020) Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. 2020. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2446–2454.
Wang et al. (2021) Wang, Q.; Chen, Y.; Pang, Z.; Wang, N.; and Zhang, Z. 2021. Immortal Tracker: Tracklet Never Dies. arXiv preprint arXiv:2111.13672.
Weng et al. (2020a) Weng, X.; Wang, J.; Held, D.; and Kitani, K. 2020a. 3d multi-object tracking: A baseline and new evaluation metrics. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10359–10366. IEEE.
Weng et al. (2020b) Weng, X.; Wang, J.; Held, D.; and Kitani, K. 2020b. AB3DMOT: A Baseline for 3D Multi-Object Tracking and New Evaluation Metrics. ECCVW.
Weng et al. (2020c) Weng, X.; Wang, Y.; Man, Y.; and Kitani, K. M. 2020c. Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6499–6508.
Wu et al. (2021) Wu, H.; Han, W.; Wen, C.; Li, X.; and Wang, C. 2021. 3d multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE Transactions on Intelligent Transportation Systems.
Yan, Mao, and Li (2018) Yan, Y.; Mao, Y.; and Li, B. 2018. Second: Sparsely embedded convolutional detection. Sensors, 18(10): 3337.
Yang et al. (2020) Yang, Z.; Sun, Y.; Liu, S.; and Jia, J. 2020. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11040–11048.
Yin, Zhou, and Krahenbuhl (2021) Yin, T.; Zhou, X.; and Krahenbuhl, P. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11784–11793.
Zaech et al. (2022) Zaech, J.-N.; Liniger, A.; Dai, D.; Danelljan, M.; and Van Gool, L. 2022. Learnable online graph representations for 3d multi-object tracking. IEEE Robotics and Automation Letters, 7(2): 5103–5110.
Zamir et al. (2020) Zamir, A. R.; Sax, A.; Cheerla, N.; Suri, R.; Cao, Z.; Malik, J.; and Guibas, L. J. 2020. Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11197–11206.
Zhang and Yang (2021) Zhang, Y.; and Yang, Q. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering.
Zhou and Tuzel (2018) Zhou, Y.; and Tuzel, O. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4490–4499.
Zhuang et al. (2020) Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; and He, Q. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109: 43–76.

CenterPoint
Ours

CenterPoint
Ours

Minkowski Tracker: A Sparse Spatio-Temporal R-CNN for Joint Object Detection and Tracking