This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DetFlowTrack: 3D Multi-object Tracking based on Simultaneous Optimization of Object Detection and Scene Flow Estimation

Yueling Shen, Guangming Wang, and Hesheng Wang *This work was supported in part by the Natural Science Foundation of China under Grant 62073222 and U1913204, in part by “Shu Guang”project supported by Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant 19SG08, in part by Shenzhen Science and Technology Program under Grant JSGG20201103094400002, in part by the Science and Technology Commission of Shanghai Municipality under Grant 21511101900, in part by grants from NVIDIA Corporation. Corresponding Author: Hesheng Wang.The authors are with Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai 200240, China.
Abstract

3D Multi-Object Tracking (MOT) is an important part of the unmanned vehicle perception module. Most methods optimize object detection and data association independently. These methods make the network structure complicated and limit the improvement of MOT accuracy. we proposed a 3D MOT framework based on simultaneous optimization of object detection and scene flow estimation. In the framework, a detection-guidance scene flow module is proposed to relieve the problem of incorrect inter-frame assocation. For more accurate scene flow label especially in the case of motion with rotation, a box-transformation-based scene flow ground truth calculation method is proposed. Experimental results on the KITTI MOT dataset show competitive results over the state-of-the-arts and the robustness under extreme motion with rotation.

I INTRODUCTION

3D MOT is an important perception technique required for the applications of autonomous driving[1, 2]. 3D MOT takes continuous frame sequence as input and outputs trajectories of targets in specific categories. The trajectory of a target is represented by 3D boxes with tracking id in consecutive frames. The same target is identified with the corresponding unique ID in different frames.

Most methods[2, 3, 1, 4, 5] follow the tracking-by-detection framework, which split the task into two stages: object detection and data association. With the object detection results, the data association module calculates the similarities between each pair in adjacent frames, and the same target is identified with the same unique ID. The tracking-by-detection framework is optimized independently for detection and data association. It limits the improvement of tracking accuracy and makes the whole process redundant and complex.

Traditional data association methods[2, 3, 1] associate boxes in adjacent frames with motion prediction methods such as Kalman filters or particle filters. Those filter-based methods are vulnerable in extreme motion conditions such as sudden braking or turning. What is more, the fixed motion parameters in these methods are not suitable for multi-categories MOT because different types of targets may have different motion models.

Optical flow[6, 7] and Scene flow[8, 9] estimates the inter-frame motion of points. Inter-motion of objects instead of points are estimated in MOT. Considering the correlation between scene flow estimation and MOT, many studies[10, 4, 11] estimate the inter-frame motion of objects with scene flow estimation. FlowMOT[4] directly replaces the motion estimation model with the trained scene flow model, improves the robustness under extreme motion. However, it is still optimized independently for detection and scene flow estimation. PointTrackNet[11] proposed an end-to-end 3D object detection and tracking network. This method optimizes object detection firstly, followed by data association based on scene flow estimation. It does not consider the problem of incorrect association between different targets in data association. The scene flow label is not accurate enough in the case of rotation of the target. All these problems limit the accuracy of MOT results.

In order to solve the above problems, we propose a 3D multi-object tracking framework based on simultaneous optimization of object detection and scene flow estimation to simplify the network structure, improve the MOT accuracy and improve the robustness under extreme motion with rotation. The contributions of our work are listed as follows:

  • A new framework of 3D MOT that simultaneously optimizes object detection and scene flow estimation is proposed. It takes two adjacent point clouds as input and outputs MOT trajectories.

  • A detection-guidance scene flow module is proposed to relieve incorrect inter-frame association of points belonging to different objects.

  • A box-transformation based scene flow label calculation method is proposed to improve the robustness in the presence of extreme motion with rotation.

II THE PROPSED METHOD

II-A Overall Framework

Instead of independent optimization of object detection and scene flow estimation, we proposed a 3d multi-object tracking framework that simultaneously optimizes object detection and scene flow estimation. The proposed framework takes point cloud sequences as inputs, and outputs objects’ trajectories.

As shown in figure1, point-wise data association between adjacent frames are achieved by scene flow head to avoid dependence on motion model parameters. To simplify the network structure, we extract both object detection feature and scene flow feature in the feature extraction module. During the training stage, the loss of object detection and scene flow will be weighted summed. The total loss will be used to optimize the overall network. Considering that the detection results contain semantic information of points, we input the detection results into the scene flow module as a guide to avoid wrong data association of points belonging to different objects. To achieve multi-object tracking, boxes between frames are matched according to the result of object detection and scene flow in the box association module. The trajectory generation module takes box matches as input and updates trajectory.

Refer to caption
Figure 1: Overall Framework. With point cloud as input, point-wise detection feature and scene flow feature are extracted simultaneously by feature extraction module. In the guidance of detection result, the scene flow feature in two adjacent frames is input into the scene flow head to estimate the inter-frame point motion. With the inter-frame points’ motion estimation, inter-frame boxes’ motion is estimated in box association. The inter-frame boxes motion is used to make inter-frame boxes associated and finally generate the trajectory.

II-B Fearute Extraction

To extract the point-wise detection feature and flow feature of the point cloud simultaneously, we utilize the backbone of PV-RCNN[12]. As shown in figure 2, voxel feature at multiple resolutions is extracted by voxel feature extraction. After that, the overhead feature map is obtained from the last layer of the feature voxel by vertical projection. We downsample the point clouds by farthest point sampling and key points are extracted. In the key point feature extraction stage, multiple resolutions voxel features are summarized into the feature embeddings of a small set of key points. Two parallel multi-layer perceptions take the key point feature embeddings as input and output detection feature and scene flow feature simultaneously.

Refer to caption
Figure 2: The Framework of Feature Extraction

II-C Detection Head

For accurate detection results, we apply a two-stage detection head as PV-RCNN[12]. In the first stage, 3D box proposals are generated from the overhead feature map. In the proposal refine stage, the 3D box proposals are refined by detection features of points inside. For 3D box proposal ii, the outputs of detection head include box result bi=(pxi,pyi,pzi,wi,li,hi,θi,ci)\emph{b}_{i}=(px_{i},py_{i},pz_{i},w_{i},l_{i},h_{i},\theta_{i},c_{i}) and score result si\emph{s}_{i}.

II-D Scene Flow Head

The scene flow head contains the cost layer and scene flow calculation layer. The cost layer realizes feature aggregation between two point clouds, and outputs cost. The scene flow calculation layer takes cost volume as input, outputs scene flow result for each point in frame t-1.

II-D1 Cost Layer

The inputs of the cost layer are two adjacent key point clouds: Pt1={pit1=(xit1,yit1,zit1)}i=1NP^{t-1}=\{\emph{p}_{i}^{t-1}=(x_{i}^{t-1},y_{i}^{t-1},z_{i}^{t-1})\}_{i=1}^{N} and Pt={pit=(xit,yit,zit)}i=1NP^{t}=\{\emph{p}_{i}^{t}=(x_{i}^{t},y_{i}^{t},z_{i}^{t})\}_{i=1}^{N}, the corresponding flow feature: Fflowt1={fit1}i=1N;Fflowt={fit}i=1NF_{flow}^{t-1}=\{\emph{f}_{i}^{t-1}\}_{i=1}^{N};F_{flow}^{t}=\{\emph{f}_{i}^{t}\}_{i=1}^{N} and the corresponding detection result: Odett1={oit1}i=1N;Odett={oit}i=1NO_{det}^{t-1}=\{\emph{o}_{i}^{t-1}\}_{i=1}^{N};O_{det}^{t}=\{\emph{o}_{i}^{t}\}_{i=1}^{N}, where NN is the number of key points in each frame and t ranges from 2 to the number of frames. The flow feature vector fitCflow\emph{f}_{i}^{t}\in\mathbb{R}^{C_{flow}} and the detection result vector oit\emph{o}_{i}^{t} is achieved by concatenating box vector bit\emph{b}_{i}^{t} and score sit{s}_{i}^{t} in detection result. The cost layer outputs cost volume cit\emph{c}_{i}^{t} associated with point pit1\emph{p}_{i}^{t-1} in frame t-1.

Refer to caption
Figure 3: The Visualization of Cost Layer

Figure 3 is the visualization of the cost layer. For point pit1\emph{p}_{i}^{t-1}, we find KK nearest points in frame tt, denoted as Nit={pjt}j=1KPtN_{i}^{t}=\{\emph{p}_{j}^{t}\}_{j=1}^{K}\subset P^{t}.

For point pit1\emph{p}_{i}^{t-1} as the center point, feature of neighbor points NitN_{i}^{t} are aggregated to learn the relative motion between frames.

The previous work[11] concates the direction vector (pit1pjt)(\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}), the flow feature of neighbour point fjt\emph{f}_{j}^{t} and the flow feature of center point fit1\emph{f}_{i}^{t-1} as embedded features, and then applies multi-layer perceptions and element-wise max pooling to directly achieve the point-wise association tracking displacement.

This method takes all neighbor points into consideration during aggregation. For accurate motion estimation, only the neighbor points belonging to the same object with point pit1\emph{p}_{i}^{t-1} should be included. As shown in figure 3, the red point with dotted edges belongs to a tree, while the center point belongs to a car. Although within neighbor range, aggregating the point belong to different object may disturb the relative motion estimation.

Previous scene flow estiamtion network[8] takes the embedded features as input, the cost between point pit1\emph{p}_{i}^{t-1} and its neighbour point pjt\emph{p}_{j}^{t} can be defined as equation 1, where h()h(\cdot) is a function with learnable parameters.

Cost(i,j)=h(fit1,fjt,pit1pjt).\operatorname{Cost}\left(i,j\right)=h\left(\emph{f}_{i}^{t-1},\emph{f}_{j}^{t},\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}\right). (1)

Feature aggregation of point pit1\emph{p}_{i}^{t-1} is conducted by weighted sum the cost of all the neighbour points as defined in equation 2, where w(i,j)w_{(}i,j) denotes the weight for match: (pit1,pjt)(\emph{p}_{i}^{t-1},\emph{p}_{j}^{t}). After feature aggregation, we obtain cost volume cit1\emph{c}_{i}^{t-1} for point pit1\emph{p}_{i}^{t-1}.

cit1=pjtNitw(i,j)h(fit1,fjt,pit1pjt).\emph{c}_{i}^{t-1}=\sum_{\emph{p}_{j}^{t}\in N_{i}^{t}}w(i,j)\cdot h\left(\emph{f}_{i}^{t-1},\emph{f}_{j}^{t},\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}\right). (2)

The weight w(i,j)w(i,j) is calculated as shown in equation 3. It is learned as a continuous function of the directional vectors (pit1pjt)(\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}) with multi-layer perceptions.

w(i,j)=MLP(pit1pjt).w(i,j)=MLP\left(\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}\right). (3)

Those weight calculation method only takes directional vectors as inputs. As a result, the closer the neighbor point is to the center, the greater the weight.

Considering that the detection result contains the object information the point belongs to, we use the guidance of detection results in feature aggregation. Our weight is calculated by equation 4-6, where wg(i,j)w_{g}(i,j) denotes the reciprocal of distance in geometric space, wo(i,j)w_{o}(i,j) denotes the reciprocal of distance in detection result space, MEAN(x,y)MEAN(x,y) denotes the average of xx and yy, and the weight w(i,j)w(i,j) is calculated by the average of normalized wg(i,j)w_{g}(i,j) and normalized wo(i,j)w_{o}(i,j). As a result, only nearby points belonging to the same object are given greater weight.

we(i,j)=1pit1pjt2w_{e}(i,j)=\frac{1}{\|{\emph{p}_{i}^{t-1}-\emph{p}_{j}^{t}}\|_{2}} (4)
wo(i,j)=1oit1ojt2w_{o}(i,j)=\frac{1}{\|{\emph{o}_{i}^{t-1}-\emph{o}_{j}^{t}}\|_{2}} (5)
w(i,j)=MEAN(we(i,j)j=1Kwe(i,j),wo(i,j)j=1Kwo(i,j))w(i,j)=MEAN(\frac{w_{e}(i,j)}{\sum_{j=1}^{K}w_{e}(i,j)},\frac{w_{o}(i,j)}{\sum_{j=1}^{K}w_{o}(i,j)}) (6)

Substituting equation 6 to equation 2, the cost volume cit1c_{i}^{t-1} contains the inter-frame motion information of point pit1\emph{p}_{i}^{t-1}.

II-D2 Scene Flow Calculation Layer

The input of the scene flow calculation layer are cost volume set Ct1={cit1}i=1NC^{t-1}=\{\emph{c}_{i}^{t-1}\}_{i=1}^{N}, the flow feature set of the first point cloud Fflowt1F_{flow}^{t-1}, and the first point cloud Pt1P^{t-1}. Firstly, point-wise concatenating of cost volume and flow feature is conducted. After that, the PointConv[13] layers are applied to merge the flow feature locally, following MLP layers to estimate scene flow Dt1,tD^{t-1,t} corresponding to the first point cloud Pt1P^{t-1}.

II-D3 Scene Flow Label Calculation

In the practical application, the ground truth label of scene flow is hard to obtain. Aiming to solve this problem, we propose an MOT-induced scene flow label calculation algorithm. MOT ground truth label provides detection results with tracking id for objects in each frame. The tracking id of the same object keeps the same during sequences. As a result, the detection results in two adjacent frames can be matched according to the tracking id, so as to calculate the inter-frame relative pose transformation of the object. Under the assumption of rigid-body transformation, the pose transformation of the point belonging to the object is equal to the pose transformation of the object.

PointTrackNet[11] calculates the scene flow label by substituting the bounding boxes’ movement between two frames. This box-translation based method is not accurate enough when the target rotates between frames. We propose a box-tranformation based method to obtain more accurate scene flow ground truth label. For point pit1\emph{p}_{i}^{t-1} belonging to box mm, we transform it with the box transformation TmT_{m} and obtain the point p~it1\tilde{\emph{p}}_{i}^{t-1}. We obtain the scene flow ground truth label d^it1\hat{d}^{t-1}_{i} as eqaution 7.

d^it1=p~it1pit1\hat{d}^{t-1}_{i}=\tilde{\emph{p}}_{i}^{t-1}-\emph{p}_{i}^{t-1} (7)

II-E Box Association

Box association realizes inter-frame motion estimation of objects with the inter-frame motion estimation of points from scene flow results. Firstly, the inter-frame transformation of trajectory each box is predicted by its internal points’ scene flow results. And then, the predicitive trajectory box is obtained by transforming the previous frame trajectory box with predicted transformation. The matching cost matrix between the predictive trajectory boxes and the detection boxes is calculated by the boxes’s intersection of union. Finally, the Hungarian algorithm[14] is applied with the matching cost matrix and outputs match between the predictive trajectory boxes and the detection boxes.

The key to the box association is to achieve inter-frame boxes’ motion estimation according to inter-frame points’ motion results from scene flow. The process of box association is the reverse of the the process of scene flow groundtruth label calculation. Corresponding to the scene flow label calculation, we solve the boxes’ inter-frame transformation based on point cloud registration. Specifically, for each trajectory box bit1b_{i}^{t-1}, its internal points are extracted and form in-box point cloud cit1c_{i}^{t-1}. For each point in cit1c_{i}^{t-1}, translate it with its scene flow and form the predictive in-box point cloud c~it1\tilde{c}_{i}^{t-1}. Then, the transformation Tit1,tT_{i}^{t-1,t} between cit1c_{i}^{t-1} and c~it1\tilde{c}_{i}^{t-1} is calculated based on svd method. The transformation Tit1,tT_{i}^{t-1,t} is applied to trajectory box bit1b_{i}^{t-1} and obtain the predictive trajectory box b~it1\tilde{b}_{i}^{t-1}.

II-F Trajectory Generation

The matching result from box association module is input into the trajectory generation module to update the trajectory. The matching result is represented by three lists: match list mt1,tm^{t-1,t}, detection box unmatch list um(d)t1,tum^{t-1,t}_{(d)} and trajectory box unmatch list um2(t)t1,tum2^{t-1,t}_{(t)}. The match list mt1,tm^{t-1,t} consists of matching pairs of detection boxes and trajectory boxes. The unmatch list um(d)t1,tum^{t-1,t}_{(d)} and um(t)t1,tum^{t-1,t}_{(t)} consists of unmatch boxes of detection boxes and presictive trajectory boxes separately. In order to cope with occlusion and error detection, track boxes are added and deleted according to the times of matching and losing.

II-G Loss Function

The loss function consists of two parts: detection loss and scene flow loss as shown in equation 8.

L=L(det)+Lsceneflow.{L}=L_{(det)}+L_{sceneflow}. (8)

In PV-RCNN[12], the detection loss can be split into the region proposal loss LrpnL_{rpn}, the keypoint segmentation loss LsegL_{seg} and the proposal refinement loss LrcnnL_{rcnn}. In our work, we fix the parameters of RPN. As a result, our detection loss is calculated as equation 9.

Ldet=Lseg+Lrcnn.{L_{det}}=L_{seg}+L_{rcnn}. (9)

For scene flow loss calculation, we calculate in-box mask Mi(inbox)M^{(\mathrm{inbox})}_{i} for each point according to the MOT ground truth label. Scene flow loss is calculated by root mean square error (RMSE) as equation 10. Only points belonging to objects are considered.

L(sceneflow )=i=1nMi(inbox)×(did^i)2N(inbox),L^{(\text{sceneflow })}=\sqrt{\frac{\sum_{i=1}^{n}M_{i}^{(\mathrm{inbox})}\times\left(d_{i}-\hat{d}_{i}\right)^{2}}{N_{(\mathrm{inbox})}}}, (10)

where did_{i} and d^i\hat{d}_{i} are scene flow estiamtion result and scene flow label for point i. N(inbox)N_{(\mathrm{inbox})} is the number of points inside object boxes.

III EXPERIMENTS RESULTS AND DISCUSSIONS

III-A Dataset and Evaluation Metirces

We train and evaluate our DetFlowTrack on KITTI MOT dataset[15]. KITTI MOT dataset consists of 21 training sequences and 29 testing sequences. Each sequence provides a sequence of point clouds collected by lidar and a sequence of images collected by color monocular cameras. We split the training sequences into two parts: Seq-0000 to Seq-0015 for training and Seq-0016 to Seq-0020 for validation.

To improve sample diversity and avoid overfitting, we intorduced a data augmentation approach for both object detection and scene flow estimation. First, we generate a object database with the KITTI Object dataset[15] by intercepting the labels and the in-box point cloud of all the groundtruth targets. Then during training, several groundtruth objects are randomly selected from the object database and are put on to the same random location of two adjacent frames’ scene. Using this approach, the number of objects and the diversity of the objects’ location is greatly increased. After that, global flip, rotation with random angle and scaling with random ratio. To increase the diversity of objects’ motion between frames, random traslation is applied to each objects independently.

We adopt HOTA[16] (Higher Order Tracking Accuracy) as the evaluation metric. The HOTA metric can split into DetA (Detection Accuracy Acore), AssA (Association Accuracy Score) and LocA (Localization Accuracy Score).

III-B Network Details

We fix the parameters of RPN in PV-RCNN backbone and train detection network and scene flow estimation network at the same time during training. We use an Adam optimizer with one cycle learning rate[17]. The learning rate changes a cosine function with 0.01 upper bound.

We train on an Intel 3.40GHz Gold 6128 CPU and 4 GeForce GTX 1080Ti GPU with batch size=4. Our DetFlowTrack is implemented on pytorch with CUDA 10.0.

III-C Abation Atudy

An ablation study is designed to verify the validity of the proposed data augmentation module and the detection results’ guidance in scene flow module. The results are show in table I. In the catagory of both car and pedestrian, the proposed data augmentation module and the detection results’ guidance in scene flow module are effective to improve the accuracy of MOT.

TABLE I: Ablation Study Results for Data Augmentation and Detection Results’ Guidance
Data Augmentation Detection Results’ Guidance HOTA\uparrow
Car Category
75.89%
\checkmark 77.43%
\checkmark \checkmark 78.80%
Pedestrian Category
42.77%
\checkmark 46.42%
\checkmark \checkmark 78.80%

III-D Quantitative Results

We tested the trained model on KITTI MOT test dataset and submitted the test results to KITTI website for accuracy evaluation. The proposed algorithm is named FlowDetTrack and the results are compared with the latest online MOT algorithm. Table II shows the accuracy evaluation results. The HOTA of the proposed method in this paper far exceeds that of the state-of-the-art methods with point cloud input.

TABLE II: Evaluation Results in Test Sets of KITTI MOT. The bold font highlight the best results.
Car Category
Method HOTA\uparrow DetA\uparrow AssA\uparrow LocA\uparrow
Complexer-YOLO[18] 49.12% 62.44% 39.34% 81.47%
AB3DMOT[2] 68.61% 71.06% 69.06% 86.80%
PointTrackNet[11] 57.20% 55.71% 59.15% 80.07%
FlowDetTrack(ours) 71.52% 72.87% 70.89% 87.79%
Pedestrian Category
Complexer-YOLO[18] 14.08% 24.91% 8.15 % 68.64%
AB3DMOT[2] 35.57% 32.99% 38.58% 71.26%
FlowDetTrack(ours) 39.64% 40.90% 38.72% 72.04%

The robust in the presence of extrem motion with rotation is test on hard data set. The hard data set was obtained by adding random translations in the range of 0-0.5 m and random rotations around the Z-axis in the range of 0-20 degrees to each target on the KITTI MOT validation set. The results of car category and pedestrian category are show in table III. When compared with the state-of-the-art 3D MOT method 3DMOT, the proposed method can still maintain high HOTA on the hard data sets.

TABLE III: Comparative Evaluation Results in Hard Datasets.
HOTA\uparrow DetA\uparrow AssA\uparrow LocA\uparrow
Car Category
AB3DMOT[2] 43.34% 58.07% 32.51% 88.08%
FlowDetTrack(ours) 64.64% 67.00% 62.54% 88.59%
Pedestrian Category
AB3DMOT[2] 15.14% 23.02% 10.08% 80.20%
FlowDetTrack(ours) 32.21% 46.82% 22.28% 83.88%

III-E Qualitative Evaluation

Refer to caption
Figure 4: The Visualization of MOT Results.
Refer to caption
Figure 5: The Visualization of Aggregation Weight Calculation. The subimage (1) is the result of weight calculation with distance in geometric space. The subimage (2) is the result of weight calculation with distance in both geometric space and detection result space.
Refer to caption
Figure 6: The Visualization of Scene Flow Label Calculation. The subimage (1) is the result of box-translation-based method. The subimage (2) is the result of box-transformation-based method.

Figure 4 is the visualization of the MOT results. Trajectory boxes are projected to the image. The box with the same tracking id is painted with the same color.

Figure 5 is the visualization result of the feature aggregation in the cost layer of scene flow module. When only the distance in geometric space is considered, those points that are close to the aggregation center but do not belong to the object will be given a large aggregation weight as indicated by the red circle in subimage (1). With the proposed detection-guidance feature aggregation methods, the aggregation weights of those background points will be small enough to be ignored.

Figure 6 shows the comparative visualization results of the scene flow calculation methods when the target has rotational motion. We translate the points in frame t1t-1 with the calculated scene flow ground truth label and the result is shown as green points. With box-translation-based method as in PointTrackNet[11], the green points as indicated by the red circle do not properly match the points in frame t. While, with the proposed box-transformation-based method, the green points can match the points in frame t nicely. This proves that the proposed box-transformation-based scene flow label calculation method can obtain a more accurate true value of scene flow in the case of target rotation.

IV CONCLUSIONS

We proposed a 3D MOT framework that optimizes the object detection and scene flow estimation simultaneously. A detection-guidance scene flow estimation is proposed to explore the role of object detection in promoting scene flow estimation. A box-transformation-based scene flow labels calculation method is proposed for more accurate scene flow labels in the case of targets’ rotation. The experimental results demonstrate that our network shows competitive results over the state-of-the-arts and robustness under extreme motion with rotation.

References

  • [1] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object tracking for autonomous driving,” arXiv preprint arXiv:2001.05673, 2020.
  • [2] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: A baseline and new evaluation metrics,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10 359–10 366.
  • [3] A. Shenoi, M. Patel, J. Gwak, P. Goebel, A. Sadeghian, H. Rezatofighi, R. Martín-Martín, and S. Savarese, “Jrmot: A real-time 3d multi-object tracker and a new large-scale dataset,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10 335–10 342.
  • [4] G. Zhai, X. Kong, J. Cui, Y. Liu, and Z. Yang, “Flowmot: 3d multi-object tracking by scene flow association,” arXiv preprint arXiv:2012.07541, 2020.
  • [5] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6498–6507.
  • [6] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766.
  • [7] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8934–8943.
  • [8] W. Wu, Z. Wang, Z. Li, W. Liu, and L. Fuxin, “Pointpwc-net: A coarse-to-fine network for supervised and self-supervised scene flow estimation on 3d point clouds,” arXiv preprint arXiv:1911.12408, 2019.
  • [9] X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 529–537.
  • [10] J. Zhang, S. Zhou, X. Chang, F. Wan, J. Wang, Y. Wu, and D. Huang, “Multiple object tracking by flowing and fusing,” arXiv preprint arXiv:2001.11180, 2020.
  • [11] S. Wang, Y. Sun, C. Liu, and M. Liu, “Pointtracknet: An end-to-end network for 3-d object detection and tracking from point clouds,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3206–3212, 2020.
  • [12] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 526–10 535.
  • [13] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9613–9622.
  • [14] H. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [15] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [16] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” International journal of computer vision, vol. 129, no. 2, pp. 548–578, 2021.
  • [17] L. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006.   International Society for Optics and Photonics, 2019, p. 1100612.
  • [18] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch, S. Milz, and H. Michael Gross, “Complexer-yolo: Real-time 3d object detection and tracking on semantic point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.