SODFormer: Streaming Object Detection with Transformer Using Events and Frames
Abstract
DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.
Index Terms:
Neuromorphic Vision, Event Cameras, Object Detection, Transformer, Multimodal Fusion.1 Introduction
Object detection [1, 2, 3], one of the most fundamental and challenging topics, supports a wide range of computer vision tasks, such as autonomous driving, intelligent surveillance, robot vision, etc. In fact, with conventional frame-based cameras, object detection performance [4, 5, 6] has suffered from a significant drop in some challenging conditions (e.g., high-speed motion blur, low-light, and overexposure). A key question still remains: How to utilize a novel sensing paradigm that makes up for the limitations of conventional cameras?
More recently, Dynamic and Active-Pixel Vision Sensor [7, 8] (i.e., DAVIS), namely a multimodal vision sensor, is designed in that spirit, combining a bio-inspired event camera and a conventional frame-based camera in the same pixel array. Its core is the novel event camera (i.e., DVS [9]), which works differently from frame-based cameras and reacts to light changes by triggering asynchronous events. Since the event camera offers high temporal resolution () and high dynamic range (HDR, up to 120 dB), it has brought a new perspective to address some limitations of conventional cameras in fast motion and challenging light scenarios. However, just as conventional frame-based cameras fail in extreme-light or high-speed motion blur scenarios, event cameras perform poorly in static or extremely slow-motion scenes. We refer to this phenomenon as unimodal degradation because it is mainly caused by the limitations of unimodal cameras (see Fig. 1). While event cameras are insensitive in static or extremely slow-motion scenes, frame-based cameras directly provide static fine textures (i.e., absolute brightness) conversely. Indeed, event cameras and frame-based cameras are complementary, which motivates the development of novel computer vision tasks (e.g., feature tracking [10], depth estimation [11], and video reconstruction [12]) by taking advantage of each end. Therefore, we aim at making complementary use of asynchronous events and frames to maximize object detection accuracy.
One problem is that most existing event-based object detectors [13, 14, 15, 16] run feed-forward models independently, leading to not utilizing rich temporal cues in continuous visual streams. When isolated event image [17] may have issues involving small objects, occlusion, and out of focus, it is natural for the human visual system [18, 19] to identify objects in the temporal dimension and then assign them together. In other words, the detection performance can be further improved via leveraging rich temporal cues from event streams and adjacent frames. Although the field of computer vision has witnessed significant achievements of CNNs and RNNs, they show poor ability in modeling long-term temporal dependencies. Consequently, the emerging Transformer is gaining more and more attention in object detection tasks [20, 21]. Thanks to the self-attention mechanism, Transformer [22, 23] is particularly suited to modeling long-term temporal dependencies for video sequence tasks [24, 25, 26]. But, there’s still no Transformer-based object detector to exploit temporal cues from events and frames. Meanwhile, the rare multimodal neuromorphic object detection datasets (e.g., PKU-DDD17-CAR [14] and DAD [16]) only provide isolated images and synchronized events, which means there is a lack of temporally long-term continuous and large-scale multimodal datasets including events, frames, and manual labels.
Another problem is that current fusion strategies (e.g., posting-processing [13, 14], concatenation [27] and averaging [16, 28]) for events and frames are inadequate in exploiting complementary information between two streams while only able to synchronously predict results with the frame rate, incurring the bottlenecks of performance improvements and inference frequency. Since asynchronous events are sparse points in the spatiotemporal domain with higher temporal resolution compared to structured frames, the existing frame-based multimodal detectors cannot directly deal with two heterogeneous visual streams. As a result, some fusion strategies attempt to first split the continuous event streams into discrete image-like representations with the same sampling frequency of frames and then integrate two synchronized streams via the post-processing or feature aggregation operation. In short, one drawback is that these fusion strategies can’t distinguish the degree of degradation of modalities and regions, therefore fail to eliminate the unimodal degradation thoroughly. The other drawback is that the inference frequency of these synchronized fusion strategies is limited by the sampling rate of conventional frames. Yet, there is no work to design an effective asynchronous multimodal fusion module for events and frames in the object detection task.
To address the aforementioned problems, this paper proposes a novel streaming object detector with Transformer, namely SODFormer, which continuously detects objects in an asynchronous way by integrating events and frames. Actually, this work aims not to optimize Transformer-based object detectors (e.g., DETR [20]) on each isolated image. In contrast, our goal is to overcome the following challenges: (i) Lack of dataset - How do we create a large-scale multimodal neuromorphic dataset for object detection? (ii) Temporal correlation - How do we design a Transformer-based object detector that leverages rich temporal cues? (iii) Asynchronous fusion - What is the unifying mechanism that takes advantages from two streams and achieves high inference frequency? To this end, we first build a large-scale multimodal object detection dataset (i.e., PKU-DAVIS-SOD), which provides manual bounding boxes at a frequency of 25 Hz for 3 classes, yielding more than 1080.1k labels. Then, we propose a spatiotemporal Transformer that aggregates spatial information from continuous streams, which outputs the detection results via an end-to-end sequence prediction problem. Finally, an asynchronous attention-based fusion module is designed to effectively integrate two sensing modalities while eliminating unimodal degradation and breaking the limitation of frame rate. The results show that our SODFormer outperforms the state-of-the-art methods and our two single-modality baselines by a large margin. We also verify the efficacy of our SODFormer in fast motion and low-light scenarios.
To sum up, the main contributions of this work are summarized as follows:
-
•
We propose a novel Transformer-based framework for streaming object detection (SODFormer), which first integrates events and frames via Transformer to continuously detect objects in an asynchronous manner.
-
•
We design an effective temporal Transformer module, which leverages rich temporal cues from two visual streams to improve the object detection performance.
-
•
We develop an asynchronous attention-based fusion module that takes complementary advantages from each end, eliminates unimodal degradation and overcomes the limited inference frequency from frame rate.
-
•
We establish a large-scale and temporally long-term multimodal neuromorphic dataset (i.e., PKU-DAVIS-SOD), which will open an opportunity for the research of this challenging problem.
The rest of the paper is organized as follows. Section 2 reviews prior works. We describe how to build a competitive dataset in Section 3. Section 4 introduces the camera working principle and formulates the novel problem. Section 5 presents our solution of streaming object detection with Transformer. Finally, Section 6 analyzes the performance of the proposed method, and some discussions are present in Section 7, while some conclusions are drawn in Section 8.
2 Related Work
This section first reviews neuromorphic object detection datasets (Section 2.1) and the corresponding object detectors (Section 2.2), followed by an overview of object detectors with Transformer (Section 2.3) and a survey of fusion approaches for events and frames (Section 2.4).
2.1 Neuromorphic Object Detection Datasets
The publicly available object detection datasets utilizing event cameras have a limited amount [29, 30, 31, 32]. Gen1 Detection dataset [33] and 1Mpx Detection dataset [34] provide large-scale annotations for object detection using event cameras. However, it may be difficult to obtain fine textures and achieve high-precision object detection by only using DVS events. Although event-based simulators (e.g., ESIM [35], V2E [36], and RS [37]) can directly convert video object datasets to the neuromorphic domain, those transformed strategies fail to reflect realistic high-speed motion or extreme light scenarios, which are exactly what event cameras are good at. Besides, PKU-DDD17-CAR dataset [14] and DAD dataset [16] provide 3155 and 6247 isolated hybrid sequences, respectively. Nevertheless, these two small-scale datasets are not continuous streams modeling temporal dependencies. Therefore, this work aims to establish a large-scale and temporally long-term multimodal object detection dataset involving challenging scenarios.
2.2 Neuromorphic Object Detection Approaches
The existing object detectors using event cameras can be roughly divided into two categories. The first category [38, 39, 40, 41, 42, 43, 44, 45, 46] is the single-modality that only processes asynchronous events. These methods usually first convert asynchronous events into 2D image-like representations [47], and then input them into frame-based detectors (e.g., YOLOs [48]). Although only utilizing dynamic events can achieve satisfactory performance in some specific scenes, it becomes clear that high-precision detection requires static fine textures.
The second category [49, 14, 13, 6, 15, 16, 27, 50, 51] refers to the multi-modality that combines multiple visual streams. For example, some works [49, 14, 13] first detect objects on each isolated frame or event temporal bin, and then merge the detection results of two modalities by post-processing (e.g., NMS [52]). Besides, a grafting algorithm [6] is proposed to integrate events and thermal images. Some attention-based aggregation operations [15, 27] are designed to fuse each isolated frame and event temporal bin from the PKU-DDD17-CAR [14] dataset, while others [16, 28] average the features from each modality. However, these joint frameworks have not explored rich temporal cues from continuous visual streams. Moreover, these post-processing or aggregation fusion operations are hard to capture global context across two sensing modalities, while averaging strategy cannot eliminate unimodal degradation thoroughly. Thus, we design a streaming object detector that aims at leveraging rich temporal cues and making fully complementary use of two visual streams.
2.3 Object Detection with Transformer
Transformer, an attention-based architecture, is first introduced by [22] for machine translation. Its core attention mechanism [53] scans through each element of a sequence and updates it by aggregating information from the whole sequence with different attention weights. The global computation makes Transformers perform better than RNNs on long sequences. Until now, Transformers have been migrated to computer vision and replaced RNNs in many problems. In video object detection, while RNNs [54, 55, 56] have achieved great success, they require meticulously hand-designed components (e.g., anchor generation). To simplify these pipelines, DETR [20], an end-to-end Transformer-based object detector, is proposed. It is worth mentioning that DETR [20] is the first Transformer-based object detector via a sequence predicting model. While DETR largely simplifies the classical CNN-based detection paradigm, it suffers from very slow convergence and relatively low performance at detecting small objects. More recently, a few approaches attempt to design optimized architectures to help detect small objects and speed up training. For instance, Deformable DETR [21] adopts the multi-scale deformable attention module that achieves better performance than DETR with 10 fewer training epochs. PnP-DETR [57] significantly reduces spatial redundancy and achieves more efficient computation via a two-step poll-and-pool sampling module. However, most works operate on each isolated image and do not exploit temporal cues from continuous visual streams. Inspired by the ability of Transformer to model long-term dependencies in video sequence tasks [24, 25, 26], we propose a temporal Transformer model to leverage rich temporal cues from continuous events and adjacent frames.
2.4 Multimodal Fusion for Events and Frames
Some computer vision tasks (e.g., video reconstruction [12, 58, 59, 60], object detection [14, 13], object tracking [61, 62], depth estimation [11], and SLAM [63, 64, 65]) have sought to integrate two complementary modalities of events and frames. For example, JDF [14] adopts the Dempster-Shafer theory to fuse events and frames for object detection. A recurrent asynchronous multimodal network [11] is introduced to fuse events and frames for depth estimation. In fact, most fusion strategies (e.g., post-processing, concatenation, and averaging) can’t distinguish the degradation degree of different modalities and regions, resulting in the failure to eliminate unimodal degradation completely (see Section 2.2). More importantly, most fusion strategies are synchronized with event and frame streams, which causes the joint output frequency to be limited by the sampling rate of conventional frames. Obviously, the limited inference frequency can’t meet the needs of fast object detection in real-time high-speed scenarios, so we seek to utilize the high temporal resolution advantage of event cameras and make use of events and frames. In RAMNet [11], RNN is utilized to introduce an asynchronous fusion method that allows decoding the task variable at any time. Nevertheless, to the best of our knowledge, there are no prior attention-based works involving asynchronous fusion strategy for events and frames in the object detection task. Thus, we design a novel asynchronous attention-based fusion module that eliminates unimodal degradation in two modalities with high inference frequency.
3 PKU-DAVIS-SOD Dataset
This section first presents the details of how to build our dataset (Section 3.1). Then, we give detailed statistics to better understand the newly built dataset (Section 3.2). Finally, we make a comparison of related datasets (Section 3.3).
3.1 Data Collection and Annotation
The goal of this dataset is to offer a dedicated platform for the training and evaluation of streaming object detection algorithms. Thus, a novel multimodal vision sensor (i.e., DAVIS346, resolution 346260) is used to record multiple hybrid sequences in a driving car (see Fig. 2). The event camera simultaneously outputs events with high temporal resolution and conventional frames with 25 FPS. Mostly, we fix the DAVIS346 in a driving car (see Fig. 2(2(a))), and record the sequences while the car is driving on city roads. Nevertheless, it is difficult to capture high-speed motion blur owing to the relative speed between vehicles on city roads. For the convenience of acquiring high-speed objects, we additionally provide some sequences in which the camera is set at the flank of the road. The raw recordings consider velocity distribution, light condition, category diversity, object scale, etc. To provide manual bounding boxes in challenging scenarios (e.g., high-speed motion blur and low-light), grayscale images are reconstructed from asynchronous events using E2VID [66] at 25 FPS when RGB frames are of low quality. When the objects are not visible in all three modalities (i.e., RGB frames, event images, and reconstructed frames), they can still be labeled because our PKU-DAVIS-SOD dataset provides continuous visual streams, and we can obtain information about the unclear objects from nearby images. Furthermore, some unclear objects in a single image may be visible in a piece of video. After the temporal calibration, we first select three common and important object classes (i.e., car, pedestrian, and two-wheeler) in our daily life. Then, all bounding boxes are annotated from RGB frames or synchronized reconstructed images by a well-trained professional team.
Name | Normal | Motion blur | Low-light | |||
---|---|---|---|---|---|---|
frames | boxes | frames | boxes | frames | boxes | |
Training | 115104 | 528944 | 31587 | 87056 | 22826 | 55343 |
Validation | 31786 | 155778 | 9763 | 24828 | 8892 | 14143 |
Test | 35459 | 150492 | 11859 | 44757 | 8729 | 18807 |
Total | 182349 | 835214 | 53209 | 156641 | 40447 | 88293 |
3.2 Dataset Statistics
PKU-DAVIS-SOD dataset provides 220 hybrid driving sequences and labels at a frequency of 25 Hz. As a result, this dataset contains 276k timestamps (i.e., labeled frames) and 1080.1k bounding boxes in total. Afterward, we split them into 671.3k for training, 194.7k for validation, and 214.1k for testing. Table I shows the number of bounding boxes in each set. Besides, we further analyze the attributes of the newly built dataset from the four following perspectives.
Category Diversity. Fig. 3(3(a)) displays the number distributions of three types of labeled objects (i.e., car, pedestrian, and two-wheeler) in our PKU-DAVIS-SOD dataset. We can find that the numbers of cars, pedestrians, and two-wheelers in each set (i.e., training, validation, and testing) are (580340, 162894, 193118), (34744, 5800, 5680), and (67599, 28400, 19144), respectively. For intuition, the ratios of the above numbers are approximately 3.5 : 1 : 1.2, 6 : 1 : 1, and 3.5 : 1.5 : 1, respectively.
Object Scale. Fig. 3(3(b)) shows the proportions or object scales in our PKU-DAVIS-SOD dataset. The object height is used to reflect its scale since the height and the scale are closely related in driving scenes [67]. We broadly classify all objects into three scales (i.e., large, medium, and small). The medium scale varies from 20 pixels to 80 pixels, the small refers to the object less than 20 pixels in height, and the large scale means the object height is greater than 80 pixels.
Velocity Distribution. As illustrated in Fig. 3(3(c)), we divide the motion speed level into normal-speed and high-speed. Event camera, offering high temporal resolution, can capture high-speed moving objects, such as a rushing car. On the contrary, RGB frames may result in motion blur due to the limited frame rate. In the process of dividing the dataset, we use video as the basic unit. If a sequence contains lots of motion blur scenarios, it is classified as high-speed, otherwise, it would be classified as normal-speed instead. Fig. 3(3(c)) shows that the proportion of high-speed scenarios is 13%, and the remaining involving normal-speed objects is 87%.
Light Intensity. Since the raw visual streams are recorded from daytime to night, it covers the diversity in illumination variance. The sequences belonging to the low-light scenario are generally collected in low-light conditions (night, rainy days, tunnels, etc.) and their RGB frames consist mainly of low-light scenes. Thus, we can easily judge the light condition by viewing RGB frames with our eyes. Fig. 3(3(d)) illustrates that the proportions of normal light and low-light are 92% and 8%, respectively.
For better visualization, we show representative samples (see Fig. 4) including RGB frames, event images, and DVS reconstructed images, which cover the category diversity, object scale, velocity, and light change.
Dataset | Year | Venue | Resolution | Modality | Classes | Boxes | Label | Frequency | High-speed | Low-light |
PKU-DDD17-CAR [14] | 2019 | ICME | 346260 | Events, Frames | 1 | 3155 | Manual | 1 Hz | ✗ | ✓ |
TUM-Pedestrian [13] | 2019 | ICRA | 240180 | Events, Frames | 1 | 9203 | Manual | 1 Hz | ✗ | ✗ |
Pedestrian Detection [41] | 2019 | FNR | 240180 | Events | 1 | 28109 | Manual | 1 Hz | ✗ | ✗ |
Gen1 Detection [33] | 2020 | arXiv | 304240 | Events | 2 | 255k | Pseudo | 1, 4 Hz | ✗ | ✓ |
1 Mpx Detection [34] | 2020 | NIPS | 1280720 | Events | 3 | 25M | Pseudo | 60 Hz | ✗ | ✓ |
DAD [16] | 2021 | ICIP | 346260 | Events, Frames | 1 | 6427 | Manual | 1 Hz | ✗ | ✓ |
PKU-DAVIS-SOD | 2022 | Ours | 346260 | Events, Frames | 3 | 1080.1k | Manual | 25 Hz | ✓ | ✓ |
3.3 Comparison with Other Datasets
To clarify the advancements of the newly built dataset, we compare it with some related object detection datasets using event cameras in Table II. Apparently, our PKU-DAVIS-SOD dataset is the first large-scale and open-source multimodal neuromorphic object detection dataset 111https://www.pkuml.org/research/pku-davis-sod-dataset.html. In contrast, two publicly large-scale datasets (i.e., Gen1 Detection [33] and 1 Mpx Detection [34]) only provide temporally long-term event streams, thus it is difficult for them to serve high-precision object detection, especially in static or extremely slow motion scenarios. What’s more, TUM-Pedestrian dataset [13] and Pedestrian Detection dataset [41] provide event streams for pedestrian detection, but they have not yet been publicly available. DAD dataset [16], followed by PKU-DDD17-CAR dataset [14], offers isolated frames and event temporal bins rather than temporally continuous visual streams.
All in all, such a novel bio-inspired multimodal camera and professional design with high labor intensity enable our PKU-DAVIS-SOD to be a competitive dataset with multiple characteristics: (i) High temporal sampling resolution with 12 Meps from event streams; (ii) High dynamic range property with 120 dB from event streams, (iii) Two temporally long-term visual streams with labels at 25 Hz, (iv) Real-world scenarios with abundant diversities in object category, object scale, moving velocity, and light change.
4 Preliminary and Problem Definition
This section first presents the details of DAVIS camera working principle (Section 4.1). Then, we formulate the definition of streaming object detection (Section 4.2).
4.1 DAVIS Sampling Principle
DAVIS is a multimodal vision sensor that simultaneously outputs two complementary modalities of asynchronous events and frames under the same pixel array. Due to the complementarity of events and frames, the output of the DAVIS camera contains more information, making the DAVIS camera naturally surpass unimodal sensors like conventional RGB cameras and event-based cameras. In particular, the core event camera [68], namely DVS, has independent pixels that react to changes in light intensity with a stream of events. More specifically, an event is a four-attribute tuple using addressing event representation (AER) [69], triggered for the pixel at the timestamp when the log-intensity changes over the pre-defined threshold . This process can be depicted as:
(1) |
where is the temporal sampling interval at a pixel, the polarity refers to whether the brightness is decreasing or increasing.
Intuitively, asynchronous events appear as sparse and discrete points [70] in the spatiotemporal domain, which can be described as follows:
(2) |
where is the event number during the spatiotemporal window. refers to the Dirac delta function, with and .
4.2 Streaming Object Detection
Let and = are two complementary modalities of asynchronous events and frame sequence from a DAVIS camera. In general, to make the asynchronous events compatible with deep learning methods, a continuous event stream needs to be divided into event temporal bins . For the current timestamp , the spatiotemporal location information of objects (i.e., bounding boxes) can be calculated using adjacent frames and multiple temporal bins , it can be formulated by:
(3) |
where is a list of bounding boxes in the timestamp . More specifically, are the width and the height of the bounding box, are the corresponding upper-left coordinates, and are the predicted class and the confidence score of the bounding box, respectively. The function is the proposed streaming object detector, which can leverage rich temporal cues from adjacent frames and temporal bins. The parameters and determine the length of mining temporal information and affect the fusion strategy between two heterogeneous visual streams.
Given the ground-truth , we aim at making the output of the optimized detector to fit as much as possible, and it can formulate the following minimization problem as:
(4) |
where is the designed loss function of the proposed streaming object detector.
Note that, this novel streaming object detector using events and frames works differently from the feed-forward object detection framework (e.g., YOLOs [48] and DETR [20]). It has two unique properties, which exactly account for the nomenclature of ”streaming”: (i) Objects can be accurately detected via leveraging rich temporal cues; (ii) Event streams offering high temporal resolution can meet the need of detecting objects at any time among continuous visual steams and overcome the limited inference frequency from conventional frames. Consequently, following similar works using point clouds [71, 72] and video streams [73], we call our proposed detector as streaming object detector.
5 Methodology
In this section, we first give an overview of our approach (Section 5.1). Besides, we briefly revisit the event representation (Section 5.2) and the spatial Transformer (Section 5.3). Then, we present the details of how to exploit temporal cues from continuous visual streams (Section 5.4). Finally, we design an asynchronous attention-based fusion strategy for events and frames (Section 5.5).
5.1 Network Overview
This work aims at designing a novel streaming object detector with Transformer, termed SODFormer, which continuously detects objects in an asynchronous way by integrating events and frames. As illustrated in Fig. 5, our framework consists of four modules: event representation, spatial Transformer, temporal Transformer, and asynchronous attention-based fusion module. More precisely, to make asynchronous events compatible with deep learning methods, the continuous event stream is first divided into event temporal bins , and each bin can be converted into a 2D image-like representation (i.e., event embeddings [47]). Then, the spatial Transformer adopts the main structures of the Deformable DETR [21] to extract feature representations from each sensing modality, which involves a feature extraction backbone (e.g., ResNet50 [74]) and the spatial Transformer encoder (STE). Besides, the proposed temporal Transformer contains the temporal Deformable Transformer encoder (TDTE) and temporal Deformable Transformer decoder (TDTD). The TDTE assigns the outputs of STE in the temporal dimension, which can improve the accuracy of object detection by leveraging rich temporal information. Meanwhile, the proposed asynchronous attention-based fusion module exploits the attention mechanism to generate a fused feature, which can eliminate unimodal degradation in two modalities with a high inference frequency. Finally, The TDTD integrates object queries and the fused feature to predict the bounding boxes .
5.2 Event Representation
When using deep learning methods to process asynchronous events, spatiotemporal point sets are usually converted into successive measurements. In general, the kernel function [47] is usually used to map the event temporal bin into an event embedding , which should ideally exploit the spatiotemporal information from asynchronous events. As a result, we can formulate this mapping as follows:
(5) |
where the designed kernel function can be deep neural networks or handcrafted operations.
In this work, we attempt to encode the event temporal bin into three existing event representations (i.e., event images [17], voxel grids [75] and sigmoid representation [39]) owning to an accuracy-speed trade-off. Actually, any event representation can be an alternative because our SODFormer provides a generic interface accepting various input types of the event-based object detector.
5.3 Revisiting Spatial Transformer
In DETR [20], the multi-head self-attention (MSA) combines multiple attention heads in parallel to increase the feature diversity. To be concrete, the feature maps are first extracted via CNN-based backbone. Then, and are derived from these feature maps as query and key-value in attention mechanism with both sizes of . Let refer to a query element with the feature map , and indexes a key-value element with the feature map . The MSA operation integrates the outputs of attention heads as:
(6) |
where indexes the single attention head, and are learnable weights. is the attention weight, in which and are also learnable weights, is a scaling factor.
To achieve fast convergence and high computation efficiency, Deformable DETR [21] designs a deformable multi-head self-attention (DMSA) operation to attend the local sampling points instead of all pixels in feature map . Specifically, DMSA first determines the corresponding locations of each query in other feature maps, which we refer to as reference points, and then adds the learnable sampling offsets to reference points to obtain the locations of sampling points. It can be described as:
(7) | ||||
where is a 2D reference point, denotes the sampling offset relative to , and is the learnable attention weight of the point in self-attention head.
In this study, we adopt the encoder of the Deformable DETR as our spatial Transformer encoder (STE) to extract features from each visual stream. Meanwhile, we further extend the spatial DMSA strategy to the temporal domain (Section 5.4), which leverages rich temporal cues from continuous event stream and adjacent frames.
5.4 Temporal Transformer
The temporal Transformer aggregates multiple spatial feature maps from the spatial Transformer and generates the predicted bounding boxes. It includes two main modules: temporal deformable Transformer encoder (TDTE) (Section 5.4.1) and temporal deformable Transformer decoder (TDTD) (Section 5.4.2).
5.4.1 Temporal Deformable Transformer Encoder
Our TDTE aims at aggregating multiple spatial feature maps from the spatial Transformer encoder (STE) using the temporal deformable multi-head self-attention (TDMSA) operation and then generating a spatiotemporal representation via the designed multiple stacked blocks.
The core idea of the proposed TDMSA operation is that the temporal deformable attention only attends to the local sampling points in the spatial domain and aggregates all sampling points in the temporal domain. Obviously, TDMSA directly extends from single feature map (i.e., DMSA [21]) to multiple feature maps = in the temporal domain (see Fig. 6). Meanwhile, TDMSA exists a total attention heads for each temporal deformable attention operation, and it can be expressed as:
(8) | ||||
where is a 2D reference point in feature map . and are the learnable attention weight and the sampling offset of the sampling point from the feature map and the attention head.
TDTE includes multiple consecutive stacked blocks (see Fig. 5). Each block performs MSA, TDMSA, and feed-forward network (FFN) operations. Besides, each operation is followed by the dropout [76] and the layer normalization [77]. Specifically, TDTE first takes as both the query and the key-value via MSA in the timestamp . Then, TDMSA integrates the current feature map and adjacent feature maps to leverage rich temporal cues. Finally, FFN is used to output a spatiotemporal representation, shown as or in Fig. 5. Thus, our TDTE can be formulated as:
(9) |
where and are the outputs of MSA and TDMSA operations in our TDTE, respectively. Note that, though frame flow and event flow are processed the same way in this step, there are two TDTEs that do not share weights processing frame flow and event flow separately.
5.4.2 Temporal Deformable Transformer Decoder
The goal of our TDTD is to output the final result via a sequence prediction problem. The inputs of TDTD contain both fused feature maps from our fusion module (Section 5.5) and object queries. The object queries are a small fixed number (denoted as ) of learnable positional embeddings. They serve as the ”query” part of the input to the TDTD. Each stacked block in the decoder consists of three operations (i.e., MSA, DMSA, and FFN). In the MSA, object queries interact with each other in parallel via multi-head self-attention at each decoder layer, where the query elements and key-value elements are both the object queries. As illustrated in Fig. 5, the DMSA [21] transforms the query flows and the fused features from two streams into an output embedding. Ideally, each object query learns to focus on a specific region in images and extracts region features for final detection in this process, similar to the anchors in Faster R-CNN [56] but is simpler and more flexible. Finally, the FFN independently processes each output embedding to predict bounding boxes =. Notably, is typically greater than the actual number of objects in the current scene, so the predicted bounding boxes can either be a detection or a ”no object” (labeled as ) as DETR [20] does. In particular, our TDTD achieves more efficient and faster converging by replacing MSA [20] with DMSA.
5.5 Asynchronous Attention-Based Fusion Module
To integrate two heterogeneous visual streams, the most existing fusion strategies [49, 13, 14, 15, 16] usually first split asynchronous events into discrete temporal bins synchronized with frames, and then fuse two streams via the post-processing or feature aggregation (e.g., averaging and concatenation). However, these synchronized fusion strategies may have two limitations: (i) The post-processing or feature aggregation operations cannot distinguish the degradation degree of different modalities and regions, which means they are difficult to eliminate the unimodal degradation thoroughly and thus incur the bottleneck of performance improvement; (ii) The joint output frequency may be limited by the sampling rate of conventional frames (e.g., 25 Hz), which fails to meet the needs of fast object detection in real-time high-speed scenarios. Therefore, we design an asynchronous attention-based fusion module to overcome the two limitations mentioned above.
One key question of two heterogeneous visual streams is how to achieve the high joint output frequency on demand for a given task. In general, the high temporal resolution event stream can be split into discrete event temporal bins, the frequency of which can be flexibly adjusted to be much larger than the conventional frame rate in real-time high-speed scenarios. As shown in Fig. 7, the event flow has more temporal sampling timestamps than the frame flow in an asynchronous manner. For the timestamp , is the output of the FFN in the TDTE. The proposed fusion strategy first searches the most recent sampling timestamp in the frame flow, which can be described as:
(10) |
where is the temporal difference between the event sampling timestamp and the frame sampling timestamp.
Another key question of two complementary data is how to tackle the unimodal degradation that the feature aggregation operations (e.g., averaging) fail to overcome and implement pixel-wise adaptive fusion for better performance. Unlike the previous weight mask by a multi-layer perceptron (MLP) [78], we compute the pixel-wise weight mask using the attention mechanism, which simplifies our model and endows it with stronger interpretability. Our fusion strategy starts with two intuitive assumptions: (i) the stronger the association between two pixels is, the greater the attention weight between them will be, and (ii) the pixels in degraded regions are supposed to have a weaker association with other pixels. Combining the two points above, we conclude that the sum of attention weights between a pixel and all other pixels represents the degradation degree of that pixel to some extent, and thus can serve as the weight for that pixel during fusion. Next, we take event flow as an example to illustrate how our attention weight mask module (see Fig. 5) obtains fusion weights. To implement this, we first remove the softmax operation in the standard self-attention to calculate the attention weights between each pixel, and accumulate attention weights for each pixel as , which can be summarized as follows:
(11) |
where is the weight mask with the width and the height for event flow. and are learnable matrix parameters. is a scaling factor in the self-attention [22]. As shown in Fig. 7, the aforementioned steps are exactly the same for the process of frame flow.
Then, we concatenate the weight masks along the pixel dimension and adopt the softmax to that dimension (i.e., softmax of [, ], ) to normalize the pixel-wise attention weights at each pixel. Meanwhile, we further split the weight masks along the pixel dimension to obtain the frame weight masks and the event weight masks as follows:
(12) |
Finally, we can obtain the fused feature map by summing up the unimodal feature maps with the corresponding pixel-wise weight masks by:
(13) |
where denotes the sum operation, and refers to the element-wise multiply operation.
6 Experiments
Scenario | Modality | AP50 | mAP | mAP50 | mAP75 | mAP | mAP | mAP | ||
---|---|---|---|---|---|---|---|---|---|---|
Car | Pedestrian | Two-wheeler | ||||||||
Normal | Events | 0.440 | 0.220 | 0.451 | 0.147 | 0.371 | 0.090 | 0.072 | 0.268 | 0.526 |
Frames | 0.747 | 0.365 | 0.558 | 0.228 | 0.557 | 0.138 | 0.166 | 0.336 | 0.539 | |
Frames + Events | 0.752 | 0.359 | 0.596 | 0.241 | 0.569 | 0.163 | 0.166 | 0.363 | 0.609 | |
Motion blur | Events | 0.327 | 0.159 | 0.380 | 0.113 | 0.289 | 0.064 | 0.051 | 0.181 | 0.255 |
Frames | 0.561 | 0.303 | 0.394 | 0.163 | 0.419 | 0.096 | 0.100 | 0.201 | 0.365 | |
Frames + Events | 0.570 | 0.285 | 0.441 | 0.183 | 0.432 | 0.125 | 0.109 | 0.230 | 0.387 | |
Low-light | Events | 0.524 | 0.0002 | 0.294 | 0.093 | 0.273 | 0.039 | 0.075 | 0.183 | 0.286 |
Frames | 0.570 | 0.128 | 0.357 | 0.114 | 0.351 | 0.048 | 0.082 | 0.198 | 0.344 | |
Frames + Events | 0.595 | 0.130 | 0.399 | 0.122 | 0.374 | 0.048 | 0.091 | 0.206 | 0.376 | |
All | Events | 0.424 | 0.188 | 0.390 | 0.128 | 0.334 | 0.071 | 0.065 | 0.210 | 0.348 |
Frames | 0.700 | 0.316 | 0.452 | 0.195 | 0.489 | 0.116 | 0.142 | 0.264 | 0.417 | |
Frames + Events | 0.705 | 0.313 | 0.493 | 0.207 | 0.504 | 0.133 | 0.144 | 0.285 | 0.454 |
This section first describes the experimental setting and implementation details (Section 6.1). Then, the effective test (Section 6.2) is conducted to verify the validity of our SODFormer, which contains a performance evaluation in various scenarios and a comparison to related detectors. Meanwhile, we further implement the ablation study (Section 6.3) to see why and how our approach works, where we investigate the influences of each designed module and parameter settings. Finally, the scalability test (Section 6.4) provides interpretable explorations of our SODFormer.
6.1 Experimental Settings
In this part, we will present the dataset, implementation details, and evaluation metrics as follows.
Dataset and Setup. Our PKU-DAVIS-SOD dataset is designed for streaming object detection in driving scenes, which provides two heterogeneous and temporally long-term visual streams with manual labels at 25 Hz (see Section 3). This newly built dataset contains three subsets including 671.3k labels for training, 194.7k for validation, and 214.1k for testing. Each subset is further split into three typical scenarios (i.e., normal, low-light, and motion blur).
Implementation Details. We select event images [17] as the event representation and the ResNet50 [74] as the backbone to achieve a trade-off between accuracy and speed. All parameters of the backbone and the spatial Transformer encoder (STE) are shared among different temporal bins of the same modality. We set the attention head to 8 and the sampling point to 4 for the deformable multi-head self-attention (DMSA) in Eq.(7) and the temporal deformable multi-head self-attention (TDMSA) in Eq.(8). The temporal aggregation size in TDMSA is set to 9 owning to the balance of the accuracy and the speed. During training, the height of an image is randomly resized to a number from [256: 576: 32], and the width is obtained accordingly while maintaining the aspect ratio. Similarly, for evaluation, all images are resized to 352 in height and 468 in width to maintain the aspect ratio. Other data augmentation methods such as crop and horizontal flip are not utilized because our temporal Transformer requires accurate reference points. Following the Deformable DETR [21], we adopt the matching cost and the Hungarian loss for training with loss weights of 2, 5, 2 for classification, and GIoU, respectively. All networks are trained for 25 epochs using the Adam optimizer [79], which is set with the initial learning rate of , the decayed factor of 0.1 after the 20th epoch, and the weight decay of . All experiments are conducted on NVIDIA Tesla V100 GPUs.
Evaluation Metrics. To compare different approaches, the mean average precision (e.g., COCO mAP [80]) and running time () are selected as two evaluation metrics, which are the most broadly utilized in the object detection task. In the effective test, we give a comprehensive evaluation using average precision with various IoUs (e.g., AP, AP0.5, and AP0.75) and AP across different scales (i.e., AP, AP, and AP). In other tests, we report the detection performance using the AP0.5. Following the video object detection, we compute the AP for each class and the mAP for three classes in all labeled timestamps. In other words, we evaluate the object detection performance with the output frequency of 25 Hz in every labeled timestamp.
Modality | Method | Input representation | Backbone | Temporal | mAP50 | Runtime () |
---|---|---|---|---|---|---|
Events | SSD-events [38] | Event image | SSD | No | 0.221 | 7.2 |
NGA-events [6] | Voxel grid | YOLOv3 | No | 0.232 | 8.0 | |
YOLOv3-RGB [48] | Reconstructed image | YOLOv3 | No | 0.244 | 178.51 | |
Faster R-CNN [56] | Event image | R-CNN | No | 0.251 | 74.5 | |
Deformable DETR [21] | Event image | DETR | No | 0.307 | 21.6 | |
LSTM-SSD* [81] | Event image | SSD | Yes | 0.273 | 22.7 | |
ASTMNet* [43] | Event embedding | Rec-Conv-SSD | Yes | 0.291 | 21.3 | |
Our baseline* | Event image | Our spatiotemporal Transformer | Yes | 0.334 | 25.0 | |
Frames | Faster R-CNN [56] | RGB frame | R-CNN | No | 0.443 | 75.2 |
YOLOv3-RGB [48] | RGB frame | YOLOv3 | No | 0.426 | 7.9 | |
Deformable DETR [21] | RGB frame | DETR | No | 0.461 | 21.5 | |
LSTM-SSD* [81] | RGB frame | SSD | Yes | 0.456 | 22.4 | |
Our baseline* | RGB frame | Our spatiotemporal Transformer | Yes | 0.489 | 24.9 | |
Events + Frames | MFEPD [13] | Event image + RGB frame | YOLOv3 | No | 0.438 | 8.2 |
JDF [14] | Channel image + RGB frame | YOLOv3 | No | 0.442 | 8.3 | |
Our SODFormer | Event image + RGB frame | Deformable DETR | Yes | 0.504 | 39.7 |
6.2 Effective Test
The objective of this part is to assess the validity of our SODFormer, so we implement two main experiments including performance evaluation in various scenarios (Section 6.2.1) and comparison with state-of-the-art methods (Section 6.2.2) as follows.
6.2.1 Performance Evaluation in Various Scenarios
To give a comprehensive evaluation of our PKU-DAVIS-SOD dataset, we report the quantization results (see Table III) and the representative visualization results (see Fig. 8) from three following perspectives.
Normal Scenarios. Performance evaluations of all modalities (i.e., frames, events, and two visual streams) in normal scenarios can be found in Table III. Our single-modality baseline only processes frames or events without using the asynchronous attention-based fusion module. We can see that our baseline using RGB frames achieves better performance than using events in normal scenarios. This is because the event camera has a flaw with weak texture in the spatial domain, thus only dynamic events (i.e., brightness changes) are hard to achieve high-precision recognition, especially in static or extremely slow motion scenes (see Fig. 8(8(a))). On the contrary, RGB frames can provide static fine textures (i.e., absolute brightness) under ideal conditions. In particular, our SODFormer obtains the performance enhancement from the single stream to two visual streams via introducing the asynchronous attention-based fusion module.
Motion Blur Scenarios. By comparing normal scenarios and motion blur scenarios in Table III, we can find that the detection performance using events degrades less than that utilizing frames. This can be explained by that RGB frames appear as motion blur in high-speed motion scenes (see Fig. 8(8(b))), resulting in a remarkable decrease in object detection performance. Note that, by introducing the event stream, the performance of our SODFormer can appreciably enhance over our baseline using RGB frames.
Low-Light Scenarios. As illustrated in Table III, the performance of our baseline using RGB frames drops sharply by 0.206 in low-light scenarios. Meanwhile, although the mAP of DVS events also drops by 0.098 in absolute term, it is significantly less compared to that of RGB frames (i.e., 0.098 vs. 0.206). On a relative basis, there is also an evident difference as the mAP of RGB frames and DVS events drop by 0.370 and 0.264, respectively. This may be caused by the fact that the event camera has the advantage of HDR to sense moving objects in extreme light conditions (see Fig. 8(8(c))). Notably, only the AP50 of ”Car” objects increases while the others decrease. We attribute this to the different image quality of different classes of objects in the event modality. Intuitively, ”Car” objects tend to have more distinct outlines and higher image quality. As a result, ”Car” objects are less distributed by the increase of noise in low-light scenes [82] and thus achieve comparable performance to that in normal scenes. More specifically, our SODFormer outperforms the baseline using RGB frames by a large margin (0.374 versus 0.351) in low-light scenarios.
In summary, our SODFormer consistently obtains better performance than the single-modality methods in three scenarios meanwhile keeping relatively comparable computational speed. By introducing DVS events, the mAP on our PKU-DAVIS-SOD dataset improves an average of 1.5% over using RGB frames. What’s more, representative visualization results in Fig. 8 show that RGB frames fail to detect objects in high-speed motion blur and low-light scenarios. However, the event camera, offering high temporal resolution and HDR, provides new insight into addressing the shortages of conventional cameras.
6.2.2 Comparison with State-of-the-Art Methods
We will instigate why and how our SODFormer works from the following three perspectives.
Evaluation on DVS Modality. To evaluate our temporal Transformer for DVS events, we compare our baseline* (i.e., our SODFormer without the asynchronous attention-based fusion module) with five feed-forward event-based object detectors (i.e., event image for SSD [38], voxel grid for YOLOv3 [6], reconstructed image for YOLOv3 [48], event image for Faster R-CNN [56], and event image for Deformable DETR [21]) and two recurrent object detectors (event image for LSTM-SSD [81] and event embedding for ASTMNet [43]) that leverages temporal cues. As shown in Table IV, our baseline*, utilizing event images for our temporal Transformer, achieves significant improvement over all these object detectors. Besides, our other baseline, using E2VID [66] to reconstruct frames, performs better than other input representations (e.g., event image and voxel grid), but a long time is spent on the first stage of image reconstruction and then on the second stage of object detection.
Evaluation on RGB Modality. We present four state-of-the-art object detectors for RGB frames including (i) Single RGB frame for Faster R-CNN, (ii) Single RGB frame for YOLOv3, (iii) Single RGB frame for Deformable DETR, (iv) Sequential RGB frames for LSTM-SSD to compare them with our baseline* that utilizes sequential RGB frames for our spatiotemporal Transformer. Compared to three object detectors utilizing single RGB frames, our baseline* obtains the best performance via introducing the temporal Transformer for RGB frames (see Table IV). Furthermore, it obtains better performance than LSTM-SSD while maintaining comparable computational speed. This is because the designed spatiotemporal Transformer is more effective than the standard convolutional-recurrent operation. In other words, the results agree with our motivation that leveraging rich temporal cues for streaming object detection is beneficial.
Benefit From Multimodal Fusion. From Table IV, we make a comparison of our SODFormer between two existing joint object detectors (e.g., MFEPD [13] and JDF [14]). Apparently, our SODFormer outperforms two competitors by a large margin. We can find that our SODFormer, incorporating the temporal Transformer and the asynchronous attention-based fusion module, outperforms four state-of-the-art detectors and our eight baselines. More precisely, by introducing DVS events, our SODFormer has a 1.5% improvement over the best baseline using sequential RGB frames. Compared with the best competitor using DVS events, our SODFormer truly shines (0.504 versus 0.334), which indicates that RGB frames with fine textures may be very significant for high-precision object detection.
We further present some visualization results on our PKU-DAVIS-SOD dataset in Fig. 9. Note that, our SODFormer outperforms the single-modality methods and the best multi-modality competitor (i.e., JDF [14]). Unfortunately, RGB frames fail to detect objects in high-speed or low-light scenes. Much to our surprise, our SODFormer can perform robust detect objects in challenging situations, due to its high temporal resolution and HDR properties that are inherited from DVS events, as well as its fine texture that comes from RGB frames.
6.3 Ablation Test
In this section, we implement an ablation study to investigate how parameter setting and each design choice influence our SODFormer.
6.3.1 Contribution of Each Component
Method | Baseline | (a) | (b) | Ours |
---|---|---|---|---|
Events | ✓ | ✓ | ||
Temporal transformer | ✓ | ✓ | ✓ | |
Fusion Module | ✓ | |||
mAP50 | 0.461 | 0.489 | 0.494 | 0.504 |
Runtime () | 21.5 | 24.9 | 39.4 | 39.7 |
To explore the impact of each component on the final performance, we choose the feed-forward detector (i.e., using RGB frame for Deformable DETR [21]) as the baseline. As illustrated in Table V, three methods, namely (a), (b), and our SODFormer, utilize temporal Transformer, averaging fusion module and asynchronous attention-based fusion module respectively, consistently achieve higher performance on our PKU-DAVIS-SOD dataset than the baseline. More specifically, method (a), adopting the temporal Transformer to leverage rich temporal cues, obtains a 2.8% mAP improvement over the baseline. Comparing method (b) with (a), the absolute promotion is merely 0.5%, which indicates that it is insufficient to combine RGB frames and DVS events using the averaging operation. Moreover, our SODFormer, adopting the asynchronous attention-based fusion strategy to replace the averaging fusion, achieves a 1.0% mAP improvement over method (b) while maintaining comparable computational complexity. Intuitively, our SODFormer employs these effective components to process two heterogeneous visual streams and achieves robust object detection on our PKU-DAVIS-SOD dataset.
6.3.2 Influence of Temporal Aggregation Size
To analyze the temporal aggregation strategy in our SODFormer, we set the temporal Transformer with different sizes (e.g., 1, 3, 5, 9, and 13) of temporal aggregation. We refer to this as temporal aggregation size, which represents the number of event temporal bins and adjacent frames to process at a given timestamp . It fits in Eq. 8 as . As depicted in Table VI, we find that our strategy improves mAPs by 1.2%, 1.9%, 2.8%, and 2.7% when compared to a single temporal aggregation size. Note that, by aggregating visual streams in a larger temporal size, richer temporal cues can be utilized in the temporal Transformer, resulting in better object detection performance. Nevertheless, as the temporal aggregation size becomes larger, the computational time also gradually increases. Additionally, we find that the model’s capacity for modeling temporally long-term dependency is finite with the increase in temporal aggregation size. In this study, our temporal aggregation size is set to 9 for a trade-off between accuracy and speed.
Temporal aggregation size | 1 | 3 | 5 | 9 | 13 |
---|---|---|---|---|---|
mAP50 | 0.461 | 0.473 | 0.480 | 0.489 | 0.488 |
Runtime () | 21.5 | 24.1 | 24.2 | 24.9 | 25.5 |
To improve the interpretability of streaming object detection, we further present some comparative visualization results about whether they utilize rich temporal cues or not (see Fig. 10). The feed-forward baseline using RGB frames suffers from some failure cases involving small objects and occluded objects, as shown in the first and third rows in Fig. 10. Fortunately, our SODFormer overcomes these limitations via leveraging rich temporal cues from adjacent frames or event streams. Therefore, a trade-off between accuracy and speed may be beneficial and necessary for certain scenarios requiring a higher level of accuracy in object detection.
Method | mAP50 | Runtime () |
---|---|---|
NMS [52] | 0.438 | 8.2 |
Concatenation [83] | 0.478 | 39.6 |
Averaging [28] | 0.494 | 39.4 |
Ours | 0.504 | 39.7 |
6.3.3 Influence of Fusion Strategy
In order to evaluate the effectiveness of our fusion module, we compare it with some typical fusion methods in Table VII. Specifically, the concatenation and averaging method are both applied to the feature maps produced by the temporal Transformer, referred to as and in Fig. 5. The and are concatenated along the last dimension for concatenation and are averaged when averaging. The fused feature map is input to the decoder in both methods. Note that, our approach obtains the best performance against the post-processing operations (e.g., NMS [52]) and end-to-end feature aggregation techniques (e.g., averaging [28] and concatenation [83]). More precisely, our strategy gets around 6.6%, 2.6% and 1.0% improvements on our PKU-DAVIS-SOD dataset with three corresponding methods. Compared with the averaging operation, our approach improves the mAP from 0.494 to 0.504 while maintaining comparable computation time. Additionally, we present some visualization comparison results in Fig. 11. Three representative instances show that our approach performs better than the averaging operation. This is because our asynchronous attention-based fusion module can adaptively generate pixel-wise weight masks and eliminate unimodal degradation thoroughly.
6.3.4 Influence of Event Representation
To verify the generality of our SODFormer for various event representations, we compare three typical event representations (i.e., event images [17], voxel grids [75] and sigmoid representation [39]) owning to an accuracy-speed trade-off. As illustrated in Table VIII, the performance of our SODFormer varies with different event representations, but their running speed is almost identical. It indicates that our SODFormer can provide a generic interface in combination with various input representations, such as, image-like representations (e.g., event images [17] and sigmoid representation [39]) and spatiotemporal representations (i.e., voxel grids [75]). Meanwhile, it is obvious that the performance is highly dependent on how well the representation is and drops heavily with worse methods (e.g., sigmoid representation). Indeed, we believe that a good event representation makes asynchronous events directly compatible with our SODFormer while maximizing the detection performance.
6.3.5 Influence of the Number of Transformer’s Layer
We will analyze the effect of the number of layers in the designed temporal Transformer on the final performance from the following two perspectives.
The Number of TDTE’s Layer. As shown in Table IX, we first explore the influence of the number of encoder layers in our TDTE on our PKU-DAVIS-SOD dataset. To our surprise, as the number of encoder layers increases, our SODFormer shows only slight improvements up to 6 layers and a rapid decline due to overfitting after that, yet the inference time is gradually getting longer. Considering that increasing the number of encoder layers in our TDTE doesn’t improve the detection performance but increases the computational time, the number of encoder layers in our TDTE is set to 6.
The number of encoder layers | 2 | 4 | 6 | 8 | 12 |
---|---|---|---|---|---|
mAP50 | 0.481 | 0.480 | 0.489 | 0.471 | 0.450 |
Runtime () | 23.5 | 24.2 | 24.9 | 25.2 | 26.1 |
The Number of TDTD’s Layer. Table X further illustrates the influence of decoder layers in our TDTD on our PKU-DAVIS-SOD dataset. We can find that the best performance occurs when we set the number of decoder layers to 8. However, the mAP is almost unchanged and the computational time increases rapidly when the number of decoder layers exceeds 6. In order to balance detection performance and computational complexity, it is reasonable to set the number of TDTD’s layers to 6.
The number of decoder layers | 2 | 4 | 6 | 8 | 12 |
---|---|---|---|---|---|
mAP50 | 0.476 | 0.480 | 0.489 | 0.491 | 0.489 |
Runtime () | 21.1 | 22.3 | 24.9 | 29.5 | 34.8 |
6.4 Scalability Test
This subsection will present the predicted output slot analysis (Section 6.4.1) and the visualization of the designed temporal deformable attention (Section 6.4.2). Then, we further present how to process two heterogeneous visual streams in an asynchronous manner (Section 6.4.3). Finally, we analyze some failure cases of our SODFormer (Section 6.4.4).
6.4.1 The Predicted Output Slot Analysis
To increase the interpretability of our SODFormer, we visualize the predicted bounding boxes of different slots for all labeled timestamps in the test subset of our PKU-DAVIS-SOD dataset. Inspired by DETR [20], each predicted box is represented as a point whose coordinates are the center normalized by image size. Fig. 12 shows 10 representative examples out of 300 predicted slots in our temporal deformable Transformer decoder (TDTD). Apparently, we find that each slot can learn to focus on certain areas and bounding box sizes. Actually, our SODFormer can learn a different specialization for each object query slot, and all slots can cover the distribution of all objects in our PKU-DAVIS-SOD dataset.
6.4.2 Visualization of Temporal Deformable Attention
To better understand the proposed temporal deformable attention, we visualize sampling points and attention weights of the last layer in our temporal deformable Transformer encoder (TDTE). As shown in Fig. 13(a), our feed-forward baseline fails to detect the middle car owning to the occlusion, while our unimodal SODFormer succeeded by utilizing rich temporal cues in adjacent frames. In addition, we present the distribution of sampling points in the last layer of our TDTE (see Fig. 13(b)). All sampling points are mapped to the corresponding location in RGB frames, and each sampling point is marked as a color-filled circle whose color represents the size of the weight. The reference point is labeled as a green cross marker. We can find that the distribution of sampling points is focused on the foreground area surrounding the reference point radially instead of the whole image. As for the attention weights, it is obvious that the closer the sampling point to the reference point is, the greater the attention weight will be. Specifically, the sampling points in adjacent frames are uniformly clustered in the foreground area, providing rich temporal cues of the objects in the temporal domain. This indicates that the proposed temporal deformable attention can adapt its sampling points to concentrate on the foreground object in adjacent frames.
6.4.3 Asynchronous Inference Analysis
In general, the global shutter of conventional frame-based cameras limits their sampling rate, resulting in a relatively long time interval between two adjacent RGB frames. Actually, it is difficult to accurately locate moving objects with high output frequency in real-time high-speed scenarios. Towards this end, the proposed asynchronous attention-based fusion module aims at breaking through the limited output frequency from synchronized frame-based fusion methods. As illustrated in Table XI and Fig. 14, we design two specific experiments to verify the effectiveness of our asynchronous fusion strategy. First, we compare the performance of single-modality baseline* without utilizing the asynchronous attention-based fusion module and our SODFormer in different frame rates. Note that we only change the frame rate, while the temporal event bins are fixed to 25 Hz, so as the output frequency. This ensures that every prediction has a corresponding annotation. We can see from Table XI that both methods suffer a reduction in performance as the frame rate decreases, but the performance of our SODFormer is significantly less reduced than that of baseline*. Meanwhile, from the last row, we see that our SODFormer is still acceptable when the output frequency is four times the frame rate. This indicates that our asynchronous attention-based fusion module has a strong ability to conduct object detection of two asynchronous modalities. To further illustrate how our SODFormer performs in asynchronous detection when the input frequency is high, we conduct another experiment in which we use the form of a sliding window with 0.04s as its length from the continuous event stream and shift it forward by 0.01s at a time. As a result, the frequency of the divided event temporal bins can be as high as 100 Hz. Then, we input the RGB frames of 25 Hz and event temporal bins of 100 Hz into our SODFormer, as shown in Fig. 14. We find that the four figures in the middle show the detection performance of the event stream between the two RGB frames. Therefore, we can confirm that our asynchronous fusion strategy is able to fill the gap between two adjacent RGB frames and helps our SODFormer to detect objects in an asynchronous manner.
Method | Frame rate (FPS) | |||
---|---|---|---|---|
25 | 12.5 | 8.33 | 6.25 | |
Baseline* | 0.489 | 0.458 | 0.412 | 0.372 |
SODFormer | 0.504 | 0.495 | 0.472 | 0.448 |
6.4.4 Failure Case Analysis
Although our SODFormer achieves satisfactory results even in challenging scenarios, the problem involving some failure cases is still far from being solved. As depicted in Fig. 15, the first row shows that static or slow-moving cars in low-light conditions are hard to perform robust detection. This is because the relatively low dynamic range of conventional cameras results in poor image quality in low-light scenes. Meanwhile, the event camera evidently senses dynamic changes, but they generate almost no events in static or slow-moving scenarios. The second row illustrates that small moving objects at high-speed fail to be detected. This may be caused by the fact that the rushing object which is of high relative speed in RGB frames is almost invisible owing to the influence of high-speed motion blur. At the same time, event cameras capture moving objects at high-speed but they display weak textures. In fact, it indicates that our PKU-DAVIS-SOD dataset is a challenging and competitive dataset. What’s more, these failure cases beyond our SODFormer need to be addressed in future work.
7 Discussion
An effective and robust streaming object detector will further highlight the potential of the unifying framework using events and frames. Here, we will discuss the generality and the limitation of our method as follows.
Generality. One might think that this work does not explore one core issue of how to design a novel event representation. Actually, any event representation can be regarded as the input of our SODFormer. The ablation study verifies that our SODFormer can improve detection performance by introducing DVS with different event representations (see Section 6.3.4). The highly realistic synthetic dataset should be worth exploring in the future, as they can provide large-scale data for model training and testing in simulation scenarios. However, synthetic datasets from existing simulators are unrealistic and are not suited for verifying the effectiveness of our SODFormer in extreme scenarios (e.g., low-light or high-speed motion blur). Some may argue that the backbone of our SODFormer is a CNN-based feature extractor. Indeed, this study does not design a pre-trained vision Transformer backbone (e.g., ViT [84] and Swin Transformer [85]) for event streams. Indeed, investigating an event-based vision Transformer to learn an effective event representation is interesting and of wide applicability in computer vision tasks (e.g., video reconstruction, object detection, and depth estimation).
Limitation. Currently, the distribution of the sampling points in TDMSA is almost identical in different reference frames, which are radially around the reference point (see Section 6.4.2). However, this sampling strategy doesn’t take the motion trajectory of the object into account and limits the model’s capacity for modeling temporally long-term dependency (see Section 6.3.2). We consider following the work [86] of using optical flow to track the motion trajectory in optimizing the selection of sampling points, but this is not included in our SODFormer due to the additional complexity. In fact, how to track objects efficiently with a low computational complexity remains a topic worth exploring.
8 Conclusion
This paper presents a novel streaming object detector with Transformer (i.e., SODFormer) using events and frames, which highlights how events and frames can be utilized to deal with major object detection challenges (e.g., fast motion blur and low-light). To the best of our knowledge, this is the first trial exploring a Transformer-based architecture to continuously detect objects from two heterogeneous visual streams. To achieve this, we first build a large-scale multimodal object detection dataset (i.e., PKU-DAVIS-SOD dataset) including two visual streams and manual labels. Then, a spatiotemporal Transformer is designed to detect objects via an end-to-end sequence prediction problem, its two core innovative modules are the temporal Transformer and the asynchronous attention-based fusion module. The results demonstrate that our SODFormer outperforms four state-of-the-art methods and our eight baselines, inheriting high temporal resolution and wide HDR range properties from DVS events (i.e., brightness changes) and fine textures from RGB frames (i.e., absolute brightness). We believe that this study makes a major step towards solving the problem of fusing events and frames for streaming object detection.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China under Grant 62027804, Grant 61825101 and Grant 62088102.
References
- [1] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep learning for generic object detection: A survey,” Int. J. Comput. Vis., vol. 128, no. 2, pp. 261–318, 2020.
- [2] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” Int. J. Comput. Vis., pp. 1–24, 2020.
- [3] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “Imbalance problems in object detection: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3388–3415, 2021.
- [4] K. Sun, W. Wu, T. Liu, S. Yang, Q. Wang, Q. Zhou, Z. Ye, and C. Qian, “Fab: A robust facial landmark detection framework for motion-blurred videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5462–5471.
- [5] M. Sayed and G. Brostow, “Improved handling of motion blur in online object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1706–1716.
- [6] Y. Hu, T. Delbruck, and S.-C. Liu, “Learning to exploit multiple vision modalities by using grafted networks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 85–101.
- [7] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341, 2014.
- [8] D. P. Moeys, F. Corradi, C. Li, S. A. Bamford, L. Longinotti, F. F. Voigt, S. Berry, G. Taverni, F. Helmchen, and T. Delbruck, “A sensitive dynamic and active pixel vision sensor for color or neural imaging applications,” IEEE Trans. Biomed. Circuits Syst., vol. 12, no. 1, pp. 123–136, 2017.
- [9] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128128 120 db 15s latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008.
- [10] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Eklt: Asynchronous photometric feature tracking using events and frames,” Int. J. Comput. Vis., vol. 128, no. 3, pp. 601–618, 2020.
- [11] D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,” IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 2822–2829, 2021.
- [12] S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y. Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16 155–16 164.
- [13] Z. Jiang, P. Xia, K. Huang, W. Stechele, G. Chen, Z. Bing, and A. Knoll, “Mixed frame-/event-driven fast pedestrian detection,” in Proc. IEEE Conf. Robot. Autom., 2019, pp. 8332–8338.
- [14] J. Li, S. Dong, Z. Yu, Y. Tian, and T. Huang, “Event-based vision enhanced: A joint detection framework in autonomous driving,” in Proc. IEEE Int. Conf. Multimedia Expo., 2019, pp. 1396–1401.
- [15] H. Cao, G. Chen, J. Xia, G. Zhuang, and A. Knoll, “Fusion-based feature attention gate component for vehicle detection based on event camera,” IEEE Sensors J., pp. 1–9, 2021.
- [16] M. Liu, N. Qi, Y. Shi, and B. Yin, “An attention fusion network for event-based vehicle object detection,” in Proc. IEEE Int. Conf. Image Process., 2021, pp. 3363–3367.
- [17] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza, “Event-based vision meets deep learning on steering prediction for self-driving cars,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5419–5427.
- [18] R. J. Dolan, G. Fink, E. Rolls, M. Booth, A. Holmes, R. Frackowiak, and K. Friston, “How the brain learns to see objects and faces in an impoverished context,” Nature, vol. 389, no. 6651, pp. 596–599, 1997.
- [19] A. G. Huth, S. Nishimoto, A. T. Vu, and J. L. Gallant, “A continuous semantic space describes the representation of thousands of object and action categories across the human brain,” Neuron, vol. 76, no. 6, pp. 1210–1224, 2012.
- [20] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229.
- [21] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–16.
- [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
- [23] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” arXiv, pp. 1–28, 2021.
- [24] T. H. Kim, M. S. Sajjadi, M. Hirsch, and B. Scholkopf, “Spatio-temporal transformer network for video restoration,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 106–122.
- [25] L. He, Q. Zhou, X. Li, L. Niu, G. Cheng, X. Li, W. Liu, Y. Tong, L. Ma, and L. Zhang, “End-to-end video object detection with spatial-temporal transformers,” in Proc. ACM Int. Conf. Multimedia., 2021.
- [26] Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, H. Chen, I. Marsic, and J. Tighe, “Vidtr: Video transformer without convolutions,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 577–13 587.
- [27] A. Tomy, A. Paigwar, K. S. Mann, A. Renzaglia, and C. Laugier, “Fusing event-based and RGB camera for robust object detection in adverse conditions,” in Proc. IEEE Conf. Robot. Autom., 2022, pp. 933–939.
- [28] H. Li, X.-J. Wu, and J. Kittler, “Mdlatlrr: A novel decomposition method for infrared and visible image fusion,” IEEE Trans. Image Process., vol. 29, no. 1, pp. 4733–4746, 2020.
- [29] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis et al., “Event-based vision: A survey,” in IEEE Trans. Pattern Anal. Mach. Intell., 2020, pp. 1–25.
- [30] S.-C. Liu, B. Rueckauer, E. Ceolini, A. Huber, and T. Delbruck, “Event-driven sensing for efficient perception: Vision and audition algorithms,” IEEE Signal Process. Mag., vol. 36, no. 6, pp. 29–37, 2019.
- [31] K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intelligence with neuromorphic computing,” Nature, vol. 575, no. 7784, pp. 607–617, 2019.
- [32] G. Chen, H. Cao, J. Conradt, H. Tang, F. Rohrbein, and A. Knoll, “Event-based neuromorphic vision for autonomous driving: A paradigm shift for bio-inspired visual sensing and perception,” IEEE Signal Process. Mag., vol. 37, no. 4, pp. 34–49, 2020.
- [33] P. de Tournemire, D. Nitti, E. Perot, D. Migliore, and A. Sironi, “A large scale event-based detection dataset for automotive,” in arXiv, 2020, pp. 1–8.
- [34] E. Perot, P. de Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1–14.
- [35] H. Rebecq, D. Gehrig, and D. Scaramuzza, “Esim: An open event camera simulator,” in Proc. Conf. Robot Learn., 2018, pp. 969–982.
- [36] Y. Hu, S.-C. Liu, and T. Delbruck, “V2e: From video frames to realistic dvs events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2021, pp. 1312–1321.
- [37] Z. Kang, J. Li, L. Zhu, and Y. Tian, “Retinomorphic sensing: A novel paradigm for future multimedia computing,” in Proc. ACM Int. Conf. Multimedia., 2021, pp. 144–152.
- [38] M. Iacono, S. Weber, A. Glover, and C. Bartolozzi, “Towards event-driven object detection with off-the-shelf deep learning,” in Proc. IEEE Conf. Intell. Robot. Syst., 2018, pp. 1–9.
- [39] N. F. Chen, “Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018, pp. 644–653.
- [40] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, “Event-based convolutional networks for object detection in neuromorphic cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2019, pp. 1656–1665.
- [41] G. Chen, H. Cao, C. Ye, Z. Zhang, X. Liu, X. Mo, Z. Qu, J. Conradt, F. Röhrbein, and A. Knoll, “Multi-cue event information fusion for pedestrian detection with neuromorphic vision sensors,” Frontiers in Neurorobotics, vol. 13, p. 10, 2019.
- [42] C. Ryan, B. O’Sullivan, A. Elrasad, A. Cahill, J. Lemley, P. Kielty, C. Posch, and E. Perot, “Real-time face & eye tracking and blink detection using event cameras,” Neural Netw., vol. 141, pp. 87–97, 2021.
- [43] J. Li, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Asynchronous spatio-temporal memory network for continuous event-based object detection,” IEEE Trans. Image Process., vol. 31, pp. 2975–2987, 2022.
- [44] X. Xiang, L. Zhu, J. Li, Y. Tian, and T. Huang, “Temporal up-sampling for asynchronous events,” in Proc. IEEE Int. Conf. Multimedia Expo., 2022, pp. 1–6.
- [45] D. Wang, X. Jia, Y. Zhang, X. Zhang, Y. Wang, Z. Zhang, D. Wang, and H. Lu, “Dual memory aggregation network for event-based object detection with learnable representation,” in Proc. AAAI Conf. on Artificial Intell., 2023, pp. 2492–2500.
- [46] M. Gehrig and D. Scaramuzza, “Recurrent vision transformers for object detection with event cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
- [47] D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to-end learning of representations for asynchronous event-based data,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5633–5643.
- [48] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” in arXiv, 2018, pp. 1–6.
- [49] H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and T. Delbrück, “Combined frame-and event-based detection and tracking,” in Proc. IEEE Int. Symposium Circuits Syst., 2016, pp. 2511–2514.
- [50] Z. El Shair and S. A. Rawashdeh, “High-temporal-resolution object detection and tracking using images and events,” J. Imag., vol. 8, no. 8, pp. 1–21, 2022.
- [51] N. Messikommer, D. Gehrig, M. Gehrig, and D. Scaramuzza, “Bridging the gap between events and frames through unsupervised domain adaptation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 3515–3522, 2022.
- [52] J. Hosang, R. Benenson, and B. Schiele, “Learning non-maximum suppression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4507–4515.
- [53] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
- [54] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
- [55] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
- [56] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
- [57] T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “Pnp-detr: Towards efficient visual analysis with transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 4661–4670.
- [58] W. Shang, D. Ren, D. Zou, J. S. Ren, P. Luo, and W. Zuo, “Bringing events into video deblurring with non-consecutively blurry frames,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 4531–4540.
- [59] L. Zhu, J. Li, X. Wang, T. Huang, and Y. Tian, “Neuspike-net: High speed video reconstruction via bio-inspired neuromorphic cameras,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2400–2409.
- [60] P. Duan, Z. Wang, B. Shi, O. Cossairt, T. Huang, and A. Katsaggelos, “Guided event filtering: Synergy between intensity images and neuromorphic events for high performance imaging,” IEEE Trans. Pattern Anal. Mach. Intell., 2021.
- [61] J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 043–13 052.
- [62] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” in arXiv, 2021.
- [63] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 994–1001, 2018.
- [64] Y.-F. Zuo, J. Yang, J. Chen, X. Wang, Y. Wang, and L. Kneip, “Devo: Depth-event camera visual odometry in challenging conditions,” in Proc. IEEE Conf. Robot. Autom., 2022, pp. 2179–2185.
- [65] L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip, “Vector: A versatile event-centric benchmark for multi-sensor slam,” IEEE Robot. Autom. Lett., 2022.
- [66] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3857–3866.
- [67] Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong, “Tju-dhd: A diverse high-resolution dataset for object detection,” IEEE Trans. Image Process., vol. 30, pp. 207–219, 2020.
- [68] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck, “Retinomorphic event-based vision sensors: bioinspired cameras with spiking output,” Proc. IEEE., vol. 102, no. 10, pp. 1470–1484, 2014.
- [69] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Sivilotti, and D. Gillespie, “Silicon auditory processors as computer peripherals,” IEEE Trans. Neural Netw., vol. 4, no. 3, pp. 523–528, 1993.
- [70] J. Li, Y. Fu, S. Dong, Z. Yu, T. Huang, and Y. Tian, “Asynchronous spatiotemporal spike metric for event cameras,” IEEE Trans. Neural Netw. Learn. Syst., 2021.
- [71] W. Han, Z. Zhang, B. Caine, B. Yang, C. Sprunk, O. Alsharif, J. Ngiam, V. Vasudevan, J. Shlens, and Z. Chen, “Streaming object detection for 3-d point clouds,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 423–441.
- [72] Q. Chen, S. Vora, and O. Beijbom, “Polarstream: Streaming object detection and segmentation with polar pillars,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 26 871–26 883.
- [73] J. Yang, S. Liu, Z. Li, X. Li, and J. Sun, “Streamyolo: Real-time object detection for streaming perception,” arXiv, 2022.
- [74] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
- [75] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based learning of optical flow, depth, and egomotion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 989–997.
- [76] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- [77] J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin, “Understanding and improving layer normalization,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, pp. 4381–4391, 2019.
- [78] O. Mazhar, R. Babuška, and J. Kober, “Gem: Glare or gloom, i can still see you–end-to-end multi-modal object detection,” IEEE Robot. Autom. Lett., vol. 6, no. 4, pp. 6321–6328, 2021.
- [79] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–15.
- [80] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 740–755.
- [81] M. Zhu and M. Liu, “Mobile video object detection with temporally-aware feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5686–5695.
- [82] S. Guo and T. Delbruck, “Low cost and latency event camera background activity denoising,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 785–795, 2023.
- [83] J. Xu, W. Wang, H. Wang, and J. Guo, “Multi-model ensemble with rich spatial information for object detection,” Pattern Recognit., vol. 99, p. 107098, 2020.
- [84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020.
- [85] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021.
- [86] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation for video object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 408–417.
![]() |
Dianze Li received the B.S degree from the School of Electronics Engineering and Computer Science, Peking University, Beijing, China, in 2022. He is currently pursuing the Ph.D. degree with the National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing, China. His current research interests include event-based vision, spiking neural networks, and neuromorphic engineering. |
![]() |
Jianing Li (Member, IEEE) received the B.S. degree from the College of Computer and Information Technology, China Three Gorges University, China, in 2014, and the M.S. degree from the School of Microelectronics and Communication Engineering, Chongqing University, China, in 2017, and the Ph.D. degree from the National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing, China, in 2022. He is the author or coauthor of over 20 technical papers in refereed journals and conferences, such as the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, CVPR, ICCV, AAAI, and ACM MM. He received the Lixin Tang Scholarship from Chongqing University, Chongqing, China, in 2016. His research interests include event-based vision, neuromorphic engineering, and robotics. |
![]() |
Yonghong Tian (S’00-M’06-SM’10-F’22) is currently the Dean of School of Electronics and Computer Engineering, Peking University, Shenzhen Graduate School, 518055, China, a Boya Distinguished Professor with the School of Computer Science, Peking University, China, and is also the deputy director of Artificial Intelligence Research, PengCheng Laboratory, Shenzhen, China. His research interests include neuromorphic vision, distributed machine learning and multimedia big data. He is the author or coauthor of over 300 technical articles in refereed journals and conferences. Prof. Tian was/is an Associate Editor of IEEE TCSVT (2018.1-2021.12), IEEE TMM (2014.8-2018.8), IEEE Multimedia Mag. (2018.1-2022.8), and IEEE Access (2017.1-2021.12). He co-initiated IEEE Int’l Conf. on Multimedia Big Data (BigMM) and served as the TPC Co-chair of BigMM 2015, and aslo served as the Technical Program Co-chair of IEEE ICME 2015, IEEE ISM 2015 and IEEE MIPR 2018/2019, and General Co-chair of IEEE MIPR 2020 and ICME2021. He is a TPC Member of more than ten conferences such as CVPR, ICCV, ACM KDD, AAAI, ACM MM and ECCV. He was the recipient of the Chinese National Science Foundation for Distinguished Young Scholars in 2018, two National Science and Technology Awards and three ministerial-level awards in China, and obtained the 2015 EURASIP Best Paper Award for Journal on Image and Video Processing, and the best paper award of IEEE BigMM 2018, and the 2022 IEEE SA Standards Medallion and SA Emerging Technology Award. He is a Fellow of IEEE, a senior member of CIE and CCF, a member of ACM. |