SODFormer: Streaming Object Detection with Transformer Using Events and Frames

Dianze Li, Jianing Li, and Yonghong Tian Dianze Li and Jianing Li are with the National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing 100871, China. E-mail: dianzeli@stu.pku.edu.cn, lijianing@pku.edu.cn.Yonghong Tian are with the National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing 100871, China, and also with the Peng Cheng Laboratory, Shenzhen 518000, China. E-mail: yhtian@pku.edu.cn.Manuscript received September 23, 2022.(Corresponding author: Yonghong Tian and Jianing Li).

Abstract

DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.

Index Terms:

Neuromorphic Vision, Event Cameras, Object Detection, Transformer, Multimodal Fusion.

1 Introduction

Object detection [1, 2, 3], one of the most fundamental and challenging topics, supports a wide range of computer vision tasks, such as autonomous driving, intelligent surveillance, robot vision, etc. In fact, with conventional frame-based cameras, object detection performance [4, 5, 6] has suffered from a significant drop in some challenging conditions (e.g., high-speed motion blur, low-light, and overexposure). A key question still remains: How to utilize a novel sensing paradigm that makes up for the limitations of conventional cameras?

More recently, Dynamic and Active-Pixel Vision Sensor [7, 8] (i.e., DAVIS), namely a multimodal vision sensor, is designed in that spirit, combining a bio-inspired event camera and a conventional frame-based camera in the same pixel array. Its core is the novel event camera (i.e., DVS [9]), which works differently from frame-based cameras and reacts to light changes by triggering asynchronous events. Since the event camera offers high temporal resolution ( $\mu s$ ) and high dynamic range (HDR, up to 120 dB), it has brought a new perspective to address some limitations of conventional cameras in fast motion and challenging light scenarios. However, just as conventional frame-based cameras fail in extreme-light or high-speed motion blur scenarios, event cameras perform poorly in static or extremely slow-motion scenes. We refer to this phenomenon as unimodal degradation because it is mainly caused by the limitations of unimodal cameras (see Fig. 1). While event cameras are insensitive in static or extremely slow-motion scenes, frame-based cameras directly provide static fine textures (i.e., absolute brightness) conversely. Indeed, event cameras and frame-based cameras are complementary, which motivates the development of novel computer vision tasks (e.g., feature tracking [10], depth estimation [11], and video reconstruction [12]) by taking advantage of each end. Therefore, we aim at making complementary use of asynchronous events and frames to maximize object detection accuracy.

One problem is that most existing event-based object detectors [13, 14, 15, 16] run feed-forward models independently, leading to not utilizing rich temporal cues in continuous visual streams. When isolated event image [17] may have issues involving small objects, occlusion, and out of focus, it is natural for the human visual system [18, 19] to identify objects in the temporal dimension and then assign them together. In other words, the detection performance can be further improved via leveraging rich temporal cues from event streams and adjacent frames. Although the field of computer vision has witnessed significant achievements of CNNs and RNNs, they show poor ability in modeling long-term temporal dependencies. Consequently, the emerging Transformer is gaining more and more attention in object detection tasks [20, 21]. Thanks to the self-attention mechanism, Transformer [22, 23] is particularly suited to modeling long-term temporal dependencies for video sequence tasks [24, 25, 26]. But, there’s still no Transformer-based object detector to exploit temporal cues from events and frames. Meanwhile, the rare multimodal neuromorphic object detection datasets (e.g., PKU-DDD17-CAR [14] and DAD [16]) only provide isolated images and synchronized events, which means there is a lack of temporally long-term continuous and large-scale multimodal datasets including events, frames, and manual labels.

Another problem is that current fusion strategies (e.g., posting-processing [13, 14], concatenation [27] and averaging [16, 28]) for events and frames are inadequate in exploiting complementary information between two streams while only able to synchronously predict results with the frame rate, incurring the bottlenecks of performance improvements and inference frequency. Since asynchronous events are sparse points in the spatiotemporal domain with higher temporal resolution compared to structured frames, the existing frame-based multimodal detectors cannot directly deal with two heterogeneous visual streams. As a result, some fusion strategies attempt to first split the continuous event streams into discrete image-like representations with the same sampling frequency of frames and then integrate two synchronized streams via the post-processing or feature aggregation operation. In short, one drawback is that these fusion strategies can’t distinguish the degree of degradation of modalities and regions, therefore fail to eliminate the unimodal degradation thoroughly. The other drawback is that the inference frequency of these synchronized fusion strategies is limited by the sampling rate of conventional frames. Yet, there is no work to design an effective asynchronous multimodal fusion module for events and frames in the object detection task.

Refer to caption — (a) Events and frames

To address the aforementioned problems, this paper proposes a novel streaming object detector with Transformer, namely SODFormer, which continuously detects objects in an asynchronous way by integrating events and frames. Actually, this work aims not to optimize Transformer-based object detectors (e.g., DETR [20]) on each isolated image. In contrast, our goal is to overcome the following challenges: (i) Lack of dataset - How do we create a large-scale multimodal neuromorphic dataset for object detection? (ii) Temporal correlation - How do we design a Transformer-based object detector that leverages rich temporal cues? (iii) Asynchronous fusion - What is the unifying mechanism that takes advantages from two streams and achieves high inference frequency? To this end, we first build a large-scale multimodal object detection dataset (i.e., PKU-DAVIS-SOD), which provides manual bounding boxes at a frequency of 25 Hz for 3 classes, yielding more than 1080.1k labels. Then, we propose a spatiotemporal Transformer that aggregates spatial information from continuous streams, which outputs the detection results via an end-to-end sequence prediction problem. Finally, an asynchronous attention-based fusion module is designed to effectively integrate two sensing modalities while eliminating unimodal degradation and breaking the limitation of frame rate. The results show that our SODFormer outperforms the state-of-the-art methods and our two single-modality baselines by a large margin. We also verify the efficacy of our SODFormer in fast motion and low-light scenarios.

To sum up, the main contributions of this work are summarized as follows:

•

We propose a novel Transformer-based framework for streaming object detection (SODFormer), which first integrates events and frames via Transformer to continuously detect objects in an asynchronous manner.
•

We design an effective temporal Transformer module, which leverages rich temporal cues from two visual streams to improve the object detection performance.
•

We develop an asynchronous attention-based fusion module that takes complementary advantages from each end, eliminates unimodal degradation and overcomes the limited inference frequency from frame rate.
•

We establish a large-scale and temporally long-term multimodal neuromorphic dataset (i.e., PKU-DAVIS-SOD), which will open an opportunity for the research of this challenging problem.

The rest of the paper is organized as follows. Section 2 reviews prior works. We describe how to build a competitive dataset in Section 3. Section 4 introduces the camera working principle and formulates the novel problem. Section 5 presents our solution of streaming object detection with Transformer. Finally, Section 6 analyzes the performance of the proposed method, and some discussions are present in Section 7, while some conclusions are drawn in Section 8.

2 Related Work

This section first reviews neuromorphic object detection datasets (Section 2.1) and the corresponding object detectors (Section 2.2), followed by an overview of object detectors with Transformer (Section 2.3) and a survey of fusion approaches for events and frames (Section 2.4).

2.1 Neuromorphic Object Detection Datasets

The publicly available object detection datasets utilizing event cameras have a limited amount [29, 30, 31, 32]. Gen1 Detection dataset [33] and 1Mpx Detection dataset [34] provide large-scale annotations for object detection using event cameras. However, it may be difficult to obtain fine textures and achieve high-precision object detection by only using DVS events. Although event-based simulators (e.g., ESIM [35], V2E [36], and RS [37]) can directly convert video object datasets to the neuromorphic domain, those transformed strategies fail to reflect realistic high-speed motion or extreme light scenarios, which are exactly what event cameras are good at. Besides, PKU-DDD17-CAR dataset [14] and DAD dataset [16] provide 3155 and 6247 isolated hybrid sequences, respectively. Nevertheless, these two small-scale datasets are not continuous streams modeling temporal dependencies. Therefore, this work aims to establish a large-scale and temporally long-term multimodal object detection dataset involving challenging scenarios.

2.2 Neuromorphic Object Detection Approaches

The existing object detectors using event cameras can be roughly divided into two categories. The first category [38, 39, 40, 41, 42, 43, 44, 45, 46] is the single-modality that only processes asynchronous events. These methods usually first convert asynchronous events into 2D image-like representations [47], and then input them into frame-based detectors (e.g., YOLOs [48]). Although only utilizing dynamic events can achieve satisfactory performance in some specific scenes, it becomes clear that high-precision detection requires static fine textures.

The second category [49, 14, 13, 6, 15, 16, 27, 50, 51] refers to the multi-modality that combines multiple visual streams. For example, some works [49, 14, 13] first detect objects on each isolated frame or event temporal bin, and then merge the detection results of two modalities by post-processing (e.g., NMS [52]). Besides, a grafting algorithm [6] is proposed to integrate events and thermal images. Some attention-based aggregation operations [15, 27] are designed to fuse each isolated frame and event temporal bin from the PKU-DDD17-CAR [14] dataset, while others [16, 28] average the features from each modality. However, these joint frameworks have not explored rich temporal cues from continuous visual streams. Moreover, these post-processing or aggregation fusion operations are hard to capture global context across two sensing modalities, while averaging strategy cannot eliminate unimodal degradation thoroughly. Thus, we design a streaming object detector that aims at leveraging rich temporal cues and making fully complementary use of two visual streams.

2.3 Object Detection with Transformer

Transformer, an attention-based architecture, is first introduced by [22] for machine translation. Its core attention mechanism [53] scans through each element of a sequence and updates it by aggregating information from the whole sequence with different attention weights. The global computation makes Transformers perform better than RNNs on long sequences. Until now, Transformers have been migrated to computer vision and replaced RNNs in many problems. In video object detection, while RNNs [54, 55, 56] have achieved great success, they require meticulously hand-designed components (e.g., anchor generation). To simplify these pipelines, DETR [20], an end-to-end Transformer-based object detector, is proposed. It is worth mentioning that DETR [20] is the first Transformer-based object detector via a sequence predicting model. While DETR largely simplifies the classical CNN-based detection paradigm, it suffers from very slow convergence and relatively low performance at detecting small objects. More recently, a few approaches attempt to design optimized architectures to help detect small objects and speed up training. For instance, Deformable DETR [21] adopts the multi-scale deformable attention module that achieves better performance than DETR with 10 $\times$ fewer training epochs. PnP-DETR [57] significantly reduces spatial redundancy and achieves more efficient computation via a two-step poll-and-pool sampling module. However, most works operate on each isolated image and do not exploit temporal cues from continuous visual streams. Inspired by the ability of Transformer to model long-term dependencies in video sequence tasks [24, 25, 26], we propose a temporal Transformer model to leverage rich temporal cues from continuous events and adjacent frames.

2.4 Multimodal Fusion for Events and Frames

Some computer vision tasks (e.g., video reconstruction [12, 58, 59, 60], object detection [14, 13], object tracking [61, 62], depth estimation [11], and SLAM [63, 64, 65]) have sought to integrate two complementary modalities of events and frames. For example, JDF [14] adopts the Dempster-Shafer theory to fuse events and frames for object detection. A recurrent asynchronous multimodal network [11] is introduced to fuse events and frames for depth estimation. In fact, most fusion strategies (e.g., post-processing, concatenation, and averaging) can’t distinguish the degradation degree of different modalities and regions, resulting in the failure to eliminate unimodal degradation completely (see Section 2.2). More importantly, most fusion strategies are synchronized with event and frame streams, which causes the joint output frequency to be limited by the sampling rate of conventional frames. Obviously, the limited inference frequency can’t meet the needs of fast object detection in real-time high-speed scenarios, so we seek to utilize the high temporal resolution advantage of event cameras and make use of events and frames. In RAMNet [11], RNN is utilized to introduce an asynchronous fusion method that allows decoding the task variable at any time. Nevertheless, to the best of our knowledge, there are no prior attention-based works involving asynchronous fusion strategy for events and frames in the object detection task. Thus, we design a novel asynchronous attention-based fusion module that eliminates unimodal degradation in two modalities with high inference frequency.

3 PKU-DAVIS-SOD Dataset

This section first presents the details of how to build our dataset (Section 3.1). Then, we give detailed statistics to better understand the newly built dataset (Section 3.2). Finally, we make a comparison of related datasets (Section 3.3).

3.1 Data Collection and Annotation

The goal of this dataset is to offer a dedicated platform for the training and evaluation of streaming object detection algorithms. Thus, a novel multimodal vision sensor (i.e., DAVIS346, resolution 346 $\times$ 260) is used to record multiple hybrid sequences in a driving car (see Fig. 2). The event camera simultaneously outputs events with high temporal resolution and conventional frames with 25 FPS. Mostly, we fix the DAVIS346 in a driving car (see Fig. 2(2(a))), and record the sequences while the car is driving on city roads. Nevertheless, it is difficult to capture high-speed motion blur owing to the relative speed between vehicles on city roads. For the convenience of acquiring high-speed objects, we additionally provide some sequences in which the camera is set at the flank of the road. The raw recordings consider velocity distribution, light condition, category diversity, object scale, etc. To provide manual bounding boxes in challenging scenarios (e.g., high-speed motion blur and low-light), grayscale images are reconstructed from asynchronous events using E2VID [66] at 25 FPS when RGB frames are of low quality. When the objects are not visible in all three modalities (i.e., RGB frames, event images, and reconstructed frames), they can still be labeled because our PKU-DAVIS-SOD dataset provides continuous visual streams, and we can obtain information about the unclear objects from nearby images. Furthermore, some unclear objects in a single image may be visible in a piece of video. After the temporal calibration, we first select three common and important object classes (i.e., car, pedestrian, and two-wheeler) in our daily life. Then, all bounding boxes are annotated from RGB frames or synchronized reconstructed images by a well-trained professional team.

TABLE I: The number of labeled frames and bounding boxes in each set of our PKU-DAVIS-SOD dataset.

Name	Normal		Motion blur		Low-light
Name	frames	boxes	frames	boxes	frames	boxes
Training	115104	528944	31587	87056	22826	55343
Validation	31786	155778	9763	24828	8892	14143
Test	35459	150492	11859	44757	8729	18807
Total	182349	835214	53209	156641	40447	88293

3.2 Dataset Statistics

PKU-DAVIS-SOD dataset provides 220 hybrid driving sequences and labels at a frequency of 25 Hz. As a result, this dataset contains 276k timestamps (i.e., labeled frames) and 1080.1k bounding boxes in total. Afterward, we split them into 671.3k for training, 194.7k for validation, and 214.1k for testing. Table I shows the number of bounding boxes in each set. Besides, we further analyze the attributes of the newly built dataset from the four following perspectives.

Category Diversity. Fig. 3(3(a)) displays the number distributions of three types of labeled objects (i.e., car, pedestrian, and two-wheeler) in our PKU-DAVIS-SOD dataset. We can find that the numbers of cars, pedestrians, and two-wheelers in each set (i.e., training, validation, and testing) are (580340, 162894, 193118), (34744, 5800, 5680), and (67599, 28400, 19144), respectively. For intuition, the ratios of the above numbers are approximately 3.5 : 1 : 1.2, 6 : 1 : 1, and 3.5 : 1.5 : 1, respectively.

Object Scale. Fig. 3(3(b)) shows the proportions or object scales in our PKU-DAVIS-SOD dataset. The object height is used to reflect its scale since the height and the scale are closely related in driving scenes [67]. We broadly classify all objects into three scales (i.e., large, medium, and small). The medium scale varies from 20 pixels to 80 pixels, the small refers to the object less than 20 pixels in height, and the large scale means the object height is greater than 80 pixels.

Velocity Distribution. As illustrated in Fig. 3(3(c)), we divide the motion speed level into normal-speed and high-speed. Event camera, offering high temporal resolution, can capture high-speed moving objects, such as a rushing car. On the contrary, RGB frames may result in motion blur due to the limited frame rate. In the process of dividing the dataset, we use video as the basic unit. If a sequence contains lots of motion blur scenarios, it is classified as high-speed, otherwise, it would be classified as normal-speed instead. Fig. 3(3(c)) shows that the proportion of high-speed scenarios is 13%, and the remaining involving normal-speed objects is 87%.

Light Intensity. Since the raw visual streams are recorded from daytime to night, it covers the diversity in illumination variance. The sequences belonging to the low-light scenario are generally collected in low-light conditions (night, rainy days, tunnels, etc.) and their RGB frames consist mainly of low-light scenes. Thus, we can easily judge the light condition by viewing RGB frames with our eyes. Fig. 3(3(d)) illustrates that the proportions of normal light and low-light are 92% and 8%, respectively.

For better visualization, we show representative samples (see Fig. 4) including RGB frames, event images, and DVS reconstructed images, which cover the category diversity, object scale, velocity, and light change.

TABLE II: Comparison with related object detection datasets using event cameras. Note that, our PKU-DAVIS-SOD dataset is the first large-scale multimodal neuromorphic object detection dataset that provides temporally long-term asynchronous events, conventional frames, and manual labels at 25 Hz.

Dataset	Year	Venue	Resolution	Modality	Classes	Boxes	Label	Frequency	High-speed	Low-light
PKU-DDD17-CAR [14]	2019	ICME	346 $\times$ 260	Events, Frames	1	3155	Manual	1 Hz	✗	✓
TUM-Pedestrian [13]	2019	ICRA	240 $\times$ 180	Events, Frames	1	9203	Manual	1 Hz	✗	✗
Pedestrian Detection [41]	2019	FNR	240 $\times$ 180	Events	1	28109	Manual	1 Hz	✗	✗
Gen1 Detection [33]	2020	arXiv	304 $\times$ 240	Events	2	255k	Pseudo	1, 4 Hz	✗	✓
1 Mpx Detection [34]	2020	NIPS	1280 $\times$ 720	Events	3	25M	Pseudo	60 Hz	✗	✓
DAD [16]	2021	ICIP	346 $\times$ 260	Events, Frames	1	6427	Manual	1 Hz	✗	✓
PKU-DAVIS-SOD	2022	Ours	346 $\times$ 260	Events, Frames	3	1080.1k	Manual	25 Hz	✓	✓

3.3 Comparison with Other Datasets

To clarify the advancements of the newly built dataset, we compare it with some related object detection datasets using event cameras in Table II. Apparently, our PKU-DAVIS-SOD dataset is the first large-scale and open-source multimodal neuromorphic object detection dataset ¹¹1https://www.pkuml.org/research/pku-davis-sod-dataset.html. In contrast, two publicly large-scale datasets (i.e., Gen1 Detection [33] and 1 Mpx Detection [34]) only provide temporally long-term event streams, thus it is difficult for them to serve high-precision object detection, especially in static or extremely slow motion scenarios. What’s more, TUM-Pedestrian dataset [13] and Pedestrian Detection dataset [41] provide event streams for pedestrian detection, but they have not yet been publicly available. DAD dataset [16], followed by PKU-DDD17-CAR dataset [14], offers isolated frames and event temporal bins rather than temporally continuous visual streams.

All in all, such a novel bio-inspired multimodal camera and professional design with high labor intensity enable our PKU-DAVIS-SOD to be a competitive dataset with multiple characteristics: (i) High temporal sampling resolution with 12 Meps from event streams; (ii) High dynamic range property with 120 dB from event streams, (iii) Two temporally long-term visual streams with labels at 25 Hz, (iv) Real-world scenarios with abundant diversities in object category, object scale, moving velocity, and light change.

4 Preliminary and Problem Definition

This section first presents the details of DAVIS camera working principle (Section 4.1). Then, we formulate the definition of streaming object detection (Section 4.2).

4.1 DAVIS Sampling Principle

DAVIS is a multimodal vision sensor that simultaneously outputs two complementary modalities of asynchronous events and frames under the same pixel array. Due to the complementarity of events and frames, the output of the DAVIS camera contains more information, making the DAVIS camera naturally surpass unimodal sensors like conventional RGB cameras and event-based cameras. In particular, the core event camera [68], namely DVS, has independent pixels that react to changes in light intensity $R(\bm{u},t)$ with a stream of events. More specifically, an event $e_{n}$ is a four-attribute tuple $(x_{n},y_{n},t_{n},p_{n})$ using addressing event representation (AER) [69], triggered for the pixel $\bm{u}=(x_{n},y_{n})$ at the timestamp $t_{n}$ when the log-intensity changes over the pre-defined threshold $\theta_{th}$ . This process can be depicted as:

\displaystyle\text{ln}R(\bm{u}_{n},t_{n})-\text{ln}R(\bm{u}_{n},t_{n}-\Delta t_{n})=p_{n}\theta_{th},

(1)

where $\Delta t_{n}$ is the temporal sampling interval at a pixel, the polarity $p_{n}\in\left\{-1,1\right\}$ refers to whether the brightness is decreasing or increasing.

Intuitively, asynchronous events appear as sparse and discrete points [70] in the spatiotemporal domain, which can be described as follows:

\displaystyle S\left(x,y,t\right)=\sum_{n=1}^{N_{e}}p_{n}\delta\left(x-x_{n},y-y_{n},t-t_{n}\right),

(2)

where $N_{e}$ is the event number during the spatiotemporal window. $\delta\left(\cdot\right)$ refers to the Dirac delta function, with $\int$ $\delta\left(t\right)dt=1$ and $\delta\left(t\right)=0,\forall$ $t\neq 0$ .

4.2 Streaming Object Detection

Let $S(x,y,t)$ and $I$ = $\left\{I_{1},...,I_{N}\right\}$ are two complementary modalities of asynchronous events and frame sequence from a DAVIS camera. In general, to make the asynchronous events compatible with deep learning methods, a continuous event stream needs to be divided into $K$ event temporal bins $S=\left\{S_{1},...,S_{K}\right\}$ . For the current timestamp $t_{i}$ , the spatiotemporal location information of objects (i.e., bounding boxes) can be calculated using adjacent frames $\left\{I_{i-n},...,I_{i}\right\},n\in[0,N]$ and multiple temporal bins $\left\{S_{i-k},...,S_{i}\right\},k\in[0,K]$ , it can be formulated by:

\displaystyle B_{i}=\mathcal{D}\left(\left\{I_{i-n},...,I_{i}\right\},\left\{S_{i-k},...,S_{i}\right\}\right),

(3)

where $B_{i}=\left\{\left(x_{i,j},y_{i,j},w_{i,j},h_{i,j},c_{i,j},p_{i,j},t_{i}\right)\right\}_{j\in[1,J]}$ is a list of $J$ bounding boxes in the timestamp $t_{i}$ . More specifically, $(w_{i,j},h_{i,j})$ are the width and the height of the $j$ bounding box, $(x_{i,j},y_{i,j})$ are the corresponding upper-left coordinates, $c_{i,j}$ and $p_{i,j}$ are the predicted class and the confidence score of the $j$ bounding box, respectively. The function $\mathcal{D}$ is the proposed streaming object detector, which can leverage rich temporal cues from $n+1$ adjacent frames and $k+1$ temporal bins. The parameters $n$ and $k$ determine the length of mining temporal information and affect the fusion strategy between two heterogeneous visual streams.

Given the ground-truth $\bar{B}=\left\{\bar{B}_{1},...,\bar{B}_{N}\right\}$ , we aim at making the output $B=\left\{B_{1},...,B_{N}\right\}$ of the optimized detector $\hat{\mathcal{D}}$ to fit $\bar{B}$ as much as possible, and it can formulate the following minimization problem as:

\displaystyle\hat{\mathcal{D}}=\mathop{\arg\min}_{\mathcal{D}}\frac{1}{N}\sum\limits_{n=1}^{N}\mathcal{L_{\mathcal{D}}}(B_{i},\bar{B}_{i}),

(4)

where $\mathcal{L_{\mathcal{D}}}$ is the designed loss function of the proposed streaming object detector.

Note that, this novel streaming object detector using events and frames works differently from the feed-forward object detection framework (e.g., YOLOs [48] and DETR [20]). It has two unique properties, which exactly account for the nomenclature of ”streaming”: (i) Objects can be accurately detected via leveraging rich temporal cues; (ii) Event streams offering high temporal resolution can meet the need of detecting objects at any time among continuous visual steams and overcome the limited inference frequency from conventional frames. Consequently, following similar works using point clouds [71, 72] and video streams [73], we call our proposed detector as streaming object detector.

5 Methodology

In this section, we first give an overview of our approach (Section 5.1). Besides, we briefly revisit the event representation (Section 5.2) and the spatial Transformer (Section 5.3). Then, we present the details of how to exploit temporal cues from continuous visual streams (Section 5.4). Finally, we design an asynchronous attention-based fusion strategy for events and frames (Section 5.5).

5.1 Network Overview

This work aims at designing a novel streaming object detector with Transformer, termed SODFormer, which continuously detects objects in an asynchronous way by integrating events and frames. As illustrated in Fig. 5, our framework consists of four modules: event representation, spatial Transformer, temporal Transformer, and asynchronous attention-based fusion module. More precisely, to make asynchronous events compatible with deep learning methods, the continuous event stream is first divided into event temporal bins $S=\left\{S_{1},...,S_{K}\right\}$ , and each bin $S_{i}$ can be converted into a 2D image-like representation $E_{i}$ (i.e., event embeddings [47]). Then, the spatial Transformer adopts the main structures of the Deformable DETR [21] to extract feature representations from each sensing modality, which involves a feature extraction backbone (e.g., ResNet50 [74]) and the spatial Transformer encoder (STE). Besides, the proposed temporal Transformer contains the temporal Deformable Transformer encoder (TDTE) and temporal Deformable Transformer decoder (TDTD). The TDTE assigns the outputs of STE in the temporal dimension, which can improve the accuracy of object detection by leveraging rich temporal information. Meanwhile, the proposed asynchronous attention-based fusion module exploits the attention mechanism to generate a fused feature, which can eliminate unimodal degradation in two modalities with a high inference frequency. Finally, The TDTD integrates object queries and the fused feature to predict the bounding boxes $B_{i}$ .

5.2 Event Representation

When using deep learning methods to process asynchronous events, spatiotemporal point sets are usually converted into successive measurements. In general, the kernel function [47] is usually used to map the event temporal bin $S_{i}$ into an event embedding $E_{i}$ , which should ideally exploit the spatiotemporal information from asynchronous events. As a result, we can formulate this mapping as follows:

\bm{E}_{i}=\sum_{e_{n}\in\Delta T}\mathcal{K}(x-x_{n},y-y_{n},t-t_{n}),

(5)

where the designed kernel function $\mathcal{K}(x,y,t)$ can be deep neural networks or handcrafted operations.

In this work, we attempt to encode the event temporal bin into three existing event representations (i.e., event images [17], voxel grids [75] and sigmoid representation [39]) owning to an accuracy-speed trade-off. Actually, any event representation can be an alternative because our SODFormer provides a generic interface accepting various input types of the event-based object detector.

5.3 Revisiting Spatial Transformer

In DETR [20], the multi-head self-attention (MSA) combines multiple attention heads in parallel to increase the feature diversity. To be concrete, the feature maps are first extracted via CNN-based backbone. Then, $X_{q}$ and $X_{k}$ are derived from these feature maps as query and key-value in attention mechanism with both sizes of $(HW,d)$ . Let $q$ refer to a query element with the feature map $X_{q}$ , and $k$ indexes a key-value element with the feature map $X_{k}$ . The MSA operation integrates the outputs of $M$ attention heads as:

\text{MSA}\left(X_{q},X_{k}\right)=\sum_{m=1}^{M}\bm{W}_{m}\left[\sum A_{mqk}\cdot\bm{W}_{m}^{\prime}{X}_{k}\right],

(6)

where $m$ indexes the single attention head, $\bm{W}_{m}$ and $\bm{W}_{m}^{\prime}$ are learnable weights. $A_{mqk}=\exp(\frac{\bm{X}_{q}^{T}\bm{W}_{mq}^{T}\bm{W}_{mk}\bm{X}_{k}}{\sqrt{d_{k}}})$ is the attention weight, in which $\bm{W}_{mq}$ and $\bm{W}_{mk}$ are also learnable weights, $d_{k}$ is a scaling factor.

To achieve fast convergence and high computation efficiency, Deformable DETR [21] designs a deformable multi-head self-attention (DMSA) operation to attend the local $L$ sampling points instead of all pixels in feature map $X$ . Specifically, DMSA first determines the corresponding locations of each query in other feature maps, which we refer to as reference points, and then adds the learnable sampling offsets to reference points to obtain the locations of sampling points. It can be described as:

	$\displaystyle\text{DMSA}\left(X_{q},X,P_{q}\right)=$	$\displaystyle\sum_{m=1}^{M}\bm{W}_{m}\left[\sum_{l=1}^{L}A_{mlq}\cdot\right.$		(7)
		$\displaystyle\left.\bm{W}_{m}^{\prime}X(P_{q}+\Delta P_{mlq})\right],$		(7)

where $P_{q}$ is a 2D reference point, $\Delta P_{mlq}$ denotes the sampling offset relative to $P_{q}$ , and $A_{mlq}$ is the learnable attention weight of the $l^{th}$ point in $m^{th}$ self-attention head.

In this study, we adopt the encoder of the Deformable DETR as our spatial Transformer encoder (STE) to extract features from each visual stream. Meanwhile, we further extend the spatial DMSA strategy to the temporal domain (Section 5.4), which leverages rich temporal cues from continuous event stream and adjacent frames.

5.4 Temporal Transformer

The temporal Transformer aggregates multiple spatial feature maps from the spatial Transformer and generates the predicted bounding boxes. It includes two main modules: temporal deformable Transformer encoder (TDTE) (Section 5.4.1) and temporal deformable Transformer decoder (TDTD) (Section 5.4.2).

5.4.1 Temporal Deformable Transformer Encoder

Our TDTE aims at aggregating multiple spatial feature maps from the spatial Transformer encoder (STE) using the temporal deformable multi-head self-attention (TDMSA) operation and then generating a spatiotemporal representation via the designed multiple stacked blocks.

The core idea of the proposed TDMSA operation is that the temporal deformable attention only attends to the local sampling points in the spatial domain and aggregates all sampling points in the temporal domain. Obviously, TDMSA directly extends from single feature map $X_{i}$ (i.e., DMSA [21]) to multiple feature maps $\bm{X}$ = $\left\{X_{i-k},...,X_{i}\right\}$ in the temporal domain (see Fig. 6). Meanwhile, TDMSA exists a total $M$ attention heads for each temporal deformable attention operation, and it can be expressed as:

	$\displaystyle\text{TDMSA}\left(X_{q},\bm{X},P_{q}\right)=$	$\displaystyle\sum_{m=1}^{M}\bm{W}_{m}\left[\sum_{l=1}^{L}\sum_{j=i-k}^{i-1}A_{mljq}\cdot\right.$		(8)
		$\displaystyle\left.\bm{W}_{m}^{\prime}{X}_{j}(P_{jq}+\Delta P_{mljq})\right],$		(8)

where $P_{jq}$ is a 2D reference point in $j^{th}$ feature map $X_{j}$ . $A_{mljq}$ and $\Delta P_{mljq}$ are the learnable attention weight and the sampling offset of the $l^{th}$ sampling point from the $j^{th}$ feature map and the $m^{th}$ attention head.

TDTE includes multiple consecutive stacked blocks (see Fig. 5). Each block performs MSA, TDMSA, and feed-forward network (FFN) operations. Besides, each operation is followed by the dropout [76] and the layer normalization [77]. Specifically, TDTE first takes $X_{i}$ as both the query and the key-value via MSA in the timestamp $t_{i}$ . Then, TDMSA integrates the current feature map $X_{i}$ and adjacent feature maps $\left\{X_{i-k},...,X_{i-1}\right\}$ to leverage rich temporal cues. Finally, FFN is used to output a spatiotemporal representation, shown as $X_{I}$ or $X_{S}$ in Fig. 5. Thus, our TDTE can be formulated as:

\displaystyle\begin{split}&\hat{Z_{i}}=\text{MSA}(X_{i},X_{i}),\\ &\tilde{Z_{i}}=\text{TDMSA}(\hat{Z_{i}},\left\{X_{i-k},...,X_{i-1}\right\},P_{q}),\\ &X_{S}=\text{FFN}(\tilde{Z_{i}}),\\ \end{split}

(9)

where $\hat{Z_{i}}$ and $\tilde{Z_{i}}$ are the outputs of MSA and TDMSA operations in our TDTE, respectively. Note that, though frame flow and event flow are processed the same way in this step, there are two TDTEs that do not share weights processing frame flow and event flow separately.

5.4.2 Temporal Deformable Transformer Decoder

The goal of our TDTD is to output the final result via a sequence prediction problem. The inputs of TDTD contain both fused feature maps from our fusion module (Section 5.5) and object queries. The object queries are a small fixed number (denoted as $N_{q}$ ) of learnable positional embeddings. They serve as the ”query” part of the input to the TDTD. Each stacked block in the decoder consists of three operations (i.e., MSA, DMSA, and FFN). In the MSA, $N_{q}$ object queries interact with each other in parallel via multi-head self-attention at each decoder layer, where the query elements and key-value elements are both the object queries. As illustrated in Fig. 5, the DMSA [21] transforms the query flows and the fused features from two streams into an output embedding. Ideally, each object query learns to focus on a specific region in images and extracts region features for final detection in this process, similar to the anchors in Faster R-CNN [56] but is simpler and more flexible. Finally, the FFN independently processes each output embedding to predict $N_{q}$ bounding boxes $\bm{B}$ = $\left\{B_{1},B_{2},\cdots,B_{N_{q}}\right\}$ . Notably, $N_{q}$ is typically greater than the actual number of objects in the current scene, so the predicted bounding boxes can either be a detection or a ”no object” (labeled as $\phi$ ) as DETR [20] does. In particular, our TDTD achieves more efficient and faster converging by replacing MSA [20] with DMSA.

5.5 Asynchronous Attention-Based Fusion Module

To integrate two heterogeneous visual streams, the most existing fusion strategies [49, 13, 14, 15, 16] usually first split asynchronous events into discrete temporal bins synchronized with frames, and then fuse two streams via the post-processing or feature aggregation (e.g., averaging and concatenation). However, these synchronized fusion strategies may have two limitations: (i) The post-processing or feature aggregation operations cannot distinguish the degradation degree of different modalities and regions, which means they are difficult to eliminate the unimodal degradation thoroughly and thus incur the bottleneck of performance improvement; (ii) The joint output frequency may be limited by the sampling rate of conventional frames (e.g., 25 Hz), which fails to meet the needs of fast object detection in real-time high-speed scenarios. Therefore, we design an asynchronous attention-based fusion module to overcome the two limitations mentioned above.

One key question of two heterogeneous visual streams is how to achieve the high joint output frequency on demand for a given task. In general, the high temporal resolution event stream can be split into discrete event temporal bins, the frequency of which can be flexibly adjusted to be much larger than the conventional frame rate in real-time high-speed scenarios. As shown in Fig. 7, the event flow has more temporal sampling timestamps than the frame flow in an asynchronous manner. For the timestamp $t_{i}$ , $X_{S}$ is the output of the FFN in the TDTE. The proposed fusion strategy first searches the most recent sampling timestamp $t_{j}$ in the frame flow, which can be described as:

\displaystyle j=\mathop{\arg\min}_{k}\left|t_{i}-t_{k}\right|,\Delta t\geq 0,

(10)

where $\Delta t=t_{i}-t_{k}$ is the temporal difference between the event sampling timestamp and the frame sampling timestamp.

Another key question of two complementary data is how to tackle the unimodal degradation that the feature aggregation operations (e.g., averaging) fail to overcome and implement pixel-wise adaptive fusion for better performance. Unlike the previous weight mask by a multi-layer perceptron (MLP) [78], we compute the pixel-wise weight mask using the attention mechanism, which simplifies our model and endows it with stronger interpretability. Our fusion strategy starts with two intuitive assumptions: (i) the stronger the association between two pixels is, the greater the attention weight between them will be, and (ii) the pixels in degraded regions are supposed to have a weaker association with other pixels. Combining the two points above, we conclude that the sum of attention weights between a pixel and all other pixels represents the degradation degree of that pixel to some extent, and thus can serve as the weight for that pixel during fusion. Next, we take event flow as an example to illustrate how our attention weight mask module (see Fig. 5) obtains fusion weights. To implement this, we first remove the softmax operation in the standard self-attention to calculate the attention weights $\bm{W}_{XY}^{S}\in\mathbb{R}^{HW\times HW}$ between each pixel, and accumulate attention weights for each pixel $x$ as $\bm{W}_{x}^{S}=\sum_{y}\bm{W}_{xy}^{S}$ , which can be summarized as follows:

\bm{W}_{X}^{S}=\sum_{y}\frac{(X_{S}\times\bm{M}^{Q_{S}})\times(X_{S}\times\bm{M}^{K_{S}})^{T}}{\sqrt{d_{k}}},

(11)

where $\bm{W}_{X}^{S}\in\mathbb{R}^{WH\times 1}$ is the weight mask with the width $W$ and the height $H$ for event flow. $\bm{M}^{Q_{S}}$ and $\bm{M}^{K_{S}}$ are learnable matrix parameters. $d_{k}$ is a scaling factor in the self-attention [22]. As shown in Fig. 7, the aforementioned steps are exactly the same for the process of frame flow.

Then, we concatenate the weight masks along the pixel dimension and adopt the softmax to that dimension (i.e., softmax of [ $\bm{W}_{j}^{I}$ , $\bm{W}_{j}^{S}$ ], $j=1,2,\cdots,HW$ ) to normalize the pixel-wise attention weights at each pixel. Meanwhile, we further split the weight masks along the pixel dimension to obtain the frame weight masks $\bm{M}_{I}$ and the event weight masks $\bm{M}_{S}$ as follows:

\bm{M}_{I},\bm{M}_{S}=\text{split}(\text{softmax}(\left[\begin{array}[]{cc}\bm{W}_{1}^{I}&\bm{W}_{1}^{S}\\ \vdots&\vdots\\ \bm{W}_{HW}^{I}&\bm{W}_{HW}^{S}\end{array}\right])).

(12)

Finally, we can obtain the fused feature map $F_{f}$ by summing up the unimodal feature maps with the corresponding pixel-wise weight masks by:

F_{f}=(\bm{M}_{I}\odot X_{I})\oplus(\bm{M}_{S}\odot X_{S}),

(13)

where $\oplus$ denotes the sum operation, and $\odot$ refers to the element-wise multiply operation.

6 Experiments

TABLE III: Performance evaluation of our PKU-DAVIS-SOD dataset in various scenarios. Our SODFormer integrates frames and events to detect objects in a continuous manner, and our baseline processes the single-modality (i.e., frames or events) without utilizing the asynchronous attention-based fusion module.

Scenario	Modality	AP₅₀			mAP	mAP₅₀	mAP₇₅	mAP ${}_{\text{S}}$	mAP ${}_{\text{M}}$	mAP ${}_{\text{L}}$
Scenario	Modality	Car	Pedestrian	Two-wheeler	mAP	mAP₅₀	mAP₇₅	mAP ${}_{\text{S}}$	mAP ${}_{\text{M}}$	mAP ${}_{\text{L}}$
Normal	Events	0.440	0.220	0.451	0.147	0.371	0.090	0.072	0.268	0.526
	Frames	0.747	0.365	0.558	0.228	0.557	0.138	0.166	0.336	0.539
	Frames + Events	0.752	0.359	0.596	0.241	0.569	0.163	0.166	0.363	0.609
Motion blur	Events	0.327	0.159	0.380	0.113	0.289	0.064	0.051	0.181	0.255
	Frames	0.561	0.303	0.394	0.163	0.419	0.096	0.100	0.201	0.365
	Frames + Events	0.570	0.285	0.441	0.183	0.432	0.125	0.109	0.230	0.387
Low-light	Events	0.524	0.0002	0.294	0.093	0.273	0.039	0.075	0.183	0.286
	Frames	0.570	0.128	0.357	0.114	0.351	0.048	0.082	0.198	0.344
	Frames + Events	0.595	0.130	0.399	0.122	0.374	0.048	0.091	0.206	0.376
All	Events	0.424	0.188	0.390	0.128	0.334	0.071	0.065	0.210	0.348
	Frames	0.700	0.316	0.452	0.195	0.489	0.116	0.142	0.264	0.417
	Frames + Events	0.705	0.313	0.493	0.207	0.504	0.133	0.144	0.285	0.454

This section first describes the experimental setting and implementation details (Section 6.1). Then, the effective test (Section 6.2) is conducted to verify the validity of our SODFormer, which contains a performance evaluation in various scenarios and a comparison to related detectors. Meanwhile, we further implement the ablation study (Section 6.3) to see why and how our approach works, where we investigate the influences of each designed module and parameter settings. Finally, the scalability test (Section 6.4) provides interpretable explorations of our SODFormer.

6.1 Experimental Settings

In this part, we will present the dataset, implementation details, and evaluation metrics as follows.

Dataset and Setup. Our PKU-DAVIS-SOD dataset is designed for streaming object detection in driving scenes, which provides two heterogeneous and temporally long-term visual streams with manual labels at 25 Hz (see Section 3). This newly built dataset contains three subsets including 671.3k labels for training, 194.7k for validation, and 214.1k for testing. Each subset is further split into three typical scenarios (i.e., normal, low-light, and motion blur).

Implementation Details. We select event images [17] as the event representation and the ResNet50 [74] as the backbone to achieve a trade-off between accuracy and speed. All parameters of the backbone and the spatial Transformer encoder (STE) are shared among different temporal bins of the same modality. We set the attention head $M$ to 8 and the sampling point $L$ to 4 for the deformable multi-head self-attention (DMSA) in Eq.(7) and the temporal deformable multi-head self-attention (TDMSA) in Eq.(8). The temporal aggregation size in TDMSA is set to 9 owning to the balance of the accuracy and the speed. During training, the height of an image is randomly resized to a number from [256: 576: 32], and the width is obtained accordingly while maintaining the aspect ratio. Similarly, for evaluation, all images are resized to 352 in height and 468 in width to maintain the aspect ratio. Other data augmentation methods such as crop and horizontal flip are not utilized because our temporal Transformer requires accurate reference points. Following the Deformable DETR [21], we adopt the matching cost and the Hungarian loss for training with loss weights of 2, 5, 2 for classification, $L_{1}$ and GIoU, respectively. All networks are trained for 25 epochs using the Adam optimizer [79], which is set with the initial learning rate of $2\times 10^{-4}$ , the decayed factor of 0.1 after the 20th epoch, and the weight decay of $10^{-4}$ . All experiments are conducted on NVIDIA Tesla V100 GPUs.

Evaluation Metrics. To compare different approaches, the mean average precision (e.g., COCO mAP [80]) and running time ( $ms$ ) are selected as two evaluation metrics, which are the most broadly utilized in the object detection task. In the effective test, we give a comprehensive evaluation using average precision with various IoUs (e.g., AP, AP_0.5, and AP_0.75) and AP across different scales (i.e., AP ${}_{\text{S}}$ , AP ${}_{\text{M}}$ , and AP ${}_{\text{L}}$ ). In other tests, we report the detection performance using the AP_0.5. Following the video object detection, we compute the AP for each class and the mAP for three classes in all labeled timestamps. In other words, we evaluate the object detection performance with the output frequency of 25 Hz in every labeled timestamp.

TABLE IV: Comparison with state-of-the-art methods and our baselines on our PKU-DAVIS-SOD dataset. Our SODFormer, making complementary use of RGB frames and DVS events, outperforms seven state-of-the-art methods involving event cameras, four using RGB frames, and two joint object detectors. * denotes that a method leverages temporal cues.

Modality	Method	Input representation	Backbone	Temporal	mAP₅₀	Runtime ( $ms$ )
Events	SSD-events [38]	Event image	SSD	No	0.221	7.2
	NGA-events [6]	Voxel grid	YOLOv3	No	0.232	8.0
	YOLOv3-RGB [48]	Reconstructed image	YOLOv3	No	0.244	178.51
	Faster R-CNN [56]	Event image	R-CNN	No	0.251	74.5
	Deformable DETR [21]	Event image	DETR	No	0.307	21.6
	LSTM-SSD* [81]	Event image	SSD	Yes	0.273	22.7
	ASTMNet* [43]	Event embedding	Rec-Conv-SSD	Yes	0.291	21.3
	Our baseline*	Event image	Our spatiotemporal Transformer	Yes	0.334	25.0
Frames	Faster R-CNN [56]	RGB frame	R-CNN	No	0.443	75.2
	YOLOv3-RGB [48]	RGB frame	YOLOv3	No	0.426	7.9
	Deformable DETR [21]	RGB frame	DETR	No	0.461	21.5
	LSTM-SSD* [81]	RGB frame	SSD	Yes	0.456	22.4
	Our baseline*	RGB frame	Our spatiotemporal Transformer	Yes	0.489	24.9
Events + Frames	MFEPD [13]	Event image + RGB frame	YOLOv3	No	0.438	8.2
	JDF [14]	Channel image + RGB frame	YOLOv3	No	0.442	8.3
	Our SODFormer	Event image + RGB frame	Deformable DETR	Yes	0.504	39.7

6.2 Effective Test

The objective of this part is to assess the validity of our SODFormer, so we implement two main experiments including performance evaluation in various scenarios (Section 6.2.1) and comparison with state-of-the-art methods (Section 6.2.2) as follows.

6.2.1 Performance Evaluation in Various Scenarios

To give a comprehensive evaluation of our PKU-DAVIS-SOD dataset, we report the quantization results (see Table III) and the representative visualization results (see Fig. 8) from three following perspectives.

Normal Scenarios. Performance evaluations of all modalities (i.e., frames, events, and two visual streams) in normal scenarios can be found in Table III. Our single-modality baseline only processes frames or events without using the asynchronous attention-based fusion module. We can see that our baseline using RGB frames achieves better performance than using events in normal scenarios. This is because the event camera has a flaw with weak texture in the spatial domain, thus only dynamic events (i.e., brightness changes) are hard to achieve high-precision recognition, especially in static or extremely slow motion scenes (see Fig. 8(8(a))). On the contrary, RGB frames can provide static fine textures (i.e., absolute brightness) under ideal conditions. In particular, our SODFormer obtains the performance enhancement from the single stream to two visual streams via introducing the asynchronous attention-based fusion module.

Motion Blur Scenarios. By comparing normal scenarios and motion blur scenarios in Table III, we can find that the detection performance using events degrades less than that utilizing frames. This can be explained by that RGB frames appear as motion blur in high-speed motion scenes (see Fig. 8(8(b))), resulting in a remarkable decrease in object detection performance. Note that, by introducing the event stream, the performance of our SODFormer can appreciably enhance over our baseline using RGB frames.

Low-Light Scenarios. As illustrated in Table III, the performance of our baseline using RGB frames drops sharply by 0.206 in low-light scenarios. Meanwhile, although the mAP of DVS events also drops by 0.098 in absolute term, it is significantly less compared to that of RGB frames (i.e., 0.098 vs. 0.206). On a relative basis, there is also an evident difference as the mAP of RGB frames and DVS events drop by 0.370 and 0.264, respectively. This may be caused by the fact that the event camera has the advantage of HDR to sense moving objects in extreme light conditions (see Fig. 8(8(c))). Notably, only the AP₅₀ of ”Car” objects increases while the others decrease. We attribute this to the different image quality of different classes of objects in the event modality. Intuitively, ”Car” objects tend to have more distinct outlines and higher image quality. As a result, ”Car” objects are less distributed by the increase of noise in low-light scenes [82] and thus achieve comparable performance to that in normal scenes. More specifically, our SODFormer outperforms the baseline using RGB frames by a large margin (0.374 versus 0.351) in low-light scenarios.

In summary, our SODFormer consistently obtains better performance than the single-modality methods in three scenarios meanwhile keeping relatively comparable computational speed. By introducing DVS events, the mAP on our PKU-DAVIS-SOD dataset improves an average of 1.5% over using RGB frames. What’s more, representative visualization results in Fig. 8 show that RGB frames fail to detect objects in high-speed motion blur and low-light scenarios. However, the event camera, offering high temporal resolution and HDR, provides new insight into addressing the shortages of conventional cameras.

6.2.2 Comparison with State-of-the-Art Methods

We will instigate why and how our SODFormer works from the following three perspectives.

Evaluation on DVS Modality. To evaluate our temporal Transformer for DVS events, we compare our baseline* (i.e., our SODFormer without the asynchronous attention-based fusion module) with five feed-forward event-based object detectors (i.e., event image for SSD [38], voxel grid for YOLOv3 [6], reconstructed image for YOLOv3 [48], event image for Faster R-CNN [56], and event image for Deformable DETR [21]) and two recurrent object detectors (event image for LSTM-SSD [81] and event embedding for ASTMNet [43]) that leverages temporal cues. As shown in Table IV, our baseline*, utilizing event images for our temporal Transformer, achieves significant improvement over all these object detectors. Besides, our other baseline, using E2VID [66] to reconstruct frames, performs better than other input representations (e.g., event image and voxel grid), but a long time is spent on the first stage of image reconstruction and then on the second stage of object detection.

Evaluation on RGB Modality. We present four state-of-the-art object detectors for RGB frames including (i) Single RGB frame for Faster R-CNN, (ii) Single RGB frame for YOLOv3, (iii) Single RGB frame for Deformable DETR, (iv) Sequential RGB frames for LSTM-SSD to compare them with our baseline* that utilizes sequential RGB frames for our spatiotemporal Transformer. Compared to three object detectors utilizing single RGB frames, our baseline* obtains the best performance via introducing the temporal Transformer for RGB frames (see Table IV). Furthermore, it obtains better performance than LSTM-SSD while maintaining comparable computational speed. This is because the designed spatiotemporal Transformer is more effective than the standard convolutional-recurrent operation. In other words, the results agree with our motivation that leveraging rich temporal cues for streaming object detection is beneficial.

Benefit From Multimodal Fusion. From Table IV, we make a comparison of our SODFormer between two existing joint object detectors (e.g., MFEPD [13] and JDF [14]). Apparently, our SODFormer outperforms two competitors by a large margin. We can find that our SODFormer, incorporating the temporal Transformer and the asynchronous attention-based fusion module, outperforms four state-of-the-art detectors and our eight baselines. More precisely, by introducing DVS events, our SODFormer has a 1.5% improvement over the best baseline using sequential RGB frames. Compared with the best competitor using DVS events, our SODFormer truly shines (0.504 versus 0.334), which indicates that RGB frames with fine textures may be very significant for high-precision object detection.

We further present some visualization results on our PKU-DAVIS-SOD dataset in Fig. 9. Note that, our SODFormer outperforms the single-modality methods and the best multi-modality competitor (i.e., JDF [14]). Unfortunately, RGB frames fail to detect objects in high-speed or low-light scenes. Much to our surprise, our SODFormer can perform robust detect objects in challenging situations, due to its high temporal resolution and HDR properties that are inherited from DVS events, as well as its fine texture that comes from RGB frames.

6.3 Ablation Test

In this section, we implement an ablation study to investigate how parameter setting and each design choice influence our SODFormer.

6.3.1 Contribution of Each Component

TABLE V: The contribution of each component to our SODFormer on our PKU-DAVIS-SOD dataset. All results are obtained with the baseline using RGB frames for Deformable DETR [21].

Method	Baseline	(a)	(b)	Ours
Events			✓	✓
Temporal transformer		✓	✓	✓
Fusion Module				✓
mAP₅₀	0.461	0.489	0.494	0.504
Runtime ( $ms$ )	21.5	24.9	39.4	39.7

To explore the impact of each component on the final performance, we choose the feed-forward detector (i.e., using RGB frame for Deformable DETR [21]) as the baseline. As illustrated in Table V, three methods, namely (a), (b), and our SODFormer, utilize temporal Transformer, averaging fusion module and asynchronous attention-based fusion module respectively, consistently achieve higher performance on our PKU-DAVIS-SOD dataset than the baseline. More specifically, method (a), adopting the temporal Transformer to leverage rich temporal cues, obtains a 2.8% mAP improvement over the baseline. Comparing method (b) with (a), the absolute promotion is merely 0.5%, which indicates that it is insufficient to combine RGB frames and DVS events using the averaging operation. Moreover, our SODFormer, adopting the asynchronous attention-based fusion strategy to replace the averaging fusion, achieves a 1.0% mAP improvement over method (b) while maintaining comparable computational complexity. Intuitively, our SODFormer employs these effective components to process two heterogeneous visual streams and achieves robust object detection on our PKU-DAVIS-SOD dataset.

6.3.2 Influence of Temporal Aggregation Size

To analyze the temporal aggregation strategy in our SODFormer, we set the temporal Transformer with different sizes (e.g., 1, 3, 5, 9, and 13) of temporal aggregation. We refer to this as temporal aggregation size, which represents the number of event temporal bins $S$ and adjacent frames $I$ to process at a given timestamp $t_{i}$ . It fits in Eq. 8 as $k+1$ . As depicted in Table VI, we find that our strategy improves mAPs by 1.2%, 1.9%, 2.8%, and 2.7% when compared to a single temporal aggregation size. Note that, by aggregating visual streams in a larger temporal size, richer temporal cues can be utilized in the temporal Transformer, resulting in better object detection performance. Nevertheless, as the temporal aggregation size becomes larger, the computational time also gradually increases. Additionally, we find that the model’s capacity for modeling temporally long-term dependency is finite with the increase in temporal aggregation size. In this study, our temporal aggregation size is set to 9 for a trade-off between accuracy and speed.

TABLE VI: The influence of temporal aggregation size on our PKU-DAVIS-SOD dataset. The single-modality baseline is conducted using RGB frames for Deformable DETR[21].

Temporal aggregation size	1	3	5	9	13
mAP₅₀	0.461	0.473	0.480	0.489	0.488
Runtime ( $ms$ )	21.5	24.1	24.2	24.9	25.5

To improve the interpretability of streaming object detection, we further present some comparative visualization results about whether they utilize rich temporal cues or not (see Fig. 10). The feed-forward baseline using RGB frames suffers from some failure cases involving small objects and occluded objects, as shown in the first and third rows in Fig. 10. Fortunately, our SODFormer overcomes these limitations via leveraging rich temporal cues from adjacent frames or event streams. Therefore, a trade-off between accuracy and speed may be beneficial and necessary for certain scenarios requiring a higher level of accuracy in object detection.

TABLE VII: Comparison of the proposed asynchronous attention-based fusion module with typical fusion strategies on our PKU-DAVIS-SOD dataset.

Method	mAP₅₀	Runtime ( $ms$ )
NMS [52]	0.438	8.2
Concatenation [83]	0.478	39.6
Averaging [28]	0.494	39.4
Ours	0.504	39.7

6.3.3 Influence of Fusion Strategy

In order to evaluate the effectiveness of our fusion module, we compare it with some typical fusion methods in Table VII. Specifically, the concatenation and averaging method are both applied to the feature maps produced by the temporal Transformer, referred to as $X_{S}$ and $X_{I}$ in Fig. 5. The $X_{I}$ and $X_{S}$ are concatenated along the last dimension for concatenation and are averaged when averaging. The fused feature map is input to the decoder in both methods. Note that, our approach obtains the best performance against the post-processing operations (e.g., NMS [52]) and end-to-end feature aggregation techniques (e.g., averaging [28] and concatenation [83]). More precisely, our strategy gets around 6.6%, 2.6% and 1.0% improvements on our PKU-DAVIS-SOD dataset with three corresponding methods. Compared with the averaging operation, our approach improves the mAP from 0.494 to 0.504 while maintaining comparable computation time. Additionally, we present some visualization comparison results in Fig. 11. Three representative instances show that our approach performs better than the averaging operation. This is because our asynchronous attention-based fusion module can adaptively generate pixel-wise weight masks and eliminate unimodal degradation thoroughly.

TABLE VIII: Comparison of our SODFormer with various event representations on our PKU-DAVIS-SOD dataset.

Method	mAP₅₀	Runtime ( $ms$ )
Sigmoid representation [39]	0.309	39.4
Voxel grid [75]	0.491	41.5
Event images [17]	0.504	39.7

6.3.4 Influence of Event Representation

To verify the generality of our SODFormer for various event representations, we compare three typical event representations (i.e., event images [17], voxel grids [75] and sigmoid representation [39]) owning to an accuracy-speed trade-off. As illustrated in Table VIII, the performance of our SODFormer varies with different event representations, but their running speed is almost identical. It indicates that our SODFormer can provide a generic interface in combination with various input representations, such as, image-like representations (e.g., event images [17] and sigmoid representation [39]) and spatiotemporal representations (i.e., voxel grids [75]). Meanwhile, it is obvious that the performance is highly dependent on how well the representation is and drops heavily with worse methods (e.g., sigmoid representation). Indeed, we believe that a good event representation makes asynchronous events directly compatible with our SODFormer while maximizing the detection performance.

6.3.5 Influence of the Number of Transformer’s Layer

We will analyze the effect of the number of layers in the designed temporal Transformer on the final performance from the following two perspectives.

The Number of TDTE’s Layer. As shown in Table IX, we first explore the influence of the number of encoder layers in our TDTE on our PKU-DAVIS-SOD dataset. To our surprise, as the number of encoder layers increases, our SODFormer shows only slight improvements up to 6 layers and a rapid decline due to overfitting after that, yet the inference time is gradually getting longer. Considering that increasing the number of encoder layers in our TDTE doesn’t improve the detection performance but increases the computational time, the number of encoder layers in our TDTE is set to 6.

TABLE IX: The influence of the number of TDTE’s layers on our PKU-DAVIS-SOD dataset.

The number of encoder layers	2	4	6	8	12
mAP₅₀	0.481	0.480	0.489	0.471	0.450
Runtime ( $ms$ )	23.5	24.2	24.9	25.2	26.1

The Number of TDTD’s Layer. Table X further illustrates the influence of decoder layers in our TDTD on our PKU-DAVIS-SOD dataset. We can find that the best performance occurs when we set the number of decoder layers to 8. However, the mAP is almost unchanged and the computational time increases rapidly when the number of decoder layers exceeds 6. In order to balance detection performance and computational complexity, it is reasonable to set the number of TDTD’s layers to 6.

TABLE X: The influence of the number of TDTD’s layers on our PKU-DAVIS-SOD dataset.

The number of decoder layers	2	4	6	8	12
mAP₅₀	0.476	0.480	0.489	0.491	0.489
Runtime ( $ms$ )	21.1	22.3	24.9	29.5	34.8

6.4 Scalability Test

This subsection will present the predicted output slot analysis (Section 6.4.1) and the visualization of the designed temporal deformable attention (Section 6.4.2). Then, we further present how to process two heterogeneous visual streams in an asynchronous manner (Section 6.4.3). Finally, we analyze some failure cases of our SODFormer (Section 6.4.4).

6.4.1 The Predicted Output Slot Analysis

To increase the interpretability of our SODFormer, we visualize the predicted bounding boxes of different slots for all labeled timestamps in the test subset of our PKU-DAVIS-SOD dataset. Inspired by DETR [20], each predicted box is represented as a point whose coordinates are the center normalized by image size. Fig. 12 shows 10 representative examples out of 300 predicted slots in our temporal deformable Transformer decoder (TDTD). Apparently, we find that each slot can learn to focus on certain areas and bounding box sizes. Actually, our SODFormer can learn a different specialization for each object query slot, and all slots can cover the distribution of all objects in our PKU-DAVIS-SOD dataset.

6.4.2 Visualization of Temporal Deformable Attention

To better understand the proposed temporal deformable attention, we visualize sampling points and attention weights of the last layer in our temporal deformable Transformer encoder (TDTE). As shown in Fig. 13(a), our feed-forward baseline fails to detect the middle car owning to the occlusion, while our unimodal SODFormer succeeded by utilizing rich temporal cues in adjacent frames. In addition, we present the distribution of sampling points in the last layer of our TDTE (see Fig. 13(b)). All sampling points are mapped to the corresponding location in RGB frames, and each sampling point is marked as a color-filled circle whose color represents the size of the weight. The reference point is labeled as a green cross marker. We can find that the distribution of sampling points is focused on the foreground area surrounding the reference point radially instead of the whole image. As for the attention weights, it is obvious that the closer the sampling point to the reference point is, the greater the attention weight will be. Specifically, the sampling points in adjacent frames are uniformly clustered in the foreground area, providing rich temporal cues of the objects in the temporal domain. This indicates that the proposed temporal deformable attention can adapt its sampling points to concentrate on the foreground object in adjacent frames.

6.4.3 Asynchronous Inference Analysis

In general, the global shutter of conventional frame-based cameras limits their sampling rate, resulting in a relatively long time interval between two adjacent RGB frames. Actually, it is difficult to accurately locate moving objects with high output frequency in real-time high-speed scenarios. Towards this end, the proposed asynchronous attention-based fusion module aims at breaking through the limited output frequency from synchronized frame-based fusion methods. As illustrated in Table XI and Fig. 14, we design two specific experiments to verify the effectiveness of our asynchronous fusion strategy. First, we compare the performance of single-modality baseline* without utilizing the asynchronous attention-based fusion module and our SODFormer in different frame rates. Note that we only change the frame rate, while the temporal event bins are fixed to 25 Hz, so as the output frequency. This ensures that every prediction has a corresponding annotation. We can see from Table XI that both methods suffer a reduction in performance as the frame rate decreases, but the performance of our SODFormer is significantly less reduced than that of baseline*. Meanwhile, from the last row, we see that our SODFormer is still acceptable when the output frequency is four times the frame rate. This indicates that our asynchronous attention-based fusion module has a strong ability to conduct object detection of two asynchronous modalities. To further illustrate how our SODFormer performs in asynchronous detection when the input frequency is high, we conduct another experiment in which we use the form of a sliding window with 0.04s as its length from the continuous event stream and shift it forward by 0.01s at a time. As a result, the frequency of the divided event temporal bins can be as high as 100 Hz. Then, we input the RGB frames of 25 Hz and event temporal bins of 100 Hz into our SODFormer, as shown in Fig. 14. We find that the four figures in the middle show the detection performance of the event stream between the two RGB frames. Therefore, we can confirm that our asynchronous fusion strategy is able to fill the gap between two adjacent RGB frames and helps our SODFormer to detect objects in an asynchronous manner.

TABLE XI: Comparison of the performance of SODFormer and Baseline* in different frame rates. Our Baseline* denotes SODFormer that processes only frame modality without utilizing the asynchronous attention-based fusion module.

Method	Frame rate (FPS)
Method	25	12.5	8.33	6.25
Baseline*	0.489	0.458	0.412	0.372
SODFormer	0.504	0.495	0.472	0.448

6.4.4 Failure Case Analysis

Although our SODFormer achieves satisfactory results even in challenging scenarios, the problem involving some failure cases is still far from being solved. As depicted in Fig. 15, the first row shows that static or slow-moving cars in low-light conditions are hard to perform robust detection. This is because the relatively low dynamic range of conventional cameras results in poor image quality in low-light scenes. Meanwhile, the event camera evidently senses dynamic changes, but they generate almost no events in static or slow-moving scenarios. The second row illustrates that small moving objects at high-speed fail to be detected. This may be caused by the fact that the rushing object which is of high relative speed in RGB frames is almost invisible owing to the influence of high-speed motion blur. At the same time, event cameras capture moving objects at high-speed but they display weak textures. In fact, it indicates that our PKU-DAVIS-SOD dataset is a challenging and competitive dataset. What’s more, these failure cases beyond our SODFormer need to be addressed in future work.

7 Discussion

An effective and robust streaming object detector will further highlight the potential of the unifying framework using events and frames. Here, we will discuss the generality and the limitation of our method as follows.

Generality. One might think that this work does not explore one core issue of how to design a novel event representation. Actually, any event representation can be regarded as the input of our SODFormer. The ablation study verifies that our SODFormer can improve detection performance by introducing DVS with different event representations (see Section 6.3.4). The highly realistic synthetic dataset should be worth exploring in the future, as they can provide large-scale data for model training and testing in simulation scenarios. However, synthetic datasets from existing simulators are unrealistic and are not suited for verifying the effectiveness of our SODFormer in extreme scenarios (e.g., low-light or high-speed motion blur). Some may argue that the backbone of our SODFormer is a CNN-based feature extractor. Indeed, this study does not design a pre-trained vision Transformer backbone (e.g., ViT [84] and Swin Transformer [85]) for event streams. Indeed, investigating an event-based vision Transformer to learn an effective event representation is interesting and of wide applicability in computer vision tasks (e.g., video reconstruction, object detection, and depth estimation).

Limitation. Currently, the distribution of the sampling points in TDMSA is almost identical in different reference frames, which are radially around the reference point (see Section 6.4.2). However, this sampling strategy doesn’t take the motion trajectory of the object into account and limits the model’s capacity for modeling temporally long-term dependency (see Section 6.3.2). We consider following the work [86] of using optical flow to track the motion trajectory in optimizing the selection of sampling points, but this is not included in our SODFormer due to the additional complexity. In fact, how to track objects efficiently with a low computational complexity remains a topic worth exploring.

8 Conclusion

This paper presents a novel streaming object detector with Transformer (i.e., SODFormer) using events and frames, which highlights how events and frames can be utilized to deal with major object detection challenges (e.g., fast motion blur and low-light). To the best of our knowledge, this is the first trial exploring a Transformer-based architecture to continuously detect objects from two heterogeneous visual streams. To achieve this, we first build a large-scale multimodal object detection dataset (i.e., PKU-DAVIS-SOD dataset) including two visual streams and manual labels. Then, a spatiotemporal Transformer is designed to detect objects via an end-to-end sequence prediction problem, its two core innovative modules are the temporal Transformer and the asynchronous attention-based fusion module. The results demonstrate that our SODFormer outperforms four state-of-the-art methods and our eight baselines, inheriting high temporal resolution and wide HDR range properties from DVS events (i.e., brightness changes) and fine textures from RGB frames (i.e., absolute brightness). We believe that this study makes a major step towards solving the problem of fusing events and frames for streaming object detection.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China under Grant 62027804, Grant 61825101 and Grant 62088102.

References

[1] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep learning for generic object detection: A survey,” Int. J. Comput. Vis., vol. 128, no. 2, pp. 261–318, 2020.
[2] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” Int. J. Comput. Vis., pp. 1–24, 2020.
[3] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “Imbalance problems in object detection: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3388–3415, 2021.
[4] K. Sun, W. Wu, T. Liu, S. Yang, Q. Wang, Q. Zhou, Z. Ye, and C. Qian, “Fab: A robust facial landmark detection framework for motion-blurred videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5462–5471.
[5] M. Sayed and G. Brostow, “Improved handling of motion blur in online object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1706–1716.
[6] Y. Hu, T. Delbruck, and S.-C. Liu, “Learning to exploit multiple vision modalities by using grafted networks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 85–101.
[7] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 $\times$ 180 130 db 3 $\mu$ s latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341, 2014.
[8] D. P. Moeys, F. Corradi, C. Li, S. A. Bamford, L. Longinotti, F. F. Voigt, S. Berry, G. Taverni, F. Helmchen, and T. Delbruck, “A sensitive dynamic and active pixel vision sensor for color or neural imaging applications,” IEEE Trans. Biomed. Circuits Syst., vol. 12, no. 1, pp. 123–136, 2017.
[9] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 $\times$ 128 120 db 15 $\mu$ s latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008.
[10] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Eklt: Asynchronous photometric feature tracking using events and frames,” Int. J. Comput. Vis., vol. 128, no. 3, pp. 601–618, 2020.
[11] D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,” IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 2822–2829, 2021.
[12] S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y. Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16 155–16 164.
[13] Z. Jiang, P. Xia, K. Huang, W. Stechele, G. Chen, Z. Bing, and A. Knoll, “Mixed frame-/event-driven fast pedestrian detection,” in Proc. IEEE Conf. Robot. Autom., 2019, pp. 8332–8338.
[14] J. Li, S. Dong, Z. Yu, Y. Tian, and T. Huang, “Event-based vision enhanced: A joint detection framework in autonomous driving,” in Proc. IEEE Int. Conf. Multimedia Expo., 2019, pp. 1396–1401.
[15] H. Cao, G. Chen, J. Xia, G. Zhuang, and A. Knoll, “Fusion-based feature attention gate component for vehicle detection based on event camera,” IEEE Sensors J., pp. 1–9, 2021.
[16] M. Liu, N. Qi, Y. Shi, and B. Yin, “An attention fusion network for event-based vehicle object detection,” in Proc. IEEE Int. Conf. Image Process., 2021, pp. 3363–3367.
[17] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza, “Event-based vision meets deep learning on steering prediction for self-driving cars,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5419–5427.
[18] R. J. Dolan, G. Fink, E. Rolls, M. Booth, A. Holmes, R. Frackowiak, and K. Friston, “How the brain learns to see objects and faces in an impoverished context,” Nature, vol. 389, no. 6651, pp. 596–599, 1997.
[19] A. G. Huth, S. Nishimoto, A. T. Vu, and J. L. Gallant, “A continuous semantic space describes the representation of thousands of object and action categories across the human brain,” Neuron, vol. 76, no. 6, pp. 1210–1224, 2012.
[20] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229.
[21] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–16.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[23] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” arXiv, pp. 1–28, 2021.
[24] T. H. Kim, M. S. Sajjadi, M. Hirsch, and B. Scholkopf, “Spatio-temporal transformer network for video restoration,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 106–122.
[25] L. He, Q. Zhou, X. Li, L. Niu, G. Cheng, X. Li, W. Liu, Y. Tong, L. Ma, and L. Zhang, “End-to-end video object detection with spatial-temporal transformers,” in Proc. ACM Int. Conf. Multimedia., 2021.
[26] Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, H. Chen, I. Marsic, and J. Tighe, “Vidtr: Video transformer without convolutions,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 577–13 587.
[27] A. Tomy, A. Paigwar, K. S. Mann, A. Renzaglia, and C. Laugier, “Fusing event-based and RGB camera for robust object detection in adverse conditions,” in Proc. IEEE Conf. Robot. Autom., 2022, pp. 933–939.
[28] H. Li, X.-J. Wu, and J. Kittler, “Mdlatlrr: A novel decomposition method for infrared and visible image fusion,” IEEE Trans. Image Process., vol. 29, no. 1, pp. 4733–4746, 2020.
[29] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis et al., “Event-based vision: A survey,” in IEEE Trans. Pattern Anal. Mach. Intell., 2020, pp. 1–25.
[30] S.-C. Liu, B. Rueckauer, E. Ceolini, A. Huber, and T. Delbruck, “Event-driven sensing for efficient perception: Vision and audition algorithms,” IEEE Signal Process. Mag., vol. 36, no. 6, pp. 29–37, 2019.
[31] K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intelligence with neuromorphic computing,” Nature, vol. 575, no. 7784, pp. 607–617, 2019.
[32] G. Chen, H. Cao, J. Conradt, H. Tang, F. Rohrbein, and A. Knoll, “Event-based neuromorphic vision for autonomous driving: A paradigm shift for bio-inspired visual sensing and perception,” IEEE Signal Process. Mag., vol. 37, no. 4, pp. 34–49, 2020.
[33] P. de Tournemire, D. Nitti, E. Perot, D. Migliore, and A. Sironi, “A large scale event-based detection dataset for automotive,” in arXiv, 2020, pp. 1–8.
[34] E. Perot, P. de Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1–14.
[35] H. Rebecq, D. Gehrig, and D. Scaramuzza, “Esim: An open event camera simulator,” in Proc. Conf. Robot Learn., 2018, pp. 969–982.
[36] Y. Hu, S.-C. Liu, and T. Delbruck, “V2e: From video frames to realistic dvs events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2021, pp. 1312–1321.
[37] Z. Kang, J. Li, L. Zhu, and Y. Tian, “Retinomorphic sensing: A novel paradigm for future multimedia computing,” in Proc. ACM Int. Conf. Multimedia., 2021, pp. 144–152.
[38] M. Iacono, S. Weber, A. Glover, and C. Bartolozzi, “Towards event-driven object detection with off-the-shelf deep learning,” in Proc. IEEE Conf. Intell. Robot. Syst., 2018, pp. 1–9.
[39] N. F. Chen, “Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018, pp. 644–653.
[40] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, “Event-based convolutional networks for object detection in neuromorphic cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2019, pp. 1656–1665.
[41] G. Chen, H. Cao, C. Ye, Z. Zhang, X. Liu, X. Mo, Z. Qu, J. Conradt, F. Röhrbein, and A. Knoll, “Multi-cue event information fusion for pedestrian detection with neuromorphic vision sensors,” Frontiers in Neurorobotics, vol. 13, p. 10, 2019.
[42] C. Ryan, B. O’Sullivan, A. Elrasad, A. Cahill, J. Lemley, P. Kielty, C. Posch, and E. Perot, “Real-time face & eye tracking and blink detection using event cameras,” Neural Netw., vol. 141, pp. 87–97, 2021.
[43] J. Li, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Asynchronous spatio-temporal memory network for continuous event-based object detection,” IEEE Trans. Image Process., vol. 31, pp. 2975–2987, 2022.
[44] X. Xiang, L. Zhu, J. Li, Y. Tian, and T. Huang, “Temporal up-sampling for asynchronous events,” in Proc. IEEE Int. Conf. Multimedia Expo., 2022, pp. 1–6.
[45] D. Wang, X. Jia, Y. Zhang, X. Zhang, Y. Wang, Z. Zhang, D. Wang, and H. Lu, “Dual memory aggregation network for event-based object detection with learnable representation,” in Proc. AAAI Conf. on Artificial Intell., 2023, pp. 2492–2500.
[46] M. Gehrig and D. Scaramuzza, “Recurrent vision transformers for object detection with event cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
[47] D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to-end learning of representations for asynchronous event-based data,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5633–5643.
[48] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” in arXiv, 2018, pp. 1–6.
[49] H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and T. Delbrück, “Combined frame-and event-based detection and tracking,” in Proc. IEEE Int. Symposium Circuits Syst., 2016, pp. 2511–2514.
[50] Z. El Shair and S. A. Rawashdeh, “High-temporal-resolution object detection and tracking using images and events,” J. Imag., vol. 8, no. 8, pp. 1–21, 2022.
[51] N. Messikommer, D. Gehrig, M. Gehrig, and D. Scaramuzza, “Bridging the gap between events and frames through unsupervised domain adaptation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 3515–3522, 2022.
[52] J. Hosang, R. Benenson, and B. Schiele, “Learning non-maximum suppression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4507–4515.
[53] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
[54] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
[55] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[56] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
[57] T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “Pnp-detr: Towards efficient visual analysis with transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 4661–4670.
[58] W. Shang, D. Ren, D. Zou, J. S. Ren, P. Luo, and W. Zuo, “Bringing events into video deblurring with non-consecutively blurry frames,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 4531–4540.
[59] L. Zhu, J. Li, X. Wang, T. Huang, and Y. Tian, “Neuspike-net: High speed video reconstruction via bio-inspired neuromorphic cameras,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2400–2409.
[60] P. Duan, Z. Wang, B. Shi, O. Cossairt, T. Huang, and A. Katsaggelos, “Guided event filtering: Synergy between intensity images and neuromorphic events for high performance imaging,” IEEE Trans. Pattern Anal. Mach. Intell., 2021.
[61] J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 043–13 052.
[62] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” in arXiv, 2021.
[63] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 994–1001, 2018.
[64] Y.-F. Zuo, J. Yang, J. Chen, X. Wang, Y. Wang, and L. Kneip, “Devo: Depth-event camera visual odometry in challenging conditions,” in Proc. IEEE Conf. Robot. Autom., 2022, pp. 2179–2185.
[65] L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip, “Vector: A versatile event-centric benchmark for multi-sensor slam,” IEEE Robot. Autom. Lett., 2022.
[66] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3857–3866.
[67] Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong, “Tju-dhd: A diverse high-resolution dataset for object detection,” IEEE Trans. Image Process., vol. 30, pp. 207–219, 2020.
[68] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck, “Retinomorphic event-based vision sensors: bioinspired cameras with spiking output,” Proc. IEEE., vol. 102, no. 10, pp. 1470–1484, 2014.
[69] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Sivilotti, and D. Gillespie, “Silicon auditory processors as computer peripherals,” IEEE Trans. Neural Netw., vol. 4, no. 3, pp. 523–528, 1993.
[70] J. Li, Y. Fu, S. Dong, Z. Yu, T. Huang, and Y. Tian, “Asynchronous spatiotemporal spike metric for event cameras,” IEEE Trans. Neural Netw. Learn. Syst., 2021.
[71] W. Han, Z. Zhang, B. Caine, B. Yang, C. Sprunk, O. Alsharif, J. Ngiam, V. Vasudevan, J. Shlens, and Z. Chen, “Streaming object detection for 3-d point clouds,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 423–441.
[72] Q. Chen, S. Vora, and O. Beijbom, “Polarstream: Streaming object detection and segmentation with polar pillars,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 26 871–26 883.
[73] J. Yang, S. Liu, Z. Li, X. Li, and J. Sun, “Streamyolo: Real-time object detection for streaming perception,” arXiv, 2022.
[74] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[75] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based learning of optical flow, depth, and egomotion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 989–997.
[76] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[77] J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin, “Understanding and improving layer normalization,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, pp. 4381–4391, 2019.
[78] O. Mazhar, R. Babuška, and J. Kober, “Gem: Glare or gloom, i can still see you–end-to-end multi-modal object detection,” IEEE Robot. Autom. Lett., vol. 6, no. 4, pp. 6321–6328, 2021.
[79] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–15.
[80] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 740–755.
[81] M. Zhu and M. Liu, “Mobile video object detection with temporally-aware feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5686–5695.
[82] S. Guo and T. Delbruck, “Low cost and latency event camera background activity denoising,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 785–795, 2023.
[83] J. Xu, W. Wang, H. Wang, and J. Guo, “Multi-model ensemble with rich spatial information for object detection,” Pattern Recognit., vol. 99, p. 107098, 2020.
[84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020.
[85] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021.
[86] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation for video object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 408–417.