This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Shuailei Ma,  Yuefeng Wang,  Ying Wei,  Jiaqi Fan,  Enming Zhang, Xinyu Sun and Peihao Chen Shuailei Ma, Yuefeng Wang, Jiaqi Fan, Enming Zhang are with College of Information Science and Engineering, Northeastern University, Shenyang, China, 110819.
E-mail:{xiaomabufei,wangyuefeng0203,f1074979751,z1693663290} @gmail.com Ying Wei is the corresponding author, with College of Information Science and Engineering, Northeastern University, Shenyang, China, 110819.
E-mail: weiying@ise.neu.edu.cn Xinyu Sun and Peihao Chen are with School of Software Engineering, South China University of Technology, China, 510641.
E-mail:{csxinyusun,phchencs} @gmail.com
Abstract

Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple knowledge distillation approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the down-weight loss function for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the cascade decouple decoding structure that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name “StandardSet\heartsuit” and “IntensiveSet\spadesuit” respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at https://github.com/xiaomabufei/SKDF.

Index Terms:
Open World Object Detection, Knowledge Distillation Framework, Down-Weight Loss Function, Decoupled Cascade Decoding Structure.

1 Introduction

Open-world object detection (OWOD) is a more practical detection problem in computer vision, facilitating the development of object detection (OD) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] in the real world. Within the OWOD paradigm, the model’s lifespan is pushed by an iterative learning process, as shown in Fig. 2 (a). At each episode, the model trained on the data with known object annotations is expected to detect known objects and all potential unknown objects. Human annotators then label a few of these tagged unknown classes of interest gradually. The model given these newly-added annotations continues incrementally updating its knowledge without retraining from scratch.

Refer to caption
Figure 1: SKDF leverages the proposed down-weight training strategy to distill open-world knowledge from the large open-vocabulary pre-trainied vision-language model to the expert open-world detector with faster-detecting speed and better performance via small amounts of data.

In the existing works [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], the open-world detectors are expected to know about the open world through several datasets [24, 25] with tiny scales. However, the annotations of these datasets are too few to provide adequate object attributes for the model. It is difficult for the model to achieve the ideal goal through these datasets. The large pre-trained vision-language grounding models (VL) [26, 27, 28, 29, 30] have rich knowledge of the open world due to countless millions of parameters, huge open-world datasets, and training costs.

However, their detection could not leave the participation of text prompts. Before detecting, the text prompt of all objects must be pre-listed to be detected, and the object whose text prompt was not listed could not be detected. There are many detection scenarios in which pre-defined language descriptions are unavailable during inference. Therefore, it poses a challenge how to equip the VLM with language-agnostic unknown object detection capability. In addition, their detection speed is also a criticism due to the following question. i)i) The huge number of parameters and FLOPs. ii)ii) The large pre-trained vision-language grounding model could only infer with several text prompts for the detecting performance, so they must infer many times when the number of prompts is large.

Humans’ ability to recognize objects they have not seen before largely depends on their brains’ knowledge base. Inspired by how humans face the open world, we propose to learn from the large pre-trained vision-language grounding models by knowledge distillation for OWOD. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector, as shown in Fig.1. We were surprised to observe that the simple knowledge distillation approach for the OWOD algorithm can achieve better performance for unknown object detection than the large open-vocabulary pertaining vision-language model, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of the detector with conventional structures for known objects due to the following challenges:

(I) Different from traditional Knowledge Distillation [31, 32, 33, 34, 35], the teacher and student contain different modalities and have different training manners, structures, and inference procedures. Meanwhile, the learning process of OWOD algorithms [12, 13, 36, 15, 14, 37] always has its own specific open-world pseudo-unknown labels. Therefore, the existing distilling objectives do not work and it is difficult to mix the learning of known grounding truth, unknown distilling knowledge and unknown pseudo-knowledge. The performance of detecting known objects is also crucial for OWOD. Through the experiments, we reveal that the direct use of distilled knowledge dramatically affects the model’s learning ability of the original annotations. The model’s performance in detecting known objects is damaged substantially. To alleviate this, we propose the down-weight training loss for the distillation training, which utilizes the distilled labels’ object confidence from the large pre-trained vision-language grounding model and the searched pseudo objectness to reduce the weight of unknown loss in the total loss during training, as shown in Fig.2 (b).

(II) The presence of objects with highly similar features to known classes within the “unknown objects” can greatly affect the process of open-world object identification. In particular, the inclusion of unknown objects will impact the detector’s performance on known objects. This issue impacts not only the identification process but also the localization process for models that use coupled information for both tasks. Therefore, we propose decoupling the detection learning process. However, the decoupled structure completely separates the localization and identification, leading to many potential problems (e.g.mismatch between localization and identification results). Therefore, we propose a cascade structure that decouples the detecting process via two decoders and connects the two decoders via a cascade manner, as shown in Fig.2 (a). In this structure, foreground localization can be protected from category knowledge because the loss of identification is diluted by the latter decoder. Moreover, the identification process can utilize the localization information because it leverages the former decoder’s output embeddings as the input queries.

Refer to caption
Figure 2: Overall scheme of the proposed framework. (a) illustrates the lifespan of the cascade open-world object detector where the model detects known objects and potential unknowns, with human annotators progressively labeling some unknown classes, the model incrementally updates its knowledge using these new labels without fully retraining. (b) exhibits the down-weight training strategy which leverages the objectness to separate the learning weight of the annotated known knowledge, distilled open-world knowledge, and searched pseudo-open-world unknown knowledge. (c) describes the distillation procedure that leverages the large-scale vocabulary prompt to mine the open-world knowledge in the open-vocabulary vision-language pertaining model.

Since existing datasets [24, 25] are manually annotated with predefined categories, current benchmarks cannot comprehensively measure the detection performance of open-world detectors for unknown objects due to the lack of bounding box annotations for unknown entities in the test scenarios. UnSniffer [38] chooses testing images with only a few unknown foreground objects (no more than 3) from the MSCOCO dataset [24] to evaluate the detector’s ability to identify unknown objects. Inspired by this, we propose two benchmarks, named StandardSet\heartsuit and IntensiveSet\spadesuit respectively, based on the complexity of their testing scenarios. Different from UnSiffer follows the manual annotation guidelines of the MSCOCO [24] dataset, inspired by the more detailed annotation criteria of finer-grained datasets [39], we manually select suitable evaluation scenes (with no fewer than three unknown objects) and provide more meticulous manual annotations of unknown objects. Moreover, our IntensiveSet\spadesuit, features an average of over 33 unknown open-world object instances per test scene. Our contributions can be summarized fourfold:

  • \bullet

    We observe that the simple knowledge distillation could convert the open-world knowledge in the large pre-trained vision-language grounding model for the specialized OWOD task and propose a simple framework with surprisingly good performance.

  • \bullet

    To mitigate the effect of distilled knowledge on the detection performance of known objects, we propose the down-weight training loss function for the detector’s mixed learning process of known ground truth, distilled unknown knowledge, and the pseudo unknown knowledge in OWOD algorithm. Meanwhile, a cascade decoupled detection transformer structure is proposed to alleviate the influence caused by unknown objects on detecting known objects.

  • \bullet

    We propose two novel benchmarks to comprehensively evaluate the ability of the open-world detectors to detect unknown open-world objects, named StandardSet\heartsuit and IntensiveSet\spadesuit respectively, based on the complexity of their testing scenarios.

  • \bullet

    Our extensive experiments on existing and proposed benchmarks demonstrate the effectiveness of our framework. Our model exceeds the distilled large pre-trained vision-language grounding model for OWOD and state-of-the-art methods for OWOD and IOD.

2 Related Works

Large pre-trained vision-language models: Recently, inspired by the success of vision-language(VL) pre-trainied methods [29] and their good zero-shot ability, [40, 26, 28, 27, 30] attempted to perform zero-shot detection on a larger range of domains by using pre-trained vision language models. [40] proposed a zero-shot detection method to distill knowledge from a pre-trained vision language image classification model. [26] tried to align region and language features using a dot-product operation and could be trained end-to-end on grounding and detection data. [28] proposed an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. [30] proposed a paralleled visual-concept pre-trainied method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. However, VLM requires predefined object categories to drive the model, therefore it cannot be directly applied to the OWOD tasks. Moreover, VLM requires substantial computational resources and a long runtime.

Semi-Supervised Learning For Object Detection: In this area, there are two dominant approaches, the consistency methods [41, 42] and pseudo-label methods [43, 44, 45, 46], respectively. STAC [43] deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. Xu etet alal.[44] proposed an end-to-end pseudo-labeling framework to avoid the complicated training process and also achieve better performance. Liu etet alal.[45] improved the pseudo-label generation model via teacher-student mutual learning regimen and addressed the crucial imbalance issue in generated pseudo-labels. However, these distillation methods are applicable to single-modality closed-set object detection tasks. This paper introduces a simple framework for distilling from a multimodal open-vocabulary model to a single-modality open-world model.

Refer to caption
Figure 3: Overall Architecture of proposed cascade decoupled open-world detector. The proposed detector consists of a multi-scale feature extractor, the decoupled cascade transformer decoder, and the regression prediction branch. The multi-scale feature extractor comprises the mainstream feature extraction backbone and a deformable transformer encoder, for extracting multi-scale features. The decoupled cascade transformer decoders are the deformable transformer decoders and decouple the localization and identification process in the cascade way. The regression prediction branch contains the bounding box regression branch FregF_{reg}, novelty objectness branch FobjF_{obj}, and novelty classification branch FclsF_{cls}. The novelty classification and objectness branches are single-layer feed-forward networks (FFN) and the regression branch is a 3-layer FFN.

Open-World Object Detection (OWOD): ORE [12] introduced OWOD task and ORE which adapted the faster-RCNN model with feature-space contrastive clustering, an RPN-based unknown detector, and an Energy Based Unknown Identifier (EBUI) for the OWOD objective. Recently, several works [14, 15, 37] attempted to extend ORE. OCPL [14] was proposed to learn the discriminative embeddings of known classes in the feature space to minimize the overlapping distributions of known and unknown classes. 2B-OCD [15] proposed a two-branch objectness-centric open-world object detection framework consisting of the bias-guided detector and the objectness-centric calibrator. OWDETR [13] proposed to utilize the pseudo-labeling scheme to supervise unknown object detection, where unmatched object proposals with high backbone activation are selected as unknown objects. Existing methods limit the training of the model to a small subset of object annotations, which fails to fully teach the model how to recognize objects and foreground. In this paper, we propose distilling open-world knowledge from large pre-trained models to scale up the open-world knowledge in the training data. Experiments prove that even the simple distillation framework can stimulate the model’s ability to recognize unknown objects from the background even without the need for vast amounts of data, even achieving capabilities that surpass those of the teacher.

3 Problem Formulation

𝒦t={1,2,,C}\mathcal{K}^{t}=\{1,2,\ldots,C\} denote the set of known object classes and 𝒰t={C+1,}\mathcal{U}^{t}=\{C+1,\ldots\} denote the unknown classes which might be encountered at the test time, at the time tt. We labeled the known object categories 𝒦t\mathcal{K}^{t} in the dataset 𝒟t={𝒥t,t}\mathcal{D}^{t}=\{\mathcal{J}^{t},\mathcal{L}^{t}\} where 𝒥t\mathcal{J}^{t} denotes the input images and t\mathcal{L}^{t} denotes the corresponding labels at time tt. The training image set consists of MM images 𝒥t={i1,i2,,iM}\mathcal{J}^{t}=\{i_{1},i_{2},\ldots,i_{M}\} and corresponding labels t={1,2,,M}\mathcal{L}^{t}=\{\ell_{1},\ell_{2},\ldots,\ell_{M}\}. Each i={𝒯1,𝒯2,,𝒯N}\ell_{i}=\{\mathcal{T}_{1},\mathcal{T}_{2},\ldots,\mathcal{T}_{N}\} denotes a set of NN object instances with their class labels cn𝒦tc_{n}\subset\mathcal{K}^{t} and locations, {xn,yn,wn,hn}\{x_{n},y_{n},w_{n},h_{n}\} denote the bounding box center coordinates, width and height respectively.

The artificial assumptions and restrictions in closed-set object detection are removed in Open-World Object Detection. It aligns object detection tasks more with real life. It requires the trained model t\mathcal{M}_{t} to detect the previously encountered known classes CC and identify an unseen class instance as belonging to the unknown class. In addition, it requires the object detector to be capable of incremental updates for new knowledge, and this cycle continues over the detector’s lifespan. In the incremental updating phase, the unknown instances identified by t\mathcal{M}_{t} are annotated manually. Along with their corresponding training examples, they update 𝒟t\mathcal{D}^{t} to 𝒟t+1\mathcal{D}^{t+1} and 𝒦t\mathcal{K}^{t} to 𝒦t+1={1,2,,C,,C+n}\mathcal{K}^{t+1}=\{1,2,\ldots,C,\ldots,C+\text{n}\}. The model adds the nn new classes to known classes and updates itself to t+1\mathcal{M}_{t+1} without retraining from scratch on the whole dataset 𝒟t+1\mathcal{D}^{t+1}.

4 Proposed method

This section elaborates on the proposed framework in detail. We start by illustrating the overall scheme of our proposed framework in Sec.4.1 Furthermore, we introduce the open-world object detector in Sec.4.2, the distillation process in Sec.4.3, and the matching and pseudo-labeling procedure of the owod algorithm in Sec.4.4. Last but not least, we describe the down-weight training strategy in Sec.4.5 and the inference phrase in Sec.4.6.

4.1 Overall Scheme

Fig.2 illustrates the overall scheme of our framework. For a given image xH×W×3x\in\mathbb{R}^{H\times W\times 3}, it is first sent into the open-world detector and the large pre-trained vision-language grounding model simultaneously. The detector leverages the visual features of the input to predict the localization, box score, and classification. The large pre-trained vision-language grounding model mines unknown open-world knowledge from the inputs. The known ground truth and unknown distilled knowledge aggregate the open-world supervision. In the training phase, we match the prediction and open-world supervision according to the regression loss, classification, and supervision confidence. After matching, the pseudo labels are selected according to the predicted box score. Then all labels are leveraged to train the open-world detector by the down-weight training loss function. In addition, when new categories are introduced at each episode, based on an exemplar replay-based finetuning to alleviate the catastrophic forgetting of learned classes and the finetuning by using a balanced set of exemplars stored for all known classes, our detector could continuously learn during its lifespan.

Refer to caption
Figure 4: The detailed data analysis of StandardSet\heartsuit and IntensiveSet\spadesuit. In (a), we calculate the area distribution of instances in the two benchmark test scenes, with the vertical axis representing the logarithm of the count with respect to Euler’s number e. In (b) and (c), we respectively analyze the aspect ratio and the spatial distribution of the instance bounding box annotations.

4.2 Open-World Object Detector

As shown in Fig.3, the open-world detector first uses a hierarchical feature extraction backbone to extract multi-scale features Zi\mathrm{Z}_{i}. The feature maps Zi{Z}_{i} are projected from dimension CsC_{s} to dimension CdC_{d} by using 1×1 convolution and concatenated to NsN_{s} vectors with CdC_{d} dimensions after flattening out. Afterward, along with supplement positional encoding PNs×CdP\in\mathbb{R}^{N_{s}\times C_{d}}, the multi-scale features are sent into the deformable transformer encoder to encode semantic features. The encoded semantic features are acquired and sent into the localization decoder together with a set of NN learnable location queries.

Aided by interleaved cross-attention and self-attention modules, the localization decoder transforms the location queries QLocationN×DQ_{Location}\in\mathbb{R}^{N\times D} to a set of N location query embeddings ELocationN×DE_{Location}\in\mathbb{R}^{N\times D}. Meanwhile, the ELocationE_{Location} are used as class queries and sent into the identification decoder together with the MM (i.e.the encoded multi-scale features) again. The identification decoder transforms the class queries to NN class query embeddings EclassE_{class} that correspond to the location query embeddings. The operation of cascade decoders is expressed as follows:

ELocation=FLD(FE((x),P),QLocation,R),\leavevmode\resizebox{328.82706pt}{}{$E_{Location}=F_{LD}(F_{E}(\varnothing(x),P),Q_{Location},R)$}, (1)
EClass=FID(FE((x),P),ELocation,R).\leavevmode\resizebox{305.33788pt}{}{$E_{Class}=F_{ID}(F_{E}(\varnothing(x),P),E_{Location},R)$}. (2)

where FLD()F_{LD}(\cdot) and FID()F_{ID}(\cdot) denote the localization and identification decoder. FE()F_{E}(\cdot) is the encoder and ()\varnothing(\cdot) is the backbone. RR represents the reference points and xx denotes the input image. QClassQ_{Class} stands for the class queries. Eventually, the EClassE_{Class} are then sent into the classification branch to predict the category cls[0,1]Nobjcls\in[0,1]^{N_{obj}}. The ELocationE_{Location} are then input to the regression and box score branch to locate N foreground bounding boxes b[0,1]4b\in[0,1]^{4} and predict the box score bs[0,1]bs\in[0,1].

In this decoupling structure, foreground localization can be protected from category knowledge, and the identification process can utilize the localization information. Therefore, we alleviate the infusion of the unknown objects on the detecting performance of known objects and the confusion between the category and location of the same objects.

4.3 Open-World Object Supervision

In the knowledge distillation phase, the teacher leverages a large pre-trained vision-language grounding model and a text prompt that contains object categories as many as possible to mine unknown open-world knowledge from input xx. In this paper, we utilize GLIP [26] and categories of LVIS [39] as the large pre-trained vision-language grounding model and the text prompts, respectively. The distilled open-world knowledge is first processed through the NMS produce. Then we align the distilled open-world knowledge and ground truth by the align module, where we align the generated labels to the given data annotation space e.g.translate LVIS categories (trailer truck, tow truck, et al.) into COCO category truck and exclude the known set in the distilled open-world knowledge. In addition, the identification confidence of the unknown labels from the large pre-trained vision-language grounding model is reserved for the following training process and leveraged as the supervision confidence.

4.4 Matching and Evolving

Following the existing open-world detectors [12, 13, 36, 15, 14, 37], we set a box score prediction branch that leverages the predicted box score to automatically select pseudo-unknown labels from the remaining regression boxes after the matching process of each training iteration as the automatic pseudo-labeling mechanism. When matching the open-world supervision and prediction, we consider the box regression loss, prediction class score, and confidence of supervision.

After matching, we leverage the box score branch to help the detector learn more unknown open-world objects beyond the ground truth and distilled knowledge from the large pre-trained vision-language grounding model during training via selecting pseudo labels. We denote by y^\hat{y} the open-world supervision, and y={yi}i=1Ny={\{y_{i}\}}_{i=1}^{N} the set of NN predictions. For finding the best bipartite matching between them, a permutation of NN elements σ𝔖N\sigma\in\mathfrak{S}_{N} with the lowest cost is searched for as follows:

ω^=argminω𝔖NiNmatch(𝒚^𝒊,𝒚𝝎(𝒊)),\hat{\omega}=\underset{\omega\in\mathfrak{S}_{N}}{\arg\min}\sum_{i}^{N}\mathcal{L}_{\operatorname{match}}(\boldsymbol{\hat{y}_{i}},\boldsymbol{y_{\omega(i)}}), (3)

where match (y^i,yω(i))\mathcal{L}_{\text{match }}\left(\hat{y}_{i},y_{\omega(i)}\right) is the pair-wise matching cost between y^i\hat{y}_{i} which represents ground truth or labels from the large pre-trained vision-language grounding model and a prediction yy with index ω(i)\omega(i), shown as Equation.4. Inspired by [2, 3, 28], we choose the Hungarian algorithm as the optimal assignment.

match(y^i,yω(i))=Lr(𝒃^𝒊,𝒃𝝎(𝒊))𝒄𝒍𝒔ω(i)(c^i)𝑺^𝒊,\mathcal{L}_{\operatorname{match}}(\hat{y}_{i},{y}_{\omega(i)})=L_{r}(\boldsymbol{\hat{b}_{i}},\boldsymbol{{b}_{\omega(i)}})-\boldsymbol{cls}_{\omega(i)}(\hat{c}_{i})-\boldsymbol{\hat{S}_{i}}, (4)

where LrL_{r} denotes the regression loss, which consists of box loss and GIOU loss [47]. b^\hat{b} and bb represent the open-world supervision box and prediction box, respectively. S^\hat{S} is the confidence of the open-world supervision. Then, the pseudo labels are selected as follows:

lp=Topk({𝒃𝒔i}iω^),l_{p}=\operatorname{Topk}(\{\boldsymbol{bs}_{i}\}_{i\notin\widehat{\omega}}), (5)

where bsbs denotes the prediction box score. The pseudo-labels could prevent the model from falling entirely into the knowledge of the large pre-trained vision-language grounding model and help it know unseen objects beyond the distilled knowledge.

4.5 Down-Weight Training Strategy

The inclusion of distilled knowledge inevitably influences the model’s learning of the known set. Because the quality of it could not be guaranteed, and it increases the difficulty of detector learning. Therefore, we propose the down-weight training strategy, which leverages the distilled knowledge identification confidence to generate harmonic factor and down-weight the unknown training loss and train the detector in an end-to-end manner as shown in Fig.2 (b). The training loss function is as follows:

L=Lr+Lbs+Lcls+Lrkd+Lbskd+Lclskd+Lclsp,L=L_{r}+L_{bs}+L_{cls}+L_{r}^{kd}+L_{bs}^{kd}+L_{cls}^{kd}+L_{cls}^{p}, (6)

where the LrL_{r} uses the common regression loss which consists of box and GIOU loss [47]. LbsL_{bs} and LclsL_{cls} represent the box score and classification loss, respectively. They all leverage the common sigmoid focal loss [48]. For simplicity, we omit them. In addition, the LrkdL_{r}^{kd}, LbskdL_{bs}^{kd}, LclskdL_{cls}^{kd} and LclspL_{cls}^{p} all utilize correspondingly down-weight loss function we propose, the formulations are shown as follows:

Lrkd=1|𝒍𝒌𝒅|i=1Nq1{i𝒍𝒌𝒅}𝑺^ω^(i)[𝒃i𝒃^ω^(i)1+1𝒢(𝒃i,𝒃^ω^(i))],L_{r}^{kd}=\frac{1}{|\boldsymbol{l_{kd}}|}\sum_{i=1}^{N_{q}}\mathbf{}{1}_{\{i\in\boldsymbol{l_{kd}}\}}\hat{\boldsymbol{S}}_{\hat{\omega}(i)}[\|\boldsymbol{b}_{i}-\hat{\boldsymbol{b}}_{\hat{\omega}(i)}\|_{1}+1-\mathcal{G}(\boldsymbol{b}_{i},\hat{\boldsymbol{b}}_{\hat{\omega}(i)})], (7)
Lbskd=1i=1Nq1{i𝒍𝒌𝒅}𝒃𝒔i1i=1Nq𝟏{i𝒍𝒌𝒅}[lsf(𝒃𝒔i,𝑺^𝝎^(i))],L_{bs}^{kd}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{}{1}_{\{i\in\boldsymbol{l_{kd}}\}}\|\boldsymbol{bs}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{kd}}\}}[l_{sf}(\boldsymbol{bs}_{i},\hat{\boldsymbol{S}}_{\hat{\boldsymbol{\omega}}(i)})], (8)
Lclskd=1i=1Nq𝟏{i𝒍𝒌𝒅}𝒄𝒍𝒔i1i=1Nq𝟏{i𝒍𝒛}[lsf(𝒄𝒍𝒔i,𝑺^𝝎^(i))],L_{cls}^{kd}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{kd}}\}}\|\boldsymbol{cls}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{z}}\}}[l_{sf}(\boldsymbol{cls}_{i},\hat{\boldsymbol{S}}_{\hat{\boldsymbol{\omega}}(i)})], (9)
Lclsp=1i=1Nq𝟏{i𝒍𝒑}𝒄𝒍𝒔i1i=1Nq𝟏{i𝒍𝒑}[lsf(𝒄𝒍𝒔i,𝒃𝒔i)],L_{cls}^{p}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{p}}\}}\|\boldsymbol{cls}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{p}}\}}[l_{sf}(\boldsymbol{cls}_{i},\boldsymbol{bs}_{i})], (10)

where 𝒍𝒌𝒅\boldsymbol{l_{kd}} and 𝒍𝒑\boldsymbol{l_{p}} denote the distilled knowledge supervision and pseudo supervision labels, respectively. NqN_{q} represents the number of queries, ω^(i)\hat{\omega}(i) represents the index of the label corresponding to the prediction. lsfl_{sf} denotes the sigmoid focal loss function. 𝒢()\mathcal{G}(\cdot) represents the GIOU loss function. S^\hat{S} is the confidence of supervision labels. bb, bsbs, and clscls are the prediction box, box score, and classification score, respectively.

4.6 Inference

During inference, our detector only utilizes the visual features of the inputs to detect open-world objects without any information from the other modalities. The inference process is to composite the detector output to form open-world object instances. Formally, the ii-th output prediction is generated as <bi,bsi,clsi><b_{i},\ bs_{i},\ cls_{i}>. According to the formulation : si,li=max(clsi)s_{i},l_{i}=max(cls_{i}), the result is acquired as <li,si,bi><l_{i},\ s_{i},\ b_{i}>, where ll is the category, ss denotes the confidence, and bb represents the predicted bounding boxes.

Dataset Images Avg. Kn. Avg.Un
StandardSet\heartsuit 694 2.7 3.4
IntensiveSet \spadesuit 489 5.8 33.7
TABLE I: The statistics on the number of scenes and instances for the proposed testing sets: StandardSet\heartsuit and IntensiveSet\spadesuit.

5 Proposed Dataset

Since existing datasets [24, 25] are manually annotated with predefined categories, current benchmarks cannot comprehensively measure the detection performance of open-world detectors for unknown objects due to the lack of bounding box annotations for unknown entities in the test scenarios. In this paper, we propose two benchmarks, named StandardSet\heartsuit and IntensiveSet\spadesuit respectively, based on the complexity of their testing scenarios. For the proposed benchmark, the known categories are defined as the PASCAL VOC [25] dataset, and we train the open-world detectors by the images and annotations of the PASCAL VOC [25] training set. Meanwhile, inspired by the detailed annotation criteria of finer-grained datasets [39], we manually select suitable evaluation scenes (with no fewer than five unknown objects) and provide more meticulous manual annotations of unknown objects to construct the testing set, as shown in TABLE.I.

5.1 Standard Testing Set

The StandardSet\heartsuit totally contains 694 testing scenes which we select from the MS-COCO [24] dataset. We manually annotate the unknown objects contained within. It contains 1897 known instances and 2378 unknown open-world instances with respect to the predefined known categories. We statistic the instance area, aspect radio, and position distribution in Fig.4. The statistical results suggest that compared to IntensiveSet\spadesuit, StandardSet\heartsuit has a more uniform distribution of instance areas. Additionally, the aspect ratio of bounding boxes in the StandardSet\heartsuit test set tends towards a uniform distribution, with the majority of bounding box annotations concentrated in the central region of the scenes.

5.2 Intensive Testing Set

In the IntensiveSet\spadesuit testing set, we elevate the complexity of the scenes by selecting 489 highly complex scenes, which altogether encompass 2859 annotations of known objects and 16482 bounding box annotations of unknown open-world objects that we manually annotate, averaging 5.8 annotations of known objects and 33.7 annotations of unknown objects per image. As shown in Fig.4, compared to the StandardSet\heartsuit testing set, the bounding box annotations in the IntensiveSet\spadesuit testing set primarily consist of smaller areas, with a more uniform distribution in terms of aspect ratios and spatial positioning.

OWOD split
Task ID \rightarrow Task 1 Task 2 Task 3 Task 4
Semantic split \rightarrow
VOC
Classes
Outdoor,
Accessories,
Appliances,
Truck
Sports,
Food
Electronic,
Indoor,
Kitchen,
Furniture
#\# training images 16551 45520 39402 40260
#\# test images 4952 1914 1642 1738
#\# train instances 47223 113741 114452 138996
#\# test instances 14976 4966 4826 6039
MS-COCO split
Task ID \rightarrow Task 1 Task 2 Task 3 Task 4
Semantic split \rightarrow
Animals,
Person,
Vehicles
Appliances,
Accessories,
Outdoor,
Furniture
Sports,
Food
Electronic,
Indoor,
Kitchen
#\# training images 89490 55870 39402 38903
#\# test images 3793 2351 1642 1691
#\# train instances 421243 163512 114452 160794
#\# test instances 17786 7159 4826 7010
TABLE II: The table shows task composition in the OWOD and MS-COCO split for the Open-world evaluation protocol. The semantics of each task and the number of images and instances(objects) across splits are shown.

6 Experiment

6.1 Datasets and Metrics

For a fair comparison, we implement the experiments on two mainstream splits of MS-COCO [24], and Pascal VOC [25] dataset. We group the classes into a set of non-overlapping tasks {T1,,Tt,}\left\{T^{1},\ldots,T^{t},\ldots\right\}. The class in task TcT^{c} only appears in tasks where tct\geq c. In task TcT^{c}, classes encountered in {Tc:ct}\left\{T^{c}:c\leq t\right\} and {Tc:c>t}\left\{T^{c}:c>t\right\} are considered as known and unknown classes, respectively.

OWOD SPLIT [12] splits the 80 classes of MS-COCO into 4 tasks and selects a training set for each task from the MS-COCO and Pascal VOC training set. Pascal VOC testing and MS-COCO validation set are used for evaluation. Detailed data statistic is shown in TABLE.II

MS-COCO SPLIT [13] mitigates data leakage across tasks in [12] and is more challenging, as shown in TABLE.II. The training and testing data are selected from MS-COCO.

Metrics: Following the most commonly used evaluation metric for object detection, we use mean average precision (mAP) to evaluate the known objects. Inspired by [12, 49, 13, 50, 51], U-Recall which measures the ability of the model to retrieve unknown object instances for OWOD problems, is used as the metric for unknown objects. For the proposed StandardSet\heartsuit and IntensiveSet\spadesuit benchmark, we also use the unknown AP, unknown Precision to comprehensively measure the detection performance of open-world detectors for unknown open-world objects.

6.2 Implementation Details

The multi-scale feature extractor consists of a Resnet-50[52] pretrained on ImageNet[53] in a self-supervised[54] manner and a deformable transformer encoder whose number of layers is set to 6. For the two cascade decoders, we all use the deformable transformer decoder, and the number of layers is set to 6, too. We set the number of queries M=100M=100, the dimension of the embeddings D=256D=256, and the number of pseudo-labels k=5k=5. We set GLIP [26] as the large pre-trained vision-language grounding model and categories of LVIS dataset [39] as text prompts to assist the training process. For GLIP, we use the GLIP-L [26] without finetuning on the COCO dataset [24]. It consists of Swin-Large [55], text encoder of CLIP [29], DyHead [56], BERT Layer [57] and Fusion Module [26]. For the incremental object detection experiments, we only use our open-world detector without the help of GLIP. The PyTorch library and eight NVIDIA RTX 3090 GPUs are used to train our SKDF framework with a batch size of 3 images per GPU. In each task, the SKDF framework is trained for 50 epochs and finetuned for 20 epochs during the incremental learning step. We train our SKDF using the Adam optimizer with a base learning rate of 2×1042\times 10^{-4}, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and weight decay of 10410^{-4}. For finetuning during the incremental step, the learning rate is reduced by a factor of 10 and trained using a set of 50 stored exemplars per known class.

Task IDs \rightarrow Task 1 Task 2 Task 3
Inference Unknown Unknown Unknown
Param# FLOPs Rate Recall Recall Recall
SPLIT Metrics \rightarrow (s/img) (\uparrow) (\uparrow) (\uparrow)
OWOD OV-VLM 321.9M 965GMac 9.22 37.0 35.5 34.9
Ours 42.9M 212GMac 0.08 39.0 36.7 36.1
MS-COCO OV-VLM 321.9M 965GMac 9.22 52.6 54.5 53.3
Ours 42.9M 212GMac 0.08 60.9 60.0 58.6
TABLE III: Comparison with the large pre-trained Open-Vocabulary grounding vision-language model for Open-World object detection on OWOD and MS-COCO split. For the large pre-trained Open-Vocabulary grounding vision-language model, the prediction beyond the known categories is set to unknown. Experiments demonstrate that SKDF distills open-world knowledge from the large open-vocabulary pre-trained vision-language model to the expert open-world detector with faster-detecting speed and better performance for unknown open-world objects.
Refer to caption
Figure 5: Qualitative Results. Visualization results are based on the setting of Task.1. Our model can detect the unknown objects in Yellow boxes beyond the unknown labels from GLIP and LVIS text prompts. The animation and games categories in the figures do not appear in the LVIS text prompt and our training dataset so our detector must not learn from GLIP.

6.3 Detailed Comparison With the Teacher

In this section, we validate the advantages of our distillation framework in terms of performance and speed through detailed comparison and quantitative experiments.

6.3.1 Experimental Results

In order to keep the model’s ability to generalize to open-world objects in the comparison, we do not finetune the teacher on the known dataset. Because the finetuning process severely destroys the teacher’s generality. Thus, we only shown the results on the comparison of the detection ability for unknown objects. The results compared with the large pre-trained vision-language grounding model (GLIP) for OWOD problem are shown in TABLE.III. Regarding the number of parameters and FLOPs, our model is significantly smaller than GLIP. In particular, the inference speed of ours is 115×116×115\times\sim 116\times of GLIP’s. Furthermore, the number of GLIP’s training data is 64M images, while our model only needs a small amount of data in each task of different splits almost 1237×116×\frac{1}{237}\times\sim\frac{1}{16}\times of its. Compared with GLIP’s U-Recall of 37.0, 35.5, and 34.9 on Task 1, 2, and 3 of OWOD split, ours achieves 39.0, 36.7 and 36.1 in the corresponding tasks, achieving significant absolute gains up to 2.0, 1.2 and 1.2, respectively. In the MS-COCO split, compared with GLIP’s U-Recall of 52.6, 54.5, and 53.3 on Task 1, 2, and 3 of OWOD split, ours achieves 60.9, 60.0 and 58.6 in the corresponding tasks, achieving significant absolute gains up to 8.3, 5.5 and 5.3, respectively. This demonstrates that our model has a better ability to detect unknown objects for OWOD.

6.3.2 Qualitative Results

To present the ability to detect unknown open-world objects intuitively, we select several open scenes from games and comics, such as “Digital Monster”, “Starcraft”, “League of Legends” and “Pokemon”. These scenes contain the object out of the categories of LVIS [39] text prompts, visualization results are shown in Fig.5. The visualization results show that our detector can identify nearly all unknown open-world objects, such as the creatures from “Digital Monster” and “Pokemon”, the robots from “StarCraft”, and the minions and turrets from “League of Legends”. Even the weapons and equipment from “League of Legends” can be detected by our detector; however, there are instances of misidentification, such as mistaking “Liandry’s Anguish” for a TV monitor. Thus, qualitative results can demonstrate that SKDF evolves the novel unknown objects beyond the large pre-trained vision-language grounding model.

6.4 Ablation Study

In this subsection, a series of experiments are designed to clearly understand the contribution of each of the constituent components. We conducted all experiments on the OWOD split. We start by ablating each component, followed by ablating the text prompts and open-vocabulary large per-trained teacher.

Task IDs \rightarrow Task 1 Task 2 Task 3
Unknown Known Unknown Known Unknown Known
Recall mAP(\uparrow) Recall mAP(\uparrow) Recall mAP(\uparrow)
Metrics \rightarrow (\uparrow) Current (\uparrow) Previously Current Both (\uparrow) Previously Current Both
Baseline(GLIP) 37.0 42.0 35.5 42.1 22.6 32.4 34.9 32.4 19.2 28.2
Distillation 39.4 41.1 36.6 27.9 10.5 19.2 36.6 18.0 6.2 14.1
Distillation + DW 39.2 53.5 36.3 47.5 22.3 34.9 35.9 32.0 11.5 25.1
Distillation + CS 39.1 51.9 36.9 46.7 22.6 34.6 36.1 32.4 12.8 25.9
Final: SKDF 39.0 56.8 36.7 52.3 28.3 40.3 36.1 36.9 16.4 30.1
TABLE IV: Experiments on ablating each component. Our method significantly improves the detection performance of the GLIP baseline. DW represents the down-weight training loss function for unknown open-world supervision. When DW is none, we use the same loss function as the ground truth supervision for unknown open-world supervision. CS represents the cascade decoupling structure. When CS is none, we leverage the normal decoder structure as DDETR.
Prompt SPLIT Method mAP UR UR(sam)
GLIP 42.2 37.0 8.04
OWOD Ours 56.8 39.0 16.5
GLIP 46.5 52.6 9.4
LVIS MSCOCO Ours 69.4 60.9 20.0
GLIP - - 7.5
OWOD Ours 57.6 38.8 16.8
GLIP - - 7.6
LVIS w/o COCO MSCOCO Ours 70.6 58.9 20.4
GLIP 67.8 43.3 3.2
OWOD Ours 60.9 41.1 12.9
GLIP 73.1 64.8 5.3
COCO MSCOCO Ours 74.8 69.6 16.7
(a) Ablating different prompts.
SPLIT Metrics DDETR CAT Ours
mAP 53.5 55.3 56.8
UR 39.2 39.4 39.0
OWOD UR(sam) 16.3 16.5 16.5
(b) Ablating different structures.
Metric
Teacher SPLIT mAP UR UR(sam)
GLIP OWOD 56.8 39.0 16.5
SAM OWOD 19.4 33.5 38.3
(c) Ablating different Teacher.
[Uncaptioned image]
(d) Left(Right) learns from COCO(LVIS) prompt.
TABLE V: Detailed ablation experiments for our SKDF. (a) Ablation experiments on different text prompts, where UR denotes the unknown recall on original unknown annotations, UR(sam) denotes the unknown recall on unknown annotations with SAM generation. (b) Ablation experiments on different detector structures. (c) Ablation experiments on different teacher models. (d) Qualitative ablation on LVIS and COCO prompt.

6.4.1 Ablating components

To study the contribution of each component, we design ablation experiments in TABLE.IV. We set GLIP as our baseline, the Distillation improves the performance on unknown object detection but reduces the detector’s ability for known objects. Adding the down-weight training loss function significantly improves the performance on detecting known objects and incremental object detecting, achieving significant absolute gains up to more than 10 points. As we have analyzed, the cascade decoupling structure alleviates the inclusion of unknown objects on the known detecting performance and the confusion between the categories and locations of the same objects. It significantly improves the performance of detecting known objects with absolute gains up to more than 10 points, too. What’s more, these two combine effectively, improving performance with absolute gains up to 15.7, 21.1, and 16.0, without reducing the ability to detect unknown objects. Thus, each component has a critical role to play in open-world object detection.

Task IDs \rightarrow Task 1 Task 2 Task 3 Task 4
Unknown Known Unknown Known Unknown Known Known
Recall mAP(\uparrow) Recall mAP(\uparrow) Recall mAP(\uparrow) mAP(\uparrow)
Metrics \rightarrow (\uparrow) Current (\uparrow) Previously Current Both (\uparrow) Previously Current Both Previously Current Both
UC-OWOD[37] 2.4 50.7 3.4 33.1 30.5 31.8 8.7 28.8 16.3 24.6 25.6 15.9 23.2
ORE-EBUI[12] 4.9 56.0 2.9 52.7 26.0 39.4 3.9 38.2 12.7 29.7 29.6 12.4 25.3
OW-DETR[13] 7.5 59.2 6.2 53.6 33.5 42.9 5.7 38.3 15.8 30.8 31.4 17.1 27.8
OCPL[14] 8.3 56.6 7.7 50.6 27.5 39.1 11.9 38.7 14.7 30.7 30.7 14.4 26.7
2B-OCD[15] 12.1 56.4 9.4 51.6 25.3 38.5 11.6 37.2 13.2 29.2 30.0 13.3 25.8
Ours 39.0 56.8 36.7 52.3 28.3 40.3 36.1 36.9 16.4 30.1 31.0 14.7 26.9
TABLE VI: State-of-the-art comparison for open-world object detection on OWOD split. The comparison is shown in terms of U-Recall and known class mAP. U-Recall measures the ability of the model to retrieve unknown object instances for OWOD problems. For a fair comparison, we compare with the recently introduced methods. The CAT achieves improved metrics over the existing works across all unknown detection tasks, demonstrating our model’s effectiveness for OWOD problems. U-Recall cannot be computed in Task 4 due to the absence of unknown test annotations, for the reason that all 80 classes are known.
Task IDs \rightarrow Task 1 Task 2 Task 3 Task 4
Unknown Known Unknown Known Unknown Known Known
Recall mAP(\uparrow) Recall mAP(\uparrow) Recall mAP(\uparrow) mAP(\uparrow)
Metrics \rightarrow (\uparrow) Current (\uparrow) Previously Current Both (\uparrow) Previously Current Both Previously Current Both
ORE-EBUI[12] 1.5 61.4 3.9 56.5 26.1 40.6 3.6 38.7 23.7 33.7 33.6 26.3 31.8
OW-DETR[13] 5.7 71.5 6.2 62.8 27.5 43.8 6.9 45.2 24.9 38.5 38.2 28.1 33.1
Ours 60.9 69.4 60.0 63.8 26.9 44.4 58.6 46.2 28.0 40.1 41.8 29.6 38.7
TABLE VII: State-of-the-art comparison for open-world object detection on MS-COCO split. As the code and weights of UC-OWOD [37], OCPL [14] and 2B-OCD [15] are not publicly available, we cannot get results of them or evaluate them on MS-COCO split. Thus, we only compare our model with ORE [12] and OW-DETR [13]. Although the MS-COCO split is more challenging, our model gets a more significant improvement on this in comparison to ORE and OW-DETR. The significant metric improvements demonstrate that our SKDF has the ability to retrieve new knowledge beyond the range of closed sets and would not be limited by category knowledge of existing objects.

6.4.2 Ablating different prompts

To ablate the impact of text prompts for knowledge distillation, we do experiments on different text prompts, i.e.LVIS, LVIS without COCO, and COCO as shown in TABLE.V(a). From the experiment results (i.e.UR comparison between LVIS and COCO prompt and qualitative results in TABLE.V(d)), we investigate that the unknown annotations in the test set are not adequate and more corresponding to the original 80 categories in COCO dataset. Therefore, existing “unknown recall” can not evaluate the detecting ability for unknown open-world objects beyond COCO annotation. To evaluate that, we use SAM[58] to generate annotations for all areas of similar objects in the test set. Though these annotations are noisy and contain many non-object box annotations, the recall for these can evaluate the unknown detection ability of detectors. The comparison between LVIS and COCO prompts shows that less distilled knowledge reduces the impact on learning known objects. In addition, less distilled knowledge impacts the performance of the knowledge distillation framework. As shown in OWOD split of COCO prompt, our detector does not learn better detecting performance for the original unknown objects. The results in LVIS w/o COCO prompt demonstrate our framework letter more unknown open-world objects. Although we exclude all corresponding coco categories from the LVIS prompt, our detector’s detection ability for original unknown objects does not seem to be impacted.

6.4.3 Ablating different detection transformer structures

As shown in TABLE.V(b), we compare our structure with original parallel structure [2] and structure of [36]. CAT proposes the cascade decoupled decoding way that uses the shared decoder to decode twice for localization and identification, respectively. Unlike them, we directly employ two separate decoders to decouple the decoding process, and we specifically train these two decoders for localization and recognition, respectively. The results demonstrate that our structure has a better ability to alleviate the influence of distilled knowledge on known objects. Our decoupled architecture outperforms the decoupled decoding way of CAT[36] by 1.5 points in the known mAP. Moreover, our architecture surpasses the conventional DDETR [2] structure by 3.3 points.

6.4.4 Ablating different teachers

In TABLE.V(c), we compare different teachers, i.e.GLIP[26] and SAM [58]. When distilling open-world knowledge from SAM, we leverage the “Segment Anything” that uses the predefined grid points and selecting strategy to generate pseudo labels. For SAM, due to the absence of supervision confidence, we can not leverage the down-weight loss function. Therefore, the known detection ability is severely reduced. In addition, due to the over-detection ability of SAM, the detector can not distill a better unknown detection ability.

6.5 Comparison With SOTA OWOD Methods

In this section, we conduct a detailed comparison with existing state-of-the-art methods on two existing benchmarks: OWOD and MS-COCO SPLIT, as well as on two benchmarks we propose: IntensiveSet\spadesuit and StandardSet\heartsuit. In addition, we compare our model with existing methods on the Incremental Object Detection task.

6.5.1 OWOD SPLIT

The results compared with the state-of-the-art methods on OWOD split for OWOD problem are shown in TABLE.VI. The ability of our model to detect unknown objects quantified by U-Recall is more than 3×3\times of those reported in previous state-of-the-art OWOD methods. Compared with the model 2B-OCD [15] with the highest U-Recall of 12.1, 9.4 and 11.6 on Task 1, 2, and 3, ours achieves 39.0, 36.7, and 36.1 in the corresponding tasks, achieving significant absolute gains up to 26.9, 27.3 and 24.5, respectively. Benefits from the cascade decoder structure and down-weight training loss which mitigate the effect of unknown objects on detecting known objects, our model’s performance on known objects is also superior to most state-of-the-art methods.

6.5.2 MS-COCO SPLIT

We report the results on the MS-COCO split in TABLE.VII. MS-COCO split mitigates data leakage across tasks and assigns more data to each Task, while our model receives a more significant boost compared with the OWOD split. Our model’s unknown object detection capability, quantified by U-Recall, is almost 10×11×10\times\sim 11\times of those reported in previous state-of-the-art OWOD methods. Compared with OW-DETR’s U-Recall of 5.7, 6.2, and 6.9 on Task 1, 2, and 3, ours achieves 60.9, 60.0, and 58.6 in the corresponding tasks, achieving significant absolute gains up to 55.2, 53.8 and 51.7, respectively. This demonstrates that our model has the more powerful ability to retrieve new knowledge and detect unknown objects when faced with more difficult tasks.

Dataset\rightarrow IntensiveSet\spadesuit StandardSet\heartsuit
Group \downarrow Metrics\rightarrow Known Unknown(\uparrow) Known Unknown(\uparrow)
mAP(\uparrow) Recall Precision mAP mAP(\uparrow) Recall Precision mAP
(I) OWDETR[13] 36.9 5.6 7.3 1.2 54.6 16.9 2.3 0.6
CAT[36] 38.6 17.4 9.5 7.6 55.4 48.8 2.7 4.3
UnSniffer[38] 39.5 9.4 22.2 2.0 53.7 34.6 21.5 12.5
UnSniffer\dagger[38] 39.5 12.4 12.0 2.6 53.7 41.2 9.6 11.6
(II) SKDF-DW 27.7 40.2 15.6 14.8 44.0 75.1 3.0 17.1
SKDF-CS 30.7 39.7 16.2 16.8 48.3 75.2 3.1 9.2
SKDF 32.3 39.6 16.2 16.7 48.7 74.3 3.0 7.4
SKDF\ddagger 32.3 35.8 37.9 24.5 48.7 66.8 11.4 24.4
TABLE VIII: State-of-the-art comparison for open-world object detection on IntensiveSet\spadesuit and StandardSet\heartsuit. As the code and weights of UC-OWOD [37], OCPL [14], and 2B-OCD [15] are not publicly available, we cannot get results of them or evaluate them on our proposed benchmark. For the fair comparison, UnSniffer\dagger means that we remove the specific unknown post-process operations. SKDF\ddagger denotes that we add the same unknown post-process operations as Unisiffer to our SKDF.

6.5.3 Intensive Testing Set

The comparison results on IntensiveSet\spadesuit are shown in the left half of TABLE.VIII. The experimental results indicate that our model’s performance in detecting unknown objects surpasses that of existing state-of-the-art methods, achieving more than double the performance of the highest unknown recall for the category CAT [36], exceeding the highest unknown precision of Unsiffer [38] by 4.2 points, and surpassing the existing methods in unknown mAP by more than threefold. Meanwhile, our model is highly adaptable to unknown post-processing techniques. When incorporating the post-processing methods from UnSniffer [38], the performance of our model further increased by 21.7 points in unknown precision and 7.8 points in unknown mAP. When incorporating the post-processing, SKDF surpasses the UnSiffer in unknown precision by more than 15 points. However, although our component can mitigate the impact of unknown objects on the detection of known objects, there is still a certain gap between our model and the existing state-of-the-art models.

6.5.4 Standard Testing Set

In the right half of the TABLE.VIII, we compare SKDF with the sota methods on StandardSet\heartsuit. Our model surpasses the existing detection models in the vast majority of evaluation metrics, except for the detection performance of known objects and unknown precision. For the unknown precision, UnSiffer outperforms our SKDF. This is because the StandardSet\heartsuit primarily comprises common categories from COCO that are beyond the predefined classes of VOC, and UnSiffer is specifically tailored to work with these category distributions. However, this indeed reflects the shortcomings of our model. Particularly, when comparing the performance of our model on the IntensiveSet\spadesuit and StandardSet\heartsuit, it is evident that although our model has very good recall performance for unknown objects, its ability to recognize unknown objects still needs to be improved.

6.5.5 Incremental Object Detection

To intuitively present our detector’s ability for detecting object instances, we compare it to [59, 60, 12, 13] on the incremental object detection (IOD) task. We do not use assistance from the large pre-trained vision-language grounding model. We evaluate the experiments on three standard settings, where a group of classes (10, 5, and last class) is introduced incrementally to a detector trained on the remaining classes (10, 15, and 19), based on the PASCAL VOC 2007 dataset [25]. As the results shown in TABLE.IX (b), our model outperforms the existing method in a great migration on all three settings, indicating the power of cascade detection transformer for IOD.

10 + 10 Setting aero cycle bird boat bottle bus car cat chair cow table dog horse bike person plant sheep sofa train tv mAP
ILOD 69.9 70.4 69.4 54.3 48 68.7 78.9 68.4 45.5 58.1 59.7 72.7 73.5 73.2 66.3 29.5 63.4 61.6 69.3 62.2 63.2
Faster ILOD 72.8 75.7 71.2 60.5 61.7 70.4 83.3 76.6 53.1 72.3 36.7 70.9 66.8 67.6 66.1 24.7 63.1 48.1 57.1 43.6 62.1
ORE - (CC + EBUI) 53.3 69.2 62.4 51.8 52.9 73.6 83.7 71.7 42.8 66.8 46.8 59.9 65.5 66.1 68.6 29.8 55.1 51.6 65.3 51.5 59.4
ORE - EBUI 63.5 70.9 58.9 42.9 34.1 76.2 80.7 76.3 34.1 66.1 56.1 70.4 80.2 72.3 81.8 42.7 71.6 68.1 77 67.7 64.5
OW - DETR 75.4 63.9 57.9 50.0 52.0 70.9 79.5 72.4 44.3 57.9 59.7 73.5 77.7 75.2 76.2 44.9 68.8 65.4 79.3 69.0 65.7
Ours 77.1 72.3 74.5 53.4 57.4 78.1 78.7 83.9 46.2 71.4 59.5 77.4 73.3 76.6 73.3 39.7 70.6 59.0 78.4 70.9 68.6
15 + 5 Setting aero cycle bird boat bottle bus car cat chair cow table dog horse bike person plant sheep sofa train tv mAP
ILOD 70.5 79.2 68.8 59.1 53.2 75.4 79.4 78.8 46.6 59.4 59 75.8 71.8 78.6 69.6 33.7 61.5 63.1 71.7 62.2 65.8
Faster ILOD 66.5 78.1 71.8 54.6 61.4 68.4 82.6 82.7 52.1 74.3 63.1 78.6 80.5 78.4 80.4 36.7 61.7 59.3 67.9 59.1 67.9
ORE - (CC + EBUI) 65.1 74.6 57.9 39.5 36.7 75.1 80 73.3 37.1 69.8 48.8 69 77.5 72.8 76.5 34.4 62.6 56.5 80.3 65.7 62.6
ORE - EBUI 75.4 81 67.1 51.9 55.7 77.2 85.6 81.7 46.1 76.2 55.4 76.7 86.2 78.5 82.1 32.8 63.6 54.7 77.7 64.6 68.5
OW - DETR 78.0 80.7 79.4 70.4 58.8 65.1 84.0 86.2 56.5 76.7 62.4 84.8 85.0 81.8 81.0 34.3 48.2 57.9 62.0 57.0 69.4
Ours 79.5 85.1 83.1 73.1 62.5 68.7 83.0 88.4 55.5 78.3 69.7 83.0 86.6 73.2 78.8 30.8 67.6 60.8 76.0 58.7 72.1
19 + 1 Setting aero cycle bird boat bottle bus car cat chair cow table dog horse bike person plant sheep sofa train tv mAP
ILOD 69.4 79.3 69.5 57.4 45.4 78.4 79.1 80.5 45.7 76.3 64.8 77.2 80.8 77.5 70.1 42.3 67.5 64.4 76.7 62.7 68.2
Faster ILOD 64.2 74.7 73.2 55.5 53.7 70.8 82.9 82.6 51.6 79.7 58.7 78.8 81.8 75.3 77.4 43.1 73.8 61.7 69.8 61.1 68.5
ORE - (CC + EBUI) 60.7 78.6 61.8 45 43.2 75.1 82.5 75.5 42.4 75.1 56.7 72.9 80.8 75.4 77.7 37.8 72.3 64.5 70.7 49.9 64.9
ORE - EBUI 67.3 76.8 60 48.4 58.8 81.1 86.5 75.8 41.5 79.6 54.6 72.8 85.9 81.7 82.4 44.8 75.8 68.2 75.7 60.1 68.8
OW - DETR 82.2 80.7 73.9 56.0 58.6 72.1 82.4 79.6 48.0 72.8 64.2 83.3 83.1 82.3 78.6 42.1 65.5 55.4 82.9 60.1 70.2
Ours 83.6 85.7 77.1 61.5 58.9 74.3 86.3 81.5 52.2 78.4 71.4 81.9 84.6 80.2 80.8 39.9 68.3 63.3 84.6 63.0 72.9
TABLE IX: The detailed comparison with existing approaches on PASCAL VOC. We only use our open-world object detector without assistance. Evaluation is performed on three standard settings, where a group of classes (10, 5, and last class) are introduced incrementally to a detector trained on the remaining classes (10,15 and 19). Our model performs favorably against existing approaches on all three settings.

6.5.6 Qualitative Comparison with OWOD method

We exhibit more qualitative results in Fig.6 and compare them to the state-of-the-art OWOD detectors. The results show that our model can detect most of the potentially unknown open-world objects in the scene, far exceeding the effectiveness of existing models. We define the 20 categories of the PASCAL VOC [25] dataset as known classes, and all categories outside of these are considered unknown classes. As shown in the first column, our model accurately detects all the foreground categories in the scene and identifies everything outside of the predefined known classes as unknown, such as the books on the table and the teddy bear, which were not detected by the state-of-the-art model. As shown in the second column, our model can accurately detect even objects such as the picture frames on the wall. In the third column, the state-of-the-art model mistakenly identifies the characters in the background as unknown objects, but our model is not confused by this. In the comparison of the last column, our model also greatly surpasses the detection performance of the state-of-the-art model.

Refer to caption
Figure 6: Visualization results for the comparison between the SOTA method and our model (known objects in Blue and unknown objects in Yellow). The categories of PASCAL VOC [25] are set as known. Our model significantly outperforms the SOTA (OW-DETR) for open-world object detection, almost accurately detecting all unknown objects in the open scene, including clothes and hats worn by people, while most of the unknown objects detected by the SOTA model are backgrounds.

7 Societal Impact, Limitations and Future Works

Open-world object detection makes artificial intelligence smarter to face more problems in real life. It takes object detection to a cognitive level, as the model requires more than simply remembering the objects learned, it requires deeper thinking about the scene. When it comes to applications like autonomous driving, the significance of open-world object detection becomes even more pronounced. In such scenarios, vehicles need to rapidly and accurately identify and comprehend various objects and obstacles on the road, including but not limited to pedestrians, other vehicles, traffic signals, and road signs. The breakthrough in open-world object detection will render autonomous driving systems more intelligent as they can handle unforeseen or rare situations, not limited to pre-trained object categories.

Although our results demonstrate significant improvements over existing state-of-the-art methods, the performances are still on the lower side due to the challenging nature of the open-world detection problem. In this paper, we are mainly committed to enhancing the model’s ability to explore unknown classes. However, the confidence level and recognition ability of our model for the detection of unknown objects still need to be improved, and this is what we will strive for in the future. Currently, our model still faces the issue of mistakenly detecting the background as unknown objects, and the benchmark we proposed is not yet comprehensive, including only a single-moment task.

In our future work, these two points will be the main focus of our research. In addition, post-processing operations for prediction boxes of unknown objects are urgently needed, so developing a post-processing operation similar to NMS, dedicated to unknown objects, is also a direction that requires our research. To facilitate the integration of open-world object detection algorithms into everyday use and contribute positively to society, we are fully committed to diligently pursuing this goal with great care and effort.

8 Conclusions

In this paper, we start by observing that the simple knowledge distillation could convert the open-world knowledge in the large pre-trained vision-language grounding model for the specialized OWOD task and propose a simple framework with surprisingly good performance. Meanwhile, we propose the down-weight training loss function for the detector’s mixed learning process of known ground truth, distilled unknown knowledge, and the pseudo unknown knowledge in OWOD algorithm to mitigate the effect of distilled knowledge on the detection performance of known objects. Besides, the cascade decoupled detection transformer structure is proposed to alleviate the influence caused by unknown objects on detecting known objects. Last but not least, we propose two novel benchmarks to comprehensively evaluate the ability of the open-world detectors to detect unknown open-world objects, named StandardSet\heartsuit and IntensiveSet\spadesuit respectively, based on the complexity of their testing scenarios. Our extensive experiments on existing and proposed benchmarks demonstrate the effectiveness of our framework. Our model exceeds the distilled large pre-trained vision-language grounding model for OWOD and state-of-the-art methods for OWOD and IOD.

Acknowledgments

This work is supported by National Natural Science Foundation of China (grant No.61871106 and No.61370152), Key R&D projects of Liaoning Province, China (grant No.2020JH2/10100029), and the Open Project Program Foundation of the Key Laboratory of Opto-Electronics Information Processing, Chinese Academy of Sciences (OEIP-O-202002).

References

  • [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [2] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020.
  • [4] Y. Lu, X. Chen, Z. Wu, and J. Yu, “Decoupled metric network for single-stage few-shot object detection,” IEEE Transactions on Cybernetics, 2022.
  • [5] Y. Pang, T. Wang, R. M. Anwer, F. S. Khan, and L. Shao, “Efficient featurized image pyramid network for single shot detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7336–7344, 2019.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
  • [7] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
  • [8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
  • [9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
  • [10] X. Chen, J. Yu, S. Kong, Z. Wu, and L. Wen, “Joint anchor-feature refinement for real-time accurate object detection in images and videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 2, pp. 594–607, 2020.
  • [11] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” arXiv preprint arXiv:1905.05055, 2019.
  • [12] K. J. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “Towards open world object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5826–5836, 2021.
  • [13] A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, and M. Shah, “Ow-detr: Open-world detection transformer,” in CVPR, 2022.
  • [14] J. Yu, L. Ma, Z. Li, Y. Peng, and S. Xie, “Open-world object detection via discriminative class prototype learning,” in 2022 IEEE International Conference on Image Processing (ICIP), pp. 626–630, IEEE, 2022.
  • [15] Y. Wu, X. Zhao, Y. Ma, D. Wang, and X. Liu, “Two-branch objectness-centric open world detection,” in Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, pp. 35–40, 2022.
  • [16] O. Zohar, K.-C. Wang, and S. Yeung, “Prob: Probabilistic objectness for open world object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453, 2023.
  • [17] J. Yu, L. Ma, Z. Li, Y. Peng, and S. Xie, “Open-world object detection via discriminative class prototype learning,” arXiv preprint arXiv:2302.11757, 2023.
  • [18] Y. Ma, H. Li, Z. Zhang, J. Guo, S. Zhang, R. Gong, and X. Liu, “Annealing-based label-transfer learning for open world object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11454–11463, 2023.
  • [19] J. Dong, Z. Zhang, S. He, Y. Liang, Y. Ma, J. Yu, R. Zhang, and B. Li, “A parallel open-world object detection framework with uncertainty mitigation for campus monitoring,” Applied Sciences, vol. 13, no. 23, p. 12806, 2023.
  • [20] S. Jamonnak, J. Guo, W. He, L. Gou, and L. Ren, “Ow-adapter: Human-assisted open-world object detection with a few examples,” IEEE Transactions on Visualization and Computer Graphics, 2023.
  • [21] X. Zhao, X. Liu, Y. Shen, Y. Ma, Y. Qiao, and D. Wang, “Revisiting open world object detection,” arXiv preprint arXiv:2201.00471, 2022.
  • [22] Y. Wang, Z. Yue, X.-S. Hua, and H. Zhang, “Random boxes are open-world object detectors,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6233–6243, 2023.
  • [23] D. Kim, T.-Y. Lin, A. Angelova, I. S. Kweon, and W. Kuo, “Learning open-world object proposals without learning to classify,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5453–5460, 2022.
  • [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
  • [25] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [26] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975, 2022.
  • [27] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Advances in Neural Information Processing Systems, 2022.
  • [28] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790, 2021.
  • [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  • [30] L. Yao, J. Han, Y. Wen, X. Liang, D. Xu, W. Zhang, Z. Li, C. Xu, and H. Xu, “Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection,” arXiv preprint arXiv:2209.09407, 2022.
  • [31] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
  • [32] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3967–3976, 2019.
  • [33] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1365–1374, 2019.
  • [34] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 5191–5198, 2020.
  • [35] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11953–11962, 2022.
  • [36] S. Ma, Y. Wang, J. Fan, Y. Wei, T. H. Li, H. Liu, and F. Lv, “Cat: Localization and identification cascade detection transformer for open-world object detection,” arXiv preprint arXiv:2301.01970, 2023.
  • [37] Z. Wu, Y. Lu, X. Chen, Z. Wu, L. Kang, and J. Yu, “Uc-owod: Unknown-classified open world object detection,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pp. 193–210, Springer, 2022.
  • [38] W. Liang, F. Xue, Y. Liu, G. Zhong, and A. Ming, “Unknown sniffer for object detection: Don’t turn a blind eye to unknown objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3230–3239, 2023.
  • [39] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5356–5364, 2019.
  • [40] X. Gu, T. Lin, W. Kuo, and Y. Cui, “Zero-shot detection via vision and language knowledge distillation. corr abs/2104.13921 (2021).”
  • [41] J. Jeong, S. Lee, J. Kim, and N. Kwak, “Consistency-based semi-supervised learning for object detection,” Advances in neural information processing systems, vol. 32, 2019.
  • [42] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong, “Proposal learning for semi-supervised object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2291–2301, 2021.
  • [43] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020.
  • [44] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069, 2021.
  • [45] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
  • [46] Y. Li, D. Huang, D. Qin, L. Wang, and B. Gong, “Improving object detection with selective self-supervised self-training,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX, pp. 589–607, Springer, 2020.
  • [47] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666, 2019.
  • [48] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
  • [49] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-shot object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 384–400, 2018.
  • [50] A. Dhamija, M. Gunther, J. Ventura, and T. Boult, “The overlooked elephant of object detection: Open set,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1021–1030, 2020.
  • [51] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout sampling for robust object detection in open-set conditions,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3243–3249, IEEE, 2018.
  • [52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
  • [54] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660, 2021.
  • [55] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
  • [56] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic detr: End-to-end object detection with dynamic attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2988–2997, 2021.
  • [57] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [58] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [59] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object detectors without catastrophic forgetting,” in Proceedings of the IEEE international conference on computer vision, pp. 3400–3409, 2017.
  • [60] C. Peng, K. Zhao, and B. C. Lovell, “Faster ilod: Incremental learning for object detectors based on faster rcnn,” Pattern recognition letters, vol. 140, pp. 109–115, 2020.
[Uncaptioned image] Shuailei Ma received the B.S. degree from Northeastern University, Shenyang, China, in 2022. Currently, he is pursuing a Ph.D. degree in the College of Information Science and Engineering at Northeastern University. His research interests include computer vision and deep learning.
[Uncaptioned image] Yuefeng Wang received the B.S. degree from Nanjing University of Information Science and Technology, Nanjing, China, in 2021. Currently, he is pursuing an M.Sc. degree in the College of Information Science and Engineering at Northeastern University. His research interests include computer vision and deep learning.
[Uncaptioned image] Ying Wei received the B.Sc. degree from Harbin Institute of Technology, China in 1990, and received the M.Sc. and the Ph.D. degree from Northeastern University, China in 1997 and 2001, respectively. Her research interests include image processing & pattern recognition, computer vision, medical image computation and analysis, and deep learning, etc. She is now a full-time professor at Northeastern University, China. She has more than 60 journal papers and 5 granted patents in her research fields.
[Uncaptioned image] Jiaqi Fan received a B.S. degree from Northeastern University, China, in 2022. Currently, he is pursuing an M.Sc. degree in the College of Information Science and Engineering at Northeastern University. His research interests include computer vision and deep learning.
[Uncaptioned image] Enming Zhang received a B.S. degree from Northeastern University, China, in 2022. Currently, he is pursuing an M.Sc. degree in the College of Information Science and Engineering at Northeastern University. His research interests include computer vision and deep learning.
[Uncaptioned image] Xinyu Sun received the B.E. degree in Automation Science and Engineering from South China University of Technology, China, in 2021. He is working toward the M.Sc. degree in the School of Software Engineering, South China University of Technology, China. His research interests include embodied AI and multi-modal video understanding.
[Uncaptioned image] Peihao Chen received the B.E. degree in Automation Science and Engineering from South China University of Technology, China, in 2018. He is working toward the PhD degree in the School of Software Engineering, South China University of Technology, China. His research interests include embodied AI and multi-modal video understanding.