SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Shuailei Ma, Yuefeng Wang, Ying Wei, Jiaqi Fan, Enming Zhang, Xinyu Sun and Peihao Chen Shuailei Ma, Yuefeng Wang, Jiaqi Fan, Enming Zhang are with College of Information Science and Engineering, Northeastern University, Shenyang, China, 110819.
E-mail:{xiaomabufei,wangyuefeng0203,f1074979751,z1693663290} @gmail.com Ying Wei is the corresponding author, with College of Information Science and Engineering, Northeastern University, Shenyang, China, 110819.
E-mail: weiying@ise.neu.edu.cn Xinyu Sun and Peihao Chen are with School of Software Engineering, South China University of Technology, China, 510641.
E-mail:{csxinyusun,phchencs} @gmail.com

Abstract

Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple knowledge distillation approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the down-weight loss function for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the cascade decouple decoding structure that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name “StandardSet $\heartsuit$ ” and “IntensiveSet $\spadesuit$ ” respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at https://github.com/xiaomabufei/SKDF.

Index Terms:

Open World Object Detection, Knowledge Distillation Framework, Down-Weight Loss Function, Decoupled Cascade Decoding Structure.

1 Introduction

Open-world object detection (OWOD) is a more practical detection problem in computer vision, facilitating the development of object detection (OD) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] in the real world. Within the OWOD paradigm, the model’s lifespan is pushed by an iterative learning process, as shown in Fig. 2 (a). At each episode, the model trained on the data with known object annotations is expected to detect known objects and all potential unknown objects. Human annotators then label a few of these tagged unknown classes of interest gradually. The model given these newly-added annotations continues incrementally updating its knowledge without retraining from scratch.

Refer to caption — Figure 1: SKDF leverages the proposed down-weight training strategy to distill open-world knowledge from the large open-vocabulary pre-trainied vision-language model to the expert open-world detector with faster-detecting speed and better performance via small amounts of data.

In the existing works [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], the open-world detectors are expected to know about the open world through several datasets [24, 25] with tiny scales. However, the annotations of these datasets are too few to provide adequate object attributes for the model. It is difficult for the model to achieve the ideal goal through these datasets. The large pre-trained vision-language grounding models (VL) [26, 27, 28, 29, 30] have rich knowledge of the open world due to countless millions of parameters, huge open-world datasets, and training costs.

However, their detection could not leave the participation of text prompts. Before detecting, the text prompt of all objects must be pre-listed to be detected, and the object whose text prompt was not listed could not be detected. There are many detection scenarios in which pre-defined language descriptions are unavailable during inference. Therefore, it poses a challenge how to equip the VLM with language-agnostic unknown object detection capability. In addition, their detection speed is also a criticism due to the following question. $i)$ The huge number of parameters and FLOPs. $ii)$ The large pre-trained vision-language grounding model could only infer with several text prompts for the detecting performance, so they must infer many times when the number of prompts is large.

Humans’ ability to recognize objects they have not seen before largely depends on their brains’ knowledge base. Inspired by how humans face the open world, we propose to learn from the large pre-trained vision-language grounding models by knowledge distillation for OWOD. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector, as shown in Fig.1. We were surprised to observe that the simple knowledge distillation approach for the OWOD algorithm can achieve better performance for unknown object detection than the large open-vocabulary pertaining vision-language model, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of the detector with conventional structures for known objects due to the following challenges:

(I) Different from traditional Knowledge Distillation [31, 32, 33, 34, 35], the teacher and student contain different modalities and have different training manners, structures, and inference procedures. Meanwhile, the learning process of OWOD algorithms [12, 13, 36, 15, 14, 37] always has its own specific open-world pseudo-unknown labels. Therefore, the existing distilling objectives do not work and it is difficult to mix the learning of known grounding truth, unknown distilling knowledge and unknown pseudo-knowledge. The performance of detecting known objects is also crucial for OWOD. Through the experiments, we reveal that the direct use of distilled knowledge dramatically affects the model’s learning ability of the original annotations. The model’s performance in detecting known objects is damaged substantially. To alleviate this, we propose the down-weight training loss for the distillation training, which utilizes the distilled labels’ object confidence from the large pre-trained vision-language grounding model and the searched pseudo objectness to reduce the weight of unknown loss in the total loss during training, as shown in Fig.2 (b).

(II) The presence of objects with highly similar features to known classes within the “unknown objects” can greatly affect the process of open-world object identification. In particular, the inclusion of unknown objects will impact the detector’s performance on known objects. This issue impacts not only the identification process but also the localization process for models that use coupled information for both tasks. Therefore, we propose decoupling the detection learning process. However, the decoupled structure completely separates the localization and identification, leading to many potential problems (e.g.mismatch between localization and identification results). Therefore, we propose a cascade structure that decouples the detecting process via two decoders and connects the two decoders via a cascade manner, as shown in Fig.2 (a). In this structure, foreground localization can be protected from category knowledge because the loss of identification is diluted by the latter decoder. Moreover, the identification process can utilize the localization information because it leverages the former decoder’s output embeddings as the input queries.

Since existing datasets [24, 25] are manually annotated with predefined categories, current benchmarks cannot comprehensively measure the detection performance of open-world detectors for unknown objects due to the lack of bounding box annotations for unknown entities in the test scenarios. UnSniffer [38] chooses testing images with only a few unknown foreground objects (no more than 3) from the MSCOCO dataset [24] to evaluate the detector’s ability to identify unknown objects. Inspired by this, we propose two benchmarks, named StandardSet $\heartsuit$ and IntensiveSet $\spadesuit$ respectively, based on the complexity of their testing scenarios. Different from UnSiffer follows the manual annotation guidelines of the MSCOCO [24] dataset, inspired by the more detailed annotation criteria of finer-grained datasets [39], we manually select suitable evaluation scenes (with no fewer than three unknown objects) and provide more meticulous manual annotations of unknown objects. Moreover, our IntensiveSet $\spadesuit$ , features an average of over 33 unknown open-world object instances per test scene. Our contributions can be summarized fourfold:

$\bullet$

We observe that the simple knowledge distillation could convert the open-world knowledge in the large pre-trained vision-language grounding model for the specialized OWOD task and propose a simple framework with surprisingly good performance.
$\bullet$

To mitigate the effect of distilled knowledge on the detection performance of known objects, we propose the down-weight training loss function for the detector’s mixed learning process of known ground truth, distilled unknown knowledge, and the pseudo unknown knowledge in OWOD algorithm. Meanwhile, a cascade decoupled detection transformer structure is proposed to alleviate the influence caused by unknown objects on detecting known objects.
$\bullet$

We propose two novel benchmarks to comprehensively evaluate the ability of the open-world detectors to detect unknown open-world objects, named StandardSet $\heartsuit$ and IntensiveSet $\spadesuit$ respectively, based on the complexity of their testing scenarios.
$\bullet$

Our extensive experiments on existing and proposed benchmarks demonstrate the effectiveness of our framework. Our model exceeds the distilled large pre-trained vision-language grounding model for OWOD and state-of-the-art methods for OWOD and IOD.

2 Related Works

Large pre-trained vision-language models: Recently, inspired by the success of vision-language(VL) pre-trainied methods [29] and their good zero-shot ability, [40, 26, 28, 27, 30] attempted to perform zero-shot detection on a larger range of domains by using pre-trained vision language models. [40] proposed a zero-shot detection method to distill knowledge from a pre-trained vision language image classification model. [26] tried to align region and language features using a dot-product operation and could be trained end-to-end on grounding and detection data. [28] proposed an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. [30] proposed a paralleled visual-concept pre-trainied method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. However, VLM requires predefined object categories to drive the model, therefore it cannot be directly applied to the OWOD tasks. Moreover, VLM requires substantial computational resources and a long runtime.

Semi-Supervised Learning For Object Detection: In this area, there are two dominant approaches, the consistency methods [41, 42] and pseudo-label methods [43, 44, 45, 46], respectively. STAC [43] deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. Xu $et$ $al$ .[44] proposed an end-to-end pseudo-labeling framework to avoid the complicated training process and also achieve better performance. Liu $et$ $al$ .[45] improved the pseudo-label generation model via teacher-student mutual learning regimen and addressed the crucial imbalance issue in generated pseudo-labels. However, these distillation methods are applicable to single-modality closed-set object detection tasks. This paper introduces a simple framework for distilling from a multimodal open-vocabulary model to a single-modality open-world model.

Open-World Object Detection (OWOD): ORE [12] introduced OWOD task and ORE which adapted the faster-RCNN model with feature-space contrastive clustering, an RPN-based unknown detector, and an Energy Based Unknown Identifier (EBUI) for the OWOD objective. Recently, several works [14, 15, 37] attempted to extend ORE. OCPL [14] was proposed to learn the discriminative embeddings of known classes in the feature space to minimize the overlapping distributions of known and unknown classes. 2B-OCD [15] proposed a two-branch objectness-centric open-world object detection framework consisting of the bias-guided detector and the objectness-centric calibrator. OWDETR [13] proposed to utilize the pseudo-labeling scheme to supervise unknown object detection, where unmatched object proposals with high backbone activation are selected as unknown objects. Existing methods limit the training of the model to a small subset of object annotations, which fails to fully teach the model how to recognize objects and foreground. In this paper, we propose distilling open-world knowledge from large pre-trained models to scale up the open-world knowledge in the training data. Experiments prove that even the simple distillation framework can stimulate the model’s ability to recognize unknown objects from the background even without the need for vast amounts of data, even achieving capabilities that surpass those of the teacher.

3 Problem Formulation

$\mathcal{K}^{t}=\{1,2,\ldots,C\}$ denote the set of known object classes and $\mathcal{U}^{t}=\{C+1,\ldots\}$ denote the unknown classes which might be encountered at the test time, at the time $t$ . We labeled the known object categories $\mathcal{K}^{t}$ in the dataset $\mathcal{D}^{t}=\{\mathcal{J}^{t},\mathcal{L}^{t}\}$ where $\mathcal{J}^{t}$ denotes the input images and $\mathcal{L}^{t}$ denotes the corresponding labels at time $t$ . The training image set consists of $M$ images $\mathcal{J}^{t}=\{i_{1},i_{2},\ldots,i_{M}\}$ and corresponding labels $\mathcal{L}^{t}=\{\ell_{1},\ell_{2},\ldots,\ell_{M}\}$ . Each $\ell_{i}=\{\mathcal{T}_{1},\mathcal{T}_{2},\ldots,\mathcal{T}_{N}\}$ denotes a set of $N$ object instances with their class labels $c_{n}\subset\mathcal{K}^{t}$ and locations, $\{x_{n},y_{n},w_{n},h_{n}\}$ denote the bounding box center coordinates, width and height respectively.

The artificial assumptions and restrictions in closed-set object detection are removed in Open-World Object Detection. It aligns object detection tasks more with real life. It requires the trained model $\mathcal{M}_{t}$ to detect the previously encountered known classes $C$ and identify an unseen class instance as belonging to the unknown class. In addition, it requires the object detector to be capable of incremental updates for new knowledge, and this cycle continues over the detector’s lifespan. In the incremental updating phase, the unknown instances identified by $\mathcal{M}_{t}$ are annotated manually. Along with their corresponding training examples, they update $\mathcal{D}^{t}$ to $\mathcal{D}^{t+1}$ and $\mathcal{K}^{t}$ to $\mathcal{K}^{t+1}=\{1,2,\ldots,C,\ldots,C+\text{n}\}$ . The model adds the $n$ new classes to known classes and updates itself to $\mathcal{M}_{t+1}$ without retraining from scratch on the whole dataset $\mathcal{D}^{t+1}$ .

4 Proposed method

This section elaborates on the proposed framework in detail. We start by illustrating the overall scheme of our proposed framework in Sec.4.1 Furthermore, we introduce the open-world object detector in Sec.4.2, the distillation process in Sec.4.3, and the matching and pseudo-labeling procedure of the owod algorithm in Sec.4.4. Last but not least, we describe the down-weight training strategy in Sec.4.5 and the inference phrase in Sec.4.6.

4.1 Overall Scheme

Fig.2 illustrates the overall scheme of our framework. For a given image $x\in\mathbb{R}^{H\times W\times 3}$ , it is first sent into the open-world detector and the large pre-trained vision-language grounding model simultaneously. The detector leverages the visual features of the input to predict the localization, box score, and classification. The large pre-trained vision-language grounding model mines unknown open-world knowledge from the inputs. The known ground truth and unknown distilled knowledge aggregate the open-world supervision. In the training phase, we match the prediction and open-world supervision according to the regression loss, classification, and supervision confidence. After matching, the pseudo labels are selected according to the predicted box score. Then all labels are leveraged to train the open-world detector by the down-weight training loss function. In addition, when new categories are introduced at each episode, based on an exemplar replay-based finetuning to alleviate the catastrophic forgetting of learned classes and the finetuning by using a balanced set of exemplars stored for all known classes, our detector could continuously learn during its lifespan.

4.2 Open-World Object Detector

As shown in Fig.3, the open-world detector first uses a hierarchical feature extraction backbone to extract multi-scale features $\mathrm{Z}_{i}$ . The feature maps ${Z}_{i}$ are projected from dimension $C_{s}$ to dimension $C_{d}$ by using 1×1 convolution and concatenated to $N_{s}$ vectors with $C_{d}$ dimensions after flattening out. Afterward, along with supplement positional encoding $P\in\mathbb{R}^{N_{s}\times C_{d}}$ , the multi-scale features are sent into the deformable transformer encoder to encode semantic features. The encoded semantic features are acquired and sent into the localization decoder together with a set of $N$ learnable location queries.

Aided by interleaved cross-attention and self-attention modules, the localization decoder transforms the location queries $Q_{Location}\in\mathbb{R}^{N\times D}$ to a set of N location query embeddings $E_{Location}\in\mathbb{R}^{N\times D}$ . Meanwhile, the $E_{Location}$ are used as class queries and sent into the identification decoder together with the $M$ (i.e.the encoded multi-scale features) again. The identification decoder transforms the class queries to $N$ class query embeddings $E_{class}$ that correspond to the location query embeddings. The operation of cascade decoders is expressed as follows:

\leavevmode\resizebox{328.82706pt}{}{$E_{Location}=F_{LD}(F_{E}(\varnothing(x),P),Q_{Location},R)$},

(1)

\leavevmode\resizebox{305.33788pt}{}{$E_{Class}=F_{ID}(F_{E}(\varnothing(x),P),E_{Location},R)$}.

(2)

where $F_{LD}(\cdot)$ and $F_{ID}(\cdot)$ denote the localization and identification decoder. $F_{E}(\cdot)$ is the encoder and $\varnothing(\cdot)$ is the backbone. $R$ represents the reference points and $x$ denotes the input image. $Q_{Class}$ stands for the class queries. Eventually, the $E_{Class}$ are then sent into the classification branch to predict the category $cls\in[0,1]^{N_{obj}}$ . The $E_{Location}$ are then input to the regression and box score branch to locate N foreground bounding boxes $b\in[0,1]^{4}$ and predict the box score $bs\in[0,1]$ .

In this decoupling structure, foreground localization can be protected from category knowledge, and the identification process can utilize the localization information. Therefore, we alleviate the infusion of the unknown objects on the detecting performance of known objects and the confusion between the category and location of the same objects.

4.3 Open-World Object Supervision

In the knowledge distillation phase, the teacher leverages a large pre-trained vision-language grounding model and a text prompt that contains object categories as many as possible to mine unknown open-world knowledge from input $x$ . In this paper, we utilize GLIP [26] and categories of LVIS [39] as the large pre-trained vision-language grounding model and the text prompts, respectively. The distilled open-world knowledge is first processed through the NMS produce. Then we align the distilled open-world knowledge and ground truth by the align module, where we align the generated labels to the given data annotation space e.g.translate LVIS categories (trailer truck, tow truck, et al.) into COCO category truck and exclude the known set in the distilled open-world knowledge. In addition, the identification confidence of the unknown labels from the large pre-trained vision-language grounding model is reserved for the following training process and leveraged as the supervision confidence.

4.4 Matching and Evolving

Following the existing open-world detectors [12, 13, 36, 15, 14, 37], we set a box score prediction branch that leverages the predicted box score to automatically select pseudo-unknown labels from the remaining regression boxes after the matching process of each training iteration as the automatic pseudo-labeling mechanism. When matching the open-world supervision and prediction, we consider the box regression loss, prediction class score, and confidence of supervision.

After matching, we leverage the box score branch to help the detector learn more unknown open-world objects beyond the ground truth and distilled knowledge from the large pre-trained vision-language grounding model during training via selecting pseudo labels. We denote by $\hat{y}$ the open-world supervision, and $y={\{y_{i}\}}_{i=1}^{N}$ the set of $N$ predictions. For finding the best bipartite matching between them, a permutation of $N$ elements $\sigma\in\mathfrak{S}_{N}$ with the lowest cost is searched for as follows:

\hat{\omega}=\underset{\omega\in\mathfrak{S}_{N}}{\arg\min}\sum_{i}^{N}\mathcal{L}_{\operatorname{match}}(\boldsymbol{\hat{y}_{i}},\boldsymbol{y_{\omega(i)}}),

(3)

where $\mathcal{L}_{\text{match }}\left(\hat{y}_{i},y_{\omega(i)}\right)$ is the pair-wise matching cost between $\hat{y}_{i}$ which represents ground truth or labels from the large pre-trained vision-language grounding model and a prediction $y$ with index $\omega(i)$ , shown as Equation.4. Inspired by [2, 3, 28], we choose the Hungarian algorithm as the optimal assignment.

\mathcal{L}_{\operatorname{match}}(\hat{y}_{i},{y}_{\omega(i)})=L_{r}(\boldsymbol{\hat{b}_{i}},\boldsymbol{{b}_{\omega(i)}})-\boldsymbol{cls}_{\omega(i)}(\hat{c}_{i})-\boldsymbol{\hat{S}_{i}},

(4)

where $L_{r}$ denotes the regression loss, which consists of box loss and GIOU loss [47]. $\hat{b}$ and $b$ represent the open-world supervision box and prediction box, respectively. $\hat{S}$ is the confidence of the open-world supervision. Then, the pseudo labels are selected as follows:

l_{p}=\operatorname{Topk}(\{\boldsymbol{bs}_{i}\}_{i\notin\widehat{\omega}}),

(5)

where $bs$ denotes the prediction box score. The pseudo-labels could prevent the model from falling entirely into the knowledge of the large pre-trained vision-language grounding model and help it know unseen objects beyond the distilled knowledge.

4.5 Down-Weight Training Strategy

The inclusion of distilled knowledge inevitably influences the model’s learning of the known set. Because the quality of it could not be guaranteed, and it increases the difficulty of detector learning. Therefore, we propose the down-weight training strategy, which leverages the distilled knowledge identification confidence to generate harmonic factor and down-weight the unknown training loss and train the detector in an end-to-end manner as shown in Fig.2 (b). The training loss function is as follows:

L=L_{r}+L_{bs}+L_{cls}+L_{r}^{kd}+L_{bs}^{kd}+L_{cls}^{kd}+L_{cls}^{p},

(6)

where the $L_{r}$ uses the common regression loss which consists of box and GIOU loss [47]. $L_{bs}$ and $L_{cls}$ represent the box score and classification loss, respectively. They all leverage the common sigmoid focal loss [48]. For simplicity, we omit them. In addition, the $L_{r}^{kd}$ , $L_{bs}^{kd}$ , $L_{cls}^{kd}$ and $L_{cls}^{p}$ all utilize correspondingly down-weight loss function we propose, the formulations are shown as follows:

L_{r}^{kd}=\frac{1}{|\boldsymbol{l_{kd}}|}\sum_{i=1}^{N_{q}}\mathbf{}{1}_{\{i\in\boldsymbol{l_{kd}}\}}\hat{\boldsymbol{S}}_{\hat{\omega}(i)}[\|\boldsymbol{b}_{i}-\hat{\boldsymbol{b}}_{\hat{\omega}(i)}\|_{1}+1-\mathcal{G}(\boldsymbol{b}_{i},\hat{\boldsymbol{b}}_{\hat{\omega}(i)})],

(7)

L_{bs}^{kd}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{}{1}_{\{i\in\boldsymbol{l_{kd}}\}}\|\boldsymbol{bs}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{kd}}\}}[l_{sf}(\boldsymbol{bs}_{i},\hat{\boldsymbol{S}}_{\hat{\boldsymbol{\omega}}(i)})],

(8)

L_{cls}^{kd}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{kd}}\}}\|\boldsymbol{cls}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{z}}\}}[l_{sf}(\boldsymbol{cls}_{i},\hat{\boldsymbol{S}}_{\hat{\boldsymbol{\omega}}(i)})],

(9)

L_{cls}^{p}=\frac{1}{\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{p}}\}}\|\boldsymbol{cls}_{i}\|_{1}}\sum_{i=1}^{N_{q}}\mathbf{1}_{\{i\in\boldsymbol{l_{p}}\}}[l_{sf}(\boldsymbol{cls}_{i},\boldsymbol{bs}_{i})],

(10)

where $\boldsymbol{l_{kd}}$ and $\boldsymbol{l_{p}}$ denote the distilled knowledge supervision and pseudo supervision labels, respectively. $N_{q}$ represents the number of queries, $\hat{\omega}(i)$ represents the index of the label corresponding to the prediction. $l_{sf}$ denotes the sigmoid focal loss function. $\mathcal{G}(\cdot)$ represents the GIOU loss function. $\hat{S}$ is the confidence of supervision labels. $b$ , $bs$ , and $cls$ are the prediction box, box score, and classification score, respectively.

4.6 Inference

During inference, our detector only utilizes the visual features of the inputs to detect open-world objects without any information from the other modalities. The inference process is to composite the detector output to form open-world object instances. Formally, the $i$ -th output prediction is generated as $<b_{i},\ bs_{i},\ cls_{i}>$ . According to the formulation : $s_{i},l_{i}=max(cls_{i})$ , the result is acquired as $<l_{i},\ s_{i},\ b_{i}>$ , where $l$ is the category, $s$ denotes the confidence, and $b$ represents the predicted bounding boxes.

Dataset	Images	Avg. Kn.	Avg.Un
StandardSet $\heartsuit$	694	2.7	3.4
IntensiveSet $\spadesuit$	489	5.8	33.7

TABLE I: The statistics on the number of scenes and instances for the proposed testing sets: StandardSet

\heartsuit

and IntensiveSet

\spadesuit

5 Proposed Dataset

Since existing datasets [24, 25] are manually annotated with predefined categories, current benchmarks cannot comprehensively measure the detection performance of open-world detectors for unknown objects due to the lack of bounding box annotations for unknown entities in the test scenarios. In this paper, we propose two benchmarks, named StandardSet $\heartsuit$ and IntensiveSet $\spadesuit$ respectively, based on the complexity of their testing scenarios. For the proposed benchmark, the known categories are defined as the PASCAL VOC [25] dataset, and we train the open-world detectors by the images and annotations of the PASCAL VOC [25] training set. Meanwhile, inspired by the detailed annotation criteria of finer-grained datasets [39], we manually select suitable evaluation scenes (with no fewer than five unknown objects) and provide more meticulous manual annotations of unknown objects to construct the testing set, as shown in TABLE.I.

5.1 Standard Testing Set

The StandardSet $\heartsuit$ totally contains 694 testing scenes which we select from the MS-COCO [24] dataset. We manually annotate the unknown objects contained within. It contains 1897 known instances and 2378 unknown open-world instances with respect to the predefined known categories. We statistic the instance area, aspect radio, and position distribution in Fig.4. The statistical results suggest that compared to IntensiveSet $\spadesuit$ , StandardSet $\heartsuit$ has a more uniform distribution of instance areas. Additionally, the aspect ratio of bounding boxes in the StandardSet $\heartsuit$ test set tends towards a uniform distribution, with the majority of bounding box annotations concentrated in the central region of the scenes.

5.2 Intensive Testing Set

In the IntensiveSet $\spadesuit$ testing set, we elevate the complexity of the scenes by selecting 489 highly complex scenes, which altogether encompass 2859 annotations of known objects and 16482 bounding box annotations of unknown open-world objects that we manually annotate, averaging 5.8 annotations of known objects and 33.7 annotations of unknown objects per image. As shown in Fig.4, compared to the StandardSet $\heartsuit$ testing set, the bounding box annotations in the IntensiveSet $\spadesuit$ testing set primarily consist of smaller areas, with a more uniform distribution in terms of aspect ratios and spatial positioning.

OWOD split

Task ID

\rightarrow

Task 1

Task 2

Task 3

Task 4

Semantic split

\rightarrow

VOC

Classes

Outdoor,

Accessories,

Appliances,

Truck

Sports,

Food

Electronic,

Indoor,

Kitchen,

Furniture

\#

training images

16551

45520

39402

40260

\#

test images

4952

1914

1642

1738

\#

train instances

47223

113741

114452

138996

\#

test instances

14976

4966

4826

6039

MS-COCO split

Task ID

\rightarrow

Task 1

Task 2

Task 3

Task 4

Semantic split

\rightarrow

Animals,

Person,

Vehicles

Appliances,

Accessories,

Outdoor,

Furniture

Sports,

Food

Electronic,

Indoor,

Kitchen

\#

training images

89490

55870

39402

38903

\#

test images

3793

2351

1642

1691

\#

train instances

421243

163512

114452

160794

\#

test instances

17786

7159

4826

7010

TABLE II: The table shows task composition in the OWOD and MS-COCO split for the Open-world evaluation protocol. The semantics of each task and the number of images and instances(objects) across splits are shown.

6 Experiment

6.1 Datasets and Metrics

For a fair comparison, we implement the experiments on two mainstream splits of MS-COCO [24], and Pascal VOC [25] dataset. We group the classes into a set of non-overlapping tasks $\left\{T^{1},\ldots,T^{t},\ldots\right\}$ . The class in task $T^{c}$ only appears in tasks where $t\geq c$ . In task $T^{c}$ , classes encountered in $\left\{T^{c}:c\leq t\right\}$ and $\left\{T^{c}:c>t\right\}$ are considered as known and unknown classes, respectively.

OWOD SPLIT [12] splits the 80 classes of MS-COCO into 4 tasks and selects a training set for each task from the MS-COCO and Pascal VOC training set. Pascal VOC testing and MS-COCO validation set are used for evaluation. Detailed data statistic is shown in TABLE.II

MS-COCO SPLIT [13] mitigates data leakage across tasks in [12] and is more challenging, as shown in TABLE.II. The training and testing data are selected from MS-COCO.

Metrics: Following the most commonly used evaluation metric for object detection, we use mean average precision (mAP) to evaluate the known objects. Inspired by [12, 49, 13, 50, 51], U-Recall which measures the ability of the model to retrieve unknown object instances for OWOD problems, is used as the metric for unknown objects. For the proposed StandardSet $\heartsuit$ and IntensiveSet $\spadesuit$ benchmark, we also use the unknown AP, unknown Precision to comprehensively measure the detection performance of open-world detectors for unknown open-world objects.

6.2 Implementation Details

The multi-scale feature extractor consists of a Resnet-50[52] pretrained on ImageNet[53] in a self-supervised[54] manner and a deformable transformer encoder whose number of layers is set to 6. For the two cascade decoders, we all use the deformable transformer decoder, and the number of layers is set to 6, too. We set the number of queries $M=100$ , the dimension of the embeddings $D=256$ , and the number of pseudo-labels $k=5$ . We set GLIP [26] as the large pre-trained vision-language grounding model and categories of LVIS dataset [39] as text prompts to assist the training process. For GLIP, we use the GLIP-L [26] without finetuning on the COCO dataset [24]. It consists of Swin-Large [55], text encoder of CLIP [29], DyHead [56], BERT Layer [57] and Fusion Module [26]. For the incremental object detection experiments, we only use our open-world detector without the help of GLIP. The PyTorch library and eight NVIDIA RTX 3090 GPUs are used to train our SKDF framework with a batch size of 3 images per GPU. In each task, the SKDF framework is trained for 50 epochs and finetuned for 20 epochs during the incremental learning step. We train our SKDF using the Adam optimizer with a base learning rate of $2\times 10^{-4}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and weight decay of $10^{-4}$ . For finetuning during the incremental step, the learning rate is reduced by a factor of 10 and trained using a set of 50 stored exemplars per known class.

	Task IDs $\rightarrow$				Task 1	Task 2	Task 3
				Inference	Unknown	Unknown	Unknown
		Param#	FLOPs	Rate	Recall	Recall	Recall
SPLIT	Metrics $\rightarrow$			(s/img)	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )
OWOD	OV-VLM	321.9M	965GMac	9.22	37.0	35.5	34.9
OWOD	Ours	42.9M	212GMac	0.08	39.0	36.7	36.1
MS-COCO	OV-VLM	321.9M	965GMac	9.22	52.6	54.5	53.3
MS-COCO	Ours	42.9M	212GMac	0.08	60.9	60.0	58.6

TABLE III: Comparison with the large pre-trained Open-Vocabulary grounding vision-language model for Open-World object detection on OWOD and MS-COCO split. For the large pre-trained Open-Vocabulary grounding vision-language model, the prediction beyond the known categories is set to unknown. Experiments demonstrate that SKDF distills open-world knowledge from the large open-vocabulary pre-trained vision-language model to the expert open-world detector with faster-detecting speed and better performance for unknown open-world objects.

6.3 Detailed Comparison With the Teacher

In this section, we validate the advantages of our distillation framework in terms of performance and speed through detailed comparison and quantitative experiments.

6.3.1 Experimental Results

In order to keep the model’s ability to generalize to open-world objects in the comparison, we do not finetune the teacher on the known dataset. Because the finetuning process severely destroys the teacher’s generality. Thus, we only shown the results on the comparison of the detection ability for unknown objects. The results compared with the large pre-trained vision-language grounding model (GLIP) for OWOD problem are shown in TABLE.III. Regarding the number of parameters and FLOPs, our model is significantly smaller than GLIP. In particular, the inference speed of ours is $115\times\sim 116\times$ of GLIP’s. Furthermore, the number of GLIP’s training data is 64M images, while our model only needs a small amount of data in each task of different splits almost $\frac{1}{237}\times\sim\frac{1}{16}\times$ of its. Compared with GLIP’s U-Recall of 37.0, 35.5, and 34.9 on Task 1, 2, and 3 of OWOD split, ours achieves 39.0, 36.7 and 36.1 in the corresponding tasks, achieving significant absolute gains up to 2.0, 1.2 and 1.2, respectively. In the MS-COCO split, compared with GLIP’s U-Recall of 52.6, 54.5, and 53.3 on Task 1, 2, and 3 of OWOD split, ours achieves 60.9, 60.0 and 58.6 in the corresponding tasks, achieving significant absolute gains up to 8.3, 5.5 and 5.3, respectively. This demonstrates that our model has a better ability to detect unknown objects for OWOD.

6.3.2 Qualitative Results

To present the ability to detect unknown open-world objects intuitively, we select several open scenes from games and comics, such as “Digital Monster”, “Starcraft”, “League of Legends” and “Pokemon”. These scenes contain the object out of the categories of LVIS [39] text prompts, visualization results are shown in Fig.5. The visualization results show that our detector can identify nearly all unknown open-world objects, such as the creatures from “Digital Monster” and “Pokemon”, the robots from “StarCraft”, and the minions and turrets from “League of Legends”. Even the weapons and equipment from “League of Legends” can be detected by our detector; however, there are instances of misidentification, such as mistaking “Liandry’s Anguish” for a TV monitor. Thus, qualitative results can demonstrate that SKDF evolves the novel unknown objects beyond the large pre-trained vision-language grounding model.

6.4 Ablation Study

In this subsection, a series of experiments are designed to clearly understand the contribution of each of the constituent components. We conducted all experiments on the OWOD split. We start by ablating each component, followed by ablating the text prompts and open-vocabulary large per-trained teacher.

Task IDs $\rightarrow$	Task 1		Task 2				Task 3
	Unknown	Known	Unknown	Known			Unknown	Known
	Recall	mAP( $\uparrow$ )	Recall	mAP( $\uparrow$ )			Recall	mAP( $\uparrow$ )
Metrics $\rightarrow$	( $\uparrow$ )	Current	( $\uparrow$ )	Previously	Current	Both	( $\uparrow$ )	Previously	Current	Both
Baseline(GLIP)	37.0	42.0	35.5	42.1	22.6	32.4	34.9	32.4	19.2	28.2
Distillation	39.4	41.1	36.6	27.9	10.5	19.2	36.6	18.0	6.2	14.1
Distillation + DW	39.2	53.5	36.3	47.5	22.3	34.9	35.9	32.0	11.5	25.1
Distillation + CS	39.1	51.9	36.9	46.7	22.6	34.6	36.1	32.4	12.8	25.9
Final: SKDF	39.0	56.8	36.7	52.3	28.3	40.3	36.1	36.9	16.4	30.1

TABLE IV: Experiments on ablating each component. Our method significantly improves the detection performance of the GLIP baseline. DW represents the down-weight training loss function for unknown open-world supervision. When DW is none, we use the same loss function as the ground truth supervision for unknown open-world supervision. CS represents the cascade decoupling structure. When CS is none, we leverage the normal decoder structure as DDETR.

[Uncaptioned image] — (a) Ablating different prompts.

6.4.1 Ablating components

To study the contribution of each component, we design ablation experiments in TABLE.IV. We set GLIP as our baseline, the Distillation improves the performance on unknown object detection but reduces the detector’s ability for known objects. Adding the down-weight training loss function significantly improves the performance on detecting known objects and incremental object detecting, achieving significant absolute gains up to more than 10 points. As we have analyzed, the cascade decoupling structure alleviates the inclusion of unknown objects on the known detecting performance and the confusion between the categories and locations of the same objects. It significantly improves the performance of detecting known objects with absolute gains up to more than 10 points, too. What’s more, these two combine effectively, improving performance with absolute gains up to 15.7, 21.1, and 16.0, without reducing the ability to detect unknown objects. Thus, each component has a critical role to play in open-world object detection.

Task IDs $\rightarrow$	Task 1		Task 2				Task 3				Task 4
	Unknown	Known	Unknown	Known			Unknown	Known			Known
	Recall	mAP( $\uparrow$ )	Recall	mAP( $\uparrow$ )			Recall	mAP( $\uparrow$ )			mAP( $\uparrow$ )
Metrics $\rightarrow$	( $\uparrow$ )	Current	( $\uparrow$ )	Previously	Current	Both	( $\uparrow$ )	Previously	Current	Both	Previously	Current	Both
UC-OWOD[37]	2.4	50.7	3.4	33.1	30.5	31.8	8.7	28.8	16.3	24.6	25.6	15.9	23.2
ORE-EBUI[12]	4.9	56.0	2.9	52.7	26.0	39.4	3.9	38.2	12.7	29.7	29.6	12.4	25.3
OW-DETR[13]	7.5	59.2	6.2	53.6	33.5	42.9	5.7	38.3	15.8	30.8	31.4	17.1	27.8
OCPL[14]	8.3	56.6	7.7	50.6	27.5	39.1	11.9	38.7	14.7	30.7	30.7	14.4	26.7
2B-OCD[15]	12.1	56.4	9.4	51.6	25.3	38.5	11.6	37.2	13.2	29.2	30.0	13.3	25.8
Ours	39.0	56.8	36.7	52.3	28.3	40.3	36.1	36.9	16.4	30.1	31.0	14.7	26.9

TABLE VI: State-of-the-art comparison for open-world object detection on OWOD split. The comparison is shown in terms of U-Recall and known class mAP. U-Recall measures the ability of the model to retrieve unknown object instances for OWOD problems. For a fair comparison, we compare with the recently introduced methods. The CAT achieves improved metrics over the existing works across all unknown detection tasks, demonstrating our model’s effectiveness for OWOD problems. U-Recall cannot be computed in Task 4 due to the absence of unknown test annotations, for the reason that all 80 classes are known.

Task IDs $\rightarrow$	Task 1		Task 2				Task 3				Task 4
	Unknown	Known	Unknown	Known			Unknown	Known			Known
	Recall	mAP( $\uparrow$ )	Recall	mAP( $\uparrow$ )			Recall	mAP( $\uparrow$ )			mAP( $\uparrow$ )
Metrics $\rightarrow$	( $\uparrow$ )	Current	( $\uparrow$ )	Previously	Current	Both	( $\uparrow$ )	Previously	Current	Both	Previously	Current	Both
ORE-EBUI[12]	1.5	61.4	3.9	56.5	26.1	40.6	3.6	38.7	23.7	33.7	33.6	26.3	31.8
OW-DETR[13]	5.7	71.5	6.2	62.8	27.5	43.8	6.9	45.2	24.9	38.5	38.2	28.1	33.1
Ours	60.9	69.4	60.0	63.8	26.9	44.4	58.6	46.2	28.0	40.1	41.8	29.6	38.7

TABLE VII: State-of-the-art comparison for open-world object detection on MS-COCO split. As the code and weights of UC-OWOD [37], OCPL [14] and 2B-OCD [15] are not publicly available, we cannot get results of them or evaluate them on MS-COCO split. Thus, we only compare our model with ORE [12] and OW-DETR [13]. Although the MS-COCO split is more challenging, our model gets a more significant improvement on this in comparison to ORE and OW-DETR. The significant metric improvements demonstrate that our SKDF has the ability to retrieve new knowledge beyond the range of closed sets and would not be limited by category knowledge of existing objects.

6.4.2 Ablating different prompts

To ablate the impact of text prompts for knowledge distillation, we do experiments on different text prompts, i.e.LVIS, LVIS without COCO, and COCO as shown in TABLE.V(a). From the experiment results (i.e.UR comparison between LVIS and COCO prompt and qualitative results in TABLE.V(d)), we investigate that the unknown annotations in the test set are not adequate and more corresponding to the original 80 categories in COCO dataset. Therefore, existing “unknown recall” can not evaluate the detecting ability for unknown open-world objects beyond COCO annotation. To evaluate that, we use SAM[58] to generate annotations for all areas of similar objects in the test set. Though these annotations are noisy and contain many non-object box annotations, the recall for these can evaluate the unknown detection ability of detectors. The comparison between LVIS and COCO prompts shows that less distilled knowledge reduces the impact on learning known objects. In addition, less distilled knowledge impacts the performance of the knowledge distillation framework. As shown in OWOD split of COCO prompt, our detector does not learn better detecting performance for the original unknown objects. The results in LVIS w/o COCO prompt demonstrate our framework letter more unknown open-world objects. Although we exclude all corresponding coco categories from the LVIS prompt, our detector’s detection ability for original unknown objects does not seem to be impacted.

6.4.3 Ablating different detection transformer structures

As shown in TABLE.V(b), we compare our structure with original parallel structure [2] and structure of [36]. CAT proposes the cascade decoupled decoding way that uses the shared decoder to decode twice for localization and identification, respectively. Unlike them, we directly employ two separate decoders to decouple the decoding process, and we specifically train these two decoders for localization and recognition, respectively. The results demonstrate that our structure has a better ability to alleviate the influence of distilled knowledge on known objects. Our decoupled architecture outperforms the decoupled decoding way of CAT[36] by 1.5 points in the known mAP. Moreover, our architecture surpasses the conventional DDETR [2] structure by 3.3 points.

6.4.4 Ablating different teachers

In TABLE.V(c), we compare different teachers, i.e.GLIP[26] and SAM [58]. When distilling open-world knowledge from SAM, we leverage the “Segment Anything” that uses the predefined grid points and selecting strategy to generate pseudo labels. For SAM, due to the absence of supervision confidence, we can not leverage the down-weight loss function. Therefore, the known detection ability is severely reduced. In addition, due to the over-detection ability of SAM, the detector can not distill a better unknown detection ability.

6.5 Comparison With SOTA OWOD Methods

In this section, we conduct a detailed comparison with existing state-of-the-art methods on two existing benchmarks: OWOD and MS-COCO SPLIT, as well as on two benchmarks we propose: IntensiveSet $\spadesuit$ and StandardSet $\heartsuit$ . In addition, we compare our model with existing methods on the Incremental Object Detection task.

6.5.1 OWOD SPLIT

The results compared with the state-of-the-art methods on OWOD split for OWOD problem are shown in TABLE.VI. The ability of our model to detect unknown objects quantified by U-Recall is more than $3\times$ of those reported in previous state-of-the-art OWOD methods. Compared with the model 2B-OCD [15] with the highest U-Recall of 12.1, 9.4 and 11.6 on Task 1, 2, and 3, ours achieves 39.0, 36.7, and 36.1 in the corresponding tasks, achieving significant absolute gains up to 26.9, 27.3 and 24.5, respectively. Benefits from the cascade decoder structure and down-weight training loss which mitigate the effect of unknown objects on detecting known objects, our model’s performance on known objects is also superior to most state-of-the-art methods.

6.5.2 MS-COCO SPLIT

We report the results on the MS-COCO split in TABLE.VII. MS-COCO split mitigates data leakage across tasks and assigns more data to each Task, while our model receives a more significant boost compared with the OWOD split. Our model’s unknown object detection capability, quantified by U-Recall, is almost $10\times\sim 11\times$ of those reported in previous state-of-the-art OWOD methods. Compared with OW-DETR’s U-Recall of 5.7, 6.2, and 6.9 on Task 1, 2, and 3, ours achieves 60.9, 60.0, and 58.6 in the corresponding tasks, achieving significant absolute gains up to 55.2, 53.8 and 51.7, respectively. This demonstrates that our model has the more powerful ability to retrieve new knowledge and detect unknown objects when faced with more difficult tasks.

	Dataset $\rightarrow$	IntensiveSet $\spadesuit$				StandardSet $\heartsuit$
Group $\downarrow$	Metrics $\rightarrow$	Known	Unknown( $\uparrow$ )			Known	Unknown( $\uparrow$ )
Group $\downarrow$	Metrics $\rightarrow$	mAP( $\uparrow$ )	Recall	Precision	mAP	mAP( $\uparrow$ )	Recall	Precision	mAP
(I)	OWDETR[13]	36.9	5.6	7.3	1.2	54.6	16.9	2.3	0.6
	CAT[36]	38.6	17.4	9.5	7.6	55.4	48.8	2.7	4.3
	UnSniffer[38]	39.5	9.4	22.2	2.0	53.7	34.6	21.5	12.5
	UnSniffer $\dagger$ [38]	39.5	12.4	12.0	2.6	53.7	41.2	9.6	11.6
(II)	SKDF-DW	27.7	40.2	15.6	14.8	44.0	75.1	3.0	17.1
	SKDF-CS	30.7	39.7	16.2	16.8	48.3	75.2	3.1	9.2
	SKDF	32.3	39.6	16.2	16.7	48.7	74.3	3.0	7.4
	SKDF $\ddagger$	32.3	35.8	37.9	24.5	48.7	66.8	11.4	24.4

TABLE VIII: State-of-the-art comparison for open-world object detection on IntensiveSet

\spadesuit

and StandardSet

\heartsuit

. As the code and weights of UC-OWOD [37], OCPL [14], and 2B-OCD [15] are not publicly available, we cannot get results of them or evaluate them on our proposed benchmark. For the fair comparison, UnSniffer

\dagger

means that we remove the specific unknown post-process operations. SKDF

\ddagger

denotes that we add the same unknown post-process operations as Unisiffer to our SKDF.

6.5.3 Intensive Testing Set

The comparison results on IntensiveSet $\spadesuit$ are shown in the left half of TABLE.VIII. The experimental results indicate that our model’s performance in detecting unknown objects surpasses that of existing state-of-the-art methods, achieving more than double the performance of the highest unknown recall for the category CAT [36], exceeding the highest unknown precision of Unsiffer [38] by 4.2 points, and surpassing the existing methods in unknown mAP by more than threefold. Meanwhile, our model is highly adaptable to unknown post-processing techniques. When incorporating the post-processing methods from UnSniffer [38], the performance of our model further increased by 21.7 points in unknown precision and 7.8 points in unknown mAP. When incorporating the post-processing, SKDF surpasses the UnSiffer in unknown precision by more than 15 points. However, although our component can mitigate the impact of unknown objects on the detection of known objects, there is still a certain gap between our model and the existing state-of-the-art models.

6.5.4 Standard Testing Set

In the right half of the TABLE.VIII, we compare SKDF with the sota methods on StandardSet $\heartsuit$ . Our model surpasses the existing detection models in the vast majority of evaluation metrics, except for the detection performance of known objects and unknown precision. For the unknown precision, UnSiffer outperforms our SKDF. This is because the StandardSet $\heartsuit$ primarily comprises common categories from COCO that are beyond the predefined classes of VOC, and UnSiffer is specifically tailored to work with these category distributions. However, this indeed reflects the shortcomings of our model. Particularly, when comparing the performance of our model on the IntensiveSet $\spadesuit$ and StandardSet $\heartsuit$ , it is evident that although our model has very good recall performance for unknown objects, its ability to recognize unknown objects still needs to be improved.

6.5.5 Incremental Object Detection

To intuitively present our detector’s ability for detecting object instances, we compare it to [59, 60, 12, 13] on the incremental object detection (IOD) task. We do not use assistance from the large pre-trained vision-language grounding model. We evaluate the experiments on three standard settings, where a group of classes (10, 5, and last class) is introduced incrementally to a detector trained on the remaining classes (10, 15, and 19), based on the PASCAL VOC 2007 dataset [25]. As the results shown in TABLE.IX (b), our model outperforms the existing method in a great migration on all three settings, indicating the power of cascade detection transformer for IOD.

10 + 10 Setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD	69.9	70.4	69.4	54.3	48	68.7	78.9	68.4	45.5	58.1	59.7	72.7	73.5	73.2	66.3	29.5	63.4	61.6	69.3	62.2	63.2
Faster ILOD	72.8	75.7	71.2	60.5	61.7	70.4	83.3	76.6	53.1	72.3	36.7	70.9	66.8	67.6	66.1	24.7	63.1	48.1	57.1	43.6	62.1
ORE - (CC + EBUI)	53.3	69.2	62.4	51.8	52.9	73.6	83.7	71.7	42.8	66.8	46.8	59.9	65.5	66.1	68.6	29.8	55.1	51.6	65.3	51.5	59.4
ORE - EBUI	63.5	70.9	58.9	42.9	34.1	76.2	80.7	76.3	34.1	66.1	56.1	70.4	80.2	72.3	81.8	42.7	71.6	68.1	77	67.7	64.5
OW - DETR	75.4	63.9	57.9	50.0	52.0	70.9	79.5	72.4	44.3	57.9	59.7	73.5	77.7	75.2	76.2	44.9	68.8	65.4	79.3	69.0	65.7
Ours	77.1	72.3	74.5	53.4	57.4	78.1	78.7	83.9	46.2	71.4	59.5	77.4	73.3	76.6	73.3	39.7	70.6	59.0	78.4	70.9	68.6
15 + 5 Setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD	70.5	79.2	68.8	59.1	53.2	75.4	79.4	78.8	46.6	59.4	59	75.8	71.8	78.6	69.6	33.7	61.5	63.1	71.7	62.2	65.8
Faster ILOD	66.5	78.1	71.8	54.6	61.4	68.4	82.6	82.7	52.1	74.3	63.1	78.6	80.5	78.4	80.4	36.7	61.7	59.3	67.9	59.1	67.9
ORE - (CC + EBUI)	65.1	74.6	57.9	39.5	36.7	75.1	80	73.3	37.1	69.8	48.8	69	77.5	72.8	76.5	34.4	62.6	56.5	80.3	65.7	62.6
ORE - EBUI	75.4	81	67.1	51.9	55.7	77.2	85.6	81.7	46.1	76.2	55.4	76.7	86.2	78.5	82.1	32.8	63.6	54.7	77.7	64.6	68.5
OW - DETR	78.0	80.7	79.4	70.4	58.8	65.1	84.0	86.2	56.5	76.7	62.4	84.8	85.0	81.8	81.0	34.3	48.2	57.9	62.0	57.0	69.4
Ours	79.5	85.1	83.1	73.1	62.5	68.7	83.0	88.4	55.5	78.3	69.7	83.0	86.6	73.2	78.8	30.8	67.6	60.8	76.0	58.7	72.1
19 + 1 Setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD	69.4	79.3	69.5	57.4	45.4	78.4	79.1	80.5	45.7	76.3	64.8	77.2	80.8	77.5	70.1	42.3	67.5	64.4	76.7	62.7	68.2
Faster ILOD	64.2	74.7	73.2	55.5	53.7	70.8	82.9	82.6	51.6	79.7	58.7	78.8	81.8	75.3	77.4	43.1	73.8	61.7	69.8	61.1	68.5
ORE - (CC + EBUI)	60.7	78.6	61.8	45	43.2	75.1	82.5	75.5	42.4	75.1	56.7	72.9	80.8	75.4	77.7	37.8	72.3	64.5	70.7	49.9	64.9
ORE - EBUI	67.3	76.8	60	48.4	58.8	81.1	86.5	75.8	41.5	79.6	54.6	72.8	85.9	81.7	82.4	44.8	75.8	68.2	75.7	60.1	68.8
OW - DETR	82.2	80.7	73.9	56.0	58.6	72.1	82.4	79.6	48.0	72.8	64.2	83.3	83.1	82.3	78.6	42.1	65.5	55.4	82.9	60.1	70.2
Ours	83.6	85.7	77.1	61.5	58.9	74.3	86.3	81.5	52.2	78.4	71.4	81.9	84.6	80.2	80.8	39.9	68.3	63.3	84.6	63.0	72.9

TABLE IX: The detailed comparison with existing approaches on PASCAL VOC. We only use our open-world object detector without assistance. Evaluation is performed on three standard settings, where a group of classes (10, 5, and last class) are introduced incrementally to a detector trained on the remaining classes (10,15 and 19). Our model performs favorably against existing approaches on all three settings.

6.5.6 Qualitative Comparison with OWOD method

We exhibit more qualitative results in Fig.6 and compare them to the state-of-the-art OWOD detectors. The results show that our model can detect most of the potentially unknown open-world objects in the scene, far exceeding the effectiveness of existing models. We define the 20 categories of the PASCAL VOC [25] dataset as known classes, and all categories outside of these are considered unknown classes. As shown in the first column, our model accurately detects all the foreground categories in the scene and identifies everything outside of the predefined known classes as unknown, such as the books on the table and the teddy bear, which were not detected by the state-of-the-art model. As shown in the second column, our model can accurately detect even objects such as the picture frames on the wall. In the third column, the state-of-the-art model mistakenly identifies the characters in the background as unknown objects, but our model is not confused by this. In the comparison of the last column, our model also greatly surpasses the detection performance of the state-of-the-art model.

7 Societal Impact, Limitations and Future Works

Open-world object detection makes artificial intelligence smarter to face more problems in real life. It takes object detection to a cognitive level, as the model requires more than simply remembering the objects learned, it requires deeper thinking about the scene. When it comes to applications like autonomous driving, the significance of open-world object detection becomes even more pronounced. In such scenarios, vehicles need to rapidly and accurately identify and comprehend various objects and obstacles on the road, including but not limited to pedestrians, other vehicles, traffic signals, and road signs. The breakthrough in open-world object detection will render autonomous driving systems more intelligent as they can handle unforeseen or rare situations, not limited to pre-trained object categories.

Although our results demonstrate significant improvements over existing state-of-the-art methods, the performances are still on the lower side due to the challenging nature of the open-world detection problem. In this paper, we are mainly committed to enhancing the model’s ability to explore unknown classes. However, the confidence level and recognition ability of our model for the detection of unknown objects still need to be improved, and this is what we will strive for in the future. Currently, our model still faces the issue of mistakenly detecting the background as unknown objects, and the benchmark we proposed is not yet comprehensive, including only a single-moment task.

In our future work, these two points will be the main focus of our research. In addition, post-processing operations for prediction boxes of unknown objects are urgently needed, so developing a post-processing operation similar to NMS, dedicated to unknown objects, is also a direction that requires our research. To facilitate the integration of open-world object detection algorithms into everyday use and contribute positively to society, we are fully committed to diligently pursuing this goal with great care and effort.

8 Conclusions

In this paper, we start by observing that the simple knowledge distillation could convert the open-world knowledge in the large pre-trained vision-language grounding model for the specialized OWOD task and propose a simple framework with surprisingly good performance. Meanwhile, we propose the down-weight training loss function for the detector’s mixed learning process of known ground truth, distilled unknown knowledge, and the pseudo unknown knowledge in OWOD algorithm to mitigate the effect of distilled knowledge on the detection performance of known objects. Besides, the cascade decoupled detection transformer structure is proposed to alleviate the influence caused by unknown objects on detecting known objects. Last but not least, we propose two novel benchmarks to comprehensively evaluate the ability of the open-world detectors to detect unknown open-world objects, named StandardSet $\heartsuit$ and IntensiveSet $\spadesuit$ respectively, based on the complexity of their testing scenarios. Our extensive experiments on existing and proposed benchmarks demonstrate the effectiveness of our framework. Our model exceeds the distilled large pre-trained vision-language grounding model for OWOD and state-of-the-art methods for OWOD and IOD.

Acknowledgments

This work is supported by National Natural Science Foundation of China (grant No.61871106 and No.61370152), Key R&D projects of Liaoning Province, China (grant No.2020JH2/10100029), and the Open Project Program Foundation of the Key Laboratory of Opto-Electronics Information Processing, Chinese Academy of Sciences (OEIP-O-202002).

References

[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[2] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
[3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020.
[4] Y. Lu, X. Chen, Z. Wu, and J. Yu, “Decoupled metric network for single-stage few-shot object detection,” IEEE Transactions on Cybernetics, 2022.
[5] Y. Pang, T. Wang, R. M. Anwer, F. S. Khan, and L. Shao, “Efficient featurized image pyramid network for single shot detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7336–7344, 2019.
[6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
[7] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
[9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
[10] X. Chen, J. Yu, S. Kong, Z. Wu, and L. Wen, “Joint anchor-feature refinement for real-time accurate object detection in images and videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 2, pp. 594–607, 2020.
[11] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” arXiv preprint arXiv:1905.05055, 2019.
[12] K. J. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “Towards open world object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5826–5836, 2021.
[13] A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, and M. Shah, “Ow-detr: Open-world detection transformer,” in CVPR, 2022.
[14] J. Yu, L. Ma, Z. Li, Y. Peng, and S. Xie, “Open-world object detection via discriminative class prototype learning,” in 2022 IEEE International Conference on Image Processing (ICIP), pp. 626–630, IEEE, 2022.
[15] Y. Wu, X. Zhao, Y. Ma, D. Wang, and X. Liu, “Two-branch objectness-centric open world detection,” in Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, pp. 35–40, 2022.
[16] O. Zohar, K.-C. Wang, and S. Yeung, “Prob: Probabilistic objectness for open world object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453, 2023.
[17] J. Yu, L. Ma, Z. Li, Y. Peng, and S. Xie, “Open-world object detection via discriminative class prototype learning,” arXiv preprint arXiv:2302.11757, 2023.
[18] Y. Ma, H. Li, Z. Zhang, J. Guo, S. Zhang, R. Gong, and X. Liu, “Annealing-based label-transfer learning for open world object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11454–11463, 2023.
[19] J. Dong, Z. Zhang, S. He, Y. Liang, Y. Ma, J. Yu, R. Zhang, and B. Li, “A parallel open-world object detection framework with uncertainty mitigation for campus monitoring,” Applied Sciences, vol. 13, no. 23, p. 12806, 2023.
[20] S. Jamonnak, J. Guo, W. He, L. Gou, and L. Ren, “Ow-adapter: Human-assisted open-world object detection with a few examples,” IEEE Transactions on Visualization and Computer Graphics, 2023.
[21] X. Zhao, X. Liu, Y. Shen, Y. Ma, Y. Qiao, and D. Wang, “Revisiting open world object detection,” arXiv preprint arXiv:2201.00471, 2022.
[22] Y. Wang, Z. Yue, X.-S. Hua, and H. Zhang, “Random boxes are open-world object detectors,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6233–6243, 2023.
[23] D. Kim, T.-Y. Lin, A. Angelova, I. S. Kweon, and W. Kuo, “Learning open-world object proposals without learning to classify,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5453–5460, 2022.
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
[25] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[26] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975, 2022.
[27] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Advances in Neural Information Processing Systems, 2022.
[28] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790, 2021.
[29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
[30] L. Yao, J. Han, Y. Wen, X. Liang, D. Xu, W. Zhang, Z. Li, C. Xu, and H. Xu, “Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection,” arXiv preprint arXiv:2209.09407, 2022.
[31] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
[32] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3967–3976, 2019.
[33] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1365–1374, 2019.
[34] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 5191–5198, 2020.
[35] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11953–11962, 2022.
[36] S. Ma, Y. Wang, J. Fan, Y. Wei, T. H. Li, H. Liu, and F. Lv, “Cat: Localization and identification cascade detection transformer for open-world object detection,” arXiv preprint arXiv:2301.01970, 2023.
[37] Z. Wu, Y. Lu, X. Chen, Z. Wu, L. Kang, and J. Yu, “Uc-owod: Unknown-classified open world object detection,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pp. 193–210, Springer, 2022.
[38] W. Liang, F. Xue, Y. Liu, G. Zhong, and A. Ming, “Unknown sniffer for object detection: Don’t turn a blind eye to unknown objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3230–3239, 2023.
[39] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5356–5364, 2019.
[40] X. Gu, T. Lin, W. Kuo, and Y. Cui, “Zero-shot detection via vision and language knowledge distillation. corr abs/2104.13921 (2021).”
[41] J. Jeong, S. Lee, J. Kim, and N. Kwak, “Consistency-based semi-supervised learning for object detection,” Advances in neural information processing systems, vol. 32, 2019.
[42] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong, “Proposal learning for semi-supervised object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2291–2301, 2021.
[43] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020.
[44] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069, 2021.
[45] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
[46] Y. Li, D. Huang, D. Qin, L. Wang, and B. Gong, “Improving object detection with selective self-supervised self-training,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX, pp. 589–607, Springer, 2020.
[47] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666, 2019.
[48] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
[49] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-shot object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 384–400, 2018.
[50] A. Dhamija, M. Gunther, J. Ventura, and T. Boult, “The overlooked elephant of object detection: Open set,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1021–1030, 2020.
[51] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout sampling for robust object detection in open-set conditions,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3243–3249, IEEE, 2018.
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
[54] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660, 2021.
[55] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
[56] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic detr: End-to-end object detection with dynamic attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2988–2997, 2021.
[57] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[58] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[59] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object detectors without catastrophic forgetting,” in Proceedings of the IEEE international conference on computer vision, pp. 3400–3409, 2017.
[60] C. Peng, K. Zhao, and B. C. Lovell, “Faster ilod: Incremental learning for object detectors based on faster rcnn,” Pattern recognition letters, vol. 140, pp. 109–115, 2020.

Prompt	SPLIT	Method	mAP	UR	UR(sam)
		GLIP	42.2	37.0	8.04
	OWOD	Ours	56.8	39.0	16.5
		GLIP	46.5	52.6	9.4
LVIS	MSCOCO	Ours	69.4	60.9	20.0
		GLIP	-	-	7.5
	OWOD	Ours	57.6	38.8	16.8
		GLIP	-	-	7.6
LVIS w/o COCO	MSCOCO	Ours	70.6	58.9	20.4
		GLIP	67.8	43.3	3.2
	OWOD	Ours	60.9	41.1	12.9
		GLIP	73.1	64.8	5.3
COCO	MSCOCO	Ours	74.8	69.6	16.7

SPLIT	Metrics	DDETR	CAT	Ours
	mAP	53.5	55.3	56.8
	UR	39.2	39.4	39.0
OWOD	UR(sam)	16.3	16.5	16.5

		Metric
Teacher	SPLIT	mAP	UR	UR(sam)
GLIP	OWOD	56.8	39.0	16.5
SAM	OWOD	19.4	33.5	38.3