This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Proposal Distribution Calibration for
Few-Shot Object Detection

Bohao Li, Chang Liu, Mengnan Shi, Xiaozhong Chen,
Xiangyang Ji,  Qixiang Ye
This work was supported by National Natural Science Foundation of China (NSFC) under Grant 61836012, 62171431 and 62225208, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No. XDA27000000.B. Li, X. Chen, and Q. Ye are with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China. E-mail: {libohao20, chenxiaozhong16@mails.ucas.ac.cn, qxye@ucas.ac.cn}.C. Liu, M. Shi, X. Ji is with the Department of Automation, Tsinghua University, Beijing, 10084, China. E-mail: {liuchang2022, shimn2022, xyji@tsinghua.edu.cn}. Chang Liu is the Corresponding Author.
Abstract

Adapting object detectors learned with sufficient supervision to novel classes under low data regimes is charming yet challenging. In few-shot object detection (FSOD), the two-step training paradigm is widely adopted to mitigate the severe sample imbalance, i.e., holistic pre-training on base classes, then partial fine-tuning in a balanced setting with all classes. Since unlabeled instances are suppressed as backgrounds in the base training phase, the learned RPN is prone to produce biased proposals for novel instances, resulting in dramatic performance degradation. Unfortunately, the extreme data scarcity aggravates the proposal distribution bias, hindering the RoI head from evolving toward novel classes. In this paper, we introduce a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head by recycling its localization ability endowed in base training and enriching high-quality positive samples for semantic fine-tuning. Specifically, we sample proposals based on the base proposal statistics to calibrate the distribution bias and impose additional localization and classification losses upon the sampled proposals for fast expanding the base detector to novel classes. Experiments on the commonly used Pascal VOC and MS COCO datasets with explicit state-of-the-art performances justify the efficacy of our PDC for FSOD. Code is available at github.com/Bohao-Lee/PDC.

Index Terms:
Distribution Calibration, Few-shot Object Detection, Object Detection, Proposal Distribution Calibration

I Introduction

With the abomination of large-scale dataset collection and annotation and the thirst for imitating human cognition, deep learning under low data regimes has attracted growing attention. To simulate data-scarce scenarios, such as identifying rare diseases, species, military objectives, etc., few-shot learning (FSL) has been proposed for pursuing fast knowledge adaptation from base classes with sufficient labeled data to novel classes with limited instances and annotations. Coupling with classification and localization tasks, few-shot object detection (FSOD) presents to be a more practicable yet challenging problem, which is far from well-studied.

Refer to caption
Figure 1: Overview of our proposal distribution calibration (PDC) on PASCAL VOC for 5-shot finetuning. We show the offset distributions of the center point location, width, and height of RPN proposals for novel instances before and after finetuning (upper left) and plot the RPN proposal offset distributions of the base training data after base training (upper right). One can see that despite RPN being class-agnostic to base classes, the novel proposals are still dramatically biased since novel classes are suppressed as background in the base training phase. Coupling with data scarcity, this distribution shift issue can hardly be mitigated in the finetuning phase, resulting in severe detection degradation to the RoI head (lower left). In contrast, our PDC advocates calibrating the distribution bias by sampling high-quality proposals based on base training statistics for novel class adaptation (upper right), and reports superior classification and localization performance (lower right).

To avoid overfitting under extreme data imbalance, FSOD methods [1] typically adopt a two-step training paradigm of base training and fast adapting. TFA [5] further reveals that novel classes can be well represented based on base features, inspiring actively studied topics in the spirit of maximizing base knowledge usage. Specifically, Halluc [22] introduces a data hallucinator via an EM-style training procedure based on the base detector to improve the model variation of novel classes by transferring the shared within-class variation from base classes. FADI [10] imitates specific base class feature space for novel class embedding by explicitly associating similar base-novel class pairs, then scatters them by disentangling the base and novel classifiers.

Since novel instances may be suppressed as backgrounds during base training, the learned RPN is typically inadequate to recognize novel instances. What is worse, the biased RPN proposals can hardly be rectified by the RoI head with limited novel annotations, resulting in sub-optimal classification and localization adaptation [26]. This intrinsic contradiction is far from satisfactorily solved, which essentially hinders the development of FSOD. Nonetheless, we greedily wonder whether there is a “free lunch” [17] for remedy.

To figure out this issue, we plot the bias distributions of RPN proposals against ground truths in both learning stages and ascertain a noticeable shift, Fig. 1(up). It seems impracticable to reverse the deviation with scarce and low-quality novel proposals. Since it is natural that comprehensive coverage of training data typically ensures superior model generalization ability, we resort to mitigating the data thirst in fine-tuning by reusing the proposal distribution statistics from base training to crop sufficient high-quality novel proposals. In this way, the augmented novel proposals involve less context and artifact noise for sufficient and correct novel semantic assimilation. Also, as localization is relatively independent against classification, the de-biased high IoU proposals facilitate a milder repurposing procedure of the RoI head by recycling the localization ability endowed in the base training stage. Additionally, we add objective functions, i.e., supervised contrastive loss [44], to the sampled proposal set to enhance the classification and localization adaptation of the detection head towards novel classes.

In summary, we propose a simple yet effective proposal distribution calibration (PDC) approach pursuing alleviating the critical target contradiction and data scarcity in the current two-step FSOD. With a plug-and-play nature, the proposed PDC can be easily inserted in seminal two-stage few-shot detectors for both classification and localization generalization enhancement. With sufficient theoretical and experimental analyses, we justify the efficacy of our PDC and achieve new SOTA performance on the commonly used benchmarks, i.e. Pascal VOC2007 [41], VOC 2012 [42], and MS COCO [43].

II Related Works

II-A Few-Shot Learning

FSL methods can be roughly divided into three categories: 1) the meta-learning-based ones [36, 35, 34, 39] typically design specific models or optimization for fast model adaption; 2) the metric learning-based ones [28, 29, 30, 31, 32, 33, 7] pursue optimizing the embedding manifold for feature discrimination enhancement; 3) the data augmentation-based ones [38, 37] target augmenting novel data to improve the model generalizability.

As FSL shares similar intrinsic challenges with FSOD while relatively simple, seminal FSL works continually inspire the development of FSOD [40]. Concretely, MVT [16] transfers factors of variations across classes via semantic transformations to alleviate the data requirement. DC [17] steps forward by economically migrating distribution statistics from base classes to their most similar novel classes for data sampling. These valuable explorations encourage the in-depth attempts in FSOD to exhaust the base knowledge at a low cost.

II-B Few-Shot Object Detection

General object detection methods can be commonly categorized as one-stage and two-stage detectors, which are continually evolving with the tremendous progress in deep neural architectures [21]. One-stage detectors [18, 19] locate and classify objects simultaneously to achieve better efficiency. Two-stage detectors [20] propose suspicious objectness regions, then attempt to discriminate and rectify each proposal for superior accuracy. However, these achievements are severely data-hungry for generalization, which limits their real-world application.

Few-shot object detection, initially inspired by meta-learning [1, 2], targets expanding the object vocabulary from label-rich scenarios to the open world with limited supervised showcases. TFA [5] advocates to keep the backbone learned during base training frozen and only fine-tune the box regressor and classifier. Such a simple two-step training scheme reports surprisingly superior performance, which encourages and inspires the FSOD community to maximize the usage of base knowledge for fast adaptation.

Concretely, subsequent fine-tuning-based few-shot detectors typically delve into optimizing embedding space via metric learning [8, 7], relation reasoning [27], and feature interaction [11], alleviating the essential objective contradiction across training stages via base knowledge retention [13], detection head decoupling [9, 10], and annotation rectification [26], and augmenting novel instances via object pyramids [6], data hallucination [22], etc. In this work, we intend to bridge the semantic gap in repurposing the class-aware RoI head for novel classes by applying PDC on the class-agnostic RPN with base proposal statistics as a “free lunch.”

Refer to caption
Figure 2: Framework of our PDC approach in the fine-tuning stage, which consists of two branches: baseline detector (Down) and proposal distribution calibration (Up). Grey means the parameters are frozen, while green means to be fine-tuned.

III Proposal Distribution Calibration

III-A Background

Task definition

Given a dataset 𝒟{\cal D}, the objects are divided into base and novel classes. Each base class contains sufficient bounding-box annotations, while each novel class has only limited ones. Following FSRW [1], few-shot detectors usually adopt a two-step paradigm to avoid overfitting. Specifically, in the base-training stage, the whole detector is sufficiently trained on base class dataset 𝒟b{\cal D}_{b} with only base class annotations. While in the fine-tuning stage, the detection head is fine-tuned on both base and novel classes with balanced data settings of 𝒟f{\cal D}_{f}, and the backbone keeps frozen. In 𝒟f{\cal D}_{f}, each class has KK annotated instances. When the novel class number is NN, it is the so-called NN-way KK-shot FSOD setting.

Baseline detector

We use imTED [12] as our basic detector, which is modified by holistically introducing MAE [15] pre-training model to Faster-RCNN [20], Fig. 2 (the branch below). It introduces the MAE encoder as the detector’s backbone and the decoder as the RoI head. For proposal generation, imTED uses an FPN and an RPN following the settings defined in Faster-RCNN. For proposal feature extraction, imTED skips the FPN and simply extracts RoI features on the feature map from the last backbone layer. In this way, imTED maximally utilizes pre-trained parameter of MAE to improve the generalization ability of object representation. The proposal features are integrally migrated from pre-trained MAE to imTED, which is observed to be important for few-shot object detection.

Note that our PDC only attaches an additional branch to imTED after base training, and we focus on introducing the fine-tuning stage modification in the following. Denote the detection head parameters of imTED as θdet\theta_{det}. The baseline object detection loss can be defined as

argminθdetRcnnimTED(θdet)=argminθRoi(Roicls+Roireg(θRoi)Roi)\displaystyle\underset{\theta_{det}}{\arg\min}\mathcal{L}_{Rcnn}^{imTED}(\theta_{det})=\underset{\theta_{Roi}}{\arg\min}\big{(}\underbrace{\mathcal{L}_{Roi}^{cls}+\mathcal{L}_{Roi}^{reg}(\theta_{Roi})}_{\mathcal{L}_{Roi}}\big{)} (1)
+argminθRpn,θFpn(Rpnobj+Rpnreg(θRpn,θFpn)Rpn),\displaystyle+\underset{\theta_{Rpn},\theta_{Fpn}}{\arg\min}\big{(}\underbrace{\mathcal{L}_{Rpn}^{obj}+\mathcal{L}_{Rpn}^{reg}(\theta_{Rpn},\theta_{Fpn})}_{\mathcal{L}_{Rpn}}\big{)},

where θRoi\theta_{Roi}, θRpn\theta_{Rpn} and θFpn\theta_{Fpn} respectively denote RoI, RPN and FPN parameters, and θdet=θRoiθRpnθFpn\theta_{det}=\theta_{Roi}\cup\theta_{Rpn}\cup\theta_{Fpn}. Roireg\mathcal{L}_{Roi}^{reg} and Roicls\mathcal{L}_{Roi}^{cls} are regression and classification losses for the RoI head. Rpnreg\mathcal{L}_{Rpn}^{reg} and Rpnobj\mathcal{L}_{Rpn}^{obj} are the regression and objectness losses for the RPN head.

III-B Proposal Sampling

Distribution statistics in base training

Given an image Im𝒟bI_{m}\in{\cal D}_{b} for base training, denote its annotations as 𝒢(Im)={(𝒃im,cim),i=1,,NI}{\cal G}(I_{m})=\{(\bm{b}_{i}^{m},c_{i}^{m}),i=1,\cdots,N_{I}\}, where 𝒃i=[xi,yi,wi,hi]\bm{b}_{i}=[x_{i},y_{i},w_{i},h_{i}]; (xi,yi)(x_{i},y_{i}), wiw_{i}, and hih_{i} are the center position, width, and height of the ii-thth bounding-box; cic_{i} is the class label of 𝒃i\bm{b}_{i}. We omit II for brevity.

We calculate the scale-normalized offset statistics of RPN proposals for distribution fit. Concretely, for the kk-th proposal 𝒃ik=[xik,yik,wik,hik]\bm{b}_{ik}=[x_{ik},y_{ik},w_{ik},h_{ik}] (k=1,,Kik=1,\cdots,K_{i}) with its corresponding ground truth bounding box 𝒃i\bm{b}_{i}, define its four-dimensional offset as Δ𝒃ik=(𝒃ik𝒃i)/𝒔i\Delta\bm{b}_{ik}=(\bm{b}_{ik}-\bm{b}_{i})/\bm{s}_{i}, where 𝒔i=[wi,hi,wi,hi]\bm{s}_{i}=[w_{i},h_{i},w_{i},h_{i}] is the scale coefficient.

According to the law of large numbers, we assume the RPN proposal offset distribution of base classes after base training is Gaussian. For simplicity, we denote it as Δ𝒃(𝝁b,𝝈b2)\Delta\bm{b}\sim{\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}), where 𝝁b=[μx,μy,μw,μh]\bm{\mu}_{b}=[\mu_{x},\mu_{y},\mu_{w},\mu_{h}] is the mean vector, 𝝈b2\bm{\sigma}_{b}^{2} = diag([σx2,σy2,σw2,σh2][\sigma_{x}^{2},\sigma_{y}^{2},\sigma_{w}^{2},\sigma_{h}^{2}]) is the covariance matrix. (𝝁b,𝝈b2)(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) can be calculated as follow:

𝝁b\displaystyle\bm{\mu}_{b} =Im𝒟bi=1NIk=1KiΔ𝒃ikmIm𝒟bi=1NIk=1Ki1,\displaystyle=\frac{\sum_{I_{m}\in{\cal D}_{b}}\sum_{i=1}^{N_{I}}\sum_{k=1}^{K_{i}}\Delta\bm{b}_{ik}^{m}}{\sum_{I_{m}\in{\cal D}_{b}}\sum_{i=1}^{N_{I}}\sum_{k=1}^{K_{i}}1}, (2)
𝝈b2\displaystyle\bm{\sigma}_{b}^{2} =diag(Im𝒟bi=1NIk=1Ki(Δ𝒃ikm𝝁b)2Im𝒟bi=1NIk=1Ki1).\displaystyle=diag(\frac{\sum_{I_{m}\in{\cal D}_{b}}\sum_{i=1}^{N_{I}}\sum_{k=1}^{K_{i}}(\Delta\bm{b}_{ik}^{m}-\bm{\mu}_{b})^{2}}{\sum_{I_{m}\in{\cal D}_{b}}\sum_{i=1}^{N_{I}}\sum_{k=1}^{K_{i}}1}).
Distribution calibration for fine-tuning

With the Gaussian offset distribution Δ𝒃(𝝁b,𝝈b2){\Delta\bm{b}\sim\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}), we can perturb the object location annotations for proposal sampling. Concretely, we sample J𝒃iJ_{\bm{b}_{i}} proposals for 𝒃i\bm{b}_{i}, which all share the same category cic_{i}. The sample proposal set 𝒫S{\cal P}_{S} can be expressed as,

𝒫S=i=1N𝒫𝒃i,{\cal P}_{S}=\cup_{i=1}^{N}{\cal P}_{\bm{b}_{i}}, (3)
𝒫𝒃i={𝒑ij|𝒑ij=𝒃i×(1+Δ𝒃ij𝒔i),j=1,,J𝒃i}.{\cal P}_{\bm{b}_{i}}=\{\bm{p}_{ij}|\bm{p}_{ij}=\bm{b}_{i}\times(1+\Delta\bm{b}_{ij}\bm{s}_{i}),j=1,\cdots,J_{\bm{b}_{i}}\}. (4)

Denote the RPN proposal set as 𝒫R{\cal P}_{R}. We apply the calibrated proposal set 𝒫F=𝒫S𝒫R{\cal P}_{F}={\cal P}_{S}\cup{\cal P}_{R} for fine-tuning. Especially, 𝒫S{\cal P}_{S} is used to assist in repurposing the RoI head for novel classes.

III-C Repurposed RoI Head

Classification

For a mini-batch of MM images, we apply supervised contrastive loss [44] on the sampled proposal set 1M𝒫Sm\cup_{1}^{M}{\cal P}^{m}_{S}, which can be defined as

Roicon(1M𝒫Sm)=𝒑ijm1M𝒫Sml(𝒑ijm)|1M𝒫Sm|,\mathcal{L}_{Roi}^{con}(\cup_{1}^{M}{\cal P}^{m}_{S})=\frac{\sum_{\bm{p}_{ij}^{m}\in\cup_{1}^{M}{\cal P}^{m}_{S}}l(\bm{p}_{ij}^{m})}{|\cup_{1}^{M}{\cal P}^{m}_{S}|}, (5)
l(𝒑ijm)\displaystyle l(\bm{p}_{ij}^{m}) =1Ncijm1m,i,j𝟙(cijm=cijm)\displaystyle=\frac{-1}{N_{c_{ij}^{m}}-1}\sum_{m^{\prime},i^{\prime},j^{\prime}}\mathbbm{1}(c_{i^{\prime}j^{\prime}}^{m^{\prime}}=c_{ij}^{m}) (6)
\displaystyle\cdot logexp(𝒛ijm𝒛ijm/τ)m′′,i′′,j′′𝟙(𝒑i′′j′′m′′𝒑ijm)exp(𝒛ijm𝒛i′′j′′m′′/τ),\displaystyle\log\frac{\exp(\bm{z}_{ij}^{m}\cdot\bm{z}_{i^{\prime}j^{\prime}}^{m^{\prime}}/\tau)}{\sum\limits_{m^{\prime\prime},i^{\prime\prime},j^{\prime\prime}}\mathbbm{1}(\bm{p}_{i^{\prime\prime}j^{\prime\prime}}^{m^{\prime\prime}}\neq\bm{p}_{ij}^{m})\exp(\bm{z}_{ij}^{m}\cdot\bm{z}_{i^{\prime\prime}j^{\prime\prime}}^{m^{\prime\prime}}/\tau)},

where NcijmN_{c_{ij}^{m}} denotes the sampled proposal number of class cijmc_{ij}^{m}; 𝒛ijm\bm{z}_{ij}^{m}, 𝒛ijm\bm{z}_{i^{\prime}j^{\prime}}^{m^{\prime}}, and 𝒛i′′j′′m′′\bm{z}_{i^{\prime\prime}j^{\prime\prime}}^{m^{\prime\prime}} are contrastive head encoded RoI features of sampled proposals 𝒑ijm\bm{p}_{ij}^{m}, 𝒑ijm\bm{p}_{i^{\prime}j^{\prime}}^{m^{\prime}}, and 𝒑i′′j′′m′′\bm{p}_{i^{\prime\prime}j^{\prime\prime}}^{m^{\prime\prime}} respectively; 𝟙()\mathbbm{1}(\cdot) equals 1 if the input sentence is true, otherwise 0; τ\tau is the temperature hyper-parameter.

Combining the origin RoI classification loss, the updated classification loss for sampled proposals on the repurposed RoI head can be defined as

Re-Roicls(𝒫S)=Roicls(𝒫𝒮)+Roicon(𝒫𝒮).\displaystyle\mathcal{L}_{Re\mbox{-}Roi}^{cls}({\cal P}_{S})=\mathcal{L}_{Roi}^{cls}({\cal P_{S}})+\mathcal{L}_{Roi}^{con}({\cal P_{S}}). (7)
Localization

The sampled proposal set 𝒫𝒮{\cal P_{S}} is also fed into the localization branch in the repurposed RoI head to recycle the localization ability endowed in the base training for generalization enhancement. The regression loss in the Re-RoI head can be calculated as

Re-Roiloc(𝒫𝒮)=Roireg(𝒫𝒮).\displaystyle\mathcal{L}_{Re\mbox{-}Roi}^{loc}({\cal P_{S}})=\mathcal{L}_{Roi}^{reg}({\cal P_{S}}). (8)

With the Eq. 7 and Eq. 8, the whole objective of the Re-RoI head is updated as

Re-Roi(𝒫S,θRoi)\displaystyle\mathcal{L}_{Re\mbox{-}Roi}({\cal P}_{S},\theta_{Roi}) =Re-Roicls(𝒫S,θRoi)\displaystyle=\mathcal{L}_{Re\mbox{-}Roi}^{cls}({\cal P}_{S},\theta_{Roi}) (9)
+Re-Roiloc(𝒫S,θRoi).\displaystyle+\mathcal{L}_{Re\mbox{-}Roi}^{loc}({\cal P}_{S},\theta_{Roi}).

Hence, the total fine-tuning loss function of our PDC can be defined as

argminθdetRcnnPDC(𝒫,θdet)=argminθdetRcnnimTED(𝒫R,θdet)\displaystyle\underset{\theta_{det}}{\arg\min}\mathcal{L}_{Rcnn}^{PDC}({\cal P},\theta_{det})=\underset{\theta_{det}}{\arg\min}\mathcal{L}_{Rcnn}^{imTED}({\cal P}_{R},\theta_{det}) (10)
+argminθRoiλRe-Roi(𝒫S,θRoi)\displaystyle+\underset{\theta_{Roi}}{\arg\min}\lambda\mathcal{L}_{Re\mbox{-}Roi}({\cal P}_{S},\theta_{Roi}) ,

where λ\lambda is a hyper-parameter to balance each loss item and set to 0.1 by default.

III-D Generalization Analysis

Assume the input proposal of the RoI head after base training is denoted by 𝒑BB\bm{p}_{B}\sim{\mathbb{P}}_{B} and the input proposal during fine-tuning is denoted by 𝒑FF\bm{p}_{F}\sim{\mathbb{P}}_{F}, where B{\mathbb{P}}_{B} and F{\mathbb{P}}_{F} are the corresponding proposal distributions, respectively. They are sampled from different data distributions as i.i.d.i.i.d.. Let {\mathbb{P}} denote proposal distribution after sufficient dataset 𝒟\cal D training and B{\mathbb{P}}_{B} is approximately equal to {\mathbb{P}}.

We can determine the relationship between the true risk and the empirical risk as follow:

E𝒟(l(𝒑F))E^𝒟f(l(𝒑F))+2Rn()+ln1/δn,\displaystyle{\rm E}_{\mathcal{D}}(l(\bm{p}_{F}))\leq\hat{{\rm E}}_{{\mathcal{D}}_{f}}(l(\bm{p}_{F}))+2R_{n}(\mathcal{L})+\sqrt{\frac{\ln 1/\delta}{n}}, (11)

where l(𝒑)l(\bm{p})\in\mathcal{L} is the loss function of fine-tuning for the RoI head, and =ReRoi+Roi\mathcal{L}=\mathcal{L}_{Re-Roi}+\mathcal{L}_{Roi}. nn is the number of proposals with probability at least 1δ1-\delta. E𝒟(l(𝒑F)){\rm E}_{\mathcal{D}}(l(\bm{p}_{F})) denotes the true risk and E^𝒟f(l(𝒑F))\hat{{\rm E}}_{{\mathcal{D}}_{f}}(l(\bm{p}_{F})) denotes the empirical risk for available labeled data.

Following  [23], we can obtain the empirical risk upper bound as follow:

ED(l(𝒑F))\displaystyle{\rm E}_{D}(l(\bm{p}_{F}))\leq E^𝒟f(l(𝒑F))+2Rn()\displaystyle\hat{{\rm E}}_{{\cal D}_{f}}(l(\bm{p}_{F}))+2R_{n}(\mathcal{L}) (12)
+MMD[F,B]+ln1/δn.\displaystyle+{\rm MMD}[{\mathbb{P}}_{F},{\mathbb{P}}_{B}]+\sqrt{\frac{\ln 1/\delta}{n}}.

Note that ln1/δn\sqrt{\frac{\ln 1/\delta}{n}} and 2Rn()2R_{n}(\mathcal{L}) are constants when the dataset and the model are given. As E^𝒟f(l(𝒑F))\hat{{\rm E}}_{{\cal D}_{f}}(l(\bm{p}_{F})) is minimized during model finetuning, Eq. 12 can be approximated as

ED(l(𝒑F))\displaystyle{\rm E}_{D}(l(\bm{p}_{F}))\propto MMD[F,B],\displaystyle{\rm MMD}[{\mathbb{P}}_{F},{\mathbb{P}}_{B}], (13)

where MMD [24] is the maximum mean discrepancy term, which is proved as

MMD(F,B):=1Nf𝒑F𝒑F1Nb𝒑B𝒑B,\displaystyle{\rm MMD}({\mathbb{P}}_{F},{\mathbb{P}}_{B}):=\|\frac{1}{N_{f}}\sum_{\bm{p}_{F}}\bm{p}_{F}-\frac{1}{N_{b}}\sum_{\bm{p}_{B}}\bm{p}_{B}\|_{\mathcal{H}}, (14)

where NbN_{b} and NfN_{f} denote the input proposal numbers of RoI head of during base training and fine-tuning respectively. \mathcal{H} denotes the RKHS(Reproducing Kernel Hilbert Space).

According to Eq. 14, when the mean and variance differences of distributions B{\mathbb{P}}_{B} and F{\mathbb{P}}_{F} are decreased, the maximum mean discrepancy of distributions B{\mathbb{P}}_{B} and F{\mathbb{P}}_{F} are reduced at the same time. Combining Eq. 13 and Eq. 14, the empirical risk upper bound is effectively reduced.

TABLE I: Ablation study of PDC modules for FSOD on Pascal VOC novel classes (split-1). “Dim.” denotes the output dimension of the contrastive head. “avg. Δ\Delta” denotes the average performance improvement. “*” denotes that we fit the base training proposal distribution ((𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}), defined in Eq. 2) with optimal uniform distribution parameters, by maximizing their intersection area.
Proposal Samping Classification Localization Novel set 1 (AP50) avg. Δ\Delta
Re-Roicon\mathcal{L}_{Re\mbox{-}Roi}^{con} Re-Roicls\mathcal{L}_{Re\mbox{-}Roi}^{cls} Re-Roireg\mathcal{L}_{Re\mbox{-}Roi}^{reg}
{Distribution} {Number} {τ\tau} {Dim.} {𝒫S{\cal P}_{S}} {𝒫R{\cal P}_{R}} 1 2 3 5 10
1 - - - - 43.4 51.0 58.1 67.6 66.6 -
2 - 0.2 128 - - - 45.3 49.4 57.6 67.4 66.0 -0.2
3 𝕌(0.02,0.02)\mathbb{U}(-0.02,0.02) 10 - 46.3 51.8 56.5 67.8 66.0 +0.3
4 𝕌(0.05,0.05)\mathbb{U}(-0.05,0.05) - 47.8 49.3 60.5 67.3 66.7 +1.0
5 𝕌\mathbb{U}^{*} - 49.3 50.0 56.4 69.8 67.2 +1.2
6 𝕌(0.1,0.1)\mathbb{U}(-0.1,0.1) - 47.2 50.4 57.5 67.9 67.3 +0.9
7 𝕌(0.2,0.2)\mathbb{U}(-0.2,0.2) - 47.8 47.7 57.5 68.6 67.0 +0.4
8 𝕌\mathbb{U}^{*} 20 - 46.8 52.7 57.6 69.5 66.7 +1.3
9 50 - 46.5 52.7 58.3 68.7 67.9 +1.5
10 100 - 45.4 51.5 57.8 69.4 67.3 +0.9
11 (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) 50 - 48.4 53.4 58.1 67.9 67.2 +1.6
12 50 0.07 128 - 47.4 52.6 57.6 69.6 67.9 +1.7
13 0.2 - 50.1 52.3 60.2 70.7 68.4 +3.0
14 0.7 - 47.8 51.0 58.4 69.7 67.9 +1.6
15 0.2 64 - 47.4 57.0 57.8 69.5 69.2 +2.8
16 256 - 45.8 49.4 60.1 69.0 67.3 +1.0
17 128 - 48.2 54.0 59.4 69.4 68.7 +2.6
18 49.6 48.5 50.7 69.0 67.9 +1.8
19 - - - 42.4 49.6 58.5 68.1 67.6 -0.1
20 - - 43.5 54.3 62.7 69.1 66.9 +2.0
21 - - 45.4 49.0 58.8 68.0 66.8 +0.3
22 (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) 50 0.2 128 - 50.1 52.3 60.2 70.7 68.4 +3.0

IV Experiments

IV-A Experimental Setting

Datasets. Pascal VOC 2007 [41], VOC 2012 [42], and MS COCO [43] datasets are commonly used for FSOD evaluation. They are typically split into fully annotated base classes and KK-shot novel classes in accordance with the settings in FSRW  [1]. Specifically, the Pascal VOC dataset is divided into three splits for cross validation, where 5 classes are chosen as novel classes and the rest 15 classes are base classes. The number of annotated instances KK is set to 1, 2, 3, 5, and 10. For MS COCO, twenty categories are chosen as novel classes with KK set to 10 and 30, and the remaining 60 categories are treated as base classes.

Implementation Details. We mainly conduct the ablation study and model analysis of PDC based on the baseline detector imTED [12]. ImTED [12] with ViT-S [45] backbone is implemented with Pytorch 1.8.0, mmcv 1.4.0, and mmdetection 2.11.0 on 8 NVIDIA A40 GPUs. We feed each GPU 2 images with data augmentations of random cropping, horizontal flipping, random resizing, and size normalization for base training and finetuning with the AdamW optimizer.

TABLE II: Performance (AP50) comparisons on Pascal VOC. * denotes our re-implementation. The best (second best) is in bold (underlined).
Method Novel set 1 Novel set 2 Novel set 3
1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
FSRW[ICCV2019] [1] 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 40.5 21.3 25.6 28.4 42.8 45.9
Meta R-CNN[ICCV2019] [MetaRCNN] 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1
Viewpoint[ECCV2020] [4] 24.2 35.3 42.2 49.1 57.4 21.6 24.6 31.9 37.0 45.7 21.2 30.0 37.2 43.8 49.6
TFA w/cos[ICML2020] [5] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8
MPSR[ECCV2020] [6] 41.7 42.5 51.4 52.2 61.8 24.4 29.3 39.2 39.9 47.8 35.6 41.8 42.3 48.0 49.7
MPSR+PDC (Ours) 41.6 44.2 50.8 56.7 61.5 31.1 30.6 41.1 43.1 48.6 34.6 43.0 43.0 49.5 52.2
CME[CVPR2021] [7] 41.5 47.5 50.4 58.2 60.9 27.2 30.2 41.4 42.5 46.8 34.3 39.6 45.1 48.3 51.5
FSCE[CVPR2021] [8] 44.2 43.8 51.4 61.9 63.4 27.3 29.5 43.5 44.2 50.2 37.2 41.9 47.5 54.6 58.5
DeFRCN[ICCV2021] [9] 40.2 53.6 58.2 63.6 66.5 29.5 39.7 43.4 48.1 52.8 35.0 38.3 52.9 57.7 60.8
FADI[NIPS2021] [10] 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6
FCT[CVPR2022] [11] 49.9 57.1 57.9 63.2 67.1 27.6 34.5 43.7 49.2 51.2 39.5 54.7 52.3 57.0 58.7
LVC[CVPR2022] [26] 54.5 53.2 58.8 63.2 65.7 32.8 29.2 50.7 49.8 50.6 48.4 52.7 55.0 59.6 59.6
DC[NIPS2022] [47] 45.8 59.1 62.1 66.8 68.0 31.8 41.7 46.6 50.3 53.7 39.6 52.1 56.3 60.3 63.3
imTED-S*[Arxiv2022] [12] 43.4 51.0 58.1 67.6 66.6 23.2 26.9 39.4 44.2 52.7 49.9 48.8 56.4 61.4 61.1
imTED-S+PDC (Ours) 50.1 52.3 60.2 70.7 68.4 23.3 28.5 43.2 48.4 54.6 53.0 50.8 57.7 63.8 62.9
imTED-B*[Arxiv2022] [12] 56.8 64.8 69.4 80.1 76.8 37.4 38.1 59.1 57.6 60.9 60.9 59.3 70.0 73.9 75.7
imTED-B+PDC (Ours) 61.8 69.1 70.2 78.7 79.6 42.9 41.2 60.0 56.3 65.9 60.3 63.1 70.6 73.3 76.7

IV-B Ablation Study

The validity of PDC components with proper hyper-parameters on the Pascal VOC split-1 is justified in Table I. The average PDC improvement with the proposal sampling module over the baseline method is 1.6%. With contrastive learning, the performance gain further rises to 3.0%, demonstrating the significant advancement brought by our PDC.

Proposal Sampling In Table I, we evidence that sampling proposals from distributions similar to that produced by base trained RPN, i.e., (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) which is defined in Eq. 2, insures superior novel class detection adaptation. Specifically, we denote the optimal uniform distribution of maximum union with (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) as 𝕌\mathbb{U}^{*}, where 𝕌\mathbb{U}^{*} = [U(-0.055,0.055), U(-0.036,0.036), U(-0.077,0.077), U(-0.057,0.057)]. As shown in rows 3-7 of Table I, when sampling from uniform distributions with different perturbation intensities, the performance gain progressively improves as the sampling distribution approaches (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}), justifying the efficacy to calibrate proposals for fine-tuning using base training statistics. From rows 5 and 8-10, one can see that with more high-quality positives sampled (less than 50 per instance), the average performance gain increases from 1.0 to 1.5, while drops to 0.9 when too many positives break the balance of positive and negative samples for RoI head repurposing. As adopting (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) reports slightly higher performance than 𝕌\mathbb{U}^{*}, i.e., 0.1 “avg. Δ\Delta” gain, rows 9 and 11, we sample 50 proposals from the Gaussian distribution of (𝝁b,𝝈b2){\mathbb{N}}(\bm{\mu}_{b},\bm{\sigma}_{b}^{2}) by default.

Refer to caption
Figure 3: The IoU histograms of RoI bounding boxes for novel instances after 5-shot fine-tuning, where the baseline (imTED) and our approach (PDC) are represented by light and dark colors, respectively. Especially, the black dotted line denotes the RoI predictions of base classes after base training.
Refer to caption
Figure 4: The RoI classification improvement for novel bounding boxes of different IoUs after 5-shot fine-tuning, where the baseline (imTED) and our approach (PDC) are represented by light and dark colors, respectively. Especially, the base class statistics after base training is indicated by the black dotted line.
Refer to caption
Figure 5: t-SNE visualization of prototypes produced by the baseline method and our PDC on Pascal VOC split1 with ViT-S. The novel class distribution is indicated by a solid line box, and bolded in the legend.
Refer to caption
Figure 6: Detection results of the baseline method and our PDC on Pascal VOC and MS COCO datasets. Blue boxes indicate objects detected by both baseline and PDC. Green boxes indicate objects detected only by PDC.

Contrastive Learning Following FSCE [8], we introduce an MLP head for contrastive learning. From row 1-2 in Table I, one can see that it is not wise to directly impose contrastive loss on the biased RPN proposals. While according to rows 12-16, it is clear that with Re-Roicon\mathcal{L}_{Re\mbox{-}Roi}^{con} applied on the proposals sampled from the calibrated distribution, the detection performance improves consistently with different settings of temperature and dimensions of the contrastive MLP head output. With the superior setting of the temperature as 0.2 and the output dimension as 128, we confirm again from results in row 13, 17, and 18 that it is better to absorb discriminative representations by only applying contrastive learning upon high-quality proposals from 𝒫S{\cal P}_{S}. These facts enhance the importance of the proposed PDC approach for current FSOD with the two-step training paradigm.

Losses for Repurposing RoI Head As defined in Eq. 78, and 9, our PDC attaches three loss terms to repurpose the RoI head for novel class detection, i.e., Re-Roicon\mathcal{L}_{Re\mbox{-}Roi}^{con}, Re-Roicls\mathcal{L}_{Re\mbox{-}Roi}^{cls}, and Re-Roireg\mathcal{L}_{Re\mbox{-}Roi}^{reg}. From rows 19–22 in Table I, the combination of three losses based on 𝒫S{\cal P}_{S} reports the best performance (+3.0 “avg. Δ\Delta” gain), indicating that high-quality proposals with these losses facilitate the classification and localization abilities of the detector.

TABLE III: Performance comparisons on MS COCO. The best is in bold.
Method Shots AP AP50 AP75
FSRW[ICCV2019] [1] 10 5.6 12.3 4.6
Meta R-CNN[ICCV2019] [MetaRCNN] 8.7 19.1 6.6
TFA w/cos[ICML2020] [5] 10.0 - 9.3
Viewpoint[ECCV2020] [4] 12.5 27.3 9.8
MPSR[ECCV2020] [6] 9.8 17.9 9.7
FSCE[CVPR2021] [8] 11.9 - 10.5
FADI[NIPS2021] [10] 12.2 22.7 11.9
AirDet[ECCV2022] [46] 13.0 23.9 12.4
CME[CVPR2021] [7] 15.1 24.6 16.4
DeFRCN[ICCV2021] [9] 16.8 - -
FCT[CVPR2022] [11] 17.1 30.2 17.0
DC[NIPS2022] [47] 18.0 - -
imTED-S[Arxiv2022] [12] 15.0 25.7 15.2
imTED-S+PDC (Ours) 15.7 26.8 15.8
imTED-B[Arxiv2022] [12] 22.5 36.6 23.7
imTED-B+PDC (Ours) 23.4 38.1 24.5
FSRW[ICCV2019] [1] 30 9.1 19.0 7.6
Meta R-CNN[ICCV2019] [MetaRCNN] 12.4 25.3 10.8
TFA w/cos[ICML2020] [5] 13.7 - 13.4
Viewpoint[ECCV2020] [4] 14.7 30.6 12.2
MPSR[ECCV2020] [6] 14.1 25.4 14.2
FSCE[CVPR2021] [8] 16.4 - 16.2
FADI[NIPS2021] [10] 16.1 29.1 15.8
CME[CVPR2021] [7] 16.9 28.0 17.8
DeFRCN[ICCV2021] [9] 21.2 - -
FCT[CVPR2022] [11] 21.4 35.5 22.1
DC[NIPS2022] [47] 22.2 - -
imTED-S[Arxiv2022] [12] 21.0 34.5 21.8
imTED-S+PDC (Ours) 22.1 35.8 23.4
imTED-B[Arxiv2022] [12] 30.2 47.4 32.5
imTED-B+PDC (Ours) 30.8 47.3 33.5

IV-C Model Analysis

In this subsection, we quantitatively and qualitatively reveal the specific advantages of our PDC based on imTED [12] with the 5-shot setting, including aspects of the RoI head’s localization and classification capabilities, embedding space, and detection results.

Concretely, we visualize the IoU histograms of bounding boxes predicted by the RoI head of the baseline method, i.e., imTED [12], w or w/o our PDC after fine-tuning in Fig. 3. It is clear that our PDC remarkably improves the localization ability of the RoI head, which predicts more accurate bounding boxes (high IoU with ground truth) whose distribution approaches the base class statistics produced with sufficient supervision. In Fig. 4, we explore the detailed classification precision improvement of the RoI head for localized instances with bounding boxes of different IoUs. PDC achieves better results than the baseline not limited in proposals of high IoU, which validates that fine-tuning with the calibrated proposal distribution facilitates the improved classification ability generalizing to low IoU cases, cf. Sec. III-D. so, from the figure, PDC classification ability improvement is proven.

In Fig. 5, we visualize the RoI feature embeddings of the baseline method and PDC after 5-shot finetuning. It is obvious that our PDC produces superior feature space with better discrimination, especially for novel classes. Fig. 6 showcases some detection results of the baseline and PDC on Pascal VOC and MS COCO datasets. Since the calibrated proposal distribution encourages to recall the novel class suppressed in base training, it is clear that our PDC can significantly reduce missed samples, i.e., green boxes in the figure.

IV-D Performance Comparison

IV-D1 Pascal VOC

In Table II, our PDC respectively achieve 1.0%, 2.8%, and 1.0% performance improvements on split-1, split-2, and split-3 of Pascal VOC with the MPSR [6] baseline detector. Compared with imTED [12] with the ViT-S backbone, PDC reports 3.0%, 2.3%, and 2.1% performance gains on three splits. While, with the ViT-B backbone, PDC averagely outperforms imTED by 2.3%, 2.6%, and 0.8%, respectively. The consistent advantages of PDC on different detectors justify its outstanding efficacy and generalization.

IV-D2 MS COCO

Compared with Pascal VOC, the MS COCO dataset is challenging with more categories and images. In Table III, with ViT-S, imTED-S+PDC shows performance improvement of 0.7% AP, 1.1% AP50, and 0.6% AP75 in the 10-shot setting and 1.1% AP, 1.3% AP50, and 1.6% AP75 in the 30-shot setting. Relative gains can also be found with ViT-B backbone (e.g.e.g., 0.9% for AP, 1.5% for AP50, and 0.8% for AP75 in 10-shot setting), justifying that our approach could generalize well with stronger baseline models.

V Conclusion

In this paper, we alleviate the intrinsic learning objective contradiction across the two-step learning phases via a simple yet effective proposal distribution calibration (PDC) approach. PDC samples proposals via base class statistics during fine-tuning to offer additional calibrated high-quality corpus for mild novel class adaptation. With additional classification and localization losses, PDC achieves great generalization ability toward novel classes in a plug-and-play fashion. Experimental results reported on Pascal VOC [41, 42] and MS COCO [43] with two baseline detectors, i.e., MPSR [6] and imTED [12], justify the efficacy and generalizability of our PDC. Although developing for general two-stage FSOD, we also hope our PDC can inspire creative insights for one-stage FSOD.

References

  • [1] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot object detection via feature reweighting,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 8420–8429.
  • [2] Y.-X. Wang, D. Ramanan, and M. Hebert, “Meta-learning to detect rare objects,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9925–9934.
  • [3] X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin, “Meta r-cnn: Towards general solver for instance-level low-shot learning,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9577–9586.
  • [4] Y. Xiao and R. Marlet, “Few-shot object detection and viewpoint estimation for objects in the wild,” in Proc. ECCV, 2020, pp. 192–210.
  • [5] X. Wang, T. E. Huang, T. Darrell, J. E. Gonzalez, and F. Yu, “Frustratingly simple few-shot object detection,” in Proc. ICML, 2020, pp. 9919–9928.
  • [6] J. Wu, S. Liu, D. Huang, and Y. Wang, “Multi-scale positive sample refinement for few-shot object detection,” in Proc. ECCV, 2020, pp. 456–472.
  • [7] B. Li, B. Yang, C. Liu, F. Liu, R. Ji, and Q. Ye, “Beyond max-margin: Class margin equilibrium for few-shot object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 7363–7372.
  • [8] B. Sun, B. Li, S. Cai, Y. Yuan, and C. Zhang, “Fsce: Few-shot object detection via contrastive proposal encoding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 7352–7362.
  • [9] L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, and C. Zhang, “Defrcn: Decoupled faster r-cnn for few-shot object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 8681–8690.
  • [10] Y. Cao, J. Wang, Y. Jin, T. Wu, K. Chen, Z. Liu, and D. Lin, “Few-shot object detection via association and discrimination,” Proc. NIPS, vol. 34, pp. 16 570–16 581, 2021.
  • [11] G. Han, J. Ma, S. Huang, L. Chen, and S.-F. Chang, “Few-shot object detection with fully cross-transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 5321–5330.
  • [12] X. Zhang, F. Liu, Z. Peng, Z. Guo, F. Wan, X. Ji, and Q. Ye, “Integral migrating pre-trained transformer encoder-decoders for visual object detection,” arXiv:2205.09613, 2022.
  • [13] Z. Fan, Y. Ma, Z. Li, and J. Sun, “Generalized few-shot object detection without forgetting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4527–4536.
  • [14] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 13 619–13 627.
  • [15] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 16 000–16 009.
  • [16] S.-J. Park, S. Han, J.-W. Baek, I. Kim, J. Song, H. B. Lee, J.-J. Han, and S. J. Hwang, “Meta variance transfer: Learning to augment from the others,” in Proc. ICML, 2020, pp. 7510–7520.
  • [17] S. Yang, L. Liu, and M. Xu, “Free lunch for few-shot learning: Distribution calibration,” in Proc. ICLR, 2020.
  • [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788.
  • [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Proc. ECCV, 2016, pp. 21–37.
  • [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Proc. NIPS, vol. 28, 2015.
  • [21] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. ECCV, 2020, pp. 213–229.
  • [22] W. Zhang and Y.-X. Wang, “Hallucination improves few-shot object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 13 008–13 017.
  • [23] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” Trans. TKDD, vol. 9, no. 3, pp. 1–23, 2015.
  • [24] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf et al., “Kernel mean embedding of distributions: A review and beyond,” Foundations and Trends® in Machine Learning, vol. 10, no. 1-2, pp. 1–141, 2017.
  • [25] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” arXiv:2203.16527, 2022.
  • [26] P. Kaul, W. Xie, and A. Zisserman, “Label, verify, correct: A simple few shot object detection method,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 14 237–14 247.
  • [27] C. Zhu, F. Chen, U. Ahmed, Z. Shen, and M. Savvides, “Semantic relation reasoning for shot-stable few-shot object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 8782–8791.
  • [28] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” Proc. NIPS, pp. 3630–3638, 2016.
  • [29] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 1199–1208.
  • [30] C. Zhang, Y. Cai, G. Lin, and C. Shen, “Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 12 200–12 210.
  • [31] B. Yang, C. Liu, B. Li, J. Jiao, and Q. Ye, “Prototype mixture models for few-shot semantic segmentation,” in Proc. ECCV, vol. 12353, 2020, pp. 763–778.
  • [32] B. Liu, Y. Ding, J. Jiao, J. Xiangyang, and Q. Ye, “Anti-aliasing semantic reconstruction for few-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2021.
  • [33] B. Liu, J. Jiao, and Q. Ye, “Harmonic feature activation for few-shot semantic segmentation,” IEEE Trans. Image Process., vol. 30, pp. 3142–3153, 2021.
  • [34] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
  • [35] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in Proc. ICLR, 2017.
  • [36] Y. Wang and M. Hebert, “Learning to learn: Model regression networks for easy small sample learning,” in Proc. ECCV, 2016, pp. 616–634.
  • [37] Y. Wang, R. B. Girshick, M. Hebert, and B. Hariharan, “Low-shot learning from imaginary data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7278–7286.
  • [38] B. Hariharan and R. B. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 3037–3046.
  • [39] M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning for few-shot learning,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 111 719–111 727.
  • [40] M. Köhler, M. Eisenbach, and H.-M. Gross, “Few-shot object detection: A survey,” arXiv:2112.11699, 2021.
  • [41] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [42] M. Everingham, S. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
  • [43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. ECCV, 2014, pp. 740–755.
  • [44] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Proc. NIPS, vol. 33, pp. 18 661–18 673, 2020.
  • [45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” Proc. ICLR, 2020.
  • [46] B. Li, C. Wang, P. Reddy, S. Kim, and S. Scherer, “Airdet: Few-shot detection without fine-tuning for autonomous exploration,” in Proc. ECCV, 2022, pp. 427–444.
  • [47] B.-B. Gao, X. Chen, Z. Huang, C. Nie, J. Liu, J. Lai, G. Jiang, X. Wang, and C. Wang, “Decoupling classifier for boosting few-shot object detection and instance segmentation,” in Proc. NIPS, 2022.