¹¹institutetext: Waseda University ²²institutetext: Tencent AI Lab ³³institutetext: Nanjing University of Science and Technology
³³email: jokerzz@fuji.waseda.jp;zequn.nus@gmail.com;koushin@toki.waseda.jp;
czhengli@njust.edu.cn;yoshie@waseda.jp

Delving into the Imbalance of Positive Proposals in Two-stage Object Detection

Zheng Ge 1122 Zequn Jie 22 Xin Huang 1122 Chengzheng Li 33 Osamu Yoshie 11

Abstract

Imbalance issue is a major yet unsolved bottleneck for the current object detection models. In this work, we observe two crucial yet never discussed imbalance issues. The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i.e., post-classification layers) become highly biased towards the negative proposals in the early training stage. The second imbalance stems from the unbalanced ground-truth numbers across different testing images, resulting in the imbalance of the number of potentially existing positive proposals in testing phase. To tackle these two imbalance issues, we incorporates two innovations into Faster R-CNN: 1) an R-CNN Gradient Annealing (RGA) strategy to enhance the impact of positive proposals in the early training stage. 2) a set of Parallel R-CNN Modules (PRM) with different positive/negative sampling ratios during training on one same backbone. Our RGA and PRM can totally bring 2.0% improvements on AP on COCO[14] $minival$ . Experiments on CrowdHuman[23] further validates the effectiveness of our innovations across various kinds of object detection tasks.

Keywords:

Object detection, proposal imbalance, gradient annealing, sampling ratios.

Refer to caption — Figure 1: (a). The actual number of sampled positive proposals during a training process of Faster R-CNN. (b) shows the distributions of positive proposal numbers generated by a well-trained RPN model w.r.t. different numbers of ground-truth instances in testing images.

1 Introduction

In recent years, the great success of deep learning pushes forward the state-of-the-art object detection approaches, e.g., Faster R-CNN [22], SSD [16] and Cascade R-CNN [1]. However, most of existing works focus on the novel detection pipeline (i.e., one-stage and two-stage detectors) and network architecture design (e.g., Feature Pyramid Network [12] and Path Aggregation Network [15]), less attention is paid to the training paradigm.

Imbalance is a severe issue when training an object detector. A few existing works notice and try to address the imbalance issue including OHEM [24], RetinaNet [13] and Libra R-CNN [19]. These works mainly deal with the training sample imbalance, e.g., the imbalance between positive/negative samples and difficult/easy samples. However, the imbalance issue lies in far more than the aforementioned situations, and many ignored imbalance issues are preventing the power of well-designed model architectures from being fully exploited.

“Jointly training” for two-stage detectors has become a mainstream in many popular open-source frameworks (e.g. [3, 27]) due to its convenience and desirable performance. In this paper, we investigate two crucial yet never discussed imbalance issues based on such training scheme. The first imbalance comes from the large number of low-quality RPN proposals in the early training stage. Such low-quality RPN proposals can hardly provide a sufficient number of positive samples for training the R-CNN module (i.e., layers from RoI alignment [9] to the final classification and bounding box regression), leading to an extremely unbalanced number of positive proposals along whole training process. Fig. 1(a) shows the actual number of positive proposals used for the R-CNN module v.s. training iterations. In the early training stage, the dominant negative proposals would push the model highly biased to the background class side. Although the issue is gradually relieved as the qualities of the RPN proposals increase in training, the significant initial learning bias towards to the background class hurts both training efficiency and model performance which are clearly shown in Sec. 4.1.

Another serious but also ignored imbalance issue stems from the inconsistency on the potentially existing number of positive proposals across different testing images, which may requires different optimal training strategies (i.e. different positive/negative sampling ratios for training the R-CNN module). Fig. 1 (b) shows the distributions of positive proposal numbers generated by a well-trained RPN model w.r.t. different numbers of ground-truth instances in testing images. One can find that for a well-trained RPN model, the more ground-truth instances a testing image contains, the more positive proposals it will probably provide to the following R-CNN module. Notice that here the positive proposals include all the positive ones regarding to all the classes except for the background. As well known, for an image classification task, a better per-image accuracy on the testing set can be achieved when the sample ratios on all the classes of the training set are more consistent with that of the testing set. In object detection, a similar phenomenon is observed. Using various positive/negative proposal sampling ratios during training, the testing performance will be different accordingly. Fig. 2 Left and Right show the detection performance on two subsets of COCO $minival$ containing images with the ground-truth instance number ranging in [1,3] and [8,+ $\infty$ ) respectively, using different training sampling ratios. One can find that a higher positive sampling ratio is desired for the testing images with larger numbers of ground-truth instances, and vice versa. Such a phenomenon is natural. When setting a high sampling ratio of positive proposals, the model tends to predict higher scores for all proposals, which is beneficial for the cases with larger numbers of ground-truth instances in images. On the contrary, the model will give positive predictions more prudently. Therefore, the inconsistency between a single training sampling ratio and the diverse ground-truth instance numbers in testing images makes the current training sampling strategy a sub-optimal solution.

To overcome the first observed imbalance issue happened at the early training stage, a novel R-CNN Gradient Annealing (RGA) strategy is proposed. The gradients of positive proposals are magnified to avoid being overwhelmed by the gradients of the huge number of negative proposals at the beginning. As the training progresses, the magnification factor is gradually decreased to guarantee the gradients from both positive and negative proposals can always rival each other.

To address the second imbalance issue, we propose to build two parallel R-CNN modules (PRM) on top of a shared backbone and RPN. To strengthen the adaptation of the detector to a wide range of ground-truth instance numbers contained in testing images, more diverse sampling ratios are simultaneously expected during training. To this end, the two R-CNN modules are trained with two sets of proposals sampled using different positive/negative ratios from a same set of proposals generated by the shared RPN. In the testing phase, proposal-level “average ensemble” on the results of the two R-CNN modules can thus be easily performed, effectively incorporating the knowledge learned with diverse class biases. As an extra bonus, without performing “average ensemble”, PRM also enhances the detection performance of each individual R-CNN module compared to training them solely, as seen in Table 1. This phenomenon reveals that apart from result ensemble, gradient ensemble in the detector backbone from the two R-CNN modules also boosts the model’s adaptation to the testing images with significantly diverse ground-truth instance numbers.

Table 1: The results of the PRM with the sampling ratios of 1:1 and 1:9 respectively. Not only the ensemble result outperforms its baseline by 1.0%, but also the performance of each single R-CNN module gets improved.

1:1	1:9	AP (%)
1:1	1:9	R-CNN1	R-CNN2	Ensemble
✓		36.3	-	-
	✓	-	34.9	-
✓	✓	37.0	36.9	37.3

Experiments on COCO and CrowdHuman benchmarks strongly validate the efficacy of RGA and PRM over several two-stage detector baselines, e.g., Faster R-CNN and Cascade R-CNN.

To summarize, our work has the following contributions:

1.

We observe two crucial but never discussed imbalance issues in object detection and analyze the reasons of the issues, indicating another improvement space of the current two-stages object detection approaches.
2.

We propose an R-CNN Gradient Annealing strategy to remedy the lack of positive proposals in the early stage of the training phase.
3.

We propose a novel PRM which integrates a set of parallel R-CNN modules trained by different sampling ratios on one same backbone. Our PRM can alleviate the inconsistency of the number of positive proposals along different testing samples.

2 Related Work

2.1 Deep Architectures for Object Detection

Recently, deep learning based object detection methods are popularized by two-stage and one-stage detectors. Two-stage detectors generate a set of region proposals, and then refine them by region-wise classification and regression. To reduce redundant computation of feature extraction in R-CNN[7], [10] and [6] propose Spatial-Pyramid-Pooling and RoI-Pooling layers respectively, leading to remarkable improvements of speed and accuracy. After that, Region Proposal Network (RPN) is proposed in Faster R-CNN[22] to improve the efficiency of detectors. RPN also allows two-stage detectors to be trained end-to-end. FPN[12] alleviates the scale mismatch between RPN’s receptive fields and actual object size via feature pyramids. Cascade R-CNN[1] applies a cascade architecture to regress BBoxes with a set of increasing IoU thresholds sequentially for progressive refinement. Mask R-CNN[9] extends Faster R-CNN by constructing a proper mask branch that refines the detection results with the help of multi-task learning. On the other hand, one-stage detectors are popularized by YOLO[20] and SSD[16] due to their computation efficiency. RetinaNet[13] with focal loss is proposed to address the extreme foreground-background class imbalance in dense object detection and gains higher accuracy than previous works. Other object detection methods focus on cascade procedures[17, 5, 4, 28], imbalance solution[24, 26, 18, 25, 2] and multi-scales adversarial mechanism[29]. They all make significant contribution to the current object detection field.

2.2 Imbalance in Object Detection

When training an object detector, imbalance issue is a common but inevitable problem, which prevents well-designed model from being fully exploited. Solutions to alleviate this issue can be mainly divided into two categories so far. First is hard example mining method which relies on the hypothesis that hard examples are particularly significant to improve detection performance. OHEM[24] proposes a systematic approach considering the loss values of positive and negative samples and drives the focus towards hard examples according to their confidences. IoU-balanced sampling[19] is proposed to associate the hardness of examples with their IoUs and use a sampling method again for only negative examples rather than compute the loss function for the entire set. The second is the soft sampling method which scales the contribution of each example according to its corresponding importance to the training process. Focal loss[13] is the pioneer approach to dynamically assign more weights to the hard examples. GHM[11] defines the gradient density to handle disharmony of gradient norm distribution to avoid paying overmuch attention to outliers, which is shown useful for both classification and regression tasks. By alleviating imbalance in the training process, object detectors can be better trained, thus the better results can be obtained.

3 Methodology

In this section, we first illustrate the shortage of positive proposals in the begining of the training phase and describe R-CNN Gradient Annealing (RGA) strategy in detail. Next, we illustrate why the number of positive proposals is unbalanced across different testing samples. Finally, we introduce the structure of parallel R-CNN modules (PRM) and experimentally explain how PRM alleviates the problem mentioned above.

3.1 R-CNN Gradient Annealing

Shortage of Positive Proposals in the Beginning. In the early training stage, since RPN cannot make accurate foreground/background discrimination, the large number of low-quality RPN generated proposals are fed into the following R-CNN module as training samples. Thus, the proposals can hardly satisfy the “positive proposal definition” (e.g., having an IoU which exceeds 0.5 with any ground-truth instance), resulting in a severe shortage of positive proposals in the early training stage. Although a fixed positive/negative sampling ratio, e.g., 1:3, is often used during training current detectors, a “soft” sampling ratio 1:3 is actually adopted in the practical implementation. For example, assuming batch size to be 512 for R-CNN, a sampling ratio 1:3 requires to sample 128 positive and 384 negative proposals from all the RPN proposals. In the early training stage, the number of all the positive ones $N_{real}^{+}$ provided by RPN is probably lower than 128. In this case, current implementations just use all the $N_{real}^{+}$ positive ones and $512-N_{real}^{+}$ negative ones to form the batch. So the actual positive/negative sampling ratio is smaller than 1:3, which is referred to “soft” sampling ratio 1:3 here. The insufficient positive proposals in the early training stage hurts the model performance, especially on the positive samples.

For a better understanding, we analyze the model performance on both positive and negative samples, as the training progresses. From Fig. 3 Left and Fig. 3 Middle, one can observe that at the beginning of training, due to the overwhelming number of negative proposals, nearly all the proposals will be predicted as the negative. As the training progresses, the accuracy of positive proposals gradually increases while the accuracy of negative ones decreases until becoming stable. As can be seen, the entire training phase can be viewed as a process of learning what is positive. From this perspective, the shortage of positive training proposals in the beginning is no doubt an obstacle for training a high quality R-CNN module.

A straightforward countermeasure to the above issue is to copy the positive proposals multiple times, to achieve a “hard sampling ratio”. However, “hard sampling ratio” will always cause performance drop according to our experiments (see in supplementary materials). The reason is when “soft sampling ratio” is applied, the R-CNN module is capable of modeling the intrinsic proportion of positive proposals v.s. all proposals, which is beneficial to the testing performance, while “hard sampling ratio” conversely imposes a strong classification bias into R-CNN, which is proved to be harmful. Thus the potential solution to this issue needs to simultaneously raise the importance of positive proposals at early training stage, and leave the intrinsic proportion of positive proposals unchanged.

Magnifying and Annealing R-CNN Gradients. As we have mentioned in Sec. 3.1 – the entire training phase can be viewed as a process of learning what is positive. Fig. 3 Middle shows that the classification accuracy of negative proposals first reaches 98% then starts to drop gradually, while the classification accuracy of positive proposals stably increases. Such a phenomenon indicates that the gradients of positive proposals take over the training process at very early stages (i.e. the average gradient of positive proposals is larger than negative proposals’). We thus propose to simultaneously magnify the gradients from both positive and negative proposals by a factor $\lambda$ because such an operation has a similar effect to only magnifying the weight of positive proposals while keeping the proportion of positive proposals v.s. all proposals unchanged. Gradually decreasing the magnification factor $\lambda$ as the training progresses is introduced to guarantee that the gradients from both positive and negative proposals can always rival each other. We name such a solution “R-CNN Gradient Annealing” (RGA). A formal description of RGA is as follows.

\begin{split}\theta_{t+1}=\theta_{t}-\alpha(\lambda\frac{\partial}{\partial\theta_{t}}{J}\left(\theta_{t}\right))\\ s.t.\quad\lambda=\lambda_{0}-\frac{(\lambda_{0}-1)t}{T}\\ \end{split}

(1)

where $\theta_{t}$ represents the parameters of R-CNN module in the $t$ th optimization step, $\alpha$ is the current learning rate, $T$ is the total number of optimization steps, $\lambda_{0}$ is the initial magnification factor. $J$ is the loss function for R-CNN module.

3.2 Parallel R-CNN Modules

Positive Proposal Imbalance in Testing Phase. As aforementioned, a better per-image accuracy on the testing set can be achieved when the sample ratios on all the classes of the training set are more consistent with that of the testing set. Object detection can be viewed as proposal-level classification. Therefore, the positive/negative proposal ratio during training the R-CNN module is also required to be more consistent with that in the testing phase, for a better testing performance. Although the positive/negative proposal ratios in testing images are not directly decided by the ground-truth instance numbers, they are still highly correlated as shown in Fig. 1 (b). Therefore, we conclude that testing images with diverse ground-truth instance numbers require different positive/negative sampling ratios during training process of the R-CNN module, for the optimal testing results. Based on this finding, we claim that using a single positive/negative sampling ratio in R-CNN module training in the existing works is a sub-optimal solution, when facing the great diversity of ground-truth instance numbers of the testing set (e.g., MS COCO).

Model Structure. To address the above issue, better consistency between the training sampling ratio and ground-truth instance number of each testing image is desired. To achieve this goal, an ideal solution could be: a Faster R-CNN with a set of parallel R-CNN modules trained with different positive/negative proposal sampling ratios; in the testing phase, the model dynamically dispatches each testing image to the best matched R-CNN module for prediction, based on its ground-truth instance number. However, two obstacles impede the effective implementation of this solution. Firstly, the ground-truth instance number is unknown during testing, and the task of the accurate prediction of the ground-truth instance number remains difficult. Secondly, even if the accurate prediction of the ground-truth instance number is possible, the exact matching from each ground-truth instance number to the optimal training sampling ratio is also hard to decide.

To avoid the difficult per-image dispatch to the single optimal training sampling ratio, we propose to utilize the “average ensemble” strategy. Specifically, the model is a Faster R-CNN with a set of parallel R-CNN modules trained with different positive/negative proposal sampling ratios. In the testing phase, each R-CNN needs to process all the testing images. The final classification scores of the multiple R-CNN modules before softmax normalization are averaged to produce the final classification score. The bounding box regression output from the R-CNN module trained with the highest positive/negative sampling ratio is directly adopted. Notice that only positive proposals are used to train the BBox regression heads, and thus the training samples of other BBox regression heads are almost subsets of the one with the largest number of positive proposals. This is the reason why we fully trust the results from the BBox regression head trained with the most positive proposals. Fig. 4 gives an illustration of Faster R-CNN with Parallel R-CNN Modules (PRM).

3.3 Mechanism of PRM

In this section, we analyze the mechanism of PRM about how it benefits the detection on images with diverse ground-truth instance numbers. Specifically, two mechanisms are discovered, i.e., Result Ensemble in the testing phase and Gradient Ensemble in the training phase. We thus decouple the two mechanisms for further analysis.

Result Ensemble. In the testing phase, multiple R-CNN modules following a shared backbone and RPN allows “Average Ensemble” to be conveniently performed on each testing proposal. “Average Ensemble” effectively combines the decisions of the R-CNN modules biased to different class distributions. Therefore, the combined results are naturally better than that of an R-CNN module trained with a single sampling ratio, when encountering testing images with diverse ground-truth instance numbers.

Notice that unlike image classification and semantic segmentation, “average ensemble” is not a common result ensemble technique in object detection. Due to the lack of clear correspondence of the detection boxes generated by different models, “joint NMS” which mixes the detection boxes together and then performs NMS among the mixed boxes is commonly utilized in the model ensemble in object detection. However, “joint NMS” has quadratic complexity w.r.t. the total number of detection boxes produced by all the models used for ensemble. In comparison, “average ensemble” in PRM introduces barely no extra costs, as only the multiple R-CNN modules are unshared, which usually contains only a few fully connected layers.

Gradient Ensemble. Apart from result ensemble in the testing phase, gradient ensemble in the training phase also plays a key role in PRM. Without result ensemble, PRM also enhances the detection performance of each individual R-CNN module compared to training them solely. The performance gain of each individual R-CNN module can only be attributed to the gradient ensemble in the shared backbone. This is similar to the phenomenon that the tasks have mutual benefit to each other in the multi-task learning. Gradients from different tasks are fused together in the shared backbone, strengthening the adaptation of the backbone to different tasks (i.e., testing images of diverse ground-truth instance numbers in our case). We plot the gradients of the weights of Block4.Conv1 in the shared backbone propagated from different R-CNN modules in Fig. 5 Left. It can be seen in Fig. 5 Left that the magnitude of gradients’ vector sum from two R-CNN modules is always smaller than the sum of two gradients’ magnitudes, which means the knowledge provided by two R-CNN modules are not exactly the same. As shown in Fig. 5 Right, a similar phenomenon is observed in Mask R-CNN[9] where the BBox head and the Mask head can be viewed as multi-task heads.

4 Experiments

4.1 Experiments on COCO

Dataset. COCO[14] is a widely used benchmark in object detection field. In this work, we train all our models on the COCO trainval35k which consists of 118k images, and evaluate our results on the COCO $minival$ which consists of 5k testing images. Our evaluation metric follows the standard COCO style mean Average Precision (AP) at different BBox IoU thresholds.

Table 2: Experimental results of our proposed RGA strategy and PRM across different backbones, learning rate schedules and other variants of two-stage detectors. Results are reported on COCO

minival

. The learning rate schedules of “1x”, “2x” follow the definitions in Detectron [8].

Method	Backbone	Schedule	AP	AP₅₀	AP₇₅
Faster R-CNN[22]	ResNet-50	1x	36.1	58.1	38.8
Faster R-CNN	ResNet-50	2x	37.3	59.0	40.5
Faster R-CNN	ResNeXt-101_32x4d	1x	40.1	62.0	43.8
Cascade R-CNN[1]	ResNet-50	1x	40.5	58.6	44.2
Faster R-CNN w/ RGA	ResNet-50	1x	37.3	59.4	40.8
Faster R-CNN w/ RGA	ResNet-50	2x	38.4	59.9	41.8
Faster R-CNN w/ RGA	ResNeXt-101_32x4d	1x	41.0	63.2	45.1
Cascade R-CNN w/ RGA	ResNet-50	1x	41.1	59.8	44.8
Faster R-CNN w/ RGA+PRM	ResNet-50	1x	38.1	60.0	41.6
Faster R-CNN w/ RGA+PRM	ResNet-50	2x	39.0	60.5	42.3
Faster R-CNN w/ RGA+PRM	ResNeXt-101_32x4d	1x	41.4	63.3	45.4
Cascade R-CNN w/ RGA+PRM	ResNet-50	1x	41.7	60.0	45.5

Implementation Details. To provide a strong baseline, we incorporate FPN[12] and RoI-Align[9] into the naive Faster R-CNN[22]. If not specifically noted, in our paper, the term “Faster R-CNN” represents this modified version and optimized in “jointly training” manner. We train detectors on 8 GPUs (2 images per GPU) with an initial learning rate of 0.02, and decrease it by 0.1 at the 8th and 11th epoch. The magnification factor $\lambda$ is initialized to 7. Following the structure in [12], our R-CNN module only contains 2 shared $fc$ and 2 separate $fc$ layers for classification and regression respectively. In our re-implemented Faster R-CNN[22], the sampling ratio of positive/negative proposals is set to 1:3, which is the same as original papers. For PRM, we use two R-CNN modules with sampling ratios of 1:1 and 1:9.

Main Results. Results on COCO minival are presented in Table 2. As shown in Table 2, incorporating RGA into Faster R-CNN can bring 1.2% improvement on AP. Such a gain reaches to 2.0% after further adopting the structure of PRM (36.1% v.s. 38.1%). RGA and PRM can still improve the baseline by 1.7% (37.3% v.s. 39.0%) when we increase the total training time from 1x to 2x to make sure the models are better optimized. Furthermore, we evaluate our method on a better backbone – ResNeXt-101 and a better two-stage detector – Cascade R-CNN. Results in Table 2 tell that our proposed RGA strategy and PRM can yield consistent improvements across different learning rate schedules, backbones and variants of two-stage detectors. It is worth mentioning that our method can especially improve the performance under AP₇₅ by 2.8% (38.8% v.s. 41.6%) which illustrates the effectiveness of RGA and PRM under more severe evaluation protocol.

To better compare our method with other state-of-the-art detectors, we evaluate our method on COCO test-dev. As shown in Table 3, a single Faster R-CNN with RGA and PRM with the backbone ResNeXt-101 can reach 42.9% on AP, which is comparable to Libra R-CNN. Our method can also improve the performance of Cascade R-CNN by 1.0%. After incorporating the cascade mechanism [1] into our model, the performance of a single model can reach 45.3% without any bells and whistles (e.g., longer learning rate schedule, deformable convolution and training/testing time augmentation).

Table 3: Comparisons with state-of-the-art methods on COCO test-dev. * means our re-implemented results. All the two-stage detectors follow 1x training schedule. ResNeXt-101 means ResNeXt-101_64x4d by default.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
YOLOv2[21]	DarkNet-19	21.6	44.0	19.2	5.0	22.4	35.5
SSD512[16]	ResNet-101	31.2	50.4	33.3	10.2	34.5	49.8
RetinaNet[13]	ResNet-101-FPN	39.1	59.1	42.3	21.8	42.7	50.2
Faster R-CNN[22]	ResNet-101-FPN	36.2	59.1	39.0	18.2	39.0	48.2
Mask R-CNN[9]	ResNet-101-FPN	38.2	60.3	41.7	20.1	41.1	50.2
Libra R-CNN[19]	ResNeXt-101-FPN	43.0	64.0	47.0	25.3	45.6	54.6
Faster R-CNN*	ResNet-101-FPN	38.7	60.8	42.3	22.3	42.2	48.5
Cascade R-CNN*	ResNet-101-FPN	42.1	61.0	46.0	23.5	45.5	54.7
Faster R-CNN w/ RGA+PRM	ResNet-101-FPN	40.2	62.1	43.9	23.4	43.6	50.7
Faster R-CNN w/ RGA+PRM	ResNeXt-101-FPN	42.9	64.9	46.7	25.6	46.2	53.7
Cascade R-CNN w/ RGA+PRM	ResNet-101-FPN	43.1	61.5	47.0	24.1	46.1	55.4
Cascade R-CNN w/ RGA+PRM	ResNeXt-101-FPN	45.3	64.0	49.5	26.6	48.1	57.8

4.2 Ablation Study

R-CNN Gradient Annealing. We study how the annealing of magnification factor $\lambda$ helps improve the final prediction results by setting $\lambda$ as a constant value. Comparing the 1st, 2nd and 5th line in Table 4, we find that although magnifying the gradients in R-CNN module 7 times can bring 0.6% improvement on AP compared to its baseline (36.1%), with the introduction of annealing, the final performance can be further improved by 0.6%, which proves the effectiveness of gradient annealing. Since a new hyper-parameter $\lambda$ is introduced, we test the influence of different $\lambda_{0}$ in Table 4. Results indicate that our method is not sensitive to the value of $\lambda$ within the range of $[5,9]$ . It deserves to be mentioned that RGA brings no extra computation cost during testing time. All of these merits make our RGA strategy more applicable.

Table 4: Varying

\lambda_{0}

in RGA. The RGA strategy yields consistent improvement over baseline when

\lambda_{0}

varies from 5 to 9.

$\lambda_{0}$	Anneal	AP	AP₅₀	AP₇₅
1	-	36.1	58.1	38.8
7	-	36.7	58.5	39.8
3	✓	36.9	59.0	40.2
5	✓	37.2	59.0	40.8
7	✓	37.3	59.4	40.6
9	✓	37.2	59.3	40.0

Sampling Ratios. Sampling ratio is crucial to PRM. To understand this issue better, we investigate the case of two R-CNN modules and run experiments with a set of different sampling ratio pairs. Results are presented in Table 5. It shows that when the two sampling ratios are 1:1 and 1:9, the detector achieves the highest AP – 37.3%. When the two sampling ratios are 1:1 and 1:1, the detector achieves the worst AP which is 36.8%. According to such results, we can conclude that the detection performance is correlated to the gap between the two chosen sampling ratios in a pair. Specifically, the larger gap between two sampling ratios is, the better final result will be achieved. In addition, to prove the performance gain of PRM does not come from more parameters, we train two separate Faster R-CNN models with the sampling ratios of 1:1 and 1:9, whose number of parameters is far larger than a single Faster R-CNN with PRM. The results are presented in the second row in Table 5. Comparing the results in the second and fifth rows, we can see that the performance of Faster R-CNN with PRM exceeds the ensemble of two separate Faster R-CNN models. It means the benefit brought by Gradient Ensemble is larger than the number of parameters, which is one main source of gain of PRM.

Table 5: Varying sampling ratios across R-CNN modules. “R” and “E” are the abbreviations of “R-CNN” and “Ensemble”, respectively. “Faster R-CNNx2” stands for two separate Faster R-CNN models.

method	sample ratios		AP
method	R1	R2	R1	R2	E
Faster R-CNN	1:1	-	36.3	-	-
Faster R-CNNx2	1:1	1:9	36.3	34.9	36.5
Faster R-CNN w/ PRM	1:1	1:1	36.7	36.7	36.8
	1:1	1:5	36.8	36.9	37.1
	1:1	1:9	37.0	36.9	37.3

Table 6: With the third R-CNN module of the sampling ratio 1:3, the performance drops 0.2% both with and without RGA.

3rd R-CNN	w/ RGA	AP	AP₅₀	AP₇₅
-	-	37.3	58.8	40.4
✓	-	37.1	58.8	40.1
-	✓	38.1	60.0	41.6
✓	✓	37.9	59.6	41.2

Number of R-CNN Modules. The number of R-CNN modules is another hyper-parameter in PRM. Following the conclusion we draw in the previous paragraph, we choose PRM with two R-CNN modules, with the sampling ratios of which are 1:1 and 1:9 as our baseline. After adding the third R-CNN module with the sampling ratio of 1:3, as shown in Table 6, we observe a little performance drop both with and without RGA (-0.2%), which indicates that two R-CNN modules are sufficient. Such phenomenon probably results from that while all R-CNN modules are put on one shared backbone, as the number of R-CNN modules increases, jointly optimizing them may cause over-amplification of the gradients from different losses in the backbone.

4.3 Further Analysis

The Gain of RGA. The performance gain of RGA has been shown in Sec. 4.1. However, where does such gain come from still remains unverified. The answer is hidden in Fig. 3 which describes two full training processes for a baseline Faster R-CNN and Faster R-CNN with RGA, respectively. Fig. 3 Middle shows that the training accuracy for negative proposals are nearly identical when with and without RGA, which means RGA can not improve model’s ability of identifying negative proposals. However, Fig. 3 Left shows that after applying RGA, the training accuracy for positive proposals is consistently better than baseline. Such results verify our assumption that by magnifying the gradients from both positive and negative proposals, it is positive proposals whose impact are amplified so that the R-CNN module can better identify positive proposals. Consequently, the validation performance is improved.

Comparisons between Different R-CNN Modules. Considering the same optimization of objectives used in different R-CNN modules, it is natural to ask this question: is it possible that those different R-CNN modules actually learn similar parameters to each other? We try to answer this question from the discrepancy between their prediction scores. The detector we use is a well-trained Faster R-CNN with two R-CNN modules trained with the sampling ratios of 1:1 and 1:9. Fig. 6 Left shows the distribution of predicted scores on COCO $minival$ from two R-CNN modules. From Fig. 6, we can learn that the R-CNN module trained with the smaller sampling ratio of positive proposals tends to predict lower scores, and vice versa, which is already stated in Sec 1. Fig. 6 Right visualizes the absolute difference of scores between corresponding outputs. It shows that there are more than 21.9% pairs of prediction scores have the absolute difference larger than 0.1. Such a phenomenon verifies our claim in Sec. 1 that R-CNN modules with different sampling ratios can show different prediction biases, which is the reason why output ensemble can bring improvements. The improvements of PRM on different subset of COCO $minival$ can be seen in Fig. 2.

4.4 Experiments on CrowdHuman

To prove the generalization ability of RGA and PBM, we evaluate them on an extra dataset – CrowdHuman[23]. CrowdHuman is a benchmark for detecting human body in the crowded situation. It contains $15,000$ , $4,370$ , and $5,000$ images for training, validation and testing, respectively. On average, there are around 23 persons per-image, making CrowdHuman a challenging benchmark. In CrowdHuman, there are three kinds of annotations: full body, visible body and head. We focus on full body in our experiments. All the configurations follow the original paper[23]. When applying our method, we set $\lambda_{0}$ to 7 and use Faster R-CNN with PRM trained with the sampling ratios of 1:1 and 1:9. The log-average-missing-rate (mMR, lower is better) and AP₅₀ are reported in Table 7. As can be seen in Table 7, our RGA and PRM bring a remarkable reduction of 2.29% on mMR and 1.35% improvement on AP₅₀, which proves the effectiveness of our method across various detection tasks.

Table 7: RGA and PRM bring remarkable improvement on CrowdHuman dataset. * stands for our re-implementation result.

method	mMR	AP₅₀
Faster R-CNN Baseline in [23]	50.42	84.95
Faster R-CNN*	47.42	85.02
Faster R-CNN w/ PRM	46.30	85.43
Faster R-CNN w/ RGA+PRM	45.13	86.37

5 Conclusion

In this paper, we propose R-CNN Gradient Annealing strategy, a gradient manipulation operation to alleviate the imbalance of the number of positive proposals in the training phase. We also propose a new design of two-stage object detector PRM which deploys several parallel R-CNN modules trained with different positive/negative proposal sampling ratios on a same backbone. Such design overcomes the imbalance of positive proposals across testing images. These two innovations can totally brings 2.0% improvement based on a modified Faster R-CNN baseline, which strongly validates the utility of the proposed approach.

6 Appendix

Appendix 0.A Hard Sampling and Soft Sampling

Table 8: Performance comparison between hard and soft sampling strategies.

sample ratio	hard method	soft method
1:1	33.3	36.3
1:3	35.2	36.1
1:5	35.5	35.7
1:7	35.2	35.4

As we stated in Sec. 1 in our submission paper, in real experiments, the number of positive proposals can hardly meet the desired number given the sampling ratio of 1:1 or 1:3. Copying the positive proposals multiple times to achieve a “hard sampling ratio” is a more natural solution to enhance the gradients from positive proposals than RGA. However, the performances of “hard sampling” are consistently worse than “hard sampling” because “hard sampling ratio” imposes a strong classification bias into R-CNN, which could be harmful. Another key difference between “hard sampling ratio” and RGA is that “hard sampling ratio” needs to reduce the number of negative proposals. In that way, the diversity of negative proposals will be reduced while RGA can avoid that. That is why RGA can yield better results while “hard sampling ratio” can not although they both enhance the gradients of positive proposals. Experimental results can be seen in Table 8.

Appendix 0.B Implementation of RGA

RGA is easy to implement. Here is our implementation of R-CNN Gradient Annealing strategy in MMDetection.

⬇

#mmdetection/mmdet/core/utils/dist_utils.py

def after_train_iter(self, runner):

runner.optimizer.zero_grad()

runner.outputs[’loss’].backward()

#####################

# RGA, alpha0=7

#####################

weight = (7. - 6. * runner.iter / runner.max_iters)

for name, param in runner.model.module.named_parameters():

if ’bbox_head’ in name.split(’.’)[0]:

param.grad *= weight

#####################

allreduce_grads(runner.model.parameters(), self.coalesce,

self.bucket_size_mb)

if self.grad_clip is not None:

self.clip_grads(runner.model.parameters())

runner.optimizer.step()

References

[1] Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6154–6162 (2018)
[2] Cao, Y., Chen, K., Loy, C.C., Lin, D.: Prime sample attention in object detection. arXiv preprint arXiv:1904.04821 (2019)
[3] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
[4] Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware cnn model. In: Proceedings of the IEEE international conference on computer vision. pp. 1134–1142 (2015)
[5] Gidaris, S., Komodakis, N.: Attend refine repeat: Active box proposal generation via in-out localization. arXiv preprint arXiv:1606.04446 (2016)
[6] Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)
[7] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
[8] Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron (2018)
[9] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
[10] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37(9), 1904–1916 (2015)
[11] Li, B., Liu, Y., Wang, X.: Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8577–8584 (2019)
[12] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
[13] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
[14] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
[15] Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8759–8768 (2018)
[16] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)
[17] Ouyang, W., Wang, K., Zhu, X., Wang, X.: Chained cascade network for object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1938–1946 (2017)
[18] Ouyang, W., Wang, X., Zhang, C., Yang, X.: Factors in finetuning deep model for object detection with long-tail distribution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 864–873 (2016)
[19] Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 821–830 (2019)
[20] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016)
[21] Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7263–7271 (2017)
[22] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
[23] Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)
[24] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 761–769 (2016)
[25] Singh, B., Davis, L.S.: An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3578–3587 (2018)
[26] Singh, B., Najibi, M., Davis, L.S.: Sniper: Efficient multi-scale training. In: Advances in Neural Information Processing Systems. pp. 9310–9320 (2018)
[27] Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)
[28] Yang, B., Yan, J., Lei, Z., Li, S.Z.: Craft objects from images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6043–6051 (2016)
[29] Zhu, X., Pang, J., Yang, C., Shi, J., Lin, D.: Adapting object detectors via selective cross-domain alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 687–696 (2019)