This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-iterative optimization of pseudo-labeling thresholds for training object detection models from multiple datasets

Abstract

We propose a non-iterative method to optimize pseudo-labeling thresholds for learning object detection from a collection of low-cost datasets, each of which is annotated for only a subset of all the object classes. A popular approach to this problem is first to train teacher models and then to use their confident predictions as pseudo ground-truth labels when training a student model. To obtain the best result, however, thresholds for prediction confidence must be adjusted. This process typically involves iterative search and repeated training of student models and is time-consuming. Therefore, we develop a method to optimize the thresholds without iterative optimization by maximizing the FβF_{\beta}-score on a validation dataset, which measures the quality of pseudo labels and can be measured without training a student model. We experimentally demonstrate that our proposed method achieves an mAP comparable to that of grid search on the COCO and VOC datasets.

Index Terms—  Non-iterative optimization, pseudo labeling, object detection, weakly supervised learning

1 Introduction

Object detection [1, 2, 3, 4] has achieved significant progress in deep learning with a tremendous number of images and annotations, but it becomes quite expensive to collect them. This creates a significant barrier when it comes to moving from the research stage to practical application. Recently, research on how to train a model with low-cost datasets has become more active.

There are several paradigms to learn from low-cost datasets. Examples include semi-supervised learning [5, 6, 7, 8] and weakly supervised learning [9, 10]. In semi-supervised learning, models are trained from a limited amount of labeled data and a lot of unlabeled data (Fig. 1(a)), while in weakly supervised learning, models are trained from only image-level annotations and no bounding boxes (Fig. 1(b)). By contrast, we aim at training a single object detection model for all classes from multiple datasets that have different class sets without additional annotations [11, 12, 13]. This setting (Fig. 1(c)) is important for practical applications, because we can add object categories simply by combining datasets that are made for different purposes.

Refer to caption

(a) Semi-supervised learning

Refer to caption

(b) Weakly supervised learning

Refer to caption

(c) Learning from multiple datasets with different class sets

Fig. 1: Examples of learning paradigms for object detection using low-cost datasets. In these examples, the goal is to train a model that detects people and bicycles in images by using training datasets with annotations as illustrated above. The two images are contained in COCO [14] and VOC [15] datasets, respectively.
Refer to caption
Fig. 2: Schematic picture of the pseudo-labeling method for training an object detection model from multiple datasets with different class sets. First, a dataset DSiDS_{i} is used to train a model MiM_{i}. Then the model MiM_{i} generates pseudo labels of the object classes in CiC_{i} for a dataset DSjDS_{j} (iji\neq j). Finally, the datasets with pseudo labels are used to train the final detection model MM.

Typically, pseudo labeling is used to train an object detection model in the current problem setting (Fig. 2). Specifically, we first train one teacher model from each dataset and then use them to predict locations of unlabeled objects. A prediction is used as a pseudo label if its confidence score is higher than a predetermined threshold. Finally, we train a single student model for all classes by using both the ground-truth labels and the pseudo labels.

To achieve the best performance with pseudo labeling, it is imperative to decide this threshold properly, but the optimization process is time-consuming. This optimization usually involves iterative search; we generate pseudo labels, train and evaluate a student model, and then repeat these steps multiple times. If we use a common value for all classes, this might require, e.g., 10 repetitions. Moreover, if we wish to find an optimal threshold for each of KK classes, the naive grid search algorithm requires 10K10^{K} iterations, which becomes infeasible if KK is more than a few.

In this work, we propose a non-iterative method for optimizing thresholds for pseudo labeling. We determine the thresholds so that the FβF_{\beta}-score of pseudo labels or that of both ground-truth and pseudo labels is maximized; see Sections 2.2 and 2.3 for the definitions of these quantities. Importantly, these metrics can be measured without training a student model, unlike the detection performance of a student model on a validation dataset, which is usually used as a criterion to find the optimal thresholds. With our proposed method, the training of a student model is required only once, when we obtain the final detection model. We experimentally demonstrate that our proposed method can achieve a comparable mAP to that of the conventional grid search method, which involves repeated training of student models.

2 Method

We provide a brief overview of pseudo labeling in Section 2.1 and then explain our proposed method for optimizing pseudo-labeling thresholds in the following subsections.

2.1 Pseudo labeling

In Fig. 2, we illustrate the pseudo-labeling method for learning object detection from multiple datasets with different sets of object classes. Assume that the entire dataset DSDS consists of NN datasets DS1,DS2,,DSNDS_{1},DS_{2},\dots,DS_{N} that have different class sets C1,C2,,CNC_{1},C_{2},\dots,C_{N}, respectively. The class sets of dataset DSDS can be written as C=C1C2CNC=C_{1}\cup C_{2}\cup\dots\cup C_{N}, and a complementary label set of dataset DSiDS_{i} can be written as Ci¯=(CCi)\overline{C_{i}}=(C\setminus C_{i}). First, we train an object detection model MiM_{i} for a label set CiC_{i} with a dataset DSiDS_{i} by supervised learning. We train models with NN datasets individually, so NN object detection models are generated. Second, we generate pseudo labels. To this end, models that cover Ci¯\overline{C_{i}} are used to get predictions of object locations in the dataset DSiDS_{i}. If the raw predictions were used as pseudo labels, they would be too noisy, so we only adopt predictions with the confidence scores higher than a certain threshold. Finally, we train an object detection model MM with the original ground-truth labels and the generated pseudo labels.

There is a variant of the pseudo-labeling method where two thresholds, τh\tau_{h} and τl\tau_{l}, are used to further reduce the noise of pseudo labels [11]. If a prediction has a confidence score higher than τh\tau_{h}, then it is used as a pseudo label, while if the confidence scores of all the classes in a region are below τl\tau_{l}, such an area is treated as pseudo background. Regions that do not satisfy either are ignored in the training of a student model.

2.2 Maximizing FβF_{\beta}-score of pseudo labels

The goal of optimizing the thresholds is to find a set of pseudo labels that brings us the best student model. However, it is time-consuming to measure the performance of a student model for each pseudo-labeled dataset generated with a different threshold. Therefore, we propose to use FβF_{\beta}-score of pseudo labels as a surrogate for the student performance. It is defined as the weighted harmonic mean of precision and recall:

Fβ=(1+β2)precisionrecallβ2precision+recall.F_{\beta}=\frac{(1+\beta^{2})\cdot precision\cdot recall}{\beta^{2}precision+recall}. (1)

If the β\beta value is less than 1, FβF_{\beta} is a precision-weighted metric, while if the β\beta value is more than 1, FβF_{\beta} is a recall-weighted metric. Importantly, because this essentially measures performance of the teacher models, we do not need to train a student model to calculate the FβF_{\beta}-score.

Ideally we would evaluate the FβF_{\beta}-score of the pseudo labels themselves, but this is impossible because there are no ground-truth labels in the dataset that we want to generate pseudo labels for. Therefore, we measure the FβF_{\beta}-score on a pseudo-label validation dataset. We can prepare this dataset either by splitting from a training dataset or separately under the condition that it has annotations of evaluation categories.

To find the optimal thresholds, we use the teacher models MiM_{i} to generate predictions of object locations in the validation dataset and calculate the FβF_{\beta}-score. For the one-threshold variant of pseudo labeling, we adopt the threshold that brings the maximum F1F_{1}-score as τ\tau. For the two-threshold variant, on the other hand, we take as τh\tau_{h} the threshold that maximizes the F0.5F_{0.5}-score, a typical precision-weighted metric, and as τl\tau_{l} the threshold that maximizes the F2F_{2}-score, a typical recall-weighted metric.

2.3 Maximizing FβF_{\beta}-score of all labels

Refer to caption
Fig. 3: Schematic image of a precision-recall curve of all labels including both ground-truth and pseudo labels. Thresholds are determined using not only automatically generated pseudo labels but also human-annotated ground-truth labels.

The method discussed above determines thresholds by maximizing the FβF_{\beta}-score of pseudo labels. However, because training of a student model uses both the human-annotated ground-truth labels and the generated pseudo labels, it is preferable to take both of them into account to determine the thresholds.

Figure 3 illustrates the basic idea. For the object class jj, let xjx_{j} be the ratio gjtj\frac{g_{j}}{t_{j}} of the number gjg_{j} of labeled objects over the total number tjt_{j} of labeled and unlabeled ones. We assume that the ground-truth labels are definitely correct, so we set precision = 1 and confidence score = 1. Then we can calculate the precision and recall of all the labels as

pDSj(τ)\displaystyle p_{DS_{j}}(\tau) ={1(τ=1),xjtj+(1xj)tjrj(τ)xjtj+(1xj)tjrj(τ)pj(τ)(τ<1),\displaystyle=\left\{\begin{array}[]{ll}1&(\tau=1),\\ \frac{x_{j}t_{j}+(1-x_{j})t_{j}r_{j}(\tau)}{x_{j}t_{j}+\frac{(1-x_{j})t_{j}r_{j}(\tau)}{p_{j}(\tau)}}&(\tau<1),\end{array}\right. (4)
rDSj(τ)\displaystyle r_{DS_{j}}(\tau) ={gjtj(τ=1),xj+(1xj)rj(τ)(τ<1).\displaystyle=\left\{\begin{array}[]{ll}\frac{g_{j}}{t_{j}}&(\tau=1),\\ x_{j}+(1-x_{j})r_{j}(\tau)&(\tau<1).\end{array}\right. (7)

Here, pj(τ)p_{j}(\tau) and rj(τ)r_{j}(\tau) are precision and recall of pseudo labels only. By using these expressions, we can calculate the FβF_{\beta}-score of all labels and select thresholds τ\tau, or the pair τh\tau_{h} and τl\tau_{l}, in the same way as explained in Section 2.2.

Under the current problem setting where no knowledge about unlabeled instances is available, tjt_{j} is also unknown, because this includes the number of unlabeled objects. In such cases, a crude but simple way to estimate xjx_{j} is to use the ratios of the number of images in the datasets with annotations of the object class jj and the number of images in the entire dataset.

3 Experiments

In this section, we experimentally demonstrate the effectiveness of our proposed method. Here, we compare four methods including a variant of our proposed method. “w/o PL” only uses the ground-truth annotation without pseudo labels; this sets the lower-bound performance for the other methods. “Grid search” finds the optimal threshold from a predefined pool of candidate values. With this method, we use the same threshold value for all the classes because otherwise it is infeasible to perform grid search due to the enormous search space. “FmaxPL\rm{Fmax_{PL}}” and “FmaxDS\rm{Fmax_{DS}}” are our proposed methods. The former uses the FβF_{\beta}-scores of pseudo labels to determine the optimal thresholds (Section 2.2), while the latter uses those of all the labels including both ground-truth and pseudo labels (Section 2.3).

3.1 Datasets

We experiment with two semi-synthetic datasets, COCO-splitting and COCO-VOC, which emulate the problem setting as described in Section 1.

COCO-splitting We split Microsoft COCO Detection 2017 (COCO) [14], a general object detection dataset consisting of 80 object classes, into NN subsets (N=2,5,10N=2,5,10) to make multiple datasets with different class sets, as follows: split 110,000 images from COCO training data into NN subsets to make NN datasets (DSiDS_{i}); assign category ids ii (mod NN) to CiC_{i}; and remove images in DSiDS_{i} that have no objects belonging to a category in CiC_{i}. In this setting, the NN splits are expected to have similar characteristics with each other. The remaining 5,000 images of the COCO training split are used as a pseudo-label validation dataset.

COCO-VOC This dataset combines COCO [14] and Pascal VOC [15] (COCO-VOC dataset), mimicking the situation where there is a domain gap between constituent datasets. The VOC dataset consists of 20 object categories, which is a subset of COCO’s 80 classes. We remove annotations of the overlapping classes from COCO, so that the two datasets have mutually exclusive sets of object classes. Similarly to COCO-splitting, we use 5,000 images from COCO’s training split and 500 images from VOC’s trainval split for validation. For testing, we use the same dataset as used in Ref. [11], which is taken from VOC’s test set and COCO’s validation set. The VOC portion of this dataset annotates not only 20 object categories but also an additional 60 categories that are only annotated in the COCO training dataset. This allows us to measure mAP over 80 classes on both of the datasets.

3.2 Implementation details

We use M2Det [4] as an object detection model. M2Det is a one-stage object detection model that applies a multi-level feature pyramid network (MLFPN) to a feature map generated from a backbone to perform detection by using multi-scale feature maps. The input image size is 320×320320\times 320 pixels and a VGG16 network pretrained on ImageNet is used as the backbone. For each experiment, the model is trained for 150 epochs by using NesterovAG with an initial learning rate of 0.01 and a momentum of 0.9.

In the experiment with COCO-splitting, we use the simpler version of the pseudo-labeling method with one threshold to reduce the computational cost. We also found from preliminary experiments that the gain from the second threshold was tiny for this dataset. On the other hand, we adopt the two-threshold variant with COCO-VOC, because preliminary experiments suggested that this setting benefit significantly from the second, lower threshold.

For the grid search algorithm, we need to predetermine the sets of candidate threshold values. When tuning the single threshold τ\tau, we take [0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1][0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] as the pool of the candidates. Note that taking the value τ=1\tau=1 is equivalent to “w/o PL”, which simply combines datasets without any additional pseudo labels. On the other hand, the thresholds τh\tau_{h} and τl\tau_{l} are selected from [0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0], with the constraint τhτl\tau_{h}\geq\tau_{l}. This results in 15 pairs of the candidate threshold values for this algorithm.

In the FmaxDS\rm{Fmax_{DS}} method, we need to estimate the ratio of the number of labeled objects to that of all objects for each object class. In the COCO splitting experiment, for simplicity, we use the true number of xjx_{j}, which is known from COCO’s original annotation. On the other hands, with COCO-VOC, we follow the simple procedure explained in Section 2.3.

3.3 Results

Refer to caption
Fig. 4: Results of COCO-splitting dataset. We compare our proposed methods (FmaxPL\rm{Fmax_{PL}} and FmaxDS\rm{Fmax_{DS}}) with grid search. Note that grid search with the threshold equal to one is equivalent to “w/o PL”. The thresholds for the proposed methods are the average over the thresholds for the 80 classes.
Table 1: Results of COCO-splitting dataset. We show mAP50 for “w/o PL”, grid search, and our proposed methods.
2 splits 5 splits 10 splits
w/o PL 0.411 0.323 0.272
Grid search 0.449 0.372 0.304
FmaxPL\rm{Fmax_{PL}} 0.446 0.366 0.298
FmaxDS\rm{Fmax_{DS}} 0.447 0.373 0.307

Figure 4 shows the experimental results for the COCO-splitting datasets. The plot indicates that one of our proposed method, FmaxDS\rm{Fmax_{DS}}, performs better than or competitively to the grid search method, without any iterative search for the optimal τ\tau. FmaxPL\rm{Fmax_{PL}} prefers smaller thresholds, which means that it uses noisier pseudo labels to train a student model. This slightly degrades the detection accuracy, as compared with grid search and FmaxDS\rm{Fmax_{DS}}, but it is still competitive (Table 1).

Interestingly, Fig. 4 implies that the optimal value of the threshold depends on a dataset. In other words, there is no universal value of the best threshold that leads to the optimal performance for any dataset. This observation corroborates the underlying premise of our research that we need to optimize the threshold to achieve the best results. Figure 4 also indicates that the value found by FmaxDS\rm{Fmax_{DS}} is similar to the optimal value found by grid search.

Table 2: Results of COCO-VOC dataset. We show mAP50 for “w/o PL”, grid search, and our proposed methods. “C+V” denotes the evaluation on the combined test dataset of COCO and VOC, while “C” and “V” are the COCO and VOC portions of the test dataset, respectively. τh\tau_{h} and τl\tau_{l} are the higher and lower thresholds chosen by each method.
C+V C V (τh,τl)(\tau_{h},\tau_{l})
w/o PL 0.425 0.422 0.343
Grid search 0.480 0.489 0.414 (0.8,0.2)(0.8,0.2)
FmaxPL\rm{Fmax_{PL}} 0.478 0.482 0.431 (0.61,0.24)(0.61,0.24)
FmaxDS\rm{Fmax_{DS}} 0.481 0.483 0.422 (0.82,0.29)(0.82,0.29)

Table 2 lists the experimental results of the COCO-VOC dataset. Without repeated training of student models, FmaxDS\rm{Fmax_{DS}} achieves the best result with respect to mAP over the whole test set (“C+V”). This result suggests that our proposed method is robust to the domain gap between the constituent datasets.

4 Conclusion

In this paper, we proposed a non-iterative method to optimize pseudo-labeling thresholds for training a single object detection model from multiple datasets. To avoid training a student model multiple times, we used FβF_{\beta}-score to measure the quality of pseudo labels and to find the optimal threshold. Experimental results showed that our method achieved an mAP competitive with grid search, but with significantly lower computational costs. This work should prove helpful for the implementation of deep learning to practical applications at low cost.

References

  • [1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “SSD: Single shot multibox detector,” in European Conference on Computer Vision, 2016, pp. 21–37.
  • [3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
  • [4] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling, “M2Det: A single-shot object detector based on multi-level feature pyramid network,” in Proceedings of the AAAI conference on artificial intelligence, 2019, pp. 9259–9266.
  • [5] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak, “Consistency-based semi-supervised learning for object detection,” in Advances in Neural Information Processing Systems, 2019.
  • [6] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020.
  • [7] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3060–3069.
  • [8] Jisoo Jeong, Vikas Verma, Minsung Hyun, Juho Kannala, and Nojun Kwak, “Interpolation-based semi-supervised learning for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11602–11611.
  • [9] Hakan Bilen and Andrea Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2846–2854.
  • [10] Peng Tang, Xinggang Wang, Song Bai, Wei Shen, Xiang Bai, Wenyu Liu, and Alan Yuille, “PCL: Proposal cluster learning for weakly supervised object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 176–191, 2020.
  • [11] Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, and Ying Wu, “Object detection with a unified label space from multiple datasets,” in European Conference on Computer Vision, 2020, pp. 178–193.
  • [12] Bowen Zhao, Chen Chen, Wanpeng Xiao, Xi Xiao, Qi Ju, and Shutao Xia, “Towards a category-extended object detector without relabeling or conflicts,” arXiv preprint arXiv:2012.14115, 2020.
  • [13] Yongqiang Yao, Yan Wang, Yu Guo, Jiaojiao Lin, Hongwei Qin, and Junjie Yan, “Cross-dataset training for class increasing object detection,” arXiv preprint arXiv:2001.04621, 2020.
  • [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755.
  • [15] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, “The pascal visual object classes (VOC) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.