Non-iterative optimization of pseudo-labeling thresholds for training object detection models from multiple datasets

Abstract

We propose a non-iterative method to optimize pseudo-labeling thresholds for learning object detection from a collection of low-cost datasets, each of which is annotated for only a subset of all the object classes. A popular approach to this problem is first to train teacher models and then to use their confident predictions as pseudo ground-truth labels when training a student model. To obtain the best result, however, thresholds for prediction confidence must be adjusted. This process typically involves iterative search and repeated training of student models and is time-consuming. Therefore, we develop a method to optimize the thresholds without iterative optimization by maximizing the $F_{\beta}$ -score on a validation dataset, which measures the quality of pseudo labels and can be measured without training a student model. We experimentally demonstrate that our proposed method achieves an mAP comparable to that of grid search on the COCO and VOC datasets.

Index Terms— Non-iterative optimization, pseudo labeling, object detection, weakly supervised learning

1 Introduction

Object detection [1, 2, 3, 4] has achieved significant progress in deep learning with a tremendous number of images and annotations, but it becomes quite expensive to collect them. This creates a significant barrier when it comes to moving from the research stage to practical application. Recently, research on how to train a model with low-cost datasets has become more active.

There are several paradigms to learn from low-cost datasets. Examples include semi-supervised learning [5, 6, 7, 8] and weakly supervised learning [9, 10]. In semi-supervised learning, models are trained from a limited amount of labeled data and a lot of unlabeled data (Fig. 1(a)), while in weakly supervised learning, models are trained from only image-level annotations and no bounding boxes (Fig. 1(b)). By contrast, we aim at training a single object detection model for all classes from multiple datasets that have different class sets without additional annotations [11, 12, 13]. This setting (Fig. 1(c)) is important for practical applications, because we can add object categories simply by combining datasets that are made for different purposes.

Refer to caption — Fig. 1: Examples of learning paradigms for object detection using low-cost datasets. In these examples, the goal is to train a model that detects people and bicycles in images by using training datasets with annotations as illustrated above. The two images are contained in COCO [14] and VOC [15] datasets, respectively.

Typically, pseudo labeling is used to train an object detection model in the current problem setting (Fig. 2). Specifically, we first train one teacher model from each dataset and then use them to predict locations of unlabeled objects. A prediction is used as a pseudo label if its confidence score is higher than a predetermined threshold. Finally, we train a single student model for all classes by using both the ground-truth labels and the pseudo labels.

To achieve the best performance with pseudo labeling, it is imperative to decide this threshold properly, but the optimization process is time-consuming. This optimization usually involves iterative search; we generate pseudo labels, train and evaluate a student model, and then repeat these steps multiple times. If we use a common value for all classes, this might require, e.g., 10 repetitions. Moreover, if we wish to find an optimal threshold for each of $K$ classes, the naive grid search algorithm requires $10^{K}$ iterations, which becomes infeasible if $K$ is more than a few.

In this work, we propose a non-iterative method for optimizing thresholds for pseudo labeling. We determine the thresholds so that the $F_{\beta}$ -score of pseudo labels or that of both ground-truth and pseudo labels is maximized; see Sections 2.2 and 2.3 for the definitions of these quantities. Importantly, these metrics can be measured without training a student model, unlike the detection performance of a student model on a validation dataset, which is usually used as a criterion to find the optimal thresholds. With our proposed method, the training of a student model is required only once, when we obtain the final detection model. We experimentally demonstrate that our proposed method can achieve a comparable mAP to that of the conventional grid search method, which involves repeated training of student models.

2 Method

We provide a brief overview of pseudo labeling in Section 2.1 and then explain our proposed method for optimizing pseudo-labeling thresholds in the following subsections.

2.1 Pseudo labeling

In Fig. 2, we illustrate the pseudo-labeling method for learning object detection from multiple datasets with different sets of object classes. Assume that the entire dataset $DS$ consists of $N$ datasets $DS_{1},DS_{2},\dots,DS_{N}$ that have different class sets $C_{1},C_{2},\dots,C_{N}$ , respectively. The class sets of dataset $DS$ can be written as $C=C_{1}\cup C_{2}\cup\dots\cup C_{N}$ , and a complementary label set of dataset $DS_{i}$ can be written as $\overline{C_{i}}=(C\setminus C_{i})$ . First, we train an object detection model $M_{i}$ for a label set $C_{i}$ with a dataset $DS_{i}$ by supervised learning. We train models with $N$ datasets individually, so $N$ object detection models are generated. Second, we generate pseudo labels. To this end, models that cover $\overline{C_{i}}$ are used to get predictions of object locations in the dataset $DS_{i}$ . If the raw predictions were used as pseudo labels, they would be too noisy, so we only adopt predictions with the confidence scores higher than a certain threshold. Finally, we train an object detection model $M$ with the original ground-truth labels and the generated pseudo labels.

There is a variant of the pseudo-labeling method where two thresholds, $\tau_{h}$ and $\tau_{l}$ , are used to further reduce the noise of pseudo labels [11]. If a prediction has a confidence score higher than $\tau_{h}$ , then it is used as a pseudo label, while if the confidence scores of all the classes in a region are below $\tau_{l}$ , such an area is treated as pseudo background. Regions that do not satisfy either are ignored in the training of a student model.

2.2 Maximizing $F_{\beta}$ -score of pseudo labels

The goal of optimizing the thresholds is to find a set of pseudo labels that brings us the best student model. However, it is time-consuming to measure the performance of a student model for each pseudo-labeled dataset generated with a different threshold. Therefore, we propose to use $F_{\beta}$ -score of pseudo labels as a surrogate for the student performance. It is defined as the weighted harmonic mean of precision and recall:

F_{\beta}=\frac{(1+\beta^{2})\cdot precision\cdot recall}{\beta^{2}precision+recall}.

(1)

If the $\beta$ value is less than 1, $F_{\beta}$ is a precision-weighted metric, while if the $\beta$ value is more than 1, $F_{\beta}$ is a recall-weighted metric. Importantly, because this essentially measures performance of the teacher models, we do not need to train a student model to calculate the $F_{\beta}$ -score.

Ideally we would evaluate the $F_{\beta}$ -score of the pseudo labels themselves, but this is impossible because there are no ground-truth labels in the dataset that we want to generate pseudo labels for. Therefore, we measure the $F_{\beta}$ -score on a pseudo-label validation dataset. We can prepare this dataset either by splitting from a training dataset or separately under the condition that it has annotations of evaluation categories.

To find the optimal thresholds, we use the teacher models $M_{i}$ to generate predictions of object locations in the validation dataset and calculate the $F_{\beta}$ -score. For the one-threshold variant of pseudo labeling, we adopt the threshold that brings the maximum $F_{1}$ -score as $\tau$ . For the two-threshold variant, on the other hand, we take as $\tau_{h}$ the threshold that maximizes the $F_{0.5}$ -score, a typical precision-weighted metric, and as $\tau_{l}$ the threshold that maximizes the $F_{2}$ -score, a typical recall-weighted metric.

2.3 Maximizing $F_{\beta}$ -score of all labels

The method discussed above determines thresholds by maximizing the $F_{\beta}$ -score of pseudo labels. However, because training of a student model uses both the human-annotated ground-truth labels and the generated pseudo labels, it is preferable to take both of them into account to determine the thresholds.

Figure 3 illustrates the basic idea. For the object class $j$ , let $x_{j}$ be the ratio $\frac{g_{j}}{t_{j}}$ of the number $g_{j}$ of labeled objects over the total number $t_{j}$ of labeled and unlabeled ones. We assume that the ground-truth labels are definitely correct, so we set precision = 1 and confidence score = 1. Then we can calculate the precision and recall of all the labels as

	$\displaystyle p_{DS_{j}}(\tau)$	$\displaystyle=\left\{\begin{array}[]{ll}1&(\tau=1),\\ \frac{x_{j}t_{j}+(1-x_{j})t_{j}r_{j}(\tau)}{x_{j}t_{j}+\frac{(1-x_{j})t_{j}r_{j}(\tau)}{p_{j}(\tau)}}&(\tau<1),\end{array}\right.$		(4)
	$\displaystyle r_{DS_{j}}(\tau)$	$\displaystyle=\left\{\begin{array}[]{ll}\frac{g_{j}}{t_{j}}&(\tau=1),\\ x_{j}+(1-x_{j})r_{j}(\tau)&(\tau<1).\end{array}\right.$		(7)

Here, $p_{j}(\tau)$ and $r_{j}(\tau)$ are precision and recall of pseudo labels only. By using these expressions, we can calculate the $F_{\beta}$ -score of all labels and select thresholds $\tau$ , or the pair $\tau_{h}$ and $\tau_{l}$ , in the same way as explained in Section 2.2.

Under the current problem setting where no knowledge about unlabeled instances is available, $t_{j}$ is also unknown, because this includes the number of unlabeled objects. In such cases, a crude but simple way to estimate $x_{j}$ is to use the ratios of the number of images in the datasets with annotations of the object class $j$ and the number of images in the entire dataset.

3 Experiments

In this section, we experimentally demonstrate the effectiveness of our proposed method. Here, we compare four methods including a variant of our proposed method. “w/o PL” only uses the ground-truth annotation without pseudo labels; this sets the lower-bound performance for the other methods. “Grid search” finds the optimal threshold from a predefined pool of candidate values. With this method, we use the same threshold value for all the classes because otherwise it is infeasible to perform grid search due to the enormous search space. “ $\rm{Fmax_{PL}}$ ” and “ $\rm{Fmax_{DS}}$ ” are our proposed methods. The former uses the $F_{\beta}$ -scores of pseudo labels to determine the optimal thresholds (Section 2.2), while the latter uses those of all the labels including both ground-truth and pseudo labels (Section 2.3).

3.1 Datasets

We experiment with two semi-synthetic datasets, COCO-splitting and COCO-VOC, which emulate the problem setting as described in Section 1.

COCO-splitting We split Microsoft COCO Detection 2017 (COCO) [14], a general object detection dataset consisting of 80 object classes, into $N$ subsets ( $N=2,5,10$ ) to make multiple datasets with different class sets, as follows: split 110,000 images from COCO training data into $N$ subsets to make $N$ datasets ( $DS_{i}$ ); assign category ids $i$ (mod $N$ ) to $C_{i}$ ; and remove images in $DS_{i}$ that have no objects belonging to a category in $C_{i}$ . In this setting, the $N$ splits are expected to have similar characteristics with each other. The remaining 5,000 images of the COCO training split are used as a pseudo-label validation dataset.

COCO-VOC This dataset combines COCO [14] and Pascal VOC [15] (COCO-VOC dataset), mimicking the situation where there is a domain gap between constituent datasets. The VOC dataset consists of 20 object categories, which is a subset of COCO’s 80 classes. We remove annotations of the overlapping classes from COCO, so that the two datasets have mutually exclusive sets of object classes. Similarly to COCO-splitting, we use 5,000 images from COCO’s training split and 500 images from VOC’s trainval split for validation. For testing, we use the same dataset as used in Ref. [11], which is taken from VOC’s test set and COCO’s validation set. The VOC portion of this dataset annotates not only 20 object categories but also an additional 60 categories that are only annotated in the COCO training dataset. This allows us to measure mAP over 80 classes on both of the datasets.

3.2 Implementation details

We use M2Det [4] as an object detection model. M2Det is a one-stage object detection model that applies a multi-level feature pyramid network (MLFPN) to a feature map generated from a backbone to perform detection by using multi-scale feature maps. The input image size is $320\times 320$ pixels and a VGG16 network pretrained on ImageNet is used as the backbone. For each experiment, the model is trained for 150 epochs by using NesterovAG with an initial learning rate of 0.01 and a momentum of 0.9.

In the experiment with COCO-splitting, we use the simpler version of the pseudo-labeling method with one threshold to reduce the computational cost. We also found from preliminary experiments that the gain from the second threshold was tiny for this dataset. On the other hand, we adopt the two-threshold variant with COCO-VOC, because preliminary experiments suggested that this setting benefit significantly from the second, lower threshold.

For the grid search algorithm, we need to predetermine the sets of candidate threshold values. When tuning the single threshold $\tau$ , we take $[0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]$ as the pool of the candidates. Note that taking the value $\tau=1$ is equivalent to “w/o PL”, which simply combines datasets without any additional pseudo labels. On the other hand, the thresholds $\tau_{h}$ and $\tau_{l}$ are selected from $[0.2,0.4,0.6,0.8,1.0]$ , with the constraint $\tau_{h}\geq\tau_{l}$ . This results in 15 pairs of the candidate threshold values for this algorithm.

In the $\rm{Fmax_{DS}}$ method, we need to estimate the ratio of the number of labeled objects to that of all objects for each object class. In the COCO splitting experiment, for simplicity, we use the true number of $x_{j}$ , which is known from COCO’s original annotation. On the other hands, with COCO-VOC, we follow the simple procedure explained in Section 2.3.

3.3 Results

Table 1: Results of COCO-splitting dataset. We show mAP⁵⁰ for “w/o PL”, grid search, and our proposed methods.

	2 splits	5 splits	10 splits
w/o PL	0.411	0.323	0.272
Grid search	0.449	0.372	0.304
$\rm{Fmax_{PL}}$	0.446	0.366	0.298
$\rm{Fmax_{DS}}$	0.447	0.373	0.307

Figure 4 shows the experimental results for the COCO-splitting datasets. The plot indicates that one of our proposed method, $\rm{Fmax_{DS}}$ , performs better than or competitively to the grid search method, without any iterative search for the optimal $\tau$ . $\rm{Fmax_{PL}}$ prefers smaller thresholds, which means that it uses noisier pseudo labels to train a student model. This slightly degrades the detection accuracy, as compared with grid search and $\rm{Fmax_{DS}}$ , but it is still competitive (Table 1).

Interestingly, Fig. 4 implies that the optimal value of the threshold depends on a dataset. In other words, there is no universal value of the best threshold that leads to the optimal performance for any dataset. This observation corroborates the underlying premise of our research that we need to optimize the threshold to achieve the best results. Figure 4 also indicates that the value found by $\rm{Fmax_{DS}}$ is similar to the optimal value found by grid search.

Table 2: Results of COCO-VOC dataset. We show mAP⁵⁰ for “w/o PL”, grid search, and our proposed methods. “C+V” denotes the evaluation on the combined test dataset of COCO and VOC, while “C” and “V” are the COCO and VOC portions of the test dataset, respectively.

\tau_{h}

and

\tau_{l}

are the higher and lower thresholds chosen by each method.

	C+V	C	V	$(\tau_{h},\tau_{l})$
w/o PL	0.425	0.422	0.343	—
Grid search	0.480	0.489	0.414	$(0.8,0.2)$
$\rm{Fmax_{PL}}$	0.478	0.482	0.431	$(0.61,0.24)$
$\rm{Fmax_{DS}}$	0.481	0.483	0.422	$(0.82,0.29)$

Table 2 lists the experimental results of the COCO-VOC dataset. Without repeated training of student models, $\rm{Fmax_{DS}}$ achieves the best result with respect to mAP over the whole test set (“C+V”). This result suggests that our proposed method is robust to the domain gap between the constituent datasets.

4 Conclusion

In this paper, we proposed a non-iterative method to optimize pseudo-labeling thresholds for training a single object detection model from multiple datasets. To avoid training a student model multiple times, we used $F_{\beta}$ -score to measure the quality of pseudo labels and to find the optimal threshold. Experimental results showed that our method achieved an mAP competitive with grid search, but with significantly lower computational costs. This work should prove helpful for the implementation of deep learning to practical applications at low cost.

References

[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
[2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “SSD: Single shot multibox detector,” in European Conference on Computer Vision, 2016, pp. 21–37.
[3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[4] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling, “M2Det: A single-shot object detector based on multi-level feature pyramid network,” in Proceedings of the AAAI conference on artificial intelligence, 2019, pp. 9259–9266.
[5] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak, “Consistency-based semi-supervised learning for object detection,” in Advances in Neural Information Processing Systems, 2019.
[6] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister, “A simple semi-supervised learning framework for object detection,” arXiv preprint arXiv:2005.04757, 2020.
[7] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3060–3069.
[8] Jisoo Jeong, Vikas Verma, Minsung Hyun, Juho Kannala, and Nojun Kwak, “Interpolation-based semi-supervised learning for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11602–11611.
[9] Hakan Bilen and Andrea Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2846–2854.
[10] Peng Tang, Xinggang Wang, Song Bai, Wei Shen, Xiang Bai, Wenyu Liu, and Alan Yuille, “PCL: Proposal cluster learning for weakly supervised object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 176–191, 2020.
[11] Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, and Ying Wu, “Object detection with a unified label space from multiple datasets,” in European Conference on Computer Vision, 2020, pp. 178–193.
[12] Bowen Zhao, Chen Chen, Wanpeng Xiao, Xi Xiao, Qi Ju, and Shutao Xia, “Towards a category-extended object detector without relabeling or conflicts,” arXiv preprint arXiv:2012.14115, 2020.
[13] Yongqiang Yao, Yan Wang, Yu Guo, Jiaojiao Lin, Hongwei Qin, and Junjie Yan, “Cross-dataset training for class increasing object detection,” arXiv preprint arXiv:2001.04621, 2020.
[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755.
[15] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, “The pascal visual object classes (VOC) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.