1st Place Solutions for UG2+ Challenge 2021 - (Semi-) supervised Face detection in the low light condition

Pengcheng Wang Lingqiao Ji Zhilong Ji Yuan Gao Xiao Liu
Tomorrow Advancing Life (TAL) Education Group
{wangpengcheng2, jilingqiao, jizhilong, gaoyuan23, liuxiao15}@tal.com

Abstract

In this technical report, we briefly introduce the solution of our team “TAL-ai” for (Semi-) supervised Face detection in the low light condition in UG2⁺ Challenge in CVPR 2021. By conducting several experiments with popular image enhancement methods and image transfer methods, we pulled the low light image and the normal image to a more closer domain. And it is observed that using these data to training can achieve better performance. We also adapt several popular object detection frameworks, e.g., DetectoRS, Cascade-RCNN, and large backbone like Swin-transformer. Finally, we ensemble several models which achieved mAP 74.89 on the testing set, ranking 1st on the final leaderboard.

1 Introduction

The (Semi-) supervised Face detection in the low light condition challenge at CVPR 2021 is a part of the Workshop on UG2+ Prize Challenge. In this task, we need to detect faces in the low light image. Moreover, the given DARKFACE set has 6000 low light images with the corresponding face annotations (some are extremely low light conditions as shown in Figure 1) as the training and validation sets, and the final test set consists of 4000 low light images. According to [17], these samples were collected on several busy streets around Beijing, where contain faces of various scales and poses, and the resolution of these images is 1080 ×720 (down-sampled from 6K × 4K).

2 Overview of Methods

In the previous work that states in [17], the two stages method achieves best result. Their methods usually with a WIDERFACE [16] pre-trained model, then fine-tuned on the properly pre-processed DARKFACE set. We follow this idea to explore our methods, but differently, the training set is not just including pre-processed DARKFACE set but external sets. Additionally, we use WIDERFACE and UFDD [10] as our external sets.

In this work, we have experimented with both image enhancement methods [8, 6] and several object detection frameworks. For the image enhancement methods, we follow the experimental setting in [8] and [6] to process the given low light images. Besides, following [14], we transfer the normal images like WIDERFACE, UFDD dataset to the closer domain of the processed DARKFACE images. Moreover, we aggregate the saliency map [7] of each image to the input of the network for suppressing the false negative result. After that, we evaluate the performance of the different object detection frameworks [3, 11, 5]. All the experiment results and conclusions are subsequently given.

Refer to caption — Figure 1: DARKFACE Dataset samples, the red boxes are ground truth.

2.1 Low Light Image Enhancement

To enhance the dark illumination of low light images, we employ the MSRCR [8] to it, which achieves simultaneous dynamic range compression/color consistency/lightness rendition. The enhanced image is shown in Figure 3. Besides, another data-driven brightness restoration method [6] is also used, which formulating light enhancement as a task of image-specific curve estimation with a deep network. The brightness restoration result is shown in Figure 5.

Additionally, the saliency map $R_{saliency}$ [7] extracted from the enhanced low light images $R_{msrcr}$ , has been fused on $R_{msrcr}$ to suppress the false negative result, the fusion result $R_{saliency\_enhanced}$ is getting with:

R_{saliency\_enhanced}=\alpha*R_{saliency}+(1-\alpha)*R_{msrcr}

(1)

where the $\alpha$ is set 0.3 in our work, and the result is shown in Figure 4

2.2 Normal Image Domain Transfer

Different with using WIDERFACE and UFDD as pre-train sets, we merge them with pre-processed DARKFACE as a whole to build a more robust detector. Considering of the domain gap between pre-processed DARKFACE samples with the normal images (WIDERFACE, UFDD) that states in [14], we transfer WIDERFACE and UFDD to a more closer domain of processed DARKFACE set firstly. There are two different ways to achieve it, the traditional one is darkening the normal images, adding noise, and then processing it with MSRCR [8], the result is shown as Figure 6. Another method like [14], using the Pix2Pix network to synthesis noise, is shown in Figure 7. Based on the above Low Light Enhancement and Domain Transfer methods, we can obtain training samples with more closer domain that consist of low light enhanced images and domain transferred normal images.

2.3 Detection methods

We build a low light face detector based on the two-stage detection frameworks, including Cascade R-CNN [3], DetectoRS [11]. Using Cascade R-CNN as an example to describe the details, and the whole framework is shown in Figure 2.

Dataset Split Firstly, we split the DARKFACE set into several groups according to the number of faces in each image, and then randomly choose 10% samples of each group as the validation part, and the rest 90% data as the training part. We use the data augmentation method that described in section 2.1 to pre-process the DARKFACE samples. We also add WIDERFACE and UFDD dataset into our training set, which are pre-processed by the methods that described in section 2.2.

Training Strategy We apply the multi-scale training tricks with resizing samples range from [2160,1440] to [4320,2880], and apply random crop with size [1000,800] on it. Additionally, we use the popular image augment tools [2] to process our training samples online, including random flip and random lightness, color jittering, and several filtering methods, etc. Using AdamW optimizer with 0.0001 as initial learning rate, linear decay in 27 and 33 epoch with total 36 epoch, and the weight decay is 0.05.

Model Refinement Feature representation is always the key point to the object detection task. The backbone is very important to the feature representation ability of networks. Thus, we adopt two series powerful architecture Swin-Transformer [9] and ResNet [15] as our backbones. Besides, we also adapt PAFPN [13] to replace FPN in Cascade R-CNN. After analyzing the distribution of face size of the DARKFACE set, it is noticed that faces with small size is dominated which is illustrated in Figure 8. Thus we set more small anchors to capture more tiny faces. We also add attention modules such as GCnet [4] in the backbone to obtain more powerful representations. RoI-align module is also adopt in order to predict more precise bounding boxes.

Model Ensemble Finally, we trained Cascade R-CNN and DetectorRS with various backbones such as Swin-large, Swin-base, ResNet50 to obtain better diversity result of detectors. The performance of the models above that are trained on whole set can be found in the Table 1. We use Weighted Boxes Fusion (WBF) [12] and Test Time Augmentation(TTA) method for combining the predictions of our detectors, and the Soft-NMS[1] is employed before the model ensemble process.

Table 1: Validation results, where DetoRS means DetectoRS detection, CasR is Cascade R-CNN method, and GC_R means GCnet is adopt in RoI-align module, Swin-b and Swin-l means Swin-base and Swin-large.

Methods	Setting	mAP
D-R	DetoRS(Resnet50)	0.817
D-RG	DetoRS(Resnet50,GC_R)	0.819
D-RGT	DetoRS(Resnet50,GC_R,TTA)	0.821
CR-S_b	CasR(Swin-b)	0.81
CR-S_bG	CasR(Swin-b,GC_R)	0.813
CR-S_bGP	CasR(Swin-b,GC_R,PAFPN)	0.82
CR-S_bGPT	CasR(Swin-b,GC_R,PAFPN,TTA)	0.823
CR-S_lGP	CasR(Swin-l,GC_R,PAFPN)	0.83
CR-S_lGPT	CasR(Swin-l,GC_R,PAFPN,TTA)	0.835
WBF	D-RGT,CR-S_bGPT,CR-S_lGPT	0.843

3 Conclusion

In our submission to the (Semi-) supervised Face detection in the low light condition in UG2⁺ Challenge in CVPR 2021, we adopt two low light image enhancement methods to achieve brightness rendition. Besides, to obtain more training images, we transfer a lot of normal images (WIDERFACE and UFDD) to a closer domain with the brightness rendition images. Finally, we employ several powerful detectors to localize the bounding box of the face. In our future works, we will explore the end-to-end detection methods to process this work.

References

[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms — improving object detection with one line of code. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5562–5570, 2017.
[2] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin. Albumentations: Fast and flexible image augmentations. Information, 11(2):125, Feb 2020.
[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1483–1498, 2021.
[4] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond, 2019.
[5] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks, 2017.
[6] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong. Zero-reference deep curve estimation for low-light image enhancement. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1777–1786, 2020.
[7] X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):194–201, 2012.
[8] D. Jobson, Z. Rahman, and G. Woodell. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image Processing, 6(7):965–976, 1997.
[9] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
[10] H. Nada, V. A. Sindagi, H. Zhang, and V. M. Patel. Pushing the limits of unconstrained face detection: a challenge dataset and baseline results, 2018.
[11] S. Qiao, L.-C. Chen, and A. Yuille. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution, 2020.
[12] R. Solovyev, W. Wang, and T. Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107:104117, 2021.
[13] M. Tan, R. Pang, and Q. V. Le. Efficientdet: Scalable and efficient object detection, 2020.
[14] W. Wang, W. Yang, and J. Liu. Hla-face: Joint high-low adaptation for low light face detection, 2021.
[15] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks, 2017.
[16] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5525–5533, 2016.
[17] W. Yang, Y. Yuan, W. Ren, J. Liu, W. J. Scheirer, Z. Wang, T. Zhang, Q. Zhong, D. Xie, S. Pu, Y. Zheng, Y. Qu, Y. Xie, L. Chen, Z. Li, C. Hong, H. Jiang, S. Yang, Y. Liu, X. Qu, P. Wan, S. Zheng, M. Zhong, T. Su, L. He, Y. Guo, Y. Zhao, Z. Zhu, J. Liang, J. Wang, T. Chen, Y. Quan, Y. Xu, B. Liu, X. Liu, Q. Sun, T. Lin, X. Li, F. Lu, L. Gu, S. Zhou, C. Cao, S. Zhang, C. Chi, C. Zhuang, Z. Lei, S. Z. Li, S. Wang, R. Liu, D. Yi, Z. Zuo, J. Chi, H. Wang, K. Wang, Y. Liu, X. Gao, Z. Chen, C. Guo, Y. Li, H. Zhong, J. Huang, H. Guo, J. Yang, W. Liao, J. Yang, L. Zhou, M. Feng, and L. Qin. Advancing image understanding in poor visibility environments: A collective benchmark study. IEEE Transactions on Image Processing, 29:5737–5752, 2020.