CBA: Contextual Background Attack against
Optical Aerial Detection in the Physical World

Jiawei Lian, , Xiaofei Wang, Yuru Su, Mingyang Ma, , and Shaohui Mei This work was supported in part by the National Natural Science Foundation of China (62171381) and in part by the Fundamental Research Funds for the Central Universities. (Corresponding author: Shaohui Mei.)Jiawei Lian, Xiaofei Wang, Yuru Su, Mingyang Ma, and Shaohui Mei are with the School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China (Email: lianjiawei@mail.nwpu.edu.cn; wangxiaofei2022@mail.nwpu.edu.cn; suyuru nwpu@mail.nwpu.edu.cn; mamingyang@mail.nwpu.edu.cn; meish@nwpu.edu.cn).

Abstract

Patch-based physical attacks have increasingly aroused concerns. However, most existing methods focus on obscuring targets captured on the ground, and some of these methods are simply extended to deceive aerial detectors. They smear the targeted objects in the physical world with the elaborated adversarial patches, which can only slightly sway the aerial detectors’ prediction and with weak attack transferability. To address the above issues, we propose to perform Contextual Background Attack (CBA), a novel physical attack framework against aerial detection, which can achieve strong attack efficacy and transferability in the physical world even without smudging the interested objects at all. Specifically, the targets of interest, i.e. the aircraft in aerial images, are adopted to mask adversarial patches. The pixels outside the mask area are optimized to make the generated adversarial patches closely cover the critical contextual background area for detection, which contributes to gifting adversarial patches with more robust and transferable attack potency in the real world. To further strengthen the attack performance, the adversarial patches are forced to be outside targets during training, by which the detected objects of interest, both on and outside patches, benefit the accumulation of attack efficacy. Consequently, the sophisticatedly designed patches are gifted with solid fooling efficacy against objects both on and outside the adversarial patches simultaneously. Extensive proportionally scaled experiments are performed in physical scenarios, demonstrating the superiority and potential of the proposed framework for physical attacks. We expect that the proposed physical attack method will serve as a benchmark for assessing the adversarial robustness of diverse aerial detectors and defense methods. The code has been released at https://github.com/JiaweiLian/CBA.

Index Terms:

Contextual background attack, aerial detection, physical world, adversarial patches, benchmark.

I Introduction

Deep neural networks (DNNs) have shown great potency over the past few years. However, some works [1, 2] have demonstrated that adversarial examples can easily deceive DNNs. When some elaborated human-unperceivable perturbations are added to the clean image [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], DNNs will generate a completely different wrong prediction, which poses extreme concerns for some security-critical applications. Consequently, the adversarial attack has increasingly garnered attention since it helps to further understand the vulnerability and interpretability of DNNs by delving into negative examples. Moreover, the study of malicious examples also provides ideas and data for improving the adversarial robustness of DNNs.

Nowadays, aerial detection is indispensable and widely used in real scenarios, such as environmental surveillance [13], aerial search and rescue [14], surveying and mapping [15], etc. Unfortunately, the vulnerability toward adversarial samples also exists in aerial detectors [16, 17, 18], as shown in Fig. 1. Nonetheless, most of the existing attacks [19, 17] are designed for digital attack [12, 20, 21, 22]. A few physical attacks against aerial detection [16, 18] are directly derived from general real-scenario attack methods [23, 24, 25], which focus on hiding particular objects captured on the ground from being detected by placing the malicious patch on the targeted objects, such as persons, traffic signs, cars, etc. Regarding aerial detection, most targets are smaller compared to other natural images. Hence, capturing the tiny patch’s negative pattern is challenging. Meanwhile, enlarging the patch size would cause severe occlusion issues, as shown in Fig. 2. Recently, some works [18, 16] have attempted to perform attacks with patches outside the target, which may be successful but the attack efficacy is unsatisfied and unstable.

Refer to caption — Figure 1: Adversarial attack performance against aerial detection in the physical world, in which the specified targets (left part) are hidden from being detected, while the unmodified targets (right part) are detected correctly.

To solve the above problems, we propose an innovative physical attack approach called Contextual Background Attack (CBA), where the targeted objects are protected from being identified by using contextual background adversarial patches. Specifically, the shape of the protected targets, such as aircraft in aerial detection, is extracted to mask the protected object to design a contextual background patch embedded with the interested target, in which the pixels of the background area are optimized iteratively during the training process. Moreover, we devise a novel training strategy in which the patches are put outside targets so that the perceived targets, both in and outside patches, are adopted to calculate the gradients and optimize the contextual adversarial patch. Given the shortage of a standardized benchmark to assess physical attacks against aerial detection, extensive experiments in both digital and physical domains are conducted to verify the effectiveness of the proposed CBA and evaluate the adversarial robustness of various aerial detectors.

In summary, our contributions are four-fold as follows:

•

A brand-new Contextual Background Attack (CBA) framework is devised to deceive aerial detection methods in the physical world, which can gift the contextual background patches with SOTA attack performance in both white-box and black-box settings and no need to smear the protected targets.
•

A novel training strategy is proposed to elaborate adversarial perturbations in the contextual background area. The generated contextual adversarial patches are masked by the interested objects and can simultaneously hide objects both on and outside the adversarial patch from being recognized.
•

To the best of our knowledge, we are the first to benchmark physical attacks against aerial detection, in which rigorous and exhaustive tests are conducted to evaluate the adversarial robustness of various aerial detectors, and the adversarial patch set is public.
•

Comprehensive proportionally scaled experiments are conducted in the physical world, demonstrating the substantial physical attack effect and transferability of the elaborated adversarial patches in the contextual background.

This work extends our previous conference version [26] in the following aspects. First, we provide more details and a deep analysis of our CBA. Second, comprehensive experiments are conducted in both white-box and black-box settings, and proxy models are extended from 4 to 20. Third, the attack efficacy is extensively validated in both the digital and physical domains. Finally, the physical robustness is thoroughly verified via exhaustive experiments in the physical world.

The rest part of this article is organized as follows. Section II reviews the related work of physical adversarial attacks in detail. Next, we introduce the details of the proposed Contextual Background Attack framework CBA for generating contextual background adversarial patches against aerial detection tasks in Section III. Then, we verify the effectiveness of the proposed CBA and demonstrate the advantages of the generated contextual background patches in Section IV. Finally, we conclude our proposed CBA method and discuss potential future work concerning patch-based physical attacks in Section V.

II Related Work

In this part, the related works concerning digital attacks are briefly introduced at first. Subsequently, we review the typical physical attack methods and physical attacks against aerial detection in detail.

II-A Digital Attack

Digital attack methods can be categorized as optimization-based and gradient-based according to how the adversarial perturbations are crafted. Optimization-based Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [1], Deepfool [9], C&W [10], etc. conduct attacks via box-constrained mechanisms. Gradient-based methods, e.g. fast gradient sign method (FGSM) [2], iterative FGSM (I-FGSM) [27], momentum iterative (MI-FGSM) [7], and projected gradient descent (PGD) [11], design adversarial perturbations based on the gradient information of models. The approaches mentioned above are all performed in the white-box conditions, i.e. the training data, victim model structure, and victim model parameters are available to the attackers. However, the imperceptible adversarial perturbations generated by digital attack algorithms are utterly useless for physical attacks, for the reason that imaging devices can barely capture the indistinguishable noises. Consequently, physical attack methods are progressively garnering attention.

II-B Physical Attack

II-B1 General Physical Attack Methods

Adversarial patches [28] are widely used for physical attacks, such as face recognition [29, 30, 31], object detection [32, 33, 24, 25, 34], autonomous driving [35, 36], etc. We review the related work according to application domains as follows:

Face recognition: Sharif et al. [29] developed a systematic method to generate attacks realized by printing a pair of eyeglass frames. In [31], the authors proposed another kind of adversarial patch: Meaningful Adversarial Sticker, a physically feasible and stealthy attack method by using actual stickers existing in our life. Dubbed PadvFace [30] framework was devised to model the challenging physical variations precisely. In addition, some other physical attacks [37, 38, 39, 40] are also devised to deceive face recognition systems.

Object detection: Hu et al. [24] proposed a method to craft natural-looking adversarial patches by leveraging the learned image manifold of a pretrained GAN upon real-world images. In [32], the authors proposed an evaluation framework for patch attacks against object detectors. [25] introduced an approach to generate a patch that can successfully hide a person from a person detector. To bridge the gap between digital and physical attacks, [33] exploited the entire 3D vehicle surface to propose a robust Full-coverage Camouflage Attack. In addition, there are some works [41, 42] focus on fooling thermal infrared pedestrian detection methods.

Autonomous driving: Cheng et al. [36] proposed an optimization-based method to generate stealthy physical-object-oriented adversarial patches to attack depth estimation. In [35], the authors realized the first physical backdoor attacks on the lane detection system, including two attack methodologies (poison-annotation and clean-annotation) to generate poisoned samples. [36] adopt an optimization-based approach to craft stealthy physical-object-oriented adversarial patches to fool depth estimation algorithms. Besides, [43, 24] also delve into camouflaging adversarial patches in the physical world.

II-B2 Physical Attacks against Aerial Detection

DNNs have been broadly adopted to process aerial imagery [44, 45, 46]. Consequently, Delving into adversarial attacks against aerial detection paves a critical path to better explaining and improving model robustness. However, most adversarial attack methods [17, 19, 47, 48, 49] against aerial detection concentrate on the digital domain. In contrast, physical attacks against aerial detection are somewhat scarce, while it is more critical and practical. Du et al. [16] demonstrated one of the first efforts at physical adversarial attacks on aerial imagery, whereby malicious patches were optimized, fabricated, and installed on or near target objects to reduce the efficacy of an object detector applied on overhead images. In [18], a novel adaptive-patch-based physical attack (APPA) framework was proposed to generate adversarial patches adapted to both physical dynamics and varying scales. Furthermore, they devised a new loss to optimize the adversarial patch by entirely using the detection results, which can significantly accelerate the optimizing process. However, the above attacks against aerial detection are derived from the aforementioned general physical attack methods, which are not aggressive enough and need to smear targets.

III Methodology

In this section, we formulate the problem and then elaborate on the proposed CBA with aircraft as the targets of interest. It is believed that CBA can also work for other interested targets similarly. The overall pipeline is displayed in Fig. 3.

III-A Problem Formulation

Given a benign aerial image $\boldsymbol{x}$ , the attack purpose in aerial detection is to hide the specified targets from being detected by pasting elaborated adversarial patches on the clean image. Specifically, the adversarial example $\boldsymbol{x}^{*}$ with adversarial patches $\boldsymbol{P}^{*}$ can be defined as:

\boldsymbol{x}^{*}=(\boldsymbol{1}-\boldsymbol{M}_{\boldsymbol{P}^{*}})\odot\boldsymbol{x}+\boldsymbol{M}_{\boldsymbol{P}^{*}}\odot\boldsymbol{P}^{*},

(1)

where $\boldsymbol{M}_{\boldsymbol{p}^{*}}$ (the pixel values of the foreground are 1, the rest are 0) and $\odot$ represent the mask of adversarial patches and Hadamard product, respectively.

The previous approaches focus on optimizing all pixels of an adversarial patch by putting it on or outside objects, as shown in Fig. 2. In comparison, our method designs the contextual background adversarial patch embedded with a specified shape to match the protected target, as shown in Fig. 3, which can fool various aerial detectors (CNN-based and Transformer-based, One-stage and Two-stage, Anchor-based and Anchor-free) when the targeted object is placed on the contextual adversarial background in the physical world. In the following sections, we will systematically introduce how to elaborate the contextual adversarial background with aircraft as the protected objects, i.e. adversarial aircraft, by our proposed framework CBA.

III-B Adversarial Aircraft Elaboration

Based on the previous work [25, 18, 50], the following observations can be obtained:

•

The bigger the adversarial patch size, the stronger the attack efficacy;
•

The closer distance between the adversarial patch and the targeted object, the stronger the attack efficacy;
•

According to the attention maps, as shown in Fig. 4, the contextual background area plays a key role during detection.

Therefore, to achieve more robust attack efficacy, we propose to perform contextual background attacks in the aerial detection task, by which we can generate adversarial patches as big as aircraft without taking up extra area. In addition, our method can make the adversarial patches as close as possible to the targeted objects, and no need to smear or obscure them.

Technically, we first take a picture of the aircraft model on a black backdrop to get the original patch $\boldsymbol{p}^{0}$ . Secondly, a saliency detector [51] is adopted to extract the saliency map of the aircraft $\boldsymbol{s}$ . Thirdly, the saliency map $\boldsymbol{s}$ is binarized as an aircraft mask $\boldsymbol{M}_{ac}$ . Consequently, the background mask of the adversarial patch is defined as:

\boldsymbol{M}_{bg}=\boldsymbol{1}-\boldsymbol{M}_{ac}.

(2)

Finally, we formulate the adversarial aircraft as follows:

\boldsymbol{p}^{aa}_{i,j}=\boldsymbol{p}^{0}\odot\boldsymbol{M}_{ac}+\boldsymbol{p}^{*}_{i,j}\odot\boldsymbol{M}_{bg},

(3)

where $\boldsymbol{p}^{*}_{i,j}$ represents the optimized adversarial patch, and $i$ , $j$ is the index of epochs and iterations during training, respectively. It can be found that the background pixels are used to update the adversary aircraft, as shown in Fig. 3.

III-C Patch Adaption and Application

Since our proposed CBA aims to generate an adversarial patch with solid attack efficacy in physical scenarios, physical accommodations are adopted to simulate the real-scenario dynamics, including different noises, adaptive scales, random rotations, and varying lighting. The above physical adaptations are bundled in the patch transformation function $PT$ similar with [25, 18], and the augmented patches are shown in the top left part of Fig. 3.

Next, the elaborated adversarial patch must be pasted on the benign image with the proper size and location. Usually, the patch should be placed where it works in actual applications during the training process. Thus, we try to place the adversarial patch in an aircraft shape (Extracted background area as shown in Fig. 3). Technically, we adopt an oriented aerial detector [52] to recognize aircraft and its orientation information. However, the extracted patch still cannot match the aircraft precisely, and severe occlusion exists due to the intra-class differences between aircraft and view angles. Consequently, we propose a novel training strategy to overcome the above difficulties, putting the adversarial aircraft patches outside targets during the training, as the adversarial example shown in Fig. 3. In this way, the aircraft both in and outside the patches can be detected, as shown in Fig. 3, which significantly contributes to strengthening the attack efficacy to hide targets both in and outside the adversarial patches.

Specifically, we place the adversarial patches outside targets at the proper distance and size, which is adaptive to the position and size of the targets according to the ground truth $\boldsymbol{y}=(x_{1},y_{1},x_{2},y_{2},class)$ . The coordination $\boldsymbol{l}_{\boldsymbol{p}^{aa}}$ and size $(w_{\boldsymbol{p}^{aa}},h_{\boldsymbol{p}^{aa}})$ of the square adversarial patch are calculated similarly with [18] as follows:

\boldsymbol{l}_{\boldsymbol{p}^{aa}}=(\frac{x_{1}+x_{2}}{2},\frac{y_{1}+y_{2}}{2}-\frac{y_{2}-y_{1}}{r_{d}}),

(4)

w_{\boldsymbol{p}^{aa}}=h_{\boldsymbol{p}^{aa}}=\sqrt[2]{r_{s}\cdot w_{\boldsymbol{t}}\cdot h_{\boldsymbol{t}}},

(5)

where $r_{d}$ and $r_{s}$ are coefficients for adaptively adjusting the patch distance and size. Then, we acquire the mask of the adversarial aircraft $\boldsymbol{M}_{\boldsymbol{P}^{aa}}$ and applied patches $\boldsymbol{P}^{aa}$ by putting ${PT(\boldsymbol{p}^{aa}})$ in proper size and location according to $\boldsymbol{l}_{\boldsymbol{p}^{aa}},w_{\boldsymbol{p}^{aa}},h_{\boldsymbol{p}^{aa}}$ , which is formulated as $PA$ :

\left[\boldsymbol{P}^{aa},\boldsymbol{M}_{\boldsymbol{P}^{aa}}\right]=PA({PT(\boldsymbol{p}^{aa}}),\boldsymbol{l}_{\boldsymbol{p}^{aa}},w_{\boldsymbol{p}^{aa}},h_{\boldsymbol{p}^{aa}}).

(6)

Finally, we transform the initial formulation Eq. (1) as

\displaystyle\boldsymbol{x}^{*}=(\boldsymbol{1}-\boldsymbol{M}_{\boldsymbol{P}^{aa}})\odot\boldsymbol{x}+\boldsymbol{M}_{\boldsymbol{P}^{aa}}\odot\boldsymbol{P}^{aa}.

(7)

III-D Loss Design

In this paper, we aim to perform untargeted physical attacks, i.e. to hide the specified targets from being detected in the physical scenarios. Therefore, the objectiveness function comprises adversarial objectiveness loss and smoothness constriction.

Adversarial objectiveness: We thoroughly use the detected objects in and outside adversarial patches to optimize the patch, as shown in Fig. 3. Specifically, all the objectiveness scores from detection results are taken into account, which is written as:

L_{obj}=E(\boldsymbol{r})=\frac{1}{n}\sum\limits_{i=1}^{n}P_{i}(obj),

(8)

where $n$ means the number of detected objects. The detection results $\boldsymbol{r}$ usually contain coordinates $(x_{1},y_{1},x_{2},y_{2})$ , the objective score $P(obj)$ , and the class probability such as $(P(aircraft),P(ship),...,P(bridge),P(harbor))$ of each object. $E(\boldsymbol{r})$ represents extracting objectiveness loss $L_{obj}$ from $\boldsymbol{r}$ . The adversarial objectiveness loss gifts the adversarial patch with attack efficacy during the optimization.

Algorithm 1 Contextual Background Attack (CBA)

0: Detector

D

, benign aerial image

\boldsymbol{x}

and ground truth

\boldsymbol{y}

, original patch

\boldsymbol{p}^{0}

, the adversarial attack loss function

L

, the number of epochs

N_{epc}

and the number of iterations of each epoch

N_{itr}

, hyperparameter

\alpha,\eta

0: Adversarial aircraft

\boldsymbol{p}^{*}

\boldsymbol{p}^{aa}=\boldsymbol{p}^{0}

;

2: Extract aircraft saliency map

\boldsymbol{s}

from

\boldsymbol{p}^{0}

;

3: Binarize

\boldsymbol{s}

to get the aircraft mask

\boldsymbol{M}_{ac}

;

4: Patch’s background mask:

\boldsymbol{M}_{bg}=\boldsymbol{1}-\boldsymbol{M}_{ac}

;

5: for

i=0

N_{epc}

6: for

j=0

N_{itr}

\boldsymbol{p}^{aa}_{i,j}=\boldsymbol{p}^{0}\odot\boldsymbol{M}_{ac}+\boldsymbol{p}^{*}_{i,j}\odot\boldsymbol{M}_{bg}

;

\boldsymbol{p}^{aa}_{i,j}=PT({\boldsymbol{p}^{aa}_{i,j}},\boldsymbol{y})

;

\big{[}\boldsymbol{P}^{aa}_{i,j},\boldsymbol{M}_{\boldsymbol{P}^{aa}_{i,j}}\big{]}=PA({\boldsymbol{p}^{aa}_{i,j}},\boldsymbol{l}_{\boldsymbol{p}^{aa}_{i,j}},w_{\boldsymbol{p}^{aa}_{i,j}},h_{\boldsymbol{p}^{aa}_{i,j}})

;

10:

\boldsymbol{x}^{*}=(\boldsymbol{1}-\boldsymbol{M}_{\boldsymbol{P}^{aa}_{i,j}})\odot\boldsymbol{x}+\boldsymbol{M}_{\boldsymbol{P}^{aa}_{i,j}}\odot\boldsymbol{P}^{aa}_{i,j}

;

11:

\boldsymbol{r}=D(\boldsymbol{x}^{*})

;

12:

L_{obj}=E(\boldsymbol{r})

;

13:

L=L_{obj}+\alpha\cdot L_{tv}

;

14:

\boldsymbol{p}^{*}_{i,j+1}=\boldsymbol{p}^{aa}_{i,j}+\eta\cdot\nabla_{\boldsymbol{p}^{aa}_{i,j}}L

;

15: end for

16: end for

17:

\boldsymbol{p}^{aa}=\boldsymbol{p}^{aa}_{N_{epc},N_{itr}}

;

18: return

\boldsymbol{p}^{aa}

Smoothness constriction: For the physical attack, it is hard for detection systems to capture the gap between adjacent pixels. Hence, total variation [29] is adopted to constraint the smoothness of the generated adversarial patch, which can be written as:

L_{tv}=\sum\limits_{m,n}\sqrt{(\boldsymbol{p}^{aa}_{m+1,n}-\boldsymbol{p}^{aa}_{m,n})^{2}+(\boldsymbol{p}^{aa}_{m,n+1}-\boldsymbol{p}^{aa}_{m,n})^{2}},

(9)

where $\boldsymbol{p}^{aa}_{m,n}$ means the pixel value of the adversarial aircraft. Total variation is indispensable for its key role in maintaining the attack efficacy of the adversarial patch during the physical-digital transformation in physical attacks.

Finally, the overall objectiveness loss is as follows:

L=L_{obj}+\alpha\cdot L_{tv},

(10)

where $\alpha$ is a balance parameter. The detailed discussion w.r.t. the selection of $\alpha$ will be presented in Sec. IV-D.

III-E Overall Training Procedures

In this section, we choose aircraft as the targeted object to describe the overall training procedures of the proposed CBA as shown in Algorithm 1. The comprehensive explanation of the algorithm is given as follows:

1) Extract the saliency map $\boldsymbol{s}$ of the protected object, i.e. the original patch $\boldsymbol{p}^{0}$ , to mask the protected aircraft;

2) We adopt the extracted masks $\boldsymbol{M}_{ac}$ and $\boldsymbol{M}_{bg}$ to formulate adversarial aircraft $\boldsymbol{p}^{aa}$ as the contextual adversarial patch;

3) The adversarial aircraft are transformed by $PT$ to accommodate dynamic physical conditions and varying size targets, then these patches are pasted on the clean image in the appropriate position and adaptive size by $PA$ ;

4) The victim aerial detector takes the elaborated adversarial example $\boldsymbol{x}^{*}$ as input to make a prediction;

5) The objectiveness loss is extracted from detection results $\boldsymbol{r}$ by the function $E$ ;

6) We use the extracted objectiveness loss plus total variation loss to calculate the gradients concerning the contextual adversarial aircraft, which are adopted to optimize the pixel values of the contextual adversarial aircraft $\boldsymbol{p}^{aa}$ ;

7) Finally, repeat steps 2 to 6 until the end of training.

IV Experiments

TABLE I:
DETAILED RESULTS OF WHITE-BOX ATTACKS.

	—	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10	T11	T12	T13	T14	T15	T16	T17	T18	Average
SSD[53]	APPA (on)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.45	0.00	0.00	0.22	0.40	0.22	0.94	0.40	0.84	0.31	0.65	0.246
	APPA (outside)	0.95	1.00	0.97	0.98	0.96	0.72	0.98	0.99	0.96	0.96	0.99	0.50	0.93	1.00	0.86	0.98	1.00	0.99	0.929
	Thys et al.	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.72	0.00	0.00	0.00	0.24	0.00	0.41	0.00	0.83	0.21	0.82	0.179
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
Faster R-CNN[54]	APPA (on)	0.00	0.90	0.00	0.00	0.00	0.00	0.99	0.99	0.99	0.00	0.66	1.00	0.55	0.72	0.27	1.00	0.00	0.99	0.503
	APPA (outside)	1.00	1.00	0.99	0.99	0.99	0.81	1.00	1.00	0.99	1.00	1.00	1.00	1.00	1.00	0.66	1.00	1.00	1.00	0.968
	Thys et al.	0.00	0.99	0.97	0.21	0.37	0.00	1.00	1.00	0.98	0.76	0.95	1.00	0.98	0.36	0.96	1.00	1.00	1.00	0.752
	Ours	0.27	0.33	0.00	0.93	0.99	0.00	0.00	0.00	0.00	0.32	0.98	0.00	0.00	0.93	0.00	0.00	0.00	0.00	0.264
Swin Transformer[55]	APPA (on)	0.98	1.00	0.93	0.25	1.00	0.74	0.99	1.00	1.00	0.84	0.94	0.97	0.97	1.00	0.00	1.00	1.00	1.00	0.867
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	0.889
	Thys et al.	0.97	1.00	1.00	0.95	1.00	0.86	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.91	0.99	1.00	1.00	0.982
	Ours	0.00	0.40	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.022
YOLOv2[56]	APPA (on)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.69	0.00	0.00	0.00	0.00	0.00	0.00	0.038
	APPA (outside)	0.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	1.00	0.00	1.00	1.00	0.96	1.00	0.99	1.00	0.93	0.99	0.771
	Thys et al.	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.99	0.00	0.00	0.00	0.73	0.00	0.00	0.096
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
YOLOv3[57]	APPA (on)	0.88	0.87	0.84	0.88	0.91	0.37	0.87	0.87	0.90	0.83	0.89	0.87	0.86	0.90	0.86	0.88	0.92	0.91	0.851
	APPA (outside)	0.90	0.85	0.84	0.70	0.88	0.00	0.22	0.85	0.80	0.85	0.89	0.88	0.85	0.88	0.87	0.88	0.90	0.89	0.774
	Thys et al.	0.86	0.88	0.79	0.86	0.90	0.43	0.80	0.89	0.87	0.83	0.82	0.87	0.36	0.89	0.76	0.89	0.88	0.91	0.805
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.25	0.00	0.00	0.00	0.00	0.00	0.014
YOLOv5n[58]	APPA (on)	0.00	0.81	0.62	0.62	0.88	0.00	0.82	0.85	0.85	0.82	0.87	0.84	0.77	0.76	0.61	0.83	0.86	0.86	0.704
	APPA (outside)	0.84	0.88	0.79	0.84	0.86	0.00	0.80	0.85	0.86	0.76	0.90	0.77	0.87	0.86	0.33	0.80	0.87	0.88	0.764
	Thys et al.	0.82	0.77	0.75	0.50	0.83	0.00	0.84	0.84	0.87	0.73	0.80	0.86	0.82	0.82	0.60	0.83	0.87	0.87	0.746
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
Cascade R-CNN [59]	APPA (on)	1.00	0.99	1.00	0.96	0.99	0.00	1.00	1.00	1.00	0.62	1.00	1.00	0.98	1.00	0.85	1.00	1.00	1.00	0.911
	APPA (outside)	1.00	1.00	0.99	0.95	1.00	0.00	0.99	0.99	1.00	1.00	1.00	1.00	0.99	1.00	0.92	0.98	1.00	1.00	0.934
	Thys et al.	1.00	1.00	1.00	0.89	1.00	0.00	0.99	1.00	1.00	0.85	1.00	1.00	0.99	1.00	0.00	0.99	1.00	1.00	0.873
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
RetinaNet [60]	APPA (on)	0.44	0.77	0.86	0.00	0.00	0.00	0.52	0.95	0.95	0.44	0.62	0.96	0.59	0.71	0.32	0.89	0.99	0.84	0.603
	APPA (outside)	0.93	0.95	0.86	0.80	0.92	0.57	0.93	0.96	0.97	0.80	0.99	0.94	1.00	0.99	0.82	0.79	0.99	0.97	0.899
	Thys et al.	0.40	0.86	0.97	0.91	0.00	0.00	0.87	0.92	0.91	0.77	0.99	0.99	0.95	0.98	0.93	0.95	0.89	0.84	0.785
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
Mask R-CNN [61]	APPA (on)	0.73	0.94	0.21	0.90	0.71	0.43	0.94	0.98	0.99	0.90	0.82	0.98	0.70	0.96	0.95	0.99	0.99	0.98	0.839
	APPA (outside)	0.73	0.51	0.61	0.73	0.86	0.00	0.53	0.27	0.91	0.92	0.98	0.94	0.96	0.97	0.93	0.86	0.67	0.99	0.743
	Thys et al.	0.71	0.97	0.97	0.86	0.92	0.45	0.95	0.99	0.99	0.89	0.74	0.99	0.33	0.97	0.96	0.98	0.94	0.99	0.867
	Ours	0.28	0.53	0.00	0.42	0.48	0.00	0.00	0.00	0.00	0.21	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.35	0.126
FreeAnchor [62]	APPA (on)	1.00	0.91	1.00	0.82	0.94	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.43	0.99	1.00	1.00	0.894
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.93	1.00	1.00	1.00	0.941
	Thys et al.	0.22	0.89	1.00	0.47	0.91	0.00	1.00	1.00	1.00	0.97	1.00	1.00	0.98	1.00	0.49	0.99	0.99	0.97	0.827
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
FSAF [63]	APPA (on)	0.35	0.51	0.77	0.21	0.31	0.00	0.76	0.86	0.76	0.49	0.51	0.78	0.68	0.60	0.67	0.84	0.61	0.88	0.588
	APPA (outside)	0.86	0.81	0.77	0.61	0.87	0.38	0.49	0.69	0.78	0.86	0.83	0.84	0.82	0.77	0.30	0.76	0.78	0.80	0.723
	Thys et al.	0.32	0.46	0.72	0.43	0.47	0.00	0.63	0.68	0.71	0.28	0.56	0.83	0.62	0.42	0.46	0.74	0.57	0.61	0.528
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
RepPoints [64]	APPA (on)	0.00	0.00	0.50	0.00	0.35	0.00	0.61	0.88	0.35	0.00	0.38	0.90	0.38	0.34	0.46	0.63	0.00	0.45	0.346
	APPA (outside)	0.56	0.81	0.73	0.57	0.53	0.00	0.54	0.71	0.78	0.75	0.89	0.84	0.71	0.63	0.73	0.85	0.72	0.89	0.680
	Thys et al.	0.00	0.23	0.57	0.23	0.22	0.00	0.23	0.87	0.49	0.24	0.50	0.89	0.43	0.46	0.21	0.77	0.52	0.52	0.410
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
TOOD [65]	APPA (on)	0.00	0.00	0.00	0.00	0.30	0.00	0.36	0.44	0.00	0.00	0.00	0.85	0.44	0.46	0.00	0.00	0.00	0.38	0.179
	APPA (outside)	0.71	0.78	0.84	0.70	0.85	0.28	0.51	0.82	0.63	0.42	0.57	0.87	0.59	0.72	0.87	0.81	0.83	0.87	0.704
	Thys et al.	0.00	0.20	0.22	0.00	0.00	0.00	0.00	0.30	0.43	0.00	0.00	0.69	0.38	0.24	0.59	0.21	0.20	0.40	0.214
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
ATSS [66]	APPA (on)	0.29	0.57	0.60	0.38	0.59	0.00	0.61	0.70	0.66	0.00	0.22	0.67	0.37	0.43	0.49	0.60	0.32	0.59	0.449
	APPA (outside)	0.62	0.56	0.73	0.53	0.56	0.00	0.72	0.75	0.76	0.47	0.53	0.74	0.51	0.57	0.64	0.75	0.68	0.74	0.603
	Thys et al.	0.45	0.67	0.74	0.54	0.56	0.00	0.64	0.68	0.73	0.34	0.40	0.63	0.24	0.57	0.35	0.55	0.49	0.51	0.505
	Ours	0.00	0.26	0.00	0.00	0.28	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.29	0.00	0.00	0.00	0.22	0.28	0.074
FoveaBox [67]	APPA (on)	0.86	0.95	0.83	0.89	0.84	0.00	0.78	0.82	0.73	0.56	0.93	0.86	0.76	0.94	0.89	0.81	0.58	0.91	0.774
	APPA (outside)	0.82	0.95	0.78	0.81	0.90	0.00	0.49	0.85	0.83	0.85	0.89	0.80	0.88	0.91	0.82	0.86	0.73	0.89	0.781
	Thys et al.	0.70	0.92	0.92	0.74	0.87	0.00	0.81	0.88	0.85	0.00	0.47	0.94	0.00	0.71	0.53	0.82	0.93	0.87	0.664
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
VarifocalNet [68]	APPA (on)	0.00	0.76	0.89	0.43	0.78	0.44	0.89	0.92	0.94	0.22	0.61	0.92	0.59	0.81	0.86	0.90	0.74	0.92	0.701
	APPA (outside)	0.79	0.87	0.91	0.82	0.84	0.77	0.93	0.88	0.92	0.67	0.78	0.89	0.83	0.90	0.83	0.88	0.85	0.88	0.847
	Thys et al.	0.23	0.58	0.89	0.22	0.25	0.00	0.23	0.90	0.83	0.30	0.00	0.94	0.33	0.53	0.63	0.73	0.49	0.87	0.497
	Ours	0.66	0.87	0.59	0.46	0.41	0.00	0.00	0.24	0.42	0.62	0.71	0.42	0.52	0.67	0.00	0.42	0.46	0.41	0.438

•

Strongest attack results are highlighted in bold.
•

”on” and ”outside” represent patches on and outside targets, respectively.

TABLE II:
DETAILED RESULTS OF BLACK-BOX ATTACKS.

	—	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10	T11	T12	T13	T14	T15	T16	T17	T18	Average
SSD[53]	APPA (on)	0.00	0.00	0.90	0.99	0.82	0.43	0.79	0.98	0.93	0.83	0.99	0.98	0.76	0.99	0.50	0.96	0.77	0.95	0.754
	APPA (outside)	0.91	0.99	0.63	0.94	0.99	0.48	0.94	0.91	0.95	0.97	1.00	0.97	0.94	1.00	0.94	0.95	1.00	1.00	0.917
	Thys et al.	0.00	0.00	0.34	0.00	0.00	0.00	0.96	0.93	0.76	0.00	0.67	0.98	0.44	0.98	0.86	0.94	0.53	0.95	0.519
	Ours	0.00	0.00	0.00	0.37	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.80	0.00	0.00	0.00	0.00	0.082
Faster R-CNN[54]	APPA (on)	0.00	0.77	0.00	0.00	0.82	0.00	0.99	1.00	0.98	0.44	0.92	1.00	0.94	0.25	0.33	0.99	0.54	0.99	0.609
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.91	0.99	1.00	1.00	1.00	1.00	1.00	1.00	0.99	0.98	1.00	1.00	1.00	0.993
	Thys et al.	0.99	0.94	0.99	0.89	0.78	0.00	1.00	1.00	1.00	0.95	0.99	1.00	0.96	0.63	0.98	1.00	0.99	0.99	0.893
	Ours	0.00	0.24	0.00	0.51	0.27	0.00	0.00	0.00	0.00	0.80	0.00	0.00	0.00	0.49	0.00	0.00	0.00	0.00	0.128
Swin Transformer[55]	APPA (on)	0.98	1.00	1.00	1.00	1.00	0.95	1.00	1.00	1.00	0.95	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.993
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.56	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	1.00	1.00	0.975
	Thys et al.	1.00	1.00	1.00	0.99	1.00	0.42	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	1.00	1.00	0.967
	Ours	0.00	0.33	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.20	0.00	0.00	0.00	0.00	0.029
YOLOv2[56]	APPA (on)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.73	0.00	0.99	0.00	0.00	0.00	0.00	0.096
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.00	0.00	1.00	1.00	0.99	1.00	0.99	0.98	1.00	0.82	0.98	0.88	0.99	0.868
	Thys et al.	0.00	0.00	1.00	0.00	0.00	0.00	0.99	0.99	1.00	0.00	0.00	0.97	0.00	0.98	0.00	0.97	0.00	0.75	0.425
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
YOLOv3[57]	APPA (on)	0.88	0.87	0.72	0.88	0.90	0.00	0.87	0.88	0.89	0.74	0.87	0.86	0.84	0.91	0.80	0.90	0.91	0.91	0.813
	APPA (outside)	0.87	0.89	0.83	0.82	0.90	0.71	0.84	0.87	0.88	0.86	0.91	0.91	0.88	0.91	0.88	0.89	0.89	0.90	0.869
	Thys et al.	0.72	0.50	0.78	0.85	0.90	0.00	0.86	0.89	0.88	0.82	0.66	0.89	0.56	0.89	0.82	0.90	0.91	0.92	0.764
	Ours	0.00	0.67	0.00	0.74	0.87	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.48	0.88	0.00	0.00	0.00	0.00	0.202
YOLOv5l[58]	APPA (on)	0.90	0.91	0.89	0.83	0.91	0.00	0.89	0.90	0.91	0.67	0.90	0.90	0.89	0.92	0.82	0.89	0.91	0.90	0.830
	APPA (outside)	0.90	0.92	0.90	0.89	0.89	0.75	0.87	0.88	0.89	0.82	0.93	0.91	0.93	0.92	0.84	0.89	0.91	0.89	0.885
	Thys et al.	0.85	0.89	0.89	0.88	0.89	0.22	0.90	0.92	0.89	0.87	0.90	0.90	0.90	0.91	0.88	0.90	0.91	0.90	0.856
	Ours	0.00	0.89	0.00	0.52	0.82	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.88	0.00	0.00	0.45	0.00	0.198
Cascade R-CNN [59]	APPA (on)	0.98	1.00	1.00	0.99	1.00	0.97	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	0.00	0.99	0.99	1.00	0.884
	APPA (outside)	1.00	1.00	0.99	0.95	1.00	0.99	1.00	1.00	0.99	1.00	1.00	1.00	1.00	1.00	0.36	1.00	1.00	1.00	0.960
	Thys et al.	0.98	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.71	1.00	1.00	1.00	0.927
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
RetinaNet [60]	APPA (on)	0.00	0.83	0.97	0.95	0.98	0.49	0.93	0.93	0.97	0.00	0.93	0.99	0.85	0.98	0.80	0.94	0.98	0.95	0.804
	APPA (outside)	0.98	1.00	0.91	0.98	0.97	0.85	0.97	0.97	0.98	0.99	1.00	0.96	0.96	0.99	0.93	0.93	0.98	0.98	0.963
	Thys et al.	0.28	0.89	0.96	0.77	0.59	0.00	0.96	0.97	0.96	0.95	0.83	0.99	0.78	0.98	0.90	0.98	0.95	0.99	0.818
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
Mask R-CNN [61]	APPA (on)	0.55	0.80	0.23	0.79	0.40	0.00	0.86	0.97	0.92	0.39	0.37	0.99	0.57	0.50	0.57	0.97	0.00	0.97	0.603
	APPA (outside)	0.48	0.99	0.84	0.97	0.98	0.76	0.74	0.87	0.93	0.99	0.98	0.54	0.98	0.98	0.70	0.98	0.69	0.99	0.855
	Thys et al.	0.98	0.98	0.94	0.91	0.56	0.00	0.96	0.98	0.96	0.97	0.97	0.99	0.90	0.94	0.92	0.98	0.85	0.98	0.876
	Ours	0.55	0.44	0.00	0.72	0.40	0.00	0.00	0.00	0.00	0.85	0.00	0.00	0.00	0.78	0.00	0.00	0.00	0.00	0.208
FreeAnchor [62]	APPA (on)	0.98	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	0.85	1.00	1.00	0.96	1.00	0.88	1.00	0.92	0.99	0.921
	APPA (outside)	1.00	1.00	1.00	1.00	1.00	0.87	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.94	1.00	1.00	1.00	0.989
	Thys et al.	1.00	0.99	1.00	0.94	0.73	0.00	1.00	1.00	1.00	0.99	1.00	1.00	0.85	0.99	0.99	1.00	0.98	1.00	0.914
	Ours	0.00	0.00	0.00	0.40	0.21	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.034
FSAF [63]	APPA (on)	0.00	0.56	0.61	0.00	0.43	0.00	0.76	0.84	0.69	0.00	0.40	0.81	0.60	0.32	0.48	0.74	0.65	0.76	0.481
	APPA (outside)	0.86	0.85	0.67	0.76	0.85	0.80	0.82	0.85	0.73	0.86	0.92	0.86	0.86	0.74	0.62	0.88	0.88	0.86	0.815
	Thys et al.	0.82	0.70	0.72	0.46	0.43	0.00	0.67	0.83	0.71	0.63	0.69	0.93	0.75	0.51	0.64	0.82	0.64	0.80	0.653
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
RepPoints [64]	APPA (on)	0.00	0.00	0.64	0.00	0.35	0.00	0.79	0.90	0.58	0.00	0.27	0.87	0.31	0.00	0.32	0.73	0.26	0.80	0.379
	APPA (outside)	0.87	0.90	0.84	0.86	0.87	0.61	0.85	0.86	0.79	0.87	0.90	0.83	0.89	0.89	0.44	0.89	0.24	0.83	0.791
	Thys et al.	0.00	0.38	0.67	0.31	0.68	0.00	0.77	0.86	0.75	0.45	0.75	0.94	0.50	0.43	0.65	0.86	0.26	0.71	0.554
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
TOOD [65]	APPA (on)	0.00	0.39	0.24	0.42	0.38	0.00	0.60	0.50	0.41	0.00	0.49	0.79	0.44	0.36	0.35	0.36	0.27	0.45	0.358
	APPA (outside)	0.74	0.88	0.78	0.73	0.70	0.70	0.77	0.79	0.81	0.78	0.68	0.68	0.44	0.67	0.76	0.79	0.50	0.76	0.720
	Thys et al.	0.57	0.69	0.73	0.66	0.56	0.00	0.68	0.60	0.39	0.34	0.53	0.81	0.36	0.41	0.63	0.79	0.40	0.47	0.534
	Ours	0.00	0.25	0.00	0.30	0.00	0.00	0.00	0.00	0.00	0.34	0.00	0.00	0.22	0.37	0.00	0.00	0.00	0.00	0.082
ATSS [66]	APPA (on)	0.35	0.67	0.63	0.37	0.42	0.00	0.61	0.71	0.62	0.00	0.25	0.69	0.39	0.64	0.29	0.56	0.41	0.51	0.451
	APPA (outside)	0.61	0.76	0.72	0.58	0.57	0.71	0.70	0.75	0.70	0.58	0.54	0.72	0.53	0.66	0.61	0.73	0.56	0.62	0.647
	Thys et al.	0.41	0.60	0.67	0.54	0.56	0.00	0.63	0.67	0.65	0.49	0.42	0.76	0.44	0.62	0.59	0.67	0.50	0.61	0.546
	Ours	0.00	0.00	0.42	0.00	0.00	0.00	0.30	0.42	0.00	0.00	0.00	0.46	0.36	0.00	0.26	0.43	0.00	0.00	0.147
FoveaBox [67]	APPA (on)	0.38	0.93	0.84	0.54	0.90	0.00	0.89	0.89	0.75	0.00	0.80	0.92	0.28	0.75	0.48	0.91	0.83	0.95	0.669
	APPA (outside)	0.93	0.94	0.81	0.95	0.92	0.86	0.83	0.83	0.84	0.94	0.95	0.82	0.93	0.93	0.72	0.84	0.92	0.90	0.881
	Thys et al.	0.92	0.92	0.91	0.93	0.95	0.00	0.89	0.88	0.81	0.82	0.81	0.93	0.83	0.85	0.53	0.88	0.87	0.89	0.812
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000
VarifocalNet [68]	APPA (on)	0.55	0.84	0.83	0.65	0.73	0.24	0.89	0.90	0.88	0.30	0.34	0.90	0.81	0.68	0.58	0.82	0.56	0.90	0.689
	APPA (outside)	0.88	0.93	0.89	0.86	0.85	0.90	0.91	0.91	0.91	0.89	0.88	0.93	0.88	0.87	0.87	0.90	0.77	0.85	0.882
	Thys et al.	0.76	0.85	0.89	0.83	0.88	0.00	0.90	0.92	0.91	0.79	0.73	0.94	0.72	0.66	0.91	0.92	0.81	0.87	0.794
	Ours	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000

•

Strongest attack results are highlighted in bold.
•

”on” and ”outside” represent patches on and outside targets, respectively.
•

The proxy model is YOLOv5n in the black-box setting.

In this section, we perform comprehensive experiments to validate the convincingness of the proposed contextual background attack framework. We first outline the experimental settings and then separately report the results of the digital and physical attacks. Subsequently, we describe the experimental results of the ablation study on total variation. Finally, we give some possible explanations regarding the unexpectedness of the experimental results. The video demo of our proposed CBA physical attack method has been released¹¹1https://www.youtube.com/watch?v=wng9LZbQeJA²²2https://www.youtube.com/shorts/BlBlCNEi_I4.

IV-A Experimental Settings

Datasets: Two well-known large-scale datasets, i.e. DOTA [69] and RSOD³³3https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-, are adopted to train aerial detectors and adversarial patches, respectively, in the experiments.

Target models: Dozens of mainstream detectors are adopted to validate the effect of the proposed framework, including YOLOv2 [56], YOLOv3 [57], YOLOv5 [58], SSD [53], Faster R-CNN [54], Swin Transformer [55], Cascade R-CNN [59], RetinaNet [60], Mask R-CNN [61], FoveaBox [67], FreeAnchor [62], FSAF [63], RepPoints [64], TOOD [65], ATSS [66], and VarifocalNet [68].

Compared methods: Two SOTA methods are selected for comparison, including the adversarial patches proposed by Thys et al. [25] and APPA [18]. Note that APPA is conducted to generate adversarial patches both on and outside targets of interest, denoting as ‘APPA (on)’ and ‘APPA (outside)’, respectively. The average precision (AP) is adopted as the quantitative detection metric.

Implementation details: During training, $\alpha$ is set to 1.5 to balance the objectiveness loss and total variation. The learning rate scheduler is started from 0.03, and the thresholds for the intersection of union (IOU) and objective confidence are set as 0.45 and 0.4, respectively. All the codes are implemented in PyTorch on RTX3090 (24GB) GPUs, and a printer model Color LaserJet Pro MFP M479dw is adopted to print all the adversarial patches generated by different methods.

IV-B Attack in Digital Domain

For our proposed CBA, the digital attacks are conducted in the same settings as training, i.e. the adversarial patches are also pasted outside the targets detected by aerial detectors in proper size and positions. The contextual background patches generated by our CBA are displayed in Fig. 5. When trained on different versions of YOLOv5, it generates adversarial patches with similar pattern styles. The patches trained by YOLOv3 and YOLOv5 are also somewhat similar, indicating that detectors with similar structures may generate analogous adversarial patches. However, the generated patches vary greatly when training on other detectors with different structures.

Ten aerial detectors are chosen for the quantitative evaluation. We set detection results of the clean images as ground truth to calculate the AP, i.e. the AP of the clean dataset is $100\%$ , by which the targets that the original detector fails to detect won’t be counted as a successful attack. The experimental results are shown in Fig. 6. It is observed that:

•

For the attackers, our proposed CBA achieves the best-attacking performance in both white-box and black-box versions for quite a few cases, though the adversarial patch is placed outside the targets of interest, and part area of the patch is sacrificed for physical attack;
•

For the detectors, YOLOv2 is the easiest to attack with the patches on targets, even in the black-box setting. Various versions of YOLOv5 are robust in diverse attack settings, while Faster R-CNN and SSD are easier to be attacked. In general, Swin Transformer is the most robust detector.

IV-C Attack in Physical World

IV-C1 Overall experimental results

In this paper, the proposed framework is mainly designed to conduct physical attacks, so extensive and rigorous proportionally scaled experiments are performed to verify the effectiveness of the proposed physical attack framework CBA.

In this section, 1:400 proportionally scaled experiments are conducted to verify the attack performance in the physical world. Specifically, we train 20 mainstream aerial detectors as victim models and record the average confidence (threshold set as 0.2, i.e. the detected object will be ignored if the detection confidence is lower than 0.2.) of 18 aircraft to compare the physical attack efficacy. The experimental results are shown in Fig. 7. It is observed that:

•

Our CBA results in a considerable number of detectors failing to detect any targets, i.e. the values are 0 (highlighted in bold), which barely happens to other methods;
•

Our CBA can transfer the attack efficacy well between different detectors, even for some hard-to-attack detectors (e.g. various versions of YOLOv5), which are only slightly swayed by other methods;
•

In contrast, different forms of YOLOv5 are still the toughest detectors to be attacked among all approaches, while our elaborated adversarial patches can easily blind them and generalize well between different versions of YOLOv5;
•

Similar to the digital attack, YOLOv2 is still the weakest detector. However, it seems immune to patches of outside targets in the physical world.

In conclusion, the result area of our CBA in Fig. 7 is significantly redder than other areas, indicating that our proposed CBA can achieve excellent attack performance in both white-box and black-box settings and significantly outperform other methods when trained by most detectors. The fly in the ointment is that the adversarial patch trained by YOLOv2 presents poor attack transferability, and we will discuss this in Sec. IV-E.

IV-C2 Detailed experimental results of white-box attacks

We report the detailed quantitative and qualitative results in a white-box setting, as shown in Table I and Fig. 8, respectively. It is observed that:

•

The generated contextual adversarial patches by the proposed CBA can completely blind quite a few aerial detectors, i.e. the victim detectors can not recognize any protected targets at all, such as SSD, YOLOv2, YOLOv5n, Cascade R-CNN, RetinaNet, FreeAnchor, FSAF, RepPoints, TOOD, and FoveaBox;
•

The rest of the detectors can recognize the protected objects correctly, but only with low average confidence under 0.438;
•

In contrast, most patches generated by the comparison methods can barely misguide the detectors, which can only slightly sway the confidence of the correct detection;
•

None of the comparison methods can successfully hide all the objects of interest from being perceived.

IV-C3 Detailed experimental results of black-box attacks

We report the detailed quantitative and qualitative results in a black-box setting, as shown in Table II and Fig. 9, respectively. It is observed that:

•

Even under the black-box setting, the generated contextual adversarial patches by the proposed CBA can transfer its attack efficacy well between various aerial detectors;
•

The contextual adversarial patch trained on YOLOv5n successfully protects all the interested targets from being recognized by YOLOv2, Cascade R-CNN, RetinaNet, FSAF, RepPoints, FoveaBox, and VarifocalNet;
•

Under the attack of our CBA, all the average confidences are lower than 0.208, which significantly outperforms other physical attack methods.

IV-C4 Physical attack robustness

The robustness of our proposed CBA for the physical attack is further validated by varying imaging angles and lighting conditions. The experimental results are shown in Fig. 10 and Fig. 11, respectively. We can observe that our proposed CBA achieves successful attacks over all the aircraft, i.e. none of the protected targets are recognized correctly and with prediction confidences higher than 0.2, demonstrating our proposed method’s physical attack robustness against dynamic conditions in real-world scenarios.

To exclude the effect of patch location and size, APPA’s patches with the same setting as ours are adopted for comparison. It is found from Fig. 12 that APPA’s patches can barely sway the prediction. We also visualize the attention map [50] of YOLOv5s before and after physical attacks as shown in Fig. 13. It is observed that the proposed CBA can completely blind the powerful detector in the physical world.

IV-D Ablation Study

This part discusses the influence of total variation on the physical attack. We evaluate the effectiveness of smoothness constriction on SSD in physical scenarios. Specifically, we vary $\alpha$ from $[0.0,0.15,1.5,15,150]$ to generate corresponding adversarial patches, as shown in Fig. 14. In addition, we perform physical attack experiments in the proportionally scaled physical scenario, including 18 aircraft targets (T1-T18), to quantitatively compare the attack efficacy. The detection results are shown in Fig. 15, and we can observe that:

•

If $\alpha$ is too small or equal to 0, the generated patch is not smooth enough, which may cause a significant loss of the attack efficacy during physical-digital transformation;
•

If $\alpha$ is too large, the pattern of the adversarial patch will be simplified, which has a critical influence on the attack efficacy of the generated patch.

Consequently, we choose $\alpha=1.5$ to balance the two parts of the total loss in our experiments.

IV-E Discussion

Surprisingly, the adversarial patch against YOLOv2 shows weak transferability. We try to explain this phenomenon from a different view. Training adversarial patches are similar to training networks. The only difference is that pixels in patches update during training adversarial patches while parameters of the network update during training networks. Therefore, the adversarial patches are influenced by training samples, victim network models, and optimizing strategy. Consequently, when the training samples and optimizing process are settled, the victim model is crucial in generating adversarial patches. Thus, for poor detectors, such as YOLOv2, a robust attack method can only learn limited information, which may be just enough for the white-box attack but not enough to attack a more powerful model. Similarly, the above analysis may also explain why adversarial patches of different versions of YOLOv5 own a similar pattern while differing from the styles of other patches.

V Conclusion

This paper proposes a brand-new physical attack framework against aerial detection based on contextual adversarial patches. The target of interest, i.e. aircraft, is adopted to mask the contextual adversarial patches, and the pixels outside the mask area are optimized through iterations. Moreover, the contextual adversarial patches are forced to be outside targets during training, by which the detected targets of interest, both on and outside patches, contribute to improving the attack’s effectiveness of the crafted contextual patches. Extensive and rigorous experiments have been conducted to validate the proposed physical attack framework’s effectiveness. This demonstrates that our elaborated contextual adversarial patches are gifted with solid fooling efficacy to hide objects on or outside patches. In addition, the elaborately crafted adversarial patches can dramatically fool aerial detectors in dynamic physical scenarios, such as varying lighting conditions and view angles, and consistently outperform existing methods in both white-box and black-box settings.

References

[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
[2] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
[3] Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, “Efficient decision-based black-box adversarial attacks on face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7714–7722.
[4] Y. Deng and L. J. Karam, “Frequency-tuned universal adversarial attacks on texture recognition,” IEEE Transactions on Image Processing, vol. 31, pp. 5856–5868, 2022.
[5] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial examples for semantic segmentation and object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1369–1378.
[6] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in Proceedings of the 10th ACM workshop on artificial intelligence and security, 2017, pp. 15–26.
[7] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193.
[8] K. Mahmood, R. Mahmood, and M. Van Dijk, “On the robustness of vision transformers to adversarial examples,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7838–7847.
[9] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582.
[10] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57.
[11] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
[12] B. Chen, Y. Feng, T. Dai, J. Bai, Y. Jiang, S.-T. Xia, and X. Wang, “Adversarial examples generation for deep product quantization networks on image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[13] D. Mao, Z. Wang, Y. Wang, C.-Y. Choi, M. Jia, M. Jackson, and R. Fuller, “Remote observations in china’s ramsar sites: wetland dynamics, anthropogenic threats, and implications for sustainable development goals,” Journal of Remote Sensing, vol. 2021, 2021.
[14] S. Mei, R. Jiang, M. Ma, and C. Song, “Rotation-invariant feature learning via convolutional neural network with cyclic polar coordinates convolutional layer,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
[15] L. Liu, X. Zhang, Y. Gao, X. Chen, X. Shuai, and J. Mi, “Finer-resolution mapping of global land cover: Recent developments, consistency analysis, and prospects,” Journal of Remote Sensing, 2021.
[16] A. Du, B. Chen, T.-J. Chin, Y. W. Law, M. Sasdelli, R. Rajasegaran, and D. Campbell, “Physical adversarial attacks on an aerial imagery object detector,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1796–1806.
[17] Y. Xu and P. Ghamisi, “Universal adversarial examples in remote sensing: Methodology and benchmark,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
[18] J. Lian, S. Mei, S. Zhang, and M. Ma, “Benchmarking adversarial patch against aerial detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
[19] Y. Xu, B. Du, and L. Zhang, “Assessing the threat of adversarial examples on deep neural networks for remote sensing scene classification: Attacks and defenses,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 2, pp. 1604–1617, 2020.
[20] S. Vellaichamy, M. Hull, Z. J. Wang, N. Das, S. Peng, H. Park, and D. H. P. Chau, “Detectordetective: Investigating the effects of adversarial examples on object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 484–21 491.
[21] Z. Cai, S. Rane, A. E. Brito, C. Song, S. V. Krishnamurthy, A. K. Roy-Chowdhury, and M. S. Asif, “Zero-query transfer attacks on context-aware object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 024–15 034.
[22] Y. Shi, Y. Han, Q. Hu, Y. Yang, and Q. Tian, “Query-efficient black-box adversarial attack with customized iteration and sampling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[23] Z. Wang, S. Zheng, M. Song, Q. Wang, A. Rahimpour, and H. Qi, “advpattern: physical-world attacks on deep person re-identification via adversarially transformable patterns,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8341–8350.
[24] Y.-C.-T. Hu, B.-H. Kung, D. S. Tan, J.-C. Chen, K.-L. Hua, and W.-H. Cheng, “Naturalistic physical adversarial patch for object detectors,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7848–7857.
[25] S. Thys, W. Van Ranst, and T. Goedemé, “Fooling automated surveillance cameras: adversarial patches to attack person detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 49–55.
[26] J. Lian, X. Wang, Y. Su, M. Ma, and S. Mei, “Contextual adversarial attack against aerial detection in the physical world,” arXiv preprint arXiv:2302.13487, 2023.
[27] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112.
[28] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,” arXiv preprint arXiv:1712.09665, 2017.
[29] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” in Proceedings of the 2016 acm sigsac conference on computer and communications security, 2016, pp. 1528–1540.
[30] X. Zheng, Y. Fan, B. Wu, Y. Zhang, J. Wang, and S. Pan, “Robust physical-world attacks on face recognition,” Pattern Recognition, vol. 133, p. 109009, 2023.
[31] X. Wei, Y. Guo, and J. Yu, “Adversarial sticker: A stealthy attack method in the physical world,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[32] P. Labarbarie, A. Chan-Hon-Tong, S. Herbin, and M. Leyli-Abadi, “Benchmarking and deeper analysis of adversarial patch attack on object detectors,” in Workshop Artificial Intelligence Safety-AI Safety (IJCAI-ECAI conference), 2022.
[33] D. Wang, T. Jiang, J. Sun, W. Zhou, Z. Gong, X. Zhang, W. Yao, and X. Chen, “Fca: Learning a 3d full-coverage vehicle camouflage for multi-view physical adversarial attack,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 2414–2422.
[34] X. Wei, Y. Guo, J. Yu, and B. Zhang, “Simultaneously optimizing perturbations and positions for black-box adversarial patch attacks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[35] X. Han, G. Xu, Y. Zhou, X. Yang, J. Li, and T. Zhang, “Physical backdoor attacks to lane detection systems in autonomous driving,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2957–2968.
[36] Z. Cheng, J. Liang, H. Choi, G. Tao, Z. Cao, D. Liu, and X. Zhang, “Physical attack on monocular depth estimation with optimal adversarial patches,” in European conference on computer vision. Springer, 2022, pp. 514–532.
[37] E. Kaziakhmedov, K. Kireev, G. Melnikov, M. Pautov, and A. Petiushko, “Real-world attack on mtcnn face detection system,” in 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON). IEEE, 2019, pp. 0422–0427.
[38] M. Pautov, G. Melnikov, E. Kaziakhmedov, K. Kireev, and A. Petiushko, “On adversarial patches: real-world attack on arcface-100 face recognition system,” in 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON). IEEE, 2019, pp. 0391–0396.
[39] S. Komkov and A. Petiushko, “Advhat: Real-world adversarial attack on arcface face id system,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 819–826.
[40] D.-L. Nguyen, S. S. Arora, Y. Wu, and H. Yang, “Adversarial light projection attacks on face recognition systems: A feasibility study,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 814–815.
[41] X. Zhu, X. Li, J. Li, Z. Wang, and X. Hu, “Fooling thermal infrared pedestrian detectors in real world using small bulbs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3616–3624.
[42] X. Zhu, Z. Hu, S. Huang, J. Li, and X. Hu, “Infrared invisible clothing: Hiding from infrared detectors at multiple angles in real world,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 317–13 326.
[43] L. Huang, C. Gao, Y. Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu, “Universal physical camouflage attacks on object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 720–729.
[44] S. Mei, X. Chen, Y. Zhang, J. Li, and A. Plaza, “Accelerating convolutional neural network-based hyperspectral image classification by step activation quantization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2021.
[45] S. Mei, X. Li, X. Liu, H. Cai, and Q. Du, “Hyperspectral image classification using attention-based bidirectional long short-term memory network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2021.
[46] J. Tian, B. Wang, R. Guo, Z. Wang, K. Cao, and X. Wang, “Adversarial attacks and defenses for deep-learning-based unmanned aerial vehicles,” IEEE Internet of Things Journal, vol. 9, no. 22, pp. 22 399–22 409, 2021.
[47] J.-C. Burnel, K. Fatras, R. Flamary, and N. Courty, “Generating natural adversarial remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
[48] G. Cheng, X. Sun, K. Li, L. Guo, and J. Han, “Perturbation-seeking generative adversarial networks: A defense framework for remote sensing image scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
[49] Y. Zhang, Y. Zhang, J. Qi, K. Bin, H. Wen, X. Tong, and P. Zhong, “Adversarial patch attack on multi-scale object detection for uav remote sensing images,” Remote Sensing, vol. 14, no. 21, p. 5298, 2022.
[50] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 618–626.
[51] M. Zhuge, D.-P. Fan, N. Liu, D. Zhang, D. Xu, and L. Shao, “Salient object detection via integrity learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[52] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented r-cnn for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3520–3529.
[53] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[55] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[56] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
[57] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[58] J. Glenn, S. Alex, B. Jirka, NanoCode012, C. Ayush, X. Tao, L. Changyu, V. Abhiram, Laughing, tkianai, yxNONG, H. Adam, lorenzomammana, AlexWang1900, H. Jan, D. Laurentiu, Marc, K. Yonghye, oleg, wanghaoyang0106, D. Yann, L. Aditya, ml5ah, M. Ben, F. Benjamin, K. Daniel, D. Yiwei, Doug, Durgesh, and I. Francisco, “YOLOv5,” accessed July 21, 2021. [Online]. Available: https://github.com/ultralytics/yolov5
[59] Z. Cai and N. Vasconcelos, “Cascade r-cnn: high quality object detection and instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1483–1498, 2019.
[60] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2980–2988.
[61] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969.
[62] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye, “Freeanchor: Learning to match anchors for visual object detection,” Advances in neural information processing systems, vol. 32, 2019.
[63] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module for single-shot object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 840–849.
[64] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9657–9666.
[65] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3490–3499.
[66] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2020, pp. 9759–9768.
[67] T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, “Foveabox: Beyound anchor-based object detection,” IEEE Transactions on Image Processing, vol. 29, pp. 7389–7398, 2020.
[68] H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou-aware dense object detector,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2021, pp. 8514–8523.
[69] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3974–3983.

CBA: Contextual Background Attack against Optical Aerial Detection in the Physical World