This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Paper ID 437411institutetext: Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University1
RealAI2   Sea AI Lab, Singapore3
Peng Cheng Laboratory; Pazhou Laboratory (Huangpu), Guangzhou, China4
11email: {yangxiao19, dyp17}@mails.tsinghua.edu.cn, tianyupang@sea.com, {suhangss, dcszj}@tsinghua.edu.cn

Boosting Transferability of Targeted Adversarial Examples via Hierarchical Generative Networks

Anonymous ECCV submission    Xiao Yang1    Yinpeng Dong1,2    Tianyu Pang3    Hang Su1,4    Jun Zhu1,2,4 corresponding author.
Abstract

Transfer-based adversarial attacks can evaluate model robustness in the black-box setting. Several methods have demonstrated impressive untargeted transferability, however, it is still challenging to efficiently produce targeted transferability. To this end, we develop a simple yet effective framework to craft targeted transfer-based adversarial examples, applying a hierarchical generative network. In particular, we contribute to amortized designs that well adapt to multi-class targeted attacks. Extensive experiments on ImageNet show that our method improves the success rates of targeted black-box attacks by a significant margin over the existing methods — it reaches an average success rate of 29.1% against six diverse models based only on one substitute white-box model, which significantly outperforms the state-of-the-art gradient-based attack methods. Moreover, the proposed method is also more efficient beyond an order of magnitude than gradient-based methods.

1 Introduction

Recent progress in adversarial machine learning demonstrates that deep neural networks (DNNs) are highly vulnerable to adversarial examples [14, 48], which are maliciously generated to mislead a model to produce incorrect predictions. It has been demonstrated that adversarial examples possess an intriguing property of transferability [51, 20, 5] — the adversarial examples crafted for a white-box model can also mislead other unknown models, making black-box attacks feasible. The threats of adversarial examples have raised severe concerns in numerous security-sensitive applications, such as autonomous driving [11] and face recognition [55, 56, 54].

Tremendous efforts have been made to develop more effective black-box attacking methods based on transferability, since they can serve as an important surrogate to evaluate the model robustness in real-world scenarios [33, 9]. The current methods have achieved impressive performance of untargeted black-box attacks, intending to cause misclassification of the black-box models. However, targeted black-box attacks, aiming at misleading the black-box models by outputting the adversary-desired target class, perform unsatisfactorily or require computation scaling with number of classes [8, 59]. Technically, the inefficiency of targeted adversarial attacks could result in an over-estimation of model robustness under the challenging black-box attack setting [17].

Existing efforts on targeted black-box attacks can be categorized as instance-specific and instance-agnostic attacks. Specifically, the instance-specific attack methods [13, 35, 26, 9, 28] craft adversarial examples by performing gradient updates iteratively, which obtain unsatisfactory performance for targeted black-box attacks due to easy overfitting to a white-box model [9, 52]. Recently, [60] propose several improvements for instance-specific targeted attacks, thus we treat the method in [60] as one of the strong instance-specific baselines compared in our experiments.

On the other hand, the instance-agnostic attack methods learn a universal perturbation [59] or a universal function [44, 36] on the data distribution independent of specific instances. They can promote more general and transferable adversarial examples since the universal perturbation or function can alleviate the data-specific overfitting problem by training on an unlabeled dataset. CD-AP [36], as one of the effective instance-agnostic methods, adopts a generative model as a universal function to obtain an acceptable performance when facing one specified target class. However, CD-AP needs to learn a generative model for each target class while performing multi-target attack [15], i.e., crafting adversarial examples targeted at different classes. Thus it is not scalable to the increasing number of targets such as hundreds of classes, limiting practical efficiency.

To address the aforementioned issues and develop a targeted black-box attack in the practical scenario, in this paper we propose a conditional generative model as the universal adversarial function to craft adversarial perturbations. Thus we can craft adversarial perturbations targeted at different classes, using a single model backbone with different class embeddings. The proposed generative method is simple yet practical to obtain superior performance of targeted black-box attacks, meanwhile with two technical improvements including (i) smooth projection mechanism that better helps the generator probe targeted semantic knowledge from the classifier; (ii) adaptive Gaussian smoothing with the focus of making generated results obtain adaptive ability against adversarially trained models. Therefore, our approach have several advantages over existing generative attacks [36, 39, 40], as described in the followings.

One model for multiple target classes. The previous generative methods [36, 39] require costly training NN models while performing a multi-target attack with NN classes. However, ours only trains one model and reaches an average success rate of 51.1% against six naturally trained models and 36.4% against three adversarially trained models based only on one substitute white-box model in ImageNet dataset, which outperforms CD-AP by a large margin of 6.0% and 31.3%, respectively.

Hierarchical partition of classes. While handling plenty of classes (e.g., 1,000 classes in ImageNet), the effectiveness of generating targeted adversarial examples will be affected by a single generative model due to the difficulty of loss convergence in adversarial learning [53, 1]. Thus we train a feasible number of models (e.g., 10\sim20 models on ImageNet) to further promote the effectiveness beyond the single model backbone. Specifically, each model is learned from a subset of classes specified by a designed hierarchical partition mechanism by considering the diversity property among subsets, for seeking a balance between effectiveness and scalability. It reaches an average success rate of 29.1% against six different models, outperforming the state-of-the-art gradient-based methods by a large margin, based only on one substitute white-box model. Moreover, the proposed method achieves substantial speedup over the mainstream gradient-based methods.

Strong semantic patterns. We experimentally find that these adversarial perturbations generated by the proposed Conditional Generative models can arise as a result of strong Semantic Pattern (C-GSP) as shown in Fig. 1(a). Furthermore, we present more valuable analyses in Sec. 4.6, illustrating that the generated semantic pattern itself achieves well-generalizing performance among the different models and is robust to the influence of data. These analyses are very instructive for the understanding and design of adversarial examples.

Technically, our main contributions can be summarized as:

  • We propose a simple yet practical conditional generative targeted attack with a scalable hierarchical partition mechanism, which can generate targeted adversarial examples without tuning the parameters.

  • Extensive experiments demonstrate that our method significantly improves the success rates of targeted black-box attacks over the existing methods.

  • As a by-product, our baseline experiments provide a systematical evaluation on previous targeted black-box attacks, either instance-specific or instance-agnostic, on the ImageNet dataset with plenty of classes and face recognition.

2 Related Work

In this section, we review related work on adversarial attacks belonging to different types.

Instance-specific attacks. Some recent works [35, 60] adopt gradient-based optimization methods to generate the data-dependent perturbations. MIM [9] introduces the momentum term into the iterative attack process to improve the black-box transferability. DIM [52] and TI [10] aim to achieve the better transferability by input or gradient diversity. Recent works [21, 22] also attempt to costly train the multiple auxiliary classifiers to improve the black-box performance of iterative methods. In contrast, we improve the transferability performance over instance-specific methods simultaneously with the inference-time efficiency.

Instance-agnostic attacks. Different from instance-specific attacks, instance-agnostic attacks belong to image-independent (universal) methods. The first pipeline is to learn a universal perturbation. UAP [34] proposes to fool a model by adding a learned universal noise vector. Another pipeline of attacks introduces learned generative models to craft adversarial examples. GAP [39] and AAA [40] craft adversarial perturbations based on target data directly and compress impressions, respectively. Previous methods, including universal perturbation and function, require costly training the same number of models for multiple target classes. Our method is capable of simultaneously generating adversarial samples for specifying multiple targets with better attack performance.

Multi-target attacks. Instance-specific attacks have the ability for specifying any target in the optimization phase. As elaborated in the introduction, these methods have degraded transferability and time-consuming iterative procedures. MAN [15] trains a generative model in the ImageNet under the constraint of 2\ell_{2} norm to explore the targeted attacks, which specifies all 1,000 categories from ImageNet for seeking extreme speed and storage. However, MAN does not fully compare multi-target black-box performance with previous instance-specific or instance-agnostic attacks, and the authors also claim that too many categories make it hard to transfer to another model. Recent approaches [59, 37] reveal better single-target transferability by learning universal perturbation or function, whereas they require to train multiple times while specifying multiple targets. As a comparison, our method can generate adversarial samples for specifying multiple targets, meanwhile generated strong semantic patterns can outperform existing attacks by a significant margin.

Refer to caption
Figure 1: (a) shows the targeted adversarial examples crafted by MIM [9] and C-GSP given the target class Viaduct with the maximum perturbation ϵ=16\epsilon=16. The predicted labels and probabilities are shown by another black-box model. (b) presents an overview of our proposed generative method for crafting C-GSP, including modules of conditional generator and classifier. The generator integrates the image and conditional class vector from Map network into a hidden incorporation. The generator is only trained in the whole pipeline to probe the target boundaries of the classifier.

3 Method

In this section, we introduce a conditional generative model to learn a universal adversarial function, which can achieve effective multi-target black-box attacks. While handing plenty of classes, we design a hierarchical partition mechanism to make the generative model capable of specifying any target class under a feasible number of models, regarding both the effectiveness and scalability.

3.1 Problem Formulation

We use 𝒙s\bm{x}_{s} to denote an input image belonging to an unlabeled training set 𝒳sd\mathcal{X}_{s}\subset\mathbb{R}^{d}, and use c𝒞c\in\mathcal{C} to denote a specific target class. Let ϕ:𝒳sK\mathcal{F}_{\phi}:\mathcal{X}_{s}\rightarrow\mathbb{R}^{K} denote a classification network that outputs a class probability vector with KK classes. To craft a targeted adversarial example 𝒙s\bm{x}_{s}^{*} from a real example 𝒙s\bm{x}_{s}, the targeted attack aims to fool the classifier ϕ\mathcal{F}_{\phi} by outputting a specific label cc as argmaxi𝒞ϕ(𝒙s)i=c\operatorname*{arg\,max}_{i\in\mathcal{C}}{\mathcal{F}_{\phi}(\bm{x}_{s}^{*})}_{i}=c, meanwhile the \ell_{\infty} norm of the adversarial perturbation is required to be no more than a threshold ϵ\epsilon as 𝒙s𝒙sϵ\|\bm{x}_{s}^{*}-\bm{x}_{s}\|_{\infty}\leq\epsilon.

Although some generative methods [39, 36] can learn targeted adversarial perturbation, they do not take into account the effectiveness of multi-target generation, thus leading to inconvenience. To make the generative model learn how to specify multiple targets, we propose a conditional generative network 𝒢θ\mathcal{G}_{\theta} that effectively crafts multi-target adversarial perturbations by modeling class-conditional distribution. Different from previous single-target methods [36, 39], the target label cc is regarded as a discrete variable rather than a constant. As illustrated in Fig. 1(b), our model contains a conditional generator 𝒢θ\mathcal{G}_{\theta} and a classification network ϕ\mathcal{F}_{\phi} parameterized by θ\theta and ϕ\phi, respectively. The conditional generative model 𝒢θ:(𝒳s,𝒞)𝒫\mathcal{G}_{\theta}:(\mathcal{X}_{s},\mathcal{C})\rightarrow\mathcal{P} learns a perturbation 𝜹=𝒢θ(𝒙s,c)𝒫d\bm{\delta}=\mathcal{G}_{\theta}(\bm{x}_{s},c)\in\mathcal{P}\subset\mathbb{R}^{d} on the training data. The output 𝜹\bm{\delta} of 𝒢θ\mathcal{G}_{\theta} is projected within the fixed \ell_{\infty} norm, thus generating the perturbed image 𝒙s=𝒙s+𝜹\bm{x}_{s}^{*}=\bm{x}_{s}+\bm{\delta}.

Given a pretrained network ϕ\mathcal{F}_{\phi} parameterized by ϕ\phi, we propose to generate the targeted adversarial perturbations by solving

minθ𝔼(𝒙s𝒳s,c𝒞)[𝔼(ϕ(𝒢θ(𝒙s,c)+𝒙s),c)],s.t. 𝒢θ(𝒙s,c)ϵ,\begin{split}\min\limits_{\theta}&\mathbb{E}_{(\bm{x}_{s}\sim\mathcal{X}_{s},c\sim\mathcal{C})}[{\mathbb{CE}\big{(}\mathcal{F}_{\phi}(\mathcal{G}_{\theta}(\bm{x}_{s},c)+\bm{x}_{s}\big{)},c)}],\;\text{s.t. }\|\mathcal{G}_{\theta}(\bm{x}_{s},c)\|_{\infty}\leq\epsilon,\end{split} (1)

where 𝔼\mathbb{CE} is the cross-entropy loss. By solving problem (1), we can obtain a targeted conditional generator by minimizing the loss of specific target class in the unlabeled training dataset. Note that we only optimize the parameter θ\theta of the generator 𝒢θ\mathcal{G}_{\theta} using the training data 𝒳s\mathcal{X}_{s}, then the targeted adversarial example 𝒙t\bm{x}_{t}^{*} can be crafted by 𝒙t=𝒙t+𝒢θ(𝒙t,c)\bm{x}_{t}^{*}=\bm{x}_{t}+\mathcal{G}_{\theta}(\bm{x}_{t},c) for any given image 𝒙t\bm{x}_{t} in the test data 𝒳t\mathcal{X}_{t}, which only requires an inference for this targeted image 𝒙t\bm{x}_{t}.

We experimentally find that the objective (1) can enforce the transferability for the generated perturbation 𝜹\bm{\delta}. A reasonable explanation is that 𝜹\bm{\delta} can arise as a result of strong and well-generalizing semantic pattern inherent to the target class, which is robust to the influence of any training data. In Sec. 4.5, we illustrate and corroborate our claim by directly feeding scaled adversarial perturbations111The perturbation is linearly scaled from [-ϵ\epsilon, ϵ\epsilon] to [0, 255]. from different methods into the classifier. Indeed, we find that our semantic pattern can be classified as the target class with a high degree of confidence while the perturbation from MIM [9] performs like the noise, meanwhile the scaled semantic pattern performs well transferability in different black-box models.

3.2 Network Architecture

We now present the details of the conditional generative model for targeted attack, as illustrated in Fig. 1(b). Specifically, we design a mapping network to generate a target-specific vector in the implicit space of each target and train conditional generator 𝒢θ\mathcal{G}_{\theta} to reflect this vector by constantly misleading the classifier ϕ\mathcal{F}_{\phi}.

Mapping network. Given the one-hot class encoding 𝟙cK\mathbbm{1}_{c}\in\mathbb{R}^{K} from target class cc, the mapping network aims to generate the targeted latent vector 𝒘=𝒲(𝟙c)\bm{w}=\mathcal{W}(\mathbbm{1}_{c}), where 𝒘M\bm{w}\in\mathbb{R}^{M} and 𝒲()\mathcal{W}(\cdot) consists of a multi-layer perceptron (MLP) and a normalization layer, which can construct diverse targeted vectors 𝒘\bm{w} for a given target class cc. Thus 𝒲\mathcal{W} is capable of learning effective targeted latent vectors by randomly sampling different classes c𝒞c\in\mathcal{C} in training phase.

Generator. Given an input image 𝒙s\bm{x}_{s}, the encoder first calculates the feature map 𝑭N×H×W\bm{F}\in\mathbb{R}^{N\times H\times W}, where NN, HH and WW refer to the number of channels, height and width of the feature map, respectively. The target latent vector 𝒘\bm{w}, derived from the mapping network 𝒲\mathcal{W} by introducing a specific target class cc, is expanded along height and width directions to obtain the label feature map 𝒘sM×H×W\bm{w}_{s}\in\mathbb{R}^{M\times H\times W}. Then the above two feature maps are concatenated along the channels to obtain 𝑭(N+M)×H×W\bm{F}^{\prime}\in\mathbb{R}^{(N+M)\times H\times W}. The obtained mixed feature map is then fed to the subsequent network. Therefore, our generator 𝒢θ\mathcal{G}_{\theta} translates an input image 𝒙s\bm{x}_{s} and latent target vector 𝒘\bm{w} into an output image 𝒢θ(𝒙s,𝒘)\mathcal{G}_{\theta}(\bm{x}_{s},\bm{w}), which enables 𝒢θ\mathcal{G}_{\theta} to synthesize adversarial images of a series of targets. For the output of feature map 𝒇d\bm{f}\in\mathbb{R}^{d} in the decoder, we adopt a smooth projection P()P(\cdot) to perform a change of variables over 𝒇\bm{f} rather than directly minimizing its 2\ell_{2} norm as [15] or clipping values outside the fixed norm [36], which can be denoted as

𝜹=P(𝒇)=ϵtanh(𝒇),\bm{\delta}=P(\bm{f})=\epsilon\cdot\mathrm{tanh}(\bm{f}), (2)

where ϵ\epsilon is the strength of perturbation. Since 1tanh(𝒇)1-1\leq\mathrm{tanh}(\bm{f})\leq 1, 𝜹\bm{\delta} can automatically satisfy the \ell_{\infty}-ball bound with perturbation budget ϵ\epsilon. This transformation can be regarded as a better smoothing of gradient than directly clipping values outside the fixed norm, which is also instrumental for 𝒢θ\mathcal{G}_{\theta} to probe and learn the targeted semantic knowledge from ϕ\mathcal{F}_{\phi}.

Training objectives. The training objectives seek to minimize the classification error on the perturbed image of the generator as

θargminθ𝔼(Fϕ(𝒙s+𝒢θ(𝒙s,𝒲(𝟙c))),c),\theta^{*}\leftarrow\operatorname*{arg\,min}_{\theta}{\mathbb{CE}\Big{(}F_{\phi}\big{(}\bm{x}_{s}+\mathcal{G}_{\theta}(\bm{x}_{s},\mathcal{W}(\mathbbm{1}_{c}))\big{)},c\Big{)}}, (3)

which adopts an end-to-end training paradigm with the goal of generating adversarial images to mislead the classifier the target label, and 𝔼\mathbb{CE} is the cross entropy loss. Previous studies attempt different classification losses in their works [59, 36], and we found that cross-entropy loss works well in our settings. The detailed optimization procedure is summarized in Algorithm 1.

Algorithm 1 Training Algorithm for the Conditional Generative Attack
1:Training Data 𝒟s\mathcal{D}_{s}; a generative network 𝒢θ\mathcal{G}_{\theta}; a classification network ϕ\mathcal{F}_{\phi}; a mapping network 𝒲\mathcal{W}.
2:Adversarial perturbations 𝜽\bm{\theta}.
3:for iter in MaxIterations T do
4:     Randomly sample BB images {𝒙si}i=1B\{\bm{x}_{s_{i}}\}_{i=1}^{B};
5:     Randomly sample BB target classes {ci}i=1B\{{c}_{i}\}_{i=1}^{B};
6:     Forward pass ci{c}_{i} into 𝒲\mathcal{W} to compute the targeted latent vectors 𝒘i\bm{w}_{i};
7:     Obtain the perturbed images by 𝒙si=ϵtanh(𝒢(𝒙si,𝒘i))+𝒙si\bm{x}_{s_{i}}^{*}=\epsilon\cdot\mathrm{tanh}(\mathcal{G}(\bm{x}_{s_{i}},\bm{w}_{i}))+\bm{x}_{s_{i}};
8:     Forward pass 𝒙si\bm{x}_{s_{i}}^{*} to ϕ\mathcal{F}_{\phi} and compute loss in Eq. (3);
9:     Backward pass and update the 𝒢θ\mathcal{G}_{\theta};
10:end for

3.3 Hierarchical Partition for Classes

While handling plenty of classes, the effectiveness of a conditional generative model will decrease as illustrated in Fig. 4, because the representative capacity is limited with a single generator. Therefore, we propose to divide all classes into a feasible number of subsets to train models when the class number KK is large, e.g., 1,000 classes in ImageNet, with the aim of seeking the effectiveness of targeted black-box attack. To obtain a good partition, we introduce a representative target class space, which is nearly equivalent to the original class space 𝒞\mathcal{C}. Specifically, we utilize the weights ϕclsD×K\phi_{cls}\in\mathbb{R}^{D\times K} in the classifier layer for the classification network ϕ\mathcal{F}_{\phi}. Therefore, ϕcls\phi_{cls} can be regarded as the alternative class space since the weight vector 𝒅cD\bm{d}_{c}\in\mathbb{R}^{D} from ϕcls\phi_{cls} can represent a class center of the feature embeddings of input images with same class cc.

Note that once those subsets with closer metric distance (e.g., larger cosine similarity) in the target class space ϕcls\phi_{cls} are regarded as conditional inputs of generative network, they obtain worse loss convergence and transferability than diverse them due to mutual influence among these input conditions, as illustrated in Fig. 5. Thus we focus on selecting target classes that do not tend to overlap or be close to each other as accessible subsets. To capture more diverse examples in a given sampling space, we adopt K-determinantal point processes (DPP) [25, 24] to achieve a hierarchical partition, which can take advantage of the diversity property among subsets by assigning subset probabilities proportional to determinants of a kernel matrix.

First, we compute the RBF kernel matrix LL of ϕcls\phi_{cls} and eigendecomposition of LL, and a random subset VV of the eigenvectors is chosen by regarding the eigenvalues as sampling probability. Second, we select a new class cic_{i} to add to the set and update VV in a manner that de-emphaseizes items similar to the one selected. Each successive point is selected and VV is updated by Gram-Schmidt orthogonalization, and the distribution shifts to avoid points near those already chosen. By performing the above procedure, we can obtain a subset with kk size. Thus while handling the conditional classes with K, we can hierarchically adopt this algorithm to get the final K/kK/k subsets, which are regarded as conditional variables of generative models to craft adversarial examples. The details are presented in Appendix A.

4 Experiments

In this section, we present extensive experiments to demonstrate the effectiveness of proposed method for targeted black-box attacks222Code at https://github.com/ShawnXYang/C-GSP..

4.1 Experimental Settings

Datasets. We consider the following datasets for training, including a widely used object detection dataset MS-COCO [31] and ImageNet training set [6]. We focus on standard and comprehensive testing settings, thus the inference is performed on ImageNet validation set (50k samples), a subset (5k) of ImageNet proposed by [29] and ImageNet-NeurIPS (1k) proposed by [38].

Networks. We consider some naturally trained networks, i.e., Inception-v3 (Inv3) [47], Inception-v4 (Inv4) [45], Resnet-v2-152 (R152) [16] and Inception-Resnet-v2 (IR-v2) [45], which are widely used for evaluating transferability. Besides, we supplement DenseNet-201 (DN) [18], GoogleNet (GN) [46] and VGG-16 (VGG) [43] to fully evaluate the transferability. Some adversarially trained networks [49] are also selected to evaluate the performance, i.e., ens3-adv-Inception-v3 (Inv3ens3\textrm{Inv3}_{\textrm{ens3}}), ens4-adv-Inception-v3 (Inv3ens4\textrm{Inv3}_{\textrm{ens4}}) and ens-adv-Inception-ResNet-v2 (IR-v2ens\textrm{IR-v2}_{\textrm{ens}}).

Implementation details. As for instance-specific attacks, we compare our method with several attacks, including MIM [9], DIM [52], TI [10], SI [30] and the state-of-the-art targeted attack named Logit [60]. All instance-specific attacks adopt optimal hyperparameters provided in their original work. Specifically, the attack iterations MM of MIM, DIM and Logit are set as 10,20,30010,20,300, respectively. And W1\|W\|_{1} = 5 is used for TI [10] as suggested by [12, 60]. We choose the same ResNet autoencoder architecture in [23, 36] as the basic generator networks, which consists of downsampling, residual and upsampling layers. We initialize the learning rate as 2e-5 and set the mini-batch size as 32. Smoothing mechanism is proposed to improve the transferability against adversarially trained models [10]. Instead of adopting smoothing for generated perturbation while the training is completed as CD-AP [36], we introduce adaptive Gaussian smoothing kernel to compute 𝜹\bm{\delta} from Eq. (2) in the training phase, named adaptive Gaussian smoothing, with the focus of making generated results obtain adaptive ability. More implementation details and discussion with other networks (e.g., BigGAN [3]) are illustrated in Appendix B.

Table 1: Transferability comparison for multi-target attacks on ImageNet NeurIPS validation set (1k images) with the perturbation budget of 16\ell_{\infty}\leq 16. The results are averaged on 8 different target classes. Note that CD-AP indicates that training 8 models can obtain results, while our method only train one conditional generative model. * indicates white-box attacks.
Method Time (ms) Model Number Naturally Trained Adversarially Trained
Inv3 Inv4 IR-v2 R152 DN GN VGG-16 Inv3ens3\textrm{Inv3}_{\textrm{ens3}} Inv3ens4\textrm{Inv3}_{\textrm{ens4}} IR-v2ens\textrm{IR-v2}_{\textrm{ens}}
Inv3 MIM \sim130 - 99.9 0.8 1.0 0.4 0.2 0.2 0.3 <<0.1 0.1 <0.1<0.1
TI-MIM \sim130 - 99.9 0.9 1.1 0.4 0.4 0.3 0.5 0.1 0.2 0.1
SI-MIM \sim130 - 99.8 1.5 2.0 0.8 0.7 0.7 0.5 0.3 0.3 0.1
DIM \sim260 - 95.6 4.0 4.8 1.3 1.9 0.8 1.3 0.1 0.2 0.1
TI-DIM \sim260 - 96.0 4.4 5.1 1.4 2.4 1.1 1.8 0.3 0.4 0.2
SI-DIM \sim260 - 98.4 5.6 5.9 2.8 3.0 2.3 1.6 0.9 0.9 0.3
Logit \sim3900 - 99.6 5.6 6.5 1.7 3.0 0.8 1.5 0.2 0.3 0.1
CD-AP \sim 15 8 94.2 57.6 60.1 37.1 41.6 32.3 41.7 1.5 2.2 1.2
CD-AP-gs \sim 15 8 69.7 31.3 30.8 18.6 20.1 14.8 20.2 5.0 5.8 4.5
Ours \sim 15 1 93.4 66.9 66.6 41.6 46.4 40.0 45.0 39.7 37.2 32.2
R152 MIM \sim185 - 0.5 0.4 0.6 99.7 0.3 0.3 0.2 0.1 0.1 <0.1<0.1
TI-MIM \sim185 - 0.3 0.3 1.0 96.5 0.5 0.3 0.3 0.3 0.2 0.3
SI-MIM \sim185 - 1.3 1.2 1.6 99.5 1.0 1.4 0.7 0.3 0.4 0.2
DIM \sim370 - 2.8 3.1 5.0 93.6 3.5 1.7 1.3 0.4 0.4 0.3
TI-DIM \sim370 - 4.3 4.1 5.8 92.9 4.3 2.1 1.4 0.8 0.7 0.4
SI-DIM \sim370 - 7.2 8.4 10.4 97.4 7.6 6.4 2.6 0.8 0.7 1.3
Logit \sim5550 - 10.1 10.7 12.8 95.7 12.4 3.7 3.5 1.1 0.9 0.4
CD-AP \sim 10 8 33.3 43.7 42.7 96.6 53.8 36.6 34.1 15.7 15.2 12.0
CD-AP-gs \sim 10 8 7.8 11.3 10.0 53.6 20.4 8.7 12.5 4.9 6.4 6.2
Ours \sim 10 1 37.7 47.6 45.1 93.2 64.2 41.7 45.9 31.6 32.0 29.9
Refer to caption
Figure 2: Comparison of different projection functions and modes of Gaussian Smoothing. Results are reported with Inv3 network on ImageNet NeurIPS validation set.
Table 2: The untargeted fooling ratio (UT-FR) and targeted fooling ratio (T-FR) for adversarial attacks on ImageNet validation set (50k images) with the perturbation budget of 10\ell_{\infty}\leq 10. The attack is performed in same setting [59] with the target class ‘sea lion’ and the training dataset MS-COCO. * indicates white-box attacks.
Method VGG-16 VGG-19 R152
UT-FR T-FR UT-FR T-FR UT-FR T-FR
VGG-16
UAE [59]
Ours
93.62
95.30
82.90
83.54
82.99
90.13
13.69
38.59
36.03
35.15
0.01
0.14
VGG-19
UAE [59]
Ours
83.40
88.20
44.53
48.96
92.53
92.69
75.61
73.96
35.36
35.96
0.01
0.14
R152
UAE [59]
Ours
55.05
83.90
1.63
29.81
55.12
83.24
1.05
24.81
82.58
91.14
70.20
80.47

4.2 Transferability Evaluation

We consider 8 different target classes from [59] to form the multi-target black-box attack testing protocol with 8k times in 1k ImageNet NeurIPS set.

Efficiency of multi-target black-box attack. Among comparable methods, instance-specific methods, i.e., MIM, DIM, and Logit, require iterative mechanism with MM steps by computing gradients to obtain adversarial examples. Given the cost tCFPt_{C}^{FP} and tCBPt_{C}^{BP} of forward and backward passing the classifier, computing cost TIST^{IS} of single data can be defined as TIS=tCFPM+tCBPMT^{IS}=t_{C}^{FP}*M+t_{C}^{BP}*M in Table 1. Instance-agnostic methods only require the inference cost from the trained generator as TIA=tGFPT^{IA}=t_{G}^{FP}, thus possessing the priority for those attack scenarios within limited time. However, instance-agnostic methods require to train 8 models to obtain all predictions from 8 different classes. Due to time-consuming training and more storage, we only reproduce an excellent generative method CD-AP [36] as a baseline, which already fully demonstrate the superior performance than other generative methods such as GAP [39] in their work. As a comparison, our conditional generative method only trains one model to inference the results and outperforms other methods w.r.t efficiency.

Effectiveness of multi-target black-box attack. Table 1 shows the transferability comparison of different methods on both naturally and adversarially trained models. The success rate of instance-specific attacks are very unsatisfactory, possibly explained by the data-point overfitting that makes it hard to transfer another model. The instance-agnostic attack CD-AP obtains acceptable performance, yet inferior to proposed method w.r.t black-box transferability. The primary reason for such a trend lies in some distinctions as 1) direct clip projection in CD-AP and our smooth projection in Eq. (2) and 2) their Gaussian Smoothing and our adaptive Gaussian Smoothing, as described in Sec. 4.1 and Appendix B. Fig. 2 empirically shows the comparison results of single-target black-box attacks based on the CD-AP framework. Thus proposed conditional generative method can be a reliable baseline w.r.t targeted black-box attacks, regarding both effectiveness and efficiency.

Results of single-target black-box attack. Recent related works, e.g., UAE [59] and TTP [37] report excellent single-target black-box performances based on universal perturbations or functions. We obtain single-target degraded version of our model by specifying an input target label during the training process. The performance of black-box targeted attack between different methods is presented in Table 2. Besides, we also make some analyses about TTP and present compared results in Appendix C. Furthermore, some other instance-agnostic adversarial methods, e.g., UAP [34], GAP [39] and RHP [29], have tendency towards the untargeted black-box problem. Despite this, we also follow the corresponding untargeted setting and compare these methods in Appendix C. Our method is steadily improved under black-box targeted and untargeted black-box manner.

Table 3: Transferability comparison with the perturbation budget of 16\ell_{\infty}\leq 16. White-box substitute model is Inv3 for all attacks, following the standard protocol [10] with 1,000 stochastic target classes.
Targeted Black-box Attack in NeurIPS 2017 Competition ( 1,000 target classes)
Method Inv4 IRv2 R152 DN GN VGG-16
MIM 0.1 <<0.1 <<0.1 0.3 0.1 <<0.1
TI-MIM 0.3 0.3 <<0.1 0.4 <<0.1 0.1
SI-MIM 0.6 0.6 0.1 0.4 0.3 0.1
DIM 2.9 2.5 0.6 1.2 0.2 0.6
TI-DIM 2.9 2.5 0.5 1.7 0.3 1.0
SI-DIM 4.3 4.1 1.7 1.9 1.8 1.1
Logit 4.7 2.4 1.2 2.4 0.4 0.8
Ours 35.9 37.4 25.0 26.8 22.9 26.6
Table 4: The success rate of black-box impersonation attacks on face verification with the perturbation budget of 16\ell_{\infty}\leq 16. ArcFace is chosen as white-box model.
Black-box Impersonation Attack in Face Recognition
Protocol Method FaceNet CosFace SphereFace MobileFace
I
MIM
DIM
Ours
34.4
38.8
65.2
16.6
21.2
56.2
22.4
27.4
52.2
35.0
44.3
83.5
II
MIM
DIM
Ours
31.3
36.1
66.8
13.6
16.4
49.1
21.1
24.4
47.9
22.3
31.9
67.8
Refer to caption
Figure 3: Generative examples of adversarial images with perturbation budget of 16\ell_{\infty}\leq 16. We separately adopt the ImageNet and MS-COCO dataset as the training dataset to implement the generation of targeted perturbations. Our method can generate semantic pattern independent of training dataset.
Refer to caption
Figure 4: Asr vs. numbers of conditional targets curve against Inv3 and VGG-16 models.
Refer to caption
Figure 5: Comparison of loss convergence and transferability between diverse and close conditional subset.

4.3 Effectiveness on NeurIPS 2017 Competition

To illustrate the effectiveness of our proposed attack methods in practical 1,000 classification, we here follow the official setting from NeurIPS 2017 adversarial competition [27] for testing targeted black-box transferability. Considering limited resource, previous instance-agnostic attacks are not required as comparable methods due to training 1,000 models, thus we focus on various excellent instance-specific attacks for comparison, including official top attack methods in NeurIPS 2017 adversarial competition. Compared with other instance-agnostic attacks, our hierarchical partition mechanism can make conditional generative networks be capable of specifying any target class via a feasible number of models for the scalability. Specifically, we consider 20 models, with each specifying 50 diverse classes from k-DPP hierarchical partition in this setting, to implement targeted attack by only once inference for each target image. As shown in Table 3, our method can obviously outperform all other methods. In addition, these trained generative models can directly be applied to craft adversarial examples, which is more convenient and efficient when required to handle large-scale (e.g., millions of images) datasets than instance-specific attacks.

4.4 Effectiveness on Realistic Face Recognition

Adversarial perturbations added to original face images have ability to evade being recognized or impersonate another individual [42, 57]. In this section, we consider the transferability of impersonation attack to further illustrate the generalization of our method, which is also corresponding to targeted attack in the image classification task.

Dataset and models. We conduct the experiments on Labeled Faces in the Wild (LFW) [19] and introduce two test protocols. For Protocol I defined as single-target impersonation attack, we choose 1 target identity and 1k source face images belonging with different identities from LFW as the attackers, thus forming 1k pairs. For Protocol II named multi-target impersonation attack, 5 target identities and 1k source face images are selected to form 1k attack pairs, meaning that we need to implement 5k attacks. We involve some excellent face recognition models for conducting black-box testing, including Sphereface [32], CosFace [50], FaceNet [41] and MobileFace [4]. These models lie in different model architectures and training objectives. In all experiments, we only use one model ArcFace [7] as substitute model to craft adversarial samples, and test attack performance against other unknown models.

Refer to caption
Figure 6: We show the adversarial examples and extracted perturbation scaled in image-pixel space in the second column and the third column. The predictive confidence is presented by directly feeding extracted perturbation into the classifier in the last column.
Refer to caption
Figure 7: Plots of logit vectors from the adversarial image LimgL_{img} and scaled crafted perturbation LadvL_{adv} of MIM and proposed generative method, with their respective PCC values.

Evaluation metrics. We first compute the optimal threshold of every face recognition models from LFW dataset by following standard protocols. If the similarity of a pair of images exceeds the threshold, we regard them as same identity, otherwise different identities.

Black-box attack results. We adjust the optimization object function to adapt face recognition for chosen attack methods (detailed in Appendix C), and report the success rate of black-box impersonation attacks in Table 4, which illustrates that our method can achieve nearly two times of the success rates than DIM in Protocol I and Protocol II. The results indicate that our method is superior to other methods not only in image classification.

4.5 Comparison Study about Target Classes

We conduct an extensive study to investigate two key points about target classes.

Different numbers of target classes. We conduct effectiveness for different numbers of target classes in Fig. 4. It can be seen that the results perform well within a feasible number of targets, whereas to a certain extent effectiveness tend to decay. Therefore, the effectiveness of conditional generative networks is influenced by the number of conditional classes, due to the representative capacity of single generator. We aim to divide all classes into a feasible number of set while handling plenty of classes.

Comparison of different multi-target conditions. We select closer conditional classes with larger cosine similarity in the target class space ϕcls\phi_{cls} and diverse conditional classes from k-DPP method. In Fig. 5, closer conditional classes have worse loss convergence and transferability than diverse them due to mutual influence among conditions.

4.6 More Analyses

Targeted adversarial samples from proposed generative method can produce semantic pattern inherent to the target class in Fig. 3. Why does generative semantic pattern work?

First, generative methods can produce strong targeted semantic pattern that is robust to the influence of data, which is obtained by minimizing the loss of specific target class in the training phase. To corroborate our claim, we directly feed scaled crafted perturbations by instance-specific attack MIM and our generative method into the classifier. Indeed, we find that our generative perturbation is considered as target class with a high degree of confidence whereas the perturbation from MIM performs like the noise, as shown in Fig. 6. Furthermore, we plot the logit relationship by computing PCC (Pearson correlation coefficient) values from scaled crafted perturbation and adversarial image in Fig. 7. The numerical performance is also consistent with our mentioned claim.

Second, the generated adversarial semantic pattern achieves well-generalizing performance among the different models. We feed 1k images from ImageNet test set into the generator trained by Inv3 model to obtain 1k semantic patterns, which are scaled to image pixel space and then fed into different classifiers. We compute the mean confidence of 0.46\bm{0.46} for DN, 0.44\bm{0.44} for Inv4, and 0.35\bm{0.35} for R152, whereas the perturbation from MIM is lower than 0.010.01. The results show that our scaled semantic pattern can directly achieve well-generalizing performance among models, possibly explained by utilizing similar feature knowledge from the same class on different classifiers trained on same training data distribution. Thus similar pattern can be instrumental for transferability among models.

5 Discussion and Conclusion

Transferability of targeted black-box attack is simultaneously affected by data and model. Therefore, instance-specific methods easily overfit the data point and white-box model, resulting in weak transferability. As a comparison, the proposed generative method with powerful learning capacity reduces the dependency for data point by adopting the unlabeled training data, thus enabling the model to learn semantic pattern and improve the transferability of targeted black-box attack. Extensive experiments demonstrate that proposed generative method can significantly improve the success rates of targeted black-box attacks against various models, meanwhile achieving faster speedup beyond an order of magnitude than gradient-based methods. Therefore, this method can be regarded as a new baseline method in terms of targeted black-box attacks, which provides a novel framework to explore the vulnerabilities of DNNs.

Acknowledgement. This work was supported by the National Key Research and Development Program of China (Nos. 2020AAA0104304, 2017YFA0700904), NSFC Projects (Nos. 62061136001, 61621136008, 62076147, U19B2034, U19A2081, U1811461), the major key project of PCL (No. PCL2021A12), Tsinghua-Alibaba Joint Research Program, Tsinghua-OPPO Joint Research Center, and the High Performance Computing Center, Tsinghua University.

References

  • [1] Berthelot, D., Schumm, T., Metz, L.: Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017)
  • [2] BircanoAYlu, C.: https://www.kaggle.com/cenkbircanoglu/comic-books-classification.Kaggle. Kaggle, 2017
  • [3] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
  • [4] Chen, S., Liu, Y., Gao, X., Han, Z.: Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In: Chinese Conference on Biometric Recognition. pp. 428–438. Springer (2018)
  • [5] Demontis, A., Melis, M., Pintor, M., Jagielski, M., Biggio, B., Oprea, A., Nita-Rotaru, C., Roli, F.: Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In: 28th {\{USENIX}\} Security Symposium ({\{USENIX}\} Security 19). pp. 321–338 (2019)
  • [6] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [7] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2019)
  • [8] Dong, Y., Fu, Q.A., Yang, X., Pang, T., Su, H., Xiao, Z., Zhu, J.: Benchmarking adversarial robustness. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
  • [9] Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J.: Boosting adversarial attacks with momentum. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [10] Dong, Y., Pang, T., Su, H., Zhu, J.: Evading defenses to transferable adversarial examples by translation-invariant attacks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [11] Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., Song, D.: Robust physical-world attacks on deep learning visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1625–1634 (2018)
  • [12] Gao, L., Zhang, Q., Song, J., Liu, X., Shen, H.T.: Patch-wise attack for fooling deep neural network. In: European Conference on Computer Vision. pp. 307–322. Springer (2020)
  • [13] Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http://www.deeplearningbook.org
  • [14] Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015)
  • [15] Han, J., Dong, X., Zhang, R., Chen, D., Zhang, W., Yu, N., Luo, P., Wang, X.: Once a man: Towards multi-target attack via learning multi-target adversarial network once. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5158–5167 (2019)
  • [16] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision (ECCV). pp. 630–645. Springer (2016)
  • [17] Hendrycks, D., Carlini, N., Schulman, J., Steinhardt, J.: Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916 (2021)
  • [18] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
  • [19] Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In: Technical report (2007)
  • [20] Huang, Q., Katsman, I., He, H., Gu, Z., Belongie, S., Lim, S.N.: Enhancing adversarial example transferability with an intermediate level attack. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4733–4742 (2019)
  • [21] Inkawhich, N., Liang, K., Wang, B., Inkawhich, M., Carin, L., Chen, Y.: Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability. Advances in Neural Information Processing Systems 33, 20791–20801 (2020)
  • [22] Inkawhich, N., Liang, K.J., Carin, L., Chen, Y.: Transferable perturbations of deep feature distributions. arXiv preprint arXiv:2004.12519 (2020)
  • [23] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. pp. 694–711. Springer (2016)
  • [24] Kulesza, A., Taskar, B.: k-dpps: Fixed-size determinantal point processes. In: ICML (2011)
  • [25] Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083 (2012)
  • [26] Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. In: International Conference on Learning Representations (ICLR) Workshops (2017)
  • [27] Kurakin, A., Goodfellow, I., Bengio, S., Dong, Y., Liao, F., Liang, M., Pang, T., Zhu, J., Hu, X., Xie, C., et al.: Adversarial attacks and defences competition. In: The NIPS’17 Competition: Building Intelligent Systems, pp. 195–231. Springer (2018)
  • [28] Li, M., Deng, C., Li, T., Yan, J., Gao, X., Huang, H.: Towards transferable targeted attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 641–649 (2020)
  • [29] Li, Y., Bai, S., Xie, C., Liao, Z., Shen, X., Yuille, A.L.: Regional homogeneity: Towards learning transferable universal adversarial perturbations against defenses. arXiv preprint arXiv:1904.00979 (2019)
  • [30] Lin, J., Song, C., He, K., Wang, L., Hopcroft, J.E.: Nesterov accelerated gradient and scale invariance for adversarial attacks. In: International Conference on Learning Representations (2019)
  • [31] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
  • [32] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 212–220 (2017)
  • [33] Liu, Y., Chen, X., Liu, C., Song, D.: Delving into transferable adversarial examples and black-box attacks. In: ICLR (2017)
  • [34] Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1765–1773 (2017)
  • [35] Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [36] Naseer, M.M., Khan, S.H., Khan, M.H., Khan, F.S., Porikli, F.: Cross-domain transferability of adversarial perturbations. In: Advances in Neural Information Processing Systems. pp. 12905–12915 (2019)
  • [37] Naseer, M., Khan, S., Hayat, M., Khan, F.S., Porikli, F.: On generating transferable targeted perturbations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7708–7717 (2021)
  • [38] NeurIPS: https://www.kaggle.com/c/nips-2017-defense-against-\\adversarial-attack/data. Kaggle, 2017
  • [39] Poursaeed, O., Katsman, I., Gao, B., Belongie, S.: Generative adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4422–4431 (2018)
  • [40] Reddy Mopuri, K., Krishna Uppala, P., Venkatesh Babu, R.: Ask, acquire, and attack: Data-free uap generation using class impressions. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 19–34 (2018)
  • [41] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 815–823 (2015)
  • [42] Sharif, M., Bhagavatula, S., Bauer, L., Reiter, M.K.: Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. pp. 1528–1540 (2016)
  • [43] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [44] Song, Y., Shu, R., Kushman, N., Ermon, S.: Constructing unrestricted adversarial examples with generative models. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
  • [45] Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI (2017)
  • [46] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9 (2015)
  • [47] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [48] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations (ICLR) (2014)
  • [49] Tramèr, F., Kurakin, A., Papernot, N., Boneh, D., McDaniel, P.: Ensemble adversarial training: Attacks and defenses. In: International Conference on Learning Representations (ICLR) (2018)
  • [50] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5265–5274 (2018)
  • [51] Wu, D., Wang, Y., Xia, S.T., Bailey, J., Ma, X.: Skip connections matter: On the transferability of adversarial examples generated with resnets. arXiv preprint arXiv:2002.05990 (2020)
  • [52] Xie, C., Zhang, Z., Zhou, Y., Bai, S., Wang, J., Ren, Z., Yuille, A.L.: Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [53] Xu, K., Li, C., Zhu, J., Zhang, B.: Understanding and stabilizing gans’ training dynamics with control theory. arXiv preprint arXiv:1909.13188 (2019)
  • [54] Yang, X., Dong, Y., Pang, T., Xiao, Z., Su, H., Zhu, J.: Controllable evaluation and generation of physical adversarial patch on face recognition. arXiv e-prints pp. arXiv–2203 (2022)
  • [55] Yang, X., Dong, Y., Pang, T., Zhu, J., Su, H.: Towards privacy protection by generating adversarial identity masks. arXiv preprint arXiv:2003.06814 (2020)
  • [56] Yang, X., Wei, F., Zhang, H., Zhu, J.: Design and interpretation of universal adversarial patches in face detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. pp. 174–191. Springer (2020)
  • [57] Yang, X., Yang, D., Dong, Y., Yu, W., Su, H., Zhu, J.: Delving into the adversarial robustness on face recognition. arXiv preprint arXiv:2007.04118 (2020)
  • [58] Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014)
  • [59] Zhang, C., Benz, P., Imtiaz, T., Kweon, I.S.: Understanding adversarial examples from the mutual influence of images and perturbations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14521–14530 (2020)
  • [60] Zhao, Z., Liu, Z., Larson, M.: On success and simplicity: A second look at transferable targeted attacks. Advances in Neural Information Processing Systems 34 (2021)

Appendix 0.A Sampling Algorithm

We summarize the overall sampling procedure based on k-DPP [24] in Algorithm 2.

  • Compute the RBF kernel matrix LL of ϕcls\phi_{cls} and eigendecomposition of LL.

  • A random subset VV of the eigenvectors is chosen by regarding the eigenvalues as sampling probability.

  • Select a new class cic_{i} to add to the set and update VV in a manner that de-emphaseizes items similar to the one selected.

  • Update VV by Gram-Schmidt orthogonalization, and the distribution shifts to avoid points near those already chosen.

By performing the Algorithm 2, we can obtain a subset with kk size. Thus while handling the conditional classes with K, we can hierarchically adopt this algorithm to get the final K/kK/k subsets, which are regarded as conditional variables of generative models to craft adversarial examples.

Algorithm 2 Sampling Algorithm by kDPP
1:Weight Vector 𝜽cls\bm{\theta}_{cls}; Subset size kk.
2:A subset CC.
3:Compute RBF kernel matrix LL of 𝜽cls\bm{\theta}_{cls};
4:Compute eigenvector/value {vn,λn}n=1N\{v_{n},\lambda_{n}\}_{n=1}^{N} pairs of LL;
5:// Phase I:
6:JϕJ\leftarrow\phi, ek(λ1,,λN)=|J|=knJλne_{k}\left(\lambda_{1},\ldots,\lambda_{N}\right)=\sum_{|J|=k}\prod_{n\in J}\lambda_{n};
7:for n = N, …, 1 do
8:     if uU[0,1]<λnek1n1eknu\sim U[0,1]<\lambda_{n}\frac{e^{n-1}_{k-1}}{e^{n}_{k}} and k>0k>0 then
9:         JJ{n}J\leftarrow J\cup\{n\}; kk1k\leftarrow k-1;
10:     end if
11:end for
12:// Phase II:
13:V{vn}nJ,YϕV\leftarrow\left\{v_{n}\right\}_{n\in J},Y\leftarrow\phi;
14:while |V|>0|V|>0 do
15:     Select cic_{i} from 𝒞\mathcal{C} with P(ci)=1|V|vV(vei)2\operatorname{P}\left(c_{i}\right)=\frac{1}{|V|}\sum_{v\in V}\left(v^{\top}e_{i}\right)^{2};
16:     CC{ci}C\leftarrow C\cup\left\{c_{i}\right\}; VV,V\leftarrow V_{\perp}, an orthonormal basis for the subspace of VV orthogonal to eie_{i};
17:end while
Refer to caption
Figure 8: Some examples of adversarial images with perturbation budget of 16\ell_{\infty}\leq 16. We separately adopt the ImageNet, MS-COCO and Comics dataset as the training dataset to implement the generation of targeted perturbations.

Appendix 0.B Some Implementation Details

The study of smoothing mechanism. Smoothing mechanism has been proved to improve the transferability against adversarially trained models. CD-AP [36] uses direct clip projection to have a fixed norm ϵ\epsilon, and adopts smoothing for generated perturbation while the generator 𝒢\mathcal{G} is trained, i.e.,

Train: 𝒙si=Clipϵ(𝒢(𝒙si),Test: 𝒙si=WClipϵ(𝒢(𝒙si),\begin{gathered}\textbf{Train: }\bm{x}_{s_{i}}^{*}=\mathrm{Clip}_{\epsilon}(\mathcal{G}(\bm{x}_{s_{i}}),\\ \textbf{Test: }\bm{x}_{s_{i}}^{*}=\mathrm{W}*\mathrm{Clip}_{\epsilon}(\mathcal{G}(\bm{x}_{s_{i}}),\end{gathered} (4)

where W\mathrm{W} indicates Gaussian smoothing of kernel size of 3, * indicates the convolution operation, and Clipϵ means clipping values outside the fixed norm ϵ\epsilon. As a comparison, we introduce adaptive Gaussian smoothing kernel to compute adversarial images 𝒙si\bm{x}_{s_{i}}^{*} from in the training phase, named adaptive Gaussian smoothing as

Train & Test: 𝒙si=ϵWtanh(𝒢(𝒙si)+𝒙si,\textbf{Train \& Test: }\bm{x}_{s_{i}}^{*}=\epsilon\cdot\mathrm{W}*\mathrm{tanh}(\mathcal{G}(\bm{x}_{s_{i}})+\bm{x}_{s_{i}}, (5)

which can make generated results obtain adaptive ability in the training phase. We perform training in ImageNet dataset to report all results including comparable baselines.

Network architecture of generator. We adopt the same autoencoder architecture in [36] as the basic generator networks. Besides, we also explore BigGAN [3] as conditional generator network. An very weak testing performance is obtained even in the white-box attack scenario, possibly explained by the weak diversity of latent variable with the Gaussian distribution from BigGAN in the training phase, whereas autoencoder can take full advantage of large-scale training dataset, e.g., ImageNet. Furthermore, we also train the autoencoder with Gaussian noise as the training dataset and obtain similar inferior performance in the white-box attack scenario, indicating that a large-scale training dataset is very significant for generating transferable targeted adversarial examples.

Appendix 0.C Additional Experimental Results

Results on different datasets. We craft adversarial examples on different datasets, including ImageNet training set, MS-COCO and Comics dataset [2], which consist of 1.2M, 82k and 50K images, respectively. MS-COCO dataset can be applied to large-scale object detection and segmentation, and those images from Comics dataset are regarded as other domains different from normal ones in ImageNet. Despite this diverse training types, we still find the common property of crafted adversarial examples by our method. Specifically, we craft some examples of adversarial images with perturbation budget of 16\ell_{\infty}\leq 16, and separately adopt the ImageNet, MS-COCO and Comics dataset as the training dataset to implement the generation of targeted perturbations. As illustrated in Fig. 8, we produce semantic pattern independent of any training dataset.

Table 5: Comparison results of targeted black-box attacks on different datasets. Inv3 is the substitute model.
Dataset DN VGG-16 GN
ImageNet 79.9 81.9 73.2
MS-COCO 70.3 71.3 64.1
Comics 60.4 63.0 61.3

We also report the success rate of targeted black-box attack, as shown in Table 5. We experimentally find that semantic pattern derived from ImageNet dataset achieves better performance of black-box performance, possibly explained by instructional effectiveness from more diverse data in ImageNet dataset.

Results of untargeted black-box attack. We evaluate our method and other generative methods including UAP [34], GAP [39] and RHP [29]. Untargeted transferability from naturally trained models to adversarially trained models occurs due to differences in model sources, data types and other factors, thus enabling challenging comparison. As illustrated in Table 6, we report the untargeted attacks increase in error rate of adversarial and clean images to evaluate different methods. Our method is steadily improved in different black-box models under untargeted black-box manner.

Results of different ϵ\epsilon. We also presented the results with the reduced perturbation budget of 10\ell_{\infty}\leq 10 in Table 7 for verifying the consistent effectiveness. Furthermore, we chose the smaller perturbation budget of 8\ell_{\infty}\leq 8 in experiments to make the adversarial examples more imperceptible. In this setting, the proposed generative method still outperforms the SOTA iterative attack method named Logit [60] with a large margin.

Compared results with TTP [37]. TTP proposes a generative approach for highly transferable targeted perturbations by introducing mutual distribution matching. For demonstrating the performance, we conduct multi-target black-box experiments by adopting 8 mutually exclusive targeted sets. 1) Efficiency: TTP needs to train 8 models while performing an 8-class targeted attack, whereas our conditional generative method only trains one model to inference the results. 2) Effectiveness: TTP obtained comparable black-box attack success rates with ours as shown in Table 8. Overall, the proposed conditional generative method can be a better baseline in targeted black-box attacks regarding both effectiveness and efficiency.

Table 6: Transferability results for untargeted attacks increase in error rate after attack on subset of ImageNet (5k images) with the perturbation budget of 16/32\ell_{\infty}\leq 16/32.
Method inv3ens3\textrm{inv3}_{\textrm{ens3}} inv3ens4\textrm{inv3}_{\textrm{ens4}} IR-v2ens\textrm{IR-v2}_{\textrm{ens}}
ϵ=16\epsilon=16 ϵ=32\epsilon=32 ϵ=16\epsilon=16 ϵ=32\epsilon=32 ϵ=16\epsilon=16 ϵ=32\epsilon=32
inv3
UAP [34]
GAP [39]
RHP [29]
1.00
5.48
32.5
7.82
33.3
60.8
1.80
4.14
31.6
5.60
29.4
58.7
1.88
3.76
24.6
5.60
22.5
57.0
inv4
UAP [34]
RHP [29]
2.08
27.5
7.68
60.3
1.94
26.7
6.92
62.5
2.34
21.2
6.78
58.5
IR-v2
UAP [34]
RHP [29]
1.88
29.7
8.28
62.3
1.74
29.8
7.22
63.3
1.96
26.8
8.18
62.8
CD-AP [36] 28.34 71.3 29.9 66.72 19.84 60.88
CD-AP-gs [36] 41.06 71.96 42.68 71.58 37.4 72.86
Ours 46.20 72.58 42.98 72.34 37.9 73.26
Table 7: Comparison results of targeted black-box attacks on different ϵ\epsilon.
Source Method VGG-16 R152
eps=16 eps=12 eps=8 eps=16 eps=12 eps=8
inv3 Logit [60] 4.4 3.4 2.1 1.2 1.1 0.8
Ours 61.9 53.7 36.1 49.6 31.0 16.3
Table 8: Comparison results of targeted black-box attacks with TTP.
Source Method Inv4 IR-v2 R152 DN GN VGG-16
inv3 TTP 65.4 55.3 39.4 44.0 35.9 36.1
Ours 66.9 66.6 41.6 46.4 40.0 45.0

Appendix 0.D Impersonation Attack of Face Recognition

We list attack methods of face recognition as follows. Given an input 𝒙\bm{x} and an image 𝒙r\bm{x}^{r} belonging with another identity, an attack method can generate an adversarial example 𝒙adv\bm{x}^{adv} with perturbation budget ϵ\epsilon under the p\ell_{p} norm (𝒙adv𝒙pϵ\|\bm{x}^{adv}-\bm{x}\|_{p}\leq\epsilon). Therefore, impersonation attack aims to perform this objective of

𝒞(𝒙adv,𝒙r)=𝕀(𝒟f(𝒙adv,𝒙r)<δ),\mathcal{C}(\bm{x}^{adv},\bm{x}^{r})=\mathbb{I}(\mathcal{D}_{f}(\bm{x}^{adv},\bm{x}^{r})<\delta), (6)

where 𝕀\mathbb{I} is the indicator function, δ\delta is a threshold, and 𝒟f(𝒙adv,𝒙r)=f(𝒙adv)f(𝒙r)22.\mathcal{D}_{f}(\bm{x}^{adv},\bm{x}^{r})=\|f(\bm{x}^{adv})-f(\bm{x}^{r})\|_{2}^{2}.

Basic Iterative Method (BIM) [26] extends FGSM by iteratively taking multiple small gradient updates as

𝒙t+1adv=clip𝒙,ϵ(𝒙tadvαsign(𝒙𝒟f(𝒙tadv,𝒙r))),\bm{x}_{t+1}^{adv}=\mathrm{clip}_{\bm{x},\epsilon}\big{(}\bm{x}_{t}^{adv}-\alpha\cdot\mathrm{sign}(\nabla_{\bm{x}}\mathcal{D}_{f}(\bm{x}_{t}^{adv},\bm{x}^{r}))\big{)}, (7)

where clip𝒙,ϵ\mathrm{clip}_{\bm{x},\epsilon} projects the adversarial example to satisfy the \ell_{\infty} constrain and α\alpha is the step size.

Momentum Iterative Method (MIM) [9] introduces a momentum term into BIM for improving the transferability of adversarial examples as

𝒈t+1=μ𝒈t+𝒙𝒟f(𝒙tadv,𝒙r)𝒙𝒟f(𝒙tadv,𝒙r)1;𝒙t+1adv=clip𝒙,ϵ(𝒙tadvαsign(𝒈t+1)).\begin{gathered}\bm{g}_{t+1}=\mu\cdot\bm{g}_{t}+\frac{\nabla_{\bm{x}}\mathcal{D}_{f}(\bm{x}_{t}^{adv},\bm{x}^{r})}{\|\nabla_{\bm{x}}\mathcal{D}_{f}(\bm{x}_{t}^{adv},\bm{x}^{r})\|_{1}};\\ \bm{x}^{adv}_{t+1}=\mathrm{clip}_{\bm{x},\epsilon}(\bm{x}^{adv}_{t}-\alpha\cdot\mathrm{sign}(\bm{g}_{t+1})).\end{gathered} (8)

The training objectives of our generative method seek to minimize the classification error on the perturbed image of the generator as

minθ𝔼(𝒙𝒳,c𝒞)[𝒟f(𝒙+𝒢θ(𝒙,c),𝒙cr)],\min_{\theta}\mathbb{E}_{(\bm{x}\sim\mathcal{X},c\sim\mathcal{C})}[{\mathcal{D}_{f}\big{(}\bm{x}+\mathcal{G}_{\theta}(\bm{x},c),\bm{x}^{r}_{c}\big{)}}], (9)

where 𝒙cr\bm{x}^{r}_{c} refers to 𝒙r\bm{x}^{r} with the corresponding identity cc. In the training phase, we randomly select 1,0001,000 identities from CASIA-WebFace [58] as training dataset to craft adversarial examples. Therefore, our method can be applied not only in image classification.

Appendix 0.E More Examples

We also show more semantic patterns from different target models, as illustrated in Fig. 9.

Refer to caption
Figure 9: Some examples of adversarial images with perturbation budget of 16\ell_{\infty}\leq 16. We separately adopt the ImageNet, MS-COCO and Comics dataset as the training dataset to implement the generation of targeted perturbations.