This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information

Mingkun Zhang    Jianing Li    Wei Chen Corresponding Author. Email: chenwei2022@ict.ac.cn    Jiafeng Guo    Xueqi Cheng CAS Key Laboratory of AI Safety
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Key Laboratory of Network Data Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
University of Chinese Academy of Sciences, Beijing, China
Abstract

Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks. Recently, methods utilizing diffusion probabilistic models have achieved great success for adversarial purification in image classification tasks. However, such methods fall into the dilemma of balancing the needs for noise removal and information preservation. This paper points out that existing adversarial purification methods based on diffusion models gradually lose sample information during the core denoising process, causing occasional label shift in subsequent classification tasks. As a remedy, we suggest to suppress such information loss by introducing guidance from the classifier confidence. Specifically, we propose Classifier-cOnfidence gUided Purification (COUP) algorithm, which purifies adversarial examples while keeping away from the classifier decision boundary. Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.

\paperid

1260

1 Introduction

Extensive research has shown that neural networks are vulnerable to well-designed adversarial examples, which are created by adding imperceptible perturbations on benign samples [12, 28, 3, 6]. Various approaches have been explored to improve model robustness, including model training enhancement [28, 49, 14] and input data preprocessing [33, 26]. While these works have significantly improved adversarial robustness, there is still a clear gap in the classification accuracy between clean and adversarial data.

In recent years, adversarial purification with diffusion probabilistic models [29, 46, 4, 45, 41] has become an effective approach to defend against adversarial attacks in image classification tasks. The key idea is to preprocess the input image using an auxiliary diffusion model before feeding it into the downstream classifier. Leveraging the strong ability of generative models to fit data distributions, adversarial purification methods are able to purify the adversarial examples by pushing them toward the manifold of benign data. Such a process is essentially a denoising process, which gradually removes possible noise from input data.

Though achieved advanced performance on robust image classification tasks, adversarial purification methods rely solely on the denoising function during purification, thus inevitably fall into the dilemma of balancing the need for noise removal and information preservation [29]. While stronger purification may destroy the image details that are necessary for classification, weaker purification may not be sufficient to remove the adversarial perturbations completely. The passive strategy to balance this, i.e. controlling the global purification steps [29], is limited in its effect, in the sense that data information monotonically loses as the purification steps grow. The existing method to mitigate the loss of information is to constrain the distance between the input adversarial example and the purified image [41, 45]. However, such constraint may inhibit the purified example from escaping the adversarial region effectively.

In this paper, we aim to propose a method that directly takes into consideration the need for information preservation. We borrow the idea of classifier guidance for diffusion models [39, 17, 8, 21], using the classifier confidence on the current class label yy given data xx as an indicator of the degree of preservation and try to maintain high confidence during the purification process. Staying away from low-confidence areas is beneficial to successful purification since such areas are close to the decision boundary and are more sensitive to small perturbations. Approaching a low confidence area can result in a potential label shift problem, i.e. a sample that initially has the correct label is misclassified after purification, especially when combined with stochastic defense strategies.

Specifically, we propose a Classifier-cOnfidence gUided Purification algorithm (COUP) with diffusion models to match the requirement of information preservation. The key idea is to gradually push input data towards high probability density regions while keeping relatively high confidence for classification. This process is realized by applying the denoising process together with a regularization term which improves the confidence score of the downstream classifier. This guidance discourages the purification process from moving toward decision boundaries, where the classifier becomes confused, and the confidence decreases.

We empirically evaluate our algorithm using strong adversarial attack methods, including AutoAttack [6], which contains both white-box and black-box attacks, Backward Pass Differentiable Approximation (BPDA) [1], as well as EOT [1] to tackle the randomness in defense strategy. Results show that COUP outperforms purification method without classifier guidance, e.g., DiffPure [29] in terms of robustness on CIFAR-10 and CIFAR-100 datasets.

Our work has the following main contributions:

  • We propose a new adversarial purification algorithm COUP. By leveraging the confidence score from the downstream classifier, COUP is able to preserve the predictive information while removing the malicious perturbation.

  • We provide both theoretical and empirical analysis for the effect of confidence guidance, showing that keeping away from the decision boundary can preserve predictive information and alleviate the label shift problem which are beneficial for classification. Though classifier guidance has been proven to be useful for better generation quality in previous works [17], we are the first to demonstrate its necessity for adversarial purification to the best of our knowledge.

  • Experiments demonstrate that COUP can achieve significantly higher adversarial robustness against strong attack methods, reaching a robustness score of 73.05%73.05\% for ll_{\infty} and 83.13%83.13\% for l2l_{2} under AutoAttack method on CIFAR-10 dataset.

Refer to caption
(a) (a) Objective
Refer to caption
(b) (b) Purification Path
Refer to caption
(c) (c) Purified Image
Figure 1: The distinction between the existing purification method and classifier-confidence guided purification is elucidated in terms of (a) purification objective and (b) visualization of the purification path and (c) the resultant purified image. In (a), the blue curve and green curve represent p(x|y=0)p(x|y=0) and p(x|y=1)p(x|y=1), respectively. The orange dotted line indicates the optimization objective (p(x)p(x) or maxyp(x|y)\max_{y}p(x|y)) and the direction of the gradient is shown accordingly by the orange arrow. This comparison underscores the importance of classifier confidence guidance in directing the purification process toward the category center. The purification approach outlined in (b) demonstrates that classifier guidance effectively preserves essential predictive information, which is crucial for successful classification. Furthermore, the purified images shown in (c) serve as evidence that classifier guidance retains information necessary for enhanced purification quality.

2 Related Work

Adversarial Training Adversarial training consolidates the discriminative model by enriching trained data [28]. Such methods include generating adversarial examples during model training [28, 49, 30, 20], or using an auxiliary generation model for data augmentation [14, 31, 42]. Though effective, such methods still face adversarial vulnerability for unseen threats [24] and suffer from computational complexity [43] during training. These works are orthogonal to ours and can be combined with our purification method.

Adversarial Purification Adversarial purification is another effective approach to defense. The idea is to use a generative model to purify the adversarial examples before feeding them into the discriminative model for classification. Based on different generative models [11, 22, 25, 18, 39], corresponding purification methods are proposed [33, 26, 10, 15, 16, 4] to convert the perturbed sample into a benign one. Recently, adversarial purification methods based on diffusion models have been proposed and achieved better performance [48, 46, 29, 45, 41]. Among these works, DiffPure [29] achieves the most remarkable result, which is the focus of our comparison.

Classifier Guided Diffusion Models Diffusion models [35, 18, 36, 39] are recently proposed generation models, achieving high generation quality on images. Some works further leverage the guidance of the classifier to achieve controllable generation and improve the image synthesis ability [17, 39, 8, 21]. Although the idea of classifier guidance has been proven to be beneficial for better image generation quality, whether the guided diffusion is helpful for adversarial purification is not yet verified. In our work, we utilize the classifier guidance to mitigate the loss of predictive information, so as to strike a balance between information preservation and purification.

3 Objective of Adversarial Purification

In this section, we present an objective of adversarial purification from the perspective of classification tasks and discuss how to achieve such an objective. The analysis results indicate the importance of considering the need for information preservation directly, which can be achieved by introducing guidance from classifier confidence during the purification process.

The concept of adversarial examples is first proposed by Szegedy et al. [40], showing that neural networks are vulnerable to imperceptible perturbations. Data 𝒙adv\bm{x}_{\mathrm{adv}} is called an adversarial example w.r.t. 𝒙d\bm{x}\in\mathbb{R}^{d} if it is close enough and belongs to the same class ytruey_{\mathrm{true}} under ground truth classifier p(y|)p(y|\cdot), but has a different label under model p^(y|)\hat{p}(y|\cdot), such that

argmaxyp^(y|𝒙adv)argmaxyp^(y|𝒙)=ytrue,\mathop{\arg\max}_{y}\hat{p}(y|\bm{x}_{\mathrm{adv}})\neq\mathop{\arg\max}_{y}\hat{p}(y|\bm{x})=y_{\mathrm{true}}, (1)

with the constraint that 𝒙adv𝒙ϵ\|\bm{x}_{\mathrm{adv}}-\bm{x}\|\leq\epsilon.

The idea of adversarial purification is to introduce a purification process before feeding data into the classifier. Though the optimal purification result would be converting 𝒙adv\bm{x}_{\mathrm{adv}} back to 𝒙\bm{x}, it is almost impossible and not necessary. For the task of classification, it is sufficient as long as 𝒙adv\bm{x}_{\mathrm{adv}} shares the same label with 𝒙\bm{x}. Therefore, the objective of adversarial purification from the classification perspective can be formulated as

maxrp^(ytrue|r(𝒙adv)),\mathop{\max}_{r}\ \hat{p}(y_{\mathrm{true}}|r(\bm{x}_{\mathrm{adv}})), (2)

where r():ddr(\cdot):\mathbb{R}^{d}\to\mathbb{R}^{d} is the purification function.

In practice, the above objective cannot be optimized directly since the ground truth label ytruey_{\mathrm{true}} is in general unknown, and so is the clean data 𝒙\bm{x}. A substitute idea may be using the label of a nearby but not too close data 𝒙\bm{x}^{\prime} with high likelihood p(𝒙)p(\bm{x}^{\prime}) instead. The reasons include two aspects: first, the classification model is more trustworthy in high-density areas, thus can eliminate the effect of adversarial perturbation; second, compared with far-away data, a data with moderate distance from 𝒙adv\bm{x}_{\mathrm{adv}} is more likely to share the same label with the clean data 𝒙\bm{x}. The search for such nearby data is non-trivial, therefore as an alternative, we can use r(𝒙adv)r(\bm{x}_{\mathrm{adv}}) itself as the nearby data and take argmaxyp^(y|r(𝒙adv))\mathop{\arg\max}_{y}\ \hat{p}(y|r(\bm{x}_{\mathrm{adv}})) as an approximation of ytruey_{\mathrm{true}}.

According to this idea, the ideal r(𝒙adv)r(\bm{x}_{\mathrm{adv}}) for solving Eq.(2) should balance the following requirements:

  1. 1.

    Maximizing the likelihood p(r(𝒙adv))p(r(\bm{x}_{\mathrm{adv}})). This helps to remove the adversarial noise.

  2. 2.

    Maximizing the classifier confidence maxyp^(y|r(𝒙adv))\mathop{\max}_{y}\ \hat{p}(y|r(\bm{x}_{\mathrm{adv}})). This helps to preserve the essential information for classification.

  3. 3.

    Controlling the distance r(𝒙adv)𝒙adv\|r(\bm{x}_{\mathrm{adv}})-\bm{x}_{\mathrm{adv}}\|. This helps to avoid significant semantic changes.

Discussion of existing works. Existing adversarial purification methods usually utilize a generative model p^(𝒙)\hat{p}(\bm{x}) to approximate p(𝒙)p(\bm{x}), and try to maximize p^(r(𝒙adv))\hat{p}(r(\bm{x}_{\mathrm{adv}})) while keeping the distance r(𝒙adv)𝒙adv\|r(\bm{x}_{\mathrm{adv}})-\bm{x}_{\mathrm{adv}}\| small enough. We show a few examples here. DefenseGAN [33] is an early work for such purification, it uses generative adversarial nets as a generative model and optimizes min𝐳G(𝐳)𝒙adv2\mathop{\min}_{\mathbf{z}}\|G(\mathbf{z})-\bm{x}_{\mathrm{adv}}\|_{2} for purification. The l2l_{2} norm is used for controlling the distance, and r(𝒙adv)=G(𝐳)r(\bm{x}_{\mathrm{adv}})=G(\mathbf{z}) is guaranteed to have a high likelihood with the generator G()G(\cdot). DiffPure [29] is a recently proposed adversarial purification method, it uses diffusion probabilistic models as a generative model. The purification process is a stochastic differential equation, whose main part includes a score function update which essentially increases the likelihood of r(𝒙adv)r(\bm{x}_{\mathrm{adv}}). Meanwhile, it has been shown that the l2l_{2} distance is implicitly controlled by the global update steps.

We find that the requirement of classifier confidence maximization is widely overlooked in existing works. A possible explanation might be that this requirement is partially addressed through density maximization. This is the case when there is little overlap between different classes of data, such that high likelihood generally means high classifier confidence. However, when such overlap exists, i.e. the decision boundary crosses high probability density areas, maximizing likelihood alone can be problematic. Consider the case where areas around the decision boundary have the highest density, the purification process without classifier confidence guidance will drive nearby samples towards the boundary, causing potential label shift especially when combined with stochastic defense strategies. A specific example is shown in Fig. 1. As a result, we suggest directly addressing the need for information preservation by maximizing the classifier confidence simultaneously.

4 Classifier-Confidence Guided Purification

Motivated by the objective of adversarial purification discussed in Section 3, we propose a Classifier-cOnfidence gUided Purification (COUP) method with score-based diffusion models to achieve the objective of adversarial purification.

4.1 Methodology

In order to meet the three requirements in Section 3, we address each of them separately: we use a score-based diffusion model and apply the denoising process to maximize the likelihood of purified image (i.e. max𝒓p^(𝒓(𝒙adv))\mathop{\max}_{\bm{r}}\hat{p}(\bm{r}(\bm{x}_{\mathrm{adv}}))); we query the classifier during purification and maximize the classifier confidence (i.e. max𝒓maxyp^(y|𝒓(𝒙adv))\mathop{\max}_{\bm{r}}\mathop{\max}_{y}\hat{p}(y|\bm{r}(\bm{x}_{\mathrm{adv}}))); we control the distance by choosing appropriate global update steps tt^{*}. We will first introduce the diffusion model used for denoising, and then explain how we utilize the classifier confidence for adversarial purification.

Algorithm 1 Classifier-Confidence Guided Purification (COUP) Algorithm.

Input: Perturbed example 𝒙adv\bm{x}_{\mathrm{adv}}.
Output: Purified example 𝒙ben\bm{x}_{\mathrm{ben}}, predicted label y^\hat{y}.
Required: Trained classifier fcls(𝒙)maxyp^(y|𝒙)f_{\mathrm{cls}}(\bm{x})\approx\mathop{\max}_{y}\ \hat{p}(y|\bm{x}) score function 𝒔𝜽(𝒙,t)𝒙logpt(𝒙)\bm{s}_{\bm{\theta}}(\bm{x},t)\approx\nabla_{\bm{x}}\log p_{t}(\bm{x}) of fully trained diffusion model, optimal timestep tt^{*}, and a regularization parameter λ\lambda.

Set up drift function: 𝒇(𝒙,t)12β(t)𝒙β(t){𝒔θ(𝒙,t)+λ𝒙log[fcls(𝒙)]}\bm{f}(\bm{x},t)\leftarrow-\frac{1}{2}\beta(t)\bm{x}-\beta(t)\{\bm{s}_{\theta}\left(\bm{x},t\right)+\lambda\nabla_{\bm{x}}\log[f_{\mathrm{cls}}(\bm{x})]\}

Set up diffusion function: g(t)β(t)g(t)\leftarrow\sqrt{\beta(t)}

Solve SDE for purification according to Eq. 5: 𝒙^benSDE(𝒙adv,𝒇(𝒙,t),g(t),t,0)\hat{\bm{x}}_{\mathrm{ben}}\leftarrow\text{SDE}(\bm{x}_{\mathrm{adv}},\bm{f}(\bm{x},t),g(t),t^{*},0)\qquad

Classification: y^argmaxyfcls(𝒙^ben,y)\hat{y}\leftarrow\mathop{\arg\max}_{y}\ f_{\mathrm{cls}}(\hat{\bm{x}}_{\mathrm{ben}},y)

return predicted label y^\hat{y}

4.1.1 Score-based Diffusion Models

Diffusion probabilistic models are deep generative models that have recently shown remarkable generation ability. Among existing diffusion models, Score SDE is a unified architecture that models the diffusion process by a stochastic differential equation (SDE) and the denoising process by a corresponding reverse-time SDE. Specifically, the forward SDE is formalized as

d𝒙=𝒇(𝒙,t)dt+g(t)d𝒘.\mathrm{d}\bm{x}=\bm{f}(\bm{x},t)\mathrm{d}t+g(t)\mathrm{d}\bm{w}. (3)

where tt is the time, 𝒇(𝒙,t)\bm{f}(\bm{x},t) is the drift function, g(t)g(t) is the diffusion coefficient, 𝒘\bm{w} is a standard Wiener process (Brownian motion). The effect of forward SDE is to progressively inject Gaussian noise into the input, and eventually transfer the original data to a Gaussian distribution. The corresponding reverse-time SDE is

d𝒙=[𝒇(𝒙,t)g(t)2𝒙logpt(𝒙)]dt+g(t)d𝒘¯,\mathrm{d}\bm{x}=[\bm{f}(\bm{x},t)-g(t)^{2}\nabla_{\bm{x}}\log p_{t}(\bm{x})]\mathrm{d}t+g(t)\mathrm{d}\overline{\bm{w}}, (4)

where 𝒘¯\overline{\bm{w}} is a standard reverse-time Wiener process. 𝒙logpt(𝒙)\nabla_{\bm{x}}\log p_{t}(\bm{x}) is parameterized by a neural model 𝒔𝜽(𝒙t,t)\bm{s}_{\bm{\theta}}(\bm{x}_{t},t), which is also called the score function. The score function is the core of the reverse process, driving data 𝒙\bm{x} towards higher likelihood by improving logpt(𝒙)\log p_{t}(\bm{x}), where pt(𝒙)p_{t}(\bm{x}) can be viewed as an approximation of p(𝒙)p(\bm{x}).

4.1.2 Purification with Guidance of Classifier Confidence

The reverse-time SDE has been used for adversarial purification in previous diffusion-based purification methods. Our key idea is to introduce the guidance signal maxyp^(y|𝒙)\mathop{\max}_{y}\ \hat{p}(y|\bm{x}) into the reverse-time SDE, such that we use logp^(𝒙)+λlogmaxyp^(y|𝒙)\log\hat{p}(\bm{x})+\lambda\cdot\log\mathop{\max}_{y}\ \hat{p}(y|\bm{x}) to replace logp^(𝒙)\log\hat{p}(\bm{x}) in the score function 𝒔θ(𝒙,t)=𝒙logp^(𝒙)\bm{s}_{\theta}(\bm{x},t)=\nabla_{\bm{x}}\log\hat{p}(\bm{x}). Therefore, the purification update rule becomes

d𝒙\displaystyle\mathrm{d}\bm{x} =g(t)d𝒘¯\displaystyle=g(t)\mathrm{d}\overline{\bm{w}} (5)
+{𝒇(𝒙,t)g(t)2[𝒔θ(𝒙,t)+λ𝒙logmaxyp^(y|𝒙)classifier guidance]}dt,\displaystyle+\{\bm{f}(\bm{x},t)-g(t)^{2}[\bm{s}_{\theta}\left(\bm{x},t\right)+\lambda\underbrace{\nabla_{\bm{x}}\log\mathop{\max}\limits_{y}\hat{p}(y|\bm{x})}_{\text{classifier guidance}}]\}\mathrm{d}t,

where p^(y|𝒙)\hat{p}(y|\bm{x}) is the classifier confidence estimated by a fully trained classifier. The coefficient λ>0\lambda>0 is determined by how much we can trust the classifier. The more accurate the classifier is, the larger value we can take for λ\lambda. More discussions on the choice of λ\lambda can be found in section 5.4.

In practice, we use VP-SDE [39], such that the drift function and diffusion coefficient are

𝒇(𝒙,t)=12β(t)𝒙,g(t)=β(t),\bm{f}(\bm{x},t)=-\frac{1}{2}\beta(t)\bm{x},\quad g(t)=\sqrt{\beta(t)}, (6)

where β(t)\beta(t) is a linear interpolation from βmin\beta_{\mathrm{min}} to βmax\beta_{\mathrm{max}}. The purification process starts from 𝒙t=𝒙adv\bm{x}_{t^{*}}=\bm{x}_{\mathrm{adv}} at time t=tt=t^{*} and ends at time t=0t=0 to get 𝒙0\bm{x}_{0}. The global purification steps tt^{*} controls the distance between 𝒙t\bm{x}_{t^{*}} and 𝒙0\bm{x}_{0}, which we will later explain in section 4.3.

4.2 The COUP Algorithm

According to the purification rule of Eq. 5, we design our Classifier-cOnfidence gUided Purification (COUP) algorithm in Algo. 1. Our algorithm first set up the drift function and diffusion scale of guided reverse-time SDE according to Eq. 6. Then, we adopt an SDE process from t=tt=t^{*} to t=0t=0 to get the purified image 𝒙^ben\hat{\bm{x}}_{\mathrm{ben}}, where the input is adversarial example 𝒙adv\bm{x}_{\mathrm{adv}}. Finally, we can use the trained classifier to predict the label of the purified image 𝒙^ben\hat{\bm{x}}_{\mathrm{ben}}. We omit the forward diffusion process since it yields no positive impact on the objectives of adversarial purification discussed in Section 3, and may cause a potential semantic shift. A detailed discussion can be found in Appendix C.1.

Note that COUP can use the off-the-shelf diffusion model and the fully trained classifier. In other words, we combine the off-the-shelf generative model and the trained classifier to achieve higher robustness.

4.2.1 Adaptive Attack

In order to evaluate our defense method against strong attacks, we propose an augmented SDE to compute the gradient of COUP for gradient-based attacks. In other words, we expose our purification strategy to the attacker to obtain strict robustness evaluation. In this way, we can make a fair comparison with other adversarial defense methods. In Section 4.3, we discuss the key idea of adaptive attack. Suppose 𝒙^ben\hat{\bm{x}}_{\mathrm{ben}} is the input of classifier, 𝒙^ben\frac{\partial\mathcal{L}}{\partial\hat{\bm{x}}_{\mathrm{ben}}} can be obtained easily obtained. Then we can get the full gradient 𝒙adv\frac{\partial\mathcal{L}}{\partial\bm{x}_{\mathrm{adv}}} according to the augmented SDE. According to the SDE in Eq. 5 with input 𝒙adv\bm{x}_{\mathrm{adv}} and output 𝒙^ben\hat{\bm{x}}_{\mathrm{ben}}, the augmented SDE is

(𝒙adv𝒙adv)=sdeint((𝒙^ben𝒙^ben),𝒇~,𝒈~,𝒘~,0,t)\left(\begin{array}[]{c}\bm{x}_{\mathrm{adv}}\\ \frac{\partial\mathcal{L}}{\partial\bm{x}_{\mathrm{adv}}}\end{array}\right)=\operatorname{sdeint}\left(\left(\begin{array}[]{c}\hat{\bm{x}}_{\mathrm{ben}}\\ \frac{\partial\mathcal{L}}{\partial\hat{\bm{x}}_{\mathrm{ben}}}\end{array}\right),\tilde{\bm{f}},\tilde{\bm{g}},\tilde{\bm{w}},0,t^{*}\right)

where 𝒙^ben\frac{\partial\mathcal{L}}{\partial\hat{\bm{x}}_{\mathrm{ben}}} is the gradient of the objective \mathcal{L} w.r.t. the output 𝒙^ben\hat{\bm{x}}_{\mathrm{ben}} of the SDE, defined in Eq. 5, and

𝒇~([𝒙;𝒛],\displaystyle\tilde{\bm{f}}([\bm{x};\bm{z}], t)=(𝒇(𝒙,t)g(t)2{𝒔θ(𝒙,t)+𝒙log[fcls(𝒙)]}{𝒇(𝒙,t)g(t)2𝒔θ(𝒙,t)𝒙g(t)2𝒙2log[fcls(𝒙)]}𝒛),\displaystyle t)=\left(\begin{array}[]{c}\bm{f}(\bm{x},t)-g(t)^{2}\{\bm{s}_{\theta}\left(\bm{x},t\right)+\nabla_{\bm{x}}\log[f_{\mathrm{cls}}(\bm{x})]\}\\ \{\frac{\partial\bm{f}(\bm{x},t)-g(t)^{2}\bm{s}_{\theta}\left(\bm{x},t\right)}{\partial\bm{x}}-g(t)^{2}\nabla^{2}_{\bm{x}}\log[f_{\mathrm{cls}}(\bm{x})]\}\bm{z}\end{array}\right),
𝒈~(t)=(g(t)𝟏d𝟎d),𝒘~(t)=(𝒘(1t)𝒘(1t)),\displaystyle\tilde{\bm{g}}(t)=\left(\begin{array}[]{c}-g(t)\mathbf{1}_{d}\\ \bm{0}_{d}\end{array}\right),\quad\tilde{\bm{w}}(t)=\left(\begin{array}[]{c}-\bm{w}(1-t)\\ -\bm{w}(1-t)\end{array}\right),

with 𝟏d\mathbf{1}_{d} and 𝟎d\bm{0}_{d} representing the dd-dimensional vectors of all ones and all zeros, respectively. Empirically, we use the stochastic adjoint method [27] to compute the pathwise gradients of Score SDE.

4.3 Analysis of COUP

In this section, we further analyze the effectiveness of our method. First, we show that under the guidance of classifier confidence, our method can better preserve information for classification. Second, under the guided reverse-time VP-SDE, the distance r(𝒙adv)𝒙adv\|r(\bm{x}_{\mathrm{adv}})-\bm{x}_{\mathrm{adv}}\| can be bounded through controlling tt^{*}.

To show that our proposed confidence guidance helps to preserve data information, we give theoretical analysis on a simple case where such guidance can be proved to alleviate the label shift problem. Consider a 1-dimension SDE dx=f(x,t)dt+g(t)dw\mathrm{d}x=f(x,t)\mathrm{d}t+g(t)\mathrm{d}w with starting point xt=0=x0>0x_{t=0}=x_{0}>0 and final solution xt=1x_{t=1}, which is simulated using the Euler method with step size Δt=1/n\Delta t=1/n. Denote as P<0(x0,f,g)P_{<0}(x_{0},f,g) the label flip probability such that there exist t[0,1]t^{*}\in[0,1] satisfying xt<0x_{t^{*}}<0, we have the following proposition:

Proposition 1.

If for any t[0,1]t\in[0,1] and x>0x>0, there is f0(x,t)<f1(x,t)f_{0}(x,t)<f_{1}(x,t) and f0(x,t)f_{0}(x,t) is strictly monotonically increasing w.r.t. xx, then

P<0(x0,f1,g)<P<0(x0,f0,g).P_{<0}(x_{0},f_{1},g)<P_{<0}(x_{0},f_{0},g). (7)

Proposition 1 supports the claim that forces pushing the data away from the decision boundary are helpful to avoid the label shift problem. Consider the case where the data is composed of two classes: one distribution follows N(μ,σ2)N(\mu,\sigma^{2}) and another follows N(μ,σ2)N(-\mu,\sigma^{2}), the conditions in Proposition 1 would be satisfied using a VP-SDE (as f0f_{0}) and a corresponding SDE with guidance (as f1f_{1}), since the added gradient of classifier confidence is always positive on (0,+)(0,+\infty). In this case, Proposition 1 shows that with the guidance from the ground-truth classifier, it is less likely for a sample to change its label during the purification process. We provide the proof of proposition 1 in Appendix A.1.

Next, we show that the distance between the input sample 𝒙\bm{x} and the purified sample r(𝒙)r(\bm{x}) can be bounded under our proposed method, thus can avoid severe semantic changes during purification. The result of proposition 2 indicates that for an an adversarial example 𝒙adv\bm{x}_{\mathrm{adv}}, the distance r(𝒙adv)𝒙adv\|r(\bm{x}_{\mathrm{adv}})-\bm{x}_{\mathrm{adv}}\| has an upper bound, which is monotonically increasing w.r.t. tt^{*}. As a result, the maximal distance can be controlled by adjusting tt^{*}.

Proposition 2.

Under the assumption that 𝐬θ(𝐱,t)12Cs\left\|\bm{s}_{\theta}(\bm{x},t)\right\|\leq\frac{1}{2}C_{s}, 𝐱p(|𝐱)12Cp\left\|\nabla_{\bm{x}}p(\cdot|\bm{x})\right\|\leq\frac{1}{2}C_{p}, and 𝐱Cx\|\bm{x}\|\leq C_{x}, the denoising error of our guided reverse variance preserving SDE (VP-SDE) can be bounded as

𝒓(𝒙)𝒙\displaystyle\|\bm{r}(\bm{x})-\bm{x}\| γ(t)(Cs+Cp)+(eγ(t)1)Cx\displaystyle\leq\gamma(t^{*})(C_{s}+C_{p})+(e^{\gamma\left(t^{*}\right)}-1)C_{x} (8)
+e2γ(t)1ϵ,\displaystyle+\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\left\|\bm{\epsilon}\right\|,

γ(t):=0t12β(s)ds\gamma\left(t^{*}\right):=\int_{0}^{t^{*}}\frac{1}{2}\beta(s)\mathrm{d}s, ϵ𝒩(𝟎,𝟏)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{1}) is the noise added by the reverse-time Wiener process.

5 Experiments

In this section, we mainly evaluate the adversarial robustness of COUP against AutoAttack [6], including both black-box and white-box attacks. Furthermore, we analyze the mechanism of the classifier confidence guidance through a case study and ablation study to verify the effectiveness of COUP. Besides, we combine our work with the state-of-the-art adversarial training method for further promotion.

Table 1: Accuracy and Robustness against AutoAttack under l(ϵ=8/255)l_{\infty}(\epsilon=8/255) threat model on CIFAR-10. The model architecture is reverse-time VP-SDE with t=0.1t^{*}=0.1. We use the fully trained WideResNet-28-10 for the classifier.
Defense Accuracy (%) Robustness (%)
- 96.09 0.00
AWP - w/o Aug [44] 85.36 59.18
GAIRAT [50] 89.36 59.96
Adv. Train - Aug [31] 87.33 61.72
AWP - Aug [44] 88.25 62.11
Adv. Train - Tricks [13] 89.48 62.70
Adv. Train - Aug [14] 87.50 65.24
DiffPure [29] 89.02 70.64
Our COUP 90.04 73.05

5.1 Experimental Settings

Dataset and Models We evaluate our method on CIFAR-10 and CIFAR-100 [23] dataset. To make a fair comparison with other diffusion-based purification methods, we follow the settings of DiffPure [29], evaluating the robustness on randomly sampled 512 images. As for the purification model, we use variance preserving SDE (VP-SDE) of Score SDE [39] with t=0.1t^{*}=0.1 for ll_{\infty} threat model and t=0.075t^{*}=0.075 for l2l_{2}. We select two backbones of the classifier, including WRN-28-10 (WideResNet-28-10) and WRN-70-16 (WideResNet-70-16).

Baselines We compare our method with (1) robust optimization methods, that is the adversarial defense based on discriminative models, also including using generative models for data augmentation; (2) adversarial purification methods based on generative models before classification. Since some diffusion-based adversarial purification methods do not support gradient computation and do not design an adaptive attack, we compare the SOTA method [29] supported by AutoAttack [6].

Refer to caption
(a) (a) Case Study
Refer to caption
(b) (b) Ablation Study
Refer to caption
(c) (c) Combine with Adv. CLS
Figure 2: (a) Expectation of disminative confidence p^(y|𝒙)\hat{p}(y|\bm{x}) of WRN-28-10 on label ytruey_{\mathrm{true}} and adversarial label yadvy_{\mathrm{adv}} over 14 bad cases of reverse-time SDE. (b) Robustness of different denoising methods, including the forward-reverse-time SDE (DiffPure) and the reverse-time SDE only (COUP and COUP w/o Guidance) under different classifier backbones. (c) Robustness of SOTA adversarial training method [42] (marked as Adv. CLS), in both the cross (trained under different lpl_{p} norm with evaluation) and non-cross settings, and combined with our COUP against APGD-ce.

Evaluation Method We evaluate our method against the AutoAttack [6] against the ll_{\infty} and l2l_{2} threat models and Backward Pass Differentiable Approximation (BPDA) [1]. Since our method contains Brownian motion, we use both standard (including three white-box attacks APGD-ce, APGD-t, FAB-t, and one black-box attack Square) and rand mode (including two white-box attack APGD-ce, APGD-dlr with EOT=20) of AutoAttack, choosing the worse one to eliminate the ’fake robustness’ brought by randomness. Since the white-box plays stronger attack behavior, we evaluate our algorithm across different classifier backbones and other analysis experiments against APGD-ce, one of the white-box in AutoAttack.

Table 2: Accuracy and Robustness against AutoAttack under l2(ϵ=0.5)l_{2}(\epsilon=0.5) threat model on CIFAR-10. The model architecture is reverse-time VP-SDE with t=0.075t^{*}=0.075. We use the fully trained WideResNet-28-10 for the classifier.
Defense Accuracy (%) Robustness (%)
- 96.09 0.00
Adv. Train - DDN [32] 89.05 66.41
Adv. Train - MMA [9] 88.02 67.77
AWP [44] 88.51 72.85
PORT [34] 90.31 75.39
RATIO [2] 92.23 77.93
Adv. Train - Aug[31] 91.79 78.32
DiffPure [29] 91.03 78.58
Our COUP 92.58 83.13
Table 3: Accuracy and Robustness against BPDA + EOT [1] under l(ϵ=8/255)l_{\infty}(\epsilon=8/255) threat model on CIFAR-10. The model architecture is reverse-time VP-SDE with t=0.1t^{*}=0.1. We use the fully trained WideResNet-28-10 for the classifier.
Defense Accuracy (%) Robustness (%)
- 96.09 0.00
PixelDefend [37] 95.00 9.00
ME-Net [47] 94.00 15.00
Purification - EBM[16] 84.12 54.90
ADP [48] 86.14 70.01
GDMP [41] 93.50 76.22
DiffPure [29] 89.02 81.40
Our COUP 90.04 83.20

5.2 Comparison with Related Work

5.2.1 AutoAttack

We evaluate our COUP against AutoAttack and compare the robustness with other advanced defense methods on CIFAR-10 according to the results proposed in robustbench [7]. DiffPure [29] is the most related work to ours. According to the results in Table 1 and Table 2, our COUP achieves better robustness (+2.41% for ll_{\infty} and +4.55% for l2l_{2} ) as well as better accuracy (+1.02% for ll_{\infty} and +1.55% for l2l_{2}), showing the effectiveness of classifier guidance enhancing the adversarial robustness through better purification.

Moreover, we also adapt our COUP to the DDPM architecture [18] in consideration of the inference efficiency and evaluate the adversarial robustness against APGD-ce on CIFAR-100. We employ the purification method suggested by Chen et al. [5] and add classifier confidence guidance during likelihood maximization. Our results indicate that COUP attains a 40.55% robustness under the ll_{\infty} threat model with ϵ=8/255\epsilon=8/255, surpassing the 38.67% achieved by likelihood maximization without classifier guidance.

Besides, owing to the adaptive attack, we can make a fair comparison with other purification algorithms as well as robust optimization methods based on discriminative models. The state-of-the-art method of robust optimization is an improved adversarial training using generated data by diffusion models. COUP is orthogonal to those. We will further discuss the comparison and combination with the SOTA work among them in Section 5.3.

5.2.2 BPDA

We also evaluate the robustness against BPDA + EOT [1] in order to make a comparison with other guided diffusion-based purification methods [45, 41] (since they do not support adaptive attack). The results are represented in Table 3. Considering both Wu et al. [45] and Wang et al. [41] utilize adversarial samples as guidance, we compare with the more proficient GDMP [41] of the two (according to the results from Table 1 of Wu et al. [45]). We test our method on the CIFAR-10 dataset against PGD-20, employing a setting of ϵ=8/255\epsilon=8/255, α=0.007\alpha=0.007. The experimental robustness of GDMP achieves 76.22%, DiffPure achieves 81.40%, while our COUP achieves the best of 83.20%. These results verify the effectiveness of our method against BPDA except for adaptive attack and obtain better performance than the adversarial examples guided diffusion-based purification algorithm.

Refer to caption
(a) (a) Probability on toy data
Refer to caption
(b) (b) Label flip probability
Figure 3: (a) The 2-Gaussian toy data and classifier with noise level c=1c=1. (b) Label flip probability under different noise levels cc and guidance weight λ\lambda on 2-Gaussian toy data.

5.3 Experimental Analysis

5.3.1 Case Study

In this part, we plot the curve of p^(ytrue|𝒙)\hat{p}(y_{\mathrm{true}}|\bm{x}) and p^(yadv|𝒙)\hat{p}(y_{\mathrm{adv}}|\bm{x}) to show what happens from the view of classifier during purification, where the p^(y|𝒙)\hat{p}(y|\bm{x}) is the predict confidence by the classifier of label yy. Moreover, we analyze the mechanism of the classifier guidance. To obtain the adversarial examples, we use APGD-ce under the ll_{\infty} threat model to attack the reverse-time SDE (COUP without Guidance) to get bad cases for COUP w/o Guidance. To focus on the purification process, we do not consider Brownian motion at inference time.

According to the analysis in Section 3, we use the predict confidence for ground truth label to evaluate the information preservation degree of the image during purification. Then we plot the curve as shown in Fig. 2a. The rise of p^(yadv|𝒙)\hat{p}(y_{\mathrm{adv}}|\bm{x}) is the reason for successful attack. Meanwhile, the decrease of p^(ytrue|𝒙)\hat{p}(y_{\mathrm{true}}|\bm{x}) shows that, during purification, predictive information keeps losing. After 90 steps of purification, p^(ytrue|𝒙)\hat{p}(y_{\mathrm{true}}|\bm{x}) suddenly declines to a very low level due to "over purification" and p^(yadv|𝒙)\hat{p}(y_{\mathrm{adv}}|\bm{x}) dominates the prediction confidence, which leads to vulnerability. Next, we further explore the mechanism of classifier guidance. After adding classifier guidance, p^(ytrue|𝒙)\hat{p}(y_{\mathrm{true}}|\bm{x}) obtains a rapid rise under the guidance of the classifier at the very beginning. Besides, the guidance of the classifier alleviates the information loss (also weakens the influence of adversarial perturbation) during purification and finally results in correct classification.

5.3.2 Analysis of Information Preservation on Toy Data

To verify the effectiveness of our method for information preservation, we use 2-Gaussian toy data and run a simulation of pure SDE and COUP to show that COUP can alleviate the label shift problem. The data distribution is a 1-dimension uniform mixture of 2 symmetric Gaussian distributions 𝒩(0.5,1)\mathcal{N}(-0.5,1) and 𝒩(0.5,1)\mathcal{N}(0.5,1), the data from which we label as y=0y=0 and y=1y=1, respectively. Starting from the point x0=0.2x_{0}=0.2, we apply the COUP algorithm with guidance weight ranging from 0 to 10.010.0. To simulate the adversarial vulnerability of the classifier, we use a noisy classifier p(y=1|x)=p1(x)p0(x)+p1(x)+cn(x)p(y=1|x)=\frac{p_{1}(x)}{p_{0}(x)+p_{1}(x)+c\cdot n(x)}, where pi(x)p_{i}(x) is the density function of class ii, n(x)=sin(100x)100n(x)=\frac{sin(100x)}{100} is the noise and cc is the noise level. We apply the Euler method for SDE simulation, using step size 1e31e-3 and t=0.1t^{*}=0.1. We run 100,000100,000 times and estimate the label flip probability, i.e. p(xt<0)p(x_{t^{*}}<0). The result in Fig. 3b shows that the guidance signal is overall helpful to keep the label unchanged under small or no noise. When the noise level is high, the classifier becomes untrustworthy. Thus, an appropriate λ\lambda should be chosen.

Refer to caption
(a) (a) Accuracy and robustness across tt^{*}
Refer to caption
(b) (b) Accuracy and robustness across λ\lambda
Figure 4: Accuracy and Robustness against APGD-ce under l(ϵ=8/255)l_{\infty}(\epsilon=8/255) threat model for (a) variant purification timestep tt^{*} and (b) variant weight of guidance λ\lambda.

5.3.3 Analysis on Different Classifier Architectures

In order to evaluate the effectiveness of our guidance method for different architectures of classifiers, we adapt our purification method to both WRN-28-10 and WRN-70-16 against APGD-ce attack (under ll_{\infty}). The results in Fig. 2b show that COUP achieves better robustness on both two classifier backbones. In other words, our method is effective across different classifier architectures.

5.3.4 Ablation Study

In order to demonstrate the effectiveness of classifier guidance, we evaluate the robustness of COUP and COUP w/o Guidance (i.e. reverse-time SDE) against APGD-ce attack (under ll_{\infty}) as an ablation. Results in Fig. 2b support that the robustness promotion in Table 1 and Table 2 of our COUP is mainly caused by the classifier guidance instead of the structure of diffusion model (since we remove the forward process from DiffPure).

5.4 Hyperparameters

Analysis on Purification Timestep tt^{*} Since the purification timestep tt^{*} is a critical hyperparameter deciding the degree of denoising, we evaluate the robustness against APGD-ce across different tt^{*} and find it performs the best at t=0.1t^{*}=0.1. The experimental result in Fig. 4a shows that insufficient purification step or "over purification" both leads to lower robustness. This phenomenon strongly supports that balancing denoising and information preservation is very important. Besides, it is intuitive that accuracy decreases as timestep tt^{*} grows.

Analysis on Guidance Weight λ\lambda We experimentally find that COUP obtains the highest robustness against APGD-ce under λ=1\lambda=1. That is, the diffusion model and the classifier have equal weight. Note that it implements the same effect as a conditional generative model (according to the Bayes formula: p(𝒙)p(y|𝒙)=p(𝒙|y)p(y)p(\bm{x})\cdot p(y|\bm{x})=\frac{p(\bm{x}|y)}{p(y)} since the prior probability p(y)p(y) of category yy is considered as uniform).

5.4.1 Comparison and Combination with SOTA of Adversarial Training

Considering the settings of robust evaluation, we argue that it is unfair to compare our COUP with the adversarial training algorithm. The reason is we do not make any assumption about the attack, while the adversarial training methods are specifically trained for the evaluation attack. Therefore, we additionally evaluate the SOTA [42] of adversarial training under both cross and non-cross settings. Specifically, in the cross-setting, we use the model trained for l2l_{2} (ll_{\infty}) to defend the attack under ll_{\infty} (l2l_{2}). The results in Table 2c show that it [42] suffers a severe robustness drop under the cross-setting. In other words, its robustness becomes poor when defending against unseen attacks.

Besides, to take advantage of their work [42], we combine our purification method with the adversarially trained classifier. When the classifier has better clean accuracy (95.16% under l2l_{2}), it can further improve the robustness against APGD-ce attack. However, worse accuracy under ll_{\infty} (92.44%) may provide inappropriate guidance for purification. Note that, in that case, our purification method COUP further improves their robustness from 70.90% to 77.15%.

6 Conclusion

To address the principal challenge in purification, i.e., achieving a balance between noise removal and information preservation, we employ the concept of the classifier-guided purification method. We discover that classifier-confidence guidance aids in preserving predictive information, which facilitates the purification of adversarial examples towards the category center. Specifically, we introduce Classifier-cOnfidence gUided Purification (COUP) and have assessed its performance against AutoAttack and BPDA, comparing it with other advanced defense algorithms under the RobustBench benchmark. The results demonstrated that our COUP achieved superior adversarial robustness.

{ack}

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDB0680101, CAS Project for Young Scientists in Basic Research under Grant No. YSBR-034, Innovation Project of ICT CAS under Grant No. E261090, and the project under Grant No. JCKY2022130C039. This paper acknowledges the valuable assistance of Xiaojie Sun in the experiment execution.

References

  • Athalye et al. [2018] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274–283. PMLR, 2018.
  • Augustin et al. [2020] M. Augustin, A. Meinke, and M. Hein. Adversarial robustness on in-and out-distribution improves explainability. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 228–245. Springer, 2020.
  • Brendel et al. [2017] W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248, 2017.
  • Carlini et al. [2022] N. Carlini, F. Tramer, J. Z. Kolter, et al. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022.
  • Chen et al. [2023] H. Chen, Y. Dong, Z. Wang, X. Yang, C. Duan, H. Su, and J. Zhu. Robust classification via a single diffusion model. arXiv preprint arXiv:2305.15241, 2023.
  • Croce and Hein [2020] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206–2216. PMLR, 2020.
  • Croce et al. [2020] F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
  • Dhariwal and Nichol [2021] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Ding et al. [2018] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. Mma training: Direct input space margin maximization through adversarial training. arXiv preprint arXiv:1812.02637, 2018.
  • Du and Mordatch [2019] Y. Du and I. Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
  • Goodfellow et al. [2020] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Goodfellow et al. [2014] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Gowal et al. [2020] S. Gowal, C. Qin, J. Uesato, T. Mann, and P. Kohli. Uncovering the limits of adversarial training against norm-bounded adversarial examples. arXiv preprint arXiv:2010.03593, 2020.
  • Gowal et al. [2021] S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A. Calian, and T. A. Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218–4233, 2021.
  • Grathwohl et al. [2019] W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
  • Hill et al. [2020] M. Hill, J. Mitchell, and S.-C. Zhu. Stochastic security: Adversarial defense using long-run dynamics of energy-based models. arXiv preprint arXiv:2005.13525, 2020.
  • Ho and Salimans [2022] J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hyvärinen and Dayan [2005] A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • Jiang et al. [2019] H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
  • Kawar et al. [2022] B. Kawar, R. Ganz, and M. Elad. Enhancing diffusion-based image synthesis with robust classifier guidance. arXiv preprint arXiv:2208.08664, 2022.
  • Kingma and Welling [2013] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Laidlaw et al. [2020] C. Laidlaw, S. Singla, and S. Feizi. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020.
  • LeCun et al. [2006] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
  • Li and Ji [2019] X. Li and S. Ji. Defense-vae: A fast and accurate defense against adversarial attacks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 191–207. Springer, 2019.
  • Li et al. [2020] X. Li, T.-K. L. Wong, R. T. Chen, and D. K. Duvenaud. Scalable gradients and variational inference for stochastic differential equations. In Symposium on Advances in Approximate Bayesian Inference, pages 1–28. PMLR, 2020.
  • Madry et al. [2017] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Nie et al. [2022] W. Nie, B. Guo, Y. Huang, C. Xiao, A. Vahdat, and A. Anandkumar. Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460, 2022.
  • Qin et al. [2019] C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli. Adversarial robustness through local linearization. Advances in Neural Information Processing Systems, 32, 2019.
  • Rebuffi et al. [2021] S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. Mann. Fixing data augmentation to improve adversarial robustness. arXiv preprint arXiv:2103.01946, 2021.
  • Rony et al. [2019] J. Rony, L. G. Hafemann, L. S. Oliveira, I. B. Ayed, R. Sabourin, and E. Granger. Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4322–4330, 2019.
  • Samangouei et al. [2018] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
  • Sehwag et al. [2021] V. Sehwag, S. Mahloujifar, T. Handina, S. Dai, C. Xiang, M. Chiang, and P. Mittal. Robust learning meets generative models: Can proxy distributions improve adversarial robustness? arXiv preprint arXiv:2104.09425, 2021.
  • Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • Song and Ermon [2019] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song et al. [2017] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
  • Song et al. [2020a] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020a.
  • Song et al. [2020b] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Szegedy et al. [2013] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Wang et al. [2022] J. Wang, Z. Lyu, D. Lin, B. Dai, and H. Fu. Guided diffusion model for adversarial purification. arXiv preprint arXiv:2205.14969, 2022.
  • Wang et al. [2023] Z. Wang, T. Pang, C. Du, M. Lin, W. Liu, and S. Yan. Better diffusion models further improve adversarial training. arXiv preprint arXiv:2302.04638, 2023.
  • Wong et al. [2020] E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
  • Wu et al. [2020] D. Wu, S.-T. Xia, and Y. Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958–2969, 2020.
  • Wu et al. [2022] Q. Wu, H. Ye, and Y. Gu. Guided diffusion model for adversarial purification from random noise. arXiv preprint arXiv:2206.10875, 2022.
  • Xiao et al. [2022] C. Xiao, Z. Chen, K. Jin, J. Wang, W. Nie, M. Liu, A. Anandkumar, B. Li, and D. Song. Densepure: Understanding diffusion models towards adversarial robustness. arXiv preprint arXiv:2211.00322, 2022.
  • Yang et al. [2019] Y. Yang, G. Zhang, D. Katabi, and Z. Xu. Me-net: Towards effective adversarial robustness with matrix estimation. arXiv preprint arXiv:1905.11971, 2019.
  • Yoon et al. [2021] J. Yoon, S. J. Hwang, and J. Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pages 12062–12072. PMLR, 2021.
  • Zhang et al. [2019] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
  • Zhang et al. [2020] J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiyama, and M. Kankanhalli. Geometry-aware instance-reweighted adversarial training. arXiv preprint arXiv:2010.01736, 2020.

In Appendix A, we give the proofs of proposition 1 and proposition 2. In Appendix B, we provide supplementary analysis to show the necessity for introducing the guidance of classifier confidence, stated in the second requirement in Section 3. In Appendix C we further discuss the evidence of omitting the forward process and why the first requirement in Section 3 "Maximizing the probability density of purified image helps to remove the adversarial noise" is valid. In Appendix D we show more details for implementation. In Appendix E we provide more experimental results including the robustness against salt-and-pepper noise, and the speed test. Moreover, we further discuss the effectiveness of our method through the purified images and find that the semantic drift and information loss caused by DiffPure and COUP w/o Guidance, respectively. Finally, we analyze the limitations in Appendix F and broader impacts in Appendix G.

Appendix A Proofs of Propositions

A.1 Proof of Proposition 1

Porposition 1 (restated) If for any t[0,1]t\in[0,1] and x>0x>0, there is f0(x,t)<f1(x,t)f_{0}(x,t)<f_{1}(x,t) and f0(x,t)f_{0}(x,t) is strictly monotonically increasing w.r.t. xx, then

P<0(x0,f1,g)<P<0(x0,f0,g).P_{<0}(x_{0},f_{1},g)<P_{<0}(x_{0},f_{0},g). (9)

Proof: Define the noise term g(ti)zΔtg(t_{i})z\sqrt{\Delta t} in update step ii as ziz_{i}, where z𝒩(0,1)z\sim\mathcal{N}(0,1) is a standard Gaussian noise so that each update by Euler method is

xti+1=f(xti,ti)Δt+zi.x_{t_{i+1}}=f(x_{t_{i}},t_{i})\Delta t+z_{i}.

Define an auxiliary function

f^(x,t){f(x,t)Δt,iff(x,t)>0,,otherwise,\displaystyle\hat{f}(x,t)\triangleq\left\{\begin{array}[]{ll}f(x,t)\Delta t,&if\ f(x,t)>0,\\ -\infty,&otherwise,\end{array}\right.

then whether there exist t[0,1]t^{*}\in[0,1] such that xt<0x_{t^{*}}<0 can be written as

c(x0,f,z0,,zn1)\displaystyle c(x_{0},f,z_{0},\cdots,z_{n-1})
\displaystyle\triangleq 𝕀(f^(f^(f^(x0,0)+z0)+z1)+zn1<0),\displaystyle\mathbb{I}(\hat{f}(\cdots\hat{f}(\hat{f}(x_{0},0)+z_{0})+z_{1}\cdots)+z_{n-1}<0),

where 𝕀()\mathbb{I}(\cdot) is the indicator function. Therefore, the label flip probability will be

P<0(x0,f,g)=𝔼z0𝔼zn1c(x0,f,z0,,zn1).P_{<0}(x_{0},f,g)=\mathbb{E}_{z_{0}}\cdots\mathbb{E}_{z_{n-1}}c(x_{0},f,z_{0},\cdots,z_{n-1}).

Since there is f1(x,t)>f0(x,t)f_{1}(x,t)>f_{0}(x,t) for any x>0x>0, so that f^1(x0,t)+z0f^0(x0,t)+z0\hat{f}_{1}(x_{0},t)+z_{0}\geq\hat{f}_{0}(x_{0},t)+z_{0}. The equivalence holds only when f1(x0,t)0f_{1}(x_{0},t)\leq 0. Considering ff is strictly monotonically increasing in (0,+)(0,+\infty), there is

f^1(f^1(x0,t)+z0)\displaystyle\hat{f}_{1}(\hat{f}_{1}(x_{0},t)+z_{0}) f^0(f^1(x0,t)+z0)\displaystyle\geq\hat{f}_{0}(\hat{f}_{1}(x_{0},t)+z_{0})
f^0(f^0(x0,t)+z0).\displaystyle\geq\hat{f}_{0}(\hat{f}_{0}(x_{0},t)+z_{0}).

By parity of reasoning, we get

f^1(f^1(f^1(x0,0)+z0)+z1)+zn1\displaystyle\hat{f}_{1}(\cdots\hat{f}_{1}(\hat{f}_{1}(x_{0},0)+z_{0})+z_{1}\cdots)+z_{n-1}
\displaystyle\geq f^0(f^0(f^0(x0,0)+z0)+z1)+zn1,\displaystyle\hat{f}_{0}(\cdots\hat{f}_{0}(\hat{f}_{0}(x_{0},0)+z_{0})+z_{1}\cdots)+z_{n-1},

so that

c(x0,f1,z0,,zn1)c(x0,f0,z0,,zn1),c(x_{0},f_{1},z_{0},\cdots,z_{n-1})\leq c(x_{0},f_{0},z_{0},\cdots,z_{n-1}),

thus

P<0(x0,f1,g)P<0(x0,f0,g),P_{<0}(x_{0},f_{1},g)\leq P_{<0}(x_{0},f_{0},g),

The above equivalence holds only when P<0(x0,f1,g)=1P_{<0}(x_{0},f_{1},g)=1, which will not be possible due to the random nature of Gaussian noise, since there is always a positive probability such that xt>0x_{t}>0 is kept during each update. So that

P<0(x0,f1,g)<P<0(x0,f0,g).P_{<0}(x_{0},f_{1},g)<P_{<0}(x_{0},f_{0},g).

A.2 Proof of Proposition 2 :

Proposition 2 (restated) Under the assumption that 𝒔θ(𝒙,t)12Cs\left\|\bm{s}_{\theta}(\bm{x},t)\right\|\leq\frac{1}{2}C_{s}, 𝒙p(|𝒙)12Cp\left\|\nabla_{\bm{x}}p(\cdot|\bm{x})\right\|\leq\frac{1}{2}C_{p}, and 𝒙Cx\|\bm{x}\|\leq C_{x}, the denoising error of our guided reverse variance preserving SDE (VP-SDE) can be bounded as

𝒓(𝒙)𝒙\displaystyle\|\bm{r}(\bm{x})-\bm{x}\| (10)
\displaystyle\leq e2γ(t)1ϵ+γ(t)(Cs+Cp)+(eγ(t)1)Cx,\displaystyle\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\left\|\bm{\epsilon}\right\|+\gamma(t^{*})(C_{s}+C_{p})+(e^{\gamma\left(t^{*}\right)}-1)C_{x},

γ(t):=0t12β(s)ds\gamma\left(t^{*}\right):=\int_{0}^{t^{*}}\frac{1}{2}\beta(s)\mathrm{d}s, ϵ𝒩(𝟎,𝟏)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{1}) is the noise added by the reverse-time Wiener process.

Proof: Suppose given an example 𝒙\bm{x}, the COUP purification function 𝒓(𝒙)\bm{r}(\bm{x}) is a guided reverse-time SDE (depicted in Section 4.1.2) from t=tt=t^{*} to t=0t=0. Thus, the l2l_{2} distance between 𝒙\bm{x} and the purified example 𝒓(𝒙)\bm{r}(\bm{x}) is

𝒓(𝒙)𝒙\displaystyle\quad\ \|\bm{r}(\bm{x})-\bm{x}\| (11)
=𝒙t012β(t)[𝒙t+2𝒔θ(𝒙t,t)+2𝒙tlog[maxyp(y|𝒙t)]]dt\displaystyle=\|\bm{x}-\int_{t^{*}}^{0}\frac{1}{2}\beta(t)\left[\bm{x}_{t}+2\bm{s}_{\theta}(\bm{x}_{t},t)+2\nabla_{\bm{x}_{t}}\log[\mathop{\max}\limits_{y}p(y|\bm{x}_{t})]\right]dt
+t0β(t)d𝒘¯𝒙\displaystyle+\int_{t^{*}}^{0}\sqrt{\beta(t)}d\overline{\bm{w}}-\bm{x}\|
𝒙+t012β(t)𝒙tdt+t0β(t)𝑑𝒘¯𝒙\displaystyle\leq\left\|\bm{x}+\int_{t^{*}}^{0}-\frac{1}{2}\beta(t)\bm{x}_{t}dt+\int_{t^{*}}^{0}\sqrt{\beta(t)}d\overline{\bm{w}}-\bm{x}\right\|
+t0β(t)𝒔θ(𝒙t,t)dt\displaystyle+\left\|\int_{t^{*}}^{0}-\beta(t)\bm{s}_{\theta}(\bm{x}_{t},t)dt\right\|
+t0β(t)𝒙tlog[maxyp(y|𝒙t)dt.\displaystyle+\left\|\int_{t^{*}}^{0}-\beta(t)\nabla_{\bm{x}_{t}}\log[\mathop{\max}\limits_{y}p(y|\bm{x}_{t})dt\right\|.

Under the assumption that 𝒔θ(𝒙,t)12Cs\left\|\bm{s}_{\theta}(\bm{x},t)\right\|\leq\frac{1}{2}C_{s} and 𝒙p(|𝒙)12Cp\left\|\nabla_{\bm{x}}p(\cdot|\bm{x})\right\|\leq\frac{1}{2}C_{p} , we have

t0β(t)𝒔θ(𝒙t,t)dt+t0β(t)𝒙tlog[maxyp(y|𝒙t)dt\displaystyle\left\|\int_{t^{*}}^{0}-\beta(t)\bm{s}_{\theta}(\bm{x}_{t},t)dt\right\|+\left\|\int_{t^{*}}^{0}-\beta(t)\nabla_{\bm{x}_{t}}\log[\mathop{\max}\limits_{y}p(y|\bm{x}_{t})dt\right\| (12)
γ(t)(Cs+Cp),\displaystyle\leq\gamma(t^{*})(C_{s}+C_{p}),

where γ(t):=0t12β(s)𝑑s\gamma\left(t^{*}\right):=\int_{0}^{t^{*}}\frac{1}{2}\beta(s)ds and β(t)\beta(t) is an linear interpolation with β(0)=0.1\beta(0)=0.1, β(1)=20\beta(1)=20. Since {𝒙+t012β(t)𝒙tdt+t0β(t)𝑑𝒘¯}\{\bm{x}+\int_{t^{*}}^{0}-\frac{1}{2}\beta(t)\bm{x}_{t}dt+\int_{t^{*}}^{0}\sqrt{\beta(t)}d\overline{\bm{w}}\} is an integration of linear SDE, the solution 𝒙^(0)\hat{\bm{x}}(0) follows a Gaussian distribution with 𝝁\bm{\mu} and 𝚺\bm{\Sigma} satisfies

d𝝁dt\displaystyle\frac{\mathrm{d}\bm{\mu}}{\mathrm{d}t} =12β(t)𝝁\displaystyle=-\frac{1}{2}\beta(t)\bm{\mu} (13)
d𝚺dt\displaystyle\frac{\mathrm{d}\bm{\Sigma}}{\mathrm{d}t} =β(t)𝚺+β(t)𝑰d.\displaystyle=-\beta(t)\bm{\Sigma}+\beta(t)\bm{I}_{d}.

The initial value is 𝝁(t)=𝒙(t)\bm{\mu}(t^{*})=\bm{x}(t^{*}) and 𝚺(t)=𝟎\bm{\Sigma}(t^{*})=\bm{0}, thus, the result of solving Eq. 13 is 𝝁(0)=eγ(t)\bm{\mu}(0)=e^{\gamma\left(t^{*}\right)} and 𝚺(0)=e2γ(t)1\bm{\Sigma}(0)=e^{2\gamma\left(t^{*}\right)}-1. Therefore, the conditional probability P(𝒙^(0)|𝒙(t)=𝒙)𝒩(eγ(t)𝒙,(e2γ(t)1)𝑰d)P(\hat{\bm{x}}(0)|\bm{x}(t^{*})=\bm{x})\sim\mathcal{N}\left(e^{\gamma\left(t^{*}\right)}\bm{x},\left(e^{2\gamma\left(t^{*}\right)}-1\right)\bm{I}_{d}\right), the predicted 𝒙^(0)\hat{\bm{x}}(0) of the linear SDE from t=tt=t^{*} to t=0t=0 can be represented as

𝒙^(0)=eγ(t)𝒙+e2γ(t)1ϵ,\displaystyle\hat{\bm{x}}(0)=e^{\gamma\left(t^{*}\right)}\bm{x}+\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\bm{\epsilon}, (14)

where ϵ𝒩(𝟎,𝟏)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{1}).

Thus, we have

𝒙^(0)𝒙=(eγ(t)1)𝒙+e2γ(t)1ϵ.\displaystyle\|\hat{\bm{x}}(0)-\bm{x}\|=\left\|(e^{\gamma\left(t^{*}\right)}-1)\bm{x}+\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\bm{\epsilon}\right\|. (15)

Under the assumption that 𝒙Cx\|\bm{x}\|\leq C_{x}, we have

𝒙^(0)𝒙e2γ(t)1ϵ+(eγ(t)1)Cx.\begin{split}\|\hat{\bm{x}}(0)-\bm{x}\|&\leq\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\left\|\bm{\epsilon}\right\|+(e^{\gamma\left(t^{*}\right)}-1)C_{x}.\end{split} (16)

Therefore, we have

𝒓(𝒙)𝒙\displaystyle\|\bm{r}(\bm{x})-\bm{x}\| (17)
\displaystyle\leq e2γ(t)1ϵ+γ(t)(Cs+Cp)+(eγ(t)1)Cx.\displaystyle\sqrt{e^{2\gamma\left(t^{*}\right)}-1}\left\|\bm{\epsilon}\right\|+\gamma(t^{*})(C_{s}+C_{p})+(e^{\gamma\left(t^{*}\right)}-1)C_{x}.

So far, the result in proposition 2 has been proved. It supports that under our purification method, the l2l_{2} distance between the input and the purified sample can be upper bounded to avoid unexpected semantic changes.

Appendix B Additional Analysis on the Objective of Adversarial Purification

In this section, We provide supplementary explanations for the content in Section 3.

B.1 The Relation between Probability Density and Classifier Credibility

We give an analysis of a simple case to show that the classification model is more trustworthy in high-density areas. Consider a binary classification problem with label space y{0,1}y\in\{0,1\} the ground truth classifier is p(y|x)p(y|x) and the learned classification model is p^(y|x)\hat{p}(y|x). Assume the total numbers of training data is nn and all data are equally divided into each class. The value of p^(y=1|x)\hat{p}(y=1|x) is determined by the proportion of sampled data in class y=1y=1 over total 2n2n samples in the neighborhood of xx, i.e. (xΔx/2,x+Δx/2)(x-\Delta x/2,x+\Delta x/2). Consider the points on the ground truth class boundary, i.e. p(y|x)=12p(y|x)=\frac{1}{2}, there is

p(x|y)=p(x)p(y|x)p(y)=p(x)1212=p(x).p(x|y)=\frac{p(x)\cdot p(y|x)}{p(y)}=\frac{p(x)\cdot\frac{1}{2}}{\frac{1}{2}}=p(x). (18)

According to the central limit theorem, as long as nn is large enough, the number of samples in class y=1y=1 would be following a Gaussian distribution with mean and variance as

μ\displaystyle\mu =np(y=1|x)Δx,\displaystyle=n\cdot p(y=1|x)\Delta x, (19)
σ2\displaystyle\sigma^{2} =np(y=1|x)Δx(1p(y=1|x)Δx).\displaystyle=n\cdot p(y=1|x)\Delta x\cdot(1-p(y=1|x)\Delta x).

Applying Eq. 18, we get

μ\displaystyle\mu =np(x)Δx,\displaystyle=n\cdot p(x)\Delta x, (20)
σ2\displaystyle\sigma^{2} =np(x)Δx(1p(x)Δx).\displaystyle=n\cdot p(x)\Delta x\cdot(1-p(x)\Delta x).

Therefore the estimated model would be

p^(y|x)\displaystyle\hat{p}(y|x) np(x)Δx+np(x)Δx(1p(x)Δx)ϵ12np(x)Δx+np(x)Δx(1p(x)Δx)(ϵ1+ϵ2)\displaystyle\approx\frac{n\cdot p(x)\Delta x+\sqrt{n\cdot p(x)\Delta x\cdot(1-p(x)\Delta x)}\epsilon_{1}}{2n\cdot p(x)\Delta x+\sqrt{n\cdot p(x)\Delta x\cdot(1-p(x)\Delta x)}(\epsilon_{1}+\epsilon_{2})} (21)
=1+1n(1p(x)Δx1)ϵ12+1n(1p(x)Δx1)(ϵ1+ϵ2),\displaystyle=\frac{1+\sqrt{\frac{1}{n}(\frac{1}{p(x)\Delta x}-1)}\epsilon_{1}}{2+\sqrt{\frac{1}{n}(\frac{1}{p(x)\Delta x}-1)}(\epsilon_{1}+\epsilon_{2})},

where ϵ1\epsilon_{1} and ϵ2\epsilon_{2} are independent variables following a standard normal distribution. To estimate the approximate variance of p^(y|x)\hat{p}(y|x), we consider the value of ϵ1\epsilon_{1} and ϵ2\epsilon_{2} are limited within a range such that ϵ1,ϵ2(c,c)\epsilon_{1},\epsilon_{2}\in(-c,c). Under such assumption, there is

p^(y|x)12±c21n(1p(x)Δx1).\hat{p}(y|x)\in\frac{1}{2}\pm\frac{c}{2}\sqrt{\frac{1}{n}\left(\frac{1}{p(x)\Delta x}-1\right)}. (22)

So that the larger p(x)p(x) is, the smaller variance of p^(y|x)\hat{p}(y|x) would be. As a result, it is necessary to seek high density areas where the classification boundary would be more stable.

B.2 The Relation between Probability Density and Classifier Confidence

We claimed in section 3 that optimizing the likelihood is sometimes consistent with optimizing the classifier confidence. We will explain here when such relation exists and when it does not.

Still consider a binary classification task for 1-dimension data xx and uniformly distributed yy, there is

p(y=1|x)=p(x|y=1)p(x|y=1)+p(x|y=0).p(y=1|x)=\frac{p(x|y=1)}{p(x|y=1)+p(x|y=0)}. (23)

Consider how p(y=1|x)p(y=1|x) changes as p(x)p(x) changes:

p(y=1|x)x=p(x|y=1)xp(x|y=0)p(x|y=0)xp(x|y=1)[p(x|y=1)+p(x|y=0)]2.\frac{\partial p(y=1|x)}{\partial x}=\frac{\frac{\partial p(x|y=1)}{\partial x}p(x|y=0)-\frac{\partial p(x|y=0)}{\partial x}p(x|y=1)}{[p(x|y=1)+p(x|y=0)]^{2}}. (24)

Assuming

p(x)x=12(p(x|y=1)x+p(x|y=0)x)>0,\frac{\partial p(x)}{\partial x}=\frac{1}{2}\left(\frac{\partial p(x|y=1)}{\partial x}+\frac{\partial p(x|y=0)}{\partial x}\right)>0, (25)

then whether p(y=1|x)x\frac{\partial p(y=1|x)}{\partial x} is also positive may not be sure. If data from different classes are well-separated, such that p(x|y=1)x>0\frac{\partial p(x|y=1)}{\partial x}>0 and p(x|y=0)x<0\frac{\partial p(x|y=0)}{\partial x}<0, there would be p(y=1|x)x>0\frac{\partial p(y=1|x)}{\partial x}>0. Otherwise, if p(x|y=1)x<0\frac{\partial p(x|y=1)}{\partial x}<0 and p(x|y=0)x>0\frac{\partial p(x|y=0)}{\partial x}>0, there would be p(y=1|x)x<0\frac{\partial p(y=1|x)}{\partial x}<0.

In conclusion, for most cases in a dataset with well-separated classes, optimizing p(x)p(x) has similar effect as optimizing p(y|x)p(y|x). However, there are exceptions that such objectives are opposite.

Appendix C Additional Analysis of Our Methodology

C.1 Effect of Forward Process for Adversarial Purification

Motivation of removing forward process The objective of the forward process (injecting random noise) is to guide the training procedure for the reverse process. During adversarial purification, the model aims to recover the true label by maximizing likelihood, utilizing the score function of the backward process. Our method exclusively employs the reverse process, avoiding the forward process due to concerns over potential semantic changes induced by the random noise injection.

Existing adversarial purification methods [29, 46, 4, 45, 41, 48], follow the forward-reverse process of image generation tasks. However, they neither offer ablation studies nor provide theoretical evidence for the necessity of incorporating the forward process when the backward process is already in use. The habitual combination of forward and backward processes in applying diffusion models may not be a deliberate choice, warranting further consideration and evaluation.

Better performance (see Fig. 2b) after removing the forward process does not conflict with the existing works. To put it more rigorously, the conclusion drawn from existing works [29, 46, 4, 45, 41, 48] is that including random noise is necessary for successful purification. Note that both the forward and backward processes contain a noise-adding operation, retaining either process can introduce considerable random noise. Yoon et al. [48] is an example, which applied the forward process but removed the noise term in the backward process. Our work applies the backward process with noise term by the Brownian motion, which is empirically sufficient to submerge the adversarial perturbation.

Evidence Therefore, we choose to discard the forward process since it could potentially lead to unexpected category drift, increasing the risk of misclassification. The case study in Fig. 5 in Appendix E.3 illustrates such a difference, where DiffPure causes more of a semantic change than COUP w/o guidance. For numerical evidence, we empirically find that after removing the forward process, "COUP w/o guidance" (i.e., DiffPure w/o forward process) achieves better robustness, as shown in Fig. 2b. For theoretical support, we refer to Proposition 2, where introducing the forward noise causes the term ϵ\|\epsilon\| to double, leading to a larger upper bound of distance (between purified image and input image).

C.2 Score Function Plays a Role as Denoiser

In this section, we will introduce the reason why maximizing the likelihood p(r(𝒙adv))p(r(\bm{x}_{\mathrm{adv}})) can remove adversarial noise. Then we will discuss the effect of the score function.

Specifically, the training objective [39] of score function 𝒔θ(𝒙t,t)\bm{s}_{\theta}\left(\bm{x}_{t},t\right) is to mimic 𝒙tlogp0t(𝒙t𝒙0)\nabla_{\bm{x}_{t}}\log p_{0t}(\bm{x}_{t}\mid\bm{x}_{0}) by score matching [19, 38]

𝜽=\displaystyle\bm{\theta}^{*}= argmin𝜽𝔼t{λ(t)𝔼𝒙0𝔼𝒙t𝒙0[𝒔𝜽(𝒙t,t)\displaystyle\ \underset{\bm{\theta}}{\arg\min}\mathbb{E}_{t}\Bigl{\{}\lambda(t)\mathbb{E}_{\bm{x}_{0}}\mathbb{E}_{\bm{x}_{t}\mid\bm{x}_{0}}\Bigl{[}\Bigl{\|}\bm{s}_{\bm{\theta}}(\bm{x}_{t},t) (26)
𝒙tlogp0t(𝒙t𝒙0)22]}.\displaystyle-\nabla_{\bm{x}_{t}}\log p_{0t}(\bm{x}_{t}\mid\bm{x}_{0})\Bigr{\|}_{2}^{2}\Bigr{]}\Bigr{\}}.

As for VP-SDE there has p0t(𝒙t𝒙0)𝒩(αt𝒙0,(1αt)𝑰)p_{0t}(\bm{x}_{t}\mid\bm{x}_{0})\sim\mathcal{N}\left(\sqrt{\alpha_{t}}\bm{x}_{0},(1-\alpha_{t})\bm{I}\right), where αt=e0tβ(s)ds\alpha_{t}=e^{-\int_{0}^{t}\beta(s)\mathrm{d}s}. Since the gradient of the density given a Gaussian distribution is 𝐱logp(𝐱)=𝐱𝝁σ2 where ϵ𝒩(𝟎,𝐈)\nabla_{\mathbf{x}}\log p(\mathbf{x})=-\frac{\mathbf{x}-\bm{\mu}}{\sigma^{2}}\text{ where }\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), thus we have

𝒔θ(𝒙t,t)\displaystyle\bm{s}_{\theta}\left(\bm{x}_{t},t\right) 𝔼p0t(𝒙0)[𝒙tlogp0t(𝒙t𝒙0)]\displaystyle\approx\mathbb{E}_{p_{0t}\left(\bm{x}_{0}\right)}\left[\nabla_{\bm{x}_{t}}\mathop{\log}p_{0t}\left(\bm{x}_{t}\mid\bm{x}_{0}\right)\right] (27)
=𝔼p0t(𝒙0)[𝒙tαt𝒙01αt]\displaystyle=\mathbb{E}_{p_{0t}\left(\bm{x}_{0}\right)}\left[-\frac{\bm{x}_{t}-\sqrt{\alpha_{t}}\bm{x}_{0}}{1-\alpha_{t}}\right]
=𝔼p0t(𝒙0)[(1αt)ϵθ(𝒙t,t)1αt]\displaystyle=\mathbb{E}_{p_{0t}\left(\bm{x}_{0}\right)}\left[-\frac{\sqrt{(1-\alpha_{t})}\bm{\epsilon}_{\theta}\left(\bm{x}_{t},t\right)}{1-\alpha_{t}}\right]
=𝔼p0t(𝒙0)[ϵθ(𝒙t,t)1αt]\displaystyle=\mathbb{E}_{p_{0t}\left(\bm{x}_{0}\right)}\left[-\frac{\bm{\epsilon}_{\theta}\left(\bm{x}_{t},t\right)}{\sqrt{1-\alpha_{t}}}\right]
=ϵθ(𝒙t,t)1αt,\displaystyle=-\frac{\bm{\epsilon}_{\theta}\left(\bm{x}_{t},t\right)}{\sqrt{1-\alpha_{t}}},

where ϵθ(𝒙t,t)\bm{\epsilon}_{\theta}\left(\bm{x}_{t},t\right) corresponds to the predicted noise contained in 𝒙t\bm{x}_{t}, and αt\alpha_{t} is the scaling factor. Therefore, the effect of score function 𝒔𝜽(𝒙t,t)\bm{s}_{\bm{\theta}}(\bm{x}_{t},t) is to denoise the image 𝒙t\bm{x}_{t} meanwhile increase the data likelihood.

Appendix D More Implementation Details

Our code is available in an anonymous Github repository: https://anonymous.4open.science/r/COUP-3D8C/README.md.

D.1 Off-the-shelf Models

We develop our method on the basis of DiffPure: https://github.com/NVlabs/DiffPure/tree/master. As for the checkpoint of the off-the-shelf generative model, we use vp/cifar10_ddpmpp_deep_continuous pre-trained from Score SDE: https://github.com/yang-song/score_sde. As for the robustness of baselines and the checkpoint of the classifier, it is provided in RobustBench: https://github.com/RobustBench/robustbench.

D.2 Key Consideration for Reproduction

In Table 1 and Table 2, we experiment with our method on an NVIDIA A100 GPU, while other experiments are tested on an NVIDIA V100 GPU.

Random Seed Consider the setting of DiffPure [29], we use seed = 121 and data seed = 0.

Automatic Mixed Precision (AMP) In our experiments, we use torch.cuda.amp.autocast (enabled=False) to alleviate the loss of precision caused by A100 GPU’s automatic use of fp16 for acceleration. The reason is our guidance is small in the integral drift function of each step of SDE. Empirically, we found our method performs even well on V100 GPU, however, it is too computationally expensive. In other words, we may gain better performance than proposed in Table 1 and Table 2 if avoiding accelerating strategies e.g. tested on V100 GPU.

D.3 Resources and Running Time

To evaluate the robustness of COUP against AutoAttack, we use 2 NVIDIA A100 GPUs and run analysis experiments against APGD-ce attack on 2 NVIDIA V100 GPUs. It takes about 18 days to test our COUP for AutoAttack once (8 days for "rand" mode + 10 days for "standard" mode). For this reason, we do not provide error bars.

Appendix E More Experimental Results

E.1 Robustness against Salt-and-Pepper Noise Attack

Thanks to the powerful capability of the diffusion model, modeling the data distribution serves as a defensive measure against not only adversarial attacks but also corruptions. Though this is not the main proposal of our main paper, we conduct a preliminary assessment experiment focused on mitigating salt-and-pepper noise.

In detail, we randomly select 500 examples from the CIFAR-10 dataset for evaluation and adopt t=0.06t^{*}=0.06 against salt-and-pepper noise attack with ϵ=0.4\epsilon=0.4. In this context, the robustness of WideResNet-28-10, without purification, stands at 87.80%. Conversely, COUP, without guidance, exhibits a robustness of 92.60%. Remarkably, our implementation of COUP attains a robustness of 93.00%. The outcomes indicate that our COUP is capable of defending against salt-and-pepper noise attacks, demonstrating enhanced performance regarding classifier-confidence guidance.

E.2 Speed Test of Inference Time

We do the speed test of the inference time of DiffPure, reverse-time SDE, and COUP on variant tt^{*}. The results in Table 4 are tested on an NVIDIA V100 GPU. They show that with larger timestep tt^{*}, the inference time will obviously increase. The larger inference time of COUP is caused by computing the gradient of classifier confidence at each diffusion timestep.

Table 4: Inference time (in seconds) different diffusion timestep tt^{*} on one example of CIFAR-10, and the classifier is WideResNet-28-10.
Method t=0t^{*}=0 t=0.001t^{*}=0.001 t=0.050t^{*}=0.050 t=0.10t^{*}=0.10 t=0.150t^{*}=0.150
DiffPure 0.006 0.140 3.056 5.770 8.648
COUP w/o Guidance 0.006 0.148 3.032 6.930 8.732
our COUP 0.006 0.170 3.894 7.662 11.134

E.3 Purified Images

We plot the purified images by different purification methods including DiffPure, COUP without Guidance, and our COUP. We follow the setting of 2b and use WRN-28-10 as the classifier.

As shown in Fig. 5, the purified images of DiffPure suffer from unexpected distortion caused by the forward diffusion process and it of COUP w/o Guidance has lost some details e.g. texture due to its denoising process. Specifically, the 2nd example purified by DiffPure suffers a significant semantic drift from a "frog" to a "bird", and the 3rd example occurs some unexpected feature, which is not beneficial for classification. Moreover, the 3rd image purified by COUP w/o Guidance lost a wing of the airplane and buffed the ears of the cat in the 4th image. However, our COUP has augmented some classification-related features while removing the malicious perturbation.

We further focus on the purification effectiveness of COUP compared with the method without guidance.

In Fig. 6, we plot the bad cases of COUP w/o Guidance which are correctly classified after adding classifier guidance. It shows that reverse-time SDE does destroy the image details (e.g. the body of the ship in 1st example and the texture of the dog and the details of the track have been destroyed) which is viewed as "over-purification". However, our COUP alleviates this phenomenon (maintaining the predictive information which leads to correct classification).

In conclusion, the results of the example show that

  • DiffPure sometimes tends to cause semantic changes due to the diffusion process.

  • Using only reverse-time SDE (i.e. COUP w/o Guidance) can avoid semantic drift, however, it faces the problem of information loss in the process of denoising.

  • Our method tackles the above problems by removing the forward diffusion process and enhancing the classification-related features.

Appendix F Limitations

Despite the effectiveness, our method suffers from high computational costs at inference time. Besides, the hyperparameter timestep tt^{*} and guidance weight λ\lambda need to be searched empirically, which further increases the calculation overhead.

Appendix G Broader Impacts

Our method does not create new models, instead, we take advantage of the off-the-shelf generative models and discriminative models. Besides, we design a method for purifying adversarial perturbations, one of the adversarial defenses to improve model robustness. It would not bring risks for misuse. However, our purification method helps eliminate the impact of malicious attacks by removing adversarial perturbation.

Refer to caption
Figure 5: Purified images by DiffPure, SDE, COUP.
Refer to caption
Figure 6: Purified images to show the impact of classifier guidance.