This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Certified 2\ell_{2} Attribution Robustness via Uniformly Smoothed Attributions

Fan Wang   Adams Wai-Kin Kong
School of Computer Science and Engineering
Nanyang Technological University
fan005@e.ntu.edu.sgadamskong@ntu.edu.sg
Abstract

Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.

1 Introduction

The developments and wider uses of deep learning models in various security-sensitive applications, such as autonomous driving, medical diagnosis, and legal judgments, have raised discussions of the trustworthiness of these models. The lack of explainability of deep learning models, especially the recent popular large language models (Brown et al., 2020), has been one of the main concerns. Regulators have started to require the explainability of the AI models in some applications (Goodman & Flaxman, 2017), and the explainability of AI models has been one of the main focuses of the research community (Doshi-Velez & Kim, 2017). Model attributions, as one of the important tools to explain the rationales behind the model predictions, have been used to understand the decision-making process of the models. For example, medical practitioners use the explanations generated by attribution methods to assist them in making important medical decisions (Antoniadi et al., 2021; Hrinivich et al., 2023; Du et al., 2022). Similarly, in autonomous vehicles, the explainability helps to deal with potential liability and responsibility gaps (Atakishiyev et al., 2021; Burton et al., 2020) and people naturally requires the confirmation of the safety critical decisions. However, since these users often lack expertise in machine learning and technical details of attributions, there is a risk that the attributions have been manipulated without noticing. Attackers may specifically target attributions to mislead investigators, propagate false narratives, or evade detection, potentially leading to serious consequences. Consequently, practitioners could lose trust in these methods, resulting in their refusal to use these methods. Thus, a trustworthy application requires not only the predictions made by the AI models, but also its explainability produced from attributions. However, the attributions have also been shown recently to be vulnerable to small perturbations (Ghorbani et al., 2019). Similar to adversarial attacks, attribution attacks generate perturbations that can be added to input samples. These perturbations distort the attributions while maintaining unchanged prediction outputs. This misleadingly gives a false sense of security to practitioners who blindly trust the attributions. An effective defense method is emergently needed to protect the attributions from the attacks.

Unlike adversarial defense, which has been extensively investigated to mitigate the harm of adversarial attacks that using both empirical (Madry et al., 2018; Athalye et al., 2018; Carlini & Wagner, 2017) and certified (Cohen et al., 2019; Wong & Kolter, 2018; Yang et al., 2020; Lecuyer et al., 2019) defense methods, attribution defense is neglected. Almost all attribution protection works focus on adversarially training the model by augmenting the training data with manipulated samples to improve the robustness of attributions (Boopathy et al., 2020; Chen et al., 2019; Ivankay et al., 2020; Sarkar et al., 2021; Wang & Kong, 2022). Only the recent work by Wang & Kong (2023) attempted to derive a practical upper bound for the worst-case attribution deviation after the samples are attacked, while suffering from strict assumptions and heavy computations. In overall, none of the previous work is capable of providing a certification that ensures the attribution attack will not change the attribution more than a given threshold. In this work, we put our focus on the smoothed version of attributions and seek to provide a theoretical guarantee that the attributions are robust to any type of perturbations within 2\ell_{2} attack budget.

Based on the previously defined formulation of attribution robustness (Wang & Kong, 2023), given a network ff, its attribution function gg and perturbation 𝜹d\bm{\delta}\in\mathbb{R}^{d}, the attribution robustness is defined as the optimal value that maximizes the worst-case attribution difference D(,)D(\cdot,\cdot) under provided attack budget,

max𝜹D(g(𝒙),g(𝒙+𝜹))s.t.𝜹ϵ.\max_{\bm{\delta}}\quad D(g(\bm{x}),g(\bm{x}+\bm{\delta}))\quad\text{s.t.}\quad\|\bm{\delta}\|\leq\epsilon. (1)

However, the aforementioned study only provides an approximate method to solve the optimization problem under the strict assumptions that the networks is locally linear, and, as a result, is unable to generalize to modern neural networks. To give a complete certification on the problem, we follow the formulation and attempt to find the effective upper bound of the attribution difference, equivalently, the lower bound of attribution similarity, which can be applied to any network. We propose to use uniformly smoothed attribution, which is a smoothed version of the original attribution, and show that, for all perturbations within the allowable attack budget, the cosine similarity that measures the difference between perturbed and unperturbed uniformly smoothed attribution can be certified to be lower bounded. The contribution of this paper can be summarized as follows:

  • We provide a theoretical guarantee that demonstrates the robustness of the uniformly smoothed attribution to any perturbations within allowable region. The robustness is measured by the similarity between the perturbed and unperturbed smoothed attribution. The method can be generally applied to any neural networks, and can be efficiently scaled to larger size images. To the best knowledge of the authors, this is the first work that provides a theoretical guarantee for attribution robustness.

  • We present alternative formulations of the certification that are equivalent to the original one and also practical to be implemented. The alternative formulations determine the maximum size of perturbation, or the minimum radius of smoothing, ensuring that the attribution remains within a given tolerance.

  • We demonstrate that the uniform smoothing can protect the attribution against 2\ell_{2} attacks. More importantly, we evaluated the proposed method on the well-bounded integrated gradients and show that it can be effectively implemented and can successfully protect and certify the attributions from 2\ell_{2} attacks.

The rest of this paper is organized as follows. In Section 2, we review the related works. In Section 3 and Section 4, we introduce the uniformly smoothed attribution and show that it can be certified against attribution attacks. In Section 5, we present experimental results and evaluate the proposed method. Finally, we conclude this paper and in Section 6.

2 Related works

2.1 Attribution methods

Attribution methods study the importance of each input feature, and measure that how much every feature contributes to the model prediction. One of popular attribution approach is the gradient-based method. Based on the property that gradient is the measurement of the rate of change, the gradient-based methods measure the feature importance by weighting the gradients in different ways. Examples of gradient-based methods include saliency (Simonyan et al., 2014), integrated gradients (IG) (Sundararajan et al., 2017), full-gradient (Srinivas & Fleuret, 2019), and etc. Other attribution methods include occlusion (Simonyan et al., 2014), which measures the importance of each input feature by occluding the feature and measuring the change of the model prediction, layer-wise relevance propagation (LRP) (Bach et al., 2015), which propagates the output relevance to the input layer, and SHAP related methods (Lundberg & Lee, 2017; Sundararajan & Najmi, 2020; Kwon & Zou, 2022). An important property of the attribution methods is the axiom of completeness that igi(𝒙)=fj(𝒙)\sum_{i}g_{i}(\bm{x})=f_{j}(\bm{x}), which indicates the relationship between attribution and the prediction score. Note that the gradient-based attribution methods are upper-bounded since the gradients of the model output with respect to the input are upper-bounded.

2.2 Attribution attacks and defenses

Ghorbani et al. (2019) first pointed out that attributions can be fragile to iterative attribution attack, and Dombrowski et al. (2019) extended the attack to be targeted that attributions can be changed purposely into any preset patterns. Similar to adversarial attacks, attribution attacks maximize the loss function that measures the difference between the original attributions and the target attributions. In addition, the attribution attacks are controlled not to alter the classification results. To defend against attribution attacks, adversarial training (Madry et al., 2018) approaches have been adopted. Chen et al. (2019) and Boopathy et al. (2020) minimizes the p\ell_{p}-norm differences between perturbed and original attributions, and Ivankay et al. (2020) considers Pearson’s correlation coefficient. Wang & Kong (2022) suggests to use cosine similarity, emphasizing the attributions should be measured in directions rather than magnitudes. All of the above defenses improve the attribution robustness empirically, but none of them is able to provide certifications to ensure that the attributions will not change under any perturbations within the allowable attack region.

Refer to caption
(a) Original image
Refer to caption
(b) IG
Refer to caption
(c) Gaussian smoothed IG
Refer to caption
(d) Uniformly smoothed IG
Refer to caption
Figure 1: (left) Examples of attributions (zoom in for better visibility). We choose to show the integrated gradients (IG) and its corresponding smoothing results. For Gaussian smoothing, the noise level is set to σ=0.2\sigma=0.2 and for uniformly smoothed IG, 2\ell_{2} ball with radius 3σ\sqrt{3}\sigma is used. (right) A 2D illustration of the volumes of (𝒙;r)\mathcal{B}(\bm{x};r) and (𝒙~;r)\mathcal{B}(\tilde{\bm{x}};r), as well as the relationship between h(𝒙)h(\bm{x}) and h(𝒙~)h(\tilde{\bm{x}}). Here h(𝒙)h(\bm{x}) is the original attribution, and a𝒗a\bm{v} represents the magnitude and direction of the translation of h(𝒙)h(\bm{x}) after the sample is perturbed. VUV_{U} in Theorem 1 is the volume of shaded region in the figure, and V𝒮V_{\mathcal{S}} is the volume of each individual ball. When h(𝒙)h(\bm{x}) is fixed, the lower bound of the cosine similarity between h(𝒙)h(\bm{x}) and h(𝒙)+a𝒗h(\bm{x})+a\bm{v} can be derived as a function of volumes

2.3 Randomized smoothing

The smoothing technique has been popular in improving certified adversarial robustness (Liu et al., 2018; Lecuyer et al., 2019; Cohen et al., 2019; Yang et al., 2020). The smoothed classifiers take a batch of inputs that are randomly sampled from the neighbourhood of original inputs under certain distributions qq and make the decisions based on the most likely outputs, i.e., argmaxy𝜼μ[F(𝒙+𝜼)=y]\operatorname*{arg\,max}_{y}\mathbb{P}_{\bm{\eta}\sim\mu}[F(\bm{x}+\bm{\eta})=y]. They provide the certification of the maximum radius so that no perturbations within the radius can alter the classification label. Cohen et al. (2019) certifies the 2\ell_{2} attack based on the Neyman-Pearson lemma and the result is alternatively proved by Salman et al. (2019) using explicit Lipschitz constants. Lecuyer et al. (2019) and Teng et al. (2020) consider Laplacian smoothing for the 1\ell_{1} attack. Yang et al. (2020) derived a similar result in \ell_{\infty} though the radius becomes small when the dimension of data gets large. Kumar & Goldstein (2021) applied randomized smoothing to structured output, which the attributions belong to, but the specific bound for attribution is to loose to be meaningful. Thus, there are no existing works that provide defense and valid certifications for attribution using randomized smoothing due to the difficulty of defining the attribution robustness and the computation of the attribution gradient. In this work, an effective method to formulate the smoothed attribution robustness as a simple optimization problem is proposed to provide certifications against attribution attacks.

3 Uniformly smoothed attribution

Consider a classifier f:d[0,1]cf:\mathbb{R}^{d}\rightarrow\left[0,1\right]^{c} that maps the input 𝒙d\bm{x}\in\mathbb{R}^{d} to the softmax output y[0,1]y\in\left[0,1\right], and its attribution function g(𝒙):ddg(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. The smoothed attribution of ff is to construct a new attribution hh by taking the mean of attributions on 𝒙+𝜼\bm{x}+\bm{\eta}, where 𝜼\bm{\eta} is randomly drawn from a density μ\mu, i.e., the smoothed attribution hh can be defined as follows:

h(𝒙)=𝔼𝜼μ[g(𝒙+𝜼)].h(\bm{x})=\mathbb{E}_{\bm{\eta}\sim\mu}[g(\bm{x}+\bm{\eta})]. (2)

To construct smoothed attribution, the density qq can be chosen arbitrarily, and the smoothed attributions provide visually sharpened gradient-based attributions (see Figure 1 (left)). Smilkov et al. (2017) choose qq to be multivariate Gaussian distribution on input gradients to weaken the visually noisy attribution. In this work, we aim at certifying the attribution robustness under 2\ell_{2} attack; thus we choose μ\mu to be a uniform distribution on a dd-dimensional closed space 𝒮\mathcal{S} centered at 𝟎\bm{0}, especially the 2\ell_{2}-norm ball of radius rr, (𝟎;r)={𝒚:𝒚2r,𝒚d}\mathcal{B}(\bm{0};r)=\left\{\bm{y}:\|\bm{y}\|_{2}\leq r,\bm{y}\in\mathbb{R}^{d}\right\}. It can be seen that smoothing under this setting also provides high attribution quality (Figure 1(d)). To quantitatively evaluate the effectiveness of uniformly smoothed attribution, we can further evaluate its performance using GridPG introduced by Rao et al. (2022), which quantifies the significance of individual features in terms of positive contributions or influences. The GridPG values of IG, uniformly smoothed IG and Gaussian smoothed IG for 5,0005,000 randomly selected ImageNet examples are 0.40210.4021, 0.40930.4093 and 0.41100.4110, respectively, which suggest that the uniformly smoothed attributions can achieve comparable performance of GridPG with the Gaussian smoothed attributions, as well as the original non-smoothed attributions.

4 Certifying the cosine similarity of smoothed attributions

We now consider 𝒮\mathcal{S} as an 2\ell_{2}-norm ball for the ease of analysing the certification. Adapted from the formulation in Eq. (1), the robustness of attribution is defined as the minimum possible attribution similarity when a natural image is perturbed by attribution attacks. Cosine similarity is chosen to measure the similarity as it has been shown to be the most suitable alternative to the non-differentiable Kendall’s rank correlation, which is the most common evaluation index (Wang & Kong, 2022). Suppose that the 2\ell_{2} attribution attack is performed upon input sample 𝒙\bm{x}, the maximum allowable perturbation is ϵ\epsilon, i.e., 𝒙~=𝒙+𝜹\tilde{\bm{x}}=\bm{x}+\bm{\delta}, where 𝜹2ϵ\|\bm{\delta}\|_{2}\leq\epsilon. For all 𝜹2ϵ\|\bm{\delta}\|_{2}\leq\epsilon, we want to find out the minimum value of cos(h(𝒙),h(𝒙~))\cos(h(\bm{x}),h(\tilde{\bm{x}})), i.e., the lower bound of cosine similarity for attributions. Thus, the optimization problem we are interested in is formulated as follows:

min𝜹cos(h(𝒙),h(𝒙+𝜹))s.t.𝜹2ϵ.\min_{\bm{\delta}}\quad\cos(h(\bm{x}),h(\bm{x}+\bm{\delta}))\quad\text{s.t.}\quad\|\bm{\delta}\|_{2}\leq\epsilon. (3)

That is, we will show that, given an arbitrary sample point 𝒙\bm{x}, for all perturbed samples 𝒙~\tilde{\bm{x}}, the cosine similarity between the original smoothed attribution and the corresponding perturbed smoothed attributions is guaranteed to be lower bounded by the optimum of (3). Alternatively, in a more practical perspective, given a threshold TT for the cosine similarity, we want to know the maximum size of perturbation, or the minimum smoothing radius, such that no perturbations inside would cause the attribution difference to exceed the threshold.

However, although optimizing the cosine similarity for attribution can be an intuitive way to find the lower bound, it is difficult in this problem to directly study the cosine function. Moreover, it is also intractable to optimize the cosine similarity with respect to a vector 𝜹\bm{\delta}. To address this issue, we first reformulate the problem into an optimization over two scalars and then solve the alternative problem to obtain the lower bound of cosine similarity. All the proofs and derivations of theorems, lemmas and corollaries are provided in the Appendix A.

4.1 One-dimensional reformulation

We note that cosine similarity of attributions is in fact an inner product of their normalized vectors. Besides, we also observe that h(𝒙)h(\bm{x}) is the mean of g(𝒙+𝜼)g(\bm{x}+\bm{\eta}) with respect to 𝜼\bm{\eta}, which is, by definition, equivalent to the integral of gg weighted by the density of uniform distribution over (𝒙;r)\mathcal{B}(\bm{x};r). More importantly, for perturbed example 𝒙~\tilde{\bm{x}}, the weighting region is (𝒙~;r)\mathcal{B}(\tilde{\bm{x}};r), which is a translated version of (𝒙;r)\mathcal{B}(\bm{x};r) and they are expected to intersect with each other when the distance of their centers is smaller than twice of the radius rr. Therefore, given the input sample 𝒙\bm{x} and its attribution h(𝒙)h(\bm{x}), we can rewrite the minimization of cosine similarity as follows:

mina,𝒗\displaystyle\min_{a,\bm{v}} h(𝒙)Th(𝒙)(h(𝒙)+a𝒗h(𝒙)+a𝒗)\displaystyle\quad\frac{h(\bm{x})^{T}}{\|h(\bm{x})\|}\left(\frac{h(\bm{x})+a\bm{v}}{\|h(\bm{x})+a\bm{v}\|}\right) (4)

where aa represents the magnitude of the translation and the unit vector 𝒗\bm{v} is the direction of translation as shown in Figure 1 (right). It can be shown that the magnitude aa is constrained by a constant related to the intersecting volume and the property of the attribution function itself.

In a high-dimensional case, for a fixed cosine similarity value, a given h(𝒙)h(\bm{x}) and h(𝒙~)h(\tilde{\bm{x}}), in fact, form a spherical cone. Thus, we can decompose the directional unit vector 𝒗\bm{v} into 𝒗=cosθ𝒗+sinθ𝒗\bm{v}=\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp}, where 𝒗\bm{v}_{\perp} is perpendicular to h(𝒙)h(\bm{x}) and 𝒗\bm{v}_{\parallel} is parallel to h(𝒙)h(\bm{x}). Then, we have

mina,θh(𝒙)+acosθ(h(𝒙)+acosθ)2+(asinθ)2,\min_{a,\theta}\quad\frac{\|h(\bm{x})\|+a\cos\theta}{\sqrt{(\|h(\bm{x})\|+a\cos\theta)^{2}+(a\sin\theta)^{2}}}, (5)

where 0θ2π0\leq\theta\leq 2\pi.

Since aa is a scalar representing the magnitude of the translation from h(𝒙)h(\bm{x}) to h(𝒙~)h(\tilde{\bm{x}}), it can be shown that the magnitude of aa is upper bounded by the magnitude of the gradient weighted by the ratio of volume change during the translation process. Specifically, aMVU/V𝒮a\leq MV_{U}/V_{\mathcal{S}}, where V𝒮V_{\mathcal{S}} is the volume of the 2\ell_{2}-ball (𝟎;r)\mathcal{B}(\bm{0};r), VUV_{U} is the volume of the union of the two sampling space centered at 𝒙\bm{x} and 𝒙~\tilde{\bm{x}} minus their intersection, and MM is a constant that depends on the upper bound of gg, We notice that the gradient-based attribution is a function of input gradient, f(𝒙)\nabla f(\bm{x}), which is bounded for Lipschitz continuous networks (See Lemma 1 in the Appendix A); thus, the upper bound can be derived separately for different attribution functions.

4.2 Lower bound of cosine similarity for bounded attributions

Now that we have reformulated the optimization with respect to vector 𝒗\bm{v} into an alternative simpler one with respect to scalar values aa and θ\theta. By solving the alternative problem, our result shows that the smoothed attribution is robust within the following half-angle of a spherical cone.

Theorem 1.

Let g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be a upper bounded attribution function, and 𝛈U(𝟎;r)\bm{\eta}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{0};r). Let hh be the smoothed version of gg as defined in (2). Then, for all 𝐱~{𝐱+𝛅|𝛅2ϵ}\tilde{\bm{x}}\in\left\{\bm{x}+\bm{\delta}|\|\bm{\delta}\|_{2}\leq\epsilon\right\}, we have cos(h(𝐱),h(𝐱~))T\cos\left(h(\bm{x}),h(\tilde{\bm{x}})\right)\geq T, where

T=h(𝒙)2h(𝒙)22+M2VU2/V𝒮2T=\frac{\|h(\bm{x})\|_{2}}{\sqrt{\|h(\bm{x})\|_{2}^{2}+M^{2}V_{U}^{2}/V^{2}_{\mathcal{S}}}} (6)

Here, MM is the upper bound of gg. V𝒮V_{\mathcal{S}} is the volume of the 2\ell_{2}-ball (𝟎;r)\mathcal{B}(\bm{0};r), and VUV_{U} is the volume of the union of the two sampling space centered at 𝐱\bm{x} and 𝐱~\tilde{\bm{x}} minus their intersection.

The entire proof can be found in Appendix A. The theorem points out that the lower bound of cosine similarity is related to the smoothing space around the input samples and the maximum allowable perturbation size. Moreover, when the smoothing space is a 2\ell_{2}-norm ball, the above result can be derived by directly computing two volumes VUV_{U} and V𝒮V_{\mathcal{S}}, which can be explicitly calculated by

VU=2V𝒮×(1I(2rhh2)/r2(d+12,12))V_{U}=2V_{\mathcal{S}}\times\left(1-I_{(2rh-h^{2})/r^{2}}\left(\frac{d+1}{2},\frac{1}{2}\right)\right) (7)

where h=rϵ/20h=r-\epsilon/2\geq 0 and Ix(a,b)I_{x}(a,b) is the regularized incomplete beta function, cumulative density function of beta distribution (Li, 2010).

We observe the following properties of the above theorem.

  1. 1.

    Unlike previous attribution robustness works, such as Dombrowski et al. (2019), Singh et al. (2020), Boopathy et al. (2020) and Wang & Kong (2022), which require the networks to be twice-differentiable and need to change the ReLU activation into Softplus, the proposed result does not assume anything on the classifiers. Thus, it can be safely applied to any neural network and any architectures.

  2. 2.

    The lower bound depends on the radius of smoothing, rr. The bound becomes larger as the radius of smoothing grows. In extreme cases when rr tends to infinity, the smoothing spaces of two samples completely overlap, which corresponds to the same smoothed attribution and their cosine similarity becomes 11.

  3. 3.

    At the same time, the lower bound also decreases when the attack budget ϵ\epsilon increases. Besides, when ϵ\epsilon increases beyond the constraint that h=rϵ/20h=r-\epsilon/2\geq 0 and tends to infinity, the distance between two attributions will become further and their cosine similarity will tend to 0 in high dimensional space, which makes the lower bound becomes trivial.

  4. 4.

    The proposed method can be scaled to datasets with large images. The lower bound is efficient to compute since only the smoothed attribution of given sample needed. On the contrary, the previous works that approximately estimate the attribution robustness (Wang & Kong, 2023) require the computation of input Hessian and the corresponding eigenvalues and eigenvectors, which becomes intractable for larger size images on modern neural networks.

Table 1: Comparison of top-k and Kendall’s rank correlation between 2\ell_{2} perturbed and non-perturbed attributions on standard and robust models using smoothed and non-smoothed attributions.
Standard IG-NORM (Chen et al., 2019) TRADES (Zhang et al., 2019) IGR (Wang & Kong, 2022)
SM top-k 0.3449 0.6575 0.6450 0.8354
Kendall 0.1496 0.4709 0.4642 0.7553
SmoothSM top-k 0.3853 0.6261 0.6082 0.8363
Kendall 0.1670 0.4238 0.4119 0.7568
IG top-k 0.4742 0.7075 0.6821 0.8402
Kendall 0.1744 0.5098 0.5030 0.7839
SmoothIG top-k 0.5302 0.6730 0.6528 0.8460
Kendall 0.3819 0.4494 0.4533 0.7612

4.3 Alternative formulations of the attribution robustness

The previous section formulates the robustness of smoothed attribution in terms of the smallest cosine similarity between the original and perturbed smoothed attribution, when the attack budget and the smoothing radius are fixed. In some scenarios, practitioners want to formulate the robustness in different ways. For example, we may want to find the maximum allowable perturbation ϵ\epsilon such that the cosine similarity between the original and perturbed smoothed attribution is guaranteed to be greater than a predefined threshold TT. On the other hand, one can also obtain the minimum smoothing radius needed such that the desired attribution robustness is achieved within allowable attack region. The following corollary provides the alternative formulations of the attribution robustness.

Corollary 1.

Let g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be a bounded attribution function, and 𝛈U(𝐱;r)\bm{\eta}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{x};r). Let hh be the smoothed version of gg as defined in (2).

  1. (i)

    Given a predefined threshold T[0,1]T\in[0,1], then for all 𝜹2ϵ\|\bm{\delta}\|_{2}\leq\epsilon, we have cos(h(𝒙),h(𝒙+𝜹))T\cos(h(\bm{x}),h(\bm{x}+\bm{\delta}))\geq T, where

    ϵ=2r1IZ1(d+12,12).\epsilon=2r\sqrt{1-I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)}. (8)
  2. (ii)

    Given a predefined threshold T[0,1]T\in[0,1] and the maximum perturbation size ϵ0\epsilon\geq 0, the smoothed attribution satisfies cos(h(𝒙),h(𝒙~))T\cos(h(\bm{x}),h(\tilde{\bm{x}}))\geq T for all 𝒙~{𝒙+𝜹|𝜹2ϵ}\tilde{\bm{x}}\in\left\{\bm{x}+\bm{\delta}|\|\bm{\delta}\|_{2}\leq\epsilon\right\} when rRr\geq R, where

    R=ϵ2(1IZ1(d+12,12))12.R=\frac{\epsilon}{2}\left(1-I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)\right)^{-\frac{1}{2}}. (9)

Iz1(a,b)I^{-1}_{z}(a,b) is the inverse of the regularized incomplete beta function, and ZZ is defined as

Z=1h(𝒙)22M(1T21)Z=1-\frac{\|h(\bm{x})\|_{2}}{2M}\left(\frac{1}{T^{2}}-1\right) (10)

The derivation of the Corollary can be found in Appendix A. We notice that, in these formulations, when the smoothing radius is larger, the maximum allowable perturbation is also larger, which allows stronger attacks while keeping the attribution similarity within a controllable range. Similarly, when the maximum allowable perturbation is larger, the minimum smoothing radius is also larger, which means that the attribution similarity can be maintained with a larger smoothing radius.

5 Experiments and results

Table 2: The theoretical lower bound (TT in Eqn. (6)) for cosine similarity evaluated on baseline models using MNIST. Note that the bound is not achievable for r=0.5r=0.5 when ϵ=1.0\epsilon=1.0, since the radius must be greater than ϵ/2\epsilon/2.
ϵ=0.5\epsilon=0.5 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.3002 0.3141 0.3385 0.3732 0.4144 0.4600 0.5057
IG-NORM 0.4038 0.4189 0.4432 0.4729 0.5055 0.5466 0.5909
IGR 0.4145 0.4269 0.4482 0.4792 0.5208 0.5748 0.6392
ϵ=1.0\epsilon=1.0 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard / 0.3034 0.3092 0.3264 0.3716 0.3990 0.4178
IG-NORM / 0.3650 0.3822 0.3892 0.4220 0.4974 0.5365
IGR / 0.3834 0.4025 0.4558 0.4914 0.5237 0.5358

In this section, we evaluate the effectiveness of uniformly smoothed attribution. Following previous work on attribution robustness, we use the 2\ell_{2} attribution attack adapted from the Iterative Feature Importance Attacks (IFIA) by Ghorbani et al. (2019). It is first shown that the uniformly smoothed attributions are less likely to be perturbed comparing with non-smoothed attributions when being attacked by IFIA. After which, we present the results of certification using the proposed lower bound of cosine similarity. The experiments are conducted on baseline models including adversarial robust models(Madry et al., 2018; Zhang et al., 2019), attributional robust models(Chen et al., 2019; Singh et al., 2020; Ivankay et al., 2020; Wang & Kong, 2022), as well as non-robust models trained for standard classification tasks. Following those baseline models, the method is tested on the validation sets of MNIST (LeCun et al., 2010) using a small-size convolutional network, on CIFAR-10 (Krizhevsky, 2009) using a ResNet-18 (He et al., 2016), and on ImageNet (Russakovsky et al., 2015) using a ResNet-50 (He et al., 2016). More details of the experiments are described in the Appendix. All experiments are run on NVIDIA GeForce RTX 3090.111Source code will be released later.

To empirically compute the uniformly smoothed attribution for every sample, NN points are randomly sampled from the dd-dimensional sphere uniformly, and augmented to the input sample. To do so, the sampling technique introduced by Box & Muller (1958) is applied. The integration in Eqn. 2 is then empirically estimated using Monte Carlo integration, i.e., h^(𝒙)=1Ng(𝒙+𝜼i)\hat{h}(\bm{x})=\frac{1}{N}\sum g(\bm{x}+\bm{\eta}_{i}), where 𝜼iU(𝟎;r)\bm{\eta}_{i}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{0};r). For large NN, the estimator h^(𝒙)\hat{h}(\bm{x}) almost surely converges to h(𝒙)h(\bm{x}) (Feller, 1991); hence the convergence of T^\hat{T} to TT can be obtained (see Appendix C.1 for details). Unless specifically stated, we choose the number of samples NN to be 100,000100,000 to compute the proposed lower bound, and N=300N^{\ast}=300 for the uniformly smoothed attribution being attacked in all experiments.

5.1 Evaluation of the robustness of uniformly smoothed attribution

Table 3: Theoretical lower bounds evaluated on non-robust model and ImageNet using different radii rr and attack sizes ϵ\epsilon. Note that the method is only applicable when radius rr must be greater than ϵ/2\epsilon/2.
rr 0.5 1.0 1.5 2.0 2.5 3.0 3.5
ϵ=0.5\epsilon=0.5 0.2612 0.2594 0.2704 0.2716 0.2892 0.2973 0.3029
ϵ=1.0\epsilon=1.0 / 0.1803 0.1992 0.1746 0.1904 0.2127 0.2502
ϵ=2.0\epsilon=2.0 / / 0.1753 0.1852 0.2044 0.2015 0.2045

We first conduct the experiment to verify that the uniformly smoothed attribution itself is more robust than the original attribution. The uniform smoothing around the 2\ell_{2} ball with radius 0.50.5 is applied to the saliency map (SM) (Simonyan et al., 2014) and integrated gradients (IG) (Sundararajan et al., 2017) and evaluate on CIFAR-10. The resultant attributions are denoted by SmoothSM and SmoothIG, respectively. We then attack the attributions using the 2\ell_{2} IFIA attack and evaluate the robustness using Kendall’s rank correlation and top-k intersection (Ghorbani et al., 2019). The experiments are evaluated on both non-robust model (Standard) and robust models (IG-NORM, TRADES, IGR). Note that IFIA is directly performed on the smoothSM and smoothIG, instead of its original counterpart. Since the PGD-like attribution attack requires to take the derivative of the attribution to determine the direction of gradient descent, the double backpropagation is needed. Thus, it is necessary to replace the ReLU activation by the twice-differentiable Softplus during attack (Dombrowski et al., 2019). The results are shown in Table 1.

We observe that for the non-robust model, both SmoothSM and SmoothIG perform better than its non-smoothed counterparts in both metrics, which shows that the uniformly smoothed attribution itself is more resistant to the attribution attacks. For models that are specifically trained to defend against the attribution attacks using heuristic methods adapted from adversarial training, e.g., IG-NORM, TRADES and IGR, the smoothed attributions show comparable robustness to the non-smoothed attribution. Moreover, we also notice that SmoothIG performs better than SmoothSM, especially for the non-robust models. This can be attributed to the fact that IG satisfies the axiom of completeness, which ensures that the sum of IG is upper-bounded by the model output. In addition, we also observed that the smoothing technique does not always enhance the robustness of attribution, as measured by top-k and Kendall’s rank correlation. It is worth noting that Yeh et al. (2019) argued that randomized smoothing can reduce attribution sensitivities and consequently improve robustness. The disparity in our findings stems from the fact that we specifically evaluated Kendall’s rank correlation, whereas their study defined sensitivity based on 2\ell_{2}-norm distance, which was subsequently deemed inappropriate for evaluating attribution robustness by Wang & Kong (2022).

5.2 Evaluation of the certification of uniformly smoothed attributions

In this section, the lower bound of the cosine similarity between the original and attacked attribution is reported. We use the integrated gradients as an example since it is well-bounded due to the axiom of completeness, and the technique can be also applied to any other gradient-based attributions.

Table 2 reports the theoretical bound evaluated on MNIST computed using Theorem 1. We include the non-robust and two attributional robust models and compute the bound for different 2\ell_{2} radius rr and attack size ϵ\epsilon pairs. The lower bounds are validated by examining the actual 2\ell_{2} attacks. Specifically, each input sample has been attacked 20 times and the cosine similarities of resulting perturbed attributions with original attributions are examined. In total, 200,000 attacked images are tested for each parameter pair and none of the evaluation metrics exceeds the theoretical bound. Moreover, we also observe that the bound becomes tighter when the 2\ell_{2} radius rr increases and when the attack size ϵ\epsilon decreases. Besides, since IGR is more robust than IG-NORM (Wang & Kong, 2022), we can observe that the lower bound is also a valid measurement of the robustness of the models.

In Figure 2, we show the gap between the theoretical lower bound and the empirical cosine similarity between the original and perturbed attributions. The results are evaluated on CIFAR-10 using IGR. Out of 10,000 testing samples, the 200 with the smallest gaps are chosen for each pair of rr and ϵ\epsilon, and the gaps are sorted for better visualization. We notice that the gaps between the theoretical bound and empirical cosine similarity are positive and small, which shows the validity and the tightness of the proposed bound.

In Table 3, we also include the theoretical lower bound evaluated on ImageNet to show that the proposed method is also applicable to large-scale datasets. Since the current attribution attacks and attribution defense methods do not scale to large-scale datasets, we only include the non-robust model. Since our method does not rely on the second-order derivative of the output with respect to the input, it can be scaled to ImageNet-size datasets. For the experiments on ResNet-50, each certification for one single sample takes around 1515 seconds. We observe that the reported bounds are also consistent with our theoretical findings.

6 Conclusion and limitations

Refer to caption
(a) r=1.5r=1.5
Refer to caption
(b) r=2.0r=2.0
Refer to caption
(c) r=2.5r=2.5
Figure 2: The gap between theoretical bounds and empirical cosine similarity between original and perturbed attribution evaluated on CIFAR-10 using IGR.

In this paper, we attempt to use the uniformly smoothed attribution to certify the attribution robustness evaluated by cosine similarity. The smoothed attribution is constructed by taking the mean of the attributions computed from input samples augmented by noises uniformly sampled from an 2\ell_{2} ball. It is proved that the cosine similarity between the original and perturbed smoothed attribution is lower-bounded based on a geometric formulation related to the volume of the hyperspherical cap. Alternative formulations are provided to find the maximum allowable size of perturbations and the minimum radius of smoothing in order to maintain the attribution robustness. The method works on bounded gradient-based attribution methods for all convolutional neural networks and is scalable to large datasets. We empirically demonstrate that the method can be used to certify the attribution robustness, using the well-bounded integrated gradients, and the state-of-the-art attributional robust models on MNIST, CIFAR-10 and ImageNet.

7 Limitations and broader impacts

The method in paper can be generally applied to any convolutional neural networks and any bounded attribution methods. Although the existence of an upper bound has been shown for all gradient-based methods, in some extreme cases when the upper bound for certain attribution is trivial, i.e., an extremely large value, the proposed lower bound for attribution robustness also becomes trivial. In future work, we will investigate the upper bound for other bounded attribution methods and provide the corresponding lower bounds for attribution robustness.

Our work attempts to draw the attention of the community to the need for a guarantee of attribution robustness. With the increasingly large number of applications of deep learning, the transparency and trustworthiness of neural networks are crucial for users to understand the outcomes and to avoid any abuse of the techniques. While the study of the security of networks could reveal their potential risks that can be misused, we believe this work has more positive impacts to the community and can encourage the development of more trustworthy deep learning applications.

References

  • Antoniadi et al. (2021) Anna Markella Antoniadi, Yuhan Du, Yasmine Guendouz, Lan Wei, Claudia Mazo, Brett A Becker, and Catherine Mooney. Current challenges and future opportunities for xai in machine learning-based clinical decision support systems: a systematic review. Applied Sciences, 11(11):5088, 2021.
  • Atakishiyev et al. (2021) Shahin Atakishiyev, Mohammad Salameh, Hengshuai Yao, and Randy Goebel. Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions. arXiv preprint arXiv:2112.11561, 2021.
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.  274–283. PMLR, 2018.
  • Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  • Boopathy et al. (2020) Akhilan Boopathy, Sijia Liu, Gaoyuan Zhang, Cynthia Liu, Pin-Yu Chen, Shiyu Chang, and Luca Daniel. Proper network interpretability helps adversarial robustness in classification. In International Conference on Machine Learning, pp.  1014–1023. PMLR, 2020.
  • Box & Muller (1958) George EP Box and Mervin E Muller. A note on the generation of random normal deviates. The annals of mathematical statistics, 29(2):610–611, 1958.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Burton et al. (2020) Simon Burton, Ibrahim Habli, Tom Lawton, John McDermid, Phillip Morgan, and Zoe Porter. Mind the gaps: Assuring the safety of autonomous systems from an engineering, ethical, and legal perspective. Artificial Intelligence, 279:103201, 2020.
  • Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp.  39–57. IEEE Computer Society, 2017.
  • Chen et al. (2019) Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, and Somesh Jha. Robust attribution regularization. In Advances in Neural Information Processing Systems, pp.  14300–14310, 2019.
  • Cohen et al. (2019) Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp.  1310–1320. PMLR, 2019.
  • Das & Geisler (2021) Abhranil Das and Wilson S Geisler. A method to integrate and classify normal distributions. Journal of Vision, 21(10):1–1, 2021.
  • Davies (1980) Robert B Davies. The distribution of a linear combination of χ\chi2 random variables. Journal of the Royal Statistical Society Series C: Applied Statistics, 29(3):323–333, 1980.
  • Dombrowski et al. (2019) Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp.  13589–13600, 2019.
  • Doshi-Velez & Kim (2017) Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  • Du et al. (2022) Yuhan Du, Anthony R Rafferty, Fionnuala M McAuliffe, Lan Wei, and Catherine Mooney. An explainable machine learning-based clinical decision support system for prediction of gestational diabetes mellitus. Scientific Reports, 12(1):1170, 2022.
  • Duchesne & De Micheaux (2010) Pierre Duchesne and Pierre Lafaye De Micheaux. Computing the distribution of quadratic forms: Further comparisons between the liu–tang–zhang approximation and exact methods. Computational Statistics & Data Analysis, 54(4):858–862, 2010.
  • Feller (1991) William Feller. An introduction to probability theory and its applications, Volume 2, volume 81. John Wiley & Sons, 1991.
  • Ghorbani et al. (2019) Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  3681–3688, 2019.
  • Goodman & Flaxman (2017) Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50–57, 2017.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hrinivich et al. (2023) William Thomas Hrinivich, Tonghe Wang, and Chunhao Wang. Interpretable and explainable machine learning models in oncology. Frontiers in Oncology, 13:1184428, 2023.
  • Ivankay et al. (2020) Adam Ivankay, Ivan Girardi, Chiara Marchiori, and Pascal Frossard. Far: A general framework for attributional robustness. arXiv preprint arXiv:2010.07393, 2020.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Kumar & Goldstein (2021) Aounon Kumar and Tom Goldstein. Center smoothing: Certified robustness for networks with structured outputs. Advances in Neural Information Processing Systems, 34:5560–5575, 2021.
  • Kwon & Zou (2022) Yongchan Kwon and James Y Zou. Weightedshap: analyzing and improving shapley based feature attributions. Advances in Neural Information Processing Systems, 35:34363–34376, 2022.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • Lecuyer et al. (2019) Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp.  656–672. IEEE, 2019.
  • Li (2010) Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010.
  • Liu et al. (2018) Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  369–385, 2018.
  • Lundberg & Lee (2017) Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Paulavičius & Žilinskas (2006) Remigijus Paulavičius and Julius Žilinskas. Analysis of different norms and corresponding lipschitz constants for global optimization. Technological and Economic Development of Economy, 12(4):301–306, 2006.
  • Rao et al. (2022) Sukrut Rao, Moritz Böhle, and Bernt Schiele. Towards better understanding attribution methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10223–10232, 2022.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Salman et al. (2019) Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  • Sarkar et al. (2021) Anindya Sarkar, Anirban Sarkar, and Vineeth N Balasubramanian. Enhanced regularizers for attributional robustness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  2532–2540, 2021.
  • Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
  • Singh et al. (2020) Mayank Singh, Nupur Kumari, Puneet Mangla, Abhishek Sinha, Vineeth N Balasubramanian, and Balaji Krishnamurthy. Attributional robustness training using input-gradient spatial alignment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp.  515–533. Springer, 2020.
  • Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  • Srinivas & Fleuret (2019) Suraj Srinivas and François Fleuret. Full-gradient representation for neural network visualization. Advances in neural information processing systems, 32, 2019.
  • Sundararajan & Najmi (2020) Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pp.  9269–9278. PMLR, 2020.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp.  3319–3328. PMLR, 2017.
  • Teng et al. (2020) Jiaye Teng, Guang-He Lee, and Yang Yuan. $\ell_1$ adversarial robustness certificates: a randomized smoothing approach, 2020. URL https://openreview.net/forum?id=H1lQIgrFDS.
  • Wang & Kong (2022) Fan Wang and Adams Wai-Kin Kong. Exploiting the relationship between kendall’s rank correlation and cosine similarity for attribution protection. Advances in Neural Information Processing Systems, 35:20580–20591, 2022.
  • Wang & Kong (2023) Fan Wang and Adams Wai-Kin Kong. A practical upper bound for the worst-case attribution deviations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  24616–24625, 2023.
  • Wong & Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp.  5286–5295. PMLR, 2018.
  • Yang et al. (2020) Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp.  10693–10705. PMLR, 2020.
  • Yeh et al. (2019) Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp.  7472–7482. PMLR, 2019.

Appendix A Proofs

Lemma 1.

(Paulavičius & Žilinskas (2006)) For LL-Lipschitz function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R},

|f(𝒙)f(𝒚)|L𝒙𝒚q|f(\bm{x})-f(\bm{y})|\leq L\|\bm{x}-\bm{y}\|_{q} (11)

where L=max{f(𝐱)p:𝐱S}L=\max\left\{\|\nabla f(\bm{x})\|_{p}:\bm{x}\in S\right\} is Lipschitz constant. Thus, f(𝐱)pL\|\nabla f(\bm{x})\|_{p}\leq L.

Proof.

Refer to Paulavičius & Žilinskas (2006) for the proof. ∎

See 1

Proof.

As defined in Eqn. (2)

h(𝒙)=𝔼𝜼(𝟎;r)[g(𝒙+𝜼)]=1V𝒮𝜼(𝟎;r)g(𝒙+𝜼)𝑑𝜼h(\bm{x})=\mathbb{E}_{\bm{\eta}\sim\mathcal{B}(\bm{0};r)}[g(\bm{x}+\bm{\eta})]=\frac{1}{V_{\mathcal{S}}}\int_{\bm{\eta}\sim\mathcal{B}(\bm{0};r)}g(\bm{x}+\bm{\eta})d\bm{\eta} (12)

where V𝒮V_{\mathcal{S}} is the volume of the p\ell_{p}-ball with radius rr. Similarly, let 𝒙~=𝒙+𝜹\tilde{\bm{x}}=\bm{x}+\bm{\delta}, where 𝜹d\bm{\delta}\in\mathbb{R}^{d} is a vector and 𝜹2ϵ\|\bm{\delta}\|_{2}\leq\epsilon. Then, we have

h(𝒙~)=1V𝒮𝜼(𝟎;r)g(𝒙~+𝜼)𝑑𝜼h(\tilde{\bm{x}})=\frac{1}{V_{\mathcal{S}}}\int_{\bm{\eta}\sim\mathcal{B}(\bm{0};r)}g(\tilde{\bm{x}}+\bm{\eta})d\bm{\eta} (13)

We note that when 𝜼(𝟎;r)\bm{\eta}\sim\mathcal{B}(\bm{0};r), 𝒙+𝜼(𝒙;r)\bm{x}+\bm{\eta}\sim\mathcal{B}(\bm{x};r) and 𝒙~+𝜼(𝒙~;r)\tilde{\bm{x}}+\bm{\eta}\sim\mathcal{B}(\tilde{\bm{x}};r). We then rewrite h(𝒙)h(\bm{x}) and h(𝒙~)h(\tilde{\bm{x}}) as follows:

h(𝒙)=1V𝒮𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙R1+1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R2h(\bm{x})=\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}g(\bm{x})d\bm{x}}_{R_{1}}+\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\cap\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{2}} (14)

and

h(𝒙~)=1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R2+1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R3h(\tilde{\bm{x}})=\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\cap\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{2}}+\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{3}} (15)

Hence,

h(𝒙~)=h(𝒙)R1+R3h(\tilde{\bm{x}})=h(\bm{x})-R_{1}+R_{3} (16)

Denote a𝒗=R3R1a\bm{v}=R_{3}-R_{1}, where 𝒗\bm{v} is a unit vector in the same direction of R3R1R_{3}-R_{1} and a=R3R12a=\|R_{3}-R_{1}\|_{2} is a scalar with the same magnitude of R3R1R_{3}-R_{1}. Then, we have

cos(h(𝒙),h(𝒙~))=h(𝒙)h(𝒙)2(h(𝒙)+a𝒗h(𝒙)+a𝒗2)\cos(h(\bm{x}),h(\tilde{\bm{x}}))=\frac{h(\bm{x})^{\top}}{\|h(\bm{x})\|_{2}}\left(\frac{h(\bm{x})+a\bm{v}}{\|h(\bm{x})+a\bm{v}\|_{2}}\right) (17)

Note that the attribution g(𝒙)g(\bm{x}) is upper bounded by MM, specifically, g(𝒙)2M\|g(\bm{x})\|_{2}\leq M, for some constant MM. Thus, we can derive that

a\displaystyle a =R3R12\displaystyle=\|R_{3}-R_{1}\|_{2} (18)
=1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙)2\displaystyle=\left\|\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}-\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}g(\bm{x})d\bm{x}\right)\right\|_{2} (19)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙2+𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙2)\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\left\|\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}\right\|_{2}+\left\|\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}g(\bm{x})d\bm{x}\right\|_{2}\right) (20)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)2𝑑𝒙+𝒙(𝒙;r)(𝒙~;r)g(𝒙)2𝑑𝒙)\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}\left\|g(\bm{x})\right\|_{2}d\bm{x}+\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}\left\|g(\bm{x})\right\|_{2}d\bm{x}\right) (21)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)M𝑑𝒙+𝒙(𝒙;r)(𝒙~;r)M𝑑𝒙)\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}Md\bm{x}+\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}Md\bm{x}\right) (22)
=M×V(𝒙;r)(𝒙~;r)(𝒙~;r)(𝒙;r)V𝒮=MVUV𝒮\displaystyle=M\times\frac{V_{\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)\cup\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}}{V_{\mathcal{S}}}=M\frac{V_{U}}{V_{\mathcal{S}}} (23)

Thus, the lower bound of cos(h(𝒙),h(𝒙~))\cos(h(\bm{x}),h(\tilde{\bm{x}})) can be found by solving the optimization problem 222\|\cdot\| in the following content denotes the 2\ell_{2}-norm unless otherwise specified.

min𝒗\displaystyle\min_{\bm{v}} h(𝒙)h(𝒙)(h(𝒙)+a𝒗h(𝒙)+a𝒗)\displaystyle\frac{h(\bm{x})^{\top}}{\|h(\bm{x})\|}\left(\frac{h(\bm{x})+a\bm{v}}{\|h(\bm{x})+a\bm{v}\|}\right) (24)
s.t. 𝒗=1\displaystyle\|\bm{v}\|=1
aMVUV𝒮\displaystyle a\leq M\frac{V_{U}}{V_{\mathcal{S}}}

Since h(𝒙)h(\bm{x}) and h(𝒙~)h(\tilde{\bm{x}}) form a spherical cone, we can decompose 𝒗\bm{v} by 𝒗=cosθ𝒗+sinθ𝒗\bm{v}=\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp}, where 𝒗\bm{v}_{\parallel} and 𝒗\bm{v}_{\perp} are two orthogonal unit vectors such that h(𝒙)𝒗=0h^{\top}(\bm{x})\bm{v}_{\perp}=0 and 𝒗=h(𝒙)/h(𝒙)\bm{v}_{\parallel}=h(\bm{x})/\|h(\bm{x})\|. Then, the optimization problem can be rewritten as

min𝒗(h(𝒙)+a(cosθ𝒗+sinθ𝒗)h(𝒙)+a(cosθ𝒗+sinθ𝒗))\displaystyle\min\quad\bm{v}_{\parallel}^{\top}\left(\frac{h(\bm{x})+a(\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp})}{\|h(\bm{x})+a(\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp})\|}\right) (25)
\displaystyle\Rightarrow min𝒗(h(𝒙)𝒗+a(cosθ𝒗+asinθ𝒗)h(𝒙)𝒗+a(cosθ𝒗+asinθ𝒗))\displaystyle\min\quad\bm{v}_{\parallel}^{\top}\left(\frac{\|h(\bm{x})\|\bm{v}_{\parallel}+a(\cos\theta\bm{v}_{\parallel}+a\sin\theta\bm{v}_{\perp})}{\|\|h(\bm{x})\|\bm{v}_{\parallel}+a(\cos\theta\bm{v}_{\parallel}+a\sin\theta\bm{v}_{\perp})\|}\right) (26)
\displaystyle\Rightarrow min(h(𝒙)+acosθ)𝒗𝒗+asinθ𝒗𝒗(h(𝒙)+acosθ)2𝒗𝒗+(asinθ)2𝒗𝒗\displaystyle\min\quad\frac{(\|h(\bm{x})\|+a\cos\theta)\bm{v}_{\parallel}^{\top}\bm{v}_{\parallel}+a\sin\theta\bm{v}_{\parallel}^{\top}\bm{v}_{\perp}}{\sqrt{(\|h(\bm{x})\|+a\cos\theta)^{2}\bm{v}_{\parallel}^{\top}\bm{v}_{\parallel}+(a\sin\theta)^{2}\bm{v}_{\perp}^{\top}\bm{v}_{\perp}}} (27)
\displaystyle\Rightarrow minh(𝒙)+acosθ(h(𝒙)+acosθ)2+(asinθ)2\displaystyle\min\quad\frac{\|h(\bm{x})\|+a\cos\theta}{\sqrt{(\|h(\bm{x})\|+a\cos\theta)^{2}+(a\sin\theta)^{2}}} (28)

Since h(𝒙)h(\bm{x}) is known for a given sample, the optimization problem can be written as follows by taking h(𝒙)=c\|h(\bm{x})\|=c:

min\displaystyle\min c+acosθ(c+acosθ)2+(asinθ)2\displaystyle\frac{c+a\cos\theta}{\sqrt{(c+a\cos\theta)^{2}+(a\sin\theta)^{2}}} (29)
s.t. aMVUV𝒮\displaystyle a\leq M\frac{V_{U}}{V_{\mathcal{S}}}

We now consider the Lagrange function of the optimization problem:

(x,θ,λ)=c+acosθ(c+acosθ)2+(asinθ)2λ(aMVUV𝒮)\mathcal{L}(x,\theta,\lambda)=\frac{c+a\cos\theta}{\sqrt{(c+a\cos\theta)^{2}+(a\sin\theta)^{2}}}-\lambda(a-M\frac{V_{U}}{V_{\mathcal{S}}}) (30)

Taking the derivative of \mathcal{L} with respect to aa and θ\theta and setting them to zero, we have

a=1T2(Tcosθ1T(ccosθ+2a)×(c+acosθ))λ=0\frac{\partial}{\partial a}\mathcal{L}=\frac{1}{T^{2}}\left(T\cos\theta-\frac{1}{T}\left(c\cos\theta+2a\right)\times(c+a\cos\theta)\right)-\lambda=0 (31)

and

θ=1T2(asinθT+1T(c2asinθ+ca2sinθcosθ))=0\frac{\partial}{\partial\theta}\mathcal{L}=\frac{1}{T^{2}}\left(-a\sin\theta\cdot T+\frac{1}{T}\left(c^{2}a\sin\theta+ca^{2}\sin\theta\cos\theta\right)\right)=0 (32)

where T=(c+acosθ)2+(asinθ)2T=\sqrt{(c+a\cos\theta)^{2}+(a\sin\theta)^{2}}. Solving the above equations, we have

cosθ=0ora=0\cos\theta=0\quad\text{or}\quad a=0 (33)

where a=0a=0 reaches the maximum and cosθ=0\cos\theta=0 is the minimum. Therefore, the lower bound of cos(h(𝒙),h(𝒙~))\cos(h(\bm{x}),h(\tilde{\bm{x}})) is

cos(h(𝒙),h(𝒙~))cc2+(MVUV𝒮)2=h(𝒙)h(𝒙)2+(MVU/V𝒮)2\cos(h(\bm{x}),h(\tilde{\bm{x}}))\geq\frac{c}{\sqrt{c^{2}+(M\frac{V_{U}}{V_{\mathcal{S}}})^{2}}}=\frac{\|h(\bm{x})\|}{\sqrt{\|h(\bm{x})\|^{2}+(MV_{U}/V_{\mathcal{S}})^{2}}} (34)

See 1

Proof.

Corollary 1 can be obtained by fixing TT and taking rr as unknown, and fixing TT and taking ϵ\epsilon as unknown, respectively. We can first derive that

I(2rhh2)/r2(d+12,12)=1h(x)22M1T21=ZI_{(2rh-h^{2})/r^{2}}\left(\frac{d+1}{2},\frac{1}{2}\right)=1-\frac{\|h(x)\|_{2}}{2M}\sqrt{\frac{1}{T^{2}}-1}=Z (35)

Using the inverse of the regularized incomplete beta function, i.e., x=Iy1(a,b)x=I^{-1}_{y}(a,b), and h=rϵ/2h=r-\epsilon/2, we have

IZ1(d+12,12)=(2rhh2)/r2=1ϵ24r2I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)=(2rh-h^{2})/r^{2}=1-\frac{\epsilon^{2}}{4r^{2}} (36)

The results in Corollary can then be solved accordingly. ∎

Appendix B Implementation details

In the experiments, we implemented the 2\ell_{2} attribution attack adapted from Ghorbani et al. (2019). The attack uses top-kk intersection version as the loss function. Following previous works, we choose k=100k=100 for MNIST and k=1000k=1000 for CIFAR-10. The number of iterations in PGD-like attack is 200, and the step size is 0.1. As mentioned in the main content, we do not implement the attack on ImageNet since the attribution attacks are not scalable to large size images. In the following parts of this section, we provide more details of evaluations in the experiments.

B.1 Attribution methods

We used saliency maps (SM) and integrated gradients (IG) in the evaluation sections. These two methods are defined as follows:

  • Saliency maps: SM(𝒙)=f(𝒙)𝒙\text{SM}(\bm{x})=\frac{\partial f(\bm{x})}{\partial\bm{x}}.

  • Integrated gradients: IG(𝒙)=(𝒙𝒙)×α=01f(𝒙+α(𝒙𝒙))𝒙𝑑α\text{IG}(\bm{x})=(\bm{x}-\bm{x}^{\prime})\times\int_{\alpha=0}^{1}\frac{\partial f(\bm{x}^{\prime}+\alpha(\bm{x}-\bm{x}^{\prime}))}{\partial\bm{x}}d\alpha.

The SmoothSM and SmoothIG are the smoothed versions of SM and IG, respectively.

B.2 Evaluation metrics

Given original attribution g(𝒙)g(\bm{x}) and perturbed attribution g(𝒙~)g(\tilde{\bm{x}}), we use top-k intersection, Kendall’s rank correlation (Ghorbani et al., 2019) and cosine similarity (Wang & Kong, 2022) to evaluate their differences.

  • Top-k intersection measures the proportion of kk largest features that overlap between g(𝒙)g(\bm{x}) and g(𝒙~)g(\tilde{\bm{x}}).

  • Kendall’s rank correlation measures the proportion of pairs of features that have the same order in g(𝒙)g(\bm{x}) and g(𝒙~)g(\tilde{\bm{x}}): 2d(d1)i=1dj=i+1d𝟏{g(𝒙)i>g(𝒙)j}𝟏{g(𝒙~)i>g(𝒙~)j}\frac{2}{d(d-1)}\sum_{i=1}^{d}\sum_{j=i+1}^{d}\mathbf{1}_{\{g(\bm{x})_{i}>g(\bm{x})_{j}\}}\mathbf{1}_{\{g(\tilde{\bm{x}})_{i}>g(\tilde{\bm{x}})_{j}\}}.

  • Cosine similarity measures the cosine of the angle between g(𝒙)g(\bm{x}) and g(𝒙~)g(\tilde{\bm{x}}): g(𝒙)g(𝒙~)g(𝒙)g(𝒙~)\frac{g(\bm{x})^{\top}g(\tilde{\bm{x}})}{\|g(\bm{x})\|\|g(\tilde{\bm{x}})\|}.

B.3 Baseline methods

We compare with the following adversarial and attributional robust models:

IG-NORM (Chen et al., 2019)

CE(f(𝒙),y)+λmax𝒙~ε(𝒙)IG(𝒙,𝒙~)1\textmd{CE}(f(\bm{x}),y)+\lambda\max_{\tilde{\bm{x}}\in\mathcal{B}_{\varepsilon}(\bm{x})}\|\textmd{IG}(\bm{x},\tilde{\bm{x}})\|_{1} (37)

TRADES (Zhang et al., 2019)

CE(f(𝒙~),y)+βKL(f(𝒙)f(𝒙~))\textmd{CE}(f(\tilde{\bm{x}}),y)+\beta\textmd{KL}(f(\bm{x})\|f(\tilde{\bm{x}})) (38)

IGR (Wang & Kong, 2022)

CE(f(𝒙~),y)+βKL(f(𝒙)f(𝒙~))+λ(1cos(IG(𝒙),IG(𝒙~)))\textmd{CE}(f(\tilde{\bm{x}}),y)+\beta\textmd{KL}(f(\bm{x})\|f(\tilde{\bm{x}}))+\lambda\left(1-\cos(\textmd{IG}(\bm{x}),\textmd{IG}(\tilde{\bm{x}}))\right) (39)

Here CE denotes the cross-entropy loss and KL denotes the Kullback-Leibler divergence.

Appendix C Additional experiments

C.1 Test on Monte Carlo estimation

Note that the bound given by Theorem 1 is deterministic. In this section, we provide a probabilistic bound for the attribution robustness. Specifically, we want to find the value of tt such that Pr(Tt)=1αPr(T\leq t)=1-\alpha, where TT is defined in Eqn. (6) and α\alpha is the significance level. Recall that TT is defined as follows:

T=h(𝒙)2h(𝒙)22+cT=\frac{\|h(\bm{x})\|_{2}}{\sqrt{\|h(\bm{x})\|_{2}^{2}+c}} (40)

where c=M2VU2/VS2c=M^{2}V_{U}^{2}/V_{S}^{2}. If we denote that Q=h(𝒙)2Q=\|h(\bm{x})\|_{2}, then we have

Pr(Tt)=Pr(QQ2+ct)=Pr(Q2ct21t2)Pr(T\leq t)=Pr\left(\frac{Q}{\sqrt{Q^{2}+c}}\leq t\right)=Pr\left(Q^{2}\leq\frac{ct^{2}}{1-t^{2}}\right) (41)

Note that we used Monte Carlo Integration to calculate the integral in h(𝒙)h(\bm{x}), which estimates h(𝒙)h(\bm{x}) by sampling 𝜼\bm{\eta} from \mathcal{B}, i.e.,

h^(𝒙)=1Ni=1Ng(𝒙+𝜼i),𝜼i.\hat{h}(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}g(\bm{x}+\bm{\eta}_{i}),\quad\bm{\eta}_{i}\sim\mathcal{B}. (42)

Note that h^(𝒙)\hat{h}(\bm{x}) is an unbiased estimator of h(𝒙)h(\bm{x}), i.e. 𝔼[h^(𝒙)]=h(x)\mathbb{E}[\hat{h}(\bm{x})]=h(x). The estimator almost surely converges to h(𝒙)h(\bm{x}) as NN\rightarrow\infty, i.e. limNh^(𝒙)=h(𝒙)\lim_{N\rightarrow\infty}\hat{h}(\bm{x})=h(\bm{x}) almost surely. By the Central Limit Theorem, the estimator h^(𝒙)\hat{h}(\bm{x}) has the following asymptotic distribution,

h^(𝒙)a.s.𝒩(h(𝒙),D),\hat{h}(\bm{x})\stackrel{{\scriptstyle a.s.}}{{\sim}}\mathcal{N}(h(\bm{x}),D), (43)

which the covariance matrix D=diag(σii2/N)D=\text{diag}(\sigma_{ii}^{2}/N) can be estimated by the empirical variances of g(𝒙+𝜼i)g(\bm{x}+\bm{\eta}_{i}). Thus, the quadratic form Q2=h(𝒙)22Q^{2}=\|h(\bm{x})\|_{2}^{2} can be seen as generalized chi-square distributed. We can derive the cumulative distribution function of Monte Carlo estimator TMCT_{MC} at tt as the cumulative distribution function of the generalized chi-square distribution at ct21t2\frac{ct^{2}}{1-t^{2}}, i.e.,

Pr(TMCt)=F(ct21t2),Pr(T_{MC}\leq t)=F\left(\frac{ct^{2}}{1-t^{2}}\right), (44)

where FF is the cumulative distribution function of the generalized chi-square distribution constructed from the quadratic form of Gaussian random variable with mean h(𝒙)h(\bm{x}) and covariance DD (Davies, 1980; Das & Geisler, 2021). In this work, we use the R package CompQuadForm (Duchesne & De Micheaux, 2010) to compute the cumulative distribution function. For any fixed image sample 𝒙\bm{x}, we can validate t2t1t_{2}-t_{1} is close to 0 when Pr(t1TMCt2)=1αPr(t_{1}\leq T_{MC}\leq t_{2})=1-\alpha by solving the following equation. For small α=0.01\alpha=0.01 and the number of samples N=100,000N=100,000, we found that the values of t2t1t_{2}-t_{1} are at scale of 10410^{-4} in MNIST and CIFAR-10, and 10310^{-3} in ImageNet calculated by choosing 10,00010,000 samples from each dataset. This validates the error from Monte Carlo integral is minute and that the probabilistic bound is close to the deterministic bound.

F(ct221t22)=1α/2andF(ct121t12)=α/2.F\left(\frac{ct_{2}^{2}}{1-t_{2}^{2}}\right)=1-\alpha/2\quad\text{and}\quad F\left(\frac{ct_{1}^{2}}{1-t_{1}^{2}}\right)=\alpha/2. (45)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Original image
Refer to caption
(b) IG
Refer to caption
(c) Gaussian smoothed
Refer to caption
(d) Uniformly smoothed
Figure 3: Additional visualization of the attribution maps of the (a) original image, (b) IG, (c) Gaussian smoothed IG, and (d) uniformly smoothed IG.

C.2 Additional visualization of the uniformly smoothed attributions

In Figure 1 (left), we have shown that the uniformly smoothed attributions have a comparable quality as the original attributions. Here more examples are provided in Figure 3 to illustrate the quality of the uniformly smoothed attributions.

C.3 Evaluation of Center Smoothing (Kumar & Goldstein, 2021) on attributions

Table 4: Evaluation of center smoothing on attributions
ϵ1\epsilon_{1} 0.1 0.2 0.3 0.4 0.5
SmoothSM 1.207 1.729 1.843 1.907 1.998

To compare the performance with center smoothing (Kumar & Goldstein, 2021), we also implemented the same method to evaluate the certification of attributions. Specifically, we compute the bound for SmoothSM on IG-NORM using MNIST, and follow the same setting by choosing h=1h=1 and ϵ1=0.1,0.2,,0.5\epsilon_{1}=0.1,0.2,\cdots,0.5. Directly using the cosine similarity on the method is not applicable since cosine similarity does not satisfy the triangle inequality. Following the relaxation method in Sec.4 of Kumar & Goldstein (2021), a multiplier γ=2\gamma=2 is added. Besides, we use 1cosθ1-\cos\theta to reflect the distance metric instead of the similarity metric. The results are shown in the Table 4. It can be observed that the upper bound for 1cosθ1-\cos\theta is greater than 11 for all the choices of ϵ\epsilon, which is trivially valid for the trigonometric function since we only consider cosθ[0,1]\cos\theta\in[0,1]. Thus, the upper bound provided in the aforementioned work can be too loose on our setting.

C.4 Evaluation of alternative formulations

In Section 4.3, we introduced two alternative formulations of the proposed method that can be applied in specific scenarios. In this section, we provide additional information to report the experiments on these two formulations.

Table 5: Maximum allowable perturbation size for different threshold (T=0.8T=0.8 and T=0.9T=0.9) under various choices of 2\ell_{2} smoothing radii rr evaluated on MNIST.
T=0.9T=0.9 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0389 0.0951 0.1550 0.2164 0.2783 0.3404 0.4029
IG-NORM 0.0394 0.0957 0.1557 0.2170 0.2790 0.3420 0.4067
IGR 0.0390 0.0952 0.1552 0.2174 0.2818 0.3477 0.4163
T=0.8T=0.8 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0447 0.1051 0.1691 0.2345 0.3004 0.3664 0.4329
IG-NORM 0.0448 0.1052 0.1692 0.2354 0.3037 0.3733 0.4456
IGR 0.0452 0.1057 0.1697 0.2350 0.3010 0.3680 0.4365
Table 6: Maximum allowable perturbation size for different threshold (T=0.8T=0.8 and T=0.9T=0.9) under various choices of 2\ell_{2} smoothing radii rr evaluated on CIFAR-10.
T=0.9T=0.9 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0086 0.0469 0.0885 0.1322 0.1773 0.2222 0.2683
IG-NORM 0.0323 0.0705 0.1104 0.1510 0.1923 0.2337 0.2749
IGR 0.0167 0.0545 0.1032 0.1586 0.2150 0.2588 0.2805
T=0.8T=0.8 2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0128 0.0522 0.0951 0.1402 0.1866 0.2330 0.2868
IG-NORM 0.0343 0.0742 0.1157 0.1580 0.2009 0.2439 0.2867
IGR 0.0237 0.0693 0.1258 0.1861 0.2546 0.3090 0.3559
Table 7: Maximum allowable perturbation size for different threshold (T=0.8T=0.8 and T=0.9T=0.9) under various choices of 2\ell_{2} smoothing radii rr evaluated on ImageNet.
2\ell_{2} radius (rr) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
T=0.9T=0.9 0.0046 0.0100 0.0152 0.0295 0.0494 0.0628 0.0768
T=0.8T=0.8 0.0058 0.0127 0.0196 0.0369 0.0618 0.0820 0.1040
Table 8: Empirical cosine similarity between original and perturbed smoothed attributions under various choices of 2\ell_{2} smoothing radius rr, and the perturbation size computed in Table 5 (T=0.8T=0.8).
rr 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.8636 0.8522 0.8347 0.8127 0.8477 0.8603 0.8310
IG-NORM 0.8308 0.8181 0.8504 0.8728 0.8502 0.8193 0.8199
IGR 0.8231 0.8800 0.8720 0.8603 0.8362 0.8135 0.8567
Table 9: Minimum smoothing radius requires to achieve the threshold (T=0.8T=0.8 and T=0.9T=0.9) under various choices of 2\ell_{2} perturbation size ε\varepsilon. IG-NORM and IGR are omitted since they are not scalable to ImageNet.
MNIST CIFAR-10 ImageNet
perturbation size (ϵ\epsilon) 0.5 1.0 0.5 1.0 0.5 1.0
T=0.9T=0.9 Standard 5.1902 5.8752 5.9752 7.9504 74.6272 149.2544
IG-NORM 5.1189 5.7699 5.6860 7.3720 / /
IGR 5.0265 5.6623 5.2895 6.5790 / /
T=0.8T=0.8 Standard 3.8927 4.4064 5.7082 7.4164 48.2095 96.4190
IG-NORM 3.8392 4.3274 5.4875 6.9750 / /
IGR 3.7699 4.2468 5.0287 6.0573 / /

In Tables˜5, 6 and 7, which correspond to MNIST, CIFAR-10 and ImageNet, respectively, we report the computed values of the maximum allowable perturbation size. Under the size constraint, no examples can be found by the attacks against uniformly smoothed IG of a certain radius such that the cosine similarity between clean and perturbed attributions exceeds the given threshold (T=0.8T=0.8 and T=0.9T=0.9). The results are consistent with our theory. For larger radius smoothing, the maximum allowable perturbation size is also larger. When the threshold requirement is stricter, the maximum allowable perturbation size is smaller, which suggests weaker attacks are allowed. The method is also scalable to ImageNet, which takes around 15 seconds to compute for each sample. Moreover, we also applied attribution attacks using the same radius and maximum perturbation size ϵ\epsilon, computed using Eqn. (8). Similar to the experiments in Section 5, we performed 20 attacks on each sample. We found that out of the total 200,000 attacked samples, the cosine similarities between clean and perturbed attributions were higher than the given threshold, suggesting that the computed bound is valid (see Table˜8).

We also evaluate the third formulation that the minimum radius of smoothing required such that, within the given perturbation sizes, the cosine similarity between original and perturbed smoothed attributions is larger than the given threshold. In Table 9, the computed minimum radius of smoothing is reported. Similarly, we observe that the minimum radius of smoothing is larger when the threshold requirement is stricter, and when the attack is stronger. This is also consistent with our theory. We also notice that the radius for ImageNet is extremely large, which indicates that ImageNet is difficult to defend under such strict threshold requirements.