This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Training Makes Weight Loss Landscape
Sharper in Logistic Regression

Masanori Yamada1111Contact Author    Sekitoshi Kanai2    Tomoharu Iwata3    Tomokatsu Takahashi1    Yuki Yamanaka1    Hiroshi Takahashi2&Atsutoshi Kumagai1,2
1NTT Secure Platform Laboratories
2NTT Software Innovation Center
3NTT Communication Science Laboratories
{masanori.yamada.cm, sekitoshi.kanai.fu, tomoharu.iwata.gy, tomokatsu.takahashi.wd, yuuki.yamanaka.kb, hiroshi.takahashi.bm, atsutoshi.kumagai.ht}@hco.ntt.co.jp
Abstract

Adversarial training is actively studied for learning robust models against adversarial examples. A recent study finds that adversarially trained models degenerate generalization performance on adversarial examples when their weight loss landscape, which is loss changes with respect to weights, is sharp. Unfortunately, it has been experimentally shown that adversarial training sharpens the weight loss landscape, but this phenomenon has not been theoretically clarified. Therefore, we theoretically analyze this phenomenon in this paper. As a first step, this paper proves that adversarial training with the L2L_{\rm 2} norm constraints sharpens the weight loss landscape in the linear logistic regression model. Our analysis reveals that the sharpness of the weight loss landscape is caused by the noise aligned in the direction of increasing the loss, which is used in adversarial training. We theoretically and experimentally confirm that the weight loss landscape becomes sharper as the magnitude of the noise of adversarial training increases in the linear logistic regression model. Moreover, we experimentally confirm the same phenomena in ResNet18 with softmax as a more general case.

1 Introduction

Deep neural networks (DNNs) has been successfully used in wide applications such as image He et al. (2016), speech Wang et al. (2017), and natural language processing Devlin et al. (2019). Though DNNs have high generalization performance, they are vulnerable to adversarial examples, which are imperceptibly perturbed data to make DNNs misclassify data Szegedy et al. (2014). For the real-world applications of deep learning, DNNs need to be made secure against this vulnerability.

Many methods to defend models against adversarial examples have been presented such as adversarial training Goodfellow et al. (2015); Kurakin et al. (2017); Madry et al. (2018); Zhang et al. (2019); Wang et al. (2020), adversarial detection Metzen et al. (2017); Huang et al. (2019); Hu et al. (2019); Liu et al. (2019), defensive distillation Papernot et al. (2016), and random smoothing Cohen et al. (2019); Yang et al. (2020); Salman et al. (2019). Among them, adversarial training experimentally achieves high robustness Madry et al. (2018); Zhang et al. (2019); Wang et al. (2020); Wu et al. (2020).

Adversarial training trains models with adversarial examples. Wu et al. (2020) experimentally shows that models obtain a high generalization performance in adversarial training when they have a flat weight loss landscape, which is the loss change with respect to the weight. However, Prabhu et al. (2019) experimentally finds that adversarial training sharpens the weight loss landscape more than standard training, which is training without adversarial noise. Thus, the following question is important to answer: Does Adversarial Training Always Sharpen the Weight Loss Landscape? If the answer is yes, adversarial training always has a larger generalization gap than standard training because the sharp weight loss landscape degrades the generalization performance.

In this paper, to answer this question theoretically, we focus on the logistic regression model with adversarial training as the first theoretical analysis step. First, we use the definition of sharpness of the weight loss landscape that uses the eigenvalues of the Hessian matrix Keskar et al. (2017). Next, to simplify the analysis, we decompose the adversarial example into magnitude and direction. Finally, we show that the eigenvalues of the Hessian matrix in the weight loss landscape are proportional to the norm of the noise in the adversarial noise. As a results, we theoretically show that the weight loss landscape becomes sharper as the noise in the adversarial training becomes larger as shown in Theorem 1. We experimentally confirmed Theorem 1 on dataset MNIST2, which restricts MNIST to two classes. Moreover, we experimentally show that in multi-class classification with the nonlinear model (ResNet18) as a more general case, the weight loss landscape becomes sharper as the noise in the adversarial training becomes larger. Finally, to check whether the sharpness of the weight loss landscape is a problem specific to the adversarial example, we compare the weight loss landscapes of training on random noise with training on adversarially perturbed data in logistic regression. As a result, we confirmed theoretically and experimentally that the adversarial noise much sharpens the weight loss landscape more than the random noise. This is caused by noise always being added to adversarial examples in the direction that makes the loss worse. We conclude that the sharpness of the weight loss landscape needs to be reduced for reducing the generalization gap in adversarial training because the weight loss landscape becomes sharp in adversarial training, and the generalization gap becomes large.

Our contributions are as follows:

  • We show theoretically and experimentally that in logistic regression with the constrained L2L_{\rm 2} norm, the weight loss landscape becomes sharper as the norm of the adversarial training noise increases.

  • We show theoretically and experimentally that adversarial noise in the data space sharpens the weight loss landscape in logistic regression much more than random noise (random noise does not sharpen it extremely).

  • We experimentally show that the larger norm of the noise of adversarial training makes the loss landscape sharper in the nonlinear model (ResNet18) with softmax. As a results, the generalization gap becomes larger as the norm of adversarial noise becomes large.

2 Preliminary

2.1 Logistic Regression

We consider a binary classification task with 𝒙(x1,,xd)d{\bm{x}}\equiv(x_{1},\dots,x_{d})\in\mathbb{R}^{d} and y{1,1}y\in\left\{-1,1\right\}. A data point is represented as 𝒙n{\bm{x}}^{n}, where nn is the data index and its true label is yny^{n}. A loss function of logistic regression is defined as

L(𝒙,y,𝒘)\displaystyle L\left(\bm{x},y,\bm{w}\right) =1Nn(𝒙n,yn,𝒘),\displaystyle=\frac{1}{N}\sum_{n}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right),
(𝒙n,yn,𝒘)\displaystyle\ell\left(\bm{x}^{n},y^{n},\bm{w}\right) log(1+exp(yn𝒘𝒙n)),\displaystyle\equiv\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}\right)\right), (1)

where NN is the total number of data and 𝒘(w1,,wd)d{\bm{w}}\equiv(w_{1},\dots,w_{d})\in\mathbb{R}^{d} is the training parameter of the model.

2.2 Adversarial Example

Adversarial example (𝒙)n=𝒙n+𝜼n\left({\bm{x}}^{\prime}\right)^{n}={\bm{x}}^{n}+{\bm{\eta}}^{n} is defined as

𝜼n=argmax𝜼n𝔹p(𝒙n,ε)(𝒙n+𝜼n,yn,𝒘),\displaystyle\bm{\eta}^{n}=\underset{\bm{\eta}^{n}\in\mathbb{B}_{{\rm p}}\left(\bm{x}^{n},\varepsilon\right)}{{\rm argmax}}\ell\left(\bm{x}^{n}+\bm{\eta}^{n},y^{n},\bm{w}\right), (2)

where 𝔹p(𝒙n,ε)\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right) is the region around 𝒙n{\bm{x}}^{n} where the LpL_{\rm{p}} norm is less than ε\varepsilon. Projected gradient decent (PGD) Madry et al. (2018), which is a powerful adversarial attack, uses multi-step search as

𝒙nΠε(𝒙n+λsign(𝒙n(𝒙n,yn,𝒘))),\displaystyle\bm{x}^{n}\leftarrow\Pi_{\varepsilon}\left(\bm{x}^{n}+\lambda{\rm sign}\left(\frac{\partial}{\partial\bm{x}^{n}}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right)\right)\right), (3)

where λ\lambda is step size and Πε\Pi_{\varepsilon} is projection to constrained space. Recently, since PGD can be prevented by gradient obfuscation, auto attack Croce and Hein (2020), which tries various attacks, is used to more reliably evaluate robustness.

2.3 Adversarial Training

A promising way to defend the model against adversarial examples is adversarial training Goodfellow et al. (2015); Kurakin et al. (2017); Madry et al. (2018). Adversarial training learns the model with adversarial examples as

min𝒘1Nnmax𝜼n𝔹p(𝒙n,ε)(𝒙n+𝜼n,yn,𝒘).\displaystyle\min_{{\bm{w}}}\frac{1}{N}\sum_{n}\max_{{\bm{\eta}}^{n}\in\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right)}\ell\left(\bm{x}^{n}+\bm{\eta}^{n},y^{n},\bm{w}\right). (4)

Since the adversarial training uses the information of the adversarial example, the adversarial training can improve the robustness of the model against the adversarial attack used for learning. In this paper, we describe models trained in standard training and adversarial training as clean and robust models, respectively.

2.4 Visualizing Loss Landscape

Li et al. (2018) presented a filter normalization for visualizing the weight loss landscape. The filter normalization works for not only standard training but also for adversarial training Wu et al. (2020). The filter normalization with adversarial training visualizes the change of loss by adding noise to the weight as

g(α)=max𝜼n𝔹p(𝒙n,ε)1Nn(𝒙n+𝜼n,yn,𝒘+α𝒉),\displaystyle g\left(\alpha\right)=\max_{{\bm{\eta}}^{n}\in\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right)}\frac{1}{N}\sum_{n}\ell\left({\bm{x}}^{n}+{\bm{\eta}}^{n},y^{n},{\bm{w}}+\alpha{\bm{h}}\right), (5)

where α\alpha is the magnitude of noise and 𝒉d{\bm{h}}\in\mathbb{R}^{d} is the direction of noise. The 𝒉{\bm{h}} is sampled from a Gaussian distribution and filter-wise normalized by

𝒉(l,m)𝒉(l,m)𝒉(l,m)F𝒘(l,m)F,\displaystyle{\bm{h}}^{\left(l,m\right)}\leftarrow\frac{{\bm{h}}^{\left(l,m\right)}}{\left\|{\bm{h}}^{\left(l,m\right)}\right\|_{{\rm F}}}\left\|{\bm{w}}^{\left(l,m\right)}\right\|_{{\rm F}}, (6)

where 𝒉(l,m){\bm{h}}^{\left(l,m\right)} is the mm-th filter at the ll-th layer of 𝒉{\bm{h}} and F\left\|\cdot\right\|_{{\rm F}} is the Frobenius norm. We note that the Frobenius norm is equal to the L2L_{\rm 2} norm in logistic regression because 𝒘{\bm{w}} can be regarded as 𝒘1×d\bm{w}\in\mathbb{R}^{1\times d}. This normalization is used to remove the scaling freedom of weights for avoiding the fake loss landscape Li et al. (2018). For instance, Dinh et al. (2017) exploit this scaling freedom of weights to build pairs of equivalent networks that have different apparent sharpnesses. The filter normalization absorbs the scaling freedom of the weights and enables the loss landscape to be visualized. The sharpness of the weight loss landscape strongly correlates with the generalization gap when this normalization is used Li et al. (2018). Moreover, the sharpness of the weight loss landscape on adversarial training strongly correlates with a robust generalization gap when this normalization is used Wu et al. (2020).

3 Theoretical Analysis

Since the robust model is trained by adding adversarial noise, the data loss landscape becomes flat. On the other hand, Wu et al. (2020) experimentally confirmed that the weight loss landscape becomes sharp and the generalization gap becomes large in adversarial training. In this section, we theoretically show that the weight loss landscape becomes sharp in adversarial training.

3.1 Definition of Weight Loss Landscape Sharpness

Several weight loss landscape sharpness/flatness definitions have been presented in the study of generalization performance and loss landscape sharpness in deep learning Hochreiter and Schmidhuber (1997); Keskar et al. (2017); Chaudhari et al. (2017). In this paper, we use the simple definition, which defines flatness as the eigenvalues of the Hessian matrix. This definition is presented in Keskar et al. (2017) and also used in Li et al. (2018). For clarifying the relationship between the weight loss landscape and the generalization gap, we use filter normalization with weights. In other words, the definition of weight loss landscape sharpness is the eigenvalue of 2Lwiwj\frac{\partial^{2}L}{\partial{w}_{i}\partial{w}_{j}} with the normalized weights. The larger the eigenvalue becomes, the sharper the weight loss landscape becomes.

3.2 Main Results

This section presents Theorem 1, which provides the relation between weight loss landscape and the norm of adversarial noise.

Theorem 1

When the loss of the linear logistic regression model converges to the minimum for each data point (wil(𝐱n+𝛈,yn,𝐰)=0\frac{\partial}{\partial w_{i}}l\left({\bm{x}}^{n}+{\bm{\eta}},y^{n},{\bm{w}}\right)=0), the weight loss landscape become sharper for robust models trained with a large norm of adversarial noise that is the constrained in the L2L_{\rm 2} norm.

Let us prove Theorem 1. For the simple but special case of logistic regression, Eq. (2.1) can be written with adversarial noise ηin\eta_{i}^{n} and angle θn\theta^{n} as

Lη=\displaystyle L_{\eta}=\! 1Nnlog(1+exp(yn𝒘(𝒙n+𝜼n)))\displaystyle\frac{1}{N}\!\sum_{n}\!\log\!\left(\!1\!+\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\left(\!\bm{x}^{n}\!+\!\bm{\eta}^{n}\!\right)\right)\right)
=\displaystyle=\! 1Nnlog(1+exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn)),\displaystyle\frac{1}{N}\!\sum_{n}\!\log\!\left(\!1\!+\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\bm{x}^{n}\!+\!y^{n}\!\left\|\!\bm{w}\!\right\|\left\|\!\bm{\eta}^{n}\!\right\|\!\cos\!\theta^{n}\!\right)\!\right), (7)

where 𝒘\left\|{\bm{w}}\right\| and 𝜼n\left\|{\bm{\eta}}^{n}\right\| is the L2L_{\rm 2} norm due to the inner product. Since an adversarial example must increase the loss, Lemma 1 shows yncosθn=1y^{n}\cos\theta^{n}=1.

Lemma 1

LηL_{\eta} is a monotonically increasing function of cosθn\cos\theta^{n} when yn=1y^{n}=1, while LηL_{\eta} is a monotonically decreasing function when yn=1y^{n}=-1.

All the proofs of lemmas are provided in the supplementary material.

The gradient of loss with respect to weight Lηωi\frac{\partial L_{\eta}}{\partial\omega_{i}} is

1Nn(ynxin+wi𝒘𝜼n)exp(yn𝒘𝒙n+𝒘𝜼n)(1+exp(yn𝒘𝒙n𝒘𝜼n)).\!\frac{1}{N}\!\sum_{n}\!\frac{\!\left(\!-y^{n}\!x_{i}^{n}\!+\!\frac{w_{i}}{\left\|\bm{w}\right\|}\!\left\|\!\bm{\eta}^{n}\!\right\|\!\right)\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\bm{x}^{n}\!+\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\!\right\|\!\right)}{\left(1+\exp\left(y^{n}\bm{w}\cdot\bm{x}^{n}-\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\right)\right)}. (8)

An optimal weight 𝒘{\bm{w}}^{\ast} of LηL_{\eta} for each point satisfies wi𝒘=ynxin𝜼n\frac{w_{i}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|}=\frac{y^{n}x_{i}^{n}}{\left\|{\bm{\eta}}^{n}\right\|}. Next, we consider the Hessian matrix of loss on an optimal weight. The (i,j)(i,j)-th element of the Hessian matrix is obtained as

2Lηωiωj|𝒘=𝒘=12Nn𝜼n𝒘(δijwiwj𝒘2).\displaystyle\left.\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}\right|_{{\bm{w}}={\bm{w}}^{\ast}}=\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right). (9)

See the supplementary material for derivation. We consider the eigenvalues of the Hessian matrix. This matrix has a trivial eigenvector as

j(δijwiwj𝒘2)vj\displaystyle\sum_{j}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)v_{j} =vi,\displaystyle=v_{i}, (10)
j(δijwiwj𝒘2)wj\displaystyle\sum_{j}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)w_{j} =0,\displaystyle=0, (11)

where vjv_{j} is the element of 𝒗{\bm{v}} which is an arbitrary vector orthogonal to 𝒘{\bm{w}}^{\ast}. This matrix has eigenvalues 11 and 0. Since (δijwiwj𝒘2)d×d\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)\in\mathbb{R}^{d\times d} is positive-semidefinite, 12Nn𝜼n𝒘\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|} determines the sharpness of the weight loss landscape. Let 𝜼a𝜼b\left\|{\bm{\eta}}_{a}\right\|\geq\left\|{\bm{\eta}}_{b}\right\| be an adversarial example for perturbation strength εaεb\varepsilon_{a}\geq\varepsilon_{b}. The optimal weights in adversarial training with 𝜼a\left\|{\bm{\eta}}_{a}\right\| and 𝜼b\left\|{\bm{\eta}}_{b}\right\| are 𝒘a{\bm{w}}_{a}^{\ast} and 𝒘b{\bm{w}}_{b}^{\ast}. The relation between the eigenvalues is as in

12Nn𝜼a𝒘a12Nn𝜼b𝒘b,\displaystyle\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}_{a}\right\|}{\left\|{\bm{w}}_{a}^{\ast}\right\|}\geq\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}_{b}\right\|}{\left\|{\bm{w}}_{b}^{\ast}\right\|}, (12)

since the filter normalization makes the scale of the weights the same 𝒘a𝒘b\left\|{\bm{w}}_{a}^{\ast}\right\|\approx\left\|{\bm{w}}_{b}^{\ast}\right\| (In particularly in the case of logistic regression, this is natural because 𝒘F=𝒘\left\|{\bm{w}}\right\|_{{\rm F}}=\left\|{\bm{w}}\right\|). Therefore, Theorem 1 is proved from Eq. (12). Moreover, since the eigenvalues are never negative, the results of adversarial training are always convex, which is a natural result.

Random Noise

To clarify whether noise in data space sharpens the weight loss landscape is a phenomenon unique to adversarial training, we consider the weight loss landscape which trained on random noise. Let us assume a uniform distribution noise. We have the following theorem:

Theorem 2

The logistic regression model trained with a uniform distribution noise, which is the constrained LL_{\infty} norm ε\varepsilon as ηiU(ε,ε)\eta_{i}\sim U\left(-\varepsilon,\varepsilon\right). The eigenvalues of the Hessian matrix of this model converge to 0 as the loss converges to minimum for each data point (wil(𝐱n+𝛈,yn,𝐰)=0\frac{\partial}{\partial w_{i}}l\left({\bm{x}}^{n}+{\bm{\eta}},y^{n},{\bm{w}}\right)=0). Furthermore, when the loss deviates slightly from the minimum, the Hessian matrix’s eigenvalues become larger along with the norm of the random noise.

This theorem shows that the weight loss landscape of the model trained with random noise is not as sharp as the weight loss landscape of the adversarial training model.

Let us prove the Theorem 2. The derivative of the loss with arbitrary noise is

Lηwi\displaystyle\frac{\partial L_{\eta}}{\partial w_{i}} =1Nnyn(gn1)xingn,\displaystyle=-\frac{1}{N}\sum_{n}\frac{-y^{n}\left(g^{n}-1\right)x_{i}^{n}}{g^{n}}, (13)

where gn1+exp(yn𝒘(𝒙n+𝜼n))g^{n}\equiv 1+\exp\left(-y^{n}\bm{w}\cdot\left(\bm{x}^{n}+\bm{\eta}^{n}\right)\right). The (i,j)(i,j)-th element of the Hessian matrix is obtained as

2Lηwiwj=1Nn(gn1)(gn)2(xin+ηin)(xjn+ηjn).\displaystyle\frac{\partial^{2}L_{\eta}}{\partial w_{i}\partial w_{j}}=\frac{1}{N}\sum_{n}\frac{\left(g^{n}-1\right)}{\left(g^{n}\right)^{2}}\left(x_{i}^{n}+\eta_{i}^{n}\right)\left(x_{j}^{n}+\eta_{j}^{n}\right). (14)

We used (yn)2=1\left(y^{n}\right)^{2}=1. Since the derivative of \ell is zero on the optimal weight, the numerator of Eq. (13) becomes zero as κngn10\kappa^{n}\equiv g^{n}-1\rightarrow 0. Thus, the eigenvalue of the Hessian matrix becomes 0 on gn=1g^{n}=1. In other words, the weight loss landscape of the model trained with random noise converges closer to flat as the loss converges to the minimum. Let us consider the case where κn\kappa^{n} is sufficiently small but takes a finite value independent of 𝜼{\bm{\eta}}. Since we assume the uniform distribution noise, 𝒙{\bm{x}} and 𝜼{\bm{\eta}} uncorrelated and the mean of 𝜼{\bm{\eta}} is zero limN1Nnηinxjn=0\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n}\eta_{i}^{n}x_{j}^{n}=0, and the variance of the noise is limN1Nnηinηjn=ε23δij\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n}\eta_{i}^{n}\eta_{j}^{n}=\frac{\varepsilon^{2}}{3}\delta_{ij}. Thus, limκn02Lηwiwj\lim_{\kappa^{n}\rightarrow 0}\frac{\partial^{2}L_{\eta}}{\partial w_{i}\partial w_{j}} in the large NN can be written as

1Nn(xinxjn+ε23δij)limκn0κn(κn+1)2=0.\displaystyle\frac{1}{N}\sum_{n}\left(x_{i}^{n}x_{j}^{n}+\frac{\varepsilon^{2}}{3}\delta_{ij}\right)\lim_{\kappa^{n}\rightarrow 0}\frac{\kappa^{n}}{\left(\kappa^{n}+1\right)^{2}}=0. (15)

Thus, when κn\kappa^{n} is sufficiently small but takes a finite value independent of 𝜼{\bm{\eta}}, the weight loss landscape becomes sharp along with the norm of random noise. We note that since δij\delta_{ij} is an element of the unit matrix, it increases the eigenvalue of the Hessian matrix. Considering Theorem 1 and Theorem 2, the adversarial noise sharpens the weight loss landscape much more than the weight landscape on training with random noise. We have considered random noise constrained to the LL_{\infty} norm, but this statement holds if the mean of the random noise is zero and the variance of the random noise increases as the random noise norm increases. For example, considering Gaussian noise 𝜼N(0,σI)\bm{\eta}\sim N\left(0,\sigma I\right), the same conclusion holds when the noise is L2L_{{\rm 2}} norm constrained.

4 Related Work

4.1 Weight Loss Landscape in Adversarial Training

Prabhu et al. (2019) and Wu et al. (2020) experimentally studied the weight loss landscape in adversarial training. Wu et al. (2020) demonstrated that the model with a flatter weight loss landscape has a smaller generalization gap and presented the Adversarial Weight Perturbation (AWP), which consequently achieves a more robust accuracy Wu et al. (2020). Prabhu et al. (2019) reported that the robust model has a sharper loss landscape than the clean model. However, these studies did not theoretically analyze the weight loss landscape. Recently, Liu et al. (2020) theoretically analyzed the loss landscapes in the robust model. The main topic of Liu et al. (2020) is a discussion of Lipschitzian smoothness in adversarial training, and their supplementary material contains a theoretical analysis of the weight loss landscape in logistic regression. The difference between the analysis of Liu et al. (2020) and our analysis is that Liu et al. (2020) compared the weight loss landscape of different xx positions in the single model (L(𝒙,y,𝒘)L\left({\bm{x}},y,{\bm{w}}\right) and L(𝒙+𝜼,y,𝒘)L\left({\bm{x}}+{\bm{\eta}},y,{\bm{w}}\right)), while we compare the weight loss landscape of the different models trained with different magnitudes of adversarial noise (L(𝒙+𝜼a,y,𝒘a)L\left({\bm{x}}+{\bm{\eta}}_{a},y,{\bm{w}}^{\ast}_{a}\right) and L(𝒙+𝜼b,y,𝒘b)L\left({\bm{x}}+{\bm{\eta}}_{b},y,{\bm{w}}^{\ast}_{b}\right), where 𝒘a{\bm{w}}^{\ast}_{a} and 𝒘a{\bm{w}}^{\ast}_{a} mean each optimal weights). Also, Liu et al. (2020) used an approximation of the loss function as Lη=1Nnynlog(1+exp(yn𝒘𝒙n+𝜼n))L_{\eta}=\frac{1}{N}\sum_{n}y^{n}\log\left(1+\exp\left(-y^{n}{\bm{w}}\cdot{\bm{x}}^{n}+{\bm{\eta}}^{n}\right)\right) to simplify the problem. In contrast, we do not use an approximation of the loss function as Eq. (7). Liu et al. (2020) derived that since the logistic loss is a monotonically decreasing convex function, adding noise in the direction of increasing the loss must increase the gradient, which leads to a sharp weight loss landscape.

4.2 Weight Loss Landscape in Standard Training

The relationship between weight loss landscape and generalization performance in deep learning has been theoretically and experimentally investigated Foret et al. (2020); Jiang et al. (2020); Keskar et al. (2017); Dinh et al. (2017). In a large experiment evaluating 40 complexity measures by Jiang et al. (2020), measures based on the weight loss landscape had the highest correlation with the generalization error, and the generalization performance becomes better as the weight loss landscape becomes flatter. For improving the generalization performance by flattening the weight loss landscape, several methods have been presented, such as operating on diffused loss landscape Mobahi (2016), local entropy regularization Chaudhari et al. (2017), and optimizer Sharpness-Aware Minimization (SAM) Foret et al. (2020), which searches for a flat weight loss landscape. In particular, since a recently presented SAM Foret et al. (2020) is an improvement of the optimizer, it can be adapted to various methods and achieved a strong experimental result that updated the state-of-the-art for many datasets including CIFAR10, CIFAR100, and ImageNet. Since the weight loss landscape becomes sharper, and the generalization gap becomes larger in adversarial training than in standard training, we believe that finding a flat solution is more important in adversarial training than in standard training.

5 Experiments

To verify the validity of Theorem 1, we visualize the sharpness of the weight loss landscape with various noise magnitudes in logistic regression in section 5.1. Next, to investigate whether the sharpness of the weight loss landscape is a problem peculiar to adversarial training, we compare the training on random noise with the training on adversarial noise in section 5.1. Finally, we visualize the weight loss landscape in a more general case (multi-class classification by softmax with deep learning) in section 5.2. We also show the relationship between the weight loss landscape and the generalization gap.

Experimental setup

We provide details of the experimental conditions in the supplementary material. We used three datasets: MNIST2, CIFAR10 Krizhevsky et al. (2009), and SVHN Netzer et al. (2011). MNIST2 is explained the below subsection. We used PGD for adversarial training and robust generalization gap evaluation. For robustness evaluation, auto attack should be used. However, we did not use auto attack in our experiments since we are focusing on the generalization gap. For visualizing the weight loss landscape, we used filter normalization introduced in section 2.4. The hyper-parameter settings for PGD were based on Madry et al. (2018) in MNIST2 and CIFAR10. Since there were no experiments using SVHN in Madry et al. (2018), we used the hyper-parameter setting based on Wu et al. (2020) for SVHN. The L2L_{2} norm of the perturbation ε\varepsilon was set to ε={0,0.2,0.4,0.6,0.8,1}\varepsilon=\left\{0,0.2,0.4,0.6,0.8,1\right\} for MNIST2. The LL_{\infty} norm of the perturbation ε\varepsilon was set to ε={0255,4255,8255,12255,16255,25.5255,51255,76.5255}\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255},\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\} for MNIST2 and ε={0255,4255,8255,12255,16255}\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\} for CIFAR10 and SVHN. For PGD, we updated the perturbation for 4040 iterations at training time and 100100 iterations at evaluation time on MNIST2. A step size of L2L_{\rm 2}PGD is 0.15×ε0.15\times\varepsilon, while a step size of LL_{\infty}PGD is 0.010.01. For CIFAR10 and SVHN, we updated the perturbation for 77 iterations at training time and 2020 iterations at evaluation time. A step size of LL_{\infty}PGD is 2255\frac{2}{255} in CIFAR10, while a step size of LL_{\infty}PGD is 1255\frac{1}{255} in SVHN. For random noise, we used uniformly distributed noise that is constrained in the L2L_{\rm 2} or LL_{\infty} norm.

Refer to caption
Figure 1: Weight Loss Landscape against L2L_{2}PGD and L2L_{2}RANDOM in MNIST2
Refer to caption
Figure 2: Weight Loss Landscape against LL_{\infty}PGD and LL_{\infty}RANDOM in MNIST2
Refer to caption
(a) L2L_{\rm 2}PGD
Refer to caption
(b) LL_{\infty}PGD
Figure 3: Eigenvalue of Hessian matrix against PGD in MNIST2
Table 1: Robust accuracy against L2L_{2}PGD in MNIST2
ε\varepsilon train acc test acc gap
0.00 99.89 99.95 -0.06
0.20 99.79 99.95 -0.16
0.40 99.75 99.86 -0.11
0.60 99.76 99.76 -0.00
0.80 99.64 99.76 -0.12
1.00 99.49 99.62 -0.13
Table 2: Robust accuracy against L2L_{2}RANDOM in MNIST2
ε\varepsilon train acc test acc gap
0.00 99.89 99.95 -0.06
0.20 99.89 99.95 -0.06
0.40 99.89 99.95 -0.06
0.60 99.89 99.95 -0.06
0.80 99.89 99.95 -0.06
1.00 99.88 99.95 -0.07
Table 3: Robust accuracy against LL_{\infty}PGD in MNIST2
ε\varepsilon train acc test acc gap
0/255 99.89 99.91 -0.02
4/255 99.79 99.95 -0.16
8/255 99.74 99.86 -0.12
12/255 99.79 99.76 0.02
16/255 99.72 99.76 -0.05
25.5/255 99.46 99.67 -0.21
51/255 97.84 98.53 -0.70
76.5/255 95.40 96.64 -1.24
Table 4: Robust accuracy against LL_{\infty}RANDOM in MNIST2
ε\varepsilon train acc test acc gap
0/255 99.89 99.91 -0.02
4/255 99.89 99.91 -0.02
8/255 99.89 99.91 -0.02
12/255 99.89 99.91 -0.02
16/255 99.89 99.91 -0.02

5.1 Binary Logistic Regression

We conducted experiments on image dataset MNIST LeCun et al. (1998), which is well known for its adversarial example settings. Since the linear logistic regression model did not perform well in classifying the two-class CIFAR10, we evaluated on MNIST. To experiment with logistic regression for binary classification, we created a two-class dataset MNIST2 from MNIST. We made MNIST2 using only MNIST class 0 and class 1. Figures 2 and 2 show the weight loss landscape with various noise magnitudes of the L2L_{2} and LL_{\infty} norm in logistic regression. We can confirm that the weight loss landscape becomes sharp as the noise magnitude increases. For the sake of clarity, we have excluded the ranges with large ε={25.5255,51255,76.5255}\varepsilon=\left\{\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\} in LL_{\infty} from Fig. 2. The results for the large ε={0255,25.5255,51255,76.5255}\varepsilon=\left\{\frac{0}{255},\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\} in LL_{\infty} norm, which is often used in the 10-class classification of MNIST, are included in the supplementary material. The results are similar: the larger the noise magnitude, the sharper the weight loss landscape. Table 1 also shows that the absolute value of the generalization gap is larger in adversarial training than in standard training (ε=0\varepsilon=0). The test robust accuracy is larger than the training accuracy because of early stopping in the test robust accuracy Rice et al. (2020), but we emphasize that the training accuracy and test accuracy diverge. In Table. 3, the relationship between the generalization gap and ε\varepsilon is difficult to understand because the experiment was designed with a small noise to achieve training loss becomes zero. However, where ε\varepsilon is a large range in Tab. 3, the absolute value of the generalization gap increases as ε\varepsilon increases in the experiment.

Refer to caption
(a) CIFAR10
Refer to caption
(b) SVHN
Figure 4: Weight Loss Landscape against LL_{\infty}PGD in ResNet18
Table 5: Robust accuracy against LL_{\infty}PGD in CIFAR10
ε\varepsilon train acc test acc gap
0/255 99.00 93.89 5.11
4/255 81.90 69.73 12.17
8/255 63.81 50.98 12.83
12/255 50.83 36.69 14.14
16/255 47.56 25.61 21.95
Table 6: Robust accuracy against LL_{\infty}PGD in SVHN
ε\varepsilon train acc test acc gap
0/255 100.00 96.13 3.87
4/255 83.53 73.51 10.01
8/255 70.96 51.82 19.14
12/255 64.58 36.93 27.65
16/255 60.93 29.59 31.34

Eigenvalue of Hessian matrix

We analyze the eigenvalue of the Hessian matrix for confirming Eq. (9). The results of the linear logistic regression model in MNIST2 are shown. Figure 3 shows the top three eigenvalues of the model trained with L2L_{\rm 2} PGD and LL_{\infty} PGD with different epsilons. These figures show that the eigenvalues are nearly linear with respect to epsilon, as in our theoretical analysis. We estimated the eigenvalues without filter normalization, but we checked that the norm of weight does not change significantly with each epsilon.

Comparison with Random Noise

To clarify whether noise in data space sharpens the weight loss landscape is a phenomenon unique to adversarial training, we compare the weight loss landscape of learning with random noise and learning with adversarial noise (adversarial training with PGD). We used a logistic regression model in MNIST2, and random noise was generated with a uniform distribution constrained to the L2={0,0.2,0.4,0.8,1}L_{{\rm 2}}=\left\{0,0.2,0.4,0.8,1\right\} and L={0,4255,8255,12255,16255}L_{\infty}=\left\{0,\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\}, and then constrained to fit in the range [0,1]\left[0,1\right] of the normalized image along with the image. Figures 2 and 2 compare the weight loss landscape in adversarial noise with random noise in the L2L_{{\rm 2}} and LL_{\infty} norm, respectively. As seen in our theoretical analysis, these figures show that the adversarial noise sharpens the weight loss landscape much more than the weight landscape on training with random noise for both the L2L_{{\rm 2}} and LL_{\infty} norms. As a result, the generalization gap in random noise is smaller than that in the adversarial noise as shown in Tables 2 and 4.

5.2 Multi-class Classification

In the more general case as multi-class classification using softmax with residual network He et al. (2016), we confirm that the weight loss landscape becomes sharper in adversarial training as well as logistic regression. We use ResNet18 He et al. (2016) with softmax in CIFAR10 and SVHN. Figure 4 shows the weight loss landscape in ε={0255,4255,8255,12255,16255}\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\}. We confirmed that the weight loss landscape becomes sharper as the noise magnitude of adversarial training becomes larger. Tables 5 and 6 also shows that the generalization gap becomes larger in most cases as the magnitude of noise becomes larger.

6 Conclusion and Future work

In this paper, we show theoretically and experimentally that the weight loss landscape becomes sharper when the noise in adversarial training is strong with the linear logistic regression model. In linear logistic regression, we also showed that not all data space noises make the landscape extremely sharp, but adversarial examples make the weight loss landscape extremely sharp, both theoretically and experimentally. The theoretical analysis in a more general nonlinear model (such as a residual network) with softmax is future work. To motivate future work, we experimentally showed that the weight loss landscape becomes sharper when the noise of adversarial training becomes larger. We conclude that the sharpness of the weight loss landscape needs to be reduced for reducing the generalization gap in adversarial training because the weight loss landscape becomes sharp in adversarial training, and the generalization gap becomes large.

Appendix A Proof of Lemma 1

We show that loss is monotonically increasing for cosθn\cos\theta^{n} when yn=1y^{n}=1 and monotonically decreasing when yn=1y^{n}=-1. (cosθn)log(1+exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn))\displaystyle\frac{\partial}{\partial\left(\cos\theta^{n}\right)}\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right) =(cosθn)(exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn))1+exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn)\displaystyle=\frac{\frac{\partial}{\partial\left(\cos\theta^{n}\right)}\left(\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right)}{1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)} =yn𝒘𝜼n(exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn))1+exp(yn𝒘𝒙n+yn𝒘𝜼ncosθn)\displaystyle=\frac{y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\left(\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right)}{1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)} Since the exponential function always has a positive value, Lemma 1 is proved.

Appendix B Hessian matrix on the optimal weight

Compute the Hessian matrix of the adversarial training model of logistic regression. The loss function is

L(𝒙,y,𝒘)\displaystyle L\left(\bm{x},y,\bm{w}\right) =1Nn(𝒙n,yn,𝒘),\displaystyle=\frac{1}{N}\sum_{n}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right),
(𝒙n,yn,𝒘)\displaystyle\ell\left(\bm{x}^{n},y^{n},\bm{w}\right) log(1+exp(yn𝒘𝒙n)).\displaystyle\equiv\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}\right)\right).

The Hessian matrix of loss along weight 2Lηωiωj\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}} is

1Nn(ωj(ynxin+wi𝒘𝜼n)(1+exp(iynwixin𝒘𝜼n))\displaystyle\frac{1}{N}\sum_{n}\left(\frac{\frac{\partial}{\partial\omega_{j}}\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right.
+(ynxin+wi𝒘𝜼n)\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)
ωj1(1+exp(iynwixin𝒘𝜼n)))\displaystyle\left.\frac{\partial}{\partial\omega_{j}}\frac{1}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right)
=1Nn(𝜼n𝒘(δijwiwj𝒘2)(1+exp(iynwixin𝒘𝜼n))\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}\right\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\|{\bm{w}}\right\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right.
+(ynxin+wi𝒘𝜼n)\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)
wj(exp(iynwixin𝒘𝜼n))(1+exp(iynwixin𝒘𝜼n))2)\displaystyle\left.\frac{-\frac{\partial}{\partial w_{j}}\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)^{2}}\right)
=1Nn(𝜼n𝒘(δijwiwj𝒘2)(1+exp(iynwixin𝒘𝜼n))\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}\right\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\|{\bm{w}}\right\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right.
+(ynxin+wi𝒘𝜼n)\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)
(exp(iynwixin𝒘𝜼n))(1+exp(iynwixin𝒘𝜼n))2\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)^{2}}
wj(iynwixin𝒘𝜼n))\displaystyle\left.\frac{\partial}{\partial w_{j}}\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)
=1Nn(𝜼n𝒘(δijwiwj𝒘2)(1+exp(iynwixin𝒘𝜼n))\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}\right\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\|{\bm{w}}\right\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right.
+(ynxin+wi𝒘𝜼n)\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)
(exp(iynwixin𝒘𝜼n))(1+exp(iynwixin𝒘𝜼n))2\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)^{2}}
(ynxjnwj𝒘𝜼n))\displaystyle\left.\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)\right)
=1Nn(𝜼n𝒘(δijwiwj𝒘2)(1+exp(iynwixin𝒘𝜼n))\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}\right\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\|{\bm{w}}\right\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)}\right.
+exp(iynwixin𝒘𝜼n)(1+exp(iynwixin𝒘𝜼n))2\displaystyle+\frac{\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\|{\bm{w}}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)^{2}}
(ynxinwi𝒘𝜼n)(ynxjnwj𝒘𝜼n)).\displaystyle\left.\left(\text{$y^{n}$}x_{i}^{n}-\frac{w_{i}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\|{\bm{w}}\right\|}\left\|{\bm{\eta}}^{n}\right\|\right)\right).

The optimal weight condition wi𝒘=ynxin𝜼n\frac{w_{i}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|}=\frac{y^{n}x_{i}^{n}}{\left\|{\bm{\eta}}^{n}\right\|} eliminate the second terms. The first term is

2Lηωiωj|𝒘=𝒘\displaystyle\left.\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}\right|_{{\bm{w}}={\bm{w}}^{\ast}}
=1Nn𝜼n𝒘(δijwiwj𝒘2)(1+exp((iynwixin𝒘𝜼n)))\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}\text{$y^{n}$}w_{i}^{\ast}x_{i}^{n}-\left\|{\bm{w}}^{\ast}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)\right)}
=1Nn𝜼n𝒘(δijwiwj𝒘2)(1+exp((iwi𝜼n𝒘wi𝒘𝜼n)))\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}w_{i}^{\ast}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}w_{i}^{\ast}-\left\|{\bm{w}}^{\ast}\right\|\left\|{\bm{\eta}}^{n}\right\|\right)\right)\right)}
=1Nn𝜼n𝒘(δijwiwj𝒘2)(1+exp(yn𝜼n(𝒘2𝒘𝒘)))\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)}{\left(1+\exp\left(\text{$y^{n}$}\left\|{\bm{\eta}}^{n}\right\|\left(\frac{\left\|{\bm{w}}^{\ast}\right\|^{2}}{\left\|{\bm{w}}^{\ast}\right\|}-\left\|{\bm{w}}^{\ast}\right\|\right)\right)\right)}
=12Nn𝜼n𝒘(δijwiwj𝒘2).\displaystyle=\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right).

Appendix C Experimental Details

In this chapter we describe the experimental details. We use five image datasets: MNIST LeCun et al. (1998), MNIST2, CIFAR10 Krizhevsky et al. (2009), CIFAR2, and SVHN Netzer et al. (2011). MNIST2 and CIFAR2 are the original datasets for binary classification. We made MNIST2 using only MNIST class 0 and class 1, and CIFAR2 using only frog and ship classes in CIFAR10. We chose the frog and ship classes because they have the highest classification accuracy. In CIFAR10 and CIFAR2, model architecture, data standardization, and the parameters of the PGD attack were set the same as in Madry et al. (2018). In SVHN, model architecture, data standardization, and the parameters of the PGD attack were set the same as in Wu et al. (2020).

Appendix D Additional Experiments

D.1 MNIST2 with the larger magnitude of noise

Refer to caption
Figure 5: Weight Loss Landscape against LL_{\infty}PGD in MNIST2

Figure 5 shows the results of the experiment using the noise magnitude range ε={0,25.5255,51255,76.5255}\varepsilon=\left\{0,\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\}, which is commonly used in MNIST for 10-class classification.

References

  • Chaudhari et al. [2017] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR, 2017.
  • Cohen et al. [2019] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, pages 1310–1320, 2019.
  • Croce and Hein [2020] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. ICML, 2020.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  • Dinh et al. [2017] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In ICML, 2017.
  • Foret et al. [2020] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  • Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
  • Hu et al. [2019] Shengyuan Hu, Tao Yu, Chuan Guo, Wei-Lun Chao, and Kilian Q Weinberger. A new defense against adversarial images: Turning a weakness into a strength. In NeurIPS, volume 32, pages 1635–1646, 2019.
  • Huang et al. [2019] Bo Huang, Yi Wang, and Wei Wang. Model-agnostic adversarial detection by random perturbations. In IJCAI, pages 4689–4696, 2019.
  • Jiang et al. [2020] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2020.
  • Keskar et al. [2017] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Li et al. [2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In NeurIPS, pages 6389–6399, 2018.
  • Liu et al. [2019] Jiayang Liu, Weiming Zhang, Yiwei Zhang, Dongdong Hou, Yujia Liu, Hongyue Zha, and Nenghai Yu. Detection based defense against adversarial examples from the steganalysis point of view. In CVPR, pages 4825–4834, 2019.
  • Liu et al. [2020] Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, and Sabine Süsstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. NeurIPS, 33, 2020.
  • Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  • Metzen et al. [2017] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. ICLR, 2017.
  • Mobahi [2016] Hossein Mobahi. Training recurrent neural networks by diffusion. CoRR, abs/1601.04114, 2016.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • Papernot et al. [2016] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In S&P (Oakland), pages 582–597. IEEE, 2016.
  • Prabhu et al. [2019] Vinay Uday Prabhu, Dian Ang Yap, Joyce Xu, and John Whaley. Understanding adversarial robustness through loss landscape geometries. arXiv preprint arXiv:1907.09061, 2019.
  • Rice et al. [2020] Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In ICML, pages 8093–8104. PMLR, 2020.
  • Salman et al. [2019] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In NeurIPS, pages 11292–11303, 2019.
  • Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.
  • Wang et al. [2017] Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhiheng Huang. Residual convolutional ctc networks for automatic speech recognition. arXiv preprint arXiv:1702.07793, 2017.
  • Wang et al. [2020] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
  • Wu et al. [2020] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. NeurIPS, 33, 2020.
  • Yang et al. [2020] Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In ICML, volume 119, pages 10693–10705, 2020.
  • Zhang et al. [2019] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.