Adversarial Training Makes Weight Loss Landscape
Sharper in Logistic Regression

Masanori Yamada¹¹¹1Contact Author Sekitoshi Kanai² Tomoharu Iwata³ Tomokatsu Takahashi¹ Yuki Yamanaka¹ Hiroshi Takahashi²&Atsutoshi Kumagai^1,2
¹NTT Secure Platform Laboratories
²NTT Software Innovation Center
³NTT Communication Science Laboratories
{masanori.yamada.cm, sekitoshi.kanai.fu, tomoharu.iwata.gy, tomokatsu.takahashi.wd, yuuki.yamanaka.kb, hiroshi.takahashi.bm, atsutoshi.kumagai.ht}@hco.ntt.co.jp

Abstract

Adversarial training is actively studied for learning robust models against adversarial examples. A recent study finds that adversarially trained models degenerate generalization performance on adversarial examples when their weight loss landscape, which is loss changes with respect to weights, is sharp. Unfortunately, it has been experimentally shown that adversarial training sharpens the weight loss landscape, but this phenomenon has not been theoretically clarified. Therefore, we theoretically analyze this phenomenon in this paper. As a first step, this paper proves that adversarial training with the $L_{\rm 2}$ norm constraints sharpens the weight loss landscape in the linear logistic regression model. Our analysis reveals that the sharpness of the weight loss landscape is caused by the noise aligned in the direction of increasing the loss, which is used in adversarial training. We theoretically and experimentally confirm that the weight loss landscape becomes sharper as the magnitude of the noise of adversarial training increases in the linear logistic regression model. Moreover, we experimentally confirm the same phenomena in ResNet18 with softmax as a more general case.

1 Introduction

Deep neural networks (DNNs) has been successfully used in wide applications such as image He et al. (2016), speech Wang et al. (2017), and natural language processing Devlin et al. (2019). Though DNNs have high generalization performance, they are vulnerable to adversarial examples, which are imperceptibly perturbed data to make DNNs misclassify data Szegedy et al. (2014). For the real-world applications of deep learning, DNNs need to be made secure against this vulnerability.

Many methods to defend models against adversarial examples have been presented such as adversarial training Goodfellow et al. (2015); Kurakin et al. (2017); Madry et al. (2018); Zhang et al. (2019); Wang et al. (2020), adversarial detection Metzen et al. (2017); Huang et al. (2019); Hu et al. (2019); Liu et al. (2019), defensive distillation Papernot et al. (2016), and random smoothing Cohen et al. (2019); Yang et al. (2020); Salman et al. (2019). Among them, adversarial training experimentally achieves high robustness Madry et al. (2018); Zhang et al. (2019); Wang et al. (2020); Wu et al. (2020).

Adversarial training trains models with adversarial examples. Wu et al. (2020) experimentally shows that models obtain a high generalization performance in adversarial training when they have a flat weight loss landscape, which is the loss change with respect to the weight. However, Prabhu et al. (2019) experimentally finds that adversarial training sharpens the weight loss landscape more than standard training, which is training without adversarial noise. Thus, the following question is important to answer: Does Adversarial Training Always Sharpen the Weight Loss Landscape? If the answer is yes, adversarial training always has a larger generalization gap than standard training because the sharp weight loss landscape degrades the generalization performance.

In this paper, to answer this question theoretically, we focus on the logistic regression model with adversarial training as the first theoretical analysis step. First, we use the definition of sharpness of the weight loss landscape that uses the eigenvalues of the Hessian matrix Keskar et al. (2017). Next, to simplify the analysis, we decompose the adversarial example into magnitude and direction. Finally, we show that the eigenvalues of the Hessian matrix in the weight loss landscape are proportional to the norm of the noise in the adversarial noise. As a results, we theoretically show that the weight loss landscape becomes sharper as the noise in the adversarial training becomes larger as shown in Theorem 1. We experimentally confirmed Theorem 1 on dataset MNIST2, which restricts MNIST to two classes. Moreover, we experimentally show that in multi-class classification with the nonlinear model (ResNet18) as a more general case, the weight loss landscape becomes sharper as the noise in the adversarial training becomes larger. Finally, to check whether the sharpness of the weight loss landscape is a problem specific to the adversarial example, we compare the weight loss landscapes of training on random noise with training on adversarially perturbed data in logistic regression. As a result, we confirmed theoretically and experimentally that the adversarial noise much sharpens the weight loss landscape more than the random noise. This is caused by noise always being added to adversarial examples in the direction that makes the loss worse. We conclude that the sharpness of the weight loss landscape needs to be reduced for reducing the generalization gap in adversarial training because the weight loss landscape becomes sharp in adversarial training, and the generalization gap becomes large.

Our contributions are as follows:

•

We show theoretically and experimentally that in logistic regression with the constrained $L_{\rm 2}$ norm, the weight loss landscape becomes sharper as the norm of the adversarial training noise increases.
•

We show theoretically and experimentally that adversarial noise in the data space sharpens the weight loss landscape in logistic regression much more than random noise (random noise does not sharpen it extremely).
•

We experimentally show that the larger norm of the noise of adversarial training makes the loss landscape sharper in the nonlinear model (ResNet18) with softmax. As a results, the generalization gap becomes larger as the norm of adversarial noise becomes large.

2 Preliminary

2.1 Logistic Regression

We consider a binary classification task with ${\bm{x}}\equiv(x_{1},\dots,x_{d})\in\mathbb{R}^{d}$ and $y\in\left\{-1,1\right\}$ . A data point is represented as ${\bm{x}}^{n}$ , where $n$ is the data index and its true label is $y^{n}$ . A loss function of logistic regression is defined as

	$\displaystyle L\left(\bm{x},y,\bm{w}\right)$	$\displaystyle=\frac{1}{N}\sum_{n}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right),$
	$\displaystyle\ell\left(\bm{x}^{n},y^{n},\bm{w}\right)$	$\displaystyle\equiv\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}\right)\right),$		(1)

where $N$ is the total number of data and ${\bm{w}}\equiv(w_{1},\dots,w_{d})\in\mathbb{R}^{d}$ is the training parameter of the model.

2.2 Adversarial Example

Adversarial example $\left({\bm{x}}^{\prime}\right)^{n}={\bm{x}}^{n}+{\bm{\eta}}^{n}$ is defined as

\displaystyle\bm{\eta}^{n}=\underset{\bm{\eta}^{n}\in\mathbb{B}_{{\rm p}}\left(\bm{x}^{n},\varepsilon\right)}{{\rm argmax}}\ell\left(\bm{x}^{n}+\bm{\eta}^{n},y^{n},\bm{w}\right),

(2)

where $\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right)$ is the region around ${\bm{x}}^{n}$ where the $L_{\rm{p}}$ norm is less than $\varepsilon$ . Projected gradient decent (PGD) Madry et al. (2018), which is a powerful adversarial attack, uses multi-step search as

\displaystyle\bm{x}^{n}\leftarrow\Pi_{\varepsilon}\left(\bm{x}^{n}+\lambda{\rm sign}\left(\frac{\partial}{\partial\bm{x}^{n}}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right)\right)\right),

(3)

where $\lambda$ is step size and $\Pi_{\varepsilon}$ is projection to constrained space. Recently, since PGD can be prevented by gradient obfuscation, auto attack Croce and Hein (2020), which tries various attacks, is used to more reliably evaluate robustness.

2.3 Adversarial Training

A promising way to defend the model against adversarial examples is adversarial training Goodfellow et al. (2015); Kurakin et al. (2017); Madry et al. (2018). Adversarial training learns the model with adversarial examples as

\displaystyle\min_{{\bm{w}}}\frac{1}{N}\sum_{n}\max_{{\bm{\eta}}^{n}\in\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right)}\ell\left(\bm{x}^{n}+\bm{\eta}^{n},y^{n},\bm{w}\right).

(4)

Since the adversarial training uses the information of the adversarial example, the adversarial training can improve the robustness of the model against the adversarial attack used for learning. In this paper, we describe models trained in standard training and adversarial training as clean and robust models, respectively.

2.4 Visualizing Loss Landscape

Li et al. (2018) presented a filter normalization for visualizing the weight loss landscape. The filter normalization works for not only standard training but also for adversarial training Wu et al. (2020). The filter normalization with adversarial training visualizes the change of loss by adding noise to the weight as

\displaystyle g\left(\alpha\right)=\max_{{\bm{\eta}}^{n}\in\mathbb{B}_{\rm p}\left({\bm{x}}^{n},\varepsilon\right)}\frac{1}{N}\sum_{n}\ell\left({\bm{x}}^{n}+{\bm{\eta}}^{n},y^{n},{\bm{w}}+\alpha{\bm{h}}\right),

(5)

where $\alpha$ is the magnitude of noise and ${\bm{h}}\in\mathbb{R}^{d}$ is the direction of noise. The ${\bm{h}}$ is sampled from a Gaussian distribution and filter-wise normalized by

\displaystyle{\bm{h}}^{\left(l,m\right)}\leftarrow\frac{{\bm{h}}^{\left(l,m\right)}}{\left\|{\bm{h}}^{\left(l,m\right)}\right\|_{{\rm F}}}\left\|{\bm{w}}^{\left(l,m\right)}\right\|_{{\rm F}},

(6)

where ${\bm{h}}^{\left(l,m\right)}$ is the $m$ -th filter at the $l$ -th layer of ${\bm{h}}$ and $\left\|\cdot\right\|_{{\rm F}}$ is the Frobenius norm. We note that the Frobenius norm is equal to the $L_{\rm 2}$ norm in logistic regression because ${\bm{w}}$ can be regarded as $\bm{w}\in\mathbb{R}^{1\times d}$ . This normalization is used to remove the scaling freedom of weights for avoiding the fake loss landscape Li et al. (2018). For instance, Dinh et al. (2017) exploit this scaling freedom of weights to build pairs of equivalent networks that have different apparent sharpnesses. The filter normalization absorbs the scaling freedom of the weights and enables the loss landscape to be visualized. The sharpness of the weight loss landscape strongly correlates with the generalization gap when this normalization is used Li et al. (2018). Moreover, the sharpness of the weight loss landscape on adversarial training strongly correlates with a robust generalization gap when this normalization is used Wu et al. (2020).

3 Theoretical Analysis

Since the robust model is trained by adding adversarial noise, the data loss landscape becomes flat. On the other hand, Wu et al. (2020) experimentally confirmed that the weight loss landscape becomes sharp and the generalization gap becomes large in adversarial training. In this section, we theoretically show that the weight loss landscape becomes sharp in adversarial training.

3.1 Definition of Weight Loss Landscape Sharpness

Several weight loss landscape sharpness/flatness definitions have been presented in the study of generalization performance and loss landscape sharpness in deep learning Hochreiter and Schmidhuber (1997); Keskar et al. (2017); Chaudhari et al. (2017). In this paper, we use the simple definition, which defines flatness as the eigenvalues of the Hessian matrix. This definition is presented in Keskar et al. (2017) and also used in Li et al. (2018). For clarifying the relationship between the weight loss landscape and the generalization gap, we use filter normalization with weights. In other words, the definition of weight loss landscape sharpness is the eigenvalue of $\frac{\partial^{2}L}{\partial{w}_{i}\partial{w}_{j}}$ with the normalized weights. The larger the eigenvalue becomes, the sharper the weight loss landscape becomes.

3.2 Main Results

This section presents Theorem 1, which provides the relation between weight loss landscape and the norm of adversarial noise.

Theorem 1

When the loss of the linear logistic regression model converges to the minimum for each data point ( $\frac{\partial}{\partial w_{i}}l\left({\bm{x}}^{n}+{\bm{\eta}},y^{n},{\bm{w}}\right)=0$ ), the weight loss landscape become sharper for robust models trained with a large norm of adversarial noise that is the constrained in the $L_{\rm 2}$ norm.

Let us prove Theorem 1. For the simple but special case of logistic regression, Eq. (2.1) can be written with adversarial noise $\eta_{i}^{n}$ and angle $\theta^{n}$ as

	$\displaystyle L_{\eta}=\!$	$\displaystyle\frac{1}{N}\!\sum_{n}\!\log\!\left(\!1\!+\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\left(\!\bm{x}^{n}\!+\!\bm{\eta}^{n}\!\right)\right)\right)$
	$\displaystyle=\!$	$\displaystyle\frac{1}{N}\!\sum_{n}\!\log\!\left(\!1\!+\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\bm{x}^{n}\!+\!y^{n}\!\left\\|\!\bm{w}\!\right\\|\left\\|\!\bm{\eta}^{n}\!\right\\|\!\cos\!\theta^{n}\!\right)\!\right),$		(7)

where $\left\|{\bm{w}}\right\|$ and $\left\|{\bm{\eta}}^{n}\right\|$ is the $L_{\rm 2}$ norm due to the inner product. Since an adversarial example must increase the loss, Lemma 1 shows $y^{n}\cos\theta^{n}=1$ .

Lemma 1

$L_{\eta}$ is a monotonically increasing function of $\cos\theta^{n}$ when $y^{n}=1$ , while $L_{\eta}$ is a monotonically decreasing function when $y^{n}=-1$ .

All the proofs of lemmas are provided in the supplementary material.

The gradient of loss with respect to weight $\frac{\partial L_{\eta}}{\partial\omega_{i}}$ is

\!\frac{1}{N}\!\sum_{n}\!\frac{\!\left(\!-y^{n}\!x_{i}^{n}\!+\!\frac{w_{i}}{\left\|\bm{w}\right\|}\!\left\|\!\bm{\eta}^{n}\!\right\|\!\right)\!\exp\!\left(\!-y^{n}\!\bm{w}\!\cdot\!\bm{x}^{n}\!+\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\!\right\|\!\right)}{\left(1+\exp\left(y^{n}\bm{w}\cdot\bm{x}^{n}-\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\right)\right)}.

(8)

An optimal weight ${\bm{w}}^{\ast}$ of $L_{\eta}$ for each point satisfies $\frac{w_{i}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|}=\frac{y^{n}x_{i}^{n}}{\left\|{\bm{\eta}}^{n}\right\|}$ . Next, we consider the Hessian matrix of loss on an optimal weight. The $(i,j)$ -th element of the Hessian matrix is obtained as

\displaystyle\left.\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}\right|_{{\bm{w}}={\bm{w}}^{\ast}}=\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right).

(9)

See the supplementary material for derivation. We consider the eigenvalues of the Hessian matrix. This matrix has a trivial eigenvector as

	$\displaystyle\sum_{j}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)v_{j}$	$\displaystyle=v_{i},$		(10)
	$\displaystyle\sum_{j}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)w_{j}$	$\displaystyle=0,$		(11)

where $v_{j}$ is the element of ${\bm{v}}$ which is an arbitrary vector orthogonal to ${\bm{w}}^{\ast}$ . This matrix has eigenvalues $1$ and $0$ . Since $\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|^{2}}\right)\in\mathbb{R}^{d\times d}$ is positive-semidefinite, $\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}^{n}\right\|}{\left\|{\bm{w}}^{\ast}\right\|}$ determines the sharpness of the weight loss landscape. Let $\left\|{\bm{\eta}}_{a}\right\|\geq\left\|{\bm{\eta}}_{b}\right\|$ be an adversarial example for perturbation strength $\varepsilon_{a}\geq\varepsilon_{b}$ . The optimal weights in adversarial training with $\left\|{\bm{\eta}}_{a}\right\|$ and $\left\|{\bm{\eta}}_{b}\right\|$ are ${\bm{w}}_{a}^{\ast}$ and ${\bm{w}}_{b}^{\ast}$ . The relation between the eigenvalues is as in

\displaystyle\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}_{a}\right\|}{\left\|{\bm{w}}_{a}^{\ast}\right\|}\geq\frac{1}{2N}\sum_{n}\frac{\left\|{\bm{\eta}}_{b}\right\|}{\left\|{\bm{w}}_{b}^{\ast}\right\|},

(12)

since the filter normalization makes the scale of the weights the same $\left\|{\bm{w}}_{a}^{\ast}\right\|\approx\left\|{\bm{w}}_{b}^{\ast}\right\|$ (In particularly in the case of logistic regression, this is natural because $\left\|{\bm{w}}\right\|_{{\rm F}}=\left\|{\bm{w}}\right\|$ ). Therefore, Theorem 1 is proved from Eq. (12). Moreover, since the eigenvalues are never negative, the results of adversarial training are always convex, which is a natural result.

Random Noise

To clarify whether noise in data space sharpens the weight loss landscape is a phenomenon unique to adversarial training, we consider the weight loss landscape which trained on random noise. Let us assume a uniform distribution noise. We have the following theorem:

Theorem 2

The logistic regression model trained with a uniform distribution noise, which is the constrained $L_{\infty}$ norm $\varepsilon$ as $\eta_{i}\sim U\left(-\varepsilon,\varepsilon\right)$ . The eigenvalues of the Hessian matrix of this model converge to $0$ as the loss converges to minimum for each data point ( $\frac{\partial}{\partial w_{i}}l\left({\bm{x}}^{n}+{\bm{\eta}},y^{n},{\bm{w}}\right)=0$ ). Furthermore, when the loss deviates slightly from the minimum, the Hessian matrix’s eigenvalues become larger along with the norm of the random noise.

This theorem shows that the weight loss landscape of the model trained with random noise is not as sharp as the weight loss landscape of the adversarial training model.

Let us prove the Theorem 2. The derivative of the loss with arbitrary noise is

\displaystyle\frac{\partial L_{\eta}}{\partial w_{i}}

\displaystyle=-\frac{1}{N}\sum_{n}\frac{-y^{n}\left(g^{n}-1\right)x_{i}^{n}}{g^{n}},

(13)

where $g^{n}\equiv 1+\exp\left(-y^{n}\bm{w}\cdot\left(\bm{x}^{n}+\bm{\eta}^{n}\right)\right)$ . The $(i,j)$ -th element of the Hessian matrix is obtained as

\displaystyle\frac{\partial^{2}L_{\eta}}{\partial w_{i}\partial w_{j}}=\frac{1}{N}\sum_{n}\frac{\left(g^{n}-1\right)}{\left(g^{n}\right)^{2}}\left(x_{i}^{n}+\eta_{i}^{n}\right)\left(x_{j}^{n}+\eta_{j}^{n}\right).

(14)

We used $\left(y^{n}\right)^{2}=1$ . Since the derivative of $\ell$ is zero on the optimal weight, the numerator of Eq. (13) becomes zero as $\kappa^{n}\equiv g^{n}-1\rightarrow 0$ . Thus, the eigenvalue of the Hessian matrix becomes $0$ on $g^{n}=1$ . In other words, the weight loss landscape of the model trained with random noise converges closer to flat as the loss converges to the minimum. Let us consider the case where $\kappa^{n}$ is sufficiently small but takes a finite value independent of ${\bm{\eta}}$ . Since we assume the uniform distribution noise, ${\bm{x}}$ and ${\bm{\eta}}$ uncorrelated and the mean of ${\bm{\eta}}$ is zero $\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n}\eta_{i}^{n}x_{j}^{n}=0$ , and the variance of the noise is $\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n}\eta_{i}^{n}\eta_{j}^{n}=\frac{\varepsilon^{2}}{3}\delta_{ij}$ . Thus, $\lim_{\kappa^{n}\rightarrow 0}\frac{\partial^{2}L_{\eta}}{\partial w_{i}\partial w_{j}}$ in the large $N$ can be written as

\displaystyle\frac{1}{N}\sum_{n}\left(x_{i}^{n}x_{j}^{n}+\frac{\varepsilon^{2}}{3}\delta_{ij}\right)\lim_{\kappa^{n}\rightarrow 0}\frac{\kappa^{n}}{\left(\kappa^{n}+1\right)^{2}}=0.

(15)

Thus, when $\kappa^{n}$ is sufficiently small but takes a finite value independent of ${\bm{\eta}}$ , the weight loss landscape becomes sharp along with the norm of random noise. We note that since $\delta_{ij}$ is an element of the unit matrix, it increases the eigenvalue of the Hessian matrix. Considering Theorem 1 and Theorem 2, the adversarial noise sharpens the weight loss landscape much more than the weight landscape on training with random noise. We have considered random noise constrained to the $L_{\infty}$ norm, but this statement holds if the mean of the random noise is zero and the variance of the random noise increases as the random noise norm increases. For example, considering Gaussian noise $\bm{\eta}\sim N\left(0,\sigma I\right)$ , the same conclusion holds when the noise is $L_{{\rm 2}}$ norm constrained.

4 Related Work

4.1 Weight Loss Landscape in Adversarial Training

Prabhu et al. (2019) and Wu et al. (2020) experimentally studied the weight loss landscape in adversarial training. Wu et al. (2020) demonstrated that the model with a flatter weight loss landscape has a smaller generalization gap and presented the Adversarial Weight Perturbation (AWP), which consequently achieves a more robust accuracy Wu et al. (2020). Prabhu et al. (2019) reported that the robust model has a sharper loss landscape than the clean model. However, these studies did not theoretically analyze the weight loss landscape. Recently, Liu et al. (2020) theoretically analyzed the loss landscapes in the robust model. The main topic of Liu et al. (2020) is a discussion of Lipschitzian smoothness in adversarial training, and their supplementary material contains a theoretical analysis of the weight loss landscape in logistic regression. The difference between the analysis of Liu et al. (2020) and our analysis is that Liu et al. (2020) compared the weight loss landscape of different $x$ positions in the single model ( $L\left({\bm{x}},y,{\bm{w}}\right)$ and $L\left({\bm{x}}+{\bm{\eta}},y,{\bm{w}}\right)$ ), while we compare the weight loss landscape of the different models trained with different magnitudes of adversarial noise ( $L\left({\bm{x}}+{\bm{\eta}}_{a},y,{\bm{w}}^{\ast}_{a}\right)$ and $L\left({\bm{x}}+{\bm{\eta}}_{b},y,{\bm{w}}^{\ast}_{b}\right)$ , where ${\bm{w}}^{\ast}_{a}$ and ${\bm{w}}^{\ast}_{a}$ mean each optimal weights). Also, Liu et al. (2020) used an approximation of the loss function as $L_{\eta}=\frac{1}{N}\sum_{n}y^{n}\log\left(1+\exp\left(-y^{n}{\bm{w}}\cdot{\bm{x}}^{n}+{\bm{\eta}}^{n}\right)\right)$ to simplify the problem. In contrast, we do not use an approximation of the loss function as Eq. (7). Liu et al. (2020) derived that since the logistic loss is a monotonically decreasing convex function, adding noise in the direction of increasing the loss must increase the gradient, which leads to a sharp weight loss landscape.

4.2 Weight Loss Landscape in Standard Training

The relationship between weight loss landscape and generalization performance in deep learning has been theoretically and experimentally investigated Foret et al. (2020); Jiang et al. (2020); Keskar et al. (2017); Dinh et al. (2017). In a large experiment evaluating 40 complexity measures by Jiang et al. (2020), measures based on the weight loss landscape had the highest correlation with the generalization error, and the generalization performance becomes better as the weight loss landscape becomes flatter. For improving the generalization performance by flattening the weight loss landscape, several methods have been presented, such as operating on diffused loss landscape Mobahi (2016), local entropy regularization Chaudhari et al. (2017), and optimizer Sharpness-Aware Minimization (SAM) Foret et al. (2020), which searches for a flat weight loss landscape. In particular, since a recently presented SAM Foret et al. (2020) is an improvement of the optimizer, it can be adapted to various methods and achieved a strong experimental result that updated the state-of-the-art for many datasets including CIFAR10, CIFAR100, and ImageNet. Since the weight loss landscape becomes sharper, and the generalization gap becomes larger in adversarial training than in standard training, we believe that finding a flat solution is more important in adversarial training than in standard training.

5 Experiments

To verify the validity of Theorem 1, we visualize the sharpness of the weight loss landscape with various noise magnitudes in logistic regression in section 5.1. Next, to investigate whether the sharpness of the weight loss landscape is a problem peculiar to adversarial training, we compare the training on random noise with the training on adversarial noise in section 5.1. Finally, we visualize the weight loss landscape in a more general case (multi-class classification by softmax with deep learning) in section 5.2. We also show the relationship between the weight loss landscape and the generalization gap.

Experimental setup

We provide details of the experimental conditions in the supplementary material. We used three datasets: MNIST2, CIFAR10 Krizhevsky et al. (2009), and SVHN Netzer et al. (2011). MNIST2 is explained the below subsection. We used PGD for adversarial training and robust generalization gap evaluation. For robustness evaluation, auto attack should be used. However, we did not use auto attack in our experiments since we are focusing on the generalization gap. For visualizing the weight loss landscape, we used filter normalization introduced in section 2.4. The hyper-parameter settings for PGD were based on Madry et al. (2018) in MNIST2 and CIFAR10. Since there were no experiments using SVHN in Madry et al. (2018), we used the hyper-parameter setting based on Wu et al. (2020) for SVHN. The $L_{2}$ norm of the perturbation $\varepsilon$ was set to $\varepsilon=\left\{0,0.2,0.4,0.6,0.8,1\right\}$ for MNIST2. The $L_{\infty}$ norm of the perturbation $\varepsilon$ was set to $\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255},\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\}$ for MNIST2 and $\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\}$ for CIFAR10 and SVHN. For PGD, we updated the perturbation for $40$ iterations at training time and $100$ iterations at evaluation time on MNIST2. A step size of $L_{\rm 2}$ PGD is $0.15\times\varepsilon$ , while a step size of $L_{\infty}$ PGD is $0.01$ . For CIFAR10 and SVHN, we updated the perturbation for $7$ iterations at training time and $20$ iterations at evaluation time. A step size of $L_{\infty}$ PGD is $\frac{2}{255}$ in CIFAR10, while a step size of $L_{\infty}$ PGD is $\frac{1}{255}$ in SVHN. For random noise, we used uniformly distributed noise that is constrained in the $L_{\rm 2}$ or $L_{\infty}$ norm.

Refer to caption — Figure 1: Weight Loss Landscape against $L_{2}$ PGD and $L_{2}$ RANDOM in MNIST2

Table 1: Robust accuracy against

L_{2}

PGD in MNIST2

$\varepsilon$	train acc	test acc	gap
0.00	99.89	99.95	-0.06
0.20	99.79	99.95	-0.16
0.40	99.75	99.86	-0.11
0.60	99.76	99.76	-0.00
0.80	99.64	99.76	-0.12
1.00	99.49	99.62	-0.13

Table 2: Robust accuracy against

L_{2}

RANDOM in MNIST2

$\varepsilon$	train acc	test acc	gap
0.00	99.89	99.95	-0.06
0.20	99.89	99.95	-0.06
0.40	99.89	99.95	-0.06
0.60	99.89	99.95	-0.06
0.80	99.89	99.95	-0.06
1.00	99.88	99.95	-0.07

Table 3: Robust accuracy against

L_{\infty}

PGD in MNIST2

$\varepsilon$	train acc	test acc	gap
0/255	99.89	99.91	-0.02
4/255	99.79	99.95	-0.16
8/255	99.74	99.86	-0.12
12/255	99.79	99.76	0.02
16/255	99.72	99.76	-0.05
25.5/255	99.46	99.67	-0.21
51/255	97.84	98.53	-0.70
76.5/255	95.40	96.64	-1.24

Table 4: Robust accuracy against

L_{\infty}

RANDOM in MNIST2

$\varepsilon$	train acc	test acc	gap
0/255	99.89	99.91	-0.02
4/255	99.89	99.91	-0.02
8/255	99.89	99.91	-0.02
12/255	99.89	99.91	-0.02
16/255	99.89	99.91	-0.02

5.1 Binary Logistic Regression

We conducted experiments on image dataset MNIST LeCun et al. (1998), which is well known for its adversarial example settings. Since the linear logistic regression model did not perform well in classifying the two-class CIFAR10, we evaluated on MNIST. To experiment with logistic regression for binary classification, we created a two-class dataset MNIST2 from MNIST. We made MNIST2 using only MNIST class 0 and class 1. Figures 2 and 2 show the weight loss landscape with various noise magnitudes of the $L_{2}$ and $L_{\infty}$ norm in logistic regression. We can confirm that the weight loss landscape becomes sharp as the noise magnitude increases. For the sake of clarity, we have excluded the ranges with large $\varepsilon=\left\{\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\}$ in $L_{\infty}$ from Fig. 2. The results for the large $\varepsilon=\left\{\frac{0}{255},\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\}$ in $L_{\infty}$ norm, which is often used in the 10-class classification of MNIST, are included in the supplementary material. The results are similar: the larger the noise magnitude, the sharper the weight loss landscape. Table 1 also shows that the absolute value of the generalization gap is larger in adversarial training than in standard training ( $\varepsilon=0$ ). The test robust accuracy is larger than the training accuracy because of early stopping in the test robust accuracy Rice et al. (2020), but we emphasize that the training accuracy and test accuracy diverge. In Table. 3, the relationship between the generalization gap and $\varepsilon$ is difficult to understand because the experiment was designed with a small noise to achieve training loss becomes zero. However, where $\varepsilon$ is a large range in Tab. 3, the absolute value of the generalization gap increases as $\varepsilon$ increases in the experiment.

Table 5: Robust accuracy against

L_{\infty}

PGD in CIFAR10

$\varepsilon$	train acc	test acc	gap
0/255	99.00	93.89	5.11
4/255	81.90	69.73	12.17
8/255	63.81	50.98	12.83
12/255	50.83	36.69	14.14
16/255	47.56	25.61	21.95

Table 6: Robust accuracy against

L_{\infty}

PGD in SVHN

$\varepsilon$	train acc	test acc	gap
0/255	100.00	96.13	3.87
4/255	83.53	73.51	10.01
8/255	70.96	51.82	19.14
12/255	64.58	36.93	27.65
16/255	60.93	29.59	31.34

Eigenvalue of Hessian matrix

We analyze the eigenvalue of the Hessian matrix for confirming Eq. (9). The results of the linear logistic regression model in MNIST2 are shown. Figure 3 shows the top three eigenvalues of the model trained with $L_{\rm 2}$ PGD and $L_{\infty}$ PGD with different epsilons. These figures show that the eigenvalues are nearly linear with respect to epsilon, as in our theoretical analysis. We estimated the eigenvalues without filter normalization, but we checked that the norm of weight does not change significantly with each epsilon.

Comparison with Random Noise

To clarify whether noise in data space sharpens the weight loss landscape is a phenomenon unique to adversarial training, we compare the weight loss landscape of learning with random noise and learning with adversarial noise (adversarial training with PGD). We used a logistic regression model in MNIST2, and random noise was generated with a uniform distribution constrained to the $L_{{\rm 2}}=\left\{0,0.2,0.4,0.8,1\right\}$ and $L_{\infty}=\left\{0,\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\}$ , and then constrained to fit in the range $\left[0,1\right]$ of the normalized image along with the image. Figures 2 and 2 compare the weight loss landscape in adversarial noise with random noise in the $L_{{\rm 2}}$ and $L_{\infty}$ norm, respectively. As seen in our theoretical analysis, these figures show that the adversarial noise sharpens the weight loss landscape much more than the weight landscape on training with random noise for both the $L_{{\rm 2}}$ and $L_{\infty}$ norms. As a result, the generalization gap in random noise is smaller than that in the adversarial noise as shown in Tables 2 and 4.

5.2 Multi-class Classification

In the more general case as multi-class classification using softmax with residual network He et al. (2016), we confirm that the weight loss landscape becomes sharper in adversarial training as well as logistic regression. We use ResNet18 He et al. (2016) with softmax in CIFAR10 and SVHN. Figure 4 shows the weight loss landscape in $\varepsilon=\left\{\frac{0}{255},\frac{4}{255},\frac{8}{255},\frac{12}{255},\frac{16}{255}\right\}$ . We confirmed that the weight loss landscape becomes sharper as the noise magnitude of adversarial training becomes larger. Tables 5 and 6 also shows that the generalization gap becomes larger in most cases as the magnitude of noise becomes larger.

6 Conclusion and Future work

In this paper, we show theoretically and experimentally that the weight loss landscape becomes sharper when the noise in adversarial training is strong with the linear logistic regression model. In linear logistic regression, we also showed that not all data space noises make the landscape extremely sharp, but adversarial examples make the weight loss landscape extremely sharp, both theoretically and experimentally. The theoretical analysis in a more general nonlinear model (such as a residual network) with softmax is future work. To motivate future work, we experimentally showed that the weight loss landscape becomes sharper when the noise of adversarial training becomes larger. We conclude that the sharpness of the weight loss landscape needs to be reduced for reducing the generalization gap in adversarial training because the weight loss landscape becomes sharp in adversarial training, and the generalization gap becomes large.

Appendix A Proof of Lemma 1

We show that loss is monotonically increasing for $\cos\theta^{n}$ when $y^{n}=1$ and monotonically decreasing when $y^{n}=-1$ . $\displaystyle\frac{\partial}{\partial\left(\cos\theta^{n}\right)}\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right)$ $\displaystyle=\frac{\frac{\partial}{\partial\left(\cos\theta^{n}\right)}\left(\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right)}{1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)}$ $\displaystyle=\frac{y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\left(\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)\right)}{1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}+y^{n}\left\|\bm{w}\right\|\left\|\bm{\eta}^{n}\right\|\cos\theta^{n}\right)}$ Since the exponential function always has a positive value, Lemma 1 is proved.

Appendix B Hessian matrix on the optimal weight

Compute the Hessian matrix of the adversarial training model of logistic regression. The loss function is

	$\displaystyle L\left(\bm{x},y,\bm{w}\right)$	$\displaystyle=\frac{1}{N}\sum_{n}\ell\left(\bm{x}^{n},y^{n},\bm{w}\right),$
	$\displaystyle\ell\left(\bm{x}^{n},y^{n},\bm{w}\right)$	$\displaystyle\equiv\log\left(1+\exp\left(-y^{n}\bm{w}\cdot\bm{x}^{n}\right)\right).$

The Hessian matrix of loss along weight $\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}$ is

	$\displaystyle\frac{1}{N}\sum_{n}\left(\frac{\frac{\partial}{\partial\omega_{j}}\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\left.\frac{\partial}{\partial\omega_{j}}\frac{1}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\left.\frac{-\frac{\partial}{\partial w_{j}}\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$

	$\displaystyle\left.\frac{\partial}{\partial w_{j}}\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$
	$\displaystyle\left.\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\frac{\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$
	$\displaystyle\left.\left(\text{$y^{n}$}x_{i}^{n}-\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\right).$

The optimal weight condition $\frac{w_{i}^{\ast}}{\left\|{\bm{w}}^{\ast}\right\|}=\frac{y^{n}x_{i}^{n}}{\left\|{\bm{\eta}}^{n}\right\|}$ eliminate the second terms. The first term is

	$\displaystyle\left.\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}\right\|_{{\bm{w}}={\bm{w}}^{\ast}}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}\text{$y^{n}$}w_{i}^{\ast}x_{i}^{n}-\left\\|{\bm{w}}^{\ast}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}w_{i}^{\ast}\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}w_{i}^{\ast}-\left\\|{\bm{w}}^{\ast}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\text{$y^{n}$}\left\\|{\bm{\eta}}^{n}\right\\|\left(\frac{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}{\left\\|{\bm{w}}^{\ast}\right\\|}-\left\\|{\bm{w}}^{\ast}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{2N}\sum_{n}\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right).$

Appendix C Experimental Details

In this chapter we describe the experimental details. We use five image datasets: MNIST LeCun et al. (1998), MNIST2, CIFAR10 Krizhevsky et al. (2009), CIFAR2, and SVHN Netzer et al. (2011). MNIST2 and CIFAR2 are the original datasets for binary classification. We made MNIST2 using only MNIST class 0 and class 1, and CIFAR2 using only frog and ship classes in CIFAR10. We chose the frog and ship classes because they have the highest classification accuracy. In CIFAR10 and CIFAR2, model architecture, data standardization, and the parameters of the PGD attack were set the same as in Madry et al. (2018). In SVHN, model architecture, data standardization, and the parameters of the PGD attack were set the same as in Wu et al. (2020).

Appendix D Additional Experiments

D.1 MNIST2 with the larger magnitude of noise

Figure 5 shows the results of the experiment using the noise magnitude range $\varepsilon=\left\{0,\frac{25.5}{255},\frac{51}{255},\frac{76.5}{255}\right\}$ , which is commonly used in MNIST for 10-class classification.

References

Chaudhari et al. [2017] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR, 2017.
Cohen et al. [2019] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, pages 1310–1320, 2019.
Croce and Hein [2020] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. ICML, 2020.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
Dinh et al. [2017] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In ICML, 2017.
Foret et al. [2020] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
Hu et al. [2019] Shengyuan Hu, Tao Yu, Chuan Guo, Wei-Lun Chao, and Kilian Q Weinberger. A new defense against adversarial images: Turning a weakness into a strength. In NeurIPS, volume 32, pages 1635–1646, 2019.
Huang et al. [2019] Bo Huang, Yi Wang, and Wei Wang. Model-agnostic adversarial detection by random perturbations. In IJCAI, pages 4689–4696, 2019.
Jiang et al. [2020] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2020.
Keskar et al. [2017] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.
Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
Li et al. [2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In NeurIPS, pages 6389–6399, 2018.
Liu et al. [2019] Jiayang Liu, Weiming Zhang, Yiwei Zhang, Dongdong Hou, Yujia Liu, Hongyue Zha, and Nenghai Yu. Detection based defense against adversarial examples from the steganalysis point of view. In CVPR, pages 4825–4834, 2019.
Liu et al. [2020] Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, and Sabine Süsstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. NeurIPS, 33, 2020.
Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
Metzen et al. [2017] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. ICLR, 2017.
Mobahi [2016] Hossein Mobahi. Training recurrent neural networks by diffusion. CoRR, abs/1601.04114, 2016.
Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
Papernot et al. [2016] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In S&P (Oakland), pages 582–597. IEEE, 2016.
Prabhu et al. [2019] Vinay Uday Prabhu, Dian Ang Yap, Joyce Xu, and John Whaley. Understanding adversarial robustness through loss landscape geometries. arXiv preprint arXiv:1907.09061, 2019.
Rice et al. [2020] Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In ICML, pages 8093–8104. PMLR, 2020.
Salman et al. [2019] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In NeurIPS, pages 11292–11303, 2019.
Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.
Wang et al. [2017] Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhiheng Huang. Residual convolutional ctc networks for automatic speech recognition. arXiv preprint arXiv:1702.07793, 2017.
Wang et al. [2020] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
Wu et al. [2020] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. NeurIPS, 33, 2020.
Yang et al. [2020] Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In ICML, volume 119, pages 10693–10705, 2020.
Zhang et al. [2019] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.

	$\displaystyle\frac{1}{N}\sum_{n}\left(\frac{\frac{\partial}{\partial\omega_{j}}\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\left.\frac{\partial}{\partial\omega_{j}}\frac{1}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\left.\frac{-\frac{\partial}{\partial w_{j}}\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$

	$\displaystyle\left.\frac{\partial}{\partial w_{j}}\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\left(-\text{$y^{n}$}x_{i}^{n}+\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)$
	$\displaystyle\frac{-\left(\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$
	$\displaystyle\left.\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)$
	$\displaystyle=\frac{1}{N}\sum_{n}\left(\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}\right\\|}\left(\delta_{ij}-\frac{w_{i}w_{j}}{\left\\|{\bm{w}}\right\\|^{2}}\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)}\right.$
	$\displaystyle+\frac{\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)}{\left(1+\exp\left(\sum_{i}\text{$y^{n}$}w_{i}x_{i}^{n}-\left\\|{\bm{w}}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)^{2}}$
	$\displaystyle\left.\left(\text{$y^{n}$}x_{i}^{n}-\frac{w_{i}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\left(\text{$y^{n}$}x_{j}^{n}-\frac{w_{j}}{\left\\|{\bm{w}}\right\\|}\left\\|{\bm{\eta}}^{n}\right\\|\right)\right).$

	$\displaystyle\left.\frac{\partial^{2}L_{\eta}}{\partial\omega_{i}\partial\omega_{j}}\right\|_{{\bm{w}}={\bm{w}}^{\ast}}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}\text{$y^{n}$}w_{i}^{\ast}x_{i}^{n}-\left\\|{\bm{w}}^{\ast}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\left(\sum_{i}w_{i}^{\ast}\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}w_{i}^{\ast}-\left\\|{\bm{w}}^{\ast}\right\\|\left\\|{\bm{\eta}}^{n}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{N}\sum_{n}\frac{\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right)}{\left(1+\exp\left(\text{$y^{n}$}\left\\|{\bm{\eta}}^{n}\right\\|\left(\frac{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}{\left\\|{\bm{w}}^{\ast}\right\\|}-\left\\|{\bm{w}}^{\ast}\right\\|\right)\right)\right)}$
	$\displaystyle=\frac{1}{2N}\sum_{n}\frac{\left\\|{\bm{\eta}}^{n}\right\\|}{\left\\|{\bm{w}}^{\ast}\right\\|}\left(\delta_{ij}-\frac{w_{i}^{\ast}w_{j}^{\ast}}{\left\\|{\bm{w}}^{\ast}\right\\|^{2}}\right).$

Adversarial Training Makes Weight Loss Landscape Sharper in Logistic Regression