This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Laplacian Networks: Bounding Indicator Function Smoothness for Neural Networks Robustness

Carlos Eduardo Rosar Kos Lassance, Vincent Gripon
IMT Atlantique
Brest, France
first dot lastname at imt-atlantique dot fr
\AndAntonio Ortega
University of Southern California
Los Angeles, USA
first dot lastname at sipi dot usc dot edu
Abstract

For the past few years, Deep Neural Network (DNN) robustness has become a question of paramount importance. As a matter of fact, in sensitive settings misclassification can lead to dramatic consequences. Such misclassifications are likely to occur when facing adversarial attacks, hardware failures or limitations, and imperfect signal acquisition. To address this question, authors have proposed different approaches aiming at increasing the robustness of DNNs, such as adding regularizers or training using noisy examples. In this paper we propose a new regularizer built upon the Laplacian of similarity graphs obtained from the representation of training data at each layer of the DNN architecture. This regularizer penalizes large changes (across consecutive layers in the architecture) in the distance between examples of different classes, and as such enforces smooth variations of the class boundaries. Since it is agnostic to the type of deformations that are expected when predicting with the DNN, the proposed regularizer can be combined with existing ad-hoc methods. We provide theoretical justification for this regularizer and demonstrate its effectiveness to improve robustness of DNNs on classical supervised learning vision datasets.

1 Introduction

Deep Neural Networks (DNNs) provide state-of-the-art performance in many challenges in machine learning (He et al., 2016; Wu et al., 2016). Their ability to achieve good generalization is often explained by the fact they use very few priors about data (LeCun et al., 2015). On the other hand, their strong dependency on data may lead to overfit to unrelevant features of the training dataset, resulting in a nonrobust classification performance.

In the literature, authors have been interested in studying the robustness of DNNs in various conditions. These conditions include:

  • Robustness to isotropic noise, i.e., small isotropic variations of the input (Mallat, 2016), typically meaning that the network function leads to a small Lipschitz constant.

  • Robustness to adversarial attacks, which can exploit knowledge about the network parameters or the training dataset (Szegedy et al., 2013; Goodfellow et al., 2014).

  • Robustness to implementation defects, which can result in only approximately correct computations (Hubara et al., 2017).

To improve DNN robustness, three main families of solutions have been proposed in the literature. The first one involves enforcing smoothness, as measured by a Lipschitz constant, in the operators and having a minimum separation margin (Mallat, 2016). A similar approach has been proposed in (Cisse et al., 2017), where the authors restrict the function of the network to be contractive. A second class of methods use intermediate representations obtained at various layers during the prediction phase (Papernot and McDaniel, 2018). Finally, in (Kurakin et al., 2016; Pezeshki et al., 2016), the authors propose to train the network using noisy inputs so that it better generalizes to this type of noise. This has been shown to improve the robustness of the network to the specific type of noise used during training, but it is not guaranteed that this robustness would be extended to other types of deformations.

In this work, we introduce a new regularizer that does not focus on a specific type of deformation, but aims at increasing robustness in general. As such, the proposed regularizer can be combined with other existing methods. It is inspired by recent developments in Graph Signal Processing (GSP) (Shuman et al., 2013). GSP is a mathematical framework that extends classical Fourier analysis to complex topologies described by graphs, by introducing notions of frequency for signals defined on graphs. Thus, signals that are smooth on the graph (i.e., change slowly from one node to its neighbors) will have most of their energy concentrated in the low frequencies.

The proposed regularizer is based on constructing a series of graphs, one for each layer of the DNN architecture, where each graph captures the similarity between all training examples given their intermediate representation at that layer. Our proposed regularizer penalizes large changes in the smoothness of class indicator vectors (viewed here as graph signals) from one layer to the next. As a consequence, the distances between pairs of examples in different classes are only allowed to change slowly from one layer to the next. Note that because we use deep architectures, the regularizer does not prevent the smoothness from achieving its maximum value, but constraining the size of changes from layer to layer reduces the risk of overfitting by controlling the distance to the boundary region, as supported by experiments in Section 4.

The outline of the paper is as follows. In Section 2 we present related work. In Section 3 we introduce the proposed regularizer. In Section 4 we evaluate the performance of our proposed method in various conditions and on vision benchmarks. Section 5 summarizes our conclusions.

2 Related work

DNN robustness may refer to many different problems. In this work we are mostly interested in the stability to deformations (Mallat, 2016), or noise, which can be due to multiple factors mentioned in the introduction. The most studied stability to deformations is in the context of adversarial attacks. It has been shown that very small imperceptible changes on the input of a trained DNN can result in missclassification of the input (Szegedy et al., 2013; Goodfellow et al., 2014). These works have been primordial to show that DNNs may not be as robust to deformations as the test accuracy benchmarks would have lead one to believe. Other works, such as (Recht et al., 2018), have shown that DNNs may also suffer from drops in performance when facing deformations that are not originated from adversarial attacks, but simply by re-sampling the test images.

Multiple ways to improve robustness have been proposed in the literature. They range from the use of a model ensemble composed of kk-nearest neighbors classifiers for each layer (Papernot and McDaniel, 2018), to the use of distillation as a mean to protect the network (Papernot et al., 2016a). Other methods introduce regularizers (Gu and Rigazio, 2014), control the Lipschitz constant of the network function (Cisse et al., 2017) or implement multiple strategies revolving around using deformations as a data augmentation procedure during the training phase (Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi Dezfooli et al., 2016).

Compared to these works, our proposed method can be viewed as a regularizer that penalizes large deformations of the class boundaries throughout the network architecture, instead of focusing on a specific deformation of the input. As such, it can be combined with other mentioned strategies. Indeed, we demonstrate that the proposed method can be implemented in combination with (Cisse et al., 2017), resulting in a network function such that small variations to the input lead to small variations in the decision, as in  (Cisse et al., 2017), while limiting the amount of change to the class boundaries. Note that our approach does not require using training data affected by a specific deformation, and our results could be further improved if such data were available for training.

As for combining GSP and machine learning, this area has sparked interest recently. For example, the authors of (Gripon et al., 2018) show that it is possible to detect overfitting by tracking the evolution of the smoothness of a graph containing only training set examples. Another example is in (Anirudh et al., 2017) where the authors introduce different quantities related to GSP that can be used to extract interpretable results from DNNs. In (Svoboda et al., 2018) the authors exploit graph convolutional layers (Bronstein et al., 2017) to increase the robustness of the network.

To the best of our knowledge, this is the first use of graph signal smoothness as a regularizer for deep neural network design.

3 Methodology

3.1 Similarity preset and postset graphs

Consider a deep neural network architecture. Such a network is obtained by assembling layers of various types. Of particular interest are layers of the form 𝐱𝐱+1=h(𝐖𝐱+𝐛)\mathbf{x}^{\ell}\mapsto\mathbf{x}^{\ell+1}=h^{\ell}(\mathbf{W}^{\ell}\mathbf{x}^{\ell}+\mathbf{b}^{\ell}), where hh^{\ell} is a nonlinear function, typically a ReLU, 𝐖\mathbf{W}^{\ell} is the weight tensor at layer \ell, 𝐱\mathbf{x}^{\ell} is the intermediate representation of the input at layer \ell and 𝐛\mathbf{b}^{\ell} is the corresponding bias tensor. Note that strides or pooling may be used. Assembling can be achieved in various ways: composition, concatenation, sums…so that we obtain a global function ff that associates an input tensor 𝐱0\mathbf{x}^{0} to an output tensor 𝐲=f(𝐱0)\mathbf{y}=f(\mathbf{x}^{0}).

When computing the output 𝐲\mathbf{y} associated with the input 𝐱0\mathbf{x}^{0}, each layer \ell of the architecture processes some input 𝐱\mathbf{x}^{\ell} and computes the corresponding output 𝐲=h(𝐖𝐱+𝐛)\mathbf{y}^{\ell}=h^{\ell}(\mathbf{W}^{\ell}\mathbf{x}^{\ell}+\mathbf{b}^{\ell}). For a given layer \ell and a batch of bb inputs 𝒳={𝐱1,,𝐱b}\mathcal{X}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{b}\}, we can obtain two sets 𝒳={𝐱1,,𝐱b}\mathcal{X}^{\ell}=\{\mathbf{x}^{\ell}_{1},\dots,\mathbf{x}^{\ell}_{b}\}, called the preset, and 𝒴={𝐲1,,𝐲b}\mathcal{Y}^{\ell}=\{\mathbf{y}^{\ell}_{1},\dots,\mathbf{y}^{\ell}_{b}\}, called the postset.

Given a similarity measure ss on tensors, from a preset we can build the similarity preset matrix: 𝐌pre[i,j]=s(𝐱i,𝐱j),1i,jb\mathbf{M}^{\ell}_{pre}[i,j]=s(\mathbf{x}_{i}^{\ell},\mathbf{x}_{j}^{\ell}),\forall 1\leq i,j\leq b, where 𝐌[i,j]\mathbf{M}[i,j] denotes the element at line ii and column jj in 𝐌\mathbf{M}. The postset matrix is defined similarly.

Consider a similarity (either preset or postset) matrix 𝐌\mathbf{M}^{\ell}. This matrix can be used to build a kk-nearest neighbor similarity weighted graph G=V,𝐀G^{\ell}=\langle V,\mathbf{A}^{\ell}\rangle, where V={1,,b}V=\{1,\dots,b\} is the set of vertices and 𝐀\mathbf{A}^{\ell} is the weighted adjacency matrix defined as:

𝐀[i,j]={𝐌[i,j]if 𝐌[i,j]argmaxij(𝐌[i,j],k)argmaxji(𝐌[i,j],k)0otherwise,i,jV,\displaystyle{\mathbf{A}^{\ell}[i,j]=\left\{\begin{array}[]{ll}\mathbf{M}^{\ell}[i,j]&$if $\mathbf{M}^{\ell}[i,j]\in\arg\max_{i^{\prime}\neq j}{(\mathbf{M}^{\ell}[i^{\prime},j],k)}\\ &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\bigcup\arg\max_{j^{\prime}\neq i}{(\mathbf{M}^{\ell}[i,j^{\prime}],k)}\\ 0&$otherwise$\end{array}\right.},\forall i,j\in V, (1)

where argmaxi(ai,k)\arg\max_{i}(a_{i},k) denotes the indices of the kk largest elements in {a1,,ab}\{a_{1},\dots,a_{b}\}. Note that by construction 𝐀\mathbf{A}^{\ell} is symmetric.

3.2 Smoothness of label signals

Given a weighted graph G=V,𝐀G^{\ell}=\langle V,\mathbf{A}^{\ell}\rangle, we call Laplacian of GG^{\ell} the matrix 𝐋=𝐃𝐀\mathbf{L}^{\ell}=\mathbf{D}^{\ell}-\mathbf{A}^{\ell}, where 𝐃\mathbf{D}^{\ell} is the diagonal matrix such that: 𝐃[i,i]=j𝐀[i,j],iV\mathbf{D}^{\ell}[i,i]=\sum_{j}{\mathbf{A}^{\ell}[i,j]},\forall i\in V. Because 𝐋\mathbf{L}^{\ell} is symmetric and real-valued, it can be written:

𝐋=𝐅𝚲𝐅,\mathbf{L}^{\ell}=\mathbf{F}^{\ell}\mathbf{\Lambda}^{\ell}\mathbf{F}^{\ell\top}, (2)

where 𝐅\mathbf{F} is orthonormal and contains eigenvectors of 𝐋\mathbf{L}^{\ell} as columns, 𝐅\mathbf{F}^{\top} denotes the transpose of 𝐅\mathbf{F}, and 𝚲\mathbf{\Lambda} is diagonal and contains eigenvalues of 𝐋\mathbf{L}^{\ell} is ascending order. Note that the constant vector 𝟏b\mathbf{1}\in\mathbb{R}^{b} is an eigenvector of 𝐋\mathbf{L}^{\ell} corresponding to eigenvalue 0. Moreover, all eigenvalues of 𝐋\mathbf{L}^{\ell} are nonnegative. Consequently, 𝟏/n\mathbf{1}/\sqrt{n} can be chosen as the first column in 𝐅\mathbf{F}.

Consider a vector 𝐬b\mathbf{s}\in\mathbb{R}^{b}, we define 𝐬^\hat{\mathbf{s}} the Graph Fourier Transform (GFT) of 𝐬\mathbf{s} on GG^{\ell} as (Shuman et al., 2013):

𝐬^=𝐅𝐬.\hat{\mathbf{s}}=\mathbf{F}^{\top}\mathbf{s}. (3)

Because the order of the eigenvectors is chosen so that the corresponding eigenvalues are in ascending order, if only the first few entries of 𝐬^\hat{\mathbf{s}} are nonzero that indicates that 𝐬\mathbf{s} is low frequency (smooth). In the extreme case where only the first entry of 𝐬^\hat{\mathbf{s}} is nonzero we have that 𝐬\mathbf{s} is constant (maximum smoothness). More generally, smoothness σ(𝐬)\sigma^{\ell}(\mathbf{s}) of a signal 𝐬\mathbf{s} can be measured using the quadratic form of the Laplacian:

σ(𝐬)=𝐬𝐋𝐬=i,j=1b𝐀[i,j](𝐬[i]𝐬[j])2=i=1b𝚲[i,i]𝐬^[i]2,\sigma^{\ell}(\mathbf{s})=\mathbf{s}^{\top}\mathbf{L}^{\ell}\mathbf{s}=\sum_{i,j=1}^{b}{\mathbf{A}^{\ell}[i,j](\mathbf{s}[i]-\mathbf{s}[j])^{2}}=\sum_{i=1}^{b}{\mathbf{\Lambda}^{\ell}[i,i]\hat{\mathbf{s}}[i]^{2}}, (4)

where we note that 𝐬\mathbf{s} is smoother when σ(𝐬)\sigma^{\ell}(\mathbf{s}) is smaller.

In this paper we are particularly interested in smoothness of the label signals. We call label signal 𝐬c\mathbf{s}_{c} associated with class cc a binary ({0,1}\{0,1\}) vector whose nonzero coordinates are the ones corresponding to input vectors of class cc. In other words, 𝐬c[i]=1(𝐱i\mathbf{s}_{c}[i]=1\Leftrightarrow(\mathbf{x}_{i} is in class c),1ibc),\forall 1\leq i\leq b.

Denote uu the last layer of the architecture: 𝐲iu=𝐲i,i\mathbf{y}^{u}_{i}=\mathbf{y}_{i},\forall i. Note that in typical settings, where outputs of the networks are one-hot-bit encoded and no regularizer is used, at the end of the learning process it is expected that 𝐲i𝐲j1\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\approx 1 if ii and jj belong to the same class, and 𝐲i𝐲j0\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\approx 0 otherwise.

Thus, assuming that cosine similarity is used to build the graph, the last layer smoothness for all cc would be σpostu(𝐬c)0\sigma^{u}_{post}(\mathbf{s}_{c})\approx 0, since edge weights between nodes having different labels will be close to zero given Equation (4). More generally, smoothness of 𝐬c\mathbf{s}_{c} at the preset or postset of a given layer measures the average similarity between examples in class cc and examples in other classes (σ(𝐬c)\sigma(\mathbf{s}_{c}) decreases as the weights of edges connecting nodes in different classes decrease). Because the last layer can achieve σ(𝐬c)0\sigma(\mathbf{s}_{c})\approx 0, we expect the smoothness metric σ\sigma at each layer to decrease as we go deeper in the network. Next we introduce a regularization strategy that limits how much σ\sigma can decrease from one layer to the next and can even prevent the last layer from achieving σ(𝐬c)=0\sigma(\mathbf{s}_{c})=0. This will be shown to improve generalization and robustness. The theoretical motivation for this choice is discussed in Section 3.4.

3.3 Proposed regularizer

3.3.1 Definition

We propose to measure the deformation induced by a given layer \ell in the relative positions of examples by computing the difference between label signal smoothness before and after the layer, averaged over all labels:

δσ=|c[σpost(𝐬c)σpre(𝐬c)]|.\delta_{\sigma}^{\ell}=\left|\sum_{c}{\left[\sigma^{\ell}_{post}(\mathbf{s}_{c})-\sigma^{\ell}_{pre}(\mathbf{s}_{c})\right]}\right|. (5)

These quantities are used to regularize modifications made to each of the layers during the learning process.

Remark 1: Since we only consider label signals, we solely depend on the similarities between examples that belong to distinct classes. In other words, the regularizer only focuses on the boundary region, and does not vary if the distance between examples of the same label grows or shrinks. This is because forcing similarities between examples of a same class to evolve slowly could prevent the network to train appropriately.

Remark 2: Compared with (Cisse et al., 2017), there are three key differences that characterize the proposed regularizer:

  1. 1.

    Not all pairwise distances are taken into account in the regularization; only distances between examples corresponding to different classes play a role in the regularization.

  2. 2.

    We allow a limited amount of both contraction and dilation of the metric space. Experimental work (e.g. (Gripon et al., 2018; Papernot and McDaniel, 2018)) has shown that the evolution of metric spaces across DNN layers is complex, and thus restricting ourselves to contractions only could lead to lower overall performance.

  3. 3.

    The proposed criterion is an average (sum) over all distances, rather than a stricter criterion (e.g. Lipschitz), which would force each pair of vectors (𝐱i,𝐱j)(\mathbf{x}_{i},\mathbf{x}_{j}) to obey the constraint.

Illustrative example:

In Figure 1 we depict a toy illustrative example to motivate the proposed regularizer. We consider here a one-dimensional two-class problem. To linearly separate circles and crosses, it is necessary to group all circles. Without regularization (setting i)), the resulting embedding is likely to increase considerably the distance between examples and the size of the boundary region between classes. In contrast, by penalizing large variations of the smoothness of label signals (setting ii)), the average distance between circles and crosses must be preserved in the embedding domain, resulting in a more precise control of distances within the boundary region.

Initial problem:Class domains boundary i) No regularization: ii) Proposed regularization:
Figure 1: Illustration of the effect of our proposed regularizer. In this example, the goal is to classify circles and crosses (top). Without use of regularizers (bottom left), the resulting embedding may considerably stretch the boundary regions (as illustrated by the irregular spacing between the tics). Forcing small variations of smoothness of label signals (bottom right), we ensure the topology is not dramatically changed in the boundary regions.

3.4 Motivation: label signal bandwidth and powers of the Laplacian

Recent work (Anis et al., 2017) develops an asymptotic analysis of the bandwidth of label signals, BW(𝐬)BW(\mathbf{s}), where bandwidth is defined as the highest non-zero graph frequency of 𝐬\mathbf{s}, i.e., the nonzero entry of 𝐬^\hat{\mathbf{s}} with the highest index. An estimate of the bandwidth can be obtained by computing:

BWm(𝐬)=(𝐬𝐋m𝐬𝐬𝐬)(1/m)BW_{m}(\mathbf{s})=\left(\frac{\mathbf{s}^{\top}\mathbf{L}^{m}\mathbf{s}}{\mathbf{s}^{\top}\mathbf{s}}\right)^{(1/m)} (6)

for large mm. This can be viewed as a generalization of the smoothness metric of (4). (Anis et al., 2017) shows that, as the number of labeled points 𝐱\mathbf{x} (assumed drawn from a distribution p(𝐱)p(\mathbf{x})) grows asymptotically, the bandwidth of the label signal converges in probability to the supremum of p(𝐱)p(\mathbf{x}) in the region of overlap between classes. This motivates our work in three ways.

First, it provides theoretical justification to use σ(𝐬)\sigma^{\ell}(\mathbf{s}) for regularization, since lower values of σ(𝐬)\sigma^{\ell}(\mathbf{s}) are indicative of better separation between classes. Second, the asymptotic analysis suggests that using higher powers of the Laplacian would lead to better regularization, since estimating bandwidth using BWm(𝐬)BW_{m}(\mathbf{s}) becomes increasingly accurate as mm increases. Finally, this regularization can be seen to be protective against overfitting by preventing σ(𝐬)\sigma^{\ell}(\mathbf{s}) from decreasing “too fast”. For most problems of interest, given a sufficiently large amount of labeled data available, it would be reasonable to expect the bandwidth of 𝐬\mathbf{s} not to be arbitrarily small, because the classes cannot be exactly separated, and thus a network that reduces the bandwidth too much can result in overfitting.

3.5 Analysis of the Laplacian powers

In Figure 2 we depict the Laplacian and squared Laplacian of similarity graphs obtained at different layers in a trained vanilla architecture. On the deep layers, we can clearly see blocks corresponding to the classes, while the situation in the middle layer is not as clear. This figure illustrates how using the squared Laplacian helps modifying the distances to improve separation. Note that we normalize the squared Laplacian values by dividing them by the highest absolute value.

Middle layerRefer to caption𝐋\mathbf{L}𝐋2\mathbf{L}^{2}𝐋\mathbf{L}𝐋2\mathbf{L}^{2}Refer to captionRefer to captionDeep layerRefer to caption
Figure 2: Sample of a Laplacian and squared Laplacian of similarity graphs in a trained vanilla architecture. Examples of the batch have been ordered so that those belonging to a same class are consecutive. Dark values correspond to high similarity.

In Figure 3, we plot the average evolution of smoothness of label signals over 100 batches, as a function of layer depth in the architecture, and for different choices of the regularizer. In the left part, we look at smoothness measures using the Laplacian. In the right part, we use the squared Laplacian. We can clearly see the effectiveness of the regularizer in enforcing small variations of smoothness across the architecture. Note that for model regularized with 𝐋2\mathbf{L}^{2}, changes in smoothness measured by 𝐋\mathbf{L} are not easy to see. This seems to suggest that some of the gains achieved via 𝐋2\mathbf{L}^{2} regularization come in making changes that would be “invisible” when looking at the layers from the perspective of 𝐋\mathbf{L} smoothness. The same normalization from Figure 2 is used for 𝐋2\mathbf{L}^{2}.

0551010151502,0002{,}0004,0004{,}0006,0006{,}000Layer depthSmoothness𝐋2\mathbf{L}^{2}VanillaRegularization with 𝐋\mathbf{L}Regularization with 𝐋2\mathbf{L}^{2}055101015152,0002{,}0004,0004{,}0006,0006{,}000Layer depthSmoothness𝐋\mathbf{L}VanillaRegularization with 𝐋\mathbf{L}Regularization with 𝐋2\mathbf{L}^{2}
Figure 3: Evolution of smoothness of label signals as a function of layer depth, and for various regularizers and choice of mm, the power of the Laplacian matrix.

4 Experiments

In the following paragraphs we evaluate the proposed method using various tests. We use the well known CIFAR-10 (Krizhevsky and Hinton, 2009) dataset made of tiny images. As far as the DNN is concerned, we use the same PreActResNet (He et al., 2016) architecture for all tests, with 18 layers. All inputs, including those on the test set, are normalized based on the mean and standard deviation of the images of the training set. Discussion about the implementation of the Parseval training, hyperparameters and more details can be found at the Appendix.

We depict the obtained results using box plots where data is aggregated from 10 different networks corresponding to different random seeds and batch orders. In the first experiment (left most plot) in Figure 4, we plot the baseline accuracy of the models on the clean test set (no deformation is added at this point). These experiments agree with the claim from (Cisse et al., 2017) where the authors show that they are able to increase the performance of the network on the clean test set. We observe that our proposed method leads to a minor decrease of performance on this test. However, we see in the following experiments that this is mitigated with increased robustness to deformations.

4.1 Isotropic deformation

In this scenario we evaluate the robustness of the network function to small isotropic variations of the input. We generate 40 different deformations using random variables 𝒩(0,0.25)\mathcal{N}(0,0.25) which are added to the test set inputs. Note that they are scaled so that SNR15SNR\approx 15 and SNR20SNR\approx 20. The middle and right-most plots from Figure 4 show that the proposed method increases the robustness of the network to isotropic deformations. Note that in both scenarios the best results are achieved by combining Parseval training and our proposed method (lower-most box on both figures).

SNRSNR\approx\inftySNR20SNR\approx 20SNR15SNR\approx 1502020404060608080100100VanillaParsevalProposed MethodProposed + ParsevalTest set accuracy02020404060608080100100Test set accuracy02020404060608080100100Test set accuracy
Figure 4: Test set accuracy under Gaussian Noise with varying signal-to-noise ratio.

4.2 Adversarial Robustness

We next evaluate robustness to adversarial inputs, which are specifically built to fool the network function. Such adversarial inputs can be generated and evaluated in multiple ways. Here we implement two approaches: first a mean case of adversarial noise, where the adversary can only use one forward and one backward pass to generate the deformations, and second a worst case scenario, where the adversary can use multiple forward and backward passes to try to find the smallest deformation that will fool the network.

For the first approach, we add the scaled gradient sign (FGSM attack) on the input (Kurakin et al., 2016), so that we obtain a target SNRSNR. Results are depicted in the left and center plots of Figure 5. In the left plot the noise is added after normalizing the input whereas on the middle plot it is added before normalizing. As in the isotropic noise case, a combination of the Parseval method and our proposed approach achieves maximum robustness.

FGSM after normFGSM before normDeepFool02020404060608080100100VanillaParsevalProposed MethodProposed + ParsevalTest set accuracy02020404060608080100100Test set accuracy022446688105\cdot 10^{-5}Mean 2\mathcal{L}_{2} pixel distance
Figure 5: Robustness against an adversary. Measured by the test set accuracy under FGSM attack in the left and center plots and by the mean 2\mathcal{L}_{2} pixel distance needed to fool the network using DeepFool on the right plot.

In regards to the second approach, where a worst case scenario is considered, we use the Foolbox toolbox (Rauber et al., 2017) implementation of DeepFool  (Moosavi Dezfooli et al., 2016). The conclusions are similar (right plot of Figure 5) to those obtained for the first adversarial attack approach.

4.3 Implementation robustness

Finally, in a third series of experiments we evaluate the robustness of the network functions to faulty implementations. As a result, approximate computations are made during the test phase that consist of random erasures of the memory (dropout) or quantization of the weights (Hubara et al., 2017).

In the dropout case, we compute the test set accuracy when the network has a probability of either 25%25\% or 40%40\% of dropping a neuron’s value after each block. We run each experiment 4040 times. The results are depicted in the left and center plots of Figure 6. It is interesting to note that the Parseval trained functions seem to drop in performance as soon as we reach 40%40\% probability of dropout, providing an average accuracy smaller than the vanilla networks. In contrast, the proposed method is the most robust to these perturbations.

25%25\% dropout40%40\% dropout55 bit quantization02020404060608080100100VanillaParsevalProposed MethodProposed + ParsevalTest set accuracy02020404060608080100100Test set accuracy02020404060608080100100Test set accuracy
Figure 6: Test set accuracy under different types of implementation related noise.

For the quantization of the weights, we consider a scenario where the network size in memory has to be shrink 6 times. We therefore quantize the weights of the networks to 5 bits (instead of 32) and re-evaluate the test set accuracy. The right plot of Figure 6 shows that the proposed method is providing a better robustness to this kind of deformation than the tested counterparts.

5 Conclusion

In this paper we have introduced a new regularizer that enforces small variations of the smoothness of label signals on similarity graphs obtained at intermediate layers of a deep neural network architecture. We have empirically shown with our tests that it can lead to improved robustness in various conditions compared to existing counterparts. We also demonstrated that combining the proposed regularizer with existing methods can result in even better robustness for some conditions.

Future work includes a more systematic study of the effectiveness of the method with regards to other datasets, models and deformations. We believe that for the first two points it should not be a problem given (Moosavi-Dezfooli et al., 2017; Papernot et al., 2016b) where the authors argue that adversarial noise is transferable between models and datasets.

One possible extension of the proposed method is to use it in a fine-tuning stage, combined with different techniques already established on the literature. An extension using a combination of input barycenter and class barycenter signals instead of the class signal could be interesting as that would be comparable to (Zhang et al., 2017). In the same vein, using random signals could be beneficial for semi-supervised or unsupervised learning challenges.

References

  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • Mallat (2016) Stéphane Mallat. Understanding deep convolutional networks. Phil. Trans. R. Soc. A, 374(2065):20150203, 2016.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Hubara et al. (2017) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:187–1, 2017.
  • Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pages 854–863, 2017.
  • Papernot and McDaniel (2018) Nicolas Papernot and Patrick D. McDaniel. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. CoRR, abs/1803.04765, 2018. URL http://arxiv.org/abs/1803.04765.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
  • Pezeshki et al. (2016) Mohammad Pezeshki, Linxi Fan, Philemon Brakel, Aaron Courville, and Yoshua Bengio. Deconstructing the ladder network architecture. In International Conference on Machine Learning, pages 2368–2376, 2016.
  • Shuman et al. (2013) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013.
  • Recht et al. (2018) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
  • Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016a.
  • Gu and Rigazio (2014) Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
  • Moosavi Dezfooli et al. (2016) Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Gripon et al. (2018) Vincent Gripon, Antonio Ortega, and Benjamin Girault. An inside look at deep neural networks using graph signal processing. In Proceedings of ITA, February 2018.
  • Anirudh et al. (2017) Rushil Anirudh, Jayaraman J Thiagarajan, Rahul Sridhar, and Timo Bremer. Influential sample selection: A graph signal processing approach. arXiv preprint arXiv:1711.05407, 2017.
  • Svoboda et al. (2018) Jan Svoboda, Jonathan Masci, Federico Monti, Michael M Bronstein, and Leonidas Guibas. Peernets: Exploiting peer wisdom against adversarial attacks. arXiv preprint arXiv:1806.00088, 2018.
  • Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • Anis et al. (2017) Aamir Anis, Aly El Gamal, Salman Avestimehr, and Antonio Ortega. A sampling theory perspective of graph-based semi-supervised learning. arXiv preprint arXiv:1705.09518, 2017.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf, 2009.
  • Rauber et al. (2017) Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017. URL http://arxiv.org/abs/1707.04131.
  • Moosavi-Dezfooli et al. (2017) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. arXiv preprint, 2017.
  • Papernot et al. (2016b) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016b.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Kovačević and Chebira (2008) Jelena Kovačević and Amina Chebira. An introduction to frames. Foundations and Trends in Signal Processing, 2(1):1–94, 2008.

Appendix A Parseval Training and implementation

We compare our results with those obtained using the method described in [Cisse et al., 2017]. There are three modifications to the normal training procedure: orthogonality constraint, convolutional renormalization and convexity constraint.

For the orthogonality constraint we enforce Parseval tightness [Kovačević and Chebira, 2008] as a layer-wise regularizer:

Rβ(W)=β2WWI22,R_{\beta}(W^{\ell})=\frac{\beta}{2}\|W^{\ell\top}W^{\ell}-I\|^{2}_{2}, (7)

where WW_{\ell} is the weight tensor at layer \ell. This function can be approximately optimized with gradient descent by doing the operation:

W(1+β)WβWWW.W^{\ell}\leftarrow(1+\beta)W^{\ell}-\beta W^{\ell}W^{\ell\top}W^{\ell}. (8)

. Given that our network is smaller we can apply the optimization to the entirety of the WW, instead of 30%30\% as per the original paper, this increases the strength of the Parseval tightness.

For the convolutional renormalization, each matrix WW^{\ell} is reparametrized before being applied to the convolution as W2ks+1\frac{W^{\ell}}{\sqrt{2k_{s}+1}}, where ksk_{s} is the kernel size.

For our architecture the inputs from a layer come from either one or two different layers. In the case where the inputs come from only one layer, α\alpha the convexity constraint parameter is set to 1. When the inputs come from the sum of two layers we use α=0.5\alpha=0.5 as the value for both of them, which constraints our Lipschitz constant, this is softer than the convexity constraint from the original paper.

Appendix B Hyperparameters

We train our networks using classical stochastic gradient descent with momentum (0.90.9), with batch size of b=100b=100 images and using a L2-norm weight decay with a coefficient of λ=0.0005\lambda=0.0005. We do a 100 epoch training. Our learning rate starts at 0.10.1. After half of the training (5050 epochs) the learning rate decreases to 0.0010.001.

We use the mean of the difference of smoothness between successive layers in our loss function. Therefore in our loss function we have:

=CategoricalCrossEntropy+λWeightDecay+γΔ\mathcal{L}=CategoricalCrossEntropy+\lambda WeightDecay+\gamma\Delta (9)

where Δ=1d1=1d|δσ|\Delta=\frac{1}{d-1}\sum_{\ell=1}^{d}|\delta_{\sigma}^{\ell}|. We perform experiments using various powers of the Laplacian m=1,2,3m=1,2,3, in which case the scaling coefficient γ\gamma is put to the same power as the Laplacian.

We tested multiple parameters of β\beta, the Parseval tightness parameter, γ\gamma the weight for the smoothness difference cost and mm the power of the Laplacian. We found that the best values for this specific architecture, dataset and training scheme were: β=0.01,γ=0.01,m=2,k=b\beta=0.01,\gamma=0.01,m=2,k=b.

Appendix C Depiction of the network

Figure 7 depicts the network used on all experiments of this paper. f=64f=64 is the filter size of the first layer of the network. Conv layers are 3x3 layers and are always preceded by batch normalization and relu (except for the first layer which receives just the input). The smoothness gaps are calculated after at each ReLU.

Figure 7: Depiction of the studied network
Conv layer, ffConv layer, ffConv layer, ffConv layer, ffConv layer, ffStrided Conv layer, 2f2fConv layer, 2f2fConv layer, 2f2fConv layer, 2f2fStrided Conv layer, 4f4fConv layer, 4f4fConv layer, 4f4fConv layer, 4f4fStrided Conv layer, 8f8fConv layer, 8f8fConv layer, 8f8fConv layer, 8f8fGlobal Avg pooling, 8f8fLinear+Softmax, 1010