This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Manifold attack

Khanh-Hung Tran
Institut LIST, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
khanh-hung.tran@cea.fr
&Fred Maurice Ngole Mboula
Institut LIST, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
fred-maurice.ngole-mboula@cea.fr
&Jean-Luc Starck
Astrophysics Department, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
jean-luc.starck@cea.fr
Abstract

Machine Learning in general and Deep Learning in particular has gained much interest in the recent decade and has shown significant performance improvements for many Computer Vision or Natural Language Processing tasks. In this paper, we introduce manifold attack, a combination of manifold learning and adversarial learning, which aims at improving neural network models regularization. We show that applying manifold attack provides a significant improvement for robustness to adversarial examples and it even enhances slightly the accuracy rate. Several implementations of our work can be found in https://github.com/ktran1/Manifold-attack.

Index terms Deep learning, Robustness, Manifold learning, Adversarial learning, Geometric structure, Semi-Supervised learning.

1 Introduction

Deep Learning (DL) [1] has been first introduced by Alexey Ivakhnenko (1967), then its derivative Convolutional Neural Networks [2] (1980) has been introduced by Fukushima. Over the years, it was improved and refined in by Yann LeCun [3] (1998). Up to now, deep neural networks have yield groundbreaking performances at various classification tasks [4, 5, 6, 7, 8]. However, training deep neural networks involves different regularization techniques, which are primordial in general for two goals: generalization and adversarial robustness. On the one hand, regularization for generalization aims at improving the accuracy rate on the data that have not been used for training. In particular, this regularization is critical when the number of training samples is relatively small compared to the complexity of neural network model. On the other hand, regularization for adversarial robustness aims at improving the accuracy rate with respect to adversarial samples.

In the last few years, a lot of research works in machine learning focus on getting high performance models in terms of adversarial robustness without losing in generalization. Our work focuses on the preservation of geometric structure (PGS) between the original representation of data and a latent representation produced by a neural network model (NNM). PGS is part of manifold learning techniques: we assume that the data of interest lies on a low dimensional piecewise smooth manifold with respect to which we want to compute original samples embeddings. In neural network models (NNMs), the output of the hidden layers are considered to be candidate low dimensional embeddings. For classification tasks, combining PGS as a regularization loss with supervised loss was shown to improve generalization ability by Weston et al. in 2008 [9]. The concept of adversarial samples was later introduced by Goodfellow et al. [10] in 2014, as NNMs architecture became deeper and hence more complex. Adversarial robustness in NNMs is strictly related to the PGS, since PGS for NNMs precisely aims at ensuring that the NNM gives similar outputs for close inputs. Figure 1 shows an illustration of PGS including a failure example. The black point can be considered an adversarial example for the dimension reduction model.

Refer to caption
Figure 1: Preservation of geometric structure and a failure example. On the left, the original representation of S curve data (3 dimensions). On the right, an embedding of these data in dimension 2). The similarity between samples is indicated by different colors.

The novelty of this paper is to reinforce PGS by introducing synthetic samples as supplementary data. We make use of adversarial learning methodology to compute relevant synthetic samples. Then, these synthetic samples are added to the original real samples for training the model fθf_{\theta} which is expected to improve adversarial robustness.

Contributions :

  • We present the concept of manifold attack to reinforce PGS task.

  • We show empirically that by applying manifold attack for a classification task, NNMs get better in both adversarial robustness and generalization.

The outline of the paper is as follows. In section 2, we present related works that aim to increase adversarial robustness of NNMs. In section 3, we revisit some classical PGS methods. In section 4, we present manifold attack. The numerical experiments are presented and discussed in section 5. Our conclusions and perspectives are presented in section 6.

2 Related works

In this section, we present several strategies to reinforce adversarial robustness, sorting them into two categories.

The first category is related to off-manifold adversarial examples i.e. adversarial examples that lies off the data manifold. In a supervised setting, these adversarial examples are used for training model as additional data[10]. In a semi-supervised setting, these adversarial examples are used to enhance the consistency between representations [11]. In plain words, a sample and its corresponding adversarial example must have a similar representation at the output of model. However, some works [12, 13] state that adversarial robustness with respect to off-manifold adversarial examples is in conflict with generalization. In other words, improving adversarial robustness is detrimental to generalization and vice versa.

In the second category, the adversarial robustness is reinforced with respect to on-manifold adversarial examples [14, 15, 16]. In all these works, a generative model such as VAE[17], GAN[18] or VAE-GAN [19] is firstly learnt, so that one can generate a sample on the data manifold from parameters in a latent space. These works differ in the technique used to create adversarial noise in the latent space to produce on-manifold adversarial examples. Finally, these on-manifold adversarial examples are used as additional data for training model. Interestingly, by using on-manifold adversarial examples, [15] states that adversarial robustness and generalization can be jointly enhanced. We note that a hybrid approach has been proposed in [20] where both off-manifold and on-manifold adversarial examples are used as supplement data for training model.

Our method falls into the second category since we want to enhance both the adversarial robustness and the generalization in both supervised and semi-supervised settings. However the adversarial examples are generated using a different paradigm.

3 Preservation of geometric structure

In most cases, a PGS process has two stages: extracting properties of original representation then computing an embedding which preserves these properties into a low dimensional space. Given a data set 𝒳={𝐱1,,𝐱N}\mathcal{X}=\{\mathbf{x}_{1},...,\mathbf{x}_{N}\}, 𝐱in\mathbf{x}_{i}\in\mathbb{R}^{n} and its embeddings set 𝒜={𝐚1,,𝐚N},𝐚ip\mathcal{A}=\{\mathbf{a}_{1},...,\mathbf{a}_{N}\},\mathbf{a}_{i}\in\mathbb{R}^{p}, where 𝐚i\mathbf{a}_{i} is the embedded representation of 𝐱i\mathbf{x}_{i}, we note:

e(𝐚i,𝒜ic),\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}), (1)

the embedding loss, where 𝒜ic\mathcal{A}_{i}^{c} is the complement of {𝐚i}\{\mathbf{a}_{i}\} in 𝒜\mathcal{A}. The objective function of PGS is then defined as:

min𝒜t=min𝒜i=1Ne(𝐚i,𝒜ic)\operatorname*{min}_{\mathcal{A}}\mathcal{L}_{t}=\operatorname*{min}_{\mathcal{A}}\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}) (2)

We present some popular embedding losses hereafter:

- Multi-Dimensional scaling (MDS) [21]:

e(𝐚i,𝒜ic)=𝐚j𝒜ic(da(𝐚i,𝐚j)dx(𝐱i,𝐱j))2,\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}(d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})-d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}))^{2}, (3)

where da()d_{a}() and dx()d_{x}() are measures of dissimilarity. By default, they are both Euclidean distances. This method aims at preserving pairwise distances from the original representation in the embedding space.

- Laplacian eigenmaps (LE) [22]:

e(𝐚i,𝒜ic)=𝐚j𝒜icdx(𝐱i,𝐱j)da(𝐚i,𝐚j),\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})d_{a}(\mathbf{a}_{i},\mathbf{a}_{j}), (4)

where dx()d_{x}() is a measure of similarity measure (for instance dx(𝐱i,𝐱j)=exp(𝐱i𝐱j222σi2)d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=\text{exp}\big{(}\frac{-\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}}{2\sigma_{i}^{2}}\big{)}) and da()d_{a}() is a measure of dissimilarity (for instance da(𝐚i,𝐚j)=𝐚i𝐚j22d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})=\left\lVert\mathbf{a}_{i}-\mathbf{a}_{j}\right\rVert_{2}^{2}). This method learns manifold structure by emphasizing the preservation of local distances. In order to further reduce effect of large distances, in some papers, dx(𝐱i,𝐱j)d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}) is set directly to zero if 𝐱j\mathbf{x}_{j} is not in the kk nearest neighbors of 𝐱i\mathbf{x}_{i} or vice versa if 𝐱i\mathbf{x}_{i} is not in the kk nearest neighbors of 𝐱j\mathbf{x}_{j}. Alternatively, one can set dx(𝐱i,𝐱j)=0d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=0 if 𝐱i𝐱j22>κ\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}>\kappa. Let 𝐖\mathbf{W} be the matrix defined as 𝐖ij=dx(𝐱i,𝐱j)\mathbf{W}_{ij}=d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}) hence 𝐖\mathbf{W} is a symmetric matrix if dx()d_{x}() is symmetric. We can represent the objective function of Laplacian eigenmaps method using the matrices:

t=i=1Ne(𝐚i,𝒜ic)=i=1Nj=1jiN𝐖ij𝐚i𝐚j22=2tr(𝐀𝐋𝐀),\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\mathbf{W}_{ij}\left\lVert\mathbf{a}_{i}-\mathbf{a}_{j}\right\rVert_{2}^{2}=2\operatorname{tr}(\mathbf{A}\mathbf{L}\mathbf{A}^{\top}), (5)

where 𝐋=𝐃𝐖\mathbf{L}=\mathbf{D}-\mathbf{W}, 𝐃ii=j=1N𝐖ij\mathbf{D}_{ii}=\sum_{j=1}^{N}\mathbf{W}_{ij} (𝐃\mathbf{D} being a diagonal matrix). 𝐋\mathbf{L} a graph Laplacian matrix because it is symmetric, the sum of each row equals to 1 and its elements are negatives except for the diagonal elements.

- Locally Linear Embedding (LLE) [23]:

e(𝐚i,𝒜ic)=𝐚i𝐚j𝒜icλij𝐚j22,\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\left\lVert\mathbf{a}_{i}-\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}\lambda_{ij}\mathbf{a}_{j}\right\rVert_{2}^{2}, (6)

where λij\lambda_{ij} are determined by solving the following problem:

minλij𝐱ijλij𝐱j22,subject to: {jλij=1, if 𝐱jknn(𝐱i),λij=0 if not.\begin{split}&\operatorname*{min}_{\lambda_{ij}}\left\lVert\mathbf{x}_{i}-\sum\limits_{j}\lambda_{ij}\mathbf{x}_{j}\right\rVert_{2}^{2},\\ \text{subject to: }&\begin{cases}\sum\limits_{j}\lambda_{ij}=1,\text{ if }\mathbf{x}_{j}\in\text{knn}(\mathbf{x}_{i}),\\ \lambda_{ij}=0\text{ if not.}\\ \end{cases}\\ \end{split} (7)

where knn(𝐱i\mathbf{x}_{i}) denotes a set containing indices of the kk nearest neighbors samples (in Euclidean distance) of the sample 𝐱i\mathbf{x}_{i}. Assuming that the observed data 𝒳\mathcal{X} is sampled from a smooth manifold and provided that the sampling is dense enough, one can assume that the sample lies locally on linear patches. Thus, LLE first computes the barycentric coordinate for each sample w.r.t. its nearest neighbors. These barycentric coordinates characterize the local geometry of the underlying manifold. Then, LLE computes a low dimensional representation (embedded) which is compatible with these local barycentric coordinates. Introducing 𝐕N×N\mathbf{V}\in\mathbb{R}^{N\times N} a matrix representation form for λ\lambda as: 𝐕[j,i]=λij\mathbf{V}[j,i]=\lambda_{ij} then t=i=1Ne(𝐚i,𝒜ic)\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}) can be rewritten as t=AAVF2=tr(𝐀𝐋𝐀)\mathcal{L}_{t}=\left\lVert\textbf{A}-\textbf{A}\textbf{V}\right\rVert_{F}^{2}=\operatorname{tr}(\mathbf{A}\mathbf{L}\mathbf{A}^{\top}), where 𝐋=𝐈N𝐕𝐕+𝐕𝐕\mathbf{L}=\mathbf{I}_{N}-\mathbf{V}-\mathbf{V}^{\top}+\mathbf{V}^{\top}\mathbf{V}. Thus, the loss t\mathcal{L}_{t} can be interpreted as Laplacian eigenmaps loss, based on an implicit metric dxd_{x} for measuring distance between two samples.

- Contrastive loss :

e(𝐚i,𝒜ic)=𝐚j𝒜ic(dx(𝐱i,𝐱j)da(𝐚i,𝐚j)+(1dx(𝐱i,𝐱j)) max(0,τda(𝐚i,𝐚j))),\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}\Big{(}d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})+(1-d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}))\text{ max}(0,\tau-d_{a}(\mathbf{a}_{i},\mathbf{a}_{j}))\Big{)},

where dx(𝐱i,𝐱j)d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}) is a discrete similarity metric which is equal to 1 if 𝐱j\mathbf{x}_{j} is in the neighborhood of 𝐱j\mathbf{x}_{j} and 0 otherwise. da()d_{a}() is a measure of dissimilarity.

- Stochastic Neighbor Embedding (SNE) [24]:

e(𝐚i,𝒜ic)=𝐚j𝒜icPijlogPijQij,\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}P_{ij}\text{log}\frac{P_{ij}}{Q_{ij}},

where Pij=dx(𝐱i,𝐱j)kidx(𝐱i,𝐱k)P_{ij}=\frac{d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})}{\sum_{k\neq i}d_{x}(\mathbf{x}_{i},\mathbf{x}_{k})} and Qij=da(𝐚i,𝐚j)kida(𝐚i,𝐚k)Q_{ij}=\frac{d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})}{\sum_{k\neq i}d_{a}(\mathbf{a}_{i},\mathbf{a}_{k})}, dxd_{x} and dad_{a} are both similarity metric. The objective of this method is to preserve the similarity between two distributions of pairwise distances, one in original representation and the other in embedded representation, by Kullback–Leibler (KL) divergence.

Traditionally, PGS finds its applications in dimensionality reduction or data visualization, which refers to the techniques used to help the analyst see the underlying structure of data and explore it. For instance, [25] proposes a variant of SNE, which has been used in a wide range of fields. Classical nonlinear PGS methods do not require a mapping model, which is a function g()g() with trainable parameters that maps a sample x to its embedded representation 𝐚\mathbf{a} as 𝐚=g(𝐱)\mathbf{a}=g(\mathbf{x}). 𝐚\mathbf{a} is directly used as the optimization variable.

Finally, it is worth noting that for several PGS methods such as LE and LLE, some supplement constraints are required to avoid a trivial solution (for instance with all embedded points are collapsed into only one point). Usually, mean and co-variance constrains are applied:

m(𝐀)=[μ(𝐀[1,:]),..,μ(𝐀[p,:])]=𝟘Cov(𝐀,𝐀)=(𝐀M(𝐀))(𝐀M(𝐀))=𝐈p\begin{gathered}\mathrm{m}(\mathbf{A})=[\mu(\mathbf{A}[1,:]),..,\mu(\mathbf{A}[p,:])]^{\top}=\mathbb{0}\\ \text{Cov}(\mathbf{A},\mathbf{A})=\big{(}\mathbf{A}-\mathrm{M}(\mathbf{A})\big{)}\big{(}\mathbf{A}-\mathrm{M}(\mathbf{A})\big{)}^{\top}=\mathbf{I}_{p}\\ \end{gathered} (8)

4 Manifold attack

We introduce a new strategy called “manifold attack” to reinforce PGS methods with mapping model, by combining with adversarial learning. In the following g()g() denotes a differentiable function that maps a sample 𝐱n\mathbf{x}\in\mathbb{R}^{n} to its embedded representation 𝐚p\mathbf{a}\in\mathbb{R}^{p}.

4.1 Individual attack point versus data points

We define a virtual point is a synthetic sample generated in such a way to be likely on the observed samples underlying manifold. An anchor point is a sample used for generating a virtual point. An attack point is a virtual point that maximises locally a chosen measure of model distortion. For example, attack point can be a sample perturbed with an adversarial noise.

We use the same notation as in section 3. Given a dataset 𝒳={𝐱1,,𝐱N}\mathcal{X}=\{\mathbf{x}_{1},...,\mathbf{x}_{N}\} and the corresponding embedded set 𝒜={𝐚1,,𝐚N}\mathcal{A}=\{\mathbf{a}_{1},...,\mathbf{a}_{N}\}, e(𝐚i,𝒜ic)\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}) denotes the embedding loss, 𝒜ic\mathcal{A}_{i}^{c} being the complement of {𝐚i}\{\mathbf{a}_{i}\} in 𝒜\mathcal{A}. We consider the objective function of PGS defined as t=i=1Ne(𝐚i,𝒜ic)\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}). Let’s consider pp anchor points 𝐳1,𝐳2,,𝐳pn\mathbf{z}_{1},\mathbf{z}_{2},...,\mathbf{z}_{p}\in\mathbb{R}^{n}, a virtual point 𝐱~n\tilde{\mathbf{x}}\in\mathbb{R}^{n} is defined as:

𝐱~=γ1𝐳1+γ2𝐳2++γp𝐳p, subject to: γ1,γ2,..,γp0,γ1+γ2++γp=1.\begin{split}&\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p},\\ \text{ subject to: }&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=1.\end{split} (9)

The anchor points 𝐳i\mathbf{z}_{i} define a region or feasible zone, in which a virtual point 𝐱~\tilde{\mathbf{x}} must be located and γ=[γ1,γ2,..,γp]\gamma=[\gamma_{1},\gamma_{2},..,\gamma_{p}] is the coordinate of 𝐱~\tilde{\mathbf{x}}. Anchor points are sampled from the dataset 𝒳\mathcal{X} with different strategies, which are defined according to predefined rules (see section 4.4 for several examples). Figure 2 shows an example of anchor points setting and relations between points. For the embedding 𝐚i\mathbf{a}_{i} of observed sample 𝐱i\mathbf{x}_{i}, 𝐚i=g(𝐱i)\mathbf{a}_{i}=g(\mathbf{x}_{i}), the embedding loss is defined as:

e(𝐚i,𝒜ic)=e(g(𝐱i),{g(𝐱1),..,g(𝐱N)}\{g(𝐱i)}).\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\mathcal{L}_{e}\big{(}g(\mathbf{x}_{i}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\backslash\{g(\mathbf{x}_{i})\}\big{)}. (10)

Similarly, for the embedding 𝐚~\tilde{\mathbf{a}} of virtual point 𝐱~\tilde{\mathbf{x}}, 𝐚~=g(𝐱~)\tilde{\mathbf{a}}=g(\tilde{\mathbf{x}}), the embedding loss is defined as:

e(𝐚~,𝒜)=e(g(𝐱~),{g(𝐱1),..,g(𝐱N)}).\mathcal{L}_{e}(\tilde{\mathbf{a}},\mathcal{A})=\mathcal{L}_{e}\big{(}g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\big{)}. (11)

The algorithm 1 describes the computation of an attack point. It consists in finding the local coordinates γ\gamma that maximizes the embedding loss e(𝐚~,𝒜)\mathcal{L}_{e}(\tilde{\mathbf{a}},\mathcal{A}) for the current state of model g()g(). Hence, γ\gamma is estimated though a projected gradient ascent.

𝐱1\mathbf{x}_{1}𝐱2\mathbf{x}_{2}𝐱3\mathbf{x}_{3}𝐱4\mathbf{x}_{4}𝐱~\tilde{\mathbf{x}}𝐳1\mathbf{z}_{1}𝐳2\mathbf{z}_{2}𝐳3\mathbf{z}_{3}γ1\gamma_{1}γ2\gamma_{2}γ3\gamma_{3}
Figure 2: Virtual point and anchor points illustration. The three anchor points (𝐳i,i{1,2,3}\mathbf{z}_{i},i\in\{1,2,3\}) are computed as 𝐳i=μ([𝐱1,𝐱2,𝐱3])+s(𝐱iμ([𝐱1,𝐱2,𝐱3])\mathbf{z}_{i}=\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}])+s(\mathbf{x}_{i}-\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}]), where μ([𝐱1,𝐱2,𝐱3])=𝐱1+𝐱2+𝐱33\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}])=\frac{\mathbf{x}_{1}+\mathbf{x}_{2}+\mathbf{x}_{3}}{3}. The dotted lines represent the zone defined by anchor points 𝐳i\mathbf{z}_{i}, within which the virtual point 𝐱~\tilde{\mathbf{x}} necessarily lies. The parameter ss determines whether the anchor points lie strictly within [𝐱1,𝐱2,𝐱3][\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}] convex hull (1>s>0)(1>s>0) or are strictly outside (s>1)(s>1). The dashed lines represent the coordinates γ\gamma of 𝐱~\tilde{\mathbf{x}}. The solid lines illustrates the relatedness of 𝐱~\tilde{\mathbf{x}} to the data points that the embedding will try to preserve minimizing e(g(𝐱~),{g(𝐱1),g(𝐱2),g(𝐱3)})\mathcal{L}_{e}(g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),g(\mathbf{x}_{2}),g(\mathbf{x}_{3})\}) in this case.
Algorithm 1 Individual manifold attack
0:  Anchor points {𝐳1,..,𝐳p}\{\mathbf{z}_{1},..,\mathbf{z}_{p}\}, data points {𝐱1,..,𝐱N}\{\mathbf{x}_{1},..,\mathbf{x}_{N}\}, embedding loss e()\mathcal{L}_{e}(), model g()g(), ξ,n_iters\xi,n\_iters.
  initialize: γp,γ=[γ1,..,γp]\gamma\in\mathbb{R}^{p},\gamma=[\gamma_{1},..,\gamma_{p}] for constraints in eq. 9
  𝐱~=γ1𝐳1+γ2𝐳2++γp𝐳p\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p}
  for i=1i=1 to n_itersn\_iters do
     L=e(g(𝐱~),{g(𝐱1),..,g(𝐱N)})L=\mathcal{L}_{e}\big{(}g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\big{)}
     γγ+ξγL(𝐱~)\gamma\leftarrow\gamma+\xi\nabla_{\gamma}L(\tilde{\mathbf{x}})
     γΠps(γ)\gamma\leftarrow\Pi_{ps}(\gamma)
     𝐱~=γ1𝐳1+γ2𝐳2++γp𝐳p\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p}
  end for
  Output: 𝐱~\tilde{\mathbf{x}}

In order to guarantee the constrains in equation (9), we use the projector Πps\Pi_{ps} defined by the problem:

minγp12κγ22, subject to: γ1,γ2,..,γp0,γ1+γ2++γp=c,(c>0).\begin{split}&\min_{\gamma\in\mathbb{R}^{p}}\frac{1}{2}\left\lVert\kappa-\gamma\right\rVert_{2}^{2},\\ \text{ subject to: }&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=c,(c>0).\\ \end{split} (12)

This convex problem with constraints can be solved quickly by a simple sequential projection that alternates between sum constraint and positive constraint (algorithm 2). The demonstration can be inspired by Lagrange multiplier method. A simple demonstration can be found in Appendix.

Algorithm 2 Projection for positive and sum constraint Πps\Pi_{ps}
0:  κp\kappa\in\mathbb{R}^{p}, c=1c=1 (by default).
  δ=(ci=1pκi)/p\delta=(c-\sum_{i=1}^{p}\kappa_{i})/p
  γiγi+δ,i=1,..,p\gamma_{i}\leftarrow\gamma_{i}+\delta,\forall i=1,..,p
  while i{1,..,p}:γi<0\exists i\in\{1,..,p\}:\gamma_{i}<0 do
     𝒫={i|γi>0}\mathcal{P}=\{i|\gamma_{i}>0\} and 𝒩={i|γi<0}\mathcal{N}=\{i|\gamma_{i}<0\}
     γi0,i𝒩\gamma_{i}\leftarrow 0,\forall i\in\mathcal{N}
     δ=(ci𝒫γi)/|𝒫|\delta=(c-\sum_{i\in\mathcal{P}}\gamma_{i})/|\mathcal{P}|
     γiγi+δ,i𝒫\gamma_{i}\leftarrow\gamma_{i}+\delta,\forall i\in\mathcal{P}
  end while
  Output: γ=[γ1,γ2,..,γp]\gamma=[\gamma_{1},\gamma_{2},..,\gamma_{p}]

By default, each set of anchor points has one attack point. However, we can generate more than one attack point for the same set of anchor points by using different initializations of γ\gamma, so as to find different local maxima. The double embedding loss, on the observed samples and on the attack points, is expected to enforce the the model g()g() smoothness over the underlying manifold, including in low samples density areas. The general optimization scheme goes as follows: we optimize alternately between attack stages and model update stages until convergence. In attack stage, we optimize the attack points through γ\gamma while fixing the model g()g() and in the model update stage, we optimize the model g()g() while fixing attack points.

4.2 Attack points as data augmentation

In algorithm 1, an attack points only interacts with observed samples. In the general manifold attack (algorithm 3), attack points and observed samples are undifferentiated in the model update stage. This way, hence generating attack points can be considered a data augmentation technique. We denote \mathcal{B} as a set that contains all embedded points (both attack points and observed samples). s\mathcal{B}_{s} is a random subset of \mathcal{B}, used batch for batch optimization. In each step, only attack points from the current batch are used to distort the manifold by maximizing the batch loss LL.

Algorithm 3 Manifold attack
0:  Data points {x1,..,xN}\{\textbf{x}_{1},..,\textbf{x}_{N}\}, embedding loss e()\mathcal{L}_{e}(), model g()g(), ξ\xi, n_itersn\_iters, an anchoring rule.
  initialize: g()g()
  Create MM sets of anchor points {𝐳1k,..,𝐳pk},k=1,..,M\{\mathbf{z}_{1}^{k},..,\mathbf{z}_{p}^{k}\},\forall k=1,..,M by the anchoring rule
  for epoch=1epoch=1 to n_epochn\_epoch do
     Initialize γkp\gamma^{k}\in\mathbb{R}^{p} for constraints in eq. 9
     𝐱~k=γ1k𝐳1k+γ2k𝐳2k++γpk𝐳pk,k=1,..,M\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,M
     Set ={g(𝐱~1),..,g(𝐱~M)}{g(𝐱1),..,g(𝐱N)}\mathcal{B}=\{g(\tilde{\mathbf{x}}^{1}),..,g(\tilde{\mathbf{x}}^{M})\}\cap\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\} and divide it into subsets s\mathcal{B}_{s}
     for each s\mathcal{B}_{s} do
        L=𝐚se(𝐚,s\{𝐚})L=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\mathcal{L}_{e}\big{(}\mathbf{a},\mathcal{B}_{s}\backslash\{\mathbf{a}\}\big{)}
        Update {𝐱~i|g(𝐱~i)s}\{\tilde{\mathbf{x}}^{i}|g(\tilde{\mathbf{x}}^{i})\in\mathcal{B}_{s}\} to maximize LL by algorithm 4
        Update g()g() to minimize LL
     end for
  end for
  Output: g()g()
Algorithm 4 Virtual points update
0:  mm sets of anchor points {𝐳1k,..,𝐳pk}\{\mathbf{z}_{1}^{k},..,\mathbf{z}_{p}^{k}\}, γk\gamma^{k}, k=1,..,m\forall k=1,..,m, loss LL, ξ,n_iters\xi,n\_iters.
  𝐱~k=γ1k𝐳1k+γ2k𝐳2k++γpk𝐳pk,k=1,..,m\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,m
  for i=1i=1 to n_itersn\_iters do
     Calculate gradient L\nabla L (w.r.t [γ1,..,γm][\gamma^{1},..,\gamma^{m}]) of function L(𝐱~1,..,𝐱~m)L(\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m})
     [γ1,..,γm][γ1,..,γm]+ξγL(𝐱~1,..,𝐱~m)[\gamma^{1},..,\gamma^{m}]\leftarrow[\gamma^{1},..,\gamma^{m}]+\xi\nabla_{\gamma}L(\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m})
     γkΠps(γk),k=1,..,m\gamma^{k}\leftarrow\Pi_{ps}(\gamma^{k}),\forall k=1,..,{m}
     𝐱~k=γ1k𝐳1k+γ2k𝐳2k++γpk𝐳pk,k=1,..,m\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,m
  end for
  Output: 𝐱~1,..,𝐱~m\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m}

Algorithm 4 represents the update step for multiple attack points. We assumed that the embedding loss e\mathcal{L}_{e} is smooth with respect to γ\gamma and used a gradient-based algorithm to estimate the latter. Nevertheless, in PGS, there are several methods whose embedding loss is not even continuous. For example, in LLE (6), the embedding loss takes into account the kk nearest neighbors of a point, which might change throughout the estimation of an attack point, producing a discontinuity. To circumvent this problem, we use several strategies to avoid singularities:

  • -

    By reducing the gradient step ξ\xi which limits the virtual point displacement.

  • -

    By taking a small number of attack points in each subset s\mathcal{B}_{s} or by using randomly a part of attack points to perform the attack while fixing other attack points, in an attack stage.

  • -

    By updating γ\gamma only if embedding loss increases.

Besides, some metrics might be approximated by smooth functionals. For instance, in the contrastive loss (3), we can replace the metric dx()d_{x}() which outputs only 0 or 1, with dx(𝐱i,𝐱j)=exp(𝐱i𝐱j222σi2)d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=\text{exp}\big{(}\frac{-\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}}{2\sigma_{i}^{2}}\big{)} to make embedding loss continuous.

4.3 Pairwise preservation of geometric structure

For some PGS methods as MDS or LE, the embedding loss e\mathcal{L}_{e} can be decomposed into the sum of elementary pairwise loss lel_{e}:

e(𝐚,)=𝐛le(𝐚,𝐛)\mathcal{L}_{e}(\mathbf{a},\mathcal{B})=\sum_{\mathbf{b}\in\mathcal{B}}l_{e}(\mathbf{a},\mathbf{b}) (13)

Then the batch loss LL (in algorithm 3) can be modified into:

L=𝐚se(𝐚,s\{𝐚})=𝐚s𝐛s,𝐛𝐚le(𝐚,𝐛)L=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\mathcal{L}_{e}\big{(}\mathbf{a},\mathcal{B}_{s}\backslash\{\mathbf{a}\}\big{)}=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\sum_{\mathbf{b}\in\mathcal{B}_{s},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b}) (14)

Following this change, LL can be decomposed into three parts:

𝐚sd𝐛sd,𝐛𝐚le(𝐚,𝐛)\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{d}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{d},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b})\quad\quad\quad (data-data)\displaystyle(\text{data-data})
𝐚sd𝐛svle(𝐚,𝐛)\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{d}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{v}}l_{e}(\mathbf{a},\mathbf{b}) (data-virtual)\displaystyle(\text{data-virtual})
𝐚sv𝐛sv,𝐛𝐚le(𝐚,𝐛)\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{v}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{v},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b}) (virtual-virtual),\displaystyle(\text{virtual-virtual}),

where sd\mathcal{B}_{s}^{d} and sv\mathcal{B}_{s}^{v} are respectively set that contains all embedded data points (or observed samples) and all embedded virtual points of s\mathcal{B}_{s}.

In some PGS losses which can be decomposed into the sum of elementary pairwise loss, we can balance the effect of the virtual points with respect to the observed samples, not only by tuning the ratio between the number of virtual points and the number of observed samples in BsB_{s}, but also by weighting each of three parts above which corresponds to the settings observed-observed, observed-virtual and virtual-virtual.

4.4 Settings of anchor points and initialization of virtual points

In this section, we provide two settings (or rules) for computing anchor points with the corresponding initializations. These settings need to be chosen carefully to guarantee that virtual points are on the sample underlying manifold.

Neighbor anchors: The first anchor point 𝐳1\mathbf{z}_{1} is taken randomly from 𝒳\mathcal{X}, then the next (p1)(p-1) anchor points 𝐳2,..,𝐳p\mathbf{z}_{2},..,\mathbf{z}_{p} are taken as (p1)(p-1) nearest neighbor points of 𝐳1\mathbf{z}_{1} in 𝒳\mathcal{X} (Euclidean metric by default). Here, we assume that the convex hull of a sample and its neighbors is likely comprised in the samples manifold. The number of anchors pp needs to be small compared to the number of data points NN. The initialization for virtual points can be set by taking γi𝒰(0,1),i=1,..,p\gamma_{i}\sim\mathcal{U}(0,1),\forall i=1,..,p then normalize to have i=1pγi=1\sum_{i=1}^{p}\gamma_{i}=1.

Random anchors: The second setting is inspired by Mix-up method [26]. pp anchors are taken randomly from 𝒳\mathcal{X} and we take γDirichlet(α1,..,αp)\gamma\sim\text{Dirichlet}(\alpha_{1},..,\alpha_{p}). If αi1,i=1,..,p\alpha_{i}\ll 1,\forall i=1,..,p, the Dirichlet distribution returns γ\gamma where γi0\gamma_{i}\geq 0, i=1pγi=1\sum_{i=1}^{p}\gamma_{i}=1. In particular, there is a coefficient γk\gamma_{k} much greater than other ones with a strong probability, which implies that virtual points are more probably in the neighborhood of a data sample. Since the manifold attack tries to find only local maximum by gradient-based method, if ξ\xi and n_itersn\_iters are both small, we expect that attack points in the attack stage do not move too far from their initiated position, remaining on the manifold of data. Note that, in the case αi=1,i=1,..,p\alpha_{i}=1,\forall i=1,..,p, the Dirichlet distribution become the Uniform distribution.

To ensure that the coefficient γk\gamma_{k} is always much greater than other ones, we apply one more constraint: γkτ\gamma_{k}\geq\tau, and by taking τ\tau close to 1. The constraints in (9) become:

γ1,γ2,..,γp0,γ1+γ2++γp=1,γkτ,(τ<1).\begin{split}&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=1,\\ &\gamma_{k}\geq\tau,(\tau<1).\\ \end{split} (15)

Then the projection in algorithm 2 needs to be slightly modified to incorporate this new constraint. We define the projection γ=Πps(κ)\gamma=\Pi_{ps}^{{}^{\prime}}(\kappa) as follow:

κκκkκkτγΠps(κ,c=1τ)γkγk+τ\begin{split}&\kappa^{\prime}\leftarrow\kappa\\ &\kappa^{\prime}_{k}\leftarrow\kappa^{\prime}_{k}-\tau\\ &\gamma\leftarrow\Pi_{ps}(\kappa^{\prime},c=1-\tau)\\ &\gamma_{k}\leftarrow\gamma_{k}+\tau\\ \end{split} (16)

5 Applications of manifold attack

We present several applications of manifold attack for NNMs that use PGS task. Firstly, we show advantages of manifold attack for a PGS task when only few training samples are available. Secondly, we show that applying manifold attack in moderation improves both generalization and adversarial robustness.

5.1 Preservation of geometric structure with few training samples

For this experiment, we use the S curve data and Digit data. The S curve data contains N=1000N=1000 3-dimensional samples, as shown in figure 3. The Digit data contains N=1797N=1797 images, of size 8×88\times 8 of a digit. We want to compute 2-dimensional embeddings for these data. Each data is separated into two sets: NtrN_{tr} samples are randomly taken for training set and the NteN_{te} remaining samples are used for testing. 𝐱tr\mathbf{x}^{tr} denotes a training sample and 𝐱te\mathbf{x}^{te} denotes a testing sample. We perform four training modes as described in table 1 with a neural network model g()g(). The evaluation loss, after optimizing model g()g(), is defined as:

Lev=1Ntei=1Ntee(g(𝐱ite),{g(𝐱jte)|ji})L_{ev}=\frac{1}{N_{te}}\sum_{i=1}^{N_{te}}\mathcal{L}_{e}(g(\mathbf{x}^{te}_{i}),\{g(\mathbf{x}^{te}_{j})|j\neq i\}) (17)
Refer to caption
Figure 3: Left: S curve data 1000 samples. Center: S curve data 50 samples. Right: Digit data.
Mode Description of objective function
1. REF PGS that takes into account both training and testing samples: Ltr=1Ni=1Ne(g(xi),{g(xj)|ji})L_{tr}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{e}(g(\textbf{x}_{i}),\{g(\textbf{x}_{j})|j\neq i\}) The result of this training mode is considered as “reference” in order to compare to other training modes.
2. DD PGS that takes only training (observed) data samples: Ltr=1Ntri=1Ntre(g(xitr),{g(xjtr)|ji}).L_{tr}=\frac{1}{N_{tr}}\sum_{i=1}^{N_{tr}}\mathcal{L}_{e}(g(\textbf{x}_{i}^{tr}),\{g(\textbf{x}_{j}^{tr})|j\neq i\}).
3. RV Using virtual points as supplement data, virtual points are only randomly initialized and without attack stage (by setting n_iters=0n\_iters=0 in algorithm 4): Ltr=1||aie(ai,\{ai}),L_{tr}=\frac{1}{|\mathcal{B}|}\sum_{\textbf{a}_{i}\in\mathcal{B}}\mathcal{L}_{e}(\textbf{a}_{i},\mathcal{B}\backslash\{\textbf{a}_{i}\}), where ={g(x~1),..,g(x~M)}{g(x1tr),..,g(xNtrtr)}\mathcal{B}=\{g(\tilde{\textbf{x}}^{1}),..,g(\tilde{\textbf{x}}^{M})\}\cap\{g(\textbf{x}_{1}^{tr}),..,g(\textbf{x}_{N_{tr}}^{tr})\}.
4. MA Using manifold attack, the same objective function as the previous case, except n_iters0n\_iters\neq 0 (virtual points become attack points).
Table 1: Four training modes: REF (Reference), DD (Data-Data), RV (Random virtual) and MA (Manifold Attack), and their corresponding objective function.

The anchoring rule, embedding loss e\mathcal{L}_{e} and the model g()g() are precised in the following.

Anchoring rule. Two settings are considered:

  • -

    Neighbor anchors (NA): A set of anchor points is composed by a sample with its 4 nearest neighbors. In this case, we have M=NtrM=N_{tr} sets of anchor points and p=5p=5 anchor points in each set. The coefficient γ\gamma is initialized by uniform distribution.

  • -

    Random anchors (RA): We take randomly 2 points among NtrN_{tr} training points to create a set of anchor points. In this case, we have M=(Ntr2)M=\binom{N_{tr}}{2} sets of anchor points and p=2p=2 anchor points in each set. The coefficient γ\gamma is initialized by the Dirichlet distribution with αi=0.5,i=1,..,p\alpha_{i}=0.5,\forall i=1,..,p.

The embedding loss. We employ the embedding loss MDS and LE as described in section 3, with the default metrics. For similarity metric dx()d_{x}() in LE method, we take σ=0.2\sigma=0.2 for S curve data and σ=0.5\sigma=0.5 for Digit data.

The model. A simple structure of CNN is used. Here are the detailed architectures for each CNN as the dimension of the inputs are different for the two datasets:

  • -

    S curve data: Conv1d[1,4,21,4,2] \rightarrow ReLu \rightarrow Conv1d[4,4,24,4,2] \rightarrow ReLu\rightarrow Flatten \rightarrow Fc[4,24,2].

  • -

    Digit data: Conv2d[1,8,31,8,3] \rightarrow ReLu \rightarrow Conv2d[8,16,38,16,3] \rightarrow ReLu\rightarrow Flatten \rightarrow Fc[64,264,2].

For LE method, two additional constraints are imposed to avoid trivial embeddings:

E(𝐀tr)=[E(𝐀tr[1,:]),..,E(𝐀tr[d,:])]=𝟘dΣ(𝐀tr,𝐀tr)=Id\begin{gathered}\mathrm{E}(\mathbf{A}^{tr})=[\mathrm{E}(\mathbf{A}^{tr}[1,:]),..,\mathrm{E}(\mathbf{A}^{tr}[d,:])]^{\top}=\mathbb{0}_{d}\\ \Sigma(\mathbf{A}^{tr},\mathbf{A}^{tr})=I_{d}\\ \end{gathered} (18)

where d=2d=2 is the number of output dimensions, 𝐀tr=[𝐚1tr,..,𝐚Ntrtr]=[g(𝐱1tr),..,g(𝐱Ntrtr)]\mathbf{A}^{tr}=[\mathbf{a}_{1}^{tr},..,\mathbf{a}_{N_{tr}}^{tr}]=[g(\mathbf{x}_{1}^{tr}),..,g(\mathbf{x}_{N_{tr}}^{tr})]. To adapt these constraints, we add a normalization layer at the end of model g()g(): (g(𝐱)E(𝐀tr))Σ1(𝐀tr,𝐀tr)(g(\mathbf{x})-\mathrm{E}(\mathbf{A}^{tr}))\Sigma^{-1}(\mathbf{A}^{tr},\mathbf{A}^{tr}), where Σ1\Sigma^{-1} is performed by Cholesky decomposition.

To simulate the case of few training samples, we fix Ntr=100N_{tr}=100 for MDS method and Ntr=50N_{tr}=50 for LE method. The balance between virtual points and samples is controlled by the couple λ=\lambda= (number of virtual points in s\mathcal{B}_{s} , number of observed samples in s\mathcal{B}_{s}). We set λ=(2,5)\lambda=(2,5) for MDS method and λ=(5,10)\lambda=(5,10) for LE method. The gradient step ξ\xi is selected from {0.1,1,10}\{0.1,1,10\} and the number of iterations is fixed at n_iters=2n\_iters=2.

The initialization of model g()g() is impactful, especially since there are few training data. Five different initialization of model for each method are performed. The mean and the standard deviation of the evaluation loss LevL_{ev} are represented in table 2. Firstly, we see that using random virtual (RV) points as additional data points gives a better loss than using only data points. Secondly, using manifold attack (MA) further improves the results which shows the benefit of the proposed approach to regularize the model.

For the S curve data, initialization by Neighbors anchors (NA) gives a better result compared to initialization by Random anchors (RA). However, for the Digit data, initialization by Random anchors gives a better result. This is due to the fact that in the S curve data, Neighbor Anchors covers better the manifold of data than Random Anchor. On the other hand, in Digit data, Neighbor Anchors (by using Euclidean metric to determine nearest neighbors) can generate, with greater probability, a virtual point that is not in the manifold of data. This leads to a greater evaluation loss compared to Random Anchor.

S curve data Digit data
Mode / Method MDS LE MDS LE
REF 130.7±24.74130.7\pm 24.74 0.399±0.070.399\pm 0.07 2015±142015\pm 14 0.07±0.0020.07\pm 0.002
DD 352.56±119.19352.56\pm 119.19 1.21±0.461.21\pm 0.46 2409±782409\pm 78 0.58±0.070.58\pm 0.07
RV (NA) 173.87±9.38173.87\pm 9.38 0.59±0.110.59\pm 0.11 2395±732395\pm 73 0.31±0.030.31\pm 0.03
MA (NA) 170.62±5.89170.62\pm 5.89 0.55±0.070.55\pm 0.07 2362±632362\pm 63 0.24±0.030.24\pm 0.03
RV (RA) 183.42±18.13183.42\pm 18.13 0.65±0.140.65\pm 0.14 2342±562342\pm 56 0.22±0.030.22\pm 0.03
MA (RA) 169.04±5.30169.04\pm 5.30 0.63±0.140.63\pm 0.14 2331±562331\pm 56 0.2±0.020.2\pm 0.02
Table 2: Evaluation loss LevL_{ev} of two PGS methods MDS and LE, in four modes: Reference (REF), Data-Data (DD), Random Virtual (RV), Manifold Attack(MA) (as described in table 1) and two initialization strategies: Neighbor Anchors (NA) and Random Anchors (RA).

The five embedded representations, respectively with five different initialization of g()g(), for testing samples in S curve data are found in figure 4 for MDS method and in figure 5 for LE method.

Refer to caption
Figure 4: Five evaluations with different initialization of model for the S curve data, using MDS method with four modes: Reference (REF), Data-Data (DD), Random Virtual (RV), Manifold Attack(MA) (as described in table 1) and two initialization strategies: Neighbor Anchors (NA) and Random Anchors (RA). We see clearly the effect of Manifold Attack by the second column. Thus, the embedded representation of Manifold Attack is more spread compared to Random Virtual.
Refer to caption
Figure 5: Five evaluations with different initialization of model for the S curve data, using LE method with four modes: Reference (REF), Data-Data (DD), Random Virtual (RV), Manifold Attack(MA) (as described in table 1) and two initialization strategies: Neighbor Anchors (NA) and Random Anchors (RA). We see clearly the effect of Manifold Attack by the fourth column, where the embedded representation shape of Manifold Attack is more similar to Reference than one of Random Virtual.

5.2 Robustness to adversarial examples

In this subsection, we combine manifold attack with supervised learning to assess the adversarial robustness of model. Let start with a general objective function which is created by a supervised loss and a PGS loss :

minθmaxγ(Ls+λLpgs)\displaystyle\min_{\theta}\max_{\gamma}\left(L_{s}+\lambda L_{pgs}\right)
=\displaystyle= minθmaxγ1Nli=1Nlc(fθ(𝐱il),yi)+λt(fθ(l)(𝐱1l),..,fθ(l)(𝐱Nll),fθ(l)(𝐱~1),..,fθ(l)(𝐱~M))\displaystyle\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}}\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\mathcal{L}_{t}\left(f_{\theta}^{(l)}(\mathbf{x}_{1}^{l}),..,f_{\theta}^{(l)}(\mathbf{x}_{N_{l}}^{l}),f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{1}),..,f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{M})\right) (19)

where (𝐱il,yi)(\mathbf{x}^{l}_{i},y_{i}) means a sample and its corresponding label. c\mathcal{L}_{c} is a dissimilarity metric, like for instance the Cross Entropy. t\mathcal{L}_{t} is a PGS loss that includes eventually all observed samples 𝐱\mathbf{x} and virtual points 𝐱~\tilde{\mathbf{x}}. Note that the coordinates γ\gamma of all virtual points are constrained by (9).

In the following, we construct t\mathcal{L}_{t} by a particular PGS loss, called Mix-up manifold learning loss [27, 28, 26] :

lmu(fθ(l)(𝐱~),fθ(l)(𝐱i),fθ(l)(𝐱j))=c(fθ(l)(γ1𝐱i+γ2𝐱j),γ1fθ(l)(𝐱i)+γ2fθ(l)(𝐱j))where: γ1+γ2=1 and γ1Beta(α,α),0<α1\begin{split}&l_{mu}\left(f_{\theta}^{(l)}(\tilde{\mathbf{x}}),f_{\theta}^{(l)}(\mathbf{x}_{i}),f_{\theta}^{(l)}(\mathbf{x}_{j})\right)\\ =&\mathcal{L}_{c}\left(f_{\theta}^{(l)}(\gamma_{1}\mathbf{x}_{i}+\gamma_{2}\mathbf{x}_{j}),\gamma_{1}f_{\theta}^{(l)}(\mathbf{x}_{i})+\gamma_{2}f_{\theta}^{(l)}(\mathbf{x}_{j})\right)\\ \text{where: }&\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1\end{split} (20)

In this configuration, the virtual point 𝐱~=γ1𝐱i+γ2𝐱j\tilde{\mathbf{x}}=\gamma_{1}\mathbf{x}_{i}+\gamma_{2}\mathbf{x}_{j}, where 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are two anchor points. The distribution Beta is just a particular case of the Dirichlet distribution where the number of anchor points p=2p=2 (in section 4.4). In case of supervised learning, we perform PGS between the original samples representation and the final NNM output (which can be considered as a latent representation). Thus, we implement a slightly modified version of equation (20):

lmu(fθ(𝐱~),fθ(𝐱il),fθ(𝐱jl))=c(fθ(γ1𝐱il+γ2𝐱jl),γ1yi+γ2yj)where: γ1+γ2=1 and γ1Beta(α,α),0<α1\begin{split}&l_{mu}\left(f_{\theta}(\tilde{\mathbf{x}}),f_{\theta}(\mathbf{x}^{l}_{i}),f_{\theta}(\mathbf{x}^{l}_{j})\right)\\ =&\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right)\\ \text{where: }&\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1\end{split} (21)

Then, we replace this explicit PGS loss in equation (5.2) and take λ=1\lambda=1:

minθmaxγ1Nli=1Nl\displaystyle\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}} c(fθ(𝐱il),yi)+λ1Nl2Nγk=1Nλi=1Nlj=1jiNlc(fθ(γ1𝐱il+γ2𝐱jl),γ1yi+γ2yj)\displaystyle\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\frac{1}{N_{l}^{2}N_{\gamma}}\sum_{k=1}^{N_{\lambda}}\sum_{i=1}^{N_{l}}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N_{l}}\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right) (22)
=minθmaxγ\displaystyle=\min_{\theta}\max_{\gamma} 1Nl2Nγk=1Nλi=1Nlj=1Nlc(fθ(γ1𝐱il+γ2𝐱jl),γ1yi+γ2yj)\displaystyle\frac{1}{N_{l}^{2}N_{\gamma}}\sum_{k=1}^{N_{\lambda}}\sum_{i=1}^{N_{l}}\sum_{j=1}^{N_{l}}\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right) (23)
where: γ1+γ2=1 and γ1Beta(α,α),0<α1\displaystyle\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1 (24)
Nγ is the number of samplings γ from Beta(α,α).\displaystyle N_{\gamma}\text{ is the number of samplings $\gamma$ from }\text{Beta}(\alpha,\alpha). (25)

We call problem (22) adversarial Mix-up because this problem is developed from Mix-up[26], by adding an adversarial learning for γ\gamma. In practice, to optimize the problem (22), we perform an attack stage (algorithm 4) to find γ\gamma that gives the greater PGS loss, before performing model update stage for fθ()f_{\theta}(). We repeat alternatively these two stages until convergence. Note that, we can take γ2=1γ1\gamma_{2}=1-\gamma_{1}, so that we only need to deal with one variable γ1\gamma_{1} to maximize PGS loss. The projection (12) for γ1\gamma_{1} is now just the clamping function, to make sure that γ1\gamma_{1} is between 0 and 1.

We compare four supervised training methods: ERM (Empirical Risk Minimization, which is thus classical supervised learning with Cross Entropy loss), Mix-up, Adversarial Mix-up and Cut-Mix [29] on ImageNet dataset with the model ResNet-50 [6], which has about 25.8M trainable parameters. We use ImageNet dataset. We retrieve 948 classes consisting in 400 labelled training samples and 50 testing samples to evaluate the models.

Method / Data evaluation Testing set Adversarial examples
Top-1 Top-5 Top-1 Top-5
ERM 33.84 12.46 81.69 59.14
Mix-up [26] 32.13 11.35 75.57 49.41
Mix-up Adversarial (1) (ξ=0.10.01\xi=0.1\rightarrow 0.01) 32.57 10.98 63.62 36.15
Adversarial Mix-up (2) (ξ=0.010.001\xi=0.01\rightarrow 0.001) 31.82 11.15 70.82 43.96
Cut-Mix [29] 30.94 10.41 81.24 58.72
Table 3: : ImageNet error rate (top-1 and top-5 in %) on testing set and on adversarial examples on different training modes: ERM, Mix-Up, Mix-up Adversarial and Cut-Mix.

We evaluate the error rate for testing set at the end of each epoch and report the best best error rate (top-1 and top-5) in table 3. We create adversarial examples using Fast Gradient Sign Method (FGSM) [10], on another trained ERM model, with ϵ=0.05\epsilon=0.05. In Adversarial Mix-up, n_itersn\_iters is fixed at 1. The attack stage is parametrized by the gradient step ξ\xi, which is set up following two configurations, (1) ξ\xi is reduced linearly from 0.1 to 0.01 and (2) ξ\xi is reduced linearly from 0.01 to 0.001. Following original articles, α\alpha is set at 0.2 for Mix-Up and Adversarial Mix-up and α=1\alpha=1 for Cut-Mix. More details for hyper-parameters can be found in Appendix.

Firstly, Mix-up [26] is a combination between supervised learning task and PGS task loss and it gives a better error rates for both testing set and adversarial examples compared to ERM, which is a standard supervised learning task.

Secondly, in Adversarial Mix-up (1) and (2), a trade-off has to be found between error rate for testing set and error rate for adversarial examples. For strong values of ξ\xi (1), the error rate for testing sample can be about 0.5% worse than Mix-up (without using attack stage), but it gains about 13% for the robustness against adversarial examples. On the other hand, if ξ\xi takes moderate values as in (2), error rates for both testing sample and adversarial examples are smaller than those of Mix-Up, but it gains only about 5% for the robustness against adversarial examples. These two observations show the effect of manifold attack, which is an improved PGS procedure. We conclude that manifold attack not only improves generalization but also significantly improves adversarial robustness of the model.

Finally, Adversarial Mix-up Adversarial (2) provides a slightly worse error rate than Cut-Mix (about 1 % in the case of testing sample), but it gains about 10% in the case of adversarial examples.

It is worth noting that in attack stage, the model g()g() needs to be differentiable. Thus, in the case of using NNMs with Dropout layer for example , the active connections need to be fixed during an attack stage.

5.3 Semi-supervised manifold attack

In this subsection, manifold attack is applied to reinforce semi-supervised neural network models that use PGS as regularization. The problem (5.2) is extended to include unlabelled samples. We assume that the training set contains NN samples 𝐱\mathbf{x}, the NlN_{l} first samples 𝐱il\mathbf{x}^{l}_{i} are labelled and the remaining Nu=NNlN_{u}=N-N_{l} samples 𝐱iu\mathbf{x}^{u}_{i} are unlabelled. Hence, the objective function for a semi-supervised manifold attack method:

minθmaxγ(Ls+λLpgs+βLu)=minθmaxγ1Nli=1Nlc(fθ(𝐱il),yi)+λt(fθ(l)(𝐱1),..,fθ(l)(𝐱N),fθ(l)(𝐱~1),..,fθ(l)(𝐱~M)+βLu)\begin{split}&\min_{\theta}\max_{\gamma}\left(L_{s}+\lambda L_{pgs}+\beta L_{u}\right)\\ =&\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}}\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\mathcal{L}_{t}\left(f_{\theta}^{(l)}(\mathbf{x}_{1}),..,f_{\theta}^{(l)}(\mathbf{x}_{N}),f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{1}),..,f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{M})+\beta L_{u}\right)\end{split} (26)

where LuL_{u} refers to other possible unsupervised losses such as pseudo-label, self-supervised learning, etc.

In the following, we explicit the terms of the problem (26). We build upon MixMatch [30], a semi-supervised learning method in the current state-of-the-art. MixMatch use Mix-up manifold learning loss for PGS. Secondly, we add an adversarial learning for γ\gamma in MixMatch. Hence we refer to the resulting instance of 26 as Adversarial MixMatch. Note that, γ1\gamma_{1} in MixMatch is slightly different from Mix-up as:

γ1+γ2=1,γ1Beta(α,α) and γ1γ2\gamma_{1}+\gamma_{2}=1,\gamma_{1}\sim\text{Beta}(\alpha,\alpha)\text{ and }\gamma_{1}\geq\gamma_{2} (27)

Then the projection for γ1\gamma_{1} is now the clamping function between 0.5 and 1. When two separated variables γ1\gamma_{1} and γ2\gamma_{2} are defined, we can use the projector Πps\Pi_{ps}^{{}^{\prime}} (16) defined in subsection 4.4. We use the Pytorch implementation for MixMatch by Yui [31] (with all hyper-parameters by default), then we introduce attack stages, with the number of iterations n_iters=1n\_iters=1. In each experiment, for both CIFAR-10 and SVHN dataset, we divide the training set into three parts: labelled set, unlabelled set and validation set. The number of validation samples is fixed at 5000. The number of labelled samples is 250, and the remaining samples are part of the unlabelled set. We repeat the experiment four times, with different samplings of labelled samples, unlabelled samples, validation samples and different initialization of model Wide ResNet-28 [32] which has about 1.47M trainable parameters. More details for hyper-parameters can be found in Appendix.

Data Method / Test 1 2 3 4 Mean
CIFAR-10 MixMatch 10.62 12.72 12.02 15.26 12.65±1.6812.65\pm 1.68
CIFAR-10 MixMatch Adersarial 8.84 10.46 10.09 12.89 10.57±1.4710.57\pm 1.47
SVHN MixMatch 6.0925 6.73 7.802 7.37 7.0±0.657.0\pm 0.65
SVHN MixMatch Adersarial 5.07 5.93 5.42 5.47 5.47±0.35.47\pm 0.3
Table 4: : CIFAR-10 and SVHN Error rate in four different configurations, each configuration consists of data partitioning and initialization of model parameters. The number of labelled samples is fixed at 250 and the used model is Wide ResNet-28.

The error rate on testing set, which corresponds to the best validation error rate, is reported in table 4, for both MixMatch and MixMatch Adersarial. We see that MixMatch Adersarial improves the performance of MixMatch, about 1.5%1.5\% less on error rate. There is a considerable difference between the error rate of MixMatch reproduced by our experiments and the one reported from the official paper, which might come from the sampling, the initialization of model, the library used (Pytorch vs TensorFlow) and the computation of the error rate (error rate associated to best validation error vs the median error rate of the last 20 checkpoints).

Method / Data CIFAR-10 SVHN
Pi Model [33] 53.02±2.0553.02\pm 2.05 17.56±0.27517.56\pm 0.275
Pseudo Label [34] 49.98±1.1749.98\pm 1.17 21.16±0.8821.16\pm 0.88
VAT [11] 36.03±2.8236.03\pm 2.82 8.41±1.018.41\pm 1.01
SESEMI SSL [35] 8.32±0.138.32\pm 0.13
Mean Teacher [36] 47.32±4.7147.32\pm 4.71 6.45±2.436.45\pm 2.43
Dual Student [37] 4.24±0.104.24\pm 0.10
MixMatch [30] 11.08±0.8711.08\pm 0.87 3.78±0.263.78\pm 0.26
MixMatch * [30] 12.65±1.6812.65\pm 1.68 7.0±0.657.0\pm 0.65
MixMatch Adersarial * 10.57±1.4710.57\pm 1.47 5.47±0.35.47\pm 0.3
Real Mix [38] 9.79±0.759.79\pm 0.75 3.53±0.383.53\pm 0.38
EnAET [39] 7.6±0.347.6\pm 0.34 3.21±0.213.21\pm 0.21
ReMixMatch [40] 6.27±0.346.27\pm 0.34 3.10±0.503.10\pm 0.50
Fix Match [41] 5.07±0.335.07\pm 0.33 2.48±0.382.48\pm 0.38
Table 5: CIFAR-10 and SVHN error rate of different semi-supervised learning methods. The number of labelled sample is fixed at 250. () means that the results are reported from [30]. (*) means that that the results are reported from our experiments. The resting results are reported from their corresponding official paper.

Table 5 shows error rates among semi-supervised methods based on NNMs, for both CIFAR-10 and SVHN dataset with only 250 labelled samples. We refer also readers to the site PapersWithCode that provides the lasted record for each dataset: CIFAR-10111CIFAR-10 https://paperswithcode.com/sota/semi-supervised-image-classification-on-cifar-6 and SVHN 222SVHN https://paperswithcode.com/sota/semi-supervised-image-classification-on-svhn-1.

6 Conclusion

Adversarial Example Manifold Attack
Illustration,
Red point: a sample
Blue line: border of
feasible zone
Feasible zone Locality of each sample Convex hull
Variable Noise ϵ\epsilon that has the same size as samples γ\gamma has the size which equals to the number of anchor points
PGS task Points in the locality of a sample must have a similar embedded representation Available for almost PGS tasks
Table 6: A simple comparison between adversarial examples and manifold attack in general. Note that, in manifold attack, the feasible zone is the whole convex hull in the case of Nearest Anchors (subsection 4.4). Otherwise, in the case of Random Anchors with the distribution Dirichlet, the feasible zone is rather the neighborhood of each anchor in the convex hull than the center.

We introduced manifold attack as an improved PGS procedure. Firstly, it is more general than adversarial examples (see for a comparison in table 6). Secondly, we confirm empirically a statement from [26]: by using Mix-up as PGS combined with supervised loss, we enhance generalization and improve significantly adversarial robustness. Thirdly, we show that applying manifold attack on Mix-up enhances further generalization and adversarial robustness. There is a trade-off to be found between generalization and adversarial robustness. However, in our experiments, in the ’worst case’, we only lose about 1% in generalization for a gain of about 13% in adversarial robustness.

To further improve our method, several directions could be investigated:

  • Optimization of the layers between those the PGS needs to be applied. Indeed, we could also implement PGS between one latent representation and another one in NNMs as in [28].

  • Mode collapse is a popular problem while training GAN models. GAN with samples that are well balanced among classes, generated samples by the generator are biased on only a few classes (as showed in figure 6). This is because the latent representation is not well regularized. We expect to overcome the problem Mode collapse by introducing manifold attack from the latent representation back to the original representation as show in figure 7 and by optimizing problem 28, where t\mathcal{L}_{t} is a PGS task as showed in section 3.

Refer to caption
Figure 6: Mode collapse observed in a GAN with the MNIST dataset. Left: Samples generated by the generator in GAN. Right: The latent representation of GAN that follows uniform distribution. In this example, the class 1 dominates in the samples generated. Credit: [42].
zzxfakex_{fake}G(z)G(z)generatorp(z)p(z)xrealx_{real}p(x)p(x)xxreal?D(x)D(x)discriminatormanifold attack
Figure 7: An GAN that we apply PGS from the latent representation back to original representation.
minGmaxDmaxz1,..,zM[0,1]p(Exp(x)[logD(x)]+Ezp(z)[1logD(G(z))]+λt(G(z1),..,G(zM)))\min_{G}\max_{D}\max_{z_{1},..,z_{M}\in[0,1]^{p}}\Big{(}\amsmathbb{E}_{x\sim p(x)}[\log D(x)]+\amsmathbb{E}_{z\sim p(z)}[1-\log D(G(z))]+\lambda\mathcal{L}_{t}\big{(}G(z_{1}),..,G(z_{M})\big{)}\Big{)} (28)

Acknowledgement

This research is supported by the European Community through the grant DEDALE (contract no. 665044) and the Cross-Disciplinary Program on Numerical Simulation of CEA, the French Alternative Energies and Atomic Energy Commission.

References

  • [1] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature Cell Biology, 521(7553):436–444, may 2015.
  • [2] K Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193—202, 1980.
  • [3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [7] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
  • [8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • [9] Jason Weston and Frédéric Ratle. Deep learning via semi-supervised embedding. In International Conference on Machine Learning, 2008.
  • [10] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014.
  • [11] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning, 2017.
  • [12] Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? – a comprehensive study on the robustness of 18 deep image classification models, 2019.
  • [13] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019.
  • [14] Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models, 2018.
  • [15] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization, 2019.
  • [16] Ousmane Amadou Dia, Elnaz Barshan, and Reza Babanezhad. Semantics preserving adversarial learning, 2019.
  • [17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.
  • [18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 2672–2680, Cambridge, MA, USA, 2014. MIT Press.
  • [19] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016.
  • [20] Wei-An Lin, Chun Pong Lau, Alexander Levine, Rama Chellappa, and Soheil Feizi. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks, 2020.
  • [21] J.B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.
  • [22] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
  • [23] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323–2326, 2000.
  • [24] Geoffrey Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing systems, 15:833–840, 2003.
  • [25] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9(nov):2579–2605, 2008. Pagination: 27.
  • [26] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
  • [27] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning, 2019.
  • [28] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states, 2019.
  • [29] Sangdoo, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. CoRR, abs/1905.04899, 2019.
  • [30] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning, 2019.
  • [31] Yui. Pytorch Implementation for Mix Match, 2019. imikushana@gmail.com.
  • [32] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
  • [33] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. CoRR, abs/1610.02242, 2016.
  • [34] Dong hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, 2013.
  • [35] Phi Vu Tran. Exploring self-supervised regularization for supervised and semi-supervised learning, 2019.
  • [36] Antti Tarvainen and Harri Valpola. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR, abs/1703.01780, 2017.
  • [37] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, and Rynson W. H. Lau. Dual student: Breaking the limits of the teacher in semi-supervised learning, 2019.
  • [38] Varun Nair, Javier Fuentes Alonso, and Tony Beltramelli. Realmix: Towards realistic semi-supervised deep learning algorithms, 2019.
  • [39] Xiao Wang, Daisuke Kihara, Jiebo Luo, and Guo-Jun Qi. Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning, 2019.
  • [40] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring, 2019.
  • [41] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence, 2020.
  • [42] Ngoc-Trung Tran, Tuan-Anh Bui, and Ngai-Man Cheung. Dist-gan: An improved gan using distance constraints, 2018.

Appendix

Projection for sum and positive

Proof, by using Lagrange multiplier, problem (12) becomes:

minγ,μp,λ12i=1pκiγi22+λ(i=1pγic)+i=1pμiγi subject to : μ1,μ2,..,μp0\begin{split}\min_{\gamma,\mu\in\mathbb{R}^{p},\lambda}\frac{1}{2}\sum_{i=1}^{p}\left\lVert\kappa_{i}-\gamma_{i}\right\rVert_{2}^{2}+&\lambda(\sum_{i=1}^{p}\gamma_{i}-c)+\sum_{i=1}^{p}\mu_{i}\gamma_{i}\\ \text{ subject to : }&\mu_{1},\mu_{2},..,\mu_{p}\leq 0\\ \end{split}

We solve the following system of equations:
{γiκi+λ+μi=0i=1pγi=cμiγi=0μi0γi0{λ=1p(i=1pκii=1pμic)γi=κi1p(i=1pκic)p1pμi+1pjiμji=1pγi=cμiγi=0μi0γi0\begin{cases}\gamma_{i}-\kappa_{i}+\lambda+\mu_{i}=0\\ \sum_{i=1}^{p}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0\\ \mu_{i}\leq 0\\ \gamma_{i}\geq 0\\ \end{cases}\Leftrightarrow\begin{cases}\lambda=\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-\sum_{i=1}^{p}\mu_{i}-c)\\ \gamma_{i}=\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)-\frac{p-1}{p}\mu_{i}+\frac{1}{p}\sum_{j\neq i}\mu_{j}\\ \sum_{i=1}^{p}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0\\ \mu_{i}\leq 0\\ \gamma_{i}\geq 0\\ \end{cases}

In the case that κi1p(i=1pκic)<0\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)<0. From the second equation, we infer that μi0\mu_{i}\neq 0 (because if μi=0\mu_{i}=0 and μj0\mu_{j}\leq 0 as in inequality 5, then γi<0\gamma_{i}<0, in contradiction to inequality 6). From μi0\mu_{i}\neq 0, we infer that γi=0\gamma_{i}=0 with equation 4.

In the case that κi1p(i=1pκic)=0\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)=0. From the second equation, first if μi=0\mu_{i}=0 then γi0\gamma_{i}\leq 0 since μj0\mu_{j}\leq 0 as in equation 5. With inequality 6, we infer γi=0\gamma_{i}=0. Second, if μi0\mu_{i}\neq 0, then we infer that γi=0\gamma_{i}=0 with equation 4.

Let’s 𝒫={i|κi1p(i=1pκic)>0}\mathcal{P}=\{i|\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)>0\} and 𝒩={i|κi1p(i=1pκic)0}\mathcal{N}=\{i|\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)\leq 0\}. We find exactly the same problem as before, but with only active index in the set 𝒫\mathcal{P}.

{γiκi+λ+μi=0,i𝒫i𝒫γi=cμiγi=0,i𝒫μi0,i𝒫γi0,i𝒫\begin{cases}\gamma_{i}-\kappa_{i}+\lambda+\mu_{i}=0,\forall i\in\mathcal{P}\\ \sum_{i\in\mathcal{P}}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0,\forall i\in\mathcal{P}\\ \mu_{i}\leq 0,\forall i\in\mathcal{P}\\ \gamma_{i}\geq 0,\forall i\in\mathcal{P}\\ \end{cases}

Then we repeat until the constraint satisfaction for γ\gamma. For a proof of convergence, as γ\gamma has exactly pp elements γi\gamma_{i}, then each time we project to get a new active set 𝒫\mathcal{P}, we reduce the number of active elements γi\gamma_{i}. As the number of active elements is something positive and it decreases, so it converges. Here is an implementation for multiple κ\kappa (κM×p\kappa\in\mathbb{R}^{M\times p}) in pytorch.

def prox_positive_and_sum_constraint(x,c):
""" x is 2-dimensional array (M \times p) """
n = x.size()[1]
k = (c - torch.sum(x,dim=1))/float(n)
x_0 = x + k[:,None]
while len(torch.where(x_0 < 0)[0]) != 0:
idx_negative = torch.where(x_0 < 0)
x_0[idx_negative] = 0.
one = x_0 > 0
n_0 = one.sum(dim=1)
k_0 =(c - torch.sum(x_0,dim =1))/ n_0
x_0 = x_0 + k_0[:,None] * one
return x_0

Manifold attack for embedded representation

Architecture of model g()g() (Pytorch style) :

  • -

    S curve data : Conv1d[1,4,21,4,2] \rightarrow ReLu \rightarrow Conv1d[4,4,24,4,2] \rightarrow ReLu\rightarrow Flatten \rightarrow Fc[4,24,2].

  • -

    Digit data : Conv2d[1,8,31,8,3] \rightarrow ReLu \rightarrow Conv2d[8,16,38,16,3] \rightarrow ReLu\rightarrow Flatten \rightarrow Fc[64,264,2].

Optimizer : Stochastic gradient descent, with learning rate lr=0.001lr=0.001 and momentum = 0.9. Learning rate is reduce by lr=lr0.5lr=lr^{0.5} after each 10 epochs. The number of epochs is 40.

Robustness to adversarial examples

Hyper-parameters : optimizer = Stochastic gradient descent, number of epochs = 300, learning rate = 0.1, momentum = 0.9, learning rate is reduce by lr=0.1lrlr=0.1lr after each 75 epochs, batch size = 200, weight decay = 0.0001.

Semi-supervised manifold attack

Hyper-parameters : optimizer = Adam, number of epochs = 1024, learning rate = 0.002, α=0.75\alpha=0.75, batch size labelled = batch size unlabelled = 64, T = 0.5 (in sharpening), λ=75\lambda=75 (linearly ramp up from 0), EMA = 0.999 , error validation after 1024 batchs.

To reproduce an experiment, we define function seed_ as:

def seed_(p):
""" for reproductive """
torch.manual_seed(p)
np.random.seed(p)
random.seed(p)
if torch.cuda.is_available():
torch.cuda.manual_seed(p)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
return 0

The four experiment 1,2,3,4 in table 4 are launched with respectively seed_(0), seed_(1), seed_(2), seed_(3).

In Mix-up Adversarial, n_itersn\_iters is fixed at 1. In dataset CIFAR-10, ξ\xi starts at 0.1 and decreases linearly to 0.01 after 1024 epochs. In dataset SVHN, ξ\xi starts at 0.1 and decreases linearly to 0.01 after 1024 epochs for seed_(0) and seed_(3); ξ\xi starts at 0.1 and decreases linearly to 0.001 after 1024 epochs for seed_(2); ξ\xi starts at 0.01 and decreases linearly to 0.001 after 1024 epochs for seed_(1)