Manifold attack

Khanh-Hung Tran
Institut LIST, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
khanh-hung.tran@cea.fr
&Fred Maurice Ngole Mboula
Institut LIST, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
fred-maurice.ngole-mboula@cea.fr
&Jean-Luc Starck
Astrophysics Department, CEA
Université Paris-Saclay
Gif-sur-Yvette, 91191
jean-luc.starck@cea.fr

Abstract

Machine Learning in general and Deep Learning in particular has gained much interest in the recent decade and has shown significant performance improvements for many Computer Vision or Natural Language Processing tasks. In this paper, we introduce manifold attack, a combination of manifold learning and adversarial learning, which aims at improving neural network models regularization. We show that applying manifold attack provides a significant improvement for robustness to adversarial examples and it even enhances slightly the accuracy rate. Several implementations of our work can be found in https://github.com/ktran1/Manifold-attack.

Index terms Deep learning, Robustness, Manifold learning, Adversarial learning, Geometric structure, Semi-Supervised learning.

1 Introduction

Deep Learning (DL) [1] has been first introduced by Alexey Ivakhnenko (1967), then its derivative Convolutional Neural Networks [2] (1980) has been introduced by Fukushima. Over the years, it was improved and refined in by Yann LeCun [3] (1998). Up to now, deep neural networks have yield groundbreaking performances at various classification tasks [4, 5, 6, 7, 8]. However, training deep neural networks involves different regularization techniques, which are primordial in general for two goals: generalization and adversarial robustness. On the one hand, regularization for generalization aims at improving the accuracy rate on the data that have not been used for training. In particular, this regularization is critical when the number of training samples is relatively small compared to the complexity of neural network model. On the other hand, regularization for adversarial robustness aims at improving the accuracy rate with respect to adversarial samples.

In the last few years, a lot of research works in machine learning focus on getting high performance models in terms of adversarial robustness without losing in generalization. Our work focuses on the preservation of geometric structure (PGS) between the original representation of data and a latent representation produced by a neural network model (NNM). PGS is part of manifold learning techniques: we assume that the data of interest lies on a low dimensional piecewise smooth manifold with respect to which we want to compute original samples embeddings. In neural network models (NNMs), the output of the hidden layers are considered to be candidate low dimensional embeddings. For classification tasks, combining PGS as a regularization loss with supervised loss was shown to improve generalization ability by Weston et al. in 2008 [9]. The concept of adversarial samples was later introduced by Goodfellow et al. [10] in 2014, as NNMs architecture became deeper and hence more complex. Adversarial robustness in NNMs is strictly related to the PGS, since PGS for NNMs precisely aims at ensuring that the NNM gives similar outputs for close inputs. Figure 1 shows an illustration of PGS including a failure example. The black point can be considered an adversarial example for the dimension reduction model.

Refer to caption — Figure 1: Preservation of geometric structure and a failure example. On the left, the original representation of S curve data (3 dimensions). On the right, an embedding of these data in dimension 2). The similarity between samples is indicated by different colors.

The novelty of this paper is to reinforce PGS by introducing synthetic samples as supplementary data. We make use of adversarial learning methodology to compute relevant synthetic samples. Then, these synthetic samples are added to the original real samples for training the model $f_{\theta}$ which is expected to improve adversarial robustness.

Contributions :

•

We present the concept of manifold attack to reinforce PGS task.
•

We show empirically that by applying manifold attack for a classification task, NNMs get better in both adversarial robustness and generalization.

The outline of the paper is as follows. In section 2, we present related works that aim to increase adversarial robustness of NNMs. In section 3, we revisit some classical PGS methods. In section 4, we present manifold attack. The numerical experiments are presented and discussed in section 5. Our conclusions and perspectives are presented in section 6.

2 Related works

In this section, we present several strategies to reinforce adversarial robustness, sorting them into two categories.

The first category is related to off-manifold adversarial examples i.e. adversarial examples that lies off the data manifold. In a supervised setting, these adversarial examples are used for training model as additional data[10]. In a semi-supervised setting, these adversarial examples are used to enhance the consistency between representations [11]. In plain words, a sample and its corresponding adversarial example must have a similar representation at the output of model. However, some works [12, 13] state that adversarial robustness with respect to off-manifold adversarial examples is in conflict with generalization. In other words, improving adversarial robustness is detrimental to generalization and vice versa.

In the second category, the adversarial robustness is reinforced with respect to on-manifold adversarial examples [14, 15, 16]. In all these works, a generative model such as VAE[17], GAN[18] or VAE-GAN [19] is firstly learnt, so that one can generate a sample on the data manifold from parameters in a latent space. These works differ in the technique used to create adversarial noise in the latent space to produce on-manifold adversarial examples. Finally, these on-manifold adversarial examples are used as additional data for training model. Interestingly, by using on-manifold adversarial examples, [15] states that adversarial robustness and generalization can be jointly enhanced. We note that a hybrid approach has been proposed in [20] where both off-manifold and on-manifold adversarial examples are used as supplement data for training model.

Our method falls into the second category since we want to enhance both the adversarial robustness and the generalization in both supervised and semi-supervised settings. However the adversarial examples are generated using a different paradigm.

3 Preservation of geometric structure

In most cases, a PGS process has two stages: extracting properties of original representation then computing an embedding which preserves these properties into a low dimensional space. Given a data set $\mathcal{X}=\{\mathbf{x}_{1},...,\mathbf{x}_{N}\}$ , $\mathbf{x}_{i}\in\mathbb{R}^{n}$ and its embeddings set $\mathcal{A}=\{\mathbf{a}_{1},...,\mathbf{a}_{N}\},\mathbf{a}_{i}\in\mathbb{R}^{p}$ , where $\mathbf{a}_{i}$ is the embedded representation of $\mathbf{x}_{i}$ , we note:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c}),

(1)

the embedding loss, where $\mathcal{A}_{i}^{c}$ is the complement of $\{\mathbf{a}_{i}\}$ in $\mathcal{A}$ . The objective function of PGS is then defined as:

\operatorname*{min}_{\mathcal{A}}\mathcal{L}_{t}=\operatorname*{min}_{\mathcal{A}}\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})

(2)

We present some popular embedding losses hereafter:

- Multi-Dimensional scaling (MDS) [21]:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}(d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})-d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}))^{2},

(3)

where $d_{a}()$ and $d_{x}()$ are measures of dissimilarity. By default, they are both Euclidean distances. This method aims at preserving pairwise distances from the original representation in the embedding space.

- Laplacian eigenmaps (LE) [22]:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})d_{a}(\mathbf{a}_{i},\mathbf{a}_{j}),

(4)

where $d_{x}()$ is a measure of similarity measure (for instance $d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=\text{exp}\big{(}\frac{-\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}}{2\sigma_{i}^{2}}\big{)}$ ) and $d_{a}()$ is a measure of dissimilarity (for instance $d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})=\left\lVert\mathbf{a}_{i}-\mathbf{a}_{j}\right\rVert_{2}^{2}$ ). This method learns manifold structure by emphasizing the preservation of local distances. In order to further reduce effect of large distances, in some papers, $d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})$ is set directly to zero if $\mathbf{x}_{j}$ is not in the $k$ nearest neighbors of $\mathbf{x}_{i}$ or vice versa if $\mathbf{x}_{i}$ is not in the $k$ nearest neighbors of $\mathbf{x}_{j}$ . Alternatively, one can set $d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=0$ if $\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}>\kappa$ . Let $\mathbf{W}$ be the matrix defined as $\mathbf{W}_{ij}=d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})$ hence $\mathbf{W}$ is a symmetric matrix if $d_{x}()$ is symmetric. We can represent the objective function of Laplacian eigenmaps method using the matrices:

\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\mathbf{W}_{ij}\left\lVert\mathbf{a}_{i}-\mathbf{a}_{j}\right\rVert_{2}^{2}=2\operatorname{tr}(\mathbf{A}\mathbf{L}\mathbf{A}^{\top}),

(5)

where $\mathbf{L}=\mathbf{D}-\mathbf{W}$ , $\mathbf{D}_{ii}=\sum_{j=1}^{N}\mathbf{W}_{ij}$ ( $\mathbf{D}$ being a diagonal matrix). $\mathbf{L}$ a graph Laplacian matrix because it is symmetric, the sum of each row equals to 1 and its elements are negatives except for the diagonal elements.

- Locally Linear Embedding (LLE) [23]:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\left\lVert\mathbf{a}_{i}-\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}\lambda_{ij}\mathbf{a}_{j}\right\rVert_{2}^{2},

(6)

where $\lambda_{ij}$ are determined by solving the following problem:

\begin{split}&\operatorname*{min}_{\lambda_{ij}}\left\lVert\mathbf{x}_{i}-\sum\limits_{j}\lambda_{ij}\mathbf{x}_{j}\right\rVert_{2}^{2},\\ \text{subject to: }&\begin{cases}\sum\limits_{j}\lambda_{ij}=1,\text{ if }\mathbf{x}_{j}\in\text{knn}(\mathbf{x}_{i}),\\ \lambda_{ij}=0\text{ if not.}\\ \end{cases}\\ \end{split}

(7)

where knn( $\mathbf{x}_{i}$ ) denotes a set containing indices of the $k$ nearest neighbors samples (in Euclidean distance) of the sample $\mathbf{x}_{i}$ . Assuming that the observed data $\mathcal{X}$ is sampled from a smooth manifold and provided that the sampling is dense enough, one can assume that the sample lies locally on linear patches. Thus, LLE first computes the barycentric coordinate for each sample w.r.t. its nearest neighbors. These barycentric coordinates characterize the local geometry of the underlying manifold. Then, LLE computes a low dimensional representation (embedded) which is compatible with these local barycentric coordinates. Introducing $\mathbf{V}\in\mathbb{R}^{N\times N}$ a matrix representation form for $\lambda$ as: $\mathbf{V}[j,i]=\lambda_{ij}$ then $\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})$ can be rewritten as $\mathcal{L}_{t}=\left\lVert\textbf{A}-\textbf{A}\textbf{V}\right\rVert_{F}^{2}=\operatorname{tr}(\mathbf{A}\mathbf{L}\mathbf{A}^{\top})$ , where $\mathbf{L}=\mathbf{I}_{N}-\mathbf{V}-\mathbf{V}^{\top}+\mathbf{V}^{\top}\mathbf{V}$ . Thus, the loss $\mathcal{L}_{t}$ can be interpreted as Laplacian eigenmaps loss, based on an implicit metric $d_{x}$ for measuring distance between two samples.

- Contrastive loss :

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}\Big{(}d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})+(1-d_{x}(\mathbf{x}_{i},\mathbf{x}_{j}))\text{ max}(0,\tau-d_{a}(\mathbf{a}_{i},\mathbf{a}_{j}))\Big{)},

where $d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})$ is a discrete similarity metric which is equal to 1 if $\mathbf{x}_{j}$ is in the neighborhood of $\mathbf{x}_{j}$ and 0 otherwise. $d_{a}()$ is a measure of dissimilarity.

- Stochastic Neighbor Embedding (SNE) [24]:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\sum_{\mathbf{a}_{j}\in\mathcal{A}_{i}^{c}}P_{ij}\text{log}\frac{P_{ij}}{Q_{ij}},

where $P_{ij}=\frac{d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})}{\sum_{k\neq i}d_{x}(\mathbf{x}_{i},\mathbf{x}_{k})}$ and $Q_{ij}=\frac{d_{a}(\mathbf{a}_{i},\mathbf{a}_{j})}{\sum_{k\neq i}d_{a}(\mathbf{a}_{i},\mathbf{a}_{k})}$ , $d_{x}$ and $d_{a}$ are both similarity metric. The objective of this method is to preserve the similarity between two distributions of pairwise distances, one in original representation and the other in embedded representation, by Kullback–Leibler (KL) divergence.

Traditionally, PGS finds its applications in dimensionality reduction or data visualization, which refers to the techniques used to help the analyst see the underlying structure of data and explore it. For instance, [25] proposes a variant of SNE, which has been used in a wide range of fields. Classical nonlinear PGS methods do not require a mapping model, which is a function $g()$ with trainable parameters that maps a sample x to its embedded representation $\mathbf{a}$ as $\mathbf{a}=g(\mathbf{x})$ . $\mathbf{a}$ is directly used as the optimization variable.

Finally, it is worth noting that for several PGS methods such as LE and LLE, some supplement constraints are required to avoid a trivial solution (for instance with all embedded points are collapsed into only one point). Usually, mean and co-variance constrains are applied:

\begin{gathered}\mathrm{m}(\mathbf{A})=[\mu(\mathbf{A}[1,:]),..,\mu(\mathbf{A}[p,:])]^{\top}=\mathbb{0}\\ \text{Cov}(\mathbf{A},\mathbf{A})=\big{(}\mathbf{A}-\mathrm{M}(\mathbf{A})\big{)}\big{(}\mathbf{A}-\mathrm{M}(\mathbf{A})\big{)}^{\top}=\mathbf{I}_{p}\\ \end{gathered}

(8)

4 Manifold attack

We introduce a new strategy called “manifold attack” to reinforce PGS methods with mapping model, by combining with adversarial learning. In the following $g()$ denotes a differentiable function that maps a sample $\mathbf{x}\in\mathbb{R}^{n}$ to its embedded representation $\mathbf{a}\in\mathbb{R}^{p}$ .

4.1 Individual attack point versus data points

We define a virtual point is a synthetic sample generated in such a way to be likely on the observed samples underlying manifold. An anchor point is a sample used for generating a virtual point. An attack point is a virtual point that maximises locally a chosen measure of model distortion. For example, attack point can be a sample perturbed with an adversarial noise.

We use the same notation as in section 3. Given a dataset $\mathcal{X}=\{\mathbf{x}_{1},...,\mathbf{x}_{N}\}$ and the corresponding embedded set $\mathcal{A}=\{\mathbf{a}_{1},...,\mathbf{a}_{N}\}$ , $\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})$ denotes the embedding loss, $\mathcal{A}_{i}^{c}$ being the complement of $\{\mathbf{a}_{i}\}$ in $\mathcal{A}$ . We consider the objective function of PGS defined as $\mathcal{L}_{t}=\sum_{i=1}^{N}\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})$ . Let’s consider $p$ anchor points $\mathbf{z}_{1},\mathbf{z}_{2},...,\mathbf{z}_{p}\in\mathbb{R}^{n}$ , a virtual point $\tilde{\mathbf{x}}\in\mathbb{R}^{n}$ is defined as:

\begin{split}&\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p},\\ \text{ subject to: }&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=1.\end{split}

(9)

The anchor points $\mathbf{z}_{i}$ define a region or feasible zone, in which a virtual point $\tilde{\mathbf{x}}$ must be located and $\gamma=[\gamma_{1},\gamma_{2},..,\gamma_{p}]$ is the coordinate of $\tilde{\mathbf{x}}$ . Anchor points are sampled from the dataset $\mathcal{X}$ with different strategies, which are defined according to predefined rules (see section 4.4 for several examples). Figure 2 shows an example of anchor points setting and relations between points. For the embedding $\mathbf{a}_{i}$ of observed sample $\mathbf{x}_{i}$ , $\mathbf{a}_{i}=g(\mathbf{x}_{i})$ , the embedding loss is defined as:

\mathcal{L}_{e}(\mathbf{a}_{i},\mathcal{A}_{i}^{c})=\mathcal{L}_{e}\big{(}g(\mathbf{x}_{i}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\backslash\{g(\mathbf{x}_{i})\}\big{)}.

(10)

Similarly, for the embedding $\tilde{\mathbf{a}}$ of virtual point $\tilde{\mathbf{x}}$ , $\tilde{\mathbf{a}}=g(\tilde{\mathbf{x}})$ , the embedding loss is defined as:

\mathcal{L}_{e}(\tilde{\mathbf{a}},\mathcal{A})=\mathcal{L}_{e}\big{(}g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\big{)}.

(11)

The algorithm 1 describes the computation of an attack point. It consists in finding the local coordinates $\gamma$ that maximizes the embedding loss $\mathcal{L}_{e}(\tilde{\mathbf{a}},\mathcal{A})$ for the current state of model $g()$ . Hence, $\gamma$ is estimated though a projected gradient ascent.

Figure 2: Virtual point and anchor points illustration. The three anchor points (

\mathbf{z}_{i},i\in\{1,2,3\}

) are computed as

\mathbf{z}_{i}=\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}])+s(\mathbf{x}_{i}-\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}])

, where

\mu([\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}])=\frac{\mathbf{x}_{1}+\mathbf{x}_{2}+\mathbf{x}_{3}}{3}

. The dotted lines represent the zone defined by anchor points

\mathbf{z}_{i}

, within which the virtual point

\tilde{\mathbf{x}}

necessarily lies. The parameter

s

determines whether the anchor points lie strictly within

[\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}]

convex hull

(1>s>0)

or are strictly outside

(s>1)

. The dashed lines represent the coordinates

\gamma

\tilde{\mathbf{x}}

. The solid lines illustrates the relatedness of

\tilde{\mathbf{x}}

to the data points that the embedding will try to preserve minimizing

\mathcal{L}_{e}(g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),g(\mathbf{x}_{2}),g(\mathbf{x}_{3})\})

in this case.

Algorithm 1 Individual manifold attack

0: Anchor points

\{\mathbf{z}_{1},..,\mathbf{z}_{p}\}

, data points

\{\mathbf{x}_{1},..,\mathbf{x}_{N}\}

, embedding loss

\mathcal{L}_{e}()

, model

g()

\xi,n\_iters

initialize:

\gamma\in\mathbb{R}^{p},\gamma=[\gamma_{1},..,\gamma_{p}]

for constraints in eq. 9

\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p}

for

i=1

n\_iters

L=\mathcal{L}_{e}\big{(}g(\tilde{\mathbf{x}}),\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}\big{)}

\gamma\leftarrow\gamma+\xi\nabla_{\gamma}L(\tilde{\mathbf{x}})

\gamma\leftarrow\Pi_{ps}(\gamma)

\tilde{\mathbf{x}}=\gamma_{1}\mathbf{z}_{1}+\gamma_{2}\mathbf{z}_{2}+...+\gamma_{p}\mathbf{z}_{p}

end for

Output:

\tilde{\mathbf{x}}

In order to guarantee the constrains in equation (9), we use the projector $\Pi_{ps}$ defined by the problem:

\begin{split}&\min_{\gamma\in\mathbb{R}^{p}}\frac{1}{2}\left\lVert\kappa-\gamma\right\rVert_{2}^{2},\\ \text{ subject to: }&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=c,(c>0).\\ \end{split}

(12)

This convex problem with constraints can be solved quickly by a simple sequential projection that alternates between sum constraint and positive constraint (algorithm 2). The demonstration can be inspired by Lagrange multiplier method. A simple demonstration can be found in Appendix.

Algorithm 2 Projection for positive and sum constraint

\Pi_{ps}

\kappa\in\mathbb{R}^{p}

c=1

(by default).

\delta=(c-\sum_{i=1}^{p}\kappa_{i})/p

\gamma_{i}\leftarrow\gamma_{i}+\delta,\forall i=1,..,p

while

\exists i\in\{1,..,p\}:\gamma_{i}<0

\mathcal{P}=\{i|\gamma_{i}>0\}

and

\mathcal{N}=\{i|\gamma_{i}<0\}

\gamma_{i}\leftarrow 0,\forall i\in\mathcal{N}

\delta=(c-\sum_{i\in\mathcal{P}}\gamma_{i})/|\mathcal{P}|

\gamma_{i}\leftarrow\gamma_{i}+\delta,\forall i\in\mathcal{P}

end while

Output:

\gamma=[\gamma_{1},\gamma_{2},..,\gamma_{p}]

By default, each set of anchor points has one attack point. However, we can generate more than one attack point for the same set of anchor points by using different initializations of $\gamma$ , so as to find different local maxima. The double embedding loss, on the observed samples and on the attack points, is expected to enforce the the model $g()$ smoothness over the underlying manifold, including in low samples density areas. The general optimization scheme goes as follows: we optimize alternately between attack stages and model update stages until convergence. In attack stage, we optimize the attack points through $\gamma$ while fixing the model $g()$ and in the model update stage, we optimize the model $g()$ while fixing attack points.

4.2 Attack points as data augmentation

In algorithm 1, an attack points only interacts with observed samples. In the general manifold attack (algorithm 3), attack points and observed samples are undifferentiated in the model update stage. This way, hence generating attack points can be considered a data augmentation technique. We denote $\mathcal{B}$ as a set that contains all embedded points (both attack points and observed samples). $\mathcal{B}_{s}$ is a random subset of $\mathcal{B}$ , used batch for batch optimization. In each step, only attack points from the current batch are used to distort the manifold by maximizing the batch loss $L$ .

Algorithm 3 Manifold attack

0: Data points

\{\textbf{x}_{1},..,\textbf{x}_{N}\}

, embedding loss

\mathcal{L}_{e}()

, model

g()

\xi

n\_iters

, an anchoring rule.

initialize:

g()

Create

M

sets of anchor points

\{\mathbf{z}_{1}^{k},..,\mathbf{z}_{p}^{k}\},\forall k=1,..,M

by the anchoring rule

for

epoch=1

n\_epoch

Initialize

\gamma^{k}\in\mathbb{R}^{p}

for constraints in eq. 9

\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,M

Set

\mathcal{B}=\{g(\tilde{\mathbf{x}}^{1}),..,g(\tilde{\mathbf{x}}^{M})\}\cap\{g(\mathbf{x}_{1}),..,g(\mathbf{x}_{N})\}

and divide it into subsets

\mathcal{B}_{s}

for each

\mathcal{B}_{s}

L=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\mathcal{L}_{e}\big{(}\mathbf{a},\mathcal{B}_{s}\backslash\{\mathbf{a}\}\big{)}

Update

\{\tilde{\mathbf{x}}^{i}|g(\tilde{\mathbf{x}}^{i})\in\mathcal{B}_{s}\}

to maximize

L

by algorithm 4

Update

g()

to minimize

L

end for

Output:

g()

Algorithm 4 Virtual points update

m

sets of anchor points

\{\mathbf{z}_{1}^{k},..,\mathbf{z}_{p}^{k}\}

\gamma^{k}

\forall k=1,..,m

, loss

L

\xi,n\_iters

\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,m

for

i=1

n\_iters

Calculate gradient

\nabla L

(w.r.t

[\gamma^{1},..,\gamma^{m}]

) of function

L(\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m})

[\gamma^{1},..,\gamma^{m}]\leftarrow[\gamma^{1},..,\gamma^{m}]+\xi\nabla_{\gamma}L(\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m})

\gamma^{k}\leftarrow\Pi_{ps}(\gamma^{k}),\forall k=1,..,{m}

\tilde{\mathbf{x}}^{k}=\gamma_{1}^{k}\mathbf{z}_{1}^{k}+\gamma_{2}^{k}\mathbf{z}_{2}^{k}+...+\gamma_{p}^{k}\mathbf{z}_{p}^{k},\forall k=1,..,m

end for

Output:

\tilde{\mathbf{x}}^{1},..,\tilde{\mathbf{x}}^{m}

Algorithm 4 represents the update step for multiple attack points. We assumed that the embedding loss $\mathcal{L}_{e}$ is smooth with respect to $\gamma$ and used a gradient-based algorithm to estimate the latter. Nevertheless, in PGS, there are several methods whose embedding loss is not even continuous. For example, in LLE (6), the embedding loss takes into account the $k$ nearest neighbors of a point, which might change throughout the estimation of an attack point, producing a discontinuity. To circumvent this problem, we use several strategies to avoid singularities:

-

By reducing the gradient step $\xi$ which limits the virtual point displacement.
-

By taking a small number of attack points in each subset $\mathcal{B}_{s}$ or by using randomly a part of attack points to perform the attack while fixing other attack points, in an attack stage.
-

By updating $\gamma$ only if embedding loss increases.

Besides, some metrics might be approximated by smooth functionals. For instance, in the contrastive loss (3), we can replace the metric $d_{x}()$ which outputs only 0 or 1, with $d_{x}(\mathbf{x}_{i},\mathbf{x}_{j})=\text{exp}\big{(}\frac{-\left\lVert\mathbf{x}_{i}-\mathbf{x}_{j}\right\rVert_{2}^{2}}{2\sigma_{i}^{2}}\big{)}$ to make embedding loss continuous.

4.3 Pairwise preservation of geometric structure

For some PGS methods as MDS or LE, the embedding loss $\mathcal{L}_{e}$ can be decomposed into the sum of elementary pairwise loss $l_{e}$ :

\mathcal{L}_{e}(\mathbf{a},\mathcal{B})=\sum_{\mathbf{b}\in\mathcal{B}}l_{e}(\mathbf{a},\mathbf{b})

(13)

Then the batch loss $L$ (in algorithm 3) can be modified into:

L=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\mathcal{L}_{e}\big{(}\mathbf{a},\mathcal{B}_{s}\backslash\{\mathbf{a}\}\big{)}=\sum_{\mathbf{a}\in\mathcal{B}_{s}}\sum_{\mathbf{b}\in\mathcal{B}_{s},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b})

(14)

Following this change, $L$ can be decomposed into three parts:

	$\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{d}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{d},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b})\quad\quad\quad$	$\displaystyle(\text{data-data})$
	$\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{d}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{v}}l_{e}(\mathbf{a},\mathbf{b})$	$\displaystyle(\text{data-virtual})$
	$\displaystyle\sum_{\mathbf{a}\in\mathcal{B}_{s}^{v}}\sum_{\mathbf{b}\in\mathcal{B}_{s}^{v},\mathbf{b}\neq\mathbf{a}}l_{e}(\mathbf{a},\mathbf{b})$	$\displaystyle(\text{virtual-virtual}),$

where $\mathcal{B}_{s}^{d}$ and $\mathcal{B}_{s}^{v}$ are respectively set that contains all embedded data points (or observed samples) and all embedded virtual points of $\mathcal{B}_{s}$ .

In some PGS losses which can be decomposed into the sum of elementary pairwise loss, we can balance the effect of the virtual points with respect to the observed samples, not only by tuning the ratio between the number of virtual points and the number of observed samples in $B_{s}$ , but also by weighting each of three parts above which corresponds to the settings observed-observed, observed-virtual and virtual-virtual.

4.4 Settings of anchor points and initialization of virtual points

In this section, we provide two settings (or rules) for computing anchor points with the corresponding initializations. These settings need to be chosen carefully to guarantee that virtual points are on the sample underlying manifold.

Neighbor anchors: The first anchor point $\mathbf{z}_{1}$ is taken randomly from $\mathcal{X}$ , then the next $(p-1)$ anchor points $\mathbf{z}_{2},..,\mathbf{z}_{p}$ are taken as $(p-1)$ nearest neighbor points of $\mathbf{z}_{1}$ in $\mathcal{X}$ (Euclidean metric by default). Here, we assume that the convex hull of a sample and its neighbors is likely comprised in the samples manifold. The number of anchors $p$ needs to be small compared to the number of data points $N$ . The initialization for virtual points can be set by taking $\gamma_{i}\sim\mathcal{U}(0,1),\forall i=1,..,p$ then normalize to have $\sum_{i=1}^{p}\gamma_{i}=1$ .

Random anchors: The second setting is inspired by Mix-up method [26]. $p$ anchors are taken randomly from $\mathcal{X}$ and we take $\gamma\sim\text{Dirichlet}(\alpha_{1},..,\alpha_{p})$ . If $\alpha_{i}\ll 1,\forall i=1,..,p$ , the Dirichlet distribution returns $\gamma$ where $\gamma_{i}\geq 0$ , $\sum_{i=1}^{p}\gamma_{i}=1$ . In particular, there is a coefficient $\gamma_{k}$ much greater than other ones with a strong probability, which implies that virtual points are more probably in the neighborhood of a data sample. Since the manifold attack tries to find only local maximum by gradient-based method, if $\xi$ and $n\_iters$ are both small, we expect that attack points in the attack stage do not move too far from their initiated position, remaining on the manifold of data. Note that, in the case $\alpha_{i}=1,\forall i=1,..,p$ , the Dirichlet distribution become the Uniform distribution.

To ensure that the coefficient $\gamma_{k}$ is always much greater than other ones, we apply one more constraint: $\gamma_{k}\geq\tau$ , and by taking $\tau$ close to 1. The constraints in (9) become:

\begin{split}&\gamma_{1},\gamma_{2},..,\gamma_{p}\geq 0,\\ &\gamma_{1}+\gamma_{2}+...+\gamma_{p}=1,\\ &\gamma_{k}\geq\tau,(\tau<1).\\ \end{split}

(15)

Then the projection in algorithm 2 needs to be slightly modified to incorporate this new constraint. We define the projection $\gamma=\Pi_{ps}^{{}^{\prime}}(\kappa)$ as follow:

\begin{split}&\kappa^{\prime}\leftarrow\kappa\\ &\kappa^{\prime}_{k}\leftarrow\kappa^{\prime}_{k}-\tau\\ &\gamma\leftarrow\Pi_{ps}(\kappa^{\prime},c=1-\tau)\\ &\gamma_{k}\leftarrow\gamma_{k}+\tau\\ \end{split}

(16)

5 Applications of manifold attack

We present several applications of manifold attack for NNMs that use PGS task. Firstly, we show advantages of manifold attack for a PGS task when only few training samples are available. Secondly, we show that applying manifold attack in moderation improves both generalization and adversarial robustness.

5.1 Preservation of geometric structure with few training samples

For this experiment, we use the S curve data and Digit data. The S curve data contains $N=1000$ 3-dimensional samples, as shown in figure 3. The Digit data contains $N=1797$ images, of size $8\times 8$ of a digit. We want to compute 2-dimensional embeddings for these data. Each data is separated into two sets: $N_{tr}$ samples are randomly taken for training set and the $N_{te}$ remaining samples are used for testing. $\mathbf{x}^{tr}$ denotes a training sample and $\mathbf{x}^{te}$ denotes a testing sample. We perform four training modes as described in table 1 with a neural network model $g()$ . The evaluation loss, after optimizing model $g()$ , is defined as:

L_{ev}=\frac{1}{N_{te}}\sum_{i=1}^{N_{te}}\mathcal{L}_{e}(g(\mathbf{x}^{te}_{i}),\{g(\mathbf{x}^{te}_{j})|j\neq i\})

(17)

Mode	Description of objective function
1. REF	PGS that takes into account both training and testing samples: $L_{tr}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{e}(g(\textbf{x}_{i}),\{g(\textbf{x}_{j})\|j\neq i\})$ The result of this training mode is considered as “reference” in order to compare to other training modes.
2. DD	PGS that takes only training (observed) data samples: $L_{tr}=\frac{1}{N_{tr}}\sum_{i=1}^{N_{tr}}\mathcal{L}_{e}(g(\textbf{x}_{i}^{tr}),\{g(\textbf{x}_{j}^{tr})\|j\neq i\}).$
3. RV	Using virtual points as supplement data, virtual points are only randomly initialized and without attack stage (by setting $n\_iters=0$ in algorithm 4): $L_{tr}=\frac{1}{\|\mathcal{B}\|}\sum_{\textbf{a}_{i}\in\mathcal{B}}\mathcal{L}_{e}(\textbf{a}_{i},\mathcal{B}\backslash\{\textbf{a}_{i}\}),$ where $\mathcal{B}=\{g(\tilde{\textbf{x}}^{1}),..,g(\tilde{\textbf{x}}^{M})\}\cap\{g(\textbf{x}_{1}^{tr}),..,g(\textbf{x}_{N_{tr}}^{tr})\}$ .
4. MA	Using manifold attack, the same objective function as the previous case, except $n\_iters\neq 0$ (virtual points become attack points).

Table 1: Four training modes: REF (Reference), DD (Data-Data), RV (Random virtual) and MA (Manifold Attack), and their corresponding objective function.

The anchoring rule, embedding loss $\mathcal{L}_{e}$ and the model $g()$ are precised in the following.

Anchoring rule. Two settings are considered:

-

Neighbor anchors (NA): A set of anchor points is composed by a sample with its 4 nearest neighbors. In this case, we have $M=N_{tr}$ sets of anchor points and $p=5$ anchor points in each set. The coefficient $\gamma$ is initialized by uniform distribution.
-

Random anchors (RA): We take randomly 2 points among $N_{tr}$ training points to create a set of anchor points. In this case, we have $M=\binom{N_{tr}}{2}$ sets of anchor points and $p=2$ anchor points in each set. The coefficient $\gamma$ is initialized by the Dirichlet distribution with $\alpha_{i}=0.5,\forall i=1,..,p$ .

The embedding loss. We employ the embedding loss MDS and LE as described in section 3, with the default metrics. For similarity metric $d_{x}()$ in LE method, we take $\sigma=0.2$ for S curve data and $\sigma=0.5$ for Digit data.

The model. A simple structure of CNN is used. Here are the detailed architectures for each CNN as the dimension of the inputs are different for the two datasets:

-

S curve data: Conv1d[ $1,4,2$ ] $\rightarrow$ ReLu $\rightarrow$ Conv1d[ $4,4,2$ ] $\rightarrow$ ReLu $\rightarrow$ Flatten $\rightarrow$ Fc[ $4,2$ ].
-

Digit data: Conv2d[ $1,8,3$ ] $\rightarrow$ ReLu $\rightarrow$ Conv2d[ $8,16,3$ ] $\rightarrow$ ReLu $\rightarrow$ Flatten $\rightarrow$ Fc[ $64,2$ ].

For LE method, two additional constraints are imposed to avoid trivial embeddings:

\begin{gathered}\mathrm{E}(\mathbf{A}^{tr})=[\mathrm{E}(\mathbf{A}^{tr}[1,:]),..,\mathrm{E}(\mathbf{A}^{tr}[d,:])]^{\top}=\mathbb{0}_{d}\\ \Sigma(\mathbf{A}^{tr},\mathbf{A}^{tr})=I_{d}\\ \end{gathered}

(18)

where $d=2$ is the number of output dimensions, $\mathbf{A}^{tr}=[\mathbf{a}_{1}^{tr},..,\mathbf{a}_{N_{tr}}^{tr}]=[g(\mathbf{x}_{1}^{tr}),..,g(\mathbf{x}_{N_{tr}}^{tr})]$ . To adapt these constraints, we add a normalization layer at the end of model $g()$ : $(g(\mathbf{x})-\mathrm{E}(\mathbf{A}^{tr}))\Sigma^{-1}(\mathbf{A}^{tr},\mathbf{A}^{tr})$ , where $\Sigma^{-1}$ is performed by Cholesky decomposition.

To simulate the case of few training samples, we fix $N_{tr}=100$ for MDS method and $N_{tr}=50$ for LE method. The balance between virtual points and samples is controlled by the couple $\lambda=$ (number of virtual points in $\mathcal{B}_{s}$ , number of observed samples in $\mathcal{B}_{s}$ ). We set $\lambda=(2,5)$ for MDS method and $\lambda=(5,10)$ for LE method. The gradient step $\xi$ is selected from $\{0.1,1,10\}$ and the number of iterations is fixed at $n\_iters=2$ .

The initialization of model $g()$ is impactful, especially since there are few training data. Five different initialization of model for each method are performed. The mean and the standard deviation of the evaluation loss $L_{ev}$ are represented in table 2. Firstly, we see that using random virtual (RV) points as additional data points gives a better loss than using only data points. Secondly, using manifold attack (MA) further improves the results which shows the benefit of the proposed approach to regularize the model.

For the S curve data, initialization by Neighbors anchors (NA) gives a better result compared to initialization by Random anchors (RA). However, for the Digit data, initialization by Random anchors gives a better result. This is due to the fact that in the S curve data, Neighbor Anchors covers better the manifold of data than Random Anchor. On the other hand, in Digit data, Neighbor Anchors (by using Euclidean metric to determine nearest neighbors) can generate, with greater probability, a virtual point that is not in the manifold of data. This leads to a greater evaluation loss compared to Random Anchor.

	S curve data		Digit data
Mode / Method	MDS	LE	MDS	LE
REF	$130.7\pm 24.74$	$0.399\pm 0.07$	$2015\pm 14$	$0.07\pm 0.002$
DD	$352.56\pm 119.19$	$1.21\pm 0.46$	$2409\pm 78$	$0.58\pm 0.07$
RV (NA)	$173.87\pm 9.38$	$0.59\pm 0.11$	$2395\pm 73$	$0.31\pm 0.03$
MA (NA)	$170.62\pm 5.89$	$0.55\pm 0.07$	$2362\pm 63$	$0.24\pm 0.03$
RV (RA)	$183.42\pm 18.13$	$0.65\pm 0.14$	$2342\pm 56$	$0.22\pm 0.03$
MA (RA)	$169.04\pm 5.30$	$0.63\pm 0.14$	$2331\pm 56$	$0.2\pm 0.02$

Table 2: Evaluation loss

L_{ev}

of two PGS methods MDS and LE, in four modes: Reference (REF), Data-Data (DD), Random Virtual (RV), Manifold Attack(MA) (as described in table 1) and two initialization strategies: Neighbor Anchors (NA) and Random Anchors (RA).

The five embedded representations, respectively with five different initialization of $g()$ , for testing samples in S curve data are found in figure 4 for MDS method and in figure 5 for LE method.

5.2 Robustness to adversarial examples

In this subsection, we combine manifold attack with supervised learning to assess the adversarial robustness of model. Let start with a general objective function which is created by a supervised loss and a PGS loss :

		$\displaystyle\min_{\theta}\max_{\gamma}\left(L_{s}+\lambda L_{pgs}\right)$
	$\displaystyle=$	$\displaystyle\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}}\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\mathcal{L}_{t}\left(f_{\theta}^{(l)}(\mathbf{x}_{1}^{l}),..,f_{\theta}^{(l)}(\mathbf{x}_{N_{l}}^{l}),f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{1}),..,f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{M})\right)$		(19)

where $(\mathbf{x}^{l}_{i},y_{i})$ means a sample and its corresponding label. $\mathcal{L}_{c}$ is a dissimilarity metric, like for instance the Cross Entropy. $\mathcal{L}_{t}$ is a PGS loss that includes eventually all observed samples $\mathbf{x}$ and virtual points $\tilde{\mathbf{x}}$ . Note that the coordinates $\gamma$ of all virtual points are constrained by (9).

In the following, we construct $\mathcal{L}_{t}$ by a particular PGS loss, called Mix-up manifold learning loss [27, 28, 26] :

\begin{split}&l_{mu}\left(f_{\theta}^{(l)}(\tilde{\mathbf{x}}),f_{\theta}^{(l)}(\mathbf{x}_{i}),f_{\theta}^{(l)}(\mathbf{x}_{j})\right)\\ =&\mathcal{L}_{c}\left(f_{\theta}^{(l)}(\gamma_{1}\mathbf{x}_{i}+\gamma_{2}\mathbf{x}_{j}),\gamma_{1}f_{\theta}^{(l)}(\mathbf{x}_{i})+\gamma_{2}f_{\theta}^{(l)}(\mathbf{x}_{j})\right)\\ \text{where: }&\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1\end{split}

(20)

In this configuration, the virtual point $\tilde{\mathbf{x}}=\gamma_{1}\mathbf{x}_{i}+\gamma_{2}\mathbf{x}_{j}$ , where $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are two anchor points. The distribution Beta is just a particular case of the Dirichlet distribution where the number of anchor points $p=2$ (in section 4.4). In case of supervised learning, we perform PGS between the original samples representation and the final NNM output (which can be considered as a latent representation). Thus, we implement a slightly modified version of equation (20):

\begin{split}&l_{mu}\left(f_{\theta}(\tilde{\mathbf{x}}),f_{\theta}(\mathbf{x}^{l}_{i}),f_{\theta}(\mathbf{x}^{l}_{j})\right)\\ =&\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right)\\ \text{where: }&\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1\end{split}

(21)

Then, we replace this explicit PGS loss in equation (5.2) and take $\lambda=1$ :

$\displaystyle\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}}$	$\displaystyle\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\frac{1}{N_{l}^{2}N_{\gamma}}\sum_{k=1}^{N_{\lambda}}\sum_{i=1}^{N_{l}}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N_{l}}\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right)$	(22)
$\displaystyle=\min_{\theta}\max_{\gamma}$	$\displaystyle\frac{1}{N_{l}^{2}N_{\gamma}}\sum_{k=1}^{N_{\lambda}}\sum_{i=1}^{N_{l}}\sum_{j=1}^{N_{l}}\mathcal{L}_{c}\left(f_{\theta}(\gamma_{1}\mathbf{x}_{i}^{l}+\gamma_{2}\mathbf{x}_{j}^{l}),\gamma_{1}y_{i}+\gamma_{2}y_{j}\right)$	(23)
where:	$\displaystyle\gamma_{1}+\gamma_{2}=1\text{ and }\gamma_{1}\sim\text{Beta}(\alpha,\alpha),0<\alpha\ll 1$	(24)
	$\displaystyle N_{\gamma}\text{ is the number of samplings $\gamma$ from }\text{Beta}(\alpha,\alpha).$	(25)

We call problem (22) adversarial Mix-up because this problem is developed from Mix-up[26], by adding an adversarial learning for $\gamma$ . In practice, to optimize the problem (22), we perform an attack stage (algorithm 4) to find $\gamma$ that gives the greater PGS loss, before performing model update stage for $f_{\theta}()$ . We repeat alternatively these two stages until convergence. Note that, we can take $\gamma_{2}=1-\gamma_{1}$ , so that we only need to deal with one variable $\gamma_{1}$ to maximize PGS loss. The projection (12) for $\gamma_{1}$ is now just the clamping function, to make sure that $\gamma_{1}$ is between 0 and 1.

We compare four supervised training methods: ERM (Empirical Risk Minimization, which is thus classical supervised learning with Cross Entropy loss), Mix-up, Adversarial Mix-up and Cut-Mix [29] on ImageNet dataset with the model ResNet-50 [6], which has about 25.8M trainable parameters. We use ImageNet dataset. We retrieve 948 classes consisting in 400 labelled training samples and 50 testing samples to evaluate the models.

Method / Data evaluation	Testing set		Adversarial examples
	Top-1	Top-5	Top-1	Top-5
ERM	33.84	12.46	81.69	59.14
Mix-up [26]	32.13	11.35	75.57	49.41
Mix-up Adversarial (1) ( $\xi=0.1\rightarrow 0.01$ )	32.57	10.98	63.62	36.15
Adversarial Mix-up (2) ( $\xi=0.01\rightarrow 0.001$ )	31.82	11.15	70.82	43.96
Cut-Mix [29]	30.94	10.41	81.24	58.72

Table 3: : ImageNet error rate (top-1 and top-5 in %) on testing set and on adversarial examples on different training modes: ERM, Mix-Up, Mix-up Adversarial and Cut-Mix.

We evaluate the error rate for testing set at the end of each epoch and report the best best error rate (top-1 and top-5) in table 3. We create adversarial examples using Fast Gradient Sign Method (FGSM) [10], on another trained ERM model, with $\epsilon=0.05$ . In Adversarial Mix-up, $n\_iters$ is fixed at 1. The attack stage is parametrized by the gradient step $\xi$ , which is set up following two configurations, (1) $\xi$ is reduced linearly from 0.1 to 0.01 and (2) $\xi$ is reduced linearly from 0.01 to 0.001. Following original articles, $\alpha$ is set at 0.2 for Mix-Up and Adversarial Mix-up and $\alpha=1$ for Cut-Mix. More details for hyper-parameters can be found in Appendix.

Firstly, Mix-up [26] is a combination between supervised learning task and PGS task loss and it gives a better error rates for both testing set and adversarial examples compared to ERM, which is a standard supervised learning task.

Secondly, in Adversarial Mix-up (1) and (2), a trade-off has to be found between error rate for testing set and error rate for adversarial examples. For strong values of $\xi$ (1), the error rate for testing sample can be about 0.5% worse than Mix-up (without using attack stage), but it gains about 13% for the robustness against adversarial examples. On the other hand, if $\xi$ takes moderate values as in (2), error rates for both testing sample and adversarial examples are smaller than those of Mix-Up, but it gains only about 5% for the robustness against adversarial examples. These two observations show the effect of manifold attack, which is an improved PGS procedure. We conclude that manifold attack not only improves generalization but also significantly improves adversarial robustness of the model.

Finally, Adversarial Mix-up Adversarial (2) provides a slightly worse error rate than Cut-Mix (about 1 % in the case of testing sample), but it gains about 10% in the case of adversarial examples.

It is worth noting that in attack stage, the model $g()$ needs to be differentiable. Thus, in the case of using NNMs with Dropout layer for example , the active connections need to be fixed during an attack stage.

5.3 Semi-supervised manifold attack

In this subsection, manifold attack is applied to reinforce semi-supervised neural network models that use PGS as regularization. The problem (5.2) is extended to include unlabelled samples. We assume that the training set contains $N$ samples $\mathbf{x}$ , the $N_{l}$ first samples $\mathbf{x}^{l}_{i}$ are labelled and the remaining $N_{u}=N-N_{l}$ samples $\mathbf{x}^{u}_{i}$ are unlabelled. Hence, the objective function for a semi-supervised manifold attack method:

\begin{split}&\min_{\theta}\max_{\gamma}\left(L_{s}+\lambda L_{pgs}+\beta L_{u}\right)\\ =&\min_{\theta}\max_{\gamma}\frac{1}{N_{l}}\sum_{i=1}^{N_{l}}\mathcal{L}_{c}(f_{\theta}(\mathbf{x}^{l}_{i}),y_{i})+\lambda\mathcal{L}_{t}\left(f_{\theta}^{(l)}(\mathbf{x}_{1}),..,f_{\theta}^{(l)}(\mathbf{x}_{N}),f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{1}),..,f_{\theta}^{(l)}(\tilde{\mathbf{x}}^{M})+\beta L_{u}\right)\end{split}

(26)

where $L_{u}$ refers to other possible unsupervised losses such as pseudo-label, self-supervised learning, etc.

In the following, we explicit the terms of the problem (26). We build upon MixMatch [30], a semi-supervised learning method in the current state-of-the-art. MixMatch use Mix-up manifold learning loss for PGS. Secondly, we add an adversarial learning for $\gamma$ in MixMatch. Hence we refer to the resulting instance of 26 as Adversarial MixMatch. Note that, $\gamma_{1}$ in MixMatch is slightly different from Mix-up as:

\gamma_{1}+\gamma_{2}=1,\gamma_{1}\sim\text{Beta}(\alpha,\alpha)\text{ and }\gamma_{1}\geq\gamma_{2}

(27)

Then the projection for $\gamma_{1}$ is now the clamping function between 0.5 and 1. When two separated variables $\gamma_{1}$ and $\gamma_{2}$ are defined, we can use the projector $\Pi_{ps}^{{}^{\prime}}$ (16) defined in subsection 4.4. We use the Pytorch implementation for MixMatch by Yui [31] (with all hyper-parameters by default), then we introduce attack stages, with the number of iterations $n\_iters=1$ . In each experiment, for both CIFAR-10 and SVHN dataset, we divide the training set into three parts: labelled set, unlabelled set and validation set. The number of validation samples is fixed at 5000. The number of labelled samples is 250, and the remaining samples are part of the unlabelled set. We repeat the experiment four times, with different samplings of labelled samples, unlabelled samples, validation samples and different initialization of model Wide ResNet-28 [32] which has about 1.47M trainable parameters. More details for hyper-parameters can be found in Appendix.

Data	Method / Test	1	2	3	4	Mean
CIFAR-10	MixMatch	10.62	12.72	12.02	15.26	$12.65\pm 1.68$
CIFAR-10	MixMatch Adersarial	8.84	10.46	10.09	12.89	$10.57\pm 1.47$
SVHN	MixMatch	6.0925	6.73	7.802	7.37	$7.0\pm 0.65$
SVHN	MixMatch Adersarial	5.07	5.93	5.42	5.47	$5.47\pm 0.3$

Table 4: : CIFAR-10 and SVHN Error rate in four different configurations, each configuration consists of data partitioning and initialization of model parameters. The number of labelled samples is fixed at 250 and the used model is Wide ResNet-28.

The error rate on testing set, which corresponds to the best validation error rate, is reported in table 4, for both MixMatch and MixMatch Adersarial. We see that MixMatch Adersarial improves the performance of MixMatch, about $1.5\%$ less on error rate. There is a considerable difference between the error rate of MixMatch reproduced by our experiments and the one reported from the official paper, which might come from the sampling, the initialization of model, the library used (Pytorch vs TensorFlow) and the computation of the error rate (error rate associated to best validation error vs the median error rate of the last 20 checkpoints).

Method / Data	CIFAR-10	SVHN
Pi Model ^⋄ [33]	$53.02\pm 2.05$	$17.56\pm 0.275$
Pseudo Label ^⋄ [34]	$49.98\pm 1.17$	$21.16\pm 0.88$
VAT ^⋄ [11]	$36.03\pm 2.82$	$8.41\pm 1.01$
SESEMI SSL [35]		$8.32\pm 0.13$
Mean Teacher ^⋄ [36]	$47.32\pm 4.71$	$6.45\pm 2.43$
Dual Student [37]		$4.24\pm 0.10$
MixMatch ^⋄ [30]	$11.08\pm 0.87$	$3.78\pm 0.26$
MixMatch * [30]	$12.65\pm 1.68$	$7.0\pm 0.65$
MixMatch Adersarial *	$10.57\pm 1.47$	$5.47\pm 0.3$
Real Mix [38]	$9.79\pm 0.75$	$3.53\pm 0.38$
EnAET [39]	$7.6\pm 0.34$	$3.21\pm 0.21$
ReMixMatch [40]	$6.27\pm 0.34$	$3.10\pm 0.50$
Fix Match [41]	$5.07\pm 0.33$	$2.48\pm 0.38$

Table 5: CIFAR-10 and SVHN error rate of different semi-supervised learning methods. The number of labelled sample is fixed at 250. (^⋄) means that the results are reported from [30]. (*) means that that the results are reported from our experiments. The resting results are reported from their corresponding official paper.

Table 5 shows error rates among semi-supervised methods based on NNMs, for both CIFAR-10 and SVHN dataset with only 250 labelled samples. We refer also readers to the site PapersWithCode that provides the lasted record for each dataset: CIFAR-10¹¹1CIFAR-10 https://paperswithcode.com/sota/semi-supervised-image-classification-on-cifar-6 and SVHN ²²2SVHN https://paperswithcode.com/sota/semi-supervised-image-classification-on-svhn-1.

6 Conclusion

Adversarial Example

Manifold Attack

Illustration,

Red point: a sample

Blue line: border of

feasible zone

Feasible zone

Locality of each sample

Convex hull

Variable

Noise

\epsilon

that has the same size as samples

\gamma

has the size which equals to the number of anchor points

PGS task

Points in the locality of a sample must have a similar embedded representation

Available for almost PGS tasks

Table 6: A simple comparison between adversarial examples and manifold attack in general. Note that, in manifold attack, the feasible zone is the whole convex hull in the case of Nearest Anchors (subsection 4.4). Otherwise, in the case of Random Anchors with the distribution Dirichlet, the feasible zone is rather the neighborhood of each anchor in the convex hull than the center.

We introduced manifold attack as an improved PGS procedure. Firstly, it is more general than adversarial examples (see for a comparison in table 6). Secondly, we confirm empirically a statement from [26]: by using Mix-up as PGS combined with supervised loss, we enhance generalization and improve significantly adversarial robustness. Thirdly, we show that applying manifold attack on Mix-up enhances further generalization and adversarial robustness. There is a trade-off to be found between generalization and adversarial robustness. However, in our experiments, in the ’worst case’, we only lose about 1% in generalization for a gain of about 13% in adversarial robustness.

To further improve our method, several directions could be investigated:

•

Optimization of the layers between those the PGS needs to be applied. Indeed, we could also implement PGS between one latent representation and another one in NNMs as in [28].
•

Mode collapse is a popular problem while training GAN models. GAN with samples that are well balanced among classes, generated samples by the generator are biased on only a few classes (as showed in figure 6). This is because the latent representation is not well regularized. We expect to overcome the problem Mode collapse by introducing manifold attack from the latent representation back to the original representation as show in figure 7 and by optimizing problem 28, where $\mathcal{L}_{t}$ is a PGS task as showed in section 3.

Figure 7: An GAN that we apply PGS from the latent representation back to original representation.

\min_{G}\max_{D}\max_{z_{1},..,z_{M}\in[0,1]^{p}}\Big{(}\amsmathbb{E}_{x\sim p(x)}[\log D(x)]+\amsmathbb{E}_{z\sim p(z)}[1-\log D(G(z))]+\lambda\mathcal{L}_{t}\big{(}G(z_{1}),..,G(z_{M})\big{)}\Big{)}

(28)

Acknowledgement

This research is supported by the European Community through the grant DEDALE (contract no. 665044) and the Cross-Disciplinary Program on Numerical Simulation of CEA, the French Alternative Energies and Atomic Energy Commission.

References

[1] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature Cell Biology, 521(7553):436–444, may 2015.
[2] K Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193—202, 1980.
[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
[7] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
[9] Jason Weston and Frédéric Ratle. Deep learning via semi-supervised embedding. In International Conference on Machine Learning, 2008.
[10] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014.
[11] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning, 2017.
[12] Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? – a comprehensive study on the robustness of 18 deep image classification models, 2019.
[13] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019.
[14] Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models, 2018.
[15] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization, 2019.
[16] Ousmane Amadou Dia, Elnaz Barshan, and Reza Babanezhad. Semantics preserving adversarial learning, 2019.
[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.
[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 2672–2680, Cambridge, MA, USA, 2014. MIT Press.
[19] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016.
[20] Wei-An Lin, Chun Pong Lau, Alexander Levine, Rama Chellappa, and Soheil Feizi. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks, 2020.
[21] J.B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.
[22] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
[23] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323–2326, 2000.
[24] Geoffrey Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing systems, 15:833–840, 2003.
[25] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9(nov):2579–2605, 2008. Pagination: 27.
[26] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
[27] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning, 2019.
[28] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states, 2019.
[29] Sangdoo, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. CoRR, abs/1905.04899, 2019.
[30] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning, 2019.
[31] Yui. Pytorch Implementation for Mix Match, 2019. imikushana@gmail.com.
[32] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
[33] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. CoRR, abs/1610.02242, 2016.
[34] Dong hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, 2013.
[35] Phi Vu Tran. Exploring self-supervised regularization for supervised and semi-supervised learning, 2019.
[36] Antti Tarvainen and Harri Valpola. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR, abs/1703.01780, 2017.
[37] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, and Rynson W. H. Lau. Dual student: Breaking the limits of the teacher in semi-supervised learning, 2019.
[38] Varun Nair, Javier Fuentes Alonso, and Tony Beltramelli. Realmix: Towards realistic semi-supervised deep learning algorithms, 2019.
[39] Xiao Wang, Daisuke Kihara, Jiebo Luo, and Guo-Jun Qi. Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning, 2019.
[40] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring, 2019.
[41] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence, 2020.
[42] Ngoc-Trung Tran, Tuan-Anh Bui, and Ngai-Man Cheung. Dist-gan: An improved gan using distance constraints, 2018.

Appendix

Projection for sum and positive

Proof, by using Lagrange multiplier, problem (12) becomes:

\begin{split}\min_{\gamma,\mu\in\mathbb{R}^{p},\lambda}\frac{1}{2}\sum_{i=1}^{p}\left\lVert\kappa_{i}-\gamma_{i}\right\rVert_{2}^{2}+&\lambda(\sum_{i=1}^{p}\gamma_{i}-c)+\sum_{i=1}^{p}\mu_{i}\gamma_{i}\\ \text{ subject to : }&\mu_{1},\mu_{2},..,\mu_{p}\leq 0\\ \end{split}

We solve the following system of equations:
$\begin{cases}\gamma_{i}-\kappa_{i}+\lambda+\mu_{i}=0\\ \sum_{i=1}^{p}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0\\ \mu_{i}\leq 0\\ \gamma_{i}\geq 0\\ \end{cases}\Leftrightarrow\begin{cases}\lambda=\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-\sum_{i=1}^{p}\mu_{i}-c)\\ \gamma_{i}=\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)-\frac{p-1}{p}\mu_{i}+\frac{1}{p}\sum_{j\neq i}\mu_{j}\\ \sum_{i=1}^{p}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0\\ \mu_{i}\leq 0\\ \gamma_{i}\geq 0\\ \end{cases}$

In the case that $\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)<0$ . From the second equation, we infer that $\mu_{i}\neq 0$ (because if $\mu_{i}=0$ and $\mu_{j}\leq 0$ as in inequality 5, then $\gamma_{i}<0$ , in contradiction to inequality 6). From $\mu_{i}\neq 0$ , we infer that $\gamma_{i}=0$ with equation 4.

In the case that $\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)=0$ . From the second equation, first if $\mu_{i}=0$ then $\gamma_{i}\leq 0$ since $\mu_{j}\leq 0$ as in equation 5. With inequality 6, we infer $\gamma_{i}=0$ . Second, if $\mu_{i}\neq 0$ , then we infer that $\gamma_{i}=0$ with equation 4.

Let’s $\mathcal{P}=\{i|\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)>0\}$ and $\mathcal{N}=\{i|\kappa_{i}-\frac{1}{p}(\sum_{i=1}^{p}\kappa_{i}-c)\leq 0\}$ . We find exactly the same problem as before, but with only active index in the set $\mathcal{P}$ .

$\begin{cases}\gamma_{i}-\kappa_{i}+\lambda+\mu_{i}=0,\forall i\in\mathcal{P}\\ \sum_{i\in\mathcal{P}}\gamma_{i}=c\\ \mu_{i}\gamma_{i}=0,\forall i\in\mathcal{P}\\ \mu_{i}\leq 0,\forall i\in\mathcal{P}\\ \gamma_{i}\geq 0,\forall i\in\mathcal{P}\\ \end{cases}$

Then we repeat until the constraint satisfaction for $\gamma$ . For a proof of convergence, as $\gamma$ has exactly $p$ elements $\gamma_{i}$ , then each time we project to get a new active set $\mathcal{P}$ , we reduce the number of active elements $\gamma_{i}$ . As the number of active elements is something positive and it decreases, so it converges. Here is an implementation for multiple $\kappa$ ( $\kappa\in\mathbb{R}^{M\times p}$ ) in pytorch.

⬇

def prox_positive_and_sum_constraint(x,c):

""" x is 2-dimensional array (M \times p) """

n = x.size()[1]

k = (c - torch.sum(x,dim=1))/float(n)

x_0 = x + k[:,None]

while len(torch.where(x_0 < 0)[0]) != 0:

idx_negative = torch.where(x_0 < 0)

x_0[idx_negative] = 0.

one = x_0 > 0

n_0 = one.sum(dim=1)

k_0 =(c - torch.sum(x_0,dim =1))/ n_0

x_0 = x_0 + k_0[:,None] * one

return x_0

Manifold attack for embedded representation

Architecture of model $g()$ (Pytorch style) :

-

S curve data : Conv1d[ $1,4,2$ ] $\rightarrow$ ReLu $\rightarrow$ Conv1d[ $4,4,2$ ] $\rightarrow$ ReLu $\rightarrow$ Flatten $\rightarrow$ Fc[ $4,2$ ].
-

Digit data : Conv2d[ $1,8,3$ ] $\rightarrow$ ReLu $\rightarrow$ Conv2d[ $8,16,3$ ] $\rightarrow$ ReLu $\rightarrow$ Flatten $\rightarrow$ Fc[ $64,2$ ].

Optimizer : Stochastic gradient descent, with learning rate $lr=0.001$ and momentum = 0.9. Learning rate is reduce by $lr=lr^{0.5}$ after each 10 epochs. The number of epochs is 40.

Robustness to adversarial examples

Hyper-parameters : optimizer = Stochastic gradient descent, number of epochs = 300, learning rate = 0.1, momentum = 0.9, learning rate is reduce by $lr=0.1lr$ after each 75 epochs, batch size = 200, weight decay = 0.0001.

Semi-supervised manifold attack

Hyper-parameters : optimizer = Adam, number of epochs = 1024, learning rate = 0.002, $\alpha=0.75$ , batch size labelled = batch size unlabelled = 64, T = 0.5 (in sharpening), $\lambda=75$ (linearly ramp up from 0), EMA = 0.999 , error validation after 1024 batchs.

To reproduce an experiment, we define function seed_ as:

⬇

def seed_(p):

""" for reproductive """

torch.manual_seed(p)

np.random.seed(p)

random.seed(p)

if torch.cuda.is_available():

torch.cuda.manual_seed(p)

torch.backends.cudnn.deterministic = True

torch.backends.cudnn.benchmark = False

return 0

The four experiment 1,2,3,4 in table 4 are launched with respectively seed_(0), seed_(1), seed_(2), seed_(3).

In Mix-up Adversarial, $n\_iters$ is fixed at 1. In dataset CIFAR-10, $\xi$ starts at 0.1 and decreases linearly to 0.01 after 1024 epochs. In dataset SVHN, $\xi$ starts at 0.1 and decreases linearly to 0.01 after 1024 epochs for seed_(0) and seed_(3); $\xi$ starts at 0.1 and decreases linearly to 0.001 after 1024 epochs for seed_(2); $\xi$ starts at 0.01 and decreases linearly to 0.001 after 1024 epochs for seed_(1)