This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-Agent Adversarial Training Using Diffusion Learning

Abstract

This work focuses on adversarial learning over graphs. We propose a general adversarial training framework for multi-agent systems using diffusion learning. We analyze the convergence properties of the proposed scheme for convex optimization problems, and illustrate its enhanced robustness to adversarial attacks.

Index Terms—  Adversarial training, decentralized optimization, diffusion strategy, multi-agent systems.

1 Introduction

In many machine learning algorithms, small malicious perturbations that are imperceptible to the human eye can cause classifiers to reach erroneous conclusions [1, 2, 3, 4, 5]. To mitigate the negative effect of adversarial examples, one methodology is adversarial training [6], in which clean training samples are augmented by adversarial samples by adding purposefully crafted perturbations. Due to the lack of an explicit definition for the imperceptibility of perturbations, additive attacks are usually restricted within a small bounded region. Most earlier studies, such as [7, 8, 6, 9, 10], have focused on studying adversarial training in the context of single agent learning. In this work, we devise a robust training algorithm for multi-agent networked systems by relying on diffusion learning [5, 11, 12], which has been shown to have a wider stability range and improved performance guarantees for adaptation in comparison to other decentralized strategies [5, 11, 12].

There of course exist other works in the literature that applied adversarial learning to a multiplicity of agents, albeit using a different architecture. For example, the works [13, 14, 15] employ multiple GPUs and a fusion center, while the works [16, 17, 18] consider graph neural networks. In this work, we focus on a fully decentralized architecture where each agent corresponds to a learning unit in its own right, and interactions occur locally over neighborhoods determined by a graph topology.

We formulate a sequential minimax optimization problem involving adversarial samples, and assume in this article that the perturbations are within an 2\ell_{2}-bounded region. We hasten to add though that the analysis can be extended to other norms, such as 1\ell_{1}- and \ell_{\infty}-bounded perturbations. For simplicity, and due to space limitations, we consider the 2\ell_{2}-case here.

In the performance analysis, we examine the convergence of the proposed framework for convex settings due to space limitations, but note that similar bounds can be derived for nonconvex environments by showing convergence towards local minimizers. In particular, we show here that with strongly-convex loss functions, the proposed algorithm approaches the global minimizer within O(μ)O(\mu) after sufficient iterations, where μ\mu is the step-size parameter.

2 Problem Setting

Consider a collection of KK agents where each agent kk observes independent realizations of some random data (𝒙k,𝒚k)(\boldsymbol{x}_{k},\boldsymbol{y}_{k}), where 𝒙k\boldsymbol{x}_{k} plays the role of the feature vector and 𝒚k\boldsymbol{y}_{k} plays the role of the label variable. Adversarial training in the decentralized setting deals with the following stochastic minimax optimization problem

w=argminwM{J(w)=Δk=1KπkJk(w)}w^{\star}=\mathop{\mathrm{argmin}}\limits_{{w}\in\mathbbm{R}^{M}}\left\{J(w)\overset{\Delta}{=}\sum\limits_{k=1}^{K}\pi_{k}J_{k}(w)\right\} (1)

where {πk}k=1K\{\pi_{k}\}_{k=1}^{K} are positive scaling weights adding up to one, and each individual risk function is defined by

Jk(w)=𝔼{𝒙k,𝒚k}{maxδkϵQk(w;𝒙k+δk,𝒚k)}\begin{split}J_{k}(w)&=\mathds{E}_{\{\boldsymbol{x}_{k},\boldsymbol{y}_{k}\}}\left\{\max\limits_{\left\|{\delta}_{k}\right\|\leq\epsilon}Q_{k}(w;\boldsymbol{x}_{k}+{\delta}_{k},\boldsymbol{y}_{k})\right\}\\ \end{split} (2)

in terms of a loss function Qk()Q_{k}(\cdot). In this formulation, the variable δk{\delta}_{k} represents an 2\ell_{2} norm-bounded perturbation used to generate adversarial examples, and 𝒚k\boldsymbol{y}_{k} is the true label of sample 𝒙k\boldsymbol{x}_{k}. We refer to ww^{\star} as the robust model. In this paper, we assume all agents observe data sampled independently (over time and space) from the same statistical distribution.

One methodology for solving (1) is to first determine the inner maximizer in (2), thus reducing the minimax problem to a standard stochastic minimization formulation. Then, the traditional stochastic gradient method could be used to seek the minimizer. We denote the true maximizer of the perturbed loss function in (2) by

𝜹k(w)argmaxδkϵQk(w;𝒙k+δk,𝒚k)\displaystyle\boldsymbol{\delta}_{k}^{\star}(w)\in\mathop{\mathrm{argmax}}\limits_{\left\|{\delta}_{k}\right\|\leq\epsilon}Q_{k}(w;\boldsymbol{x}_{k}+{\delta}_{k},\boldsymbol{y}_{k}) (3)

where the dependence of 𝜹k\boldsymbol{\delta}_{k}^{\star} on ww is shown explicitly. To apply the stochastic gradient method, we would need to evaluate the gradient of Q(w;𝒙k+𝜹k(w),𝒚k)Q(w;\boldsymbol{x}_{k}+\boldsymbol{\delta}_{k}^{\star}(w),\boldsymbol{y}_{k}) relative to ww, which can be challenging since 𝜹k(w)\boldsymbol{\delta}_{k}^{\star}(w) is also dependent on ww. This difficulty can be resolved by appealing to Danskin’s theorem [19, 20, 21]. Let

g(w)=ΔmaxδkϵQk(w;𝒙k+δk,𝒚k)\displaystyle g(w)\overset{\Delta}{=}\max_{\left\|{\delta}_{k}\right\|\leq\epsilon}Q_{k}(w;\boldsymbol{x}_{k}+{\delta}_{k},\boldsymbol{y}_{k}) (4)

Then, the theorem asserts that g(w)g(w) is convex over ww if Qk(w;,)Q_{k}(w;\cdot,\cdot) is convex over ww. Moreover, g(w)g(w) need not be differentiable over ww even when Qk(w;,)Q_{k}(w;\cdot,\cdot) is differentiable. However, and importantly for our purposes, we can determine a subgradient for g(w)g(w) by using the actual gradient of the loss evaluated at the worst perturbation, namely, it holds that

wQk(w;𝒙k+𝜹k,𝒚k)wg(w)\displaystyle\nabla_{w}Q_{k}(w;\boldsymbol{x}_{k}+\boldsymbol{\delta}_{k}^{\star},\boldsymbol{y}_{k})\in\partial_{w}g(w) (5)

where w\partial_{w} refers to the subdifferential set of its argument. In (5), the gradient of Qk()Q_{k}(\cdot) relative to ww at the maximizer 𝜹k\boldsymbol{\delta}_{k}^{\star} is computed by treating 𝜹k\boldsymbol{\delta}_{k}^{\star} as a stand-alone vector and ignoring its dependence on ww. When 𝜹k\boldsymbol{\delta}_{k}^{\star} in (3) happens to be unique, then the gradient on the left in (5) will be equal to the right side, so that in that case the function g(w)g(w) is differentiable.

Motivated by these properties, and using (5), we can now propose an algorithm to enhance the robustness of multi-agent systems to adversarial perturbations. To do so, we rely on the adapt-then-combine (ATC) version of the diffusion strategy [11, 12] and write down the following adversarial extension to solve (1)–(2)

𝒙k,n=𝒙k,n+𝜹k,n\displaystyle\boldsymbol{x}_{k,n}^{\star}=\boldsymbol{x}_{k,n}+\boldsymbol{\delta}^{\star}_{k,n} (6a)
ϕk,n=𝒘k,n1μwQk(𝒘k,n1;𝒙k,n,𝒚k,n)\displaystyle\boldsymbol{\phi}_{k,n}=\boldsymbol{w}_{k,n-1}-\mu\nabla_{w}Q_{k}(\boldsymbol{w}_{k,n-1};\boldsymbol{x}_{k,n}^{\star},\boldsymbol{y}_{k,n}) (6b)
𝒘k,n=𝒩kakϕ,n\displaystyle\boldsymbol{w}_{k,n}=\sum\limits_{\ell\mathcal{\in N}_{k}}a_{\ell k}\boldsymbol{\phi}_{\ell,n} (6c)

where

𝜹k,nargmaxδkϵQk(𝒘k,n1;𝒙k,n+δk,𝒚k,n)\boldsymbol{\delta}^{\star}_{k,n}\in\mathop{\mathrm{argmax}}\limits_{\left\|{\delta}_{k}\right\|\leq\epsilon}Q_{k}(\boldsymbol{w}_{k,n-1};\boldsymbol{x}_{k,n}+{\delta}_{k},\boldsymbol{y}_{k,n}) (7)

In this implementation, expression (6a) computes the worst-case adversarial example at iteration nn using the perturbation from (7)(\ref{max_k}), while (6b) is the intermediate adaptation step in which all agents simultaneously update their parameters with step-size μ\mu. Relation (6c) is the convex combination step where the intermediate states ϕ,n\boldsymbol{\phi}_{\ell,n} from the neighbors of agent kk are combined together. The scalars aka_{\ell k} are non-negative and they add to one over 𝒩k\ell\in{\cal N}_{k}.

3 Convergence analysis

This section analyzes the convergence of the adversarial diffusion strategy (6a)–(6c) for the case of strongly convex loss functions. We list the following assumptions, which are commonly used in the literature of decentralized multi-agent learning and single-agent adversarial training [11, 22, 23, 24, 25].

Assumption 1.

(Strongly-connected graph) The entries of the combination matrix A=[ak]A=[a_{\ell k}] satisfy ak0a_{\ell k}\geq 0 and the entries on each column add up to one, which means that AA is left-stochastic. Moreover, the graph is assumed to be strongly-connected, meaning that there exists a path with nonzero weights {ak}\{a_{\ell k}\} linking any pair of agents and, in addition, at least one node kk in the network has a self-loop with akk>0a_{kk}>0.

Assumption 2.

(Strong convexity) For each agent k, the loss function Qk(w;)Q_{k}(w;\cdot) is ν\nu-strongly convex over ww, namely, for any w1,w2,xMw_{1},w_{2},x\in\mathbbm{R}^{M} and yy\in\mathbbm{R}, it holds that

Qk(w2;x,y)\displaystyle Q_{k}(w_{2};x,y)\geq\; Qk(w1;x,y)+w𝖳Qk(w1;x,y)(w2w1)\displaystyle Q_{k}(w_{1};x,y)+\nabla_{w^{\sf T}}Q_{k}(w_{1};x,y)(w_{2}-w_{1})
+ν2w2w12\displaystyle+\frac{\nu}{2}\|w_{2}-w_{1}\|^{2} (8)

\hfill\square

We remark that it also follows from Danskin’s theorem [19, 20, 21] that, when Qk(w;,)Q_{k}(w;\cdot,\cdot) is ν\nu-strongly convex over ww, then the adversarial risk Jk(w)J_{k}({w}) defined by (2) will be strongly convex. As a result, the aggregate risk J(w)J(w) in (1) will be strongly-convex as well.

Assumption 3.

(Smooth loss functions): For each agent kk, the gradients of the loss function relative to ww and xx are Lipschitz in relation to the three variables {w,x,y}\{w,x,y\} in the following manner:

wQk(w2;x+δ,y)wQk(w1;x+δ,y)L1w2w1\displaystyle\left\|\nabla_{w}Q_{k}(w_{2};x+\delta,y)-\nabla_{w}Q_{k}(w_{1};x+\delta,y)\right\|\leq L_{1}\left\|w_{2}-w_{1}\right\| (9a)
wQk(w;x2+δ,y)wQk(w;x1+δ,y)L2x2x1\displaystyle\left\|\nabla_{w}Q_{k}(w;x_{2}+\delta,y)-\nabla_{w}Q_{k}(w;x_{1}+\delta,y)\right\|\leq L_{2}\left\|x_{2}-x_{1}\right\| (9b)
wQk(w;x+δ,y2)wQk(w;x+δ,y1)L3y2y1\displaystyle\left\|\nabla_{w}Q_{k}(w;x+\delta,y_{2})-\nabla_{w}Q_{k}(w;x+\delta,y_{1})\right\|\leq L_{3}\left\|y_{2}-y_{1}\right\| (9c)

and

xQk(w2;x+δ,y)xQk(w1;x+δ,y)L4w2w1\displaystyle\left\|\nabla_{x}Q_{k}(w_{2};x+\delta,y)-\nabla_{x}Q_{k}(w_{1};x+\delta,y)\right\|\leq L_{4}\left\|w_{2}-w_{1}\right\| (10a)
xQk(w;x2+δ,y)xQk(w;x1+δ,y)L5x2x1\displaystyle\left\|\nabla_{x}Q_{k}(w;x_{2}+\delta,y)-\nabla_{x}Q_{k}(w;x_{1}+\delta,y)\right\|\leq L_{5}\left\|x_{2}-x_{1}\right\| (10b)

where δϵ\|\delta\|\leq\epsilon. For later use, we use L=max{L1,L2,L3,L4,L5}L=\max\{L_{1},L_{2},L_{3},L_{4},L_{5}\}.

Assumption 4.

(Bounded gradient disagreement) For any pair of agents kk and \ell, the squared gradient disagreements are uniformly bounded on average, namely, for any wMw\in\mathbbm{R}^{M} and δϵ\|\delta\|\leq\epsilon, it holds that

𝔼{𝒙,𝒚}wQk(w;𝒙+δ,𝒚)wQ(w;𝒙+δ,𝒚)2C2\mathds{E}_{\{\boldsymbol{x},\boldsymbol{y}\}}\|\nabla_{w}Q_{k}(w;\boldsymbol{x}+\delta,\boldsymbol{y})-\nabla_{w}Q_{\ell}(w;\boldsymbol{x}+\delta,\boldsymbol{y})\|^{2}\leq C^{2} (11)

\hfill\square

Note that (11) is automatically satisfied when all agents use the same loss function and collect data independently from the same distribution.

To evaluate the performance of the proposed framework (6a)–(6c), it is critical to compute the inner maximizer 𝜹k,n\boldsymbol{\delta}_{k,n}^{\star} defined by (7). Fortunately, for some convex problems, such as logistic regression, the maximization in (7) has a unique closed-form solution, which will be shown in the simulation section. Thus, we analyze the convergence properties of (6a)–(6c) when 𝜹k,n\boldsymbol{\delta}_{k,n}^{\star} is unique. We first establish the following affine Lipschitz result for the risk function in (2); proofs are omitted due to space limitations.

Lemma 1.

(Affine Lipschitz) For each agent kk, the gradient of Jk(w)J_{k}(w) is affine Lipschitz, namely, for any w1,w2Mw_{1},w_{2}\in\mathbbm{R}^{M}, it holds that

wJk(w2)wJk(w1)22L2w2w12+8L2ϵ2\|\nabla_{w}J_{k}(w_{2})-\nabla_{w}J_{k}(w_{1})\|^{2}\leq 2L^{2}\|w_{2}-w_{1}\|^{2}+8L^{2}\epsilon^{2} (12)

\hfill\square

Contrary to the traditional analysis of decentralized learning algorithms where the risk functions Jk(w)J_{k}(w) are generally Lipschitz, it turns out from (12) that under adversarial perturbations, the risks in (2) are now affine Lipschitz. This requires adjustments to the convergence arguments. A similar situation arises, for example, when one studies the convergence of decentralized learning under non-smooth losses — see [5, 26, 27].

To proceed with the convergence analysis, we introduce the gradient noise process, which is defined by

𝒔k,n(𝒘k,n1)=wQk(𝒘k,n1;𝒙k,n,𝒚k,n)wJk(𝒘k,n1)\boldsymbol{s}_{k,n}(\boldsymbol{w}_{k,n-1})=\nabla_{w}Q_{k}(\boldsymbol{w}_{k,n-1};\boldsymbol{x}_{k,n}^{\star},\boldsymbol{y}_{k,n})-\nabla_{w}J_{k}(\boldsymbol{w}_{k,n-1}) (13)

This quantity measures the difference between the approximate gradient (represented by the gradient of the loss) and the true gradient (represented by the gradient of the risk). The following result establishes some useful properties for the gradient noise process, namely, it has zero mean and bounded second-order moment (conditioned on past history).

Lemma 2.

(Moments of gradient noise) For each agent k, the gradient noise defined in (13) is zero mean and its variance satisfies

𝔼{𝒔k,n(𝒘k,n1)|𝓕n1}=0\displaystyle\mathds{E}\left\{\boldsymbol{s}_{k,n}(\boldsymbol{w}_{k,n-1})|\boldsymbol{\mathcal{F}}_{n-1}\right\}=0 (14)
𝔼{𝒔k,n(𝒘k,n1)2|𝓕n1}βk,ϵ2𝒘~k,n12+σk,ϵ2\displaystyle\mathds{E}\left\{\left\|\boldsymbol{s}_{k,n}(\boldsymbol{w}_{k,n-1})\right\|^{2}|\boldsymbol{\mathcal{F}}_{n-1}\right\}\leq\beta_{k,\epsilon}^{2}\left\|\widetilde{\boldsymbol{w}}_{k,n-1}\right\|^{2}+\sigma_{k,\epsilon}^{2} (15)

for some non-negative scalars βk,ϵ2\beta_{k,\epsilon}^{2} and σk,ϵ2\sigma_{k,\epsilon}^{2} that depend on ϵ\epsilon and can vary across agents. In the above notation, the quantity 𝓕n1\boldsymbol{\mathcal{F}}_{n-1} is the filtration generated by the past history of the random process col{𝐰k,n}\mathrm{col}\{\boldsymbol{w}_{k,n}\}, and

𝒘~k,n1=w𝒘k,n1\displaystyle\widetilde{\boldsymbol{w}}_{k,n-1}=w^{\star}-{\boldsymbol{w}}_{k,n-1} (16)

\hfill\square

The main convergence result is stated next; the proof is again omitted due to space limitations.

Theorem 1.

(Network mean-square-error stability) Consider a network of KK agents running the adversarial diffusion learning algorithm (6a)–(6c). Under Assumptions 14 and for sufficiently small step size μ\mu, the network converges asymptotically to an O(μ)O(\mu)-neighborhood around the global minimizer ww^{\star} at an exponential rate, namely,

limsupn𝔼𝒘~k,n12O(μ)\mathop{\lim\sup}\limits_{n\to\infty}\mathds{E}\left\|\widetilde{\boldsymbol{w}}_{k,n-1}\right\|^{2}\leq O(\mu) (17)

\hfill\square

The above theorem indicates that the proposed algorithm enables the network to approach an O(μ)O(\mu)-neighborhood of the robust minimizer ww^{\star} after enough iterations, so that the worst-case performance over all possible perturbations in the small region bounded by ϵ\epsilon can be effectively minimized.

4 Computer Simulations

In this section, we illustrate the performance of the proposed algorithm using a logistic regression application. Let 𝜸\boldsymbol{\gamma} be a binary variable that takes values from {1,1}\{-1,1\}, and 𝒉M\boldsymbol{h}\in\mathbbm{R}^{M} be a feature variable. The robust logistic regression problem by a network of agents employs the risk functions:

Jk(w)=𝔼maxδϵ{ln(1+e𝜸(𝒉+δ)𝖳w)}\displaystyle J_{k}(w)=\mathds{E}\max\limits_{\|{\delta}\|\leq\epsilon}\left\{\ln{(1+e^{-\boldsymbol{\gamma}(\boldsymbol{h}+{\delta})^{\sf T}{w}}})\right\} (18)

The analytical solution for the inner maximizer (i.e., the worst-case perturbation) is given by

𝜹=ϵ𝜸𝒘𝒘\displaystyle\boldsymbol{\delta}^{\star}=-\epsilon\boldsymbol{\gamma}\frac{\boldsymbol{w}}{\|\boldsymbol{w}\|} (19)

which is consistent with the perturbation computed from the fast gradient method (FGM) [7].

Refer to caption

Fig. 1: A randomly generated graph structure used in the simulations

Refer to caption

(a) MNIST

Refer to caption

(b) CIFAR10

Fig. 2: The convergence plots for the two datasets: (a) The evolution of the average classification error over adversarial examples bounded by ϵ=4\epsilon=4 during training for MNIST; (b) The evolution of the average classification error of adversarial examples bounded by ϵ=1.5\epsilon=1.5 during training for CIFAR10.

Refer to caption

(a) MNIST

Refer to caption

(b) CIFAR10

Fig. 3: The robustness plots for the two datasets: (a) Classification error versus perturbation size for MNIST; (b) Classification error versus perturbation size for CIFAR10. The graphs show three plots illustrating the behavior of the traditional (nonrobust) algorithm to worst-case perturbations generated by means of FGM, as well as the performance of the proposed adversarial diffusion strategy (6a)–(6c) to attacks generated by FGM and DeepFool.

Refer to caption

Fig. 4: Visualization of the original and adversarial samples. The first row consists of 10 random original samples with the titles representing their true classes. The second row shows the adversarial examples generated by DeepFool and applied to the standard (nonrobust) algorithm. The third row shows the results obtained by the adversarial (robust) algorithm. The titles are the predictions by the corresponding models. The same construction is repeated in the last two rows using FGM. If the prediction of an image is wrong, the title is shown in red color. It is seen that the adversarial algorithm fails less frequently.

In our experiments, we use both the MNIST [28] and CIFAR10 [29] datasets, and randomly generate a graph with 20 nodes, shown in Fig. 1. We limit our simulations to binary classification in this example. For this reason, we consider samples with digits 0 and 11 from MNIST, and images for airplanes and automobiles from CIFAR10. We set the perturbation bound in (18) to ϵ=4\epsilon=4 for MNIST and ϵ=1.5\epsilon=1.5 for CIFAR10. In the test phase, we compute the average classification error across the network to measure the performance of the multi-agent system against perturbations of different strengths.

We first illustrate the convergence of our algorithm, as anticipated by Theorem 1. From Fig. 2, we observe a steady decrease in the classification error towards a limiting value.

The robust behavior of the proposed algorithm is illustrated in Fig. 3 for both MNIST and CIFAR10. We explain the curves for MNIST and similar remarks hold for CIFAR10. In the simulation, we use perturbations generated in one of two ways: using the FGM worst-case construction and also using the DeepFool construction [5, 30]. The figure shows three curves. The red curve is obtained by training the network using the traditional diffusion learning strategy without accounting for robustness. The network is subsequently fed with worst-case perturbed samples (generated using FGM) during testing corresponding to different levels of ϵ\epsilon. The red curve shows that the classification error deteriorates rapidly. The blue curve repeats the same experiment except that the network is now trained with the adversarial diffusion strategy (6a)–(6c). It is seen in the blue curve that the testing error is more resilient and the degradation is better controlled. The same experiment is repeated using the same adversarially trained network, where the perturbed samples are now generated using DeepFool as opposed to FGM. Here again it is seen that the network is resilient and the degradation in performance is better controlled.

In Fig. 4, we plot some randomly selected CIFAR10 images, their perturbed versions, and the classification decisions generated by the nonrobust algorithm and its adversarial version (6a)–(6c). We observe from the figure that no matter which attack method is applied, the perturbations are always imperceptible to the human eye. Moreover, while the nonrobust algorithm fails to classify correctly in most cases, the adversarial algorithm is more robust and leads to fewer classification errors.

5 Conclusion

In this paper, we proposed a diffusion defense mechanism for adversarial attacks. We analyzed the convergence of the proposed method under convex losses and showed that it approaches a small O(μ)O(\mu) neighborhood around the robust solution. We further illustrated the behavior of the trained network to perturbations generated by FGM and DeepFool constructions and observed the enhanced robust behavior. Similar results are applicable to nonconvex losses and will be described in future work.

References

  • [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in Proc. International Conference on Learning Representations, Banff, 2014, pp. 1–10.
  • [2] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, “Pixeldefend: Leveraging generative models to understand and defend against adversarial examples,” in Proc. International Conference on Learning Representations, Vancouver, 2018, pp. 1–20.
  • [3] R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” in Proc. Empirical Methods in Natural Language Processing, Copenhagen, 2017, pp. 1–11.
  • [4] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proc. International Conference on Machine Learning, Sydney, 2017, pp. 2817–2826.
  • [5] A. H. Sayed, Inference and Learning from Data.   Cambridge University Press, 2022.
  • [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proc. International Conference on Learning Representations, Vancouver, 2018, pp. 1–23.
  • [7] T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial training methods for semi-supervised text classification,” in Proc. International Conference on Learning Representations, Toulon, 2017, pp. 1–11.
  • [8] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. International Conference on Learning Representations, San Diego, 2015, pp. 1–11.
  • [9] P. Maini, E. Wong, and Z. Kolter, “Adversarial robustness against the union of multiple perturbation models,” in Proc. International Conference on Machine Learning, 2020, pp. 6640–6650.
  • [10] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in Proc. International Conference on Machine Learning, California, 2019, pp. 7472–7482.
  • [11] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends in Machine Learning, vol. 7, pp. 311–801, 2014.
  • [12] A. H. Sayed, “Adaptive networks,” Proc. IEEE, vol. 102, no. 4, pp. 460–497, 2014.
  • [13] C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli, “Adversarial robustness through local linearization,” in Proc. Advances in Neural Information Processing Systems, Vancouver, 2019, pp. 13 824–13 833.
  • [14] G. Zhang, S. Lu, S. Liu, X. Chen, P.-Y. Chen, L. Martie, L. Horesh, and M. Hong, “Distributed adversarial training to robustify deep neural networks at scale,” in Proc. Conference on Uncertainty in Artificial Intelligence, Eindhoven, 2022, pp. 2353–2363.
  • [15] Y. Liu, X. Chen, M. Cheng, C.-J. Hsieh, and Y. You, “Concurrent adversarial learning for large-batch training,” in Proc. International Conference on Learning Representations, 2022, pp. 1–17.
  • [16] F. Feng, X. He, J. Tang, and T.-S. Chua, “Graph adversarial training: Dynamically regularizing based on graph structure,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2493–2504, 2021.
  • [17] K. Xu, H. Chen, S. Liu, P. Y. Chen, T. W. Weng, M. Hong, and X. Lin, “Topology attack and defense for graph neural networks: An optimization perspective,” in Proc. International Joint Conference on Artificial Intelligence, Macao, 2019, pp. 3961–3967.
  • [18] X. Wang, X. Liu, and C.-J. Hsieh, “Graphdefense: Towards robust graph convolutional networks,” arXiv preprint arXiv:1911.04429, pp. 1–9, 2019.
  • [19] T. Lin, C. Jin, and M. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” in Proc. International Conference on Machine Learning, 2020, pp. 6083–6093.
  • [20] R. T. Rockafellar, Convex Analysis.   Princeton University Press, 2015.
  • [21] K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh, “Efficient algorithms for smooth minimax optimization,” in Proc. Advances in Neural Information Processing Systems, Vancouver, 2019, pp. 12 659–12 670.
  • [22] A. Sinha, H. Namkoong, R. Volpi, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” in Proc. International Conference on Learning Representations, Vancouver, 2018, pp. 1–34.
  • [23] S. Vlaski and A. H. Sayed, “Diffusion learning in non-convex environments,” in Proc. IEEE ICASSP, Brighton, 2019, pp. 5262–5266.
  • [24] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments—part i: Agreement at a linear rate,” IEEE Transactions on Signal Processing, vol. 69, pp. 1242–1256, 2021.
  • [25] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments—part ii: Polynomial escape from saddle-points,” IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021.
  • [26] B. Ying and A. H. Sayed, “Performance limits of stochastic sub-gradient learning, Part I: Single agent case,” Signal Processing, vol. 144, pp. 271–282, 2018.
  • [27] B. Ying and A. H. Sayed, “Performance limits of stochastic sub-gradient learning, Part II: Multi-agent case,” Signal Processing, vol. 144, pp. 253–264, 2018.
  • [28] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  • [29] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/ kriz/cifar.html
  • [30] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: a simple and accurate method to fool deep neural networks,” in Proc. IEEE CVPR, Las Vegas, 2016, pp. 2574–2582.