This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Adversarial Learning: A Framework with Convergence Analysis

Xiaoxiao Li xiaoxiao.li@ece.ubc.ca. University of British Columbia.    Zhao Song zsong@adobe.com. Adobe Research.    Jiaming Yang jiamyang@umich.edu. University of Michigan, Ann Arbor.

Federated learning (FL) is a trending training paradigm to utilize decentralized training data. FL allows clients to update model parameters locally for several epochs, then share them to a global model for aggregation. This training paradigm with multi-local step updating before aggregation exposes unique vulnerabilities to adversarial attacks. Adversarial training is a popular and effective method to improve the robustness of networks against adversaries. In this work, we formulate a general form of federated adversarial learning (FAL) that is adapted from adversarial learning in the centralized setting. On the client side of FL training, FAL has an inner loop to generate adversarial samples for adversarial training and an outer loop to update local model parameters. On the server side, FAL aggregates local model updates and broadcast the aggregated model. We design a global robust training loss and formulate FAL training as a min-max optimization problem. Unlike the convergence analysis in classical centralized training that relies on the gradient direction, it is significantly harder to analyze the convergence in FAL for three reasons: 1) the complexity of min-max optimization, 2) model not updating in the gradient direction due to the multi-local updates on the client-side before aggregation and 3) inter-client heterogeneity. We address these challenges by using appropriate gradient approximation and coupling techniques and present the convergence analysis in the over-parameterized regime. Our main result theoretically shows that the minimum loss under our algorithm can converge to ϵ\epsilon small with chosen learning rate and communication rounds. It is noteworthy that our analysis is feasible for non-IID clients.

1 Introduction

Federated learning (FL) is playing an important role nowadays, as it allows different clients to train models collaboratively without sharing private information. One popular FL paradigm called FedAvg [MMR+17] introduces an easy-to-implement distributed learning method without data sharing. Specifically, it requires a central server to aggregate model updates computed by the local clients (also known as nodes or participants) using local imparticipable private data. Then with these updates aggregated, the central server use them to train a global model.

Nowadays deep learning model are exposed to severe threats of adversarial samples. Namely, small adversarial perturbations on the inputs will dramatically change the outputs or output wrong answers [SZS+13]. In this regard, much effort has been made to improve neural networks’ resistance to such perturbations using adversarial learning [TKP+17, SKC18, MMS+18]. Among these studies, the adversarial training scheme in [MMS+18] has achieved the good robustness in practice. [MMS+18] proposes an adversarial training scheme that uses projected gradient descent (PGD) to generate alternative adversarial samples as the augmented training set. Generating adversarial examples during neural network training is considered as one of the most effective approaches for adversarial training up to now according to the literature [CW17, ACW18, CH20].

Although adversarial learning has attracted much attention in the centralized domain, its practice in FL is under-explored [ZRSB20]. Like training classical deep neural networks that use gradient-based methods, FL paradigms are vulnerable to adversarial samples. Adversarial learning in FL brings multiple open challenges due to FL properties on low convergence rate, application in non-IID environments, and secure aggregation solutions. Hence applying adversarial learning in an FL paradigm may lead to unstable training loss and a lack of robustness. However, a recent practical work [ZRSB20] observed that although there exist difficulties of convergence, the federation of adversarial training with suitable hyperparameter settings can achieve adversarial robustness and acceptable performance. Motivated by the empirical results, we want to address the provable property of combining adversarial learning into FL from the theoretical perspective.

This work aims to theoretically study the unexplored convergence challenges that lie in the interaction between adversarial training and FL. To achieve a general understanding, we consider a general form of federated adversarial learning (FAL), which deploys adversarial training scheme on local clients in the most common FL paradigm, FedAvg [MMR+17] system. Specifically, FAL has an inner loop of local updating that generates adversarial samples (i.e., using  [MMS+18]) for adversarial training and an outer loop to update local model weights on the client side. Then global model is aggregated using FedAvg [MMR+17]. Our algorithm is detailed in Algorithm 1.

We are interested in theoretically understanding the proposed FAL scheme from the aspects of model robustness and convergence:

Can federated adversarial training fit training data robustly and converge with an over-parameterized neural network?

The theoretical convergence analysis of adversarial training itself is challenging in the centralized training setting. [TZT18] recently proposed a general theoretical method to analyze the risk bound with adversaries but did not address the convergence problem. The investigation of convergence on over-parameterized neural network has achieved tremendous progress [DLL+19, AZLS19b, AZLS19a, DZPS19, ADH+19b]. The basic statement is that training can converge to sufficiently small training loss in polynomial iterations using gradient descent or stochastic gradient descent when the width of the network is polynomial in the number of training examples when initialized randomly. Recent theoretical analysis [GCL+19, ZPD+20] extends these standard training convergence results to adversarial training settings. To answer the above interesting but challenging question, we formulate FAL as an min-max optimization problem. We extend the convergence analysis on the general formulation of over-parameterized neural networks in the FL setting that allows each client to perform min-max training and generate adversarial examples (see Algorithm 1). Involved challenges are arising in FL convergence analysis due to its unique optimization method: 1) unlike classical centralized setting, the global model of FL does not update in the gradient direction; 2) inter-client heterogeneity issue needs to be considered.

Despite the challenges, we give an affirmative answer to the above question. To the best of our knowledge, this work is the first theoretical study that studies those unexplored problems about the convergence of adversarial training with FL. The contributions of this paper are:

  • We propose a framework to analyze a general form of FAL in over-parameterized neural networks. We follow a natural and valid assumption of data separability that the training dataset are well separated apropos of the adversarial perturbations’ magnitude. After sufficient rounds of global communication and certain steps of local gradient descent for each tt, we obtain the minimal loss close to zero. Notably, our assumptions do not rely on data distribution. Thus the proposed analysis framework is feasible for non-IID clients.

  • We are the first to theoretically formulate the convergence of the FAL problem into a min-max optimization framework with the proposed loss function. In FL, the update in the global model is no longer directly determined by the gradient directions due to multiple local steps. To tackle the challenges, we define a new ‘gradient’, FL gradient. With valid ReLU Lipschitz and over-parameterized assumptions, we use gradient coupling for gradient updates in FL to show the model updates of each global updating is bounded in federated adversarial learning.

2 Related Work

Federated Learning

A efficient and privacy-preserving way to learn from the distributed data collected on the edge devices (a.k.a clients) would be FL. FedAvg is a easy-to-implement distributed learning strategy by aggregating local model updates of the server’s side, and then transmitting the averaged model back to the local clients. Later, many FL methods are developed baed on FedAvg. Theses FL schemes can be divided into aggregation schemes [MMR+17, WLL+20, LJZ+21] and optimization schemes [RCZ+20, ZHD+20]. Nearly all the them have the common characteristics that client model are updating using gradient descent-based methods, which is venerable to adversarial attacks. In addition, data heterogeneity brings in huge challeng in FL. For IID data, FL has been proven effective. However, in practice, data mostly distribute as non-IID. Non-IID data could substantially degrade the performance of FL models [ZLL+18, LHY+19, LJZ+21, LSZ+20]. Despite the potential risk in security and unstable performance in non-IID setting, as FL mitigates the concern of data sharing, it is still a popular and practical solution for distributed data learning in many real applications, such as healthcare [LGD+20, RHL+20], autonomous driving [LLC+19], IoTs [WTS+19, LLH+20].

Learning with Adversaries

Ever since adversarial examples are discovered [SZS+13], to make neural networks robust to perturbations, efforts have been made to propose more effective defense methods. As adversarial examples are an issue of robustness, the popular scheme is to include learning with adversarial examples, which can be traced back to [GSS14]. It produces adversarial examples and injecting them into training data. Later, Madry et al. [MMS+18] proposed training on multi-step PGD adversaries and empirically observed that adversarial training consistently achieves small and robust training loss in wide neural networks.

Federated Adversarial Learning

Adversarial examples, which may not be visually distinguishable from benign samples, are often classified. This poses potential security threats for practical machine learning applications. Adversarial training [GSS14, KGB16] is a popular protocol to train more adversarial robust models by inserting adversarial examples in training. The use of adversarial training in FL presents a number of open challenges, including poor convergence due to multiple local update steps, instability and heterogeneity of clients, cost and security request of communication, and so on. To defend the adversarial attacks in federated learning, limited recent studies have proposed to include adversarial training on clients in the local training steps  [BCMC19, ZRSB20]. These two works empirically showed the performance of adversarial training, while the theoretical analysis of convergence is under explored. [DKM20] focused the problem of distributionally robust FL with an emphasis on reducing the communication rounds, they traded O(T1/8)O(T^{1/8}) convergence rate for O(T1/4)O(T^{1/4}) communication rounds. In addition, different from our focus on a generic theoretical analysis framework, [ZLL+21] is a methodology paper that proposed an adversarial training strategy in classical distributed setting, with focus on specific training strategy (PGD, FGSM), which could be generalized to a method in FAL.

Convergence via Over-parameterization

Convergence analysis on over-parameterized neural networks falls in two lines. In the first line of work [LL18, AZLS19b, AZLS19a, AZLL19] data separability plays a crucial role, and is widely used in theoretically showing the convergence result in the over-parameterized neural network setting. To be specific, data separability theory shows that to guarantee convergence, the width (mm) of a neural network shall be at least polynomial factor of all parameters (i.e. mpoly(n,d,1/δ)m\geq\operatorname{poly}(n,d,1/\delta)), where δ\delta is the minimum distance between all pairs of data points, nn is the number of data points and dd is the data dimension. Another line of work [DZPS19, ADH+19a, ADH+19b, SY19, LSS+20, BPSW21, SYZ21, SZZ21, HLSY21, Zha22, MOSW22] builds on neural tangent kernel (NTK) [JGH18]. In this line of work, the minimum eigenvalue (λ\lambda) of the NTK is required to be lower bounded to guarantee convergence. Our analysis focuses on the former approach based on data separability.

Robustness of Federated Learning

Previously there were several works that theoretically analyzed the robustness of federated learning under noise. [YCKB18] developed distributed optimization algorithms that were provably robust against arbitrary and potentially adversarial behavior in distributed computing systems, and mainly focused on achieving optimal statistical performance. [RFPJ20] developed a robust federated learning algorithm by considering a structured affine distribution shift in users’ data. Their analysis was built on several assumptions on the loss functions without a direct connection to neural network.

3 Problem Formulation

To explore the properties of FAL in deep learning, we formulate the problem in over-parameterized neural network regime. We start by presenting the notations and setup required for federated adversarial learning, then we will describe the loss function we use and our FAL algorithm.

3.1 Notations

For a vector xx, we use xp\|x\|_{p} to denote its p\ell_{p} norm, in this paper we mainly consider the situation when p=1,2,p=1,2, or \infty. For a matrix Ud×mU\in\mathbb{R}^{d\times m}, we use UU^{\top} to denote its transpose and use tr[U]\operatorname{tr}[U] to denote its trace. We use U1\|U\|_{1} to denote its entry-wise 1\ell_{1} norm. We use U2\|U\|_{2} to denote its spectral norm. We use UF\|U\|_{F} to denote its Frobenius norm. For j[m]j\in[m], we let UjdU_{j}\in\mathbb{R}^{d} be the jj-th column of UU. We let U2,1\|U\|_{2,1} denotes j=1mUj2\sum_{j=1}^{m}\|U_{j}\|_{2}. We let U2,\|U\|_{2,\infty} denotes maxj[m]Uj2\max_{j\in[m]}\|U_{j}\|_{2}.

We denote Gaussian distribution with mean μ\mu and covariance Σ\Sigma as 𝒩(μ,Σ){\cal N}(\mu,\Sigma). We use σ()\sigma(\cdot) to denote the ReLU function σ(x)=max{x,0}\sigma(x)=\max\{x,0\}, and use 𝟙{A}\operatorname{\mathds{1}}\{A\} to denote the indicator function of event AA.

3.2 Problem Setup

Two-layer ReLU network in FAL

Following recent theoretical work in understanding neural networks training in deep learning [DZPS19, ADH+19a, ADH+19b, SY19, LSS+20, SYZ21, Zha22], in this paper, we focus on a two-layer neural network that has mm neurons in the hidden layer, where each neuron is a ReLU activation function.

We define the global network as

fU(x):=r=1marσ(Ur,x+br)\displaystyle f_{U}(x):=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle U_{r},x\rangle+b_{r}) (1)

and for c[N]c\in[N], we define the local network of client cc as

fWc(x):=r=1marσ(Wc,r,x+br).\displaystyle f_{W_{c}}(x):=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle W_{c,r},x\rangle+b_{r}). (2)

Here U=(U1,U2,,Um)d×mU=(U_{1},U_{2},\dots,U_{m})\in\mathbb{R}^{d\times m} is the global hidden weight matrix, Wc=(Wc,1,,Wc,m)d×mW_{c}=(W_{c,1},\dots,W_{c,m})\in\mathbb{R}^{d\times m} is the local hidden weight matrix of client cc, a=(a1,a2,,am)ma=(a_{1},a_{2},\dots,a_{m})\in\mathbb{R}^{m} denotes the output weight, b=(b1,b2,,bm)mb=(b_{1},b_{2},\dots,b_{m})\in\mathbb{R}^{m} denotes the bias.

During the process of federated adversarial learning, we only update the value of UU and WW, while keeping aa and bb equal to their initialization, so we can write the global network as fU(x)f_{U}(x) and the local network as fWc(x)f_{W_{c}}(x). For the situation we don’t care about the weight matrix, we write f(x)f(x) or fc(x)f_{c}(x) for short.

Next, we make some standard assumptions regarding our training set.

Definition 3.1 (Dataset).

There are NN clients and n=NJn=NJ data in total.111For simplicity, we assume that all clients have same number of training data. Our result can be generalized to the setting where each client has a different number of data as the future work. Let 𝒮=c[N]𝒮c\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c} where 𝒮c={(xc,1,yc,1),,(xc,J,yc,J)}d×\mathcal{S}_{c}=\{(x_{c,1},y_{c,1}),...,(x_{c,J},y_{c,J})\}\subseteq\mathbb{R}^{d}\times\mathbb{R} denotes the JJ training data of client cc. Without loss of generality, we assume xc,j2=1\|x_{c,j}\|_{2}=1 holds for all c[N],j[J]c\in[N],j\in[J], and the last coordinate of each point equals to 1/21/2 , so we consider 𝒳:={xd:x2=1,xd=1/2}{\cal X}:=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1,\ x_{d}=1/2\}. For simplicity, we assume that |yc,j|1|y_{c,j}|\leq 1 holds for all c[N]c\in[N] and j[J]j\in[J].222Our assumptions on data points are reasonable since we can do scale-up. In addition, l2l_{2} norm normalization is a typical technique in experiments. Same assumptions also appears in many previous theoretical works like [ADH+19b, AZLL19, AZLS19b].

We now define the initialization for the neural networks.

Definition 3.2 (Initialization).

The initialization of am,Ud×m,bma\in\mathbb{R}^{m},U\in\mathbb{R}^{d\times m},b\in\mathbb{R}^{m} is a(0)m,U(0)d×m,b(0)ma(0)\in\mathbb{R}^{m},U(0)\in\mathbb{R}^{d\times m},b(0)\in\mathbb{R}^{m}. The initialization of client c’s local weight matrix WcW_{c} is Wc(0,0)=U(0)W_{c}(0,0)=U(0). Here the second term in WcW_{c} denotes iteration of local steps.

  • For each r[m]r\in[m], ar(0)a_{r}(0) are i.i.d. sampled from [1/m1/3,+1/m1/3][-1/m^{1/3},+1/m^{1/3}] uniformly.

  • For each i[d],r[m]i\in[d],r\in[m], Ui,r(0)U_{i,r}(0) and br(0)b_{r}(0) are i.i.d. random Gaussians sampled from 𝒩(0,1/m)\mathcal{N}(0,1/m). Here Ui,rU_{i,r} means the (i,r)(i,r)-entry of UU.

For each global iteration t[T]t\in[T],

  • For each c[N]c\in[N], the initial value of client c’s local weight matrix WcW_{c} is Wc(t,0)=U(t)W_{c}(t,0)=U(t).

Next we formulate the adversary model that will be used.

Definition 3.3 (ρ\rho-Bounded adversary).

Let \mathcal{F} denote the function class. An adversary is a mapping 𝒜:×𝒳×𝒳\mathcal{A}:\mathcal{F}\times\mathcal{X}\times\mathbb{R}\rightarrow\mathcal{X} which denotes the adversarial perturbation. For ρ>0\rho>0, we define the 2\ell_{2} ball 2(x,ρ):={x~d:x~x2ρ}𝒳{\cal B}_{2}(x,\rho):=\{\widetilde{x}\in\mathbb{R}^{d}:\|\widetilde{x}-x\|_{2}\leq\rho\}\cap\cal{X}, we say an adversary 𝒜\mathcal{A} is ρ\rho-bounded if it satisfies 𝒜(f,x,y)2(x,ρ)\mathcal{A}(f,x,y)\in\mathcal{B}_{2}(x,\rho). Furthermore, given ρ>0\rho>0, we denote the worst-case adversary as 𝒜:=argmaxx~2(x,ρ)(f(x~),y)\mathcal{A}^{*}:=\operatorname*{argmax}_{\widetilde{x}\in\mathcal{B}_{2}(x,\rho)}\ell(f(\widetilde{x}),y), where \ell is defined in Definition 3.5.

Well-separated training set

In the over-parameterized regime, it is a standard assumption that the training set is well-separated. Since we deal with adversarial perturbations, we require the following γ\gamma-separability, which is a bit stronger.

Definition 3.4 (γ\gamma-separability).

Let γ(0,1/2),δ(0,1/2),ρ(0,1/2)\gamma\in(0,1/2),\delta\in(0,1/2),\rho\in(0,1/2) denote three parameters such that γδ(δ2ρ)\gamma\leq\delta\cdot(\delta-2\rho). We say our training set 𝒮=c[N]𝒮c=c[N],j[J]{(xc,j,yc,j)}d×\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c}=\cup_{c\in[N],j\in[J]}\{(x_{c,j},y_{c,j})\}\subset\mathbb{R}^{d}\times\mathbb{R} is globally γ\gamma-separable w.r.t a ρ\rho-bounded adversary, if xc1,j1xc2,j22δ\|x_{c_{1},j_{1}}-x_{c_{2},j_{2}}\|_{2}\geq\delta holds for any c1c2c_{1}\neq c_{2} and j1j2j_{1}\neq j_{2}.

Note that in the above definition, the introducing of γ\gamma is for expression simplicity of Theorem 4.1, and the assumption γδ(δ2ρ)\gamma\leq\delta\cdot(\delta-2\rho) is reasonable and easy to achieve in adversarial training. It is also noteworthy that, our problem setup does not need the assumption on independent and identically distribution (IID) on data, thus such a formation can be applied to unique challenge of the non-IID setting in FL.

3.3 Federated Adversarial Learning

Adversary and robust loss

We set the following loss for the sake of technical presentation simplicity, as is customary in prior studies [GCL+19, AZLL19]:

Definition 3.5 (Lipschitz convex loss).

A loss function :×\ell:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is said to be a Lipschitz convex loss, if it satisfies the following properties: (i) convex w.r.t. the first input of \ell; (ii) 11-Lipshcitz, which means |(x1,y1)(x2,y2)|(x1,y1)(x2,y2)2;|\ell(x_{1},y_{1})-\ell(x_{2},y_{2})|\leq\|(x_{1},y_{1})-(x_{2},y_{2})\|_{2}; and (iii) (y,y)=0\ell(y,y)=0 for all yy\in\mathbb{R}.

In this paper we assume \ell is a Lipschitz convex loss. Next we define our robust loss function of a network, which is based on the adversarial samples generated by a ρ\rho-bounded adversary 𝒜\mathcal{A}.

Definition 3.6 (Training loss).

Given a client’s training set 𝒮c={(xc,j,yc,j)}j=1Jd×\mathcal{S}_{c}=\{(x_{c,j},y_{c,j})\}_{j=1}^{J}\subset\mathbb{R}^{d}\times\mathbb{R} of JJ samples. Let fc:df_{c}:\mathbb{R}^{d}\rightarrow\mathbb{R} be a net. The classical training loss of fcf_{c} is (fc,Sc):=1Jj=1J(fc(xc,j),yc,j)\mathcal{L}(f_{c},S_{c}):=\frac{1}{J}\sum_{j=1}^{J}\ell\left(f_{c}(x_{c,j}),y_{c,j}\right). Given 𝒮=c[N]𝒮c\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c}, we define the global loss as

(fU,S):=1NJc=1Nj=1J(fU(xc,j),yc,j).\displaystyle\mathcal{L}(f_{U},S):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(x_{c,j}),y_{c,j}).

Given an adversary 𝒜\mathcal{A} that is ρ\rho-bounded, we define

the global loss with respect to 𝒜\mathcal{A} as

𝒜(fU):=1NJc=1Nj=1J(fU(𝒜(fc,xc,j,yc,j)),yc,j)=1NJc=1Nj=1J(fU(x~c,j),yc,j)\displaystyle{\cal L}_{\mathcal{A}}(f_{U}):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\mathcal{A}(f_{c},x_{c,j},y_{c,j})),y_{c,j})=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\widetilde{x}_{c,j}),y_{c,j})

and also define the global robust loss (in terms of worst-case) as

𝒜(fU):=1NJc=1Nj=1J(fU(𝒜(fc,xc,j,yc,j)),yc,j)=1NJc=1Nj=1Jmaxxc,j2(xc,j,ρ)(fU(xc,j),yc,j).\displaystyle{\cal L}_{\mathcal{A}^{*}}(f_{U}):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\mathcal{A}^{*}(f_{c},x_{c,j},y_{c,j})),y_{c,j})=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\max_{{x}_{c,j}^{*}\in\mathcal{B}_{2}(x_{c,j},\rho)}\ell\left(f_{U}({x}_{c,j}^{*}),y_{c,j}\right).

Moreover, since we deal with pseudo-net (Definition 5.1), we also define the loss of a pseudo-net as

(gc,𝒮c):=1Jj=1J(gc(xc,j),yc,j)\displaystyle\mathcal{L}(g_{c},\mathcal{S}_{c}):=\frac{1}{J}\sum_{j=1}^{J}\ell\left(g_{c}(x_{c,j}),y_{c,j}\right)

and

(gU,𝒮):=1NJc=1Nj=1J(gU(xc,j),yc,j).\displaystyle\mathcal{L}(g_{U},\mathcal{S}):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(g_{U}(x_{c,j}),y_{c,j}).

Algorithm

We focus on a general FAL framework that is adapted from the most common adversarial training in the classical setting on the client. Specifically, we describe the adversarial learning of a local neural network fWcf_{W_{c}} against an adversary 𝒜\mathcal{A} that generate adversarial examples during training as shown in Algorithm 1. As for the analysis of a general theoretical analysis framework, we do not specify the explicit format of 𝒜\mathcal{A}.

The FAL algorithm contains two procedures: one is ClientUpdate running on client side and the other is ServerExecution running on server side. These two procedures are iteratively processed through communication iterations. Adversarial training is addressed in procedure ClientUpdate. Hence, there are two loops in ClientUpdate procedure: the outer loop is iteration for local model updating; and the inner loop is iteratively generating adversarial samples by the adversary 𝒜\mathcal{A}. In the outer loop in ServerExecution procedure, the neural network’s parameters are updated to reduce its prediction loss on the new adversarial samples.

Algorithm 1 Federated Adversarial Learning (FAL)

Notations: Training sets of clients with each client is indexed by cc, 𝒮c={(xc,j,yc,j)}j=1J\mathcal{S}_{c}=\{(x_{c,j},y_{c,j})\}_{j=1}^{J}; adversary 𝒜\mathcal{A}; local learning rate ηlocal\eta_{\operatorname{local}}; global learning rate ηglobal\eta_{\mathrm{global}}; local updating iterations KK; global communication round TT.

1:Initialization a(0)m,U(0)d×m,b(0)ma(0)\in\mathbb{R}^{m},U(0)\in\mathbb{R}^{d\times m},b(0)\in\mathbb{R}^{m}
2:For t=0Tt=0\to T, we iteratively run Procedure A then Procedure B
3:procedure A. ClientUpdate(t,ct,c)
4:     𝒮c(t)\mathcal{S}_{c}(t)\leftarrow\emptyset
5:     Wc(t,0)U(t)W_{c}(t,0)\leftarrow U(t) \triangleright Receive global model weights update.
6:     for k=0K1k=0\to K-1  do
7:         for j=1Jj=1\to J do
8:              x~c,j(t)𝒜(fWc(t,k),xc,j,yc,j)\widetilde{x}_{c,j}^{(t)}\leftarrow\mathcal{A}(f_{W_{c}(t,k)},x_{c,j},y_{c,j}) \triangleright Adversarial samples. fWcf_{W_{c}} is defined as (2).
9:              𝒮c(t)𝒮c(t)(x~c,j(t),yc,j)\mathcal{S}_{c}(t)\leftarrow\mathcal{S}_{c}(t)\cup(\widetilde{x}_{c,j}^{(t)},y_{c,j})
10:         end for
11:         Wc(t,k+1)Wc(t,k)ηlocalWc(fWc(t,k),𝒮c(t))W_{c}{(t,k+1)}\leftarrow W_{c}{(t,k)}-\eta_{\operatorname{local}}\cdot\leavevmode\nobreak\ \nabla_{W_{c}}\mathcal{L}(f_{W_{c}(t,k)},\mathcal{S}_{c}(t))
12:     end for
13:     ΔUc(t)Wc(t,K)U(t)\Delta U_{c}(t)\leftarrow W_{c}(t,K)-U(t)
14:     Send ΔUc(t)\Delta U_{c}(t) to ServerExecution
15:end procedure
16:procedure B. ServerExecution(tt):
17:     for each client cc in parallel do do
18:         ΔUc(t)\Delta U_{c}(t)\leftarrow ClientUpdate(c,t)(c,t) \triangleright Receive local model weights update.
19:         ΔU(t)1Nc[N]ΔUc(t)\Delta U(t)\leftarrow\frac{1}{N}\sum_{c\in[N]}\Delta U_{c}(t)
20:         U(t+1)U(t)+ηglobalΔU(t)U(t+1)\leftarrow U(t)+\eta_{\mathrm{global}}\cdot\Delta U(t) \triangleright Aggregation on the server side.
21:         Send U(t+1)U(t+1) to client cc for ClientUpdate(c,t)(c,t)
22:     end for
23:end procedure

4 Our Result

The main result of this work is showing the convergence of FAL algorithm (Algorithm 1) in overparameterized neural networks. Specifically, our defined global training loss (Definition 3.6) converges to a small ϵ\epsilon with the chosen communication round TT, local and global learning rate ηlocal\eta_{\operatorname{local}}, ηglobal\eta_{\operatorname{global}}.

We now formally present our main result.

Theorem 4.1 (Federated Adversarial Learning).

Let c0(0,1)c_{0}\in(0,1) be a fixed constant. Let NN denotes the total number of clients and JJ denotes the number of data points per client. Suppose that our training set 𝒮=c[N]𝒮c\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c} is globally γ\gamma-separable for some γ>0\gamma>0. Then, for all ϵ(0,1)\epsilon\in(0,1), there exists R=poly((NJ/ϵ)1/γ)R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma}) that satisfies: for every K1K\geq 1 and Tpoly(R/ϵ)T\geq\operatorname{poly}(R/\epsilon), for all mpoly(d,(NJ/ϵ)1/γ)m\geq\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma}), with probability 1exp(Ω(m1/3))\geq 1-\exp(-\Omega(m^{1/3})) , running federated adversarial learning (Algorithm 1) with step size choices

ηglobal=1/poly(NJ,R,1/ϵ)andηlocal=1/K\displaystyle\eta_{\operatorname{global}}=1/\operatorname{poly}(NJ,R,1/\epsilon)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \eta_{\operatorname{local}}=1/K

will output a list of weights {U(1),U(2),,U(T)}d×m\{U(1),U(2),\cdots,U(T)\}\in\mathbb{R}^{d\times m} that satisfy:

mint[T]𝒜(fU(t))ϵ.\displaystyle\min_{t\in[T]}\mathcal{L}_{\mathcal{A}}(f_{U(t)})\leq\epsilon.

The randomness comes from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

Discussion

As we can see in Theorem 4.1, one key element that affects parameters m,R,Tm,R,T is the data separability γ\gamma. As the data separability bound becomes larger, the parameter RR becomes smaller, resulting in the need of a larger global learning rate ηglobal\eta_{\operatorname{global}} to achieve convergence. We also conduct numerical experiments in Appendix 6 to verify Theorem 4.1 empirically.

5 Proof Sketch

To handle the min-max objective in FAL, we formulate the optimization of FAL in the framework of online gradient descent333We refer our readers to [Haz16] for more details regarding online gradient descent. : at each local step kk on the client side, firstly the adversary generates adversarial samples and computes the loss function (fWc(t,k),𝒮c(t))\mathcal{L}\left(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)\right), then the local client learner takes the fresh loss function and update Wc(t,k+1)=Wc(t,k)ηlocalWc(fWc(t,k),𝒮c(t))W_{c}(t,k+1)=W_{c}(t,k)-\eta_{\operatorname{local}}\cdot\nabla_{W_{c}}\mathcal{L}\left(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)\right).

Compared with the centralized setting, the key difficulties in the convergence analysis of FL are induced by multiple local step updates of the client side and the step updates on both local and global sides. Specifically, local updates are not the standard gradient as the centralized adversarial training when K2K\geq 2. We used ΔU(t)-\Delta U(t) in substitution of the real gradient of UU to update the value of U(t)U(t). This brings in challenges to bound the gradient of the neural networks. Nevertheless, gradient bounding is challenging in adversarial training solely.

To this end, we use gradient coupling method twice to solve this core problem: firstly we bound the difference between real gradient and FL gradient (defined below), then we bound the difference between pseudo gradient and real gradient.

5.1 Existence of small robust loss

In this section, we denote U~=U(0)\widetilde{U}=U(0) as the initialization of global weights UU and denote U(t)U(t) as the global weights of communication round tt. UU^{*} is the value of UU after small perturbations from U~\widetilde{U} which satisfies UU~2,R/mc1\|U^{*}-\widetilde{U}\|_{2,\infty}\leq R/m^{c_{1}}, here c1(0,1)c_{1}\in(0,1) is a constant (e.g. c1=2/3c_{1}=2/3), mm is the width of the neural network and RR is a parameter. We will specify the concrete value of these parameters later in appendix.

We study the over-parameterized neural nets’ well-approximated pseudo-network to learn gradient descent for over-parameterized neural nets whose weights are close to initialization. Pseudo-network can be seen as a linear approximation of our two layer ReLU neural network near initialization, and the introducing of pseudo-network makes the proof more intuitive.

Definition 5.1 (Pseudo-network).

Given weights Ud×mU\in\mathbb{R}^{d\times m}, ama\in\mathbb{R}^{m} and bmb\in\mathbb{R}^{m}, for a neural network fU(x)=r=1marσ(Ur,x+br)f_{U}(x)=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle U_{r},x\rangle+b_{r}), we define the corresponding pseudo-network gU:dg_{U}:\mathbb{R}^{d}\rightarrow\mathbb{R} as gU(x):=r=1marUr(t)Ur(0),x𝟙{Ur(0),x+br0}.g_{U}(x):=\sum_{r=1}^{m}a_{r}\cdot\langle U_{r}(t)-U_{r}(0),x\rangle\cdot\operatorname{\mathds{1}}\{\langle U_{r}(0),x\rangle+b_{r}\geq 0\}.

Existence of small robust loss

To obtain our main theorem, first we show that we can find a UU^{*} which is close to U(0)U(0) and also makes 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}) sufficiently small. Later in Theorem 5.6 we show that the average of 𝒜(fU(t))\mathcal{L}_{\mathcal{A}}(f_{U(t)}) is dominated by 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}), thus we can prove the minimum of 𝒜(fU(t))\mathcal{L}_{\mathcal{A}}(f_{U(t)}) is ϵ\epsilon small.

Theorem 5.2 (Existence, informal version of Theorem F.3).

For all ϵ(0,1)\epsilon\in(0,1), there are M0=poly(d,(NJ/ϵ)1/γ)M_{0}=\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma}) and R=poly((NJ/ϵ)1/γ)R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma}) satisfying: for all mM0m\geq M_{0}, with high probability there exists Ud×mU^{*}\in\mathbb{R}^{d\times m} that satisfies UU(0)2,R/mc1\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{c_{1}} and 𝒜(fU)ϵ\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}})\leq\epsilon.

5.2 Convergence result for federated learning

Definition 5.3 (Gradient).

For a local real network fWc(t,k)f_{W_{c}(t,k)}, we denote its gradient by

(fc,t,k):=Wc(fWc(t,k),𝒮c(t)).\displaystyle\nabla(f_{c},t,k):=\nabla_{W_{c}}\mathcal{L}(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)).

If the corresponding pseudo-network is gWc(t,k)g_{W_{c}(t,k)}, then denote the pseudo-network gradient by

(gc,t,k):=Wc(gWc(t,k),𝒮c(t)).\displaystyle\nabla(g_{c},t,k):=\nabla_{W_{c}}\mathcal{L}(g_{W_{c}(t,k)},\mathcal{S}_{c}(t)).

Now we consider the global network. We define pseudo gradient as (g,t):=U(gU(t),𝒮(t))\nabla(g,t):=\nabla_{U}\mathcal{L}(g_{U(t)},\mathcal{S}(t)) and define FL gradient as ~(f,t):=1NΔU(t)\widetilde{\nabla}(f,t):=-\frac{1}{N}\Delta U(t), which is used in the proof of Theorem 5.6. We present our gradient coupling methods in the following two lemmas.

Lemma 5.4 (Bound the difference between real gradient and FL gradient, informal version of Lemma E.4).

With probability 1exp(Ω(mc0))\geq 1-\exp(-\Omega(m^{c_{0}})), for iterations tt satisfying U(t)U(0)2,1/o(m)\|U(t)-U(0)\|_{2,\infty}\leq 1/o(m), the gradients satisfy

(f,t)~(f,t)2,1o(m).\displaystyle\|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\leq o(m).

The randomness is from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

Lemma 5.5 (Bound the difference between pseudo gradient and real gradient, informal version of Lemma E.5).

With probability 1exp(Ω(mc0))\geq 1-\exp(-\Omega(m^{c_{0}})), for iterations tt satisfying U(t)U(0)2,1/o(m)\|U(t)-U(0)\|_{2,\infty}\leq 1/o(m), the gradients satisfy

(g,t)(f,t)2,1NJo(m).\displaystyle\|\nabla(g,t)-\nabla(f,t)\|_{2,1}\lesssim NJ\cdot o(m).

The randomness is from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

The above two lemmas are essential in proving Theorem 5.6, which is our convergence result.

Theorem 5.6 (Convergence result, informal version of Theorem E.3).

Let R1R\geq 1. Suppose ϵ(0,1)\epsilon\in(0,1). Let K1K\geq 1, let Tpoly(R/ϵ)T\geq\operatorname{poly}(R/\epsilon). There is M=poly(n,R,1/ϵ)M=\operatorname{poly}(n,R,1/\epsilon), such that for every mMm\geq M, with probability 1exp(Ω(mc0))\geq 1-\exp(-\Omega(m^{c_{0}})), for every UU^{*} satisfying UU(0)2,R/mc1\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{c_{1}}, running Algorithm 1 with setting ηglobal=1/poly(NJ,R,1/ϵ)\eta_{\operatorname{global}}=1/\operatorname{poly}(NJ,R,1/\epsilon) and ηlocal=1/K\eta_{\operatorname{local}}=1/K will output weights (U(t))t=1T(U(t))_{t=1}^{T} that satisfy

1Tt=1T𝒜(fU(t))𝒜(fU)+ϵ.\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{\mathcal{A}}\left(f_{U(t)}\right)\leq\mathcal{L}_{\mathcal{A}^{*}}\left(f_{U^{*}}\right)+\epsilon.

The randomness comes from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

In the proof of Theorem 5.6 we first bound the local gradient r(fc,t,k)\nabla_{r}(f_{c},t,k). We consider the pseudo-network and bound

(gU(t),S(t))(gU,S(t))α(t)+β(t)+γ(t),\displaystyle\mathcal{L}(g_{U(t)},S(t))-\mathcal{L}(g_{U^{*}},S(t))\leq\alpha(t)+\beta(t)+\gamma(t),

where

α(t):=\displaystyle\alpha(t):= ~(f,t),U(t)U,\displaystyle\leavevmode\nobreak\ \langle\widetilde{\nabla}(f,t),U(t)-U^{*}\rangle,
β(t):=\displaystyle\beta(t):= (f,t)~(f,t)2,1U(t)U2,\displaystyle\leavevmode\nobreak\ \|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}
γ(t):=\displaystyle\gamma(t):= (g,t)(f,t)2,1U(t)U2,.\displaystyle\leavevmode\nobreak\ \|\nabla(g,t)-\nabla(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}.

In bounding α(t)\alpha(t), we unfold U(t+1)UF2\|U(t+1)-U^{*}\|_{F}^{2} and have

α(t)=ηglobal2ΔU(t)F2+12ηglobal(U(t)UF2U(t+1)UF2).\displaystyle\alpha(t)=\frac{\eta_{\operatorname{global}}}{2}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2\eta_{\operatorname{global}}}\cdot(\|U(t)-U^{*}\|_{F}^{2}-\|U(t+1)-U^{*}\|_{F}^{2}).

We bound ΔU(t)F2ηlocalKo(m)\|\Delta U(t)\|_{F}^{2}\leq\eta_{\operatorname{local}}K\cdot o(m). By doing summation over tt, we have

t=1Tα(t)=\displaystyle\sum_{t=1}^{T}\alpha(t)= ηglobal2t=1TΔU(t)F2+12ηglobalt=1T(U(t)UF2U(t+1)UF2)\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{2}\sum_{t=1}^{T}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2\eta_{\operatorname{global}}}\cdot\sum_{t=1}^{T}(\|U(t)-U^{*}\|_{F}^{2}-\|U(t+1)-U^{*}\|_{F}^{2})
\displaystyle\leq ηglobal2t=1TΔU(t)F2+12ηglobalU(1)UF2\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{2}\sum_{t=1}^{T}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2\eta_{\operatorname{global}}}\cdot\|U(1)-U^{*}\|_{F}^{2}
\displaystyle\lesssim ηglobalηlocalTKo(m)+1ηglobalmDU2\displaystyle\leavevmode\nobreak\ \eta_{\operatorname{global}}\eta_{\operatorname{local}}TK\cdot o(m)+\frac{1}{\eta_{\operatorname{global}}}mD_{U^{*}}^{2}

In bounding β(t)\beta(t), we apply Lemma 5.4 and have

β(t)=\displaystyle\beta(t)= (f,t)~(f,t)2,1U(t)U2,\displaystyle\leavevmode\nobreak\ \|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}
\displaystyle\lesssim o(m)U(t)U2,\displaystyle\leavevmode\nobreak\ o(m)\cdot\|U(t)-U^{*}\|_{2,\infty}
\displaystyle\lesssim o(m)(U(t)U~2,+DU).\displaystyle\leavevmode\nobreak\ o(m)\cdot(\|U(t)-\widetilde{U}\|_{2,\infty}+D_{U^{*}}).

where DU:=UU~2,R/mc1D_{U^{*}}:=\|U^{*}-\widetilde{U}\|_{2,\infty}\leq R/m^{c_{1}}. As for the first term, we bound

U(t)U~2,\displaystyle\|U(t)-\widetilde{U}\|_{2,\infty}\leq ηglobalτ=1tΔU(τ)2,\displaystyle\leavevmode\nobreak\ \eta_{\operatorname{global}}\sum_{\tau=1}^{t}\|\Delta U(\tau)\|_{2,\infty}
=\displaystyle= ηglobalτ=1tηlocalNc=1Nk=0K1(fc,t,k)2,\displaystyle\leavevmode\nobreak\ \eta_{\operatorname{global}}\sum_{\tau=1}^{t}\|\frac{\eta_{\operatorname{local}}}{N}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\nabla(f_{c},t,k)\|_{2,\infty}
\displaystyle\leq ηglobalηlocalNτ=1tc=1Nk=0K1(fc,t,k)2,\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}\eta_{\operatorname{local}}}{N}\sum_{\tau=1}^{t}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\|\nabla(f_{c},t,k)\|_{2,\infty}
\displaystyle\leq ηglobalηlocaltKm1/3\displaystyle\leavevmode\nobreak\ \eta_{\operatorname{global}}\eta_{\operatorname{local}}tKm^{-1/3}

and have β(t)ηglobalηlocaltKo(m)+o(m)DU\beta(t)\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}tK\cdot o(m)+o(m)\cdot D_{U^{*}}, then we do summation and obtain

t=1Tβ(t)ηglobalηlocalT2Ko(m)+o(m)TDU.\displaystyle\sum_{t=1}^{T}\beta(t)\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}T^{2}K\cdot o(m)+o(m)\cdot TD_{U^{*}}.

In bounding γ(t)\gamma(t), we apply Lemma 5.5 and have

γ(t)\displaystyle\gamma(t) =(g,t)(f,t)2,1U(t)U2,\displaystyle=\leavevmode\nobreak\ \|\nabla(g,t)-\nabla(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}
NJo(m)(U(t)U~2,+DU).\displaystyle\lesssim\leavevmode\nobreak\ NJ\cdot o(m)\cdot(\|U(t)-\widetilde{U}\|_{2,\infty}+D_{U^{*}}).

Then we do summation over tt and have

t=1Tγ(t)\displaystyle\sum_{t=1}^{T}\gamma(t)\lesssim ηglobalηlocalT2KNJo(m)+TNJo(m)DU\displaystyle\leavevmode\nobreak\ \eta_{\operatorname{global}}\eta_{\operatorname{local}}T^{2}KNJ\cdot o(m)+TNJ\cdot o(m)D_{U^{*}}

Putting it together with our choice of our all parameters (i.e. ηlocal,ηglobal,R,K,T,m\eta_{\operatorname{local}},\eta_{\operatorname{global}},R,K,T,m), we obtain

1Tτ=1T(gU(τ),S(τ))1Tτ=1T(gU,S(τ))1T(τ=1Tα(τ)+τ=1Tβ(τ)+t=1Tγ(τ))O(ϵ).\displaystyle\frac{1}{T}\sum_{\tau=1}^{T}\mathcal{L}(g_{U(\tau)},S(\tau))-\frac{1}{T}\sum_{\tau=1}^{T}\mathcal{L}(g_{U^{*}},S(\tau))\leq\frac{1}{T}(\sum_{\tau=1}^{T}\alpha(\tau)+\sum_{\tau=1}^{T}\beta(\tau)+\sum_{t=1}^{T}\gamma(\tau))\leq O(\epsilon).

From Theorem D.2 in appendix, we have supx𝒳|fU(x)gU(x)|O(ϵ)\sup_{x\in\mathcal{X}}|f_{U}(x)-g_{U}(x)|\leq O(\epsilon) and thus,

1Tt=1T(fU(t),𝒮(t))1Tt=1T(fU,𝒮(t))\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}(f_{U(t)},\mathcal{S}(t))-\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}(f_{U^{*}},\mathcal{S}(t)) O(ϵ).\displaystyle\leq O(\epsilon). (3)

From the definition of 𝒜\mathcal{A}^{*} we have (fU,S(t))𝒜(fU)\mathcal{L}(f_{U^{*}},S(t))\leq\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}). From the definition of loss we have (fU(t),𝒮(t))=𝒜(fU(t))\mathcal{L}(f_{U(t)},\mathcal{S}(t))=\mathcal{L}_{\mathcal{A}}(f_{U(t)}). Moreover, since Eq. (3) holds for all ϵ>0\epsilon>0, we can replace O(ϵ)O(\epsilon) with ϵ\epsilon. Thus we prove that for all ϵ>0\epsilon>0,

1Tt=1T𝒜(fU(t))𝒜(fU)+ϵ.\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{\mathcal{A}}(f_{U(t)})\leq\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}})+\epsilon.

Combining the results

From Theorem 5.2 we obtain UU^{*} that is close to U(0)U(0) and makes 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}) close to zero, from Theorem 5.6 we have that the average of 𝒜(fU(t))\mathcal{L}_{\mathcal{A}}(f_{U(t)}) is dominated by 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}). By aggregating these two results, we prove that the minimal of 𝒜(fU(t))\mathcal{L}_{\mathcal{A}}(f_{U(t)}) is ϵ\epsilon small and finish the proof of our main Theorem 4.1.

6 Numerical Results

In this section, we examine our theoretical results (Theorem 4.1) on data separability, γ\gamma (Definition 3.4), a standard assumption is an over-parameterized neural network convergence analysis. We simulate synthetic data with different levels of data separability as shown in Fig. 1. Specifically, each data point contains two dimensions. Each class of data is generated from two Gaussian distributions (std=1) with different means. Each class consists of two Gaussian clusters where the intra-class cluster centroids are closer than the inter-class distances. We perform binary classification tasks on the simulated datasets using multi-layer perceptrons MLP with one hidden layer with 128 neurons. To increase learning difficulty, 5% of labels are randomly flipped. For each class, we simulated 400 data points as training sets and 100 data points as a testing set. The training data is even divided into four parts to simulate four clients. To simulate different levels of separability, we expand/shrink data features by (2.5, 1.5, 0.85) to construct (large, medium, small) data separability. Note that the whole dataset is not normalized before feeding into the classifier.

Refer to caption
Figure 1: Simulated data with different levels of data separability in numerical experiment.

We deploy PGD [MMS+18] to generate adversarial examples during FAL training with the box of radius ρ=0.0314\rho=0.0314, each perturbation step of 7, and step length of 0.00784. Model aggregation follows FedAvg [MMR+17] after each local update. The batch size is 50, and SGD optimizer is used. We depict the training and testing accuracy curves in Fig. 2(a), where solid lines strand for training and dash line stand for testing. The total communication round for is 100, and we observe training convergence for high (blue) and medium (green) separability datasets with learning rate 1e-5. However, a low separability dataset requires a smaller learning rate (i.e., 5e-6) to avoid divergence. From Theorem 4.1, it is easy to see a larger data separability bound γ\gamma results in a smaller RR, and we can choose a larger learning rate to achieve convergence. Hence, the selection of learning rate for small separability is consistent with the constraint of learning rate ηglobal\eta_{\rm global} implied in Theorem 4.1. We empirically observe results that a dataset with larger data separability γ\gamma converges faster with the flexibility of choosing a large learning rate, which is affirmative of our theoretical results that convergence round Tpoly(R/ϵ)T\geq\operatorname{poly}(R/\epsilon) has a larger lower bound with a smaller γ\gamma, where R=poly((NJ/ϵ)1/γ)R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma}). In addition, we compare with the accuracy curves obtained by using FedAvg [MMR+17]. As shown Fig. 2(b), all the datasets converge at around round 40. Therefore, we notice that the same data separability scales have larger affect in FAL training.

Refer to caption
(a) FAL
Refer to caption
(b) FedAvg
Figure 2: Training and testing curve on datasets with different levels of data separability. Solid lines present training curves and dash lines present testing curves.

7 Conclusion

We have studied the convergence of a general format of adopting adversarial training in FL setting to improve FL training robustness. We propose the general framework, FAL, which deploys adversarial samples generation-based adversarial training method on the client-side and then aggregate local model using FedAvg [MMR+17]. In FAL, each client is trained via min-max optimization with inner loop adversarial generation and outer loop loss minimization. As far as we know, we are the first to detail the proof of theoretical convergence guarantee for over-parameterized ReLU network on the presented FAL strategy, using gradient descent. Unlike the convergence of adversarial training in classical settings, we consider the updates on both local client and global server sides. Our result indicates that we can control learning rates ηlocal\eta_{\operatorname{local}} and ηglobal\eta_{\operatorname{global}} according to the local update steps KK and global communication round TT to make the minimal loss close to zero. The technical challenges lie in the multiple local update steps and heterogeneous data, leading to the difficulties of convergence. Under ReLU Lipschitz and over-prameterization assumptions, we use gradient coupling methods twice. Together, we show the model updates of each global updating bounded in our federated adversarial learning. Note that we do not require IID assumptions for data distribution. In sum, the proposed FAL formulation and analysis framework can well handle the multi-local updates and non-IID data in FL. Moreover, our framework can be generalized to other FL aggregation methods, such as sketching and selective aggregation.

Roadmap

The appendix is organized as follows. We introduce the probability tools to be used in our proof in Section A. In addition, we introduce the preliminaries in Section B. We present the proof overview in Section C and additional remarks used in the proof sketch in Section D. We show the detailed proof for the convergence in Section E and the detailed proof of existence in Section F correspondingly.

Appendix A Probability Tools

We introduce the probability tools that will be used in our proof. First we present two lemmas about random variable’s tail bound in Lemma A.1 and A.2:

Lemma A.1 (Chernoff bound [Che52]).

Let x=i=1nxix=\sum_{i=1}^{n}x_{i}, where xi=1x_{i}=1 with probability pip_{i} and xi=0x_{i}=0 with probability 1pi1-p_{i}, and all xix_{i} are independent. Let μ=𝔼[x]=i=1npi\mu=\operatorname*{{\mathbb{E}}}[x]=\sum_{i=1}^{n}p_{i}. Then
1. Pr[x(1+δ)μ]exp(δ2μ/3)\Pr[x\geq(1+\delta)\mu]\leq\exp(-\delta^{2}\mu/3), δ>0\forall\delta>0 ;
2. Pr[x(1δ)μ]exp(δ2μ/2)\Pr[x\leq(1-\delta)\mu]\leq\exp(-\delta^{2}\mu/2), 0<δ<1\forall 0<\delta<1.

Lemma A.2 (Bernstein inequality [Ber24]).

Let Y1,,YnY_{1},\cdots,Y_{n} be independent zero-mean random variables. Suppose that for i[n],|Yi|Mi\in[n],|Y_{i}|\leq M almost surely. Then for all t>0t>0, we have

Pr[i=1nYi>t]exp(t2/2i=1n𝔼[Yi2]+Mt/3).\displaystyle\Pr\left[\sum_{i=1}^{n}Y_{i}>t\right]\leq\exp\left(-\frac{t^{2}/2}{\sum_{i=1}^{n}\operatorname*{{\mathbb{E}}}[Y_{i}^{2}]+Mt/3}\right).

Next, we introduce Lemma A.3 about CDF of Gaussian distributions:

Lemma A.3.

Let Z𝒩(0,σ2)Z\sim{\mathcal{N}}(0,\sigma^{2}) denotes a Gaussian random variable, then we have

Pr[|Z|t](23tσ,45tσ).\displaystyle\Pr[|Z|\leq t]\in\left(\frac{2}{3}\frac{t}{\sigma},\frac{4}{5}\frac{t}{\sigma}\right).

Finally, we introduce Claim A.4 about elementary anti-concentration property of Gaussian distribution.

Claim A.4.

Let z𝒩(0,Id)z\sim\mathcal{N}(0,I_{d}) and u𝒩(0,1)u\sim\mathcal{N}(0,1) are independent Gaussian random variables. Then for all t0t\geq 0 and xdx\in\mathbb{R}^{d} that satisfies x2=1\|x\|_{2}=1, we have

Pr[|x,z+v|t]=O(t).\displaystyle\Pr[|\langle x,z\rangle+v|\leq t]=O(t).

Appendix B Preliminaries

B.1 Notations

For a vector xx, we use xp\|x\|_{p} to denote its p\ell_{p} norm, in this paper we mainly consider the situation when p=1,2,p=1,2, or \infty.

For a matrix Ud×mU\in\mathbb{R}^{d\times m}, we use UU^{\top} to denote its transpose and use tr[U]\operatorname{tr}[U] to denote its trace. We use U1\|U\|_{1} to denote its entry-wise 1\ell_{1} norm. We use U2\|U\|_{2} to denote its spectral norm. We use UF\|U\|_{F} to denote its Frobenius norm. For j[m]j\in[m], we let UjdU_{j}\in\mathbb{R}^{d} be the jj-th column of UU. We let U2,1\|U\|_{2,1} denotes j=1mUj2\sum_{j=1}^{m}\|U_{j}\|_{2}. We let U2,\|U\|_{2,\infty} denotes maxj[m]Uj2\max_{j\in[m]}\|U_{j}\|_{2}. For two matrices XX and YY, we denote their Euclidean inner product as X,Y:=tr[XY]\langle X,Y\rangle:=\operatorname{tr}[X^{\top}Y].

We denote Gaussian distribution with mean μ\mu and covariance Σ\Sigma as 𝒩(μ,Σ){\cal N}(\mu,\Sigma). We use σ()\sigma(\cdot) to denote the ReLU function, and use 𝟙{A}\operatorname{\mathds{1}}\{A\} to denote the indicator function of AA.

B.2 Two layer neural network and initialization

In this paper, we focus on a two-layer neural network that has mm neurons in the hidden layer, where each neuron is a ReLU activation function. We define the global network as

fU(x):=r=1marσ(Ur,x+br)\displaystyle f_{U}(x):=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle U_{r},x\rangle+b_{r}) (4)

and for c[N]c\in[N], we define the local network of client cc as

fWc(x):=r=1marσ(Wc,r,x+br).\displaystyle f_{W_{c}}(x):=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle W_{c,r},x\rangle+b_{r}). (5)

Here U=(U1,U2,,Um)d×mU=(U_{1},U_{2},\dots,U_{m})\in\mathbb{R}^{d\times m} is the global hidden weight matrix, Wc=(Wc,1,,Wc,m)d×mW_{c}=(W_{c,1},\dots,W_{c,m})\in\mathbb{R}^{d\times m} is the local hidden weight matrix of client cc, and a=(a1,a2,,am)ma=(a_{1},a_{2},\dots,a_{m})\in\mathbb{R}^{m} denotes the output weight, b=(b1,b2,,bm)mb=(b_{1},b_{2},\dots,b_{m})\in\mathbb{R}^{m} denotes the bias. During the process of federated adversarial learning, for convenience we keep aa and bb equal to their initialized values and only update UU and WW, so we can write the global network as fU(x)f_{U}(x) and the local network as fWc(x)f_{W_{c}}(x). For the situation we don’t care about the weight matrix, we write f(x)f(x) or fc(x)f_{c}(x) for short. Next, we make some standard assumptions regarding our training set.

Definition B.1 (Dataset).

There are NN clients and n=NJn=NJ data in total.444For simplicity, we assume that all clients have same number of training data. Our result can be generalized to the setting where each client has a different number of data as the future work. Let 𝒮=c[N]𝒮c\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c} where 𝒮c={(xc,1,yc,1),,(xc,J,yc,J)}d×\mathcal{S}_{c}=\{(x_{c,1},y_{c,1}),...,(x_{c,J},y_{c,J})\}\subseteq\mathbb{R}^{d}\times\mathbb{R} denotes the JJ training data of client cc. Without loss of generality, we assume that xc,j2=1\|x_{c,j}\|_{2}=1 holds for all c[N],j[J]c\in[N],j\in[J], and the last coordinate of each point equals to 1/21/2, so we consider 𝒳:={xd:x2=1,xd=1/2}{\cal X}:=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1,\ x_{d}=1/2\}. For simplicity, we assume that |yc,j|1|y_{c,j}|\leq 1 holds for all c[N]c\in[N] and j[J]j\in[J].555Our assumptions on data points are reasonable since we can do scale-up. In addition, l2l_{2} norm normalization is a typical technique in experiments. Same assumptions also appears in many previous theoretical works like [ADH+19b, AZLL19, AZLS19b].

We now define the initialization for the neural networks.

Definition B.2 (Initialization).

The initialization of am,Ud×m,bma\in\mathbb{R}^{m},U\in\mathbb{R}^{d\times m},b\in\mathbb{R}^{m} is a(0)m,U(0)d×m,b(0)ma(0)\in\mathbb{R}^{m},U(0)\in\mathbb{R}^{d\times m},b(0)\in\mathbb{R}^{m}. The initialization of client c’s local weight matrix WcW_{c} is Wc(0,0)=U(0)W_{c}(0,0)=U(0). Here the second term in WcW_{c} denotes iteration of local steps.

  • For each r[m]r\in[m], ar(0)a_{r}(0) are i.i.d. sampled from [1/m1/3,+1/m1/3][-1/m^{1/3},+1/m^{1/3}] uniformly.

  • For each i[d],r[m]i\in[d],r\in[m], Ui,r(0)U_{i,r}(0) and br(0)b_{r}(0) are i.i.d. random Gaussians sampled from 𝒩(0,1/m)\mathcal{N}(0,1/m). Here Ui,rU_{i,r} means the (i,r)(i,r)-entry of UU.

For each global iteration t[T]t\in[T],

  • For each c[N]c\in[N], the initial value of client c’s local weight matrix WcW_{c} is Wc(t,0)=U(t)W_{c}(t,0)=U(t).

B.3 Adversary and Well-separated training sets

We first formulate the adversary as a mapping.

Definition B.3 (ρ\rho-Bounded adversary).

Let \mathcal{F} denote the function class. An adversary is a mapping 𝒜:×𝒳×𝒳\mathcal{A}:\mathcal{F}\times\mathcal{X}\times\mathbb{R}\rightarrow\mathcal{X} which denotes the adversarial perturbation. For ρ>0\rho>0, we define the 2\ell_{2} ball as 2(x,ρ):={x~d:x~x2ρ}𝒳{\cal B}_{2}(x,\rho):=\{\widetilde{x}\in\mathbb{R}^{d}:\|\widetilde{x}-x\|_{2}\leq\rho\}\cap\cal{X}, we say an adversary 𝒜\mathcal{A} is ρ\rho-bounded if it satisfies

𝒜(f,x,y)2(x,ρ).\displaystyle\mathcal{A}(f,x,y)\in\mathcal{B}_{2}(x,\rho).

Moreover, given ρ>0\rho>0, we denote the worst-case adversary as 𝒜:=argmaxx~2(x,ρ)(f(x~),y)\mathcal{A}^{*}:=\operatorname*{argmax}_{\widetilde{x}\in\mathcal{B}_{2}(x,\rho)}\ell(f(\widetilde{x}),y), where \ell is defined in Definition B.5.

In the over-parameterized regime, it is a standard assumption that the training set is well-separated. Since we deal with adversarial perturbations, we require the following γ\gamma-separability, which is a bit stronger.

Definition B.4 (γ\gamma-separability).

Let γ(0,1/2),δ(0,1/2),ρ(0,1/2)\gamma\in(0,1/2),\delta\in(0,1/2),\rho\in(0,1/2) denote three parameters such that γδ(δ2ρ)\gamma\leq\delta\cdot(\delta-2\rho). We say our training set 𝒮=c[N]𝒮c=c[N],j[J]{(xc,j,yc,j)}d×\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c}=\cup_{c\in[N],j\in[J]}\{(x_{c,j},y_{c,j})\}\subset\mathbb{R}^{d}\times\mathbb{R} is globally γ\gamma-separable w.r.t a ρ\rho-bounded adversary, if

minc1c2,j1j2xc1,j1xc2,j22δ.\displaystyle\min_{c_{1}\neq c_{2},j_{1}\neq j_{2}}\|x_{c_{1},j_{1}}-x_{c_{2},j_{2}}\|_{2}\geq\delta.

It is noteworthy that our problem setup does not need the assumption on independent and identically distribution (IID) on data, thus such a formation can be applied to unique challenge of the non-IID setting in FL.

B.4 Robust loss function

We define the following Lipschitz convex loss function that will be used.

Definition B.5 (Lipschitz convex loss).

A loss function :×\ell:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is said to be a Lipschitz convex loss, if it satisfies the following four properties:

  • non-negative;

  • convex in the first input of \ell;

  • 11-Lipshcitz, which means (x1,y1)(x2,y2)2(x1,y1)(x2,y2)2\|\ell(x_{1},y_{1})-\ell(x_{2},y_{2})\|_{2}\leq\|(x_{1},y_{1})-(x_{2},y_{2})\|_{2};

  • (y,y)=0\ell(y,y)=0 for all yy\in\mathbb{R}.

In this paper we assume \ell is a Lipschitz convex loss. Next, we define our robust loss function of a neural network, which is based on the adversarial examples generated by a ρ\rho-bounded adversary 𝒜\mathcal{A}.

Definition B.6 (Training loss).

Given a client’s training set 𝒮c={(xc,j,yc,j)}j=1Jd×\mathcal{S}_{c}=\{(x_{c,j},y_{c,j})\}_{j=1}^{J}\subset\mathbb{R}^{d}\times\mathbb{R} of JJ samples. Let fc:df_{c}:\mathbb{R}^{d}\rightarrow\mathbb{R} be a net. We define loss to be (fc,𝒮c):=1Jj=1J(fc(xc,j),yc,j)\mathcal{L}(f_{c},\mathcal{S}_{c}):=\frac{1}{J}\sum_{j=1}^{J}\ell\left(f_{c}(x_{c,j}),y_{c,j}\right). Given 𝒮=c[N]𝒮c\mathcal{S}=\cup_{c\in[N]}\mathcal{S}_{c}, the global loss is defined as

(fU,𝒮):=1NJc=1Nj=1J(fU(xc,j),yc,j).\displaystyle\mathcal{L}(f_{U},\mathcal{S}):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(x_{c,j}),y_{c,j}).

Given an adversary 𝒜\mathcal{A} that is ρ\rho-bounded, we define the global loss with respect to 𝒜\mathcal{A} as

𝒜(fU):=\displaystyle{\cal L}_{\mathcal{A}}(f_{U}):= 1NJc=1Nj=1J(fU(𝒜(fc,xc,j,yc,j)),yc,j)\displaystyle\leavevmode\nobreak\ \frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\mathcal{A}(f_{c},x_{c,j},y_{c,j})),y_{c,j})
=\displaystyle= 1NJc=1Nj=1J(fU(x~c,j),yc,j)\displaystyle\leavevmode\nobreak\ \frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\widetilde{x}_{c,j}),y_{c,j})

and also define the global robust loss (in terms of worst-case) as

𝒜(fU):=\displaystyle{\cal L}_{\mathcal{A}^{*}}(f_{U}):= 1NJc=1Nj=1J(fU(𝒜(fc,xc,j,yc,j)),yc,j)\displaystyle\leavevmode\nobreak\ \frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(f_{U}(\mathcal{A}^{*}(f_{c},x_{c,j},y_{c,j})),y_{c,j})
=\displaystyle= 1NJc=1Nj=1Jmaxxc,j2(xc,j,ρ)(fU(xc,j),yc,j).\displaystyle\leavevmode\nobreak\ \frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\max_{{x}_{c,j}^{*}\in\mathcal{B}_{2}(x_{c,j},\rho)}\ell\left(f_{U}({x}_{c,j}^{*}),y_{c,j}\right).

Moreover, since we deal with pseudo-net which is defined in Definition D.1, we also define the loss of a pseudo-net as (gc,𝒮c):=1Jj=1J(gc(xc,j),yc,j)\mathcal{L}(g_{c},\mathcal{S}_{c}):=\frac{1}{J}\sum_{j=1}^{J}\ell\left(g_{c}(x_{c,j}),y_{c,j}\right) and (gU,𝒮):=1NJc=1Nj=1J(gU(xc,j),yc,j)\mathcal{L}(g_{U},\mathcal{S}):=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\ell(g_{U}(x_{c,j}),y_{c,j}).

B.5 Federated Adversarial Learning algorithm

Classical adversarial training algorithm can be found in [ZPD+20]. Different from the classical setting, our federated adversarial learning of a local neural network fWcf_{W_{c}} against an adversary 𝒜\mathcal{A} is shown in Algorithm 2, where there are two procedures: one is ClientUpdate running on client side and the other is ServerExecution running on server side. These two procedures are iteratively processed through communication iterations. Adversarial training is addressed in procedure ClientUpdate. Hence, there are two loops in ClientUpdate procedure: the outer loop is iteration for local model updating; and the inner loop is iteratively generating adversarial samples by the adversary 𝒜\mathcal{A}. In the outer loop in ServerExecution procedure, the neural network’s parameters are updated to reduce its prediction loss on the new adversarial samples. These loops constitute an intertwining dynamics.

Algorithm 2 Federated Adversarial Learning (FAL). Complete and formal version of Algorithm 1.
1:/*Defining notations and parameters*/
2:     We use cc to denote the client’s index
3:     The training set of client cc is denoted as 𝒮c={(xc,j,yc,j)}j=1J\mathcal{S}_{c}=\{(x_{c,j},y_{c,j})\}_{j=1}^{J}
4:     Let 𝒜\mathcal{A} be the adversary
5:     We denote local learning rate as ηlocal\eta_{\operatorname{local}}
6:     We denote global learning rate as ηglobal\eta_{\mathrm{global}}
7:     We denote local updating iterations as KK
8:     We denote global communication round as TT
9:
10:/*Initialization*/
11:     Initialization a(0)m,U(0)d×m,b(0)ma(0)\in\mathbb{R}^{m},U(0)\in\mathbb{R}^{d\times m},b(0)\in\mathbb{R}^{m}
12:     For t=0Tt=0\to T, we iteratively run Procedure A then Procedure B
13:
14:/* Procedure running on client side */
15:procedure A. ClientUpdate(t,ct,c)
16:     𝒮c(t)\mathcal{S}_{c}(t)\leftarrow\emptyset
17:     Wc(t,0)U(t)W_{c}(t,0)\leftarrow U(t) \triangleright Receive global model weights update
18:     for k=0K1k=0\to K-1  do
19:         for j=1Jj=1\to J do
20:              x~c,j(t)𝒜(fWc(t,k),xc,j,yc,j)\widetilde{x}_{c,j}^{(t)}\leftarrow\mathcal{A}(f_{W_{c}(t,k)},x_{c,j},y_{c,j}) \triangleright Adversarial examples, fWcf_{W_{c}} is defined as (5)
21:              𝒮c(t)𝒮c(t)(x~c,j(t),yc,j)\mathcal{S}_{c}(t)\leftarrow\mathcal{S}_{c}(t)\cup(\widetilde{x}_{c,j}^{(t)},y_{c,j})
22:         end for
23:         Wc(t,k+1)Wc(t,k)ηlocalWc(fWc(t,k),𝒮c(t))W_{c}{(t,k+1)}\leftarrow W_{c}{(t,k)}-\eta_{\operatorname{local}}\cdot\leavevmode\nobreak\ \nabla_{W_{c}}\mathcal{L}(f_{W_{c}(t,k)},\mathcal{S}_{c}(t))
24:     end for
25:     ΔUc(t)Wc(t,K)U(t)\Delta U_{c}(t)\leftarrow W_{c}(t,K)-U(t)
26:     Send ΔUc(t)\Delta U_{c}(t) to ServerExecution
27:end procedure
28:
29:/*Procedure running on server side*/
30:procedure B. ServerExecution(tt):
31:     for each client cc in parallel do
32:         ΔUc(t)\Delta U_{c}(t)\leftarrow ClientUpdate(c,t)(c,t) \triangleright Receive local model weights update
33:         ΔU(t)1Nc[N]ΔUc(t)\Delta U(t)\leftarrow\frac{1}{N}\sum_{c\in[N]}\Delta U_{c}(t)
34:         U(t+1)U(t)+ηglobalΔU(t)U(t+1)\leftarrow U(t)+\eta_{\mathrm{global}}\cdot\Delta U(t) \triangleright Aggregation on the server side
35:         Send U(t+1)U(t+1) to client cc for ClientUpdate(c,t)(c,t)
36:     end for
37:end procedure

Appendix C Proof Overview

In this section we give an overview of our main result’s proof. Two theorems to be used are Theorem E.3 and Theorem F.3.

C.1 Pseudo-network

We study the over-parameterized neural nets’ well-approximated pseudo-network to learn gradient descent for over-parameterized neural nets whose weights are close to initialization. The introducing of pseudo-network makes the proof more intuitive.

To be specific, we give the definition of pseudo-network in Section D, and also state Theorem D.2 which shows the fact that the pseudo-network approximates the real network uniformly well. It can be seen that the notion of pseudo-network is used for several times in our proof.

C.2 Online gradient descent in federated adversarial learning

Our federated adversarial learning algorithm is formulated in online gradient descent framework: at each local step kk on the client side, firstly the adversary generates adversarial samples and computes the loss function (fWc(t,k),𝒮c(t))\mathcal{L}\left(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)\right), then the local client learner takes the fresh loss function and update Wc(t,k+1)=Wc(t,k)ηlocalWc(fWc(t,k),𝒮c(t))W_{c}(t,k+1)=W_{c}(t,k)-\eta_{\operatorname{local}}\cdot\nabla_{W_{c}}\mathcal{L}\left(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)\right). We refer our readers to [GCL+19, Haz16] for more details regarding online learning and online gradient descent.

Compared with the centralized setting, the key difficulties in the convergence analysis of FL are induced by multiple local step updates of the client side and the step updates on both local and global sides. Specifically, local updates are not the standard gradient as the centralized adversarial training when K2K\geq 2. We used ΔU(t)-\Delta U(t) in substitution of the real gradient of UU to update the value of U(t)U(t). This brings in challenges to bound the gradient of the neural networks. Nevertheless, gradient bounding is challenging in adversarial training solely. We use gradient coupling method twice to solve this core problem: firstly we bound the difference between real gradient and FL gradient in Lemma E.4, then we bound the difference between pseudo gradient and real gradient in Lemma E.5. We show the connection of online gradient descent and federated adversarial learning in the proof of Theorem E.3.

C.3 Existence of robust network near initialization

In Section F we show that there exists a global network fUf_{U^{*}} whose weight is close to the initial value U(0)U(0) and makes the worst-case global loss 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}) sufficiently small. We show that the required width mm is poly(d,(NJ/ϵ)1/γ)\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma}).

Suppose we are given a ρ\rho-bounded adversary. For a globally γ\gamma-separable training set, to prove Theorem F.3, first we state Lemma F.1 which shows the existence of function ff^{*} that has "low complexity" and satisfies f(x~c,j)yc,jf^{*}(\widetilde{x}_{c,j})\approx y_{c,j} for all data point (xc,j,yc,j)(x_{c,j},y_{c,j}) and perturbation inputs x~c,j2(xc,j,ρ)\widetilde{x}_{c,j}\in\mathcal{B}_{2}(x_{c,j},\rho).

Then, we state Lemma F.2 which shows the existence of a pseudo-network gUg_{U^{*}} that approximates ff^{*} well. Finally, by using Theorem D.2 we show that fUf_{U^{*}} approximates gUg_{U^{*}} well. By combining these results, we we finish the proof of Theorem F.3.

Appendix D Real approximates pseudo

To make additional remark to proof sketch in Section 5, in this section, we state a tool that will be used in our proof that is related to our definition of pseudo-network. First, we recall the definition of pseudo-network.

Definition D.1 (Pseudo-network).

Given weights Ud×mU\in\mathbb{R}^{d\times m}, ama\in\mathbb{R}^{m} and bmb\in\mathbb{R}^{m}, the global neural network function fU:df_{U}:\mathbb{R}^{d}\rightarrow\mathbb{R} is defined as

fU(x):=r=1marσ(Ur,x+br).\displaystyle f_{U}(x):=\sum_{r=1}^{m}a_{r}\cdot\sigma(\langle U_{r},x\rangle+b_{r}).

Given this fU(x)f_{U}(x), we define the corresponding pseudo-network function gU:dg_{U}:\mathbb{R}^{d}\rightarrow\mathbb{R} as

gU(x):=r=1marUr(t)Ur(0),x𝟙{Ur(0),x+br0}.\displaystyle g_{U}(x):=\sum_{r=1}^{m}a_{r}\cdot\langle U_{r}(t)-U_{r}(0),x\rangle\cdot\operatorname{\mathds{1}}\{\langle U_{r}(0),x\rangle+b_{r}\geq 0\}.

From the definition we can know that pseudo-network can be seen as a linear approximation of the two layer ReLU network we study near initialization. Next, we cite a Theorem from [ZPD+20], which gives a uniform bound of the difference between a network and its pseudo-network.

Theorem D.2 (Uniform approximation, Theorem 5.1 in [ZPD+20]).

Suppose R1R\geq 1 is a constant. Let ρ:=exp(Ω(m1/3))\rho:=\exp(-\Omega(m^{1/3})). As long as mpoly(d)m\geq\operatorname{poly}(d), with prob. 1ρ1-\rho, for every Ud×mU\in\mathbb{R}^{d\times m} satisfying UU(0)2,R/m2/3\|U-U(0)\|_{2,\infty}\leq R/m^{2/3}, we have supx𝒳|fU(x)gU(x)|\sup_{x\in\mathcal{X}}|f_{U}(x)-g_{U}(x)| is at most O(R2/m1/6)O(R^{2}/m^{1/6}).

The randomness is due to initialization.

Appendix E Convergence

Table 1: List of theorems and lemmas in Section E. The main result of this section is Theorem E.3. By saying "Statements Used" we mean these statements are used in the proof in the corresponding section. For example, Lemma E.4, E.5 and Theorem D.2 are used in the proof of Theorem E.3.
Section Statement Comment Statements Used
E.1 Definition E.1 and E.2 Definition -
E.2 Theorem E.3 Convergence result Lem. E.4, E.5, Thm. D.2
E.3 Lemma E.4 Approximates real gradient -
E.4 Lemma E.5 Approximates pseudo gradient Claim E.6
E.5 Claim E.6 Auxiliary bounding Claim A.4

E.1 Definitions and notations

In Section E, we follow the notations used in Definition D.1. Since we are dealing with pseudo-network, we first introduce some additional definitions and notations regarding gradient.

Definition E.1 (Gradient).

For a local real network fWc(t,k)f_{W_{c}(t,k)}, we denote its gradient by

(fc,t,k):=\displaystyle\nabla(f_{c},t,k):= Wc(fWc(t,k),𝒮c(t)).\displaystyle\leavevmode\nobreak\ \nabla_{W_{c}}\mathcal{L}(f_{W_{c}(t,k)},\mathcal{S}_{c}(t)).

If the corresponding pseudo-network is gWc(t,k)g_{W_{c}(t,k)}, then we define the pseudo-network gradient as

(gc,t,k):=\displaystyle\nabla(g_{c},t,k):= Wc(gWc(t,k),𝒮c(t)).\displaystyle\leavevmode\nobreak\ \nabla_{W_{c}}\mathcal{L}(g_{W_{c}(t,k)},\mathcal{S}_{c}(t)).

Now we consider the global matrix. For convenience we write (f,t):=U(fU(t),𝒮(t))\nabla(f,t):=\nabla_{U}\mathcal{L}(f_{U(t)},\mathcal{S}(t)) and (g,t):=U(gU(t),𝒮(t))\nabla(g,t):=\nabla_{U}\mathcal{L}(g_{U(t)},\mathcal{S}(t)). We define the FL gradient as ~(f,t):=1NΔU(t)\widetilde{\nabla}(f,t):=-\frac{1}{N}\Delta U(t).

Definition E.2 (Distance).

For Ud×mU^{*}\in\mathbb{R}^{d\times m} such that UU~2,R/m3/4\|U^{*}-\widetilde{U}\|_{2,\infty}\leq R/m^{3/4}, we define the following distance for simplicity:

Dmax:=\displaystyle D_{\max}:= maxt[T]U~U(t)2,\displaystyle\leavevmode\nobreak\ \max_{t\in[T]}\|\widetilde{U}-U(t)\|_{2,\infty}
DU:=\displaystyle D_{U^{*}}:= U~U2,\displaystyle\leavevmode\nobreak\ \|\widetilde{U}-U^{*}\|_{2,\infty}

We have DU=O(R/m3/4)D_{U^{*}}=O(R/m^{3/4}) and U(t)U2,Dmax+DU\|U(t)-U^{*}\|_{2,\infty}\leq D_{\text{max}}+D_{U^{*}} by using triangle inequality.

Table 2: Notations of global model weights in federated learning to be used in this section.
Notation Meaning Satisfy
U(0)U(0) or U~\widetilde{U} Initialization of UU Wc(0,0)=U(0)W_{c}(0,0)=U(0)
U(t)U(t) The value of UU after tt iterations Dmax=maxU(t)U~2,D_{\text{max}}=\max\|U(t)-\widetilde{U}\|_{2,\infty}
UU^{*} The value of UU after small perturbations from U~\widetilde{U} UU~2,R/m3/4\|U^{*}-\widetilde{U}\|_{2,\infty}\leq R/m^{3/4}

E.2 Convergence result

We are going to prove Theorem E.3 in this section.

Theorem E.3 (Convergence, formal version of Theorem 5.6).

Let R1R\geq 1. Suppose ϵ(0,1)\epsilon\in(0,1). Let K1K\geq 1. Let Tpoly(R/ϵ)T\geq\operatorname{poly}(R/\epsilon). There is M=poly(n,R,1/ϵ)M=\operatorname{poly}(n,R,1/\epsilon), such that for every mMm\geq M, with probability 1exp(Ω(m1/3))\geq 1-\exp(-\Omega(m^{1/3})), if we run Algorithm 2 by setting

ηglobal=1/poly(NJ,R,1/ϵ)andηlocal=1/K,\displaystyle\eta_{\operatorname{global}}=1/\operatorname{poly}(NJ,R,1/\epsilon)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \eta_{\operatorname{local}}=1/K,

then for every UU^{*} such that UU(0)2,R/m3/4\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{3/4}, the output weights (U(t))t=1T(U(t))_{t=1}^{T} satisfy

1Tt=1T𝒜(fU(t))𝒜(fU)+ϵ.\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{\mathcal{A}}\left(f_{U(t)}\right)\leq\mathcal{L}_{\mathcal{A}^{*}}\left(f_{U^{*}}\right)+\epsilon.

The randomness is from a(0)ma(0)\in\mathbb{R}^{m}, U(0)d×mU(0)\in\mathbb{R}^{d\times m}, b(0)mb(0)\in\mathbb{R}^{m}.

Proof.

We set our parameters as follows:

M\displaystyle M =Ω(max{(NJ)8,(Rϵ)12})\displaystyle=\Omega\Big{(}\max\big{\{}(NJ)^{8},(\frac{R}{\epsilon})^{12}\big{\}}\Big{)}
ηglobal\displaystyle\eta_{\operatorname{global}} =O(ϵNm1/3poly(R/ϵ))\displaystyle=O(\frac{\epsilon}{Nm^{1/3}\cdot\operatorname{poly}(R/\epsilon)})
ηlocal\displaystyle\eta_{\operatorname{local}} =1/K\displaystyle=1/K

Since the loss function is 11-Lipschitz, we first bound the 2\ell_{2} norm of real net gradient:

r(fc,t,k)2|ar|(1Jj=1Jσ(Wc,r(t,k),xc,j+br)x~c,j2)|ar|1m1/3.\displaystyle\|\nabla_{r}(f_{c},t,k)\|_{2}\leq|a_{r}|\cdot\Big{(}\frac{1}{J}\sum_{j=1}^{J}\sigma^{\prime}(\langle W_{c,r}(t,k),x_{c,j}\rangle+b_{r})\cdot\|\widetilde{x}_{c,j}\|_{2}\Big{)}\leq|a_{r}|\leq\frac{1}{m^{1/3}}. (6)

Now we consider the pseudo-net gradient. The loss (gU,𝒮(t))\mathcal{L}(g_{U},\mathcal{S}(t)) is convex in UU due to the fact that gg is linear with UU. Then we have

(gU(t),𝒮(t))(gU,𝒮(t))\displaystyle\leavevmode\nobreak\ \mathcal{L}(g_{U(t)},\mathcal{S}(t))-\mathcal{L}(g_{U^{*}},\mathcal{S}(t))
\displaystyle\leq U(gU(t),𝒮(t)),U(t)U\displaystyle\leavevmode\nobreak\ \langle\nabla_{U}\mathcal{L}(g_{U(t)},\mathcal{S}(t)),U(t)-U^{*}\rangle
=\displaystyle= ~(f,t),U(t)U+(f,t)~(f,t),U(t)U+(g,t)(f,t),U(t)U\displaystyle\leavevmode\nobreak\ \langle\widetilde{\nabla}(f,t),U(t)-U^{*}\rangle+\langle\nabla(f,t)-\widetilde{\nabla}(f,t),U(t)-U^{*}\rangle+\langle\nabla(g,t)-\nabla(f,t),U(t)-U^{*}\rangle
\displaystyle\leq α(t)+β(t)+γ(t)\displaystyle\leavevmode\nobreak\ \alpha(t)+\beta(t)+\gamma(t)

where the last step follows from

α(t):=\displaystyle\alpha(t):= ~(f,t),U(t)U,\displaystyle\leavevmode\nobreak\ \langle\widetilde{\nabla}(f,t),U(t)-U^{*}\rangle,
β(t):=\displaystyle\beta(t):= (f,t)~(f,t)2,1U(t)U2,,\displaystyle\leavevmode\nobreak\ \|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty},
γ(t):=\displaystyle\gamma(t):= (g,t)(f,t)2,1U(t)U2,.\displaystyle\leavevmode\nobreak\ \|\nabla(g,t)-\nabla(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}.

Note that the FL gradient ~(f,t)=1NΔU(t)\widetilde{\nabla}(f,t)=-\frac{1}{N}\Delta U(t) is the direction moved by center, in contrast, (f,t)\nabla(f,t) is the true gradient of function ff. We deal with these three terms separately. As for α(t)\alpha(t), we have

U(t+1)UF2=\displaystyle\|U(t+1)-U^{*}\|_{F}^{2}= U(t)+ηglobalΔU(t)UF2\displaystyle\leavevmode\nobreak\ \|U(t)+\eta_{\operatorname{global}}\Delta U(t)-U^{*}\|_{F}^{2}
=\displaystyle= U(t)UF22Nηglobalα(t)+ηglobal2ΔU(t)F2\displaystyle\leavevmode\nobreak\ \|U(t)-U^{*}\|_{F}^{2}-2N\eta_{\operatorname{global}}\alpha(t)+\eta_{\operatorname{global}}^{2}\|\Delta U(t)\|_{F}^{2}

and by rearranging the equation we get

α(t)=ηglobal2NΔU(t)F2+12Nηglobal(U(t)UF2U(t+1)UF2).\displaystyle\alpha(t)=\frac{\eta_{\operatorname{global}}}{2N}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2N\eta_{\operatorname{global}}}\cdot(\|U(t)-U^{*}\|_{F}^{2}-\|U(t+1)-U^{*}\|_{F}^{2}).

Next, we need to upper bound ΔU(t)F2\|\Delta U(t)\|_{F}^{2},

ΔU(t)F2\displaystyle\|\Delta U(t)\|_{F}^{2} =ηlocalNc=1Nk=0K1(fc,t,k)F2\displaystyle=\|\frac{\eta_{\operatorname{local}}}{N}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\nabla(f_{c},t,k)\|_{F}^{2}
ηlocalNc=1Nk=0K1r=1mr(fc,t,k)22\displaystyle\leq\frac{\eta_{\operatorname{local}}}{N}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\sum_{r=1}^{m}\|\nabla_{r}(f_{c},t,k)\|_{2}^{2}
=ηlocalKm1/3\displaystyle=\eta_{\operatorname{local}}Km^{1/3}
=m1/3.\displaystyle=m^{1/3}. (7)

where the last step follows from Kηlocal=1K\eta_{\operatorname{local}}=1. Then we do summation over tt and have

t=1Tα(t)=\displaystyle\sum_{t=1}^{T}\alpha(t)= ηglobal2Nt=1TΔU(t)F2+12Nηglobalt=1T(U(t)UF2U(t+1)UF2)\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{2N}\sum_{t=1}^{T}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2N\eta_{\operatorname{global}}}\cdot\sum_{t=1}^{T}(\|U(t)-U^{*}\|_{F}^{2}-\|U(t+1)-U^{*}\|_{F}^{2})
=\displaystyle= ηglobal2Nt=1TΔU(t)F2+12Nηglobal(U(1)UF2U(T+1)UF2)\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{2N}\sum_{t=1}^{T}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2N\eta_{\operatorname{global}}}\cdot(\|U(1)-U^{*}\|_{F}^{2}-\|U(T+1)-U^{*}\|_{F}^{2})
\displaystyle\leq ηglobal2Nt=1TΔU(t)F2+12NηglobalU(1)UF2\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{2N}\sum_{t=1}^{T}\|\Delta U(t)\|_{F}^{2}+\frac{1}{2N\eta_{\operatorname{global}}}\cdot\|U(1)-U^{*}\|_{F}^{2}
\displaystyle\lesssim ηglobalNTm1/3+1NηglobalmDU2\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{N}Tm^{1/3}+\frac{1}{N\eta_{\operatorname{global}}}mD_{U^{*}}^{2}

where the last step follows from Eq. (E.2) and U~UF2mU~U2,2=mDU2\|\widetilde{U}-U^{*}\|_{F}^{2}\leq m\cdot\|\widetilde{U}-U^{*}\|_{2,\infty}^{2}=mD_{U^{*}}^{2}.

As for β(t)\beta(t), we apply Lemma E.4 and also triangle inequality and have

β(t)=\displaystyle\beta(t)= (f,t)~(f,t)2,1U(t)U2,\displaystyle\leavevmode\nobreak\ \|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}
\displaystyle\lesssim m2/3U(t)U2,\displaystyle\leavevmode\nobreak\ m^{2/3}\cdot\|U(t)-U^{*}\|_{2,\infty}
\displaystyle\lesssim m2/3(U(t)U~2,+DU).\displaystyle\leavevmode\nobreak\ m^{2/3}\cdot(\|U(t)-\widetilde{U}\|_{2,\infty}+D_{U^{*}}).

By using Eq. (6) we bound the size of U(t)U~2,\|U(t)-\widetilde{U}\|_{2,\infty}:

U(t)U~2,\displaystyle\|U(t)-\widetilde{U}\|_{2,\infty} ηglobalτ=1tΔU(τ)2,\displaystyle\leq\eta_{\operatorname{global}}\sum_{\tau=1}^{t}\|\Delta U(\tau)\|_{2,\infty}
=ηglobalτ=1tηlocalNc=1Nk=0K1(fc,t,k)2,\displaystyle=\eta_{\operatorname{global}}\sum_{\tau=1}^{t}\|\frac{\eta_{\operatorname{local}}}{N}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\nabla(f_{c},t,k)\|_{2,\infty}
ηglobalηlocalNτ=1tc=1Nk=0K1(fc,t,k)2,\displaystyle\leq\frac{\eta_{\operatorname{global}}\eta_{\operatorname{local}}}{N}\sum_{\tau=1}^{t}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\|\nabla(f_{c},t,k)\|_{2,\infty}
ηglobalηlocaltKm1/3\displaystyle\leq\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKm^{-1/3}

and have

β(t)ηglobalηlocaltKm1/3+m2/3DU.\displaystyle\beta(t)\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKm^{1/3}+m^{2/3}D_{U^{*}}.

Then we do summation over tt and have

t=1Tβ(t)\displaystyle\sum_{t=1}^{T}\beta(t) t=1T(ηglobalηlocaltKm1/3+m2/3DU)\displaystyle\lesssim\sum_{t=1}^{T}(\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKm^{1/3}+m^{2/3}D_{U^{*}})
ηglobalηlocalT2Km1/3+m2/3TDU\displaystyle\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}T^{2}Km^{1/3}+m^{2/3}TD_{U^{*}}
ηglobalT2m1/3+m2/3TDU.\displaystyle\lesssim\eta_{\operatorname{global}}T^{2}m^{1/3}+m^{2/3}TD_{U^{*}}.

As for γ(t)\gamma(t), we apply Lemma E.5 and have

γ(t)=\displaystyle\gamma(t)= (g,t)(f,t)2,1U(t)U2,\displaystyle\leavevmode\nobreak\ \|\nabla(g,t)-\nabla(f,t)\|_{2,1}\cdot\|U(t)-U^{*}\|_{2,\infty}
\displaystyle\lesssim NJm13/24(U(t)U~2,+DU).\displaystyle\leavevmode\nobreak\ NJm^{13/24}\cdot(\|U(t)-\widetilde{U}\|_{2,\infty}+D_{U^{*}}).

Since U(t)U~2,ηglobalηlocaltKm1/3\|U(t)-\widetilde{U}\|_{2,\infty}\leq\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKm^{-1/3}, we have

γ(t)ηglobalηlocaltKNJm5/24+NJm13/24DU.\displaystyle\gamma(t)\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKNJm^{5/24}+NJm^{13/24}D_{U^{*}}.

Then we do summation over tt and have

t=1Tγ(t)\displaystyle\sum_{t=1}^{T}\gamma(t) t=1T(ηglobalηlocaltKNJm5/24+NJm13/24DU)\displaystyle\lesssim\sum_{t=1}^{T}\big{(}\eta_{\operatorname{global}}\eta_{\operatorname{local}}tKNJm^{5/24}+NJm^{13/24}D_{U^{*}}\big{)}
ηglobalηlocalT2KNJm5/24+NJm13/24TDU\displaystyle\lesssim\eta_{\operatorname{global}}\eta_{\operatorname{local}}T^{2}KNJm^{5/24}+NJm^{13/24}TD_{U^{*}}
ηglobalT2NJm5/24+NJm13/24TDU.\displaystyle\lesssim\eta_{\operatorname{global}}T^{2}NJm^{5/24}+NJm^{13/24}TD_{U^{*}}.

Next we put it altogether. Note that DU=O(Rm3/4)D_{U^{*}}=O(\frac{R}{m^{3/4}}), thus we obtain

t=1T(gU(t),𝒮(t))t=1T(gU,𝒮(t))\displaystyle\sum_{t=1}^{T}\mathcal{L}(g_{U(t)},\mathcal{S}(t))-\sum_{t=1}^{T}\mathcal{L}(g_{U^{*}},\mathcal{S}(t))
\displaystyle\leq t=1Tα(t)+t=1Tβ(t)+t=1Tγ(t)\displaystyle\sum_{t=1}^{T}\alpha(t)+\sum_{t=1}^{T}\beta(t)+\sum_{t=1}^{T}\gamma(t)
\displaystyle\lesssim ηglobalNTm1/3+1NηglobalmDU2+ηglobalT2m1/3\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{N}Tm^{1/3}+\frac{1}{N\eta_{\operatorname{global}}}mD_{U^{*}}^{2}+\eta_{\operatorname{global}}T^{2}m^{1/3}
+m2/3TDU+ηglobalT2NJm5/24+NJm13/24TDU\displaystyle+m^{2/3}TD_{U^{*}}+\eta_{\operatorname{global}}T^{2}NJm^{5/24}+NJm^{13/24}TD_{U^{*}}
\displaystyle\lesssim ηglobalNTm1/3+1NηglobalR2m1/2+ηglobalT2m1/3\displaystyle\leavevmode\nobreak\ \frac{\eta_{\operatorname{global}}}{N}Tm^{1/3}+\frac{1}{N\eta_{\operatorname{global}}}R^{2}m^{-1/2}+\eta_{\operatorname{global}}T^{2}m^{1/3}
+RTm1/12+ηglobalT2NJm5/24+NJm5/24RT.\displaystyle+RTm^{-1/12}+\eta_{\operatorname{global}}T^{2}NJm^{5/24}+NJm^{-5/24}RT.

We then have

1Tτ=1T(gU(τ),𝒮(τ))1Tτ=1T(gU,𝒮(τ))\displaystyle\frac{1}{T}\sum_{\tau=1}^{T}\mathcal{L}(g_{U(\tau)},\mathcal{S}(\tau))-\frac{1}{T}\sum_{\tau=1}^{T}\mathcal{L}(g_{U^{*}},\mathcal{S}(\tau))
\displaystyle\lesssim\leavevmode\nobreak\ ηglobalNm1/3+1NηglobalTR2m1/2+ηglobalTm1/3+Rm1/12\displaystyle\frac{\eta_{\operatorname{global}}}{N}m^{1/3}+\frac{1}{N\eta_{\operatorname{global}}T}R^{2}m^{-1/2}+\eta_{\operatorname{global}}Tm^{1/3}+Rm^{-1/12}
+ηglobalTNJm5/24+NJm5/24R.\displaystyle+\eta_{\operatorname{global}}TNJm^{5/24}+NJm^{-5/24}R.
\displaystyle\lesssim\leavevmode\nobreak\ 1NηglobalTR2m1/2+ηglobalTm1/3+Rm1/12+ηglobalTNJm5/24+NJm5/24R\displaystyle\frac{1}{N\eta_{\operatorname{global}}T}R^{2}m^{-1/2}+\eta_{\operatorname{global}}Tm^{1/3}+Rm^{-1/12}+\eta_{\operatorname{global}}TNJm^{5/24}+NJm^{-5/24}R (8)
\displaystyle\leq\leavevmode\nobreak\ O(ϵ).\displaystyle O(\epsilon).

From Theorem D.2 we know

supx𝒳|fU(x)gU(x)|O(R2/m1/6)=O(ϵ).\displaystyle\sup_{x\in\mathcal{X}}|f_{U}(x)-g_{U}(x)|\leq O(R^{2}/m^{1/6})=O(\epsilon).

In addition,

1Tt=1T((fU(t),𝒮(t))(fU,𝒮(t)))\displaystyle\frac{1}{T}\sum_{t=1}^{T}\left(\mathcal{L}(f_{U(t)},\mathcal{S}(t))-\mathcal{L}(f_{U^{*}},\mathcal{S}(t))\right) O(ϵ)\displaystyle\leq O(\epsilon) (9)

From the definition of 𝒜\mathcal{A}^{*} we have (fU,S(t))𝒜(fU)\mathcal{L}(f_{U^{*}},S(t))\leq\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}). From the definition of loss we have (fU(t),𝒮(t))=𝒜(fU(t))\mathcal{L}(f_{U(t)},\mathcal{S}(t))=\mathcal{L}_{\mathcal{A}}(f_{U(t)}). Moreover, since Eq. (9) holds for all ϵ>0\epsilon>0, we can replace ϵc\frac{\epsilon}{c} with ϵ\epsilon. Thus we prove that for ϵ>0\forall\epsilon>0,

1Tt=1T𝒜(fU(t))𝒜(fU)+ϵ.\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{\mathcal{A}}(f_{U(t)})\leq\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}})+\epsilon.

E.3 Approximates real global gradient

We are going to prove Lemma E.4 in this section.

Lemma E.4 (Bounding the difference between real gradient and FL gradient).

Let ρ:=exp(Ω(m1/3))\rho:=\exp(-\Omega(m^{1/3})) With probability 1ρ\geq 1-\rho, for iterations tt satisfying

|U(t)U(0)2,O(m15/24),\displaystyle|U(t)-U(0)\|_{2,\infty}\leq O(m^{-15/24}),

the following holds:

(f,t)~(f,t)2,1O(m2/3).\displaystyle\|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\leq O(m^{2/3}).

The randomness is from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ\tau at 0.

Proof.

Notice that (f,t)=U(fU(t),𝒮(t))\nabla(f,t)=\nabla_{U}\mathcal{L}(f_{U(t)},\mathcal{S}(t)) and

~(f,t)=1NΔU(t)=1Nc=1NΔUc(t)=ηlocalNc=1Nk=0K1(fc,t,k).\displaystyle\widetilde{\nabla}(f,t)=-\frac{1}{N}\Delta U(t)=-\frac{1}{N}\sum_{c=1}^{N}\Delta U_{c}(t)=\frac{\eta_{\operatorname{local}}}{N}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\nabla(f_{c},t,k).

So we have

(f,t)~(f,t)2,1\displaystyle\|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1} =r=1mr(f,t)~r(f,t)2\displaystyle=\sum_{r=1}^{m}\|\nabla_{r}(f,t)-\widetilde{\nabla}_{r}(f,t)\|_{2}
=1Nr=1mNr(f,t)ηlocalc=1Nk=0K1r(fc,t,k)2\displaystyle=\frac{1}{N}\sum_{r=1}^{m}\|N\cdot\nabla_{r}(f,t)-\eta_{\operatorname{local}}\sum_{c=1}^{N}\sum_{k=0}^{K-1}\nabla_{r}(f_{c},t,k)\|_{2}
ηlocalNr=1mk=0K1Nr(f,t)Kηlocalc=1Nr(fc,t,k)2\displaystyle\leq\frac{\eta_{\operatorname{local}}}{N}\sum_{r=1}^{m}\sum_{k=0}^{K-1}\|\frac{N\cdot\nabla_{r}(f,t)}{K\eta_{\operatorname{local}}}-\sum_{c=1}^{N}\nabla_{r}(f_{c},t,k)\|_{2}
=1NKr=1mk=0K1Nr(f,t)c=1Nr(fc,t,k)2\displaystyle=\frac{1}{NK}\sum_{r=1}^{m}\sum_{k=0}^{K-1}\|N\cdot\nabla_{r}(f,t)-\sum_{c=1}^{N}\nabla_{r}(f_{c},t,k)\|_{2}

where the last step follows from the assumption that ηlocal=1K\eta_{\operatorname{local}}=\frac{1}{K}.

As for Nr(f,t)c=1Nr(fc,t,k)2\|N\cdot\nabla_{r}(f,t)-\sum_{c=1}^{N}\nabla_{r}(f_{c},t,k)\|_{2}, we have

Nr(f,t)c=1Nr(fc,t,k)2\displaystyle\|N\cdot\nabla_{r}(f,t)-\sum_{c=1}^{N}\nabla_{r}(f_{c},t,k)\|_{2}
\displaystyle\leq |ar||(NNJc=1Nj=1J𝟙{Ur(t),xc,j+br0}\displaystyle\leavevmode\nobreak\ |a_{r}|\cdot\Big{|}\big{(}\frac{N}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\operatorname{\mathds{1}}\{\langle U_{r}(t),x_{c,j}\rangle+b_{r}\geq 0\}
1Jc=1Nj=1J𝟙{Wc,r(t,k),xc,j+br0})xc,j2|\displaystyle-\frac{1}{J}\sum_{c=1}^{N}\sum_{j=1}^{J}\operatorname{\mathds{1}}\{\langle W_{c,r}(t,k),x_{c,j}\rangle+b_{r}\geq 0\}\big{)}\cdot\|x_{c,j}\|_{2}\Big{|}
\displaystyle\leq 1m1/31Jc=1Nj=1J|𝟙{Ur(t),xc,j+br0}𝟙{Wc,r(t,k),xc,j+br0}|\displaystyle\leavevmode\nobreak\ \frac{1}{m^{1/3}}\cdot\frac{1}{J}\sum_{c=1}^{N}\sum_{j=1}^{J}|\operatorname{\mathds{1}}\{\langle U_{r}(t),x_{c,j}\rangle+b_{r}\geq 0\}-\operatorname{\mathds{1}}\{\langle W_{c,r}(t,k),x_{c,j}\rangle+b_{r}\geq 0\}|
\displaystyle\leq Nm1/3.\displaystyle\leavevmode\nobreak\ \frac{N}{m^{1/3}}.

Then we do summation and have

(f,t)~(f,t)2,1\displaystyle\|\nabla(f,t)-\widetilde{\nabla}(f,t)\|_{2,1}\leq 1NKr=1mk=0K1Nr(f,t)c=1Nr(fc,t,k)2\displaystyle\leavevmode\nobreak\ \frac{1}{NK}\sum_{r=1}^{m}\sum_{k=0}^{K-1}\|N\cdot\nabla_{r}(f,t)-\sum_{c=1}^{N}\nabla_{r}(f_{c},t,k)\|_{2}
\displaystyle\leq 1NKr=1mk=0K1Nm1/3\displaystyle\leavevmode\nobreak\ \frac{1}{NK}\sum_{r=1}^{m}\sum_{k=0}^{K-1}\frac{N}{m^{1/3}}
=\displaystyle= m2/3.\displaystyle\leavevmode\nobreak\ m^{2/3}.

Thus we finish the proof. ∎

E.4 Approximates pseudo global gradient

We are going to prove Lemma E.5 in this section.

Lemma E.5.

Let ρ:=exp(Ω(m1/3))\rho:=\exp(-\Omega(m^{1/3})). With probability 1ρ\geq 1-\rho, for iterations tt satisfying

U(t)U(0)2,O(m15/24),\displaystyle\|U(t)-U(0)\|_{2,\infty}\leq O(m^{-15/24}),

the following holds:

(g,t)(f,t)2,1O(NJm13/24).\displaystyle\|\nabla(g,t)-\nabla(f,t)\|_{2,1}\leq O(NJm^{13/24}).

The randomness is because a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m}, for τ=0\tau=0.

Proof.

Notice that (g,t)=U(gU(t),𝒮(t))\nabla(g,t)=\nabla_{U}\mathcal{L}(g_{U(t)},\mathcal{S}(t)) and (f,t)=U(fU(t),𝒮(t))\nabla(f,t)=\nabla_{U}\mathcal{L}(f_{U(t)},\mathcal{S}(t)). By Claim E.6, with the given probability we have

r=1m𝟙{r(g,t)r(f,t)}O(NJm7/8).\displaystyle\sum_{r=1}^{m}\operatorname{\mathds{1}}\{\nabla_{r}(g,t)\not=\nabla_{r}(f,t)\}\leq O(NJm^{7/8}).

For indices r[m]r\in[m] satisfying r(g,t)r(f,t)\nabla_{r}(g,t)\not=\nabla_{r}(f,t), the following holds:

r(g,t)r(f,t)2=\displaystyle\|\nabla_{r}(g,t)-\nabla_{r}(f,t)\|_{2}= U,r(gU(t),S(t))U,r(fU(t),S(t))2\displaystyle\leavevmode\nobreak\ \|\nabla_{U,r}\mathcal{L}(g_{U(t)},S(t))-\nabla_{U,r}\mathcal{L}(f_{U(t)},S(t))\|_{2}
\displaystyle\leq |ar|1NJc=1Nj=1Jxc,j2|𝟙{U~r,xc,j+br0}\displaystyle\leavevmode\nobreak\ |a_{r}|\cdot\frac{1}{NJ}\cdot\sum_{c=1}^{N}\sum_{j=1}^{J}\|x_{c,j}\|_{2}\cdot\big{|}\operatorname{\mathds{1}}\{\langle\widetilde{U}_{r},x_{c,j}\rangle+b_{r}\geq 0\}
𝟙{Ur,xc,j+br0}|\displaystyle-\operatorname{\mathds{1}}\{\langle U_{r},x_{c,j}\rangle+b_{r}\geq 0\}\big{|}
\displaystyle\leq 1m1/31NJc=1Nj=1J|𝟙{U~r,xc,j+br0}𝟙{Ur,xc,j+br0}|\displaystyle\leavevmode\nobreak\ \frac{1}{m^{1/3}}\cdot\frac{1}{NJ}\cdot\sum_{c=1}^{N}\sum_{j=1}^{J}\big{|}\operatorname{\mathds{1}}\{\langle\widetilde{U}_{r},x_{c,j}\rangle+b_{r}\geq 0\}-\operatorname{\mathds{1}}\{\langle U_{r},x_{c,j}\rangle+b_{r}\geq 0\}\big{|}
\displaystyle\leq 1m1/3.\displaystyle\leavevmode\nobreak\ \frac{1}{m^{1/3}}.

where the first step is definition, the second step follows that the loss function is 11-Lipschitz, the third step follows from |ar|1m1/3|a_{r}|\leq\frac{1}{m^{1/3}} and xc,j2=1\|x_{c,j}\|_{2}=1, the last step follows from the bound of the indicator function. Thus, we do the conclusion:

(g,t)(f,t)2,1\displaystyle\leavevmode\nobreak\ \|\nabla(g,t)-\nabla(f,t)\|_{2,1}
=\displaystyle= r=1mr(g,t)r(f,t)2𝟙{r(g,t)r(f,t)}\displaystyle\leavevmode\nobreak\ \sum_{r=1}^{m}\|\nabla_{r}(g,t)-\nabla_{r}(f,t)\|_{2}\cdot\operatorname{\mathds{1}}\{\nabla_{r}(g,t)\not=\nabla_{r}(f,t)\}
\displaystyle\leq 1m1/3r=1m𝟙{r(g,t)r(f,t)}\displaystyle\leavevmode\nobreak\ \frac{1}{m^{1/3}}\sum_{r=1}^{m}\operatorname{\mathds{1}}\{\nabla_{r}(g,t)\not=\nabla_{r}(f,t)\}
\displaystyle\leq 1m1/3O(NJm7/8)\displaystyle\leavevmode\nobreak\ \frac{1}{m^{1/3}}\cdot O(NJm^{7/8})
=\displaystyle= O(NJm13/24)\displaystyle\leavevmode\nobreak\ O(NJm^{13/24})

and finish the proof. ∎

E.5 Bounding auxiliary

Claim E.6 (Bounding auxiliary).

Let ρ:=exp(Ω(m1/3))\rho:=\exp(-\Omega(m^{1/3})). With probability 1ρ\geq 1-\rho , we have

r=1m𝟙{r(g,t)r(f,t)}O(NJm7/8).\displaystyle\sum_{r=1}^{m}\operatorname{\mathds{1}}\{\nabla_{r}(g,t)\not=\nabla_{r}(f,t)\}\leq O(NJm^{7/8}).

The randomness is from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

Proof.

For r[m]r\in[m], let Ir:=𝟙{r(g,t)r(f,t)}I_{r}:=\operatorname{\mathds{1}}\{\nabla_{r}(g,t)\not=\nabla_{r}(f,t)\}. By Claim A.4 we know that for each xc,jx_{c,j} we have

Pr[|W~c,r,xc,j+br|m15/24]O(m1/8).\displaystyle\Pr[|\langle\widetilde{W}_{c,r},x_{c,j}\rangle+b_{r}|\leq m^{-15/24}]\leq O(m^{-1/8}).

By putting a union bound on cc and jj, we get

Pr[c[N],j[J],|W~c,r,xc,j+br|m15/24]O(NJm1/8).\displaystyle\Pr\big{[}\exists c\in[N],j\in[J],\leavevmode\nobreak\ |\langle\widetilde{W}_{c,r},x_{c,j}\rangle+b_{r}|\leq m^{-15/24}\big{]}\leq O(NJm^{-1/8}).

Since

Pr[Ir=1]Pr[j[J],c[N],|W~c,r,xc,j+br|m15/24],\displaystyle\Pr[I_{r}=1]\leq\Pr[\exists j\in[J],c\in[N],\leavevmode\nobreak\ |\langle\widetilde{W}_{c,r},x_{c,j}\rangle+b_{r}|\leq m^{-15/24}],

we have

Pr[Ir=1]O(NJm1/8).\displaystyle\Pr[I_{r}=1]\leq O(NJm^{-1/8}).

By applying concentration inequality on IrI_{r} (independent Bernoulli) for r[m]r\in[m], we obtain that with prob.

1exp(Ω(NJm7/8))>1ρ,\displaystyle\geq 1-\exp(-\Omega(NJm^{7/8}))>1-\rho,

the following holds:

r=1mIrO(NJm7/8).\displaystyle\sum_{r=1}^{m}I_{r}\leq O(NJm^{7/8}).

Thus we finish the proof.

E.6 Further Discussion

Note that in the proof of Theorem E.3 we set the hidden layer’s width mm to be greater than O(ϵ12)O(\epsilon^{-12}), which seems impractical in reality: if we choose our convergence accuracy to be 10210^{-2}, the width will become 102410^{24} which is impossible to achieve.

However, we want to claim that the "12-12" term is not intrinsic in our theorem and proof, and we can actually further improve the lower bound of mm to O((R/ϵ)c2)O((R/\epsilon)^{c_{2}}) where c2c_{2} is some constant between 3-3 and 4-4. To be specific, we observe from Eq. (E.2) that the "12-12" term comes from 2334=112\frac{2}{3}-\frac{3}{4}=-\frac{1}{12}, where 23\frac{2}{3} appears in Lemma E.4 and 34\frac{3}{4} appears in the assumption that DUR/m3/4D_{U^{*}}\leq R/m^{3/4} in Definition E.2. As for our observations, the 23\frac{2}{3} term is hard to improve. On the other hand, we can actually adjust the value of DUD_{U^{*}} as long as we ensure

DUR/mc3\displaystyle D_{U^{*}}\leq R/m^{c_{3}}

for some constant c3(0,1)c_{3}\in(0,1). When we let c31c_{3}\to 1, the final result will achieve

O((R/ϵ)3)\displaystyle O(({R}/{\epsilon})^{3})

which is much more feasible in reality.

As the first work and the first step towards understanding the convergence of federated adversarial learning, the priority of our work is not achieving the tightest bounds. Instead, our main goal is to show the convergence of a general federated adversarial learning framework. Nevertheless, we will improve the bound in the final version.

Appendix F Existence

In this section we prove the existence of UU^{*} that is close to U(0)U(0) and makes 𝒜(fU)\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}}) close to zero.

F.1 Tools from previous work

In order to prove our existence result, we first state two lemmas that will be used.

Lemma F.1 (Lemma 6.2 from [ZPD+20]).

Suppose that xc1,j1xc2,j22δ\|x_{c_{1},j_{1}}-x_{c_{2},j_{2}}\|_{2}\geq\delta holds for each pair of two different data points xc1,j1,xc2,j2x_{c_{1},j_{1}},x_{c_{2},j_{2}}. Let D=24γ1ln(48NJ/ϵ)D=24\gamma^{-1}\ln(48NJ/\epsilon), then there \exists a polynomial g:g:\mathbb{R}\rightarrow\mathbb{R} with size of coefficients no bigger than O(γ126D)O(\gamma^{-1}2^{6D}) and degree no bigger than DD, that satisfies for all c0[N],j0[J]c_{0}\in[N],j_{0}\in[J] and x~c0,j02(xc0,j0,ρ)\widetilde{x}_{c_{0},j_{0}}\in\mathcal{B}_{2}(x_{c_{0},j_{0}},\rho),

|c=1Nj=1Jyc,jg(xc,j,x~c0,j0)yc0,j0|ϵ3.\displaystyle\left|\sum_{c=1}^{N}\sum_{j=1}^{J}y_{c,j}\cdot g(\langle x_{c,j},\widetilde{x}_{c_{0},j_{0}}\rangle)-y_{c_{0},j_{0}}\right|\leq\frac{\epsilon}{3}.

We let f(x):=c=1Nj=1Jyc,jg(xc,j,x)f^{*}(x):=\sum_{c=1}^{N}\sum_{j=1}^{J}y_{c,j}\cdot g(\langle x_{c,j},x\rangle) and have |f(x~c0,j0)yc0,j0|ϵ/3|f^{*}(\widetilde{x}_{c_{0},j_{0}})-y_{c_{0},j_{0}}|\leq\epsilon/3.

Lemma F.2 (Lemma 6.5 from [ZPD+20]).

Suppose ϵ(0,1)\epsilon\in(0,1). Suppose

M=poly((NJ/ϵ)1/γ,d)andR=poly((NJ/ϵ)1/γ)\displaystyle M=\operatorname{poly}((NJ/\epsilon)^{1/\gamma},d)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma})

As long as mMm\geq M, with prob. 1exp(Ω(m/NJ))\geq 1-\exp(-\Omega(\sqrt{m/NJ})) , there Ud×m\exists U^{*}\in\mathbb{R}^{d\times m} that satisfies

UU(0)2,R/m2/3andsupx𝒳|gU(x)f(x)|ϵ/3.\displaystyle\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{2/3}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \sup_{x\in\cal{X}}|g_{U^{*}}(x)-f^{*}(x)|\leq\epsilon/3.

The randomness is due to a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

F.2 Existence result

We are going to present Theorem F.3 in this section and present its proofs.

Theorem F.3 (Existence, formal version of Theorem 5.2).

Suppose that ϵ(0,1)\epsilon\in(0,1). Suppose

M0=poly(d,(NJ/ϵ)1/γ)andR=poly((NJ/ϵ)1/γ)\displaystyle M_{0}=\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma})\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma})

As long as mM0m\geq M_{0}, then with prob. 1exp(Ω(m1/3))\geq 1-\exp(-\Omega(m^{1/3})), there exists Ud×mU^{*}\in\mathbb{R}^{d\times m} satisfying

UU(0)2,R/m2/3and𝒜(fU)ϵ.\displaystyle\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{2/3}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}})\leq\epsilon.

The randomness comes from a(τ)ma(\tau)\in\mathbb{R}^{m}, U(τ)d×mU(\tau)\in\mathbb{R}^{d\times m}, b(τ)mb(\tau)\in\mathbb{R}^{m} for τ=0\tau=0.

Proof.

For convenient, we define

ρ0:=exp(Ω(m/NJ))exp(Ω(m1/3)).\displaystyle\rho_{0}:=\exp(-\Omega(\sqrt{m/NJ}))-\exp(-\Omega(m^{1/3})).

From Lemma F.1 we obtain the function ff^{*}. From Lemma F.2 we know the existence of M0=poly(d,(NJ/ϵ)1/γ)M_{0}=\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma}) and also R=poly((NJ/ϵ)1/γ)R=\operatorname{poly}((NJ/\epsilon)^{1/\gamma}).

By combining these two results with Theorem D.2, we have that for all mpoly(d,(NJ/ϵ)1/γ)m\geq\operatorname{poly}(d,(NJ/\epsilon)^{1/\gamma}), with prob.

1ρ0,\displaystyle\geq 1-\rho_{0},

there Ud×m\exists U^{*}\in\mathbb{R}^{d\times m} that satisfies UU(0)2,R/m2/3\|U^{*}-U(0)\|_{2,\infty}\leq R/m^{2/3}.

In addition, the following properties:

  • maxx𝒳|gU(x)f(x)|\max_{x\in{\cal X}}|g_{U^{*}}(x)-f^{*}(x)| is at most ϵ/3\epsilon/3

  • maxx𝒳|fU(x)gU(x)|\max_{x\in{\cal X}}|f_{U^{*}}(x)-g_{U^{*}}(x)| is at most O(R2/m1/6)O(R^{2}/m^{1/6})

Consider the loss function. For all c[N]c\in[N], j[J]j\in[J] and x~c,j(xc,j,ρ)\widetilde{x}_{c,j}\in\mathcal{B}(x_{c,j},\rho), we have

(fU(x~c,j),yc,j)\displaystyle\ell(f_{U^{*}}(\widetilde{x}_{c,j}),y_{c,j})\leq |fU(x~c,j)yc,j|\displaystyle\leavevmode\nobreak\ |f_{U^{*}}(\widetilde{x}_{c,j})-y_{c,j}|
\displaystyle\leq |fU(x~c,j)gU(x~c,j)|+|gU(x~c,j)f(x~c,j)|+|f(x~c,j)yc,j|\displaystyle\leavevmode\nobreak\ |f_{U^{*}}(\widetilde{x}_{c,j})-g_{U^{*}}(\widetilde{x}_{c,j})|+|g_{U^{*}}(\widetilde{x}_{c,j})-f^{*}(\widetilde{x}_{c,j})|+|f^{*}(\widetilde{x}_{c,j})-y_{c,j}|
\displaystyle\leq O(R2/m1/6)+ϵ3+ϵ3\displaystyle\leavevmode\nobreak\ O(R^{2}/m^{1/6})+\frac{\epsilon}{3}+\frac{\epsilon}{3}
\displaystyle\leq ϵ,\displaystyle\leavevmode\nobreak\ \epsilon,

Thus, we have that

𝒜(fU)=1NJc=1Nj=1Jmax(fU(xc,j),yc,j)ϵ.\displaystyle\mathcal{L}_{\mathcal{A}^{*}}(f_{U^{*}})=\frac{1}{NJ}\sum_{c=1}^{N}\sum_{j=1}^{J}\max\ell\left(f_{U^{*}}({x}_{c,j}^{*}),y_{c,j}\right)\leq\epsilon.

Furthermore, since the mm we consider satisfies mΩ((NJ)1/γ)m\geq\Omega((NJ)^{1/\gamma}), the holding probability is

\displaystyle\geq 1ρ0\displaystyle\leavevmode\nobreak\ 1-\rho_{0}
=\displaystyle= 1exp(Ω(m1/3)).\displaystyle\leavevmode\nobreak\ 1-\exp(-\Omega(m^{1/3})).

Thus, it finishes the proof of this theorem.

References

  • [ACW18] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
  • [ADH+19a] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In NeurIPS. https://arxiv.org/pdf/1904.11955.pdf, 2019.
  • [ADH+19b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In ICML. https://arxiv.org/pdf/1901.08584.pdf, 2019.
  • [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In NeurIPS. https://arxiv.org/pdf/1811.04918.pdf, 2019.
  • [AZLS19a] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML. https://arxiv.org/pdf/1811.03962.pdf, 2019.
  • [AZLS19b] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS. https://arxiv.org/pdf/1810.12065.pdf, 2019.
  • [BCMC19] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pages 634–643. PMLR, 2019.
  • [Ber24] Sergei Bernstein. On a modification of chebyshev’s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38–49, 1924.
  • [BPSW21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS. https://arxiv.org/pdf/2006.11648.pdf, 2021.
  • [CH20] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, pages 2206–2216. PMLR, 2020.
  • [Che52] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.
  • [CW17] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
  • [DKM20] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated averaging. Advances in Neural Information Processing Systems, 33:15111–15122, 2020.
  • [DLL+19] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML. https://arxiv.org/pdf/1811.03804, 2019.
  • [DZPS19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. https://arxiv.org/pdf/1810.02054.pdf, 2019.
  • [GCL+19] Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks. Advances in Neural Information Processing Systems, 32:13029–13040, 2019.
  • [GSS14] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
  • [HLSY21] Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pages 4423–4434. PMLR, 2021.
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.
  • [KGB16] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
  • [LGD+20] Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence Staib, Pamela Ventola, and James S Duncan. Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: Abide results. Medical Image Analysis, 2020.
  • [LHY+19] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
  • [LJZ+21] Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. In International Conference on Learning Representations, 2021.
  • [LL18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS. https://arxiv.org/pdf/1808.01204.pdf, 2018.
  • [LLC+19] Xinle Liang, Yang Liu, Tianjian Chen, Ming Liu, and Qiang Yang. Federated transfer reinforcement learning for autonomous driving. arXiv preprint arXiv:1910.06001, 2019.
  • [LLH+20] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020.
  • [LSS+20] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural networks. In NeurIPS, 2020.
  • [LSZ+20] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Conference on Machine Learning and Systems, 2020a, 2020.
  • [MMR+17] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
  • [MMS+18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR. https://arxiv.org/pdf/1706.06083.pdf, 2018.
  • [MOSW22] Alexander Munteanu, Simon Omlor, Zhao Song, and David Woodruff. Bounding the width of neural networks via coupled initialization a worst case analysis. In International Conference on Machine Learning, pages 16083–16122. PMLR, 2022.
  • [RCZ+20] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • [RFPJ20] Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning: The case of affine distribution shifts. arXiv preprint arXiv:2006.08907, 2020.
  • [RHL+20] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3(1):1–7, 2020.
  • [SKC18] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
  • [SY19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. In arXiv preprint. https://arxiv.org/pdf/1906.03593.pdf, 2019.
  • [SYZ21] Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems, 34, 2021.
  • [SZS+13] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In arXiv preprint. https://arxiv.org/pdf/1312.6199.pdf, 2013.
  • [SZZ21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • [TKP+17] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
  • [TZT18] Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach. arXiv preprint arXiv:1811.05232, 2018.
  • [WLL+20] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. arXiv preprint arXiv:2007.07481, 2020.
  • [WTS+19] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K. Leung, Christian Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications, 37(6):1205–1221, 2019.
  • [YCKB18] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
  • [Zha22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
  • [ZHD+20] Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data. arXiv preprint arXiv:2005.11418, 2020.
  • [ZLL+18] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • [ZLL+21] Gaoyuan Zhang, Songtao Lu, Sijia Liu, Xiangyi Chen, Pin-Yu Chen, Lee Martie, and Mingyi Horesh, Lior abd Hong. Distributed adversarial training to robustify deep neural networks at scale. 2021.
  • [ZPD+20] Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Over-parameterized adversarial training: An analysis overcoming the curse of dimensionality. In NeurIPS. arXiv preprint arXiv:2002.06668, 2020.
  • [ZRSB20] Giulio Zizzo, Ambrish Rawat, Mathieu Sinn, and Beat Buesser. Fat: Federated adversarial training. arXiv preprint arXiv:2012.01791, 2020.