This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Boston College, USAzhangbcu@bc.edu Boston College, USAlewis.tseng@bc.eduhttps://orcid.org/0000-0002-4717-4038 \CopyrightQinzi Zhang and Lewis Tseng{CCSXML} <ccs2012> <concept> <concept_id>10010147.10010919</concept_id> <concept_desc>Computing methodologies Distributed computing methodologies</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257</concept_id> <concept_desc>Computing methodologies Machine learning</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012> \ccsdesc[500]Computing methodologies Distributed computing methodologies \ccsdesc[500]Computing methodologies Machine learning XXX\supplement

Acknowledgements.
The authors would like to acknowledgement Nitin H. Vaidya and anonymous reviewers for their helpful comments..\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitleOPODIS 2020 \EventShortTitleOPODIS 2020 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Echo-CGC: A Communication-Efficient Byzantine-tolerant Distributed Machine Learning Algorithm in Single-Hop Radio Network

Qinzi Zhang    Lewis Tseng
Abstract

In the past few years, many Byzantine-tolerant distributed machine learning (DML) algorithms have been proposed in the point-to-point communication model. In this paper, we focus on a popular DML framework – the parameter server computation paradigm and iterative learning algorithms that proceed in rounds, e.g., [11, 8, 6]. One limitation of prior algorithms in this domain is the high communication complexity. All the Byzantine-tolerant DML algorithms that we are aware of need to send nn dd-dimensional vectors from worker nodes to the parameter server in each round, where nn is the number of workers and dd is the number of dimensions of the feature space (which may be in the order of millions). In a wireless network, power consumption is proportional to the number of bits transmitted. Consequently, it is extremely difficult, if not impossible, to deploy these algorithms in power-limited wireless devices. Motivated by this observation, we aim to reduce the communication complexity of Byzantine-tolerant DML algorithms in the single-hop radio network [1, 3, 14].

Inspired by the CGC filter developed by Gupta and Vaidya, PODC 2020 [11], we propose a gradient descent-based algorithm, Echo-CGC. Our main novelty is a mechanism to utilize the broadcast properties of the radio network to avoid transmitting the raw gradients (full dd-dimensional vectors). In the radio network, each worker is able to overhear previous gradients that were transmitted to the parameter server. Roughly speaking, in Echo-CGC, if a worker “agrees” with a combination of prior gradients, it will broadcast the “echo message” instead of the its raw local gradient. The echo message contains a vector of coefficients (of size at most nn) and the ratio of the magnitude between two gradients (a float). In comparison, the traditional approaches need to send nn local gradients in each round, where each gradient is typically a vector in a ultra-high dimensional space (dnd\gg n). The improvement on communication complexity of our algorithm depends on multiple factors, including number of nodes, number of faulty workers in an execution, and the cost function. We numerically analyze the improvement, and show that with a large number of nodes, Echo-CGC reduces 80% of the communication under standard assumptions.

keywords:
Distributed Machine Learning, Single-hop Radio Network, Byzantine Fault, Communication Complexity, Wireless Communication, Parameter Server
category:
\relatedversion

1 Introduction

Machine learning has been widely adopted and explored recently [24, 16]. Due to the exponential growth of datasets and computation power required, distributed machine learning (DML) becomes a necessity. There is also an emerging trend [22, 13] to apply DML in power-limited wireless networked systems, e.g., sensor networks, distributed robots, smart homes, and Industrial Internet-of-Things (IIoT), etc. In these applications, the devices are usually small and fragile, and susceptible to malicious attacks and/or malfunction. More importantly, it is necessary to reduce communication complexity so that (over-)communication does not drain the device battery. Most prior research on fault-tolerant DML (e.g., [8, 4, 11, 6]) has focused on the use cases in clusters or datacenters. These algorithms achieve high resilience (number of faults tolerated), but also incur high communication complexity. As a result, most prior Byzantine-tolerant DML algorithms are extremely difficult, if not impossible, to be deployed in power-limited wireless networks.

Motivated by our observations, we aim to design a Byzantine DML algorithm with reduced communication complexity. We consider wireless systems that are modeled as a single-hop radio network, and focus on the popular parameter server computation paradigm (e.g., [11, 8, 6]). We propose Echo-CGC, and prove its correctness under typical assumptions [4, 8]. For the communication complexity, we formally analyze the expected number of bits that need to be sent from workers to the parameter server. The extension to multi-hop radio network is left as an interesting future work.

Recent Development in Distributed Machine Learning   Distributed Machine Learning (DML) is designed to handle a large amount of computation over big data. In the parameter server model, there is a centralized parameter server that distributes the computation tasks to nn workers. These workers have the access to the same dataset (that may be stored externally). Similar to [4, 11, 6], we focus on the synchronous gradient descent DML algorithms, where the server and workers proceed in synchronous rounds. In each round, each worker computes a local gradient over the parameter received from the server, and the server then aggregates the gradients collected from workers, and updates the parameter. Under suitable assumptions, prior algorithms [4, 11, 6] converge to the optimal point in the dd-dimensional space d\mathbb{R}^{d} even if up to ff workers may become Byzantine faulty.

To our knowledge, most Byzantine-tolerant DML or distributed optimization algorithms focused on the case of clusters and datacenters, which are modeled as a point-to-point network. For example, Reference [6], Krum [4], Kardam [7], and ByzSGD [8] focused on the stochastic gradient descent algorithms under several different settings (synchronous, asynchronous, and distributed parameter server). Reference [21, 11, 20] focused on the gradient descent algorithms for the general distributed optimization framework. Zeno [25] uses failure detection to improve the resilience. None of these works aimed to reduce communication complexity.

Another closely related research direction is on reducing the communication complexity of non-Byzantine-tolerant DML algorithms, e.g., [15, 13, 23]. These algorithms are not Byzantine fault-tolerant, and adopt a completely different design. For example, reference [15] utilizes relaxed consistency (of the underlying shared data), reference [23] discards coordinates (of the local gradients) aggressively, and reference [13] uses intermediate aggregation. It is not clear how to integrate these techniques with Byzantine fault-tolerance, as these approaches reduce the redundancy, making it difficult to mask the impact from Byzantine workers.

Single-Hop Radio Network   We consider the problem in a single-hop radio network, which is a proper theoretical model for wireless networks. Following [1, 3, 14], we assume that single-hop wireless communication is reliable and authenticated, and there is no jamming nor spoofing. Moreover, nodes follow a specific TDMA schedule so that there is no collision. In Section 2.1, we briefly argue why such an assumption is realistic to model wireless communication. In the single-hop radio network model, we aim to minimize the total number of bits to be transmitted in each round. If we directly adapt prior gradient descent-based algorithms [4, 11] to the radio network model, then each worker needs to broadcast a vector of size dd, where dd is the number of dimensions of the feature space. In practical applications (e.g., [9, 13]), dd might be in the order of millions, and the gradients may require a few GBs. Since power consumption is proportional to the communication complexity in wireless channel, prior Byzantine DML algorithms are not adequate for power-limited wireless networks..

Main Contributions   Inspired by the CGC filter developed by Gupta and Vaidya, PODC 2020 [11], we propose a gradient descent-based algorithm, Echo-CGC, for the parameter server model in the single-hop radio network. Our main observation is that since workers can overhear gradients transmitted earlier, they can use this information to avoid sending the raw gradients in some cases. Particularly, if a worker “agrees” with some reference gradient(s) transmitted earlier in the same round, then they send a small message to “echo” with the reference gradient(s). The size of the echo message (O(n)O(n) bits) is negligible compared to the raw gradient (O(d)O(d) bits), since in typical ML applications, dnd\gg n.

Our proof is more sophisticated than the one in [11], even though Echo-CGC is inspired by the CGC filter. The reason is that the “echo message” does not necessarily contain worker ii’s local gradient; instead, it can be used to construct an approximate gradient, which intuitively equals a combined gradients between ii’s local gradients and the gradients broadcast by previous workers. We need to ensure that such an approximation does not affect the aggregation at the server. Moreover, CGC filter [11] works on deterministic gradients – each worker computes the gradient of its local cost function using the full dataset. In our case, each worker computes a stochastic gradient, a gradient over a small random data batch. We prove that with appropriate assumptions, Echo-CGC converges to the optimal point.

Echo-CGC is correct under the same set of assumptions in prior work [4]; however, there is an inherent trade-off between resilience, the proven bound on the communication complexity reduction, and the cost function. Fix the cost function. We derive necessary conditions on nn so that Echo-CGC is guaranteed to perform better. We also perform numerical analysis to understand the trade-off. In general, Echo-CGC saves more and more communication if f/nf/n becomes smaller and smaller. Moreover, our algorithm performs better when the variance of the data is relatively small. For example, our algorithm tolerates 10% of faulty workers and saves over 75% of communication cost when standard deviation of computed gradients is less than 10% of the true gradient.

2 Preliminaries

In this section, we formally define our models, and introduce the assumptions and notations.

2.1 Models

Single-Hop Radio Network   We consider the standard radio network model in the literature, e.g., [1, 3, 14]. In particular, the underlying communication layer ensures the reliable local broadcast property [3]. In other words, the channel is perfectly reliable, and a local broadcast is correctly received by all neighbors. As noted in [1, 3], this assumption does not typically hold in the current deployed wireless networks, but it is possible to realize such a property with high probability in practice with the help from the MAC layer [2] or physical layer [17].

In our system, nodes can be uniquely identified, i.e., each node has a unique identifier. We assume that a faulty node may not spoof another node’s identity. The communication network is assumed to be single-hop; that is, each pair of nodes are within the communication range of each other. Moreover, time is divided into slots, and each node proceeds synchronously. Message collision is not possible because of the nodes follow a pre-determined TDMA schedule that determine the transmitting node in each slot and the transmission protocol is jam-resistant. Each slot is assumed to be large enough so that it is possible for a node to transmit a gradient. We also assume that each communication round (or communication step) is divided into nn slots, and the TDMA schedule assigns each node to a unique slot. For ease of discussion, node ii is scheduled to transmit at slot ii.

Stochastic Gradient Descent and Parameter Server   In this work, we focus on the Byzantine-tolerant distributed Stochastic Gradient Descent (SGD) algorithms, which are popular in the optimization and machine learning literature [4, 8, 11, 5]. Given a cost function QQ, the (sequential) SGD algorithm outputs an optimal parameter ww^{*} such that

w=argminwdQ(w)w^{*}=\operatorname*{argmin}_{w\in\mathbb{R}^{d}}Q(w) (1)

An SGD algorithm executes in an iterative fashion, where in each round tt, the algorithm computes the gradient of the cost function QQ at parameter wtw^{t} and updates the parameter with the gradient.

Synchronous Parameter Server Model   Computation of gradients is typically expensive and slow. One popular framework to speed up the computation is the parameter server model, in which the parameter server distributes the computation tasks to nn workers and aggregates their computed gradients to update the parameter in each round. Following the convention, we will use node and worker interchangeably.

We assume a synchronous system, i.e., the computation and communication delays are bounded, and the server and workers know the bound. Consequently, if the server does not receive a message from worker ii by the end of some round, then the server identifies that worker ii is faulty.

Formally speaking, a distributed SGD algorithm in the parameter server model proceeds in synchronous rounds, and executes the following three steps in each round tt:

  1. 1.

    The parameter server broadcasts parameter wtw^{t} to the workers.

  2. 2.

    Each worker jj randomly chooses a random data batch ξjt\xi_{j}^{t} from the dataset (shared by all the workers) and computes an estimate, gjtg_{j}^{t}, of the gradient Q(wt)\nabla Q(w^{t}) of the cost function QQ using ξjt\xi_{j}^{t} and wtw^{t}.

  3. 3.

    The server aggregates estimated gradients from all workers and updates the parameter using the gradient descent approach with step size η\eta:

    wt+1=wtηj=1ngjtw^{t+1}=w^{t}-\eta\sum_{j=1}^{n}g_{j}^{t} (2)

Fault Model and Byzantine SGD   Following [11, 4, 6], our system consists of nn workers, up to ff of which might be Byzantine faulty. We assume that the central parameter server is always fault-free.

Byzantine workers may be controlled by an omniscient adversary which has the knowledge of the current parameter (at the server) and the local gradient of all the other workers, and may have arbitrary behaviors. They can send arbitrary messages. However, due to the reliable local broadcast property of the radio network model, they cannot send inconsistent messages to the server and other workers. They also cannot spoof another node’s identity. Our goal is therefore to design a distributed SGD algorithm that solves Equation (1) in the presence of up to ff Byzantine workers.

Workers that are not Byzantine faulty are called fault-free workers. These workers follow the algorithm specification faithfully. For a given execution of the algorithm, we denote \mathcal{H} as the set of fault-free workers and \mathcal{B} as the set of Byzantine workers. For brevity, we denote h=||h=|\mathcal{H}| and b=||b=|\mathcal{B}|; hence, we have bfb\leq f and hnfh\geq n-f.

Communication Complexity   We are interested in minimizing the total number of bits that need to be transmitted from workers to the parameter server in each round. Prior algorithms [11, 4] transmit nn gradients in a dd-dimensional space in each round, since each node needs to transmit its local gradient to the centralized server. Typically, each gradient consists of dd floats or doubles (i.e., a single primitive floating point data structure for each dimension).

2.2 Assumptions and Notations

We assume that the cost function QQ satisfies some standard properties used in the literature [4, 8, 6], including convexity, differentiability, Lipschitz smoothness, and strong convexity. Following the convention, we use a,b\left<a,b\right> to represent the dot product of two vectors aa and bb in the dd-dimensional space d\mathbb{R}^{d}.

Assumption 1 (Convexity and smoothness).

QQ is convex and differentiable.

Assumption 2 (LL-Lipschitz smoothness).

There exists L>0L>0 such that for all w,wdw,w^{\prime}\in\mathbb{R}^{d},

Q(w)Q(w)Lww\lVert\nabla Q(w)-\nabla Q(w^{\prime})\rVert\leq L\lVert w-w^{\prime}\rVert (3)
Assumption 3 (μ\mu-strong convexity).

There exists μ>0\mu>0 such that for all w,wdw,w^{\prime}\in\mathbb{R}^{d},

Q(w)Q(w),wwμww2\left<\nabla Q(w)-\nabla Q(w^{\prime}),w-w^{\prime}\right>\geq\mu\lVert w-w^{\prime}\rVert^{2} (4)

We also assume that the random data batches are independently and identically distributed from the dataset. Before stating the assumptions, we formally introduce the concept of randomness in the framework. Similar to typical stochastic gradient descent algorithms, the only randomness is due to the random data batches ξjt\xi_{j}^{t} sampled by each fault-free worker jj\in\mathcal{H} in each round tt, which further makes gjtg_{j}^{t} as well as wt+1w^{t+1} non-deterministic. In the case when a worker uses the entire dataset to train model, gjt=Q(wt)g_{j}^{t}=\nabla Q(w^{t}). Hence, the result is deterministic, i.e., each fault-free worker derives the same gradient. In practice, data batch is a small sample of the entire data set.111Reference [11] works on a different formulation in which each worker may have a different local cost function.

Formally speaking, we denote an operator 𝔼Ξt(wt,𝒢t)\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},\mathcal{G}_{\mathcal{B}}^{t}) as the conditional expectation operator over the set of random batches Ξt={ξjt,j=1,2,,n}\Xi^{t}=\{\xi_{j}^{t},j=1,2,\ldots,n\} in round tt given (i) the parameter wtw^{t}, and (ii) the set of Byzantine gradients 𝒢t={gjt:j}\mathcal{G}_{\mathcal{B}}^{t}=\{g_{j}^{t}:j\in\mathcal{B}\}. This conditional expectation operator allows us to treat wtw^{t}, Q(wt)Q(w^{t}), and Q(wt)\nabla Q(w^{t}) as constants, as well as the Byzantine gradients. This is reasonable because (i) we have the knowledge about QQ and wtw^{t} given an execution, and (ii) the Byzantine gradients are arbitrary, and do not depend on the data batches. From now on, without further specification, we abbreviate the operator 𝔼Ξt(wt,𝒢t)\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},\mathcal{G}_{\mathcal{B}}^{t}) as 𝔼\mathbb{E}.

Below we present two further assumptions of local stochastic gradient gjtg^{t}_{j} at each fault-free worker jj. Similar to [4, 8], we rely on the two following assumptions for correctness proof.

Assumption 4 (IID Random Batches).

For all jj\in\mathcal{H} and tt\in\mathbb{N},

𝔼(gjt)=Q(wt)\mathbb{E}(g_{j}^{t})=\nabla Q(w^{t}) (5)
Assumption 5 (Bounded Variance).

For all jj\in\mathcal{H} and tt\in\mathbb{N},

𝔼gjtQ(wt)2σ2Q(wt)2\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2} (6)

Notation   We list the most important notations and constants used in our algorithm and analysis in the following table.

\mathcal{H} set of fault-free workers; h=||h=|\mathcal{H}|
\mathcal{B} set of faulty workers; b=||b=|\mathcal{B}|
tt round number, t=0,1,2,t=0,1,2,\ldots
ww^{*} optimal solution to QQ, i.e., w=argminwdQ(w)w^{*}=\operatorname*{argmin}_{w\in\mathbb{R}^{d}}Q(w)
wtw^{t} parameter in round tt
gjtg_{j}^{t} estimated gradient of jj in round tt
g~jt\tilde{g}_{j}^{t} “reconstructed" gradient of jj by server in round
g^jt\hat{g}_{j}^{t} gradient of jj in round tt after applying the CGC filter
η\eta fixed step size as in Equation (2)
LL Lipschitz constant
μ\mu strong convexity constant
rr deviation ratio, a key parameter in our algorithm
kk^{*} constant defined in Lemma 4.2, k1.12k^{*}\approx 1.12
Table 1: Notations and constants used in this paper.

3 Our Algorithm: Echo-CGC

Our algorithm is inspired by Gupta and Vaidya [11]. Specifically, we integrate their CGC filter with a novel aggregation phase. Our aggregation mechanism utilizes the broadcast property of the radio network to improve the communication complexity. In the CGC algorithm [11], each worker needs to send a dd-dimensional gradient to the server, whereas in our algorithm, some workers only need to send the “echo message” which is of size O(n)O(n) bits. Note that in typical machine learning applications, dnd\gg n.

We design our algorithm for the synchronous parameter server model, so the algorithm is presented in an iterative fashion. That is, each worker and the parameter server proceed in synchronous rounds, and the algorithm specifies the exact steps for each round tt. Our Algorithm, Echo-CGC, is presented in Algorithm 1. The algorithm uses the notations and constants summarized in Table 1.

Algorithm Description

Initially, the parameter server randomly generates an initial parameter w0dw^{0}\in\mathbb{R}^{d}. Each round t0t\geq 0 consists of three phases: (i) computation phase, (ii) communication phase, and (iii) aggregation phase. Echo-CGC takes the following inputs: step size η\eta, deviation ratio rr, number of workers nn, and maximum number of tolerable faults ff. The exact requirements on the values of these inputs will become clear later. For example, n,f,rn,f,r need to satisfy the bound derived in Lemma 4.3. More discussion will be presented in Section 4.3.

Computation Phase   In the computation phase of round tt, the server broadcasts wtw^{t} to the workers. Each worker jj then computes the local stochastic gradient gjt=Qj(wt)g_{j}^{t}=\nabla Q_{j}(w^{t}) using wtw^{t} and its random data batch ξjt\xi_{j}^{t}. Since we assume the parameter server is fault-free, each worker receives the identical wtw^{t}. The local gradient is stochastic, because each worker uses a random data batch to compute the local gradient gjtg^{t}_{j}.

Communication Phase   In the communication phase, each worker needs to send the information regarding to its local gradient to the parameter server. This phase is our main novelty, and different from prior algorithms [11, 4, 6]. We utilize the property of the broadcast channel to reduce the communication complexity. As mentioned earlier, the communication phase of round tt is divided into nn slots t1,,tnt_{1},\ldots,t_{n}. Without loss of generality, we assume that each worker jj is scheduled to broadcast its information in slot tjt_{j} (of round tt). Note that we assume that the underlying physical or MAC layer is jamming-resistant and reliable; hence, each fault-free worker can reliably broadcast the information to all the other nodes.

Steps for Worker jj:   Each worker jj stores a set of gradients that it overhears in round tt. Denote by RjR_{j} the set of stored gradients. By assumption, RjR_{j} consists of gradients gitg_{i}^{t} for i<ji<j, when at the beginning of slot tjt_{j}. Upon receiving a gradient gitg_{i}^{t} (in the form of a vector in d\mathbb{R}^{d}), worker jj stores it to RjR_{j} if gitg_{i}^{t} is linearly independent with all existing gradients in RjR_{j}. In the slot tjt_{j}, worker jj computes the “echo gradient” using vectors stored in RjR_{j}. Specifically, worker jj takes the following steps:

  • It expresses RjR_{j} as Rj={gi1t,,gi|Rj|t}R_{j}=\{g_{i_{1}}^{t},\ldots,g_{i_{|R_{j}|}}^{t}\} and constructs a matrix Ajd×|Rj|A_{j}\in\mathbb{R}^{d\times|R_{j}|} as

    Ajt=[gi1tgi2tgi|Rj|t]A_{j}^{t}=\begin{bmatrix}g_{i_{1}}^{t}&g_{i_{2}}^{t}&\dotsm&g_{i_{|R_{j}|}}^{t}\end{bmatrix}
  • It then computes the Moore-Penrose inverse (M-P inverse in short) of AjtA_{j}^{t}, defined as

    (Ajt)+=((Ajt)TAjt)1(Ajt)T,(A_{j}^{t})^{+}=((A^{t}_{j})^{T}A^{t}_{j})^{-1}(A^{t}_{j})^{T},

    where ATA^{T} is the transpose of matrix AA. The existence of the M-P inverse is guaranteed. Intuitively this is because all columns of AjtA_{j}^{t} are linearly independent by construction. The formal proof is presented in Appendix D.

  • Next, worker jj computes a vector xjt|Rj|x_{j}^{t}\in\mathbb{R}^{|R_{j}|} using the M-P inverse:

    xjt=(Ajt)+gjt,x_{j}^{t}=(A_{j}^{t})^{+}g_{j}^{t},

    where gjtg_{j}^{t} is the local stochastic gradient of QQ computed by jj in the computation phase. Note that xjtx^{t}_{j} is of size O(n)O(n), since RjR_{j} contains at most nn elements.

  • Finally, it computes the “echo gradient” as

    (gjt)=Ajtxjt(g_{j}^{t})^{*}=A_{j}^{t}x_{j}^{t}

    Mathematically, (gjt)(g_{j}^{t})^{*} is the projection of gjtg_{j}^{t} onto the span of vectors in RjR_{j}, i.e., the closest vector to gjtg^{t}_{j} in the span of RjR_{j}.

Next, worker jj checks whether the following inequality holds where (gjt)(g_{j}^{t})^{*} is the echo gradient, gjtg_{j}^{t} the local stochastic gradient, and rr the deviation ratio.

(gjt)gjtrgjt\lVert(g_{j}^{t})^{*}-g_{j}^{t}\rVert\leq r\lVert g_{j}^{t}\rVert (7)

Worker jj performs one of the two actions depending on the result of Inequality (7).

  • If Inequality (7) holds, then jj sends the echo message (gjt/(gjt),xjt,Ijt)(\lVert g_{j}^{t}\rVert/(\lVert g_{j}^{t})^{*}\rVert,x_{j}^{t},I_{j}^{t}) to the server, where Ijt={i1,,i|Rj|}I_{j}^{t}=\{i_{1},\ldots,i_{|R_{j}|}\} is a sorted list of worker IDs whose gradients are stored in RjR_{j}.

  • Otherwise, worker jj broadcasts the raw gradient gjtg_{j}^{t} to server and all the other workers.

Steps for Parameter Server:   The parameter server uses a vector GG to store the gradients from workers. Specifically, in each round tt, for each worker jj, the server computes g~jt\tilde{g}_{j}^{t} and stores it as the jj-th element of GG. At the beginning of round tt, every element G[j]G[j] is initialized as an empty placeholder \perp. During the communication phase, the parameter server takes two possible actions upon receiving a message from worker jj:

  • If the message is a vector, then the server stores g~jt=gjt\tilde{g}_{j}^{t}=g_{j}^{t} in G[j]G[j].

  • Otherwise, the message is a tuple (k,x,I)(k,x,I). The server then does the following:

    • If there exists some iIi\in I such that G[i]=G[i]=\perp (i.e., the server has not received a message from worker ii), then due to the reliable broadcast property, the server can safely identify jj as a Byzantine worker. By convention, we let the server store g~jt=0\tilde{g}_{j}^{t}=\vec{0}, the zero vector in d\mathbb{R}^{d}, in G[j]G[j].

    • Otherwise, denote the matrix AIA_{I} as AI=[G[i1],,G[i|Rj|]]A_{I}=\begin{bmatrix}G[i_{1}],&\ldots,&G[i_{|R_{j}|}]\end{bmatrix} where I={i1,,i|Rj|}I=\{i_{1},\ldots,i_{|R_{j}|}\}, and the server stores g~jt\tilde{g}_{j}^{t} as g~jt=kAIx\tilde{g}_{j}^{t}=kA_{I}x in G[j]G[j].

Aggregation Phase   The final phase is identical to the algorithm in [11], in which the server updates the parameter using the CGC filter. First, the server sorts the stored gradients GtG^{t} in the increasing order of their Euclidean norm and relabel the IDs so that g~i1tg~int\lVert\tilde{g}_{i_{1}}^{t}\rVert\leq\dotsm\leq\lVert\tilde{g}_{i_{n}}^{t}\rVert. Then the server applies the CGC filter as follows:

g^jt={g~inftg~jtg~jt,j{inf+1,,in}g~jt,j{i1,,inf}\hat{g}_{j}^{t}=\left\{\begin{array}[]{ll}\frac{\lVert\tilde{g}_{i_{n-f}}^{t}\rVert}{\lVert\tilde{g}_{j}^{t}\rVert}\tilde{g}_{j}^{t},&j\in\{i_{n-f+1},\ldots,i_{n}\}\\ \tilde{g}_{j}^{t},&j\in\{i_{1},\ldots,i_{n-f}\}\\ \end{array}\right. (8)

Finally, the server aggregates the gradients by gt=j=1ng^jtg^{t}=\sum_{j=1}^{n}\hat{g}_{j}^{t} and updates the parameter by wt+1=wtηgtw^{t+1}=w^{t}-\eta g^{t}, where η\eta is the fixed step size.

Algorithm 1 Algorithm Echo-CGC
1:Parameters:
2:     η>0\eta>0 is the step size defined in Equation (2)
3:     r>0r>0 is the deviation ratio
4:     n,f,rn,f,r satisfy the resilience bounds stated in Lemma 4.3
5:Initialization at server:   w0w^{0}\leftarrow a random vector in d\mathbb{R}^{d}
6:for t0t\leftarrow 0 to \infty do
7:     /* Computation Phase */
8:     At server:   broadcast wtw^{t} to all workers;   GG\leftarrow a \perp-vector of length nn
9:     At worker jj:
10:           receive wtw^{t} from the server
11:           gjtQj(wt)g_{j}^{t}\leftarrow\nabla Q_{j}(w^{t});   Rj{}R_{j}\leftarrow\{\} \triangleright local stochastic gradient at worker jj
12:     /* Communication Phase */
13:     for i1i\leftarrow 1 to nn do
14:         (i) At worker ii:
15:         if |Ri|=0|R_{i}|=0 then
16:              broadcast gitg_{i}^{t}
17:         else
18:              A[g]gRjA\leftarrow[g]_{g\in R_{j}};   A+(ATA)1ATA^{+}\leftarrow(A^{T}A)^{-1}A^{T};   xA+gitx\leftarrow A^{+}g_{i}^{t} \triangleright AxAx is the echo gradient
19:              if Axgitrgit\lVert Ax-g_{i}^{t}\rVert\leq r\lVert g_{i}^{t}\rVert then
20:                  I{i:gitRj}I\leftarrow\{i^{\prime}:{g_{i^{\prime}}^{t}\in R_{j}}\} in an ascending order
21:                  broadcast (git/Ax,x,I)(\lVert g_{i}^{t}\rVert/\lVert Ax\rVert,x,I) \triangleright echo message
22:              else
23:                  broadcast gitg_{i}^{t} \triangleright raw local gradient
24:              end if
25:         end if
26:         (ii) At worker j>ij>i:
27:         if jj receives vector gitg_{i}^{t} from worker ii then
28:              A[g]gRjA\leftarrow[g]_{g\in R_{j}};   A+(ATA)1ATA^{+}\leftarrow(A^{T}A)^{-1}A^{T}
29:              if gitg_{i}^{t} is linearly independent with RjR_{j} (i.e., AA+gitgitAA^{+}g_{i}^{t}\neq g_{i}^{t}then
30:                  RiRi{git}R_{i}\leftarrow R_{i}\cup\{g_{i}^{t}\}
31:              end if
32:         end if
33:          (iii) At server:
34:         if it receives a vector gjtg_{j}^{t} from worker jj then
35:              G[j]gjtG[j]\leftarrow g_{j}^{t} \triangleright jj transmitted a raw gradient
36:         else if it receives an echo message (k,x,I)(k,x,I) from worker jj then
37:              if iI\exists i\in I such that G[i]=G[i]=\perp  then
38:                  G[j]0G[j]\leftarrow\vec{0} \triangleright jj is a Byzantine worker
39:              else
40:                  AI[g~it]iIA_{I}\leftarrow[\tilde{g}_{i}^{t}]_{i\in I},   G[j]kAIxG[j]\leftarrow kA_{I}x \triangleright jj transmitted an echo message
41:              end if
42:         end if
43:     end for
44:     /* Aggregation Phase (applying CGC filter from [11]) */
45:     gtgGCGC(g)g^{t}\leftarrow\sum_{g\in G}CGC(g) \triangleright CGC()CGC(\cdot) defined in Equation (8)
46:     wt+1wtηgtw^{t+1}\leftarrow w^{t}-\eta\cdot g^{t} \triangleright η\eta defined in Equation (2)
47:end for

4 Convergence Analysis

In this section, we prove the convergence of our algorithm Echo-CGC. The proof is more complicated than the one in [11], even though both algorithms use the CGC filter. This is mainly due to two reasons: (i) we use stochastic gradient, whereas [11] uses a deterministic gradient; and (ii) echo messages only results in an approximate gradient (i.e., the echo gradient which may be deviated from the local stochastic gradient by a ratio rr). Intuitively, in addition to the Byzantine tampering, we need to deal with non-determinism from stochastic gradients and noise from echo messages.

4.1 Convergence Rate Analysis

In this part, we first analyze the convergence rate ρ\rho, which is a constant defined later in Equation (13). Recall a few notations that h=||h=|\mathcal{H}| and b=||b=|\mathcal{B}|, where given the execution, \mathcal{H} is the set of fault-free workers and \mathcal{B} is the set of Byzantine workers. Also recall that LL and μ\mu are the constants defined in the Assumption 2 and 3, respectively; σ\sigma defined in Assumption 5; and rr is the deviation ratio used in Echo-CGC. To derive ρ\rho, we need to define series of constants based on the given parameters of n,f,h,b,L,μ,rn,f,h,b,L,\mu,r, and σ\sigma.

We first define a constant β\beta as

β=(n2f)μr(1+σ)L1+rb(1+khσ)L,\beta=(n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-b(1+k_{h}\sigma)L, (9)

where kxk_{x} is defined as

kx=1+x12x1,x1.k_{x}=1+\frac{x-1}{\sqrt{2x-1}},\enspace\forall x\geq 1. (10)

We then define a constant γ\gamma as

γ=nL2(h(1+σ2)+bαh),\gamma=nL^{2}\left(h(1+\sigma^{2})+b\alpha_{h}\right), (11)

where

αx=xσ2+(1+khσ)2,x1.\alpha_{x}=x\sigma^{2}+(1+k_{h}\sigma)^{2},\enspace\forall x\geq 1. (12)

Finally, we define the convergence rate ρ\rho using β\beta and γ\gamma as follows:

ρ=12βη+γη2.\rho=1-2\beta\eta+\gamma\eta^{2}. (13)

We will prove that under some standard assumptions, the convergence rate ρ\rho is in the interval [0,1)[0,1). We first present several auxiliary lemmas. Due to page limit, most proofs are presented in Appendix A.

Lemma 4.1.

Let L,μ>0L,\mu>0 be the Lipschitz constant and strong convexity constant defined in Assumption 2 and 3, respectively. Then we have μL\mu\leq L.

Lemma 4.2.

Denote k=supx{kx/x:x1}k^{*}=\sup_{x}\{k_{x}/\sqrt{x}:x\geq 1\}. Then k<k^{*}<\infty, and numerically k1.12k^{*}\approx 1.12. Equivalently, khkhk_{h}\leq k^{*}\sqrt{h} for all h1h\geq 1.

Lemma 4.3.

Assume nμ(3+knσ)fL>0n\mu-(3+k_{n}\sigma)fL>0, then there exists r>0r>0 that satisfies equation below.

r<nμ(3+knσ)fL(n2f)(1+σ)L+(1+knσ)fL.r<\frac{n\mu-(3+k_{n}\sigma)fL}{(n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL}. (14)

Moreover, if r>0r>0 satisfies Equation (14), then β>0\beta>0.

Lemma 4.3 implies that we need to bound σ\sigma for convergence. In general, Echo-CGC is correct if σ=o(logn)\sigma=o(\log{n}). For brevity, we make the following assumption to simplify the proof of convergence and the analysis of communication complexity. We stress that this assumption can be relaxed using basically the same analysis with a denser mathematical manipulation.

Assumption 6.

Let σ\sigma be the variance bound defined in Assumption 5. We further assume that σ<1n\sigma<\frac{1}{\sqrt{n}}.

Under Assumption 6, we can narrow down the bound of rr in Lemma 4.3 to loosen our assumption on fault tolerance.

Lemma 4.4.

Assume nμ(3+k)fL>0n\mu-(3+k^{*})fL>0 (k1.12k^{*}\approx 1.12), then there exists r>0r>0 satisfying Equation (15) such that β>0\beta>0.

r<nμ(3+k)fL(n2f)(1+σ)L+(1+k)fL.r<\frac{n\mu-(3+k^{*})fL}{(n-2f)(1+\sigma)L+(1+k^{*})fL}. (15)
Theorem 4.5.

Assume nμ(3+k)fL>0n\mu-(3+k^{*})fL>0 and rr is a value that satisfies Inequality (15). Then we can find an η>0\eta>0 such that η<2β/γ\eta<2\beta/\gamma, which in turn makes ρ[0,1)\rho\in[0,1).

4.2 Proof of Convergence

Next, we prove the convergence of our algorithm. That is, Echo-CGC converges to the optimal point ww^{*} of the cost function QQ. We prove the convergence under the assumption that nμ(3+k)fL>0n\mu-(3+k^{*})fL>0. Due to page limit, we present key proofs here, and the rest can be found in Appendix B.

Recall our definition of the conditional expectation 𝔼=𝔼Ξt(wt,Gt)\mathbb{E}=\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},G_{\mathcal{B}}^{t}) introduced in Section 2.2. Before proving the main theorem, we introduce some preliminary lemmas.

Lemma 4.6.

For all tt and for all jj\in\mathcal{H},

𝔼gjt(1+σ)Q(wt).\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert. (16)
Lemma 4.7.

Recall that g^jt\hat{g}_{j}^{t} is the gradient after applying the CGC filter. For all tt and for all j{1,2,,n}j\in\{1,2,\ldots,n\},

𝔼g^jt(1+khσ)Q(wt).\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq(1+k_{h}\sigma)\lVert\nabla Q(w^{t})\rVert. (17)

The proof of Lemma 4.7 is based on Lemma 4.6 and the following prior results: Gumbel [10] and Hartley and David [12] proved that given identical means and variances (μ,σ2)(\mu,\sigma^{2}), the upper bound of the expectation of the largest random variable among nn independent random variables is μ+σ(n1)2n1\mu+\frac{\sigma(n-1)}{\sqrt{2n-1}}.

Lemma 4.8.

Following the same setup, for all tt and for all j{1,2,,n}j\in\{1,2,\ldots,n\},

𝔼g^jt2αhQ(wt)2.\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}. (18)

The proof of Lemma 4.8 is based on Lemma 4.6 and the following result: Papadatos [18] proved that for nn i.i.d. random variables X1X2XnX_{1}\leq X_{2}\leq\dotsm\leq X_{n} with finite variance σ2\sigma^{2}, the maximum variance of XnX_{n} is bounded above by nσ2n\sigma^{2}.

Lemma 4.7 and Lemma 4.8 provide upper bounds on 𝔼g^jt\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert and 𝔼g^jt2\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}. These two bounds allow us to bound the impact of bogus gradients transmitted by a faulty node jj. If jj transmitted an extreme gradient, it would be dropped by the CGC filter; otherwise, these two bounds essentially imply that the filtered gradient g^jt\hat{g}_{j}^{t} has some nice property even if jj is faulty. For fault-free gradients, Lemma 4.6 provides a better bound.

Theorem 4.9.

Assume that nμ(3+k)fL>0n\mu-(3+k^{*})fL>0. We can find r>0r>0 that satisfies Inequality (15) and η>0\eta>0 such that η<2β/γ\eta<2\beta/\gamma. Echo-CGC with the chosen rr and η\eta will converge to the optimal parameter ww^{*} as tt\to\infty.

Proof 4.10.

Our ultimate goal is to show that the sequence {𝔼wtw2}t=0\{\mathbb{E}\lVert w^{t}-w^{*}\rVert^{2}\}_{t=0}^{\infty} converges to 0. Recall that the aggregation rule of the algorithm is wt+1=wtηgtw^{t+1}=w^{t}-\eta g^{t}. Thus, we obtain that

𝔼wt+1w2\displaystyle\mathbb{E}\lVert w^{t+1}-w^{*}\rVert^{2} 𝔼wtwηgt2\displaystyle\leq\mathbb{E}\lVert w^{t}-w^{*}-\eta g^{t}\rVert^{2}
=𝔼wtw2A2η𝔼wtw,gtB+η2𝔼gt2C.\displaystyle=\underbrace{\mathbb{E}\lVert w^{t}-w^{*}\rVert^{2}}_{A}-\underbrace{2\eta\mathbb{E}\left<w^{t}-w^{*},g^{t}\right>}_{B}+\underbrace{\eta^{2}\mathbb{E}\lVert g^{t}\rVert^{2}}_{C}. (19)

Since wtw^{t} is known, wtw^{t} can be treated as a constant, and 𝔼wtw2=wtw2\mathbb{E}\lVert w^{t}-w^{*}\rVert^{2}=\lVert w^{t}-w^{*}\rVert^{2}.

Part CC: In Appendix B.4, we show that the following inequality holds.

𝔼gt2γwtw2.\displaystyle\mathbb{E}\lVert g^{t}\rVert^{2}\leq\gamma\lVert w^{t}-w^{*}\rVert^{2}. (20)

Part BB: By linearity of inner product,

wtw,gt\displaystyle\left<w^{t}-w^{*},g^{t}\right> =j=1nwtw,g^jt\displaystyle=\sum_{j=1}^{n}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>
=jwtw,g^jt+jwtw,g^jt.\displaystyle=\sum_{j\in\mathcal{H}}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>+\sum_{j\in\mathcal{B}}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>. (21)

First, by Schwarz Inequality, wtw,g^jtwtwg^jt\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq-\lVert w^{t}-w^{*}\rVert\lVert\hat{g}_{j}^{t}\rVert; by Lemma 4.7 and LL-Lipschitz assumption, 𝔼g^jt(1+khσ)Lwtw\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq(1+k_{h}\sigma)L\lVert w^{t}-w^{*}\rVert. Thus,

𝔼wtw,g^jt(1+khσ)Lwtw2,j.\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq-(1+k_{h}\sigma)L\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall j\in\mathcal{B}. (22)

Next, observe that by our algorithm, for each jj\in\mathcal{H}, the received gradient before CGC filter g~jt\tilde{g}_{j}^{t} satisfies (i) g~jt=gjt\lVert\tilde{g}_{j}^{t}\rVert=\lVert g_{j}^{t}\rVert and (ii) g~jt=aj(gjt+Δgjt)\tilde{g}_{j}^{t}=a_{j}(g_{j}^{t}+\Delta g_{j}^{t}), for some constant aj=gjt/gjt+Δgjta_{j}=\lVert g_{j}^{t}\rVert/\lVert g_{j}^{t}+\Delta g_{j}^{t}\rVert and a vector Δgjt\Delta g_{j}^{t} such that Δgjtrgjt\lVert\Delta g_{j}^{t}\rVert\leq r\lVert g_{j}^{t}\rVert. This implies aj1/(1+r)a_{j}\geq 1/(1+r). Therefore,

𝔼wtw,g~jt\displaystyle\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right> =𝔼wtw,aj(gjt+Δgjt)\displaystyle=\mathbb{E}\left<w^{t}-w^{*},a_{j}(g_{j}^{t}+\Delta g_{j}^{t})\right>
11+r(wtw,𝔼gjt+𝔼wtw,Δgjt),j.\displaystyle\geq\frac{1}{1+r}\left(\left<w^{t}-w^{*},\mathbb{E}g_{j}^{t}\right>+\mathbb{E}\left<w^{t}-w^{*},\Delta g_{j}^{t}\right>\right),\enspace\forall j\in\mathcal{H}. (23)

By Assumption 4, 𝔼gjt=Q(wt)\mathbb{E}g_{j}^{t}=\nabla Q(w^{t}); by strong convexity,

wtw,Q(wt)μwtw2.\left<w^{t}-w^{*},\nabla Q(w^{t})\right>\geq\mu\lVert w^{t}-w^{*}\rVert^{2}.

By Schwarz inequality, 𝔼wtw,Δgjtwtw𝔼Δgjt\mathbb{E}\left<w^{t}-w^{*},\Delta g_{j}^{t}\right>\geq-\lVert w^{t}-w^{*}\rVert\mathbb{E}\lVert\Delta g_{j}^{t}\rVert; and 𝔼Δgjtr𝔼gjt\mathbb{E}\lVert\Delta g_{j}^{t}\rVert\leq r\mathbb{E}\lVert g_{j}^{t}\rVert. By Lemma 4.6 and LL-Lipschitz assumption, 𝔼gjt(1+σ)Lwtw\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)L\lVert w^{t}-w^{*}\rVert. Thus,

𝔼wtw,Δgjtr(1+σ)Lwtw2.\mathbb{E}\left<w^{t}-w^{*},\Delta g_{j}^{t}\right>\geq-r(1+\sigma)L\lVert w^{t}-w^{*}\rVert^{2}.

Upon substituting these results into Equation (4.10), we obtain that

𝔼wtw,g~jtμr(1+σ)L1+rwtw2,j.\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>\geq\frac{\mu-r(1+\sigma)L}{1+r}\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall j\in\mathcal{H}. (24)

We partition \mathcal{H} into two parts: 1={i1,,inf}\mathcal{H}_{1}=\mathcal{H}\cap\{i_{1},\ldots,i_{n-f}\} and 2=1\mathcal{H}_{2}=\mathcal{H}\setminus\mathcal{H}_{1}. For each j1j\in\mathcal{H}_{1}, the received gradient is unchanged by CGC filter, i.e., g^jt=g~jt\hat{g}_{j}^{t}=\tilde{g}_{j}^{t}. Therefore, Equation (24) also holds for g^jt\hat{g}_{j}^{t}, for all j1j\in\mathcal{H}_{1}.

The case of 2\mathcal{H}_{2} is similar. Note that for each j2j\in\mathcal{H}_{2}, the gradient g~jt\tilde{g}_{j}^{t} is scaled down to g^jt\hat{g}_{j}^{t} by CGC filter. In other words, there exists some constant aj0a_{j}^{\prime}\geq 0 such that g^jt=ajg~jt\hat{g}_{j}^{t}=a_{j}^{\prime}\tilde{g}_{j}^{t}. Therefore, by Equation (4.10),

𝔼wtw,g^jt=𝔼wtw,ajgjt~=aj𝔼wtw,g~jt,j2.\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>=\mathbb{E}\left<w^{t}-w^{*},a_{j}^{\prime}\tilde{g_{j}^{t}}\right>=a_{j}^{\prime}\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>,\enspace\forall j\in\mathcal{H}_{2}.

We can verify that if by assumption that r>0r>0 satisfies Equation (15), then μr(1+σ)L>0\mu-r(1+\sigma)L>0; and Equation (4.10) implies that 𝔼wtw,g~jt0\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>\geq 0. Therefore,

𝔼wtw,g^jt0,j2.\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq 0,\enspace\forall j\in\mathcal{H}_{2}. (25)

Note that |1|h2f|\mathcal{H}_{1}|\geq h-2f. Upon substituting Equation (22), (24), (25) into Equation (4.10), we obtain that

𝔼wtw,gt\displaystyle\mathbb{E}\left<w^{t}-w^{*},g^{t}\right> ((n2f)μr(1+σ)L1+rb(1+khσ)L)wtw2.\displaystyle\geq\left((n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-b(1+k_{h}\sigma)L\right)\lVert w^{t}-w^{*}\rVert^{2}. (26)

By definition of β\beta in Equation (9), this implies 𝔼wtw,gtβwtw2\mathbb{E}\left<w^{t}-w^{*},g^{t}\right>\geq\beta\lVert w^{t}-w^{*}\rVert^{2}.

Conclusion: Upon combining part A,BA,B and CC, by definition of ρ\rho in Equation (13),

𝔼wt+1w2ρwtw2,t=0,1,2,\mathbb{E}\lVert w^{t+1}-w^{*}\rVert^{2}\leq\rho\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall t=0,1,2,\ldots

Recall the definition of the conditional expectation operator 𝔼\mathbb{E}. This implies that

𝔼(wtw2w0,𝒢0,,𝒢t)ρtw0w2\mathbb{E}(\lVert w^{t}-w^{*}\rVert^{2}\mid w^{0},\mathcal{G}_{\mathcal{B}}^{0},\ldots,\mathcal{G}_{\mathcal{B}}^{t})\leq\rho^{t}\lVert w^{0}-w^{*}\rVert^{2}

By Theorem 4.5, ρ(0,1)\rho\in(0,1). Therefore, as tt\to\infty, wtw2\lVert w^{t}-w^{*}\rVert^{2} converges to 0. In other words, wtw^{t} converges to the optimal parameter ww^{*}. This proves the theorem.

4.3 Communication Complexity

We analyze the communication complexity of the Echo-CGC algorithm, and show that under suitable conditions, it effectively reduces communication complexity compared to prior algorithms [4, 11]. First consider a ball in d\mathbb{R}^{d} whose center is the true gradient Q(wt)\nabla Q(w^{t}):

B(Q(wt),r2+rQ(wt))={ud:uQ(wt)r2+rQ(wt)},B(\nabla Q(w^{t}),\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert)=\{u\in\mathbb{R}^{d}:\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert\}, (27)

where r>0r>0 is the deviation ratio. For a slight abuse of notations, we abbreviate the ball as BB. This should not be confused with \mathcal{B}, the set of Byzantine workers. We present only the main results, and the proofs can be found in Appendix C.

Lemma 4.11.

For all u,vBu,v\in B, uvru\lVert u-v\rVert\leq r\lVert u\rVert (and uvrv\lVert u-v\rVert\leq r\lVert v\rVert).

Given Lemma 4.11, we compute the probability that an arbitrary gradient gjtg_{j}^{t} is in the ball BB. By Markov’s Inequality,

=\displaystyle=
1𝔼gjtQ(wt)2r2(2+r)2Q(wt)2.\displaystyle\geq 1-\frac{\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}}{\frac{r^{2}}{(2+r)^{2}}\lVert\nabla Q(w^{t})\rVert^{2}}. (28)

By Assumption 5, 𝔼gjtQ(wt)2σ2Q(wt)2\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}, so we conclude that , where pp is the lower bound defined as p=1(1+2/r)2σ2p=1-(1+2/r)^{2}\sigma^{2}.

Denote nB=|{j:gjtB}|n_{B}=|\{j:g_{j}^{t}\in B\}| and nn^{*} as the number of workers that send the “echo message” in a round. By Lemma 4.11, nnB1n^{*}\geq n_{B}-1. Since each event {gjtB}\{g_{j}^{t}\in B\} is independent and has a fixed probability, nn^{*} follows a Binomial distribution with success probability which is bounded below by pp. Therefore,

𝔼n𝔼nB1np1.\mathbb{E}n^{*}\geq\mathbb{E}n_{B}-1\geq np-1.

For n1n\gg 1, we assume that 1/n01/n\approx 0. Also in practice, dnd\gg n, so the message complexity of each echo message (in O(n)O(n) bits) is negligible compared to raw gradients (in O(d)O(d) bits). Hence, the ratio of bit complexity of our algorithm and prior algorithms (e.g., [4, 11]) can be approximately bounded above as follows:

bit complexity of Echo-CGCbit complexity of prior algorithms\displaystyle\frac{\text{bit complexity of Echo-CGC}}{\text{bit complexity of prior algorithms}} =nO(n)+(nn)O(d)nO(d)\displaystyle=\frac{n^{*}O(n)+(n-n^{*})O(d)}{nO(d)}
(np1)O(n)+[n(np1)]O(d)nO(d)\displaystyle\leq\frac{(np-1)O(n)+[n-(np-1)]O(d)}{nO(d)}
1p.\displaystyle\approx 1-p.

We denote the upper bound of ratio of reduced complexity to complexity of prior algorithms as C=1p=(1+2/r)2σ2C=1-p=(1+2/r)^{2}\sigma^{2}.

Analysis  By Equation (4.3) and Lemma 4.2, CC can be expressed as

Cσ2(1+2(12x)(1+σ)+(1+σkn)xμ/L(3+σkn)x)2,C\leq\sigma^{2}\left(1+2\cdot\frac{(1-2x)(1+\sigma)+(1+\sigma k^{*}\sqrt{n})x}{\mu/L-(3+\sigma k^{*}\sqrt{n})x}\right)^{2}, (29)

where x=f/nx=f/n is the fault-tolerance factor.

As Equation (29) shows, the ratio 𝒞\mathcal{C} is related to four non-trivial variables: (i) bound of variance σ0\sigma\geq 0; (ii) resilience x=f/nx=f/n satisfying the assumption in Lemma 4.3, i.e.,

μ/L(3+σkn)x>0;\mu/L-(3+\sigma k^{*}\sqrt{n})x>0;

(iii) constant L/μL/\mu, which is determined by the cost function QQ and satisfies 0<L/μ<10<L/\mu<1 by Lemma 4.1; and (iv) number of workers n>0n>0.

041024\cdot 10^{-2}81028\cdot 10^{-2}0.120.120.160.1600.20.20.40.40.60.60.80.811σ\sigma𝒞\mathcal{C}
(a) CC as a function of σ\sigma, for fixed μ/L=1\mu/L=1, x=0.1x=0.1, and n=100n=100.
0.70.70.750.750.80.80.850.850.90.90.950.951100.20.20.40.40.60.60.80.811μ/L\mu/L𝒞\mathcal{C}
(b) CC as a function of μ/L\mu/L, for fixed σ=0.1\sigma=0.1, x=0.1x=0.1, and n=100n=100.
051025\cdot 10^{-2}0.10.10.150.1500.20.20.40.40.60.60.80.811xx𝒞\mathcal{C}
(c) CC as a function of xx, for fixed σ=0.1\sigma=0.1, μ/L=1\mu/L=1, and n=100n=100.
010010020020030030040040050050000.20.20.40.40.60.60.80.811nn𝒞\mathcal{C}
(d) CC as a function of nn, for fixed σ=0.1\sigma=0.1, μ/L=1\mu/L=1, and x=0.1x=0.1.

We first plot the relation between one factor and CC while fixing the other three factors. First, we present the most significant fact, σ\sigma. We fix μ/L=1\mu/L=1, x=0.1x=0.1, and n=100n=100. As Figure 1(a) shows, CC increases in an almost quadratic speed with σ\sigma because of the σ2\sigma^{2} term in Equation (29). Therefore, our algorithm is guaranteed to have lower communication complexity when the variance of gradients is relatively low, especially when σ0.1\sigma\leq 0.1. In practice, this is the scenario when the data set consists mainly of similar data instances.

Then, we plot CC against μ/L\mu/L with fixed σ=0.1\sigma=0.1, x=0.1x=0.1, and n=100n=100. As Figure 1(b) shows, CC decreases as μ/L\mu/L becomes closer to 11. As μ/L>0.75\mu/L>0.75, C<0.5C<0.5, meaning that [0.75,1][0.75,1] is the range of μ/L\mu/L where our algorithm is guaranteed to perform significantly better.

Next, we plot CC against xx with fixed σ=0.1\sigma=0.1, μ/L\mu/L, and n=100n=100. As Figure 1(c) shows, there is a trade-off between CC and fault resilience xx. As xx approaches the max resilience defined in Lemma 4.3, i.e., xmax=μ/L(3+σknx_{\max}=\frac{\mu/L}{(3+\sigma k^{*}\sqrt{n}}, the theoretical upper bound CC blows up. Moreover, as x<0.15x<0.15, C<0.4C<0.4; and thus [0,0.15][0,0.15] is a proper range of xx.

Finally, we plot CC against nn with fixed σ=0.1\sigma=0.1, μ/L=1\mu/L=1, and x=0.1x=0.1. As Figure 1(d) shows, CC increases almost linearly with respect to nn with a relatively flat slope. In other words, nn is not a significant factor of CC; and the performance of our algorithm is stable in a wide range of nn.

In conclusion, our algorithm is guaranteed to require lower communication complexity when: (i) σ\sigma is low, i.e., data instances are similar and (ii) μ/L\mu/L is close to 11. Also, there is a trade-off between resilience and efficiency. As a concrete example, when σ=0.1\sigma=0.1, x=0.2x=0.2, μ/L=1\mu/L=1, and n=100n=100, C0.25C\approx 0.25, meaning that our algorithm is guaranteed to save at least 75%75\% of communication cost.

5 Summary

In this paper, we present our Byzantine-tolerant DML algorithm that incurs lower communication complexity in a single-hop radio netowrk (under suitable conditions). Our algorithm is inspired by the CGC filter [11], but we need to devise new proofs to handle the randomness and noise introduced in our mechanism.

There are two interesting open problems: (i) multi-hop radio network; and (ii) different mechanism for constructing echo messages, e.g., usage of angles rather than distance ratio.

References

  • [1] Dan Alistarh, Seth Gilbert, Rachid Guerraoui, Zarko Milosevic, and Calvin Newport. Securing every bit: Authenticated broadcast in radio networks. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, page 50–59, New York, NY, USA, 2010. Association for Computing Machinery. URL: https://doi.org/10.1145/1810479.1810489, doi:10.1145/1810479.1810489.
  • [2] Baruch Awerbuch, Andrea Richa, and Christian Scheideler. A jamming-resistant mac protocol for single-hop wireless networks. In Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing, PODC ’08, page 45–54, New York, NY, USA, 2008. Association for Computing Machinery. URL: https://doi.org/10.1145/1400751.1400759, doi:10.1145/1400751.1400759.
  • [3] Vartika Bhandari and Nitin H. Vaidya. On reliable broadcast in a radio network. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, PODC ’05, page 138–147, New York, NY, USA, 2005. Association for Computing Machinery. URL: https://doi.org/10.1145/1073814.1073841, doi:10.1145/1073814.1073841.
  • [4] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. NIPS’17, page 118–128, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • [5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, USA, 2004.
  • [6] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst., 1(2), December 2017. URL: https://doi.org/10.1145/3154503, doi:10.1145/3154503.
  • [7] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, and Mahsa Taziki. Asynchronous Byzantine machine learning (the case of SGD). volume 80 of Proceedings of Machine Learning Research, pages 1145–1154, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL: http://proceedings.mlr.press/v80/damaskinos18a.html.
  • [8] El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, Lê Nguyên Hoang, and Sébastien Rouault. Genuinely distributed byzantine machine learning. In Proceedings of the 39th Symposium on Principles of Distributed Computing, PODC ’20, page 355–364, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3382734.3405695, doi:10.1145/3382734.3405695.
  • [9] J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, pages 101 – 148, 01 2010.
  • [10] E. J. Gumbel. The maxima of the mean largest value and of the range. The Annals of Mathematical Statistics, 25(1):76–84, 1954. URL: http://www.jstor.org/stable/2236513.
  • [11] Nirupam Gupta and Nitin H. Vaidya. Fault-tolerance in distributed optimization: The case of redundancy. In Proceedings of the 39th Symposium on Principles of Distributed Computing, PODC ’20, page 365–374, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3382734.3405748, doi:10.1145/3382734.3405748.
  • [12] H. O. Hartley and H. A. David. Universal bounds for mean range and extreme observation. The Annals of Mathematical Statistics, 25(1):85–99, 1954. URL: http://www.jstor.org/stable/2236514.
  • [13] Seyyedali Hosseinalipour, Christopher G. Brinton, Vaneet Aggarwal, Huaiyu Dai, and Mung Chiang. From federated learning to fog learning: Towards large-scale distributed machine learning in heterogeneous wireless networks, 2020. arXiv:2006.03594.
  • [14] Chiu-Yuen Koo. Broadcast in radio networks tolerating byzantine adversarial behavior. In Proceedings of the Twenty-Third Annual ACM Symposium on Principles of Distributed Computing, PODC ’04, page 275–282, New York, NY, USA, 2004. Association for Computing Machinery. URL: https://doi.org/10.1145/1011767.1011807, doi:10.1145/1011767.1011807.
  • [15] Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 19–27, Cambridge, MA, USA, 2014. MIT Press.
  • [16] Ruben Mayer and Hans-Arno Jacobsen. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv., 53(1), February 2020. URL: https://doi.org/10.1145/3363554, doi:10.1145/3363554.
  • [17] V. Navda, A. Bohra, S. Ganguly, and D. Rubenstein. Using channel hopping to increase 802.11 resilience to jamming attacks. In IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, pages 2526–2530, 2007.
  • [18] Nickos Papadatos. Maximum variance of order statistics. Annals of the Institute of Statistical Mathematics, 47:185–193, 02 1995. doi:10.1007/BF00773423.
  • [19] Michael D. Perlman. Jensen’s inequality for a convex vector-valued function on an infinite-dimensional space. Journal of Multivariate Analysis, 4(1):52 – 65, 1974. URL: http://www.sciencedirect.com/science/article/pii/0047259X74900050, doi:https://doi.org/10.1016/0047-259X(74)90005-0.
  • [20] Lili Su and Nitin H. Vaidya. Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms. In George Giakkoupis, editor, Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, PODC 2016, Chicago, IL, USA, July 25-28, 2016, pages 425–434. ACM, 2016. URL: https://doi.org/10.1145/2933057.2933105, doi:10.1145/2933057.2933105.
  • [21] Lili Su and Nitin H. Vaidya. Non-bayesian learning in the presence of byzantine agents. In Distributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, pages 414–427, 2016. URL: https://doi.org/10.1007/978-3-662-53426-7_30, doi:10.1007/978-3-662-53426-7\_30.
  • [22] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao. Application of machine learning in wireless networks: Key techniques and open issues. IEEE Communications Surveys Tutorials, 21(4):3072–3108, 2019.
  • [23] Zeyi Tao and Qun Li. esgd: Communication efficient distributed deep learning on the edge. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), Boston, MA, July 2018. USENIX Association. URL: https://www.usenix.org/conference/hotedge18/presentation/tao.
  • [24] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. A survey on distributed machine learning. ACM Comput. Surv., 53(2), March 2020. URL: https://doi.org/10.1145/3377454, doi:10.1145/3377454.
  • [25] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. volume 97 of Proceedings of Machine Learning Research, pages 6893–6901, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL: http://proceedings.mlr.press/v97/xie19b.html.

Appendix A Proof of Lemmas in Section 4.1

A.1 Proof of Lemma 4.1

Proof A.1.

Cauchy inequality and LL-Lipschitz smoothness of cost function imply that

ww,Q(w)wwQ(w)Lww2,wd.\left<w-w^{*},\nabla Q(w)\right>\leq\lVert w-w^{*}\rVert\lVert\nabla Q(w)\rVert\leq L\lVert w-w^{*}\rVert^{2},\enspace\forall w\in\mathbb{R}^{d}. (30)

Also by strong convexity,

ww,Q(w)μww2,wd.\left<w-w^{*},\nabla Q(w)\right>\geq\mu\lVert w-w^{*}\rVert^{2},\enspace\forall w\in\mathbb{R}^{d}. (31)

Equation (30) and (31) together imply that μL\mu\leq L.

A.2 Proof of Lemma 4.2

Proof A.2.

First, we check the derivative of kx/xk_{x}/\sqrt{x}:

ddxkxx=(2x1)3/23x+12(2x2x)3/2.\frac{d}{dx}\frac{k_{x}}{\sqrt{x}}=-\frac{(2x-1)^{3/2}-3x+1}{2(2x^{2}-x)^{3/2}}.

We can then verify numerically that kx/xk_{x}/\sqrt{x} reaches maximum at 1.91\approx 1.91 and that k=supx1(kx/x)1.12k^{*}=\sup_{x\geq 1}(k_{x}/\sqrt{x})\approx 1.12.

A.3 Proof of Lemma 4.3

Proof A.3.

By definition of kxk_{x}, kx=1+x12x1k_{x}=1+\frac{x-1}{\sqrt{2x-1}} is increasing for x1x\geq 1. Recall that hnh\leq n and bfb\leq f, so khknk_{h}\leq k_{n}. By definition of β\beta, this further implies that

β(n2f)μr(1+σ)L1+rfL(1+knσ).\beta\geq(n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-fL(1+k_{n}\sigma).

Therefore, β>0\beta>0 if

(n2f)(μr(1+σ)L)>(1+r)(1+knσ)fL\displaystyle(n-2f)\left(\mu-r(1+\sigma)L\right)>(1+r)(1+k_{n}\sigma)fL
\displaystyle\iff (n2f)μ(1+knσ)fL>((n2f)(1+σ)L+(1+knσ)fL)r.\displaystyle(n-2f)\mu-(1+k_{n}\sigma)fL>\left((n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL\right)r.

Recall that by Lemma 4.1, μL\mu\leq L, so the above inequality is true if

nμ(3+knσ)fL>((n2f)(1+σ)L+(1+knσ)fL)r.n\mu-(3+k_{n}\sigma)fL>\left((n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL\right)r. (32)

Since we assume n2f>0n-2f>0, the right side is always positive. Therefore, if nμ(3+knσ)fL>0n\mu-(3+k_{n}\sigma)fL>0, then there exists r>0r>0 that satisfies Equation (32); also such rr satisfies β>0\beta>0. This proves the lemma.

A.4 Proof of Lemma 4.4

Proof A.4.

By Lemma 4.2, knknk_{n}\leq k^{*}\sqrt{n}. Given the assumption that σ<1/n\sigma<1/\sqrt{n},

nμ(3+knσ)fL(n2f)(1+σ)L+(1+knσ)fL>nμ(3+k)fL(n2f)(1+σ)L+(1+k)fL.\frac{n\mu-(3+k_{n}\sigma)fL}{(n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL}>\frac{n\mu-(3+k^{*})fL}{(n-2f)(1+\sigma)L+(1+k^{*})fL}.

Therefore, if r>0r>0 satisfies Equation (15), then it also satisfies Equation (14); and by Lemma 4.4, such rr satisfies β>0\beta>0.

A.5 Proof of Theorem 4.5

Proof A.5.

First observe that ρ\rho can be represented as a quadratic function of η\eta, i.e., ρ(η)=γη22βη+1\rho(\eta)=\gamma\eta^{2}-2\beta\eta+1. We prove the theorem by providing the bounds on this function.

By definition of αx\alpha_{x} in Equation (12), αh>0\alpha_{h}>0 for all h1h\geq 1. Thus, by definition of γ\gamma in Equation (11), γ>0\gamma>0. Also, by Lemma 4.4, for r>0r>0 that satisfies Equation (15), β>0\beta>0.

Recall that for a quadratic function q(x)=ax2bx+1q(x)=ax^{2}-bx+1, where a,b>0a,b>0, q(x)q(x) reaches minimum at x=b/2ax^{*}=b/2a. Moreover, q(0)=q(2x)q(0)=q(2x^{*}); and for all x(0,2x)x\in(0,2x^{*}), q(x)[q(x),q(0))q(x)\in[q(x^{*}),q(0)). In our case, a=γa=\gamma and b=2βb=2\beta. Therefore, the minimum occurs at η=β/γ\eta^{*}=\beta/\gamma, and η>0\eta^{*}>0. Since ρ(0)=1\rho(0)=1, for all η(0,2η)\eta\in(0,2\eta^{*}), ρ(η)[ρ(η),1)\rho(\eta)\in[\rho(\eta^{*}),1).

The remaining task is to compute the minimum value ρ(η)\rho(\eta^{*}). Upon substituting η=β/γ\eta^{*}=\beta/\gamma into ρ(η)=γη22βη+1\rho(\eta)=\gamma\eta^{2}-2\beta\eta+1, we obtain that ρ(η)=1β2/γ\rho(\eta^{*})=1-\beta^{2}/\gamma.

Recall the definition of β\beta and γ\gamma in Equation (9, 11), we can first see that βnμ\beta\leq n\mu. Also, since αh1\alpha_{h}\geq 1, γnL2(h+b)=n2L2\gamma\geq nL^{2}(h+b)=n^{2}L^{2}. Finally, by Lemma 4.1, μL\mu\leq L. These imply that β2n2μ2n2L2γ\beta^{2}\leq n^{2}\mu^{2}\leq n^{2}L^{2}\leq\gamma; and thus ρ(η)=1β2/γ>0\rho(\eta^{*})=1-\beta^{2}/\gamma>0.

In conclusion, for all η>0\eta>0 such that η<2η\eta<2\eta^{*}, ρ[0,1)\rho\in[0,1). This proves the theorem.

Appendix B Proof of Lemmas in Section 4.2

B.1 Proof of Lemma 4.6

Proof B.1.

Consider any jj\in\mathcal{H}. First, observe that we have

𝔼(gjt)\displaystyle\mathbb{E}(\lVert g_{j}^{t}\rVert) =𝔼(gjtQ(wt)+Q(wt))\displaystyle=\mathbb{E}(\lVert g_{j}^{t}-\nabla Q(w^{t})+\nabla Q(w^{t})\rVert)
𝔼gjtQ(wt)+𝔼Q(wt).\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert+\mathbb{E}\lVert\nabla Q(w^{t})\rVert. (33)

By Jensen’s inequality, for any random variable XX, 𝔼(X)2𝔼(X2)\mathbb{E}(X)^{2}\leq\mathbb{E}(X^{2}). In our case, upon substituting X=gjtQ(wt)X=\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert, we obtain that

𝔼gjtQ(wt)𝔼gjtQ(wt)2.\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert\leq\sqrt{\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}}.

By Assumption 5, this further implies that

𝔼gjtQ(wt)σ2Q(wt)2=σQ(wt).\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert\leq\sqrt{\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}}=\sigma\lVert\nabla Q(w^{t})\rVert.

Upon substituting this into Equation (B.1), we obtain that

𝔼gjt(1+σ)Q(wt).\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert.

This completes the proof.

B.2 Proof of Lemma 4.7

Proof B.2.

Gumbel [10] and Hartley and David [12] proved that given identical means and variances (μ,σ2)(\mu,\sigma^{2}), the upper bound of the expectation of the largest random variable among nn independent random variables is μ+σ(n1)2n1\mu+\frac{\sigma(n-1)}{\sqrt{2n-1}}. In our model, {gjt:j}\{g_{j}^{t}:j\in\mathcal{H}\} can be viewed as a set of independent and identically distributed random vectors with expectation Q(wt)\nabla Q(w^{t}). Recall that \mathcal{H} is the set of fault-free workers. Therefore, {gjt:j}\{\lVert g_{j}^{t}\rVert:j\in\mathcal{H}\} is also a set of iid random variables; and by our algorithm, the norm of the received gradients g~jt=gjt\lVert\tilde{g}_{j}^{t}\rVert=\lVert g_{j}^{t}\rVert.

Denote the mean and variance of each gjt\lVert g_{j}^{t}\rVert as (ϵ,δ2)(\epsilon,\delta^{2}), and denote MM as M=maxj{gjt}M=\max_{j\in\mathcal{H}}\{\lVert g_{j}^{t}\rVert\}. Thus, [10, 12] imply that 𝔼Mϵ+δ(h1)2h1\mathbb{E}M\leq\epsilon+\frac{\delta(h-1)}{\sqrt{2h-1}}. ϵ\epsilon and δ\delta are to be bounded next.

Recall that Lemma 4.6 gives an upper bound on 𝔼gjt\mathbb{E}\lVert g_{j}^{t}\rVert, i.e.,

ϵ=𝔼gjt(1+σ)Q(wt).\epsilon=\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert.

Therefore, we only need to compute the upper bound of δ2\delta^{2}. Consider a random vector XdX\in\mathbb{R}^{d} and denote μ=𝔼X\mu=\mathbb{E}X. It is proved that for any constant vector vdv\in\mathbb{R}^{d}, 𝔼v,X=v,𝔼X\mathbb{E}\left<v,X\right>=\left<v,\mathbb{E}X\right>. Therefore,

𝔼Xμ2\displaystyle\mathbb{E}\lVert X-\mu\rVert^{2} =𝔼Xμ,Xμ\displaystyle=\mathbb{E}\left<X-\mu,X-\mu\right>
=𝔼X,X2𝔼μ,X+𝔼μ,μ\displaystyle=\mathbb{E}\left<X,X\right>-2\mathbb{E}\left<\mu,X\right>+\mathbb{E}\left<\mu,\mu\right>
=𝔼X2μ2.\displaystyle=\mathbb{E}\lVert X\rVert^{2}-\lVert\mu\rVert^{2}. (34)

Perlman [19] proved the extended Jensen’s Inequality to random vectors such that ϕ(𝔼X)𝔼ϕ(X)\phi(\mathbb{E}X)\leq\mathbb{E}\phi(X) for any convex function ϕ\phi. Since \lVert\cdot\rVert is convex, this implies μ𝔼X\lVert\mu\rVert\leq\mathbb{E}\lVert X\rVert. Thus,

VarX\displaystyle\mathrm{Var}\lVert X\rVert =𝔼X2(𝔼X)2\displaystyle=\mathbb{E}\lVert X\rVert^{2}-(\mathbb{E}\lVert X\rVert)^{2}
𝔼X2μ2\displaystyle\leq\mathbb{E}\lVert X\rVert^{2}-\lVert\mu\rVert^{2}
=𝔼Eμ2.\displaystyle=\mathbb{E}\lVert E-\mu\rVert^{2}.

Upon substituting XX with gjtg_{j}^{t}, we obtain that

δ2=Vargjt\displaystyle\delta^{2}=\mathrm{Var}\lVert g_{j}^{t}\rVert 𝔼gjtQ(wt)2\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}
σ2Q(wt)2.\displaystyle\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}. (Assumption 5) (35)

The upper bounds of ϵ\epsilon and δ2\delta^{2} together induce the following bound on 𝔼M\mathbb{E}M, which is

𝔼M\displaystyle\mathbb{E}M ϵ+h12h1δ((1+σ)+h12h1σ)Q(wt).\displaystyle\leq\epsilon+\frac{h-1}{\sqrt{2h-1}}\delta\leq\left((1+\sigma)+\frac{h-1}{\sqrt{2h-1}}\sigma\right)\lVert\nabla Q(w^{t})\rVert.

By definition of kxk_{x} in Equation (10), this can be simplified as

𝔼M(1+khσ)Q(wt).\mathbb{E}M\leq(1+k_{h}\sigma)\lVert\nabla Q(w^{t})\rVert.

Finally, recall the definition of the CGC filter in Equation (8). Since ||nf|\mathcal{H}|\geq n-f, it is guaranteed that one fault-free received gradient will be clipped. This implies that the threshold of the CGC filter g~inft\lVert\tilde{g}_{i_{n-f}}^{t}\rVert satisfies g~inftM\lVert\tilde{g}_{i_{n-f}}^{t}\rVert\leq M. Since the CGC filter guarantees that g^jtg~inft\lVert\hat{g}_{j}^{t}\rVert\leq\lVert\tilde{g}_{i_{n-f}}^{t}\rVert for all jj, together we have 𝔼g^jt𝔼g~inft𝔼M\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq\mathbb{E}\lVert\tilde{g}_{i_{n-f}}^{t}\rVert\leq\mathbb{E}M, which proves the lemma.

B.3 Proof of Lemma 4.8

Proof B.3.

Papadatos [18] proved that for nn iid random variables X1X2XnX_{1}\leq X_{2}\leq\dotsm\leq X_{n} with finite variance σ2\sigma^{2}, the maximum variance of XnX_{n} is bounded above by nσ2n\sigma^{2}. In the same setup as Section B.2 that M=max{gjt:j}M=\max\{\lVert g_{j}^{t}\rVert:j\in\mathcal{H}\}, Papadatos’ result together with Equation (B.2) imply that

VarMhVargjthσ2Q(wt)2.\mathrm{Var}M\leq h\mathrm{Var}\lVert g_{j}^{t}\rVert\leq h\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}.

Recall that for a random variable YY, VarY=𝔼Y2(𝔼Y)2\mathrm{Var}Y=\mathbb{E}Y^{2}-(\mathbb{E}Y)^{2}. Therefore,

𝔼M2\displaystyle\mathbb{E}M^{2} =VarM+(𝔼M)2(hσ2+(1+khσ)2)Q(wt)2.\displaystyle=\mathrm{Var}M+(\mathbb{E}M)^{2}\leq(h\sigma^{2}+(1+k_{h}\sigma)^{2})\lVert\nabla Q(w^{t})\rVert^{2}.

Recall the definition of αx\alpha_{x} in Equation (12). This implies

𝔼MαhQ(wt)2.\mathbb{E}M\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}.

Finally, in Section B.2 we proved g^jtM\lVert\hat{g}_{j}^{t}\rVert\leq M for all jj, so g^jt2M2\lVert\hat{g}_{j}^{t}\rVert^{2}\leq M^{2} and 𝔼g^jt2𝔼M2\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\mathbb{E}M^{2} for all jj, which proves the lemma.

B.4 Part C in Proof of Theorem 4.9

Proof B.4.

Part CC: By definition of gtg^{t} and convexity of 2\lVert\cdot\rVert^{2},

gt2=j=1ng^jt2\displaystyle\lVert g^{t}\rVert^{2}=\left\lVert\sum_{j=1}^{n}\hat{g}_{j}^{t}\right\rVert^{2} nj=1ng^jt2\displaystyle\leq n\sum_{j=1}^{n}\lVert\hat{g}_{j}^{t}\rVert^{2} (2\lVert\cdot\rVert^{2} is convex)
=njg^jt2+njg^jt2.\displaystyle=n\sum_{j\in\mathcal{H}}\lVert\hat{g}_{j}^{t}\rVert^{2}+n\sum_{j\in\mathcal{B}}\lVert\hat{g}_{j}^{t}\rVert^{2}.

By definition of g^jt\hat{g}_{j}^{t}, g^jtgjt\lVert\hat{g}_{j}^{t}\rVert\leq\lVert g_{j}^{t}\rVert for all jj\in\mathcal{H}. Also, by Equation (B.2), for each fault-free worker jj\in\mathcal{H},

𝔼g^jt𝔼gjt2\displaystyle\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq\mathbb{E}\lVert g_{j}^{t}\rVert^{2} 𝔼gjtQ(wt)2+Q(wt)2\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}+\lVert\nabla Q(w^{t})\rVert^{2}
σ2Q(wt)2+Q(wt)2.\displaystyle\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}+\lVert\nabla Q(w^{t})\rVert^{2}. (36)

By Lemma 4.8, for each Byzantine worker jj\in\mathcal{B},

𝔼g^jt2αhQ(wt)2.\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}. (37)

Recall the LL-Lipschitz assumption (Assumption 2) that Q(wt)Lwtw\lVert\nabla Q(w^{t})\rVert\leq L\lVert w^{t}-w^{*}\rVert. Hence, upon combining Equation (B.4) and (37), we obtain that

𝔼gt2\displaystyle\mathbb{E}\lVert g^{t}\rVert^{2} nh(1+σ2)Q(wt)2+nbαhQ(wt)2\displaystyle\leq nh(1+\sigma^{2})\lVert\nabla Q(w^{t})\rVert^{2}+nb\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}
nL2(h(1+σ2)+bαh)wtw2.\displaystyle\leq nL^{2}\left(h(1+\sigma^{2})+b\alpha_{h}\right)\lVert w^{t}-w^{*}\rVert^{2}.

Recall the definition of γ\gamma in Equation (11). This proves part C of Theorem 4.9.

Appendix C Proof in Section 4.3

C.1 Proof of Lemma 4.11

Proof C.1.

First, by triangular inequality, for all u,vBu,v\in B,

uv\displaystyle\lVert u-v\rVert uQ(wt)+vQ(wt).\displaystyle\leq\lVert u-\nabla Q(w^{t})\rVert+\lVert v-\nabla Q(w^{t})\rVert. (38)

Again by triangular inequality and the radius of ball BB, for each uBu\in B,

Q(wt)uuQ(wt)r2+rQ(wt).\displaystyle\lVert\nabla Q(w^{t})\rVert-\lVert u\rVert\leq\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert.

This is equivalent to say

(1r2+r)Q(wt)uQ(wt)2+r2u.\left(1-\frac{r}{2+r}\right)\lVert\nabla Q(w^{t})\rVert\leq\lVert u\rVert\iff\lVert\nabla Q(w^{t})\rVert\leq\frac{2+r}{2}\lVert u\rVert.

Therefore,

uQ(wt)r2+rQ(wt)r2u,uB.\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert\leq\frac{r}{2}\lVert u\rVert,\enspace\forall u\in B.

Finally, upon substituting this into Equation (38), we prove the lemma.

Appendix D Proof of existence of M-P inverse

Proof D.1.

First of all, recall that for a matrix Am×nA\in\mathbb{R}^{m\times n} where m>nm>n, if AA has full column rank, then ATAA^{T}A is invertible.

Consider arbitrary j,tj,t. First we can prove by induction that all gradients gitg_{i}^{t} stored in RjR_{j} are linearly independent. Base case is true when |Rj|=1|R_{j}|=1. Now assume that when |Rj|=k|R_{j}|=k, RjR_{j} is linearly independent. Denote AA as the matrix generated by RjR_{j}, i.e., A=[g]gRjA=[g]_{g\in R_{j}}. Then ATAA^{T}A is invertible and M-P inverse of AA exists, i.e., A+A^{+} in line 29 of Algorithm 1 is well-defined.

Now suppose that jj receives a gradient gg and stores it to RjR_{j}. Then gg must pass the condition in line 29, i.e., AA+ggAA^{+}g\neq g. Suppose by contradiction that gg is linearly dependent of RjR_{j}, then there exists xkx\in\mathbb{R}^{k} such that g=Axg=Ax. Note that A+A=IA^{+}A=I, the identity matrix. This implies that AA+g=AA+(Ax)=Ax=gAA^{+}g=AA^{+}(Ax)=Ax=g, which is a contradiction. Hence, gg is independent of RjR_{j}.

In the induction proof, we already showed that A+A^{+} always exists because RjR_{j} is always linearly independent. This proves that the M-P inverse always exists.