Boston College, USAzhangbcu@bc.edu Boston College, USAlewis.tseng@bc.eduhttps://orcid.org/0000-0002-4717-4038 \CopyrightQinzi Zhang and Lewis Tseng{CCSXML} <ccs2012> <concept> <concept_id>10010147.10010919</concept_id> <concept_desc>Computing methodologies Distributed computing methodologies</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257</concept_id> <concept_desc>Computing methodologies Machine learning</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012> \ccsdesc[500]Computing methodologies Distributed computing methodologies \ccsdesc[500]Computing methodologies Machine learning XXX\supplement

Acknowledgements.

The authors would like to acknowledgement Nitin H. Vaidya and anonymous reviewers for their helpful comments..\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitleOPODIS 2020 \EventShortTitleOPODIS 2020 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Echo-CGC: A Communication-Efficient Byzantine-tolerant Distributed Machine Learning Algorithm in Single-Hop Radio Network

Qinzi Zhang Lewis Tseng

Abstract

In the past few years, many Byzantine-tolerant distributed machine learning (DML) algorithms have been proposed in the point-to-point communication model. In this paper, we focus on a popular DML framework – the parameter server computation paradigm and iterative learning algorithms that proceed in rounds, e.g., [11, 8, 6]. One limitation of prior algorithms in this domain is the high communication complexity. All the Byzantine-tolerant DML algorithms that we are aware of need to send $n$ $d$ -dimensional vectors from worker nodes to the parameter server in each round, where $n$ is the number of workers and $d$ is the number of dimensions of the feature space (which may be in the order of millions). In a wireless network, power consumption is proportional to the number of bits transmitted. Consequently, it is extremely difficult, if not impossible, to deploy these algorithms in power-limited wireless devices. Motivated by this observation, we aim to reduce the communication complexity of Byzantine-tolerant DML algorithms in the single-hop radio network [1, 3, 14].

Inspired by the CGC filter developed by Gupta and Vaidya, PODC 2020 [11], we propose a gradient descent-based algorithm, Echo-CGC. Our main novelty is a mechanism to utilize the broadcast properties of the radio network to avoid transmitting the raw gradients (full $d$ -dimensional vectors). In the radio network, each worker is able to overhear previous gradients that were transmitted to the parameter server. Roughly speaking, in Echo-CGC, if a worker “agrees” with a combination of prior gradients, it will broadcast the “echo message” instead of the its raw local gradient. The echo message contains a vector of coefficients (of size at most $n$ ) and the ratio of the magnitude between two gradients (a float). In comparison, the traditional approaches need to send $n$ local gradients in each round, where each gradient is typically a vector in a ultra-high dimensional space ( $d\gg n$ ). The improvement on communication complexity of our algorithm depends on multiple factors, including number of nodes, number of faulty workers in an execution, and the cost function. We numerically analyze the improvement, and show that with a large number of nodes, Echo-CGC reduces 80% of the communication under standard assumptions.

keywords:

Distributed Machine Learning, Single-hop Radio Network, Byzantine Fault, Communication Complexity, Wireless Communication, Parameter Server

category:

\relatedversion

1 Introduction

Machine learning has been widely adopted and explored recently [24, 16]. Due to the exponential growth of datasets and computation power required, distributed machine learning (DML) becomes a necessity. There is also an emerging trend [22, 13] to apply DML in power-limited wireless networked systems, e.g., sensor networks, distributed robots, smart homes, and Industrial Internet-of-Things (IIoT), etc. In these applications, the devices are usually small and fragile, and susceptible to malicious attacks and/or malfunction. More importantly, it is necessary to reduce communication complexity so that (over-)communication does not drain the device battery. Most prior research on fault-tolerant DML (e.g., [8, 4, 11, 6]) has focused on the use cases in clusters or datacenters. These algorithms achieve high resilience (number of faults tolerated), but also incur high communication complexity. As a result, most prior Byzantine-tolerant DML algorithms are extremely difficult, if not impossible, to be deployed in power-limited wireless networks.

Motivated by our observations, we aim to design a Byzantine DML algorithm with reduced communication complexity. We consider wireless systems that are modeled as a single-hop radio network, and focus on the popular parameter server computation paradigm (e.g., [11, 8, 6]). We propose Echo-CGC, and prove its correctness under typical assumptions [4, 8]. For the communication complexity, we formally analyze the expected number of bits that need to be sent from workers to the parameter server. The extension to multi-hop radio network is left as an interesting future work.

Recent Development in Distributed Machine Learning Distributed Machine Learning (DML) is designed to handle a large amount of computation over big data. In the parameter server model, there is a centralized parameter server that distributes the computation tasks to $n$ workers. These workers have the access to the same dataset (that may be stored externally). Similar to [4, 11, 6], we focus on the synchronous gradient descent DML algorithms, where the server and workers proceed in synchronous rounds. In each round, each worker computes a local gradient over the parameter received from the server, and the server then aggregates the gradients collected from workers, and updates the parameter. Under suitable assumptions, prior algorithms [4, 11, 6] converge to the optimal point in the $d$ -dimensional space $\mathbb{R}^{d}$ even if up to $f$ workers may become Byzantine faulty.

To our knowledge, most Byzantine-tolerant DML or distributed optimization algorithms focused on the case of clusters and datacenters, which are modeled as a point-to-point network. For example, Reference [6], Krum [4], Kardam [7], and ByzSGD [8] focused on the stochastic gradient descent algorithms under several different settings (synchronous, asynchronous, and distributed parameter server). Reference [21, 11, 20] focused on the gradient descent algorithms for the general distributed optimization framework. Zeno [25] uses failure detection to improve the resilience. None of these works aimed to reduce communication complexity.

Another closely related research direction is on reducing the communication complexity of non-Byzantine-tolerant DML algorithms, e.g., [15, 13, 23]. These algorithms are not Byzantine fault-tolerant, and adopt a completely different design. For example, reference [15] utilizes relaxed consistency (of the underlying shared data), reference [23] discards coordinates (of the local gradients) aggressively, and reference [13] uses intermediate aggregation. It is not clear how to integrate these techniques with Byzantine fault-tolerance, as these approaches reduce the redundancy, making it difficult to mask the impact from Byzantine workers.

Single-Hop Radio Network We consider the problem in a single-hop radio network, which is a proper theoretical model for wireless networks. Following [1, 3, 14], we assume that single-hop wireless communication is reliable and authenticated, and there is no jamming nor spoofing. Moreover, nodes follow a specific TDMA schedule so that there is no collision. In Section 2.1, we briefly argue why such an assumption is realistic to model wireless communication. In the single-hop radio network model, we aim to minimize the total number of bits to be transmitted in each round. If we directly adapt prior gradient descent-based algorithms [4, 11] to the radio network model, then each worker needs to broadcast a vector of size $d$ , where $d$ is the number of dimensions of the feature space. In practical applications (e.g., [9, 13]), $d$ might be in the order of millions, and the gradients may require a few GBs. Since power consumption is proportional to the communication complexity in wireless channel, prior Byzantine DML algorithms are not adequate for power-limited wireless networks..

Main Contributions Inspired by the CGC filter developed by Gupta and Vaidya, PODC 2020 [11], we propose a gradient descent-based algorithm, Echo-CGC, for the parameter server model in the single-hop radio network. Our main observation is that since workers can overhear gradients transmitted earlier, they can use this information to avoid sending the raw gradients in some cases. Particularly, if a worker “agrees” with some reference gradient(s) transmitted earlier in the same round, then they send a small message to “echo” with the reference gradient(s). The size of the echo message ( $O(n)$ bits) is negligible compared to the raw gradient ( $O(d)$ bits), since in typical ML applications, $d\gg n$ .

Our proof is more sophisticated than the one in [11], even though Echo-CGC is inspired by the CGC filter. The reason is that the “echo message” does not necessarily contain worker $i$ ’s local gradient; instead, it can be used to construct an approximate gradient, which intuitively equals a combined gradients between $i$ ’s local gradients and the gradients broadcast by previous workers. We need to ensure that such an approximation does not affect the aggregation at the server. Moreover, CGC filter [11] works on deterministic gradients – each worker computes the gradient of its local cost function using the full dataset. In our case, each worker computes a stochastic gradient, a gradient over a small random data batch. We prove that with appropriate assumptions, Echo-CGC converges to the optimal point.

Echo-CGC is correct under the same set of assumptions in prior work [4]; however, there is an inherent trade-off between resilience, the proven bound on the communication complexity reduction, and the cost function. Fix the cost function. We derive necessary conditions on $n$ so that Echo-CGC is guaranteed to perform better. We also perform numerical analysis to understand the trade-off. In general, Echo-CGC saves more and more communication if $f/n$ becomes smaller and smaller. Moreover, our algorithm performs better when the variance of the data is relatively small. For example, our algorithm tolerates 10% of faulty workers and saves over 75% of communication cost when standard deviation of computed gradients is less than 10% of the true gradient.

2 Preliminaries

In this section, we formally define our models, and introduce the assumptions and notations.

2.1 Models

Single-Hop Radio Network We consider the standard radio network model in the literature, e.g., [1, 3, 14]. In particular, the underlying communication layer ensures the reliable local broadcast property [3]. In other words, the channel is perfectly reliable, and a local broadcast is correctly received by all neighbors. As noted in [1, 3], this assumption does not typically hold in the current deployed wireless networks, but it is possible to realize such a property with high probability in practice with the help from the MAC layer [2] or physical layer [17].

In our system, nodes can be uniquely identified, i.e., each node has a unique identifier. We assume that a faulty node may not spoof another node’s identity. The communication network is assumed to be single-hop; that is, each pair of nodes are within the communication range of each other. Moreover, time is divided into slots, and each node proceeds synchronously. Message collision is not possible because of the nodes follow a pre-determined TDMA schedule that determine the transmitting node in each slot and the transmission protocol is jam-resistant. Each slot is assumed to be large enough so that it is possible for a node to transmit a gradient. We also assume that each communication round (or communication step) is divided into $n$ slots, and the TDMA schedule assigns each node to a unique slot. For ease of discussion, node $i$ is scheduled to transmit at slot $i$ .

Stochastic Gradient Descent and Parameter Server In this work, we focus on the Byzantine-tolerant distributed Stochastic Gradient Descent (SGD) algorithms, which are popular in the optimization and machine learning literature [4, 8, 11, 5]. Given a cost function $Q$ , the (sequential) SGD algorithm outputs an optimal parameter $w^{*}$ such that

w^{*}=\operatorname*{argmin}_{w\in\mathbb{R}^{d}}Q(w)

(1)

An SGD algorithm executes in an iterative fashion, where in each round $t$ , the algorithm computes the gradient of the cost function $Q$ at parameter $w^{t}$ and updates the parameter with the gradient.

Synchronous Parameter Server Model Computation of gradients is typically expensive and slow. One popular framework to speed up the computation is the parameter server model, in which the parameter server distributes the computation tasks to $n$ workers and aggregates their computed gradients to update the parameter in each round. Following the convention, we will use node and worker interchangeably.

We assume a synchronous system, i.e., the computation and communication delays are bounded, and the server and workers know the bound. Consequently, if the server does not receive a message from worker $i$ by the end of some round, then the server identifies that worker $i$ is faulty.

Formally speaking, a distributed SGD algorithm in the parameter server model proceeds in synchronous rounds, and executes the following three steps in each round $t$ :

1.

The parameter server broadcasts parameter $w^{t}$ to the workers.
2.

Each worker $j$ randomly chooses a random data batch $\xi_{j}^{t}$ from the dataset (shared by all the workers) and computes an estimate, $g_{j}^{t}$ , of the gradient $\nabla Q(w^{t})$ of the cost function $Q$ using $\xi_{j}^{t}$ and $w^{t}$ .
3.

The server aggregates estimated gradients from all workers and updates the parameter using the gradient descent approach with step size $\eta$ :

$w^{t+1}=w^{t}-\eta\sum_{j=1}^{n}g_{j}^{t}$ (2)

Fault Model and Byzantine SGD Following [11, 4, 6], our system consists of $n$ workers, up to $f$ of which might be Byzantine faulty. We assume that the central parameter server is always fault-free.

Byzantine workers may be controlled by an omniscient adversary which has the knowledge of the current parameter (at the server) and the local gradient of all the other workers, and may have arbitrary behaviors. They can send arbitrary messages. However, due to the reliable local broadcast property of the radio network model, they cannot send inconsistent messages to the server and other workers. They also cannot spoof another node’s identity. Our goal is therefore to design a distributed SGD algorithm that solves Equation (1) in the presence of up to $f$ Byzantine workers.

Workers that are not Byzantine faulty are called fault-free workers. These workers follow the algorithm specification faithfully. For a given execution of the algorithm, we denote $\mathcal{H}$ as the set of fault-free workers and $\mathcal{B}$ as the set of Byzantine workers. For brevity, we denote $h=|\mathcal{H}|$ and $b=|\mathcal{B}|$ ; hence, we have $b\leq f$ and $h\geq n-f$ .

Communication Complexity We are interested in minimizing the total number of bits that need to be transmitted from workers to the parameter server in each round. Prior algorithms [11, 4] transmit $n$ gradients in a $d$ -dimensional space in each round, since each node needs to transmit its local gradient to the centralized server. Typically, each gradient consists of $d$ floats or doubles (i.e., a single primitive floating point data structure for each dimension).

2.2 Assumptions and Notations

We assume that the cost function $Q$ satisfies some standard properties used in the literature [4, 8, 6], including convexity, differentiability, Lipschitz smoothness, and strong convexity. Following the convention, we use $\left<a,b\right>$ to represent the dot product of two vectors $a$ and $b$ in the $d$ -dimensional space $\mathbb{R}^{d}$ .

Assumption 1 (Convexity and smoothness).

$Q$ is convex and differentiable.

Assumption 2 ( $L$ -Lipschitz smoothness).

There exists $L>0$ such that for all $w,w^{\prime}\in\mathbb{R}^{d}$ ,

\lVert\nabla Q(w)-\nabla Q(w^{\prime})\rVert\leq L\lVert w-w^{\prime}\rVert

(3)

Assumption 3 ( $\mu$ -strong convexity).

There exists $\mu>0$ such that for all $w,w^{\prime}\in\mathbb{R}^{d}$ ,

\left<\nabla Q(w)-\nabla Q(w^{\prime}),w-w^{\prime}\right>\geq\mu\lVert w-w^{\prime}\rVert^{2}

(4)

We also assume that the random data batches are independently and identically distributed from the dataset. Before stating the assumptions, we formally introduce the concept of randomness in the framework. Similar to typical stochastic gradient descent algorithms, the only randomness is due to the random data batches $\xi_{j}^{t}$ sampled by each fault-free worker $j\in\mathcal{H}$ in each round $t$ , which further makes $g_{j}^{t}$ as well as $w^{t+1}$ non-deterministic. In the case when a worker uses the entire dataset to train model, $g_{j}^{t}=\nabla Q(w^{t})$ . Hence, the result is deterministic, i.e., each fault-free worker derives the same gradient. In practice, data batch is a small sample of the entire data set.¹¹1Reference [11] works on a different formulation in which each worker may have a different local cost function.

Formally speaking, we denote an operator $\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},\mathcal{G}_{\mathcal{B}}^{t})$ as the conditional expectation operator over the set of random batches $\Xi^{t}=\{\xi_{j}^{t},j=1,2,\ldots,n\}$ in round $t$ given (i) the parameter $w^{t}$ , and (ii) the set of Byzantine gradients $\mathcal{G}_{\mathcal{B}}^{t}=\{g_{j}^{t}:j\in\mathcal{B}\}$ . This conditional expectation operator allows us to treat $w^{t}$ , $Q(w^{t})$ , and $\nabla Q(w^{t})$ as constants, as well as the Byzantine gradients. This is reasonable because (i) we have the knowledge about $Q$ and $w^{t}$ given an execution, and (ii) the Byzantine gradients are arbitrary, and do not depend on the data batches. From now on, without further specification, we abbreviate the operator $\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},\mathcal{G}_{\mathcal{B}}^{t})$ as $\mathbb{E}$ .

Below we present two further assumptions of local stochastic gradient $g^{t}_{j}$ at each fault-free worker $j$ . Similar to [4, 8], we rely on the two following assumptions for correctness proof.

Assumption 4 (IID Random Batches).

For all $j\in\mathcal{H}$ and $t\in\mathbb{N}$ ,

\mathbb{E}(g_{j}^{t})=\nabla Q(w^{t})

(5)

Assumption 5 (Bounded Variance).

For all $j\in\mathcal{H}$ and $t\in\mathbb{N}$ ,

\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}

(6)

Notation We list the most important notations and constants used in our algorithm and analysis in the following table.

$\mathcal{H}$	set of fault-free workers; $h=\|\mathcal{H}\|$
$\mathcal{B}$	set of faulty workers; $b=\|\mathcal{B}\|$
$t$	round number, $t=0,1,2,\ldots$
$w^{*}$	optimal solution to $Q$ , i.e., $w^{}=\operatorname{argmin}_{w\in\mathbb{R}^{d}}Q(w)$
$w^{t}$	parameter in round $t$
$g_{j}^{t}$	estimated gradient of $j$ in round $t$
$\tilde{g}_{j}^{t}$	“reconstructed" gradient of $j$ by server in round
$\hat{g}_{j}^{t}$	gradient of $j$ in round $t$ after applying the CGC filter
$\eta$	fixed step size as in Equation (2)
$L$	Lipschitz constant
$\mu$	strong convexity constant
$r$	deviation ratio, a key parameter in our algorithm
$k^{*}$	constant defined in Lemma 4.2, $k^{*}\approx 1.12$

Table 1: Notations and constants used in this paper.

3 Our Algorithm: Echo-CGC

Our algorithm is inspired by Gupta and Vaidya [11]. Specifically, we integrate their CGC filter with a novel aggregation phase. Our aggregation mechanism utilizes the broadcast property of the radio network to improve the communication complexity. In the CGC algorithm [11], each worker needs to send a $d$ -dimensional gradient to the server, whereas in our algorithm, some workers only need to send the “echo message” which is of size $O(n)$ bits. Note that in typical machine learning applications, $d\gg n$ .

We design our algorithm for the synchronous parameter server model, so the algorithm is presented in an iterative fashion. That is, each worker and the parameter server proceed in synchronous rounds, and the algorithm specifies the exact steps for each round $t$ . Our Algorithm, Echo-CGC, is presented in Algorithm 1. The algorithm uses the notations and constants summarized in Table 1.

Algorithm Description

Initially, the parameter server randomly generates an initial parameter $w^{0}\in\mathbb{R}^{d}$ . Each round $t\geq 0$ consists of three phases: (i) computation phase, (ii) communication phase, and (iii) aggregation phase. Echo-CGC takes the following inputs: step size $\eta$ , deviation ratio $r$ , number of workers $n$ , and maximum number of tolerable faults $f$ . The exact requirements on the values of these inputs will become clear later. For example, $n,f,r$ need to satisfy the bound derived in Lemma 4.3. More discussion will be presented in Section 4.3.

Computation Phase In the computation phase of round $t$ , the server broadcasts $w^{t}$ to the workers. Each worker $j$ then computes the local stochastic gradient $g_{j}^{t}=\nabla Q_{j}(w^{t})$ using $w^{t}$ and its random data batch $\xi_{j}^{t}$ . Since we assume the parameter server is fault-free, each worker receives the identical $w^{t}$ . The local gradient is stochastic, because each worker uses a random data batch to compute the local gradient $g^{t}_{j}$ .

Communication Phase In the communication phase, each worker needs to send the information regarding to its local gradient to the parameter server. This phase is our main novelty, and different from prior algorithms [11, 4, 6]. We utilize the property of the broadcast channel to reduce the communication complexity. As mentioned earlier, the communication phase of round $t$ is divided into $n$ slots $t_{1},\ldots,t_{n}$ . Without loss of generality, we assume that each worker $j$ is scheduled to broadcast its information in slot $t_{j}$ (of round $t$ ). Note that we assume that the underlying physical or MAC layer is jamming-resistant and reliable; hence, each fault-free worker can reliably broadcast the information to all the other nodes.

Steps for Worker $j$ : Each worker $j$ stores a set of gradients that it overhears in round $t$ . Denote by $R_{j}$ the set of stored gradients. By assumption, $R_{j}$ consists of gradients $g_{i}^{t}$ for $i<j$ , when at the beginning of slot $t_{j}$ . Upon receiving a gradient $g_{i}^{t}$ (in the form of a vector in $\mathbb{R}^{d}$ ), worker $j$ stores it to $R_{j}$ if $g_{i}^{t}$ is linearly independent with all existing gradients in $R_{j}$ . In the slot $t_{j}$ , worker $j$ computes the “echo gradient” using vectors stored in $R_{j}$ . Specifically, worker $j$ takes the following steps:

•

It expresses $R_{j}$ as $R_{j}=\{g_{i_{1}}^{t},\ldots,g_{i_{|R_{j}|}}^{t}\}$ and constructs a matrix $A_{j}\in\mathbb{R}^{d\times|R_{j}|}$ as

$A_{j}^{t}=\begin{bmatrix}g_{i_{1}}^{t}&g_{i_{2}}^{t}&\dotsm&g_{i_{|R_{j}|}}^{t}\end{bmatrix}$
•

It then computes the Moore-Penrose inverse (M-P inverse in short) of $A_{j}^{t}$ , defined as

$(A_{j}^{t})^{+}=((A^{t}_{j})^{T}A^{t}_{j})^{-1}(A^{t}_{j})^{T},$

where $A^{T}$ is the transpose of matrix $A$ . The existence of the M-P inverse is guaranteed. Intuitively this is because all columns of $A_{j}^{t}$ are linearly independent by construction. The formal proof is presented in Appendix D.
•

Next, worker $j$ computes a vector $x_{j}^{t}\in\mathbb{R}^{|R_{j}|}$ using the M-P inverse:

$x_{j}^{t}=(A_{j}^{t})^{+}g_{j}^{t},$

where $g_{j}^{t}$ is the local stochastic gradient of $Q$ computed by $j$ in the computation phase. Note that $x^{t}_{j}$ is of size $O(n)$ , since $R_{j}$ contains at most $n$ elements.
•

Finally, it computes the “echo gradient” as

$(g_{j}^{t})^{*}=A_{j}^{t}x_{j}^{t}$

Mathematically, $(g_{j}^{t})^{*}$ is the projection of $g_{j}^{t}$ onto the span of vectors in $R_{j}$ , i.e., the closest vector to $g^{t}_{j}$ in the span of $R_{j}$ .

Next, worker $j$ checks whether the following inequality holds where $(g_{j}^{t})^{*}$ is the echo gradient, $g_{j}^{t}$ the local stochastic gradient, and $r$ the deviation ratio.

\lVert(g_{j}^{t})^{*}-g_{j}^{t}\rVert\leq r\lVert g_{j}^{t}\rVert

(7)

Worker $j$ performs one of the two actions depending on the result of Inequality (7).

•

If Inequality (7) holds, then $j$ sends the echo message $(\lVert g_{j}^{t}\rVert/(\lVert g_{j}^{t})^{*}\rVert,x_{j}^{t},I_{j}^{t})$ to the server, where $I_{j}^{t}=\{i_{1},\ldots,i_{|R_{j}|}\}$ is a sorted list of worker IDs whose gradients are stored in $R_{j}$ .
•

Otherwise, worker $j$ broadcasts the raw gradient $g_{j}^{t}$ to server and all the other workers.

Steps for Parameter Server: The parameter server uses a vector $G$ to store the gradients from workers. Specifically, in each round $t$ , for each worker $j$ , the server computes $\tilde{g}_{j}^{t}$ and stores it as the $j$ -th element of $G$ . At the beginning of round $t$ , every element $G[j]$ is initialized as an empty placeholder $\perp$ . During the communication phase, the parameter server takes two possible actions upon receiving a message from worker $j$ :

•

If the message is a vector, then the server stores $\tilde{g}_{j}^{t}=g_{j}^{t}$ in $G[j]$ .
•
Otherwise, the message is a tuple $(k,x,I)$ . The server then does the following:
- –
  
  If there exists some $i\in I$ such that $G[i]=\perp$ (i.e., the server has not received a message from worker $i$ ), then due to the reliable broadcast property, the server can safely identify $j$ as a Byzantine worker. By convention, we let the server store $\tilde{g}_{j}^{t}=\vec{0}$ , the zero vector in $\mathbb{R}^{d}$ , in $G[j]$ .
- –
  
  Otherwise, denote the matrix $A_{I}$ as $A_{I}=\begin{bmatrix}G[i_{1}],&\ldots,&G[i_{|R_{j}|}]\end{bmatrix}$ where $I=\{i_{1},\ldots,i_{|R_{j}|}\}$ , and the server stores $\tilde{g}_{j}^{t}$ as $\tilde{g}_{j}^{t}=kA_{I}x$ in $G[j]$ .

Aggregation Phase The final phase is identical to the algorithm in [11], in which the server updates the parameter using the CGC filter. First, the server sorts the stored gradients $G^{t}$ in the increasing order of their Euclidean norm and relabel the IDs so that $\lVert\tilde{g}_{i_{1}}^{t}\rVert\leq\dotsm\leq\lVert\tilde{g}_{i_{n}}^{t}\rVert$ . Then the server applies the CGC filter as follows:

\hat{g}_{j}^{t}=\left\{\begin{array}[]{ll}\frac{\lVert\tilde{g}_{i_{n-f}}^{t}\rVert}{\lVert\tilde{g}_{j}^{t}\rVert}\tilde{g}_{j}^{t},&j\in\{i_{n-f+1},\ldots,i_{n}\}\\ \tilde{g}_{j}^{t},&j\in\{i_{1},\ldots,i_{n-f}\}\\ \end{array}\right.

(8)

Finally, the server aggregates the gradients by $g^{t}=\sum_{j=1}^{n}\hat{g}_{j}^{t}$ and updates the parameter by $w^{t+1}=w^{t}-\eta g^{t}$ , where $\eta$ is the fixed step size.

Algorithm 1 Algorithm Echo-CGC

1:Parameters:

\eta>0

is the step size defined in Equation (2)

r>0

is the deviation ratio

n,f,r

satisfy the resilience bounds stated in Lemma 4.3

5:Initialization at server:

w^{0}\leftarrow

a random vector in

\mathbb{R}^{d}

6:for

t\leftarrow 0

\infty

7: /* Computation Phase */

8: At server: broadcast

w^{t}

to all workers;

G\leftarrow

\perp

-vector of length

n

9: At worker

j

10: receive

w^{t}

from the server

11:

g_{j}^{t}\leftarrow\nabla Q_{j}(w^{t})

;

R_{j}\leftarrow\{\}

\triangleright

local stochastic gradient at worker

j

12: /* Communication Phase */

13: for

i\leftarrow 1

n

14: (i) At worker

i

15: if

|R_{i}|=0

then

16: broadcast

g_{i}^{t}

17: else

18:

A\leftarrow[g]_{g\in R_{j}}

;

A^{+}\leftarrow(A^{T}A)^{-1}A^{T}

;

x\leftarrow A^{+}g_{i}^{t}

\triangleright

Ax

is the echo gradient

19: if

\lVert Ax-g_{i}^{t}\rVert\leq r\lVert g_{i}^{t}\rVert

then

20:

I\leftarrow\{i^{\prime}:{g_{i^{\prime}}^{t}\in R_{j}}\}

in an ascending order

21: broadcast

(\lVert g_{i}^{t}\rVert/\lVert Ax\rVert,x,I)

\triangleright

echo message

22: else

23: broadcast

g_{i}^{t}

\triangleright

raw local gradient

24: end if

25: end if

26: (ii) At worker

j>i

27: if

j

receives vector

g_{i}^{t}

from worker

i

then

28:

A\leftarrow[g]_{g\in R_{j}}

;

A^{+}\leftarrow(A^{T}A)^{-1}A^{T}

29: if

g_{i}^{t}

is linearly independent with

R_{j}

(i.e.,

AA^{+}g_{i}^{t}\neq g_{i}^{t}

) then

30:

R_{i}\leftarrow R_{i}\cup\{g_{i}^{t}\}

31: end if

32: end if

33: (iii) At server:

34: if it receives a vector

g_{j}^{t}

from worker

j

then

35:

G[j]\leftarrow g_{j}^{t}

\triangleright

j

transmitted a raw gradient

36: else if it receives an echo message

(k,x,I)

from worker

j

then

37: if

\exists i\in I

such that

G[i]=\perp

then

38:

G[j]\leftarrow\vec{0}

\triangleright

j

is a Byzantine worker

39: else

40:

A_{I}\leftarrow[\tilde{g}_{i}^{t}]_{i\in I}

G[j]\leftarrow kA_{I}x

\triangleright

j

transmitted an echo message

41: end if

42: end if

43: end for

44: /* Aggregation Phase (applying CGC filter from [11]) */

45:

g^{t}\leftarrow\sum_{g\in G}CGC(g)

\triangleright

CGC(\cdot)

defined in Equation (8)

46:

w^{t+1}\leftarrow w^{t}-\eta\cdot g^{t}

\triangleright

\eta

defined in Equation (2)

47:end for

4 Convergence Analysis

In this section, we prove the convergence of our algorithm Echo-CGC. The proof is more complicated than the one in [11], even though both algorithms use the CGC filter. This is mainly due to two reasons: (i) we use stochastic gradient, whereas [11] uses a deterministic gradient; and (ii) echo messages only results in an approximate gradient (i.e., the echo gradient which may be deviated from the local stochastic gradient by a ratio $r$ ). Intuitively, in addition to the Byzantine tampering, we need to deal with non-determinism from stochastic gradients and noise from echo messages.

4.1 Convergence Rate Analysis

In this part, we first analyze the convergence rate $\rho$ , which is a constant defined later in Equation (13). Recall a few notations that $h=|\mathcal{H}|$ and $b=|\mathcal{B}|$ , where given the execution, $\mathcal{H}$ is the set of fault-free workers and $\mathcal{B}$ is the set of Byzantine workers. Also recall that $L$ and $\mu$ are the constants defined in the Assumption 2 and 3, respectively; $\sigma$ defined in Assumption 5; and $r$ is the deviation ratio used in Echo-CGC. To derive $\rho$ , we need to define series of constants based on the given parameters of $n,f,h,b,L,\mu,r$ , and $\sigma$ .

We first define a constant $\beta$ as

\beta=(n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-b(1+k_{h}\sigma)L,

(9)

where $k_{x}$ is defined as

k_{x}=1+\frac{x-1}{\sqrt{2x-1}},\enspace\forall x\geq 1.

(10)

We then define a constant $\gamma$ as

\gamma=nL^{2}\left(h(1+\sigma^{2})+b\alpha_{h}\right),

(11)

where

\alpha_{x}=x\sigma^{2}+(1+k_{h}\sigma)^{2},\enspace\forall x\geq 1.

(12)

Finally, we define the convergence rate $\rho$ using $\beta$ and $\gamma$ as follows:

\rho=1-2\beta\eta+\gamma\eta^{2}.

(13)

We will prove that under some standard assumptions, the convergence rate $\rho$ is in the interval $[0,1)$ . We first present several auxiliary lemmas. Due to page limit, most proofs are presented in Appendix A.

Lemma 4.1.

Let $L,\mu>0$ be the Lipschitz constant and strong convexity constant defined in Assumption 2 and 3, respectively. Then we have $\mu\leq L$ .

Lemma 4.2.

Denote $k^{*}=\sup_{x}\{k_{x}/\sqrt{x}:x\geq 1\}$ . Then $k^{*}<\infty$ , and numerically $k^{*}\approx 1.12$ . Equivalently, $k_{h}\leq k^{*}\sqrt{h}$ for all $h\geq 1$ .

Lemma 4.3.

Assume $n\mu-(3+k_{n}\sigma)fL>0$ , then there exists $r>0$ that satisfies equation below.

r<\frac{n\mu-(3+k_{n}\sigma)fL}{(n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL}.

(14)

Moreover, if $r>0$ satisfies Equation (14), then $\beta>0$ .

Lemma 4.3 implies that we need to bound $\sigma$ for convergence. In general, Echo-CGC is correct if $\sigma=o(\log{n})$ . For brevity, we make the following assumption to simplify the proof of convergence and the analysis of communication complexity. We stress that this assumption can be relaxed using basically the same analysis with a denser mathematical manipulation.

Assumption 6.

Let $\sigma$ be the variance bound defined in Assumption 5. We further assume that $\sigma<\frac{1}{\sqrt{n}}$ .

Under Assumption 6, we can narrow down the bound of $r$ in Lemma 4.3 to loosen our assumption on fault tolerance.

Lemma 4.4.

Assume $n\mu-(3+k^{*})fL>0$ ( $k^{*}\approx 1.12$ ), then there exists $r>0$ satisfying Equation (15) such that $\beta>0$ .

r<\frac{n\mu-(3+k^{*})fL}{(n-2f)(1+\sigma)L+(1+k^{*})fL}.

(15)

Theorem 4.5.

Assume $n\mu-(3+k^{*})fL>0$ and $r$ is a value that satisfies Inequality (15). Then we can find an $\eta>0$ such that $\eta<2\beta/\gamma$ , which in turn makes $\rho\in[0,1)$ .

4.2 Proof of Convergence

Next, we prove the convergence of our algorithm. That is, Echo-CGC converges to the optimal point $w^{*}$ of the cost function $Q$ . We prove the convergence under the assumption that $n\mu-(3+k^{*})fL>0$ . Due to page limit, we present key proofs here, and the rest can be found in Appendix B.

Recall our definition of the conditional expectation $\mathbb{E}=\mathbb{E}_{\Xi^{t}}(\cdot\mid w^{t},G_{\mathcal{B}}^{t})$ introduced in Section 2.2. Before proving the main theorem, we introduce some preliminary lemmas.

Lemma 4.6.

For all $t$ and for all $j\in\mathcal{H}$ ,

\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert.

(16)

Lemma 4.7.

Recall that $\hat{g}_{j}^{t}$ is the gradient after applying the CGC filter. For all $t$ and for all $j\in\{1,2,\ldots,n\}$ ,

\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq(1+k_{h}\sigma)\lVert\nabla Q(w^{t})\rVert.

(17)

The proof of Lemma 4.7 is based on Lemma 4.6 and the following prior results: Gumbel [10] and Hartley and David [12] proved that given identical means and variances $(\mu,\sigma^{2})$ , the upper bound of the expectation of the largest random variable among $n$ independent random variables is $\mu+\frac{\sigma(n-1)}{\sqrt{2n-1}}$ .

Lemma 4.8.

Following the same setup, for all $t$ and for all $j\in\{1,2,\ldots,n\}$ ,

\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}.

(18)

The proof of Lemma 4.8 is based on Lemma 4.6 and the following result: Papadatos [18] proved that for $n$ i.i.d. random variables $X_{1}\leq X_{2}\leq\dotsm\leq X_{n}$ with finite variance $\sigma^{2}$ , the maximum variance of $X_{n}$ is bounded above by $n\sigma^{2}$ .

Lemma 4.7 and Lemma 4.8 provide upper bounds on $\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert$ and $\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}$ . These two bounds allow us to bound the impact of bogus gradients transmitted by a faulty node $j$ . If $j$ transmitted an extreme gradient, it would be dropped by the CGC filter; otherwise, these two bounds essentially imply that the filtered gradient $\hat{g}_{j}^{t}$ has some nice property even if $j$ is faulty. For fault-free gradients, Lemma 4.6 provides a better bound.

Theorem 4.9.

Assume that $n\mu-(3+k^{*})fL>0$ . We can find $r>0$ that satisfies Inequality (15) and $\eta>0$ such that $\eta<2\beta/\gamma$ . Echo-CGC with the chosen $r$ and $\eta$ will converge to the optimal parameter $w^{*}$ as $t\to\infty$ .

Proof 4.10.

Our ultimate goal is to show that the sequence $\{\mathbb{E}\lVert w^{t}-w^{*}\rVert^{2}\}_{t=0}^{\infty}$ converges to $0$ . Recall that the aggregation rule of the algorithm is $w^{t+1}=w^{t}-\eta g^{t}$ . Thus, we obtain that

	$\displaystyle\mathbb{E}\lVert w^{t+1}-w^{*}\rVert^{2}$	$\displaystyle\leq\mathbb{E}\lVert w^{t}-w^{*}-\eta g^{t}\rVert^{2}$
		$\displaystyle=\underbrace{\mathbb{E}\lVert w^{t}-w^{}\rVert^{2}}_{A}-\underbrace{2\eta\mathbb{E}\left<w^{t}-w^{},g^{t}\right>}_{B}+\underbrace{\eta^{2}\mathbb{E}\lVert g^{t}\rVert^{2}}_{C}.$		(19)

Since $w^{t}$ is known, $w^{t}$ can be treated as a constant, and $\mathbb{E}\lVert w^{t}-w^{*}\rVert^{2}=\lVert w^{t}-w^{*}\rVert^{2}$ .

Part $C$ : In Appendix B.4, we show that the following inequality holds.

\displaystyle\mathbb{E}\lVert g^{t}\rVert^{2}\leq\gamma\lVert w^{t}-w^{*}\rVert^{2}.

(20)

Part $B$ : By linearity of inner product,

	$\displaystyle\left<w^{t}-w^{*},g^{t}\right>$	$\displaystyle=\sum_{j=1}^{n}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>$
		$\displaystyle=\sum_{j\in\mathcal{H}}\left<w^{t}-w^{},\hat{g}_{j}^{t}\right>+\sum_{j\in\mathcal{B}}\left<w^{t}-w^{},\hat{g}_{j}^{t}\right>.$		(21)

First, by Schwarz Inequality, $\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq-\lVert w^{t}-w^{*}\rVert\lVert\hat{g}_{j}^{t}\rVert$ ; by Lemma 4.7 and $L$ -Lipschitz assumption, $\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq(1+k_{h}\sigma)L\lVert w^{t}-w^{*}\rVert$ . Thus,

\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq-(1+k_{h}\sigma)L\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall j\in\mathcal{B}.

(22)

Next, observe that by our algorithm, for each $j\in\mathcal{H}$ , the received gradient before CGC filter $\tilde{g}_{j}^{t}$ satisfies (i) $\lVert\tilde{g}_{j}^{t}\rVert=\lVert g_{j}^{t}\rVert$ and (ii) $\tilde{g}_{j}^{t}=a_{j}(g_{j}^{t}+\Delta g_{j}^{t})$ , for some constant $a_{j}=\lVert g_{j}^{t}\rVert/\lVert g_{j}^{t}+\Delta g_{j}^{t}\rVert$ and a vector $\Delta g_{j}^{t}$ such that $\lVert\Delta g_{j}^{t}\rVert\leq r\lVert g_{j}^{t}\rVert$ . This implies $a_{j}\geq 1/(1+r)$ . Therefore,

	$\displaystyle\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>$	$\displaystyle=\mathbb{E}\left<w^{t}-w^{*},a_{j}(g_{j}^{t}+\Delta g_{j}^{t})\right>$
		$\displaystyle\geq\frac{1}{1+r}\left(\left<w^{t}-w^{},\mathbb{E}g_{j}^{t}\right>+\mathbb{E}\left<w^{t}-w^{},\Delta g_{j}^{t}\right>\right),\enspace\forall j\in\mathcal{H}.$		(23)

By Assumption 4, $\mathbb{E}g_{j}^{t}=\nabla Q(w^{t})$ ; by strong convexity,

\left<w^{t}-w^{*},\nabla Q(w^{t})\right>\geq\mu\lVert w^{t}-w^{*}\rVert^{2}.

By Schwarz inequality, $\mathbb{E}\left<w^{t}-w^{*},\Delta g_{j}^{t}\right>\geq-\lVert w^{t}-w^{*}\rVert\mathbb{E}\lVert\Delta g_{j}^{t}\rVert$ ; and $\mathbb{E}\lVert\Delta g_{j}^{t}\rVert\leq r\mathbb{E}\lVert g_{j}^{t}\rVert$ . By Lemma 4.6 and $L$ -Lipschitz assumption, $\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)L\lVert w^{t}-w^{*}\rVert$ . Thus,

\mathbb{E}\left<w^{t}-w^{*},\Delta g_{j}^{t}\right>\geq-r(1+\sigma)L\lVert w^{t}-w^{*}\rVert^{2}.

Upon substituting these results into Equation (4.10), we obtain that

\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>\geq\frac{\mu-r(1+\sigma)L}{1+r}\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall j\in\mathcal{H}.

(24)

We partition $\mathcal{H}$ into two parts: $\mathcal{H}_{1}=\mathcal{H}\cap\{i_{1},\ldots,i_{n-f}\}$ and $\mathcal{H}_{2}=\mathcal{H}\setminus\mathcal{H}_{1}$ . For each $j\in\mathcal{H}_{1}$ , the received gradient is unchanged by CGC filter, i.e., $\hat{g}_{j}^{t}=\tilde{g}_{j}^{t}$ . Therefore, Equation (24) also holds for $\hat{g}_{j}^{t}$ , for all $j\in\mathcal{H}_{1}$ .

The case of $\mathcal{H}_{2}$ is similar. Note that for each $j\in\mathcal{H}_{2}$ , the gradient $\tilde{g}_{j}^{t}$ is scaled down to $\hat{g}_{j}^{t}$ by CGC filter. In other words, there exists some constant $a_{j}^{\prime}\geq 0$ such that $\hat{g}_{j}^{t}=a_{j}^{\prime}\tilde{g}_{j}^{t}$ . Therefore, by Equation (4.10),

\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>=\mathbb{E}\left<w^{t}-w^{*},a_{j}^{\prime}\tilde{g_{j}^{t}}\right>=a_{j}^{\prime}\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>,\enspace\forall j\in\mathcal{H}_{2}.

We can verify that if by assumption that $r>0$ satisfies Equation (15), then $\mu-r(1+\sigma)L>0$ ; and Equation (4.10) implies that $\mathbb{E}\left<w^{t}-w^{*},\tilde{g}_{j}^{t}\right>\geq 0$ . Therefore,

\mathbb{E}\left<w^{t}-w^{*},\hat{g}_{j}^{t}\right>\geq 0,\enspace\forall j\in\mathcal{H}_{2}.

(25)

Note that $|\mathcal{H}_{1}|\geq h-2f$ . Upon substituting Equation (22), (24), (25) into Equation (4.10), we obtain that

\displaystyle\mathbb{E}\left<w^{t}-w^{*},g^{t}\right>

\displaystyle\geq\left((n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-b(1+k_{h}\sigma)L\right)\lVert w^{t}-w^{*}\rVert^{2}.

(26)

By definition of $\beta$ in Equation (9), this implies $\mathbb{E}\left<w^{t}-w^{*},g^{t}\right>\geq\beta\lVert w^{t}-w^{*}\rVert^{2}$ .

Conclusion: Upon combining part $A,B$ and $C$ , by definition of $\rho$ in Equation (13),

\mathbb{E}\lVert w^{t+1}-w^{*}\rVert^{2}\leq\rho\lVert w^{t}-w^{*}\rVert^{2},\enspace\forall t=0,1,2,\ldots

Recall the definition of the conditional expectation operator $\mathbb{E}$ . This implies that

\mathbb{E}(\lVert w^{t}-w^{*}\rVert^{2}\mid w^{0},\mathcal{G}_{\mathcal{B}}^{0},\ldots,\mathcal{G}_{\mathcal{B}}^{t})\leq\rho^{t}\lVert w^{0}-w^{*}\rVert^{2}

By Theorem 4.5, $\rho\in(0,1)$ . Therefore, as $t\to\infty$ , $\lVert w^{t}-w^{*}\rVert^{2}$ converges to $0$ . In other words, $w^{t}$ converges to the optimal parameter $w^{*}$ . This proves the theorem.

4.3 Communication Complexity

We analyze the communication complexity of the Echo-CGC algorithm, and show that under suitable conditions, it effectively reduces communication complexity compared to prior algorithms [4, 11]. First consider a ball in $\mathbb{R}^{d}$ whose center is the true gradient $\nabla Q(w^{t})$ :

B(\nabla Q(w^{t}),\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert)=\{u\in\mathbb{R}^{d}:\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert\},

(27)

where $r>0$ is the deviation ratio. For a slight abuse of notations, we abbreviate the ball as $B$ . This should not be confused with $\mathcal{B}$ , the set of Byzantine workers. We present only the main results, and the proofs can be found in Appendix C.

Lemma 4.11.

For all $u,v\in B$ , $\lVert u-v\rVert\leq r\lVert u\rVert$ (and $\lVert u-v\rVert\leq r\lVert v\rVert$ ).

Given Lemma 4.11, we compute the probability that an arbitrary gradient $g_{j}^{t}$ is in the ball $B$ . By Markov’s Inequality,

	$\displaystyle=$
	$\displaystyle\geq 1-\frac{\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}}{\frac{r^{2}}{(2+r)^{2}}\lVert\nabla Q(w^{t})\rVert^{2}}.$		(28)

By Assumption 5, $\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}$ , so we conclude that , where $p$ is the lower bound defined as $p=1-(1+2/r)^{2}\sigma^{2}$ .

Denote $n_{B}=|\{j:g_{j}^{t}\in B\}|$ and $n^{*}$ as the number of workers that send the “echo message” in a round. By Lemma 4.11, $n^{*}\geq n_{B}-1$ . Since each event $\{g_{j}^{t}\in B\}$ is independent and has a fixed probability, $n^{*}$ follows a Binomial distribution with success probability which is bounded below by $p$ . Therefore,

\mathbb{E}n^{*}\geq\mathbb{E}n_{B}-1\geq np-1.

For $n\gg 1$ , we assume that $1/n\approx 0$ . Also in practice, $d\gg n$ , so the message complexity of each echo message (in $O(n)$ bits) is negligible compared to raw gradients (in $O(d)$ bits). Hence, the ratio of bit complexity of our algorithm and prior algorithms (e.g., [4, 11]) can be approximately bounded above as follows:

	$\displaystyle\frac{\text{bit complexity of Echo-CGC}}{\text{bit complexity of prior algorithms}}$	$\displaystyle=\frac{n^{}O(n)+(n-n^{})O(d)}{nO(d)}$
		$\displaystyle\leq\frac{(np-1)O(n)+[n-(np-1)]O(d)}{nO(d)}$
		$\displaystyle\approx 1-p.$

We denote the upper bound of ratio of reduced complexity to complexity of prior algorithms as $C=1-p=(1+2/r)^{2}\sigma^{2}$ .

Analysis By Equation (4.3) and Lemma 4.2, $C$ can be expressed as

C\leq\sigma^{2}\left(1+2\cdot\frac{(1-2x)(1+\sigma)+(1+\sigma k^{*}\sqrt{n})x}{\mu/L-(3+\sigma k^{*}\sqrt{n})x}\right)^{2},

(29)

where $x=f/n$ is the fault-tolerance factor.

As Equation (29) shows, the ratio $\mathcal{C}$ is related to four non-trivial variables: (i) bound of variance $\sigma\geq 0$ ; (ii) resilience $x=f/n$ satisfying the assumption in Lemma 4.3, i.e.,

\mu/L-(3+\sigma k^{*}\sqrt{n})x>0;

(iii) constant $L/\mu$ , which is determined by the cost function $Q$ and satisfies $0<L/\mu<1$ by Lemma 4.1; and (iv) number of workers $n>0$ .

(a)

C

as a function of

\sigma

, for fixed

\mu/L=1

x=0.1

, and

n=100

(b)

C

as a function of

\mu/L

, for fixed

\sigma=0.1

x=0.1

, and

n=100

(c)

C

as a function of

x

, for fixed

\sigma=0.1

\mu/L=1

, and

n=100

(d)

C

as a function of

n

, for fixed

\sigma=0.1

\mu/L=1

, and

x=0.1

We first plot the relation between one factor and $C$ while fixing the other three factors. First, we present the most significant fact, $\sigma$ . We fix $\mu/L=1$ , $x=0.1$ , and $n=100$ . As Figure 1(a) shows, $C$ increases in an almost quadratic speed with $\sigma$ because of the $\sigma^{2}$ term in Equation (29). Therefore, our algorithm is guaranteed to have lower communication complexity when the variance of gradients is relatively low, especially when $\sigma\leq 0.1$ . In practice, this is the scenario when the data set consists mainly of similar data instances.

Then, we plot $C$ against $\mu/L$ with fixed $\sigma=0.1$ , $x=0.1$ , and $n=100$ . As Figure 1(b) shows, $C$ decreases as $\mu/L$ becomes closer to $1$ . As $\mu/L>0.75$ , $C<0.5$ , meaning that $[0.75,1]$ is the range of $\mu/L$ where our algorithm is guaranteed to perform significantly better.

Next, we plot $C$ against $x$ with fixed $\sigma=0.1$ , $\mu/L$ , and $n=100$ . As Figure 1(c) shows, there is a trade-off between $C$ and fault resilience $x$ . As $x$ approaches the max resilience defined in Lemma 4.3, i.e., $x_{\max}=\frac{\mu/L}{(3+\sigma k^{*}\sqrt{n}}$ , the theoretical upper bound $C$ blows up. Moreover, as $x<0.15$ , $C<0.4$ ; and thus $[0,0.15]$ is a proper range of $x$ .

Finally, we plot $C$ against $n$ with fixed $\sigma=0.1$ , $\mu/L=1$ , and $x=0.1$ . As Figure 1(d) shows, $C$ increases almost linearly with respect to $n$ with a relatively flat slope. In other words, $n$ is not a significant factor of $C$ ; and the performance of our algorithm is stable in a wide range of $n$ .

In conclusion, our algorithm is guaranteed to require lower communication complexity when: (i) $\sigma$ is low, i.e., data instances are similar and (ii) $\mu/L$ is close to $1$ . Also, there is a trade-off between resilience and efficiency. As a concrete example, when $\sigma=0.1$ , $x=0.2$ , $\mu/L=1$ , and $n=100$ , $C\approx 0.25$ , meaning that our algorithm is guaranteed to save at least $75\%$ of communication cost.

5 Summary

In this paper, we present our Byzantine-tolerant DML algorithm that incurs lower communication complexity in a single-hop radio netowrk (under suitable conditions). Our algorithm is inspired by the CGC filter [11], but we need to devise new proofs to handle the randomness and noise introduced in our mechanism.

There are two interesting open problems: (i) multi-hop radio network; and (ii) different mechanism for constructing echo messages, e.g., usage of angles rather than distance ratio.

References

[1] Dan Alistarh, Seth Gilbert, Rachid Guerraoui, Zarko Milosevic, and Calvin Newport. Securing every bit: Authenticated broadcast in radio networks. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, page 50–59, New York, NY, USA, 2010. Association for Computing Machinery. URL: https://doi.org/10.1145/1810479.1810489, doi:10.1145/1810479.1810489.
[2] Baruch Awerbuch, Andrea Richa, and Christian Scheideler. A jamming-resistant mac protocol for single-hop wireless networks. In Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing, PODC ’08, page 45–54, New York, NY, USA, 2008. Association for Computing Machinery. URL: https://doi.org/10.1145/1400751.1400759, doi:10.1145/1400751.1400759.
[3] Vartika Bhandari and Nitin H. Vaidya. On reliable broadcast in a radio network. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, PODC ’05, page 138–147, New York, NY, USA, 2005. Association for Computing Machinery. URL: https://doi.org/10.1145/1073814.1073841, doi:10.1145/1073814.1073841.
[4] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. NIPS’17, page 118–128, Red Hook, NY, USA, 2017. Curran Associates Inc.
[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, USA, 2004.
[6] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst., 1(2), December 2017. URL: https://doi.org/10.1145/3154503, doi:10.1145/3154503.
[7] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, and Mahsa Taziki. Asynchronous Byzantine machine learning (the case of SGD). volume 80 of Proceedings of Machine Learning Research, pages 1145–1154, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL: http://proceedings.mlr.press/v80/damaskinos18a.html.
[8] El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, Lê Nguyên Hoang, and Sébastien Rouault. Genuinely distributed byzantine machine learning. In Proceedings of the 39th Symposium on Principles of Distributed Computing, PODC ’20, page 355–364, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3382734.3405695, doi:10.1145/3382734.3405695.
[9] J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, pages 101 – 148, 01 2010.
[10] E. J. Gumbel. The maxima of the mean largest value and of the range. The Annals of Mathematical Statistics, 25(1):76–84, 1954. URL: http://www.jstor.org/stable/2236513.
[11] Nirupam Gupta and Nitin H. Vaidya. Fault-tolerance in distributed optimization: The case of redundancy. In Proceedings of the 39th Symposium on Principles of Distributed Computing, PODC ’20, page 365–374, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3382734.3405748, doi:10.1145/3382734.3405748.
[12] H. O. Hartley and H. A. David. Universal bounds for mean range and extreme observation. The Annals of Mathematical Statistics, 25(1):85–99, 1954. URL: http://www.jstor.org/stable/2236514.
[13] Seyyedali Hosseinalipour, Christopher G. Brinton, Vaneet Aggarwal, Huaiyu Dai, and Mung Chiang. From federated learning to fog learning: Towards large-scale distributed machine learning in heterogeneous wireless networks, 2020. arXiv:2006.03594.
[14] Chiu-Yuen Koo. Broadcast in radio networks tolerating byzantine adversarial behavior. In Proceedings of the Twenty-Third Annual ACM Symposium on Principles of Distributed Computing, PODC ’04, page 275–282, New York, NY, USA, 2004. Association for Computing Machinery. URL: https://doi.org/10.1145/1011767.1011807, doi:10.1145/1011767.1011807.
[15] Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 19–27, Cambridge, MA, USA, 2014. MIT Press.
[16] Ruben Mayer and Hans-Arno Jacobsen. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv., 53(1), February 2020. URL: https://doi.org/10.1145/3363554, doi:10.1145/3363554.
[17] V. Navda, A. Bohra, S. Ganguly, and D. Rubenstein. Using channel hopping to increase 802.11 resilience to jamming attacks. In IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, pages 2526–2530, 2007.
[18] Nickos Papadatos. Maximum variance of order statistics. Annals of the Institute of Statistical Mathematics, 47:185–193, 02 1995. doi:10.1007/BF00773423.
[19] Michael D. Perlman. Jensen’s inequality for a convex vector-valued function on an infinite-dimensional space. Journal of Multivariate Analysis, 4(1):52 – 65, 1974. URL: http://www.sciencedirect.com/science/article/pii/0047259X74900050, doi:https://doi.org/10.1016/0047-259X(74)90005-0.
[20] Lili Su and Nitin H. Vaidya. Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms. In George Giakkoupis, editor, Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, PODC 2016, Chicago, IL, USA, July 25-28, 2016, pages 425–434. ACM, 2016. URL: https://doi.org/10.1145/2933057.2933105, doi:10.1145/2933057.2933105.
[21] Lili Su and Nitin H. Vaidya. Non-bayesian learning in the presence of byzantine agents. In Distributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, pages 414–427, 2016. URL: https://doi.org/10.1007/978-3-662-53426-7_30, doi:10.1007/978-3-662-53426-7\_30.
[22] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao. Application of machine learning in wireless networks: Key techniques and open issues. IEEE Communications Surveys Tutorials, 21(4):3072–3108, 2019.
[23] Zeyi Tao and Qun Li. esgd: Communication efficient distributed deep learning on the edge. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), Boston, MA, July 2018. USENIX Association. URL: https://www.usenix.org/conference/hotedge18/presentation/tao.
[24] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. A survey on distributed machine learning. ACM Comput. Surv., 53(2), March 2020. URL: https://doi.org/10.1145/3377454, doi:10.1145/3377454.
[25] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. volume 97 of Proceedings of Machine Learning Research, pages 6893–6901, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL: http://proceedings.mlr.press/v97/xie19b.html.

Appendix A Proof of Lemmas in Section 4.1

A.1 Proof of Lemma 4.1

Proof A.1.

Cauchy inequality and $L$ -Lipschitz smoothness of cost function imply that

\left<w-w^{*},\nabla Q(w)\right>\leq\lVert w-w^{*}\rVert\lVert\nabla Q(w)\rVert\leq L\lVert w-w^{*}\rVert^{2},\enspace\forall w\in\mathbb{R}^{d}.

(30)

Also by strong convexity,

\left<w-w^{*},\nabla Q(w)\right>\geq\mu\lVert w-w^{*}\rVert^{2},\enspace\forall w\in\mathbb{R}^{d}.

(31)

Equation (30) and (31) together imply that $\mu\leq L$ .

A.2 Proof of Lemma 4.2

Proof A.2.

First, we check the derivative of $k_{x}/\sqrt{x}$ :

\frac{d}{dx}\frac{k_{x}}{\sqrt{x}}=-\frac{(2x-1)^{3/2}-3x+1}{2(2x^{2}-x)^{3/2}}.

We can then verify numerically that $k_{x}/\sqrt{x}$ reaches maximum at $\approx 1.91$ and that $k^{*}=\sup_{x\geq 1}(k_{x}/\sqrt{x})\approx 1.12$ .

A.3 Proof of Lemma 4.3

Proof A.3.

By definition of $k_{x}$ , $k_{x}=1+\frac{x-1}{\sqrt{2x-1}}$ is increasing for $x\geq 1$ . Recall that $h\leq n$ and $b\leq f$ , so $k_{h}\leq k_{n}$ . By definition of $\beta$ , this further implies that

\beta\geq(n-2f)\frac{\mu-r(1+\sigma)L}{1+r}-fL(1+k_{n}\sigma).

Therefore, $\beta>0$ if

		$\displaystyle(n-2f)\left(\mu-r(1+\sigma)L\right)>(1+r)(1+k_{n}\sigma)fL$
	$\displaystyle\iff$	$\displaystyle(n-2f)\mu-(1+k_{n}\sigma)fL>\left((n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL\right)r.$

Recall that by Lemma 4.1, $\mu\leq L$ , so the above inequality is true if

n\mu-(3+k_{n}\sigma)fL>\left((n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL\right)r.

(32)

Since we assume $n-2f>0$ , the right side is always positive. Therefore, if $n\mu-(3+k_{n}\sigma)fL>0$ , then there exists $r>0$ that satisfies Equation (32); also such $r$ satisfies $\beta>0$ . This proves the lemma.

A.4 Proof of Lemma 4.4

Proof A.4.

By Lemma 4.2, $k_{n}\leq k^{*}\sqrt{n}$ . Given the assumption that $\sigma<1/\sqrt{n}$ ,

\frac{n\mu-(3+k_{n}\sigma)fL}{(n-2f)(1+\sigma)L+(1+k_{n}\sigma)fL}>\frac{n\mu-(3+k^{*})fL}{(n-2f)(1+\sigma)L+(1+k^{*})fL}.

Therefore, if $r>0$ satisfies Equation (15), then it also satisfies Equation (14); and by Lemma 4.4, such $r$ satisfies $\beta>0$ .

A.5 Proof of Theorem 4.5

Proof A.5.

First observe that $\rho$ can be represented as a quadratic function of $\eta$ , i.e., $\rho(\eta)=\gamma\eta^{2}-2\beta\eta+1$ . We prove the theorem by providing the bounds on this function.

By definition of $\alpha_{x}$ in Equation (12), $\alpha_{h}>0$ for all $h\geq 1$ . Thus, by definition of $\gamma$ in Equation (11), $\gamma>0$ . Also, by Lemma 4.4, for $r>0$ that satisfies Equation (15), $\beta>0$ .

Recall that for a quadratic function $q(x)=ax^{2}-bx+1$ , where $a,b>0$ , $q(x)$ reaches minimum at $x^{*}=b/2a$ . Moreover, $q(0)=q(2x^{*})$ ; and for all $x\in(0,2x^{*})$ , $q(x)\in[q(x^{*}),q(0))$ . In our case, $a=\gamma$ and $b=2\beta$ . Therefore, the minimum occurs at $\eta^{*}=\beta/\gamma$ , and $\eta^{*}>0$ . Since $\rho(0)=1$ , for all $\eta\in(0,2\eta^{*})$ , $\rho(\eta)\in[\rho(\eta^{*}),1)$ .

The remaining task is to compute the minimum value $\rho(\eta^{*})$ . Upon substituting $\eta^{*}=\beta/\gamma$ into $\rho(\eta)=\gamma\eta^{2}-2\beta\eta+1$ , we obtain that $\rho(\eta^{*})=1-\beta^{2}/\gamma$ .

Recall the definition of $\beta$ and $\gamma$ in Equation (9, 11), we can first see that $\beta\leq n\mu$ . Also, since $\alpha_{h}\geq 1$ , $\gamma\geq nL^{2}(h+b)=n^{2}L^{2}$ . Finally, by Lemma 4.1, $\mu\leq L$ . These imply that $\beta^{2}\leq n^{2}\mu^{2}\leq n^{2}L^{2}\leq\gamma$ ; and thus $\rho(\eta^{*})=1-\beta^{2}/\gamma>0$ .

In conclusion, for all $\eta>0$ such that $\eta<2\eta^{*}$ , $\rho\in[0,1)$ . This proves the theorem.

Appendix B Proof of Lemmas in Section 4.2

B.1 Proof of Lemma 4.6

Proof B.1.

Consider any $j\in\mathcal{H}$ . First, observe that we have

	$\displaystyle\mathbb{E}(\lVert g_{j}^{t}\rVert)$	$\displaystyle=\mathbb{E}(\lVert g_{j}^{t}-\nabla Q(w^{t})+\nabla Q(w^{t})\rVert)$
		$\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert+\mathbb{E}\lVert\nabla Q(w^{t})\rVert.$		(33)

By Jensen’s inequality, for any random variable $X$ , $\mathbb{E}(X)^{2}\leq\mathbb{E}(X^{2})$ . In our case, upon substituting $X=\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert$ , we obtain that

\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert\leq\sqrt{\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}}.

By Assumption 5, this further implies that

\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert\leq\sqrt{\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}}=\sigma\lVert\nabla Q(w^{t})\rVert.

Upon substituting this into Equation (B.1), we obtain that

\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert.

This completes the proof.

B.2 Proof of Lemma 4.7

Proof B.2.

Gumbel [10] and Hartley and David [12] proved that given identical means and variances $(\mu,\sigma^{2})$ , the upper bound of the expectation of the largest random variable among $n$ independent random variables is $\mu+\frac{\sigma(n-1)}{\sqrt{2n-1}}$ . In our model, $\{g_{j}^{t}:j\in\mathcal{H}\}$ can be viewed as a set of independent and identically distributed random vectors with expectation $\nabla Q(w^{t})$ . Recall that $\mathcal{H}$ is the set of fault-free workers. Therefore, $\{\lVert g_{j}^{t}\rVert:j\in\mathcal{H}\}$ is also a set of iid random variables; and by our algorithm, the norm of the received gradients $\lVert\tilde{g}_{j}^{t}\rVert=\lVert g_{j}^{t}\rVert$ .

Denote the mean and variance of each $\lVert g_{j}^{t}\rVert$ as $(\epsilon,\delta^{2})$ , and denote $M$ as $M=\max_{j\in\mathcal{H}}\{\lVert g_{j}^{t}\rVert\}$ . Thus, [10, 12] imply that $\mathbb{E}M\leq\epsilon+\frac{\delta(h-1)}{\sqrt{2h-1}}$ . $\epsilon$ and $\delta$ are to be bounded next.

Recall that Lemma 4.6 gives an upper bound on $\mathbb{E}\lVert g_{j}^{t}\rVert$ , i.e.,

\epsilon=\mathbb{E}\lVert g_{j}^{t}\rVert\leq(1+\sigma)\lVert\nabla Q(w^{t})\rVert.

Therefore, we only need to compute the upper bound of $\delta^{2}$ . Consider a random vector $X\in\mathbb{R}^{d}$ and denote $\mu=\mathbb{E}X$ . It is proved that for any constant vector $v\in\mathbb{R}^{d}$ , $\mathbb{E}\left<v,X\right>=\left<v,\mathbb{E}X\right>$ . Therefore,

$\displaystyle\mathbb{E}\lVert X-\mu\rVert^{2}$	$\displaystyle=\mathbb{E}\left<X-\mu,X-\mu\right>$
	$\displaystyle=\mathbb{E}\left<X,X\right>-2\mathbb{E}\left<\mu,X\right>+\mathbb{E}\left<\mu,\mu\right>$
	$\displaystyle=\mathbb{E}\lVert X\rVert^{2}-\lVert\mu\rVert^{2}.$	(34)

Perlman [19] proved the extended Jensen’s Inequality to random vectors such that $\phi(\mathbb{E}X)\leq\mathbb{E}\phi(X)$ for any convex function $\phi$ . Since $\lVert\cdot\rVert$ is convex, this implies $\lVert\mu\rVert\leq\mathbb{E}\lVert X\rVert$ . Thus,

	$\displaystyle\mathrm{Var}\lVert X\rVert$	$\displaystyle=\mathbb{E}\lVert X\rVert^{2}-(\mathbb{E}\lVert X\rVert)^{2}$
		$\displaystyle\leq\mathbb{E}\lVert X\rVert^{2}-\lVert\mu\rVert^{2}$
		$\displaystyle=\mathbb{E}\lVert E-\mu\rVert^{2}.$

Upon substituting $X$ with $g_{j}^{t}$ , we obtain that

	$\displaystyle\delta^{2}=\mathrm{Var}\lVert g_{j}^{t}\rVert$	$\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}$
		$\displaystyle\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}.$	(Assumption 5)		(35)

The upper bounds of $\epsilon$ and $\delta^{2}$ together induce the following bound on $\mathbb{E}M$ , which is

\displaystyle\mathbb{E}M

\displaystyle\leq\epsilon+\frac{h-1}{\sqrt{2h-1}}\delta\leq\left((1+\sigma)+\frac{h-1}{\sqrt{2h-1}}\sigma\right)\lVert\nabla Q(w^{t})\rVert.

By definition of $k_{x}$ in Equation (10), this can be simplified as

\mathbb{E}M\leq(1+k_{h}\sigma)\lVert\nabla Q(w^{t})\rVert.

Finally, recall the definition of the CGC filter in Equation (8). Since $|\mathcal{H}|\geq n-f$ , it is guaranteed that one fault-free received gradient will be clipped. This implies that the threshold of the CGC filter $\lVert\tilde{g}_{i_{n-f}}^{t}\rVert$ satisfies $\lVert\tilde{g}_{i_{n-f}}^{t}\rVert\leq M$ . Since the CGC filter guarantees that $\lVert\hat{g}_{j}^{t}\rVert\leq\lVert\tilde{g}_{i_{n-f}}^{t}\rVert$ for all $j$ , together we have $\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq\mathbb{E}\lVert\tilde{g}_{i_{n-f}}^{t}\rVert\leq\mathbb{E}M$ , which proves the lemma.

B.3 Proof of Lemma 4.8

Proof B.3.

Papadatos [18] proved that for $n$ iid random variables $X_{1}\leq X_{2}\leq\dotsm\leq X_{n}$ with finite variance $\sigma^{2}$ , the maximum variance of $X_{n}$ is bounded above by $n\sigma^{2}$ . In the same setup as Section B.2 that $M=\max\{\lVert g_{j}^{t}\rVert:j\in\mathcal{H}\}$ , Papadatos’ result together with Equation (B.2) imply that

\mathrm{Var}M\leq h\mathrm{Var}\lVert g_{j}^{t}\rVert\leq h\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}.

Recall that for a random variable $Y$ , $\mathrm{Var}Y=\mathbb{E}Y^{2}-(\mathbb{E}Y)^{2}$ . Therefore,

\displaystyle\mathbb{E}M^{2}

\displaystyle=\mathrm{Var}M+(\mathbb{E}M)^{2}\leq(h\sigma^{2}+(1+k_{h}\sigma)^{2})\lVert\nabla Q(w^{t})\rVert^{2}.

Recall the definition of $\alpha_{x}$ in Equation (12). This implies

\mathbb{E}M\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}.

Finally, in Section B.2 we proved $\lVert\hat{g}_{j}^{t}\rVert\leq M$ for all $j$ , so $\lVert\hat{g}_{j}^{t}\rVert^{2}\leq M^{2}$ and $\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\mathbb{E}M^{2}$ for all $j$ , which proves the lemma.

B.4 Part C in Proof of Theorem 4.9

Proof B.4.

Part $C$ : By definition of $g^{t}$ and convexity of $\lVert\cdot\rVert^{2}$ ,

	$\displaystyle\lVert g^{t}\rVert^{2}=\left\lVert\sum_{j=1}^{n}\hat{g}_{j}^{t}\right\rVert^{2}$	$\displaystyle\leq n\sum_{j=1}^{n}\lVert\hat{g}_{j}^{t}\rVert^{2}$	( $\lVert\cdot\rVert^{2}$ is convex)
		$\displaystyle=n\sum_{j\in\mathcal{H}}\lVert\hat{g}_{j}^{t}\rVert^{2}+n\sum_{j\in\mathcal{B}}\lVert\hat{g}_{j}^{t}\rVert^{2}.$

By definition of $\hat{g}_{j}^{t}$ , $\lVert\hat{g}_{j}^{t}\rVert\leq\lVert g_{j}^{t}\rVert$ for all $j\in\mathcal{H}$ . Also, by Equation (B.2), for each fault-free worker $j\in\mathcal{H}$ ,

	$\displaystyle\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert\leq\mathbb{E}\lVert g_{j}^{t}\rVert^{2}$	$\displaystyle\leq\mathbb{E}\lVert g_{j}^{t}-\nabla Q(w^{t})\rVert^{2}+\lVert\nabla Q(w^{t})\rVert^{2}$
		$\displaystyle\leq\sigma^{2}\lVert\nabla Q(w^{t})\rVert^{2}+\lVert\nabla Q(w^{t})\rVert^{2}.$		(36)

By Lemma 4.8, for each Byzantine worker $j\in\mathcal{B}$ ,

\mathbb{E}\lVert\hat{g}_{j}^{t}\rVert^{2}\leq\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}.

(37)

Recall the $L$ -Lipschitz assumption (Assumption 2) that $\lVert\nabla Q(w^{t})\rVert\leq L\lVert w^{t}-w^{*}\rVert$ . Hence, upon combining Equation (B.4) and (37), we obtain that

	$\displaystyle\mathbb{E}\lVert g^{t}\rVert^{2}$	$\displaystyle\leq nh(1+\sigma^{2})\lVert\nabla Q(w^{t})\rVert^{2}+nb\alpha_{h}\lVert\nabla Q(w^{t})\rVert^{2}$
		$\displaystyle\leq nL^{2}\left(h(1+\sigma^{2})+b\alpha_{h}\right)\lVert w^{t}-w^{*}\rVert^{2}.$

Recall the definition of $\gamma$ in Equation (11). This proves part C of Theorem 4.9.

Appendix C Proof in Section 4.3

C.1 Proof of Lemma 4.11

Proof C.1.

First, by triangular inequality, for all $u,v\in B$ ,

\displaystyle\lVert u-v\rVert

\displaystyle\leq\lVert u-\nabla Q(w^{t})\rVert+\lVert v-\nabla Q(w^{t})\rVert.

(38)

Again by triangular inequality and the radius of ball $B$ , for each $u\in B$ ,

\displaystyle\lVert\nabla Q(w^{t})\rVert-\lVert u\rVert\leq\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert.

This is equivalent to say

\left(1-\frac{r}{2+r}\right)\lVert\nabla Q(w^{t})\rVert\leq\lVert u\rVert\iff\lVert\nabla Q(w^{t})\rVert\leq\frac{2+r}{2}\lVert u\rVert.

Therefore,

\lVert u-\nabla Q(w^{t})\rVert\leq\frac{r}{2+r}\lVert\nabla Q(w^{t})\rVert\leq\frac{r}{2}\lVert u\rVert,\enspace\forall u\in B.

Finally, upon substituting this into Equation (38), we prove the lemma.

Appendix D Proof of existence of M-P inverse

Proof D.1.

First of all, recall that for a matrix $A\in\mathbb{R}^{m\times n}$ where $m>n$ , if $A$ has full column rank, then $A^{T}A$ is invertible.

Consider arbitrary $j,t$ . First we can prove by induction that all gradients $g_{i}^{t}$ stored in $R_{j}$ are linearly independent. Base case is true when $|R_{j}|=1$ . Now assume that when $|R_{j}|=k$ , $R_{j}$ is linearly independent. Denote $A$ as the matrix generated by $R_{j}$ , i.e., $A=[g]_{g\in R_{j}}$ . Then $A^{T}A$ is invertible and M-P inverse of $A$ exists, i.e., $A^{+}$ in line 29 of Algorithm 1 is well-defined.

Now suppose that $j$ receives a gradient $g$ and stores it to $R_{j}$ . Then $g$ must pass the condition in line 29, i.e., $AA^{+}g\neq g$ . Suppose by contradiction that $g$ is linearly dependent of $R_{j}$ , then there exists $x\in\mathbb{R}^{k}$ such that $g=Ax$ . Note that $A^{+}A=I$ , the identity matrix. This implies that $AA^{+}g=AA^{+}(Ax)=Ax=g$ , which is a contradiction. Hence, $g$ is independent of $R_{j}$ .

In the induction proof, we already showed that $A^{+}$ always exists because $R_{j}$ is always linearly independent. This proves that the M-P inverse always exists.

Acknowledgements.

Echo-CGC: A Communication-Efficient Byzantine-tolerant Distributed Machine Learning Algorithm in Single-Hop Radio Network

Abstract

keywords:

category:

1 Introduction

2 Preliminaries

2.1 Models

2.2 Assumptions and Notations

Assumption 1 (Convexity and smoothness).

Assumption 2 (LL-Lipschitz smoothness).

Assumption 3 (μ\mu-strong convexity).

Assumption 4 (IID Random Batches).

Assumption 5 (Bounded Variance).

3 Our Algorithm: Echo-CGC

Algorithm Description

4 Convergence Analysis

4.1 Convergence Rate Analysis

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Assumption 6.

Lemma 4.4.

Theorem 4.5.

4.2 Proof of Convergence

Lemma 4.6.

Lemma 4.7.

Lemma 4.8.

Theorem 4.9.

Proof 4.10.

4.3 Communication Complexity

Lemma 4.11.

5 Summary

References

Appendix A Proof of Lemmas in Section 4.1

A.1 Proof of Lemma 4.1

Proof A.1.

A.2 Proof of Lemma 4.2

Proof A.2.

A.3 Proof of Lemma 4.3

Proof A.3.

A.4 Proof of Lemma 4.4

Proof A.4.

A.5 Proof of Theorem 4.5

Proof A.5.

Appendix B Proof of Lemmas in Section 4.2

B.1 Proof of Lemma 4.6

Proof B.1.

B.2 Proof of Lemma 4.7

Proof B.2.

B.3 Proof of Lemma 4.8

Proof B.3.

B.4 Part C in Proof of Theorem 4.9

Proof B.4.

Appendix C Proof in Section 4.3

C.1 Proof of Lemma 4.11

Proof C.1.

Appendix D Proof of existence of M-P inverse

Proof D.1.

Assumption 2 ( $L$ -Lipschitz smoothness).

Assumption 3 ( $\mu$ -strong convexity).