Detection and Mitigation of
Byzantine Attacks in Distributed Training

Konstantinos Konstantinidis, Namrata Vaswani, and Aditya Ramamoorthy This work was supported in part by the National Science Foundation (NSF) under grants CCF-1910840 and CCF-2115200. The material in this work has appeared in part at the 2022 IEEE International Symposium on Information Theory (ISIT). (Corresponding author: Aditya Ramamoorthy.)The authors are with the Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA (e-mail: {kostas, namrata, adityar}@iastate.edu).

Abstract

A plethora of modern machine learning tasks require the utilization of large-scale distributed clusters as a critical component of the training pipeline. However, abnormal Byzantine behavior of the worker nodes can derail the training and compromise the quality of the inference. Such behavior can be attributed to unintentional system malfunctions or orchestrated attacks; as a result, some nodes may return arbitrary results to the parameter server (PS) that coordinates the training. Recent work considers a wide range of attack models and has explored robust aggregation and/or computational redundancy to correct the distorted gradients.

In this work, we consider attack models ranging from strong ones: $q$ omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: $q$ randomly chosen adversaries with limited collusion abilities which only change every few iterations at a time. Our algorithms rely on redundant task assignments coupled with detection of adversarial behavior. We also show the convergence of our method to the optimal point under common assumptions and settings considered in literature. For strong attacks, we demonstrate a reduction in the fraction of distorted gradients ranging from 16%-99% as compared to the prior state-of-the-art. Our top-1 classification accuracy results on the CIFAR-10 data set demonstrate 25% advantage in accuracy (averaged over strong and weak scenarios) under the most sophisticated attacks compared to state-of-the-art methods.

Index Terms:

Byzantine resilience, distributed training, gradient descent, deep learning, optimization, security.

I Introduction and Background

Increasingly complex machine learning models with large data set sizes are nowadays routinely trained on distributed clusters. A typical setup consists of a single central machine (parameter server or PS) and multiple worker machines. The PS owns the data set, assigns gradient tasks to workers, and coordinates the protocol. The workers then compute gradients of the loss function with respect to the model parameters. These computations are returned to the PS, which aggregates them, updates the model, and maintains the global copy of it. The new copy is communicated back to the workers. Multiple iterations of this process are performed until convergence has been achieved. PyTorch [1], TensorFlow [2], MXNet [3], CNTK [4] and other frameworks support this architecture.

These setups offer significant speedup benefits and enable training challenging, large-scale models. Nevertheless, they are vulnerable to misbehavior by the worker nodes, i.e., when a subset of them returns erroneous computations to the PS, either inadvertently or on purpose. This “Byzantine” behavior can be attributed to a wide range of reasons. The principal causes of inadvertent errors are hardware and software malfunctions (e.g., [5]). Reference [6] exposes the vulnerability of neural networks to such failures and identifies weight parameters that could maximize accuracy degradation. The gradients may also be distorted in an adversarial manner. As ML problems demand more resources, many jobs are often outsourced to external commodity servers (cloud) whose security cannot be guaranteed. Thus, an adversary may be able to gain control of some devices and fool the model. The distorted gradients can derail the optimization and lead to low test accuracy or slow convergence.

Achieving robustness in the presence of Byzantine node behavior and devising training algorithms that can efficiently aggregate the gradients has inspired several works [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. The first idea is to filter the corrupted computations from the training without attempting to identify the Byzantine workers. Specifically, many existing papers use majority voting and median-based defenses [7, 8, 9, 10, 11, 12, 13] for this purpose. In addition, several works also operate by replicating the gradient tasks [14, 15, 16, 17, 18] allowing for consistency checks across the cluster. The second idea for mitigating Byzantine behavior involves detecting the corrupted devices and subsequently ignoring their calculations [19, 20, 21], in some instances paired with redundancy [17]. In this work, we propose a technique that combines the usage of redundant tasks, filtering, and detection of Byzantine workers. Our work is applicable to a broad range of assumptions on the Byzantine behavior.

There is much variability in the adversarial assumptions that prior work considers. For instance, prior work differs in the maximum number of adversaries considered, their ability to collude, their possession of knowledge involving the data assignment and the protocol, and whether the adversarial machines are chosen at random or systematically. We will initially examine our methods under strong adversarial models similar to those in prior work [22, 14, 23, 11, 10, 24, 25]. We will then extend our algorithms to tackle weaker failures that are not necessarily adversarial but rather common in commodity machines [5, 6, 26]. We expand on related work in the upcoming Section II.

II Related Work and Summary of Contributions

II-A Related Work

All work in this area (including ours) assumes a reliable parameter server that possesses the global data set and can assign specific subsets of it to workers. Robust aggregation methods have also been proposed for federated learning [27, 28]; however, as we make no assumption of privacy, our work, as well as the methods we compare with do not apply to federated learning.

One category of defenses splits the data set into $K$ batches and assigns one to each worker with the ultimate goal of suitably aggregating the results from the workers. Early work in the area [12] established that no linear aggregation method (such as averaging) can be robust even to a single adversarial worker. This has inspired alternative methods collectively known as robust aggregation. Majority voting, geometric median, and squared-distance-based techniques fall into this category [8, 9, 10, 11, 12, 13].

One of the most popular robust aggregation techniques is known as mean-around-median or trimmed mean [11, 10]. It handles each dimension of the gradient separately and returns the average of a subset of the values that are closest to the median. Auror [25] is a variant of trimmed mean which partitions the values of each dimension into two clusters using k-means and discards the smaller cluster if the distance between the two exceeds a threshold; the values of the larger cluster are then averaged. signSGD in [26] transmits only the sign of the gradient vectors from the workers to the PS and exploits majority voting to decide the overall update; this practice reduces the communication time and denies any individual worker too much effect on the update.

Krum in [12] chooses a single honest worker for the next model update, discarding the data from the rest of them. The chosen gradient is the one closest to its $k\in\mathbb{N}$ nearest neighbors. In later work [24], the authors recognized that Krum may converge to an ineffectual model in the landscape of non-convex high dimensional problems, such as in neural networks. They showed that a large adversarial change to a single parameter with a minor impact on the $L^{p}$ norm can make the model ineffective. In the same work, they present an alternative defense called Bulyan to oppose such attacks. The algorithm works in two stages. In the first part, a selection set of potentially benign values is iteratively constructed. In the second part, a variant of trimmed mean is applied to the selection set. Nevertheless, if $K$ machines are used, Bulyan is designed to defend only up to $(K-3)/4$ fraction of corrupted workers.

Another category of defenses is based on redundancy and seeks resilience to Byzantines by replicating the gradient computations such that each of them is processed by more than one machine [15, 16, 17, 18]. Even though this approach requires more computation load, it comes with stronger guarantees of correcting the erroneous gradients. Existing redundancy-based techniques are sometimes combined with robust aggregation [16]. The main drawback of recent work in this category is that the training can be easily disrupted by a powerful, omniscient adversary that has full control of a subset of the nodes and can mount judicious attacks [14].

Redundancy-based method DRACO in [17] uses a simple Fractional Repetition Code (FRC) (that operates by grouping workers) and the cyclic repetition code introduced in [29, 30] to ensure robustness; majority voting and Fourier decoders try to alleviate the adversarial effects. Their work ensures exact recovery (as if the system had no adversaries) with $q$ Byzantine nodes, when each task is replicated $r\geq 2q+1$ times; the bound is information-theoretic minimum, and DRACO is not applicable if it is violated. Nonetheless, this requirement is very restrictive for the typical assumption that up to half of the workers can be Byzantine.

DETOX in [16] extends DRACO and uses a simple grouping strategy to assign the gradients. It performs multiple stages of aggregation to gradually filter the adversarial values. The first stage involves majority voting, while the following stages perform robust aggregation. Unlike DRACO, the authors do not seek exact recovery; hence the minimum requirement in $r$ is small. However, the theoretical resilience guarantees that DETOX provides depend heavily on a “random choice” of the adversarial workers. In fact, we have crafted simple attacks [14] to make this aggregator fail under a more careful choice of adversaries. Furthermore, their theoretical results hold when the fraction of Byzantines is less than $1/40$ .

A third category focuses on ranking and/or detection [19, 17, 20]; the objective is to rank workers using a reputation score to identify suspicious machines and exclude them or give them lower weight in the model update. This is achieved by means of computing reputation scores for each machine or by using ideas from coding theory to assign tasks to workers (encoding) and to detect the adversaries (decoding). Zeno in [20] ranks each worker using a score that depends on the estimated loss and the magnitude of the update. Zeno requires strict assumptions on the smoothness of the loss function and the gradient estimates’ variance to tolerate an adversarial majority in the cluster. Similarly, ByGARS [19] computes reputation scores for the nodes based on an auxiliary data set; these scores are used to weigh the contribution of each gradient to the model update.

II-B Contributions

In this paper, we propose novel techniques which combine redundancy, detection, and robust aggregation for Byzantine resilience under a range of attack models and assumptions on the dataset/loss function.

Our first scheme Aspis is a subset-based assignment method for allocating tasks to workers in strong adversarial settings: up to $q$ omniscient, colluding adversaries that can change at each iteration. We also consider weaker attacks: adversaries chosen randomly with limited collusion abilities, changing only after a few iterations at a time. It is conceivable that Aspis should continue to perform well with weaker attacks. However, as discussed later (Section V-B), Aspis requires large batch sizes (for the mini-batch SGD). It is well-recognized that large batch sizes often cause performance degradation in training [31]. Accordingly, for this class of attacks, we present a different algorithm called Aspis+ that can work with much smaller batch sizes. Both Aspis and Aspis+ use combinatorial ideas to assign the tasks to the worker nodes. Our work builds on our initial work in [22] and makes the following contributions.

•

We demonstrate a worst-case upper bound (under any possible attack) on the fraction of corrupted gradients when Aspis is used. Even in this adverse scenario, our method enjoys a reduction in the fraction of corrupted gradients of more than 90% compared with DETOX [16]. A weaker variation of this attack is where the adversaries do not collude and act randomly. In this case, we demonstrate that the Aspis protocol allows for detecting all the adversaries. In both scenarios, we provide theoretical guarantees on the fraction of corrupted gradients.
•

In the setting where the dataset is distributed i.i.d. and the loss function is strongly convex and other technical conditions hold, we demonstrate a proof of convergence for Aspis. We demonstrate numerical results on the linear regression problem in this part; these show the advantage of Aspis over competing methods such as DETOX.
•

For weaker attacks (discussed above), our experimental results indicate that Aspis+ detects all adversaries within approximately 5 iterations.
•

We present top-1 classification accuracy experiments on the CIFAR-10 [32] data set for various gradient distortion attacks coupled with choice/behavior patterns of the adversarial nodes. Under the most sophisticated distortion methods [23], the performance gap between Aspis/Aspis+ and other state-of-the-art methods is substantial, e.g., for Aspis it is 43% in the strong scenario (cf. Figure 7(a)), and for Aspis+ 19% in the weak scenario (cf. Figure 13).

Refer to caption — Figure 1: Aggregation of gradients on a cluster.

III Distributed Training Formulation

Assume a loss function $l_{i}(\mathbf{w})$ for the $i^{\mathrm{th}}$ sample of the dataset where $\mathbf{w}\in\mathbb{R}^{d}$ is the set of parameters of the model.¹¹1The paper’s heavily-used notation is summarized in Appendix Table II. The objective of distributed training is to minimize the empirical loss function $\hat{L}(\mathbf{w})$ with respect to $\mathbf{w}$ , where

\hat{L}(\mathbf{w})=\frac{1}{n}\sum\limits_{i=1}^{n}l_{i}(\mathbf{w}).

Here $n$ denotes the number of samples.

We use either gradient descent (GD) or mini-batch Stochastic Gradient Descent (SGD) to solve this optimization. In both methods, initially $\mathbf{w}$ is randomly set to $\mathbf{w}_{0}$ ( $\mathbf{w}_{t}$ is the model state at the end of iteration $t$ ). When using GD, the update equation is

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{n}\sum\limits_{i=1}^{n}\nabla l_{i}(\mathbf{w}_{t}).

(1)

Under mini-batch SGD a random batch $B_{t}$ of $b$ samples is chosen to perform the update in the $t^{\mathrm{th}}$ iteration. Thus,

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{|B_{t}|}\sum\limits_{i\in B_{t}}\nabla l_{i}(\mathbf{w}_{t}).

(2)

In both methods $\eta_{t}$ is the learning rate at the $t^{\mathrm{th}}$ iteration. The workers denoted $U_{1},U_{2},\dots,U_{K}$ , compute gradients on subsets of the batch. The training is synchronous, i.e., the PS waits for all workers to return before performing an update. It stores the data set and the model and coordinates the protocol. It can be observed that GD can be considered an instance of mini-batch SGD where the batch at each iteration is the entire dataset. Our discussion below is in the context of mini-batch SGD but can easily be applied to the GD case by using this observation.

We consider settings in this work that depend on the underlying assumptions on the dataset and the loss function. Setting-I does not make any assumption on the dataset or the loss function. In Setting-II at the top-level (technical details appear in Section VI) we assume that the data samples are distributed i.i.d. and the loss function is strongly-convex. The results that we provide depend on the underlying setting.

Task assignment: Each batch $B_{t}$ is split into $f$ disjoint subsets $\{B_{t,i}\}_{i=0}^{f-1}$ , which are then assigned to the workers according to our placement policy. In what follows we refer to these as “files” to avoid confusion with other subsets that we need to refer to. Computational redundancy is introduced by assigning a given file to $r>1$ workers. As the load on all the workers is equal it follows that each worker is responsible for $l=fr/K$ files ( $l$ is the computation load). We let $\mathcal{N}^{w}(U_{j})$ be the set of files assigned to worker $U_{j}$ and $\mathcal{N}^{f}(B_{t,i})$ be the group of workers assigned to file $B_{t,i}$ ; our placement scheme is such that $\mathcal{N}^{f}(B_{t,i})$ uniquely identifies the file $B_{t,i}$ ; thus, we will sometimes refer to the file $B_{t,i}$ by its worker assignment, $\mathcal{N}^{f}(B_{t,i})$ . We will also occasionally use the term group (of the assigned workers) to refer to a file. We discuss the actual placement algorithms used in this work in the upcoming subsection III-A.

Training: Each worker $U_{j}$ is given the task of computing the sum of the gradients on all its assigned files. For example, if file $B_{t,i}$ is assigned to $U_{j}$ , then it calculates $\sum_{i^{\prime}\in B_{t,i}}\nabla l_{i^{\prime}}(\mathbf{w}_{t})$ and returns them to the PS. In every iteration, the PS will run our detection algorithm once it receives the results from all the users in an effort to identify the $q$ adversaries and will act according to the detection outcome.

Figure 1 depicts this process. There are $K=6$ machines and $f=4$ distinct files (represented by colored circles) replicated $r=3$ times.²²2Some arrows and ellipses have been omitted from Figure 1; however, all files will be going through detection. Each worker is assigned to $l=2$ files and computes the sum of gradients (or a distorted value) on each of them. The “d” ellipses refer to PS’s detection operations immediately after receiving all the gradients.

Metrics: We consider various metrics in our work. For Setting-I we consider (i) the fraction of distorted files, and (ii) the top-1 test accuracy of the final trained model. For the distortion fraction, let us denote the number of distorted files upon detection and aggregation by $c^{(q)}$ and its maximum value (under a worst-case attack) by $c_{\mathrm{max}}^{(q)}$ . The distortion fraction is $\epsilon:=c^{(q)}/f$ . The top-1 test accuracy is determined via numerical experiments. In Setting-II, in addition we consider proofs and rates of convergence of the proposed algorithms. We provide theoretical results and supporting experimental results on these.

TABLE I: Adversarial models considered in literature.

Scheme	Byzantine choice/orchestration	Gradient distortion
Draco [17]	optimal	reversed gradient, constant
DETOX [16]	random	ALIE, reversed gradient, constant
ByzShield [14]	optimal	ALIE, reversed gradient, constant
Bulyan [24]	N/A	$\ell_{2}$ -norm attack targeted on Bulyan
Multi-Krum [12]	N/A	random high-variance Gaussian vector
Aspis	ATT-1, ATT-2	ALIE, FoE, reversed gradient
Aspis+	ATT-3	ALIE, constant

III-A Task Assignment

Let $\mathcal{U}$ be the set of workers. Our scheme has $|\mathcal{U}|\leq f$ (i.e., fewer workers than files). Our assignment of files to worker nodes is specified by a bipartite graph $\mathbf{G}_{task}$ where the left vertices correspond to the workers, and the right vertices correspond to the files. An edge in $\mathbf{G}_{task}$ between worker $U_{i}$ and a file $B_{t,j}$ indicates that the $U_{i}$ is responsible for processing file $B_{t,j}$ .

III-A1 Aspis

For the Aspis scheme we construct $\mathbf{G}_{task}$ as follows. The left vertex set is $\{1,2,\dots,K\}$ and the right vertex set corresponds to $r$ -sized subsets of $\{1,2,\dots,K\}$ (there are $\binom{K}{r}$ of them). An edge between $1\leq i\leq K$ and $S\subset\{1,2,\dots,K\}$ (where $|S|=r$ ) exists if $i\in S$ . The worker set $\{U_{1},\dots,U_{K}\}$ is in one-to-one correspondence with $\{1,2,\dots,K\}$ and the files $B_{t,0},\dots,B_{t,f-1}$ are in one-to-one correspondence with the $r$ -sized subsets.

Example 1.

Consider $K=7$ workers $U_{1},U_{2}\dots,U_{7}$ and $r=3$ . Based on our protocol, the $f=\binom{7}{3}=35$ files of each batch $B_{t}$ are associated one-to-one with 3-subsets of $\mathcal{U}$ , e.g., the subset $S=\{U_{1},U_{2},U_{3}\}$ corresponds to file $B_{t,0}$ and will be processed by $U_{1}$ , $U_{2}$ , and $U_{3}$ .

Remark 1.

Our task assignment ensures that every pair of workers processes $\binom{K-2}{r-2}$ files. Moreover, the number of adversaries is $q<K/2$ . Thus, upon receiving the gradients from the workers, the PS can examine them for consistency and flag certain nodes as adversarial if their computed gradients differ from $q+1$ or more of the other nodes. We use this intuition to detect and mitigate the adversarial effects and compute the fraction of corrupted files.

III-A2 Aspis+

For Aspis+, we use combinatorial designs [33] to assign the gradient tasks to workers. Formally, a design is a pair ( $X$ , $\mathcal{A}$ ) consisting of a set of $v$ elements (points), $X$ , and a family $\mathcal{A}$ (i.e., multiset) of nonempty subsets of $X$ called blocks, where each block has the same cardinality $k$ . Similar to Aspis, the workers and files are in one-to-one correspondence with the points and the blocks, respectively. Hence, for our purposes, the $k$ parameter of the design is the redundancy. A $t-(v,k,\lambda)$ design is one where any subset of $t$ points appear together in exactly $\lambda$ blocks. The case of $t=2$ has been studied extensively in the literature and is referred to as a balanced incomplete block design (BIBD). A bipartite graph representing the incidence between the points and the blocks can be obtained naturally by letting the points correspond to the left vertices, and the blocks correspond to the right vertices. An edge exists between a point and a block if the point is contained in the block.

Example 2.

A $2-(7,3,1)$ design, also known as the Fano plane, consists of the $v=7$ points $X=\{1,2,\dots,7\}$ and the block multiset $\mathcal{A}$ contains the blocks $\{1,2,3\}$ , $\{1,4,7\}$ , $\{2,4,6\}$ , $\{3,4,5\}$ , $\{2,5,7\}$ , $\{1,5,6\}$ and $\{3,6,7\}$ with each block being of size $k=3$ . In the bipartite graph $\mathbf{G}_{task}$ representation, we would have an edge, e.g., between point $2$ and blocks $\{1,2,3\},\{2,4,6\}$ , and $\{2,5,7\}$ .

In Aspis+ we construct $\mathbf{G}_{task}$ by the bipartite graph representing an appropriate $2-(v,k,\lambda)$ design.

Another change compared to the Aspis placement scheme is that the points of the design will be randomly permuted at each iteration, i.e., for permutation $\pi$ , the PS will map $\{U_{1},U_{2},\dots,U_{K}\}\xrightarrow{\pi}\{\pi(U_{1}),\pi(U_{2}),\dots,\pi(U_{K})\}$ . For instance, let us circularly permute the points of the Fano plane in Example 2 as $\pi(U_{i})=U_{i+1},i=1,2,\dots,K-1$ and $\pi(U_{K})=U_{1}$ . Then, the file assignment at the next iteration will be based on the block collection $\mathcal{A}=\{\{2,3,4\},\{1,2,5\},\{3,5,7\},\{4,5,6\},\{1,3,6\},\{2,6,7\},\{1,\allowbreak 4,7\}\}$ . Permuting the assignment causes each Byzantine to disagree with more workers and to be detected in fewer iterations; details will be discussed in Section V-C. Owing to this permutation, we use a time subscript for the files assigned to $U_{i}$ for the $t^{\mathrm{th}}$ iteration; this is denoted by $\mathcal{N}_{t}^{w}(U_{i})$ .

IV Adversarial Attack Models and Gradient Distortion Methods

We now discuss the different Byzantine models that we consider in this work. For all the models, we assume that at most $q<K/2$ workers can be adversarial. For each assigned file $B_{t,i}$ a worker $U_{j}$ will return the value $\hat{\mathbf{g}}_{t,i}^{(j)}$ to the PS. Then,

\hat{\mathbf{g}}_{t,i}^{(j)}=\left\{\begin{array}[]{ll}\mathbf{g}_{t,i}&\text{ if }U_{j}\text{ is honest},\\ \mathbf{*}&\text{otherwise},\\ \end{array}\right.

(3)

where $\mathbf{g}_{t,i}$ is the sum of the loss gradients on all samples in file $B_{t,i}$ , i.e.,

\mathbf{g}_{t,i}=\sum\limits_{j\in B_{t,i}}\nabla l_{j}(\mathbf{w}_{t})

and $\mathbf{*}$ is any arbitrary vector in $\mathbb{R}^{d}$ . Within this setup, we examine adversarial scenarios that differ based on the behavior of the workers. Table I provides a high-level summary of the Byzantine models considered in this work as well as in related papers. As we will discuss in Section VIII-B, for those schemes that do not involve redundancy and merely split the work equally among the $K$ workers, all possible choices of the Byzantine set are equivalent, and no orchestration³³3We will use the term orchestration to refer to the method adversaries use to collude and attack collectively as a group. of them will change the defense’s output; hence, those cases are denoted by “N/A” in the table.

IV-A Attack 1

We first consider a weak attack, denoted ATT-1, where the Byzantine nodes operate independently (i.e., do not collude) and attempt to distort the gradient on any file they participate in. For instance, a node may try to return arbitrary gradients on all its assigned files. For this attack, the identity of the workers may be arbitrary at each iteration as long as there are at most $q$ of them.

Remark 2.

We emphasize that even though we call this attack “weak”, this is the attack model considered in several prior works [16, 17]. To our best knowledge, most of them have not considered the adversarial problem from the lens of detection.

IV-B Attack 2

Our second scenario, named ATT-2, is the strongest one we consider. We assume that the adversaries have full knowledge of the task assignment at each iteration and the detection strategies employed by the PS. The adversaries can collude in the “best” possible way to corrupt as many gradients as possible. Moreover, the set of adversaries can also change from iteration to iteration as long as there are at most $q$ of them.

IV-C Attack 3

This attack is similar to ATT-1 and will be called ATT-3. On the one hand, it is weaker in the sense that the set of Byzantines (denoted $A$ ) does not change in every iteration. Instead, we will assume that there is a “Byzantine window” of $T_{b}$ iterations in which the set $A$ remains fixed. Also, the set $A$ will be a randomly chosen set of $q$ workers from $\mathcal{U}$ , i.e., it will not be chosen systematically. A new set will be chosen at random at all iterations $t$ , where $t\equiv 0$ (mod $T_{b}$ ). Conversely, it is stronger than ATT-1 since we allow for limited collusion amongst the adversarial nodes. In particular, the Byzantines simulated by ATT-3 will distort only the files for which a Byzantine majority exists.

IV-D Gradient Distortion Methods

For each of the attacks considered above, the adversaries can distort the gradient in specific ways. Several such techniques have been considered in the literature and our numerical experiments use these methods for comparing different methods. For instance, ALIE [23] involves communication among the Byzantines in which they jointly estimate the mean $\mu_{i}$ and standard deviation $\sigma_{i}$ of the batch’s gradient for each dimension $i$ and subsequently use them to construct a distorted gradient that attempts to distort the median of the results. Another powerful attack is Fall of Empires (FoE) [34] which performs “inner product manipulation” to make the inner product between the true gradient and the robust estimator to be negative even when their distance is upper bounded by a small value. Reversed gradient distortion returns $-c\boldsymbol{\mathbf{g}}$ for $c>0$ , to the PS instead of the true gradient $\boldsymbol{\mathbf{g}}$ . The constant attack involves the Byzantine workers sending a constant gradient with all elements equal to a fixed value. To our best knowledge, the ALIE algorithm is the most sophisticated attack in literature for deep learning techniques.

V Defense Strategies in Aspis and Aspis+

In our work we use the Aspis task assignment and detection strategy for attacks ATT-1 and ATT-2. For ATT-3, we will use Aspis+. Recall that the methods differ in their corresponding task assignments. Nevertheless, the central idea in both detection methods is for the PS to apply a set of consistency checks on the obtained gradients from the different workers at each iteration to identify the adversaries.

Let the current set of adversaries be $A\subset\{U_{1},U_{2},\dots,U_{K}\}$ with $|A|=q$ ; also, let $H$ be the honest worker set. The set $A$ is unknown, but our goal is to provide an estimate $\hat{A}$ of it. Ideally, the two sets should be identical. In general, depending on the adversarial behavior, we will be able to provide a set $\hat{A}$ such that $\hat{A}\subseteq A$ . For each file, there is a group of $r$ workers which have processed it, and there are ${r\choose 2}$ pairs of workers in each group. Each such pair may or may not agree on the gradient value for the file. For iteration $t$ , let us encode the agreement of workers $U_{j_{1}},U_{j_{2}}$ on common file $i$ during the current iteration $t$ by

\alpha_{t,i}^{(j_{1},j_{2})}:=\left\{\begin{array}[]{ll}1&\text{if }\hat{\mathbf{g}}_{t,i}^{(j_{1})}=\hat{\mathbf{g}}_{t,i}^{(j_{2})},\\ 0&\text{otherwise}.\end{array}\right.

(4)

Across all files, the total number of agreements between a pair of workers $U_{j_{1}},U_{j_{2}}$ during the $t^{\mathrm{th}}$ iteration is denoted by

\alpha_{t}^{(j_{1},j_{2})}:=\sum_{i\in\mathcal{N}_{t}^{w}(U_{j_{1}})\cap\mathcal{N}_{t}^{w}(U_{j_{2}})}\alpha_{t,i}^{(j_{1},j_{2})}.

(5)

Since the placement is known, the PS can always perform the above computation. Next, we form an undirected graph $\mathbf{G}_{t}$ whose vertices correspond to all workers $\{U_{1},U_{2},\dots,U_{K}\}$ . An edge $(U_{j_{1}},U_{j_{2}})$ exists in $\mathbf{G}_{t}$ only if the computed gradients (at iteration $t$ ) of $U_{j_{1}}$ and $U_{j_{2}}$ match in “all” their common assignments.

V-A Aspis Detection Rule

In what follows, we suppress the iteration index $t$ since the Aspis algorithm is the same for each iteration. For the Aspis task assignment (cf. Section III-A1), any two workers, $U_{j_{1}}$ and $U_{j_{2}}$ , have ${{K-2}\choose{r-2}}$ common files.

Let us index the $q$ adversaries in $A=\{A_{1},A_{2},\dots,A_{q}\}$ and the honest workers in $H$ . We say that two workers $U_{j_{1}}$ and $U_{j_{2}}$ disagree if there is no edge between them in $\mathbf{G}$ . The non-existence of an edge between $U_{j_{1}}$ and $U_{j_{2}}$ only means that they disagree in at least one of the $\binom{K-2}{r-2}$ files that they jointly participate in. For corrupting the gradients, each adversary has to disagree on the computations with a subset of the honest workers. An adversary may also disagree with other adversaries.

A clique in an undirected graph is defined as a subset of vertices with an edge between any pair of them. A maximal clique is one that cannot be enlarged by adding additional vertices to it. A maximum clique is one such that there is no clique with more vertices in the given graph. We note that the set of honest workers $H$ will pair-wise agree on all common tasks. Thus, $H$ forms a clique (of size $K-q$ ) within $\mathbf{G}$ . The clique containing the honest workers may not be maximal. However, it will have a size of at least $K-q$ . Let the maximum clique on $\mathbf{G}$ be $M_{\mathbf{G}}$ . Any worker $U_{j}$ with $\deg(U_{j})<K-q-1$ will not belong to a maximum clique and can right away be eliminated as a “detected” adversary.

Input: Computed gradients

\hat{\mathbf{g}}_{t,i}^{(j)}

i=0,1,\dots,f-1

j=1,2,\dots,K

, redundancy

r

and empty graph

\mathbf{G}

with worker vertices

\mathcal{U}

1 for each pair $(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2}$ of workers do

2 PS computes the number of agreements

\alpha^{(j_{1},j_{2})}

of the pair

U_{j_{1}},U_{j_{2}}

on the gradient value.

3 if $\alpha^{(j_{1},j_{2})}={{K-2}\choose{r-2}}$ then

4 Connect vertex

U_{j_{1}}

to vertex

U_{j_{2}}

\mathbf{G}

5 end if

7 end for

9PS enumerates all

k

maximum cliques

M_{\mathbf{G}}^{(1)},M_{\mathbf{G}}^{(2)},\dots,M_{\mathbf{G}}^{(k)}

\mathbf{G}

10if there is a unique maximum clique $M_{\mathbf{G}}$ ( $k=1$ ) then

11 PS determines the honest workers

H=M_{\mathbf{G}}

and the adversarial machines

\hat{A}=\mathcal{U}-M_{\mathbf{G}}

12else

13 PS declares unsuccessful detection.

14 end if

Algorithm 1 Proposed Aspis graph-based detection.

Input: Data set of

n

samples, batch size

b

, computation load

l

, redundancy

r

,
number of files

f

, maximum iterations

T

, file assignments

\{\mathcal{N}^{w}(U_{i})\}_{i=1}^{K}

, robust estimator function

\widehat{\mathrm{med}}

1 The PS randomly initializes model’s parameters to

\mathbf{w}_{0}

2 for $t=0$ to $T-1$ do

3 PS chooses a random batch

B_{t}\subseteq\{1,2,\dots,n\}

b

samples, partitions it into

f

files

\{B_{t,i}\}_{i=0}^{f-1}

and assigns them to workers according to

\{\mathcal{N}^{w}(U_{i})\}_{i=1}^{K}

. It then transmits

\mathbf{w}_{t}

to all workers.

4 for each worker $U_{j}$ do

5 if $U_{j}$ is honest then

6 for each file $i\in\mathcal{N}^{w}(U_{j})$ do

U_{j}

computes the sum of gradients

\hat{\mathbf{g}}_{t,i}^{(j)}=\sum\limits_{k\in B_{t,i}}\nabla l_{k}(\mathbf{w}_{t}).

8 end for

10 else

U_{j}

constructs

l

adversarial vectors

\hat{\mathbf{g}}_{t,i_{1}}^{(j)},\hat{\mathbf{g}}_{t,i_{2}}^{(j)},\dots,\hat{\mathbf{g}}_{t,i_{l}}^{(j)}.

12 end if

U_{j}

returns

\hat{\mathbf{g}}_{t,i_{1}}^{(j)},\hat{\mathbf{g}}_{t,i_{2}}^{(j)},\dots,\hat{\mathbf{g}}_{t,i_{l}}^{(j)}

to the PS.

14 end for

15 PS runs a detection algorithm to identify the adversaries.

16 if detection is successful then

17 Let

H

be the detected honest workers. Initialize a non-corrupted gradient set as

\mathcal{G}=\emptyset

18 for each file in $\{B_{t,i}\}_{i=0}^{f-1}$ do

19 PS chooses the gradient of a worker in

\mathcal{N}^{f}(B_{t,i})\cap H

(if non-empty) and adds it to

\mathcal{G}

20 end for

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{|\mathcal{G}|}\sum\limits_{\mathbf{g}\in\mathcal{G}}\mathbf{g}.

22 else

23 for each file in $\{B_{t,i}\}_{i=0}^{f-1}$ do

24 PS determines the

r

workers in

\mathcal{N}^{f}(B_{t,i})

which have processed

B_{t,i}

and computes

\mathbf{m}_{i}=\mathrm{majority}\left\{\hat{\mathbf{g}}_{t,i}^{(j)}:U_{j}\in\mathcal{N}^{f}(B_{t,i})\}\right\}.

25 end for

26 PS updates the model via

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\times\widehat{\mathrm{med}}\{\mathbf{m}_{i}:i=0,1,\dots,f-1\}.

27 end if

29 end for

Algorithm 2 Proposed Aspis/Aspis+ aggregation protocol to alleviate Byzantine effects.

The essential idea of our detection is to run a clique-finding algorithm on $\mathbf{G}$ (summarized in Algorithm 1). The detection may be successful or unsuccessful depending on which attack is used; we discuss this in more detail shortly.

We note that clique-finding is well-known to be an NP-complete problem [35]. Nevertheless, there are fast, practical algorithms with excellent performance on graphs even up to hundreds of nodes [36, 37]. Specifically, the authors of [37] have shown that their proposed algorithm, which enumerates all maximal cliques, has similar complexity as other methods [38, 39], which are used to find a single maximum clique. We utilize this algorithm. Our extensive experimental evidence suggests that clique-finding is not a computation bottleneck for the size and structure of the graphs that Aspis uses. We have experimented with clique-finding on a graph of $K=100$ workers and $r=5$ for different values of $q$ ; in all cases, enumerating all maximal cliques took no more than 15 milliseconds. These experiments and the asymptotic complexity of the entire protocol are addressed in Supplement Section XI-A.

During aggregation (see Algorithm 2), the PS will perform a majority vote across the computations of each file (implementation details in Supplement Section XI-B). Recall that $r$ workers have processed each file. For each such file $B_{t,i}$ , the PS decides a majority value $\mathbf{m}_{i}$

\mathbf{m}_{i}:=\mathrm{majority}\left\{\hat{\mathbf{g}}_{t,i}^{(j)}:U_{j}\in\mathcal{N}^{f}(B_{t,i})\right\}.

(6)

Assume that $r$ is odd and let $r^{\prime}=\frac{r+1}{2}$ . Under the rule in Eq. (6), the gradient on a file is distorted only if at least $r^{\prime}$ of the computations are performed by Byzantines. Following the majority vote, we will further filter the gradients using a robust estimator $\widehat{\mathrm{med}}$ (see Algorithm 2, line 25). This robust estimator is either the coordinate-wise median or the geometric median; a similar setup was considered in [14, 16]. For example, in Figure 1, all returned values for the red file will be evaluated by a majority vote function on the PS, which decides a single output value; a similar voting is done for the other 3 files. After the voting process, Aspis applies the robust estimator $\widehat{\mathrm{med}}$ on the “winning” gradients $\mathbf{m}_{i}$ , $i=0,1,\dots,f-1$ .

V-A1 Defense Strategy Against ATT-1

Under ATT-1, it is clear that a Byzantine node will disagree with at least $K-q$ honest nodes (as, by assumption in Section IV-A, it will disagree with all of them), and thus, the degree of the node in $\mathbf{G}$ will be at most $q-1<K-q-1$ , and it will not be part of the maximum clique. Thus, each of the adversaries will be detected, and their returned gradients will not be considered further. The algorithm declares the (unique) maximum clique as honest and proceeds to aggregation. In particular, assume that $h$ workers $U_{i_{1}},U_{i_{2}},\dots,U_{i_{h}}$ have been identified as honest. For each of the $f$ files, if at least one honest worker processed it, the PS will pick one of the “honest” gradient values. The chosen gradients are then averaged for the update (cf. Eq. (2)). For instance, in Figure 1, assume that $U_{1}$ , $U_{2}$ , and $U_{4}$ have been identified as faulty. During aggregation, the PS will ignore the red file as all 3 copies have been compromised. For the orange file, it will pick either the gradient computed by $U_{5}$ or $U_{6}$ as both of them are “honest.” The only files that can be distorted in this case are those that consist exclusively of adversarial nodes.

Figure 2(a) (corresponding to Example 1) shows an example where in a cluster of size $K=7$ , the $q=3$ adversaries are $A=\{U_{1},U_{2},U_{3}\}$ and the remaining workers are honest with $H=\{U_{4},U_{5},U_{6},U_{7}\}$ . In this case, the unique maximum clique is $M_{\mathbf{G}}=H$ , and detection is successful. Under this attack, the distorted files are those whose all copies have been compromised, i.e., $c^{(q)}=\binom{q}{r}$ .

V-A2 Defense Strategy Against ATT-2 (Robust Aggregation)

Let $D_{i}$ denote the set of disagreement workers for adversary $A_{i},i=1,2,\dots,q$ , where $D_{i}$ can contain members from $A$ and from $H$ . If the attack ATT-2 is used on Aspis, upon the formation of $\mathbf{G}$ we know that a worker $U_{j}$ will be flagged as adversarial if $deg(U_{j})<K-q-1$ . Therefore to avoid detection, a necessary condition is that $|D_{j}|\leq q$ .

We now upper bound the number of files that can be corrupted under any possible strategy employed by the adversaries. Note that according to Algorithm 2, we resort to robust aggregation in case of more than one maximum clique in $\mathbf{G}$ . In this scenario, a gradient can only be corrupted if a majority of the assigned workers computing it are adversarial and agree on a wrong value. The proof of the following theorem appears in Appendix Section X-A.

Theorem 1.

Consider a training cluster of $K$ workers with $q$ adversaries using algorithm in Section III-A1 to assign the $f=\binom{K}{r}$ files to workers, and Algorithm 1 for adversary detection. Under any adversarial strategy, the maximum number of files that can be corrupted is

c_{\mathrm{max}}^{(q)}=\frac{1}{2}{2q\choose r}.

(7)

Furthermore, this upper bound can be achieved if all adversaries fix a set $D\subset H$ of honest workers with which they will consistently disagree on the gradient (by distorting it).

Remark 3.

We emphasize that the maximum fraction of corrupted gradients $c_{\mathrm{max}}^{(q)}/f$ is much lesser as compared to the baseline $q/K$ and with respect to other schemes as well (details in Sec. VII). For instance $K=15$ and $q=3$ , at most 0.022 fraction of the gradients are corrupted in Aspis as against 0.2 for the baseline scheme.

In Appendix Section X-A we show that under ATT-2 there is bound to be more than one maximum clique in the detection graph. Thus, the PS cannot unambiguously decide which one is the honest one; detection fails and we fall back to the robust aggregation technique.

An example is shown in Figure 2(b) for the setup of Example 1. The adversaries $A=\{U_{1},U_{2},U_{3}\}$ consistently disagree with the workers in $D=\{U_{4},U_{5},U_{6}\}\subset H$ . The ambiguity as to which of the two maximum cliques ( $\{U_{1},U_{2},U_{3},U_{7}\}$ or $\{U_{4},U_{5},U_{6},U_{7}\}$ ) is the honest one makes an accurate detection impossible; robust aggregation will be performed instead.

V-B Motivation for Aspis+

Our motivation for proposing Aspis+ originates in the limitations of the subset assignment of Aspis. It is evident from the experimental results in Section VIII-B that Aspis is more suitable to worst-case attacks where the adversaries collude and distort the maximum number of tasks in an undetected fashion; in this case, the accuracy gap between Aspis and prior methods is maximal. Aspis does not perform as well under weaker attacks such as the reversed gradient attack (cf. Figures 8(a), 8(b), 8(c) even though it achieves a much smaller distortion fraction $\epsilon$ , as discussed in Section VII. This can be attributed to the fact that the number of tasks is $K\choose r$ and even for the considered cluster of $K=15$ , $r=3$ , it would require splitting the batch into 455 files; hence, the batch size must be a multiple of 455. There is significant evidence that large batch sizes can hurt generalization and make the model converge slowly [31, 40, 41]. Some workarounds have been proposed to solve this problem. For instance, the work of [41] uses layer-wise adaptive rate scaling to update each layer using a different learning rate. The authors of [42] perform implicit regularization using warmup and cosine annealing to tune the learning rate as well as gradient clipping. However, these methods require training for a significantly larger number of epochs. For the above reasons, we have extended our work and proposed Aspis+ to handle weaker Byzantine failures (cf. ATT-3) while requiring a much smaller batch size.

Input: Computed gradients

\hat{\mathbf{g}}_{t,i}^{(j)}

i=0,1,\dots,f-1

j=1,2,\dots,K

2-(v,k,\lambda)

design, length of detection window

T_{d}

, maximum iterations

T

1 for $t=0$ to $T-1$ do

2 Let

t^{\prime}=t\ (\mathrm{mod}\ T_{d})+1

3 if $t^{\prime}=1$ then

4 Set

\mathbf{G}

as the complete graph with worker vertices

\mathcal{U}

\forall j_{1},j_{2}

, set

\alpha^{(j_{1},j_{2})}=0

6 end if

7 for each pair $(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2}$ of workers do

8 PS computes the number of agreements

\alpha_{t}^{(j_{1},j_{2})}

of the pair

U_{j_{1}},U_{j_{2}}

on the gradient value.

9 Update

\alpha^{(j_{1},j_{2})}=\alpha^{(j_{1},j_{2})}+\alpha_{t}^{(j_{1},j_{2})}

10 end for

11 for each pair $(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2}$ of workers do

12 if $\alpha^{(j_{1},j_{2})}<\lambda\times t^{\prime}$ then

13 Remove edge

(U_{j_{1}},U_{j_{2}})

from

\mathbf{G}

14 end if

16 end for

17 for each worker $U_{j}\in\mathcal{U}$ do

18 if $deg(U_{j})<K-q-1$ then

\hat{A}=\hat{A}\cup\{U_{j}\}

20 end if

22 end for

23 if $|\hat{A}|>q$ then

24 Set

\hat{A}

to be the

q

most recently detected Byzantines.

25 end if

27 end for

Algorithm 3 Proposed Aspis+ graph-based detection.

V-C Aspis+ Detection Rule

The principal intuition of the Aspis+ detection approach (used for ATT-3) is to iteratively keep refining the graph $\mathbf{G}$ in which the edges encode the agreements of workers during consecutive and non-overlapping windows of $T_{d}$ iterations. At the beginning of each such window, the PS will reset $\mathbf{G}$ to be a complete graph, i.e., as if all workers pairwise agree with other. Then, it will gradually remove edges from $\mathbf{G}$ as disagreements between the workers are observed; hence, the graph will be updated at each of the $T_{d}$ iterations of the window, and the PS will assume that the Byzantine set does not change within a detection window. In practice, as we do not know the “Byzantine window,” we will not assume an alignment between the two kinds of windows, and we will set $T_{d}\neq T_{b}$ for our experiments. The detection method will be the same for all detection windows; thus, we will analyze the process in one window of $T_{d}$ steps.

For a detection window, let us encode the agreement of workers $U_{j_{1}},U_{j_{2}}$ on common file $i$ during the current iteration $t$ of the window $t=1,2,\dots,T_{d}$ as

\alpha_{t,i}^{(j_{1},j_{2})}:=\left\{\begin{array}[]{ll}1&\text{if }\hat{\mathbf{g}}_{t,i}^{(j_{1})}=\hat{\mathbf{g}}_{t,i}^{(j_{2})},\\ 0&\text{otherwise}.\end{array}\right.

(8)

Across all files, the total number of agreements between a pair of workers $U_{j_{1}},U_{j_{2}}$ during the $t^{\mathrm{th}}$ iteration is denoted by

\alpha_{t}^{(j_{1},j_{2})}:=\sum_{i\in\mathcal{N}_{t}^{w}(U_{j_{1}})\cap\mathcal{N}_{t}^{w}(U_{j_{2}})}\alpha_{t,i}^{(j_{1},j_{2})}.

(9)

Assume that the current iteration of the window is indexed with $t^{\prime}\in\{1,2,\dots,T_{d}\}$ . The PS will collect all agreements for each pair of workers $U_{j_{1}},U_{j_{2}}$ up until the current iteration as

\alpha^{(j_{1},j_{2})}:=\sum\limits_{t=1}^{t^{\prime}}\alpha_{t}^{(j_{1},j_{2})}.

(10)

Since the placement is known, the PS can always perform the above computation. Next, it will examine the agreements and update $\mathbf{G}$ as necessary.

Based on the task placement (cf. Section III-A2), an edge $(U_{j_{1}},U_{j_{2}})$ exists in $\mathbf{G}$ only if the computed gradients of $U_{j_{1}}$ and $U_{j_{2}}$ match in all their $\lambda$ common groups in all iterations up to the current one indexed with $t^{\prime}$ , i.e., a pair $U_{j_{1}},U_{j_{2}}$ needs to have $\alpha^{(j_{1},j_{2})}=\lambda\times t^{\prime}$ for an edge $(U_{j_{1}},U_{j_{2}})$ to be in $\mathbf{G}$ . If this is not the case, the edge $(U_{j_{1}},U_{j_{2}})$ will be removed from $\mathbf{G}$ . After all such edges are examined, detection is done using degree counting. Given that there are $q$ Byzantines in the cluster, after examining all pairs of workers and determining the form of $\mathbf{G}$ , a worker $U_{j}$ will be flagged as Byzantine if $deg(U_{j})<K-q-1$ . Based on Eq. (10), it is not hard to see that such workers can be eliminated and their gradients will not be considered again until the last iteration of the current window. The only exception to this is if the Byzantine set changes before the end of the current detection window. This is possible due to a potential misalignment between the “Byzantine window” and the detection window (recall that $T_{d}\neq T_{b}$ is assumed to avoid trivialities). In this case, more than $q$ workers may be detected as Byzantines; the PS will, by convention, choose $\hat{A}$ to be the most recently detected Byzantines. Algorithm 3 discusses the detection protocol. Following detection, the PS will act as follows. If at least one Byzantine has been detected, it will ignore the votes of detected Byzantines, and for each group, if there is at least one “honest” vote, it will use this as the output of the majority voting group; also, if a group consists merely of detected Byzantines, it will completely ignore the group. The remaining groups will go through robust aggregation (as in Section V-A). In our experiments in Section VIII-C, all Byzantines are detected successfully in at most 5 iterations. Example 3 showcases the utility of permutations in our detection algorithm using $K=7$ workers.

Example 3.

We will use the assignment of Example 2 with $K=7$ workers $\mathcal{U}=\{1,2,\dots,7\}$ assigned to tasks according to a $2-(7,3,1)$ Fano plane and let us denote the assignment of workers to groups (blocks of the design) during the $t^{\mathrm{th}}$ iteration by $\mathcal{A}_{t}$ , initially equal to $\mathcal{A}_{1}=\{\{1,2,3\},\{1,4,7\},\{2,4,6\},\{3,4,5\},\{2,5,7\},\{1,5,6\},\{3,6,7\}\}$ . For the windows, assume that $T_{d}>2$ and $T_{b}>2$ . Also, let $q=2$ and the Byzantine set be $A=\{U_{1},U_{2}\}$ . Based on ATT-3, workers $U_{1},U_{2}$ are in majority within a group in which they disagree with worker $U_{3}$ . After the first permutation, a possible assignment is $\mathcal{A}_{2}=\{\{1,3,6\},\{3,4,7\},\{2,4,6\},\{1,4,5\},\{5,6,7\},\{2,3,5\},\{1,2,7\}\}$ . Then, $U_{1},U_{2}$ are in the same group as the honest $U_{7}$ with which they disagree; hence, $deg(U_{1})=deg(U_{2})=4=K-q-1$ , and none of them affords to disagree with more honest workers to remain undetected. However, if the next permutation assigns the workers as $\mathcal{A}_{3}=\{\{1,3,6\},\{1,4,7\},\{4,6,5\},\{2,3,4\},\{2,6,7\},\{1,2,5\},\{3,5,7\}\}$ then the adversaries will cast a different vote than $U_{5}$ as well. Both of them will be detected after only three iterations.

Remark 4.

Using a $2-(v,3,\lambda)$ , i.e., a design with $k=3$ (a typical value for the redundancy) to assign the files on a cluster with $q$ Byzantines, the maximum number of files one can distort is $\lambda{q\choose 2}/|\mathcal{B}|$ [33], where $|\mathcal{B}|$ is the total number of files; this is when each possible pair of Byzantines, among the $q\choose 2$ possible ones appear together in a distinct block and distorts the corresponding file. In Aspis+, the focus is on weak attacks and determining the worst-case choice of adversaries that maximize the number of distorted files is beyond the scope of our work.

VI Convergence Results and Experiments under Setting-II

In this section, we operate under Setting-II (cf. Section III). By leveraging the work of Chen et al. [13] we demonstrate that our training algorithm converges to the optimal point.

We assume that the data samples are distributed i.i.d. from some unknown distribution $\mu$ . We are interested in finding $\mathbf{w}^{*}$ that minimizes $L(\mathbf{w})=\mathbb{E}(l_{1}(\mathbf{w}))$ over the $\mathbf{w}\in\mathcal{W}$ ; here the expectation is over the distribution $\mu$ and the $l_{i}(\mathbf{w})$ ’s are distributed i.i.d as well. In general, since the distribution is unknown, $\mathbb{E}(l_{1}(\mathbf{w}))$ cannot be computed and we instead minimize the empirical loss function given by $\hat{L}(\mathbf{w})=\frac{1}{n}\sum_{i=1}^{n}l_{i}(\mathbf{w})$ . We need the following additional assumptions.

In the discussion below, we say that a random vector $\mathbf{z}$ is sub-exponential with sub-exponential norm $K$ if for every unit-vector $v$ , $\mathbf{z}^{\top}v$ is a sub-exponential random variable with sub-exponential norm at most $K$ , i.e., $\sup_{v:||v||\leq 1}Pr(|\mathbf{z}^{\top}v|>t)\leq\exp(-t/K)$ [43, Sec 2.7]. To keep notation simple, we reuse the letter $C$ to denote different numerical constants in each use. This practice is common when working with classes of distributions such as sub-exponential.

•

The minimization of $\hat{L}(\mathbf{w})$ is performed by using Aspis or Aspis+ along with gradient descent (cf. Eq. (1)). This means that in Algorithm 2, the batch size $b=n$ for all iterations (cf. discussion after (2)). The robust estimator $\widehat{\mathrm{med}}$ is the geometric median.

•

The function $L(\mathbf{w})$ is $\beta-$ strongly convex, and differentiable with respect to $\mathbf{w}$ with $\tilde{M}$ -Lipschitz gradient. This means that for all $\mathbf{w}$ and $\mathbf{w}^{\prime}$ we have

	$\displaystyle L(\mathbf{w}^{\prime})\geq L(\mathbf{w})+\nabla L(\mathbf{w})^{T}(\mathbf{w}^{\prime}-\mathbf{w})+\frac{\beta}{2}\norm{\mathbf{w}-\mathbf{w}^{\prime}}^{2},\text{~{}and}$
	$\displaystyle\norm{\nabla L(\mathbf{w})-\nabla L(\mathbf{w}^{\prime})}\leq\tilde{M}\norm{\mathbf{w}-\mathbf{w}^{\prime}}.$

•

The random vectors $\nabla l_{i}(\mathbf{w})$ for $i=1,\dots,n$ are sub-exponential with sub-exponential norm $C$ . This assumption ensures that $\frac{1}{n}\sum_{i=1}^{n}\nabla l_{i}(\mathbf{w}^{*})$ concentrates around its mean $\nabla L(\mathbf{w}^{*})=0$ .
•

Let $h_{i}(\mathbf{w})=\nabla l_{i}(\mathbf{w})-\nabla l_{i}(\mathbf{w}^{*})$ . For $i=1,\dots,n$ , the random vectors $h_{i}(\mathbf{w})$ are sub-exponential with sub-exponential norm $C\norm{\mathbf{w}-\mathbf{w}^{*}}$ .

•

For any $\delta\in(0,1)$ there exists $\tilde{M}^{\prime}$ (dependent on $n$ and $\delta$ ) that is non-increasing in $n$ such that $\hat{L}(\mathbf{w})$ is $\tilde{M}^{\prime}$ -smooth with high probability, i.e,

	$\displaystyle P\left(\sup_{\mathbf{w},\mathbf{w}^{\prime}\in\mathcal{W}:\mathbf{w}\neq\mathbf{w}^{\prime}}\frac{\norm{\frac{1}{n}\sum_{i=1}^{n}(\nabla l_{i}(\mathbf{w})-\nabla l_{i}(\mathbf{w}^{\prime}))}}{\norm{\mathbf{w}-\mathbf{w}^{\prime}}}\leq\tilde{M}^{\prime}\right)$
	$\displaystyle\geq 1-\frac{\delta}{3}.$

Here $\mathcal{W}$ is the feasible parameter set.

For Aspis, Theorem 1 guarantees an upper bound on the fraction of corrupted gradients regardless of what attack is used. In particular, treating the majority logic and clique finding as a pre-processing step, we arrive at a set of $f$ files, at most $c^{(q)}_{\max}$ (cf. Theorem 1) of which are “arbitrarily” corrupted. At this point, the PS applies the robust estimator $\widehat{\mathrm{med}}$ - “geometric median” and uses it to perform the update step. We can leverage Theorem 5 of [13] to obtain the following result where $d$ is the length of the parameter vector and for $p_{i}\in(0,1),i=1,2$ the quantity $D(p_{1}||p_{2})=p_{1}\log_{2}(\frac{p_{1}}{p_{2}})+(1-p_{1})\log_{2}(\frac{1-p_{1}}{1-p_{2}})$ .

Theorem 2.

(adapted from [13]) Suppose that $\beta,\tilde{M}$ are all constants and $\log\tilde{M}^{\prime}=\mathcal{O}(\log d)$ . Assume that $\mathcal{W}\subset\{\mathbf{w}~{}:~{}\norm{\mathbf{w}-\mathbf{w}^{*}}\leq\tilde{r}\sqrt{d}\}$ for positive $\tilde{r}$ such that $\log\tilde{r}=\mathcal{O}(d\log(n/f))$ and $2(1+\epsilon)c^{(q)}_{\max}\leq f$ . Fix any $\alpha\in(c^{(q)}_{\max}/f,1/2)$ and any $\delta>0$ such that $\delta\leq\alpha-c^{(q)}_{\max}/f$ and $\log(1/\delta)=\mathcal{O}(d)$ . There exist universal constants $c_{1},c_{2}$ such that if

\displaystyle\frac{n}{f}\geq c_{1}C_{\alpha}^{2}d\log(n/f),

then with probability at least $1-\exp(-fD(\alpha-\frac{c^{(q)}_{\max}}{f}~{}||~{}\delta))$ , for all $t\geq 1$ , the iterates of our algorithm with $\eta=\beta/(2\tilde{M}^{2})$ satisfy

\displaystyle\norm{\mathbf{w}_{t}-\mathbf{w}^{*}}\leq\left(\frac{1}{2}+\frac{1}{2}\sqrt{1-\frac{\beta^{2}}{4\tilde{M}^{2}}}\right)^{t}\norm{\mathbf{w}_{0}-\mathbf{w}^{*}}+c_{2}\sqrt{\frac{df}{n}}.

(11)

An instance of a problem that satisfies the assumptions presented above is the linear regression problem. Formally, the data set consists of $n$ vectors $\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\}$ , where $\forall i,\mathbf{x}_{i}\in\mathbb{R}^{d}$ . We construct the data matrix $X$ of size $n\times d$ using these vectors as its rows. The $n$ labels corresponding to the data points are computed as follows: $\mathbf{y}=X\mathbf{w}$ , where $\mathbf{w}$ denotes the parameter set. For this problem, our loss function is the least-squares loss, i.e., we have $l_{i}(\mathbf{w})=\frac{1}{2}(\mathbf{y}_{i}-\mathbf{x}_{i}^{T}\mathbf{w})^{2}$ for $i=1,\dots,n$ where $\mathbf{x}_{i}^{T}$ denotes the $i^{\mathrm{th}}$ row of $X$ .

VI-A Numerical Experiments

We use the GD algorithm (1) with the initial randomly chosen parameter vector $\mathbf{w_{0}}\sim\mathcal{N}(\mathbf{0}_{d},I_{d})$ . We partition the data matrix $X$ row-wise into $f$ submatrices $X_{1},X_{2},\dots,X_{f}$ , and correspondingly the label vector $\mathbf{y}$ into $f$ sub-vectors $\textbf{y}_{1},\textbf{y}_{2},\dots,\textbf{y}_{f}$ , where $f$ is the number of files of the distributed algorithm. A file $B_{t,i}$ consists of a pair $(X_{i},\textbf{y}_{i})$ . For each of its assigned files $B_{t_{i}}=(X_{i},\textbf{y}_{i})\in\mathcal{N}^{w}(U_{i})$ , worker $U_{i}$ either computes the honest partial gradient or a distorted value and returns it to the PS. Using the formulation of Section IV, the gradient in Eq. (3) for linear regression is the product $\textbf{g}_{t,i}=X_{i}^{T}X_{i}\textbf{w}-X_{i}^{T}\textbf{y}_{i}$ .

Metrics: For each scheme and value of $q$ we run multiple Carlo simulations, and calculated the average least-square loss that each algorithm converges to across the Monte Carlo simulations. For each simulation we declare convergence if the final empirical loss is less than 0.1 We record the fraction of experiments that converged and the rate of convergence. In computing the average loss, the experiments that did not converged are not taken into account (for more details, please see Supplement Section XI-C).

VI-A1 Experiment Setup

In our experiments, we set $n=50000$ , $d=100$ while our cluster consists of $K=15$ workers. All replication-based schemes use $r=3$ . For Aspis+, we considered a $2-(15,3,1)$ design [33].

The geometric median is available as a Python library [44]. Initially, we tuned the learning rate for each scheme and each distortion method to decide the one to use for the Monte Carlo simulations; all learning rates $10^{-x}$ , $x=\{1,2,\dots,6\}$ have been tested. Also, we fix the random seeds of our experiments to be the same across all schemes; this guarantees that the data matrix as well as the original model estimate $\mathbf{w_{0}}$ will be the same across all methods. At the beginning of the algorithm, all elements of $X$ and $\mathbf{w}$ are generated randomly according to a $\mathcal{N}(0,1)$ distribution. For all runs, we chose to terminate the algorithm when the norm of gradient is less than $10^{-10}$ or the algorithm has reached a maximum number of 2000 iterations. Our code is available online ⁴⁴4https://github.com/kkonstantinidis/Aspis.

VI-A2 Results

The first set of experiments are for the strong attack ATT-2. The baseline scheme where geometric median is applied on all $K$ gradients returned by the workers is referred to as GeoMed and it has no redundancy. DETOX, Aspis, and Aspis+ use geometric median as part of the robust aggregation. Under reversed gradient (see Figure 4(a)), it is clear that all schemes perform well and achieve similar loss for $q=2,4$ Byzantines. Nevertheless, baseline geometric median needed at least 100 iterations to converge while the redundancy-based schemes have a faster convergence rate. However, the situation is very different for $q=6$ where Aspis converges within 30 iterations. In contrast, the DETOX loss diverged in all 100 simulations, while the convergence rate of the baseline scheme is much slower.

For the constant attack (see Fig. 4(b)) however, the relative performance of the baseline scheme and Aspis is reversed, i.e., the baseline scheme has a faster convergence rate as compared with Aspis. Moreover, Aspis and DETOX have roughly similar convergence rates for $q=2,4$ . As before for $q=6$ , none of the DETOX simulations converged.

Our last set of results are with the ALIE attack and the results are reported in Figure 6. As the baseline geometric median simulations converged much slower than other schemes, they could not fit properly into the figure; it achieved final loss approximately equal to $2.07\times 10^{-28}$ , $9.71\times 10^{-28}$ , and $1.77\times 10^{-28}$ for $q=2,4$ , and $6$ respectively. On the other hand, Aspis converged to $1.69\times 10^{-23}$ in 30 iterations for $q=6$ . For this attack all schemes converged to very low loss values in all simulation runs. Nevertheless as is evident from Figure 6, both Aspis and DETOX converge to loss values less that $10^{-5}$ within 15 iterations.

Another experiment we performed compares our two proposed methods, Aspis and Aspis+. For both schemes, we generate a new random Byzantine set $A$ every $T_{b}=50$ iterations (introduced as ATT-3 Section IV-C) while the detection window for Aspis+ is of length $T_{d}=15$ . For a comparable attack we use ATT-1 on Aspis (cf. Section IV-A), i.e., all adversaries distort all their assigned files. We compare the two schemes under reversed gradient attack in Figure 5(a) and under constant attack in Figure 5(b). Both methods achieve low final loss in the order of $10^{-20}$ or lower; Aspis converged to lower losses of the order of $10^{-24}$ in approximately 280 iterations in all cases. Nevertheless, Aspis+ achieves a faster convergence rate which aligns with the fact that it’s mostly suitable for weaker adversaries.

VII Distortion Fraction Evaluation

The main motivation of our distortion fraction analysis is that our deep learning experiments (cf. Section VIII-B) and prior work [14] show that $\epsilon=c^{(q)}/f$ is a surrogate of the model’s convergence with respect to accuracy. This comparison involves our work and state-of-the-art schemes under the best- and worst-case choice of the $q$ adversaries in terms of the achievable value of $\epsilon$ . We also compare our work with baseline approaches that do not involve redundancy or majority voting and aggregation is applied directly to the $K$ gradients returned by the workers ( $f=K$ , $c_{\mathrm{max}}^{(q)}=q$ and $\epsilon=q/K$ ).

For Aspis, we used the proposed attack ATT-2 from Section IV-B and the corresponding computation of $c^{(q),Aspis}$ of Theorem 1. DETOX in [16] employs a redundant assignment followed by majority voting and offers robustness guarantees which crucially rely on a “random choice” of the Byzantines. Our prior work [14] (ByzShield) has demonstrated the importance of a careful task assignment and observed that redundancy by itself is not sufficient to allow for Byzantine resilience. That work proposed an optimal choice of the $q$ Byzantines that maximizes $\epsilon^{DETOX}$ , which we used in our current experiments. In short, DETOX splits the $K$ workers into $K/r$ groups. All workers within a group process the same subset of the batch, specifically containing $br/K$ samples. This phase is followed by majority voting on a group-by-group basis. Reference [14] suggests choosing the Byzantines so that at least $r^{\prime}$ workers in each group are adversarial in order to distort the corresponding gradients. In this case, $c^{(q),DETOX}=\lfloor\frac{q}{r^{\prime}}\rfloor$ and $\epsilon^{DETOX}=\lfloor\frac{q}{r^{\prime}}\rfloor\times r/K$ . We also compare with the distortion fraction incurred by ByzShield [14] under a worst-case scenario. For this scheme, there is no known optimal attack, and we performed an exhaustive combinatorial search to find the $q$ adversaries that maximize $\epsilon^{ByzShield}$ among all possible options; we follow the same process here to simulate ByzShield’s distortion fraction computation while utilizing the scheme of that work based on mutually orthogonal Latin squares. The reader can refer to Figure 3(a) and Appendix Tables III, IV, and V for our results. Aspis achieves major reductions in $\epsilon$ ; for instance, $\epsilon^{Aspis,ATT-2}$ is reduced by up to 99% compared to both $\epsilon^{Baseline}$ and $\epsilon^{DETOX}$ in Figure 3(a).

Next, we consider the weak attack, ATT-1. For our scheme, we will make an arbitrary choice of $q$ adversaries which carry out the method introduced in Section V-A1, i.e., they will distort all files, and a successful detection is possible. As discussed in Section V-A1, the fraction of corrupted gradients is $\epsilon^{Aspis,ATT-1}=\binom{q}{r}/\binom{K}{r}$ . For DETOX, a simple benign attack is used. To that end, let the $K/r$ files be $B_{t,0},B_{t,1},\dots,B_{t,K/r-1}$ . Initialize $A=\emptyset$ and choose the $q$ Byzantines as follows: for $i=0,1,\dots,q-1$ , among the remaining workers in $\{U_{1},U_{2},\dots,U_{K}\}-A$ add a worker from the group $B_{t,i\mod K/r}$ to the adversarial set $A$ . Then,

c^{(q),DETOX}=\left\{\begin{array}[]{ll}q-\frac{K}{r}(r^{\prime}-1)&\text{if }q>\frac{K}{r}(r^{\prime}-1),\\ 0&\text{otherwise}.\end{array}\right.

The results of this scenario are in Figure 3(b).

VIII Large-Scale Deep Learning Experiments

All these experiments are performed under Setting-I, i.e., no assumptions are made about the dataset or the loss function. Accordingly, the evaluation here is in terms of the distortion fraction (see Section VII) and numerical experiments (described below). For the experiments, we used the mini-batch SGD (see (2)) and the robust estimator (see Algorithm 2) is the coordinate-wise median.

VIII-A Experiment Setup

We have evaluated the performance of our methods and competing techniques in classification tasks on Amazon EC2 clusters. The project is written in PyTorch [1] and uses the MPICH library for communication between the different nodes. We worked with the CIFAR-10 data set [32] using the ResNet-18 [45] model. We used clusters of $K=15$ , $21$ , $25$ workers, redundancy $r=3$ , and simulated values of $q=2,4,6,7,9$ during training. Detailed information about the implementation can be found in Appendix Section X-B.

Competing methods: We compare Aspis against the baseline implementations of median-of-means [46], Bulyan [24], and Multi-Krum [12]. If $c_{\mathrm{max}}^{(q)}$ is the number of adversarial computations, then Bulyan requires at least $4c_{\mathrm{max}}^{(q)}+3$ total number of computations while the same number for Multi-Krum is $2c_{\mathrm{max}}^{(q)}+3$ . These constraints make these methods inapplicable for larger values of $q$ for which our methods are robust. The second class of comparisons is with methods that use redundancy, specifically DETOX [16]. For the baseline scheme we compare with median-based techniques since they originate from robust statistics and are the basis for many aggregators. Multi-Krum combines the intuitions of majority-based and squared-distance-based methods. Draco [17] is a closely related method that uses redundancy. However we do not compare with it since it is very limited in the number of Byzantines that it is resilient to.

Note that for a baseline scheme, all choices of $A$ are equivalent in terms of the value of $\epsilon$ . In our comparisons between Aspis and DETOX we will consider two attack scenarios concerning the choice of the adversaries. For the optimal attack on DETOX, we will use the method proposed in [14] and compare with the attack introduced in Section V-A2. For the weak one, we will choose the adversaries such that they incur the minimum value of $\epsilon$ in DETOX for given $q$ and compare its performance with the scenario of Section V-A1. All schemes compared with Aspis+ consider random sets of Byzantines, and for Aspis+, we will use the attack ATT-3.

VIII-B Aspis Experimental Results

VIII-B1 Comparison under Optimal Attacks

We compare the different defense algorithms under optimal attack scenarios using ATT-2 for Aspis. Figure 7(a) compares our scheme Aspis with the baseline implementation of coordinate-wise median ( $\epsilon=0.133,0.267$ for $q=2,4$ , respectively) and DETOX with median-of-means ( $\epsilon=0.2,0.4$ for $q=2,4$ , respectively) under the ALIE attack. Aspis converges faster and achieves at least a 35% average accuracy boost (at the end of the training) for both values of $q$ ( $\epsilon^{Aspis}=0.004,0.062$ for $q=2,4$ , respectively).⁵⁵5Please refer to Appendix Tables III and IV for the values of the distortion fraction $\epsilon$ each scheme incurs. In Figures 7(b) and 7(c), we observe similar trends in our experiments with Bulyan and Multi-Krum, where Aspis significantly outperforms these techniques. For the current setup, Bulyan is not applicable for $q=4$ since $K=15<4c_{\mathrm{max}}^{(q)}+3=4q+3=19$ . Also, neither Bulyan nor Multi-Krum can be paired with DETOX for $q\geq 4$ since the inequalities $f\geq 4c_{\mathrm{max}}^{(q)}+3$ and $f\geq 2c_{\mathrm{max}}^{(q)}+3$ , where $f=f_{\mathrm{DETOX}}=K/r$ , cannot be satisfied; for the specific case of Bulyan even $q=2,3$ would not be supported by DETOX. Please refer to Section VIII-A and Section VII for more details on these requirements. Also, note that the accuracy of most competing methods fluctuates more than in the results presented in the corresponding papers [16] and [23]. This is expected as we consider stronger attacks than those papers, i.e., optimal deterministic attacks on DETOX and, in general, up to 27% adversarial workers in the cluster. Also, we have done multiple experiments with different random seeds to demonstrate the stability and superiority of our accuracy results compared to other methods (against median-based defenses in Appendix Figure 15, Bulyan in Figure 16 and Multi-Krum in Supplement Figure 17); we point the reader to Appendix Section X-B3 for this analysis. This analysis is clearly missing from most prior work, including that of ALIE [23] and their presented results are only a snapshot of a single experiment. The results for the reversed gradient attack are shown in Figures 8(a), 8(b), and 8(c). Given that this is a weaker attack [14, 16] all schemes, including the baseline methods, are expected to perform well; indeed, in most cases, the model converges to approximately 80% accuracy. However, DETOX fails to converge to high accuracy for $q=4$ as in the case of ALIE; one explanation is that $\epsilon^{DETOX}=0.4$ for $q=4$ . Under the Fall of Empires (FoE) distortion (cf. Figure 11) our method still enjoys an accuracy advantage over the baseline and DETOX schemes which becomes more important as the number of Byzantines in the cluster increases.

We have also performed experiments on larger clusters ( $K=21$ workers) as well. The results for the ALIE distortion with the ATT-2 attack can be found in Figure 12. They exhibit similar behavior as in the case of $K=15$ .

VIII-B2 Comparison under Weak Attacks

For baseline schemes, the discussion of weak versus optimal choice of the adversaries is not very relevant as any choice of the $q$ Byzantines can overall distort exactly $q$ out of the $K$ gradients. Hence, for weak scenarios, we chose to compare mostly with DETOX while using ATT-1 on Aspis. The accuracy is reported in Figures 11 and 11, according to which Aspis shows an improvement under attacks on the more challenging end of the spectrum (ALIE). According to Appendix Table IIIIII(b), Aspis enjoys a fraction $\epsilon^{Aspis}=0.044$ while $\epsilon^{Baseline}=0.4$ and $\epsilon^{DETOX}=0.2$ for $q=6$ .

VIII-C Aspis+ Experimental Results

For Aspis+, we considered the attack ATT-3 discussed in Section IV-C. We tested clusters of $K=15$ with $q=2,4$ and $K=25$ workers among which $q=7,9$ are Byzantine. In the former case, a $2-(15,3,1)$ design [33] with $f=35$ blocks (files) was used for the placement, while in the latter case, we used a $2-(25,3,1)$ design [33] with $f=100$ blocks (files). A new random Byzantine set $A$ is generated every $T_{b}=50$ iterations while the detection window is of length $T_{d}=15$ .

The results for $K=15$ are in Figure 13. We tested against the ALIE distortion, and all compared methods use median-based defenses to filter the gradients. Aspis+ demonstrates an advantage of at least 15% compared with other algorithms (cf. $q=2$ ). For $K=25$ , we tried a weaker distortion than ALIE, i.e., the constant attack paired with signSGD-based defenses [26]. In signSGD, the PS will output the majority of the gradients’ signs for each dimension. Following the advice of [16], we pair this defense with the stronger constant attack as sign flips (e.g., reversed gradient) are unlikely to affect the gradient’s distribution. Aspis+ with median still enjoys an accuracy improvement of at least 20% for $q=7$ and a larger one for $q=9$ . The results are in Figure 14; in this figure, the DETOX accuracy is an average of two experiments using two different random seeds.

IX Conclusions and Future Work

In this work, we have presented Aspis and Aspis+, two Byzantine-resilient distributed schemes that use redundancy and robust aggregation in novel ways to detect failures of the workers. Our theoretical analysis and numerical experiments clearly indicate their superior performance compared to state-of-the-art. Our experiments show that these methods require increased computation and communication time as compared to prior work, e.g., note that each worker has to transmit $l$ gradients instead of $1$ in related work [16, 17] (see Appendix Section X-B4 for details). We emphasize, however, that our schemes converge to high accuracy in our experiments, while other methods remain at much lower accuracy values regardless of how long the algorithm runs for.

Our experiments involve clusters of up to $25$ workers. As we scale Aspis to more workers, the total number of files and the computation load $l$ of each worker will also scale; this increases the memory needed to store the gradients during aggregation. For complex neural networks, the memory to store the model and the intermediate gradient computations is by far the most memory-consuming aspect of the algorithm. For these reasons, Aspis is mostly suitable for training large data sets using fairly compact models that do not require too much memory. Aspis+, on the other hand, is a good fit for clusters that suffer from non-adversarial failures that can lead to inaccurate gradients. Finally, utilizing GPUs and communication-related algorithmic improvements are worth exploring to reduce the time overhead.

X Appendix

TABLE II: Main notation of the paper.

Symbol	Meaning
$K$	number of workers
$q$	number of adversaries
$r$	redundancy (number of workers each file is assigned to)
$b$	batch size
$B_{t}$	samples of batch of $t^{\mathrm{th}}$ iteration
$f$	number of files (alternatively called groups or tasks)
$U_{j}$	$j^{\mathrm{th}}$ worker
$l$	computation load (number of files per worker)
$\mathcal{N}^{w}(U_{j})$	set of files of worker $U_{j}$
$\mathcal{N}^{f}(B_{t,i})$	set of workers assigned to file $B_{t,i}$
$\mathbf{g}_{t,i}$	true gradient of file $B_{t,i}$ with respect to $\mathbf{w}$
$\hat{\mathbf{g}}_{t,i}^{(j)}$	returned gradient of $U_{j}$ for file $B_{t,i}$ with respect to $\mathbf{w}$
$\mathbf{m}_{i}$	majority gradient for file $B_{t,i}$
$\mathcal{U}$	worker set $\{U_{1},U_{2},...,U_{K}\}$
$\mathbf{G}_{task}$	graph used to encode the task assignments to workers
$\mathbf{G}_{t}$	graph indicating the agreements of pairs of workers in all of their common gradient tasks in $t^{\mathrm{th}}$ iteration
$A$	set of adversaries
$M_{\mathbf{G}}$	maximum clique in $\mathbf{G}$
$c^{(q)}$	number of distorted gradients after detection and aggregation
$c_{\mathrm{max}}^{(q)}$	maximum number of distorted gradients after detection and aggregation (worst-case)
$D_{i}$	disagreement set (of workers) for $i^{\mathrm{th}}$ adversary
$r^{\prime}$	$(r+1)/2$ , i.e., minimum number of distorted copies needed to corrupt majority vote for a file
$\epsilon$	$c^{(q)}/f$ , i.e., fraction of distorted gradients after detection and aggregation
$X_{j}$	subset of files where the set of active adversaries is of size $j$ ; for linear regression this is the data matrix corresponding to the $i^{\mathrm{th}}$ file
$X$	data matrix of linear regression
$n$	number of points of linear regression
$d$	dimensionality of linear regression model

TABLE III: Distortion fraction of optimal and weak attacks for

(K,f,l,r)=(15,455,91,3)

and comparison.

$q$	$\epsilon^{Aspis}_{ATT-2}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$	$\epsilon^{ByzShield}$
$2$	\collectcell0.004\endcollectcell	\collectcell0.133\endcollectcell	\collectcell0.2\endcollectcell	\collectcell0.04\endcollectcell
$3$	\collectcell0.022\endcollectcell	\collectcell0.2\endcollectcell	\collectcell0.2\endcollectcell	\collectcell0.12\endcollectcell
$4$	\collectcell0.062\endcollectcell	\collectcell0.267\endcollectcell	\collectcell0.4\endcollectcell	\collectcell0.2\endcollectcell
$5$	\collectcell0.132\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0.4\endcollectcell	\collectcell0.32\endcollectcell
$6$	\collectcell0.242\endcollectcell	\collectcell0.4\endcollectcell	\collectcell0.6\endcollectcell	\collectcell0.48\endcollectcell
$7$	\collectcell0.4\endcollectcell	\collectcell0.467\endcollectcell	\collectcell0.6\endcollectcell	\collectcell0.56\endcollectcell

III(a) Optimal attacks.

$q$	$\epsilon^{Aspis}_{ATT-1}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$
$2$	\collectcell0.002\endcollectcell	\collectcell0.133\endcollectcell	\collectcell0\endcollectcell
$3$	\collectcell0.002\endcollectcell	\collectcell0.2\endcollectcell	\collectcell0\endcollectcell
$4$	\collectcell0.009\endcollectcell	\collectcell0.267\endcollectcell	\collectcell0\endcollectcell
$5$	\collectcell0.022\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0\endcollectcell
$6$	\collectcell0.044\endcollectcell	\collectcell0.4\endcollectcell	\collectcell0.2\endcollectcell
$7$	\collectcell0.077\endcollectcell	\collectcell0.467\endcollectcell	\collectcell0.4\endcollectcell

III(b) Weak attacks.

TABLE IV: Distortion fraction of optimal and weak attacks for

(K,f,l,r)=(21,1330,190,3)

and comparison.

$q$	$\epsilon^{Aspis}_{ATT-2}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$	$\epsilon^{ByzShield}$
$2$	\collectcell0.002\endcollectcell	\collectcell0.095\endcollectcell	\collectcell0.143\endcollectcell	\collectcell0.02\endcollectcell
$3$	\collectcell0.008\endcollectcell	\collectcell0.143\endcollectcell	\collectcell0.143\endcollectcell	\collectcell0.06\endcollectcell
$4$	\collectcell0.021\endcollectcell	\collectcell0.19\endcollectcell	\collectcell0.286\endcollectcell	\collectcell0.1\endcollectcell
$5$	\collectcell0.045\endcollectcell	\collectcell0.238\endcollectcell	\collectcell0.286\endcollectcell	\collectcell0.16\endcollectcell
$6$	\collectcell0.083\endcollectcell	\collectcell0.286\endcollectcell	\collectcell0.429\endcollectcell	\collectcell0.24\endcollectcell
$7$	\collectcell0.137\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0.429\endcollectcell	\collectcell0.33\endcollectcell
$8$	\collectcell0.211\endcollectcell	\collectcell0.381\endcollectcell	\collectcell0.571\endcollectcell	\collectcell0.43\endcollectcell
$9$	\collectcell0.307\endcollectcell	\collectcell0.429\endcollectcell	\collectcell0.571\endcollectcell	\collectcell0.51\endcollectcell
$10$	\collectcell0.429\endcollectcell	\collectcell0.476\endcollectcell	\collectcell0.714\endcollectcell	\collectcell0.59\endcollectcell

IV(a) Optimal attacks.

$q$	$\epsilon^{Aspis}_{ATT-1}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$
$2$	\collectcell0.001\endcollectcell	\collectcell0.095\endcollectcell	\collectcell0\endcollectcell
$3$	\collectcell0.001\endcollectcell	\collectcell0.143\endcollectcell	\collectcell0\endcollectcell
$4$	\collectcell0.003\endcollectcell	\collectcell0.19\endcollectcell	\collectcell0\endcollectcell
$5$	\collectcell0.008\endcollectcell	\collectcell0.238\endcollectcell	\collectcell0\endcollectcell
$6$	\collectcell0.015\endcollectcell	\collectcell0.286\endcollectcell	\collectcell0\endcollectcell
$7$	\collectcell0.026\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0\endcollectcell
$8$	\collectcell0.042\endcollectcell	\collectcell0.381\endcollectcell	\collectcell0.143\endcollectcell
$9$	\collectcell0.063\endcollectcell	\collectcell0.429\endcollectcell	\collectcell0.286\endcollectcell
$10$	\collectcell0.09\endcollectcell	\collectcell0.476\endcollectcell	\collectcell0.429\endcollectcell

IV(b) Weak attacks.

TABLE V: Distortion fraction of optimal and weak attacks for

(K,f,l,r)=(24,2024,253,3)

and comparison.

$q$	$\epsilon^{Aspis}_{ATT-2}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$	$\epsilon^{ByzShield}$
$2$	\collectcell0.001\endcollectcell	\collectcell0.083\endcollectcell	\collectcell0.125\endcollectcell	\collectcell0.031\endcollectcell
$3$	\collectcell0.005\endcollectcell	\collectcell0.125\endcollectcell	\collectcell0.125\endcollectcell	\collectcell0.063\endcollectcell
$4$	\collectcell0.014\endcollectcell	\collectcell0.167\endcollectcell	\collectcell0.25\endcollectcell	\collectcell0.125\endcollectcell
$5$	\collectcell0.03\endcollectcell	\collectcell0.208\endcollectcell	\collectcell0.25\endcollectcell	\collectcell0.188\endcollectcell
$6$	\collectcell0.054\endcollectcell	\collectcell0.25\endcollectcell	\collectcell0.375\endcollectcell	\collectcell0.281\endcollectcell
$7$	\collectcell0.09\endcollectcell	\collectcell0.292\endcollectcell	\collectcell0.375\endcollectcell	\collectcell0.375\endcollectcell
$8$	\collectcell0.138\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0.5\endcollectcell	\collectcell0.5\endcollectcell
$9$	\collectcell0.202\endcollectcell	\collectcell0.375\endcollectcell	\collectcell0.5\endcollectcell	\collectcell0.5\endcollectcell
$10$	\collectcell0.282\endcollectcell	\collectcell0.417\endcollectcell	\collectcell0.625\endcollectcell	\collectcell0.531\endcollectcell
$11$	\collectcell0.38\endcollectcell	\collectcell0.458\endcollectcell	\collectcell0.625\endcollectcell	\collectcell0.625\endcollectcell

V(a) Optimal attacks.

$q$	$\epsilon^{Aspis}_{ATT-1}$	$\epsilon^{Baseline}$	$\epsilon^{DETOX}$
$2$	\collectcell0\endcollectcell	\collectcell0.083\endcollectcell	\collectcell0\endcollectcell
$3$	\collectcell0\endcollectcell	\collectcell0.125\endcollectcell	\collectcell0\endcollectcell
$4$	\collectcell0.002\endcollectcell	\collectcell0.167\endcollectcell	\collectcell0\endcollectcell
$5$	\collectcell0.005\endcollectcell	\collectcell0.208\endcollectcell	\collectcell0\endcollectcell
$6$	\collectcell0.01\endcollectcell	\collectcell0.25\endcollectcell	\collectcell0\endcollectcell
$7$	\collectcell0.017\endcollectcell	\collectcell0.292\endcollectcell	\collectcell0\endcollectcell
$8$	\collectcell0.028\endcollectcell	\collectcell0.333\endcollectcell	\collectcell0\endcollectcell
$9$	\collectcell0.042\endcollectcell	\collectcell0.375\endcollectcell	\collectcell0.125\endcollectcell
$10$	\collectcell0.059\endcollectcell	\collectcell0.417\endcollectcell	\collectcell0.25\endcollectcell
$11$	\collectcell0.082\endcollectcell	\collectcell0.458\endcollectcell	\collectcell0.375\endcollectcell

V(b) Weak attacks.

X-A Proof of Theorem 1

For a given file $F$ , let $A^{\prime}\subseteq A$ with $|A^{\prime}|\geq r^{\prime}$ be the set of “active adversaries” in it, i.e., $A^{\prime}\subseteq F$ consists of Byzantines that collude to create a majority that distorts the gradient on it. In this case, the remaining workers in $F$ belong to $\cap_{i\in A^{\prime}}D_{i}$ , where we note that $|\cap_{i\in A^{\prime}}D_{i}|\leq q$ . Let $X_{j},j=r^{\prime},r^{\prime}+1,\dots,r$ denote the subset of files where the set of active adversaries is of size $j$ ; note that $X_{j}$ depends on the disagreement sets $D_{i},i=1,2,\dots,q$ . Formally,

	$\displaystyle X_{j}$	$\displaystyle=$	$\displaystyle\{F:\exists A^{\prime}\subseteq A\cap F,\|A^{\prime}\|=j,$		(12)
			$\displaystyle\qquad\text{~{}and~{}}\forall~{}U_{j}\in F\setminus A^{\prime},U_{j}\in\cap_{i\in A^{\prime}}D_{i}\}.\quad$		(12)

Then, for a given choice of disagreement sets, the number of files that can be corrupted is given by $|\cup_{j=r^{\prime}}^{r}X_{j}|$ . We obtain an upper bound on the maximum number of corrupted files by maximizing this quantity with respect to the choice of $D_{i},i=1,2,\dots,q$ , i.e.,

c_{\mathrm{max}}^{(q)}=\max\limits_{D_{i},|D_{i}|\leq q,i=1,2,\dots,q}|\cup_{j=r^{\prime}}^{r}X_{j}|

(13)

where the maximization is over the choice of the disagreement sets $D_{1},D_{2},\dots,D_{q}$ . With $X_{j}$ given in (12), assuming $q\geq r^{\prime}$ , the number of distorted files is upper bounded by

\displaystyle|\cup_{j=r^{\prime}}^{r}X_{j}|

\displaystyle\leq\sum_{j=r^{\prime}}^{r}|X_{j}|\text{~{}(by the union bound).}

(14)

For that, recall that $r^{\prime}=(r+1)/2$ and that an adversarial majority of at least $r^{\prime}$ distorted computations for a file is needed to corrupt that particular file. Note that $X_{j}$ consists of those files where the active adversaries $A^{\prime}$ are of size $j$ ; these can be chosen in $\binom{q}{j}$ ways. The remaining workers in the file belong to $\cap_{i\in A^{\prime}}D_{i}$ where $|\cap_{i\in A^{\prime}}D_{i}|\leq q$ . Thus, the remaining workers can be chosen in at most $\binom{q}{r-j}$ ways. It follows that

\displaystyle|X_{j}|\leq\binom{q}{j}\binom{q}{r-j}.

(15)

Therefore,

$\displaystyle c_{\mathrm{max}}^{(q)}$	$\displaystyle\leq$	$\displaystyle{q\choose r^{\prime}}{q\choose r-r^{\prime}}+{q\choose r^{\prime}+1}{q\choose r-(r^{\prime}+1)}$	(16)
		$\displaystyle+\cdots$
		$\displaystyle+{q\choose r-1}{q\choose r-(r-1)}+{q\choose r}$
	$\displaystyle=$	$\displaystyle\sum_{i=r^{\prime}}^{q}{{q}\choose{i}}{{q}\choose{r-i}}$	(17)
	$\displaystyle=$	$\displaystyle\sum_{i=0}^{q}{{q}\choose{i}}{{q}\choose{r-i}}-\sum_{i=0}^{r^{\prime}-1}{{q}\choose{i}}{{q}\choose{r-i}}$	(18)
	$\displaystyle=$	$\displaystyle\frac{1}{2}{2q\choose r}.$	(19)

Eq. (17) follows from the convention that ${n\choose k}=0$ when $k>n$ or $k<0$ . Eq. (19) follows from Eq. (18) using the following observations

•

$\sum_{i=0}^{q}{{q}\choose{i}}{{q}\choose{r-i}}=\sum_{i=0}^{r}{{q}\choose{i}}{{q}\choose{r-i}}={2q\choose r}$ in which the first equality is straightforward to show by taking all possible cases: $q<r$ , $q=r$ and $q>r$ .
•

By symmetry, $\sum_{i=0}^{r^{\prime}-1}{{q}\choose{i}}{{q}\choose{r-i}}=\sum_{i=r^{\prime}}^{q}{{q}\choose{i}}{{q}\choose{r-i}}=\frac{1}{2}{2q\choose r}$ .

The upper bound in Eq. (16) is met with equality when all adversaries choose the same disagreement set, which is a $q$ -sized subset of the honest workers, i.e., $D_{i}=D\subset H$ for $i=1,\dots,q$ . In this case, it can be seen that the sets $X_{j},j=r^{\prime},\dots,r$ are disjoint so that (14) is met with equality. Moreover, (15) is also an equality. This finally implies that (16) is also an equality, i.e., this choice of disagreement sets saturates the upper bound.

It can also be seen that in this case, the adversarial strategy yields a graph $\mathbf{G}$ with multiple maximum cliques. To see this, we note that the adversaries in $A$ agree with all the computed gradients in $H\setminus D$ . Thus, they form a clique of $M_{\mathbf{G}}^{(1)}$ of size $K-q$ in $\mathbf{G}$ . Furthermore, the honest workers in $H$ form another clique $M_{\mathbf{G}}^{(2)}$ , which is also of size $K-q$ . Thus, the detection algorithm cannot select one over the other and the adversaries will evade detection; and the fallback robust aggregation strategy will apply.

X-B Experiment Setup Details

X-B1 Cluster Setup

We used clusters of $K=15$ , $21$ , and $25$ workers arranged in various setups within Amazon EC2. Initially, we used a PS of type i3.16xlarge and several workers of type c5.4xlarge to set up a distributed cluster; for the experiments, we adapted GPUs, g3s.xlarge instances were used. However, purely distributed implementations require training data to be transmitted from the PS to every single machine, based on our current implementation; an alternative approach one can follow is to set up shared storage space accessible by all machines to store the training data. Also, some instances were automatically terminated by AWS per the AWS spot instance policy limitations;⁶⁶6https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html this incurred some delays in resuming the experiments that were stopped. In order to facilitate our evaluation and avoid these issues we decided to simulate the PS and the workers for the rest of the experiments on a single instance either of type x1.16xlarge or i3.16xlarge. We emphasize that the choice of the EC2 setup does not affect any of the numerical results in this paper since in all cases, we used a single virtual machine image with the same dependencies. Handling of the GPU floating-point precision errors have been discussed in Supplement Section XI-B.

X-B2 Data Set Preprocessing and Hyperparameter Tuning

The CIFAR-10 images have been normalized using standard mean and standard deviation values for the data set. The value used for momentum (for gradient descent) was set to $0.9$ , and we trained for $16$ epochs in all experiments. The number of epochs is precisely the invariant we maintain across all experiments, i.e., all schemes process the training data the same number of times. The batch size and the learning rate are chosen independently for each method; the number of iterations is adjusted accordingly to account for the number of epochs. For Section VIII-B, we followed the advice of the authors of DETOX and chose $(K,b)=(15,480)$ and $(K,b)=(21,672)$ for the DETOX and baseline schemes. For Aspis, we used $(K,b)=(15,14560)$ (32 samples per file) and $(K,b)=(21,3990)$ (3 samples per file) for the ALIE experiments and $b=1365$ (3 samples per file) for the remaining experiments except for the FoE optimal attack $q=4$ (cf. Figure 11) for which $b=14560$ performed better. In Section VIII-C, we used $(K,b)=(15,480)$ and $(K,b)=(25,800)$ for DETOX as well as for baseline schemes while for Aspis+ we used $(K,b)=(15,770)$ for the ALIE experiments and $(K,b)=(25,1800)$ for the constant attack experiments. In Supplement Table VI, a learning rate schedule is denoted by $(x,y)$ ; this notation signifies the fact that we start with a rate equal to $x$ , and every $z$ iterations, we set the rate equal to $x\times y^{t/z}$ , where $t$ is the index of the current iteration and $z$ is set to be the number of iterations occurring between two consecutive checkpoints in which we store the model (points in the accuracy figures). We also index the schemes in order of appearance in the corresponding figure’s legend. Experiments that appear in multiple figures are not repeated in Supplement Table VI (we ran those training processes once). In order to pick the optimal hyperparameters for each scheme, we performed an extensive grid search involving different combinations of $(x,y)$ . In particular, the values of $x$ we tested are 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, and 0.0003, and for $y$ we tried 1, 0.975, 0.95, 0.7 and 0.5. For each method, we ran 3 epochs for each such combination and chose the one which was giving the lowest value of average cross-entropy loss (principal criterion) and the highest value of top-1 accuracy (secondary criterion).

X-B3 Error Bars

In order to examine whether the choice of the random seed affects the accuracy of the trained model we have performed the experiments of Section VIII-B for the ALIE distortion for two different seeds for the values $q=2,4$ for every scheme; we used $428$ and $50$ as random seeds. These tests have been performed for the case of $K=15$ workers. In Figure 15(a), for a given method, we report the minimum accuracy, the maximum accuracy, and their average for each evaluation point. We repeat the same process in Figures 16(a) and Supplement Figure 17(a) when comparing with Bulyan and Multi-Krum, respectively. The corresponding experiments for $q=4$ are shown in Figures 15(b), 16(b), and Supplement Figure 17(b).

Given the fact that these experiments take a significant amount of time and that they are computationally expensive, we chose to perform this consistency check for a subset of our experiments. Nevertheless, these results indicate that prior schemes [16, 8, 24] are sensitive to the choice of the random seed and demonstrate an unstable behavior in terms of convergence. In all of these cases, the achieved value of accuracy at the end of the 16 epochs of training is small compared to Aspis. On the other hand, the accuracy results for Aspis are almost identical for both choices of the random seed.

X-B4 Computation and Communication Overhead

Our schemes provide robustness under powerful attacks and sophisticated distortion methods at the expense of increased computation and communication time. Note that each worker has to perform $l$ forward/backward propagation computations and transmit $l$ gradients per iteration. In related baseline [24, 12] and redundancy-based methods [16, 17], each worker is responsible for a single such computation. Experimentally, we have observed that Aspis needs up to $5\times$ overall training time compared to other schemes to complete the same number of training epochs. We emphasize that the training time incurred by each scheme depends on a wide range of parameters, including the utilized defense, the batch size, and the number of iterations, and can vary significantly. Our implementation supports GPUs, and we used NVIDIA CUDA [47] for some experiments to alleviate a significant part of the overhead; however, a detailed time cost analysis is not an objective of our current work. Communication-related algorithmic improvements are also worth exploring. Finally, our implementation natively supports resuming from a checkpoint (trained model) and hence, when new data becomes available, we can only use that data to perform more training epochs.

X-B5 Software

Our implementation of the Aspis and Aspis+ algorithms used for the experiments builds on ByzShield’s [14] PyTorch skeleton and has been provided along with dependency information and instructions ⁷⁷7https://github.com/kkonstantinidis/Aspis. The implementation of ByzShield is available at [48] and uses the standard Github license. We utilized the NetworkX package [49] for the clique-finding; its license is 3-clause BSD. The CIFAR-10 data set [32] comes with the MIT license; we have cited its technical report, as required.

References

[1] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, December 2019, pp. 8024–8035.
[2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), November 2016, pp. 265–283.
[3] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” December 2015. [Online]. Available: https://arxiv.org/abs/1512.01274
[4] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, p. 2135.
[5] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors,” in Proceeding of the 41st Annual International Symposium on Computer Architecuture, June 2014, pp. 361––372.
[6] A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019, pp. 1211–1220.
[7] N. Gupta and N. H. Vaidya, “Byzantine fault-tolerant parallelized stochastic gradient descent for linear regression,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), September 2019, pp. 415–420.
[8] G. Damaskinos, E. M. El Mhamdi, R. Guerraoui, A. H. A. Guirguis, and S. L. A. Rouault, “Aggregathor: Byzantine machine learning via robust gradient aggregation,” in Conference on Systems and Machine Learning (SysML) 2019, March 2019, p. 19.
[9] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Defending against saddle point attack in Byzantine-robust distributed learning,” in Proceedings of the 36th International Conference on Machine Learning, June 2019, pp. 7074–7084.
[10] ——, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 5650–5659.
[11] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine-tolerant SGD,” March 2018. [Online]. Available: https://arxiv.org/abs/1802.10116
[12] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, December 2017, pp. 119–129.
[13] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proc. ACM Meas. Anal. Comput. Syst., vol. 1, no. 2, December 2017.
[14] K. Konstantinidis and A. Ramamoorthy, “ByzShield: An efficient and robust system for distributed training,” in Machine Learning and Systems 3 (MLSys 2021), April 2021, pp. 812–828.
[15] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, and S. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security and privacy,” April 2019. [Online]. Available: https://arxiv.org/abs/1806.00939
[16] S. Rajput, H. Wang, Z. Charles, and D. Papailiopoulos, “DETOX: A redundancy-based framework for faster and more robust gradient aggregation,” in Advances in Neural Information Processing Systems, December 2019, pp. 10 320–10 330.
[17] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient distributed training via redundant gradients,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 903–912.
[18] D. Data, L. Song, and S. Diggavi, “Data encoding for Byzantine-resilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), October 2018, pp. 863–870.
[19] J. Regatti, H. Chen, and A. Gupta, “ByGARS: Byzantine SGD with arbitrary number of attackers,” December 2020. [Online]. Available: https://arxiv.org/abs/2006.13421
[20] C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance,” in Proceedings of the 36th International Conference on Machine Learning, June 2019, pp. 6893–6901.
[21] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent,” in Advances in Neural Information Processing Systems, December 2018.
[22] K. Konstantinidis and A. Ramamoorthy, “Aspis: Robust detection for distributed learning,” in IEEE International Symposium on Information Theory (ISIT), June 2022, pp. 2058–2063.
[23] G. Baruch, M. Baruch, and Y. Goldberg, “A Little Is Enough: Circumventing defenses for distributed learning,” in Advances in Neural Information Processing Systems, December 2019, pp. 8635–8645.
[24] E. M. El Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 3521–3530.
[25] S. Shen, S. Tople, and P. Saxena, “Auror: Defending against poisoning attacks in collaborative deep learning systems,” in Proceedings of the 32nd Annual Conference on Computer Security Applications, December 2016, pp. 508––519.
[26] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, “signSGD with majority vote is communication efficient and fault tolerant,” February 2019. [Online]. Available: https://arxiv.org/abs/1810.05291
[27] J. So, B. Güler, and A. S. Avestimehr, “Byzantine-resilient secure federated learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2168–2181, July 2021.
[28] R. Jin, Y. Huang, X. He, H. Dai, and T. Wu, “Stochastic-sign SGD for federated learning with theoretical guarantees,” September 2021. [Online]. Available: https://arxiv.org/abs/2002.10940
[29] N. Raviv, R. Tandon, A. Dimakis, and I. Tamo, “Gradient coding from cyclic MDS codes and expander graphs,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 4302–4310.
[30] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, August 2017, pp. 3368–3376.
[31] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, May 2018.
[32] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
[33] D. R. Stinson, Combinatorial Designs: Constructions and Analysis. New York: Springer, 2004.
[34] C. Xie, O. Koyejo, and I. Gupta, “Fall of Empires: Breaking byzantine-tolerant sgd by inner product manipulation,” in 35th Conference on Uncertainty in Artificial Intelligence, UAI 2019, July 2019, pp. 6893–6901.
[35] R. M. Karp, Reducibility among Combinatorial Problems. Boston, MA: Springer US, 1972.
[36] F. Cazals and C. Karande, “A note on the problem of reporting maximal cliques,” Theoretical Computer Science, vol. 407, no. 1, pp. 564–568, November 2008.
[37] T. Etsuji, T. Akira, and T. Haruhisa, “The worst-case time complexity for generating all maximal cliques and computational experiments,” Theoretical Computer Science, vol. 363, no. 1, pp. 28–42, October 2006.
[38] J. Robson, “Algorithms for maximum independent sets,” Journal of Algorithms, vol. 7, no. 3, pp. 425–440, September 1986.
[39] R. E. Tarjan and A. E. Trojanowski, “Finding a maximum independent set,” SIAM Journal on Computing, vol. 6, no. 3, pp. 537–546, 1977.
[40] D. Masters and C. Luschi, “Revisiting small batch training for deep neural networks,” April 2018. [Online]. Available: https://arxiv.org/abs/1804.07612
[41] Y. You, I. Gitman, and B. Ginsburg, “Large batch training of convolutional networks,” September 2017. [Online]. Available: https://arxiv.org/abs/1708.03888
[42] J. Geiping, M. Goldblum, P. E. Pope, M. Moeller, and T. Goldstein, “Stochastic training is not necessary for generalization,” April 2022. [Online]. Available: https://arxiv.org/abs/2109.14119
[43] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018.
[44] K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust aggregation for federated learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 1142–1154, February 2022.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
[46] S. Minsker, “Geometric median and robust estimation in Banach spaces,” Bernoulli, vol. 21, no. 4, pp. 2308–2335, November 2015.
[47] “NVIDIA CUDA toolkit,” September 2022. [Online]. Available: https://developer.nvidia.com/cuda-toolkit
[48] “Repository of ByzShield implementation,” August 2022. [Online]. Available: https://github.com/kkonstantinidis/ByzShield
[49] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using networkx,” in Proceedings of the 7th Python in Science Conference, August 2008, pp. 11–15.
[50] R. S. Boyer and J. S. Moore, MJRTY—A Fast Majority Vote Algorithm. Dordrecht: Springer Netherlands, 1991.

XI Supplement

XI-A Asymptotic Complexity

If the gradient computation has linear complexity (assuming $\mathcal{O}(1)$ cost for the gradient computation with respect to one model parameter) and since each worker is assigned to $l$ files of $b/f$ samples each, the gradient computation cost at the worker level is $\mathcal{O}((lb/f)d)$ ( $K$ such computations in parallel). In our schemes, however, $b$ is a constant multiple of $f$ , and in general $r<l$ ( $l=\binom{K-1}{r-1}$ for Aspis while $r=3$ is a typical redundancy value used in literature as well as in Aspis+); hence, the complexity becomes $\mathcal{O}(ld)$ which is similar to other redundancy-based schemes in [14, 16, 17]. For Aspis, as there are $\binom{K}{2}$ each sharing $\binom{K-2}{r-2}$ files, the complexity to determine their agreements and form the graph is $\mathcal{O}(\binom{K}{2}\binom{K-2}{r-2})$ . The clique-finding problem that follows as part of Aspis detection is NP-complete. However, our experimental evidence suggests that for the kind of graphs we construct, this computation takes an infinitesimal fraction of the execution time. The NetworkX package [49], which we use for enumerating all maximal cliques, is based on the algorithm of [37] and has asymptotic complexity $\mathcal{O}(3^{K/3})$ . We provide extensive simulations of the graph construction and clique enumeration time under the Aspis file assignment for $K=50,r=5$ and $K=200,r=3$ (cf. Supplement Tables VIIVII(a), VIIVII(b), VIIIVIII(a), and VIIIVIII(b) for the weak (ATT-1) and optimal (ATT-2) attack as introduced in Sections IV and V). We emphasize that this value of $K$ exceeds by far the typical values of $K$ of prior work, and the number of servers would suffice for most challenging training tasks. Even in this case, the cost of enumerating all cliques is negligible. For this experiment, we used an EC2 instance of type i3.16xlarge. The complexity of robust aggregation varies significantly depending on the operator. For example, majority voting can be done in time, which scales linearly with the number of votes using MJRTY proposed in [50]. In our case, this is $\mathcal{O}(Kd)$ as the PS needs to use the $d$ -dimensional input from all $K$ machines. Krum [12], Multi-Krum [12] and Bulyan [24], are applied to all $K$ workers by default and require $\mathcal{O}(K^{2}(d+\mathrm{log}K))$ .

XI-B Floating-Point Precision and Gradient Equality Check

A gradient equality check is needed to determine whether two gradient vectors, e.g., $\mathbf{a}$ and $\mathbf{b}$ , are equal for our majority voting procedure to work. This check can be performed on an element-by-element basis or using the norm of the difference. There are two distinct cases we have considered:

•

Case 1: Execution on CPUs: If we use the CPUs of the workers to compute the gradients, we have observed that two “honest” gradients, $\mathbf{a}$ and $\mathbf{b}$ , will always be exactly equal to each other element-wise. In this case, we use the numpy.array_equal function for all equality checks. If one of $\mathbf{a}$ , $\mathbf{b}$ is corrupted and the other one is “honest,” the program will effectively flag this as a disagreement between the corresponding workers.
•

Case 2: Execution on GPUs: Most deep learning libraries [2, 1] provide non-deterministic back-propagation for the sake of faster and more efficient computations. In our implementation, we use NVIDIA CUDA [47]; hence, two “honest” float (e.g., numpy.float_32) gradients $\mathbf{a}$ and $\mathbf{b}$ computed by two different GPUs will not be exactly equal to each other. However, the floating-point precision errors were less than $10^{-6}$ in all of our experiments. In this case, we decide that the two workers agree with each other if the following criterion is satisfied for a small tolerance value of $10^{-5}$

$\frac{\lVert\mathbf{a}-\mathbf{b}\rVert_{2}}{\mathrm{max}\{\lVert\mathbf{a}\rVert_{2},\lVert\mathbf{b}\rVert_{2}\}}\leq 10^{-5}.$

On the other hand, if one of $\mathbf{a}$ , $\mathbf{b}$ is distorted even by the most sophisticated inner manipulation attack ALIE [23], then $\frac{\lVert\mathbf{a}-\mathbf{b}\rVert_{2}}{\mathrm{max}\{\lVert\mathbf{a}\rVert_{2},\lVert\mathbf{b}\rVert_{2}\}}$ is at least five orders of magnitude larger and typically ranges in $[1,100]$ .

In both cases, we have an integrity check in place to throw an exception if two “honest” gradients for the same task violate this criterion. We have not observed any violation of this in any of our exhaustive experiments.

Input: Loss vectors

\mathbf{l}_{i}

i=1,2,\dots,c

, maximum iterations

T

1 Set

\mathbf{l}_{\mathrm{MC}}

to be an empty vector.

2 for $t=1$ to $T$ do

3 Let

\mathbf{v}_{t}

(of length

v

) be the vector with the losses of all

v

simulations that ran for at least

t

iterations, i.e.,

\mathbf{v}_{t}=\begin{bmatrix}\mathbf{l}_{1,t},\mathbf{l}_{2,t},\dots,\mathbf{l}_{v,t}\end{bmatrix}

Append

\frac{\sum_{i}\mathbf{v}_{t,i}}{v}

\mathbf{l}_{\mathrm{MC}}

4 end for

Return

\mathbf{l}_{\mathrm{MC}}

Algorithm 4 Average Monte Carlo loss across the simulations.

XI-C Note on Computation of Average Monte Carlo Loss

As discussed in Section VI-A, we ran 100 Monte Carlo simulations for each scheme and value of $q$ . Among the 100 simulations of a given experiment we kept only those that converged to final empirical loss less than $0.1$ (cf. Section VI-A). In order to report the average loss in Figures 4(a), 4(b), 5(a), 5(b), and 6 of the converged simulations we need to take into account the fact that each of them may have run for different number of iterations until convergence. To that end, let us collect the loss for each Monte Carlo simulation to a vector $\mathbf{l}_{i}$ , $i=1,2,\dots,c$ where $c$ is the number of converged simulations. The vectors $\mathbf{l}_{i}$ , $i=1,2,\dots,c$ do not necessarily have the same length; we used Algorithm 4 to compute and report the average loss of these vectors.

TABLE VI: Parameters used for training.

Figure	Schemes	Learning rate schedule
7(a)	1,2,5,6	$(0.01,0.7)$
7(a)	3,4	$(0.1,0.95)$
7(b)	1	$(0.001,0.95)$
7(c)	1,2	$(0.01,0.7)$
8(a)	1,2	$(0.1,0.7)$
8(a)	3,4	$(0.1,0.95)$
8(a)	5,6	$(0.01,0.7)$
8(b)	1	$(0.1,0.7)$
8(c)	1,2	$(0.01,0.975)$
11	1,2	$(0.1,0.7)$
11	3,4	$(0.1,0.95)$
11	5,6	$(0.01,0.95)$
11	1	$(0.1,0.95)$
11	2	$(0.01,0.7)$
11	1	$(0.01,0.7)$
11	2	$(0.1,0.95)$
11	3	$(0.01,0.7)$
12(a)	1,2	$(0.01,0.7)$
12(a)	3	$(0.1,0.95)$
12(b)	2	$(0.01,0.7)$
12(c)	2	$(0.01,0.95)$
13	1,2	$(0.01,0.7)$
13	3,4	$(0.01,0.975)$
13	5,6	$(0.1,0.975)$
14	1,2,3,4	$(0.0003,0.7)$
14	5,6	$(0.3,0.975)$

TABLE VII: Graph contruction (in seconds) and clique enumeration (in milliseconds) time in Aspis graph of

K=50

vertices and redundancy

r=5

$q$	Graph construction (s)	Clique finding (ms)
5	15	1
10	15	$<1$
15	14	1
20	14	1

VII(a) Adversaries carry out weak attack ATT-1.

$q$	Graph construction (s)	Clique finding (ms)
5	16	2
10	14	1
15	14	1
20	14	1

VII(b) Adversaries carry out optimal attack ATT-2.

TABLE VIII: Graph contruction (in seconds) and clique enumeration (in milliseconds) time in Aspis graph of

K=100

vertices and redundancy

r=3

$q$	Graph construction (s)	Clique finding (ms)
5	4	51
10	4	46
15	4	43
20	4	40

VIII(a) Adversaries carry out weak attack ATT-1.

$q$	Graph construction (s)	Clique finding (ms)
5	4	55
10	4	54
15	4	55
20	4	55

VIII(b) Adversaries carry out optimal attack ATT-2.

Detection and Mitigation of Byzantine Attacks in Distributed Training

Abstract

Index Terms:

I Introduction and Background

II Related Work and Summary of Contributions

II-A Related Work

II-B Contributions

III Distributed Training Formulation

III-A Task Assignment

III-A1 Aspis

Example 1.

Remark 1.

III-A2 Aspis+

Example 2.

IV Adversarial Attack Models and Gradient Distortion Methods

IV-A Attack 1

Remark 2.

IV-B Attack 2

IV-C Attack 3

IV-D Gradient Distortion Methods

V Defense Strategies in Aspis and Aspis+

V-A Aspis Detection Rule

V-A1 Defense Strategy Against ATT-1

V-A2 Defense Strategy Against ATT-2 (Robust Aggregation)

Theorem 1.

Remark 3.

V-B Motivation for Aspis+

V-C Aspis+ Detection Rule

Example 3.

Remark 4.

VI Convergence Results and Experiments under Setting-II

Theorem 2.

VI-A Numerical Experiments

VI-A1 Experiment Setup

VI-A2 Results

VII Distortion Fraction Evaluation

VIII Large-Scale Deep Learning Experiments

VIII-A Experiment Setup

VIII-B Aspis Experimental Results

VIII-B1 Comparison under Optimal Attacks

VIII-B2 Comparison under Weak Attacks

VIII-C Aspis+ Experimental Results

IX Conclusions and Future Work

X Appendix

X-A Proof of Theorem 1

X-B Experiment Setup Details

X-B1 Cluster Setup

X-B2 Data Set Preprocessing and Hyperparameter Tuning

X-B3 Error Bars

X-B4 Computation and Communication Overhead

X-B5 Software

References

XI Supplement

XI-A Asymptotic Complexity

XI-B Floating-Point Precision and Gradient Equality Check

XI-C Note on Computation of Average Monte Carlo Loss

Detection and Mitigation of
Byzantine Attacks in Distributed Training