This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Detection and Mitigation of
Byzantine Attacks in Distributed Training

Konstantinos Konstantinidis, Namrata Vaswani,  and Aditya Ramamoorthy This work was supported in part by the National Science Foundation (NSF) under grants CCF-1910840 and CCF-2115200. The material in this work has appeared in part at the 2022 IEEE International Symposium on Information Theory (ISIT). (Corresponding author: Aditya Ramamoorthy.)The authors are with the Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA (e-mail: {kostas, namrata, adityar}@iastate.edu).
Abstract

A plethora of modern machine learning tasks require the utilization of large-scale distributed clusters as a critical component of the training pipeline. However, abnormal Byzantine behavior of the worker nodes can derail the training and compromise the quality of the inference. Such behavior can be attributed to unintentional system malfunctions or orchestrated attacks; as a result, some nodes may return arbitrary results to the parameter server (PS) that coordinates the training. Recent work considers a wide range of attack models and has explored robust aggregation and/or computational redundancy to correct the distorted gradients.

In this work, we consider attack models ranging from strong ones: qq omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: qq randomly chosen adversaries with limited collusion abilities which only change every few iterations at a time. Our algorithms rely on redundant task assignments coupled with detection of adversarial behavior. We also show the convergence of our method to the optimal point under common assumptions and settings considered in literature. For strong attacks, we demonstrate a reduction in the fraction of distorted gradients ranging from 16%-99% as compared to the prior state-of-the-art. Our top-1 classification accuracy results on the CIFAR-10 data set demonstrate 25% advantage in accuracy (averaged over strong and weak scenarios) under the most sophisticated attacks compared to state-of-the-art methods.

Index Terms:
Byzantine resilience, distributed training, gradient descent, deep learning, optimization, security.

I Introduction and Background

Increasingly complex machine learning models with large data set sizes are nowadays routinely trained on distributed clusters. A typical setup consists of a single central machine (parameter server or PS) and multiple worker machines. The PS owns the data set, assigns gradient tasks to workers, and coordinates the protocol. The workers then compute gradients of the loss function with respect to the model parameters. These computations are returned to the PS, which aggregates them, updates the model, and maintains the global copy of it. The new copy is communicated back to the workers. Multiple iterations of this process are performed until convergence has been achieved. PyTorch [1], TensorFlow [2], MXNet [3], CNTK [4] and other frameworks support this architecture.

These setups offer significant speedup benefits and enable training challenging, large-scale models. Nevertheless, they are vulnerable to misbehavior by the worker nodes, i.e., when a subset of them returns erroneous computations to the PS, either inadvertently or on purpose. This “Byzantine” behavior can be attributed to a wide range of reasons. The principal causes of inadvertent errors are hardware and software malfunctions (e.g., [5]). Reference [6] exposes the vulnerability of neural networks to such failures and identifies weight parameters that could maximize accuracy degradation. The gradients may also be distorted in an adversarial manner. As ML problems demand more resources, many jobs are often outsourced to external commodity servers (cloud) whose security cannot be guaranteed. Thus, an adversary may be able to gain control of some devices and fool the model. The distorted gradients can derail the optimization and lead to low test accuracy or slow convergence.

Achieving robustness in the presence of Byzantine node behavior and devising training algorithms that can efficiently aggregate the gradients has inspired several works [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. The first idea is to filter the corrupted computations from the training without attempting to identify the Byzantine workers. Specifically, many existing papers use majority voting and median-based defenses [7, 8, 9, 10, 11, 12, 13] for this purpose. In addition, several works also operate by replicating the gradient tasks [14, 15, 16, 17, 18] allowing for consistency checks across the cluster. The second idea for mitigating Byzantine behavior involves detecting the corrupted devices and subsequently ignoring their calculations [19, 20, 21], in some instances paired with redundancy [17]. In this work, we propose a technique that combines the usage of redundant tasks, filtering, and detection of Byzantine workers. Our work is applicable to a broad range of assumptions on the Byzantine behavior.

There is much variability in the adversarial assumptions that prior work considers. For instance, prior work differs in the maximum number of adversaries considered, their ability to collude, their possession of knowledge involving the data assignment and the protocol, and whether the adversarial machines are chosen at random or systematically. We will initially examine our methods under strong adversarial models similar to those in prior work [22, 14, 23, 11, 10, 24, 25]. We will then extend our algorithms to tackle weaker failures that are not necessarily adversarial but rather common in commodity machines [5, 6, 26]. We expand on related work in the upcoming Section II.

II Related Work and Summary of Contributions

II-A Related Work

All work in this area (including ours) assumes a reliable parameter server that possesses the global data set and can assign specific subsets of it to workers. Robust aggregation methods have also been proposed for federated learning [27, 28]; however, as we make no assumption of privacy, our work, as well as the methods we compare with do not apply to federated learning.

One category of defenses splits the data set into KK batches and assigns one to each worker with the ultimate goal of suitably aggregating the results from the workers. Early work in the area [12] established that no linear aggregation method (such as averaging) can be robust even to a single adversarial worker. This has inspired alternative methods collectively known as robust aggregation. Majority voting, geometric median, and squared-distance-based techniques fall into this category [8, 9, 10, 11, 12, 13].

One of the most popular robust aggregation techniques is known as mean-around-median or trimmed mean [11, 10]. It handles each dimension of the gradient separately and returns the average of a subset of the values that are closest to the median. Auror [25] is a variant of trimmed mean which partitions the values of each dimension into two clusters using k-means and discards the smaller cluster if the distance between the two exceeds a threshold; the values of the larger cluster are then averaged. signSGD in [26] transmits only the sign of the gradient vectors from the workers to the PS and exploits majority voting to decide the overall update; this practice reduces the communication time and denies any individual worker too much effect on the update.

Krum in [12] chooses a single honest worker for the next model update, discarding the data from the rest of them. The chosen gradient is the one closest to its kk\in\mathbb{N} nearest neighbors. In later work [24], the authors recognized that Krum may converge to an ineffectual model in the landscape of non-convex high dimensional problems, such as in neural networks. They showed that a large adversarial change to a single parameter with a minor impact on the LpL^{p} norm can make the model ineffective. In the same work, they present an alternative defense called Bulyan to oppose such attacks. The algorithm works in two stages. In the first part, a selection set of potentially benign values is iteratively constructed. In the second part, a variant of trimmed mean is applied to the selection set. Nevertheless, if KK machines are used, Bulyan is designed to defend only up to (K3)/4(K-3)/4 fraction of corrupted workers.

Another category of defenses is based on redundancy and seeks resilience to Byzantines by replicating the gradient computations such that each of them is processed by more than one machine [15, 16, 17, 18]. Even though this approach requires more computation load, it comes with stronger guarantees of correcting the erroneous gradients. Existing redundancy-based techniques are sometimes combined with robust aggregation [16]. The main drawback of recent work in this category is that the training can be easily disrupted by a powerful, omniscient adversary that has full control of a subset of the nodes and can mount judicious attacks [14].

Redundancy-based method DRACO in [17] uses a simple Fractional Repetition Code (FRC) (that operates by grouping workers) and the cyclic repetition code introduced in [29, 30] to ensure robustness; majority voting and Fourier decoders try to alleviate the adversarial effects. Their work ensures exact recovery (as if the system had no adversaries) with qq Byzantine nodes, when each task is replicated r2q+1r\geq 2q+1 times; the bound is information-theoretic minimum, and DRACO is not applicable if it is violated. Nonetheless, this requirement is very restrictive for the typical assumption that up to half of the workers can be Byzantine.

DETOX in [16] extends DRACO and uses a simple grouping strategy to assign the gradients. It performs multiple stages of aggregation to gradually filter the adversarial values. The first stage involves majority voting, while the following stages perform robust aggregation. Unlike DRACO, the authors do not seek exact recovery; hence the minimum requirement in rr is small. However, the theoretical resilience guarantees that DETOX provides depend heavily on a “random choice” of the adversarial workers. In fact, we have crafted simple attacks [14] to make this aggregator fail under a more careful choice of adversaries. Furthermore, their theoretical results hold when the fraction of Byzantines is less than 1/401/40.

A third category focuses on ranking and/or detection [19, 17, 20]; the objective is to rank workers using a reputation score to identify suspicious machines and exclude them or give them lower weight in the model update. This is achieved by means of computing reputation scores for each machine or by using ideas from coding theory to assign tasks to workers (encoding) and to detect the adversaries (decoding). Zeno in [20] ranks each worker using a score that depends on the estimated loss and the magnitude of the update. Zeno requires strict assumptions on the smoothness of the loss function and the gradient estimates’ variance to tolerate an adversarial majority in the cluster. Similarly, ByGARS [19] computes reputation scores for the nodes based on an auxiliary data set; these scores are used to weigh the contribution of each gradient to the model update.

II-B Contributions

In this paper, we propose novel techniques which combine redundancy, detection, and robust aggregation for Byzantine resilience under a range of attack models and assumptions on the dataset/loss function.

Our first scheme Aspis is a subset-based assignment method for allocating tasks to workers in strong adversarial settings: up to qq omniscient, colluding adversaries that can change at each iteration. We also consider weaker attacks: adversaries chosen randomly with limited collusion abilities, changing only after a few iterations at a time. It is conceivable that Aspis should continue to perform well with weaker attacks. However, as discussed later (Section V-B), Aspis requires large batch sizes (for the mini-batch SGD). It is well-recognized that large batch sizes often cause performance degradation in training [31]. Accordingly, for this class of attacks, we present a different algorithm called Aspis+ that can work with much smaller batch sizes. Both Aspis and Aspis+ use combinatorial ideas to assign the tasks to the worker nodes. Our work builds on our initial work in [22] and makes the following contributions.

  • We demonstrate a worst-case upper bound (under any possible attack) on the fraction of corrupted gradients when Aspis is used. Even in this adverse scenario, our method enjoys a reduction in the fraction of corrupted gradients of more than 90% compared with DETOX [16]. A weaker variation of this attack is where the adversaries do not collude and act randomly. In this case, we demonstrate that the Aspis protocol allows for detecting all the adversaries. In both scenarios, we provide theoretical guarantees on the fraction of corrupted gradients.

  • In the setting where the dataset is distributed i.i.d. and the loss function is strongly convex and other technical conditions hold, we demonstrate a proof of convergence for Aspis. We demonstrate numerical results on the linear regression problem in this part; these show the advantage of Aspis over competing methods such as DETOX.

  • For weaker attacks (discussed above), our experimental results indicate that Aspis+ detects all adversaries within approximately 5 iterations.

  • We present top-1 classification accuracy experiments on the CIFAR-10 [32] data set for various gradient distortion attacks coupled with choice/behavior patterns of the adversarial nodes. Under the most sophisticated distortion methods [23], the performance gap between Aspis/Aspis+ and other state-of-the-art methods is substantial, e.g., for Aspis it is 43% in the strong scenario (cf. Figure 7(a)), and for Aspis+ 19% in the weak scenario (cf. Figure 13).

Refer to caption
Figure 1: Aggregation of gradients on a cluster.

III Distributed Training Formulation

Assume a loss function li(𝐰)l_{i}(\mathbf{w}) for the ithi^{\mathrm{th}} sample of the dataset where 𝐰d\mathbf{w}\in\mathbb{R}^{d} is the set of parameters of the model.111The paper’s heavily-used notation is summarized in Appendix Table II. The objective of distributed training is to minimize the empirical loss function L^(𝐰)\hat{L}(\mathbf{w}) with respect to 𝐰\mathbf{w}, where

L^(𝐰)=1ni=1nli(𝐰).\hat{L}(\mathbf{w})=\frac{1}{n}\sum\limits_{i=1}^{n}l_{i}(\mathbf{w}).

Here nn denotes the number of samples.

We use either gradient descent (GD) or mini-batch Stochastic Gradient Descent (SGD) to solve this optimization. In both methods, initially 𝐰\mathbf{w} is randomly set to 𝐰0\mathbf{w}_{0} (𝐰t\mathbf{w}_{t} is the model state at the end of iteration tt). When using GD, the update equation is

𝐰t+1=𝐰tηt1ni=1nli(𝐰t).\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{n}\sum\limits_{i=1}^{n}\nabla l_{i}(\mathbf{w}_{t}). (1)

Under mini-batch SGD a random batch BtB_{t} of bb samples is chosen to perform the update in the ttht^{\mathrm{th}} iteration. Thus,

𝐰t+1=𝐰tηt1|Bt|iBtli(𝐰t).\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{|B_{t}|}\sum\limits_{i\in B_{t}}\nabla l_{i}(\mathbf{w}_{t}). (2)

In both methods ηt\eta_{t} is the learning rate at the ttht^{\mathrm{th}} iteration. The workers denoted U1,U2,,UKU_{1},U_{2},\dots,U_{K}, compute gradients on subsets of the batch. The training is synchronous, i.e., the PS waits for all workers to return before performing an update. It stores the data set and the model and coordinates the protocol. It can be observed that GD can be considered an instance of mini-batch SGD where the batch at each iteration is the entire dataset. Our discussion below is in the context of mini-batch SGD but can easily be applied to the GD case by using this observation.

We consider settings in this work that depend on the underlying assumptions on the dataset and the loss function. Setting-I does not make any assumption on the dataset or the loss function. In Setting-II at the top-level (technical details appear in Section VI) we assume that the data samples are distributed i.i.d. and the loss function is strongly-convex. The results that we provide depend on the underlying setting.

Task assignment: Each batch BtB_{t} is split into ff disjoint subsets {Bt,i}i=0f1\{B_{t,i}\}_{i=0}^{f-1}, which are then assigned to the workers according to our placement policy. In what follows we refer to these as “files” to avoid confusion with other subsets that we need to refer to. Computational redundancy is introduced by assigning a given file to r>1r>1 workers. As the load on all the workers is equal it follows that each worker is responsible for l=fr/Kl=fr/K files (ll is the computation load). We let 𝒩w(Uj)\mathcal{N}^{w}(U_{j}) be the set of files assigned to worker UjU_{j} and 𝒩f(Bt,i)\mathcal{N}^{f}(B_{t,i}) be the group of workers assigned to file Bt,iB_{t,i}; our placement scheme is such that 𝒩f(Bt,i)\mathcal{N}^{f}(B_{t,i}) uniquely identifies the file Bt,iB_{t,i}; thus, we will sometimes refer to the file Bt,iB_{t,i} by its worker assignment, 𝒩f(Bt,i)\mathcal{N}^{f}(B_{t,i}). We will also occasionally use the term group (of the assigned workers) to refer to a file. We discuss the actual placement algorithms used in this work in the upcoming subsection III-A.

Training: Each worker UjU_{j} is given the task of computing the sum of the gradients on all its assigned files. For example, if file Bt,iB_{t,i} is assigned to UjU_{j}, then it calculates iBt,ili(𝐰t)\sum_{i^{\prime}\in B_{t,i}}\nabla l_{i^{\prime}}(\mathbf{w}_{t}) and returns them to the PS. In every iteration, the PS will run our detection algorithm once it receives the results from all the users in an effort to identify the qq adversaries and will act according to the detection outcome.

Figure 1 depicts this process. There are K=6K=6 machines and f=4f=4 distinct files (represented by colored circles) replicated r=3r=3 times.222Some arrows and ellipses have been omitted from Figure 1; however, all files will be going through detection. Each worker is assigned to l=2l=2 files and computes the sum of gradients (or a distorted value) on each of them. The “d” ellipses refer to PS’s detection operations immediately after receiving all the gradients.

Metrics: We consider various metrics in our work. For Setting-I we consider (i) the fraction of distorted files, and (ii) the top-1 test accuracy of the final trained model. For the distortion fraction, let us denote the number of distorted files upon detection and aggregation by c(q)c^{(q)} and its maximum value (under a worst-case attack) by cmax(q)c_{\mathrm{max}}^{(q)}. The distortion fraction is ϵ:=c(q)/f\epsilon:=c^{(q)}/f. The top-1 test accuracy is determined via numerical experiments. In Setting-II, in addition we consider proofs and rates of convergence of the proposed algorithms. We provide theoretical results and supporting experimental results on these.

TABLE I: Adversarial models considered in literature.
Scheme Byzantine choice/orchestration Gradient distortion
Draco [17] optimal reversed gradient, constant
DETOX [16] random ALIE, reversed gradient, constant
ByzShield [14] optimal ALIE, reversed gradient, constant
Bulyan [24] N/A 2\ell_{2}-norm attack targeted on Bulyan
Multi-Krum [12] N/A random high-variance Gaussian vector
Aspis ATT-1, ATT-2 ALIE, FoE, reversed gradient
Aspis+ ATT-3 ALIE, constant

III-A Task Assignment

Let 𝒰\mathcal{U} be the set of workers. Our scheme has |𝒰|f|\mathcal{U}|\leq f (i.e., fewer workers than files). Our assignment of files to worker nodes is specified by a bipartite graph 𝐆task\mathbf{G}_{task} where the left vertices correspond to the workers, and the right vertices correspond to the files. An edge in 𝐆task\mathbf{G}_{task} between worker UiU_{i} and a file Bt,jB_{t,j} indicates that the UiU_{i} is responsible for processing file Bt,jB_{t,j}.

III-A1 Aspis

For the Aspis scheme we construct 𝐆task\mathbf{G}_{task} as follows. The left vertex set is {1,2,,K}\{1,2,\dots,K\} and the right vertex set corresponds to rr-sized subsets of {1,2,,K}\{1,2,\dots,K\} (there are (Kr)\binom{K}{r} of them). An edge between 1iK1\leq i\leq K and S{1,2,,K}S\subset\{1,2,\dots,K\} (where |S|=r|S|=r) exists if iSi\in S. The worker set {U1,,UK}\{U_{1},\dots,U_{K}\} is in one-to-one correspondence with {1,2,,K}\{1,2,\dots,K\} and the files Bt,0,,Bt,f1B_{t,0},\dots,B_{t,f-1} are in one-to-one correspondence with the rr-sized subsets.

Example 1.

Consider K=7K=7 workers U1,U2,U7U_{1},U_{2}\dots,U_{7} and r=3r=3. Based on our protocol, the f=(73)=35f=\binom{7}{3}=35 files of each batch BtB_{t} are associated one-to-one with 3-subsets of 𝒰\mathcal{U}, e.g., the subset S={U1,U2,U3}S=\{U_{1},U_{2},U_{3}\} corresponds to file Bt,0B_{t,0} and will be processed by U1U_{1}, U2U_{2}, and U3U_{3}.

Remark 1.

Our task assignment ensures that every pair of workers processes (K2r2)\binom{K-2}{r-2} files. Moreover, the number of adversaries is q<K/2q<K/2. Thus, upon receiving the gradients from the workers, the PS can examine them for consistency and flag certain nodes as adversarial if their computed gradients differ from q+1q+1 or more of the other nodes. We use this intuition to detect and mitigate the adversarial effects and compute the fraction of corrupted files.

III-A2 Aspis+

For Aspis+, we use combinatorial designs [33] to assign the gradient tasks to workers. Formally, a design is a pair (XX, 𝒜\mathcal{A}) consisting of a set of vv elements (points), XX, and a family 𝒜\mathcal{A} (i.e., multiset) of nonempty subsets of XX called blocks, where each block has the same cardinality kk. Similar to Aspis, the workers and files are in one-to-one correspondence with the points and the blocks, respectively. Hence, for our purposes, the kk parameter of the design is the redundancy. A t(v,k,λ)t-(v,k,\lambda) design is one where any subset of tt points appear together in exactly λ\lambda blocks. The case of t=2t=2 has been studied extensively in the literature and is referred to as a balanced incomplete block design (BIBD). A bipartite graph representing the incidence between the points and the blocks can be obtained naturally by letting the points correspond to the left vertices, and the blocks correspond to the right vertices. An edge exists between a point and a block if the point is contained in the block.

Example 2.

A 2(7,3,1)2-(7,3,1) design, also known as the Fano plane, consists of the v=7v=7 points X={1,2,,7}X=\{1,2,\dots,7\} and the block multiset 𝒜\mathcal{A} contains the blocks {1,2,3}\{1,2,3\}, {1,4,7}\{1,4,7\}, {2,4,6}\{2,4,6\}, {3,4,5}\{3,4,5\}, {2,5,7}\{2,5,7\}, {1,5,6}\{1,5,6\} and {3,6,7}\{3,6,7\} with each block being of size k=3k=3. In the bipartite graph 𝐆task\mathbf{G}_{task} representation, we would have an edge, e.g., between point 22 and blocks {1,2,3},{2,4,6}\{1,2,3\},\{2,4,6\}, and {2,5,7}\{2,5,7\}.

In Aspis+ we construct 𝐆task\mathbf{G}_{task} by the bipartite graph representing an appropriate 2(v,k,λ)2-(v,k,\lambda) design.

Another change compared to the Aspis placement scheme is that the points of the design will be randomly permuted at each iteration, i.e., for permutation π\pi, the PS will map {U1,U2,,UK}𝜋{π(U1),π(U2),,π(UK)}\{U_{1},U_{2},\dots,U_{K}\}\xrightarrow{\pi}\{\pi(U_{1}),\pi(U_{2}),\dots,\pi(U_{K})\}. For instance, let us circularly permute the points of the Fano plane in Example 2 as π(Ui)=Ui+1,i=1,2,,K1\pi(U_{i})=U_{i+1},i=1,2,\dots,K-1 and π(UK)=U1\pi(U_{K})=U_{1}. Then, the file assignment at the next iteration will be based on the block collection 𝒜={{2,3,4},{1,2,5},{3,5,7},{4,5,6},{1,3,6},{2,6,7},{1,4,7}}\mathcal{A}=\{\{2,3,4\},\{1,2,5\},\{3,5,7\},\{4,5,6\},\{1,3,6\},\{2,6,7\},\{1,\allowbreak 4,7\}\}. Permuting the assignment causes each Byzantine to disagree with more workers and to be detected in fewer iterations; details will be discussed in Section V-C. Owing to this permutation, we use a time subscript for the files assigned to UiU_{i} for the ttht^{\mathrm{th}} iteration; this is denoted by 𝒩tw(Ui)\mathcal{N}_{t}^{w}(U_{i}).

IV Adversarial Attack Models and Gradient Distortion Methods

We now discuss the different Byzantine models that we consider in this work. For all the models, we assume that at most q<K/2q<K/2 workers can be adversarial. For each assigned file Bt,iB_{t,i} a worker UjU_{j} will return the value 𝐠^t,i(j)\hat{\mathbf{g}}_{t,i}^{(j)} to the PS. Then,

𝐠^t,i(j)={𝐠t,i if Uj is honest,otherwise,\hat{\mathbf{g}}_{t,i}^{(j)}=\left\{\begin{array}[]{ll}\mathbf{g}_{t,i}&\text{ if }U_{j}\text{ is honest},\\ \mathbf{*}&\text{otherwise},\\ \end{array}\right. (3)

where 𝐠t,i\mathbf{g}_{t,i} is the sum of the loss gradients on all samples in file Bt,iB_{t,i}, i.e.,

𝐠t,i=jBt,ilj(𝐰t)\mathbf{g}_{t,i}=\sum\limits_{j\in B_{t,i}}\nabla l_{j}(\mathbf{w}_{t})

and \mathbf{*} is any arbitrary vector in d\mathbb{R}^{d}. Within this setup, we examine adversarial scenarios that differ based on the behavior of the workers. Table I provides a high-level summary of the Byzantine models considered in this work as well as in related papers. As we will discuss in Section VIII-B, for those schemes that do not involve redundancy and merely split the work equally among the KK workers, all possible choices of the Byzantine set are equivalent, and no orchestration333We will use the term orchestration to refer to the method adversaries use to collude and attack collectively as a group. of them will change the defense’s output; hence, those cases are denoted by “N/A” in the table.

IV-A Attack 1

We first consider a weak attack, denoted ATT-1, where the Byzantine nodes operate independently (i.e., do not collude) and attempt to distort the gradient on any file they participate in. For instance, a node may try to return arbitrary gradients on all its assigned files. For this attack, the identity of the workers may be arbitrary at each iteration as long as there are at most qq of them.

Remark 2.

We emphasize that even though we call this attack “weak”, this is the attack model considered in several prior works [16, 17]. To our best knowledge, most of them have not considered the adversarial problem from the lens of detection.

IV-B Attack 2

Our second scenario, named ATT-2, is the strongest one we consider. We assume that the adversaries have full knowledge of the task assignment at each iteration and the detection strategies employed by the PS. The adversaries can collude in the “best” possible way to corrupt as many gradients as possible. Moreover, the set of adversaries can also change from iteration to iteration as long as there are at most qq of them.

IV-C Attack 3

This attack is similar to ATT-1 and will be called ATT-3. On the one hand, it is weaker in the sense that the set of Byzantines (denoted AA) does not change in every iteration. Instead, we will assume that there is a “Byzantine window” of TbT_{b} iterations in which the set AA remains fixed. Also, the set AA will be a randomly chosen set of qq workers from 𝒰\mathcal{U}, i.e., it will not be chosen systematically. A new set will be chosen at random at all iterations tt, where t0t\equiv 0 (mod TbT_{b}). Conversely, it is stronger than ATT-1 since we allow for limited collusion amongst the adversarial nodes. In particular, the Byzantines simulated by ATT-3 will distort only the files for which a Byzantine majority exists.

IV-D Gradient Distortion Methods

For each of the attacks considered above, the adversaries can distort the gradient in specific ways. Several such techniques have been considered in the literature and our numerical experiments use these methods for comparing different methods. For instance, ALIE [23] involves communication among the Byzantines in which they jointly estimate the mean μi\mu_{i} and standard deviation σi\sigma_{i} of the batch’s gradient for each dimension ii and subsequently use them to construct a distorted gradient that attempts to distort the median of the results. Another powerful attack is Fall of Empires (FoE) [34] which performs “inner product manipulation” to make the inner product between the true gradient and the robust estimator to be negative even when their distance is upper bounded by a small value. Reversed gradient distortion returns c𝐠-c\boldsymbol{\mathbf{g}} for c>0c>0, to the PS instead of the true gradient 𝐠\boldsymbol{\mathbf{g}}. The constant attack involves the Byzantine workers sending a constant gradient with all elements equal to a fixed value. To our best knowledge, the ALIE algorithm is the most sophisticated attack in literature for deep learning techniques.

V Defense Strategies in Aspis and Aspis+

In our work we use the Aspis task assignment and detection strategy for attacks ATT-1 and ATT-2. For ATT-3, we will use Aspis+. Recall that the methods differ in their corresponding task assignments. Nevertheless, the central idea in both detection methods is for the PS to apply a set of consistency checks on the obtained gradients from the different workers at each iteration to identify the adversaries.

Let the current set of adversaries be A{U1,U2,,UK}A\subset\{U_{1},U_{2},\dots,U_{K}\} with |A|=q|A|=q; also, let HH be the honest worker set. The set AA is unknown, but our goal is to provide an estimate A^\hat{A} of it. Ideally, the two sets should be identical. In general, depending on the adversarial behavior, we will be able to provide a set A^\hat{A} such that A^A\hat{A}\subseteq A. For each file, there is a group of rr workers which have processed it, and there are (r2){r\choose 2} pairs of workers in each group. Each such pair may or may not agree on the gradient value for the file. For iteration tt, let us encode the agreement of workers Uj1,Uj2U_{j_{1}},U_{j_{2}} on common file ii during the current iteration tt by

αt,i(j1,j2):={1if 𝐠^t,i(j1)=𝐠^t,i(j2),0otherwise.\alpha_{t,i}^{(j_{1},j_{2})}:=\left\{\begin{array}[]{ll}1&\text{if }\hat{\mathbf{g}}_{t,i}^{(j_{1})}=\hat{\mathbf{g}}_{t,i}^{(j_{2})},\\ 0&\text{otherwise}.\end{array}\right. (4)

Across all files, the total number of agreements between a pair of workers Uj1,Uj2U_{j_{1}},U_{j_{2}} during the ttht^{\mathrm{th}} iteration is denoted by

αt(j1,j2):=i𝒩tw(Uj1)𝒩tw(Uj2)αt,i(j1,j2).\alpha_{t}^{(j_{1},j_{2})}:=\sum_{i\in\mathcal{N}_{t}^{w}(U_{j_{1}})\cap\mathcal{N}_{t}^{w}(U_{j_{2}})}\alpha_{t,i}^{(j_{1},j_{2})}. (5)

Since the placement is known, the PS can always perform the above computation. Next, we form an undirected graph 𝐆t\mathbf{G}_{t} whose vertices correspond to all workers {U1,U2,,UK}\{U_{1},U_{2},\dots,U_{K}\}. An edge (Uj1,Uj2)(U_{j_{1}},U_{j_{2}}) exists in 𝐆t\mathbf{G}_{t} only if the computed gradients (at iteration tt) of Uj1U_{j_{1}} and Uj2U_{j_{2}} match in “all” their common assignments.

V-A Aspis Detection Rule

In what follows, we suppress the iteration index tt since the Aspis algorithm is the same for each iteration. For the Aspis task assignment (cf. Section III-A1), any two workers, Uj1U_{j_{1}} and Uj2U_{j_{2}}, have (K2r2){{K-2}\choose{r-2}} common files.

Let us index the qq adversaries in A={A1,A2,,Aq}A=\{A_{1},A_{2},\dots,A_{q}\} and the honest workers in HH. We say that two workers Uj1U_{j_{1}} and Uj2U_{j_{2}} disagree if there is no edge between them in 𝐆\mathbf{G}. The non-existence of an edge between Uj1U_{j_{1}} and Uj2U_{j_{2}} only means that they disagree in at least one of the (K2r2)\binom{K-2}{r-2} files that they jointly participate in. For corrupting the gradients, each adversary has to disagree on the computations with a subset of the honest workers. An adversary may also disagree with other adversaries.

A clique in an undirected graph is defined as a subset of vertices with an edge between any pair of them. A maximal clique is one that cannot be enlarged by adding additional vertices to it. A maximum clique is one such that there is no clique with more vertices in the given graph. We note that the set of honest workers HH will pair-wise agree on all common tasks. Thus, HH forms a clique (of size KqK-q) within 𝐆\mathbf{G}. The clique containing the honest workers may not be maximal. However, it will have a size of at least KqK-q. Let the maximum clique on 𝐆\mathbf{G} be M𝐆M_{\mathbf{G}}. Any worker UjU_{j} with deg(Uj)<Kq1\deg(U_{j})<K-q-1 will not belong to a maximum clique and can right away be eliminated as a “detected” adversary.

Input: Computed gradients 𝐠^t,i(j)\hat{\mathbf{g}}_{t,i}^{(j)}, i=0,1,,f1i=0,1,\dots,f-1, j=1,2,,Kj=1,2,\dots,K, redundancy rr and empty graph 𝐆\mathbf{G} with worker vertices 𝒰\mathcal{U}.
1 for each pair (Uj1,Uj2),j1j2(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2} of workers do
2       PS computes the number of agreements α(j1,j2)\alpha^{(j_{1},j_{2})} of the pair Uj1,Uj2U_{j_{1}},U_{j_{2}} on the gradient value.
3      if α(j1,j2)=(K2r2)\alpha^{(j_{1},j_{2})}={{K-2}\choose{r-2}} then
4             Connect vertex Uj1U_{j_{1}} to vertex Uj2U_{j_{2}} in 𝐆\mathbf{G}.
5       end if
6      
7 end for
8
9PS enumerates all kk maximum cliques M𝐆(1),M𝐆(2),,M𝐆(k)M_{\mathbf{G}}^{(1)},M_{\mathbf{G}}^{(2)},\dots,M_{\mathbf{G}}^{(k)} in 𝐆\mathbf{G}.
10if there is a unique maximum clique M𝐆M_{\mathbf{G}} (k=1k=1) then
11       PS determines the honest workers H=M𝐆H=M_{\mathbf{G}} and the adversarial machines A^=𝒰M𝐆\hat{A}=\mathcal{U}-M_{\mathbf{G}}.
12else
13       PS declares unsuccessful detection.
14 end if
Algorithm 1 Proposed Aspis graph-based detection.
Input: Data set of nn samples, batch size bb, computation load ll, redundancy rr,
number of files ff, maximum iterations TT, file assignments {𝒩w(Ui)}i=1K\{\mathcal{N}^{w}(U_{i})\}_{i=1}^{K}, robust estimator function med^\widehat{\mathrm{med}}.
1 The PS randomly initializes model’s parameters to 𝐰0\mathbf{w}_{0}.
2 for t=0t=0 to T1T-1 do
3       PS chooses a random batch Bt{1,2,,n}B_{t}\subseteq\{1,2,\dots,n\} of bb samples, partitions it into ff files {Bt,i}i=0f1\{B_{t,i}\}_{i=0}^{f-1} and assigns them to workers according to {𝒩w(Ui)}i=1K\{\mathcal{N}^{w}(U_{i})\}_{i=1}^{K}. It then transmits 𝐰t\mathbf{w}_{t} to all workers.
4       for each worker UjU_{j} do
5             if UjU_{j} is honest then
6                   for each file i𝒩w(Uj)i\in\mathcal{N}^{w}(U_{j}) do
7                         UjU_{j} computes the sum of gradients
𝐠^t,i(j)=kBt,ilk(𝐰t).\hat{\mathbf{g}}_{t,i}^{(j)}=\sum\limits_{k\in B_{t,i}}\nabla l_{k}(\mathbf{w}_{t}).
8                   end for
9                  
10            else
11                   UjU_{j} constructs ll adversarial vectors
𝐠^t,i1(j),𝐠^t,i2(j),,𝐠^t,il(j).\hat{\mathbf{g}}_{t,i_{1}}^{(j)},\hat{\mathbf{g}}_{t,i_{2}}^{(j)},\dots,\hat{\mathbf{g}}_{t,i_{l}}^{(j)}.
12             end if
13            UjU_{j} returns 𝐠^t,i1(j),𝐠^t,i2(j),,𝐠^t,il(j)\hat{\mathbf{g}}_{t,i_{1}}^{(j)},\hat{\mathbf{g}}_{t,i_{2}}^{(j)},\dots,\hat{\mathbf{g}}_{t,i_{l}}^{(j)} to the PS.
14       end for
15      PS runs a detection algorithm to identify the adversaries.
16      if detection is successful then
17             Let HH be the detected honest workers. Initialize a non-corrupted gradient set as 𝒢=\mathcal{G}=\emptyset.
18             for each file in {Bt,i}i=0f1\{B_{t,i}\}_{i=0}^{f-1} do
19                   PS chooses the gradient of a worker in 𝒩f(Bt,i)H\mathcal{N}^{f}(B_{t,i})\cap H (if non-empty) and adds it to 𝒢\mathcal{G}.
20             end for
21            
𝐰t+1=𝐰tηt1|𝒢|𝐠𝒢𝐠.\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\frac{1}{|\mathcal{G}|}\sum\limits_{\mathbf{g}\in\mathcal{G}}\mathbf{g}.
22      else
23             for each file in {Bt,i}i=0f1\{B_{t,i}\}_{i=0}^{f-1} do
24                   PS determines the rr workers in 𝒩f(Bt,i)\mathcal{N}^{f}(B_{t,i}) which have processed Bt,iB_{t,i} and computes
𝐦i=majority{𝐠^t,i(j):Uj𝒩f(Bt,i)}}.\mathbf{m}_{i}=\mathrm{majority}\left\{\hat{\mathbf{g}}_{t,i}^{(j)}:U_{j}\in\mathcal{N}^{f}(B_{t,i})\}\right\}.
25             end for
26            PS updates the model via
𝐰t+1=𝐰tηt×med^{𝐦i:i=0,1,,f1}.\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta_{t}\times\widehat{\mathrm{med}}\{\mathbf{m}_{i}:i=0,1,\dots,f-1\}.
27       end if
28      
29 end for
Algorithm 2 Proposed Aspis/Aspis+ aggregation protocol to alleviate Byzantine effects.

The essential idea of our detection is to run a clique-finding algorithm on 𝐆\mathbf{G} (summarized in Algorithm 1). The detection may be successful or unsuccessful depending on which attack is used; we discuss this in more detail shortly.

We note that clique-finding is well-known to be an NP-complete problem [35]. Nevertheless, there are fast, practical algorithms with excellent performance on graphs even up to hundreds of nodes [36, 37]. Specifically, the authors of [37] have shown that their proposed algorithm, which enumerates all maximal cliques, has similar complexity as other methods [38, 39], which are used to find a single maximum clique. We utilize this algorithm. Our extensive experimental evidence suggests that clique-finding is not a computation bottleneck for the size and structure of the graphs that Aspis uses. We have experimented with clique-finding on a graph of K=100K=100 workers and r=5r=5 for different values of qq; in all cases, enumerating all maximal cliques took no more than 15 milliseconds. These experiments and the asymptotic complexity of the entire protocol are addressed in Supplement Section XI-A.

During aggregation (see Algorithm 2), the PS will perform a majority vote across the computations of each file (implementation details in Supplement Section XI-B). Recall that rr workers have processed each file. For each such file Bt,iB_{t,i}, the PS decides a majority value 𝐦i\mathbf{m}_{i}

𝐦i:=majority{𝐠^t,i(j):Uj𝒩f(Bt,i)}.\mathbf{m}_{i}:=\mathrm{majority}\left\{\hat{\mathbf{g}}_{t,i}^{(j)}:U_{j}\in\mathcal{N}^{f}(B_{t,i})\right\}. (6)

Assume that rr is odd and let r=r+12r^{\prime}=\frac{r+1}{2}. Under the rule in Eq. (6), the gradient on a file is distorted only if at least rr^{\prime} of the computations are performed by Byzantines. Following the majority vote, we will further filter the gradients using a robust estimator med^\widehat{\mathrm{med}} (see Algorithm 2, line 25). This robust estimator is either the coordinate-wise median or the geometric median; a similar setup was considered in [14, 16]. For example, in Figure 1, all returned values for the red file will be evaluated by a majority vote function on the PS, which decides a single output value; a similar voting is done for the other 3 files. After the voting process, Aspis applies the robust estimator med^\widehat{\mathrm{med}} on the “winning” gradients 𝐦i\mathbf{m}_{i}, i=0,1,,f1i=0,1,\dots,f-1.

Refer to caption
(a) Unique max-clique, detection succeeds.
Refer to caption
(b) Two max-cliques, detection fails.
Figure 2: Detection graph 𝐆\mathbf{G} for K=7K=7 workers among which U1U_{1}, U2U_{2} and U3U_{3} are the adversaries.

V-A1 Defense Strategy Against ATT-1

Under ATT-1, it is clear that a Byzantine node will disagree with at least KqK-q honest nodes (as, by assumption in Section IV-A, it will disagree with all of them), and thus, the degree of the node in 𝐆\mathbf{G} will be at most q1<Kq1q-1<K-q-1, and it will not be part of the maximum clique. Thus, each of the adversaries will be detected, and their returned gradients will not be considered further. The algorithm declares the (unique) maximum clique as honest and proceeds to aggregation. In particular, assume that hh workers Ui1,Ui2,,UihU_{i_{1}},U_{i_{2}},\dots,U_{i_{h}} have been identified as honest. For each of the ff files, if at least one honest worker processed it, the PS will pick one of the “honest” gradient values. The chosen gradients are then averaged for the update (cf. Eq. (2)). For instance, in Figure 1, assume that U1U_{1}, U2U_{2}, and U4U_{4} have been identified as faulty. During aggregation, the PS will ignore the red file as all 3 copies have been compromised. For the orange file, it will pick either the gradient computed by U5U_{5} or U6U_{6} as both of them are “honest.” The only files that can be distorted in this case are those that consist exclusively of adversarial nodes.

Figure 2(a) (corresponding to Example 1) shows an example where in a cluster of size K=7K=7, the q=3q=3 adversaries are A={U1,U2,U3}A=\{U_{1},U_{2},U_{3}\} and the remaining workers are honest with H={U4,U5,U6,U7}H=\{U_{4},U_{5},U_{6},U_{7}\}. In this case, the unique maximum clique is M𝐆=HM_{\mathbf{G}}=H, and detection is successful. Under this attack, the distorted files are those whose all copies have been compromised, i.e., c(q)=(qr)c^{(q)}=\binom{q}{r}.

V-A2 Defense Strategy Against ATT-2 (Robust Aggregation)

Let DiD_{i} denote the set of disagreement workers for adversary Ai,i=1,2,,qA_{i},i=1,2,\dots,q, where DiD_{i} can contain members from AA and from HH. If the attack ATT-2 is used on Aspis, upon the formation of 𝐆\mathbf{G} we know that a worker UjU_{j} will be flagged as adversarial if deg(Uj)<Kq1deg(U_{j})<K-q-1. Therefore to avoid detection, a necessary condition is that |Dj|q|D_{j}|\leq q.

We now upper bound the number of files that can be corrupted under any possible strategy employed by the adversaries. Note that according to Algorithm 2, we resort to robust aggregation in case of more than one maximum clique in 𝐆\mathbf{G}. In this scenario, a gradient can only be corrupted if a majority of the assigned workers computing it are adversarial and agree on a wrong value. The proof of the following theorem appears in Appendix Section X-A.

Theorem 1.

Consider a training cluster of KK workers with qq adversaries using algorithm in Section III-A1 to assign the f=(Kr)f=\binom{K}{r} files to workers, and Algorithm 1 for adversary detection. Under any adversarial strategy, the maximum number of files that can be corrupted is

cmax(q)=12(2qr).c_{\mathrm{max}}^{(q)}=\frac{1}{2}{2q\choose r}. (7)

Furthermore, this upper bound can be achieved if all adversaries fix a set DHD\subset H of honest workers with which they will consistently disagree on the gradient (by distorting it).

Remark 3.

We emphasize that the maximum fraction of corrupted gradients cmax(q)/fc_{\mathrm{max}}^{(q)}/f is much lesser as compared to the baseline q/Kq/K and with respect to other schemes as well (details in Sec. VII). For instance K=15K=15 and q=3q=3, at most 0.022 fraction of the gradients are corrupted in Aspis as against 0.2 for the baseline scheme.

In Appendix Section X-A we show that under ATT-2 there is bound to be more than one maximum clique in the detection graph. Thus, the PS cannot unambiguously decide which one is the honest one; detection fails and we fall back to the robust aggregation technique.

An example is shown in Figure 2(b) for the setup of Example 1. The adversaries A={U1,U2,U3}A=\{U_{1},U_{2},U_{3}\} consistently disagree with the workers in D={U4,U5,U6}HD=\{U_{4},U_{5},U_{6}\}\subset H. The ambiguity as to which of the two maximum cliques ({U1,U2,U3,U7}\{U_{1},U_{2},U_{3},U_{7}\} or {U4,U5,U6,U7}\{U_{4},U_{5},U_{6},U_{7}\}) is the honest one makes an accurate detection impossible; robust aggregation will be performed instead.

V-B Motivation for Aspis+

Our motivation for proposing Aspis+ originates in the limitations of the subset assignment of Aspis. It is evident from the experimental results in Section VIII-B that Aspis is more suitable to worst-case attacks where the adversaries collude and distort the maximum number of tasks in an undetected fashion; in this case, the accuracy gap between Aspis and prior methods is maximal. Aspis does not perform as well under weaker attacks such as the reversed gradient attack (cf. Figures 8(a), 8(b), 8(c) even though it achieves a much smaller distortion fraction ϵ\epsilon, as discussed in Section VII. This can be attributed to the fact that the number of tasks is (Kr)K\choose r and even for the considered cluster of K=15K=15, r=3r=3, it would require splitting the batch into 455 files; hence, the batch size must be a multiple of 455. There is significant evidence that large batch sizes can hurt generalization and make the model converge slowly [31, 40, 41]. Some workarounds have been proposed to solve this problem. For instance, the work of [41] uses layer-wise adaptive rate scaling to update each layer using a different learning rate. The authors of [42] perform implicit regularization using warmup and cosine annealing to tune the learning rate as well as gradient clipping. However, these methods require training for a significantly larger number of epochs. For the above reasons, we have extended our work and proposed Aspis+ to handle weaker Byzantine failures (cf. ATT-3) while requiring a much smaller batch size.

Input: Computed gradients 𝐠^t,i(j)\hat{\mathbf{g}}_{t,i}^{(j)}, i=0,1,,f1i=0,1,\dots,f-1, j=1,2,,Kj=1,2,\dots,K, 2(v,k,λ)2-(v,k,\lambda) design, length of detection window TdT_{d}, maximum iterations TT.
1 for t=0t=0 to T1T-1 do
2       Let t=t(modTd)+1t^{\prime}=t\ (\mathrm{mod}\ T_{d})+1.
3       if t=1t^{\prime}=1 then
4             Set 𝐆\mathbf{G} as the complete graph with worker vertices 𝒰\mathcal{U}.
5             j1,j2\forall j_{1},j_{2}, set α(j1,j2)=0\alpha^{(j_{1},j_{2})}=0.
6       end if
7      for each pair (Uj1,Uj2),j1j2(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2} of workers do
8             PS computes the number of agreements αt(j1,j2)\alpha_{t}^{(j_{1},j_{2})} of the pair Uj1,Uj2U_{j_{1}},U_{j_{2}} on the gradient value.
9            Update α(j1,j2)=α(j1,j2)+αt(j1,j2)\alpha^{(j_{1},j_{2})}=\alpha^{(j_{1},j_{2})}+\alpha_{t}^{(j_{1},j_{2})}.
10       end for
11      for each pair (Uj1,Uj2),j1j2(U_{j_{1}},U_{j_{2}}),j_{1}\neq j_{2} of workers do
12             if α(j1,j2)<λ×t\alpha^{(j_{1},j_{2})}<\lambda\times t^{\prime} then
13                   Remove edge (Uj1,Uj2)(U_{j_{1}},U_{j_{2}}) from 𝐆\mathbf{G}.
14             end if
15            
16       end for
17      for each worker Uj𝒰U_{j}\in\mathcal{U} do
18             if deg(Uj)<Kq1deg(U_{j})<K-q-1 then
19                   A^=A^{Uj}\hat{A}=\hat{A}\cup\{U_{j}\}.
20             end if
21            
22       end for
23      if |A^|>q|\hat{A}|>q then
24             Set A^\hat{A} to be the qq most recently detected Byzantines.
25       end if
26      
27 end for
28
Algorithm 3 Proposed Aspis+ graph-based detection.

V-C Aspis+ Detection Rule

The principal intuition of the Aspis+ detection approach (used for ATT-3) is to iteratively keep refining the graph 𝐆\mathbf{G} in which the edges encode the agreements of workers during consecutive and non-overlapping windows of TdT_{d} iterations. At the beginning of each such window, the PS will reset 𝐆\mathbf{G} to be a complete graph, i.e., as if all workers pairwise agree with other. Then, it will gradually remove edges from 𝐆\mathbf{G} as disagreements between the workers are observed; hence, the graph will be updated at each of the TdT_{d} iterations of the window, and the PS will assume that the Byzantine set does not change within a detection window. In practice, as we do not know the “Byzantine window,” we will not assume an alignment between the two kinds of windows, and we will set TdTbT_{d}\neq T_{b} for our experiments. The detection method will be the same for all detection windows; thus, we will analyze the process in one window of TdT_{d} steps.

For a detection window, let us encode the agreement of workers Uj1,Uj2U_{j_{1}},U_{j_{2}} on common file ii during the current iteration tt of the window t=1,2,,Tdt=1,2,\dots,T_{d} as

αt,i(j1,j2):={1if 𝐠^t,i(j1)=𝐠^t,i(j2),0otherwise.\alpha_{t,i}^{(j_{1},j_{2})}:=\left\{\begin{array}[]{ll}1&\text{if }\hat{\mathbf{g}}_{t,i}^{(j_{1})}=\hat{\mathbf{g}}_{t,i}^{(j_{2})},\\ 0&\text{otherwise}.\end{array}\right. (8)

Across all files, the total number of agreements between a pair of workers Uj1,Uj2U_{j_{1}},U_{j_{2}} during the ttht^{\mathrm{th}} iteration is denoted by

αt(j1,j2):=i𝒩tw(Uj1)𝒩tw(Uj2)αt,i(j1,j2).\alpha_{t}^{(j_{1},j_{2})}:=\sum_{i\in\mathcal{N}_{t}^{w}(U_{j_{1}})\cap\mathcal{N}_{t}^{w}(U_{j_{2}})}\alpha_{t,i}^{(j_{1},j_{2})}. (9)

Assume that the current iteration of the window is indexed with t{1,2,,Td}t^{\prime}\in\{1,2,\dots,T_{d}\}. The PS will collect all agreements for each pair of workers Uj1,Uj2U_{j_{1}},U_{j_{2}} up until the current iteration as

α(j1,j2):=t=1tαt(j1,j2).\alpha^{(j_{1},j_{2})}:=\sum\limits_{t=1}^{t^{\prime}}\alpha_{t}^{(j_{1},j_{2})}. (10)

Since the placement is known, the PS can always perform the above computation. Next, it will examine the agreements and update 𝐆\mathbf{G} as necessary.

Based on the task placement (cf. Section III-A2), an edge (Uj1,Uj2)(U_{j_{1}},U_{j_{2}}) exists in 𝐆\mathbf{G} only if the computed gradients of Uj1U_{j_{1}} and Uj2U_{j_{2}} match in all their λ\lambda common groups in all iterations up to the current one indexed with tt^{\prime}, i.e., a pair Uj1,Uj2U_{j_{1}},U_{j_{2}} needs to have α(j1,j2)=λ×t\alpha^{(j_{1},j_{2})}=\lambda\times t^{\prime} for an edge (Uj1,Uj2)(U_{j_{1}},U_{j_{2}}) to be in 𝐆\mathbf{G}. If this is not the case, the edge (Uj1,Uj2)(U_{j_{1}},U_{j_{2}}) will be removed from 𝐆\mathbf{G}. After all such edges are examined, detection is done using degree counting. Given that there are qq Byzantines in the cluster, after examining all pairs of workers and determining the form of 𝐆\mathbf{G}, a worker UjU_{j} will be flagged as Byzantine if deg(Uj)<Kq1deg(U_{j})<K-q-1. Based on Eq. (10), it is not hard to see that such workers can be eliminated and their gradients will not be considered again until the last iteration of the current window. The only exception to this is if the Byzantine set changes before the end of the current detection window. This is possible due to a potential misalignment between the “Byzantine window” and the detection window (recall that TdTbT_{d}\neq T_{b} is assumed to avoid trivialities). In this case, more than qq workers may be detected as Byzantines; the PS will, by convention, choose A^\hat{A} to be the most recently detected Byzantines. Algorithm 3 discusses the detection protocol. Following detection, the PS will act as follows. If at least one Byzantine has been detected, it will ignore the votes of detected Byzantines, and for each group, if there is at least one “honest” vote, it will use this as the output of the majority voting group; also, if a group consists merely of detected Byzantines, it will completely ignore the group. The remaining groups will go through robust aggregation (as in Section V-A). In our experiments in Section VIII-C, all Byzantines are detected successfully in at most 5 iterations. Example 3 showcases the utility of permutations in our detection algorithm using K=7K=7 workers.

Example 3.

We will use the assignment of Example 2 with K=7K=7 workers 𝒰={1,2,,7}\mathcal{U}=\{1,2,\dots,7\} assigned to tasks according to a 2(7,3,1)2-(7,3,1) Fano plane and let us denote the assignment of workers to groups (blocks of the design) during the ttht^{\mathrm{th}} iteration by 𝒜t\mathcal{A}_{t}, initially equal to 𝒜1={{1,2,3},{1,4,7},{2,4,6},{3,4,5},{2,5,7},{1,5,6},{3,6,7}}\mathcal{A}_{1}=\{\{1,2,3\},\{1,4,7\},\{2,4,6\},\{3,4,5\},\{2,5,7\},\{1,5,6\},\{3,6,7\}\}. For the windows, assume that Td>2T_{d}>2 and Tb>2T_{b}>2. Also, let q=2q=2 and the Byzantine set be A={U1,U2}A=\{U_{1},U_{2}\}. Based on ATT-3, workers U1,U2U_{1},U_{2} are in majority within a group in which they disagree with worker U3U_{3}. After the first permutation, a possible assignment is 𝒜2={{1,3,6},{3,4,7},{2,4,6},{1,4,5},{5,6,7},{2,3,5},{1,2,7}}\mathcal{A}_{2}=\{\{1,3,6\},\{3,4,7\},\{2,4,6\},\{1,4,5\},\{5,6,7\},\{2,3,5\},\{1,2,7\}\}. Then, U1,U2U_{1},U_{2} are in the same group as the honest U7U_{7} with which they disagree; hence, deg(U1)=deg(U2)=4=Kq1deg(U_{1})=deg(U_{2})=4=K-q-1, and none of them affords to disagree with more honest workers to remain undetected. However, if the next permutation assigns the workers as 𝒜3={{1,3,6},{1,4,7},{4,6,5},{2,3,4},{2,6,7},{1,2,5},{3,5,7}}\mathcal{A}_{3}=\{\{1,3,6\},\{1,4,7\},\{4,6,5\},\{2,3,4\},\{2,6,7\},\{1,2,5\},\{3,5,7\}\} then the adversaries will cast a different vote than U5U_{5} as well. Both of them will be detected after only three iterations.

Remark 4.

Using a 2(v,3,λ)2-(v,3,\lambda), i.e., a design with k=3k=3 (a typical value for the redundancy) to assign the files on a cluster with qq Byzantines, the maximum number of files one can distort is λ(q2)/||\lambda{q\choose 2}/|\mathcal{B}| [33], where |||\mathcal{B}| is the total number of files; this is when each possible pair of Byzantines, among the (q2)q\choose 2 possible ones appear together in a distinct block and distorts the corresponding file. In Aspis+, the focus is on weak attacks and determining the worst-case choice of adversaries that maximize the number of distorted files is beyond the scope of our work.

Refer to caption
(a) Optimal attacks.
Refer to caption
(b) Weak attacks.
Figure 3: Distortion fraction of optimal and weak attacks for (K,r)=(50,3)(K,r)=(50,3) and comparison.

VI Convergence Results and Experiments under Setting-II

In this section, we operate under Setting-II (cf. Section III). By leveraging the work of Chen et al. [13] we demonstrate that our training algorithm converges to the optimal point.

We assume that the data samples are distributed i.i.d. from some unknown distribution μ\mu. We are interested in finding 𝐰\mathbf{w}^{*} that minimizes L(𝐰)=𝔼(l1(𝐰))L(\mathbf{w})=\mathbb{E}(l_{1}(\mathbf{w})) over the 𝐰𝒲\mathbf{w}\in\mathcal{W}; here the expectation is over the distribution μ\mu and the li(𝐰)l_{i}(\mathbf{w})’s are distributed i.i.d as well. In general, since the distribution is unknown, 𝔼(l1(𝐰))\mathbb{E}(l_{1}(\mathbf{w})) cannot be computed and we instead minimize the empirical loss function given by L^(𝐰)=1ni=1nli(𝐰)\hat{L}(\mathbf{w})=\frac{1}{n}\sum_{i=1}^{n}l_{i}(\mathbf{w}). We need the following additional assumptions.

In the discussion below, we say that a random vector 𝐳\mathbf{z} is sub-exponential with sub-exponential norm KK if for every unit-vector vv, 𝐳v\mathbf{z}^{\top}v is a sub-exponential random variable with sub-exponential norm at most KK, i.e., supv:v1Pr(|𝐳v|>t)exp(t/K)\sup_{v:||v||\leq 1}Pr(|\mathbf{z}^{\top}v|>t)\leq\exp(-t/K) [43, Sec 2.7]. To keep notation simple, we reuse the letter CC to denote different numerical constants in each use. This practice is common when working with classes of distributions such as sub-exponential.

  • The minimization of L^(𝐰)\hat{L}(\mathbf{w}) is performed by using Aspis or Aspis+ along with gradient descent (cf. Eq. (1)). This means that in Algorithm 2, the batch size b=nb=n for all iterations (cf. discussion after (2)). The robust estimator med^\widehat{\mathrm{med}} is the geometric median.

  • The function L(𝐰)L(\mathbf{w}) is β\beta-strongly convex, and differentiable with respect to 𝐰\mathbf{w} with M~\tilde{M}-Lipschitz gradient. This means that for all 𝐰\mathbf{w} and 𝐰\mathbf{w}^{\prime} we have

    L(𝐰)L(𝐰)+L(𝐰)T(𝐰𝐰)+β2𝐰𝐰2, and\displaystyle L(\mathbf{w}^{\prime})\geq L(\mathbf{w})+\nabla L(\mathbf{w})^{T}(\mathbf{w}^{\prime}-\mathbf{w})+\frac{\beta}{2}\norm{\mathbf{w}-\mathbf{w}^{\prime}}^{2},\text{~{}and}
    L(𝐰)L(𝐰)M~𝐰𝐰.\displaystyle\norm{\nabla L(\mathbf{w})-\nabla L(\mathbf{w}^{\prime})}\leq\tilde{M}\norm{\mathbf{w}-\mathbf{w}^{\prime}}.
  • The random vectors li(𝐰)\nabla l_{i}(\mathbf{w}) for i=1,,ni=1,\dots,n are sub-exponential with sub-exponential norm CC. This assumption ensures that 1ni=1nli(𝐰)\frac{1}{n}\sum_{i=1}^{n}\nabla l_{i}(\mathbf{w}^{*}) concentrates around its mean L(𝐰)=0\nabla L(\mathbf{w}^{*})=0.

  • Let hi(𝐰)=li(𝐰)li(𝐰)h_{i}(\mathbf{w})=\nabla l_{i}(\mathbf{w})-\nabla l_{i}(\mathbf{w}^{*}). For i=1,,ni=1,\dots,n, the random vectors hi(𝐰)h_{i}(\mathbf{w}) are sub-exponential with sub-exponential norm C𝐰𝐰C\norm{\mathbf{w}-\mathbf{w}^{*}}.

  • For any δ(0,1)\delta\in(0,1) there exists M~\tilde{M}^{\prime} (dependent on nn and δ\delta) that is non-increasing in nn such that L^(𝐰)\hat{L}(\mathbf{w}) is M~\tilde{M}^{\prime}-smooth with high probability, i.e,

    P(sup𝐰,𝐰𝒲:𝐰𝐰1ni=1n(li(𝐰)li(𝐰))𝐰𝐰M~)\displaystyle P\left(\sup_{\mathbf{w},\mathbf{w}^{\prime}\in\mathcal{W}:\mathbf{w}\neq\mathbf{w}^{\prime}}\frac{\norm{\frac{1}{n}\sum_{i=1}^{n}(\nabla l_{i}(\mathbf{w})-\nabla l_{i}(\mathbf{w}^{\prime}))}}{\norm{\mathbf{w}-\mathbf{w}^{\prime}}}\leq\tilde{M}^{\prime}\right)
    1δ3.\displaystyle\geq 1-\frac{\delta}{3}.

    Here 𝒲\mathcal{W} is the feasible parameter set.

For Aspis, Theorem 1 guarantees an upper bound on the fraction of corrupted gradients regardless of what attack is used. In particular, treating the majority logic and clique finding as a pre-processing step, we arrive at a set of ff files, at most cmax(q)c^{(q)}_{\max} (cf. Theorem 1) of which are “arbitrarily” corrupted. At this point, the PS applies the robust estimator med^\widehat{\mathrm{med}} - “geometric median” and uses it to perform the update step. We can leverage Theorem 5 of [13] to obtain the following result where dd is the length of the parameter vector and for pi(0,1),i=1,2p_{i}\in(0,1),i=1,2 the quantity D(p1||p2)=p1log2(p1p2)+(1p1)log2(1p11p2)D(p_{1}||p_{2})=p_{1}\log_{2}(\frac{p_{1}}{p_{2}})+(1-p_{1})\log_{2}(\frac{1-p_{1}}{1-p_{2}}).

Theorem 2.

(adapted from [13]) Suppose that β,M~\beta,\tilde{M} are all constants and logM~=𝒪(logd)\log\tilde{M}^{\prime}=\mathcal{O}(\log d). Assume that 𝒲{𝐰:𝐰𝐰r~d}\mathcal{W}\subset\{\mathbf{w}~{}:~{}\norm{\mathbf{w}-\mathbf{w}^{*}}\leq\tilde{r}\sqrt{d}\} for positive r~\tilde{r} such that logr~=𝒪(dlog(n/f))\log\tilde{r}=\mathcal{O}(d\log(n/f)) and 2(1+ϵ)cmax(q)f2(1+\epsilon)c^{(q)}_{\max}\leq f. Fix any α(cmax(q)/f,1/2)\alpha\in(c^{(q)}_{\max}/f,1/2) and any δ>0\delta>0 such that δαcmax(q)/f\delta\leq\alpha-c^{(q)}_{\max}/f and log(1/δ)=𝒪(d)\log(1/\delta)=\mathcal{O}(d). There exist universal constants c1,c2c_{1},c_{2} such that if

nfc1Cα2dlog(n/f),\displaystyle\frac{n}{f}\geq c_{1}C_{\alpha}^{2}d\log(n/f),

then with probability at least 1exp(fD(αcmax(q)f||δ))1-\exp(-fD(\alpha-\frac{c^{(q)}_{\max}}{f}~{}||~{}\delta)), for all t1t\geq 1, the iterates of our algorithm with η=β/(2M~2)\eta=\beta/(2\tilde{M}^{2}) satisfy

𝐰t𝐰(12+121β24M~2)t𝐰0𝐰+c2dfn.\displaystyle\norm{\mathbf{w}_{t}-\mathbf{w}^{*}}\leq\left(\frac{1}{2}+\frac{1}{2}\sqrt{1-\frac{\beta^{2}}{4\tilde{M}^{2}}}\right)^{t}\norm{\mathbf{w}_{0}-\mathbf{w}^{*}}+c_{2}\sqrt{\frac{df}{n}}. (11)

An instance of a problem that satisfies the assumptions presented above is the linear regression problem. Formally, the data set consists of nn vectors {𝐱1,𝐱2,,𝐱n}\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\}, where i,𝐱id\forall i,\mathbf{x}_{i}\in\mathbb{R}^{d}. We construct the data matrix XX of size n×dn\times d using these vectors as its rows. The nn labels corresponding to the data points are computed as follows: 𝐲=X𝐰\mathbf{y}=X\mathbf{w}, where 𝐰\mathbf{w} denotes the parameter set. For this problem, our loss function is the least-squares loss, i.e., we have li(𝐰)=12(𝐲i𝐱iT𝐰)2l_{i}(\mathbf{w})=\frac{1}{2}(\mathbf{y}_{i}-\mathbf{x}_{i}^{T}\mathbf{w})^{2} for i=1,,ni=1,\dots,n where 𝐱iT\mathbf{x}_{i}^{T} denotes the ithi^{\mathrm{th}} row of XX.

VI-A Numerical Experiments

We use the GD algorithm (1) with the initial randomly chosen parameter vector 𝐰𝟎𝒩(𝟎d,Id)\mathbf{w_{0}}\sim\mathcal{N}(\mathbf{0}_{d},I_{d}). We partition the data matrix XX row-wise into ff submatrices X1,X2,,XfX_{1},X_{2},\dots,X_{f}, and correspondingly the label vector 𝐲\mathbf{y} into ff sub-vectors y1,y2,,yf\textbf{y}_{1},\textbf{y}_{2},\dots,\textbf{y}_{f}, where ff is the number of files of the distributed algorithm. A file Bt,iB_{t,i} consists of a pair (Xi,yi)(X_{i},\textbf{y}_{i}). For each of its assigned files Bti=(Xi,yi)𝒩w(Ui)B_{t_{i}}=(X_{i},\textbf{y}_{i})\in\mathcal{N}^{w}(U_{i}), worker UiU_{i} either computes the honest partial gradient or a distorted value and returns it to the PS. Using the formulation of Section IV, the gradient in Eq. (3) for linear regression is the product gt,i=XiTXiwXiTyi\textbf{g}_{t,i}=X_{i}^{T}X_{i}\textbf{w}-X_{i}^{T}\textbf{y}_{i}.

Metrics: For each scheme and value of qq we run multiple Carlo simulations, and calculated the average least-square loss that each algorithm converges to across the Monte Carlo simulations. For each simulation we declare convergence if the final empirical loss is less than 0.1 We record the fraction of experiments that converged and the rate of convergence. In computing the average loss, the experiments that did not converged are not taken into account (for more details, please see Supplement Section XI-C).

VI-A1 Experiment Setup

In our experiments, we set n=50000n=50000, d=100d=100 while our cluster consists of K=15K=15 workers. All replication-based schemes use r=3r=3. For Aspis+, we considered a 2(15,3,1)2-(15,3,1) design [33].

The geometric median is available as a Python library [44]. Initially, we tuned the learning rate for each scheme and each distortion method to decide the one to use for the Monte Carlo simulations; all learning rates 10x10^{-x}, x={1,2,,6}x=\{1,2,\dots,6\} have been tested. Also, we fix the random seeds of our experiments to be the same across all schemes; this guarantees that the data matrix as well as the original model estimate 𝐰𝟎\mathbf{w_{0}} will be the same across all methods. At the beginning of the algorithm, all elements of XX and 𝐰\mathbf{w} are generated randomly according to a 𝒩(0,1)\mathcal{N}(0,1) distribution. For all runs, we chose to terminate the algorithm when the norm of gradient is less than 101010^{-10} or the algorithm has reached a maximum number of 2000 iterations. Our code is available online 444https://github.com/kkonstantinidis/Aspis.

VI-A2 Results

The first set of experiments are for the strong attack ATT-2. The baseline scheme where geometric median is applied on all KK gradients returned by the workers is referred to as GeoMed and it has no redundancy. DETOX, Aspis, and Aspis+ use geometric median as part of the robust aggregation. Under reversed gradient (see Figure 4(a)), it is clear that all schemes perform well and achieve similar loss for q=2,4q=2,4 Byzantines. Nevertheless, baseline geometric median needed at least 100 iterations to converge while the redundancy-based schemes have a faster convergence rate. However, the situation is very different for q=6q=6 where Aspis converges within 30 iterations. In contrast, the DETOX loss diverged in all 100 simulations, while the convergence rate of the baseline scheme is much slower.

For the constant attack (see Fig. 4(b)) however, the relative performance of the baseline scheme and Aspis is reversed, i.e., the baseline scheme has a faster convergence rate as compared with Aspis. Moreover, Aspis and DETOX have roughly similar convergence rates for q=2,4q=2,4. As before for q=6q=6, none of the DETOX simulations converged.

Our last set of results are with the ALIE attack and the results are reported in Figure 6. As the baseline geometric median simulations converged much slower than other schemes, they could not fit properly into the figure; it achieved final loss approximately equal to 2.07×10282.07\times 10^{-28}, 9.71×10289.71\times 10^{-28}, and 1.77×10281.77\times 10^{-28} for q=2,4q=2,4, and 66 respectively. On the other hand, Aspis converged to 1.69×10231.69\times 10^{-23} in 30 iterations for q=6q=6. For this attack all schemes converged to very low loss values in all simulation runs. Nevertheless as is evident from Figure 6, both Aspis and DETOX converge to loss values less that 10510^{-5} within 15 iterations.

Refer to caption
(a) Reversed gradient distortion.
Refer to caption
(b) Constant distortion.
Figure 4: Linear regression least-squares loss, optimal attacks, ATT-2 (Aspis), geometric median defenses, K=15K=15.
Refer to caption
(a) Reversed gradient distortion.
Refer to caption
(b) Constant distortion.
Figure 5: Linear regression least-squares loss, random Byzantines (ATT-3), geometric median defenses, K=15K=15.
Refer to caption
Figure 6: Linear regression least-squares loss, ALIE distortion, optimal attacks, ATT-2 (Aspis), geometric median defenses, K=15K=15.

Another experiment we performed compares our two proposed methods, Aspis and Aspis+. For both schemes, we generate a new random Byzantine set AA every Tb=50T_{b}=50 iterations (introduced as ATT-3 Section IV-C) while the detection window for Aspis+ is of length Td=15T_{d}=15. For a comparable attack we use ATT-1 on Aspis (cf. Section IV-A), i.e., all adversaries distort all their assigned files. We compare the two schemes under reversed gradient attack in Figure 5(a) and under constant attack in Figure 5(b). Both methods achieve low final loss in the order of 102010^{-20} or lower; Aspis converged to lower losses of the order of 102410^{-24} in approximately 280 iterations in all cases. Nevertheless, Aspis+ achieves a faster convergence rate which aligns with the fact that it’s mostly suitable for weaker adversaries.

VII Distortion Fraction Evaluation

The main motivation of our distortion fraction analysis is that our deep learning experiments (cf. Section VIII-B) and prior work [14] show that ϵ=c(q)/f\epsilon=c^{(q)}/f is a surrogate of the model’s convergence with respect to accuracy. This comparison involves our work and state-of-the-art schemes under the best- and worst-case choice of the qq adversaries in terms of the achievable value of ϵ\epsilon. We also compare our work with baseline approaches that do not involve redundancy or majority voting and aggregation is applied directly to the KK gradients returned by the workers (f=Kf=K, cmax(q)=qc_{\mathrm{max}}^{(q)}=q and ϵ=q/K\epsilon=q/K).

For Aspis, we used the proposed attack ATT-2 from Section IV-B and the corresponding computation of c(q),Aspisc^{(q),Aspis} of Theorem 1. DETOX in [16] employs a redundant assignment followed by majority voting and offers robustness guarantees which crucially rely on a “random choice” of the Byzantines. Our prior work [14] (ByzShield) has demonstrated the importance of a careful task assignment and observed that redundancy by itself is not sufficient to allow for Byzantine resilience. That work proposed an optimal choice of the qq Byzantines that maximizes ϵDETOX\epsilon^{DETOX}, which we used in our current experiments. In short, DETOX splits the KK workers into K/rK/r groups. All workers within a group process the same subset of the batch, specifically containing br/Kbr/K samples. This phase is followed by majority voting on a group-by-group basis. Reference [14] suggests choosing the Byzantines so that at least rr^{\prime} workers in each group are adversarial in order to distort the corresponding gradients. In this case, c(q),DETOX=qrc^{(q),DETOX}=\lfloor\frac{q}{r^{\prime}}\rfloor and ϵDETOX=qr×r/K\epsilon^{DETOX}=\lfloor\frac{q}{r^{\prime}}\rfloor\times r/K. We also compare with the distortion fraction incurred by ByzShield [14] under a worst-case scenario. For this scheme, there is no known optimal attack, and we performed an exhaustive combinatorial search to find the qq adversaries that maximize ϵByzShield\epsilon^{ByzShield} among all possible options; we follow the same process here to simulate ByzShield’s distortion fraction computation while utilizing the scheme of that work based on mutually orthogonal Latin squares. The reader can refer to Figure 3(a) and Appendix Tables III, IV, and V for our results. Aspis achieves major reductions in ϵ\epsilon; for instance, ϵAspis,ATT2\epsilon^{Aspis,ATT-2} is reduced by up to 99% compared to both ϵBaseline\epsilon^{Baseline} and ϵDETOX\epsilon^{DETOX} in Figure 3(a).

Next, we consider the weak attack, ATT-1. For our scheme, we will make an arbitrary choice of qq adversaries which carry out the method introduced in Section V-A1, i.e., they will distort all files, and a successful detection is possible. As discussed in Section V-A1, the fraction of corrupted gradients is ϵAspis,ATT1=(qr)/(Kr)\epsilon^{Aspis,ATT-1}=\binom{q}{r}/\binom{K}{r}. For DETOX, a simple benign attack is used. To that end, let the K/rK/r files be Bt,0,Bt,1,,Bt,K/r1B_{t,0},B_{t,1},\dots,B_{t,K/r-1}. Initialize A=A=\emptyset and choose the qq Byzantines as follows: for i=0,1,,q1i=0,1,\dots,q-1, among the remaining workers in {U1,U2,,UK}A\{U_{1},U_{2},\dots,U_{K}\}-A add a worker from the group Bt,imodK/rB_{t,i\mod K/r} to the adversarial set AA. Then,

c(q),DETOX={qKr(r1)if q>Kr(r1),0otherwise.c^{(q),DETOX}=\left\{\begin{array}[]{ll}q-\frac{K}{r}(r^{\prime}-1)&\text{if }q>\frac{K}{r}(r^{\prime}-1),\\ 0&\text{otherwise}.\end{array}\right.

The results of this scenario are in Figure 3(b).

Refer to caption
(a) Median-based defenses.
Refer to caption
(b) Bulyan-based defenses.
Refer to caption
(c) Multi-Krum-based defenses.
Figure 7: ALIE distortion under optimal attack scenarios, ATT-2 for Aspis, CIFAR-10, K=15K=15.
Refer to caption
(a) Median-based defenses.
Refer to caption
(b) Bulyan-based defenses.
Refer to caption
(c) Multi-Krum-based defenses.
Figure 8: Reversed gradient distortion under optimal attack scenarios, ATT-2 for Aspis, CIFAR-10, K=15K=15.

VIII Large-Scale Deep Learning Experiments

All these experiments are performed under Setting-I, i.e., no assumptions are made about the dataset or the loss function. Accordingly, the evaluation here is in terms of the distortion fraction (see Section VII) and numerical experiments (described below). For the experiments, we used the mini-batch SGD (see (2)) and the robust estimator (see Algorithm 2) is the coordinate-wise median.

VIII-A Experiment Setup

We have evaluated the performance of our methods and competing techniques in classification tasks on Amazon EC2 clusters. The project is written in PyTorch [1] and uses the MPICH library for communication between the different nodes. We worked with the CIFAR-10 data set [32] using the ResNet-18 [45] model. We used clusters of K=15K=15, 2121, 2525 workers, redundancy r=3r=3, and simulated values of q=2,4,6,7,9q=2,4,6,7,9 during training. Detailed information about the implementation can be found in Appendix Section X-B.

Competing methods: We compare Aspis against the baseline implementations of median-of-means [46], Bulyan [24], and Multi-Krum [12]. If cmax(q)c_{\mathrm{max}}^{(q)} is the number of adversarial computations, then Bulyan requires at least 4cmax(q)+34c_{\mathrm{max}}^{(q)}+3 total number of computations while the same number for Multi-Krum is 2cmax(q)+32c_{\mathrm{max}}^{(q)}+3. These constraints make these methods inapplicable for larger values of qq for which our methods are robust. The second class of comparisons is with methods that use redundancy, specifically DETOX [16]. For the baseline scheme we compare with median-based techniques since they originate from robust statistics and are the basis for many aggregators. Multi-Krum combines the intuitions of majority-based and squared-distance-based methods. Draco [17] is a closely related method that uses redundancy. However we do not compare with it since it is very limited in the number of Byzantines that it is resilient to.

Note that for a baseline scheme, all choices of AA are equivalent in terms of the value of ϵ\epsilon. In our comparisons between Aspis and DETOX we will consider two attack scenarios concerning the choice of the adversaries. For the optimal attack on DETOX, we will use the method proposed in [14] and compare with the attack introduced in Section V-A2. For the weak one, we will choose the adversaries such that they incur the minimum value of ϵ\epsilon in DETOX for given qq and compare its performance with the scenario of Section V-A1. All schemes compared with Aspis+ consider random sets of Byzantines, and for Aspis+, we will use the attack ATT-3.

Refer to caption
Figure 9: FoE distortion, optimal attacks, ATT-2 (Aspis) and median-based defenses (CIFAR-10), K=15K=15
Refer to caption
Figure 10: Reversed gradient distortion, weak attacks, ATT-1 (Aspis) and median-based defenses (CIFAR-10), K=15K=15.
Refer to caption
Figure 11: ALIE distortion, weak attacks, ATT-1 (Aspis) and median-based defenses (CIFAR-10), K=15K=15.
Refer to caption
(a) Median-based defenses.
Refer to caption
(b) Bulyan-based defenses.
Refer to caption
(c) Multi-Krum-based defenses.
Figure 12: ALIE distortion under optimal attack scenarios, ATT-2 for Aspis, CIFAR-10, K=21K=21.

VIII-B Aspis Experimental Results

VIII-B1 Comparison under Optimal Attacks

We compare the different defense algorithms under optimal attack scenarios using ATT-2 for Aspis. Figure 7(a) compares our scheme Aspis with the baseline implementation of coordinate-wise median (ϵ=0.133,0.267\epsilon=0.133,0.267 for q=2,4q=2,4, respectively) and DETOX with median-of-means (ϵ=0.2,0.4\epsilon=0.2,0.4 for q=2,4q=2,4, respectively) under the ALIE attack. Aspis converges faster and achieves at least a 35% average accuracy boost (at the end of the training) for both values of qq (ϵAspis=0.004,0.062\epsilon^{Aspis}=0.004,0.062 for q=2,4q=2,4, respectively).555Please refer to Appendix Tables III and IV for the values of the distortion fraction ϵ\epsilon each scheme incurs. In Figures 7(b) and 7(c), we observe similar trends in our experiments with Bulyan and Multi-Krum, where Aspis significantly outperforms these techniques. For the current setup, Bulyan is not applicable for q=4q=4 since K=15<4cmax(q)+3=4q+3=19K=15<4c_{\mathrm{max}}^{(q)}+3=4q+3=19. Also, neither Bulyan nor Multi-Krum can be paired with DETOX for q4q\geq 4 since the inequalities f4cmax(q)+3f\geq 4c_{\mathrm{max}}^{(q)}+3 and f2cmax(q)+3f\geq 2c_{\mathrm{max}}^{(q)}+3, where f=fDETOX=K/rf=f_{\mathrm{DETOX}}=K/r, cannot be satisfied; for the specific case of Bulyan even q=2,3q=2,3 would not be supported by DETOX. Please refer to Section VIII-A and Section VII for more details on these requirements. Also, note that the accuracy of most competing methods fluctuates more than in the results presented in the corresponding papers [16] and [23]. This is expected as we consider stronger attacks than those papers, i.e., optimal deterministic attacks on DETOX and, in general, up to 27% adversarial workers in the cluster. Also, we have done multiple experiments with different random seeds to demonstrate the stability and superiority of our accuracy results compared to other methods (against median-based defenses in Appendix Figure 15, Bulyan in Figure 16 and Multi-Krum in Supplement Figure 17); we point the reader to Appendix Section X-B3 for this analysis. This analysis is clearly missing from most prior work, including that of ALIE [23] and their presented results are only a snapshot of a single experiment. The results for the reversed gradient attack are shown in Figures 8(a), 8(b), and 8(c). Given that this is a weaker attack [14, 16] all schemes, including the baseline methods, are expected to perform well; indeed, in most cases, the model converges to approximately 80% accuracy. However, DETOX fails to converge to high accuracy for q=4q=4 as in the case of ALIE; one explanation is that ϵDETOX=0.4\epsilon^{DETOX}=0.4 for q=4q=4. Under the Fall of Empires (FoE) distortion (cf. Figure 11) our method still enjoys an accuracy advantage over the baseline and DETOX schemes which becomes more important as the number of Byzantines in the cluster increases.

We have also performed experiments on larger clusters (K=21K=21 workers) as well. The results for the ALIE distortion with the ATT-2 attack can be found in Figure 12. They exhibit similar behavior as in the case of K=15K=15.

Refer to caption
Figure 13: ALIE distortion and random Byzantines, K=15K=15 (median-based defenses). ATT-3 used on Aspis+.
Refer to caption
Figure 14: Constant distortion and random Byzantines, K=25K=25 (signSGD-based defenses). ATT-3 used on Aspis+.

VIII-B2 Comparison under Weak Attacks

For baseline schemes, the discussion of weak versus optimal choice of the adversaries is not very relevant as any choice of the qq Byzantines can overall distort exactly qq out of the KK gradients. Hence, for weak scenarios, we chose to compare mostly with DETOX while using ATT-1 on Aspis. The accuracy is reported in Figures 11 and 11, according to which Aspis shows an improvement under attacks on the more challenging end of the spectrum (ALIE). According to Appendix Table IIIIII(b), Aspis enjoys a fraction ϵAspis=0.044\epsilon^{Aspis}=0.044 while ϵBaseline=0.4\epsilon^{Baseline}=0.4 and ϵDETOX=0.2\epsilon^{DETOX}=0.2 for q=6q=6.

VIII-C Aspis+ Experimental Results

For Aspis+, we considered the attack ATT-3 discussed in Section IV-C. We tested clusters of K=15K=15 with q=2,4q=2,4 and K=25K=25 workers among which q=7,9q=7,9 are Byzantine. In the former case, a 2(15,3,1)2-(15,3,1) design [33] with f=35f=35 blocks (files) was used for the placement, while in the latter case, we used a 2(25,3,1)2-(25,3,1) design [33] with f=100f=100 blocks (files). A new random Byzantine set AA is generated every Tb=50T_{b}=50 iterations while the detection window is of length Td=15T_{d}=15.

The results for K=15K=15 are in Figure 13. We tested against the ALIE distortion, and all compared methods use median-based defenses to filter the gradients. Aspis+ demonstrates an advantage of at least 15% compared with other algorithms (cf. q=2q=2). For K=25K=25, we tried a weaker distortion than ALIE, i.e., the constant attack paired with signSGD-based defenses [26]. In signSGD, the PS will output the majority of the gradients’ signs for each dimension. Following the advice of [16], we pair this defense with the stronger constant attack as sign flips (e.g., reversed gradient) are unlikely to affect the gradient’s distribution. Aspis+ with median still enjoys an accuracy improvement of at least 20% for q=7q=7 and a larger one for q=9q=9. The results are in Figure 14; in this figure, the DETOX accuracy is an average of two experiments using two different random seeds.

IX Conclusions and Future Work

In this work, we have presented Aspis and Aspis+, two Byzantine-resilient distributed schemes that use redundancy and robust aggregation in novel ways to detect failures of the workers. Our theoretical analysis and numerical experiments clearly indicate their superior performance compared to state-of-the-art. Our experiments show that these methods require increased computation and communication time as compared to prior work, e.g., note that each worker has to transmit ll gradients instead of 11 in related work [16, 17] (see Appendix Section X-B4 for details). We emphasize, however, that our schemes converge to high accuracy in our experiments, while other methods remain at much lower accuracy values regardless of how long the algorithm runs for.

Our experiments involve clusters of up to 2525 workers. As we scale Aspis to more workers, the total number of files and the computation load ll of each worker will also scale; this increases the memory needed to store the gradients during aggregation. For complex neural networks, the memory to store the model and the intermediate gradient computations is by far the most memory-consuming aspect of the algorithm. For these reasons, Aspis is mostly suitable for training large data sets using fairly compact models that do not require too much memory. Aspis+, on the other hand, is a good fit for clusters that suffer from non-adversarial failures that can lead to inaccurate gradients. Finally, utilizing GPUs and communication-related algorithmic improvements are worth exploring to reduce the time overhead.

X Appendix

TABLE II: Main notation of the paper.
Symbol Meaning
KK number of workers
qq number of adversaries
rr redundancy (number of workers each file is assigned to)
bb batch size
BtB_{t} samples of batch of ttht^{\mathrm{th}} iteration
ff number of files (alternatively called groups or tasks)
UjU_{j} jthj^{\mathrm{th}} worker
ll computation load (number of files per worker)
𝒩w(Uj)\mathcal{N}^{w}(U_{j}) set of files of worker UjU_{j}
𝒩f(Bt,i)\mathcal{N}^{f}(B_{t,i}) set of workers assigned to file Bt,iB_{t,i}
𝐠t,i\mathbf{g}_{t,i} true gradient of file Bt,iB_{t,i} with respect to 𝐰\mathbf{w}
𝐠^t,i(j)\hat{\mathbf{g}}_{t,i}^{(j)} returned gradient of UjU_{j} for file Bt,iB_{t,i} with respect to 𝐰\mathbf{w}
𝐦i\mathbf{m}_{i} majority gradient for file Bt,iB_{t,i}
𝒰\mathcal{U} worker set {U1,U2,,UK}\{U_{1},U_{2},...,U_{K}\}
𝐆task\mathbf{G}_{task} graph used to encode the task assignments to workers
𝐆t\mathbf{G}_{t} graph indicating the agreements of pairs of workers in all of their common gradient tasks in ttht^{\mathrm{th}} iteration
AA set of adversaries
M𝐆M_{\mathbf{G}} maximum clique in 𝐆\mathbf{G}
c(q)c^{(q)} number of distorted gradients after detection and aggregation
cmax(q)c_{\mathrm{max}}^{(q)} maximum number of distorted gradients after detection and aggregation (worst-case)
DiD_{i} disagreement set (of workers) for ithi^{\mathrm{th}} adversary
rr^{\prime} (r+1)/2(r+1)/2, i.e., minimum number of distorted copies needed to corrupt majority vote for a file
ϵ\epsilon c(q)/fc^{(q)}/f, i.e., fraction of distorted gradients after detection and aggregation
XjX_{j} subset of files where the set of active adversaries is of size jj; for linear regression this is the data matrix corresponding to the ithi^{\mathrm{th}} file
XX data matrix of linear regression
nn number of points of linear regression
dd dimensionality of linear regression model
TABLE III: Distortion fraction of optimal and weak attacks for (K,f,l,r)=(15,455,91,3)(K,f,l,r)=(15,455,91,3) and comparison.
qq ϵATT2Aspis\epsilon^{Aspis}_{ATT-2} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX} ϵByzShield\epsilon^{ByzShield}
22 \collectcell0.004\endcollectcell \collectcell0.133\endcollectcell \collectcell0.2\endcollectcell \collectcell0.04\endcollectcell
33 \collectcell0.022\endcollectcell \collectcell0.2\endcollectcell \collectcell0.2\endcollectcell \collectcell0.12\endcollectcell
44 \collectcell0.062\endcollectcell \collectcell0.267\endcollectcell \collectcell0.4\endcollectcell \collectcell0.2\endcollectcell
55 \collectcell0.132\endcollectcell \collectcell0.333\endcollectcell \collectcell0.4\endcollectcell \collectcell0.32\endcollectcell
66 \collectcell0.242\endcollectcell \collectcell0.4\endcollectcell \collectcell0.6\endcollectcell \collectcell0.48\endcollectcell
77 \collectcell0.4\endcollectcell \collectcell0.467\endcollectcell \collectcell0.6\endcollectcell \collectcell0.56\endcollectcell
III(a) Optimal attacks.
qq ϵATT1Aspis\epsilon^{Aspis}_{ATT-1} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX}
22 \collectcell0.002\endcollectcell \collectcell0.133\endcollectcell \collectcell0\endcollectcell
33 \collectcell0.002\endcollectcell \collectcell0.2\endcollectcell \collectcell0\endcollectcell
44 \collectcell0.009\endcollectcell \collectcell0.267\endcollectcell \collectcell0\endcollectcell
55 \collectcell0.022\endcollectcell \collectcell0.333\endcollectcell \collectcell0\endcollectcell
66 \collectcell0.044\endcollectcell \collectcell0.4\endcollectcell \collectcell0.2\endcollectcell
77 \collectcell0.077\endcollectcell \collectcell0.467\endcollectcell \collectcell0.4\endcollectcell
III(b) Weak attacks.
TABLE IV: Distortion fraction of optimal and weak attacks for (K,f,l,r)=(21,1330,190,3)(K,f,l,r)=(21,1330,190,3) and comparison.
qq ϵATT2Aspis\epsilon^{Aspis}_{ATT-2} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX} ϵByzShield\epsilon^{ByzShield}
22 \collectcell0.002\endcollectcell \collectcell0.095\endcollectcell \collectcell0.143\endcollectcell \collectcell0.02\endcollectcell
33 \collectcell0.008\endcollectcell \collectcell0.143\endcollectcell \collectcell0.143\endcollectcell \collectcell0.06\endcollectcell
44 \collectcell0.021\endcollectcell \collectcell0.19\endcollectcell \collectcell0.286\endcollectcell \collectcell0.1\endcollectcell
55 \collectcell0.045\endcollectcell \collectcell0.238\endcollectcell \collectcell0.286\endcollectcell \collectcell0.16\endcollectcell
66 \collectcell0.083\endcollectcell \collectcell0.286\endcollectcell \collectcell0.429\endcollectcell \collectcell0.24\endcollectcell
77 \collectcell0.137\endcollectcell \collectcell0.333\endcollectcell \collectcell0.429\endcollectcell \collectcell0.33\endcollectcell
88 \collectcell0.211\endcollectcell \collectcell0.381\endcollectcell \collectcell0.571\endcollectcell \collectcell0.43\endcollectcell
99 \collectcell0.307\endcollectcell \collectcell0.429\endcollectcell \collectcell0.571\endcollectcell \collectcell0.51\endcollectcell
1010 \collectcell0.429\endcollectcell \collectcell0.476\endcollectcell \collectcell0.714\endcollectcell \collectcell0.59\endcollectcell
IV(a) Optimal attacks.
qq ϵATT1Aspis\epsilon^{Aspis}_{ATT-1} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX}
22 \collectcell0.001\endcollectcell \collectcell0.095\endcollectcell \collectcell0\endcollectcell
33 \collectcell0.001\endcollectcell \collectcell0.143\endcollectcell \collectcell0\endcollectcell
44 \collectcell0.003\endcollectcell \collectcell0.19\endcollectcell \collectcell0\endcollectcell
55 \collectcell0.008\endcollectcell \collectcell0.238\endcollectcell \collectcell0\endcollectcell
66 \collectcell0.015\endcollectcell \collectcell0.286\endcollectcell \collectcell0\endcollectcell
77 \collectcell0.026\endcollectcell \collectcell0.333\endcollectcell \collectcell0\endcollectcell
88 \collectcell0.042\endcollectcell \collectcell0.381\endcollectcell \collectcell0.143\endcollectcell
99 \collectcell0.063\endcollectcell \collectcell0.429\endcollectcell \collectcell0.286\endcollectcell
1010 \collectcell0.09\endcollectcell \collectcell0.476\endcollectcell \collectcell0.429\endcollectcell
IV(b) Weak attacks.
TABLE V: Distortion fraction of optimal and weak attacks for (K,f,l,r)=(24,2024,253,3)(K,f,l,r)=(24,2024,253,3) and comparison.
qq ϵATT2Aspis\epsilon^{Aspis}_{ATT-2} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX} ϵByzShield\epsilon^{ByzShield}
22 \collectcell0.001\endcollectcell \collectcell0.083\endcollectcell \collectcell0.125\endcollectcell \collectcell0.031\endcollectcell
33 \collectcell0.005\endcollectcell \collectcell0.125\endcollectcell \collectcell0.125\endcollectcell \collectcell0.063\endcollectcell
44 \collectcell0.014\endcollectcell \collectcell0.167\endcollectcell \collectcell0.25\endcollectcell \collectcell0.125\endcollectcell
55 \collectcell0.03\endcollectcell \collectcell0.208\endcollectcell \collectcell0.25\endcollectcell \collectcell0.188\endcollectcell
66 \collectcell0.054\endcollectcell \collectcell0.25\endcollectcell \collectcell0.375\endcollectcell \collectcell0.281\endcollectcell
77 \collectcell0.09\endcollectcell \collectcell0.292\endcollectcell \collectcell0.375\endcollectcell \collectcell0.375\endcollectcell
88 \collectcell0.138\endcollectcell \collectcell0.333\endcollectcell \collectcell0.5\endcollectcell \collectcell0.5\endcollectcell
99 \collectcell0.202\endcollectcell \collectcell0.375\endcollectcell \collectcell0.5\endcollectcell \collectcell0.5\endcollectcell
1010 \collectcell0.282\endcollectcell \collectcell0.417\endcollectcell \collectcell0.625\endcollectcell \collectcell0.531\endcollectcell
1111 \collectcell0.38\endcollectcell \collectcell0.458\endcollectcell \collectcell0.625\endcollectcell \collectcell0.625\endcollectcell
V(a) Optimal attacks.
qq ϵATT1Aspis\epsilon^{Aspis}_{ATT-1} ϵBaseline\epsilon^{Baseline} ϵDETOX\epsilon^{DETOX}
22 \collectcell0\endcollectcell \collectcell0.083\endcollectcell \collectcell0\endcollectcell
33 \collectcell0\endcollectcell \collectcell0.125\endcollectcell \collectcell0\endcollectcell
44 \collectcell0.002\endcollectcell \collectcell0.167\endcollectcell \collectcell0\endcollectcell
55 \collectcell0.005\endcollectcell \collectcell0.208\endcollectcell \collectcell0\endcollectcell
66 \collectcell0.01\endcollectcell \collectcell0.25\endcollectcell \collectcell0\endcollectcell
77 \collectcell0.017\endcollectcell \collectcell0.292\endcollectcell \collectcell0\endcollectcell
88 \collectcell0.028\endcollectcell \collectcell0.333\endcollectcell \collectcell0\endcollectcell
99 \collectcell0.042\endcollectcell \collectcell0.375\endcollectcell \collectcell0.125\endcollectcell
1010 \collectcell0.059\endcollectcell \collectcell0.417\endcollectcell \collectcell0.25\endcollectcell
1111 \collectcell0.082\endcollectcell \collectcell0.458\endcollectcell \collectcell0.375\endcollectcell
V(b) Weak attacks.

X-A Proof of Theorem 1

For a given file FF, let AAA^{\prime}\subseteq A with |A|r|A^{\prime}|\geq r^{\prime} be the set of “active adversaries” in it, i.e., AFA^{\prime}\subseteq F consists of Byzantines that collude to create a majority that distorts the gradient on it. In this case, the remaining workers in FF belong to iADi\cap_{i\in A^{\prime}}D_{i}, where we note that |iADi|q|\cap_{i\in A^{\prime}}D_{i}|\leq q. Let Xj,j=r,r+1,,rX_{j},j=r^{\prime},r^{\prime}+1,\dots,r denote the subset of files where the set of active adversaries is of size jj; note that XjX_{j} depends on the disagreement sets Di,i=1,2,,qD_{i},i=1,2,\dots,q. Formally,

Xj\displaystyle X_{j} =\displaystyle= {F:AAF,|A|=j,\displaystyle\{F:\exists A^{\prime}\subseteq A\cap F,|A^{\prime}|=j, (12)
 and UjFA,UjiADi}.\displaystyle\qquad\text{~{}and~{}}\forall~{}U_{j}\in F\setminus A^{\prime},U_{j}\in\cap_{i\in A^{\prime}}D_{i}\}.\quad

Then, for a given choice of disagreement sets, the number of files that can be corrupted is given by |j=rrXj||\cup_{j=r^{\prime}}^{r}X_{j}|. We obtain an upper bound on the maximum number of corrupted files by maximizing this quantity with respect to the choice of Di,i=1,2,,qD_{i},i=1,2,\dots,q, i.e.,

cmax(q)=maxDi,|Di|q,i=1,2,,q|j=rrXj|c_{\mathrm{max}}^{(q)}=\max\limits_{D_{i},|D_{i}|\leq q,i=1,2,\dots,q}|\cup_{j=r^{\prime}}^{r}X_{j}| (13)

where the maximization is over the choice of the disagreement sets D1,D2,,DqD_{1},D_{2},\dots,D_{q}. With XjX_{j} given in (12), assuming qrq\geq r^{\prime}, the number of distorted files is upper bounded by

|j=rrXj|\displaystyle|\cup_{j=r^{\prime}}^{r}X_{j}| j=rr|Xj| (by the union bound).\displaystyle\leq\sum_{j=r^{\prime}}^{r}|X_{j}|\text{~{}(by the union bound).} (14)

For that, recall that r=(r+1)/2r^{\prime}=(r+1)/2 and that an adversarial majority of at least rr^{\prime} distorted computations for a file is needed to corrupt that particular file. Note that XjX_{j} consists of those files where the active adversaries AA^{\prime} are of size jj; these can be chosen in (qj)\binom{q}{j} ways. The remaining workers in the file belong to iADi\cap_{i\in A^{\prime}}D_{i} where |iADi|q|\cap_{i\in A^{\prime}}D_{i}|\leq q. Thus, the remaining workers can be chosen in at most (qrj)\binom{q}{r-j} ways. It follows that

|Xj|(qj)(qrj).\displaystyle|X_{j}|\leq\binom{q}{j}\binom{q}{r-j}. (15)

Therefore,

cmax(q)\displaystyle c_{\mathrm{max}}^{(q)} \displaystyle\leq (qr)(qrr)+(qr+1)(qr(r+1))\displaystyle{q\choose r^{\prime}}{q\choose r-r^{\prime}}+{q\choose r^{\prime}+1}{q\choose r-(r^{\prime}+1)} (16)
+\displaystyle+\cdots
+(qr1)(qr(r1))+(qr)\displaystyle+{q\choose r-1}{q\choose r-(r-1)}+{q\choose r}
=\displaystyle= i=rq(qi)(qri)\displaystyle\sum_{i=r^{\prime}}^{q}{{q}\choose{i}}{{q}\choose{r-i}} (17)
=\displaystyle= i=0q(qi)(qri)i=0r1(qi)(qri)\displaystyle\sum_{i=0}^{q}{{q}\choose{i}}{{q}\choose{r-i}}-\sum_{i=0}^{r^{\prime}-1}{{q}\choose{i}}{{q}\choose{r-i}} (18)
=\displaystyle= 12(2qr).\displaystyle\frac{1}{2}{2q\choose r}. (19)

Eq. (17) follows from the convention that (nk)=0{n\choose k}=0 when k>nk>n or k<0k<0. Eq. (19) follows from Eq. (18) using the following observations

  • i=0q(qi)(qri)=i=0r(qi)(qri)=(2qr)\sum_{i=0}^{q}{{q}\choose{i}}{{q}\choose{r-i}}=\sum_{i=0}^{r}{{q}\choose{i}}{{q}\choose{r-i}}={2q\choose r} in which the first equality is straightforward to show by taking all possible cases: q<rq<r, q=rq=r and q>rq>r.

  • By symmetry, i=0r1(qi)(qri)=i=rq(qi)(qri)=12(2qr)\sum_{i=0}^{r^{\prime}-1}{{q}\choose{i}}{{q}\choose{r-i}}=\sum_{i=r^{\prime}}^{q}{{q}\choose{i}}{{q}\choose{r-i}}=\frac{1}{2}{2q\choose r}.

The upper bound in Eq. (16) is met with equality when all adversaries choose the same disagreement set, which is a qq-sized subset of the honest workers, i.e., Di=DHD_{i}=D\subset H for i=1,,qi=1,\dots,q. In this case, it can be seen that the sets Xj,j=r,,rX_{j},j=r^{\prime},\dots,r are disjoint so that (14) is met with equality. Moreover, (15) is also an equality. This finally implies that (16) is also an equality, i.e., this choice of disagreement sets saturates the upper bound.

It can also be seen that in this case, the adversarial strategy yields a graph 𝐆\mathbf{G} with multiple maximum cliques. To see this, we note that the adversaries in AA agree with all the computed gradients in HDH\setminus D. Thus, they form a clique of M𝐆(1)M_{\mathbf{G}}^{(1)} of size KqK-q in 𝐆\mathbf{G}. Furthermore, the honest workers in HH form another clique M𝐆(2)M_{\mathbf{G}}^{(2)}, which is also of size KqK-q. Thus, the detection algorithm cannot select one over the other and the adversaries will evade detection; and the fallback robust aggregation strategy will apply.

X-B Experiment Setup Details

X-B1 Cluster Setup

We used clusters of K=15K=15, 2121, and 2525 workers arranged in various setups within Amazon EC2. Initially, we used a PS of type i3.16xlarge and several workers of type c5.4xlarge to set up a distributed cluster; for the experiments, we adapted GPUs, g3s.xlarge instances were used. However, purely distributed implementations require training data to be transmitted from the PS to every single machine, based on our current implementation; an alternative approach one can follow is to set up shared storage space accessible by all machines to store the training data. Also, some instances were automatically terminated by AWS per the AWS spot instance policy limitations;666https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html this incurred some delays in resuming the experiments that were stopped. In order to facilitate our evaluation and avoid these issues we decided to simulate the PS and the workers for the rest of the experiments on a single instance either of type x1.16xlarge or i3.16xlarge. We emphasize that the choice of the EC2 setup does not affect any of the numerical results in this paper since in all cases, we used a single virtual machine image with the same dependencies. Handling of the GPU floating-point precision errors have been discussed in Supplement Section XI-B.

X-B2 Data Set Preprocessing and Hyperparameter Tuning

The CIFAR-10 images have been normalized using standard mean and standard deviation values for the data set. The value used for momentum (for gradient descent) was set to 0.90.9, and we trained for 1616 epochs in all experiments. The number of epochs is precisely the invariant we maintain across all experiments, i.e., all schemes process the training data the same number of times. The batch size and the learning rate are chosen independently for each method; the number of iterations is adjusted accordingly to account for the number of epochs. For Section VIII-B, we followed the advice of the authors of DETOX and chose (K,b)=(15,480)(K,b)=(15,480) and (K,b)=(21,672)(K,b)=(21,672) for the DETOX and baseline schemes. For Aspis, we used (K,b)=(15,14560)(K,b)=(15,14560) (32 samples per file) and (K,b)=(21,3990)(K,b)=(21,3990) (3 samples per file) for the ALIE experiments and b=1365b=1365 (3 samples per file) for the remaining experiments except for the FoE optimal attack q=4q=4 (cf. Figure 11) for which b=14560b=14560 performed better. In Section VIII-C, we used (K,b)=(15,480)(K,b)=(15,480) and (K,b)=(25,800)(K,b)=(25,800) for DETOX as well as for baseline schemes while for Aspis+ we used (K,b)=(15,770)(K,b)=(15,770) for the ALIE experiments and (K,b)=(25,1800)(K,b)=(25,1800) for the constant attack experiments. In Supplement Table VI, a learning rate schedule is denoted by (x,y)(x,y); this notation signifies the fact that we start with a rate equal to xx, and every zz iterations, we set the rate equal to x×yt/zx\times y^{t/z}, where tt is the index of the current iteration and zz is set to be the number of iterations occurring between two consecutive checkpoints in which we store the model (points in the accuracy figures). We also index the schemes in order of appearance in the corresponding figure’s legend. Experiments that appear in multiple figures are not repeated in Supplement Table VI (we ran those training processes once). In order to pick the optimal hyperparameters for each scheme, we performed an extensive grid search involving different combinations of (x,y)(x,y). In particular, the values of xx we tested are 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, and 0.0003, and for yy we tried 1, 0.975, 0.95, 0.7 and 0.5. For each method, we ran 3 epochs for each such combination and chose the one which was giving the lowest value of average cross-entropy loss (principal criterion) and the highest value of top-1 accuracy (secondary criterion).

Refer to caption
(a) q=2q=2 adversaries.
Refer to caption
(b) q=4q=4 adversaries.
Figure 15: ALIE optimal attack and median-based defenses (CIFAR-10), K=15K=15 with different random seeds, ATT-2 (Aspis).
Refer to caption
(a) q=2q=2 adversaries.
Refer to caption
(b) q=4q=4 adversaries.
Figure 16: ALIE optimal attack and Bulyan-based defenses (CIFAR-10), K=15K=15 with different random seeds, ATT-1 (Aspis).

X-B3 Error Bars

In order to examine whether the choice of the random seed affects the accuracy of the trained model we have performed the experiments of Section VIII-B for the ALIE distortion for two different seeds for the values q=2,4q=2,4 for every scheme; we used 428428 and 5050 as random seeds. These tests have been performed for the case of K=15K=15 workers. In Figure 15(a), for a given method, we report the minimum accuracy, the maximum accuracy, and their average for each evaluation point. We repeat the same process in Figures 16(a) and Supplement Figure 17(a) when comparing with Bulyan and Multi-Krum, respectively. The corresponding experiments for q=4q=4 are shown in Figures 15(b), 16(b), and Supplement Figure 17(b).

Given the fact that these experiments take a significant amount of time and that they are computationally expensive, we chose to perform this consistency check for a subset of our experiments. Nevertheless, these results indicate that prior schemes [16, 8, 24] are sensitive to the choice of the random seed and demonstrate an unstable behavior in terms of convergence. In all of these cases, the achieved value of accuracy at the end of the 16 epochs of training is small compared to Aspis. On the other hand, the accuracy results for Aspis are almost identical for both choices of the random seed.

X-B4 Computation and Communication Overhead

Our schemes provide robustness under powerful attacks and sophisticated distortion methods at the expense of increased computation and communication time. Note that each worker has to perform ll forward/backward propagation computations and transmit ll gradients per iteration. In related baseline [24, 12] and redundancy-based methods [16, 17], each worker is responsible for a single such computation. Experimentally, we have observed that Aspis needs up to 5×5\times overall training time compared to other schemes to complete the same number of training epochs. We emphasize that the training time incurred by each scheme depends on a wide range of parameters, including the utilized defense, the batch size, and the number of iterations, and can vary significantly. Our implementation supports GPUs, and we used NVIDIA CUDA [47] for some experiments to alleviate a significant part of the overhead; however, a detailed time cost analysis is not an objective of our current work. Communication-related algorithmic improvements are also worth exploring. Finally, our implementation natively supports resuming from a checkpoint (trained model) and hence, when new data becomes available, we can only use that data to perform more training epochs.

X-B5 Software

Our implementation of the Aspis and Aspis+ algorithms used for the experiments builds on ByzShield’s [14] PyTorch skeleton and has been provided along with dependency information and instructions 777https://github.com/kkonstantinidis/Aspis. The implementation of ByzShield is available at [48] and uses the standard Github license. We utilized the NetworkX package [49] for the clique-finding; its license is 3-clause BSD. The CIFAR-10 data set [32] comes with the MIT license; we have cited its technical report, as required.

References

  • [1] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, December 2019, pp. 8024–8035.
  • [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), November 2016, pp. 265–283.
  • [3] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” December 2015. [Online]. Available: https://arxiv.org/abs/1512.01274
  • [4] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, p. 2135.
  • [5] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors,” in Proceeding of the 41st Annual International Symposium on Computer Architecuture, June 2014, pp. 361––372.
  • [6] A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019, pp. 1211–1220.
  • [7] N. Gupta and N. H. Vaidya, “Byzantine fault-tolerant parallelized stochastic gradient descent for linear regression,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), September 2019, pp. 415–420.
  • [8] G. Damaskinos, E. M. El Mhamdi, R. Guerraoui, A. H. A. Guirguis, and S. L. A. Rouault, “Aggregathor: Byzantine machine learning via robust gradient aggregation,” in Conference on Systems and Machine Learning (SysML) 2019, March 2019, p. 19.
  • [9] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Defending against saddle point attack in Byzantine-robust distributed learning,” in Proceedings of the 36th International Conference on Machine Learning, June 2019, pp. 7074–7084.
  • [10] ——, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 5650–5659.
  • [11] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine-tolerant SGD,” March 2018. [Online]. Available: https://arxiv.org/abs/1802.10116
  • [12] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, December 2017, pp. 119–129.
  • [13] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proc. ACM Meas. Anal. Comput. Syst., vol. 1, no. 2, December 2017.
  • [14] K. Konstantinidis and A. Ramamoorthy, “ByzShield: An efficient and robust system for distributed training,” in Machine Learning and Systems 3 (MLSys 2021), April 2021, pp. 812–828.
  • [15] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, and S. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security and privacy,” April 2019. [Online]. Available: https://arxiv.org/abs/1806.00939
  • [16] S. Rajput, H. Wang, Z. Charles, and D. Papailiopoulos, “DETOX: A redundancy-based framework for faster and more robust gradient aggregation,” in Advances in Neural Information Processing Systems, December 2019, pp. 10 320–10 330.
  • [17] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient distributed training via redundant gradients,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 903–912.
  • [18] D. Data, L. Song, and S. Diggavi, “Data encoding for Byzantine-resilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), October 2018, pp. 863–870.
  • [19] J. Regatti, H. Chen, and A. Gupta, “ByGARS: Byzantine SGD with arbitrary number of attackers,” December 2020. [Online]. Available: https://arxiv.org/abs/2006.13421
  • [20] C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance,” in Proceedings of the 36th International Conference on Machine Learning, June 2019, pp. 6893–6901.
  • [21] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent,” in Advances in Neural Information Processing Systems, December 2018.
  • [22] K. Konstantinidis and A. Ramamoorthy, “Aspis: Robust detection for distributed learning,” in IEEE International Symposium on Information Theory (ISIT), June 2022, pp. 2058–2063.
  • [23] G. Baruch, M. Baruch, and Y. Goldberg, “A Little Is Enough: Circumventing defenses for distributed learning,” in Advances in Neural Information Processing Systems, December 2019, pp. 8635–8645.
  • [24] E. M. El Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 3521–3530.
  • [25] S. Shen, S. Tople, and P. Saxena, “Auror: Defending against poisoning attacks in collaborative deep learning systems,” in Proceedings of the 32nd Annual Conference on Computer Security Applications, December 2016, pp. 508––519.
  • [26] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, “signSGD with majority vote is communication efficient and fault tolerant,” February 2019. [Online]. Available: https://arxiv.org/abs/1810.05291
  • [27] J. So, B. Güler, and A. S. Avestimehr, “Byzantine-resilient secure federated learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2168–2181, July 2021.
  • [28] R. Jin, Y. Huang, X. He, H. Dai, and T. Wu, “Stochastic-sign SGD for federated learning with theoretical guarantees,” September 2021. [Online]. Available: https://arxiv.org/abs/2002.10940
  • [29] N. Raviv, R. Tandon, A. Dimakis, and I. Tamo, “Gradient coding from cyclic MDS codes and expander graphs,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 4302–4310.
  • [30] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, August 2017, pp. 3368–3376.
  • [31] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, May 2018.
  • [32] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
  • [33] D. R. Stinson, Combinatorial Designs: Constructions and Analysis.   New York: Springer, 2004.
  • [34] C. Xie, O. Koyejo, and I. Gupta, “Fall of Empires: Breaking byzantine-tolerant sgd by inner product manipulation,” in 35th Conference on Uncertainty in Artificial Intelligence, UAI 2019, July 2019, pp. 6893–6901.
  • [35] R. M. Karp, Reducibility among Combinatorial Problems.   Boston, MA: Springer US, 1972.
  • [36] F. Cazals and C. Karande, “A note on the problem of reporting maximal cliques,” Theoretical Computer Science, vol. 407, no. 1, pp. 564–568, November 2008.
  • [37] T. Etsuji, T. Akira, and T. Haruhisa, “The worst-case time complexity for generating all maximal cliques and computational experiments,” Theoretical Computer Science, vol. 363, no. 1, pp. 28–42, October 2006.
  • [38] J. Robson, “Algorithms for maximum independent sets,” Journal of Algorithms, vol. 7, no. 3, pp. 425–440, September 1986.
  • [39] R. E. Tarjan and A. E. Trojanowski, “Finding a maximum independent set,” SIAM Journal on Computing, vol. 6, no. 3, pp. 537–546, 1977.
  • [40] D. Masters and C. Luschi, “Revisiting small batch training for deep neural networks,” April 2018. [Online]. Available: https://arxiv.org/abs/1804.07612
  • [41] Y. You, I. Gitman, and B. Ginsburg, “Large batch training of convolutional networks,” September 2017. [Online]. Available: https://arxiv.org/abs/1708.03888
  • [42] J. Geiping, M. Goldblum, P. E. Pope, M. Moeller, and T. Goldstein, “Stochastic training is not necessary for generalization,” April 2022. [Online]. Available: https://arxiv.org/abs/2109.14119
  • [43] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science.   Cambridge University Press, 2018.
  • [44] K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust aggregation for federated learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 1142–1154, February 2022.
  • [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
  • [46] S. Minsker, “Geometric median and robust estimation in Banach spaces,” Bernoulli, vol. 21, no. 4, pp. 2308–2335, November 2015.
  • [47] “NVIDIA CUDA toolkit,” September 2022. [Online]. Available: https://developer.nvidia.com/cuda-toolkit
  • [48] “Repository of ByzShield implementation,” August 2022. [Online]. Available: https://github.com/kkonstantinidis/ByzShield
  • [49] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using networkx,” in Proceedings of the 7th Python in Science Conference, August 2008, pp. 11–15.
  • [50] R. S. Boyer and J. S. Moore, MJRTY—A Fast Majority Vote Algorithm.   Dordrecht: Springer Netherlands, 1991.

XI Supplement

XI-A Asymptotic Complexity

If the gradient computation has linear complexity (assuming 𝒪(1)\mathcal{O}(1) cost for the gradient computation with respect to one model parameter) and since each worker is assigned to ll files of b/fb/f samples each, the gradient computation cost at the worker level is 𝒪((lb/f)d)\mathcal{O}((lb/f)d) (KK such computations in parallel). In our schemes, however, bb is a constant multiple of ff, and in general r<lr<l (l=(K1r1)l=\binom{K-1}{r-1} for Aspis while r=3r=3 is a typical redundancy value used in literature as well as in Aspis+); hence, the complexity becomes 𝒪(ld)\mathcal{O}(ld) which is similar to other redundancy-based schemes in [14, 16, 17]. For Aspis, as there are (K2)\binom{K}{2} each sharing (K2r2)\binom{K-2}{r-2} files, the complexity to determine their agreements and form the graph is 𝒪((K2)(K2r2))\mathcal{O}(\binom{K}{2}\binom{K-2}{r-2}). The clique-finding problem that follows as part of Aspis detection is NP-complete. However, our experimental evidence suggests that for the kind of graphs we construct, this computation takes an infinitesimal fraction of the execution time. The NetworkX package [49], which we use for enumerating all maximal cliques, is based on the algorithm of [37] and has asymptotic complexity 𝒪(3K/3)\mathcal{O}(3^{K/3}). We provide extensive simulations of the graph construction and clique enumeration time under the Aspis file assignment for K=50,r=5K=50,r=5 and K=200,r=3K=200,r=3 (cf. Supplement Tables VIIVII(a), VIIVII(b), VIIIVIII(a), and VIIIVIII(b) for the weak (ATT-1) and optimal (ATT-2) attack as introduced in Sections IV and V). We emphasize that this value of KK exceeds by far the typical values of KK of prior work, and the number of servers would suffice for most challenging training tasks. Even in this case, the cost of enumerating all cliques is negligible. For this experiment, we used an EC2 instance of type i3.16xlarge. The complexity of robust aggregation varies significantly depending on the operator. For example, majority voting can be done in time, which scales linearly with the number of votes using MJRTY proposed in [50]. In our case, this is 𝒪(Kd)\mathcal{O}(Kd) as the PS needs to use the dd-dimensional input from all KK machines. Krum [12], Multi-Krum [12] and Bulyan [24], are applied to all KK workers by default and require 𝒪(K2(d+logK))\mathcal{O}(K^{2}(d+\mathrm{log}K)).

XI-B Floating-Point Precision and Gradient Equality Check

A gradient equality check is needed to determine whether two gradient vectors, e.g., 𝐚\mathbf{a} and 𝐛\mathbf{b}, are equal for our majority voting procedure to work. This check can be performed on an element-by-element basis or using the norm of the difference. There are two distinct cases we have considered:

  • Case 1: Execution on CPUs: If we use the CPUs of the workers to compute the gradients, we have observed that two “honest” gradients, 𝐚\mathbf{a} and 𝐛\mathbf{b}, will always be exactly equal to each other element-wise. In this case, we use the numpy.array_equal function for all equality checks. If one of 𝐚\mathbf{a}, 𝐛\mathbf{b} is corrupted and the other one is “honest,” the program will effectively flag this as a disagreement between the corresponding workers.

  • Case 2: Execution on GPUs: Most deep learning libraries [2, 1] provide non-deterministic back-propagation for the sake of faster and more efficient computations. In our implementation, we use NVIDIA CUDA [47]; hence, two “honest” float (e.g., numpy.float_32) gradients 𝐚\mathbf{a} and 𝐛\mathbf{b} computed by two different GPUs will not be exactly equal to each other. However, the floating-point precision errors were less than 10610^{-6} in all of our experiments. In this case, we decide that the two workers agree with each other if the following criterion is satisfied for a small tolerance value of 10510^{-5}

    𝐚𝐛2max{𝐚2,𝐛2}105.\frac{\lVert\mathbf{a}-\mathbf{b}\rVert_{2}}{\mathrm{max}\{\lVert\mathbf{a}\rVert_{2},\lVert\mathbf{b}\rVert_{2}\}}\leq 10^{-5}.

    On the other hand, if one of 𝐚\mathbf{a}, 𝐛\mathbf{b} is distorted even by the most sophisticated inner manipulation attack ALIE [23], then 𝐚𝐛2max{𝐚2,𝐛2}\frac{\lVert\mathbf{a}-\mathbf{b}\rVert_{2}}{\mathrm{max}\{\lVert\mathbf{a}\rVert_{2},\lVert\mathbf{b}\rVert_{2}\}} is at least five orders of magnitude larger and typically ranges in [1,100][1,100].

In both cases, we have an integrity check in place to throw an exception if two “honest” gradients for the same task violate this criterion. We have not observed any violation of this in any of our exhaustive experiments.

Input: Loss vectors 𝐥i\mathbf{l}_{i}, i=1,2,,ci=1,2,\dots,c, maximum iterations TT.
1 Set 𝐥MC\mathbf{l}_{\mathrm{MC}} to be an empty vector.
2 for t=1t=1 to TT do
3       Let 𝐯t\mathbf{v}_{t} (of length vv) be the vector with the losses of all vv simulations that ran for at least tt iterations, i.e.,
𝐯t=[𝐥1,t,𝐥2,t,,𝐥v,t]\mathbf{v}_{t}=\begin{bmatrix}\mathbf{l}_{1,t},\mathbf{l}_{2,t},\dots,\mathbf{l}_{v,t}\end{bmatrix}
Append i𝐯t,iv\frac{\sum_{i}\mathbf{v}_{t,i}}{v} to 𝐥MC\mathbf{l}_{\mathrm{MC}}.
4 end for
Return 𝐥MC\mathbf{l}_{\mathrm{MC}}.
Algorithm 4 Average Monte Carlo loss across the simulations.

XI-C Note on Computation of Average Monte Carlo Loss

As discussed in Section VI-A, we ran 100 Monte Carlo simulations for each scheme and value of qq. Among the 100 simulations of a given experiment we kept only those that converged to final empirical loss less than 0.10.1 (cf. Section VI-A). In order to report the average loss in Figures 4(a), 4(b), 5(a), 5(b), and 6 of the converged simulations we need to take into account the fact that each of them may have run for different number of iterations until convergence. To that end, let us collect the loss for each Monte Carlo simulation to a vector 𝐥i\mathbf{l}_{i}, i=1,2,,ci=1,2,\dots,c where cc is the number of converged simulations. The vectors 𝐥i\mathbf{l}_{i}, i=1,2,,ci=1,2,\dots,c do not necessarily have the same length; we used Algorithm 4 to compute and report the average loss of these vectors.

TABLE VI: Parameters used for training.
Figure Schemes Learning rate schedule
7(a) 1,2,5,6 (0.01,0.7)(0.01,0.7)
7(a) 3,4 (0.1,0.95)(0.1,0.95)
7(b) 1 (0.001,0.95)(0.001,0.95)
7(c) 1,2 (0.01,0.7)(0.01,0.7)
8(a) 1,2 (0.1,0.7)(0.1,0.7)
8(a) 3,4 (0.1,0.95)(0.1,0.95)
8(a) 5,6 (0.01,0.7)(0.01,0.7)
8(b) 1 (0.1,0.7)(0.1,0.7)
8(c) 1,2 (0.01,0.975)(0.01,0.975)
11 1,2 (0.1,0.7)(0.1,0.7)
11 3,4 (0.1,0.95)(0.1,0.95)
11 5,6 (0.01,0.95)(0.01,0.95)
11 1 (0.1,0.95)(0.1,0.95)
11 2 (0.01,0.7)(0.01,0.7)
11 1 (0.01,0.7)(0.01,0.7)
11 2 (0.1,0.95)(0.1,0.95)
11 3 (0.01,0.7)(0.01,0.7)
12(a) 1,2 (0.01,0.7)(0.01,0.7)
12(a) 3 (0.1,0.95)(0.1,0.95)
12(b) 2 (0.01,0.7)(0.01,0.7)
12(c) 2 (0.01,0.95)(0.01,0.95)
13 1,2 (0.01,0.7)(0.01,0.7)
13 3,4 (0.01,0.975)(0.01,0.975)
13 5,6 (0.1,0.975)(0.1,0.975)
14 1,2,3,4 (0.0003,0.7)(0.0003,0.7)
14 5,6 (0.3,0.975)(0.3,0.975)
TABLE VII: Graph contruction (in seconds) and clique enumeration (in milliseconds) time in Aspis graph of K=50K=50 vertices and redundancy r=5r=5.
qq Graph construction (s) Clique finding (ms)
5 15 1
10 15 <1<1
15 14 1
20 14 1
VII(a) Adversaries carry out weak attack ATT-1.
qq Graph construction (s) Clique finding (ms)
5 16 2
10 14 1
15 14 1
20 14 1
VII(b) Adversaries carry out optimal attack ATT-2.
TABLE VIII: Graph contruction (in seconds) and clique enumeration (in milliseconds) time in Aspis graph of K=100K=100 vertices and redundancy r=3r=3.
qq Graph construction (s) Clique finding (ms)
5 4 51
10 4 46
15 4 43
20 4 40
VIII(a) Adversaries carry out weak attack ATT-1.
qq Graph construction (s) Clique finding (ms)
5 4 55
10 4 54
15 4 55
20 4 55
VIII(b) Adversaries carry out optimal attack ATT-2.
Refer to caption
(a) q=2q=2 adversaries.
Refer to caption
(b) q=4q=4 adversaries.
Figure 17: ALIE optimal attack and Multi-Krum-based defenses (CIFAR-10), K=15K=15 with different random seeds, ATT-2 (Aspis).