This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training

Ahmed Roushdy Elkordy, Saurav Prakash, Salman Avestimehr The authors are with the Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA (e-mail: aelkordy@usc.edu sauravpr@usc.edu, avestimehr@ee.usc.edu).
Abstract

Decentralized (i.e., serverless) training across edge nodes can suffer substantially from potential Byzantine nodes that can degrade the training performance. However, detection and mitigation of Byzantine behaviors in a decentralized learning setting is a daunting task, especially when the data distribution at the users is heterogeneous. As our main contribution, we propose Basil, a fast and computationally efficient Byzantine-robust algorithm for decentralized training systems, which leverages a novel sequential, memory-assisted and performance-based criteria for training over a logical ring while filtering the Byzantine users. In the IID dataset setting, we provide the theoretical convergence guarantees of Basil, demonstrating its linear convergence rate. Furthermore, for the IID setting, we experimentally demonstrate that Basil is robust to various Byzantine attacks, including the strong Hidden attack, while providing up to absolute 16%{\sim}16\% higher test accuracy over the state-of-the-art Byzantine-resilient decentralized learning approach. Additionally, we generalize Basil to the non-IID setting by proposing Anonymous Cyclic Data Sharing (ACDS), a technique that allows each node to anonymously share a random fraction of its local non-sensitive dataset (e.g., landmarks images) with all other nodes. Finally, to reduce the overall latency of Basil resulting from its sequential implementation over the logical ring, we propose Basil+ that enables Byzantine-robust parallel training across groups of logical rings, and at the same time, it retains the performance gains of Basil due to sequential training within each group. Furthermore, we experimentally demonstrate the scalability gains of Basil+ through different sets of experiments.

Index Terms:
decentralized training, federated learning, Byzantine-robustness.

I Introduction

Thanks to the large amounts of data generated on and held by the edge devices, machine learning (ML) applications can achieve significant performance [1, 2]. However, privacy concerns and regulations [3] make it extremely difficult to pool clients’ datasets for a centralized ML training procedure. As a result, distributed machine learning methods are gaining a surge of recent interests. The key underlying goal in distributed machine learning at the edge is to learn a global model using the data stored across the edge devices.

Federated learning (FL) has emerged as a promising framework [4] for distributed machine learning. In federated learning, the training process is facilitated by a central server. In an FL architecture, the task of training is federated among the clients themselves. Specifically, each participating client trains a local model based on its own (private) training dataset and shares only the trained local model with the central entity, which appropriately aggregates the clients’ local models. While the existence of the parameter server in FL is advantageous for orchestrating the training process, it brings new security and efficiency drawbacks [5, 4]. Particularly, the parameter server in FL is a single point of failure, as the entire learning process could fail when the server crashes or gets hacked. Additionally, the parameter server can become a performance bottleneck itself due to the large number of the mobile devices that need to be handled simultaneously.

Training using a decentralized setup is another approach for distributed machine learning without having to rely on a central coordinator (e.g., parameter server), thus avoiding the aforementioned limitations of FL. Instead, it only requires on-device computations on the edge nodes and peer-to-peer communications. In fact, many decentralized training algorithms have been proposed for the decentralized training setup. In particular, a class of gossip-based algorithms over random graphs has been proposed, e.g., [6, 7, 8], in which all the nodes participate in each training round. During training, each node maintains a local model and communicates with others over a graph-based decentralized network. More specifically, every node updates its local model using its local dataset, as well as the models received from the nodes in its neighborhood. For example, a simple aggregation rule at each node is to average the locally updated model with the models from the neighboring nodes. Thus, each node performs both model training and model aggregation.

Although decentralized training provides many benefits, its decentralized nature makes it vulnerable to performance degradation due to system failures, malicious nodes, and data heterogeneity [4]. Specifically, one of the key challenges in decentralized training is the presence of different threats that can alter the learning process, such as the software/hardware errors and adversarial attacks. Particularly, some clients can become faulty due to software bugs, hardware components which may behave arbitrarily, or even get hacked during training, sending arbitrary or malicious values to other clients, thus severely degrading the overall convergence performance. Such faults, where client nodes arbitrarily deviate from the agreed-upon protocol, are called Byzantine faults [9]. To mitigate Byzantine nodes in a graph-based decentralized setup where nodes are randomly connected to each other, some Byzantine-robust optimization algorithms have been introduced recently, e.g., [10, 11]. In these algorithms, each node combines the set of models received from its neighbors by using robust aggregation rules, to ensure that the training is not impacted by the Byzantine nodes. However, to the best of our knowledge, none of these algorithms have considered the scenario when the data distribution at the nodes is heterogeneous. Data heterogeneity makes the detection of Byzantine nodes a daunting task, since it becomes unclear whether the model drift can be attributed to a Byzantine node, or to the very heterogeneous nature of the data. Even in the absence of Byzantine nodes, data heterogeneity can degrade the convergence rate [4].

The limited computation and communication resources of edge devices (e.g., IoT devices) are another important consideration in the decentralized training setting. In fact, the resources at the edge devices are considered as a critical bottleneck for performing distributed training for large models [4, 1]. In prior Byzantine-robust decentralized algorithms (e.g., [10, 11]), which are based on parallel training over a random graph, all nodes need to be always active and perform training during the entire training process. Therefore, they might not be suitable for the resource constrained edge devices, as the parallel training nature of their algorithms requires them to be perpetually active which could drain their resources. In contrast to parallel training over a random graph, our work takes the view that sequential training over a logical ring is better suited for decentralized training in resource constrained edge setting. Specifically, sequential training over a logical ring allows each node to become active and perform model training only when it receives the model from its counterclockwise neighbor. Since nodes need not be active during the whole training time, the sequential training nature makes it suitable for IoT devices with limited computational resources.

I-A Contributions

To overcome the aforementioned limitations of prior graph-based Byzantine-robust algorithms, we propose Basil, an efficient Byzantine mitigation algorithm, that achieves Byzantine robustness in a decentralized setting by leveraging the sequential training over a logical ring. To highlight some of the benefits of Basil, Fig. 1(a) illustrates a sample result that demonstrates the performance gains and the cost reduction compared to the state-of-the-art Byzantine-robust algorithm UBAR that leverages the parallel training over a graph-based setting. We observe that Basil retains a higher accuracy than UBAR with an absolute value of 16%\sim 16\%. Additionally, we note that while UBAR achieves its highest accuracy in 500\sim 500 rounds, Basil achieves UBAR’s highest accuracy in just 100\sim 100 rounds. This implies that for achieving UBAR’s highest accuracy, each client in Basil uses 5×5\times lesser computation and communication resources compared to that in UBAR confirming the gain attained from the sequential training nature of Basil.

In the following, we further highlight the key aspects and performance gains of Basil:

  • In Basil, the defense technique to filter out Byzantine nodes is a performance-based strategy, wherein each node evaluates a received set of models from its counterclockwise neighbors by using its own local dataset to select the best candidate.

  • We theoretically show that Basil for convex loss functions in the IID data setting has a linear convergence rate with respect to the product of the number of benign nodes and the total number of training rounds over the ring. Thus, our theoretical result demonstrates scalable performance for Basil with respect to the number of nodes.

  • We empirically demonstrate the superior performance of Basil compared to UBAR, the state-of-the-art Byzantine-resilient decentralized learning algorithm over graph, under different Byzantine attacks in the IID setting. Additionally, we study the performance of Basil and UBAR with respect to the wall-clock time in Appendix H showing that the training time for Basil is comparable to UBAR.

  • For extending the superior benefits of Basil to the scenario when data distribution is non-IID across devices, we propose Anonymous Cyclic Data Sharing (ACDS) to be applied on top of Basil. To the best of our knowledge, no prior decentralized Byzantine-robust algorithm has considered the scenario when the data distribution at the nodes is non-IID. ACDS allows each node to share a random fraction of its local non-sensitive dataset (e.g., landmarks images captured during tours) with all other nodes, while guaranteeing anonymity of the node identity. As highlighted in Section I-B, there are multiple real-world use cases where anonymous data sharing is sufficient to meet the privacy concerns of the users.

  • We experimentally demonstrate that using ACDS with only 5%5\% data sharing on top of Basil provides resiliency to Byzantine behaviors, unlike UBAR which diverges in the non-IID setting (Fig. 1(c)).

  • As the number of participating nodes in Basil scales, the increase in the overall latency of sequential training over the logical ring topology may limit the practicality of implementing Basil. Therefore, we propose a parallel extension of Basil, named Basil+, that provides further scalability by enabling Byzantine-robust parallel training across groups of logical rings, while retaining the performance gains of Basil through sequential training within each group.

Refer to caption
(a) Gaussian Attack (IID setting)
Refer to caption
(b) No Attack (non-IID setting)
Refer to caption
(c) Gaussian Attack
Figure 1: A highlight of the performance benefits of Basil, compared with state-of-the-art (UBAR) [10], for CIFAR10 under different settings: In Fig. 1(a), we can see the superior performance of Basil over UBAR with 16%{\sim}16\% improvement of the test accuracy under Gaussian attack in the IID setting. Fig. 1(b) demonstrates that the test accuracy in the non-IID setting by using sequential training over the ring topology can be increased by up to 10%{\sim}10\% in the absence of Byzantine nodes, when each node shares only 5%5\% of its local data anonymously with other nodes. Fig. 1(c) shows that ACDS on the top of Basil not only provides Byzantine robustness to Gaussian attack in the non-IID setting, but also gives higher performance than UBAR in the IID setting. Furthermore, UBAR for the non-IID setting completely fails in the presence of this attack. For further details, please refer to Section VI.

I-B Related Works

Many Byzantine-robust strategies have been proposed recently for the distributed training setup (federated learning) where there is a central server to orchestrate the training process [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. These Byzantine-robust optimization algorithms combine the gradients received by all workers using robust aggregation rules, to ensure that training is not impacted by malicious nodes. Some of these strategies [15, 16, 19, 17, 18] are based on distance-based approaches, while some others are based on performance-based criteria [12, 13, 14, 21]. The key idea in distance-based defense solutions is to filter the updates that are far from the average of the updates from the benign nodes. It has been shown that distance-based solutions are vulnerable to the sophisticated Hidden attack proposed in [23]. In this attack, Byzantine nodes could create gradients that are malicious but indistinguishable from benign gradients in distance. On the other hand, performance-based filtering strategies rely on having some auxiliary dataset at the server to evaluate the model received from each node.

Compared to the large number of Byzantine-robust training algorithms for distributed training in the presence of a central server, there have been only a few recent works on Byzantine resiliency in the decentralized training setup with no central coordinator. In particular, to address Byzantine failures in a decentralized training setup over a random graph under the scenario when the data distribution at the nodes is IID, the authors in [11] propose using a trimmed mean distance-based approach called BRIDGE to mitigate Byzantine nodes. However, the authors in [10] demonstrate that BRIDGE is defeated by the hidden attack proposed in [23]. To solve the limitations of the distance-based approaches in the decentralized setup, [10] proposes an algorithm called UBAR in which a combination of performance-based and distance-based stages are used to mitigate the Byzantine nodes, where the performance-based stage at a particular node leverages only its local dataset. As demonstrated numerically in [10], the combination of these two strategies allows UBAR to defeat the Byzantine attack proposed in [23]. However, UBAR is not suitable for the training over resource-constrained edge devices, as the training is carried out in parallel and nodes remain active all the time. In contrast, Basil is a fast and computationally efficient Byzantine-robust algorithm, which leverages a novel sequential, memory-assisted and performance-based criteria for training over a logical ring while filtering the Byzantine users.

Data heterogeneity in the decentralized setting has been studied in some recent works (e.g., [24]) in the absence of Byzantine nodes. In particular, the authors of TornadoAggregate [24] propose to cluster users into groups based on an algorithm called Group-BY-IID and CLUSTER where both use EMD (earth mover distance) that can approximately model the learning divergences between the models to complete the grouping. However, EMD function relies on having a publicly shared data at each node which can be collected similarly as in [25]. In particular, to improve training on non-IID data in federated learning, [25] proposed sharing of small portions of users’ data with the server. The parameter server pools the received subsets of data thus creating a small subset of the data distributed at the clients, which is then globally shared between all the nodes to make the data distribution close to IID. However, the aforementioned data sharing approach is considered insecure in scenarios where users are fine with sharing some of their datasets with each other but want to keep their identities anonymous, i.e., data shares should not reveal who the data owners are.

There are multiple real-world use cases where anonymous data sharing is sufficient for privacy concerns. For example, mobile users maybe fine with sharing some of their own text data, which does not contain any personal and sensitive information with others, as long as their personal identities remain anonymous. Another example is sharing of some non-private data (such as landmarks images) collected by a person with others. In this scenario, although data itself is not generated at the users, revealing the identity of the users can potentially leak private information such as personal interests, location, or travel history. Our proposed ACDS strategy is suitable for such scenarios as it guarantees that the owner identity of the shared data points are kept hidden.

As a final remark, we point out that for anonymous data sharing, [26] proposed an approach which is based on utilizing a secure sum operation along with anonymous ID assignment (AIDA). This involves computational operations at the nodes such as polynomial evaluations and some arithmetic operations such as modular operations. Thus, this algorithm may fail in the presence of Byzantine faults arising during these computations. Particularly, computation errors or software bugs can be present during the AIDA algorithm thus leading to the failure of anonymous ID assignment, or during the secure sum algorithm which can lead to distortion of the shared data.

II Problem Statement

We formally define the decentralized learning system in the presence of Byzantine faults.

II-A Decentralized System Model

We consider a decentralized learning setting in which a set 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\} of |𝒩|=N|\mathcal{N}|=N nodes collaboratively train a machine learning (ML) model 𝐱d\mathbf{x}\in\mathbb{R}^{d}, where dd is the model size, based on all the training data samples n𝒩𝒵n{\cup_{n\in\mathcal{N}}}\mathcal{Z}_{n} that are generated and stored at these distributed nodes, where the size of each local dataset is |𝒵i|=Di|\mathcal{Z}_{i}|=D_{i} data points.

In this work, we are motivated by the edge IoT setting, where users want to collaboratively train an ML model, in the absence of a centralized parameter server. The communication in this setting leverages the underlay communication fabric of the internet that connects any pair of nodes directly via overlay communication protocols. Specifically, we assume that there is no central parameter server, and consider the setup where the training process is carried out in a sequential fashion over a clockwise directed ring. Each node in this ring topology takes part in the training process when it receives the model from the previous counterclockwise node. In Section III-A, we propose a method in which nodes can consensually agree on a random ordering on a logical ring at the beginning of the training process, so that each node knows the logical ring positions of the other nodes. Therefore, without loss of generality, we assume for notation simplification that the indices of nodes in the ring are arranged in ascending order starting from node 11. In this setup, each node can send its model update to any set of users in the network.

In this decentralized setting, an unknown β\beta-proportion of nodes can be Byzantine, where β(0,1)\beta\in(0,1), meaning they can send arbitrary and possibly malicious results to the other nodes. We denote \mathcal{R} (with cardinality ||=r|\mathcal{R}|=r) and \mathcal{B} (with cardinality ||=b|\mathcal{B}|=b) as the sets of benign nodes and Byzantine nodes, respectively. Furthermore, Byzantine nodes are uniformly distributed over the ring due to consensus-based random order agreement. Finally, we assume nodes can authenticate the source of a message, so no Byzantine node can forge its identity or create multiple fake ones [27].

II-B Model Training

Each node in the set \mathcal{R} of benign nodes uses its own dataset to collaboratively train a shared model by solving the following optimization problem in the presence of Byzantine nodes:

𝐱=argmin𝐱d[f(𝐱):=1ri=1rfi(𝐱)],\mathbf{x}^{*}=\arg\min_{\mathbf{x}\in\mathbb{R}^{d}}\left[f(\mathbf{x}):=\frac{1}{r}\sum_{i=1}^{r}f_{i}(\mathbf{x})\right], (1)

where 𝐱\mathbf{x} is the optimization variable, and fi(𝐱)f_{i}(\mathbf{x}) is the expected loss function of node ii such that fi(𝐱)=𝔼ζi𝒫i[li(𝐱,ζi)]f_{i}(\mathbf{x})=\mathbb{E}_{\zeta_{i}\sim\mathcal{P}_{i}}[l_{i}(\mathbf{x},\zeta_{i})]. Here, li(𝐱,ζi)l_{i}(\mathbf{x},\zeta_{i})\in\mathbb{R} denotes the loss function for model parameter 𝐱d\mathbf{x}\in\mathbb{R}^{d} for a given realization ζi\zeta_{i}, which is generated from a distribution 𝒫i\mathcal{P}_{i}.

The general update rule in this decentralized setting is given as follows. At the kk-th round, the current active node ii updates the global model according to:

𝐱k(i)=\displaystyle\mathbf{x}^{(i)}_{k}= 𝐱¯k(i)ηk(i)gi(𝐱¯k(i)),\displaystyle\bar{\mathbf{x}}^{(i)}_{k}-\eta_{k}^{(i)}g_{i}(\bar{\mathbf{x}}^{(i)}_{k}), (2)

where 𝐱¯k(i)=𝒜(𝐱v(j),j𝒩,v=1,,k)\bar{\mathbf{x}}^{(i)}_{k}{=}\mathcal{A}(\mathbf{x}^{(j)}_{v},j\in\mathcal{N},v=1,\dots,k) is the selected model by node ii according to the underlying aggregation rule 𝒜\mathcal{A}, g(𝐱¯k(i))g(\bar{\mathbf{x}}^{(i)}_{k}) is the stochastic gradient computed by node ii by using a random sample from its local dataset 𝒵i\mathcal{Z}_{i}, and ηk(i)\eta_{k}^{(i)} is the learning rate in round kk used by node ii.

Threat model: Byzantine node ii\in\mathcal{B} could send faulty or malicious update 𝐱k(i)=\mathbf{x}^{(i)}_{k}=*, where “*” denotes that 𝐱k(i)\mathbf{x}^{(i)}_{k} can be an arbitrary vector in d\mathbb{R}^{d}. Furthermore, Byzantine nodes cannot forge their identities or create multiple fake ones. This assumption has been used in different prior works (e.g., [27]).

Our goal is to design an algorithm for the decentralized training setup discussed earlier, while mitigating the impact of the Byzantine nodes. Towards achieving this goal, we propose Basil that is described next.

III The Proposed Basil Algorithm

Now, we describe Basil, our proposed approach for mitigating both malicious updates and faulty updates in the IID setting, where the local dataset 𝒵i\mathcal{Z}_{i} at node ii consists of IID data samples from a distribution 𝒫i\mathcal{P}_{i}, where 𝒫i=𝒫\mathcal{P}_{i}=\mathcal{P} for i𝒩i\in\mathcal{N}, and characterize the complexity of Basil. Note that, in Section IV, we extend Basil to the non-IID setting by integrating it to our proposed Anonymous Cyclic Data Sharing scheme.

III-A Basil for IID Setting

Our proposed Basil algorithm that is given in Algorithm 1 consists of two phases; initialization phase and training phase which are described below.
Phase 1: Order Agreement over the Logical Ring. Before the training starts, nodes consensually agree on their order on the ring by using the following simple steps. 1) All users first share their IDs with each other, and we assume WLOG that nodes’ IDs can be arranged in ascending order. 2) Each node locally generates the order permutation for the users’ IDs by using a pseudo random number generator (PRNG) initialized via a common seed (e.g., NN). This ensures that all nodes will generate the same IDs’ order for the ring.

Phase 2: Robust Training. As illustrated in Fig. 2, Basil leverages sequential training over the logical ring to mitigate the effect of Byzantine nodes. At a high level, in the kk-th round, the current active node carries out the model update step in (2), and then multicasts its updated model to the next S=b+1S=b+1 clockwise nodes, where bb is the worst case number of Byzantine nodes. We note that multicasting each model to the next b+1b+1 neighbors is crucial to make sure that the benign subgraph, which is generated by excluding the Byzantine nodes, is connected. Connectivity of the benign subgraph is important as it ensures that each benign node can still receive information from a few other non-faulty nodes, i.e., the good updates can successfully propagate between the benign nodes. Even in the scenario where all Byzantine nodes come in a row, multicasting each updated model to the next SS clockwise neighbors allows the connectivity of benign nodes.

Refer to caption
Figure 2: Basil with N=6N=6 nodes, where node 33 and node 66 are Byzantine nodes. Node 11, the current active benign node in the kk-th round, selects one model out of its stored 33 models which gives the lowest loss when it is evaluated on a mini-batch from its local dataset 𝒵1\mathcal{Z}_{1}. After that, node 11 updates the selected model by using the same mini-batch according to (2) before multicasting it to the next 33 clockwise neighbors.

We now describe how the aggregation rule 𝒜Basil \mathcal{A}_{\text{{Basil} }} in Basil, that node ii implements for obtaining the model 𝐱¯k(i)\bar{\mathbf{x}}^{(i)}_{k} for carrying out the update in (2), works. Node ii stores the SS latest models from its previous SS counterclockwise neighbors. As highlighted above, the reason for storing SS models is to make sure that each stored set of models at node ii contains at least one good model. When node ii is active, it implements our proposed performance-based criteria to pick the best model out of its SS stored models. In the following, we formally define our model selection criteria:

Definition 1

(Basil Aggregation Rule) In the kk-th round over the ring, let 𝒩ki={𝐲1,,𝐲S}\mathcal{N}^{i}_{k}=\{\mathbf{y}_{1},\dots,\mathbf{y}_{S}\} be the set of SS latest models from the SS counterclockwise neighbors of node ii. We define ζi\zeta_{i} to be a random sample from the local dataset 𝒵i\mathcal{Z}_{i}, and let li(𝐲j)=li(𝐲j,ζi){l}_{i}(\mathbf{y}_{j})={l}_{i}({\mathbf{y}_{j},\zeta_{i}}) to be the loss function of node ii evaluated on the model 𝐲j𝒩ki\mathbf{y}_{j}\in\mathcal{N}^{i}_{k}, by using a random sample ζi\zeta_{i}. The proposed Basil aggregation rule is defined as

𝐱¯k(i)=𝒜Basil (𝒩ki)=argmin𝐲𝒩ki𝔼[li(𝐲,ζi)].\bar{\mathbf{x}}_{k}^{(i)}=\mathcal{A}_{\text{{Basil} }}(\mathcal{N}^{i}_{k})=\arg\min_{\mathbf{y}\in\mathcal{N}^{i}_{k}}\mathbb{E}\left[{l}_{i}({\mathbf{y}},\zeta_{i})\right]. (3)

In practice, node ii can sample a mini-batch from its dataset and leverage it as validation data to test the performance (i.e., loss function value) of each model of the neighboring SS models, and set 𝐱¯k(i)\bar{\mathbf{x}}^{(i)}_{k} to be the model with the lowest loss among the SS stored models. As demonstrated in our experiments in Section VI, this practical mini-batch implementation of the Basil criteria in Definition 3 is sufficient to mitigate Byzantine nodes in the network, while achieving superior performance over state-of-the-art.

1 Input: 𝒩\mathcal{N}(nodes); SS(connectivity) ;{𝒵i}i𝒩\{\mathcal{Z}_{i}\}_{i\in\mathcal{N}}(local datasets); 𝐱0\mathbf{x}^{0}(initial model); KK(number of rounds)
2
3 Initialization:
4 for each node i𝒩i\in\mathcal{N}  do
5      StoredModels [ii]. insert(𝐱0\mathbf{x}^{0}) // queue “StoredModels [ii]” is used by node ii to keep the last inserted SS models, denoted by 𝒩i\mathcal{N}^{i}. It is intialized by inserting 𝐱0\mathbf{x}^{0}
6      end for
7     Order \leftarrow RandomOrderAgrement(𝒩\mathcal{N}) // users’ order generation for the logical ring topology according to Section III-A.
8     
9      Robust Training:
10      for each round k=1,,Kk=1,\dots,K  do
11           for  each node i=1,,Ni=1,\dots,N in sequence  do
12                if  node i benign set i\in\text{ benign set }\mathcal{R}  then
13                     𝐱¯k(i)𝒜Basil (𝒩i)\bar{\mathbf{x}}_{k}^{(i)}\leftarrow\mathcal{A}_{\text{{Basil} }}(\mathcal{N}^{i}) // Basil performance-based strategy to select one model from 𝒩i\mathcal{N}^{i} using (3)
14                    
15                    𝐱k(i)\mathbf{x}^{(i)}_{k}\leftarrow Update(𝐱¯k(i),𝒵i)(\bar{\mathbf{x}}_{k}^{(i)},\mathcal{Z}_{i}) // model update using (2)
16                     end if
17                    else
18                         𝐱k(i)\mathbf{x}^{(i)}_{k}\leftarrow* // Byzantine node sends faulty model
19                          end if
20                         
21                         multicast 𝐱k(i)\mathbf{x}^{(i)}_{k} to the next SS clockwise neighbors
22                          for neighbor s=1,,Ss=1,\dots,S  do
23                               StoredModels [(i+s)modN(i+s)\mod N]. insert(𝐱k(i)\mathbf{x}^{(i)}_{k}) // insert 𝐱k(i)\mathbf{x}^{(i)}_{k} to the queue of the ss-th neighbor of node ii
24                               end for
25                              
26                               end for
27                              
28                               end for
Return {𝐱K(i)}i𝒩\{\mathbf{x}^{(i)}_{K}\}_{i\in{\mathcal{N}}}
Algorithm 1 Basil

In the following, we characterize the communication, computation, and storage costs of Basil.

Proposition 1. The communication, computation, and storage complexities of Basil algorithm are all 𝒪(Sd)\mathcal{O}(Sd) for each node at each iteration, where dd is the model size.

The complexity of prior graph-based Byzantine-robust decentralized algorithms UBAR [10] and Bridge [11] is 𝒪(Hid)\mathcal{O}({H}_{i}d), where Hi{H}_{i} is the number of neighbours (e.g., connectivity parameter) of node ii on the graph. So we conclude that Basil maintains the same per round complexity as Bridge and UBAR, but with higher performance as we will show in Section VI.

The costs in Proposition 1 can be reduced by relaxing the connectivity parameter SS to S<b+1S<b+1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability, as formally presented in the following.

Proposition 2. The number of models that each benign node needs to receive, store and evaluate from its counterclockwise neighbors for ensuring the connectivity and success of Basil can be relaxed to S<b+1S<b+1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability. Additionally, the failure probability of Basil is given by

(Failure)b!(NS)!(bS)!(N1)!,\displaystyle\mathbb{P}(\text{Failure})\leq\frac{b!(N-S)!}{(b-S)!(N-1)!}, (4)

where NN, bb are the total number of nodes, and Byzantine nodes, respectively. The proofs of Proposition 1-2 are given in Appendix A.

Remark 1

In order to further illustrate the impact of choosing SS on the probability of failure given in (4), we consider the following numerical examples. Let the total number of nodes in the system be N=100N=100, where b=33b=33 of them are Byzantine, and the storage parameter S=15S=15. The failure event probability in (4) turns out to be 4×107\sim{4\times 10^{-7}}, which is negligible. For the case when S=10S=10, the probability of failure becomes 5.34×104\sim{5.34\times 10^{-4}}, which remains reasonably small.

III-B Theoretical Guarantees

We derive the convergence guarantees of Basil under the following standard assumptions.

Assumption 1 (IID data distribution). Local dataset 𝒵i\mathcal{Z}_{i} at node ii consists of IID data samples from a distribution 𝒫i\mathcal{P}_{i}, where 𝒫i=𝒫\mathcal{P}_{i}=\mathcal{P} for ii\in\mathcal{R}. In other words, fi(𝐱)=𝔼ζi𝒫i[l(𝐱,ζi)]=𝔼ζj𝒫j[l(𝐱,ζi))]=fj(𝐱)i,jf_{i}(\mathbf{x})=\mathbb{E}_{\zeta_{i}\sim\mathcal{P}_{i}}[l(\mathbf{x},\zeta_{i})]=\mathbb{E}_{\zeta_{j}\sim\mathcal{P}_{j}}[l(\mathbf{x},\zeta_{i}))]=f_{j}(\mathbf{x})\,\forall i,j\in\mathcal{R}. Hence, the global loss function f(𝐱)=𝔼ζi𝒫i[l(𝐱,ζi)]f(\mathbf{x})=\mathbb{E}_{\zeta_{i}\sim\mathcal{P}_{i}}[l(\mathbf{x},\zeta_{i})].

Assumption 2 (Bounded variance). Stochastic gradient gi(𝐱)g_{i}(\mathbf{x}) is unbiased and variance bounded, i.e., 𝔼𝒫i[gi(𝐱)]=fi(𝐱)=f(𝐱)\mathbb{E}_{\mathcal{P}_{i}}[g_{i}(\mathbf{x})]=\nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x}), and 𝔼𝒫igi(𝐱)fi(𝐱)2σ2\mathbb{E}_{\mathcal{P}_{i}}||g_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{x})||^{2}\leq\sigma^{2}, where gi(𝐱)g_{i}({\mathbf{x}}) is the stochastic gradient computed by node ii by using a random sample ζi\zeta_{i} from its local dataset 𝒵i\mathcal{Z}_{i}.

Assumption 3 (Smoothness). The loss functions fisf_{i}^{\prime}s are L-smooth and twice differentiable, i.e., for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we have 2fi(𝐱)2L||\nabla^{2}f_{i}(\mathbf{x})||_{2}\leq L.

Let bib^{i} be the number of counterclockwise Byzantine neighbors of node ii. We divide the set of stored models 𝒩ki\mathcal{N}_{k}^{i} at node ii in the kk-th round into two sets. The first set 𝒢ki={𝐲1,,𝐲ri}\mathcal{G}^{i}_{k}=\{\mathbf{y}_{1},\dots,\mathbf{y}_{r^{i}}\} contains the benign models, where ri=Sbir^{i}=S-b^{i}. We consider scenarios with S=b+1S=b+1, where bb is the total number of Byzantine nodes in the network. Without loss of generality, we assume the models in this set are arranged such that the first model is the closest benign node in the neighborhood of node ii, while the last model is the farthest node. Similarly, we define the second set ki\mathcal{B}_{k}^{i} to be the set of models from the counterclockwise Byzantine neighbors of node ii such that ki𝒢ki=𝒩ki\mathcal{B}_{k}^{i}\cup\mathcal{G}_{k}^{i}=\mathcal{N}_{k}^{i}.

Theorem 1

When the learning rate ηk(i)\eta_{k}^{(i)} for node ii\in\mathcal{R} in round kk satisfies ηk(i)1L\eta_{k}^{(i)}\geq\frac{1}{L}, the expected loss function 𝔼[li()]\mathbb{E}\left[l_{i}(\cdot)\right] of node ii evaluated on the set of models in 𝒩ki\mathcal{N}_{k}^{i} can be arranged as follows:

𝔼[li(𝐲1)]𝔼[li(𝐲2)]\displaystyle\mathbb{E}\left[l_{i}(\mathbf{y}_{1})\right]\leq\mathbb{E}\left[l_{i}(\mathbf{y}_{2})\right]\leq\dots\leq 𝔼[li(𝐲ri)]<𝔼[li(𝐱)]\displaystyle\mathbb{E}\left[l_{i}(\mathbf{y}_{r^{i}})\right]<\mathbb{E}\left[l_{i}(\mathbf{x})\right]
𝐱ki,\displaystyle\forall\mathbf{x}\in\mathcal{B}_{k}^{i}, (5)

where 𝒢ki={𝐲1,,𝐲ri}\mathcal{G}^{i}_{k}=\{\mathbf{y}_{1},\dots,\mathbf{y}_{r^{i}}\} is the set of benign models stored at node ii. Hence, the Basil aggregation rule in Definition 1 is reduced to 𝐱¯k(i)=𝒜Basil (𝒩ki)=𝐲1\bar{\mathbf{x}}_{k}^{(i)}=\mathcal{A}_{\text{{Basil} }}(\mathcal{N}^{i}_{k})=\mathbf{y}_{1}. Hence, the model update step in (2) can be simplified as follows:

𝐱k(i)=𝐲1ηk(i)gi(𝐲1).\mathbf{x}_{k}^{(i)}=\mathbf{y}_{1}-\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1}). (6)
Remark 2

For the Basil aggregation rule in Definition 1, (1) in Theorem 1 implies that for convergence analysis, we can consider only the benign sub-graph which is generated by removing the Byzantine nodes. As described in Section III-A, the benign sub-graph is connected. Furthermore, due to (6) in Theorem 1, training via Basil reduces to sequential training over a logical ring with only the set \mathcal{R} of benign nodes and connectivity parameter S=1S=1.

Leveraging the results in Theorem 1 and based on the discussion in Remark 1, we prove the linear convergence rate for Basil, under the additional assumption of convexity of the loss functions.

Theorem 2

Assume that f(𝐱)f(\mathbf{x}) is convex. Under Assumptions 1-3 stated in this section, Basil with a fixed learning rate η=1L\eta=\frac{1}{L} at all users achieves linear convergence with a constant error as follows:

𝔼[f(1Ts=1T𝐱s)]f(𝐱)𝐱0𝐱2L2T+1Lσ2,\displaystyle\mathbb{E}\left[f\left(\frac{1}{T}\sum_{s=1}^{T}\mathbf{x}^{s}\right)\right]-f(\mathbf{x}^{*})\leq\frac{||\mathbf{x}^{0}-\mathbf{x}^{*}||^{2}L}{2T}+\frac{1}{L}\sigma^{2}, (7)

where T=KrT=Kr, KK is the total number of rounds over the ring and rr is the number of benign nodes. Here 𝐱s\mathbf{x}^{s} represents the model after ss update steps starting from the initial model 𝐱0\mathbf{x}^{0}, where s=rk+is=rk+i with i=1,,ri=1,\dots,r and k=0,,K1k=0,\dots,K-1. Furthermore, 𝐱\mathbf{x}^{*} is the optimal solution in (1) and σ2\sigma^{2} is defined in Assumption 2.

Remark 3

The error bound for Basil decreases with increasing the total number of benign nodes r=βNr=\beta N, where β(0,1)\beta\in(0,1).

To extend Basil to be robust against software/hardware faults in the non-IID setting, i.e., when the local dataset 𝒵i\mathcal{Z}_{i} at node ii consists of data samples from a distribution 𝒫i\mathcal{P}_{i} with 𝒫i𝒫j\mathcal{P}_{i}\neq\mathcal{P}_{j} for i,j𝒩i,j\in\mathcal{N} and iji\neq j, we present our Anonymous Cyclic Data Sharing algorithm (ACDS) in the following section.

IV Generalizing Basil to Non-IID Setting via Anonymous Cyclic Data Sharing

We propose Anonymous Cyclic Data Sharing (ACDS), an algorithm that can be integrated on the top of Basil to guarantee robustness against software/hardware faults in the non-IID setting. This algorithm allows each node to anonymously share a fraction of its local non-sensitive dataset with other nodes. In other words, ACDS guarantees that the identity of the owner of the shared data is kept hidden from all other nodes under no collusion between nodes.

Refer to caption
Figure 3: ACDS algorithm within group gg with n=4n=4 users, where each node ig𝒩gi_{g}\in\mathcal{N}_{g} has two batches {cig1,cig2}\{c^{1}_{i_{g}},c^{2}_{i_{g}}\}. Each round starts from node 1g1_{g} and continues in a clockwise order. The dummy round is introduced to make sure that node 2g2_{g} and node 3g3_{g} get their missing batches {c3g2,c4g2}\{c^{2}_{3_{g}},c^{2}_{4_{g}}\} and c4g2c^{2}_{4_{g}}, respectively. Here, c^ig\widehat{c}_{i_{g}} represents the dummy batch with the same size as the other batches that are used by node ig𝒩gi_{g}\in\mathcal{N}_{g}. This dummy batch could be a batch of public data that shares the same features that are used in the learning task.

IV-A ACDS Algorithm

The ACDS procedure has three phases which are formally described next. The overall algorithm has been summarized in Algorithm 2 and illustrated in Fig. 3.

Phase 1: Initialization. ACDS starts by first clustering the NN nodes into GG groups of rings, where the set of nodes in each group g[G]g\in[G] is denoted by 𝒩g={1g,,ng}\mathcal{N}_{g}=\{1_{g},\dots,n_{g}\}. Here, node 1g1_{g} is the starting node in ring gg, and without loss of generality, we assume all groups have the same size n=NGn=\frac{N}{G}. Node ig𝒩gi_{g}\in\mathcal{N}_{g} for g[G]g\in[G] divides its dataset 𝒵ig\mathcal{Z}_{i_{g}} into sensitive (𝒵igs\mathcal{Z}_{i_{g}}^{s}) and non-sensitive (𝒵igns\mathcal{Z}_{i_{g}}^{ns}) portions, which can be done during the data processing phase by each node. Then, for a given hyperparameter α(0,1)\alpha\in(0,1), each node selects αD\alpha D points from its local non-sensitive dataset at random, where |𝒵ig|=D|\mathcal{Z}_{i_{g}}|=D, and then partitions these data points into HH batches denoted by {cig1,,cigH}\{c_{i_{g}}^{1},\dots,c_{i_{g}}^{H}\}, where each batch has M=αDHM=\frac{\alpha D}{H} data points.

29 Input: 𝒩\mathcal{N} (nodes); {𝒵i}i𝒩\{\mathcal{Z}_{i}\}_{i\in\mathcal{N}}(local datasets); α\alpha (data fraction); HH(number of batches); GG(number of groups)
30
31 Phase 1: Initialization
32 {𝒩g}g[G]\{\mathcal{N}_{g}\}_{g\in[G]}\leftarrow Clustering (𝒩,G)(\mathcal{N},G) // cluster nodes 𝒩\mathcal{N} into GG groups each of size nn
33 for  each node ig𝒩gi_{g}\in\mathcal{N}_{g} in parallel  do
34      𝒵igs𝒵igns Partition(𝒵ig)\mathcal{Z}_{{i_{g}}}^{s}\cup\mathcal{Z}_{{i_{g}}}^{ns}\leftarrow\text{ Partition}(\mathcal{Z}_{{i_{g}}}) // partion local data 𝒵ig\mathcal{Z}_{{i_{g}}} into sensitive and non-sensitive parts.
35      {cig1,,cigH}\{c^{1}_{i_{g}},\dots,c^{H}_{i_{g}}\}\leftarrow RandomSelection(𝒵igns,α,H)(\mathcal{Z}_{{i_{g}}}^{ns},\alpha,H) // random selection of HH batches, each of size αDH\frac{\alpha D}{H}, from 𝒵igns\mathcal{Z}_{{i_{g}}}^{ns}.
36     
37     DStored[ig][i_{g}] =list() // a list used by node igi_{g} to store the shared data from other nodes
38      end for
39     DShared[gg]= list(), g[G]\forall g\in[G] // a list that is used to circulate the data within group gg
40     
41      Phase 2: Within Group Anonymous Data Sharing
42      for each group g=1,,Gg=1,\dots,G in parallel  do
43           for each batch h=1,,Hh=1,\dots,H do
44                for each node ig𝒩gi_{g}\in\mathcal{N}_{g} in sequence do
45                     DShared[g][g]; DStored[ig][i_{g}]\leftarrow RobustShare(DShared[g][g]; DStored[ig][i_{g}], gg, igi_{g}, cigh1c^{h-1}_{i_{g}}, cighc^{h}_{i_{g}})
46                     Send DShared[g][g] to the next clockwise neighbor
47                     end for
48                    
49                     end for
50                     // start the dummy round
51                     for  each node ig𝒩g\{1g,ng}i_{g}\in\mathcal{N}_{g}\backslash\{1_{g},n_{g}\} in parallel  do
52                         DShared[g][g]; DStored[ig][i_{g}]\leftarrow RobustShare(DShared[g][g]; DStored[ig][i_{g}], gg, igi_{g}, HH, cigHc^{H}_{i_{g}}, cig^\widehat{c_{i_{g}}})
53                          end for
54                         
55                          end for
56                          Function RobustShare(DShared[g][g]; DStored[ig][i_{g}], gg, igi_{g}, hh, cigh1c^{h-1}_{i_{g}}, cighc^{h}_{i_{g}}):
57                               if  h>1h>1 then
58                                   DShared[g][g].remove(cigh1c^{h-1}_{i_{g}}) // remove the batch cigh1c^{h-1}_{i_{g}} from “DataShared[g][g]
59                                    end if
60                                   DStored[ig].[i_{g}].add(DShared[g][g]) // copy the data in “DShared[g][g]” to “DStored[ig][i_{g}]
61                                    DShared[g][g].add(cighc^{h}_{i_{g}}) // add the hh-th batch cighc^{h}_{i_{g}} that will be shared with other nodes to “DShared[g][g]
62                                    DShared[g][g].shuffle() //shuffle the data in the list “DShared[g][g]
63                                    return DShared[g][g]; DStored[ig][i_{g}]
64                                   
65                                   
66                                   
67                                   
68                                    Phase 3: Global Sharing
69                                    for g=1,,Gg=1,\dots,G in parallel  do
70                                         node 1g1_{g} multicasts DStored[1g]{c1g1,,c1gH}[1_{g}]\bigcup\{c^{1}_{1_{g}},\dots,c^{H}_{1_{g}}\} with all nodes in g[N]\{g}𝒩g\bigcup_{g^{\prime}\in[N]\backslash\{g\}}\mathcal{N}_{g^{\prime}}
71                                         end for
Algorithm 2 ACDS

Phase 2: Within Group Anonymous Data Sharing. In this phase, for g[G]g\in[G], the set of nodes 𝒩g\mathcal{N}_{g} in group gg anonymously share their data batches {c1gj,,cngj}j[H]\{c^{j}_{1_{g}},\dots,c^{j}_{n_{g}}\}_{j\in[H]} with each other. The anonymous data sharing within group gg takes H+1H+1 rounds. In particular, as shown in Fig. 3, within group gg and during the first round h=1h=1, node 1g1_{g} sends the first batch c1g1c^{1}_{1_{g}} to its clockwise neighbor, node 2g2_{g}. Node 2g2_{g} then stores the received batch. After that, node 2g2_{g} augments the received batch with its own first batch c2g1c^{1}_{2_{g}} and shuffles them together before sending them to node 3g3_{g}. More generally, in the intra-group cyclic sharing over the ring, node igi_{g} stores the received set of shuffled data points from batches {c1g1,,c(i1)g1}\{c^{1}_{1_{g}},\dots,c^{1}_{(i-1)_{g}}\} from its counterclockwise neighbor node (i1)g(i-1)_{g}. Then, it adds its own batch cig1c^{1}_{i_{g}} to the received set of data points, and shuffles them together, before sending the combined and shuffled dataset to node (i+1)g(i+1)_{g}.

For round 1<hH1<h\leq H, as shown in Fig. 3-(round 2), node 1g1_{g} has the data points from the set of batches {c1g(h1),,cng(h1)}\{c^{(h-1)}_{1_{g}},\dots,c^{(h-1)}_{n_{g}}\} which were received from node ngn_{g} at the end of round (h1)(h-1). It first removes its old batch of data points c1g(h1)c^{(h-1)}_{1_{g}} and then stores the remaining set of data points. After that, it adds its hh-th batch, c1g(h)c^{(h)}_{1_{g}} to this remaining set, and then shuffles the entire set of data points in the new set of batches {c1gh,c2gh1,,cngh1}\{c^{h}_{1_{g}},c^{{h-1}}_{2_{g}},\dots,c^{h-1}_{n_{g}}\}, before sending them to node 2g2_{g}. More generally, in the hh-th round for 1<hH1<h\leq H, node igi_{g} first removes its batch cigh1c^{h-1}_{i_{g}} from the received set of batches and then stores the set of remaining data points. Thereafter, node igi_{g} adds its cighc^{h}_{i_{g}} to the set {c1gh,,c(i1)gh,c(i)gh1,,cngh1}\{cigh1}\{c^{h}_{1_{g}},\ldots,c^{h}_{(i-1)_{g}},c^{h-1}_{(i)_{g}},\dots,c^{h-1}_{n_{g}}\}\backslash\{c^{h-1}_{i_{g}}\}, and then shuffles the set of data points in the new set of batches {c1gh,,cigh,c(i+1)gh1,,cngh1}\{c^{h}_{1_{g}},\dots,c^{h}_{i_{g}},c^{h-1}_{(i+1)_{g}},\dots,c^{h-1}_{n_{g}}\} before sending them to node (i+1)g(i+1)_{g}.

After HH intra-group communication iterations within each group as described above, all nodes within each group have completely shared their H1H-1 batches with each other. However, due to the sequential nature of unicast communications, some nodes are still missing the batches shared by some clients in the HthH^{th} round. For instance, in Fig. 3, after the completion of the second round, node 2g2_{g} is missing the last batches c3g2c^{2}_{3_{g}} and c4g2c^{2}_{4_{g}}. Therefore, we propose a final dummy round, in which we repeat the same procedure adopted in rounds 1<hH1<h\leq H, but with the following slight modification: node igi_{g} replaces its batch cigHc^{H}_{i_{g}} with a dummy batch cigdummyc^{\text{dummy}}_{i_{g}}. This dummy batch could be a batch of public data points that share the same feature space that is used in the learning task. This completes the anonymous cyclic data sharing within group g[G]g\in[G].

Phase 3: Global Sharing. For each g[G]g\in[G], node 1g1_{g} shares the dataset {c1gj,,cngj}j[H]\{c^{j}_{1_{g}},\dots,c^{j}_{n_{g}}\}_{j\in[H]}, which it has gathered in phase 2, with all other nodes in the other groups.

We note that implementation of ACDS only needs to be done once before training. As we demonstrate later in Section VI, the one-time overhead of the ACDS algorithm dramatically improves convergence performance when data is non-IID. In the following proposition, we describe the communication cost/time of ACDS.

Proposition 3 (Communication cost/time of ACDS). Consider ACDS algorithm with a set of NN nodes divided equally into GG groups with nn nodes in each group. Each node i[N]i\in[N] has HH batches each of size M=αDHM=\frac{\alpha D}{H} data points, where αD\alpha D is the fraction of shared local data, such that α(0,1)\alpha\in(0,1) and DD is the local data size. By letting II to be the size of each data point in bits, we have the following:
(1) Worst case communication cost per node (CACDSC_{\text{ACDS}})

CACDS=αDI(1H+n(G+1)).C_{\text{ACDS}}=\alpha DI(\frac{1}{H}+n(G+1)). (8)

(2) Total communication time for completing ACDS (TACDST_{\text{ACDS}}) When the upload/download bandwidth of each node is RR b/s, we have the following

TACDS=αDIHR[n2(H+0.5)+n(H(G1)1.5)].\displaystyle T_{\text{ACDS}}=\frac{\alpha DI}{HR}\left[n^{2}(H+0.5)+n\left(H(G-1)-1.5\right)\right]. (9)
Remark 4

The worst case communication cost in (8) is with respect to the first node 1g1_{g}, for g[G]g\in[G], that has more communication cost than the other nodes in group gg for its participation in the global sharing phase of ACDS.

Remark 5

To illustrate the communication overhead resulting from using ACDS, we consider the following numerical example. Let the total number of nodes in the system be N=100N=100 and each node has D=500D=500 images from the CIFAR10 dataset, where each image of size I=24.5I=24.5 Kbits111Each image in CIFAR10 dataset has 33 channels each of size 32×3232\times 32 pixels, and each pixel takes value from 02550-255.. When the communication bandwidth at each node is R=100R=100 Mb/s (e.g., 4G speed), and each node shares only α=5%\alpha=5\% of its dataset in the form of H=5H=5 batches each with size M=5M=5 images, the latency, and communication cost of ACDS with G=4G=4 groups are 1111 seconds and 75~{}75 Mbits, respectively. We note that the communication cost for ACDS and completion time of the algorithm are small with respect to the training process that requires sharing large model for large number of iteration as demonstrated in Section VI.

The proof of Proposition 3 is presented in Appendix D.

In the following, we discuss the anonymity guarantees of ACDS.

IV-B Anonymity Guarantees of ACDS

In the first round of the scheme, node 2g2_{g} will know that the source of the received batch c1g1c^{1}_{1_{g}} is node 1g1_{g}. Similarly and more general, node igi_{g} will know that each data point in the received set of batches {c1g1,,c(i1)g1}\{c^{1}_{1_{g}},\dots,c^{1}_{(i-1)_{g}}\} is generated by one of the previous i1i-1 counterclockwise neighbors. However, in the next H1H-1 rounds, each received data point by any node will be equally likely generated from any one of the remaining n1n-1 nodes in this group. Hence, the size of the candidate pool from which each node could take a guess for the owner of each data point is small specially for the first set of nodes in the ring. In order to provide anonymity for the entire data and decrease the risk in the first round of the ACDS scheme, the size of the batch can be reduced to just one data point. Therefore, in the first round node 2g2_{g} will only know one data point from node 1g1_{g}. This comes on the expense of increasing the number of rounds. Another key consideration is that the number of nodes in each group trades the level of anonymity with the communication cost. In particular, the communication cost per node in the ACDS algorithm is 𝒪(n)\mathcal{O}(n), while the anonymity level, which we measure by the number of possible candidates for a given data point, is (n1)(n-1). Therefore, increasing nn, i.e., decreasing the number of groups GG, will decrease the communication cost but increase the anonymity level.

V Basil+: Parallelization of Basil

In this section, we describe our modified protocol Basil+ which allows for parallel training across multiple rings, along with sequential training over each ring. This results in decreasing the training time needed for completing one global epoch (visiting all nodes) compared to Basil which only considers sequential training over one ring.

V-A Basil+ Algorithm

At a high level, Basil+ divides nodes into GG groups such that each group in parallel performs sequential training over a logical ring using Basil algorithm. After τ\tau rounds of sequential training within each group, a robust circular aggregation strategy is implemented to have a robust average model from all the groups. Following the robust circular aggregation stage, a final multicasting step is implemented such that each group can use the resulting robust average model. This entire process is repeated for KK global rounds.

We now formalize the execution of Basil+ through the following four stages.

Stage 1: Basil+ Initialization. The protocol starts by clustering the set of NN nodes equally into GG groups of rings with n=NGn=\frac{N}{G} nodes in each group. The set of nodes in group gg is denoted by 𝒩g={u1g,,ung}\mathcal{N}_{g}=\{u^{g}_{1},\dots,u_{n}^{g}\}, where node u1gu_{1}^{g} is the starting node in ring gg, where g=1,,Gg=1,\dots,G. The clustering of nodes follows a random splitting agreement protocol similar to the one in Section III-A (details are presented in Section V-B). The connectivity parameter within each group is set to be S=min(n1,b+1)S=\min(n-1,b+1), where b is the worst-case number of Byzantine nodes. This choice of SS guarantees that each benign subgraph within each group is connected with high probability, as described in Proposition 3.

Stage 2: Within Group Parallel Implementation of Basil. Each group g[G]g\in[G] in parallel performs the training across its nodes for τ\tau rounds using Basil algorithm.

Stage 3: Robust Circular Aggregation Strategy. We denote

𝒮g={un1g,un2g,,unS+1g},\mathcal{S}_{g}=\{u^{g}_{n-1},u^{g}_{n-2},\dots,u_{n-S+1}^{g}\}, (10)

to be the set of SS counterclockwise neighbors of node u1gu^{g}_{1}. The robust circular aggregation strategy consists of G1G-1 steps performed sequentially over the GG groups, where the GG groups form a global ring. At step gg, where g{1,,G1}g\in\{1,\dots,G-1\}, the set of nodes 𝒮g\mathcal{S}_{g} send their aggregated models to each node in the set 𝒮g+1\mathcal{S}_{g+1}. The reason for sending SS models from one group to another is to ensure the connectivity of the global ring when removing the Byzantine nodes. The average aggregated model at node uig+1𝒮g+1u^{g+1}_{i}\in\mathcal{S}_{g+1} is given as follows:

𝐳ig+1=\displaystyle\mathbf{z}_{i}^{g+1}= 1g+1(𝐱τ(i,g+1)+g𝐳¯ig+1),\displaystyle\frac{1}{g+1}\left(\mathbf{x}^{(i,g+1)}_{\tau}+g\bar{\mathbf{z}}_{i}^{g+1}\right), (11)

where 𝐱τ(i,g+1)\mathbf{x}^{(i,g+1)}_{\tau} is the local updated model at node uig+1u^{g+1}_{i} in ring g+1g+1 after τ\tau rounds of updates according to Basil algorithm. Here, 𝐳¯ig+1\bar{\mathbf{z}}_{i}^{g+1} is the selected model by node uig+1u^{g+1}_{i} from the set of received models from the set 𝒮g\mathcal{S}_{g}. More specifically, by letting

g={𝐳n1g,𝐳n2g,,𝐳nS+1g}\mathcal{L}_{g}=\{\mathbf{z}_{n-1}^{g},\mathbf{z}_{n-2}^{g},\dots,\mathbf{z}_{n-S+1}^{g}\} (12)

be the set of average aggregated models sent from the set of nodes 𝒮g\mathcal{S}_{g} to each node in the set 𝒮g+1\mathcal{S}_{g+1}, we define 𝐳¯ig+1\bar{\mathbf{z}}_{i}^{g+1} to be the model selected from g\mathcal{L}_{g} by node uig+1𝒮g+1u^{g+1}_{i}\in\mathcal{S}_{g+1} using the Basil aggregation rule as follows:

𝐳¯ig+1=𝒜Basil (g)=argmin𝐲g𝔼[lig+1(𝐲,ζig+1)].\bar{\mathbf{z}}_{i}^{g+1}=\mathcal{A}_{\text{{Basil} }}(\mathcal{L}_{g})=\arg\min_{\mathbf{y}\in\mathcal{L}_{g}}\mathbb{E}\left[{l}^{g+1}_{i}({\mathbf{y}},\zeta^{g+1}_{i})\right]. (13)

Stage 4: Robust Multicasting. The final stage is the multicasting stage. The set of nodes in 𝒮G\mathcal{S}_{G} send the final set of robust aggregated models G\mathcal{L}_{G} to 𝒮1\mathcal{S}_{1}. Each node in the set 𝒮1\mathcal{S}_{1} applies the aggregation rule in (13) on the set of received models G\mathcal{L}_{G}. Finally, each benign node in the set 𝒮1\mathcal{S}_{1} sends the filtered model 𝐳i1{\mathbf{z}}_{i}^{1} to all nodes in this set g=1G𝒰g\cup_{g=1}^{G}\mathcal{U}_{g}, where 𝒰g\mathcal{U}_{g} is defined as follows

𝒰g={u1g,u2g,,uSg}.\mathcal{U}_{g}=\{u^{g}_{1},u^{g}_{2},\dots,u_{S}^{g}\}. (14)

Finally, all nodes in this set g=1G𝒰g\cup_{g=1}^{G}\mathcal{U}_{g} use the aggregation rule in (13) to get the best model out of this set 1\mathcal{L}_{1} before updating it according to (2). These four stages are repeated for KK rounds.

72
73Input: 𝒩\mathcal{N}; SS; {𝒵i}i𝒩\{\mathcal{Z}_{i}\}_{i\in\mathcal{N}}; 𝐱0\mathbf{x}^{0}; τ\tau, KK
74 Stage 1: Initialization:
75for each node i𝒩i\in\mathcal{N}  do
76      {𝒩g}g[G]\{\mathcal{N}_{g}\}_{g\in[G]}\leftarrow RandomClusteringAggrement(𝒩,G)(\mathcal{N},G) //cluster ther nodes into GG groups each of size nn according to Section IV-A
77      end for
78     𝐱(i,g)𝐱0,i𝒩g,g[G]\mathbf{x}^{(i,g)}\leftarrow\mathbf{x}^{0},\quad\forall i\in\mathcal{N}_{g},g\in[G]
79     for  each global round k=1,,Kk=1,\dots,K do
80           Stage 2: Within Groups Robust Training:
81          for  each group g=1,,Gg=1,\dots,G in parallel  do
82                Basil(𝒩g,S,{𝒵i}i𝒩g,{𝐱(i,g)}i𝒩g,τ\mathcal{N}_{g},S,\{\mathcal{Z}_{i}\}_{i\in{\mathcal{N}_{g}}},\{\mathbf{x}^{(i,g)}\}_{i\in{\mathcal{N}_{g}}},\tau){𝐱τ(i,g)}i𝒩g\rightarrow\{\mathbf{x}^{(i,g)}_{\tau}\}_{i\in{\mathcal{N}_{g}}} //apply Basil algorithm within each group for τ\tau rounds
83               𝐳ig𝐱τ(i,g),i𝒩g\mathbf{z}_{i}^{g}\leftarrow\mathbf{x}^{(i,g)}_{\tau},\quad\forall i\in\mathcal{N}_{g}
84                end for
85               
86                Stage 3: Robust Circular Aggregation: for  each group g=1,,G1g=1,\dots,G-1 in sequence  do
87                    
88                    for each node uig+1𝒮g+1u_{i}^{g+1}\in\mathcal{S}_{g+1} in parallel do
89                         
90                         // the set 𝒮g+1\mathcal{S}_{g+1} is defined in (10)
91                         g{𝐳ig}uig𝒮g\mathcal{L}_{g}\leftarrow\{\mathbf{z}_{i}^{g}\}_{u_{i}^{g}\in\mathcal{S}_{g}} // each node uig+1𝒮g+1u_{i}^{g+1}\in\mathcal{S}_{g+1} receives the set of models g\mathcal{L}_{g} (defined in (12)) from the nodes in 𝒮g\mathcal{S}_{g}.
92                         if  node uig+1benign set u_{i}^{g+1}\in\text{benign set }\mathcal{R}  then
93                              
94                              𝐳¯ig+1𝒜Basil (g)\bar{\mathbf{z}}_{i}^{g+1}\leftarrow\mathcal{A}_{\text{{Basil} }}(\mathcal{L}_{g}) // Basil performance based strategy to select one model from g\mathcal{L}_{g} using (13)
95                               𝐳ig+11g+1(𝐱τ(i,g+1)+g𝐳¯ig+1)\mathbf{z}_{i}^{g+1}\leftarrow\frac{1}{g+1}\left(\mathbf{x}^{(i,g+1)}_{\tau}+g\bar{\mathbf{z}}_{i}^{g+1}\right) //get proper average model from the first g+1g+1 groups
96                               end if
97                              else
98                                    𝐳ig+1\mathbf{z}_{i}^{g+1}\leftarrow* // Byzantine node sends faulty model
99                                    end if
100                                   
101                                    end for
102                                   
103                                    end for
104                                   
105                                    Stage 4: Robust Multicasting:
106                                   for each node ui1𝒮1u_{i}^{1}\in\mathcal{S}_{1} in parallel do
107                                         G{𝐳i(G)}uiG𝒮G\mathcal{L}_{G}\leftarrow\{\mathbf{z}_{i}^{(G)}\}_{u_{i}^{G}\in\mathcal{S}_{G}}.
108                                        
109                                        𝐳¯i1𝒜Basil (G)\bar{\mathbf{z}}_{i}^{1}\leftarrow\mathcal{A}_{\text{{Basil} }}(\mathcal{L}_{G})
110                                        
111                                        𝐳i1𝐳¯i1\mathbf{z}_{i}^{1}\leftarrow\bar{\mathbf{z}}_{i}^{1}
112                                         end for
113                                        for each node uigg=1G𝒰gu_{i}^{g}\in\cup_{g=1}^{G}\mathcal{U}_{g} in parallel  do
114                                              // the set 𝒰g\mathcal{U}_{g} is defined in (14) 1{𝐳i1}ui1𝒮1\mathcal{L}_{1}\leftarrow\{\mathbf{z}_{i}^{1}\}_{u_{i}^{1}\in\mathcal{S}_{1}} // each node uigu_{i}^{g} receives the set of models 1\mathcal{L}_{1} from the nodes in 𝒮1\mathcal{S}_{1}.
115                                             𝐳¯ig𝒜Basil (1)\bar{\mathbf{z}}_{i}^{g}\leftarrow\mathcal{A}_{\text{{Basil} }}(\mathcal{L}_{1})
116                                             𝐱(i,g)𝐳¯ig\mathbf{x}^{(i,g)}\leftarrow\bar{\mathbf{z}}_{i}^{g}
117                                              end for
118                                             
119                                              end for
120                                             
121                                             Return {𝐱K(i,g)}i𝒩g,g[G]\{\mathbf{x}^{(i,g)}_{K}\}_{{i\in\mathcal{N}_{g}},g\in[G]}
Algorithm 3 Basil+

We compare between the training time of Basil and Basil+ in the following proposition.

Proposition 4. Let TcommT_{\text{comm}}, Tperf-basedT_{\text{perf-based}}, and TSGDT_{\text{SGD}} respectively denote the time needed to multicast/receive one model, the time to evaluate SS models according to Basil performance-based criterion, and the time to take one step of model update. The total training time for completing one global round when using Basil algorithm, where one global round consists of τ\tau sequential rounds over the ring, is

TBasil(τnG)Tperf-based+(τnG)Tcomm+(τnG)TSGD,T_{\text{{Basil}}}\leq(\tau nG)T_{\text{perf-based}}+(\tau nG)T_{\text{comm}}+(\tau nG)T_{\text{SGD}}, (15)

compared to the training time for Basil+ algorithm, which is given as follows

TBasil+\displaystyle T_{\text{{Basil}+}} (τn+G+1)Tperf-based+(SG+τn1)Tcomm\displaystyle\leq(\tau n+G+1)T_{\text{perf-based}}+(SG+\tau n-1)T_{\text{comm}}
+(τn)TSGD.\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+(\tau n)T_{\text{SGD}}. (16)
Remark 6

According to this proposition, we can see the training time of Basil is polynomial in nGnG, while in Basil+, the training time is linear in both nn and GG. The proof of Proposition 4 is given in Appendix E.

In the following section, we discuss the random clustering method used in the Stage 1 of Basil+.

V-B Random Clustering Agreement

In practice, nodes can agree on a random clustering by using similar approach as in Section III-A by the following simple steps. 1) All nodes first share their IDs with each other, and we assume WLOG that nodes’ IDs can be arranged in ascending order, and Byzantine nodes cannot forge their identities or create multiple fake ones [27]. 2) Each node locally random splits the nodes into GG subsets by using a pseudo random number generator (PRNG) initialized via a common seed (e.g., N). This ensures that all nodes will generate the set of nodes. To know the nodes order within each local group, the method in Section III-A can be used.

V-C The Success of Basil+

We will consider different scenarios for the connectivity parameter SS while evaluating the success of Basil+. Case 1: S=min(n1,b+1)S=\min(n-1,b+1) We set the connectivity parameter for Basil+ to S=min(n1,b+1)S=\min(n-1,b+1). By setting S=b+1S=b+1 when n>b+1n>b+1, this ensures the connectivity of each ring (after removing the Byzantine nodes) along with the global ring. On the other hand, by setting S=n1S=n-1 if nb+1n\leq b+1, Basil+ would only fail if at least one group has a number of Byzantine nodes of nn or n1n-1. We define BjB_{j} to be the failure event in which the number of Byzantine nodes in a given group is nn or n1n-1. The failure event BjB_{j} follows a Hypergeometric distribution with parameters (N,b,n)(N,b,n), where NN, bb, and nn are the total number of nodes, total number of Byzantine nodes, number of nodes in each group, respectively. The probability of failure is given as follows

(Failure)=(j=1GBj)(a)j=1G(Bj)=(bn)+(bn1)(Nb1)(Nn)G,\displaystyle\mathbb{P}(\text{Failure}){=}\mathbb{P}(\bigcup_{j=1}^{G}B_{j})\overset{(a)}{\leq}\sum_{j=1}^{G}\mathbb{P}(B_{j}){=}\frac{\binom{b}{n}+\binom{b}{n-1}\binom{N-b}{1}}{\binom{N}{n}}G, (17)

where (a) follows from the union bound.

In order to further illustrate the impact of choosing the group size nn when setting S=n1S=n-1 on the probability of failure given in (17), we consider the following numerical examples. Let the total number of nodes in the system be N=100N=100, where b=33b=33 of them are Byzantine. By setting n=20n=20 nodes in each group, the probability in (17) turns out to be 5×1010\sim{5\times 10^{-10}}, which is negligible. For the case when n=10n=10, the probability of failure becomes 1.2×104\sim{1.2\times 10^{-4}}, which remains reasonably small.

Case 2: S<n1S<n-1 Similar to the failure analysis of Basil given in Proposition 2, we relax the connectivity parameter SS as stated in the following proposition.

Proposition 5. The connectivity parameter SS in Basil+ can be relaxed to S<n1S<n-1 while guaranteeing the success of the algorithm (benign local/global subgraph connectivity) with high probability. The failure probability of Basil+ is given by

(F)Gi=0min(b,n)(s=0S1max(is,0)(Ns)n)(bi)(Nbni)(Nn),\mathbb{P}(F)\leq G\sum_{i=0}^{\min{(b,n)}}\left(\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}n\right)\frac{\binom{b}{i}\binom{N-b}{n-i}}{\binom{N}{n}}, (18)

where NN, nn, GG, SS and bb are the number of total nodes, number of nodes in each group, number of groups, the connectivity parameter, and the number of Byzantine nodes.

The proof of Proposition 5 is presented in Appendix F.

Remark 7

In order to further illustrate the impact of choosing SS on the probability of failure given in (18), we consider the following numerical examples. Let the total number of nodes in the system be N=400N=400, where b=60b=60 of them are Byzantine and n=100n=100, and the connectivity parameter S=10S=10. The probability of failure event in (18) turns out to be 106\sim{10^{-6}}, which is negligible. For the case when S=7S=7, the probability of failure becomes 104\sim{10^{-4}}, which remains reasonably small.

VI Numerical Experiments

We start by evaluating the performance gains of Basil. After that, we give the set of experiments of Basil+. We note that in Appendix H, we have included additional experiments for Basil including the wall-clock time performance compared to UBAR, performance of Basil and ACDS for CIFAR100 dataset, and performance comparison between Basil and Basil+.

VI-A Numerical Experiments for Basil

Schemes: We consider four schemes as described next.

  • G-plain: This is for graph based topology. At the start of each round, nodes exchange their models with their neighbors. Each node then finds the average of its model with the received neighboring models and uses it to carry out an SGD step over its local dataset.

  • R-plain: This is for ring based topology with S=1S=1. The current active node carries out an SGD step over its local dataset by using the model received from its previous counterclockwise neighbor.

  • UBAR: This is the prior state-of-the-art for mitigating Byzantine nodes in decentralized training over graph, and is described in Appendix G.

  • Basil: This is our proposal.

Datasets and Hyperparameters: There are a total of 100100 nodes, in which 6767 are benign. For decentralized network setting for simulating UBAR and G-plain schemes, we follow a similar approach as described in the experiments in [10] (we provide the details in Appendix H-B). For Basil and R-plain, the nodes are arranged in a logical ring, and 3333 of them are randomly set as Byzantine nodes. Furthermore, we set S=10S=10 for Basil which gives us the connectivity guarantees discussed in Proposition 2. We use a decreasing learning rate of 0.03/(1+0.03k)0.03/(1+0.03\,k). We consider CIFAR10 [28] and use a neural network with 22 convolutional layers and 33 fully connected layers. The details are included in Appendix H-A. The training dataset is partitioned equally among all nodes. Furthermore, we report the worst test accuracy among the benign clients in our results. We also conduct similar evaluations on the MNIST dataset. The experimental results lead to the same conclusion and can be found in Appendix H. Additionally, we emphasize that Basil is based on sequential training over a logical ring, while UBAR is based on parallel training over a graph topology. Hence, for consistency of experimental evaluations, we consider the following common definition for training round:

Definition 2 (Training round)

With respect to the total number of SGD computations, we define a round over a logical ring to be equivalent to one parallel iteration over a graph. This definition aligns with our motivation of training with resource constrained edge devices, where user’s computation power is limited.

Refer to caption
(a) No Attack
Refer to caption
(b) Gaussian Attack
Refer to caption
(c) Random Sign Flip
Refer to caption
(d) Hidden Attack
Figure 4: Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting.
Refer to caption
(a) No Attack
Refer to caption
(b) No Attack
Refer to caption
(c) Gaussian Attack
Refer to caption
(d) Random Sign Flip
Figure 5: Illustrating the performance of Basil using CIFAR10 dataset under non-IID data distribution setting.

Byzantine Attacks: We consider a variety of attacks, that are described as follows. Gaussian Attack: Each Byzantine node replaces its model parameters with entries drawn from a Gaussian distribution with mean 0 and standard distribution σ=1\sigma=1. Random Sign Flip: We observed in our experiments that the naive sign flip attack, in which Byzantine nodes flip the sign of each parameter before exchanging their models with their neighbors, is not strong in the R-plain scheme. To make the sign-flip attack stronger, we propose a layer-wise sign flip, in which Byzantine nodes randomly choose to flip or keep the sign of the entire elements in each neural network layer. Hidden Attack: This is the attack that degrades the performance of distance-based defense approaches, as proposed in [23]. Essentially, the Byzantine nodes are assumed to be omniscient, i.e., they can collect the models uploaded by all the benign clients. Byzantine nodes then design their models such that they are undetectable from the benign ones in terms of the distance metric, while still degrading the training process. For hidden attack, as the key idea is to exploit similarity of models from benign nodes, thus, to make it more effective, the Byzantine nodes launch this attack after 2020 rounds of training.

Results (IID Setting): We first present the results for the IID data setting. The training dataset is first shuffled randomly and then partitioned among the nodes. As can be seen from Fig. 4(a), Basil converges much faster than both UBAR and G-plain even in the absence of any Byzantine attacks, illustrating the benefits of ring topology based learning over graph based topology. We note that the total number of gradient updates after kk rounds in the two setups are almost the same. We can also see that R-plain gives higher performance than Basil. This is because in Basil, a small mini-batch is used for performance evaluation, hence in contrast to R-plain, the latest neighborhood model may not be chosen in each round resulting in the loss of some update steps. Nevertheless, Figs. 4(b), (c) and 4(d) illustrate that Basil is not only Byzantine-resilient, it maintains its superior performance over UBAR with 16%{\sim}{16\%} improvement in test accuracy, as opposed to R-plain that suffers significantly. Furthermore, we would like to highlight that as Basil uses a performance-based criterion for mitigating Byzantine nodes, it is robust to the Hidden attack as well. Finally, by considering the poor convergence of R-plain under different Byzantine attacks, we conclude that Basil is a good solution with fast convergence, strong Byzantine resiliency and acceptable computation overhead.

Results (non-IID Setting): For simulating the non-IID setting, we sort the training data as per class, partition the sorted data into NN subsets, and assign each node 11 partition. By applying ACDS in the absence of Byzantine nodes while trying different values for α\alpha, we found that α=5%\alpha=5\% gives a good performance while a small amount of shared data from each node. Fig. 5(a) illustrates that test accuracy for R-plain in the non-IID setting can be increased by up to 10%{\sim}10\% when each node shares only α=5%\alpha=5\% of its local data with other nodes. Fig. 5(c), and Fig. 5(d) illustrate that Basil on the top of ACDS with α=5%\alpha=5\% is robust to software/hardware faults represented in Gaussian model and random sign flip. Furthermore, both Basil without ACDS and UBAR completely fail in the presence of these faults. This is because the two defenses are using performance-based criterion which is not meaningful in the non-IID setting. In other words, each node has only data from one class, hence it becomes unclear whether a high loss value for a given model can be attributed to the Byzantine nodes, or to the very heterogeneous nature of the data. Additionally, R-plain with α=0%,5%\alpha=0\%,5\% completely fail in the presence of these faults.

Furthermore, we can observe in Fig. 5(b) that Basil with α=0\alpha=0 gives low performance. This confirms that non-IID data distribution degraded the convergence behavior. For UBAR, the performance is completely degraded, since in UBAR each node selects the set of models which gives a lower loss than its own local model, before using them in the update rule. Since performance-based is not meaningful in this setting, each node might end up only with its own model. Hence, the model of each node does not completely propagate over the graph, as also demonstrated in Fig. 5(b), where UBAR fails completely. This is different from the ring setting, where the model is propagated over the ring.

Refer to caption
(a) No Attack
Refer to caption
(b) Gaussian Attack
Refer to caption
(c) Random Sign Flip Attack
Figure 6: Illustrating the performance of Basil compared with UBAR for CIFAR10 under non-IID data distribution setting with α=5%\alpha=5\% data sharing.
Refer to caption
(a) Varying N, (b=0.33×N, S =10)
Refer to caption
(b) Varying b, (N=100, S=10)
Refer to caption
(c) Varying S, (N=100, b=33)
Refer to caption
(d) Basil with ACDS in the non-iid setting (N=100, b=33, S=10)
Figure 7: Ablation studies. Here NN, bb, SS, α\alpha are the total nodes, Byzantine nodes, connectivity, and fraction of shared data. For non-IID data, Gaussian attack is considered, while for others plots with IID hidden attack is used. Using the same NNs in Section H-A.

In Fig. 5, we showed that UBAR performs quite poorly for non-IID data setting, when no data is shared among the clients. We note that achieving anonymity in data sharing in graph based decentralized learning in general and UBAR in particular is an open problem. Nevertheless, in Fig. 6, we further show that even 5%5\% data sharing is done in UBAR, performance remains quite low in comparison to Basil+ACDS.

Now, we compute the communication cost overhead due to leveraging ACDS for the experiments associated with Fig. 5. By considering the setting discussed in Remark 5 for ACDS with G=4G=4 groups for data sharing and each node sharing α=5%\alpha=5\% fraction of its local dataset, we can see from Fig. 5 that Basil takes 500500 rounds to get 50%\sim 50\% test accuracy. Hence, given that the model used in this section is of size 3.63.6 Mbits (117706117706 trainable parameters each represented by 3232 bits), the communication cost overhead resulting from using ACDS for data sharing is only 4%4\%.

Further ablation studies: We perform ablation studies to show the effect of different parameters on the performance of Basil: number of nodes NN, number of Byzantine nodes bb, connectivity parameter SS, and the fraction of data sharing α\alpha. For the ablation studies corresponding to NN, bb, SS, we consider the IID setting described previously, while for the α\alpha, we consider the non-IID setting.

Fig. 7(a) demonstrates that, unlike UBAR, Basil performance scales with the number of nodes NN. This is because in any given round, the sequential training over the logical ring topology accumulates SGD updates of clients along the logical ring, as opposed to parallel training over the graph topology in which an update from any given node only propagates to its neighbors. Hence, Basil has better accuracy than UBAR. Additionally, as described in Section V, one can also leverage Basil+ to achieve further scalability for large NN by parallelizing Basil. We provide the experimental evaluations corresponding to Basil+ in Section VI-B.

To study the effect of different number of Byzantine nodes in the system, we conduct experiments with different bb. Fig. 7(b) demonstrates that Basil is quite robust to different number of Byzantine nodes.

Fig. 7(c) demonstrates the impact of the connectivity parameter SS. Interestingly, the convergence rate decreases as SS increases. We posit that due to the noisy SGD based training process, the closest benign model is not always selected, resulting in loss of some intermediate update steps. However, decreasing SS too much results in larger increase in the connectivity failure probability of Basil. For example, the upper bound on the failure probability when S=6S=6 is less than 0.090.09. However, for an extremely low value of S=2S=2, we observed consistent failure across all simulation trials, as also illustrated in Fig. 7(c). Hence, a careful choice of SS is important.

Finally, to study the relationship between privacy and accuracy when Basil is used alongside ACDS, we carry out numerical analysis by varying α\alpha in the non-IID setting described previously. Fig. 7(d) demonstrates that as α\alpha is increased, i.e., as the amount of shared data is increased, the convergence rate increases as well. Furthermore, we emphasize that even having α=0.5%\alpha=0.5\% gives reasonable performance when data is non-IID, unlike UBAR which fails completely.

VI-B Numerical Experiments for Basil+

In this section, we demonstrate the achievable gains of Basil+ in terms of its scalability, Byzantine robustness, and superior performance over UBAR.

Schemes: We consider three schemes, as described next.

  • Basil+: Our proposed scheme.

  • R-plain+: We implement a parallel extension of R-plain. In particular, nodes are divided into GG groups. Within each group, a sequential R-plain training process is carried out, wherein the current active node carries out local training using the model received from its previous counterclockwise neighbor. After τ\tau rounds of sequential R-plain training within each group, a circular aggregation is carried out along the GG groups. Specifically, the model from the last node in each group gets averaged. The average model from each group is then used by the first node in each group in the next global round. This entire process is repeated for KK global rounds.

Setting: We use CIFAR10 dataset [28] and use the neural network with 22 convolutional layers and 33 fully connected layers described in Section VI. The training dataset is partitioned uniformly among the set 𝒩\mathcal{N} of all nodes, where |𝒩|=400|\mathcal{N}|=400. We set the batch size to 8080 for local training and performance evaluation in Basil+ as well as UBAR. Furthermore, we consider epoch based local training, where we set the number of epochs to 33. We use a decreasing learning rate of 0.03/(1+0.03k)0.03/(1+0.03\,k), where kk denotes the global round. For all schemes, we report the average test accuracy among the benign clients. For Basil+, we set the connectivity parameter to S=6S=6 and the number of intra-group rounds to τ=1\tau=1. The implementation of UBAR is given in Section H-B.

Refer to caption
Figure 8: The scalability gain of Basil+ in the presence of Byzantine nodes as number of nodes increases. Here GG is the number of groups, where each group has n=25n=25 nodes.
Refer to caption
(a) Number of active nodes Na=25N_{a}=25
Refer to caption
(b) Number of active nodes Na=50N_{a}=50
Refer to caption
(c) Number of active nodes Na=100N_{a}=100
Refer to caption
(d) Number of active nodes Na=400N_{a}=400
Figure 9: Illustrating the performance gains of Basil+ over UBAR and R-plain+ for CIFAR10 dataset under different number of nodes NaN_{a}.

Results: For studying how the three schemes perform when more nodes participate in the training process, we consider different cases for the number of participating nodes 𝒩a𝒩\mathcal{N}_{a}\subseteq\mathcal{N}, where |𝒩a|=Na|\mathcal{N}_{a}|=N_{a}. Furthermore, for the three schemes, we set the total number of Byzantine nodes to be βNa\lfloor\beta N_{a}\rfloor with β=0.2\beta=0.2.

Fig. 8 shows the performance of Basil+ in the presence of Gaussian attack when the number of groups increases, i.e., when the number of participating nodes NaN_{a} increases. Here, for all the four scenarios in Fig. 8, we fix the number of nodes in each group to n=25n=25 nodes. As we can see Basil+ is able to mitigate the Byzantine behavior while achieving scalable performance as the number of nodes increases. In particular, Basil+ with G=16G=16 groups (i.e., Na=400N_{a}=400 nodes) achieves a test accuracy which is higher by an absolute margin of 20%20\% compared to the case of G=1G=1 (i.e., Na=25N_{a}=25 nodes). Additionally, while Basil+ provides scalable model performance when the number of groups increases, the overall increase in training time scales gracefully due to parallelization of training within groups. In particular, the key difference in the training time between the two cases of G=1G=1 and G=16G=16 is in the stages of robust aggregation and global multicast. In order to get further insight, we set TcommT_{\text{comm}}, Tperf-basedT_{\text{perf-based}} and TSGDT_{\text{SGD}} in equation (29) to one unit of time. Hence, using (29), one can show that the ratio of the training time for G=16G=16 with Na=400N_{a}=400 nodes to the training time for G=1G=1 with Na=25N_{a}=25 nodes is just 1.51.5. This further demonstrates the scalability of Basil+.

In Fig. 9, we compare the performance of the three schemes for different numbers of participating nodes NaN_{a}. In particular, in both Basil+ and R-plain+, we have Na=25GN_{a}=25\,G, where GG denotes the number of groups, while for UBAR, we consider a parallel graph with NaN_{a} nodes. As can be observed from Fig. 9, Basil+ is not only robust to the Byzantine nodes, but also gives superior performance over UBAR in all the 4 cases shown in Fig. 9. The key reason of Basil+ having higher performance than UBAR is that training in Basil+ includes sequential training over logical rings within groups, which has better performance than graph based decentralized training.

VII Conclusion and Future Directions

We propose Basil, a fast and computationally efficient Byzantine-robust algorithm for decentralized training over a logical ring. We provide the theoretical convergence guarantees of Basil demonstrating its linear convergence rate. Our experimental results in the IID setting show the superiority of Basil over the state-of-the-art algorithm for decentralized training. Furthermore, we generalize Basil to the non-IID setting by integrating it with our proposed Anonymous Cyclic Data Sharing (ACDS) scheme. Finally, we propose Basil+ that enables a parallel implementation of Basil achieving further scalability.

One interesting future direction is to explore some techniques such as data compression or data placement and coded shuffling to reduce the communication cost resulting from using ACDS. Additionally, it is interesting to see how some differential privacy (DP) methods can be adopted by adding noise to the shared data in ACDS to provide further privacy while studying the impact of the added noise in the overall training performance.

References

  • [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186, 2019.
  • [2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16×\times16 words: Transformers for image recognition at scale,” International Conference on Learning Representations (ICLR), 2021.
  • [3] E. L. Feld, “United States Data Privacy Law: The Domino Effect After the GDPR,” in N.C. Banking Inst., vol. 24.   HeinOnline, 2020, p. 481.
  • [4] P. Kairouz, H. B. McMahan, and e. a. Brendan, “Advances and open problems in federated learning,” preprint arXiv:1912.04977, 2019.
  • [5] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” 2017.
  • [6] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 3478–3487.
  • [7] R. Dobbe, D. Fridovich-Keil, and C. Tomlin, “Fully decentralized policies for multi-agent systems: An information theoretic approach,” ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 2945–2954.
  • [8] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli, “Asynchronous gossip algorithms for stochastic optimization,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, 2009, pp. 3581–3586.
  • [9] L. Lamport, R. Shostak, and M. Pease, “The byzantine generals problem,” ACM Trans. Program. Lang. Syst., vol. 4, no. 3, p. 382–401, Jul. 1982.
  • [10] S. Guo, T. Zhang, H. Yu, X. Xie, L. Ma, T. Xiang, and Y. Liu, “Byzantine-resilient decentralized stochastic gradient descent,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2021.
  • [11] Z. Yang and W. U. Bajwa, “Bridge: Byzantine-resilient decentralized gradient descent,” preprint arXiv:1908.08098, 2019.
  • [12] J. Regatti, H. Chen, and A. Gupta, “Bygars: Byzantine sgd with arbitrary number of attackers,” preprint arXiv:2006.13421, 2020.
  • [13] C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 6893–6901.
  • [14] ——, “Zeno++: Robust fully asynchronous SGD,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 10 495–10 503.
  • [15] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Curran Associates Inc., 2017, p. 118–128.
  • [16] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 5650–5659.
  • [17] A. R. Elkordy and A. Salman Avestimehr, “Heterosag: Secure aggregation with heterogeneous quantization in federated learning,” IEEE Transactions on Communications, pp. 1–1, 2022.
  • [18] J. So, B. Güler, and A. S. Avestimehr, “Byzantine-resilient secure federated learning,” IEEE Journal on Selected Areas in Communications, 2020.
  • [19] K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust aggregation for federated learning,” preprint arXiv:1912.13445, 2019.
  • [20] L. Zhao, S. Hu, Q. Wang, J. Jiang, S. Chao, X. Luo, and P. Hu, “Shielding collaborative learning: Mitigating poisoning attacks through client-side detection,” IEEE Transactions on Dependable and Secure Computing, pp. 1–1, 2020.
  • [21] S. Prakash, H. Hashemi, Y. Wang, M. Annavaram, and S. Avestimehr, “Byzantine-resilient federated learning with heterogeneous data distribution,” arXiv preprint arXiv:2010.07541, 2020.
  • [22] T. Zhang, C. He, T.-S. Ma, M. Ma, and S. Avestimehr, “Federated learning for internet of things: A federated learning framework for on-device anomaly data detection,” ArXiv, vol. abs/2106.07976, 2021.
  • [23] E. M. El Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 3521–3530.
  • [24] J. woo Lee, J. Oh, S. Lim, S.-Y. Yun, and J.-G. Lee, “Tornadoaggregate: Accurate and scalable federated learning via the ring-based architecture,” preprint arXiv:1806.00582, 2020.
  • [25] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” preprint arXiv:1806.00582, 2018.
  • [26] L. A. Dunning and R. Kresman, “Privacy preserving data sharing with anonymous id assignment,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 2, pp. 402–413, 2013.
  • [27] M. Castro, B. Liskov, and et al., “Practical byzantine fault tolerance,” OSDI, vol. 173–186, 1999.
  • [28] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
  • [29] Y. LeCun, C. Cortes, , and C. Burges, “The mnist database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998.
  • [30] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.

Overview

In the following, we summarize the content of the appendix.

  • In Appendix A, we prove Propositions 1 and 2.

  • In Appendix B, we prove the convergence guarantees of Basil.

  • In Appendix C, we describe how Basil can be robust to nodes dropout.

  • In Appendix D, Appendix E and Appendix F, the proofs of Propositions 3, 4 and 5 are presented.

  • In Appendix G, we present UBAR [10], the recent Byzantine-robust decentralized algorithm.

  • In Appendix H, we provide additional experiments.

Appendix A Proof of Proposition 1 and Proposition 2

Proposition 1. The communication, computation, and storage complexities of Basil algorithm are all 𝒪(Sd)\mathcal{O}(Sd) for each node in each iteration, where dd is the model size.

Proof. Each node receives and stores the latest SS models, calculates the loss by using each model out of the SS stored models, and multicasts its updated model to the next SS clockwise neighbors. Thus, this results in 𝒪(Sd)\mathcal{O}(Sd) communication, computation, and storage costs. \hfill\square

Proposition 2. The number of models that each benign node needs to receive, store and evaluate from its counterclockwise neighbors for ensuring the connectivity and success of Basil can be relaxed to S<b+1S<b+1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability.

Proof. This can be proven by showing that the benign subgraph, which is generated by removing the Byzantine nodes, is connected with high probability when each node multicasts its updated model to the next S<b+1S<b+1 clockwise neighbors instead of b+1b+1 neighbors. Connectivity of the benign subgraph is important as it ensures that each benign node can still receive information from a few other non-faulty nodes. Hence, by letting each node store and evaluate the latest SS model updates, this ensures that each benign node has the chance to select one of the benign updates.

More formally, when each node multicasts its model to the next SS clockwise neighbors, we define AjA_{j} to be the failure event in which SS Byzantine nodes come in a row where jj is the starting node of these SS nodes. When AjA_{j} occurs, there is at least one pair of benign nodes that have no link between them. The probability of AjA_{j} is given as follows:

(Aj)=i=0S1(bi)(Ni)=b!(NS)!(bS)!N!,\mathbb{P}(A_{j})=\prod_{i=0}^{S-1}\frac{(b-i)}{(N-i)}=\frac{b!(N-S)!}{(b-S)!N!}, (19)

where the second equality follows from the definition of factorial, while bb, and NN are the number of Byzantine nodes and the total number of nodes in the system, respectively. Thus, the probability of having a disconnected benign subgraph in Basil, i.e., SS Byzantine nodes coming in a row, is given as follows:

(Failure)=(j=1N(Aj))(a)j=1N(Aj)=(b)b!(NS)!(bS)!(N1)!,\displaystyle\mathbb{P}(\text{Failure})=\mathbb{P}(\bigcup_{j=1}^{N}(A_{j}))\overset{(a)}{\leq}\sum_{j=1}^{N}\mathbb{P}(A_{j})\overset{(b)}{=}\frac{b!(N-S)!}{(b-S)!(N-1)!}, (20)

where (a) follows from the union bound and (b) follows from (19). \hfill\square

Appendix B Convergence Analysis

In this section, we prove the two theorems presented in Section III.B in the main paper. We start the proofs by stating the main assumptions and the update rule of Basil.

Assumption 1 (IID data distribution). Local dataset 𝒵i\mathcal{Z}_{i} at node ii consists of IID data samples from a distribution 𝒫i\mathcal{P}_{i}, where 𝒫i=𝒫\mathcal{P}_{i}=\mathcal{P} for ii\in\mathcal{R}. In other words, fi(𝐱)=𝔼ζi𝒫i[l(𝐱,ζi)]=𝔼ζj𝒫j[l(𝐱,ζi)]=fj(𝐱)i,jf_{i}(\mathbf{x})=\mathbb{E}_{\zeta_{i}\sim\mathcal{P}_{i}}[l(\mathbf{x},\zeta_{i})]=\mathbb{E}_{\zeta_{j}\sim\mathcal{P}_{j}}[l(\mathbf{x},\zeta_{i})]=f_{j}(\mathbf{x})\,\forall i,j\in\mathcal{R}. Hence, the global loss function f(𝐱)=𝔼ζi𝒫i[l(𝐱,ζi)]f(\mathbf{x})=\mathbb{E}_{\zeta_{i}\sim\mathcal{P}_{i}}[l(\mathbf{x},\zeta_{i})].

Assumption 2 (Bounded variance). Stochastic gradient gi(𝐱)g_{i}(\mathbf{x}) is unbiased and variance bounded, i.e., 𝔼𝒫i[gi(𝐱)]=fi(𝐱)=f(𝐱)\mathbb{E}_{\mathcal{P}_{i}}[g_{i}(\mathbf{x})]=\nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x}), and 𝔼𝒫igi(𝐱)fi(𝐱)2σ2\mathbb{E}_{\mathcal{P}_{i}}||g_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{x})||^{2}\leq\sigma^{2}, where gi(𝐱)g_{i}({\mathbf{x}}) is the stochastic gradient computed by node ii by using a random sample ζi\zeta_{i} from its local dataset 𝒵i\mathcal{Z}_{i}.

Assumption 3 (Smoothness). The loss functions fisf_{i}^{\prime}s are L-smooth and twice differentiable, i.e., for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we have 2fi(𝐱)2L||\nabla^{2}f_{i}(\mathbf{x})||_{2}\leq L.

Let bib^{i} be the number of Byzantine nodes out of the SS counterclockwise neighbors of node ii. We divide the set of stored models 𝒩ki\mathcal{N}_{k}^{i} at node ii in the kk-th round into two sets. The first set 𝒢ki={𝐲1,,𝐲ri}\mathcal{G}^{i}_{k}=\{\mathbf{y}_{1},\dots,\mathbf{y}_{r^{i}}\} contains the benign models, where ri=Sbir^{i}=S-b^{i}. We consider scenarios with S=b+1S=b+1, where bb is the total number of Byzantine nodes in the network. Without loss of generality, we assume the models in this set are arranged such that the first model is from the closest benign node in the neighborhood of node ii, while the last model is from the farthest node. Similarly, we define the second set ki\mathcal{B}^{i}_{k} to be the set of models from the counterclockwise Byzantine neighbors of node ii such that ki𝒢ki=𝒩ki\mathcal{B}_{k}^{i}\cup\mathcal{G}_{k}^{i}=\mathcal{N}_{k}^{i}.

The general update rule in Basil is given as follows. At the kk-th round, the current active node ii updates the global model according to the following rule:

𝐱k(i)=\displaystyle\mathbf{x}^{(i)}_{k}= 𝐱¯k(i)ηk(i)gi(𝐱¯k(i)),\displaystyle\bar{\mathbf{x}}^{(i)}_{k}-\eta_{k}^{(i)}g_{i}(\bar{\mathbf{x}}^{(i)}_{k}), (21)

where 𝐱¯k(i)\bar{\mathbf{x}}^{(i)}_{k} is given as follows

𝐱¯k(i)=argmin𝐲𝒩ki𝔼[li(𝐲,ζi)].\bar{\mathbf{x}}_{k}^{(i)}=\arg\min_{\mathbf{y}\in\mathcal{N}_{k}^{i}}\mathbb{E}\left[{l}_{i}({\mathbf{y}},\zeta_{i})\right]. (22)

B-A Proof of Theorem 1

We first show that if node ii completed the performance-based criteria in (22) and selected the model 𝐲1𝒢ki\mathbf{y}_{1}\in\mathcal{G}^{i}_{k}, and updated its model as follows:

𝐱k(i)=𝐲1ηk(i)gi(𝐲1),\mathbf{x}_{k}^{(i)}=\mathbf{y}_{1}-\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1}), (23)

we will have

𝔼[i+1(𝐱k(i))]𝔼[i+1(𝐲1)],\mathbb{E}\left[{\ell}_{i+1}(\mathbf{x}_{k}^{(i)})\right]\leq\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right], (24)

where i+1(𝐲1)=i+1(𝐲1,ζi+1){\ell}_{i+1}(\mathbf{y}_{1})={\ell}_{i+1}(\mathbf{y}_{1},\zeta_{i+1}) is the loss function of node i+1i+1 evaluated on a random sample ζi+1\zeta_{i+1} by using the model 𝐲1\mathbf{y}_{1}.

The proof of (24) is as follows: By using Taylor’s theorem, there exists a γ\gamma such that

i+1(𝐱k(i))=\displaystyle{\ell}_{i+1}({\mathbf{x}_{k}^{(i)}})= i+1(𝐲1ηk(i)gi(𝐲1))\displaystyle{\ell}_{i+1}\left(\mathbf{y}_{1}-\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1})\right) (25)
=\displaystyle= i+1(𝐲1)ηk(i)gi(𝐲1)Tgi+1(𝐲1)\displaystyle{\ell}_{i+1}(\mathbf{y}_{1})-\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1})^{T}g_{i+1}({\mathbf{y}_{1}})
+12ηk(i)gi(𝐲1)T2i+1(γ)ηk(i)gi(𝐲1),\displaystyle+\frac{1}{2}\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1})^{T}\nabla^{2}{\ell}_{i+1}({\gamma})\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1}), (26)

where 2i+1\nabla^{2}{\ell}_{i+1} is the stochastic Hessian matrix. By using the following assumption

2i+1(𝐱k(i))2L,||\nabla^{2}{\ell}_{i+1}({\mathbf{x}_{k}^{(i)}})||_{2}\leq L, (27)

for all random samples ζi+1\zeta_{i+1} and any model 𝐱d\mathbf{x}\in\mathbb{R}^{d}, where LL is the Lipschitz constant, we get

i+1(𝐱k(i))\displaystyle{\ell}_{i+1}({\mathbf{x}_{k}^{(i)}})\leq i+1(𝐲1)ηk(i)gi(𝐲1)Tgi+1(𝐲1)\displaystyle{\ell}_{i+1}(\mathbf{y}_{1})-\eta_{k}^{(i)}g_{i}(\mathbf{y}_{1})^{T}g_{i+1}({\mathbf{y}_{1}})
+\displaystyle+ (ηk(i))2L2gi(𝐲1)2.\displaystyle\frac{(\eta_{k}^{(i)})^{2}L}{2}||g_{i}(\mathbf{y}_{1})||^{2}. (28)

By taking the expected value of both sides of this expression (where the expectation is taken over the randomness in the sample selection), we get

𝔼[i+1(𝐱k(i))]𝔼[i+1(𝐲1)]ηk(i)𝔼[gi(𝐲1)Tgi+1(𝐲1)]\displaystyle\mathbb{E}\left[{\ell}_{i+1}({\mathbf{x}_{k}^{(i)}})\right]\leq\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right]-\eta_{k}^{(i)}\mathbb{E}\left[g_{i}(\mathbf{y}_{1})^{T}g_{i+1}({\mathbf{y}_{1}})\right]
+\displaystyle+ (ηk(i))2L2𝔼gi(𝐲1)2\displaystyle\frac{(\eta_{k}^{(i)})^{2}L}{2}\mathbb{E}||g_{i}(\mathbf{y}_{1})||^{2}
=𝑎\displaystyle\overset{a}{=} 𝔼[i+1(𝐲1)]ηk(i)𝔼[gi(𝐲1)T]𝔼[gi+1(𝐲1)]\displaystyle\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right]-\eta_{k}^{(i)}\mathbb{E}\left[g_{i}(\mathbf{y}_{1})^{T}\right]\mathbb{E}\left[g_{i+1}({\mathbf{y}_{1}})\right]
+\displaystyle+ (ηk(i))2L2𝔼gi(𝐲1)2\displaystyle\frac{(\eta_{k}^{(i)})^{2}L}{2}\mathbb{E}||g_{i}(\mathbf{y}_{1})||^{2}
\displaystyle\leq 𝔼[i+1(𝐲1)]ηk(i)𝔼[gi(𝐲1)T]𝔼[gi+1(𝐲1)]\displaystyle\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right]-\eta_{k}^{(i)}\mathbb{E}\left[g_{i}(\mathbf{y}_{1})^{T}\right]\mathbb{E}\left[g_{i+1}({\mathbf{y}_{1}})\right]
+\displaystyle+ (ηk(i))2L𝔼gi(𝐲1)2\displaystyle(\eta_{k}^{(i)})^{2}L\mathbb{E}||g_{i}(\mathbf{y}_{1})||^{2}
𝑏\displaystyle\overset{b}{\leq} 𝔼[i+1(𝐲1)]ηk(i)f(𝐲1)2+(ηk(i))2Lf(𝐲1)2\displaystyle\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right]-\eta_{k}^{(i)}||\nabla f(\mathbf{y}_{1})||^{2}+(\eta_{k}^{(i)})^{2}L||\nabla f(\mathbf{y}_{1})||^{2}
+\displaystyle+ (ηk(i))2Lσ2\displaystyle(\eta_{k}^{(i)})^{2}L\sigma^{2}
=\displaystyle= 𝔼[i+1(𝐲1)]f(𝐲1)2(ηk(i)(ηk(i))2L)+(ηk(i))2Lσ2,\displaystyle\mathbb{E}\left[{\ell}_{i+1}(\mathbf{y}_{1})\right]{-}||\nabla f(\mathbf{y}_{1})||^{2}\left(\eta_{k}^{(i)}{-}(\eta_{k}^{(i)})^{2}L\right)+(\eta_{k}^{(i)})^{2}L\sigma^{2}, (29)

where (a) follows from that the samples are drawn from independent data distribution, while (b) from Assumption 1 along with

𝔼gi(𝐲1)2=\displaystyle\mathbb{E}||g_{i}(\mathbf{y}_{1})||^{2}= 𝔼gi(𝐲1)𝔼[gi(𝐲1)]2+𝔼[gi(𝐲1)]2\displaystyle\mathbb{E}||g_{i}(\mathbf{y}_{1})-\mathbb{E}\left[g_{i}(\mathbf{y}_{1})\right]||^{2}+||\mathbb{E}[g_{i}(\mathbf{y}_{1})]||^{2}
\displaystyle\leq σ2+f(𝐲1)2.\displaystyle\sigma^{2}+||\nabla f(\mathbf{y}_{1})||^{2}. (30)

Let Cki=f(𝐲1)2f(𝐲1)2+σ2=||𝔼[gi(𝐲1]||2||𝔼[gi(𝐲1]||2+σ2C_{k}^{i}=\frac{||\nabla f(\mathbf{y}_{1})||^{2}}{||\nabla f(\mathbf{y}_{1})||^{2}+\sigma^{2}}=\frac{||\mathbb{E}[g_{i}(\mathbf{y}_{1}]||^{2}}{||\mathbb{E}[g_{i}(\mathbf{y}_{1}]||^{2}+\sigma^{2}}, which implies that Cki[0,1]C_{k}^{i}\in[0,1]. By selecting the learning rate as ηk(i)1LCki\eta_{k}^{(i)}\geq\frac{1}{L}C_{k}^{i}, we get

𝔼[li+1(𝐲1)]𝔼[li+1(𝐲1)].\mathbb{E}\left[l_{i+1}(\mathbf{y}_{1})\right]\leq\mathbb{E}\left[l_{i+1}(\mathbf{y}_{1})\right]. (31)

Note that, nodes can just use a learning rate η1L\eta\geq\frac{1}{L}, since Cki[0,1]C_{k}^{i}\in[0,1], while still achieving (31). This completes the first part of the proof.

By using (31), it can be easily seen that the update rule in equation (21) can be reduced to the case where each node updates its model based on the model received from the closest benign node (23) in its neighborhood, where this follows from using induction.

Let’s consider this example. Consider a ring with NN nodes and by using S=3S=3 while ignoring the Byzantine nodes for a while (assume all nodes are benign nodes). We consider the first round k=1k=1. With a little abuse of notations, we can get the following, the updated model by node 1 (the first node in the ring) 𝐱1=h(𝐱0)\mathbf{x}_{1}=h(\mathbf{x}_{0}) is a function of the initial model 𝐱0\mathbf{x}_{0} (updated by using the model 𝐱0\mathbf{x}_{0}). Now, node 2 has to select one model from the set of two models 𝒩k2={𝐱1=h(𝐱0),𝐱0}\mathcal{N}_{k}^{2}=\{\mathbf{x}_{1}=h(\mathbf{x}_{0}),\mathbf{x}_{0}\}. The selection is performed by evaluating the expected loss function of node 2 by using the criteria given in (22) on the models on the set 𝒩k2\mathcal{N}_{k}^{2}. According to (31), node 2 will select the model 𝐱1\mathbf{x}_{1} which results in lower expected loss. Now, node 2 updates its model based on the model 𝐱1\mathbf{x}_{1}, i.e., 𝐱2=h(𝐱𝟏)\mathbf{x}_{2}=h(\mathbf{x_{1}}). After that, node 3 applies the aggregation rule in (22) to selects one model from this set of models 𝒩k3={𝐱2=h(𝐱𝟏),𝐱1=h(𝐱𝟎),𝐱0}\mathcal{N}_{k}^{3}=\{\mathbf{x}_{2}=h(\mathbf{x_{1}}),\mathbf{x}_{1}=h(\mathbf{x_{0}}),\mathbf{x}_{0}\}. By using (31) and Assumption 1, we get

𝔼[l3(𝐱2)]𝔼[l3(𝐱1)]𝔼[l3(𝐱0)],\displaystyle\mathbb{E}[l_{3}(\mathbf{x}_{2})]\leq\mathbb{E}[l_{3}(\mathbf{x}_{1})]\leq\mathbb{E}[l_{3}(\mathbf{x}_{0})], (32)

and node 3 model will be updated according to the model 𝐱2\mathbf{x}_{2}, i.e., 𝐱3=h(𝐱2)\mathbf{x}_{3}=h(\mathbf{x}_{2}).

More generally, the set of stored benign models at node ii is given by 𝒩ki={𝐲1=h(𝐲2),𝐲2=h(𝐲3),,𝐲ri=h(𝐲ri1)}\mathcal{N}_{k}^{i}=\{\mathbf{y}_{1}=h(\mathbf{y}_{2}),\mathbf{y}_{2}=h(\mathbf{y}_{3}),\dots,\mathbf{y}_{r^{i}}=h(\mathbf{y}_{r^{i}-1})\}, where rir^{i} is the number of benign models in the set 𝒩ki\mathcal{N}_{k}^{i}. According to (31), we will have the following

𝔼[li(𝐲1)]𝔼[li(𝐲2)]\displaystyle\mathbb{E}\left[l_{i}(\mathbf{y}_{1})\right]{\leq}\mathbb{E}\left[l_{i}(\mathbf{y}_{2})\right]{\leq}\dots\leq 𝔼[li(𝐲ri)]𝔼[li(𝐱)]𝐱ki,\displaystyle\mathbb{E}\left[l_{i}(\mathbf{y}_{r^{i}})\right]{\leq}\mathbb{E}\left[l_{i}(\mathbf{x})\right]\;\forall\mathbf{x}\in\mathcal{B}_{k}^{i}, (33)

where the last inequality in (33) follows from the fact that the Byzantine nodes are sending faulty models and their expected loss is supposed to be higher than the expected loss of the benign nodes.

According to this discussion and by removing the Byzantine nodes thanks to (33), we can only consider the benign subgraph which is generated by removing the Byzantine nodes according to the discussion in Section III-A in the main paper. Note that by letting each active node send its updated model to the next b+1b+1 nodes, where bb is the total number of Byzantine nodes, the benign subgraph can always be connected. By considering the benign subgraph (the logical rings without Byzantine nodes), we assume without loss of generality that the indices of benign nodes in the ring are arranged in ascending order starting from node 1 to node rr. In this benign subgraph, the update rule will be given as follows

𝐱k(i)=𝐱k(i1)ηk(i)gi(𝐱k(i1)).\mathbf{x}_{k}^{(i)}=\mathbf{x}_{k}^{(i-1)}-\eta_{k}^{(i)}g_{i}(\mathbf{x}_{k}^{(i-1)}). (34)

B-B Proof of Theorem 2

By using Taylor’s theorem, there exists a γ\gamma such that

f(𝐱k(i+1))=𝑎f(𝐱k(i)ηk(i+1)gi+1(𝐱k(i)))\displaystyle{f}({\mathbf{x}_{k}^{(i+1)}})\overset{a}{=}{f}\left(\mathbf{x}_{k}^{(i)}-\eta_{k}^{(i+1)}g_{i+1}({\mathbf{x}}^{(i)}_{k})\right)
=f(𝐱k(i))ηk(i+1)(gi+1(𝐱k(i)))Tf(𝐱k(i))\displaystyle={f}(\mathbf{x}_{k}^{(i)})-\eta_{k}^{(i+1)}\left(g_{i+1}({\mathbf{x}}^{(i)}_{k})\right)^{T}\nabla{f}({\mathbf{x}_{k}^{(i)}})
+12ηk(i+1)(gi+1(𝐱k(i)))T2f(γ)ηk(i+1)gi+1(𝐱k(i))\displaystyle+\frac{1}{2}\eta_{k}^{(i+1)}\left(g_{i+1}({\mathbf{x}}^{(i)}_{k})\right)^{T}\nabla^{2}{f}({\gamma})\eta_{k}^{(i+1)}g_{i+1}({\mathbf{x}}^{(i)}_{k})
𝑏f(𝐱k(i))ηk(i+1)(gi+1(𝐱k(i)))Tf(𝐱k(i))\displaystyle\overset{b}{\leq}{f}(\mathbf{x}_{k}^{(i)})-\eta_{k}^{(i+1)}\left(g_{i+1}({\mathbf{x}}^{(i)}_{k})\right)^{T}\nabla{f}({\mathbf{x}_{k}^{(i)}})
+L2ηk(i+1)||gi+1(𝐱k(i)))||2,\displaystyle+\frac{L}{2}\eta_{k}^{(i+1)}||g_{i+1}({\mathbf{x}}^{(i)}_{k}))||^{2}, (35)

where (a) follows from the update rule in (34), while ff is the global loss function in equation (1) in the main paper, and (b) from Assumption 3 where 2f(γ)L||\nabla^{2}{f}({\gamma})||\leq L. Given the model 𝐱k(i)\mathbf{x}_{k}^{(i)}, we take expectation over the randomness in selection of sample ζi+1\mathcal{\zeta}_{i+1} (the random sample used to get the model 𝐱k(i+1)\mathbf{x}_{k}^{(i+1)}). We recall that ζi+1\mathcal{\zeta}_{i+1} is drawn according to the distribution 𝒫i+1\mathcal{P}_{i+1} and is independent of the model 𝐱ki\mathbf{x}_{k}^{i}). Therefore, we get the following set of equations:

𝔼[f(𝐱k(i+1))]f(𝐱k(i))ηk(i)𝔼[(gi+1(𝐱k(i)))]Tf(𝐱k(i))\displaystyle\mathbb{E}[{f}({\mathbf{x}_{k}^{(i+1)}})]\leq f(\mathbf{x}_{k}^{(i)})-\eta_{k}^{(i)}\mathbb{E}\left[(g_{i+1}({\mathbf{x}}^{(i)}_{k}))\right]^{T}\nabla{f}({\mathbf{x}_{k}^{(i)}})
+(ηk(i))2L2𝔼gi+1(𝐱k(i))2\displaystyle+\frac{(\eta_{k}^{(i)})^{2}L}{2}\mathbb{E}||g_{i+1}({\mathbf{x}}^{(i)}_{k})||^{2}
𝑎\displaystyle\overset{a}{\leq} f(𝐱k(i))ηk(i)f(𝐱k(i))2+(ηk(i))2L2f(𝐱k(i))2\displaystyle f(\mathbf{x}_{k}^{(i)})-\eta_{k}^{(i)}||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}+\frac{(\eta_{k}^{(i)})^{2}L}{2}||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}
+(ηk(i))2L2σ2\displaystyle+\frac{(\eta_{k}^{(i)})^{2}L}{2}\sigma^{2}
=\displaystyle= f(𝐱k(i))f(𝐱k(i))2(ηk(i)(ηk(i))2L2)\displaystyle f(\mathbf{x}_{k}^{(i)})-||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}\left(\eta_{k}^{(i)}-\frac{(\eta_{k}^{(i)})^{2}L}{2}\right)
+(ηk(i))2L2σ2\displaystyle+\frac{(\eta_{k}^{(i)})^{2}L}{2}\sigma^{2}
𝑏\displaystyle\overset{b}{\leq} f(𝐱k(i))ηk(i)2f(𝐱k(i))2+ηk(i)2σ2,\displaystyle f(\mathbf{x}_{k}^{(i)})-\frac{\eta_{k}^{(i)}}{2}||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}+\frac{\eta_{k}^{(i)}}{2}\sigma^{2}, (36)

where (a) follows from (B-A), and (b) by selecting ηk(i)1L\eta_{k}^{(i)}\leq\frac{1}{L}. Furthermore, in the proof of Theorem 1, we choose the learning to be ηk(i)1L\eta_{k}^{(i)}\geq\frac{1}{L}. Therefore, the learning rate will be given by ηk(i)=1L\eta_{k}^{(i)}=\frac{1}{L}. By the convexity of the loss function ff, we get the next inequality from the inequality in (36)

𝔼[f(𝐱k(i+1))]\displaystyle\mathbb{E}[{f}({\mathbf{x}_{k}^{(i+1)}})]\leq f(𝐱)+f(𝐱k(i)),𝐱k(i)𝐱\displaystyle f(\mathbf{x}^{*})+\langle\,\nabla f(\mathbf{x}_{k}^{(i)}),\mathbf{x}_{k}^{(i)}-\mathbf{x}^{*}\rangle
ηk(i)2f(𝐱k(i))2+ηk(i)2σ2.\displaystyle-\frac{\eta_{k}^{(i)}}{2}||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}+\frac{\eta_{k}^{(i)}}{2}\sigma^{2}. (37)

We now back-substitute gi(𝐱k(i))g_{i}({\mathbf{x}}^{(i)}_{k}) into (B-B) by using 𝔼[gi+1(𝐱k(i))]=f(𝐱k(i))\mathbb{E}[g_{i+1}({\mathbf{x}}^{(i)}_{k})]=\nabla f(\mathbf{x}_{k}^{(i)}) and f(𝐱k(i))2𝔼gi+1(𝐱k(i))2σ2||\nabla f(\mathbf{x}_{k}^{(i)})||^{2}\geq\mathbb{E}||g_{i+1}({\mathbf{x}}^{(i)}_{k})||^{2}-\sigma^{2}:

𝔼[f(𝐱k(i+1))]f(𝐱)+𝔼[gi+1(𝐱k(i))],𝐱k(i)𝐱\displaystyle\mathbb{E}[{f}({\mathbf{x}_{k}^{(i+1)}})]\leq f(\mathbf{x}^{*})+\langle\,\mathbb{E}[g_{i+1}({\mathbf{x}}^{(i)}_{k})],\mathbf{x}_{k}^{(i)}-\mathbf{x}^{*}\rangle
ηk(i)2𝔼gi+1(𝐱k(i))2+ηk(i)σ2\displaystyle-\frac{\eta_{k}^{(i)}}{2}\mathbb{E}||g_{i+1}({\mathbf{x}}^{(i)}_{k})||^{2}+\eta_{k}^{(i)}\sigma^{2}
=f(𝐱)+𝔼[[gi+1(𝐱k(i))],𝐱k(i)𝐱\displaystyle=f(\mathbf{x}^{*})+\mathbb{E}[\langle\,[g_{i+1}({\mathbf{x}}^{(i)}_{k})],\mathbf{x}_{k}^{(i)}-\mathbf{x}^{*}\rangle
ηk(i)2||gi+1(𝐱k(i))||2]+ηk(i)σ2.\displaystyle-\frac{\eta_{k}^{(i)}}{2}||g_{i+1}({\mathbf{x}}^{(i)}_{k})||^{2}]+\eta_{k}^{(i)}\sigma^{2}. (38)

By completing the square of the middle two terms to get:

𝔼[f(𝐱k(i+1))]f(𝐱)\displaystyle\mathbb{E}[{f}({\mathbf{x}_{k}^{(i+1)}})]\leq f(\mathbf{x}^{*})
+\displaystyle+ 𝔼[12ηk(i)(𝐱k(i)𝐱2𝐱k(i)𝐱ηk(i)gi+1(𝐱k(i))2)]\displaystyle\mathbb{E}\left[\frac{1}{2\eta_{k}^{(i)}}\left(||\mathbf{x}_{k}^{(i)}{-}\mathbf{x}^{*}||^{2}{-}||\mathbf{x}_{k}^{(i)}{-}\mathbf{x}^{*}{-}\eta_{k}^{(i)}g_{i+1}({\mathbf{x}}^{(i)}_{k})||^{2}\right)\right]
+\displaystyle+ ηk(i)σ2.\displaystyle\eta_{k}^{(i)}\sigma^{2}.
=\displaystyle= f(𝐱)+𝔼[12ηk(i)(𝐱k(i)𝐱2𝐱k(i+1)𝐱2)]\displaystyle f(\mathbf{x}^{*}){+}\mathbb{E}\left[\frac{1}{2\eta_{k}^{(i)}}\left(||\mathbf{x}_{k}^{(i)}{-}\mathbf{x}^{*}||^{2}{-}||\mathbf{x}_{k}^{(i+1)}{-}\mathbf{x}^{*}||^{2}\right)\right]
+ηk(i)σ2.\displaystyle{+}\eta_{k}^{(i)}\sigma^{2}. (39)

For KK rounds and rr benign nodes, we note that the total number of SGD steps are T=KrT=Kr. We let s=kr+is=kr+i represent the number of updates happen to the initial model 𝐱0\mathbf{x}^{0}, where i=1,,ri=1,\dots,r and k=0,,K1k=0,\dots,K-1. Therefore, 𝐱k(i){\mathbf{x}_{k}^{(i)}} can be written as 𝐱s\mathbf{x}^{s}. With the modified notation, we can now take the expectation in the above expression over the entire sampling process during training and then by summing the above equations for s=0,,T1s=0,\dots,T-1, while taking η=1L\eta=\frac{1}{L}, we have the following:

s=0T1(𝔼[f(𝐱s+1)]f(𝐱))\displaystyle\sum_{s=0}^{T-1}\left(\mathbb{E}[{f}({\mathbf{x}^{s+1}})]-f(\mathbf{x}^{*})\right)
L2(𝐱0𝐱2𝔼[𝐱k(T)𝐱2])+1LTσ2.\displaystyle\leq\frac{L}{2}\left(||\mathbf{x}_{0}-\mathbf{x}^{*}||^{2}-\mathbb{E}[||\mathbf{x}_{k}^{(T)}-\mathbf{x}^{*}||^{2}]\right)+\frac{1}{L}T\sigma^{2}. (40)

By using the convexity of f(.)f(.), we get

𝔼[f(1Ts=1T𝐱s)]f(𝐱)\displaystyle\mathbb{E}\left[f\left(\frac{1}{T}\sum_{s=1}^{T}\mathbf{x}^{s}\right)\right]-f(\mathbf{x}^{*})\leq 1Ts=0T1𝔼[f(𝐱s+1)]f(𝐱)\displaystyle\frac{1}{T}\sum_{s=0}^{T-1}\mathbb{E}\left[f(\mathbf{x}^{s+1})\right]-f(\mathbf{x}^{*})
𝐱0𝐱2L2T+1Lσ2.\displaystyle\leq\frac{||\mathbf{x}^{0}-\mathbf{x}^{*}||^{2}L}{2T}+\frac{1}{L}\sigma^{2}. (41)

Appendix C Joining and Leaving of Nodes

Basil can handle the scenario of 1) node dropouts out of the NN available nodes 2) nodes rejoining the system.

C-A Nodes Dropout

For handling node dropouts, we allow for extra communication between nodes. In particular, each active node can multicast its model to the S=b+d+1S{=}b{+}d{+}1 clockwise neighbors, where bb and dd are respectively the number of Byzantine nodes and the worst case number of dropped nodes, and each node can store only the latest b+1b{+}1 model updates. By doing that, each benign node will have at least 1 benign update even in the worst case where all Byzantine nodes appear in a row and dd (out of SS) counterclockwise nodes drop out.

C-B Nodes Rejoining

To address a node rejoining the system, this rejoined node can re-multicast its ID to all other nodes. Since benign nodes know the correct order of the nodes (IDs) in the ring according to Section III-A, each active node out of the L=b+d+1L{=}b{+}d{+}1 counterclockwise neighbors of the rejoined node sends its model to it, and this rejoined node stores the latest b+1b{+}1 models. We note that handling participation of new fresh nodes during training is out of scope of our paper, as we consider mitigating Byzantine nodes in decentralized training with a fixed number of NN nodes

Appendix D Proof of Proposition 3

We first prove the communication cost given in Proposition 3, which corresponds to node 1g1_{g}, for g[G]g\in[G]. We recall from Section IV that in ACDS, each node i𝒩i\in\mathcal{N} has HH batches each of size αDH\frac{\alpha D}{H} data points. Furthermore, for each group g[G]g\in[G], the anonymous cyclic data sharing phase (phase 2) consists of H+1H+1 rounds. The communication cost of node 1g1_{g} in the first round is αDHI\frac{\alpha D}{H}I bits, where αDH\frac{\alpha D}{H} is the size of one batch and II is the size of one data point in bits. The cost of each round h[2,H+1]h\in[2,H+1] is nαDHIn\frac{\alpha D}{H}I, where node 1g1_{g} sends the set of shuffled data from the nn batches {c1gh,c2gh1,,cngh1}\{c^{h}_{1_{g}},c^{h-1}_{2_{g}},\dots,c^{h-1}_{n_{g}}\} to node 2g2_{g}. Hence, the total communication cost for node 1g1_{g} in this phase is given by CACDSphase-2=αDI(1H+n)C_{\text{ACDS}}^{\text{phase-2}}=\alpha DI(\frac{1}{H}+n). In phase 3, node 1g1_{g} multicasts its set of shuffled data from batches {c1gh,c2gh,,cngh}h[H]\{c^{h}_{1_{g}},c^{h}_{2_{g}},\dots,c^{h}_{n_{g}}\}_{h\in[H]} to all nodes in the other groups at a cost of nαDIn\alpha DI bits. Finally, node 1g1_{g} receives (G1)(G-1) set of batches {c1gh,c2gh,,cngh}h[H],g[G]\{g}\{c^{h}_{1_{g^{\prime}}},c^{h}_{2_{g^{\prime}}},\dots,c^{h}_{n_{g^{\prime}}}\}_{h\in[H],g^{\prime}\in[G]\backslash\{g\}} at a cost of (G1)nαDI(G-1)n\alpha DI. Hence, the communication cost of the third phase of ACDS is given by CACDSphase-3=αDnGIC_{\text{ACDS}}^{\text{phase-3}}=\alpha DnGI. By adding the cost of Phase 2 and Phase 3, we get the first result in Proposition 3.

Now, we prove the communication time of ACDS by first computing the time needed to complete the anonymous data sharing phase (phase-2), and then compute the time for the multicasting phase. The second phase of ACDS consists of H+1H+1 rounds. The communication time of the first round is given by TR1=i=1niTT_{R_{1}}=\sum_{i=1}^{n}iT, where nn is the number of nodes in each group. Here, T=αDIHRT=\frac{\alpha DI}{HR} is the time needed to send one batch of size αDIH\frac{\alpha DI}{H} data points with RR being the communication bandwidth in b/s, and II is the size of one data points in bits. On the other hand, the time for each round h[2,H]h\in[2,H], is given by TRh=n2TT_{R_{h}}=n^{2}T, where each node sends nn batches. Finally, the time for completing the dummy round, the (H+1)(H+1)-th round, is given by TRH+1=n(n2)TT_{R_{H+1}}=n(n-2)T where only the first n2n-2 nodes in the ring participate in this round as discussed in Section IV. Therefore, the total time for completing the anonymous cyclic data sharing phase (phase 2 of ACDS) is given by Tphase-2=TR1+(H1)TRh+TRH+1=T(n2(H+0.5)1.5n)T_{\text{phase-2}}=T_{R_{1}}+(H-1)T_{R_{h}}+T_{R_{H+1}}=T(n^{2}(H+0.5)-1.5n) as this phase happens in parallel for all the GG groups. The time for completing the multicasting phase is Tphase-3=(G1)nHTT_{\text{phase-3}}=(G-1)nHT, where each node in group gg receives nHnH batches from each node 1g1_{g^{\prime}} in group g[G]\{g}g^{\prime}\in[G]\backslash\{g\}. By adding Tphase-2T_{\text{phase-2}} and Tphase-3T_{\text{phase-3}}, we get the communication time of ACDS given in Proposition 3.

Appendix E Proof of Proposition 4

We recall from Section III that the per round training time of Basil is divided into four parts. In particular, each active node (1) receives the model from the SS counterclockwise neighbors; (2) evaluates the SS models using the Basil aggregation rule; (3) updates the model by taking one step of SGD; and (4) multicasts the model to the next SS clockwise neighbors.

Assuming training begins at time 0, we define Ei(k)E_{i}^{(k)} to be the wall-clock time at which node ii finishes the training in round kk. We also define Tcom=32dRT_{\text{com}}=\frac{32d}{R} to be the time to receive one model of size dd elements each of size 3232 bits, where each node receives only one model at each step in the ring as the training is sequential. Furthermore, we let Tcomp=Tperf-based+TSGDT_{\text{comp}}=T_{\text{perf-based}}+T_{\text{SGD}} to be the time needed to evaluate SS models and perform one step of SGD update.

We assume that each node i𝒩i\in\mathcal{N} becomes active and starts evaluating the SS models (using (3)) and taking the SGD model update step (using (2)) only when it receives the model from its counter clock-wise neighbor. Therefore, for the first round, we have the following time recursion:

E1(1)\displaystyle E_{1}^{(1)} =TSGD\displaystyle=T_{\text{SGD}} (42)
En(1)\displaystyle E_{n}^{(1)} =En1(1)+Tcom+(n1)Tperf-based+TSGD for n[2,S]\displaystyle{=}E_{n-1}^{(1)}+T_{\text{com}}{+}(n-1)T_{\text{perf-based}}+T_{\text{SGD}}\text{ for }n\in[2,S] (43)
En(1)\displaystyle E_{n}^{(1)} =En1(1)+Tcom+Tcomp for n[S+1,N],\displaystyle=E_{n-1}^{(1)}+T_{\text{com}}+T_{\text{comp}}\text{ for }n\in[S+1,N], (44)

where (42) follows from the fact that node 11 just takes one step of model update using the initial model 𝐱0\mathbf{x}^{0}. Each node i[2,S]i\in[2,S] receives the model from its own node, evaluates the (i1)(i-1) received model and then takes one step of model update. For node i[S+1,N]i\in[S+1,N], each node will have SS models to evaluate, and the time recursion follows (44).

The time recursions, for the remaining τ\tau rounds, where the training is assumed to happen over τ\tau rounds, are given by

E1(k+1)=En(k)+Tcom+Tcomp for k[τ1]\displaystyle E_{1}^{(k+1)}=E_{n}^{(k)}+T_{\text{com}}+T_{\text{comp}}\text{ for }k\in[\tau-1] (45)
En(k)=En1(k)+Tcom+Tcomp for n[N]\{1},k[τ]\displaystyle E_{n}^{(k)}=E_{n-1}^{(k)}+T_{\text{com}}+T_{\text{comp}}\;\text{ for }n\in[N]\backslash\{1\},k\in[\tau] (46)
E1(τ+1)=En(τ)+Tcom+STperf-based.\displaystyle E_{1}^{(\tau+1)}=E_{n}^{(\tau)}+T_{\text{com}}+ST_{\text{perf-based}}. (47)

By telescoping (42)-(47), we get the upper bound in (15).

The training time of Basil+ in (V-A) can be proven by computing the time of each stage of the algorithm: In Stage 1, all groups in parallel apply Basil algorithm within its group, where the training is carried out sequentially. This results in a training time of Tstage1nτTperf-based+nτTcomm+nτTSGDT_{\text{stage1}}\leq n\tau T_{\text{perf-based}}+n\tau T_{\text{comm}}+n\tau T_{\text{SGD}}. The time of the robust circular aggregation stage is given by Tstage2=GTperf-based+SGTcommT_{\text{stage2}}=GT_{\text{perf-based}}+SGT_{\text{comm}}. Here, STperf-basedST_{\text{perf-based}} in the first term comes from the fact that each node in the set 𝒮g\mathcal{S}_{g} in parallel evaluates SS models received from the nodes in 𝒮g1\mathcal{S}_{g-1}. The second term in Tstage2T_{\text{stage2}} comes from the fact that each node in 𝒮g\mathcal{S}_{g} receives SS models from the nodes in 𝒮g1\mathcal{S}_{g-1}. The term GG in stage 2 results from the sequential aggregation over the GG groups. The time of the final stage (multicasting stage) is given by Tstage3=Tperf-based+(G1)TcommT_{\text{stage3}}=T_{\text{perf-based}}+(G-1)T_{\text{comm}}, where the first term from the fact all nodes in the set {𝒰1,𝒰2,,𝒰G1}\{\mathcal{U}_{1},\mathcal{U}_{2},\dots,\mathcal{U}_{G-1}\} evaluates the SS robust average model in parallel, while the second term follows from the time needed to receive the SS robust average model by each corresponding node in the remaining groups. By combining the time of the three stages, we get the training time given in (V-A).

Appendix F Proof of Proposition 5

Proposition 5. The connectivity parameter SS in Basil+ can be relaxed to S<n1S<n-1 while guaranteeing the success of the algorithm (benign local/global subgraph connectivity) with high probability. The failure probability of Basil+ is given by

(F)Gi=0min(b,n)(s=0S1max(is,0)(Ns)n)(bi)(Nbni)(Nn),\displaystyle\mathbb{P}(F)\leq G\sum_{i=0}^{\min{(b,n)}}\left(\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}n\right)\frac{\binom{b}{i}\binom{N-b}{n-i}}{\binom{N}{n}}, (48)

where NN, nn, GG, SS and bb are the number of total nodes, number of nodes in each group, number of groups, the connectivity parameter, and the number of Byzantine nodes.

Proof. At a high level, Basil+ would fail if at least one group out of the GG groups failed (the set of models g\mathcal{L}_{g} sent from the set 𝒮g\mathcal{S}_{g} in any particular group gg to the group g+1g+1 are faulty). According to the discussion in the proof of proposition 2, group gg fails, when we SS Byzantine nodes comes in a row.

Now, we formally prove the failure probability of Basil+. We start our proof by defining the failure event of Basil+ by

F=g=1GFg,F=\bigcup_{g=1}^{G}F_{g}, (49)

where FgF_{g} is the failure event of group gg and GG is the number of groups. The failure probability of group gg is given by

(Fg)=i=0min(b,n)(Fg|bg=i)(bg=i),\mathbb{P}(F_{g})=\sum_{i=0}^{\min{(b,n)}}\mathbb{P}(F_{g}\>\Big{|}\>b_{g}=i)\mathbb{P}(b_{g}=i), (50)

where bgb_{g} is the number of Byzantine nodes in group gg. Equation (50) follows the law of total probability. The conditional probability in (50) represents the failure probability of group gg given ii nodes in that group are Byzantine nodes. This conditional group failure probability can be derived similarly to the failure probability in (20) in Proposition 2. In particular, the conditional probability is formally given by

(Fg|bg=i)j=1n(Aj|bg=i),\mathbb{P}(F_{g}\>\Big{|}\>b_{g}=i)\leq\sum_{j=1}^{n}\mathbb{P}(A_{j}\>\Big{|}\>b_{g}=i), (51)

where AjA_{j} is the failure event in which SS Byzantine nodes come in a row, where jj is the starting node of these SS nodes given that there are ii Byzantine nodes in that group. The probability (Aj|bg=i)\mathbb{P}(A_{j}\>\Big{|}\>b_{g}=i) is given as follows

(Aj|bg=i)=s=0S1max(is,0)(Ns),\mathbb{P}(A_{j}\>\Big{|}\>b_{g}=i)=\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}, (52)

where ii is the number of Byzantine nodes in group gg and SS is the connectivity parameter in that group. By combining (52) with (51), we get the conditional probability in the first term in (50) which is given as follows

(Fg|bg=i)j=1n(Aj|bg=i)=s=0S1max(is,0)(Ns)n.\mathbb{P}(F_{g}\>\Big{|}\>b_{g}=i)\leq\sum_{j=1}^{n}\mathbb{P}(A_{j}\>\Big{|}\>b_{g}=i)=\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}n. (53)

The probability (bg=i)\mathbb{P}(b_{g}=i) in the second term of (50) follows a Hypergeometric distribution with parameter (N,b,n)(N,b,n) where NN, bb, and nn are the total number of nodes, total number of Byzantine nodes, number of nodes in each group, respectively. This probability is given by

(bg=i)=(bi)(Nbni)(Nn).\mathbb{P}(b_{g}=i)=\frac{\binom{b}{i}\binom{N-b}{n-i}}{\binom{N}{n}}. (54)

By substituting (54) and (53) in (50), we get the failure probability of one group in Basil+, which is given as follows

(Fg)i=0min(b,n)(s=0S1max(is,0)(Ns)n)(bi)(Nbni)(Nn).\mathbb{P}(F_{g})\leq\sum_{i=0}^{\min{(b,n)}}\left(\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}n\right)\frac{\binom{b}{i}\binom{N-b}{n-i}}{\binom{N}{n}}. (55)

Finally, the failure probability of Basil+ is given by

(F)\displaystyle\mathbb{P}(F) =(g=1GFg)\displaystyle=\mathbb{P}(\bigcup_{g=1}^{G}F_{g})
𝑎g=1G(Fg)\displaystyle\overset{a}{\leq}\sum_{g=1}^{G}\mathbb{P}(F_{g})
=Gi=0min(b,n)(s=0S1max(is,0)(Ns)n)(bi)(Nbni)(Nn),\displaystyle=G\sum_{i=0}^{\min{(b,n)}}\left(\prod_{s=0}^{S-1}\frac{\max\left(i-s,0\right)}{(N-s)}n\right)\frac{\binom{b}{i}\binom{N-b}{n-i}}{\binom{N}{n}}, (56)

where (a) follows the union bound. \hfill\square

Appendix G UBAR

In this section, we describe UBAR [10], the SOTA Byzantine resilient approach for parallel decentralized training.

G-A Algorithm

This decentralized training setup is defined over undirected graph: 𝒢=(V,E)\mathcal{G}=(V,E), where VV denotes a set of NN nodes and EE denotes a set of edges representing communication links. Filtering Byzantine nodes is done over two stages for each training iteration. At the first stage, each benign node performs a distance-based strategy to select a candidate pool of potential benign nodes from its neighbors. This selection is performed by comparing the Euclidean distance of its own model with the model from its neighbors. In the second stage, each benign node performs a performance-based strategy to pick the final nodes from the candidate pool resulted from stage 1. It reuses the training sample as the validation data to compute the loss function value of each model. It selects the models whose loss values are smaller than the value of its own model, and calculates the average of those models as the final updated value. Formally, the update rule in UBAR is given by

𝐱k+1(i)=α𝐱k(i)+(1α)UBAR(𝐱k(j),j𝒩i)ηfi(𝐱k(i)),\mathbf{x}^{(i)}_{k+1}=\alpha\mathbf{x}^{(i)}_{k}+(1-\alpha)\mathcal{R}_{\text{UBAR}}(\mathbf{x}^{(j)}_{k},j\in\mathcal{N}_{i})-\eta{\nabla}f_{i}({\mathbf{x}}^{(i)}_{k}), (57)

where 𝒩i\mathcal{N}_{i} is the set of neighbors of Node ii, fi(𝐱k(i)){\nabla}f_{i}({\mathbf{x}}^{(i)}_{k}) is the local gradient of node ii evaluated on a random sample from the local dataset of node ii while using its own model, kk is the training round, and UBAR\mathcal{R}_{\text{UBAR}} is given as follows:

UBAR={1𝒩i,krj𝒩i,kr𝐱k(j) if 𝒩i,krϕ𝐱kj Otherwise,\mathcal{R}_{\text{UBAR}}=\begin{cases}\frac{1}{\mathcal{N}^{r}_{i,k}}\sum_{j\in\mathcal{N}^{r}_{i,k}}\mathbf{x}^{(j)}_{k}&\text{ if }\mathcal{N}^{r}_{i,k}\neq\phi\\ \mathbf{x}^{j^{*}}_{k}&\text{ Otherwise,}\end{cases} (58)

where there are two stages of filtering:

(1) 𝒩i,ks=argmin𝒩𝒩i,𝒩=ρi|𝒩i|j𝒩i𝐱k(j)𝐱k(i),\displaystyle\mathcal{N}^{s}_{i,k}{=}\arg\min_{\begin{subarray}{c}\mathcal{N}^{*}\subset\mathcal{N}_{i},\\ \mathcal{N}^{*}=\rho_{i}|\mathcal{N}_{i}|\end{subarray}}\sum_{j\in\mathcal{N}_{i}}||\mathbf{x}^{(j)}_{k}-\mathbf{x}^{(i)}_{k}||,
(2) 𝒩i,kr=j𝒩i,ksi(𝐱k(j))i(𝐱k(i))j, and j=argminj𝒩i,ksi(𝐱k(i)).\displaystyle\mathcal{N}^{r}_{i,k}{=}\bigcup_{\begin{subarray}{c}j\in\mathcal{N}^{s}_{i,k}\\ {\ell}_{i}(\mathbf{x}^{(j)}_{k})\leq{\ell}_{i}(\mathbf{x}^{(i)}_{k})\end{subarray}}j,\text{ and }j^{*}{=}\arg\min_{j\in\mathcal{N}^{s}_{i,k}}{\ell}_{i}(\mathbf{x}^{(i)}_{k}).

G-B Time Analysis for UBAR

The training time of UBAR is divided into two parts; computation time and communication time. We start by discussing the communication time. For modeling the communication time, we assume that each node in parallel can multicast its model to its neighbors and receive the models from the neighbor nodes, where each node is assumed to be connected to SS neighbors. Hence, the time to multicast SS models can be calculated as d32R\frac{d32}{R}, where dd is the model size and each element of the model is represented by 3232 bits and RR is the communication BW in b/s. On the other hand, the time to receives SS different models from the SS neighbor nodes is given by Sd32R\frac{S*d*32}{R}. We assume that in UBAR, each node starts the model evaluations and model update after receiving the SS models (the same assumption is used when computing the training time for Basil). Therefore, given that each node starts the training procedure only when it receives the SS models while all the communications happen in parallel in the graph, the communication time in one parallel round in UBAR is given as follows:

TUBAR-communication=Sd32R.T_{\text{UBAR-communication}}=\frac{Sd32}{R}. (59)

The computation time of UBAR is given by

TUBAR-computation=Tdist-based+Tperf-based+Tagg+TSGD,T_{\text{UBAR-computation}}=T_{\text{dist-based}}+T_{\text{perf-based}}+T_{\text{agg}}+T_{\text{SGD}}, (60)

where Tdist-basedT_{\text{dist-based}}, Tperf-basedT_{\text{perf-based}}, TaggT_{\text{agg}} and TSGDT_{\text{SGD}} are respectively the times to apply the distance-based strategy, the performance-based strategy, the aggregation, and one step of SGD model update. Hence, the total training time when using UBAR for KK training rounds is given by

TUBAR=K(Tdist-based+Tperf-based+Tagg+TSGD+STcomm),T_{\text{UBAR}}=K(T_{\text{dist-based}}+T_{\text{perf-based}}+T_{\text{agg}}+T_{\text{SGD}}+ST_{\text{comm}}), (61)

where Tcomm=d32RT_{\text{comm}}=\frac{d32}{R}.

Appendix H

In this section, we provide the details of the neural networks used in our experiments, some key details regarding the UBAR implementation, and multiple additional experiments to further demonstrate the superiority of our proposed algorithms. We start in Section H-A by describing the model that is used in our experiments in Section VI and the additional experiments given in this section. In Section H-B, we discuss the implementation of UBAR. After that, we run additional experiments by using MNIST dataset [29] in Section H-C. In Section H-D, we study the computation time of Basil compared to UBAR, and the training performance of Basil and UBAR with respect to the training time. Finally, we study the performance of Basil and ACDS for CIFAR100 dataset with non-IID data distribution in Section H-E, and the performance comparison between Basil and Basil+ in Section H-F.

Refer to caption
(a) No Attack
Refer to caption
(b) Gaussian Attack
Refer to caption
(c) Random Sign Flip Attack
Refer to caption
(d) Hidden Attack
Figure 10: Illustrating the results for MNIST under IID data distribution setting.
Refer to caption
(a) No Attack
Refer to caption
(b) No Attack
Refer to caption
(c) Gaussian Attack
Refer to caption
(d) Random Sign Flip Attack
Figure 11: Illustrating the results for MNIST under non-IID data distribution setting.
Refer to caption
(a) No Attack
Refer to caption
(b) Gaussian Attack
Refer to caption
(c) Random Sign Flip Attack
Figure 12: Illustrating the performance of Basil compared with UBAR for MNIST under non-IID data distribution setting with α=5%\alpha=5\% data sharing.

H-A Models

We provide the details of the neural network architectures used in our experiments. For MNIST, we use a model with three fully connected layers, and the details for the same are provided in Table I. Each of the first two fully connected layers is followed by ReLU, while softmax is used at the output of the third one fully connected layer.

TABLE I: Details of the parameters in the architecture of the neural network used in our MNIST experiments.
Parameter Shape
fc1 784×100784\times 100
fc2 100×100100\times 100
fc3 100×10100\times 10

For CIFAR10 experiments in the main paper, we consider a neural network with two convolutional layers, and three fully connected layers, and the specific details of these layers are provided in Table II. ReLU and maxpool is applied on the convolutional layers. The first maxpool has a kernel size 3×33\times 3 and a stride of 33 and the second maxpool has a kernel size of 4×44\times 4 and a stride of 44. Each of the first two fully connected layers is followed by ReLU, while softmax is used at the output of the third one fully connected layer.

We initialize all biases to 0. Furthermore, for weights in convolutional layers, we use Glorot uniform initializer, while for weights in fully connected layers, we use the default Pytorch initialization.

TABLE II: Details of the parameters in the architecture of the neural network used in our CIFAR10 experiments.
Parameter Shape
conv1 3×16×3×33\times 16\times 3\times 3
conv2 16×64×4×416\times 64\times 4\times 4
fc1 64×38464\times 384
fc2 384×192384\times 192
fc3 192×10192\times 10

H-B Implementing UBAR

We follow a similar approach as described in the experiments in [10]. Specifically, we first assign connections randomly among benign nodes with a probability of 0.40.4 unless otherwise specified, and then randomly assign connections from the benign nodes to the Byzantine nodes, with a probability of 0.40.4 unless otherwise specified. Furthermore, we set the Byzantine ratio for benign nodes as ρ=0.33\rho=0.33.

H-C Performance of Basil on MNIST

We present the results for MNIST in Fig. 10 and Fig. 11 under IID and non-IID data distribution settings, respectively. As can be seen from Fig. 10 and Fig. 11 that using Basil leads to the same conclusions shown in CIFAR10 dataset in the main paper in terms of its fast convergence, high test accuracy, and Byzantine robustness compared to the different schemes. In particular, Fig. 10 under IID data setting demonstrates that Basil is not only resilient to Byzantine attacks, it maintains its superior convergence performance over UBAR. Furthermore, Fig. 10(a) and Fig. 10(b) illustrate that the test accuracy when using Basil and R-plain under non-IID data setting increases when each node shares 5%5\% of its local data with other nodes in the absence of Byzantine nodes. It can also be seen from Fig. 10(c) and Fig. 10(d) that ACDS with α=5%\alpha=5\% on the top of Basil provides the same robustness to software/hardware faults represented in Gaussian model and random sign flip as concluded in the main paper. Additionally, we observe that both Basil without ACDS and UBAR completely fail in the presence of these faults.

Similar to the results in Fig. 6 in the main paper, Fig. 12 shows that even 5%5\% data sharing is done in UBAR, performance remains quite low in comparison to Basil+ACDS.

H-D Wall-Clock Time Performance of Basil

In this section, we show the training performance of Basil and UBAR with respect to the training time instead of the number of rounds. To do so, we consider the following setting.

Refer to caption
(a) Using a communication Bandwidth of 100100 Mb/s
Refer to caption
(b) Using a communication Bandwidth of 1010 Mb/s
Figure 13: Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting with respect to the training time.

Experimental setting. We consider the same setting discussed in Section VI-A in the main paper where there exists a total of 100100 nodes, in which 6767 are benign. For the dataset, we use CIFAR10. We also consider the Gaussian attack. We set the connectivity parameter for the two algorithms to be S=10S=10.

Now, we start by giving the computation/communication time of Basil and UBAR.

Computation time. We measured the computation time of Basil and UBAR on a server with AMD EPYC 7502 32-Core CPU Processor. In particular, in TABLE III, we report the average running time of each main component (function) of UBAR and Basil. To do that, we take the average computation time over 10310^{3} runs of each component in the mitigation strategy on each training round for 100100 rounds. These functions (components of the mitigation strategy) are the performance-based evaluation for Basil given in Section III, while for UBAR these functions are the performance-based evaluation and the distance-based evaluation along with the model aggregation for UBAR given in Appendix G. We can see from TABLE III that the average time each benign node in UBAR takes to evaluate the received set of models and take one step of model update using SGD is 2×\sim 2\times the one in Basil. The reason is that each benign node in UBAR performs two extra stages before taking the model update step: (1) distance-based evaluation and (2) model aggregation. The distance-based stage includes a comparison between the model of each benign node and the received set of models which is a time-consuming operation compared to the performance-based as shown in Table III.

TABLE III: The breakdown of the average computation time per node for Basil and UBAR.
Algorithm Average evaluation time per node (s)
Aggregation (s)
TaggT_{\text{agg}}
SGD step (s)
TSGDT_{\text{SGD}}
Total computation time
per node (s)
Performance-based
Tper-basedT_{\text{per-based}}
Distance-based
Tdis-basedT_{\text{dis-based}}
Basil 0.0190.019 - - 0.0060.006 0.0250.025
UBAR 0.0120.012 0.0270.027 0.0020.002 0.0060.006 0.0470.047

Communication time. We consider an idealistic simulation, where the communication time of the trained model is proportional to the number of elements of the model. In particular, we simulate the communication time taken to send the model as used in Section VI and described in Appendix H-(B) to be given by Tcomm=32dRT_{\text{comm}}=\frac{32d}{R}, where the model size is given by d=117706d=117706 parameters where each is represented by 3232 bits and RR is the bandwidth in Mb/s. To get the training time after KK rounds, we use the per round time result for Basil and UBAR that are given in Proposition 4 in Section VI and Appendix G-(B), respectively, while considering KK to be the number of training rounds.

Results. Fig. 13 demonstrates the performance of Basil and UBAR with respect to the training time. As we can observe in Fig. 13, the time it takes for UBAR to reach its maximum accuracy is almost the same as the time for Basil to reach UBAR’s maximum achievable accuracy. We recall from Section VI that UBAR needs 5×5\times more computation/communication resources than Basil to get 41%~{}41\% test accuracy. The performance of Basil and UBAR with respect to the training time is not surprising as we know that Basil takes much less training rounds to reach the same accuracy that UBAR can reach as shown in Fig. 4. As a result, the latency resulting from the sequential training does not have high impact in comparison to UBAR. Finally, we can see that the communication time is the bottleneck in this setting as when the BW increases from 1010 Mb/s to 100100 Mb/s the training time decreases significantly.

H-E Performance of Basil for Non-IID Data Distribution using CIFAR100

To demonstrate the practicality of ACDS and simulate the scenario where each node shares data only from its non-sensitive dataset portion, we have considered the following experiment.

Dataset and hyperparameters. We run image classification task on CIFAR100 dataset [30]. This dataset is similar to CIFAR10 in having the same dimension (32×32×332\times 32\times 3), except it has 100100 classes containing 600600 images each, where each class has its own feature distribution. There are 500500 training images and 100100 testing images per class. The 100100 classes in the CIFAR100 are grouped into 2020 superclasses. For instance, the superclass Fish includes these five subclasses; Aquarium fish, Flatfish, Ray, Shark and Trout. In this experiment, we consider a system of a total of 100100 nodes, in which 8080 are benign. We have set the connectivity parameter of Basil to S=5S=5. We use a decreasing learning rate of 0.03/(1+0.03k)0.03/(1+0.03\,k), where kk denotes the training round. For the classification task, we only consider the superclasses as the target labels, i.e., we have 2020 labels for our classification task.

Model architecture. We use the same neural network that is used for CIFAR10 in the main paper which consists of 22 convolutional layers and 33 fully connected layers, with the modification that the output of the last layer has a dimension of 2020. The details for this network are included in Appendix H-A.

Data distribution. For simulating the non-IID setting, we first shuffle the data within each superclass, then partition each superclass into 55 portions, and assign each node one partition. Hence, each node will have only data from one superclass and includes data from each of the corresponding 55 subclasses.

Data sharing. To simulate the case where each node shares data from the non-sensitive portion of its local data, we take the advantage of the variation of the feature distribution across subclasses and simulate the sensitive and non-sensitive data as per subclasses. Towards achieving this partition goal, we define γ(0,1)\gamma\in(0,1) to represent the fraction of subclasses that a node considers their data as non-sensitive out of its available 55 subclasses. For instance, γ=1\gamma=1 implies that all the 55 subclasses are considered non-sensitive and nodes can share data from them (e.g., nodes can share data from their entire local data). On the other hand, γ=0.4\gamma=0.4 means that all nodes only consider the first two subclasses of their data as non-sensitive and only share data from them. We note that for nodes that share the same superclass, we consider the order of the subclasses are the same among them (e.g., if node 1 and node 2 have data from the same superclass Fish, hence the subclass Aquarium fish is the first subclass at both of them). This ensures that for γ=0.4\gamma=0.4 data in 33 subclasses per each superclass will never be shared by any user. Finally, we allow each node to share αD\alpha D data points, where D=500D=500 is the local data set size at each node, from the γ\gamma subclasses, and α=5%\alpha=5\%.

Refer to caption
Figure 14: The performance of Basil under different data sharing scenario in the presence of the Gaussian attack when the data distribution at the nodes is non-IID. Here, γ\gamma represents the fraction of subclasses that nodes consider the data from them as non-sensitive out of its available 55 subclasses.

Results Fig. 14 shows the performance of Basil in the presence of the Gaussian attack under different γ\gamma values. As we can see that even when each node shares data from only two subclasses (γ=0.4)\gamma=0.4) out of the five available subclasses, Basil is giving a comparable performance to the case where each node shares data from its entire dataset (γ=1\gamma=1). The intuition behinds getting a good performance in the presence of Byzantine nodes is the fact that although data from the three subclasses in each superclass is never shared, there are nodes in the system originally that have data from these sensitive classes, and when they train the model on their local dataset, the augmented side information from the shared dataset helps to maintain the model performance and resiliency to Byzantine faults. Furthermore, we can see that R-plain fails in the presence of the attack, demonstrating that the ACDS data sharing is quite crucial for good performance when data is non-IID.

H-F Performance Comparison Between Basil and Basil+

We compare the performance of Basil and Basil+ under the following setting:

Refer to caption
(a) Reporting the average test accuracy among the benign nodes in each round.
Refer to caption
(b) Reporting the worst case test accuracy among the benign nodes in each round.
Figure 15: Illustrating the performance of Basil and Basil+ for CIFAR10 dataset.

Setting. We consider a setting of 400400 nodes in which 8080 of them are Byzantine nodes. For the dataset, we use CIFAR10 dataset, while considering inverse attack for the Byzantine nodes. We use a connectivity parameter of S=6S=6 for both Basil and Basil+, and consider epoch based local training, where we set the number of epochs to 33.

Results: As we can see from Fig. 15, Basil and Basil+ for different groups retain high test accuracy over UBAR in the presence of the inverse attack. We can also see from Fig. 15-(b) that when we have a large number of nodes in the system increasing the number of groups (e.g., G=8G=8) makes the worst case training performance of the benign nodes more stable across the training rounds compared to Basil that has high fluctuation for large setting (e.g., 400400 nodes).