This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Sch of Info & Comm Engin, University of Electronic Science and Technology of China, Chengdu, China
11email: {shenglaizeng, lizhuestc, heyh.uestc}@gmail.com, yuhf@uestc.edu.cn
22institutetext: Sch of Comp Sci & Tech, Harbin Institute of Technology, Shenzhen, China
22email: xuzenglin@hit.edu.cn
33institutetext: Sch of Comp Sci & Engin, Nanyang Technological University, Singapore
33email: {dniyato, han.yu}@ntu.edu.sg

Heterogeneous Federated Learning via Grouped Sequential-to-Parallel Training

Shenglai Zeng Shenglai Zeng and Zonghang Li are of equal contributions.11    Zonghang Li 1133    Hongfang Yu(✉) 11    Yihong He 11   
Zenglin Xu
22
   Dusit Niyato 33    Han Yu 33
Abstract

Federated learning (FL) is a rapidly growing privacy preserving collaborative machine learning paradigm. In practical FL applications, local data from each data silo reflect local usage patterns. Therefore, there exists heterogeneity of data distributions among data owners (a.k.a. FL clients). If not handled properly, this can lead to model performance degradation. This challenge has inspired the research field of heterogeneous federated learning, which currently remains open. In this paper, we propose a data heterogeneity-robust FL approach, FedGSP, to address this challenge by leveraging on a novel concept of dynamic Sequential-to-Parallel (STP) collaborative training. FedGSP assigns FL clients to homogeneous groups to minimize the overall distribution divergence among groups, and increases the degree of parallelism by reassigning more groups in each round. It is also incorporated with a novel Inter-Cluster Grouping (ICG) algorithm to assist in group assignment, which uses the centroid equivalence theorem to simplify the NP-hard grouping problem to make it solvable. Extensive experiments have been conducted on the non-i.i.d. FEMNIST dataset. The results show that FedGSP improves the accuracy by 3.7% on average compared with seven state-of-the-art approaches, and reduces the training time and communication overhead by more than 90%.

ederated learning Distributed data mining Heterogeneous data Clustering-based learning

Keywords:
F

1 Introduction

Federated learning (FL) [1], as a privacy-preserving collaborative paradigm for training machine learning (ML) models with data scattered across a large number of data owners, has attracted increasing attention from both academia and industry. Under FL, data owners (a.k.a. FL clients) submit their local ML models to the FL server for aggregation, while local data remain private. FL has been applied in fields which are highly sensitive to data privacy, including healthcare [2], manufacturing [3] and next generation communication networks [4]. In practical applications, FL clients’ local data distributions can be highly heterogeneous due to diverse usage patterns. This problem is referred to as the non-independent and identically distributed (non-i.i.d.) data challenge, which negatively affects training convergence and the performance of the resulting FL model[5].

Recently, heterogeneous federated learning approaches have been proposed in an attempt to address this challenge. These works try to make class distributions of different FL clients similar to improve the performance of the resulting FL model. In [5, 6, 7], FL clients share a small portion of local data to build a common meta dataset to help correct deviations caused by non-i.i.d. data. In [8, 9], data augmentation is performed for categories with fewer samples to reduce the skew of local datasets. These methods are vulnerable to privacy attacks as misbehaving FL servers or clients can easily compromise the shared private data and the augmentation process. To align client data distributions without exposing the FL process to privacy risks, we group together heterogeneous FL clients so that each group can be perceived as a homogeneous “client” to participate in FL. This process does not involve any manipulation of private data itself and is therefore more secure.

An intuitive approach to achieve this goal is to assign FL clients to groups with similar overall class distribution, and use collaborative training to coordinate model training within and among groups. However, designing such an approach is not trivial due to the following two challenges. Firstly, assigning FL clients to a specified number of groups of equal group sizes to minimize the data divergence among groups (which can be reduced from the well-known bin packing problem [10]) is an NP-hard problem. Moreover, such group assignment process needs to be performed periodically in a dynamic FL environment, which introduces higher requirements for its effectiveness and execution efficiency. Secondly, even if the data distributions among groups are forced to be homogeneous, the data within each group can still be skewed. Due to the robustness of sequential training mode (STM) to data heterogeneity, some collaborative training approaches (e.g., [9]) adopt STM within a group to train on skewed client data. Then, the typical parallel training mode (PTM) can be applied among homogeneous groups. These methods are promising, but are still limited due to their static properties, which prevents them from adapting to the changing needs of FL at different stages. In FL, STM should be emphasized in the early stage to achieve a rapid increase in accuracy in the presence of non-i.i.d. data, while PTM should be emphasized in the later stage to promote convergence. In the static mode, the above parallelism degree must be carefully designed to realize a proper trade-off between sensitivity to heterogeneous data of PTM and overfitting of STM. Otherwise, the FL model performance may suffer.

To address these challenges, this paper proposes a new concept of dynamic collaborative Sequential-to-Parallel (STP) training to improve FL model performance in the presence of non-i.i.d. data. The core idea of STP is to force STM to be gradually transformed into PTM as FL model training progresses. In this way, STP can better refine unbiased model knowledge in the early stage, and promote convergence while avoiding overfitting in the later stage. To support the proposed STP, we propose a Federated Grouped Sequential-to-Parallel (FedGSP) training framework. FedGSP allows reassignment of FL clients into more groups in each training round, and introduces group managers to manage the dynamically growing number of groups. It also coordinates model training and transmission within and among groups. In addition, we propose a novel Inter-Cluster Grouping (ICG) method to assign FL clients to a pre-specified number of groups, which uses the centroid equivalence theorem to simplify the original NP-hard grouping problem into a solvable constrained clustering problem with equal group size constraint. ICG can find an effective solution with high efficiency (with a time complexity of 𝒪(K6τM2logKd)\mathcal{O}(\frac{K^{6}\mathcal{F}\tau}{M^{2}}\log{Kd})). We evaluate FedGSP on the most widely adopted non-i.i.d. benchmark dataset FEMNIST[11] and compare it with seven state-of-the-art approaches including FedProx[12], FedMMD[13], FedFusion[14], IDA[15], FedAdam, FedAdagrad and FedYogi[16]. The results show that FedGSP improves model accuracy by 3.7% on average, and reduces training time and communication overhead by more than 90%. To the best of our knowledge, FedGSP is the first dynamic collaborative training approach for FL.

2 Related Work

Existing heterogeneous FL solutions can be divided into three main categories: 1) data augmentation, 2) clustering-based learning, and 3) adaptive optimization.

Data Augmentation: Zhao et al. [5] proved that the FL model accuracy degradation due to heterogeneous local data can be quantified by the earth move distance (EMD) between the client and global data distributions. This result motivates some research works to balance the sample size of each class through data augmentation. Zhao et al. [5] proposed to build a globally shared dataset to expand client data. Jeong et al. [8] used the conditional generative network to generate new samples for categories with fewer samples. Similarly, Duan et al. [9] used augmentation techniques such as random cropping and rotation to expand client data. These methods are effective in improving the FL model accuracy by reducing data skew. However, they involve modifying clients’ local data, which can lead to serious privacy risks.

Clustering-based Learning: Another promising way to reduce data heterogeneity is through clustering-based FL. Sattler et al. [17] groups FL clients with similar class distributions into one cluster, so that FL clients with dissimilar data distributions do not interfere with each other. This method works well in personalized FL [18] where FL is perform within each cluster and an FL model is produced for each cluster. However, it is not the same as our goal which is to train one shared FL model that can be generalized to all FL clients. Duan et al. [9] makes the KullbackLeibler divergence of class distributions similar among clusters, and proposed a greedy best-fit strategy to assign FL clients.

Adaptive Optimization: Other research explores adaptive methods to better merge and optimize client- and server-side models. On the client side, Li et al. [12] added a proximal penalty term to the local loss function to constrain the local model to be closer to the global model. Yao et al. [13] adopted a two-stream framework and used transfer learning to transfer knowledge from the global model to the local model. A feature fusion method has been further proposed to better merge the features of local and global models [14]. On the server side, Yeganeh et al. [15] weighed less out-of-distribution models based on inverse distance coefficients during aggregation. Instead, Reddi et al. [16] focused on server-side optimization and introduced three advanced adaptive optimizers (Adagrad, Adam and Yogi) to obtain FedAdagrad, FedAdam and FedYogi, respectively. These methods perform well in improving FL model convergence.

Solutions based on data augmentation are at risky due to potential data leakage, while solutions based on adaptive optimization do not solve the problem of class distribution divergence causing FL model performance to degrade. FedGSP focuses on clustering-based learning. Different from existing research, it takes a novel approach of dynamic collaborative training, which allows dynamic scheduling and reassignment of clients into groups according to the changing needs of FL.

3 Federated Grouped Sequential-to-Parallel Learning

In this section, we first describe the concept and design of the STP approach. Then, we present the FedGSP framework which is used to support STP. Finally, we mathematically formulate the group assignment problem in STP, and present our practical solution ICG.

3.1 STP: The Sequential-to-Parallel Training Mode

Under our grouped FL setting, FL clients are grouped such that clients in the same group have heterogeneous data but the overall data distributions among the groups are homogeneous. Due to the difference in data heterogeneity, the training modes within and among groups are designed separately. We refer to this jointly designed FL training mode as the “collaborative training mode”.

Refer to caption
Figure 1: An example of STP. The ML model in each group is trained in sequence, while ML models among groups are trained in parallel. In each round rr, the group number grows according to function ff, and FL clients are regrouped and shuffled.

Intuitively, the homogeneous groups can be trained in a simple parallel mode PTM because the heterogeneity of their data has been eliminated by client grouping. Instead, for FL clients in the same group whose local data are still skewed, the sequential mode STM can be useful. In STM, FL clients train the model in a sequential manner. A FL client receives the model from its predecessor client and delivers the local trained model to its successor client to continue training. In the special case of training with only one local epoch (i.e., e=1e=1), STM is equivalent to centralized SGD, which gives it robustness against data heterogeneity.

This naive collaborative training mode is static and has limitations. Therefore, we extend it to propose a more dynamic approach STP. As shown in Figure 1, STP reassigns FL clients into f(r)f(r) groups and shuffles their order in each round rr, where ff is a pre-specified group number growth function, with the goal to dynamically adjust the degree of parallelism. Then, STP can be smoothly transformed from (full) sequential mode to (full) parallel mode. This design can prevent catastrophic forgetting caused by the long “chain of clients” that causes the FL model to forget the data of previous clients and overfit the data of subsequent clients, and can also prevent the FL model from learning interfering information such as the order of clients. Moreover, the growing number of groups improves the parallelism efficiency, which promotes convergence and speeds up training when the global FL model is close to convergence.

Algorithm 1 Sequential-To-Parallel (main)
1:All FL clients 𝒞\mathcal{C}, the total number of FL clients KK, the maximum training rounds RR, the group number growth function ff, the group sampling rate κ\kappa.
2:The well-trained global FL model ωglobalR\omega_{\mathrm{global}}^{R}.
3:Initialize the global FL model ωglobal0\omega_{\mathrm{global}}^{0};
4:for each round r=1,,Rr=1,\cdots,R do
5:     Reassign all FL clients 𝒞\mathcal{C} to f(r)f(r) groups to obtain 𝒢\mathcal{G},
6:        𝒢\mathcal{G}\leftarrowInter-Cluster-Grouping(𝒞,K,f,r\mathcal{C},K,f,r);
7:     Randomly sample a subset of groups 𝒢~𝒢\tilde{\mathcal{G}}\subset\mathcal{G} with proportion κ\kappa;
8:     for each group 𝒢m\mathcal{G}_{m} in 𝒢~\tilde{\mathcal{G}} in parallel do
9:         The first FL client 𝒞m1\mathcal{C}_{m}^{1} in 𝒢m\mathcal{G}_{m} initializes ωm1ωglobalr1\omega_{m}^{1}\leftarrow\omega_{\mathrm{global}}^{r-1};
10:         for each FL client 𝒞mk\mathcal{C}_{m}^{k} in 𝒢m\mathcal{G}_{m} in sequence do
11:              Train ωmk\omega_{m}^{k} on local data 𝒟mk\mathcal{D}_{m}^{k} using mini-batch SGD for one epoch;
12:              Send the trained ωmk+1ωmk\omega_{m}^{k+1}\leftarrow\omega_{m}^{k} to the next FL client 𝒞mk+1\mathcal{C}_{m}^{k+1};
13:         end for
14:         The last FL client 𝒞mK/f(r)\mathcal{C}_{m}^{K/f(r)} in 𝒢m\mathcal{G}_{m} uploads ωmK/f(r)\omega_{m}^{K/f(r)};
15:     end for
16:     Update the global FL model using the aggregation ωglobalr𝒢m𝒢~(ωmK/f(r))f(r)\omega_{\mathrm{global}}^{r}\leftarrow\frac{\sum_{\forall\mathcal{G}_{m}\in\tilde{\mathcal{G}}}{(\omega_{m}^{K/f(r)}})}{f(r)};
17:end for
18:return ωglobalR\omega_{\mathrm{global}}^{R};

The pseudo code of STP is given in Algorithm 1. In round rr, STP divides all FL clients into f(r)f(r) groups using the ICG grouping algorithm (Line 5), which will be described in Section 3.3. Due to the similarity of data among groups, each group can independently represent the global distribution, so only a small proportion of κ\kappa groups are required to participate in each round of training (Line 7). The first FL client in each group pulls the global model from the FL server (Line 9), and trains its local model using mini-batch SGD for one epoch (Line 11). The trained local model is then delivered to the next FL client to continue training (Line 12), until the last FL client is reached. The last FL client in each group sends the trained model to the FL server (Line 14). Models from all groups are aggregated to update the global FL model (Line 16). The above steps repeat until the maximum training round RR is reached. Finally, the well-trained global FL model is obtained (Line 18).

The choice of the growth function for the number of groups, ff, is critical for the performance of STP. We give three representative growth functions, including linear (smooth grow), logarithmic (fast first and slow later), and exponential (slow first and fast later) growth functions:

LinearGrowthFunction:\displaystyle\mathrm{Linear\ Growth\ Function:} f(r)=\displaystyle f(r)= βα(r1)+1,\displaystyle\beta\left\lfloor\alpha(r-1)+1\right\rfloor, (1)
LogGrowthFunction:\displaystyle\mathrm{Log\ Growth\ Function:} f(r)=\displaystyle f(r)= βαlnr+1,\displaystyle\beta\left\lfloor\alpha\ln r+1\right\rfloor, (2)
ExpGrowthFunction:\displaystyle\mathrm{Exp\ Growth\ Function:} f(r)=\displaystyle f(r)= β(1+α)r1,\displaystyle\beta\lfloor(1+\alpha)^{r-1}\rfloor, (3)

where the real number coefficient α\alpha controls the growth rate, and the integer coefficient β\beta controls the initial number of groups and the growth span. We recommend to initialize α\alpha, β\beta to a moderate value and explore the best setting in an empirical manner.

3.2 FedGSP: The Grouped FL Framework To Enable STP

In this section, we describe the FedGSP framework that enables dynamic STP. FedGSP is generally a grouped FL framework that supports dynamic group management, as shown in Figure 2. The basic components include a top server (which acts as an FL server and performs functions related to group assignment) and a large number of FL clients. FL clients can be smart devices with certain available computing and communication capabilities, such as smart phones, laptops, mobile robots and drones. They collect data from the surrounding environment and use the data to train local ML models.

In addition, FedGSP creates group managers to facilitate the management of the growing number of groups in STP. The group managers can be virtual function nodes deployed in the same machine as the top server. Whenever a new group is built, a new group manager is created to assist the top server to manage this group by performing the following tasks:

1. Collect distribution information. The group manager needs to collect class distributions of FL clients and report them to the top server. These meta information will be used to assign FL clients to f(r)f(r) groups via ICG.

2. Coordinate model training. The group manager needs to coordinate the sequential training of FL clients in its group, as well as the parallel training with other groups, according to the rules of STP. Specifically, it needs to shuffle the order of clients and report resulting model to the top server for aggregation.

3. Schedule model transmission. In applications such as Industrial IoT systems, wireless devices can directly communicate with each other through wireless sensor networks (WSNs). However, this cannot be realized in most scenarios. Therefore, the group manager needs to act as a communication relay to schedule the transmission of ML models from one client to another.

Refer to caption
Figure 2: An overview of the FedGSP framework.

3.3 ICG: The Inter-Cluster Grouping Algorithm

As required by STP, the equally sized groups containing heterogeneous FL clients should have similar overall class distributions. To achieve this goal, in this section, we first formalize the FL client grouping problem which is NP-hard, and then explain how to simplify to propose the ICG approach.

(A) Problem Modeling

Considering an \mathcal{F}-class classification task involving KK FL clients, STP needs to assign these clients to MM groups, where MM is determined by the group number growth function ff and the current round rr. Our goal is to find a grouping strategy 𝐱𝕀M×K\mathbf{x}\in\mathbb{I}^{M\times K} in the 0-1 space 𝕀={0,1}\mathbb{I}=\{0,1\} to minimize the difference in class distributions of all groups, where 𝐱mk=1\mathbf{x}_{m}^{k}=1 represents the device kk is assigned to the group mm, 𝒱(+)×K\mathcal{V}\in(\mathbb{Z^{+}})^{\mathcal{F}\times K} is the class distribution matrix composed of \mathcal{F}-dimensional class distribution vectors of KK FL clients, 𝒱m(+)×1\mathcal{V}_{m}\in(\mathbb{Z}^{+})^{\mathcal{F}\times 1} represents the overall class distribution of group mm, and ,\left<\cdot,\cdot\right> represents the distance between two class distributions. The problem can be formalized as follows:

minimize𝐱\displaystyle\underset{\mathbf{x}}{\mathrm{minimize}}\qquad z=m1=1M1m2=m1+1M<𝒱m1,𝒱m2>,\displaystyle z=\sum_{m_{1}=1}^{M-1}\sum_{m_{2}=m_{1}+1}^{M}<\mathcal{V}_{m_{1}},\mathcal{V}_{m_{2}}>, (4)
s.t.\displaystyle\mathrm{s.t.}\qquad M=f(r),\displaystyle M=f(r), (5)
k=1K𝐱mkKMm=1,,M,\displaystyle\sum_{k=1}^{K}\mathbf{x}_{m}^{k}\leq\left\lceil\frac{K}{M}\right\rceil\quad\forall m=1,\cdots,M, (6)
m=1M𝐱mk=1k=1,,K,\displaystyle\sum_{m=1}^{M}{\mathbf{x}_{m}^{k}}=1\qquad\quad\forall k=1,\cdots,K, (7)
𝒱m=k=1K𝐱mk𝒱km=1,,M,\displaystyle\mathcal{V}_{m}=\sum_{k=1}^{K}{\mathbf{x}_{m}^{k}\mathcal{V}^{k}}\quad\forall m=1,\cdots,M, (8)
𝐱mk{0,1},k[1,K],m[1,M].\displaystyle\mathbf{x}_{m}^{k}\in\{0,1\},~{}k\in[1,K],~{}m\in[1,M]. (9)

Constraint (5) ensures that the number of groups MM meets f(r)f(r) required by STP. Constraint (6) ensures that the groups have similar or equal size KM\left\lceil\frac{K}{M}\right\rceil. Constraint (7) ensures that each client can only be assigned to one group at a time. The overall class distribution 𝒱m\mathcal{V}_{m} of the group mm is defined by Eq. (8), where 𝒱k𝒱\mathcal{V}^{k}\in\mathcal{V} is the class distribution vector of client kk. Constraint (9) restricts the decision variable 𝐱\mathbf{x} to only take up a value of 0 or 1.

Proposition 1

The NP-hard bin packing problem (BPP) can be reduced to the grouping problem in Eq. (4) to Eq. (9), making it also an NP-hard problem.

Proof

The problem stated by Eq. (4) to Eq. (9) is actually a BPP with additional constraints, where KK items with integer weight 𝒱k\mathcal{V}^{k} and unit volume should be packed into the minimum number of bins of integer capacity KM\left\lceil\frac{K}{M}\right\rceil. The difference is that Eq. (4) to Eq. (9) restricts the number of available bins to MM instead of unlimited, and the difference in the bin weights not to exceed ξ\xi. The input and output of BPP and Eq. (4) to Eq. (9) are matched, with only additional 𝒪(1)\mathcal{O}(1) transformation complexity to set MM and ξ\xi to infinity. Therefore, BPP can call the solution of Eq. (4) to Eq. (9) in 𝒪(1)\mathcal{O}(1) time to obtain its solution, which proves that the NP-hard BPP [10] can be reduced to the problem stated by Eq. (4) to Eq. (9). Therefore, Eq. (4) to Eq. (9) is also an NP-hard problem.

Therefore, it is almost impossible to find the optimal solution within a polynomial time. To address this issue, we adopt the centroid equivalence theorem to simplify the original problem to a constrained clustering problem.

(B) Inter-Cluster Grouping (ICG)

Consider a constrained clustering problem with KK points and LL clusters, where the size of all clusters is strictly the same K/LK/L.

Assumption 1

We make the following assumptions:

  1. 1.

    KK is divisible by LL;

  2. 2.

    Take any point 𝒱lm\mathcal{V}^{m}_{l} from cluster ll, the squared l2l_{2}-norm distance 𝒱lmCl22\|\mathcal{V}^{m}_{l}-C_{l}\|_{2}^{2} between the point 𝒱lm\mathcal{V}^{m}_{l} and its cluster centroid ClC_{l} is bounded by σl2\sigma_{l}^{2}.

  3. 3.

    Take one point 𝒱lm\mathcal{V}^{m}_{l} from each of LL clusters at random, the sum of deviations of each point from its cluster centroid ϵm=l=1L(𝒱lmCl)\epsilon^{m}=\sum_{l=1}^{L}(\mathcal{V}^{m}_{l}-C_{l}) meets 𝐄[ϵm]=0\mathbf{E}\left[\epsilon^{m}\right]=0.

Definition 1 (Group Centroid)

Given LL clusters of equal size, let group mm be constructed from one point randomly sampled from each cluster {𝒱1m,,𝒱Lm}\{\mathcal{V}^{m}_{1},\cdots,\mathcal{V}^{m}_{L}\}. Then, the centroid of group mm is defined as Cm=1Ll=1L𝒱lmC^{m}=\frac{1}{L}\sum_{l=1}^{L}\mathcal{V}^{m}_{l}.

Proposition 2

If Assumption 1 holds, suppose the centroid of cluster ll is Cl=LKi=1K/L𝒱liC_{l}=\frac{L}{K}\sum_{i=1}^{K/L}{\mathcal{V}_{l}^{i}} and the global centroid is Cglobal=1Ll=1LClC_{\mathrm{global}}=\frac{1}{L}\sum_{l=1}^{L}{C_{l}}. We have:

  1. 1.

    The group and global centroids are expected to coincide, 𝐄[Cm]=Cglobal\mathbf{E}[C^{m}]=C_{\mathrm{global}}.

  2. 2.

    The error CmCglobal22\|C^{m}-C_{\mathrm{global}}\|_{2}^{2} between the group and global centroids is bounded by 1L2l=1Lσl2\frac{1}{L^{2}}\sum_{l=1}^{L}{\sigma_{l}^{2}}.

Proof
𝐄[Cm]\displaystyle\mathbf{E}[C^{m}] =𝐄[1Ll=1L𝒱lm]=𝐄[1Ll=1L(𝒱lmCl+Cl)]\displaystyle=\mathbf{E}[\frac{1}{L}\sum_{l=1}^{L}\mathcal{V}^{m}_{l}]=\mathbf{E}[\frac{1}{L}\sum_{l=1}^{L}(\mathcal{V}^{m}_{l}-C_{l}+C_{l})]
=𝐄[1Ll=1L(𝒱lmCl)+1Ll=1LCl]=1L𝐄[ϵm]+Cglobal=Cglobal,\displaystyle=\mathbf{E}[\frac{1}{L}\sum_{l=1}^{L}(\mathcal{V}^{m}_{l}-C_{l})+\frac{1}{L}\sum_{l=1}^{L}{C_{l}}]=\frac{1}{L}\mathbf{E}[\epsilon^{m}]+C_{\mathrm{global}}=C_{\mathrm{global}},
CmCglobal22\displaystyle\|C^{m}-C_{\mathrm{global}}\|_{2}^{2} =1Ll=1L𝒱lm1Ll=1LCl22=1L2l=1L(𝒱lmCl)22\displaystyle=\|\frac{1}{L}\sum_{l=1}^{L}{\mathcal{V}_{l}^{m}}-\frac{1}{L}\sum_{l=1}^{L}{C_{l}}\|_{2}^{2}=\frac{1}{L^{2}}\|\sum_{l=1}^{L}{(\mathcal{V}_{l}^{m}-C_{l})}\|_{2}^{2}
1L2l=1L𝒱lmCl22=1L2l=1Lσl2.\displaystyle\leq\frac{1}{L^{2}}\sum_{l=1}^{L}\|\mathcal{V}_{l}^{m}-C_{l}\|_{2}^{2}=\frac{1}{L^{2}}\sum_{l=1}^{L}{\sigma_{l}^{2}}.

Proposition 2 indicates that there exists a grouping strategy 𝐱~\tilde{\mathbf{x}} and 𝒱m1=k=1K𝐱~m1k𝒱k=LCm1\mathcal{V}_{m_{1}}=\sum_{k=1}^{K}\tilde{\mathbf{x}}_{m_{1}}^{k}\mathcal{V}^{k}=LC^{m_{1}}, 𝒱m2=k=1K𝐱~m2k𝒱k=LCm2\mathcal{V}_{m_{2}}=\sum_{k=1}^{K}\tilde{\mathbf{x}}_{m_{2}}^{k}\mathcal{V}^{k}=LC^{m_{2}} (m1m2\forall m_{1}\neq m_{2}), so that the objective in Eq. (4) turns to z=m1m2L<Cm1,Cm2>z=\sum_{m_{1}\neq m_{2}}L<C^{m_{1}},C^{m_{2}}> and the expectation value reaches 0. This motivates us to use the constrained clustering model to solve 𝐱~\tilde{\mathbf{x}} in the objective Eq. (4). Therefore, we consider the constrained clustering problem below,

minimize𝐲\displaystyle\underset{\mathbf{y}}{\mathrm{minimize}}\qquad k=1Kl=1L𝐲lk(12𝒱kCl22),\displaystyle\sum_{k=1}^{K}\sum_{l=1}^{L}\mathbf{y}^{k}_{l}\cdot\left(\frac{1}{2}\|\mathcal{V}^{k}-C_{l}\|_{2}^{2}\right), (10)
s.t.\displaystyle\mathrm{s.t.}\qquad k=1K𝐲lk=KLl=1,,L,\displaystyle\sum_{k=1}^{K}\mathbf{y}_{l}^{k}=\frac{K}{L}\qquad\forall l=1,\cdots,L, (11)
l=1L𝐲lk=1k=1,,K,\displaystyle\sum_{l=1}^{L}\mathbf{y}_{l}^{k}=1\qquad\quad\forall k=1,\cdots,K, (12)
𝐲lk{0,1},k[1,K],l[1,L],\displaystyle\mathbf{y}_{l}^{k}\in\{0,1\},~{}k\in[1,K],~{}l\in[1,L], (13)

where 𝐲𝕀L×K\mathbf{y}\in\mathbb{I}^{L\times K} is a selector variable, 𝐲lk=1\mathbf{y}_{l}^{k}=1 means that client kk is assigned to cluster ll while 0 means not, ClC_{l} represents the centroid of cluster ll. Eq. (10) is the standard clustering objective, which aims to assign KK clients to LL clusters so that the sum of the squared l2l_{2}-norm distance between the class distribution vector 𝒱k\mathcal{V}^{k} and its nearest cluster centroid ClC_{l} is minimized. Constraint (11) ensures that each cluster has the same size KL\frac{K}{L}. Constraint (12) ensures that each client can only be assigned to one cluster at a time. In this simplified problem, Constraint (7) is relaxed to m=1M𝐱mk1\sum_{m=1}^{M}{\mathbf{x}_{m}^{k}}\leq 1 to satisfy the assumption that K/LK/L is divisible.

The above constrained clustering problem can be modeled as a minimum cost flow (MCF) problem and solved by network simplex algorithms [19], such as SimpleMinCostFlow in Google OR-Tools. Then, we can alternately perform cluster assignment and cluster update to optimize 𝐲lk\mathbf{y}_{l}^{k} and Cl(k,l)C_{l}(\forall k,l), respectively. Finally, we construct MM groups, each group consists of one client randomly sampled from each cluster without replacement, so that their group centroids are expected to coincide with the global centroid. The pseudo code is given in Algorithm 2. ICG has a complexity of 𝒪(K6τM2logKd)\mathcal{O}(\frac{K^{6}\mathcal{F}\tau}{M^{2}}\log{Kd}), where d=max{σl2|l[1,L]}d=\max\{\sigma_{l}^{2}|\forall l\in[1,L]\}, and K,M,,τK,M,\mathcal{F},\tau are the number of clients, groups, categories, and iterations, respectively. In our experiment, ICG is quite fast, and it can complete group assignment within only 0.1 seconds, with K=364,M=52,=62K=364,M=52,\mathcal{F}=62 and τ=10\tau=10.

Algorithm 2 Inter-Cluster-Grouping
1:All FL clients 𝒞\mathcal{C} (with attribute 𝒱k\mathcal{V}^{k}), the total number of FL clients KK, the group number growth function ff, the current training round rr.
2:The grouping strategy 𝒢\mathcal{G}.
3:Randomly sample LKLL\cdot\lfloor\frac{K}{L}\rfloor clients from 𝒞\mathcal{C} to meet Assumption 1, where L=Kf(r)L=\lfloor\frac{K}{f(r)}\rfloor;
4:repeat
5:     Cluster Assignment: Fix the cluster centroid ClC_{l} and optimize 𝐲\mathbf{y} in Eq. (10) to Eq. (13);
6:     Cluster Update: Fix 𝐲\mathbf{y} and update the cluster centroid ClC_{l} as follows,
Clk=1K𝐲lk𝒱kk=1K𝐲lkl=1,,L;C_{l}\leftarrow\frac{\sum_{k=1}^{K}{\mathbf{y}_{l}^{k}\mathcal{V}^{k}}}{\sum_{k=1}^{K}{\mathbf{y}_{l}^{k}}}\quad\forall l=1,\cdots,L;
7:until ClC_{l} converges;
8:Group Assignment: Randomly sample one client from each cluster without replacement to construct group 𝒢m(m=1,,f(r))\mathcal{G}_{m}(\forall m=1,\cdots,f(r));
9:return 𝒢={𝒢1,,𝒢f(r)}\mathcal{G}=\{\mathcal{G}_{1},\cdots,\mathcal{G}_{f(r)}\};

4 Experimental Evaluation

4.1 Experiment Setup and Evaluation Metrics

Environment and Hyperparameter Setup. The experiment platform contains K=368K=368 FL clients. The most commonly used FEMNIST[11] is selected as the benchmark dataset, which is specially designed for non-i.i.d. FL environment and is constructed by dividing 805,263 digit and character samples into 3,550 FL clients in a non-uniform class distribution, with an average of n=226n=226 samples per client. For the resource-limited mobile devices, a lightweight neural network composed of 2 convolutional layers and 2 fully connected layers with a total of 6.3 million parameters is adopted as the training model. The standard mini-batch SGD is used by FL clients to train their local models, with the learning rate η=0.01\eta=0.01, the batch size b=5b=5 and the local epoch e=1e=1. We test FedGSP for R=500R=500 rounds. By default, we set the group sampling rate κ=0.3\kappa=0.3, the group number growth function f=Logf=\textsc{Log} and the corresponding coefficients α=2\alpha=2, β=10\beta=10. The values of κ\kappa, ff, α\alpha, β\beta will be further tuned in the experiment to observe their performance influence.

Benchmark Algorithms. In order to highlight the effect of the proposed STP and ICG separately, we remove them from FedGSP to obtain the naive version, NaiveGSP. Then, we compare the performance of the following versions of FedGSP through ablation studies:

  1. 1.

    NaiveGSP: FL clients are randomly assigned to a fixed number of groups, the clients in the group are trained in sequence and the groups are trained in parallel (e.g., Astraea[9]).

  2. 2.

    NaiveGSP+ICG: The ICG grouping algorithm is adopted in NaiveGSP to assign FL clients to a fixed number of groups strategically.

  3. 3.

    NaiveGSP+ICG+STP (FedGSP): On the basis of NaiveGSP+ICG, FL clients are reassigned to a growing number of groups in each round as required by STP.

In addition, seven state-of-the-art baselines are experimentally compared with FedGSP. They are FedProx [12], FedMMD [13], FedFusion [14], IDA [15], and FedAdagrad, FedAdam, FedYogi from [16].

Evaluation Metrics. In addition to the fundamental test accuracy and test loss, we also define the following metrics to assist in performance evaluation.

Class Probability Distance (CPD). The maximum mean discrepancy (MMD) distance is a probability measure in the reproducing kernel Hilbert space. We define CPD as the kernel two-sample estimation with Gaussian radial basis kernel 𝒦\mathcal{K}[20] to measure the difference in class probability (i.e., normalized class distribution) 𝒫=norm(𝒱m1),𝒬=norm(𝒱m2)\mathcal{P}=\mathrm{norm}(\mathcal{V}_{m_{1}}),\mathcal{Q}=\mathrm{norm}(\mathcal{V}_{m_{2}}) between two groups m1,m2m_{1},m_{2}. Generally, the smaller the CPD, the smaller the data heterogeneity between two groups, and therefore the better the grouping strategy.

CPD(m1,m2)\displaystyle\mathrm{CPD}(m_{1},m_{2}) =MMD2(𝒫,𝒬)\displaystyle=\mathrm{MMD}^{2}(\mathcal{P},\mathcal{Q}) (14)
=𝐄x,x𝒫[𝒦(x,x)]2𝐄x𝒫,y𝒬[𝒦(x,y)]+𝐄y,y𝒬[𝒦(y,y)].\displaystyle=\mathbf{E}_{x,x^{\prime}\sim\mathcal{P}}\left[\mathcal{K}(x,x^{\prime})\right]-2\mathbf{E}_{x\sim\mathcal{P},y\sim\mathcal{Q}}\left[\mathcal{K}(x,y)\right]+\mathbf{E}_{y,y^{\prime}\sim\mathcal{Q}}\left[\mathcal{K}(y,y^{\prime})\right].

Computational Time. We define TcompT_{\mathrm{comp}} in Eq. (15) to estimate the computational time cost, where the number of floating point operations (FLOPs) is 𝒩calc=96\mathcal{N}_{\mathrm{calc}}=96M FLOPs per sample and 𝒩aggr=6.3\mathcal{N}_{\mathrm{aggr}}=6.3M FLOPs for global aggregation, and 𝒯FLOPS=567\mathcal{T}_{\mathrm{FLOPS}}=567G FLOPs per second is the computing throughput of the Qualcomm Snapdragon 835 smartphone chip equipped with Adreno 540 GPU.

Tcomp(R)=r=1R(𝒩calc𝒯FLOPSneKmin{K,f(r)}LocalTraining+𝒩aggr𝒯FLOPS[κf(r)1]GlobalAggregation)(s).T_{\mathrm{comp}}(R)=\sum_{r=1}^{R}\left(\underbrace{\frac{\mathcal{N}_{\mathrm{calc}}}{\mathcal{T}_{\mathrm{FLOPS}}}\cdot\frac{neK}{\min\left\{K,f(r)\right\}}}_{\mathrm{Local~{}Training}}+\underbrace{\frac{\mathcal{N}_{\mathrm{aggr}}}{\mathcal{T}_{\mathrm{FLOPS}}}\cdot\left[\kappa f(r)-1\right]}_{\mathrm{Global~{}Aggregation}}\right)~{}(\mathrm{s}). (15)

Communication Time and Traffic. We define TcommT_{\mathrm{comm}} in Eq. (16) to estimate the communication time cost and DcommD_{\mathrm{comm}} in Eq. (17) to estimate the total traffic, where the FL model size is =25.2\mathcal{M}=25.2MB, the inbound and outbound transmission rates are in=out=567Mbps\mathcal{R}_{\mathrm{in}}=\mathcal{R}_{\mathrm{out}}=567\mathrm{Mbps} (tested in the Internet by AWS EC2 r4.large 2 vCPUs with disabled enhanced networking). Eq. (16) to Eq. (17) consider only the cross-WAN traffic between FL clients and group managers, but the traffic between the top server and group managers is ignored because they are deployed in the same physical machine.

Tcomm(R)=8κKR(1in+1out)(s),T_{\mathrm{comm}}(R)=8\kappa K\mathcal{M}R(\frac{1}{\mathcal{R}_{\mathrm{in}}}+\frac{1}{\mathcal{R}_{\mathrm{out}}})~{}(\mathrm{s}), (16)
Dcomm(R)=2κKR(Bytes).D_{\mathrm{comm}}(R)=2\kappa K\mathcal{M}R~{}(\mathrm{Bytes}). (17)

Please note that Eq. (15) to Eq. (16) are theoretical metrics, which do not consider memory I/O cost, network congestion, and platform configurations such as different versions of CUDNN/MKLDNN libraries.

4.2 Results and Discussion

The effect of ICG and STP. We first compare the CPD of FedAvg[21], NaiveGSP and NaiveGSP+ICG in Figure 3a. These CPDs are calculated between every pair of FL clients. The results show that NaiveGSP+ICG reduces the median CPD of FedAvg by 82%82\% and NaiveGSP by 41%41\%. We also show their accuracy performance in Figure 3b. The baseline NaiveGSP quickly converges but only achieves the accuracy similar to FedAvg. Instead, NaiveGSP+ICG improves the accuracy by 6%6\%. This shows that reducing the data heterogeneity among groups can indeed effectively improve FL performance in the presence of non-i.i.d. data. Although NaiveGSP+ICG is already very effective, it still has defects. Figure 3c shows a rise in the loss value of NaiveGSP+ICG, which indicates that it has been overfitted. That is because the training mode of NaiveGSP+ICG is static, it may learn the client order and forget the previous data. Instead, the dynamic FedGSP overcomes overfitting and eventually converges to a higher accuracy 85.4%85.4\%, which proves the effectiveness of combining STP and ICG.

Refer to caption
(a) CPD
Refer to caption
(b) Accuracy curve
Refer to caption
(c) Loss curve
Figure 3: Comparison among FedAvg, NaiveGSP, NaiveGSP+ICG and FedGSP in (a) CPD, (b) accuracy curve and (c) loss curve. In subfigure (a), the orange line represents the median value and the green triangle represents the mean value.

The effect of the growth function ff and its coefficients α,β\alpha,\beta. To explore the performance influence of different group number growth functions ff, we conduct a grid search on f={Linear,Log,Exp}f=\{\textsc{Linear},\textsc{Log},\textsc{Exp}\} and α,β\alpha,\beta. The test loss heatmap is shown in Figure 4. The results show that the logarithmic growth function achieves smaller loss 0.453 with α=2,β=10\alpha=2,\beta=10 among 3 candidate functions. Besides, we found that both lower and higher α,β\alpha,\beta lead to higher loss values. The reasons may be that a slow increase in the number of groups leads to more STM and results in overfitting, while a rapid increase in the number of groups makes FedGSP degenerate into FedAvg prematurely and suffers the damage of data heterogeneity. Therefore, we recommend αβ\alpha\cdot\beta to be a moderate value, as shown in the green area.

Refer to caption
(a) Linear
Refer to caption
(b) Logarithmic
Refer to caption
(c) Exponential
Figure 4: Test loss heatmap of (a) linear, (b) logarithmic and (c) exponential growth functions over different α\alpha and β\beta settings in FedGSP.
Refer to caption
(a) Accuracy
Refer to caption
(b) Computational Time
Refer to caption
(c) Communication Time
Figure 5: Comparison of (a) accuracy and the normalized (b) computational time and (c) communication time over different κ\kappa settings in FedGSP.

The effect of the group sampling rate κ\kappa. κ\kappa controls the participation rate of groups (also the participation rate of FL clients) in each round. We set κ={0.1,0.2,0.3,0.5,1.0}\kappa=\{0.1,0.2,0.3,0.5,1.0\} to observe its effect on accuracy and time cost. Figure 5a shows the robustness of accuracy to different values of κ\kappa. This is expected because ICG forces the data of each group to become homogeneous, which enables each group to individually represent the global data. In addition, Figures 5b and 5c show that κ\kappa has a negligible effect on computational time TcompT_{\mathrm{comp}}, but a proportional effect on communication time TcommT_{\mathrm{comm}} because a larger κ\kappa means more model data are involved in data transmission. Therefore, we recommend that only κ[0.1,0.3]\kappa\in[0.1,0.3] of groups are sampled to participate in FL in each round to reduce the overall time cost. In our experiments, we set κ=0.3\kappa=0.3 by default.

The performance comparison of FedGSP. We compare FedGSP with seven state-of-the-art approaches and summarize their test accuracy, test loss and training rounds (required to reach the accuracy of 80%80\%) in Table 1. The results show that FedGSP achieves 5.3%5.3\% higher accuracy than FedAvg and reaches the accuracy of 80%80\% within only 3434 rounds. Moreover, FedGSP outperforms all the comparison approaches, with an average of 3.7%3.7\% higher accuracy, 0.1230.123 lower loss and 84%84\% less rounds, which shows its effectiveness to improve FL performance in the presence of non-i.i.d. data.

The time and traffic cost of FedGSP. Figure 6 visualizes the time cost and total traffic of FedGSP and FedAvg when they reach the accuracy of 80%80\%. The time cost consists of computational time Tcomp(R)T_{\mathrm{comp}}(R) and communication time Tcomm(R)T_{\mathrm{comm}}(R), of which Tcomm(R)T_{\mathrm{comm}}(R) accounts for the majority due to the huge data traffic from hundreds of FL clients has exacerbated the bandwidth bottleneck of the cloud server. Figure 6 also shows that FedGSP spends 93%93\% less time and traffic than FedAvg, which benefits from a cliff-like reduction in the number of training rounds RR (only 34 rounds to reach the accuracy of 80%80\%). Therefore, FedGSP is not only accurate, but also training- and communication-efficient.

[Uncaptioned image]
Figure 6: Comparison of time and traffic cost to reach 80%80\% accuracy.
Algorithm Accuracy Loss Rounds
FedAvg 80.1% 0.602 470
FedProx 78.7% 0.633 ×\times
FedMMD 81.7% 0.587 336
FedFusion 82.4% 0.554 230
IDA 82.0% 0.567 256
FedAdagrad 81.9% 0.582 297
FedAdam 82.1% 0.566 87
FedYogi 83.2% 0.543 93
FedGSP 85.4% 0.453 34
Table 1: Comparison of accuracy, loss, and rounds required to reach 80%80\% accuracy.

5 Conclusions

In this paper, we addressed the problem of FL model performance degradation in the presence of non-i.i.d. data. We proposed a new concept of dynamic STP collaborative training that is robust against data heterogeneity, and a grouped framework FedGSP to support dynamic management of the continuously growing client groups. In addition, we proposed ICG to support efficient group assignment in STP by solving a constrained clustering problem with equal group size constraint, aiming to minimize the data distribution divergence among groups. We experimentally evaluated FedGSP on LEAF, a widely adopted FL benchmark platform, with the non-i.i.d. FEMNIST dataset. The results showed that FedGSP significantly outperforms seven state-of-the-art approaches in terms of model accuracy and convergence speed. In addition, FedGSP is both training- and communication-efficient, making it suitable for practical applications.

Acknowledgments. This work is supported, in part, by the National Key Research and Development Program of China (2019YFB1802800); PCL Future Greater-Bay Area Network Facilities for Large-Scale Experiments and Applications (LZC0019), China; National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-RP-2020-019); the RIE 2020 Advanced Manufacturing and Engineering (AME) Programmatic Fund (No. A20G8b0102), Singapore; and Nanyang Assistant Professorship (NAP). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the funding agencies.

References

  • [1] Kairouz, P., McMahan, H.B., Avent, B., et al.: Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14(1-2), pp. 1-210, (2021). \doi10.1561/2200000083
  • [2] Xu, J., Glicksberg, B.S., Su, C., et al.: Federated learning for healthcare informatics. Journal of Healthcare Informatics Research 5(1), pp. 1-19, (2021). \doi10.1007/s41666-020-00082-4
  • [3] Khan, L.U., Saad, W., Han, Z., et al.: Federated learning for internet of things: Recent advances, taxonomy, and open challenges. IEEE Communications Surveys & Tutorials 23(3), pp. 1759-1799, (2021). \doi10.1109/COMST.2021.3090430
  • [4] Lim, W.Y.B., Luong, N.C., Hoang, D.T., et al.: Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3), pp. 2031-2063, (2020). \doi10.1109/COMST.2020.2986024
  • [5] Zhao, Y., Li, M., Lai, L., et al.: Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, (2018).
  • [6] Yao, X., Huang, T., Zhang, R.X., et al.: Federated learning with unbiased gradient aggregation and controllable meta updating. In: Workshop on Federated Learning for Data Privacy and Confidentiality, (2019).
  • [7] Yoshida, N., Nishio, T., Morikura, M., et al.: Hybrid-fl for wireless networks: Cooperative learning mechanism using non-iid data. In: ICC 2020-2020 IEEE International Conference on Communications (ICC), pp. 1-7, (2020). \doi10.1109/ICC40277.2020.9149323
  • [8] Jeong, E., Oh, S., Kim, H., et al.: Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. In: Workshop on Machine Learning on the Phone and other Consumer Devices, (2018).
  • [9] Duan, M., Liu, D., Chen, X., et al.: Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 246-254, (2019). \doi10.1109/ICCD46524.2019.00038
  • [10] Garey, M.R. and Johnson, D.S.: “Strong” NP-completeness results: Motivation, examples, and implications. Journal of the Association for Computing Machinery, 25(3), pp. 499-508, (1978). \doi10.1145/322077.322090
  • [11] Caldas, S., Duddu, S.M.K., Wu, P., et al.: Leaf: A benchmark for federated settings. In: 33rd Conference on Neural Information Processing Systems (NeurIPS), (2019).
  • [12] Li, T., Sahu, A.K., Zaheer, M., et al.: Federated optimization in heterogeneous networks. In: Proceedings of Machine Learning and Systems 2, pp. 429-450, (2020).
  • [13] Yao, X., Huang, C. and Sun, L.: Two-stream federated learning: Reduce the communication costs. In: 2018 IEEE Visual Communications and Image Processing (VCIP), pp. 1-4, (2018). \doi10.1109/VCIP.2018.8698609
  • [14] Yao, X., Huang, T., Wu, C., et al.: Towards faster and better federated learning: A feature fusion approach. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 175-179, (2019). \doi10.1109/ICIP.2019.8803001
  • [15] Yeganeh, Y., Farshad, A., Navab, N., et al.: Inverse distance aggregation for federated learning with non-iid data. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pp. 150-159, (2020). \doi10.1007/978-3-030-60548-3_15
  • [16] Reddi, S., Charles, Z., Zaheer, M., et al.: Adaptive federated optimization. In: International Conference on Learning Representations, (2021).
  • [17] Sattler, F., Müller, K.R. and Samek, W.: Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems 32(8), pp. 3710-3722, (2021). \doi10.1109/TNNLS.2020.3015958
  • [18] Fallah, A., Mokhtari, A. and Ozdaglar, A.: Personalized federated learning: A meta-learning approach. In: 34th Conference on Neural Information Processing Systems (NeurIPS), (2020).
  • [19] Bradley, P.S., Bennett, K.P. and Demiriz, A.: Constrained k-means clustering. Microsoft Research, Redmond, 20(0), p. 0, (2000).
  • [20] Gretton, A., Borgwardt, K.M., Rasch, M.J.: A kernel two-sample test. Journal of Machine Learning Research, 13(1), pp. 723-773, (2012).
  • [21] McMahan, B., Moore, E., Ramage, D., et al.: Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 1273-1282, (2017).