CDFL: Efficient Federated Human Activity Recognition using Contrastive Learning and Deep Clustering

Ensieh Khazaei, Alireza Esmaeilzehi, Bilal Taha, Dimitrios Hatzinakos, Ensieh Khazaei, Alireza Esmaeilzehi, Bilal Taha, and Dimitrios Hatzinakos are with the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada (e-mail: ensieh.khazaei@mail.utoronto.ca; alireza.esmaeilzehi@utoronto.ca; bilal.taha@mail.utoronto.ca; dimitris@comm.utoronto.ca).

Abstract

In the realm of ubiquitous computing, Human Activity Recognition (HAR) is vital for the automation and intelligent identification of human actions through data from diverse sensors. However, traditional machine learning approaches by aggregating data on a central server and centralized processing are memory-intensive and raise privacy concerns. Federated Learning (FL) has emerged as a solution by training a global model collaboratively across multiple devices by exchanging their local model parameters instead of local data. However, in realistic settings, sensor data on devices is non-independently and identically distributed (Non-IID). This means that data activity recorded by most devices is sparse, and sensor data distribution for each client may be inconsistent. As a result, typical FL frameworks in heterogeneous environments suffer from slow convergence and poor performance due to deviation of the global model’s objective from the global objective. Most FL methods applied to HAR are either designed for overly ideal scenarios without considering the Non-IID problem or present privacy and scalability concerns. This work addresses these challenges, proposing CDFL, an efficient federated learning framework for image-based HAR. CDFL efficiently selects a representative set of privacy-preserved images using contrastive learning and deep clustering, reduces communication overhead by selecting effective clients for global model updates, and improves global model quality by training on privacy-preserved data. Our comprehensive experiments carried out on three public datasets, namely Stanford40, PPMI, and VOC2012, demonstrate the superiority of CDFL in terms of performance, convergence rate, and bandwidth usage compared to state-of-the-art approaches.

Index Terms:

Federated learning, human activity recognition, data heterogeneity, communication efficiency, naive aggregation.

I Introduction

In an increasingly connected world, Human Activity Recognition (HAR) has emerged as a pivotal component in the landscape of ubiquitous computing applications [1, 2, 3]. Of particular interest is the image-based HAR which pertains to the automatic and intelligent identification of actions/behaviors exhibited by humans using image data collected from various sources [4, 5]. These images are often captured by cameras embedded in a wide array of devices, such as smartphones, surveillance systems, and smart environments. Image-based HAR plays a significant role in various domains, including healthcare for patient monitoring, sports analytics, security surveillance systems, and human-computer interaction in smart environments [6, 7, 8]. As these devices continue to proliferate, they generate vast amounts of data with the potential to significantly enhance the accuracy and reliability of HAR algorithms. However, centralizing data from multiple sources on powerful servers or cloud platforms for training machine learning models raise several issues including privacy invasions, potential misuse of sensitive personal data, and the significant costs associated with data transmission and centralized processing.

Federated Learning (FL) has recently emerged as a machine learning paradigm that seeks to address the above-mentioned issues [9]. It allows a global model to be trained across multiple decentralized sources (or clients) holding local data samples, without exchanging them. Instead of transferring raw data to a central server, FL involves training machine learning models locally on each device and then aggregating the model weights (or gradients) to construct an improved global model. This approach ensures that sensitive raw data never leaves the local device, thereby significantly enhancing privacy and security of the users. In the context of HAR, FL is especially promising since it allows for the leveraging of extensive, rich, and diverse images collected from multiple sources employed by different individuals in various environments. Such a decentralized approach does not only preserve privacy but also promises to deliver more personalized and context-aware HAR models by learning from data that is representative of heterogeneous user behaviors. Nonetheless, there are three main challenges in FL that limit its potential: data heterogeneity, communication overhead, and naive aggregation.

In real-world applications, the data among the clients can be non-IID (independent and identically distributed) which can negatively affect the performance of typical FL system[9]. In this scenario, some clients may not contain any sample from some categories of data. Recent studies have proposed methods to address this problem. FedProx [10] utilizes an $\ell_{2}$ regularization term to control the distance between the local and global weights. Personalized FL is another approach that addresses the data heterogeneity issue by training personalized models [11, 12, 13, 14, 15, 16, 17]. In personalized FL, the focus on local data can prevent the clients from exploiting the information lying in the whole distributed dataset [18]. Furthermore, some research studies [19, 20, 21, 22, 23] employ clustered FL to handle the data heterogeneity. In clustered FL, clients with similar data distributions are grouped into the same cluster, where each cluster collaboratively trains its own global model independent of other clusters.

In every communication round of conventional FL systems, all active clients must send their weights to the server, which poses challenges in situations where there are limitations in bandwidth. Additionally, there are clients with lower convergence rates compared to others, known as stragglers. Therefore, a strategy is required to select clients intelligently not only to reduce communication overhead but also to achieve better performance in a smaller number of communication rounds. Ouyang et al. [24] have proposed cluster-wise straggler dropout and correlation-based client selection techniques to reduce communication overhead for HAR applications. In [25], an algorithm is proposed to select a subset of devices in resource-constrained environments while achieving the best resource efficiency. The algorithm dynamically selects clients and adaptively adjusts the frequency of model aggregation, considering various factors such as local loss, data size, computational power, resource demands, and the age of updates for each client.

For updating the server model in a conventional FL framework such as Federated Averaging (FedAvg) [9], a coefficient is assigned to each client according to its number of samples and then clients’ model weights are aggregated using weighted averaging. In this method, the global model’s objective is far from the global objective for the whole distributed dataset, and it has a negative effect on the convergence rate of the FL system especially in a heterogeneous state. To overcome this issue, the authors in [26] have proposed a novel aggregation method based on parameter sensitivity. Their method resolves the global model drifting by increasing updates for less sensitive parameters and diminishing updates for parameters with higher levels of sensitivity.

Although existing FL frameworks focus on addressing one or two of these challenges, there is a need to propose a comprehensive FL framework to resolve all of the above challenges. In this work, we propose, CDFL, an efficient federated learning framework for image-based human activity recognition to resolve the above challenges. Our proposed method not only shows an overall improvement in system performance in comparison to state-of-the-art methods but also reduces the communication overhead between the server and various clients. The contributions of the work of this paper are summarized as follows:

•

We address the issue of data heterogeneity by introducing a privacy-preserved subset of the dataset to the server. To choose the best images efficiently, we leverage the ideas of deep clustering and contrastive learning methods. Specifically, before transmitting the data of each client to the server, we manipulate it using image pixelization operation, in order to make it resilient to the leakage of client’s sensitive information. Since the chosen dataset is a good representative of the entire dataset, it effectively overcomes the data heterogeneity challenge.
•

We reduce the communication overhead by selecting the top-performing clients in each communication round. Notably, in each communication round, the most effective clients share their weights with the server for aggregation. Client selection not only decreases the bandwidth usage in our proposed scheme but also improves the performance because the less effective clients do not contribute to the global model update.
•

We improve the quality of the global model by training it on the privacy-preserved subset of data instead of a naive aggregation. In terms of distribution, the selected subset of privacy-preserved images and the entire dataset are similar. Therefore, training the global model on this subset will approach the global model’s objective to the global objective and enhance the system’s convergence.
•

We conduct comprehensive experiments on three HAR datasets under different settings and demonstrate that CDFL is able to outperform the state-of-the-art FL schemes existing in the literature in terms of communication efficiency and performance. In particular, we demonstrate that the proposed framework can increase the performance by up to 10% and speed up the convergence rate by up to 10 times compared to the state-of-the-art approaches. Additionally, it will reduce the bandwidth usage by up to $\sim$ 64% in large-scale scenarios.

The remainder of this paper is organized as follows. In Section II, we briefly explain previous work related to FL and HAR applications. Section III explains CDFL in detail. Specifically, we include how we train individual clients locally, choose images using deep clustering, and conduct server training along with client selection. The results of various experimentations on the proposed scheme, as well as, their analysis are described in Section IV. Finally, the concluding remarks on the work of this paper are made in Section V.

II Related Work

II-A Federated Learning

FL is a prominent distributed machine learning approach that has attracted attention in recent years. The main goal of FL is to preserve privacy through decentralized learning. The first attempt in this area is FedAvg [9] algorithm. Although the results obtained from FedAvg are promising compared to centralized learning, this scheme suffers from several challenges, including data heterogeneity, communication overhead, and naive aggregation.

To mitigate the data heterogeneity issue, Li et al. [27] presented a model-contrastive federated learning (MOON) to similarize the representation of local models to those of the global model. While MOON performs effectively when the global model is consistent with the global objective, it overlooks the fact that a deviated global model, particularly in strongly heterogeneous conditions, can lead to deviation of all local clients from the global objective. Another category of algorithms focuses on clustered FL to manage heterogeneous states such as cFedFN [18], IFCA [20], FedCluster [28], FedGroup [29], FlexCFL [23], CFL [19]. IFCA [20] is an iterative federated clustering algorithm that alternates between estimating the cluster identities and minimizing the loss function. This method does not need a centralized clustering algorithm, leading to a considerable reduction in server computational cost. It has been shown that the convergence rate of their proposed algorithm follows the exponential curve. CFL [19] is another cluster-based FL framework that does not require any prior knowledge about the number of clusters. In this framework, the geometric properties of the FL loss are leveraged for client clustering. Further, they discovered that by calculating the cosine similarity between client weights, the similarity of their data distribution can be perceived. In cFedFN [18], Wei et al. have argued that feature norm is a suitable metric to cluster clients in view of its capability to reflect the clients’ data distribution. By leveraging feature norms for clustering, a performance improvement is achieved without the need for any pre-clustering steps. The additional computation cost required for feature norm calculation of cFedFN is negligible, which makes it applicable in real-life situations. While clustered FL methods can mitigate data heterogeneity issues by identifying underlying client clusters, clustering becomes challenging as the number of clients increases.

A number of existing works [30] have focused on communication overhead reduction to handle bandwidth-limited situations. Chen et al. [31] have designed a probabilistic client selection such that clients that can potentially enhance the convergence rate are selected with higher probability in the following communication round. Besides, in this scheme, a quantization method has been developed to shrink the model parameters before transmission to the server. In [32], an asynchronous model update algorithm is proposed to reduce client-server communication. In this algorithm, the various layers of the model are divided into shallow and deep categories. The parameters in the deep category of layers are updated less often compared to the parameters in the shallow layers. With this strategy, the number of parameters to be exchanged between the server and clients is reduced per communication round. Although their proposed strategy is successful in reducing the total communication cost, it needs more communication rounds to achieve a target accuracy compared to FedAvg. The FetchSGD algorithm, presented in [33], is based on the Count Sketch data structure to overcome the communication bottleneck issue. The FetchSGD sends the sketches of each local client to the server for aggregation. In the server, the momentum and error accumulation are applied to the aggregated sketch. Then, top-k values are selected as sparse updates to broadcast to the connected clients in the following round.

On the server side, naive aggregation of local models suffers from global drift in heterogeneous situations [26], and therefore, deteriorates the FL performance. Chen et al. [32] have designed a temporally weighted aggregation strategy to resolve this issue. This scheme is based on timestamps, and local clients with recent updates get larger weights compared to other clients. FedMA [34] has used parameter averaging of the local clients by utilizing a permutation matrix. Specifically, this scheme ensures that similar elements in the model, like channels in convolution layers or neurons in fully connected layers, are matched and averaged appropriately. However, the computational cost of finding the optimal permutation matrix hinders its practical usage. EK et al. [35] introduced a new aggregation algorithm called FedDist, which follows a similar approach to FedMA but focuses on identifying differences between specific neurons among the clients to modify the global model’s architecture. It has been demonstrated that this algorithm considers the clients’ specificity without affecting the overall performance.

II-B Federated Learning and HAR

Many existing works in HAR applications still rely on central servers to train their models. This exposes the users’ data to substantial privacy threats, especially when dealing with sensitive user information. To maintain user privacy, federated learning has been increasingly adopted in the context of HAR [36, 35, 37].

Xiao et al. [38] paid attention to the structure of the model in the FL system. They designed an FL system with an advanced feature extractor for recognizing human activities. This feature extractor consists of two main components. The initial component utilizes convolutional layers to capture local features, while the second component combines LSTM and attention layers to uncover the global relationships within the data. Another line of studies tries to tackle data heterogeneity in HAR applications. Gad et al. [39] introduced Federated Learning via Augmented Knowledge Distillation (FedAKD) to manage the heterogeneous state. They have demonstrated that FedAKD not only surpasses FedAvg in terms of communication efficiency but also outperforms FedMD [40] due to the incorporation of augmentation techniques. FedDL, as presented in Tu et al.’s work [41], is another FL framework tailored for activity recognition purposes under non-IID circumstances. FedDL clusters clients based on the similarity of model weights and subsequently integrates models through an iterative layer-wise approach. With this strategy, the system dynamically acquires personalized models for distinct users. Although FedAKD [39] and FedDL [41] address the data heterogeneity issue, they do not resolve the communication overhead problem that can be challenging in large scale scenarios. ClusterFL [24] also directed its attention toward client clustering, in which, the cluster indicative matrix that captures the similarity between users, is utilized. By employing alternating optimization techniques, they iteratively update both local models and the cluster indicator matrix.

Some research endeavors focus on the FL system in a semi-supervised approach to address the limited availability of labeled samples in the HAR problem. Notably, Bettini et al. [42] introduced FedHAR which is a semi-supervised FL framework. Specifically, in this scheme, a combination of active learning and label propagation is employed to semi-automatically annotate data for each individual client, effectively addressing the scarcity of labeled samples. In [43], a semi-supervised learning-based HAR scheme is proposed, in which clients train autoencoders in an unsupervised approach with their unlabeled local data while the server trains an activity classifier through supervised learning. Their scheme can achieve competitive performance compared to the supervised approach in the HAR application.

III Proposed Approuch

In this section, we develop CDFL as an efficient FL framework for HAR. Our proposed scheme addresses the challenges of privacy-preserving and efficient model training in a distributed environment.

III-A Preliminaries

In the proposed framework, a given dataset, $D$ , containing $|C|$ distinct classes, is distributed across $N$ distinct clients, which $N$ represents the total number of clients. For any given client, designated as the $i$ -th party, their local dataset is represented as $D_{i}=\{(x_{i}^{j},z_{i}^{j},y_{i}^{j})\}_{j=1}^{n_{i}}$ , where $x_{i}^{j}$ refers to the original image of the $j$ -th sample in the $i$ -th client, $z_{i}^{j}$ is the pixelized version of $x^{i}_{j}$ , where the human identity (face) is covered using image pixelization operation, $y_{i}^{j}$ is the human activity label corresponding to $x^{i}_{j}$ . The term $n_{i}$ represents the total number of samples that are present within the $i$ -th client’s dataset.

It’s imperative to mention that due to the non-independent and identically distributed (non-iid) nature of our setup, there might be clients, in which certain classes are absent. Therefore, while $|C|$ represents the total number of classes in dataset $D$ , a client’s local dataset might contain fewer classes, represented as $|C_{i}|$ . This relationship can be mathematically defined as:

	$\displaystyle\|C_{i}\|$	$\displaystyle\leq\|C\|$
	$\displaystyle C$	$\displaystyle=\bigcup_{i=1}^{N}C_{i}.$

Furthermore, for each client, two neural networks $g(.;\theta_{i})$ and ${g}(.;\tilde{\theta}_{i})$ are associated with the personalized local model and local model of the $i$ -th client. The functionality of these two networks is explained in the following paragraphs.

Each of the two neural networks mentioned above consists of two principal segments; 1) a feature extractor that maps the input data (i.e., images in the context of our HAR application) into embedding vectors; 2) a classifier that maps the embedding vector into categorical vectors to make a classification decision.

Let $\theta_{i}$ and $\tilde{\theta}_{i}$ represent, respectively, the weights of personalized and local models for $i$ -th client. We denote the penultimate layers of personalized and local models of $i$ -th client with $g^{p}(.;\theta_{i})$ and ${g}^{p}(.;\tilde{\theta}_{i})$ , respectively. In fact, $g^{p}(.;\theta_{i})$ and ${g}^{p}(.;\tilde{\theta}_{i})$ are feature extractors in personalized and local models that provide the representations of the input in the embedding space. For simplicity, in the following sections, the notion ${g}^{p}(.;\tilde{\theta_{i}})$ and ${g}^{p}(.;{\theta_{i}})$ is represented as $\tilde{g}_{i}^{p}(.)$ and ${g}_{i}^{p}(.)$ , respectively.

III-B Model Contrastive Learning

Each client in our scheme partakes in the training of two distinct neural architectures: the personalized model and the local model.

•

Personalized Model: The objective of this model is to minimize the canonical cross-entropy loss using the original image data. For the $i$ -th client’s data sample $(x_{i}^{j},z_{i}^{j},y_{i}^{j})$ , the personalized model’s loss is given as:

$\mathcal{L}_{CE}^{personalized}=CE(g(x_{i}^{j};\theta_{i}),y_{i}^{j})$ (1)

Utilizing original images poses privacy concerns since they could inadvertently disclose participant identities. To address this, we propose another model for every client, that is trained with the pixelized visual signals. By employing image pixelization, it is assured that the sensitive data in the images, in our case the faces, are masked. Hence, in addition to sending the parameters of each local model to the server, we can transmit their pixelized data (that preserves the human identities) as well. This strategy could have a positive impact on enhancing the FL performance.

To generate the masked images, we employ the RetinaFace model [44], an efficient facial detection scheme, to detect faces within these images. Subsequently, the detected faces undergo pixelization through bilinear downsampling and nearest neighbor upsampling operations, both conducted with a factor of $a$ , as follows:

$z_{i}^{j}=\text{Nearest Neighbor}_{\uparrow a}(\text{Bilinear}_{\downarrow a}(x_{i}^{j}))$ (2)

A substantial value for $a$ ensures effective pixelization. Figure 1 demonstrates the face masking process for a sample of Stanford40 dataset [45].

Figure 1: Face masking workflow on a sample of Stanford40 dataset

Figure 2: Overview of local training and image selection within the $i$ -th client

•

Local Model: The local model is trained on pixelized images using a similar architecture as the personalized model. Its design addresses the aforementioned privacy concerns, aiming to achieve reasonable performance on pixelized images. The loss for this model is the standard cross-entropy based on pixelized images, which is represented as

\mathcal{L}_{CE}^{local}=CE({g}(z_{i}^{j};\tilde{\theta}_{i}),y_{i}^{j})

(3)

The proposed local model incorporates both knowledge distillation and contrastive learning to improve the model’s capability to generate efficient representations for pixelized images, as shown in Figure 2. With the addition of contrastive learning, the local model aims for congruent representations between original and pixelized images. Similar to [46], we define the contrastive loss as:

\mathcal{L}_{con}=-\log\frac{e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),\tilde{g}_{i}^{p}(x_{i}^{j}))}{\tau}}}{e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),\tilde{g}_{i}^{p}(x_{i}^{j}))}{\tau}}+\sum_{k\neq j}e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),\tilde{g}_{i}^{p}(x_{k}^{j}))}{\tau}}}

(4)

Here, $\tilde{g}_{i}^{p}(z_{i}^{j})$ and $\tilde{g}_{i}^{p}(x_{i}^{j})$ represent the local model’s representations over pixelized and original images respectively. The term $\tau$ denotes the temperature parameter that controls the strength of penalties on hard negative samples, while $cos(.,.)$ is the cosine similarity metric.

To further refine the local model’s performance on original images, knowledge distillation is employed, leveraging insights from the personalized model which enhances the similarity between the local model representation on pixelized images and the representations of the personalized model on the original counterparts:

\mathcal{L}_{KD}=-\log\frac{e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),{g}_{i}^{p}(x_{i}^{j}))}{\tau}}}{e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),{g}_{i}^{p}(x_{i}^{j}))}{\tau}}+\sum_{k\neq j}e^{\frac{cos(\tilde{g}_{i}^{p}(z_{i}^{j}),{g}_{i}^{p}(x_{k}^{j}))}{\tau}}}

(5)

where ${g}_{i}^{p}(x_{i}^{j})$ denotes the personalized model’s representations over the original images.

Finally, the total loss used for training the local model is as:

\mathcal{L}^{local}=\mathcal{L}_{CE}^{local}+\lambda_{1}\mathcal{L}_{con}+\lambda_{2}\mathcal{L}_{KD}

(6)

here, $\lambda_{1}$ and $\lambda_{2}$ work as hyperparameters to calibrate the contributions from contrastive learning and knowledge distillation components, respectively.

III-C Deep Clustering for Image Selection

Clustered Federated Learning (FL) [47, 48, 49, 50] categorizes clients with similar data distributions into groups, and consequently, trains a distinct global model for every cluster. Previous investigations have adopted diverse criteria for client clustering, which range from using contextual information [28], model parameters [19, 51, 29, 23], loss functions [20], and embedding vectors [18, 52]. While earlier efforts largely focused on client clustering to mediate data heterogeneity, our research harnesses image clustering to transmit a minimal number of representative pixelized images to the server in order to enhance the FL performance. It should be noted that local data remains unshared with servers or other clients in conventional FL systems. As delineated in Section III-B, pixelized images effectively obscure individual sensitive data and contain the necessary information for performing the task of HAR. Nonetheless, the indiscriminate transmission of all pixelized images is memory-consuming and inefficient, emphasizing the need for efficient image selection methodologies.

In order to perform image selection, we utilize the idea of deep clustering in the feature space. Mathematically, deep clustering for the $i$ -th client is expressed as:

\displaystyle\sum\limits_{k=1}^{|C_{i}|}\mathbbm{1}_{jk}\|\tilde{g}_{i}^{p}(z_{i}^{j})-\mu_{ik}\|

(7)

where $\mu_{ik}$ is the centroids corresponding to the $k$ -th cluster in the $i$ -th client. The centroids will be learned during training of the local model. $\mathbbm{1}_{jk}$ is an indicator that becomes one if $\tilde{g}_{i}^{p}(z_{i}^{j})$ belongs to the $k$ -th cluster, and zero otherwise.

The training parameters in Equation (7) scale with the embedding space’s size. For obtaining high-quality centroids for each client, their sample count should be considerably large relative to this space. Given that FL systems rarely assign individual clients substantial samples, the deep clustering cannot be carried out in an efficient manner [53, 54, 55, 56]. In order to address this, we employ the pseudo-labeling [57, 58, 59] technique for carrying out the deep clustering operation. In this prevalent deep clustering strategy, first, cluster centroids are derived through methods like k-means clustering. These centroids serve as the reference points for assigning data samples to their corresponding clusters. These assigned cluster labels are then considered as the samples’ pseudo-labels. Subsequently, the model is re-trained with the objective of aligning the representations of the samples with their corresponding cluster centroids.

Our algorithm employs the global model in conjunction with K-means for pseudo-label prediction, leveraging the global model’s encompassing ability to derive representations from all clients. In the first stage, the global model is applied to pixelized images within each client, yielding individual representations in the embedding or penultimate layer for each client:

\displaystyle g(\{z_{i}^{j}\}_{j=1}^{n_{i}};\tilde{\theta})=\{\tilde{g}^{p}(z_{i}^{j})\}_{j=1}^{n_{i}}

(8)

Subsequently, K-means extracts cluster centroids and pseudo-labels, given the representations learned earlier as follows:

\displaystyle\{\tilde{\mu}_{ik}\}_{k=1}^{|C_{i}|}=\arg\min\limits_{\{\tilde{\mu}_{im}\}_{m=1}^{|C_{i}|}}\sum\limits_{j=1}^{n_{i}}a_{jm}^{i}\|\tilde{g}^{p}(z_{i}^{j})-\tilde{\mu}_{im}\|\

(9)

Then the pseudo-labels $a_{jk}^{i}$ is defined as

\displaystyle a_{jk}^{i}=\begin{cases}1&\text{if }k=\min\limits_{m}\|\tilde{g}^{p}(z_{i}^{j})-\tilde{\mu}_{im}\|\\ 0&\text{otherwise}\end{cases}

(10)

Therefore, Equation (7) can be reformulated as

\displaystyle\mathcal{L}_{cl}=\sum\limits_{k=1}^{|C_{i}|}a_{jk}^{i}\|\tilde{g}_{i}^{p}(z_{i}^{j})-\tilde{\mu}_{ik}\|

(11)

By introducing the deep clustering loss into Equation (6), the local model’s loss can be computed as:

\displaystyle\mathcal{L}^{local}=\mathcal{L}_{CE}^{local}+\lambda_{1}\mathcal{L}_{con}+\lambda_{2}\mathcal{L}_{KD}+\lambda_{3}\mathcal{L}_{cl}

(12)

where $\lambda_{3}$ adjusts the significance of the clustering term.

After the local model’s training using Equation (12), K-means determine updated centroids based on the representations generated by the local model:

\displaystyle\{\bar{\mu}_{ik}\}_{k=1}^{|C_{i}|}=\arg\min\limits_{\{\bar{\mu}_{im}\}_{m=1}^{|C_{i}|}}\sum\limits_{j=1}^{n_{i}}a_{jm}^{i}\|\tilde{g}_{i}^{p}(z_{i}^{j})-\bar{\mu}_{im}\|\

(13)

Subsequently, for each centroid $\bar{\mu}_{ik}$ , the Euclidean distance to all local samples is computed, and the $m$ nearest samples to each centroid are then selected as the optimal samples for transmission to the server. The collection of chosen images for the $k$ -th cluster is denoted as:

\displaystyle\mathcal{S}_{i}^{k}=\{(z_{i}^{j},y_{i}^{j})\}_{j=1}^{m}

(14)

By performing the selection process on all centroids for each client, a set containing the best images for transmission to the server is obtained as:

\displaystyle\mathcal{S}_{i}=\bigcup_{k=1}^{|C_{i}|}\{\mathcal{S}_{i}^{k}\}

(15)

where $\mathcal{S}_{i}$ will be shared with the server for further processing. It is worth mentioning that $\mathcal{S}_{i}$ remains stored on the server until the next communication round when the respective client is selected to contribute to the server update process and update its corresponding set, $\mathcal{S}_{i}$ . Consequently, in each communication round, privacy-preserved data samples from all clients are present on the server, regardless of their contributions.

The advantages of the proposed image selection method are two-fold. Firstly, it employs global model representations for obtaining the pseudo-labels, and hence, provides centroids with a global view from all clients. This prevents the deviation of the local model from the global model during local model training. Secondly, the global and local models are updated in each communication round. As a result, the selected images are dynamically refined to improve performance over time.

III-D Server Training and Client Selection in Federated Learning

In traditional FL systems, during each communication round, the central server aggregates model weights from all connected clients to update the global model. There exist three issues with using this approach. First, due to the data heterogeneity among clients, the aggregated global model is far from the global objective. Second, it necessitates significant memory and bandwidth resources, which might be unattainable in many real-world scenarios. Third, some clients may demonstrate slower convergence compared to others. Including such clients during the global model update can detrimentally impact the convergence rate of the entire FL framework.

Refer to caption — (a) Server Initialization

In order to address these challenges, we consolidate the global model after naive aggregation using the privacy-preserved selected images. Additionally, we propose a client selection algorithm that identifies and integrates contributions from the most useful clients in each round. This algorithm leverages the dataset of selected images from Section III-C to determine the optimal clients.

For each client, the received dataset, denoted as $\mathcal{S}_{i}$ , is divided into two sets of training ( $\mathcal{S}_{i}^{tr}$ ) and validation ( $\mathcal{S}_{i}^{val}$ ). The aggregation of local models follows the fundamental aggregation:

\displaystyle\tilde{\theta}=\sum\limits_{i\in\mathcal{C}}\frac{n_{i}}{\sum\limits_{i\in\mathcal{C}}n_{i}}\tilde{\theta_{i}}

(16)

Here, $\mathcal{C}$ represents the set of selected clients forwarding their model weights to the central server. The process of selecting clients will be explained later in this section.

A simple parameter aggregation in the server is not sufficient to approach the global objective, since the global model requires an efficient fine-tuning using the selected privacy-preserved data of all clients in order to align with the global objective and provide a superior FL performance. To mitigate this, we utilize images chosen through the methodology detailed in Section III-C to adjust the global weights, aligning them with the global objective.

After the naive aggregation of the global model, it undergoes training on the combined training set, $\mathcal{S}^{tr}=\bigcup\limits_{i=1}^{N}\mathcal{S}_{i}^{tr}$ , and is fine-tuned using the privacy-preserved data samples stored on the server from all clients. The loss for the global model, for any data instance $(z_{i}^{j},y_{i}^{j})\in\mathcal{S}^{tr}$ , is:

\displaystyle\mathcal{L}_{g}=CE({g}(z_{i}^{j};\tilde{\theta}),y_{i}^{j})

(17)

This loss undergoes updates via the Stochastic Gradient Descent (SGD) optimizer. Finally, the updated global model is broadcast to the clients in order to initialize the personalized and local models for the next communication round. This strategy will be helpful to overcome the slow convergence in the typical FL algorithms. The overall architecture of the global model update is shown in Figure 3.

To optimize the bandwidth, the trained global model is evaluated on the validation subsets and then clients are ranked based on their validation performance in a descending order. A certain percentage of clients with the highest validation accuracy are selected to contribute their weights in the next communication round. These selected clients are stored in $\mathcal{C}$ set. It should be noted that in the next round, the updated global weights are sent to all connected clients. However, after local training is done, only the chosen clients share their weights with the server and update their pixelized images on the central server.

IV Experiments

We evaluate the performance of CDFL on several image-based HAR datasets and compare it with the state-of-the-art frameworks FedAvg [9], SCAFFOLD [60], FedProx [10], MOON [27], FedDyn [61], and FedHKD [62].

Our experimental analysis of CDFL focuses on five key aspects: (i) comparing the accuracy of CDFL to state-of-the-art baselines, (ii) evaluating the CDFL’s ability to handle a large number of clients, (iii) investigating how CDFL performs in different levels of heterogeneity, (iv) assessing the CDFL’s efficiency and convergence speed, and (v) analyzing the impact of different of important parameters on the performance of CDFL.

These evaluations will provide insights into the effectiveness and suitability of CDFL for image-based HAR tasks.

IV-A Experimental Settings

IV-A1 Datasets

Three image-based HAR datasets are used in our experiments, namely Stanford40 [45], PPMI [63], and VOC2012 [64]. Stanford40 contains 40 categories of human actions and a total of 9,532 images that are split with ratios of 0.9 and 0.1 as the training and testing sets, respectively. The PPMI includes 4,800 images of people interacting with 12 different musical instruments, and VOC2012 contains 10 classes of actions and 3,744 images. It should be noted that images with more than one action are omitted in VOC2012 to avoid model confusion. PPMI and VOC2012 are split to train and test sets by the original publisher and we follow their partitioning. We use Dirichlet distribution to generate the non-IID data partitions among clients [27, 62, 61]. Specifically, we distribute samples among clients with $Dir_{N}(\alpha)$ , where $\alpha$ is the concentration parameter of Dirichlet distribution. Smaller values of $\alpha$ result in a more unbalanced data distribution among clients. We set $\alpha$ to 1.0 in all the experiments unless explicitly specified.

IV-A2 Baseline Methods

To assess the effectiveness of CDFL, we conduct a comparative analysis with six state-of-the-art baselines, including FedAvg [9], SCAFFOLD [60], FedProx [10], MOON [27], FedDyn [61], and FedHKD [62]. We also compare with the SOLO approach, where each client trains a model with its local data independent of other clients. In the SOLO approach, no server model is utilized.

IV-A3 Hyperparameters

We use the Stochastic Gradient Descent (SGD) optimizer with a constant learning rate of 0.01 across all settings. For the SGD optimizer, we set the weight decay to 0.00001 and the momentum to 0.9. The batch size is uniformly set to 32 for all experiments. In the case of SOLO, the number of local epochs is fixed to 100. For all federated learning approaches, the number of local epochs is set to 5. The performance evaluation is conducted after 20 communication rounds for the Stanford40 dataset and 25 communication rounds for both the PPMI and VOC2012 datasets. We chose these specific numbers of communication rounds because we observed that considerable improvement is not seen beyond these points. We employ the ResNet-50 as the deep network architecture for all experiments, and in each communication round, 80% of clients are randomly involved.

In the FedProx algorithm, we choose the best value for $\mu$ from the set of {0.001, 0.01, 0.1, 1.0}. For MOON, we tune $\mu$ over the range of {0.01, 0.1, 1, 5} and report the best result. The $\alpha$ parameter in FedDyn is adjusted with values from {0.001, 0.01, 0.1, 1.0}. In FedHKD, we tune the parameters $(\lambda,\gamma)$ using {(0.001, 0.001), (0.01, 0.01), (0.05, 0.05), (0.1, 0.1)}, and report the best result.

In CDFL, 50% of clients ( $|\mathcal{C}|=0.5N$ ) are selected to share their weights with the server and update their pixelized images on the server. $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are set to 0.1, 0.1, and 0.05, respectively. The parameter $m$ , the number of nearest selected samples to each centroid, is set to 3 for experiments involving 30 clients and 5 for experiments with 20 clients. When the number of clients is 10, the value of $m$ is 9 for the Stanford40 dataset and 7 for the PPMI and VOC2012 datasets.

IV-B Performance Comparison

At the end of each communication round, the local models are applied to the test samples to evaluate their individual performance. Then, the performance of all connected clients is averaged to obtain the performance of the scheme. The mean accuracy of all schemes, tested with varying numbers of clients (10, 20, and 30), is summarized in Table I. First, we have to say CDFL is the best. The SOLO approach consistently exhibits the poorest performance across all settings, representing the beneficial impact of the FL framework in decentralized learning. As the number of clients increases, there is a drop in performance across all schemes. In the case of FedAvg [9], this drop is particularly significant, ranging from 7% to 16% among different datasets. The relatively lower accuracy of FedAvg can be attributed to its limited ability to handle heterogeneous conditions. The performance improvement of CDFL in comparison to MOON [27] is up to 4.2%, 7.1%, and 9.8% for Stanford40, PPMI, and VOC2012, respectively. Among the baselines, the performance of FedDyn [61] is closer to CDFL. Compared to FedDyn [61], CDFL improves the performance up to 1.3%, 5.4%, and 5.7% in Stanford40, PPMI, and VOC2012, respectively.

As shown in Table I, CDFL consistently outperforms all other baselines across various settings, demonstrating the strength of CDFL in comparison to previous works. In general, CDFL has a smaller standard deviation compared to other baselines. This implies that it not only works well on the whole set of clients according to its higher mean accuracy but also performs well on all individual clients due to its low standard deviation. In contrast, the higher standard deviation in other baselines represents difficulties in achieving good performance on some clients due to non-IID data distribution. Furthermore, CDFL is still able to provide higher performance, even when the number of clients increases, which is crucial in realistic applications where the number of clients is often large.

TABLE I: Mean accuracies (%± standard deviation) for all datasets

	Stanford40			PPMI			VOC2012
	$N=10$	$N=20$	$N=30$	$N=10$	$N=20$	$N=30$	$N=10$	$N=20$	$N=30$
SOLO	$78.42\pm 8.81$	$58.32\pm 3.28$	$52.46\pm 4.56$	$67.46\pm 13.96$	$54.55\pm 13.27$	$47.42\pm 13.90$	$59.82\pm 4.14$	$52.58\pm 8.20$	$44.40\pm 8.05$
FedAvg [9]	$86.18\pm 5.07$	$81.22\pm 3.50$	$79.74\pm 5.34$	$88.22\pm 3.35$	$77.78\pm 14.93$	$75.60\pm 11.88$	$74.03\pm 4.82$	$67.20\pm 12.38$	$65.91\pm 10.40$
SCAFFOLD [60]	$83.81\pm 2.80$	$82.06\pm 2.00$	$82.01\pm 4.79$	$88.31\pm 2.67$	$81.74\pm 14.71$	$82.26\pm 12.26$	$70.07\pm 4.34$	$71.06\pm 6.23$	$73.48\pm 6.51$
FedProx [10]	$86.44\pm 4.05$	$81.41\pm 3.09$	$79.92\pm 7.51$	$88.13\pm 4.63$	$80.21\pm 9.68$	$72.94\pm 15.28$	$75.62\pm 6.64$	$72.5\pm 13.59$	$70.32\pm 9.37$
MOON [27]	$87.00\pm 3.35$	$81.73\pm 3.86$	$80.13\pm 5.74$	$86.97\pm 7.39$	$81.24\pm 10.78$	$77.91\pm 13.66$	$72.39\pm 9.27$	$66.26\pm 16.19$	$67.02\pm 9.44$
FedDyn [61]	$87.02\pm 3.48$	$83.79\pm 2.76$	$83.04\pm 5.12$	$88.75\pm 2.65$	$82.58\pm 13.06$	$81.19\pm 13.15$	$74.24\pm 4.07$	$73.05\pm 7.88$	$71.13\pm 10.88$
FedHKD [62]	$86.59\pm 5.39$	$82.31\pm 4.66$	$80.04\pm 10.23$	$89.69\pm 6.06$	$83.11\pm 6.97$	$73.86\pm 18.11$	$69.70\pm 10.32$	$69.28\pm 11.78$	$67.86\pm 11.06$
CDFL	$\mathbf{88.19\pm 3.65}$	$\mathbf{84.48\pm 3.12}$	$\mathbf{84.37\pm 2.41}$	$\mathbf{90.74\pm 2.98}$	$\mathbf{88.00\pm 4.66}$	$\mathbf{85.14\pm 8.39}$	$\mathbf{77.82\pm 2.49}$	$\mathbf{75.19\pm 8.82}$	$\mathbf{76.89\pm 5.71}$

IV-C Scalability

To demonstrate the scalability of CDFL to a large number of clients, we evaluate the Stanford40 dataset in two scenarios with 50 and 100 clients. The data samples were distributed among clients using Dirichlet distribution with $\alpha=1.0$ . It is worth noting that Stanford40 is selected for these large-scale scenarios due to the small size of other datasets, whose number of samples is approximately 50% of the samples available in Stanford40. Since the number of clients increases, the number of communication rounds is set to 30 to make sure that all schemes converge to optimal solutions.

TABLE II: Mean accuracies (%± standard deviation) for large clients on Stanford40

	Stanford40
	$N=50$	$N=100$
SOLO	$45.04\pm 8.99$	$42.24\pm 6.88$
FedAvg [9]	$77.96\pm 7.73$	$75.26\pm 10.14$
SCAFFOLD [60]	$81.63\pm 6.04$	$82.95\pm 8.41$
FedProx [10]	$78.18\pm 8.31$	$75.31\pm 12.00$
MOON [27]	$78.60\pm 8.19$	$76.29\pm 7.97$
FedDyn [61]	$82.28\pm 10.85$	$82.43\pm 13.06$
FedHKD [62]	$78.91\pm 10.55$	$76.15\pm 8.45$
CDFL	$\mathbf{83.01\pm 4.49}$	$\mathbf{83.87\pm 4.98}$

The performance of various schemes on a large number of clients for the Stanford40 dataset is presented in Table II. As expected, the SOLO approach exhibits a significant drop in performance when the number of clients increases. Among the baseline methods we considered, FedProx [10], MOON [27], and FedHKD [62] perform similarly to FedAvg [9], struggling to handle a large number of clients effectively. In contrast, SCAFFOLD [60] and FedDyn [61] show improved performance compared to the other baselines in large-scale settings. As shown in Table II, CDFL consistently outperforms the other baselines in the case of large-scale FL setups. The higher mean performance of CDFL implies its ability to effectively manage a large number of clients. Furthermore, its lower standard deviation compared to other baselines demonstrates its consistent performance across all the individual clients.

TABLE III: Mean accuracies (%± standard deviation) for different levels of heterogeneity (#clients=10)

	Stanford40		PPMI		VOC2012
	$\alpha=0.1$	$\alpha=10.0$	$\alpha=0.1$	$\alpha=10.0$	$\alpha=0.1$	$\alpha=10.0$
SOLO	$59.82\pm 14.93$	$71.52\pm 2.26$	$58.57\pm 20.62$	$74.20\pm 2.30$	$61.24\pm 25.85$	$57.76\pm 12.77$
FedAvg [9]	$78.31\pm 6.45$	$84.25\pm 1.73$	$72.74\pm 13.68$	$86.91\pm 5.96$	$73.73\pm 21.43$	$73.45\pm 13.36$
SCAFFOLD [60]	$81.31\pm 4.52$	$83.66\pm 1.82$	$76.39\pm 12.20$	$87.58\pm 3.09$	$69.71\pm 13.32$	$68.08\pm 14.02$
FedProx [10]	$80.11\pm 6.52$	$84.08\pm 2.17$	$74.09\pm 15.06$	$87.57\pm 4.48$	$75.36\pm 20.46$	$\mathbf{78.61\pm 3.45}$
MOON [27]	$78.49\pm 6.02$	$85.06\pm 2.32$	$73.51\pm 17.03$	$87.35\pm 5.94$	$74.89\pm 18.34$	$74.73\pm 9.82$
FedDyn [61]	$83.93\pm 4.83$	$84.25\pm 1.76$	$80.21\pm 10.62$	$87.89\pm 3.78$	$74.74\pm 13.15$	$75.33\pm 5.82$
FedHKD [62]	$80.22\pm 4.16$	$83.57\pm 3.09$	$77.09\pm 12.79$	$86.15\pm 10.74$	$75.59\pm 15.30$	$74.1\pm 13.01$
CDFL	$\mathbf{84.16\pm 4.94}$	$\mathbf{85.15\pm 1.19}$	$\mathbf{82.74\pm 11.29}$	$\mathbf{90.08\pm 2.02}$	$\mathbf{76.09\pm 13.62}$	$75.09\pm 6.51$

IV-D Heterogeneity

To assess the performance of our CDFL under different levels of data heterogeneity, we conduct experiments using two different values of $\alpha$ across all datasets. Specifically, we tested $\alpha=0.1$ to represent a higher level of heterogeneity as well as $\alpha=10$ to consider less heterogeneous data. The results are presented in Table III.

CDFL consistently outperforms other baselines when data heterogeneity is high ( $\alpha=0.1$ ), demonstrating its robustness in handling challenging, heterogeneous situations. In this scenario, CDFL achieves a performance improvement of up to $\sim 6\%$ , 10%, and 6% compared to other baselines for Stanford40, PPMI, and VOC2012, respectively.

For $\alpha=10.0$ , CDFL achieves the highest accuracy for the Stanford40 and PPMI datasets, while FedProx leads to the best accuracy for VOC2012. Generally, the performance of all schemes is quite similar under homogeneous conditions ( $\alpha=10.0$ ), which demonstrates the ability of FL systems to obtain desired performance in homogeneous data settings.

IV-E Efficiency and Convergence Speed

we compare CDFL with other baselines in terms of the number of communication rounds required to achieve a specific target accuracy as well as communication overhead in each communication round. Table IV provides an overview of the communication rounds needed to reach the test target accuracy for various numbers of clients across all datasets. In all settings, we continue training until we reach the target test accuracy or reach a maximum of 50 communication rounds.

As illustrated in Table IV, CDFL requires the fewest communication rounds in all settings and demonstrates a remarkable increase in convergence speed up to 10 times faster compared to other baselines. For the VOC2012 dataset with 10 clients, both SCAFFOLD [60] and FedHKD [62] failed to reach the target accuracy after 50 communication rounds.

Each communication round involves two types of communication overhead: uplink and downlink. Uplink includes the transmission of local models to the server, while downlink corresponds to broadcasting the updated global model to the clients. As the downlink overhead is considerably lower than the uplink, our analysis focuses on the uplink overhead. In this study, we define the number of tansmitted parameters as communication overhead metric.

TABLE IV: Comparison of the number of communication rounds for different FL schemes to achieve the target accuracy

	Stanford40					PPMI			VOC2012
	$N=10$	$N=20$	$N=30$	$N=50$	$N=100$	$N=10$	$N=20$	$N=30$	$N=10$	$N=20$	$N=30$
Target	85%	80%	80%	80%	80%	85%	85%	80%	75%	70%	70%
FedAvg [9]	14 ( $0.6\times$ )	16 ( $0.5\times$ )	24 ( $0.4\times$ )	35 ( $0.3\times$ )	50 ( $0.1\times$ )	9 ( $0.6\times$ )	34 ( $0.3\times$ )	37 ( $0.2\times$ )	34 ( $0.3\times$ )	44 ( $0.2\times$ )	38 ( $0.2\times$ )
SCAFFOLD [60]	22 ( $0.4\times$ )	12 ( $0.7\times$ )	13 ( $0.7\times$ )	16 ( $0.6\times$ )	17 ( $0.4\times$ )	19 ( $0.3\times$ )	31 ( $0.3\times$ )	12 ( $0.7\times$ )	-	10 ( $0.8\times$ )	11 ( $0.8\times$ )
FedProx [10]	10 ( $0.8\times$ )	15 ( $0.5\times$ )	24 ( $0.4\times$ )	35 ( $0.3\times$ )	50 ( $0.1\times$ )	11 ( $0.5\times$ )	35 ( $0.3\times$ )	38 ( $0.2\times$ )	18 ( $0.6\times$ )	10 ( $0.8\times$ )	25 ( $0.4\times$ )
MOON [27]	11 ( $0.7\times$ )	13 ( $0.6\times$ )	19 ( $0.5\times$ )	35 ( $0.3\times$ )	46 ( $0.1\times$ )	14 ( $0.4\times$ )	34 ( $0.3\times$ )	30 ( $0.3\times$ )	29 ( $0.4\times$ )	31 ( $0.3\times$ )	31 ( $0.3\times$ )
FedDyn [61]	14 ( $0.6\times$ )	8 ( $\mathbf{1\times}$ )	10 ( $0.9\times$ )	17 ( $0.6\times$ )	18 ( $0.3\times$ )	14 ( $0.4\times$ )	26 ( $0.3\times$ )	12 ( $0.7\times$ )	30 ( $0.4\times$ )	15 ( $0.5\times$ )	9 ( $\mathbf{1\times}$ )
FedHKD [62]	8 ( $\mathbf{1\times}$ )	11 ( $0.7\times$ )	15 ( $0.6\times$ )	29 ( $0.3\times$ )	46 ( $0.1\times$ )	6 ( $0.8\times$ )	36 ( $0.3\times$ )	32 ( $0.3\times$ )	-	30 ( $0.3\times$ )	31 ( $0.3\times$ )
CDFL	8 ( $\mathbf{1\times}$ )	8 ( $\mathbf{1\times}$ )	9 ( $\mathbf{1\times}$ )	10 ( $\mathbf{1\times}$ )	6 ( $\mathbf{1\times}$ )	5 ( $\mathbf{1\times}$ )	9 ( $\mathbf{1\times}$ )	8 ( $\mathbf{1\times}$ )	11 ( $\mathbf{1\times}$ )	8 ( $\mathbf{1\times}$ )	9 ( $\mathbf{1\times}$ )

In Table V, we summarize the communication overhead of various FL schemes. Since the uplink communication overhead of FedProx [10], MOON [27], and FedDyn [61] matches that of FedAvg [9], we only report the uplink overhead of FedAvg [9] in Table V. In this table, $p$ represents the ratio of connected clients to the total number of clients, and $|\theta_{i}|$ is the number of parameters in each client’s model. In the case of SCAFFOLD [60], $c_{i}$ is the control variate whose size matches the client’s model ( $|c_{i}|=|\theta_{i}|$ ). For the FedHKD [62] scheme, hyperknowledge vectors are transmitted to the server alongside the clients’ models, and $|\mathbf{v}|$ represents the size of the penultimate layer in the model. In most cases, the overhead arising from hyperknowledge vectors is negligible compared to the model transmission overhead. In CDFL, the pixelized images are transmitted beside the clients’ models but for a smaller number of clients ( $|\mathcal{C}|$ ).

We evaluate Stanford40, PPMI, and VOC2012 datasets with 100, 30, and 30 clients, respectively. We select the largest number of clients for each dataset, as the communication overhead would be more challenging with a larger number of clients. The overhead of all schemes is calculated in Table V. For CDFL, we assumed the size of images to be $|I|=224\times 224\times 3$ , and this is the same as the input size of ResNet50. However, there exists an opportunity to compress or downsample the images before transmission to the server, which can save more bandwidth. Since ResNet50 is utilized in our experiments, the total number of parameters is 24.1 M.

As shown in Table III, SCAFFOLD has the highest overhead among all schemes, while CDFL provides the lowest overhead due to its client selection. We achieved a reduction in overhead by up to $\sim$ 64% compared to other baselines for all datasets.

TABLE V: Communication overhead (

\times 10^{8}

) in each round of different FL schemes

	Uplink	Stanford40	PPMI	VOC2012
FedAvg [9]	$pN\|\theta_{i}\|$	$19.3$	$5.78$	$5.78$
SCAFFOLD [60]	$pN(\|\theta_{i}\|+\|c_{i}\|)$	$38.6$	$11.6$	$11.6$
FedHKD [62]	$pN(\|\theta_{i}\|+\|C\|(\|\mathbf{v}\|+\|C\|))$	$19.3$	$5.78$	$5.78$
CDFL	$\|\mathcal{C}\|(\|\theta_{i}\|+\|\mathcal{S}_{i}\|\|I\|)$	$\mathbf{14.3}$	$\mathbf{4.32}$	$\mathbf{4.19}$

IV-F Analysis of CDFL

In this section, we study the effects of key modules and hyper-parameters in CDFL. The experiments are mainly conducted on the PPMI datasets with 10 clients and $\alpha=1.0$ .

•

Imapact of Contrastive Learning & Deep Clustering Modules

We investigate the effect of the contrastive learning and deep clustering modules by excluding them from CDFL, as shown in Table VI. The contrastive learning module is eliminated by setting $\lambda_{1}=\lambda_{2}=0$ in local model loss function. Therefore, this test only relies on the training of pixelized images. In the second test, the deep clustering term is omitted from the local model loss function by setting $\lambda_{3}$ to zero, and random pixelized images are chosen for transmission to the server. As presented in Table VI, removing the contrastive learning module from CDFL results in a 1.27% decrease in performance. In addition, eliminating the deep clustering modules and transmitting random pixelized images to the server leads to a 3.9% reduction in performance, highlighting the effectiveness of our image selection method.

TABLE VI: Impact of contrastive learning and deep clustering modules of CDFL

Removing contrastive learning Removing deep clustering

$89.47\pm 5.368$ ( $1.27\%\downarrow$ ) $86.84\pm 4.30$ ( $3.9\%\downarrow$ )

Removing contrastive learning	Removing deep clustering
$89.47\pm 5.368$ ( $1.27\%\downarrow$ )	$86.84\pm 4.30$ ( $3.9\%\downarrow$ )

•

Impact of $\lambda_{1}$ and $\lambda_{2}$ of Model Contrastive Learning

We investigate the effect of various values of $\lambda_{1}$ and $\lambda_{2}$ on the performance of CDFL. Test performance results for different values of $\lambda_{1}$ and $\lambda_{2}$ are presented in Table VII. An increase in $\lambda_{1}$ results in a more similar representation between the local model on both original and pixelized images. Further, larger values of $\lambda_{2}$ lead to a more similar representation between personalized and local models on original images.

TABLE VII: Impact of

\lambda_{1}

and

\lambda_{2}

on the performance of CDFL

$\lambda_{1}=\lambda_{2}=0.01$	$\lambda_{1}=\lambda_{2}=0.1$	$\lambda_{1}=\lambda_{2}=1.0$
$89.63\pm 4.31$	$90.74\pm 2.98$	$88.68\pm 3.23$

•

Impact of $\lambda_{3}$ in Deep Clustering

Deep clustering is a crucial element in CDFL as it helps select representative images for the clients. The impact of $\lambda_{3}$ in deep clustering loss is investigated for three different values of $\{0.01,0.05,0.1\}$ , as summarized in Table VIII. Smaller values of $\lambda_{3}$ result in poor clustering and image selection, while larger values of $\lambda_{3}$ fade the impact of contrastive learning and knowledge distillation terms in the local loss function, causing the local model to diverge from the personalized model.

TABLE VIII: Impact of $\lambda_{3}$ on the performance of CDFL

$\lambda_{3}=0.01$ $\lambda_{3}=0.0.5$ $\lambda_{3}=0.1$

$90.11\pm 2.30$ $90.74\pm 2.98$ $90.19\pm 3.69$
•

Impact of Number of Selected Images

TABLE IX: Impact of the number of selected images on the performance of CDFL

$m=1$ $m=3$ $m=5$ $m=7$

$88.13\pm 5.39$ $89.32\pm 4.39$ $88.77\pm 5.26$ $90.74\pm 2.98$

In CDFL, we select the $m$ nearest neighbors of each centroid for transmission to the server. A very small $m$ selection leads to a non-representative image set, while a large $m$ increases communication overhead. The performance of CDFL for four different values of $m$ is shown in Table IX. As shown in Table IX, smaller $m$ values, while slightly sacrificing performance, reduce bandwidth usage and communication overhead. This shows that CDFL can be effective even with fewer selected images.
•

Impact of Number of Selected Clients

We also study the impact of the number of selected clients in CDFL. We repeat the experiments when only 3, 5, and 7 clients share their weights and images with the central server, and the results are summarized in Table X. As expected, when fewer clients contribute to the server aggregation, the performance drops, however, the performance drop is negligible. This demonstrates that CDFL works well even if only 30% of clients are involved in server aggregation resulting in a significant decrease in communication overhead.

TABLE X: Impact of the number of selected clients on the performance of CDFL

$|\mathcal{C}|=3$ $|\mathcal{C}|=5$ $|\mathcal{C}|=7$

$89.48\pm 4.44$ $90.74\pm 2.98$ $91.02\pm 2.65$

V Conclusion

This paper introduces CDFL as an efficient FL framework for recognizing human activities. It is designed specifically to address the prevalent challenges of data heterogeneity, communication overhead, and naive aggregation. CDFL introduces a privacy-preserved subset of samples to the central server to enhance the performance of the system in a secure manner. To optimize the selection of representative data samples for each activity, we integrated methods of contrastive learning, knowledge distillation, and deep clustering. Furthermore, our new client selection method offers the dual advantage of reducing communication overheads and enhancing performance. Extensive experiments on three well-known image-based HAR datasets highlight the efficacy of CDFL in both classification accuracy and communication efficiency. Overall, a performance increase of up to 10%, a bandwidth usage reduction of 64% as well as a faster convergence rate of up to 10 times in comparison to the state-of-the-art schemes is achieved.

In the future, several directions of the proposed work will be investigated. First, we want to study the influence of degraded image quality on the efficacy of our system. This assessment is pivotal given the real-world scenarios where image quality can vary depending on the sensor and external factors. Furthermore, we will design a robust and efficient FL framework for the multimodal HAR task that combines signals from various modalities, including visual, audio, inertial, and biological signals.

References

[1] M. Guo, Z. Wang, N. Yang, Z. Li, and T. An, “A multisensor multiclassifier hierarchical fusion model based on entropy weight for human activity recognition using wearable inertial sensors,” IEEE Transactions on Human-Machine Systems, vol. 49, no. 1, pp. 105–111, 2018.
[2] S. Gaglio, G. L. Re, and M. Morana, “Human activity recognition process using 3-d posture data,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 5, pp. 586–597, 2014.
[3] A. Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng, “Deep convolutional neural networks for human action recognition using depth maps and postures,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 9, pp. 1806–1819, 2018.
[4] L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, “Sensor-based and vision-based human activity recognition: A comprehensive survey,” Pattern Recognition, vol. 108, p. 107561, 2020.
[5] A. Esmaeilzehi, E. Khazaei, K. Wang, N. K. Kalsi, P. C. Ng, H. Liu, Y. Yu, D. Hatzinakos, and K. Plataniotis, “Harwe: A multi-modal large-scale dataset for context-aware human activity recognition in smart working environments,” Pattern Recognition Letters, 2024.
[6] X. Zhou, W. Liang, I. Kevin, K. Wang, H. Wang, L. T. Yang, and Q. Jin, “Deep-learning-enhanced human activity recognition for internet of healthcare things,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6429–6438, 2020.
[7] M. Moniruzzaman, Z. Yin, Z. He, R. Qin, and M. C. Leu, “Human action recognition by discriminative feature pooling and video segment attention model,” IEEE Transactions on Multimedia, vol. 24, pp. 689–701, 2021.
[8] W. Qi, H. Su, and A. Aliverti, “A smartphone-based adaptive recognition and real-time monitoring system for human activities,” IEEE Transactions on Human-Machine Systems, vol. 50, no. 5, pp. 414–423, 2020.
[9] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in International Conference on Artificial Intelligence and Statistics, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:14955348
[10] T. Li, A. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Third Conference on Machine Learning and Systems (MLSys), vol. 2, 2020, p. 429–450.
[11] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized federated learning,” CoRR, vol. abs/2003.13461, 2020. [Online]. Available: https://arxiv.org/abs/2003.13461
[12] K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and D. Ramage, “Federated evaluation of on-device personalization,” 2019.
[13] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary, “Federated learning with personalization layers,” 2019.
[14] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” 2023.
[15] Y. Huang, L. Chu, Z. Zhou, L. Wang, J. Liu, J. Pei, and Y. Zhang, “Personalized cross-silo federated learning on non-iid data,” in AAAI Conference on Artificial Intelligence, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:227311284
[16] C. T. Dinh, N. H. Tran, and T. D. Nguyen, “Personalized federated learning with moreau envelopes,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
[17] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
[18] X.-X. Wei and H. Huang, “Edge devices clustering for federated visual classification: A feature norm based framework,” IEEE Transactions on Image Processing, vol. 32, pp. 995–1010, 2023.
[19] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3710–3722, 2021.
[20] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8076–8091, 2022.
[21] C. Chen, Z. Chen, Y. Zhou, and B. Kailkhura, “Fedcluster: Boosting the convergence of federated learning via cluster-cycling,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 5017–5026.
[22] C. Briggs, Z. Fan, and P. Andras, “Federated learning with hierarchical clustering of local updates to improve training on non-iid data,” in 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–9.
[23] M. Duan, D. Liu, X. Ji, Y. Wu, L. Liang, X. Chen, Y. Tan, and A. Ren, “Flexible clustered federated learning for client-level data distribution shift,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2661–2674, 2022.
[24] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: A similarity-aware federated learning system for human activity recognition,” in Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, ser. MobiSys ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 54–66. [Online]. Available: https://doi.org/10.1145/3458864.3467681
[25] A. Sultana, M. M. Haque, L. Chen, F. Xu, and X. Yuan, “Eiffel: Efficient and fair scheduling in adaptive federated learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4282–4294, 2022.
[26] D. Chen, J. Hu, V. Junkai Tan, X. Wei, and E. Wu, “Elastic aggregation for federated optimization,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[27] Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 708–10 717.
[28] C. Chen, Z. Chen, Y. Zhou, and B. Kailkhura, “Fedcluster: Boosting the convergence of federated learning via cluster-cycling,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 5017–5026.
[29] M. Duan, D. Liu, X. Ji, R. Liu, L. Liang, X. Chen, and Y. Tan, “Fedgroup: Efficient federated learning via decomposed similarity-based clustering,” in 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), 2021, pp. 228–237.
[30] A. Albaseer, M. Abdallah, A. Al-Fuqaha, and A. Erbad, “Data-driven participant selection and bandwidth allocation for heterogeneous federated edge learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
[31] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proceedings of the National Academy of Sciences, vol. 118, no. 17, p. e2024789118, 2021. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.2024789118
[32] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 10, pp. 4229–4238, 2020.
[33] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “FetchSGD: Communication-efficient federated learning with sketching,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 8253–8265. [Online]. Available: https://proceedings.mlr.press/v119/rothchild20a.html
[34] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni, “Federated learning with matched averaging,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=BkluqlSFDS
[35] S. EK, F. PORTET, P. LALANDA, and G. VEGA, “A federated learning aggregation algorithm for pervasive computing: Evaluation and comparison,” in 2021 IEEE International Conference on Pervasive Computing and Communications (PerCom), 2021, pp. 1–10.
[36] X. Zhou, W. Liang, J. Ma, Z. Yan, and K. I.-K. Wang, “2d federated learning for personalized human activity recognition in cyber-physical-social systems,” IEEE Transactions on Network Science and Engineering, vol. 9, no. 6, pp. 3934–3944, 2022.
[37] F. Concone, C. Ferdico, G. L. Re, and M. Morana, “A federated learning approach for distributed human activity recognition,” in 2022 IEEE International Conference on Smart Computing (SMARTCOMP), 2022, pp. 269–274.
[38] Z. Xiao, X. Xu, H. Xing, F. Song, X. Wang, and B. Zhao, “A federated learning system with enhanced feature extraction for human activity recognition,” Knowledge-Based Systems, vol. 229, p. 107338, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705121006006
[39] G. Gad and Z. Fadlullah, “Federated learning via augmented knowledge distillation for heterogenous deep human activity recognition systems,” Sensors, vol. 23, no. 1, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/1/6
[40] D. Li and J. Wang, “Fedmd: Heterogenous federated learning via model distillation,” 2019.
[41] L. Tu, X. Ouyang, J. Zhou, Y. He, and G. Xing, “Feddl: Federated learning via dynamic layer sharing for human activity recognition,” in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, ser. SenSys ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 15–28. [Online]. Available: https://doi.org/10.1145/3485730.3485946
[42] C. Bettini, G. Civitarese, and R. Presotto, “Semi-supervised and personalized federated activity recognition based on active learning and label propagation,” Personal and Ubiquitous Computing, vol. 26, no. 5, pp. 1281–1298, jun 2022. [Online]. Available: https://doi.org/10.1007%2Fs00779-022-01688-8
[43] Y. Zhao, H. Liu, H. Li, P. Barnaghi, and H. Haddadi, “Semi-supervised federated learning for activity recognition,” 2021.
[44] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in CVPR, 2020.
[45] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition by learning bases of action attributes and parts,” in 2011 International Conference on Computer Vision, 2011, pp. 1331–1338.
[46] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, p. 1857–1865.
[47] B. Xu, W. Xia, H. Zhao, Y. Zhu, X. Sun, and T. Q. S. Quek, “Clustered federated learning in internet of things: Convergence analysis and resource optimization,” IEEE Internet of Things Journal, pp. 1–1, 2023.
[48] N. Shlezinger, S. Rini, and Y. C. Eldar, “The communication-aware clustered federated learning problem,” in 2020 IEEE International Symposium on Information Theory (ISIT), 2020, pp. 2610–2615.
[49] C. Li, G. Li, and P. K. Varshney, “Federated learning with soft clustering,” IEEE Internet of Things Journal, vol. 9, no. 10, pp. 7773–7782, 2022.
[50] Y. Kim, E. A. Hakim, J. Haraldson, H. Eriksson, J. M. B. da Silva, and C. Fischione, “Dynamic clustering in federated learning,” in ICC 2021 - IEEE International Conference on Communications, 2021, pp. 1–6.
[51] X. Tang, S. Guo, and J. Guo, “Personalized federated learning with clustered generalization,” CoRR, vol. abs/2106.13044, 2021. [Online]. Available: https://arxiv.org/abs/2106.13044
[52] H. Jamali-Rad, M. Abdizadeh, and A. Singh, “Federated learning with taskonomy for non-iid data,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2022.
[53] Z. Dang, C. Deng, X. Yang, K. Wei, and H. Huang, “Nearest neighbor matching for deep clustering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13 693–13 702.
[54] H. Dong, W. Ma, L. Jiao, F. Liu, and L. Li, “A multiscale self-attention deep clustering for change detection in sar images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
[55] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 12 310–12 320. [Online]. Available: https://proceedings.mlr.press/v139/zbontar21a.html
[56] J. Lv, Z. Kang, X. Lu, and Z. Xu, “Pseudo-supervised deep subspace clustering,” IEEE Transactions on Image Processing, vol. 30, pp. 5252–5263, 2021.
[57] C. Niu, H. Shan, and G. Wang, “Spice: Semantic pseudo-labeling for image clustering,” IEEE Transactions on Image Processing, vol. 31, pp. 7264–7278, 2022.
[58] X. Zhang, Y. Ge, Y. Qiao, and H. Li, “Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3436–3445.
[59] L. Wang, D. Guo, G. Wang, and S. Zhang, “Annotation-efficient learning for medical image segmentation based on noisy pseudo labels and adversarial learning,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2795–2807, 2021.
[60] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020.
[61] D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P. Whatmough, and V. Saligrama, “Federated learning based on dynamic regularization,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=B7v4QMR6Z9w
[62] H. Chen, H. Vikalo et al., “The best of both worlds: Accurate global and personalized models through federated learning with data-free hyper-knowledge distillation,” arXiv preprint arXiv:2301.08968, 2023.
[63] B. Yao and L. Fei-Fei, “Grouplet: A structured image representation for recognizing human and object interactions,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 9–16.
[64] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

$\lambda_{3}=0.01$	$\lambda_{3}=0.0.5$	$\lambda_{3}=0.1$
$90.11\pm 2.30$	$90.74\pm 2.98$	$90.19\pm 3.69$

$m=1$	$m=3$	$m=5$	$m=7$
$88.13\pm 5.39$	$89.32\pm 4.39$	$88.77\pm 5.26$	$90.74\pm 2.98$

$\|\mathcal{C}\|=3$	$\|\mathcal{C}\|=5$	$\|\mathcal{C}\|=7$
$89.48\pm 4.44$	$90.74\pm 2.98$	$91.02\pm 2.65$