Learn More by Using Less: Distributed Learning with Energy-Constrained Devices

Roberto Pereira, Cristian J. Vaca-Rubio and Luis Blanco This work has been submitted to IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Centre Tecnològic de Telecomunicacions de Catalunya (CTTC / CERCA)
email: {rpereira, cvaca, lblanco}@cttc.es

Abstract

Federated Learning (FL) has emerged as a solution for distributed model training across decentralized, privacy-preserving devices, but the different energy capacities of participating devices (system heterogeneity) constrain real-world implementations. These energy limitations not only reduce model accuracy but also increase dropout rates, impacting on convergence in practical FL deployments. In this work, we propose LeanFed, an energy-aware FL framework designed to optimize client selection and training workloads on battery-constrained devices. LeanFed leverages adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, thereby maximizing device participation across communication rounds while ensuring they do not run out of battery during the process. We rigorously evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets, simulating various levels of data heterogeneity and device participation rates. Results show that LeanFed consistently enhances model accuracy and stability, particularly in settings with high data heterogeneity and limited battery life, by mitigating client dropout and extending device availability. This approach demonstrates the potential of energy-efficient, privacy-preserving FL in real-world, large-scale applications, setting a foundation for robust and sustainable pervasive AI on resource-constrained networks.

Index Terms:

Energy efficiency, Federated learning, Resource-Constrained Devices, Pervasive AI.

I Introduction

Until recently, most machine learning (ML) solutions relied on centralized training, where data is transferred from local devices to powerful data centers. Although effective, this approach raises significant privacy and environmental concerns, largely due to the high energy consumption and carbon footprint associated with data transmission and intensive computations in large-scale data centers [1, 2]. As the use of AI expands globally, the environmental impact of centralized frameworks is becoming unsustainable, given the substantial energy requirements of these centralized systems [3, 4]. Recent studies indicate that transitioning from centralized to distributed computing architectures could reduce the total carbon footprint of ML operations, advancing more sustainable, energy-efficient AI [5, 6].

In response, pervasive AI has emerged as a promising solution, enabling distributed learning directly on edge devices and bringing AI processing closer to data sources [7, 8, 9]. By processing data locally, pervasive AI minimizes the need to transmit large volumes of raw data to centralized servers, thereby also reducing bandwidth requirements and addressing privacy concerns. This shift not only improves response times and network efficiency but also enables a more privacy-preserving, sustainable, and scalable approach to machine learning [10].

Nonetheless, despite these benefits, enabling truly pervasive AI solutions comes with its own set of challenges. Unlike cloud-based AI, which relies on high-performance centralized infrastructure, edge AI must leverage collaboration across multiple heterogeneous, resource-constrained devices operating in dynamic environments. These devices often have limited computational power, and energy resources, and can only access local data. Federated learning (FL) is a widely adopted distributed learning approach that moves training closer to data sources, minimizing the need for centralized processing [11, 12]. In FL, each client/device trains its model locally on private data and shares only updated model parameters with a central server, which then aggregates these updates to refine a global model for distribution to all clients. This iterative process of local computation and centralized aggregation significantly reduces data transfer, addressing both privacy and energy efficiency.

Unlike centralized learning, the convergence rate in FL models is strongly influenced by data distribution and client participation in each training round [13, 14, 15]. Particularly, when data is independent and identically distributed (iid) across devices, training on all devices may yield accuracy and convergence rates comparable to those in centralized solutions [16, 17]. In real-world applications, however, data is typically heterogeneously distributed (non-iid) among clients, and only a subset of devices participate in each training round. Therefore, convergence depends heavily on the number and specific selection of participating devices [18, 19, 20].

A larger device pool is generally associated with more data, which potentially accelerates convergence and increases model accuracy. However, to truly promote pervasive AI, FL must also be deployed on resource-constrained devices with varying, and often limited, battery capacities.

In this scenario, recent studies have shown that energy heterogeneity among devices can cause dropout and gradient divergence, impacting performance significantly [21]. For instance, client selection strategies that disregard battery life can accelerate battery drainage, leading to temporary or prolonged device unavailability during training [22]. In this work, we argue that prioritizing devices’ battery life is essential for enhancing global model performance in resource-limited FL scenarios. Particularly, we propose selecting a subset of each device’s local data to extend its battery life and participation in the FL process. Although training on reduced data may yield less accurate local models temporarily, it significantly conserves battery life, enabling devices to participate across more training rounds. Our numerical experiments demonstrate that it is generally more advantageous for devices to engage in more rounds with fewer data samples than to participate in fewer rounds using their entire dataset. This strategy mitigates client dropout, preserves device energy, and ensures balanced contributions, thereby promoting robust convergence and enhancing the quality of the global model.

II Background and Motivation

Before presenting our energy-aware sampling strategy in Section III, we first outline key concepts of federated learning, emphasizing centralized FL, the challenges posed by heterogeneous (non-iid) data distribution, and considerations related to local energy consumption.

In a typical centralized FL setup, a central server manages the learning process over a network of $E$ edge devices, where each of them owns a local dataset $D_{e}$ , for $e=1,\ldots,E$ . The objective of centralized FL is to minimize a weighted global loss function by aggregating information across the different devices while ensuring data privacy. Formally, this optimization process can be defined as

\min_{w}f(w)=\sum_{e=1}^{E}\frac{|D_{e}|}{|D|}F_{e}(w),

(1)

where $F_{e}(w)=\mathbb{E}_{x_{e}\sim D_{e}}[f_{e}(w;x_{e})]$ represents the local loss function computed by the $e$ -th edge device on its local dataset $D_{e}$ , and $|D_{e}|$ denotes the number of samples available to the $e$ -th device. Here, $|D|$ quantifies the total number of samples across all devices.

This optimization typically is performed over multiple communication rounds. In each round $r=1,\ldots,R$ , each device $e=1,\ldots,E$ initializes its local model $w_{e}^{(r)}$ with the current global model $w^{(r)}$ , which has been broadcast by the server. Each device then performs $L_{e}$ epochs of local training on its dataset $D_{e}$ . After completing the local updates, the server aggregates the models from each device to update the global model by computing the weighted average defined by

w^{(r+1)}=\sum_{e=1}^{E}\frac{|D_{e}|}{|D|}w_{e}^{(r)}.

(2)

In an ideal scenario where all devices have similar resources—such as balanced and similar data distributions, processing power, network stability, and battery life—the vanilla Federated Averaging (FedAvg) strategy [23] can yield optimal results. With independently and identically distributed (iid) data across devices, each local dataset closely approximates the global data distribution. This scenario minimizes the divergence between local and global models during aggregation, enabling the FL system to converge toward a solution comparable to that of a centralized model (depending on factors such as $L_{e},e=1,...,E$ and network conditions).

In real-world applications, however, devices often exhibit significant heterogeneity in terms of data distribution, available resources, and connectivity. Specifically, the local data on each device may follow diverse distributions, with variations such as covariate shift, concept drift, or label distribution skew [24, 25, 26]. In this work, we focus on label heterogeneity, where label distributions vary across devices. For instance, in a location-dependent data collection scenario, certain classes may be overrepresented or underrepresented on specific devices, leading to imbalances that can worsen model convergence and degrade accuracy.

In addition to data heterogeneity, differences in device resources also affect the FL process. For instance, FedAvg assumes [23] that all devices perform the same number of local epochs ( $L=L_{1}=\cdots=L_{e}$ ). However, this assumption may not hold in real-world settings where devices vary in processing power, battery life, and network stability. Resource-constrained devices, or “stragglers,” can slow down FL by delaying communication rounds or providing less reliable updates.

To address these challenges, adaptive methods like FedProx [15] enable devices to perform a variable number of local training steps, promoting more flexible participation based on resource availability. However, many existing approaches that aim to enable learning in resource-contained conditions primarily focus on the number of local epochs [15], networking conditions [7], and/or partial network updates [27], which can overlook the broader implications of resource constraints. In parallel, we argue that adaptive sampling strategies [28, 22] can dynamically select devices for each round based on criteria such as data representativeness or available energy. This dynamic selection helps balance the trade-offs between energy consumption, convergence speed, and model accuracy, ultimately optimizing FL performance in heterogeneous environments.

Building on the methods mentioned above, in our work we explore leveraging the variations in battery life and data availability across devices during training. As later shown in our results, it is crucial to ensure that all devices remain active until the last communication rounds, as their participation can significantly influence the convergence of the learning process and enhance the overall model performance.

III Energy-aware Federated Learning

We aim to explore strategies that enable distributed learning in energy-constrained devices, particularly focusing on scenarios where multiple devices may lack sufficient battery life to participate in all communication rounds $R$ pre-defined by the central server. More formally, consider a set of devices $e=1,\dots,E$ , each with a fixed battery energy budget $B_{e}$ .

During each communication round $r=1,\dots,R$ , each local training incurs an energy cost $\bar{b}_{e}$ for each device.

This energy cost is assumed to be proportional to the device’s dataset size and is relatively stable across rounds. Specifically, for $e=1,\dots,E$ , we define $\bar{b}_{e}$ as: $\bar{b}_{e}=b_{e}^{(1)}\approx\cdots\approx b_{e}^{(R)}$ where experimental observations support that, under fixed conditions, the (expected) energy consumption per round remains relatively stable across different rounds¹¹1Our experiments suggest that, for a given device and dataset, energy consumption per round is largely invariant. These results are not presented here due to space constraints..

Considering the scenario described above, employing traditional FedAvg, a device can be capable of participating in a maximum of

{R}_{e}=\frac{B_{e}}{\bar{b}_{e}L_{e}}\quad\quad e=1,...,E

(3)

communication rounds.

If, for any given device, the pre-defined number of communication rounds $R$ is larger than the maximum number of rounds ${R}_{e}$ , the device will run out of battery which will likely degrade the performance of the global model. This becomes especially problematic when $R$ is much larger than $R_{e}$ for a large number of devices (see Sec. IV-B for numerical results).

To ensure that $R_{e}\geq R$ for all devices, a simple yet effective approach is to reduce the cost of local learning by further reducing the amount of data used during local training at each device. Specifically, we propose that each device $e=1,\dots,E$ uses only a percentage of its local data

\eta_{e}(\lambda)=R\frac{\bar{b}_{e}L_{e}}{\lambda B_{e}}

(4)

where $\lambda\in[0,1]$ is defined by the central server. This fraction of data $\eta_{e}(\lambda)$ is proportional to $R$ and inversely proportional to $R_{e}$ and the device participation. By doing so, we ensure that each device’s participation capacity meets the maximum number of communication rounds $R$ , thus preventing devices from running out of battery before the end of training. We always enforce $\eta_{e}(\lambda)\in[0,1]$ , and hereafter, we will drop the dependence on $\lambda$ and define $\eta_{e}=\eta_{e}(\lambda)$ .

Finally, when only a subset of devices participates per communication round, inserting the scaling factor $\lambda$ enables each device to use the maximum amount of local data possible while still preserving battery life for expected rounds.

Our proposed framework is illustrated in Algorithm 1. Setting $\eta_{e}=1$ for all devices, our approach recovers the vanilla FedAvg method, where all devices use their entire dataset for each communication round.

Input: Number of communication rounds

R

, number of devices

E

, number of local training epochs

L_{e}

, device energy budgets

B_{1},\dots,B_{E}

, energy consumption per round

\bar{b}_{1},\dots,\bar{b}_{E}

3for each communication round $r=1,\dots,R$ do

4 Server broadcasts the global model

w^{(r)}

to all devices;

6 for each device $e=1,\dots,E$ in parallel do

7 if $B_{e}>0$ then

8 Device

e

initializes

w_{e}^{(r)}\leftarrow w^{(r)}

;

9 Device

e

estimates the percentage of data

\eta_{e}

to use according to (4);

11 for local epoch $i=1,\dots,L_{e}$ do

12 if $B_{e}>0$ then

13 Device

e

trains on

\eta_{e}

percentage of its local data

D_{e}

;

14 Update remaining energy

B_{e}\leftarrow B_{e}-\bar{b}_{e}\cdot\eta_{e}

;

16 end if

18 end for

20 if $B_{e}>0$ then

21 Device

e

sends updated model

w_{e}^{(r)}

to the server;

23 end if

25 end if

27 end for

29 Server aggregates the local models to update the global model

w^{(r+1)}

of all active devices according to Equation (2);

31 end for

Output: Final global model

w^{(R)}

Algorithm 1 Energy-Aware Federated Learning (LeanFed)

IV Numerical Evaluation

IV-A Experimental Setup

Datasets and baselines. In this section, we evaluate the impact of the energy-preserving strategy outlined above on both learning performance and energy consumption, comparing it to traditional implementations of FedAvg. We consider two standard benchmarks during our experiments: CIFAR–10 and CIFAR–100 [29]

with various levels of data heterogeneity and participation rates. We simulate data heterogeneity across devices by sampling the ratio of data associated with each label from a Dirichlet distribution [26], which is controlled using a concentration parameter $\gamma\in\{0.5,1.0\}$ .

For the i.i.d. scenario, we set $\gamma=10^{3}$ , which results in a near-homogeneous distribution across devices. During evaluation, we use each dataset’s full test set and record the highest test accuracy achieved during training, averaging over five independent executions of the experiment. We compare our method, dubbed as LeanFed, with the vanilla FedAvg algorithm [23]. We consider varying device participation rates ( $\lambda\in\{80\%,50\%,20\%,10\%\}$ ) to assess how reducing the number of participating devices affects training duration and energy consumption. Furthermore, for both methods, we assume that any device with remaining energy can participate in a training round, ensuring that energy constraints are respected throughout the learning process.

Implementation details. We adopt a ResNet-18 as the deep architecture and train it from scratch using ADAM optimizer with a learning rate of $0.01$ , and a weight decay of $10^{-4}$ . Following prior works [23], we set the number of local epochs to $5$ and the batch size to $64$ throughout all experiments. To simulate different battery resources at each device, we model $B_{e}$ as $B_{e}=(\alpha_{e}|D_{e}|/|D|)\cdot(\beta_{e}R)$ , where $\alpha_{e}$ , $\beta_{e}$ are drawn from a Gaussian distribution, i.e., $\alpha_{e},\beta_{e}\sim\mathcal{N}(0.5,0.5)$ and clipped to the range $\alpha_{e},\beta_{e}\in[0.1,1]$ .

The random variable $\beta_{e}$ controls the maximum number of communication rounds each device can perform with $\alpha_{e}$ percentage of its local data, simulating the proportionality of energy consumption relative to data size. We set $b_{e}=|D_{e}|/|D|$ .

Finally, we used PyTorch framework for implementation and executed all the experiments on a single NVIDIA GeForce RTX GPU 3090.

We also make the simulation code available at:
https://github.com/robertomatheuspp/leanFed.

IV-B Learning on Battery Constrained Devices

We start by comparing the convergence rate of our proposed method, LeanFed, against the baseline FedAvg on the CIFAR–10 and CIFAR–100 datasets. Fig. 1 displays the test accuracy, averaged over five independent simulations, for both methods under varying data distribution settings (indicated by different colors in the plot). We consider two different scenarios, one with $10$ devices and another one with $50$ devices, training for $R=100$ and $R=200$ communication rounds, respectively. We notice that the test accuracy of LeanFed (dashed lines) consistently increases with the number of communication rounds, while the baseline (solid lines) reaches a peak accuracy before starting to degrade. During our experiments, we observed that this degradation occurs consistently regardless of the level of data heterogeneity ( $\gamma$ ), desired number of training rounds ( $R$ ), number of devices ( $E$ ), and datasets (CIFAR–10/100), suggesting that naively employing FedAvg may lead to undesired results in energy-constrained settings. We argue that this degradation in performance is a consequence of devices becoming inactive. Particularly, as the number of communication rounds increases a larger number of devices become inactive. Consequently, the local model of the active ones tends to overfit to the local data, degrading the overall behavior of the global model²²2An alternative to avoid such degradation is for each device to only accept the global model if it does not degrade the performance on its local dataset. However, for simplicity, we opt to employ a vanilla FL approach where the global model always overwrites the local one..

Figure 1: Test accuracy (y-axis) over communication rounds (x-axis) for the baseline FedAvg (solid lines) and LeanFed (dashed lines) across various levels of data heterogeneity (indicated by color) and full device participation. Top row shows results for the CIFAR-10 dataset, while the bottom row presents results for the CIFAR-100.

TABLE I: Average and variance of test accuracy of FedAvg and LeanFed considering a total of

50

devices for varying levels of data heterogeneity and pre-defined communication rounds

R=100

and

200

. The highest accuracies are highlighted in bold.

Dataset	Method	$\gamma=0.5$		$\gamma=1.0$		i.i.d.
Dataset	Method	100 Rounds	200 Rounds	100 Rounds	200 Rounds	100 Rounds	200 Rounds
CIFAR–10	FedAvg	$52.90\pm 2.67$	$57.29\pm 2.67$	$58.33\pm 2.45$	$60.95\pm 2.60$	$63.11\pm 0.32$	$65.72\pm 0.75$
	FedAvg ( $80\%$ )	$52.34\pm 2.51$	$57.47\pm 2.87$	$57.56\pm 2.35$	$61.12\pm 2.01$	$63.25\pm 1.08$	$65.57\pm 0.44$
	FedAvg ( $50\%$ )	$55.30\pm 2.89$	$61.23\pm 1.40$	$59.99\pm 2.29$	$62.84\pm 2.71$	$65.19\pm 0.99$	$67.79\pm 0.29$
	FedAvg ( $20\%$ )	$59.41\pm 2.42$	$61.22\pm 1.42$	$\mathbf{62.87\pm 0.87}$	$64.12\pm 0.96$	$\mathbf{67.16\pm 0.54}$	$67.79\pm 0.42$
	FedAvg ( $10\%$ )	$57.40\pm 1.49$	$59.84\pm 2.12$	$61.81\pm 0.89$	$63.27\pm 1.00$	$66.81\pm 0.64$	$67.81\pm 0.40$
	LeanFed (ours)	$\mathbf{60.30\pm 1.28}$	$\mathbf{63.71\pm 1.27}$	$\mathbf{62.91\pm 1.08}$	$\mathbf{65.38\pm 1.60}$	$\mathbf{66.68\pm 0.52}$	$\mathbf{68.27\pm 0.81}$
CIFAR-100	FedAvg	$17.82\pm 0.99$	$20.23\pm 0.72$	$20.09\pm 1.41$	$23.25\pm 1.29$	$21.34\pm 1.70$	$23.83\pm 0.93$
	FedAvg ( $80\%$ )	$18.17\pm 0.69$	$20.91\pm 0.16$	$19.79\pm 1.58$	$23.40\pm 0.86$	$21.54\pm 1.21$	$24.18\pm 1.14$
	FedAvg ( $50\%$ )	$20.45\pm 1.13$	$24.99\pm 1.28$	$21.86\pm 1.27$	$25.76\pm 0.67$	$22.59\pm 1.30$	$27.95\pm 0.49$
	FedAvg ( $20\%$ )	$\mathbf{25.44\pm 1.25}$	$26.08\pm 1.07$	$\mathbf{26.60\pm 1.11}$	$27.26\pm 0.88$	$27.38\pm 0.55$	$28.84\pm 0.84$
	FedAvg ( $10\%$ )	$24.07\pm 1.49$	$26.86\pm 0.64$	$\mathbf{26.48\pm 1.28}$	$27.70\pm 0.55$	$\mathbf{28.50\pm 0.61}$	$\mathbf{29.25\pm 0.08}$
	LeanFed (ours)	$\mathbf{25.20\pm 1.73}$	$\mathbf{28.32\pm 1.05}$	$\mathbf{26.02\pm 2.64}$	$\mathbf{30.51\pm 1.24}$	$26.85\pm 2.23$	$\mathbf{29.95\pm 1.98}$

Moreover, we also observe that the final accuracy achieved by LeanFed is consistently higher than the maximum peak values obtained by FedAvg. While also evident in homogeneous setup, this trend is more pronounced in highly heterogeneous scenarios, e.g., for $\gamma=1.0$ (red) and $\gamma=0.5$ (orange) in Fig. 1. In these settings, prolonging the participation of all the devices until the last communication round also seems to increase global accuracy.

To further explore the benefit of prolonging the device participation, we extend our analysis by comparing LeanFed with the FedAvg baseline across varying device participation rates on both the CIFAR-10 and CIFAR-100 datasets. Table I displays the highest test accuracy achieved by the global model (averaged over five independent simulations), showing that LeanFed consistently outperforms all baseline configurations across both datasets. As expected, increasing data heterogeneity among devices (lower values of $\gamma$ ) leads to a general degradation in performance across all methods, while increasing the number of communication rounds improves the overall test accuracy.

The above results suggest that in energy-constrained federated learning, a lower participation rate can be advantageous, as only selected devices expend battery resources during local training. This allows the remaining devices to conserve energy, potentially extending their active participation across more training rounds. However, if the participation rate is too low (e.g., $10\%$ in Table I), the global model may lack sufficient representation, leading to degraded performance. Conversely, if the participation rate is too high (e.g., $80\%$ or $50\%$ ), many devices may exhaust their batteries in the early training stages, reducing their availability for later rounds. Figure 2 illustrates this behavior through boxplots showing the number of devices that run out of battery across communication rounds under fixed $R=200$ , $\gamma=0.5$ and number of devices $E=50$ . Bars near the bottom represent scenarios where a high number of devices become inactive early, while higher-positioned bars suggest better device availability throughout previous rounds. In the first three scenarios—FedAvg baseline with $100\%$ , $80\%$ and $50\%$ participation rates—the majority of devices become inactive early in training. In contrast, when considering FedAvg at $20\%$ and $10\%$ participation rates, as well as LeanFed, the boxplots are closer to the maximum number of communication rounds, indicating prolonged device participation.

Figure 2: Number of communication rounds after which devices become inactive due to drained battery, considering

50

devices,

R=200

and

\gamma=0.5

. Each boxplot corresponds to a different method and/or participation method.

Finally, building on the observations above, we further investigate the behavior of LeanFed under partial device participation. Figure 3 presents results similar to those in Fig. 1 but focuses on a fixed heterogeneous data distribution ( $\gamma=0.5$ ) and varying device participation rates, $\lambda\in\{80\%,50\%,20\%,10\%\}$ . In low participation settings ( $\lambda\in\{10\%,20\%\}$ ), where most devices remain active throughout training, LeanFed performs comparably to (or slightly better than) FedAvg. This is aligned with the results from (4), which indicates that as $\lambda$ approaches zero, $\eta_{e}(\lambda)$ approaches one, effectively making LeanFed equivalent to FedAvg³³3Using low participation rates, e.g. $\lambda=10\%$ does not grantee that all the energy of the devices have been consumed. Therefore, many of the local devices could have potentially participated in more communication rounds. .

A more interesting observation happens when the participation rates increase. In this setting, LeanFed consistently outperforms both FedAvg and LeanFed under full device participation. Specifically, with $\lambda=80\%$ , LeanFed achieves peak test accuracies of $61.26~(\pm 1.44)$ for $E=10$ and $66.63~(\pm 1.23)$ for $E=50$ , surpassing the accuracies presented in Table I which are obtained under full device participation. We hypothesize that this is due to a higher proportion of local data used per device when participation is partial, leading to enhanced model training. However, further investigation is needed to fully understand this effect, which we leave as future work.

Figure 3: Test accuracy of the FedAvg (solid line) and LeaFed (dashed line) for fixed

\gamma=0.5

on CIFAR-10 (top) and different participation rates (different colors).

IV-C Limitations

While LeanFed demonstrates significant advantages in scenarios where energy-constrained devices must participate in federated learning with limited communication opportunities, we recon that it may suffer limitations. Specifically, LeanFed is most effective when the number of communication rounds required for convergence is relatively high, and devices have insufficient battery levels to participate in every round. However, these benefits may become limited or almost negligible in scenarios where devices have large battery capacity or when the pre-defined number of communication rounds is significantly larger the actual number of rounds required to achieve convergence. We also highlight that LeanFed relies on accurate battery level estimations, which may not always be available in real-world conditions, potentially leading to suboptimal scheduling decisions. We leave such evaluations to future work.

V Conclusion

In this paper, we have presented a simple and efficient approach to federated learning in energy-constrained environments, where device battery limitations can hinder long-term participation and compromise model performance. By adaptively adjusting the amount of data used during local training by each device, our proposed LeanFed method ensures extended device availability across communication rounds without sacrificing model accuracy. Our experimental results demonstrated that LeanFed consistently outperforms traditional FedAvg, especially in scenarios with high data heterogeneity and limited battery. Future works can extend this to consider dynamic energy adjustments, incorporating real-time device constraints, and exploring LeanFed’s application across diverse federated learning scenarios and real-world deployments.

References

[1] M. Avgerinou, P. Bertoldi, and L. Castellazzi, “Trends in data centre energy consumption under the european code of conduct for data centre energy efficiency,” Energies, vol. 10, no. 10, p. 1470, 2017.
[2] J. Baliga, R. W. Ayre, K. Hinton, and R. S. Tucker, “Green cloud computing: Balancing energy in processing, storage, and transport,” Proceedings of the IEEE, vol. 99, no. 1, pp. 149–167, 2010.
[3] D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon Emissions and Large Neural Network Training,” arXiv preprint arXiv:2104.10350, 2021.
[4] E. Strubell, A. Ganesh, and A. McCallum, “Energy and Policy Considerations for Modern Deep Learning Research,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 09, 2020, pp. 13 693–13 696.
[5] E. Ahvar, A.-C. Orgerie, and A. Lebre, “Estimating Energy Consumption of Cloud, Fog, and Edge Computing Infrastructures,” IEEE Transactions on Sustainable Computing, vol. 7, no. 2, pp. 277–288, 2022.
[6] E. Guerra, F. Wilhelmi, M. Miozzo, and D. Paolo, “The Cost of Training Machine Learning Models over Distributed Data Sources,” IEEE Open Journal of the Communications Society, 2023.
[7] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, pp. 3579–3605, 2021.
[8] T. Wang, B. Sun, L. Wang, X. Zheng, and W. Jia, “Eidls: An edge-intelligence-based distributed learning system over internet of things,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 7, pp. 3966–3978, 2023.
[9] Y. Yuan, S. Chen, D. Yu, Z. Zhao, Y. Zou, L. Cui, and X. Cheng, “Distributed Learning for Large-scale Models at Edge with Privacy Protection,” IEEE Transactions on Computers, 2024.
[10] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” Acm computing surveys (csur), vol. 53, no. 2, pp. 1–33, 2020.
[11] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini, “A Survey on Federated Learning for Resource-Constrained IoT Devices,” IEEE Internet of Things Journal, vol. 9, no. 1, pp. 1–24, 2022.
[12] S. Abdulrahman, H. Tout, H. Ould-Slimane, A. Mourad, C. Talhi, and M. Guizani, “A Survey on Federated Learning: The Journey From Centralized to Distributed On-Site Learning and Beyond,” IEEE Internet of Things Journal, vol. 8, no. 7, pp. 5476–5497, 2021.
[13] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the Convergence of FedAvg on Non-IID Data,” arXiv preprint arXiv:1907.02189, 2019.
[14] Z. Lu, H. Pan, Y. Dai, X. Si, and Y. Zhang, “Federated learning with non-iid data: A survey,” IEEE Internet of Things Journal, 2024.
[15] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Challenges, Methods, and Future Directions,” IEEE signal processing magazine, vol. 37, no. 3, pp. 50–60, 2020.
[16] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
[17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,” Advances in neural information processing systems, vol. 30, 2017.
[18] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on non-iid data with reinforcement learning,” in IEEE INFOCOM 2020-IEEE conference on computer communications. IEEE, 2020, pp. 1698–1707.
[19] W. Zhang, X. Wang, P. Zhou, W. Wu, and X. Zhang, “Client selection for federated learning with non-iid data in mobile edge computing,” IEEE Access, vol. 9, pp. 24 462–24 474, 2021.
[20] Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning: Convergence analysis and power-of-choice selection strategies,” arXiv preprint arXiv:2010.01243, 2020.
[21] A. Arouj and A. M. Abdelmoniem, “Towards Energy-Aware Federated Learning on Battery-Powered Clients,” in Proceedings of the 1st ACM Workshop on Data Privacy and Federated Learning Technologies for Mobile Edge Network, 2022, pp. 7–12.
[22] R. Albelaihi, L. Yu, W. D. Craft, X. Sun, C. Wang, and R. Gazda, “Green Federated Learning via Energy-Aware Client Selection,” in GLOBECOM 2022-2022 IEEE Global Communications Conference. IEEE, 2022, pp. 13–18.
[23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
[24] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and trends® in machine learning, vol. 14, no. 1–2, pp. 1–210, 2021.
[25] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated Learning on Non-IID Data Silos: An Experimental Study,” in IEEE International Conference on Data Engineering, 2022.
[26] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” arXiv preprint arXiv:1909.06335, 2019.
[27] H. Wang, X. Liu, J. Niu, W. Guo, and S. Tang, “Why go full? elevating federated learning through partial network updates,” arXiv preprint arXiv:2410.11559, 2024.
[28] K. Sreenivasan, K. Lee, J.-G. Lee, A. Lee, J. Cho, J.-y. Sohn, D. Papailiopoulos, and K. Lee, “Mini-batch Optimization of Contrastive Loss,” in ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
[29] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.