This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Flexible Parallel Learning in Edge Scenarios:
Communication, Computational and Energy Cost

Francesco Malandrino CNR-IEIIT and CNIT
Torino, Italy
   Carla Fabiana Chiasserini Politecnico di Torino, CNR-IEIIT and CNIT
Torino, Italy
Abstract

Traditionally, distributed machine learning takes the guise of (i) different nodes training the same model (as in federated learning), or (ii) one model being split among multiple nodes (as in distributed stochastic gradient descent). In this work, we highlight how fog- and IoT-based scenarios often require combining both approaches, and we present a framework for flexible parallel learning (FPL), achieving both data and model parallelism. Further, we investigate how different ways of distributing and parallelizing learning tasks across the participating nodes result in different computation, communication, and energy costs. Our experiments, carried out using state-of-the-art deep-network architectures and large-scale datasets, confirm that FPL allows for an excellent trade-off among computational (hence energy) cost, communication overhead, and learning performance.

Index Terms:
Edge computing; distributed machine learning

I Introduction

The emerging pervasiveness of machine learning (ML) and the fact that data generated by people, machines and sensors is expected to soon amount to 850 ZB [1] have led to an increasing adoption of distributed ML, involving multiple network nodes. Reasons to adopt this new paradigm include the ability to leverage more computational and energy resources, and the possibility of exploiting local data without disclosing it [2, 3] or transporting it to far-away data centers. This is particularly important for the training of ML models, especially in the case of the popular deep neural networks [4] (DNNs). DNNs are indeed composed of many layers, requiring large amounts of data to set the many parameters they are composed of and take a high toll in terms of computing resource and energy consumption. There are currently two main approaches to distributed supervised learning: federated learning (FL) [2, 3, 5] and distributed stochastic gradient descent (D-SGD) [6], both depicted in Fig. 1. Under both FL and D-SGD, the total training time includes both (i) local computation, and (ii) network delay.

Under FL, all nodes share the same DNN architecture and each node trains its DNN with local data. After one or more local training epochs, local parameters are sent to a centralized learning server, which is in charge of combining them (by averaging them, or through more complex strategies [7]), and sending the results back to the learning nodes. The main appeal of FL is its ability to exploit local data for learning, without sharing it, which is especially important for private and/or potentially sensitive data. Further, FL is suitable for scenarios where learning nodes, such as mobile user devices, can appear or disappear in an unpredictable manner, including network resource scheduling [8] and the management of drone-powered MEC systems [9]. As multiple learning nodes train different instances of the same model on different sets of data, FL is said to implement data parallelism [10].

On the contrary, the D-SGD paradigm allows for splitting a given DNN architecture among learning nodes, enabling each node to run only a part of the DNN. Nodes communicate with each other during each learning iteration (both forward- and back-passes), exchanging information on the values and gradients of model parameters. Compared to FL, D-SGD typically requires a tighter coordination among learning nodes and cannot handle their addition/removal, but the amount of data to transmit is lower; furthermore, unlike most FL variants, D-SGD does not require a learning server, and data can be exchanged directly between learning nodes, in a decentralized fashion. Due to its support for DNN splitting, D-SGD is a popular choice when the learning task involves nodes with limited capabilities, e.g., smart sensors [11]. D-SGD is said to implement model parallelism [10], as different nodes run different parts of the same model.

A further, recently emerging alternative is represented by the split learning (SL) paradigm [12]. SL envisions splitting the DNN layers in two parts: a local part, ran by each node using its own local data, and a global part run at edge- or cloud-based servers leveraging the intermediate outputs of all learning nodes. It can be considered akin to D-SGD, in that the model is shared across multiple nodes, each of which only implements a part of it. Compared to D-SGD, SL offers more flexibility, e.g., by allowing to process data coming from different sources through duplicate, parallel subsets of the DNN architecture.

Refer to caption
Refer to caption
Figure 1: Two of the main existing approaches to distributed supervised learning: FL (left) and D-SGD (right).

In this work, we envision a new distributed supervised learning paradigm, called flexible parallel learning (FPL), that allows for an increased flexibility in leveraging the resources available at Internet-of-Things (IoT) and mobile user devices. FPL achieves this goal by combining both data and model parallelism, thus inherently bringing the benefits coming from both approaches. Like D-SGD and SL, it enables splitting one model across multiple nodes, thereby associating different learning tasks to different nodes, and, like FL, it can leverage an arbitrary number of nodes performing the same learning task on their own local data: knowledge of such a number is not needed at the time of designing the DNN architecture, so long as it remains constant for the duration of the training. At the same time, FPL supports partial replication of the model across multiple devices working on different data, thereby achieving indeed both model and data parallelism. Such higher flexibility is obtained by including an additional layer in the DNN topology to accommodate data coming from different sources. Importantly, such layer can also be effectively used to properly weigh input data depending on their quality. To assess the impact of this additional layer, we compare the FPL communication, computation, and energy costs against those of FL and D-SGD, using state-of-the-art DNN architectures and large-scale, de facto standard datasets.

The remainder of the paper is organized as follows. Sec. II introduces the FPL paradigm, along with the main use cases it targets. Then Sec. III and Sec. IV describe the experiments we perform and the insights they provide. Finally, Sec. V concludes the paper.

Refer to caption
Refer to caption
Figure 2: Top: example scenario, with two cameras observing the same object, capturing different views thereof. Bottom: example DNN architecture, taken from [13], for image classification.

II The Flexible Parallel Learning Paradigm

Learning problem formulation. A distributed learning problem includes a set ={1,,I}\mathcal{I}=\{1,\dots,I\} of learning nodes, all cooperating in order to optimize a model composed of layers ={l1,,lL}\mathcal{L}=\{l_{1},\dots,l_{L}\}. Node ii owns a local dataset 𝐗i\mathbf{X}_{i} and local parameters 𝐰im\mathbf{w}_{i}^{m}, with mim\in\mathcal{L}_{i} and i\mathcal{L}_{i}\subseteq\mathcal{L} denoting the layers running at node ii. The global model is given by the set of all parameters for all layers, with parameters of the same layer present at different nodes being averaged:

𝐰l=1|{i:li}|i:li𝐰il.\mathbf{w}_{l}=\frac{1}{|\{i\in\mathcal{I}\colon l\in\mathcal{L}_{i}\}|}\sum_{i\in\mathcal{I}\colon l\in\mathcal{L}_{i}}\mathbf{w}_{i}^{l}. (1)

From the weights and the input data, it is possible to compute the output of the model 𝐲^=g(𝐗1,,𝐗I,𝐰)\hat{\mathbf{y}}=g(\mathbf{X}_{1},\dots,\mathbf{X}_{I},\mathbf{w}); as an example, in a classification problem, values in 𝐲^\hat{\mathbf{y}} represent the probabilities assigned to each class. The learning objective is to find the parameters 𝐰\mathbf{w}^{\star} that minimize a loss function f(𝐲^,𝐲)f(\hat{\mathbf{y}},\mathbf{y}), where 𝐲\mathbf{y} are the ground-truth. For classification tasks, the most commonly-used loss function is the categorical cross-entropy

c=1Cyclogy^c,-\sum_{c=1}^{C}y_{c}\log\hat{y}_{c}, (2)

with c=1Cc=1\dots C denoting the existing data classes.

The FPL paradigm targets IoT- and fog-based scenarios such as the one exemplified in Fig. 2(top), where multiple sources of data and multiple processing-capable devices are available and can be leveraged to perform a common learning task, i.e., train a DNN such as the one exemplified in Fig. 2(bottom). The high-level goal of FPL is to use all available data sources (hence, achieving data parallelism), while distributing different parts of the DNN across the devices, without duplicating them unless needed (hence, achieving model parallelism). As an example, one may want to run the two convolutional layers, C1C_{1} and C2C_{2}, of Fig. 2(bottom) separately at each of the two cameras, and then run only one instance of each of the fully-connected layers, F1F_{1} and F2F_{2}, at an edge-based server.

More specifically, given a DNN architecture, the FPL paradigm operates as follows:

  1. 1.

    it identifies the DNN layers that should be replicated at each learning node to leverage multiple data sources;

  2. 2.

    it merges the output of the above DNN layers through an appropriate junction layer;

  3. 3.

    it distributes the remaining DNN layers across the most suitable available nodes.

With reference to the example in Fig. 3, FPL creates two copies of C1C_{1} and C2C_{2}, i.e., one per data source. The outputs of the two C2C_{2} copies, C2(a)C_{2}^{(a)} and C2(b)C_{2}^{(b)}, are then merged at the network edge through the junction layer JJ. The input and output size of the junction layer match those of the preceding and succeeding layers; in the example of Fig. 3, JJ’ input size will be the sum of C2(a)C_{2}^{(a)}’s and C2(b)C_{2}^{(b)}’s output sizes, while JJ’s output size will be equal to the input size of F1F_{1}.

We remark that a fundamental aspect of FPL is how it handles DNN merges. One option would indeed be averaging the parameters, as in FL; however, such an approach is often suboptimal when the data sources are not equivalent, e.g., they observe different aspects of the phenomenon [7, 14]. D-SGD and SL opt instead for statically adapting the DNN architecture to the number of available data sources; however, this (i) requires changing the size of F1F_{1} – and potentially the whole DNN architecture – when moving to a scenario with a different number of data sources, and (ii) gives all data coming from all sources the same weight, regardless of their quality or importance. A better option is represented by the FPL paradigm, which combines the information coming from different data sources by including the junction layer in the DNN architecture, represented in purple in Fig. 3. The junction layer is fully-connected and serves two purposes. The first is to adapt the number of parameters to the size of input and output layers; such a purpose is achieved by setting the input and output sizes of JJ: with reference to Fig. 3, JJ’s input size is the sum of the output sizes of C2(a)C_{2}^{(a)} and C2(b)C_{2}^{(b)}, while its output size is the same as the input size of F1F_{1}. The second, and arguably most important, is to automatically learn how to combine the output of DNN layers running at different nodes. Indeed, the value of the parameters of layer JJ describe how information coming from different sources shall be processed, especially when they are of different quality. Dataset quality itself is generally linked with whether a dataset over- or under-represents some data classes: datasets that do so are of poor quality, while i.i.d. datasets, adequately representing all types of data, are of high quality.

Dealing with low-quality, non-i.i.d. dataset is the focus of several FL studies, including [5, 15, 16, 17], and the main strategy they use is assigning to learning nodes weights reflecting their data quality, thus, the contribution they can give to the learning process. FPL achieves the same objective through the junction layer: the values of the parameters therein – hence, the importance to assign to different data sources – are found as a part of the DNN training process. Indeed, the parameters of the junction layer are model parameters like all others, and are optimized through the same process – forward- and back-passes, gradient optimization.

Refer to caption
Figure 3: Example decisions made under the FPL paradigm. The two convolutional layers of the DNN in Fig. 2(bottom) are duplicated, and an instance of each layer is ran at each of the two cameras in Fig. 2(top). The outputs are then sent to the edge-based server, passed through the junction layer JJ, and then fed to the rest of the original DNN.

Replicating parts of a given DNN and merging them through a junction layer allows FPL to support both data and model parallelism, as well as different combinations thereof. If the scenario or the nature of data require so, FPL allows processing data coming from different sources in different ways, and distributing the necessary layers across various devices. At the same time, FPL supports scenarios where most of the processing is executed at the edge, and learning nodes only perform the operations necessary to reduce the quantity of data to transfer (e.g., running a convolutional layer).

Difference w.r.t. FL, D-SGD, and SL. FPL shares with the three distributed learning paradigms discussed earlier, namely, FL, D-SGD, and SL, the high-level goal of allowing a distributed set of nodes to cooperatively perform a learning task. There are, however, fundamental differences in the computations performed by nodes, and the data they exchange:

  • under FL, all nodes run the same model, and exchange, after one or more local epochs, the weights (parameters) of the whole model;

  • under D-SGD and SL, each node runs a part, i.e., some layers, of the model, and node exchange gradient information during the forward- and backward-pass of each epoch.

Under FPL, nodes run a part of the model, similarly to D-SGD and SL, and also communicate during each epoch. FPL is, however, superior to all alternatives due to the flexibility afforded by the junction layer, during both the setup and the training of the DNN. Thanks to the junction layer, the central controller is able to adapt the DNN architecture to the number of data sources at setup-time. Even more importantly, the junction layer can learn, during the DNN training, how to best combine the information coming from different data source: this includes how to deal with different types of information, e.g., pictures and sensor readings, but also giving a lower weight (or even a negative one) to lower-quality information.

FPL scalability. The size of the junction layer added by FPL grows with the number of DNN branches to combine, which may pose a scalability problem. However, scalability is ensured by the fact that FPL allows flexibility as to where in the DNN the junction layer is placed. Indeed, the quantity of data exchanged between DNN layers decreases as we move closer to the solution, hence, moving the junction layer deeper into the DNN helps reducing the number of its parameters.

FPL training. A DNN built according to the FPL paradigm, i.e., including a junction layer, is nonetheless a DNN shared across multiple devices, as in the D-SGD or SL paradigms. It follows that such a DNN is trained with the same methodology and tools as in D-SGD and SL, i.e., with forward- and backward-passes within each epoch. Nodes exchange gradient information during each pass of each epoch, and parameters are optimized through distributed algorithms such as ADAM. Indeed, the FPL paradigm seeks to innovate how DNNs are built and distributed across devices, while leveraging existing training techniques.

Building DNN architectures with FPL. The flexibility of the FPL paradigm extends to its support for multiple decision-making strategies and algorithms: the paradigm itself does not mandate or require any specific strategy to choose which DNN layers to replicate and how to place them across the participating nodes. Indeed, any of the existing approaches in the literature [18, 19] can be accommodated within the FPL paradigm. However, the addition of a junction layer implies additional parameters to train and might lead to larger latency or processing power consumption. Assessing and quantifying such costs – if any – is indeed one of the main goal of our experiments.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: An example image from EMNIST: original (top left), blurred (top center), erased (top right), horizontally flipped (bottom left), vertically flipped (bottom center), cropped (bottom right).

III Experiment design

Channel and transmission model. Similar to [18], we assume that all nodes are equipped with cellular antennas and are covered by the same eNB, with the edge-based sever being co-located with an eNB itself. Nodes are randomly distributed in a 500 m-radius circular area around the eNB. The data rate achieved for transmissions from node ii to node jj is given by [18]:

rBlog2𝔼hi(1+PihiI+BN0),rB\log_{2}\mathds{E}_{h_{i}}\left(1+\frac{P_{i}h_{i}}{I+BN_{0}}\right), (3)

where rr is the number of resource blocks (RBs) assigned to the communication, BB is the bandwidth of each RB, PiP_{i} is the transmission power (30 dBm for the eNB, 10 dBm for the UEs), N0=174N_{0}=-174 dBm/Hz is the noise power spectral density. hi=oidij2h_{i}=o_{i}d_{ij}^{-2} is the channel gain, where dijd_{ij} represents the distance between the terminals, and oio_{i} the Rayleigh fading parameter. Consistently with LTE, we consider a 20 MHz bandwidth, divided into 100 RBs, assigned according to proportional-fair scheduling.

We compare the FPL paradigm performance and costs against state-of-the-art alternatives from the viewpoint of (a) learning accuracy, (b) model size, (c) learning time, (d) network overhead, and (e) energy consumption. We assess the performance of FPL and its alternatives over a classification task, based on the EMNIST dataset [20].

EMNIST has been introduced to provide a more challenging variant of the famous MNIST handwritten number dataset, and includes a total of 814,255 images belonging to 62 classes (26 uppercase and lowercase letters, plus ten digits). All images are in gray-scale and have a size of 28×2828\times 28 pixels, hence, they can be represented as 28×28×128\times 28\times 1 three-dimensional tensors. The main reasons for using the EMNIST dataset are its relative simplicity and wide availability. Due to the former, we can compare the performance of FPL to its alternatives using a streamlined DNN architecture, as detailed next. Thanks to the latter, our results can be reproduced, generalized and compared against other existing works.

As shown in Fig. 2(top), our goal is to assess the effect of FPL and its alternatives when dealing with different, partial views of the same phenomenon. To emulate this, we apply to the EMNIST one of the five transformation exemplified in Fig. 4: Gaussian blur; random erasure; horizontal or vertical flip; random crop. Unless specified otherwise, to perform the classification task, we use the DNN architecture proposed in [13] and represented in Fig. 2(bottom), including two CNN layers (each followed by a max-pooling layer, not represented in the figure) and two fully-connected layers. The DNN, as well as the image transformations, is implemented in Python using the PyTorch framework. We selected PyTorch over the more popular TensorFlow framework due to the greater control the former affords over the manipulation of data, which simplifies implementing the newly proposed FPL paradigm.

We compare FPL against the following alternatives:

Split Learning: we implement the “vertical partitioned data” variant of SL, introduced in [12, Sec. 2]: the convolutional layers are duplicated, with an instance thereof running at each data source. The input size of layer F1F_{1}, which is the first non-duplicated layer, is adjusted so that the concatenated outputs of the duplicated layers can be fit therein.

Transfer images: one single model is trained using all of the five image sources. This entails transferring the images to the edge, thus incurring additional network overhead and delay;

Generalized FL (gFL): models for each of the five image sources are separate, but some or all their layers are averaged at the end of each epoch. This reproduces and generalizes FL. Since the input datasets are not i.i.d., the FedProx strategy [17] is used in lieu of the more common FedAvg to combine the updates coming from different nodes. One averaging round is executed after each local computation one.

We run our experiments using a server with a 4040-core Intel Xeon E5-2690 v2 3.00 GHz CPU, 6464 GB of memory, and equipped with a Tesla K80 GPU. We have opted not to use GPUs for training, in spite of the fact that PyTorch is able to exploit them, and doing so would have resulted in faster learning times. The main reason is that GPU usage is comparatively tricky to track, while – via the psutil library – it is possible to accurately measure the CPU consumption (hence, learning time and energy cost) associated with both FPL and the benchmark solutions.

Refer to caption
Figure 5: Validation loss of FPL and its alternatives. Markers indicate the iteration at which the loss reaches its minimum value, hence, training can be considered complete.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: Classification task over the EMNIST dataset: learning accuracy (a), number of model parameters (b), training time (c), and network overhead (d) under the FPL paradigm and its alternatives.

IV Performance evaluation and discussion

The first aspect we are interested into is the convergence behavior of FPL and its alternatives, depicted in Fig. 5, where JF1J\rightarrow F_{1} denotes that the junction layer in FPL is placed before layer F1F_{1} of the original DNN, i.e., between the second convolutional layer C2C_{2} and the first convolutional layer F1F_{1}. For gFL, we list the layers that are averaged à la FL, e.g., F1/F2F_{1}/F_{2} indicates that the two fully-connected layers are averaged (while the convolutional ones are kept separate).

The plot shows the value of the loss function (2), computed over the validation set at each iteration; convergence is achieved when the loss function starts increasing, which signals overfit. It is easy to observe how FPL has the fastest convergence, i.e., it takes fewer epochs to train than its alternatives. This translates into shorter training times, even if more operations (e.g., training the junction layer) must be performed at each epoch.

We next look at the classification accuracy, summarized in Fig. 6(a). Similarly to Fig. 5, for FPL we indicate the layer of the original DNN before which the junction layer is inserted, and for gFL we list the averaged layers.

As one might expect, transferring the images (bar with blue, horizontal line-pattern) results in the best performance; intuitively, this is because more data is used for training. The FPL paradigm (two rightmost bars with green patterns) yields the next-best performance; specifically, inserting the junction layer JJ just before the final fully-connected layer F2F_{2} is associated with the highest accuracy. gFL (bars with pink/purple slanting-line patterns) is associated with comparatively poor accuracy. Indeed, FL-like approaches are best suited when different learning nodes have access to similar images, e.g., the same view captured from different cameras. If, however, the images are qualitatively different, e.g., flipped as in Fig. 4, forcing the same model to process all images may be a suboptimal approach. Using advanced strategies like FedProx [17] to combine the updates coming from different nodes does improve the performance compared to FedAvg, due to FedProx’s ability to deal with non-i.i.d. data, but does not close the performance gap. As for SL, it outperforms gFL, but yields a lower accuracy than FPL. Furthermore, it is important to stress that the configuration of the DNN in SL is tied to the number of data sources (five in our case). While FPL and gFL could accommodate any number of data sources, doing so in SL would require restructuring the whole DNN, which does not suit dynamic scenarios such as IoT- and fog-based ones.

Fig. 6(b) shows the size of models under the different paradigms, quantified through the number of their parameters. As one might expect, FPL is associated with a larger model size, due to the introduction of the junction layer JJ. However, it is important to remark that the increase in model size is moderate in both cases.

It is more critical to assess whether the extra parameters introduced by FPL also result in longer learning times. As highlighted in Fig. 6(c), FPL’s learning times are comparable with those of its alternatives, actually shorter than those of gFL: indeed, the higher number of parameters to train is compensated by the fact that learning itself is more effective, i.e., as per Fig. 5, requires fewer epochs. From Fig. 6(c) it is also possible to observe how, unless we transfer all images to all learning nodes, processing represents the main contribution to the global training time.

Tab. I summarizes the energy cost and carbon footprint of each approach, computed according to [21] for private servers running in northern Italy and supplied by the national energy provider Enel, which has a carbon efficiency of 0.243 kg/kWh according to electricitymap.org. Consistently with Fig. 6(c), it is clear that the FPL paradigm yields lower energy consumption and pollution.

Looking at Fig. 6(a)–Fig. 6(c), one may observe that transferring the images results in the highest accuracy, the smallest model size and a short training time, but Fig. 6(d), depicting the network overhead, reminds why such a solution is not viable in virtually all scenarios. More importantly, the figure also highlights how FPL is associated with a very small network overhead – lower than gFL approaches –, regardless of where the junction layer JJ is placed. This is due to the fact that, as highlighted in Fig. 1, FL (unlike FPL) requires transmitting all model parameters. Also notice how the scale of Fig. 6(d) is logarithmic, and the advantage of FPL over FL-based approaches is almost one order of magnitude. Comparing Fig. 6(b) to Fig. 6(c) and Fig. 6(d), it is also interesting to remark how the introduction of the junction layer JJ under then FPL paradigm mildly increases the model size, but does not result in longer training times or higher network overhead with respect to gFL. Intuitively, there are more parameters but they are easier to train, and do not travel from a learning node to another.

TABLE I: Energy consumption and carbon footprint associated with FPL and its alternatives. Figures are computed according to [21]
Strategy Energy [kWh] Carbon [g CO2 eq.]
SL 0.11 38.74
Transfer images 0.13 45.07
gFL: F1F_{1}/F2F_{2} 0.23 77.97
gFL: C2C_{2}/F1F_{1}/F2F_{2} 0.33 112.68
FPL: JF2J\to F_{2} 0.21 71.45
FPL: JF1J\to F_{1} 0.25 84.98

In summary, the FPL paradigm proves to be an attractive alternative to present-day approaches like FL or D-SGD. It provides better learning accuracy, allows for much higher flexibility and, unlike D-SGD, is a good match for dynamic scenarios – at the price of a modest increase in model size and with comparable learning times and energy consumption.

V Conclusion

To meet the needs of IoT- and fog-based environments, we have introduced a new distributed supervised learning paradigm called flexible parallel learning (FPL). FPL exhibits the advantages of both federated learning and distributed gradient descent, supports the dynamic addition and removal of learning nodes, and achieves both data and model parallelism. One of the most distinctive features of FPL is the introduction of a junction layer, through which data coming from different sources can be properly combined. We have evaluated the performance of FPL and its present-day alternatives over an image classification task, using the EMNIST dataset. The results highlight how FPL outperforms its alternatives in terms of learning accuracy, with comparable or lower network overhead and energy consumption.

Future work will focus on extending the performance evaluation of the FPL paradigm considering the ability of the junction layer to differently weigh the input data, as well as additional datasets, e.g., CARS or ImageNet, and more complex DNN architectures, e.g., ResNet and VGG. Although the FPL paradigm itself works unmodified in these cases, the different sizes of the DNNs and the images they process may impact the relative performance of FPL and its alternative strategies.

References

  • [1] Cisco, “Cisco Annual Internet Report (2018–2023) White Paper,” https://www.cisco.com/c/en/us/solutions/executive-perspectives/annual-internet-report/index.html, accessed Feb. 2021.
  • [2] J. Konečný, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015.
  • [3] J. Kang and et al., “Reliable federated learning for mobile networks,” IEEE Wireless Comm., 2020.
  • [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker et al., “Large scale distributed deep networks,” in NIPS, 2012.
  • [5] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE JSAC, 2019.
  • [6] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in ACM SIGKDD, 2011.
  • [7] F. Malandrino and C. F. Chiasserini, “Federated learning at the network edge: When not all nodes are created equal,” IEEE Comm. Mag., 2021.
  • [8] F. Jiang, K. Wang, L. Dong, C. Pan, W. Xu, and K. Yang, “Deep-learning-based joint resource scheduling algorithms for hybrid mec networks,” IEEE Internet of Things Journal, 2019.
  • [9] ——, “Ai driven heterogeneous mec system with uav assistance for dynamic environment: Challenges and solutions,” IEEE Network, 2020.
  • [10] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
  • [11] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, 2019.
  • [12] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018.
  • [13] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097, 2018.
  • [14] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data,” arXiv preprint arXiv:1811.11479, 2018.
  • [15] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on non-iid data with reinforcement learning,” in IEEE INFOCOM, 2020.
  • [16] T. Nishio and R. Yonetani, “Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge,” in IEEE ICC 2019, 2019.
  • [17] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” arXiv preprint arXiv:1812.06127, 2018.
  • [18] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. on Wireless Comm., 2020.
  • [19] Y. Tu, Y. Ruan, S. Wagle, C. G. Brinton, and C. Joe-Wong, “Network-Aware Optimization of Distributed Learning for Fog Computing,” in IEEE INFOCOM, 2020.
  • [20] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: Extending mnist to handwritten letters,” in IEEE IJCNN, 2017.
  • [21] A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres, “Quantifying the carbon emissions of machine learning,” arXiv preprint arXiv:1910.09700, 2019.