This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Network-Aware Optimization of
Distributed Learning for Fog Computing

Su Wang, Yichen Ruan, Yuwei Tu, Satyavrat Wagle, Christopher G. Brinton, and Carlee Joe-Wong This work was presented in part at the 2020 IEEE Conference on Computer Communications (INFOCOM) [1].S. Wang and C. Brinton are with the School of Electrical and Computer Engineering at Purdue University. email: {wang2506,cgb}@purdue.eduY. Ruan, S. Wagle, and C. Joe-Wong are with the Department of Electrical and Computer Engineering at Carnegie Mellon University. email: {yichenr, srwagle, cjoewong}@andrew.cmu.edu
Abstract

Fog computing promises to enable machine learning tasks to scale to large amounts of data by distributing processing across connected devices. Two key challenges to achieving this goal are (i) heterogeneity in devices’ compute resources and (ii) topology constraints on which devices communicate with each other. We address these challenges by developing a novel network-aware distributed learning methodology where devices optimally share local data processing and send their learnt parameters to a server for periodic aggregation. Unlike traditional federated learning, our method enables devices to offload their data processing tasks to each other, with these decisions optimized to trade off costs associated with data processing, offloading, and discarding. We analytically characterize the optimal data transfer solution under different assumptions on the fog network scenario, showing for example that the value of offloading is approximately linear in the range of computing costs in the network when the cost of discarding is modeled as decreasing linearly in the amount of data processed at each node. Our experiments on real-world data traces from our testbed confirm that our algorithms improve network resource utilization substantially without sacrificing the accuracy of the learned model, for varying distributions of data across devices. We also investigate the effect of network dynamics on model learning and resource costs.

Index Terms:
federated learning, offloading, fog computing

I Introduction

New technologies like autonomous cars and smart factories are coming to rely extensively on data-driven machine learning (ML) algorithms [2, 3, 4] to produce near real-time insights based on historical data. Training ML models at realistic scales, however, is challenging, given the enormous computing power required to process today’s data volumes. The collected data is also dispersed across networks of devices, while ML models are traditionally managed in a centralized manner [5].

Fortunately, the rise in data generation in networks has occurred alongside a rise in the computing power of networked devices. Thus, a possible solution for real-time training of and inferencing from data-driven ML algorithms lies in the emerging paradigm of fog computing, which aims to design systems, architectures, and algorithms that leverage device capabilities between the network edge and cloud [6]. Deployment of 5G wireless networks and the Internet of Things (IoT) is accelerating adoption of this computing paradigm by expanding the set of connected devices with compute capabilities and enabling direct device-to-device communications between them [7]. Though centralized ML algorithms are not optimized for such environments, distributed data analytics is expected to be a major driver of 5G adoption [8].

Initial efforts in decentralizing ML have focused on decomposing model parameter updates over several nodes, typically managed by some centralized server [9, 10]. Most of these methods implicitly assume idealized network topologies where node and link properties are homogeneous. Fog environments, by contrast, are characterized by devices’ computation and communication resource heterogeneity, e.g., due to power constraints or privacy considerations. For example, consider the two common fog topologies depicted in Figure 1. In the hierarchical scenario, weaker edge devices are connected to powerful edge servers. In the social network case, devices tend to have similar compute resources, but connectivity may vary significantly depending on levels of trust between users [11].

A central question that arises, then, in adapting ML methodologies to fog environments is: How should each fog device contribute to the ML training and inference? The examples above motivate techniques for network-aware distributed learning, which (i) account for the potentially heterogeneous computation and communication resources across devices, and (ii) leverage the network topology for device communications to optimize the distribution of data processing through the network. In this paper, we develop a novel network-aware distributed learning methodology that optimizes the distribution of processing across a network of fog devices.

Refer to caption
((a)) Hierarchical
Refer to caption
((b)) Social network
Figure 1: Cartoon illustrations of two example topologies for fog computing that we consider. In the hierarchical case, less powerful devices are connected to more powerful ones, while for the social network, connections are denser and devices tend to be similar.

I-A Machine Learning in Fog Environments

ML models are generally trained by iteratively updating estimates of parameter values, such as weights in a neural network, that best “fit” empirical data, through data processing at one or more devices. We face two major challenges in adapting such training to fog networking environments: (i) heterogeneity in devices’ compute resources and (ii) constraints on devices’ abilities to communicate with each other. We outline these characteristics, and the potential benefits enabled by our network-aware distributed learning methodology, in some key applications below:

Privacy-sensitive applications. Many ML applications learn models on sensitive user data, e.g., for health monitoring [4]. Due to privacy concerns, most of these applications have devices train their models on local data to avoid revealing data to untrustworthy nodes [11]. Offloading ML data processing to trusted devices, e.g. if one user owns multiple devices, can reduce training times and improve model accuracy.

Internet-connected vehicles can collaboratively learn about their environment [2], e.g., by combining their data with that of road sensors to infer current road or traffic conditions. Since sensors have less computing capabilities than vehicles, they will likely offload their data to vehicles or roadside units for processing. This offloading must adapt as vehicles move and their connectivity with (stationary) road sensors changes.

Augmented reality (AR) uses ML algorithms for e.g., image recognition [3] to overlay digital content onto users’ views of an environment. A network of AR-enabled devices can distributedly train ML models, but may exhibit significant heterogeneity: they can range from generic smartphones to AR-specific headsets, with different battery levels. As the AR users move, connectivity between devices will also change.

Industrial IoT. 5G networks will allow sensors that power control loops in factory production lines to communicate across the factory floor [2, 12], enabling distributed ML algorithms to use this data for, e.g., predicting production delays. Determining which controllers should process data at specific sensors is an open-question: it depends on sensor-controller connectivities, which may vary with factory activity.

I-B Outline and Summary of Contributions

First, Section II differentiates our work from relevant literature. To the best of our knowledge, we are the first to optimize the distribution of ML data processing (i.e., training) tasks across fog nodes, leading to several contributions:

Formulating the task distribution problem (Section III). In deciding which devices should process which datapoints, our formulation accounts for resource limitations and model accuracy. While ideally more of the data would be processed at devices with more computing resources, sending data samples to such devices may overburden the network. Moreover, processing too many data samples can incur large processing costs relative to the gain in model accuracy. We derive new bounds (Theorem 1) on the model accuracy when data can be moved between devices, showing that the optimal task distribution problem can be formulated as a convex optimization.

Characterizing the optimal task distribution (Section IV). Solving the optimization in Sec. III requires specifying network characteristics that may be unknown in advance. We analyze the expected deviations from our assumptions in Sec. III to derive guidelines on estimating these characteristics (Theorem 2). We then consider two different models of discard cost, one linear and one convex in the number of datapoints processed, and use them to derive the optimal distribution of data processing through the network for typical fog topologies (Theorems 3 and 4). Further, we use the result from the linear case to estimate the reduction in processing costs due to data movement (Theorem 5). Overall, these results are the first to characterize the integration of offloading into federated learning over graph topologies.

Experimental validation (Section V). We train classification models on the MNIST dataset to validate our algorithms. We use a Raspberry Pi testbed to emulate network delays and available compute resources, under both i.i.d. and non-i.i.d. device data. Our proposed algorithm nearly halves the processing overhead yet achieves an accuracy comparable to centralized model training, which validates the benefits of network-aware distributed learning. The experiments also reveal how the network connectivity, size, and topology structure impact the trained model accuracy and network resource costs.

II Related Work

We contextualize our work within prior results on (i) federated and distributed machine learning, and (ii) methods for offloading ML tasks from mobile devices to edge servers.

II-A Distributed and Federated Learning

In classical distributed learning, “workers” each compute a parameter value from their local data. These results are aggregated at a central server, and updated parameter values are sent to the workers for the next round of local computations. Several works have considered how network topologies can affect ML training. Specifically, [13] optimizes decentralized training of linear models when workers exchange parameters with their neighbors instead of a central server, while [14] studied D2D (device-to-device) message passing that emulates global communications with a server, and [15] optimizes performance when communications can fail. In fog networks, devices may not have the resources to send updates to a server at every time period [5], so we focus on the newer federated learning methodology for distributed model training.

Devices in federated learning perform a series of local updates between aggregations, and send their resulting model updates to the server [16, 17, 10]. This framework preserves user privacy by keeping data at local devices [18] and reduces the amount of communication between devices and the central server. Since its proposition in [10], federated learning has generated significant research interest; see [19] for a comprehensive survey. Compared to traditional distributed learning, federated learning introduces two new challenges: firstly, data may not be identically and independently distributed as devices locally generate the samples they process; and secondly, in fog/edge networks, devices may have limited ability to perform and communicate local updates due to resource constraints. Many works have attempted to address the first challenge; for instance, [20] trains user-specific models within a multi-task learning framework, while [21] showed that sharing small subsets of user data can significantly increase the accuracy of a single model. Recent efforts have also considered optimizing the frequency of parameter aggregations according to a fixed budget of network and computing resources [5], or adopting a peer-to-peer decentralized learning framework [22].

Our work is more related to the second challenge of resource heterogeneity. Existing works have focused on minimizing the communication costs between the workers and the server. Specifically, [23] proposes to reduce uplink costs by restricting and compressing the parameter space prior to transmission. [24] proposes a method for thresholding updates for transmission, while the method in [25] only communicates the most important individual gradient results, which [26] extends to compress both downlink and uplink communication. To reduce the number of uplink transfers, [27, 28] aggregate subsets of the network. For wireless networks in particular, [29] proposes methods to minimize power consumption and training time.

Unlike these works, we focus on the network topology between nodes to optimize the tradeoffs among communication, computation, and model performance in federated learning. Specifically, our work leverages D2D communications for offloading data from resource-constrained to resource-rich nodes, as is present in new 5G and IoT network technologies [6]. In integrating D2D offloading with federated learning, we derive novel results on optimizing the distribution of data processing over a fog computing system to train ML models.

II-B Offloading

Fog computing introduces opportunities to pool network resources and maximize the use of idle computing/storage in completing resource-intensive activities [6]. Offloading mechanisms improve system performance when high-bandwidth network connections are available between devices. Offloading has significantly accelerated ML tasks such as linear regression training [30] and neural network inference [31]. Existing literature has also considered splitting different layers in deep neural networks between fog devices and edge/cloud servers. Specifically, [32] proposed two network-dependent schemes and a min-cut problem to accelerate the training phase, while [33] developed an architecture to intelligently locate models on local devices or the cloud based on network reliability.

Prior works on deep learning offloading in fog/edge computing generally focus on network factors, such as cost/latency, rather than model accuracy. For example, [34, 35] maximize throughput in wireless and sensor IoT networks. [36] studies D2D data offloading in federated learning, but focuses on communication protocols to facilitate over-the-air parameter aggregations and dataset duplication across nodes. Our methodology considers more general ML models, optimizes tradeoffs between network cost and model accuracy, and provides novel theoretical performance bounds.

III Model and Optimization Formulation

We first define our model for fog networks (Section III-A) and machine learning training (Section III-B), and then formulate the task distribution optimization problem (Section III-C).

III-A Fog Computing System Model

We consider a set VV of nn fog devices forming a network, an aggregation server ss, and discrete time intervals t=1,,Tt=1,\ldots,T as the period for training an ML model. Each device, e.g., a sensor or smartphone, can both collect data and process it to contribute to the ML task. The server ss aggregates the results of each device’s local analysis, as will be explained in Section III-B. Both the length and number of time intervals may depend on the specific ML application. In each interval tt, we suppose a subset of devices V(t)V(t), indexed by ii, is active (i.e., available to collect and/or process data). For simplicity of notation, we omit ii’s dependence on tt.

III-A1 Data collection and processing

We use Di(t)D_{i}(t) to denote the set of data collected by device iV(t)i\in V(t) for the ML task at time tt; dDi(t)d\in D_{i}(t) denotes each datapoint. Note, Di(t)=D_{i}(t)=\emptyset if a device does not collect data at time tt. Gi(t)G_{i}(t), by contrast, denotes the set of datapoints processed by each device at time tt, for the ML task. In conventional distributed learning frameworks, Di(t)=Gi(t)D_{i}(t)=G_{i}(t), as all devices process the data they collect [5]; separating these variables is one of our main contributions. We suppose that each device ii can process up to Ci(t)C_{i}(t) datapoints at each time tt, incurring a cost of ci(t)c_{i}(t) for each point. For example, devices with low battery will have lower capacities Ci(t)C_{i}(t) and higher costs ci(t)c_{i}(t).

III-A2 Fog network connectivity

The devices VV are connected to each other via a set EE of directed links (i,j)(i,j) between devices ii and jj; E(t)EE(t)\subseteq E denotes the set of functioning links at time tt. The overall system can then be described as a directed graph ({s,V},E)(\left\{s,V\right\},E) with vertices VV representing the devices and edges EE the links between them. We suppose that ({s,V(t)},E(t))(\left\{s,V(t)\right\},E(t)) is connected at each time tt and that links between devices are single-hop, i.e., devices do not use each other as relays except possibly to communicate with the server. The scenarios outlined in Section I-A each possess such an architecture: in smart factories, for example, a subset of the floor sensors connect to each controller. Each link (i,j)E(t)(i,j)\in E(t) is characterized by a capacity Cij(t)C_{ij}(t), i.e., the maximum datapoints transferable in a time interval, and a per-unit “cost of connectivity” cij(t)c_{ij}(t). This cost may reflect network conditions or a desire for privacy, and will be high if sending from ii to jj is less desirable at tt.

III-A3 Data structure

Each datapoint dd takes the form (xd,yd)(x_{d},y_{d}), where xdx_{d} is an attribute/feature vector and ydy_{d} is an associated label.111For unsupervised ML tasks, there are no labels ydy_{d}. However, we can still minimize a loss defined in terms of the xdx_{d}, similar to (1). We use DV=i,tDi(t)D_{V}=\cup_{i,t}D_{i}(t) to denote the full set of datapoints collected by all devices over all time. Following prior work [37, 38], we model data collection as each device ii sampling points uniformly at random from a (usually unknown) distribution 𝒟i\mathcal{D}_{i}. Temporal changes in 𝒟i\mathcal{D}_{i} are assumed to be slow compared to the time horizon TT. We use 𝒟=i𝒟i\mathcal{D}=\cup_{i}\mathcal{D}_{i} to denote the global distribution induced by these 𝒟i\mathcal{D}_{i}. Our model implies that the relationship between xdx_{d} and ydy_{d} is temporally invariant, which is common in the applications discussed in Section I-A, e.g., image recognition from road cameras at fixed locations or AR users with random mobility patterns. We use such a dataset for evaluation in Section V.

III-B Machine Learning Model

Our goal is to learn a parameterized model that outputs ydy_{d} given the input feature vector xdx_{d}. We use the vector ww to denote the set of model parameters, whose values are chosen so as to minimize a loss function L(w|𝒟)L(w|\mathcal{D}) that is defined for the specific ML model (e.g., squared error for linear regression, cross-entropy loss for multi-class classifiers [39]). Since the overall distributions 𝒟i\mathcal{D}_{i} are unknown, instead of minimizing L(w|𝒟)L(w|\mathcal{D}) we minimize the empirical loss function, as commonly done in ML model training (e.g., [5][10]):

minimize𝑤L(w|DV)=t=1TiV(t)dGi(t)l(w,xd,yd)|DV|,\underset{w}{\text{minimize}}\;\;L(w|D_{V})=\frac{\sum_{t=1}^{T}\sum_{i\in V(t)}\sum_{d\in G_{i}(t)}l(w,x_{d},y_{d})}{|D_{V}|}, (1)

where l(w,xd,yd)l(w,x_{d},y_{d}) is the error for datapoint dd, and |DV||D_{V}| is the number of datapoints. Note that the function ll may include regularization terms that aim to prevent model overfitting [10].

Fog computing allows (1) to be solved in a distributed manner: instead of computing the solution at the server ss, we can leverage computations at any device ii with available resources. Below, we follow the commonly used federated averaging framework [5] in specifying these local computations and the subsequent global aggregation by the server in each iteration, illustrated by device nn in Figure 2. To avoid excessive re-optimization at each device, the local updating algorithm does not depend on Gi(t)G_{i}(t). We adjust the server averaging to account for the amount of data each device processes.

III-B1 Local loss minimization

To distributedly solve (1), we first decompose it into a weighted sum of local loss functions

Li(wi|Gi)=t=1TdGi(t)l(w,xd,yd)|Gi|,L_{i}(w_{i}|G_{i})=\frac{\sum_{t=1}^{T}\sum_{d\in G_{i}(t)}l(w,x_{d},y_{d})}{|G_{i}|}, (2)

where GitTGi(t)G_{i}\equiv\cup_{t\leq T}G_{i}(t) denotes the set of datapoints processed by device ii over all times. The global loss (1) is then L(w|DV)=iLi(w|Gi)|Gi|/|DV|L(w|D_{V})=\sum_{i}L_{i}(w|G_{i})\left|G_{i}\right|/\left|D_{V}\right| if iGi=DV\cup_{i}G_{i}=D_{V}, i.e., all datapoints dDVd\in D_{V} are eventually processed.

Loss functions such as (2) are typically minimized using gradient descent techniques [10]. Specifically, the devices update their local parameter estimates at tt according to

wi(t)=wi(t1)η(t)Li(wi(t1)|Gi(t)),w_{i}(t)=w_{i}(t-1)-\eta(t)\nabla L_{i}(w_{i}(t-1)|G_{i}(t)), (3)

where η(t)>0\eta(t)>0 is the step size or learning rate, and Li(wi(t1)|Gi(t))=dGi(t)l(wi(t1),xd,yd)/|Gi(t)|\nabla L_{i}(w_{i}(t-1)|G_{i}(t))=\sum_{d\in G_{i}(t)}\nabla l(w_{i}(t-1),x_{d},y_{d})/|G_{i}(t)| is the gradient with respect to ww of the average loss of points in the current dataset Gi(t)G_{i}(t) at the parameter value wi(t1)w_{i}(t-1). We define the loss only on Gi(t)G_{i}(t) since future data in GiG_{i} has not yet been revealed; since we assume each node’s data is i.i.d. over time, Li(wi(t1)|Gi(t))L_{i}(w_{i}(t-1)|G_{i}(t)) approximates the local loss Li(wi|Gi)L_{i}(w_{i}|G_{i}). The computational cost ci(t)c_{i}(t) of processing datapoint dd is then the cost of computing the gradient l(wi(t1),xd,yd)\nabla l(w_{i}(t-1),x_{d},y_{d}). If the local data distributions 𝒟i\mathcal{D}_{i} are all the same, then all datapoints are i.i.d. samples of this distribution, and this process is similar to stochastic gradient descent with batch size |Gi(t)||G_{i}(t)|.

III-B2 Aggregation and synchronization

The aggregation server ss will periodically receive the local estimates wi(t)w_{i}(t) from the devices, compute a global update based on these models, and synchronize the devices with the global update. Formally, the kkth aggregation is computed as

w(k)=iHi(kτ)wi(kτ)iHi(kτ),w(k)=\frac{\sum_{i}H_{i}(k\tau)\cdot w_{i}(k\tau)}{\sum_{i}H_{i}(k\tau)}, (4)

where τ\tau is the fixed aggregation period and Hi(kτ)=t=(k1)τ+1kτ|Gi(t)|H_{i}(k\tau)=\sum_{t=(k-1)\tau+1}^{k\tau}|G_{i}(t)| is the number of datapoints node ii processed since the last aggregation. Thus, the update is a weighted average factoring in the sample size Hi(t)H_{i}(t) on which each wi(t)w_{i}(t) is based. Intuitively, since the objective in (1) minimizes the empirical loss [5, 10], nodes that process more data should be weighted more, as in (1)-(4).

Refer to caption
Figure 2: Federated learning updates between aggregations kk and k+1k+1, in our system. Device 1 discards all of its data or offloads it to device nn, which computes τ\tau gradient updates on its local data. The final parameter values are averaged at the parameter server, with the result sent back to the devices to begin a new iteration.

Once this is computed, we synchronize local estimates, i.e., wi(t)w(t/τ)w_{i}(t)\leftarrow w(t/\tau) i\forall i. Using a smaller τ\tau generally results in faster convergence of ww, while larger τ\tau requires less network resources. Prior work [5] considered how to optimize τ\tau; we analyze its effect experimentally in Section V.

III-C Optimization Model for Data Processing Tasks

We now consider the choice of Gi(t)G_{i}(t), which defines the data processing task executed by device ii at time tt. There are two possible reasons Gi(t)Di(t)G_{i}(t)\neq D_{i}(t): first, device ii may offload some of its collected data to another device jj or vice versa, e.g., if ii has insufficient capacity to process all of the data (Di(t)Ci(t))(D_{i}(t)\geq C_{i}(t))222For notational convenience, Di(t)D_{i}(t) here refers to the number of datapoints |Di(t)||D_{i}(t)|, and similarly Gi(t)G_{i}(t) refers to |Gi(t)||G_{i}(t)|. The context will make the distinction clear throughout the paper. or if jj has lower computing costs (cj(t)ci(t))(c_{j}(t)\leq c_{i}(t)). Second, device ii may discard data if processing it does not reduce the empirical loss (1) by much. In Figure 2, device 1 offloads or discards all of its data. We collectively refer to discarding and offloading as data movement.

III-C1 Data movement model

We define sij(t)[0,1]s_{ij}(t)\in[0,1] as the fraction of data collected at device ii that is offloaded to device jij\neq i at time tt. Thus, at time tt, device ii offloads Di(t)sij(t)D_{i}(t)s_{ij}(t) amount of data to jj. Similarly, sii(t)s_{ii}(t) will denote the fraction of data collected at time tt that device ii also processes at time tt. We suppose that as long as Di(t)sij(t)Cij(t)D_{i}(t)s_{ij}(t)\leq C_{ij}(t), the capacity of the link between ii and jij\neq i, then all offloaded data will reach jj within one time interval and can be processed at device jj in time interval t+1t+1. Since devices must have a link between them to offload data, sij(t)=0s_{ij}(t)=0 if (i,j)E(t)(i,j)\notin E(t).

We also define ri(t)[0,1]r_{i}(t)\in[0,1] as the fraction of data collected by device ii at time tt that will be discarded. In doing so, we assume that device jj will not discard data that has been offloaded to it by others, since that has already incurred an offloading cost Di(t)sij(t)cij(t)D_{i}(t)s_{ij}(t)c_{ij}(t). The amount of data collected by device ii at time tt and discarded is then Di(t)ri(t)D_{i}(t)r_{i}(t), and the amount of data processed by each device ii at time tt is

Gi(t)=sii(t)Di(t)+jisji(t1)Dj(t1).G_{i}(t)=s_{ii}(t)D_{i}(t)+\sum_{j\neq i}s_{ji}(t-1)D_{j}(t-1).

In defining the variables sij(t)s_{ij}(t) and ri(t)r_{i}(t), we have implicitly specified the constraint ri(t)+jsij(t)=1r_{i}(t)+\sum_{j}s_{ij}(t)=1: all data collected by device ii at time tt must either be processed by device ii at this time, offloaded to another device jj, or discarded. We assume that devices will not store data for future processing, which would add another cost component to the model.

To quantify the cost of discarding relative to processing/offloading, we also define fi(t)f_{i}(t) as the cost per unit model loss L(wi(t)|DV)L(w_{i}(t)|D_{V}). The dependence on ii can capture the fact that certain devices in the network may weight their model error costs differently depending on the importance of the application. The dependence on tt can capture the fact that the loss may become less important relative to the resource utilization as the model approaches convergence.

III-C2 Data movement optimization

We formulate the following cost minimization problem for determining the data movement variables sij(t)s_{ij}(t) and ri(t)r_{i}(t) over the time period TT:

minimizesij(t),ri(t)\displaystyle\underset{s_{ij}(t),r_{i}(t)}{\text{minimize}} t=1T(iGi(t)ci(t)+(i,j)E(t)Di(t)sij(t)cij(t)\displaystyle\sum_{t=1}^{T}\Bigg{(}\sum_{i}G_{i}(t)c_{i}(t)+\sum_{(i,j)\in E(t)}D_{i}(t)s_{ij}(t)c_{ij}(t)
+ifi(t)L(wi(t)|DV)),\displaystyle\quad\quad+\sum_{i}f_{i}(t)L\left(w_{i}(t)|D_{V}\right)\Bigg{)}, (5)
subject to Gi(t)=sii(t)Di(t)+jisji(t1)Dj(t1),\displaystyle G_{i}(t)=s_{ii}(t)D_{i}(t)+\sum_{j\neq i}s_{ji}(t-1)D_{j}(t-1), (6)
sij(t)=0,(i,j)E(t),ji,\displaystyle s_{ij}(t)=0,\;\;(i,j)\notin E(t),j\neq i, (7)
ri(t)+jsij(t)=1,sij(t),ri(t)0,\displaystyle r_{i}(t)+\sum_{j}s_{ij}(t)=1,\;\;s_{ij}(t),r_{i}(t)\geq 0, (8)
Gi(t)Ci(t),sij(t)Di(t)Cij(t).\displaystyle G_{i}(t)\leq C_{i}(t),\;\;s_{ij}(t)D_{i}(t)\leq C_{ij}(t). (9)

Constraints (68) were introduced above and ensure that the solution is feasible. The capacity constraints in (9) ensure that the amounts of data transferred and processed are within link and node capacities, respectively.

The three terms in the objective (5) correspond to the processing, offloading, and error costs, respectively. We do not include the cost of communicating parameter updates to/from the server in our model; unless a device processes no data, the number of updates stays constant. By including these three cost terms, our formulation accounts for the tradeoff between model performance and resource utilization:

(i) Processing, Gi(t)ci(t)G_{i}(t)c_{i}(t): This is the computing cost associated with processing Gi(t)G_{i}(t) of data at node ii at time tt.

(ii) Offloading, Di(t)sij(t)cij(t)D_{i}(t)s_{ij}(t)c_{ij}(t): This is the communication cost incurred from node ii offloading data to jj.

(iii) Error, fi(t)L(wi(t)|DV)f_{i}(t)L\left(w_{i}(t)|D_{V}\right): This cost quantifies the impact of data movement on the local model error at each device ii. As the model approaches convergence for large tt, improvement in the error term L(wi(t))L(w_{i}(t)) with respect to the variables Gi(t)G_{i}(t) becomes limited. Thus, we may have fi(t)f_{i}(t) decrease over time to prioritize network costs at later time periods.

Since wi(t)w_{i}(t) is computed as in (3), the model error term is an implicit function of Gi(t)G_{i}(t). We include the error from each device ii’s local model at each time tt, instead of considering the error of the final model, since devices may need to make use of their local models as they are updated (e.g., if aggregations are infrequent due to resource constraints [5]).

If the local datasets are i.i.d. (i.e., each is representative of the global distribution), then discarding more data clearly increases the global loss, since less data is used to train the ML model. Offloading may also skew the local model if it is updated over a small Gi(t)G_{i}(t). We can, however, upper bound the loss function L(wi(t))L(w_{i}(t)) regardless of the data movement:

Theorem 1 (Upper bound on local loss).

If Li(w)L_{i}(w) is convex, ρ\rho-Lipschitz, and β\beta-smooth, if η1β\eta\leq\frac{1}{\beta}, and if L(w(T))L(w)ϵL(w(T))-L(w^{\star})\geq\epsilon for a lower bound ϵ\epsilon, then, after K=t/τK=\lfloor t/\tau\rfloor aggregations at time tt with a period τ\tau and defining the constant δiLi(w)L(w)\delta_{i}\geq||\nabla L_{i}(w)-\nabla L(w)||, the local loss will satisfy

L(wi(t))L(w)ϵ0+ρgi(tKτ),L(w_{i}(t))-L(w^{\star})\leq\epsilon_{0}+\rho g_{i}(t-K\tau), (10)

where gi(x)=δiβ((ηβ+1)x1)g_{i}(x)=\frac{\delta_{i}}{\beta}((\eta\beta+1)^{x}-1), which implies gi(tKτ)g_{i}(t-K\tau) is decreasing in KK, and ϵ0\epsilon_{0} is given by

1tωη(2βη)+1t2ω2η2(2βη)2+Kh(τ)+gi(tKτ)tωη(1βη/2).\frac{1}{t\omega\eta(2-\beta\eta)}+\sqrt{\frac{1}{t^{2}\omega^{2}\eta^{2}(2-\beta\eta)^{2}}+\frac{Kh(\tau)+g_{i}(t-K\tau)}{t\omega\eta(1-\beta\eta/2)}}.
Proof:

The full proof is contained in Appendix A of the supplementary material. We define vk(t)v_{k}(t), t{(k1)τ,,kτ}t\in\{(k-1)\tau,...,k\tau\} as the parameters under centralized gradient descent updates, θk(t)=L(vk(t))L(w)\theta_{k}(t)=L(v_{k}(t))-L(w^{\star}), and assume θk(kτ)ϵ\theta_{k}(k\tau)\geq\epsilon as in [5]. After lower-bounding 1θK+1(t)1θ1(0)\frac{1}{\theta_{K+1}(t)}-\frac{1}{\theta_{1}(0)} and 1L(wi(t))L(w)1θK+1(t)\frac{1}{L(w_{i}(t))-L(w^{\star})}-\frac{1}{\theta_{K+1}(t)}, we can upper-bound L(wi(t))L(w)L(w_{i}(t))-L(w^{\star}) as

(tωη(1βη2)ρϵ2(Kh(τ)+gi(tKτ)))1=y(ϵ).\left(t\omega\eta\big{(}1-\frac{\beta\eta}{2}\big{)}-\frac{\rho}{\epsilon^{2}}\big{(}Kh(\tau)+g_{i}(t-K\tau)\big{)}\right)^{-1}=y(\epsilon).

Then, we let ϵ0\epsilon_{0} be the positive root of y(ϵ)=ϵy(\epsilon)=\epsilon. The result follows since either minkKL(vk(kτ))L(w)ϵ0\underset{k\leq K}{\min}\;L(v_{k}(k\tau))-L(w^{\star})\leq\epsilon_{0} or L(wi(t))L(w)ϵ0L(w_{i}(t))-L(w^{\star})\leq\epsilon_{0}; both imply (10). ∎

In Section IV, we will consider how to use Theorem 1’s result to find tractable forms of the loss expression that allow us to solve the optimization (59) efficiently. Moreover, without perfect information on the device costs, capacities, and error statistics over the time period TT, we cannot solve (59) exactly, so we will propose methods for estimating them.

IV Optimization Model Analysis

We turn now to a theoretical analysis of the data movement optimization problem (5)–(9). We discuss the choice of error cost and capacity values (Section IV-A), and then characterize the optimal solution under different choices of the error cost and assumptions on the fog networking scenario for the ML use cases outlined in Section I (Section IV-B).

IV-A Choosing Costs and Capacities

We may not be able to reliably estimate the costs cij(t)c_{ij}(t), ci(t)c_{i}(t), and fi(t)f_{i}(t) or capacities Ci(t)C_{i}(t) and Cij(t)C_{ij}(t) in real time. Misestimations are likely in highly dynamic scenarios that use mobile devices, since the costs cij(t)c_{ij}(t) of offloading data depend on network conditions at the current device locations. Mobile devices are also prone to occasional processing delays called “straggler effects” [22], which can be modeled as variations in their capacities. The error cost, on the other hand, decreases over time as the model parameters move towards convergence. Here, we propose and analyze network characteristic selection methods. Although these methods also rely on some knowledge of the system, we show in Section V that a simple time-averaging of historical costs and capacities suffices to obtain reasonable data movement solutions.

IV-A1 Choosing capacities

Overestimating Ci(t)C_{i}(t) leads to the deferral of some data processing until future time periods, which may cause a cascade of processing delays. Misestimations of the link capacities have similar effects. Here, to limit delays due to overestimation, we formalize guidelines for the capacities in (9)’s constraints. As commonly done [40], we assume that processing times on stragglers follow an exponential distribution exp(μ)\exp(\mu) for parameter μ\mu.

For device capacities, we obtain the following result:

Theorem 2 (Processing time with compute stragglers).

Suppose that the service time of processing a datapoint at node ii follows exp(μi)\exp(\mu_{i}), and that cij(t)c_{ij}(t), ci(t)c_{i}(t), Ci(t)C_{i}(t) are time invariant. We can ensure the average waiting time of a datapoint to be processed is lower than a given threshold σ\sigma by setting the device capacity CiC_{i} such that ϕ(Ci)=σμi/(1+σμi)\phi(C_{i})=\sigma\mu_{i}/(1+\sigma\mu_{i}), where ϕ(Ci)\phi(C_{i}) is the smallest solution to the equation ϕ=exp(μi(1ϕ)/Ci)\phi=\exp\left(-\mu_{i}(1-\phi)/C_{i}\right), an increasing function of CiC_{i}.

Proof:

We note that the processing at node ii follows a D/M/1 queue with arrival rate Gi(t)CiG_{i}(t)\leq C_{i}, and the result follows from the average waiting time in such a queue. Details are in Appendix B of the supplementary material. ∎

For instance, σ=1\sigma=1 guarantees an average processing time of less than one time slot, as assumed in Section III’s model. Thus, Theorem 2 shows that we can still (probabilistically) bound the data processing time when stragglers are present. As long as Gi(t)Ci(t)G_{i}(t)\leq C_{i}(t), where Gi(t)G_{i}(t) is defined based on any values of sij(t)s_{ij}(t) and ri(t)r_{i}(t), Theorem 2 holds for any data offloading and discarding policy, including the one provided by the solution to our optimization (5)-(9).

Network link congestion encountered in transferring data may also delay its processing, which can be handled by choosing the network capacity Cij(t)C_{ij}(t) analogously to Theorem 2.

IV-A2 Choosing error expressions

To solve the optimization (5)–(9), we also need an expression of the error objective fi(t)L(wi(t)|DV)f_{i}(t)L(w_{i}(t)|D_{V}) in terms of our decision variables. As shown in Theorem 1, we can bound the local loss at time tt in terms of a gradient divergence constant δiLi(w)L(w)\delta_{i}\geq\left\|\nabla L_{i}(w)-\nabla L(w)\right\|. The following in turn provides an upper bound for δi\delta_{i} in terms of Gi(t)G_{i}(t):

Lemma 1 (Error convergence).

Suppose the data distributions 𝒟i\mathcal{D}_{i} have finite second and third moments. Then there exists a constant γi>0\gamma_{i}>0 independent of Gi(t)G_{i}(t) such that

δiLi(w|Gi(t))L(w)γiGi(t)+γ|DV|+Δ,\delta_{i}\equiv\left\|\nabla L_{i}\left(w|G_{i}(t)\right)-\nabla L(w)\right\|\leq\frac{\gamma_{i}}{\sqrt{G_{i}(t)}}+\frac{\gamma}{\sqrt{|D_{V}|}}+\Delta, (11)

where Δ=||Li(w|Di)L(w|D)||\Delta=||\nabla L_{i}(w|D_{i})-\nabla L(w|D)||, γ=i=1Nγi\gamma=\sum_{i=1}^{N}{\gamma_{i}}, and |DV||D_{V}| is the total number of datapoints generated.

Proof:

We can express ||Li(w|Gi(t))L(w)||||\nabla L_{i}(w|G_{i}(t))-\nabla L(w)|| as the sum of ||Li(w|Gi(t))Li(w|Di(t))||||\nabla L_{i}(w|G_{i}(t))-\nabla L_{i}(w|D_{i}(t))||, ||L(w|D)L(w|DV)||||\nabla L(w|D)-\nabla L(w|D_{V})||, and Δ\Delta. We bound ||L(w|D)L(w|DV)||||\nabla L(w|D)-\nabla L(w|D_{V})|| above by γ/|DV|\gamma/\sqrt{|D_{V}|} using the central limit theorem. Since L(w|DV)\nabla L(w|D_{V}) is the average of L(w,xd,yd)\nabla L(w,x_{d},y_{d}), (xd,yd)DV\forall(x_{d},y_{d})\in D_{V}, we can view L(w,xd,yd)\nabla L(w,x_{d},y_{d}) as |DV||D_{V}| samples from a distribution whose expected value is L(w|D)\nabla L(w|D). We repeat this argument for ||Li(w|Gi)Li(w|Di)||||\nabla L_{i}(w|G_{i})-\nabla L_{i}(w|D_{i})||. ∎

The bound in Lemma 1 is likely to be a loose bound for non-i.i.d data, since, with data movement, the resulting distribution of data at device ii will be a mixture of the original distributions at each device. The value of Δ\Delta quantifies the degree to which local datasets 𝒟i\mathcal{D}_{i} are non-i.i.d. across devices. When the local datasets are i.i.d., 𝒟i=𝒟j\mathcal{D}_{i}=\mathcal{D}_{j} i,j\forall i,j and Δ=0\Delta=0. Since we take the generated device datasets Di(t)D_{i}(t) as given in the optimization, Δ\Delta is independent of our decision variables sij(t)s_{ij}(t) and ri(t)r_{i}(t), and thus independent of Gi(t)G_{i}(t). As a result, only the term γi/Gi(t)\gamma_{i}/\sqrt{G_{i}(t)} from (11) is dependent on our decision variables. By substituting Lemma 1 into gi(t)g_{i}(t) in Theorem 1, we see that L(wi(t))L(w)Gi(t)1L(w_{i}(t))-L(w^{\star})\propto\sqrt{G_{i}(t)^{-1}}. Since the other time-dependent terms in gi(t)g_{i}(t) can be absorbed into fi(t)f_{i}(t), one choice of the error cost fi(t)L(wi(t)|DV)f_{i}(t)L(w_{i}(t)|D_{V}) in (5) is fi(t)Gi(t)1f_{i}(t)\sqrt{G_{i}(t)^{-1}}, capturing diminishing marginal returns in Gi(t)G_{i}(t).

Since Gi(t)1\sqrt{G_{i}(t)^{-1}} is a convex function of Gi(t)G_{i}(t), with this choice of error cost, (59) becomes a convex optimization problem and can be solved relatively easily in theory. When the number of variables is large, however – e.g., if the number of devices n>100n>100 with T>100T>100 time periods, as could be the case in the applications discussed in Section I-A – standard interior point solvers will be prohibitively slow [41]. In such cases, we may wish to approximate the error term with a linear function and leverage faster linear optimization techniques, i.e., to take the error cost as fi(t)Gi(t)-f_{i}(t)G_{i}(t) so the error decreases when Gi(t)G_{i}(t) increases. This choice eliminates the diminishing marginal returns in Gi(t)G_{i}(t). For learning tasks that are of a critical nature (e.g., object detection in smart vehicles [2]), this may actually be more desirable in order to prioritize the trained model accuracy. Ultimately, in practice, the desired relationship between the resource and error costs in the optimization depends on the particular application.

We can further express fi(t)Gi(t)=fi(t)(1ri(t))Di(t)+fi(t)Di(t)jisij(t)fi(t)jisji(t1)Dj(t1)-f_{i}(t)G_{i}(t)=-f_{i}(t)(1-r_{i}(t))D_{i}(t)+f_{i}(t)D_{i}(t)\sum_{j\neq i}s_{ij}(t)-f_{i}(t)\sum_{j\neq i}s_{ji}(t-1)D_{j}(t-1). Since these last two terms are linear functions of sij(t)s_{ij}(t), they can alternatively be viewed as part of the transmission cost (i,j)E(t)Di(t)sij(t)cij(t)\sum_{(i,j)\in E(t)}D_{i}(t)s_{ij}(t)c_{ij}(t) in (5). Specifically, if we redefine cij(t)cij(t)+fi(t)fj(t+1)c_{ij}(t)\leftarrow c_{ij}(t)+f_{i}(t)-f_{j}(t+1) in the objective, we can treat the error cost as fi(t)Di(t)[1ri(t)]-f_{i}(t)D_{i}(t)[1-r_{i}(t)], which is equivalent to minimizing fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t), i.e., a cost proportional to the amount of data discarded at node ii. The analytical results we present in Sec. IV-B will use this form.

We can quantify the difference imposed by the linear approximation by considering the Taylor expansion of h(x)=x1/2h(x)=x^{-1/2}, where xx represents the quantity of processed data, i.e., Gi(t)G_{i}(t), at a device. Our linear approximation is equivalent to the first order Taylor approximation of h(x)h(x), i.e. h(x)h(x0)12x03/2(xx0)h(x)\approx h(x_{0})-\frac{1}{2}x_{0}^{-3/2}(x-x_{0}), with x0=0x_{0}=0. Taylor’s theorem shows that the error in this first order expansion is proportional to c5/2x2c^{-5/2}x^{2} for some 0<c<x0<c<x, which is increasing in xx [42, 43]. Thus, the approximation will have a larger difference when devices are processing more data. Later in Sec. V-B, our experiments will show that there can be a practical advantage to modeling the discard cost as fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t) without modifying the cij(t)c_{ij}(t).

IV-B Optimal Task Distributions

Given a set of costs and capacities for the optimization (59), we now characterize the optimal data movement solutions. We consider two formulations that admit closed-form solutions and elucidate the relationships between our costs and decision variables: (i) convex discard costs over a hierarchical, static topology (Theorem 4), and (ii) linear discard costs over a general, possibly dynamic topology (Theorem 3).

IV-B1 General and dynamic network topology

First, we consider the case of a general network topologies. In this case, we can derive a closed-form solution for the data processing variables under the linear error term fi(t)ri(t)Di(t)f_{i}(t)r_{i}(t)D_{i}(t):

Theorem 3 (Data movement with linear discard cost).

Suppose Ci(t)Di(t)+j𝒩i(t1)Dj(t1)C_{i}(t)\geq D_{i}(t)+\sum_{j\in\mathcal{N}_{i}(t-1)}D_{j}(t-1) for each device ii, i.e., its compute capacity always exceeds the data it collects as well as any data offloaded to it by 𝒩i(t1)={j:(j,i)E(t1)}\mathcal{N}_{i}(t-1)=\{j:(j,i)\in E(t-1)\}. Then, if the error cost is modeled as fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t), the optimal sij(t)s_{ij}^{\ast}(t) and ri(t)r_{i}^{\ast}(t) will each be 0 or 11, with the following conditions for them being 11 at node ii:

{sik(t)=1if cik(t)+ck(t+1)min{fi(t),ci(t)}sii(t)=1if ci(t)min{fi(t),cik(t)+ck(t+1)}ri(t)=1if fi(t)min{ci(t),cik(t)+ck(t+1)}\begin{cases}s_{ik}^{\ast}(t)=1&\mbox{if }c_{ik}(t)+c_{k}(t+1)\leq\min\left\{f_{i}(t),c_{i}(t)\right\}\\ s_{ii}^{\ast}(t)=1&\mbox{if }c_{i}(t)\leq\min\left\{f_{i}(t),c_{ik}(t)+c_{k}(t+1)\right\}\\ r_{i}^{\ast}(t)=1&\mbox{if }f_{i}(t)\leq\min\left\{c_{i}(t),c_{ik}(t)+c_{k}(t+1)\right\}\end{cases} (12)

where k=argminj:ji,(i,j)E(t){cij(t)+cj(t+1)}k=\underset{j:j\neq i,(i,j)\in E(t)}{\arg\min}\{c_{ij}(t)+c_{j}(t+1)\}.

Proof:

Since ri(t)+jsij(t)=1r_{i}(t)+\sum_{j}s_{ij}(t)=1 in (8), each datapoint in Di(t)D_{i}(t) is either discarded, offloaded, or processed at ii. It is optimal to choose the option with least marginal cost. ∎

This theorem implies that with a linear discard cost, in the absence of resource constraints, all data will either be processed, offloaded to the lowest cost neighbor, or discarded. Additionally, Theorem 3 quantifies how the link costs cij(t)c_{ij}(t) affect the amount of data offloaded, allowing us to write the fraction of data offloaded by device ii as 1sii(t)ri(t)1-s_{ii}(t)-r_{i}(t), which is 11 if cik(t)+ck(t+1)min{fi(t),ci(t)}c_{ik}(t)+c_{k}(t+1)\leq\min\{f_{i}(t),c_{i}(t)\} and 0 otherwise. We later use this point in the proofs for Theorems 5 and 6.

Fog use cases. We next move to characterize solutions for specific fog use cases outlined in Section I. Table I summarizes the topologies of these four applications. Networks in smart factories have fairly static topologies, since they are deployed in controlled indoor settings. They also exhibit a hierarchical structure, with less powerful devices connected to more powerful ones in a tree-like manner with devices at the same level unable to communicate, as shown in Figure 1. Connected vehicles have a similar hierarchical structure, with sensors and vehicles connected to more powerful edge servers, but their architectures are more dynamic as vehicles are moving. Similarly, AR applications feature (possibly heterogeneous) mobile AR headsets connected to powerful edge servers. Applications that involve privacy-sensitive data may have very different, non-hierarchical topologies as the links between devices are based on trust, i.e., comfort in sharing private information. Since social relationships generally change slowly compared to ML model training, these topologies are relatively static.

While the connected vehicles and AR settings require the generic topology treatment from Theorem 3, we can derive additional results for static hierarchical and social topologies:

IV-B2 Static and hierarchical topologies

In hierarchical scenarios, more powerful edge servers will likely always have sufficient capacity to handle all offloaded data, and they will likely have lower computing costs when compared to other devices. If the cost of discarding is linear, then from Theorem 3, sensors offload their data to the edge servers, unless the cost of offloading exceeds the difference in computing costs. We show in Section V that the network cost can exceed the savings from offloading to more powerful devices; in such cases, devices process or discard their data instead.

Use case Topology Dynamics
Smart factories [2] Hierarchical Fairly static
Connected vehicles [2] Hierarchical Rapid changes
Augmented reality [3] Hierarchical, Rapid changes
heterogeneous possible
Privacy-sensitive [4, 22] Social network Fairly static
TABLE I: Dominant characteristics of the four fog use cases.

When the cost of discarding is nonlinear, the optimal solution is less intuitive: it may be optimal to discard data if the additional processing cost outweighs the expected error reduction. If we consider multiple heterogeneous devices with static processing costs and data generation rates, each connected to a more powerful edge server over the same wireless network (and thus assumed with the same connectivity costs), we can find a closed form solution to our original optimization:

Theorem 4 (Data movement with nonlinear error costs).

Suppose that nn devices with static processing costs ci(t)=cic_{i}(t)=c_{i} and data generation rates Di(t)=DiD_{i}(t)=D_{i} can offload to an edge server, indexed as n+1n+1. Assume that there are no resource constraints, ci>cn+1c_{i}>c_{n+1}, the costs cij(t)=ctc_{ij}(t)=c_{t} of transmitting to the server are identical and constant, and the discard cost is given by fi(t)L(wi(t))=γ/Gif_{i}(t)L(w_{i}(t))=\gamma/\sqrt{G_{i}} as in Lemma 1. Then, letting sis_{i} denote the fraction of data offloaded for node ii, for DiD_{i} sufficiently large, the optimal amount of data discarded is

ri=11Di(γ2ci)23si,i.r_{i}^{\ast}=1-\frac{1}{D_{i}}\left(\frac{\gamma}{2c_{i}}\right)^{\frac{2}{3}}-s_{i},\;\forall i. (13)

Given the optimal rir_{i}^{\ast}, an optimal solution for sis_{i}^{\ast} is given by

si=1jDj(γ2(cn+1+ct))23,i.s_{i}^{\ast}=\frac{1}{\sum_{j}D_{j}}\left(\frac{\gamma}{2(c_{n+1}+c_{t})}\right)^{\frac{2}{3}},\;\forall i. (14)
Proof:

The full proof is contained in Appendix C. There we note that in the hierarchical scenario, the cost objective (5) can be rewritten as i(1risi)Dici+isiDi(cn+1+ct)+iγ(1risi)Di+γisiDi\sum_{i}(1-r_{i}-s_{i})D_{i}c_{i}+\sum_{i}s_{i}D_{i}(c_{n+1}+c_{t})+\sum_{i}\frac{\gamma}{\sqrt{(1-r_{i}-s_{i})D_{i}}}+\frac{\gamma}{\sqrt{\sum_{i}s_{i}D_{i}}}. Taking the partial derivatives with respect to rir_{i} and sis_{i}, and noting that a large DiD_{i} forces ri,si[0,1]r_{i},s_{i}\in[0,1] for each node gives the result. ∎

Intuitively, as the costs cic_{i} or cn+1c_{n+1} increase, so should the amount of data discarded, as in Theorem 4. Theorem 4 also shows that data is neither fully discarded nor offloaded, in contrast with Theorem 3, implying that convex error bounds lead to a more balanced distribution of data across nodes in the network. We do not include resource constraints in Theorem 4 in order to focus on the effects of the compute and transmission costs cic_{i}, cn+1c_{n+1}, and ctc_{t} on the optimal solution. In practice, an edge server would have nearly unlimited capacity Cn+1C_{n+1} to process the datapoints generated by the devices, as it is much more powerful, and often has access to a more reliable power source, than typical mobile devices. Intuitively, including the capacities CiC_{i} and CtC_{t} on the devices and links would further increase the amount of data discarded.

IV-B3 Socially-defined topologies

When networks are larger and have more complex topologies, we extrapolate from Theorem 3’s characterization of device behaviors to understand data movement in the network as a whole. Specifically, we consider a social topology in which edges between devices are defined by willingness to share data (Figure 1(b)). In such cases, we assume cij(t)=0c_{ij}(t)=0 as nodes either trust each other to share data or do not; nodes that do not trust each other simply will not have an edge between them. We can find the fraction of devices that offload data, which allows us to determine the cost savings from offloading, when the error cost is linear:

Refer to caption
Figure 3: Our Raspberry Pi devices running local computations.
Theorem 5 (Value of offloading).

Suppose the fraction of devices with kk neighbors equals N(k)N(k). For a social network following a scale-free topology, for example, N(k)=Γk1γN(k)=\Gamma k^{1-\gamma} for some constant Γ\Gamma and γ(2,3)\gamma\in(2,3). Suppose ciU(0,C)c_{i}\sim U(0,C) and cij=0c_{ij}=0 over all time, where U(a,b)U(a,b) is the uniform distribution between aa and bb and no discarding occurs.

Then the average cost savings, compared to no offloading, equals

k=1nN(k)(C2C(1)kk+2l=0k1(kl)C(1)l(k+3)(l+2)(l+3)).\sum_{k=1}^{n}N(k)\left(\frac{C}{2}-\frac{C(-1)^{k}}{k+2}-\sum_{l=0}^{k-1}\binom{k}{l}\frac{C(-1)^{l}(k+3)}{(l+2)(l+3)}\right). (15)
Proof:

The full proof is contained in Appendix D of the supplementary materials. There, we use the result of Theorem 3 to find the probability that devices have lower processing cost neighbors, from which we can determine (15). ∎

Thus, the reduction in cost from enabling device offloading in such scenarios is approximately linear in CC: as the range of computing costs increases, there is greater benefit from offloading, since devices are more likely to find a neighbor with lower cost. The processing cost model may for instance represent device battery levels drawn uniformly at random from 0 (full charge) to CC (low charge). The expected reduction from offloading, however, may be less than the average computing cost C/2C/2, as offloading data to another device does not entirely eliminate the computing cost. Note we can use a similar technique to analyze the scenario where cijc_{ij} is either nonzero or follows a given distribution; however, the expression for cost savings is then more complex due to the need to compare cj(t)c_{j}(t) (drawn from U(0,C)U(0,C)) with cjk(t)+ck(t)c_{jk}(t)+c_{k}(t) (drawn from a convolution of U(0,C)U(0,C) with the distribution of cij(t)c_{ij}(t)). Similarly, we do not generalize Theorem 5’s results to a non-uniform ci(t)c_{i}(t) distribution due to the complexity in writing a closed-form solution as in (15). However, we expect similar observations to hold, i.e., offloading becomes more beneficial as the cost range increases.

Next, we consider the case in which resource constraints are present, e.g., for less powerful edge devices. We find the expected devices that will have tight resource constraints:

Theorem 6 (Probability of resource constraint violation).

Let N(k)N(k) be the number of devices with kk neighbors, and for each device ii with kk neighbors, let pk(n)p_{k}(n) be the probability that any one of its neighbors jj has nn neighbors. Also let C~\tilde{C} denote the distribution of resource capacities, assumed to be i.i.d. across devices, and let Di(t)=DD_{i}(t)=D be constant. Then if devices offload as in Theorem 3, the expected number of devices whose capacity constraints are violated is

C~(x)(k=1NN(k)[1Po(k)+kn=1N(Po(n)pk(n)n)xD]),\hskip-9.03374pt\underset{\tilde{C}(x)}{\int}\left(\sum_{k=1}^{N}N(k)\mathbb{P}\left[1-P_{o}(k)+k\sum_{n=1}^{N}\left(\frac{P_{o}(n)p_{k}(n)}{n}\right)\geq\frac{x}{D}\right]\right), (16)

with Po(k)P_{o}(k) defined as the probability a device with kk neighbors offloads its data based on the conditions in Theorem 3.

Proof:

This follows from Theorem 3, and from obtaining an expression for the expected amount of processed data at a node with kk neighbors when offloading is enabled. ∎

Theorem 6 makes two assumptions: Di(t)=DD_{i}(t)=D and resource capacities being i.i.d. across devices. The former assumption models a constant rate of data generation (e.g., from regular monitoring of user activity). If DD is stochastic (e.g., from event-driven sensor readings with random event sequences), we may generalize (16) by taking an additional expectation over DD. The latter assumption is reasonable if we have no information about specific devices in the network, but know a range of possible hardware specifications that determine the compute capacities. This is likely to occur when there are a large number of devices; as we explain below, such a scenario is precisely where Theorem 6 is most useful.

Theorem 6 allows us to quantify the complexity of solving the data movement optimization problem when resource constraints are in effect. We observe that it depends on not just the resource constraints, but also on the distribution of computing costs (through Po(k)P_{o}(k)), since these costs influence the probability devices will want to offload in the first place. Additionally, Theorem 6 provides a guide for the most efficient way to solve optimization (5)–(9). If the expected number of resource violations is low, then following the procedure in Theorem 3 will produce a near-optimal solution that violates only a few resource constraints. We can then ensure these constraints are satisfied with minimal adjustments to the solution, e.g., optimizing over only the sij(t)s_{ij}(t) and ri(t)r_{i}(t) variables for the affected nodes and their neighbors while fixing all other optimization variables, or even increasing the ri(t)r_{i}(t) until the capacity constraints are satisfied. While this solution will not be optimal, (12) can be implemented distributedly, if each device jj sends each of its neighbors ii (i) its processing cost cj(t)c_{j}(t) and (ii) estimates of cij(t)c_{ij}(t), e.g., channel conditions at the receiver. Thus, it is significantly more efficient than solving (5)–(9) via a generic linear solver. On the other hand, if (16) is large, then we should solve (5)–(9) using a generic optimizer.

V Experimental Evaluation

In this section, we experimentally evaluate our methodology. After discussing the setup in Sec.V-A, we investigate the performance of network-aware learning in Sec. V-B. Then, we examine the effects of network characteristics, structure, and dynamics on our methodology in Sec. V-C to V-E.

Learning Methodology Synthetic Costs Testbed Costs
MLP CNN MLP CNN
Centralized 92.00% 98.00% 92.00% 98.00%
Federated (i.i.d.) 90.82% 96.62% 90.82% 96.62%
Federated (non-i.i.d.) 84.00% 96.10% 84.00% 96.10%
Network-aware (i.i.d.) 88.63% 95.89% 89.70% 96.03%
Network-aware (non-i.i.d.) 83.20% 92.31% 85.47% 92.62%
TABLE II: Comparison of accuracies obtained by learning methodologies for different cost settings and data distribution scenarios. Network-aware learning achieves within 4% accuracy of federated learning on test datasets in all cases.

V-A Experimental Setup

Machine learning task and models. We consider image recognition as our machine learning task, using the MNIST dataset [44], which contains 70K images of hand-written digits labeled 0-9, i.e., a 10-class classification problem. We use 60K images as the training dataset DVD_{V}, and the remainder as our test set. The number of samples |Di(t)||D_{i}(t)| at node ii is modeled using a Poisson arrival process with mean |DV|/(nT)|D_{V}|/(nT). For i.i.d. scenarios, each device ii generates Di(t)D_{i}(t) by sampling uniformly at random and without replacement from DVD_{V}. For non-i.i.d. scenarios, each device is limited to a random selection of five of the 10 possible labels. Device ii then samples Di(t)D_{i}(t) uniformly at random from this chosen subset of labels. Data distributions are i.i.d. unless stated otherwise. We train multilayer perceptrons (MLP) and convolutional neural networks (CNN) for image recognition on MNIST. We use cross entropy [45] as the loss function L(w|DV)L(w|D_{V}), and a constant learning rate η(t)=0.01\eta(t)=0.01. Unless otherwise stated, results are reported for CNN using n=10n=10 fog devices, an aggregation period τ=10\tau=10, and T=100T=100 time intervals.

Network cost and capacity parameters. To obtain realistic network costs for nodes ci(t)c_{i}(t) and links cij(t)c_{ij}(t), we collect measurements from a testbed consisting of six Raspberry Pis as nodes and AWS DynamoDB as a cloud-based parameter server (Figure 3). Three Pis collect data and transmit it over Bluetooth to another “gateway” Pi. The three gateway nodes receive this data and either perform a local gradient update or upload the data to DynamoDB. We measure 100 rounds of gradient update processing times and Pi-to-DynamoDB communication times while training a two-layer fully connected neural network (MLP), with devices communicating over 2.4 GHz WiFi or LTE cellular. These processing times are linearly scaled to range from 0 to 1, and recorded as ci(t)c_{i}(t), while the Pi-to-DynamoDB communication times are similarly scaled and saved as cij(t)c_{ij}(t). For completeness, we also evaluate performance for synthetic costs, where we take cij(t),ci(t)U(0,1)c_{ij}(t),c_{i}(t)\sim U(0,1). Using both types of costs allows for a more thorough comparison between our network-aware and state-of-the-art learning frameworks, as network costs only affect network-aware learning. Unless otherwise stated, results are reported using the testbed-collected parameters.

Refer to caption
((a)) Training loss over time
Refer to caption
((b)) Data similarity between devices
Figure 4: (a) shows training loss over time for each device with network-aware learning. The average and variance drop over time. (b) captures average data similarity for 100 experiments across devices before (xx-axis) and after (yy-axis) offloading occurs for the non-i.i.d. scenario. The dotted line represents the change when no offloading occurs. Our methodology increases data similarity in almost all cases.

When imposed, the capacity constraints Ci(t)C_{i}(t) and Cij(t)C_{ij}(t) are taken as the average data generated per device in each time period, i.e., |DV|/(nT)|D_{V}|/(nT). The error costs fi(t)f_{i}(t) are modeled using the measurements from simulations on the Raspberry Pi testbed. All results are averaged over at least five iterations.

Centralized and federated learning. To see whether our method compromises learning accuracy in considering network costs as additional objectives, we compare against a baseline of centralized ML training where all data is processed at a single device (server). We also compare to federated learning with the same τ,T\tau,T parameters where there is no data offloading or discarding, i.e., Gi(t)=Di(t)G_{i}(t)=D_{i}(t).

Perfect information vs. estimation. As discussed in Section IV-A, solving (5-9) in practice requires estimating the costs and capacities over the time horizon TT. To do this, we divide TT into LL intervals T1,,TLT_{1},...,T_{L}, and in each interval ll, we use the time-averaged observations of Di(t)D_{i}(t), ci(t)c_{i}(t), cij(t)c_{ij}(t), and Ci(t)C_{i}(t) over Tl1T_{l-1} to compute the optimal data movement. The resulting sij(t)s^{\star}_{ij}(t) and ri(t)r^{\star}_{i}(t) for tTlt\in T_{l} are then used by device ii to transfer data in TlT_{l}. This “imperfect information” scheme will be compared with the ideal case in which the network costs and parameters are available (i.e., “perfect information”).

V-B Efficacy of Network-Aware Learning

We first investigate the efficicacy of our method on a fully-connected network topology E(t)={(i,j):ij}E(t)=\{(i,j):i\neq j\}.

V-B1 Model accuracy

Table II compares the testing accuracies obtained by centralized, federated, and network-aware learning on both synthetic and testbed network costs. Our method achieves nearly the same (within 4%) accuracy as federated learning in both the i.i.d. and non-i.i.d. cases. The performance differences between the i.i.d. and non-i.i.d. cases of each algorithm are expected, as non-i.i.d. local datasets are not representative of the overall dataset. Note that network-aware learning produced more accurate models on testbed rather than synthetic costs. In practical fog environments, outlined in Section I-A, devices with faster computations are also likely to transmit faster. The testbed-derived real costs contain such a correlation, which allows more cost-effective offloading and improves model accuracy.

The convergence of network-aware learning across individual devices is shown in Figure 4(a), with all devices exhibiting an overall decreasing trend. In Figure 4(b), we consider the similarity between local data distributions before and after offloading in the non-i.i.d. setting. Percent similarity is calculated between each pair of nodes ii and jj as the percent overlap in labels they contain – i.e., si,j=|𝒴i𝒴j|/min{|𝒴i|,|𝒴j|}s_{i,j}=|\mathcal{Y}_{i}\cap\mathcal{Y}_{j}|/\min\{|\mathcal{Y}_{i}|,|\mathcal{Y}_{j}|\} where 𝒴i\mathcal{Y}_{i} is the multiset of labels at device ii – and then are averaged over all i,ji,j pairs. The average similarity increases due to offloading in nearly all cases, with an average improvement of around 10%, which is a consequence of data offloading.

Accuracy (%) Network costs (both i.i.d. and non-i.i.d.)
i.i.d. non-i.i.d. Process Transfer Discard Total Unit
A 89.72 87.92 1234 0 0 1234 0.265
B 89.81 81.29 322 120 136 578 0.118
C 89.54 80.47 302 117 138 558 0.121
D 83.14 78.25 336 63 187 586 0.119
E 82.83 77.39 307 46 274 627 0.136
TABLE III: Network costs and model accuracies obtained for i.i.d. and non-i.i.d. local data distributions in five different settings. The costs are the same for i.i.d. and non-i.i.d. local data distributions. The differences between A, where no data transfers are permitted, and B-E, which are variants of network-aware learning, show substantial improvements in resource utilization.
Refer to caption
((a)) Process vs. discard ratio
Refer to caption
((b)) Data movement rates
Refer to caption
((c)) Unit cost breakdown
Refer to caption
((d)) Testing accuracy
Figure 5: Impact of the number of nodes nn on (a) the ratio of nodes processed vs. discarded, (b) the data movement rate, (c) the unit cost and cost components, and (d) the learning accuracy (log scale). The shading in (b) shows the range observed over time periods. We see that network-aware learning scales well with the number of nodes, as the cost incurred per datapoint improves, and the testing accuracy for the non-i.i.d. case improves substantially.

V-B2 Offloading and imperfect information

While network-aware learning obtains similar accuracy to traditional federated learning, we expect it will improve network resource costs. Table III compares the costs incurred and model accuracy for both the i.i.d. and non-i.i.d. scenarios in five settings, where settings B - E are applications of network-aware learning:

  1. A.

    Offloading and discarding disabled

  2. B.

    Perfect information and no capacity constraints

  3. C.

    Imperfect information and no capacity constraints

  4. D.

    Perfect information and capacity constraints

  5. E.

    Imperfect information and capacity constraints

Cost components in Table III – process, transfer, and discard (i.e., error) – are summed over all nodes/links and time. The unit cost column is the total cost normalized by the total data generated in that setting, to account for variation in Di(t)D_{i}(t) across experiments. Simulations with imperfect information – cases C and E – allocate data based on historical network characteristics, which may be inaccurate, and can cause unintended large transfer or process costs. In the capacity limiting cases – D and E – the excess data must be discarded instead. Note also that the costs are the same for both the i.i.d. and non-i.i.d. settings as our offloading optimization (5)-(9) does not account for local data distributions. Comparing A and B, we see that offloading reduces the unit cost by 53%. The network takes advantage of transfer links, reducing the total processing cost by 74%, by offloading more data. Our model is robust to estimation errors, similar to our observations from Section IV-A, as we observe only minor changes in cost or accuracy from B to C, which has imperfect network information. When devices have strict capacity constraints, as in D and E, their gradient updates are based on fewer samples, and each node’s Li(wi(t))L_{i}(w_{i}(t)) will tend to have larger errors.

Discard cost Accuracy (%) Network costs
objective i.i.d. non-i.i.d. Pr Tr Di Tot
B fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t) 87.89 79.68 322 120 136 578
D 83.14 78.25 336 63 187 586
B fi(t)Gi(t)-f_{i}(t)G_{i}(t) 90.05 80.90 390 461 125 976
D 87.86 78.08 410 244 136 790
B fi(t)/Gi(t)f_{i}(t)/\sqrt{G_{i}(t)} 86.81 78.04 323 79 172 574
D 85.42 77.26 311 83 184 578
TABLE IV: Effect of varying the discard cost model used in the optimization on the network costs and trained model accuracies, for settings B and D in Table III. Compared with the convex fi(t)/Gi(t)f_{i}(t)/\sqrt{G_{i}(t)}, the linear fi(t)Gi(t)-f_{i}(t)G_{i}(t) prioritizes higher accuracy over lower costs, while fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t) is close to the convex case.

Overall, there is a roughly 7%7\% difference in accuracy for the i.i.d. case between settings A and E, and a 13% difference for the non-i.i.d. case; this comes at an improvement of more than 50% in network costs. The accuracy differences for the non-i.i.d. case are larger but exhibit the same trends across settings as the i.i.d. case. If higher accuracy is desired, a different error cost in (5) could be chosen, as we investigate next.

Refer to caption
((a)) Process vs. discard ratio
Refer to caption
((b)) Data movement rates
Refer to caption
((c)) Unit cost breakdown
Refer to caption
((d)) Testing accuracy
Figure 6: Impact of network connectivity ρ\rho on the different aspects of network-aware learning in Figure 5. Overall, we see that the costs have a linear relationship with ρ\rho, that data movement rates increase in response to ρ\rho, and that the learning accuracy tends to improve with ρ\rho particularly in the non-i.i.d. case. The shading in (b) indicates the range over time periods.

V-B3 Comparing error cost models

We compare the use of three error cost models from Section IV-A: fi(t)/Gi(t)f_{i}(t)/\sqrt{G_{i}(t)}, fi(t)Gi(t)-f_{i}(t)G_{i}(t), and fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t) without modification to cij(t)c_{ij}(t). For this last case, note that fi(t)Di(t)(1ri(t))fi(t)Gi(t)-f_{i}(t)D_{i}(t)(1-r_{i}(t))\geq-f_{i}(t)G_{i}(t) if Di(t)jisij(t)jisji(t1)Dj(t1)D_{i}(t)\sum_{j\neq i}s_{ij}(t)\leq\sum_{j\neq i}s_{ji}(t-1)D_{j}(t-1), as is likely since node ii is unlikely to offload more data at time tt than was offloaded to it at t1t-1. This upper bound on fi(t)Gi(t)-f_{i}(t)G_{i}(t) may work well for data-intensive applications as it neglects fi(t)/Gi(t)f_{i}(t)/\sqrt{G_{i}(t)}’s diminishing marginal returns.

We show the results for the different cost models in Table IV under settings B and D from Table III. The linear cost fi(t)Gi(t)-f_{i}(t)G_{i}(t) produces a higher accuracy than using fi(t)/Gi(t)f_{i}(t)/\sqrt{G_{i}(t)}, but incurs a higher total cost due to transfer costs from offloading more data for processing. The results for fi(t)Di(t)ri(t)f_{i}(t)D_{i}(t)r_{i}(t) are close to the convex case. By neglecting the offloading terms in the discard cost, we prevent the solution from offloading more data to further reduce the error cost when it is exceeded by the marginal transfer cost.

V-C Effect of Network System Characteristics

Our next experiments investigate the impact of the number of nodes, nn, the network connectivity, ρ\rho, and the aggregation period, τ\tau on network-aware learning. We use a fully connected topology when varying nn and τ\tau, and, for varying ρ\rho, we use a random graph with probability P[(i,j)E(t),ji]=ρP[(i,j)\in E(t),j\neq i]=\rho.

V-C1 Varying number of nodes nn

Figure 5 varies nn from 55 to 5050 in increments of five nodes. Figure 5(a) depicts the fraction of data processed vs. discarded333Rounding used during the solution of the optimization problem resulted in the variance in the sum of the processed and discarded data ratios.; 5(b) plots the change in the movement rate, i.e., fraction of data that is either offloaded or discarded; 5(c) breaks down unit cost by component; and 5(d) shows the model accuracy for i.i.d. and non-i.i.d.

Our method scales well as the unit cost in Figure 5(c) decreases with nn. As the network grows, high-cost nodes are more likely to connect to low-cost nodes, which results in more offloading and is consistent with Theorems 5 and 6. Figure 5(b) confirms this: both the minimum and average data transfer rates grow with network size. As more offloading occurs, more data is processed in Figure 5(a). Although more data is processed, the increased processing cost is outweighed by the savings in discard cost in Figure 5(c). Training on more data then produces a more accurate ML model in Figure 5(d). In Figure 5(d), the non-i.i.d. case shows a substantial accuracy improvement as nn increases. Since each node in the non-i.i.d. case only contains half of the data labels for the ML problem, small networks are at a natural disadvantage compared to larger ones – their empirical training dataset is unlikely to be representative of the underlying distribution, no matter what offloading scheme is used. Hence, we see the growth from 55% to 95% testing accuracy when nn increases from 5 to 50.

Refer to caption
((a)) Process vs. discard ratio
Refer to caption
((b)) Data movement rates
Refer to caption
((c)) Unit cost breakdown
Refer to caption
((d)) Testing accuracy
Figure 7: Impact of the aggregation period τ\tau on the different aspects of network-aware learning in Figure 5. Overall, we see that a higher τ\tau reduces total costs in (c), but decreases performance in (d), particularly for non-i.i.d., as the device models overfit their local distributions.

V-C2 Varying network connectivity ρ\rho

Figure 6 examines the same characteristics in Figure 5 as ρ\rho is varied from 0 (i.e., completely disconnected) to 11 (i.e., fully connected). Overall, we observe a similar trend to the effect of nn: as connectivity grows, the unit cost per datapoint decreases in Figure 6(c), caused by cheaper alternatives to discarding. The relatively small change in total cost as ρ\rho varies indicates that network-aware learning is robust to variations in device connectivity.

High connectivity produces more opportunities for offloading in Figure 6(b), which increases total data processed and decreases total data discarded in Figure 6(a). Discard costs then take a smaller share of the unit costs in Figure 6(c). Intuitively, more data processed leads to the more accurate model in Figure 6(d). For non-i.i.d. data, the data transfer growth seen in Figures 6(b) and 6(c) increases the dataset similarity among devices, allowing model accuracy to improve. Increased network connectivity has a similar effect to having a larger network: there are more network links to nodes with low processing cost, which network-aware learning can leverage to produce cost savings without compromising model accuracy.

V-C3 Varying aggregation period τ\tau

Figure 7 varies the aggregation period τ\tau. Overall, a larger τ\tau decreases unit cost and testing accuracy. We expect local models to converge as τ\tau grows, similar to the effects studied in [5]. As devices train on their local datasets for longer, they begin to converge, reducing the value of processing data. As a result, discarding becomes cost-effective in Figure 7(a), and the discard costs dominate in Figure 7(c), with the data movement rate in Figure 7(b) increasing due to this extra discarding. Finally, since TT is constant, a higher τ\tau reduces the number of global aggregations, which results in decreased test accuracy in Figure 7(d), consistent with our findings in Theorem 1. Frequent aggregations are more important in the non-i.i.d. case where aggregations are needed to prevent overfitting to devices’ distinct local distributions. Hence, large τ\tau results in significantly worse accuracy for non-i.i.d. data.

V-D Effect of Fog Topology

Next, we evaluate network-aware learning on three fog computing topologies: hierarchical and social network topologies as in Section IV-B, and a fully-connected topology in which all nodes are neighbors. The social network is modeled as a Watts-Strogatz small world graph [46] with each node connected to n/5n/5 of its neighbors, and the hierarchical network connects each of the n/3n/3 nodes with the lowest processing costs to two of the 2n/32n/3 remaining nodes, randomly.

Refer to caption
Figure 8: Cost components for social, hierarchical, and fully connected topologies running network-aware learning on (a) LTE and (b) WiFi network media. Discard costs dominate for each topology in the case of WiFi, while the higher cost factor for LTE depends on the topology.

Our Raspberry Pi testbed provides LTE and WiFi network media for which we compare the network resource costs. Both media exhibit similar trends across the topologies in Figure 8. The topology determines offloading availability: the fully-connected topology maximizes the degree of each node, while the hierarchical topology minimizes the average degree. A smaller average degree limits offloading, which leads to more data processed locally and/or discarded. The major difference between LTE and WiFi is that WiFi skews more towards discarding. WiFi has fewer interference mitigation techniques than cellular, so, in the presence of several devices, we expect its links to exhibit longer delays. Consequently, both the discard and transfer costs are larger for WiFi than their LTE counterparts, regardless of topology. Results in both cases are consistent with the findings from varying the network connectivity in Figure 6 too: as networks become less connected and edges grow sparse, the ability of individual devices to offload their data to lower cost alternatives diminishes. Devices will marginally increase their data processing workloads, but ultimately a significant fraction of the data is discarded.

V-E Effect of Dynamic Networks

Finally, we consider network-aware learning when nodes may enter and exit the network. Initially, all devices are in the network. At each tt, devices in the network will exit with probability pexitp_{exit}, while devices outside the network will re-enter with probability pentryp_{entry}. For worst-case analysis, nodes cannot transmit their local update results prior to exiting, and nodes rejoining the network cannot obtain the current global parameters until the ongoing aggregation period finishes.

Setting Acc(%) Nodes Cost
Process Transfer Discard Unit
Static 95.83 10 399 66 328 0.135
Dynamic 94.79 7.8 300 56 256 0.144
TABLE V: Comparison of network-aware learning characteristics on static and dynamic networks, with the probability of nodes entering and exiting fixed at 1%. “Acc” represents model accuracy and “Nodes” represents the average number of active nodes per aggregation period. Overall, we see that node churn of 20% has an impact of roughly 6% on unit costs, and 1% on accuracy.

Table V compares a dynamic network with pexit=pentry=1%p_{exit}=p_{entry}=1\% against the static case. In a dynamic network, network-aware learning operates with less overall network data and compute capability as active nodes per aggregation period decreases from 10 in the static case to an average of 7.8 in the dynamic case. Node exits always result in at least one inactive node - even if a new node enters, it must wait for the synchronized global parameters. So, our methodology is reasonably robust in dynamic networks as a 20% decline in active nodes/period only leads to a 6% increase in unit costs incurred per datapoint, and a 1% accuracy decline, due to fewer processed data and more discarding (also visible from the ratio of processed to discard costs).

Refer to caption
((a)) Mean active nodes
Refer to caption
((b)) Process vs. discard
Refer to caption
((c)) Data movement rates
Refer to caption
((d)) Total cost breakdown
Refer to caption
((e)) Testing accuracy
Figure 9: Impact of increases in the probability of node exits on data movement and costs with probability of node re-entry at 2%. The shading in (a) and (c) indicates the range over time periods. While the i.i.d. case is more robust to dynamic network environments, both i.i.d. and non-i.i.d. cases indicate the same decreasing trends in testing accuracy as pexitp_{exit} grows.
Refer to caption
((a)) Mean active nodes
Refer to caption
((b)) Process vs. discard
Refer to caption
((c)) Data movement rates
Refer to caption
((d)) Total cost breakdown
Refer to caption
((e)) Testing accuracy
Figure 10: Impact of increases in the probability of node re-entry in each time period, with the probability of node exits fixed at 2%. The shading in (a) and (c) indicates the range over time periods. Both i.i.d. and non-i.i.d. cases improve in performance as pentryp_{entry} increases.

V-E1 Varying pexitp_{exit}

Figure 9 varies pexitp_{exit} from 0-5% and fixes pentry=0.02p_{entry}=0.02. Figure 9(a) shows the variation in average active nodes per period, and the remaining four subfigures display the aspects of network-aware learning as in Figs. 5-7.

Figure 9(a) depicts a sharp decline in the number of active nodes per period as pexitp_{exit} grows. At pexit=5%p_{exit}=5\%, the network averages <6<6 active nodes/period, a 40% decrease from network initialization. The total generated data and cost decrease sharply in Figures 9(b) and 9(c), respectively: since the network has fewer active nodes, there is less data overall and therefore less cost. While the ratio of processed to discarded data in Figure 9(b) skews towards processed data as pexitp_{exit} increases, the total cost in Figure 9(d) skews toward discard costs: fewer active nodes implies fewer offloading opportunities, and network-aware learning discards more data. As a result, the average data movement rate drops from 0.550.55 to 0.20.2 in Figure 9(c). The i.i.d. testing accuracy declines by 4\sim 4% in Figure 9(e), with a larger decline for non-i.i.d. data.

V-E2 Varying pentryp_{entry}

Figure 10 varies pentryp_{entry} from 0-5% with pexit=2%p_{exit}=2\%, and the average active nodes per period increases with probability of node entry until 4% in Figure 10(a). Both the total data generated (Figure 10(b)) and average data movement rate (Figure 10(c)) increase with more active nodes, but more data leads to larger processing, discard, and total costs in Figure 10(d). As pentryp_{entry} increases, the network benefits from scale, i.e., more efficient offloading, which enables the network to train a more accurate ML model for the i.i.d. and non-i.i.d. cases in Figure 10(e). However, the overall non-i.i.d. testing accuracy is unable to match that of the i.i.d. case, as node exits distort the distribution of local data across active devices further from the true global distribution.

Figures 9 and 10 exhibit a consistent pattern: the costs rapidly change and then plateau around when pexitpentryp_{exit}\geq p_{entry} for Figure 9 and pentrypexitp_{entry}\geq p_{exit} for Figure 10. For the non-i.i.d. case, we notice that pexitp_{exit} has a stronger effect on the testing accuracy in Figure 9 than pentryp_{entry} does in Figure 10; in the non-i.i.d. case, there is less overlap between devices’ datasets, so each node exit affects the empirical training dataset more substantially and results in a stronger impact on the accuracy.

VI Conclusion and Future Work

In this paper, we developed a novel methodology to distribute ML training tasks over devices in a fog computing network while considering the compute-communication-accuracy tradeoffs inherent in fog scenarios. We derived new error bounds when devices transfer their local data processing to each other, and bounded the impact of these transfers on the cost and accuracy of the model training. Through experimentation with a popular machine learning task, we showed that our network-aware scheme significantly reduces the cost of model training while achieving comparable accuracy to the recently popularized federated learning algorithm for distributed training, and demonstrated the effects of network characteristics and device data distributions on its performance.

Our framework and analysis point to several possible extensions. First, while we do not observe significant heterogeneity in compute times on our wireless testbed, in general fog devices may experience compute straggling and failures, which might benefit from more sophisticated offloading mechanisms. Second, predicting devices’ mobility patterns and network connectivity could likely further optimize the data offloading. Finally, learning device-specific models when local distributions are non-i.i.d. could introduce new performance tradeoffs between offloading and data processing.

Acknowledgment

This work was partially supported by NSF CNS-1909306, by ARO grant W911NF1910036, and by NSWC Crane. We thank the anonymous reviewers for their valuable comments.

References

  • [1] Y. Tu, Y. Ruan, S. Wagle, C. Brinton, and C. Joe-Wong, “Network-Aware Optimization of Distributed Learning for Fog Computing,” in IEEE INFOCOM, 2020.
  • [2] Cisco Systems, “Demystifying 5G in Industrial IOT,” White Paper, 2019. [Online]. Available: https://www.cisco.com/c/dam/en_us/solutions/iot/demystifying-5g-industrial-iot.pdf
  • [3] D. Chatzopoulos, C. Bermejo, Z. Huang, and P. Hui, “Mobile Augmented Reality Survey: From where we are to where we go,” IEEE Access, vol. 5, pp. 6917–6950, 2017.
  • [4] K. Rao, “The Path to 5G for Health Care,” IEEE Perspectives on 5G Applications and Services. [Online]. Available: https://futurenetworks.ieee.org/images/files/pdf/applications/5G--Health-Care030518.pdf
  • [5] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive Federated Learning in Resource Constrained Edge Computing Systems,” IEEE JSAC, vol. 37, no. 6, pp. 1205–1221, 2019.
  • [6] M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Int. Things J., vol. 3, no. 6, pp. 854–864, 2016.
  • [7] IEEE Spectrum, “Applications of Device-to-Device Communication in 5G Networks,” White Paper. [Online]. Available: https://spectrum.ieee.org/computing/networks/applications-of-devicetodevice-communication-in-5g-networks
  • [8] M. Somisetty, “Big Data Analytics in 5G,” IEEE Perspectives on 5G Applications and Services. [Online]. Available: https://futurenetworks.ieee.org/images/files/pdf/applications/Data-Analytics-in-5G-Applications030518.pdf
  • [9] S. Pu, W. Shi, J. Xu, and A. Nedić, “A Push-Pull Gradient Method for Distributed Optimization in Networks,” in IEEE Conference on Decision and Control (CDC), 2018, pp. 3385–3390.
  • [10] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in AISTATS, 2017.
  • [11] T.-Y. Yang, C. Brinton, P. Mittal, M. Chiang, and A. Lan, “Learning Informative and Private Representations via Generative Adversarial Networks,” in IEEE Intl. Conf. on Big Data.   IEEE, 2018, pp. 1534–1543.
  • [12] S. A. Ashraf, I. Aktas, E. Eriksson, K. W. Helmersson, and J. Ansari, “Ultra-Reliable and Low-Latency Communication for Wireless Factory Automation: From LTE to 5G,” in IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), 2016, pp. 1–8.
  • [13] L. He, A. Bian, and M. Jaggi, “Cola: Decentralized linear learning,” in NeurIPS, 2018, pp. 4536–4546.
  • [14] J. B. Predd, S. R. Kulkarni, and H. V. Poor, “A collaborative training algorithm for distributed learning,” IEEE Transactions on Information Theory, vol. 55, no. 4, pp. 1856–1871, 2009.
  • [15] H. Tang, C. Yu, C. Renggli, S. Kassing, A. Singla, D. Alistarh, J. Liu, and C. Zhang, “Distributed learning over unreliable networks,” ArXiv, vol. abs/1810.07766, 2019.
  • [16] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD,” in AISTATS, 2018, pp. 803–812.
  • [17] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated Learning: Strategies for Improving Communication Efficiency,” in NeurIPS, 2016.
  • [18] R. Shokri and V. Shmatikov, “Privacy-Preserving Deep Learning,” in ACM SIGSAC, 2015, pp. 1310–1321.
  • [19] S. A. Rahman, H. Tout, H. Ould-Slimane, A. Mourad, C. Talhi, and M. Guizani, “A survey on federated learning: The journey from centralized to distributed on-site learning and beyond,” IEEE Internet of Things Journal, 2020.
  • [20] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated Multi-Task Learning,” in NeurIPS, 2017, pp. 4424–4434.
  • [21] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Learning with Non-IID Data,” arXiv:1806.00582, 2018.
  • [22] G. Neglia, G. Calbi, D. Towsley, and G. Vardoyan, “The Role of Network Topology for Distributed Machine Learning,” in IEEE Conference on Computer Communications (INFOCOM), 2019, pp. 2350–2358.
  • [23] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016.
  • [24] N. Strom, “Scalable distributed dnn training using commodity gpu cloud computing,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [25] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” Proceedings of the 2017 Conference on EMNLP, 2017. [Online]. Available: http://dx.doi.org/10.18653/v1/D17-1045
  • [26] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and communication-efficient federated learning from non-iid data,” 2019.
  • [27] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning with asynchronous model update and temporally weighted aggregation,” 2019.
  • [28] A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar, “Peer-to-peer federated learning on graphs,” 2019.
  • [29] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated Learning over Wireless Networks: Optimization Model Design and Analysis,” in IEEE INFOCOM, 2019, pp. 1387–1395.
  • [30] T. Chang, L. Zheng, M. Gorlatova, C. Gitau, C.-Y. Huang, and M. Chiang, “Demo: Decomposing Data Analytics in Fog Networks,” in ACM Conference on Embedded Networked Sensor Systems (SenSys), 2017.
  • [31] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen, “DeepDecision: A Mobile Deep Learning Framework for Edge Video Analytics,” in IEEE INFOCOM, 2018, pp. 1421–1429.
  • [32] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic Adaptive DNN Surgery for Inference Acceleration on the Edge,” in IEEE INFOCOM, 2019, pp. 1423–1431.
  • [33] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed Deep Neural Networks over the Cloud, the Edge and End Devices,” in IEEE ICDCS, 2017, pp. 328–339.
  • [34] X. Xu, D. Li, Z. Dai, S. Li, and X. Chen, “A heuristic offloading method for deep learning edge services in 5g networks,” IEEE Access, vol. 7, pp. 67 734–67 744, 2019.
  • [35] H. Li, K. Ota, and M. Dong, “Learning iot in edge: Deep learning for the internet of things with edge computing,” IEEE Network, vol. 32, pp. 96–101, 2018.
  • [36] Y. Sun, S. Zhou, and D. Gündüz, “Energy-aware analog aggregation for federated learning with redundant data,” in IEEE ICC.   IEEE, 2020, pp. 1–7.
  • [37] T. Yang, “Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent,” in NeurIPS, 2013, pp. 629–637.
  • [38] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic Lower Bounds for Distributed Statistical Estimation with Communication Constraints,” in NeurIPS, 2013, pp. 2328–2336.
  • [39] K. P. Murphy, Machine Learning: A Probabilistic Perspective.   MIT Press, 2012.
  • [40] F. Farhat, D. Z. Tootaghaj, Y. He, A. Sivasubramaniam, M. Kandemir, and C. R. Das, “Stochastic Modeling and Optimization of Stragglers,” IEEE Trans. on Cloud Comp,, vol. 6, no. 4, pp. 1164–1177, 2016.
  • [41] F. M. F. Wong, Z. Liu, and M. Chiang, “On the Efficiency of Social Recommender Networks,” IEEE/ACM Trans. Netw., vol. 24, pp. 2512–2524, 2016.
  • [42] C. A. Bouman and K. Sauer, “A unified approach to statistical tomography using coordinate descent optimization,” IEEE Transactions on image processing, vol. 5, no. 3, pp. 480–492, 1996.
  • [43] A. Nadh, J. Samuel, A. Sharma, S. Aniruddhan, and R. K. Ganti, “A taylor series approximation of self-interference channel in full-duplex radios,” IEEE Trans. Wirls. Comm., vol. 16, no. 7, pp. 4304–4316, 2017.
  • [44] Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST Database of Handwritten Digits.” [Online]. Available: http://yann.lecun.com/exdb/mnist/
  • [45] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Comp., vol. 1, no. 4, pp. 541–551, 1989.
  • [46] M. Chiang, Networked Life: 20 Questions and Answers.   Cambridge University Press, 2012.
Su Wang is a second year PhD student in Electrical and Computer Engineering at Purdue University. He received his BS in Electrical Engineering from Purdue in 2018.
Yichen Ruan is a Ph.D. candidate in Electrical and Computer Engineering at Carnegie Mellon University. He received his B.S. and M.S. degrees respectively from Tsinghua University and UC Berkeley.
Yuwei Tu is an independent researcher conducting research on machine learning and network optimization. She graduated from the Center of Data Science at New York University with a master degree of data science.
Satyavrat Wagle is a researcher in the Software Systems and Services Research Area (SSS-RA) at Tata Consultancy Services (TCS) Research. He holds an MS in Electrical Engineering from Carnegie Mellon University.
Christopher G. Brinton (SM’20) is an Assistant Professor of Electrical and Computer Engineering at Purdue University. He received his PhD degree in Electrical Engineering from Princeton University in 2016. Since joining Purdue in 2019, he has won several awards including the Seed for Success Award and the Ruth and Joel Spira Outstanding Teacher Award.
Carlee Joe-Wong (M’16) is the Robert E. Doherty Assistant Professor of Electrical and Computer Engineering at Carnegie Mellon University. She received her A.B., M.A., and Ph.D. degrees from Princeton University in 2011, 2013, and 2016, respectively. She has received several awards for her work, including the Army Young Investigator and NSF CAREER awards.

Appendix A Proofs of All Theorems

A-A Proof of Theorem 1

Proof:

To aid in analysis, we define vk(t)=vk(t1)ηL(vk(t1)|𝒟)v_{k}(t)=v_{k}(t-1)-\eta\nabla L(v_{k}(t-1)|\mathcal{D}) for t{(k1)τ,,kτ}t\in\{(k-1)\tau,...,k\tau\} as the centralized version of the gradient descent update synchronized with the weighted average w(t)w(t) after every global aggregation kk, i.e., vk+1(kτ)w(k)v_{k+1}(k\tau)\leftarrow w(k). With θk(t)=L(vk(t))L(w)\theta_{k}(t)=L(v_{k}(t))-L(w^{\star}), letting K=t/τK=\lfloor t/\tau\rfloor, we can write

1θK+1(t)\displaystyle\frac{1}{\theta_{K+1}(t)} 1θ1(0)=(1θK+1(t)1θK+1(Kτ))\displaystyle-\frac{1}{\theta_{1}(0)}=\Big{(}\frac{1}{\theta_{K+1}(t)}-\frac{1}{\theta_{K+1}(K\tau)}\Big{)}
+(1θK+1(Kτ)1θK(Kτ))+(1θK(Kτ)1θ1(0))\displaystyle+\Big{(}\frac{1}{\theta_{K+1}(K\tau)}-\frac{1}{\theta_{K}(K\tau)}\Big{)}+\Big{(}\frac{1}{\theta_{K}(K\tau)}-\frac{1}{\theta_{1}(0)}\Big{)}
((tKτ)ωη(1βη2))+(ρh(τ)ϵ2)\displaystyle\geq\Big{(}(t-K\tau)\omega\eta\big{(}1-\frac{\beta\eta}{2}\big{)}\Big{)}+\Big{(}-\frac{\rho h(\tau)}{\epsilon^{2}}\Big{)}
+(Kτωη(1βη2)(K1)ρh(τ)ϵ2)\displaystyle\qquad+\Big{(}K\tau\omega\eta\big{(}1-\frac{\beta\eta}{2}\big{)}-(K-1)\frac{\rho h(\tau)}{\epsilon^{2}}\Big{)}
=tωη(1βη2)Kρh(τ)ϵ2\displaystyle=t\omega\eta\Big{(}1-\frac{\beta\eta}{2}\Big{)}-K\frac{\rho h(\tau)}{\epsilon^{2}} (17)

where the three inequalities use the results 1θk(kτ)1θk((k1)τ)τωη(1βη2)\frac{1}{\theta_{k}(k\tau)}-\frac{1}{\theta_{k}((k-1)\tau)}\geq\tau\omega\eta(1-\frac{\beta\eta}{2}), 1θk+1(kτ)1θk(kτ)ρh(τ)ϵ2\frac{1}{\theta_{k+1}(k\tau)}-\frac{1}{\theta_{k}(k\tau)}\geq-\frac{\rho h(\tau)}{\epsilon^{2}}, and 1θK(T)1θ1(0)Tωη(1βη2)(K1)ρh(τ)ϵ2\frac{1}{\theta_{K}(T)}-\frac{1}{\theta_{1}(0)}\geq T\omega\eta(1-\frac{\beta\eta}{2})-(K-1)\frac{\rho h(\tau)}{\epsilon^{2}} from Lemma 2 in [5]. Here, ω=mink1vk((k1)τ)w2\omega=\min_{k}\frac{1}{||v_{k}((k-1)\tau)-w^{\star}||^{2}} and h(x)=δβ((ηβ+1)x1)ηδxh(x)=\frac{\delta}{\beta}((\eta\beta+1)^{x}-1)-\eta\delta x for x{0,1,}x\in\{0,1,...\}. Additionally, if we assume θk(kτ)=L(vk(kτ))L(w)ϵ\theta_{k}(k\tau)=L(v_{k}(k\tau))-L(w^{\star})\geq\epsilon, we can write

1L(wi(t))L(w)1θK+1(t)\displaystyle\frac{1}{L(w_{i}(t))-L(w^{\star})}-\frac{1}{\theta_{K+1}(t)}
=θK+1(t)(L(wi(t))L(w))(L(wi(t))L(w))θK+1(t)\displaystyle=\frac{\theta_{K+1}(t)-\big{(}L(w_{i}(t))-L(w^{\star})\big{)}}{\big{(}L(w_{i}(t))-L(w^{\star})\big{)}\theta_{K+1}(t)}
=L(vK+1(t))L(wi(t))(L(wi(t))L(w))θK+1(t)ρgi(tKτ)ϵ2\displaystyle=\frac{L(v_{K+1}(t))-L(w_{i}(t))}{\big{(}L(w_{i}(t))-L(w^{\star})\big{)}\theta_{K+1}(t)}\geq-\frac{\rho g_{i}(t-K\tau)}{\epsilon^{2}} (18)

where the inequality uses the result wi(t)vk(t)gi(t(k1)τ)||w_{i}(t)-v_{k}(t)||\leq g_{i}(t-(k-1)\tau) for any kk from Lemma 3 in [5], and the ρ\rho-Lipschitz assumption on Li(w)L_{i}(w) which extends to L(w)L(w) by the triangle inequality, i.e., L(x)L(y)ρxy||L(x)-L(y)||\leq\rho||x-y|| for any x,yx,y. Adding the results from (17) and (18) and noting θk(t)0\theta_{k}(t)\geq 0, we have

1L(wi(t))L(w)1L(wi(t))L(w)1θ1(0)\displaystyle\frac{1}{L(w_{i}(t))-L(w^{\star})}\geq\frac{1}{L(w_{i}(t))-L(w^{\star})}-\frac{1}{\theta_{1}(0)}
tωη(1βη2)ρϵ2(Kh(τ)+gi(tKτ))\displaystyle\qquad\quad\geq t\omega\eta\big{(}1-\frac{\beta\eta}{2}\big{)}-\frac{\rho}{\epsilon^{2}}\big{(}Kh(\tau)+g_{i}(t-K\tau)\big{)}

Taking the reciprocal, it follows that

L(wi(t))L(w)\displaystyle L(w_{i}(t))-L(w^{\star})\leq
1tωη(1βη2)ρϵ2(Kh(τ)+gi(tKτ))=y(ϵ)\displaystyle\frac{1}{t\omega\eta\big{(}1-\frac{\beta\eta}{2}\big{)}-\frac{\rho}{\epsilon^{2}}\big{(}Kh(\tau)+g_{i}(t-K\tau)\big{)}}=y(\epsilon) (19)

for y(ϵ)>0y(\epsilon)>0. Now, let ϵ0\epsilon_{0} be the positive root of y(ϵ)=ϵy(\epsilon)=\epsilon, which is easy to check exists. We can show that one of the following conditions must be true: (i) minkKL(vk(kτ))L(w)ϵ0\underset{k\leq K}{\min}\;L(v_{k}(k\tau))-L(w^{\star})\leq\epsilon_{0} or (ii) L(wi(t))L(w)ϵ0L(w_{i}(t))-L(w^{\star})\leq\epsilon_{0}. If we assume L(vk(kτ))L(v_{k}(k\tau)) is non-increasing with kk, then from (i) and (ii) we can write L(wi(t))L(w)+ϵ0L(w_{i}(t))\leq L(w^{\star})+\epsilon_{0} or L(vK+1(Kτ))L(w)+ϵ0L(v_{K+1}(K\tau))\leq L(w^{\star})+\epsilon_{0}. But we already know wi(t)vk(t)gi(t(k1)τ)||w_{i}(t)-v_{k}(t)||\leq g_{i}(t-(k-1)\tau) for any kk, so with the ρ\rho-Lipschitz assumption L(wi(t))L(vK+1(t))ρgi(tKτ)L(w_{i}(t))-L(v_{K+1}(t))\leq\rho g_{i}(t-K\tau). If (ii) holds, then, L(wi(t))L(w)+ϵ0+ρgi(tKτ)L(w_{i}(t))\leq L(w^{\star})+\epsilon_{0}+\rho g_{i}(t-K\tau). Now comparing (i) and (ii), (ii) must always be true since ρ,gi0\rho,g_{i}\geq 0. ∎

A-B Proof of Theorem 2

Proof:

Since we assume all costs and capacities are constant, we first note that the Gi(t)G_{i}(t) can also be assumed to be constant. Thus, the processing of data at node ii can be modeled as a D/M/1 queue, with constant arrival rate 1/Gi(t)1/G_{i}(t) and exponential service time. The expected waiting time of such a queue is then equal to δ/(μ(1δ))\delta/\left(\mu(1-\delta)\right), where δ\delta is the smallest solution to δ=exp(μ(1δ)/Gi(t))\delta=\exp\left(-\mu(1-\delta)/G_{i}(t)\right). Upon showing that the expected waiting time is an increasing function of δ\delta and δ\delta an increasing function of Gi(t)G_{i}(t), it follows that to ensure an expected waiting time no larger than σ\sigma, we should choose Gi(t)CG_{i}(t)\leq C, where CC is the maximum arrival rate such that δ(C)/(μ(1δ(C)))=σ\delta(C)/\left(\mu(1-\delta(C))\right)=\sigma. ∎

A-C Proof of Theorem 4

Proof:

In the hierarchical scenario described in the statement of the theorem, the cost objective (5) can be rewritten as

i(1risi)Dici+isiDi(cn+1+ct)\displaystyle\sum_{i}(1-r_{i}-s_{i})D_{i}c_{i}+\sum_{i}s_{i}D_{i}(c_{n+1}+c_{t})
+iγ(1risi)D+γisiDi.\displaystyle+\sum_{i}\frac{\gamma}{\sqrt{(1-r_{i}-s_{i})D}}+\frac{\gamma}{\sqrt{\sum_{i}s_{i}D_{i}}}.

Taking the partial derivative of the cost objective with respect to rir_{i} and setting to 0 gives:

Dici+γ(1/2)Di((1risi)Di)3/2=0-D_{i}c_{i}+\frac{\gamma(1/2)D_{i}}{((1-r_{i}-s_{i})D_{i})^{3/2}}=0

Rearranging gives 2ci=γ((1risi)Di)3/22c_{i}=\frac{\gamma}{((1-r_{i}-s_{i})D_{i})^{3/2}}, which yields:

ri=1(γ/2ci)2/3Disi.r_{i}^{*}=1-\frac{{(\gamma/2c_{i})}^{2/3}}{D_{i}}-s_{i}.

Using the expression for rir_{i}^{*}, the objective function becomes:

i(1risi)Dici+isiDi(cn+1+ct)\displaystyle\sum_{i}(1-r_{i}^{*}-s_{i})D_{i}c_{i}+\sum_{i}s_{i}D_{i}(c_{n+1}+c_{t})
+iγ(1risi)D+γn+1isiDi.\displaystyle+\sum_{i}\frac{\gamma}{\sqrt{(1-r_{i}^{*}-s_{i})D}}+\frac{\gamma_{n+1}}{\sqrt{\sum_{i}s_{i}D_{i}}}.

Taking the partial derivative with respect to sis_{i} and setting to 0 gives:

Dici(dridsi1)+Di(cn+1+ct)\displaystyle D_{i}c_{i}(-\frac{dr_{i}^{*}}{ds_{i}}-1)+D_{i}(c_{n+1}+c_{t})
γDi(dridsi1)2((1risi)Di)3/2γDi2(jsjDj)3/2=0.\displaystyle-\frac{\gamma D_{i}(-\frac{dr_{i}^{*}}{ds_{i}}-1)}{2((1-r_{i}-s_{i})D_{i})^{3/2}}-\frac{\gamma D_{i}}{2(\sum_{j}s_{j}D_{j})^{3/2}}=0.

Since ri+si=1r_{i}+s_{i}=1 at the optimal point, dridsi=1\frac{dr_{i}^{*}}{ds_{i}}=-1 and we obtain Di(cn+1+ct)γDi2(jsjDj)3/2=0D_{i}(c_{n+1}+c_{t})-\frac{\gamma D_{i}}{2(\sum_{j}s_{j}D_{j})^{3/2}}=0 for each ii. We can then enforce si=sjs_{i}=s_{j} for all i,ji,j and rearrange for sis_{i} to obtain:

si=1jDj(γ2(cn+1+ct))2/3s_{i}^{*}=\frac{1}{\sum_{j}D_{j}}\left(\frac{\gamma}{2(c_{n+1}+c_{t})}\right)^{2/3}

Finally, note that a large DiD_{i} forces ri,si[0,1]r_{i},s_{i}\in[0,1], which gives the result. ∎

A-D Proof of Theorem 5

Proof:

From Theorem 3, we can write the expected cost savings as

𝔼[max{0,ciminjcj}]=0C0cik(ciy)C2(CyC)k1𝑑y𝑑ci\mathbb{E}\left[\max\left\{0,c_{i}-\min_{j}c_{j}\right\}\right]=\int_{0}^{C}\int_{0}^{c_{i}}\frac{k(c_{i}-y)}{C^{2}}\left(\frac{C-y}{C}\right)^{k-1}\;dy\;dc_{i}

(20)

where we take y=minjcjy=\min_{j}c_{j} and use the fact that the minimum of kk i.i.d. uniform random variables on [0,C][0,C] has the probability distribution f(y)=kC(CyC)k1f(y)=\frac{k}{C}\left(\frac{C-y}{C}\right)^{k-1}. Integrating and simplifying then yields the desired result.

0C0cik(ciy)C2(CyC)k1𝑑y𝑑ci\displaystyle\int_{0}^{C}\int_{0}^{c_{i}}\frac{k(c_{i}-y)}{C^{2}}\left(\frac{C-y}{C}\right)^{k-1}\;dy\;dc_{i}
=kCk+10C0cici(Cy)k1y(Cy)k1dydci\displaystyle=\frac{k}{C^{k+1}}\int_{0}^{C}\int_{0}^{c_{i}}c_{i}\left(C-y\right)^{k-1}-y\left(C-y\right)^{k-1}\;dy\;dc_{i}
=kCk+10C(cik(Ck(Cci)k)\displaystyle=\frac{k}{C^{k+1}}\int_{0}^{C}\Bigg{(}\frac{c_{i}}{k}\left(C^{k}-(C-c_{i})^{k}\right)
+0cil=0k1(k1l)(y)l+1Ck1ldy)dci\displaystyle+\int_{0}^{c_{i}}\sum_{l=0}^{k-1}\binom{k-1}{l}(-y)^{l+1}C^{k-1-l}\;dy\Bigg{)}\;dc_{i}
=kCk+1(Ck+22k+1k0C(l=0k(kl)(ci)l+1Ckl\displaystyle=\frac{k}{C^{k+1}}\Bigg{(}\frac{C^{k+2}}{2k}+\frac{1}{k}\int_{0}^{C}\Bigg{(}\sum_{l=0}^{k}\binom{k}{l}(-c_{i})^{l+1}C^{k-l}
+l=0k1(k1l)cil+2Ck1l(1)l+1l+2)dci)\displaystyle+\sum_{l=0}^{k-1}\binom{k-1}{l}\frac{{c_{i}}^{l+2}C^{k-1-l}(-1)^{l+1}}{l+2}\Bigg{)}\;dc_{i}\Bigg{)}
=C2+l=0k(kl)(1)l+1Cl1Cl+2l+2\displaystyle=\frac{C}{2}+\sum_{l=0}^{k}\binom{k}{l}(-1)^{l+1}C^{-l-1}\frac{C^{l+2}}{l+2}
+l=0k1(k1l)kCl+3Ck1l(1)l+1(l+2)(l+3)Ck+1\displaystyle+\sum_{l=0}^{k-1}\binom{k-1}{l}\frac{kC^{l+3}C^{k-1-l}(-1)^{l+1}}{(l+2)(l+3)C^{k+1}}
=C2+l=0k(kl)(1)l+1Cl+2+l=0k1(k1l)kC(1)l+1(l+2)(l+3)\displaystyle=\frac{C}{2}+\sum_{l=0}^{k}\binom{k}{l}\frac{(-1)^{l+1}C}{l+2}+\sum_{l=0}^{k-1}\binom{k-1}{l}\frac{kC(-1)^{l+1}}{(l+2)(l+3)}
=C2C(1)kk+2l=0k1(kl)C(1)l(k+3)(l+2)(l+3).\displaystyle=\frac{C}{2}-\frac{C(-1)^{k}}{k+2}-\sum_{l=0}^{k-1}\binom{k}{l}\frac{C(-1)^{l}(k+3)}{(l+2)(l+3)}.