This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint Data Deepening-and-Prefetching for Energy-Efficient Edge Learning

Sujin Kook, Won-Yong Shin, Seong-Lyun Kim, Seung-Woo Ko§
School of EEE, Yonsei University, Seoul, Korea, email: {sjkook, slkim}@ramo.yonsei.ac.kr Dept. of Comput. Science and Eng., Yonsei University, Seoul, Korea, email: wy.shin@yonsei.ac.kr §Dept. of Smart Mobility Eng., Inha University, Incheon, Korea, email: swko@inha.ac.kr
Abstract

The vision of pervasive machine learning (ML) services can be realized by training an ML model on time using real-time data collected by internet of things (IoT) devices. To this end, IoT devices require offloading their data to an edge server in proximity. On the other hand, high dimensional data with a heavy volume causes a significant burden to an IoT device with a limited energy budget. To cope with the limitation, we propose a novel offloading architecture, called joint data deepening and prefetching (JD2P), which is feature-by-feature offloading comprising two key techniques. The first one is data deepening, where each data sample’s features are sequentially offloaded in the order of importance determined by the data embedding technique such as principle component analysis (PCA). No more features are offloaded when the features offloaded so far are enough to classify the data, resulting in reducing the amount of offloaded data. The second one is data prefetching, where some features potentially required in the future are offloaded in advance, thus achieving high efficiency via precise prediction and parameter optimization. To verify the effectiveness of JD2P, we conduct experiments using the MNIST and fashion-MNIST dataset. Experimental results demonstrate that the JD2P can significantly reduce the expected energy consumption compared with several benchmarks without degrading learning accuracy.

I Introduction

With the wide spread of internet of things (IoT) devices, a huge amount of real-time data have been continuously generated. It can be fuel for operating various on-device machine learning (ML) services, e.g., object detection and natural language processing, if provided on time. One viable technology to this end is edge learning, where an ML model is trained at the edge server in proximity using the data offloaded from IoT devices [1]. Compared to the learning at the cloud server, IoT devices can offer the latest data to the edge server before out-of-date, and the resultant ML model can reflect the current environment precisely without a dataset shift [2] or catastrophic forgetting [3].

On the other hand, as the concerned environment becomes complex, the data collected by each device tends to be high-dimensional with heavy volume, thus causing a significant burden to offload data for an IoT device with a limited energy budget. Several attempts have been proposed in the literature to address this issue, whose main thrust is to selectively offload data depending on the importance of data to the concerned ML model. In [4], motivated by the classic support vector machine (SVM) technique, data importance was defined inversely proportional to its uncertainty, which corresponds to the margin to the decision boundary. A selective retransmission decision was optimized by allowing more transmissions for data with high uncertainty, leading to the corresponding ML model’s fast convergence. In the same vein, the scheduling issue of multi-device edge learning has been tackled in [5], where a device having more important data samples is granted access to the medium more frequently. In [6], a data sample’s gradient norm obtained during training a deep neural network (DNN) was regarded as the corresponding importance metric. It enables each mobile device to select data sample that is likely to contribute to its local ML model training in a federated edge learning system. In [7], data importance was defined at the dispersed level of dataset distribution. A device with an important dataset is allowed to assign more bandwidth to accelerate the training process.

Aligned with the trend, we aim to develop a novel edge learning architecture, called joint data deepening and prefetching (JD2P). The above prior works quantify the importance of each data sample or the entire dataset, bringing about a significant communication overhead when raw data become complex with a higher dimension. On the other hand, the proposed JD2P leverages the technique of data embedding to extract a few features from raw data and sort them in the order of importance. This allows us to design a feature importance-based offloading technique, called data deepening; Features are sequentially offloaded in the important order and stop offloading the next one if reaching the desired performance. Besides, several data samples’ subsequent features can be offloaded proactively before requested, called data prefetching, which extends the offloading duration and thus achieves higher energy efficiency. Through relevant parameter optimizations and extensive simulation studies using the MNIST and fashion-MNIST dataset, it is verified that the JD2P reduces the expected energy consumption significantly than several benchmarks without degrading learning accuracy.

Refer to caption
Figure 1: Edge learning network comprising a pair of edge device and edge server collocated with a wireless access point.

II System Model

This section describes our system model, including the concerned scenario, data structure, and offloading model.

II-A Edge Learning Scenario

Consider an edge learning network comprising a pair of the edge server and the IoT sensor as a data collector (see Fig. 1). We aim at training a binary classifier using local data with two classes collected by the IoT sensor. Due to the IoT sensor’s limited computation capability, the edge server is requested to train a classifier with the local data offloaded from IoT sensor instead of training the classifier on the IoT sensor.

II-B Data Embedding

Consider MM samples measured at the sensor, denoted by 𝐲mD\mathbf{y}_{m}\in\mathbb{R}^{D}, where mm is the index of data sample, i.e., m=1,2,,Mm=1,2,\cdots,M. We assume that the class of each sample is known, denoted by cm{0,1}{c}_{m}\in\{0,1\}. Each raw data sample’s dimension, say DD, is assumed to be equivalent. The dimension DD is in general sufficiently high to reflect complex environments, which is known as an obstacle to achieve high-accuracy classification [8]. Besides, a large amount of energy is required to offload these raw data to the edge server. To overcome these limitations, these high dimensional raw data can be embedded into a low-dimensional space using data embedding techniques [9], such as principle component analysis (PCA) [10] and auto-encoder [11]. Specifically, given FF less than DD, there exists a mapping function :DF\mathcal{F}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{F} such that

𝐱m=(𝐲m),\displaystyle\mathbf{x}_{m}=\mathcal{F}(\mathbf{y}_{m}), (1)

where 𝐱m=[xm,1,,xm,F]T\mathbf{x}_{m}=\left[x_{m,1},\cdots,x_{m,F}\right]^{T} represents the embedded data with FF features. We assume that the edge device knows the embedding function \mathcal{F}, which has been trained by the edge server using the historical data set. We use PCA as a primary feature embedding technique due to its low computational overhead, while other techniques are straightforwardly applicable. Partial or all features of each embedded data are offloaded depending on the offloading and learning designs introduced in the sequel.

II-C Offloading Model

The entire offloading duration is slotted into KK rounds with t0t_{0} seconds. The channel gain in round kk is denoted as gkg_{k} with gk>0g_{k}>0. We assume that channel gains are constant over one time slot and independently and identically distributed (i.i.d.) over different rounds. Following the models in [12] and [13], the transmission power required to transmit bb bits in round kk, denoted as eke_{k}, is modeled by a monomial function and is given as ek=λ(b/t)gke_{k}=\lambda\frac{{(b/t)}^{\ell}}{g_{k}} where λ\lambda is the energy coefficient, \ell represents the monomial order, and tt is an allowable transmission duration for bb bits. The typical range for a monomial order is 252\leq\ell\leq 5 because this order depends on the specific modulation and coding scheme. Then, the energy consumption in round kk, which is the product of eke_{k} and tt, is given as

𝐄(b,t;gk)=ekt=λbgk(t)1.\displaystyle\mathbf{E}(b,t;g_{k})={e_{k}}{t}=\lambda\frac{{b}^{\ell}}{g_{k}\left(t\right)^{\ell-1}}. (2)

It is shown that energy consumption is proportional to the transmitted data size bb, and inversely proportional to the transmission time tt. For energy-efficient edge learning, it is necessary to decrease the amount of transmitted data and increase the transmission time.

III Joint Data Deepening-and-Prefetching

This section aims at describing JD2P as a novel architecture to realize energy-efficient edge learning. The overall architecture is briefly introduced first and the detailed techniques of JD2P are elaborated next.

III-A Overview

The proposed JD2P is a feature-by-feature offloading control for energy-efficient classifier training, built on the following definition.

Definition 1 (Data Depth).

A embedded data sample 𝐱m\mathbf{x}_{m} is said to have depth kk when features from 11 to kk, say 𝐱m(k)=[xm,1,,xm,k]T\mathbf{x}_{m}^{(k)}=[x_{m,1},\cdots,x_{m,k}]^{T}, are enough to correctly predict its class.

By Definition 1, we can offload less amount of data required to train the classifier and the resultant energy consumption can be reduced if depths of all data are known in advance. On the other hand, each data sample’s depth can be determined after the concerned classifier is trained. Eventually, it is required to process each data’s depth identification and classifier training simultaneously to cope with the above recursive relation, which is technically challenging. To this end, we propose two key techniques summarized below.

Refer to caption
Figure 2: The example of data deepening process from 1-dimensional space to 3-dimensional space.

III-A1 Data Deepening

It is a closed-loop offloading decision whether to offload a new feature or not based on the current version of a classifier. Specifically, consider the kk-depth classifier defined as one trained through features from 11 to kk, say 𝐱m(k)\mathbf{x}_{m}^{(k)} for all m𝕊(k)m\in\mathbb{S}^{(k)} where 𝕊(k)\mathbb{S}^{(k)} denotes an index set of data samples that may have a depth of kk. We use a classic SVM for each depth classifier111The extension to other classifiers such as DNN and convolutional neural network (CNN) are straightforward, which remains for our future work., whose decision hyperplane is given as

(𝐰(k))T𝐱(k)+b(k)=0,\displaystyle(\mathbf{w}^{(k)})^{T}\mathbf{x}^{(k)}+b^{(k)}=0, (3)

where 𝐰(k)k\mathbf{w}^{(k)}\in\mathbb{R}^{k} is the vector perpendicular to the hyperplane and b(k)b^{(k)} is the offset parameter. Given a data sample 𝐱m(k)\mathbf{x}_{m}^{(k)} for m𝕊(k)m\in\mathbb{S}^{(k)}, the distance to the hyperplane in (3) can be computed as

dm(k)=|(𝐰(k))T𝐱m(k)+b(k)|/𝐰(k),\displaystyle d_{m}^{(k)}=\left|(\mathbf{w}^{(k)})^{T}\mathbf{x}_{m}^{(k)}+b^{(k)}\right|/\|\mathbf{w}^{(k)}\|, (4)

where \|\cdot\| represents the Euclidean norm. The data sample 𝐱m\mathbf{x}_{m} is said to be a clearly classified instance (CCI) by the kk-depth classifier if dm(k)d_{m}^{(k)} is no less than a threshold d¯(k){\bar{d}}^{(k)} to be specified in Sec. III-B. Otherwise, it is said to be a ambiguous classified instance (ACI). In other words, CCIs are depth-kk data not requiring an additional feature. Only ACIs are thus included in a new set 𝕊(k+1)\mathbb{S}^{(k+1)}, given as

𝕊(k+1)={m|dm(k)d¯(k),m𝕊(k)}.\displaystyle\mathbb{S}^{(k+1)}=\left\{m~{}|~{}d_{m}^{(k)}\leq\bar{d}^{(k)},m\in\mathbb{S}^{(k)}\right\}. (5)

As a result, the edge server requests the edge device to offload the next feature xm,k+1x_{m,k+1} for m𝕊(k+1)m\in\mathbb{S}^{(k+1)}. Fig. 2 illustrates the graphical example of data deepening from 11-dimensional to 33-dimensional spaces. The detailed process is summarized in Algorithm 1 except the design of the threshold d¯(k){\bar{d}}^{(k)}.

Algorithm 1 Data Deepening
1:Embedded data 𝐱m\mathbf{x}_{m} for all m{1,,M}m\in\{1,\cdots,M\}.
2:Setting k=0k=0, 𝕊(1)={1,,M}\mathbb{S}^{(1)}=\{1,\cdots,M\}.
3:while kKk\leq K do
4:     k=k+1k=k+1.
5:      Using {𝐱m(k)}\{\mathbf{x}_{m}^{(k)}\} for m𝕊(k)m\in\mathbb{S}^{(k)}, compute the hyperplane of the kk-depth classifier, specified in (3).
6:     Compute the threshold d¯(k){\bar{d}}^{(k)} using Algorithm 2.
7:     for m𝕊(k)m\in\mathbb{S}^{(k)} do
8:         Compute dm(k)d_{m}^{(k)} using (4).
9:         if dm(k)d¯(k)d_{m}^{(k)}\leq{\bar{d}}^{(k)} then
10:              m𝕊(k+1)m\in\mathbb{S}^{(k+1)}.
11:         else
12:              m𝕊(k+1)m\notin\mathbb{S}^{(k+1)}.
13:         end if
14:     end for
15:end while

III-A2 Data Prefetching

As shown in Fig. 3, the round kk comprises an offloading duration for the kk-th features (i.e., xm,k,m𝕊(k)x_{m,k},\forall m\in\mathbb{S}^{(k)}) and a training duration for the kk-depth classifier, and a feedback duration for a new ACI set 𝕊(k+1)\mathbb{S}^{(k+1)} in (5). Without loss of generality, the feedback duration is assumed to be negligible due to its small data size and the edge server’s high transmit power. Note that 𝕊(k+1)\mathbb{S}^{(k+1)} can be available when starting round (k+1)(k+1), and a sufficient amount of time should be reserved for training the (k+1)(k+1)-depth classifier. Denote τk+1\tau_{k+1} as the corresponding training duration. In other words, the offloading duration tk+1t_{k+1} should be no more than t0τk+1t_{0}-\tau_{k+1}, making energy consumption significant as τk+1\tau_{k+1} becomes longer.

Refer to caption
Figure 3: Data prefetching architecture

It can be overcome by offloading the partial data samples’ (k+1)(k+1) features in advance during the training process, called prefetching. The resultant offloading duration can be extended from tk+1t_{k+1} to t0t_{0}, enabling the IoT device to reduce energy consumption, according to (2). On the other hand, the prefetching decision is based on predicting on whether the concerned data sample becomes ACIs. Unless correct, the excessive energy is consumed to prefetch useless features. Balancing the tradeoff is a key, which will be addressed by formulating a stochastic optimization in Sec. IV-A.

III-B Threshold Design for Data Deepening

This subsection deals with the threshold design d¯(k)\bar{d}^{(k)} to categorize whether the concerned data sample 𝐱m(k)\mathbf{x}_{m}^{(k)} is ACI or CCI based on the kk-depth classifier. The stochastic distribution of each class can be approximated in a form of kk-variate Gaussian processes using the Gaussian mixture model (GMM) [14]. As shown in Fig. 4, the overlapped area between two distributions is observed. The data samples in the area are likely to be misclassified. We aim at setting the threshold d¯(k)\bar{d}^{(k)} in such a way that most data samples in the overlapped area are included except a few outliners located in each tail.

To this end, we introduce the Mahalanobis distance (MD) [15] as a metric representing the distance from each instant to the concerned distribution. Given class c{0,1}c\in\{0,1\}, the MD is defined as

δc(k)=(𝐱(k)𝝁c(k))T(𝚺c(k))1(𝐱(k)𝝁c(k)),\displaystyle\delta_{c}^{(k)}=\sqrt{\left(\mathbf{x}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)^{T}\cdot\left(\boldsymbol{\Sigma}_{c}^{(k)}\right)^{-1}\cdot\left(\mathbf{x}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)}, (6)

where 𝐱(k)𝒩(𝝁c(k),𝚺c(k))\mathbf{x}^{(k)}\sim\mathcal{N}(\boldsymbol{\mu}_{c}^{(k)},\boldsymbol{\Sigma}_{c}^{(k)}) with 𝝁c(k)k\boldsymbol{\mu}_{c}^{(k)}\in\mathbb{R}^{k} and 𝚺c(k)k×k\mathbf{\Sigma}_{c}^{(k)}\in\mathbb{R}^{k\times k} being the distribution’s mean vector and covariance matrix respectively, which are obtainable through the GMM process. It is obvious that δc(k)\delta_{c}^{(k)} is a scale-free random variable and we attempt to set the threshold as the value whose cumulative distribution function (CDF) of δc(k)\delta_{c}^{(k)} becomes pthp_{\mathrm{th}}, namely,

𝖯𝗋[δc(k)δ¯c(k)]=pth.\displaystyle\mathsf{Pr}\left[\delta_{c}^{(k)}\leq\bar{\delta}_{c}^{(k)}\right]=p_{\mathrm{th}}. (7)

Noting that the square of δc(k)\delta_{c}^{(k)} follows a chi-square distribution with kk degree-of-freedom, the CDF of this distribution for r>0r>0 is defined as :

𝒢(r;k)=𝖯𝗋[δc(k)r]=γ(k2,r2)Γ(k2),\displaystyle\mathscr{G}(r;k)=\mathsf{Pr}\left[\delta_{c}^{(k)}\leq r\right]=\frac{\gamma\left(\frac{k}{2},\frac{r}{2}\right)}{\Gamma\left(\frac{k}{2}\right)}, (8)

where Γ\Gamma is gamma function defined as Γ(k)=0tk1et𝑑t\Gamma(k)=\int_{0}^{\infty}t^{k-1}e^{-t}dt and γ\gamma is the lower incomplete gamma function defined as γ(k,r)=0rtk1et𝑑t\gamma(k,r)=\int_{0}^{r}t^{k-1}e^{-t}dt. In a closed-form, the threshold δ¯c(k)\bar{\delta}_{c}^{(k)} can be given as

δ¯c(k)=𝒢1(pth;k),\displaystyle\bar{\delta}_{c}^{(k)}=\sqrt{\mathscr{G}^{-1}(p_{\mathrm{th}};k)}, (9)

where 𝒢1\mathscr{G}^{-1} represents the inverse CDF of chi-square distribution with kk degree-of-freedom. Due to the scale-free property, the threshold δ¯c(k)\bar{\delta}_{c}^{(k)} is identically set regardless of the concerned class; thus, the index of class can be omitted, namely, δ¯0(k)=δ¯1(k)=δ¯(k)\bar{\delta}_{0}^{(k)}=\bar{\delta}_{1}^{(k)}=\bar{\delta}^{(k)}. Given δ¯(k)\bar{\delta}^{(k)}, the each distribution can be truncated as

c={𝐱(k)(k)|δc(k)δ¯(k)},c{0,1}.\displaystyle\mathcal{R}_{c}=\left\{\mathbf{x}^{(k)}\in\mathbb{R}^{(k)}~{}|~{}\delta_{c}^{(k)}\leq\bar{\delta}^{(k)}\right\},\quad c\in\{0,1\}. (10)

Last, the threshold d¯(k)\bar{d}^{(k)} is set by the maximum distance from the hyperplane in (3) to an arbitrary kk-dimensional point 𝐱(k)\mathbf{x}^{(k)} in the overlapped area of 0\mathcal{R}_{0} and 1\mathcal{R}_{1}, given as

d¯(k)=max𝐱(k)01|(𝐰(k))T𝐱(k)+b(k)|/𝐰(k).\displaystyle\bar{d}^{(k)}=\max_{\mathbf{x}^{(k)}\in\mathcal{R}_{0}\cap\mathcal{R}_{1}}\left|(\mathbf{w}^{(k)})^{T}\mathbf{x}^{(k)}+b^{(k)}\right|/\|\mathbf{w}^{(k)}\|. (11)

The process to obtain the threshold d¯(k)\bar{d}^{(k)} is summarized in Algorithm 2.

Refer to caption
Figure 4: The ACI region in the 1-dimensional space is obtained by the probability distribution and distance from the hyperplane.
Remark 1 (Symmetric ACI Region).

Noting that each class’s covariance matrix {𝚺c(k)}c{0,1}\{\mathbf{\Sigma}_{c}^{(k)}\}_{c\in\{0,1\}} is different, the resultant truncated areas of 0\mathcal{R}_{0} and 1\mathcal{R}_{1} become asymmetric. To avoid the classifier overfitted to one class, we choose the common distance threshold for both classes, say d¯(k)\bar{d}^{(k)} in (11), corresponding to the maximum distance between the two.

III-C Hierarchical Edge Inference

After KK rounds, the entire classifier has a hierarchical structure comprising from 11-depth to KK-depth classifiers. Consider that a mobile device sends an unlabeled data sample to the edge server, which is initially set as an ACI. Starting from the 11-depth classifier, the data sample passes through different depth classifiers in sequence until it is changed to an CCI. The last classifier’s depth is referred to as the data sample’s depth. In other words, its classification result becomes the final one.

Algorithm 2 Finding the threshold d¯(k)\bar{d}^{(k)}
1:Embedded data 𝐱m(k)\mathbf{x}_{m}^{(k)} for m𝕊(k)m\in\mathbb{S}^{(k)}, kk-th version classifier.
2:for c{0,1}c\in\{0,1\} do
3:     Find 𝝁c\boldsymbol{\mu}_{c}, 𝚺c\boldsymbol{\Sigma}_{c} through GMM process.
4:     Compute δ¯c(k)\bar{\delta}_{c}^{(k)} specified in (9).
5:      Compute the truncated domain of distribution c\mathcal{R}_{c} using (10)
6:end for
7:Find the overlapped area =01\mathcal{R}=\mathcal{R}_{0}\bigcap\mathcal{R}_{1}.
8: Using kk-th version classifier in (3), compute d¯(k)\bar{d}^{(k)} using (11).
9:return d¯(k)\bar{d}^{(k)}. 

IV Optimal Data Prefetching

This section deals with selecting the size of prefetched data in the sense of minimizing the expected energy consumption of the sensor.

IV-A Problem Formulation

Consider the prefetching duration in round kk, say τk\tau_{k}, which is equivalent to the training duration of the kk-depth classifier, as shown in Fig. 3. The number of data samples in 𝕊(k)\mathbb{S}^{(k)} is denoted by sks_{k}. Among them, pkp_{k} data samples are randomly selected and their (k+1)(k+1)-th features are prefetched. The prefetched data size is αpk\alpha p_{k}, where α\alpha represents the number of bits required to quantize feature data222The quantization bit rate depends on the value of intensity. For example, one pixel of MNIST data has 255255 intensities and can be quantized 8 bits, i.e., α=8\alpha=8.. Given the channel gain gkg_{k}, the resultant energy consumption for prefetching is

𝐄(αpk,τk;gk)=λαgkτk1pk.\displaystyle\mathbf{E}(\alpha p_{k},\tau_{k};g_{k})=\lambda\frac{\alpha^{\ell}}{g_{k}\tau_{k}^{\ell-1}}p_{k}^{\ell}. (12)

Here, the number of prefetched data pkp_{k} is a discrete control parameter ranging from 0 to sks_{k}. For tractable optimization in the sequel, we regard pkp_{k} as a continuous variable within the range, which is rounded to the nearest integer in practice.

Next, consider the offloading duration in round (k+1)(k+1), say tk+1=t0τk+1t_{k+1}=t_{0}-\tau_{k+1}. Among the data samples in 𝕊(k+1)\mathbb{S}^{(k+1)}, a few number of data, denoted by nk+1n_{k+1}, remain after the prefetching. Given the channel gain gk+1g_{k+1}, the resultant energy consumption is

𝐄(αnk+1,tk+1;gk+1)=λαgk+1tk+11nk+1.\displaystyle\mathbf{E}(\alpha n_{k+1},t_{k+1};g_{k+1})=\lambda\frac{\alpha^{\ell}}{g_{k+1}t_{k+1}^{\ell-1}}n_{k+1}^{\ell}. (13)

Note that nk+1n_{k+1} is determined after the kk-depth classifier is trained. In other words, nk+1n_{k+1} is random at the instant of the prefetching decision. Denote ρk\rho_{k} as the ratio of a data sample in 𝕊(k)\mathbb{S}^{(k)} being included in 𝕊(k+1)\mathbb{S}^{(k+1)}. Then, nk+1n_{k+1} follows a binomial distribution with parameters (skpk)(s_{k}-p_{k}) and ρk\rho_{k}, whose probability mass function is P(j)=(skpkj)ρkj(1ρk)skpkjP(j)=\bigl{(}{{s_{k}-p_{k}}\atop j}\bigr{)}\rho_{k}^{j}(1-\rho_{k})^{s_{k}-p_{k}-j} for j=0,,skpkj=0,\cdots,{s_{k}-p_{k}}. Given pkp_{k}, the expected energy consumption is

𝔼nk+1,gk+1[𝐄(αnk+1,tk+1;gk+1)]=λναtk+11𝔼nk+1[nk+1],\displaystyle\mathbb{E}_{n_{k+1},g_{k+1}}[\mathbf{E}(\alpha n_{k+1},t_{k+1};g_{k+1})]=\lambda\frac{\nu\alpha^{\ell}}{t_{k+1}^{\ell-1}}\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}], (14)

where ν=𝔼[1gk+1]\nu=\mathbb{E}[\frac{1}{g_{k+1}}] is the expectation of the inverse channel gain, which can be known a priori due to its i.i.d. property.

Last, summing up (12) and (14) is the expected energy consumption for the (k+1)(k+1)-th feature when pkp_{k} data samples are prefetched, leading to the following two-stage stochastic optimization:

minpk\displaystyle\min_{p_{k}}~{} λαgkτk1pk+λναtk+11𝔼nk+1[nk+1]\displaystyle\lambda\frac{\alpha^{\ell}}{g_{k}\tau_{k}^{\ell-1}}p_{k}^{\ell}+\lambda\frac{\nu\alpha^{\ell}}{t_{k+1}^{\ell-1}}\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}]
s.t. 0pksk.\displaystyle 0\leq p_{k}\leq s_{k}.

The optimal prefetching policy can be designed by solving IV-A explained in the following subsection.

IV-B Optimal Prefetching Control

This subsection aims at deriving the closed-form expression of the optimal prefetching number pkp_{k}^{*} by solving IV-A. The main difficulty lies in addressing the \ell-th moment 𝔼nk+1[nk+1]\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}], of which the simple form is unknown for general \ell. To address it, we refer to the upper bound of the \ell-th moment in [16],

𝔼k+1[nk+1](μnk+1+2),\displaystyle\mathbb{E}_{k+1}[n_{k+1}^{\ell}]\leq\left(\mu_{n_{k+1}}+\frac{\ell}{2}\right)^{\ell}, (15)

where μnk+1=(skpk)ρk\mu_{n_{k+1}}=\left(s_{k}-p_{k}\right)\rho_{k} is the mean of the binomial distribution with parameters (skpk)(s_{k}-p_{k}) and ρk\rho_{k}. It is proved in [16] that the above upper bound is tight when the order \ell is less than the mean μnk+1\mu_{n_{k+1}}. Therefore, instead of solving IV-A directly, the problem of minimizing the upper bound of the objective function can be formulated as

minpk\displaystyle\min_{p_{k}}~{} pkgkτk1+νtk+11((skpk)ρk+2)\displaystyle\frac{p_{k}^{\ell}}{g_{k}\tau_{k}^{\ell-1}}+\frac{\nu}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell}
s.t. 0pksk.\displaystyle 0\leq p_{k}\leq s_{k}. (P2)

Note that P2 is a convex optimization, enabling us to derive the closed-form solution. The main result is shown in the following proposition.

Proposition 1 (Optimal Prefetching Policy).

Given the ratio of prefetching ρk\rho_{k} in round kk, the optimal prefetching data size pkp_{k}^{*}, which is the solution to P2, is

pk=(φρk111+φρk1)(skρk+2),\displaystyle p^{*}_{k}=\left(\frac{\varphi\rho_{k}^{\frac{1}{\ell-1}}}{1+\varphi\rho_{k}^{\frac{\ell}{\ell-1}}}\right)\left(s_{k}\rho_{k}+\frac{\ell}{2}\right), (16)

where φ=(gkν)11τktk+1\varphi=\left(g_{k}\nu\right)^{\frac{1}{\ell-1}}\frac{\tau_{k}}{t_{k+1}}.

Proof:

Define the Lagrangian function for P2 as

L=pkgkτk1+νtk+11((skpk)ρk+2)+η(pksk),\displaystyle L=\frac{p_{k}^{\ell}}{g_{k}\tau_{k}^{\ell-1}}+\frac{\nu}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell}+\eta(p_{k}-s_{k}),

where η\eta is a Lagrangian multipliers. Since P2 is a convex optimization, the following KKT conditions are necessary and sufficient for optimality:

pk1gkτk1νρktk+11((skpk)ρk+2)1+η0,\displaystyle\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}+\eta\geq 0, (17a)
pk(pk1gkτk1νρktk+11((skpk)ρk+2)1+η)=0,\displaystyle p_{k}\left(\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}+\eta\right)=0, (17b)
η(pksk)=0.\displaystyle\eta\left(p_{k}-s_{k}\right)=0. (17c)

First, if η\eta is positive, then pkp_{k} should be equal to sks_{k} due to the slackness condition of (17c), making the LHS of (17b) strictly positive. In other words, the optimal multiplier η\eta is zero to satisfy (17b). Second, with pk=0p_{k}=0, the LHS of condition (17a) is always strictly negative unless ρk=0\rho_{k}=0. As a result, given ρk>0\rho_{k}>0, pkp_{k} should be strictly positive and satisfy the following equality condition:

pk1gkτk1νρktk+11((skpk)ρk+2)1=0.\displaystyle\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}=0. (18)

Solving (18) leads to the optimal solution of P2, which completes the proof of this proposition. \Box

Remark 2 (Effect of Parameters).

Assume that the number of ACIs in slot kk, say sk=|𝕊k|s_{k}=|\mathbb{S}^{k}|, is significantly larger than 2\frac{\ell}{2}. We can approximate (16) as pk(φρk111+φρk1)skρkp^{*}_{k}\approx\left(\frac{\varphi\rho_{k}^{\frac{1}{\ell-1}}}{1+\varphi\rho_{k}^{\frac{\ell}{\ell-1}}}\right)s_{k}\rho_{k}. Noting that the term skρks_{k}\rho_{k} represents the expected number of ACIs in slot (k+1)(k+1), the parameter φ\varphi controls the portion of prefetching as follows:

  • As the current channel gain gkg_{k} becomes larger or the training duration τk\tau_{k} increases, the parameter φ\varphi increases and the optimal solution pkp_{k}^{*} reaches near to the skρks_{k}\rho_{k};

  • As gkg_{k} becomes smaller and τk\tau_{k} decreases, both φ\varphi and pkp_{k}^{*} converge to zero.

V Simulation Results

In this section, simulation results are presented to validate the superiority of JD2P over several benchmarks. The parameters are set as follows unless stated otherwise. The entire offloading duration consists of 1010 rounds (K=10K=10), each of which is set to t0=0.1t_{0}=0.1 (sec). For offloading, the channel follows the Gamma distribution with the shape parameter β>1\beta>1 and the probability density function fg(x)=xβ1eβx(1/β)βΓ(β)f_{g}(x)=\frac{x^{\beta-1}e^{-\beta x}}{(1/\beta)^{\beta}\Gamma(\beta)}, where the gamma function Γ(β)=0xβ1ex𝑑x\Gamma(\beta)=\int_{0}^{\infty}x^{\beta-1}e^{-x}dx and the mean 𝔼[gk]=1\mathbb{E}[g_{k}]=1. The energy coefficient λ\lambda is set to 101710^{-17}, according to [12]. The monomial order of the energy consumption model in (2) is set as =3\ell=3. For computing and prefetching, the reserved training duration τk\tau_{k} is assumed constant for all kk and fixed to τk=τ=0.5\tau_{k}=\tau=0.5 (sec) for 1kK1\leq k\leq K.

We use the MNIST and fashion MNIST datasets for training and testing. Both datasets include 61046\cdot 10^{4} training samples and 10410^{4} test samples with 784784 gray-scaled pixels. The number of each dataset’s classes is 1010. We conduct experiments with every possible pair of classes, namely, (102)=45\binom{10}{2}=45 pairs. PCA is applied for data embedding. For comparison, we consider two benchmark schemes. The first one is to use data deepening only without data prefetching. The second one is full offloading, where all data samples’ 1010 features are offloaded first, and the classifier is trained using them. To be specific, the offloading duration of each round is t0t_{0} except the last one reduced as tK=t0τt_{K}=t_{0}-\tau.

Refer to caption
Figure 5: Performance of JD2P compared with benchmark schemes. Each point represents the average value over all pairs of classes. The training duration is set as τ=0.5\tau=0.5s.

First, the expected energy consumption (in Joule) versus the error rate (in %) is plotted in Fig. 5. It is shown that the proposed JD2P consumes less energy than the full offloading scheme, namely, 2323dB and 2020dB energy gain for MNIST and fashion-MNIST, respectively, at cost of the marginal degradation in the error rate. The effectiveness of data prefetching is demonstrated in Fig. 6, plotting the curves of the expected energy consumption gain against the prefetching duration τ\tau in the case of the MNIST dataset333The case of the fashion MNIST dataset follows the tendency similar to that of MNIST although the result is omitted in this paper. The JD2P’s expected energy consumption is always smaller than the scheme of data deepening only by sophisticated control of prefetching data in Sec. IV. On the other hand, when compared with the full offloading scheme, the energy gain of JD2P decreases as τ\tau increases. In other words, a shorter offloading duration compels more data samples to be prefetched, wasting more energy since many prefetched data samples are likely to become CCIs while not being used for the following training.

VI Concluding remarks

This study explored the problem of multi-round technique for energy-efficient edge learning. Two criteria for achieving energy efficiency are 1) reducing the amount of offloaded data and 2) extending the offloading duration. JD2P was proposed by addressing both, while integrating data deepening and data prefetching with measuring feature-by-feature data importance and optimizing the amount of prefetched data to avoid wasting energy. Our comprehensive simulation study demonstrated that JD2P can significantly reduce the expected energy consumption compared to several benchmarks.

Refer to caption
Figure 6: Effect of the prefetching duration on the expected energy consumption gain obtained by comparing with the full offloading scheme in the case of the MNIST dataset.

Though the current work targets to design a simple SVM-based binary classifier with PCA as a key data embedding technique, the proposed JD2P is straightforwardly applicable to more challenging scenarios, such as a multi-class DNN classifier with an advanced data embedding technique. Besides, it is interesting to analyze the performance of JD2P concerning various parameters, which is essential to derive rigorous guidelines for JD2P’s practical use.

References

  • [1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE communications magazine, vol. 58, no. 1, pp. 19–25, 2020.
  • [2] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset shift in machine learning.   Mit Press, 2008.
  • [3] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013.
  • [4] D. Liu, G. Zhu, Q. Zeng, J. Zhang, and K. Huang, “Wireless data acquisition for edge learning: Data-importance aware retransmission,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 406–420, 2020.
  • [5] D. Liu, G. Zhu, J. Zhang, and K. Huang, “Data-importance aware user scheduling for communication-efficient edge machine learning,” IEEE Transactions on Cognitive Communications and Networking, vol. 7, no. 1, pp. 265–278, 2020.
  • [6] Y. He, J. Ren, G. Yu, and J. Yuan, “Importance-aware data selection and resource allocation in federated edge learning system,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 13 593–13 605, 2020.
  • [7] A. Taïk, Z. Mlika, and S. Cherkaoui, “Data-aware device scheduling for federated edge learning,” IEEE Transactions on Cognitive Communications and Networking, vol. 8, no. 1, pp. 408–421, 2021.
  • [8] P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
  • [9] L. Zheng, S. Wang, and Q. Tian, “Coupled binary embedding for large-scale image retrieval,” IEEE transactions on image processing, vol. 23, no. 8, pp. 3368–3380, 2014.
  • [10] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.
  • [11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [12] Y. Tao, C. You, P. Zhang, and K. Huang, “Stochastic control of computation offloading to a helper with a dynamically loaded cpu,” IEEE Transactions on Wireless Communications, vol. 18, no. 2, pp. 1247–1262, 2019.
  • [13] W. Zhang, Y. Wen, K. Guan, D. Kilper, H. Luo, and D. O. Wu, “Energy-optimal mobile cloud computing under stochastic wireless channel,” IEEE Transactions on Wireless Communications, vol. 12, no. 9, pp. 4569–4581, 2013.
  • [14] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning.   Springer, 2006, vol. 4, no. 4.
  • [15] M. Bensimhoun, “N-dimensional cumulative function, and other useful facts about gaussians and normal densities,” Jerusalem, Israel, Tech. Rep, pp. 1–8, 2009.
  • [16] T. D. Ahle, “Sharp and simple bounds for the raw moments of the binomial and poisson distributions,” Statistics & Probability Letters, vol. 182, p. 109306, 2022.