Joint Data Deepening-and-Prefetching for Energy-Efficient Edge Learning

Sujin Kook^∗, Won-Yong Shin^‡, Seong-Lyun Kim^∗, Seung-Woo Ko^§
^∗School of EEE, Yonsei University, Seoul, Korea, email: {sjkook, slkim}@ramo.yonsei.ac.kr ^‡Dept. of Comput. Science and Eng., Yonsei University, Seoul, Korea, email: wy.shin@yonsei.ac.kr ^§Dept. of Smart Mobility Eng., Inha University, Incheon, Korea, email: swko@inha.ac.kr

Abstract

The vision of pervasive machine learning (ML) services can be realized by training an ML model on time using real-time data collected by internet of things (IoT) devices. To this end, IoT devices require offloading their data to an edge server in proximity. On the other hand, high dimensional data with a heavy volume causes a significant burden to an IoT device with a limited energy budget. To cope with the limitation, we propose a novel offloading architecture, called joint data deepening and prefetching (JD2P), which is feature-by-feature offloading comprising two key techniques. The first one is data deepening, where each data sample’s features are sequentially offloaded in the order of importance determined by the data embedding technique such as principle component analysis (PCA). No more features are offloaded when the features offloaded so far are enough to classify the data, resulting in reducing the amount of offloaded data. The second one is data prefetching, where some features potentially required in the future are offloaded in advance, thus achieving high efficiency via precise prediction and parameter optimization. To verify the effectiveness of JD2P, we conduct experiments using the MNIST and fashion-MNIST dataset. Experimental results demonstrate that the JD2P can significantly reduce the expected energy consumption compared with several benchmarks without degrading learning accuracy.

I Introduction

With the wide spread of internet of things (IoT) devices, a huge amount of real-time data have been continuously generated. It can be fuel for operating various on-device machine learning (ML) services, e.g., object detection and natural language processing, if provided on time. One viable technology to this end is edge learning, where an ML model is trained at the edge server in proximity using the data offloaded from IoT devices [1]. Compared to the learning at the cloud server, IoT devices can offer the latest data to the edge server before out-of-date, and the resultant ML model can reflect the current environment precisely without a dataset shift [2] or catastrophic forgetting [3].

On the other hand, as the concerned environment becomes complex, the data collected by each device tends to be high-dimensional with heavy volume, thus causing a significant burden to offload data for an IoT device with a limited energy budget. Several attempts have been proposed in the literature to address this issue, whose main thrust is to selectively offload data depending on the importance of data to the concerned ML model. In [4], motivated by the classic support vector machine (SVM) technique, data importance was defined inversely proportional to its uncertainty, which corresponds to the margin to the decision boundary. A selective retransmission decision was optimized by allowing more transmissions for data with high uncertainty, leading to the corresponding ML model’s fast convergence. In the same vein, the scheduling issue of multi-device edge learning has been tackled in [5], where a device having more important data samples is granted access to the medium more frequently. In [6], a data sample’s gradient norm obtained during training a deep neural network (DNN) was regarded as the corresponding importance metric. It enables each mobile device to select data sample that is likely to contribute to its local ML model training in a federated edge learning system. In [7], data importance was defined at the dispersed level of dataset distribution. A device with an important dataset is allowed to assign more bandwidth to accelerate the training process.

Aligned with the trend, we aim to develop a novel edge learning architecture, called joint data deepening and prefetching (JD2P). The above prior works quantify the importance of each data sample or the entire dataset, bringing about a significant communication overhead when raw data become complex with a higher dimension. On the other hand, the proposed JD2P leverages the technique of data embedding to extract a few features from raw data and sort them in the order of importance. This allows us to design a feature importance-based offloading technique, called data deepening; Features are sequentially offloaded in the important order and stop offloading the next one if reaching the desired performance. Besides, several data samples’ subsequent features can be offloaded proactively before requested, called data prefetching, which extends the offloading duration and thus achieves higher energy efficiency. Through relevant parameter optimizations and extensive simulation studies using the MNIST and fashion-MNIST dataset, it is verified that the JD2P reduces the expected energy consumption significantly than several benchmarks without degrading learning accuracy.

Refer to caption — Figure 1: Edge learning network comprising a pair of edge device and edge server collocated with a wireless access point.

II System Model

This section describes our system model, including the concerned scenario, data structure, and offloading model.

II-A Edge Learning Scenario

Consider an edge learning network comprising a pair of the edge server and the IoT sensor as a data collector (see Fig. 1). We aim at training a binary classifier using local data with two classes collected by the IoT sensor. Due to the IoT sensor’s limited computation capability, the edge server is requested to train a classifier with the local data offloaded from IoT sensor instead of training the classifier on the IoT sensor.

II-B Data Embedding

Consider $M$ samples measured at the sensor, denoted by $\mathbf{y}_{m}\in\mathbb{R}^{D}$ , where $m$ is the index of data sample, i.e., $m=1,2,\cdots,M$ . We assume that the class of each sample is known, denoted by ${c}_{m}\in\{0,1\}$ . Each raw data sample’s dimension, say $D$ , is assumed to be equivalent. The dimension $D$ is in general sufficiently high to reflect complex environments, which is known as an obstacle to achieve high-accuracy classification [8]. Besides, a large amount of energy is required to offload these raw data to the edge server. To overcome these limitations, these high dimensional raw data can be embedded into a low-dimensional space using data embedding techniques [9], such as principle component analysis (PCA) [10] and auto-encoder [11]. Specifically, given $F$ less than $D$ , there exists a mapping function $\mathcal{F}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{F}$ such that

\displaystyle\mathbf{x}_{m}=\mathcal{F}(\mathbf{y}_{m}),

(1)

where $\mathbf{x}_{m}=\left[x_{m,1},\cdots,x_{m,F}\right]^{T}$ represents the embedded data with $F$ features. We assume that the edge device knows the embedding function $\mathcal{F}$ , which has been trained by the edge server using the historical data set. We use PCA as a primary feature embedding technique due to its low computational overhead, while other techniques are straightforwardly applicable. Partial or all features of each embedded data are offloaded depending on the offloading and learning designs introduced in the sequel.

II-C Offloading Model

The entire offloading duration is slotted into $K$ rounds with $t_{0}$ seconds. The channel gain in round $k$ is denoted as $g_{k}$ with $g_{k}>0$ . We assume that channel gains are constant over one time slot and independently and identically distributed (i.i.d.) over different rounds. Following the models in [12] and [13], the transmission power required to transmit $b$ bits in round $k$ , denoted as $e_{k}$ , is modeled by a monomial function and is given as $e_{k}=\lambda\frac{{(b/t)}^{\ell}}{g_{k}}$ where $\lambda$ is the energy coefficient, $\ell$ represents the monomial order, and $t$ is an allowable transmission duration for $b$ bits. The typical range for a monomial order is $2\leq\ell\leq 5$ because this order depends on the specific modulation and coding scheme. Then, the energy consumption in round $k$ , which is the product of $e_{k}$ and $t$ , is given as

\displaystyle\mathbf{E}(b,t;g_{k})={e_{k}}{t}=\lambda\frac{{b}^{\ell}}{g_{k}\left(t\right)^{\ell-1}}.

(2)

It is shown that energy consumption is proportional to the transmitted data size $b$ , and inversely proportional to the transmission time $t$ . For energy-efficient edge learning, it is necessary to decrease the amount of transmitted data and increase the transmission time.

III Joint Data Deepening-and-Prefetching

This section aims at describing JD2P as a novel architecture to realize energy-efficient edge learning. The overall architecture is briefly introduced first and the detailed techniques of JD2P are elaborated next.

III-A Overview

The proposed JD2P is a feature-by-feature offloading control for energy-efficient classifier training, built on the following definition.

Definition 1 (Data Depth).

A embedded data sample $\mathbf{x}_{m}$ is said to have depth $k$ when features from $1$ to $k$ , say $\mathbf{x}_{m}^{(k)}=[x_{m,1},\cdots,x_{m,k}]^{T}$ , are enough to correctly predict its class.

By Definition 1, we can offload less amount of data required to train the classifier and the resultant energy consumption can be reduced if depths of all data are known in advance. On the other hand, each data sample’s depth can be determined after the concerned classifier is trained. Eventually, it is required to process each data’s depth identification and classifier training simultaneously to cope with the above recursive relation, which is technically challenging. To this end, we propose two key techniques summarized below.

III-A1 Data Deepening

It is a closed-loop offloading decision whether to offload a new feature or not based on the current version of a classifier. Specifically, consider the $k$ -depth classifier defined as one trained through features from $1$ to $k$ , say $\mathbf{x}_{m}^{(k)}$ for all $m\in\mathbb{S}^{(k)}$ where $\mathbb{S}^{(k)}$ denotes an index set of data samples that may have a depth of $k$ . We use a classic SVM for each depth classifier¹¹1The extension to other classifiers such as DNN and convolutional neural network (CNN) are straightforward, which remains for our future work., whose decision hyperplane is given as

\displaystyle(\mathbf{w}^{(k)})^{T}\mathbf{x}^{(k)}+b^{(k)}=0,

(3)

where $\mathbf{w}^{(k)}\in\mathbb{R}^{k}$ is the vector perpendicular to the hyperplane and $b^{(k)}$ is the offset parameter. Given a data sample $\mathbf{x}_{m}^{(k)}$ for $m\in\mathbb{S}^{(k)}$ , the distance to the hyperplane in (3) can be computed as

\displaystyle d_{m}^{(k)}=\left|(\mathbf{w}^{(k)})^{T}\mathbf{x}_{m}^{(k)}+b^{(k)}\right|/\|\mathbf{w}^{(k)}\|,

(4)

where $\|\cdot\|$ represents the Euclidean norm. The data sample $\mathbf{x}_{m}$ is said to be a clearly classified instance (CCI) by the $k$ -depth classifier if $d_{m}^{(k)}$ is no less than a threshold ${\bar{d}}^{(k)}$ to be specified in Sec. III-B. Otherwise, it is said to be a ambiguous classified instance (ACI). In other words, CCIs are depth- $k$ data not requiring an additional feature. Only ACIs are thus included in a new set $\mathbb{S}^{(k+1)}$ , given as

\displaystyle\mathbb{S}^{(k+1)}=\left\{m~{}|~{}d_{m}^{(k)}\leq\bar{d}^{(k)},m\in\mathbb{S}^{(k)}\right\}.

(5)

As a result, the edge server requests the edge device to offload the next feature $x_{m,k+1}$ for $m\in\mathbb{S}^{(k+1)}$ . Fig. 2 illustrates the graphical example of data deepening from $1$ -dimensional to $3$ -dimensional spaces. The detailed process is summarized in Algorithm 1 except the design of the threshold ${\bar{d}}^{(k)}$ .

Algorithm 1 Data Deepening

1:Embedded data

\mathbf{x}_{m}

for all

m\in\{1,\cdots,M\}

2:Setting

k=0

\mathbb{S}^{(1)}=\{1,\cdots,M\}

3:while

k\leq K

k=k+1

5: Using

\{\mathbf{x}_{m}^{(k)}\}

for

m\in\mathbb{S}^{(k)}

, compute the hyperplane of the

k

-depth classifier, specified in (3).

6: Compute the threshold

{\bar{d}}^{(k)}

using Algorithm 2.

7: for

m\in\mathbb{S}^{(k)}

8: Compute

d_{m}^{(k)}

using (4).

9: if

d_{m}^{(k)}\leq{\bar{d}}^{(k)}

then

10:

m\in\mathbb{S}^{(k+1)}

11: else

12:

m\notin\mathbb{S}^{(k+1)}

13: end if

14: end for

15:end while

III-A2 Data Prefetching

As shown in Fig. 3, the round $k$ comprises an offloading duration for the $k$ -th features (i.e., $x_{m,k},\forall m\in\mathbb{S}^{(k)}$ ) and a training duration for the $k$ -depth classifier, and a feedback duration for a new ACI set $\mathbb{S}^{(k+1)}$ in (5). Without loss of generality, the feedback duration is assumed to be negligible due to its small data size and the edge server’s high transmit power. Note that $\mathbb{S}^{(k+1)}$ can be available when starting round $(k+1)$ , and a sufficient amount of time should be reserved for training the $(k+1)$ -depth classifier. Denote $\tau_{k+1}$ as the corresponding training duration. In other words, the offloading duration $t_{k+1}$ should be no more than $t_{0}-\tau_{k+1}$ , making energy consumption significant as $\tau_{k+1}$ becomes longer.

It can be overcome by offloading the partial data samples’ $(k+1)$ features in advance during the training process, called prefetching. The resultant offloading duration can be extended from $t_{k+1}$ to $t_{0}$ , enabling the IoT device to reduce energy consumption, according to (2). On the other hand, the prefetching decision is based on predicting on whether the concerned data sample becomes ACIs. Unless correct, the excessive energy is consumed to prefetch useless features. Balancing the tradeoff is a key, which will be addressed by formulating a stochastic optimization in Sec. IV-A.

III-B Threshold Design for Data Deepening

This subsection deals with the threshold design $\bar{d}^{(k)}$ to categorize whether the concerned data sample $\mathbf{x}_{m}^{(k)}$ is ACI or CCI based on the $k$ -depth classifier. The stochastic distribution of each class can be approximated in a form of $k$ -variate Gaussian processes using the Gaussian mixture model (GMM) [14]. As shown in Fig. 4, the overlapped area between two distributions is observed. The data samples in the area are likely to be misclassified. We aim at setting the threshold $\bar{d}^{(k)}$ in such a way that most data samples in the overlapped area are included except a few outliners located in each tail.

To this end, we introduce the Mahalanobis distance (MD) [15] as a metric representing the distance from each instant to the concerned distribution. Given class $c\in\{0,1\}$ , the MD is defined as

\displaystyle\delta_{c}^{(k)}=\sqrt{\left(\mathbf{x}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)^{T}\cdot\left(\boldsymbol{\Sigma}_{c}^{(k)}\right)^{-1}\cdot\left(\mathbf{x}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)},

(6)

where $\mathbf{x}^{(k)}\sim\mathcal{N}(\boldsymbol{\mu}_{c}^{(k)},\boldsymbol{\Sigma}_{c}^{(k)})$ with $\boldsymbol{\mu}_{c}^{(k)}\in\mathbb{R}^{k}$ and $\mathbf{\Sigma}_{c}^{(k)}\in\mathbb{R}^{k\times k}$ being the distribution’s mean vector and covariance matrix respectively, which are obtainable through the GMM process. It is obvious that $\delta_{c}^{(k)}$ is a scale-free random variable and we attempt to set the threshold as the value whose cumulative distribution function (CDF) of $\delta_{c}^{(k)}$ becomes $p_{\mathrm{th}}$ , namely,

\displaystyle\mathsf{Pr}\left[\delta_{c}^{(k)}\leq\bar{\delta}_{c}^{(k)}\right]=p_{\mathrm{th}}.

(7)

Noting that the square of $\delta_{c}^{(k)}$ follows a chi-square distribution with $k$ degree-of-freedom, the CDF of this distribution for $r>0$ is defined as :

\displaystyle\mathscr{G}(r;k)=\mathsf{Pr}\left[\delta_{c}^{(k)}\leq r\right]=\frac{\gamma\left(\frac{k}{2},\frac{r}{2}\right)}{\Gamma\left(\frac{k}{2}\right)},

(8)

where $\Gamma$ is gamma function defined as $\Gamma(k)=\int_{0}^{\infty}t^{k-1}e^{-t}dt$ and $\gamma$ is the lower incomplete gamma function defined as $\gamma(k,r)=\int_{0}^{r}t^{k-1}e^{-t}dt$ . In a closed-form, the threshold $\bar{\delta}_{c}^{(k)}$ can be given as

\displaystyle\bar{\delta}_{c}^{(k)}=\sqrt{\mathscr{G}^{-1}(p_{\mathrm{th}};k)},

(9)

where $\mathscr{G}^{-1}$ represents the inverse CDF of chi-square distribution with $k$ degree-of-freedom. Due to the scale-free property, the threshold $\bar{\delta}_{c}^{(k)}$ is identically set regardless of the concerned class; thus, the index of class can be omitted, namely, $\bar{\delta}_{0}^{(k)}=\bar{\delta}_{1}^{(k)}=\bar{\delta}^{(k)}$ . Given $\bar{\delta}^{(k)}$ , the each distribution can be truncated as

\displaystyle\mathcal{R}_{c}=\left\{\mathbf{x}^{(k)}\in\mathbb{R}^{(k)}~{}|~{}\delta_{c}^{(k)}\leq\bar{\delta}^{(k)}\right\},\quad c\in\{0,1\}.

(10)

Last, the threshold $\bar{d}^{(k)}$ is set by the maximum distance from the hyperplane in (3) to an arbitrary $k$ -dimensional point $\mathbf{x}^{(k)}$ in the overlapped area of $\mathcal{R}_{0}$ and $\mathcal{R}_{1}$ , given as

\displaystyle\bar{d}^{(k)}=\max_{\mathbf{x}^{(k)}\in\mathcal{R}_{0}\cap\mathcal{R}_{1}}\left|(\mathbf{w}^{(k)})^{T}\mathbf{x}^{(k)}+b^{(k)}\right|/\|\mathbf{w}^{(k)}\|.

(11)

The process to obtain the threshold $\bar{d}^{(k)}$ is summarized in Algorithm 2.

Remark 1 (Symmetric ACI Region).

Noting that each class’s covariance matrix $\{\mathbf{\Sigma}_{c}^{(k)}\}_{c\in\{0,1\}}$ is different, the resultant truncated areas of $\mathcal{R}_{0}$ and $\mathcal{R}_{1}$ become asymmetric. To avoid the classifier overfitted to one class, we choose the common distance threshold for both classes, say $\bar{d}^{(k)}$ in (11), corresponding to the maximum distance between the two.

III-C Hierarchical Edge Inference

After $K$ rounds, the entire classifier has a hierarchical structure comprising from $1$ -depth to $K$ -depth classifiers. Consider that a mobile device sends an unlabeled data sample to the edge server, which is initially set as an ACI. Starting from the $1$ -depth classifier, the data sample passes through different depth classifiers in sequence until it is changed to an CCI. The last classifier’s depth is referred to as the data sample’s depth. In other words, its classification result becomes the final one.

Algorithm 2 Finding the threshold

\bar{d}^{(k)}

1:Embedded data

\mathbf{x}_{m}^{(k)}

for

m\in\mathbb{S}^{(k)}

k

-th version classifier.

2:for

c\in\{0,1\}

3: Find

\boldsymbol{\mu}_{c}

\boldsymbol{\Sigma}_{c}

through GMM process.

4: Compute

\bar{\delta}_{c}^{(k)}

specified in (9).

5: Compute the truncated domain of distribution

\mathcal{R}_{c}

using (10)

6:end for

7:Find the overlapped area

\mathcal{R}=\mathcal{R}_{0}\bigcap\mathcal{R}_{1}

8: Using

k

-th version classifier in (3), compute

\bar{d}^{(k)}

using (11).

9:return

\bar{d}^{(k)}

IV Optimal Data Prefetching

This section deals with selecting the size of prefetched data in the sense of minimizing the expected energy consumption of the sensor.

IV-A Problem Formulation

Consider the prefetching duration in round $k$ , say $\tau_{k}$ , which is equivalent to the training duration of the $k$ -depth classifier, as shown in Fig. 3. The number of data samples in $\mathbb{S}^{(k)}$ is denoted by $s_{k}$ . Among them, $p_{k}$ data samples are randomly selected and their $(k+1)$ -th features are prefetched. The prefetched data size is $\alpha p_{k}$ , where $\alpha$ represents the number of bits required to quantize feature data²²2The quantization bit rate depends on the value of intensity. For example, one pixel of MNIST data has $255$ intensities and can be quantized 8 bits, i.e., $\alpha=8$ .. Given the channel gain $g_{k}$ , the resultant energy consumption for prefetching is

\displaystyle\mathbf{E}(\alpha p_{k},\tau_{k};g_{k})=\lambda\frac{\alpha^{\ell}}{g_{k}\tau_{k}^{\ell-1}}p_{k}^{\ell}.

(12)

Here, the number of prefetched data $p_{k}$ is a discrete control parameter ranging from $0$ to $s_{k}$ . For tractable optimization in the sequel, we regard $p_{k}$ as a continuous variable within the range, which is rounded to the nearest integer in practice.

Next, consider the offloading duration in round $(k+1)$ , say $t_{k+1}=t_{0}-\tau_{k+1}$ . Among the data samples in $\mathbb{S}^{(k+1)}$ , a few number of data, denoted by $n_{k+1}$ , remain after the prefetching. Given the channel gain $g_{k+1}$ , the resultant energy consumption is

\displaystyle\mathbf{E}(\alpha n_{k+1},t_{k+1};g_{k+1})=\lambda\frac{\alpha^{\ell}}{g_{k+1}t_{k+1}^{\ell-1}}n_{k+1}^{\ell}.

(13)

Note that $n_{k+1}$ is determined after the $k$ -depth classifier is trained. In other words, $n_{k+1}$ is random at the instant of the prefetching decision. Denote $\rho_{k}$ as the ratio of a data sample in $\mathbb{S}^{(k)}$ being included in $\mathbb{S}^{(k+1)}$ . Then, $n_{k+1}$ follows a binomial distribution with parameters $(s_{k}-p_{k})$ and $\rho_{k}$ , whose probability mass function is $P(j)=\bigl{(}{{s_{k}-p_{k}}\atop j}\bigr{)}\rho_{k}^{j}(1-\rho_{k})^{s_{k}-p_{k}-j}$ for $j=0,\cdots,{s_{k}-p_{k}}$ . Given $p_{k}$ , the expected energy consumption is

\displaystyle\mathbb{E}_{n_{k+1},g_{k+1}}[\mathbf{E}(\alpha n_{k+1},t_{k+1};g_{k+1})]=\lambda\frac{\nu\alpha^{\ell}}{t_{k+1}^{\ell-1}}\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}],

(14)

where $\nu=\mathbb{E}[\frac{1}{g_{k+1}}]$ is the expectation of the inverse channel gain, which can be known a priori due to its i.i.d. property.

Last, summing up (12) and (14) is the expected energy consumption for the $(k+1)$ -th feature when $p_{k}$ data samples are prefetched, leading to the following two-stage stochastic optimization:

	$\displaystyle\min_{p_{k}}~{}$	$\displaystyle\lambda\frac{\alpha^{\ell}}{g_{k}\tau_{k}^{\ell-1}}p_{k}^{\ell}+\lambda\frac{\nu\alpha^{\ell}}{t_{k+1}^{\ell-1}}\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}]$
	s.t.	$\displaystyle 0\leq p_{k}\leq s_{k}.$

The optimal prefetching policy can be designed by solving IV-A explained in the following subsection.

IV-B Optimal Prefetching Control

This subsection aims at deriving the closed-form expression of the optimal prefetching number $p_{k}^{*}$ by solving IV-A. The main difficulty lies in addressing the $\ell$ -th moment $\mathbb{E}_{n_{k+1}}[n_{k+1}^{\ell}]$ , of which the simple form is unknown for general $\ell$ . To address it, we refer to the upper bound of the $\ell$ -th moment in [16],

\displaystyle\mathbb{E}_{k+1}[n_{k+1}^{\ell}]\leq\left(\mu_{n_{k+1}}+\frac{\ell}{2}\right)^{\ell},

(15)

where $\mu_{n_{k+1}}=\left(s_{k}-p_{k}\right)\rho_{k}$ is the mean of the binomial distribution with parameters $(s_{k}-p_{k})$ and $\rho_{k}$ . It is proved in [16] that the above upper bound is tight when the order $\ell$ is less than the mean $\mu_{n_{k+1}}$ . Therefore, instead of solving IV-A directly, the problem of minimizing the upper bound of the objective function can be formulated as

	$\displaystyle\min_{p_{k}}~{}$	$\displaystyle\frac{p_{k}^{\ell}}{g_{k}\tau_{k}^{\ell-1}}+\frac{\nu}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell}$
	s.t.	$\displaystyle 0\leq p_{k}\leq s_{k}.$		(P2)

Note that P2 is a convex optimization, enabling us to derive the closed-form solution. The main result is shown in the following proposition.

Proposition 1 (Optimal Prefetching Policy).

Given the ratio of prefetching $\rho_{k}$ in round $k$ , the optimal prefetching data size $p_{k}^{*}$ , which is the solution to P2, is

\displaystyle p^{*}_{k}=\left(\frac{\varphi\rho_{k}^{\frac{1}{\ell-1}}}{1+\varphi\rho_{k}^{\frac{\ell}{\ell-1}}}\right)\left(s_{k}\rho_{k}+\frac{\ell}{2}\right),

(16)

where $\varphi=\left(g_{k}\nu\right)^{\frac{1}{\ell-1}}\frac{\tau_{k}}{t_{k+1}}$ .

Proof:

Define the Lagrangian function for P2 as

\displaystyle L=\frac{p_{k}^{\ell}}{g_{k}\tau_{k}^{\ell-1}}+\frac{\nu}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell}+\eta(p_{k}-s_{k}),

where $\eta$ is a Lagrangian multipliers. Since P2 is a convex optimization, the following KKT conditions are necessary and sufficient for optimality:


	$\displaystyle\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}+\eta\geq 0,$		(17a)
	$\displaystyle p_{k}\left(\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}+\eta\right)=0,$		(17b)
	$\displaystyle\eta\left(p_{k}-s_{k}\right)=0.$		(17c)

First, if $\eta$ is positive, then $p_{k}$ should be equal to $s_{k}$ due to the slackness condition of (17c), making the LHS of (17b) strictly positive. In other words, the optimal multiplier $\eta$ is zero to satisfy (17b). Second, with $p_{k}=0$ , the LHS of condition (17a) is always strictly negative unless $\rho_{k}=0$ . As a result, given $\rho_{k}>0$ , $p_{k}$ should be strictly positive and satisfy the following equality condition:

\displaystyle\frac{\ell p_{k}^{\ell-1}}{g_{k}\tau_{k}^{\ell-1}}-\frac{\ell\nu\rho_{k}}{t_{k+1}^{\ell-1}}\left(\left(s_{k}-p_{k}\right)\rho_{k}+\frac{\ell}{2}\right)^{\ell-1}=0.

(18)

Solving (18) leads to the optimal solution of P2, which completes the proof of this proposition. $\Box$

Remark 2 (Effect of Parameters).

Assume that the number of ACIs in slot $k$ , say $s_{k}=|\mathbb{S}^{k}|$ , is significantly larger than $\frac{\ell}{2}$ . We can approximate (16) as $p^{*}_{k}\approx\left(\frac{\varphi\rho_{k}^{\frac{1}{\ell-1}}}{1+\varphi\rho_{k}^{\frac{\ell}{\ell-1}}}\right)s_{k}\rho_{k}$ . Noting that the term $s_{k}\rho_{k}$ represents the expected number of ACIs in slot $(k+1)$ , the parameter $\varphi$ controls the portion of prefetching as follows:

•

As the current channel gain $g_{k}$ becomes larger or the training duration $\tau_{k}$ increases, the parameter $\varphi$ increases and the optimal solution $p_{k}^{*}$ reaches near to the $s_{k}\rho_{k}$ ;
•

As $g_{k}$ becomes smaller and $\tau_{k}$ decreases, both $\varphi$ and $p_{k}^{*}$ converge to zero.

V Simulation Results

In this section, simulation results are presented to validate the superiority of JD2P over several benchmarks. The parameters are set as follows unless stated otherwise. The entire offloading duration consists of $10$ rounds ( $K=10$ ), each of which is set to $t_{0}=0.1$ (sec). For offloading, the channel follows the Gamma distribution with the shape parameter $\beta>1$ and the probability density function $f_{g}(x)=\frac{x^{\beta-1}e^{-\beta x}}{(1/\beta)^{\beta}\Gamma(\beta)}$ , where the gamma function $\Gamma(\beta)=\int_{0}^{\infty}x^{\beta-1}e^{-x}dx$ and the mean $\mathbb{E}[g_{k}]=1$ . The energy coefficient $\lambda$ is set to $10^{-17}$ , according to [12]. The monomial order of the energy consumption model in (2) is set as $\ell=3$ . For computing and prefetching, the reserved training duration $\tau_{k}$ is assumed constant for all $k$ and fixed to $\tau_{k}=\tau=0.5$ (sec) for $1\leq k\leq K$ .

We use the MNIST and fashion MNIST datasets for training and testing. Both datasets include $6\cdot 10^{4}$ training samples and $10^{4}$ test samples with $784$ gray-scaled pixels. The number of each dataset’s classes is $10$ . We conduct experiments with every possible pair of classes, namely, $\binom{10}{2}=45$ pairs. PCA is applied for data embedding. For comparison, we consider two benchmark schemes. The first one is to use data deepening only without data prefetching. The second one is full offloading, where all data samples’ $10$ features are offloaded first, and the classifier is trained using them. To be specific, the offloading duration of each round is $t_{0}$ except the last one reduced as $t_{K}=t_{0}-\tau$ .

First, the expected energy consumption (in Joule) versus the error rate (in %) is plotted in Fig. 5. It is shown that the proposed JD2P consumes less energy than the full offloading scheme, namely, $23$ dB and $20$ dB energy gain for MNIST and fashion-MNIST, respectively, at cost of the marginal degradation in the error rate. The effectiveness of data prefetching is demonstrated in Fig. 6, plotting the curves of the expected energy consumption gain against the prefetching duration $\tau$ in the case of the MNIST dataset³³3The case of the fashion MNIST dataset follows the tendency similar to that of MNIST although the result is omitted in this paper. The JD2P’s expected energy consumption is always smaller than the scheme of data deepening only by sophisticated control of prefetching data in Sec. IV. On the other hand, when compared with the full offloading scheme, the energy gain of JD2P decreases as $\tau$ increases. In other words, a shorter offloading duration compels more data samples to be prefetched, wasting more energy since many prefetched data samples are likely to become CCIs while not being used for the following training.

VI Concluding remarks

This study explored the problem of multi-round technique for energy-efficient edge learning. Two criteria for achieving energy efficiency are 1) reducing the amount of offloaded data and 2) extending the offloading duration. JD2P was proposed by addressing both, while integrating data deepening and data prefetching with measuring feature-by-feature data importance and optimizing the amount of prefetched data to avoid wasting energy. Our comprehensive simulation study demonstrated that JD2P can significantly reduce the expected energy consumption compared to several benchmarks.

Though the current work targets to design a simple SVM-based binary classifier with PCA as a key data embedding technique, the proposed JD2P is straightforwardly applicable to more challenging scenarios, such as a multi-class DNN classifier with an advanced data embedding technique. Besides, it is interesting to analyze the performance of JD2P concerning various parameters, which is essential to derive rigorous guidelines for JD2P’s practical use.

References

[1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE communications magazine, vol. 58, no. 1, pp. 19–25, 2020.
[2] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset shift in machine learning. Mit Press, 2008.
[3] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013.
[4] D. Liu, G. Zhu, Q. Zeng, J. Zhang, and K. Huang, “Wireless data acquisition for edge learning: Data-importance aware retransmission,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 406–420, 2020.
[5] D. Liu, G. Zhu, J. Zhang, and K. Huang, “Data-importance aware user scheduling for communication-efficient edge machine learning,” IEEE Transactions on Cognitive Communications and Networking, vol. 7, no. 1, pp. 265–278, 2020.
[6] Y. He, J. Ren, G. Yu, and J. Yuan, “Importance-aware data selection and resource allocation in federated edge learning system,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 13 593–13 605, 2020.
[7] A. Taïk, Z. Mlika, and S. Cherkaoui, “Data-aware device scheduling for federated edge learning,” IEEE Transactions on Cognitive Communications and Networking, vol. 8, no. 1, pp. 408–421, 2021.
[8] P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
[9] L. Zheng, S. Wang, and Q. Tian, “Coupled binary embedding for large-scale image retrieval,” IEEE transactions on image processing, vol. 23, no. 8, pp. 3368–3380, 2014.
[10] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[12] Y. Tao, C. You, P. Zhang, and K. Huang, “Stochastic control of computation offloading to a helper with a dynamically loaded cpu,” IEEE Transactions on Wireless Communications, vol. 18, no. 2, pp. 1247–1262, 2019.
[13] W. Zhang, Y. Wen, K. Guan, D. Kilper, H. Luo, and D. O. Wu, “Energy-optimal mobile cloud computing under stochastic wireless channel,” IEEE Transactions on Wireless Communications, vol. 12, no. 9, pp. 4569–4581, 2013.
[14] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning. Springer, 2006, vol. 4, no. 4.
[15] M. Bensimhoun, “N-dimensional cumulative function, and other useful facts about gaussians and normal densities,” Jerusalem, Israel, Tech. Rep, pp. 1–8, 2009.
[16] T. D. Ahle, “Sharp and simple bounds for the raw moments of the binomial and poisson distributions,” Statistics & Probability Letters, vol. 182, p. 109306, 2022.