This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimized Power Control for Over-the-Air Federated Edge Learning

Xiaowen Cao13, Guangxu Zhu2, Jie Xu3, and Shuguang Cui32 1School of Information Engineering, Guangdong University of Technology, Guangzhou, China 2Shenzhen Research Institute of Big Data, Shenzhen, China 3FNii and SSE, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Email: caoxwen@outlook.com, gxzhu@sribd.cn, xujie@cuhk.edu.cn, shuguangcui@cuhk.edu.cn
Abstract

Over-the-air federated edge learning (Air-FEEL) is a communication-efficient solution for privacy-preserving distributed learning over wireless networks. Air-FEEL allows “one-shot” over-the-air aggregation of gradient/model-updates by exploiting the waveform superposition property of wireless channels, and thus promises an extremely low aggregation latency that is independent of the network size. However, such communication efficiency may come at a cost of learning performance degradation due to the aggregation error caused by the non-uniform channel fading over devices and noise perturbation. Prior work adopted channel inversion power control (or its variants) to reduce the aggregation error by aligning the channel gains, which, however, could be highly suboptimal in deep fading scenarios due to the noise amplification. To overcome this issue, we investigate the power control optimization for enhancing the learning performance of Air-FEEL. Towards this end, we first analyze the convergence behavior of the Air-FEEL by deriving the optimality gap of the loss-function under any given power control policy. Then we optimize the power control to minimize the optimality gap for accelerating convergence, subject to a set of average and maximum power constraints at edge devices. The problem is generally non-convex and challenging to solve due to the coupling of power control variables over different devices and iterations. To tackle this challenge, we develop an efficient algorithm by jointly exploiting the successive convex approximation (SCA) and trust region methods. Numerical results show that the optimized power control policy achieves significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission.

I Introduction

In the pursuit of ubiquitous intelligence envisioned in the future 6G networks, recent years have witnessed the spreading of artificial intelligence (AI) algorithms from the cloud to the network edge, resulting in an active area called edge intelligence [1, 2]. The core research issue therein is to allow low-latency and privacy-aware access to rich mobile data for intelligence distillation. To this end, a popular framework called federated edge learning (FEEL) is proposed recently, which distributes the task of model training over edge devices so as to reduce the communication overhead and keep the data-use locally [3, 4]. Essentially, the FEEL framework is a distributed implementation of stochastic gradient decent (SGD) over wireless networks. A typical training process involves iterations between 1) broadcasting of the global model under training from edge server to devices for local SGD execution using local data, and 2) local models/gradients uploading from devices to edge server for aggregation and global model updating. Although the uploading of high-volume raw data is avoided, the updates aggregation process in FEEL may still suffer from a communication bottleneck due to the high-dimensionality of each updates and the multiple access by many devices over wireless links. To tackle this issue, one promising solution called over-the-air FEEL (Air-FEEL) has been proposed, which exploits the over-the-air computation (AirComp) for “one-shot” aggregation via concurrent update transmission, such that communication and computation are integrated in a joint design by exploiting the superposition property of a multiple access channel (MAC) [1, 5, 6].

The idea of AirComp was first proposed in [5] in the context of data aggregation in sensor networks, where it is surprisingly found that “interference” can be harnessed by structured codes to help functional computation over a MAC. Inspired by the finding, it was shown in the subsequent work [6] that for Gaussian independent and identically distributed (i.i.d.) data sources, the uncoded transmission is optimal in terms of distortion minimization. Besides the information-theoretic studies, various practical issues faced by AirComp implementation were also considered in [7, 8, 9]. In particular, the synchronization issue in AirComp was addressed in [7] via an innovative idea of shared clock broadcasting from edge server to devices. The optimal power control policies for AirComp over fading channels were derived in [8] to minimize the average computation distortion, and the cooperative interference management framework for coordinating coexisting AirComp tasks over multi-cell networks was developed in [9].

More recently, AirComp found its merits in the new context of FEEL, known as Air-FEEL, for communication-efficient update aggregation as demonstrated in a rich set of prior works [10, 11, 12, 13, 14, 15]. Specifically, a broadband Air-FEEL solution was proposed in [10], where several communication-learning tradeoffs were derived to guide the design of device scheduling. Around the same time, a source-coding algorithm exploiting gradient sparsification was proposed in [11] to implement Air-FEEL with compressed updates for higher communication efficiency. In parallel, a joint design of device scheduling and beamforming in a multi-antenna system was presented in [12] to accelerate Air-FEEL. Subsequently, the gradient statistics aware power control was investigated in [13] to further enhance the performance of Air-FEEL. Furthermore, to allow Air-FEEL compatible with digital chips embedded in modern edge devices, Air-FEEL based on digital modulation was proposed in [14] featuring one-bit quantization and modulation at the edge devices and majority-vote based decoding at the edge server. Besides the benefit of low latency, Air-FEEL was also found to be beneficial in data privacy enhancement as individual updates are not accessible by edge server, eliminating the risk of model inversion attack [15].

Despite the promise in high communication efficiency, Air-FEEL may suffer from severe learning performance degradation due to the aggregation error caused by the non-uniform channel fading over devices and noise perturbation. Prior work in this field mostly assumed channel inversion power control (or its variants) [10, 11, 12, 15] in an effort to reducing the aggregation error by aligning the channel gains, which could be highly suboptimal in deep fading scenarios due to the noise amplification. Although there exists one relevant study on power control for Air-FEEL system in [13], it focused on the minimization of the intermediate aggregation distortion (e.g., mean squared error) instead of the ultimate learning performance (e.g., the general loss function). Therefore, there still leaves a research gap in learning performance optimization of Air-FEEL by judicious power control, motivating the current work. To close the gap, we first analyze the convergence behavior of the Air-FEEL by deriving the optimality gap of the loss-function under arbitrary power control policy. Then the power control problem is formulated to minimize the optimality gap for convergence acceleration, subject to a set of average and maximum power constraints at edge devices. The problem is generally non-convex and challenging to solve due to the coupling of power control variables over different devices and iterations. The challenge is tackled by the joint use of successive convex approximation (SCA) and trust region methods in the optimized power control algorithm derivation. Numerical results show that the optimized power control policy achieves significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission, thus opening up a new degree-of-freedom for regulating the performance of Air-FEEL by power control.

II System Model

Refer to caption
Figure 1: Illustration of over-the-air federated edge learning.

We consider an Air-FEEL system consisting of an edge server and K0K\geq 0 edge devices, as shown in Fig. 1. With the coordination of the edge server, the edge devices cooperatively train a shared machine learning model via over-the-air update aggregation as elaborated in the sequel.

II-A Learning Model

We assume that the learning model is represented by the parameter vector 𝐰q{\bf w}\in\mathbb{R}^{q} with qq denoting the model size. Let 𝒟k{\mathcal{D}}_{k} denote the local dataset at edge device kk, in which the ii-th sample and its ground-true label are denoted by 𝐱i{\bf x}_{i} and yiy_{i}, respectively. Then the local loss function of the model vector 𝐰\bf w on 𝒟k{\mathcal{D}}_{k} is

Fk(𝐰)=1|𝒟k|(𝐱i,yi)𝒟kf(𝐰,𝐱i,yi)+ρR(𝐰),\displaystyle F_{k}({\bf w})=\frac{1}{|{\mathcal{D}}_{k}|}\sum\limits_{({\bf x}_{i},y_{i})\in{\mathcal{D}}_{k}}f({\bf w},{\bf x}_{i},y_{i})+\rho R({\bf w}), (1)

where f(𝐰,𝐱i,yi)f({\bf w},{\bf x}_{i},y_{i}) denotes the sample-wise loss function quantifying the prediction error of the model 𝐰\bf w on the sample 𝐱i{\bf x}_{i} with respect to (w.r.t.) its ground-true label yiy_{i}, and R(𝐰)R({\bf w}) denotes the strongly convex regularization function scaled by a hyperparameter ρ0\rho\geq 0. For notational convenience, we simplify f(𝐰,𝐱i,yi)f({\bf w},{\bf x}_{i},y_{i}) as fi(𝐰)f_{i}({\bf w}). Then, the global loss function on all the distributed datasets is given by

F(𝐰)=1Kk𝒦DkFk(𝐰),\displaystyle F({\bf w})=\frac{1}{K}\sum\limits_{k\in\mathcal{K}}D_{k}F_{k}({\bf w}), (2)

where 𝒟=k𝒦𝒟k{\mathcal{D}}=\cup_{k\in\mathcal{K}}{\mathcal{D}}_{k} with Dtot=|𝒟|D_{\rm tot}=|{\mathcal{D}}|, and the sizes of datasets in all edge devices are assumed to be uniform for notation simplicity, i.e., |𝒟k|=D¯,k𝒦|{\mathcal{D}}_{k}|=\bar{D},\forall k\in\mathcal{K}.

The objective of the training process is to minimize the global loss function F(𝐰)F({\bf w}):

𝐰=argmin𝐰F(𝐰).\displaystyle{\bf w}^{\star}=\arg\min_{\bf w}F({\bf w}). (3)

Instead of directly uploading all the local data to the edge server for centralized training, the learning process in (3) can be implemented iteratively in a distributed manner based on gradient-averaging approach as illustrated in Fig. 1.

At each communication round nn, the machine learning model is denoted by 𝐰(n){\bf w}^{(n)}. Then each edge device can compute the local gradient denoted by 𝐠k(n){\bf g}_{k}^{(n)} using the local dataset 𝒟k{\mathcal{D}}_{k}:

𝐠k(n)=1|𝒟k|(𝐱i,yi)𝒟kfi(𝐰(n))+ρR(𝐰),\displaystyle{\bf g}_{k}^{(n)}=\frac{1}{|{\mathcal{D}}_{k}|}\sum\limits_{({\bf x}_{i},y_{i})\in{\mathcal{D}}_{k}}\nabla f_{i}({\bf w}^{(n)})+\rho\nabla R({\bf w}), (4)

where \nabla is the gradient operator and we assume that the whole local dataset is used to estimate the local gradients. Next, the edge devices upload all local gradients to the edge server, which are further averaged to obtain the global gradient:

𝐠¯(n)=1Kk𝒦𝐠k(n).\displaystyle\bar{\bf g}^{(n)}=\frac{1}{K}\sum\limits_{k\in\mathcal{K}}{\bf g}_{k}^{(n)}. (5)

Then, the global gradient estimate is broadcast from edge server to edge devices, based on which edge device can update its own model under training via

𝐰(n+1)=𝐰(n)η𝐠¯(n),\displaystyle{\bf w}^{(n+1)}={\bf w}^{(n)}-\eta\cdot\bar{\bf g}^{(n)}, (6)

where η\eta is the learning rate. Notice that the above procedure continues until convergence criteria is met or the maximum number of iterations is achieved.

II-B Basic Assumptions on Learning Model

To facilitate the convergence analysis next, we make several standard assumptions on the loss function and gradient estimates.

Assumption 1 (Smoothness).

Let 𝐠=F(𝐰){\bf g}=\nabla F({\bf w}) denote the gradient of the loss function evaluated at point 𝐰{\bf w}. Then there exists a non-negative constant vector 𝐋q{\bf L}\in\mathbb{R}^{q}, such that

F(𝐰)[F(𝐰)+𝐠T(𝐰𝐰)]12i=1qLi(wiwi)2,𝐰,𝐰,\displaystyle F({\bf w})\!-\!\left[F({\bf w}^{\prime})\!+\!{\bf g}^{T}(\!{\bf w}\!\!-\!{\bf w}^{\prime})\right]\leq\frac{1}{2}\sum_{i=1}^{q}\!L_{i}({{w}_{i}\!-\!{w}^{\prime}_{i}}\!)^{2},\forall{\bf w},{\bf w}^{\prime},

where the superscript TT denotes the transpose operation.

Assumption 2 (Polyak-Lojasiewicz Inequality).

Let FF^{\star} denote the optimal loss function value to problem (3). There exists a constant μ0\mu\geq 0 such that the global loss function F(𝐰)F({\bf w}) satisfies the following Polyak-Lojasiewicz (PL) condition:

𝐠222μ(F(𝐰)F).\displaystyle\|{\bf g}\|_{2}^{2}\geq 2\mu(F({\bf w})-F^{\star}).

Notice that the above assumption is more general than the standard assumption of strong convexity [16]. Typical loss functions that satisfy the above two assumptions include logistic regression, linear regression and least squares.

Assumption 3 (Variance Bound).

The local gradient estimates {𝐠k}\{{\bf g}_{k}\}, defined in (4), where the index (n)(n) is omitted for simplicity, are assumed to be independent and unbiased estimates of the batch gradient 𝐠{\bf g} with coordinate bounded variance, i.e.,

𝔼[𝐠k]=𝐠,k𝒦,\displaystyle\mathbb{E}[{\bf g}_{k}]={\bf g},\forall k\in\mathcal{K}, (7)
𝔼[(gk,igi)2]σi2,k𝒦,i,\displaystyle\mathbb{E}[({g}_{k,i}-g_{i})^{2}]\leq\sigma_{i}^{2},\forall k\in\mathcal{K},\forall i, (8)

where gk,i{g}_{k,i} and gig_{i} are defined as the ii-th element of {𝐠k}\{{\bf g}_{k}\} and 𝐠{\bf g}, respectively, and 𝝈=[σ1,,σq]{\bm{\sigma}}=[\sigma_{1},\cdots,\sigma_{q}] is a vector of non-negative constants.

II-C Communication Model

The distributed training latency is dominated by the update aggregation process, especially when the number of devices becomes large. Therefore, we focus on the aggregation process over a MAC. Instead of treating different devices’ update as interference, we consider AirComp for fast update aggregation by exploiting the superposition property of MAC. We assume that the channel coefficients remain unchanged within a communication round, and may change over different communication rounds. Besides, the channel state information (CSI) is assumed to be available at all edge devices, so that they can perfectly compensate for the phases introduced by the wireless channels.

Let h^k(n)\hat{h}_{k}^{(n)} denote the complex channel coefficient from device kk to the edge server at communication round nn, and hk(n)h_{k}^{(n)} denote its magnitude with hk(n)=|h^k(n)|h_{k}^{(n)}=|\hat{h}_{k}^{(n)}|. During the gradient-uploading phase, all the devices transmit simultaneously over the same time-frequency block, and thus the received aggregate signal is given by

𝐲(n)=k𝒦hk(n)pk(n)𝐠k(n)+𝐳(n),\displaystyle{\bf y}^{(n)}=\sum\limits_{k\in\mathcal{K}}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}{\bf g}_{k}^{(n)}+{\bf z}^{(n)}, (9)

in which pk(n)p_{k}^{(n)} denotes the transmit power at device kk, and 𝐳(n)q{\bf z}^{(n)}\in\mathbb{R}^{q} denotes the additive white Gaussian noise with 𝐳(n)𝒞N(0,N0𝐈){\bf z}^{(n)}\sim{\mathcal{C}N}(0,N_{0}\bf I), where N0N_{0} is the noise power density and 𝐈\bf I is an identity matrix. Therefore, the global gradient estimate at the edge server is given by

𝐠^(n)=𝐲(n)K.\displaystyle\hat{\bf g}^{(n)}=\frac{{\bf y}^{(n)}}{K}. (10)

The devices can adaptively adjust their transmit powers for enhancing the learning performance. In practice, the transmit power of each edge device k𝒦k\in\mathcal{K} at each communication round is constrained by a maximum power budget P¯k\bar{P}_{k}:

pk(n)P¯k,k𝒦,n.\displaystyle p_{k}^{(n)}\leq\bar{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}},\leavevmode\nobreak\ \forall n. (11)

In addition, each device k𝒦k\in\mathcal{K} is also constrained by an average power budget denoted by P~k\tilde{P}_{k} over the whole training period as expressed below:

1Nn𝒩pk(n)P~k,k𝒦.\displaystyle\frac{1}{N}\sum\limits_{n\in\mathcal{N}}p_{k}^{(n)}\leq\tilde{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}}. (12)

Here, we generally have P~kP¯k,k𝒦\tilde{P}_{k}\leq\bar{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}}.

III Convergence Analysis for Air-FEEL with Adaptive Power Control

In this section, we formally characterize the learning performance of Air-FEEL system, which is derived to be a function of transmit powers of all devices.

Let NN denote the number of needed communication rounds and L𝐋L\triangleq\|\bf L\|_{\infty}. For notational convenience, we use F(n+1)F^{(n+1)} to represent F(𝐰(n+1))F({\bf w}^{(n+1)}). The optimality gap after NN communication rounds defined by F(N+1)FF^{(N+1)}-F^{\star} is derived in the following theorem, from which we can understand the convergence behavior of Air-FEEL.

Theorem 1 (Optimality Gap).

The optimality gap for Air-FEEL, with arbitrary transmit power control policy {pk(n)}\{p_{k}^{(n)}\}, is given as

𝔼[F(N+1)]FΦ({pk(n)},η)\displaystyle\mathbb{E}\left[F^{(N+1)}\right]-F^{\star}\leq{\Phi}(\{p_{k}^{(n)}\},\eta)
n=1NA(n)(F(1)F)+n=1N1(i=n+1NA(i))B(n)+B(N),\displaystyle\!\!\triangleq\!\prod_{n=1}^{N}\!A^{(n)}\!\!\left(F^{(1)}\!-\!F^{\star}\!\right)\!+\!\sum_{n=1}^{N-1}\!\left(\!\prod_{i=n+1}^{N}\!A^{(i)}\!\right)\!B^{(n)}\!+\!B^{(N)},\! (13)

with A(n)=12μηKk𝒦(hk(n)pk(n)ηL2K(hk(n))2pk(n))A^{(n)}=1-\frac{2\mu\eta}{K}\sum\limits_{k\in\mathcal{K}}\left(h_{k}^{(n)}\sqrt{p_{k}^{(n)}}-\frac{\eta L}{2K}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right) and B(n)=η2L𝝈222K2(k𝒦(hk(n))2pk(n))+η2LN0q2K2B^{(n)}=\frac{\eta^{2}L\|{\bm{\sigma}}\|_{2}^{2}}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)+\frac{\eta^{2}LN_{0}q}{2K^{2}}.

Proof:

The proof follows the widely-adopted strategy of relating the norm of the gradient to the expected improvement made in a single algorithmic step, and comparing this with the total possible improvement.

F(n+1)F(n)\displaystyle F^{(n+1)}-F^{(n)}
(a)(𝐠(n))T(𝐰(n+1)𝐰(n))+12i=1qLi(wi(n+1)wi(n))2,\displaystyle\overset{(a)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n+1)}-{\bf w}^{(n)})+\frac{1}{2}\sum_{i=1}^{q}L_{i}({{w}_{i}^{(n+1)}-{w}^{(n)}_{i}})^{2},
(b)(𝐠(n))T(𝐰(n)η𝐠^(n)𝐰(n))+L2𝐰(n)η𝐠^(n)𝐰(n)22\displaystyle\overset{(b)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)})+\frac{L}{2}\|{\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)}\|_{2}^{2}
=η(𝐠(n))T𝐠^(n)+η2L2𝐠^(n)22\displaystyle=-\eta({\bf g}^{(n)})^{T}\hat{\bf g}^{(n)}+\eta^{2}\frac{L}{2}\|\hat{\bf g}^{(n)}\|_{2}^{2}
=ηK(𝐠(n))T(k𝒦hk(n)pk(n)𝐠k(n)+𝐳(n))+η2L2K2k𝒦hk(n)pk(n)𝐠k(n)+𝐳(n)22,\displaystyle\!\!=\!\!-\!\frac{\eta}{K}({\bf g}^{(\!n\!)}\!)^{T}\!\!\left(\!\sum\limits_{k\in\mathcal{K}}\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}\!{\bf g}_{k}^{(\!n\!)}\!\!+\!\!{\bf z}^{(\!n\!)}\!\right)\!\!\!+\!\frac{\eta^{2}L}{2K^{2}}\!\!\left\|\!\sum\limits_{k\in\mathcal{K}}\!\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}{\bf g}_{k}^{(\!n\!)}\!+\!{\bf z}^{(\!n\!)}\!\right\|_{2}^{2}\!,\!

where the inequalities (a) and (b) follows the Assumption 1 and L𝐋L\triangleq\|\bf L\|_{\infty}. By subtracting FF^{\star} and taking expectation at both sides, the convergence rate of each communication round is given by (14). Next, (15) is obtained by applying the PL condition in the Assumption 2. Then, by applying above inequality repeatedly through NN iterations, after some simple algebraic manipulation we have (13), which completes the proof.

𝔼[F(n+1)]F\displaystyle\mathbb{E}\left[F^{(n+1)}\right]-F^{\star}
F(n)FηK(k𝒦hk(n)pk(n))𝐠(n)22+η2L2K2(k𝒦(hk(n))2pk(n))(𝐠(n)22+𝝈22)+η2LN0q2K2\displaystyle\leq F^{(n)}-F^{\star}-\frac{\eta}{K}\left(\sum\limits_{k\in\mathcal{K}}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}\right)\|{\bf g}^{(n)}\|_{2}^{2}+\frac{\eta^{2}L}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\left(\left\|{\bf g}^{(n)}\right\|_{2}^{2}+\left\|{\bm{\sigma}}\right\|_{2}^{2}\right)+\frac{\eta^{2}LN_{0}q}{2K^{2}}
=F(n)F[k𝒦(ηKhk(n)pk(n)η2L2K2(hk(n))2pk(n))]𝐠(n)22+η2L2K2(k𝒦(hk(n))2pk(n))𝝈22+η2LN0q2K2.\displaystyle=F^{(n)}-F^{\star}-\left[\sum\limits_{k\in\mathcal{K}}\left(\frac{\eta}{K}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}-\frac{\eta^{2}L}{2K^{2}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\right]\|{\bf g}^{(n)}\|_{2}^{2}+\frac{\eta^{2}L}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\left\|{\bm{\sigma}}\right\|_{2}^{2}+\frac{\eta^{2}LN_{0}q}{2K^{2}}. (14)
𝔼[F(n+1)]F[12μ(k𝒦(ηKhk(n)pk(n)η2L2K2(hk(n))2pk(n)))]A(n)(F(n)F)+η2L𝝈222K2k𝒦(hk(n))2pk(n)+η2LN0q2K2B(n).\displaystyle\!\!\mathbb{E}\!\left[\!F^{(n+\!1)}\!\right]\!\!-\!F^{\star}\!\!\leq\!\!\underbrace{\left[\!1\!-\!2\mu\!\!\left(\!\sum\limits_{k\in\mathcal{K}}\!\!\left(\!\frac{\eta}{K}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}\!-\!\frac{\eta^{2}L}{2K^{2}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\!\!\right)\!\right]\!}_{A^{(n)}}\left(\!F^{(n)}-F^{\star}\!\right)\!+\underbrace{\frac{\eta^{2}L\left\|{\bm{\sigma}}\right\|_{2}^{2}}{2K^{2}}\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\!+\!\frac{\eta^{2}LN_{0}q}{2K^{2}}}_{B^{(n)}}. (15)

Further applying the mean inequality (a1a2am)(a1+a2++amm)m(a_{1}a_{2}\cdots a_{m})\leq(\frac{a_{1}+a_{2}+\cdots+a_{m}}{m})^{m}, we can derive a more elegant upper bound for the expression in (13) to attain more insights as follows

Φ({pk(n)},η)αN(F(1)F)+n=1NB(n)β(n)Nn,\displaystyle{\Phi}(\{p_{k}^{(n)}\},\eta)\leq\!\!\alpha^{N}\left(\!F^{(\!1\!)}\!-\!F^{\star}\!\right)\!\!+\!\!\sum_{n=1}^{N}\!\!B^{(n)}\beta_{(n)}^{N-n}, (16)

where α=i=1NA(i)N\alpha=\frac{\sum_{i=1}^{N}\!\!A^{(i)}}{N} and β(n)=i=n+1NA(i)Nn\beta_{(n)}=\frac{\sum_{i=n+1}^{N}A^{(i)}}{N-n} for n=1,,N1n=1,\cdots,N-1 while β(N)=1\beta_{(N)}=1.

Remark 1.

The first term on the right hand side of (16) suggests that the effect of initial optimality gap vanishes as the number of communication round NN increases. The second term reflects the impact of the power control and additive noise power on the convergence process, that is, transmission with more power in the initial learning iterations is more beneficial in decreasing the optimality gap. This is because that the contribution of power control at iteration nn is discounted by a factor β(n)Nn\beta_{(n)}^{N-n}.

IV Power Control Optimization

In this section, we focus on speeding up the convergence rate by minimizing the optimality gap in Theorem 1, under the power constraints stated in (11) and (12). The optimization problem is thus formulated as

𝐏𝟏:min{pk(n)0},η0\displaystyle\mathbf{P1:}\min_{\{p_{k}^{(n)}\geq 0\},\eta\geq 0}\leavevmode\nobreak\ \leavevmode\nobreak\ Φ({pk(n),η})\displaystyle{\Phi}(\{p_{k}^{(n)},\eta\})
s.t.\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ (11)and(12).\displaystyle\eqref{sys_bar_P_max}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{sys_bar_P_ave}.

Due to the coupling between the power control {pk(n)}\{p_{k}^{(n)}\} and learning rate η\eta, problem (P1) is non-convex and hard to solve. We resort to the alternating optimization technique for efficiently solving this problem. In particular, we first solve problem (P1) under any given η\eta, and then apply a one-dimension search to find the optimal η\eta that achieves the minimum objective value.

Let Φ~({pk(n)})=Φ({pk(n),η})\tilde{\Phi}(\{p_{k}^{(n)}\})={\Phi}(\{p_{k}^{(n)},\eta\}) under any given η\eta. Note that the transmit powers at different devices and different communication rounds are coupled with each other in the objective function in (13) under given learning rate η\eta, leading to a highly non-convex problem:

𝐏𝟐:min{pk(n)0}\displaystyle\mathbf{P2:}\min_{\{p_{k}^{(n)}\geq 0\}}\leavevmode\nobreak\ \leavevmode\nobreak\ Φ~({pk(n)})\displaystyle\tilde{\Phi}(\{p_{k}^{(n)}\})
s.t.\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ (11)and(12).\displaystyle\eqref{sys_bar_P_max}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{sys_bar_P_ave}.

To tackle this problem, we propose an iterative algorithm to obtain an efficient solution using the SCA technique. The key idea is that under any given local point at each iteration, we can approximate the non-convex objective as a constructed convex one. Therefore, after solving a series of approximate convex problems iteratively, we can obtain a high-quality suboptimal solution to problem (P2).

Let {pk(n)[i]}\{p_{k}^{(n)}[i]\} denote the local point at the ii-th iteration with i0i\geq 0, and 𝒩{1,,N}{\mathcal{N}}\triangleq\{1,\cdots,N\} the set of communication rounds. Notice that by checking the first-order Taylor expansion of Φ~({pk(n)})\tilde{\Phi}(\{p_{k}^{(n)}\}) w.r.t. {pk(n)}\{p_{k}^{(n)}\} at the local point {pk(n)[i]}\{p_{k}^{(n)}[i]\}, it follows that

Φ~({pk(n)})Φ¯({pk(n)})\displaystyle\tilde{\Phi}(\{p_{k}^{(n)}\})\approx\bar{\Phi}(\{p_{k}^{(n)}\})
Φ~({pk(n)[i]})+n𝒩k𝒦(pk(n)pk(n)[i])Φ~({pk(n)[i]}),\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \triangleq\!\tilde{\Phi}(\{p_{k}^{(n)}[i]\})\!+\!\sum\limits_{n\in{\mathcal{N}}}\!\sum\limits_{k\in{\mathcal{K}}}\!\left(p_{k}^{(n)}-p_{k}^{(n)}[i]\right)\nabla\tilde{\Phi}(\{p_{k}^{(n)}[i]\}),

where Φ~({pk(n)[i]})\nabla\tilde{\Phi}(\{p_{k}^{(n)}[i]\}) represents the first-order derivative w.r.t. pk(n)[i]p_{k}^{(n)}[i], given in (17) and (18).

Φ~(pk(n)[n])\displaystyle\nabla\tilde{\Phi}(p_{k}^{(n)}[n]) =μηhk(n)(F(1)F)K(1pk(n)ηLhk(n)K)i𝒩{n}A(i)+η2L𝝈22(hk(n))2j=nNA(j)2K2A(n)\displaystyle=-\frac{\mu\eta h_{k}^{(n)}\left(F^{(1)}-F^{\star}\right)}{K}\left(\frac{1}{\sqrt{p_{k}^{(n)}}}-\frac{\eta Lh_{k}^{(n)}}{K}\right)\prod_{i\in{\mathcal{N}}\setminus\{n\}}A^{(i)}+\frac{\eta^{2}L\left\|{\bm{\sigma}}\right\|_{2}^{2}(h_{k}^{(n)})^{2}\prod_{j=n}^{N}A^{(j)}}{2K^{2}A^{(n)}}
μηhk(n)K(1pk(n)ηLhk(n)K)=1n1B()j=NA(j)A(n)A(),n𝒩{1}\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ -\frac{\mu\eta h_{k}^{(n)}}{K}\left(\frac{1}{\sqrt{p_{k}^{(n)}}}-\frac{\eta Lh_{k}^{(n)}}{K}\right)\sum_{\ell=1}^{n-1}B_{(\ell)}\frac{\prod_{j=\ell}^{N}A^{(j)}}{A^{(n)}A^{(\ell)}},\forall n\in\mathcal{N}\setminus\{1\} (17)
Φ~(pk(1)[i])\displaystyle\nabla\tilde{\Phi}(p_{k}^{(1)}[i]) =μηhk(1)(F(1)F)K(1pk(1)ηLhk(1)K)i𝒩{1}A(i)+η2L𝝈22(hk(1))22K2i𝒩{1}A(i).\displaystyle=-\frac{\mu\eta h_{k}^{(1)}\left(F^{(1)}-F^{\star}\right)}{K}\left(\frac{1}{\sqrt{p_{k}^{(1)}}}-\frac{\eta Lh_{k}^{(1)}}{K}\right)\prod_{i\in{\mathcal{N}}\setminus\{1\}}A^{(i)}+\frac{\eta^{2}L\left\|{\bm{\sigma}}\right\|_{2}^{2}(h_{k}^{(1)})^{2}}{2K^{2}}\prod_{i\in{\mathcal{N}}\setminus\{1\}}A^{(i)}. (18)

In this case, Φ¯({pk(n)})\bar{\Phi}(\{p_{k}^{(n)}\}) is linear w.r.t. {pk(n)}\{p_{k}^{(n)}\}. To ensure the approximation accuracy, a series of trust region constraints are imposed as [17]

|pk(n)[i]pk(n)[i1]|Γ[i],k𝒦,n𝒩,\displaystyle|p_{k}^{(n)}[i]-p_{k}^{(n)}[i-1]|\leq\Gamma[i],\leavevmode\nobreak\ \forall k\in\mathcal{K},\forall n\in\mathcal{N}, (19)

where Γ[i]\Gamma[i] denotes the radius of the trust region. By replacing Φ¯({pk(n)})\bar{\Phi}(\{p_{k}^{(n)}\}) as the approximation of Φ~({pk(n)})\tilde{\Phi}(\{p_{k}^{(n)}\}) and introducing an auxiliary variable γ\gamma, the approximated problem at the ii-th iteration is derived as a convex problem:

𝐏𝟐.1:min{pk(n)[i]},γ0\displaystyle\mathbf{P2.1:}\min_{\{p_{k}^{(n)}[i]\},\gamma\geq 0}\leavevmode\nobreak\ \leavevmode\nobreak\ γ\displaystyle\gamma
s.t.\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ Φ¯({pk(n)[i]})γ\displaystyle\bar{\Phi}(\{p_{k}^{(n)}[i]\})\leq\gamma (20)
(11),(12),and(19),\displaystyle\eqref{sys_bar_P_max},\leavevmode\nobreak\ \eqref{sys_bar_P_ave},\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{TrustRegion},

which can be directly solved by CVX [18].

Let {pk(n)[i]}\{p_{k}^{(n)*}[i]\} denote the optimal power control policy to problem (P2.1) at local point {pk(n)[i]}\{p_{k}^{(n)}[i]\}. Then, we can obtain an efficient iterative algorithm to solve problem (P2) as follows. In each iteration i1i\geq 1, the power control is updated as {pk(n)[i]}\{p_{k}^{(n)*}[i]\} by solving problem (P2.1) at local point {pk(n)[i]}\{p_{k}^{(n)}[i]\}, i.e. pk(n)[i+1]=pk(n)[i],n𝒩,k𝒦p_{k}^{(n)}[i+1]=p_{k}^{(n)*}[i],\forall n\in\mathcal{N},\forall k\in\cal K, where {pk(n)[0]}\{p_{k}^{(n)}[0]\} denotes the initial power control. At the ii-th iteration, we compute the objective value in problem (P2) by replacing {p^k(n)[i]}\{\hat{p}_{k}^{(n)*}[i]\} as {pk(n)}\{p_{k}^{(n)*}\}. If the objective value decreases, we then replace the current point by the obtained solution and go to the next iteration; otherwise, we update Γ[i]=Γ[i]/2\Gamma[i]=\Gamma[i]/2 and continue to solve problem (P2.1). This algorithm would stop until that Γ[i]\Gamma[i] is lower than a given threshold denoted by ϵ\epsilon. In summary, the proposed algorithm is presented in Algorithm 1.

 

Algorithm 1 for Solving Problem (P2)

 
  • 1

    Initialization: Given the initial power control {pk(n)[0]}\{p_{k}^{(n)}[0]\}; let i=0i=0.

  • 2

    Repeat:

    • a)

      Solve problem (P1.1) under given {pk(n)[i]}\{p_{k}^{(n)}[i]\} to obtain the optimal solution as {pk(n)[i]}\{p_{k}^{(n)*}[i]\};

    • b)

      If the objective value of problem (P2) Φ~({pk(n)})\tilde{\Phi}(\{p_{k}^{(n)}\}) decreases, then update pk(n)[i+1]=pk(n)[i],n𝒩p_{k}^{(n)}[i+1]=p_{k}^{(n)*}[i],\forall n\in\mathcal{N} with i=i+1i=i+1; otherwise Γ[i]=Γ[i]/2\Gamma[i]=\Gamma[i]/2;

  • 3

    Until Γ[i]ϵ\Gamma[i]\leq\epsilon.

 

With the obtained power control in Algorithm 1, we can find the optimal η\eta accordingly via a one-dimensional search.

V Simulation Results

In this section, we provide simulation results to validate the performance of the proposed power control policy for Air-FEEL. In the simulation, the wireless channels from each device to the edge server over fading states follow i.i.d. Rayleigh fading, such that hkh_{k}’s are modeled as i.i.d. circularly symmetric complex Gaussian (CSCG) random variables with zero mean and unit variance. The dataset with size Dtot=600D_{\rm tot}=600 at all device are randomly generated, where part of the data, namely 100100 pairs (𝐱{\bf x}, yy), are left for prediction, and the remaining ones are used for model training. The generated data sample vector 𝐱{\bf x} follow i.i.d. Gaussian distribution as 𝒩(0,𝐈)\mathcal{N}(0,{\bf I}) and the label yy is obtained as y=x(2)+3x(5)+0.2zy=x(2)+3x(5)+0.2z, where x(t)x(t) represents the tt-entry in vector 𝐱{\bf x} and zz is the observation noise with i.i.d. Gaussian distribution, i.e., z𝒩(0,1)z\sim\mathcal{N}(0,1). Unless stated otherwise, the data samples are evenly distributed among the K=20K=20 devices, and thus it follows Dk=25D_{k}=25. Moreover, we apply ridge regression with the sample-wise loss function f(𝐰,𝐱,y)=12𝐱T𝐰y2f({\bf w},{\bf x},y)=\frac{1}{2}\|{\bf x}^{T}{\bf w}-y\|^{2} and the regularization function R(𝐰)=𝐰2R({\bf w})=\|{\bf w}\|^{2} with ρ=5×105\rho=5\times 10^{-5} in this paper. Furthermore, recall that Dtot=k𝒦DkD_{\rm tot}=\sum_{k\in\mathcal{K}}D_{k} and then we can obtain the smoothness parameter LL and PL parameter μ\mu as the largest and smallest eigenvalues of the data Gramian matrix 𝐗T𝐗/Dtot+104𝐈{\bf X}^{T}{\bf X}/D_{\rm tot}+10^{-4}{\bf I}, in which 𝐗=[𝐱1,,𝐱Dtot]T{\bf X}=[{\bf x}_{1},\cdots,{\bf x}_{D_{\rm tot}}]^{T} is the data matrix. The optimal loss function FF^{\star} is computed according to the optimal parameter vector 𝐰\bf w^{\star} to the learning problem (3), where 𝐰=(𝐗T𝐗+ρ𝐈)1𝐗T𝐲{\bf w}^{\star}=({\bf X}^{T}{\bf X}+\rho{\bf I})^{-1}{\bf X}^{T}{\bf y} with 𝐲=[y1,,yDtot]T{\bf y}=[y_{1},\cdots,y_{D_{\rm tot}}]^{T}. We set the initial parameter vector as an all-zero vector and the noise variance N0=0.1N_{0}=0.1.

We consider two benchmark schemes for performance comparison, namely the uniform power transmission that transmits with uniform power over different communication round under the constraint of average power budget, and the channel inversion adopted in [15]. As for the performance metric for comparison, we consider the optimality gap and prediction error to evaluate the learning performance.

Refer to caption
(a) Optimality gap versus varying number of devices.
Refer to caption
(b) Prediction error versus varying number of devices.
Figure 2: Effect of number of devices on the learning performance of Air-FEEL.

The effect of device population on learning performance is illustrated in Fig. 2 with N=30N=30, where the power budgets at all devices are identically set to be P~=1\tilde{P}=1 W and P¯=5\bar{P}=5 W. Notice that the increasing of device population may introduce both the positive and negative effects on the learning performance. The positive effect is that the training process can exploit more data, while the negative effect is the increased aggregation error raised by AirComp over more devices. As observed in Fig. 2, the positive effect can be cancelled or even overweighed by the negative effect when applying the channel inversion or uniform power control. The blessing of including more devices in Air-FEEL can dominate the curse it brings only when the power control is judiciously optimized, showing the crucial role of power control in determining the the learning performance of Air-FEEL.

Refer to caption
(a) Tendency of loss function.
Refer to caption
(b) Tendency of prediction error.
Figure 3: Learning performance of Air-FEEL over iterations, where η\eta^{*} denotes the optimized learning rate after one-dimensional search.

Fig. 3 shows the learning performance during the learning process under the optimized learning rate, where we set K=20K=20, P~=1\tilde{P}=1 W, P¯=5\bar{P}=5 W, and N=80N=80. It is observed that the proposed power control scheme can achieve faster convergence than both the channel-inversion and uniform-power-control schemes. This is attributed to the power control optimization directly targeting convergence acceleration.

Refer to caption
Figure 4: The optimized power allocation over iterations under static channels.

Fig. 4 shows the power allocation over a static channel with uniform channel gain during the learning process, where we set K=20K=20, P~=1\tilde{P}=1 W, P¯=5\bar{P}=5 W, and N=30N=30. It is observed that the power allocation over a static channel follows a stair-wise monotonously decreasing function. The behavior of power control coincides the analysis on Remark 1.

VI Conclusion

In this paper, we exploit power control as a new degree of freedom to optimize the learning performance of Air-FEEL, a promising communication-efficient solution towards edge intelligence. To this end, we first analyzed the convergence rate of the Air-FEEL by deriving the optimality gap of the loss-function under arbitrary power control policy. Then the formulated power control problem aimed to minimize the optimality gap for accelerating convergence, subject to a set of average and maximum power constraints at edge devices. Due to the coupling of power control variables over different devices and iterations, the challenge of the formulated power control problem was tackled by the joint use of SCA and trust region methods. Numerical results demonstrated that the optimized power control policy can achieve significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission.

References

  • [1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020.
  • [2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, Nov. 2019.
  • [3] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016. [Online]. Available: https://arxiv.org/pdf/1610.05492.pdf
  • [4] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.
  • [5] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498–3516, Oct. 2007.
  • [6] M. Gastpar, “Uncoded transmission is exactly optimal for a simple gaussian sensor network,” IEEE Trans. Inf. Theory, vol. 54, no. 11, pp. 5247–5251, Nov. 2008.
  • [7] O. Abari, H. Rahul, D. Katabi, and M. Pant, “Airshare: Distributed coherent transmission made seamless,” in Proc. IEEE INFOCOM, Kowloon, Hong Kong, Apr. 2015, pp. 1742–1750.
  • [8] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimized power control for over-the-air computation in fading channels,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.
  • [9] X. Cao, G. Zhu, J. Xu, and K. Huang, “Cooperative interference management for over-the-air computation networks,” 2020. [Online]. Available: https://arxiv.org/pdf/2007.11765.pdf
  • [10] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
  • [11] M. Mohammadi Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
  • [12] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [13] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning,” 2020. [Online]. Available: https://arxiv.org/pdf/2003.02089.pdf
  • [14] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” 2020. [Online]. Available: https://arxiv.org/pdf/2001.05713.pdf
  • [15] D. Liu and O. Simeone, “Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control,” 2020. [Online]. Available: https://arxiv.org/pdf/2006.05459.pdf
  • [16] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” Lecture Notes in Computer Science, pp. 795–811, 2016. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46128-1_50
  • [17] B. Dai, Y. Liu, and W. Yu, “Optimized base-station cache allocation for cloud radio access network with multicast backhaul,” IEEE J. Sel. Areas Commun., vol. 36, no. 8, pp. 1737–1750, Aug. 2018.
  • [18] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming,” 2016. [Online]. Available: http://cvxr.com/cvx