Optimized Power Control for Over-the-Air Federated Edge Learning

Xiaowen Cao13, Guangxu Zhu2, Jie Xu3, and Shuguang Cui32 1School of Information Engineering, Guangdong University of Technology, Guangzhou, China 2Shenzhen Research Institute of Big Data, Shenzhen, China 3FNii and SSE, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Email: caoxwen@outlook.com, gxzhu@sribd.cn, xujie@cuhk.edu.cn, shuguangcui@cuhk.edu.cn

Abstract

Over-the-air federated edge learning (Air-FEEL) is a communication-efficient solution for privacy-preserving distributed learning over wireless networks. Air-FEEL allows “one-shot” over-the-air aggregation of gradient/model-updates by exploiting the waveform superposition property of wireless channels, and thus promises an extremely low aggregation latency that is independent of the network size. However, such communication efficiency may come at a cost of learning performance degradation due to the aggregation error caused by the non-uniform channel fading over devices and noise perturbation. Prior work adopted channel inversion power control (or its variants) to reduce the aggregation error by aligning the channel gains, which, however, could be highly suboptimal in deep fading scenarios due to the noise amplification. To overcome this issue, we investigate the power control optimization for enhancing the learning performance of Air-FEEL. Towards this end, we first analyze the convergence behavior of the Air-FEEL by deriving the optimality gap of the loss-function under any given power control policy. Then we optimize the power control to minimize the optimality gap for accelerating convergence, subject to a set of average and maximum power constraints at edge devices. The problem is generally non-convex and challenging to solve due to the coupling of power control variables over different devices and iterations. To tackle this challenge, we develop an efficient algorithm by jointly exploiting the successive convex approximation (SCA) and trust region methods. Numerical results show that the optimized power control policy achieves significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission.

I Introduction

In the pursuit of ubiquitous intelligence envisioned in the future 6G networks, recent years have witnessed the spreading of artificial intelligence (AI) algorithms from the cloud to the network edge, resulting in an active area called edge intelligence [1, 2]. The core research issue therein is to allow low-latency and privacy-aware access to rich mobile data for intelligence distillation. To this end, a popular framework called federated edge learning (FEEL) is proposed recently, which distributes the task of model training over edge devices so as to reduce the communication overhead and keep the data-use locally [3, 4]. Essentially, the FEEL framework is a distributed implementation of stochastic gradient decent (SGD) over wireless networks. A typical training process involves iterations between 1) broadcasting of the global model under training from edge server to devices for local SGD execution using local data, and 2) local models/gradients uploading from devices to edge server for aggregation and global model updating. Although the uploading of high-volume raw data is avoided, the updates aggregation process in FEEL may still suffer from a communication bottleneck due to the high-dimensionality of each updates and the multiple access by many devices over wireless links. To tackle this issue, one promising solution called over-the-air FEEL (Air-FEEL) has been proposed, which exploits the over-the-air computation (AirComp) for “one-shot” aggregation via concurrent update transmission, such that communication and computation are integrated in a joint design by exploiting the superposition property of a multiple access channel (MAC) [1, 5, 6].

The idea of AirComp was first proposed in [5] in the context of data aggregation in sensor networks, where it is surprisingly found that “interference” can be harnessed by structured codes to help functional computation over a MAC. Inspired by the finding, it was shown in the subsequent work [6] that for Gaussian independent and identically distributed (i.i.d.) data sources, the uncoded transmission is optimal in terms of distortion minimization. Besides the information-theoretic studies, various practical issues faced by AirComp implementation were also considered in [7, 8, 9]. In particular, the synchronization issue in AirComp was addressed in [7] via an innovative idea of shared clock broadcasting from edge server to devices. The optimal power control policies for AirComp over fading channels were derived in [8] to minimize the average computation distortion, and the cooperative interference management framework for coordinating coexisting AirComp tasks over multi-cell networks was developed in [9].

More recently, AirComp found its merits in the new context of FEEL, known as Air-FEEL, for communication-efficient update aggregation as demonstrated in a rich set of prior works [10, 11, 12, 13, 14, 15]. Specifically, a broadband Air-FEEL solution was proposed in [10], where several communication-learning tradeoffs were derived to guide the design of device scheduling. Around the same time, a source-coding algorithm exploiting gradient sparsification was proposed in [11] to implement Air-FEEL with compressed updates for higher communication efficiency. In parallel, a joint design of device scheduling and beamforming in a multi-antenna system was presented in [12] to accelerate Air-FEEL. Subsequently, the gradient statistics aware power control was investigated in [13] to further enhance the performance of Air-FEEL. Furthermore, to allow Air-FEEL compatible with digital chips embedded in modern edge devices, Air-FEEL based on digital modulation was proposed in [14] featuring one-bit quantization and modulation at the edge devices and majority-vote based decoding at the edge server. Besides the benefit of low latency, Air-FEEL was also found to be beneficial in data privacy enhancement as individual updates are not accessible by edge server, eliminating the risk of model inversion attack [15].

Despite the promise in high communication efficiency, Air-FEEL may suffer from severe learning performance degradation due to the aggregation error caused by the non-uniform channel fading over devices and noise perturbation. Prior work in this field mostly assumed channel inversion power control (or its variants) [10, 11, 12, 15] in an effort to reducing the aggregation error by aligning the channel gains, which could be highly suboptimal in deep fading scenarios due to the noise amplification. Although there exists one relevant study on power control for Air-FEEL system in [13], it focused on the minimization of the intermediate aggregation distortion (e.g., mean squared error) instead of the ultimate learning performance (e.g., the general loss function). Therefore, there still leaves a research gap in learning performance optimization of Air-FEEL by judicious power control, motivating the current work. To close the gap, we first analyze the convergence behavior of the Air-FEEL by deriving the optimality gap of the loss-function under arbitrary power control policy. Then the power control problem is formulated to minimize the optimality gap for convergence acceleration, subject to a set of average and maximum power constraints at edge devices. The problem is generally non-convex and challenging to solve due to the coupling of power control variables over different devices and iterations. The challenge is tackled by the joint use of successive convex approximation (SCA) and trust region methods in the optimized power control algorithm derivation. Numerical results show that the optimized power control policy achieves significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission, thus opening up a new degree-of-freedom for regulating the performance of Air-FEEL by power control.

II System Model

Refer to caption — Figure 1: Illustration of over-the-air federated edge learning.

We consider an Air-FEEL system consisting of an edge server and $K\geq 0$ edge devices, as shown in Fig. 1. With the coordination of the edge server, the edge devices cooperatively train a shared machine learning model via over-the-air update aggregation as elaborated in the sequel.

II-A Learning Model

We assume that the learning model is represented by the parameter vector ${\bf w}\in\mathbb{R}^{q}$ with $q$ denoting the model size. Let ${\mathcal{D}}_{k}$ denote the local dataset at edge device $k$ , in which the $i$ -th sample and its ground-true label are denoted by ${\bf x}_{i}$ and $y_{i}$ , respectively. Then the local loss function of the model vector $\bf w$ on ${\mathcal{D}}_{k}$ is

\displaystyle F_{k}({\bf w})=\frac{1}{|{\mathcal{D}}_{k}|}\sum\limits_{({\bf x}_{i},y_{i})\in{\mathcal{D}}_{k}}f({\bf w},{\bf x}_{i},y_{i})+\rho R({\bf w}),

(1)

where $f({\bf w},{\bf x}_{i},y_{i})$ denotes the sample-wise loss function quantifying the prediction error of the model $\bf w$ on the sample ${\bf x}_{i}$ with respect to (w.r.t.) its ground-true label $y_{i}$ , and $R({\bf w})$ denotes the strongly convex regularization function scaled by a hyperparameter $\rho\geq 0$ . For notational convenience, we simplify $f({\bf w},{\bf x}_{i},y_{i})$ as $f_{i}({\bf w})$ . Then, the global loss function on all the distributed datasets is given by

\displaystyle F({\bf w})=\frac{1}{K}\sum\limits_{k\in\mathcal{K}}D_{k}F_{k}({\bf w}),

(2)

where ${\mathcal{D}}=\cup_{k\in\mathcal{K}}{\mathcal{D}}_{k}$ with $D_{\rm tot}=|{\mathcal{D}}|$ , and the sizes of datasets in all edge devices are assumed to be uniform for notation simplicity, i.e., $|{\mathcal{D}}_{k}|=\bar{D},\forall k\in\mathcal{K}$ .

The objective of the training process is to minimize the global loss function $F({\bf w})$ :

\displaystyle{\bf w}^{\star}=\arg\min_{\bf w}F({\bf w}).

(3)

Instead of directly uploading all the local data to the edge server for centralized training, the learning process in (3) can be implemented iteratively in a distributed manner based on gradient-averaging approach as illustrated in Fig. 1.

At each communication round $n$ , the machine learning model is denoted by ${\bf w}^{(n)}$ . Then each edge device can compute the local gradient denoted by ${\bf g}_{k}^{(n)}$ using the local dataset ${\mathcal{D}}_{k}$ :

\displaystyle{\bf g}_{k}^{(n)}=\frac{1}{|{\mathcal{D}}_{k}|}\sum\limits_{({\bf x}_{i},y_{i})\in{\mathcal{D}}_{k}}\nabla f_{i}({\bf w}^{(n)})+\rho\nabla R({\bf w}),

(4)

where $\nabla$ is the gradient operator and we assume that the whole local dataset is used to estimate the local gradients. Next, the edge devices upload all local gradients to the edge server, which are further averaged to obtain the global gradient:

\displaystyle\bar{\bf g}^{(n)}=\frac{1}{K}\sum\limits_{k\in\mathcal{K}}{\bf g}_{k}^{(n)}.

(5)

Then, the global gradient estimate is broadcast from edge server to edge devices, based on which edge device can update its own model under training via

\displaystyle{\bf w}^{(n+1)}={\bf w}^{(n)}-\eta\cdot\bar{\bf g}^{(n)},

(6)

where $\eta$ is the learning rate. Notice that the above procedure continues until convergence criteria is met or the maximum number of iterations is achieved.

II-B Basic Assumptions on Learning Model

To facilitate the convergence analysis next, we make several standard assumptions on the loss function and gradient estimates.

Assumption 1 (Smoothness).

Let ${\bf g}=\nabla F({\bf w})$ denote the gradient of the loss function evaluated at point ${\bf w}$ . Then there exists a non-negative constant vector ${\bf L}\in\mathbb{R}^{q}$ , such that

\displaystyle F({\bf w})\!-\!\left[F({\bf w}^{\prime})\!+\!{\bf g}^{T}(\!{\bf w}\!\!-\!{\bf w}^{\prime})\right]\leq\frac{1}{2}\sum_{i=1}^{q}\!L_{i}({{w}_{i}\!-\!{w}^{\prime}_{i}}\!)^{2},\forall{\bf w},{\bf w}^{\prime},

where the superscript $T$ denotes the transpose operation.

Assumption 2 (Polyak-Lojasiewicz Inequality).

Let $F^{\star}$ denote the optimal loss function value to problem (3). There exists a constant $\mu\geq 0$ such that the global loss function $F({\bf w})$ satisfies the following Polyak-Lojasiewicz (PL) condition:

\displaystyle\|{\bf g}\|_{2}^{2}\geq 2\mu(F({\bf w})-F^{\star}).

Notice that the above assumption is more general than the standard assumption of strong convexity [16]. Typical loss functions that satisfy the above two assumptions include logistic regression, linear regression and least squares.

Assumption 3 (Variance Bound).

The local gradient estimates $\{{\bf g}_{k}\}$ , defined in (4), where the index $(n)$ is omitted for simplicity, are assumed to be independent and unbiased estimates of the batch gradient ${\bf g}$ with coordinate bounded variance, i.e.,

	$\displaystyle\mathbb{E}[{\bf g}_{k}]={\bf g},\forall k\in\mathcal{K},$		(7)
	$\displaystyle\mathbb{E}[({g}_{k,i}-g_{i})^{2}]\leq\sigma_{i}^{2},\forall k\in\mathcal{K},\forall i,$		(8)

where ${g}_{k,i}$ and $g_{i}$ are defined as the $i$ -th element of $\{{\bf g}_{k}\}$ and ${\bf g}$ , respectively, and ${\bm{\sigma}}=[\sigma_{1},\cdots,\sigma_{q}]$ is a vector of non-negative constants.

II-C Communication Model

The distributed training latency is dominated by the update aggregation process, especially when the number of devices becomes large. Therefore, we focus on the aggregation process over a MAC. Instead of treating different devices’ update as interference, we consider AirComp for fast update aggregation by exploiting the superposition property of MAC. We assume that the channel coefficients remain unchanged within a communication round, and may change over different communication rounds. Besides, the channel state information (CSI) is assumed to be available at all edge devices, so that they can perfectly compensate for the phases introduced by the wireless channels.

Let $\hat{h}_{k}^{(n)}$ denote the complex channel coefficient from device $k$ to the edge server at communication round $n$ , and $h_{k}^{(n)}$ denote its magnitude with $h_{k}^{(n)}=|\hat{h}_{k}^{(n)}|$ . During the gradient-uploading phase, all the devices transmit simultaneously over the same time-frequency block, and thus the received aggregate signal is given by

\displaystyle{\bf y}^{(n)}=\sum\limits_{k\in\mathcal{K}}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}{\bf g}_{k}^{(n)}+{\bf z}^{(n)},

(9)

in which $p_{k}^{(n)}$ denotes the transmit power at device $k$ , and ${\bf z}^{(n)}\in\mathbb{R}^{q}$ denotes the additive white Gaussian noise with ${\bf z}^{(n)}\sim{\mathcal{C}N}(0,N_{0}\bf I)$ , where $N_{0}$ is the noise power density and $\bf I$ is an identity matrix. Therefore, the global gradient estimate at the edge server is given by

\displaystyle\hat{\bf g}^{(n)}=\frac{{\bf y}^{(n)}}{K}.

(10)

The devices can adaptively adjust their transmit powers for enhancing the learning performance. In practice, the transmit power of each edge device $k\in\mathcal{K}$ at each communication round is constrained by a maximum power budget $\bar{P}_{k}$ :

\displaystyle p_{k}^{(n)}\leq\bar{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}},\leavevmode\nobreak\ \forall n.

(11)

In addition, each device $k\in\mathcal{K}$ is also constrained by an average power budget denoted by $\tilde{P}_{k}$ over the whole training period as expressed below:

\displaystyle\frac{1}{N}\sum\limits_{n\in\mathcal{N}}p_{k}^{(n)}\leq\tilde{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}}.

(12)

Here, we generally have $\tilde{P}_{k}\leq\bar{P}_{k},\leavevmode\nobreak\ \forall k\in{\mathcal{K}}$ .

III Convergence Analysis for Air-FEEL with Adaptive Power Control

In this section, we formally characterize the learning performance of Air-FEEL system, which is derived to be a function of transmit powers of all devices.

Let $N$ denote the number of needed communication rounds and $L\triangleq\|\bf L\|_{\infty}$ . For notational convenience, we use $F^{(n+1)}$ to represent $F({\bf w}^{(n+1)})$ . The optimality gap after $N$ communication rounds defined by $F^{(N+1)}-F^{\star}$ is derived in the following theorem, from which we can understand the convergence behavior of Air-FEEL.

Theorem 1 (Optimality Gap).

The optimality gap for Air-FEEL, with arbitrary transmit power control policy $\{p_{k}^{(n)}\}$ , is given as

	$\displaystyle\mathbb{E}\left[F^{(N+1)}\right]-F^{\star}\leq{\Phi}(\{p_{k}^{(n)}\},\eta)$
	$\displaystyle\!\!\triangleq\!\prod_{n=1}^{N}\!A^{(n)}\!\!\left(F^{(1)}\!-\!F^{\star}\!\right)\!+\!\sum_{n=1}^{N-1}\!\left(\!\prod_{i=n+1}^{N}\!A^{(i)}\!\right)\!B^{(n)}\!+\!B^{(N)},\!$		(13)

with $A^{(n)}=1-\frac{2\mu\eta}{K}\sum\limits_{k\in\mathcal{K}}\left(h_{k}^{(n)}\sqrt{p_{k}^{(n)}}-\frac{\eta L}{2K}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)$ and $B^{(n)}=\frac{\eta^{2}L\|{\bm{\sigma}}\|_{2}^{2}}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)+\frac{\eta^{2}LN_{0}q}{2K^{2}}$ .

Proof:

The proof follows the widely-adopted strategy of relating the norm of the gradient to the expected improvement made in a single algorithmic step, and comparing this with the total possible improvement.

	$\displaystyle F^{(n+1)}-F^{(n)}$
	$\displaystyle\overset{(a)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n+1)}-{\bf w}^{(n)})+\frac{1}{2}\sum_{i=1}^{q}L_{i}({{w}_{i}^{(n+1)}-{w}^{(n)}_{i}})^{2},$
	$\displaystyle\overset{(b)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)})+\frac{L}{2}\\|{\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)}\\|_{2}^{2}$
	$\displaystyle=-\eta({\bf g}^{(n)})^{T}\hat{\bf g}^{(n)}+\eta^{2}\frac{L}{2}\\|\hat{\bf g}^{(n)}\\|_{2}^{2}$
	$\displaystyle\!\!=\!\!-\!\frac{\eta}{K}({\bf g}^{(\!n\!)}\!)^{T}\!\!\left(\!\sum\limits_{k\in\mathcal{K}}\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}\!{\bf g}_{k}^{(\!n\!)}\!\!+\!\!{\bf z}^{(\!n\!)}\!\right)\!\!\!+\!\frac{\eta^{2}L}{2K^{2}}\!\!\left\\|\!\sum\limits_{k\in\mathcal{K}}\!\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}{\bf g}_{k}^{(\!n\!)}\!+\!{\bf z}^{(\!n\!)}\!\right\\|_{2}^{2}\!,\!$

where the inequalities (a) and (b) follows the Assumption 1 and $L\triangleq\|\bf L\|_{\infty}$ . By subtracting $F^{\star}$ and taking expectation at both sides, the convergence rate of each communication round is given by (14). Next, (15) is obtained by applying the PL condition in the Assumption 2. Then, by applying above inequality repeatedly through $N$ iterations, after some simple algebraic manipulation we have (13), which completes the proof.

	$\displaystyle\mathbb{E}\left[F^{(n+1)}\right]-F^{\star}$
	$\displaystyle\leq F^{(n)}-F^{\star}-\frac{\eta}{K}\left(\sum\limits_{k\in\mathcal{K}}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}\right)\\|{\bf g}^{(n)}\\|_{2}^{2}+\frac{\eta^{2}L}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\left(\left\\|{\bf g}^{(n)}\right\\|_{2}^{2}+\left\\|{\bm{\sigma}}\right\\|_{2}^{2}\right)+\frac{\eta^{2}LN_{0}q}{2K^{2}}$
	$\displaystyle=F^{(n)}-F^{\star}-\left[\sum\limits_{k\in\mathcal{K}}\left(\frac{\eta}{K}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}-\frac{\eta^{2}L}{2K^{2}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\right]\\|{\bf g}^{(n)}\\|_{2}^{2}+\frac{\eta^{2}L}{2K^{2}}\left(\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\left\\|{\bm{\sigma}}\right\\|_{2}^{2}+\frac{\eta^{2}LN_{0}q}{2K^{2}}.$		(14)

\displaystyle\!\!\mathbb{E}\!\left[\!F^{(n+\!1)}\!\right]\!\!-\!F^{\star}\!\!\leq\!\!\underbrace{\left[\!1\!-\!2\mu\!\!\left(\!\sum\limits_{k\in\mathcal{K}}\!\!\left(\!\frac{\eta}{K}h_{k}^{(n)}\sqrt{p_{k}^{(n)}}\!-\!\frac{\eta^{2}L}{2K^{2}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\right)\!\!\right)\!\right]\!}_{A^{(n)}}\left(\!F^{(n)}-F^{\star}\!\right)\!+\underbrace{\frac{\eta^{2}L\left\|{\bm{\sigma}}\right\|_{2}^{2}}{2K^{2}}\sum\limits_{k\in\mathcal{K}}(h_{k}^{(n)})^{2}p_{k}^{(n)}\!+\!\frac{\eta^{2}LN_{0}q}{2K^{2}}}_{B^{(n)}}.

(15)

∎

Further applying the mean inequality $(a_{1}a_{2}\cdots a_{m})\leq(\frac{a_{1}+a_{2}+\cdots+a_{m}}{m})^{m}$ , we can derive a more elegant upper bound for the expression in (13) to attain more insights as follows

\displaystyle{\Phi}(\{p_{k}^{(n)}\},\eta)\leq\!\!\alpha^{N}\left(\!F^{(\!1\!)}\!-\!F^{\star}\!\right)\!\!+\!\!\sum_{n=1}^{N}\!\!B^{(n)}\beta_{(n)}^{N-n},

(16)

where $\alpha=\frac{\sum_{i=1}^{N}\!\!A^{(i)}}{N}$ and $\beta_{(n)}=\frac{\sum_{i=n+1}^{N}A^{(i)}}{N-n}$ for $n=1,\cdots,N-1$ while $\beta_{(N)}=1$ .

Remark 1.

The first term on the right hand side of (16) suggests that the effect of initial optimality gap vanishes as the number of communication round $N$ increases. The second term reflects the impact of the power control and additive noise power on the convergence process, that is, transmission with more power in the initial learning iterations is more beneficial in decreasing the optimality gap. This is because that the contribution of power control at iteration $n$ is discounted by a factor $\beta_{(n)}^{N-n}$ .

IV Power Control Optimization

In this section, we focus on speeding up the convergence rate by minimizing the optimality gap in Theorem 1, under the power constraints stated in (11) and (12). The optimization problem is thus formulated as

	$\displaystyle\mathbf{P1:}\min_{\{p_{k}^{(n)}\geq 0\},\eta\geq 0}\leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle{\Phi}(\{p_{k}^{(n)},\eta\})$
	$\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle\eqref{sys_bar_P_max}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{sys_bar_P_ave}.$

Due to the coupling between the power control $\{p_{k}^{(n)}\}$ and learning rate $\eta$ , problem (P1) is non-convex and hard to solve. We resort to the alternating optimization technique for efficiently solving this problem. In particular, we first solve problem (P1) under any given $\eta$ , and then apply a one-dimension search to find the optimal $\eta$ that achieves the minimum objective value.

Let $\tilde{\Phi}(\{p_{k}^{(n)}\})={\Phi}(\{p_{k}^{(n)},\eta\})$ under any given $\eta$ . Note that the transmit powers at different devices and different communication rounds are coupled with each other in the objective function in (13) under given learning rate $\eta$ , leading to a highly non-convex problem:

	$\displaystyle\mathbf{P2:}\min_{\{p_{k}^{(n)}\geq 0\}}\leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle\tilde{\Phi}(\{p_{k}^{(n)}\})$
	$\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle\eqref{sys_bar_P_max}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{sys_bar_P_ave}.$

To tackle this problem, we propose an iterative algorithm to obtain an efficient solution using the SCA technique. The key idea is that under any given local point at each iteration, we can approximate the non-convex objective as a constructed convex one. Therefore, after solving a series of approximate convex problems iteratively, we can obtain a high-quality suboptimal solution to problem (P2).

Let $\{p_{k}^{(n)}[i]\}$ denote the local point at the $i$ -th iteration with $i\geq 0$ , and ${\mathcal{N}}\triangleq\{1,\cdots,N\}$ the set of communication rounds. Notice that by checking the first-order Taylor expansion of $\tilde{\Phi}(\{p_{k}^{(n)}\})$ w.r.t. $\{p_{k}^{(n)}\}$ at the local point $\{p_{k}^{(n)}[i]\}$ , it follows that

	$\displaystyle\tilde{\Phi}(\{p_{k}^{(n)}\})\approx\bar{\Phi}(\{p_{k}^{(n)}\})$
	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \triangleq\!\tilde{\Phi}(\{p_{k}^{(n)}[i]\})\!+\!\sum\limits_{n\in{\mathcal{N}}}\!\sum\limits_{k\in{\mathcal{K}}}\!\left(p_{k}^{(n)}-p_{k}^{(n)}[i]\right)\nabla\tilde{\Phi}(\{p_{k}^{(n)}[i]\}),$

where $\nabla\tilde{\Phi}(\{p_{k}^{(n)}[i]\})$ represents the first-order derivative w.r.t. $p_{k}^{(n)}[i]$ , given in (17) and (18).

$\displaystyle\nabla\tilde{\Phi}(p_{k}^{(n)}[n])$	$\displaystyle=-\frac{\mu\eta h_{k}^{(n)}\left(F^{(1)}-F^{\star}\right)}{K}\left(\frac{1}{\sqrt{p_{k}^{(n)}}}-\frac{\eta Lh_{k}^{(n)}}{K}\right)\prod_{i\in{\mathcal{N}}\setminus\{n\}}A^{(i)}+\frac{\eta^{2}L\left\\|{\bm{\sigma}}\right\\|_{2}^{2}(h_{k}^{(n)})^{2}\prod_{j=n}^{N}A^{(j)}}{2K^{2}A^{(n)}}$
	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ -\frac{\mu\eta h_{k}^{(n)}}{K}\left(\frac{1}{\sqrt{p_{k}^{(n)}}}-\frac{\eta Lh_{k}^{(n)}}{K}\right)\sum_{\ell=1}^{n-1}B_{(\ell)}\frac{\prod_{j=\ell}^{N}A^{(j)}}{A^{(n)}A^{(\ell)}},\forall n\in\mathcal{N}\setminus\{1\}$	(17)
$\displaystyle\nabla\tilde{\Phi}(p_{k}^{(1)}[i])$	$\displaystyle=-\frac{\mu\eta h_{k}^{(1)}\left(F^{(1)}-F^{\star}\right)}{K}\left(\frac{1}{\sqrt{p_{k}^{(1)}}}-\frac{\eta Lh_{k}^{(1)}}{K}\right)\prod_{i\in{\mathcal{N}}\setminus\{1\}}A^{(i)}+\frac{\eta^{2}L\left\\|{\bm{\sigma}}\right\\|_{2}^{2}(h_{k}^{(1)})^{2}}{2K^{2}}\prod_{i\in{\mathcal{N}}\setminus\{1\}}A^{(i)}.$	(18)

In this case, $\bar{\Phi}(\{p_{k}^{(n)}\})$ is linear w.r.t. $\{p_{k}^{(n)}\}$ . To ensure the approximation accuracy, a series of trust region constraints are imposed as [17]

\displaystyle|p_{k}^{(n)}[i]-p_{k}^{(n)}[i-1]|\leq\Gamma[i],\leavevmode\nobreak\ \forall k\in\mathcal{K},\forall n\in\mathcal{N},

(19)

where $\Gamma[i]$ denotes the radius of the trust region. By replacing $\bar{\Phi}(\{p_{k}^{(n)}\})$ as the approximation of $\tilde{\Phi}(\{p_{k}^{(n)}\})$ and introducing an auxiliary variable $\gamma$ , the approximated problem at the $i$ -th iteration is derived as a convex problem:

$\displaystyle\mathbf{P2.1:}\min_{\{p_{k}^{(n)}[i]\},\gamma\geq 0}\leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle\gamma$
$\displaystyle{\rm s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\$	$\displaystyle\bar{\Phi}(\{p_{k}^{(n)}[i]\})\leq\gamma$	(20)
	$\displaystyle\eqref{sys_bar_P_max},\leavevmode\nobreak\ \eqref{sys_bar_P_ave},\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \eqref{TrustRegion},$

which can be directly solved by CVX [18].

Let $\{p_{k}^{(n)*}[i]\}$ denote the optimal power control policy to problem (P2.1) at local point $\{p_{k}^{(n)}[i]\}$ . Then, we can obtain an efficient iterative algorithm to solve problem (P2) as follows. In each iteration $i\geq 1$ , the power control is updated as $\{p_{k}^{(n)*}[i]\}$ by solving problem (P2.1) at local point $\{p_{k}^{(n)}[i]\}$ , i.e. $p_{k}^{(n)}[i+1]=p_{k}^{(n)*}[i],\forall n\in\mathcal{N},\forall k\in\cal K$ , where $\{p_{k}^{(n)}[0]\}$ denotes the initial power control. At the $i$ -th iteration, we compute the objective value in problem (P2) by replacing $\{\hat{p}_{k}^{(n)*}[i]\}$ as $\{p_{k}^{(n)*}\}$ . If the objective value decreases, we then replace the current point by the obtained solution and go to the next iteration; otherwise, we update $\Gamma[i]=\Gamma[i]/2$ and continue to solve problem (P2.1). This algorithm would stop until that $\Gamma[i]$ is lower than a given threshold denoted by $\epsilon$ . In summary, the proposed algorithm is presented in Algorithm 1.

Algorithm 1 for Solving Problem (P2)

1

Initialization: Given the initial power control $\{p_{k}^{(n)}[0]\}$ ; let $i=0$ .
2
Repeat:
- a)
  
  Solve problem (P1.1) under given $\{p_{k}^{(n)}[i]\}$ to obtain the optimal solution as $\{p_{k}^{(n)*}[i]\}$ ;
- b)
  
  If the objective value of problem (P2) $\tilde{\Phi}(\{p_{k}^{(n)}\})$ decreases, then update $p_{k}^{(n)}[i+1]=p_{k}^{(n)*}[i],\forall n\in\mathcal{N}$ with $i=i+1$ ; otherwise $\Gamma[i]=\Gamma[i]/2$ ;
3

Until $\Gamma[i]\leq\epsilon$ .

With the obtained power control in Algorithm 1, we can find the optimal $\eta$ accordingly via a one-dimensional search.

V Simulation Results

In this section, we provide simulation results to validate the performance of the proposed power control policy for Air-FEEL. In the simulation, the wireless channels from each device to the edge server over fading states follow i.i.d. Rayleigh fading, such that $h_{k}$ ’s are modeled as i.i.d. circularly symmetric complex Gaussian (CSCG) random variables with zero mean and unit variance. The dataset with size $D_{\rm tot}=600$ at all device are randomly generated, where part of the data, namely $100$ pairs ( ${\bf x}$ , $y$ ), are left for prediction, and the remaining ones are used for model training. The generated data sample vector ${\bf x}$ follow i.i.d. Gaussian distribution as $\mathcal{N}(0,{\bf I})$ and the label $y$ is obtained as $y=x(2)+3x(5)+0.2z$ , where $x(t)$ represents the $t$ -entry in vector ${\bf x}$ and $z$ is the observation noise with i.i.d. Gaussian distribution, i.e., $z\sim\mathcal{N}(0,1)$ . Unless stated otherwise, the data samples are evenly distributed among the $K=20$ devices, and thus it follows $D_{k}=25$ . Moreover, we apply ridge regression with the sample-wise loss function $f({\bf w},{\bf x},y)=\frac{1}{2}\|{\bf x}^{T}{\bf w}-y\|^{2}$ and the regularization function $R({\bf w})=\|{\bf w}\|^{2}$ with $\rho=5\times 10^{-5}$ in this paper. Furthermore, recall that $D_{\rm tot}=\sum_{k\in\mathcal{K}}D_{k}$ and then we can obtain the smoothness parameter $L$ and PL parameter $\mu$ as the largest and smallest eigenvalues of the data Gramian matrix ${\bf X}^{T}{\bf X}/D_{\rm tot}+10^{-4}{\bf I}$ , in which ${\bf X}=[{\bf x}_{1},\cdots,{\bf x}_{D_{\rm tot}}]^{T}$ is the data matrix. The optimal loss function $F^{\star}$ is computed according to the optimal parameter vector $\bf w^{\star}$ to the learning problem (3), where ${\bf w}^{\star}=({\bf X}^{T}{\bf X}+\rho{\bf I})^{-1}{\bf X}^{T}{\bf y}$ with ${\bf y}=[y_{1},\cdots,y_{D_{\rm tot}}]^{T}$ . We set the initial parameter vector as an all-zero vector and the noise variance $N_{0}=0.1$ .

We consider two benchmark schemes for performance comparison, namely the uniform power transmission that transmits with uniform power over different communication round under the constraint of average power budget, and the channel inversion adopted in [15]. As for the performance metric for comparison, we consider the optimality gap and prediction error to evaluate the learning performance.

The effect of device population on learning performance is illustrated in Fig. 2 with $N=30$ , where the power budgets at all devices are identically set to be $\tilde{P}=1$ W and $\bar{P}=5$ W. Notice that the increasing of device population may introduce both the positive and negative effects on the learning performance. The positive effect is that the training process can exploit more data, while the negative effect is the increased aggregation error raised by AirComp over more devices. As observed in Fig. 2, the positive effect can be cancelled or even overweighed by the negative effect when applying the channel inversion or uniform power control. The blessing of including more devices in Air-FEEL can dominate the curse it brings only when the power control is judiciously optimized, showing the crucial role of power control in determining the the learning performance of Air-FEEL.

Fig. 3 shows the learning performance during the learning process under the optimized learning rate, where we set $K=20$ , $\tilde{P}=1$ W, $\bar{P}=5$ W, and $N=80$ . It is observed that the proposed power control scheme can achieve faster convergence than both the channel-inversion and uniform-power-control schemes. This is attributed to the power control optimization directly targeting convergence acceleration.

Fig. 4 shows the power allocation over a static channel with uniform channel gain during the learning process, where we set $K=20$ , $\tilde{P}=1$ W, $\bar{P}=5$ W, and $N=30$ . It is observed that the power allocation over a static channel follows a stair-wise monotonously decreasing function. The behavior of power control coincides the analysis on Remark 1.

VI Conclusion

In this paper, we exploit power control as a new degree of freedom to optimize the learning performance of Air-FEEL, a promising communication-efficient solution towards edge intelligence. To this end, we first analyzed the convergence rate of the Air-FEEL by deriving the optimality gap of the loss-function under arbitrary power control policy. Then the formulated power control problem aimed to minimize the optimality gap for accelerating convergence, subject to a set of average and maximum power constraints at edge devices. Due to the coupling of power control variables over different devices and iterations, the challenge of the formulated power control problem was tackled by the joint use of SCA and trust region methods. Numerical results demonstrated that the optimized power control policy can achieve significantly faster convergence than the benchmark policies such as channel inversion and uniform power transmission.

References

[1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020.
[2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, Nov. 2019.
[3] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016. [Online]. Available: https://arxiv.org/pdf/1610.05492.pdf
[4] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.
[5] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498–3516, Oct. 2007.
[6] M. Gastpar, “Uncoded transmission is exactly optimal for a simple gaussian sensor network,” IEEE Trans. Inf. Theory, vol. 54, no. 11, pp. 5247–5251, Nov. 2008.
[7] O. Abari, H. Rahul, D. Katabi, and M. Pant, “Airshare: Distributed coherent transmission made seamless,” in Proc. IEEE INFOCOM, Kowloon, Hong Kong, Apr. 2015, pp. 1742–1750.
[8] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimized power control for over-the-air computation in fading channels,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.
[9] X. Cao, G. Zhu, J. Xu, and K. Huang, “Cooperative interference management for over-the-air computation networks,” 2020. [Online]. Available: https://arxiv.org/pdf/2007.11765.pdf
[10] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
[11] M. Mohammadi Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
[12] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
[13] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning,” 2020. [Online]. Available: https://arxiv.org/pdf/2003.02089.pdf
[14] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” 2020. [Online]. Available: https://arxiv.org/pdf/2001.05713.pdf
[15] D. Liu and O. Simeone, “Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control,” 2020. [Online]. Available: https://arxiv.org/pdf/2006.05459.pdf
[16] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” Lecture Notes in Computer Science, pp. 795–811, 2016. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46128-1_50
[17] B. Dai, Y. Liu, and W. Yu, “Optimized base-station cache allocation for cloud radio access network with multicast backhaul,” IEEE J. Sel. Areas Commun., vol. 36, no. 8, pp. 1737–1750, Aug. 2018.
[18] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming,” 2016. [Online]. Available: http://cvxr.com/cvx

	$\displaystyle F^{(n+1)}-F^{(n)}$
	$\displaystyle\overset{(a)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n+1)}-{\bf w}^{(n)})+\frac{1}{2}\sum_{i=1}^{q}L_{i}({{w}_{i}^{(n+1)}-{w}^{(n)}_{i}})^{2},$
	$\displaystyle\overset{(b)}{\leq}({\bf g}^{(n)})^{T}({\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)})+\frac{L}{2}\\|{\bf w}^{(n)}-\eta\cdot\hat{\bf g}^{(n)}-{\bf w}^{(n)}\\|_{2}^{2}$
	$\displaystyle=-\eta({\bf g}^{(n)})^{T}\hat{\bf g}^{(n)}+\eta^{2}\frac{L}{2}\\|\hat{\bf g}^{(n)}\\|_{2}^{2}$
	$\displaystyle\!\!=\!\!-\!\frac{\eta}{K}({\bf g}^{(\!n\!)}\!)^{T}\!\!\left(\!\sum\limits_{k\in\mathcal{K}}\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}\!{\bf g}_{k}^{(\!n\!)}\!\!+\!\!{\bf z}^{(\!n\!)}\!\right)\!\!\!+\!\frac{\eta^{2}L}{2K^{2}}\!\!\left\\|\!\sum\limits_{k\in\mathcal{K}}\!\!h_{k}^{(\!n\!)}\sqrt{p_{k}^{(\!n\!)}}{\bf g}_{k}^{(\!n\!)}\!+\!{\bf z}^{(\!n\!)}\!\right\\|_{2}^{2}\!,\!$