This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Edge Learning for Large-Scale Internet of Things With Task-Oriented Efficient Communication

Haihui Xie, Minghua Xia, , Peiran Wu, ,
Shuai Wang, , and H. Vincent Poor
Manuscript received August 25, 2022; revised December 6, 2022 and April 11, 2023; accepted April 19, 2023. This work was supported in part by the National Natural Science Foundation of China under Grants 62171486 and U2001213, in part by the Guangdong Basic and Applied Basic Research Project under Grant 2021B1515120067, and in part by the U.S. National Science Foundation under Grant CNS-2128448. The associate editor coordinating the review of this paper and approving it for publication was A. S. Cacciapuoti. (Corresponding author: Minghua Xia.)Haihui Xie is with the School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China (e-mail: xiehh6@mail2.sysu.edu.cn).Minghua Xia and Peiran Wu are with the School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China, and also with the Southern Marine Science and Engineering Guangdong Laboratory, Zhuhai 519082, China (e-mail: {xiamingh, wupr3}@mail.sysu.edu.cn).Shuai Wang is with the Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China (e-mail: s.wang@siat.ac.cn).H. Vincent Poor is with the Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).Color versions of one or more of the figures in this article are available online at https://ieeexplore.ieee.org. Digital Object Identifier XXX
Abstract

In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wireless resource allocation and edge learning error prediction. In particular, we start with multi-user scheduling to alleviate co-channel interference in dense networks. Then, we perform optimal power allocation in parallel for different learning tasks. Thanks to the high parallelization of the designed algorithm, extensive experimental results corroborate that the multi-user scheduling and task-oriented power allocation improve the performance of distinct edge learning tasks efficiently compared with the state-of-the-art benchmark algorithms.

Index Terms:
Edge learning, multi-user scheduling, Internet of Things, parallel computing.
publicationid: pubid: 1536-1276 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

I Introduction

Massive connectivity is a feature of many Internet of Things (IoT) deployments, and in such applications, the widely connected users generate an enormous amount of data. To extract information from these data, IoT users can train learning models to effectively represent different types of data [1]. Although these IoT users have a specific capability to train simple learning models, the limited memory, computing, and battery capability deter the application of complicated models, such as deep neural networks [2]. To deal with this issue, edge learning techniques have emerged, transferring the burden of complex model updates to an edge server, i.e., leveraging the storage, communication, and computational capabilities at the edge server [3]. Moreover, the edge server also allows rapid access to the enormous amount of data distributed over end-user devices for fast model learning, providing intelligent services and applications for IoT users [4].

In edge learning, the main design objective is to acquire fast intelligence from the rich but highly distributed data of subscribed IoT users. This critically depends on data processing at edge servers, as well as efficient communication between edge servers and IoT users [5]. However, compared to increasingly high processing speeds at edge servers, communication suffers from the hostility of wireless channels and consequently becomes the bottleneck for ultra-fast edge learning [6]. Moreover, the diversity of ubiquitous IoT users and complex transmission environments lead to additional interference. Such interference would significantly deteriorate the reliability and communication latency of the IoT network while uploading a vast amount of data to an edge server [7]. To address these issues, the traditional data-oriented communication systems are designed to maximize network throughput based on Shannon’s theory, which targets transmitting data reliably given the limited radio resources [8]. However, such approaches are often ineffective in edge learning, as they rely only on classical source coding and channel coding theory and fail to improve learning performance [9]. Therefore, a paradigm shift for wireless system design is required from data-oriented to task-oriented communications.

I-A Related Works and Motivation

The initial attempts of task-oriented communications were to design task-aware transmission phases rather than end-to-end data reconstruction, see, e.g., [10, 11, 12] for designing task-aware reporting phases in the case of distributed inference tasks. Unlike task-aware efficient transmissions, several pioneering works [13, 14, 15, 16, 17] have also studied task-oriented schemes in edge learning systems. The work [13] designed a task-oriented communication scheme to realize a trade-off between preserving the relevant information and fitting with bandwidth-limited edge inference nicely. In [14, 15], task-oriented methods were proposed to maximize learning accuracy by jointly designing sensing, communication, and computation. The work [16] proposed a task-oriented transmission scheme to accelerate learning processes efficiently by capturing the semantic features from the correlated multimodal data of the IoT users. Nevertheless, these task-oriented communication schemes are mostly heuristics based, and it is necessary to improve their learning performance via additional optimization.

To overcome the drawback of heuristic methods, the work [17] proposed a learning-centric power allocation (LCPA) model to guide power allocation efficiently so as to optimize the limited network resources under the task-oriented principle. First, the learning performance was approximated by parameter fitting to capture the shape of learning models. Then, a majorization minimization (MM) algorithm was designed to allocate transmit power efficiently. However, the MM algorithm is inefficient for large-scale IoT networks due to its high computational complexity and lack of co-channel interference (CCI) management. In particular, for massive IoT users, it is necessary to collect multi-modal datasets and process heterogeneous learning tasks concurrently, making it imperative for designing parallel and low-complexity task-oriented communication algorithms. Furthermore, concurrent transmissions of massive IoTs will inevitably yield severe CCI, thus degrading the performance of task-oriented communications [18]. But due to the highly-coupled CCIs, the associated power allocation problem is non-separable and non-convex, which is non-trivial to realize the inference management and algorithm parallelization.

To fill the gap, this paper designs a task-oriented power allocation model for efficient communications in large-scale IoT networks with edge learning. On the one hand, as a task-oriented learning system involves heterogeneous learning tasks, it is necessary to predict the required resources for training different tasks. Therefore, our method is designed as an offline learning procedure that fits historical datasets to a performance prediction model and an online inference procedure that guides the IoT-edge communications with the pre-trained performance model. Note that this performance model can be fine-tuned by exploiting a small amount of real-time data from active IoT users. On the other hand, we formulate a task-oriented power allocation problem to guide communication-efficient data collection for large-scale IoT networks. To alleviate CCI, multi-user scheduling is first performed before power allocation. Then, a highly parallel algorithm is designed for different learning tasks. Lastly, we develop an accelerated algorithm to make the parallel algorithm more efficient. In brevity, Table I compares the existing and proposed schemes.

TABLE I: Comparison of Existing and Proposed Schemes.

[!t]       Type     Scheme Learning Efficiencya Multi-task Multi-modalb Algorithm Complexity Parallelism Acceleration Optimal Objective Task Fairness Multi-user Scheduling           Task-aware     [10] + +++ + ✗             [11] + +++ + ✗             [12] + +++ + ✗         Task-oriented     [13] ++ ++ + ✗             [14, 15] ++ ++ + ✗             [16] ++ ++ + ✗             [17] +++ +++ N/A Min-max ✗             Proposed +++ + +++ Weighted sum ✓              

  • a

    The symbols “+, ++, +++” indicate low, moderate, and high capability, respectively.

  • b

    The tick “✓” indicates a functionality supported, whereas the cross “✗” indicates not supported.

I-B Summary of Main Results

Aiming at efficient communication for task-oriented edge learning, this paper starts with a multi-user scheduling strategy to mitigate CCI. In particular, a relaxation-and-rounding algorithm is exploited to identify scheduled users efficiently, and an approximate closed-form solution is obtained. Secondly, a parallel algorithm with Gauss-Seidel methods is developed. By a set of variable decompositions, we realize a highly parallel iteration. Thirdly, we design an accelerated algorithm to speed up this parallel algorithm. Finally, extensive experimental results demonstrate the efficiency of our design. In summary, the main contributions are as follows:

  • 1)

    A task-oriented power allocation model is proposed to process multiple distinct datasets at the edge. Moreover, a multi-user scheduling strategy is performed before power allocation to mitigate CCI for large-scale IoT networks efficiently.

  • 2)

    A highly parallel algorithm is designed for the task-oriented power allocation problem. By variable decompositions and eliminating auxiliary variables, the power allocation in the presence of CCI is realized efficiently in parallel.

  • 3)

    An accelerated algorithm is developed to make the parallel algorithm more efficient for large-scale IoT networks. Specifically, this algorithm utilizes the Lipschitz continuous property of the learning error and the identity mapping of the gain matrix to improve the convergence.

  • 4)

    Extensive experimental results show that the multi-user scheduling strategy can mitigate CCI in large-scale IoT networks. Moreover, our parallel and accelerated algorithms efficiently solve task-oriented power allocation problems with a significantly shorter computation time than the existing algorithms.

I-C Organization

The rest of this paper is organized as follows. Section \@slowromancapii@ describes the system model and formulates a task-oriented power allocation problem. Section \@slowromancapiii@ performs multi-user scheduling to mitigate CCI and designs a parallel algorithm for solving the task-oriented power allocation problem. Section \@slowromancapiv@ develops an accelerated algorithm for large-scale IoT networks. Section \@slowromancapv@ discusses the experimental results, and finally, Section \@slowromancapvi@ concludes the paper.

Notation: Scalars, column vectors, and matrices are denoted by regular italic letters, lower- and upper-case letters in bold typeface, respectively. The symbol 𝟭\bm{\mathsf{1}} indicates a column vector with all entries being unity. The superscripts ()T(\cdot)^{T} and ()H(\cdot)^{H} denote the transpose and Hermitian transpose of a vector or matrix, respectively, and the subscript 𝒙2\|\bm{x}\|_{2} denotes the two-norm of 𝒙\bm{x}. The abbreviation 𝒞𝒩(𝟎,ϱ𝑰)\mathcal{CN}(\bm{0},\,\varrho\bm{I}) stands for a multi-variable complex Gaussian distribution with mean vector 𝟎\bm{0} and variance matrix ϱ𝑰\varrho\bm{I}. The notation |𝒳|\left|\mathcal{X}\right| denotes the cardinality of the set 𝒳\mathcal{X}, and 𝒴𝒳\mathcal{Y}\setminus\mathcal{X} denotes the complement of set 𝒴\mathcal{Y} except for 𝒳\mathcal{X}. The matrix operation 𝑨(𝒦i,𝒦j)\bm{A}(\mathcal{K}_{i},\,\mathcal{K}_{j}) denotes a sub-matrix of size |𝒦i|×|𝒦j||\mathcal{K}_{i}|\times|\mathcal{K}_{j}| that includes the rows and columns in 𝑨\bm{A} specified by the sets of indices 𝒦i\mathcal{K}_{i} and 𝒦j\mathcal{K}_{j}, respectively. The arithmetic operations 𝒙𝒚\bm{x}\succeq\bm{y}, 𝒙𝒚\bm{x}\circ\bm{y}, and 𝒙,𝒚\left\langle\bm{x},\,\bm{y}\right\rangle denote that each element of 𝒙\bm{x} is greater than or equal to the counterpart of 𝒚\bm{y}, the Hadamard product, and the inner product of two vectors, respectively. The Landau notation 𝒪()\mathcal{O}(\cdot) denotes the order of arithmetic operations. Further, we define a binary function w=1\lfloor w\rceil=1 if w0.5w\geq 0.5, and w=0\lfloor w\rceil=0 otherwise, and 𝒘[w1,w2,,wK]TK×1\lfloor\bm{w}\rceil\triangleq\left[\lfloor w_{1}\rceil,\,\lfloor w_{2}\rceil,\,\cdots,\,\lfloor w_{K}\rceil\right]^{T}\in\mathbb{R}^{K\times 1} is computed for each element of 𝒘\bm{w}. Finally, the floor function xmax{n:nx}\left\lfloor x\right\rfloor\triangleq\max\{n\in\mathbb{Z}:n\leq x\}, where \mathbb{Z} is the set of integers.

II System Model and Problem Formulation

In this section, we first describe a task-oriented edge learning system. Then, we formulate the task-oriented power allocation problem with multi-user scheduling.

II-A System Model

Figure 1 illustrates a task-oriented edge learning system consisting of an edge server equipped with NN antennas, II different learning tasks {𝒯1,𝒯2,,𝒯I}\{\mathcal{T}_{1},\,\mathcal{T}_{2},\,\cdots,\,\mathcal{T}_{I}\} with corresponding user sets 𝒦{𝒦1,𝒦2,,𝒦I}\mathcal{K}\triangleq\{\mathcal{K}_{1},\,\mathcal{K}_{2},\,\cdots,\,\mathcal{K}_{I}\} and power allocation parameters 𝒑[𝒑1T,𝒑2T,,𝒑IT]TK\bm{p}\triangleq\left[\bm{p}_{1}^{T},\,\bm{p}_{2}^{T},\,\cdots,\,\bm{p}_{I}^{T}\right]^{T}\in\mathbb{R}^{K}, where 𝒑i[pi1,pi2,,p|𝒦i|]T\bm{p}_{i}\triangleq\left[p_{i_{1}},\,p_{i_{2}},\cdots,\,p_{|\mathcal{K}_{i}|}\right]^{T}, i{1, 2,,I}i\in\mathcal{I}\triangleq\{1,\,2,\,\cdots,\,I\}, ij𝒦ii_{j}\in\mathcal{K}_{i}, j{1, 2,,|𝒦i|}j\in\{1,\,2,\cdots,|\mathcal{K}_{i}|\}, K|𝒦|K\triangleq|\mathcal{K}|, and pkp_{k}, k𝒦k\in\mathcal{K}, denotes the transmit power of the kthk^{\rm th} user. As for the II different learning tasks in Fig. 1, each of them concerns a set of transmitted data, a multi-user scheduling algorithm, a learning model, a process of parameter fitting for a learning model, and a task-oriented power allocation problem.

To improve the performance of edge learning, multi-user scheduling is adopted to alleviate CCI in large-scale IoT networks, and task-oriented power allocation is performed to implement efficient communications. To maximize the network utility function of long-term average data rates, by recalling the seminal Shannon formula, the achievable data rate of user kk in the presence of multi-user scheduling can be expressed as [8]

Rk=log2(1+Gk,kpkΠ𝒮(𝒦)kGk,p+σ2),kΠ𝒮(𝒦i),R_{k}=\log_{2}\Bigg{(}1+\dfrac{G_{k,k}p_{k}}{\sum_{\ell\in\Pi_{\mathcal{S}}(\mathcal{K})\setminus k}G_{k,\ell}p_{\ell}+\sigma^{2}}\Bigg{)},\,k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i}), (1)

where Π𝒮()\Pi_{\mathcal{S}}(\cdot) denotes a projection function of multi-user scheduling; σ2\sigma^{2} is the variance of additive white Gaussian noise; Gk,G_{k,\ell} represents the composite channel gain from the th\ell^{\rm th} user to the edge server when decoding data of the kthk^{\rm th} user, computed as Gk,k=ρk𝒉k22G_{k,k}=\rho_{k}\|\bm{h}_{k}\|_{2}^{2} if =k\ell=k, and Gk,=ρ|𝒉kH𝒉|2/𝒉k22G_{k,\ell}={\rho_{\ell}|\bm{h}_{k}^{H}\bm{h}_{\ell}|^{2}}/{\|\bm{h}_{k}\|^{2}_{2}} if k\ell\neq k, with 𝒉kN×1\bm{h}_{k}\in\mathbb{C}^{N\times 1} being the complex-valued channel fast-fading vector from the kthk^{\rm th} user to the edge server and ρk\rho_{k} being the path loss of the kthk^{\rm th} user. By (1), the number of samples transmitted by user kk for the learning task 𝒯i\mathcal{T}_{i} at the edge server can be computed as

Di=kΠ𝒮(𝒦i)BTRkVi+AikΠ𝒮(𝒦i)BTRkVi+Ai,D_{i}=\sum_{k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i})}\left\lfloor\dfrac{BTR_{k}}{V_{i}}\right\rfloor+A_{i}\approx\sum_{k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i})}\dfrac{BTR_{k}}{V_{i}}+A_{i}, (2)

where BB is the total bandwidth in Hz; TT is the transmission time in seconds; ViV_{i} is the number of bits for each data, and AiA_{i} is the initial number of historical data for the ithi^{\rm th} pre-trained task.

Refer to caption
Figure 1: The system model of task-oriented edge learning.

This paper considers the average channel over a long transmission period instead of assuming a static channel. The reason is twofold. On the one hand, to fine-tune diverse learning models, it is essential to require a relatively long transmission time with tens or hundreds of seconds to obtain a large number of datasets. On the other hand, the effect of multi-user scheduling can only be disclosed in the context of a long-term channel average rather than an instantaneous channel realization. Assume that the transmission period consists of different time slots. The channels are quasi-static during each time slot and vary in consecutive time slots. Therefore, Gk,kG_{k,k} and Gk,G_{k,\ell} in (1) could denote the average channels gain during these slots.

II-B Problem Formulation

To establish a connection between wireless resource allocation and the performance of machine learning, the work [17] conceived a non-linear exponential function Θi(Di|ai,bi)aiDibi\Theta_{i}(D_{i}|a_{i},\,b_{i})\triangleq a_{i}D_{i}^{-b_{i}} to capture the shape of the learning error function, where aia_{i} and bib_{i} are tuning parameters that denote the model complexity and account for the non-independent and identically distributed (n.i.i.d.) parallel datasets, respectively. In practice, the values of aia_{i} and bib_{i} are obtained by fitting the learning error function from the historical dataset. This fitted function matches the experimental data of the machine learning model very well. In line with this idea and multi-user scheduling, we formulate a task-oriented power allocation problem:

𝒫1:min𝒑,Π𝒮\displaystyle\mathcal{P}1:\min_{\bm{p},\,\Pi_{\mathcal{S}}} iλi×aiDibi\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times a_{i}D_{i}^{-b_{i}} (3a)
s.t.\displaystyle{\rm s.t.} k𝒦pk=P,pk0,k𝒦,\displaystyle\sum_{k\in\mathcal{K}}p_{k}=P,\,p_{k}\geq 0,\,\forall k\in\mathcal{K}, (3b)
Π𝒮(𝒦)𝒦,\displaystyle\Pi_{\mathcal{S}}(\mathcal{K})\subseteq\mathcal{K}, (3c)
pk=0,k𝒦Π𝒮(𝒦),\displaystyle p_{k}=0,\,\forall k\in\mathcal{K}\setminus\Pi_{\mathcal{S}}(\mathcal{K}), (3d)

where λiAiVi/(jAjVj)\lambda_{i}\triangleq A_{i}V_{i}/(\sum_{j\in\mathcal{I}}A_{j}V_{j}) is a weight of diverse datasets. In general, the power allocation of all scheduled users shall satisfy k𝒦pkP\sum_{k\in\mathcal{K}}p_{k}\leq P, i.e., not exceeding the total power budget PP. As a larger value of k𝒦pk\sum_{k\in\mathcal{K}}p_{k} always improves the learning performance, (3b) is obtained [17]. (3c) means to schedule a subset of users, and (3d) implies no power is allocated to inactive users.

Compared to the conventional min-max objective function used in [17], it only focuses on the worst learning task, even if the task is not critical for real-world application. Thus, it is not suitable for the multi-task multi-modal scenario considered in this paper. Instead, the weighted sum model in (3a) can optimize multiple tasks simultaneously. In particular, the objective function can adapt to different learning tasks by adjusting the weight factors λi,i\lambda_{i},i\in\mathcal{I} in (3a).

Remark 1 (On the learning loss model).

In theory, the training procedure of any smooth learning network can be modeled as a Gibbs distribution of networks characterized by a temperature parameter TgT_{g}. The asymptotic generalization loss ϵi\epsilon_{i} as the number of samples DiD_{i} for the ithi^{\rm th} learning task goes to infinity can be expressed as [19, Eq. 3.123.12]

ϵi=ϵi,min+(Tg2+Tr(𝑼i𝑽i1)2Wi)WiDi1,as Di+,\displaystyle\epsilon_{i}=\epsilon_{i,\mathrm{min}}+\left(\frac{T_{g}}{2}+\frac{\mathrm{Tr}(\bm{U}_{i}\bm{V}_{i}^{-1})}{2W_{i}}\right)W_{i}D_{i}^{-1},\quad\text{as }D_{i}\rightarrow+\infty, (4)

where ϵi,min0\epsilon_{i,\mathrm{min}}\geq 0 is the minimum error for the considered learning system, WiW_{i} is the number of parameters, and DiD_{i} is the number of samples. The matrices 𝐔i\bm{U}_{i} and 𝐕i\bm{V}_{i} denote the second-order and first-order derivatives of the generalization loss with respect to the parameters of model ii. By setting ai=(Tg2+Tr(𝐔i𝐕i1)2Wi)Wia_{i}=\left(\frac{T_{g}}{2}+\frac{\mathrm{Tr}(\bm{U}_{i}\bm{V}_{i}^{-1})}{2W_{i}}\right)W_{i}, bi=1b_{i}=-1 and ϵi,min=0\epsilon_{i,\mathrm{min}}=0, (4) reduces to the proposed learning loss model Θi(Di|ai,bi)aiDibi\Theta_{i}(D_{i}|a_{i},\,b_{i})\triangleq a_{i}D_{i}^{-b_{i}}, implying the proposed model holds in the asymptotic sense. In practice, ϵi,min\epsilon_{i,\mathrm{min}} in (4) cannot always approach zero as the number of samples reaches infinite, even for some simple learning models. For ease of mathematical tractability, we set ϵi,min=0\epsilon_{i,\mathrm{min}}=0 in this paper by assuming that the learning model is powerful enough such that given an infinite amount of data, the learning loss becomes zero.

III Algorithm Development in Parallel

In this section, we first describe a multi-user scheduling algorithm given a power allocation. Then, we design a parallel algorithm to solve the power allocation problem.

III-A Multi-user Scheduling Algorithm

In the task-oriented learning system, multi-user scheduling is an effective strategy for solving the massive-connectivity problem. Traditional approaches to multi-user scheduling are almost non-convex algorithms, e.g., greedy heuristic search. In particular, as the number of scheduling cases increases exponentially with the number of users, it is hard to enumerate all possible subsets of users explicitly [20]. To deal with this problem, we introduce binary variables wkw_{k}, k𝒦k\in\mathcal{K}, to replace Π𝒮\Pi_{\mathcal{S}} defined immediately after (1). More specifically, wk=1w_{k}=1 if kΠ𝒮(𝒦)k\in\Pi_{\mathcal{S}}\left(\mathcal{K}\right), and wk=0w_{k}=0 otherwise. As a result, given pkp_{k}, inserting (1)-(2) into 𝒫1\mathcal{P}1, the multi-user scheduling problem in the ithi^{\rm th} group, ii\in\mathcal{I}, is formulated as

𝒫2:min𝒘\displaystyle\mathcal{P}2:\min_{\bm{w}}\, ai(BTVik𝒦iwkRk+Ai)bi\displaystyle a_{i}\Bigg{(}\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}w_{k}R_{k}+A_{i}\Bigg{)}^{-b_{i}} (5a)
s.t.\displaystyle{\rm s.t.}\ wk{0, 1},\displaystyle w_{k}\in\{0,\,1\}, (5b)
k𝒦iwkNi,\displaystyle\sum_{k\in\mathcal{K}_{i}}w_{k}\leq N_{i}, (5c)

where Rk=log2(1+Gk,kpk/(𝒦kwGk,p+σ2))R_{k}=\log_{2}(1+{G_{k,k}p_{k}}/(\sum_{\ell\in\mathcal{K}\setminus k}w_{\ell}G_{k,\ell}p_{\ell}+\sigma^{2})) as per (1), 𝒘[w1,w2,,wK]T\bm{w}\triangleq\left[w_{1},\,w_{2},\,\cdots,\,w_{K}\right]^{T}, and (5c) is derived from (3c) with NiN_{i} being the maximal allowed number of active users for the ithi^{\rm th} learning task.

To solve 𝒫2\mathcal{P}2, we adopt a relaxation-and-rounding algorithm [21]. First, we relax the binary constraint (5b) as the real-valued constraint 0<wk10<w_{k}\leq 1. Then, we provide the following Proposition 1 to obtain an approximate closed-form solution to the relaxed version of 𝒫2\mathcal{P}2.

Proposition 1 (Multi-user scheduling algorithm).

Given pk,k𝒦ip_{k},\,k\in\mathcal{K}_{i}, the multi-user scheduling variable wkw_{k} is analytically determined by

w~k=min(max(G~k,kpkδk(exp(Gk,kpkδk+Gk,kpk+νi)1),ϵ), 1),\tilde{w}_{k}=\min\left(\max\left(\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}\left(\exp\left(\frac{G_{k,k}p_{k}}{\delta_{k}+G_{k,k}p_{k}}+\nu_{i}\right)-1\right)},\,\epsilon\right),\,1\right), (6)

where δk𝒦kG~k,p+σ2\delta_{k}\triangleq\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}+\sigma^{2} with G~k,wkGk,\tilde{G}_{k,\ell}\triangleq w_{k}{G}_{k,\ell}; νi>0\nu_{i}>0 is a tuning parameter for controlling the sparsity of the solution, and ϵ>0\epsilon>0 is a small positive number close to zero. When the multi-user scheduling algorithm converges, 𝐰~[𝐰~1T,𝐰~2T,,𝐰~IT]T\tilde{\bm{w}}\triangleq\left[\tilde{\bm{w}}_{1}^{T},\,\tilde{\bm{w}}_{2}^{T},\,\cdots,\,\tilde{\bm{w}}_{I}^{T}\right]^{T} with 𝐰~i[w~i1,w~i2,,w~i|𝒦i|]T\tilde{\bm{w}}_{i}\triangleq\left[\tilde{w}_{i_{1}},\,\tilde{w}_{i_{2}},\,\cdots,\,\tilde{w}_{i_{|\mathcal{K}_{i}|}}\right]^{T} is the optimal solution, in which each element is rounded off to the nearest integer 11 or 0, i.e., 𝐰~\lfloor\tilde{\bm{w}}\rceil.

Proof.

See Appendix A. ∎

Proposition 1 shows that the multi-user scheduling decision is analytically determined, with extremely low computational complexity proportional to the number of users, i.e., 𝒪(K)\mathcal{O}(K). Also, it is noted that our multi-user scheduling strategy is fair with respect to different learning tasks, but the fairness among users is not accounted for since it is beyond the scope of this paper.

III-B Parallel Power Allocation

After the multi-user scheduling is performed by Proposition 1, the task-oriented power allocation problem can be rewritten as

𝒫3:min𝒑\displaystyle\mathcal{P}3:\min_{\bm{p}}\, iλi×ai(BTVik𝒦iw~kR~k+Ai)bi\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times a_{i}\Bigg{(}\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\tilde{R}_{k}+A_{i}\Bigg{)}^{-b_{i}} (7a)
s.t.\displaystyle{\rm s.t.}\ k𝒦(1w~k)pkϵ,\displaystyle\sum_{k\in\mathcal{K}}(1-\tilde{w}_{k})p_{k}\leq\epsilon, (7b)
(3b),\displaystyle\eqref{S1-EQ-3a},

where R~klog2(1+Gk,kpk/(𝒦kw~Gk,p+σ2))\tilde{R}_{k}\triangleq\log_{2}(1+{G_{k,k}p_{k}}/(\sum_{\ell\in\mathcal{K}\setminus k}\tilde{w}_{\ell}G_{k,\ell}p_{\ell}+\sigma^{2})); (7b) is the relaxation of (3d), which means little power is reserved for inactive users.

The optimization problem 𝒫3\mathcal{P}3 is non-convex; even worse, its computational complexity rises with the number of users and tasks. To address these issues, we propose a parallel first-order algorithm. As there is a dependency on the power and CCI terms amongst different learning tasks, it is hard to realize parallelization for various tasks in a straightforward manner. An efficient strategy is to introduce auxiliary variables to separate these terms independently. The resulting problem involves a set of sub-problems by variable decompositions, and these sub-problems are easier to be solved in parallel [22, 23].

Now, we begin to extract the relevant sub-problems. First, to divide the interference term, we introduce additional variables 𝜹\bm{\delta}, defined as

𝜹𝚫𝒑+σ2𝟭,\bm{\delta}\triangleq\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}}, (8)

where 𝜹=[𝜹1T,𝜹2T,,𝜹IT]T\bm{\delta}=\left[\bm{\delta}_{1}^{T},\,\bm{\delta}_{2}^{T},\,\cdots,\,\bm{\delta}_{I}^{T}\right]^{T} with 𝜹i[δi1,δi2,,δi|𝒦i|]T\bm{\delta}_{i}\triangleq[\delta_{i_{1}},\,\delta_{i_{2}},\,\cdots,\,\delta_{i_{|\mathcal{K}_{i}|}}]^{T}, and 𝚫𝑮~𝑫~\bm{\Delta}\triangleq\tilde{\bm{G}}-\tilde{\bm{D}} with 𝑮~[𝑮(1,:)T𝒘~𝑮(K,:)T𝒘~]T\tilde{\bm{G}}\triangleq\left[\begin{array}[]{ccc}\bm{G}(1,\,:)^{T}\circ\tilde{\bm{w}}&\cdots&\bm{G}(K,\,:)^{T}\circ\tilde{\bm{w}}\end{array}\right]^{T} and 𝑫~[𝑫(:, 1)𝒘~𝑫(:,K)𝒘~]\tilde{\bm{D}}\triangleq\left[\begin{array}[]{ccc}\bm{D}(:,\,1)\circ\tilde{\bm{w}}&\cdots&\bm{D}(:,\,K)\circ\tilde{\bm{w}}\end{array}\right]. Here, the (k,)th(k,\,\ell)^{\rm th} element of 𝑮\bm{G} is made up of Gk,G_{k,\,\ell}, and the kthk^{\rm th} diagonal element of the diagonal matrix 𝑫\bm{D} is made up of Gk,kG_{k,\,k}. By partitioning users 𝒦\mathcal{K} into II groups of users {𝒦i}i=1I\{\mathcal{K}_{i}\}_{i=1}^{I} and introducing a set of variables {𝒛i}i=1I\{\bm{z}_{i}\}_{i=1}^{I} and {Pi}i=1I\{P_{i}\}_{i=1}^{I}, we have

𝚫(:,𝒦i)𝒑i=𝒛i,\displaystyle\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}=\bm{z}_{i}, (9a)
i𝒛i=𝜹σ2𝟭,\displaystyle\sum_{i\in\mathcal{I}}\bm{z}_{i}=\bm{\delta}-\sigma^{2}\bm{\mathsf{1}}, (9b)
𝟭T𝒑i=Pi,\displaystyle\bm{\mathsf{1}}^{T}\bm{p}_{i}=P_{i}, (9c)
iPi=P,\displaystyle\sum_{i\in\mathcal{I}}P_{i}=P, (9d)

where 𝒛i[z1,i,z2,i,,zK,i]T\bm{z}_{i}\triangleq\left[z_{1,i},\,z_{2,i},\,\cdots,\,z_{K,i}\right]^{T}, and 𝒑i[pi1,pi2,,pi|𝒦i|]T\bm{p}_{i}\triangleq[p_{i_{1}},\,p_{i_{2}},\,\cdots,\,p_{i_{\left|\mathcal{K}_{i}\right|}}]^{T}. It is noteworthy that 𝒛i\bm{z}_{i} in (9a)-(9b) and PiP_{i} in (9c)-(9d) are auxiliary variables. As a result, 𝒫3\mathcal{P}3 can be transformed into:

𝒫4:min{𝒑i,𝜹i,𝒛i,Pi}i\displaystyle\mathcal{P}4:\min_{\{\bm{p}_{i},\,\bm{\delta}_{i},\,\bm{z}_{i},\,P_{i}\}_{i\in\mathcal{I}}}\, iλi×Φi(𝒑i|𝜹i)\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) (10a)
s.t.\displaystyle{\rm s.t.}\ 𝜹σ2𝟭,𝒑i𝟎,\displaystyle\bm{\delta}\succeq\sigma^{2}\bm{\mathsf{1}},\,\bm{p}_{i}\succeq\bm{0}, (10b)
(7b),(9a),(9b),(9c),(9d),\displaystyle\eqref{S3-EQ-8a},\,\eqref{SA-EQ-B3a},\,\eqref{SA-EQ-B3b},\,\eqref{SA-EQ-B3c},\,\eqref{SA-EQ-B3d},

where (10b) is naturally satisfied as 𝒦kG~k,p0\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}\geq 0 and p0p_{\ell}\geq 0. Specially, Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) in (10a) is explicitly given by

Φi(𝒑i|𝜹i)ai(BTVik𝒦iw~klog2(1+Gk,kpkδk)+Ai)bi.\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})\triangleq a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\log_{2}\left(1+\dfrac{G_{k,k}p_{k}}{\delta_{k}}\right)+A_{i}\right)^{-b_{i}}.

It is clear that 𝒫4\mathcal{P}4 separates the interference term and power constraint by introducing auxiliary variables; thus, it is beneficial to the parallelization of algorithm design. However, as there are multiple auxiliary variables and constraints in 𝒫4\mathcal{P}4, they will linearize the augmented terms and slow down the convergence. Even worse, convergence may not be guaranteed if there are more than two variables.

To address this issue, we propose a method to eliminate auxiliary variables [24, pp. 249-251]. First, the augmented Lagrangian function (ALF) of 𝒫4\mathcal{P}4 can be written as

L({𝒑i}i=1I,{Pi}i=1I,{𝒛i}i=1I,{𝜹i}i=1I;{𝜶i}i=1I,{βi}i=1I)\displaystyle L\left(\{\bm{p}_{i}\}_{i=1}^{I},\,\{P_{i}\}_{i=1}^{I},\,\{\bm{z}_{i}\}_{i=1}^{I},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}_{i}\}_{i=1}^{I},\,\{\beta_{i}\}_{i=1}^{I}\right)
=iλiΦi(𝒑i|𝜹i)+iβi(𝟭T𝒑iPi)+μ2i(𝟭T𝒑iPi)2\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})+\sum_{i\in\mathcal{I}}\beta_{i}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)+\,\dfrac{\mu}{2}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}
+i=1I𝜶i,𝚫(:,𝒦i)𝒑i𝒛i+μ2i=1I𝚫(:,𝒦i)𝒑i𝒛i22,\displaystyle\quad{}+\,\sum_{i=1}^{I}\left\langle\bm{\alpha}_{i},\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right\rangle+\dfrac{\mu}{2}\sum_{i=1}^{I}\left\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right\|^{2}_{2}, (11)

where μ\mu is an increasing positive sequence {μ(t)}\{\mu(t)\} about the iteration. From (11), it is observed that (9b) and (9d) are not directly considered in the ALF since different tasks are correlated. As will be shown shortly, this new ALF term allows the sub-problems to be solved in parallel. It is noteworthy that this algorithm differs from the conventional alternating direction method of multipliers (ADMM) algorithms, and its convergence is guaranteed [24, p. 255]. Given (11), we have the following proposition.

Proposition 2.

For the ALF given by (11), we can make the following iteration concerning variables PiP_{i} and 𝐳i\bm{z}_{i}:

Pi(t)\displaystyle P_{i}(t) =𝟭T𝒑i(t)1I(𝟭T𝒑(t)P),\displaystyle=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t)-\dfrac{1}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t)-P\right), (12a)
𝒛i(𝜹(t))\displaystyle\bm{z}_{i}(\bm{\delta}(t)) =𝚫(:,𝒦i)𝒑i(t)1I(𝚫𝒑(t)𝜹(t)+σ2𝟭).\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}(t)-\dfrac{1}{I}\left(\bm{\Delta}\bm{p}(t)-\bm{\delta}(t)+\sigma^{2}\bm{\mathsf{1}}\right). (12b)

The relative dual variables are updated by

β(t+1)=β(t)+μ(t)I(𝟭T𝒑(t+1)P),\displaystyle\beta(t+1)=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right), (13a)
𝜶(t+1)=𝜶(t)+μ(t)I(𝚫𝒑(t+1)𝜹(t+1)+σ2𝟭),\displaystyle\bm{\alpha}(t+1)=\bm{\alpha}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}\bm{p}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right), (13b)

and βi(t+1)=β(t+1)\beta_{i}(t+1)=\beta(t+1) and 𝛂i(t+1)=𝛂(t+1)\bm{\alpha}_{i}(t+1)=\bm{\alpha}(t+1), for all i=1,,Ii=1,\cdots,I.

Proof.

See Appendix B. ∎

By Proposition 2, it is evident that we have eliminated auxiliary variables and decreased the dimension of dual variables. Next, we split the ALF given by (11) with respect to 𝒑\bm{p} and 𝜹\bm{\delta}.

III-B1 Parallelizable splitting with respect to 𝒑\bm{p}

By Proposition 2, we divide the ALF given by (11) into a set of sub-functions, i.e., Li(𝒑i,𝜹;𝜶,β)L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right), which denote the ALF of the ithi^{\rm th} task. To realize the parallel algorithm for different tasks, we obtain Li(𝒑i,𝜹;𝜶,β)L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right) given by

Li(𝒑i,𝜹;𝜶,β)\displaystyle L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)
=λiΦi(𝒑i|𝜹i)+β(𝟭T𝒑iPi)+μ2(𝟭T𝒑iPi)2\displaystyle=\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})+\beta\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)+\,\dfrac{\mu}{2}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}
+𝜶,𝚫(:,𝒦i)𝒑i𝒛i(𝜹)+μ2𝚫(:,𝒦i)𝒑i𝒛i(𝜹)22.\displaystyle\quad{}+\,\left\langle\bm{\alpha},\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}(\bm{\delta})\right\rangle+\dfrac{\mu}{2}\left\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}(\bm{\delta})\right\|^{2}_{2}. (14)

III-B2 Parallelizable splitting with respect to 𝜹\bm{\delta}

By Proposition 2, it is evident that there are interference terms of (12b) and (13b). Thus it is still hard to update 𝜹\bm{\delta} in parallel. Therefore, we adopt the Gauss-Seidel method to obtain a highly parallelizable iteration for 𝜹i\bm{\delta}_{i} [24, p. 199], as formalized in the following proposition.

Proposition 3.

By (11) and Proposition 2, we obtain the following function about 𝛅\bm{\delta}:

L(𝒑,{𝜹i}i=1I;{𝜶i}i=1I)\displaystyle L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}^{\prime}_{i}\}_{i=1}^{I})
=iλiΦi(𝒑i|𝜹i)+𝜶,𝚫𝒑+σ2𝟭𝜹\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})+\left\langle\bm{\alpha},\,\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}\right\rangle
+μ2I𝚫𝒑+σ2𝟭𝜹22,\displaystyle\quad{}+\,\dfrac{\mu}{2I}\left\|\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}\right\|_{2}^{2}, (15)

where 𝛂[𝛂1T,𝛂2T,,𝛂IT]T\bm{\alpha}\triangleq\left[{\bm{\alpha}^{\prime}_{1}}^{T},\,{\bm{\alpha}^{\prime}_{2}}^{T},\,\cdots,\,{\bm{\alpha}^{\prime}_{I}}^{T}\right]^{T}.

Proof.

From (11), we obtain the following ALF about 𝜹\bm{\delta}:

L(𝒑,{𝜹i}i=1I;{𝜶i}i=1I)\displaystyle L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}_{i}\}_{i=1}^{I})
=i(λiΦi(𝒑i|𝜹i)+𝜶i,𝚫(:,𝒦i)𝒑i𝒛i(𝜹)\displaystyle=\sum_{i\in\mathcal{I}}\left(\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})+\left\langle\bm{\alpha}_{i},\,\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}-\bm{z}_{i}(\bm{\delta})\right\rangle\right.
+μ2𝚫(:,𝒦i)𝒑i𝒛i(𝜹)22).\displaystyle\quad{}\left.{}+\,\dfrac{\mu}{2}\left\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}-\bm{z}_{i}(\bm{\delta})\right\|_{2}^{2}\right). (16)

From Proposition 2, we also have 𝜶i=𝜶\bm{\alpha}_{i}=\bm{\alpha} and 𝒛i(𝜹)=𝚫(:,𝒦i)𝒑i1I(𝚫𝒑𝜹+σ2𝟭).\bm{z}_{i}(\bm{\delta})=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\dfrac{1}{I}\left(\bm{\Delta}\bm{p}-\bm{\delta}+\sigma^{2}\bm{\mathsf{1}}\right). Inserting them into (16) and performing algebraic manipulations, we obtain (15). ∎

With Proposition 3, to realize a parallel algorithm while updating 𝜹\bm{\delta}, we divide L(𝒑,{𝜹i}i=1I;{𝜶i}i=1I)L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}^{\prime}_{i}\}_{i=1}^{I}) into a set of sub-functions Li(𝒑,𝜹i;𝜶i)L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i}) as follows:

Li(𝒑,𝜹i;𝜶i)\displaystyle L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})
=λiΦi(𝒑i|𝜹𝒊)+𝜶i,𝚫(𝒦i,:)𝒑+σ2𝟭𝜹i\displaystyle=\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta_{i}})+\left\langle\bm{\alpha}^{\prime}_{i},\,\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right\rangle
+μ2I𝚫(𝒦i,:)𝒑+σ2𝟭𝜹i22.\displaystyle\quad{}+\,\dfrac{\mu}{2I}\left\|\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right\|_{2}^{2}. (17)

By (17), it is evident that 𝜹\bm{\delta} is divided into II blocks corresponding to II different learning tasks, which implies that we can efficiently update 𝜹i\bm{\delta}_{i} in parallel.

III-C Algorithm Development

We have derived the ALF of 𝒫4\mathcal{P}4 and obtained a set of sub-functions to realize a parallel algorithm for different learning tasks. Now, we compute partial derivatives of relative variables and then apply the gradient descent algorithms in parallel.

III-C1 Update 𝒑i\bm{p}_{i} with other variables fixed

It is observed that Li(𝒑i,𝜹;𝜶,β)L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right) given by (14) is differentiable with respect to 𝒑i\bm{p}_{i}, and the gradient is computed as

𝒑iLi(𝒑i,𝜹;𝜶,β)\displaystyle\nabla_{\bm{p}_{i}}L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)
=λi𝒑iΦi(𝒑i|𝜹i)+𝚫(:,𝒦i)T𝜶+β𝟭+μ(𝟭T𝒑iPi)𝟭\displaystyle=\lambda_{i}\nabla_{\bm{p}_{i}}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})+\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\bm{\alpha}+\beta\bm{\mathsf{1}}+\mu\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)\bm{\mathsf{1}}
+μ𝚫(:,𝒦i)T(𝚫(:,𝒦i)𝒑i𝒛i).\displaystyle\quad{}+\,\mu\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right).

Then we apply the gradient descent method to obtain the 𝒑i(t+1)\bm{p}_{i}(t+1), as explicitly given by

𝒑i(t+1)\displaystyle\bm{p}_{i}(t+1) =max(𝒑i(t)η𝒑iLi(𝒑i(t),𝜹(t);𝜶(t),β(t))\displaystyle=\max\left(\bm{p}_{i}(t)-\eta\nabla_{\bm{p}_{i}}L_{i}\left(\bm{p}_{i}(t),\,\bm{\delta}(t);\,\bm{\alpha}(t),\,\beta(t)\right)\right.
ν(𝟭𝒘~i), 0),\displaystyle\quad\left.{}-\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i}),\,\bm{0}\right), (18)

where η\eta is the step-size and ν(𝟭𝒘~i)\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i}) denotes a sparsity-regularized term [25]. Moreover, it is seen from Proposition 2 that 𝟭T𝒑i(t)Pi(t)\bm{\mathsf{1}}^{T}\bm{p}_{i}(t)-P_{i}(t) and 𝚫(:,𝒦i)𝒑i(t)𝒛i(t)\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}(t)-\bm{z}_{i}(t) can be updated by (12a) and (12b), respectively.

III-C2 Update 𝜹i\bm{\delta}_{i} with other variables fixed

It is seen that Li(𝒑,𝜹i;𝜶i)L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i}) given by (17) is differentiable with respect to 𝜹i\bm{\delta}_{i}, and the gradient is computed as

𝜹iLi(𝒑,𝜹i;𝜶i)\displaystyle\nabla_{\bm{\delta}_{i}}L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})
=λi𝜹iΦi(𝒑i|𝜹i)𝜶iμI(𝚫(𝒦i,:)𝒑+σ2𝟭𝜹i).\displaystyle=\lambda_{i}\nabla_{\bm{\delta}_{i}}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})-\bm{\alpha}^{\prime}_{i}-\dfrac{\mu}{I}\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right).

Then, we apply the gradient descent method to obtain

𝜹i(t+1)=max(𝜹i(t)η𝜹iLi(𝒑(t),𝜹i(t);𝜶i(t)),σ2𝟭).\bm{\delta}_{i}(t+1)=\max\left(\bm{\delta}_{i}(t)-\eta\nabla_{\bm{\delta}_{i}}L_{i}\left(\bm{p}(t),\,\bm{\delta}_{i}(t);\,\bm{\alpha}^{\prime}_{i}(t)\right),\,\sigma^{2}\bm{\mathsf{1}}\right). (19)

To realize a highly parallelizable iteration of 𝒑i\bm{p}_{i} and 𝜹i\bm{\delta}_{i}, as explicitly given by (III-C1) and (19), we denote variable blocks 𝒙p=[𝒙p1T,𝒙p2T,,𝒙pIT]T\bm{x}_{p}=\left[\bm{x}_{p_{1}}^{T},\,\bm{x}_{p_{2}}^{T},\,\cdots,\,\bm{x}_{p_{I}}^{T}\right]^{T} with 𝒙pi|𝒦i|\bm{x}_{p_{i}}\in\mathcal{R}^{\left|\mathcal{K}_{i}\right|}, and 𝒙δ=[𝒙δ1T,𝒙δ2T,,𝒙δIT]T\bm{x}_{\delta}=\left[\bm{x}_{\delta_{1}}^{T},\,\bm{x}_{\delta_{2}}^{T},\,\cdots,\,\bm{x}_{\delta_{I}}^{T}\right]^{T} with 𝒙δi|𝒦i|\bm{x}_{\delta_{i}}\in\mathcal{R}^{\left|\mathcal{K}_{i}\right|}. By using variable blocks 𝒙i,{p,δ}\bm{x}_{\ell_{i}},\ell\in\{p,\,\delta\}, we obtain

𝒙i(t+1)={𝒑i(t+1), if =p;𝜹i(t+1), otherwise. \bm{x}_{\ell_{i}}(t+1)=\begin{cases}\bm{p}_{i}(t+1),&\text{ if }\ell=p;\\ \bm{\delta}_{i}(t+1),&\text{ otherwise. }\end{cases} (20)

III-C3 Update relative dual variables with others fixed

It is obvious that the ALF given by (11) is a linear function concerning all dual variables; thus, we have

𝜶i(t+1)\displaystyle\bm{\alpha}_{i}^{\prime}(t+1) =𝜶i(t)+μ(t)I(𝚫(𝒦i,𝒦i)𝒑i(t+1)𝜹i(t+1)\displaystyle=\bm{\alpha}_{i}^{\prime}(t)+\dfrac{\mu(t)}{I}\Bigg{(}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{i}\right)\bm{p}_{i}(t+1)-\bm{\delta}_{i}(t+1)
+σ2𝟭+ij(𝚫(𝒦i,𝒦j))𝒑j(t)).\displaystyle\quad{}+\,\sigma^{2}\bm{\mathsf{1}}+\sum_{i\neq j}\left(\bm{\Delta}\left(\mathcal{K}_{i},\mathcal{K}_{j}\right)\right)\bm{p}_{j}(t)\Bigg{)}. (21)

Using (21) in place of (13b) gives a Gauss-Seidel sequence to realize a highly efficient iteration and obtain a real-time message. Apart from the aforementioned dual variables, β(t+1)\beta(t+1) can be directly updated by (13a).

In terms of computational complexity, this algorithm involves KK scheduling variables, KK primal variables, KK auxiliary variables, K(K1)K(K-1) interference terms, and K+IK+I dual variables. Specifically, K+IK+I dual variables come from KK interference constraints and II learning tasks. In addition, KK primal variables, KK auxiliary variables, K(K1)K(K-1) interference terms, and K+IK+I dual variables can be updated in parallel. Consequently, when the dimension KK of users is large, the per-iteration complexity is approximately given by 𝒪((K2+K)/I)\mathcal{O}\left((K^{2}+K)/I\right).

To sum up, Fig. 2 sketches the block diagram of the proposed parallel algorithm. Also, the detailed steps are formalized in Algorithm 1, where lines 33-88 are the main steps of the parallel algorithm, as shown in the parallelization module of Fig. 2. Specifically, lines 44-66 of Algorithm 1 realize the power and CCI optimization, and line 77 performs the dual and scheduling variables update in parallel. Then, line 99 aggregates messages from different tasks and also constructs an increasing sequence μ(t+1)=max(μsμ(t),μmax)\mu(t+1)=\max(\mu_{\rm s}\mu(t),\,\mu_{\max}), which means that the equalities (9a)-(9d) must hold when Algorithm 1 converges.

Algorithm 1 The task-oriented power allocation in parallel.
0:  Setting (I,N,K,P,T,B,σ2,{λi,ai,bi,Vi,Ai}i)\left(I,\,N,\,K,\,P,\,T,\,B,\,\sigma^{2},\,\{\lambda_{i},a_{i},\,b_{i},\,V_{i},\,A_{i}\}_{i\in\mathcal{I}}\right), channels {𝒉k}k𝒦\{\bm{h}_{k}\}_{k\in\mathcal{K}}, user set 𝒦\mathcal{K}, gain matrix 𝑮\bm{G}, gain diagonal matrix 𝑫\bm{D}, learning rate η\eta, error tolerance ε\varepsilon, μmax\mu_{\max}, and μs>1\mu_{\rm s}>1.
0:  The optimization solution 𝒑^\hat{\bm{p}};
1:  Initialize t=0,𝒘~=𝟭,𝒑(0)=P/K×𝟭,𝜹(0)=(𝑮𝑫)𝒑(0)+σ2𝟭,𝜶(0)=1/K×𝟭,β(0)=1,t=0,\,\tilde{\bm{w}}=\bm{\mathsf{1}},\,\bm{p}(0)=P/K\times\bm{\mathsf{1}},\,\bm{\delta}(0)=\left(\bm{G}-\bm{D}\right)\bm{p}(0)+\sigma^{2}\bm{\mathsf{1}},\,\bm{\alpha}(0)=1/K\times\bm{\mathsf{1}},\,\beta(0)=1, and μ(0)=1\mu(0)=1;
2:  repeat
3:     for ii\in\mathcal{I} in parallel do
4:        for {p,δ}\ell\in\{p,\,\delta\} in parallel do
5:           Update 𝒙i(t+1)\bm{x}_{\ell_{i}}(t+1) by (20);
6:        end for
7:        Update 𝜶i(t+1)\bm{\alpha}_{i}^{\prime}(t+1) by (21), and 𝒘~i\tilde{\bm{w}}_{i} as per (6);
8:     end for
9:     Compute β(t+1)\beta(t+1) as per (13a), and μ(t+1)=max(μsμ(t),μmax)\mu(t+1)=\max(\mu_{\rm s}\mu(t),\,\mu_{\max});
10:     Compute MSE{\rm MSE} as per (35);
11:     t=t+1t=t+1;
12:  until MSEε{\rm MSE}\leq\varepsilon;
13:  𝒑^=𝒘~𝒑(t)\hat{\bm{p}}=\lfloor\tilde{\bm{w}}\rceil\circ\bm{p}(t).

So far, we have developed a parallel algorithm to solve the task-oriented power allocation problem. As multiple variables need to be relaxed for task parallelism, it slows down the convergence. Although Proposition 2 can eliminate auxiliary variables, additional relaxed constraints exist to separate the CCI term, such as variables 𝜹\bm{\delta} and relative dual variables. Also, the per-iteration complexity is usually high, specifically for solving the non-convexity problem of Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) and the non-unitary matrix 𝑮~𝑫~\tilde{\bm{G}}-\tilde{\bm{D}}, i.e., (𝑮~𝑫~)T(𝑮~𝑫~)(\tilde{\bm{G}}-\tilde{\bm{D}})^{T}(\tilde{\bm{G}}-\tilde{\bm{D}}) is not an identity mapping. To address these issues, we design an accelerated algorithm in the next section.

Refer to caption
Figure 2: Block diagram of the proposed parallel algorithm.
Refer to caption
Figure 3: Block diagram of the accelerated algorithm.

IV An accelerated Algorithm: Fast Proximal Algorithms

Now, we design a fast proximal ADMM algorithm with parallelizable splitting [26], and Fig. 3 sketches its block diagram. Specifically, to improve the convergence rate, we first exploit the smoothness property to linearize Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) (i.e., Step 22 in Fig. 3). Accordingly, the smoothness result of Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) is shown in Lemma 1

Lemma 1.

The function Φi(𝐩i|𝛅i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}) satisfies the following conditions:

  • i)

    Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}^{*}) is LpiL_{p_{i}}-smooth, i.e., 𝒙Φi(𝒙|𝜹i)𝒚Φi(𝒚|𝜹i)2Lpi𝒙𝒚2\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{2}\leq L_{p_{i}}\|\bm{x}-\bm{y}\|_{2} for any 𝒙,𝒚\bm{x},\bm{y};

  • ii)

    Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta}_{i}) is LδiL_{\delta_{i}}-smooth, i.e., 𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)2Lδi𝒙𝒚2\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{2}\leq L_{\delta_{i}}\|\bm{x}-\bm{y}\|_{2} for any 𝒙,𝒚\bm{x},\bm{y},

where 𝛅i,𝐩i,i\bm{\delta}_{i}^{*},\,\bm{p}_{i}^{*},\,i\in\mathcal{I} denote their current values stored.

Proof.

See Appendix C. ∎

Given Lemma 1, the smoothness result enables us to linearize the learning error function Φi(𝒑i|𝜹i)\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}). So, we next design an identity mapping of the unitary matrix to improve the convergence rate and solve the related sub-problems more efficiently.

IV-A Parallelization

In principle, the essence of our accelerated algorithm is to use an identical transform of matrices to split variable blocks. Using a fast proximal linearized ADMM algorithm with parallelizable splitting [26], we derive ALFs associated with variable blocks 𝒑i\bm{p}_{i} and 𝜹i\bm{\delta}_{i}, respectively.

IV-A1 The ALF with respect to 𝒑i\bm{p}_{i} and 𝜹\bm{\delta}

We first define two block matrices

𝑨[𝚫(:,𝒦i)𝑰/I𝟭T𝟎T],𝑨1[𝚫(:,𝒦i)𝟭T].\bm{A}\triangleq\left[\begin{array}[]{cc}\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)&-{\bm{I}}/{I}\\ \bm{\mathsf{1}}^{T}&\bm{0}^{T}\end{array}\right],\,\bm{A}_{1}\triangleq\left[\begin{array}[]{c}\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\ \bm{\mathsf{1}}^{T}\end{array}\right].

Then, we can rewrite (9a)-(9c) as a linear equation 𝑨𝒙=𝒓\bm{A}\bm{x}=\bm{r}, where 𝒓[σ2/I𝟭T,Pi]T\bm{r}\triangleq\left[-{\sigma^{2}}/{I}\bm{\mathsf{1}}^{T},\,P_{i}\right]^{T}, 𝒙[𝒑iT,𝜹T]T\bm{x}\triangleq\left[\bm{p}_{i}^{T},\,\bm{\delta}^{T}\right]^{T}, and 𝒛i(𝜹)=𝜹/Iσ2/I𝟏\bm{z}_{i}(\bm{\delta})=\bm{\delta}/{I}-{\sigma^{2}}/{I}\bm{1} given by (8), (9a), and (9b), respectively. Moreover, the ALF given by (14) can be rewritten as

Li(𝒙;𝝀)=Φi(𝒙)+𝝀,𝑨𝒙𝒓+μ2𝑨𝒙𝒓22,L_{i}(\bm{x};\,\bm{\lambda}^{\prime})=\Phi_{i}(\bm{x})+\langle\bm{\lambda}^{\prime},\,\bm{A}\bm{x}-\bm{r}\rangle+\dfrac{\mu}{2}\|\bm{A}\bm{x}-\bm{r}\|_{2}^{2}, (22)

where 𝝀[𝜶T,β]T\bm{\lambda}^{\prime}\triangleq\left[\bm{\alpha}^{T},\,\beta\right]^{T} and Φi(𝒙)λiΦi(𝒑i|𝜹i)\Phi_{i}(\bm{x})\triangleq\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}). Then, by means of the parallelizable splitting [26] and relaxing Li(𝒙;𝝀)L_{i}(\bm{x};\,\bm{\lambda}^{\prime}), we write an accelerated ALF of (22) with respect to 𝒑i\bm{p}_{i} as

Li(𝒑i|𝒚piΦi(𝒚pi(t+1)|𝜹i(t)),𝒑i(t),𝒛pi(t),𝒛i(t);𝜶(t),β(t))\displaystyle L_{i}\left(\bm{p}_{i}|\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)|\bm{\delta}_{i}(t)),\,\bm{p}_{i}(t),\,\bm{z}_{p_{i}}(t),\,\bm{z}_{i}(t);\,\bm{\alpha}(t),\,\beta(t)\right)
=λi𝒚piΦi(𝒚pi(t+1)|𝜹𝒊(t)),𝒑i+μ(t)𝑨1T(𝑨𝒛1(t)𝒓),𝒑i\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)|\bm{\delta_{i}}(t)),\,\bm{p}_{i}\right\rangle+\mu(t)\left\langle\bm{A}_{1}^{T}(\bm{A}\bm{z}_{1}(t)-\bm{r}),\,\bm{p}_{i}\right\rangle
+𝝀(t),𝑨1𝒑i+12(Lpiθ(t)+μ(t)λpi)𝒑i𝒛pi(t)22\displaystyle\quad{}+\,\langle\bm{\lambda}^{\prime}(t),\,\bm{A}_{1}\bm{p}_{i}\rangle+\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\|_{2}^{2}
=𝜶(t),𝚫(:,𝒦i)𝒑i+μ(t)(𝟭T𝒛pi(t)Pi(t))𝟭,𝒑i\displaystyle=\left\langle\bm{\alpha}(t),\,\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}\right\rangle+\mu(t)\left\langle\left(\bm{\mathsf{1}}^{T}\bm{z}_{p_{i}}(t)-P_{i}(t)\right)\bm{\mathsf{1}},\,\bm{p}_{i}\right\rangle
+μ(t)𝚫(:,𝒦i)T(𝚫(:,𝒦i)𝒛pi(t)𝒛δ(t)I+σ2I𝟏),𝒑i\displaystyle\quad{}+\,\mu(t)\left\langle\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t)-\dfrac{\bm{z}_{\delta}(t)}{I}+\dfrac{\sigma^{2}}{I}\bm{1}\right),\,\bm{p}_{i}\right\rangle
+λi𝒚piΦ(𝒚pi(t+1)|𝜹i(t)),𝒑i+β(t)𝟭T𝒑i\displaystyle\quad{}+\,\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi(\bm{y}_{p_{i}}(t+1)|\bm{\delta}_{i}(t)),\,\bm{p}_{i}\right\rangle+\beta(t)\bm{\mathsf{1}}^{T}\bm{p}_{i}
+12(Lpiθ(t)+μ(t)λpi)𝒑i𝒛pi(t)22,\displaystyle\quad{}+\,\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\|_{2}^{2},\, (23)

where 𝒛1[𝒛piT,𝒛δT]T\bm{z}_{1}\triangleq\left[\bm{z}_{p_{i}}^{T},\,\bm{z}_{\delta}^{T}\right]^{T}, 𝒛δ[𝒛δ1T,,𝒛δIT]T\bm{z}_{\delta}\triangleq\left[\bm{z}_{\delta_{1}}^{T},\,\cdots,\,\bm{z}_{\delta_{I}}^{T}\right]^{T}; 𝒛pi\bm{z}_{p_{i}} and 𝒛δi\bm{z}_{\delta_{i}}, ii\in\mathcal{I}, denote the gradient update results of 𝒑i\bm{p}_{i} and 𝜹i\bm{\delta}_{i}, respectively. Moreover, λpi2𝑨122\lambda_{p_{i}}\geq 2\|\bm{A}_{1}\|_{2}^{2} guarantees that (23) is a tight majorant surrogate function of (22) with respective to 𝒑i\bm{p}_{i} [26, 22], therefore we have

λpi\displaystyle\lambda_{p_{i}} 2K/I(𝚫(:,𝒦i)2+1)2\displaystyle\geq{}2K/I\left(\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\|_{2}+1\right)^{2}
2(𝒘i~2𝚫(:,𝒦i)2+K/I)2\displaystyle\geq{}2\left(\|\tilde{\bm{w}_{i}}\|_{2}\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\|_{2}+\sqrt{K/I}\right)^{2}
2(𝚫(:,𝒦i)2+𝟭2)22𝑨122.\displaystyle\geq{}2\left(\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\|_{2}+\|\bm{\mathsf{1}}\|_{2}\right)^{2}\geq 2\|\bm{A}_{1}\|_{2}^{2}. (24)

Lastly, the parameters 𝒚pi(t+1)\bm{y}_{p_{i}}(t+1), θ(t+1)\theta(t+1), and μ(t+1)\mu(t+1) can be updated by

𝒚pi(t+1)\displaystyle\bm{y}_{p_{i}}(t+1) =(1θ(t))𝒑i(t)+θ(t)𝒛pi(t),\displaystyle=(1-\theta(t))\bm{p}_{i}(t)+\theta(t)\bm{z}_{p_{i}}(t), (25a)
θ(t+1)\displaystyle\theta(t+1) =12(θ2(t)+θ4(t)+4θ2(t)),\displaystyle=\frac{1}{2}(-\theta^{2}(t)+\sqrt{\theta^{4}(t)+4\theta^{2}(t)}), (25b)
μ(t+1)\displaystyle\mu(t+1) =1/θ(t+1),\displaystyle=1/{\theta(t+1)}, (25c)

where (25a) is to accelerate convergence by using the smoothness result given by Lemma 1; (25b) is a stepsize of the fast algorithm, and (25c) means an increasing sequence explained in Algorithm 1. With careful choices of θ(t)\theta(t) and μ(t)\mu(t), the convergence rate can be accelerated from 𝒪(1/τ)\mathcal{O}(1/\tau) to 𝒪(1/τ2)\mathcal{O}(1/\tau^{2}) [26], where τ\tau is the number of iterations needed to converge.

IV-A2 The ALF with respect to 𝒑\bm{p} and 𝜹i\bm{\delta}_{i}

We first define 𝑨[𝚫(𝒦i,:)𝑰]\bm{A}^{\prime}\triangleq\left[\begin{array}[]{cc}\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)&-\bm{I}\end{array}\right], then we can also rewrite 𝚫(𝒦i,:)𝒑+σ2𝟭=𝜹i\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}=\bm{\delta}_{i} given by (8) as 𝑨𝒙=𝒓\bm{A}^{\prime}\bm{x}^{\prime}=\bm{r}^{\prime}, where 𝒓σ2𝟭\bm{r}^{\prime}\triangleq-\sigma^{2}\bm{\mathsf{1}} and 𝒙[𝒑T,𝜹iT]T\bm{x}^{\prime}\triangleq\left[\bm{p}^{T},\,\bm{\delta}_{i}^{T}\right]^{T}. Moreover, the ALF given by (17) can be re-expressed as

Li(𝒙;𝜶i)Φi(𝒙)+𝜶i,𝑨𝒙𝒓+μ2𝑨𝒙𝒓22,L_{i}(\bm{x}^{\prime};\,\bm{\alpha}^{\prime}_{i})\triangleq\Phi_{i}(\bm{x}^{\prime})+\langle\bm{\alpha}^{\prime}_{i},\,\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\rangle+\dfrac{\mu}{2}\|\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\|_{2}^{2}, (26)

where Φi(𝒙)λiΦi(𝒑i|𝜹i)\Phi_{i}(\bm{x}^{\prime})\triangleq\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}). By the parallelizable splitting and relaxing Li(𝒙;𝜶i)L_{i}(\bm{x}^{\prime};\,\bm{\alpha}^{\prime}_{i}), we also write compactly another accelerated ALF of (26) with respect to 𝜹i\bm{\delta}_{i} as

Li(𝜹i|𝒚δi(t+1),𝒑i(t+1),𝜹i(t),𝒛δi(t);𝜶i(t),β(t))\displaystyle L_{i}\left(\bm{\delta}_{i}|\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{p}_{i}(t+1),\,\bm{\delta}_{i}(t),\,\bm{z}_{\delta_{i}}(t);\,\bm{\alpha}^{\prime}_{i}(t),\,\beta(t)\right)
=λi𝒚δi(t+1),𝜹i𝜶i(t),𝜹iμ(t)𝑨𝒛2(t)𝒓,𝜹i\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{\delta}_{i}\right\rangle-\langle\bm{\alpha}^{\prime}_{i}(t),\,\bm{\delta}_{i}\rangle-\mu(t)\langle\bm{A}^{\prime}\bm{z}_{2}(t)-\bm{r}^{\prime},\,\bm{\delta}_{i}\rangle
+12(Lδiθ(t)+μ(t)λδi)𝜹i𝒛δi(t)22\displaystyle\quad{}+\,\frac{1}{2}\left(L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}\right)\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\|_{2}^{2}
=λi𝒚δi(t+1),𝜹i+12(Lδiθ(t)+μ(t)λδi)𝜹i𝒛δi(t)22\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{\delta}_{i}\right\rangle+\frac{1}{2}\left(L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}\right)\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\|_{2}^{2}
𝜶i(t),𝜹i+μ(t)(𝚫(𝒦i,:))𝒛p(t)𝒛δi(t)+σ2𝟭,𝜹i,\displaystyle\quad{}-\,\left\langle\bm{\alpha}^{\prime}_{i}(t),\,\bm{\delta}_{i}\right\rangle+\mu(t)\left\langle\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\right)\bm{z}_{p}(t)-\bm{z}_{\delta_{i}}(t)+\sigma^{2}\bm{\mathsf{1}},\,\bm{\delta}_{i}\right\rangle, (27)

where 𝒚δi(t+1)𝒚δiΦ(𝒑i(t+1)|𝒚δi(t+1))\nabla_{\bm{y}_{\delta_{i}}(t+1)}\triangleq\nabla_{\bm{y}_{\delta_{i}}}\Phi(\bm{p}_{i}(t+1)|\bm{y}_{\delta_{i}}(t+1)), 𝒛2[𝒛pT,𝒛δiT]T\bm{z}_{2}\triangleq\left[\bm{z}_{p}^{T},\,\bm{z}_{\delta_{i}}^{T}\right]^{T}, and 𝒛p[𝒛p1T,,𝒛pIT]T\bm{z}_{p}\triangleq\left[\bm{z}_{p_{1}}^{T},\,\cdots,\,\bm{z}_{p_{I}}^{T}\right]^{T}. Like (24), the choice of λδi2𝑰22=2\lambda_{\delta_{i}}\geq 2\|\bm{I}\|_{2}^{2}=2 also guarantees that (27) is a tight majorant surrogate function of (26) with respective to 𝜹i\bm{\delta}_{i} [26, 22]. Moreover, 𝒚δi(t+1)\bm{y}_{\delta_{i}}(t+1) is given by

𝒚δi(t+1)=(1θ(t))𝜹i(t)+θ(t)𝒛δi(t),\bm{y}_{\delta_{i}}(t+1)=(1-\theta(t))\bm{\delta}_{i}(t)+\theta(t)\bm{z}_{\delta_{i}}(t), (28)

whose effect is the same as (25a). As stated above, we can relax Φi(𝒑i|𝜹𝒊)\Phi_{i}(\bm{p}_{i}|\bm{\delta_{i}}) by Lemma 1, and then Lemma 1 allows very large Lipschitz constants LpiL_{p_{i}} and LδiL_{\delta_{i}} for non-convex functions, which are as large as 𝒪(τ)\mathcal{O}(\tau) without affecting the convergence rate. Moreover, we also linearize the augmented terms 1/2𝑨𝒙𝒓221/{2}\|\bm{A}\bm{x}-\bm{r}\|_{2}^{2} and 1/2𝑨𝒙𝒓221/{2}\|\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\|_{2}^{2} by λpi/2𝒑i𝒛pi(t)22\lambda_{p_{i}}/2\left\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\|_{2}^{2} and λδi/2𝜹i𝒛δi(t)22\lambda_{\delta_{i}}/2\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\|_{2}^{2}, respectively. After (23) and (27) are obtained, we can improve the efficiency for optimizing these sub-functions given by (14) and (17).

IV-B Algorithm Development

We have obtained the parallelizable splitting and derived ALFs of the accelerated algorithm. Now, we compute the partial derivatives of relative variables and apply the gradient descent algorithm to update these variables in parallel.

IV-B1 Update 𝒑i\bm{p}_{i} in parallel with other variables fixed

Here, the ALF given by (23) is a quadratic function of 𝒑i\bm{p}_{i}, thus it has a closed-form solution with respective to 𝒑i\bm{p}_{i}, given by

𝒛~pi(t+1)\displaystyle\tilde{\bm{z}}_{p_{i}}(t+1)
=1Lpiθ(t)+μ(t)λpi(λi𝒚piΦ(𝒚pi(t+1)|𝜹i(t))+β(t)𝟭\displaystyle=-\dfrac{1}{L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}}\left(\lambda_{i}\nabla_{\bm{y}_{p_{i}}}\Phi(\bm{y}_{p_{i}}(t+1)|\bm{\delta}_{i}(t))+\beta(t)\bm{\mathsf{1}}\right.
+μ(t)𝚫(:,𝒦i)T(𝚫(:,𝒦i)𝒛pi(t)𝒛δ(t)I+σ2I𝟏)\displaystyle\quad{}+\,\mu(t)\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t)-\dfrac{\bm{z}_{\delta}(t)}{I}+\dfrac{\sigma^{2}}{I}\bm{1}\right)
+𝚫(:,𝒦i)T𝜶(t)+μ(t)(𝟭T𝒛pi(t)Pi(t))𝟭)+𝒛pi(t),\displaystyle\quad{}+\,\left.\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\bm{\alpha}(t)+\mu(t)\left(\bm{\mathsf{1}}^{T}\bm{z}_{p_{i}}(t)-P_{i}(t)\right)\bm{\mathsf{1}}\right)+\bm{z}_{p_{i}}(t), (29)

where Pi(t)P_{i}(t) can be computed by (12a). Then, we obtain

𝒛pi(t+1)\displaystyle\bm{z}_{p_{i}}(t+1) =max(𝒛~pi(t+1)ν(𝟭𝒘~i), 0),\displaystyle=\max(\tilde{\bm{z}}_{p_{i}}(t+1)-\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i}),\,\bm{0}), (30a)
𝒑i(t+1)\displaystyle\bm{p}_{i}(t+1) =(1θ(t))𝒑i(t)+θ(t)𝒛pi(t+1),\displaystyle=(1-\theta(t))\bm{p}_{i}(t)+\theta(t)\bm{z}_{p_{i}}(t+1), (30b)

where (30a) and (30b) are an orthogonal projection onto sparsity-regularized and accelerated terms, respectively.

IV-B2 Update 𝜹i\bm{\delta}_{i} in parallel with other variables fixed

Here, the ALF given by (27) is also a quadratic function of 𝜹i\bm{\delta}_{i} hence we obtain a closed-form solution as

𝒛~δi(t+1)\displaystyle\tilde{\bm{z}}_{\delta_{i}}(t+1)
=𝒛δi(t)1Lδiθ(t)+μ(t)λδi(λi𝒚δiΦ(𝒑i(t+1)|𝒚δi(t+1))\displaystyle=\bm{z}_{\delta_{i}}(t)-\dfrac{1}{L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}}\left(\lambda_{i}\nabla_{\bm{y}_{\delta_{i}}}\Phi(\bm{p}_{i}(t+1)|\bm{y}_{\delta_{i}}(t+1))\right.
𝜶i(t)+μ(t)((𝚫(𝒦i,:))𝒛p(t)𝒛δi(t)+σ2𝟭)).\displaystyle\quad{}-\,\left.\bm{\alpha}_{i}^{\prime}(t)+\mu(t)\left(\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\right)\bm{z}_{p}(t)-\bm{z}_{\delta_{i}}(t)+\sigma^{2}\bm{\mathsf{1}}\right)\right). (31)

Next, we have

𝒛δi(t+1)\displaystyle\bm{z}_{\delta_{i}}(t+1) =max(𝒛~δi(t+1),σ2𝟭),\displaystyle=\max(\tilde{\bm{z}}_{\delta_{i}}(t+1),\,\sigma^{2}\bm{\mathsf{1}}), (32a)
𝜹i(t+1)\displaystyle\bm{\delta}_{i}(t+1) =(1θ(t))𝜹i(t)+θ(t)𝒛δi(t+1),\displaystyle=(1-\theta(t))\bm{\delta}_{i}(t)+\theta(t)\bm{z}_{\delta_{i}}(t+1), (32b)

where (32a) and (32b) denote an orthogonal projection and an accelerated term, respectively. In light of (29) and (31), it is obvious that 𝒑i\bm{p}_{i} and 𝜹i\bm{\delta}_{i} can be updated in parallel, thus we have

𝒚i(t+1)\displaystyle\bm{y}_{\ell_{i}}(t+1) ={𝒚pi(t+1), if =p;𝒚δi(t+1), otherwise,\displaystyle=\begin{cases}\bm{y}_{p_{i}}(t+1),&\text{ if }\ell=p;\\ \bm{y}_{\delta_{i}}(t+1),&\text{ otherwise},\end{cases} (33a)
𝒛i(t+1)\displaystyle\bm{z}_{\ell_{i}}(t+1) ={𝒛pi(t+1), if =p;𝒛δi(t+1), otherwise,\displaystyle=\begin{cases}\bm{z}_{p_{i}}(t+1),&\text{ if }\ell=p;\\ \bm{z}_{\delta_{i}}(t+1),&\text{ otherwise},\end{cases} (33b)

and 𝒙i(t+1),{p,δ}\bm{x}_{\ell_{i}}(t+1),\,\ell\in\{p,\,\delta\} is updated by (20).

IV-B3 Update relative dual variables

It is evident that the ALFs given by (14) and (17) are linear functions of all dual variables; hence we have

𝜶i(t+1)\displaystyle\bm{\alpha}_{i}^{\prime}(t+1) =𝜶i(t)+μ(t)I(𝚫(𝒦i,𝒦i)𝒛pi(t+1)+σ2𝟭\displaystyle=\bm{\alpha}_{i}^{\prime}(t)+\dfrac{\mu(t)}{I}\Bigg{(}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t+1)+\sigma^{2}\bm{\mathsf{1}}
𝒛δi(t+1)+ij𝚫(𝒦i,𝒦j)𝒛pj(t)),\displaystyle\quad{}-\,\bm{z}_{\delta_{i}}(t+1)+\sum_{i\in\mathcal{I}\setminus j}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{j}\right)\bm{z}_{p_{j}}(t)\Bigg{)}, (34a)
β(t+1)\displaystyle\beta(t+1) =β(t)+μ(t)I(𝟭T𝒛p(t+1)P).\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{z}_{p}(t+1)-P\right). (34b)

Using (34a) in place of (13b) leads to a highly parallelizable iteration.

In terms of computational complexity, the accelerated algorithm is proportional to the parallel algorithm. Thus the per-iteration complexity is also given by 𝒪((K2+K)/I)\mathcal{O}\left((K^{2}+K)/I\right). Beyond the computational complexity, another important metric to measure the convergence speed is the convergence rate. From [26, Theorem 22], this algorithm improves the convergence rate from 𝒪(1/τ)\mathcal{O}(1/\tau) to 𝒪(1/τ2)\mathcal{O}(1/\tau^{2}), which makes it more attractive, specifically for large-scale IoT networks. Moreover, this algorithm also allows large Lipschitz constants LpiL_{p_{i}} and LδiL_{\delta_{i}} for relaxing non-convex objective functions without affecting the convergence rate.

To sum up, the procedure is formalized in Algorithm 2, which is faster than Algorithm 1 due to the acceleration to the error functions (i.e., (25a) and (28)) and equality constraints (i.e., (30b) and (32b)). Specifically, lines 55 and 77 of Algorithm 2 describe the parallel steps (i.e., Steps 33-66 in Fig. 3). Among them, line 55 specifies the acceleration steps (i.e., Steps 33 and 66 in Fig. 3). Moreover, line 99 describes aggregated messages from different tasks (i.e., Step 77 in Fig. 3). Also, μ(t)\mu(t) in Algorithm 2 is adaptive to the stepsize θ(t)\theta(t) to guide convergence more efficiently.

Algorithm 2 The accelerated algorithm.
0:  Setting (I,N,K,P,T,B,σ2,{λi,λpi,λδi,ai,bi,Vi,Ai}i)\left(I,\,N,\,K,\,P,\,T,\,B,\,\sigma^{2},\,\{\lambda_{i},\lambda_{p_{i}},\,\lambda_{\delta_{i}},\,a_{i},\,b_{i},\,V_{i},\,A_{i}\}_{i\in\mathcal{I}}\right), user set 𝒦\mathcal{K}, channels {𝒉k}k𝒦\{\bm{h}_{k}\}_{k\in\mathcal{K}}, gain matrix 𝑮\bm{G}, gain diagonal matrix 𝑫\bm{D}, learning rate η\eta, and error tolerance ε\varepsilon.
0:  The optimization solution 𝒑^\hat{\bm{p}}.
1:  Initialize t=0,𝒙p(0)=𝒚p(0)=𝒛p(0)=P/K×𝟭,𝒙δ(0)=𝒚δ(0)=𝒛δ(0)=(𝑮𝑫)𝒙p(0)+σ2𝟭,𝒘~=𝟭,𝜶(0)=1/K×𝟭,β(0)=1,μ(0)=θ(0)=1t=0,\,\bm{x}_{p}(0)=\bm{y}_{p}(0)=\bm{z}_{p}(0)=P/K\times\bm{\mathsf{1}},\,\bm{x}_{\delta}(0)=\bm{y}_{\delta}(0)=\bm{z}_{\delta}(0)=(\bm{G}-\bm{D})\bm{x}_{p}(0)+\sigma^{2}\bm{\mathsf{1}},\,\tilde{\bm{w}}=\bm{\mathsf{1}},\,\bm{\alpha}(0)=1/K\times\bm{\mathsf{1}},\,\beta(0)=1,\,\mu(0)=\theta(0)=1;
2:  repeat
3:     for ii\in\mathcal{I} in parallel do
4:        for {p,δ}\ell\in\{p,\,\delta\} in parallel do
5:           Compute 𝒚i(t+1)\bm{y}_{\ell_{i}}(t+1), 𝒛i(t+1)\bm{z}_{\ell_{i}}(t+1), and 𝒙i(t+1)\bm{x}_{\ell_{i}}(t+1) as per (33a), (33b), and (20), respectively;
6:        end for
7:        Compute 𝜶i(t+1)\bm{\alpha}_{i}^{\prime}(t+1) and 𝒘~i\tilde{\bm{w}}_{i} as per (34a) and (6), respectively;
8:     end for
9:     Update β(t+1)\beta(t+1), θ(t+1)\theta(t+1), and μ(t+1)\mu(t+1) according to (34b), (25b), and (25c), respectively;
10:     Compute MSE{\rm MSE} by (35);
11:     t=t+1t=t+1;
12:  until MSEε{\rm MSE}\leq\varepsilon;
13:  𝒑^=𝒘~𝒙p(t)\hat{\bm{p}}=\lfloor\tilde{\bm{w}}\rceil\circ\bm{x}_{p}(t).

V Simulation Results and Discussions

This section presents simulation results to evaluate the performances of the designed algorithms compared with state-of-the-art benchmark ones. The simulation parameter settings are as follows unless specified otherwise. On the one hand, we use a similar parameter setting for the wireless communication system as [17]. Specifically, we set the noise power σ2=77dBm\sigma^{2}=-77\,{\rm dBm}, the communication bandwidth B=180kHzB=180\,{\rm kHz}, the path loss of the kthk^{\rm th} user ϱk=90dB\varrho_{k}=-90\,{\rm dB}, and the channel 𝒉k\bm{h}_{k} is generated according to 𝒞𝒩(𝟎,ϱk𝑰)\mathcal{CN}(\bm{0},\,\varrho_{k}\bm{I}). Also, we assume that the number of users is identical among different tasks, i.e., |𝒦1|=|𝒦2|==|𝒦I|=120|\mathcal{K}_{1}|=|\mathcal{K}_{2}|=\cdots=|\mathcal{K}_{I}|=120. This is a valid assumption since we consider massive connectivity in large-scale IoT networks. On the other hand, for the task-oriented learning at the edge, we consider a support vector machine (SVM) for the classification of digits dataset in Scikit-learn [27], a 66-layer convolutional neural network (CNN66) for classification of the MNIST dataset [28], a 110110-layer deep residual network (ResNet110110) using the CIFAR10 dataset [29], and a PointNet using 33D point clouds in the ModelNet4040 dataset [30]. In our pertaining simulation experiments, the single-task case {SVM}\{\text{SVM}\}, two-task case {SVM,CNN6}\{\text{SVM},\,\text{CNN}6\}, and four-task case {SVM,CNN6,ResNet110,PointNet}\{\text{SVM},\,\text{CNN}6,\,\text{ResNet}110,\,\text{PointNet}\} are considered. For ease of tractability, relative learning parameters are summarized in Table II. For more details on how to get these learning parameters, the interested reader refers to Section III of [17]. Apart from simulation experiments, we also investigate autonomous vehicle perception in the real world to demonstrate the excellent generalization performance of our proposed model.

In the simulation experiments, we consider seven schemes: a parallel task-oriented power allocation scheme (i.e., Algorithm 1), an accelerated task-oriented power allocation scheme (i.e., Algorithm 2), the parallel algorithm without scheduling (Algorithm 1 w/o SH for short), and the accelerated algorithm without scheduling (Algorithm 2 w/o SH for short). In addition to our algorithms, we also simulate two benchmark ones: a sum-rate maximization scheme [31] and an MM-based LCPA scheme [17]. The sum-rate maximization algorithm is typical in conventional wireless communications but only considers the wireless channel state information without accounting for the learning factors. Finally, for fair comparison of different multi-user scheduling strategies, the user-fair scheduling (UFS) algorithm developed in [32] is also accounted for in the simulation experiments.

TABLE II: Summary of the learning Parameters [17].
      Models     Datasets     Symbols     Values     Description    
      SVM [27]     Digits     (a1,b1,A1,V1)(a_{1},\,b_{1},\,A_{1},\,V_{1})     (5.2, 0.72, 200, 324bits)(5.2,\,0.72,\,200,\,324\,{\rm bits})     The 1st1^{\rm st} learning task    
    CNN66 [28]     MNIST     (a2,b2,A2,V2)(a_{2},\,b_{2},\,A_{2},\,V_{2})     (7.3, 0.69, 300, 6276bits)(7.3,\,0.69,\,300,\,6276\,{\rm bits})     The 2nd2^{\rm nd} learning task    
    ResNet110110 [29]     CIFAR1010     (a3,b3,A3,V3)(a_{3},\,b_{3},\,A_{3},\,V_{3})     (8.15, 0.44, 1600, 24584bits)(8.15,\,0.44,\,1600,\,24584\,{\rm bits})     The 3rd3^{\rm rd} learning task    
    PointNet [30]     ModelNet4040     (a4,b4,A4,V4)(a_{4},\,b_{4},\,A_{4},\,V_{4})     (0.96, 0.24, 800, 192008bits)(0.96,\,0.24,\,800,\,192008\,{\rm bits})     The 4th4^{\rm th} learning task    
         

V-A Convergence Performance and Complexity Analysis

In this subsection, the number of antennas N=2N=2, the total transmit power P=13dBmP=13\,{\rm dBm} (i.e., 20mW20\,{\rm mW}), the transmit time T=10sT=10\,{\rm s} for the single-task case, T=20sT=20\,{\rm s} for the two-task case, and T=200sT=200\,{\rm s} for the four-task case are used in the simulation experiments. The dataset types and task parameters are defined in Table II. In particular, as the four-task case is associated with deep networks, T=200sT=200\,{\rm s} is set to obtain enough data to fine-tune these deep networks. To evaluate the process of convergence, we define a mean squared error (MSE) as

MSE\displaystyle{\rm MSE}
𝒑(t)𝒑(t1)2+𝜹(t)𝜹(t1)2+|𝒑(t)1P|\displaystyle{}\triangleq\left\|\bm{p}(t)-\bm{p}(t-1)\right\|_{2}+\left\|\bm{\delta}(t)-\bm{\delta}(t-1)\right\|_{2}+\left|\|\bm{p}(t)\|_{1}-P\right|
+(𝑮𝑫)𝒑(t)𝜹(t)+σ2𝟭2.\displaystyle\quad{}+\,\left\|\left(\bm{G}-\bm{D}\right)\bm{p}(t)-\bm{\delta}(t)+\sigma^{2}\bm{\mathsf{1}}\right\|_{2}. (35)

Figure 4 depicts the MSE computed by (35) versus the number of iterations. On the one hand, we observe from Fig. 4a that Algorithms 1 and 2 with multi-user scheduling outperform those without it in terms of both convergence speed and MSE performance. The reason behind these observations is that although the redundant variable introduced may slow down convergence and increase instability in the proposed algorithms, the multi-user scheduling strategy activates only a small fraction of users. Thus the dimensionality of the corresponding variable is highly reduced. Therefore, the algorithm with multi-user scheduling is relatively stable and converges faster. On the other hand, Fig. 4a also shows that the performance of Algorithm 1 suffers from a slower convergence and more severe stochastic fluctuations than Algorithm 2. The reason is that Algorithm 2 accelerates the convergence rate from 𝒪(1/τ)\mathcal{O}(1/\tau) to 𝒪(1/τ2)\mathcal{O}(1/\tau^{2}). Similarly, Figs. 4b and 4c illustrate that multi-scheduling and accelerated algorithms also benefit from faster convergence and lower MSE in the two-task and four-task learning cases, respectively.

Refer to caption
(a) Single-task case.
Refer to caption
(b) Two-task case.
Refer to caption
(c) Four-task case.
Figure 4: Mean squared error vs. the number of iterations.

Figure 5 illustrates the computational complexity in the sense of the average execution time. On the one hand, Fig. 5a shows that the MM-LCPA algorithm for the single-task case developed in [17] has a longer execution time than our algorithms. Even worse, the MM-LCPA algorithm shows a steeper increment than ours. The reason behind these observations is that, when the number of users KK is large, the per-iteration complexity of two proposed algorithms is 𝒪(K2+K)\mathcal{O}\left(K^{2}+K\right) whereas that of MM-LCPA is as high as 𝒪((I+K2+K)3.5)\mathcal{O}\left((I+K^{2}+K)^{3.5}\right). We also observe that Algorithm 2 has a shorter execution time than Algorithm 1. It is because the accelerated algorithm speeds up the convergence rate from 𝒪(1/τ)\mathcal{O}\left(1/\tau\right) to 𝒪(1/τ2)\mathcal{O}\left(1/\tau^{2}\right). Hence it decreases the number of iterations, specifically for large-scale IoT networks. On the other hand, Figs. 5b and 5c show that the execution time of our algorithms remains almost the same as the number of tasks changes from two to four, compared with Fig. 5a. For example, the computational time is approximately computed by 10s10\,{\rm s} for K=200K=200 in the single-task case, K=400K=400 in the two-task case, and K=800K=800 in the four-task case (i.e., each task has the same number of users). The reason behind these observations is that the per-iteration complexity of our parallel algorithm is reduced from 𝒪(K2+K)\mathcal{O}\left(K^{2}+K\right) to 𝒪((K2+K)/I)\mathcal{O}\left((K^{2}+K)/I\right). In other words, if the value of KK is fixed, the computational complexity of our algorithms decreases with the number of tasks II. Thus, we infer that the proposed parallel algorithms efficiently solve the task-oriented power allocation problem.

Refer to caption
(a) Single-task case.
Refer to caption
(b) Two-task case.
Refer to caption
(c) Four-task case.
Figure 5: Average execution time vs. the number of users.

V-B Learning Error Performance

Figure 6 depicts the mean learning error (MLE) computed by (3a). On the one hand, Fig. 6a shows that the MM-LCPA algorithm developed in [17] performs similarly to the sum-rate maximization algorithm developed in [31], and they both underperform our algorithms. The reason behind these observations is that in the case of single-task, the objective function of the MM-LCPA algorithm degenerates into that of the sum-rate maximization algorithm due to the monotonicity of the learning error function, such that they have similar performance. Instead, as multi-user scheduling eliminates CCI in dense networks, the proposed algorithm outperforms the others. On the other hand, in the different task-oriented learning cases, Figs. 6b and 6c show that the MLE of the MM-LCPA algorithm is superior to that of the sum-rate maximization algorithm due to the joint design of efficient task-oriented communications for different learning models. Also, it is seen that our algorithms have a smaller MLE than the MM-LCPA and the UFS algorithm developed in [32]: the former is due to the multi-user scheduling and task fairness of our algorithms, whereas the latter is caused by the fact that the UFS algorithm concentrates on user fairness but degrades learning performance.

Refer to caption
(a) Single-task case.
Refer to caption
(b) Two-task case.
Refer to caption
(c) Four-task case.
Figure 6: Mean learning error vs. the number of users.

In summary, Table III compares the four algorithms discussed above in terms of computational complexity, convergence rate, parallelization capability, and MLE. Our designed Algorithms 1 and 2 are effective for task-oriented power allocation, thanks to their low computational complexity, fast convergence rate, high parallel capability, and low learning error. In particular, the former applies to small- or medium-scale IoT networks in terms of lower MLE, whereas the latter adapts to large-scale ones thanks to its faster convergence rate.

V-C Experimental Validation for Autonomous Vehicle Perception

To verify the robustness of the proposed algorithms in real-world applications, we consider three perception tasks in autonomous driving [33], and they are Task 11: weather classification using the RGB images and CNN; Task 22: traffic sign detection using the RGB images and YOLOV55, and Task 33: object detection using the point cloud data and sparsely embedded convolutional detection object detection (SECOND). In the pertaining experiments, all the datasets are generated by the CarlaFLCAV framework, which is an open-source autonomous driving simulation platform and online available at https://github.com/SIAT-INVS/CarlaFLCAV. In particular, the transmit time T=500sT=500\,{\rm s} is set for this autonomous vehicle perception. The size of each RGB image sample is V1=V2=0.7MBV_{1}=V_{2}=0.7\,{\rm MB} and that of each point cloud sample is V3=1.6MBV_{3}=1.6\,{\rm MB}. The number of historical data samples is A1=A2=A3=300A_{1}=A_{2}=A_{3}=300. By fitting the error function to the historical data, we obtain the model parameters (a1,b1)=(10.34, 1.2)(a_{1},\,b_{1})=(10.34,\,1.2), (a2,b2)=(8.89, 0.64)(a_{2},\,b_{2})=(8.89,\,0.64), and (a3,b3)=(0.5, 0.1)(a_{3},\,b_{3})=(0.5,\,0.1) for Tasks 11, 22 and 33, respectively. It can be seen from Fig. 7 that the three fitting curves match the experimental data very well. Note that with a smaller AiA_{i}, the estimated parameters (ai,bi)(a_{i},\,b_{i}) may be less accurate. However, such parameters can still perform the power allocation efficiently since our goal is to distinguish different tasks rather than accurately predict learning errors.

TABLE III: Performance comparison.

[!t]       Algorithm     Complexity     Convergence rate     Parallelisma     MLE           Sum-rate max. [31]     𝒪((K1)7)\mathcal{O}\left((K-1)^{7}\right)     𝒪(1/logτ)\mathcal{O}\left(1/\log\tau\right)     ✗     high         MM-based LCPA [17]     𝒪((I+K2+K)3.5)\mathcal{O}\left((I+K^{2}+K)^{3.5}\right)     𝒪(1/logτ)\mathcal{O}\left(1/\log\tau\right)     ✗     low         Algorithm 1     𝒪((K2+K)/I)\mathcal{O}\left((K^{2}+K)/I\right)     𝒪(1/τ)\mathcal{O}\left(1/\tau\right)     ✓     low         Algorithm 2     𝒪((K2+K)/I)\mathcal{O}\left((K^{2}+K)/I\right)     𝒪(1/τ2)\mathcal{O}\left(1/\tau^{2}\right)     ✓     low              

  • a

    The tick “✓” indicates a functionality supported, whereas the cross “✗” indicates not supported.

Refer to caption
Figure 7: Learning error vs. the number of samples.
Refer to caption
Figure 8: Qualitative and quantitative results of multi-task perception for autonomous driving.

The top panel of Fig. 8 compares the perception accuracies of the proposed and benchmark algorithms. Firstly, it is seen that the actual perception accuracies obtained from the machine learning experiments coincide with the predicted perception accuracies obtained from the error functions for all the tasks and simulated schemes. Secondly, the proposed algorithm achieves significantly higher average perception accuracy than the MM-LCPA and sum-rate maximization schemes. This is because the proposed algorithm is a task-oriented scheme, which computes the “learning curve”, i.e., the derivative of the learning error w.r.t. the number of samples, for each task by leveraging the associated fitted error functions. As such, it automatically allocates more power resources to the task with a more significant learning curve since it needs more samples to train the learning model. In our experiment, Task 22 has the steepest “learning curve” as seen from Fig. 7. Accordingly, the proposed algorithm allocates more power to Task 22 and achieves the highest perception accuracy. In contrast, the MM-LCPA and sum-rate maximization schemes give more power resources to Tasks 11 and 33, whose learning errors are saturated when the number of samples exceeds 400400. Therefore, these benchmark schemes are less learning-efficient than the proposed scheme.

Lastly, the qualitative results of different schemes are shown in the bottom panel of Fig. 8. It can be seen that there are three traffic lights and two traffic signs at the T-junction. The proposed Algorithm 11 successfully detects all the objects in the image. The MM-LCPA scheme fails to detect a far-away traffic sign and a traffic light (impeded by the wall) while misclassifying a door as a traffic sign. The sum-rate maximization scheme fails to detect a far-away traffic sign and misclassifies a door as a traffic sign. The reason behind these observations is that the proposed Algorithm 11 can obtain more samples for multiple tasks in task-oriented principle than other schemes. However, the MM-LCPA algorithm only focuses on one of these tasks, even if this task is unimportant. The sum-rate maximization scheme may not obtain data for multiple tasks as it ignores task-irrelevant information.

VI Conclusions

This paper has developed a task-oriented power allocation model to process distinct learning datasets for large-scale IoT networks, especially for multi-task multi-modal scenarios. To deal with massive connectivity, a multi-user scheduling algorithm has been designed to mitigate co-channel interference and decouple multi-user scheduling and power allocation. Moreover, highly parallel and accelerated algorithms have been designed to solve multi-objective and large-scale optimization problems. Extensive experimental results have shown that multi-user scheduling could effectively mitigate the influence of interference in dense networks. The parallel algorithm and its accelerated version enable different learning tasks efficiently, including the real-world multi-task multi-modal scenario for autonomous vehicle perception. In real-world applications, the proposed algorithms can be deployed at the edge, e.g., the gateway of a large-scale IoT network, which can then inform the users of their transmit powers and other parameters through the downlink control channel, e.g., the narrowband physical downlink control channel in NB-IoT networks. However, as the offline-learning mode is not adaptive to a real-time wireless environment, developing an online-learning mode is promising for future work.

Appendix A Proof of Proposition 1

Substituting δk𝒦kG~k,p+σ2\delta_{k}\triangleq\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}+\sigma^{2} and G~k,kwkGk,k\tilde{G}_{k,k}\triangleq w_{k}{G}_{k,k} into the cost function of 𝒫2\mathcal{P}2, and performing some algebraic manipulations, we obtain

𝒫2a:min{wk}k𝒦i\displaystyle\mathcal{P}2a:\min_{\{w_{k}\}_{k\in\mathcal{K}_{i}}}\, ai(BTVik𝒦iwklog2(1+G~k,kpkwkδk)+Ai)bi\displaystyle a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}w_{k}\log_{2}\left(1+\dfrac{\tilde{G}_{k,k}p_{k}}{w_{k}\delta_{k}}\right)+A_{i}\right)^{-b_{i}} (A.1a)
s.t.\displaystyle{\rm s.t.}\ 0<wk1,\displaystyle 0<w_{k}\leq 1, (A.1b)
k𝒦iwkNi.\displaystyle\sum_{k\in\mathcal{K}_{i}}w_{k}\leq N_{i}. (A.1c)

In light of the non-increasing characteristices of aixbia_{i}x^{-b_{i}} where x>0x>0, and the sparsity constraint (A.1c), 𝒫2a\mathcal{P}2a can be transformed into its equivalent penalized form:

𝒫2b:min𝒘i\displaystyle\mathcal{P}2b:\min_{\bm{w}_{i}}\, k𝒦iwkln(1+G~k,kpkδkwk)+νik𝒦iwk\displaystyle-\sum_{k\in\mathcal{K}_{i}}w_{k}\ln\left(1+\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}w_{k}}\right)+\nu_{i}\sum_{k\in\mathcal{K}_{i}}w_{k} (A.2a)
s.t.\displaystyle{\rm s.t.}\ (A.1b),(A.1c),\displaystyle\eqref{SA-EQ-A1a},\,\eqref{SA-EQ-A1b},

where 𝒘i[wi1,wi2,,wi|𝒦i|]T\bm{w}_{i}\triangleq\left[w_{i_{1}},w_{i_{2}},\cdots,w_{i_{\left|\mathcal{K}_{i}\right|}}\right]^{T}, and νi>0\nu_{i}>0 is a tuning parameter for the sparsity regulation.

Next, by setting the objective function of (A.2a) as J(wk)wkln(1+G~k,kpk/(δkwk))+νiwkJ(w_{k})\triangleq-w_{k}\ln\left(1+{\tilde{G}_{k,k}p_{k}}/{(\delta_{k}w_{k})}\right)+\nu_{i}w_{k}, it follows that J(wk)/wk=ln(1+Gk,kp~k/(δkwk))+Gk,kpk/(Gk,kpk+δk)+νi{\partial J(w_{k})}/{\partial w_{k}}=-\ln\left(1+{G_{k,k}\tilde{p}_{k}}/{(\delta_{k}w_{k})}\right)+{G_{k,k}p_{k}}/{(G_{k,k}p_{k}+\delta_{k})}+\nu_{i}. Let J(wk)/wk=0\partial J(w_{k})/\partial w_{k}=0, and we obtain

w^k=G~k,kpkδk(exp(Gk,kpkδk+Gk,kpk+νi)1).\hat{w}_{k}=\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}\left(\exp\left(\frac{G_{k,k}p_{k}}{\delta_{k}+G_{k,k}p_{k}}+\nu_{i}\right)-1\right)}. (A.3)

Considering 0<wk10<w_{k}\leq 1, there are three cases of ω^k\hat{\omega}_{k} to account for:

  • 1)

    If w^kϵ\hat{w}_{k}\leq\epsilon, the minimization of J(wk)J(w_{k}) is obtained at wk=ϵw_{k}=\epsilon;

  • 2)

    If ϵ<w^k<1\epsilon<\hat{w}_{k}<1, the minimization of J(wk)J(w_{k}) is obtained at wk=w^kw_{k}=\hat{w}_{k};

  • 3)

    If w^k1\hat{w}_{k}\geq 1, the minimization of J(wk)J(w_{k}) is obtained at wk=1w_{k}=1.

As a result, the optimization point is given by (6). This completes the proof.

Appendix B Proof of Proposition 2

The Lagrange multiplier βi\beta_{i} for 𝟭T𝒑i=Pi\bm{\mathsf{1}}^{T}\bm{p}_{i}=P_{i} is given by

βi(t+1)=βi(t)+μ(t)(𝟭T𝒑i(t+1)Pi(t+1)),\beta_{i}(t+1)=\beta_{i}(t)+\mu(t)\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-P_{i}(t+1)\right), (B.1)

where 𝒑i(t+1)\bm{p}_{i}(t+1) and Pi(t+1)P_{i}(t+1) are obtained by the minimization of the ALF given by (11). These minimizations concerning 𝒑i\bm{p}_{i} and PiP_{i} are computed iteratively:

𝒑i\displaystyle\bm{p}_{i} =argmin𝒑i𝟎λiΦi(𝐩i)+βi(t)(𝟭T𝐩iPi)\displaystyle=\underset{\bm{p}_{i}\succeq\bm{0}}{\rm argmin}\,\lambda_{i}\Phi_{i}(\bm{p}_{i})+\beta_{i}(t)\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)
+μ(t)2(𝟭T𝒑iPi)2,\displaystyle\quad{}+\dfrac{\mu(t)}{2}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}, (B.2a)
Pi\displaystyle P_{i} =argmin{Pi|iPi=P}{iβi(t)Pi+μ(t)2i(𝟭T𝐩iPi)2},\displaystyle=\underset{\left\{P_{i}\left|\underset{i\in\mathcal{I}}{\sum}P_{i}=P\right.\right\}}{\rm argmin}\left\{-\sum_{i\in\mathcal{I}}\beta_{i}(t)P_{i}+\dfrac{\mu(t)}{2}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}\right\}, (B.2b)

where

Φi(𝒑i)ai(BTVik𝒦iw~kR~k+Ai)bi.\Phi_{i}(\bm{p}_{i})\triangleq a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\tilde{R}_{k}+A_{i}\right)^{-b_{i}}.

Note that the minimization with respect to {Pi|i}\{P_{i}\left|i\in\mathcal{I}\right.\} in (B.2b) involves a separable quadratic cost and a single equality constraint and can be carried out analytically. Given the optimization values 𝒑i(t+1)\bm{p}_{i}(t+1), the optimization value Pi(t+1)P_{i}(t+1) in (B.2b) is analytically given by

Pi(t+1)=𝟭T𝒑i(t+1)+βi(t)β(t+1)μ(t),P_{i}(t+1)=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)+\dfrac{\beta_{i}(t)-\beta(t+1)}{\mu(t)}, (B.3)

where β(t+1)\beta(t+1) is a scalar Lagrange multiplier subject to iPi=P\sum_{i\in\mathcal{I}}P_{i}=P, and it is determined by

β(t+1)=1Iiβi(t)+μ(t)I(𝟭T𝒑(t+1)P).\beta(t+1)=\dfrac{1}{I}\sum_{i\in\mathcal{I}}\beta_{i}(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right). (B.4)

By comparing (B.3) with (B.1), we see that

βi(t+1)=β(t+1).\beta_{i}(t+1)=\beta(t+1). (B.5)

Then, summing (B.1) up for all ii\in\mathcal{I} yields

β(t+1)\displaystyle\beta(t+1) =β(t)+μ(t)Ii(𝟭T𝒑i(t+1)Pi(t+1))\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-P_{i}(t+1)\right) (B.6a)
=β(t)+(β(t+1)1Iiβi(t))\displaystyle=\beta(t)+\left(\beta(t+1)-\dfrac{1}{I}\sum_{i\in\mathcal{I}}\beta_{i}(t)\right) (B.6b)
=β(t)+μ(t)I(𝟭T𝒑(t+1)P),\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right), (B.6c)

where (B.6b)-(B.6c) are derived by (B.3)-(B.4), respectively, and PiP_{i} is updated by

Pi(t+1)=𝟭T𝒑i(t+1)1I(𝟭T𝒑(t+1)P),P_{i}(t+1)=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-\dfrac{1}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right), (B.7)

where (B.7) is derived from (B.3) and (B.4). Hence, (12a) and (13a) are immediately proved.

Next, we derive (12b) and (13b). Similar to (B.1), we consider Lagrange multipliers 𝜶i\bm{\alpha}^{\prime}_{i}. The method of multipliers consists of

𝜶i(t+1)=𝜶i(t)+μ(t)(𝚫(:,𝒦i)𝒑i(t+1)𝒛i(t+1)),\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}^{\prime}_{i}(t)+\mu(t)\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)-\bm{z}_{i}(t+1)\right), (B.8)

where 𝒑i(t+1)\bm{p}_{i}(t+1), 𝜹i(t+1)\bm{\delta}_{i}(t+1), and 𝒛i(t+1)\bm{z}_{i}(t+1) are obtained by the minimization of the ALF (11). Similar to (B.4), a Lagrange multiplier vector 𝜶\bm{\alpha} is shown below:

𝜶(t+1)\displaystyle\bm{\alpha}(t+1)
=1Ii𝜶i(t)+μ(t)I(𝚫𝒑(t+1)𝜹(t+1)+σ2𝟭)\displaystyle=\dfrac{1}{I}\sum_{i\in\mathcal{I}}\bm{\alpha}^{\prime}_{i}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right) (B.9a)
=𝜶(t)+μ(t)I(𝚫𝒑(t+1)𝜹(t+1)+σ2𝟭),\displaystyle=\bm{\alpha}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right), (B.9b)

where (B.9b) is obtained by 𝜶i(t+1)=𝜶(t+1)\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}(t+1). Moreover, we obtain the following optimization solution involving i=1I𝒛i=𝜹σ2𝟭\sum_{i=1}^{I}\bm{z}_{i}=\bm{\delta}-\sigma^{2}\bm{\mathsf{1}}:

𝒛i(𝜹(t+1))\displaystyle\bm{z}_{i}(\bm{\delta}(t+1))
=𝚫(:,𝒦i)𝒑i(t+1)+1μ(t)(𝜶i(t)𝜶(t+1))\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)+\dfrac{1}{\mu(t)}(\bm{\alpha}^{\prime}_{i}(t)-\bm{\alpha}(t+1)) (B.10a)
=𝚫(:,𝒦i)𝒑i(t+1)+1μ(t)(𝜶(t)𝜶(t+1))\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)+\dfrac{1}{\mu(t)}\left(\bm{\alpha}(t)-\bm{\alpha}(t+1)\right) (B.10b)
=𝚫(:,𝒦i)𝒑i(t+1)1I(𝚫𝒑(t+1)𝜹(t+1)+σ2𝟭),\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)-\dfrac{1}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right), (B.10c)

where (B.10b) is obtained by 𝜶i(t+1)=𝜶(t+1)\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}(t+1), and (B.10c) by (B.9b).

Appendix C Proof of Lemma 1

First, we prove part i) of Lemma 1:

𝒙Φi(𝒙|𝜹i)𝒚Φi(𝒚|𝜹i)2\displaystyle\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{2} (C.1)
N1𝒙Φi(𝒙|𝜹i)𝒚Φi(𝒚|𝜹i)Lp𝒙𝒚2,\displaystyle\leq N_{1}\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{\infty}\leq L_{p}\|\bm{x}-\bm{y}\|_{2},

where N1𝒙Φi(𝒙|𝜹i)𝒚Φi(𝒚|𝜹i)01/2N_{1}\triangleq\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{0}^{1/2} is a bounded constant and LpL_{p} is a positive constant. By recalling the definition of [17, Lemma 1], (C.1) can be obtained in a straightforward manner.

To prove part ii) of Lemma 1, we notice that xkΦi(𝒑i|𝒙)\nabla_{x_{k}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x}) can be rewritten as xkΦi(𝒑i|𝒙)=h(𝒙)g(xk)\nabla_{x_{k}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})=h(\bm{x})g(x_{k}), with the auxiliary functions

h(𝒙)=biai(𝒦iBTViw~log2(1+G,px)+Ai)bi1,\displaystyle h(\bm{x})=b_{i}a_{i}\left(\sum_{\ell\in\mathcal{K}_{i}}\dfrac{BT}{V_{i}}\tilde{w}_{\ell}\log_{2}\left(1+\dfrac{G_{\ell,\ell}p_{\ell}^{*}}{x_{\ell}}\right)+A_{i}\right)^{-b_{i}-1}, (C.2a)
g(xk)=BTw~kGk,kpkln(2)Vixk(xk+Gk,kpk).\displaystyle g(x_{k})=\dfrac{BT\tilde{w}_{k}G_{k,k}p_{k}^{*}}{\ln(2)V_{i}x_{k}(x_{k}+G_{k,k}p_{k}^{*})}. (C.2b)

where xkx_{k} denotes the kthk^{\rm th} entry of 𝒙\bm{x}. The assumption Φi(𝒑i|𝜹i)u0\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta}_{i})\leq u_{0} gives

𝒦iBTViw~log2(1+G,px)+Ai(aiu0)1/bi,\sum_{\ell\in\mathcal{K}_{i}}\dfrac{BT}{V_{i}}\tilde{w}_{\ell}\log_{2}\left(1+\dfrac{G_{\ell,\ell}p_{\ell}^{*}}{x_{\ell}}\right)+A_{i}\geq\left(\dfrac{a_{i}}{u_{0}}\right)^{1/b_{i}}, (C.3)

then, we have

|h(𝒙)|aibi(u0ai)1+1/bi,\displaystyle|h(\bm{x})|\leq a_{i}b_{i}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+1/b_{i}}, (C.4a)
|g(xk)|BTU0ln(2)Viσ2(σ2+U0),\displaystyle|g(x_{k})|\leq\dfrac{BTU_{0}}{\ln(2)V_{i}\sigma^{2}(\sigma^{2}+U_{0})}, (C.4b)

where (C.4b) is derived by U0Gk,kpkU_{0}\geq G_{k,k}p_{k} and xkσ2x_{k}\geq\sigma^{2}. Here, Gk,G_{k,\ell} satisfies Gaussian distribution and pkPp_{k}\leq P, hence we obtain an upper bound U0U_{0} of Gk,kpkG_{k,k}p_{k} with a high probability [34]. Furthermore, according to Lipschitz conditions [35] of hh and gg, they satisfy

|h(𝒙)h(𝒚)|\displaystyle\left|h(\bm{x})-h(\bm{y})\right|
sup𝒙σ2𝟭𝒙h(𝒙)2×𝒙𝒚2\displaystyle\leq\sup_{\bm{x}\succeq\sigma^{2}\bm{\mathsf{1}}}\|\nabla_{\bm{x}}h(\bm{x})\|_{2}\times\|\bm{x}-\bm{y}\|_{2}
Kaibi(bi+1)BTU0ln(2)IViσ2(σ2+U0)(u0ai)1+2/bi𝒙𝒚2,\displaystyle\leq\dfrac{Ka_{i}b_{i}(b_{i}+1)BTU_{0}}{\ln(2)IV_{i}\sigma^{2}(\sigma^{2}+U_{0})}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+2/b_{i}}\|\bm{x}-\bm{y}\|_{2}, (C.5a)
|g(xk)g(yk)|\displaystyle|g(x_{k})-g(y_{k})|
supxkσ2|xkg(xk)|×|xkyk|\displaystyle\leq\sup_{x_{k}\geq\sigma^{2}}|\nabla_{x_{k}}g(x_{k})|\times|x_{k}-y_{k}|
BTU0(2σ2+U0)ln(2)Viσ4(σ2+U0)2|xkyk|\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}|x_{k}-y_{k}|
BTU0(2σ2+U0)ln(2)Viσ4(σ2+U0)2𝒙𝒚2.\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\|\bm{x}-\bm{y}\|_{2}. (C.5b)

As a result, the following inequality is obtained:

𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)\displaystyle\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{\infty}
supk𝒦i|h(𝒙)||g(xk)g(yk)|+|h(𝒙)h(𝒚)||g(xk)|\displaystyle\leq\sup_{k\in\mathcal{K}_{i}}|h(\bm{x})||g(x_{k})-g(y_{k})|+\left|h(\bm{x})-h(\bm{y})\right||g(x_{k})|
L2𝒙𝒚2,\displaystyle\leq L_{2}\|\bm{x}-\bm{y}\|_{2},

where the first inequality is due to |ab+cd||a||b|+|c||d||ab+cd|\leq|a||b|+|c||d|, and the second inequality is obtained from (C.4a), (C.4b), (C.5a), and (C.5b); also, L2L_{2} is defined as

L2\displaystyle L_{2} aibiBTU0(2σ2+U0)ln(2)Viσ4(σ2+U0)2(u0ai)1+1/bi\displaystyle\triangleq\dfrac{a_{i}b_{i}BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+1/b_{i}} (C.6)
+Kaibi(bi+1)B2T2U02ln2(2)IVi2σ4(σ2+U0)2(u0ai)1+2/bi.\displaystyle\quad{}+\,\dfrac{Ka_{i}b_{i}(b_{i}+1)B^{2}T^{2}U_{0}^{2}}{\ln^{2}(2)IV_{i}^{2}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+2/b_{i}}.

Thus, the gradient function 𝜹iΦi(𝒑i|𝜹)\nabla_{\bm{\delta}_{i}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta}) satisfies the following inequality:

𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)L2𝒙𝒚2.\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{\infty}\leq L_{2}\|\bm{x}-\bm{y}\|_{2}. (C.7)

Based on (C.7), we have

𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)2\displaystyle\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{2} (C.8)
N2𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)Lf𝒙𝒚2,\displaystyle\leq N_{2}\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{\infty}\leq L_{f}\|\bm{x}-\bm{y}\|_{2},

where N2𝒙Φi(𝒑i|𝒙)𝒚Φi(𝒑i|𝒚)01/2N_{2}\triangleq\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{0}^{1/2} and LfN2L2L_{f}\triangleq N_{2}L_{2}. This completes the proof.

References

  • [1] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource-constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, Jun. 2019.
  • [2] Z. Dawy, W. Saad, A. Ghosh, J. G. Andrews, and E. Yaacoub, “Toward massive machine type cellular communications,” IEEE Wireless Commun., vol. 24, no. 1, pp. 120–128, Feb. 2017.
  • [3] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020.
  • [4] S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya, “Edge intelligence: The confluence of edge computing and artificial intelligence,” IEEE Internet Things J., vol. 7, no. 8, pp. 7457–7469, Aug. 2020.
  • [5] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proc. Nat. Acad. Sci. USA, vol. 118, no. 17, Apr. 2021, Art. no. e2024789118.
  • [6] K. B. Letaief, Y. Shi, J. Lu, and J. Lu, “Edge artificial intelligence for 6G: Vision, enabling technologies, and applications,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 5–36, Jan. 2022.
  • [7] A. Badi and I. Mahgoub, “ReapIoT: Reliable, energy-aware network protocol for large-scale Internet-of-Things (IoT) applications,” IEEE Internet Things J., vol. 8, no. 17, pp. 13 582–13 592, Sept. 2021.
  • [8] W. Cui, K. Shen, and W. Yu, “Spatial deep learning for wireless scheduling,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1248–1261, Jun. 2019.
  • [9] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating deep neural network inference via edge computing,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 447–457, Jan. 2020.
  • [10] Q. Cheng, B. Chen, and P. K. Varshney, “Detection performance limits for distributed sensor networks in the presence of nonideal channels,” IEEE Trans. Wireless Commun., vol. 5, no. 11, pp. 3034–3038, Nov. 2006.
  • [11] D. Ciuonzo, P. S. Rossi, and P. K. Varshney, “Distributed detection in wireless sensor networks under multiplicative fading via generalized score tests,” IEEE Internet Things J., vol. 8, no. 11, pp. 9059–9071, Jun. 2021.
  • [12] X. Cheng, D. Ciuonzo, and P. S. Rossi, “Multibit decentralized detection through fusing smart and dumb sensors based on Rao test,” IEEE Trans. Aerosp. Electron. Syst., vol. 56, no. 2, pp. 1391–1405, Apr. 2020.
  • [13] J. Shao, Y. Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 197–211, Nov. 2022.
  • [14] D. Wen, P. Liu, G. Zhu, Y. Shi, J. Xu, Y. C. Eldar, and S. Cui, “Task-oriented sensing, computation, and communication integration for multi-device edge AI,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2207.00969
  • [15] P. Liu, G. Zhu, S. Wang, W. Jiang, W. Luo, H. V. Poor, and S. Cui, “Toward ambient intelligence: Federated edge learning with task-oriented sensing, computation, and communication integration,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.05949
  • [16] H. Xie, Z. Qin, and G. Y. Li, “Task-oriented multi-user semantic communications for VQA,” IEEE Wireless Comm. Lett., vol. 11, no. 3, pp. 553–557, Mar. 2022.
  • [17] S. Wang, Y.-C. Wu, M. Xia, R. Wang, and H. V. Poor, “Machine intelligence at the edge with learning-centric power allocation,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7293–7308, Nov. 2020.
  • [18] H. Xie and Z. Qin, “A lite distributed semantic communication system for Internet of Things,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 142–153, Jan. 2021.
  • [19] H. S. Seung, H. Sompolinsky, and N. Tishby, “Statistical mechanics of learning from examples,” Phys. Rev., vol. 45, no. 8, p. 6056, Apr. 1991.
  • [20] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, Jan. 2021.
  • [21] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in Proc. Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6.
  • [22] C. Lu, J. Feng, S. Yan, and Z. Lin, “A unified alternating direction method of multipliers by majorization minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 527–541, Mar. 2018.
  • [23] B. He, M. Tao, and X. Yuan, “Alternating direction method with Gaussian back substitution for separable convex programming,” SIAM J. Optimiz., vol. 22, no. 2, pp. 313–340, May 2012.
  • [24] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods.   Athena Scientific, Belmont, Massachusetts, 1997.
  • [25] Y. Li, M. Xia, and Y.-C. Wu, “Activity detection for massive connectivity under frequency offsets via first-order algorithms,” IEEE Trans. Wireless Commun., vol. 18, no. 3, pp. 1988–2002, Mar. 2019.
  • [26] C. Lu, H. Li, Z. Lin, and S. Yan, “Fast proximal linearized alternating direction method of multiplier with parallel splitting,” in Proc. Conf. Artif. Intell. (AAAI), Feb. 2016, pp. 739–745.
  • [27] F. Pedregosa, G. Varoquaux, A. Gramfort, and et al., “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.
  • [28] L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141–142, Nov. 2012.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Comp. Vis. Pat. Rec. (CVPR), Jun. 2016, pp. 770–778.
  • [30] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” in Proc. Comp. Vis. Pat. Rec. (CVPR), Jun. 2017, pp. 77–85.
  • [31] H. Al-Shatri and T. Weber, “Achieving the maximum sum rate using D.C. programming in cellular networks,” IEEE Trans. Signal Process., vol. 60, no. 3, pp. 1331–1341, Mar. 2012.
  • [32] J. A. de Carvalho, D. B. da Costa, L. Yang, G. C. Alexandropoulos, R. Oliveira, and U. S. Dias, “User fairness in wireless powered communication networks with non-orthogonal multiple access,” IEEE Wireless Commun. Lett., vol. 10, no. 1, pp. 189–193, Jan. 2021.
  • [33] S. Wang, C. Li, Q. Hao, C. Xu, D. W. K. Ng, Y. C. Eldar, and H. V. Poor, “Federated deep learning meets autonomous vehicle perception: Design and verification.” [Online]. Available: https://doi.org/10.48550/arXiv.2206.01748
  • [34] L. Meng, G. Li, J. Yan, and Y. Gu, “A general framework for understanding compressed subspace clustering algorithms,” IEEE J. Sel. Top. Signal Process., vol. 12, no. 6, pp. 1504–1519, Dec. 2018.
  • [35] S. Bubeck, “Convex optimization: Algorithms and complexity,” Found. Trends Mach. Learn., vol. 8, no. 3-4, pp. 231–357, Nov. 2015.
[Uncaptioned image] Haihui Xie received the B.S. degree and the M.S. degree in photonic and electronic engineering from Fujian Normal University, Fuzhou, China, in 2014 and 2016, respectively. Currently, he is pursuing the Ph.D. degree in information and communication engineering at Sun Yat-sen University, Guangzhou, China. His research interests include edge learning, optimization, and their applications in wireless communications.
[Uncaptioned image] Minghua Xia (Senior Member, IEEE) received the Ph.D. degree in Telecommunications and Information Systems from Sun Yat-sen University, Guangzhou, China, in 2007. From 2007 to 2009, he was with the Electronics and Telecommunications Research Institute (ETRI) of South Korea, Beijing R&D Center, Beijing, China, where he worked as a member and then as a senior member of the engineering staff. From 2010 to 2014, he was in sequence with The University of Hong Kong, Hong Kong, China; King Abdullah University of Science and Technology, Jeddah, Saudi Arabia; and the Institut National de la Recherche Scientifique (INRS), University of Quebec, Montreal, Canada, as a Postdoctoral Fellow. Since 2015, he has been a Professor at Sun Yat-sen University. Since 2019, he has also been an Adjunct Professor with the Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai). His research interests are in the general areas of wireless communications and signal processing.
[Uncaptioned image] Peiran Wu (Member, IEEE) received the Ph.D. degree in electrical and computer engineering from The University of British Columbia (UBC), Vancouver, Canada, in 2015. From October 2015 to December 2016, he was a Post-Doctoral Fellow at UBC. In the Summer of 2014, he was a Visiting Scholar with the Institute for Digital Communications, Friedrich-Alexander-University Erlangen-Nuremberg (FAU), Erlangen, Germany. Since February 2017, he has been with Sun Yat-sen University, Guangzhou, China, where he is currently an Associate Professor. Since 2019, he has been an Adjunct Associate Professor with the Southern Marine Science and Engineering Guangdong Laboratory, Zhuhai, China. His research interests include mobile edge computing, wireless power transfer, and energy-efficient wireless communications. Dr. Wu was a recipient of the Fourth-Year Fellowship in 2010, the C. L. Wang Memorial Fellowship in 2011, the Graduate Support Initiative (GSI) Award from UBC in 2014, the German Academic Exchange Service (DAAD) Scholarship in 2014, and the Chinese Government Award for Outstanding Self-Financed Students Abroad in 2014.
[Uncaptioned image] Shuai Wang (M’19) received the Ph.D. degree in Electrical and Electronic Engineering from The University of Hong Kong (HKU) in 2018. From 2018 to 2021, he was a Postdoc Fellow at HKU and then a Research Assistant Professor at the Southern University of Science and Technology (SUSTech). Currently, he is an Associate Professor with the Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences. His research interests include autonomous driving, machine learning, and communication networks. Dr. Wang has published 40+ journal papers and 20+ conference papers. He has received various awards from IEEE ICC, IEEE SPCC, IEEE ICCCS, IEEE TWC, IEEE WCL, and National 5G Competition.
[Uncaptioned image] H. Vincent Poor (S’72, M’77, SM’82, F’87) received the Ph.D. degree in EECS from Princeton University in 1977. From 1977 until 1990, he was on the faculty of the University of Illinois at Urbana-Champaign. Since 1990 he has been on the faculty at Princeton, where he is currently the Michael Henry Strater University Professor. During 2006 to 2016, he served as the dean of Princeton’s School of Engineering and Applied Science. He has also held visiting appointments at several other universities, including most recently at Berkeley and Cambridge. His research interests are in the areas of information theory, machine learning and network science, and their applications in wireless networks, energy systems and related fields. Among his publications in these areas is the recent book Machine Learning and Wireless Communications. (Cambridge University Press, 2022). Dr. Poor is a member of the National Academy of Engineering and the National Academy of Sciences and is a foreign member of the Chinese Academy of Sciences, the Royal Society, and other national and international academies. He received the IEEE Alexander Graham Bell Medal in 2017.