Edge Learning for Large-Scale Internet of Things With Task-Oriented Efficient Communication

Haihui Xie, Minghua Xia, , Peiran Wu, ,
Shuai Wang, , and H. Vincent Poor Manuscript received August 25, 2022; revised December 6, 2022 and April 11, 2023; accepted April 19, 2023. This work was supported in part by the National Natural Science Foundation of China under Grants 62171486 and U2001213, in part by the Guangdong Basic and Applied Basic Research Project under Grant 2021B1515120067, and in part by the U.S. National Science Foundation under Grant CNS-2128448. The associate editor coordinating the review of this paper and approving it for publication was A. S. Cacciapuoti. (Corresponding author: Minghua Xia.)Haihui Xie is with the School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China (e-mail: xiehh6@mail2.sysu.edu.cn).Minghua Xia and Peiran Wu are with the School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China, and also with the Southern Marine Science and Engineering Guangdong Laboratory, Zhuhai 519082, China (e-mail: {xiamingh, wupr3}@mail.sysu.edu.cn).Shuai Wang is with the Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China (e-mail: s.wang@siat.ac.cn).H. Vincent Poor is with the Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).Color versions of one or more of the figures in this article are available online at https://ieeexplore.ieee.org. Digital Object Identifier XXX

Abstract

In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wireless resource allocation and edge learning error prediction. In particular, we start with multi-user scheduling to alleviate co-channel interference in dense networks. Then, we perform optimal power allocation in parallel for different learning tasks. Thanks to the high parallelization of the designed algorithm, extensive experimental results corroborate that the multi-user scheduling and task-oriented power allocation improve the performance of distinct edge learning tasks efficiently compared with the state-of-the-art benchmark algorithms.

Index Terms:

Edge learning, multi-user scheduling, Internet of Things, parallel computing.

^†^†publicationid: pubid: 1536-1276 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

I Introduction

Massive connectivity is a feature of many Internet of Things (IoT) deployments, and in such applications, the widely connected users generate an enormous amount of data. To extract information from these data, IoT users can train learning models to effectively represent different types of data [1]. Although these IoT users have a specific capability to train simple learning models, the limited memory, computing, and battery capability deter the application of complicated models, such as deep neural networks [2]. To deal with this issue, edge learning techniques have emerged, transferring the burden of complex model updates to an edge server, i.e., leveraging the storage, communication, and computational capabilities at the edge server [3]. Moreover, the edge server also allows rapid access to the enormous amount of data distributed over end-user devices for fast model learning, providing intelligent services and applications for IoT users [4].

In edge learning, the main design objective is to acquire fast intelligence from the rich but highly distributed data of subscribed IoT users. This critically depends on data processing at edge servers, as well as efficient communication between edge servers and IoT users [5]. However, compared to increasingly high processing speeds at edge servers, communication suffers from the hostility of wireless channels and consequently becomes the bottleneck for ultra-fast edge learning [6]. Moreover, the diversity of ubiquitous IoT users and complex transmission environments lead to additional interference. Such interference would significantly deteriorate the reliability and communication latency of the IoT network while uploading a vast amount of data to an edge server [7]. To address these issues, the traditional data-oriented communication systems are designed to maximize network throughput based on Shannon’s theory, which targets transmitting data reliably given the limited radio resources [8]. However, such approaches are often ineffective in edge learning, as they rely only on classical source coding and channel coding theory and fail to improve learning performance [9]. Therefore, a paradigm shift for wireless system design is required from data-oriented to task-oriented communications.

I-A Related Works and Motivation

The initial attempts of task-oriented communications were to design task-aware transmission phases rather than end-to-end data reconstruction, see, e.g., [10, 11, 12] for designing task-aware reporting phases in the case of distributed inference tasks. Unlike task-aware efficient transmissions, several pioneering works [13, 14, 15, 16, 17] have also studied task-oriented schemes in edge learning systems. The work [13] designed a task-oriented communication scheme to realize a trade-off between preserving the relevant information and fitting with bandwidth-limited edge inference nicely. In [14, 15], task-oriented methods were proposed to maximize learning accuracy by jointly designing sensing, communication, and computation. The work [16] proposed a task-oriented transmission scheme to accelerate learning processes efficiently by capturing the semantic features from the correlated multimodal data of the IoT users. Nevertheless, these task-oriented communication schemes are mostly heuristics based, and it is necessary to improve their learning performance via additional optimization.

To overcome the drawback of heuristic methods, the work [17] proposed a learning-centric power allocation (LCPA) model to guide power allocation efficiently so as to optimize the limited network resources under the task-oriented principle. First, the learning performance was approximated by parameter fitting to capture the shape of learning models. Then, a majorization minimization (MM) algorithm was designed to allocate transmit power efficiently. However, the MM algorithm is inefficient for large-scale IoT networks due to its high computational complexity and lack of co-channel interference (CCI) management. In particular, for massive IoT users, it is necessary to collect multi-modal datasets and process heterogeneous learning tasks concurrently, making it imperative for designing parallel and low-complexity task-oriented communication algorithms. Furthermore, concurrent transmissions of massive IoTs will inevitably yield severe CCI, thus degrading the performance of task-oriented communications [18]. But due to the highly-coupled CCIs, the associated power allocation problem is non-separable and non-convex, which is non-trivial to realize the inference management and algorithm parallelization.

To fill the gap, this paper designs a task-oriented power allocation model for efficient communications in large-scale IoT networks with edge learning. On the one hand, as a task-oriented learning system involves heterogeneous learning tasks, it is necessary to predict the required resources for training different tasks. Therefore, our method is designed as an offline learning procedure that fits historical datasets to a performance prediction model and an online inference procedure that guides the IoT-edge communications with the pre-trained performance model. Note that this performance model can be fine-tuned by exploiting a small amount of real-time data from active IoT users. On the other hand, we formulate a task-oriented power allocation problem to guide communication-efficient data collection for large-scale IoT networks. To alleviate CCI, multi-user scheduling is first performed before power allocation. Then, a highly parallel algorithm is designed for different learning tasks. Lastly, we develop an accelerated algorithm to make the parallel algorithm more efficient. In brevity, Table I compares the existing and proposed schemes.

TABLE I: Comparison of Existing and Proposed Schemes.

[!t] Type Scheme Learning Efficiency^a Multi-task Multi-modal^b Algorithm Complexity Parallelism Acceleration Optimal Objective Task Fairness Multi-user Scheduling Task-aware [10] + ✗ +++ + ✗ ✗ ✗ [11] + ✗ +++ + ✗ ✗ ✗ [12] + ✗ +++ + ✗ ✗ ✗ Task-oriented [13] ++ ✓ ++ + ✗ ✗ ✗ [14, 15] ++ ✓ ++ + ✗ ✗ ✗ [16] ++ ✓ ++ + ✗ ✗ ✗ [17] +++ ✗ +++ N/A Min-max ✗ ✗ Proposed +++ ✓ + +++ Weighted sum ✓ ✓

a

The symbols “+, ++, +++” indicate low, moderate, and high capability, respectively.
b

The tick “✓” indicates a functionality supported, whereas the cross “✗” indicates not supported.

I-B Summary of Main Results

Aiming at efficient communication for task-oriented edge learning, this paper starts with a multi-user scheduling strategy to mitigate CCI. In particular, a relaxation-and-rounding algorithm is exploited to identify scheduled users efficiently, and an approximate closed-form solution is obtained. Secondly, a parallel algorithm with Gauss-Seidel methods is developed. By a set of variable decompositions, we realize a highly parallel iteration. Thirdly, we design an accelerated algorithm to speed up this parallel algorithm. Finally, extensive experimental results demonstrate the efficiency of our design. In summary, the main contributions are as follows:

1)

A task-oriented power allocation model is proposed to process multiple distinct datasets at the edge. Moreover, a multi-user scheduling strategy is performed before power allocation to mitigate CCI for large-scale IoT networks efficiently.
2)

A highly parallel algorithm is designed for the task-oriented power allocation problem. By variable decompositions and eliminating auxiliary variables, the power allocation in the presence of CCI is realized efficiently in parallel.
3)

An accelerated algorithm is developed to make the parallel algorithm more efficient for large-scale IoT networks. Specifically, this algorithm utilizes the Lipschitz continuous property of the learning error and the identity mapping of the gain matrix to improve the convergence.
4)

Extensive experimental results show that the multi-user scheduling strategy can mitigate CCI in large-scale IoT networks. Moreover, our parallel and accelerated algorithms efficiently solve task-oriented power allocation problems with a significantly shorter computation time than the existing algorithms.

I-C Organization

The rest of this paper is organized as follows. Section \@slowromancapii@ describes the system model and formulates a task-oriented power allocation problem. Section \@slowromancapiii@ performs multi-user scheduling to mitigate CCI and designs a parallel algorithm for solving the task-oriented power allocation problem. Section \@slowromancapiv@ develops an accelerated algorithm for large-scale IoT networks. Section \@slowromancapv@ discusses the experimental results, and finally, Section \@slowromancapvi@ concludes the paper.

Notation: Scalars, column vectors, and matrices are denoted by regular italic letters, lower- and upper-case letters in bold typeface, respectively. The symbol $\bm{\mathsf{1}}$ indicates a column vector with all entries being unity. The superscripts $(\cdot)^{T}$ and $(\cdot)^{H}$ denote the transpose and Hermitian transpose of a vector or matrix, respectively, and the subscript $\|\bm{x}\|_{2}$ denotes the two-norm of $\bm{x}$ . The abbreviation $\mathcal{CN}(\bm{0},\,\varrho\bm{I})$ stands for a multi-variable complex Gaussian distribution with mean vector $\bm{0}$ and variance matrix $\varrho\bm{I}$ . The notation $\left|\mathcal{X}\right|$ denotes the cardinality of the set $\mathcal{X}$ , and $\mathcal{Y}\setminus\mathcal{X}$ denotes the complement of set $\mathcal{Y}$ except for $\mathcal{X}$ . The matrix operation $\bm{A}(\mathcal{K}_{i},\,\mathcal{K}_{j})$ denotes a sub-matrix of size $|\mathcal{K}_{i}|\times|\mathcal{K}_{j}|$ that includes the rows and columns in $\bm{A}$ specified by the sets of indices $\mathcal{K}_{i}$ and $\mathcal{K}_{j}$ , respectively. The arithmetic operations $\bm{x}\succeq\bm{y}$ , $\bm{x}\circ\bm{y}$ , and $\left\langle\bm{x},\,\bm{y}\right\rangle$ denote that each element of $\bm{x}$ is greater than or equal to the counterpart of $\bm{y}$ , the Hadamard product, and the inner product of two vectors, respectively. The Landau notation $\mathcal{O}(\cdot)$ denotes the order of arithmetic operations. Further, we define a binary function $\lfloor w\rceil=1$ if $w\geq 0.5$ , and $\lfloor w\rceil=0$ otherwise, and $\lfloor\bm{w}\rceil\triangleq\left[\lfloor w_{1}\rceil,\,\lfloor w_{2}\rceil,\,\cdots,\,\lfloor w_{K}\rceil\right]^{T}\in\mathbb{R}^{K\times 1}$ is computed for each element of $\bm{w}$ . Finally, the floor function $\left\lfloor x\right\rfloor\triangleq\max\{n\in\mathbb{Z}:n\leq x\}$ , where $\mathbb{Z}$ is the set of integers.

II System Model and Problem Formulation

In this section, we first describe a task-oriented edge learning system. Then, we formulate the task-oriented power allocation problem with multi-user scheduling.

II-A System Model

Figure 1 illustrates a task-oriented edge learning system consisting of an edge server equipped with $N$ antennas, $I$ different learning tasks $\{\mathcal{T}_{1},\,\mathcal{T}_{2},\,\cdots,\,\mathcal{T}_{I}\}$ with corresponding user sets $\mathcal{K}\triangleq\{\mathcal{K}_{1},\,\mathcal{K}_{2},\,\cdots,\,\mathcal{K}_{I}\}$ and power allocation parameters $\bm{p}\triangleq\left[\bm{p}_{1}^{T},\,\bm{p}_{2}^{T},\,\cdots,\,\bm{p}_{I}^{T}\right]^{T}\in\mathbb{R}^{K}$ , where $\bm{p}_{i}\triangleq\left[p_{i_{1}},\,p_{i_{2}},\cdots,\,p_{|\mathcal{K}_{i}|}\right]^{T}$ , $i\in\mathcal{I}\triangleq\{1,\,2,\,\cdots,\,I\}$ , $i_{j}\in\mathcal{K}_{i}$ , $j\in\{1,\,2,\cdots,|\mathcal{K}_{i}|\}$ , $K\triangleq|\mathcal{K}|$ , and $p_{k}$ , $k\in\mathcal{K}$ , denotes the transmit power of the $k^{\rm th}$ user. As for the $I$ different learning tasks in Fig. 1, each of them concerns a set of transmitted data, a multi-user scheduling algorithm, a learning model, a process of parameter fitting for a learning model, and a task-oriented power allocation problem.

To improve the performance of edge learning, multi-user scheduling is adopted to alleviate CCI in large-scale IoT networks, and task-oriented power allocation is performed to implement efficient communications. To maximize the network utility function of long-term average data rates, by recalling the seminal Shannon formula, the achievable data rate of user $k$ in the presence of multi-user scheduling can be expressed as [8]

R_{k}=\log_{2}\Bigg{(}1+\dfrac{G_{k,k}p_{k}}{\sum_{\ell\in\Pi_{\mathcal{S}}(\mathcal{K})\setminus k}G_{k,\ell}p_{\ell}+\sigma^{2}}\Bigg{)},\,k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i}),

(1)

where $\Pi_{\mathcal{S}}(\cdot)$ denotes a projection function of multi-user scheduling; $\sigma^{2}$ is the variance of additive white Gaussian noise; $G_{k,\ell}$ represents the composite channel gain from the $\ell^{\rm th}$ user to the edge server when decoding data of the $k^{\rm th}$ user, computed as $G_{k,k}=\rho_{k}\|\bm{h}_{k}\|_{2}^{2}$ if $\ell=k$ , and $G_{k,\ell}={\rho_{\ell}|\bm{h}_{k}^{H}\bm{h}_{\ell}|^{2}}/{\|\bm{h}_{k}\|^{2}_{2}}$ if $\ell\neq k$ , with $\bm{h}_{k}\in\mathbb{C}^{N\times 1}$ being the complex-valued channel fast-fading vector from the $k^{\rm th}$ user to the edge server and $\rho_{k}$ being the path loss of the $k^{\rm th}$ user. By (1), the number of samples transmitted by user $k$ for the learning task $\mathcal{T}_{i}$ at the edge server can be computed as

D_{i}=\sum_{k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i})}\left\lfloor\dfrac{BTR_{k}}{V_{i}}\right\rfloor+A_{i}\approx\sum_{k\in\Pi_{\mathcal{S}}(\mathcal{K}_{i})}\dfrac{BTR_{k}}{V_{i}}+A_{i},

(2)

where $B$ is the total bandwidth in Hz; $T$ is the transmission time in seconds; $V_{i}$ is the number of bits for each data, and $A_{i}$ is the initial number of historical data for the $i^{\rm th}$ pre-trained task.

Refer to caption — Figure 1: The system model of task-oriented edge learning.

This paper considers the average channel over a long transmission period instead of assuming a static channel. The reason is twofold. On the one hand, to fine-tune diverse learning models, it is essential to require a relatively long transmission time with tens or hundreds of seconds to obtain a large number of datasets. On the other hand, the effect of multi-user scheduling can only be disclosed in the context of a long-term channel average rather than an instantaneous channel realization. Assume that the transmission period consists of different time slots. The channels are quasi-static during each time slot and vary in consecutive time slots. Therefore, $G_{k,k}$ and $G_{k,\ell}$ in (1) could denote the average channels gain during these slots.

II-B Problem Formulation

To establish a connection between wireless resource allocation and the performance of machine learning, the work [17] conceived a non-linear exponential function $\Theta_{i}(D_{i}|a_{i},\,b_{i})\triangleq a_{i}D_{i}^{-b_{i}}$ to capture the shape of the learning error function, where $a_{i}$ and $b_{i}$ are tuning parameters that denote the model complexity and account for the non-independent and identically distributed (n.i.i.d.) parallel datasets, respectively. In practice, the values of $a_{i}$ and $b_{i}$ are obtained by fitting the learning error function from the historical dataset. This fitted function matches the experimental data of the machine learning model very well. In line with this idea and multi-user scheduling, we formulate a task-oriented power allocation problem:


$\displaystyle\mathcal{P}1:\min_{\bm{p},\,\Pi_{\mathcal{S}}}$	$\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times a_{i}D_{i}^{-b_{i}}$	(3a)
$\displaystyle{\rm s.t.}$	$\displaystyle\sum_{k\in\mathcal{K}}p_{k}=P,\,p_{k}\geq 0,\,\forall k\in\mathcal{K},$	(3b)
	$\displaystyle\Pi_{\mathcal{S}}(\mathcal{K})\subseteq\mathcal{K},$	(3c)
	$\displaystyle p_{k}=0,\,\forall k\in\mathcal{K}\setminus\Pi_{\mathcal{S}}(\mathcal{K}),$	(3d)

where $\lambda_{i}\triangleq A_{i}V_{i}/(\sum_{j\in\mathcal{I}}A_{j}V_{j})$ is a weight of diverse datasets. In general, the power allocation of all scheduled users shall satisfy $\sum_{k\in\mathcal{K}}p_{k}\leq P$ , i.e., not exceeding the total power budget $P$ . As a larger value of $\sum_{k\in\mathcal{K}}p_{k}$ always improves the learning performance, (3b) is obtained [17]. (3c) means to schedule a subset of users, and (3d) implies no power is allocated to inactive users.

Compared to the conventional min-max objective function used in [17], it only focuses on the worst learning task, even if the task is not critical for real-world application. Thus, it is not suitable for the multi-task multi-modal scenario considered in this paper. Instead, the weighted sum model in (3a) can optimize multiple tasks simultaneously. In particular, the objective function can adapt to different learning tasks by adjusting the weight factors $\lambda_{i},i\in\mathcal{I}$ in (3a).

Remark 1 (On the learning loss model).

In theory, the training procedure of any smooth learning network can be modeled as a Gibbs distribution of networks characterized by a temperature parameter $T_{g}$ . The asymptotic generalization loss $\epsilon_{i}$ as the number of samples $D_{i}$ for the $i^{\rm th}$ learning task goes to infinity can be expressed as [19, Eq. $3.12$ ]

\displaystyle\epsilon_{i}=\epsilon_{i,\mathrm{min}}+\left(\frac{T_{g}}{2}+\frac{\mathrm{Tr}(\bm{U}_{i}\bm{V}_{i}^{-1})}{2W_{i}}\right)W_{i}D_{i}^{-1},\quad\text{as }D_{i}\rightarrow+\infty,

(4)

where $\epsilon_{i,\mathrm{min}}\geq 0$ is the minimum error for the considered learning system, $W_{i}$ is the number of parameters, and $D_{i}$ is the number of samples. The matrices $\bm{U}_{i}$ and $\bm{V}_{i}$ denote the second-order and first-order derivatives of the generalization loss with respect to the parameters of model $i$ . By setting $a_{i}=\left(\frac{T_{g}}{2}+\frac{\mathrm{Tr}(\bm{U}_{i}\bm{V}_{i}^{-1})}{2W_{i}}\right)W_{i}$ , $b_{i}=-1$ and $\epsilon_{i,\mathrm{min}}=0$ , (4) reduces to the proposed learning loss model $\Theta_{i}(D_{i}|a_{i},\,b_{i})\triangleq a_{i}D_{i}^{-b_{i}}$ , implying the proposed model holds in the asymptotic sense. In practice, $\epsilon_{i,\mathrm{min}}$ in (4) cannot always approach zero as the number of samples reaches infinite, even for some simple learning models. For ease of mathematical tractability, we set $\epsilon_{i,\mathrm{min}}=0$ in this paper by assuming that the learning model is powerful enough such that given an infinite amount of data, the learning loss becomes zero.

III Algorithm Development in Parallel

In this section, we first describe a multi-user scheduling algorithm given a power allocation. Then, we design a parallel algorithm to solve the power allocation problem.

III-A Multi-user Scheduling Algorithm

In the task-oriented learning system, multi-user scheduling is an effective strategy for solving the massive-connectivity problem. Traditional approaches to multi-user scheduling are almost non-convex algorithms, e.g., greedy heuristic search. In particular, as the number of scheduling cases increases exponentially with the number of users, it is hard to enumerate all possible subsets of users explicitly [20]. To deal with this problem, we introduce binary variables $w_{k}$ , $k\in\mathcal{K}$ , to replace $\Pi_{\mathcal{S}}$ defined immediately after (1). More specifically, $w_{k}=1$ if $k\in\Pi_{\mathcal{S}}\left(\mathcal{K}\right)$ , and $w_{k}=0$ otherwise. As a result, given $p_{k}$ , inserting (1)-(2) into $\mathcal{P}1$ , the multi-user scheduling problem in the $i^{\rm th}$ group, $i\in\mathcal{I}$ , is formulated as


$\displaystyle\mathcal{P}2:\min_{\bm{w}}\,$	$\displaystyle a_{i}\Bigg{(}\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}w_{k}R_{k}+A_{i}\Bigg{)}^{-b_{i}}$	(5a)
$\displaystyle{\rm s.t.}\$	$\displaystyle w_{k}\in\{0,\,1\},$	(5b)
	$\displaystyle\sum_{k\in\mathcal{K}_{i}}w_{k}\leq N_{i},$	(5c)

where $R_{k}=\log_{2}(1+{G_{k,k}p_{k}}/(\sum_{\ell\in\mathcal{K}\setminus k}w_{\ell}G_{k,\ell}p_{\ell}+\sigma^{2}))$ as per (1), $\bm{w}\triangleq\left[w_{1},\,w_{2},\,\cdots,\,w_{K}\right]^{T}$ , and (5c) is derived from (3c) with $N_{i}$ being the maximal allowed number of active users for the $i^{\rm th}$ learning task.

To solve $\mathcal{P}2$ , we adopt a relaxation-and-rounding algorithm [21]. First, we relax the binary constraint (5b) as the real-valued constraint $0<w_{k}\leq 1$ . Then, we provide the following Proposition 1 to obtain an approximate closed-form solution to the relaxed version of $\mathcal{P}2$ .

Proposition 1 (Multi-user scheduling algorithm).

Given $p_{k},\,k\in\mathcal{K}_{i}$ , the multi-user scheduling variable $w_{k}$ is analytically determined by

\tilde{w}_{k}=\min\left(\max\left(\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}\left(\exp\left(\frac{G_{k,k}p_{k}}{\delta_{k}+G_{k,k}p_{k}}+\nu_{i}\right)-1\right)},\,\epsilon\right),\,1\right),

(6)

where $\delta_{k}\triangleq\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}+\sigma^{2}$ with $\tilde{G}_{k,\ell}\triangleq w_{k}{G}_{k,\ell}$ ; $\nu_{i}>0$ is a tuning parameter for controlling the sparsity of the solution, and $\epsilon>0$ is a small positive number close to zero. When the multi-user scheduling algorithm converges, $\tilde{\bm{w}}\triangleq\left[\tilde{\bm{w}}_{1}^{T},\,\tilde{\bm{w}}_{2}^{T},\,\cdots,\,\tilde{\bm{w}}_{I}^{T}\right]^{T}$ with $\tilde{\bm{w}}_{i}\triangleq\left[\tilde{w}_{i_{1}},\,\tilde{w}_{i_{2}},\,\cdots,\,\tilde{w}_{i_{|\mathcal{K}_{i}|}}\right]^{T}$ is the optimal solution, in which each element is rounded off to the nearest integer $1$ or $0$ , i.e., $\lfloor\tilde{\bm{w}}\rceil$ .

Proof.

See Appendix A. ∎

Proposition 1 shows that the multi-user scheduling decision is analytically determined, with extremely low computational complexity proportional to the number of users, i.e., $\mathcal{O}(K)$ . Also, it is noted that our multi-user scheduling strategy is fair with respect to different learning tasks, but the fairness among users is not accounted for since it is beyond the scope of this paper.

III-B Parallel Power Allocation

After the multi-user scheduling is performed by Proposition 1, the task-oriented power allocation problem can be rewritten as


$\displaystyle\mathcal{P}3:\min_{\bm{p}}\,$	$\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times a_{i}\Bigg{(}\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\tilde{R}_{k}+A_{i}\Bigg{)}^{-b_{i}}$	(7a)
$\displaystyle{\rm s.t.}\$	$\displaystyle\sum_{k\in\mathcal{K}}(1-\tilde{w}_{k})p_{k}\leq\epsilon,$	(7b)
	$\displaystyle\eqref{S1-EQ-3a},$

where $\tilde{R}_{k}\triangleq\log_{2}(1+{G_{k,k}p_{k}}/(\sum_{\ell\in\mathcal{K}\setminus k}\tilde{w}_{\ell}G_{k,\ell}p_{\ell}+\sigma^{2}))$ ; (7b) is the relaxation of (3d), which means little power is reserved for inactive users.

The optimization problem $\mathcal{P}3$ is non-convex; even worse, its computational complexity rises with the number of users and tasks. To address these issues, we propose a parallel first-order algorithm. As there is a dependency on the power and CCI terms amongst different learning tasks, it is hard to realize parallelization for various tasks in a straightforward manner. An efficient strategy is to introduce auxiliary variables to separate these terms independently. The resulting problem involves a set of sub-problems by variable decompositions, and these sub-problems are easier to be solved in parallel [22, 23].

Now, we begin to extract the relevant sub-problems. First, to divide the interference term, we introduce additional variables $\bm{\delta}$ , defined as

\bm{\delta}\triangleq\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}},

(8)

where $\bm{\delta}=\left[\bm{\delta}_{1}^{T},\,\bm{\delta}_{2}^{T},\,\cdots,\,\bm{\delta}_{I}^{T}\right]^{T}$ with $\bm{\delta}_{i}\triangleq[\delta_{i_{1}},\,\delta_{i_{2}},\,\cdots,\,\delta_{i_{|\mathcal{K}_{i}|}}]^{T}$ , and $\bm{\Delta}\triangleq\tilde{\bm{G}}-\tilde{\bm{D}}$ with $\tilde{\bm{G}}\triangleq\left[\begin{array}[]{ccc}\bm{G}(1,\,:)^{T}\circ\tilde{\bm{w}}&\cdots&\bm{G}(K,\,:)^{T}\circ\tilde{\bm{w}}\end{array}\right]^{T}$ and $\tilde{\bm{D}}\triangleq\left[\begin{array}[]{ccc}\bm{D}(:,\,1)\circ\tilde{\bm{w}}&\cdots&\bm{D}(:,\,K)\circ\tilde{\bm{w}}\end{array}\right]$ . Here, the $(k,\,\ell)^{\rm th}$ element of $\bm{G}$ is made up of $G_{k,\,\ell}$ , and the $k^{\rm th}$ diagonal element of the diagonal matrix $\bm{D}$ is made up of $G_{k,\,k}$ . By partitioning users $\mathcal{K}$ into $I$ groups of users $\{\mathcal{K}_{i}\}_{i=1}^{I}$ and introducing a set of variables $\{\bm{z}_{i}\}_{i=1}^{I}$ and $\{P_{i}\}_{i=1}^{I}$ , we have


	$\displaystyle\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}=\bm{z}_{i},$		(9a)
	$\displaystyle\sum_{i\in\mathcal{I}}\bm{z}_{i}=\bm{\delta}-\sigma^{2}\bm{\mathsf{1}},$		(9b)
	$\displaystyle\bm{\mathsf{1}}^{T}\bm{p}_{i}=P_{i},$		(9c)
	$\displaystyle\sum_{i\in\mathcal{I}}P_{i}=P,$		(9d)

where $\bm{z}_{i}\triangleq\left[z_{1,i},\,z_{2,i},\,\cdots,\,z_{K,i}\right]^{T}$ , and $\bm{p}_{i}\triangleq[p_{i_{1}},\,p_{i_{2}},\,\cdots,\,p_{i_{\left|\mathcal{K}_{i}\right|}}]^{T}$ . It is noteworthy that $\bm{z}_{i}$ in (9a)-(9b) and $P_{i}$ in (9c)-(9d) are auxiliary variables. As a result, $\mathcal{P}3$ can be transformed into:


$\displaystyle\mathcal{P}4:\min_{\{\bm{p}_{i},\,\bm{\delta}_{i},\,\bm{z}_{i},\,P_{i}\}_{i\in\mathcal{I}}}\,$	$\displaystyle\sum_{i\in\mathcal{I}}\lambda_{i}\times\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})$	(10a)
$\displaystyle{\rm s.t.}\$	$\displaystyle\bm{\delta}\succeq\sigma^{2}\bm{\mathsf{1}},\,\bm{p}_{i}\succeq\bm{0},$	(10b)
	$\displaystyle\eqref{S3-EQ-8a},\,\eqref{SA-EQ-B3a},\,\eqref{SA-EQ-B3b},\,\eqref{SA-EQ-B3c},\,\eqref{SA-EQ-B3d},$

where (10b) is naturally satisfied as $\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}\geq 0$ and $p_{\ell}\geq 0$ . Specially, $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ in (10a) is explicitly given by

\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})\triangleq a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\log_{2}\left(1+\dfrac{G_{k,k}p_{k}}{\delta_{k}}\right)+A_{i}\right)^{-b_{i}}.

It is clear that $\mathcal{P}4$ separates the interference term and power constraint by introducing auxiliary variables; thus, it is beneficial to the parallelization of algorithm design. However, as there are multiple auxiliary variables and constraints in $\mathcal{P}4$ , they will linearize the augmented terms and slow down the convergence. Even worse, convergence may not be guaranteed if there are more than two variables.

To address this issue, we propose a method to eliminate auxiliary variables [24, pp. 249-251]. First, the augmented Lagrangian function (ALF) of $\mathcal{P}4$ can be written as

	$\displaystyle L\left(\{\bm{p}_{i}\}_{i=1}^{I},\,\{P_{i}\}_{i=1}^{I},\,\{\bm{z}_{i}\}_{i=1}^{I},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}_{i}\}_{i=1}^{I},\,\{\beta_{i}\}_{i=1}^{I}\right)$
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})+\sum_{i\in\mathcal{I}}\beta_{i}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)+\,\dfrac{\mu}{2}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}$
	$\displaystyle\quad{}+\,\sum_{i=1}^{I}\left\langle\bm{\alpha}_{i},\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right\rangle+\dfrac{\mu}{2}\sum_{i=1}^{I}\left\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right\\|^{2}_{2},$		(11)

where $\mu$ is an increasing positive sequence $\{\mu(t)\}$ about the iteration. From (11), it is observed that (9b) and (9d) are not directly considered in the ALF since different tasks are correlated. As will be shown shortly, this new ALF term allows the sub-problems to be solved in parallel. It is noteworthy that this algorithm differs from the conventional alternating direction method of multipliers (ADMM) algorithms, and its convergence is guaranteed [24, p. 255]. Given (11), we have the following proposition.

Proposition 2.

For the ALF given by (11), we can make the following iteration concerning variables $P_{i}$ and $\bm{z}_{i}$ :


$\displaystyle P_{i}(t)$	$\displaystyle=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t)-\dfrac{1}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t)-P\right),$	(12a)
$\displaystyle\bm{z}_{i}(\bm{\delta}(t))$	$\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}(t)-\dfrac{1}{I}\left(\bm{\Delta}\bm{p}(t)-\bm{\delta}(t)+\sigma^{2}\bm{\mathsf{1}}\right).$	(12b)

The relative dual variables are updated by


	$\displaystyle\beta(t+1)=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right),$		(13a)
	$\displaystyle\bm{\alpha}(t+1)=\bm{\alpha}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}\bm{p}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right),$		(13b)

and $\beta_{i}(t+1)=\beta(t+1)$ and $\bm{\alpha}_{i}(t+1)=\bm{\alpha}(t+1)$ , for all $i=1,\cdots,I$ .

Proof.

See Appendix B. ∎

By Proposition 2, it is evident that we have eliminated auxiliary variables and decreased the dimension of dual variables. Next, we split the ALF given by (11) with respect to $\bm{p}$ and $\bm{\delta}$ .

III-B1 Parallelizable splitting with respect to $\bm{p}$

By Proposition 2, we divide the ALF given by (11) into a set of sub-functions, i.e., $L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)$ , which denote the ALF of the $i^{\rm th}$ task. To realize the parallel algorithm for different tasks, we obtain $L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)$ given by

	$\displaystyle L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)$
	$\displaystyle=\lambda_{i}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})+\beta\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)+\,\dfrac{\mu}{2}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}$
	$\displaystyle\quad{}+\,\left\langle\bm{\alpha},\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}(\bm{\delta})\right\rangle+\dfrac{\mu}{2}\left\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}(\bm{\delta})\right\\|^{2}_{2}.$		(14)

III-B2 Parallelizable splitting with respect to $\bm{\delta}$

By Proposition 2, it is evident that there are interference terms of (12b) and (13b). Thus it is still hard to update $\bm{\delta}$ in parallel. Therefore, we adopt the Gauss-Seidel method to obtain a highly parallelizable iteration for $\bm{\delta}_{i}$ [24, p. 199], as formalized in the following proposition.

Proposition 3.

By (11) and Proposition 2, we obtain the following function about $\bm{\delta}$ :

	$\displaystyle L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}^{\prime}_{i}\}_{i=1}^{I})$
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})+\left\langle\bm{\alpha},\,\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}\right\rangle$
	$\displaystyle\quad{}+\,\dfrac{\mu}{2I}\left\\|\bm{\Delta}\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}\right\\|_{2}^{2},$		(15)

where $\bm{\alpha}\triangleq\left[{\bm{\alpha}^{\prime}_{1}}^{T},\,{\bm{\alpha}^{\prime}_{2}}^{T},\,\cdots,\,{\bm{\alpha}^{\prime}_{I}}^{T}\right]^{T}$ .

Proof.

From (11), we obtain the following ALF about $\bm{\delta}$ :

	$\displaystyle L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}_{i}\}_{i=1}^{I})$
	$\displaystyle=\sum_{i\in\mathcal{I}}\left(\lambda_{i}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})+\left\langle\bm{\alpha}_{i},\,\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}-\bm{z}_{i}(\bm{\delta})\right\rangle\right.$
	$\displaystyle\quad{}\left.{}+\,\dfrac{\mu}{2}\left\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}-\bm{z}_{i}(\bm{\delta})\right\\|_{2}^{2}\right).$		(16)

From Proposition 2, we also have $\bm{\alpha}_{i}=\bm{\alpha}$ and $\bm{z}_{i}(\bm{\delta})=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\dfrac{1}{I}\left(\bm{\Delta}\bm{p}-\bm{\delta}+\sigma^{2}\bm{\mathsf{1}}\right).$ Inserting them into (16) and performing algebraic manipulations, we obtain (15). ∎

With Proposition 3, to realize a parallel algorithm while updating $\bm{\delta}$ , we divide $L(\bm{p},\,\{\bm{\delta}_{i}\}_{i=1}^{I};\,\{\bm{\alpha}^{\prime}_{i}\}_{i=1}^{I})$ into a set of sub-functions $L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})$ as follows:

	$\displaystyle L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})$
	$\displaystyle=\lambda_{i}\Phi_{i}(\bm{p}_{i}\|\bm{\delta_{i}})+\left\langle\bm{\alpha}^{\prime}_{i},\,\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\dfrac{\mu}{2I}\left\\|\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right\\|_{2}^{2}.$		(17)

By (17), it is evident that $\bm{\delta}$ is divided into $I$ blocks corresponding to $I$ different learning tasks, which implies that we can efficiently update $\bm{\delta}_{i}$ in parallel.

III-C Algorithm Development

We have derived the ALF of $\mathcal{P}4$ and obtained a set of sub-functions to realize a parallel algorithm for different learning tasks. Now, we compute partial derivatives of relative variables and then apply the gradient descent algorithms in parallel.

III-C1 Update $\bm{p}_{i}$ with other variables fixed

It is observed that $L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)$ given by (14) is differentiable with respect to $\bm{p}_{i}$ , and the gradient is computed as

		$\displaystyle\nabla_{\bm{p}_{i}}L_{i}\left(\bm{p}_{i},\,\bm{\delta};\,\bm{\alpha},\,\beta\right)$
		$\displaystyle=\lambda_{i}\nabla_{\bm{p}_{i}}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})+\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\bm{\alpha}+\beta\bm{\mathsf{1}}+\mu\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)\bm{\mathsf{1}}$
		$\displaystyle\quad{}+\,\mu\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}-\bm{z}_{i}\right).$

Then we apply the gradient descent method to obtain the $\bm{p}_{i}(t+1)$ , as explicitly given by

	$\displaystyle\bm{p}_{i}(t+1)$	$\displaystyle=\max\left(\bm{p}_{i}(t)-\eta\nabla_{\bm{p}_{i}}L_{i}\left(\bm{p}_{i}(t),\,\bm{\delta}(t);\,\bm{\alpha}(t),\,\beta(t)\right)\right.$
		$\displaystyle\quad\left.{}-\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i}),\,\bm{0}\right),$		(18)

where $\eta$ is the step-size and $\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i})$ denotes a sparsity-regularized term [25]. Moreover, it is seen from Proposition 2 that $\bm{\mathsf{1}}^{T}\bm{p}_{i}(t)-P_{i}(t)$ and $\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}(t)-\bm{z}_{i}(t)$ can be updated by (12a) and (12b), respectively.

III-C2 Update $\bm{\delta}_{i}$ with other variables fixed

It is seen that $L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})$ given by (17) is differentiable with respect to $\bm{\delta}_{i}$ , and the gradient is computed as

		$\displaystyle\nabla_{\bm{\delta}_{i}}L_{i}(\bm{p},\,\bm{\delta}_{i};\,\bm{\alpha}^{\prime}_{i})$
		$\displaystyle=\lambda_{i}\nabla_{\bm{\delta}_{i}}\Phi_{i}(\bm{p}_{i}\|\bm{\delta}_{i})-\bm{\alpha}^{\prime}_{i}-\dfrac{\mu}{I}\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}-\bm{\delta}_{i}\right).$

Then, we apply the gradient descent method to obtain

\bm{\delta}_{i}(t+1)=\max\left(\bm{\delta}_{i}(t)-\eta\nabla_{\bm{\delta}_{i}}L_{i}\left(\bm{p}(t),\,\bm{\delta}_{i}(t);\,\bm{\alpha}^{\prime}_{i}(t)\right),\,\sigma^{2}\bm{\mathsf{1}}\right).

(19)

To realize a highly parallelizable iteration of $\bm{p}_{i}$ and $\bm{\delta}_{i}$ , as explicitly given by (III-C1) and (19), we denote variable blocks $\bm{x}_{p}=\left[\bm{x}_{p_{1}}^{T},\,\bm{x}_{p_{2}}^{T},\,\cdots,\,\bm{x}_{p_{I}}^{T}\right]^{T}$ with $\bm{x}_{p_{i}}\in\mathcal{R}^{\left|\mathcal{K}_{i}\right|}$ , and $\bm{x}_{\delta}=\left[\bm{x}_{\delta_{1}}^{T},\,\bm{x}_{\delta_{2}}^{T},\,\cdots,\,\bm{x}_{\delta_{I}}^{T}\right]^{T}$ with $\bm{x}_{\delta_{i}}\in\mathcal{R}^{\left|\mathcal{K}_{i}\right|}$ . By using variable blocks $\bm{x}_{\ell_{i}},\ell\in\{p,\,\delta\}$ , we obtain

\bm{x}_{\ell_{i}}(t+1)=\begin{cases}\bm{p}_{i}(t+1),&\text{ if }\ell=p;\\ \bm{\delta}_{i}(t+1),&\text{ otherwise. }\end{cases}

(20)

III-C3 Update relative dual variables with others fixed

It is obvious that the ALF given by (11) is a linear function concerning all dual variables; thus, we have

	$\displaystyle\bm{\alpha}_{i}^{\prime}(t+1)$	$\displaystyle=\bm{\alpha}_{i}^{\prime}(t)+\dfrac{\mu(t)}{I}\Bigg{(}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{i}\right)\bm{p}_{i}(t+1)-\bm{\delta}_{i}(t+1)$
		$\displaystyle\quad{}+\,\sigma^{2}\bm{\mathsf{1}}+\sum_{i\neq j}\left(\bm{\Delta}\left(\mathcal{K}_{i},\mathcal{K}_{j}\right)\right)\bm{p}_{j}(t)\Bigg{)}.$		(21)

Using (21) in place of (13b) gives a Gauss-Seidel sequence to realize a highly efficient iteration and obtain a real-time message. Apart from the aforementioned dual variables, $\beta(t+1)$ can be directly updated by (13a).

In terms of computational complexity, this algorithm involves $K$ scheduling variables, $K$ primal variables, $K$ auxiliary variables, $K(K-1)$ interference terms, and $K+I$ dual variables. Specifically, $K+I$ dual variables come from $K$ interference constraints and $I$ learning tasks. In addition, $K$ primal variables, $K$ auxiliary variables, $K(K-1)$ interference terms, and $K+I$ dual variables can be updated in parallel. Consequently, when the dimension $K$ of users is large, the per-iteration complexity is approximately given by $\mathcal{O}\left((K^{2}+K)/I\right)$ .

To sum up, Fig. 2 sketches the block diagram of the proposed parallel algorithm. Also, the detailed steps are formalized in Algorithm 1, where lines $3$ - $8$ are the main steps of the parallel algorithm, as shown in the parallelization module of Fig. 2. Specifically, lines $4$ - $6$ of Algorithm 1 realize the power and CCI optimization, and line $7$ performs the dual and scheduling variables update in parallel. Then, line $9$ aggregates messages from different tasks and also constructs an increasing sequence $\mu(t+1)=\max(\mu_{\rm s}\mu(t),\,\mu_{\max})$ , which means that the equalities (9a)-(9d) must hold when Algorithm 1 converges.

Algorithm 1 The task-oriented power allocation in parallel.

0: Setting

\left(I,\,N,\,K,\,P,\,T,\,B,\,\sigma^{2},\,\{\lambda_{i},a_{i},\,b_{i},\,V_{i},\,A_{i}\}_{i\in\mathcal{I}}\right)

, channels

\{\bm{h}_{k}\}_{k\in\mathcal{K}}

, user set

\mathcal{K}

, gain matrix

\bm{G}

, gain diagonal matrix

\bm{D}

, learning rate

\eta

, error tolerance

\varepsilon

\mu_{\max}

, and

\mu_{\rm s}>1

0: The optimization solution

\hat{\bm{p}}

;

1: Initialize

t=0,\,\tilde{\bm{w}}=\bm{\mathsf{1}},\,\bm{p}(0)=P/K\times\bm{\mathsf{1}},\,\bm{\delta}(0)=\left(\bm{G}-\bm{D}\right)\bm{p}(0)+\sigma^{2}\bm{\mathsf{1}},\,\bm{\alpha}(0)=1/K\times\bm{\mathsf{1}},\,\beta(0)=1,

and

\mu(0)=1

;

2: repeat

3: for

i\in\mathcal{I}

in parallel do

4: for

\ell\in\{p,\,\delta\}

in parallel do

5: Update

\bm{x}_{\ell_{i}}(t+1)

by (20);

6: end for

7: Update

\bm{\alpha}_{i}^{\prime}(t+1)

by (21), and

\tilde{\bm{w}}_{i}

as per (6);

8: end for

9: Compute

\beta(t+1)

as per (13a), and

\mu(t+1)=\max(\mu_{\rm s}\mu(t),\,\mu_{\max})

;

10: Compute

{\rm MSE}

as per (35);

11:

t=t+1

;

12: until

{\rm MSE}\leq\varepsilon

;

13:

\hat{\bm{p}}=\lfloor\tilde{\bm{w}}\rceil\circ\bm{p}(t)

So far, we have developed a parallel algorithm to solve the task-oriented power allocation problem. As multiple variables need to be relaxed for task parallelism, it slows down the convergence. Although Proposition 2 can eliminate auxiliary variables, additional relaxed constraints exist to separate the CCI term, such as variables $\bm{\delta}$ and relative dual variables. Also, the per-iteration complexity is usually high, specifically for solving the non-convexity problem of $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ and the non-unitary matrix $\tilde{\bm{G}}-\tilde{\bm{D}}$ , i.e., $(\tilde{\bm{G}}-\tilde{\bm{D}})^{T}(\tilde{\bm{G}}-\tilde{\bm{D}})$ is not an identity mapping. To address these issues, we design an accelerated algorithm in the next section.

IV An accelerated Algorithm: Fast Proximal Algorithms

Now, we design a fast proximal ADMM algorithm with parallelizable splitting [26], and Fig. 3 sketches its block diagram. Specifically, to improve the convergence rate, we first exploit the smoothness property to linearize $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ (i.e., Step $2$ in Fig. 3). Accordingly, the smoothness result of $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ is shown in Lemma 1

Lemma 1.

The function $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ satisfies the following conditions:

i)

$\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i}^{*})$ is $L_{p_{i}}$ -smooth, i.e., $\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{2}\leq L_{p_{i}}\|\bm{x}-\bm{y}\|_{2}$ for any $\bm{x},\bm{y}$ ;
ii)

$\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta}_{i})$ is $L_{\delta_{i}}$ -smooth, i.e., $\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{2}\leq L_{\delta_{i}}\|\bm{x}-\bm{y}\|_{2}$ for any $\bm{x},\bm{y}$ ,

where $\bm{\delta}_{i}^{*},\,\bm{p}_{i}^{*},\,i\in\mathcal{I}$ denote their current values stored.

Proof.

See Appendix C. ∎

Given Lemma 1, the smoothness result enables us to linearize the learning error function $\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ . So, we next design an identity mapping of the unitary matrix to improve the convergence rate and solve the related sub-problems more efficiently.

IV-A Parallelization

In principle, the essence of our accelerated algorithm is to use an identical transform of matrices to split variable blocks. Using a fast proximal linearized ADMM algorithm with parallelizable splitting [26], we derive ALFs associated with variable blocks $\bm{p}_{i}$ and $\bm{\delta}_{i}$ , respectively.

IV-A1 The ALF with respect to $\bm{p}_{i}$ and $\bm{\delta}$

We first define two block matrices

\bm{A}\triangleq\left[\begin{array}[]{cc}\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)&-{\bm{I}}/{I}\\ \bm{\mathsf{1}}^{T}&\bm{0}^{T}\end{array}\right],\,\bm{A}_{1}\triangleq\left[\begin{array}[]{c}\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\ \bm{\mathsf{1}}^{T}\end{array}\right].

Then, we can rewrite (9a)-(9c) as a linear equation $\bm{A}\bm{x}=\bm{r}$ , where $\bm{r}\triangleq\left[-{\sigma^{2}}/{I}\bm{\mathsf{1}}^{T},\,P_{i}\right]^{T}$ , $\bm{x}\triangleq\left[\bm{p}_{i}^{T},\,\bm{\delta}^{T}\right]^{T}$ , and $\bm{z}_{i}(\bm{\delta})=\bm{\delta}/{I}-{\sigma^{2}}/{I}\bm{1}$ given by (8), (9a), and (9b), respectively. Moreover, the ALF given by (14) can be rewritten as

L_{i}(\bm{x};\,\bm{\lambda}^{\prime})=\Phi_{i}(\bm{x})+\langle\bm{\lambda}^{\prime},\,\bm{A}\bm{x}-\bm{r}\rangle+\dfrac{\mu}{2}\|\bm{A}\bm{x}-\bm{r}\|_{2}^{2},

(22)

where $\bm{\lambda}^{\prime}\triangleq\left[\bm{\alpha}^{T},\,\beta\right]^{T}$ and $\Phi_{i}(\bm{x})\triangleq\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ . Then, by means of the parallelizable splitting [26] and relaxing $L_{i}(\bm{x};\,\bm{\lambda}^{\prime})$ , we write an accelerated ALF of (22) with respect to $\bm{p}_{i}$ as

	$\displaystyle L_{i}\left(\bm{p}_{i}\|\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)\|\bm{\delta}_{i}(t)),\,\bm{p}_{i}(t),\,\bm{z}_{p_{i}}(t),\,\bm{z}_{i}(t);\,\bm{\alpha}(t),\,\beta(t)\right)$
	$\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)\|\bm{\delta_{i}}(t)),\,\bm{p}_{i}\right\rangle+\mu(t)\left\langle\bm{A}_{1}^{T}(\bm{A}\bm{z}_{1}(t)-\bm{r}),\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\langle\bm{\lambda}^{\prime}(t),\,\bm{A}_{1}\bm{p}_{i}\rangle+\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\\|_{2}^{2}$
	$\displaystyle=\left\langle\bm{\alpha}(t),\,\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}\right\rangle+\mu(t)\left\langle\left(\bm{\mathsf{1}}^{T}\bm{z}_{p_{i}}(t)-P_{i}(t)\right)\bm{\mathsf{1}},\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\mu(t)\left\langle\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t)-\dfrac{\bm{z}_{\delta}(t)}{I}+\dfrac{\sigma^{2}}{I}\bm{1}\right),\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi(\bm{y}_{p_{i}}(t+1)\|\bm{\delta}_{i}(t)),\,\bm{p}_{i}\right\rangle+\beta(t)\bm{\mathsf{1}}^{T}\bm{p}_{i}$
	$\displaystyle\quad{}+\,\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\\|_{2}^{2},\,$		(23)

where $\bm{z}_{1}\triangleq\left[\bm{z}_{p_{i}}^{T},\,\bm{z}_{\delta}^{T}\right]^{T}$ , $\bm{z}_{\delta}\triangleq\left[\bm{z}_{\delta_{1}}^{T},\,\cdots,\,\bm{z}_{\delta_{I}}^{T}\right]^{T}$ ; $\bm{z}_{p_{i}}$ and $\bm{z}_{\delta_{i}}$ , $i\in\mathcal{I}$ , denote the gradient update results of $\bm{p}_{i}$ and $\bm{\delta}_{i}$ , respectively. Moreover, $\lambda_{p_{i}}\geq 2\|\bm{A}_{1}\|_{2}^{2}$ guarantees that (23) is a tight majorant surrogate function of (22) with respective to $\bm{p}_{i}$ [26, 22], therefore we have

$\displaystyle\lambda_{p_{i}}$	$\displaystyle\geq{}2K/I\left(\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+1\right)^{2}$
	$\displaystyle\geq{}2\left(\\|\tilde{\bm{w}_{i}}\\|_{2}\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+\sqrt{K/I}\right)^{2}$
	$\displaystyle\geq{}2\left(\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+\\|\bm{\mathsf{1}}\\|_{2}\right)^{2}\geq 2\\|\bm{A}_{1}\\|_{2}^{2}.$	(24)

Lastly, the parameters $\bm{y}_{p_{i}}(t+1)$ , $\theta(t+1)$ , and $\mu(t+1)$ can be updated by


$\displaystyle\bm{y}_{p_{i}}(t+1)$	$\displaystyle=(1-\theta(t))\bm{p}_{i}(t)+\theta(t)\bm{z}_{p_{i}}(t),$	(25a)
$\displaystyle\theta(t+1)$	$\displaystyle=\frac{1}{2}(-\theta^{2}(t)+\sqrt{\theta^{4}(t)+4\theta^{2}(t)}),$	(25b)
$\displaystyle\mu(t+1)$	$\displaystyle=1/{\theta(t+1)},$	(25c)

where (25a) is to accelerate convergence by using the smoothness result given by Lemma 1; (25b) is a stepsize of the fast algorithm, and (25c) means an increasing sequence explained in Algorithm 1. With careful choices of $\theta(t)$ and $\mu(t)$ , the convergence rate can be accelerated from $\mathcal{O}(1/\tau)$ to $\mathcal{O}(1/\tau^{2})$ [26], where $\tau$ is the number of iterations needed to converge.

IV-A2 The ALF with respect to $\bm{p}$ and $\bm{\delta}_{i}$

We first define $\bm{A}^{\prime}\triangleq\left[\begin{array}[]{cc}\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)&-\bm{I}\end{array}\right]$ , then we can also rewrite $\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\bm{p}+\sigma^{2}\bm{\mathsf{1}}=\bm{\delta}_{i}$ given by (8) as $\bm{A}^{\prime}\bm{x}^{\prime}=\bm{r}^{\prime}$ , where $\bm{r}^{\prime}\triangleq-\sigma^{2}\bm{\mathsf{1}}$ and $\bm{x}^{\prime}\triangleq\left[\bm{p}^{T},\,\bm{\delta}_{i}^{T}\right]^{T}$ . Moreover, the ALF given by (17) can be re-expressed as

L_{i}(\bm{x}^{\prime};\,\bm{\alpha}^{\prime}_{i})\triangleq\Phi_{i}(\bm{x}^{\prime})+\langle\bm{\alpha}^{\prime}_{i},\,\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\rangle+\dfrac{\mu}{2}\|\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\|_{2}^{2},

(26)

where $\Phi_{i}(\bm{x}^{\prime})\triangleq\lambda_{i}\Phi_{i}(\bm{p}_{i}|\bm{\delta}_{i})$ . By the parallelizable splitting and relaxing $L_{i}(\bm{x}^{\prime};\,\bm{\alpha}^{\prime}_{i})$ , we also write compactly another accelerated ALF of (26) with respect to $\bm{\delta}_{i}$ as

	$\displaystyle L_{i}\left(\bm{\delta}_{i}\|\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{p}_{i}(t+1),\,\bm{\delta}_{i}(t),\,\bm{z}_{\delta_{i}}(t);\,\bm{\alpha}^{\prime}_{i}(t),\,\beta(t)\right)$
	$\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{\delta}_{i}\right\rangle-\langle\bm{\alpha}^{\prime}_{i}(t),\,\bm{\delta}_{i}\rangle-\mu(t)\langle\bm{A}^{\prime}\bm{z}_{2}(t)-\bm{r}^{\prime},\,\bm{\delta}_{i}\rangle$
	$\displaystyle\quad{}+\,\frac{1}{2}\left(L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}\right)\\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\\|_{2}^{2}$
	$\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{\delta_{i}}(t+1)},\,\bm{\delta}_{i}\right\rangle+\frac{1}{2}\left(L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}\right)\\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\\|_{2}^{2}$
	$\displaystyle\quad{}-\,\left\langle\bm{\alpha}^{\prime}_{i}(t),\,\bm{\delta}_{i}\right\rangle+\mu(t)\left\langle\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\right)\bm{z}_{p}(t)-\bm{z}_{\delta_{i}}(t)+\sigma^{2}\bm{\mathsf{1}},\,\bm{\delta}_{i}\right\rangle,$		(27)

where $\nabla_{\bm{y}_{\delta_{i}}(t+1)}\triangleq\nabla_{\bm{y}_{\delta_{i}}}\Phi(\bm{p}_{i}(t+1)|\bm{y}_{\delta_{i}}(t+1))$ , $\bm{z}_{2}\triangleq\left[\bm{z}_{p}^{T},\,\bm{z}_{\delta_{i}}^{T}\right]^{T}$ , and $\bm{z}_{p}\triangleq\left[\bm{z}_{p_{1}}^{T},\,\cdots,\,\bm{z}_{p_{I}}^{T}\right]^{T}$ . Like (24), the choice of $\lambda_{\delta_{i}}\geq 2\|\bm{I}\|_{2}^{2}=2$ also guarantees that (27) is a tight majorant surrogate function of (26) with respective to $\bm{\delta}_{i}$ [26, 22]. Moreover, $\bm{y}_{\delta_{i}}(t+1)$ is given by

\bm{y}_{\delta_{i}}(t+1)=(1-\theta(t))\bm{\delta}_{i}(t)+\theta(t)\bm{z}_{\delta_{i}}(t),

(28)

whose effect is the same as (25a). As stated above, we can relax $\Phi_{i}(\bm{p}_{i}|\bm{\delta_{i}})$ by Lemma 1, and then Lemma 1 allows very large Lipschitz constants $L_{p_{i}}$ and $L_{\delta_{i}}$ for non-convex functions, which are as large as $\mathcal{O}(\tau)$ without affecting the convergence rate. Moreover, we also linearize the augmented terms $1/{2}\|\bm{A}\bm{x}-\bm{r}\|_{2}^{2}$ and $1/{2}\|\bm{A}^{\prime}\bm{x}^{\prime}-\bm{r}^{\prime}\|_{2}^{2}$ by $\lambda_{p_{i}}/2\left\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\|_{2}^{2}$ and $\lambda_{\delta_{i}}/2\|\bm{\delta}_{i}-\bm{z}_{\delta_{i}}(t)\|_{2}^{2}$ , respectively. After (23) and (27) are obtained, we can improve the efficiency for optimizing these sub-functions given by (14) and (17).

IV-B Algorithm Development

We have obtained the parallelizable splitting and derived ALFs of the accelerated algorithm. Now, we compute the partial derivatives of relative variables and apply the gradient descent algorithm to update these variables in parallel.

IV-B1 Update $\bm{p}_{i}$ in parallel with other variables fixed

Here, the ALF given by (23) is a quadratic function of $\bm{p}_{i}$ , thus it has a closed-form solution with respective to $\bm{p}_{i}$ , given by

	$\displaystyle\tilde{\bm{z}}_{p_{i}}(t+1)$
	$\displaystyle=-\dfrac{1}{L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}}\left(\lambda_{i}\nabla_{\bm{y}_{p_{i}}}\Phi(\bm{y}_{p_{i}}(t+1)\|\bm{\delta}_{i}(t))+\beta(t)\bm{\mathsf{1}}\right.$
	$\displaystyle\quad{}+\,\mu(t)\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t)-\dfrac{\bm{z}_{\delta}(t)}{I}+\dfrac{\sigma^{2}}{I}\bm{1}\right)$
	$\displaystyle\quad{}+\,\left.\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\bm{\alpha}(t)+\mu(t)\left(\bm{\mathsf{1}}^{T}\bm{z}_{p_{i}}(t)-P_{i}(t)\right)\bm{\mathsf{1}}\right)+\bm{z}_{p_{i}}(t),$		(29)

where $P_{i}(t)$ can be computed by (12a). Then, we obtain


$\displaystyle\bm{z}_{p_{i}}(t+1)$	$\displaystyle=\max(\tilde{\bm{z}}_{p_{i}}(t+1)-\nu(\bm{\mathsf{1}}-\tilde{\bm{w}}_{i}),\,\bm{0}),$	(30a)
$\displaystyle\bm{p}_{i}(t+1)$	$\displaystyle=(1-\theta(t))\bm{p}_{i}(t)+\theta(t)\bm{z}_{p_{i}}(t+1),$	(30b)

where (30a) and (30b) are an orthogonal projection onto sparsity-regularized and accelerated terms, respectively.

IV-B2 Update $\bm{\delta}_{i}$ in parallel with other variables fixed

Here, the ALF given by (27) is also a quadratic function of $\bm{\delta}_{i}$ hence we obtain a closed-form solution as

	$\displaystyle\tilde{\bm{z}}_{\delta_{i}}(t+1)$
	$\displaystyle=\bm{z}_{\delta_{i}}(t)-\dfrac{1}{L_{\delta_{i}}\theta(t)+\mu(t)\lambda_{\delta_{i}}}\left(\lambda_{i}\nabla_{\bm{y}_{\delta_{i}}}\Phi(\bm{p}_{i}(t+1)\|\bm{y}_{\delta_{i}}(t+1))\right.$
	$\displaystyle\quad{}-\,\left.\bm{\alpha}_{i}^{\prime}(t)+\mu(t)\left(\left(\bm{\Delta}\left(\mathcal{K}_{i},\,:\right)\right)\bm{z}_{p}(t)-\bm{z}_{\delta_{i}}(t)+\sigma^{2}\bm{\mathsf{1}}\right)\right).$		(31)

Next, we have


$\displaystyle\bm{z}_{\delta_{i}}(t+1)$	$\displaystyle=\max(\tilde{\bm{z}}_{\delta_{i}}(t+1),\,\sigma^{2}\bm{\mathsf{1}}),$	(32a)
$\displaystyle\bm{\delta}_{i}(t+1)$	$\displaystyle=(1-\theta(t))\bm{\delta}_{i}(t)+\theta(t)\bm{z}_{\delta_{i}}(t+1),$	(32b)

where (32a) and (32b) denote an orthogonal projection and an accelerated term, respectively. In light of (29) and (31), it is obvious that $\bm{p}_{i}$ and $\bm{\delta}_{i}$ can be updated in parallel, thus we have


$\displaystyle\bm{y}_{\ell_{i}}(t+1)$	$\displaystyle=\begin{cases}\bm{y}_{p_{i}}(t+1),&\text{ if }\ell=p;\\ \bm{y}_{\delta_{i}}(t+1),&\text{ otherwise},\end{cases}$	(33a)
$\displaystyle\bm{z}_{\ell_{i}}(t+1)$	$\displaystyle=\begin{cases}\bm{z}_{p_{i}}(t+1),&\text{ if }\ell=p;\\ \bm{z}_{\delta_{i}}(t+1),&\text{ otherwise},\end{cases}$	(33b)

and $\bm{x}_{\ell_{i}}(t+1),\,\ell\in\{p,\,\delta\}$ is updated by (20).

IV-B3 Update relative dual variables

It is evident that the ALFs given by (14) and (17) are linear functions of all dual variables; hence we have


$\displaystyle\bm{\alpha}_{i}^{\prime}(t+1)$	$\displaystyle=\bm{\alpha}_{i}^{\prime}(t)+\dfrac{\mu(t)}{I}\Bigg{(}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t+1)+\sigma^{2}\bm{\mathsf{1}}$
	$\displaystyle\quad{}-\,\bm{z}_{\delta_{i}}(t+1)+\sum_{i\in\mathcal{I}\setminus j}\bm{\Delta}\left(\mathcal{K}_{i},\,\mathcal{K}_{j}\right)\bm{z}_{p_{j}}(t)\Bigg{)},$	(34a)
$\displaystyle\beta(t+1)$	$\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{z}_{p}(t+1)-P\right).$	(34b)

Using (34a) in place of (13b) leads to a highly parallelizable iteration.

In terms of computational complexity, the accelerated algorithm is proportional to the parallel algorithm. Thus the per-iteration complexity is also given by $\mathcal{O}\left((K^{2}+K)/I\right)$ . Beyond the computational complexity, another important metric to measure the convergence speed is the convergence rate. From [26, Theorem $2$ ], this algorithm improves the convergence rate from $\mathcal{O}(1/\tau)$ to $\mathcal{O}(1/\tau^{2})$ , which makes it more attractive, specifically for large-scale IoT networks. Moreover, this algorithm also allows large Lipschitz constants $L_{p_{i}}$ and $L_{\delta_{i}}$ for relaxing non-convex objective functions without affecting the convergence rate.

To sum up, the procedure is formalized in Algorithm 2, which is faster than Algorithm 1 due to the acceleration to the error functions (i.e., (25a) and (28)) and equality constraints (i.e., (30b) and (32b)). Specifically, lines $5$ and $7$ of Algorithm 2 describe the parallel steps (i.e., Steps $3$ - $6$ in Fig. 3). Among them, line $5$ specifies the acceleration steps (i.e., Steps $3$ and $6$ in Fig. 3). Moreover, line $9$ describes aggregated messages from different tasks (i.e., Step $7$ in Fig. 3). Also, $\mu(t)$ in Algorithm 2 is adaptive to the stepsize $\theta(t)$ to guide convergence more efficiently.

Algorithm 2 The accelerated algorithm.

0: Setting

\left(I,\,N,\,K,\,P,\,T,\,B,\,\sigma^{2},\,\{\lambda_{i},\lambda_{p_{i}},\,\lambda_{\delta_{i}},\,a_{i},\,b_{i},\,V_{i},\,A_{i}\}_{i\in\mathcal{I}}\right)

, user set

\mathcal{K}

, channels

\{\bm{h}_{k}\}_{k\in\mathcal{K}}

, gain matrix

\bm{G}

, gain diagonal matrix

\bm{D}

, learning rate

\eta

, and error tolerance

\varepsilon

0: The optimization solution

\hat{\bm{p}}

1: Initialize

t=0,\,\bm{x}_{p}(0)=\bm{y}_{p}(0)=\bm{z}_{p}(0)=P/K\times\bm{\mathsf{1}},\,\bm{x}_{\delta}(0)=\bm{y}_{\delta}(0)=\bm{z}_{\delta}(0)=(\bm{G}-\bm{D})\bm{x}_{p}(0)+\sigma^{2}\bm{\mathsf{1}},\,\tilde{\bm{w}}=\bm{\mathsf{1}},\,\bm{\alpha}(0)=1/K\times\bm{\mathsf{1}},\,\beta(0)=1,\,\mu(0)=\theta(0)=1

;

2: repeat

3: for

i\in\mathcal{I}

in parallel do

4: for

\ell\in\{p,\,\delta\}

in parallel do

5: Compute

\bm{y}_{\ell_{i}}(t+1)

\bm{z}_{\ell_{i}}(t+1)

, and

\bm{x}_{\ell_{i}}(t+1)

as per (33a), (33b), and (20), respectively;

6: end for

7: Compute

\bm{\alpha}_{i}^{\prime}(t+1)

and

\tilde{\bm{w}}_{i}

as per (34a) and (6), respectively;

8: end for

9: Update

\beta(t+1)

\theta(t+1)

, and

\mu(t+1)

according to (34b), (25b), and (25c), respectively;

10: Compute

{\rm MSE}

by (35);

11:

t=t+1

;

12: until

{\rm MSE}\leq\varepsilon

;

13:

\hat{\bm{p}}=\lfloor\tilde{\bm{w}}\rceil\circ\bm{x}_{p}(t)

V Simulation Results and Discussions

This section presents simulation results to evaluate the performances of the designed algorithms compared with state-of-the-art benchmark ones. The simulation parameter settings are as follows unless specified otherwise. On the one hand, we use a similar parameter setting for the wireless communication system as [17]. Specifically, we set the noise power $\sigma^{2}=-77\,{\rm dBm}$ , the communication bandwidth $B=180\,{\rm kHz}$ , the path loss of the $k^{\rm th}$ user $\varrho_{k}=-90\,{\rm dB}$ , and the channel $\bm{h}_{k}$ is generated according to $\mathcal{CN}(\bm{0},\,\varrho_{k}\bm{I})$ . Also, we assume that the number of users is identical among different tasks, i.e., $|\mathcal{K}_{1}|=|\mathcal{K}_{2}|=\cdots=|\mathcal{K}_{I}|=120$ . This is a valid assumption since we consider massive connectivity in large-scale IoT networks. On the other hand, for the task-oriented learning at the edge, we consider a support vector machine (SVM) for the classification of digits dataset in Scikit-learn [27], a $6$ -layer convolutional neural network (CNN $6$ ) for classification of the MNIST dataset [28], a $110$ -layer deep residual network (ResNet $110$ ) using the CIFAR10 dataset [29], and a PointNet using $3$ D point clouds in the ModelNet $40$ dataset [30]. In our pertaining simulation experiments, the single-task case $\{\text{SVM}\}$ , two-task case $\{\text{SVM},\,\text{CNN}6\}$ , and four-task case $\{\text{SVM},\,\text{CNN}6,\,\text{ResNet}110,\,\text{PointNet}\}$ are considered. For ease of tractability, relative learning parameters are summarized in Table II. For more details on how to get these learning parameters, the interested reader refers to Section III of [17]. Apart from simulation experiments, we also investigate autonomous vehicle perception in the real world to demonstrate the excellent generalization performance of our proposed model.

In the simulation experiments, we consider seven schemes: a parallel task-oriented power allocation scheme (i.e., Algorithm 1), an accelerated task-oriented power allocation scheme (i.e., Algorithm 2), the parallel algorithm without scheduling (Algorithm 1 w/o SH for short), and the accelerated algorithm without scheduling (Algorithm 2 w/o SH for short). In addition to our algorithms, we also simulate two benchmark ones: a sum-rate maximization scheme [31] and an MM-based LCPA scheme [17]. The sum-rate maximization algorithm is typical in conventional wireless communications but only considers the wireless channel state information without accounting for the learning factors. Finally, for fair comparison of different multi-user scheduling strategies, the user-fair scheduling (UFS) algorithm developed in [32] is also accounted for in the simulation experiments.

TABLE II: Summary of the learning Parameters [17].

Models	Datasets	Symbols	Values	Description
SVM [27]	Digits	$(a_{1},\,b_{1},\,A_{1},\,V_{1})$	$(5.2,\,0.72,\,200,\,324\,{\rm bits})$	The $1^{\rm st}$ learning task
CNN $6$ [28]	MNIST	$(a_{2},\,b_{2},\,A_{2},\,V_{2})$	$(7.3,\,0.69,\,300,\,6276\,{\rm bits})$	The $2^{\rm nd}$ learning task
ResNet $110$ [29]	CIFAR $10$	$(a_{3},\,b_{3},\,A_{3},\,V_{3})$	$(8.15,\,0.44,\,1600,\,24584\,{\rm bits})$	The $3^{\rm rd}$ learning task
PointNet [30]	ModelNet $40$	$(a_{4},\,b_{4},\,A_{4},\,V_{4})$	$(0.96,\,0.24,\,800,\,192008\,{\rm bits})$	The $4^{\rm th}$ learning task

V-A Convergence Performance and Complexity Analysis

In this subsection, the number of antennas $N=2$ , the total transmit power $P=13\,{\rm dBm}$ (i.e., $20\,{\rm mW}$ ), the transmit time $T=10\,{\rm s}$ for the single-task case, $T=20\,{\rm s}$ for the two-task case, and $T=200\,{\rm s}$ for the four-task case are used in the simulation experiments. The dataset types and task parameters are defined in Table II. In particular, as the four-task case is associated with deep networks, $T=200\,{\rm s}$ is set to obtain enough data to fine-tune these deep networks. To evaluate the process of convergence, we define a mean squared error (MSE) as

	$\displaystyle{\rm MSE}$
	$\displaystyle{}\triangleq\left\\|\bm{p}(t)-\bm{p}(t-1)\right\\|_{2}+\left\\|\bm{\delta}(t)-\bm{\delta}(t-1)\right\\|_{2}+\left\|\\|\bm{p}(t)\\|_{1}-P\right\|$
	$\displaystyle\quad{}+\,\left\\|\left(\bm{G}-\bm{D}\right)\bm{p}(t)-\bm{\delta}(t)+\sigma^{2}\bm{\mathsf{1}}\right\\|_{2}.$		(35)

Figure 4 depicts the MSE computed by (35) versus the number of iterations. On the one hand, we observe from Fig. 4a that Algorithms 1 and 2 with multi-user scheduling outperform those without it in terms of both convergence speed and MSE performance. The reason behind these observations is that although the redundant variable introduced may slow down convergence and increase instability in the proposed algorithms, the multi-user scheduling strategy activates only a small fraction of users. Thus the dimensionality of the corresponding variable is highly reduced. Therefore, the algorithm with multi-user scheduling is relatively stable and converges faster. On the other hand, Fig. 4a also shows that the performance of Algorithm 1 suffers from a slower convergence and more severe stochastic fluctuations than Algorithm 2. The reason is that Algorithm 2 accelerates the convergence rate from $\mathcal{O}(1/\tau)$ to $\mathcal{O}(1/\tau^{2})$ . Similarly, Figs. 4b and 4c illustrate that multi-scheduling and accelerated algorithms also benefit from faster convergence and lower MSE in the two-task and four-task learning cases, respectively.

Figure 5 illustrates the computational complexity in the sense of the average execution time. On the one hand, Fig. 5a shows that the MM-LCPA algorithm for the single-task case developed in [17] has a longer execution time than our algorithms. Even worse, the MM-LCPA algorithm shows a steeper increment than ours. The reason behind these observations is that, when the number of users $K$ is large, the per-iteration complexity of two proposed algorithms is $\mathcal{O}\left(K^{2}+K\right)$ whereas that of MM-LCPA is as high as $\mathcal{O}\left((I+K^{2}+K)^{3.5}\right)$ . We also observe that Algorithm 2 has a shorter execution time than Algorithm 1. It is because the accelerated algorithm speeds up the convergence rate from $\mathcal{O}\left(1/\tau\right)$ to $\mathcal{O}\left(1/\tau^{2}\right)$ . Hence it decreases the number of iterations, specifically for large-scale IoT networks. On the other hand, Figs. 5b and 5c show that the execution time of our algorithms remains almost the same as the number of tasks changes from two to four, compared with Fig. 5a. For example, the computational time is approximately computed by $10\,{\rm s}$ for $K=200$ in the single-task case, $K=400$ in the two-task case, and $K=800$ in the four-task case (i.e., each task has the same number of users). The reason behind these observations is that the per-iteration complexity of our parallel algorithm is reduced from $\mathcal{O}\left(K^{2}+K\right)$ to $\mathcal{O}\left((K^{2}+K)/I\right)$ . In other words, if the value of $K$ is fixed, the computational complexity of our algorithms decreases with the number of tasks $I$ . Thus, we infer that the proposed parallel algorithms efficiently solve the task-oriented power allocation problem.

V-B Learning Error Performance

Figure 6 depicts the mean learning error (MLE) computed by (3a). On the one hand, Fig. 6a shows that the MM-LCPA algorithm developed in [17] performs similarly to the sum-rate maximization algorithm developed in [31], and they both underperform our algorithms. The reason behind these observations is that in the case of single-task, the objective function of the MM-LCPA algorithm degenerates into that of the sum-rate maximization algorithm due to the monotonicity of the learning error function, such that they have similar performance. Instead, as multi-user scheduling eliminates CCI in dense networks, the proposed algorithm outperforms the others. On the other hand, in the different task-oriented learning cases, Figs. 6b and 6c show that the MLE of the MM-LCPA algorithm is superior to that of the sum-rate maximization algorithm due to the joint design of efficient task-oriented communications for different learning models. Also, it is seen that our algorithms have a smaller MLE than the MM-LCPA and the UFS algorithm developed in [32]: the former is due to the multi-user scheduling and task fairness of our algorithms, whereas the latter is caused by the fact that the UFS algorithm concentrates on user fairness but degrades learning performance.

In summary, Table III compares the four algorithms discussed above in terms of computational complexity, convergence rate, parallelization capability, and MLE. Our designed Algorithms 1 and 2 are effective for task-oriented power allocation, thanks to their low computational complexity, fast convergence rate, high parallel capability, and low learning error. In particular, the former applies to small- or medium-scale IoT networks in terms of lower MLE, whereas the latter adapts to large-scale ones thanks to its faster convergence rate.

V-C Experimental Validation for Autonomous Vehicle Perception

To verify the robustness of the proposed algorithms in real-world applications, we consider three perception tasks in autonomous driving [33], and they are Task $1$ : weather classification using the RGB images and CNN; Task $2$ : traffic sign detection using the RGB images and YOLOV $5$ , and Task $3$ : object detection using the point cloud data and sparsely embedded convolutional detection object detection (SECOND). In the pertaining experiments, all the datasets are generated by the CarlaFLCAV framework, which is an open-source autonomous driving simulation platform and online available at https://github.com/SIAT-INVS/CarlaFLCAV. In particular, the transmit time $T=500\,{\rm s}$ is set for this autonomous vehicle perception. The size of each RGB image sample is $V_{1}=V_{2}=0.7\,{\rm MB}$ and that of each point cloud sample is $V_{3}=1.6\,{\rm MB}$ . The number of historical data samples is $A_{1}=A_{2}=A_{3}=300$ . By fitting the error function to the historical data, we obtain the model parameters $(a_{1},\,b_{1})=(10.34,\,1.2)$ , $(a_{2},\,b_{2})=(8.89,\,0.64)$ , and $(a_{3},\,b_{3})=(0.5,\,0.1)$ for Tasks $1$ , $2$ and $3$ , respectively. It can be seen from Fig. 7 that the three fitting curves match the experimental data very well. Note that with a smaller $A_{i}$ , the estimated parameters $(a_{i},\,b_{i})$ may be less accurate. However, such parameters can still perform the power allocation efficiently since our goal is to distinguish different tasks rather than accurately predict learning errors.

TABLE III: Performance comparison.

[!t] Algorithm Complexity Convergence rate Parallelism^a MLE Sum-rate max. [31] $\mathcal{O}\left((K-1)^{7}\right)$ $\mathcal{O}\left(1/\log\tau\right)$ ✗ high MM-based LCPA [17] $\mathcal{O}\left((I+K^{2}+K)^{3.5}\right)$ $\mathcal{O}\left(1/\log\tau\right)$ ✗ low Algorithm 1 $\mathcal{O}\left((K^{2}+K)/I\right)$ $\mathcal{O}\left(1/\tau\right)$ ✓ low Algorithm 2 $\mathcal{O}\left((K^{2}+K)/I\right)$ $\mathcal{O}\left(1/\tau^{2}\right)$ ✓ low

a

The tick “✓” indicates a functionality supported, whereas the cross “✗” indicates not supported.

The top panel of Fig. 8 compares the perception accuracies of the proposed and benchmark algorithms. Firstly, it is seen that the actual perception accuracies obtained from the machine learning experiments coincide with the predicted perception accuracies obtained from the error functions for all the tasks and simulated schemes. Secondly, the proposed algorithm achieves significantly higher average perception accuracy than the MM-LCPA and sum-rate maximization schemes. This is because the proposed algorithm is a task-oriented scheme, which computes the “learning curve”, i.e., the derivative of the learning error w.r.t. the number of samples, for each task by leveraging the associated fitted error functions. As such, it automatically allocates more power resources to the task with a more significant learning curve since it needs more samples to train the learning model. In our experiment, Task $2$ has the steepest “learning curve” as seen from Fig. 7. Accordingly, the proposed algorithm allocates more power to Task $2$ and achieves the highest perception accuracy. In contrast, the MM-LCPA and sum-rate maximization schemes give more power resources to Tasks $1$ and $3$ , whose learning errors are saturated when the number of samples exceeds $400$ . Therefore, these benchmark schemes are less learning-efficient than the proposed scheme.

Lastly, the qualitative results of different schemes are shown in the bottom panel of Fig. 8. It can be seen that there are three traffic lights and two traffic signs at the T-junction. The proposed Algorithm $1$ successfully detects all the objects in the image. The MM-LCPA scheme fails to detect a far-away traffic sign and a traffic light (impeded by the wall) while misclassifying a door as a traffic sign. The sum-rate maximization scheme fails to detect a far-away traffic sign and misclassifies a door as a traffic sign. The reason behind these observations is that the proposed Algorithm $1$ can obtain more samples for multiple tasks in task-oriented principle than other schemes. However, the MM-LCPA algorithm only focuses on one of these tasks, even if this task is unimportant. The sum-rate maximization scheme may not obtain data for multiple tasks as it ignores task-irrelevant information.

VI Conclusions

This paper has developed a task-oriented power allocation model to process distinct learning datasets for large-scale IoT networks, especially for multi-task multi-modal scenarios. To deal with massive connectivity, a multi-user scheduling algorithm has been designed to mitigate co-channel interference and decouple multi-user scheduling and power allocation. Moreover, highly parallel and accelerated algorithms have been designed to solve multi-objective and large-scale optimization problems. Extensive experimental results have shown that multi-user scheduling could effectively mitigate the influence of interference in dense networks. The parallel algorithm and its accelerated version enable different learning tasks efficiently, including the real-world multi-task multi-modal scenario for autonomous vehicle perception. In real-world applications, the proposed algorithms can be deployed at the edge, e.g., the gateway of a large-scale IoT network, which can then inform the users of their transmit powers and other parameters through the downlink control channel, e.g., the narrowband physical downlink control channel in NB-IoT networks. However, as the offline-learning mode is not adaptive to a real-time wireless environment, developing an online-learning mode is promising for future work.

Appendix A Proof of Proposition 1

Substituting $\delta_{k}\triangleq\sum_{\ell\in\mathcal{K}\setminus k}\tilde{G}_{k,\ell}p_{\ell}+\sigma^{2}$ and $\tilde{G}_{k,k}\triangleq w_{k}{G}_{k,k}$ into the cost function of $\mathcal{P}2$ , and performing some algebraic manipulations, we obtain


$\displaystyle\mathcal{P}2a:\min_{\{w_{k}\}_{k\in\mathcal{K}_{i}}}\,$	$\displaystyle a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}w_{k}\log_{2}\left(1+\dfrac{\tilde{G}_{k,k}p_{k}}{w_{k}\delta_{k}}\right)+A_{i}\right)^{-b_{i}}$	(A.1a)
$\displaystyle{\rm s.t.}\$	$\displaystyle 0<w_{k}\leq 1,$	(A.1b)
	$\displaystyle\sum_{k\in\mathcal{K}_{i}}w_{k}\leq N_{i}.$	(A.1c)

In light of the non-increasing characteristices of $a_{i}x^{-b_{i}}$ where $x>0$ , and the sparsity constraint (A.1c), $\mathcal{P}2a$ can be transformed into its equivalent penalized form:


$\displaystyle\mathcal{P}2b:\min_{\bm{w}_{i}}\,$	$\displaystyle-\sum_{k\in\mathcal{K}_{i}}w_{k}\ln\left(1+\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}w_{k}}\right)+\nu_{i}\sum_{k\in\mathcal{K}_{i}}w_{k}$	(A.2a)
$\displaystyle{\rm s.t.}\$	$\displaystyle\eqref{SA-EQ-A1a},\,\eqref{SA-EQ-A1b},$

where $\bm{w}_{i}\triangleq\left[w_{i_{1}},w_{i_{2}},\cdots,w_{i_{\left|\mathcal{K}_{i}\right|}}\right]^{T}$ , and $\nu_{i}>0$ is a tuning parameter for the sparsity regulation.

Next, by setting the objective function of (A.2a) as $J(w_{k})\triangleq-w_{k}\ln\left(1+{\tilde{G}_{k,k}p_{k}}/{(\delta_{k}w_{k})}\right)+\nu_{i}w_{k}$ , it follows that ${\partial J(w_{k})}/{\partial w_{k}}=-\ln\left(1+{G_{k,k}\tilde{p}_{k}}/{(\delta_{k}w_{k})}\right)+{G_{k,k}p_{k}}/{(G_{k,k}p_{k}+\delta_{k})}+\nu_{i}$ . Let $\partial J(w_{k})/\partial w_{k}=0$ , and we obtain

\hat{w}_{k}=\dfrac{\tilde{G}_{k,k}p_{k}}{\delta_{k}\left(\exp\left(\frac{G_{k,k}p_{k}}{\delta_{k}+G_{k,k}p_{k}}+\nu_{i}\right)-1\right)}.

(A.3)

Considering $0<w_{k}\leq 1$ , there are three cases of $\hat{\omega}_{k}$ to account for:

1)

If $\hat{w}_{k}\leq\epsilon$ , the minimization of $J(w_{k})$ is obtained at $w_{k}=\epsilon$ ;
2)

If $\epsilon<\hat{w}_{k}<1$ , the minimization of $J(w_{k})$ is obtained at $w_{k}=\hat{w}_{k}$ ;
3)

If $\hat{w}_{k}\geq 1$ , the minimization of $J(w_{k})$ is obtained at $w_{k}=1$ .

As a result, the optimization point is given by (6). This completes the proof.

Appendix B Proof of Proposition 2

The Lagrange multiplier $\beta_{i}$ for $\bm{\mathsf{1}}^{T}\bm{p}_{i}=P_{i}$ is given by

\beta_{i}(t+1)=\beta_{i}(t)+\mu(t)\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-P_{i}(t+1)\right),

(B.1)

where $\bm{p}_{i}(t+1)$ and $P_{i}(t+1)$ are obtained by the minimization of the ALF given by (11). These minimizations concerning $\bm{p}_{i}$ and $P_{i}$ are computed iteratively:


$\displaystyle\bm{p}_{i}$	$\displaystyle=\underset{\bm{p}_{i}\succeq\bm{0}}{\rm argmin}\,\lambda_{i}\Phi_{i}(\bm{p}_{i})+\beta_{i}(t)\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)$
	$\displaystyle\quad{}+\dfrac{\mu(t)}{2}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2},$	(B.2a)
$\displaystyle P_{i}$	$\displaystyle=\underset{\left\{P_{i}\left\|\underset{i\in\mathcal{I}}{\sum}P_{i}=P\right.\right\}}{\rm argmin}\left\{-\sum_{i\in\mathcal{I}}\beta_{i}(t)P_{i}+\dfrac{\mu(t)}{2}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}-P_{i}\right)^{2}\right\},$	(B.2b)

where

\Phi_{i}(\bm{p}_{i})\triangleq a_{i}\left(\dfrac{BT}{V_{i}}\sum_{k\in\mathcal{K}_{i}}\tilde{w}_{k}\tilde{R}_{k}+A_{i}\right)^{-b_{i}}.

Note that the minimization with respect to $\{P_{i}\left|i\in\mathcal{I}\right.\}$ in (B.2b) involves a separable quadratic cost and a single equality constraint and can be carried out analytically. Given the optimization values $\bm{p}_{i}(t+1)$ , the optimization value $P_{i}(t+1)$ in (B.2b) is analytically given by

P_{i}(t+1)=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)+\dfrac{\beta_{i}(t)-\beta(t+1)}{\mu(t)},

(B.3)

where $\beta(t+1)$ is a scalar Lagrange multiplier subject to $\sum_{i\in\mathcal{I}}P_{i}=P$ , and it is determined by

\beta(t+1)=\dfrac{1}{I}\sum_{i\in\mathcal{I}}\beta_{i}(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right).

(B.4)

By comparing (B.3) with (B.1), we see that

\beta_{i}(t+1)=\beta(t+1).

(B.5)

Then, summing (B.1) up for all $i\in\mathcal{I}$ yields


$\displaystyle\beta(t+1)$	$\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\sum_{i\in\mathcal{I}}\left(\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-P_{i}(t+1)\right)$	(B.6a)
	$\displaystyle=\beta(t)+\left(\beta(t+1)-\dfrac{1}{I}\sum_{i\in\mathcal{I}}\beta_{i}(t)\right)$	(B.6b)
	$\displaystyle=\beta(t)+\dfrac{\mu(t)}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right),$	(B.6c)

where (B.6b)-(B.6c) are derived by (B.3)-(B.4), respectively, and $P_{i}$ is updated by

P_{i}(t+1)=\bm{\mathsf{1}}^{T}\bm{p}_{i}(t+1)-\dfrac{1}{I}\left(\bm{\mathsf{1}}^{T}\bm{p}(t+1)-P\right),

(B.7)

where (B.7) is derived from (B.3) and (B.4). Hence, (12a) and (13a) are immediately proved.

Next, we derive (12b) and (13b). Similar to (B.1), we consider Lagrange multipliers $\bm{\alpha}^{\prime}_{i}$ . The method of multipliers consists of

\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}^{\prime}_{i}(t)+\mu(t)\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)-\bm{z}_{i}(t+1)\right),

(B.8)

where $\bm{p}_{i}(t+1)$ , $\bm{\delta}_{i}(t+1)$ , and $\bm{z}_{i}(t+1)$ are obtained by the minimization of the ALF (11). Similar to (B.4), a Lagrange multiplier vector $\bm{\alpha}$ is shown below:


	$\displaystyle\bm{\alpha}(t+1)$
	$\displaystyle=\dfrac{1}{I}\sum_{i\in\mathcal{I}}\bm{\alpha}^{\prime}_{i}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right)$		(B.9a)
	$\displaystyle=\bm{\alpha}(t)+\dfrac{\mu(t)}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right),$		(B.9b)

where (B.9b) is obtained by $\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}(t+1)$ . Moreover, we obtain the following optimization solution involving $\sum_{i=1}^{I}\bm{z}_{i}=\bm{\delta}-\sigma^{2}\bm{\mathsf{1}}$ :


	$\displaystyle\bm{z}_{i}(\bm{\delta}(t+1))$
	$\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)+\dfrac{1}{\mu(t)}(\bm{\alpha}^{\prime}_{i}(t)-\bm{\alpha}(t+1))$		(B.10a)
	$\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)+\dfrac{1}{\mu(t)}\left(\bm{\alpha}(t)-\bm{\alpha}(t+1)\right)$		(B.10b)
	$\displaystyle=\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right){\bm{p}}_{i}(t+1)-\dfrac{1}{I}\left(\bm{\Delta}{\bm{p}}(t+1)-\bm{\delta}(t+1)+\sigma^{2}\bm{\mathsf{1}}\right),$		(B.10c)

where (B.10b) is obtained by $\bm{\alpha}^{\prime}_{i}(t+1)=\bm{\alpha}(t+1)$ , and (B.10c) by (B.9b).

Appendix C Proof of Lemma 1

First, we prove part i) of Lemma 1:

		$\displaystyle\\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}\|\bm{\delta}_{i}^{})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}\|\bm{\delta}_{i}^{})\\|_{2}$		(C.1)
		$\displaystyle\leq N_{1}\\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}\|\bm{\delta}_{i}^{})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}\|\bm{\delta}_{i}^{})\\|_{\infty}\leq L_{p}\\|\bm{x}-\bm{y}\\|_{2},$		(C.1)

where $N_{1}\triangleq\|\nabla_{\bm{x}}\Phi_{i}(\bm{x}|\bm{\delta}_{i}^{*})-\nabla_{\bm{y}}\Phi_{i}(\bm{y}|\bm{\delta}_{i}^{*})\|_{0}^{1/2}$ is a bounded constant and $L_{p}$ is a positive constant. By recalling the definition of [17, Lemma 1], (C.1) can be obtained in a straightforward manner.

To prove part ii) of Lemma 1, we notice that $\nabla_{x_{k}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})$ can be rewritten as $\nabla_{x_{k}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})=h(\bm{x})g(x_{k})$ , with the auxiliary functions


	$\displaystyle h(\bm{x})=b_{i}a_{i}\left(\sum_{\ell\in\mathcal{K}_{i}}\dfrac{BT}{V_{i}}\tilde{w}_{\ell}\log_{2}\left(1+\dfrac{G_{\ell,\ell}p_{\ell}^{*}}{x_{\ell}}\right)+A_{i}\right)^{-b_{i}-1},$		(C.2a)
	$\displaystyle g(x_{k})=\dfrac{BT\tilde{w}_{k}G_{k,k}p_{k}^{}}{\ln(2)V_{i}x_{k}(x_{k}+G_{k,k}p_{k}^{})}.$		(C.2b)

where $x_{k}$ denotes the $k^{\rm th}$ entry of $\bm{x}$ . The assumption $\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta}_{i})\leq u_{0}$ gives

\sum_{\ell\in\mathcal{K}_{i}}\dfrac{BT}{V_{i}}\tilde{w}_{\ell}\log_{2}\left(1+\dfrac{G_{\ell,\ell}p_{\ell}^{*}}{x_{\ell}}\right)+A_{i}\geq\left(\dfrac{a_{i}}{u_{0}}\right)^{1/b_{i}},

(C.3)

then, we have


	$\displaystyle\|h(\bm{x})\|\leq a_{i}b_{i}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+1/b_{i}},$		(C.4a)
	$\displaystyle\|g(x_{k})\|\leq\dfrac{BTU_{0}}{\ln(2)V_{i}\sigma^{2}(\sigma^{2}+U_{0})},$		(C.4b)

where (C.4b) is derived by $U_{0}\geq G_{k,k}p_{k}$ and $x_{k}\geq\sigma^{2}$ . Here, $G_{k,\ell}$ satisfies Gaussian distribution and $p_{k}\leq P$ , hence we obtain an upper bound $U_{0}$ of $G_{k,k}p_{k}$ with a high probability [34]. Furthermore, according to Lipschitz conditions [35] of $h$ and $g$ , they satisfy


	$\displaystyle\left\|h(\bm{x})-h(\bm{y})\right\|$
	$\displaystyle\leq\sup_{\bm{x}\succeq\sigma^{2}\bm{\mathsf{1}}}\\|\nabla_{\bm{x}}h(\bm{x})\\|_{2}\times\\|\bm{x}-\bm{y}\\|_{2}$
	$\displaystyle\leq\dfrac{Ka_{i}b_{i}(b_{i}+1)BTU_{0}}{\ln(2)IV_{i}\sigma^{2}(\sigma^{2}+U_{0})}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+2/b_{i}}\\|\bm{x}-\bm{y}\\|_{2},$		(C.5a)
	$\displaystyle\|g(x_{k})-g(y_{k})\|$
	$\displaystyle\leq\sup_{x_{k}\geq\sigma^{2}}\|\nabla_{x_{k}}g(x_{k})\|\times\|x_{k}-y_{k}\|$
	$\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\|x_{k}-y_{k}\|$
	$\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\\|\bm{x}-\bm{y}\\|_{2}.$		(C.5b)

As a result, the following inequality is obtained:

		$\displaystyle\\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{y})\\|_{\infty}$
		$\displaystyle\leq\sup_{k\in\mathcal{K}_{i}}\|h(\bm{x})\|\|g(x_{k})-g(y_{k})\|+\left\|h(\bm{x})-h(\bm{y})\right\|\|g(x_{k})\|$
		$\displaystyle\leq L_{2}\\|\bm{x}-\bm{y}\\|_{2},$

where the first inequality is due to $|ab+cd|\leq|a||b|+|c||d|$ , and the second inequality is obtained from (C.4a), (C.4b), (C.5a), and (C.5b); also, $L_{2}$ is defined as

	$\displaystyle L_{2}$	$\displaystyle\triangleq\dfrac{a_{i}b_{i}BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+1/b_{i}}$		(C.6)
		$\displaystyle\quad{}+\,\dfrac{Ka_{i}b_{i}(b_{i}+1)B^{2}T^{2}U_{0}^{2}}{\ln^{2}(2)IV_{i}^{2}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+2/b_{i}}.$		(C.6)

Thus, the gradient function $\nabla_{\bm{\delta}_{i}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{\delta})$ satisfies the following inequality:

\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{\infty}\leq L_{2}\|\bm{x}-\bm{y}\|_{2}.

(C.7)

Based on (C.7), we have

		$\displaystyle\\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{y})\\|_{2}$		(C.8)
		$\displaystyle\leq N_{2}\\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{}\|\bm{y})\\|_{\infty}\leq L_{f}\\|\bm{x}-\bm{y}\\|_{2},$		(C.8)

where $N_{2}\triangleq\|\nabla_{\bm{x}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{x})-\nabla_{\bm{y}}\Phi_{i}(\bm{p}_{i}^{*}|\bm{y})\|_{0}^{1/2}$ and $L_{f}\triangleq N_{2}L_{2}$ . This completes the proof.

References

[1] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource-constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, Jun. 2019.
[2] Z. Dawy, W. Saad, A. Ghosh, J. G. Andrews, and E. Yaacoub, “Toward massive machine type cellular communications,” IEEE Wireless Commun., vol. 24, no. 1, pp. 120–128, Feb. 2017.
[3] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020.
[4] S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya, “Edge intelligence: The confluence of edge computing and artificial intelligence,” IEEE Internet Things J., vol. 7, no. 8, pp. 7457–7469, Aug. 2020.
[5] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proc. Nat. Acad. Sci. USA, vol. 118, no. 17, Apr. 2021, Art. no. e2024789118.
[6] K. B. Letaief, Y. Shi, J. Lu, and J. Lu, “Edge artificial intelligence for 6G: Vision, enabling technologies, and applications,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 5–36, Jan. 2022.
[7] A. Badi and I. Mahgoub, “ReapIoT: Reliable, energy-aware network protocol for large-scale Internet-of-Things (IoT) applications,” IEEE Internet Things J., vol. 8, no. 17, pp. 13 582–13 592, Sept. 2021.
[8] W. Cui, K. Shen, and W. Yu, “Spatial deep learning for wireless scheduling,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1248–1261, Jun. 2019.
[9] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating deep neural network inference via edge computing,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 447–457, Jan. 2020.
[10] Q. Cheng, B. Chen, and P. K. Varshney, “Detection performance limits for distributed sensor networks in the presence of nonideal channels,” IEEE Trans. Wireless Commun., vol. 5, no. 11, pp. 3034–3038, Nov. 2006.
[11] D. Ciuonzo, P. S. Rossi, and P. K. Varshney, “Distributed detection in wireless sensor networks under multiplicative fading via generalized score tests,” IEEE Internet Things J., vol. 8, no. 11, pp. 9059–9071, Jun. 2021.
[12] X. Cheng, D. Ciuonzo, and P. S. Rossi, “Multibit decentralized detection through fusing smart and dumb sensors based on Rao test,” IEEE Trans. Aerosp. Electron. Syst., vol. 56, no. 2, pp. 1391–1405, Apr. 2020.
[13] J. Shao, Y. Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 197–211, Nov. 2022.
[14] D. Wen, P. Liu, G. Zhu, Y. Shi, J. Xu, Y. C. Eldar, and S. Cui, “Task-oriented sensing, computation, and communication integration for multi-device edge AI,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2207.00969
[15] P. Liu, G. Zhu, S. Wang, W. Jiang, W. Luo, H. V. Poor, and S. Cui, “Toward ambient intelligence: Federated edge learning with task-oriented sensing, computation, and communication integration,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.05949
[16] H. Xie, Z. Qin, and G. Y. Li, “Task-oriented multi-user semantic communications for VQA,” IEEE Wireless Comm. Lett., vol. 11, no. 3, pp. 553–557, Mar. 2022.
[17] S. Wang, Y.-C. Wu, M. Xia, R. Wang, and H. V. Poor, “Machine intelligence at the edge with learning-centric power allocation,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7293–7308, Nov. 2020.
[18] H. Xie and Z. Qin, “A lite distributed semantic communication system for Internet of Things,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 142–153, Jan. 2021.
[19] H. S. Seung, H. Sompolinsky, and N. Tishby, “Statistical mechanics of learning from examples,” Phys. Rev., vol. 45, no. 8, p. 6056, Apr. 1991.
[20] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, Jan. 2021.
[21] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in Proc. Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6.
[22] C. Lu, J. Feng, S. Yan, and Z. Lin, “A unified alternating direction method of multipliers by majorization minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 527–541, Mar. 2018.
[23] B. He, M. Tao, and X. Yuan, “Alternating direction method with Gaussian back substitution for separable convex programming,” SIAM J. Optimiz., vol. 22, no. 2, pp. 313–340, May 2012.
[24] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Athena Scientific, Belmont, Massachusetts, 1997.
[25] Y. Li, M. Xia, and Y.-C. Wu, “Activity detection for massive connectivity under frequency offsets via first-order algorithms,” IEEE Trans. Wireless Commun., vol. 18, no. 3, pp. 1988–2002, Mar. 2019.
[26] C. Lu, H. Li, Z. Lin, and S. Yan, “Fast proximal linearized alternating direction method of multiplier with parallel splitting,” in Proc. Conf. Artif. Intell. (AAAI), Feb. 2016, pp. 739–745.
[27] F. Pedregosa, G. Varoquaux, A. Gramfort, and et al., “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.
[28] L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141–142, Nov. 2012.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Comp. Vis. Pat. Rec. (CVPR), Jun. 2016, pp. 770–778.
[30] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” in Proc. Comp. Vis. Pat. Rec. (CVPR), Jun. 2017, pp. 77–85.
[31] H. Al-Shatri and T. Weber, “Achieving the maximum sum rate using D.C. programming in cellular networks,” IEEE Trans. Signal Process., vol. 60, no. 3, pp. 1331–1341, Mar. 2012.
[32] J. A. de Carvalho, D. B. da Costa, L. Yang, G. C. Alexandropoulos, R. Oliveira, and U. S. Dias, “User fairness in wireless powered communication networks with non-orthogonal multiple access,” IEEE Wireless Commun. Lett., vol. 10, no. 1, pp. 189–193, Jan. 2021.
[33] S. Wang, C. Li, Q. Hao, C. Xu, D. W. K. Ng, Y. C. Eldar, and H. V. Poor, “Federated deep learning meets autonomous vehicle perception: Design and verification.” [Online]. Available: https://doi.org/10.48550/arXiv.2206.01748
[34] L. Meng, G. Li, J. Yan, and Y. Gu, “A general framework for understanding compressed subspace clustering algorithms,” IEEE J. Sel. Top. Signal Process., vol. 12, no. 6, pp. 1504–1519, Dec. 2018.
[35] S. Bubeck, “Convex optimization: Algorithms and complexity,” Found. Trends Mach. Learn., vol. 8, no. 3-4, pp. 231–357, Nov. 2015.

	$\displaystyle L_{i}\left(\bm{p}_{i}\|\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)\|\bm{\delta}_{i}(t)),\,\bm{p}_{i}(t),\,\bm{z}_{p_{i}}(t),\,\bm{z}_{i}(t);\,\bm{\alpha}(t),\,\beta(t)\right)$
	$\displaystyle=\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi_{i}(\bm{y}_{p_{i}}(t+1)\|\bm{\delta_{i}}(t)),\,\bm{p}_{i}\right\rangle+\mu(t)\left\langle\bm{A}_{1}^{T}(\bm{A}\bm{z}_{1}(t)-\bm{r}),\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\langle\bm{\lambda}^{\prime}(t),\,\bm{A}_{1}\bm{p}_{i}\rangle+\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\\|_{2}^{2}$
	$\displaystyle=\left\langle\bm{\alpha}(t),\,\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{p}_{i}\right\rangle+\mu(t)\left\langle\left(\bm{\mathsf{1}}^{T}\bm{z}_{p_{i}}(t)-P_{i}(t)\right)\bm{\mathsf{1}},\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\mu(t)\left\langle\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)^{T}\left(\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\bm{z}_{p_{i}}(t)-\dfrac{\bm{z}_{\delta}(t)}{I}+\dfrac{\sigma^{2}}{I}\bm{1}\right),\,\bm{p}_{i}\right\rangle$
	$\displaystyle\quad{}+\,\lambda_{i}\left\langle\nabla_{\bm{y}_{p_{i}}}\Phi(\bm{y}_{p_{i}}(t+1)\|\bm{\delta}_{i}(t)),\,\bm{p}_{i}\right\rangle+\beta(t)\bm{\mathsf{1}}^{T}\bm{p}_{i}$
	$\displaystyle\quad{}+\,\frac{1}{2}\left(L_{p_{i}}\theta(t)+\mu(t)\lambda_{p_{i}}\right)\left\\|\bm{p}_{i}-\bm{z}_{p_{i}}(t)\right\\|_{2}^{2},\,$		(23)

$\displaystyle\lambda_{p_{i}}$	$\displaystyle\geq{}2K/I\left(\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+1\right)^{2}$
	$\displaystyle\geq{}2\left(\\|\tilde{\bm{w}_{i}}\\|_{2}\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+\sqrt{K/I}\right)^{2}$
	$\displaystyle\geq{}2\left(\\|\bm{\Delta}\left(:,\,\mathcal{K}_{i}\right)\\|_{2}+\\|\bm{\mathsf{1}}\\|_{2}\right)^{2}\geq 2\\|\bm{A}_{1}\\|_{2}^{2}.$	(24)


	$\displaystyle\left\|h(\bm{x})-h(\bm{y})\right\|$
	$\displaystyle\leq\sup_{\bm{x}\succeq\sigma^{2}\bm{\mathsf{1}}}\\|\nabla_{\bm{x}}h(\bm{x})\\|_{2}\times\\|\bm{x}-\bm{y}\\|_{2}$
	$\displaystyle\leq\dfrac{Ka_{i}b_{i}(b_{i}+1)BTU_{0}}{\ln(2)IV_{i}\sigma^{2}(\sigma^{2}+U_{0})}\left(\dfrac{u_{0}}{a_{i}}\right)^{1+2/b_{i}}\\|\bm{x}-\bm{y}\\|_{2},$		(C.5a)
	$\displaystyle\|g(x_{k})-g(y_{k})\|$
	$\displaystyle\leq\sup_{x_{k}\geq\sigma^{2}}\|\nabla_{x_{k}}g(x_{k})\|\times\|x_{k}-y_{k}\|$
	$\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\|x_{k}-y_{k}\|$
	$\displaystyle\leq\dfrac{BTU_{0}(2\sigma^{2}+U_{0})}{\ln(2)V_{i}\sigma^{4}(\sigma^{2}+U_{0})^{2}}\\|\bm{x}-\bm{y}\\|_{2}.$		(C.5b)

Edge Learning for Large-Scale Internet of Things With Task-Oriented Efficient Communication

Abstract

Index Terms:

I Introduction

I-A Related Works and Motivation

I-B Summary of Main Results

I-C Organization

II System Model and Problem Formulation

II-A System Model

II-B Problem Formulation

Remark 1 (On the learning loss model).

III Algorithm Development in Parallel

III-A Multi-user Scheduling Algorithm

Proposition 1 (Multi-user scheduling algorithm).

Proof.

III-B Parallel Power Allocation

Proposition 2.

Proof.

III-B1 Parallelizable splitting with respect to 𝒑\bm{p}

III-B2 Parallelizable splitting with respect to 𝜹\bm{\delta}

Proposition 3.

Proof.

III-C Algorithm Development

III-C1 Update 𝒑i\bm{p}_{i} with other variables fixed

III-C2 Update 𝜹i\bm{\delta}_{i} with other variables fixed

III-C3 Update relative dual variables with others fixed

IV An accelerated Algorithm: Fast Proximal Algorithms

Lemma 1.

Proof.

IV-A Parallelization

IV-A1 The ALF with respect to 𝒑i\bm{p}_{i} and 𝜹\bm{\delta}

IV-A2 The ALF with respect to 𝒑\bm{p} and 𝜹i\bm{\delta}_{i}

IV-B Algorithm Development

IV-B1 Update 𝒑i\bm{p}_{i} in parallel with other variables fixed

IV-B2 Update 𝜹i\bm{\delta}_{i} in parallel with other variables fixed

IV-B3 Update relative dual variables

V Simulation Results and Discussions

V-A Convergence Performance and Complexity Analysis

V-B Learning Error Performance

V-C Experimental Validation for Autonomous Vehicle Perception

VI Conclusions

Appendix A Proof of Proposition 1

Appendix B Proof of Proposition 2

Appendix C Proof of Lemma 1

References

III-B1 Parallelizable splitting with respect to $\bm{p}$

III-B2 Parallelizable splitting with respect to $\bm{\delta}$

III-C1 Update $\bm{p}_{i}$ with other variables fixed

III-C2 Update $\bm{\delta}_{i}$ with other variables fixed

IV-A1 The ALF with respect to $\bm{p}_{i}$ and $\bm{\delta}$

IV-A2 The ALF with respect to $\bm{p}$ and $\bm{\delta}_{i}$

IV-B1 Update $\bm{p}_{i}$ in parallel with other variables fixed

IV-B2 Update $\bm{\delta}_{i}$ in parallel with other variables fixed