This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantized Analog Beamforming Enabled Multi-task Federated Learning Over-the-air

Jiacheng Yao12, Wei Xu12, Guangxu Zhu3, Zhaohui Yang4, Kaibin Huang5, and Dusit Niyato6 1National Mobile Communications Research Laboratory, Southeast University, Nanjing, China 2Purple Mountain Laboratories, Nanjing, China 3Shenzhen Research Institute of Big Data, Shenzhen, China 4College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China 5Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China 6School of Computer Science and Engineering, Nanyang Technological University, Singapore Emails: {jcyao,wxu}@seu.edu.cn, gxzhu@sribd.cn, yang_zhaohui@zju.edu.cn
Abstract

Over-the-air computation (AirComp) has recently emerged as a pivotal technique for communication-efficient federated learning (FL) in resource-constrained wireless networks. Though AirComp leverages the superposition property of multiple access channels for computation, it inherently limits its ability to manage inter-task interference in multi-task computing. In this paper, we propose a quantized analog beamforming scheme at the receiver to enable simultaneous multi-task FL. Specifically, inspiring by the favorable propagation and channel hardening properties of large-scale antenna arrays, a targeted analog beamforming method in closed form is proposed for statistical interference elimination. Analytical results reveal that the interference power vanishes by an order of 𝒪(1/Nr)\mathcal{O}\left(1/N_{r}\right) with the number of analog phase shifters, NrN_{r}, irrespective of their quantization precision. Numerical results demonstrate the effectiveness of the proposed analog beamforming method and show that the performance upper bound of ideal learning without errors can be achieved by increasing the number of low-precision analog phase shifters.

Index Terms:
Multi-task Federated learning, over-the-air computation (AirComp), analog beamforming.

I Introduction

Over-the-air computation (AirComp) becomes a promising technique for enabling communication-efficient federated learning (FL) over resource-constrained wireless networks [1, 2, 3]. Specifically, by exploiting the superposition property of MAC channels, the signals are amplitude-modulated and simultaneously transmitted with the same radio resource and hence the summation is automatically achieved over-the-air. Therefore, applying AirComp to FL can deeply integrate uplink gradient transmission and model aggregation, breaking through the bottleneck of limited bandwidth [4].

However, existing AirComp-enabled FL (AirFL) designs primarily focused on single-antenna systems, limiting their capability to handle upto one single computational task at a receiver. The design of multi-antenna systems mainly focused on improving diversity gain through beamforming, allowing more devices to access and minimizing the mean squared error (MSE) of gradient calculation [5]. The potential multiplexing capability of multi-antenna systems to support multi-task computing has rarely been considered in recent works. However, emerging practical applications of FL urgently necessitate the implementation of multi-task computing, primarily including the personalized FL scenarios [6, 7].

To better support multi-task FL, it is necessary to explore the implementation of multi-task AirComp based on typical multiple-input multiple-output (MIMO) systems. However, the presence of additional inter-task interference becomes a decisive factor in determining the computational performance. Moreover, due to the nonorthogonal multiple access characteristics of AirComp, it is challenging to directly deploy the classical zero-forcing (ZF) receiver to eliminate interference. To facilitate the implementation of multi-task AirComp, various techniques have been proposed in recent years. For instance, a large number of antennas at the transceivers are considered in [8], together with coordinated ZF beamforming for complete interference elimination. To reduce the number of antennas, the authors in [9] combined the beamforming optimization with device scheduling, which helps reduce the number of devices and ensure interference-free computation. Without complete interference elimination, the spatial correlation between devices was considered in [10], which allows the proposition of joint transceiver beamforming and device selection design for task-oriented interference suppression.

Nevertheless, deploying a large number of antennas to tackle with inter-task interference results in significant hardware cost. Meanwhile, the training process of FL tasks have inherent robustness to noise [11], reducing the necessity for complete interference elimination. Motivated by this background, in this paper, we adopt a low-cost hybrid beamforming architecture with limited radio frequency (RF) chains as a solution [12], and propose an analog beamforming method in closed form to enable multiple FL tasks simultaneously without the utilization of channel state information at the transmitter (CSIT). Our theoretical analysis validates the vanishing power of inter-task interference with order 𝒪(1/Nr)\mathcal{O}\left(1/N_{r}\right) with the number of analog phase shifters, NrN_{r}, which is irrespective of their quantization precision. The numerical results reveal that by employing the proposed analog beamforming method, all FL tasks can approach the upper bound performance of ideal learning with error-free transmission.

II System Model

In this paper, we consider a wireless FL system with NN individual FL tasks. Without loss of generality, we assume that KK distributed devices are uniformly divided into NN clusters, yielding each cluster of L=KNL=\frac{K}{N} users without overlap. A central parameter server (PS) is deployed to coordinate the parallel training of NN FL tasks. The training procedure and transmission model are elaborated in the sequel.

II-A Federated Learning Model

Let 𝒟n,l\mathcal{D}_{n,l}, n[N],l[L]\forall n\in[N],l\in[L], denote the local dataset of the ll-th device in the nn-th FL task and Dn,lD_{n,l} represents the number of training samples in dataset 𝒟n,l\mathcal{D}_{n,l}. Then, the local loss function of the ll-th device with the nn-th FL task is given by fn,l(𝐯n,𝒟n,l)=1Dn,l𝐮𝒟n,l(𝐯n,𝐮)f_{n,l}(\mathbf{v}_{n},\mathcal{D}_{n,l})=\frac{1}{D_{n,l}}\sum_{\mathbf{u}\in\mathcal{D}_{n,l}}\mathcal{L}(\mathbf{v}_{n},\mathbf{u}), where 𝐮\mathbf{u} is the data sample selected from the local dataset and (𝐯n,𝐮)\mathcal{L}(\mathbf{v}_{n},\mathbf{u}) represents the sample-wise loss function quantifying the prediction error of model parameter 𝐯n\mathbf{v}_{n} on data sample 𝐮\mathbf{u}. The learning process aims to optimize the specific model parameter 𝐯nd\mathbf{v}_{n}\in\mathbb{R}^{d} for the nn-th task by minimizing the global loss function defined as Fn(𝐯n)=l=1Lαn,lfn,l(𝐯n,𝒟n,l)F_{n}(\mathbf{v}_{n})=\sum_{l=1}^{L}\alpha_{n,l}f_{n,l}(\mathbf{v}_{n},\mathcal{D}_{n,l}), where αn,lDn,l=1LDn,\alpha_{n,l}\triangleq\frac{D_{n,l}}{\sum_{\ell=1}^{L}D_{n,\ell}} represents the aggregation weight for the ll-th device. For simplicity, we assume that for all tasks, the models contain the same number of parameters, denoted by dd. For models with different sizes, zero padding can be employed to ensure a uniform size, thereby applicable to general scenarios.

For model training, we employ the typical FedAvg algorithm to iteratively minimize Fn(𝐯n)F_{n}(\mathbf{v}_{n}). The main steps of the tt-th round of the FedAvg algorithm are listed as follows.

  • 1)

    Model broadcasting: The PS broadcasts the latest global model 𝐯n,t\mathbf{v}_{n,t} to all devices with the nn-th FL task.

  • 2)

    Local computing: Based on the received global model 𝐯n,t\mathbf{v}_{n,t}, each device updates its local model via mini-batch stochastic gradient descent (SGD), i.e., 𝐯n,t,i+1l=𝐯n,t,ilηn,tfn,l(𝐯n,t,il,𝝃n,t,il)\mathbf{v}_{n,t,i+1}^{l}=\mathbf{v}_{n,t,i}^{l}-\eta_{n,t}\nabla f_{n,l}\left(\mathbf{v}_{n,t,i}^{l},\bm{\xi}_{n,t,i}^{l}\right), where 𝐯n,t,il\mathbf{v}_{n,t,i}^{l} denote the local model of the ll-th device obtained after the ii-th local iteration, 𝐯n,t,0l\mathbf{v}_{n,t,0}^{l} is initialized as 𝐯n,t\mathbf{v}_{n,t}, ηn,t\eta_{n,t} denotes the learning rate, and 𝝃n,t,il\bm{\xi}_{n,t,i}^{l} is a local mini-batch selected from 𝒟n,l\mathcal{D}_{n,l}. Each device performs EE local iterations in a specific round and updates its local model as 𝐯n,t,El\mathbf{v}_{n,t,E}^{l}.

  • 3)

    Local updates uploading: After local computing, each device transfers its local updates 𝐠n,l,t\mathbf{g}_{n,l,t} to the PS, where 𝐠n,l,t𝐯n,t,0l𝐯n,t,Elηn,t=i=0E1fn,l(𝐯n,t,il,𝝃n,t,il)\mathbf{g}_{n,l,t}\!\triangleq\!\frac{\mathbf{v}_{n,t,0}^{l}\!-\!\mathbf{v}_{n,t,E}^{l}}{\eta_{n,t}}\!=\!\sum_{i=0}^{E-1}\!\nabla f_{n,l}\!\left(\mathbf{v}_{n,t,i}^{l},\bm{\xi}_{n,t,i}^{l}\right).

  • 4)

    Model aggregation: Upon receiving all the local updates, the PS calculates the global update as 𝐠n,t=l=1Lαn,l𝐠n,l,t\mathbf{g}_{n,t}=\sum_{l=1}^{L}\alpha_{n,l}\mathbf{g}_{n,l,t}, and then updates the global model according to 𝐯n,t+1=𝐯n,tηn,t𝐠n,t\mathbf{v}_{n,t+1}=\mathbf{v}_{n,t}-\eta_{n,t}\mathbf{g}_{n,t}.

II-B Multi-task AirComp Model

The downlink model broadcasting is usually assumed error-free due to abundant resource at the PS [4]. For the uplink, we adopt the AirComp technique for simultaneous transmission and model aggregation. Moreover, to enable parallel model aggregation for multiple FL tasks, the PS exploits hybrid analog and digital beamforming with a typical sub-connected structure [12]. In particular, the PS is equipped with NRFN_{\text{RF}} RF chains and NrN_{r} antennas. Each RF chain connects to an exclusive set of MM antennas through a dedicated phase shifter, where M=NrNRFM=\frac{N_{r}}{N_{\text{RF}}}. We consider the scenario with NRF=NN_{\text{RF}}=N, where each RF chain and its connected analog array is dedicated to serving a specific FL task.

Now, it is ready to introduce the considered multi-task AirComp. Before transmission, local updates are first normalized as 𝐬n,l,t=𝐠n,l,tμn,l,t𝟏vn,l,t\mathbf{s}_{n,l,t}=\frac{\mathbf{g}_{n,l,t}-\mu_{n,l,t}\mathbf{1}}{v_{n,l,t}}, where μn,l,t\mu_{n,l,t} and vn,l,tv_{n,l,t} respectively represent the mean and standard deviation of all entries in 𝐠n,l,t\mathbf{g}_{n,l,t}, and 𝟏\mathbf{1} is all one vector. Unlike traditional AirComp relying on perfect CSIT, we assume that no CSIT is available at the device, and consequently corresponding preprocessing, such as the typical channel inversion scheme in [13], is not performed. Hence, the received signal at the PS, 𝐘tN×d\mathbf{Y}_{t}\in\mathbb{C}^{N\times d}, is given as follows:

𝐘t=𝐖t𝐀ti=1Nl=1Lpi,l,t𝐡i,l,t𝐬i,l,tT+𝐖t𝐀t𝐍t,\displaystyle\mathbf{Y}_{t}=\mathbf{W}_{t}\mathbf{A}_{t}\sum_{i=1}^{N}\sum_{l=1}^{L}\sqrt{p_{i,l,t}}\mathbf{h}_{i,l,t}\mathbf{s}_{i,l,t}^{T}+\mathbf{W}_{t}\mathbf{A}_{t}\mathbf{N}_{t}, (1)

where 𝐡i,l,t𝒞𝒩(𝟎Nr×1,𝐈Nr)\mathbf{h}_{i,l,t}\sim\mathcal{CN}\left(\mathbf{0}_{N_{r}\times 1},\mathbf{I}_{N_{r}}\right) denotes the channel between the PS and the ll-th device in the ii-th cluster, pi,l,tp_{i,l,t} represents its transmit power, 𝐖t=[𝐰1,t,𝐰2,t,,𝐰N,t]HN×N\mathbf{W}_{t}=[\mathbf{w}_{1,t},\mathbf{w}_{2,t},\cdots,\mathbf{w}_{N,t}]^{H}\in\mathbb{C}^{N\times N} and 𝐀tN×Nr\mathbf{A}_{t}\in\mathbb{C}^{N\times N_{r}} represent the digital combiner and analog beamformer, respectively, and 𝐍tN×d\mathbf{N}_{t}\in\mathbb{C}^{N\times d} denotes the additive Gaussian noise matrix and its entries are independent and identically Gaussian distributed with zero mean and variance of σ2\sigma^{2}. With the sub-connected structure, the analog beamformer can be expressed as 𝐀t=diag{𝐚1,tH,,𝐚N,tH}\mathbf{A}_{t}=\text{diag}\left\{\mathbf{a}_{1,t}^{H},\cdots,\mathbf{a}_{N,t}^{H}\right\}, where 𝐚i,t\mathbf{a}_{i,t} denotes the analog beamforming vector of the ii-th sub-array satisfying |[𝐚i,t]j|2=1M\left|[\mathbf{a}_{i,t}]_{j}\right|^{2}=\frac{1}{M}, j[M]\forall j\in[M]. Since the nn-th RF chain is dedicated to serving the nn-th FL task, we exploit the processed signal at the nn-th RF chain as an estimate of the desired weighted aggregation, l=1Lαn,lvn,l,t𝐬n,l,t\sum_{l=1}^{L}\alpha_{n,l}v_{n,l,t}\mathbf{s}_{n,l,t}. It follows

𝐲n,tT=𝐰n,tHi=1Nl=1Lpi,l,t𝐀t𝐡i,l,t𝐬i,l,tT+𝐰n,tH𝐀t𝐍t.\displaystyle\mathbf{y}_{n,t}^{T}=\mathbf{w}_{n,t}^{H}\sum_{i=1}^{N}\sum_{l=1}^{L}\sqrt{p_{i,l,t}}\mathbf{A}_{t}\mathbf{h}_{i,l,t}\mathbf{s}_{i,l,t}^{T}+\mathbf{w}_{n,t}^{H}\mathbf{A}_{t}\mathbf{N}_{t}. (2)

The global update for the nn-th task is then calculated as

𝐠^n,t=𝐲n,t+l=1Lαn,lμn,l,t𝟏,\displaystyle\hat{\mathbf{g}}_{n,t}=\mathbf{y}_{n,t}+\sum_{l=1}^{L}\alpha_{n,l}\mu_{n,l,t}\mathbf{1}, (3)

where the statistics of the local updates are transmitted to PS in advance with ignorable communication overhead [14]. By comparing (3) with the ideal local updates, we express the error of global update estimation, 𝐞n,t𝐠^n,t𝐠n,t\mathbf{e}_{n,t}\triangleq\hat{\mathbf{g}}_{n,t}-\mathbf{g}_{n,t}, as

𝐞n,t=\displaystyle\mathbf{e}_{n,t}= l=1L(pn,l,t𝐰n,tH𝐀t𝐡n,l,tαn,lvn,l,t)𝐬n,l,t\displaystyle\sum_{l=1}^{L}\left(\sqrt{p_{n,l,t}}\mathbf{w}_{n,t}^{H}\mathbf{A}_{t}\mathbf{h}_{n,l,t}-\alpha_{n,l}v_{n,l,t}\right)\mathbf{s}_{n,l,t}
+inNl=1Lpi,l,t𝐰n,tH𝐀t𝐡i,l,t𝐬i,l,t+𝐍tT𝐀tT𝐰n,t.\displaystyle+\sum_{i\neq n}^{N}\sum_{l=1}^{L}\sqrt{p_{i,l,t}}\mathbf{w}_{n,t}^{H}\mathbf{A}_{t}\mathbf{h}_{i,l,t}\mathbf{s}_{i,l,t}\!+\!\mathbf{N}_{t}^{T}\mathbf{A}_{t}^{T}\mathbf{w}_{n,t}^{*}. (4)

Note that the error in (II-B) comprises of two main components. First, there is distortion in the aggregation weights of the desired signal. The second component consists mainly of interference from other tasks and the additive Gaussian noise. To tackle with the inter-task interference, existing works [8, 9] rely on sufficient spatial degrees of freedom (DoF) to apply ZF receivers for complete interference elimination, which requires a large number of RF chains, i.e., NRF>KN_{\text{RF}}>K. However, in the scenario considered in this paper, the number of RF chains is limited, making complete interference elimination unattainable. In the subsequent section, we introduce how analog beamforming design is utilized to statistically eliminate interference and support the multi-task FL over-the-air.

III Statistical Interference Elimination via Analog Beamforming

Unlike traditional communication tasks, the introduction of zero-mean noise does not necessarily have a serious impact on the SGD process while it can sometimes even impose a positive effect [15]. However, noise with non-zero mean inevitably introduces bias to the SGD process, leading to performance deterioration and potential failure of training convergence. These observations encourage us to avoid unnecessary resource overhead for complete interference elimination. Instead, we rely on low-cost analog beamforming to statistically eliminate the impact of inter-task interference.

III-A Typical Fully-Digital Beamforming Scheme

Favorable propagation is a unique characteristic of large-scale MIMO systems. When the number of antennas, NrN_{r}\to\infty, the channels of different devices are asymptotically orthogonal, i.e., 𝐡i,l,tH𝐡i,l,t0\mathbf{h}_{i,l,t}^{H}\mathbf{h}_{i^{\prime},l^{\prime},t}\to 0, (i,l)(i,l)\forall(i,l)\neq(i^{\prime},l^{\prime}). Inspired by the nature of favorable propagation, the inter-task interference can be statistically eliminated with fully-digital beamforming at the receiver by applying a linear projection [16], referred to as random orthogonalization (RO). To be specific, we set the fully-digital beamformer for the nn-th task as 𝐟n,t=l=1L𝐡n,l,tNr\mathbf{f}_{n,t}=\sum_{l=1}^{L}\mathbf{h}_{n,l,t}\in\mathbb{C}^{N_{r}} and the received signal in (2) becomes

𝐲n,tT\displaystyle\mathbf{y}_{n,t}^{T} =i=1Nl=1Lpi,l,t𝐟n,tH𝐡i,l,t𝐬i,l,tT+𝐟n,tH𝐍t\displaystyle=\sum_{i=1}^{N}\sum_{l=1}^{L}\sqrt{p_{i,l,t}}\mathbf{f}_{n,t}^{H}\mathbf{h}_{i,l,t}\mathbf{s}_{i,l,t}^{T}+\mathbf{f}_{n,t}^{H}\mathbf{N}_{t}
a.sl=1Lpn,l,t𝐡n,l,t2𝐬i,l,tT,\displaystyle\overset{\text{a.s}}{\to}\sum_{l=1}^{L}\sqrt{p_{n,l,t}}\left\|\mathbf{h}_{n,l,t}\right\|^{2}\mathbf{s}_{i,l,t}^{T}, (5)

where a.s\overset{\text{a.s}}{\to} represents “almost surely converge to” and it is due to the favorable propagation property with NrN_{r}\to\infty, and the inter-device interference is asymptotically eliminated. Also, from the statistical perspective, we have

𝔼[𝐡i,l,tH𝐡i,l,t]=0,(i,l)(i,l).\displaystyle\mathbb{E}\left[\mathbf{h}_{i,l,t}^{H}\mathbf{h}_{i^{\prime},l^{\prime},t}\right]=0,\enspace\forall(i,l)\neq(i^{\prime},l^{\prime}). (6)

Hence, it is straightforward to verify that the inter-task interference is statistically eliminated. The receive beamforming acts as a filter, which filters out interference components orthogonal to it. However, achieving better interference elimination requires a substantial deployment of extra RF chains, significantly increasing hardware cost. Alternatively, large-scale antenna arrays enable the direct use of ZF receivers to obtain individual signals from different devices without interference. The RO approach may not necessarily yield better performance compared to the ZF receiver.

III-B Proposed Analog Beamforming with Continuous Phases

The core idea of (III-A) is to include the target channel components, i.e., 𝐡n,l,t\mathbf{h}_{n,l,t}, l\forall l, in the nn-th digital beamformer, thus achieving statistical interference elimination via the favorable propagation property. However, the included amplitude components of 𝐡n,l,t\mathbf{h}_{n,l,t}, l\forall l, do not exert a significant impact. Hence, we posit that utilizing analog phase shifters alone can still achieve this statistical interference elimination. In particular, considering that the nn-th sub-array only serves the nn-th FL task, we set its analog beamforming as

𝐚n,t=1Mexp(jl=1L𝐡n,l,t,n),n\displaystyle\mathbf{a}_{n,t}=\frac{1}{\sqrt{M}}\mathrm{exp}\left(j\angle\sum_{l=1}^{L}\mathbf{h}_{n,l,t,n}\right),\enspace\forall n (7)

where 𝐡n,l,t,nM\mathbf{h}_{n,l,t,n}\in\mathbb{C}^{M} denotes the channel between the nn-th sub-array and the ll-th device in the nn-th cluster in round tt and 𝐡n,l,t=[𝐡n,l,t,1H,,𝐡n,l,t,NH]H\mathbf{h}_{n,l,t}=\left[\mathbf{h}_{n,l,t,1}^{H},\cdots,\mathbf{h}_{n,l,t,N}^{H}\right]^{H}, n[N],l[L]\forall n\in[N],l\in[L]. In (7), we only incorporate the phases of channels related to devices in the nn-th cluster, aiming at filtering out interference from other clusters. With the analog beamforming in (7), the effective channel for the ll-th device in the nn-th cluster at the tt-th round is given as 𝐡¯n,l,t=𝐀t𝐡n,l,t\bar{\mathbf{h}}_{n,l,t}=\mathbf{A}_{t}\mathbf{h}_{n,l,t}, whose statistics are derived in the following lemma.

Lemma 1:

With the analog beamforming in (7), the effective channel follows

𝔼[𝐡¯n,l,t]=πM2L𝜹n,𝕍[𝐡¯n,l,t]=𝚺n,n,\displaystyle\mathbb{E}\left[\bar{\mathbf{h}}_{n,l,t}\right]=\frac{\sqrt{\pi M}}{2\sqrt{L}}\bm{\delta}_{n},\enspace\mathbb{V}\left[\bar{\mathbf{h}}_{n,l,t}\right]=\bm{\Sigma}_{n},\enspace\forall n, (8)

where 𝜹n\bm{\delta}_{n} is the Kronecker delta vector with [𝜹n]n=1[\bm{\delta}_{n}]_{n}=1, 𝚺n\mathbf{\Sigma}_{n} is a diagonal matrix, and its nn-th diagonal element is 1π4L1-\frac{\pi}{4L} and all the other diagonal elements are 1.

Proof:

Please refer to the supplementary file in [7]. \hfill\square

Building upon the analytical results in Lemma 1, the statistical interference elimination can be achieved only with analog beamforming. To be concrete, the digital combiner 𝐖t\mathbf{W}_{t} does not cope with the inter-task interference. We set 𝐖t=ζ𝐈N\mathbf{W}_{t}=\zeta\mathbf{I}_{N}, where ζ>0\zeta>0 is a scaling factor. Then, it follows

𝔼[𝐠^n,t]\displaystyle\mathbb{E}\left[\hat{\mathbf{g}}_{n,t}\right] =l=1LζπMpn,l,t2L𝐬n,l,t,\displaystyle=\sum_{l=1}^{L}\frac{\zeta\sqrt{\pi Mp_{n,l,t}}}{2\sqrt{L}}\mathbf{s}_{n,l,t}, (9)

which no longer includes the inter-task interference terms.

To further demonstrate the statistical interference elimination method, we characterize the scaling law of the average power of the interference with respect to the number of antennas, NrN_{r}, in the following theorem. The average power of the interference for the nn-th task is given by

PI,n𝔼[inNl=1Lpi,l,t𝐰n,tH𝐀t𝐡i,l,t𝐬i,l,t2].\displaystyle P_{\text{I},n}\triangleq\mathbb{E}\left[\left\|\sum_{i\neq n}^{N}\sum_{l=1}^{L}\sqrt{p_{i,l,t}}\mathbf{w}_{n,t}^{H}\mathbf{A}_{t}\mathbf{h}_{i,l,t}\mathbf{s}_{i,l,t}\right\|^{2}\right]. (10)
Theorem 1:

As the number of antennas, NrN_{r}, increases, the average power of the interference PI,nP_{\text{I},n}, n\forall n, vanishes by an order of 1Nr\frac{1}{N_{r}}.

Proof:

From (II-B), the average power of the interference for the nn-th task is expressed as

PI,n=(a)dinNl=1Lpi,l,t𝔼[|𝐰n,tH𝐀t𝐡i,l,t|2]\displaystyle P_{\text{I},n}\!\overset{\text{(a)}}{=}d\sum_{i\neq n}^{N}\sum_{l=1}^{L}p_{i,l,t}\mathbb{E}\left[\left|\mathbf{w}_{n,t}^{H}\mathbf{A}_{t}\mathbf{h}_{i,l,t}\right|^{2}\right]
=dζ2inNl=1Lpi,l,t𝜹nH𝔼[𝐡¯i,l,t𝐡¯i,l,tH]𝜹n=dζ2inNl=1Lpi,l,t,\displaystyle\!=\!d\zeta^{2}\!\!\sum_{i\neq n}^{N}\!\sum_{l=1}^{L}\!p_{i,l,t}\bm{\delta}_{n}^{H}\mathbb{E}\!\!\left[\bar{\mathbf{h}}_{i,l,t}\bar{\mathbf{h}}_{i,l,t}^{H}\right]\!\bm{\delta}_{n}\!\!=\!d\zeta^{2}\!\!\sum_{i\neq n}^{N}\!\sum_{l=1}^{L}\!p_{i,l,t}, (11)

where (a) is due to the independence between signal vectors and the fact that 𝔼[𝐬i,l,t]=𝟎\mathbb{E}\left[\mathbf{s}_{i,l,t}\right]=\mathbf{0}. Moreover, as observed in (9), the coefficient of the desired signal 𝐬n,l,t\mathbf{s}_{n,l,t} is ζπNrpn,l,t2K\frac{\zeta\sqrt{\pi N_{r}p_{n,l,t}}}{2\sqrt{K}}, which scales up with Nr\sqrt{N_{r}}. To ensure a fixed aggregation coefficient, the scaling factor ζ\zeta must satisfy ζ1Nr\zeta\propto\frac{1}{\sqrt{N_{r}}}. Now, we can conclude that PI,n1NrP_{\text{I},n}\propto\frac{1}{N_{r}} and the proof completes. \square

Remark 1:

By increasing the numbers of low-cost phase shifters, the proposed analog beamforming method can effectively mitigate interference without the need for additional RF chains. Moreover, with infinite phase shifters, i.e., NrN_{r}\to\infty, we have PI,n0P_{\text{I},n}\to 0 and the inter-task interference is asymptotically completely eliminated.

Corollary 1:

The proposed analog beamforming method enjoys the same scaling law for interference power with the RO scheme using fully-digital beamforming in Sec. III-A.

Proof:

It has been demonstrated in [16, Eq. (10)] that interference decrease by an order of 1Nr\frac{1}{N_{r}}, which is same with the conclusion in Theorem 1. \square

Remark 2:

The proposed analog beamforming method fully replicates the functionality of the RO scheme. This implies that achieving random orthogonalization requires only phase matching, thereby avoiding high hardware cost associated with numerous RF chains for the fully-digital beamforming.

III-C Extension to Discrete Phase Control

Due to hardware limitations, phase shifts are usually limited to a finite number of discrete values. We further extend the proposed analog beamforming design to the scenarios with discrete phase control. Specifically, we denote the set of discrete phase shifts by 𝒜{0,2π2b,,(2b1)2π2b}\mathcal{A}\triangleq\left\{0,\frac{2\pi}{2^{b}},\cdots,\frac{(2^{b}-1)2\pi}{2^{b}}\right\}, where bb is the number of quantization bits. Then, we rewrite (7) as

𝐚~n,t=1Mexp(jargminϕ𝒜M{l=1L𝐡n,l,t,nϕ2}),\displaystyle\tilde{\mathbf{a}}_{n,t}\!=\!\frac{1}{\sqrt{M}}\mathrm{exp}\left(\!j\arg\!\!\min_{\bm{\phi}\in\mathcal{A}^{M}}\!\!\left\{\left\|\angle\!\sum_{l=1}^{L}\mathbf{h}_{n,l,t,n}\!\!-\!\bm{\phi}\right\|^{2}\!\right\}\!\right)\!, (12)

where ϕ\bm{\phi} is an MM-dimensional vector with elements selected from 𝒜\mathcal{A}. Comparing with the perfect case in (7), the configured phase is disturbed by a quantization error, which is modelled as uniformly distributed RVs following 𝒰(2bπ,2bπ)\mathcal{U}\left(-2^{-b}\pi,2^{-b}\pi\right) [17]. Then, the discrete phase control in (12) is rewritten as

𝐚~n,t=1Mexp(jl=1L𝐡n,l,t,n+j𝝍n,t),\displaystyle\tilde{\mathbf{a}}_{n,t}=\frac{1}{\sqrt{M}}\mathrm{exp}\left(j\angle\sum_{l=1}^{L}\mathbf{h}_{n,l,t,n}+j\bm{\psi}_{n,t}\right), (13)

where 𝝍n,t\bm{\psi}_{n,t} denotes the quantization error and [𝝍n,t]m𝒰(2bπ,2bπ)[\bm{\psi}_{n,t}]_{m}\sim\mathcal{U}\left(-2^{-b}\pi,2^{-b}\pi\right), m\forall m. Accordingly, we denote the effective channel for the ll-th device in the nn-th cluster at the tt-th round as 𝐡~n,l,t=𝐀~t𝐡n,l,t\tilde{\mathbf{h}}_{n,l,t}=\tilde{\mathbf{A}}_{t}\mathbf{h}_{n,l,t}, where 𝐀~t=blkdiag{𝐚~1,tH,,𝐚~N,tH}\tilde{\mathbf{A}}_{t}=\text{blkdiag}\left\{\tilde{\mathbf{a}}_{1,t}^{H},\cdots,\tilde{\mathbf{a}}_{N,t}^{H}\right\}.

Lemma 2:

With the discrete phase control in (13), the expectation and variance of the effective channels are

𝔼[𝐡~n,l,t]=sin(2bπ)M2b+1πL𝜹n,𝕍[𝐡~n,l,t]=𝚺~n,n,\displaystyle\mathbb{E}\left[\tilde{\mathbf{h}}_{n,l,t}\right]\!=\!\frac{\sin\left(2^{-b}\pi\right)\sqrt{M}}{2^{-b+1}\sqrt{\pi L}}\bm{\delta}_{n},\,\mathbb{V}\left[\tilde{\mathbf{h}}_{n,l,t}\right]=\tilde{\bm{\Sigma}}_{n},\,\forall n, (14)

where 𝚺~n\tilde{\bm{\Sigma}}_{n} is a diagonal matrix, and its nn-th diagonal element is 1sin2(2bπ)4b+1πL1-\frac{\sin^{2}\left(2^{-b}\pi\right)}{4^{-b+1}\pi L} and all the other diagonal elements are 1.

Proof:

The expectation of ej[𝝍n,t]me^{j[\bm{\psi}_{n,t}]_{m}} is equal to sin(2bπ)2bπ\frac{\sin\left(2^{-b}\pi\right)}{2^{-b}\pi}. Then, combining with Lemma 1, we complete the proof. \square

Remark 3:

Compared to the results in Lemma 1, discrete phase shifts brings about only a constant scaling but does not change the statistical properties of effective channels. Therefore, the statistical interference elimination is also attained by applying the discrete phase control.

Next, regarding the analysis of the average power of the interference, we derive the following lemma.

Lemma 3:

Compared with the continuous phase shifts, the average power of the interference obtained via discrete phase control, denoted by PI,nDP_{\text{I},n}^{\text{D}}, satisfies PI,nDPI,n=1sinc2(2b)\frac{P_{\text{I},n}^{\text{D}}}{P_{\text{I},n}}=\frac{1}{\mathrm{sinc}^{2}\left(2^{-b}\right)}, where sinc(x)sin(πx)πx\mathrm{sinc}(x)\triangleq\frac{\sin(\pi x)}{\pi x}.

Proof:

To achieve the same aggregation coefficient, we have

ζ1πMpn,l,t2L=ζ2sin(2bπ)Mpn,l,t2b+1πL,\displaystyle\frac{\zeta_{1}\sqrt{\pi Mp_{n,l,t}}}{2\sqrt{L}}=\frac{\zeta_{2}\sin\left(2^{-b}\pi\right)\sqrt{Mp_{n,l,t}}}{2^{-b+1}\sqrt{\pi L}}, (15)

where ζ1\zeta_{1} and ζ2\zeta_{2} respectively denote the scaling factors for the continuous and discrete phase control. Moreover, according to (Proof), the average power of the interference is proportional to ζ2\zeta^{2}. Hence, we complete the proof. \square

As observed in Lemma 3, the performance gap between the continuous and discrete phase control methods vanishes with increasing bb. Moreover, PI,nDP_{\text{I},n}^{\text{D}} also decreases by 1Nr\frac{1}{N_{r}} as NrN_{r} increases and we conclude the following corollary.

Corollary 2:

Regardless of the quantization precision of phase shifts, the proposed analog beamforming method always achieves the same scaling law for interference power.

The proposed analog beamforming method, designed to prioritize favorable propagation over precise signal processing, demonstrates resilience to the impacts of non-ideal hardware. Hence, it suggests us to use a large number of low-precision phase shifters to achieve a satisfactory performance of interference elimination.

Refer to caption
Figure 1: Element-wise power of the interference versus NrN_{r} (K=100,N=4K=100,N=4).
Refer to caption
Figure 2: Element-wise power of the interference versus NN (K=100,Nr=200K=100,N_{r}=200).
Refer to caption
Figure 3: Test accuracy versus communication rounds (K=20K=20, N=2N=2, SNR=0\text{SNR}=0 dB).

IV Numerical Results

In this section, numerical simulations are presented to validate the proposed scheme. Without loss of generality, we neglect the large-scale path loss and normalize it as 1. Besides, we adopt the same transmit power budget of each devices, denoted by PP. The signal-to-noise ratio (SNR) is defined as Pσ2\frac{P}{\sigma^{2}}. To evaluate the learning performance, we simultaneously conduct two FL tasks of image classification. The two tasks are performed on the two popular datasets, i.e., MNIST and CIFAR-10, respectively. The trained AI model is a convolutional neural network (CNN). The learning parameters are set as: the batch size 6464, the learning rate ηn,t=0.01\eta_{n,t}=0.01, and the number of local iterations E=1E=1 and E=5E=5 for the FL tasks on MNIST and CIFAR-10, respectively.

Fig. 3 depicts the element-wise power of the interference with different number of phase shifters (antennas), NrN_{r}. The element-wise power of the interference is defined as n=1NPI,ndN\frac{\sum_{n=1}^{N}P_{\text{I},n}}{dN}, representing the normalized interference power. It is observed that, in line with the theoretical results, the power of the interference decreases by an order of 1Nr\frac{1}{N_{r}} with NrN_{r} using the proposed analog beamforming, and exhibits the same order with that using RO. Furthermore, we observe that using a low precision phase shifter does not result in significant performance loss. The performance comparable to that with ideal continuous phase shifters can be achieved with b=3b=3. These results demonstrate that the additional expensive RF chains can be entirely replaced with low-cost phase shifters.

In Fig. 3, we show the impact of the number of tasks, NN. As NN increases, all schemes experience greater interference. Compared to the RO method, the performance gap between the proposed analog beamforming and it gradually widens with increasing NN. This is because as the number of tasks increases, the number of phase shifters assigned to each task decreases, leading to reduced interference suppression performance. This indicates that when faced with more tasks, additional analog phase shifters are required to achieve satisfactory performance.

Finally, we evaluate the convergence performance of the FL tasks in Fig. 3. We exploit the ideal case with perfect transmission as a benchmark. When sufficient phase shifters are deployed, e.g., Nr=1000N_{r}=1000, the proposed analog beamforming method effectively eliminates interference and approaches the FL performance under ideal case, even with low-precision phase shifters.

V Conclusion

In this paper, we have developed a multi-task AirFL framework based on analog beamforming at the PS. Following the favorable propagation and channel hardening properties, we have designed a quantized analog beamforming method for statistical interference elimination. It has validated the cost-effective characteristic of the proposed method compared to existing fully-digital beamforming method based on RO.

References

  • [1] W. Xu et al., “Edge learning for B5G networks with distributed signal processing: Semantic communication, edge computing, and wireless sensing,” IEEE J. Sel. Topics Signal Process., vol. 17, no. 1, pp. 9–39, Jan. 2023.
  • [2] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, May 2020.
  • [3] W. Shi et al., “Combating interference for over-the-air federated learning: A statistical approach via ris,” IEEE Trans. Signal Process., vol. 73, pp. 936–953, 2025.
  • [4] J. Yao et al., “Wireless federated learning over resource-constrained networks: Digital versus analog transmissions,” IEEE Trans. Wireless Commun., vol. 23, no. 10, pp. 14 020–14 036, Oct. 2024.
  • [5] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [6] A. Ghosh et al., “An efficient framework for clustered federated learning,” IEEE Trans. Inf. Theo., vol. 68, no. 12, pp. 8076–8091, Dec. 2022.
  • [7] W. Shi et al., “Empowering over-the-air personalized federated learning via RIS,” Sci. China Inf. Sci., vol. 67, no. 11, pp. 1–2, Nov. 2024.
  • [8] H. U. Sami and B. Güler, “Over-the-air clustered federated learning,” IEEE Trans. Wireless Commun., vol. 23, no. 7, pp. 7877–7893, Jul. 2024.
  • [9] G. Shi, S. Guo, J. Ye, N. Saeed, and S. Dang, “Multiple parallel federated learning via over-the-air computation,” IEEE Open J. Commun. Soc., vol. 3, pp. 1252–1264, Aug. 2022.
  • [10] C. Zhong, H. Yang, and X. Yuan, “Over-the-air federated multi-task learning over MIMO multiple access channels,” IEEE Trans. Wireless Commun., vol. 22, no. 6, pp. 3853–3868, Jun. 2023.
  • [11] Z. Zhang et al., “Turning channel noise into an accelerator for over-the-air principal component analysis,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 7926–7941, Oct. 2022.
  • [12] X. Yu et al., “Alternating minimization algorithms for hybrid precoding in millimeter wave MIMO systems,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 485–500, Apr. 2016.
  • [13] J. Yao, Z. Yang, W. Xu, D. Niyato, and X. You, “Imperfect CSI: A key factor of uncertainty to over-the-air federated learning,” IEEE Wireless Commun. Lett., vol. 12, no. 12, pp. 2273–2277, Dec. 2023.
  • [14] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 8, pp. 5115–5128, Aug. 2021.
  • [15] J. Wu et al., “On the noisy gradient descent that generalizes as SGD,” in Proc. Int. Conf. Machine Learn. (ICML), 2020, pp. 10 367–10 376.
  • [16] X. Wei, C. Shen, J. Yang, and H. V. Poor, “Random orthogonalization for federated learning in massive MIMO systems,” IEEE Trans. Wireless Commun., vol. 23, no. 3, pp. 2469–2485, Mar. 2024.
  • [17] W. Shi, J. Xu, W. Xu, M. D. Renzo, and C. Zhao, “Secure outage analysis of RIS-assisted communications with discrete phase control,” IEEE Trans. Veh. Techn., vol. 72, no. 4, pp. 5435–5440, Apr. 2023.