This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning

Authors The results of this study have been submitted in part to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).    Shayan Mohajer Hamidi, , Ali Bereyhi, , Saba Asaad,  ,
and H. Vincent Poor
Shayan Mohajer Hamidi is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (email: smohajer@uwaterloo.ca).
Ali Bereyhi is with the Department of Electrical and Computer Engineering (ECE), University of Toronto, Toronto, M5S 2E4, Canada (e-mail: ali.bereyhi@utoronto.ca).
Saba Asaad is with the Department of Electrical Engineering and Computer Science, York University, Toronto, ON M3J 1P3, Canada (e-mail: asaads@yorku.ca).
H. Vincent Poor is with the Department of Electrical and Computer Engineering (ECE), Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).
Abstract

Second-order methods are widely adopted to improve the convergence rate of learning algorithms. In federated learning (FL), these methods require the clients to share their local Hessian matrices with the parameter server (PS), which comes at a prohibitive communication cost. A classical solution to this issue is to approximate the global Hessian matrix from the first-order information. Unlike in idealized networks, this solution does not perform effectively in over-the-air FL settings, where the PS receives noisy versions of the local gradients. This paper introduces a novel second-order FL framework tailored for wireless channels. The pivotal innovation lies in the PS’s capability to directly estimate the global Hessian matrix from the received noisy local gradients via a non-parametric method: the PS models the unknown Hessian matrix as a Gaussian process, and then uses the temporal relation between the gradients and Hessian along with the channel model to find a stochastic estimator for the global Hessian matrix. We refer to this method as Gaussian process-based Hessian modeling for wireless FL (GP-FL) and show that it exhibits a linear-quadratic convergence rate. Numerical experiments on various datasets demonstrate that GP-FL outperforms all classical baseline first and second order FL approaches.

Index Terms:
Second-order federated learning, quasi-Newton method, non-parametric estimation, distributed learning, over-the-air computation.

I Introduction

Traditionally, the training of machine learning (ML) models has followed a centralized approach, with the training data residing in a data center or cloud parameter server (PS). However, in numerous contemporary applications, there is a growing reluctance among devices to share their private data with a remote PS. Addressing this challenge, federated learning (FL) has emerged as a viable solution [1]. FL allows devices to actively contribute to the training of a global model by leveraging only their local datasets while being facilitated by a central PS. In the FL framework, devices transmit solely their local updates to the PS, thus sidestepping the need to disclose raw datasets. The process unfolds in several steps: initially, the PS disseminates the current global model parameters to all participating devices. Subsequently, each device conducts local model training based on its unique dataset and transmits the resulting local updates back to the PS. In the final step, the PS aggregates these local updates, producing a new global model parameters for the subsequent iteration of distributed training.

Considering the large sizes of the model updates, the iterative communication between the PS and clients incurs a significant communication cost. This can significantly decelerate the convergence of FL, since the communication are typically rate-limited in practice [2, 3, 4]. Several lines of research have been dedicated to theoretically characterizing the fundamental trade-offs between learning performance and communication cost, as well as improving the communication efficiency, within the FL framework [5, 6].

I-A Communication-Efficient FL

Early efforts to improve the communication efficiency in the FL framework can be categorized roughly into two approaches: (ii) the first approach tries to reduce the communication overhead per round. One way to achieve this involves utilizing quantization [7] and sparsification [8] techniques. These techniques aim to diminish the transmitted bits and eliminate redundant parameter updates. In general, these approaches come with the trade-off of potentially degrading the model performance, and they must take into account the compatibility for the aggregation operation in FL [9]. (iiii) The second approach intends to minimize the total communication rounds. The most well-known example is the federated averaging (FedAvg) algorithm, in which clients execute multiple local iterations before sharing their models with the PS [1], which can significantly reduce the total number of rounds. Some variants of the gradient descent (GD) method have further demonstrated the ability to substantially reduce the overall communication rounds compared to the naive GD [10, 11].

These previous approaches ignore the properties of the communication network, and look at the communication links as rate-limited orthogonal noiseless channels. Nevertheless, later studies have shown that using the properties of the underlying network, further gain in terms of communication efficiency can be achieved [2]. A well-known case is wireless FL, in which clients can use the superposition property of multiple access channel (MAC) to perform the computation directly over the air and significantly reduce the communication costs [12]. In fact, in wireless networks the abstract view on communication links corresponds to the conventional digital transmission, where coded symbols are sent using techniques such as orthogonal frequency-division multiplexing (OFDM). For wireless FL, however, the clients can send their local parameters simultaneously via analog transmission, and the PS can apply the concept of over-the-air computation (AirComp) to aggregate the global parameter. This method allows aggregation to be performed directly over the air, significantly reducing the required bandwidth [12, 13]. This approach has gained traction in wireless FL as an effective strategy for mitigating communication costs [14, 15, 16, 17, 18, 19].

I-B Second-Order Wireless FL

A standard framework for wireless FL consider first-order algorithms, where the clients transmit their local gradients, and the PS aggregates them gradient directly over the air using the AirComp technique. We refer to this framework as first-order AirComp. Similar to other first-order methods, first-order AirComp at best achieves linear convergence, which leads to a relatively large number of iteration rounds to reach the desired accuracy. We further note that the analog nature of AirComp-aided computation leads to extra aggregation error from the channel. This can be seen as zero-mean fluctuations, which adds to the noiseless aggregated gradient, i.e., the estimator that PS computes in perfect FL settings. As a result, the estimator of global gradient computed in first-order AirComp has higher variance as compared with FL with noiseless aggregation. This leads to further degradation in terms of convergence behavior.

One potential solution is to incorporate second-order methods, e.g., Newton-type methods, into the wireless FL setting, as explored in [20]. Unlike first-order methods, second-order approaches benefit from a quadratic convergence rate by computing the update direction using both first and second-order derivatives of the loss function. The update direction computed in these methods is often called Newton direction.

Integrating second-order information into FL requires further access to the second-order information at the PS. This information is obtained by clients sharing their local Hessian matrices with the PS, which imposes a significant communication burden, especially in wireless settings with strictly-restricted links. To mitigate this issue, various studies have explored ways to approximate the Hessian matrix from the first-order information and/or reduced second-order information. For instance, GIANT [21] estimates the Newton direction by utilizing the global gradient and local Hessians on each client, combined with a global backtracking line search across all clients. Although this approach reduces the communication overhead, it still requires partial sharing of second-order information resulting in higher communication load as compared to first-order methods.

The study in [22] suggests an alternative approach, in which each client the computation of Newton direction is offloaded to the clients: the PS and devices invoke the conjugate gradient descent algorithm to compute iteratively an estimate of the global Newton direction. Although the information shared per communication round in this approach is the same as that of the first-order methods, a single iteration of gradient descent (on the model parameters) requires at least two rounds of communication. To circumvent further this overhead, the recent study in [23] proposes a novel second-order method that eliminates the aggregation of local gradients, enabling a single communication round per iteration. Motivated by this work, the authors in [20] develop a second-order wireless FL method, in which the clients compute their Newton directions locally and share them with the PS. The PS then estimates the global Newton direction from the aggregation of these local directions. This approach enables devices to communicate with the PS only once per iteration, reducing the communication overhead to that of the first-order methods.

As a natural extension to available proposals, some recent studies have proposed integration of communication-efficient second-order FL algorithms into the analog computation framework [21, 20]. Experiments however show that, unlike their first-order counterparts, these second-order AirComp algorithms struggle to perform adequately. This observation can be intuitively explained as follows: the noisy aggregation of the second-order information in AirComp-aided approaches introduces significant biases to the estimate of the Newton direction, leading to suboptimal performance. The investigations reveal that such biases can severely deteriorates the convergence, and hence the overall efficiency, of these methods [21, 20]. This study aims to develop a novel second-order AirComp algorithm that addresses this challenge in a communication-efficient way.

I-C Motivation and Contributions

This work proposes a novel second-order AirComp scheme for FL which estimates the Newton direction directly from the aggregated first-order information. It is hence extremely communication-efficient, in the sense that it does not require the exchange of local Hessians or a functions of them: in each round, the clients leverage AirComp to share only their local gradients. The PS then estimates the global Hessian matrix from a finite sequence of its noisy aggregations. For Hessian estimation, we deviate from the classical deterministic schemes and develop a nonparametric model-based estimator. The proposed estimator assumes a Gaussian prior for the Hessian matrix and determines its posterior distribution conditional to a window of most recent noisy aggregations. It then computes an unbiased estimator of the Newton direction by sampling from the posterior distribution. We refer to this method as Gaussian process-based Hessian modeling for FL (GP-FL). Our analysis and numerical experiments demonstrate that the scheme can efficiently suppress the undesired directional bias.

In summary, the contributions of this paper are three-fold:

\bullet We introduce GP-FL, a novel second-order AirComp algorithm for wireless FL. The key innovation lies in the PS’s ability to estimate global Hessian matrix based on the received noisy sum of the gradients. This algorithm represents a fundamental departure from most existing works, which typically focus solely on stochastic gradient (SGD) during training. The incorporation of second-order information markedly diminishes the total communication rounds in AirComp-based FL, thereby enhancing communication efficiency even further.

\bullet We analyze the convergence behavior of GP-FL and show that it achieves a linear-quadratic convergence rate. Specifically, the number of communication rounds required to reach an ε\varepsilon-accurate solution, TεT_{\varepsilon}, is either 𝒪(loglog1/ε)\mathcal{O}\left(\log\log{1}/{\varepsilon}\right) or 𝒪(log1/εlog1/μ)\mathcal{O}\left(\tfrac{\log{1}/{\varepsilon}}{\log{1/\mu}}\right). This contrasts with GD-based FL methods, which typically have a linear convergence rate of Tε=𝒪(1/ε)T_{\varepsilon}=\mathcal{O}({1}/{\varepsilon}).

\bullet Through extensive experiments on datasets—including three from the LIBSVM library [24], Fashion-MNIST [25], and CIFAR-{10,100} [26]—we show that GP-FL outperforms existing first- and second-order algorithms. The method is evaluated under varying levels of heterogeneity [27], and the impact of key hyperparameters is analyzed.

I-D Related Works

The FedAvg scheme relies solely on first-order gradients for updates [1], offering significantly faster computation compared to Hessian-based methods, especially in settings with fast communication. Adaptive step-size methods like AdaGrad [28] and ADAM [29] have also been adapted to distributed settings, showing improved convergence [30]. The adaptive step-size can be represented as a diagonal matrix 𝔻\mathbb{D}, providing an alternative preconditioning to the Newton direction 1f(𝜽)\mathbb{H}^{-1}\nabla f(\boldsymbol{\theta}), where the direction is computed as 𝔻1f(𝜽)\mathbb{D}^{-1}\nabla f(\boldsymbol{\theta}).

The second-order methods studied in the literature can be categorized into two groups: (ii) those that utilize second-order information implicitly [31, 32, 33], and (iiii) those that explicitly compute them [22, 21, 34]. DANE [31] computes a mirror descent update on the local loss functions, which is equivalent to the GIANT update for a quadratic function. The study in [32] proposes FedDANE as a variant of DANE, specifically tailored for FL, where it uses FedAvg as a baseline with 20 local epochs and observes no improvement with their proposed method. AIDE, proposed in [33], is an alternative accelerated inexact version of DANE. Another line of study in the first group, i.e., group ii, considers employing distributed quasi-Newton methods [35]. CoCoA [36] and its trust-region extension [37] also perform local steps on a second-order local subproblem, but they specifically address the special case of generalized linear model objectives.

The second group of studies employ the Hessian by computing it indirectly through the use of the so-called Hessian-free optimization approach. In DiSCO [22], the clients compute the Hessian-vector products, and subsequently the PS executes the conjugate gradient method [38]. This process entails one communication round for each iteration of the conjugate gradient method. GIANT [21] and LocalNewton [34] both employ the conjugate gradient method at the clients: GIANT utilizes the global gradient, while LocalNewton uses the local gradients. The study in [39] and its FL extension [40] iteratively approximate the global Hessian, requiring a similar number of communication rounds as GIANT but achieving better convergence rates. However, they lack experimental comparisons with FedAvg using multiple local steps.

In general, the performance of mentioned methods tends to degrade when integrated into the AirComp framework, in which the PS receives a distorted and noisy aggregation. The proposed scheme in this work, GP-FL, addresses this issue by explicitly accounting for the channel imperfections.

I-E Notation

The gradient and Hessian of a function f()f(\cdot) are denoted by f()\nabla f(\cdot) and 2f()\nabla^{2}f(\cdot), respectively. The operators ()𝖳(\cdot)^{\mathsf{T}} and ()𝖧(\cdot)^{\mathsf{H}} denote the transpose and Hermitian transpose, respectively; and 𝔼()\mathbb{E}(\cdot) represents the mathematical expectation. The notation 𝐱[j]\mathbf{x}[j] refers to the jj-th entry of vector 𝐱\mathbf{x}. The notation 𝒢𝒫(μ,κ)\mathcal{GP}(\mu,\kappa) denotes a Gaussian process with mean μ\mu and covariance κ\kappa.

For an integer KK, [K][K] denotes {1,,K}\{1,\cdots,K\}. For a scalar or function ff, {fk}k[K]={f1,,fK}\{f_{k}\}_{k\in[K]}=\{f_{1},\dots,f_{K}\}, and [fk]k[K]=[f1,,fK]𝖳[f_{k}]_{k\in[K]}=[f_{1},\dots,f_{K}]^{\mathsf{T}}. Scalars are denoted by non-boldface letters (e.g. aa), vectors by boldface lowercase letters (e.g. 𝕒\mathbb{a}), and matrices by boldface uppercase letters (e.g. 𝔸\mathbb{A}). The jj-th entry of vector 𝕒\mathbb{a} is denoted by 𝕒[j]\mathbb{a}[j]. The matrix 𝕀\mathbb{I} represents the identity matrix. For two matrices 𝔸\mathbb{A} and 𝔹\mathbb{B} of the same size, 𝔸𝔹\mathbb{A}\preceq\mathbb{B} implies that 𝔹𝔸\mathbb{B}-\mathbb{A} is positive semi-definite.

I-F Organization

The remainder of this paper is organized as follows: Section II provides an in-depth discussion of the preliminary knowledge on Newton and quasi-Newton methods. Section III formulates the problem. Section IV presents the proposed stochastic approach for estimating the quasi-Newton direction. Section V provides the convergence analysis for GP-FL. Numerical results and comparison with baselines are given in Section VI. Finally, Section VII concludes the paper.

II Preliminaries

Consider a wireless network with KK clients and a single PS. The clients aim to jointly train a common model for a supervised learning task over their local labeled datasets. Let 𝒟k\mathcal{D}_{k} denote the local dataset at client kk which contains |𝒟k||{\mathcal{D}_{k}}| independently-collected training tuples of the form (𝕦,v)(\mathbb{u},v) with 𝕦m\mathbb{u}\in\mathbb{R}^{m} representing a sample input to the model and vv denoting its ground-truth label.

The clients agree on a global model with dd learnable parameters. Let vector 𝜽d\boldsymbol{\theta}\in\mathbb{R}^{d} represent the collection of these parameters into a vector. The ultimate goal is to train this common model by minimizing the global empirical loss that is defined as f(𝜽)1|𝒟|(𝕦,v)𝒟(𝜽,𝕦,v)f(\boldsymbol{\theta})\triangleq\frac{1}{|\mathcal{D}|}\sum_{(\mathbb{u},v)\in\mathcal{D}}\ell(\boldsymbol{\theta},\mathbb{u},v), where 𝒟\mathcal{D} denotes the global dataset, i.e., 𝒟=k=1K𝒟k\mathcal{D}=\bigcup_{k=1}^{K}\mathcal{D}_{k} and (𝜽,𝕦,v)\ell(\boldsymbol{\theta},\mathbb{u},v) is the loss function measuring the prediction error of the model with parameters 𝜽\boldsymbol{\theta} on input sample 𝕦\mathbb{u} relative to the ground-truth vv.

With the dataset distributed among clients, the global training problem is expressed in terms of local empirical losses. The local empirical loss at client kk is fk(𝜽)1|𝒟k|(𝕦,v)𝒟k(𝜽,𝕦,v)f_{k}\left(\boldsymbol{\theta}\right)\triangleq\frac{1}{|\mathcal{D}_{k}|}\sum_{(\mathbb{u},v)\in\mathcal{D}_{k}}\ell(\boldsymbol{\theta},\mathbb{u},v). The global empirical loss is f(𝜽)=1|𝒟|k𝒮|𝒟k|fk(𝜽)f(\boldsymbol{\theta})=\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}}|\mathcal{D}_{k}|f_{k}\left(\boldsymbol{\theta}\right), where 𝒮[K]\mathcal{S}\subseteq[K] is the set of participating devices. The distributed training problem is formulated as

min𝜽1|𝒟|k𝒮|𝒟k|fk(𝜽),\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}}|\mathcal{D}_{k}|f_{k}\left(\boldsymbol{\theta}\right), (1)

where fk(𝜽)f_{k}\left(\boldsymbol{\theta}\right) is computed locally at client kk.

II-A First-Order Federated Learning

The de-facto solution to problem (1) is to employ the distributed (stochastic) gradient descent method (DGD/DSGD). In the tt-th round of DGD, the PS shares parameter 𝜽t\boldsymbol{\theta}_{t} with all clients. Each client computes its local gradient at 𝜽t\boldsymbol{\theta}_{t}, i.e., fk(𝜽t)\nabla f_{k}(\boldsymbol{\theta}_{t}), and returns it to the PS. The PS then aggregates these local gradients into a global gradient as

f(𝜽t)=|𝒟k||𝒟|k𝒮tfk(𝜽t),\displaystyle\nabla f(\boldsymbol{\theta}_{t})=\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|}\sum_{k\in\mathcal{S}_{t}}\nabla f_{k}(\boldsymbol{\theta}_{t}), (2)

and performs one step of gradient descent, i.e., it computes 𝜽t+1=𝜽tηtf(𝜽t)\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\nabla f(\boldsymbol{\theta}_{t}), where ηt>0\eta_{t}>0 is the global learning rate at the tt-th round. First-order methods often converge slowly with linear (or sub-linear) convergence rates [41, 42], requiring many iterations to approach a local minimum. Since the number of communication rounds in FL corresponds to iterations, this increases communication overhead.

II-B Newton’s Method in Federated Learning

A method to improve FL communication efficiency is to use faster-converging optimizers, such as second-order methods. These methods estimate the loss landscape’s local curvature, enabling faster and more adaptive updates. While computationally intensive per iteration, they require fewer iterations to converge. Second-order methods are Newton-type techniques. Specifically, Newton’s method updates the model using the second-order Taylor series approximation of the empirical loss. For a small 𝔡d\mathfrak{d}\in\mathbb{R}^{d}, the Taylor series expansion of f(𝜽+𝔡)f(\boldsymbol{\theta}+\mathfrak{d}) around 𝜽\boldsymbol{\theta} up to the quadratic term is given by f(𝜽+𝔡)f(𝜽)+𝔡Tf(𝜽)+12𝔡T2f(𝜽)𝔡f(\boldsymbol{\theta}+\mathfrak{d})\approx f(\boldsymbol{\theta})+\mathfrak{d}^{T}\nabla f(\boldsymbol{\theta})+\frac{1}{2}\mathfrak{d}^{T}\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d}, with 2f(𝜽)\nabla^{2}f(\boldsymbol{\theta}) being the Hessian matrix of ff computed at 𝜽\boldsymbol{\theta}. Assuming the Hessian is positive definite, the Newton direction that minimizes this quadratic approximation is given by 𝔡=(2f(𝜽))1f(𝜽)\mathfrak{d}^{\star}=-(\nabla^{2}f(\boldsymbol{\theta}))^{-1}\nabla f(\boldsymbol{\theta}). Newton’s method hence updates the model parameters by stepping proportional to 𝔡\mathfrak{d}^{\star}.

Deploying Newton’s method for the distributed training task in (1), the PS needs to update the global model as

𝜽t+1\displaystyle\boldsymbol{\theta}_{t+1} =𝜽tηt(2f(𝜽t))1f(𝜽t)\displaystyle=\boldsymbol{\theta}_{t}-\eta_{t}\Big{(}\nabla^{2}f(\boldsymbol{\theta}_{t})\Big{)}^{-1}\nabla f(\boldsymbol{\theta}_{t}) (3a)
=𝜽tηt(k𝒮t2fk(𝜽t))1k𝒮tfk(𝜽t).\displaystyle=\boldsymbol{\theta}_{t}-\eta_{t}\Big{(}\sum_{k\in\mathcal{S}_{t}}\nabla^{2}f_{k}(\boldsymbol{\theta}_{t})\Big{)}^{-1}\sum_{k\in\mathcal{S}_{t}}\nabla f_{k}(\boldsymbol{\theta}_{t}). (3b)

This requires transmitting local Hessians to the PS, creating a trade-off: second-order methods reduce communication rounds via faster convergence but increase communication costs per round due to larger transmission. Standard analyses show that the naive use of Newton’s method in distributed settings often leads to inefficiency, as communication overhead dominates.

II-C Quasi-Newton Search Directions

To address issues with Newton’s method in distributed settings, quasi-Newton methods provide an alternative by avoiding explicit computation of local Hessians. Instead of 2f(𝜽)\nabla^{2}f(\boldsymbol{\theta}) in (3a), an approximation matrix 𝔹𝜽\mathbb{B}_{\boldsymbol{\theta}} is built using the sequence of local gradients from previous iterations. This matrix is updated iteratively to include new information. To understand the quasi-Newton method, assume f()f(\cdot) is twice continuously differentiable. We can hence write

f(𝜽+𝔡)=f(𝜽)+012f(𝜽+τ𝔡)𝔡dτ,\displaystyle\nabla f(\boldsymbol{\theta}+\mathfrak{d})=\nabla f(\boldsymbol{\theta})+\int_{0}^{1}\nabla^{2}f(\boldsymbol{\theta}+\tau\mathfrak{d})\mathfrak{d}d\tau, (4)

By adding and subtracting the term 2f(𝜽)𝔡\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d} to the right-hand side (r.h.s.) of (4), we have

f(𝜽+𝔡)\displaystyle\nabla f(\boldsymbol{\theta}+\mathfrak{d}) =f(𝜽)+2f(𝜽)𝔡\displaystyle=\nabla f(\boldsymbol{\theta})+\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d}
+01[2f(𝜽+τ𝔡)2f(𝜽)]𝔡dτ.\displaystyle+\int_{0}^{1}\big{[}\nabla^{2}f(\boldsymbol{\theta}+\tau\mathfrak{d})-\nabla^{2}f(\boldsymbol{\theta})\big{]}\mathfrak{d}d\tau. (5)

Note that since the gradient function f()\nabla f(\cdot) is continuous, the magnitude of the integral on the r.h.s. of (II-C) is 𝒪(𝔡)\mathcal{O}(\|\mathfrak{d}\|). We now set 𝜽=𝜽t1\boldsymbol{\theta}=\boldsymbol{\theta}_{t-1} and 𝔡=𝜽t𝜽t1\mathfrak{d}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}. This leads to

f(𝜽t)f(𝜽t1)=2f(𝜽t1)(𝜽t𝜽t1)+𝜺,\displaystyle\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1})=\nabla^{2}f(\boldsymbol{\theta}_{t-1})\left(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\right)+\boldsymbol{\varepsilon}, (6)

with 𝜺=𝒪(𝜽t𝜽t1)\boldsymbol{\varepsilon}=\mathcal{O}(\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\|). Note that when 𝜽t1\boldsymbol{\theta}_{t-1} and 𝜽t\boldsymbol{\theta}_{t} reside in a region close to the minimizer, the r.h.s. of (6) is dominated by the first term. We can hence write

f(𝜽t)f(𝜽t1)2f(𝜽t1)(𝜽t𝜽t1).\displaystyle\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1})\approx\nabla^{2}f(\boldsymbol{\theta}_{t-1})\left(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\right). (7)

The approximation in (7) suggests estimating the Hessian matrix using a matrix 𝔹t\mathbb{B}_{t} that satisfies the relation in (7) as an identity. Specifically, by defining 𝕨t=𝜽t𝜽t1\mathbb{w}_{t}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1} and 𝕪t=f(𝜽t)f(𝜽t1)\mathbb{y}_{t}=\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1}), the quasi-Newton method computes Newton’s direction using 𝔹t\mathbb{B}_{t} that satisfies the following equation, commonly referred to as the secant equation:

𝔹t𝕨t=𝕪t,\displaystyle\mathbb{B}_{t}\mathbb{w}_{t}=\mathbb{y}_{t}, (8)

which is an estimator of the Hessian matrix. A classic solution to (8) given by Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [43], which iteratively updates matrix 𝔹t+1\mathbb{B}_{t+1} using information from 𝔹t,𝕨t,𝕪t{\mathbb{B}_{t},\mathbb{w}_{t},\mathbb{y}_{t}} according to 𝔹t+1=𝔹t𝔹t𝕨t𝕨t𝖳𝔹t𝕨t𝖳𝔹t𝕨t+𝕪t𝕪t𝖳𝕨t𝖳𝕪t\mathbb{B}_{t+1}=\mathbb{B}_{t}-\frac{\mathbb{B}_{t}\mathbb{w}_{t}\mathbb{w}_{t}^{\mathsf{T}}\mathbb{B}_{t}}{\mathbb{w}_{t}^{\mathsf{T}}\mathbb{B}_{t}\mathbb{w}_{t}}+\frac{\mathbb{y}_{t}\mathbb{y}_{t}^{\mathsf{T}}}{\mathbb{w}_{t}^{\mathsf{T}}\mathbb{y}_{t}}, ensuring the positive-definiteness of 𝔹t\mathbb{B}_{t} and consequently making the following quasi-Newton search direction a descent direction:

𝔡t=(𝔹t)1f(𝜽t).\displaystyle\mathfrak{d}_{t}=-(\mathbb{B}_{t})^{-1}\nabla f(\boldsymbol{\theta}_{t}). (9)

Unlike Newton’s method, the quasi-Newton approach relies only on computed gradients, reducing communication requirements. However, this comes with higher variance in estimating the optimal direction. Studies show that the quasi-Newton method strikes a good trade-off, achieving faster convergence than first-order methods with similar communication costs.

Remark 1.

Note that using BFGS involves only a rank-one update, allowing us to efficiently compute the inverse (𝔹t)1(\mathbb{B}_{t})^{-1} as required in (9) via the Sherman-Morrison formula [44].

II-D GP-FL: Quasi-Newton Method in Wireless Networks

We now develop the second-order algorithm GP-FL for FL in wireless networks based on the quasi-Newton method. The algorithm uses two key notions to adapt the quasi-Newton search directions approach to wireless networks: (ii) it uses AirComp to realize the aggregation directly over the air, and (iiii) it models the quasi-Newton estimator of the Hessian matrix as a Gaussian process and employs the maximum-likelihood method to estimate it from noisy aggregations.

In the next sections, we present GP-FL, sketch its derivation and analyze its convergence. For the sake of simplicity, we use the following notation hereafter in the paper: we denote the local gradient of client kk in iteration tt as 𝕘t,kfk(𝜽t)\mathbb{g}_{t,k}\triangleq\nabla f_{k}(\boldsymbol{\theta}_{t}) and global gradient as 𝕘tf(𝜽t)\mathbb{g}_{t}\triangleq\nabla f(\boldsymbol{\theta}_{t}), i.e.,

𝕘t=1|𝒟|k𝒮t|𝒟k|𝕘t,k.\displaystyle\mathbb{g}_{t}=\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}_{t}}|\mathcal{D}_{k}|~{}\mathbb{g}_{t,k}. (10)

Using this notation, we can summarize the vanilla quasi-Newton method as follows: in iteration tt,

  1. 1.

    The PS aggregates local gradients as per (10).

  2. 2.

    It computes the difference

    𝕪t=𝕘t𝕘t1.\displaystyle\mathbb{y}_{t}=\mathbb{g}_{t}-\mathbb{g}_{t-1}. (11)
  3. 3.

    It finds a quasi-Newton matrix by solving (8) for 𝔹t\mathbb{B}_{t}.

  4. 4.

    It updates the global model parameter as 𝜽t+1=𝜽tηt𝔹t1𝕘t\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\mathbb{B}_{t}^{-1}\mathbb{g}_{t}, and sets 𝕨t+1=𝜽t+1𝜽t\mathbb{w}_{t+1}=\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{t}.

III GP-FL: Over-the-Air Aggregation

During each FL communication round, two parameter exchanges occur. The first is a downlink transmission where the PS broadcasts the global model. Since only one set of parameters is transmitted, the communication cost is negligible and can be handled via encoded data over the broadcast channel [15]. The other parameter exchange occurs over uplink channels, where clients send their local gradients to the PS. This is the main source of communication overhead in wireless FL: using conventional orthogonal multiple-access transmission, the time-frequency resources scale linearly with the number of clients KK. For large KK, this leads to substantial overhead. To address this, we use AirComp to perform gradient aggregation directly over the air [15, 19, 12, 14, 16, 18]. This approach completes the aggregation within a single coherent block, significantly reducing communication overhead.

III-A Uplink Channel Model

We consider a Gaussian fading multiple access channel with slow frequency-flat fading, where the coherence time exceeds dd symbol intervals. Clients synchronously transmit dd-dimensional local gradient entries over dd consecutive intervals within a single channel coherence interval. The PS is assumed to have an array of NN antennas, while each client has a single antenna.

We denote the channel coefficient vector between client kk and the PS in round tt as 𝕙t,kN\mathbb{h}_{t,k}\in\mathbb{C}^{N}. Assuming perfect channel state information (CSI) at both ends, clients can adjust their transmitted signals based on the channel coefficients.

III-B Model Aggregation via AirComp

The AirComp-based model aggregation works as follows: client kk sends ϕk(𝕘t,k)\phi_{k}(\mathbb{g}_{t,k}) for some pre-processing function ϕk()\phi_{k}(\cdot) simultaneously with all other clients over the channel. The PS then receives a superimposed version of these transmissions. Denoting the received signal by 𝕣t\mathbb{r}_{t}, the PS estimates the aggregated gradient as 𝕘^t=ψ(𝕣t)\hat{\mathbb{g}}_{t}=\psi(\mathbb{r}_{t}) for some post-processing function ψ()\psi(\cdot). The pre- and post-processing functions serve as analog filters aiming to fulfill the transmission constraints, e.g, transmit power constraint, and minimize the estimation error.

Due to the diversity of 𝕘t,k\mathbb{g}_{t,k} among devices, a universal pre- and post-processing function cannot guarantee the joint stationarity of the information-bearing symbols. i.e., local gradients. We hence opt for a data-and-CSI-aware design: prior to uplink transmission, client kk normalizes its local gradient as

𝕤t,k=𝕘t,k𝕘t,k2.\displaystyle\mathbb{s}_{t,k}=\frac{\mathbb{g}_{t,k}}{\|\mathbb{g}_{t,k}\|_{2}}. (12)

As the normalized gradients are unit-norm, we can model them as jointly stationary processes. This means that 𝔼(𝕤t,k[j]2)=1/d\mathbb{E}(\|\mathbb{s}_{t,k}[j]\|^{2})={1}/{d} for entry j[d]j\in[d]. After normalization, client kk sends

𝕩t,k=bt,k𝕤t,k,\displaystyle\mathbb{x}_{t,k}=b_{t,k}\mathbb{s}_{t,k}, (13)

over its uplink channel, where bt,kb_{t,k}\in\mathbb{R} is a power scaling factor satisfying

𝔼(|bt,k𝕤t,k[j]|2)=bt,k2d𝖯𝟢,\displaystyle\mathbb{E}(|b_{t,k}\mathbb{s}_{t,k}[j]|^{2})=\frac{b_{t,k}^{2}}{d}\leq\mathsf{P_{0}}, (14)

with 𝖯𝟢\mathsf{P_{0}} denoting maximum transmit power of devices.

The PS receives the superimposed version of transmitted signals: let 𝕣t,jN\mathbb{r}_{t,j}\in\mathbb{C}^{N} denotes the received signal of the PS in the jj-th symbol interval of iteration tt. Using the channel model, we can write

𝕣t,j\displaystyle\mathbb{r}_{t,j} =k𝒮t𝕙t,k𝕩t,k[j]+𝕟t,j=k𝒮t𝕙^t,kbt,k𝕘t,k[j]+𝕟t,j,\displaystyle=\sum_{k\in\mathcal{S}_{t}}\mathbb{h}_{t,k}\mathbb{x}_{t,k}[j]+\mathbb{n}_{t,j}=\sum_{k\in\mathcal{S}_{t}}\hat{\mathbb{h}}_{t,k}b_{t,k}\mathbb{g}_{t,k}[j]+\mathbb{n}_{t,j},

where 𝕙^t,k𝕙t,k/𝕘t,k2\hat{\mathbb{h}}_{t,k}\triangleq{\mathbb{h}_{t,k}}/{\|\mathbb{g}_{t,k}\|_{2}} is the effective channel, and 𝕟t,jN\mathbb{n}_{t,j}\in\mathbb{C}^{N} is complex additive white Gaussian noise (AWGN) with mean zero and variance σ2\sigma^{2}, i.e., 𝕟t,j𝒩(𝟎,σ2𝕀)\mathbb{n}_{t,j}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\mathbb{I}).

The PS uses the received signals 𝕣t,j\mathbb{r}_{t,j} for j[d]j\in[d] to determine an estimator of the global gradient 𝐠t\mathbf{g}_{t}. Let 𝕕t[j]\mathbb{d}_{t}[j] denote the estimator of 𝐠t[j]\mathbf{g}_{t}[j]. Using linear post-processing, the PS determines 𝕕t[j]\mathbb{d}_{t}[j] from 𝕣t,j\mathbb{r}_{t,j} as

𝕕t[j]\displaystyle\mathbb{d}_{t}[j] =𝕔t𝖧𝕣t,jαt\displaystyle=\frac{\mathbb{c}_{t}^{\mathsf{H}}\mathbb{r}_{t,j}}{\sqrt{\alpha_{t}}} (15)
=1αt(𝕔t𝖧k𝒮t𝕙^t,kbt,k𝕘t,k[j]+𝕔t𝖧𝕟t,j),\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbb{c}_{t}^{\mathsf{H}}\sum_{k\in\mathcal{S}_{t}}\hat{\mathbb{h}}_{t,k}b_{t,k}\mathbb{g}_{t,k}[j]+\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}\right), (16)

for some linear receiver 𝕔tN\mathbb{c}_{t}\in\mathbb{C}^{N} and the power factor αt0\alpha_{t}\neq 0.

Then, the PS employs a zero-forcing approach to estimate 𝕘t\mathbb{g}_{t}. It first computes a linear receiver 𝕔t\mathbb{c}_{t} using the CSI (details on computing 𝕔t\mathbb{c}_{t} will be discussed later) and determines the power scaling factor αtZF\alpha^{\text{ZF}}_{t} as:

αtZF=𝖯𝟢dmink𝒮t𝕔t𝖧𝕙^t,k22|𝒟k|2.\displaystyle\alpha^{\text{ZF}}_{t}=\mathsf{P_{0}}d\min_{k\in\mathcal{S}_{t}}\frac{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}{|\mathcal{D}_{k}|^{2}}. (17)

This way, the PS guarantees that all clients satisfy their transmit power constraint, and then it broadcasts αtZF\alpha^{\text{ZF}}_{t} to the clients. Client kk upon receiving αtZF\alpha^{\text{ZF}}_{t} determines its scaling as

bt,k=αtZF|𝒟k|𝕙^t,k𝖧𝕔t𝕔t𝖧𝕙^t,k22,\displaystyle b_{t,k}=\sqrt{\alpha^{\text{ZF}}_{t}}|\mathcal{D}_{k}|\frac{\hat{\mathbb{h}}_{t,k}^{\mathsf{H}}\mathbb{c}_{t}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|^{2}_{2}}, (18)

which satisfies the transmit power constraint (14).

The above coordination of clients realizes the desired superposition over the air in the absence of noise during uplink transmission. Substituting (17) and (18) in (15), it is concluded that the PS aggregates 𝕕t[j]=k𝒮t|𝒟k|𝕘t,k[j]+1αtZF𝕔t𝖧𝕟t,j\mathbb{d}_{t}[j]=\sum_{k\in\mathcal{S}_{t}}|\mathcal{D}_{k}|\mathbb{g}_{t,k}[j]+\frac{1}{\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}, which after scaling by 1/|𝒟|{1}/{|\mathcal{D}|} concludes the following

𝕘~t[j]\displaystyle\tilde{\mathbb{g}}_{t}[j] =1|𝒟|𝕕t[j]=𝕘t[j]+1|𝒟|αtZF𝕔t𝖧𝕟t,j.\displaystyle=\frac{1}{|\mathcal{D}|}\mathbb{d}_{t}[j]=\mathbb{g}_{t}[j]+\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}. (19)

The estimator of the gradient in iteration tt is hence given by

𝕘~t=𝕘t+𝕟~t,\displaystyle\tilde{\mathbb{g}}_{t}=\mathbb{g}_{t}+\tilde{\mathbb{n}}_{t}, (20)

where, for notational simplicity, we defined the dd-dimensional column vector 𝕟~t\tilde{\mathbb{n}}_{t} as

𝕟~t=1|𝒟|αtZF[𝕔t𝖧𝕟t,1,,𝕔t𝖧𝕟t,d]𝖳.\displaystyle\tilde{\mathbb{n}}_{t}=\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}[\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,1},\ldots,\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,d}]^{\mathsf{T}}. (21)
Remark 2.

In general we can assume 𝕘t\mathbb{g}_{t} to be complex, as we can represent every two entries of a complex gradient as a pair of in-phase and quadrature components.

Remark 3.

Note that 𝔼[𝕟~t]=0\mathbb{E}[\tilde{\mathbb{n}}_{t}]=0, and thus 𝕘~t\tilde{\mathbb{g}}_{t} is an unbiased estimator of 𝕘t\mathbb{g}_{t}. As such, assuming that 𝕘t\mathbb{g}_{t} itself is an unbiased estimator of the true gradient, 𝕘~t\tilde{\mathbb{g}}_{t} is an unbiased estimator of the true gradient, as well.

III-C Receiver Design for Model Aggregation

The variance of 𝕟~t\tilde{\mathbb{n}}_{t} which specifies the variance of the gradient estimator is given by σ~2(𝕔t)=σ2𝕔t2|𝒟|2αtZF\tilde{\sigma}^{2}(\mathbb{c}_{t})=\frac{\sigma^{2}\|\mathbb{c}_{t}\|^{2}}{|\mathcal{D}|^{2}{\alpha^{\text{ZF}}_{t}}}. Substituting (17) in this term, we have

σ~2(𝕔t)=σ2|𝒟|2𝖯𝟢dmaxk𝒮t|𝒟k|2𝕔t2𝕔t𝖧𝕙^t,k22.\displaystyle\tilde{\sigma}^{2}(\mathbb{c}_{t})=\frac{\sigma^{2}}{|\mathcal{D}|^{2}\mathsf{P_{0}}d}\max_{k\in\mathcal{S}_{t}}\frac{|\mathcal{D}_{k}|^{2}\|\mathbb{c}_{t}\|^{2}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}. (22)

Thus, for a given set of selected devices 𝒮t\mathcal{S}_{t}, the optimal receiver that minimizes the estimation variance is given by a solution to the following optimization problem:

min𝕔tmaxk𝒮t|𝒟k|2𝕔t2𝕔t𝖧𝕙^t,k22.\displaystyle\min_{\mathbb{c}_{t}}\max_{k\in\mathcal{S}_{t}}\frac{|\mathcal{D}_{k}|^{2}\|\mathbb{c}_{t}\|^{2}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}. (23)

Using a similar analysis as in [45], the optimization problem in (23) can be reformulated as

min𝕔t𝕔t2s.t.𝕔t𝖧𝕙^t,k22|𝒟k|2,k𝒮t\displaystyle\min_{\mathbb{c}_{t}}\|\mathbb{c}_{t}\|^{2}\qquad\text{s.t.}\quad\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}\geq|\mathcal{D}_{k}|^{2},\quad\forall k\in\mathcal{S}_{t} (24)

which is a quadratically constrained quadratic programming problem that is computationally challenging. However, it can be transformed into a difference-of-convex-function (DC) program, as shown in Appendix A.

Remark 4.

The receiver 𝕔t\mathbb{c}_{t} is updated only once per channel coherence time, and hence it is realistic to assume that we coordinate the network after each update of 𝐜t\mathbf{c}_{t}, as it occurs with the same rate as for the CSI update.

III-D Device Selection Algorithm

The device selection is a combinatorial problem, which lies int the set of nondeterministic polynomial (NP) hard problems. We therefore invoke the sub-optimal approach developed in [20] to approximate its solution via Gibbs sampling (GS) method. The key idea in the GS method is to iteratively sample a device set from the neighboring devices based on a suitable distribution. This iterative process allows the selected devices to gradually converge towards an optimal set. We omit the details of the algorithm due to lack of space and refer the interested readers to [20] for more details.

IV GP-FL: Hessian Estimation

In the quasi-Newton approach, the PS estimates the Newton direction via 𝔹t\mathbb{B}_{t} computed from the subsequent gradient estimators 𝕘t\mathbb{g}_{t} and 𝕘t1\mathbb{g}_{t-1} according to (8). With over-the-air aggregation, the gradient estimators, i.e., 𝕘~t\tilde{\mathbb{g}}_{t} and 𝕘~t1\tilde{\mathbb{g}}_{t-1}, are further distorted by effective channel noise process resulting in higher error variances. As demonstrated in our numerical investigations in Section VI, the direct derivation of 𝔹t\mathbb{B}_{t} from the aggregated gradients 𝕘~t\tilde{\mathbb{g}}_{t} and 𝕘~t1\tilde{\mathbb{g}}_{t-1} can significantly degrade the training performance as compared to the case with gradient estimators collected via noise-free aggregation. This observation can be intuitively explained as follows: the variance introduced by over-the-air aggregation leads (through the nonlinear procedure of solving (8)) to an estimator whose estimate of Newton’s direction is biased. The higher the aggregation variance is, the more this directional bias will be.

Unlike the primal estimation error in the mini-batch gradients, i.e., 𝕘t\mathbb{g}_{t}, whose statistics is unknown, the computation error introduced by the over-the-air aggregation is statistically known to us. This knowledge can be potentially used to suppress the impact of computation error and extract a better estimator of the Hessian matrix. The key challenge in this respect is however that the Hessian estimator is related to the noisy aggregations through a nonlinear transform, i.e., the solution to (8). To overcome this challenge we follow the model-based approach suggested in various lines of work in the literature, e.g., [46], to compute a better estimator of the Hessian matrix.

IV-A Model-based Hessian Estimation

In model-based estimation approach, the stochastic nature of the solution to (8) is described through a stochastic process, which approximates the original process. In the sequel, we adapt the Gaussian model for this problem. This choice is particularly useful as Gaussian processes offer non-parametric probabilistic models to capture complexities of nonlinear functions [47].111Modeling Hessians as Gaussian processes was proposed in [46] for use in noisy settings but has not yet been applied to tasks like FL.

To understand the idea of probabilistic modeling of Hessian, let us look more precisely into the noisy quasi-Newton matrix computed from the over-the-air aggregations: at round tt, the PS uses its latest two aggregations to compute

𝕪~t=𝕘~t𝕘~t1,\displaystyle\tilde{\mathbb{y}}_{t}=\tilde{\mathbb{g}}_{t}-\tilde{\mathbb{g}}_{t-1}, (25)

The quasi-Newton matrix 𝔹~t\tilde{\mathbb{B}}_{t}, which is an estimator of Hessian, can then be computed from 𝕪~t\tilde{\mathbb{y}}_{t} by solving

𝔹~t𝕨t=𝕪~t.\displaystyle\tilde{\mathbb{B}}_{t}\mathbb{w}_{t}=\tilde{\mathbb{y}}_{t}. (26)

With ideal links, 𝕪~t=𝕪t\tilde{\mathbb{y}}_{t}=\mathbb{y}_{t}, and using a deterministic scheme like BFGS, the solution to (26) matches that of (8). However, with over-the-air computation, deterministic methods yield a Hessian estimate with potentially higher bias and variance compared to the standard case with perfect aggregation222This is demonstrated in the numerical results.. To address this, we deviate from the classical approach of using deterministic schemes to solve (26). Instead, we model 𝔹~t\tilde{\mathbb{B}}_{t} and 𝕪~t\tilde{\mathbb{y}}_{t} as jointly Gaussian random processes. We approximate their statistics from the observation sequence and use these to derive a more robust Hessian estimator, effectively mitigating the impact of aggregation noise.

IV-B Quasi-Newton Matrix as Gaussian Process

We start by modeling the prior: let 𝔹~t\tilde{\mathbb{B}}_{t} be a Gaussian matrix with mean 𝐌\mathbf{M} and covariance \mathbb{C}. For i,j[d]i,j\in[d], entry (i,j)(i,j) of 𝔹~t\tilde{\mathbb{B}}_{t} is an independent Gaussian process with mean μi,j=[𝐌]i,j\mu_{i,j}=[\mathbf{M}]_{i,j} and variance ψi,j\psi_{i,j} given by \mathbb{C}. For simplicity, we focus on a single entry of 𝔹~t\tilde{\mathbb{B}}_{t}, denoting it as b~t𝒩(μ,ψ)\tilde{b}_{t}\sim\mathcal{N}(\mu,\psi), dropping the index (i,j)(i,j). This does not affect the generality of the analysis, as entries of 𝔹~t\tilde{\mathbb{B}}_{t} are statistically independent and follow the same estimation procedure.

Our ultimate goal is to model b~t\tilde{b}_{t} from the observations 𝕪~t\tilde{\mathbb{y}}_{t} collected through time tt. To this end, we collect the last rr gradient difference in a set 𝒴t\mathcal{Y}_{t}, i.e.,

𝒴t={𝕪~tr,,𝕪~t},\displaystyle\mathcal{Y}_{t}=\{\tilde{\mathbb{y}}_{t-r},\dots,\tilde{\mathbb{y}}_{t}\}, (27)

and model it jointly with entry b~t\tilde{b}_{t} as a Gaussian processes 𝒢𝒫\mathcal{GP}. More precisely, let us define 𝕫=[𝕠t𝖳,b~t]𝖳\mathbb{z}=[\mathbb{o}_{t}^{\sf T},\tilde{b}_{t}]^{\sf T}, where 𝕠t\mathbb{o}_{t} is defined as the concatenation of entries in 𝒴t\mathcal{Y}_{t}, i.e.,

𝕠t=[𝕪~tr+1𝖳,,𝕪~t𝖳]𝖳.\displaystyle\mathbb{o}_{t}=\left[\tilde{\mathbb{y}}_{t-r+1}^{\sf T},\ldots,\tilde{\mathbb{y}}_{t}^{\sf T}\right]^{\sf T}. (28)

We model 𝕫t\mathbb{z}_{t} as a Gaussian vector whose mean is 𝝃trd+1\boldsymbol{\xi}_{t}\in\mathbb{R}^{rd+1} and whose covariance matrix 𝚺t(rd+1)×(rd+1)\boldsymbol{\Sigma}_{t}\in\mathbb{R}^{(rd+1)\times(rd+1)}. Considering this statistical model, our goal is to use the sample data collected in the last rr communication rounds, i.e., 𝒴t\mathcal{Y}_{t}, to find an estimate of the mean and covariance. We then use the assumed statistical model to find an alternative estimator for b~t\tilde{b}_{t}.

IV-C Estimating Parameters of Gaussian Process

To estimate the covariance matrix, we follow the conventional approach in the literature [48]: we estimate the covariance of 𝕫t\mathbb{z}_{t} from its samples using the kernel function κ\kappa. The choice of an appropriate kernel is based on assumptions such as smoothness and likely patterns that are expected in the data. Given the nature of data in this problem, i.e., the fact that 𝕪~t\tilde{\mathbb{y}}_{t} is the gradient difference, one popular choice of the kernel is the radial basis function [48].

Definition 1 (Radial basis kernel).

For inputs 𝕦n\mathbb{u}\in\mathbb{R}^{n} and 𝕧m\mathbb{v}\in\mathbb{R}^{m}, the radial basis function κ\kappa computes an n×mn\times m matrix whose entry in the ii-th row and jj-th column is obtained as

[κ(𝕦,𝕧)]i,j=exp(|[𝕦]i[𝕧]j|22τ2).\displaystyle[\kappa(\mathbb{u},\mathbb{v})]_{i,j}=\exp\left(-\frac{|[\mathbb{u}]_{i}-[\mathbb{v}]_{j}|^{2}}{2\tau^{2}}\right). (29)

It is worth mentioning that using the radial basis kernel, the estimator of the covariance matrix computed from 𝕫𝟘\mathbb{z^{0}} sampled from 𝕫\mathbb{z}, i.e., κ(𝕫𝟘,𝕫𝟘)\kappa(\mathbb{z^{0}},\mathbb{z^{0}}), is a positive semi-definite matrix.

To use the above covariance estimator, we require a sample from 𝕫t\mathbb{z}_{t}. To this end, we use a deterministic solver to find a solution to (26) for the last rr communication rounds. This means that we find 𝐁~i0\tilde{\mathbf{B}}_{i}^{0} by solving (26) with 𝕪~i0\tilde{\mathbb{y}}_{i}^{0} for i{tr+1,,t}i\in\{t-r+1,\ldots,t\}, where 𝕪~i0\tilde{\mathbb{y}}_{i}^{0} denotes to the sample gradient difference computed from the observed received signals.333Superscript 0 is to distinguish random samples from stochastic processes. Let b~i0\tilde{b}_{i}^{0} be an entry of 𝐁~i0\tilde{\mathbf{B}}_{i}^{0}. We sample 𝕫\mathbb{z} as 𝕫0=[𝕠t0𝖳,b~t0]𝖳\mathbb{z}^{0}=[\mathbb{o}_{t}^{0\sf T},\tilde{b}_{t}^{0}]^{\sf T} with

𝕠t0=[𝕪~tr+10𝖳,,𝕪~t0𝖳]𝖳,\displaystyle\mathbb{o}_{t}^{0}=\left[\tilde{\mathbb{y}}_{t-r+1}^{0\sf T},\ldots,\tilde{\mathbb{y}}_{t}^{0\sf T}\right]^{\sf T}, (30)

and estimate the covariance of 𝕫t\mathbb{z}_{t} with the sample covariance matrix κ(𝕫t0,𝕫t0)\kappa(\mathbb{z}_{t}^{0},\mathbb{z}_{t}^{0}) computed by the radial basis kernel.

We next compute an estimator of the mean 𝝃t\boldsymbol{\xi}_{t} using a moving average: we estimate the mean of b~t\tilde{b}_{t} by arithmetic average of b~tr+10,,b~t0\tilde{b}_{t-r+1}^{0},\ldots,\tilde{b}_{t}^{0} denoted by 𝔼^{b~t0}\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}, and the mean of 𝕠i\mathbb{o}_{i} by the moving average of 𝕠i0\mathbb{o}_{i}^{0}, i.e., we set the mean of 𝕪~i\tilde{\mathbb{y}}_{i} to be the average of {𝕪~tr0,,𝕪~i0}\{\tilde{\mathbb{y}}_{t-r}^{0},\dots,\tilde{\mathbb{y}}_{i}^{0}\} for i{tr+1,,t}i\in\{t-r+1,\ldots,t\}. Let us denote this moving average by μ[𝕠t0]\mu[\mathbb{o}_{t}^{0}]. We can then approximate the random process 𝕫t\mathbb{z}_{t} as

𝕫t𝒩([𝔼^{b~t0}μ[𝕫t0]],[βtϕt𝖳ϕt𝕂t]),\displaystyle\mathbb{z}_{t}\sim\mathcal{N}\begin{pmatrix}\begin{bmatrix}\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}\\ \mu[\mathbb{z}_{t}^{0}]\end{bmatrix},&\begin{bmatrix}\beta_{t}&\boldsymbol{\phi}_{t}^{\sf T}\\ \boldsymbol{\phi}_{t}&\mathbb{K}_{t}\end{bmatrix}\end{pmatrix}, (31)

for scalar βt=κ(b~t0,b~t0)\beta_{t}=\kappa(\tilde{{b}}_{t}^{0},\tilde{{b}}_{t}^{0}), vector ϕt=κ(𝕠t0,b~t0)\boldsymbol{\phi}_{t}=\kappa(\mathbb{o}_{t}^{0},\tilde{b}_{t}^{0}), and symmetric and positive semi-definite matrix 𝕂t=κ(𝕠t0,𝕠t0)\mathbb{K}_{t}=\kappa(\mathbb{o}_{t}^{0},\mathbb{o}_{t}^{0}).

IV-D Estimating Hessian via Posterior Sampling

The statistical model considered for the quasi-Newton matrix and the sample observations enables us to find an alternative estimator for the Hessian matrix. To this end, we derive the posterior distribution of 𝔹~t\tilde{\mathbb{B}}_{t} in the following lemma.

Lemma 1 (Posterior of quasi-Newton matrix).

Consider the Gaussian model for 𝕫t\mathbb{z}_{t} given in (31). The entry b~t\tilde{b}_{t} conditional to observation 𝕠t\mathbb{o}_{t} is distributed Gaussian with mean ζ(𝕠t)\zeta(\mathbb{o}_{t}) and variance ψ(𝕠t)\psi(\mathbb{o}_{t}) that are given by

ζ(𝕠t)\displaystyle\zeta(\mathbb{o}_{t}) =𝔼^{b~t0}𝕒t𝖳(𝕠tμ[𝕠t0]),\displaystyle=\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}-\mathbb{a}_{t}^{\sf T}(\mathbb{o}_{t}-\mu[\mathbb{o}_{t}^{0}]), (32a)
ψ(𝕠t)\displaystyle\psi(\mathbb{o}_{t}) =βt𝕒t𝖳ϕt,\displaystyle=\beta_{t}-\mathbb{a}_{t}^{\sf T}\boldsymbol{\phi}_{t}, (32b)

where 𝕒t\mathbb{a}_{t} is the solution to 𝕂t𝕒t=ϕt\mathbb{K}_{t}\mathbb{a}_{t}=\boldsymbol{\phi}_{t} which can be computed via the conjugate gradient algorithm.

Proof.

Since 𝕫t\mathbb{z}_{t} is Gaussian, the posterior distribution of b~t\tilde{b}_{t} is also Gaussian. After some standards derivations, the mean and variance of this conditional distribution are given by [48]

ζ(𝕠t)\displaystyle\zeta(\mathbb{o}_{t}) =𝔼^{b~t0}ϕt𝖳𝕂t1(𝕠tμ[𝕠t0]),\displaystyle=\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}-\boldsymbol{\phi}_{t}^{\sf T}\mathbb{K}_{t}^{-1}(\mathbb{o}_{t}-\mu[\mathbb{o}_{t}^{0}]), (33a)
ψ(𝕠t)\displaystyle\psi(\mathbb{o}_{t}) =βtϕt𝖳𝕂t1ϕt.\displaystyle=\beta_{t}-\boldsymbol{\phi}_{t}^{\sf T}\mathbb{K}_{t}^{-1}\boldsymbol{\phi}_{t}. (33b)
By defining 𝕒t\mathbb{a}_{t} as in the lemma, the proof is concluded.

Using this posterior distribution, we can compute various estimators for b~t\tilde{b}_{t}. Noting that this estimator is used to estimate Newton’s direction, we compute a simple unbiased estimator by sampling from the posterior distribution:444From the Bayesian viewpoint, one might consider setting the estimator to the posterior mean, as it describes the minimum mean squared error (MMSE) estimate. Although MMSE returns minimal variance, it is in general biased and hence can lead to directional misalignment. for every entry of 𝔹~t\tilde{\mathbb{B}}_{t}, we compute the posterior mean ζ(𝕠t0)\zeta(\mathbb{o}_{t}^{0}) and variance ψ(𝕠t0)\psi(\mathbb{o}_{t}^{0}) using the sample observation 𝕠t0\mathbb{o}_{t}^{0}. Let b^i,j,t\hat{b}_{i,j,t} be the posterior sample for entry (i,j)(i,j). We build the Hessian estimator 𝔹^t\hat{\mathbb{B}}_{t} from entries b^i,j,t\hat{b}_{i,j,t} and compute

𝔡~t=𝔹^t1𝕘~t.\displaystyle\tilde{\mathfrak{d}}_{t}=-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}. (34)

The global model is then updated as 𝜽t+1=𝜽t+ηt𝔡~t\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}.

Although validated numerically, it is insightful to explain why (34) provides a better estimate of the true Newton direction compared to deterministic approaches like BFGS. Classical deterministic methods rely only on the latest two gradient samples, which can result in highly inaccurate curvature estimates under noisy aggregation. In contrast, the proposed scheme uses information from the last r+1r+1 gradient samples to estimate the loss curvature. This allows it to effectively mitigate the impact of aggregation noise and compute a more robust curvature estimate.

Remark 5.

From Lemma 1, computing the posterior involves solving an inverse problem to find 𝕒t\mathbb{a}_{t}, which may add computational cost compared to deterministic approaches. However, this inverse problem can be efficiently solved with reduced complexity using the conjugate gradient algorithm [38].

V Algorithm and Convergence Analysis

The derivations in Sections III and IV enable the realization of a second-order FL framework over the air, referred to as Gaussian Process FL (GP-FL). Compared to classical second-order FL, GP-FL introduces two key aspects: (ii) it uses analog function computation to aggregate local models directly over the air, and (iiii) it invokes the Gaussian model to compute a more robust estimation for Hessian from a finite window of recent aggregated gradients. The final algorithm is summarized in Algorithm 1, outlining the framework with device scheduling. At communication round tt, the PS selects a set of participating clients 𝒮t\mathcal{S}_{t} from the KK available ones using a scheduling scheme, specifically the GS method from [20]. The PS shares 𝜽t\boldsymbol{\theta}_{t} with the devices. Upon receiving the global model, the devices compute local gradients and transmit them over their uplink channels. The PS aggregates the global gradient over the air and estimates the Hessian matrix using the method from Section IV, which is then used to update the model via (34).

Remark 6.

We assume the PS has high computational capabilities, rendering the complexity of calculating a direction via (34) negligible. Consequently, the overall running times for GP-FL and conventional second-order FL are nearly identical, as confirmed by our numerical experiments.

Input: Number of global epochs TT, global learning rate ηt{\eta}_{t}, sindow size rr, local datasets {𝒟k}kK\{\mathcal{D}_{k}\}_{k\in K}.
for t=0,1,,T1t=0,1,\dots,T-1 do
       PS randomly selects a subset of devices 𝒮t\mathcal{S}_{t} and sends 𝜽t\boldsymbol{\theta}_{t} to them.
       for device k𝒮tk\in\mathcal{S}_{t} in parallel do
             Compute local gradient as 𝕘t,k=1|𝒟k|(𝕦,v)𝒟k(𝜽t,𝕦,v)\mathbb{g}_{t,k}=\frac{1}{|\mathcal{D}_{k}|}\sum_{(\mathbb{u},v)\in\mathcal{D}_{k}}\ell(\boldsymbol{\theta}_{t},\mathbb{u},v).
             Normalize 𝕘t,k\mathbb{g}_{t,k} according to (12), and find 𝕩t,k\mathbb{x}_{t,k} based on (13).
             Send 𝕩t,k\mathbb{x}_{t,k} to the PS through wireless channel.
       end for
      
      PS calculates 𝕘~t\tilde{\mathbb{g}}_{t} as per (19).
       PS finds the updating direction as per (34).
       PS updates the global model as 𝜽t+1=𝜽t+ηt𝔡~t\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}.
end for
Algorithm 1 GP-FL

V-A Scheduling the Learning Rate

The remaining of this section provides convergence analysis for GP-FL under a set of regularity assumptions, which are commonly considered in the literature [2]. We start the analysis by stating the first assumption.

Assumption 1 (Smoothness and convexity).

The global loss function is twice continuously differentiable, LL-Lipschitz gradient (LL-smooth) and λ\lambda-strongly convex. As such, we have

λ𝕀2f(𝜽)L𝕀.\displaystyle\lambda\mathbb{I}\preceq\nabla^{2}f(\boldsymbol{\theta})\preceq L\mathbb{I}. (35)

The strong convexity of the global loss function implies that there exists a unique optimal model parameter, which we denote by 𝛉\boldsymbol{\theta}^{\star} hereafter in the paper.

For global convergence of GP-FL, as a stochastic estimator of the Newton method, under Assumption 1 the magnitude of model update should be controlled. The classical approach for update control is to resort backtracking line search methods. In this work, we deviate from this classical scheme and control the update magnitude by scheduling the global learning rate. To this end, we invoke the results of [49], which proposes an adaptive learning rate to facilitate the global convergence of Newton-type methods. Casting GP-FL on the proposed scheme in [49], it is shown that for convergence guarantee of GP-FL to the minimizer of an LL-smooth and λ\lambda-strongly convex loss, it is sufficient to schedule the learning rate as

ηt=min{1,λ2L𝕘t}.\displaystyle\eta_{t}=\min\{1,\frac{\lambda^{2}}{L\|\mathbb{g}_{t}\|}\}. (36)

We hence consider this scheduling throughout the analysis.

V-B Convergence Results

To present the convergence guarantee for GP-FL, we first need to define the concept of δ\delta-approximate matrix.

Definition 2 (δ\delta-approximate).

Matrix ^\hat{\mathbb{H}} is said to be a δ\delta-approximate of \mathbb{H}, if the following inequality holds

^δ.\displaystyle\|\hat{\mathbb{H}}-\mathbb{H}\|\leq\delta\|\mathbb{H}\|. (37)

Using this definition, we can now present our first convergence result: Theorem 1 gives an upper-bound on the gap between the converging and optimal models.

Theorem 1.

Let Assumption 1 hold. Moreover, let the inverse of sample quasi-Hessian matrix drawn from the posterior in Lemma 1 in round tt be a δt\delta_{t}-approximate of the inverse Hessian matrix, and define δ=maxtδt\delta=\max_{t}\delta_{t}. Then, the distance between the global model updated in round tt, i.e., 𝛉t\boldsymbol{\theta}_{t} and the optimal model 𝛉\boldsymbol{\theta}^{\star} is bounded from above as

𝔼[𝜽t𝜽]μt𝔼[𝜽0𝜽]+Ct,\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\mu^{t}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+C_{t}, (38)

where μ=Lδ/λ\mu=L\delta/\lambda and CtC_{t} is given by

Ct=μt1μ1{C0tt0C1t>t0,\displaystyle C_{t}=\frac{\mu^{t}-1}{\mu-1}\begin{cases}C_{0}&t\leq t_{0}\\ C_{1}&t>t_{0}\end{cases}, (39)

for C0C_{0} and C1C_{1} being

C0\displaystyle C_{0} =λL(t0t+2γ1γ)+δ+1λσn𝕔t|𝒟|αtZF\displaystyle=\frac{\lambda}{L}(t_{0}-t+\frac{2\gamma}{1-\gamma})+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\|}\mathbb{c}_{t}\big{\|}}{|\mathcal{D}|\sqrt{\alpha^{\mathrm{ZF}}_{t}}} (40a)
C1\displaystyle C_{1} =2λγ2tt0L(1γ2tt0)+δ+1λσn𝕔t|𝒟|αtZF\displaystyle=\frac{2\lambda\gamma^{2^{t-t_{0}}}}{L(1-\gamma^{2^{t-t_{0}}})}+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\|}\mathbb{c}_{t}\big{\|}}{|\mathcal{D}|\sqrt{\alpha^{\mathrm{ZF}}_{t}}} (40b)

with t0=max{0,2Lλ2𝕘02}t_{0}=\max\Bigl{\{}0,\bigl{\lceil}\frac{2L}{\lambda^{2}\|\mathbb{g}_{0}\|}\bigr{\rceil}-2\Bigr{\}} and γ=L2λ2𝕘0t04\gamma=\frac{L}{2\lambda^{2}}\|\mathbb{g}_{0}\|-\frac{t_{0}}{4}.

Proof.

The proof is given in Appendix B. ∎

Theorem (1) implies that the GP-FL algorithm exhibits a linear-quadratic convergence rate. In fact, the quadratic term in (65) is exactly the same as the one reported in [49] for Newton-type methods. Nevertheless, our result has an extra linear term, as a result of model-based estimation of the Hessian matrix and the AirComp aggregation error.

We next present Corollaries 1 and 2, which characterize the scenarios under which the GP-FL algorithm converges quadratically and linearly.

Corollary 1.

Let the initial model be in a bounded distance of the optimal model as

𝔼[𝜽0𝜽]<μt1μt(μ1)C1.\displaystyle\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}<\frac{\mu^{t}-1}{\mu^{t}(\mu-1)}C_{1}. (41)

Then, the updated model in round tt satisfies

𝔼[𝜽t𝜽]2(μt1)μ1C1.\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\frac{2(\mu^{t}-1)}{\mu-1}C_{1}. (42)

and lies within the ε\varepsilon-neighborhood of optimal model after TεT_{\varepsilon} rounds, i.e., 𝔼[𝛉Tε𝛉]ε\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{T_{\varepsilon}}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\leq\varepsilon, if μ<1\mu<1 and

Tε=𝒪(loglog1ε).\displaystyle T_{\varepsilon}=\mathcal{O}\left(\log\log\frac{1}{\varepsilon}\right). (43)

which is also called super-linear convergence rate.

Proof.

See Appendix C. ∎

Corollary 1 presents an intuitive result: when GP-FL is initiated in a close enough vicinity of the optimal model, it can quadratically converge to the optimal model. We next consider the case with linear convergence.

Corollary 2.

Let the initial model satisfies

𝔼[𝜽0𝜽]μt1μt(μ1)C1.\displaystyle\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\geq\frac{\mu^{t}-1}{\mu^{t}(\mu-1)}C_{1}. (44)

Then, the updated model in round tt satisfies

𝔼[𝜽t𝜽]2μt𝔼[𝜽0𝜽].\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq 2\mu^{t}\mathbb{E}\big{[}\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\|\big{]}. (45)

and lies within the ε\varepsilon-neighborhood of optimal model after TεT_{\varepsilon} rounds, i.e., 𝔼[𝛉Tε𝛉]ε\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{T_{\varepsilon}}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\leq\varepsilon, if μ<1\mu<1 and

Tε=𝒪(1log(1μ)log1ε).\displaystyle T_{\varepsilon}=\mathcal{O}\left(\frac{1}{\log(\frac{1}{\mu})}\log\frac{1}{\varepsilon}\right). (46)
Proof.

See Appendix D. ∎

It is worth mentioning that in classical distributed methods, the number of rounds required to achieve a desired precision ε\varepsilon, follows a linear convergence rate. Specifically, for vanilla federated averaging Tε=𝒪(Lλlog1ε)T_{\varepsilon}=\mathcal{O}\left(\frac{L}{\lambda}\log\frac{1}{\varepsilon}\right). This underscores the superiority of GP-FL in terms of convergence rate.

The convergence analysis can be extended to characterize the optimality gap, i.e., the difference between f(𝜽t)f(\boldsymbol{\theta}_{t}) and the minimal loss. Theorem 2 gives an upper bound on the optimality gap at round tt whose proof is differed to Appendix E.

Theorem 2.

Consider the assumptions in Theorem 1. Then, the optimality gap in round tt is bounded from above as

𝔼[f(𝜽t)]f(𝜽)L2(μ2t𝔼[𝜽0𝜽2]+Ct),\displaystyle\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\left({\mu^{2t}}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}^{2}\big{]}+C^{\prime}_{t}\right), (47)

where CtC^{\prime}_{t} is defined as

Ct=μ2t1μ21{C0tt0C1t>t0\displaystyle C^{\prime}_{t}=\frac{\mu^{2t}-1}{\mu^{2}-1}\begin{cases}C^{\prime}_{0}&t\leq t_{0}\\ C^{\prime}_{1}&t>t_{0}\end{cases} (48)

for μ\mu that is defined in Theorem 1, and

C0=λ2L2(t0t+2γ1γ)2+(δ+1λ)2σn2𝕔t𝕔t𝖧|𝒟|2αtZF,\displaystyle C^{\prime}_{0}=\frac{\lambda^{2}}{L^{2}}(t_{0}-t+\frac{2\gamma}{1-\gamma})^{2}+\left(\frac{\delta+1}{\lambda}\right)^{2}\frac{\sigma^{2}_{n}\big{\|}\mathbb{c}_{t}\mathbb{c}^{\mathsf{H}}_{t}\big{\|}}{|\mathcal{D}|^{2}\alpha^{\text{ZF}}_{t}}, (49a)
C1=(2λγ2tt0L(1γ2tt0))2+(δ+1λ)2σn2𝕔t𝕔t𝖧|𝒟|2αtZF.\displaystyle C^{\prime}_{1}=\left(\frac{2\lambda\gamma^{2^{t-t_{0}}}}{L(1-\gamma^{2^{t-t_{0}}})}\right)^{2}+\left(\frac{\delta+1}{\lambda}\right)^{2}\frac{\sigma^{2}_{n}\big{\|}\mathbb{c}_{t}\mathbb{c}^{\mathsf{H}}_{t}\big{\|}}{|\mathcal{D}|^{2}\alpha^{\text{ZF}}_{t}}. (49b)

VI Simulation Results

We validate GP-FL by numerical experiments. Throughout the experiments, we elaborate the effectiveness of our proposed scheme by comparing it with several benchmarks, namely the following first-order and second-order FL algorithms:

AirComp-FedAvg (First-order)

Our first benchmark is AirComp-FedAvg proposed in [12]. The algorithm schedules devices using difference-of-convex programming.

GIANT (Second-order)

We consider GIANT algorithm [21] when realized directly over the air via AirComp. We note that GIANT requires an additional aggregation of local gradients, resulting in two communication rounds per iteration. The communication model for this gradient aggregation follows the same procedure as we discussed in Section III-A.

DANE (Second-order)

We deploy AirComp to realize the DANE algorithm proposed in [31]. Similar to GIANT, this algorithm requires two aggregations per communication round.

Sec-Order (Second-order)

We consider the Sec-Order algorithm [20], where the clients locally find a Newton direction and send them to the PS via AirComp. Note that this approach loads more computations on local devices.

BFGS

The PS uses BFGS to find quasi-Newton direction based on the AirComp-based noisy gradients.

We follow the simulation setup in [7] and consider channels with 200 distinct noise standard deviation values, namely {0.005,0.0100,,1}\{0.005,0.0100,\dots,1\}. For each value, we generate a corresponding channel. Then, based on the number of devices KK, we sample KK values from this noise set. The server is equipped with N=5N=5 antennas. For a fair comparison, we apply the same receiver beamforming method described in Section III-C and the same client selection method discussed in Section III-D across all benchmark methods, including GP-FL. For all experiments, we use the SGD optimizer on local devices with a learning rate of η=0.1\eta=0.1, momentum set to 0.9, and weight decay of 0.0005. The batch size is set to 64. The hyper-parameters of the benchmark methods are aligned with those reported in their respective original papers. Additionally, we set the observation window width in GP-FL to r=20r=20. Other hyper-parameters are specified for each experiment separately.

Refer to caption
(a) w8a
Refer to caption
(b) a9a
Refer to caption
(c) phishing
Refer to caption
(d) Fashion-MNIST
Refer to caption
(e) CIFAR-10, Setup I
Refer to caption
(f) CIFAR-10, Setup II
Refer to caption
(g) CIFAR-100, Setup I
Refer to caption
(h) CIFAR-100, Setup II
Figure 1: Test accuracy vs. communication rounds for GP-FL and baseline methods across different datasets.

VI-A LIBSVM Dataset

First, we show results for a binary classification task using logistic regression and three datasets from the LIBSVM library [24]: a9a, w8a, and phishing. The data samples are uniformly distributed across K=20K=20 clients, all of whom participate in 50 communication rounds. We use the logistic regression model, a type of generalized linear model, as described in [50]. The results for a9a, w8a, and phishing datasets are depicted in Figures 1a, 1b and 1c. As observed, GP-FL achieves higher classification accuracy than benchmark methods.

VI-B Fashion-MNIST Dataset

We next use a fully-connected neural network with two hidden layers and perform 300 communication rounds, with all K=10K=10 clients participating in each round. The results are shown in Figure 1d. As observed, GP-FL not only achieves higher classification accuracy than the benchmark methods but also reaches 80% accuracy within just five rounds.

VI-C CIFAR-10 Dataset (Non-iid Settings)

We evaluate CIFAR-10 classification in two setups:

  • Setup I: Following [51, 52], we organize the dataset by class and divide it into 200 shards. Each client randomly selects two shards without replacement. We use a feedforward neural network with two hidden layers for the FL task, with K=100K=100 clients, 4000 communication rounds, and a 10% client sampling rate per round.

  • Setup II: We distribute the dataset among them using Dirichlet allocation with β=0.5\beta=0.5. Using ResNet-18 with Group Normalization, we conduct 200 communication rounds, wherein all clients participate. We set K=10K=10.

The test accuracy curves for our method and the benchmark methods are depicted in Figures 1e and 1f. As illustrated, compared to the benchmark methods, GP-FL (i) achieves a higher test accuracy, and (ii) demonstrates a faster rate of convergence. For example, in Figure 1e, GP-FL reaches 45% test accuracy in approximately 700 rounds, whereas the best-performing second-order benchmark methods require around 1600 rounds to reach the same accuracy.

VI-D CIFAR-100 Dataset (Non-iid Settings)

CIFAR-100 shares the same sample size as CIFAR-10, yet it encompasses a broader diversity with 100 distinct classes. We next train the ResNet-18 model for this classification. Through training, we use Group Normalization and let all clients participate in each communication round. For this experiment, we consider the following two setups:

  • Setup I: We set K=10K=10 and β=0.5\beta=0.5 for Dirichlet allocation, and use 400 communication rounds.

  • Setup II: We set K=50K=50 and β=0.05\beta=0.05 for Dirichlet allocation, and use 200 communication rounds.

The test accuracy curve for our method and the benchmark methods are depicted in Figures 1g and 1h. Similar conclusions can be drawn from these figures as those from the CIFAR-10 dataset. For example, in Figure 1g, GP-FL reaches 30% test accuracy in just 120 rounds, whereas the best benchmark method requires 180 rounds to achieve the same performance.

VI-E Comments on Required Observation Window

From an implementation perspective, GP-FL requires the PS to select an efficient observation window rr. We empirically examine how different rr values affect GP-FL’s performance by repeating the experiment in Setup I with CIFAR-10 (see Subsection VI-C) for r={0,5,10,15,20,50,100}r=\{0,5,10,15,20,50,100\}. Setting r=0r=0 corresponds to the standard quasi-Newton algorithm. The results, summarized in Table I, show no improvement beyond r=20r=20, with accuracy dropping at r=100r=100. This is because recent gradients carry the most relevant curvature information. Thus, the algorithm can be efficiently implemented with a reasonably sized observation window.

TABLE I: Impacts of hyper-parameter rr: the simulation setup is the same as Setup I with CIFAR-10.
rr 0 5 10 15 20 50 100
Average Accuracy (%) 45.24 46.75 47.88 48.41 48.59 48.61 48.55

VII Conclusion

While second-order methods are favored in FL for their fast convergence, the communication cost of sharing local Hessian matrices poses a challenge. Attempts to approximate these matrices with first-order information have limited effectiveness in real-world scenarios due to noisy updates from imperfect wireless channels. In this paper, we propose a novel second-order FL algorithm tailored for wireless channels, termed Gaussian process-based Hessian modeling for FL (GP-FL). The algorithm uses a non-parametric approach to estimate the global Hessian matrix from noisy local gradients received over uplink channels. Our analysis demonstrates that GP-FL achieves a linear-quadratic convergence rate. Extensive numerical experiments validate the proposed method’s efficiency. Extending GP-FL to other distributed learning approaches is a natural direction in this area.

Appendix A Solving Optimization (23)

To solve the joint optimization in (23), we begin by transforming it into a low-rank optimization problem. Specifically, let =𝕔t𝕔t𝖧\mathbb{C}=\mathbb{c}_{t}\mathbb{c}_{t}^{\mathsf{H}}, where rank()=1\text{rank}(\mathbb{C})=1, and let k=𝕙^t,k𝕙^t,k𝖧\mathbb{H}_{k}=\hat{\mathbb{h}}_{t,k}\hat{\mathbb{h}}_{t,k}^{\mathsf{H}}. This allows us to reformulate (23) as

minTr()\displaystyle\min_{\mathbb{C}}~{}\text{Tr}(\mathbb{C}) (50)
s.t.𝟘,rank()=1,Tr(k)|𝒟k|2,k𝒮t.\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{rank}(\mathbb{C})=1,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq|\mathcal{D}_{k}|^{2},\quad\forall k\in\mathcal{S}_{t}.

Effectively addressing the rank-one constraint is crucial for solving low-rank optimization problems. A common approach is semidefinite relaxation (SDR), which removes the constraint, reformulating the problem as semidefinite programming to approximate the solution. However, for larger problems, the rank-one constraint is often violated, requiring randomization methods to scale the solution.

An alternative approach to enforce the rank-one constraint is to express it as Tr()2=0\text{Tr}(\mathbb{C})-\|\mathbb{C}\|_{2}=0 with Tr()>0\text{Tr}(\mathbb{C})>0. This reformulates the problem into a difference-of-convex-functions (DC) program, allowing for a more accurate solution by satisfying all constraints. Building on methods in [20], we introduce the following modified DC programming, incorporating the new constraint as a penalty term:

minTr()+ζ(Tr()2)\displaystyle\min_{\mathbb{C}}~{}\text{Tr}(\mathbb{C})+\zeta(\text{Tr}(\mathbb{C})-\|\mathbb{C}\|_{2}) (51)
s.t.𝟘,Tr()>0,Tr(k)|𝒟k|2,k𝒮t,\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{Tr}(\mathbb{C})>0,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq|\mathcal{D}_{k}|^{2},\quad\forall k\in\mathcal{S}_{t},

where ζ\zeta is a regularizer. Although this remains a non-convex problem due to the concave term 2-\|\mathbb{C}\|_{2}, we can linearize 2\|\mathbb{C}\|_{2} and transform it into the following convex iterative subproblem:

min(1+ζ)Tr()ζj2\displaystyle\min_{\mathbb{C}}~{}(1+\zeta)\text{Tr}(\mathbb{C})-\zeta\partial\|\mathbb{C}_{j}\|_{2}\cdot\mathbb{C} (52)
s.t.𝟘,Tr()>0,Tr(k)|𝒟k|2,k𝒮t,\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{Tr}(\mathbb{C})>0,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq|\mathcal{D}_{k}|^{2},\quad\forall k\in\mathcal{S}_{t},

where j\mathbb{C}_{j} is jj-th iterative of \mathbb{C}, j2\partial\|\mathbb{C}_{j}\|_{2} is the sub-gradient of j2\|\mathbb{C}_{j}\|_{2}. For a complete description of the iterative algorithm, please refer to Algorithm 2 in [20].

Appendix B Proof of Theorem 1

Before starting the derivation, note that we frequently use the triangle inequality, which states that for two vectors 𝕧\mathbb{v} and 𝕦\mathbb{u}, 𝕧±𝕦𝕧+𝕦\|\mathbb{v}\pm\mathbb{u}\|\leq\|\mathbb{v}\|+\|\mathbb{u}\|. For clarity, we will omit explicitly mentioning its use when it is evident from the context.

We aim to bound the distance to the optimal model 𝜽\boldsymbol{\theta}^{\star} at round tt, i.e., 𝜽t𝜽\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|. This bound is derived by establishing a recursive relationship between the distances in consecutive communication rounds. To begin, we note that 𝜽t+1=𝜽t+ηt𝔡~t\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}, where 𝔡~t=𝔹^t1𝕘~t\tilde{\mathfrak{d}}_{t}=-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t} as given in (34). We hence have

𝜽t+1𝜽\displaystyle\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\| =𝜽t+ηt𝔡~t𝜽1+2,\displaystyle=\big{\|}\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\leq\mathcal{M}_{1}+\mathcal{M}_{2}, (53)

where we define 1=𝜽tηtt1𝕘t𝜽\mathcal{M}_{1}=\big{\|}\boldsymbol{\theta}_{t}-\eta_{t}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\boldsymbol{\theta}^{\star}\big{\|}, and

2\displaystyle\mathcal{M}_{2} =t1𝕘t+𝔡~t=t1𝕘t𝔹^t1𝕘~t.\displaystyle=\big{\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}+\tilde{\mathfrak{d}}_{t}\big{\|}=\big{\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}\big{\|}. (54)

The term 1\mathcal{M}_{1} is upper-bounded using the results reported in [49]. In particular, defining t0t_{0} and γ\gamma as in Theorem 1, the result of [49] implies that

1λL{t0t+2γ1γ,tt02γ2tt01γ2tt0,t>t0.\displaystyle\mathcal{M}_{1}\leq\frac{\lambda}{L}\begin{cases}t_{0}-t+\dfrac{2\gamma}{1-\gamma},~{}&t\leq t_{0}\\ \dfrac{2\gamma^{2^{t-t_{0}}}}{1-\gamma^{2^{t-t_{0}}}},&t>t_{0}\end{cases}. (55)

In the next step, we find an upper bound on 2\mathcal{M}_{2}. To this end, we use the triangular inequality once again to write

2\displaystyle\mathcal{M}_{2} t1𝕘t𝔹^t1𝕘t+𝔹^t1𝕘t𝔹^t1𝕘~t\displaystyle\leq\big{\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\mathbb{g}_{t}\big{\|}+\big{\|}\hat{\mathbb{B}}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}\big{\|} (56a)
t1𝔹^t1𝕘t+𝔹^t1𝕘t𝕘~t.\displaystyle\leq\big{\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\|}\big{\|}\mathbb{g}_{t}\big{\|}+\big{\|}\hat{\mathbb{B}}^{-1}_{t}\big{\|}\big{\|}\mathbb{g}_{t}-\tilde{\mathbb{g}}_{t}\big{\|}. (56b)

Noting that 𝔹^t1\hat{\mathbb{B}}^{-1}_{t} is δt\delta_{t}-approximator of 𝐇t1\mathbf{H}_{t}^{-1}, we can write

t1𝔹^t1δtt1δt1,\displaystyle\big{\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\|}\leq\delta_{t}\big{\|}\mathbb{H}^{-1}_{t}\big{\|}\leq\delta\big{\|}\mathbb{H}^{-1}_{t}\big{\|}, (57)

with the latter inequality being concluded directly from the fact that δ=maxtδt\delta=\max_{t}\delta_{t}. We further note that λ\lambda-strong convexity of the loss yields t11λ\big{\|}\mathbb{H}^{-1}_{t}\big{\|}\leq\frac{1}{\lambda}. We hence conclude that

t1𝔹^t1δλ,\displaystyle\big{\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\|}\leq\frac{\delta}{\lambda}, (58)

and use the LL-smoothness of the global loss to write

𝕘tL𝜽t𝜽.\displaystyle\big{\|}\mathbb{g}_{t}\big{\|}\leq L\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}. (59)

Using (58) and (59), the first term in (56b) is straightforwardly bounded. To bound the second term, we have

𝔹^t1\displaystyle\big{\|}\hat{\mathbb{B}}^{-1}_{t}\big{\|} 𝔹^t1t1+t1\displaystyle\leq\big{\|}\hat{\mathbb{B}}^{-1}_{t}-\mathbb{H}^{-1}_{t}\big{\|}+\big{\|}\mathbb{H}^{-1}_{t}\big{\|} (60a)
δλ+1λ=δ+1λ.\displaystyle\leq\frac{\delta}{\lambda}+\frac{1}{\lambda}=\frac{\delta+1}{\lambda}. (60b)

Using this bound, we can finally write

2μ𝜽t𝜽+δ+1λ𝕘t𝕘~t,\displaystyle\mathcal{M}_{2}\leq\mu\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}+\frac{\delta+1}{\lambda}\big{\|}\mathbb{g}_{t}-\tilde{\mathbb{g}}_{t}\big{\|}, (61)

with μ\mu being defined as in Theorem 1.

The above bound leads to an instantaneous recursive bound. We now recall that in the AirComp-aided aggregation we have 𝕘~t𝕘t=𝕟~t\tilde{\mathbb{g}}_{t}-\mathbb{g}_{t}=\tilde{\mathbb{n}}_{t}. By taking expectation from both sides of (61) w.r.t. to the randomness in communication links, we obtain

𝔼[2]μ𝔼[𝜽t𝜽]+δ+1λ𝔼[𝕟~t].\displaystyle\mathbb{E}\big{[}\mathcal{M}_{2}\big{]}\leq\mu\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+\frac{\delta+1}{\lambda}\mathbb{E}\big{[}\big{\|}\tilde{\mathbb{n}}_{t}\big{\|}\big{]}. (62)

Using (21), we can further write

𝔼[𝕟~t]\displaystyle\mathbb{E}\big{[}\big{\|}\tilde{\mathbb{n}}_{t}\big{\|}\big{]} =1|𝒟|αtZF𝔼[𝕟t,j𝖧𝕔t]\displaystyle=\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{E}\big{[}\big{\|}\mathbb{n}_{t,j}^{\mathsf{H}}\mathbb{c}_{t}\big{\|}\big{]} (63a)
1|𝒟|αtZF𝕔t𝔼[𝕟t,j]=σn𝕔t|𝒟|αtZF.\displaystyle\leq\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}\big{\|}\mathbb{c}_{t}\big{\|}\mathbb{E}\big{[}\big{\|}\mathbb{n}_{t,j}\big{\|}\big{]}=\frac{\sigma_{n}\big{\|}\mathbb{c}_{t}\big{\|}}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}. (63b)

We finally take an expectation from both sides of (53), and use (62) and (63a) to write (note that ηt1\eta_{t}\leq 1 as per (36))

𝔼[𝜽t𝜽]μ𝔼[𝜽t𝜽]+At,\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\mu\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+A_{t}, (64)

where AtA_{t} is defined as

At={C0tt0C1t>t0\displaystyle A_{t}=\begin{cases}C_{0}&t\leq t_{0}\\ C_{1}&t>t_{0}\end{cases} (65)

for C0C_{0} and C1C_{1} that were defined in Theorem 1. Applying (64) recursively, the proof is concluded.

Appendix C Proof of Corollary 1

From Theorem (1), it is concluded that if 𝔼[𝜽0𝜽]<Ct\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}<C_{t}

𝔼[𝜽t𝜽]\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]} 2Ct.\displaystyle\leq 2C_{t}. (66)

We aim to determine the growth order of TεT_{\varepsilon}. Assuming ε\varepsilon is sufficiently small, TεT_{\varepsilon} becomes large and satisfies Tε>t0T_{\varepsilon}>t_{0}. To remain within the ε\varepsilon-neighborhood of 𝜽\boldsymbol{\theta}^{\star}, it is sufficient to have: 2CTεε2C_{T_{\varepsilon}}\leq\varepsilon, which using the fact that Tε>t0T_{\varepsilon}>t_{0}, it reduces to

2μTε1μ1[2λγ2Tεt0L(1γ2Tεt0)+δ+1λσn𝕔Tε|𝒟|αTεZF]ε.\displaystyle 2\frac{\mu^{T_{\varepsilon}}-1}{\mu-1}\Big{[}{\frac{2\lambda\gamma^{2^{{T_{\varepsilon}}-t_{0}}}}{L(1-\gamma^{2^{{T_{\varepsilon}}-t_{0}}})}}+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\|}\mathbb{c}_{T_{\varepsilon}}\big{\|}}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{T_{\varepsilon}}}}\Big{]}\leq\varepsilon. (67)

Since μ<1\mu<1, as Tε{T_{\varepsilon}} grows large μTε0\mu^{T_{\varepsilon}}\to 0; therefore, the left-hand-side coefficient tends to 2/(1μ){2}/({1-\mu}). Noting that γ[0,12]\gamma\in[0,\frac{1}{2}], we can further replace the first term inside the brackets with 2λγ2Tεt0/L{2\lambda\gamma^{2^{T_{\varepsilon}-t_{0}}}}/{L} for large TεT_{\varepsilon}. By substituting into (67), and taking log\log from both sides we finally have

log(1ε)log(4λLL2δλ)2Tεt0log(γ).\displaystyle\log(\frac{1}{\varepsilon})\leq-\log(\frac{4\lambda}{L-\frac{L^{2}\delta}{\lambda}})-2^{T_{\varepsilon}-t_{0}}\log(\gamma). (68)

As log(γ)<0\log(\gamma)<0, the second term on the right-hand-side of (68) is positive. Noting that the first term does not scale, we can finally write Tε𝒪(loglog1ε)T_{\varepsilon}\leq\mathcal{O}\left(\log\log\frac{1}{\varepsilon}\right), which concludes the proof.

Appendix D Proof of Corollary 2

The proof is similar to that of Corollary 1: from Theorem 1, we know that if 𝔼[𝜽0𝜽]Ct\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\geq C_{t}; then, we have

𝔼[𝜽t𝜽]\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]} 2μt𝔼[𝜽0𝜽].\displaystyle\leq 2\mu^{t}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}. (69)

Thus, to lie within ε\varepsilon-neighborhood of optimal model, it is sufficient for Tε>t0T_{\varepsilon}>t_{0} to satisfy 2μTε𝔼[𝜽0𝜽]ε2\mu^{T_{\varepsilon}}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\leq\varepsilon, which is equivalent to

Tεlog(1μ)log(2𝔼[𝜽0𝜽]ε).\displaystyle T_{\varepsilon}\log(\frac{1}{\mu})\geq\log\left(\frac{2\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}}{\varepsilon}\right). (70)

This concludes the proof.

Appendix E Proof of Theorem 2

As the global loss is LL-Lipschitz gradient, we have

f(𝜽t)f(𝜽)f𝖳(𝜽)(𝜽t𝜽)L2𝜽t𝜽2.\displaystyle f(\boldsymbol{\theta}_{t})-f(\boldsymbol{\theta}^{\star})-\nabla f^{\mathsf{T}}(\boldsymbol{\theta}^{\star})(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}. (71)

Taking expectation from both sides, and noting that f(𝜽)=𝟎f(\boldsymbol{\theta}^{\star})=\boldsymbol{0}, we obtain

𝔼[f(𝜽t)]f(𝜽)L2𝔼[𝜽t𝜽2].\displaystyle\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}. (72)

This implies that to bound 𝔼[f(𝜽t)]f(𝜽)\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star}), we should bound 𝔼[𝜽t𝜽2]\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}. From (53), we have

𝔼[𝜽t𝜽2]\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]} 𝔼[12]+𝔼[22].\displaystyle\leq\mathbb{E}\big{[}\mathcal{M}_{1}^{2}\big{]}+\mathbb{E}\big{[}\mathcal{M}_{2}^{2}\big{]}. (73)

Using the bounds in (55) to (63a), we obtain

𝔼[𝜽t+1𝜽2]μ2𝔼[𝜽t𝜽2]+Ct,\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}\leq\mu^{2}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}^{2}\big{]}+C^{\prime}_{t}, (74)

where CtC^{\prime}_{t} is given by

Ct={C0tt0C1t>t0,\displaystyle C^{\prime}_{t}=\begin{cases}C^{\prime}_{0}~{}&t\leq t_{0}\\ C^{\prime}_{1}&t>t_{0}\end{cases}, (75)

with C0C^{\prime}_{0} and C1C^{\prime}_{1} being defined in Theorem 2. Using (74) recursively, we next obtain

𝔼[𝜽t+1𝜽2]\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\|^{2}\big{]} μ2t𝔼[𝜽0𝜽2]\displaystyle\leq\mu^{2t}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}^{2}\big{]}
+μ2t1μ21{C0,tt0C1,t>t0\displaystyle+\frac{\mu^{2t}-1}{\mu^{2}-1}\begin{cases}C^{\prime}_{0},~{}&t\leq t_{0}\\ C^{\prime}_{1},&t>t_{0}\end{cases} (76)

Finally, using (E) in (72) the proof is concluded.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).   PMLR, 2017, pp. 1273–1282.
  • [2] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wirel. Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
  • [3] S. M. Hamidi, S. Herath, A. Bayesteh, and A. K. Khandani, “Systems and methods for communication resource usage control,” May 30 2019, uS Patent App. 15/824,352.
  • [4] S. M. Hamidi, “Training neural networks on remote edge devices for unseen class classification,” IEEE Signal Processing Letters, vol. 31, pp. 1004–1008, 2024.
  • [5] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff, “Communication lower bounds for statistical estimation problems via a distributed data processing inequality,” in Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, 2016, pp. 1011–1020.
  • [6] Y. Han, A. Özgür, and T. Weissman, “Geometric lower bounds for distributed parameter estimation under communication constraints,” in Proceedings of the 31st Conference On Learning Theory (COLT).   PMLR, 2018, pp. 3163–3188.
  • [7] M. Lan, Q. Ling, S. Xiao, and W. Zhang, “Quantization bits allocation for wireless federated learning,” IEEE Trans. Wirel. Commun., vol. 22, no. 11, pp. 8336–8351, 2023.
  • [8] A. Bereyhi, B. Liang, G. Boudreau, and A. Afana, “Novel gradient sparsification algorithm via bayesian inference,” in Proceedings of IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), 2024, pp. 1–6.
  • [9] J. Wang, Z. Charles, Z. Xu et al., “A field guide to federated optimization,” 2021. [Online]. Available: https://arxiv.org/abs/2107.06917
  • [10] S. J. Reddi, Z. Charles, M. Zaheer, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proceedings of the International Conference on Learning Representations, 2020.
  • [11] Q. Tong, G. Liang, and J. Bi, “Effective federated adaptive gradient methods with non-iid decentralized data,” 2020. [Online]. Available: https://arxiv.org/abs/2009.06557
  • [12] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wirel. Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
  • [13] A. Bereyhi, A. Vagollari, S. Asaad, R. R. Müller, W. Gerstacker, and H. V. Poor, “Device scheduling in over-the-air federated learning via matching pursuit,” IEEE Trans. Signal Process., vol. 71, pp. 2188–2203, 2023.
  • [14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 491–506, 2019.
  • [15] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
  • [16] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” IEEE Trans. Signal Process., vol. 68, pp. 2897–2911, 2020.
  • [17] S. M. Hamidi, M. Mehrabi, A. K. Khandani, and D. Gündüz, “Over-the-air federated learning exploiting channel perturbation,” in 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), 2022, pp. 1–5.
  • [18] D. Liu and O. Simeone, “Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 170–185, 2020.
  • [19] A. Elgabli, J. Park, C. B. Issaid, and M. Bennis, “Harnessing wireless channels for scalable and privacy-preserving federated learning,” IEEE Trans. Commun., vol. 69, no. 8, pp. 5194–5208, 2021.
  • [20] P. Yang, Y. Jiang, T. Wang, Y. Zhou, Y. Shi, and C. N. Jones, “Over-the-air federated learning via second-order optimization,” IEEE Trans. Wirel. Commun., vol. 21, no. 12, pp. 10 560–10 575, 2022.
  • [21] S. Wang, F. Roosta, P. Xu, and M. W. Mahoney, “GIANT: Globally improved approximate newton method for distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 2332–2342.
  • [22] Y. Zhang and X. Lin, “Disco: Distributed optimization for self-concordant empirical loss,” in Proceedings of the International conference on machine learning.   PMLR, 2015, pp. 362–370.
  • [23] A. Ghosh, R. K. Maity, and A. Mazumdar, “Distributed newton can communicate less and resist byzantine workers,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 028–18 038, 2020.
  • [24] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
  • [25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1708.07747
  • [26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Ph.D. dissertation, University of Toronto, 2009, technical Report.
  • [27] S. M. Hamidi, R. Tan, L. Ye, and E.-H. Yang, “Fed-it: Addressing class imbalance in federated learning through an information- theoretic lens,” in 2024 IEEE International Symposium on Information Theory (ISIT), 2024, pp. 1848–1853.
  • [28] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” J. Mach. Learn. Res., vol. 12, no. 7, 2011.
  • [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017. [Online]. Available: https://arxiv.org/abs/1412.6980
  • [30] S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Mime: Mimicking centralized stochastic algorithms in federated learning,” 2021. [Online]. Available: https://arxiv.org/abs/2008.03606
  • [31] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate newton-type method,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2014, pp. 1000–1008.
  • [32] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smithy, “Feddane: A federated newton-type method,” in Proceedings of the 2019 53rd Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2019, pp. 1227–1231.
  • [33] S. J. Reddi, J. Konečný, P. Richtárik, B. Póczós, and A. Smola, “Aide: Fast and communication efficient distributed optimization,” 2016. [Online]. Available: https://arxiv.org/abs/1608.06879
  • [34] V. Gupta, A. Ghosh, M. Derezinski, R. Khanna, K. Ramchandran, and M. Mahoney, “Localnewton: Reducing communication bottleneck for distributed learning,” 2021. [Online]. Available: https://arxiv.org/abs/2105.07320
  • [35] A. Agarwal, O. Chapelle, M. Dudík, and J. Langford, “A reliable effective terascale linear learning system,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1111–1133, 2014.
  • [36] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi, “CoCoA: A general framework for communication-efficient distributed optimization,” J. Mach. Learn. Res., vol. 18, p. 230, 2018.
  • [37] C. Duenner, A. Lucchi, M. Gargiani, A. Bian, T. Hofmann, and M. Jaggi, “A distributed second-order algorithm you can trust,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80.   PMLR, 10–15 Jul 2018, pp. 1358–1366.
  • [38] M. R. Hestenes, E. Stiefel et al., Methods of conjugate gradients for solving linear systems.   NBS Washington, DC, 1952, vol. 49, no. 1.
  • [39] R. Islamov, X. Qian, and P. Richtárik, “Distributed second order methods with fast rates and compressed communication,” in International conference on machine learning.   PMLR, 2021, pp. 4617–4628.
  • [40] M. Safaryan, R. Islamov, X. Qian, and P. Richtarik, “Fednl: Making newton-type methods applicable to federated learning,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2022, pp. 18 959–19 010.
  • [41] Y. Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” in Proceedings of the International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 1387–1395.
  • [42] P. Sharma, R. Panda, G. Joshi, and P. Varshney, “Federated minimax optimization: Improved convergence analyses and algorithms,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2022, pp. 19 683–19 730.
  • [43] J. Nocedal and S. J. Wright, Numerical Optimization.   Springer, 1999.
  • [44] J. Shermen and W. Morrison, “Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix,” Annals of Mathmatical Statistics, vol. 20, pp. 621–625, 1949.
  • [45] L. Chen, X. Qin, and G. Wei, “A uniform-forcing transceiver design for over-the-air function computation,” IEEE Wirel. Commun. Lett., vol. 7, no. 6, pp. 942–945, 2018.
  • [46] A. G. Wills and T. B. Schön, “Stochastic quasi-newton with line-search regularisation,” Automatica, vol. 127, p. 109503, 2021.
  • [47] E. Schulz, M. Speekenbrink, and A. Krause, “A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions,” Journal of Mathematical Psychology, vol. 85, pp. 1–16, 2018.
  • [48] C. B. Do and H. Lee, “Gaussian processes,” Stanford University, Stanford, CA, accessed Dec, vol. 5, p. 2017, 2007.
  • [49] B. Polyak and A. Tremba, “New versions of newton method: step-size choice, convergence domain and under-determined equations,” Optimization Methods and Software, vol. 35, no. 6, pp. 1272–1303, 2020.
  • [50] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression.   John Wiley & Sons, 2013.
  • [51] S. M. Hamidi and E.-H. YANG, “Adafed: Fair federated learning via adaptive common descent direction,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=rFecyFpFUp
  • [52] S. Mohajer Hamidi and O. Damen, “Fair wireless federated learning through the identification of a common descent direction,” IEEE Communications Letters, vol. 28, no. 3, pp. 567–571, 2024.