GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning

Authors The results of this study have been submitted in part to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Shayan Mohajer Hamidi, , Ali Bereyhi, , Saba Asaad, ,
and H. Vincent Poor Shayan Mohajer Hamidi is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (email: smohajer@uwaterloo.ca).
Ali Bereyhi is with the Department of Electrical and Computer Engineering (ECE), University of Toronto, Toronto, M5S 2E4, Canada (e-mail: ali.bereyhi@utoronto.ca).
Saba Asaad is with the Department of Electrical Engineering and Computer Science, York University, Toronto, ON M3J 1P3, Canada (e-mail: asaads@yorku.ca).
H. Vincent Poor is with the Department of Electrical and Computer Engineering (ECE), Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).

Abstract

Second-order methods are widely adopted to improve the convergence rate of learning algorithms. In federated learning (FL), these methods require the clients to share their local Hessian matrices with the parameter server (PS), which comes at a prohibitive communication cost. A classical solution to this issue is to approximate the global Hessian matrix from the first-order information. Unlike in idealized networks, this solution does not perform effectively in over-the-air FL settings, where the PS receives noisy versions of the local gradients. This paper introduces a novel second-order FL framework tailored for wireless channels. The pivotal innovation lies in the PS’s capability to directly estimate the global Hessian matrix from the received noisy local gradients via a non-parametric method: the PS models the unknown Hessian matrix as a Gaussian process, and then uses the temporal relation between the gradients and Hessian along with the channel model to find a stochastic estimator for the global Hessian matrix. We refer to this method as Gaussian process-based Hessian modeling for wireless FL (GP-FL) and show that it exhibits a linear-quadratic convergence rate. Numerical experiments on various datasets demonstrate that GP-FL outperforms all classical baseline first and second order FL approaches.

Index Terms:

Second-order federated learning, quasi-Newton method, non-parametric estimation, distributed learning, over-the-air computation.

I Introduction

Traditionally, the training of machine learning (ML) models has followed a centralized approach, with the training data residing in a data center or cloud parameter server (PS). However, in numerous contemporary applications, there is a growing reluctance among devices to share their private data with a remote PS. Addressing this challenge, federated learning (FL) has emerged as a viable solution [1]. FL allows devices to actively contribute to the training of a global model by leveraging only their local datasets while being facilitated by a central PS. In the FL framework, devices transmit solely their local updates to the PS, thus sidestepping the need to disclose raw datasets. The process unfolds in several steps: initially, the PS disseminates the current global model parameters to all participating devices. Subsequently, each device conducts local model training based on its unique dataset and transmits the resulting local updates back to the PS. In the final step, the PS aggregates these local updates, producing a new global model parameters for the subsequent iteration of distributed training.

Considering the large sizes of the model updates, the iterative communication between the PS and clients incurs a significant communication cost. This can significantly decelerate the convergence of FL, since the communication are typically rate-limited in practice [2, 3, 4]. Several lines of research have been dedicated to theoretically characterizing the fundamental trade-offs between learning performance and communication cost, as well as improving the communication efficiency, within the FL framework [5, 6].

I-A Communication-Efficient FL

Early efforts to improve the communication efficiency in the FL framework can be categorized roughly into two approaches: ( $i$ ) the first approach tries to reduce the communication overhead per round. One way to achieve this involves utilizing quantization [7] and sparsification [8] techniques. These techniques aim to diminish the transmitted bits and eliminate redundant parameter updates. In general, these approaches come with the trade-off of potentially degrading the model performance, and they must take into account the compatibility for the aggregation operation in FL [9]. ( $ii$ ) The second approach intends to minimize the total communication rounds. The most well-known example is the federated averaging (FedAvg) algorithm, in which clients execute multiple local iterations before sharing their models with the PS [1], which can significantly reduce the total number of rounds. Some variants of the gradient descent (GD) method have further demonstrated the ability to substantially reduce the overall communication rounds compared to the naive GD [10, 11].

These previous approaches ignore the properties of the communication network, and look at the communication links as rate-limited orthogonal noiseless channels. Nevertheless, later studies have shown that using the properties of the underlying network, further gain in terms of communication efficiency can be achieved [2]. A well-known case is wireless FL, in which clients can use the superposition property of multiple access channel (MAC) to perform the computation directly over the air and significantly reduce the communication costs [12]. In fact, in wireless networks the abstract view on communication links corresponds to the conventional digital transmission, where coded symbols are sent using techniques such as orthogonal frequency-division multiplexing (OFDM). For wireless FL, however, the clients can send their local parameters simultaneously via analog transmission, and the PS can apply the concept of over-the-air computation (AirComp) to aggregate the global parameter. This method allows aggregation to be performed directly over the air, significantly reducing the required bandwidth [12, 13]. This approach has gained traction in wireless FL as an effective strategy for mitigating communication costs [14, 15, 16, 17, 18, 19].

I-B Second-Order Wireless FL

A standard framework for wireless FL consider first-order algorithms, where the clients transmit their local gradients, and the PS aggregates them gradient directly over the air using the AirComp technique. We refer to this framework as first-order AirComp. Similar to other first-order methods, first-order AirComp at best achieves linear convergence, which leads to a relatively large number of iteration rounds to reach the desired accuracy. We further note that the analog nature of AirComp-aided computation leads to extra aggregation error from the channel. This can be seen as zero-mean fluctuations, which adds to the noiseless aggregated gradient, i.e., the estimator that PS computes in perfect FL settings. As a result, the estimator of global gradient computed in first-order AirComp has higher variance as compared with FL with noiseless aggregation. This leads to further degradation in terms of convergence behavior.

One potential solution is to incorporate second-order methods, e.g., Newton-type methods, into the wireless FL setting, as explored in [20]. Unlike first-order methods, second-order approaches benefit from a quadratic convergence rate by computing the update direction using both first and second-order derivatives of the loss function. The update direction computed in these methods is often called Newton direction.

Integrating second-order information into FL requires further access to the second-order information at the PS. This information is obtained by clients sharing their local Hessian matrices with the PS, which imposes a significant communication burden, especially in wireless settings with strictly-restricted links. To mitigate this issue, various studies have explored ways to approximate the Hessian matrix from the first-order information and/or reduced second-order information. For instance, GIANT [21] estimates the Newton direction by utilizing the global gradient and local Hessians on each client, combined with a global backtracking line search across all clients. Although this approach reduces the communication overhead, it still requires partial sharing of second-order information resulting in higher communication load as compared to first-order methods.

The study in [22] suggests an alternative approach, in which each client the computation of Newton direction is offloaded to the clients: the PS and devices invoke the conjugate gradient descent algorithm to compute iteratively an estimate of the global Newton direction. Although the information shared per communication round in this approach is the same as that of the first-order methods, a single iteration of gradient descent (on the model parameters) requires at least two rounds of communication. To circumvent further this overhead, the recent study in [23] proposes a novel second-order method that eliminates the aggregation of local gradients, enabling a single communication round per iteration. Motivated by this work, the authors in [20] develop a second-order wireless FL method, in which the clients compute their Newton directions locally and share them with the PS. The PS then estimates the global Newton direction from the aggregation of these local directions. This approach enables devices to communicate with the PS only once per iteration, reducing the communication overhead to that of the first-order methods.

As a natural extension to available proposals, some recent studies have proposed integration of communication-efficient second-order FL algorithms into the analog computation framework [21, 20]. Experiments however show that, unlike their first-order counterparts, these second-order AirComp algorithms struggle to perform adequately. This observation can be intuitively explained as follows: the noisy aggregation of the second-order information in AirComp-aided approaches introduces significant biases to the estimate of the Newton direction, leading to suboptimal performance. The investigations reveal that such biases can severely deteriorates the convergence, and hence the overall efficiency, of these methods [21, 20]. This study aims to develop a novel second-order AirComp algorithm that addresses this challenge in a communication-efficient way.

I-C Motivation and Contributions

This work proposes a novel second-order AirComp scheme for FL which estimates the Newton direction directly from the aggregated first-order information. It is hence extremely communication-efficient, in the sense that it does not require the exchange of local Hessians or a functions of them: in each round, the clients leverage AirComp to share only their local gradients. The PS then estimates the global Hessian matrix from a finite sequence of its noisy aggregations. For Hessian estimation, we deviate from the classical deterministic schemes and develop a nonparametric model-based estimator. The proposed estimator assumes a Gaussian prior for the Hessian matrix and determines its posterior distribution conditional to a window of most recent noisy aggregations. It then computes an unbiased estimator of the Newton direction by sampling from the posterior distribution. We refer to this method as Gaussian process-based Hessian modeling for FL (GP-FL). Our analysis and numerical experiments demonstrate that the scheme can efficiently suppress the undesired directional bias.

In summary, the contributions of this paper are three-fold:

$\bullet$ We introduce GP-FL, a novel second-order AirComp algorithm for wireless FL. The key innovation lies in the PS’s ability to estimate global Hessian matrix based on the received noisy sum of the gradients. This algorithm represents a fundamental departure from most existing works, which typically focus solely on stochastic gradient (SGD) during training. The incorporation of second-order information markedly diminishes the total communication rounds in AirComp-based FL, thereby enhancing communication efficiency even further.

$\bullet$ We analyze the convergence behavior of GP-FL and show that it achieves a linear-quadratic convergence rate. Specifically, the number of communication rounds required to reach an $\varepsilon$ -accurate solution, $T_{\varepsilon}$ , is either $\mathcal{O}\left(\log\log{1}/{\varepsilon}\right)$ or $\mathcal{O}\left(\tfrac{\log{1}/{\varepsilon}}{\log{1/\mu}}\right)$ . This contrasts with GD-based FL methods, which typically have a linear convergence rate of $T_{\varepsilon}=\mathcal{O}({1}/{\varepsilon})$ .

$\bullet$ Through extensive experiments on datasets—including three from the LIBSVM library [24], Fashion-MNIST [25], and CIFAR-{10,100} [26]—we show that GP-FL outperforms existing first- and second-order algorithms. The method is evaluated under varying levels of heterogeneity [27], and the impact of key hyperparameters is analyzed.

I-D Related Works

The FedAvg scheme relies solely on first-order gradients for updates [1], offering significantly faster computation compared to Hessian-based methods, especially in settings with fast communication. Adaptive step-size methods like AdaGrad [28] and ADAM [29] have also been adapted to distributed settings, showing improved convergence [30]. The adaptive step-size can be represented as a diagonal matrix $\mathbb{D}$ , providing an alternative preconditioning to the Newton direction $\mathbb{H}^{-1}\nabla f(\boldsymbol{\theta})$ , where the direction is computed as $\mathbb{D}^{-1}\nabla f(\boldsymbol{\theta})$ .

The second-order methods studied in the literature can be categorized into two groups: ( $i$ ) those that utilize second-order information implicitly [31, 32, 33], and ( $ii$ ) those that explicitly compute them [22, 21, 34]. DANE [31] computes a mirror descent update on the local loss functions, which is equivalent to the GIANT update for a quadratic function. The study in [32] proposes FedDANE as a variant of DANE, specifically tailored for FL, where it uses FedAvg as a baseline with 20 local epochs and observes no improvement with their proposed method. AIDE, proposed in [33], is an alternative accelerated inexact version of DANE. Another line of study in the first group, i.e., group $i$ , considers employing distributed quasi-Newton methods [35]. CoCoA [36] and its trust-region extension [37] also perform local steps on a second-order local subproblem, but they specifically address the special case of generalized linear model objectives.

The second group of studies employ the Hessian by computing it indirectly through the use of the so-called Hessian-free optimization approach. In DiSCO [22], the clients compute the Hessian-vector products, and subsequently the PS executes the conjugate gradient method [38]. This process entails one communication round for each iteration of the conjugate gradient method. GIANT [21] and LocalNewton [34] both employ the conjugate gradient method at the clients: GIANT utilizes the global gradient, while LocalNewton uses the local gradients. The study in [39] and its FL extension [40] iteratively approximate the global Hessian, requiring a similar number of communication rounds as GIANT but achieving better convergence rates. However, they lack experimental comparisons with FedAvg using multiple local steps.

In general, the performance of mentioned methods tends to degrade when integrated into the AirComp framework, in which the PS receives a distorted and noisy aggregation. The proposed scheme in this work, GP-FL, addresses this issue by explicitly accounting for the channel imperfections.

I-E Notation

The gradient and Hessian of a function $f(\cdot)$ are denoted by $\nabla f(\cdot)$ and $\nabla^{2}f(\cdot)$ , respectively. The operators $(\cdot)^{\mathsf{T}}$ and $(\cdot)^{\mathsf{H}}$ denote the transpose and Hermitian transpose, respectively; and $\mathbb{E}(\cdot)$ represents the mathematical expectation. The notation $\mathbf{x}[j]$ refers to the $j$ -th entry of vector $\mathbf{x}$ . The notation $\mathcal{GP}(\mu,\kappa)$ denotes a Gaussian process with mean $\mu$ and covariance $\kappa$ .

For an integer $K$ , $[K]$ denotes $\{1,\cdots,K\}$ . For a scalar or function $f$ , $\{f_{k}\}_{k\in[K]}=\{f_{1},\dots,f_{K}\}$ , and $[f_{k}]_{k\in[K]}=[f_{1},\dots,f_{K}]^{\mathsf{T}}$ . Scalars are denoted by non-boldface letters (e.g. $a$ ), vectors by boldface lowercase letters (e.g. $\mathbb{a}$ ), and matrices by boldface uppercase letters (e.g. $\mathbb{A}$ ). The $j$ -th entry of vector $\mathbb{a}$ is denoted by $\mathbb{a}[j]$ . The matrix $\mathbb{I}$ represents the identity matrix. For two matrices $\mathbb{A}$ and $\mathbb{B}$ of the same size, $\mathbb{A}\preceq\mathbb{B}$ implies that $\mathbb{B}-\mathbb{A}$ is positive semi-definite.

I-F Organization

The remainder of this paper is organized as follows: Section II provides an in-depth discussion of the preliminary knowledge on Newton and quasi-Newton methods. Section III formulates the problem. Section IV presents the proposed stochastic approach for estimating the quasi-Newton direction. Section V provides the convergence analysis for GP-FL. Numerical results and comparison with baselines are given in Section VI. Finally, Section VII concludes the paper.

II Preliminaries

Consider a wireless network with $K$ clients and a single PS. The clients aim to jointly train a common model for a supervised learning task over their local labeled datasets. Let $\mathcal{D}_{k}$ denote the local dataset at client $k$ which contains $|{\mathcal{D}_{k}}|$ independently-collected training tuples of the form $(\mathbb{u},v)$ with $\mathbb{u}\in\mathbb{R}^{m}$ representing a sample input to the model and $v$ denoting its ground-truth label.

The clients agree on a global model with $d$ learnable parameters. Let vector $\boldsymbol{\theta}\in\mathbb{R}^{d}$ represent the collection of these parameters into a vector. The ultimate goal is to train this common model by minimizing the global empirical loss that is defined as $f(\boldsymbol{\theta})\triangleq\frac{1}{|\mathcal{D}|}\sum_{(\mathbb{u},v)\in\mathcal{D}}\ell(\boldsymbol{\theta},\mathbb{u},v)$ , where $\mathcal{D}$ denotes the global dataset, i.e., $\mathcal{D}=\bigcup_{k=1}^{K}\mathcal{D}_{k}$ and $\ell(\boldsymbol{\theta},\mathbb{u},v)$ is the loss function measuring the prediction error of the model with parameters $\boldsymbol{\theta}$ on input sample $\mathbb{u}$ relative to the ground-truth $v$ .

With the dataset distributed among clients, the global training problem is expressed in terms of local empirical losses. The local empirical loss at client $k$ is $f_{k}\left(\boldsymbol{\theta}\right)\triangleq\frac{1}{|\mathcal{D}_{k}|}\sum_{(\mathbb{u},v)\in\mathcal{D}_{k}}\ell(\boldsymbol{\theta},\mathbb{u},v)$ . The global empirical loss is $f(\boldsymbol{\theta})=\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}}|\mathcal{D}_{k}|f_{k}\left(\boldsymbol{\theta}\right)$ , where $\mathcal{S}\subseteq[K]$ is the set of participating devices. The distributed training problem is formulated as

\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}}|\mathcal{D}_{k}|f_{k}\left(\boldsymbol{\theta}\right),

(1)

where $f_{k}\left(\boldsymbol{\theta}\right)$ is computed locally at client $k$ .

II-A First-Order Federated Learning

The de-facto solution to problem (1) is to employ the distributed (stochastic) gradient descent method (DGD/DSGD). In the $t$ -th round of DGD, the PS shares parameter $\boldsymbol{\theta}_{t}$ with all clients. Each client computes its local gradient at $\boldsymbol{\theta}_{t}$ , i.e., $\nabla f_{k}(\boldsymbol{\theta}_{t})$ , and returns it to the PS. The PS then aggregates these local gradients into a global gradient as

\displaystyle\nabla f(\boldsymbol{\theta}_{t})=\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|}\sum_{k\in\mathcal{S}_{t}}\nabla f_{k}(\boldsymbol{\theta}_{t}),

(2)

and performs one step of gradient descent, i.e., it computes $\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\nabla f(\boldsymbol{\theta}_{t})$ , where $\eta_{t}>0$ is the global learning rate at the $t$ -th round. First-order methods often converge slowly with linear (or sub-linear) convergence rates [41, 42], requiring many iterations to approach a local minimum. Since the number of communication rounds in FL corresponds to iterations, this increases communication overhead.

II-B Newton’s Method in Federated Learning

A method to improve FL communication efficiency is to use faster-converging optimizers, such as second-order methods. These methods estimate the loss landscape’s local curvature, enabling faster and more adaptive updates. While computationally intensive per iteration, they require fewer iterations to converge. Second-order methods are Newton-type techniques. Specifically, Newton’s method updates the model using the second-order Taylor series approximation of the empirical loss. For a small $\mathfrak{d}\in\mathbb{R}^{d}$ , the Taylor series expansion of $f(\boldsymbol{\theta}+\mathfrak{d})$ around $\boldsymbol{\theta}$ up to the quadratic term is given by $f(\boldsymbol{\theta}+\mathfrak{d})\approx f(\boldsymbol{\theta})+\mathfrak{d}^{T}\nabla f(\boldsymbol{\theta})+\frac{1}{2}\mathfrak{d}^{T}\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d}$ , with $\nabla^{2}f(\boldsymbol{\theta})$ being the Hessian matrix of $f$ computed at $\boldsymbol{\theta}$ . Assuming the Hessian is positive definite, the Newton direction that minimizes this quadratic approximation is given by $\mathfrak{d}^{\star}=-(\nabla^{2}f(\boldsymbol{\theta}))^{-1}\nabla f(\boldsymbol{\theta})$ . Newton’s method hence updates the model parameters by stepping proportional to $\mathfrak{d}^{\star}$ .

Deploying Newton’s method for the distributed training task in (1), the PS needs to update the global model as


$\displaystyle\boldsymbol{\theta}_{t+1}$	$\displaystyle=\boldsymbol{\theta}_{t}-\eta_{t}\Big{(}\nabla^{2}f(\boldsymbol{\theta}_{t})\Big{)}^{-1}\nabla f(\boldsymbol{\theta}_{t})$	(3a)
	$\displaystyle=\boldsymbol{\theta}_{t}-\eta_{t}\Big{(}\sum_{k\in\mathcal{S}_{t}}\nabla^{2}f_{k}(\boldsymbol{\theta}_{t})\Big{)}^{-1}\sum_{k\in\mathcal{S}_{t}}\nabla f_{k}(\boldsymbol{\theta}_{t}).$	(3b)

This requires transmitting local Hessians to the PS, creating a trade-off: second-order methods reduce communication rounds via faster convergence but increase communication costs per round due to larger transmission. Standard analyses show that the naive use of Newton’s method in distributed settings often leads to inefficiency, as communication overhead dominates.

II-C Quasi-Newton Search Directions

To address issues with Newton’s method in distributed settings, quasi-Newton methods provide an alternative by avoiding explicit computation of local Hessians. Instead of $\nabla^{2}f(\boldsymbol{\theta})$ in (3a), an approximation matrix $\mathbb{B}_{\boldsymbol{\theta}}$ is built using the sequence of local gradients from previous iterations. This matrix is updated iteratively to include new information. To understand the quasi-Newton method, assume $f(\cdot)$ is twice continuously differentiable. We can hence write

\displaystyle\nabla f(\boldsymbol{\theta}+\mathfrak{d})=\nabla f(\boldsymbol{\theta})+\int_{0}^{1}\nabla^{2}f(\boldsymbol{\theta}+\tau\mathfrak{d})\mathfrak{d}d\tau,

(4)

By adding and subtracting the term $\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d}$ to the right-hand side (r.h.s.) of (4), we have

	$\displaystyle\nabla f(\boldsymbol{\theta}+\mathfrak{d})$	$\displaystyle=\nabla f(\boldsymbol{\theta})+\nabla^{2}f(\boldsymbol{\theta})\mathfrak{d}$
		$\displaystyle+\int_{0}^{1}\big{[}\nabla^{2}f(\boldsymbol{\theta}+\tau\mathfrak{d})-\nabla^{2}f(\boldsymbol{\theta})\big{]}\mathfrak{d}d\tau.$		(5)

Note that since the gradient function $\nabla f(\cdot)$ is continuous, the magnitude of the integral on the r.h.s. of (II-C) is $\mathcal{O}(\|\mathfrak{d}\|)$ . We now set $\boldsymbol{\theta}=\boldsymbol{\theta}_{t-1}$ and $\mathfrak{d}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}$ . This leads to

\displaystyle\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1})=\nabla^{2}f(\boldsymbol{\theta}_{t-1})\left(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\right)+\boldsymbol{\varepsilon},

(6)

with $\boldsymbol{\varepsilon}=\mathcal{O}(\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\|)$ . Note that when $\boldsymbol{\theta}_{t-1}$ and $\boldsymbol{\theta}_{t}$ reside in a region close to the minimizer, the r.h.s. of (6) is dominated by the first term. We can hence write

\displaystyle\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1})\approx\nabla^{2}f(\boldsymbol{\theta}_{t-1})\left(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}\right).

(7)

The approximation in (7) suggests estimating the Hessian matrix using a matrix $\mathbb{B}_{t}$ that satisfies the relation in (7) as an identity. Specifically, by defining $\mathbb{w}_{t}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t-1}$ and $\mathbb{y}_{t}=\nabla f(\boldsymbol{\theta}_{t})-\nabla f(\boldsymbol{\theta}_{t-1})$ , the quasi-Newton method computes Newton’s direction using $\mathbb{B}_{t}$ that satisfies the following equation, commonly referred to as the secant equation:

\displaystyle\mathbb{B}_{t}\mathbb{w}_{t}=\mathbb{y}_{t},

(8)

which is an estimator of the Hessian matrix. A classic solution to (8) given by Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [43], which iteratively updates matrix $\mathbb{B}_{t+1}$ using information from ${\mathbb{B}_{t},\mathbb{w}_{t},\mathbb{y}_{t}}$ according to $\mathbb{B}_{t+1}=\mathbb{B}_{t}-\frac{\mathbb{B}_{t}\mathbb{w}_{t}\mathbb{w}_{t}^{\mathsf{T}}\mathbb{B}_{t}}{\mathbb{w}_{t}^{\mathsf{T}}\mathbb{B}_{t}\mathbb{w}_{t}}+\frac{\mathbb{y}_{t}\mathbb{y}_{t}^{\mathsf{T}}}{\mathbb{w}_{t}^{\mathsf{T}}\mathbb{y}_{t}}$ , ensuring the positive-definiteness of $\mathbb{B}_{t}$ and consequently making the following quasi-Newton search direction a descent direction:

\displaystyle\mathfrak{d}_{t}=-(\mathbb{B}_{t})^{-1}\nabla f(\boldsymbol{\theta}_{t}).

(9)

Unlike Newton’s method, the quasi-Newton approach relies only on computed gradients, reducing communication requirements. However, this comes with higher variance in estimating the optimal direction. Studies show that the quasi-Newton method strikes a good trade-off, achieving faster convergence than first-order methods with similar communication costs.

Remark 1.

Note that using BFGS involves only a rank-one update, allowing us to efficiently compute the inverse $(\mathbb{B}_{t})^{-1}$ as required in (9) via the Sherman-Morrison formula [44].

II-D GP-FL: Quasi-Newton Method in Wireless Networks

We now develop the second-order algorithm GP-FL for FL in wireless networks based on the quasi-Newton method. The algorithm uses two key notions to adapt the quasi-Newton search directions approach to wireless networks: ( $i$ ) it uses AirComp to realize the aggregation directly over the air, and ( $ii$ ) it models the quasi-Newton estimator of the Hessian matrix as a Gaussian process and employs the maximum-likelihood method to estimate it from noisy aggregations.

In the next sections, we present GP-FL, sketch its derivation and analyze its convergence. For the sake of simplicity, we use the following notation hereafter in the paper: we denote the local gradient of client $k$ in iteration $t$ as $\mathbb{g}_{t,k}\triangleq\nabla f_{k}(\boldsymbol{\theta}_{t})$ and global gradient as $\mathbb{g}_{t}\triangleq\nabla f(\boldsymbol{\theta}_{t})$ , i.e.,

\displaystyle\mathbb{g}_{t}=\frac{1}{|\mathcal{D}|}\sum_{k\in\mathcal{S}_{t}}|\mathcal{D}_{k}|~{}\mathbb{g}_{t,k}.

(10)

Using this notation, we can summarize the vanilla quasi-Newton method as follows: in iteration $t$ ,

1.

The PS aggregates local gradients as per (10).
2.

It computes the difference

$\displaystyle\mathbb{y}_{t}=\mathbb{g}_{t}-\mathbb{g}_{t-1}.$ (11)
3.

It finds a quasi-Newton matrix by solving (8) for $\mathbb{B}_{t}$ .
4.

It updates the global model parameter as $\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\mathbb{B}_{t}^{-1}\mathbb{g}_{t}$ , and sets $\mathbb{w}_{t+1}=\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{t}$ .

III GP-FL: Over-the-Air Aggregation

During each FL communication round, two parameter exchanges occur. The first is a downlink transmission where the PS broadcasts the global model. Since only one set of parameters is transmitted, the communication cost is negligible and can be handled via encoded data over the broadcast channel [15]. The other parameter exchange occurs over uplink channels, where clients send their local gradients to the PS. This is the main source of communication overhead in wireless FL: using conventional orthogonal multiple-access transmission, the time-frequency resources scale linearly with the number of clients $K$ . For large $K$ , this leads to substantial overhead. To address this, we use AirComp to perform gradient aggregation directly over the air [15, 19, 12, 14, 16, 18]. This approach completes the aggregation within a single coherent block, significantly reducing communication overhead.

III-A Uplink Channel Model

We consider a Gaussian fading multiple access channel with slow frequency-flat fading, where the coherence time exceeds $d$ symbol intervals. Clients synchronously transmit $d$ -dimensional local gradient entries over $d$ consecutive intervals within a single channel coherence interval. The PS is assumed to have an array of $N$ antennas, while each client has a single antenna.

We denote the channel coefficient vector between client $k$ and the PS in round $t$ as $\mathbb{h}_{t,k}\in\mathbb{C}^{N}$ . Assuming perfect channel state information (CSI) at both ends, clients can adjust their transmitted signals based on the channel coefficients.

III-B Model Aggregation via AirComp

The AirComp-based model aggregation works as follows: client $k$ sends $\phi_{k}(\mathbb{g}_{t,k})$ for some pre-processing function $\phi_{k}(\cdot)$ simultaneously with all other clients over the channel. The PS then receives a superimposed version of these transmissions. Denoting the received signal by $\mathbb{r}_{t}$ , the PS estimates the aggregated gradient as $\hat{\mathbb{g}}_{t}=\psi(\mathbb{r}_{t})$ for some post-processing function $\psi(\cdot)$ . The pre- and post-processing functions serve as analog filters aiming to fulfill the transmission constraints, e.g, transmit power constraint, and minimize the estimation error.

Due to the diversity of $\mathbb{g}_{t,k}$ among devices, a universal pre- and post-processing function cannot guarantee the joint stationarity of the information-bearing symbols. i.e., local gradients. We hence opt for a data-and-CSI-aware design: prior to uplink transmission, client $k$ normalizes its local gradient as

\displaystyle\mathbb{s}_{t,k}=\frac{\mathbb{g}_{t,k}}{\|\mathbb{g}_{t,k}\|_{2}}.

(12)

As the normalized gradients are unit-norm, we can model them as jointly stationary processes. This means that $\mathbb{E}(\|\mathbb{s}_{t,k}[j]\|^{2})={1}/{d}$ for entry $j\in[d]$ . After normalization, client $k$ sends

\displaystyle\mathbb{x}_{t,k}=b_{t,k}\mathbb{s}_{t,k},

(13)

over its uplink channel, where $b_{t,k}\in\mathbb{R}$ is a power scaling factor satisfying

\displaystyle\mathbb{E}(|b_{t,k}\mathbb{s}_{t,k}[j]|^{2})=\frac{b_{t,k}^{2}}{d}\leq\mathsf{P_{0}},

(14)

with $\mathsf{P_{0}}$ denoting maximum transmit power of devices.

The PS receives the superimposed version of transmitted signals: let $\mathbb{r}_{t,j}\in\mathbb{C}^{N}$ denotes the received signal of the PS in the $j$ -th symbol interval of iteration $t$ . Using the channel model, we can write

\displaystyle\mathbb{r}_{t,j}

\displaystyle=\sum_{k\in\mathcal{S}_{t}}\mathbb{h}_{t,k}\mathbb{x}_{t,k}[j]+\mathbb{n}_{t,j}=\sum_{k\in\mathcal{S}_{t}}\hat{\mathbb{h}}_{t,k}b_{t,k}\mathbb{g}_{t,k}[j]+\mathbb{n}_{t,j},

where $\hat{\mathbb{h}}_{t,k}\triangleq{\mathbb{h}_{t,k}}/{\|\mathbb{g}_{t,k}\|_{2}}$ is the effective channel, and $\mathbb{n}_{t,j}\in\mathbb{C}^{N}$ is complex additive white Gaussian noise (AWGN) with mean zero and variance $\sigma^{2}$ , i.e., $\mathbb{n}_{t,j}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\mathbb{I})$ .

The PS uses the received signals $\mathbb{r}_{t,j}$ for $j\in[d]$ to determine an estimator of the global gradient $\mathbf{g}_{t}$ . Let $\mathbb{d}_{t}[j]$ denote the estimator of $\mathbf{g}_{t}[j]$ . Using linear post-processing, the PS determines $\mathbb{d}_{t}[j]$ from $\mathbb{r}_{t,j}$ as

	$\displaystyle\mathbb{d}_{t}[j]$	$\displaystyle=\frac{\mathbb{c}_{t}^{\mathsf{H}}\mathbb{r}_{t,j}}{\sqrt{\alpha_{t}}}$		(15)
		$\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbb{c}_{t}^{\mathsf{H}}\sum_{k\in\mathcal{S}_{t}}\hat{\mathbb{h}}_{t,k}b_{t,k}\mathbb{g}_{t,k}[j]+\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}\right),$		(16)

for some linear receiver $\mathbb{c}_{t}\in\mathbb{C}^{N}$ and the power factor $\alpha_{t}\neq 0$ .

Then, the PS employs a zero-forcing approach to estimate $\mathbb{g}_{t}$ . It first computes a linear receiver $\mathbb{c}_{t}$ using the CSI (details on computing $\mathbb{c}_{t}$ will be discussed later) and determines the power scaling factor $\alpha^{\text{ZF}}_{t}$ as:

\displaystyle\alpha^{\text{ZF}}_{t}=\mathsf{P_{0}}d\min_{k\in\mathcal{S}_{t}}\frac{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}{|\mathcal{D}_{k}|^{2}}.

(17)

This way, the PS guarantees that all clients satisfy their transmit power constraint, and then it broadcasts $\alpha^{\text{ZF}}_{t}$ to the clients. Client $k$ upon receiving $\alpha^{\text{ZF}}_{t}$ determines its scaling as

\displaystyle b_{t,k}=\sqrt{\alpha^{\text{ZF}}_{t}}|\mathcal{D}_{k}|\frac{\hat{\mathbb{h}}_{t,k}^{\mathsf{H}}\mathbb{c}_{t}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|^{2}_{2}},

(18)

which satisfies the transmit power constraint (14).

The above coordination of clients realizes the desired superposition over the air in the absence of noise during uplink transmission. Substituting (17) and (18) in (15), it is concluded that the PS aggregates $\mathbb{d}_{t}[j]=\sum_{k\in\mathcal{S}_{t}}|\mathcal{D}_{k}|\mathbb{g}_{t,k}[j]+\frac{1}{\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}$ , which after scaling by ${1}/{|\mathcal{D}|}$ concludes the following

\displaystyle\tilde{\mathbb{g}}_{t}[j]

\displaystyle=\frac{1}{|\mathcal{D}|}\mathbb{d}_{t}[j]=\mathbb{g}_{t}[j]+\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,j}.

(19)

The estimator of the gradient in iteration $t$ is hence given by

\displaystyle\tilde{\mathbb{g}}_{t}=\mathbb{g}_{t}+\tilde{\mathbb{n}}_{t},

(20)

where, for notational simplicity, we defined the $d$ -dimensional column vector $\tilde{\mathbb{n}}_{t}$ as

\displaystyle\tilde{\mathbb{n}}_{t}=\frac{1}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{t}}}[\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,1},\ldots,\mathbb{c}_{t}^{\mathsf{H}}\mathbb{n}_{t,d}]^{\mathsf{T}}.

(21)

Remark 2.

In general we can assume $\mathbb{g}_{t}$ to be complex, as we can represent every two entries of a complex gradient as a pair of in-phase and quadrature components.

Remark 3.

Note that $\mathbb{E}[\tilde{\mathbb{n}}_{t}]=0$ , and thus $\tilde{\mathbb{g}}_{t}$ is an unbiased estimator of $\mathbb{g}_{t}$ . As such, assuming that $\mathbb{g}_{t}$ itself is an unbiased estimator of the true gradient, $\tilde{\mathbb{g}}_{t}$ is an unbiased estimator of the true gradient, as well.

III-C Receiver Design for Model Aggregation

The variance of $\tilde{\mathbb{n}}_{t}$ which specifies the variance of the gradient estimator is given by $\tilde{\sigma}^{2}(\mathbb{c}_{t})=\frac{\sigma^{2}\|\mathbb{c}_{t}\|^{2}}{|\mathcal{D}|^{2}{\alpha^{\text{ZF}}_{t}}}$ . Substituting (17) in this term, we have

\displaystyle\tilde{\sigma}^{2}(\mathbb{c}_{t})=\frac{\sigma^{2}}{|\mathcal{D}|^{2}\mathsf{P_{0}}d}\max_{k\in\mathcal{S}_{t}}\frac{|\mathcal{D}_{k}|^{2}\|\mathbb{c}_{t}\|^{2}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}.

(22)

Thus, for a given set of selected devices $\mathcal{S}_{t}$ , the optimal receiver that minimizes the estimation variance is given by a solution to the following optimization problem:

\displaystyle\min_{\mathbb{c}_{t}}\max_{k\in\mathcal{S}_{t}}\frac{|\mathcal{D}_{k}|^{2}\|\mathbb{c}_{t}\|^{2}}{\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}}.

(23)

Using a similar analysis as in [45], the optimization problem in (23) can be reformulated as

\displaystyle\min_{\mathbb{c}_{t}}\|\mathbb{c}_{t}\|^{2}\qquad\text{s.t.}\quad\|\mathbb{c}_{t}^{\mathsf{H}}\hat{\mathbb{h}}_{t,k}\|_{2}^{2}\geq|\mathcal{D}_{k}|^{2},\quad\forall k\in\mathcal{S}_{t}

(24)

which is a quadratically constrained quadratic programming problem that is computationally challenging. However, it can be transformed into a difference-of-convex-function (DC) program, as shown in Appendix A.

Remark 4.

The receiver $\mathbb{c}_{t}$ is updated only once per channel coherence time, and hence it is realistic to assume that we coordinate the network after each update of $\mathbf{c}_{t}$ , as it occurs with the same rate as for the CSI update.

III-D Device Selection Algorithm

The device selection is a combinatorial problem, which lies int the set of nondeterministic polynomial (NP) hard problems. We therefore invoke the sub-optimal approach developed in [20] to approximate its solution via Gibbs sampling (GS) method. The key idea in the GS method is to iteratively sample a device set from the neighboring devices based on a suitable distribution. This iterative process allows the selected devices to gradually converge towards an optimal set. We omit the details of the algorithm due to lack of space and refer the interested readers to [20] for more details.

IV GP-FL: Hessian Estimation

In the quasi-Newton approach, the PS estimates the Newton direction via $\mathbb{B}_{t}$ computed from the subsequent gradient estimators $\mathbb{g}_{t}$ and $\mathbb{g}_{t-1}$ according to (8). With over-the-air aggregation, the gradient estimators, i.e., $\tilde{\mathbb{g}}_{t}$ and $\tilde{\mathbb{g}}_{t-1}$ , are further distorted by effective channel noise process resulting in higher error variances. As demonstrated in our numerical investigations in Section VI, the direct derivation of $\mathbb{B}_{t}$ from the aggregated gradients $\tilde{\mathbb{g}}_{t}$ and $\tilde{\mathbb{g}}_{t-1}$ can significantly degrade the training performance as compared to the case with gradient estimators collected via noise-free aggregation. This observation can be intuitively explained as follows: the variance introduced by over-the-air aggregation leads (through the nonlinear procedure of solving (8)) to an estimator whose estimate of Newton’s direction is biased. The higher the aggregation variance is, the more this directional bias will be.

Unlike the primal estimation error in the mini-batch gradients, i.e., $\mathbb{g}_{t}$ , whose statistics is unknown, the computation error introduced by the over-the-air aggregation is statistically known to us. This knowledge can be potentially used to suppress the impact of computation error and extract a better estimator of the Hessian matrix. The key challenge in this respect is however that the Hessian estimator is related to the noisy aggregations through a nonlinear transform, i.e., the solution to (8). To overcome this challenge we follow the model-based approach suggested in various lines of work in the literature, e.g., [46], to compute a better estimator of the Hessian matrix.

IV-A Model-based Hessian Estimation

In model-based estimation approach, the stochastic nature of the solution to (8) is described through a stochastic process, which approximates the original process. In the sequel, we adapt the Gaussian model for this problem. This choice is particularly useful as Gaussian processes offer non-parametric probabilistic models to capture complexities of nonlinear functions [47].¹¹1Modeling Hessians as Gaussian processes was proposed in [46] for use in noisy settings but has not yet been applied to tasks like FL.

To understand the idea of probabilistic modeling of Hessian, let us look more precisely into the noisy quasi-Newton matrix computed from the over-the-air aggregations: at round $t$ , the PS uses its latest two aggregations to compute

\displaystyle\tilde{\mathbb{y}}_{t}=\tilde{\mathbb{g}}_{t}-\tilde{\mathbb{g}}_{t-1},

(25)

The quasi-Newton matrix $\tilde{\mathbb{B}}_{t}$ , which is an estimator of Hessian, can then be computed from $\tilde{\mathbb{y}}_{t}$ by solving

\displaystyle\tilde{\mathbb{B}}_{t}\mathbb{w}_{t}=\tilde{\mathbb{y}}_{t}.

(26)

With ideal links, $\tilde{\mathbb{y}}_{t}=\mathbb{y}_{t}$ , and using a deterministic scheme like BFGS, the solution to (26) matches that of (8). However, with over-the-air computation, deterministic methods yield a Hessian estimate with potentially higher bias and variance compared to the standard case with perfect aggregation²²2This is demonstrated in the numerical results.. To address this, we deviate from the classical approach of using deterministic schemes to solve (26). Instead, we model $\tilde{\mathbb{B}}_{t}$ and $\tilde{\mathbb{y}}_{t}$ as jointly Gaussian random processes. We approximate their statistics from the observation sequence and use these to derive a more robust Hessian estimator, effectively mitigating the impact of aggregation noise.

IV-B Quasi-Newton Matrix as Gaussian Process

We start by modeling the prior: let $\tilde{\mathbb{B}}_{t}$ be a Gaussian matrix with mean $\mathbf{M}$ and covariance $\mathbb{C}$ . For $i,j\in[d]$ , entry $(i,j)$ of $\tilde{\mathbb{B}}_{t}$ is an independent Gaussian process with mean $\mu_{i,j}=[\mathbf{M}]_{i,j}$ and variance $\psi_{i,j}$ given by $\mathbb{C}$ . For simplicity, we focus on a single entry of $\tilde{\mathbb{B}}_{t}$ , denoting it as $\tilde{b}_{t}\sim\mathcal{N}(\mu,\psi)$ , dropping the index $(i,j)$ . This does not affect the generality of the analysis, as entries of $\tilde{\mathbb{B}}_{t}$ are statistically independent and follow the same estimation procedure.

Our ultimate goal is to model $\tilde{b}_{t}$ from the observations $\tilde{\mathbb{y}}_{t}$ collected through time $t$ . To this end, we collect the last $r$ gradient difference in a set $\mathcal{Y}_{t}$ , i.e.,

\displaystyle\mathcal{Y}_{t}=\{\tilde{\mathbb{y}}_{t-r},\dots,\tilde{\mathbb{y}}_{t}\},

(27)

and model it jointly with entry $\tilde{b}_{t}$ as a Gaussian processes $\mathcal{GP}$ . More precisely, let us define $\mathbb{z}=[\mathbb{o}_{t}^{\sf T},\tilde{b}_{t}]^{\sf T}$ , where $\mathbb{o}_{t}$ is defined as the concatenation of entries in $\mathcal{Y}_{t}$ , i.e.,

\displaystyle\mathbb{o}_{t}=\left[\tilde{\mathbb{y}}_{t-r+1}^{\sf T},\ldots,\tilde{\mathbb{y}}_{t}^{\sf T}\right]^{\sf T}.

(28)

We model $\mathbb{z}_{t}$ as a Gaussian vector whose mean is $\boldsymbol{\xi}_{t}\in\mathbb{R}^{rd+1}$ and whose covariance matrix $\boldsymbol{\Sigma}_{t}\in\mathbb{R}^{(rd+1)\times(rd+1)}$ . Considering this statistical model, our goal is to use the sample data collected in the last $r$ communication rounds, i.e., $\mathcal{Y}_{t}$ , to find an estimate of the mean and covariance. We then use the assumed statistical model to find an alternative estimator for $\tilde{b}_{t}$ .

IV-C Estimating Parameters of Gaussian Process

To estimate the covariance matrix, we follow the conventional approach in the literature [48]: we estimate the covariance of $\mathbb{z}_{t}$ from its samples using the kernel function $\kappa$ . The choice of an appropriate kernel is based on assumptions such as smoothness and likely patterns that are expected in the data. Given the nature of data in this problem, i.e., the fact that $\tilde{\mathbb{y}}_{t}$ is the gradient difference, one popular choice of the kernel is the radial basis function [48].

Definition 1 (Radial basis kernel).

For inputs $\mathbb{u}\in\mathbb{R}^{n}$ and $\mathbb{v}\in\mathbb{R}^{m}$ , the radial basis function $\kappa$ computes an $n\times m$ matrix whose entry in the $i$ -th row and $j$ -th column is obtained as

\displaystyle[\kappa(\mathbb{u},\mathbb{v})]_{i,j}=\exp\left(-\frac{|[\mathbb{u}]_{i}-[\mathbb{v}]_{j}|^{2}}{2\tau^{2}}\right).

(29)

It is worth mentioning that using the radial basis kernel, the estimator of the covariance matrix computed from $\mathbb{z^{0}}$ sampled from $\mathbb{z}$ , i.e., $\kappa(\mathbb{z^{0}},\mathbb{z^{0}})$ , is a positive semi-definite matrix.

To use the above covariance estimator, we require a sample from $\mathbb{z}_{t}$ . To this end, we use a deterministic solver to find a solution to (26) for the last $r$ communication rounds. This means that we find $\tilde{\mathbf{B}}_{i}^{0}$ by solving (26) with $\tilde{\mathbb{y}}_{i}^{0}$ for $i\in\{t-r+1,\ldots,t\}$ , where $\tilde{\mathbb{y}}_{i}^{0}$ denotes to the sample gradient difference computed from the observed received signals.³³3Superscript $0$ is to distinguish random samples from stochastic processes. Let $\tilde{b}_{i}^{0}$ be an entry of $\tilde{\mathbf{B}}_{i}^{0}$ . We sample $\mathbb{z}$ as $\mathbb{z}^{0}=[\mathbb{o}_{t}^{0\sf T},\tilde{b}_{t}^{0}]^{\sf T}$ with

\displaystyle\mathbb{o}_{t}^{0}=\left[\tilde{\mathbb{y}}_{t-r+1}^{0\sf T},\ldots,\tilde{\mathbb{y}}_{t}^{0\sf T}\right]^{\sf T},

(30)

and estimate the covariance of $\mathbb{z}_{t}$ with the sample covariance matrix $\kappa(\mathbb{z}_{t}^{0},\mathbb{z}_{t}^{0})$ computed by the radial basis kernel.

We next compute an estimator of the mean $\boldsymbol{\xi}_{t}$ using a moving average: we estimate the mean of $\tilde{b}_{t}$ by arithmetic average of $\tilde{b}_{t-r+1}^{0},\ldots,\tilde{b}_{t}^{0}$ denoted by $\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}$ , and the mean of $\mathbb{o}_{i}$ by the moving average of $\mathbb{o}_{i}^{0}$ , i.e., we set the mean of $\tilde{\mathbb{y}}_{i}$ to be the average of $\{\tilde{\mathbb{y}}_{t-r}^{0},\dots,\tilde{\mathbb{y}}_{i}^{0}\}$ for $i\in\{t-r+1,\ldots,t\}$ . Let us denote this moving average by $\mu[\mathbb{o}_{t}^{0}]$ . We can then approximate the random process $\mathbb{z}_{t}$ as

\displaystyle\mathbb{z}_{t}\sim\mathcal{N}\begin{pmatrix}\begin{bmatrix}\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}\\ \mu[\mathbb{z}_{t}^{0}]\end{bmatrix},&\begin{bmatrix}\beta_{t}&\boldsymbol{\phi}_{t}^{\sf T}\\ \boldsymbol{\phi}_{t}&\mathbb{K}_{t}\end{bmatrix}\end{pmatrix},

(31)

for scalar $\beta_{t}=\kappa(\tilde{{b}}_{t}^{0},\tilde{{b}}_{t}^{0})$ , vector $\boldsymbol{\phi}_{t}=\kappa(\mathbb{o}_{t}^{0},\tilde{b}_{t}^{0})$ , and symmetric and positive semi-definite matrix $\mathbb{K}_{t}=\kappa(\mathbb{o}_{t}^{0},\mathbb{o}_{t}^{0})$ .

IV-D Estimating Hessian via Posterior Sampling

The statistical model considered for the quasi-Newton matrix and the sample observations enables us to find an alternative estimator for the Hessian matrix. To this end, we derive the posterior distribution of $\tilde{\mathbb{B}}_{t}$ in the following lemma.

Lemma 1 (Posterior of quasi-Newton matrix).

Consider the Gaussian model for $\mathbb{z}_{t}$ given in (31). The entry $\tilde{b}_{t}$ conditional to observation $\mathbb{o}_{t}$ is distributed Gaussian with mean $\zeta(\mathbb{o}_{t})$ and variance $\psi(\mathbb{o}_{t})$ that are given by


$\displaystyle\zeta(\mathbb{o}_{t})$	$\displaystyle=\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}-\mathbb{a}_{t}^{\sf T}(\mathbb{o}_{t}-\mu[\mathbb{o}_{t}^{0}]),$	(32a)
$\displaystyle\psi(\mathbb{o}_{t})$	$\displaystyle=\beta_{t}-\mathbb{a}_{t}^{\sf T}\boldsymbol{\phi}_{t},$	(32b)

where $\mathbb{a}_{t}$ is the solution to $\mathbb{K}_{t}\mathbb{a}_{t}=\boldsymbol{\phi}_{t}$ which can be computed via the conjugate gradient algorithm.

Proof.

Since $\mathbb{z}_{t}$ is Gaussian, the posterior distribution of $\tilde{b}_{t}$ is also Gaussian. After some standards derivations, the mean and variance of this conditional distribution are given by [48]


	$\displaystyle\zeta(\mathbb{o}_{t})$	$\displaystyle=\hat{\mathbb{E}}\{\tilde{b}_{t}^{0}\}-\boldsymbol{\phi}_{t}^{\sf T}\mathbb{K}_{t}^{-1}(\mathbb{o}_{t}-\mu[\mathbb{o}_{t}^{0}]),$	(33a)
	$\displaystyle\psi(\mathbb{o}_{t})$	$\displaystyle=\beta_{t}-\boldsymbol{\phi}_{t}^{\sf T}\mathbb{K}_{t}^{-1}\boldsymbol{\phi}_{t}.$	(33b)
By defining $\mathbb{a}_{t}$ as in the lemma, the proof is concluded.

∎

Using this posterior distribution, we can compute various estimators for $\tilde{b}_{t}$ . Noting that this estimator is used to estimate Newton’s direction, we compute a simple unbiased estimator by sampling from the posterior distribution:⁴⁴4From the Bayesian viewpoint, one might consider setting the estimator to the posterior mean, as it describes the minimum mean squared error (MMSE) estimate. Although MMSE returns minimal variance, it is in general biased and hence can lead to directional misalignment. for every entry of $\tilde{\mathbb{B}}_{t}$ , we compute the posterior mean $\zeta(\mathbb{o}_{t}^{0})$ and variance $\psi(\mathbb{o}_{t}^{0})$ using the sample observation $\mathbb{o}_{t}^{0}$ . Let $\hat{b}_{i,j,t}$ be the posterior sample for entry $(i,j)$ . We build the Hessian estimator $\hat{\mathbb{B}}_{t}$ from entries $\hat{b}_{i,j,t}$ and compute

\displaystyle\tilde{\mathfrak{d}}_{t}=-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}.

(34)

The global model is then updated as $\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}$ .

Although validated numerically, it is insightful to explain why (34) provides a better estimate of the true Newton direction compared to deterministic approaches like BFGS. Classical deterministic methods rely only on the latest two gradient samples, which can result in highly inaccurate curvature estimates under noisy aggregation. In contrast, the proposed scheme uses information from the last $r+1$ gradient samples to estimate the loss curvature. This allows it to effectively mitigate the impact of aggregation noise and compute a more robust curvature estimate.

Remark 5.

From Lemma 1, computing the posterior involves solving an inverse problem to find $\mathbb{a}_{t}$ , which may add computational cost compared to deterministic approaches. However, this inverse problem can be efficiently solved with reduced complexity using the conjugate gradient algorithm [38].

V Algorithm and Convergence Analysis

The derivations in Sections III and IV enable the realization of a second-order FL framework over the air, referred to as Gaussian Process FL (GP-FL). Compared to classical second-order FL, GP-FL introduces two key aspects: ( $i$ ) it uses analog function computation to aggregate local models directly over the air, and ( $ii$ ) it invokes the Gaussian model to compute a more robust estimation for Hessian from a finite window of recent aggregated gradients. The final algorithm is summarized in Algorithm 1, outlining the framework with device scheduling. At communication round $t$ , the PS selects a set of participating clients $\mathcal{S}_{t}$ from the $K$ available ones using a scheduling scheme, specifically the GS method from [20]. The PS shares $\boldsymbol{\theta}_{t}$ with the devices. Upon receiving the global model, the devices compute local gradients and transmit them over their uplink channels. The PS aggregates the global gradient over the air and estimates the Hessian matrix using the method from Section IV, which is then used to update the model via (34).

Remark 6.

We assume the PS has high computational capabilities, rendering the complexity of calculating a direction via (34) negligible. Consequently, the overall running times for GP-FL and conventional second-order FL are nearly identical, as confirmed by our numerical experiments.

Input: Number of global epochs

T

, global learning rate

{\eta}_{t}

, sindow size

r

, local datasets

\{\mathcal{D}_{k}\}_{k\in K}

for $t=0,1,\dots,T-1$ do

PS randomly selects a subset of devices

\mathcal{S}_{t}

and sends

\boldsymbol{\theta}_{t}

to them.

for device $k\in\mathcal{S}_{t}$ in parallel do

Compute local gradient as

\mathbb{g}_{t,k}=\frac{1}{|\mathcal{D}_{k}|}\sum_{(\mathbb{u},v)\in\mathcal{D}_{k}}\ell(\boldsymbol{\theta}_{t},\mathbb{u},v)

Normalize

\mathbb{g}_{t,k}

according to (12), and find

\mathbb{x}_{t,k}

based on (13).

Send

\mathbb{x}_{t,k}

to the PS through wireless channel.

end for

PS calculates

\tilde{\mathbb{g}}_{t}

as per (19).

PS finds the updating direction as per (34).

PS updates the global model as

\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}

end for

Algorithm 1 GP-FL

V-A Scheduling the Learning Rate

The remaining of this section provides convergence analysis for GP-FL under a set of regularity assumptions, which are commonly considered in the literature [2]. We start the analysis by stating the first assumption.

Assumption 1 (Smoothness and convexity).

The global loss function is twice continuously differentiable, $L$ -Lipschitz gradient ( $L$ -smooth) and $\lambda$ -strongly convex. As such, we have

\displaystyle\lambda\mathbb{I}\preceq\nabla^{2}f(\boldsymbol{\theta})\preceq L\mathbb{I}.

(35)

The strong convexity of the global loss function implies that there exists a unique optimal model parameter, which we denote by $\boldsymbol{\theta}^{\star}$ hereafter in the paper.

For global convergence of GP-FL, as a stochastic estimator of the Newton method, under Assumption 1 the magnitude of model update should be controlled. The classical approach for update control is to resort backtracking line search methods. In this work, we deviate from this classical scheme and control the update magnitude by scheduling the global learning rate. To this end, we invoke the results of [49], which proposes an adaptive learning rate to facilitate the global convergence of Newton-type methods. Casting GP-FL on the proposed scheme in [49], it is shown that for convergence guarantee of GP-FL to the minimizer of an $L$ -smooth and $\lambda$ -strongly convex loss, it is sufficient to schedule the learning rate as

\displaystyle\eta_{t}=\min\{1,\frac{\lambda^{2}}{L\|\mathbb{g}_{t}\|}\}.

(36)

We hence consider this scheduling throughout the analysis.

V-B Convergence Results

To present the convergence guarantee for GP-FL, we first need to define the concept of $\delta$ -approximate matrix.

Definition 2 ( $\delta$ -approximate).

Matrix $\hat{\mathbb{H}}$ is said to be a $\delta$ -approximate of $\mathbb{H}$ , if the following inequality holds

\displaystyle\|\hat{\mathbb{H}}-\mathbb{H}\|\leq\delta\|\mathbb{H}\|.

(37)

Using this definition, we can now present our first convergence result: Theorem 1 gives an upper-bound on the gap between the converging and optimal models.

Theorem 1.

Let Assumption 1 hold. Moreover, let the inverse of sample quasi-Hessian matrix drawn from the posterior in Lemma 1 in round $t$ be a $\delta_{t}$ -approximate of the inverse Hessian matrix, and define $\delta=\max_{t}\delta_{t}$ . Then, the distance between the global model updated in round $t$ , i.e., $\boldsymbol{\theta}_{t}$ and the optimal model $\boldsymbol{\theta}^{\star}$ is bounded from above as

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\mu^{t}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+C_{t},

(38)

where $\mu=L\delta/\lambda$ and $C_{t}$ is given by

\displaystyle C_{t}=\frac{\mu^{t}-1}{\mu-1}\begin{cases}C_{0}&t\leq t_{0}\\ C_{1}&t>t_{0}\end{cases},

(39)

for $C_{0}$ and $C_{1}$ being


$\displaystyle C_{0}$	$\displaystyle=\frac{\lambda}{L}(t_{0}-t+\frac{2\gamma}{1-\gamma})+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\\|}\mathbb{c}_{t}\big{\\|}}{\|\mathcal{D}\|\sqrt{\alpha^{\mathrm{ZF}}_{t}}}$	(40a)
$\displaystyle C_{1}$	$\displaystyle=\frac{2\lambda\gamma^{2^{t-t_{0}}}}{L(1-\gamma^{2^{t-t_{0}}})}+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\\|}\mathbb{c}_{t}\big{\\|}}{\|\mathcal{D}\|\sqrt{\alpha^{\mathrm{ZF}}_{t}}}$	(40b)

with $t_{0}=\max\Bigl{\{}0,\bigl{\lceil}\frac{2L}{\lambda^{2}\|\mathbb{g}_{0}\|}\bigr{\rceil}-2\Bigr{\}}$ and $\gamma=\frac{L}{2\lambda^{2}}\|\mathbb{g}_{0}\|-\frac{t_{0}}{4}$ .

Proof.

The proof is given in Appendix B. ∎

Theorem (1) implies that the GP-FL algorithm exhibits a linear-quadratic convergence rate. In fact, the quadratic term in (65) is exactly the same as the one reported in [49] for Newton-type methods. Nevertheless, our result has an extra linear term, as a result of model-based estimation of the Hessian matrix and the AirComp aggregation error.

We next present Corollaries 1 and 2, which characterize the scenarios under which the GP-FL algorithm converges quadratically and linearly.

Corollary 1.

Let the initial model be in a bounded distance of the optimal model as

\displaystyle\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}<\frac{\mu^{t}-1}{\mu^{t}(\mu-1)}C_{1}.

(41)

Then, the updated model in round $t$ satisfies

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\frac{2(\mu^{t}-1)}{\mu-1}C_{1}.

(42)

and lies within the $\varepsilon$ -neighborhood of optimal model after $T_{\varepsilon}$ rounds, i.e., $\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{T_{\varepsilon}}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\leq\varepsilon$ , if $\mu<1$ and

\displaystyle T_{\varepsilon}=\mathcal{O}\left(\log\log\frac{1}{\varepsilon}\right).

(43)

which is also called super-linear convergence rate.

Proof.

See Appendix C. ∎

Corollary 1 presents an intuitive result: when GP-FL is initiated in a close enough vicinity of the optimal model, it can quadratically converge to the optimal model. We next consider the case with linear convergence.

Corollary 2.

Let the initial model satisfies

\displaystyle\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\geq\frac{\mu^{t}-1}{\mu^{t}(\mu-1)}C_{1}.

(44)

Then, the updated model in round $t$ satisfies

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq 2\mu^{t}\mathbb{E}\big{[}\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\|\big{]}.

(45)

\displaystyle T_{\varepsilon}=\mathcal{O}\left(\frac{1}{\log(\frac{1}{\mu})}\log\frac{1}{\varepsilon}\right).

(46)

Proof.

See Appendix D. ∎

It is worth mentioning that in classical distributed methods, the number of rounds required to achieve a desired precision $\varepsilon$ , follows a linear convergence rate. Specifically, for vanilla federated averaging $T_{\varepsilon}=\mathcal{O}\left(\frac{L}{\lambda}\log\frac{1}{\varepsilon}\right)$ . This underscores the superiority of GP-FL in terms of convergence rate.

The convergence analysis can be extended to characterize the optimality gap, i.e., the difference between $f(\boldsymbol{\theta}_{t})$ and the minimal loss. Theorem 2 gives an upper bound on the optimality gap at round $t$ whose proof is differed to Appendix E.

Theorem 2.

Consider the assumptions in Theorem 1. Then, the optimality gap in round $t$ is bounded from above as

\displaystyle\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\left({\mu^{2t}}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}^{2}\big{]}+C^{\prime}_{t}\right),

(47)

where $C^{\prime}_{t}$ is defined as

\displaystyle C^{\prime}_{t}=\frac{\mu^{2t}-1}{\mu^{2}-1}\begin{cases}C^{\prime}_{0}&t\leq t_{0}\\ C^{\prime}_{1}&t>t_{0}\end{cases}

(48)

for $\mu$ that is defined in Theorem 1, and


	$\displaystyle C^{\prime}_{0}=\frac{\lambda^{2}}{L^{2}}(t_{0}-t+\frac{2\gamma}{1-\gamma})^{2}+\left(\frac{\delta+1}{\lambda}\right)^{2}\frac{\sigma^{2}_{n}\big{\\|}\mathbb{c}_{t}\mathbb{c}^{\mathsf{H}}_{t}\big{\\|}}{\|\mathcal{D}\|^{2}\alpha^{\text{ZF}}_{t}},$		(49a)
	$\displaystyle C^{\prime}_{1}=\left(\frac{2\lambda\gamma^{2^{t-t_{0}}}}{L(1-\gamma^{2^{t-t_{0}}})}\right)^{2}+\left(\frac{\delta+1}{\lambda}\right)^{2}\frac{\sigma^{2}_{n}\big{\\|}\mathbb{c}_{t}\mathbb{c}^{\mathsf{H}}_{t}\big{\\|}}{\|\mathcal{D}\|^{2}\alpha^{\text{ZF}}_{t}}.$		(49b)

VI Simulation Results

We validate GP-FL by numerical experiments. Throughout the experiments, we elaborate the effectiveness of our proposed scheme by comparing it with several benchmarks, namely the following first-order and second-order FL algorithms:

AirComp-FedAvg (First-order)

Our first benchmark is AirComp-FedAvg proposed in [12]. The algorithm schedules devices using difference-of-convex programming.

GIANT (Second-order)

We consider GIANT algorithm [21] when realized directly over the air via AirComp. We note that GIANT requires an additional aggregation of local gradients, resulting in two communication rounds per iteration. The communication model for this gradient aggregation follows the same procedure as we discussed in Section III-A.

DANE (Second-order)

We deploy AirComp to realize the DANE algorithm proposed in [31]. Similar to GIANT, this algorithm requires two aggregations per communication round.

Sec-Order (Second-order)

We consider the Sec-Order algorithm [20], where the clients locally find a Newton direction and send them to the PS via AirComp. Note that this approach loads more computations on local devices.

BFGS

The PS uses BFGS to find quasi-Newton direction based on the AirComp-based noisy gradients.

We follow the simulation setup in [7] and consider channels with 200 distinct noise standard deviation values, namely $\{0.005,0.0100,\dots,1\}$ . For each value, we generate a corresponding channel. Then, based on the number of devices $K$ , we sample $K$ values from this noise set. The server is equipped with $N=5$ antennas. For a fair comparison, we apply the same receiver beamforming method described in Section III-C and the same client selection method discussed in Section III-D across all benchmark methods, including GP-FL. For all experiments, we use the SGD optimizer on local devices with a learning rate of $\eta=0.1$ , momentum set to 0.9, and weight decay of 0.0005. The batch size is set to 64. The hyper-parameters of the benchmark methods are aligned with those reported in their respective original papers. Additionally, we set the observation window width in GP-FL to $r=20$ . Other hyper-parameters are specified for each experiment separately.

VI-A LIBSVM Dataset

First, we show results for a binary classification task using logistic regression and three datasets from the LIBSVM library [24]: a9a, w8a, and phishing. The data samples are uniformly distributed across $K=20$ clients, all of whom participate in 50 communication rounds. We use the logistic regression model, a type of generalized linear model, as described in [50]. The results for a9a, w8a, and phishing datasets are depicted in Figures 1a, 1b and 1c. As observed, GP-FL achieves higher classification accuracy than benchmark methods.

VI-B Fashion-MNIST Dataset

We next use a fully-connected neural network with two hidden layers and perform 300 communication rounds, with all $K=10$ clients participating in each round. The results are shown in Figure 1d. As observed, GP-FL not only achieves higher classification accuracy than the benchmark methods but also reaches 80% accuracy within just five rounds.

VI-C CIFAR-10 Dataset (Non-iid Settings)

We evaluate CIFAR-10 classification in two setups:

•

Setup I: Following [51, 52], we organize the dataset by class and divide it into 200 shards. Each client randomly selects two shards without replacement. We use a feedforward neural network with two hidden layers for the FL task, with $K=100$ clients, 4000 communication rounds, and a 10% client sampling rate per round.
•

Setup II: We distribute the dataset among them using Dirichlet allocation with $\beta=0.5$ . Using ResNet-18 with Group Normalization, we conduct 200 communication rounds, wherein all clients participate. We set $K=10$ .

The test accuracy curves for our method and the benchmark methods are depicted in Figures 1e and 1f. As illustrated, compared to the benchmark methods, GP-FL (i) achieves a higher test accuracy, and (ii) demonstrates a faster rate of convergence. For example, in Figure 1e, GP-FL reaches 45% test accuracy in approximately 700 rounds, whereas the best-performing second-order benchmark methods require around 1600 rounds to reach the same accuracy.

VI-D CIFAR-100 Dataset (Non-iid Settings)

CIFAR-100 shares the same sample size as CIFAR-10, yet it encompasses a broader diversity with 100 distinct classes. We next train the ResNet-18 model for this classification. Through training, we use Group Normalization and let all clients participate in each communication round. For this experiment, we consider the following two setups:

•

Setup I: We set $K=10$ and $\beta=0.5$ for Dirichlet allocation, and use 400 communication rounds.
•

Setup II: We set $K=50$ and $\beta=0.05$ for Dirichlet allocation, and use 200 communication rounds.

The test accuracy curve for our method and the benchmark methods are depicted in Figures 1g and 1h. Similar conclusions can be drawn from these figures as those from the CIFAR-10 dataset. For example, in Figure 1g, GP-FL reaches 30% test accuracy in just 120 rounds, whereas the best benchmark method requires 180 rounds to achieve the same performance.

VI-E Comments on Required Observation Window

From an implementation perspective, GP-FL requires the PS to select an efficient observation window $r$ . We empirically examine how different $r$ values affect GP-FL’s performance by repeating the experiment in Setup I with CIFAR-10 (see Subsection VI-C) for $r=\{0,5,10,15,20,50,100\}$ . Setting $r=0$ corresponds to the standard quasi-Newton algorithm. The results, summarized in Table I, show no improvement beyond $r=20$ , with accuracy dropping at $r=100$ . This is because recent gradients carry the most relevant curvature information. Thus, the algorithm can be efficiently implemented with a reasonably sized observation window.

TABLE I: Impacts of hyper-parameter

r

: the simulation setup is the same as Setup I with CIFAR-10.

$r$	0	5	10	15	20	50	100
Average Accuracy (%)	45.24	46.75	47.88	48.41	48.59	48.61	48.55

VII Conclusion

While second-order methods are favored in FL for their fast convergence, the communication cost of sharing local Hessian matrices poses a challenge. Attempts to approximate these matrices with first-order information have limited effectiveness in real-world scenarios due to noisy updates from imperfect wireless channels. In this paper, we propose a novel second-order FL algorithm tailored for wireless channels, termed Gaussian process-based Hessian modeling for FL (GP-FL). The algorithm uses a non-parametric approach to estimate the global Hessian matrix from noisy local gradients received over uplink channels. Our analysis demonstrates that GP-FL achieves a linear-quadratic convergence rate. Extensive numerical experiments validate the proposed method’s efficiency. Extending GP-FL to other distributed learning approaches is a natural direction in this area.

Appendix A Solving Optimization (23)

To solve the joint optimization in (23), we begin by transforming it into a low-rank optimization problem. Specifically, let $\mathbb{C}=\mathbb{c}_{t}\mathbb{c}_{t}^{\mathsf{H}}$ , where $\text{rank}(\mathbb{C})=1$ , and let $\mathbb{H}_{k}=\hat{\mathbb{h}}_{t,k}\hat{\mathbb{h}}_{t,k}^{\mathsf{H}}$ . This allows us to reformulate (23) as

	$\displaystyle\min_{\mathbb{C}}~{}\text{Tr}(\mathbb{C})$		(50)
	$\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{rank}(\mathbb{C})=1,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq\|\mathcal{D}_{k}\|^{2},\quad\forall k\in\mathcal{S}_{t}.$

Effectively addressing the rank-one constraint is crucial for solving low-rank optimization problems. A common approach is semidefinite relaxation (SDR), which removes the constraint, reformulating the problem as semidefinite programming to approximate the solution. However, for larger problems, the rank-one constraint is often violated, requiring randomization methods to scale the solution.

An alternative approach to enforce the rank-one constraint is to express it as $\text{Tr}(\mathbb{C})-\|\mathbb{C}\|_{2}=0$ with $\text{Tr}(\mathbb{C})>0$ . This reformulates the problem into a difference-of-convex-functions (DC) program, allowing for a more accurate solution by satisfying all constraints. Building on methods in [20], we introduce the following modified DC programming, incorporating the new constraint as a penalty term:

	$\displaystyle\min_{\mathbb{C}}~{}\text{Tr}(\mathbb{C})+\zeta(\text{Tr}(\mathbb{C})-\\|\mathbb{C}\\|_{2})$		(51)
	$\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{Tr}(\mathbb{C})>0,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq\|\mathcal{D}_{k}\|^{2},\quad\forall k\in\mathcal{S}_{t},$

where $\zeta$ is a regularizer. Although this remains a non-convex problem due to the concave term $-\|\mathbb{C}\|_{2}$ , we can linearize $\|\mathbb{C}\|_{2}$ and transform it into the following convex iterative subproblem:

	$\displaystyle\min_{\mathbb{C}}~{}(1+\zeta)\text{Tr}(\mathbb{C})-\zeta\partial\\|\mathbb{C}_{j}\\|_{2}\cdot\mathbb{C}$		(52)
	$\displaystyle\text{s.t.}~{}\mathbb{C}\succeq\mathbb{0},~{}\text{Tr}(\mathbb{C})>0,~{}\text{Tr}(\mathbb{C}\mathbb{H}_{k})\geq\|\mathcal{D}_{k}\|^{2},\quad\forall k\in\mathcal{S}_{t},$

where $\mathbb{C}_{j}$ is $j$ -th iterative of $\mathbb{C}$ , $\partial\|\mathbb{C}_{j}\|_{2}$ is the sub-gradient of $\|\mathbb{C}_{j}\|_{2}$ . For a complete description of the iterative algorithm, please refer to Algorithm 2 in [20].

Appendix B Proof of Theorem 1

Before starting the derivation, note that we frequently use the triangle inequality, which states that for two vectors $\mathbb{v}$ and $\mathbb{u}$ , $\|\mathbb{v}\pm\mathbb{u}\|\leq\|\mathbb{v}\|+\|\mathbb{u}\|$ . For clarity, we will omit explicitly mentioning its use when it is evident from the context.

We aim to bound the distance to the optimal model $\boldsymbol{\theta}^{\star}$ at round $t$ , i.e., $\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|$ . This bound is derived by establishing a recursive relationship between the distances in consecutive communication rounds. To begin, we note that $\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}$ , where $\tilde{\mathfrak{d}}_{t}=-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}$ as given in (34). We hence have

\displaystyle\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\|

\displaystyle=\big{\|}\boldsymbol{\theta}_{t}+\eta_{t}\tilde{\mathfrak{d}}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\leq\mathcal{M}_{1}+\mathcal{M}_{2},

(53)

where we define $\mathcal{M}_{1}=\big{\|}\boldsymbol{\theta}_{t}-\eta_{t}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\boldsymbol{\theta}^{\star}\big{\|}$ , and

\displaystyle\mathcal{M}_{2}

\displaystyle=\big{\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}+\tilde{\mathfrak{d}}_{t}\big{\|}=\big{\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}\big{\|}.

(54)

The term $\mathcal{M}_{1}$ is upper-bounded using the results reported in [49]. In particular, defining $t_{0}$ and $\gamma$ as in Theorem 1, the result of [49] implies that

\displaystyle\mathcal{M}_{1}\leq\frac{\lambda}{L}\begin{cases}t_{0}-t+\dfrac{2\gamma}{1-\gamma},~{}&t\leq t_{0}\\ \dfrac{2\gamma^{2^{t-t_{0}}}}{1-\gamma^{2^{t-t_{0}}}},&t>t_{0}\end{cases}.

(55)

In the next step, we find an upper bound on $\mathcal{M}_{2}$ . To this end, we use the triangular inequality once again to write


$\displaystyle\mathcal{M}_{2}$	$\displaystyle\leq\big{\\|}\mathbb{H}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\mathbb{g}_{t}\big{\\|}+\big{\\|}\hat{\mathbb{B}}^{-1}_{t}\mathbb{g}_{t}-\hat{\mathbb{B}}^{-1}_{t}\tilde{\mathbb{g}}_{t}\big{\\|}$	(56a)
	$\displaystyle\leq\big{\\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\\|}\big{\\|}\mathbb{g}_{t}\big{\\|}+\big{\\|}\hat{\mathbb{B}}^{-1}_{t}\big{\\|}\big{\\|}\mathbb{g}_{t}-\tilde{\mathbb{g}}_{t}\big{\\|}.$	(56b)

Noting that $\hat{\mathbb{B}}^{-1}_{t}$ is $\delta_{t}$ -approximator of $\mathbf{H}_{t}^{-1}$ , we can write

\displaystyle\big{\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\|}\leq\delta_{t}\big{\|}\mathbb{H}^{-1}_{t}\big{\|}\leq\delta\big{\|}\mathbb{H}^{-1}_{t}\big{\|},

(57)

with the latter inequality being concluded directly from the fact that $\delta=\max_{t}\delta_{t}$ . We further note that $\lambda$ -strong convexity of the loss yields $\big{\|}\mathbb{H}^{-1}_{t}\big{\|}\leq\frac{1}{\lambda}$ . We hence conclude that

\displaystyle\big{\|}\mathbb{H}^{-1}_{t}-\hat{\mathbb{B}}^{-1}_{t}\big{\|}\leq\frac{\delta}{\lambda},

(58)

and use the $L$ -smoothness of the global loss to write

\displaystyle\big{\|}\mathbb{g}_{t}\big{\|}\leq L\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}.

(59)

Using (58) and (59), the first term in (56b) is straightforwardly bounded. To bound the second term, we have


$\displaystyle\big{\\|}\hat{\mathbb{B}}^{-1}_{t}\big{\\|}$	$\displaystyle\leq\big{\\|}\hat{\mathbb{B}}^{-1}_{t}-\mathbb{H}^{-1}_{t}\big{\\|}+\big{\\|}\mathbb{H}^{-1}_{t}\big{\\|}$	(60a)
	$\displaystyle\leq\frac{\delta}{\lambda}+\frac{1}{\lambda}=\frac{\delta+1}{\lambda}.$	(60b)

Using this bound, we can finally write

\displaystyle\mathcal{M}_{2}\leq\mu\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}+\frac{\delta+1}{\lambda}\big{\|}\mathbb{g}_{t}-\tilde{\mathbb{g}}_{t}\big{\|},

(61)

with $\mu$ being defined as in Theorem 1.

The above bound leads to an instantaneous recursive bound. We now recall that in the AirComp-aided aggregation we have $\tilde{\mathbb{g}}_{t}-\mathbb{g}_{t}=\tilde{\mathbb{n}}_{t}$ . By taking expectation from both sides of (61) w.r.t. to the randomness in communication links, we obtain

\displaystyle\mathbb{E}\big{[}\mathcal{M}_{2}\big{]}\leq\mu\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+\frac{\delta+1}{\lambda}\mathbb{E}\big{[}\big{\|}\tilde{\mathbb{n}}_{t}\big{\|}\big{]}.

(62)

Using (21), we can further write


$\displaystyle\mathbb{E}\big{[}\big{\\|}\tilde{\mathbb{n}}_{t}\big{\\|}\big{]}$	$\displaystyle=\frac{1}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{E}\big{[}\big{\\|}\mathbb{n}_{t,j}^{\mathsf{H}}\mathbb{c}_{t}\big{\\|}\big{]}$	(63a)
	$\displaystyle\leq\frac{1}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}\big{\\|}\mathbb{c}_{t}\big{\\|}\mathbb{E}\big{[}\big{\\|}\mathbb{n}_{t,j}\big{\\|}\big{]}=\frac{\sigma_{n}\big{\\|}\mathbb{c}_{t}\big{\\|}}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}.$	(63b)

We finally take an expectation from both sides of (53), and use (62) and (63a) to write (note that $\eta_{t}\leq 1$ as per (36))

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}\leq\mu\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}+A_{t},

(64)

where $A_{t}$ is defined as

\displaystyle A_{t}=\begin{cases}C_{0}&t\leq t_{0}\\ C_{1}&t>t_{0}\end{cases}

(65)

for $C_{0}$ and $C_{1}$ that were defined in Theorem 1. Applying (64) recursively, the proof is concluded.

Appendix C Proof of Corollary 1

From Theorem (1), it is concluded that if $\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}<C_{t}$

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}

\displaystyle\leq 2C_{t}.

(66)

We aim to determine the growth order of $T_{\varepsilon}$ . Assuming $\varepsilon$ is sufficiently small, $T_{\varepsilon}$ becomes large and satisfies $T_{\varepsilon}>t_{0}$ . To remain within the $\varepsilon$ -neighborhood of $\boldsymbol{\theta}^{\star}$ , it is sufficient to have: $2C_{T_{\varepsilon}}\leq\varepsilon$ , which using the fact that $T_{\varepsilon}>t_{0}$ , it reduces to

\displaystyle 2\frac{\mu^{T_{\varepsilon}}-1}{\mu-1}\Big{[}{\frac{2\lambda\gamma^{2^{{T_{\varepsilon}}-t_{0}}}}{L(1-\gamma^{2^{{T_{\varepsilon}}-t_{0}}})}}+\frac{\delta+1}{\lambda}\frac{\sigma_{n}\big{\|}\mathbb{c}_{T_{\varepsilon}}\big{\|}}{|\mathcal{D}|\sqrt{\alpha^{\text{ZF}}_{T_{\varepsilon}}}}\Big{]}\leq\varepsilon.

(67)

Since $\mu<1$ , as ${T_{\varepsilon}}$ grows large $\mu^{T_{\varepsilon}}\to 0$ ; therefore, the left-hand-side coefficient tends to ${2}/({1-\mu})$ . Noting that $\gamma\in[0,\frac{1}{2}]$ , we can further replace the first term inside the brackets with ${2\lambda\gamma^{2^{T_{\varepsilon}-t_{0}}}}/{L}$ for large $T_{\varepsilon}$ . By substituting into (67), and taking $\log$ from both sides we finally have

\displaystyle\log(\frac{1}{\varepsilon})\leq-\log(\frac{4\lambda}{L-\frac{L^{2}\delta}{\lambda}})-2^{T_{\varepsilon}-t_{0}}\log(\gamma).

(68)

As $\log(\gamma)<0$ , the second term on the right-hand-side of (68) is positive. Noting that the first term does not scale, we can finally write $T_{\varepsilon}\leq\mathcal{O}\left(\log\log\frac{1}{\varepsilon}\right)$ , which concludes the proof.

Appendix D Proof of Corollary 2

The proof is similar to that of Corollary 1: from Theorem 1, we know that if $\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\geq C_{t}$ ; then, we have

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|\big{]}

\displaystyle\leq 2\mu^{t}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}.

(69)

Thus, to lie within $\varepsilon$ -neighborhood of optimal model, it is sufficient for $T_{\varepsilon}>t_{0}$ to satisfy $2\mu^{T_{\varepsilon}}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}\leq\varepsilon$ , which is equivalent to

\displaystyle T_{\varepsilon}\log(\frac{1}{\mu})\geq\log\left(\frac{2\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\|}\big{]}}{\varepsilon}\right).

(70)

This concludes the proof.

Appendix E Proof of Theorem 2

As the global loss is $L$ -Lipschitz gradient, we have

\displaystyle f(\boldsymbol{\theta}_{t})-f(\boldsymbol{\theta}^{\star})-\nabla f^{\mathsf{T}}(\boldsymbol{\theta}^{\star})(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}.

(71)

Taking expectation from both sides, and noting that $f(\boldsymbol{\theta}^{\star})=\boldsymbol{0}$ , we obtain

\displaystyle\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star})\leq\frac{L}{2}\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}.

(72)

This implies that to bound $\mathbb{E}\big{[}f(\boldsymbol{\theta}_{t})\big{]}-f(\boldsymbol{\theta}^{\star})$ , we should bound $\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}$ . From (53), we have

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}

\displaystyle\leq\mathbb{E}\big{[}\mathcal{M}_{1}^{2}\big{]}+\mathbb{E}\big{[}\mathcal{M}_{2}^{2}\big{]}.

(73)

Using the bounds in (55) to (63a), we obtain

\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\|^{2}\big{]}\leq\mu^{2}\mathbb{E}\big{[}\big{\|}\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{\star}\big{\|}^{2}\big{]}+C^{\prime}_{t},

(74)

where $C^{\prime}_{t}$ is given by

\displaystyle C^{\prime}_{t}=\begin{cases}C^{\prime}_{0}~{}&t\leq t_{0}\\ C^{\prime}_{1}&t>t_{0}\end{cases},

(75)

with $C^{\prime}_{0}$ and $C^{\prime}_{1}$ being defined in Theorem 2. Using (74) recursively, we next obtain

	$\displaystyle\mathbb{E}\big{[}\\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{\star}\\|^{2}\big{]}$	$\displaystyle\leq\mu^{2t}\mathbb{E}\big{[}\big{\\|}\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{\star}\big{\\|}^{2}\big{]}$
		$\displaystyle+\frac{\mu^{2t}-1}{\mu^{2}-1}\begin{cases}C^{\prime}_{0},~{}&t\leq t_{0}\\ C^{\prime}_{1},&t>t_{0}\end{cases}$		(76)

Finally, using (E) in (72) the proof is concluded.

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 2017, pp. 1273–1282.
[2] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wirel. Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
[3] S. M. Hamidi, S. Herath, A. Bayesteh, and A. K. Khandani, “Systems and methods for communication resource usage control,” May 30 2019, uS Patent App. 15/824,352.
[4] S. M. Hamidi, “Training neural networks on remote edge devices for unseen class classification,” IEEE Signal Processing Letters, vol. 31, pp. 1004–1008, 2024.
[5] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff, “Communication lower bounds for statistical estimation problems via a distributed data processing inequality,” in Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, 2016, pp. 1011–1020.
[6] Y. Han, A. Özgür, and T. Weissman, “Geometric lower bounds for distributed parameter estimation under communication constraints,” in Proceedings of the 31st Conference On Learning Theory (COLT). PMLR, 2018, pp. 3163–3188.
[7] M. Lan, Q. Ling, S. Xiao, and W. Zhang, “Quantization bits allocation for wireless federated learning,” IEEE Trans. Wirel. Commun., vol. 22, no. 11, pp. 8336–8351, 2023.
[8] A. Bereyhi, B. Liang, G. Boudreau, and A. Afana, “Novel gradient sparsification algorithm via bayesian inference,” in Proceedings of IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), 2024, pp. 1–6.
[9] J. Wang, Z. Charles, Z. Xu et al., “A field guide to federated optimization,” 2021. [Online]. Available: https://arxiv.org/abs/2107.06917
[10] S. J. Reddi, Z. Charles, M. Zaheer, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proceedings of the International Conference on Learning Representations, 2020.
[11] Q. Tong, G. Liang, and J. Bi, “Effective federated adaptive gradient methods with non-iid decentralized data,” 2020. [Online]. Available: https://arxiv.org/abs/2009.06557
[12] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wirel. Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
[13] A. Bereyhi, A. Vagollari, S. Asaad, R. R. Müller, W. Gerstacker, and H. V. Poor, “Device scheduling in over-the-air federated learning via matching pursuit,” IEEE Trans. Signal Process., vol. 71, pp. 2188–2203, 2023.
[14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 491–506, 2019.
[15] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
[16] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” IEEE Trans. Signal Process., vol. 68, pp. 2897–2911, 2020.
[17] S. M. Hamidi, M. Mehrabi, A. K. Khandani, and D. Gündüz, “Over-the-air federated learning exploiting channel perturbation,” in 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), 2022, pp. 1–5.
[18] D. Liu and O. Simeone, “Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 170–185, 2020.
[19] A. Elgabli, J. Park, C. B. Issaid, and M. Bennis, “Harnessing wireless channels for scalable and privacy-preserving federated learning,” IEEE Trans. Commun., vol. 69, no. 8, pp. 5194–5208, 2021.
[20] P. Yang, Y. Jiang, T. Wang, Y. Zhou, Y. Shi, and C. N. Jones, “Over-the-air federated learning via second-order optimization,” IEEE Trans. Wirel. Commun., vol. 21, no. 12, pp. 10 560–10 575, 2022.
[21] S. Wang, F. Roosta, P. Xu, and M. W. Mahoney, “GIANT: Globally improved approximate newton method for distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 2332–2342.
[22] Y. Zhang and X. Lin, “Disco: Distributed optimization for self-concordant empirical loss,” in Proceedings of the International conference on machine learning. PMLR, 2015, pp. 362–370.
[23] A. Ghosh, R. K. Maity, and A. Mazumdar, “Distributed newton can communicate less and resist byzantine workers,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 028–18 038, 2020.
[24] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
[25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1708.07747
[26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Ph.D. dissertation, University of Toronto, 2009, technical Report.
[27] S. M. Hamidi, R. Tan, L. Ye, and E.-H. Yang, “Fed-it: Addressing class imbalance in federated learning through an information- theoretic lens,” in 2024 IEEE International Symposium on Information Theory (ISIT), 2024, pp. 1848–1853.
[28] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” J. Mach. Learn. Res., vol. 12, no. 7, 2011.
[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017. [Online]. Available: https://arxiv.org/abs/1412.6980
[30] S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Mime: Mimicking centralized stochastic algorithms in federated learning,” 2021. [Online]. Available: https://arxiv.org/abs/2008.03606
[31] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate newton-type method,” in Proceedings of the International Conference on Machine Learning. PMLR, 2014, pp. 1000–1008.
[32] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smithy, “Feddane: A federated newton-type method,” in Proceedings of the 2019 53rd Asilomar Conference on Signals, Systems, and Computers. IEEE, 2019, pp. 1227–1231.
[33] S. J. Reddi, J. Konečný, P. Richtárik, B. Póczós, and A. Smola, “Aide: Fast and communication efficient distributed optimization,” 2016. [Online]. Available: https://arxiv.org/abs/1608.06879
[34] V. Gupta, A. Ghosh, M. Derezinski, R. Khanna, K. Ramchandran, and M. Mahoney, “Localnewton: Reducing communication bottleneck for distributed learning,” 2021. [Online]. Available: https://arxiv.org/abs/2105.07320
[35] A. Agarwal, O. Chapelle, M. Dudík, and J. Langford, “A reliable effective terascale linear learning system,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1111–1133, 2014.
[36] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi, “CoCoA: A general framework for communication-efficient distributed optimization,” J. Mach. Learn. Res., vol. 18, p. 230, 2018.
[37] C. Duenner, A. Lucchi, M. Gargiani, A. Bian, T. Hofmann, and M. Jaggi, “A distributed second-order algorithm you can trust,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80. PMLR, 10–15 Jul 2018, pp. 1358–1366.
[38] M. R. Hestenes, E. Stiefel et al., Methods of conjugate gradients for solving linear systems. NBS Washington, DC, 1952, vol. 49, no. 1.
[39] R. Islamov, X. Qian, and P. Richtárik, “Distributed second order methods with fast rates and compressed communication,” in International conference on machine learning. PMLR, 2021, pp. 4617–4628.
[40] M. Safaryan, R. Islamov, X. Qian, and P. Richtarik, “Fednl: Making newton-type methods applicable to federated learning,” in Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 18 959–19 010.
[41] Y. Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” in Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 1387–1395.
[42] P. Sharma, R. Panda, G. Joshi, and P. Varshney, “Federated minimax optimization: Improved convergence analyses and algorithms,” in Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 19 683–19 730.
[43] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, 1999.
[44] J. Shermen and W. Morrison, “Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix,” Annals of Mathmatical Statistics, vol. 20, pp. 621–625, 1949.
[45] L. Chen, X. Qin, and G. Wei, “A uniform-forcing transceiver design for over-the-air function computation,” IEEE Wirel. Commun. Lett., vol. 7, no. 6, pp. 942–945, 2018.
[46] A. G. Wills and T. B. Schön, “Stochastic quasi-newton with line-search regularisation,” Automatica, vol. 127, p. 109503, 2021.
[47] E. Schulz, M. Speekenbrink, and A. Krause, “A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions,” Journal of Mathematical Psychology, vol. 85, pp. 1–16, 2018.
[48] C. B. Do and H. Lee, “Gaussian processes,” Stanford University, Stanford, CA, accessed Dec, vol. 5, p. 2017, 2007.
[49] B. Polyak and A. Tremba, “New versions of newton method: step-size choice, convergence domain and under-determined equations,” Optimization Methods and Software, vol. 35, no. 6, pp. 1272–1303, 2020.
[50] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression. John Wiley & Sons, 2013.
[51] S. M. Hamidi and E.-H. YANG, “Adafed: Fair federated learning via adaptive common descent direction,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=rFecyFpFUp
[52] S. Mohajer Hamidi and O. Damen, “Fair wireless federated learning through the identification of a common descent direction,” IEEE Communications Letters, vol. 28, no. 3, pp. 567–571, 2024.


$\displaystyle\mathbb{E}\big{[}\big{\\|}\tilde{\mathbb{n}}_{t}\big{\\|}\big{]}$	$\displaystyle=\frac{1}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}\mathbb{E}\big{[}\big{\\|}\mathbb{n}_{t,j}^{\mathsf{H}}\mathbb{c}_{t}\big{\\|}\big{]}$	(63a)
	$\displaystyle\leq\frac{1}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}\big{\\|}\mathbb{c}_{t}\big{\\|}\mathbb{E}\big{[}\big{\\|}\mathbb{n}_{t,j}\big{\\|}\big{]}=\frac{\sigma_{n}\big{\\|}\mathbb{c}_{t}\big{\\|}}{\|\mathcal{D}\|\sqrt{\alpha^{\text{ZF}}_{t}}}.$	(63b)

GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning

Abstract

Index Terms:

I Introduction

I-A Communication-Efficient FL

I-B Second-Order Wireless FL

I-C Motivation and Contributions

I-D Related Works

I-E Notation

I-F Organization

II Preliminaries

II-A First-Order Federated Learning

II-B Newton’s Method in Federated Learning

II-C Quasi-Newton Search Directions

Remark 1.

II-D GP-FL: Quasi-Newton Method in Wireless Networks

III GP-FL: Over-the-Air Aggregation

III-A Uplink Channel Model

III-B Model Aggregation via AirComp

Remark 2.

Remark 3.

III-C Receiver Design for Model Aggregation

Remark 4.

III-D Device Selection Algorithm

IV GP-FL: Hessian Estimation

IV-A Model-based Hessian Estimation

IV-B Quasi-Newton Matrix as Gaussian Process

IV-C Estimating Parameters of Gaussian Process

Definition 1 (Radial basis kernel).

IV-D Estimating Hessian via Posterior Sampling

Lemma 1 (Posterior of quasi-Newton matrix).

Proof.

Remark 5.

V Algorithm and Convergence Analysis

Remark 6.

V-A Scheduling the Learning Rate

Assumption 1 (Smoothness and convexity).

V-B Convergence Results

Definition 2 (δ\delta-approximate).

Theorem 1.

Proof.

Corollary 1.

Proof.

Corollary 2.

Proof.

Theorem 2.

VI Simulation Results

AirComp-FedAvg (First-order)

GIANT (Second-order)

DANE (Second-order)

Sec-Order (Second-order)

BFGS

VI-A LIBSVM Dataset

VI-B Fashion-MNIST Dataset

VI-C CIFAR-10 Dataset (Non-iid Settings)

VI-D CIFAR-100 Dataset (Non-iid Settings)

VI-E Comments on Required Observation Window

VII Conclusion

Appendix A Solving Optimization (23)

Appendix B Proof of Theorem 1

Appendix C Proof of Corollary 1

Appendix D Proof of Corollary 2

Appendix E Proof of Theorem 2

References

Definition 2 ( $\delta$ -approximate).