This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Communication-Efficient Zeroth-Order Distributed Online Optimization: Algorithm, Theory, and Applications

Abstract
Ege C. Kaya, M. Berk Sahin, and Abolfazl Hashemi are with College of Engineering, Purdue University, West Lafayette, IN 47907, USA. Emails: {kayae, sahinm, abolfazl}@purdue.edu

This paper focuses on a multi-agent zeroth-order online optimization problem in a federated learning setting for target tracking. The agents only sense their current distances to their targets and aim to maintain a minimum safe distance from each other to prevent collisions. The coordination among the agents and dissemination of collision-prevention information is managed by a central server using the federated learning paradigm. The proposed formulation leads to an instance of distributed online nonconvex optimization problem that is solved via a group of communication-constrained agents. To deal with the communication limitations of the agents, an error feedback-based compression scheme is utilized for agent-to-server communication. The proposed algorithm is analyzed theoretically for the general class of distributed online nonconvex optimization problems. We provide non-asymptotic convergence rates that show the dominant term is independent of the characteristics of the compression scheme. Our theoretical results feature a new approach that employs significantly more relaxed assumptions in comparison to standard literature. The performance of the proposed solution is further analyzed numerically in terms of tracking errors and collisions between agents in two relevant applications.

Index Terms—  communication efficiency, compression schemes, federated learning, online optimization, zeroth-order optimization

1 Introduction

As datasets and machine learning (ML) models continue to grow in size and complexity, training ML models increasingly requires carrying out the optimization process across multiple devices. This is often the result of parallel processing needs or the collaboration of multiple participants in the data acquisition and optimization processes. The federated learning (FL) paradigm [1, 2] addresses this by focusing on the latter scenario and training a global model through the cooperation of multiple clients (or agents), managed by a central server. However, FL is typically carried out by a large number of communication-constrained agents, making the transmission of model parameters to the central server a potential bottleneck that needs to be addressed for efficient model training.

In online learning (OL), where decisions are made in real-time with limited information/feedback provided to the decision maker, limited communication resources become even a more severe problem. To address this, first-order FL algorithms like local stochastic gradient descent (SGD) use compression techniques like quantization or sparsification [3, 4, 5] to reduce the size of local gradients before transmission, but this causes information loss which may impact the learning performance adversely.

To counteract this loss in information, an error feedback (EF) mechanism can be added. The EF mechanism works by incorporating the error made by compression in the subsequent steps, so that effectively, each gradient is fully utilized, even if at later stages. Moreover, the EF mechanism theoretically achieves the same rate of convergence as the no-compression case, making compression come at no cost. [6].

An additional consideration that we may need to have in practical scenarios is the potentially limited nature of available information. The zeroth-order (ZO) optimization setting presents an example for such limitations. In an optimization problem arising from a real-life scenario, the information to be used in the optimization process may be the sensed values of physical quantities such as sound or light intensity, or relative distance [7]. For instance, assuming that sensing agents may only sense current distances to their targets and other nearby agents, we can consider this to be a ZO setting [8] as agents do not have access to higher-order information, such as velocity or acceleration.

As an example of a practical scenario combining all of the aforementioned considerations, consider delivery robots that are loaded from the same region and aim to find their customers. This situation may be viewed as a source localization problem with multiple mobile agents. We adopt the terms agent and source from the literature on this subject in the upcoming discussion. If the customers are also moving, this becomes a target-tracking problem [9, 10].

Refer to caption
Fig. 1: Illustration of agent-server communication. The agents communicate compressed information to the server, whereas the server transmits back the full information.

In a multi-agent setting, collisions between these delivery robots may occur, which can be solved by establishing communication between the agents using the FL framework, somehow incorporating the information of where nearby agents are. Supposing additionally that the robots are only capable of sensing their current distances to their respective targets and to other nearby robots moves our problem into the field of ZO optimization. However, doing so would also result in an online optimization scenario, seeing as the relative locations of the robots with respect to one another would be continually changing, producing a time-varying sequence of optimization problems to solve. Finally, to overcome the inherent communication bottleneck engendered by the online and FL settings, compression schemes may be used along with the EF mechanism. Our novel formulation of this target tracking problem is illustrated and explained in detail in Section 4.1.

1.1 Contribution

Motivated by the previous problem formulation, the purpose of this work is to find an answer to the central question:

Is it possible to devise an algorithm for online, distributed non-convex optimization problems with compressed exchange of zeroth-order information, and with provable convergence guarantees for both single-agent and multi-agent settings?

To address this question, we focus on a general stochastic nonconvex optimization problem, taking into account the following factors: i) access to the stochastic cost function is limited to zeroth-order oracle, meaning only function values at current locations and times are available, ii) due to communication constraints, only compressed or quantized gradients are exchanged between the agents and the server, iii) multiple agents use zeroth-order information to track their targets, and iv) the objective functions are time-varying in nature, resulting in an online optimization problem.

We prove the existence of a first-order solution in d\mathbb{R}^{d} that is ξ\xi-accurate with T=𝒪(dσ2ML(Δ+ω¯)ξ2)T=\mathcal{O}\left(\frac{d\sigma^{2}ML(\Delta+\bar{\omega})}{\xi^{2}}\right) in the dominant term, where σ2\sigma^{2}, LL, MM, Δ\Delta, and ω¯\bar{\omega} denote the variance of stochastic gradients, smoothness constant in Assumption 3, bound constant on the stochastic gradients’ second moment in Assumption 2, the difference between averages of loss functions for the first and last iterates, and the summation of drift bounds from Assumption 4 respectively. Hence the dominant term in the convergence error is not dependent on the compression ratio. This is achieved while using an EF mechanism and a ZO gradient estimator which uses two function evaluations. In the derivation of this result, we also relax the assumption of bounded second moment commonly found in related literature [6]. Instead of assuming that the second moment of the stochastic gradients are upper-bounded by a constant term greater than or equal to their variance, we adopt the relaxed assumptıon that it is upper-bounded by the variance plus a term that is proportional to the square of its expected value. In other words, we relax the assumptions on the value of MM in Assumption 2, whereas it is commonly assumed in other literature that M=0M=0, uniformly. That is, our upper bound depends on the current sample rather than a uniform bound. Whereas the previous work deals with a single-agent scenario [11], we examine the effectiveness of the proposed approach in a multi-agent target tracking scenario with limited communication where collision avoidance is of paramount importance. The problem of reducing collisions among agents is addressed by incorporating the FL paradigm and a new regularization term. This task is formulated as an online, distributed nonconvex optimization problem that can be solved by a multi-agent variation of the proposed scheme. Theoretical analysis shows that a ξ\xi-accurate first-order solution in Nd\mathbb{R}^{Nd} with T=𝒪(σ2dMQ(Δ2+ω¯2)+M(σ2+Z4)ξ2)T=\mathcal{O}\left(\dfrac{\sigma^{2}dMQ\left(\Delta^{2}+\bar{\omega}^{2}\right)+M\left(\sigma^{2}+Z^{4}\right)}{\xi^{2}}\right) in the dominant term can be found in a scenario with NN agents, where Z2Z^{2} and QQ are constants that arise from Assumption 6, which effectively places a bound on the norm of the gradients of each client in terms of the average of these gradients over all of the clients [12]. The results of the study are further supported by experimental results

Our preliminary work on single-agent convergence analysis and experiments was accepted to and will be presented at 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing[13]. The current work presents a significantly more thorough analysis of the subject, with an additional part detailing the multi-agent algorithm and its analysis, presented in Theorem 2. The complete proofs of the two theorems are also provided. The experimental section is ameliorated with more descriptive results, and an additional experiment involving an area coverage problem.

1.2 Related Work

Communication-efficient FL. FedAvg is the seminal FL paper in which the central server takes the average of the local gradients transmitted by the clients and distributes the updated parameters to corresponding clients [1]. The crux of this work is the locality of data, in that data is acquired and trained on locally by a multitude of clients, without ever transporting it to a central server. Several difficulties, including privacy concerns [14], heterogeneity of client data [15], and high communication costs in agent-to-server links [16] arise in relation to this paradigm. Variations of FedAvg have been developed to mitigate these problems. For instance, to deal with high communication costs, [17, 18] propose a sparsification algorithm in communication for time-varying decentralized learning and optimization. Reference [19] proposes utilizing adaptive learning rate for aggregation, which is relevant to both the client data heterogeneity and communication efficiency issues. Reference [20] suggests using a novel aggregation technique which first quantizes gradients, and then skips communicating less impactful quantized gradients in favor of reusing previous ones. Reference [21] proposes using a momentum-based global update at the server, which promotes communication efficiency through variance reduction. Reference [22] proposes a derivative-free federated ZO optimization (FedZO) algorithm, and to improve its communication efficiency over wireless networks, they propose an over-air computation assisted variant. Reference [23] proposes a multiple local update strategy and a decentralized ZO algorithm to improve the communication efficiency and convergence rate in the decentralized FL scenario, in which there is no access to first-order derivatives. Reference [24] promotes the use of multiple gossip steps for communication efficiency. Various compression schemes such as Top-k [6], Rand-k [5], Biased and Unbiased Dropout-p [4], Quantized SGD (QSGD) [3], and their variants/generalizations are used to achieve the communication efficiency of FL algorithms. Compression schemes can be divided into contractive and non-contractive methods. With contractive compression schemes, which are our focus in this work, it is common to introduce an EF mechanism to compensate for the error due to compression by accumulating compression error in memory and adding it back as feedback for subsequent rounds. In [6], it is shown that such a method used in conjunction with SGD has a comparable rate of convergence to non-compressed SGD. In this study, we relax the assumption of having stochastic-first order oracle with bounded noise required in [6] by meticulously characterizing the impact of such relaxation on convergence. Furthermore, instead of the single-agent case as investigated in [6], we consider multiple agents with the additional contingency of preventing their collisions, which make the theoretical side more challenging.

Multi-agent target tracking. In our setting, agents are limited to ZO information, since they are assumed to only be able to sense their distance to their targets and other nearby agents. As a result of this consideration, our method is applicable to different practical scenarios such as [25, 26]. In these kinds of scenarios, gradients of the loss function can still be estimated by finite differences [27] but doing so in a multi-agent setting under communication constraints still remains an open challenge. Reference [11] describes a setup comparable to online optimization employing ZO oracles, applied to a target tracking problem. In that work, the authors focus on the case where there is a single source pursued by a single agent, which we generalize to the multi-agent setting as part of our contribution. We further investigate an effective approach via nonconvex regularization for collision avoidance. The λ\lambda parameter we refer to as the regularization parameter is in essence similar to the penalty and augmented Lagrangian methods used in functional constrained optimization[28, 29]. However, these methods aim to adaptively tune the λ\lambda parameter on-the-go, which is out of the scope of our work. It should be noted also that this line of research is very relevant to the area of safe reinforcement learning, see e.g. [30, 31, 32, 33].

In [34], a cooperative, mobile multi-agent source localization problem is tackled via using a distributed algorithm. Compared to our setting, the agents sense first-order information and their neighboring agents benefit from collaboration between agents to avoid collision. Reference [35] deals with a source localization problem in a single-agent and single-source setting, where the source is stationary or near-stationary. Reference [36] studies the problem of OL using ZO information with convex cost functions. They extend the problem out of the conventional Euclidean setting onto Riemannian manifolds. In [37] a ZO source localization problem is considered using distance information, where the agents are essentially multiple sensors of known position. References [38] and [39] deal with an online optimization problem using a decentralized network of multiple agents which have access to ZO information, and propose the usage of local information and information from neighboring agents in the network. In [38], an iterative algorithm with guarantees is proposed for time-varying online loss functions. The theoretical result there is established by assuming a certain bounded drift in time assumption which is standard in the literature, see, e.g., [11].

Online optimization and online target tracking. In the general online optimization setting, we focus on literature within or similar to the online convex optimization framework, which we can consider as a sequential decision making game in the presence of time-varying loss functions [40]. In [41], a distributed online optimization problem with multiple agents is considered, and the local loss function of each agent is convex and time-varying. The authors propose a randomized gradient-free distributed projected gradient descent procedure, where agents estimate the gradient of their local loss functions in a random direction using information from a locally-built ZO oracle. In [42], a similar setting is considered, and a multi-agent distributed optimization problem is studied in continuous-time, with time-varying convex loss functions. Reference [11] deals with a setting where zeroth-order oracles are used for optimization in the presence of time-varying cost functions. Besides the general online optimization setting, there is an abundance of literature focusing on the online optimization aspect of the target tracking problem. A large number of them also involve a swarm of multiple agents working in coordination. Usually, literature on this area tends to consider the problem in the context of unmanned aerial vehicles (UAV), or unmanned surface vehicles (USV). As pointed out in [43], the approaches to the problem may be separated into three broad categories: those using filtering based, control theory based and machine learning based approaches. For instance, [43] examines the problem within the domain of reinforcement learning by formulating it as a constrained Markov decision process, with application to autonomous target tracking using a swarm of UAVs. The authors provide an algorithm with provable guarantees. In [44], the authors again consider a multi-agent multi-target pursuit evasion scenario, where they propose the usage of a recurrent neural net work for target trajectory prediction, in conjunction with a multi-agent deep deterministic policy gradient formulation for decision making. Reference [45] deals with a robust formulation of a similar scenario in the domain of supervised learning, using a game theoretic approach. Reference [46] deals with a multi-target following scenario with consideration of external threats. The authors treat the problem as an online path planning problem and adopt a control-theoretic approach. In [47], an online adaptive Kalman filter is ussed in a target tracking problem where the sensor signals of the agents are assumed to have unknown noise statistics, to formulate a solution that is robust to noise. Lastly, [48] considers a decentralized control problem involving multiple agents with multiple control objectives, among which target tracking is one. The authors make use of a scheme based on adaptive dynamic programming, and feedback from a critic neural network which approximates the control objectives in online fashion.

1.3 Novelty w.r.t. Existing Works

Our work is focused on a nonconvex online distributed optimization problem with compressed exchange of zeroth-order information, along with the error feedback mechanism. Although these concepts were investigated individually in prior works [6, 11, 22, 27, 34] to the best of our knowledge, we are the first to combine them in a single framework and propose an algorithm with its theoretical analysis and convergence guarantee.

Furthermore, we may compare our theoretical results in Section 3 with the result of the analysis in [49], which derives an upper bound for the offline, first-order case with contractive compressors and the error feedback mechanism for the optimization of a smooth, nonconvex function in the FL paradigm. Their result establishes, in this setting, an iteration complexity of 𝒪(1/Nξ2)\mathcal{O}(1/N\xi^{2}) to produce a ξ\xi-accurate first-order solution. Our result agrees with this result on the ξ\xi-dependency. However, the analysis in this result is obtained in an offline setting with access to the first-order derivatives and is hence able to derive a convergence rate that is inversely proportional to the number of agents NN. As opposed to that setting, we consider a more challenging online setting in which agents lack access to first-order derivatives but have access to finite differences. Thus, the convergence rate we establish is independent of NN. Moreover, reference [49] assumes a uniform bound on the second moment of the gradients. We adopt a more relaxed version of this assumption, which does not assume a uniform bound on the second moment (Assumption 2), and this makes our analysis more involved. We make a similar relaxation of a standard assumption used in distributed optimization [50, 51, 52, 53, 54, 55, 56] in the same vein, by lifting the assumption of a uniform bound (Assumption 6).

The rest of the paper is organized as follows: In Section 2, we present the related background on stochastic gradient descent in zeroth-order oracle setting and necessary assumptions for our theoretical analysis. In Section 3, we propose the EF-ZO-SGD and FED-EF-ZO-SGD algorithms and present two theorems for their convergence along with sketches of proofs. Experimental results for two different settings are presented in Section 4 followed by the conclusion in Section 5. The complete proofs of the theorems along with the statements of relevant lemmas are presented in the Appendix.

2 Preliminaries and Background

We start by providing a description of the problem in the single-agent setting. We deal with a sequence of time-varying optimization problems: minxdt(x)\min_{x\in\mathbb{R}^{d}}\ell_{t}(x), t+t\in\mathbb{Z}^{+}. Each t:d\ell_{t}:\mathbb{R}^{d}\to\mathbb{R} is a continuous loss function and t(x):=𝔼z[~t(x)]\ell_{t}(x):=\mathbb{E}_{z}[\tilde{\ell}_{t}(x)]. We denote ~t(x):=t(x,z)\tilde{\ell}_{t}(x):=\ell_{t}(x,z) where zz is a random variable representing data points coming from the unknown distribution PzP_{z}, so zPzz\sim P_{z}. In our application, the target tracking problem, it is the position vector of targets. We aim to find a sequence of solutions {xt}t=1T\{x_{t}\}_{t=1}^{T} such that 1Tt=1Tt(xt)2ξ\frac{1}{T}\sum_{t=1}^{T}\|\nabla\ell_{t}(x_{t})\|^{2}\leq\xi for some small ξ>0\xi>0. Suppose that, at time tt, we have somehow generated a (possibly non-optimal) solution xtx_{t} to the problem minxdt(x)\min_{x\in\mathbb{R}^{d}}\ell_{t}(x). As we are motivated by online and time-critical missions, we would like to generate a solution xt+1x_{t+1} to the problem minxdt+1(x)\min_{x\in\mathbb{R}^{d}}\ell_{t+1}(x) applying a simple update rule which is similar to SGD to xtx_{t}:

xt+1=xtηt~t(xt),x_{t+1}=x_{t}-\eta_{t}\nabla\tilde{\ell}_{t}(x_{t}), (1)

where ηt\eta_{t} is the step size or learning rate adopted at time tt. As discussed in Section 1, we cannot directly apply such an update since we are in the ZO setting, that is, we only have access to evaluation of ~t\tilde{\ell}_{t} and not to its gradient or stochastic gradient. To overcome this limitation, we resort to a ZO estimator of the gradient:

g~μ,t(xt):=~t(xt+μut)~t(xt)μut,\tilde{g}_{\mu,t}(x_{t})\vcentcolon=\frac{\tilde{\ell}_{t}(x_{t}+\mu u_{t})-\tilde{\ell}_{t}(x_{t})}{\mu}u_{t}, (2)

where μ\mu\in\mathbb{R} is the so-called smoothing parameter, and each ut𝒩(0,Id)u_{t}\sim\mathcal{N}(0,I_{d}). Note that g~μ,t\tilde{g}_{\mu,t} can be thought of as an approximation to the stochastic gradient of a Gaussian smoothing of ~t\tilde{\ell}_{t}, i.e., ~μ,t(x):=𝔼u[~t(x+μu)]\tilde{\ell}_{\mu,t}(x)\vcentcolon=\mathbb{E}_{u}[\tilde{\ell}_{t}(x+\mu u)]. A final modification to the update rule arises due to the aforementioned communication constraints. We apply compression to the ZO estimator and use the resulting quantity in the update rule. To mitigate the negative effect of compression on the convergence of the method, we employ the error feedback mechanism. Essentially, this serves in each time step to partially recover information discarded in the previous compression steps. The details of our approach may be seen in EF-ZO-SGD.

In the multi-agent setting, we generalize the problem as follows: There are now NN sequences of continuous loss functions where t+t\in\mathbb{Z}^{+} and each t:Nd\ell_{t}:\mathbb{R}^{Nd}\to\mathbb{R}, which we denote t1,,tN,\ell^{1}_{t},\ldots,\ell^{N}_{t}, belonging to agents 11 through NN. Similar to the previous part, ti(x):=𝔼z[~ti(x)]\ell^{i}_{t}(x):=\mathbb{E}_{z}[\tilde{\ell}^{i}_{t}(x)], ~ti(x):=ti(x,z)\tilde{\ell}^{i}_{t}(x):=\ell^{i}_{t}(x,z) and zPzz\sim P_{z}. We name these the local loss functions, since they represent the loss of each specific agent. The objective is to find a sequence of solutions {xt1:N}t=1TNd\{x^{1:N}_{t}\}_{t=1}^{T}\subset\mathbb{R}^{Nd} that minimizes the global loss function ~¯t=1Ni=1N~ti.\bar{\tilde{\ell}}_{t}=\frac{1}{N}\sum_{i=1}^{N}\tilde{\ell}^{i}_{t}. Akin to the single-agent setting, each agent computes a compressed version of the ZO estimator, corrected to some extent by feedback of the error generated due to compression in the previous steps. The result of this computation is then transmitted to the central server, where they are aggregated and used to update the locations of each agent. The full algorithm entailed by this approach can be seen in FED-EF-ZO-SGD.

Next, we state the assumptions adopted in the forthcoming analyses of the single- and multi-agent settings.

Assumption 1.

(Unbiased Stochastic Zeroth-Order Oracle) For any t+t\in\mathbb{Z}^{+}, i{1,,N}i\in\{1,\ldots,N\} and xdx\in\mathbb{R}^{d}, we have

𝔼z[~ti(x)]=ti(x).\mathbb{E}_{z}\left[\tilde{\ell}^{i}_{t}(x)\right]=\ell^{i}_{t}(x). (3)

Although we do not explicitly utilize the stochastic gradient ~t\nabla\tilde{\ell}_{t} in the forthcoming algorithm, our analysis still requires a certain regulatory assumption on it.

Assumption 2.

(Bounded Stochastic Gradients) For any t+t\in\mathbb{Z}^{+}, i{1,,N}i\in\{1,\ldots,N\} and xdx\in\mathbb{R}^{d}, there exist σ,M>0\sigma,M>0 such that

𝔼z[~ti(x)2]σ2+Mti(x)2.\mathbb{E}_{z}\left[\lVert\nabla\tilde{\ell}^{i}_{t}(x)\rVert^{2}\right]\leq\sigma^{2}+M\lVert\nabla\ell^{i}_{t}(x)\rVert^{2}. (4)

We note that this assumption is significantly more relaxed compared to the assumption typically used in stochastic optimization [57] and EF-based compression [6]. In particular, [6] requires M=0M=0 which effectively imposes a uniform bound on the gradient of t\ell_{t}. As part of our contribution, we carry out the analysis under the relaxed assumption stated above.

Assumption 3.

(L-smoothness) Each ~ti(x)\tilde{\ell}^{i}_{t}(x) is continuously differentiable and L-smooth over xx on d\mathbb{R}^{d}, that is, there exists an L0L\geq 0 such that for all x,ydx,y\in\mathbb{R}^{d}, t+t\in\mathbb{Z}^{+} and i{1,,N}i\in\{1,\ldots,N\}, we have

~ti(x)~ti(y)Lxy.\lVert\nabla\tilde{\ell}^{i}_{t}(x)-\nabla\tilde{\ell}^{i}_{t}(y)\rVert\leq L\lVert x-y\rVert. (5)

We denote this by ~ti(x)CL1,1(d)\tilde{\ell}^{i}_{t}(x)\in C^{1,1}_{L}(\mathbb{R}^{d}). Note that this assumption implies ti(x)CL1,1(d)\ell^{i}_{t}(x)\in C^{1,1}_{L}(\mathbb{R}^{d}).

Assumption 4.

(Bounded Drift in Time) There exist NN bounded sequences {ωt1}t=1T,,{ωtN}t=1T\{\omega^{1}_{t}\}_{t=1}^{T},\ldots,\{\omega^{N}_{t}\}_{t=1}^{T} such that for all t+t\in\mathbb{Z}^{+} and i{1,,N}i\in\{1,\ldots,N\}, |ti(x)t+1i(x)|ωti\lvert\ell^{i}_{t}(x)-\ell^{i}_{t+1}(x)\rvert\leq\omega^{i}_{t} for any xdx\in\mathbb{R}^{d}. Note that in the case where t+1i=ti\ell^{i}_{t+1}=\ell^{i}_{t}, this assumption holds with ωti=0\omega^{i}_{t}=0.

Assumption 4 is standard in the literature on time-varying optimization [11, 58]. Since we work in the online optimization setting where our loss function is time-varying, this assumption upper-bounds the change in the loss function uniformly with a different constant value at each time step.

The next assumption has to do with the aforementioned compression of the gradient estimator gμ,tg_{\mu,t}. We assume that the schemes used for the compression satisfy the following assumption.

Assumption 5.

(Contractive Compression[6]) The compression function 𝒞\mathcal{C} is a contraction mapping, that is,

𝔼𝒞[𝒞(x)x2x](1δ)x2\mathbb{E}_{\mathcal{C}}\left[\lVert\mathcal{C}(x)-x\rVert^{2}\mid x\right]\leq\left(1-\delta\right)\lVert x\rVert^{2} (6)

for all xdx\in\mathbb{R}^{d} where 0<δ10<\delta\leq 1, and the expectation is over the randomness generated by compression 𝒞\mathcal{C}.

One can see that δ\delta effectively controls the scale of the compression. δ=1\delta=1 corresponds to the case of no compression and the amount of compression increases as δ0.\delta\to 0.

The compression operators we use in the numerical experiments are as follows:

  • topk\operatorname{top}_{k}: We fix a parameter k{0,,d}k\in\{0,\ldots,d\}. topk:dd\operatorname{top}_{k}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is defined as:

    (topk(x))i:={(x)π(i)ik,0otherwise.(\operatorname{top}_{k}(x))_{i}\vcentcolon=\begin{cases}(x)_{\pi(i)}&i\leq k,\\ 0&\text{otherwise}.\end{cases} (7)

    where π(i)\pi(i) is a permutation of {1,,d}\{1,\ldots,d\} such that (|x|)π(i)(|x|)π(i+1)(\lvert x\rvert)_{\pi(i)}\geq(\lvert x\rvert)_{\pi(i+1)} for every i{1,,d1}i\in\{1,\ldots,d-1\} [5]. In other words, topk\operatorname{top}_{k} preserves the kk elements of xx that are largest in magnitude, and assigns 0 to the rest.

  • randk\operatorname{rand}_{k}: We fix a parameter k{0,,d}k\in\{0,\ldots,d\}. randk:d×Ωkd\operatorname{rand}_{k}:\mathbb{R}^{d}\times\Omega_{k}\rightarrow\mathbb{R}^{d} is defined as:

    (randk(x,ω0))i:={xiiω0,0otherwise.(\operatorname{rand}_{k}(x,\omega_{0}))_{i}\vcentcolon=\begin{cases}x_{i}&i\in\omega_{0},\\ 0&\text{otherwise}.\end{cases} (8)

    where Ωk={ω:ω{1,,d},|ω|=k}\Omega_{k}=\{\omega:\omega\subseteq\{1,\ldots,d\},\lvert\omega\rvert=k\} and ω0\omega_{0} is chosen uniformly at random from Ωk\Omega_{k} [5]. In other words, randk\operatorname{rand}_{k} preserves kk random elements of xx, and assigns 0 to the rest.

  • dropoutbp\operatorname{dropout-b}_{p}: We fix a parameter p[0,1]p\in[0,1]. dropoutbp:dd\operatorname{dropout-b}_{p}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is defined as:

    (dropoutbp(x))i:={(x)iuip,0otherwise.(\operatorname{dropout-b}_{p}(x))_{i}\vcentcolon=\begin{cases}(x)_{i}&u_{i}\leq p,\\ 0&\text{otherwise}.\end{cases} (9)

    where each uiU[0,1]u_{i}\sim U[0,1]. Note that dropoutbp(x)\operatorname{dropout-b}_{p}(x) is a biased estimator of xx.

  • dropoutup\operatorname{dropout-u}_{p}: We fix a parameter p[0,1]p\in[0,1]. dropoutup:dd\operatorname{dropout-u}_{p}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is defined as:

    (dropoutup(x))i:={1p(x)iuip,0otherwise.(\operatorname{dropout-u}_{p}(x))_{i}\vcentcolon=\begin{cases}\frac{1}{p}(x)_{i}&u_{i}\leq p,\\ 0&\text{otherwise}.\end{cases} (10)

    where each uiU[0,1]u_{i}\sim U[0,1]. Note that dropoutup(x)\operatorname{dropout-u}_{p}(x) is an unbiased estimator of xx.

  • qsgdb\operatorname{qsgd}_{b}: We fix a parameter bb\in\mathbb{N} and perform bb-bit random quantization (where 2b2^{b} is the quantization level):

    qsgdb(x)=sign(x)x22bw[2b|x|x2+u]\operatorname{qsgd}_{b}(x)=\dfrac{\operatorname{sign}(x)\lVert x\rVert_{2}}{2^{b}w}\left[2^{b}\dfrac{|x|}{\lVert x\rVert_{2}}+u\right] (11)

    where w=1+min(d/2b,d/22b)w=1+\min(\sqrt{d}/2^{b},d/2^{2b}), u(U[0,1])du\sim(U[0,1])^{d}, and qsgdb(0)=0\operatorname{qsgd}_{b}(0)=0 [3].

It is worth noting that all of these compression schemes respect Assumption 5, with the sole exception of dropoutup\operatorname{dropout-u}_{p}.

Our final assumption concerns only the analysis of the multi-agent case:

Assumption 6.

(Bounded Gradients) For any xt1:NNdx^{1:N}_{t}\in\mathbb{R}^{Nd}, there exist Z,Q>0Z,Q>0 such that

𝔼z[ti(xt1:N)2]Z2+Q¯t(xt1:N)2\mathbb{E}_{z}\left[\lVert\nabla\ell^{i}_{t}(x^{1:N}_{t})\rVert^{2}\right]\leq Z^{2}+Q\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2} (12)

for all i{1,,N},i\in\{1,\ldots,N\}, where ¯t(xt1:N)=1Ni=1Nti(xt1:N).\nabla\bar{\ell}_{t}(x^{1:N}_{t})=\frac{1}{N}\sum_{i=1}^{N}\nabla\ell^{i}_{t}(x^{1:N}_{t}).

We note that this is a relaxation of the standard assumption capturing the effect of data heterogeneity, commonly employed in the analyses of decentralized optimization algorithms [59, 12, 60] and in the analysis of FedAvg-like methods in particular [50, 51, 52, 53, 54, 55, 56]. The standard assumption poses a uniform bound: 𝔼z1:T[ti(xt1:N)¯t(xt1:N)2]Z2\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell^{i}_{t}(x^{1:N}_{t})-\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\leq Z^{2}. In [61], it is argued that this form usually holds in practice, and may even be considered too pessimistic. However, one can easily come up with a counterexample where it does not, e.g., with ti(x)=(ix)2\ell^{i}_{t}(x)=(ix)^{2} for all t+.t\in\mathbb{Z}^{+}. We note that this relaxation of the assumption is akin to the one adopted with Assumption 2.

3 Proposed Method

In this section, we present our EF-ZO-SGD and FED-EF-ZO-SGD algorithms along with their convergence results and provide sketches of the proofs for these results. The complete proofs may be found in Appendix.

3.1 EF-ZO-SGD

We now present EF-ZO-SGD, an algorithm which uses compression along with the EF mechanism in addition to the ZO estimator in (2) to achieve a communication-efficient method of approaching the presented problem in the single-agent scenario. The complete algorithm is demonstrated in EF-ZO-SGD. Given an initial solution x0dx_{0}\in\mathbb{R}^{d}, which for our problem represents the initial position of the agent, the algorithm works iteratively to construct subsequent solutions to the sequence of optimization problems. It first samples a random vector in d\mathbb{R}^{d} from the standard Gaussian distribution and uses this to construct a ZO estimator to the gradient (steps 3 and 4). Then, the error feedback vector, which keeps track of information discarded during compression in previous communication rounds (step 7) is added to this ZO estimator to produce the augmented estimator (step 5). In this manner, information previously lost to compression is re-utilized. The augmented estimator is the quantity used in the update rule to produce the subsequent solution (step 6), and it is further used to update the error feedback vector (step 7). This process is repeated for t=1,,Tt=1,...,T to produce solutions to all terms of the sequence of optimization problems.

Algorithm 1 EF-ZO-SGD

Input: Number of time steps T+T\in\mathbb{Z}^{+}, smoothing parameter μ\mu\in\mathbb{R}, initial agent position x0dx_{0}\in\mathbb{R}^{d}, learning rate η\eta\in\mathbb{R}, sequence of target positions {zt}t=1Td.\{z_{t}\}_{t=1}^{T}\subset\mathbb{R}^{d}.
   Output: Sequence of optimal agent positions {xt}t=1Td.\{x_{t}\}_{t=1}^{T}\subset\mathbb{R}^{d}.

1:e0=0e_{0}=0
2:for t=1,,Tt=1,\ldots,T do
3:   ut𝒩(0,Id)u_{t}\sim\mathcal{N}(0,I_{d})
4:   g~μ,t(xt)=~t(xt+μut)~t(xt)μut\tilde{g}_{\mu,t}(x_{t})=\dfrac{\tilde{\ell}_{t}(x_{t}+\mu u_{t})-\tilde{\ell}_{t}(x_{t})}{\mu}u_{t}
5:   pt=g~μ,t(xt)+etp_{t}=\tilde{g}_{\mu,t}(x_{t})+e_{t}
6:   xt+1=xtη𝒞(pt)x_{t+1}=x_{t}-\eta\mathcal{C}(p_{t})
7:   et+1=pt𝒞(pt)e_{t+1}=p_{t}-\mathcal{C}(p_{t})
8:end for

The convergence properties of EF-ZO-SGD are analyzed next. For the convergence of EF-ZO-SGD in a single-agent setting, we establish Theorem 1.

Note that although the EF-ZO-SGD algorithm can be thought of as a SGD-type scheme, the analysis – due the interaction of EF and ZO estimation – is involved. In the proof, we leverage a new intertwined perturbation analysis, wherein we analyze the convergence of a virtual solution sequence to the smoothed functions μ,t\ell_{\mu,t} and tie that to the performance of the real iterates xtx_{t} to t\ell_{t}, while utilizing the relaxed bounded stochastic gradient assumption.

Theorem 1.

Suppose Assumptions 12 hold. Consider EF-ZO-SGD algorithm. Then, if η=1σ(d+4)MTL\eta=\dfrac{1}{\sigma\sqrt{(d+4)MTL}} and μ=1(d+4)T,\mu=\dfrac{1}{(d+4)\sqrt{T}}, it holds that

1Tt=1T𝔼[t(xt)2]8Δσ(d+4)12M12L12T12+8σdL32M12T32(d+3)32+2(d+6)32L52σ(d+4)52T32M12+8σ(d+4)12L12M12T12+(d+3)3L2(d+2)2T+32Lδ2σ2MT+8(d+6)3L3δ2σ2(d+4)3MT2+8ω¯σ(d+4)12M12L12T12,\begin{split}&\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]\leq\frac{8\Delta\sigma(d+4)^{\frac{1}{2}}M^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}}\\ &+\frac{8\sigma dL^{\frac{3}{2}}M^{\frac{1}{2}}}{T^{\frac{3}{2}}(d+3)^{\frac{3}{2}}}+\frac{2(d+6)^{\frac{3}{2}}L^{\frac{5}{2}}}{\sigma(d+4)^{\frac{5}{2}}T^{\frac{3}{2}}M^{\frac{1}{2}}}+\frac{8\sigma(d+4)^{\frac{1}{2}}L^{\frac{1}{2}}}{M^{\frac{1}{2}}T^{\frac{1}{2}}}\\ &+\frac{(d+3)^{3}L^{2}}{(d+2)^{2}T}+\frac{32L}{\delta^{2}\sigma^{2}MT}+\frac{8(d+6)^{3}L^{3}}{\delta^{2}\sigma^{2}(d+4)^{3}MT^{2}}\\ &+\frac{8\bar{\omega}\sigma(d+4)^{\frac{1}{2}}M^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}},\end{split} (13)

where xT+1argminxdT+1(x),x_{T+1}^{*}\in\arg\!\min_{x\in\mathbb{R}^{d}}\ell_{T+1}(x), Δ=1(x1)T+1(xT+1)\Delta=\ell_{1}(x_{1})-\ell_{T+1}(x_{T+1}^{*}), ω¯=t=1Tωt,\bar{\omega}=\sum_{t=1}^{T}\omega_{t}, and 𝔼[\mathbb{E}[\cdot]] denotes 𝔼z1:T[\mathbb{E}_{z_{1:T}}[\cdot]]. Furthermore, the number of time steps TT to obtain a ξ\xi-accurate first order solution is

T=𝒪(dσ2LΔMξ2+dLΔδ2ξ+ω¯σ2dMLξ2).T=\mathcal{O}\left(\frac{d\sigma^{2}L\Delta M}{\xi^{2}}+\frac{dL\Delta}{\delta^{2}\xi}+\frac{\bar{\omega}\sigma^{2}dML}{\xi^{2}}\right). (14)
Sketch of proof.

We begin by defining the perturbed quantity x~t:=xtηet.\tilde{x}_{t}:=x_{t}-\eta e_{t}. Then, using assumptions 3 and 4, we obtain the inequality

μ,t+1(x~t+1)μ,t(x~t)ηg~μ,t(xt),μ,t(x~t)+Lη22g~μ,t(xt)2+ωt.\begin{split}\ell_{\mu,t+1}(\tilde{x}_{t+1})\leq\ell_{\mu,t}(\tilde{x}_{t})-\eta\langle\tilde{g}_{\mu,t}(x_{t}),\nabla\ell_{\mu,t}(\tilde{x}_{t})\rangle\\ +\frac{L\eta^{2}}{2}\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}+\omega_{t}.\end{split} (15)

Taking expectations and performing algebraic manipulations produce the main inequality with the four terms:

η2μ,t(xt)2Term I[μ,t(x~t)μ,t+1(x~t+1)]Term II+Lη22𝔼ut,zt[g~μ,t(xt)2]Term III+L2η32et2Term IV+ωt.\begin{split}\underbrace{\frac{\eta}{2}\lVert\nabla\ell_{\mu,t}(x_{t})\rVert^{2}}_{\text{Term I}}\leq\underbrace{\left[\ell_{\mu,t}(\tilde{x}_{t})-\ell_{\mu,t+1}(\tilde{x}_{t+1})\right]}_{\text{Term II}}+\\ \underbrace{\frac{L\eta^{2}}{2}\mathbb{E}_{u_{t},z_{t}}\left[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right]}_{\text{Term III}}+\underbrace{\frac{L^{2}\eta^{3}}{2}\lVert e_{t}\rVert^{2}}_{\text{Term IV}}+\omega_{t}.\end{split} (16)

We can upper-bound Term II by means of a telescoping sum. Then, using assumptions 5 and 2, Term I can be lower-bounded and Terms III and IV can be upper-bounded by quantities involving 𝔼z1:T[t(xt)2].\mathbb{E}_{z_{1:T}}[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}]. Rearranging this, inserting the values for η\eta and μ\mu and introducing ξ\xi to obtain an expression for the time complexity lead directly to the result. The complete proof may be found in the Appendix. ∎

We further note that (14) demonstrates that the dominant term in the complexity is independent of the compression parameter δ\delta. Therefore, for long sequences of time-varying optimization problems where TT is very large, the contribution of compression to the convergence error is negligible. Also notable is the fact that the complexity scales with dimension dd. While this dependence is undesirable, in the worst case, it is unavoidable even without compression as shown in [62].

We may discuss the implication of our results to the setting of learning parameters of an overparameterized model, e.g., a deep learning predictor. It has been argued, see, e.g. [63, 64], such models typically satisfy a so-called strong growth condition which implies σ=0\sigma=0 in Assumption 2. That is, as the EF-ZO-SGD algorithm converges to a stationary solution, it enters into a virtuous cycle wherein the noise in the stochastic gradient reduces. As our analysis demonstrates, in such settings we can modify η\eta and μ\mu accordingly (in particular set η\eta independent of T) to improve the complexity of the proposed algorithm to T=𝒪(1ξ)T=\mathcal{O}(\frac{1}{\xi}).

Algorithm 2 FED-EF-ZO-SGD

Input: Number of time steps T+T\in\mathbb{Z}^{+}, number of agents N+N\in\mathbb{Z}^{+}, smoothing parameter μ\mu\in\mathbb{R}, initial agent positions x01:NNdx_{0}^{1:N}\in\mathbb{R}^{Nd}, learning rate η\eta\in\mathbb{R}, sequence of target positions {zt1:N}t=1TNd.\left\{z^{1:N}_{t}\right\}_{t=1}^{T}\subset\mathbb{R}^{Nd}.
   Output: Sequence of optimal target positions {xt1:N}t=1TNd.\left\{x^{1:N}_{t}\right\}_{t=1}^{T}\subset\mathbb{R}^{Nd}.

1:for i=1,,Ni=1,\ldots,N do
2:   e0i=0e_{0}^{i}=0
3:end for
4:for t=1,,Tt=1,\ldots,T do
5:Runs on each agent:
6:   for i=1,,Ni=1,\ldots,N do
7:      uti𝒩(0,INd)u_{t}^{i}\sim\mathcal{N}(0,I_{Nd})
8:      g~μ,ti(xt1:N)=~ti(xt1:N+μuti)~ti(xt1:N)μuti\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})=\dfrac{\tilde{\ell}^{i}_{t}(x^{1:N}_{t}+\mu u^{i}_{t})-\tilde{\ell}^{i}_{t}(x^{1:N}_{t})}{\mu}u^{i}_{t}
9:      pti=g~μ,ti(xt1:N)+etip^{i}_{t}=\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})+e^{i}_{t}
10:      et+1i=pti𝒞(pti)e^{i}_{t+1}=p^{i}_{t}-\mathcal{C}(p^{i}_{t})
11:      transmit_to_server(𝒞(pti))\operatorname{transmit\_to\_server}\left(\mathcal{C}(p^{i}_{t})\right)
12:   end for
13:Runs on the server:
14:   𝒢t=1Ni=1N𝒞(pti)\mathcal{G}_{t}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{C}(p^{i}_{t})
15:   xt+11:N=xt1:Nη𝒢tx^{1:N}_{t+1}=x^{1:N}_{t}-\eta\mathcal{G}_{t}
16:   transmit_to_clients(xt+11:N)\operatorname{transmit\_to\_clients}\left(x^{1:N}_{t+1}\right)
17:end for

3.2 FED-EF-ZO-SGD

FED-EF-ZO-SGD algorithm is a generalization of EF-ZO-SGD to multi-agent and multi-target setting. In addition to the compression, EF mechanism, and ZO estimator, agents are coordinated with the central server and their compressed gradients are averaged in the server as in [1]. The complete algorithm is shown in FED-EF-ZO-SGD. Given an initial solution x01:NNdx_{0}^{1:N}\in\mathbb{R}^{Nd}, which in our problem represents the concatenation of the initial position of the agents, the FED-EF-ZO-SGD algorithm works iteratively on both the agent side and the server side to generate the consecutive solutions to the sequence of optimization problems. The agent side is similar to EF-ZO-SGD except for the content of the solution vectors. In our setting, without loss of generality, we consider agents which can sense the position of nearby agents called "neighbors" and merge the position vectors with their current position to obtain xt1:Nx_{t}^{1:N}. Entries that correspond to the other agents which are not neighbors are set to 0. The same algorithm can be implemented for the agents having no knowledge of the nearby agents’ positions. For every agent, the algorithm first samples a random vector in Nd\mathbb{R}^{Nd} from the standard Gaussian distribution and the entries that do not correspond to ithi^{th} agent’s position are set to 0 (step 6). Thus, only ithi^{th} agents position vector is perturbed to approximate the noisy gradient with finite differences (step 7). Steps 8 and 9 are the same as EF-ZO-SGD. Lastly, each agent sends its compressed augmented estimator to the central server. After the server collects all the estimators from every agent, it takes their average (step 12). Then, this average is used in the update (step 13) and the new positions are transmitted to the agents. This procedure is followed for t=1,,Tt=1,...,T to produce solutions to all terms of the sequence of optimization problems.

Now, we proceed with the analysis extended to the multi-agent case, which involves FED-EF-ZO-SGD. We state the following theorem:

Theorem 2.

Suppose Assumptions 16 hold. Consider FED-EF-ZO-SGD algorithm. Then, if η=1σ(d+4)MQTL\eta=\dfrac{1}{\sigma\sqrt{(d+4)MQTL}} and μ=1(d+4)T,\mu=\dfrac{1}{(d+4)\sqrt{T}}, it holds that

1Tt=1T𝔼[¯t(xt1:N)2]8Δσ(d+4)12M12Q12L12T12+8L32dσM12Q12(d+4)32T32+8L12(d+4)12M12Z2σQ12T12+8L12(d+4)12σM12Q12T12+2L52(d+6)3(d+4)32T32σM12Q12+32LZ2σ2QTδ2+32LMQTδ2+8L3(d+6)3(d+4)3T2σ2MQ+8ω¯σ(d+4)12M12Q12L12T12,\begin{split}&\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\leq\dfrac{8\Delta\sigma\left(d+4\right)^{\frac{1}{2}}M^{\frac{1}{2}}Q^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}}\\ &+\dfrac{8L^{\frac{3}{2}}d\sigma M^{\frac{1}{2}}Q^{\frac{1}{2}}}{\left(d+4\right)^{\frac{3}{2}}T^{\frac{3}{2}}}+\dfrac{8L^{\frac{1}{2}}\left(d+4\right)^{\frac{1}{2}}M^{\frac{1}{2}}Z^{2}}{\sigma Q^{\frac{1}{2}}T^{\frac{1}{2}}}\\ &+\dfrac{8L^{\frac{1}{2}}\left(d+4\right)^{\frac{1}{2}}\sigma}{M^{\frac{1}{2}}Q^{\frac{1}{2}}T^{\frac{1}{2}}}+\dfrac{2L^{\frac{5}{2}}\left(d+6\right)^{3}}{\left(d+4\right)^{\frac{3}{2}}T^{\frac{3}{2}}\sigma M^{\frac{1}{2}}Q^{\frac{1}{2}}}\\ &+\dfrac{32LZ^{2}}{\sigma^{2}QT\delta^{2}}+\dfrac{32L}{MQT\delta^{2}}+\dfrac{8L^{3}\left(d+6\right)^{3}}{\left(d+4\right)^{3}T^{2}\sigma^{2}MQ}\\ &+\dfrac{8\bar{\omega}\sigma\left(d+4\right)^{\frac{1}{2}}M^{\frac{1}{2}}Q^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}},\end{split} (17)

where ¯t(x)=1Ni=1Nti(x),\bar{\ell}_{t}(x)=\frac{1}{N}\sum_{i=1}^{N}\ell^{i}_{t}(x), ω¯:=t=1Tωt,\bar{\omega}\vcentcolon=\sum_{t=1}^{T}\omega_{t}, xT+1=mini{1,,N}argminxT+1i(x),x^{*}_{T+1}=\min_{i\in\{1,...,N\}}\arg\!\min_{x}\ell^{i}_{T+1}(x), Δ=¯1(x11:N)¯T+1(xT+1),\Delta=\bar{\ell}_{1}(x^{1:N}_{1})-\bar{\ell}_{T+1}(x_{T+1}^{*}), and 𝔼[\mathbb{E}[\cdot]] denotes 𝔼z1:T1:N[\mathbb{E}_{z_{1:T}^{1:N}}[\cdot]]. Furthermore, the number of time steps TT to obtain a ξ\xi-accurate first order solution is

T=𝒪(σ2dMQ(Δ2+ω¯2)+M(σ2+Z4)ξ2+L53ξ23+1δ2ξ).\begin{split}&T=\\ &\mathcal{O}\left(\dfrac{\sigma^{2}dMQ\left(\Delta^{2}+\bar{\omega}^{2}\right)+M\left(\sigma^{2}+Z^{4}\right)}{\xi^{2}}+\dfrac{L^{\frac{5}{3}}}{\xi^{\frac{2}{3}}}+\dfrac{1}{\delta^{2}\xi}\right).\end{split} (18)
Sketch of proof.

The general outline of the proof is very similar to that of the single-agent case. We define and work with the perturbed quantity x~t1:N:=xt1:Nηe¯t,\tilde{x}^{1:N}_{t}\vcentcolon=x^{1:N}_{t}-\eta\bar{e}_{t}, where e¯t:=1Ni=1Neti.\bar{e}_{t}\vcentcolon=\dfrac{1}{N}\sum_{i=1}^{N}e^{i}_{t}. Additionally, our global loss function in this scenario is ~¯t(xt1:N)=1Ni=1N~ti(xt1:N).\bar{\tilde{\ell}}_{t}\left(x^{1:N}_{t}\right)=\dfrac{1}{N}\sum_{i=1}^{N}\tilde{\ell}^{i}_{t}\left(x^{1:N}_{t}\right). Using Assumptions 3 and 4, we obtain

¯μ,t+1(x~t+11:N)¯μ,t(x~t1:N)ηg~¯μ,t(xt1:N),¯μ,t(x~t1:N)+Lη22g~¯μ,t(xt1:N)2+ωt,\begin{split}\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)&\leq\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\\ &-\eta\left\langle\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right),\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rangle\\ &+\dfrac{L\eta^{2}}{2}\left\lVert\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}+\omega_{t},\end{split} (19)

where ωt=max{wt1,,wtN}.\omega_{t}=\max\{w_{t}^{1},...,w_{t}^{N}\}. Taking expectations and algebraic manipulations lead to the main inequality with four terms:

η2¯μ,t(xt1:N)2Term I[¯μ,t(x~t1:N)¯μ,t+1(x~t+11:N)]Term II+Lη22𝔼ut1:N,zt1:N[g~¯μ,t(xt1:N)2]Term III+L2η32e¯t2Term IV+ωt.\begin{split}\underbrace{\dfrac{\eta}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}}_{\text{Term I}}\leq\underbrace{\left[\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)-\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)\right]}_{\text{Term II}}\\ +\underbrace{\dfrac{L\eta^{2}}{2}\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}\right]}_{\text{Term III}}+\underbrace{\dfrac{L^{2}\eta^{3}}{2}\lVert\bar{e}_{t}\rVert^{2}}_{\text{Term IV}}+\omega_{t}.\end{split} (20)

Term II may be upper-bounded by means of a telescoping sum. Term I may be lower-bounded and Terms III and IV upper-bounded by quantities involving 𝔼z1:T1:N[¯t(xt1:N)2],\mathbb{E}_{z_{1:T}^{1:N}}[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}], using assumptions 5, 2 and 6. Rearranging this, inserting the values for η\eta and μ\mu and introducing ξ\xi to obtain an expression for the time complexity lead directly to the result. The complete proof may be found in the Appendix. ∎

Much like in the single-agent analysis, we note that the dominant term in the complexity is independent of the compression ratio δ\delta.

4 Experimental Results

In this section, we explore two applications of the proposed method to multi-agent target tracking under communication constraints. The first application deals with the main focus of the work, i.e., multi-agent target tracking. The second is an alternative view of the problem involving an area-coverage problem. Our code used for the experiments is available online with the simulation video [65].

4.1 Target Tracking

We begin with the application of our proposed FED-EF-ZO-SGD multi-agent target tracking scenario detailed in the previous sections. In all experiments, we instantiate a central server, NN agents {𝒜i}i=1N\{\mathcal{A}_{i}\}_{i=1}^{N} and NN sources {𝒮i}i=1N\{\mathcal{S}_{i}\}_{i=1}^{N}. The initial location of each agent is chosen uniformly at random from [100,100]2[-100,100]^{2} and each source from [200,400]2[200,400]^{2}. Hence, d=2d=2, i.e., we consider the target tracking problem on a 22-dimensional plane, which is reasonable for the motivating example of delivery robots. Also, we instantiate the agents and sources in two separate clusters, with some initial distance between them. Each agent 𝒜i\mathcal{A}_{i} aims to track source 𝒮i\mathcal{S}_{i} and each 𝒮i\mathcal{S}_{i} actively evades its tracker with maximum speed. This setting generalizes that of [11] to a scenario with multiple agents and sources. We use xtix^{i}_{t}, ztiz^{i}_{t} to denote the positions of 𝒜i\mathcal{A}_{i} and 𝒮i\mathcal{S}_{i} at time step tt. Each 𝒮i\mathcal{S}_{i} aims to maximize its distance to 𝒜i\mathcal{A}_{i}, by setting its velocity at each step to ζti=β(ztixti)/ztixti\zeta^{i}_{t}=\beta(z^{i}_{t}-x^{i}_{t})/\lVert z^{i}_{t}-x^{i}_{t}\rVert, i.e., moving directly away from 𝒜i\mathcal{A}_{i} with speed β=0.1\beta=0.1.

Refer to caption
Fig. 2: Illustration of 5 agents tracking 5 sources. Sources evade the agents by moving directly away from them.

An illustration of the movements of agents and sources is given in Fig. 2.

As explained in the previous sections, an additional contingency we introduce to the above setting is the requirement of a collision avoidance mechanism to prevent agents from unsafe maneuvers. To this end, we propose a two-level approach: i) on the local level within each agent, by means of local neighbor detection leveraging a judicious regularization term, and ii) coordination via the FL paradigm. With regard to i), at every step of the simulation, we calculate the set of neighbors of each 𝒜i\mathcal{A}_{i} as Dti{ji:xtixtjr}D^{i}_{t}\coloneqq\{j\neq i:\lVert x_{t}^{i}-x_{t}^{j}\rVert\leq r\}, where we set r=10r=10. These neighbor sets determine the local loss function ti\ell^{i}_{t} of 𝒜i\mathcal{A}_{i} at time step tt, which we define as:

ti(xt1:N,zti)=12xtizti2λjDti(xtixtj2r2),\ell_{t}^{i}(x_{t}^{1:N},z^{i}_{t})=\dfrac{1}{2}\lVert x_{t}^{i}-z_{t}^{i}\rVert^{2}-\lambda\sum_{j\in D^{i}_{t}}\left(\lVert x_{t}^{i}-x_{t}^{j}\rVert^{2}-r^{2}\right), (21)

where xt1:N=[(xt1)T(xtN)T]T(Nd)x^{1:N}_{t}=[(x^{1}_{t})^{T}\cdots(x^{N}_{t})^{T}]^{T}\in\mathbb{R}^{(Nd)} and λ\lambda is the predetermined regularization parameter. We note that the time-varying nature of these neighbor sets introduce time-variance to the loss functions, which is exactly the setting we examine in the theoretical analysis. We divide the local loss function into two terms in order to simplify the notation in the subsequent calculation of the local ZO gradient estimator gμ,tig^{i}_{\mu,t}:

ti(xt1:N,zti)=sti(xti,zti)jDtirti,j(xti,xtj),\ell_{t}^{i}(x_{t}^{1:N},z^{i}_{t})=s_{t}^{i}(x_{t}^{i},z^{i}_{t})-\sum_{j\in D^{i}_{t}}r_{t}^{i,j}(x_{t}^{i},x_{t}^{j}), (22)

where the loss due to source stis_{t}^{i} is given by sti(xti,zti)=12xtizti2s_{t}^{i}(x_{t}^{i},z^{i}_{t})=\frac{1}{2}\lVert x_{t}^{i}-z_{t}^{i}\rVert^{2} and the loss due to regularization between agents 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j}, rti,jr_{t}^{i,j}, by rti,j(xti,xtj)=λ(xtixtj2r2)r_{t}^{i,j}(x_{t}^{i},x_{t}^{j})=\lambda(\lVert x_{t}^{i}-x_{t}^{j}\rVert^{2}-r^{2}). In terms of the scenario, one could see the regularization term as agents being able to sense other agents within a radius rr around its position. With regard to ii), collision avoidance is ensured by means of federated aggregation of the local gradient estimators. The global loss function at tt is defined as ¯t(xt1:N,zt1:N)=1Ni=1Nti(xt1:N,zti)\bar{\ell}_{t}(x_{t}^{1:N},z_{t}^{1:N})=\frac{1}{N}\sum_{i=1}^{N}\ell_{t}^{i}(x_{t}^{1:N},z^{i}_{t}), where zt1:Nz^{1:N}_{t} is defined similarly to xt1:Nx^{1:N}_{t}.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Fig. 3: Results of FED-EF-ZO-SGD via the proposed scheme: (a) shows the average tracking error over 100100 runs of the simulation with different compression schemes and EF combinations in the FL paradigm, with SGDm being the non-FL benchmark algorithm, in which SGD with momentum is run locally on each agent with no communication, and FedAvg with 11-bit QSGD compression and error feedback term being the FL benchmark algorithm. The difference of this benchmark algorithm with FED-EF-ZO-SGD is the fact that FedAvg uses first-order information. (b) shows collision numbers for the same experiment. (c) shows the average tracking errors over 100100 runs of the simulation with varying number of agents NN, using the best-performing model of QSGD1b-EF (1-bit QSGD with error feedback term). The learning rate η\eta is set proportionally to N\sqrt{N}. (d) shows the average numbers of collisions over 100100 runs of the simulation for varying values of the regularization parameter λ\lambda, using the best-performing model of QSGD1b-EF.

Defining the neighborhood of two agents 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j} in the above manner results in a symmetric relation. To make the setting more interesting, we also introduce the concept of neighbor dropout which aims to capture practical considerations such as imperfection in communication links and sensing capabilities. At each tt, if 𝒜i\mathcal{A}_{i} is to be added to 𝒟tj\mathcal{D}^{j}_{t}, a random number XX is sampled from U[0,1]U[0,1]. If X>pX>p, ii is added to 𝒟tj\mathcal{D}_{t}^{j}, otherwise, it is dropped out. This leads to a more realistic scenario and opens up room for more meaningful collaboration between agents by breaking the symmetricity of the relation. If, for example, 𝒜i\mathcal{A}_{i} is a neighbor of 𝒜j\mathcal{A}_{j} but fails to detect it, we would expect 𝒜j\mathcal{A}_{j} to compensate for this. Or worse, if both 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j} fail to detect each other, we would expect 𝒜k\mathcal{A}_{k} such that k𝒟ti𝒟tjk\in\mathcal{D}_{t}^{i}\cap\mathcal{D}_{t}^{j} to compensate for these detection failures. With the local loss function defined in (21), every agent 𝒜i\mathcal{A}_{i} calculates a ZO gradient estimator gμ,tig^{i}_{\mu,t}. Following the setup in [11], we slightly modify the computation of the ZO estimator, by introducing a small change in the argument of the first function evaluation. Let

t+i(xt1:N,zti):=12xti+μuti,i(zti+0.5ζti)2λj𝒟ti(xti+μuti,j(xtj+0.5ξtj)2r2),\begin{split}\ell^{i}_{t^{+}}(x_{t}^{1:N},z^{i}_{t})\vcentcolon=\dfrac{1}{2}\lVert x^{i}_{t}+\mu u^{i,i}_{t}-(z^{i}_{t}+0.5\zeta^{i}_{t})\rVert^{2}\\ -\lambda\sum_{j\in\mathcal{D}^{i}_{t}}\left(\lVert x^{i}_{t}+\mu u^{i,j}_{t}-(x^{j}_{t}+0.5\xi^{j}_{t})\rVert^{2}-r^{2}\right),\end{split} (23)

where uti,ju^{i,j}_{t} for all j𝒟tij\in\mathcal{D}_{t}^{i} are drawn from 𝒩(0,Id)\mathcal{N}(0,I_{d}) at time tt and ζti,ξti\zeta^{i}_{t},\xi^{i}_{t} denote the velocities of agent 𝒜i\mathcal{A}_{i} and source 𝒮i\mathcal{S}_{i} at time tt, respectively. Similar to ti\ell_{t}^{i} and (22), we divide t+i\ell^{i}_{t^{+}} into two terms:

t+i(xt1:N,zti)=st+i(xti,zti)j𝒟tirt+i,j(xti,xtj)\ell^{i}_{t^{+}}(x_{t}^{1:N},z^{i}_{t})=s_{t^{+}}^{i}(x_{t}^{i},z^{i}_{t})-\sum_{j\in\mathcal{D}^{i}_{t}}r_{t^{+}}^{i,j}(x_{t}^{i},x_{t}^{j}) (24)

where rt+i,j(xti,xtj)=λ(xti+μuti,j(xtj+0.5ξtj)2r2)r_{t^{+}}^{i,j}(x_{t}^{i},x_{t}^{j})=\lambda(\lVert x^{i}_{t}+\mu u^{i,j}_{t}-(x^{j}_{t}+0.5\xi^{j}_{t})\rVert^{2}-r^{2}) and st+i(xti,zti)=12xti+μuti,i(zti+0.5ζti)2s_{t^{+}}^{i}(x_{t}^{i},z^{i}_{t})=\frac{1}{2}\lVert x^{i}_{t}+\mu u^{i,i}_{t}-(z^{i}_{t}+0.5\zeta^{i}_{t})\rVert^{2}.

Now, we define gμ,ti=[(gμ,ti,1)T(gμ,ti,N)T]T(Nd)g^{i}_{\mu,t}=[(g^{i,1}_{\mu,t})^{T}\cdots(g^{i,N}_{\mu,t})^{T}]^{T}\in\mathbb{R}^{(Nd)} where

gμ,ti,j={st+i(xti,zti)sti(xti,zti)μuti,ij=i,rt+i,j(xti,xtj,ξtj)rti,j(xti,xtj)μuti,jj𝒟ti,0dotherwise.g^{i,j}_{\mu,t}=\begin{cases}\hskip 5.0pt\dfrac{s_{t^{+}}^{i}(x_{t}^{i},z^{i}_{t})-s^{i}_{t}(x_{t}^{i},z^{i}_{t})}{\mu}u^{i,i}_{t}&j=i,\\[10.0pt] \hskip 5.0pt-\dfrac{r_{t^{+}}^{i,j}(x_{t}^{i},x_{t}^{j},\xi^{j}_{t})-r_{t}^{i,j}(x_{t}^{i},x_{t}^{j})}{\mu}u^{i,j}_{t}&j\in\mathcal{D}^{i}_{t},\\[10.0pt] \hskip 5.0pt0\in\mathbb{R}^{d}&\text{otherwise}.\end{cases} (25)

In practice, it usually holds that for any i{1,,N}i\in\{1,\ldots,N\}, |𝒟ti|N\lvert\mathcal{D}_{t}^{i}\rvert\ll N, which results in a sparse gμ,tig^{i}_{\mu,t}. Each agent then transmits its local ZO gradient estimator gμ,tig^{i}_{\mu,t} to the server. In scenarios with compression, each agent applies compression before transmission and transmits 𝒞(gμ,ti+eti)\mathcal{C}(g^{i}_{\mu,t}+e^{i}_{t}) (see step 10 in FED-EF-ZO-SGD). The possible compression schemes that are used in the experiments are the ones that are detailed in Section 2. The server collects all of the transmitted (and possibly compressed) local gradient estimators and averages them, producing the aggregated global gradient estimator 𝒢t\mathcal{G}_{t} of ¯t\bar{\ell}_{t}: 𝒢t=1Ni=1N𝒞(gμ,ti+eti)\mathcal{G}_{t}=\dfrac{1}{N}\sum_{i=1}^{N}\mathcal{C}(g^{i}_{\mu,t}+e^{i}_{t}). Then, to keep the speed of the agents bounded in order to maintain a practically plausible simulation, the server normalizes 𝒢t\mathcal{G}_{t} and then computes its estimation to the optimal position of every agent by xt+11:N=xt1:Nη𝒢tx^{1:N}_{t+1}=x^{1:N}_{t}-\eta\mathcal{G}_{t} where η\eta is the learning rate. With this formulation, η\eta determines the speed of the agents in the practical sense, since 𝒢t=1\lVert\mathcal{G}_{t}\rVert=1, therefore it only plays a role in determining the directions of the agents. The subsequent positions of agents are transmitted to the agents, without compression, and agents move to these positions. This process is illustrated on Fig. 1. To gauge the performance of the model with respect to the number of collisions, we keep track of the number of collisions between agents by checking whether the position of any two agents 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j} are close in Euclidean norm, the measure of closeness depends on the radii of the agents in the simulation. In all experiments, we set the collision radius R=3R=3, i.e., we increment the collision counter whenever xtixtj3\lVert x^{i}_{t}-x^{j}_{t}\rVert\leq 3 for any two agents 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j} such that iji\neq j.

We conduct 3 types of experiments and depict the results on the 4 plots of Fig. 3: In Fig. 3 (a) and Fig. 3 (b) we test the FED-EF-ZO-SGD algorithms’ performance in terms of loss and number of collisions with various compression schemes. Fig. 3 (c) compares the convergence of FED-EF-ZO-SGD for different numbers of agents NN while scaling the learning rate in proportion with N\sqrt{N}, since the application bears theoretical resemblance to mini-batch SGD. Fig. 3 (d) demonstrates the effect of varying the regularization parameter λ\lambda on the number of collisions. Unless otherwise stated, the parameters used in the experiments are K=0.5K=0.5 for TopK and RandK, p=0.5p=0.5 for Dropout, η=1\eta=1, β=0.1\beta=0.1, pN=0.5p_{N}=0.5, d=2d=2, N=20N=20, r=10r=10 and steps=1000steps=1000. 100100 instances of the simulation are run for each experiment, with the same fixed random seeds across different methods. In the first experiment, we also run SGD with momentum (SGDm) locally on each agent with no communication, i.e., without the FL paradigm as a benchmark algorithm. Additionally, a benchmark algorithm within the FL paradigm, we also look at the performance of FedAvg with 1-bit QSGD and error feedback mechanism, its key difference from FED-EF-ZO-SGD being that it uses first-order information.

As Fig. 3 (a) demonstrates, the variant that leverages EF along with the QSGD compression scheme with 11-bit quantization (QSGD1b-EF) enjoys the fastest convergence and even outperforms the setting with no compression (No-Comp). This might be explained by the inherent noise introduced by quantization helping convergence. TopK with error feedback (TopK-EF), 11-bit QSGD without error feedback (QSGD1b) and TopK without error feedback (TopK) perform virtually on par with the no compression setting. It is interesting to note that TopK seems to slightly outperform TopK-EF. These are followed in performance by RandK with error feedback (RandK-EF), and then RandK without error feedback (RandK). These are finally followed by Unbiased Dropout (Dropout-U) and Biased Dropout (Dropout-B), which perform equally well, but with a large gap to the best performers. It is expected for RandK-EF, RandK, Dropout-U and Dropout-B to take longer to converge, due to the high compression error that they inject in the communicated gradient estimators. Although, it appears that the error feedback helps the convergence of RandK significantly. We note that all of QSGD1b-EF, No-Comp, QSGD1b, TopK and TopK-EF converge within 10001000 iterations, with RandK-EF also coming very close. The non-FL benchmark algorithm SGDm outperforms all FL-based methods in terms of iterations needed for convergence, however the rate of convergence appears to be of the same order, and the results are comparable. The first-order FL benchmark algorithm FO-QSGD1b-EF enjoys slightly faster convergence than FED-EF-ZO-SGD, but the performance difference is marginal.

Refer to caption
(a)
Refer to caption
(b)
Fig. 4: Convergence results of FED-EF-ZO-SGD under different compression rates: (a) shows the tracking error over 100100 runs of the simulation with different values for δ\delta under the Dropout-B compression scheme with error feedback term. (b) shows the tracking error over 100100 runs of the simulation with different numbers of bits used for the QSGD compression scheme with error feedback term.

To evaluate the effectiveness of collaboration, we compare the number of collisions vs iterations for the same experiment in Fig. 3 (b). The results show that all of our FL-based methods far outperform the non-FL benchmark method of SGDm in terms of number of collisions. SGDm, which has no regard for collision prevention causes on average about 7070 collisions whereas all of our schemes, even the ones that do not achieve good convergence results such as RandK and Dropout-B cause at most about 1010 collisions on the average. This demonstrates the efficiency of the proposed regularization term.

Refer to caption
Fig. 5: Illustration of the agents in the area coverage experiment. Each agent has the main objective of patrolling its designated area, following a circular route (indicated with the dashed curves). However, there is overlap between the areas, and the secondary objective is to discourage the agents from moving towards area that is already covered by another agent.

In Fig. 3 (c), we show the results of the second experiment, where we use the best-performing scheme in the first experiment, (QSDGD1b-EF) and test the convergence results with varying numbers of agents NN. We make the observation that increasing the number of agents in the described multi-agent scenario is akin to increasing the batch size in mini-batch SGD, as by aggregating the messages received from agents, the server performs an update on the global objective. Thus, motivated by the theoretical studies of mini-batch SGD (see, e.g., [66]), to see the effect of varying the number of agents, we set the η\eta parameter proportional to N\sqrt{N}. The values we use for NN are 5,10,15,20,5,10,15,20, and 2525, with the respective η\eta values being 0.5,0.71,0.87,1,0.5,0.71,0.87,1, and 1.121.12. The main lines in the plot show the average tracking errors averaged over 100100 runs of the simulation. It can be seen that the comparison to mini-batch SGD may be justified, as the model converges roughly around the same iteration for all NN values except for 55, when η\eta values are set proportionally.

In Fig. 4, we demonstrate the effect of the compression parameter δ\delta on the convergence of the tracking error. Although the theoretical analysis shows that the dominant term in the convergence bound is independent from the compression parameter δ\delta, the transient behavior of the convergence still depends on δ\delta. This is reflected in the experimental results. In Fig. 4 (a), we consider the effect of varying δ\delta in the Dropout-B compression scheme. Here, δ\delta corresponds to the probability that a gradient component will be dropped. The case of δ=0\delta=0 corresponds to when there is no compression. In these experiments, we set the step size η=3\eta=3, to facilitate convergence in the highly compressed regime when δ=0.9\delta=0.9. It can be seen that even in the presence of extreme compression, convergence can be achieved by increasing the step size. Similarly, in In Fig. 4 (b), we consider the effect of varying the number of bits used in QSGD on the convergence of the tracking error. We simulate the experiment with number of bits 1,2,4,1,2,4, and 88 and plot the results.

Finally, in Fig. 3 (d), we compare the effect of the regularization parameter λ\lambda on the number of collisions. Similar to the second experiment, we use the best-performing scheme in the first experiment, QSDGD1b-EF. The values tested for λ\lambda are 0,1,2,5,7,0,1,2,5,7, and 1010. We observe that, as expected, increasing the λ\lambda parameter has a significant effect on decreasing the number of collisions. In the λ=0\lambda=0 scenario, which practically corresponds to no communication with regards to collision prevention among the agents, we observe on the average up to more than 5050 collisions, similar to the numbers observed in the first experiment with the benchmark SGDm method. Even the small value of λ=1\lambda=1 drops the number of collisions on average by almost half. We observe a drop in the number of collisions with each increment in λ\lambda, with λ=10\lambda=10 achieving less than 55 collisions on average. This is naturally expected, since as we increase λ\lambda, the agents are more severely penalized when they get close to each other; hence, they maintain a safe distance to ensure a lower collision likelihood. It is intuitively clear that this effect demonstrates a diminishing marginal gain effect, in that the decrease in the number of collisions beyond the value of λ=5\lambda=5 seems to slow down.

4.2 Area Coverage

For the second application, we consider a scenario where multiple agents patrol a designated area by following a fixed trajectory, illustrated in Fig. 5. The goal of each agent is to maintain maximum total area coverage by avoiding crossing into areas already covered by other agents, while generally maintaining its fixed trajectory. A motivating example might be one where the agents are UAVs carrying out a ground coverage task of their designated areas, where the areas overlap in certain regions. Ideally, to have the maximum amount of ground coverage at any given time, we would want to discourage a UAV from approaching an overlapping region of its designated coverage area if it is already being covered by another UAV, since employing multiple UAVs for covering the same area would reduce the total amount of area covered. We claim that this can be seen akin to the first experiment in the following manner: If we increase the rr parameter of the agents to a suitable value, the collision prevention mechanism works in the way that the agents try not to cross into territories that are already covered by other agents. Also, the trajectories of the agents in their patrolling area can be modelled as perpetually tracking a target that follows said trajectory. In the experiments, we investigate 3 scenarios: the central server assigns agents new locations under compressed gradients with error feedback and nonzero regularization term, the same scenario but with the regularization term set to 0 (which corresponds to a scenario without communication), and a scenario with no central aggregation, where agents run SGDm locally. We model the intended coverage areas of each agent as a disk of radius 55, with overlapping regions ranging between 10%10\% to 25%25\%. We report the number of “collisions”, which in this case represents the number of area violations between agents, and present them in Table 1.

N SGDm No-Comp QSGD3b TopK Dropout-B RandK
2 0.8 0.0 0.0 0.0 0.0 0.0
3 9.0 0.8 0.2 1.2 0.6 0.4
4 12.4 1.0 2.59 0.8 0.2 0.2
Table 1: Average number of collisions over 55 runs of the simulation for varying NN with methods SGDm without central server; FED-EF-ZO-SGD with QSGD3b, TopK, Dropout-B and RandK, and No-Comp.

In the experiments, in addition to the 33-agent scenario illustrated in Fig. 5, we also test 22 and 44 agent cases. We run each experiment for 70007000 iterations, with values λ=100\lambda=100, N=2,3,4N=2,3,4, and the rest of the parameters have the same values as in the first experiment of the target-tracking problem. Running the simulation for 70007000 iterations corresponds to about 44 full cycles of the agents around their circular trajectory. Again, we observe that the number of collisions reduces significantly by our FED-EF-ZO-SGD algorithm and the results obtained using a compressed gradient with error feedback are very close to the case where no compression is used. In some cases, compression with error feedback leads to even better results. This can be explained similarly to before in that owing to compressed gradients, we inject more noise to the gradients, introducing randomness to the trajectories of the agents, which helps avoid collisions.

5 Conclusion

In this study, we tackled a problem of distributed online optimization with communication limitations, where multiple agents collaborate to track targets in a federated learning setting, limited to only zeroth-order information. The communication from the agents to the server was assumed to be constrained, and we addressed this constraint by compressing the communicated information along with an error feedback term. Our analysis showed that in the single-agent scenario, after 𝒪(dσ2ξ2)\mathcal{O}(\frac{d\sigma^{2}}{\xi^{2}}) steps in the dominant term, the EF-ZO-SGD algorithm will reach a ξ\xi-accurate first-order solution. In the multi-agent scenario, the FED-EF-ZO-SGD algorithm will converge to a ξ\xi-accurate first-order solution after 𝒪(σ2dMQ(Δ2+ω¯2)+Mσ2+Z4)ξ2)\mathcal{O}(\frac{\sigma^{2}dMQ(\Delta^{2}+\bar{\omega}^{2})+M\sigma^{2}+Z^{4})}{\xi^{2}}) steps in the dominant term. The dominant term in these convergence results are independent of the compression ratio δ\delta. The convergence of the FED-EF-ZO-SGD algorithm was confirmed through simulations.

As future work, one can investigate the collision constraints of each agent from safe reinforcement learning where in addition to maximizing rewards, agents must satisfy some constraints. This framework can be incorporated into our setting and can be analyzed from an optimization perspective. Additionally, rather than doing simple averaging at the central server, our work can be extended to a personalized federated learning setting where the losses are minimized by considering one step further of each agent. Further avenues for research include the examination of how the adaptive tuning of step sizes and the regularization parameters might change the convergence analysis. We note that the tuning of the regularization parameter is very related to dual formulations and Lagrangian methods in the general functional constrained optimization context. Finally, following the intuition presented in the experimental section, the effect of the number of agents on the variance of the stochastic gradients of the local loss functions may be studied. In this manner, as in mini-batch SGD, one might discover that incorporating a factor of N\sqrt{N} in the selection of the step size might accelerate convergence in a multi-agent scenario with NN agents.

6 Appendix. Proofs

6.1 Lemmas

We state several lemmas from [67], mainly related to the zeroth-order method, which will be used in the main proofs. Suppose f(x)CL1,1(d)f(x)\in C_{L}^{1,1}(\mathbb{R}^{d}). Then, the following hold:

Lemma 1.

fμ(x)CLμ1,1(d)f_{\mu}(x)\in C_{L_{\mu}}^{1,1}(\mathbb{R}^{d}), where LμLL_{\mu}\leq L [67].

Lemma 2.

fμ(x)f_{\mu}(x) has the following gradient with respect to xx:

fμ(x)=1(2π)d/2f(x+μu)f(x)μue(12u2)du,\nabla f_{\mu}(x)=\dfrac{1}{(2\pi)^{d/2}}\int\frac{f(x+\mu u)-f(x)}{\mu}ue^{(-\frac{1}{2}\lVert u\rVert^{2})}\mathrm{d}u, (26)

where u𝒩(0,Id)u\sim\mathcal{N}(0,I_{d}) [67].

Lemma 3.

For any xdx\in\mathbb{R}^{d}, we have

|fμ(x)f(x)|μ2Ld2,\lvert f_{\mu}(x)-f(x)\rvert\leq\frac{\mu^{2}Ld}{2}, (27)

[67].

Lemma 4.

For any xdx\in\mathbb{R}^{d}, we have

fμ(x)f(x)μ2L(d+3)32\lVert\nabla f_{\mu}(x)-\nabla f(x)\rVert\leq\frac{\mu}{2}L(d+3)^{\frac{3}{2}} (28)

[67].

Lemma 5.

For any xdx\in\mathbb{R}^{d}, we have

𝔼u[gμ(x)2]μ22L2(d+6)3+2(d+4)f(x)2,\mathbb{E}_{u}\left[\left\lVert g_{\mu}\left(x\right)\right\rVert^{2}\right]\leq\frac{\mu^{2}}{2}L^{2}(d+6)^{3}+2(d+4)\lVert\nabla f(x)\rVert^{2}, (29)

where u𝒩(0,Id)u\sim\mathcal{N}(0,I_{d}) and gμ(x)=f(x+μu)f(x)μug_{\mu}(x)=\frac{f(x+\mu u)-f(x)}{\mu}u [67].

Lemma 6.

(Young’s inequality) For any x,ydx,y\in\mathbb{R}^{d} and λ>0\lambda>0, we have

x,yx22λ+y2λ2\langle x,y\rangle\leq\dfrac{\rVert x\lVert^{2}}{2\lambda}+\dfrac{\lVert y\rVert^{2}\lambda}{2} (30)

[67].

6.2 Proof of Theorem 1

Proof.

We assume that ztdz_{t}\in\mathbb{R}^{d} are i.i.d. random variables for all t+t\in\mathbb{Z}^{+}. Furthermore, we drop the superscript notation present in the assumptions, since ii is always 11 for the single-agent case. Let x~t\tilde{x}_{t} be defined as follows (following the analysis in [6]):

x~t:=xtηet.\tilde{x}_{t}:=x_{t}-\eta e_{t}. (31)

From EF-ZO-SGD, we know that et+1=pt𝒞(pt)e_{t+1}=p_{t}-\mathcal{C}(p_{t}) and pt=g~μ,t(xt)+etp_{t}=\tilde{g}_{\mu,t}(x_{t})+e_{t}, so we can rewrite x~t+1\tilde{x}_{t+1} as

x~t+1=xt+1ηpt+η𝒞(pt)=xtη𝒞(pt)ηg~μ,t(xt)ηet+η𝒞(pt)=xtηetηg~μ,t(xt)=x~tηg~μ,t(xt),\begin{split}\tilde{x}_{t+1}&=x_{t+1}-\eta p_{t}+\eta\mathcal{C}(p_{t})\\ &=x_{t}-\eta\mathcal{C}(p_{t})-\eta\tilde{g}_{\mu,t}(x_{t})-\eta e_{t}+\eta\mathcal{C}(p_{t})\\ &=x_{t}-\eta e_{t}-\eta\tilde{g}_{\mu,t}(x_{t})\\ &=\tilde{x}_{t}-\eta\tilde{g}_{\mu,t}(x_{t}),\end{split} (32)

where g~μ,t(xt)=~t(xt+μut)~t(xt)μut\tilde{g}_{\mu,t}(x_{t})=\frac{\tilde{\ell}_{t}(x_{t}+\mu u_{t})-\tilde{\ell}_{t}(x_{t})}{\mu}u_{t} and ut𝒩(0,Id)u_{t}\sim\mathcal{N}(0,I_{d}). By Assumption 3, we can write the following:

μ,t(x~t+1)μ,t(x~t)+μ,t(x~t),x~t+1x~t+L2x~t+1x~t2.\begin{split}\ell_{\mu,t}(\tilde{x}_{t+1})\leq\ell_{\mu,t}(\tilde{x}_{t})&+\langle\nabla\ell_{\mu,t}(\tilde{x}_{t}),\tilde{x}_{t+1}-\tilde{x}_{t}\rangle\\ &+\frac{L}{2}\lVert\tilde{x}_{t+1}-\tilde{x}_{t}\rVert^{2}.\end{split} (33)

Now by Assumption 4, we get:

μ,t+1(x~t+1)μ,t(x~t)ηg~μ,t(xt),μ,t(x~t)+Lη22g~μ,t(xt)2+ωt.\begin{split}\ell_{\mu,t+1}(\tilde{x}_{t+1})\leq\ell_{\mu,t}(\tilde{x}_{t})&-\eta\langle\tilde{g}_{\mu,t}(x_{t}),\nabla\ell_{\mu,t}(\tilde{x}_{t})\rangle\\ &+\frac{L\eta^{2}}{2}\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}+\omega_{t}.\end{split} (34)

Since μ,t(xt)=𝔼ut,zt[g~μ,t(xt)]\nabla\ell_{\mu,t}(x_{t})=\mathbb{E}_{u_{t},z_{t}}\left[\tilde{g}_{\mu,t}(x_{t})\right], taking the expectation of both sides with respect to utu_{t} and ztz_{t}, we have the following:

𝔼ut,zt[g~μ,t(xt),μ,t(x~t)]=μ,t(xt),μ,t(x~t),\begin{split}\mathbb{E}_{u_{t},z_{t}}\left[\langle\tilde{g}_{\mu,t}(x_{t}),\nabla\ell_{\mu,t}(\tilde{x}_{t})\rangle\right]=\langle\nabla\ell_{\mu,t}(x_{t}),\nabla\ell_{\mu,t}(\tilde{x}_{t})\rangle,\end{split} (35)

and

μ,t(xt),μ,t(x~t)=12μ,t(xt)2+12μ,t(x~t)212μ,t(xt)μ,t(x~t)2.\begin{split}\langle\nabla\ell_{\mu,t}(x_{t}),\nabla\ell_{\mu,t}(\tilde{x}_{t})\rangle&=\frac{1}{2}\lVert\nabla\ell_{\mu,t}(x_{t})\rVert^{2}+\frac{1}{2}\lVert\nabla\ell_{\mu,t}(\tilde{x}_{t})\rVert^{2}\\ &-\frac{1}{2}\lVert\nabla\ell_{\mu,t}(x_{t})-\nabla\ell_{\mu,t}(\tilde{x}_{t})\rVert^{2}.\end{split} (36)

In the last step, we use the fact that 2a,b=a2+b2ab22\langle a,b\rangle=\lVert a\rVert^{2}+\lVert b\rVert^{2}-\lVert a-b\rVert^{2}. Inserting this into (34), we get:

μ,t+1(x~t+1)μ,t(x~t)η2μ,t(xt)2η2μ,t(x~t)2+L2η2xtx~t2+Lη22𝔼ut,zt[g~μ,t(xt)2]+ωt.\begin{split}\ell_{\mu,t+1}(\tilde{x}_{t+1})&\leq\ell_{\mu,t}(\tilde{x}_{t})-\frac{\eta}{2}\lVert\nabla\ell_{\mu,t}(x_{t})\rVert^{2}\\ &-\frac{\eta}{2}\lVert\nabla\ell_{\mu,t}(\tilde{x}_{t})\rVert^{2}+\frac{L^{2}\eta}{2}\lVert x_{t}-\tilde{x}_{t}\rVert^{2}\\ &+\frac{L\eta^{2}}{2}\mathbb{E}_{u_{t},z_{t}}\left[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right]+\omega_{t}.\end{split} (37)

Note that μ,t(xt)μ,t(x~t)2L2xtx~t2\lVert\nabla\ell_{\mu,t}(x_{t})-\nabla\ell_{\mu,t}(\tilde{x}_{t})\rVert^{2}\leq L^{2}\lVert x_{t}-\tilde{x}_{t}\rVert^{2} by Assumption 3, with subsequent application of Lemma 1. Also, we can drop -η2μ,t(x~t)2\frac{\eta}{2}\lVert\nabla\ell_{\mu,t}(\tilde{x}_{t})\rVert^{2} because it is nonpositive. Using the fact that x~txt=ηet\tilde{x}_{t}-x_{t}=\eta e_{t}, we get the main inequality:

η2μ,t(xt)2Term I[μ,t(x~t)μ,t+1(x~t+1)]Term II+Lη22𝔼ut,zt[g~μ,t(xt)2]Term III+L2η32et2Term IV+ωt.\begin{split}\underbrace{\frac{\eta}{2}\lVert\nabla\ell_{\mu,t}(x_{t})\rVert^{2}}_{\text{Term I}}&\leq\underbrace{\left[\ell_{\mu,t}(\tilde{x}_{t})-\ell_{\mu,t+1}(\tilde{x}_{t+1})\right]}_{\text{Term II}}\\ &+\underbrace{\frac{L\eta^{2}}{2}\mathbb{E}_{u_{t},z_{t}}\left[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right]}_{\text{Term III}}\\ &+\underbrace{\frac{L^{2}\eta^{3}}{2}\lVert e_{t}\rVert^{2}}_{\text{Term IV}}+\omega_{t}.\end{split} (38)

We will put an upper bound to the Terms II, III and IV and a lower bound to Term I. Starting with Term III, by Lemma 5, we know that

𝔼ut,z1:T[g~μ,t(xt)2]2(d+4)𝔼z1:T[~t(xt)2]+μ2L22(d+6)3,\begin{split}\mathbb{E}_{u_{t},z_{1:T}}\left[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right]&\leq 2(d+4)\mathbb{E}_{z_{1:T}}\left[\lVert\tilde{\nabla}\ell_{t}(x_{t})\rVert^{2}\right]\\ &+\frac{\mu^{2}L^{2}}{2}(d+6)^{3},\end{split} (39)

where 𝔼z1:T[~t(xt)2]M𝔼z1:T[t(xt)2]+σ2\mathbb{E}_{z_{1:T}}[\lVert\tilde{\nabla}\ell_{t}(x_{t})\rVert^{2}]\leq M\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+\sigma^{2} by Assumption 2. Note that, in this step, we use the the principle of causality and the fact that ztz_{t} are i.i.d. random variables. We can put the following upper bound to Term II by means of a telescoping sum and subsequent application of Lemma 3:

t=1T[μ,t(x~t)μ,t+1(x~t+1)]=μ,1(x~1)μ,T+1(x~T+1),\begin{split}\sum_{t=1}^{T}\left[\ell_{\mu,t}(\tilde{x}_{t})-\ell_{\mu,t+1}(\tilde{x}_{t+1})\right]&=\ell_{\mu,1}(\tilde{x}_{1})-\ell_{\mu,T+1}(\tilde{x}_{T+1}),\end{split} (40)

and

μ,1(x~1)μ,T+1(x~T+1)μ2Ld+1(x~1)T+1(x~T+1)=μ2Ld+1(x1)T+1(x~T+1),\begin{split}\ell_{\mu,1}(\tilde{x}_{1})-\ell_{\mu,T+1}(\tilde{x}_{T+1})&\leq\mu^{2}Ld+\ell_{1}(\tilde{x}_{1})-\ell_{T+1}(\tilde{x}_{T+1})\\ &=\mu^{2}Ld+\ell_{1}(x_{1})-\ell_{T+1}(\tilde{x}_{T+1}),\end{split} (41)

where we use the fact that (x1)=1(x~1)\ell(x_{1})=\ell_{1}(\tilde{x}_{1}), since x~1=x1\tilde{x}_{1}=x_{1} by definition. Then, we can do the following:

t=1T[μ,t(x~t)μ,t+1(x~t+1)]\displaystyle\sum_{t=1}^{T}\left[\ell_{\mu,t}(\tilde{x}_{t})-\ell_{\mu,t+1}(\tilde{x}_{t+1})\right] μ2Ld+1(x1)\displaystyle\leq\mu^{2}Ld+\ell_{1}(x_{1}) (42)
T+1(x~T+1)\displaystyle-\ell_{T+1}(\tilde{x}_{T+1})
μ2Ld+1(x1)\displaystyle\leq\mu^{2}Ld+\ell_{1}(x_{1})
T+1(xT+1),\displaystyle-\ell_{T+1}(x^{*}_{T+1}),

where xT+1argminxT+1(x)x_{T+1}^{*}\in\arg\!\min_{x}\ell_{T+1}(x). We can put the following lower bound to Term I by using Lemmas 4 and 6:

12t(xt)2μ2L24(d+3)3μ,t(xt)2.\frac{1}{2}\lVert\nabla\ell_{t}(x_{t})\rVert^{2}-\frac{\mu^{2}L^{2}}{4}(d+3)^{3}\leq\lVert\nabla\ell_{\mu,t}(x_{t})\rVert^{2}. (43)

Lastly, we can put the following upper bound to Term IV by Assumption 5 and Lemma 6. (Due to space considerations, in the remainder of the proof, we denote the total expectation 𝔼u1:T,z1:T,𝒞1:T[\mathbb{E}_{u_{1:T},z_{1:T},\mathcal{C}_{1:T}}[\cdot]] as 𝔼[\mathbb{E}[\cdot]].)

𝔼[et+12]=𝔼[pt𝒞t(pt)2](1δ)𝔼[pt2]=(1δ)𝔼[et+g~μ,t(xt)2](1δ)(1+φ)𝔼[et2]+(1δ)(1+1φ)𝔼u1:T,z1:T[g~μ,t(xt)2],\begin{split}\mathbb{E}\left[\lVert e_{t+1}\rVert^{2}\right]&=\mathbb{E}\left[\lVert p_{t}-\mathcal{C}_{t}(p_{t})\rVert^{2}\right]\\ &\leq(1-\delta)\mathbb{E}\left[\lVert p_{t}\rVert^{2}\right]\\ &=(1-\delta)\mathbb{E}\left[\lVert e_{t}+\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right]\\ &\leq(1-\delta)(1+\varphi)\mathbb{E}\left[\lVert e_{t}\rVert^{2}\right]+(1-\delta)\left(1+\frac{1}{\varphi}\right)\\ &\mathbb{E}_{u_{1:T},z_{1:T}}\left[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}\right],\end{split} (44)

which we can write as,

i=1t[(1δ)(1+φ)]ti(1δ)(1+1φ)𝔼ui,z1:T[g~μ,i(xi)2],\begin{split}\sum_{i=1}^{t}\left[(1-\delta)(1+\varphi)\right]^{t-i}(1-\delta)(1+\frac{1}{\varphi})\\ \mathbb{E}_{u_{i},z_{1:T}}\left[\lVert\tilde{g}_{\mu,i}(x_{i})\rVert^{2}\right],\end{split} (45)

for some φ>0\varphi>0, zt,ut,𝒞tz_{t},u_{t},\mathcal{C}_{t} are i.i.d., and 𝔼𝒞t[\mathbb{E}_{\mathcal{C}_{t}}[\cdot]] denotes the expectation over the randomness at time tt due to the compression used. Note that by using Lemma 5 and Assumption 2,

𝔼ut,z1:T[g~μ,t(xt)2]A𝔼z1:T[t(xt)2]+B,\mathbb{E}_{u_{t},z_{1:T}}[\lVert\tilde{g}_{\mu,t}(x_{t})\rVert^{2}]\leq A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+B, (46)

where

B=2σ2(d+4)+μ2L22(d+6)3andA=2M(d+4).\begin{split}&B=2\sigma^{2}(d+4)+\frac{\mu^{2}L^{2}}{2}(d+6)^{3}\;\text{and}\\ &A=2M(d+4).\end{split} (47)

So we can rewrite (44) as follows:

𝔼[et+12]i=1t[(1δ)(1+φ)]ti(1δ)(1+1φ)[A𝔼z1:T[i(xi)2]+B].\begin{split}\mathbb{E}\left[\lVert e_{t+1}\rVert^{2}\right]\leq\sum_{i=1}^{t}\left[(1-\delta)(1+\varphi)\right]^{t-i}(1-\delta)(1+\frac{1}{\varphi})\\ \left[A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{i}(x_{i})\rVert^{2}\right]+B\right].\end{split} (48)

If we set φ:=δ2(1δ)\varphi\vcentcolon=\frac{\delta}{2(1-\delta)}, then 1+1φ2δ1+\frac{1}{\varphi}\leq\frac{2}{\delta} and (1δ)(1+φ)=(1δ2)(1-\delta)(1+\varphi)=(1-\frac{\delta}{2}), so we get:

𝔼[et+12]i=1t(1δ2)ti[A𝔼z1:T[i(xi)2]+B]2(1δ)δ.\begin{split}\mathbb{E}\left[\lVert e_{t+1}\rVert^{2}\right]\leq\sum_{i=1}^{t}\left(1-\frac{\delta}{2}\right)^{t-i}\left[A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{i}(x_{i})\rVert^{2}\right]+B\right]\\ \frac{2(1-\delta)}{\delta}.\end{split} (49)

If we sum through all 𝔼[et2]\mathbb{E}[\lVert e_{t}\rVert^{2}], we get:

t=1T𝔼[et2]t=1Ti=1t1(1δ2)ti[A𝔼z1:T[i(xi)2]+B]2(1δ)δt=1T[A𝔼z1:T[t(xt)2]+B]i=0(1δ2)i2(1δ)δt=1T[A𝔼z1:T[t(xt)2]+B]K,\begin{split}\sum_{t=1}^{T}\mathbb{E}\left[\lVert e_{t}\rVert^{2}\right]&\leq\sum_{t=1}^{T}\sum_{i=1}^{t-1}\left(1-\frac{\delta}{2}\right)^{t-i}\\ &\left[A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{i}(x_{i})\rVert^{2}\right]+B\right]\frac{2(1-\delta)}{\delta}\\ &\leq\sum_{t=1}^{T}\left[A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+B\right]\\ &\sum_{i=0}^{\infty}\left(1-\frac{\delta}{2}\right)^{i}\frac{2(1-\delta)}{\delta}\\ &\leq\sum_{t=1}^{T}\left[A\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+B\right]K,\end{split} (50)

where K=2(1δ)δ2δ4δ2K=\frac{2(1-\delta)}{\delta}\frac{2}{\delta}\leq\frac{4}{\delta^{2}}. If we define Δ:=1(x1)T+1(xT+1),\Delta\vcentcolon=\ell_{1}(x_{1})-\ell_{T+1}(x_{T+1}^{*}), where xT+1argminxT+1(x),x^{*}_{T+1}\in\arg\!\min_{x}\ell_{T+1}(x), and combine the upper bounds derived in (39), (40), (44), and the lower bound derived in (43) and insert them into (38), we get the following:

t=1Tη4𝔼z1:T[t(xt)2]ημ2L28(d+3)3Tμ2Ld+Δ+Tμ2L3η24(d+6)3+Lη22σ2T2(d+4)+Lη22×2M(d+4)t=1T𝔼z1:T[t(xt)2]+η3L22×4δ2T[2σ2(d+4)+μ2L22(d+6)3]+η3L22×4δ2t=1T2M(d+4)𝔼z1:T[t(xt)2]+t=1Tωt.\begin{split}&\sum_{t=1}^{T}\frac{\eta}{4}\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]-\frac{\eta\mu^{2}L^{2}}{8}(d+3)^{3}T\\ &\leq\mu^{2}Ld+\Delta+\frac{T\mu^{2}L^{3}\eta^{2}}{4}(d+6)^{3}+\frac{L\eta^{2}}{2}\sigma^{2}T2(d+4)\\ &+\frac{L\eta^{2}}{2}\times 2M(d+4)\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+\frac{\eta^{3}L^{2}}{2}\\ &\times\frac{4}{\delta^{2}}T\left[2\sigma^{2}(d+4)+\frac{\mu^{2}L^{2}}{2}(d+6)^{3}\right]+\frac{\eta^{3}L^{2}}{2}\\ &\times\frac{4}{\delta^{2}}\sum_{t=1}^{T}2M(d+4)\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]+\sum_{t=1}^{T}\omega_{t}.\end{split} (51)

Now, since ztz_{t}’s are i.i.d. for all t+t\in\mathbb{Z}^{+}, we have:

ETt=1T𝔼z1:T[t(xt)2]μ2Ld+ΔT+η2L3μ2(d+6)34+Lη2σ2(d+4)+ημ2L2(d+3)38+η3L2δ24σ2(d+4)+η3L2δ2μ2L2(d+6)3+1Tt=1Tωt,\begin{split}&\frac{E}{T}\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]\\ &\leq\frac{\mu^{2}Ld+\Delta}{T}+\frac{\eta^{2}L^{3}\mu^{2}(d+6)^{3}}{4}+L\eta^{2}\sigma^{2}(d+4)\\ &+\frac{\eta\mu^{2}L^{2}(d+3)^{3}}{8}+\frac{\eta^{3}L^{2}}{\delta^{2}}4\sigma^{2}(d+4)\\ &+\frac{\eta^{3}L^{2}}{\delta^{2}}\mu^{2}L^{2}(d+6)^{3}+\frac{1}{T}\sum_{t=1}^{T}\omega_{t},\end{split} (52)

where

E=η4LMη2(d+4)L2η3δ24M(d+4)=η[14LMη(d+4)(1+4Lηδ2)].\begin{split}E&=\frac{\eta}{4}-LM\eta^{2}(d+4)-\frac{L^{2}\eta^{3}}{\delta^{2}}4M(d+4)\\ &=\eta\left[\frac{1}{4}-LM\eta(d+4)\left(1+\frac{4L\eta}{\delta^{2}}\right)\right].\end{split} (53)

If η14L\eta\leq\frac{1}{4L}, the first upper bound will instead be:

1+4Lηδ21+1δ2=δ2+1δ22δ2.1+\dfrac{4L\eta}{\delta^{2}}\leq 1+\dfrac{1}{\delta^{2}}=\dfrac{\delta^{2}+1}{\delta^{2}}\leq\dfrac{2}{\delta^{2}}. (54)

We proceed to find an η\eta such that

2δ2LMη(d+4)18.\dfrac{2}{\delta^{2}}LM\eta(d+4)\leq\frac{1}{8}. (55)

Then, we get

ηδ216LM(d+4),\eta\leq\frac{\delta^{2}}{16LM(d+4)}, (56)

which implies Eη8E\geq\frac{\eta}{8}. Multiplying all terms in the bound by 8η\frac{8}{\eta},

1Tt=1T𝔼z1:T[t(xt)2]8Δ(ηT)+8μ2LdηT+2ηL3μ2(d+6)3+8Lησ2(d+4)+μ2L2(d+3)3+32η2L2δ2σ2(d+4)+8η2L4μ2(d+6)3δ2+8ηTt=1Tωt.\begin{split}&\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}}\left[\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\right]\leq\frac{8\Delta}{(\eta T)}+\frac{8\mu^{2}Ld}{\eta T}\\ &+2\eta L^{3}\mu^{2}(d+6)^{3}+8L\eta\sigma^{2}(d+4)+\mu^{2}L^{2}(d+3)^{3}\\ &+\frac{32\eta^{2}L^{2}}{\delta^{2}}\sigma^{2}(d+4)+\frac{8\eta^{2}L^{4}\mu^{2}(d+6)^{3}}{\delta^{2}}+\frac{8}{\eta T}\sum_{t=1}^{T}\omega_{t}.\end{split} (57)

Let

η=1σ(d+4)MTLandμ=1(d+4)T.\eta=\frac{1}{\sigma\sqrt{(d+4)MTL}}\quad\text{and}\quad\mu=\dfrac{1}{(d+4)\sqrt{T}}. (58)

Putting these values into (57), we get (13) as follows:

1Tt=1T𝔼t(xt)28Δσ(d+4)12M12L12T12+8σdL32M12T32(d+3)32+2(d+6)32L52σ(d+4)52T32M12+8σ(d+4)12L12M12T12+(d+3)3L2(d+2)2T+32Lδ2σ2MT+8(d+6)3L3δ2σ2(d+4)3MT2+8ω¯σ(d+4)12M12L12T12.\begin{split}&\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\lVert\nabla\ell_{t}(x_{t})\rVert^{2}\leq\frac{8\Delta\sigma(d+4)^{\frac{1}{2}}M^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}}\\ &+\frac{8\sigma dL^{\frac{3}{2}}M^{\frac{1}{2}}}{T^{\frac{3}{2}}(d+3)^{\frac{3}{2}}}+\frac{2(d+6)^{\frac{3}{2}}L^{\frac{5}{2}}}{\sigma(d+4)^{\frac{5}{2}}T^{\frac{3}{2}}M^{\frac{1}{2}}}+\frac{8\sigma(d+4)^{\frac{1}{2}}L^{\frac{1}{2}}}{M^{\frac{1}{2}}T^{\frac{1}{2}}}\\ &+\frac{(d+3)^{3}L^{2}}{(d+2)^{2}T}+\frac{32L}{\delta^{2}\sigma^{2}MT}+\frac{8(d+6)^{3}L^{3}}{\delta^{2}\sigma^{2}(d+4)^{3}MT^{2}}\\ &+\frac{8\bar{\omega}\sigma(d+4)^{\frac{1}{2}}M^{\frac{1}{2}}L^{\frac{1}{2}}}{T^{\frac{1}{2}}}.\end{split} (59)

Defining ω¯:=t=1Tωt\bar{\omega}\vcentcolon=\sum_{t=1}^{T}\omega_{t}, the number of times steps TT to obtain a ξ\xi-accurate first order solution is

T=𝒪(dσ2LΔMξ2+dLΔδ2ξ+ω¯σ2dMLξ2).T=\mathcal{O}\left(\frac{d\sigma^{2}L\Delta M}{\xi^{2}}+\frac{dL\Delta}{\delta^{2}\xi}+\frac{\bar{\omega}\sigma^{2}dML}{\xi^{2}}\right). (60)

6.3 Proof of Theorem 2

Proof.

We assume in the following that zt1:NNdz^{1:N}_{t}\in\mathbb{R}^{Nd} are i.i.d. random variables for all t+t\in\mathbb{Z}^{+}. Similar to the analysis in the single-agent case, we begin by defining:

e¯t:=1Ni=1Neti,\bar{e}_{t}\vcentcolon=\dfrac{1}{N}\sum_{i=1}^{N}e^{i}_{t}, (61)

and

x~t1:N:=xt1:Nηe¯t.\tilde{x}^{1:N}_{t}\vcentcolon=x^{1:N}_{t}-\eta\bar{e}_{t}. (62)

Additionally, our global loss function in this scenario is:

~¯t(xt1:N)=1Ni=1N~ti(xt1:N).\bar{\tilde{\ell}}_{t}\left(x^{1:N}_{t}\right)=\dfrac{1}{N}\sum_{i=1}^{N}\tilde{\ell}^{i}_{t}\left(x^{1:N}_{t}\right). (63)

Now, we have:

x~t+11:N=xt+11:Nηe¯t+1=xt+11:Nη1Ni=1N[pti𝒞(pti)]=xt1:Nη𝒢tη1Ni=1N[pti𝒞(pti)]=xt1:Nη1Ni=1Npti=xt1:Nη1Ni=1N[g~μ,ti(xt1:N)+eti]=x~t1:Nηg~¯μ,t(xt1:N),\begin{split}\tilde{x}^{1:N}_{t+1}&=x^{1:N}_{t+1}-\eta\bar{e}_{t+1}\\ &=x^{1:N}_{t+1}-\eta\dfrac{1}{N}\sum_{i=1}^{N}\left[p^{i}_{t}-\mathcal{C}\left(p^{i}_{t}\right)\right]\\ &=x^{1:N}_{t}-\eta\mathcal{G}_{t}-\eta\dfrac{1}{N}\sum_{i=1}^{N}\left[p^{i}_{t}-\mathcal{C}\left(p^{i}_{t}\right)\right]\\ &=x^{1:N}_{t}-\eta\dfrac{1}{N}\sum_{i=1}^{N}p^{i}_{t}\\ &=x^{1:N}_{t}-\eta\dfrac{1}{N}\sum_{i=1}^{N}\left[\tilde{g}^{i}_{\mu,t}\left(x^{1:N}_{t}\right)+e^{i}_{t}\right]\\ &=\tilde{x}^{1:N}_{t}-\eta\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right),\end{split} (64)

where we define g~¯μ,t(xt1:N):=1Ni=1Ng~μ,ti(xt1:N).\bar{\tilde{g}}_{\mu,t}(x^{1:N}_{t})\vcentcolon=\frac{1}{N}\sum_{i=1}^{N}\tilde{g}^{i}_{\mu,t}\left(x_{t}^{1:N}\right). Now, we have by Assumption 3 that each ti\ell^{i}_{t} is LL-smooth, therefore, our global loss function ¯t\bar{\ell}_{t} is also LL-smooth. Using Lemma 1, we write

¯μ,t(x~t+11:N)¯μ,t(x~t1:N)+¯μ,t(x~t1:N),x~t+11:Nx~t1:N+L2x~t+11:Nx~t1:N2.\begin{split}\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t+1}\right)\leq\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)+\left\langle\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right),\tilde{x}^{1:N}_{t+1}-\tilde{x}^{1:N}_{t}\right\rangle\\ +\dfrac{L}{2}\left\lVert\tilde{x}^{1:N}_{t+1}-\tilde{x}^{1:N}_{t}\right\rVert^{2}.\end{split} (65)

By Assumption 4, this implies

¯μ,t+1(x~t+11:N)¯μ,t(x~t1:N)ηg~¯μ,t(xt1:N),¯μ,t(x~t1:N)+Lη22g~¯μ,t(xt1:N)2+ωt,\begin{split}\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)&\leq\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\\ &-\eta\left\langle\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right),\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rangle\\ &+\dfrac{L\eta^{2}}{2}\left\lVert\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}+\omega_{t},\end{split} (66)

where ωt=max{wt1,,wtN}.\omega_{t}=\max\{w_{t}^{1},...,w_{t}^{N}\}. Now, since we have

𝔼ut1:N[g~¯μ,t(xt1:N)]=𝔼ut1:N[1Ni=1Ng~μ,ti(xt1:N)]=1Ni=1N~μ,ti(xt1:N)=~¯μ,t(xt1:N),\begin{split}\mathbb{E}_{u^{1:N}_{t}}\left[\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right]&=\mathbb{E}_{u^{1:N}_{t}}\left[\dfrac{1}{N}\sum_{i=1}^{N}\tilde{g}^{i}_{\mu,t}\left(x^{1:N}_{t}\right)\right]\\ &=\dfrac{1}{N}\sum_{i=1}^{N}\nabla\tilde{\ell}^{i}_{\mu,t}\left(x^{1:N}_{t}\right)\\ &=\nabla\bar{\tilde{\ell}}_{\mu,t}\left(x^{1:N}_{t}\right),\end{split} (67)

the following holds:

𝔼ut1:N,zt1:N[g~¯μ,t(xt1:N),¯μ,t(x~t1:N)]=¯μ,t(xt1:N),¯μ,t(x~t1:N)=12¯μ,t(xt1:N)2+12¯μ,t(x~t1:N)212¯μ,t(xt1:N)¯μ,t(x~t1:N)2,\begin{split}&\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\langle\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right),\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rangle\right]\\ &=\left\langle\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right),\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rangle=\dfrac{1}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}\\ &+\dfrac{1}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rVert^{2}\\ &-\dfrac{1}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right)-\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rVert^{2},\end{split} (68)

since 𝔼zt1:N[~¯(xt1:N)]=¯(xt1:N).\mathbb{E}_{z^{1:N}_{t}}[\nabla\bar{\tilde{\ell}}(x^{1:N}_{t})]=\nabla\bar{\ell}(x^{1:N}_{t}). Now, combining this with (66) and using LL-smoothness, we obtain:

¯μ,t+1(x~t+11:N)¯μ,t(x~t1:N)η2¯μ,t(xt1:N)2η2¯μ,t(x~t1:N)2+L2η2xt1:Nx~t1:N2+Lη22𝔼ut1:N,zt1:N[g~¯μ,t(xt1:N)2]+ωt\begin{split}\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)&\leq\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)-\dfrac{\eta}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}\\ &-\dfrac{\eta}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)\right\rVert^{2}\\ &+\dfrac{L^{2}\eta}{2}\left\lVert x^{1:N}_{t}-\tilde{x}^{1:N}_{t}\right\rVert^{2}\\ &+\dfrac{L\eta^{2}}{2}\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}\right]+\omega_{t}\end{split} (69)

Note that the third term at the right-hand side of the inequality can be dropped because it is nonpositive. Using the definition of x~t1:N\tilde{x}^{1:N}_{t}, and taking the expectation of both sides with respect to ut1:Nu^{1:N}_{t} and zt1:Nz^{1:N}_{t}, we have the following main inequality:

η2¯μ,t(xt1:N)2Term I[¯μ,t(x~t1:N)¯μ,t+1(x~t+11:N)]Term II+Lη22𝔼ut1:N,zt1:N[g~¯μ,t(xt1:N)2]Term III+L2η32e¯t2Term IV+ωt.\begin{split}\underbrace{\dfrac{\eta}{2}\left\lVert\nabla\bar{\ell}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}}_{\text{Term I}}&\leq\underbrace{\left[\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)-\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)\right]}_{\text{Term II}}\\ &+\underbrace{\dfrac{L\eta^{2}}{2}\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\bar{\tilde{g}}_{\mu,t}\left(x^{1:N}_{t}\right)\right\rVert^{2}\right]}_{\text{Term III}}\\ &+\underbrace{\dfrac{L^{2}\eta^{3}}{2}\lVert\bar{e}_{t}\rVert^{2}}_{\text{Term IV}}+\omega_{t}.\end{split} (70)

We will continue the proof by putting an upper bound to Terms II, III, and IV and a lower bound to Term I. Starting with Term III, using Jensen’s inequality, we get

𝔼ut1:N,zt1:N[g~¯μ,t(xt1:N)2]=𝔼ut1:N,zt1:N[1Ni=1Ng~μ,ti(xt1:N)2]1Ni=1N𝔼ut1:N,zt1:N[g~μ,ti(xt1:N)2].\begin{split}&\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\bar{\tilde{g}}_{\mu,t}(x^{1:N}_{t})\right\rVert^{2}\right]\\ &=\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\frac{1}{N}\sum_{i=1}^{N}\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})\right\rVert^{2}\right]\\ &\leq\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{u^{1:N}_{t},z^{1:N}_{t}}\left[\left\lVert\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})\right\rVert^{2}\right].\end{split} (71)

Then, by Lemma 5 we know

𝔼u1:T1:N,z1:T1:N[g~μ,ti(xt1:N)2]2(d+4)𝔼z1:T1:N[~ti(xt1:N)2]+μ2L22(d+6)3.\begin{split}\mathbb{E}_{u^{1:N}_{1:T},z^{1:N}_{1:T}}\left[\lVert\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})\rVert^{2}\right]&\leq 2(d+4)\\ &\mathbb{E}_{z^{1:N}_{1:T}}\left[\lVert\nabla\tilde{\ell}^{i}_{t}(x^{1:N}_{t})\rVert^{2}\right]\\ &+\frac{\mu^{2}L^{2}}{2}(d+6)^{3}.\end{split} (72)

Using Assumption 2, we have 𝔼z1:T1:N[~ti(xt1:N)2]M𝔼z1:T1:N[ti(xt1:N)2]+σ2\mathbb{E}_{z^{1:N}_{1:T}}[\lVert\nabla\tilde{\ell}^{i}_{t}(x^{1:N}_{t})\rVert^{2}]\leq M\mathbb{E}_{z^{1:N}_{1:T}}\left[\lVert\nabla\ell^{i}_{t}(x^{1:N}_{t})\rVert^{2}\right]+\sigma^{2}. Then, through application of Assumption 6 and Lemma 6, we have:

𝔼u1:T1:N,z1:T1:N[g~μ,ti(xt1:N)2]2(d+4)(MZ2+σ2)+2(d+4)MQ𝔼z1:T1:N[¯t(xt1:N)2]+μ2L22(d+6)3.\begin{split}\mathbb{E}_{u^{1:N}_{1:T},z^{1:N}_{1:T}}\left[\lVert\tilde{g}^{i}_{\mu,t}(x^{1:N}_{t})\rVert^{2}\right]\leq 2(d+4)(MZ^{2}+\sigma^{2})\\ +2(d+4)MQ\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x_{t}^{1:N})\rVert^{2}\right]\\ +\frac{\mu^{2}L^{2}}{2}(d+6)^{3}.\end{split} (73)

For Term II, if we do a summation on both sides of (70) from t=1t=1 to TT, we get a telescoping sum:

t=1T[¯μ,t(x~t1:N)¯μ,t+1(x~t+11:N)]=¯μ,1(x~11:N)¯μ,T+1(x~T+11:N).\begin{split}&\sum_{t=1}^{T}\left[\bar{\ell}_{\mu,t}\left(\tilde{x}^{1:N}_{t}\right)-\bar{\ell}_{\mu,t+1}\left(\tilde{x}^{1:N}_{t+1}\right)\right]\\ &=\bar{\ell}_{\mu,1}\left(\tilde{x}^{1:N}_{1}\right)-\bar{\ell}_{\mu,T+1}\left(\tilde{x}^{1:N}_{T+1}\right).\end{split} (74)

By adding and subtracting ¯1(x~11:N)\bar{\ell}_{1}(\tilde{x}_{1}^{1:N}) and ¯T+1(x~T+11:N)\bar{\ell}_{T+1}(\tilde{x}_{T+1}^{1:N}) on both sides and using Lemma 3, we have:

¯μ,1(x~11:N)¯μ,T+1(x~T+11:N)μ2Ld+¯1(x11:N)¯T+1(x~T+11:N).μ2Ld+¯1(x11:N)¯T+1(xT+1)=μ2Ld+Δ,\begin{split}&\bar{\ell}_{\mu,1}\left(\tilde{x}^{1:N}_{1}\right)-\bar{\ell}_{\mu,T+1}\left(\tilde{x}^{1:N}_{T+1}\right)\\ &\leq\mu^{2}Ld+\bar{\ell}_{1}(x^{1:N}_{1})-\bar{\ell}_{T+1}(\tilde{x}^{1:N}_{T+1}).\\ &\leq\mu^{2}Ld+\bar{\ell}_{1}(x^{1:N}_{1})-\bar{\ell}_{T+1}(x_{T+1}^{*})\\ &=\mu^{2}Ld+\Delta,\end{split} (75)

where xT+1=mini{1,,N}argminxT+1i(x)x^{*}_{T+1}=\min_{i\in\{1,...,N\}}\arg\!\min_{x}\ell^{i}_{T+1}(x) and Δ=¯1(x11:N)¯T+1(xT+1)\Delta=\bar{\ell}_{1}(x^{1:N}_{1})-\bar{\ell}_{T+1}(x_{T+1}^{*}). Note that we use x~11:N=x11:N.\tilde{x}^{1:N}_{1}=x^{1:N}_{1}. For Term I, one should note that if ti(x)CL1,1\ell^{i}_{t}(x)\in C^{1,1}_{L}, then μ,ti(x)CL1,1\ell^{i}_{\mu,t}(x)\in C^{1,1}_{L} by Lemma 1. This implies that ¯μ,t(x)CL1,1\bar{\ell}_{\mu,t}(x)\in C^{1,1}_{L} because ¯μ,t(x)=1Ni=1Nμ,ti(x)\bar{\ell}_{\mu,t}(x)=\frac{1}{N}\sum_{i=1}^{N}\ell^{i}_{\mu,t}(x). Thus, using Lemmas 4 and 6, we get

12¯t(xt1:N)2μ2L2(d+3)24¯μ,t(xt1:N)2.\frac{1}{2}\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}-\frac{\mu^{2}L^{2}(d+3)^{2}}{4}\leq\lVert\nabla\bar{\ell}_{\mu,t}(x^{1:N}_{t})\rVert^{2}. (76)

Finally, for Term IV, we use the recursive summation similar to the one in the single-agent proof. We want to put an upper bound to e¯t2\lVert\bar{e}_{t}\rVert^{2}. We can do so by taking the expectation of both sides in (70) with respect to u1:T1:N,z1:T1:N,C1:Tu^{1:N}_{1:T},z^{1:N}_{1:T},C_{1:T} and put an upper bound to 𝔼u1:T1:N,z1:T1:N,C1:T[e¯t2]\mathbb{E}_{u^{1:N}_{1:T},z^{1:N}_{1:T},C_{1:T}}\left[\lVert\bar{e}_{t}\rVert^{2}\right] instead. (Due to space considerations, in the remainder of the proof, we denote the total expectation 𝔼u1:T1:N,z1:T1:N,𝒞1:T[\mathbb{E}_{u^{1:N}_{1:T},z^{1:N}_{1:T},\mathcal{C}_{1:T}}[\cdot]] as 𝔼[\mathbb{E}[\cdot]].) By Jensen’s inequality, we can do the following:

𝔼[e¯t2]=𝔼[1Ni=1Neti2]𝔼[1Ni=1Neti2]=1Ni=1N𝔼[eti2]\begin{split}\mathbb{E}\left[\lVert\bar{e}_{t}\rVert^{2}\right]=\mathbb{E}\left[\left\lVert\frac{1}{N}\sum_{i=1}^{N}e^{i}_{t}\right\rVert^{2}\right]&\leq\mathbb{E}\left[\frac{1}{N}\sum_{i=1}^{N}\left\lVert e^{i}_{t}\right\rVert^{2}\right]\\ &=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\lVert e^{i}_{t}\rVert^{2}\right]\end{split} (77)

Note that putting an upper bound to the terms inside summation is nothing but putting an upper bound to the single-agent case, which we have done in Proof 6.2 of the single-agent setting. Hence, we know

𝔼[et1i2]j=1t1[(1δ)(1+φ)]t1j(1δ)(1+1φ)[A𝔼z1:T1:N[ji(xj1:N)2]+B].\begin{split}\mathbb{E}\left[\lVert e^{i}_{t-1}\rVert^{2}\right]\leq\sum_{j=1}^{t-1}[(1-\delta)(1+\varphi)]^{t-1-j}(1-\delta)\left(1+\frac{1}{\varphi}\right)\\ \left[A\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\ell^{i}_{j}(x^{1:N}_{j})\rVert^{2}\right]+B\right].\end{split} (78)

Using this fact in (77), we obtain

𝔼[et1:N2]1Ni=1Nj=1t1[(1δ)(1+φ)]t1j(1δ)(1+1φ)[A𝔼z1:T1:N[ji(xj1:N)2]+B].\begin{split}\mathbb{E}\left[\lVert e^{1:N}_{t}\rVert^{2}\right]\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{t-1}[(1-\delta)(1+\varphi)]^{t-1-j}\\ (1-\delta)\left(1+\frac{1}{\varphi}\right)\left[A\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\ell^{i}_{j}(x^{1:N}_{j})\rVert^{2}\right]+B\right].\end{split} (79)

Using the same procedure in (50), if we sum both sides through t=1t=1 to t=Tt=T, we get the following inequality:

t=1T𝔼[et1:N2]1Ni=1Nt=1T[A𝔼z1:T1:Nti(xt1:N)2+B]K,\begin{split}&\sum_{t=1}^{T}\mathbb{E}\left[\lVert e^{1:N}_{t}\rVert^{2}\right]\\ &\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T}\left[A\mathbb{E}_{z_{1:T}^{1:N}}\lVert\nabla\ell^{i}_{t}(x_{t}^{1:N})\rVert^{2}+B\right]K,\end{split} (80)

where A=2M(d+4),B=2σ2(d+4)+μ2L2(d+6)32A=2M(d+4),B=2\sigma^{2}(d+4)+\frac{\mu^{2}L^{2}(d+6)^{3}}{2} and K=4(1δ)δ24δ2K=\frac{4(1-\delta)}{\delta^{2}}\leq\frac{4}{\delta^{2}}. Another way of expressing (80) is:

t=1T𝔼[et1:N2]t=1T[A(1Ni=1N𝔼z1:T1:N[ti(xt1:N)2])+B]K.\begin{split}&\sum_{t=1}^{T}\mathbb{E}\left[\lVert e^{1:N}_{t}\rVert^{2}\right]\\ &\leq\sum_{t=1}^{T}\left[A\left(\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\ell^{i}_{t}(x_{t}^{1:N})\rVert^{2}\right]\right)+B\right]K.\end{split} (81)

Using Assumption 6, we can write this as:

t=1T𝔼[et1:N2]t=1T[A(Z2+Q𝔼z1:T1:N[¯t(xt1:N)2])+B]K.\begin{split}&\sum_{t=1}^{T}\mathbb{E}\left[\lVert e^{1:N}_{t}\rVert^{2}\right]\\ &\leq\sum_{t=1}^{T}\left[A\left(Z^{2}+Q\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\right)+B\right]K.\end{split} (82)

If we now combine the upper bounds derived for Terms I, II and IV, and the lower bound derived for Term III and insert them into (70), we get the following inequality:

η4t=1T𝔼z1:T1:N[¯t(xt1:N)2]Tημ2L2(d+3)28μ2Ld+Δ+TLη2(d+4)(MZ2+σ2)+Lη2(d+4)MQ(t=1T𝔼z1:T1:N[¯t(xt1:N)2])+TL2μ2η2(d+6)34+2TL2η3Kδ2+4L2η3(d+4)MQδ2(t=1T𝔼z1:T1:N[¯t(xt1:N)2])+t=1Tωt\begin{split}\frac{\eta}{4}\sum_{t=1}^{T}&\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]-\frac{T\eta\mu^{2}L^{2}(d+3)^{2}}{8}\\ &\leq\mu^{2}Ld+\Delta+TL\eta^{2}(d+4)(MZ^{2}+\sigma^{2})\\ &+L\eta^{2}(d+4)MQ\left(\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\right)\\ &+\frac{TL^{2}\mu^{2}\eta^{2}(d+6)^{3}}{4}+\frac{2TL^{2}\eta^{3}K}{\delta^{2}}\\ &+\frac{4L^{2}\eta^{3}(d+4)MQ}{\delta^{2}}\left(\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\right)\\ &+\sum_{t=1}^{T}\omega_{t}\end{split} (83)

where K=2M(d+4)Z2+2σ2(d+4)+μ2L2(d+6)32K=2M(d+4)Z^{2}+2\sigma^{2}(d+4)+\frac{\mu^{2}L^{2}(d+6)^{3}}{2}. After rearranging the terms and dividing both sides by TT, we have the following inequality:

ETt=1T𝔼z1:T1:N[¯t(xt1:N)2]μ2Ld+ΔT+Lη2(d+4)(MZ2+σ2)+L3μ2η2(d+6)34+2L2η3Kδ2+ω¯T,\begin{split}\frac{E}{T}\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]&\leq\frac{\mu^{2}Ld+\Delta}{T}\\ &+L\eta^{2}(d+4)(MZ^{2}+\sigma^{2})\\ &+\frac{L^{3}\mu^{2}\eta^{2}(d+6)^{3}}{4}\\ &+\frac{2L^{2}\eta^{3}K}{\delta^{2}}+\dfrac{\bar{\omega}}{T},\end{split} (84)

where ω¯:=t=1Tωt\bar{\omega}\vcentcolon=\sum_{t=1}^{T}\omega_{t}, and

E=η4LMQη2(d+4)4L2η3MQ(d+4)δ2=η[14LMη(d+4)(1+4Lηδ2)].\begin{split}E&=\frac{\eta}{4}-LMQ\eta^{2}(d+4)-\frac{4L^{2}\eta^{3}MQ(d+4)}{\delta^{2}}\\ &=\eta\left[\frac{1}{4}-LM\eta(d+4)\left(1+\frac{4L\eta}{\delta^{2}}\right)\right].\end{split} (85)

If η<14L\eta<\frac{1}{4L}, the first upper bound will instead be:

1+4Lηδ21+1δ22δ2.1+\frac{4L\eta}{\delta^{2}}\leq 1+\frac{1}{\delta^{2}}\leq\frac{2}{\delta^{2}}. (86)

We proceed to find an η\eta such that

2Lη(d+4)MQδ218.\frac{2L\eta(d+4)MQ}{\delta^{2}}\leq\frac{1}{8}. (87)

Then, we get

ηδ216LMQ(d+4),\eta\leq\frac{\delta^{2}}{16LMQ(d+4)}, (88)

which implies Eη8E\geq\frac{\eta}{8}. Multiplying all the terms in the bound by 8η\frac{8}{\eta},

1Tt=1T𝔼z1:T1:N[¯t(xt1:N)2]8ΔηT+8μ2LdηT+8Lη(d+4)(MZ2+σ2)+2L3μ2η(d+6)3+32L2η2M(d+4)Z2δ2+32L2η2σ2(d+4)δ2+8L4μ2η2(d+6)3δ2+8w¯ηT.\begin{split}\frac{1}{T}&\sum_{t=1}^{T}\mathbb{E}_{z_{1:T}^{1:N}}\left[\lVert\nabla\bar{\ell}_{t}(x^{1:N}_{t})\rVert^{2}\right]\leq\frac{8\Delta}{\eta T}+\frac{8\mu^{2}Ld}{\eta T}\\ &+8L\eta(d+4)(MZ^{2}+\sigma^{2})+2L^{3}\mu^{2}\eta(d+6)^{3}\\ &+\frac{32L^{2}\eta^{2}M(d+4)Z^{2}}{\delta^{2}}+\frac{32L^{2}\eta^{2}\sigma^{2}(d+4)}{\delta^{2}}\\ &+\frac{8L^{4}\mu^{2}\eta^{2}(d+6)^{3}}{\delta^{2}}+\dfrac{8\bar{w}}{\eta T}.\end{split} (89)

Let

η=1σ(d+4)MQTLandμ=1(d+4)T.\eta=\frac{1}{\sigma\sqrt{(d+4)MQTL}}\quad\mathrm{and}\quad\mu=\frac{1}{(d+4)\sqrt{T}}. (90)

Then, the number of times steps TT to obtain a ξ\xi-accurate first order solution is:

T=𝒪(σ2dMQ(Δ2+ω¯2)+M(σ2+Z4)ξ2+L53ξ23+1δ2ξ).\begin{split}&T=\\ &\mathcal{O}\left(\dfrac{\sigma^{2}dMQ\left(\Delta^{2}+\bar{\omega}^{2}\right)+M\left(\sigma^{2}+Z^{4}\right)}{\xi^{2}}+\dfrac{L^{\frac{5}{3}}}{\xi^{\frac{2}{3}}}+\dfrac{1}{\delta^{2}\xi}\right).\end{split} (91)

REFERENCES

  • [1] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. 2016.
  • [2] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, page 1310–1321. ACM, 2015.
  • [3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding, 2016.
  • [4] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication, 2019.
  • [5] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory, 2018.
  • [6] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes, 2019.
  • [7] Baris Fidan, Soura Dasgupta, and Brian D. O. Anderson. Guaranteeing practical convergence in algorithms for sensor and source localization. IEEE Transactions on Signal Processing, 56(9):4458–4469, 2008.
  • [8] Anthony Nguyen and Krishnakumar Balasubramanian. Stochastic zeroth-order functional constrained optimization: Oracle complexity and applications. INFORMS Journal on Optimization, 2022.
  • [9] Jean-Marc Valin, François Michaud, and Jean Rouat. Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems, 55(3):216–228, 2007.
  • [10] Dongming Luan, Yongjian Yang, En Wang, Qiyang Zeng, Zhaohui Li, and Li Zhou. An efficient target tracking approach through mobile crowdsensing. IEEE Access, 7:110749–110760, 2019.
  • [11] Iman Shames, Daniel Selvaratnam, and Jonathan H. Manton. Online optimization using zeroth order oracles. IEEE Control Systems Letters, 4(1):31–36, 2020.
  • [12] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. 2013.
  • [13] Ege C. Kaya, M. Berk Sahin, and Abolfazl Hashemi. Communication-constrained exchange of zeroth-order information with application to collaborative target tracking. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
  • [14] Truc Nguyen and My T. Thai. Preserving privacy and security in federated learning, 2022.
  • [15] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. 2018.
  • [16] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization, 2019.
  • [17] Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Communication-efficient variance-reduced decentralized stochastic optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 2021.
  • [18] Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Decentralized optimization on time-varying directed graphs under communication constraints. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3670–3674. IEEE, 2021.
  • [19] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization, 2020.
  • [20] Jun Sun, Tianyi Chen, Georgios B. Giannakis, and Zaiyue Yang. Communication-efficient distributed learning via lazily aggregated quantized gradients, 2019.
  • [21] Rudrajit Das, Anish Acharya, Abolfazl Hashemi, Sujay Sanghavi, Inderjit S Dhillon, and Ufuk Topcu. Faster non-convex federated learning via global and local momentum. In Uncertainty in Artificial Intelligence, pages 496–506. PMLR, 2022.
  • [22] Wenzhi Fang, Ziyi Yu, Yuning Jiang, Yuanming Shi, Colin N. Jones, and Yong Zhou. Communication-efficient stochastic zeroth-order optimization for federated learning. IEEE Transactions on Signal Processing, 70:5058–5073, 2022.
  • [23] Zan Li and Li Chen. Communication-efficient decentralized zeroth-order method on heterogeneous data. In 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), pages 1–6, 2021.
  • [24] Abolfazl Hashemi, Anish Acharya, Rudrajit Das, Haris Vikalo, Sujay Sanghavi, and Inderjit Dhillon. On the benefits of multiple gossip steps in communication-constrained decentralized federated learning. IEEE Transactions on Parallel and Distributed Systems, 33(11):2727–2739, 2021.
  • [25] Seung-Jun Kim and Geogios B. Giannakis. An online convex optimization approach to real-time energy pricing for demand response. IEEE Transactions on Smart Grid, 8(6):2784–2793, 2017.
  • [26] Tianyi Chen and Georgios B. Giannakis. Bandit convex optimization for scalable and dynamic IoT management. IEEE Internet of Things Journal, 6(1):1276–1286, feb 2019.
  • [27] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Found. Comput. Math., 17(2):527–566, apr 2017.
  • [28] Jorge Nocedal and Stephen J Wright. Penalty and Augmented Lagrangian Methods, page 497–524. Springer Series in Operations Research. Springer Science+Business Media, 2nd edition, 2006.
  • [29] Songtao Lu. A single-loop gradient descent and perturbed ascent algorithm for nonconvex functional constrained optimization, 2022.
  • [30] David Isele, Reza Rahimi, Akansel Cosgun, Kaushik Subramanian, and Kikuo Fujimura. Navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2034–2039, 2018.
  • [31] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.
  • [32] Meixin Zhu, Yinhai Wang, Ziyuan Pu, Jingyun Hu, Xuesong Wang, and Ruimin Ke. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transportation Research Part C: Emerging Technologies, 117:102662, 2020.
  • [33] Songtao Lu, Kaiqing Zhang, Tianyi Chen, Tamer Başar, and Lior Horesh. Decentralized policy gradient descent ascent for safe multi-agent reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8767–8775, May 2021.
  • [34] Mansoor Shaukat and Mandar Chitre. Adaptive behaviors in multi-agent source localization using passive sensing. Adaptive Behavior, 24(6):446–463, 2016. PMID: 28018121.
  • [35] Sandra H. Dandach, Baris Fidan, Soura Dasgupta, and Brian D. O. Anderson. Adaptive source localization by mobile agents. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 2045–2050, 2006.
  • [36] Alejandro I. Maass, Chris Manzie, Dragan Nešić, Jonathan H. Manton, and Iman Shames. Tracking and regret bounds for online zeroth-order euclidean and riemannian optimization. SIAM Journal on Optimization, 32(2):445–469, 2022.
  • [37] Barış Fidan, Soura Dasgupta, and Brian. D. O. Anderson. Guaranteeing practical convergence in algorithms for sensor and source localization. IEEE Transactions on Signal Processing, 56:4458–4469, 2008.
  • [38] Elad Michael, Daniel Zelazo, Tony A. Wood, Chris Manzie, and Iman Shames. Optimisation with zeroth-order oracles in formation. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 5354–5359, 2020.
  • [39] Yipeng Pang and Guoqiang Hu. Randomized gradient-free distributed optimization methods for a multiagent system with unknown cost function. IEEE Transactions on Automatic Control, 65(1):333–340, 2020.
  • [40] Elad Hazan. Introduction to online convex optimization. CoRR, abs/1909.05207, 2019.
  • [41] Yipeng Pang and Guoqiang Hu. Randomized gradient-free distributed online optimization with time-varying cost functions. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 4910–4915, 2019.
  • [42] Salar Rahili and Wei Ren. Distributed continuous-time convex optimization with time-varying cost functions. IEEE Transactions on Automatic Control, 62(4):1590–1605, 2017.
  • [43] Yu-Jia Chen, Deng-Kai Chang, and Cheng Zhang. Autonomous tracking using a swarm of uavs: A constrained multi-agent reinforcement learning approach. IEEE Transactions on Vehicular Technology, 69(11):13702–13717, 2020.
  • [44] Ruilong Zhang, Qun Zong, Xiuyun Zhang, Liqian Dou, and Bailing Tian. Game of drones: Multi-uav pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, pages 1–10, 2022.
  • [45] Luis Rodolfo Garcia Carrillo and Kyriakos G. Vamvoudakis. Deep-learning tracking for autonomous flying systems under adversarial inputs. IEEE Transactions on Aerospace and Electronic Systems, 56(2):1444–1459, 2020.
  • [46] Hao Jiang and Yueqian Liang. Online path planning of autonomous uavs for bearing-only standoff multi-target following in threat environment. IEEE Access, 6:22531–22544, 2018.
  • [47] Yuming Chen, Wei Li, and Yuqiao Wang. Online adaptive kalman filter for target tracking with unknown noise statistics. IEEE Sensors Letters, 5(3):1–4, 2021.
  • [48] Xuejing Lan, Lei Liu, and Yongji Wang. Adp-based intelligent decentralized control for multi-agent systems moving in obstacle environment. IEEE Access, 7:59624–59630, 2019.
  • [49] Shuai Zheng, Ziyue Huang, and James T. Kwok. Communication-efficient distributed blockwise momentum sgd with error-feedback, 2019.
  • [50] Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization, 2019.
  • [51] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning, 2019.
  • [52] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data, 2019.
  • [53] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U. Stich. A unified theory of decentralized sgd with changing topology and local updates, 2020.
  • [54] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization, 2020.
  • [55] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization, 2020.
  • [56] Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum, 2019.
  • [57] Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
  • [58] Andrea Simonetto, Emiliano Dall’Anese, Santiago Paternain, Geert Leus, and Georgios B Giannakis. Time-varying convex optimization: Time-structured algorithms and applications. Proceedings of the IEEE, 108(11):2032–2048, 2020.
  • [59] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent, 2017.
  • [60] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for distributed deep learning. 2018.
  • [61] Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, and Tong Zhang. On the unreasonable effectiveness of federated averaging with heterogeneous data, 2022.
  • [62] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
  • [63] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019.
  • [64] Anish Acharya, Abolfazl Hashemi, Prateek Jain, Sujay Sanghavi, Inderjit S Dhillon, and Ufuk Topcu. Robust training in high dimensions via block coordinate geometric median descent. In International Conference on Artificial Intelligence and Statistics, pages 11145–11168. PMLR, 2022.
  • [65] Supplementary material. https://github.com/Sunses-hub/FED-EF-ZO-SGD.git, 2022.
  • [66] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtarik. Sgd: General analysis and improved rates. 2019.
  • [67] Guanghui. Lan. First-order and stochastic optimization methods for machine learning. Springer Series in the Data Sciences. Springer, Cham, 2020.