This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Personalized Federated Learning: A Unified Framework and Universal Optimization Techniques

Filip Hanzely fhanzely@gmail.com
Toyota Technological Institute at Chicago
Chicago, IL 60637, USA
Boxin Zhao11footnotemark: 1 boxinz@uchicago.edu
Mladen Kolar mkolar@chicagobooth.edu
The University of Chicago Booth School of Business
Chicago, IL 60637, USA
equal contribution
Abstract

We investigate the optimization aspects of personalized Federated Learning (FL). We propose general optimizers that can be applied to numerous existing personalized FL objectives, specifically a tailored variant of Local SGD and variants of accelerated coordinate descent/accelerated SVRCD. By examining a general personalized objective capable of recovering many existing personalized FL objectives as special cases, we develop a comprehensive optimization theory applicable to a wide range of strongly convex personalized FL models in the literature. We showcase the practicality and/or optimality of our methods in terms of communication and local computation. Remarkably, our general optimization solvers and theory can recover the best-known communication and computation guarantees for addressing specific personalized FL objectives. Consequently, our proposed methods can serve as universal optimizers, rendering the design of task-specific optimizers unnecessary in many instances.

1 Introduction

Modern personal electronic devices, such as mobile phones, wearable devices, and home assistants, can collectively generate and store vast amounts of user data. This data is essential for training and improving state-of-the-art machine learning models for tasks ranging from natural language processing to computer vision. Traditionally, the training process was performed by first collecting all the data into a data center (Dean et al., 2012), raising serious concerns about user privacy and placing a considerable burden on the storage capabilities of server suppliers. To address these issues, a novel paradigm – Federated Learning (FL) (McMahan et al., 2017; Kairouz et al., 2021) – has been proposed. Informally, the main idea of FL is to train a model locally on an individual’s device, rather than revealing their data, while communicating the model updates using private and secure protocols.

While the original goal of FL was to search for a single model to be deployed on each device, this objective has been recently questioned. As the distribution of user data can vary greatly across devices, a single model might not serve all devices simultaneously (Hard et al., 2018). Consequently, data heterogeneity has become the main challenge in the search for efficient federated learning models. Recently, a range of personalized FL approaches has been proposed to address data heterogeneity (Kulkarni et al., 2020), wherein different local models are used to fit user-specific data while also capturing the common knowledge distilled from data of other devices.

Since the motivation and goals of each of these personalized approaches vary greatly, examining them separately can only provide us with an understanding of a specific model. Fortunately, many personalized FL models in the literature are trained by minimizing a specially structured optimization program. In this paper, we analyze the general properties of such an optimization program, which in turn provides us with high-level principles for training personalized FL models. We aim to solve the following optimization problem

minw,β{F(w,β)1Mm=1Mfm(w,βm)},\min_{w,\beta}\left\{F(w,\beta)\coloneqq\frac{1}{M}\sum^{M}_{m=1}f_{m}(w,\beta_{m})\right\}, (1)

where wd0w\in\mathbb{R}^{d_{0}} corresponds to the shared parameters, β=(β1,,βM)\beta=(\beta_{1},\dots,\beta_{M}) with βmdm\beta_{m}\in\mathbb{R}^{d_{m}}, m[M]\forall m\in[M] corresponds to the local parameters, MM is the number of devices, and fm:d0+dmf_{m}:\mathbb{R}^{d_{0}+d_{m}}\rightarrow\mathbb{R} is the objective that depends on the local data at the mm-th client.

By carefully designing the local loss fm(w,βm)f_{m}(w,\beta_{m}), the objective (1) can recover many existing personalized FL approaches as special cases. The local objective fmf_{m} does not need to correspond to the empirical loss of a given model on the mm-th device’s data. See Section 2 for details. Consequently, (1) serves as a unified objective that encompasses numerous existing personalized FL approaches as special cases. The primary goal of our work is to explore the problem (1) from an optimization perspective. By doing so, we develop a universal convex optimization theory that applies to many personalized FL approaches.

1.1 Contributions

We outline the main contributions of this work.

Single personalized FL objective. We propose a single objective (1) capable of recovering many existing convex personalized FL approaches by carefully constructing the local loss fm(w,βm)f_{m}(w,\beta_{m}). Consequently, training different personalized FL models is equivalent to solving a particular instance of (1).

Recovering best-known complexity and novel guarantees. We develop algorithms for solving (1) and prove sharp convergence rates for strongly convex objectives. Specializing our rates from the general setting to the individual personalized FL objectives, we recover the best-known optimization guarantees from the literature or advance beyond the state-of-the-art with a single exception: objective (11) with λ>L\lambda>L^{\prime}. Therefore, our results often render optimization tailored to solve a specific personalized FL unnecessary in many cases.

Universal (convex) optimization methods and theory for personalized FL. To develop an optimization theory for solving (1), we impose particular assumptions on the objective: μ\mu-strong convexity of FF and convexity and (Lw,MLβ)(L^{w},ML^{\beta})-smoothness of fmf_{m} for all m[M]m\in[M] (see Assumptions 12). These assumptions are naturally satisfied for the vast majority of personalized FL objectives in the literature, with the exception of personalized FL approaches that are inherently nonconvex, such as MAML (Finn et al., 2017). Under these assumptions, we propose three algorithms for solving the general personalized FL objective (1): i) Local Stochastic Gradient Descent for Personalized FL (LSGD-PFL), ii) Accelerated Block Coordinate Descent for Personalized FL (ACD-PFL), and iii) Accelerated Stochastic Variance Reduced Coordinate Descent for Personalized FL (ASVRCD-PFL). The convergence rates of these methods are summarized in Table 1. We emphasize that these optimizers can be used to solve many (convex) personalized FL objectives from the literature by casting a given objective as a special case of (1), oftentimes matching or outperforming algorithms originally designed for the particular scenario.

Minimax optimal rates. We provide lower complexity bounds for solving (1). Using the construction of Hendrikx et al. (2021), we show that to solve (1), one requires at least 𝒪(Lw/μlogϵ1){\cal O}\left(\sqrt{{L^{w}}/{\mu}}\log\epsilon^{-1}\right) communication rounds. Note that communication is often the bottleneck when training distributed and personalized FL models. Furthermore, one needs at least 𝒪(Lw/μlogϵ1){\cal O}\left(\sqrt{{L^{w}}/{\mu}}\log\epsilon^{-1}\right) evaluations of wF\nabla_{w}F and at least 𝒪(Lβ/μlogϵ1){\cal O}\left(\sqrt{{L^{\beta}}/{\mu}}\log\epsilon^{-1}\right) evaluations of βF\nabla_{\beta}F. Given the nn-finite sum structure of fmf_{m} with (w,Mβ)({\cal L}^{w},M{\cal L}^{\beta})-smooth components, we show that one requires at least 𝒪(n+nw/μlogϵ1){\cal O}\left(n+\sqrt{{n{\cal L}^{w}}/{\mu}}\log{\epsilon^{-1}}\right) stochastic gradient evaluations with respect to ww-parameters and at least 𝒪(n+nβ/μlogϵ1){\cal O}\left(n+\sqrt{{n{\cal L}^{\beta}}/{\mu}}\log{\epsilon^{-1}}\right) stochastic gradient evaluations with respect to β\beta-parameters. We show that ACD-PFL is always optimal in terms of communication and local computation when the full gradients are available, while ASVRCD-PFL can be optimal either in terms of the number of evaluations of the ww-stochastic gradient or the β\beta-stochastic gradient. However, note that ASVRCD-PFL cannot achieve optimal rate for both evaluations of the ww-stochastic gradient and the β\beta-stochastic gradient simultaneously, which we leave for future research.

Personalization and communication complexity. Given that a specific FL objective contains a parameter that determines the amount of personalization, we observe that the value of Lw/μ\sqrt{L^{w}/\mu} is always a non-increasing function of this parameter. Since the communication complexity of (1) is equal to Lw/μ\sqrt{L^{w}/\mu} up to constant and log factors, we conclude that personalization has a positive effect on the communication complexity of training an FL model.

New personalized FL objectives. The universal personalized FL objective (1) enables us to obtain a range of novel personalized FL formulations as special cases. While we study various (parametric) extensions of known models, we believe that the objective (1) can lead to easier development of brand new objectives as well. However, we stress that proposing novel personalized FL models is not the main focus of our work; the paper’s primary focus is on providing universal optimization guarantees for (convex) personalized FL.

Despite the aforementioned benefits of our proposed unified framework, we acknowledge that this is neither the only nor the universally best approach for personalized federated learning. However, providing a general framework that can include many existing methods as special cases can help us gain a clear understanding and motivate us to propose new personalized methods.

Alg. Communication # w\nabla_{w} # β\nabla_{\beta}
LSGD-PFL
max(Lβτ1,Lw)μ\frac{\max\left(L^{\beta}\tau^{-1},L^{w}\right)}{\mu}
+ σ2MBτμϵ\frac{\sigma^{2}}{MB\tau\mu\epsilon}
+1μLw(ζ2+σ2B1)ϵ+\frac{1}{\mu}\sqrt{\frac{L^{w}(\zeta_{*}^{2}+\sigma^{2}B^{-1})}{\epsilon}}
max(Lβ,τLw)μ\frac{\max\left(L^{\beta},\tau L^{w}\right)}{\mu}
+ σ2MBμϵ\frac{\sigma^{2}}{MB\mu\epsilon}
+τμLw(ζ2+σ2B1)ϵ+\frac{\tau}{\mu}\sqrt{\frac{L^{w}(\zeta_{*}^{2}+\sigma^{2}B^{-1})}{\epsilon}}
max(Lβ,τLw)μ\frac{\max\left(L^{\beta},\tau L^{w}\right)}{\mu}
+ σ2MBμϵ\frac{\sigma^{2}}{MB\mu\epsilon}
+τμLw(ζ2+σ2B1)ϵ+\frac{\tau}{\mu}\sqrt{\frac{L^{w}(\zeta_{*}^{2}+\sigma^{2}B^{-1})}{\epsilon}}
ACD-PFL Lw/μ\sqrt{L^{w}/\mu} Lw/μ\sqrt{L^{w}/\mu} Lβ/μ\sqrt{L^{\beta}/\mu}
ASVRCD-PFL n+nw/μn+\sqrt{n{\cal L}^{w}/\mu} n+nw/μn+\sqrt{n{\cal L}^{w}/\mu} n+nβ/μn+\sqrt{n{\cal L}^{\beta}/\mu}
Table 1: Complexity guarantees of the proposed methods when ignoring constant and log factors. #w/#β:\#\nabla_{w}/\#\nabla_{\beta}: number of (stochastic) gradient calls with respect to the w/βw/\beta-parameters. Symbol , indicates the minimax optimal complexity. Local Stochastic Gradient Descent (LSGD): Local access to BB-minibatches of stochastic gradients, each with σ2\sigma^{2}-bounded variance. Each device takes (τ1)(\tau-1) local steps in between the communication rounds. Accelerated Coordinate Descent (ACD): Access to the full local gradient, yielding both the optimal communication complexity and the optimal computational complexity (both in terms of w\nabla_{w} and β\nabla_{\beta}). ASVRCD: Assuming that fif_{i} is an nn-finite sum, the oracle provides access to a single stochastic gradient with respect to that sum. The corresponding local computation is either optimal with respect to w\nabla_{w} or with respect to β\nabla_{\beta}. Achieving both optimal rates simultaneously remains an open problem.

1.2 Assumptions and Notations

Complexity Notations. For two sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, an=𝒪(bn)a_{n}={\cal O}(b_{n}) if there exists C>0C>0 such that |an/bn|C|a_{n}/b_{n}|\leq C for all nn large enough; an=Θ(bn)a_{n}=\Theta(b_{n}) if an=𝒪(bn)a_{n}={\cal O}(b_{n}) and bn=𝒪(an)b_{n}={\cal O}(a_{n}) simultaneously. Similarly, an=𝒪~(bn)a_{n}=\tilde{\cal O}(b_{n}) if an=𝒪(bnlogkbn)a_{n}={\cal O}(b_{n}\log^{k}b_{n}) for some k0k\geq 0; an=Θ~(bn)a_{n}=\tilde{\Theta}(b_{n}) if an=𝒪~(bn)a_{n}=\tilde{\cal O}(b_{n}) and bn=𝒪~(an)b_{n}=\tilde{\cal O}(a_{n}) simultaneously.

Local Objective. We assume three different ways to access the gradient of the local objective fmf_{m}. The first, and the simplest case, corresponds to having access to the full gradient of fmf_{m} with respect to either ww or βm\beta_{m} for all m[M]m\in[M] simultaneously. The second case corresponds to a situation where fm(w,βm)\nabla f_{m}(w,\beta_{m}) is the expectation itself, i.e.,

fm(w,βm)=𝔼ξ𝒟m[f^m(w,βm;ξ)],\nabla f_{m}(w,\beta_{m})=\mathbb{E}_{\xi\in{\cal D}_{m}}\left[\nabla\hat{f}_{m}(w,\beta_{m};\xi)\right], (2)

while having access to stochastic gradients with respect to either ww or βm\beta_{m} simultaneously for all mMm\in M, where f^m\hat{f}_{m} represents the loss function on a single data point. The third case corresponds to a finite sum fmf_{m}:

fm(w,βm)=1ni=1nfm,i(w,βm),f_{m}(w,\beta_{m})=\frac{1}{n}\sum_{i=1}^{n}f_{m,i}(w,\beta_{m}), (3)

having access to wfm,i(w,βm)\nabla_{w}f_{m,i}(w,\beta_{m}) or to βfm,i(w,βm)\nabla_{\beta}f_{m,i}(w,\beta_{m}) for all m[M]m\in[M] and i[n]i\in[n] selected uniformly at random.

Assumptions. We argue that the objective (1) is capable of recovering virtually any (convex) personalized FL objective. Since the structure of the individual personalized FL objectives varies greatly, it is important to impose reasonable assumptions on the problem (1) in order to obtain meaningful rates in the special cases.

Assumption 1.

The function F(w,β)F(w,\beta) is jointly μ\mu-strongly convex for μ0\mu\geq 0, while for all m[M]m\in[M], function fm(w,βm)f_{m}(w,\beta_{m}) is jointly convex, LwL^{w}-smooth w.r.t. parameter ww and (MLβ)(ML^{\beta})-smooth w.r.t. parameter βm\beta_{m}. In the case when μ=0\mu=0, assume additionally that (1) has a unique solution, denoted as ww^{\star} and β=(β1,,βM)\beta^{\star}=(\beta^{\star}_{1},\dots,\beta^{\star}_{M}).

When fmf_{m} is a finite sum (3), we require the smoothness of the finite sum components.

Assumption 2.

Suppose that for all m[M],i[n]m\in[M],i\in[n], function fm,i(w,βm)f_{m,i}(w,\beta_{m}) is jointly convex, w{\cal L}^{w}-smooth w.r.t. parameter ww and (Mβ)(M{\cal L}^{\beta})-smooth w.r.t. parameter βm\beta_{m}.111It is easy to see that wLwwn{\cal L}^{w}\geq L^{w}\geq\frac{{\cal L}^{w}}{n} and βLββn{\cal L}^{\beta}\geq L^{\beta}\geq\frac{{\cal L}^{\beta}}{n}.

In Section 2 we justify Assumptions 1 and 2 and characterize the constants μ\mu, LwL^{w}, LβL^{\beta}, w{\cal L}^{w}, β{\cal L}^{\beta} for special cases of (1). Table 3 summarizes these parameters.

Price of generality. Since Assumption 1 is the only structural assumption we impose on (1), one cannot hope to recover the minimax optimal rates, that is, the rates that match the lower complexity bounds, for all individual personalized FL objectives as a special case of our general guarantees. Note that any given instance of (1) has a structure that is not covered by Assumption 1, but can be exploited by an optimization algorithm to improve either communication or local computation. Therefore, our convergence guarantees are optimal in light of Assumption 1 only. Despite this, our general rates specialize surprisingly well as we show in Section 2: our complexities are state-of-the-art in all scenarios with a single exception: the communication/computation complexity of (11).

Individual treatment of ww and β\beta. Throughout this work, we allow different smoothness of the objective with respect to global parameters ww and local parameters β\beta. At the same time, our algorithm is allowed to exploit the separate access to gradients with respect to ww and β\beta, given that these gradients can be efficiently computed separately. Without such a distinction, one might not hope for the communication complexity better than Θ(max{Lw,Lβ}/μlogϵ1)\Theta\left(\max\{L^{w},L^{\beta}\}/\mu\log\epsilon^{-1}\right), which is suboptimal in the special cases. Similarly, the computational guarantees would be suboptimal as well. See Section 2 for more details.

Data heterogeneity. While the convergence rate of LSGD-PFL relies on data heterogeneity (See Theorem 1), we allow for an arbitrary dissimilarity among the individual clients for analyzing ACD-PFL and ASVRCD-PFL (see Theorem 7 and Theorem 8). Our experimental results also support that ASCD-PFL (ACD-PFL with stochastic gradient to reduce computation) and ASVRCD-PFL are more robust to data heterogeneity compared to the widely used Local SGD.

The rest of the paper is organized as follows. In Section 2, we show how (1) can be used to recover various personalized federated learning objectives in the literature. In Section 3, we propose a local-SGD based algorithm, LSGD-PFL, for solving (1). We further establish computational upper bounds for LSGD-PFL in strongly convex, weakly convex, and nonconvex cases. In Section 4, we discuss the minimax optimal algorithms for solving (1). We first show the minimax lower bounds in terms of the number of communication rounds, number of evaluations of the gradient of global parameters, and number of evaluations of the gradient of local parameters, respectively. We subsequently propose two coordinate-descent based algorithms, ACD-PFL and ASVRCD-PFL, which can match the lower bounds. In Section 5 and Section 6, we use experiments on synthetic and real data to illustrate the performance of the proposed algorithms and empirically validate the theorems. Finally, we conclude the paper with Section7. Technical proofs are deferred to the Appendix.

2 Personalized FL objectives

We recover a range of known personalized FL approaches as special cases of (1). In this section, we detail the optimization challenges that arise in each one of the special cases. We discuss the relation to our results, particularly focusing on how Assumptions 12 and our general rates (presented in Sections 3 and 4) behave in the special cases. Table 3 presents the smoothness and strong convexity constants with respect to (1) for the special cases, while Table 3 provides the corresponding convergence rates for our methods when applied to these specific objectives. For the sake of convenience, define

Fi(w,β)1Mm=1Mfm,i(w,βm).F_{i}(w,\beta)\coloneqq\frac{1}{M}\sum_{m=1}^{M}f_{m,i}(w,\beta_{m}).

in the case when functions fmf_{m} have a finite sum structure (3).

F(w,β)F(w,\beta) μ\mu LwL^{w} LβL^{\beta} w{\cal L}^{w} β{\cal L}^{\beta} Rate?
Traditional FL ((4)) (McMahan et al., 2017) μ\mu^{\prime} LL^{\prime} 0 {\cal L}^{\prime} 0 recovered
Fully Personalized FL ((5)) μM\frac{\mu^{\prime}}{M} 0 LM\frac{L^{\prime}}{M} 0 M\frac{{\cal L}^{\prime}}{M} recovered
MT2 ((8)) (Li et al., 2020) λ2M\frac{\lambda}{2M} ΛL+λ2M\frac{\Lambda L^{\prime}+\lambda}{2M} L+λ2M\frac{L^{\prime}+\lambda}{2M} Λ+λ2M\frac{\Lambda{\cal L}^{\prime}+\lambda}{2M} +λ2M\frac{{\cal L}^{\prime}+\lambda}{2M} new
MX2 ((11)) (Smith et al., 2017) μ3M\frac{\mu^{\prime}}{3M} λM\frac{\lambda}{M} L+λM\frac{L^{\prime}+\lambda}{M} λM\frac{\lambda}{M} +λM\frac{{\cal L}^{\prime}+\lambda}{M} recovered
APFL2 ((14)) (Deng et al., 2020) μ(1αmax)2M\frac{\mu^{\prime}\left(1-\alpha_{\max}\right)^{2}}{M} (Λ+αmax2)LM\frac{(\Lambda+\alpha_{\max}^{2})L^{\prime}}{M} (1αmin)2LM\frac{(1-\alpha_{\min})^{2}L^{\prime}}{M} (Λ+αmax2)M\frac{(\Lambda+\alpha_{\max}^{2}){\cal L}^{\prime}}{M} (1αmin)2M\frac{(1-\alpha_{\min})^{2}{\cal L}^{\prime}}{M} new
WS2 ((16)) (Liang et al., 2020) μ\mu^{\prime} LL^{\prime} LL^{\prime} {\cal L}^{\prime} {\cal L}^{\prime} new
Fed Residual ((18)) (Agarwal et al., 2020) μ\mu LRwL^{w}_{R} LRβL^{\beta}_{R} Rw{\cal L}^{w}_{R} Rβ{\cal L}^{\beta}_{R} new
Table 2: Parameters in Assumptions 1 and 2 for personalized FL objectives, with a note about the rate: we either recover the best-known rate for a given objective, or provide a novel rate that is the best under the given assumptions. : Rate for the novel personalized FL objective (extension of a known one). \spadesuit: Best-known communication complexity recovered only for λ=𝒪(L)\lambda={\cal O}(L^{\prime}).
F(w,β)F(w,\beta) # Comm # wF\nabla_{w}F # βF\nabla_{\beta}F # wFi\nabla_{w}F_{i} # βFi\nabla_{\beta}F_{i}
Traditional FL ((4)) (McMahan et al., 2017) Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} 0 n+nμn+\sqrt{\frac{n{\cal L}^{\prime}}{\mu^{\prime}}} 0
Fully Personalized FL ((5)) 0 0 Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} 0 n+nμn+\sqrt{\frac{n{\cal L}^{\prime}}{\mu^{\prime}}}
MT2 ((8)) (Li et al., 2020) ΛLλ\sqrt{\frac{\Lambda L^{\prime}}{\lambda}} ΛLλ\sqrt{\frac{\Lambda L^{\prime}}{\lambda}} Lλ\sqrt{\frac{L^{\prime}}{\lambda}} n+nΛλn+\sqrt{\frac{n\Lambda{\cal L}^{\prime}}{\lambda}} n+nλn+\sqrt{\frac{n{\cal L}^{\prime}}{\lambda}}
MX2 ((11)) (Smith et al., 2017) λμ\sqrt{\frac{\lambda}{\mu^{\prime}}} λμ\sqrt{\frac{\lambda}{\mu^{\prime}}} L+λμ\sqrt{\frac{L^{\prime}+\lambda}{\mu^{\prime}}} - n+n(+λ)μn+\sqrt{\frac{n({\cal L}^{\prime}+\lambda)}{\mu^{\prime}}}
APFL2 ((14)) (Deng et al., 2020) (Λ+αmax2)L(1αmax)2μ\sqrt{\frac{(\Lambda+\alpha_{\max}^{2})L^{\prime}}{\left(1-\alpha_{\max}\right)^{2}\mu^{\prime}}} (Λ+αmax2)L(1αmax)2μ\sqrt{\frac{(\Lambda+\alpha_{\max}^{2})L^{\prime}}{\left(1-\alpha_{\max}\right)^{2}\mu^{\prime}}} (1αmin)2L(1αmax)2μ\sqrt{\frac{(1-\alpha_{\min})^{2}L^{\prime}}{\left(1-\alpha_{\max}\right)^{2}\mu^{\prime}}} n+n(Λ+αmax2)(1αmax)2μn+\sqrt{\frac{n(\Lambda+\alpha_{\max}^{2}){\cal L}^{\prime}}{\left(1-\alpha_{\max}\right)^{2}\mu^{\prime}}} n+n(1αmin)2(1αmax)2μn+\sqrt{\frac{n(1-\alpha_{\min})^{2}{\cal L}^{\prime}}{\left(1-\alpha_{\max}\right)^{2}\mu^{\prime}}}
WS2 ((16)) (Liang et al., 2020) Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} Lμ\sqrt{\frac{L^{\prime}}{\mu^{\prime}}} n+nμn+\sqrt{\frac{n{\cal L}^{\prime}}{\mu^{\prime}}} n+nμn+\sqrt{\frac{n{\cal L}^{\prime}}{\mu^{\prime}}}
Fed Residual ((18)) (Agarwal et al., 2020) LRwμ\sqrt{\frac{L^{w}_{R}}{\mu}} LRwμ\sqrt{\frac{L^{w}_{R}}{\mu}} LRβμ\sqrt{\frac{L^{\beta}_{R}}{\mu}} n+nRwμn+\sqrt{\frac{n{\cal L}^{w}_{R}}{\mu^{\prime}}} n+nRβμn+\sqrt{\frac{n{\cal L}^{\beta}_{R}}{\mu^{\prime}}}
Table 3: Complexity of solving personalized FL objectives by Algorithms 2 (second, third, and fourth column) and 3 (fifth and sixth column). Constant and log factors are ignored.

2.1 Traditional FL

The traditional, non-personalized FL objective (McMahan et al., 2017) is given as

minwdF(w)1Mm=1Mfm(w),\min_{w\in\mathbb{R}^{d}}F^{\prime}(w)\coloneqq\frac{1}{M}\sum_{m=1}^{M}f_{m}^{\prime}(w), (4)

where fmf^{\prime}_{m} corresponds to the loss on the mm-th client’s data. Assume that fmf^{\prime}_{m} is LL^{\prime}-smooth and μ\mu^{\prime}- strongly convex for all m[M]m\in[M]. The minimax optimal communication to solve (4) up to ϵ\epsilon-neighborhood of the optimum is Θ~(L/μlogϵ1)\tilde{\Theta}\left(\sqrt{{L^{\prime}}/{\mu^{\prime}}}\log{\epsilon^{-1}}\right) (Scaman et al., 2018). When fm=1nj=1nfm,j(w)f_{m}^{\prime}=\frac{1}{n}\sum_{j=1}^{n}f_{m,j}^{\prime}(w) is an nn-finite sum with convex and {\cal L}^{\prime}-smooth components, the minimax optimal local stochastic gradient complexity is Θ~((n+n/μ)logϵ1)\tilde{\Theta}\left(\left(n+\sqrt{{n{\cal L}^{\prime}}/{\mu}}\right)\log{\epsilon^{-1}}\right) (Hendrikx et al., 2021). The FL objective (4) is a special case of (1) with d1==dM=0d_{1}=\dots=d_{M}=0 and our theory recovers the aforementioned rates.

2.2 Fully Personalized FL

At the other end of the spectrum lies the fully personalized FL where the mm-th client trains their own model without any influence from other clients:

minβ1,,βMdFfull(β)1Mm=1Mfm(βm).\min_{\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{full}(\beta)\coloneqq\frac{1}{M}\sum_{m=1}^{M}f_{m}^{\prime}(\beta_{m}). (5)

The above objective is a special case of (1) with d0=0d_{0}=0. As the objective is separable in β1,,βM\beta_{1},\dots,\beta_{M}, we do not require any communication to train it. At the same time, we need Θ~((n+n/μ)logϵ1)\tilde{\Theta}\left(\left(n+\sqrt{{n{\cal L}^{\prime}}/{\mu}}\right)\log{\epsilon^{-1}}\right) local stochastic oracle calls to solve it (Lan & Zhou, 2018) – which is what our algorithms achieve.

2.3 Multi-Task FL of Li et al. (2020)

The objective is given as

minβ1,,βMdFMT(β)=1Mi=1M(fm(βm)+λ2βm(w)2),\min_{\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{MT}(\beta)=\frac{1}{M}\sum_{i=1}^{M}\left(f^{\prime}_{m}(\beta_{m})+\frac{\lambda}{2}\|\beta_{m}-(w^{\prime})^{*}\|^{2}\right), (6)

where (w)(w^{\prime})^{*} is a solution of the traditional FL in (4) and λ0\lambda\geq 0. Assuming that (w)(w^{\prime})^{*} is known (which Li et al. (2020) does), the problem (6) is a particular instance of (5); thus our approach achieves the optimal complexity.

A more challenging objective (in terms of optimization) is the following relaxed version of (6):

minw,β1Mm=1M(Λfm(w)+fm(βm)+λwβm2),\min_{w,\beta}\frac{1}{M}\sum_{m=1}^{M}\left(\Lambda f^{\prime}_{m}(w)+f^{\prime}_{m}(\beta_{m})+\lambda\|w-\beta_{m}\|^{2}\right), (7)

where Λ0\Lambda\geq 0 is the relaxation parameter, recovering the original objective for Λ\Lambda\rightarrow\infty. Note that, since Λ\Lambda\rightarrow\infty, finding a minimax optimal method for the optimization of (6) is straightforward. First, one has to compute a minimizer (w)(w^{\prime})^{*} of the classical FL objective (4), which can be done with a minimax optimal complexity. Next, one needs to compute the local solution βm=argminβmdfm(βm)+λwβm2\beta_{m}^{*}=\operatorname*{arg\,min}_{\beta_{m}\in\mathbb{R}^{d}}f^{\prime}_{m}(\beta_{m})+\lambda\|w^{*}-\beta_{m}\|^{2}, which only depends on the local data and thus can also be optimized with a minimax optimal algorithm.

A more interesting scenario is obtained when we do not set Λ\Lambda\rightarrow\infty in (7), but rather consider a finite Λ>0\Lambda>0 that is sufficiently large. To obtain the right smoothness/strong convexity parameter (according to Assumption 1), we scale the global parameter ww by a factor of M12M^{-\frac{1}{2}} and arrive at the following objective:

minw,β1,,βMdFMT2(w,β)=1Mm=1Mfm(w,βm),\displaystyle\min_{w,\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{MT2}(w,\beta)=\frac{1}{M}\sum_{m=1}^{M}f_{m}(w,\beta_{m}), (8)

where

fm(w,βm)=Λfm(M12w)+fm(βm)+λ2βmM12w2.f_{m}(w,\beta_{m})=\Lambda f^{\prime}_{m}(M^{-\frac{1}{2}}w)+f^{\prime}_{m}(\beta_{m})+\frac{\lambda}{2}\|\beta_{m}-M^{-\frac{1}{2}}w\|^{2}.

The next lemma determines parameters μ,Lw,Lβ,w,β\mu,L^{w},L^{\beta},{\cal L}^{w},{\cal L}^{\beta} in Assumption 1. See the proof in Appendix B.1.

Lemma 1.

Let Λ3λ/(2μ)\Lambda\geq{3\lambda}/({2\mu^{\prime}}). Then, the objective (8) is jointly (λ/(2M))\left({\lambda}/({2M})\right)-strongly convex, while the function fmf_{m} is jointly convex, ((ΛL+λ)/M)\left(({\Lambda L^{\prime}+\lambda})/{M}\right)-smooth w.r.t. ww and (L+λ)\left(L^{\prime}+\lambda\right)-smooth w.r.t. βm\beta_{m}. Similarly, the function fm,jf_{m,j} is jointly convex, ((Λ+λ)/M)\left(({\Lambda{\cal L}^{\prime}+\lambda})/{M}\right)-smooth w.r.t. ww and (+λ)\left({\cal L}^{\prime}+\lambda\right)-smooth w.r.t. βm\beta_{m}.

Evaluating gradients. Note that evaluating wfm(x,βm)\nabla_{w}f_{m}(x,\beta_{m}) under the objective (8) can be perfectly decoupled from evaluating βfm(x,βm)\nabla_{\beta}f_{m}(x,\beta_{m}). Therefore, we can make full use of our theory and take advantage of different complexities w.r.t. w\nabla_{w} and β\nabla_{\beta}. Resulting communication and computation complexities for solving (8) are presented in Table 3.

2.4 Multi-Task Personalized FL and Implicit MAML

In its simplest form, the multi-task personalized objective (Smith et al., 2017; Wang et al., 2018) is given as

minβ1,,βMdFMX(β)=1Mm=1Mfm(βm)+λ2Mm=1Mβ¯βm2,\min_{\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{MX}(\beta)=\frac{1}{M}\sum_{m=1}^{M}f_{m}^{\prime}(\beta_{m})+\frac{\lambda}{2M}\sum_{m=1}^{M}\|\bar{\beta}-\beta_{m}\|^{2}, (9)

where β¯1Mm=1Mβm\bar{\beta}\coloneqq\frac{1}{M}\sum_{m=1}^{M}\beta_{m} and λ0\lambda\geq 0 (Hanzely & Richtárik, 2020). On the other hand, the goal of implicit MAML (Rajeswaran et al., 2019; Dinh et al., 2020) is to minimize

minwdFME(w)=1Mi=1M(minβmd(fm(βm)+λ2wβm2)).\min_{w\in\mathbb{R}^{d}}F_{ME}(w)=\frac{1}{M}\sum_{i=1}^{M}\left(\min_{\beta_{m}\in\mathbb{R}^{d}}\left(f^{\prime}_{m}(\beta_{m})+\frac{\lambda}{2}\|w-\beta_{m}\|^{2}\right)\right). (10)

By reparametrizing (1), we can recover an objective that is simultaneously equivalent to both (9) and (10). In particular, by setting

fm(w,βm)=fm(βm)+λM12wβm2,f_{m}(w,\beta_{m})=f^{\prime}_{m}(\beta_{m})+\lambda\|M^{-\frac{1}{2}}w-\beta_{m}\|^{2},

the objective (1) becomes

minw,β1,,βMdFMX2(w,β)1Mm=1Mfm(βm)+λ2Mm=1MM12wβm2.\min_{w,\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{MX2}(w,\beta)\\ \coloneqq\frac{1}{M}\sum_{m=1}^{M}f_{m}^{\prime}(\beta_{m})+\frac{\lambda}{2M}\sum_{m=1}^{M}\|M^{-\frac{1}{2}}w-\beta_{m}\|^{2}. (11)

It is a simple exercise to notice the equivalence of (11) to both (9) and (10).222To the best of our knowledge, we are the first to notice the equivalence of (9) and (10). Indeed, we can always minimize (11) in ww, arriving at w=M12β¯w^{*}=M^{\frac{1}{2}}\bar{\beta}, and thus recovering the solution of (9). Similarly, by minimizing (11) in β\beta we arrive at (10).

Next, we establish the parameters in Assumptions 1 and 2.

Lemma 2.

Let μλ/2\mu^{\prime}\leq{\lambda}/{2}. Then the objective (11) is jointly (μ/(3M))\left({\mu^{\prime}}/({3M})\right)-strongly convex, while fmf_{m} is (λ/M)\left({\lambda}/{M}\right)-smooth w.r.t. ww and (L+λ)\left(L^{\prime}+\lambda\right)-smooth w.r.t. β\beta. The function

fm,i(w,βm)=fm,i(βm)+(λ/2)M12wβm2f_{m,i}(w,\beta_{m})=f^{\prime}_{m,i}(\beta_{m})+({\lambda}/{2})\|M^{-\frac{1}{2}}w-\beta_{m}\|^{2}

is jointly convex, (λ/M)\left({\lambda}/{M}\right)-smooth w.r.t. ww and (+λ)\left({\cal L}^{\prime}+\lambda\right)-smooth w.r.t. β\beta.

The proof is given in Appendix B.2. Hanzely et al. (2020a) showed that the minimax optimal communication complexity to solve (9) (and therefore to solve (10) and (11)) is Θ(min(L,λ)/μlogϵ1)\Theta\left(\sqrt{{\min(L^{\prime},\lambda)}/{\mu^{\prime}}}\log\epsilon^{-1}\right). Furthermore, they showed that the minimax optimal number of gradients w.r.t. ff^{\prime} is Θ~((L/μ)logϵ1)\tilde{\Theta}\left(\left(\sqrt{{L^{\prime}}/{\mu^{\prime}}}\right)\log\epsilon^{-1}\right) and proposed a method that has the complexity Θ((n+n(+λ)/μ)logϵ1)\Theta\left(\left(n+\sqrt{{n({\cal L}^{\prime}+\lambda)}/{\mu^{\prime}}}\right)\log\epsilon^{-1}\right) w.r.t. the number of fm,jf^{\prime}_{m,j}-gradients. We match the aforementioned communication guarantees when λ=𝒪(L)\lambda={\cal O}(L^{\prime}) and computation guarantees when L=𝒪(λ)L^{\prime}={\cal O}(\lambda). Furthermore, when λ=𝒪(L)\lambda={\cal O}(L^{\prime}), our complexity guarantees are strictly better compared to the guarantees for solving the implicit MAML objective (10) directly (Rajeswaran et al., 2019; Dinh et al., 2020).

2.5 Adaptive Personalized FL (Deng et al., 2020)

The objective is given as

minβ1,,βMFAPFL(β)=1Mm=1Mfm((1αm)βm+αm(w)),\min_{\beta_{1},\dots,\beta_{M}}F_{APFL}(\beta)=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{m}((1-\alpha_{m})\beta_{m}+\alpha_{m}(w^{\prime*})), (12)

where (w)=argminwdF(w)(w^{\prime})^{*}=\operatorname*{arg\,min}_{w\in\mathbb{R}^{d}}F^{\prime}(w) is a solution to (4) and 0<α1,αM<10<\alpha_{1},\dots\alpha_{M}<1. Assuming that (w)(w^{\prime})^{*} is known, as was done in Deng et al. (2020), the problem (12) is an instance of (5); thus our approach achieves the optimal complexity.

A more interesting case (in terms of optimization) is when considering a relaxed variant of (12), given as

minw,β1Mm=1M(Λfm(w)+fm((1αm)βm+αmw))\min_{w,\beta}\frac{1}{M}\sum_{m=1}^{M}\left(\Lambda f^{\prime}_{m}(w)+f_{m}^{\prime}((1-\alpha_{m})\beta_{m}+\alpha_{m}w)\right) (13)

where Λ0\Lambda\geq 0 is the relaxation parameter that allows recovering the original objective when Λ\Lambda\rightarrow\infty. Such a choice, alongside with the usual rescaling of the parameter ww results in the following objective:

minw,β1,,βMdFAPFL2(w,β):=1Mi=1Mf(w,βm),\min_{w,\beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d}}F_{APFL2}(w,\beta):=\frac{1}{M}\sum_{i=1}^{M}f(w,\beta_{m}), (14)

where

f(w,βm)=Λfm(M12w)+fm((1αm)βm+αmM12w).f(w,\beta_{m})=\Lambda f^{\prime}_{m}(M^{-\frac{1}{2}}w)+f^{\prime}_{m}((1-\alpha_{m})\beta_{m}+\alpha_{m}M^{-\frac{1}{2}}w).
Lemma 3.

Let αminmin1mMαm\alpha_{\min}\coloneqq\min_{1\leq m\leq M}\alpha_{m} and αmaxmax1mMαm\alpha_{\max}\coloneqq\max_{1\leq m\leq M}\alpha_{m}. If

Λmax1mM(3αm2+(1αm)2/2),\Lambda\geq\max_{1\leq m\leq M}(3\alpha_{m}^{2}+{(1-\alpha_{m})^{2}}/{2}),

then the function FAPFL2F_{APFL2} is jointly (μ(1αmax)2/M)\left({\mu^{\prime}(1-\alpha_{\max})^{2}}/{M}\right)-strongly convex, ((Λ+αmax2)L/M)\left({(\Lambda+\alpha_{\max}^{2})L^{\prime}}/{M}\right)-smooth w.r.t. ww and ((1αmin)2L/M)\left({(1-\alpha_{\min})^{2}L^{\prime}}/{M}\right)-smooth w.r.t. β\beta.

The proof is given in Appendix B.3.

2.6 Personalized FL with Explicit Weight Sharing

The most typical example of the weight sharing setting is when parameters w,βw,\beta correspond to different layers of the same neural network. For example, β1,,βM\beta_{1},\dots,\beta_{M} could be the weights of first few layers of a neural network, while ww are the weights of the remaining layers (Liang et al., 2020). Or, alternatively, each of β1,,βM\beta_{1},\dots,\beta_{M} can correspond to the weights of last few layers, while the remaining weights are included in the global parameter ww (Arivazhagan et al., 2019). Overall, we can write the objective as follows:

minwdw,β1,,βMdβFWS(w,β)=1Mm=1Mfm([w,βm]),\min_{\small\begin{matrix}w\in\mathbb{R}^{d_{w}},\\ \beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d_{\beta}}\end{matrix}}\text{\hskip-19.91684pt}F_{WS}(w,\beta)=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{m}([w,\beta_{m}]), (15)

where dw+dβ=dd_{w}+d_{\beta}=d. Using an equivalent reparameterization of the ww-space, we aim to minimize

minwdw,β1,,βMdβFWS2(w,β)=1Mm=1Mfm([M12w,βm]),\min_{\small\begin{matrix}w\in\mathbb{R}^{d_{w}},\\ \beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d_{\beta}}\end{matrix}}\text{\hskip-19.91684pt}F_{WS2}(w,\beta)=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{m}([M^{-\frac{1}{2}}w,\beta_{m}]), (16)

which is an instance of (1) with fm(w,βm)=fm([M12w,βm])f_{m}(w,\beta_{m})=f^{\prime}_{m}([M^{-\frac{1}{2}}w,\beta_{m}]).

Lemma 4.

The function FWS2F_{WS2} is jointly μ\mu^{\prime}-strongly convex, (LM)\left(\tfrac{L^{\prime}}{M}\right)-smooth w.r.t. ww and LL^{\prime}-smooth w.r.t. β\beta. Similarly, the function fmf_{m} is jointly convex, (M)\left(\tfrac{{\cal L}^{\prime}}{M}\right)-smooth w.r.t. ww and {\cal L}^{\prime}-smooth w.r.t. β\beta.

The proof is straightforward and, therefore, omitted. A distinctive characteristic of the explicit weight sharing paradigm is that evaluating a gradient w.r.t. ww-parameters automatically grants either free or highly cost-effective access to the gradient w.r.t. β\beta-parameters (and vice versa).

2.7 Federated Residual Learning (Agarwal et al., 2020)

Agarwal et al. (2020) proposed federated residual learning:

minwdw,β1,,βMdβFR(w,β)=1Mm=1Mlm(Aw(w,xmw),Aβ(βm,xmβ)),\min_{\begin{matrix}w\in\mathbb{R}^{d_{w}},\\ \beta_{1},\dots,\beta_{M}\in\mathbb{R}^{d_{\beta}}\end{matrix}}F_{R}(w,\beta)=\frac{1}{M}\sum_{m=1}^{M}l_{m}(A^{w}(w,x^{w}_{m}),A^{\beta}(\beta_{m},x^{\beta}_{m})), (17)

where (xmw,xmβ)(x^{w}_{m},x^{\beta}_{m}) is a local feature vector (there may be an overlap between xmwx^{w}_{m} and xmβx^{\beta}_{m}), Aw(w,xmw)A^{w}(w,x^{w}_{m}) represents the model prediction using global parameters/features, Aβ(β,xmβ)A^{\beta}(\beta,x^{\beta}_{m}) denotes the model prediction using local parameters/features, and l(,)l(\cdot,\cdot) is a loss function. Clearly, we can recover (17) with

fm(w,βm)=l(Aw(M12w,xmw),Aβ(βm,xmβ)).f_{m}(w,\beta_{m})=l(A^{w}(M^{-\frac{1}{2}}w,x^{w}_{m}),A^{\beta}(\beta_{m},x^{\beta}_{m})). (18)

Unlike the other objectives, here we cannot relate constants μ,L,\mu^{\prime},L^{\prime},{\cal L}^{\prime} to FRF_{R}, since we do not write fmf_{m} as a function of fmf^{\prime}m. However, it seems natural to assume LRwL^{w}_{R} (or LRβL^{\beta}_{R})-smoothness of l(Aw(w,xwm),amβ)l(A^{w}(w,x^{w}{m}),a^{\beta}_{m}) (or l(amw,Aβ(βm,xmβ))l(a^{w}_{m},A^{\beta}(\beta_{m},x^{\beta}_{m}))) as a function of ww (or β\beta) for any amβ,xmβ,amw,xmwa^{\beta}_{m},x^{\beta}_{m},a^{w}_{m},x^{w}_{m}. Let us define Rw{\cal L}^{w}_{R}, Rβ{\cal L}^{\beta}_{R} analogously, given that ll has an nn-finite sum structure. Assuming, furthermore, that FF is μ\mu-strongly convex and fmf_{m} is convex (for each mMm\in M), we can apply our theory.

2.8 MAML Based Approaches

Meta-learning has recently been employed for personalization (Chen et al., 2018; Khodak et al., 2019; Jiang et al., 2019; Fallah et al., 2020; Lin et al., 2020). Notably, the model-agnostic meta-learning (MAML) (Finn et al., 2017) based personalized FL objective is given as

minwdFMAML(w)=1Mm=1Mfm(wαfm(w)).\min_{w\in\mathbb{R}^{d}}F_{MAML}(w)=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{m}(w-\alpha\nabla f^{\prime}_{m}(w)). (19)

Although we can recover (19) as a special case of (1) by setting fm(w,βm)=fm(wαfm(w))f_{m}(w,\beta_{m})=f^{\prime}_{m}(w-\alpha\nabla f^{\prime}_{m}(w)), our (convex) convergence theory does not apply due to the inherent non-convex structure of (19). Specifically, objective FMAMLF_{MAML} is non-convex even if function fmf^{\prime}_{m} is convex. In this scenario, only our non-convex rates of Local SGD apply.

3 Local SGD

The most popular optimizer to train non-personalized FL models is the local SGD/FedAvg (McMahan et al., 2016; Stich, 2019). We devise a local SGD variant tailored to solve the personalized FL objective (1) – LSGD-PFL. See the detailed description in Algorithm 1. Specifically, LSGD-PFL can be seen as a combination of local SGD applied on global parameters ww and SGD applied on local parameters β\beta. To mimic the non-personalized setup of local SGD, we assume access to the local objective fm(w,βm)f_{m}(w,\beta_{m}) in the form of an unbiased stochastic gradient with bounded variance.

Algorithm 1 LSGD-PFL
0:  Stepsizes {ηk}k0\{\eta_{k}\}_{k\geq 0}\in\mathbb{R}, starting point w0d0w^{0}\in\mathbb{R}^{d_{0}}, βm0dm\beta^{0}_{m}\in\mathbb{R}^{d_{m}} for all m[M]m\in[M], communication period τ\tau.
  for k=0,1,2,k=0,1,2,\dots do
     if k mod τ=0k\text{ mod }\tau=0 then
        Send all wmkw^{k}_{m}’s to server, let wk=1Mm=1Mwmkw^{k}=\frac{1}{M}\sum^{M}_{m=1}w^{k}_{m}
        Send wkw^{k} to each device, set wmk=wk,m[M]w^{k}_{m}=w^{k},\,\forall m\in[M]
     end if
     for m=1,2,,Mm=1,2,\dots,M in parallel do
        Sample ξ1,mk,ξB,mk𝒟m\xi^{k}_{1,m},\dots\xi^{k}_{B,m}\sim{\cal D}_{m} independently  
        Compute gmk=1Bj=1Bf^m(wmk,βmk;ξj,mk)g^{k}_{m}=\frac{1}{B}\sum_{j=1}^{B}\nabla\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{j,m})
        Update the iterates (wmk+1,βmk+1)=(wmk,βmk)ηkgmk(w^{k+1}_{m},\beta^{k+1}_{m})=(w^{k}_{m},\beta^{k}_{m})-\eta_{k}\cdot g^{k}_{m}
     end for
  end for

Admittedly, LSGD-PFL was already proposed by Arivazhagan et al. (2019) and Liang et al. (2020) to solve a particular instance of (1). However, no optimization guarantees were provided. In contrast, we provide convergence guarantees for LSGD-PFL that recover the convergence rate of LSGD when d1=d2==dM=0d_{1}=d_{2}=\dots=d_{M}=0 and the rate of SGD when d0=0d_{0}=0. Next, we demonstrate that LSGD-PFL works best when applied to an objective with rescaled ww-space, unlike what was proposed in the aforementioned papers.

We will need the following assumption on the stochastic gradients.

Assumption 3.

Assume that the stochastic gradients wf^m(w,βm,ζ)\nabla_{w}\hat{f}_{m}(w,\beta_{m},\zeta) and βf^m(w,βm,ζ)\nabla_{\beta}\hat{f}_{m}(w,\beta_{m},\zeta) satisfy the following conditions for all m[M]m\in[M], wd0w\in\mathbb{R}^{d_{0}}, and βmdm\beta_{m}\in\mathbb{R}^{d_{m}}:

𝔼wf^m(w,βm,ζ)\displaystyle\mathbb{E}{\nabla_{w}\hat{f}_{m}(w,\beta_{m},\zeta)} =wfm(w,βm),\displaystyle=\nabla_{w}f_{m}(w,\beta_{m}),
𝔼βf^m(w,βm,ζ)\displaystyle\mathbb{E}{\nabla_{\beta}\hat{f}_{m}(w,\beta_{m},\zeta)} =βfm(w,βm),\displaystyle=\nabla_{\beta}f_{m}(w,\beta_{m}),
𝔼wf^m(w,βm,ζ)wfm(w,βm)2\displaystyle\mathbb{E}{\|\nabla_{w}\hat{f}_{m}(w,\beta_{m},\zeta)-\nabla_{w}f_{m}(w,\beta_{m})\|^{2}} σ2, and\displaystyle\leq\sigma^{2},\text{ and}
𝔼m=1Mβf^m(w,βm,ζ)βfm(w,βm)2\displaystyle\mathbb{E}{\sum^{M}_{m=1}\|\nabla_{\beta}\hat{f}_{m}(w,\beta_{m},\zeta)-\nabla_{\beta}f_{m}(w,\beta_{m})\|^{2}} Mσ2.\displaystyle\leq M\sigma^{2}.

Let

(w¯K,β¯K)(k=0K(1ημ)k1)1k=0K(1ημ)k1(wk,βk).(\overline{w}^{K},\overline{\beta}^{K})\coloneqq\left(\sum\limits_{k=0}^{K}(1-\eta\mu)^{-k-1}\right)^{-1}\sum_{k=0}^{K}(1-\eta\mu)^{-k-1}(w^{k},\beta^{k}).

We are now ready to state the convergence rate of LSGD-PFL.

Theorem 1.

Suppose that Assumptions 1 and  3 hold. Let ηk=η\eta_{k}=\eta for all k0k\geq 0, where η\eta satisfies

0<ηmin{14Lβ,183e(τ1)Lw}.0<\eta\leq\min\left\{\frac{1}{4L^{\beta}},\frac{1}{8\sqrt{3e}(\tau-1)L^{w}}\right\}.

Let ζ21Mm=1Mfm(w,β)2\zeta_{*}^{2}\coloneqq\frac{1}{M}\sum\limits_{m=1}^{M}\|\nabla f_{m}(w^{*},\beta^{*})\|^{2} be the data heterogeneity parameter at the optimum. The iteration complexity of Algorithm 1 to achieve 𝔼f(w¯K,β¯K)f(w,β)ϵ\mathbb{E}{f(\overline{w}^{K},\overline{\beta}^{K})}-f(w^{*},\beta^{*})\leq\epsilon is

𝒪~(max(Lβ,τLw)μ+σ2MBμϵ+τμLw(ζ2+σ2B1)ϵ).\tilde{{\cal O}}\left(\frac{\max\left(L^{\beta},\tau L^{w}\right)}{\mu}+\frac{\sigma^{2}}{MB\mu\epsilon}+\frac{\tau}{\mu}\sqrt{\frac{L^{w}(\zeta_{*}^{2}+\sigma^{2}B^{-1})}{\epsilon}}\right).

The iteration complexity of LSGD-PFL can be seen as a sum of two complexities: the complexity of minibatch SGD to minimize a problem with a condition number Lβ/μL^{\beta}/\mu and the complexity of local SGD to minimize a problem with a condition number Lw/μL^{w}/\mu. Note that the key reason why we were able to obtain such a rate for LSGD-PFL is the rescaling of the ww-space by a constant M12M^{-\frac{1}{2}}. Arivazhagan et al. (2019) and Liang et al. (2020), where LSGD-PFL was introduced without optimization guarantees, did not consider such a reparametrization.

We also have the following result for weakly convex objectives.

Theorem 2.

Suppose that the conditions of Theorem 1 hold. Let μ=0\mu=0. The iteration complexity of Algorithm 1 to achieve 𝔼f(w¯K,β¯K)f(w,β)ϵ\mathbb{E}{f(\overline{w}^{K},\overline{\beta}^{K})}-f(w^{*},\beta^{*})\leq\epsilon is

O~(max(Lβ,τLw)ϵ+σ2MBϵ2+τLw(ζ2+σ2B1)ϵ32).\tilde{O}\left(\frac{\max\left(L^{\beta},\tau L^{w}\right)}{\epsilon}+\frac{\sigma^{2}}{MB\epsilon^{2}}+\frac{\tau\sqrt{L^{w}(\zeta_{*}^{2}+\sigma^{2}B^{-1})}}{\epsilon^{\frac{3}{2}}}\right).

The proofs of Theorem 1 and Theorem 2 can be found in Section 3.2. The reparametrization of ww plays a key role in proving the iteration complexity bound. Unlike the convergence guarantees for ACD-PFL and ASVRCD-PFL, which are introduced in Section 4, we do not claim any optimality properties for the rates obtained in Theorem 1 and Theorem 2. However, given the popularity of Local SGD as an optimizer for non-personalized FL models, Algorithm 1 is a natural extension of Local SGD to personalized FL models, and the corresponding convergence rate is an important contribution.

3.1 Nonconvex Theory for LSGD-PFL

To demonstrate that LSGD-PFL works in the nonconvex setting as well, we develop a non-convex theory for it. Therefore, the algorithm is also applicable, for example, for solving the explicit MAML objective. Note that we do not claim the optimality of our results.

Before stating the results, we need to make the following assumptions, which are slightly different from the rest of the paper. First, we need a smoothness assumption on local objective functions.

Assumption 4.

The local objective function fm()f_{m}(\cdot) is differentiable and LL-smooth, that is, fm(u)fm(v)Luv\|\nabla f_{m}(u)-\nabla f_{m}(v)\|\leq L\|u-v\| for all u,vd0+dmu,v\in\mathbb{R}^{d_{0}+d_{m}} and m[M]m\in[M].

This condition is slightly different compared to the smoothness assumption on the objective stated in Assumption 1. Next, we need local stochastic gradients to have bounded variance.

Assumption 5.

The stochastic gradients wf^m(w,βm,ζ)\nabla_{w}\hat{f}_{m}(w,\beta_{m},\zeta), βf^m(w,βm,ζ)\nabla_{\beta}\hat{f}_{m}(w,\beta_{m},\zeta) satisfy for all m[M]m\in[M], wd0w\in\mathbb{R}^{d_{0}}, βmdm\beta_{m}\in\mathbb{R}^{d_{m}}:

𝔼ζ[wf^m(w,βm,ζ)wfm(w,βm)2]\displaystyle\mathbb{E}_{\zeta}\left[\|\nabla_{w}\hat{f}_{m}(w,\beta_{m},\zeta)-\nabla_{w}f_{m}(w,\beta_{m})\|^{2}\right] C1wfm(w,βm)2+σ12B,\displaystyle\leq C_{1}\|\nabla_{w}f_{m}(w,\beta_{m})\|^{2}+\frac{\sigma^{2}_{1}}{B},
𝔼ζ[βf^m(w,βm,ζ)βfm(w,βm)2]\displaystyle\mathbb{E}_{\zeta}\left[\|\nabla_{\beta}\hat{f}_{m}(w,\beta_{m},\zeta)-\nabla_{\beta}f_{m}(w,\beta_{m})\|^{2}\right] C2βfm(w,βm)2+σ22B,\displaystyle\leq C_{2}\|\nabla_{\beta}f_{m}(w,\beta_{m})\|^{2}+\frac{\sigma^{2}_{2}}{B},

for all m[M]m\in[M], where C1,C2,σ12,σ22C_{1},C_{2},\sigma^{2}_{1},\sigma^{2}_{2} are all positive constants.

This assumption is common in the literature. See, for example, Assumption 3 in Haddadpour & Mahdavi (2019). Note that this assumption is weaker than Assumption 3. We also need an assumption on data heterogeneity.

Assumption 6 (Bounded Dissimilarity).

There is a positive constant λ>0\lambda>0 such that for all wd0w\in\mathbb{R}^{d_{0}} and βmdm\beta_{m}\in\mathbb{R}^{d_{m}}, m[M]m\in[M], we have

1Mm=1Mfm(w,βm)2λ1Mm=1Mfm(w,βm)2+σdif2.\frac{1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(w,\beta_{m})\|^{2}\leq\lambda\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(w,\beta_{m})\right\|^{2}+\sigma^{2}_{\text{dif}}\,.

This way of characterizing data heterogeneity was used in Haddadpour & Mahdavi (2019) – see Definition 1 therein. Given the above assumptions, we can establish the following convergence rate of LSGD-PFL for general non-convex objectives.

Theorem 3.

Suppose that Assumptions 4-6 hold. Let ηk=η\eta_{k}=\eta, for all k0k\geq 0, where η\eta is small enough to satisfy

1+ηLλ(C1M+C2+1)+λη2L2(τ1)τ(C1+1)0.-1+\eta L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)+\lambda\eta^{2}L^{2}(\tau-1)\tau(C_{1}+1)\leq 0. (20)

We have

1Kk=0K1𝔼[1Mm=1Mfm(wk,βmk)2]2𝔼[1Mm=1Mfm(w0,βm0)f]ηK+ηLλ{(C1M+C2+1)σdif2+σ12MB+σ22B}+η2L2σdif2(τ1)2(C1+1)+η2L2σ12(τ1)2B,\frac{1}{K}\sum^{K-1}_{k=0}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(w^{k},\beta^{k}_{m})\right\|^{2}\right]\\ \leq\frac{2\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(w^{0},\beta^{0}_{m})-f^{*}\right]}{\eta K}+\eta L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\\ +\eta^{2}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)+\frac{\eta^{2}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{B},

where wk1Mm=1Mwmkw^{k}\coloneqq\frac{1}{M}\sum_{m=1}^{M}w_{m}^{k} is a sequence of so-called virtual iterates.

The following assumption is commonly used to characterize non-convex objectives in the literature.

Assumption 7 (μ\mu-Polyak-Łojasiewicz (PL)).

There exists a positive constant μ>0\mu>0, such that for all wd0w\in\mathbb{R}^{d_{0}} and βmdm\beta_{m}\in\mathbb{R}^{d_{m}}, m[M]m\in[M], we have

121Mm=1Mfm(w,βm)2μ(1Mm=1Mfm(w,βm)f).\frac{1}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(w,\beta_{m})\right\|^{2}\geq\mu\left(\frac{1}{M}\sum^{M}_{m=1}f_{m}(w,\beta_{m})-f^{*}\right).

When the local objective functions satisfy the PL-condition, we obtain a faster convergence rate stated in the theorem below.

Theorem 4.

Suppose that Assumptions 4-7 hold. Suppose ηk=1/(μ(k+βτ+1))\eta_{k}=1/(\mu(k+\beta\tau+1)), where β\beta is a positive constant satisfying

β>max{2λLμ(C1M+C2+1)2,2L2λ(C1+1)μ2,1}\beta>\max\left\{\frac{2\lambda L}{\mu}\left(\frac{C_{1}}{M}+C_{2}+1\right)-2,\frac{2L^{2}\lambda(C_{1}+1)}{\mu^{2}},1\right\}

and τ\tau is large enough such that

τmax{(2L2λ(C1+1)/μ2)e1/β4,0}β2(2L2λ(C1+1)/μ2)e1β.\tau\geq\sqrt{\frac{\max\left\{(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{1/\beta}-4,0\right\}}{\beta^{2}-(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{\frac{1}{\beta}}}}.

Then

𝔼[1Mm=1Mfm(wK,βmk)f]b3(K+βτ)3𝔼[1Mm=1Mfm(w0,βm0)f]+2L2(τ1)2Kμ3(K+βτ)3{σdif2(C1+1)+σ12B}+LK(K+2βτ+2)4μ2(K+βτ)3{σdif2(C1M+C2+1)+σ12MB+σ22B}.\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}\left(w^{K},\beta^{k}_{m}\right)-f^{*}\right]\\ \leq\frac{b^{3}}{(K+\beta\tau)^{3}}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}\left(w^{0},\beta^{0}_{m}\right)-f^{*}\right]+\frac{2L^{2}(\tau-1)^{2}K}{\mu^{3}(K+\beta\tau)^{3}}\left\{\sigma^{2}_{\text{dif}}(C_{1}+1)+\frac{\sigma^{2}_{1}}{B}\right\}\\ +\frac{LK(K+2\beta\tau+2)}{4\mu^{2}(K+\beta\tau)^{3}}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}.

The proofs of Theorem 3 and Theorem 4 can be found in Appendix B.7.

3.2 Proof of Theorem 1 and Theorem 2

The main idea consists of invoking the framework for analyzing local SGD methods introduced in Gorbunov et al. (2021) with several minor modifications. In particular, Algorithm 1 is an intriguing method that runs a local SGD on ww-parameters and SGD on β\beta-parameters. Therefore, we shall treat these parameter sets differently. Define Vk1Mm=1Mwkwmk2V_{k}\coloneqq\frac{1}{M}\sum_{m=1}^{M}\|w^{k}-w_{m}^{k}\|^{2} where wk1Mm=1Mwmkw^{k}\coloneqq\frac{1}{M}\sum_{m=1}^{M}w_{m}^{k} is defined as in Theorem 3.

The first step towards the convergence rate is to figure out the parameters of Assumption 2.3 from Gorbunov et al. (2021). The following lemma is an analog of Lemma G.1 in Gorbunov et al. (2021).

Lemma 5.

Let Assumptions 1 and 3 hold. Let L=max{Lw,Lβ}L=\max\{L^{w},L^{\beta}\}. Then, we have:

1Mm=1Mwfm(wmk,βmk)26Lw(f(wk,βmk)f(w,β))+3(Lw)2Vk+3ζ2,\frac{1}{M}\sum\limits_{m=1}^{M}\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\|^{2}\leq 6L^{w}\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+3(L^{w})^{2}V_{k}+3\zeta_{*}^{2}, (21)

and

1Mm=1Mwfm(wmk,βmk)2+1M2m=1Mβfm(wmk,βmk)24L(f(wk,βmk)f(w,β))+2(Lw)2Vk.\left\|\frac{1}{M}\sum\limits_{m=1}^{M}\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}\leq 4L\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+2(L^{w})^{2}V_{k}. (22)

The proof is given in Appendix B.4. Using Lemma 5 we recover a set of crucial parameters of Assumption 2.3 from Gorbunov et al. (2021).

Lemma 6.

Let gw,mk(gmk)1:d0g_{w,m}^{k}\coloneqq(g_{m}^{k})_{1:d_{0}} and gβ,mk(gmk)(d0+1):(d0+dm)g_{\beta,m}^{k}\coloneqq(g_{m}^{k})_{(d_{0}+1):(d_{0}+d_{m})}. Then

1Mm=1M𝔼gw,mk26Lw(f(wk,βmk)f(w,β))+3(Lw)2Vk+σ2B+3ζ2,\displaystyle\frac{1}{M}\sum\limits_{m=1}^{M}\mathbb{E}{\|g_{w,m}^{k}\|^{2}}\leq 6L^{w}\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+3(L^{w})^{2}V_{k}+\frac{\sigma^{2}}{B}+3\zeta_{*}^{2}, (23)

and

𝔼1Mm=1Mgw,mk2+1M2m=1Mgβ,mk24L(f(wk,βmk)f(w,β))+2(Lw)2Vk+2σ2BM.\mathbb{E}{\left\|\frac{1}{M}\sum\limits_{m=1}^{M}g_{w,m}^{k}\right\|^{2}+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left\|g_{\beta,m}^{k}\right\|^{2}}\leq 4L\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+2(L^{w})^{2}V_{k}+\frac{2\sigma^{2}}{BM}. (24)

The proof is given in Appendix B.5. Finally, the following lemma is an analog of Lemma E.1 in Gorbunov et al. (2021) and gives us the remaining parameters of Assumption 2.3 therein.

Lemma 7.

Suppose that Assumptions 1 and 3 hold and

η183e(τ1)Lw.\eta\leq\frac{1}{8\sqrt{3e}(\tau-1)L^{w}}.

Then

2Lwk=0K(1ημ)k1𝔼Vk12k=0K(1ημ)k1𝔼F(wk,βk)F(w,β)+2LwDη2k=0K(1ημ)k1,2L^{w}\sum\limits_{k=0}^{K}(1-\eta\mu)^{-k-1}\mathbb{E}{V_{k}}\leq\frac{1}{2}\sum\limits_{k=0}^{K}(1-\eta\mu)^{-k-1}\mathbb{E}{F(w^{k},\beta^{k})-F(w^{*},\beta^{*})}+2L^{w}D\eta^{2}\sum\limits_{k=0}^{K}(1-\eta\mu)^{-k-1}, (25)

where

D=2e(τ1)τ(3ζ2+σ2B).D=2e(\tau-1)\tau\left(3\zeta_{*}^{2}+\frac{\sigma^{2}}{B}\right).

The proof is given in Appendix B.6. With these preliminary results, we are ready to state the main convergence result for Algorithm 1.

Theorem 5.

Suppose that Assumptions 1 and 3 hold and the stepsize η\eta satisfies

0<ηmin{14Lβ,183e(τ1)Lw}.0<\eta\leq\min\left\{\frac{1}{4L^{\beta}},\frac{1}{8\sqrt{3e}(\tau-1)L^{w}}\right\}.

Define

(w¯K,β¯K)\displaystyle(\overline{w}^{K},\overline{\beta}^{K}) (k=0K(1ημ)k1)1k=0K(1ημ)k1(wk,βk),\displaystyle\coloneqq\left(\sum\limits_{k=0}^{K}(1-\eta\mu)^{-k-1}\right)^{-1}\sum_{k=0}^{K}(1-\eta\mu)^{-k-1}(w^{k},\beta^{k}),
Φ0\displaystyle\Phi^{0} 2w0w2+m=1Mβm0βm2η, and\displaystyle\coloneqq\frac{2\|w^{0}-w^{*}\|^{2}+\sum_{m=1}^{M}\|\beta_{m}^{0}-\beta_{m}^{*}\|^{2}}{\eta},\text{ and}
Ψ0\displaystyle\Psi^{0} 2σ2BM+8Lwηe(τ1)τ(3ζ2+σ2B).\displaystyle\coloneqq\frac{2\sigma^{2}}{BM}+8L^{w}\eta e(\tau-1)\tau\left(3\zeta^{2}_{*}+\frac{\sigma^{2}}{B}\right).

Then, if μ>0\mu>0, we have

𝔼f(w¯K,β¯K)f(w,β)\displaystyle\mathbb{E}{f(\overline{w}^{K},\overline{\beta}^{K})}-f(w^{*},\beta^{*})\leq (1ημ)KΦ0+ηΨ0,\displaystyle\left(1-\eta\mu\right)^{K}\Phi^{0}+\eta\Psi^{0}, (26)

while, in the case when μ=0\mu=0, we have

𝔼f(w¯K,β¯K)f(w,β)\displaystyle\mathbb{E}{f(\overline{w}^{K},\overline{\beta}^{K})}-f(w^{*},\beta^{*})\leq Φ0K+ηΨ0.\displaystyle\frac{\Phi^{0}}{K}+\eta\Psi^{0}. (27)

The proof follows directly from Theorem 2.1 of Gorbunov et al. (2021), once the conditions are verified, as has been done in Lemma 5, Lemma 6, and Lemma 7. Theorem 1 and Theorem 2 then follow from Corollary D.1 and Corollary D.2 of Gorbunov et al. (2021).

4 Minimax Optimal Methods

We discuss the complexity of solving (1) in terms of the number of communication rounds required to reach an ϵ\epsilon-solution and the amount of local computation, both in terms of the number of (stochastic) gradients with respect to global ww-parameters and local β\beta-parameters.

4.1 Lower Complexity Bounds

We provide lower complexity bounds for solving (1) when fmf_{m} is a finite sum (3). We show that any algorithm with access to the communication oracle and local (stochastic) gradient oracle with respect to the ww or β\beta parameters requires at least a certain number of oracle calls to approximately solve (1).

Oracle. The considered oracle allows us at any iteration to compute either:

  • wfm,i(wm,βm)\nabla_{w}f_{m,i}(w_{m},\beta_{m}) on each device for a randomly selected i[n]i\in[n] and any wm,βmw_{m},\beta_{m}; or

  • βfm,i(wm,βm)\nabla_{\beta}f_{m,i}(w_{m},\beta_{m}) on each device for a randomly selected i[n]i\in[n] and any wm,βmw_{m},\beta_{m}; or

  • the average of wmw_{m}’s alongside broadcasting the average back to clients (communication step).

Our lower bound is provided for iterative algorithms whose iterates lie in the span of historical oracle queries only. Let us denote such a class of algorithms as 𝒜{\cal A}. In particular, for each m,km,k we must have

βmkLin(βm0,βfm(wm0,βm0),,βfm(wmk1,βmk1))\beta^{k}_{m}\in\text{Lin}\left(\beta^{0}_{m},\nabla_{\beta}f_{m}(w^{0}_{m},\beta^{0}_{m}),\dots,\nabla_{\beta}f_{m}(w^{k-1}_{m},\beta^{k-1}_{m})\right)

and

wmkLin(w0,wfm(wm0,βm0),,wfm(wmk1,βmk1),Ql(k)),w^{k}_{m}\in\text{Lin}\left(w^{0},\nabla_{w}f_{m}(w^{0}_{m},\beta^{0}_{m}),\dots,\nabla_{w}f_{m}(w^{k-1}_{m},\beta^{k-1}_{m}),\text{Q}_{l}(k)\right),

where

Qk=m=1M{w0,wfm(wm0,βm0),,wfm(wml(k),βml(k))},Q^{k}=\bigcup_{m=1}^{M}\left\{w^{0},\nabla_{w}f_{m}(w^{0}_{m},\beta^{0}_{m}),\dots,\nabla_{w}f_{m}(w^{l(k)}_{m},\beta^{l(k)}_{m})\right\},

with l(k)l(k) being the index of the last communication round until iteration kk. While such a restriction is widespread in the classical optimization literature (Nesterov, 2018; Scaman et al., 2018; Hendrikx et al., 2021; Hanzely et al., 2020a), it can be avoided by more complex arguments (Nemirovskij & Yudin, 1983; Woodworth & Srebro, 2016; Woodworth et al., 2018).

We then have the following theorem regarding the minimal calls of oracles for solving (1).

Theorem 6.

Let FF satisfy Assumptions 1 and 2. Then, any algorithm from the class 𝒜{\cal A} requires at least Ω(Lw/μlogϵ1)\Omega(\sqrt{{L^{w}}/{\mu}}\log\epsilon^{-1}) communication rounds, Ω(n+nw/μlogϵ1)\Omega\left(n+\sqrt{{n{\cal L}^{w}}/{\mu}}\log\epsilon^{-1}\right) calls to w\nabla_{w}-oracle and Ω(n+nβ/μlogϵ1)\Omega\left(n+\sqrt{{n{\cal L}^{\beta}}/{\mu}}\log\epsilon^{-1}\right) calls to β\nabla_{\beta}-oracle to reach the ϵ\epsilon-solution.

The proof is given in Appendix B.8. In the special case where n=1n=1, Theorem 6 provides a lower complexity bound for solving (1) with access to the full gradient locally. Specifically, it shows both the communication complexity and local gradient complexity with respect to ww-variables of the order Ω(Lwμlog1ϵ)\Omega\left(\sqrt{\frac{L^{w}}{\mu}}\log\frac{1}{\epsilon}\right), and the local gradient complexity with respect to β\beta-variables of the order Ω(Lβμlog1ϵ)\Omega\left(\sqrt{\frac{L^{\beta}}{\mu}}\log\frac{1}{\epsilon}\right).

4.2 Accelerated Coordinate Descent for PFL

We apply Accelerated Block Coordinate Descent (ACD) (Allen-Zhu et al., 2016; Nesterov & Stich, 2017; Hanzely & Richtárik, 2019) to solve (1). We separate the domain into two blocks of coordinates to sample from: the first one corresponding to ww parameters and the second one corresponding to β=[β1,β2,,βM]\beta=[\beta_{1},\beta_{2},\dots,\beta_{M}]. Specifically, at every iteration, we toss an unfair coin. With probability pw=Lw/(Lw+Lβ)p_{w}={\sqrt{L^{w}}}/{(\sqrt{L^{w}}+\sqrt{L^{\beta}})}, we compute wF(w,β)\nabla_{w}F(w,\beta) and update block ww. Alternatively, with probability pβ=1pwp_{\beta}=1-p_{w}, we compute βF(w,β)\nabla_{\beta}F(w,\beta) and update block β\beta. Plugging the described sampling of coordinate blocks into ACD, we arrive at Algorithm 2. Note that ACD from Allen-Zhu et al. (2016) only allows for subsampling individual coordinates and does not allow for “blocks.” A variant of ACD that provides the right convergence guarantees for block sampling was proposed in Nesterov & Stich (2017) and Hanzely & Richtárik (2019).

Algorithm 2 ACD-PFL
0:  0<θ<10<\theta<1, η,ν>0\eta,\nu>0, wy0=wz0d0w_{y}^{0}=w_{z}^{0}\in\mathbb{R}^{d_{0}}, βy,m0=βz,m0dm\beta_{y,m}^{0}=\beta_{z,m}^{0}\in\mathbb{R}^{d_{m}} for 1mM1\leq m\leq M. 
  for k=0,1,2,k=0,1,2,\dots do
     wxk+1=(1θ)wyk+θwzkw_{x}^{k+1}=(1-\theta)w_{y}^{k}+\theta w_{z}^{k}
     for m=1,,Mm=1,\dots,M in parallel do
        βx,mk+1=(1θ)βy,mk+θβz,mk\beta_{x,m}^{k+1}=(1-\theta)\beta_{y,m}^{k}+\theta\beta_{z,m}^{k}
     end for
     ξ={1, with probability pw=LwLw+Lβ0, with probability pβ=LβLw+Lβ\xi=\begin{cases}1,&\text{ with probability }p_{w}=\frac{\sqrt{L^{w}}}{\sqrt{L^{w}}+\sqrt{L^{\beta}}}\\ 0,&\text{ with probability }p_{\beta}=\frac{\sqrt{L^{\beta}}}{\sqrt{L^{w}}+\sqrt{L^{\beta}}}\end{cases}
     if ξ=0\xi=0 then
         wyk+1=wxk+11Lw1Mm=1Mwfm(wxk+1,βx,mk+1)w_{y}^{k+1}=w_{x}^{k+1}-\frac{1}{L^{w}}\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m}(w_{x}^{k+1},\beta_{x,m}^{k+1})
        wzk+1=11+ην(wzk+ηνwxk+1ηLw(Lw+Lβ)1Mm=1Mwfm(wxk+1,βx,mk+1))w_{z}^{k+1}=\frac{1}{1+\eta\nu}\left(w_{z}^{k}+\eta\nu w_{x}^{k+1}-\frac{\eta}{\sqrt{L^{w}}(\sqrt{L^{w}}+\sqrt{L^{\beta}})}\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m}(w_{x}^{k+1},\beta_{x,m}^{k+1})\right)
        for m=1,,Mm=1,\dots,M in parallel do
           βz,mk+1=11+ην(βz,mk+ηνβx,mk+1)\beta_{z,m}^{k+1}=\frac{1}{1+\eta\nu}\left(\beta_{z,m}^{k}+\eta\nu\beta_{x,m}^{k+1}\right)
        end for
     else
        for m=1,,Mm=1,\dots,M in parallel do
           βy,mk+1=βx,mk+11Lββfm(wxk+1,βx,mk+1)\beta_{y,m}^{k+1}=\beta_{x,m}^{k+1}-\frac{1}{L^{\beta}}\nabla_{\beta}f_{m}(w_{x}^{k+1},\beta_{x,m}^{k+1})
        end for
        βz,mk+1=11+ην(βz,mk+ηνβx,mk+1ηLβ(Lw+Lβ)βfm(wxk+1,βx,mk+1))\beta_{z,m}^{k+1}=\frac{1}{1+\eta\nu}\left(\beta_{z,m}^{k}+\eta\nu\beta_{x,m}^{k+1}-\frac{\eta}{\sqrt{L^{\beta}}(\sqrt{L^{w}}+\sqrt{L^{\beta}})}\nabla_{\beta}f_{m}(w_{x}^{k+1},\beta_{x,m}^{k+1})\right)
        wzk+1=11+ην(wzk+ηνwxk+1)w_{z}^{k+1}=\frac{1}{1+\eta\nu}\left(w_{z}^{k}+\eta\nu w_{x}^{k+1}\right)
     end if
  end for

We provide an optimization guarantee for Algorithm 2 in the following theorem.

Theorem 7.

Suppose that Assumption 1 holds. Let

ν=μ(Lw+Lβ)2,θ=ν2+4νν2, and η=θ1.\nu=\frac{\mu}{(\sqrt{L^{w}}+\sqrt{L^{\beta}})^{2}},\,\theta=\frac{\sqrt{\nu^{2}+4\nu}-\nu}{2},\text{ and }\eta=\theta^{-1}.

The iteration complexity of ACD-PFL is

𝒪((Lw+Lβ)/μlogϵ1).{\cal O}\left(\sqrt{{(L^{w}+L^{\beta})}/{\mu}}\log{\epsilon^{-1}}\right).

The proof follows directly from Theorem 4.2 of Hanzely & Richtárik (2019). Since wF(w,β)\nabla_{w}F(w,\beta) is evaluated on average once every 1/pw{1}/{p_{w}} iterations, ACD-PFL requires 𝒪(Lw/μlogϵ1){\cal O}\left(\sqrt{{L^{w}}/{\mu}}\log{\epsilon^{-1}}\right) communication rounds and 𝒪(Lw/μlogϵ1){\cal O}\left(\sqrt{{L^{w}}/{\mu}}\log{\epsilon^{-1}}\right) gradient evaluations with respect to ww, thus matching the lower bound. Similarly, as βF(w,β)\nabla_{\beta}F(w,\beta) is evaluated on average once every 1/pβ{1}/{p_{\beta}} iterations, we require 𝒪(Lβ/μlogϵ1){\cal O}\left(\sqrt{{L^{\beta}}/{\mu}}\log{\epsilon^{-1}}\right) evaluations of βF(w,β)\nabla_{\beta}F(w,\beta) to reach an ϵ\epsilon-solution; again matching the lower bound. Consequently, ACD-PFL is minimax optimal in terms of all three quantities of interest simultaneously.

We are not the first to propose a variant of coordinate descent (Nesterov, 2012) for personalized FL. Wu et al. (2021) introduced block coordinate descent to solve a variant of (11) formulated over a network. However, they do not argue about any form of optimality for their approach, which is also less general as it only covers a single personalized FL objective.

4.3 Accelerated SVRCD for PFL

Despite being minimax optimal, the main drawback of ACD-PFL is the necessity of having access to the full gradient of local loss fmf_{m} with respect to either ww or β\beta at each iteration. Specifically, computing the full gradient with respect to fmf_{m} might be very expensive when fmf_{m} is a finite sum (3). Ideally, one would desire an algorithm that is i) subsampling the global/local variables ww and β\beta just as ACD-PFL, ii) subsampling the local finite sum, iii) employing control variates to reduce the variance of the local stochastic gradient (Johnson & Zhang, 2013; Defazio et al., 2014), and iv) accelerated in the sense of Nesterov (1983).

We propose a method – ASVRCD-PFL – that satisfies all four conditions above, by carefully designing an instance of ASVRCD (Accelerated proximal Stochastic Variance Reduced Coordinate Descent) (Hanzely et al., 2020b) applied to minimize an objective in a lifted space that is equivalent to (1). We are not aware of any other algorithm capable of satisfying i)-iv) simultaneously.

The construction of ASVRCD-PFL involves four main ingredients. First, we rewrite the original problem in a lifted space, which corresponds to the problem form discussed in Hanzely et al. (2020b). Second, we construct an unbiased stochastic gradient estimator by sampling coordinate blocks. Next, we enrich the stochastic gradient by control variates as in SVRG. Finally, we incorporate Nesterov’s momentum. We explain the construction of ASVRCD-PFL in detail below.

Lifting the problem space. ASVRCD-PFL is an instance of ASVRCD applied to an objective (1) in a lifted space. We have that

minw0d,βmdm,m[M]F(w,β)=min 
X[1,:,:]R×Mnd0
X[2,m,:]R×ndm,mM
 
{𝐏(X)𝐅(X)+ψ(X)}
,
\displaystyle\min_{w\in\mathbb{R}^{d}_{0},\beta_{m}\in\mathbb{R}^{d_{m}},\forall m\in[M]}F(w,\beta)=\min_{\text{ \begin{tabular}[]{c}$X[1,:,:]\in\mathbb{R}^{M\times n\times d_{0}}$\\ $X[2,m,:]\in\mathbb{R}^{n\times d_{m}},\forall m\in M$\end{tabular} }}\left\{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{P}}(X)\coloneqq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(X)+{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{\psi}}(X)\right\},

where

𝐅(X)1Mm=1M(1nj=1nfm,j(X[1,m,j],X[2,m,j])){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(X)\coloneqq\frac{1}{M}\sum_{m=1}^{M}\left(\frac{1}{n}\sum_{j=1}^{n}f_{m,j}(X[1,m,j],X[2,m,j])\right)

and

ψ(X){0if m,m[M],j,j[n]:X[1,m,j]=X[1,m,j],X[2,m,j]=X[2,m,j]otherwise.{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{\psi}}(X)\coloneqq\begin{cases}0&\text{if }m,m^{\prime}\in[M],j,j^{\prime}\in[n]:X[1,m,j]=X[1,m^{\prime},j^{\prime}],\ X[2,m,j]=X[2,m,j^{\prime}]\\ \infty&\text{otherwise}.\end{cases}

Variables X[1,m,j]X[1,m,j] correspond to ww for all m[M]m\in[M] and j[n]j\in[n], while variables X[2,m,j]X[2,m,j] correspond to βm\beta_{m} for all j[n]j\in[n]. The equivalence between the objective (1) and the objective in the lifted space is ensured with the indicator function ψ(X){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{\psi}}(X), which forces different XX variables to take the same values. We apply ASVRCD with a carefully chosen non-uniform sampling of coordinate blocks to minimize 𝐏(X){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{P}}(X).

Sampling of coordinate blocks. The key component of ASVRCD-PFL is the construction of the unbiased stochastic gradient estimator of 𝐅(X)\nabla{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(X), which we describe here. We consider two independent sources of randomness when sampling the coordinate blocks.

First, we toss an unfair coin ζ\zeta. With probability pwp_{w}, we have ζ=1\zeta=1. In such a case, we ignore the local variables and update the global variables only, corresponding to ww or X[1]X[1] in our current notation. Alternatively, ζ=2\zeta=2 with probability pβ1pwp_{\beta}\coloneqq 1-p_{w}. In such a case, we ignore the global variables and update local variables only, corresponding to β\beta or X[2]X[2] in our current notation.

Second, we consider local subsampling. At each iteration, the stochastic gradient is constructed using Fj\nabla F_{j} only, where Fj(w,β)1Mm=1Mfm,j(w,βm)F_{j}(w,\beta)\coloneqq\frac{1}{M}\sum_{m=1}^{M}f_{m,j}(w,\beta_{m}) and jj is selected uniformly at random from [n][n]. For the sake of simplicity, we assume that all clients sample the same index, i.e., the randomness is synchronized. A similar rate can be obtained without shared randomness.

With these sources of randomness, we arrive at the following construction of 𝐆(X){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(X), which is an unbiased stochastic estimator of 𝐅(X)\nabla{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(X):

𝐆(X)[1,m,j]\displaystyle{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(X)[1,m,j^{\prime}] ={1pwwfj,m(X[1,m,j],X[2,m,j])if ζ=1 and j=j;0d0otherwise;\displaystyle=\begin{cases}\frac{1}{p^{w}}\nabla_{w}f_{j^{\prime},m}(X[1,m,j^{\prime}],X[2,m,j^{\prime}])&\text{if }\zeta=1\text{ and }j^{\prime}=j;\\ 0\in\mathbb{R}^{d_{0}}&\text{otherwise};\end{cases}
𝐆(X)[2,m,j]\displaystyle{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(X)[2,m,j^{\prime}] ={1pββfj,m(X[1,m,j],X[2,m,j])if ζ=2 and j=j;0dmotherwise.\displaystyle=\begin{cases}\frac{1}{p^{\beta}}\nabla_{\beta}f_{j^{\prime},m}(X[1,m,j^{\prime}],X[2,m,j^{\prime}])&\text{if }\zeta=2\text{ and }j^{\prime}=j;\\ 0\in\mathbb{R}^{d_{m}}&\text{otherwise}.\end{cases}

Control variates and acceleration. We enrich the stochastic gradient by incorporating control variates, resulting in an SVRG-style stochastic gradient estimator. In particular, the resulting stochastic gradient will take the form of 𝐆(X)𝐆(Y)+𝐅(Y){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(X)-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(Y)+\nabla{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(Y), where YY is another point that is updated upon a successful toss of a ρ\rho-coin. The last ingredient of the method is to incorporate Nesterov’s momentum.

Combining the above ingredients, we arrive at the ASVRCD-PFL procedure, which is detailed in Algorithm 3 in the lifted notation. Algorithm 4 details ASVRCD-PFL in the notation consistent with the rest of the paper.

The following theorem provides convergence guarantees for ASVRCD-PFL.

Theorem 8.

Suppose Assumptions 1 and 2 hold. The iteration complexity of ASVRCD-PFL with

η=14,θ2=12,γ=1max{2μ,4θ1/η},\displaystyle\eta=\frac{1}{4{\cal L}},\quad\theta_{2}=\frac{1}{2},\quad\gamma=\frac{1}{\max\{2\mu,4\theta_{1}/\eta\}},
ν=1γμ,θ1=min{12,ημmax{12,θ2ρ}},andpw=wβ+w\displaystyle\nu=1-\gamma\mu,\quad\theta_{1}=\min\left\{\frac{1}{2},\sqrt{\eta\mu\max\left\{\frac{1}{2},\frac{\theta_{2}}{\rho}\right\}}\right\},\quad\text{and}\quad p_{w}=\frac{{\cal L}^{w}}{{\cal L}^{\beta}+{\cal L}^{w}}

is

𝒪((ρ1+(w+β)/(ρμ))logϵ1),{\cal O}\left(\left({\rho^{-1}}+\sqrt{{({\cal L}^{w}+{\cal L}^{\beta})}/{(\rho\mu)}}\right)\log{\epsilon^{-1}}\right),

where ρ\rho is the frequency of updating the control variates.

The communication complexity and the local stochastic gradient complexity with respect to ww-parameters of order 𝒪((n+nw/μ)logϵ1){\cal O}\left(\left(n+\sqrt{{n{\cal L}^{w}}/{\mu}}\right)\log{\epsilon^{-1}}\right), is obtained by setting ρ=w/((w+β)n)\rho={{\cal L}^{w}}/\left(({\cal L}^{w}+{\cal L}^{\beta})n\right). Analogously, setting ρ=β/((w+β)n)\rho={{\cal L}^{\beta}}/(({\cal L}^{w}+{\cal L}^{\beta})n) yields the local stochastic gradient complexity with respect to β\beta-parameters of order 𝒪((n+nβ/μ)logϵ1){\cal O}\left(\left(n+\sqrt{{n{\cal L}^{\beta}}/{\mu}}\right)\log{\epsilon^{-1}}\right). In contrast with Theorem 6, this result shows that ASVRCD-PFL can be optimal in terms of the local computation either with respect to β\beta-variables or in terms of the ww-variables. Unfortunately, these bounds are not achieved simultaneously unless w,β{\cal L}^{w},{\cal L}^{\beta} are of a similar order, which we leave for future research. The proof is given in Appendix B.9. Additional discussion on how to choose the tuning parameters is given in Theorem 9.

Algorithm 3 ASVRCD-PFL (lifted notation)
0:  0<θ1,θ2<10<\theta_{1},\theta_{2}<1, η,ν,γ>0\eta,\nu,\gamma>0, ρ(0,1)\rho\in(0,1), Y0=Z0=X0Y^{0}=Z^{0}=X^{0}. 
  for k=0,1,2,k=0,1,2,\dots do
     Xk=θ1Zk+θ2Vk+(1θ1θ2)YkX^{k}=\theta_{1}Z^{k}+\theta_{2}V^{k}+(1-\theta_{1}-\theta_{2})Y^{k}
     gk=𝐆(Xk)𝐆(Vk)+𝐅(Vk)g^{k}={\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(X^{k})-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{G}}(V^{k})+\nabla{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{F}}(V^{k})
     Yk+1=proxηψ(Xkηgk)Y^{k+1}=\mathop{\mathrm{prox}}\nolimits_{\eta\psi}(X^{k}-\eta g^{k})
     Zk+1=νZk+(1ν)Xk+γη(Yk+1Yk)Z^{k+1}=\nu Z^{k}+(1-\nu)X^{k}+\frac{\gamma}{\eta}(Y^{k+1}-Y^{k})
     Vk+1={Yk, with probability ρVk, with probability 1ρV^{k+1}=\begin{cases}Y^{k},&\text{ with probability }\rho\\ V^{k},&\text{ with probability }1-\rho\\ \end{cases}
  end for

5 Simulations

We present an extensive numerical evaluation to verify and support the theoretical claims. We perform experiments on both synthetic and real data, with a range of different objectives and methods (both ours and the baselines from the literature). The experiments are designed to shed light on various aspects of the theory. In this section, we present the results on synthetic data, while in the next section, we illustrate the performance of different methods on real data. The code to reproduce the experiments is publicly available at

https://github.com/boxinz17/PFL-Unified-Framework.

The experiments on synthetic data were conducted on a personal laptop with a CPU (Intel(R) Core(TM) i7-9750H CPU@2.60GHz). The results are summarized over 30 independent runs.

Algorithm 4 ASVRCD-PFL
0:  0<θ1,θ2<10<\theta_{1},\theta_{2}<1, η,ν,γ>0\eta,\nu,\gamma>0, ρ(0,1)\rho\in(0,1), pw(0,1)p_{w}\in(0,1), pβ=1pwp_{\beta}=1-p_{w}, wy0=wz0=wv0d0w_{y}^{0}=w_{z}^{0}=w_{v}^{0}\in\mathbb{R}^{d_{0}}, βy,m0=βz,m0=βv,m0dm\beta_{y,m}^{0}=\beta_{z,m}^{0}=\beta_{v,m}^{0}\in\mathbb{R}^{d_{m}} for 1mM1\leq m\leq M. 
  for k=0,1,2,k=0,1,2,\dots do
     wxk=θ1wzk+θ2wvk+(1θ1θ2)wykw_{x}^{k}=\theta_{1}w_{z}^{k}+\theta_{2}w_{v}^{k}+(1-\theta_{1}-\theta_{2})w_{y}^{k}
     for m=1,,Mm=1,\dots,M in parallel do
        βx,mk=θ1βz,mk+θ2βv,mk+(1θ1θ2)βy,mk\beta_{x,m}^{k}=\theta_{1}\beta_{z,m}^{k}+\theta_{2}\beta_{v,m}^{k}+(1-\theta_{1}-\theta_{2})\beta_{y,m}^{k}
     end for
     Sample random j{1,2,,n}j\in\{1,2,\dots,n\} and ζ={1, with probability pw2, with probability pβ\zeta=\begin{cases}1,&\text{ with probability }p_{w}\\ 2,&\text{ with probability }p_{\beta}\end{cases}
     
gwk={1pw(1Mm=1Mwfm,j(wxk,βx,mk)1Mm=1Mwfm,j(wvk,βv,mk))+wF(wvk,βvk) if ζ=1wF(wvk,βvk) if ζ=2\displaystyle g_{w}^{k}=\begin{cases}\begin{aligned} \frac{1}{p_{w}}&\left(\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m,j}(w_{x}^{k},\beta_{x,m}^{k})-\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m,j}(w_{v}^{k},\beta_{v,m}^{k})\right)+\nabla_{w}F(w_{v}^{k},\beta_{v}^{k})\end{aligned}&\text{ if }\zeta=1\\ \\ \nabla_{w}F(w_{v}^{k},\beta_{v}^{k})&\text{ if }\zeta=2\end{cases}
     wyk+1=wxkηgwkw_{y}^{k+1}=w_{x}^{k}-\eta g_{w}^{k}
     wzk+1=νwzk+(1ν)wxk+γη(wyk+1wxk)w_{z}^{k+1}=\nu w_{z}^{k}+(1-\nu)w_{x}^{k}+\frac{\gamma}{\eta}(w_{y}^{k+1}-w_{x}^{k})
     wvk+1={wyk, with probability ρwvk, with probability 1ρw_{v}^{k+1}=\begin{cases}w_{y}^{k},&\text{ with probability }\rho\\ w_{v}^{k},&\text{ with probability }1-\rho\\ \end{cases}
     for m=1,,Mm=1,\dots,M in parallel do
        
gβ,mk={1Mβfm(wvk,βv,mk) if ζ=11pβM(βfm,j(wxk,βx,mk)βfm,j(wvk,βv,mk))+1Mβfm(wvk,βv,mk) if ζ=2g_{\beta,m}^{k}=\begin{cases}\frac{1}{M}\nabla_{\beta}f_{m}(w_{v}^{k},\beta_{v,m}^{k})&\text{ if }\zeta=1\\ \\ \begin{aligned} \frac{1}{p_{\beta}M}&\left(\nabla_{\beta}f_{m,j}(w_{x}^{k},\beta_{x,m}^{k})-\nabla_{\beta}f_{m,j}(w_{v}^{k},\beta_{v,m}^{k})\right)+\frac{1}{M}\nabla_{\beta}f_{m}(w_{v}^{k},\beta_{v,m}^{k})\end{aligned}&\text{ if }\zeta=2\end{cases}
        βy,mk+1=βx,mkηgβ,mk\beta_{y,m}^{k+1}=\beta_{x,m}^{k}-\eta g_{\beta,m}^{k}
        βz,mk+1=νβz,mk+(1ν)βx,mk+γη(βy,mk+1βx,mk)\beta_{z,m}^{k+1}=\nu\beta_{z,m}^{k}+(1-\nu)\beta_{x,m}^{k}+\frac{\gamma}{\eta}(\beta_{y,m}^{k+1}-\beta_{x,m}^{k})
        βv,mk+1={βy,mk, with probability ρβv,mk, with probability 1ρ\beta_{v,m}^{k+1}=\begin{cases}\beta_{y,m}^{k},&\text{ with probability }\rho\\ \beta_{v,m}^{k},&\text{ with probability }1-\rho\\ \end{cases}
     end for
  end for

5.1 Multi-Task Personalized FL and Implicit MAML Objective

In this section, we focus on the performance of different methods when solving the objective (11). We implement three proposed algorithms – LSGD-PFL, ASCD-PFL333ASCD-PFL is ASVRCD-PFL without control variates. See the detailed description in Appendix A., and ASVRCD-PFL – and compare them with two baselines – L2SGD+ (Hanzely & Richtárik, 2020) and pFedMe (Dinh et al., 2020). As both L2SGD+ and pFedMe were designed specifically to solve (11), the aim of this experiment is to demonstrate that our universal approach is competitive with these specifically designed methods.

Data and model. We perform this experiment on synthetically generated data which allows us to properly control the data heterogeneity level. As a model, we choose logistic regression. We generate wdw^{*}\in\mathbb{R}^{d} with i.i.d. entries from Uniform[0.49,0.51]\text{Uniform}[0.49,0.51], and set βm=w+Δβmd\beta^{*}_{m}=w^{*}+\Delta\beta^{*}_{m}\in\mathbb{R}^{d}, where entries of Δβm\Delta\beta^{*}_{m} are generated i.i.d. from Uniform[μm0.01,μm+0.01]\text{Uniform}[\mu_{m}-0.01,\mu_{m}+0.01] and μmN(0,σh2)\mu_{m}\sim N(0,\sigma^{2}_{h}) for all m=1,2,,Mm=1,2,\dots,M. Thus, σh\sigma_{h} can be regarded as a measure of heterogeneity level, with a large σh\sigma_{h} corresponding to large heterogeneity. Finally, for each device m=1,2,,Mm=1,2,\dots,M, we generate 𝚡m,id\mathtt{x}_{m,i}\in\mathbb{R}^{d} with entries i.i.d. from Uniform[0.2,0.5]\text{Uniform}[0.2,0.5] for all i=1,2,,ni=1,2,\dots,n, and ym,iBernoulli(pm,i)y_{m,i}\sim\text{Bernoulli}(p_{m,i}), where pm,i=1/(1+exp(βm𝚡m,i))p_{m,i}=1/(1+\exp(\beta^{*\top}_{m}\mathtt{x}_{m,i})). We set d=15d=15, n=1000n=1000, M=20M=20, and let σh{0.1,0.3,1.0}\sigma_{h}\in\{0.1,0.3,1.0\} to explore different levels of heterogeneity.

Objective function. We use objective (11) with fm(βm)f^{\prime}_{m}(\beta_{m}) being the cross-entropy loss function. We set λ=σh102\lambda=\sigma_{h}\cdot 10^{-2}, so that larger heterogeneity level will induce a larger penalty, which will further encourage parameters on each device to be closer to their geometric center. In addition to the training loss, we also record the estimation error in the training process, defined as w^w2+m=1Mβ^mβm2\|\hat{w}-w^{*}\|^{2}+\sum^{M}_{m=1}\|\hat{\beta}_{m}-\beta^{*}_{m}\|^{2}.

Tuning parameters of proposed algorithms. For LSGD-PFL (Algorithm 1), we set the batch size to compute the stochastic gradient B=1B=1, the average period τ=5\tau=5, and the learning rate η=0.01\eta=0.01. For pwp_{w} and pβp_{\beta} in ASCD-PFL (Algorithm 5) and ASVRCD-PFL (Algorithm 4), we set them as pw=w/(β+w)p_{w}=\mathcal{L}^{w}/(\mathcal{L}^{\beta}+\mathcal{L}^{w}) and pβ=1pwp_{\beta}=1-p_{w}, where w=λ/M\mathcal{L}^{w}=\lambda/M and β=(+λ)/M\mathcal{L}^{\beta}=(\mathcal{L}^{\prime}+\lambda)/M. We set =max1mM,1in𝚡m,i2/4\mathcal{L}^{\prime}=\max_{1\leq m\leq M,1\leq i\leq n}\|\mathtt{x}_{m,i}\|^{2}/4. For η,θ2,γ,ν\eta,\theta_{2},\gamma,\nu and θ1\theta_{1} in ASVRCD-PFL, we set them according to Theorem 9, where =2max{w/pw,β/pβ}\mathcal{L}=2\max\{\mathcal{L}^{w}/p_{w},\mathcal{L}^{\beta}/p_{\beta}\}, ρ=pw/n\rho=p_{w}/n, and μ=μ/(3M)\mu=\mu^{\prime}/(3M). We let μ\mu^{\prime} be the smallest eigenvalue of 1nMm=1Mi=1nexp(βm𝚡m,i)/(1+exp(βm𝚡m,i))2𝚡m,i𝚡m,i\frac{1}{nM}\sum^{M}_{m=1}\sum^{n}_{i=1}\exp(\beta^{*\top}_{m}\mathtt{x}_{m,i})/(1+\exp(\beta^{*\top}_{m}\mathtt{x}_{m,i}))^{2}\mathtt{x}_{m,i}\mathtt{x}^{\top}_{m,i}. The η,ν,γ,ρ\eta,\nu,\gamma,\rho in ASCD-PFL are the same as in ASVRCD-PFL, and we let θ=min{0.8,1/η}\theta=\min\{0.8,1/\eta\}. In addition, we initialize all iterates at zero for all algorithms.

Refer to caption
Figure 1: Comparison of various methods for minimizing (11). Each experiment was repeated 30 times, and the solid line represents the average performance, while the shaded region indicates the mean ±\pm standard error. We set the communication period of LSGD-PFL at 5, while other methods synchronize based on the corresponding theory. The first row shows the training loss, and the second row shows the estimation error. The different columns correspond to varying heterogeneity levels, which are parameterized by σh\sigma_{h}. As σh\sigma_{h} increases, the heterogeneity level also increases.

Tuning Parameters of pFedMe. For pFedMe (Algorithm 1 in Dinh et al. (2020)), we set all parameters according to the suggestions in Section 5.2 of Dinh et al. (2020). Specifically, we set the local computation rounds to R=20R=20, computation complexity to K=5K=5, Mini-Batch size to |𝒟|=5|\mathcal{D}|=5, and η=0.005\eta=0.005. We also set S=M=20S=M=20. To solve the objective (7) in Dinh et al. (2020), we use gradient descent, as suggested in the paper. Additionally, we initialize all iterates at zero.

Tuning Parameters of L2SGD+. For L2SGD+ 444Algorithm 2 in Hanzely & Richtárik (2020), we set the stepsize η\eta (the parameter α\alpha in Hanzely & Richtárik (2020)) and the probability of averaging pwp_{w} to be the same as in ASRVCD-PFL and ASCD-PFL. We also initialize all iterates at zero.

Results. The results are summarized in Figure 1. We observe that our general-purpose optimizers are competitive with L2SGD+ and pFedMe. In particular, both ASVRCD-PFL and L2SGD+ consistently achieve the same training error as the other methods, which is well predicted by our theory. Although L2SGD+ is slightly faster in terms of convergence due to the specific parameter setting, it is not as general as the methods we propose. Furthermore, we note that the widely used LSGD-PFL suffers from data heterogeneity on different devices, whereas ASVRCD-PFL is not affected by this heterogeneity, as predicted by our theory. The average running time over 30 independent runs is reported in Table 4.

Algorithm σh\sigma_{h} 0.1 0.3 1.0
LSGD 260.98 256.16 237.18
ASCD 28.79 51.71 142.59
ASVRCD 47.94 97.61 274.60
pFedMe 399.33 370.42 370.35
L2SGDplus 25.83 47.59 182.29
Table 4: The average wall-clock running time in seconds over 30 independent runs when solving the objective (11). Each entry in the table reports the average time for 1,000 communication rounds. We ignore any additional communication costs that might occur in practice.

5.2 Explicit Weight Sharing Objective

In this section, we present another experiment on synthetic data that aims to optimize the explicit weight sharing objective (16). Since there is no good baseline algorithm for this objective, the purpose of this experiment is to compare the three proposed algorithms: LSGD-PFL, ASCD-PFL, and ASVRCD-PFL.

Data and Model. As a model, we choose logistic regression. We generate wdgw^{*}\in\mathbb{R}^{d_{g}} with i.i.d. entries from N(0,1)N(0,1), and βmdl\beta^{*}_{m}\in\mathbb{R}^{d_{l}} with i.i.d. entries from Uniform[μm0.01,μm+0.01]\text{Uniform}[\mu_{m}-0.01,\mu_{m}+0.01], where μmN(0,σh2)\mu_{m}\sim N(0,\sigma^{2}_{h}) for all m=1,2,,Mm=1,2,\dots,M. Thus, σh\sigma_{h} can be regarded as a measure of the heterogeneity level, with a large σh\sigma_{h} corresponding to a high degree of heterogeneity. Finally, for each device m=1,2,,Mm=1,2,\dots,M, we generate 𝚡m,idg+dl\mathtt{x}_{m,i}\in\mathbb{R}^{d_{g}+d_{l}} with entries i.i.d. from Uniform[0.0,0.1]\text{Uniform}[0.0,0.1] for all i=1,2,,ni=1,2,\dots,n, and ym,iBernoulli(pm,i)y_{m,i}\sim\text{Bernoulli}(p_{m,i}), where pm,i=1/(1+exp((w,βm)𝚡m,i))p_{m,i}=1/(1+\exp((w^{*\top},\beta^{*\top}_{m})\mathtt{x}_{m,i})). We set dg=10d_{g}=10, dl=5d_{l}=5, n=1000n=1000, M=20M=20, and let σh{5.0,10.0,15.0}\sigma_{h}\in\{5.0,10.0,15.0\} to explore different levels of heterogeneity.

Refer to caption
Figure 2: Comparison of the three proposed algorithms – LSGD-PFL, ASCD-PFL, and ASVRCD-PFL – when minimizing (16). Each experiment is repeated 30 times; the solid line represents the mean performance, and the shaded region covers the mean±standard error\text{mean}\pm\text{standard error}. We set the communication period of LSGD-PFL to 55, while other methods synchronize according to their respective theories. The first row represents the training loss, and the second row corresponds to the estimation error. Different columns indicate various heterogeneity levels, parameterized by σh\sigma_{h}. The heterogeneity level increases with σh\sigma_{h}.

Objective function. We use objective (11) with fm(βm)f^{\prime}_{m}(\beta_{m}) representing the cross-entropy loss function. We set λ=σh102\lambda=\sigma_{h}\cdot 10^{-2}, so smaller heterogeneity levels will induce a larger penalty, encouraging parameters on each device to be closer to their geometric center. In addition to the training loss, we also record the estimation error during the training process, defined as w^w2+m=1Mβ^mβm2\|\hat{w}-w^{*}\|^{2}+\sum^{M}_{m=1}\|\hat{\beta}_{m}-\beta^{*}_{m}\|^{2}.

Algorithms σh\sigma_{h} 5.0 10.0 15.0
LSGD 238.01 237.67 234.33
ASCD 17.00 16.55 16.46
ASVRCD 23.35 22.12 23.51
Table 5: The average wall-clock running time in seconds over 30 independent runs when solving the objective (16) is presented. Each entry of the table reports the average time for 1,000 communication rounds. We ignore any additional communication costs that might occur in practice.

Results. The results are summarized in Figure 2. When examining the training loss, we observe that ASCD-PFL drives the loss down quickly initially, while ASVRCD-PFL ultimately achieves a better optimum. This suggests that we can apply ASCD-PFL at the beginning of training and add control variates to reduce the variance at a later stage of training, thus combining the benefits of both algorithms. Both ASCD-PFL and ASVRCD-PFL perform much better than the widely used LSGD-PFL. When analyzing the estimation error, we observe that when the heterogeneity level is small, there is a tendency for overfitting, especially for ASCD-PFL; and when the heterogeneity level increases, there is less concern for overfitting. In general, however, ASCD-PFL and ASVRCD-PFL still achieve better estimation error than LSGD-PFL. The average running time over 30 independent runs is reported in Table 5.

6 Real Data Experiment Results

In this section, we use real data to illustrate the performance and various properties of the proposed methods. In Section 6.1, we compare the performance of the three proposed algorithms. In Section 6.2, we illustrate the effect of communication frequency of global parameters for ASCD-PFL and demonstrate that the theoretical choice based on Theorem 9 can generate the best communication complexity. Finally, in Section 6.3, we show the effect of reparametrizing ww for ASCD-PFL.

6.1 Performance of the Proposed Methods on Real Data

Refer to caption
(a) Training Loss
Refer to caption
(b) Validation Accuracy
Figure 3: The real data experiment results for K=2K=2. Different rows correspond to different objective functions, and columns correspond to different datasets.
Refer to caption
(a) Training Loss
Refer to caption
(b) Validation Accuracy
Figure 4: The real data experiment results for K=4K=4. Different rows orrespond to different objective functions, and columns correspond to different datasets.
Refer to caption
(a) Training Loss
Refer to caption
(b) Validation Accuracy
Figure 5: The real data experiment results for K=8K=8. Different rows correspond to different objective functions, and columns correspond to different datasets.

We compare the three proposed algorithms – LSGD-PFL, ASCD-PFL (ASVRCD-PFL without control variates), and ASVRCD-PFL – across four image classification datasets – MNIST (Deng, 2012), KMINIST (Clanuwat et al., 2018), FMINST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009) with three objective functions (8), (11), and (14). As a model, we use a multiclass logistic regression, which is a single-layer fully connected neural network combined with a softmax function and cross-entropy loss. Experiments were conducted on a personal laptop (Intel(R) Core(TM) i7-9750H CPU@2.60GHz) with a GPU (NVIDIA GeForce RTX 2070 with Max-Q Design).

Data preparation. We set the number of devices M=20M=20. We focus on a non-i.i.d. setting of McMahan et al. (2017) and Liang et al. (2020) by assigning KK classes out of ten to each device. We let K=2,4,8K=2,4,8 to generate different levels of heterogeneity. A larger KK means a more balanced data distribution and thus smaller data heterogeneity. We then randomly select n=100n=100 samples for each device based on its class assignment for training and n=300n^{\prime}=300 samples for testing. We normalize each dataset in two steps: first, we normalize the columns (features) to have mean zero and unit variance; next, we normalize the rows (samples) so that every input vector has a unit norm.

Model. Given a grayscale picture with a label y{1,2,,C}y\in\{1,2,\dots,C\}, we unroll its pixel matrix into a vector xpx\in\mathbb{R}^{p}. Then, given a parameter matrix Θp×C\Theta\in\mathbb{R}^{p\times C}, the function fm()f^{\prime}_{m}(\cdot) in (8), (11), and (14) is defined as

fm(Θ)lCE(ς(Θx);y),f^{\prime}_{m}(\Theta)\coloneqq l_{\text{CE}}\left(\varsigma(\Theta x)\,;y\right),

where ς():KK\varsigma(\cdot):\mathbb{R}^{K}\rightarrow\mathbb{R}^{K} is the softmax function and lCE()l_{\text{CE}}(\cdot) is the cross-entropy loss function. In this setting, the function fm()f^{\prime}_{m}(\cdot) is convex.

Personalized FL objectives. We consider three different objectives:

  1. 1.

    the multitask FL objective (8) with Λ=1.0\Lambda=1.0 and λ=1.0/K\lambda=1.0/K;

  2. 2.

    the mixture FL objective (11), with λ=1.0/K\lambda=1.0/K; and

  3. 3.

    the adaptive personalized FL objective (14), with Λ=1.0\Lambda=1.0 and αm=0.05×K\alpha_{m}=0.05\times K for all m[M]m\in[M],

where KK is the number of labels for each device. When computing the testing accuracy, we only use the local parameters. Note that the choices of hyperparameters in the chosen objectives are purely heuristic. Our purpose is to demonstrate the convergence properties of the proposed algorithms on the training loss. Thus, it is possible that a smaller training loss does not necessarily imply better testing accuracy (or generalization ability). How to choose hyperparameters optimally is not the focus of this paper and requires further research.

Tuning parameters of proposed algorithms. For LSGD-PFL (Algorithm 1), we set the batch size to compute the stochastic gradient B=1B=1 and set the average period τ=5\tau=5. For pwp_{w} in ASCD-PFL (Algorithm 6) and ASVRCD-PFL (Algorithm 7), we set it as pw=w/(β+w)p_{w}=\mathcal{L}^{w}/(\mathcal{L}^{\beta}+\mathcal{L}^{w}). For the objective FMT2F_{MT2} in (8), we set w=(Λ+λ)/M\mathcal{L}^{w}=(\Lambda\mathcal{L}^{\prime}+\lambda)/M and β=+λ\mathcal{L}^{\beta}=\mathcal{L}^{\prime}+\lambda; for the objective FMX2F_{MX2} in (11), we set w=λ/M\mathcal{L}^{w}=\lambda/M and β=+λ\mathcal{L}^{\beta}=\mathcal{L}^{\prime}+\lambda; for the objective FAPFL2F_{APFL2} in (14), we set w=(Λ+max1mMαm2)/M\mathcal{L}^{w}=(\Lambda+\max_{1\leq m\leq M}\alpha_{m}^{2})\mathcal{L}^{\prime}/M and β=(1max1mMαm)2/M\mathcal{L}^{\beta}=(1-\max_{1\leq m\leq M}\alpha_{m})^{2}\mathcal{L}^{\prime}/M. We set =1.0\mathcal{L}^{\prime}=1.0 for all objectives. We set ρ=pw/n\rho=p_{w}/n for ASCD-PFL and ASVRCD-PFL. For η,θ2,γ,ν\eta,\theta_{2},\gamma,\nu and θ1\theta_{1} in ASVRCD-PFL, we set them according to Theorem 9, where =2max{w/pw,β/pβ}\mathcal{L}=2\max\{\mathcal{L}^{w}/p_{w},\mathcal{L}^{\beta}/p_{\beta}\}, ρ=pw/n\rho=p_{w}/n, and μ=μ/(3M)\mu=\mu^{\prime}/(3M). We let μ=0.01\mu^{\prime}=0.01. Since the dimension of the iterates is larger than the sample size, the objective is weakly convex and, thus, μ=0\mu^{\prime}=0. Therefore, our choice of μ\mu^{\prime} is aimed at improving the numerical behavior of algorithms. The η,ν,γ,ρ\eta,\nu,\gamma,\rho in ASCD-PFL are the same as in ASVRCD-PFL, and we let θ=min{0.8,1/η}\theta=\min\{0.8,1/\eta\}. In addition, we initialize all iterates at zero for all algorithms.

Results. The results are summarized in Figure 3, Figure 4, and Figure 5 for K=2,4,8K=2,4,8 respectively. We observe that ASCD-PFL outperforms the widely-used LSGD-PFL. We also observe that ASVRCD-PFL converges slowly when minimizing the training loss. As we are working in the over parametrization regimes, which makes μ=0\mu^{\prime}=0, the assumptions of our theory are violated. As a result, it is more advisable to use ASCD-PFL during the initial phase of training and use ASVRCD-PFL when the iterates get closer to the optimum.

6.2 Subsampling of the Global and Local Parameters

Refer to caption
(a) Training Loss
Refer to caption
(b) Validation Accuracy
Figure 6: Communication complexity for different choices of pwp_{w}. The theoretical choice based on Theorem 9 can provide the best communication complexity. Specifically, the theoretical choice of pwp_{w} results in pw=0.5p_{w}=0.5 for FMT2F_{MT2}, pw=0.25p_{w}=0.25 for FMX2F_{MX2}, and pw=0.55p_{w}=0.55 for FAPFL2F_{APFL2}. Different rows correspond to different objective functions, and columns correspond to different datasets.
Refer to caption
(a) Training Loss
Refer to caption
(b) Validation Accuracy
Figure 7: Effect of reparametrization of global space in ASCD-PFL. Reparametrization generally helps achieve faster convergence in minimizing training loss and consistently improves testing accuracy. Different rows correspond to different objective functions, and columns correspond to different datasets.

We demonstrate that the choice of pwp_{w} based on Theorem 9, specifically setting pw=w/(w+β)p_{w}={\cal L}^{w}/({\cal L}^{w}+{\cal L}^{\beta}), results in the best communication complexity of ASCD-PFL. More precisely, based on Theorem 9, we set the learning rate η=1/(4)\eta=1/(4\mathcal{L}), where 2max{w/pw,β/pβ}{\cal L}\coloneqq 2\max\left\{{\cal L}^{w}/p_{w},{\cal L}^{\beta}/p_{\beta}\right\}. The expressions of w{\cal L}^{w} and β{\cal L}^{\beta} for FMT2F_{MT2}, FMX2F_{MX2}, and FAPFL2F_{APFL2} are stated in Lemma 1, Lemma 2, Lemma 3, and also restated in the previous section, where {\cal L}^{\prime} is 11 after normalization. We set ρ=pw/n\rho=p_{w}/n. We compare the performance of ASCD-PFL using the theoretically suggested pwp_{w} with other choices of pw{0.1,0.3,0.5,0.7,0.9}p_{w}\in\{0.1,0.3,0.5,0.7,0.9\}. We fix those parameters that are independent of pwp_{w}. See more details about the choice of tuning parameters in the previous section.

We plot the loss against the number of communication rounds, which illustrates the communication complexity. The number of classes for each device is K=2K=2. The results are summarized in Figure 6, which also includes the test accuracy. We observe that choosing pwp_{w} based on theoretical considerations leads to the best communication complexity.

6.3 Effect of Reparametrization in ASCD-PFL.

We demonstrate the importance of reparametrization of the global parameter ww by rescaling ww by a factor of M12M^{-\frac{1}{2}}. We run reparameterized and non-reparameterized ASCD-PFL across different objectives and datasets. We set the number of classes for each device K=2K=2. The results are summarized in Figure 7. For the training loss, we observe that reparametrization improves the convergence of ASCD-PFL, except for the APFL2 objective (14), where the non-reparametrized variant performs slightly better in three of the four datasets. On the other hand, when considering the testing accuracy, reparametrization always helps improve the results, indicating that reparametrization can help prevent overfitting. Based on the experiment, we suggest always using reparametrization to ensure the scale of the learning rate is appropriate for both global and local parameters.

6.4 Potential Benefits of Extended Objectives

We empirically justify the potential benefits of our extended objectives. More specifically, we vary the relaxation parameter Λ\Lambda in (7) to show the change in performance. Since the multitask objective proposed by Li et al. (2020) is equivalent to setting Λ\Lambda\to\infty, our main interest is to explore whether a larger Λ\Lambda always implies better performance. We set K=4K=4, n=30n=30, n=100n^{\prime}=100, M=20M=20, and λ=0.1\lambda=0.1. The remaining settings are the same as in Section 6.1. We vary Λ\Lambda over the values {1.0,10.0,100.0}\{1.0,10.0,100.0\}, and the resulting performance is shown in Figure 8. The plot shows that the performance slightly improves as Λ\Lambda increases from 1.01.0 to 10.010.0, but then drops when Λ=100.0\Lambda=100.0. This result suggests that by selecting an appropriate Λ\Lambda, it is possible to achieve better empirical performance. Furthermore, although proposing new personalized FL objectives is not the main focus of this paper, the above empirical result suggests the potential benefits of a general framework.

Refer to caption
Figure 8: Empirical exploration of the effect of Λ\Lambda on the performance of the objective in (7).

7 Conclusions and Directions for Future Research

We proposed a general convex optimization theory for personalized FL. While our work answers a range of important questions, there are many directions in which our work can be extended in the future, such as partial participation, minimax optimal rates for specific personalized FL objectives, brand new personalized FL objectives, and non-convex theory.

Partial participation and client sampling. An essential aspect of FL that is not covered in this work is the partial participation or client sampling when one has access to only a subset of devices at each iteration. While we did not cover partial participation and focused on answering orthogonal questions, we believe that partial participation should be considered when extending our results in the future. Typically, when one chooses clients uniformly, the theorems in this paper should be extended easily; however, a more interesting question is how to sample clients with a non-uniform distribution to speed up the convergence. We leave this problem for future study.

Minimax optimal rates for specific personalized FL objectives. As outlined in Section 1.2, one cannot hope for the general optimization framework to be minimax optimal in every single special case. Consequently, there is still a need to keep exploring the optimization aspects of individual personalized FL objectives as one might come up with a more efficient optimizer that exploits the specific structure not covered by Assumption 1 or Assumption 2.

Brand new personalized FL objectives. While in this work we propose a couple of novel personalized FL objectives obtained as an extension of known objectives, we believe that seeing personalized FL as an instance of (1) might lead to the development of brand new approaches for personalized FL.

Non-convex theory. In this work, we have focused on a general convex optimization theory for personalized FL. Our convex rates are meaningful – they are minimax optimal and correspond to the empirical convergence. However, an inherent drawback of such an approach is the inability to cover non-convex FL approaches, such as MAML (see Section 2.8), or non-convex FL models. We believe that obtaining minimax optimal rates in the non-convex world would be very valuable.

References

  • Agarwal et al. (2020) A. Agarwal, J. Langford, and C.-Y. Wei. Federated residual learning. arXiv preprint arXiv:2003.12880, 2020.
  • Allen-Zhu et al. (2016) Z. Allen-Zhu, Z. Qu, P. Richtárik, and Y. Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, 2016.
  • Arivazhagan et al. (2019) M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019.
  • Chen et al. (2018) F. Chen, M. Luo, Z. Dong, Z. Li, and X. He. Federated meta-learning with fast convergence and efficient communication. arXiv preprint arXiv:1802.07876, 2018.
  • Clanuwat et al. (2018) T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018.
  • Dean et al. (2012) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012.
  • Defazio et al. (2014) A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014.
  • Deng (2012) L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Deng et al. (2020) Y. Deng, M. M. Kamani, and M. Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
  • Dinh et al. (2020) C. T. Dinh, N. H. Tran, and T. D. Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems, 2020.
  • Fallah et al. (2020) A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020.
  • Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
  • Gorbunov et al. (2021) E. Gorbunov, F. Hanzely, and P. Richtárik. Local SGD: unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, 2021.
  • Haddadpour & Mahdavi (2019) F. Haddadpour and M. Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
  • Hanzely & Richtárik (2019) F. Hanzely and P. Richtárik. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In International Conference on Artificial Intelligence and Statistics, 2019.
  • Hanzely & Richtárik (2020) F. Hanzely and P. Richtárik. Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516, 2020.
  • Hanzely et al. (2020a) F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik. Lower bounds and optimal algorithms for personalized federated learning. In Advances in Neural Information Processing Systems, 2020a.
  • Hanzely et al. (2020b) F. Hanzely, D. Kovalev, and P. Richtárik. Variance reduced coordinate descent with acceleration: New method with a surprising application to finite-sum problems. In International Conference on Machine Learning, 2020b.
  • Hard et al. (2018) A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
  • Hendrikx et al. (2021) H. Hendrikx, F. Bach, and L. Massoulie. An optimal algorithm for decentralized finite-sum optimization. SIAM Journal on Optimization, 31(4):2753–2783, 2021.
  • Jiang et al. (2019) Y. Jiang, J. Konečný, K. Rush, and S. Kannan. Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019.
  • Johnson & Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013.
  • Kairouz et al. (2021) P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Khodak et al. (2019) M. Khodak, M. Balcan, and A. Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, 2019.
  • Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Kulkarni et al. (2020) V. Kulkarni, M. Kulkarni, and A. Pant. Survey of personalization techniques for federated learning. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp.  794–797, 2020.
  • Lan & Zhou (2018) G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2):167–215, 2018.
  • Li et al. (2020) T. Li, S. Hu, A. Beirami, and V. Smith. Federated multi-task learning for competing constraints. arXiv preprint arXiv:2012.04221, 2020.
  • Liang et al. (2020) P. P. Liang, T. Liu, L. Ziyin, R. Salakhutdinov, and L.-P. Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 2020.
  • Lin et al. (2020) S. Lin, G. Yang, and J. Zhang. A collaborative learning framework via federated meta-learning. In International Conference on Distributed Computing Systems, 2020.
  • McMahan et al. (2017) B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017.
  • McMahan et al. (2016) H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629, 2016.
  • Nemirovskij & Yudin (1983) A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.
  • Nesterov (2012) Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Nesterov (1983) Y. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2)o(1/k^{2}). In Doklady Akademii Nauk, volume 269, pp.  543–547, 1983.
  • Nesterov (2018) Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.
  • Nesterov & Stich (2017) Y. Nesterov and S. U. Stich. Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110–123, 2017.
  • Rajeswaran et al. (2019) A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, 2019.
  • Scaman et al. (2018) K. Scaman, F. R. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, 2018.
  • Smith et al. (2017) V. Smith, C. Chiang, M. Sanjabi, and A. Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017.
  • Stich (2019) S. U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.
  • Wang et al. (2018) W. Wang, J. Wang, M. Kolar, and N. Srebro. Distributed stochastic multi-task learning with graph regularization. arXiv preprint arXiv:1802.03830, 2018.
  • Woodworth & Srebro (2016) B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, 2016.
  • Woodworth et al. (2018) B. E. Woodworth, J. Wang, A. D. Smith, B. McMahan, and N. Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems, 2018.
  • Wu et al. (2021) R. Wu, A. Scaglione, H. Wai, N. Karakoç, K. Hreinsson, and W. Ma. Federated block coordinate descent scheme for learning global and personalized models. In AAAI Conference on Artificial Intelligence, 2021.
  • Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Appendix A Additional Algorithms Used in Simulations

In this section, we detail algorithms that are used in Section 5. We detail ASCD-PFL in Algorithm 5. ASCD-PFL is a simplified version of ASVRCD-PFL that does not incorporate control variates. SCD-PFL is detailed in Algorithm 6. SCD-PFL is a simplified version of ASCD-PFL that does not incorporate the control variates or Nesterov’s acceleration. SVRCD-PFL is detailed in Algorithm 7. SVRCD-PFL is a simplified version of ASVRCD-PFL that does not incorporate Nesterov’s acceleration.

Algorithm 5 ASCD-PFL
0:  0<θ<10<\theta<1, η,ν,γ>0\eta,\nu,\gamma>0, ρ(0,1)\rho\in(0,1), pw(0,1)p_{w}\in(0,1), pβ=1pwp_{\beta}=1-p_{w}, wy0=wz0d0w_{y}^{0}=w_{z}^{0}\in\mathbb{R}^{d_{0}}, βy,m0=βz,m0dm\beta_{y,m}^{0}=\beta_{z,m}^{0}\in\mathbb{R}^{d_{m}} for 1mM1\leq m\leq M. 
  for k=0,1,2,k=0,1,2,\dots do
     wxk=θwzk+(1θ)wykw_{x}^{k}=\theta w_{z}^{k}+(1-\theta)w_{y}^{k}
     for m=1,,Mm=1,\dots,M in parallel do
        βx,mk=θβz,mk+(1θ)βy,mk\beta_{x,m}^{k}=\theta\beta_{z,m}^{k}+(1-\theta)\beta_{y,m}^{k}
     end for
     Sample random j{1,2,,n}j\in\{1,2,\dots,n\} and ζ={1 with probability pw2 with probability pβ\zeta=\begin{cases}1&\text{ with probability }p_{w}\\ 2&\text{ with probability }p_{\beta}\end{cases}
     if ζ=1\zeta=1 then
        gwk=1pwMm=1Mwfm,j(wxk,βx,mk)g_{w}^{k}=\frac{1}{p_{w}M}\sum_{m=1}^{M}\nabla_{w}f_{m,j}(w_{x}^{k},\beta_{x,m}^{k})
        wyk+1=wxkηgwkw_{y}^{k+1}=w_{x}^{k}-\eta g_{w}^{k}
        wzk+1=νwzk+(1ν)wxk+γη(wyk+1wxk)w_{z}^{k+1}=\nu w_{z}^{k}+(1-\nu)w_{x}^{k}+\frac{\gamma}{\eta}(w_{y}^{k+1}-w_{x}^{k})
        wvk+1={wyk, with probability ρwvk, with probability 1ρw_{v}^{k+1}=\begin{cases}w_{y}^{k},&\text{ with probability }\rho\\ w_{v}^{k},&\text{ with probability }1-\rho\\ \end{cases}
     else
        for m=1,,Mm=1,\dots,M in parallel do
           gβ,mk=1pβMβfm,j(wxk,βx,mk)g_{\beta,m}^{k}=\frac{1}{p_{\beta}M}\nabla_{\beta}f_{m,j}(w_{x}^{k},\beta_{x,m}^{k})
           βy,mk+1=βx,mkηgβ,mk\beta_{y,m}^{k+1}=\beta_{x,m}^{k}-\eta g_{\beta,m}^{k}
           βz,mk+1=νβz,mk+(1ν)βx,mk+γη(βy,mk+1βx,mk)\beta_{z,m}^{k+1}=\nu\beta_{z,m}^{k}+(1-\nu)\beta_{x,m}^{k}+\frac{\gamma}{\eta}(\beta_{y,m}^{k+1}-\beta_{x,m}^{k})
           βv,mk+1={βy,mk, with probability ρβv,mk, with probability 1ρ\beta_{v,m}^{k+1}=\begin{cases}\beta_{y,m}^{k},&\text{ with probability }\rho\\ \beta_{v,m}^{k},&\text{ with probability }1-\rho\\ \end{cases}
        end for
     end if
  end for

Algorithm 6 SCD-PFL
0:  η>0\eta>0, pw(0,1)p_{w}\in(0,1), pβ=1pwp_{\beta}=1-p_{w}, w0dw^{0}\in\mathbb{R}^{d}, βm0d\beta_{m}^{0}\in\mathbb{R}^{d} for 1mM1\leq m\leq M. 
  for k=0,1,2,K1k=0,1,2,\dots K-1 do
     Sample random jm{1,2,,nm}j_{m}\in\{1,2,\dots,n_{m}\} for 1mM1\leq m\leq M and ζ={1 with probability pw2 with probability pβ\zeta=\begin{cases}1&\text{ with probability }\,p_{w}\\ 2&\text{ with probability }\,p_{\beta}\end{cases}
     gwk={1pwMm=1Mwfm,jm(wk,βmk) if ζ=10 if ζ=2g_{w}^{k}=\begin{cases}\frac{1}{p_{w}M}\sum_{m=1}^{M}\nabla_{w}f_{m,j_{m}}(w^{k},\beta_{m}^{k})&\text{ if }\zeta=1\\ 0&\text{ if }\zeta=2\end{cases}
     wk+1=wkηgwkw^{k+1}=w^{k}-\eta g_{w}^{k}
     for m=1,,Mm=1,\dots,M in parallel do
        gβ,mk={0 if ζ=11pβMβfm,jm(wk,βmk) if ζ=2g_{\beta,m}^{k}=\begin{cases}0&\text{ if }\zeta=1\\ \frac{1}{p_{\beta}M}\nabla_{\beta}f_{m,j_{m}}(w^{k},\beta_{m}^{k})&\text{ if }\zeta=2\end{cases}
        βmk+1=βmkηgβ,mk\beta_{m}^{k+1}=\beta_{m}^{k}-\eta g_{\beta,m}^{k}
     end for
  end for
  wKw^{K}, βmK\beta^{K}_{m} for 1mM1\leq m\leq M.
Algorithm 7 SVRCD-PFL
0:  η>0\eta>0, pw(0,1)p_{w}\in(0,1), pβ=1pwp_{\beta}=1-p_{w}, ρ(0,1)\rho\in(0,1), wy0=wv0dw_{y}^{0}=w_{v}^{0}\in\mathbb{R}^{d}, βy,m0=βv,m0d\beta_{y,m}^{0}=\beta_{v,m}^{0}\in\mathbb{R}^{d} for 1mM1\leq m\leq M. 
  for k=0,1,2,K1k=0,1,2,\dots K-1 do
     Sample random jm{1,2,,nm}j_{m}\in\{1,2,\dots,n_{m}\} for 1mM1\leq m\leq M and ζ={1 with probability pw2 with probability pβ\zeta=\begin{cases}1&\text{ with probability }p_{w}\\ 2&\text{ with probability }p_{\beta}\end{cases}
     gwk={1pwMm=1Mwfm,jm(wyk,βy,mk)+wF(wvk,βvk) if ζ=1wF(wvk,βvk) if ζ=2g_{w}^{k}=\begin{cases}\frac{1}{p_{w}M}\sum_{m=1}^{M}\nabla_{w}f_{m,j_{m}}(w_{y}^{k},\beta_{y,m}^{k})+\nabla_{w}F(w_{v}^{k},\beta_{v}^{k})&\text{ if }\zeta=1\\ \nabla_{w}F(w_{v}^{k},\beta_{v}^{k})&\text{ if }\zeta=2\end{cases}
     wyk+1=wykηgwkw^{k+1}_{y}=w^{k}_{y}-\eta g_{w}^{k}
     wvk+1={wyk, with probability ρwvk, with probability 1ρw_{v}^{k+1}=\begin{cases}w_{y}^{k},&\text{ with probability }\rho\\ w_{v}^{k},&\text{ with probability }1-\rho\\ \end{cases}
     for m=1,,Mm=1,\dots,M in parallel do
        gβ,mk={1Mβfm(wvk,βv,mk) if ζ=11pβMβfm,jm(wyk,βy,mk)+1Mβfm(wvk,βv,mk) if ζ=2g_{\beta,m}^{k}=\begin{cases}\frac{1}{M}\nabla_{\beta}f_{m}(w_{v}^{k},\beta_{v,m}^{k})&\text{ if }\zeta=1\\ \frac{1}{p_{\beta}M}\nabla_{\beta}f_{m,j_{m}}(w_{y}^{k},\beta_{y,m}^{k})+\frac{1}{M}\nabla_{\beta}f_{m}(w_{v}^{k},\beta_{v,m}^{k})&\text{ if }\zeta=2\end{cases}
        βy,mk+1=βy,mkηgβ,mk\beta_{y,m}^{k+1}=\beta_{y,m}^{k}-\eta g_{\beta,m}^{k}
        βv,mk+1={βy,mk, with probability ρβv,mk, with probability 1ρ\beta_{v,m}^{k+1}=\begin{cases}\beta_{y,m}^{k},&\text{ with probability }\rho\\ \beta_{v,m}^{k},&\text{ with probability }1-\rho\\ \end{cases}
     end for
  end for
  wyKw_{y}^{K}, βy,mK\beta^{K}_{y,m} for 1mM1\leq m\leq M.

Appendix B Technical Proofs

Throughout this section, we use 𝑰d{\bm{I}}_{d^{\prime}} to denote the d×dd^{\prime}\times d^{\prime} identity matrix, 0d1×d20_{d^{\prime}_{1}\times d^{\prime}_{2}} to denote the d1×d2d^{\prime}_{1}\times d^{\prime}_{2} zero matrix, and 𝟏dd{\bf 1}_{d}^{\prime}\in\mathbb{R}^{d^{\prime}} to denote the vector of ones.

B.1 Proof of Lemma 1

To show the strong convexity, we shall verify the positive definiteness of

2FMT2(w,β)λ2M𝑰d(M+1)\displaystyle\nabla^{2}F_{MT2}(w,\beta)-\frac{\lambda}{2M}{\bm{I}}_{d(M+1)}
=(ΛMF(w)+λMIdλM32(𝟏MId)λM32(𝟏MId)λM(ImId)+Diag(2f1(β1),,2fM(βM)))λ2M𝑰d(M+1)\displaystyle\quad=\begin{pmatrix}\frac{\Lambda}{M}\nabla F^{\prime}(w)+\frac{\lambda}{M}I_{d}&-\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}^{\top}\otimes I_{d})\\ -\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}\otimes I_{d})&\frac{\lambda}{M}(I_{m}\otimes I_{d})+\mathrm{Diag}(\nabla^{2}f^{\prime}_{1}(\beta_{1}),\dots,\nabla^{2}f^{\prime}_{M}(\beta_{M}))\end{pmatrix}-\frac{\lambda}{2M}{\bm{I}}_{d(M+1)}
((ΛμM+λ2M)IdλM32(𝟏MId)λM32(𝟏MId)(λ2M+μM)(ImId))\displaystyle\quad\succeq\begin{pmatrix}\left(\frac{\Lambda\mu^{\prime}}{M}+\frac{\lambda}{2M}\right)I_{d}&-\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}^{\top}\otimes I_{d})\\ -\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}\otimes I_{d})&\left(\frac{\lambda}{2M}+\frac{\mu^{\prime}}{M}\right)(I_{m}\otimes I_{d})\end{pmatrix}
=1M(Λμ+λ2λM12𝟏MλM12𝟏M(λ2+2μ)Im)𝑴Id.\displaystyle\quad=\frac{1}{M}\underbrace{\begin{pmatrix}\Lambda\mu^{\prime}+\frac{\lambda}{2}&-\frac{\lambda}{M^{\frac{1}{2}}}{\bf 1}_{M}^{\top}\\ -\frac{\lambda}{M^{\frac{1}{2}}}{\bf 1}_{M}&\left(\frac{\lambda}{2}+2\mu^{\prime}\right)I_{m}\end{pmatrix}}_{\coloneqq{\bm{M}}}\otimes I_{d}.

Note that 𝑴{\bm{M}} can be written as a sum of MM matrices, each of them having

𝑴m=(Λμ+λ2MλM12λM12(λ2+2μ)){\bm{M}}_{m}=\begin{pmatrix}\frac{\Lambda\mu^{\prime}+\frac{\lambda}{2}}{M}&-\frac{\lambda}{M^{\frac{1}{2}}}\\ -\frac{\lambda}{M^{\frac{1}{2}}}&\left(\frac{\lambda}{2}+2\mu^{\prime}\right)\end{pmatrix}

as a (m+1)×(m+1)(m+1)\times(m+1) submatrix and zeros everywhere else. To verify positive semidefiniteness of 𝑴m{\bm{M}}_{m}, we shall prove that the determinant is positive:

det(𝑴m)=1M((Λμ+λ2)(λ2+2μ)λ2)1M((2λ)(λ2+2μ)λ2)0\text{det}({\bm{M}}_{m})=\frac{1}{M}\left(\left(\Lambda\mu^{\prime}+\frac{\lambda}{2}\right)\left(\frac{\lambda}{2}+2\mu^{\prime}\right)-\lambda^{2}\right)\geq\frac{1}{M}\left(\left(2\lambda\right)\left(\frac{\lambda}{2}+2\mu^{\prime}\right)-\lambda^{2}\right)\geq 0

as desired. Verifying the smoothness constants is straightforward.

B.2 Proof of Lemma 2

We have

2\displaystyle\nabla^{2} FMFL2(w,β)μ3M𝑰d(M+1)\displaystyle F_{MFL2}(w,\beta)-\frac{\mu^{\prime}}{3M}{\bm{I}}_{d(M+1)}
=(λMIdλM32(𝟏MId)λM32(𝟏MId)λM(ImId)+Diag(2f1(β1),,2fM(βM)))μ3M𝑰d(M+1)\displaystyle=\begin{pmatrix}\frac{\lambda}{M}I_{d}&-\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}^{\top}\otimes I_{d})\\ -\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}\otimes I_{d})&\frac{\lambda}{M}(I_{m}\otimes I_{d})+\mathrm{Diag}(\nabla^{2}f^{\prime}_{1}(\beta_{1}),\dots,\nabla^{2}f^{\prime}_{M}(\beta_{M}))\end{pmatrix}-\frac{\mu^{\prime}}{3M}{\bm{I}}_{d(M+1)}
((λMμ3M)IdλM32(𝟏MId)λM32(𝟏MId)(λM+2μ3M)(ImId))\displaystyle\succeq\begin{pmatrix}\left(\frac{\lambda}{M}-\frac{\mu^{\prime}}{3M}\right)I_{d}&-\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}^{\top}\otimes I_{d})\\ -\frac{\lambda}{M^{\frac{3}{2}}}({\bf 1}_{M}\otimes I_{d})&\left(\frac{\lambda}{M}+\frac{2\mu^{\prime}}{3M}\right)(I_{m}\otimes I_{d})\end{pmatrix}
=1M(λμ3λM12𝟏MλM12𝟏M(λ+2μ3)Im)𝑴Id.\displaystyle=\frac{1}{M}\underbrace{\begin{pmatrix}\lambda-\frac{\mu^{\prime}}{3}&-\frac{\lambda}{M^{\frac{1}{2}}}{\bf 1}_{M}^{\top}\\ -\frac{\lambda}{M^{\frac{1}{2}}}{\bf 1}_{M}&\left(\lambda+\frac{2\mu^{\prime}}{3}\right)I_{m}\end{pmatrix}}_{\coloneqq{\bm{M}}}\otimes I_{d}.

Note that 𝑴{\bm{M}} can be written as a sum of MM matrices, each of them having λMμ3M\frac{\lambda}{M}-\frac{\mu^{\prime}}{3M} at the position (1,1)(1,1), λM12-\frac{\lambda}{M^{\frac{1}{2}}} at positions (1,m),(m,1)(1,m),(m,1) and (λM+2μ3M)\left(\frac{\lambda}{M}+\frac{2\mu^{\prime}}{3M}\right) at the position (m,m)(m,m). Using the assumption μλ2\mu^{\prime}\leq\frac{\lambda}{2}, it is easy to see that each of these matrices is positive semidefinite, and thus so is 𝑴{\bm{M}}. Consequently, FMFL2(w,β)μ3M𝑰d(M+1)\nabla F_{MFL2}(w,\beta)-\frac{\mu^{\prime}}{3M}{\bm{I}}_{d(M+1)} is positive semidefinite and thus FMFL2F_{MFL2} is jointly μ3M\frac{\mu^{\prime}}{3M}- strongly convex. Verifying the smoothness constants is straightforward.

B.3 Proof of Lemma 3

Let xm=(1αm)βm+αmM12wx_{m}=(1-\alpha_{m})\beta_{m}+\alpha_{m}M^{-\frac{1}{2}}w for notational simplicity. We have

2fm(w,βm)\displaystyle\nabla^{2}f_{m}(w,\beta_{m}) =(ΛM2f(M12w)+αm2M2fm(xm)αm(1αm)M122fm(xm)αm(1αm)M122fm(xm)(1αm)22fm(xm))\displaystyle=\begin{pmatrix}\frac{\Lambda}{M}\nabla^{2}f^{\prime}(M^{-\frac{1}{2}}w)+\frac{\alpha^{2}_{m}}{M}\nabla^{2}f_{m}^{\prime}(x_{m})&\frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}\nabla^{2}f_{m}^{\prime}(x_{m})\\ \frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}\nabla^{2}f_{m}^{\prime}(x_{m})&(1-\alpha_{m})^{2}\nabla^{2}f_{m}^{\prime}(x_{m})\end{pmatrix}
=(ΛM2f(M12w)0d×d0d×d0d×d)+1M(αm2Mαm(1αm)M12αm(1αm)M12(1αm)2)2fm(xm)\displaystyle=\begin{pmatrix}\frac{\Lambda}{M}\nabla^{2}f^{\prime}(M^{-\frac{1}{2}}w)&0_{d\times d}\\ 0_{d\times d}&0_{d\times d}\end{pmatrix}+\frac{1}{M}\begin{pmatrix}\frac{\alpha_{m}^{2}}{M}&\frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}\\ \frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}&(1-\alpha_{m})^{2}\end{pmatrix}\otimes\nabla^{2}f_{m}^{\prime}(x_{m})
(ΛμM𝑰d0d×d0d×d0d×d)+(αm2Mαm(1αm)M12αm(1αm)M12(1αm)2)(μ𝑰d)\displaystyle\succeq\begin{pmatrix}\frac{\Lambda\mu^{\prime}}{M}{\bm{I}}_{d}&0_{d\times d}\\ 0_{d\times d}&0_{d\times d}\end{pmatrix}+\begin{pmatrix}\frac{\alpha_{m}^{2}}{M}&\frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}\\ \frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}&(1-\alpha_{m})^{2}\end{pmatrix}\otimes\left(\mu^{\prime}{\bm{I}}_{d}\right)
=μ(Λ+αm2Mαm(1αm)M12αm(1αm)M12(1αm)2)𝑴m𝑰d.\displaystyle=\mu^{\prime}\underbrace{\begin{pmatrix}\frac{\Lambda+\alpha_{m}^{2}}{M}&\frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}\\ \frac{\alpha_{m}(1-\alpha_{m})}{M^{\frac{1}{2}}}&(1-\alpha_{m})^{2}\end{pmatrix}}_{\coloneqq{\bm{M}}_{m}}\otimes{\bm{I}}_{d}.

Next, we show that

𝑴m((1αm)22M00(1αm)22).{\bm{M}}_{m}\succeq\begin{pmatrix}\frac{(1-\alpha_{m})^{2}}{2M}&0\\ 0&\frac{(1-\alpha_{m})^{2}}{2}\end{pmatrix}. (28)

For that, it suffices to show that

det(𝑴m((1αm)22M00(1αm)22))0,\text{det}\left({\bm{M}}_{m}-\begin{pmatrix}\frac{(1-\alpha_{m})^{2}}{2M}&0\\ 0&\frac{(1-\alpha_{m})^{2}}{2}\end{pmatrix}\right)\geq 0,

which holds since

det(𝑴m((1αm)22M00(1αm)22))\displaystyle\text{det}\left({\bm{M}}_{m}-\begin{pmatrix}\frac{(1-\alpha_{m})^{2}}{2M}&0\\ 0&\frac{(1-\alpha_{m})^{2}}{2}\end{pmatrix}\right) =(Λ+αm2(1αm)22M)(1αm)22αm2(1αm)2M\displaystyle=\left(\frac{\Lambda+\alpha_{m}^{2}-\frac{(1-\alpha_{m})^{2}}{2}}{M}-\right)\frac{(1-\alpha_{m})^{2}}{2}-\frac{\alpha_{m}^{2}(1-\alpha_{m})^{2}}{M}
(2αm2M)(1αm)22αm2(1αm)2M=0.\displaystyle\geq\left(2\frac{\alpha_{m}^{2}}{M}\right)\frac{(1-\alpha_{m})^{2}}{2}-\frac{\alpha_{m}^{2}(1-\alpha_{m})^{2}}{M}=0.

Finally, using (28) MM times, it is easy to see that

2FAPFL2(w,β)μ(1αmax)2M𝑰d(M+1)\nabla^{2}F_{APFL2}(w,\beta)\succeq\mu^{\prime}\frac{(1-\alpha_{\max})^{2}}{M}{\bm{I}}_{d(M+1)}

as desired. Verifying the smoothness constants is straightforward.

B.4 Proof of Lemma 5

We have

1Mm=1Mwfm(wmk,βmk)2\displaystyle\frac{1}{M}\sum\limits_{m=1}^{M}\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\|^{2} 3Mm=1Mwfm(wmk,βmk)wfm(wk,βmk)2\displaystyle\leq\frac{3}{M}\sum\limits_{m=1}^{M}\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})-\nabla_{w}f_{m}(w^{k},\beta_{m}^{k})\|^{2}
+3Mm=1Mwfm(wk,βmk)wfm(w,β)2\displaystyle\qquad+\frac{3}{M}\sum\limits_{m=1}^{M}\|\nabla_{w}f_{m}(w^{k},\beta_{m}^{k})-\nabla_{w}f_{m}(w^{*},\beta^{*})\|^{2}
+3Mm=1Mwfm(w,β)2.\displaystyle\qquad+\frac{3}{M}\sum\limits_{m=1}^{M}\|\nabla_{w}f_{m}(w^{*},\beta^{*})\|^{2}.

Then, using Assumption 1, the above display is bounded as

3L2Mm=1Mwmkwk2+6LMm=1MDfm((wk,βmk),(w,β))+3ζ2=6Lw(f(wk,βmk)f(w,β))+3(Lw)2Vk+3ζ2,\frac{3L^{2}}{M}\sum\limits_{m=1}^{M}\|w_{m}^{k}-w^{k}\|^{2}+\frac{6L}{M}\sum\limits_{m=1}^{M}D_{f_{m}}((w^{k},\beta_{m}^{k}),(w^{*},\beta^{*}))+3\zeta_{*}^{2}\\ =6L^{w}\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+3(L^{w})^{2}V_{k}+3\zeta_{*}^{2},

which shows (21).

To establish (22), we have

1Mm=1Mwfm(wmk,βmk)2\displaystyle\left\|\frac{1}{M}\sum\limits_{m=1}^{M}\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2} +1M2m=1Mβfm(wmk,βmk)2\displaystyle+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}
2Mi=1Mwfm(wmk,βmk)wfm(wk,βmk)2\displaystyle\leq\frac{2}{M}\sum\limits_{i=1}^{M}\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})-\nabla_{w}f_{m}(w^{k},\beta_{m}^{k})\|^{2}
+2Mm=1Mβfm(wmk,βmk)βfm(w,β)2\displaystyle\qquad+\frac{2}{M}\sum\limits_{m=1}^{M}\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})-\nabla_{\beta}f_{m}(w^{*},\beta^{*})\|^{2}
+2M2m=1Mβfm(wmk,βmk)βfm(w,β)2.\displaystyle\qquad+\frac{2}{M^{2}}\sum\limits_{m=1}^{M}\left\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})-\nabla_{\beta}f_{m}(w^{*},\beta^{*})\right\|^{2}.

Then, using Assumption 1, the above display is bounded as

2(Lw)2Mm=1Mwmkwk2+4LMm=1MDfm((wk,βmk),(w,β))=4L(f(wk,βmk)f(w,β))+2(Lw)2Vk.\frac{2(L^{w})^{2}}{M}\sum\limits_{m=1}^{M}\|w_{m}^{k}-w^{k}\|^{2}+\frac{4L}{M}\sum\limits_{m=1}^{M}D_{f_{m}}((w^{k},\beta_{m}^{k}),(w^{*},\beta^{*}))=4L\left(f(w^{k},\beta_{m}^{k})-f(w^{*},\beta^{*})\right)+2(L^{w})^{2}V_{k}.

This completes the proof.

B.5 Proof of Lemma 6

Let us start with establishing (23). We have

1Mm=1M𝔼gw,mk2\displaystyle\frac{1}{M}\sum\limits_{m=1}^{M}\mathbb{E}{\|g_{w,m}^{k}\|^{2}} =1Mm=1M(𝔼gw,mkwfm(wmk,βmk)2+wfm(wmk,βmk)2)\displaystyle=\frac{1}{M}\sum\limits_{m=1}^{M}\left(\mathbb{E}{\|g_{w,m}^{k}-\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\|^{2}}+\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\|^{2}\right)
σ2B+wfm(wmk,βmk)2.\displaystyle\leq\frac{\sigma^{2}}{B}+\|\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\|^{2}.

Now (23) follows from an application of (21). Similarly, to show (24), we have

𝔼1Mm=1Mgw,mk2+1M2m=1Mgβ,mk2\displaystyle\mathbb{E}{\left\|\frac{1}{M}\sum\limits_{m=1}^{M}g_{w,m}^{k}\right\|^{2}+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left\|g_{\beta,m}^{k}\right\|^{2}}
=𝔼1Mm=1M(gw,mkwfm(wmk,βmk))2+1Mm=1Mwfm(wmk,βmk)2\displaystyle\qquad=\mathbb{E}{\left\|\frac{1}{M}\sum\limits_{m=1}^{M}(g_{w,m}^{k}-\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k}))\right\|^{2}}+\left\|\frac{1}{M}\sum\limits_{m=1}^{M}\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}
+1M2m=1M(𝔼gβ,mkβfm(wmk,βmk)2+βfm(wmk,βmk)2)\displaystyle\qquad\qquad+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left(\mathbb{E}{\left\|g_{\beta,m}^{k}-\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}}+\left\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}\right)
σ2MB+1Mm=1Mwfm(wmk,βmk)2\displaystyle\qquad\leq\frac{\sigma^{2}}{MB}+\left\|\frac{1}{M}\sum\limits_{m=1}^{M}\nabla_{w}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}
+σ2MB+1M2m=1Mβfm(wmk,βmk)2.\displaystyle\qquad\qquad+\frac{\sigma^{2}}{MB}+\frac{1}{M^{2}}\sum\limits_{m=1}^{M}\left\|\nabla_{\beta}f_{m}(w_{m}^{k},\beta_{m}^{k})\right\|^{2}.

Now (24) follows from (22), which completes the proof.

B.6 Proof of Lemma 7

The proof is identical to the proof of Lemma E.1 from (Gorbunov et al., 2021) with a single difference – using inequality (23) instead of Assumption E.1 from (Gorbunov et al., 2021). We omit the details.

B.7 Proof of Theorem 3 and Theorem 4

We start by introducing additional notation. We set kp=pτk_{p}=p\cdot\tau, where τ+\tau\in\mathbb{N}^{+} is the length of the averaging period. Let kp=pτ+τ1=kp+11=vpk_{p}=p\tau+\tau-1=k_{p+1}-1=v_{p}. Denote the total number of iterations as KK and assume that K=kp¯K=k_{\bar{p}} for some p¯+\bar{p}\in\mathbb{N}^{+}. The final result is set to be that w^=wK\hat{w}=w^{K} and β^m=βmK\hat{\beta}_{m}=\beta^{K}_{m} for all m[M]m\in[M]. We assume that the solution to (1) is w,β1,,βMw^{*},\beta^{*}_{1},\dots,\beta^{*}_{M} and that the optimal value is ff^{*}. Let wk=1Mm=1Mwmkw^{k}=\frac{1}{M}\sum^{M}_{m=1}w^{k}_{m} for all kk. Note that this quantity will not be actually computed in practice unless k=kpk=k_{p} for some pp\in\mathbb{N}, where we have wkp=wmkpw^{k_{p}}=w^{k_{p}}_{m} for all m[M]m\in[M]. In addition, let ξmk={ξ1,mk,ξ2,mk,,ξB,mk}\xi^{k}_{m}=\{\xi^{k}_{1,m},\xi^{k}_{2,m},\dots,\xi^{k}_{B,m}\} and ξk={ξ1k,ξ2k,,ξMk}\xi^{k}=\{\xi^{k}_{1},\xi^{k}_{2},\dots,\xi^{k}_{M}\}.

Let θm=((wm),(βm))\theta_{m}=((w_{m})^{\top},(\beta_{m})^{\top})^{\top}, θmk=((wmk),(βmk))\theta^{k}_{m}=((w^{k}_{m})^{\top},(\beta^{k}_{m})^{\top})^{\top}, θm=((w),(βm))\theta^{*}_{m}=((w^{*})^{\top},(\beta^{*}_{m})^{\top})^{\top} and θ^mk=((wk),(βmk))\hat{\theta}^{k}_{m}=((w^{k})^{\top},(\beta^{k}_{m})^{\top})^{\top}. Let

gmk=1Bf^m(wmk,βmk;ξmk),g^{k}_{m}=\frac{1}{B}\nabla\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{m}), (29)

where

f^m(wmk,βmk;ξmk)=j=1Bf^m(wmk,βmk;ξj,mk).\nabla\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{m})=\sum^{B}_{j=1}\nabla\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{j,m}).

We assume that the gradient is unbiased, that is,

𝔼[gmk]=fm(wmk,βmk).\mathbb{E}\left[g^{k}_{m}\right]=\nabla f_{m}(w^{k}_{m},\beta^{k}_{m}).

Let

gm,1k=1Bwf^m(wmk,βmk;ξmk),gm,2k=1Bβmf^m(wmk,βmk;ξmk),\displaystyle g^{k}_{m,1}=\frac{1}{B}\nabla_{w}\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{m}),\qquad g^{k}_{m,2}=\frac{1}{B}\nabla_{\beta_{m}}\hat{f}_{m}(w^{k}_{m},\beta^{k}_{m};\xi^{k}_{m}), (30)

so that gmk=((gm,1k),(gm,2k))g^{k}_{m}=((g^{k}_{m,1})^{\top},(g^{k}_{m,2})^{\top})^{\top}. We update the parameters by

(wmk+1,βmk+1)=(wmk,βmk)ηkgmk.(w^{k+1}_{m},\beta^{k+1}_{m})=(w^{k}_{m},\beta^{k}_{m})-\eta_{k}g^{k}_{m}.

In addition, we define

hk=1Mm=1Mgm,1k,Vk=1Mm=1Mwmkwk2.\displaystyle h^{k}=\frac{1}{M}\sum^{M}_{m=1}g^{k}_{m,1},\qquad V^{k}=\frac{1}{M}\sum^{M}_{m=1}\|w^{k}_{m}-w^{k}\|^{2}.

Then wk+1=wkηkhkw^{k+1}=w^{k}-\eta_{k}h^{k} for all kk.

We denote the Bregman divergence associated with fmf_{m} for θm\theta_{m} and θ¯m\bar{\theta}_{m} as

Dfm(θm,θ¯m)fm(θm)f(θ¯m)fm(θ¯m),θmθ¯m.D_{f_{m}}(\theta_{m},\bar{\theta}_{m})\coloneqq f_{m}(\theta_{m})-f(\bar{\theta}_{m})-\langle\nabla f_{m}(\bar{\theta}_{m}),\theta_{m}-\bar{\theta}_{m}\rangle.

Finally, we define the sum of residuals as

rk=wkw2+1Mm=1Mβmkβm2=1Mm=1Mθ^mkθm2r^{k}=\|w^{k}-w^{*}\|^{2}+\frac{1}{M}\sum^{M}_{m=1}\|\beta^{k}_{m}-\beta^{*}_{m}\|^{2}=\frac{1}{M}\sum^{M}_{m=1}\|\hat{\theta}^{k}_{m}-\theta^{*}_{m}\|^{2} (31)

and let σdif2=1Mm=1Mfm(θm)2\sigma^{2}_{\text{dif}}=\frac{1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{*}_{m})\|^{2}.

The following proposition states some useful results that will be used in the proof below. The results are are standard and can be found in, for example, Nesterov (2018).

Proposition 1.

If the function ff is differentiable and LL-smooth, then

f(x)f(y)f(y),xyL2xy2.f(x)-f(y)-\langle\nabla f(y),x-y\rangle\leq\frac{L}{2}\|x-y\|^{2}. (32)

If ff is also convex, then

f(x)f(y)22LDf(x,y)\|\nabla f(x)-\nabla f(y)\|^{2}\leq 2LD_{f}(x,y) (33)

for all x,yx,y.

For all vectors x,yx,y, we have

2x,y\displaystyle 2\langle x,y\rangle ξx2+ξ1y2,ξ>0,\displaystyle\leq\xi\|x\|^{2}+\xi^{-1}\|y\|^{2},\quad\forall\xi>0, (34)
x,y\displaystyle-\langle x,y\rangle =12x212y2+12xy2.\displaystyle=-\frac{1}{2}\|x\|^{2}-\frac{1}{2}\|y\|^{2}+\frac{1}{2}\|x-y\|^{2}. (35)

For vectors v1,v2,,vnv_{1},v_{2},\dots,v_{n}, by the Jensen’s inequality and the convexity of the map: xx2x\mapsto\|x\|^{2}, we have

1ni=1nvi21ni=1nvi2.\left\|\frac{1}{n}\sum^{n}_{i=1}v_{i}\right\|^{2}\leq\frac{1}{n}\sum^{n}_{i=1}\left\|v_{i}\right\|^{2}. (36)

Next, we establish a few technical results.

Lemma 8.

Suppose Assumption 4 holds. Given {θmk}m[M]\{\theta^{k}_{m}\}_{m\in[M]}, we have

𝔼ξk[1Mm=1Mfm(θ^mk+1)]\displaystyle\mathbb{E}_{\xi^{k}}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})\right] 1Mm=1Mfm(θ^mk)\displaystyle-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})
ηk1Mm=1Mwfm(θ^mk),1Mm=1Mwfm(θmk)\displaystyle\leq-\eta_{k}\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\rangle
ηkMm=1Mβmfm(θ^mk),βmfm(θmk)\displaystyle\qquad-\frac{\eta_{k}}{M}\sum^{M}_{m=1}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle
+ηk2L2𝔼ξk[hk2]+ηk2L2Mm=1M𝔼ξmk[gm,2k2],\displaystyle\qquad+\frac{\eta^{2}_{k}L}{2}\mathbb{E}_{\xi^{k}}\left[\|h^{k}\|^{2}\right]+\frac{\eta^{2}_{k}L}{2M}\sum^{M}_{m=1}\mathbb{E}_{\xi_{m}^{k}}\left[\|g^{k}_{m,2}\|^{2}\right],

where the expectation is taken only with respect to the randomness in ξk\xi^{k}.

Proof.

By the LL-smoothness assumption on fm()f_{m}(\cdot) and (32), we have

fm(θ^mk+1)fm(θ^mk)fm(θ^mk),θ^mk+1θ^mkL2θ^mk+1θ^mk2.f_{m}(\hat{\theta}^{k+1}_{m})-f_{m}(\hat{\theta}^{k}_{m})-\langle\nabla f_{m}(\hat{\theta}^{k}_{m}),\hat{\theta}^{k+1}_{m}-\hat{\theta}^{k}_{m}\rangle\leq\frac{L}{2}\|\hat{\theta}^{k+1}_{m}-\hat{\theta}^{k}_{m}\|^{2}.

Thus, we have

fm(θ^mk+1)fm(θ^mk)ηkwfm(θ^mk),hkηkβmfm(θ^mk),gm,2k+ηk2L2hk2+ηk2L2gm,2k2,f_{m}(\hat{\theta}^{k+1}_{m})-f_{m}(\hat{\theta}^{k}_{m})\leq-\eta_{k}\langle\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),h^{k}\rangle-\eta_{k}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m}),g^{k}_{m,2}\rangle+\frac{\eta^{2}_{k}L}{2}\|h^{k}\|^{2}+\frac{\eta^{2}_{k}L}{2}\|g^{k}_{m,2}\|^{2},

which further implies that

1Mm=1Mfm(θ^mk+1)1Mm=1Mfm(θ^mk)ηk1Mm=1Mwfm(θ^mk),hkηkMm=1Mβmfm(θ^mk),gm,2k+ηk2L2hk2+ηk2L2Mm=1Mgm,2k2.\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\\ \leq-\eta_{k}\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),h^{k}\right\rangle-\frac{\eta_{k}}{M}\sum^{M}_{m=1}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m}),g^{k}_{m,2}\rangle+\frac{\eta^{2}_{k}L}{2}\|h^{k}\|^{2}\\ +\frac{\eta^{2}_{k}L}{2M}\sum^{M}_{m=1}\|g^{k}_{m,2}\|^{2}.

The result follows by taking the expectation with respect to the randomness in ξk\xi^{k}, while keeping the other quantities fixed. ∎

Lemma 9.

Suppose Assumptions 5 and 6 hold. Given {θmk}m[M]\{\theta^{k}_{m}\}_{m\in[M]}, we have

𝔼ξk[hk2]\displaystyle\mathbb{E}_{\xi^{k}}\left[\|h^{k}\|^{2}\right] +1Mm=1M𝔼ξmk[gm,2k2]\displaystyle+\frac{1}{M}\sum^{M}_{m=1}\mathbb{E}_{\xi_{m}^{k}}\left[\|g^{k}_{m,2}\|^{2}\right]
(C1M+C2+1)1Mm=1Mfm(θmk)2+σ12MB+σ22B\displaystyle\leq\left(\frac{C_{1}}{M}+C_{2}+1\right)\frac{1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}
λ(C1M+C2+1)1Mm=1Mfm(θmk)2+(C1M+C2+1)σdif2+σ12MB+σ22B,\displaystyle\leq\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}+\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B},

where the expectation is taken only with respect to the randomness in ξk\xi^{k}.

Proof.

Note that

𝔼ξk[hk2]\displaystyle\mathbb{E}_{\xi^{k}}\left[\|h^{k}\|^{2}\right] =𝔼ξk[1Mm=1Mgm,1k2]\displaystyle=\mathbb{E}_{\xi^{k}}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}g^{k}_{m,1}\right\|^{2}\right]
=(i)𝔼ξk[1Mm=1M(gm,1kwfm(θmk))2]+1Mm=1Mwfm(θmk)2\displaystyle\overset{(i)}{=}\mathbb{E}_{\xi^{k}}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\left(g^{k}_{m,1}-\nabla_{w}f_{m}(\theta^{k}_{m})\right)\right\|^{2}\right]+\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}
=(ii)1M2m=1M𝔼ξmk[gm,1kwfm(θmk)2]+1Mm=1Mwfm(θmk)2\displaystyle\overset{(ii)}{=}\frac{1}{M^{2}}\sum^{M}_{m=1}\mathbb{E}_{\xi^{k}_{m}}\left[\left\|g^{k}_{m,1}-\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}\right]+\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}
(iii)1M2m=1M(C1fm(θmk)2+σ12B)+1Mm=1Mwfm(θmk)2\displaystyle\overset{(iii)}{\leq}\frac{1}{M^{2}}\sum^{M}_{m=1}\left(C_{1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{1}}{B}\right)+\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}
(iv)C1M2m=1Mfm(θmk)2+σ12MB+1Mm=1Mwfm(θmk)2,\displaystyle\overset{(iv)}{\leq}\frac{C_{1}}{M^{2}}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{1}}{MB}+\frac{1}{M}\sum^{M}_{m=1}\|\nabla_{w}f_{m}(\theta^{k}_{m})\|^{2},

where (i) is due to gm,1kg^{k}_{m,1} being unbiased, (ii) is by the fact that ξ1k,ξ2k,,ξMk\xi^{k}_{1},\xi^{k}_{2},\dots,\xi^{k}_{M} are independent, (iii) is by Assumption 5, and (iv) is by (36). Similarly, we have

1Mm=1M𝔼ξmk[gm,2k2]\displaystyle\frac{1}{M}\sum^{M}_{m=1}\mathbb{E}_{\xi^{k}_{m}}\left[\|g^{k}_{m,2}\|^{2}\right] =1Mm=1M𝔼ξmk[gm,2kβmfm(θmk)2]+1Mm=1Mβmfm(θmk)2\displaystyle=\frac{1}{M}\sum^{M}_{m=1}\mathbb{E}_{\xi^{k}_{m}}\left[\|g^{k}_{m,2}-\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}\right]+\frac{1}{M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}
C2Mm=1Mfm(θmk)2+σ22B+1Mm=1Mβmfm(θmk)2.\displaystyle\leq\frac{C_{2}}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{2}}{B}+\frac{1}{M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}.

The lemma follows by combining the two inequalities. ∎

Lemma 10.

Under Assumption 4, we have

ηk1Mm=1Mwfm(θ^mk),1Mm=1Mwfm(θmk)ηkMm=1Mβmfm(θ^k),βmfm(θmk)ηk21Mm=1Mfm(θ^mk)2ηk21Mm=1Mfm(θmk)2+ηkL22Vk.-\eta_{k}\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\rangle-\frac{\eta_{k}}{M}\sum^{M}_{m=1}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle\\ \leq-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}+\frac{\eta_{k}L^{2}}{2}V^{k}.
Proof.

By (35), we have

ηk\displaystyle-\eta_{k} 1Mm=1Mwfm(θ^mk),1Mm=1Mwfm(θmk)\displaystyle\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\rangle
=ηk21Mm=1Mwfm(θ^mk)2ηk21Mm=1Mwfm(θmk)2\displaystyle=-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}
+ηk21Mm=1M(wfm(θ^mk)wfm(θmk))2\displaystyle\qquad\qquad+\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\left(\nabla_{w}f_{m}(\hat{\theta}^{k}_{m})-\nabla_{w}f_{m}(\theta^{k}_{m})\right)\right\|^{2}
ηk21Mm=1Mwfm(θ^mk)2ηk21Mm=1Mwfm(θmk)2\displaystyle\leq-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2}
+ηk2Mm=1Mwfm(θ^mk)wfm(θmk)2,\displaystyle\qquad\qquad+\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\left\|\nabla_{w}f_{m}(\hat{\theta}^{k}_{m})-\nabla_{w}f_{m}(\theta^{k}_{m})\right\|^{2},

where the last inequality follows from (36). We also have

ηkβmfm(θ^k),βmfm(θmk)=ηk2βmfm(θ^mk)2ηk2βmfm(θmk)2+ηk2βmfm(θ^mk)βmfm(θmk)2.-\eta_{k}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle\\ =-\frac{\eta_{k}}{2}\|\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})\|^{2}-\frac{\eta_{k}}{2}\|\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}+\frac{\eta_{k}}{2}\|\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})-\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}.

Thus,

ηkMβmfm(θ^k),βmfm(θmk)\displaystyle-\frac{\eta_{k}}{M}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle =ηk2Mm=1Mβmfm(θ^mk)2ηk2Mm=1Mβmfm(θmk)2\displaystyle=-\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})\|^{2}-\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}
+ηk2Mm=1Mβmfm(θ^mk)βmfm(θmk)2\displaystyle\qquad+\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})-\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}
ηk21Mm=1Mβmfm(θ^mk)2ηk21Mm=1Mβmfm(θmk)2\displaystyle\leq-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\right\|^{2}
+ηk2Mm=1Mβmfm(θ^mk)βmfm(θmk)2.\displaystyle\qquad+\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\|\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m})-\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\|^{2}.

Combining the above equations, we have

ηk\displaystyle-\eta_{k} 1Mm=1Mwfm(θ^mk),1Mm=1Mwfm(θmk)ηkMm=1Mβmfm(θ^mk),βmfm(θmk)\displaystyle\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\rangle-\frac{\eta_{k}}{M}\sum^{M}_{m=1}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}_{m}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle
ηk21Mm=1Mfm(θ^mk)2ηk21Mm=1Mfm(θmk)2\displaystyle\leq-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}
+ηk2Mm=1Mfm(θ^mk)fm(θmk)2\displaystyle\qquad+\frac{\eta_{k}}{2M}\sum^{M}_{m=1}\left\|\nabla f_{m}(\hat{\theta}^{k}_{m})-\nabla f_{m}(\theta^{k}_{m})\right\|^{2}
(i)ηk21Mm=1Mfm(θ^mk)2ηk21Mm=1Mfm(θmk)2+ηkL22Mm=1Mwmkwk2\displaystyle\overset{(i)}{\leq}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}+\frac{\eta_{k}L^{2}}{2M}\sum^{M}_{m=1}\left\|w^{k}_{m}-w^{k}\right\|^{2}
=ηk21Mm=1Mfm(θ^mk)2ηk21Mm=1Mfm(θmk)2+ηkL22Vk,\displaystyle=-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}+\frac{\eta_{k}L^{2}}{2}V^{k},

where (i)(i) is by Assumption 4. ∎

Lemma 11.

Under Assumptions 4 and 7, we have

ηk1Mm=1Mwfm(θ^mk),1Mm=1Mwfm(θmk)ηkMm=1Mβmfm(θ^k),βmfm(θmk)ηkμ(1Mm=1Mfm(θ^mk)f)ηk21Mm=1Mfm(θmk)2+ηkL22Vk.-\eta_{k}\left\langle\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\hat{\theta}^{k}_{m}),\frac{1}{M}\sum^{M}_{m=1}\nabla_{w}f_{m}(\theta^{k}_{m})\right\rangle-\frac{\eta_{k}}{M}\sum^{M}_{m=1}\langle\nabla_{\beta_{m}}f_{m}(\hat{\theta}^{k}),\nabla_{\beta_{m}}f_{m}(\theta^{k}_{m})\rangle\\ \leq-\eta_{k}\mu\left(\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\hat{\theta}^{k}_{m})-f^{*}\right)-\frac{\eta_{k}}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}+\frac{\eta_{k}L^{2}}{2}V^{k}.
Proof.

The proof follows directly from Lemma 10 and Assumption 7. ∎

Lemma 12.

Suppose Assumptions 5 and 6 hold. For kp+1kvpk_{p}+1\leq k\leq v_{p}, we have

𝔼[Vk]λ(τ1)(C1+1)t=kpk1ηt2𝔼[1Mm=1Mfm(θmt)2]+σdif2(τ1)(C1+1)t=kpk1ηt2+σ12(τ1)Bt=kpk1ηt2.\mathbb{E}\left[V^{k}\right]\leq\lambda(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{t}_{m})\right\|^{2}\right]\\ +\sigma^{2}_{\text{dif}}(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}+\frac{\sigma^{2}_{1}(\tau-1)}{B}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}.

Note that Vkp=0V^{k_{p}}=0.

Proof.

Note that wkp=wmkpw^{k_{p}}=w^{k_{p}}_{m} for all m[M]m\in[M]. Thus, for kp+1kvpk_{p}+1\leq k\leq v_{p}, we have

wmkwk2=wmkpt=kpk1ηtgm,1twkpt=kpk1ηtht2=t=kpk1ηtgm,1tt=kpk1ηtht2.\displaystyle\|w^{k}_{m}-w^{k}\|^{2}=\left\|w^{k_{p}}_{m}-\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}-w^{k_{p}}-\sum^{k-1}_{t=k_{p}}\eta_{t}h^{t}\right\|^{2}=\left\|\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}-\sum^{k-1}_{t=k_{p}}\eta_{t}h^{t}\right\|^{2}.

Since

1Mm=1Mt=kpk1ηtgm,1t=t=kpk1ηtht,\frac{1}{M}\sum^{M}_{m=1}\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}=\sum^{k-1}_{t=k_{p}}\eta_{t}h^{t},

we have

1Mm=1Mwmkwk2\displaystyle\frac{1}{M}\sum^{M}_{m=1}\|w^{k}_{m}-w^{k}\|^{2} =1Mm=1Mt=kpk1ηtgm,1tt=kpk1ηtht2\displaystyle=\frac{1}{M}\sum^{M}_{m=1}\left\|\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}-\sum^{k-1}_{t=k_{p}}\eta_{t}h^{t}\right\|^{2} (37)
=1Mm=1Mt=kpk1ηtgm,1t2t=kpk1ηtht21Mm=1Mt=kpk1ηtgm,1t2\displaystyle=\frac{1}{M}\sum^{M}_{m=1}\left\|\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}\right\|^{2}-\left\|\sum^{k-1}_{t=k_{p}}\eta_{t}h^{t}\right\|^{2}\leq\frac{1}{M}\sum^{M}_{m=1}\left\|\sum^{k-1}_{t=k_{p}}\eta_{t}g^{t}_{m,1}\right\|^{2}
kkpMm=1Mt=kpk1ηt2gm,1t2τ1Mm=1Mt=kpk1ηt2gm,1t2.\displaystyle\leq\frac{k-k_{p}}{M}\sum^{M}_{m=1}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\|g^{t}_{m,1}\|^{2}\leq\frac{\tau-1}{M}\sum^{M}_{m=1}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\|g^{t}_{m,1}\|^{2}.

Given {θmk}m[M]\{\theta^{k}_{m}\}_{m\in[M]}, we have

𝔼ξk[1Mm=1Mgm,1k2]\displaystyle\mathbb{E}_{\xi^{k}}\left[\frac{1}{M}\sum^{M}_{m=1}\|g^{k}_{m,1}\|^{2}\right] =1Mm=1M𝔼ξmk[gm,1k2]\displaystyle=\frac{1}{M}\sum^{M}_{m=1}\mathbb{E}_{\xi^{k}_{m}}\left[\|g^{k}_{m,1}\|^{2}\right]
=1Mm=1M𝔼ξmk[gm,1kwfm(θmk)2]+1Mm=1Mwfm(θmk)2\displaystyle=\frac{1}{M}\sum^{M}_{m=1}\mathbb{E}_{\xi^{k}_{m}}\left[\|g^{k}_{m,1}-\nabla_{w}f_{m}(\theta^{k}_{m})\|^{2}\right]+\frac{1}{M}\sum^{M}_{m=1}\|\nabla_{w}f_{m}(\theta^{k}_{m})\|^{2}
1Mm=1M[(C1+1)fm(θmk)2+σ12B]+1Mm=1Mfm(θmk)2\displaystyle\leq\frac{1}{M}\sum^{M}_{m=1}\left[(C_{1}+1)\nabla\|f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{1}}{B}\right]+\frac{1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}
=C1+1Mm=1Mfm(θmk)2+σ12B,\displaystyle=\frac{C_{1}+1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{k}_{m})\|^{2}+\frac{\sigma^{2}_{1}}{B},

where the expectation is taken with respect to the randomness in ξk\xi^{k}. Thus, by the independence of ξ(1),ξ(2),,ξk\xi^{(1)},\xi^{(2)},\dots,\xi^{k} and taking an unconditional expectation on both sides of (37), we have

𝔼[Vk]\displaystyle\mathbb{E}\left[V^{k}\right] =(τ1)t=kpk1ηt2𝔼[𝔼ξt[1Mm=1Mgm,1t2]]\displaystyle=(\tau-1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\mathbb{E}\left[\mathbb{E}_{\xi^{t}}\left[\frac{1}{M}\sum^{M}_{m=1}\|g^{t}_{m,1}\|^{2}\right]\right]
(τ1)(C1+1)t=kpk1ηt2𝔼[1Mm=1Mfm(θmt)2]+(τ1)σ12Bt=kpk1ηt2\displaystyle\leq(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}\|\nabla f_{m}(\theta^{t}_{m})\|^{2}\right]+\frac{(\tau-1)\sigma^{2}_{1}}{B}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}
λ(τ1)(C1+1)t=kpk1ηt2𝔼[1Mm=1Mfm(θmt)2]\displaystyle\leq\lambda(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{t}_{m})\right\|^{2}\right]
+σdif2(τ1)(C1+1)t=kpk1ηt2+(τ1)σ12Bt=kpk1ηt2,\displaystyle\qquad\qquad+\sigma^{2}_{\text{dif}}(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}+\frac{(\tau-1)\sigma^{2}_{1}}{B}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t},

where the last inequality follows Assumption 6. ∎

With these preliminaries, we are ready to prove Theorem 3 and Theorem 4.

B.7.1 Proof of Theorem 3

Under Assumptions 4-6, given {θmk}m[M]\{\theta^{k}_{m}\}_{m\in[M]}, it follows from Lemmas 8-10 that

𝔼ξk[1Mm=1Mfm(θ^mk+1)]\displaystyle\mathbb{E}_{\xi^{k}}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})\right] 1Mm=1Mfm(θ^mk)\displaystyle-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})
η21Mm=1Mfm(θ^mk)2η21Mm=1Mfm(θmk)2+ηL22Vk\displaystyle\leq-\frac{\eta}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}-\frac{\eta}{2}\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}+\frac{\eta L^{2}}{2}V^{k}
+12η2Lλ(C1M+C2+1)1Mm=1Mfm(θmk)2\displaystyle\qquad+\frac{1}{2}\eta^{2}L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}
+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B},\displaystyle\qquad+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\},

where the expectation is taken with respect to the randomness in ξk\xi^{k}. Thus, taking the unconditional expectation on both sides of the equation above, we have

𝔼\displaystyle\mathbb{E} [1Mm=1Mfm(θ^mk+1)1Mm=1Mfm(θ^mk)]\displaystyle\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right]
η2𝔼[1Mm=1Mfm(θ^mk)2]η2𝔼[1Mm=1Mfm(θmk)2]+ηL22𝔼[Vk]\displaystyle\qquad\qquad\leq-\frac{\eta}{2}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]-\frac{\eta}{2}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}\right]+\frac{\eta L^{2}}{2}\mathbb{E}\left[V^{k}\right]
+12η2Lλ(C1M+C2+1)𝔼[1Mm=1Mfm(θmk)2]\displaystyle\qquad\qquad\qquad+\frac{1}{2}\eta^{2}L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}\right]
+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B},\displaystyle\qquad\qquad\qquad+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\},

which implies that

𝔼[1Mm=1Mfm(θ^mkp+1)1Mm=1Mfm(θ^mkp)]=k=kpvp𝔼[1Mm=1Mfm(θ^mk+1)1Mm=1Mfm(θ^mk)]η2k=kpvp𝔼[1Mm=1Mfm(θ^mk)2]+η2{1+ηLλ(C1M+C2+1)}k=kpvp𝔼[1Mm=1Mfm(θmk)2]+ηL22k=kpvp𝔼[Vk]+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B}k=kpvp1.\begin{aligned} \mathbb{E}&\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p+1}}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p}}_{m})\right]\\ &\quad\quad=\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right]\\ &\quad\quad\leq-\frac{\eta}{2}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]\\ &\quad\quad\quad+\frac{\eta}{2}\left\{-1+\eta L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)\right\}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}\right]\\ &\quad\quad\quad+\frac{\eta L^{2}}{2}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[V^{k}\right]+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\sum^{v_{p}}_{k=k_{p}}1.\end{aligned} (38)

By Lemma 12, for all kpkvpk_{p}\leq k\leq v_{p}, we have that

𝔼[Vk]\displaystyle\mathbb{E}\left[V^{k}\right] λη2(τ1)(C1+1)k=kpk1𝔼[1Mm=1Mfm(θmk)2]\displaystyle\leq\lambda\eta^{2}(\tau-1)(C_{1}+1)\sum^{k-1}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}\right]
+η2σdif2(τ1)(C1+1)(kkp)+η2σ12(τ1)B(kkp)\displaystyle\qquad+\eta^{2}\sigma^{2}_{\text{dif}}(\tau-1)(C_{1}+1)(k-k_{p})+\frac{\eta^{2}\sigma^{2}_{1}(\tau-1)}{B}(k-k_{p})
λη2(τ1)(C1+1)k=kpvp𝔼[1Mm=1Mfm(θmk)2]\displaystyle\leq\lambda\eta^{2}(\tau-1)(C_{1}+1)\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}\right]
+η2σdif2(τ1)2(C1+1)+η2σ12(τ1)2B.\displaystyle\qquad+\eta^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)+\frac{\eta^{2}\sigma^{2}_{1}(\tau-1)^{2}}{B}.

Therefore, we have

ηL22k=kpvp𝔼[Vk]12λη3L2(τ1)τ(C1+1)k=kpvp𝔼[1Mm=1Mfm(θmk)2]+12η3L2σdif2(τ1)2(C1+1)k=kpvp1+η3L2σ12(τ1)22Bk=kpvp1.\frac{\eta L^{2}}{2}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[V^{k}\right]\leq\frac{1}{2}\lambda\eta^{3}L^{2}(\tau-1)\tau(C_{1}+1)\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}\right]\\ +\frac{1}{2}\eta^{3}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)\sum^{v_{p}}_{k=k_{p}}1+\frac{\eta^{3}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{2B}\sum^{v_{p}}_{k=k_{p}}1.

Combined with (38), we have

𝔼\displaystyle\mathbb{E} [1Mm=1Mfm(θ^mkp+1)1Mm=1Mfm(θ^mkp)]\displaystyle\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p+1}}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p}}_{m})\right]
η2k=kpvp𝔼[1Mm=1Mfm(θ^mk)2]\displaystyle\leq-\frac{\eta}{2}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]
+η2{1+ηLλ(C1M+C2+1)+λη2L2(τ1)τ(C1+1)}k=kpvp𝔼[1Mm=1Mfm(θmk)2]\displaystyle\ +\frac{\eta}{2}\left\{-1+\eta L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)+\lambda\eta^{2}L^{2}(\tau-1)\tau(C_{1}+1)\right\}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\theta^{k}_{m})\right\|^{2}\right]
+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B}k=kpvp1\displaystyle\ +\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\sum^{v_{p}}_{k=k_{p}}1
+12η3L2σdif2(τ1)2(C1+1)k=kpvp1+η3L2σ12(τ1)22Bk=kpvp1.\displaystyle\ +\frac{1}{2}\eta^{3}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)\sum^{v_{p}}_{k=k_{p}}1+\frac{\eta^{3}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{2B}\sum^{v_{p}}_{k=k_{p}}1.

Since we require that

1+ηLλ(C1M+C2+1)+η2L2(τ1)τ(C1+1)0,-1+\eta L\lambda\left(\frac{C_{1}}{M}+C_{2}+1\right)+\eta^{2}L^{2}(\tau-1)\tau(C_{1}+1)\leq 0,

the equation above implies that

𝔼\displaystyle\mathbb{E} [1Mm=1Mfm(θ^mkp+1)1Mm=1Mfm(θ^mkp)]\displaystyle\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p+1}}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p}}_{m})\right]
η2k=kpvp𝔼[1Mm=1Mfm(θ^mk)2]+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B}k=kpvp1\displaystyle\leq-\frac{\eta}{2}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\sum^{v_{p}}_{k=k_{p}}1
+12η3L2σdif2(τ1)2(C1+1)k=kpvp1+η3L2σ12(τ1)22Bk=kpvp1.\displaystyle\quad+\frac{1}{2}\eta^{3}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)\sum^{v_{p}}_{k=k_{p}}1+\frac{\eta^{3}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{2B}\sum^{v_{p}}_{k=k_{p}}1.

Since we have assumed that K=kp¯K=k_{\bar{p}} for some p¯+\bar{p}\in\mathbb{N}^{+}, we further have

1K\displaystyle\frac{1}{K} 𝔼[(1Mm=1Mfm(θ^mK)f)(1Mm=1Mfm(θ^m0)f)]\displaystyle\mathbb{E}\left[\left(\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{K}_{m})-f^{*}\right)-\left(\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{0}_{m})-f^{*}\right)\right]
=1K𝔼[1Mm=1Mfm(θ^mK)1Mm=1Mfm(θ^m0)]\displaystyle=\frac{1}{K}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{K}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{0}_{m})\right]
=1Kp=0p¯1𝔼[1Mm=1Mfm(θ^mkp+1)1Mm=1Mfm(θ^mkp)]\displaystyle=\frac{1}{K}\sum^{\bar{p}-1}_{p=0}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p+1}}_{m})-\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k_{p}}_{m})\right]
η2Kp=0p¯1k=kpvp𝔼[1Mm=1Mfm(θ^mk)2]\displaystyle\leq-\frac{\eta}{2K}\sum^{\bar{p}-1}_{p=0}\sum^{v_{p}}_{k=k_{p}}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]
+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B}1Kp=0p¯1k=kpvp1\displaystyle\quad+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\frac{1}{K}\sum^{\bar{p}-1}_{p=0}\sum^{v_{p}}_{k=k_{p}}1
+12η3L2σdif2(τ1)2(C1+1)1Kp=0p¯1k=kpvp1+η3L2σ12(τ1)22B1Kp=0p¯1k=kpvp1\displaystyle\quad+\frac{1}{2}\eta^{3}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)\frac{1}{K}\sum^{\bar{p}-1}_{p=0}\sum^{v_{p}}_{k=k_{p}}1+\frac{\eta^{3}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{2B}\frac{1}{K}\sum^{\bar{p}-1}_{p=0}\sum^{v_{p}}_{k=k_{p}}1
=η2Kk=0K1𝔼[1Mm=1Mfm(θ^mk)2]+12η2Lλ{(C1M+C2+1)σdif2+σ12MB+σ22B}\displaystyle=-\frac{\eta}{2K}\sum^{K-1}_{k=0}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]+\frac{1}{2}\eta^{2}L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}
+12η3L2σdif2(τ1)2(C1+1)+η3L2σ12(τ1)22B.\displaystyle\quad+\frac{1}{2}\eta^{3}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)+\frac{\eta^{3}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{2B}.

This implies that

1Kk=0K1𝔼[1Mm=1Mfm(θ^mk)2]2𝔼[1Mm=1Mfm(θ^m0)f]ηK+ηLλ{(C1M+C2+1)σdif2+σ12MB+σ22B}+η2L2σdif2(τ1)2(C1+1)+η2L2σ12(τ1)2B\frac{1}{K}\sum^{K-1}_{k=0}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})\right\|^{2}\right]\\ \leq\frac{2\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{0}_{m})-f^{*}\right]}{\eta K}+\eta L\lambda\left\{\left(\frac{C_{1}}{M}+C_{2}+1\right)\sigma^{2}_{\text{dif}}+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\\ +\eta^{2}L^{2}\sigma^{2}_{\text{dif}}(\tau-1)^{2}(C_{1}+1)+\frac{\eta^{2}L^{2}\sigma^{2}_{1}(\tau-1)^{2}}{B}

and the proof is complete.

B.7.2 Proof of Theorem 4

By Lemmas 8, 9, 11 and 12, for kp+1kvpk_{p}+1\leq k\leq v_{p}, we have

𝔼\displaystyle\mathbb{E} [1Mm=1Mfm(θ^mk+1)f]\displaystyle\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k+1}_{m})-f^{*}\right]
Δk𝔼[1Mm=1Mfm(θ^mk)f]\displaystyle\qquad\leq\Delta_{k}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})-f^{*}\right]
+ηk2{1+ηkλL(C1M+C2+1)}𝔼[1Mm=1Mfm(θmk)2]\displaystyle\qquad\qquad+\frac{\eta_{k}}{2}\left\{-1+\eta_{k}\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right)\right\}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}\right]
+Bkt=kpk1ηt2𝔼[1Mm=1Mfm(θm(t))2]+ck,\displaystyle\qquad\qquad+B_{k}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{(t)}_{m})\right\|^{2}\right]+c_{k},

where

Δk\displaystyle\Delta_{k} =1ηkμ,\displaystyle=1-\eta_{k}\mu, (39)
Bk\displaystyle B_{k} =12ηkL2λ(τ1)(C1+1), and\displaystyle=\frac{1}{2}\eta_{k}L^{2}\lambda(\tau-1)(C_{1}+1),\text{ and}
ck\displaystyle c_{k} =ηkL22{σdif2(τ1)(C1+1)t=kpk1ηt2+σ12(τ1)Bt=kpk1ηt2}\displaystyle=\frac{\eta_{k}L^{2}}{2}\left\{\sigma^{2}_{\text{dif}}(\tau-1)(C_{1}+1)\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}+\frac{\sigma^{2}_{1}(\tau-1)}{B}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}\right\}
+ηk2L2{σdif2(C1M+C2+1)+σ12MB+σ22B}.\displaystyle\qquad\qquad\qquad\qquad+\frac{\eta^{2}_{k}L}{2}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}.

Let

ak\displaystyle a_{k} =𝔼[1Mm=1Mfm(θ^mk)f],\displaystyle=\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}(\hat{\theta}^{k}_{m})-f^{*}\right],
D\displaystyle D =λL(C1M+C2+1), and\displaystyle=\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right),\text{ and}
ek\displaystyle e_{k} =𝔼[1Mm=1Mfm(θmk)2],\displaystyle=\mathbb{E}\left[\left\|\frac{1}{M}\sum^{M}_{m=1}\nabla f_{m}(\theta^{k}_{m})\right\|^{2}\right],

and denote

k=kpkp1ηk2et=0,\displaystyle\sum^{k_{p}-1}_{k=k_{p}}\eta^{2}_{k}e_{t}=0,
ckp=αkp2L2{σdif2(C1M+C2+1)+σ12MB+σ22B}.\displaystyle c_{k_{p}}=\frac{\alpha^{2}_{k_{p}}L}{2}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}.

Then

ak+1Δkak+ηk2(1+Dηk)ek+Bkt=kpk1ηk2et+ck,a_{k+1}\leq\Delta_{k}a_{k}+\frac{\eta_{k}}{2}(-1+D\eta_{k})e_{k}+B_{k}\sum^{k-1}_{t=k_{p}}\eta^{2}_{k}e_{t}+c_{k},

for all kpkvpk_{p}\leq k\leq v_{p}. Under the conditions on β\beta and τ\tau, by Lemmas 13 and 14, we have

avp+1(k=kpvpΔk)akp+k=kpvp1(i=k+1vpΔi)ck+cvp.a_{v_{p}+1}\leq\left(\prod^{v_{p}}_{k=k_{p}}\Delta_{k}\right)a_{k_{p}}+\sum^{v_{p}-1}_{k=k_{p}}\left(\prod^{v_{p}}_{i=k+1}\Delta_{i}\right)c_{k}+c_{v_{p}}. (40)

Let zk=(k+b)2z_{k}=(k+b)^{2}, where b=βτ+1b=\beta\tau+1. Then

Δkzkηk=(1μηk)μ(k+b)3=(11k+b)μ(k+b)3=μ(k+b1)(k+b)2μ(k+b1)3=zk1ηk1\Delta_{k}\frac{z_{k}}{\eta_{k}}=(1-\mu\eta_{k})\mu(k+b)^{3}=(1-\frac{1}{k+b})\mu(k+b)^{3}=\mu(k+b-1)(k+b)^{2}\leq\mu(k+b-1)^{3}=\frac{z_{k-1}}{\eta_{k-1}}

and, thus,

zvpηvp(i=k+1vpΔi)=zvpηvpΔvp(i=k+1vp1Δi)zvp1ηvp1(i=k+1vp1Δi)zkηk.\frac{z_{v_{p}}}{\eta_{v_{p}}}\left(\prod^{v_{p}}_{i=k+1}\Delta_{i}\right)=\frac{z_{v_{p}}}{\eta_{v_{p}}}\Delta_{v_{p}}\left(\prod^{v_{p}-1}_{i=k+1}\Delta_{i}\right)\leq\frac{z_{v_{p}-1}}{\eta_{v_{p}-1}}\left(\prod^{v_{p}-1}_{i=k+1}\Delta_{i}\right)\ldots\leq\frac{z_{k}}{\eta_{k}}.

Note that vp+1=kp+1v_{p}+1=k_{p+1}. Plugging the above inequality into (40), we then get

zvpηvpakp+1zkpηkpakp+k=kpvpzkηkck.\frac{z_{v_{p}}}{\eta_{v_{p}}}a_{k_{p+1}}\leq\frac{z_{k_{p}}}{\eta_{k_{p}}}a_{k_{p}}+\sum^{v_{p}}_{k=k_{p}}\frac{z_{k}}{\eta_{k}}c_{k}.

Since we have assumed that K=kp¯K=k_{\bar{p}}, we thus have

zK1ηK1aK=zvp¯1ηvp¯1akp¯zkp¯1ηkp¯1akp¯1+t=kp¯1vp¯1ztηtctz0η0a0+k=0K1zkηkck.\frac{z_{K-1}}{\eta_{K-1}}a_{K}=\frac{z_{v_{\bar{p}-1}}}{\eta_{v_{\bar{p}-1}}}a_{k_{\bar{p}}}\leq\frac{z_{k_{\bar{p}-1}}}{\eta_{k_{\bar{p}-1}}}a_{k_{\bar{p}-1}}+\sum^{v_{\bar{p}-1}}_{t=k_{\bar{p}-1}}\frac{z_{t}}{\eta_{t}}c_{t}\ldots\leq\frac{z_{0}}{\eta_{0}}a_{0}+\sum^{K-1}_{k=0}\frac{z_{k}}{\eta_{k}}c_{k}. (41)

Since, for kpkvpk_{p}\leq k\leq v_{p}, we have

ck\displaystyle c_{k} =ηkL22{σdif2(τ1)(C1+1)k=kpt1ηk2+σ12(τ1)Bk=kpt1ηk2}\displaystyle=\frac{\eta_{k}L^{2}}{2}\left\{\sigma^{2}_{\text{dif}}(\tau-1)(C_{1}+1)\sum^{t-1}_{k=k_{p}}\eta^{2}_{k}+\frac{\sigma^{2}_{1}(\tau-1)}{B}\sum^{t-1}_{k=k_{p}}\eta^{2}_{k}\right\}
+ηk2L2{σdif2(C1M+C2+1)+σ12MB+σ22B}\displaystyle\qquad\qquad\qquad\qquad+\frac{\eta^{2}_{k}L}{2}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}
ηkηkττ2L2(τ1)22{σdif2(C1+1)+σ12B}\displaystyle\leq\frac{\eta_{k}\eta^{2}_{\left\lfloor\frac{k}{\tau}\right\rfloor\tau}L^{2}(\tau-1)^{2}}{2}\left\{\sigma^{2}_{\text{dif}}(C_{1}+1)+\frac{\sigma^{2}_{1}}{B}\right\}
+ηk2L2{σdif2(C1M+C2+1)+σ12MB+σ22B},\displaystyle\qquad\qquad\qquad\qquad+\frac{\eta^{2}_{k}L}{2}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\},

we also have

k=0K1zkηkckL2(τ1)22{σdif2(C1+1)+σ12B}k=0K1zkηkττ2+L2{σdif2(C1M+C2+1)+σ12MB+σ22B}k=0K1zkηk.\sum^{K-1}_{k=0}\frac{z_{k}}{\eta_{k}}c_{k}\leq\frac{L^{2}(\tau-1)^{2}}{2}\left\{\sigma^{2}_{\text{dif}}(C_{1}+1)+\frac{\sigma^{2}_{1}}{B}\right\}\sum^{K-1}_{k=0}z_{k}\eta^{2}_{\left\lfloor\frac{k}{\tau}\right\rfloor\tau}\\ +\frac{L}{2}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\}\sum^{K-1}_{k=0}z_{k}\eta_{k}. (42)

Assume that k=pτ+rk=p\tau+r, where 0rτ10\leq r\leq\tau-1. Then

tττ+b=pτ+βτ+1=(p+β)τ+1βτr,\displaystyle\left\lfloor\frac{t}{\tau}\right\rfloor\tau+b=p\tau+\beta\tau+1=(p+\beta)\tau+1\geq\beta\tau\geq r,

as we have assumed that β>1\beta>1. Thus

2(kττ+b)(p+β)τ+1+r=k+b\displaystyle 2\left(\left\lfloor\frac{k}{\tau}\right\rfloor\tau+b\right)\geq(p+\beta)\tau+1+r=k+b

and

k=0K1zkηkττ2=1μ2k=0K1(k+bkττ+b)24Kμ2.\displaystyle\sum^{K-1}_{k=0}z_{k}\eta^{2}_{\left\lfloor\frac{k}{\tau}\right\rfloor\tau}=\frac{1}{\mu^{2}}\sum^{K-1}_{k=0}\left(\frac{k+b}{\left\lfloor\frac{k}{\tau}\right\rfloor\tau+b}\right)^{2}\leq\frac{4K}{\mu^{2}}.

Next, note that

k=0K1zkηk=1μk=0K1(k+b)K(K+2b)2μ.\sum^{K-1}_{k=0}z_{k}\eta_{k}=\frac{1}{\mu}\sum^{K-1}_{k=0}(k+b)\leq\frac{K(K+2b)}{2\mu}. (43)

Combining equations (41)-(43), we have

𝔼[1Mm=1Mfm(θ^mK)f]b3(K+βτ)3𝔼[1Mm=1Mfm(θ^m0)f]+2L2(τ1)2Kμ3(K+βτ)3{σdif2(C1+1)+σ12B}+LK(K+2βτ+2)4μ2(K+βτ)3{σdif2(C1M+C2+1)+σ12MB+σ22B},\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}\left(\hat{\theta}^{K}_{m}\right)-f^{*}\right]\\ \leq\frac{b^{3}}{(K+\beta\tau)^{3}}\mathbb{E}\left[\frac{1}{M}\sum^{M}_{m=1}f_{m}\left(\hat{\theta}^{0}_{m}\right)-f^{*}\right]+\frac{2L^{2}(\tau-1)^{2}K}{\mu^{3}(K+\beta\tau)^{3}}\left\{\sigma^{2}_{\text{dif}}(C_{1}+1)+\frac{\sigma^{2}_{1}}{B}\right\}\\ +\frac{LK(K+2\beta\tau+2)}{4\mu^{2}(K+\beta\tau)^{3}}\left\{\sigma^{2}_{\text{dif}}\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\sigma^{2}_{1}}{MB}+\frac{\sigma^{2}_{2}}{B}\right\},

which completes the proof.

B.7.3 Auxiliary Results

We give two technical lemmas that are used to prove Theorem 4.

Lemma 13.

Consider the sequence {ak}kpkvp\{a_{k}\}_{k_{p}\leq k\leq v_{p}} in the proof of Theorem 4 that satisfies

ak+1Δkak+ηk2(1+Dηk)ek+Bkt=kpk1ηt2et+ck,a_{k+1}\leq\Delta_{k}a_{k}+\frac{\eta_{k}}{2}(-1+D\eta_{k})e_{k}+B_{k}\sum^{k-1}_{t=k_{p}}\eta^{2}_{t}e_{t}+c_{k},

where Δk\Delta_{k}, BkB_{k}, and ckc_{k} are defined in (39). Suppose the sequence of learning rates {ηk}\{\eta_{k}\} satisfies

ηvp\displaystyle\eta_{v_{p}} D1,\displaystyle\leq D^{-1}, (44)
ηvp1\displaystyle\eta_{v_{p}-1} (D+2BvpΔvp)1,\displaystyle\leq\left(D+\frac{2B_{v_{p}}}{\Delta_{v_{p}}}\right)^{-1}, (45)
\displaystyle\vdots
ηkp\displaystyle\eta_{k_{p}} (D+2(Bvp+j=kp+1vp1Bji=j+1vpΔi)i=kp+1vpΔi)1.\displaystyle\leq\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+1}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}}\right)^{-1}. (46)

Then

avp+1(k=kpvpΔk)akp+k=kpvp1(i=k+1vpΔi)ck+cvp.a_{v_{p}+1}\leq\left(\prod^{v_{p}}_{k=k_{p}}\Delta_{k}\right)a_{k_{p}}+\sum^{v_{p}-1}_{k=k_{p}}\left(\prod^{v_{p}}_{i=k+1}\Delta_{i}\right)c_{k}+c_{v_{p}}.
Proof.

We start by noting that

avp+1\displaystyle a_{v_{p}+1} Δvpavp+ηvp2(1+Dηvp)evp+Bvpk=kpvp1ηk2ek+cvp\displaystyle\leq\Delta_{v_{p}}a_{v_{p}}+\frac{\eta_{v_{p}}}{2}(-1+D\eta_{v_{p}})e_{v_{p}}+B_{v_{p}}\sum^{v_{p}-1}_{k=k_{p}}\eta^{2}_{k}e_{k}+c_{v_{p}}
Δvpavp+Bvpk=kpvp1ηk2ek+cvp,\displaystyle\leq\Delta_{v_{p}}a_{v_{p}}+B_{v_{p}}\sum^{v_{p}-1}_{k=k_{p}}\eta^{2}_{k}e_{k}+c_{v_{p}},

where the last inequality is due to (44). Thus, we have

avp+1\displaystyle a_{v_{p}+1} Δvpavp+Bvpk=kpvp1ηk2ek+cvp\displaystyle\leq\Delta_{v_{p}}a_{v_{p}}+B_{v_{p}}\sum^{v_{p}-1}_{k=k_{p}}\eta^{2}_{k}e_{k}+c_{v_{p}}
=Δvpavp+Bvp(k=kpvp2ηk2ek+ηvp12evp1)+cvp\displaystyle=\Delta_{v_{p}}a_{v_{p}}+B_{v_{p}}\left(\sum^{v_{p}-2}_{k=k_{p}}\eta^{2}_{k}e_{k}+\eta^{2}_{v_{p}-1}e_{v_{p}-1}\right)+c_{v_{p}}
Δvp(Δvp1avp1+ηvp12(1+Dηvp1)evp1+Bvp1k=kpvp2ηk2ek+cvp1)\displaystyle\leq\Delta_{v_{p}}\left(\Delta_{v_{p}-1}a_{v_{p}-1}+\frac{\eta_{v_{p}-1}}{2}(-1+D\eta_{v_{p}-1})e_{v_{p}-1}+B_{v_{p}-1}\sum^{v_{p}-2}_{k=k_{p}}\eta^{2}_{k}e_{k}+c_{v_{p}-1}\right)
+Bvp(k=kpvp2ηk2ek+ηvp12evp1)+cvp\displaystyle\quad+B_{v_{p}}\left(\sum^{v_{p}-2}_{k=k_{p}}\eta^{2}_{k}e_{k}+\eta^{2}_{v_{p}-1}e_{v_{p}-1}\right)+c_{v_{p}}
=ΔvpΔvp1avp1+ηvp1Δvp2[1+Dηvp1+2Bvpηvp1Δvp]evp1\displaystyle=\Delta_{v_{p}}\Delta_{v_{p-1}}a_{v_{p}-1}+\frac{\eta_{v_{p-1}}\Delta_{v_{p}}}{2}\left[-1+D\eta_{v_{p}-1}+\frac{2B_{v_{p}}\eta_{v_{p}-1}}{\Delta_{v_{p}}}\right]e_{v_{p}-1}
+(ΔvpBvp1+Bvp)k=kpvp2ηk2ek+(Δvpcvp1+cvp).\displaystyle\quad+\left(\Delta_{v_{p}}B_{v_{p}-1}+B_{v_{p}}\right)\sum^{v_{p}-2}_{k=k_{p}}\eta^{2}_{k}e_{k}+\left(\Delta_{v_{p}}c_{v_{p}-1}+c_{v_{p}}\right).

By (45), we have

1+Dηvp1+2Bvpηvp1Δvp0.-1+D\eta_{v_{p}-1}+\frac{2B_{v_{p}}\eta_{v_{p}-1}}{\Delta_{v_{p}}}\leq 0.

Therefore,

avp+1ΔvpΔvp1avp1+(ΔvpBvp1+Bvp)k=kpvp2ηk2ek+(Δvpcvp1+cvp).a_{v_{p}+1}\leq\Delta_{v_{p}}\Delta_{v_{p-1}}a_{v_{p}-1}+\left(\Delta_{v_{p}}B_{v_{p}-1}+B_{v_{p}}\right)\sum^{v_{p}-2}_{k=k_{p}}\eta^{2}_{k}e_{k}+\left(\Delta_{v_{p}}c_{v_{p}-1}+c_{v_{p}}\right).

Under the assumptions on ηk\eta_{k}, repeating the process above, we have

avp+1\displaystyle a_{v_{p}+1} (i=kp+1vpΔi)akp+1\displaystyle\leq\left(\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}\right)a_{k_{p}+1}
+[(i=kp+2vpΔi)Bkp+1+(i=kp+3vpΔi)Bkp+2++ΔvpBvp1+Bvp]ηkp2ekp\displaystyle\quad+\left[\left(\prod^{v_{p}}_{i=k_{p}+2}\Delta_{i}\right)B_{k_{p}+1}+\left(\prod^{v_{p}}_{i=k_{p}+3}\Delta_{i}\right)B_{k_{p}+2}+\dots+\Delta_{v_{p}}B_{v_{p}-1}+B_{v_{p}}\right]\eta^{2}_{k_{p}}e_{k_{p}}
+k=kpvp1(i=k+1vpΔi)ck.\displaystyle\quad+\sum^{v_{p}-1}_{k=k_{p}}\left(\prod^{v_{p}}_{i=k+1}\Delta_{i}\right)c_{k}.

Since

akp+1Δkpakp+ηkp2(1+Dηkp)ekp+ckp,a_{k_{p+1}}\leq\Delta_{k_{p}}a_{k_{p}}+\frac{\eta_{k_{p}}}{2}(-1+D\eta_{k_{p}})e_{k_{p}}+c_{k_{p}},

combining with (46), the final result follows. ∎

Lemma 14.

Let ηk=(μ(k+βτ+1))1\eta_{k}=(\mu(k+\beta\tau+1))^{-1} where

β>max{2λLμ(C1M+C2+1)2,2L2λ(C1+1)μ2}.\beta>\max\left\{\frac{2\lambda L}{\mu}\left(\frac{C_{1}}{M}+C_{2}+1\right)-2,\frac{2L^{2}\lambda(C_{1}+1)}{\mu^{2}}\right\}.

and

τmax{(2L2λ(C1+1)/μ2)e1/β4,0}β2(2L2λ(C1+1)/μ2)e1β.\tau\geq\sqrt{\frac{\max\left\{(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{1/\beta}-4,0\right\}}{\beta^{2}-(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{\frac{1}{\beta}}}}.

Then the conditions in Lemma 13 are satisfied for ηk\eta_{k} for all k0k\geq 0.

Proof.

Let Δk\Delta_{k} and BkB_{k} be defined as in (39). Since Δk<1\Delta_{k}<1 for all kk, after pp-th communication, for the right hand side of (46), we have

(D+2(Bvp+j=kp+1vp1Bji=j+1vpΔi)i=kp+1vpΔi)1\displaystyle\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+1}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}}\right)^{-1}
(D+2(Bvp+j=kp+2vp1Bji=j+1vpΔi)i=kp+1vpΔi)1\displaystyle\qquad\qquad\qquad\qquad\leq\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+2}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}}\right)^{-1}
(D+2(Bvp+j=kp+2vp1Bji=j+1vpΔi)i=kp+2vpΔi)1.\displaystyle\qquad\qquad\qquad\qquad\leq\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+2}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+2}\Delta_{i}}\right)^{-1}.

Thus, by induction, we have

(D+2(Bvp+j=kp+1vp1Bji=j+1vpΔi)i=kp+1vpΔi)1\displaystyle\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+1}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}}\right)^{-1} (47)
(D+2(Bj+j=kp+2vp1Bji=j+1vpΔi)i=kp+2vpΔi)1\displaystyle\qquad\qquad\qquad\qquad\leq\left(D+\frac{2\left(B_{j}+\sum_{j=k_{p}+2}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+2}\Delta_{i}}\right)^{-1}
\displaystyle\qquad\qquad\qquad\qquad\leq\ldots
(D+2BvpΔvp)1\displaystyle\qquad\qquad\qquad\qquad\leq\left(D+\frac{2B_{v_{p}}}{\Delta_{v_{p}}}\right)^{-1}
D1.\displaystyle\qquad\qquad\qquad\qquad\leq D^{-1}.

As kk increases, we have that ηk\eta_{k} decreases, Δk\Delta_{k} increases, and BkB_{k} decreases. Thus, for 1kK1\leq k\leq K, we have ηKηK1η1\eta_{K}\leq\eta_{K-1}\leq\dots\leq\eta_{1}. On the other hand, we can lower bound the right hand side of (46) as

(D+2(Bvp+j=kp+1vp1Bji=j+1vpΔi)i=kp+1vpΔi)1\displaystyle\left(D+\frac{2\left(B_{v_{p}}+\sum_{j=k_{p}+1}^{v_{p}-1}B_{j}\prod_{i=j+1}^{v_{p}}\Delta_{i}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{i}}\right)^{-1} (48)
(D+2(B1+j=kp+1vp1B1i=j+1vpΔK)i=kp+1vpΔ1)1\displaystyle\qquad\qquad\qquad\qquad\geq\left(D+\frac{2\left(B_{1}+\sum_{j=k_{p}+1}^{v_{p}-1}B_{1}\prod_{i=j+1}^{v_{p}}\Delta_{K}\right)}{\prod^{v_{p}}_{i=k_{p}+1}\Delta_{1}}\right)^{-1}
(D+2B1(1+j=kp+1vp1ΔKvpj)Δ1τ1)1\displaystyle\qquad\qquad\qquad\qquad\geq\left(D+\frac{2B_{1}\left(1+\sum_{j=k_{p}+1}^{v_{p}-1}\Delta^{v_{p}-j}_{K}\right)}{\Delta^{\tau-1}_{1}}\right)^{-1}
(D+2B1(1+j=kp+1vp11)Δ1τ1)1\displaystyle\qquad\qquad\qquad\qquad\geq\left(D+\frac{2B_{1}\left(1+\sum_{j=k_{p}+1}^{v_{p}-1}1\right)}{\Delta^{\tau-1}_{1}}\right)^{-1}
=(D+2B1(τ1)Δ1τ1)1.\displaystyle\qquad\qquad\qquad\qquad=\left(D+\frac{2B_{1}(\tau-1)}{\Delta^{\tau-1}_{1}}\right)^{-1}.

If

η11D+2B1(τ1)Δ1τ1,\eta_{1}\leq\frac{1}{D+\frac{2B_{1}(\tau-1)}{\Delta^{\tau-1}_{1}}}, (49)

then the conditions on stepsizes in Lemma 13 are satisfied for all ηk\eta_{k} by combining (47)-(49). Thus, we only need to show that (49) is satisfied to complete the proof.

To that end, we need to have

(D+2B1(τ1)Δ1τ1)τ11\displaystyle\left(D+\frac{2B_{1}(\tau-1)}{\Delta^{\tau-1}_{1}}\right)\tau_{1}\leq 1
(λL(C1M+C2+1)+η1L2λ(τ1)2(C1+1)(1η1μ)τ1)η11\displaystyle\qquad\qquad\iff\left(\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right)+\frac{\eta_{1}L^{2}\lambda(\tau-1)^{2}(C_{1}+1)}{(1-\eta_{1}\mu)^{\tau-1}}\right)\eta_{1}\leq 1
λL(C1M+C2+1)(1η1μ)τ1+η1L2λ(τ1)2(C1+1)(1η1μ)τ1η1.\displaystyle\qquad\qquad\iff\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right)(1-\eta_{1}\mu)^{\tau-1}+\eta_{1}L^{2}\lambda(\tau-1)^{2}(C_{1}+1)\leq\frac{(1-\eta_{1}\mu)^{\tau-1}}{\eta_{1}}.

To satisfy the above equation, we need

{λL(C1M+C2+1)(1η1μ)τ1(2η1)1(1η1μ)τ1η1L2λ(τ1)2(C1+1)(2η1)1(1η1μ)τ1.\begin{cases}\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right)(1-\eta_{1}\mu)^{\tau-1}&\leq(2\eta_{1})^{-1}{(1-\eta_{1}\mu)^{\tau-1}}\\ \eta_{1}L^{2}\lambda(\tau-1)^{2}(C_{1}+1)&\leq(2\eta_{1})^{-1}{(1-\eta_{1}\mu)^{\tau-1}}.\end{cases} (50)

Note that η1=1/(μ(βτ+2))\eta_{1}=1/(\mu(\beta\tau+2)). Thus, to satisfy the first inequality in (50), we need

2λL(C1M+C2+1)1η1=μ(βτ+2).2\lambda L\left(\frac{C_{1}}{M}+C_{2}+1\right)\leq\frac{1}{\eta_{1}}=\mu(\beta\tau+2).

Since μ(βτ+2)μ(β+2)\mu(\beta\tau+2)\geq\mu(\beta+2), the condition above follows if

β2λLμ(C1M+C2+1)2.\beta\geq\frac{2\lambda L}{\mu}\left(\frac{C_{1}}{M}+C_{2}+1\right)-2. (51)

Next, to satisfy the second inequality in (50), we need

2η12L2λ(τ1)2(C1+1)(1η1μ)τ12L2λ(C1+1)μ2(τ1βτ+2)2(βτ+2βτ+1)τ11.2\eta^{2}_{1}L^{2}\lambda(\tau-1)^{2}(C_{1}+1)\leq(1-\eta_{1}\mu)^{\tau-1}\\ \iff\frac{2L^{2}\lambda(C_{1}+1)}{\mu^{2}}\left(\frac{\tau-1}{\beta\tau+2}\right)^{2}\left(\frac{\beta\tau+2}{\beta\tau+1}\right)^{\tau-1}\leq 1.

Since

(βτ+2βτ+1)τ1=(1+1βτ+1)τ1=(1+(τ1)/(βτ+1)τ1)τ1exp{τ1βτ+1}e1β,\left(\frac{\beta\tau+2}{\beta\tau+1}\right)^{\tau-1}=\left(1+\frac{1}{\beta\tau+1}\right)^{\tau-1}=\left(1+\frac{(\tau-1)/(\beta\tau+1)}{\tau-1}\right)^{\tau-1}\leq\exp\left\{\frac{\tau-1}{\beta\tau+1}\right\}\leq e^{\frac{1}{\beta}},

we need

2L2λ(C1+1)μ2(τ1βτ+2)2e1β1.\frac{2L^{2}\lambda(C_{1}+1)}{\mu^{2}}\left(\frac{\tau-1}{\beta\tau+2}\right)^{2}e^{\frac{1}{\beta}}\leq 1.

Let ν=2L2λ(C1+1)/μ2\nu=2L^{2}\lambda(C_{1}+1)/\mu^{2}. Then the above equation is equivalent to

(β2νe1β)τ2+2(β+νe1β)τ+(4νe1β)0.(\beta^{2}-\nu e^{\frac{1}{\beta}})\tau^{2}+2(\beta+\nu e^{\frac{1}{\beta}})\tau+(4-\nu e^{\frac{1}{\beta}})\geq 0.

First, we let β2νe1β>0\beta^{2}-\nu e^{\frac{1}{\beta}}>0 or equivalently

β2e1β>2L2λ(C1+1)μ2.\frac{\beta^{2}}{e^{\frac{1}{\beta}}}>\frac{2L^{2}\lambda(C_{1}+1)}{\mu^{2}}. (52)

Then we need τ\tau to be large enough such that

τ2(β+νe1β)+4(β+νe1β)2max{4(β2νe1β)(4νe1β),0}2(β2νe1β).\tau\geq\frac{-2(\beta+\nu e^{\frac{1}{\beta}})+\sqrt{4(\beta+\nu e^{\frac{1}{\beta}})^{2}-\max\left\{4(\beta^{2}-\nu e^{\frac{1}{\beta}})(4-\nu e^{\frac{1}{\beta}}),0\right\}}}{2(\beta^{2}-\nu e^{\frac{1}{\beta}})}.

Since a2+b|a|+|b|\sqrt{a^{2}+b}\leq|a|+\sqrt{|b|} for any a,ba,b\in\mathbb{R}, the left hand side is smaller or equal to

max{νe1/β4,0}β2νe1β=max{(2L2λ(C1+1)/μ2)e1/β4,0}β2(2L2λ(C1+1)/μ2)e1β.\sqrt{\frac{\max\left\{\nu e^{1/\beta}-4,0\right\}}{\beta^{2}-\nu e^{\frac{1}{\beta}}}}=\sqrt{\frac{\max\left\{(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{1/\beta}-4,0\right\}}{\beta^{2}-(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{\frac{1}{\beta}}}}.

Therefore, we need

τmax{(2L2λ(C1+1)/μ2)e1/β4,0}β2(2L2λ(C1+1)/μ2)e1β.\tau\geq\sqrt{\frac{\max\left\{(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{1/\beta}-4,0\right\}}{\beta^{2}-(2L^{2}\lambda(C_{1}+1)/\mu^{2})e^{\frac{1}{\beta}}}}. (53)

The final result follows from the combination of (51)-(53). ∎

B.8 Proof of Theorem 6

Nesterov’s worst case objective. (Nesterov, 2018) Let h:h^{\prime}:\mathbb{R}^{\infty}\rightarrow\mathbb{R} be the Nesterov’s worst case objective (see), i.e., h(y)=12yAye1yh^{\prime}(y)=\frac{1}{2}y^{\top}Ay-e_{1}^{\top}y with tridiagonal AA having diagonal elements equal to 2+c2+c (for some c>0c>0) and offdiagonal elements equal to 11.555This is for the strongly convex case; one can do convex similarly. The proof rationale is to show that a kk-th iterate of any first order method must satisfy yk0k\|y^{k}\|_{0}\leq k and consequently

yky2(κ1κ+1)2ky2\|y^{k}-y^{*}\|^{2}\geq\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{2k}\|y^{*}\|^{2} (54)

where yargminyh(y)y^{*}\coloneqq\operatorname*{arg\,min}_{y\in\mathbb{R}^{\infty}}h^{\prime}(y), κλmax(A)λmin(A)\kappa\coloneqq\frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}.

Finite sum worst case objective. (Lan & Zhou, 2018) The construction of the worst case finite-sum objective666We have lifted their construction to the infinite-dimensional space for the sake of simplicity. One can get a similar finite-dimensional results. h:,h(z)=1nj=1nhj(z)h:\mathbb{R}^{\infty}\rightarrow\mathbb{R},h(z)=\frac{1}{n}\sum_{j=1}^{n}h_{j}(z) is such that hjh_{j} corresponds only on a jj-th block of the coordinates; in particular if z=[z1,z2,,zn]z=[z_{1},z_{2},\dots,z_{n}]; z1,z2,,znz_{1},z_{2},\dots,z_{n}\in\mathbb{R}^{\infty} we set hj(z)=h(zj)h_{j}(z)=h^{\prime}(z_{j}). It was shown that to reach zkz2ϵ\|z^{k}-z^{*}\|^{2}\leq\epsilon one requires at least Ω((n+nμ)log1ϵ)\Omega\left(\left(n+\sqrt{\frac{n{\cal L}}{\mu}}\right)\log\frac{1}{\epsilon}\right) iterations for {\cal L}-smooth functions hjh_{j} and μ\mu-strongly convex hh.

Distributed worst case objective. (Scaman et al., 2018) Define

g1(z)\displaystyle g^{\prime}_{1}(z) 12(c1z2+c2(e1z+z𝑴1z))\displaystyle\coloneqq\frac{1}{2}\left(c_{1}\|z\|^{2}+c_{2}\left(e_{1}^{\top}z+z^{\top}{\bm{M}}_{1}z\right)\right)
g2(z)=g3(z)==gM(z)\displaystyle g^{\prime}_{2}(z)=g^{\prime}_{3}(z)=\dots=g^{\prime}_{M}(z) 12(M1)(c1z2+c2z𝑴2z)\displaystyle\coloneqq\frac{1}{2(M-1)}\left(c_{1}\|z\|^{2}+c_{2}z^{\top}{\bm{M}}_{2}z\right)

where 𝑴1{\bm{M}}_{1} is an infinite block diagonal matrix with blocks (1100121001100000)\begin{pmatrix}1&1&0&0\\ 1&2&1&0\\ 0&1&1&0\\ 0&0&0&0\end{pmatrix} and 𝑴2(1000𝑴1){\bm{M}}_{2}\coloneqq\begin{pmatrix}1&0&\\ 0&0&\\ &&{\bm{M}}_{1}\end{pmatrix} and c1,c2>0c_{1},c_{2}>0 are some constants determining the smoothness and strong convexity of the objective. The worst case objective of Scaman et al. (2018) is now g(z)=1Mm=1mgm(z)g(z)=\frac{1}{M}\sum_{m=1}^{m}g^{\prime}_{m}(z).

Distributed worst case objective with local finite sum. (Hendrikx et al., 2021) The given construction is obtained from the one of Scaman et al. (2018) in the same way as the worst case finite sum objective (Lan & Zhou, 2018) was obtained from the construction of Nesterov (2018). In particular, one would set gm,j(z)=gm(zj)g_{m,j}(z)=g^{\prime}_{m}(z_{j}) where z=[z1,z2,,zn]z=[z_{1},z_{2},\dots,z_{n}]. Next, it was shown that such a construction with properly chosen c1,c2c_{1},c_{2} yields a lower bound on the communication complexity of order Ω(Lμlog1ϵ)\Omega\left(\sqrt{\frac{L}{\mu}}\log\frac{1}{\epsilon}\right) and the lower bound on the local computation of order Ω((n+nμ)log1ϵ)\Omega\left(\left(n+\sqrt{\frac{n{\cal L}}{\mu}}\right)\log\frac{1}{\epsilon}\right) where {\cal L} is a smoothness constant of gm,jg_{m,j}, LL is a smoothness constant of gm(z)=1nj=1ngj(z)g_{m}(z)=\frac{1}{n}\sum_{j=1}^{n}g_{j}(z) and μ\mu is the strong convexity constant of g(z)=1Mm=1Mgm(z)g(z)=\frac{1}{M}\sum_{m=1}^{M}g_{m}(z).

Our construction and sketch of the proof. Now, our construction is straightforward – we set fm(w,βm)=g(w)+h(βm)f_{m}(w,\beta_{m})=g(w)+h(\beta_{m}) with gg, hh scaled appropriately such that the strong convexity ratio is as per Assumption 1. Clearly, to minimize the global part g(w)g(w), we require at least Ω(Lwμlog1ϵ)\Omega\left(\sqrt{\frac{L^{w}}{\mu}}\log\frac{1}{\epsilon}\right) iterations and at least Ω((n+nwμ)log1ϵ)\Omega\left(\left(n+\sqrt{\frac{n{\cal L}^{w}}{\mu}}\right)\log\frac{1}{\epsilon}\right) stochastic gradients of gg. Similarly, to minimize hh, we require at least Ω((n+nβμ)log1ϵ)\Omega\left(\left(n+\sqrt{\frac{n{\cal L}^{\beta}}{\mu}}\right)\log\frac{1}{\epsilon}\right) stochastic gradients of hh. Therefore, Theorem 6 is established.

B.9 Proof of Theorem 8

Taking the stochastic gradient step followed by the proximal step with respect to ψ\psi, both with stepsize η\eta, is equivalent to (Hanzely et al., 2020b):

w.p. pw:\displaystyle\text{w.p. }p_{w}: {w+=wη(1pw(1Mm=1Mwfm,j(w,βm)1Mm=1Mwfm,j(w,βm))+wF(w,β)),βm+=βmηMfm(w,βm)\displaystyle\begin{cases}\begin{aligned} w^{+}=&w-\eta\left(\frac{1}{p_{w}}\left(\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m,j}(w,\beta_{m})-\frac{1}{M}\sum_{m=1}^{M}\nabla_{w}f_{m,j}(w^{\prime},\beta^{\prime}_{m})\right)\right.\\ +&\left.\nabla_{w}F(w^{\prime},\beta^{\prime})\right),\end{aligned}\\ \beta^{+}_{m}=\beta_{m}-\frac{\eta}{M}\nabla f_{m}(w^{\prime},\beta^{\prime}_{m})\end{cases} (55)
w.p. pβ:\displaystyle\text{w.p. }p_{\beta}: {w+=wηwF(w,β),βm+=βmηM(1pβ(βfm,j(w,βm)βfm,j(w,βm))+βfm(w,βm)).\displaystyle\begin{cases}w^{+}=w-\eta\nabla_{w}F(w^{\prime},\beta^{\prime}),\\ \beta_{m}^{+}=\beta_{m}-\frac{\eta}{M}\left(\frac{1}{p_{\beta}}\left(\nabla_{\beta}f_{m,j}(w,\beta_{m})-\nabla_{\beta}f_{m,j}(w^{\prime},\beta^{\prime}_{m})\right)+\nabla_{\beta}f_{m}(w^{\prime},\beta^{\prime}_{m})\right).\end{cases}

Let x=[w,β1,,βM]x=[w,\beta_{1},\dots,\beta_{M}], x=[w,β1,,βM]x^{\prime}=[w^{\prime},\beta_{1}^{\prime},\dots,\beta_{M}^{\prime}]. The update rule (55) can be rewritten as

x+=xη(g(x)g(x)+F(x)),x^{+}=x-\eta\left(g(x)-g(x^{\prime})+\nabla F(x^{\prime})\right),

where g(x)g(x) corresponds to the described unbiased stochastic gradient obtained by subsampling both the space and the finite sum simultaneously. To give the rate of the aforementioned method, we shall determine the expected smoothness constant. To achieve that, we introduce the following two lemmas.

Lemma 15.

Suppose that Assumptions 1 and 2 hold. Then

𝔼(g(x)g(x)+F(x))F(x)22DF(x,y),\mathbb{E}{\|(g(x)-g(x^{\prime})+\nabla F(x^{\prime}))-\nabla F(x)\|^{2}}\leq 2{\cal L}D_{F}(x,y),

where 2max(wpw,βpβ){\cal L}\coloneqq 2\max\left(\frac{{\cal L}^{w}}{p_{w}},\frac{{\cal L}^{\beta}}{p_{\beta}}\right).

Proof.

Let dβm=1mdmd_{\beta}\coloneqq\sum_{m=1}^{m}d_{m}. We have:

𝔼(g(x)g(x)+F(x))F(x)2\displaystyle\mathbb{E}{\|(g(x)-g(x^{\prime})+\nabla F(x^{\prime}))-\nabla F(x)\|^{2}}
𝔼g(x)g(x)2\displaystyle\qquad\leq\mathbb{E}{\|g(x)-g(x^{\prime})\|^{2}}
=pw𝔼pw11Mm=1M(wfm,j(w,βm)wfm,j(w,βm))2ζ=1\displaystyle\qquad=p_{w}\mathbb{E}{\left\|p_{w}^{-1}\frac{1}{M}\sum_{m=1}^{M}\left(\nabla_{w}f_{m,j}(w,\beta_{m})-\nabla_{w}f_{m,j}(w^{\prime},\beta^{\prime}_{m})\right)\right\|^{2}\mid\zeta=1}
+pβ1M2m=1M𝔼pβ1fm,j(w,βm)pβ1fm,j(w,βm)2ζ=2\displaystyle\qquad\qquad+p_{\beta}\frac{1}{M^{2}}\sum_{m=1}^{M}\mathbb{E}{\|p_{\beta}^{-1}\nabla f_{m,j}(w,\beta_{m})-p_{\beta}^{-1}\nabla f_{m,j}(w^{\prime},\beta^{\prime}_{m})\|^{2}\mid\zeta=2}
=pw1𝔼1Mm=1M(wfm,j(w,βm)wfm,j(w,βm))2ζ=1\displaystyle\qquad=p_{w}^{-1}\mathbb{E}{\left\|\frac{1}{M}\sum_{m=1}^{M}\left(\nabla_{w}f_{m,j}(w,\beta_{m})-\nabla_{w}f_{m,j}(w^{\prime},\beta^{\prime}_{m})\right)\right\|^{2}\mid\zeta=1}
+pβ11M2m=1M𝔼fm,j(w,βm)fm,j(w,βm)2ζ=2\displaystyle\qquad\qquad+p_{\beta}^{-1}\frac{1}{M^{2}}\sum_{m=1}^{M}\mathbb{E}{\|\nabla f_{m,j}(w,\beta_{m})-\nabla f_{m,j}(w^{\prime},\beta^{\prime}_{m})\|^{2}\mid\zeta=2}
=𝔼(Fj(x)Fj(x))(pw1Id0×d000pβ1Idβ×dβ)(Fj(x)Fj(x))\displaystyle\qquad=\mathbb{E}{(F_{j}(x)-\nabla F_{j}(x^{\prime}))^{\top}\begin{pmatrix}p_{w}^{-1}I^{d_{0}\times d_{0}}&0\\ 0&p_{\beta}^{-1}I^{d_{\beta}\times d_{\beta}}\end{pmatrix}(F_{j}(x)-\nabla F_{j}(x^{\prime}))}
()𝔼4max(wpw,Lβpβ)DFj(x,x)\displaystyle\qquad\stackrel{{\scriptstyle(*)}}{{\leq}}\mathbb{E}{4\max\left(\frac{{\cal L}^{w}}{p_{w}},\frac{L^{\beta}}{p_{\beta}}\right)D_{F_{j}}(x,x^{\prime})}
=4max(wpw,Lβpβ)DF(x,x),\displaystyle\qquad=4\max\left(\frac{{\cal L}^{w}}{p_{w}},\frac{L^{\beta}}{p_{\beta}}\right)D_{F}(x,x^{\prime}),

where ()(*) holds due to the (w,β)({\cal L}^{w},{\cal L}^{\beta})-smoothness of FjF_{j} (from Assumption 2) and Lemma 16. ∎

Lemma 16.

Let H(x,y):dx+dyH(x,y):\mathbb{R}^{d_{x}+d_{y}}\rightarrow\mathbb{R} be a (jointly) convex function such that

x2H(x,y)Lx𝑰andy2H(x,y)Ly𝑰.\nabla^{2}_{x}H(x,y)\leq L_{x}{\bm{I}}\quad\text{and}\quad\nabla^{2}_{y}H(x,y)\leq L_{y}{\bm{I}}.

Then

2H(x,y)2(Lx𝑰00Ly𝑰)\nabla^{2}H(x,y)\leq 2\begin{pmatrix}L_{x}{\bm{I}}&0\\ 0&L_{y}{\bm{I}}\end{pmatrix} (56)

and

DH((x,y),(xy))12(H(x,y)H(x,y))(12Lx1𝑰0012Ly1𝑰)(H(x,y)H(x,y)).D_{H}((x,y),(x^{\prime}y^{\prime}))\\ \geq\frac{1}{2}\left(\nabla H(x,y)-\nabla H(x^{\prime},y^{\prime})\right)^{\top}\begin{pmatrix}\frac{1}{2}L_{x}^{-1}{\bm{I}}&0\\ 0&\frac{1}{2}L_{y}^{-1}{\bm{I}}\end{pmatrix}\left(\nabla H(x,y)-\nabla H(x^{\prime},y^{\prime})\right). (57)
Proof.

To show (56), observe that

2(Lx𝑰00Ly𝑰)2H(x,y)\displaystyle 2\begin{pmatrix}L_{x}{\bm{I}}&0\\ 0&L_{y}{\bm{I}}\end{pmatrix}-\nabla^{2}H(x,y) =(2Lx𝑰x,x2H(x,y)x,y2H(x,y)y,x2H(x,y)2Ly𝑰y,y2H(x,y))\displaystyle=\begin{pmatrix}2L_{x}{\bm{I}}-\nabla^{2}_{x,x}H(x,y)&-\nabla^{2}_{x,y}H(x,y)\\ -\nabla^{2}_{y,x}H(x,y)&2L_{y}{\bm{I}}-\nabla^{2}_{y,y}H(x,y)\end{pmatrix}
(x,x2H(x,y)x,y2H(x,y)y,x2H(x,y)y,y2H(x,y))\displaystyle\succeq\begin{pmatrix}\nabla^{2}_{x,x}H(x,y)&-\nabla^{2}_{x,y}H(x,y)\\ -\nabla^{2}_{y,x}H(x,y)&\nabla^{2}_{y,y}H(x,y)\end{pmatrix}
x,x2H(x,y)\displaystyle\succeq\nabla^{2}_{x,x}H(x,-y)
0.\displaystyle\succeq 0.

Finally, we note that (57) is a direct consequence of (56) and joint convexity of HH. ∎

We are now ready to state the convergence rate of ASVRCD-PFL.

Theorem 9.

Iteration complexity of Algorithm 3 with

η=14,θ2=12,γ=1max{2μ,4θ1/η},\displaystyle\eta=\frac{1}{4{\cal L}},\quad\theta_{2}=\frac{1}{2},\quad\gamma=\frac{1}{\max\{2\mu,4\theta_{1}/\eta\}},
ν=1γμ,andθ1=min{12,ημmax{12,θ2ρ}}\displaystyle\nu=1-\gamma\mu,\quad\text{and}\quad\theta_{1}=\min\left\{\frac{1}{2},\sqrt{\eta\mu\max\left\{\frac{1}{2},\frac{\theta_{2}}{\rho}\right\}}\right\}

is

𝒪((1ρ+max(wpw,Lβpβ)ρμ)log1ϵ).{\cal O}\left(\left(\frac{1}{\rho}+\sqrt{\frac{\max\left(\frac{{\cal L}^{w}}{p_{w}},\frac{L^{\beta}}{p_{\beta}}\right)}{\rho\mu}}\right)\log\frac{1}{\epsilon}\right).

Setting pw=wβ+wp_{w}=\frac{{\cal L}^{w}}{{\cal L}^{\beta}+{\cal L}^{w}} yields the complexity

𝒪((1ρ+w+βρμ)log1ϵ).{\cal O}\left(\left(\frac{1}{\rho}+\sqrt{\frac{{\cal L}^{w}+{\cal L}^{\beta}}{\rho\mu}}\right)\log\frac{1}{\epsilon}\right).
Proof.

The proof follows from Lemma 15 and Theorem 4.1 of Hanzely et al. (2020b), thus is omitted. ∎

Overall, the algorithm requires

𝒪((1ρ+w+βρμ)(log1ϵ)(ρn+pw)){\cal O}\left(\left(\frac{1}{\rho}+\sqrt{\frac{{\cal L}^{w}+{\cal L}^{\beta}}{\rho\mu}}\right)\left(\log\frac{1}{\epsilon}\right)(\rho n+p_{w})\right)

communication rounds and the same number of gradient calls w.r.t. parameter ww. Setting ρ=pwn\rho=\frac{p_{w}}{n}, we have

(1ρ+w+βρμ)(log1ϵ)(ρn+pw)\displaystyle\left(\frac{1}{\rho}+\sqrt{\frac{{\cal L}^{w}+{\cal L}^{\beta}}{\rho\mu}}\right)\left(\log\frac{1}{\epsilon}\right)(\rho n+p_{w}) =2(1ρ+w+βρμ)(log1ϵ)ρn\displaystyle=2\left(\frac{1}{\rho}+\sqrt{\frac{{\cal L}^{w}+{\cal L}^{\beta}}{\rho\mu}}\right)\left(\log\frac{1}{\epsilon}\right)\rho n
=2(n+ρn2(w+β)μ)(log1ϵ)\displaystyle=2\left(n+\sqrt{\frac{\rho n^{2}({\cal L}^{w}+{\cal L}^{\beta})}{\mu}}\right)\left(\log\frac{1}{\epsilon}\right)
=2(n+nwμ)(log1ϵ),\displaystyle=2\left(n+\sqrt{\frac{n{\cal L}^{w}}{\mu}}\right)\left(\log\frac{1}{\epsilon}\right),

which shows that Algorithm 3 enjoys both communication complexity and the global gradient complexity of order 𝒪((n+nwμ)log1ϵ){\cal O}\left(\left(n+\sqrt{\frac{n{\cal L}^{w}}{\mu}}\right)\log\frac{1}{\epsilon}\right). Analogously, setting ρ=pβn\rho=\frac{p_{\beta}}{n} yields personalized/local gradient complexity of order 𝒪((n+nβμ)log1ϵ){\cal O}\left(\left(n+\sqrt{\frac{n{\cal L}^{\beta}}{\mu}}\right)\log\frac{1}{\epsilon}\right).