This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Optimization of Smooth Loss Functions

Ali Jadbabaie, Anuran Makur, and Devavrat Shah The author ordering is alphabetical. This research was supported in part by the ONR under Grant N000142012394, in part by the ARO MURI under Grant W911NF-19-1-0217, and in part by the NSF TRIPODS Foundations of Data Science.A. Jadbabaie is with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: jadbabai@mit.edu).A. Makur is with the Department of Computer Science and the Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA (e-mail: amakur@purdue.edu).D. Shah is with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: devavrat@mit.edu).
Abstract

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server seeks to minimize an ERM objective function using nn samples of training data that is stored across mm clients and the server. The recent flurry of research in this area has identified the Federated Averaging (𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}) algorithm as the staple for determining ϵ\epsilon-approximate solutions to the ERM problem. Similar to standard optimization algorithms, e.g., stochastic gradient descent, the convergence analysis of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} and its variants only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness in data in a federated learning context, we propose the Federated Low Rank Gradient Descent (𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}) algorithm. Since smoothness in data induces an approximate low rank structure on the gradient of the loss function, our algorithm first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients’ gradients using its own gradients. Then, our algorithm solves the ERM problem at the server using an inexact gradient descent method. To theoretically demonstrate that 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} can have superior performance to 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity in the optimization literature. Under some assumptions on the loss function, e.g., strong convexity and smoothness in the parameter, η\eta-Hölder class smoothness in the data, etc., we prove that the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} scales like ϕm(p/ϵ)Θ(d/η)\phi m(p/\epsilon)^{\Theta(d/\eta)} and that of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} scales like ϕm(p/ϵ)3/4\phi m(p/\epsilon)^{3/4} (neglecting typically sub-dominant factors), where ϕ1\phi\gg 1 is the ratio of client-to-server communication time to gradient computation time, pp is the parameter dimension, and dd is the data dimension. Then, we show that when dd is small compared to nn and the loss function is sufficiently smooth in the data, i.e., η=Θ(d)\eta=\Theta(d), 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} beats 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} in federated oracle complexity. Finally, in the course of analyzing 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}, we also establish a general result on low rank approximation of smooth latent variable models.

Index Terms:
Federated learning, empirical risk minimization, gradient descent, Hölder class, low rank approximation.

I Introduction

Optimization for machine learning has become a prevalent subject of study due to the advent of large scale model learning tasks. Indeed, while optimization theory is a relatively mature field in applied mathematics (cf. [1, 2]), many of its recent developments have focused on its application to specialized machine learning settings. For instance, several common machine learning problems, such as classification and regression, can be posed as empirical risk minimization (ERM) problems. In the unconstrained case, ERM problems have the form [3], [4, Section 1]:

minθpF(θ)1ni=1nf(x(i);θ),\min_{\theta\in{\mathbb{R}}^{p}}{F(\theta)}\triangleq\frac{1}{n}\sum_{i=1}^{n}{f(x^{(i)};\theta)}\,, (1)

where θp\theta\in{\mathbb{R}}^{p} is the parameter over which we optimize, x(1),,x(n)[0,1]dx^{(1)},\dots,x^{(n)}\in[0,1]^{d} are some given training data, f:[0,1]d×pf:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}, f(x;θ)f(x;\theta) denotes the loss function, and F:pF:{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} defines the empirical risk (or objective function). There is a gargantuan literature that develops and analyzes iterative gradient-based optimization methods that solve the specific optimization in (1), e.g., (exact and projected) GD [2, 4], inexact GD [5, 6, 7, 8], stochastic gradient descent (SGD) [9], GD with momentum [10], accelerated GD [11], mini-batch SGD [12], variance reduction methods [13, 14, 15, 16], and methods with adaptive learning rates [17, 18, 19] (also see the recent monographs [4, 20, 21, 22], and the references therein).

However, as observed in [23], the optimization for machine learning literature typically analyzes convergence and oracle complexity of iterative gradient-based algorithms for (1) by posing the problem as, cf. [4, Section 1], [3]:

minθp1ni=1nfi(θ),\min_{\theta\in{\mathbb{R}}^{p}}{\frac{1}{n}\sum_{i=1}^{n}{f_{i}(\theta)}}\,, (2)

where each fi(θ)=f(x(i);θ)f_{i}(\theta)=f(x^{(i)};\theta). Hence, canonical analyses of iterative methods for the representation (2) do not effectively utilize information about the smoothness of the loss function f(x;θ)f(x;\theta) with respect to the data xx. As noted in [23], such smoothness properties are actually quite pervasive in ERM problems, e.g., (regularized) linear regression [3, Chapter 3], (regularized) logistic regression [3, Section 4.4], deep neural network training with smooth loss and activation functions [24, Chapter 5], etc. To address the paucity of theoretical analyses that exploit this additional smoothness information, [23] initiates a study of inexact gradient descent methods for ERM problems that learn the gradient oracle at every iteration by performing local polynomial regressions [25, 26, 27].

In this paper, we continue this important thread of exploiting the smoothness of loss functions in training data to improve optimization algorithms. But in contrast to [23], we consider the newly developed federated learning paradigm here, and utilize the approximate low rank structure induced by smoothness in data to theoretically ameliorate optimization performance in this paradigm. We briefly introduce the federated learning architecture and formally abstract it in Section I-A, and we subsequently explain our main contributions in Section I-B.

I-A Federated Learning

The recent escalation of mobile phone usage and emergence of the Internet of Things (IoT) has engendered unprecedented volumes of collected data on such devices. While it is clearly of utility to train machine models based on this data, various new challenges pertaining to privacy, scalability, distributed computation, etc. must be addressed before doing so. For instance, data collected on a user’s mobile phone must be kept private, and even if privacy is not a concern, it is infeasible to communicate large quantities of data from millions of users to a central server for computation. To deal with such issues, a new paradigm known as federated learning has been formulated in the literature [28]. There are several classes of federated learning architectures (see, e.g., [29]): those with a synchronous central server, fully decentralized networks with synchronized clocks, and asynchronous versions of these topologies. We consider synchronous architectures with a central server in this paper.

In this synchronous setting, the architecture consists of a single central server and a collection of several clients (that all have synchronized clocks), cf. [28, 30, 31], and the references therein. Typically, the server and clients hold their own private data. So, to train machine learning models, federated optimization is carried out by clients training possibly inaccurate models based on their private data, and the server periodically aggregating client models to produce a more accurate model at the server. Several key dimensions are considered when studying such federated learning and optimization systems:

  1. 1.

    Privacy: The training data resides with the clients and is not shared with the server (and vice versa) [28]. However, the server must leverage the data stored at the clients via several communication rounds to improve the statistical accuracy of the learnt model.

  2. 2.

    Large client population with possibly inactive clients: Federated learning systems usually have a very large number of clients so that the client population size often dominates the number of training samples per client [28]. On the other hand, a nontrivial fraction of clients could be inactive during the learning process, e.g., mobile phones only send updates when they are online and being charged [30]. So, federated optimization algorithms must be robust to participation from smaller subsets of clients.

  3. 3.

    Communication cost: There is significant cost associated with clients communicating with the server. Typically, clients have limited upload bandwidth and servers have limited download bandwidth. In particular, if a large number of clients seek to simultaneously communicate with the server, the server will put the clients in a queue to receive their updates. So, federated optimization schemes often try to increase the amount of (cheap) computation at each client in order to reduce the frequency of communication with the server (e.g., by employing several local updates at the clients), and utilize smaller fractions of active clients in certain rounds [28, 30]. (Moreover, such schemes may also utilize quantization to compress updates from clients to servers [30]; we do not consider data compression in this work.)

  4. 4.

    Latency: Beyond the communication costs delineated above, the duration of client-to-server updates is determined by the slowest client among the set of active clients. Hence, using smaller fractions of active clients in certain rounds is also advantageous to eliminate stragglers. In fact, recent algorithms and analysis suggest that stragglers should be used at later stages of the learning process in order to reduce latency [31].

  5. 5.

    Data heterogeneity: Different clients may have data that is drawn from different population distributions [28, 29]. Although some works in the literature, e.g., [30, 31], assume for simplicity of analysis that clients are homogeneous, i.e., the training data at every client is drawn from a common (unknown) probability distribution, we do not need to impose such a stringent assumption in this work. In particular, our new algorithm in Algorithm 1 and its theoretical analysis in Theorem 2 and Section V will hold in the general heterogeneous setting. Usually, data homogeneity simplifies the analysis of stochastic federated optimization methods since clients’ gradients become unbiased estimators of true full gradients [30]. However, since we analyze the (sample-version of the) ERM problem given below in (3), which has the same form in both homogeneous and heterogeneous settings, and we will not use randomized estimators of true full gradients at the clients, our analysis is agnostic to whether the server and client data is homogeneous or heterogeneous.

Formally, we model synchronous federated learning with a single central server as follows. We assume that there are mm\in{\mathbb{N}} clients, indexed by [m][m]. The server is endowed with r+r\in{\mathbb{Z}}_{+} training data samples x(0,1),,x(0,r)[0,1]dx^{(0,1)},\dots,x^{(0,r)}\in[0,1]^{d}, and each client i[m]i\in[m] is given ss\in{\mathbb{N}} training data samples x(i,1),,x(i,s)[0,1]dx^{(i,1)},\dots,x^{(i,s)}\in[0,1]^{d}, where n=ms+rn=ms+r so that {x(i,j):i{0}[m],j[s]}={x(k):k[n]}\{x^{(i,j)}:i\in\{0\}\cup[m],\,j\in[s]\}=\{x^{(k)}:k\in[n]\}.111It is assumed for simplicity that every client has the same number of training data samples. Under the assumptions outlined in Section II-B, the goal of the server is to utilize the clients to compute an ϵ\epsilon-approximate solution (in the sense of (11) or (12)) to the ERM problem (1), which may be rewritten as

minθpF(θ)=1nj=1rf(x(0,j);θ)+1ni=1mj=1sf(x(i,j);θ).\min_{\theta\in{\mathbb{R}}^{p}}{F(\theta)}=\frac{1}{n}\sum_{j=1}^{r}{f(x^{(0,j)};\theta)}+\frac{1}{n}\sum_{i=1}^{m}{\sum_{j=1}^{s}{f(x^{(i,j)};\theta)}}\,. (3)

To achieve this goal, we consider a specific family of federated optimization algorithms. In particular, any such algorithm runs over TT\in{\mathbb{N}} synchronized epochs (or rounds). During each epoch, the server may first communicate with τm\tau m\in{\mathbb{N}} clients, where τ[0,1]\tau\in[0,1] denotes the proportion of active clients that participate in computation. Then, the server and the active clients perform gradient computations, or other less expensive computations, towards finding an ϵ\epsilon-approximate solution to (3), where we typically assume that all active clients perform the same number of gradient computations. Finally, each active client may communicate only one vector in p{\mathbb{R}}^{p} to the server (while taking care not to reveal their private data by transmitting vectors from which it could be easily extracted). This client-to-server communication has some nontrivial cost associated with it. On the other hand, it is usually assumed that server-to-client communication has negligible cost, because there are often no critical constraints on client download bandwidths and the server broadcasting bandwidth. So, the server broadcasts vectors in p{\mathbb{R}}^{p} to active clients during epochs at essentially no cost. We also note that every epoch need not contain the three aforementioned phases, i.e., the server need not always communicate with active clients or perform gradient computations, and clients need not always perform gradient computations and communicate with the server.

Several federated optimization algorithms for the synchronous central server setting have been proposed in the literature. The benchmark method in the field is known as Federated Averaging (𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}) [28], where clients perform several iterations of (mini-batch) SGD using their private data, and the server aggregates client updates by periodically averaging them over several epochs. Needless to say, 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} belongs to the aforementioned family of algorithms, and has been extended in several directions. For example, the authors of [32] introduce an accelerated version of the algorithm by introducing momentum, and the authors of [30] include a compression (or quantization) step before any client-to-server communication in the algorithm to reduce communication cost. In particular, the iteration complexity and convergence of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} are carefully analyzed in [30]. More generally, a unified analysis of the class of communication-efficient SGD algorithms is presented in [33]. Various other federated optimization methods have also been proposed that address different drawbacks of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}. For instance, federated optimization methods with better fixed point properties are developed in [34], latency of federated optimization algorithms is improved via adaptive node participation in [31], a federated optimization algorithm for training with only positive labels is presented in [35], and adaptive algorithms that incorporate more sophisticated gradient-based iterative methods, such as Adam [19], into the federated learning framework are proposed in [36]. In a different vein, one-shot federated learning is considered in [37], where only one round of client-to-server communication is used in the optimization process (also see follow-up work). We refer readers to [29] and the references therein for an in-depth further discussion of federated learning, distributed optimization, and related ideas.

Finally, we note that in the traditional federated learning literature, only the clients possess training data which the server cannot access. However, as mentioned above, our work assumes that the server also possesses some training data samples. As explained in [38], this is a reasonable assumption in many federated learning settings because the server could also function as a client and have access to its own private training data, it may have to collect some training data from clients (despite privacy concerns) for system monitoring purposes, it may collect data from test devices (or test clients) which do not participate in training, and it could even possess simulated training data from models. For these reasons, there is a growing recent literature on federated learning with server data, cf. [39, 40, 38] and the references therein. Furthermore, as remarked earlier, the server and client datasets can be heterogeneous in our work since the smoothness (or approximate low rankness) of loss functions will circumvent the need for data homogeneity in our analysis.

I-B Main Contributions

As mentioned earlier, the vast literature on federated optimization, including the aforementioned works, deals with the ERM problem representation in (2). Thus, these works do not exploit smoothness of loss functions in the training data for federated learning. In this paper, we fill this void by making the following key theoretical contributions:

  1. 1.

    We define a notion of federated oracle complexity in Definition 1 to capture the running time of federated optimization algorithms, and argue its suitability using some simple probabilistic analysis. This formally generalizes classical oracle complexity of iterative optimization methods (see [2, Section 1.1]) to a federated learning setting. Our definition allows us to theoretically compare our proposed federated optimization algorithm (mentioned next) with the benchmark 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} method from the literature. We believe that this definition will be of further utility in future research in the area.

  2. 2.

    We propose a new Federated Low Rank Gradient Descent (𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}) method in Algorithm 1 to solve the ERM problem (1), or equivalently, (3), in the federated learning setting. Our method exploits the approximate low rankness of the gradient of the loss function, which is a consequence of its smoothness in data, by using a few rounds of communication to first learn weights that allow the server to approximate clients’ gradients by computing weighted sums of its own gradients. Then, the server can solve (1) by running an inexact GD method on its own, where the server estimates full gradients at each iteration from gradients corresponding to its private data. Moreover, like other federated optimization algorithms in the literature, our method preserves the privacy of the server and clients by never exchanging data samples, and it works in settings where server and client data is heterogeneous.

  3. 3.

    Under the smoothness, strong convexity, and non-singularity assumptions outlined in Section II-B and Section III-C, we analyze the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm and show in Theorem 2 that its federated oracle complexity scales like ϕm(p/ϵ)Θ(d/η)\phi m(p/\epsilon)^{\Theta(d/\eta)} (neglecting typically sub-dominant factors), where ϕ1\phi\gg 1 is the ratio of communication cost to computation cost (see Section II-D), ϵ>0\epsilon>0 is the desired approximation accuracy of the solution (see Section II-B), and η>0\eta>0 denotes a Hölder class parameter which determines the level of smoothness of the loss function in the data (see Section II-B).

  4. 4.

    We also calculate the federated oracle complexity of the 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm in Theorem 3 by building on known results from the literature, cf. [30], and show that it scales like ϕm(p/ϵ)3/4\phi m(p/\epsilon)^{3/4}. Using this calculation, we demonstrate in Proposition 1 that when the data dimension is small, e.g., d=O(log(n)0.99)d=O(\log(n)^{0.99}), and the gradient of the loss function is sufficiently smooth in the training data so that it is approximately low rank, i.e., η=Θ(d)\eta=\Theta(d), the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} is much smaller than that of the benchmark 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm.

  5. 5.

    In the process of analyzing the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm, we also derive a new statistical result in Theorem 1 that utilizes the smoothness of loss functions in the training data to bound the Frobenius norm error in estimating a latent variable model (or graphon) corresponding to the loss function with its low rank approximation. This result generalizes the graphon approximation results in [41, 42, 43], and could be of independent utility for future research in high-dimensional statistics.

I-C Outline

We briefly delineate the remainder of our discussion here. In Section II, we precisely state and explain the notation and assumptions used in our subsequent analysis, and also introduce and expound upon the concept of federated oracle complexity. In Section III, we present and discuss our main technical results and the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm mentioned above. Then, we provide proofs of our results on latent variable model approximation and the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} in Section IV and Section V, respectively. The proofs of all other results are deferred to the appendices. Although the focus of this work is to theoretically analyze 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and demonstrate the resulting gain in federated oracle complexity, we also present some pertinent simulation results in Appendix C. These simulations illustrate that gradients of loss functions corresponding to neural networks trained on the MNIST database for handwritten digit recognition (see [44]) and CIFAR-10 dataset of tiny images (see [45]) exhibit approximate low rankness. This suggests that algorithms which exploit low rank structure, such as 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}, could have potential impact in applications. Finally, we conclude our discussion in Section VI.

II Formal Setup

Before deriving our main results, we need to present some essential formalism. To that end, Section II-A contains various notational conventions used in this work, Section II-B illustrates the precise conditions needed for our results to hold, Section II-C explains the reasoning behind these conditions and provides some related literature, and Section II-D develops the notion of federated oracle complexity which measures the running time of federated optimization algorithms.

II-A Notational Preliminaries

In this paper, we let {1,2,3,}{\mathbb{N}}\triangleq\{1,2,3,\dots\}, +={0}{\mathbb{Z}}_{+}={\mathbb{N}}\cup\!\{0\}, and +{\mathbb{R}}_{+} denote the sets of natural numbers, non-negative integers, and non-negative real numbers, respectively, and for mm\in{\mathbb{N}}, we let [m]{1,,m}[m]\triangleq\{1,\dots,m\}. We let (){\mathbb{P}}(\cdot) and 𝔼[]{\mathbb{E}}[\cdot] denote the probability and expectation operators, respectively, where the underlying probability laws will be clear from context. Furthermore, exp()\exp(\cdot) and log()\log(\cdot) denote the natural exponential and logarithm functions with base ee, respectively, 𝟙{}{\mathds{1}}\{\cdot\} denotes the indicator function that equals 11 if its input proposition is true and equals 0 otherwise, and \lceil\cdot\rceil denotes the ceiling function. Lastly, we will utilize Bachmann-Landau asymptotic notation throughout this paper, e.g., O()O(\cdot), Ω()\Omega(\cdot), Θ()\Theta(\cdot), etc., where the limiting variable will be clear from context.

In addition, for any matrix Am×kA\in{\mathbb{R}}^{m\times k} with m,km,k\in{\mathbb{N}}, Ai,jA_{i,j} denotes the (i,j)(i,j)th entry of AA for i[m]i\in[m] and j[k]j\in[k], ATA^{\mathrm{T}} denotes the transpose of AA, A1A^{-1} denotes the inverse of AA (when m=km=k and AA is non-singular), rank(A)\operatorname{{rank}}(A) denotes the rank of AA, AF\|A\|_{\mathrm{F}} denotes the Frobenius norm of AA, and Aop\|A\|_{\mathrm{op}} denotes the spectral or (induced 2\ell^{2}) operator norm of AA. Similarly, for any (column) vector ymy\in{\mathbb{R}}^{m}, yiy_{i} denotes the iith entry of yy for i[m]i\in[m], and yq\|y\|_{q} denotes the q\ell^{q}-norm of yy for q[1,+]q\in[1,+\infty]. Moreover, for any continuously differentiable function h:mh:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}, h(y)h(y), we define h/yi:m\partial h/\partial y_{i}:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}} to be its partial derivative with respect to yiy_{i} for i[m]i\in[m], and yh:mm\nabla_{y}h:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}^{m}, yh=[h/y1h/ym]T\nabla_{y}h=\left[\partial h/\partial y_{1}\,\cdots\,\partial h/\partial y_{m}\right]^{\mathrm{T}} to be its gradient. As in [23], we will also require some multi-index notation. So, for any mm-tuple s=(s1,,sm)+ms=(s_{1},\dots,s_{m})\in{\mathbb{Z}}_{+}^{m}, we let |s|=s1=s1++sm|s|=\|s\|_{1}=s_{1}+\cdots+s_{m}, s!=s1!sm!s!=s_{1}!\cdots s_{m}!, ys=y1s1ymsmy^{s}=y_{1}^{s_{1}}\cdots y_{m}^{s_{m}} for ymy\in{\mathbb{R}}^{m}, and sh=ysh=|s|h/(y1s1ymsm)\nabla^{s}h=\nabla_{y}^{s}h=\partial^{|s|}h/(\partial y_{1}^{s_{1}}\cdots\partial y_{m}^{s_{m}}) be the ssth partial derivative of hh with respect to yy.

II-B Assumptions on Loss Function

As mentioned earlier, for any n,d,pn,d,p\in{\mathbb{N}}, assume that we are given nn samples of training data x(1),,x(n)[0,1]dx^{(1)},\dots,x^{(n)}\in[0,1]^{d} and a loss function f:[0,1]d×pf:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}, f(x;θ)f(x;\theta), where θp\theta\in{\mathbb{R}}^{p} is the parameter vector. In our ensuing analysis, we will impose the following assumptions on the loss function (cf. [2, 4, 23]):

  1. 1.

    Smoothness in parameter θ\theta: There exists L1>0L_{1}>0 such that for all x[0,1]dx\in[0,1]^{d}, θf(x;):pp\nabla_{\theta}f(x;\cdot):{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}^{p} exists and is L1L_{1}-Lipschitz continuous:

    θ(1),θ(2)p,\displaystyle\forall\theta^{(1)},\theta^{(2)}\in{\mathbb{R}}^{p},\, θf(x;θ(1))θf(x;θ(2))2\displaystyle\left\|\nabla_{\theta}f(x;\theta^{(1)})-\nabla_{\theta}f(x;\theta^{(2)})\right\|_{2}
    L1θ(1)θ(2)2.\displaystyle\qquad\qquad\enspace\leq L_{1}\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}. (4)
  2. 2.

    Strong convexity: There exists μ>0\mu>0 such that for all x[0,1]dx\in[0,1]^{d}, f(x;):pf(x;\cdot):{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} is μ\mu-strongly convex:

    θ(1),θ(2)\displaystyle\forall\theta^{(1)},\theta^{(2)} p,\displaystyle\in{\mathbb{R}}^{p},
    f(x;θ(1))\displaystyle f(x;\theta^{(1)}) f(x;θ(2))+θf(x;θ(2))T(θ(1)θ(2))\displaystyle\geq f(x;\theta^{(2)})+\nabla_{\theta}f(x;\theta^{(2)})^{\mathrm{T}}(\theta^{(1)}-\theta^{(2)})
    +μ2θ(1)θ(2)22.\displaystyle\quad\,+\frac{\mu}{2}\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}^{2}. (5)

    Furthermore, we let κL1/μ1\kappa\triangleq L_{1}/\mu\geq 1 be the condition number of the loss function.

  3. 3.

    Smoothness in data xx: There exist η>0\eta>0 and L2>0L_{2}>0 such that for all fixed ϑp\vartheta\in{\mathbb{R}}^{p} and for all i[p]i\in[p], the iith partial derivative of ff at ϑ\vartheta, denoted gi(;ϑ)fθi(;ϑ):[0,1]dg_{i}(\cdot;\vartheta)\triangleq\frac{\partial f}{\partial\theta_{i}}(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}}, belongs to the (η,L2)(\eta,L_{2})-Hölder class (cf. [27, Definition 1.2]), i.e., gi(;ϑ):[0,1]dg_{i}(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} is l=η1l=\lceil\eta\rceil-1 times differentiable, and for every s=(s1,,sd)+ds=(s_{1},\dots,s_{d})\in{\mathbb{Z}}_{+}^{d} such that |s|=l|s|=l, we have

    y(1),y(2)[0,1]d,\displaystyle\forall y^{(1)},y^{(2)}\in[0,1]^{d}, |xsgi(y(1);ϑ)xsgi(y(2);ϑ)|\displaystyle\enspace\left|\nabla_{x}^{s}g_{i}(y^{(1)};\vartheta)-\nabla_{x}^{s}g_{i}(y^{(2)};\vartheta)\right|
    L2y(1)y(2)1ηl.\displaystyle\quad\quad\leq L_{2}\left\|y^{(1)}-y^{(2)}\right\|_{1}^{\eta-l}. (6)
  4. 4.

    Non-singularity: For any (large) size rr\in{\mathbb{N}}, any index i[p]i\in[p], any sequence of distinct data samples y(1),,y(r)[0,1]dy^{(1)},\dots,y^{(r)}\in[0,1]^{d}, and any sequence of distinct parameter values ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p},222More formally, this assumption only holds for ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)} almost everywhere, but we neglect this detail here and in the sequel for simplicity. the matrix of partial derivatives G(i)r×rG^{(i)}\in{\mathbb{R}}^{r\times r} given by:

    j,k[r],Gj,k(i)=gi(y(j);ϑ(k)),\forall j,k\in[r],\enspace G^{(i)}_{j,k}=g_{i}(y^{(j)};\vartheta^{(k)})\,, (7)

    is invertible. Note that we suppress the dependence of G(i)G^{(i)} on rr and the sequences of data and parameters for convenience.

  5. 5.

    Conditioning: There exists an absolute constant α1\alpha\geq 1 such that for any (large) rr\in{\mathbb{N}}, any i[p]i\in[p], any sequence of distinct data samples y(1),,y(r)[0,1]dy^{(1)},\dots,y^{(r)}\in[0,1]^{d}, and any sequence of distinct parameter values ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p},333More formally, one would need to impose additional distributional assumptions to prove that this assumption holds for random ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)} with high probability, but we also neglect this detail here and in the sequel for simplicity. the matrix G(i)r×rG^{(i)}\in{\mathbb{R}}^{r\times r} given in (7) satisfies

    (G(i))1oprα,\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}\leq r^{\alpha}\,, (8)

    i.e., the minimum singular value of G(i)G^{(i)} decreases to 0 at most polynomially fast in the dimension rr.

  6. 6.

    Boundedness: There exists an absolute constant B>0B>0 such that for all i[p]i\in[p], we have

    supx[0,1]dsupϑp|gi(x;ϑ)|B.\sup_{x\in[0,1]^{d}}\sup_{\vartheta\in{\mathbb{R}}^{p}}\left|g_{i}(x;\vartheta)\right|\leq B\,. (9)

Next, let ϵ>0\epsilon>0 be the desired approximation accuracy of the solution, θp\theta^{\star}\in{\mathbb{R}}^{p} be the unique global minimizer of the objective function FF in (1), and

FminθpF(θ)=F(θ).F_{*}\triangleq\min_{\theta\in{\mathbb{R}}^{p}}{F(\theta)}=F(\theta^{\star})\,. (10)

(Note that θ\theta^{\star} exists and FF_{*} is finite due to the strong convexity assumption.) In the sequel, we consider federated optimization algorithms that output an ϵ\epsilon-approximate solution θp\theta^{*}\in{\mathbb{R}}^{p} to (1) such that:

  1. 1.

    If the algorithm generates a deterministic solution θ\theta^{*}, we have

    F(θ)Fϵ.F(\theta^{*})-F_{*}\leq\epsilon\,. (11)
  2. 2.

    If the (stochastic) algorithm produces a random solution θ\theta^{*}, we have

    𝔼[F(θ)]Fϵ,{\mathbb{E}}\!\left[F(\theta^{*})\right]-F_{*}\leq\epsilon\,, (12)

    where the expectation is computed with respect to the law of θ\theta^{*}.

II-C Motivation and Background Literature for Assumptions

We briefly explain some of the important assumptions in Section II-B and contextualize them within the optimization and statistics literatures.

Firstly, smoothness in parameter is a standard assumption in the optimization for machine learning literature (see, e.g.,[2, 4], and the references therein). This assumption implies that the empirical risk F(θ)F(\theta) has L1L_{1}-Lipschitz continuous gradient (which is used in the proof of Theorem 2 in Section V, cf. [23, Section 4.1]). In particular, when the gradient of the objective function F(θ)F(\theta) is L1L_{1}-Lipschitz continuous, it can be shown that each iteration of GD reduces the objective value:

F(θ1L1θF(θ))F(θ)12L1θF(θ)22F\!\left(\theta-\frac{1}{L_{1}}\nabla_{\theta}F(\theta)\right)-F(\theta)\leq-\frac{1}{2L_{1}}\left\|\nabla_{\theta}F(\theta)\right\|_{2}^{2} (13)

for all θp\theta\in{\mathbb{R}}^{p} [4, Equation (3.5)]. The inequality (13) illustrates the greedy nature in which GD decreases its objective value and is fundamental in the analysis of iterative descent methods. For this reason, this assumption has been used extensively in the optimization literature to prove convergence guarantees in a variety of settings, e.g., [9, 8, 5, 12, 13, 16, 6, 14, 21], etc. We also impose this assumption since the analysis to obtain Lemma 2 in Section V, which is a restatement of [5, Lemma 2.1], requires it.

Similarly, strong convexity (in parameter) is another standard assumption in the optimization for machine learning literature (see, e.g.,[2, 4], and the references therein). As before, this assumption implies that the empirical risk F(θ)F(\theta) is strongly convex (which is used in the proof of Theorem 2 in Section V, cf. [2, Lemma 2.1.4], [23, Section 4.1]). As explained in [4], the main utility of imposing strong convexity of the objective function F(θ)F(\theta) is a notable and provable speed-up in the convergence rate of gradient-based iterative methods. Specifically, GD can achieve linear convergence rate (i.e., exponentially fast convergence) when the objective function is strongly convex [2, 4]. In addition, the strong convexity assumption often leads to far simpler proofs of convergence and oracle complexity, and easier illustration of the benefits of improvements like momentum (or heavy-ball method) [10] and acceleration [11] on oracle complexity (cf. [2, 21]). For these reasons, this assumption has also been used extensively in the optimization literature to prove convergence guarantees in a variety of settings, e.g., [9, 8, 5, 13, 16, 6, 14, 21], etc. We also assume strong convexity for these reasons; indeed, strong convexity is needed for Lemma 2 in Section V, which is a restatement of [5, Lemma 2.1], to hold and the first term in the bound in Lemma 2 illustrates the aforementioned exponential convergence.

Smoothness in data is a standard assumption in the non-parametric estimation literature (see, e.g.,[27, 46, 47], and the references therein). In simple terms, we can think of a function in a Hölder class with positive integer parameter η\eta as an (η1)(\eta-1)-times differentiable function with Lipschitz continuous (η1)(\eta-1)th partial derivatives (or even an η\eta-times differentiable function with bounded η\etath partial derivatives). Hence, the size of the parameter η\eta controls the degree of smoothness of functions in a Hölder class. Such Hölder class assumptions are used in various non-parametric regression techniques, e.g., kernel regression such as Nadaraya-Watson estimators [48, 49, 27], local polynomial regression [25, 26, 27], and nearest neighbor methods [50, 51, 46]. At a high-level, the basic idea of imposing smoothness assumptions for regression is the following: If an unknown function living in a “large” (or non-parametric) class is known to be smoother, then it can be better estimated from samples using Taylor polynomials. When defining the Hölder condition in (6), one can in principle use any q\ell^{q}-norm (see, e.g., [42, 47, 43, 23]). We use the 1\ell^{1}-norm to define (6) because this yields the largest class of functions for fixed parameters (η,L2)(\eta,L_{2}); indeed, if (6) is satisfied for any q\ell^{q}-norm, then it is also satisfied for the 1\ell^{1}-norm using the monotonicity of q\ell^{q}-norms. In this sense, our definition includes the largest class of functions. In Appendix A, we provide an example of a commonly used loss function in machine learning that satisfies the aforementioned smoothness and strong convexity assumptions.

In a recent sequence of works in high-dimensional statistics, it has been shown that matrices defined by bivariate graphon functions or latent variable models are approximately low rank when the function satisfies Hölder class assumptions on one of the variables [41, 42, 43]. (We generalize these results in Theorem 1.) Partial derivatives of loss functions gig_{i} can be construed as bivariate functions of the data xx and parameter θ\theta, and we impose Hölder class smoothness in data on them in order to exploit the resulting approximate low rank structure for federated optimization (as we will elaborate in Section III). As noted earlier and in [23], smoothness in data exists in a variety of ERM problems, e.g., (regularized) linear regression [3, Chapter 3], (regularized) logistic regression [3, Section 4.4], deep neural network training with smooth loss and activation functions [24, Chapter 5], etc. The close connection between smoothness of bivariate functions and approximate low rank structure may seem peculiar at first glance. Perceiving a bivariate function as a kernel of an integral operator, the results in [41, 42, 43] suggest that smoothness of the kernel is closely related to approximate low rankness of the associated integral operator. This relation has its roots in more familiar ideas from harmonic analysis. Indeed, in the setting of difference (or convolution) kernels that are defined by a univariate function, increasing the smoothness of the function is essentially equivalent to faster decay of its Fourier transform [52, Theorem 60.2] (also see [53, 54]). This, in turn, can be interpreted as approximate low rankness of the associated convolution operator, because the Fourier transform is precisely the spectrum of the convolution operator.

Approximate low rank structure, broadly construed, has been utilized in a variety of areas, e.g., matrix estimation [55], reinforcement learning [56], function approximation via ridge functions [57, 58], semidefinite programming [59], uncertainty quantification [60], etc. In this work, we seek to exploit approximate low rank structure in the context of federated optimization. Just as smoothness is prevalent in many ERM problems, a litany of empirical investigations in the literature have directly demonstrated various forms of approximate low rank structure in large-scale machine learning tasks. For instance, the empirical risk and loss functions corresponding to deep neural network training problems display approximate low rank structure in their gradients (which live in low-dimensional subspaces) or Hessians (which have decaying eigenvalues, thereby inducing low rank gradients via Taylor approximation arguments) [61, 62, 63, 64, 65, 66]. In particular, [61] shows that when training deep neural networks for KK-ary classification, the computed gradients belong to KK-dimensional eigenspaces of the Hessian corresponding to its dominant eigenvalues. On a related front, low rankness has also been observed in weight matrices of deep neural networks [67, 68], neural collapse has been demonstrated when training deep neural networks [69, 70, 71], and training memory for neural networks has been saved using low rank parameterizations of gradient weight matrices [72]. Our work on exploiting approximate low rank structure for federated optimization, albeit different, is inspired by many of these observations.

The non-singularity, conditioning, and boundedness assumptions in Section II-B and Section III-C have been imposed for technical purposes for the proofs. Variants of such assumptions pertaining to matrix invertibility and spectra are often made for analytical tractability in theoretical studies in high-dimensional statistics (see, e.g., the invertibility and spectral boundedness assumptions made in [73, Section 2] to analyze linear regression). While it is difficult to provide simple and easily interpretable sufficient conditions that ensure the non-singularity assumptions in Section II-B and Section III-C, we provide some intuition for when we expect such assumptions to hold. To this end, consider the non-singularity assumption in Section II-B. As before, let us construe gi(x,θ)g_{i}(x,\theta) as the kernel of an integral operator. Then, the spectral theory (specifically, singular value decomposition) of compact operators states that the kernel can be written as [56, Theorem 9] (also see [74]):

gi(x,θ)=j=1σjuj(x)vj(θ),g_{i}(x,\theta)=\sum_{j=1}^{\infty}{\sigma_{j}u_{j}(x)v_{j}(\theta)}\,, (14)

where {σj0:j}\{\sigma_{j}\geq 0:j\in{\mathbb{N}}\} are singular values that converge to 0 as jj\rightarrow\infty, {uj:j}\{u_{j}:j\in{\mathbb{N}}\} and {vj:j}\{v_{j}:j\in{\mathbb{N}}\} are orthonormal bases of singular vector functions, and although the equality above usually holds in an 2{\mathcal{L}}^{2}-sense, it can also hold in a pointwise sense under some assumptions. Hence, for data samples y(1),,y(r)y^{(1)},\dots,y^{(r)} and parameter values ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)}, we may write the matrix G(i)G^{(i)} as

G(i)=j=1σj[uj(y(1))uj(y(r))]=𝗎j[vj(ϑ(1))vj(ϑ(r))]=𝗏jT,G^{(i)}=\sum_{j=1}^{\infty}\sigma_{j}\underbrace{\left[\begin{array}[]{c}u_{j}(y^{(1)})\\ \vdots\\ u_{j}(y^{(r)})\end{array}\right]}_{=\,\mathsf{u}_{j}}\underbrace{\left[v_{j}(\vartheta^{(1)})\,\cdots\,v_{j}(\vartheta^{(r)})\right]}_{=\,\mathsf{v}_{j}^{\mathrm{T}}}\,, (15)

using (7) and (14), where we let 𝗎j=[uj(y(1))uj(y(r))]T\mathsf{u}_{j}=\big{[}u_{j}(y^{(1)})\,\cdots\,u_{j}(y^{(r)})\big{]}^{\mathrm{T}} and 𝗏j=[vj(ϑ(1))vj(ϑ(r))]T\mathsf{v}_{j}=\big{[}v_{j}(\vartheta^{(1)})\,\cdots\,v_{j}(\vartheta^{(r)})\big{]}^{\mathrm{T}}. Since the integral operator corresponding to gig_{i} is a map between infinite-dimensional spaces, we typically have infinitely many non-zero σj\sigma_{j}’s. Moreover, since {uj:j}\{u_{j}:j\in{\mathbb{N}}\} and {vj:j}\{v_{j}:j\in{\mathbb{N}}\} are linearly independent sets of functions, we typically have rr non-zero unit-rank terms σj𝗎j𝗏jT\sigma_{j}\mathsf{u}_{j}\mathsf{v}_{j}^{\mathrm{T}} in (15), with linearly independent 𝗎j\mathsf{u}_{j}’s and 𝗏j\mathsf{v}_{j}’s, whose linear independence is not cancelled out by the other terms. Thus, G(i)G^{(i)} has rank rr and is invertible. This provides some intuition behind the non-singularity assumption in Section II-B. The non-singularity of monomials and approximation assumptions in Section III-C have similar underlying intuition based on sampling of linearly independent functions (cf. (45) and (44), (46), respectively). In closing this subsection, we remark that one well-known setting where all sampled matrices of a symmetric bivariate function are invertible, and in fact, positive definite, is when the bivariate function itself defines a valid positive definite kernel (in the reproducing kernel Hilbert space sense), cf. [75, Section 4-2-4.6]. In the context of Section II-B, we cannot directly assume that every gig_{i} is a positive definite kernel since gig_{i} is asymmetric in general and we only require invertibility of sampled matrices, not positive definiteness, which is a much stronger condition. However, alternative versions of such kernel conditions could be developed in future work to provide cleaner assumptions that imply various non-singularity conditions.

Finally, we note that, strictly speaking, the boundedness and strong convexity assumptions make sense simultaneously when the parameter space is compact. Although our analysis is carried out with the parameter space p{\mathbb{R}}^{p}, since iterates of inexact GD in Algorithm 1 only live in some compact space, our analysis carries over to the compact setting. For example, one way to rigorously carry over our analysis when it is assumed that the parameter space is compact is to run projected inexact GD instead of inexact GD in Algorithm 1. Then, the parameter iterates of Algorithm 1 remain in the compact space, and we can execute the analysis of our algorithm using the convergence analysis in, e.g., [8, Proposition 3]. For simplicity of exposition, however, we do not use projected inexact GD and simply take the parameter space as p{\mathbb{R}}^{p} in this paper.

II-D Federated Oracle Complexity

In the traditional (non-distributed) optimization literature, the running time of first-order iterative methods, such as GD and SGD, is abstractly measured by the notion of (first-order) oracle complexity, i.e., the total number of queries to an oracle that returns the gradient of the loss function with respect to the parameter [76] (also see [2, Section 1.1]). This is because gradient computations are usually the most expensive operations in such algorithms. However, the problem of measuring running time is quite different for federated optimization algorithms, because the cost of clients communicating with servers is typically much higher than the running time of gradient computations. Currently, simulations based on several informal principles (such as the aforementioned one) are often used to empirically compare the efficiency of different federated optimization algorithms, rather than theoretically analyzing any formal notion of complexity; see, e.g., [30] and the references therein. To partially remedy this, inspired by [30], we propose a new definition of federated oracle complexity by formalizing some of the informal rules delineated in Section I-A using simple probabilistic approximations. We note that other notions of oracle complexity for distributed optimization have either implicitly or explicitly been posed in the literature, cf. graph oracle models [77] or [78] and the references therein. In fact, our definition of federated oracle complexity is reminiscent of the complexity measure introduced in [79], but our definition is adapted to the centralized, synchronous federated learning setting rather than distributed optimization frameworks.

Consider any algorithm belonging to the family outlined in Section I-A. In the sequel, we will approximate the running times of the two time intensive phases of each epoch: the gradient computation phase and the client-to-server communication phase.

II-D1 Gradient Computation Phase

In the algorithms we consider in this work, there will be two kinds of gradient computation phases; either only active clients will simultaneously perform bb\in{\mathbb{N}} gradient computations each, or only the server will perform bb gradient computations. In the former case, each active client computes bb gradients of the loss function with respect to the parameter in an epoch. As in the traditional setting, these gradient computations dominate the running time of the client. For each client, its bb successive gradient computations are modeled as bb successive delayed Poisson arrivals, cf. [80]. Equivalently, each gradient computation is assigned a statistically independent (abstract) running time of 1+X1+X units, where 11 is the delay and XX is an exponential random variable with rate 11 [80].444For simplicity, we assume that the shift and scale parameters of the shifted exponential distribution are equal. Thus, the running time of each client in an epoch is b+Yb+Y units, where the random variable YY has an Erlang (or gamma) distribution with shape parameter bb and rate 11 [81, Section 2.2.2]. Since only τm\tau m clients are active in any epoch, the total computation time of an epoch is determined by the slowest client, and is equal to b+max{Y1,,Yτm}b+\max\{Y_{1},\dots,Y_{\tau m}\} units, where Y1,,YτmY_{1},\dots,Y_{\tau m} are independent and identically distributed (i.i.d.) Erlang random variables with shape parameter bb and rate 11.

In most federated learning contexts, τm\tau m is very large and we can employ extreme value theory to approximate the probability distribution of max{Y1,,Yτm}\max\{Y_{1},\dots,Y_{\tau m}\}. To this end, observe that the cumulative distribution function (CDF) FY:[0,)[0,1]F_{Y}:[0,\infty)\rightarrow[0,1] of YY is a von Mises function with auxiliary function, cf. [82, Section 3.3.3, Example 3.3.21]:

x>0,a(x)=k=0b1(b1)!(bk1)!xk=1+b1x+O(1x2),\forall x>0,\,a(x)=\sum_{k=0}^{b-1}{\frac{(b-1)!}{(b-k-1)!}x^{-k}}=1+\frac{b-1}{x}+O\!\left(\frac{1}{x^{2}}\right), (16)

where the second equality holds when xbx\gg b (as will be the case for us). Hence, employing the Fisher-Tippett-Gnedenko theorem [82, Theorem 3.2.3], we have convergence in distribution to the Gumbel distribution (or the generalized extreme value distribution of Type I):

x,limk\displaystyle\forall x\in{\mathbb{R}},\enspace\lim_{k\rightarrow\infty} (max{Y1,,Yk}FY1(1k1)a(FY1(1k1))x)\displaystyle{{\mathbb{P}}\!\left(\frac{\max\{Y_{1},\dots,Y_{k}\}-F_{Y}^{-1}(1-k^{-1})}{a(F_{Y}^{-1}(1-k^{-1}))}\leq x\right)}
=exp(ex),\displaystyle\qquad\qquad\qquad\qquad\qquad=\exp(-e^{-x})\,, (17)

where FY1:[0,1)[0,)F_{Y}^{-1}:[0,1)\rightarrow[0,\infty) is the inverse of FYF_{Y}, i.e., the quantile function, and we also utilize [82, Proposition 3.3.25]. According to Proposition 2 in Appendix B, Θ(log(k))+Ω(log(b))FY1(1k1)Θ(log(k))+O(blog(b))\Theta(\log(k))+\Omega(\log(b))\leq F_{Y}^{-1}(1-k^{-1})\leq\Theta(\log(k))+O(b\log(b)) and we estimate FY1(1k1)F_{Y}^{-1}(1-k^{-1}) as follows for simplicity:

FY1(1k1)log(k)+b.F_{Y}^{-1}(1-k^{-1})\approx\log(k)+b\,. (18)

Thus, using (17), the expected total computation time of an epoch can be approximated as

b+𝔼[max{Y1,,Yτm}]\displaystyle b+{\mathbb{E}}\!\left[\max\{Y_{1},\dots,Y_{\tau m}\}\right]
γa(FY1(1(τm)1))+FY1(1(τm)1)+b\displaystyle\quad\approx\gamma a(F_{Y}^{-1}(1-(\tau m)^{-1}))+F_{Y}^{-1}(1-(\tau m)^{-1})+b
γ+γ(b1)FY1(1(τm)1)+FY1(1(τm)1)+b\displaystyle\quad\approx\gamma+\frac{\gamma(b-1)}{F_{Y}^{-1}(1-(\tau m)^{-1})}+F_{Y}^{-1}(1-(\tau m)^{-1})+b
log(τm)+2b+γ+γblog(τm)+b=Θ(1)\displaystyle\quad\approx\log(\tau m)+2b+\underbrace{\gamma+\frac{\gamma b}{\log(\tau m)+b}}_{=\Theta(1)}
log(τm)+2b,\displaystyle\quad\approx\log(\tau m)+2b\,, (19)

where the Euler-Mascheroni constant γ=Ψ(1)=0.57721\gamma=-\Psi(1)=0.57721\dots (with Ψ()\Psi(\cdot) denoting the digamma function) is the expected value of the Gumbel distribution with CDF xexp(ex)x\mapsto\exp(-e^{-x}), the second line follows from (16) (neglecting lower order terms), the third line follows from (18), and we neglect the constant in the final line. Since the server does not compute any gradients in this case, any other computation performed by the server or clients is assumed to have negligible cost compared to that in (19).

On the other hand, in the case where only the server performs bb gradients of the loss function with respect to the parameter in an epoch, we may construe the server as a single active client. So, the running time of the server in an epoch is b+Yb+Y units, where YY has an Erlang distribution with shape parameter bb and rate 11. Hence, expected total computation time of an epoch is

b+𝔼[Y]=2b.b+{\mathbb{E}}\!\left[Y\right]=2b\,. (20)

Combining (19) and (20), the expected total computation time per epoch is given by

𝔼[total computation time]\displaystyle{\mathbb{E}}\!\left[\text{total computation time}\right] (21)
={2b+log(τm),if active clients simultaneously compute gradients,2b,if server computes gradients.\displaystyle=\begin{cases}2b+\log(\tau m)\,,&\parbox{125.00018pt}{if active clients simultaneously compute gradients,}\\ 2b\,,&\text{if server computes gradients}.\end{cases}

II-D2 Client-to-Server Communication Phase

The total communication cost in any epoch is dominated by the client-to-server communication. Since the server has limited bandwidth and often uses a queue to receive updates from clients, this cost is proportional to the number of active clients τm\tau m. To see this, let us model the client-to-server communication as a first in, first out (FIFO) M/M/1 queue [81].555We utilize Kendall’s notation, where the first “M” implies that arrivals obey a Poisson process, the second “M” implies that the service time is memoryless, and the “1” denotes that we only have one server. In such a queue, the updates from active clients arrive at the server according to a Poisson process with rate 11 (for simplicity) [81, Chapter 2]. Moreover, the server only accepts the transmission of any one active client at any instant in time, and from the instant an update arrives at the server, it takes a total transmission time given by an independent exponential distribution with rate ν>1\nu>1.666The assumption that ν>1\nu>1 ensures that the M/M/1 queue is “stable,” and the number of active clients in the queue does not diverge over time. Any other active client updates that arrive during this transmission time are put in the FIFO queue for eventual transmission to the server. Finally, let us assume for simplicity that the M/M/1 queue is in steady state.

We seek to calculate the expected time it takes for all active clients to complete transmitting their updates to the server. So, let time t=0t=0 denote the beginning of the (synchronized) client-to-server communication phase, and define the departure process {D(t)+:t0}\{D(t)\in{\mathbb{Z}}_{+}:t\geq 0\}, where D(t)D(t) is the number of client updates that have arrived at the server and been entirely transmitted to the server up to time t0t\geq 0 (this excludes any client updates in the queue at time tt). By Burke’s theorem [81, Theorem 7.6.5], D(t)D(t) is a Poisson process with rate 11, i.e., the departure process has the same rate as the arrival process. Hence, the total time it takes for all τm\tau m active clients to complete transmitting their updates to the server,

Sinf{t0:D(t)=τm}+,S\triangleq\inf\!\left\{t\geq 0:D(t)=\tau m\right\}\in{\mathbb{R}}_{+}\,, (22)

has an Erlang distribution with shape parameter τm\tau m and rate 11. Therefore, we have

𝔼[S]=τm.{\mathbb{E}}[S]=\tau m\,. (23)

Next, let ϕ>0\phi>0 be a possibly pp-dependent hyper-parameter that represents the communication-to-computation ratio (cf. [30]); ϕ\phi is the ratio of the average time it takes for a client to communicate a vector in p{\mathbb{R}}^{p} with the server and the average time it takes a client to compute a single gradient vector in p{\mathbb{R}}^{p}. In most federated learning contexts, ϕ1\phi\gg 1 because the average communication cost of sending a vector to the server far outweighs the average complexity of a gradient computation. Then, using (23), the expected total communication time per epoch is given by

𝔼[total communication time]=ϕτm,{\mathbb{E}}\!\left[\text{total communication time}\right]=\phi\tau m\,, (24)

which is in the “same scale” as the expected total computation time in (21) because we multiply by ϕ\phi.

II-D3 Overall Running Time

In the algorithms we consider in this work, there will be three kinds of epochs with non-negligible running time:

  1. 1.

    Type A: Active clients simultaneously compute gradients and then communicate with the server. The server does not compute gradients.

  2. 2.

    Type B: The server computes gradients. Clients do not compute gradients and do not communicate with the server.

  3. 3.

    Type C: The server and clients do not compute gradients, but active clients communicate with the server.

Combining (21) and (24), the expected total running time per epoch is

𝔼[total running time]\displaystyle{\mathbb{E}}\!\left[\text{total running time}\right]
={2b+log(τm)+ϕτm,if Type A2b,if Type Bϕτm,if Type C\displaystyle\qquad\qquad=\begin{cases}2b+\log(\tau m)+\phi\tau m\,,&\text{if Type A}\\ 2b\,,&\text{if Type B}\\ \phi\tau m\,,&\text{if Type C}\end{cases} (25)
={Θ(b+ϕτm),if Type AΘ(b),if Type BΘ(ϕτm),if Type C.\displaystyle\qquad\qquad=\begin{cases}\Theta(b+\phi\tau m)\,,&\text{if Type A}\\ \Theta(b)\,,&\text{if Type B}\\ \Theta(\phi\tau m)\,,&\text{if Type C}\end{cases}. (26)

The ensuing definition translates (26) into a formal notion of federated oracle complexity by summing the expected running times over all epochs. (Note that we neglect the constants 22 and the sub-dominant logarithmic term in (25), because we are only concerned with how the complexity asymptotically scales with important problem parameters.)

Definition 1 (Federated Oracle Complexity).

For any federated optimization algorithm 𝒜{\mathcal{A}} belonging to the family of algorithms outlined in Section I-A, we define its (first-order) federated oracle complexity as

Γ(𝒜)t=1Tbt+𝟙{clients communicate in epoch t}ϕτm,\Gamma({\mathcal{A}})\triangleq\sum_{t=1}^{T}{b_{t}+{\mathds{1}}\!\left\{\text{clients communicate in epoch }t\right\}\phi\tau m}\,,

where TT\in{\mathbb{N}} is the number of epochs in 𝒜{\mathcal{A}}, bt+b_{t}\in{\mathbb{Z}}_{+} (which is possibly 0) is the number of gradients computed in epoch t[T]t\in[T] by each active client or the server, mm\in{\mathbb{N}} is the total number of clients, τ(0,1]\tau\in(0,1] is the proportion of active clients that participate in each epoch, and ϕ>0\phi>0 denotes the communication-to-computation ratio.

We next make some pertinent remarks about Definition 1. Firstly, our model of federated oracle complexity neglects the effect of straggling clients on the latency of federated optimization algorithms. This is because the effect of stragglers is captured by a lower order logarithmic term in (21) (and (25)). So, we do not study latency issues in this work. Secondly, Definition 1 assumes that in any epoch, all clients compute the same number of gradients, because both the theoretical algorithms we analyze in this work, 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} [30], satisfy this assumption. In general settings, however, different clients may have to compute different numbers of gradients in an epoch. In such cases, one would need to carry out the earlier extreme value theoretic analysis of the gradient computation phase again to determine the expected total computation time of this phase and alter Definition 1 appropriately (e.g., btb_{t} could represent the maximum number of gradients computed by a client across all clients in epoch tt). Thirdly, note that Definition 1 reduces to the usual notion of oracle complexity in non-federated settings [76, 2]. Indeed, since non-federated settings have no communication cost, we can set ϕ=0\phi=0; in this case, our definition simply counts the total number of gradient oracle calls. Finally, for simplicity in this work, we will assume in all cases that the proportion of active clients τ=Θ(1)\tau=\Theta(1) is constant (with respect to any asymptotically growing problem parameters). In the sequel, we evaluate the efficiency of federated optimization algorithms using Definition 1.

III Main Results and Discussion

We discuss our main results in this section. Specifically, we outline some key auxiliary results on low rank latent variable model approximation in Section III-A, describe our new Federated Low Rank Gradient Descent algorithm in Section III-B, present the federated oracle complexity (as defined in Definition 1) of this algorithm in Section III-C, and finally, demonstrate that our algorithm has better federated oracle complexity than the vanilla 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm in Section III-D.

III-A Latent Variable Model Approximation

To establish the federated oracle complexity of our algorithm, we will require a basic result about uniformly approximating a smooth function using piecewise polynomials. To illustrate this result, consider any function g:[0,1]d×pg:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}, g(x;θ)g(x;\theta) such that for all ϑp\vartheta\in{\mathbb{R}}^{p}, the function g(;ϑ):[0,1]dg(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} belongs to the (η,L2)(\eta,L_{2})-Hölder class as defined in (6) in Section II-B. Moreover, for any q>0q>0, consider a minimal 1\ell^{1}-covering of the hypercube [0,1]d[0,1]^{d} using closed 1\ell^{1}-balls (or cross-polytopes) with radius q1q^{-1} and centers 𝒩={z(1),,z(|𝒩|)}[0,1]d{\mathcal{N}}=\{z^{(1)},\dots,z^{(|{\mathcal{N}}|)}\}\subseteq[0,1]^{d}, which forms the associated q1q^{-1}-net, such that:

x[0,1]d,z𝒩,xz11q.\forall x\in[0,1]^{d},\,\exists z\in{\mathcal{N}},\enspace\left\|x-z\right\|_{1}\leq\frac{1}{q}\,. (27)

In particular, (27) shows that the union of the closed 1\ell^{1}-balls with radius q1q^{-1} and centers 𝒩{\mathcal{N}} contains the set [0,1]d[0,1]^{d}, and the q1q^{-1}-net 𝒩{\mathcal{N}} is minimal in the sense that its cardinality |𝒩||{\mathcal{N}}| is as small as possible while ensuring (27), i.e., |𝒩||{\mathcal{N}}| is the 1\ell^{1}-covering number of [0,1]d[0,1]^{d} (cf. [83, Definition 4.2.2]). Let B1,,B|𝒩|dB_{1},\dots,B_{|{\mathcal{N}}|}\subseteq{\mathbb{R}}^{d} be an enumeration of the closed 1\ell^{1}-balls with radius q1q^{-1} and centers z(1),,z(|𝒩|)𝒩z^{(1)},\dots,z^{(|{\mathcal{N}}|)}\in{\mathcal{N}}, respectively, and construct the sets I1,,I|𝒩|[0,1]dI_{1},\dots,I_{|{\mathcal{N}}|}\subseteq[0,1]^{d} such that I1=B1[0,1]dI_{1}=B_{1}\cap[0,1]^{d} and Ii=(Bi\(B1Bi1))[0,1]dI_{i}=(B_{i}\backslash(B_{1}\cup\cdots\cup B_{i-1}))\cap[0,1]^{d} for i{2,,|𝒩|}i\in\{2,\dots,|{\mathcal{N}}|\}. Then,

𝒮d,q{I1,,I|𝒩|}{\mathcal{S}}_{d,q}\triangleq\left\{I_{1},\dots,I_{|{\mathcal{N}}|}\right\} (28)

forms a partition of the hypercube [0,1]d[0,1]^{d} such that every point in IjI_{j} is at an 1\ell^{1}-distance of at most q1q^{-1} from z(j)z^{(j)}. With this 1\ell^{1}-ball based partition in place, the next lemma conveys that gg can be uniformly approximated using a piecewise polynomial function akin to the proofs of [84, Proposition 1] and [43, Proposition 3.3].

Lemma 1 (Uniform Piecewise Polynomial Approximation).

For any q>0q>0 and any ϑp\vartheta\in{\mathbb{R}}^{p}, there exists a piecewise polynomial function Pq,l(;ϑ):[0,1]dP_{q,l}(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} such that

supy[0,1]d|g(y;ϑ)Pq,l(y;ϑ)|L2l!qη,\sup_{y\in[0,1]^{d}}{\left|g(y;\vartheta)-P_{q,l}(y;\vartheta)\right|}\leq\frac{L_{2}}{l!q^{\eta}}\,,

where l=η1l=\lceil\eta\rceil-1, the piecewise polynomial function Pq,l(;ϑ):[0,1]dP_{q,l}(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} is given by the following summation:

y[0,1]d,Pq,l(y;ϑ)=I𝒮d,qPI(y;ϑ)𝟙{yI},\forall y\in[0,1]^{d},\enspace P_{q,l}(y;\vartheta)=\sum_{I\in{\mathcal{S}}_{d,q}}{P_{I}(y;\vartheta){\mathds{1}}\!\{y\in I\}}\,,

and PI(;ϑ):[0,1]dP_{I}(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} are Taylor polynomials with (total) degree at most ll as shown in (49).

Lemma 1 is proved in Section IV-A. While this lemma will suffice to establish federated oracle complexity in our problem, we also present a corollary of it in the context of latent variable models for completeness.

For any n,kn,k\in{\mathbb{N}}, any y(1),,y(n)[0,1]dy^{(1)},\dots,y^{(n)}\in[0,1]^{d}, and any ϑ(1),,ϑ(k)p\vartheta^{(1)},\dots,\vartheta^{(k)}\in{\mathbb{R}}^{p}, define the matrix Mn×kM\in{\mathbb{R}}^{n\times k} such that:

i[n],j[k],Mi,jg(y(i),ϑ(j)).\forall i\in[n],\,\forall j\in[k],\enspace M_{i,j}\triangleq g(y^{(i)},\vartheta^{(j)})\,. (29)

We refer to MM as a latent variable model with latent parameters y(1),,y(n)[0,1]dy^{(1)},\dots,y^{(n)}\in[0,1]^{d} and ϑ(1),,ϑ(k)p\vartheta^{(1)},\dots,\vartheta^{(k)}\in{\mathbb{R}}^{p}, and latent function gg (which satisfies the Hölder smoothness assumptions outlined earlier); see, e.g., [43] and the references therein. Furthermore, in the special case where gg is symmetric in its two inputs and MM is a symmetric matrix, gg is known as a graphon function (which naturally appears in Aldous-Hoover representations of exchangeable random graphs [85]), cf. [41, 86, 42]. Next, let

M=i=1min{n,k}σiuiviTM=\sum_{i=1}^{\min\{n,k\}}{\sigma_{i}u_{i}v_{i}^{\mathrm{T}}} (30)

be the singular value decomposition of MM with singular values σ1σ2σmin{n,k}0\sigma_{1}\geq\sigma_{2}\geq\cdots\geq\sigma_{\min\{n,k\}}\geq 0 and orthonormal bases of singular vectors {u1,,un}n\{u_{1},\dots,u_{n}\}\subseteq{\mathbb{R}}^{n} and {v1,,vk}k\{v_{1},\dots,v_{k}\}\subseteq{\mathbb{R}}^{k}. Then, by the Eckart-Young-Mirsky theorem, for any r[min{n,k}]r\in[\min\{n,k\}], the best rank rr approximation of MM in Frobenius norm is given by

Mri=1rσiuiviT,M_{r}\triangleq\sum_{i=1}^{r}{\sigma_{i}u_{i}v_{i}^{\mathrm{T}}}\,, (31)

i.e., Mr=argminAn×k:rank(A)rMAFM_{r}=\operatorname*{arg\,min}_{A\in{\mathbb{R}}^{n\times k}:\,\operatorname{{rank}}(A)\leq r}{\|M-A\|_{\mathrm{F}}}, cf. [87, Section 7.4.2]. Akin to [42, Proposition 1] and [43, Proposition 3.3], the ensuing theorem demonstrates that the latent variable model MM can be well-estimated by its low rank approximation due to the smoothness of gg. We believe that this result may be of broader interest to the matrix and tensor estimation community.

Theorem 1 (Low Rank Approximation).

Consider the latent variable model Mn×kM\in{\mathbb{R}}^{n\times k} in (29) defined by the latent function g:[0,1]d×pg:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} such that g(;ϑ):[0,1]dg(\cdot;\vartheta):[0,1]^{d}\rightarrow{\mathbb{R}} belongs to the (η,L2)(\eta,L_{2})-Hölder class for all ϑp\vartheta\in{\mathbb{R}}^{p}. Then, for any red 2d(l+d)dr\geq e\sqrt{d}\,2^{d}(l+d)^{d}, we have

1nkMMrF2L22e2η/ddη/d4η(l+d)2η(l!)2(1r)2η/d,\frac{1}{nk}\left\|M-M_{r}\right\|_{\mathrm{F}}^{2}\leq\frac{L_{2}^{2}e^{2\eta/d}d^{\eta/d}4^{\eta}(l+d)^{2\eta}}{(l!)^{2}}\left(\frac{1}{r}\right)^{\!2\eta/d},

where l=η1l=\lceil\eta\rceil-1.

Theorem 1 is proved in Section IV-B; we will use ideas from this proof in our analysis of federated oracle complexity. It is worth comparing this result with the related results in [41], [84, Proposition 1], [42, Proposition 1], and [43, Proposition 3.3]. The author of [41] proves an analog of Theorem 1 in the special case where gg is a graphon function and g(;ϑ)g(\cdot;\vartheta) is Lipschitz continuous with respect to the \ell^{\infty}-norm for all ϑp\vartheta\in{\mathbb{R}}^{p}. In this special case, a piecewise constant function approximation of gg suffices for the proof. The results in [84, Proposition 1] and [42, Proposition 1] extend the result of [41] to the setting where gg is still a graphon function, but g(;ϑ)g(\cdot;\vartheta) belongs to an (η,L2)(\eta,L_{2})-Hölder class with respect to the \ell^{\infty}-norm. The key insight in [84, 42] is to utilize piecewise polynomial function approximations instead of piecewise constant functions; we also use this key idea in Lemma 1. Finally, the authors of [43, Proposition 3.3] apply the argument of [84, 42] to extend their result further to the setting where gg is a general latent variable model, similar to our result, but g(;ϑ)g(\cdot;\vartheta) only belongs to an (η,L2)(\eta,L_{2})-Hölder class with respect to the \ell^{\infty}-norm. In contrast, our result in Theorem 1 applies to general latent variable models gg such that g(;ϑ)g(\cdot;\vartheta) belongs to an (η,L2)(\eta,L_{2})-Hölder class with respect to the 1\ell^{1}-norm, which subsumes the settings of [41, 84, 42, 43]. To prove this more general result, we resort to using a modified minimal 1\ell^{1}-covering to construct our piecewise polynomial function approximation in the intermediate result in Lemma 1; this differs from the \ell^{\infty}-coverings used in [41], [84, Lemma 3], [42], and [43], and it can be verified that \ell^{\infty}-coverings produce a worse bound in Lemma 1.777We note that our modified 1\ell^{1}-covering of [0,1]d[0,1]^{d} is more sophisticated than the \ell^{\infty}-coverings used in [41, 84, 42, 43], because \ell^{\infty}-balls can be used to form a (geometric) honeycomb in the hypercube [0,1]d[0,1]^{d}, but 1\ell^{1}-balls do not form such a honeycomb in [0,1]d[0,1]^{d} for general dd\in{\mathbb{N}}. Therefore, in contrast to [41, 84, 42, 43], our proofs of Lemma 1 and Theorem 1 are adapted to deal with the 1\ell^{1}-norm Hölder class assumption on g(;ϑ)g(\cdot;\vartheta) and the modified minimal 1\ell^{1}-covering of the hypercube [0,1]d[0,1]^{d}.

III-B Federated Low Rank Gradient Descent

We now describe the new Federated Low Rank Gradient Descent (𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}) algorithm for federated optimization under the setup of Section I-A and Section II-B. Recall that we are given a central server with rr private training data samples x(0,1),,x(0,r)[0,1]dx^{(0,1)},\dots,x^{(0,r)}\in[0,1]^{d}, and mm clients with ss private training data samples each; client c[m]c\in[m] has samples x(c,1),,x(c,s)[0,1]dx^{(c,1)},\dots,x^{(c,s)}\in[0,1]^{d}. The number of samples at the server, rr, depends on the approximate rank of the loss function, as we will see in the next subsection. The server’s objective is to compute an ϵ\epsilon-approximate solution in the sense of (11) to the ERM problem (3).

In the traditional non-federated setting, this objective could be achieved by running GD (or any of its many variants). Each iteration of GD would involve calculating the gradient:

θp,θF(θ)\displaystyle\forall\theta\in{\mathbb{R}}^{p},\enspace\nabla_{\theta}F(\theta) =1nk=1rθf(x(0,k);θ)\displaystyle=\frac{1}{n}\sum_{k=1}^{r}{\nabla_{\theta}f(x^{(0,k)};\theta)} (32)
+1nc=1mj=1sθf(x(c,j);θ),\displaystyle\quad\,+\frac{1}{n}\sum_{c=1}^{m}{\sum_{j=1}^{s}{\nabla_{\theta}f(x^{(c,j)};\theta)}}\,,

at different parameter values. The main idea of this work is to utilize the smoothness of θf\nabla_{\theta}f in the data to approximate the gradient in (32) by exploiting the approximate low rank structure induced by the smoothness (cf. Section III-A). Specifically, for every index i[p]i\in[p], we determine a set of weights {wj,k(i,c):c[m],j[s],k[r]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:c\in[m],\,j\in[s],\,k\in[r]\} such that:

θp,Fθi(θ)1nk=1rgi(x(0,k);θ)(1+c=1mj=1swj,k(i,c)),\forall\theta\in{\mathbb{R}}^{p},\,\frac{\partial F}{\partial\theta_{i}}(\theta)\approx\frac{1}{n}\sum_{k=1}^{r}{g_{i}(x^{(0,k)};\theta)\left(1+\sum_{c=1}^{m}{\sum_{j=1}^{s}{w_{j,k}^{(i,c)}}}\right)}, (33)

where for each client c[m]c\in[m] and each private training sample x(c,j)[0,1]dx^{(c,j)}\in[0,1]^{d}, we re-weigh the partial derivative gi(x(0,k);)g_{i}(x^{(0,k)};\cdot) at the server’s training sample x(0,k)[0,1]dx^{(0,k)}\in[0,1]^{d} by an additional amount wj,k(i,c)w_{j,k}^{(i,c)}. In our 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm, the clients compute the aforementioned weights after receiving some initial information from the server, and the server then uses these weights to run inexact GD. Intuitively, when θf\nabla_{\theta}f is very smooth with respect to the data, the number of private samples rr required at the server to ensure that (33) is a good approximation is reasonably small. Hence, the oracle complexity of GD at the server also remains reasonably small since it only computes gradients at rr data samples in every iteration (instead of nn).

Our 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm runs over T=r+2T=r+2 synchronized epochs and computes an ϵ\epsilon-approximate solution to (3) at the server. In the first epoch, all clients are idle and there is no communication between the server and clients. The server arbitrarily chooses rr parameter values ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p}, and performs r2r^{2} gradient computations corresponding to these rr parameter values and its rr private training samples x(0,1),,x(0,r)x^{(0,1)},\dots,x^{(0,r)}. Since each gradient computation produces a pp-dimensional vector, for every index i[p]i\in[p], the server obtains a matrix of partial derivatives G(i)r×rG^{(i)}\in{\mathbb{R}}^{r\times r} as in (7):

j,k[r],Gj,k(i)=gi(x(0,j);ϑ(k))=fθi(x(0,j);ϑ(k)).\forall j,k\in[r],\enspace G^{(i)}_{j,k}=g_{i}(x^{(0,j)};\vartheta^{(k)})=\frac{\partial f}{\partial\theta_{i}}(x^{(0,j)};\vartheta^{(k)})\,. (34)

Moreover, due to the non-singularity assumption in Section II-B, each G(i)G^{(i)} is an invertible matrix. So, the server computes the inverses of these pp matrices to get (G(1))1,,(G(p))1r×r(G^{(1)})^{-1},\dots,(G^{(p)})^{-1}\in{\mathbb{R}}^{r\times r}.

The second epoch proceeds in three phases. First, the server broadcasts the rr parameter values ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)} and the pp matrices (G(1))1,,(G(p))1(G^{(1)})^{-1},\dots,(G^{(p)})^{-1} to all mm clients. We assume for simplicity that τ=1\tau=1, i.e., all clients are active, although our analysis would hold in the case τ=Θ(1)\tau=\Theta(1). In the second phase, every client performs rsrs gradient computations corresponding to its ss private training samples and the rr common parameter values. In particular, each client c[m]c\in[m] computes the following set of partial derivatives {gi(x(c,j);ϑ(k)):i[p],j[s],k[r]}\{g_{i}(x^{(c,j)};\vartheta^{(k)}):i\in[p],\,j\in[s],\,k\in[r]\}. Then, each client cc calculates the weight vectors v(i,c)rv^{(i,c)}\in{\mathbb{R}}^{r}:

(v(i,c))T\displaystyle(v^{(i,c)})^{\mathrm{T}} (35)
=(j=1s[gi(x(c,j);ϑ(1))gi(x(c,j);ϑ(2))gi(x(c,j);ϑ(r))]row vector of r partial derivatives at x(c,j))\displaystyle=\left(\!\sum_{j=1}^{s}{\underbrace{\left[g_{i}(x^{(c,j)};\vartheta^{(1)})\enspace g_{i}(x^{(c,j)};\vartheta^{(2)})\,\cdots\,g_{i}(x^{(c,j)};\vartheta^{(r)})\right]}_{\text{row vector of }r\text{ partial derivatives at }x^{(c,j)}}}\!\right)
(G(i))1\displaystyle\quad\enspace\cdot(G^{(i)})^{-1}

for every index i[p]i\in[p]. Note that the kkth weight vk(i,c)v^{(i,c)}_{k}\in{\mathbb{R}} for the iith partial derivative at client cc represents the accumulation of the aforementioned weights {wj,k(i,c):j[s]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:j\in[s]\} (as explained in (39)).

To intuitively understand (35), for each index i[p]i\in[p], note that the induced approximate low rankness of gi(x;θ)g_{i}(x;\theta) implies that for every client c[m]c\in[m] and data sample x(c,j)x^{(c,j)}:

θp,gi(x(c,j);θ)k=1rwj,k(i,c)gi(x(0,k);θ),\forall\theta\in{\mathbb{R}}^{p},\enspace g_{i}(x^{(c,j)};\theta)\approx\sum_{k=1}^{r}{w_{j,k}^{(i,c)}g_{i}(x^{(0,k)};\theta)}\,, (36)

for some choices of weights {wj,k(i,c):k[r]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:k\in[r]\} (which we formally define soon). Indeed, if we construe each gi(x;θ)g_{i}(x;\theta) as an infinite “matrix” with xx indexing its “rows” and θ\theta indexing its “columns,” then if the approximate rank of gig_{i} is rr, the “rows” of gig_{i} corresponding to data samples at the clients can we approximately represented as linear combinations of the rr “rows” of gig_{i} corresponding to the data samples at the server x(0,1),,x(0,r)x^{(0,1)},\dots,x^{(0,r)}. Furthermore, this property in (36) implies the approximation in (33). Formally, we capture the approximate low rankness in (36) by sampling gi(x(c,j);θ)g_{i}(x^{(c,j)};\theta) and each gi(x(0,k);θ)g_{i}(x^{(0,k)};\theta) at the parameter values ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)}:

[gi(x(c,j);ϑ(1))gi(x(c,j);ϑ(2))gi(x(c,j);ϑ(r))]\displaystyle\left[g_{i}(x^{(c,j)};\vartheta^{(1)})\enspace g_{i}(x^{(c,j)};\vartheta^{(2)})\,\cdots\,g_{i}(x^{(c,j)};\vartheta^{(r)})\right] (37)
=[wj,1(i,c)wj,2(i,c)wj,r(i,c)]G(i),\displaystyle\qquad\qquad=\left[w_{j,1}^{(i,c)}\enspace w_{j,2}^{(i,c)}\,\cdots\,w_{j,r}^{(i,c)}\right]G^{(i)},

which, after matrix inversion, equivalently defines the weights {wj,k(i,c):k[r]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:k\in[r]\} via the relation

[wj,1(i,c)wj,2(i,c)wj,r(i,c)]=\displaystyle\left[w_{j,1}^{(i,c)}\enspace w_{j,2}^{(i,c)}\,\cdots\,w_{j,r}^{(i,c)}\right]= (38)
[gi(x(c,j);ϑ(1))gi(x(c,j);ϑ(2))gi(x(c,j);ϑ(r))](G(i))1.\displaystyle\left[g_{i}(x^{(c,j)};\vartheta^{(1)})\enspace g_{i}(x^{(c,j)};\vartheta^{(2)})\,\cdots\,g_{i}(x^{(c,j)};\vartheta^{(r)})\right](G^{(i)})^{-1}.

Additionally, (33) conveys that we only require the sums of weights, j=1swj,k(i,c)\sum_{j=1}^{s}{w_{j,k}^{(i,c)}}, for all i[p]i\in[p] and k[r]k\in[r] from each client cc to estimate the gradient of the empirical risk. So, it suffices for client cc to directly compute the accumulated weights:

i[p],k[r],vk(i,c)=j=1swj,k(i,c).\forall i\in[p],\,\forall k\in[r],\enspace v^{(i,c)}_{k}=\sum_{j=1}^{s}{w_{j,k}^{(i,c)}}\,. (39)

Combining (39) with (38) produces (35). This explains why each client cc directly calculates (35). In fact, the accumulated weights in (35) simplify (33) and yield the approximation:

θp,Fθi(θ)1nk=1rgi(x(0,k);θ)(1+c=1mvk(i,c))\forall\theta\in{\mathbb{R}}^{p},\enspace\frac{\partial F}{\partial\theta_{i}}(\theta)\approx\frac{1}{n}\sum_{k=1}^{r}{g_{i}(x^{(0,k)};\theta)\left(1+\sum_{c=1}^{m}{v_{k}^{(i,c)}}\right)} (40)

for all i[p]i\in[p].

After calculating (35) for all i[p]i\in[p], each client cc possesses a set of rprp weights {vk(i,c):k[r],i[p]}\{v^{(i,c)}_{k}:k\in[r],\,i\in[p]\}. Hence, in the third phase of the second epoch, every client cc communicates the first pp weights {v1(i,c):i[p]}\{v^{(i,c)}_{1}:i\in[p]\} from its set of rprp weights to the server. In the next r1r-1 epochs, every client c[m]c\in[m] communicates pp weights per epoch from its set of rprp weights to the server. Specifically, in epoch t{3,,r+1}t\in\{3,\dots,r+1\}, client cc communicates the weights {vt1(i,c):i[p]}\{v^{(i,c)}_{t-1}:i\in[p]\} to the server. There are no gradient computations performed in these epochs.

Finally, in the last epoch, the server runs inexact gradient descent using the weights {vk(i,c):k[r],i[p],c[m]}\{v^{(i,c)}_{k}:k\in[r],\,i\in[p],\,c\in[m]\} it has received from the clients. There is no communication between the server and clients, and all clients remain idle at this stage. To precisely describe the inexact GD iterations, for any θp\theta\in{\mathbb{R}}^{p}, define the approximation F^(θ)=[F^1(θ)F^p(θ)]Tp\widehat{\nabla F}(\theta)=\big{[}\widehat{\nabla F}_{1}(\theta)\,\cdots\,\widehat{\nabla F}_{p}(\theta)\big{]}^{\mathrm{T}}\in{\mathbb{R}}^{p} of θF(θ)\nabla_{\theta}F(\theta) such that:

i[p],F^i(θ)\displaystyle\forall i\in[p],\enspace\widehat{\nabla F}_{i}(\theta) 1nk=1rgi(x(0,k);θ)(1+c=1mvk(i,c))\displaystyle\triangleq\frac{1}{n}\sum_{k=1}^{r}{g_{i}(x^{(0,k)};\theta)\left(1+\sum_{c=1}^{m}{v_{k}^{(i,c)}}\right)}
=1nk=1rfθi(x(0,k);θ)(1+c=1mvk(i,c)),\displaystyle=\frac{1}{n}\sum_{k=1}^{r}{\frac{\partial f}{\partial\theta_{i}}(x^{(0,k)};\theta)\left(1+\sum_{c=1}^{m}{v_{k}^{(i,c)}}\right)}, (41)

as indicated in (40). Moreover, let SS\in{\mathbb{N}} be the number of iterations for which the server runs inexact GD; SS depends on the desired approximation accuracy ϵ>0\epsilon>0 in Section II-B and other problem parameters, as we will see in the next subsection. Then, starting with an arbitrary parameter vector θ(0)p\theta^{(0)}\in{\mathbb{R}}^{p}, the server performs the inexact GD updates

θ(γ)=θ(γ1)1L1F^(θ(γ1))\theta^{(\gamma)}=\theta^{(\gamma-1)}-\frac{1}{L_{1}}\,\widehat{\nabla F}(\theta^{(\gamma-1)}) (42)

for γ=1,,S\gamma=1,\dots,S, where L1>0L_{1}>0 is the common Lipschitz constant of the gradient θf(x;)\nabla_{\theta}f(x;\cdot) for all x[0,1]dx\in[0,1]^{d} (see Section II-B). Note that at the γ\gammath update, the server performs rr gradient computations corresponding to its private data {x(0,j)[0,1]d:j[r]}\{x^{(0,j)}\in[0,1]^{d}:j\in[r]\} at the parameter value θ(γ1)\theta^{(\gamma-1)}. Lastly, the output of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm at the server is θ=θ(S)p\theta^{*}=\theta^{(S)}\in{\mathbb{R}}^{p}.

In the next subsection, we analyze the federated oracle complexity of this algorithm and specify what values rr and SS should take to ensure that θ(S)p\theta^{(S)}\in{\mathbb{R}}^{p} is an ϵ\epsilon-approximate solution in the sense of (11) to the ERM problem (3). We also summarize the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm in Algorithm 1 for the readers’ convenience.

1:training data {x(0,j)[0,1]d:j[r]}{x(c,j)[0,1]d:c[m],j[s]}\{x^{(0,j)}\in[0,1]^{d}:j\in[r]\}\cup\{x^{(c,j)}\in[0,1]^{d}:c\in[m],\,j\in[s]\}
2:approximate rank rr\in{\mathbb{N}}
3:number of inexact GD iterations SS\in{\mathbb{N}}
4:approximate solution θp\theta^{*}\in{\mathbb{R}}^{p} to (3)
5:Epoch t=1t=1: Server pre-computation
6:Server chooses rr arbitrary parameter vectors ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p}
7:Server performs r2r^{2} gradient computations corresponding to ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\allowbreak\vartheta^{(r)} and its private data {x(0,j)[0,1]d:j[r]}\{x^{(0,j)}\in[0,1]^{d}:j\in[r]\} to obtain the set of partial derivatives {gi(x(0,j);ϑ(k)):i[p],j[r],k[r]}\{g_{i}(x^{(0,j)};\vartheta^{(k)}):i\in[p],\,j\in[r],\,k\in[r]\}
8:Server constructs the matrices G(1),,G(p)r×rG^{(1)},\dots,G^{(p)}\in{\mathbb{R}}^{r\times r} according to (34) using its partial derivatives
9:Server calculates the inverse matrices (G(1))1,,(G(p))1r×r(G^{(1)})^{-1},\dots,(G^{(p)})^{-1}\in{\mathbb{R}}^{r\times r}
10:Epoch t=2t=2: Client weights computation
11:Server broadcasts ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)} to all mm clients
12:Server broadcasts (G(1))1,,(G(p))1(G^{(1)})^{-1},\dots,(G^{(p)})^{-1} to all mm clients
13:Every client c[m]c\in[m] performs rsrs gradient computations corresponding to ϑ(1),,ϑ(r)\vartheta^{(1)},\dots,\vartheta^{(r)} and its private data {x(c,j)[0,1]d:j[s]}\{x^{(c,j)}\in[0,1]^{d}:j\in[s]\} to obtain the set of partial derivatives {gi(x(c,j);ϑ(k)):i[p],j[s],k[r]}\{g_{i}(x^{(c,j)};\vartheta^{(k)}):i\in[p],\,j\in[s],\,k\in[r]\}
14:Every client c[m]c\in[m] calculates pp weight vectors v(1,c),,v(p,c)rv^{(1,c)},\dots,v^{(p,c)}\in\allowbreak{\mathbb{R}}^{r} according to (35) using its partial derivatives and (G(1))1,,(G(p))1(G^{(1)})^{-1},\dots,\allowbreak(G^{(p)})^{-1}
15:Every client c[m]c\in[m] communicates the pp weights {v1(i,c):i[p]}\{v_{1}^{(i,c)}:i\in[p]\} to the server
16:for Epochs t=3t=3 to r+1r+1: Client-to-server communication
17:Every client c[m]c\in[m] communicates the pp weights {vt1(i,c):i[p]}\{v_{t-1}^{(i,c)}:i\in[p]\} to the server
18:end for
19:Epoch t=r+2t=r+2: Server inexact GD
20:Server initializes an arbitrary parameter vector θ(0)p\theta^{(0)}\in{\mathbb{R}}^{p}
21:for γ=1\gamma=1 to SS do \triangleright Iterations of inexact GD
22:     Server performs rr gradient computations corresponding to its private data {x(0,j)[0,1]d:j[r]}\{x^{(0,j)}\in[0,1]^{d}:j\in[r]\} at the parameter value θ(γ1)\theta^{(\gamma-1)} to obtain the set of partial derivatives {gi(x(0,j);θ(γ1)):i[p],j[r]}\{g_{i}(x^{(0,j)};\theta^{(\gamma-1)}):i\in[p],\,j\in[r]\}
23:     Server constructs approximation of gradient F^(θ(γ1))\widehat{\nabla F}(\theta^{(\gamma-1)}) according to (41) using {gi(x(0,j);θ(γ1)):i[p],j[r]}\{g_{i}(x^{(0,j)};\theta^{(\gamma-1)}):i\in[p],\,j\in[r]\} and the weights {vk(i,c):k[r],i[p],c[m]}\{v^{(i,c)}_{k}:k\in[r],\,i\in[p],\,c\in[m]\}
24:     Server updates the parameter vector according to (42): θ(γ)=θ(γ1)1L1F^(θ(γ1))\theta^{(\gamma)}=\theta^{(\gamma-1)}-\frac{1}{L_{1}}\widehat{\nabla F}(\theta^{(\gamma-1)})
25:end for
26:return θ=θ(S)\theta^{*}=\theta^{(S)}
Algorithm 1 Federated Low Rank Gradient Descent (𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}) algorithm to approximately solve the ERM problem (3).

III-C Federated Oracle Complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}

In addition to the assumptions on the loss function in Section II-B, we will require several assumptions on the piecewise polynomial approximations introduced in Section III-A. Recall from Section II-B that the iith partial derivative gi:[0,1]d×pg_{i}:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} satisfies the (η,L2)(\eta,L_{2})-Hölder class condition. So, for any q>0q>0 and any fixed index i[p]i\in[p], let Pq,l:[0,1]d×pP_{q,l}:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} be the piecewise polynomial approximation of gi:[0,1]d×pg_{i}:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} satisfying Lemma 1:

x[0,1]d,θp,Pq,l(x;θ)=I𝒮d,qPI(x;θ)𝟙{xI},\forall x\in[0,1]^{d},\,\forall\theta\in{\mathbb{R}}^{p},\enspace P_{q,l}(x;\theta)=\sum_{I\in{\mathcal{S}}_{d,q}}{P_{I}(x;\theta){\mathds{1}}\!\{x\in I\}}\,, (43)

where 𝒮d,q{\mathcal{S}}_{d,q} is the partition of the hypercube [0,1]d[0,1]^{d} given in (28), PI(;θ):[0,1]dP_{I}(\cdot;\theta):[0,1]^{d}\rightarrow{\mathbb{R}} are Taylor polynomials with degree at most l=η1l=\lceil\eta\rceil-1 as defined in (49), and we suppress the dependence of Pq,lP_{q,l} on ii for simplicity. Then, for any sufficiently large rr\in{\mathbb{N}}, fix q>0q>0 such that r=(l+dd)|𝒮d,q|r=\binom{l+d}{d}|{\mathcal{S}}_{d,q}|. As shown in the proofs of Theorem 1 and Theorem 2 in Section IV-B and Section V, respectively, we can write Pq,lP_{q,l} in the form (cf. (55)):

x[0,1]d,θp,Pq,l(x;θ)=w=1rϕw(x)ψw(θ),\forall x\in[0,1]^{d},\,\forall\theta\in{\mathbb{R}}^{p},\enspace P_{q,l}(x;\theta)=\sum_{w=1}^{r}{\phi_{w}(x)\psi_{w}(\theta)}\,, (44)

where {ϕw:w[r]}\{\phi_{w}:w\in[r]\} is a family of linearly independent monomial functions on [0,1]d[0,1]^{d} (see (56)), and {ψw:w[r]}\{\psi_{w}:w\in[r]\} is another family of functions on p{\mathbb{R}}^{p}. Note that the form (44) conveys that Pq,lP_{q,l}, when perceived as the bivariate kernel of an integral operator, has rank at most rr. For every large enough rr\in{\mathbb{N}} and any fixed index i[p]i\in[p], we make the following assumptions about the approximation Pq,lP_{q,l}:

  1. 1.

    Non-singularity of monomials: The matrix Φr×r\Phi\in{\mathbb{R}}^{r\times r} with entries:

    w,j[r],Φw,j=ϕw(x(0,j)),\forall w,j\in[r],\enspace\Phi_{w,j}=\phi_{w}(x^{(0,j)})\,, (45)

    is invertible. Note that we suppress the dependence of Φ\Phi on ii, rr, and the server’s training data for simplicity.

  2. 2.

    Non-singularity of approximation: For any sequence of distinct parameter values ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p}, the matrix 𝒫(i)r×r{\mathcal{P}}^{(i)}\in{\mathbb{R}}^{r\times r} given by:

    j,k[r],𝒫j,k(i)=Pq,l(x(0,j);ϑ(k)),\forall j,k\in[r],\enspace{\mathcal{P}}^{(i)}_{j,k}=P_{q,l}(x^{(0,j)};\vartheta^{(k)})\,, (46)

    is invertible. Here, we suppress the dependence of 𝒫(i){\mathcal{P}}^{(i)} on rr, the server’s training data, and the parameter sequence for simplicity.

With all our assumptions in place, the ensuing theorem presents the federated oracle complexity of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm to generate an ϵ\epsilon-approximate solution. This is the key technical result of this work.

Theorem 2 (Federated Oracle Complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}).

Suppose that the two assumptions above and the assumptions in Section II-B hold. Assume further that the approximation accuracy satisfies 0<ϵ9(B+3)4p8μ0<\epsilon\leq\frac{9(B+3)^{4}p}{8\mu}, the Hölder smoothness parameter satisfies η>(2α+2)d\eta>(2\alpha+2)d, and the proportion of active clients is τ=1\tau=1. Then, the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm described in Algorithm 1 with approximate rank given by

r\displaystyle r =max{ed 2d(η+d)d,((L2dηdeη(d+1)dη/2(2η+2d)ηd(η1)ηd)\displaystyle=\!\left\lceil\rule{0.0pt}{31.2982pt}\right.\!\!\max\!\left\{\rule{0.0pt}{31.2982pt}\right.\!\!\!e\sqrt{d}\,2^{d}(\eta+d)^{d},\!\left(\rule{0.0pt}{25.6073pt}\right.\!\!\!\!\left(\!\frac{L_{2}^{d}\eta^{d}e^{\eta(d+1)}d^{\eta/2}(2\eta+2d)^{\eta d}}{(\eta-1)^{\eta d}}\!\right) (47)
(κ(F(θ(0))F+9(B+3)4p2μ)(κ1)ϵ)d2)1η(2α+2)d},\displaystyle\qquad\quad\,\cdot\!\left(\!\frac{\kappa\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)}{(\kappa-1)\epsilon}\!\right)^{\!\!\frac{d}{2}}\!\!\left.\rule{0.0pt}{25.6073pt}\right)^{\!\!\frac{1}{\eta-(2\alpha+2)d}}\!\!\left.\rule{0.0pt}{31.2982pt}\right\}\!\!\left.\rule{0.0pt}{31.2982pt}\right\rceil,

and number of inexact GD iterations given by

S=(log(κκ1))1log(F(θ(0))F+9(B+3)4p2μϵ),S=\left\lceil\left(\log\!\left(\frac{\kappa}{\kappa-1}\right)\right)^{\!-1}\log\!\left(\frac{F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}}{\epsilon}\right)\right\rceil\!, (48)

produces an ϵ\epsilon-approximate solution in the sense of (11) with a federated oracle complexity of

Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)=r2+rs+rS+ϕmr,\Gamma({\mathsf{FedLRGD}})=r^{2}+rs+rS+\phi mr\,,

where ss is the number of training data samples per client, mm is the number of clients, ϕ\phi is the communication-to-computation ratio, and the various other constants and problem variables are defined in Section I and Section II.

Theorem 2 is established in Section V. It roughly states that Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\Gamma({\mathsf{FedLRGD}}) scales like ϕm(p/ϵ)Θ(d/η)\phi m(p/\epsilon)^{\Theta(d/\eta)} (neglecting typically sub-dominant factors; see, e.g., the assumptions and proof of Proposition 1) as mentioned in Section I-B. At a high level, the proof of Theorem 2 has three parts. First, we use the uniform piecewise polynomial approximations from Lemma 1 to show that partial derivatives of the loss function are approximately low rank. We emphasize that this approximate low rankness is deduced from the smoothness of loss functions in the data—a property we had set out to exploit since Section I. This approximate low rankness implies that F^\widehat{\nabla F} and θF\nabla_{\theta}F are close to each other. In particular, we show that under the assumptions of the theorem statement, the error in approximating the true gradient of the empirical risk at the server satisfies F^θF22=O(p)\big{\|}\widehat{\nabla F}-\nabla_{\theta}F\big{\|}_{2}^{2}=O(p) (see (62) and (65) for details). Second, we control the error F(θ)FF(\theta^{*})-F_{*} using the aforementioned bound on F^θF22\big{\|}\widehat{\nabla F}-\nabla_{\theta}F\big{\|}_{2}^{2} from the first part and standard tools from the inexact GD literature, cf. [5]. Finally, we impose the condition in (11), which naturally engenders certain choices of rr and SS (specifically, (47) and (48)), and from there, a direct application of Definition 1 yields the aforementioned federated oracle complexity.

We remark that it is the process of rigorously translating the smoothness of partial derivatives of loss functions in the data to an approximate low rankness property that requires us to make many of the assumptions at the outset of this subsection and in Section II-B. Once we have approximate low rankness of partial derivatives of loss functions, however, it is reasonably straightforward to intuitively understand why (33) holds, and in turn, why Algorithm 1 works and a result like Theorem 2 holds. Furthermore, canonical machine learning models can exhibit the desired approximate low rankness property that lies at the heart of our method. For example, in Appendix C, we illustrate that gradients of loss functions associated with neural networks used to train classifiers for the MNIST and CIFAR-10 datasets (see [44, 45]) are approximately low rank. This gives further credence to the potential utility of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}.

III-D Comparison to Federated Averaging

We finally compare the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} in Theorem 2 with the widely used benchmark 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm from the literature [28]. Specifically, we consider the precise formulation of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} in [30, Algorithm 1, Theorem 1]. Recall that in the 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm, active clients perform several iterations of SGD using their private data in each epoch, and the server aggregates individual client updates at the end of each epoch and broadcasts the aggregated update back to clients at the beginning of the next epoch. The authors of [30] provide a sharp convergence analysis of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}.888In fact, a more sophisticated version of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} with quantization is analyzed in [30]. Building upon their results, we derive the federated oracle complexity of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} in the ensuing theorem. For simplicity, we assume that the number of clients mm grows (i.e., mm\rightarrow\infty), and other non-constant problem parameters vary at different rates with respect to mm.

Theorem 3 (Federated Oracle Complexity of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}).

Assume that:

  1. 1.

    The smoothness in θ\theta and strong convexity assumptions of Section II-B hold.

  2. 2.

    The parameters τ(0,1]\tau\in(0,1] and L1>μ>0L_{1}>\mu>0 are constants, i.e., Θ(1)\Theta(1), while other problem parameters may vary with mm\in{\mathbb{N}}.

  3. 3.

    Any stochastic gradient ~fc(θ)\widetilde{\nabla}f_{c}(\theta) computed by a client c[m]c\in[m] in the 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm is unbiased and has variance bounded by σ2>0\sigma^{2}>0 (see [30, Assumption 3]):999The scaling σ2=O(p)\sigma^{2}=O(p) is natural since the squared 2\ell^{2}-norm in the expectation has pp terms.

    c[m],θp,\displaystyle\forall c\in[m],\,\forall\theta\in{\mathbb{R}}^{p},
    𝔼[~fc(θ)1sj=1sθf(x(c,j);θ)22]σ2=O(p).\displaystyle{\mathbb{E}}\!\left[\left\|\widetilde{\nabla}f_{c}(\theta)-\frac{1}{s}\sum_{j=1}^{s}{\nabla_{\theta}f(x^{(c,j)};\theta)}\right\|_{2}^{2}\right]\leq\sigma^{2}=O(p)\,.
  4. 4.

    The number of gradient computations bb\in{\mathbb{N}} per client in an epoch of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} satisfies b=O(m)b=O(m).

  5. 5.

    𝔼[θ(T0)θ22]=O(p){\mathbb{E}}\big{[}\big{\|}\theta^{(T_{0})}-\theta^{\star}\big{\|}_{2}^{2}\big{]}=O(p), where θ(T0)p\theta^{(T_{0})}\in{\mathbb{R}}^{p} denotes the parameter vector at the server after T0=Θ(mb)T_{0}=\Theta\big{(}\frac{m}{b}\big{)} epochs of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}.101010The O(p)O(p) scaling here is also natural because the parameter θ\theta usually resides in an \ell^{\infty}-norm ball, e.g., [0,1]p[0,1]^{p}, rather than all of p{\mathbb{R}}^{p} in applications. Since the diameter of such an \ell^{\infty}-norm ball scales as O(p)O(p), 𝔼[θ(T0)θ22]{\mathbb{E}}\big{[}\big{\|}\theta^{(T_{0})}-\theta^{\star}\big{\|}_{2}^{2}\big{]} is upper bounded by O(p)O(p).

Then, for all ϵ(0,1]\epsilon\in(0,1], all ϕ>0\phi>0 larger than some constant, and all sufficiently large mm\in{\mathbb{N}}, we have

Γ(𝖥𝖾𝖽𝖠𝗏𝖾)=O(ϕm(pϵ)3/4),\Gamma({\mathsf{FedAve}})=O\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!3/4}\right),

where we compute the federated oracle complexity above by minimizing over all choices of bb and all numbers of epochs TT\in{\mathbb{N}} in 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} such that it produces an ϵ\epsilon-approximate solution in the sense of (12).

Theorem 3 is established in Appendix D. To easily compare the federated oracle complexities of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithms, we assume as before that the number of clients mm\rightarrow\infty, all the problem variables d,p,s,r,n,ϵ1,η,ϕd,p,s,r,n,\epsilon^{-1},\eta,\phi are non-decreasing functions of mm, and the other problem parameters τ,L1,L2,μ,α,B\tau,L_{1},L_{2},\mu,\alpha,B are constants with respect to mm. (We refer readers back to Section I and Section II for the definitions of these parameters.) In particular, this means that the condition number κ\kappa is also constant. Although improving the dependence of oracle complexity on condition number is a vast area of research in optimization theory (see, e.g., accelerated methods [11], variance reduction [16], etc.), we do not consider this aspect of the problem in this work. The proposition below elucidates a broad and important regime where 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} has better federated oracle complexity than 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}.

Proposition 1 (Comparison between 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}).

Suppose that the assumptions of Theorem 2 and Theorem 3 hold. Fix any absolute constants λ(0,1)\lambda\in(0,1), β(0,12]\beta\in\big{(}0,\frac{1}{2}\big{]}, and ξ(0,34)\xi\in\big{(}0,\frac{3}{4}\big{)}, and then consider any sufficiently large constants c1,c2>0c_{1},c_{2}>0 such that c2>c12/ξc_{2}>c_{1}\geq 2/\xi. Assume further that the number of samples per client s=O(ϕm)s=O(\phi m), the data dimension d=O(log(n)λ)d=O(\log(n)^{\lambda}), the approximation accuracy ϵ=Θ(nβ)\epsilon=\Theta(n^{-\beta}), the Hölder smoothness parameter η=Θ(d)\eta=\Theta(d) such that (c1+2α+2)dη(c2+2α+2)d(c_{1}+2\alpha+2)d\leq\eta\leq(c_{2}+2\alpha+2)d, and F(θ(0))F=O(p)F(\theta^{(0)})-F_{*}=O(p), where n=ms+rn=ms+r is the total number of training data samples, and pp\in{\mathbb{N}} is the parameter dimension.111111Note that F(θ(0))F=O(p)F(\theta^{(0)})-F_{*}=O(p), because F(θ(0))F\displaystyle F(\theta^{(0)})-F_{*} =F(θ(0))F(θ)\displaystyle=F(\theta^{(0)})-F(\theta^{\star}) =θF(τθ(0)+(1τ)θ)T(θ(0)θ)\displaystyle=\nabla_{\theta}F(\tau\theta^{(0)}+(1-\tau)\theta^{\star})^{\mathrm{T}}(\theta^{(0)}-\theta^{\star}) =(θF(τθ(0)+(1τ)θ)θF(θ))T(θ(0)θ)\displaystyle=\left(\nabla_{\theta}F(\tau\theta^{(0)}+(1-\tau)\theta^{\star})-\nabla_{\theta}F(\theta^{\star})\right)^{\!\mathrm{T}}(\theta^{(0)}-\theta^{\star}) θF(τθ(0)+(1τ)θ)θF(θ)2θ(0)θ2\displaystyle\leq\left\|\nabla_{\theta}F(\tau\theta^{(0)}+(1-\tau)\theta^{\star})-\nabla_{\theta}F(\theta^{\star})\right\|_{2}\left\|\theta^{(0)}-\theta^{\star}\right\|_{2} L1θ(0)θ22=O(p),\displaystyle\leq L_{1}\left\|\theta^{(0)}-\theta^{\star}\right\|_{2}^{2}=O(p)\,, where θp\theta^{\star}\in{\mathbb{R}}^{p} is the optimal parameter argument defined in Section II-B, the second equality follows from the mean value theorem for some τ(0,1)\tau\in(0,1) (see [88, Theorem 12.14]), the third equality follows from the fact that θF(θ)=0\nabla_{\theta}F(\theta^{\star})=0 at the minimum (which holds because FF is strongly convex due to the strong convexity assumption in Section II-B, cf. [23, Section 4.1]), the next inequality follows from the Cauchy-Schwarz inequality, the last inequality holds because θF\nabla_{\theta}F is L1L_{1}-Lipschitz continuous (which is inherited from the smoothness in parameter assumption in Section II-B, cf. [23, Section 4.1]), and the scaling with pp in the last line follows from the reasoning in Footnote 10. Then, the federated oracle complexities of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} scale as

Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\displaystyle\Gamma({\mathsf{FedLRGD}}) =O(ϕm(pϵ)ξ),\displaystyle=O\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!\xi}\right),
Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\displaystyle\Gamma({\mathsf{FedAve}}) Γ+(𝖥𝖾𝖽𝖠𝗏𝖾)Θ(ϕm(pϵ)3/4),\displaystyle\leq\Gamma_{+}({\mathsf{FedAve}})\triangleq\Theta\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!3/4}\right),

and satisfy

limmΓ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)Γ+(𝖥𝖾𝖽𝖠𝗏𝖾)=0.\lim_{m\rightarrow\infty}{\frac{\Gamma({\mathsf{FedLRGD}})}{\Gamma_{+}({\mathsf{FedAve}})}}=0\,.

This result is proved in Appendix E. We briefly explain the relevance of the regime presented in Proposition 1 where 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} beats 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}. Firstly, we state conditions on the scaling of dd and ϵ\epsilon in terms of nn (rather than mm), because dimension and statistical error are typically compared to the number of available training samples in high-dimensional statistics and machine learning problems. Secondly, as noted in [23], we assume that ϵ=Θ(nβ)\epsilon=\Theta(n^{-\beta}) for β(0,12]\beta\in\big{(}0,\frac{1}{2}\big{]}, because the empirical risk in (1) is an approximation of some true risk, and the statistical error between these quantities is known to be O(n1/2)O(n^{-1/2}) (with high probability). As we only minimize empirical risk in machine learning problems as a proxy for true risk minimization, there is no gain in reducing the approximation accuracy ϵ\epsilon of federated optimization algorithms below this n1/2n^{-1/2} threshold. Thirdly, our result subsumes one of the most widely studied regimes for training neural networks in recent times—the overparametrized regime where pnp\gg n (see [89] and the references therein). Fourthly, as discussed in [23], when λ\lambda is close to 11, e.g., λ=0.99\lambda=0.99, the d=O(log(n)0.99)d=O(\log(n)^{0.99}) assumption is not too restrictive, because several machine learning problems have meaningful feature vectors with dimension that is logarithmic in the number of training samples. Indeed, results like the Johnson-Lindenstrauss lemma, cf. [90], illustrate how nn high-dimensional feature vectors can be mapped to O(log(n))O(\log(n))-dimensional feature vectors while approximately preserving the distances between pairs of vectors, and hence, conserving the relative geometry of the vectors. Fifthly, we informally summarize the main theoretical message of Proposition 1 for clarity: If the data dimension d=O(log(n)0.99)d=O(\log(n)^{0.99}) is small and the gradient of the loss function possesses sufficient smoothness in the data η=Θ(d)\eta=\Theta(d), then our 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm provably beats the benchmark 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm in federated oracle complexity. (In other regimes, the 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm could perform better than our method.) This theoretically portrays the utility of exploiting smoothness of loss functions in the data for federated optimization. Finally, we note that although we show the theoretical advantage of our 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm under several assumptions, running Algorithm 1 itself does not require most of these assumptions, e.g., strong convexity, small data dimension, etc., to hold. Therefore, our main conceptual contribution of exploiting smoothness in data, or approximate low rank structure, of loss functions for federated optimization could be of utility in settings beyond the assumptions considered in this work.

IV Analysis of Latent Variable Model Approximation

We prove Lemma 1 and Theorem 1 in this section.

IV-A Proof of Lemma 1

Proof.

We will follow the proof idea of [84, Lemma 3], but the details of our proof are different since we use a more general Hölder class definition than [84]. Fix any vector ϑp\vartheta\in{\mathbb{R}}^{p}. For any set I𝒮d,qI\in{\mathcal{S}}_{d,q}, let BIB_{I} denote the closed 1\ell^{1}-ball corresponding to II in our minimal 1\ell^{1}-covering of [0,1]d[0,1]^{d}, and zI𝒩z_{I}\in{\mathcal{N}} denote the center of BIB_{I}, i.e., if I=IjI=I_{j} for j[|𝒩|]j\in[|{\mathcal{N}}|] according to the enumeration defined in Section III-A, then BI=BjB_{I}=B_{j} and zI=z(j)z_{I}=z^{(j)}. Moreover, for any I𝒮d,qI\in{\mathcal{S}}_{d,q}, construct the Taylor polynomial with degree ll of g(;ϑ)g(\cdot;\vartheta) around zIz_{I}:

y[0,1]d,PI(y;ϑ)s+d:|s|lxsg(zI;ϑ)s!(yzI)s.\forall y\in[0,1]^{d},\enspace P_{I}(y;\vartheta)\triangleq\sum_{s\in{\mathbb{Z}}_{+}^{d}:|s|\leq l}{\frac{\nabla_{x}^{s}g(z_{I};\vartheta)}{s!}(y-z_{I})^{s}}\,. (49)

Then, we have

supy[0,1]d|g(y;ϑ)Pq,l(y;ϑ)|\displaystyle\sup_{y\in[0,1]^{d}}{\left|g(y;\vartheta)-P_{q,l}(y;\vartheta)\right|}
=maxI𝒮d,qsupyI|g(y;ϑ)PI(y;ϑ)|\displaystyle=\max_{I\in{\mathcal{S}}_{d,q}}\sup_{y\in I}{\left|g(y;\vartheta)-P_{I}(y;\vartheta)\right|}
maxI𝒮d,qsupyBI[0,1]d|g(y;ϑ)PI(y;ϑ)|\displaystyle\leq\max_{I\in{\mathcal{S}}_{d,q}}\sup_{y\in B_{I}\cap[0,1]^{d}}{\left|g(y;\vartheta)-P_{I}(y;\vartheta)\right|}
maxI𝒮d,qsupyBI[0,1]dL2yzI1ηls+d:|s|=l|(yzI)s|s!\displaystyle\leq\max_{I\in{\mathcal{S}}_{d,q}}\sup_{y\in B_{I}\cap[0,1]^{d}}{L_{2}\left\|y-z_{I}\right\|_{1}^{\eta-l}\sum_{s\in{\mathbb{Z}}_{+}^{d}:|s|=l}{\frac{|(y-z_{I})^{s}|}{s!}}}
=L2l!maxI𝒮d,qsupyBI[0,1]dyzI1η\displaystyle=\frac{L_{2}}{l!}\max_{I\in{\mathcal{S}}_{d,q}}\sup_{y\in B_{I}\cap[0,1]^{d}}{\left\|y-z_{I}\right\|_{1}^{\eta}}
L2l!qη,\displaystyle\leq\frac{L_{2}}{l!q^{\eta}}\,,

where the piecewise polynomial function Pq,l(;ϑ)P_{q,l}(\cdot;\vartheta) is defined using the Taylor polynomials PI(;ϑ)P_{I}(\cdot;\vartheta), the second inequality follows from the fact that IBI[0,1]dI\subseteq B_{I}\cap[0,1]^{d} by construction, the fourth equality follows from the multinomial theorem, the last inequality holds because each BIB_{I} is an 1\ell^{1}-ball with center zIz_{I} and radius q1q^{-1}, and the third inequality holds because

|g(y;ϑ)PI(y;ϑ)|\displaystyle\left|g(y;\vartheta)-P_{I}(y;\vartheta)\right|
=|s+d:|s|=lxsg(zI+τ(yzI);ϑ)xsg(zI;ϑ)s!(yzI)s|\displaystyle=\left|\sum_{s\in{\mathbb{Z}}_{+}^{d}:|s|=l}{\!\!\!\!\frac{\nabla_{x}^{s}g(z_{I}+\tau(y-z_{I});\vartheta)-\nabla_{x}^{s}g(z_{I};\vartheta)}{s!}(y-z_{I})^{s}}\right|
s+d:|s|=l|xsg(zI+τ(yzI);ϑ)xsg(zI;ϑ)|s!|(yzI)s|\displaystyle\leq\!\!\sum_{s\in{\mathbb{Z}}_{+}^{d}:|s|=l}{\!\!\!\!\frac{|\nabla_{x}^{s}g(z_{I}+\tau(y-z_{I});\vartheta)-\nabla_{x}^{s}g(z_{I};\vartheta)|}{s!}|(y-z_{I})^{s}|}
L2yzI1ηls+d:|s|=l|(yzI)s|s!,\displaystyle\leq L_{2}\left\|y-z_{I}\right\|_{1}^{\eta-l}\sum_{s\in{\mathbb{Z}}_{+}^{d}:|s|=l}{\frac{|(y-z_{I})^{s}|}{s!}}\,,

where we use Taylor’s theorem with Lagrange remainder term which holds for some τ(0,1)\tau\in(0,1) [88, Theorem 12.14], the triangle inequality, and the (η,L2)(\eta,L_{2})-Hölder class assumption. This completes the proof. ∎

IV-B Proof of Theorem 1

Proof.

As before, we follow the proof idea of [84, Proposition 1], but the details of our proof are again different since we use a more general Hölder class definition than [84]. For any q>0q>0 (to be selected later), construct the family of piecewise polynomial functions Pq,l:[0,1]d×pP_{q,l}:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} as in Lemma 1 using the families of Taylor polynomials PI:[0,1]d×pP_{I}:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} defined in (49) for I𝒮d,qI\in{\mathcal{S}}_{d,q}. Moreover, define the latent variable model Nn×kN\in{\mathbb{R}}^{n\times k} such that:

i[n],j[k],Ni,j=Pq,l(y(i);ϑ(j)),\forall i\in[n],\,\forall j\in[k],\enspace N_{i,j}=P_{q,l}(y^{(i)};\vartheta^{(j)})\,,

where y(1),,y(n)[0,1]dy^{(1)},\dots,y^{(n)}\in[0,1]^{d} and ϑ(1),,ϑ(k)p\vartheta^{(1)},\dots,\vartheta^{(k)}\in{\mathbb{R}}^{p} are the latent parameters that define MM in (29). It is straightforward to verify that NN is “good” induced (1,)(1,\infty)-norm approximation of MM. Indeed, applying Lemma 1, we get

maxi[n],j[k]|Mi,jNi,j|L2l!qη.\max_{i\in[n],\,j\in[k]}{\left|M_{i,j}-N_{i,j}\right|}\leq\frac{L_{2}}{l!q^{\eta}}\,. (50)

We next bound the rank of NN. Let U(x)=[xs:s+d,|s|l]TDU(x)=[x^{s}:s\in{\mathbb{Z}}_{+}^{d},\,|s|\leq l]^{\mathrm{T}}\in{\mathbb{R}}^{D} be the vector of monomials with degree at most ll for x[0,1]dx\in[0,1]^{d}, where D=(l+dd)D=\binom{l+d}{d} is the number of such monomials. For each I𝒮d,qI\in{\mathcal{S}}_{d,q}, suppose PI(x;θ)=U(x)TVI(θ)P_{I}(x;\theta)=U(x)^{\mathrm{T}}V_{I}(\theta) for all x[0,1]dx\in[0,1]^{d} and θp\theta\in{\mathbb{R}}^{p}, where VI(θ)DV_{I}(\theta)\in{\mathbb{R}}^{D} is the vector of coefficients of the Taylor polynomial PI(;θ)P_{I}(\cdot;\theta). Then, we have

N=I𝒮d,q\displaystyle N=\sum_{I\in{\mathcal{S}}_{d,q}} [𝟙{y(1)I}U(y(1)) 1{y(n)I}U(y(n))]Tn×D with rank D\displaystyle\underbrace{\left[{\mathds{1}}\{y^{(1)}\in I\}U(y^{(1)})\,\cdots\,{\mathds{1}}\{y^{(n)}\in I\}U(y^{(n)})\right]^{\mathrm{T}}}_{\in\,{\mathbb{R}}^{n\times D}\text{ with rank }\leq D}
[VI(ϑ(1))VI(ϑ(k))]D×k with rank D,\displaystyle\cdot\underbrace{\left[V_{I}(\vartheta^{(1)})\,\cdots\,V_{I}(\vartheta^{(k)})\right]}_{\in\,{\mathbb{R}}^{D\times k}\text{ with rank }\leq D}\,,

which implies that

rank(N)|𝒮d,q|D=|𝒩|(l+dd),\operatorname{{rank}}(N)\leq|{\mathcal{S}}_{d,q}|D=|{\mathcal{N}}|\binom{l+d}{d}\,,

where we use the sub-additivity of rank, and 𝒩{\mathcal{N}} is the q1q^{-1}-net introduced in Section III-A. Hence, to upper bound rank(N)\operatorname{{rank}}(N), it suffices to upper bound the 1\ell^{1}-covering number |𝒩||{\mathcal{N}}| of the hypercube [0,1]d[0,1]^{d}. Let BB denote the closed 1\ell^{1}-ball with radius (2q)1(2q)^{-1} that is centered at the origin. Using the volumetric argument in [83, Proposition 4.2.12] and [91, Theorem 14.2], we obtain

|𝒩|\displaystyle|{\mathcal{N}}| vol([0,1]d+B)vol(B)\displaystyle\leq\frac{\operatorname{{vol}}([0,1]^{d}+B)}{\operatorname{{vol}}(B)}
d!qdvol([12q,1+12q]d)\displaystyle\leq d!q^{d}\operatorname{{vol}}\!\left(\left[-\frac{1}{2q},1+\frac{1}{2q}\right]^{d}\right)
=d!(q+1)d,\displaystyle=d!(q+1)^{d}\,,

where vol()\operatorname{{vol}}(\cdot) denotes the volume in d{\mathbb{R}}^{d} and ++ denotes the Minkowski sum of sets in d{\mathbb{R}}^{d}, the second inequality follows from substituting the well-known volume of the 1\ell^{1}-ball as well as the fact that [0,1]d+B[12q,1+12q]d[0,1]^{d}+B\subseteq\big{[}-\frac{1}{2q},1+\frac{1}{2q}\big{]}^{d}, and the final equality follows from substituting the volume of a hypercube. This produces the following upper bound on the rank of NN:

rank(N)d!(l+dd)(q+1)d.\operatorname{{rank}}(N)\leq d!\binom{l+d}{d}(q+1)^{d}\,. (51)

Finally, for any r[min{n,k}]r\in[\min\{n,k\}] satisfying red 2d(l+d)dr\geq e\sqrt{d}\,2^{d}(l+d)^{d}, choose q=r1/d(d!)1/d(l+dd)1/d1q=r^{1/d}(d!)^{-1/d}\binom{l+d}{d}^{\!-1/d}-1 such that r=d!(l+dd)(q+1)dr=d!\binom{l+d}{d}(q+1)^{d}. Note that q>0q>0 because

red 2d(l+d)d\displaystyle r\geq e\sqrt{d}\,2^{d}(l+d)^{d}\quad r>ed(l+d)d\displaystyle\Rightarrow\quad r>e\sqrt{d}\,(l+d)^{d}
r>d!(l+dd)\displaystyle\Rightarrow\quad r>d!\binom{l+d}{d}
q>0,\displaystyle\Leftrightarrow\quad q>0\,,

where the second implication follows from the bound

d!(l+dd)ed(de)d(e(l+d)d)d=ed(l+d)d,d!\binom{l+d}{d}\leq e\sqrt{d}\left(\frac{d}{e}\right)^{\!d}\left(\frac{e(l+d)}{d}\right)^{\!d}=e\sqrt{d}(l+d)^{d}\,, (52)

which follows from a standard upper bound on binomial coefficients and Stirling’s approximation [92, Chapter II, Section 9, Equation (9.15)]. Then, observe that

1nkMMrF2\displaystyle\frac{1}{nk}\left\|M-M_{r}\right\|_{\mathrm{F}}^{2} 1nkMNF2\displaystyle\leq\frac{1}{nk}\left\|M-N\right\|_{\mathrm{F}}^{2}
L22(l!)2q2η\displaystyle\leq\frac{L_{2}^{2}}{(l!)^{2}q^{2\eta}}
L22(l!)2((rd!(l+dd))1/d1)2η\displaystyle\leq\frac{L_{2}^{2}}{(l!)^{2}}\left(\left(\frac{r}{d!\binom{l+d}{d}}\right)^{\!1/d}-1\right)^{\!-2\eta}
L22(l!)2(r1/de1/dd1/(2d)(l+d)1)2η\displaystyle\leq\frac{L_{2}^{2}}{(l!)^{2}}\left(\frac{r^{1/d}}{e^{1/d}d^{1/(2d)}(l+d)}-1\right)^{\!-2\eta}
L22e2η/ddη/d4η(l+d)2η(l!)2(1r)2η/d,\displaystyle\leq\frac{L_{2}^{2}e^{2\eta/d}d^{\eta/d}4^{\eta}(l+d)^{2\eta}}{(l!)^{2}}\left(\frac{1}{r}\right)^{\!2\eta/d},

where the first inequality follows from (51) and the Eckart-Young-Mirsky theorem [87, Section 7.4.2], the second inequality follows from (50), the fourth inequality uses (52), and the fifth inequality follows from the assumption that

red 2d(l+d)d\displaystyle r\geq e\sqrt{d}\,2^{d}(l+d)^{d}\quad\Leftrightarrow
r1/de1/dd1/(2d)(l+d)1r1/d2e1/dd1/(2d)(l+d).\displaystyle\qquad\qquad\frac{r^{1/d}}{e^{1/d}d^{1/(2d)}(l+d)}-1\geq\frac{r^{1/d}}{2e^{1/d}d^{1/(2d)}(l+d)}\,.

This completes the proof. ∎

V Analysis of Federated Low Rank Gradient Descent

In this section, we establish the federated oracle complexity of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm as stated in Theorem 2. To do this, we need the ensuing lemma from the literature which bounds how close the ERM objective value F(θ(S))F(\theta^{(S)}), at the output parameter θ=θ(S)\theta^{*}=\theta^{(S)} of Algorithm 1, is to the global minimum in (10) [5, Lemma 2.1] (cf. [7, Theorem 2, Equation (19)] and [23, Lemma 4]).

Lemma 2 (Inexact GD Bound [5, Lemma 2.1]).

Suppose the empirical risk F:pF:{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} is continuously differentiable and μ\mu-strongly convex, and its gradient θF:pp\nabla_{\theta}F:{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}^{p} is L1L_{1}-Lipschitz continuous. Then, for all SS\in{\mathbb{N}}, we have

0F(θ(S))F(11κ)S(F(θ(0))F)\displaystyle 0\leq F(\theta^{(S)})-F_{*}\leq\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}\right)
+12L1γ=1S(11κ)SγF^(θ(γ1))θF(θ(γ1))22,\displaystyle+\frac{1}{2L_{1}}\sum_{\gamma=1}^{S}{\left(1-\frac{1}{\kappa}\right)^{\!S-\gamma}\left\|\widehat{\nabla F}(\theta^{(\gamma-1)})-\nabla_{\theta}F(\theta^{(\gamma-1)})\right\|_{2}^{2}}\,,

where {θ(γ)p:γ{0,,S}}\{\theta^{(\gamma)}\in{\mathbb{R}}^{p}:\gamma\in\{0,\dots,S\}\} are the inexact GD updates in (42) in Algorithm 1 with arbitrary initialization θ(0)\theta^{(0)}, F^(θ(γ))\widehat{\nabla F}(\theta^{(\gamma)}) is our approximation of θF(θ(γ))\nabla_{\theta}F(\theta^{(\gamma)}) for γ{0,,S1}\gamma\in\{0,\dots,S-1\} in Algorithm 1 as shown in (41), and the condition number κ=L1/μ\kappa=L_{1}/\mu is defined in Section II-B.

We next prove Theorem 2 using Lemma 1, Lemma 2, and the various assumptions in Section II-B and Section III-C.

Proof of Theorem 2.

First, to avoid notational clutter in this proof, we re-index the training datasets of the server and clients using [n][n] with n=ms+rn=ms+r (cf. Section I-A). Specifically, we let x(j)=x(0,j)x^{(j)}=x^{(0,j)} for j[r]j\in[r] (i.e., the server’s training data is indexed first), and x(b)=x(c,j)x^{(b)}=x^{(c,j)} where br=(c1)s+jb-r=(c-1)s+j for b[n]\[r]b\in[n]\backslash[r], c[m]c\in[m], and j[s]j\in[s]. Correspondingly, we re-index the weights {wj,k(i,c):i[p],j[s],k[r],c[m]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:i\in[p],\,j\in[s],\,k\in[r],\,c\in[m]\}, which are defined by (38) and implicitly used in the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm, so that wb,k(i)=wj,k(i,c)w^{(i)}_{b,k}=w_{j,k}^{(i,c)} where br=(c1)s+jb-r=(c-1)s+j for b[n]\[r]b\in[n]\backslash[r].

The majority of this proof focuses on analyzing the iteration complexity of the inexact GD method employed in the last epoch of Algorithm 1. Once we have control over this iteration complexity, we can easily deduce federated oracle complexity. Our main tool for studying iteration complexity will be Lemma 2. But to use this lemma, we require a bound on F^θF22\|\widehat{\nabla F}-\nabla_{\theta}F\|_{2}^{2}. So, we begin our analysis by fixing any (sufficiently large) rr\in{\mathbb{N}}, any index i[p]i\in[p], and any parameter θp\theta\in{\mathbb{R}}^{p}, and noticing that

|Fθi(θ)F^i(θ)|\displaystyle\left|\frac{\partial F}{\partial\theta_{i}}(\theta)-\widehat{\nabla F}_{i}(\theta)\right|
=1n|k=1rgi(x(0,k);θ)+c=1mj=1sgi(x(c,j);θ)\displaystyle=\frac{1}{n}\left|\sum_{k=1}^{r}{g_{i}(x^{(0,k)};\theta)}+\sum_{c=1}^{m}{\sum_{j=1}^{s}{g_{i}(x^{(c,j)};\theta)}}\right.
k=1rgi(x(0,k);θ)(1+c=1mj=1swj,k(i,c))|\displaystyle\quad\quad\enspace\left.-\sum_{k=1}^{r}{g_{i}(x^{(0,k)};\theta)\left(1+\sum_{c=1}^{m}\sum_{j=1}^{s}{w_{j,k}^{(i,c)}}\right)}\right|
=1n|k=1ngi(x(k);θ)k=1rgi(x(k);θ)(1+j=r+1nwj,k(i))|\displaystyle=\frac{1}{n}\left|\sum_{k=1}^{n}{g_{i}(x^{(k)};\theta)}-\sum_{k=1}^{r}{g_{i}(x^{(k)};\theta)\left(1+\sum_{j=r+1}^{n}{w_{j,k}^{(i)}}\right)}\right|
=1n|j=r+1ngi(x(j);θ)k=1rgi(x(k);θ)j=r+1nwj,k(i)|\displaystyle=\frac{1}{n}\left|\sum_{j=r+1}^{n}{g_{i}(x^{(j)};\theta)}-\sum_{k=1}^{r}{g_{i}(x^{(k)};\theta)\sum_{j=r+1}^{n}{w_{j,k}^{(i)}}}\right|
1nj=r+1n|gi(x(j);θ)k=1rgi(x(k);θ)wj,k(i)|\displaystyle\leq\frac{1}{n}\sum_{j=r+1}^{n}\left|g_{i}(x^{(j)};\theta)-\sum_{k=1}^{r}g_{i}(x^{(k)};\theta)w_{j,k}^{(i)}\right|
maxj[n]\[r]|gi(x(j);θ)k=1rgi(x(k);θ)wj,k(i)|,\displaystyle\leq\max_{j\in[n]\backslash[r]}\left|g_{i}(x^{(j)};\theta)-\sum_{k=1}^{r}g_{i}(x^{(k)};\theta)w_{j,k}^{(i)}\right|, (53)

where the first equality follows from (32), (39), and (41), and we use the weights {wj,k(i,c):j[s],k[r],c[m]}\{w_{j,k}^{(i,c)}\in{\mathbb{R}}:j\in[s],\,k\in[r],\,c\in[m]\} defined by (38), the second equality follows from the aforementioned re-indexing, the fourth inequality follows from swapping the order of summations in the second term on the right hand side and applying the triangle inequality, and the fifth inequality holds because nmsn\geq ms. To upper bound (53), we next develop some properties of a uniform piecewise polynomial approximation of gig_{i}.

Let D=(l+dd)D=\binom{l+d}{d} and 𝒮d,q{\mathcal{S}}_{d,q} be the partition of the hypercube [0,1]d[0,1]^{d} defined in (28) for any q>0q>0. Moreover, fix the parameter q>0q>0 such that r=D|𝒮d,q|r=D|{\mathcal{S}}_{d,q}|. Since gi(;θ):[0,1]dg_{i}(\cdot;\theta):[0,1]^{d}\rightarrow{\mathbb{R}} satisfies the smoothness in data assumption (i.e., the (η,L2)(\eta,L_{2})-Hölder class assumption) in Section II-B, Lemma 1 implies that there exists a piecewise polynomial function Pq,l(;θ):[0,1]dP_{q,l}(\cdot;\theta):[0,1]^{d}\rightarrow{\mathbb{R}} with l=η1l=\lceil\eta\rceil-1 such that

supx[0,1]d|gi(x;θ)Pq,l(x;θ)|L2l!qη.\sup_{x\in[0,1]^{d}}{\left|g_{i}(x;\theta)-P_{q,l}(x;\theta)\right|}\leq\frac{L_{2}}{l!q^{\eta}}\,. (54)

Here, as shown in the proof of Theorem 1 in Section IV-B, we have

Pq,l(x;θ)\displaystyle P_{q,l}(x;\theta) =I𝒮d,q𝟙{xI}U(x)TVI(θ)\displaystyle=\sum_{I\in{\mathcal{S}}_{d,q}}{{\mathds{1}}\!\{x\in I\}U(x)^{\mathrm{T}}V_{I}(\theta)}
=I𝒮d,qv+d:|v|l𝟙{xI}xv[VI(θ)]v\displaystyle=\sum_{I\in{\mathcal{S}}_{d,q}}\sum_{v\in{\mathbb{Z}}_{+}^{d}:\,|v|\leq l}{{\mathds{1}}\!\{x\in I\}x^{v}[V_{I}(\theta)]_{v}}

for all x[0,1]dx\in[0,1]^{d}, where U(x)=[xv:v+d,|v|l]TDU(x)=[x^{v}:v\in{\mathbb{Z}}_{+}^{d},\,|v|\leq l]^{\mathrm{T}}\in{\mathbb{R}}^{D} is the vector of monomials with degree at most ll, and VI(θ)DV_{I}(\theta)\in{\mathbb{R}}^{D} are vectors of coefficients (see Section IV-B) which we may index with {v+d:|v|l}\{v\in{\mathbb{Z}}_{+}^{d}:|v|\leq l\}. Equivalently, we can write:

x[0,1]d,Pq,l(x;θ)=w=1rϕw(x)ψw(θ),\forall x\in[0,1]^{d},\enspace P_{q,l}(x;\theta)=\sum_{w=1}^{r}{\phi_{w}(x)\psi_{w}(\theta)}\,, (55)

where we enumerate 𝒮d,q×{v+d:|v|l}{\mathcal{S}}_{d,q}\times\{v\in{\mathbb{Z}}_{+}^{d}:\,|v|\leq l\} using the index set [r][r], i.e., we employ a bijection [r]𝒮d,q×{v+d:|v|l}[r]\mapsto{\mathcal{S}}_{d,q}\times\{v\in{\mathbb{Z}}_{+}^{d}:\,|v|\leq l\}, and for all I𝒮d,qI\in{\mathcal{S}}_{d,q} and v+dv\in{\mathbb{Z}}_{+}^{d} with |v|l|v|\leq l such that [r]w(I,v)[r]\ni w\mapsto(I,v), we let:

x[0,1]d,ϕw(x)\displaystyle\forall x\in[0,1]^{d},\enspace\phi_{w}(x) =𝟙{xI}xv,\displaystyle={\mathds{1}}\!\{x\in I\}x^{v}\,, (56)
ϑd,ψw(ϑ)\displaystyle\forall\vartheta\in{\mathbb{R}}^{d},\enspace\psi_{w}(\vartheta) =[VI(ϑ)]v.\displaystyle=[V_{I}(\vartheta)]_{v}\,.

(Hence, as noted in Section III-C, the family of monomial functions {ϕw:w[r]}\{\phi_{w}:w\in[r]\} are linearly independent.) For every j[n]\[r]j\in[n]\backslash[r], define the coefficients aj,1,,aj,ra_{j,1},\dots,a_{j,r}\in{\mathbb{R}} via

[aj,1aj,2aj,r][ϕ1(x(j))ϕ2(x(j))ϕr(x(j))]ΦT,\left[a_{j,1}\enspace a_{j,2}\,\cdots\,a_{j,r}\right]\triangleq\left[\phi_{1}(x^{(j)})\enspace\phi_{2}(x^{(j)})\,\cdots\,\phi_{r}(x^{(j)})\right]\Phi^{-\mathrm{T}}\,,

where the matrix Φr×r\Phi\in{\mathbb{R}}^{r\times r} is defined in (45) and is assumed to be invertible (cf. Section III-C). Then, observe that for every w[r]w\in[r] and j[n]\[r]j\in[n]\backslash[r],

ϕw(x(j))=k=1raj,kϕw(x(k)),\phi_{w}(x^{(j)})=\sum_{k=1}^{r}{a_{j,k}\phi_{w}(x^{(k)})}\,,

which implies that

Pq,l(x(j);θ)\displaystyle P_{q,l}(x^{(j)};\theta) =w=1rψw(θ)k=1raj,kϕw(x(k))\displaystyle=\sum_{w=1}^{r}{\psi_{w}(\theta)\sum_{k=1}^{r}{a_{j,k}\phi_{w}(x^{(k)})}}
=k=1raj,kw=1rϕw(x(k))ψw(θ)\displaystyle=\sum_{k=1}^{r}{a_{j,k}\sum_{w=1}^{r}{\phi_{w}(x^{(k)})\psi_{w}(\theta)}}
=k=1raj,kPq,l(x(k);θ)\displaystyle=\sum_{k=1}^{r}{a_{j,k}P_{q,l}(x^{(k)};\theta)} (57)

using (55). Therefore, letting 𝒫(i)r×r{\mathcal{P}}^{(i)}\in{\mathbb{R}}^{r\times r} be the matrix given by (46) in Section III-C for the arbitrary set of parameter vectors ϑ(1),,ϑ(r)p\vartheta^{(1)},\dots,\vartheta^{(r)}\in{\mathbb{R}}^{p} chosen by the server in the first epoch of Algorithm 1, we obtain that for all j[n]\[r]j\in[n]\backslash[r],

[aj,1aj,2aj,r]\displaystyle\left[a_{j,1}\enspace a_{j,2}\,\cdots\,a_{j,r}\right] (58)
=[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(2))Pq,l(x(j);ϑ(r))]\displaystyle=\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\enspace P_{q,l}(x^{(j)};\vartheta^{(2)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]
(𝒫(i))1,\displaystyle\quad\enspace\cdot({\mathcal{P}}^{(i)})^{-1}\,,

where we use (57) and the assumption in Section III-C that 𝒫(i){\mathcal{P}}^{(i)} is invertible. From the relation in (58), we now show that the aj,ka_{j,k}’s are “close” to the wj,k(i)w^{(i)}_{j,k}’s.

To this end, notice that for any j[n]\[r]j\in[n]\backslash[r],

[aj,1aj,2aj,r][wj,1(i)wj,2(i)wj,r(i)]2\displaystyle\left\|\left[a_{j,1}\enspace a_{j,2}\,\cdots\,a_{j,r}\right]-\left[w^{(i)}_{j,1}\enspace w^{(i)}_{j,2}\,\cdots\,w^{(i)}_{j,r}\right]\right\|_{2}
=\eqmakebox[A][c](i)[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))](𝒫(i))1\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(i)}}{=}\left\|\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]({\mathcal{P}}^{(i)})^{-1}\right.
[gi(x(j);ϑ(1))gi(x(j);ϑ(r))](G(i))12\displaystyle\quad\quad\quad\left.-\left[g_{i}(x^{(j)};\vartheta^{(1)})\,\cdots\,g_{i}(x^{(j)};\vartheta^{(r)})\right](G^{(i)})^{-1}\right\|_{2}
\eqmakebox[A][c](ii)[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))]\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(ii)}}{\leq}\left\|\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]\right.
((𝒫(i))1(G(i))1)2\displaystyle\quad\quad\quad\left.\cdot\left(({\mathcal{P}}^{(i)})^{-1}-(G^{(i)})^{-1}\right)\right\|_{2}
+([Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))]\displaystyle\quad\enspace\,\,+\left\|\left(\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]\right.\right.
[gi(x(j);ϑ(1))gi(x(j);ϑ(r))])(G(i))12\displaystyle\quad\quad\quad\quad\left.\left.-\left[g_{i}(x^{(j)};\vartheta^{(1)})\,\cdots\,g_{i}(x^{(j)};\vartheta^{(r)})\right]\right)(G^{(i)})^{-1}\right\|_{2}
\eqmakebox[A][c](iii)[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))]2\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(iii)}}{\leq}\left\|\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]\right\|_{2}
(𝒫(i))1(G(i))1op\displaystyle\quad\enspace\,\,\cdot\left\|({\mathcal{P}}^{(i)})^{-1}-(G^{(i)})^{-1}\right\|_{\mathrm{op}}
+[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))]\displaystyle\quad\enspace\,\,+\left\|\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]\right.
[gi(x(j);ϑ(1))gi(x(j);ϑ(r))]2(G(i))1op\displaystyle\quad\quad\quad\enspace\left.-\left[g_{i}(x^{(j)};\vartheta^{(1)})\,\cdots\,g_{i}(x^{(j)};\vartheta^{(r)})\right]\right\|_{2}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}
\eqmakebox[A][c](iv)[Pq,l(x(j);ϑ(1))Pq,l(x(j);ϑ(r))]\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(iv)}}{\leq}\left\|\left[P_{q,l}(x^{(j)};\vartheta^{(1)})\,\cdots\,P_{q,l}(x^{(j)};\vartheta^{(r)})\right]\right.
[gi(x(j);ϑ(1))gi(x(j);ϑ(r))]2\displaystyle\quad\quad\quad\left.-\left[g_{i}(x^{(j)};\vartheta^{(1)})\,\cdots\,g_{i}(x^{(j)};\vartheta^{(r)})\right]\right\|_{2}
((G(i))1op+(𝒫(i))1(G(i))1op)\displaystyle\quad\enspace\,\,\cdot\left(\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}+\left\|({\mathcal{P}}^{(i)})^{-1}-(G^{(i)})^{-1}\right\|_{\mathrm{op}}\right)
+[gi(x(j);ϑ(1))gi(x(j);ϑ(r))]2\displaystyle\quad\enspace\,\,+\left\|\left[g_{i}(x^{(j)};\vartheta^{(1)})\,\cdots\,g_{i}(x^{(j)};\vartheta^{(r)})\right]\right\|_{2}
(𝒫(i))1(G(i))1op\displaystyle\quad\quad\quad\cdot\left\|({\mathcal{P}}^{(i)})^{-1}-(G^{(i)})^{-1}\right\|_{\mathrm{op}}
\eqmakebox[A][c](v)L2rl!qη(G(i))1op\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(v)}}{\leq}\frac{L_{2}\sqrt{r}}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}
+r(B+L2l!qη)(𝒫(i))1(G(i))1op\displaystyle\quad\enspace\,\,+\sqrt{r}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\left\|({\mathcal{P}}^{(i)})^{-1}-(G^{(i)})^{-1}\right\|_{\mathrm{op}}
\eqmakebox[A][c](vi)L2rl!qη(G(i))1op\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(vi)}}{\leq}\frac{L_{2}\sqrt{r}}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}
+r(B+L2l!qη)((G(i))1op2G(i)𝒫(i)F1(G(i))1opG(i)𝒫(i)F)\displaystyle\quad\enspace\,\,+\sqrt{r}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\!\left(\!\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}^{2}\left\|G^{(i)}-{\mathcal{P}}^{(i)}\right\|_{\mathrm{F}}}{1-\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}\left\|G^{(i)}-{\mathcal{P}}^{(i)}\right\|_{\mathrm{F}}}\!\right)
\eqmakebox[A][c](vii)L2rl!qη(G(i))1op\displaystyle\overset{\eqmakebox[A][c]{\footnotesize(vii)}}{\leq}\frac{L_{2}\sqrt{r}}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}
+L2r3/2l!qη(B+L2l!qη)((G(i))1op21((G(i))1oprL2l!qη)),\displaystyle\quad\enspace\,\,+\frac{L_{2}r^{3/2}}{l!q^{\eta}}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\left(\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}^{2}}{1-\left(\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}rL_{2}}{l!q^{\eta}}\right)}\right), (59)

where (i) uses (58) and (38), (ii) and (iv) follow from the triangle inequality, (iii) follows from the definition of operator norm, (v) follows from (54) and the boundedness assumption in (9), (vii) follows from (54), and (vi) holds because for any two non-singular matrices X,YX,Y with common dimensions, we have

X1Y1op\displaystyle\left\|X^{-1}-Y^{-1}\right\|_{\mathrm{op}}
=X1(YX)Y1op\displaystyle=\|X^{-1}(Y-X)Y^{-1}\|_{\mathrm{op}}
X1opY1opXYop\displaystyle\leq\left\|X^{-1}\right\|_{\mathrm{op}}\left\|Y^{-1}\right\|_{\mathrm{op}}\left\|X-Y\right\|_{\mathrm{op}}
X1op(Y1X1op+X1op)XYop\displaystyle\leq\left\|X^{-1}\right\|_{\mathrm{op}}\left(\left\|Y^{-1}-X^{-1}\right\|_{\mathrm{op}}+\left\|X^{-1}\right\|_{\mathrm{op}}\right)\left\|X-Y\right\|_{\mathrm{op}}
=X1op2XYop\displaystyle=\left\|X^{-1}\right\|_{\mathrm{op}}^{2}\left\|X-Y\right\|_{\mathrm{op}}
+X1opY1X1opXYop\displaystyle\quad\,+\left\|X^{-1}\right\|_{\mathrm{op}}\left\|Y^{-1}-X^{-1}\right\|_{\mathrm{op}}\left\|X-Y\right\|_{\mathrm{op}}

using the sub-multiplicativity of operator norms, which implies that

X1Y1op\displaystyle\left\|X^{-1}-Y^{-1}\right\|_{\mathrm{op}} X1op2XYop1X1opXYop\displaystyle\leq\frac{\left\|X^{-1}\right\|_{\mathrm{op}}^{2}\left\|X-Y\right\|_{\mathrm{op}}}{1-\left\|X^{-1}\right\|_{\mathrm{op}}\left\|X-Y\right\|_{\mathrm{op}}}
X1op2XYF1X1opXYF\displaystyle\leq\frac{\left\|X^{-1}\right\|_{\mathrm{op}}^{2}\left\|X-Y\right\|_{\mathrm{F}}}{1-\left\|X^{-1}\right\|_{\mathrm{op}}\left\|X-Y\right\|_{\mathrm{F}}}

as long as X1opXYF<1\|X^{-1}\|_{\mathrm{op}}\|X-Y\|_{\mathrm{F}}<1 (also see [87, Section 5.8]). Note that (59) only holds when (G(i))1oprL2<l!qη\|(G^{(i)})^{-1}\|_{\mathrm{op}}rL_{2}<l!q^{\eta}.

Next, we proceed to upper bounding (53). Starting from (53), observe that

|Fθi(θ)F^i(θ)|\displaystyle\left|\frac{\partial F}{\partial\theta_{i}}(\theta)-\widehat{\nabla F}_{i}(\theta)\right|
\eqmakebox[B][c]maxj[n]\[r]|gi(x(j);θ)k=1rgi(x(k);θ)wj,k(i)|\displaystyle\overset{\eqmakebox[B][c]{\footnotesize}}{\leq}\max_{j\in[n]\backslash[r]}\left|g_{i}(x^{(j)};\theta)-\sum_{k=1}^{r}g_{i}(x^{(k)};\theta)w_{j,k}^{(i)}\right|
\eqmakebox[B][c](i)L2l!qη(1+maxj[n]\[r]k=1r|wj,k(i)|)\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(i)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}\right)
+maxj[n]\[r]|Pq,l(x(j);θ)k=1rPq,l(x(k);θ)wj,k(i)|\displaystyle\quad\enspace\enspace+\max_{j\in[n]\backslash[r]}\left|P_{q,l}(x^{(j)};\theta)-\sum_{k=1}^{r}P_{q,l}(x^{(k)};\theta)w_{j,k}^{(i)}\right|
\eqmakebox[B][c](ii)L2l!qη(1+maxj[n]\[r]k=1r|wj,k(i)|)\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(ii)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}\right)
+maxj[n]\[r]|k=1r(aj,kwj,k(i))Pq,l(x(k);θ)|\displaystyle\quad\enspace\enspace+\max_{j\in[n]\backslash[r]}\left|\sum_{k=1}^{r}{\left(a_{j,k}-w_{j,k}^{(i)}\right)P_{q,l}(x^{(k)};\theta)}\right|
\eqmakebox[B][c](iii)L2l!qη(1+maxj[n]\[r]k=1r|wj,k(i)|)\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(iii)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}\right)
+maxj[n]\[r](k=1r(aj,kwj,k(i))2)12(k=1rPq,l(x(k);θ)2)12\displaystyle\quad\enspace\enspace+\max_{j\in[n]\backslash[r]}\left(\sum_{k=1}^{r}{\left(a_{j,k}-w_{j,k}^{(i)}\right)^{\!2}}\right)^{\!\frac{1}{2}}\!\!\left(\sum_{k=1}^{r}{P_{q,l}(x^{(k)};\theta)^{2}}\right)^{\!\frac{1}{2}}
\eqmakebox[B][c](iv)L2l!qη(1+maxj[n]\[r]k=1r|wj,k(i)|)+(k=1rPq,l(x(k);θ)2)12\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(iv)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}\right)+\left(\sum_{k=1}^{r}{P_{q,l}(x^{(k)};\theta)^{2}}\right)^{\!\frac{1}{2}}
(L2rl!qη(G(i))1op+L2r3/2l!qη(B+L2l!qη)\displaystyle\quad\enspace\enspace\cdot\left(\rule{0.0pt}{19.91684pt}\right.\!\frac{L_{2}\sqrt{r}}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}+\frac{L_{2}r^{3/2}}{l!q^{\eta}}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)
((G(i))1op21((G(i))1oprL2)/(l!qη)))\displaystyle\qquad\qquad\qquad\qquad\cdot\left(\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}^{2}}{1-\left(\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}rL_{2}\right)\!/\!\left(l!q^{\eta}\right)}\right)\!\left.\rule{0.0pt}{19.91684pt}\right)
\eqmakebox[B][c](v)L2l!qη(1+maxj[n]\[r]k=1r|wj,k(i)|)+(B+L2l!qη)r\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(v)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}\right)+\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\sqrt{r}
(L2rl!qη(G(i))1op+L2r3/2l!qη(B+L2l!qη)\displaystyle\quad\enspace\enspace\cdot\left(\rule{0.0pt}{19.91684pt}\right.\!\frac{L_{2}\sqrt{r}}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}+\frac{L_{2}r^{3/2}}{l!q^{\eta}}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)
((G(i))1op21((G(i))1oprL2)/(l!qη)))\displaystyle\qquad\qquad\qquad\qquad\cdot\left(\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}^{2}}{1-\left(\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}rL_{2}\right)\!/\!\left(l!q^{\eta}\right)}\right)\!\left.\rule{0.0pt}{19.91684pt}\right)
\eqmakebox[B][c](vi)L2l!qη(1+(G(i))1opBr)+(B+L2l!qη)\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(vi)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}Br\right)+\left(B+\frac{L_{2}}{l!q^{\eta}}\right)
(L2rl!qη(G(i))1op+L2r2l!qη(B+L2l!qη)\displaystyle\quad\enspace\enspace\cdot\left(\rule{0.0pt}{19.91684pt}\right.\!\frac{L_{2}r}{l!q^{\eta}}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}+\frac{L_{2}r^{2}}{l!q^{\eta}}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)
((G(i))1op21((G(i))1oprL2)/(l!qη)))\displaystyle\qquad\qquad\qquad\qquad\cdot\left(\frac{\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}^{2}}{1-\left(\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}rL_{2}\right)\!/\!\left(l!q^{\eta}\right)}\right)\!\left.\rule{0.0pt}{19.91684pt}\right)
\eqmakebox[B][c](vii)L2l!qη(1+Brα+1)+(B+L2l!qη)(L2rα+1l!qη\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(vii)}}{\leq}\frac{L_{2}}{l!q^{\eta}}\left(1+Br^{\alpha+1}\right)+\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\Bigg{(}\frac{L_{2}r^{\alpha+1}}{l!q^{\eta}}
+L2r2α+2l!qη(B+L2l!qη)(11(L2rα+1)/(l!qη)))\displaystyle\quad\enspace\enspace+\frac{L_{2}r^{2\alpha+2}}{l!q^{\eta}}\left(B+\frac{L_{2}}{l!q^{\eta}}\right)\!\left(\frac{1}{1-\left(L_{2}r^{\alpha+1}\right)\!/\!\left(l!q^{\eta}\right)}\right)\Bigg{)}
\eqmakebox[B][c](viii)L2r2α+2l!qη(2+B+L2l!qη)2\displaystyle\overset{\eqmakebox[B][c]{\footnotesize(viii)}}{\leq}\frac{L_{2}r^{2\alpha+2}}{l!q^{\eta}}\left(2+B+\frac{L_{2}}{l!q^{\eta}}\right)^{\!2}
(11(L2rα+1)/(l!qη)+1),\displaystyle\quad\enspace\enspace\cdot\left(\frac{1}{1-\left(L_{2}r^{\alpha+1}\right)\!/\!\left(l!q^{\eta}\right)}+1\right), (60)

where (i) follows from (54) and the triangle inequality, (ii) follows from (57), (iii) follows from the Cauchy-Schwarz inequality, (iv) utilizes (59), (v) follows from the triangle inequality, (54), and the boundedness assumption in (9), (vi) holds because

maxj[n]\[r]k=1r|wj,k(i)|\displaystyle\max_{j\in[n]\backslash[r]}\sum_{k=1}^{r}{\left|w_{j,k}^{(i)}\right|}
maxj[n]\[r]rk=1r(wj,k(i))2\displaystyle\qquad\leq\max_{j\in[n]\backslash[r]}\sqrt{r\sum_{k=1}^{r}{\left(w_{j,k}^{(i)}\right)^{\!2}}}
maxj[n]\[r](G(i))1oprk=1rgi(x(j);ϑ(k))2\displaystyle\qquad\leq\max_{j\in[n]\backslash[r]}\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}\sqrt{r\sum_{k=1}^{r}{g_{i}(x^{(j)};\vartheta^{(k)})^{2}}}
(G(i))1opBr\displaystyle\qquad\leq\left\|(G^{(i)})^{-1}\right\|_{\mathrm{op}}Br

using the Cauchy-Schwarz inequality (or the equivalence of 1\ell^{1} and 2\ell^{2}-norms), (38), the definition of operator norm, and (9), (vii) follows from the conditioning assumption in (8), and (viii) follows from basic algebraic manipulations. As before, note that (60) only holds when L2rα+1<l!qηL_{2}r^{\alpha+1}<l!q^{\eta}.

We are finally in a position to bound F^θF22\|\widehat{\nabla F}-\nabla_{\theta}F\|_{2}^{2}. Indeed, for any δ(0,12]\delta\in\big{(}0,\frac{1}{2}\big{]}, suppose that

L2r2α+2l!qηδ.\frac{L_{2}r^{2\alpha+2}}{l!q^{\eta}}\leq\delta\,. (61)

(Notice that δ(0,12]\delta\in\big{(}0,\frac{1}{2}\big{]} implies the desired condition L2rα+1<l!qηL_{2}r^{\alpha+1}<l!q^{\eta}.) Then, we have from (60) that

|Fθi(θ)F^i(θ)|\displaystyle\left|\frac{\partial F}{\partial\theta_{i}}(\theta)-\widehat{\nabla F}_{i}(\theta)\right| δ(2+B+δ)2(11δ+1)\displaystyle\leq\delta(2+B+\delta)^{2}\!\left(\frac{1}{1-\delta}+1\right)
3(B+3)2δ,\displaystyle\leq 3(B+3)^{2}\delta\,,

which implies that for any θp\theta\in{\mathbb{R}}^{p},

F^(θ)θF(θ)229(B+3)4δ2p.\left\|\widehat{\nabla F}(\theta)-\nabla_{\theta}F(\theta)\right\|_{2}^{2}\leq 9(B+3)^{4}\delta^{2}p\,. (62)

Therefore, applying Lemma 2, we obtain

F(θ(S))F\displaystyle F(\theta^{(S)})-F_{*}
(11κ)S(F(θ(0))F)\displaystyle\leq\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}\right)
+12L1γ=1S(11κ)SγF^(θ(γ1))θF(θ(γ1))22\displaystyle\quad\,+\frac{1}{2L_{1}}\sum_{\gamma=1}^{S}{\left(1-\frac{1}{\kappa}\right)^{\!S-\gamma}\left\|\widehat{\nabla F}(\theta^{(\gamma-1)})-\nabla_{\theta}F(\theta^{(\gamma-1)})\right\|_{2}^{2}}
(11κ)S(F(θ(0))F)\displaystyle\leq\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}\right)
+9(B+3)4δ2p2L1γ=1S(11κ)Sγκ=L1/μ\displaystyle\quad\,+\frac{9(B+3)^{4}\delta^{2}p}{2L_{1}}\underbrace{\sum_{\gamma=1}^{S}{\left(1-\frac{1}{\kappa}\right)^{\!S-\gamma}}}_{\leq\,\kappa\,=\,L_{1}/\mu}
(11κ)S(F(θ(0))F)+9(B+3)4δ2p2μ,\displaystyle\leq\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}\right)+\frac{9(B+3)^{4}\delta^{2}p}{2\mu}\,, (63)

where {θ(γ)p:γ{0,,S}}\{\theta^{(\gamma)}\in{\mathbb{R}}^{p}:\gamma\in\{0,\dots,S\}\} are the inexact GD updates in (42) in Algorithm 1, SS\in{\mathbb{N}} is the number of iterations for which inexact GD is run in the last epoch of Algorithm 1, and the third inequality follows from the usual geometric series formula. We remark that to apply Lemma 2 here, we need to verify that the empirical risk F:pF:{\mathbb{R}}^{p}\rightarrow{\mathbb{R}} is continuously differentiable and μ\mu-strongly convex, and its gradient θF\nabla_{\theta}F is L1L_{1}-Lipschitz continuous. These properties of FF are in fact inherited from the corresponding strong convexity and smoothness in parameter assumptions imposed on the loss function in Section II-B; we refer readers to [23, Section 4.1] for the formal arguments.

We can easily obtain the iteration complexity of the inexact GD method in Algorithm 1 from (63). Letting

δ=(11κ)S/2,\delta=\left(1-\frac{1}{\kappa}\right)^{\!S/2}, (64)

we see that

F(θ(S))F(11κ)S(F(θ(0))F+9(B+3)4p2μ).F(\theta^{(S)})-F_{*}\leq\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right).

Hence, for Algorithm 1 to produce an ϵ\epsilon-approximate solution θ=θ(S)\theta^{*}=\theta^{(S)} in the sense of (11), it suffices to ensure that

(11κ)S(F(θ(0))F+9(B+3)4p2μ)ϵ,\left(1-\frac{1}{\kappa}\right)^{\!S}\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)\leq\epsilon\,,

or equivalently,

S(log(κκ1))1log(F(θ(0))F+9(B+3)4p2μϵ).S\geq\left(\log\!\left(\frac{\kappa}{\kappa-1}\right)\right)^{\!-1}\log\!\left(\frac{F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}}{\epsilon}\right).

Thus, setting SS as in (48) yields an ϵ\epsilon-approximate solution θ=θ(S)\theta^{*}=\theta^{(S)}. Furthermore, it is straightforward to verify that δ12\delta\leq\frac{1}{2}. Indeed, from (64) and (48), we have

δ2ϵF(θ(0))F+9(B+3)4p2μϵ(9(B+3)4p2μ)14,\delta^{2}\leq\frac{\epsilon}{F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}}\leq\frac{\epsilon}{\left(\frac{9(B+3)^{4}p}{2\mu}\right)}\leq\frac{1}{4}\,, (65)

where the last inequality holds because we have assumed in the theorem statement that ϵ9(B+3)4p8μ\epsilon\leq\frac{9(B+3)^{4}p}{8\mu}.

Having computed the number of iterations (48), we close this proof by calculating the federated oracle complexity of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm. To execute this final step, we must first ensure that (61) holds by choosing an appropriate value of rr\in{\mathbb{N}}. Recall that qq was chosen so that r=D|𝒮d,q|r=D|{\mathcal{S}}_{d,q}|. Using (51), (52), and the fact that lηl\leq\eta, we have

r=D|𝒮d,q|d!(l+dd)(q+1)ded(η+d)d(q+1)d,r=D|{\mathcal{S}}_{d,q}|\leq d!\binom{l+d}{d}(q+1)^{d}\leq e\sqrt{d}(\eta+d)^{d}(q+1)^{d}\,,

which in turn produces the following lower bound on qq:

qr1/de1/dd1/(2d)(η+d)1.q\geq\frac{r^{1/d}}{e^{1/d}d^{1/(2d)}(\eta+d)}-1\,.

Since we require q>0q>0, or to be more precise, q1q\geq 1 (see Section III-A), we can check to see that

q1\displaystyle q\geq 1 r1/de1/dd1/(2d)(η+d)2\displaystyle\quad\Leftarrow\quad\frac{r^{1/d}}{e^{1/d}d^{1/(2d)}(\eta+d)}\geq 2
red 2d(η+d)d\displaystyle\quad\Leftrightarrow\quad r\geq e\sqrt{d}\,2^{d}(\eta+d)^{d}

by assumption (47) in the theorem statement. Furthermore, similar to the argument in the proof Theorem 1 in Section IV-B, (47) also yields:

red 2d(η+d)d\displaystyle r\geq e\sqrt{d}\,2^{d}(\eta+d)^{d}
qr1/de1/dd1/(2d)(η+d)1r1/d2e1/dd1/(2d)(η+d).\displaystyle\Leftrightarrow\quad q\geq\frac{r^{1/d}}{e^{1/d}d^{1/(2d)}(\eta+d)}-1\geq\frac{r^{1/d}}{2e^{1/d}d^{1/(2d)}(\eta+d)}\,.

Substituting this lower bound on qq into the condition (61), it suffices to ensure that

L22ηeη/ddη/(2d)(η+d)ηl!r(η/d)2α2δ.\frac{L_{2}2^{\eta}e^{\eta/d}d^{\eta/(2d)}(\eta+d)^{\eta}}{l!\,r^{(\eta/d)-2\alpha-2}}\leq\delta\,.

Then, using Stirling’s approximation [92, Chapter II, Section 9, Equation (9.15)] and the fact that η1lη\eta-1\leq l\leq\eta, the left hand side of the above inequality can be upper bounded by

L22ηeη/ddη/(2d)(η+d)ηl!r(η/d)2α2\displaystyle\frac{L_{2}2^{\eta}e^{\eta/d}d^{\eta/(2d)}(\eta+d)^{\eta}}{l!\,r^{(\eta/d)-2\alpha-2}}
L2η2ηeη(d+1)/ddη/(2d)(η+d)η(η1)ηr(η/d)2α2.\displaystyle\qquad\qquad\leq\frac{L_{2}\eta 2^{\eta}e^{\eta(d+1)/d}d^{\eta/(2d)}(\eta+d)^{\eta}}{(\eta-1)^{\eta}r^{(\eta/d)-2\alpha-2}}\,.

This means that ensuring

L2η2ηeη(d+1)/ddη/(2d)(η+d)η(η1)ηr(η/d)2α2δ\displaystyle\frac{L_{2}\eta 2^{\eta}e^{\eta(d+1)/d}d^{\eta/(2d)}(\eta+d)^{\eta}}{(\eta-1)^{\eta}r^{(\eta/d)-2\alpha-2}}\leq\delta
r(L2dηdeη(d+1)dη/2(2η+2d)ηd(η1)ηd)1/(η(2α+2)d)\displaystyle\Leftrightarrow\quad r\geq\left(\frac{L_{2}^{d}\eta^{d}e^{\eta(d+1)}d^{\eta/2}(2\eta+2d)^{\eta d}}{(\eta-1)^{\eta d}}\right)^{\!1/(\eta-(2\alpha+2)d)}
(1δ)d/(η(2α+2)d)\displaystyle\qquad\quad\quad\enspace\cdot\left(\frac{1}{\delta}\right)^{\!d/(\eta-(2\alpha+2)d)}

implies that (61) is satisfied, where we utilize the assumption in the theorem statement that η>(2α+2)d\eta>(2\alpha+2)d. Now, using (64) and (48), notice that

δ\displaystyle\delta =(11κ)S/2\displaystyle=\left(1-\frac{1}{\kappa}\right)^{\!S/2}
(κ1κ)1+log((F(θ(0))F+9(B+3)4p2μ)ϵ1)log(κκ1)\displaystyle\geq\left(\sqrt{\frac{\kappa-1}{\kappa}}\right)^{\!1+\frac{\log\left(\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)\epsilon^{-1}\right)}{\log\left(\frac{\kappa}{\kappa-1}\right)}}
=(κ1)ϵκ(F(θ(0))F+9(B+3)4p2μ).\displaystyle=\sqrt{\frac{(\kappa-1)\epsilon}{\kappa\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)}}\,.

Hence, to satisfy (61), it suffices to ensure that

r\displaystyle r (L2dηdeη(d+1)dη/2(2η+2d)ηd(η1)ηd)1/(η(2α+2)d)\displaystyle\geq\left(\frac{L_{2}^{d}\eta^{d}e^{\eta(d+1)}d^{\eta/2}(2\eta+2d)^{\eta d}}{(\eta-1)^{\eta d}}\right)^{\!1/(\eta-(2\alpha+2)d)}
(κ(F(θ(0))F+9(B+3)4p2μ)(κ1)ϵ)d/(2η(4α+4)d),\displaystyle\quad\enspace\cdot\left(\frac{\kappa\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)}{(\kappa-1)\epsilon}\right)^{\!d/(2\eta-(4\alpha+4)d)},

which is already assumed in (47). Finally, for rr and SS given by (47) and (48), respectively, observe using Definition 1 that the federated oracle complexity of Algorithm 1 is

Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\displaystyle\Gamma({\mathsf{FedLRGD}}) =r2epoch 1+rs+ϕmepoch 2+(r1)ϕmepochs 3 to r+1+rSepoch r+2\displaystyle=\underbrace{r^{2}}_{\text{epoch }1}\!+\,\,\underbrace{rs+\phi m}_{\text{epoch }2}\,\,+\underbrace{(r-1)\phi m}_{\text{epochs }3\text{ to }r+1}+\!\!\underbrace{rS}_{\text{epoch }r+2}
=r2+rs+rS+ϕmr.\displaystyle=r^{2}+rs+rS+\phi mr\,.

This completes the proof. ∎

VI Conclusion

In closing, we summarize the main technical contributions of this work. Firstly, as a counterpart to traditional oracle complexity of iterative optimization methods, we provided a formalism for measuring running time of federated optimization algorithms by presenting the notion of federated oracle complexity in Section II-D. This notion afforded us the ability to theoretically compare the performance of different algorithms. Secondly, in order to exploit the smoothness of loss functions in training data—a generally unexplored direction in the optimization for machine learning literature [23], we proposed the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm for federated learning in Section III-B. This algorithm crucially used the approximate low rankness induced by the smoothness of gradients of loss functions in data. Thirdly, we analyzed the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} in Theorem 2 under various assumptions, including strong convexity, smoothness in parameter, smoothness in data, non-singularity, etc. In particular, we proved that Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\Gamma({\mathsf{FedLRGD}}) scales like ϕm(p/ϵ)Θ(d/η)\phi m(p/\epsilon)^{\Theta(d/\eta)} (neglecting typically sub-dominant factors). Moreover, to compare 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} to the standard 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} method in the literature [28], we also evaluated the federated oracle complexity of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} in Theorem 3 and saw that Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\Gamma({\mathsf{FedAve}}) scales like ϕm(p/ϵ)3/4\phi m(p/\epsilon)^{3/4}. Then, we demonstrated that when data dimension is small and the loss function is sufficiently smooth in the data, Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\Gamma({\mathsf{FedLRGD}}) is indeed smaller that Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\Gamma({\mathsf{FedAve}}). Finally, in a complementary direction, we built upon aspects of our analysis of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} to generalize a well-known result from high-dimensional statistics regarding low rank approximations of smooth latent variable models in Theorem 1.

We also mention a few future research directions to conclude. It is natural to try and weaken some of the assumptions made to derive the technical results in this work, cf. Section II-B. For example, the strong convexity assumption could potentially be weakened to convex or even non-convex loss functions. As another example, the non-singularity and conditioning assumptions could perhaps be replaced by simpler assumptions on the gradient of the loss function, e.g., by developing variants of positive definite kernel ideas and assuming these on the gradient perceived as a bivariate kernel. The non-singularity and conditioning assumptions could then be deduced from such assumptions. In a different vein, we also leave the thorough empirical evaluation of practical versions of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} and their comparisons to known federated learning methods, such as 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}, in various senses (including federated oracle complexity and wall-clock running time) as future work.

Appendix A Smoothness Assumptions for Logistic Regression with Soft Labels

In this appendix, we delineate an example of a commonly used loss function in logistic regression satisfying the important smoothness and strong convexity assumptions in Section II-B. Let the parameter space be [1,1]d1[-1,1]^{d-1} so that any parameter vector θ[1,1]d1\theta\in[-1,1]^{d-1}, and each labeled training data sample (x,y)[0,1]d1×[0,1](x,y)\in[0,1]^{d-1}\times[0,1] be such that the label yy is a “soft” belief. The compactness of the parameter space is reasonable since training usually takes place in a compact subset of the Euclidean space d1{\mathbb{R}}^{d-1}, and the use of soft labels instead of canonical binary classification labels has also been studied in the literature, cf. [93]. Consider the 2\ell^{2}-regularized cross-entropy loss function:

f𝖾𝗇𝗍(x,y;θ)\displaystyle f_{\mathsf{ent}}(x,y;\theta) ylog(1+eθTx)\displaystyle\triangleq y\log\!\left(1+e^{-\theta^{\mathrm{T}}x}\right) (66)
+(1y)log(1+eθTx)+μ2θ22,\displaystyle\quad\,+(1-y)\log\!\left(1+e^{\theta^{\mathrm{T}}x}\right)+\frac{\mu}{2}\left\|\theta\right\|_{2}^{2}\,,

where μ>0\mu>0 is a hyper-parameter that determines the level of regularization, and we assume for convenience that the bias parameter in logistic regression is 0. Given a training dataset (x(1),y1),,(x(n),yn)[0,1]d1×[0,1](x^{(1)},y_{1}),\dots,(x^{(n)},y_{n})\in[0,1]^{d-1}\times[0,1], the associated empirical risk is

F(θ)\displaystyle F(\theta) =1ni=1nf𝖾𝗇𝗍(x(i),yi;θ)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{f_{\mathsf{ent}}(x^{(i)},y_{i};\theta)} (67)
=1ni=1n(yilog(1+eθTx(i))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bigg{(}y_{i}\log\!\left(1+e^{-\theta^{\mathrm{T}}x^{(i)}}\right)
+(1yi)log(1+eθTx(i)))\displaystyle\qquad\qquad\enspace+(1-y_{i})\log\!\left(1+e^{\theta^{\mathrm{T}}x^{(i)}}\right)\!\bigg{)}
+μ2θ22,\displaystyle\quad\,+\frac{\mu}{2}\left\|\theta\right\|_{2}^{2}\,,

which corresponds to the well-known setting of 2\ell^{2}-regularized logistic regression [3, Section 4.4]. We next verify that f𝖾𝗇𝗍f_{\mathsf{ent}} in (66) satisfies the smoothness and strong convexity assumptions in Section II-B:

  1. 1.

    Observe that θf𝖾𝗇𝗍(x,y;θ)=((1y)1+eθTxy1+eθTx)x+μθ\nabla_{\theta}f_{\mathsf{ent}}(x,y;\theta)=\big{(}\frac{(1-y)}{1+e^{-\theta^{\mathrm{T}}x}}-\frac{y}{1+e^{\theta^{\mathrm{T}}x}}\big{)}x+\mu\theta, which means that for all x,yx,y and all θ(1),θ(2)d1\theta^{(1)},\theta^{(2)}\in{\mathbb{R}}^{d-1},

    θf𝖾𝗇𝗍(x,y;θ(1))θf𝖾𝗇𝗍(x,y;θ(2))2\displaystyle\left\|\nabla_{\theta}f_{\mathsf{ent}}(x,y;\theta^{(1)})-\nabla_{\theta}f_{\mathsf{ent}}(x,y;\theta^{(2)})\right\|_{2}
    x2|(1y)1+eθ(1)Txy1+eθ(1)Tx\displaystyle\quad\leq\left\|x\right\|_{2}\left|\frac{(1-y)}{1+e^{-{\theta^{(1)}}^{\mathrm{T}}x}}-\frac{y}{1+e^{{\theta^{(1)}}^{\mathrm{T}}x}}\right.
    (1y)1+eθ(2)Tx+y1+eθ(2)Tx|\displaystyle\quad\qquad\quad\enspace\left.-\frac{(1-y)}{1+e^{-{\theta^{(2)}}^{\mathrm{T}}x}}+\frac{y}{1+e^{{\theta^{(2)}}^{\mathrm{T}}x}}\right|
    +μθ(1)θ(2)2\displaystyle\quad\quad\,+\mu\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}
    x2(|11+eθ(1)Tx11+eθ(2)Tx|\displaystyle\quad\leq\left\|x\right\|_{2}\bigg{(}\left|\frac{1}{1+e^{-{\theta^{(1)}}^{\mathrm{T}}x}}-\frac{1}{1+e^{-{\theta^{(2)}}^{\mathrm{T}}x}}\right|
    +|11+eθ(2)Tx11+eθ(1)Tx|)\displaystyle\quad\qquad\quad\quad\enspace+\left|\frac{1}{1+e^{{\theta^{(2)}}^{\mathrm{T}}x}}-\frac{1}{1+e^{{\theta^{(1)}}^{\mathrm{T}}x}}\right|\bigg{)}
    +μθ(1)θ(2)2\displaystyle\quad\quad\,+\mu\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}
    x22|(θ(1)θ(2))Tx|+μθ(1)θ(2)2\displaystyle\quad\leq\frac{\left\|x\right\|_{2}}{2}\left|(\theta^{(1)}-\theta^{(2)})^{\mathrm{T}}x\right|+\mu\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}
    x222θ(1)θ(2)2+μθ(1)θ(2)2\displaystyle\quad\leq\frac{\left\|x\right\|_{2}^{2}}{2}\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}+\mu\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}
    (d12+μ)θ(1)θ(2)2,\displaystyle\quad\leq\left(\frac{d-1}{2}+\mu\right)\!\left\|\theta^{(1)}-\theta^{(2)}\right\|_{2}\,, (68)

    where the first inequality follows from the triangle inequality, the second inequality holds because max{|y|,|1y|}1\max\{|y|,|1-y|\}\leq 1, the third inequality holds due to the Lipschitz continuity of the sigmoid function and the fact that maxt|ddt(1+et)1|=14\max_{t\in{\mathbb{R}}}{\big{|}\frac{\operatorname{{d\!}}}{\operatorname{{d\!}}t}(1+e^{-t})^{-1}\big{|}}\allowbreak=\frac{1}{4}, the fourth inequality follows from the Cauchy-Schwarz inequality, and the final inequality follows from the bound x22d1\|x\|_{2}^{2}\leq d-1. Hence, the Lipschitz constant L1=d12+μL_{1}=\frac{d-1}{2}+\mu.

  2. 2.

    It is well-known that the (unregularized) cross-entropy loss function θylog(1+eθTx)+(1y)log(1+eθTx)\theta\mapsto y\log\big{(}1+e^{-\theta^{\mathrm{T}}x}\big{)}+(1-y)\log\big{(}1+e^{\theta^{\mathrm{T}}x}\big{)} is convex for all x,yx,y [3, Section 4.4]. This can be directly checked by computing derivatives. This implies that θf𝖾𝗇𝗍(x,y;θ)\theta\mapsto f_{\mathsf{ent}}(x,y;\theta) is μ\mu-strongly convex for all x,yx,y [4, Section 3.4].

  3. 3.

    Observe that for any fixed ϑ[1,1]d1\vartheta\in[-1,1]^{d-1} and any i[d1]i\in[d-1], the iith partial derivative gi(x,y;ϑ)=((1y)1+eϑTxy1+eϑTx)xi+μϑig_{i}(x,y;\vartheta)=\big{(}\frac{(1-y)}{1+e^{-\vartheta^{\mathrm{T}}x}}-\frac{y}{1+e^{\vartheta^{\mathrm{T}}x}}\big{)}x_{i}+\mu\vartheta_{i}. Let us arbitrarily choose η=2\eta=2 for the purposes of illustration. Then, for any s=(0,,0,1,0,,0)+ds=(0,\dots,0,1,0,\dots,0)\in{\mathbb{Z}}_{+}^{d} with a value of 11 at the jjth position and |s|=η1=1|s|=\eta-1=1, we have for all x[0,1]d1x\in[0,1]^{d-1} and y[0,1]y\in[0,1],

    x,ysgi(x,y;ϑ)\displaystyle\nabla_{x,y}^{s}g_{i}(x,y;\vartheta) (69)
    ={2f𝖾𝗇𝗍θixi(x,y;ϑ),j=i2f𝖾𝗇𝗍θixj(x,y;ϑ),j[d1],ji2f𝖾𝗇𝗍θiy(x,y;ϑ),j=d\displaystyle=\begin{cases}\frac{\partial^{2}f_{\mathsf{ent}}}{\partial\theta_{i}\partial x_{i}}(x,y;\vartheta)\,,&j=i\\ \frac{\partial^{2}f_{\mathsf{ent}}}{\partial\theta_{i}\partial x_{j}}(x,y;\vartheta)\,,&j\in[d-1],\,j\neq i\\ \frac{\partial^{2}f_{\mathsf{ent}}}{\partial\theta_{i}\partial y}(x,y;\vartheta)\,,&j=d\end{cases}
    ={ϑixieϑTx(1+eϑTx)2+11+eϑTxy,j=iϑjxieϑTx(1+eϑTx)2,j[d1],jixi,j=d.\displaystyle=\begin{cases}\frac{\vartheta_{i}x_{i}e^{-\vartheta^{\mathrm{T}}x}}{\left(1+e^{-\vartheta^{\mathrm{T}}x}\right)^{2}}+\frac{1}{1+e^{-\vartheta^{\mathrm{T}}x}}-y\,,&j=i\\ \frac{\vartheta_{j}x_{i}e^{-\vartheta^{\mathrm{T}}x}}{\left(1+e^{-\vartheta^{\mathrm{T}}x}\right)^{2}}\,,&j\in[d-1],\,j\neq i\\ -x_{i}\,,&j=d\end{cases}.

    Hence, for all x(1),x(2)[0,1]d1x^{(1)},x^{(2)}\in[0,1]^{d-1} and y1,y2[0,1]y_{1},y_{2}\in[0,1], we obtain

    |x,ysgi(x(1),y1;ϑ)x,ysgi(x(2),y2;ϑ)|\displaystyle\left|\nabla_{x,y}^{s}g_{i}(x^{(1)},y_{1};\vartheta)-\nabla_{x,y}^{s}g_{i}(x^{(2)},y_{2};\vartheta)\right|\leq
    {34x(1)x(2)1+|y1y2|,j=i12x(1)x(2)1,j[d1],ji|xi(1)xi(2)|,j=d,\displaystyle\begin{cases}\frac{3}{4}\left\|x^{(1)}-x^{(2)}\right\|_{1}+\left|y_{1}-y_{2}\right|,&j=i\\ \frac{1}{2}\left\|x^{(1)}-x^{(2)}\right\|_{1},&j\in[d-1],\,j\neq i\\ \left|x_{i}^{(1)}-x_{i}^{(2)}\right|,&j=d\end{cases}, (70)

    where the third case follows from direct calculation, the first case follows from a similar argument to the second case, and the second case holds because

    |x,ysgi(x(1),y1;ϑ)x,ysgi(x(2),y2;ϑ)|\displaystyle\left|\nabla_{x,y}^{s}g_{i}(x^{(1)},y_{1};\vartheta)-\nabla_{x,y}^{s}g_{i}(x^{(2)},y_{2};\vartheta)\right|
    =|ϑjxi(1)eϑTx(1)(1+eϑTx(1))2ϑjxi(2)eϑTx(2)(1+eϑTx(2))2|\displaystyle\quad=\left|\frac{\vartheta_{j}x_{i}^{(1)}e^{-\vartheta^{\mathrm{T}}x^{(1)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(1)}}\right)^{2}}-\frac{\vartheta_{j}x_{i}^{(2)}e^{-\vartheta^{\mathrm{T}}x^{(2)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(2)}}\right)^{2}}\right|
    |xi(1)eϑTx(1)(1+eϑTx(1))2xi(2)eϑTx(2)(1+eϑTx(2))2|\displaystyle\quad\leq\left|\frac{x_{i}^{(1)}e^{-\vartheta^{\mathrm{T}}x^{(1)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(1)}}\right)^{2}}-\frac{x_{i}^{(2)}e^{-\vartheta^{\mathrm{T}}x^{(2)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(2)}}\right)^{2}}\right|
    |eϑTx(1)(1+eϑTx(1))2eϑTx(2)(1+eϑTx(2))2|\displaystyle\quad\leq\left|\frac{e^{-\vartheta^{\mathrm{T}}x^{(1)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(1)}}\right)^{2}}-\frac{e^{-\vartheta^{\mathrm{T}}x^{(2)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(2)}}\right)^{2}}\right|
    +eϑTx(2)(1+eϑTx(2))2|xi(1)xi(2)|\displaystyle\quad\quad\,+\frac{e^{-\vartheta^{\mathrm{T}}x^{(2)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(2)}}\right)^{2}}\left|x_{i}^{(1)}-x_{i}^{(2)}\right|
    |eϑTx(1)(1+eϑTx(1))2eϑTx(2)(1+eϑTx(2))2|\displaystyle\quad\leq\left|\frac{e^{-\vartheta^{\mathrm{T}}x^{(1)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(1)}}\right)^{2}}-\frac{e^{-\vartheta^{\mathrm{T}}x^{(2)}}}{\left(1+e^{-\vartheta^{\mathrm{T}}x^{(2)}}\right)^{2}}\right|
    +14|xi(1)xi(2)|\displaystyle\quad\quad\,+\frac{1}{4}\left|x_{i}^{(1)}-x_{i}^{(2)}\right|
    14|ϑT(x(1)x(2))|+14|xi(1)xi(2)|\displaystyle\quad\leq\frac{1}{4}\left|\vartheta^{\mathrm{T}}\!\left(x^{(1)}-x^{(2)}\right)\right|+\frac{1}{4}\left|x_{i}^{(1)}-x_{i}^{(2)}\right|
    14x(1)x(2)1+14|xi(1)xi(2)|\displaystyle\quad\leq\frac{1}{4}\left\|x^{(1)}-x^{(2)}\right\|_{1}+\frac{1}{4}\left|x_{i}^{(1)}-x_{i}^{(2)}\right|
    12x(1)x(2)1,\displaystyle\quad\leq\frac{1}{2}\left\|x^{(1)}-x^{(2)}\right\|_{1}, (71)

    where the second line follows from the fact that |ϑj|1|\vartheta_{j}|\leq 1, the third line follows from the triangle inequality and the fact that xi(1)[0,1]x_{i}^{(1)}\in[0,1], the fourth line holds because et(1+et)214e^{-t}(1+e^{-t})^{-2}\leq\frac{1}{4} for all tt\in{\mathbb{R}}, the fifth line follows from Lipschitz continuity and the fact that maxt|ddtet(1+et)2|14\max_{t\in{\mathbb{R}}}{\big{|}\frac{\operatorname{{d\!}}}{\operatorname{{d\!}}t}e^{-t}(1+e^{-t})^{-2}\big{|}}\leq\frac{1}{4}, and the sixth line follows from Hölder’s inequality and the fact that ϑ1\|\vartheta\|_{\infty}\leq 1. This implies that for all x(1),x(2)[0,1]d1x^{(1)},x^{(2)}\in[0,1]^{d-1} and y1,y2[0,1]y_{1},y_{2}\in[0,1], we have

    |x,ysgi(x(1),y1;ϑ)x,ysgi(x(2),y2;ϑ)|\displaystyle\left|\nabla_{x,y}^{s}g_{i}(x^{(1)},y_{1};\vartheta)-\nabla_{x,y}^{s}g_{i}(x^{(2)},y_{2};\vartheta)\right| (72)
    (x(1),y1)(x(2),y2)1.\displaystyle\qquad\qquad\qquad\quad\leq\left\|(x^{(1)},y_{1})-(x^{(2)},y_{2})\right\|_{1}.

    Thus, L2=1L_{2}=1 in this example, and each gi(;ϑ)g_{i}(\cdot;\vartheta) belongs to a (2,1)(2,1)-Hölder class.

Appendix B Bounds on Erlang Quantile Function

For any bb\in{\mathbb{N}}, let YY be an Erlang distributed random variable with shape parameter bb, rate 11, and CDF FY:[0,)[0,1]F_{Y}:[0,\infty)\rightarrow[0,1] given by [82, Example 3.3.21]:

y0,FY(y)=1eyk=0b1ykk!.\forall y\geq 0,\enspace F_{Y}(y)=1-e^{-y}\sum_{k=0}^{b-1}{\frac{y^{k}}{k!}}\,. (73)

The ensuing proposition provides bounds on the quantile function (or inverse CDF) FY1:[0,1)[0,)F_{Y}^{-1}:[0,1)\rightarrow[0,\infty) in the neighborhood of unity.

Proposition 2 (Quantile Function of Erlang Distribution).

For any integer qq\in{\mathbb{N}} such that (q1)b55(q-1)b\geq 55, we have

12log(q1)+12log(b)\displaystyle\frac{1}{2}\log(q-1)+\frac{1}{2}\log(b) FY1(11q)\displaystyle\leq F_{Y}^{-1}\!\left(1-\frac{1}{q}\right)
2log(q1)+2blog(2b).\displaystyle\leq 2\log(q-1)+2b\log(2b)\,.
Proof.

Recall from the definition of generalized inverse that

FY1(11q)\displaystyle F_{Y}^{-1}\!\left(1-\frac{1}{q}\right)
=inf{y>0:FY(y)11q}\displaystyle=\inf\!\left\{y>0:F_{Y}(y)\geq 1-\frac{1}{q}\right\}
=inf{y>0:k=0b1ykk!1qk=0ykk!}\displaystyle=\inf\!\left\{y>0:\sum_{k=0}^{b-1}{\frac{y^{k}}{k!}}\leq\frac{1}{q}\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}}\right\}
=inf{y>0:(q1)k=0b1ykk!k=bykk!}\displaystyle=\inf\!\left\{y>0:(q-1)\sum_{k=0}^{b-1}{\frac{y^{k}}{k!}}\leq\sum_{k=b}^{\infty}{\frac{y^{k}}{k!}}\right\}
=inf{y>0:(q1)k=0b1ykk!ybb!k=0ykk!(k+bb)1}\displaystyle=\inf\!\left\{y>0:(q-1)\sum_{k=0}^{b-1}{\frac{y^{k}}{k!}}\leq\frac{y^{b}}{b!}\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}\binom{k+b}{b}^{\!-1}}\right\}
=inf{y>0:b!(q1)k=0b1ykbk!k=0ykk!(k+bb)1},\displaystyle=\inf\!\left\{y>0:b!(q-1)\sum_{k=0}^{b-1}{\frac{y^{k-b}}{k!}}\leq\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}\binom{k+b}{b}^{\!-1}}\right\}, (74)

where the second equality uses (73) and the Maclaurin series of ey=k=0ykk!e^{y}=\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}}, and the fourth equality follows from straightforward manipulations of the summation on the right hand side.

To prove the lower bound, notice that for all y>0y>0,

b!(q1)k=0b1ykbk!\displaystyle b!(q-1)\sum_{k=0}^{b-1}{\frac{y^{k-b}}{k!}} k=0ykk!(k+bb)1\displaystyle\leq\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}\binom{k+b}{b}^{\!-1}}
b!(q1)k=0b1ykbk!\displaystyle\Rightarrow\qquad b!(q-1)\sum_{k=0}^{b-1}{\frac{y^{k-b}}{k!}} k=0ykk!=ey\displaystyle\leq\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}}=e^{y}
b(q1)\displaystyle\Rightarrow\qquad b(q-1) yey\displaystyle\leq ye^{y}
log(b)+log(q1)\displaystyle\Leftrightarrow\qquad\log(b)+\log(q-1) log(y)+y\displaystyle\leq\log(y)+y
12log(b)+12log(q1)\displaystyle\Rightarrow\qquad\frac{1}{2}\log(b)+\frac{1}{2}\log(q-1) y,\displaystyle\leq y\,,

where the first implication holds because (k+bb)1\binom{k+b}{b}\geq 1, the second implication follows from only retaining the k=b1k=b-1 term in the summation on the left hand side, and the last implication holds because log(y)y1y\log(y)\leq y-1\leq y. Hence, using (74), we obtain

FY1(11q)\displaystyle F_{Y}^{-1}\!\left(1-\frac{1}{q}\right) inf{y12log(q1)+12log(b)}\displaystyle\geq\inf\!\left\{y\geq\frac{1}{2}\log(q-1)+\frac{1}{2}\log(b)\right\}
=12log(q1)+12log(b).\displaystyle=\frac{1}{2}\log(q-1)+\frac{1}{2}\log(b)\,.

Next, note that since we assume that (q1)b55e4(q-1)b\geq 55\geq e^{4}, FY1(1q1)2F_{Y}^{-1}(1-q^{-1})\geq 2. So, it suffices to consider y2y\geq 2 within the infimum in (74). To establish the upper bound, observe that for all y2y\geq 2,

b!(q1)k=0b1ykbk!\displaystyle b!(q-1)\sum_{k=0}^{b-1}{\frac{y^{k-b}}{k!}} k=0ykk!(k+bb)1\displaystyle\leq\sum_{k=0}^{\infty}{\frac{y^{k}}{k!}\binom{k+b}{b}^{\!-1}}
2bb!(q1)k=0b1ykbk!\displaystyle\Leftarrow\qquad 2^{b}b!(q-1)\sum_{k=0}^{b-1}{\frac{y^{k-b}}{k!}} k=0(y/2)kk!=ey/2\displaystyle\leq\sum_{k=0}^{\infty}{\frac{(y/2)^{k}}{k!}}=e^{y/2}
2bb!(q1)yk=0b11yk\displaystyle\Leftarrow\qquad\frac{2^{b}b!(q-1)}{y}\sum_{k=0}^{b-1}{\frac{1}{y^{k}}} ey/2\displaystyle\leq e^{y/2}
2bb!(q1)y1\displaystyle\Leftarrow\qquad\frac{2^{b}b!(q-1)}{y-1} ey/2\displaystyle\leq e^{y/2}
blog(2)+log(b!)+log(q1)\displaystyle\Leftrightarrow\qquad b\log(2)+\log(b!)+\log(q-1) log(y1)+y2\displaystyle\leq\log(y-1)+\frac{y}{2}
2blog(2b)+2log(q1)\displaystyle\Leftarrow\qquad 2b\log(2b)+2\log(q-1) y,\displaystyle\leq y\,,

where the first implication holds because (k+bb)2k+b\binom{k+b}{b}\leq 2^{k+b}, the third implication follows from computing the geometric series on the left hand side in the previous line over all k{0}k\in{\mathbb{N}}\cup\!\{0\}, and the last implication holds because b!bbb!\leq b^{b} and log(y1)0\log(y-1)\geq 0. Therefore, using (74), we get

FY1(11q)\displaystyle F_{Y}^{-1}\!\left(1-\frac{1}{q}\right) inf{y2log(q1)+2blog(2b)}\displaystyle\leq\inf\!\left\{y\geq 2\log(q-1)+2b\log(2b)\right\}
=2log(q1)+2blog(2b).\displaystyle=2\log(q-1)+2b\log(2b)\,.

This completes the proof. ∎

Appendix C Low Rank Structure in Neural Networks

In this appendix, we present some Keras (TensorFlow) based simulation results which provide evidence that a variant of the approximate low rank structure used in our analysis of the 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} algorithm appears in neural network training problems. We perform our experiments using the well-known MNIST database for handwritten digit recognition [44] as well as the CIFAR-10 dataset of tiny images [45]. We next describe our experiment for both databases.

MNIST training. The MNIST database consists of a training dataset of n=60,000n=60,000 image-label pairs as well as a test dataset of 10,00010,000 pairs. Each handwritten digit image has 28×2828\times 28 grayscale pixels taking intensity values in {0,,255}\{0,\dots,255\}, which we can stack and re-normalize to obtain a feature vector y[0,1]784y\in[0,1]^{784}. Each label belongs to the set of digits {0,,9}\{0,\dots,9\}, which we transform into an elementary basis vector z{0,1}10z\in\{0,1\}^{10} via one hot encoding. So, each data sample (y,z)(y,z) has dimension d=794d=794.

To train a classifier that solves the multi-class classification problem of inferring digits based on images, we construct a fully connected feed-forward neural network with one hidden layer (cf. [24, Chapter 5] for basic definitions and terminology). The input layer of this network has 784784 nodes (since the feature vectors are 784784-dimensional), the hidden layer has KK\in{\mathbb{N}} nodes (we typically set K=784K=784 as well), and the output layer has 1010 nodes (since there are 1010 possible label values). The network also possesses the usual bias nodes, which are not included in the above counts. We use all logistic (or sigmoid) activation functions (and in one case, all rectified linear unit, i.e., ReLU, activation functions) in the hidden layer, and the softmax activation function in the output layer. Note that these neural networks are highly overparametrized, because dnp=623,290d\ll n\ll p=623,290, where pp is the number of weight parameters in the network including all bias nodes. Finally, we use the cross-entropy loss function, and minimize the empirical risk (with respect to the training data) over the weights of the neural network using the Adam stochastic optimization algorithm [19] (and in one case, using the root mean square propagation, i.e., RMSprop, algorithm [18]). Specifically, we always use a batch size of 200200 and run the algorithm over 2525 epochs. In almost all instances in our experiments, this setting of hyper-parameters for the optimization algorithm yields test validation accuracies of over 98%98\% (and training accuracies of over 99%99\%). (We note that perturbing the batch size or number of epochs does not change our qualitative observations.)

CIFAR-10 training. The CIFAR-10 database consists of a training dataset of n=50,000n=50,000 image-label pairs as well as a test dataset of 10,00010,000 pairs. Each image has 32×3232\times 32 color pixels with 33 channels per pixel taking intensity values in {0,,255}\{0,\dots,255\}, which we can stack and re-normalize to obtain a feature vector y[0,1]3072y\in[0,1]^{3072}. Each label belongs to a set of 1010 possible values, e.g., airplane, bird, horse, truck, etc., which we transform into an elementary basis vector z{0,1}10z\in\{0,1\}^{10} via one hot encoding. So, each data sample (y,z)(y,z) has dimension d=3082d=3082.

This time, to train a classifier that solves the problem of inferring labels based on images, we construct a convolutional neural network (cf. [24, Chapter 5]). The input layer of this network has 32×32×332\times 32\times 3 nodes (since there are 33 channels), the next two hidden convolutional layers use 4040 filters per input channel each with 3×33\times 3 kernels with a stride of 11 and zero padding (to retain the image dimensions of 32×3232\times 32), the fourth layer performs sub-sampling via 2×22\times 2 max-pooling with a stride of 22, the fifth layer is fully connected with 150150 output nodes, and the sixth output layer is also fully connected and has 1010 nodes (since there are 1010 possible label values). The network also possesses the usual bias nodes, which are not included in the above counts of nodes. We use all logistic activation functions (and in one case, all ReLU activation functions) in the two convolutional hidden layers and the fifth fully connected layer, and the softmax activation function in the output layer. As before, these deep neural networks are highly overparametrized, because dnp=1,553,220d\ll n\ll p=1,553,220, where pp is the number of weight parameters in the network. Finally, we use the cross-entropy loss function, and minimize the empirical risk over the weights of the neural network using the Adam stochastic optimization algorithm (and in one case, using the RMSprop algorithm) as before. Specifically, we always use a batch size of 200200 and run the algorithm over 1515 epochs. In almost all instances in our experiments, this setting of hyper-parameters for the optimization algorithm yields test validation accuracies of around 60%60\% (with slightly higher test accuracies for ReLU activations). (On the other hand, training accuracies are usually higher, e.g., over 98%98\% for ReLU activations. Although these accuracies can be further increased through careful tuning using techniques like dropout for regularization and batch normalization, we do not utilize such ideas here for simplicity. We are able to illustrate low rank structure without these additional bells and whistles. Furthermore, we note that perturbing the batch size, number of epochs, the architecture of hidden layers, etc. does not change our qualitative observations.)

Experimental details. In either the MNIST or the CIFAR-10 settings delineated above, let θp\theta^{*}\in{\mathbb{R}}^{p} be the optimal network weights output by the training process. Then, for any k{30,60,90}k\in\{30,60,90\}, we randomly choose kk test data samples {x(i)=(y(i),z(i))[0,1]d:i[k]}\{x^{(i)}=(y^{(i)},z^{(i)})\in[0,1]^{d}:i\in[k]\} uniformly among all subsets of kk test data samples. Moreover, we fix any kk independent random perturbations of θ\theta^{*}, namely, vectors of network weights θ(1),,θ(k)p\theta^{(1)},\dots,\theta^{(k)}\in{\mathbb{R}}^{p} such that

θ(i)=θ+θ1pgi\theta^{(i)}=\theta^{*}+\frac{\left\|\theta^{*}\right\|_{1}}{p}\,g_{i} (75)

for all i{1,,k}i\in\{1,\dots,k\}, where g1,,gkg_{1},\dots,g_{k} are i.i.d. Gaussian distributed random vectors with zero mean and identity covariance. We next construct a k×k×pk\times k\times p tensor k×k×p{\mathcal{M}}\in{\mathbb{R}}^{k\times k\times p} of partial derivatives:

i,j{1,,k},q{1,,p},\displaystyle\forall i,j\in\{1,\dots,k\},\,\forall q\in\{1,\dots,p\}, (76)
i,j,qfθq(x(i);θ(j)),\displaystyle\qquad\qquad\qquad\qquad{\mathcal{M}}_{i,j,q}\triangleq\frac{\partial f}{\partial\theta_{q}}(x^{(i)};\theta^{(j)})\,,

where f:[0,1]d×pf:[0,1]^{d}\times{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}, f(x;θ)f(x;\theta) denotes the cross-entropy loss for a data sample xx when the neural network has weights θ\theta, the first dimension of {\mathcal{M}} is indexed by x(1),,x(k)x^{(1)},\dots,x^{(k)}, the second dimension is indexed by θ(1),,θ(k)\theta^{(1)},\dots,\theta^{(k)}, and the third dimension is indexed by the network weight coordinates θ1,,θp\theta_{1},\dots,\theta_{p}. Note that the partial derivatives to obtain each i,j,q{\mathcal{M}}_{i,j,q} can be computed in a variety of ways; we use one iteration of a standard GD procedure in Keras with a single data sample. For any q[p]q\in[p], let qk×k{\mathcal{M}}_{q}\in{\mathbb{R}}^{k\times k} be the qqth matrix slice of the tensor {\mathcal{M}} (where the rows of q{\mathcal{M}}_{q} are indexed by x(1),,x(k)x^{(1)},\dots,x^{(k)} and the columns are indexed by θ(1),,θ(k)\theta^{(1)},\dots,\theta^{(k)}). Furthermore, letting σ1(q)σ2(q)σk(q)0\sigma_{1}({\mathcal{M}}_{q})\geq\sigma_{2}({\mathcal{M}}_{q})\geq\cdots\geq\sigma_{k}({\mathcal{M}}_{q})\geq 0 denote the ordered singular values of q{\mathcal{M}}_{q}, we define the approximate rank of the (non-zero) matrix q{\mathcal{M}}_{q} as the minimum number of ordered singular values that capture 90%90\% of the squared Frobenius norm of q{\mathcal{M}}_{q}:

approximate rank of q\displaystyle\text{approximate rank of }{\mathcal{M}}_{q} (77)
min{J[k]:i=1Jσi(q)20.9qF2}.\displaystyle\qquad\triangleq\min\!\left\{J\in[k]:\sum_{i=1}^{J}{\sigma_{i}({\mathcal{M}}_{q})^{2}}\geq 0.9\left\|{\mathcal{M}}_{q}\right\|_{\mathrm{F}}^{2}\right\}.

(Note that the choice of 90%90\% is arbitrary, and our qualitative observations do not change if we perturb this value.) Next, we fix a sufficiently small ε>0\varepsilon>0; specifically, we use ε=0.005\varepsilon=0.005 after eyeballing several instances of optimal weight vectors θ\theta^{*} to determine a small enough threshold to rule out zero weights (although the precise choice of this value does not alter our qualitative observations). Ruling out zero weights is a stylistic choice in our experiments since these weights are effectively not used by the final trained model. However, keeping them in the sequel would not change the depicted low rank structure. Then, for every q[p]q\in[p] such that θq\theta^{*}_{q} is “non-trivial,” by which we mean |θq|ε|\theta_{q}^{*}|\geq\varepsilon, we compute the approximate rank of q{\mathcal{M}}_{q}. We finally plot a histogram of this multiset of approximate rank values, where the multiset contains one value for each q[p]q\in[p] such that θq\theta^{*}_{q} is non-trivial.121212To be precise, for computational efficiency, we plot histograms of around 500500 randomly and uniformly chosen approximate rank values from this multiset.

Refer to caption
(a) K=784K=784, k=30k=30
Refer to caption
(b) K=784K=784, k=60k=60
Refer to caption
(c) K=784K=784, k=90k=90
Refer to caption
(d) K=784K=784, k=60k=60, all weights
Refer to caption
(e) K=500K=500, k=60k=60
Refer to caption
(f) K=1000K=1000, k=60k=60
Refer to caption
(g) K=784K=784, k=60k=60, RMSprop
Refer to caption
(h) K=784K=784, k=60k=60, two hidden layers
Refer to caption
(i) K=784K=784, k=60k=60, ReLU
Figure 1: Histograms of approximate ranks for MNIST neural networks with different choices of hyper-parameters. The different values of KK and kk are stated under each plot. Unless stated otherwise, the neural network architectures for all the plots have one hidden layer, use logistic activation functions in all hidden layers, and use the Adam algorithm for training. Moreover, unless stated otherwise, the plots only consider approximate ranks of q{\mathcal{M}}_{q}’s corresponding to non-trivial weights θq\theta^{*}_{q}. Note that Figure 1(d) displays approximate ranks for q{\mathcal{M}}_{q}’s corresponding to all weights θq\theta^{*}_{q}, Figure 1(g) uses the RMSprop algorithm for training, Figure 1(h) has two hidden layers of size K=784K=784 each, and Figure 1(i) uses ReLU activation functions in the hidden layer.

MNIST plots. In the MNIST setting, Figure 1 plots histograms generated using the aforementioned procedure for various choices of network hyper-parameters. We now explain these precise choices and discuss associated observations. Figure 1(b) depicts the prototypical histogram, where we use K=784K=784 and logistic activation functions in the hidden layer to define the neural network, Adam to train the network, and k=60k=60 to construct the tensor. The ordinate (vertical axis) of this plot represents frequency, i.e., the number of times we see a particular value, and the abscissa (horizontal axis) represents the set of possible approximate rank values. Moreover, the plot portrays that almost all q{\mathcal{M}}_{q}’s corresponding to non-trivial θq\theta_{q}^{*}’s have approximate rank bounded by 1010. In fact, about half of these q{\mathcal{M}}_{q}’s have approximate rank bounded by 55. This suggests that the gradient of the loss function for this neural network model indeed exhibits some approximate low rank structure. We also note that ff is “smooth” here, because we have used smooth activation functions and the cross-entropy loss. Furthermore, the observation of low rankness does not change when we use other smooth losses instead of the cross-entropy loss.

Compared to Figure 1(b), Figure 1(a) and Figure 1(c) only change the value of kk to k=30k=30 and k=90k=90, respectively. It is clear from these plots that the broad qualitative observation explained above remains unchanged if kk is varied. In a different vein, compared to Figure 1(b), Figure 1(e) and Figure 1(f) only change the number of nodes in the hidden layer to K=500K=500 and K=1000K=1000, respectively. These plots illustrate that our observations are qualitatively unchanged by different hidden layer sizes. Finally, compared to Figure 1(b), Figure 1(g) only changes the optimization algorithm for training from Adam to RMSprop, and demonstrates (once again) that the choice of algorithm has essentially no effect on our observations. Figure 1(d) uses the same setting of network hyper-parameters as Figure 1(b), but it plots a histogram of the multiset of approximate ranks of all q{\mathcal{M}}_{q}’s, i.e., q{\mathcal{M}}_{q}’s corresponding to all weights θq\theta^{*}_{q}. As mentioned earlier, the purpose of this plot is to illustrate that keeping all weights in the histograms does not change the depicted low rank structure, i.e., all q{\mathcal{M}}_{q}’s have reasonably small approximate rank.

All of the experiments until now focus on smooth ff and neural networks with one hidden layer. To understand the effect of changing the number of hidden layers, Figure 1(h) illustrates the histogram obtained when the neural network has two hidden layers with size K=784K=784 each and all logistic activation functions, and all other hyper-parameters are kept the same as Figure 1(b). In this plot, the approximate rank of the bulk of q{\mathcal{M}}_{q}’s corresponding to non-trivial θq\theta_{q}^{*}’s is bounded by 1010. This suggests that the addition of a few hidden layers does not increase the approximate rank of the gradient of ff by much. On the other hand, to investigate the effect of using non-smooth activation functions, Figure 1(i) uses all ReLU activation functions in the hidden layer, and the same hyper-parameters as Figure 1(b) in all other respects. This histogram is qualitatively the same as that in Figure 1(b). Therefore, the non-smoothness of ff does not seem to affect the approximate low rankness of the gradient of ff. This final observation indicates that algorithms like 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}, which exploit approximate low rankness, could potentially be useful in non-smooth settings as well.

Refer to caption
(a) k=30k=30
Refer to caption
(b) k=60k=60
Refer to caption
(c) k=60k=60, RMSprop
Refer to caption
(d) k=60k=60, ReLU
Figure 2: Histograms of approximate ranks for CIFAR-10 neural networks with different choices of hyper-parameters. The different values of kk are stated under each plot. Unless stated otherwise, the neural network architectures for all the plots have two hidden convolutional layers, a fourth max-pooling layer, and two final fully connected layers, and they use logistic activation functions in all non-sub-sampling hidden layers and the Adam algorithm for training. Moreover, unless stated otherwise, the plots only consider approximate ranks of q{\mathcal{M}}_{q}’s corresponding to non-trivial weights θq\theta^{*}_{q}. Note that Figure 2(c) uses the RMSprop algorithm for training and Figure 2(d) uses ReLU activation functions in all non-sub-sampling hidden layers.

CIFAR-10 plots. In the CIFAR-10 setting, Figure 2 plots histograms generated using the aforementioned procedure for various choices of network hyper-parameters. The purpose of this figure is to demonstrate that the low rank structure depicted in Figure 1 can also be observed in more complex neural network architectures used for more challenging supervised learning tasks. As before, we briefly explain these precise choices and discuss associated observations. Many of our comments are similar in spirit to those in the MNIST setting.

Figure 2(b) depicts the prototypical histogram, where we use the convolutional architecture outlined earlier, logistic activation functions in all non-sub-sampling hidden layers, Adam to train the network, and k=60k=60 to construct the tensor. The axes of the plot are the same as those in Figure 1. The plot illustrates that almost all q{\mathcal{M}}_{q}’s corresponding to non-trivial θq\theta_{q}^{*}’s have approximate rank bounded by 66. This corroborates the MNIST experiments that the gradient of the loss function for CIFAR-10 also exhibits some approximate low rank structure. Compared to Figure 2(b), Figure 2(a) changes the value of kk to k=30k=30 to show that our qualitative observations remain unchanged if kk is varied, and Figure 2(c) only changes the optimization algorithm for training to RMSprop, and demonstrates that the choice of algorithm has essentially no effect on our observations (as before). Lastly, while all of these plots are for a smooth loss function ff (because we have used smooth activation functions and the cross-entropy loss), compared to Figure 2(b), Figure 2(d) only changes the activation functions in all non-sub-sampling hidden layers to ReLU. Once again, we observe that there is no qualitative difference in the depicted approximate low rank structure even when non-smooth activation functions are used.

Appendix D Proof of Theorem 3

To prove Theorem 3, we will need two auxiliary lemmata. The first of these is the following convergence result which is distilled from [30, Theorem 1].

Lemma 3 (Convergence of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} [30, Theorem 1]).

Let θ(T)p\theta^{(T)}\in{\mathbb{R}}^{p} be the parameter vector at the server after TT epochs of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} (see [30, Algorithm 1]). Suppose assumptions 1 and 3 in the statement of Theorem 3 hold. Then, for every

TT0=4max{κ,4+32κ2(1τ)τ(m1),4mμ2b},T\geq T_{0}=4\max\!\left\{\kappa,4+\frac{32\kappa^{2}(1-\tau)}{\tau(m-1)},\frac{4m}{\mu^{2}b}\right\},

we have

𝔼[F(θ(T))]F\displaystyle{\mathbb{E}}\!\left[F(\theta^{(T)})\right]-F_{*} L12𝔼[θ(T)θ22]\displaystyle\leq\frac{L_{1}}{2}\,{\mathbb{E}}\!\left[\left\|\theta^{(T)}-\theta^{*}\right\|_{2}^{2}\right]
L1(T0b+1)22(Tb+1)2𝔼[θ(T0)θ22]\displaystyle\leq\frac{L_{1}(T_{0}b+1)^{2}}{2(Tb+1)^{2}}\,{\mathbb{E}}\!\left[\left\|\theta^{(T_{0})}-\theta^{*}\right\|_{2}^{2}\right]
+8κσ2μm(1+8(1τ)mτ(m1))bTb+1\displaystyle\quad\,+\frac{8\kappa\sigma^{2}}{\mu m}\left(1+\frac{8(1-\tau)m}{\tau(m-1)}\right)\frac{b}{Tb+1}
+(8eL1κ2σ2m)(b1)2Tb+1\displaystyle\quad\,+\left(\frac{8eL_{1}\kappa^{2}\sigma^{2}}{m}\right)\frac{(b-1)^{2}}{Tb+1}
+128eκ3σ2μ(1+8(1τ)τ(m1))b1(Tb+1)2,\displaystyle\quad\,+\frac{128e\kappa^{3}\sigma^{2}}{\mu}\!\left(\!1\!+\!\frac{8(1-\tau)}{\tau(m-1)}\!\right)\!\frac{b-1}{(Tb+1)^{2}},

where the first inequality follows from the smoothness in θ\theta assumption in Section II-B.

The second lemma provides a useful estimate of a trigonometric expression we will utilize in the proof.

Lemma 4 (Taylor Approximation of Root).

For all t[0,1]t\in[0,1], we have131313Note that we use arcsin(1t2)\arcsin(1-t^{2}) here instead of arcsin(1t)\arcsin(1-t), because the latter can only be expanded as a Newton-Puiseux series.

12+t65t29\displaystyle\frac{1}{2}+\frac{t}{\sqrt{6}}-\frac{5t^{2}}{9} sin(13arcsin(1t2)+2π3)\displaystyle\leq\sin\!\left(\frac{1}{3}\arcsin\!\left(1-t^{2}\right)+\frac{2\pi}{3}\right)
12+t6+5t29.\displaystyle\leq\frac{1}{2}+\frac{t}{\sqrt{6}}+\frac{5t^{2}}{9}\,.
Proof.

Define the function h:[0,1][1,1]h:[0,1]\rightarrow[-1,1] as

h(t)sin(13arcsin(1t2)+2π3).h(t)\triangleq\sin\!\left(\frac{1}{3}\arcsin\!\left(1-t^{2}\right)+\frac{2\pi}{3}\right).

The first and second derivatives of hh are

h(t)\displaystyle h^{\prime}(t) =232t2cos(13arcsin(1t2)+2π3),\displaystyle=-\frac{2}{3\sqrt{2-t^{2}}}\cos\!\left(\frac{1}{3}\arcsin\!\left(1-t^{2}\right)+\frac{2\pi}{3}\right),
h′′(t)\displaystyle h^{\prime\prime}(t) =49(2t2)sin(13arcsin(1t2)+2π3)\displaystyle=-\frac{4}{9(2-t^{2})}\sin\!\left(\frac{1}{3}\arcsin\!\left(1-t^{2}\right)+\frac{2\pi}{3}\right)
2t3(2t2)3/2cos(13arcsin(1t2)+2π3),\displaystyle\quad\,-\frac{2t}{3(2-t^{2})^{3/2}}\cos\!\left(\frac{1}{3}\arcsin\!\left(1-t^{2}\right)+\frac{2\pi}{3}\right),

for all t[0,1]t\in[0,1]. Moreover, we have h(0)=12h(0)=\frac{1}{2}, h(0)=16h^{\prime}(0)=\frac{1}{\sqrt{6}}, and for any t[0,1]t\in[0,1],

|h′′(t)|49(2t2)+2t3(2t2)3/2109,|h^{\prime\prime}(t)|\leq\frac{4}{9(2-t^{2})}+\frac{2t}{3(2-t^{2})^{3/2}}\leq\frac{10}{9}\,,

where the first inequality follows from the triangle inequality, and the second inequality follows from setting t=1t=1 since [0,1]t49(2t2)+2t3(2t2)3/2[0,1]\ni t\mapsto\frac{4}{9(2-t^{2})}+\frac{2t}{3(2-t^{2})^{3/2}} is an increasing function. Hence, using Taylor’s theorem (with Lagrange remainder term), we get:

t[0,1],s[0,t],h(t)=12+t6+h′′(s)t22.\forall t\in[0,1],\,\exists s\in[0,t],\enspace h(t)=\frac{1}{2}+\frac{t}{\sqrt{6}}+\frac{h^{\prime\prime}(s)t^{2}}{2}\,.

This implies that:

t[0,1],|h(t)12t6|5t29,\forall t\in[0,1],\enspace\left|h(t)-\frac{1}{2}-\frac{t}{\sqrt{6}}\right|\leq\frac{5t^{2}}{9}\,,

which completes the proof. ∎

We next establish Theorem 3.

Proof of Theorem 3.

Since we assume that σ2=O(p)\sigma^{2}=O(p), b=O(m)b=O(m), 𝔼[θ(T0)θ22]=O(p){\mathbb{E}}\big{[}\big{\|}\theta^{(T_{0})}-\theta^{*}\big{\|}_{2}^{2}\big{]}=O(p), and τ,L1,μ=Θ(1)\tau,L_{1},\mu=\Theta(1) with respect to mm (see assumptions 2-5 in the theorem statement), for every TT0=Θ(mb)T\geq T_{0}=\Theta\big{(}\frac{m}{b}\big{)}, each term on the right hand side of Lemma 3 satisfies

L1(T0b+1)22(Tb+1)2𝔼[θ(T0)θ22]\displaystyle\frac{L_{1}(T_{0}b+1)^{2}}{2(Tb+1)^{2}}\,{\mathbb{E}}\!\left[\left\|\theta^{(T_{0})}-\theta^{*}\right\|_{2}^{2}\right] =O(m2pT2b2),\displaystyle=O\!\left(\frac{m^{2}p}{T^{2}b^{2}}\right),
8κσ2μm(1+8(1τ)mτ(m1))bTb+1\displaystyle\frac{8\kappa\sigma^{2}}{\mu m}\left(1+\frac{8(1-\tau)m}{\tau(m-1)}\right)\frac{b}{Tb+1} =O(pmT),\displaystyle=O\!\left(\frac{p}{mT}\right),
(8eL1κ2σ2m)(b1)2Tb+1\displaystyle\left(\frac{8eL_{1}\kappa^{2}\sigma^{2}}{m}\right)\frac{(b-1)^{2}}{Tb+1} =O(bpmT),\displaystyle=O\!\left(\frac{bp}{mT}\right),
128eκ3σ2μ(1+8(1τ)τ(m1))b1(Tb+1)2\displaystyle\frac{128e\kappa^{3}\sigma^{2}}{\mu}\left(1+\frac{8(1-\tau)}{\tau(m-1)}\right)\frac{b-1}{(Tb+1)^{2}} =O(pT2b).\displaystyle=O\!\left(\frac{p}{T^{2}b}\right).

Clearly, the third term above dominates the second term, and the first term dominates the fourth term for all sufficiently large mm (as b=O(m)b=O(m)). Hence, Lemma 3 implies that

𝔼[F(θ(T))]FC(m2pT2b2+bpmT),{\mathbb{E}}\!\left[F(\theta^{(T)})\right]-F_{*}\leq C\left(\frac{m^{2}p}{T^{2}b^{2}}+\frac{bp}{mT}\right),

where C>0C>0 is a constant that depends on the various constant problem parameters.

Observe that if we seek an approximation accuracy of ϵ>0\epsilon>0, i.e., 𝔼[F(θ(T))]Fϵ{\mathbb{E}}\big{[}F(\theta^{(T)})\big{]}-F_{*}\leq\epsilon, we need to ensure that

m2T2b2+bmTϵCp.\frac{m^{2}}{T^{2}b^{2}}+\frac{b}{mT}\leq\frac{\epsilon}{Cp}\,. (78)

According to Definition 1, the associated federated oracle complexity of the 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} algorithm is Tb+ϕτmTTb+\phi\tau mT, because the active clients each compute bb gradients in all TT epochs of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}. However, we have control over the parameters bb and TT in 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}}. So, we can minimize this complexity over all bb\in{\mathbb{N}} and TT\in{\mathbb{N}} subject to the approximation accuracy constraint above. Indeed, the (minimum) federated oracle complexity of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} is

Γ(𝖥𝖾𝖽𝖠𝗏𝖾)=minb,T+:m2T2b2+bmTϵCpT(b+ϕτm),\Gamma({\mathsf{FedAve}})=\min_{\begin{subarray}{c}b,T\in{\mathbb{R}}_{+}:\\ \frac{m^{2}}{T^{2}b^{2}}+\frac{b}{mT}\leq\frac{\epsilon}{Cp}\end{subarray}}{T(b+\phi\tau m)}\,, (79)

where ϕ>0\phi>0, τ(0,1]\tau\in(0,1], mm\in{\mathbb{N}}, C>0C>0, ϵ(0,1]\epsilon\in(0,1], and pp\in{\mathbb{N}} are held constant in the optimization.141414Note that we allow bb and TT to be real numbers in the optimization rather than natural numbers, because this does not affect the final scaling of federated oracle complexity with respect to the problem parameters. Moreover, the optimization in (79) is really an upper bound on Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\Gamma({\mathsf{FedAve}}) if we are to be pedantic. We next calculate how Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\Gamma({\mathsf{FedAve}}) scales with mm, ϕ\phi, ϵ\epsilon, and pp. For convenience, we let ε=ϵ/p\varepsilon=\epsilon/p in the sequel.

First, notice that the inequality constraint in (78) can be changed into an equality constraint

m2T2b2+bmT=εC.\frac{m^{2}}{T^{2}b^{2}}+\frac{b}{mT}=\frac{\varepsilon}{C}\,.

Indeed, if the optimal TT and bb for (79) do not achieve (78) with equality, we can reduce TT to achieve (78) with equality, and this reduces the objective value of T(b+ϕτm)T(b+\phi\tau m) further (producing a contradiction). Rearranging this equality constraint, we obtain the following quadratic equation in TT:

εT2CbTmm2b2=0,\frac{\varepsilon T^{2}}{C}-\frac{bT}{m}-\frac{m^{2}}{b^{2}}=0\,,

whose unique non-negative solution is

T=bC2mε+(bC2mε)2+Cm2εb2.T=\frac{bC}{2m\varepsilon}+\sqrt{\left(\frac{bC}{2m\varepsilon}\right)^{\!2}+\frac{Cm^{2}}{\varepsilon b^{2}}}\,. (80)

This solution satisfies the bounds

bC2mε+mbCεT2(bC2mε+mbCε),\frac{bC}{2m\varepsilon}+\frac{m}{b}\sqrt{\frac{C}{\varepsilon}}\leq T\leq 2\left(\frac{bC}{2m\varepsilon}+\frac{m}{b}\sqrt{\frac{C}{\varepsilon}}\right), (81)

where the lower bound follows from neglecting the (bC/(2mε))20(bC/(2m\varepsilon))^{2}\geq 0 term in (80), and the upper bound follows from the sub-additivity of the square root function. (Note that TT0=Θ(mb)T\geq T_{0}=\Theta\big{(}\frac{m}{b}\big{)} is satisfied.) Now define the constants

C1\displaystyle C_{1} =C2mε,\displaystyle=\frac{C}{2m\varepsilon}\,, (82)
C2\displaystyle\ C_{2} =mCε,\displaystyle=m\sqrt{\frac{C}{\varepsilon}}\,, (83)
C3\displaystyle C_{3} =ϕτm,\displaystyle=\phi\tau m\,, (84)

and the function g:(0,)g:(0,\infty)\rightarrow{\mathbb{R}} as:

b>0,g(b)\displaystyle\forall b>0,\enspace g(b) (C1b+C2b)(b+C3)\displaystyle\triangleq\left(C_{1}b+\frac{C_{2}}{b}\right)\!(b+C_{3})
=C1b2+C1C3b+C2C3b+C2.\displaystyle=C_{1}b^{2}+C_{1}C_{3}b+\frac{C_{2}C_{3}}{b}+C_{2}\,.

Then, using (79), (81), (82), (83), and (84), we have

minb>0g(b)Γ(𝖥𝖾𝖽𝖠𝗏𝖾)2minb>0g(b).\min_{b>0}{g(b)}\leq\Gamma({\mathsf{FedAve}})\leq 2\min_{b>0}{g(b)}\,. (85)

So, we focus on evaluating minb>0g(b)\min_{b>0}{g(b)} from hereon.

Since C1,C2,C3>0C_{1},C_{2},C_{3}>0, it is straightforward to verify that gg is a convex function on (0,)(0,\infty). This means that any first order stationary point of gg in (0,)(0,\infty) is a global minimum. We will find such a stationary point in the sequel. To this end, notice that

b>0,g(b)=2C1b+C1C3C2C3b2.\forall b>0,\enspace g^{\prime}(b)=2C_{1}b+C_{1}C_{3}-\frac{C_{2}C_{3}}{b^{2}}\,.

Setting this equal to 0 yields the stationarity condition

2C1b3+C1C3b2C2C3=0.2C_{1}b^{3}+C_{1}C_{3}b^{2}-C_{2}C_{3}=0\,. (86)

Using the substitution b=aC36b=a-\frac{C_{3}}{6} above, we obtain:

0\displaystyle 0 =2C1(aC36)3+C1C3(aC36)2C2C3\displaystyle=2C_{1}\left(a-\frac{C_{3}}{6}\right)^{\!3}+C_{1}C_{3}\left(a-\frac{C_{3}}{6}\right)^{\!2}-C_{2}C_{3}
=(2C1)a3(C1C326)a+(C1C3354C2C3),\displaystyle=(2C_{1})a^{3}-\left(\frac{C_{1}C_{3}^{2}}{6}\right)\!a+\left(\frac{C_{1}C_{3}^{3}}{54}-C_{2}C_{3}\right),

which we rearrange to get the (monic) depressed cubic equation:

a3(C3212)a+(C33108C2C32C1)=0.a^{3}-\left(\frac{C_{3}^{2}}{12}\right)\!a+\left(\frac{C_{3}^{3}}{108}-\frac{C_{2}C_{3}}{2C_{1}}\right)=0\,. (87)

The discriminant of this depressed cubic equation is

4(C3212)327(C33108C2C32C1)2\displaystyle 4\left(\frac{C_{3}^{2}}{12}\right)^{\!3}-27\left(\frac{C_{3}^{3}}{108}-\frac{C_{2}C_{3}}{2C_{1}}\right)^{\!2}
=C2C344C127C22C324C12\displaystyle\qquad\qquad=\frac{C_{2}C_{3}^{4}}{4C_{1}}-\frac{27C_{2}^{2}C_{3}^{2}}{4C_{1}^{2}}
=ϕ4τ4m6ε2C27ϕ2τ2m6εC\displaystyle\qquad\qquad=\frac{\phi^{4}\tau^{4}m^{6}\sqrt{\varepsilon}}{2\sqrt{C}}-\frac{27\phi^{2}\tau^{2}m^{6}\varepsilon}{C}
=ϕ2τ2m6εC>0(12ϕ2τ227εC)\displaystyle\qquad\qquad=\underbrace{\phi^{2}\tau^{2}m^{6}\sqrt{\frac{\varepsilon}{C}}}_{>0}\left(\frac{1}{2}\phi^{2}\tau^{2}-27\sqrt{\frac{\varepsilon}{C}}\right)
>0,\displaystyle\qquad\qquad>0\,,

where the second equality follows from (82), (83), and (84), and the strict positivity holds because τ\tau and CC are constants, ε=ϵ/p(0,1]\varepsilon=\epsilon/p\in(0,1], and ϕ\phi is sufficiently large (compared to these constant problem parameters). In particular, we will assume throughout this proof that

ϕ40C1/4τ.\phi\geq\frac{40}{C^{1/4}\tau}\,. (88)

In the previous case, (88) clearly implies that

ϕ>36τ(εC)1/412ϕ2τ227εC>0.\phi>\frac{3\sqrt{6}}{\tau}\left(\frac{\varepsilon}{C}\right)^{\!1/4}\quad\Leftrightarrow\quad\frac{1}{2}\phi^{2}\tau^{2}-27\sqrt{\frac{\varepsilon}{C}}>0\,.

Thus, the depressed cubic equation in (87) has three distinct real solutions. Using Viète’s trigonometric formula for the roots of a depressed cubic equation [94, 95], we have that

a=C33sin(13arcsin(154C2C1C32)+2π3)a^{*}=\frac{C_{3}}{3}\sin\!\left(\frac{1}{3}\arcsin\!\left(1-\frac{54C_{2}}{C_{1}C_{3}^{2}}\right)+\frac{2\pi}{3}\right)

is one of the solutions to (87), and hence,

b=C33(sin(13arcsin(154C2C1C32)+2π3)12)b^{*}=\frac{C_{3}}{3}\left(\sin\!\left(\frac{1}{3}\arcsin\!\left(1-\frac{54C_{2}}{C_{1}C_{3}^{2}}\right)+\frac{2\pi}{3}\right)-\frac{1}{2}\right)

is one of the solutions to (86). Applying Lemma 4 to bb^{*} and simplifying yields

C2C110C2C1C3bC2C1+10C2C1C3,\sqrt{\frac{C_{2}}{C_{1}}}-\frac{10C_{2}}{C_{1}C_{3}}\leq b^{*}\leq\sqrt{\frac{C_{2}}{C_{1}}}+\frac{10C_{2}}{C_{1}C_{3}}\,,

where we use (88) and the fact that ε(0,1]\varepsilon\in(0,1], which imply that the condition required by Lemma 4 is satisfied:

ϕ63τ(εC)1/43C36C2C11.\phi\geq\frac{6\sqrt{3}}{\tau}\left(\frac{\varepsilon}{C}\right)^{\!1/4}\quad\Leftrightarrow\quad\frac{3}{C_{3}}\sqrt{\frac{6C_{2}}{C_{1}}}\leq 1\,.

Then, substituting (82), (83), and (84) into these bounds gives

2m(εC)1/420mϕτεCb2m(εC)1/4+20mϕτεC.\sqrt{2}\,m\!\left(\frac{\varepsilon}{C}\right)^{\!1/4}-\frac{20m}{\phi\tau}\sqrt{\frac{\varepsilon}{C}}\leq b^{*}\leq\sqrt{2}\,m\!\left(\frac{\varepsilon}{C}\right)^{\!1/4}+\frac{20m}{\phi\tau}\sqrt{\frac{\varepsilon}{C}}.

Since ε(0,1]\varepsilon\in(0,1] and (88) holds, we have

ϕ40ε1/4C1/4τ20mϕτεCm2(εC)1/4,\phi\geq\frac{40\varepsilon^{1/4}}{C^{1/4}\tau}\quad\Leftrightarrow\quad\frac{20m}{\phi\tau}\sqrt{\frac{\varepsilon}{C}}\leq\frac{m}{2}\!\left(\frac{\varepsilon}{C}\right)^{\!1/4}\,,

and we see that b=Θ(mε1/4)b^{*}=\Theta(m\varepsilon^{1/4}); specifically,

m2(εC)1/4b2m(εC)1/4.\frac{m}{2}\left(\frac{\varepsilon}{C}\right)^{\!1/4}\leq b^{*}\leq 2m\!\left(\frac{\varepsilon}{C}\right)^{\!1/4}. (89)

(Note that b=O(m)b^{*}=O(m) is satisfied.) Furthermore, b>0b^{*}>0 is the unique positive solution to the stationarity condition (86) by Descartes’ rule of signs. Therefore, using (82), (83), and (84), we have

minb>0g(b)\displaystyle\min_{b>0}{g(b)} =C1(b)2+C1C3b+C2C3b+C2\displaystyle=C_{1}(b^{*})^{2}+C_{1}C_{3}b^{*}+\frac{C_{2}C_{3}}{b^{*}}+C_{2}
=C2mε(b)2+Cϕτ2εb+m2ϕτbCε+mCε.\displaystyle=\frac{C}{2m\varepsilon}(b^{*})^{2}+\frac{C\phi\tau}{2\varepsilon}b^{*}+\frac{m^{2}\phi\tau}{b^{*}}\sqrt{\frac{C}{\varepsilon}}+m\sqrt{\frac{C}{\varepsilon}}\,.

We can now estimate the scaling of minb>0g(b)\min_{b>0}{g(b)}. Indeed, by applying the inequalities in (89), we have

3ϕτm4(Cε)3/4\displaystyle\frac{3\phi\tau m}{4}\left(\frac{C}{\varepsilon}\right)^{\!3/4}\!\!\!\! 9m8Cε+3ϕτm4(Cε)3/4\displaystyle\leq\frac{9m}{8}\sqrt{\frac{C}{\varepsilon}}+\frac{3\phi\tau m}{4}\left(\frac{C}{\varepsilon}\right)^{\!3/4}
minb>0g(b)\displaystyle\leq\min_{b>0}{g(b)}
3mCε+3ϕτm(Cε)3/44ϕτm(Cε)3/4,\displaystyle\leq 3m\sqrt{\frac{C}{\varepsilon}}+3\phi\tau m\!\left(\frac{C}{\varepsilon}\right)^{\!3/4}\!\!\!\!\leq 4\phi\tau m\!\left(\frac{C}{\varepsilon}\right)^{\!3/4}\!\!,

where the last inequality again uses (88) and the fact that ε(0,1]\varepsilon\in(0,1], which imply that

ϕ3τ(εC)1/43mCεϕτm(Cε)3/4.\phi\geq\frac{3}{\tau}\left(\frac{\varepsilon}{C}\right)^{\!1/4}\quad\Leftrightarrow\quad 3m\sqrt{\frac{C}{\varepsilon}}\leq\phi\tau m\!\left(\frac{C}{\varepsilon}\right)^{\!3/4}.

Finally, from the bounds in (85), we can estimate the scaling of Γ(𝖥𝖾𝖽𝖠𝗏𝖾)\Gamma({\mathsf{FedAve}}):

3ϕτm4(Cε)3/4Γ(𝖥𝖾𝖽𝖠𝗏𝖾)8ϕτm(Cε)3/4.\frac{3\phi\tau m}{4}\left(\frac{C}{\varepsilon}\right)^{\!3/4}\leq\Gamma({\mathsf{FedAve}})\leq 8\phi\tau m\!\left(\frac{C}{\varepsilon}\right)^{\!3/4}.

Since ε=ϵ/p\varepsilon=\epsilon/p, this establishes that Γ(𝖥𝖾𝖽𝖠𝗏𝖾)=Θ(ϕm(p/ϵ)3/4)\Gamma({\mathsf{FedAve}})=\Theta(\phi m(p/\epsilon)^{3/4}). More pedantically, this argument shows that Γ(𝖥𝖾𝖽𝖠𝗏𝖾)=O(ϕm(p/ϵ)3/4)\Gamma({\mathsf{FedAve}})=O(\phi m(p/\epsilon)^{3/4}), because the constraint (78) uses the (worst case O(p)O(p) scaling) bounds in assumptions 3 and 5 of the theorem statement. This completes the proof. ∎

Appendix E Proof of Proposition 1

Proof.

We first calculate the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}}. Since d=O(log(n)λ)d=O(\log(n)^{\lambda}), without loss of generality, suppose that dc3log(n)λd\leq c_{3}\log(n)^{\lambda} for some constant c3>0c_{3}>0. From (47) in Theorem 2,

r\displaystyle r =max{ed 2d(η+d)d,((L2dηdeη(d+1)dη/2(2η+2d)ηd(η1)ηd)\displaystyle=\!\left\lceil\rule{0.0pt}{31.2982pt}\right.\!\!\max\!\left\{\rule{0.0pt}{31.2982pt}\right.\!\!\!e\sqrt{d}\,2^{d}(\eta+d)^{d},\!\left(\rule{0.0pt}{25.6073pt}\right.\!\!\!\!\left(\!\frac{L_{2}^{d}\eta^{d}e^{\eta(d+1)}d^{\eta/2}(2\eta+2d)^{\eta d}}{(\eta-1)^{\eta d}}\!\right)
(κ(F(θ(0))F+9(B+3)4p2μ)(κ1)ϵ)d2)1η(2α+2)d}.\displaystyle\qquad\quad\,\cdot\!\left(\!\frac{\kappa\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)}{(\kappa-1)\epsilon}\!\right)^{\!\!\frac{d}{2}}\!\!\left.\rule{0.0pt}{25.6073pt}\right)^{\!\!\frac{1}{\eta-(2\alpha+2)d}}\!\!\left.\rule{0.0pt}{31.2982pt}\right\}\!\!\left.\rule{0.0pt}{31.2982pt}\right\rceil.

Let us analyze each salient term in this equation:

ed 2d(η+d)d\displaystyle e\sqrt{d}\,2^{d}(\eta+d)^{d} e(2c2+4α+6)ddd+(1/2)\displaystyle\leq e(2c_{2}+4\alpha+6)^{d}d^{d+(1/2)}
=O(d2d+(1/2))\displaystyle=O\!\left(d^{2d+(1/2)}\right)
=O(exp(2λc3log(n)(λ+1)/2)),\displaystyle=O\!\left(\exp\!\left(2\lambda c_{3}\log(n)^{(\lambda+1)/2}\right)\right),
(L2dηd(2e)η(d+1)dη/2(η+dη1)ηd)1η(2α+2)d\displaystyle\left(L_{2}^{d}\eta^{d}(2e)^{\eta(d+1)}d^{\eta/2}\left(\frac{\eta+d}{\eta-1}\right)^{\!\eta d}\right)^{\!\frac{1}{\eta-(2\alpha+2)d}}
=O(d(c2+2α+4)/(2c1)(2e(c2+2α+3)c1+2α+1)(c2+2α+2)d/c1)\displaystyle=O\!\left(d^{(c_{2}+2\alpha+4)/(2c_{1})}\left(\frac{2e(c_{2}+2\alpha+3)}{c_{1}+2\alpha+1}\right)^{\!(c_{2}+2\alpha+2)d/c_{1}}\right)
=O(log(n)λ(c2+2α+4)/(2c1)\displaystyle=O\Bigg{(}\log(n)^{\lambda(c_{2}+2\alpha+4)/(2c_{1})}
(2e(c2+2α+3)c1+2α+1)c3(c2+2α+2)log(n)λ/c1),\displaystyle\qquad\quad\cdot\left(\frac{2e(c_{2}+2\alpha+3)}{c_{1}+2\alpha+1}\right)^{\!c_{3}(c_{2}+2\alpha+2)\log(n)^{\lambda}/c_{1}}\Bigg{)},
(κ(F(θ(0))F+9(B+3)4p2μ)(κ1)ϵ)d2η(4α+4)d\displaystyle\left(\frac{\kappa\!\left(F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}\right)}{(\kappa-1)\epsilon}\right)^{\!\!\frac{d}{2\eta-(4\alpha+4)d}}
=O((pϵ)1/(2c1)),\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad=O\!\left(\left(\frac{p}{\epsilon}\right)^{\!1/(2c_{1})}\right),

where the first estimate uses the facts that η(c2+2α+2)d\eta\leq(c_{2}+2\alpha+2)d and dc3log(n)λd\leq c_{3}\log(n)^{\lambda}, the second estimate uses the facts that (c1+2α+2)dη(c2+2α+2)d(c_{1}+2\alpha+2)d\leq\eta\leq(c_{2}+2\alpha+2)d and dc3log(n)λd\leq c_{3}\log(n)^{\lambda}, and the third estimate uses the facts that (c1+2α+2)dη(c_{1}+2\alpha+2)d\leq\eta and F(θ(0))F=O(p)F(\theta^{(0)})-F_{*}=O(p). Hence, since ϵ=Θ(nβ)\epsilon=\Theta(n^{-\beta}), p/ϵ=Ω(nβ)p/\epsilon=\Omega(n^{\beta}) and rr grows at least polynomially in nn, which implies that

r=O(subpoly(n)(pϵ)1/(2c1)),r=O\!\left(\operatorname{{sub-poly}}(n)\left(\frac{p}{\epsilon}\right)^{\!1/(2c_{1})}\right), (90)

where subpoly(n)\operatorname{{sub-poly}}(n) denotes a term that is dominated by any polynomial in nn, i.e., subpoly(n)=O(nτ)\operatorname{{sub-poly}}(n)=O(n^{\tau}) for any τ>0\tau>0.

On the other hand, from (48) in Theorem 2,

S\displaystyle S =(log(κκ1))1log(F(θ(0))F+9(B+3)4p2μϵ)\displaystyle=\left\lceil\left(\log\!\left(\frac{\kappa}{\kappa-1}\right)\right)^{\!-1}\log\!\left(\frac{F(\theta^{(0)})-F_{*}+\frac{9(B+3)^{4}p}{2\mu}}{\epsilon}\right)\right\rceil
=Θ(log(pϵ))\displaystyle=\Theta\!\left(\log\!\left(\frac{p}{\epsilon}\right)\right)
=O(log(r)),\displaystyle=O(\log(r))\,,

where the second equality uses the fact that F(θ(0))F=O(p)F(\theta^{(0)})-F_{*}=O(p), and the third equality follows from (90). Therefore, the federated oracle complexity of 𝖥𝖾𝖽𝖫𝖱𝖦𝖣{\mathsf{FedLRGD}} from Theorem 2 scales as

Γ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)\displaystyle\Gamma({\mathsf{FedLRGD}}) =r2+rs+rS+ϕmr\displaystyle=r^{2}+rs+rS+\phi mr
O(ϕmr2)\displaystyle\leq O(\phi mr^{2})
=O(subpoly(n)ϕm(pϵ)1/c1)\displaystyle=O\!\left(\operatorname{{sub-poly}}(n)\,\phi m\left(\frac{p}{\epsilon}\right)^{\!1/c_{1}}\right)
O(ϕm(pϵ)2/c1)\displaystyle\leq O\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!2/c_{1}}\right)
O(ϕm(pϵ)ξ),\displaystyle\leq O\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!\xi}\right),

where the second line holds because s=O(ϕm)s=O(\phi m) and S=log(r)S=\log(r), the fourth line holds because subpoly(n)=O((p/ϵ)1/c1)\operatorname{{sub-poly}}(n)=O((p/\epsilon)^{1/c_{1}}) since ϵ=Θ(nβ)\epsilon=\Theta(n^{-\beta}), and the last line holds because c12/ξc_{1}\geq 2/\xi.

Furthermore, rewriting Theorem 3, the federated oracle complexity of 𝖥𝖾𝖽𝖠𝗏𝖾{\mathsf{FedAve}} scales as

Γ(𝖥𝖾𝖽𝖠𝗏𝖾)Γ+(𝖥𝖾𝖽𝖠𝗏𝖾)=Θ(ϕm(pϵ)3/4).\Gamma({\mathsf{FedAve}})\leq\Gamma_{+}({\mathsf{FedAve}})=\Theta\!\left(\phi m\left(\frac{p}{\epsilon}\right)^{\!3/4}\right).

Hence, since ξ<34\xi<\frac{3}{4}, we have

limmΓ(𝖥𝖾𝖽𝖫𝖱𝖦𝖣)Γ+(𝖥𝖾𝖽𝖠𝗏𝖾)limmc4(pϵ)ξ(3/4)=0,\lim_{m\rightarrow\infty}{\frac{\Gamma({\mathsf{FedLRGD}})}{\Gamma_{+}({\mathsf{FedAve}})}}\leq\lim_{m\rightarrow\infty}{c_{4}\left(\frac{p}{\epsilon}\right)^{\!\xi-(3/4)}}=0\,,

as desired, where c4>0c_{4}>0 is some constant. This completes the proof. ∎

References

  • [1] S. Boyd and L. Vandenberghe, Convex Optimization.   New York, NY, USA: Cambridge University Press, 2004.
  • [2] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, ser. Applied Optimization.   New York, NY, USA: Springer, 2004, vol. 87.
  • [3] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., ser. Springer Series in Statistics.   New York, NY, USA: Springer, 2009.
  • [4] S. Bubeck, Convex Optimization: Algorithms and Complexity, ser. Foundations and Trends in Machine Learning.   Hanover, MA, USA: now Publishers Inc., 2015, vol. 8, no. 2-4.
  • [5] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34, no. 3, pp. A1380–A1405, 2012.
  • [6] O. Devolder, F. Glineur, and Y. Nesterov, “First-order methods with inexact oracle: the strongly convex case,” CORE Discussion Papers 2013016, vol. 2013, no. 16, March 2013.
  • [7] A. M.-C. So and Z. Zhou, “Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity,” Optimization Methods and Software, vol. 32, no. 4, pp. 963–992, May 2017.
  • [8] M. Schmidt, N. Le Roux, and F. Bach, “Convergence rates of inexact proximal-gradient methods for convex optimization,” in Proceedings of the Advances in Neural Information Processing Systems 24 (NeurIPS), Granada, Spain, December 12-17 2011, pp. 1–9.
  • [9] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, no. 4, pp. 1574–1609, January 2009.
  • [10] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, December 1964.
  • [11] Y. E. Nesterov, “A method of solving a convex programming problem with convergence rate O(1k2)O\big{(}\frac{1}{k^{2}}\big{)},” Doklady Akademii Nauk SSSR, vol. 269, no. 3, pp. 543–547, 1983, in Russian.
  • [12] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. 6, pp. 165–202, January 2012.
  • [13] N. Le Roux, M. Schmidt, and F. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Proceedings of the Advances in Neural Information Processing Systems 25 (NeurIPS), Lake Tahoe, NV, USA, December 3-8 2012, pp. 1–9.
  • [14] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, Series A, vol. 162, p. 83–112, March 2017.
  • [15] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” Journal of Machine Learning Research, vol. 14, pp. 567–599, February 2013.
  • [16] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proceedings of the Advances in Neural Information Processing Systems 26 (NeurIPS), Lake Tahoe, NV, USA, December 5-10 2013, pp. 315–323.
  • [17] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, July 2011.
  • [18] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” 2012, COURSERA: Neural Networks for Machine Learning.
  • [19] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, May 7-9 2015, pp. 1–13.
  • [20] P. Jain and P. Kar, Non-convex Optimization for Machine Learning, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed.   Hanover, MA, USA: now Publishers Inc., 2017, vol. 10, no. 3-4.
  • [21] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, May 2018.
  • [22] P. Netrapalli, “Stochastic gradient descent and its variants in machine learning,” Journal of Indian Institute of Science, vol. 99, no. 2, pp. 201–213, June 2019.
  • [23] A. Jadbabaie, A. Makur, and D. Shah, “Gradient-based empirical risk minimization using local polynomial regression,” November 2020, arXiv:2011.02522 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2011.02522
  • [24] C. M. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics.   New York, NY, USA: Springer, 2006.
  • [25] J. Fan and I. Gijbels, Local Polynomial Modeling and Its Applications, ser. Monographs on Statistics and Applied Probability.   London, UK: Chapman and Hall, 1996, vol. 66.
  • [26] W. S. Cleveland and C. Loader, “Smoothing by local regression: Principles and methods,” in Statistical Theory and Computational Aspects of Smoothing: Proceedings of the COMPSTAT ’94 Satellite Meeting held in Semmering, Austria 27-28 August 1994, ser. Contributions to Statistics, W. Härdle and M. G. Schimek, Eds.   Heidelberg, Germany: Physica-Verlag, 1996, pp. 10–49.
  • [27] A. B. Tsybakov, Introduction to Nonparametric Estimation, ser. Springer Series in Statistics.   New York, NY, USA: Springer, 2009.
  • [28] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, April 20-22 2017, pp. 1273–1282.
  • [29] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konečný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao, Advances and Open Problems in Federated Learning, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed.   Hanover, MA, USA: now Publishers Inc., 2021, vol. 14, no. 1-2.
  • [30] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, August 26-28 2020, pp. 2021–2031.
  • [31] A. Reisizadeh, I. Tziotis, H. Hassani, A. Mokhtari, and R. Pedarsani, “Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity,” December 2020, arXiv:2012.14453 [cs.LG]. [Online]. Available: https://arxiv.org/pdf/2012.14453.pdf
  • [32] Z. Huo, Q. Yang, B. Gu, L. Carin, and H. Huang, “Faster on-device training using new federated momentum algorithm,” February 2020, arXiv:2002.02090v1 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2002.02090
  • [33] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” January 2019, arXiv:1808.07576v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1808.07576
  • [34] R. Pathak and M. J. Wainwright, “FedSplit: an algorithmic framework for fast federated optimization,” in Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS), Vancouver, BC, Canada, December 6-12 2020, pp. 7057–7066.
  • [35] F. X. Yu, A. S. Rawat, A. K. Menon, and S. Kumar, “Federated learning with only positive labels,” in Proceedings of the 37th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 119, Vienna, Austria, July 13-18 2020, pp. 10 946–10 956.
  • [36] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proceedings of the 10th International Conference on Learning Representations (ICLR), Vienna, Austria, April 25-29 2021, pp. 1–38.
  • [37] N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,” March 2019, arXiv:1902.11175v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1902.11175
  • [38] V. S. Mai, R. J. La, T. Zhang, Y. Huang, and A. Battou, “Federated learning with server learning for non-iid data,” in Proceedings of the 57th Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, March 22-24 2023, pp. 1–6.
  • [39] K. Yang and C. Shen, “On the convergence of hybrid federated learning with server-clients collaborative training,” in Proceedings of the 56th Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, March 9-11 2022, pp. 252–257.
  • [40] V. S. Mai, R. J. La, and T. Zhang, “Federated learning with server learning: Enhancing performance for non-iid data,” April 2023, arXiv:2210.02614v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2210.02614
  • [41] S. Chatterjee, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, February 2015.
  • [42] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” in Proceedings of the 35th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 80, Stockholm, Sweden, July 10-15 2018, pp. 5433–5442.
  • [43] A. Agarwal, D. Shah, D. Shen, and D. Song, “On robustness of principal component regression,” Journal of the American Statistical Association, vol. 116, no. 536, pp. 1731–1745, July 2021.
  • [44] Y. LeCun, C. Cortes, and C. J. C. Burges, “THE MNIST DATABASE of handwritten digits.” [Online]. Available: http://yann.lecun.com/exdb/mnist/
  • [45] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset.” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.htm
  • [46] G. H. Chen and D. Shah, Explaining the Success of Nearest Neighbor Methods in Prediction, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed.   Hanover, MA, USA: now Publishers Inc., 2018, vol. 10, no. 5-6.
  • [47] L. Wasserman, “Statistical methods for machine learning,” May 2019, Department of Statistics and Data Science, CMU, Pittsburgh, PA, USA, Lecture Notes 36-708.
  • [48] E. A. Nadaraya, “On estimating regression,” Theory of Probability and Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
  • [49] G. S. Watson, “Smooth regression analysis,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, December 1964.
  • [50] E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametric discrimination: Consistency properties,” University of California, Berkeley, USAF School of Aviation Medicine, Randolph Field, San Antonio, TX, USA, Tech. Rep., February 1951.
  • [51] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. IT-13, no. 1, pp. 21–27, January 1967.
  • [52] T. W. Körner, Fourier Analysis.   Cambridge, UK: Cambridge University Press, 1988.
  • [53] E. M. Stein and R. Shakarchi, Fourier Analysis: An Introduction, ser. Princeton Lectures in Analysis.   Princeton, NJ, USA: Princeton University Press, 2003, vol. 1.
  • [54] E. M. Stein and R. Shakarchi, Complex Analysis, ser. Princeton Lectures in Analysis.   Princeton, NJ, USA: Princeton University Press, 2003, vol. 2.
  • [55] E. J. Candès and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98, no. 6, pp. 925–936, June 2010.
  • [56] D. Shah, D. Song, Z. Xu, and Y. Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” June 2020, arXiv:2006.06135 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2006.06135
  • [57] B. F. Logan and L. A. Shepp, “Optimal reconstruction of a function from its projections,” Duke Mathematical Journal, vol. 42, no. 4, pp. 645–659, December 1975.
  • [58] D. L. Donoho and I. M. Johnstone, “Projection-based approximation and a duality with kernel methods,” The Annals of Statistics, vol. 17, no. 1, pp. 58–106, March 1989.
  • [59] J. Chen, T. Yang, and S. Zhu, “Efficient low-rank stochastic gradient descent methods for solving semidefinite programs,” in Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, April 22-25 2014, pp. 122–130.
  • [60] P. G. Constantine, E. Dow, and Q. Wang, “Active subspace methods in theory and practice: Applications to kriging surfaces,” SIAM Journal on Scientific Computing, vol. 36, no. 4, pp. A1500–A1524, 2014.
  • [61] G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” December 2018, arXiv:1812.04754 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1812.04754
  • [62] L. Sagun, U. Evci, V. U. Güney, Y. Dauphin, and L. Bottou, “Empirical analysis of the Hessian of over-parametrized neural networks,” in Proceedings of the Sixth International Conference on Learning Representations (ICLR) Workshop, Vancouver, BC, Canada, April 30-May 3 2018, pp. 1–14.
  • [63] V. Papyan, “The full spectrum of deepnet Hessians at scale: Dynamics with SGD training and sample size,” June 2019, arXiv:1811.07062v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1811.07062
  • [64] C. Cui, K. Zhang, T. Daulbaev, J. Gusak, I. Oseledets, and Z. Zhang, “Active subspace of neural networks: Structural analysis and universal attacks,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 4, pp. 1096–1122, 2020.
  • [65] Y. Wu, X. Zhu, C. Wu, A. Wang, and R. Ge, “Dissecting Hessian: Understanding common structure of Hessian in neural networks,” June 2021, arXiv:2010.04261v5 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2010.04261
  • [66] S. P. Singh, G. Bachmann, and T. Hofmann, “Analytic insights into structure and rank of neural network Hessian maps,” in Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS), Virtual, December 6-14 2021, pp. 23 914–23 927.
  • [67] T. Le and S. Jegelka, “Training invariances and the low-rank phenomenon: beyond linear networks,” in Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual, April 25-29 2022, pp. 1–26.
  • [68] T. Galanti, Z. S. Siegel, A. Gupte, and T. Poggio, “SGD and weight decay provably induce a low-rank bias in neural networks,” January 2023, arXiv:2206.05794v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2206.05794
  • [69] V. Papyan, X. Y. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,” Proceedings of the National Academy of Sciences (PNAS), vol. 117, no. 40, pp. 24 652–24 663, October 2020.
  • [70] C. Fang, H. He, Q. Long, and W. J. Su, “Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training,” Proceedings of the National Academy of Sciences (PNAS), vol. 118, no. 43, pp. 1–12, October 2021.
  • [71] L. Hui, M. Belkin, and P. Nakkiran, “Limitations of neural collapse for understanding generalization in deep learning,” February 2022, arXiv:2202.08384 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2202.08384
  • [72] M. Gooneratne, , K. C. Sim, P. Zadrazil, A. Kabel, F. Beaufays, and G. Motta, “Low-rank gradient approximation for memory-efficient on-device training of deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 4-8 2020, pp. 3017–3021.
  • [73] D. Hsu, S. M. Kakade, and T. Zhang, “Random design analysis of ridge regression,” in 25th Annual Conference on Learning Theory (COLT), Proceedings of Machine Learning Research (PMLR), vol. 23, no. 9, Edinburgh, Scotland, June 25-27 2012, pp. 1–24.
  • [74] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, ser. Princeton Lectures in Analysis.   Princeton, NJ, USA: Princeton University Press, 2005, vol. 3.
  • [75] J. H. Manton and P.-O. Amblard, A Primer on Reproducing Kernel Hilbert Spaces, ser. Foundations and Trends in Signal Processing, Y. Eldar, Ed.   Hanover, MA, USA: now Publishers Inc., 2014, vol. 8, no. 1-2.
  • [76] A. S. Nemirovskiĭ and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, ser. Wiley-Interscience Series in Discrete Mathematics and Optimization.   New York, NY, USA: John Wiley & Sons Inc., 1983, Translated from Russian by E. R. Dawson.
  • [77] B. Woodworth, J. Wang, A. Smith, B. McMahan, and N. Srebro, “Graph oracle models, lower bounds, and gaps for parallel stochastic optimization,” in Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS), Montréal, QC, Canada, December 2-8 2018, pp. 1–11.
  • [78] N. Michelusi, G. Scutari, and C.-S. Lee, “Finite-bit quantization for distributed algorithms with linear convergence,” May 2022, arXiv:2107.11304v3 [math.OC]. [Online]. Available: https://arxiv.org/abs/2107.11304v3
  • [79] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing communication and computation in distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, August 2019.
  • [80] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, March 2018.
  • [81] R. G. Gallager, Stochastic Processes: Theory for Applications.   New York, NY, USA: Cambridge University Press, 2013.
  • [82] P. Embrechts, C. Klüppelberg, and T. Mikosch, Modelling Extremal Events for Insurance and Finance, ser. Stochastic Modelling and Applied Probability.   Berlin, Heidelberg, Germany: Springer, 1997, vol. 33.
  • [83] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science, ser. Cambridge Series in Statistical and Probabilistic Mathematics.   New York, NY, USA: Cambridge University Press, 2018, vol. 47.
  • [84] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” September 2017, arXiv:1709.03183 [stat.ML]. [Online]. Available: https://arxiv.org/abs/1709.03183
  • [85] D. J. Aldous, “Representations for partially exchangeable arrays of random variables,” Journal of Multivariate Analysis, Elsevier, vol. 11, no. 4, pp. 581–598, December 1981.
  • [86] Y. Zhang, E. Levina, and J. Zhu, “Estimating network edge probabilities by neighbourhood smoothing,” Biometrika, vol. 104, no. 4, pp. 771–783, December 2017.
  • [87] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed.   New York, NY, USA: Cambridge University Press, 2013.
  • [88] T. M. Apostol, Mathematical Analysis, 2nd ed.   Reading, MA, USA: Addison-Wesley, 1974.
  • [89] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,” Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 117, no. 48, pp. 30 039–30 045, June 2020.
  • [90] S. Dasgupta and A. Gupta, “An elementary proof of a theorem of Johnson and Lindenstrauss,” Random Structures and Algorithms, vol. 22, no. 1, pp. 60–65, January 2003.
  • [91] Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.
  • [92] W. Feller, An Introduction to Probability Theory and Its Applications, 3rd ed.   New York, NY, USA: John Wiley & Sons, Inc., 1968, vol. 1.
  • [93] Q. Nguyen, H. Valizadegan, and M. Hauskrecht, “Learning classification with auxiliary probabilistic information,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), Vancouver, BC, Canada, December 11-14 2011, pp. 477–486.
  • [94] R. W. D. Nickalls, “A new approach to solving the cubic: Cardan’s solution revealed,” The Mathematical Gazette, vol. 77, no. 480, pp. 354–359, November 1993.
  • [95] R. W. D. Nickalls, “Viète, Descartes and the cubic equation,” The Mathematical Gazette, vol. 90, no. 518, pp. 203–208, July 2006.