Federated Optimization of Smooth Loss Functions
Abstract
In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server seeks to minimize an ERM objective function using samples of training data that is stored across clients and the server. The recent flurry of research in this area has identified the Federated Averaging () algorithm as the staple for determining -approximate solutions to the ERM problem. Similar to standard optimization algorithms, e.g., stochastic gradient descent, the convergence analysis of and its variants only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness in data in a federated learning context, we propose the Federated Low Rank Gradient Descent () algorithm. Since smoothness in data induces an approximate low rank structure on the gradient of the loss function, our algorithm first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients’ gradients using its own gradients. Then, our algorithm solves the ERM problem at the server using an inexact gradient descent method. To theoretically demonstrate that can have superior performance to , we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity in the optimization literature. Under some assumptions on the loss function, e.g., strong convexity and smoothness in the parameter, -Hölder class smoothness in the data, etc., we prove that the federated oracle complexity of scales like and that of scales like (neglecting typically sub-dominant factors), where is the ratio of client-to-server communication time to gradient computation time, is the parameter dimension, and is the data dimension. Then, we show that when is small compared to and the loss function is sufficiently smooth in the data, i.e., , beats in federated oracle complexity. Finally, in the course of analyzing , we also establish a general result on low rank approximation of smooth latent variable models.
Index Terms:
Federated learning, empirical risk minimization, gradient descent, Hölder class, low rank approximation.I Introduction
Optimization for machine learning has become a prevalent subject of study due to the advent of large scale model learning tasks. Indeed, while optimization theory is a relatively mature field in applied mathematics (cf. [1, 2]), many of its recent developments have focused on its application to specialized machine learning settings. For instance, several common machine learning problems, such as classification and regression, can be posed as empirical risk minimization (ERM) problems. In the unconstrained case, ERM problems have the form [3], [4, Section 1]:
(1) |
where is the parameter over which we optimize, are some given training data, , denotes the loss function, and defines the empirical risk (or objective function). There is a gargantuan literature that develops and analyzes iterative gradient-based optimization methods that solve the specific optimization in (1), e.g., (exact and projected) GD [2, 4], inexact GD [5, 6, 7, 8], stochastic gradient descent (SGD) [9], GD with momentum [10], accelerated GD [11], mini-batch SGD [12], variance reduction methods [13, 14, 15, 16], and methods with adaptive learning rates [17, 18, 19] (also see the recent monographs [4, 20, 21, 22], and the references therein).
However, as observed in [23], the optimization for machine learning literature typically analyzes convergence and oracle complexity of iterative gradient-based algorithms for (1) by posing the problem as, cf. [4, Section 1], [3]:
(2) |
where each . Hence, canonical analyses of iterative methods for the representation (2) do not effectively utilize information about the smoothness of the loss function with respect to the data . As noted in [23], such smoothness properties are actually quite pervasive in ERM problems, e.g., (regularized) linear regression [3, Chapter 3], (regularized) logistic regression [3, Section 4.4], deep neural network training with smooth loss and activation functions [24, Chapter 5], etc. To address the paucity of theoretical analyses that exploit this additional smoothness information, [23] initiates a study of inexact gradient descent methods for ERM problems that learn the gradient oracle at every iteration by performing local polynomial regressions [25, 26, 27].
In this paper, we continue this important thread of exploiting the smoothness of loss functions in training data to improve optimization algorithms. But in contrast to [23], we consider the newly developed federated learning paradigm here, and utilize the approximate low rank structure induced by smoothness in data to theoretically ameliorate optimization performance in this paradigm. We briefly introduce the federated learning architecture and formally abstract it in Section I-A, and we subsequently explain our main contributions in Section I-B.
I-A Federated Learning
The recent escalation of mobile phone usage and emergence of the Internet of Things (IoT) has engendered unprecedented volumes of collected data on such devices. While it is clearly of utility to train machine models based on this data, various new challenges pertaining to privacy, scalability, distributed computation, etc. must be addressed before doing so. For instance, data collected on a user’s mobile phone must be kept private, and even if privacy is not a concern, it is infeasible to communicate large quantities of data from millions of users to a central server for computation. To deal with such issues, a new paradigm known as federated learning has been formulated in the literature [28]. There are several classes of federated learning architectures (see, e.g., [29]): those with a synchronous central server, fully decentralized networks with synchronized clocks, and asynchronous versions of these topologies. We consider synchronous architectures with a central server in this paper.
In this synchronous setting, the architecture consists of a single central server and a collection of several clients (that all have synchronized clocks), cf. [28, 30, 31], and the references therein. Typically, the server and clients hold their own private data. So, to train machine learning models, federated optimization is carried out by clients training possibly inaccurate models based on their private data, and the server periodically aggregating client models to produce a more accurate model at the server. Several key dimensions are considered when studying such federated learning and optimization systems:
-
1.
Privacy: The training data resides with the clients and is not shared with the server (and vice versa) [28]. However, the server must leverage the data stored at the clients via several communication rounds to improve the statistical accuracy of the learnt model.
-
2.
Large client population with possibly inactive clients: Federated learning systems usually have a very large number of clients so that the client population size often dominates the number of training samples per client [28]. On the other hand, a nontrivial fraction of clients could be inactive during the learning process, e.g., mobile phones only send updates when they are online and being charged [30]. So, federated optimization algorithms must be robust to participation from smaller subsets of clients.
-
3.
Communication cost: There is significant cost associated with clients communicating with the server. Typically, clients have limited upload bandwidth and servers have limited download bandwidth. In particular, if a large number of clients seek to simultaneously communicate with the server, the server will put the clients in a queue to receive their updates. So, federated optimization schemes often try to increase the amount of (cheap) computation at each client in order to reduce the frequency of communication with the server (e.g., by employing several local updates at the clients), and utilize smaller fractions of active clients in certain rounds [28, 30]. (Moreover, such schemes may also utilize quantization to compress updates from clients to servers [30]; we do not consider data compression in this work.)
-
4.
Latency: Beyond the communication costs delineated above, the duration of client-to-server updates is determined by the slowest client among the set of active clients. Hence, using smaller fractions of active clients in certain rounds is also advantageous to eliminate stragglers. In fact, recent algorithms and analysis suggest that stragglers should be used at later stages of the learning process in order to reduce latency [31].
-
5.
Data heterogeneity: Different clients may have data that is drawn from different population distributions [28, 29]. Although some works in the literature, e.g., [30, 31], assume for simplicity of analysis that clients are homogeneous, i.e., the training data at every client is drawn from a common (unknown) probability distribution, we do not need to impose such a stringent assumption in this work. In particular, our new algorithm in Algorithm 1 and its theoretical analysis in Theorem 2 and Section V will hold in the general heterogeneous setting. Usually, data homogeneity simplifies the analysis of stochastic federated optimization methods since clients’ gradients become unbiased estimators of true full gradients [30]. However, since we analyze the (sample-version of the) ERM problem given below in (3), which has the same form in both homogeneous and heterogeneous settings, and we will not use randomized estimators of true full gradients at the clients, our analysis is agnostic to whether the server and client data is homogeneous or heterogeneous.
Formally, we model synchronous federated learning with a single central server as follows. We assume that there are clients, indexed by . The server is endowed with training data samples , and each client is given training data samples , where so that .111It is assumed for simplicity that every client has the same number of training data samples. Under the assumptions outlined in Section II-B, the goal of the server is to utilize the clients to compute an -approximate solution (in the sense of (11) or (12)) to the ERM problem (1), which may be rewritten as
(3) |
To achieve this goal, we consider a specific family of federated optimization algorithms. In particular, any such algorithm runs over synchronized epochs (or rounds). During each epoch, the server may first communicate with clients, where denotes the proportion of active clients that participate in computation. Then, the server and the active clients perform gradient computations, or other less expensive computations, towards finding an -approximate solution to (3), where we typically assume that all active clients perform the same number of gradient computations. Finally, each active client may communicate only one vector in to the server (while taking care not to reveal their private data by transmitting vectors from which it could be easily extracted). This client-to-server communication has some nontrivial cost associated with it. On the other hand, it is usually assumed that server-to-client communication has negligible cost, because there are often no critical constraints on client download bandwidths and the server broadcasting bandwidth. So, the server broadcasts vectors in to active clients during epochs at essentially no cost. We also note that every epoch need not contain the three aforementioned phases, i.e., the server need not always communicate with active clients or perform gradient computations, and clients need not always perform gradient computations and communicate with the server.
Several federated optimization algorithms for the synchronous central server setting have been proposed in the literature. The benchmark method in the field is known as Federated Averaging () [28], where clients perform several iterations of (mini-batch) SGD using their private data, and the server aggregates client updates by periodically averaging them over several epochs. Needless to say, belongs to the aforementioned family of algorithms, and has been extended in several directions. For example, the authors of [32] introduce an accelerated version of the algorithm by introducing momentum, and the authors of [30] include a compression (or quantization) step before any client-to-server communication in the algorithm to reduce communication cost. In particular, the iteration complexity and convergence of are carefully analyzed in [30]. More generally, a unified analysis of the class of communication-efficient SGD algorithms is presented in [33]. Various other federated optimization methods have also been proposed that address different drawbacks of . For instance, federated optimization methods with better fixed point properties are developed in [34], latency of federated optimization algorithms is improved via adaptive node participation in [31], a federated optimization algorithm for training with only positive labels is presented in [35], and adaptive algorithms that incorporate more sophisticated gradient-based iterative methods, such as Adam [19], into the federated learning framework are proposed in [36]. In a different vein, one-shot federated learning is considered in [37], where only one round of client-to-server communication is used in the optimization process (also see follow-up work). We refer readers to [29] and the references therein for an in-depth further discussion of federated learning, distributed optimization, and related ideas.
Finally, we note that in the traditional federated learning literature, only the clients possess training data which the server cannot access. However, as mentioned above, our work assumes that the server also possesses some training data samples. As explained in [38], this is a reasonable assumption in many federated learning settings because the server could also function as a client and have access to its own private training data, it may have to collect some training data from clients (despite privacy concerns) for system monitoring purposes, it may collect data from test devices (or test clients) which do not participate in training, and it could even possess simulated training data from models. For these reasons, there is a growing recent literature on federated learning with server data, cf. [39, 40, 38] and the references therein. Furthermore, as remarked earlier, the server and client datasets can be heterogeneous in our work since the smoothness (or approximate low rankness) of loss functions will circumvent the need for data homogeneity in our analysis.
I-B Main Contributions
As mentioned earlier, the vast literature on federated optimization, including the aforementioned works, deals with the ERM problem representation in (2). Thus, these works do not exploit smoothness of loss functions in the training data for federated learning. In this paper, we fill this void by making the following key theoretical contributions:
-
1.
We define a notion of federated oracle complexity in Definition 1 to capture the running time of federated optimization algorithms, and argue its suitability using some simple probabilistic analysis. This formally generalizes classical oracle complexity of iterative optimization methods (see [2, Section 1.1]) to a federated learning setting. Our definition allows us to theoretically compare our proposed federated optimization algorithm (mentioned next) with the benchmark method from the literature. We believe that this definition will be of further utility in future research in the area.
-
2.
We propose a new Federated Low Rank Gradient Descent () method in Algorithm 1 to solve the ERM problem (1), or equivalently, (3), in the federated learning setting. Our method exploits the approximate low rankness of the gradient of the loss function, which is a consequence of its smoothness in data, by using a few rounds of communication to first learn weights that allow the server to approximate clients’ gradients by computing weighted sums of its own gradients. Then, the server can solve (1) by running an inexact GD method on its own, where the server estimates full gradients at each iteration from gradients corresponding to its private data. Moreover, like other federated optimization algorithms in the literature, our method preserves the privacy of the server and clients by never exchanging data samples, and it works in settings where server and client data is heterogeneous.
-
3.
Under the smoothness, strong convexity, and non-singularity assumptions outlined in Section II-B and Section III-C, we analyze the algorithm and show in Theorem 2 that its federated oracle complexity scales like (neglecting typically sub-dominant factors), where is the ratio of communication cost to computation cost (see Section II-D), is the desired approximation accuracy of the solution (see Section II-B), and denotes a Hölder class parameter which determines the level of smoothness of the loss function in the data (see Section II-B).
-
4.
We also calculate the federated oracle complexity of the algorithm in Theorem 3 by building on known results from the literature, cf. [30], and show that it scales like . Using this calculation, we demonstrate in Proposition 1 that when the data dimension is small, e.g., , and the gradient of the loss function is sufficiently smooth in the training data so that it is approximately low rank, i.e., , the federated oracle complexity of is much smaller than that of the benchmark algorithm.
-
5.
In the process of analyzing the algorithm, we also derive a new statistical result in Theorem 1 that utilizes the smoothness of loss functions in the training data to bound the Frobenius norm error in estimating a latent variable model (or graphon) corresponding to the loss function with its low rank approximation. This result generalizes the graphon approximation results in [41, 42, 43], and could be of independent utility for future research in high-dimensional statistics.
I-C Outline
We briefly delineate the remainder of our discussion here. In Section II, we precisely state and explain the notation and assumptions used in our subsequent analysis, and also introduce and expound upon the concept of federated oracle complexity. In Section III, we present and discuss our main technical results and the algorithm mentioned above. Then, we provide proofs of our results on latent variable model approximation and the federated oracle complexity of in Section IV and Section V, respectively. The proofs of all other results are deferred to the appendices. Although the focus of this work is to theoretically analyze and demonstrate the resulting gain in federated oracle complexity, we also present some pertinent simulation results in Appendix C. These simulations illustrate that gradients of loss functions corresponding to neural networks trained on the MNIST database for handwritten digit recognition (see [44]) and CIFAR-10 dataset of tiny images (see [45]) exhibit approximate low rankness. This suggests that algorithms which exploit low rank structure, such as , could have potential impact in applications. Finally, we conclude our discussion in Section VI.
II Formal Setup
Before deriving our main results, we need to present some essential formalism. To that end, Section II-A contains various notational conventions used in this work, Section II-B illustrates the precise conditions needed for our results to hold, Section II-C explains the reasoning behind these conditions and provides some related literature, and Section II-D develops the notion of federated oracle complexity which measures the running time of federated optimization algorithms.
II-A Notational Preliminaries
In this paper, we let , , and denote the sets of natural numbers, non-negative integers, and non-negative real numbers, respectively, and for , we let . We let and denote the probability and expectation operators, respectively, where the underlying probability laws will be clear from context. Furthermore, and denote the natural exponential and logarithm functions with base , respectively, denotes the indicator function that equals if its input proposition is true and equals otherwise, and denotes the ceiling function. Lastly, we will utilize Bachmann-Landau asymptotic notation throughout this paper, e.g., , , , etc., where the limiting variable will be clear from context.
In addition, for any matrix with , denotes the th entry of for and , denotes the transpose of , denotes the inverse of (when and is non-singular), denotes the rank of , denotes the Frobenius norm of , and denotes the spectral or (induced ) operator norm of . Similarly, for any (column) vector , denotes the th entry of for , and denotes the -norm of for . Moreover, for any continuously differentiable function , , we define to be its partial derivative with respect to for , and , to be its gradient. As in [23], we will also require some multi-index notation. So, for any -tuple , we let , , for , and be the th partial derivative of with respect to .
II-B Assumptions on Loss Function
As mentioned earlier, for any , assume that we are given samples of training data and a loss function , , where is the parameter vector. In our ensuing analysis, we will impose the following assumptions on the loss function (cf. [2, 4, 23]):
-
1.
Smoothness in parameter : There exists such that for all , exists and is -Lipschitz continuous:
(4) -
2.
Strong convexity: There exists such that for all , is -strongly convex:
(5) Furthermore, we let be the condition number of the loss function.
-
3.
Smoothness in data : There exist and such that for all fixed and for all , the th partial derivative of at , denoted , belongs to the -Hölder class (cf. [27, Definition 1.2]), i.e., is times differentiable, and for every such that , we have
(6) -
4.
Non-singularity: For any (large) size , any index , any sequence of distinct data samples , and any sequence of distinct parameter values ,222More formally, this assumption only holds for almost everywhere, but we neglect this detail here and in the sequel for simplicity. the matrix of partial derivatives given by:
(7) is invertible. Note that we suppress the dependence of on and the sequences of data and parameters for convenience.
-
5.
Conditioning: There exists an absolute constant such that for any (large) , any , any sequence of distinct data samples , and any sequence of distinct parameter values ,333More formally, one would need to impose additional distributional assumptions to prove that this assumption holds for random with high probability, but we also neglect this detail here and in the sequel for simplicity. the matrix given in (7) satisfies
(8) i.e., the minimum singular value of decreases to at most polynomially fast in the dimension .
-
6.
Boundedness: There exists an absolute constant such that for all , we have
(9)
Next, let be the desired approximation accuracy of the solution, be the unique global minimizer of the objective function in (1), and
(10) |
(Note that exists and is finite due to the strong convexity assumption.) In the sequel, we consider federated optimization algorithms that output an -approximate solution to (1) such that:
-
1.
If the algorithm generates a deterministic solution , we have
(11) -
2.
If the (stochastic) algorithm produces a random solution , we have
(12) where the expectation is computed with respect to the law of .
II-C Motivation and Background Literature for Assumptions
We briefly explain some of the important assumptions in Section II-B and contextualize them within the optimization and statistics literatures.
Firstly, smoothness in parameter is a standard assumption in the optimization for machine learning literature (see, e.g.,[2, 4], and the references therein). This assumption implies that the empirical risk has -Lipschitz continuous gradient (which is used in the proof of Theorem 2 in Section V, cf. [23, Section 4.1]). In particular, when the gradient of the objective function is -Lipschitz continuous, it can be shown that each iteration of GD reduces the objective value:
(13) |
for all [4, Equation (3.5)]. The inequality (13) illustrates the greedy nature in which GD decreases its objective value and is fundamental in the analysis of iterative descent methods. For this reason, this assumption has been used extensively in the optimization literature to prove convergence guarantees in a variety of settings, e.g., [9, 8, 5, 12, 13, 16, 6, 14, 21], etc. We also impose this assumption since the analysis to obtain Lemma 2 in Section V, which is a restatement of [5, Lemma 2.1], requires it.
Similarly, strong convexity (in parameter) is another standard assumption in the optimization for machine learning literature (see, e.g.,[2, 4], and the references therein). As before, this assumption implies that the empirical risk is strongly convex (which is used in the proof of Theorem 2 in Section V, cf. [2, Lemma 2.1.4], [23, Section 4.1]). As explained in [4], the main utility of imposing strong convexity of the objective function is a notable and provable speed-up in the convergence rate of gradient-based iterative methods. Specifically, GD can achieve linear convergence rate (i.e., exponentially fast convergence) when the objective function is strongly convex [2, 4]. In addition, the strong convexity assumption often leads to far simpler proofs of convergence and oracle complexity, and easier illustration of the benefits of improvements like momentum (or heavy-ball method) [10] and acceleration [11] on oracle complexity (cf. [2, 21]). For these reasons, this assumption has also been used extensively in the optimization literature to prove convergence guarantees in a variety of settings, e.g., [9, 8, 5, 13, 16, 6, 14, 21], etc. We also assume strong convexity for these reasons; indeed, strong convexity is needed for Lemma 2 in Section V, which is a restatement of [5, Lemma 2.1], to hold and the first term in the bound in Lemma 2 illustrates the aforementioned exponential convergence.
Smoothness in data is a standard assumption in the non-parametric estimation literature (see, e.g.,[27, 46, 47], and the references therein). In simple terms, we can think of a function in a Hölder class with positive integer parameter as an -times differentiable function with Lipschitz continuous th partial derivatives (or even an -times differentiable function with bounded th partial derivatives). Hence, the size of the parameter controls the degree of smoothness of functions in a Hölder class. Such Hölder class assumptions are used in various non-parametric regression techniques, e.g., kernel regression such as Nadaraya-Watson estimators [48, 49, 27], local polynomial regression [25, 26, 27], and nearest neighbor methods [50, 51, 46]. At a high-level, the basic idea of imposing smoothness assumptions for regression is the following: If an unknown function living in a “large” (or non-parametric) class is known to be smoother, then it can be better estimated from samples using Taylor polynomials. When defining the Hölder condition in (6), one can in principle use any -norm (see, e.g., [42, 47, 43, 23]). We use the -norm to define (6) because this yields the largest class of functions for fixed parameters ; indeed, if (6) is satisfied for any -norm, then it is also satisfied for the -norm using the monotonicity of -norms. In this sense, our definition includes the largest class of functions. In Appendix A, we provide an example of a commonly used loss function in machine learning that satisfies the aforementioned smoothness and strong convexity assumptions.
In a recent sequence of works in high-dimensional statistics, it has been shown that matrices defined by bivariate graphon functions or latent variable models are approximately low rank when the function satisfies Hölder class assumptions on one of the variables [41, 42, 43]. (We generalize these results in Theorem 1.) Partial derivatives of loss functions can be construed as bivariate functions of the data and parameter , and we impose Hölder class smoothness in data on them in order to exploit the resulting approximate low rank structure for federated optimization (as we will elaborate in Section III). As noted earlier and in [23], smoothness in data exists in a variety of ERM problems, e.g., (regularized) linear regression [3, Chapter 3], (regularized) logistic regression [3, Section 4.4], deep neural network training with smooth loss and activation functions [24, Chapter 5], etc. The close connection between smoothness of bivariate functions and approximate low rank structure may seem peculiar at first glance. Perceiving a bivariate function as a kernel of an integral operator, the results in [41, 42, 43] suggest that smoothness of the kernel is closely related to approximate low rankness of the associated integral operator. This relation has its roots in more familiar ideas from harmonic analysis. Indeed, in the setting of difference (or convolution) kernels that are defined by a univariate function, increasing the smoothness of the function is essentially equivalent to faster decay of its Fourier transform [52, Theorem 60.2] (also see [53, 54]). This, in turn, can be interpreted as approximate low rankness of the associated convolution operator, because the Fourier transform is precisely the spectrum of the convolution operator.
Approximate low rank structure, broadly construed, has been utilized in a variety of areas, e.g., matrix estimation [55], reinforcement learning [56], function approximation via ridge functions [57, 58], semidefinite programming [59], uncertainty quantification [60], etc. In this work, we seek to exploit approximate low rank structure in the context of federated optimization. Just as smoothness is prevalent in many ERM problems, a litany of empirical investigations in the literature have directly demonstrated various forms of approximate low rank structure in large-scale machine learning tasks. For instance, the empirical risk and loss functions corresponding to deep neural network training problems display approximate low rank structure in their gradients (which live in low-dimensional subspaces) or Hessians (which have decaying eigenvalues, thereby inducing low rank gradients via Taylor approximation arguments) [61, 62, 63, 64, 65, 66]. In particular, [61] shows that when training deep neural networks for -ary classification, the computed gradients belong to -dimensional eigenspaces of the Hessian corresponding to its dominant eigenvalues. On a related front, low rankness has also been observed in weight matrices of deep neural networks [67, 68], neural collapse has been demonstrated when training deep neural networks [69, 70, 71], and training memory for neural networks has been saved using low rank parameterizations of gradient weight matrices [72]. Our work on exploiting approximate low rank structure for federated optimization, albeit different, is inspired by many of these observations.
The non-singularity, conditioning, and boundedness assumptions in Section II-B and Section III-C have been imposed for technical purposes for the proofs. Variants of such assumptions pertaining to matrix invertibility and spectra are often made for analytical tractability in theoretical studies in high-dimensional statistics (see, e.g., the invertibility and spectral boundedness assumptions made in [73, Section 2] to analyze linear regression). While it is difficult to provide simple and easily interpretable sufficient conditions that ensure the non-singularity assumptions in Section II-B and Section III-C, we provide some intuition for when we expect such assumptions to hold. To this end, consider the non-singularity assumption in Section II-B. As before, let us construe as the kernel of an integral operator. Then, the spectral theory (specifically, singular value decomposition) of compact operators states that the kernel can be written as [56, Theorem 9] (also see [74]):
(14) |
where are singular values that converge to as , and are orthonormal bases of singular vector functions, and although the equality above usually holds in an -sense, it can also hold in a pointwise sense under some assumptions. Hence, for data samples and parameter values , we may write the matrix as
(15) |
using (7) and (14), where we let and . Since the integral operator corresponding to is a map between infinite-dimensional spaces, we typically have infinitely many non-zero ’s. Moreover, since and are linearly independent sets of functions, we typically have non-zero unit-rank terms in (15), with linearly independent ’s and ’s, whose linear independence is not cancelled out by the other terms. Thus, has rank and is invertible. This provides some intuition behind the non-singularity assumption in Section II-B. The non-singularity of monomials and approximation assumptions in Section III-C have similar underlying intuition based on sampling of linearly independent functions (cf. (45) and (44), (46), respectively). In closing this subsection, we remark that one well-known setting where all sampled matrices of a symmetric bivariate function are invertible, and in fact, positive definite, is when the bivariate function itself defines a valid positive definite kernel (in the reproducing kernel Hilbert space sense), cf. [75, Section 4-2-4.6]. In the context of Section II-B, we cannot directly assume that every is a positive definite kernel since is asymmetric in general and we only require invertibility of sampled matrices, not positive definiteness, which is a much stronger condition. However, alternative versions of such kernel conditions could be developed in future work to provide cleaner assumptions that imply various non-singularity conditions.
Finally, we note that, strictly speaking, the boundedness and strong convexity assumptions make sense simultaneously when the parameter space is compact. Although our analysis is carried out with the parameter space , since iterates of inexact GD in Algorithm 1 only live in some compact space, our analysis carries over to the compact setting. For example, one way to rigorously carry over our analysis when it is assumed that the parameter space is compact is to run projected inexact GD instead of inexact GD in Algorithm 1. Then, the parameter iterates of Algorithm 1 remain in the compact space, and we can execute the analysis of our algorithm using the convergence analysis in, e.g., [8, Proposition 3]. For simplicity of exposition, however, we do not use projected inexact GD and simply take the parameter space as in this paper.
II-D Federated Oracle Complexity
In the traditional (non-distributed) optimization literature, the running time of first-order iterative methods, such as GD and SGD, is abstractly measured by the notion of (first-order) oracle complexity, i.e., the total number of queries to an oracle that returns the gradient of the loss function with respect to the parameter [76] (also see [2, Section 1.1]). This is because gradient computations are usually the most expensive operations in such algorithms. However, the problem of measuring running time is quite different for federated optimization algorithms, because the cost of clients communicating with servers is typically much higher than the running time of gradient computations. Currently, simulations based on several informal principles (such as the aforementioned one) are often used to empirically compare the efficiency of different federated optimization algorithms, rather than theoretically analyzing any formal notion of complexity; see, e.g., [30] and the references therein. To partially remedy this, inspired by [30], we propose a new definition of federated oracle complexity by formalizing some of the informal rules delineated in Section I-A using simple probabilistic approximations. We note that other notions of oracle complexity for distributed optimization have either implicitly or explicitly been posed in the literature, cf. graph oracle models [77] or [78] and the references therein. In fact, our definition of federated oracle complexity is reminiscent of the complexity measure introduced in [79], but our definition is adapted to the centralized, synchronous federated learning setting rather than distributed optimization frameworks.
Consider any algorithm belonging to the family outlined in Section I-A. In the sequel, we will approximate the running times of the two time intensive phases of each epoch: the gradient computation phase and the client-to-server communication phase.
II-D1 Gradient Computation Phase
In the algorithms we consider in this work, there will be two kinds of gradient computation phases; either only active clients will simultaneously perform gradient computations each, or only the server will perform gradient computations. In the former case, each active client computes gradients of the loss function with respect to the parameter in an epoch. As in the traditional setting, these gradient computations dominate the running time of the client. For each client, its successive gradient computations are modeled as successive delayed Poisson arrivals, cf. [80]. Equivalently, each gradient computation is assigned a statistically independent (abstract) running time of units, where is the delay and is an exponential random variable with rate [80].444For simplicity, we assume that the shift and scale parameters of the shifted exponential distribution are equal. Thus, the running time of each client in an epoch is units, where the random variable has an Erlang (or gamma) distribution with shape parameter and rate [81, Section 2.2.2]. Since only clients are active in any epoch, the total computation time of an epoch is determined by the slowest client, and is equal to units, where are independent and identically distributed (i.i.d.) Erlang random variables with shape parameter and rate .
In most federated learning contexts, is very large and we can employ extreme value theory to approximate the probability distribution of . To this end, observe that the cumulative distribution function (CDF) of is a von Mises function with auxiliary function, cf. [82, Section 3.3.3, Example 3.3.21]:
(16) |
where the second equality holds when (as will be the case for us). Hence, employing the Fisher-Tippett-Gnedenko theorem [82, Theorem 3.2.3], we have convergence in distribution to the Gumbel distribution (or the generalized extreme value distribution of Type I):
(17) |
where is the inverse of , i.e., the quantile function, and we also utilize [82, Proposition 3.3.25]. According to Proposition 2 in Appendix B, and we estimate as follows for simplicity:
(18) |
Thus, using (17), the expected total computation time of an epoch can be approximated as
(19) |
where the Euler-Mascheroni constant (with denoting the digamma function) is the expected value of the Gumbel distribution with CDF , the second line follows from (16) (neglecting lower order terms), the third line follows from (18), and we neglect the constant in the final line. Since the server does not compute any gradients in this case, any other computation performed by the server or clients is assumed to have negligible cost compared to that in (19).
On the other hand, in the case where only the server performs gradients of the loss function with respect to the parameter in an epoch, we may construe the server as a single active client. So, the running time of the server in an epoch is units, where has an Erlang distribution with shape parameter and rate . Hence, expected total computation time of an epoch is
(20) |
Combining (19) and (20), the expected total computation time per epoch is given by
(21) | ||||
II-D2 Client-to-Server Communication Phase
The total communication cost in any epoch is dominated by the client-to-server communication. Since the server has limited bandwidth and often uses a queue to receive updates from clients, this cost is proportional to the number of active clients . To see this, let us model the client-to-server communication as a first in, first out (FIFO) M/M/1 queue [81].555We utilize Kendall’s notation, where the first “M” implies that arrivals obey a Poisson process, the second “M” implies that the service time is memoryless, and the “1” denotes that we only have one server. In such a queue, the updates from active clients arrive at the server according to a Poisson process with rate (for simplicity) [81, Chapter 2]. Moreover, the server only accepts the transmission of any one active client at any instant in time, and from the instant an update arrives at the server, it takes a total transmission time given by an independent exponential distribution with rate .666The assumption that ensures that the M/M/1 queue is “stable,” and the number of active clients in the queue does not diverge over time. Any other active client updates that arrive during this transmission time are put in the FIFO queue for eventual transmission to the server. Finally, let us assume for simplicity that the M/M/1 queue is in steady state.
We seek to calculate the expected time it takes for all active clients to complete transmitting their updates to the server. So, let time denote the beginning of the (synchronized) client-to-server communication phase, and define the departure process , where is the number of client updates that have arrived at the server and been entirely transmitted to the server up to time (this excludes any client updates in the queue at time ). By Burke’s theorem [81, Theorem 7.6.5], is a Poisson process with rate , i.e., the departure process has the same rate as the arrival process. Hence, the total time it takes for all active clients to complete transmitting their updates to the server,
(22) |
has an Erlang distribution with shape parameter and rate . Therefore, we have
(23) |
Next, let be a possibly -dependent hyper-parameter that represents the communication-to-computation ratio (cf. [30]); is the ratio of the average time it takes for a client to communicate a vector in with the server and the average time it takes a client to compute a single gradient vector in . In most federated learning contexts, because the average communication cost of sending a vector to the server far outweighs the average complexity of a gradient computation. Then, using (23), the expected total communication time per epoch is given by
(24) |
which is in the “same scale” as the expected total computation time in (21) because we multiply by .
II-D3 Overall Running Time
In the algorithms we consider in this work, there will be three kinds of epochs with non-negligible running time:
-
1.
Type A: Active clients simultaneously compute gradients and then communicate with the server. The server does not compute gradients.
-
2.
Type B: The server computes gradients. Clients do not compute gradients and do not communicate with the server.
-
3.
Type C: The server and clients do not compute gradients, but active clients communicate with the server.
Combining (21) and (24), the expected total running time per epoch is
(25) | |||
(26) |
The ensuing definition translates (26) into a formal notion of federated oracle complexity by summing the expected running times over all epochs. (Note that we neglect the constants and the sub-dominant logarithmic term in (25), because we are only concerned with how the complexity asymptotically scales with important problem parameters.)
Definition 1 (Federated Oracle Complexity).
For any federated optimization algorithm belonging to the family of algorithms outlined in Section I-A, we define its (first-order) federated oracle complexity as
where is the number of epochs in , (which is possibly ) is the number of gradients computed in epoch by each active client or the server, is the total number of clients, is the proportion of active clients that participate in each epoch, and denotes the communication-to-computation ratio.
We next make some pertinent remarks about Definition 1. Firstly, our model of federated oracle complexity neglects the effect of straggling clients on the latency of federated optimization algorithms. This is because the effect of stragglers is captured by a lower order logarithmic term in (21) (and (25)). So, we do not study latency issues in this work. Secondly, Definition 1 assumes that in any epoch, all clients compute the same number of gradients, because both the theoretical algorithms we analyze in this work, and [30], satisfy this assumption. In general settings, however, different clients may have to compute different numbers of gradients in an epoch. In such cases, one would need to carry out the earlier extreme value theoretic analysis of the gradient computation phase again to determine the expected total computation time of this phase and alter Definition 1 appropriately (e.g., could represent the maximum number of gradients computed by a client across all clients in epoch ). Thirdly, note that Definition 1 reduces to the usual notion of oracle complexity in non-federated settings [76, 2]. Indeed, since non-federated settings have no communication cost, we can set ; in this case, our definition simply counts the total number of gradient oracle calls. Finally, for simplicity in this work, we will assume in all cases that the proportion of active clients is constant (with respect to any asymptotically growing problem parameters). In the sequel, we evaluate the efficiency of federated optimization algorithms using Definition 1.
III Main Results and Discussion
We discuss our main results in this section. Specifically, we outline some key auxiliary results on low rank latent variable model approximation in Section III-A, describe our new Federated Low Rank Gradient Descent algorithm in Section III-B, present the federated oracle complexity (as defined in Definition 1) of this algorithm in Section III-C, and finally, demonstrate that our algorithm has better federated oracle complexity than the vanilla algorithm in Section III-D.
III-A Latent Variable Model Approximation
To establish the federated oracle complexity of our algorithm, we will require a basic result about uniformly approximating a smooth function using piecewise polynomials. To illustrate this result, consider any function , such that for all , the function belongs to the -Hölder class as defined in (6) in Section II-B. Moreover, for any , consider a minimal -covering of the hypercube using closed -balls (or cross-polytopes) with radius and centers , which forms the associated -net, such that:
(27) |
In particular, (27) shows that the union of the closed -balls with radius and centers contains the set , and the -net is minimal in the sense that its cardinality is as small as possible while ensuring (27), i.e., is the -covering number of (cf. [83, Definition 4.2.2]). Let be an enumeration of the closed -balls with radius and centers , respectively, and construct the sets such that and for . Then,
(28) |
forms a partition of the hypercube such that every point in is at an -distance of at most from . With this -ball based partition in place, the next lemma conveys that can be uniformly approximated using a piecewise polynomial function akin to the proofs of [84, Proposition 1] and [43, Proposition 3.3].
Lemma 1 (Uniform Piecewise Polynomial Approximation).
For any and any , there exists a piecewise polynomial function such that
where , the piecewise polynomial function is given by the following summation:
and are Taylor polynomials with (total) degree at most as shown in (49).
Lemma 1 is proved in Section IV-A. While this lemma will suffice to establish federated oracle complexity in our problem, we also present a corollary of it in the context of latent variable models for completeness.
For any , any , and any , define the matrix such that:
(29) |
We refer to as a latent variable model with latent parameters and , and latent function (which satisfies the Hölder smoothness assumptions outlined earlier); see, e.g., [43] and the references therein. Furthermore, in the special case where is symmetric in its two inputs and is a symmetric matrix, is known as a graphon function (which naturally appears in Aldous-Hoover representations of exchangeable random graphs [85]), cf. [41, 86, 42]. Next, let
(30) |
be the singular value decomposition of with singular values and orthonormal bases of singular vectors and . Then, by the Eckart-Young-Mirsky theorem, for any , the best rank approximation of in Frobenius norm is given by
(31) |
i.e., , cf. [87, Section 7.4.2]. Akin to [42, Proposition 1] and [43, Proposition 3.3], the ensuing theorem demonstrates that the latent variable model can be well-estimated by its low rank approximation due to the smoothness of . We believe that this result may be of broader interest to the matrix and tensor estimation community.
Theorem 1 (Low Rank Approximation).
Consider the latent variable model in (29) defined by the latent function such that belongs to the -Hölder class for all . Then, for any , we have
where .
Theorem 1 is proved in Section IV-B; we will use ideas from this proof in our analysis of federated oracle complexity. It is worth comparing this result with the related results in [41], [84, Proposition 1], [42, Proposition 1], and [43, Proposition 3.3]. The author of [41] proves an analog of Theorem 1 in the special case where is a graphon function and is Lipschitz continuous with respect to the -norm for all . In this special case, a piecewise constant function approximation of suffices for the proof. The results in [84, Proposition 1] and [42, Proposition 1] extend the result of [41] to the setting where is still a graphon function, but belongs to an -Hölder class with respect to the -norm. The key insight in [84, 42] is to utilize piecewise polynomial function approximations instead of piecewise constant functions; we also use this key idea in Lemma 1. Finally, the authors of [43, Proposition 3.3] apply the argument of [84, 42] to extend their result further to the setting where is a general latent variable model, similar to our result, but only belongs to an -Hölder class with respect to the -norm. In contrast, our result in Theorem 1 applies to general latent variable models such that belongs to an -Hölder class with respect to the -norm, which subsumes the settings of [41, 84, 42, 43]. To prove this more general result, we resort to using a modified minimal -covering to construct our piecewise polynomial function approximation in the intermediate result in Lemma 1; this differs from the -coverings used in [41], [84, Lemma 3], [42], and [43], and it can be verified that -coverings produce a worse bound in Lemma 1.777We note that our modified -covering of is more sophisticated than the -coverings used in [41, 84, 42, 43], because -balls can be used to form a (geometric) honeycomb in the hypercube , but -balls do not form such a honeycomb in for general . Therefore, in contrast to [41, 84, 42, 43], our proofs of Lemma 1 and Theorem 1 are adapted to deal with the -norm Hölder class assumption on and the modified minimal -covering of the hypercube .
III-B Federated Low Rank Gradient Descent
We now describe the new Federated Low Rank Gradient Descent () algorithm for federated optimization under the setup of Section I-A and Section II-B. Recall that we are given a central server with private training data samples , and clients with private training data samples each; client has samples . The number of samples at the server, , depends on the approximate rank of the loss function, as we will see in the next subsection. The server’s objective is to compute an -approximate solution in the sense of (11) to the ERM problem (3).
In the traditional non-federated setting, this objective could be achieved by running GD (or any of its many variants). Each iteration of GD would involve calculating the gradient:
(32) | ||||
at different parameter values. The main idea of this work is to utilize the smoothness of in the data to approximate the gradient in (32) by exploiting the approximate low rank structure induced by the smoothness (cf. Section III-A). Specifically, for every index , we determine a set of weights such that:
(33) |
where for each client and each private training sample , we re-weigh the partial derivative at the server’s training sample by an additional amount . In our algorithm, the clients compute the aforementioned weights after receiving some initial information from the server, and the server then uses these weights to run inexact GD. Intuitively, when is very smooth with respect to the data, the number of private samples required at the server to ensure that (33) is a good approximation is reasonably small. Hence, the oracle complexity of GD at the server also remains reasonably small since it only computes gradients at data samples in every iteration (instead of ).
Our algorithm runs over synchronized epochs and computes an -approximate solution to (3) at the server. In the first epoch, all clients are idle and there is no communication between the server and clients. The server arbitrarily chooses parameter values , and performs gradient computations corresponding to these parameter values and its private training samples . Since each gradient computation produces a -dimensional vector, for every index , the server obtains a matrix of partial derivatives as in (7):
(34) |
Moreover, due to the non-singularity assumption in Section II-B, each is an invertible matrix. So, the server computes the inverses of these matrices to get .
The second epoch proceeds in three phases. First, the server broadcasts the parameter values and the matrices to all clients. We assume for simplicity that , i.e., all clients are active, although our analysis would hold in the case . In the second phase, every client performs gradient computations corresponding to its private training samples and the common parameter values. In particular, each client computes the following set of partial derivatives . Then, each client calculates the weight vectors :
(35) | ||||
for every index . Note that the th weight for the th partial derivative at client represents the accumulation of the aforementioned weights (as explained in (39)).
To intuitively understand (35), for each index , note that the induced approximate low rankness of implies that for every client and data sample :
(36) |
for some choices of weights (which we formally define soon). Indeed, if we construe each as an infinite “matrix” with indexing its “rows” and indexing its “columns,” then if the approximate rank of is , the “rows” of corresponding to data samples at the clients can we approximately represented as linear combinations of the “rows” of corresponding to the data samples at the server . Furthermore, this property in (36) implies the approximation in (33). Formally, we capture the approximate low rankness in (36) by sampling and each at the parameter values :
(37) | ||||
which, after matrix inversion, equivalently defines the weights via the relation
(38) | ||||
Additionally, (33) conveys that we only require the sums of weights, , for all and from each client to estimate the gradient of the empirical risk. So, it suffices for client to directly compute the accumulated weights:
(39) |
Combining (39) with (38) produces (35). This explains why each client directly calculates (35). In fact, the accumulated weights in (35) simplify (33) and yield the approximation:
(40) |
for all .
After calculating (35) for all , each client possesses a set of weights . Hence, in the third phase of the second epoch, every client communicates the first weights from its set of weights to the server. In the next epochs, every client communicates weights per epoch from its set of weights to the server. Specifically, in epoch , client communicates the weights to the server. There are no gradient computations performed in these epochs.
Finally, in the last epoch, the server runs inexact gradient descent using the weights it has received from the clients. There is no communication between the server and clients, and all clients remain idle at this stage. To precisely describe the inexact GD iterations, for any , define the approximation of such that:
(41) |
as indicated in (40). Moreover, let be the number of iterations for which the server runs inexact GD; depends on the desired approximation accuracy in Section II-B and other problem parameters, as we will see in the next subsection. Then, starting with an arbitrary parameter vector , the server performs the inexact GD updates
(42) |
for , where is the common Lipschitz constant of the gradient for all (see Section II-B). Note that at the th update, the server performs gradient computations corresponding to its private data at the parameter value . Lastly, the output of the algorithm at the server is .
In the next subsection, we analyze the federated oracle complexity of this algorithm and specify what values and should take to ensure that is an -approximate solution in the sense of (11) to the ERM problem (3). We also summarize the algorithm in Algorithm 1 for the readers’ convenience.
III-C Federated Oracle Complexity of
In addition to the assumptions on the loss function in Section II-B, we will require several assumptions on the piecewise polynomial approximations introduced in Section III-A. Recall from Section II-B that the th partial derivative satisfies the -Hölder class condition. So, for any and any fixed index , let be the piecewise polynomial approximation of satisfying Lemma 1:
(43) |
where is the partition of the hypercube given in (28), are Taylor polynomials with degree at most as defined in (49), and we suppress the dependence of on for simplicity. Then, for any sufficiently large , fix such that . As shown in the proofs of Theorem 1 and Theorem 2 in Section IV-B and Section V, respectively, we can write in the form (cf. (55)):
(44) |
where is a family of linearly independent monomial functions on (see (56)), and is another family of functions on . Note that the form (44) conveys that , when perceived as the bivariate kernel of an integral operator, has rank at most . For every large enough and any fixed index , we make the following assumptions about the approximation :
-
1.
Non-singularity of monomials: The matrix with entries:
(45) is invertible. Note that we suppress the dependence of on , , and the server’s training data for simplicity.
-
2.
Non-singularity of approximation: For any sequence of distinct parameter values , the matrix given by:
(46) is invertible. Here, we suppress the dependence of on , the server’s training data, and the parameter sequence for simplicity.
With all our assumptions in place, the ensuing theorem presents the federated oracle complexity of the algorithm to generate an -approximate solution. This is the key technical result of this work.
Theorem 2 (Federated Oracle Complexity of ).
Suppose that the two assumptions above and the assumptions in Section II-B hold. Assume further that the approximation accuracy satisfies , the Hölder smoothness parameter satisfies , and the proportion of active clients is . Then, the algorithm described in Algorithm 1 with approximate rank given by
(47) | ||||
and number of inexact GD iterations given by
(48) |
produces an -approximate solution in the sense of (11) with a federated oracle complexity of
where is the number of training data samples per client, is the number of clients, is the communication-to-computation ratio, and the various other constants and problem variables are defined in Section I and Section II.
Theorem 2 is established in Section V. It roughly states that scales like (neglecting typically sub-dominant factors; see, e.g., the assumptions and proof of Proposition 1) as mentioned in Section I-B. At a high level, the proof of Theorem 2 has three parts. First, we use the uniform piecewise polynomial approximations from Lemma 1 to show that partial derivatives of the loss function are approximately low rank. We emphasize that this approximate low rankness is deduced from the smoothness of loss functions in the data—a property we had set out to exploit since Section I. This approximate low rankness implies that and are close to each other. In particular, we show that under the assumptions of the theorem statement, the error in approximating the true gradient of the empirical risk at the server satisfies (see (62) and (65) for details). Second, we control the error using the aforementioned bound on from the first part and standard tools from the inexact GD literature, cf. [5]. Finally, we impose the condition in (11), which naturally engenders certain choices of and (specifically, (47) and (48)), and from there, a direct application of Definition 1 yields the aforementioned federated oracle complexity.
We remark that it is the process of rigorously translating the smoothness of partial derivatives of loss functions in the data to an approximate low rankness property that requires us to make many of the assumptions at the outset of this subsection and in Section II-B. Once we have approximate low rankness of partial derivatives of loss functions, however, it is reasonably straightforward to intuitively understand why (33) holds, and in turn, why Algorithm 1 works and a result like Theorem 2 holds. Furthermore, canonical machine learning models can exhibit the desired approximate low rankness property that lies at the heart of our method. For example, in Appendix C, we illustrate that gradients of loss functions associated with neural networks used to train classifiers for the MNIST and CIFAR-10 datasets (see [44, 45]) are approximately low rank. This gives further credence to the potential utility of .
III-D Comparison to Federated Averaging
We finally compare the federated oracle complexity of in Theorem 2 with the widely used benchmark algorithm from the literature [28]. Specifically, we consider the precise formulation of in [30, Algorithm 1, Theorem 1]. Recall that in the algorithm, active clients perform several iterations of SGD using their private data in each epoch, and the server aggregates individual client updates at the end of each epoch and broadcasts the aggregated update back to clients at the beginning of the next epoch. The authors of [30] provide a sharp convergence analysis of .888In fact, a more sophisticated version of with quantization is analyzed in [30]. Building upon their results, we derive the federated oracle complexity of in the ensuing theorem. For simplicity, we assume that the number of clients grows (i.e., ), and other non-constant problem parameters vary at different rates with respect to .
Theorem 3 (Federated Oracle Complexity of ).
Assume that:
-
1.
The smoothness in and strong convexity assumptions of Section II-B hold.
-
2.
The parameters and are constants, i.e., , while other problem parameters may vary with .
-
3.
Any stochastic gradient computed by a client in the algorithm is unbiased and has variance bounded by (see [30, Assumption 3]):999The scaling is natural since the squared -norm in the expectation has terms.
-
4.
The number of gradient computations per client in an epoch of satisfies .
-
5.
, where denotes the parameter vector at the server after epochs of .101010The scaling here is also natural because the parameter usually resides in an -norm ball, e.g., , rather than all of in applications. Since the diameter of such an -norm ball scales as , is upper bounded by .
Then, for all , all larger than some constant, and all sufficiently large , we have
where we compute the federated oracle complexity above by minimizing over all choices of and all numbers of epochs in such that it produces an -approximate solution in the sense of (12).
Theorem 3 is established in Appendix D. To easily compare the federated oracle complexities of the and algorithms, we assume as before that the number of clients , all the problem variables are non-decreasing functions of , and the other problem parameters are constants with respect to . (We refer readers back to Section I and Section II for the definitions of these parameters.) In particular, this means that the condition number is also constant. Although improving the dependence of oracle complexity on condition number is a vast area of research in optimization theory (see, e.g., accelerated methods [11], variance reduction [16], etc.), we do not consider this aspect of the problem in this work. The proposition below elucidates a broad and important regime where has better federated oracle complexity than .
Proposition 1 (Comparison between and ).
Suppose that the assumptions of Theorem 2 and Theorem 3 hold. Fix any absolute constants , , and , and then consider any sufficiently large constants such that . Assume further that the number of samples per client , the data dimension , the approximation accuracy , the Hölder smoothness parameter such that , and , where is the total number of training data samples, and is the parameter dimension.111111Note that , because where is the optimal parameter argument defined in Section II-B, the second equality follows from the mean value theorem for some (see [88, Theorem 12.14]), the third equality follows from the fact that at the minimum (which holds because is strongly convex due to the strong convexity assumption in Section II-B, cf. [23, Section 4.1]), the next inequality follows from the Cauchy-Schwarz inequality, the last inequality holds because is -Lipschitz continuous (which is inherited from the smoothness in parameter assumption in Section II-B, cf. [23, Section 4.1]), and the scaling with in the last line follows from the reasoning in Footnote 10. Then, the federated oracle complexities of and scale as
and satisfy
This result is proved in Appendix E. We briefly explain the relevance of the regime presented in Proposition 1 where beats . Firstly, we state conditions on the scaling of and in terms of (rather than ), because dimension and statistical error are typically compared to the number of available training samples in high-dimensional statistics and machine learning problems. Secondly, as noted in [23], we assume that for , because the empirical risk in (1) is an approximation of some true risk, and the statistical error between these quantities is known to be (with high probability). As we only minimize empirical risk in machine learning problems as a proxy for true risk minimization, there is no gain in reducing the approximation accuracy of federated optimization algorithms below this threshold. Thirdly, our result subsumes one of the most widely studied regimes for training neural networks in recent times—the overparametrized regime where (see [89] and the references therein). Fourthly, as discussed in [23], when is close to , e.g., , the assumption is not too restrictive, because several machine learning problems have meaningful feature vectors with dimension that is logarithmic in the number of training samples. Indeed, results like the Johnson-Lindenstrauss lemma, cf. [90], illustrate how high-dimensional feature vectors can be mapped to -dimensional feature vectors while approximately preserving the distances between pairs of vectors, and hence, conserving the relative geometry of the vectors. Fifthly, we informally summarize the main theoretical message of Proposition 1 for clarity: If the data dimension is small and the gradient of the loss function possesses sufficient smoothness in the data , then our algorithm provably beats the benchmark algorithm in federated oracle complexity. (In other regimes, the algorithm could perform better than our method.) This theoretically portrays the utility of exploiting smoothness of loss functions in the data for federated optimization. Finally, we note that although we show the theoretical advantage of our algorithm under several assumptions, running Algorithm 1 itself does not require most of these assumptions, e.g., strong convexity, small data dimension, etc., to hold. Therefore, our main conceptual contribution of exploiting smoothness in data, or approximate low rank structure, of loss functions for federated optimization could be of utility in settings beyond the assumptions considered in this work.
IV Analysis of Latent Variable Model Approximation
IV-A Proof of Lemma 1
Proof.
We will follow the proof idea of [84, Lemma 3], but the details of our proof are different since we use a more general Hölder class definition than [84]. Fix any vector . For any set , let denote the closed -ball corresponding to in our minimal -covering of , and denote the center of , i.e., if for according to the enumeration defined in Section III-A, then and . Moreover, for any , construct the Taylor polynomial with degree of around :
(49) |
Then, we have
where the piecewise polynomial function is defined using the Taylor polynomials , the second inequality follows from the fact that by construction, the fourth equality follows from the multinomial theorem, the last inequality holds because each is an -ball with center and radius , and the third inequality holds because
where we use Taylor’s theorem with Lagrange remainder term which holds for some [88, Theorem 12.14], the triangle inequality, and the -Hölder class assumption. This completes the proof. ∎
IV-B Proof of Theorem 1
Proof.
As before, we follow the proof idea of [84, Proposition 1], but the details of our proof are again different since we use a more general Hölder class definition than [84]. For any (to be selected later), construct the family of piecewise polynomial functions as in Lemma 1 using the families of Taylor polynomials defined in (49) for . Moreover, define the latent variable model such that:
where and are the latent parameters that define in (29). It is straightforward to verify that is “good” induced -norm approximation of . Indeed, applying Lemma 1, we get
(50) |
We next bound the rank of . Let be the vector of monomials with degree at most for , where is the number of such monomials. For each , suppose for all and , where is the vector of coefficients of the Taylor polynomial . Then, we have
which implies that
where we use the sub-additivity of rank, and is the -net introduced in Section III-A. Hence, to upper bound , it suffices to upper bound the -covering number of the hypercube . Let denote the closed -ball with radius that is centered at the origin. Using the volumetric argument in [83, Proposition 4.2.12] and [91, Theorem 14.2], we obtain
where denotes the volume in and denotes the Minkowski sum of sets in , the second inequality follows from substituting the well-known volume of the -ball as well as the fact that , and the final equality follows from substituting the volume of a hypercube. This produces the following upper bound on the rank of :
(51) |
Finally, for any satisfying , choose such that . Note that because
where the second implication follows from the bound
(52) |
which follows from a standard upper bound on binomial coefficients and Stirling’s approximation [92, Chapter II, Section 9, Equation (9.15)]. Then, observe that
where the first inequality follows from (51) and the Eckart-Young-Mirsky theorem [87, Section 7.4.2], the second inequality follows from (50), the fourth inequality uses (52), and the fifth inequality follows from the assumption that
This completes the proof. ∎
V Analysis of Federated Low Rank Gradient Descent
In this section, we establish the federated oracle complexity of the algorithm as stated in Theorem 2. To do this, we need the ensuing lemma from the literature which bounds how close the ERM objective value , at the output parameter of Algorithm 1, is to the global minimum in (10) [5, Lemma 2.1] (cf. [7, Theorem 2, Equation (19)] and [23, Lemma 4]).
Lemma 2 (Inexact GD Bound [5, Lemma 2.1]).
Suppose the empirical risk is continuously differentiable and -strongly convex, and its gradient is -Lipschitz continuous. Then, for all , we have
where are the inexact GD updates in (42) in Algorithm 1 with arbitrary initialization , is our approximation of for in Algorithm 1 as shown in (41), and the condition number is defined in Section II-B.
We next prove Theorem 2 using Lemma 1, Lemma 2, and the various assumptions in Section II-B and Section III-C.
Proof of Theorem 2.
First, to avoid notational clutter in this proof, we re-index the training datasets of the server and clients using with (cf. Section I-A). Specifically, we let for (i.e., the server’s training data is indexed first), and where for , , and . Correspondingly, we re-index the weights , which are defined by (38) and implicitly used in the algorithm, so that where for .
The majority of this proof focuses on analyzing the iteration complexity of the inexact GD method employed in the last epoch of Algorithm 1. Once we have control over this iteration complexity, we can easily deduce federated oracle complexity. Our main tool for studying iteration complexity will be Lemma 2. But to use this lemma, we require a bound on . So, we begin our analysis by fixing any (sufficiently large) , any index , and any parameter , and noticing that
(53) |
where the first equality follows from (32), (39), and (41), and we use the weights defined by (38), the second equality follows from the aforementioned re-indexing, the fourth inequality follows from swapping the order of summations in the second term on the right hand side and applying the triangle inequality, and the fifth inequality holds because . To upper bound (53), we next develop some properties of a uniform piecewise polynomial approximation of .
Let and be the partition of the hypercube defined in (28) for any . Moreover, fix the parameter such that . Since satisfies the smoothness in data assumption (i.e., the -Hölder class assumption) in Section II-B, Lemma 1 implies that there exists a piecewise polynomial function with such that
(54) |
Here, as shown in the proof of Theorem 1 in Section IV-B, we have
for all , where is the vector of monomials with degree at most , and are vectors of coefficients (see Section IV-B) which we may index with . Equivalently, we can write:
(55) |
where we enumerate using the index set , i.e., we employ a bijection , and for all and with such that , we let:
(56) | ||||
(Hence, as noted in Section III-C, the family of monomial functions are linearly independent.) For every , define the coefficients via
where the matrix is defined in (45) and is assumed to be invertible (cf. Section III-C). Then, observe that for every and ,
which implies that
(57) |
using (55). Therefore, letting be the matrix given by (46) in Section III-C for the arbitrary set of parameter vectors chosen by the server in the first epoch of Algorithm 1, we obtain that for all ,
(58) | ||||
where we use (57) and the assumption in Section III-C that is invertible. From the relation in (58), we now show that the ’s are “close” to the ’s.
To this end, notice that for any ,
(59) |
where (i) uses (58) and (38), (ii) and (iv) follow from the triangle inequality, (iii) follows from the definition of operator norm, (v) follows from (54) and the boundedness assumption in (9), (vii) follows from (54), and (vi) holds because for any two non-singular matrices with common dimensions, we have
using the sub-multiplicativity of operator norms, which implies that
as long as (also see [87, Section 5.8]). Note that (59) only holds when .
Next, we proceed to upper bounding (53). Starting from (53), observe that
(60) |
where (i) follows from (54) and the triangle inequality, (ii) follows from (57), (iii) follows from the Cauchy-Schwarz inequality, (iv) utilizes (59), (v) follows from the triangle inequality, (54), and the boundedness assumption in (9), (vi) holds because
using the Cauchy-Schwarz inequality (or the equivalence of and -norms), (38), the definition of operator norm, and (9), (vii) follows from the conditioning assumption in (8), and (viii) follows from basic algebraic manipulations. As before, note that (60) only holds when .
We are finally in a position to bound . Indeed, for any , suppose that
(61) |
(Notice that implies the desired condition .) Then, we have from (60) that
which implies that for any ,
(62) |
Therefore, applying Lemma 2, we obtain
(63) |
where are the inexact GD updates in (42) in Algorithm 1, is the number of iterations for which inexact GD is run in the last epoch of Algorithm 1, and the third inequality follows from the usual geometric series formula. We remark that to apply Lemma 2 here, we need to verify that the empirical risk is continuously differentiable and -strongly convex, and its gradient is -Lipschitz continuous. These properties of are in fact inherited from the corresponding strong convexity and smoothness in parameter assumptions imposed on the loss function in Section II-B; we refer readers to [23, Section 4.1] for the formal arguments.
We can easily obtain the iteration complexity of the inexact GD method in Algorithm 1 from (63). Letting
(64) |
we see that
Hence, for Algorithm 1 to produce an -approximate solution in the sense of (11), it suffices to ensure that
or equivalently,
Thus, setting as in (48) yields an -approximate solution . Furthermore, it is straightforward to verify that . Indeed, from (64) and (48), we have
(65) |
where the last inequality holds because we have assumed in the theorem statement that .
Having computed the number of iterations (48), we close this proof by calculating the federated oracle complexity of the algorithm. To execute this final step, we must first ensure that (61) holds by choosing an appropriate value of . Recall that was chosen so that . Using (51), (52), and the fact that , we have
which in turn produces the following lower bound on :
Since we require , or to be more precise, (see Section III-A), we can check to see that
by assumption (47) in the theorem statement. Furthermore, similar to the argument in the proof Theorem 1 in Section IV-B, (47) also yields:
Substituting this lower bound on into the condition (61), it suffices to ensure that
Then, using Stirling’s approximation [92, Chapter II, Section 9, Equation (9.15)] and the fact that , the left hand side of the above inequality can be upper bounded by
This means that ensuring
implies that (61) is satisfied, where we utilize the assumption in the theorem statement that . Now, using (64) and (48), notice that
Hence, to satisfy (61), it suffices to ensure that
which is already assumed in (47). Finally, for and given by (47) and (48), respectively, observe using Definition 1 that the federated oracle complexity of Algorithm 1 is
This completes the proof. ∎
VI Conclusion
In closing, we summarize the main technical contributions of this work. Firstly, as a counterpart to traditional oracle complexity of iterative optimization methods, we provided a formalism for measuring running time of federated optimization algorithms by presenting the notion of federated oracle complexity in Section II-D. This notion afforded us the ability to theoretically compare the performance of different algorithms. Secondly, in order to exploit the smoothness of loss functions in training data—a generally unexplored direction in the optimization for machine learning literature [23], we proposed the algorithm for federated learning in Section III-B. This algorithm crucially used the approximate low rankness induced by the smoothness of gradients of loss functions in data. Thirdly, we analyzed the federated oracle complexity of in Theorem 2 under various assumptions, including strong convexity, smoothness in parameter, smoothness in data, non-singularity, etc. In particular, we proved that scales like (neglecting typically sub-dominant factors). Moreover, to compare to the standard method in the literature [28], we also evaluated the federated oracle complexity of in Theorem 3 and saw that scales like . Then, we demonstrated that when data dimension is small and the loss function is sufficiently smooth in the data, is indeed smaller that . Finally, in a complementary direction, we built upon aspects of our analysis of to generalize a well-known result from high-dimensional statistics regarding low rank approximations of smooth latent variable models in Theorem 1.
We also mention a few future research directions to conclude. It is natural to try and weaken some of the assumptions made to derive the technical results in this work, cf. Section II-B. For example, the strong convexity assumption could potentially be weakened to convex or even non-convex loss functions. As another example, the non-singularity and conditioning assumptions could perhaps be replaced by simpler assumptions on the gradient of the loss function, e.g., by developing variants of positive definite kernel ideas and assuming these on the gradient perceived as a bivariate kernel. The non-singularity and conditioning assumptions could then be deduced from such assumptions. In a different vein, we also leave the thorough empirical evaluation of practical versions of and their comparisons to known federated learning methods, such as , in various senses (including federated oracle complexity and wall-clock running time) as future work.
Appendix A Smoothness Assumptions for Logistic Regression with Soft Labels
In this appendix, we delineate an example of a commonly used loss function in logistic regression satisfying the important smoothness and strong convexity assumptions in Section II-B. Let the parameter space be so that any parameter vector , and each labeled training data sample be such that the label is a “soft” belief. The compactness of the parameter space is reasonable since training usually takes place in a compact subset of the Euclidean space , and the use of soft labels instead of canonical binary classification labels has also been studied in the literature, cf. [93]. Consider the -regularized cross-entropy loss function:
(66) | ||||
where is a hyper-parameter that determines the level of regularization, and we assume for convenience that the bias parameter in logistic regression is . Given a training dataset , the associated empirical risk is
(67) | ||||
which corresponds to the well-known setting of -regularized logistic regression [3, Section 4.4]. We next verify that in (66) satisfies the smoothness and strong convexity assumptions in Section II-B:
-
1.
Observe that , which means that for all and all ,
(68) where the first inequality follows from the triangle inequality, the second inequality holds because , the third inequality holds due to the Lipschitz continuity of the sigmoid function and the fact that , the fourth inequality follows from the Cauchy-Schwarz inequality, and the final inequality follows from the bound . Hence, the Lipschitz constant .
- 2.
-
3.
Observe that for any fixed and any , the th partial derivative . Let us arbitrarily choose for the purposes of illustration. Then, for any with a value of at the th position and , we have for all and ,
(69) Hence, for all and , we obtain
(70) where the third case follows from direct calculation, the first case follows from a similar argument to the second case, and the second case holds because
(71) where the second line follows from the fact that , the third line follows from the triangle inequality and the fact that , the fourth line holds because for all , the fifth line follows from Lipschitz continuity and the fact that , and the sixth line follows from Hölder’s inequality and the fact that . This implies that for all and , we have
(72) Thus, in this example, and each belongs to a -Hölder class.
Appendix B Bounds on Erlang Quantile Function
For any , let be an Erlang distributed random variable with shape parameter , rate , and CDF given by [82, Example 3.3.21]:
(73) |
The ensuing proposition provides bounds on the quantile function (or inverse CDF) in the neighborhood of unity.
Proposition 2 (Quantile Function of Erlang Distribution).
For any integer such that , we have
Proof.
Recall from the definition of generalized inverse that
(74) |
where the second equality uses (73) and the Maclaurin series of , and the fourth equality follows from straightforward manipulations of the summation on the right hand side.
To prove the lower bound, notice that for all ,
where the first implication holds because , the second implication follows from only retaining the term in the summation on the left hand side, and the last implication holds because . Hence, using (74), we obtain
Next, note that since we assume that , . So, it suffices to consider within the infimum in (74). To establish the upper bound, observe that for all ,
where the first implication holds because , the third implication follows from computing the geometric series on the left hand side in the previous line over all , and the last implication holds because and . Therefore, using (74), we get
This completes the proof. ∎
Appendix C Low Rank Structure in Neural Networks
In this appendix, we present some Keras (TensorFlow) based simulation results which provide evidence that a variant of the approximate low rank structure used in our analysis of the algorithm appears in neural network training problems. We perform our experiments using the well-known MNIST database for handwritten digit recognition [44] as well as the CIFAR-10 dataset of tiny images [45]. We next describe our experiment for both databases.
MNIST training. The MNIST database consists of a training dataset of image-label pairs as well as a test dataset of pairs. Each handwritten digit image has grayscale pixels taking intensity values in , which we can stack and re-normalize to obtain a feature vector . Each label belongs to the set of digits , which we transform into an elementary basis vector via one hot encoding. So, each data sample has dimension .
To train a classifier that solves the multi-class classification problem of inferring digits based on images, we construct a fully connected feed-forward neural network with one hidden layer (cf. [24, Chapter 5] for basic definitions and terminology). The input layer of this network has nodes (since the feature vectors are -dimensional), the hidden layer has nodes (we typically set as well), and the output layer has nodes (since there are possible label values). The network also possesses the usual bias nodes, which are not included in the above counts. We use all logistic (or sigmoid) activation functions (and in one case, all rectified linear unit, i.e., ReLU, activation functions) in the hidden layer, and the softmax activation function in the output layer. Note that these neural networks are highly overparametrized, because , where is the number of weight parameters in the network including all bias nodes. Finally, we use the cross-entropy loss function, and minimize the empirical risk (with respect to the training data) over the weights of the neural network using the Adam stochastic optimization algorithm [19] (and in one case, using the root mean square propagation, i.e., RMSprop, algorithm [18]). Specifically, we always use a batch size of and run the algorithm over epochs. In almost all instances in our experiments, this setting of hyper-parameters for the optimization algorithm yields test validation accuracies of over (and training accuracies of over ). (We note that perturbing the batch size or number of epochs does not change our qualitative observations.)
CIFAR-10 training. The CIFAR-10 database consists of a training dataset of image-label pairs as well as a test dataset of pairs. Each image has color pixels with channels per pixel taking intensity values in , which we can stack and re-normalize to obtain a feature vector . Each label belongs to a set of possible values, e.g., airplane, bird, horse, truck, etc., which we transform into an elementary basis vector via one hot encoding. So, each data sample has dimension .
This time, to train a classifier that solves the problem of inferring labels based on images, we construct a convolutional neural network (cf. [24, Chapter 5]). The input layer of this network has nodes (since there are channels), the next two hidden convolutional layers use filters per input channel each with kernels with a stride of and zero padding (to retain the image dimensions of ), the fourth layer performs sub-sampling via max-pooling with a stride of , the fifth layer is fully connected with output nodes, and the sixth output layer is also fully connected and has nodes (since there are possible label values). The network also possesses the usual bias nodes, which are not included in the above counts of nodes. We use all logistic activation functions (and in one case, all ReLU activation functions) in the two convolutional hidden layers and the fifth fully connected layer, and the softmax activation function in the output layer. As before, these deep neural networks are highly overparametrized, because , where is the number of weight parameters in the network. Finally, we use the cross-entropy loss function, and minimize the empirical risk over the weights of the neural network using the Adam stochastic optimization algorithm (and in one case, using the RMSprop algorithm) as before. Specifically, we always use a batch size of and run the algorithm over epochs. In almost all instances in our experiments, this setting of hyper-parameters for the optimization algorithm yields test validation accuracies of around (with slightly higher test accuracies for ReLU activations). (On the other hand, training accuracies are usually higher, e.g., over for ReLU activations. Although these accuracies can be further increased through careful tuning using techniques like dropout for regularization and batch normalization, we do not utilize such ideas here for simplicity. We are able to illustrate low rank structure without these additional bells and whistles. Furthermore, we note that perturbing the batch size, number of epochs, the architecture of hidden layers, etc. does not change our qualitative observations.)
Experimental details. In either the MNIST or the CIFAR-10 settings delineated above, let be the optimal network weights output by the training process. Then, for any , we randomly choose test data samples uniformly among all subsets of test data samples. Moreover, we fix any independent random perturbations of , namely, vectors of network weights such that
(75) |
for all , where are i.i.d. Gaussian distributed random vectors with zero mean and identity covariance. We next construct a tensor of partial derivatives:
(76) | ||||
where , denotes the cross-entropy loss for a data sample when the neural network has weights , the first dimension of is indexed by , the second dimension is indexed by , and the third dimension is indexed by the network weight coordinates . Note that the partial derivatives to obtain each can be computed in a variety of ways; we use one iteration of a standard GD procedure in Keras with a single data sample. For any , let be the th matrix slice of the tensor (where the rows of are indexed by and the columns are indexed by ). Furthermore, letting denote the ordered singular values of , we define the approximate rank of the (non-zero) matrix as the minimum number of ordered singular values that capture of the squared Frobenius norm of :
(77) | ||||
(Note that the choice of is arbitrary, and our qualitative observations do not change if we perturb this value.) Next, we fix a sufficiently small ; specifically, we use after eyeballing several instances of optimal weight vectors to determine a small enough threshold to rule out zero weights (although the precise choice of this value does not alter our qualitative observations). Ruling out zero weights is a stylistic choice in our experiments since these weights are effectively not used by the final trained model. However, keeping them in the sequel would not change the depicted low rank structure. Then, for every such that is “non-trivial,” by which we mean , we compute the approximate rank of . We finally plot a histogram of this multiset of approximate rank values, where the multiset contains one value for each such that is non-trivial.121212To be precise, for computational efficiency, we plot histograms of around randomly and uniformly chosen approximate rank values from this multiset.









MNIST plots. In the MNIST setting, Figure 1 plots histograms generated using the aforementioned procedure for various choices of network hyper-parameters. We now explain these precise choices and discuss associated observations. Figure 1(b) depicts the prototypical histogram, where we use and logistic activation functions in the hidden layer to define the neural network, Adam to train the network, and to construct the tensor. The ordinate (vertical axis) of this plot represents frequency, i.e., the number of times we see a particular value, and the abscissa (horizontal axis) represents the set of possible approximate rank values. Moreover, the plot portrays that almost all ’s corresponding to non-trivial ’s have approximate rank bounded by . In fact, about half of these ’s have approximate rank bounded by . This suggests that the gradient of the loss function for this neural network model indeed exhibits some approximate low rank structure. We also note that is “smooth” here, because we have used smooth activation functions and the cross-entropy loss. Furthermore, the observation of low rankness does not change when we use other smooth losses instead of the cross-entropy loss.
Compared to Figure 1(b), Figure 1(a) and Figure 1(c) only change the value of to and , respectively. It is clear from these plots that the broad qualitative observation explained above remains unchanged if is varied. In a different vein, compared to Figure 1(b), Figure 1(e) and Figure 1(f) only change the number of nodes in the hidden layer to and , respectively. These plots illustrate that our observations are qualitatively unchanged by different hidden layer sizes. Finally, compared to Figure 1(b), Figure 1(g) only changes the optimization algorithm for training from Adam to RMSprop, and demonstrates (once again) that the choice of algorithm has essentially no effect on our observations. Figure 1(d) uses the same setting of network hyper-parameters as Figure 1(b), but it plots a histogram of the multiset of approximate ranks of all ’s, i.e., ’s corresponding to all weights . As mentioned earlier, the purpose of this plot is to illustrate that keeping all weights in the histograms does not change the depicted low rank structure, i.e., all ’s have reasonably small approximate rank.
All of the experiments until now focus on smooth and neural networks with one hidden layer. To understand the effect of changing the number of hidden layers, Figure 1(h) illustrates the histogram obtained when the neural network has two hidden layers with size each and all logistic activation functions, and all other hyper-parameters are kept the same as Figure 1(b). In this plot, the approximate rank of the bulk of ’s corresponding to non-trivial ’s is bounded by . This suggests that the addition of a few hidden layers does not increase the approximate rank of the gradient of by much. On the other hand, to investigate the effect of using non-smooth activation functions, Figure 1(i) uses all ReLU activation functions in the hidden layer, and the same hyper-parameters as Figure 1(b) in all other respects. This histogram is qualitatively the same as that in Figure 1(b). Therefore, the non-smoothness of does not seem to affect the approximate low rankness of the gradient of . This final observation indicates that algorithms like , which exploit approximate low rankness, could potentially be useful in non-smooth settings as well.




CIFAR-10 plots. In the CIFAR-10 setting, Figure 2 plots histograms generated using the aforementioned procedure for various choices of network hyper-parameters. The purpose of this figure is to demonstrate that the low rank structure depicted in Figure 1 can also be observed in more complex neural network architectures used for more challenging supervised learning tasks. As before, we briefly explain these precise choices and discuss associated observations. Many of our comments are similar in spirit to those in the MNIST setting.
Figure 2(b) depicts the prototypical histogram, where we use the convolutional architecture outlined earlier, logistic activation functions in all non-sub-sampling hidden layers, Adam to train the network, and to construct the tensor. The axes of the plot are the same as those in Figure 1. The plot illustrates that almost all ’s corresponding to non-trivial ’s have approximate rank bounded by . This corroborates the MNIST experiments that the gradient of the loss function for CIFAR-10 also exhibits some approximate low rank structure. Compared to Figure 2(b), Figure 2(a) changes the value of to to show that our qualitative observations remain unchanged if is varied, and Figure 2(c) only changes the optimization algorithm for training to RMSprop, and demonstrates that the choice of algorithm has essentially no effect on our observations (as before). Lastly, while all of these plots are for a smooth loss function (because we have used smooth activation functions and the cross-entropy loss), compared to Figure 2(b), Figure 2(d) only changes the activation functions in all non-sub-sampling hidden layers to ReLU. Once again, we observe that there is no qualitative difference in the depicted approximate low rank structure even when non-smooth activation functions are used.
Appendix D Proof of Theorem 3
To prove Theorem 3, we will need two auxiliary lemmata. The first of these is the following convergence result which is distilled from [30, Theorem 1].
Lemma 3 (Convergence of [30, Theorem 1]).
The second lemma provides a useful estimate of a trigonometric expression we will utilize in the proof.
Lemma 4 (Taylor Approximation of Root).
For all , we have131313Note that we use here instead of , because the latter can only be expanded as a Newton-Puiseux series.
Proof.
Define the function as
The first and second derivatives of are
for all . Moreover, we have , , and for any ,
where the first inequality follows from the triangle inequality, and the second inequality follows from setting since is an increasing function. Hence, using Taylor’s theorem (with Lagrange remainder term), we get:
This implies that:
which completes the proof. ∎
We next establish Theorem 3.
Proof of Theorem 3.
Since we assume that , , , and with respect to (see assumptions 2-5 in the theorem statement), for every , each term on the right hand side of Lemma 3 satisfies
Clearly, the third term above dominates the second term, and the first term dominates the fourth term for all sufficiently large (as ). Hence, Lemma 3 implies that
where is a constant that depends on the various constant problem parameters.
Observe that if we seek an approximation accuracy of , i.e., , we need to ensure that
(78) |
According to Definition 1, the associated federated oracle complexity of the algorithm is , because the active clients each compute gradients in all epochs of . However, we have control over the parameters and in . So, we can minimize this complexity over all and subject to the approximation accuracy constraint above. Indeed, the (minimum) federated oracle complexity of is
(79) |
where , , , , , and are held constant in the optimization.141414Note that we allow and to be real numbers in the optimization rather than natural numbers, because this does not affect the final scaling of federated oracle complexity with respect to the problem parameters. Moreover, the optimization in (79) is really an upper bound on if we are to be pedantic. We next calculate how scales with , , , and . For convenience, we let in the sequel.
First, notice that the inequality constraint in (78) can be changed into an equality constraint
Indeed, if the optimal and for (79) do not achieve (78) with equality, we can reduce to achieve (78) with equality, and this reduces the objective value of further (producing a contradiction). Rearranging this equality constraint, we obtain the following quadratic equation in :
whose unique non-negative solution is
(80) |
This solution satisfies the bounds
(81) |
where the lower bound follows from neglecting the term in (80), and the upper bound follows from the sub-additivity of the square root function. (Note that is satisfied.) Now define the constants
(82) | ||||
(83) | ||||
(84) |
and the function as:
Then, using (79), (81), (82), (83), and (84), we have
(85) |
So, we focus on evaluating from hereon.
Since , it is straightforward to verify that is a convex function on . This means that any first order stationary point of in is a global minimum. We will find such a stationary point in the sequel. To this end, notice that
Setting this equal to yields the stationarity condition
(86) |
Using the substitution above, we obtain:
which we rearrange to get the (monic) depressed cubic equation:
(87) |
The discriminant of this depressed cubic equation is
where the second equality follows from (82), (83), and (84), and the strict positivity holds because and are constants, , and is sufficiently large (compared to these constant problem parameters). In particular, we will assume throughout this proof that
(88) |
In the previous case, (88) clearly implies that
Thus, the depressed cubic equation in (87) has three distinct real solutions. Using Viète’s trigonometric formula for the roots of a depressed cubic equation [94, 95], we have that
is one of the solutions to (87), and hence,
is one of the solutions to (86). Applying Lemma 4 to and simplifying yields
where we use (88) and the fact that , which imply that the condition required by Lemma 4 is satisfied:
Then, substituting (82), (83), and (84) into these bounds gives
Since and (88) holds, we have
and we see that ; specifically,
(89) |
(Note that is satisfied.) Furthermore, is the unique positive solution to the stationarity condition (86) by Descartes’ rule of signs. Therefore, using (82), (83), and (84), we have
We can now estimate the scaling of . Indeed, by applying the inequalities in (89), we have
where the last inequality again uses (88) and the fact that , which imply that
Appendix E Proof of Proposition 1
Proof.
We first calculate the federated oracle complexity of . Since , without loss of generality, suppose that for some constant . From (47) in Theorem 2,
Let us analyze each salient term in this equation:
where the first estimate uses the facts that and , the second estimate uses the facts that and , and the third estimate uses the facts that and . Hence, since , and grows at least polynomially in , which implies that
(90) |
where denotes a term that is dominated by any polynomial in , i.e., for any .
On the other hand, from (48) in Theorem 2,
where the second equality uses the fact that , and the third equality follows from (90). Therefore, the federated oracle complexity of from Theorem 2 scales as
where the second line holds because and , the fourth line holds because since , and the last line holds because .
Furthermore, rewriting Theorem 3, the federated oracle complexity of scales as
Hence, since , we have
as desired, where is some constant. This completes the proof. ∎
References
- [1] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004.
- [2] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, ser. Applied Optimization. New York, NY, USA: Springer, 2004, vol. 87.
- [3] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., ser. Springer Series in Statistics. New York, NY, USA: Springer, 2009.
- [4] S. Bubeck, Convex Optimization: Algorithms and Complexity, ser. Foundations and Trends in Machine Learning. Hanover, MA, USA: now Publishers Inc., 2015, vol. 8, no. 2-4.
- [5] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34, no. 3, pp. A1380–A1405, 2012.
- [6] O. Devolder, F. Glineur, and Y. Nesterov, “First-order methods with inexact oracle: the strongly convex case,” CORE Discussion Papers 2013016, vol. 2013, no. 16, March 2013.
- [7] A. M.-C. So and Z. Zhou, “Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity,” Optimization Methods and Software, vol. 32, no. 4, pp. 963–992, May 2017.
- [8] M. Schmidt, N. Le Roux, and F. Bach, “Convergence rates of inexact proximal-gradient methods for convex optimization,” in Proceedings of the Advances in Neural Information Processing Systems 24 (NeurIPS), Granada, Spain, December 12-17 2011, pp. 1–9.
- [9] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, no. 4, pp. 1574–1609, January 2009.
- [10] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, December 1964.
- [11] Y. E. Nesterov, “A method of solving a convex programming problem with convergence rate ,” Doklady Akademii Nauk SSSR, vol. 269, no. 3, pp. 543–547, 1983, in Russian.
- [12] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. 6, pp. 165–202, January 2012.
- [13] N. Le Roux, M. Schmidt, and F. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Proceedings of the Advances in Neural Information Processing Systems 25 (NeurIPS), Lake Tahoe, NV, USA, December 3-8 2012, pp. 1–9.
- [14] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, Series A, vol. 162, p. 83–112, March 2017.
- [15] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” Journal of Machine Learning Research, vol. 14, pp. 567–599, February 2013.
- [16] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proceedings of the Advances in Neural Information Processing Systems 26 (NeurIPS), Lake Tahoe, NV, USA, December 5-10 2013, pp. 315–323.
- [17] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, July 2011.
- [18] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” 2012, COURSERA: Neural Networks for Machine Learning.
- [19] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, May 7-9 2015, pp. 1–13.
- [20] P. Jain and P. Kar, Non-convex Optimization for Machine Learning, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed. Hanover, MA, USA: now Publishers Inc., 2017, vol. 10, no. 3-4.
- [21] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, May 2018.
- [22] P. Netrapalli, “Stochastic gradient descent and its variants in machine learning,” Journal of Indian Institute of Science, vol. 99, no. 2, pp. 201–213, June 2019.
- [23] A. Jadbabaie, A. Makur, and D. Shah, “Gradient-based empirical risk minimization using local polynomial regression,” November 2020, arXiv:2011.02522 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2011.02522
- [24] C. M. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics. New York, NY, USA: Springer, 2006.
- [25] J. Fan and I. Gijbels, Local Polynomial Modeling and Its Applications, ser. Monographs on Statistics and Applied Probability. London, UK: Chapman and Hall, 1996, vol. 66.
- [26] W. S. Cleveland and C. Loader, “Smoothing by local regression: Principles and methods,” in Statistical Theory and Computational Aspects of Smoothing: Proceedings of the COMPSTAT ’94 Satellite Meeting held in Semmering, Austria 27-28 August 1994, ser. Contributions to Statistics, W. Härdle and M. G. Schimek, Eds. Heidelberg, Germany: Physica-Verlag, 1996, pp. 10–49.
- [27] A. B. Tsybakov, Introduction to Nonparametric Estimation, ser. Springer Series in Statistics. New York, NY, USA: Springer, 2009.
- [28] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, April 20-22 2017, pp. 1273–1282.
- [29] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konečný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao, Advances and Open Problems in Federated Learning, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed. Hanover, MA, USA: now Publishers Inc., 2021, vol. 14, no. 1-2.
- [30] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, August 26-28 2020, pp. 2021–2031.
- [31] A. Reisizadeh, I. Tziotis, H. Hassani, A. Mokhtari, and R. Pedarsani, “Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity,” December 2020, arXiv:2012.14453 [cs.LG]. [Online]. Available: https://arxiv.org/pdf/2012.14453.pdf
- [32] Z. Huo, Q. Yang, B. Gu, L. Carin, and H. Huang, “Faster on-device training using new federated momentum algorithm,” February 2020, arXiv:2002.02090v1 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2002.02090
- [33] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” January 2019, arXiv:1808.07576v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1808.07576
- [34] R. Pathak and M. J. Wainwright, “FedSplit: an algorithmic framework for fast federated optimization,” in Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS), Vancouver, BC, Canada, December 6-12 2020, pp. 7057–7066.
- [35] F. X. Yu, A. S. Rawat, A. K. Menon, and S. Kumar, “Federated learning with only positive labels,” in Proceedings of the 37th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 119, Vienna, Austria, July 13-18 2020, pp. 10 946–10 956.
- [36] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proceedings of the 10th International Conference on Learning Representations (ICLR), Vienna, Austria, April 25-29 2021, pp. 1–38.
- [37] N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,” March 2019, arXiv:1902.11175v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1902.11175
- [38] V. S. Mai, R. J. La, T. Zhang, Y. Huang, and A. Battou, “Federated learning with server learning for non-iid data,” in Proceedings of the 57th Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, March 22-24 2023, pp. 1–6.
- [39] K. Yang and C. Shen, “On the convergence of hybrid federated learning with server-clients collaborative training,” in Proceedings of the 56th Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, March 9-11 2022, pp. 252–257.
- [40] V. S. Mai, R. J. La, and T. Zhang, “Federated learning with server learning: Enhancing performance for non-iid data,” April 2023, arXiv:2210.02614v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2210.02614
- [41] S. Chatterjee, “Matrix estimation by universal singular value thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, February 2015.
- [42] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” in Proceedings of the 35th Annual International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (PMLR), vol. 80, Stockholm, Sweden, July 10-15 2018, pp. 5433–5442.
- [43] A. Agarwal, D. Shah, D. Shen, and D. Song, “On robustness of principal component regression,” Journal of the American Statistical Association, vol. 116, no. 536, pp. 1731–1745, July 2021.
- [44] Y. LeCun, C. Cortes, and C. J. C. Burges, “THE MNIST DATABASE of handwritten digits.” [Online]. Available: http://yann.lecun.com/exdb/mnist/
- [45] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset.” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.htm
- [46] G. H. Chen and D. Shah, Explaining the Success of Nearest Neighbor Methods in Prediction, ser. Foundations and Trends in Machine Learning, M. Jordan, Ed. Hanover, MA, USA: now Publishers Inc., 2018, vol. 10, no. 5-6.
- [47] L. Wasserman, “Statistical methods for machine learning,” May 2019, Department of Statistics and Data Science, CMU, Pittsburgh, PA, USA, Lecture Notes 36-708.
- [48] E. A. Nadaraya, “On estimating regression,” Theory of Probability and Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
- [49] G. S. Watson, “Smooth regression analysis,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, December 1964.
- [50] E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametric discrimination: Consistency properties,” University of California, Berkeley, USAF School of Aviation Medicine, Randolph Field, San Antonio, TX, USA, Tech. Rep., February 1951.
- [51] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. IT-13, no. 1, pp. 21–27, January 1967.
- [52] T. W. Körner, Fourier Analysis. Cambridge, UK: Cambridge University Press, 1988.
- [53] E. M. Stein and R. Shakarchi, Fourier Analysis: An Introduction, ser. Princeton Lectures in Analysis. Princeton, NJ, USA: Princeton University Press, 2003, vol. 1.
- [54] E. M. Stein and R. Shakarchi, Complex Analysis, ser. Princeton Lectures in Analysis. Princeton, NJ, USA: Princeton University Press, 2003, vol. 2.
- [55] E. J. Candès and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98, no. 6, pp. 925–936, June 2010.
- [56] D. Shah, D. Song, Z. Xu, and Y. Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” June 2020, arXiv:2006.06135 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2006.06135
- [57] B. F. Logan and L. A. Shepp, “Optimal reconstruction of a function from its projections,” Duke Mathematical Journal, vol. 42, no. 4, pp. 645–659, December 1975.
- [58] D. L. Donoho and I. M. Johnstone, “Projection-based approximation and a duality with kernel methods,” The Annals of Statistics, vol. 17, no. 1, pp. 58–106, March 1989.
- [59] J. Chen, T. Yang, and S. Zhu, “Efficient low-rank stochastic gradient descent methods for solving semidefinite programs,” in Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, April 22-25 2014, pp. 122–130.
- [60] P. G. Constantine, E. Dow, and Q. Wang, “Active subspace methods in theory and practice: Applications to kriging surfaces,” SIAM Journal on Scientific Computing, vol. 36, no. 4, pp. A1500–A1524, 2014.
- [61] G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” December 2018, arXiv:1812.04754 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1812.04754
- [62] L. Sagun, U. Evci, V. U. Güney, Y. Dauphin, and L. Bottou, “Empirical analysis of the Hessian of over-parametrized neural networks,” in Proceedings of the Sixth International Conference on Learning Representations (ICLR) Workshop, Vancouver, BC, Canada, April 30-May 3 2018, pp. 1–14.
- [63] V. Papyan, “The full spectrum of deepnet Hessians at scale: Dynamics with SGD training and sample size,” June 2019, arXiv:1811.07062v2 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1811.07062
- [64] C. Cui, K. Zhang, T. Daulbaev, J. Gusak, I. Oseledets, and Z. Zhang, “Active subspace of neural networks: Structural analysis and universal attacks,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 4, pp. 1096–1122, 2020.
- [65] Y. Wu, X. Zhu, C. Wu, A. Wang, and R. Ge, “Dissecting Hessian: Understanding common structure of Hessian in neural networks,” June 2021, arXiv:2010.04261v5 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2010.04261
- [66] S. P. Singh, G. Bachmann, and T. Hofmann, “Analytic insights into structure and rank of neural network Hessian maps,” in Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS), Virtual, December 6-14 2021, pp. 23 914–23 927.
- [67] T. Le and S. Jegelka, “Training invariances and the low-rank phenomenon: beyond linear networks,” in Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual, April 25-29 2022, pp. 1–26.
- [68] T. Galanti, Z. S. Siegel, A. Gupte, and T. Poggio, “SGD and weight decay provably induce a low-rank bias in neural networks,” January 2023, arXiv:2206.05794v3 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2206.05794
- [69] V. Papyan, X. Y. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,” Proceedings of the National Academy of Sciences (PNAS), vol. 117, no. 40, pp. 24 652–24 663, October 2020.
- [70] C. Fang, H. He, Q. Long, and W. J. Su, “Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training,” Proceedings of the National Academy of Sciences (PNAS), vol. 118, no. 43, pp. 1–12, October 2021.
- [71] L. Hui, M. Belkin, and P. Nakkiran, “Limitations of neural collapse for understanding generalization in deep learning,” February 2022, arXiv:2202.08384 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2202.08384
- [72] M. Gooneratne, , K. C. Sim, P. Zadrazil, A. Kabel, F. Beaufays, and G. Motta, “Low-rank gradient approximation for memory-efficient on-device training of deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 4-8 2020, pp. 3017–3021.
- [73] D. Hsu, S. M. Kakade, and T. Zhang, “Random design analysis of ridge regression,” in 25th Annual Conference on Learning Theory (COLT), Proceedings of Machine Learning Research (PMLR), vol. 23, no. 9, Edinburgh, Scotland, June 25-27 2012, pp. 1–24.
- [74] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, ser. Princeton Lectures in Analysis. Princeton, NJ, USA: Princeton University Press, 2005, vol. 3.
- [75] J. H. Manton and P.-O. Amblard, A Primer on Reproducing Kernel Hilbert Spaces, ser. Foundations and Trends in Signal Processing, Y. Eldar, Ed. Hanover, MA, USA: now Publishers Inc., 2014, vol. 8, no. 1-2.
- [76] A. S. Nemirovskiĭ and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, ser. Wiley-Interscience Series in Discrete Mathematics and Optimization. New York, NY, USA: John Wiley & Sons Inc., 1983, Translated from Russian by E. R. Dawson.
- [77] B. Woodworth, J. Wang, A. Smith, B. McMahan, and N. Srebro, “Graph oracle models, lower bounds, and gaps for parallel stochastic optimization,” in Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS), Montréal, QC, Canada, December 2-8 2018, pp. 1–11.
- [78] N. Michelusi, G. Scutari, and C.-S. Lee, “Finite-bit quantization for distributed algorithms with linear convergence,” May 2022, arXiv:2107.11304v3 [math.OC]. [Online]. Available: https://arxiv.org/abs/2107.11304v3
- [79] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing communication and computation in distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, August 2019.
- [80] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, March 2018.
- [81] R. G. Gallager, Stochastic Processes: Theory for Applications. New York, NY, USA: Cambridge University Press, 2013.
- [82] P. Embrechts, C. Klüppelberg, and T. Mikosch, Modelling Extremal Events for Insurance and Finance, ser. Stochastic Modelling and Applied Probability. Berlin, Heidelberg, Germany: Springer, 1997, vol. 33.
- [83] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science, ser. Cambridge Series in Statistical and Probabilistic Mathematics. New York, NY, USA: Cambridge University Press, 2018, vol. 47.
- [84] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” September 2017, arXiv:1709.03183 [stat.ML]. [Online]. Available: https://arxiv.org/abs/1709.03183
- [85] D. J. Aldous, “Representations for partially exchangeable arrays of random variables,” Journal of Multivariate Analysis, Elsevier, vol. 11, no. 4, pp. 581–598, December 1981.
- [86] Y. Zhang, E. Levina, and J. Zhu, “Estimating network edge probabilities by neighbourhood smoothing,” Biometrika, vol. 104, no. 4, pp. 771–783, December 2017.
- [87] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. New York, NY, USA: Cambridge University Press, 2013.
- [88] T. M. Apostol, Mathematical Analysis, 2nd ed. Reading, MA, USA: Addison-Wesley, 1974.
- [89] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,” Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 117, no. 48, pp. 30 039–30 045, June 2020.
- [90] S. Dasgupta and A. Gupta, “An elementary proof of a theorem of Johnson and Lindenstrauss,” Random Structures and Algorithms, vol. 22, no. 1, pp. 60–65, January 2003.
- [91] Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.
- [92] W. Feller, An Introduction to Probability Theory and Its Applications, 3rd ed. New York, NY, USA: John Wiley & Sons, Inc., 1968, vol. 1.
- [93] Q. Nguyen, H. Valizadegan, and M. Hauskrecht, “Learning classification with auxiliary probabilistic information,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), Vancouver, BC, Canada, December 11-14 2011, pp. 477–486.
- [94] R. W. D. Nickalls, “A new approach to solving the cubic: Cardan’s solution revealed,” The Mathematical Gazette, vol. 77, no. 480, pp. 354–359, November 1993.
- [95] R. W. D. Nickalls, “Viète, Descartes and the cubic equation,” The Mathematical Gazette, vol. 90, no. 518, pp. 203–208, July 2006.