On the Convergence of NEAR-DGD for Nonconvex Optimization with Second Order Guarantees
Abstract
We consider the setting where the nodes of an undirected, connected network collaborate to solve a shared objective modeled as the sum of smooth functions. We assume that each summand is privately known by a unique node. NEAR-DGD is a distributed first order method which permits adjusting the amount of communication between nodes relative to the amount of computation performed locally in order to balance convergence accuracy and total application cost. In this work, we generalize the convergence properties of a variant of NEAR-DGD from the strongly convex to the nonconvex case. Under mild assumptions, we show convergence to minimizers of a custom Lyapunov function. Moreover, we demonstrate that the gap between those minimizers and the second order stationary solutions of the original problem can become arbitrarily small depending on the choice of algorithm parameters. Finally, we accompany our theoretical analysis with a numerical experiment to evaluate the empirical performance of NEAR-DGD in the nonconvex setting.
Index Terms:
distributed optimization, decentralized gradient method, nonconvex optimization, second-order guarantees.I Introduction
We focus on optimization problems where the cost function can be modeled as a summation of components,
(1) |
where is a smooth and (possibly) nonconvex function.
Problems of this type frequently arise in a variety of decentralized systems such as wireless sensor networks, smart grids and systems of autonomous vehicles. A special case of this setting involves a connected, undirected network of nodes , where and denote the sets of nodes and edges, respectively. Each node has private access to the cost component and maintains a local estimate of the global decision variable . This leads to the following equivalent reformulation of Problem (1),
(2) |
One of the first algorithms proposed for the solution of Problem (2) when the functions are convex is the Distributed (Sub)Gradient Descent (DGD) method [1], which relies on the combination of two elements: ) local gradient steps on the functions and ) calculations of weighted averages of local and neighbor variables . For the remainder of this work, we will be referring to these two procedures as computation and consensus (or communication) steps, respectively. While DGD has been shown to converge to an approximate solution of Problem (2) under constant steplengths, a subset of methods known as gradient tracking algorithms [2, 3, 4] overcomes this limitation by iteratively estimating the average gradient between nodes.
The convergence of DGD when the function is nonconvex has been studied in [5]. NEXT [6], SONATA [7, 8], xFilter [9] and MAGENTA [10], are some examples of distributed methods that utilize gradient tracking and can handle nonconvex objectives. Other approaches include primal-dual algorithms [11, 12] (we note that primal-dual and gradient tracking algorithms are equivalent in some cases [3]), the perturbed push-sum method [13], zeroth order methods [14, 15], and stochastic gradient algorithms [16, 17, 18, 19].
Providing second order guarantees when Hessian information is not available is a challenging task. As a result, the majority of the works listed in the previous paragraph establish convergence to critical points only. A recent line of research leverages existing results from dynamical systems theory and the structural properties of certain problems (which include matrix factorization, phase retrieval and dictionary learning, among others) to demonstrate that several centralized first order algorithms converge to minimizers almost surely when initialized randomly [20]. Specifically, if the objective function satisfies the strict saddle property, namely, if all critical points are either strict saddles or minimizers, then many first order methods converge to saddles only if they are initialized in a low-dimensional manifold with measure zero. Using similar arguments, almost sure convergence to second order stationary points of Problem (2) is proven in [8] for DOGT, a gradient tracking algorithm for directed networks, and in [12] for the first order primal-dual algorithms GPDA and GADMM. The convergence of DGD with constant steplength to a neighborhood of the minimizers of Problem (2) is also shown in [8]. The conditions under which the Distributed Stochastic Gradient method (D-SGD), and Distributed Gradient Flow (DGF), a continuous-time approximation of DGD, avoid saddle points are studied in [21] and [22], respectively. Finally, the authors of [13] prove almost sure convergence to local minima under the assumption that the objective function has no saddle points.
Given the diversity of distributed systems in terms of computing power, connectivity and energy consumption, among other concerns, the ability to adjust the relative amounts of communication and computation on a case-by-case basis is a desirable attribute for a distributed optimization algorithm. While some existing methods are designed to minimize overall communication load (for instance, the authors of [9] employ Chebyshev polynomials to improve communication complexity), all of the methods listed above perform fixed amounts of computation and communication at every iteration and lack adaptability to heterogeneous environments.
I-A Contributions
In this work, we extend the convergence analysis of the NEAR-DGD method, originally proposed in [23], from the strongly convex to the nonconvex setting. NEAR-DGD is a distributed first order method with a flexible framework, which allows for the exchange of computation with communication in order to reach a target accuracy level while simultaneously maintaining low overall application cost. We design a custom Lyapunov function which captures both progress on Problem (1) and distance to consensus, and demonstrate that under relatively mild assumptions, a variant of NEAR-DGD converges to the set of critical points of this Lyapunov function and to approximate critical points of the function of Problem (1). Moreover, we show that the distance between the limit points of NEAR-DGD and the critical points of can become arbitrarily small by appropriate selection of algorithm parameters. Finally, we employ recent results based on dynamical systems theory to prove that the same variant of NEAR-DGD almost surely avoids the saddle points of the Lyapunov function when initialized randomly. Our analysis is shorter and simpler compared to other works due to the convenient form of our Lyapunov function. The implication is that NEAR-DGD asymptotically converges to second order stationary solutions of Problem (1) as the values of algorithm parameters increase.
I-B Notation
In this paper, all vectors are column vectors. We will use the notation to refer to the transpose of a vector . The concatenation of local vectors is denoted by with a lowercase boldface letter. The average of the vectors contained in will be denoted by , i.e. . We use uppercase boldface letters for matrices and will denote the element in the row and column of matrix with . We will refer to the (real) eigenvalue in ascending order (i.e. 1st is the smallest) of a matrix as . We use the notations and for the identity matrix of dimension and the vector of ones of dimension , respectively. We will use to denote the -norm, i.e. for we have where is the -th element of . The inner product of vectors will be denoted by . The symbol will denote the Kronecker product operation. Finally, we define the averaging matrix .
I-C Organization
The rest of this paper is organized as follows. We briefly review the NEAR-DGD method and list our assumptions for the rest of this work in Section II. We analyze the convergence properties of NEAR-DGD when the function of Problem (1) is nonconvex in Section III. Finally, we present the results of a numerical experiment we conducted to assess the empirical performance of NEAR-DGD in the nonconvex setting in Section IV, and conclude this work in Section V.
II The NEAR-DGD method
In this section, we list our assumptions for the remainder of this work and briefly review the NEAR-DGD method, originally proposed for strongly convex optimization in [23]. We first introduce the following compact reformulation of Problem (2),
(3) |
where in is the concatenation of the local variables , and is a matrix satisfying the following condition.
Assumption II.1.
(Consensus matrix) The matrix has the following properties: i) symmetry, ii) double stochasticity, iii) if and only if or and otherwise and iv) positive-definiteness.
We can construct a matrix satisfying properties of Assumption II.1 by defining its elements to be max degree or Metropolis-Hastings weights [24], for instance. Such matrices are not necessarily positive-definite, so we can further enforce property using simple linear transformations (for example, we could define , where is a constant). For the rest of this work, we will be referring to the largest eigenvalue of as , i.e. .
We adopt the following standard assumptions for the global function of Problem 3.
Assumption II.2.
(Global Lipschitz gradient) The global objective function has -Lipschitz continuous gradients, i.e. .
Assumption II.3.
(Coercivity) The global objective function is coercive, i.e. for every sequence such that .
II-A The NEAR-DGD method
Starting from (arbitrary) initial points , the local iterates of NEAR-DGD at node and iteration count can be expressed as,
(4a) | |||
(4b) |
where is a predefined sequence of consensus rounds per iteration, is a positive steplength and is the element in the row and column of the matrix , resulting from the composition of consensus operations,
The system-wide iterates of NEAR-DGD can be written as,
(5a) | |||
(5b) |
where , and .
The sequence of consensus rounds per iteration can be suitably chosen to balance convergence accuracy and total cost on a per-application basis. When the functions are strongly convex, NEAR-DGD paired with an increasing sequence converges to the optimal solution of Problem (3), and achieves exact convergence with geometric rate (in terms of gradient evaluations) when [23].
III Convergence Analysis
In this section, we present our theoretical results on the convergence of a variant of NEAR-DGD where the number of consensus steps per iteration is fixed, i.e. in (4a) and (5a) for some . We will refer to this method as NEAR-DGDt. First, we introduce the following Lyapunov function, which will play a key role in our analysis,
(6) |
Using (6), we can express the iterates of NEAR-DGDt as,
(7) |
We need one more assumption on the geometry of the Lyapunov function to guarantee the convergence of NEAR-DGDt. Namely, we require to be ”sharp” around its critical points, up to a reparametrization. This property is formally summarized below.
Definition 1.
(Kurdyka-Łojasiewicz (KL) property) [25] A function has the KL property at if there exists , a neighborhood of , and a continuous concave function such that: ) , ) is (continuously differentiable) on , ) for all , , and ) for all , the KL inequality holds:
Proper lower semicontinuous functions which satisfy the KL inequality at each point of are called KL functions.
Assumption III.1.
(KL Lyapunov function) The Lyapunov function is a KL function.
Assumption III.1 covers a broad range of functions, including real analytic, semialgebraic and globally subanalytic functions (see [26] for more details). For instance, if the function is real analytic, then is the sum of real analytic functions and by extension KL.
III-A Convergence to critical points
In this subsection, we demonstrate that the iterates of NEAR-DGDt converge to a critical point of the Lyapunov function (6). We assume that Assumptions II.1-II.3 and Assumption III.1 hold for the rest of this work. We begin our analysis by showing that the sequence is non-increasing in the following Lemma.
Lemma III.2.
Proof.
Let . Assumption II.2 then yields where we acquire the last equality from (5b). Substituting this relation in (8) applied at the iteration, we obtain,
where we obtain the equality after further application of (8). After setting and re-arranging the terms, we obtain,
Let , which is a positive definite matrix due to Assumption II.1 and the fact that . Moreover, by Eq. (5a). We can then re-write the immediately previous relation as . Applying the definition of concludes the proof. ∎
An important consequence of Lemma III.2 is that NEAR-DGDt can tolerate a bigger range of steplengths than previously indicated ( vs. in [23]). Moreover, Lemma III.2 implies that the sequence is upper bounded by . We use this fact to prove that the iterates of NEAR-DGDt are also bounded in the next Lemma.
Lemma III.3.
Proof.
By Assumption II.3, the function is lower bounded and therefore is also lower bounded (sum of lower bounded functions). This proves the first claim of this Lemma.
To prove the second claim, we first notice that Lemma III.2 implies that the sequence is upper bounded by . Let us define the set . The set is compact, since for all due to the non-expansiveness of . Hence, by the continuity of and the Weierstrass Extreme Value Theorem, there exists such that for all . Moreover, Assumption II.1 yields for all positive integers , and therefore for all , where .
Since for all and , the sequence is upper bounded. Hence, by Assumption II.3, there exists positive constant such that for and . Moreover, Assumption II.2 yields where we obtain the first equality from (5b) and last inequality from the fact that . This relation combined with Assumption II.3 implies that there exists constant such that for and , which concludes the proof. ∎
Next, we use Lemma III.3 to show that the distance between the local iterates generated by NEAR-DGDt and their average is bounded.
Lemma III.4.
(Bounded distance to mean) Let and be the local NEAR-DGDt iterates produced under steplength by (4a) and (4b), respectively, and define the average iterates and . Then the distance between the local and the average iterates is bounded for and , i.e.
where is a positive constant defined in Lemma III.3.
Proof.
Multiplying both sides of (5a) with yields . Moreover, we observe that for any vector . Hence,
where we derive the last inequality from the spectral properties of and (we note that the matrix has a single non-zero eigenvalue at associated with the eigenvector ).
Similarly, for the local iterates we obtain,
Applying Lemma III.3 to the two preceding inequalities completes the proof. ∎
We are now ready to state the first Theorem of this section, namely that there exists a subsequence of that converges to a critical point of .
Theorem III.5.
Proof.
By Lemma III.3, the sequence is bounded and therefore there exists a convergent subsequence as . In addition, recursive application of Lemma III.2 over iterations yields,
where the sequence is non-increasing and bounded from below by Lemmas III.2 and III.3.
Hence, converges and the above relation implies that and . Moreover, by the non-expansiveness of and thus . Finally, Eq. 7 yields . We conclude that as and therefore . ∎
We note that Assumption III.1 is not necessary for Theorem III.5 to hold. However, Theorem III.5 does not guarantee the convergence of NEAR-DGDt; we will need Assumption III.1 to prove that NEAR-DGDt converges in Theorem III.8. Before that, we introduce the following two preliminary Lemmas that hold only under Assumption III.1.
Lemma III.6.
Proof.
Lemma III.2 yields for . We can multiply both sides of this relation with to obtain where we derive the last inequality from the concavity of . In addition, using Eq. 7, we can re-write (9) as . Combining these relations, we acquire,
Observing that due to the non-expansiveness of and re-arranging the terms of the relation above yields the final result. ∎
In the next Lemma, we show that if NEAR-DGDt is initialized from an appropriate subset of and Assumption III.1 holds, then the sequence converges to a critical point of the Lyapunov function .
Lemma III.7.
(Local convergence) Let be the sequence of iterates generated by (5b) from initial point and with steplength . Moreover, let and be the objects in Def. 1 and suppose that the following relations are satisfied for some point ,
(10) | |||
(11) |
where is a positive constant and .
Then the sequence has finite length, i.e. , and converges to a critical point of (6).
Proof.
In the trivial case where , Lemma III.2 combined with (11) yield and . Let us now assume that and up to and including some index , which implies that (9) holds for all . Applying the triangle inequality twice, we obtain,
Application of Lemma III.6 then yields , for . Substituting this in the preceding relation, we acquire,
The above result implies that . Given the fact that that and thus , we have and for all . Hence,
Thus, the sequence is finite and Cauchy (convergent), and , where is a critical point of (6) by Theorem III.5. ∎
Next, we combine our previous results to prove the global convergence of the iterates of NEAR-DGDt in Theorem III.8.
Theorem III.8.
(Global Convergence) Let be the sequence of NEAR-DGDt iterates produced by (5b) under steplength and let be a limit point of a convergent subsequence of as defined in Theorem III.5.
Then under Assumption III.1 the following statements hold: ) there exists an index such that the KL inequality with respect to holds for all , and ) the sequence converges to .
Proof.
We first observe that by Lemma III.2 the sequence is non-increasing, and therefore for all . If Assumption III.1 holds, then the objects and in Def. 1 exist and by the continuity of , it is possible to find an index that satisfies the following relations,
where .
Applying Lemma III.7 to the sequence with establishes the convergence of . Finally, since is the limit point of a subsequence of and is convergent, we conclude that . ∎
Since is a non-singular matrix, Theorem III.8 implies that the sequence also converges. Moreover, using arguments similar to [27], we can prove the following result on the convergence rate of .
Lemma III.9.
(Rates) Let be the sequence of iterates produced by (5a), where is the limit point of the sequence and suppose in Assumption III.1 for some constant and (for a discussion on , we direct readers to [26]). Then the following hold:
-
1.
If , converges in a finite number of iterations.
-
2.
If , then constants and exist such that .
-
3.
If , then there exists a constant such that .
Proof.
) : From the definition of and we have . Let (by the non-singularity of , it also follows that for ). Then for large the KL inequality holds at and we obtain , or equivalently by (7), . Application of Lemma III.2 combined with the fact that yields . Given the convergence of the sequence , we conclude that the set is finite and the method converges in a finite number of steps.
) : Let where . Since , it suffices to bound . Using Lemma III.6 with and for , where is defined in Theorem III.8, we obtain,
(12) |
where .
Moreover, Eq. 7 yields . Using this relation and the definition of , we can express the KL inequality as,
(13) |
where .
If , raising both sides of the preceding inequality to the power of and re-arranging the terms yields . Due to the fact that , there exists some index such that and . Combining this relation with (12), we obtain
We conclude this subsection with one more result on the distance to optimality of the local iterates of NEAR-DGDt and their average as .
Corollary III.10.
(Distance to optimality) Suppose that and let . Moreover, let . Then is an approximate critical point of ,
where is a positive constant defined in Lemma III.3.
Proof.
We observe that and hence where we obtain the last equality due to the fact that for any vector .
Moreover, is a critical point of (6) and therefore satisfies . From the double stochasticity of , multiplying the above relation with yields . After combining all the preceding results, we obtain,
where used the spectral properties of and Assumption II.2 to get the first inequality and the spectral properties of to get the second inequality. Applying Lemma III.3 yields the result of this Corollary.
∎
III-B Second order guarantees
In this subsection, we provide second order guarantees for the NEAR-DGDt method. Specifically, using recent results stemming from dynamical systems theory, we will prove that NEAR-DGDt almost surely avoids the strict saddles of the Lyapunov function when initialized randomly. Hence, if satisfies the strict saddle property, NEAR-DGDt converges to minima of with probability . We begin by listing a number of additional assumptions and definitions.
Assumption III.11.
(Differentiability) The functions is .
Assumption III.11 implies that the function is also .
Definition 2.
(Differential of a mapping) [Ch. 3, [28]] The differential of a mapping , denoted as , is a linear operator from , where is the tangent space of at point . Given a curve in with and , the linear operator is defined as . The determinant of the linear operator is the determinant of the matrix representing with respect to an arbitrary basis.
Definition 3.
(Unstable fixed points) The set of unstable fixed points of a mapping is defined as
Definition 4.
(Strict saddles) The set of strict saddles of a function is defined as
We can express NEAR-DGDt as a mapping ,
with . Let us define the set of unstable fixed points of NEAR-DGDt and the set of strict saddles of the Lyapunov function (6) following Def. 3 and 4, respectively. Corollary of [20] implies that if for all and , then NEAR-DGDt almost surely avoids the strict saddles of (6). We will show that this is indeed the case in Theorem III.12.
Theorem III.12.
(Convergence to 2nd order stationary points) Let be the sequence of iterates generated by NEAR-DGDt under steplength . Then if the Lyapunov function satisfies the strict saddle property, converges almost surely to order stationary points of under random initialization.
Proof.
We begin this proof by showing that for every . Let be the eigenvalues of the Hessian . Assumption II.2 implies that for all . Using standard properties of the determinant, we obtain, . Thus, by the positive-definiteness of and .
We will now confirm that . Every critical point of (6) satisfies , namely . Since is positive-definite and by extension non-singular, we can multiply both sides of the equality above with and re-arrange the resulting terms to obtain . Finally, the Hessian of (6) at is given by,
(14) |
We define the matrix . Using the positive-definiteness of , we obtain from (14) which implies that and are similar matrices and have identical spectrums. Moreover, the matrix is symmetric by Assumption II.1. Hence, and are congruent and by Sylvester’s law of inertia [Theorem 4.5.8, [29]] they have the same number of negative eigenvalues. Given that has at least one negative eigenvalue by Def. 4, we conclude that so does and there exists index such that or . Applying [Corollary , [20]] establishes the desired result. ∎
Before we conclude this section, we make one final remark on the asymptotic behavior of NEAR-DGDt as the parameter becomes large.
Corollary III.13.
(Convergence to SOS) Let and be the sequences of NEAR-DGDt iterates produced by (5a) and (5b), respectively, from initial point with and steplength . Moreover, suppose that is the limit point of NEAR-DGDt and let . Then for all and approaches the order stationary solutions (SOS) of Problem 1 as for all .
Proof.
By Theorems III.8 and III.12, we have , where is a minimizer of . Since is non-singular, we also have . As , Lemmas III.4 and III.10 yield and , respectively, implying that and approach each other and the critical points of . Finally, by Theorem III.12, where . Multiplying this relation with the matrix on both sides, we obtain . As , Lemma III.4 yields . Therefore, by Sylvester’s law of inertia for congruent matrices [Theorem 4.5.8, [29]]. Based on the above, we conclude that NEAR-DGDt approaches the order stationary solutions of Problem 1 as . ∎
IV Numerical Results
We evaluate the empirical performance of NEAR-DGD on the following regularized quadratic problem,
where is some positive index and and are diagonal matrices constructed as follows: if and otherwise, and , where is a constant and is the indicator vector for the element. It is easy to check that has a unique saddle point at and two minima at . We can distribute this problem to nodes by setting . Moreover, each has Lipschitz gradients in any compact subset of .
We set for the purposes of our numerical experiment. The matrices were constructed randomly with for and otherwise, and the parameter of matrix was set to . We allocated each to a unique node in a network of size with ring graph topology. We tested methods in total, including DGD [1, 5], DOGT (with doubly stochastic consensus matrix) [8], and variants of the NEAR-DGD method: ) NEAR-DGD1 (one consensus round per gradient evaluation), ) NEAR-DGD5 ( consensus rounds per gradient evaluation), ) a variant of NEAR-DGD where the sequence of consensus rounds increases by at every iteration, and to which we will refer as NEAR-DGD+, and ) a practical variant of NEAR-DGD+, where starting from one consensus round/iteration, we double the number of consensus rounds every gradient evaluations. We will refer to this last modification as NEAR-DGD. All algorithms were initialized from the same randomly chosen point in the interval . The stepsize was manually tuned to for all methods.
In Fig. 1, we plot the objective function error where (Fig. 1a) and the distance of the average iterates to the saddle point (Fig. 1b) versus the number of iterations/gradient evaluations for all methods. In Fig. 1a, we observe that convergence accuracy increases with the value of the parameter of NEAR-DGDt, as predicted by our theoretical results. NEAR-DGD1 performs comparably to DGD, while the two variants of NEAR-DGD paired with increasing sequences of consensus rounds per iteration, i.e. NEAR-DGD+ and NEAR-DGD, achieve exact convergence to the optimal value with faster rates compared to NEXT. All methods successfully escape the saddle point of with approximately the same speed (Fig. 1b). We noticed that the trends in Fig. 1b were very sensitive to small changes in problem parameters and the selection of initial point.
In Fig. 2, we plot the objective function error versus the cumulative application cost (per node) for all methods, where we calculated the cost per iteration using the framework proposed in [23],
where and are constants representing the application-specific costs of one communication and one computation operation, respectively. In Fig. 2a, the costs of communication and computation are equal () and NEXT outperforms NEAR-DGD+ and NEAR-DGD since it requires only two communication rounds per gradient evaluation to achieve exact convergence. Conversely, in Fig. 2b, the cost of communication is relatively low compared to the cost of computation (). In this case, NEAR-DGD+ converges to the optimal value faster than the remaining methods in terms of total application cost.




V Conclusion
NEAR-DGD [23] is a distributed first order method that permits adjusting the amounts of computation and communication carried out at each iteration to balance convergence accuracy and total application cost. We have extended to the nonconvex setting the analysis of NEAR-DGDt, a variant of NEAR-DGD performing a fixed number of communication rounds at every iteration, which is controlled by the parameter . Given a connected, undirected network with general topology, we have shown that NEAR-DGDt converges to minimizers of a customly designed Lyapunov function and locally approaches the minimizers of the original objective function as becomes large. Our numerical results confirm our theoretical analysis and indicate that NEAR-DGD can achieve exact convergence to the order stationary points of Problem (1) when the number of consensus rounds increases over time.
References
- [1] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48 –61, Jan 2009.
- [2] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [3] A. Nedich, A. Olshevsky, and A. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 1 2017.
- [4] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. PP, pp. 1–1, 04 2017.
- [5] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
- [6] P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
- [7] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Math. Program., vol. 176, no. 1–2, p. 497–544, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-018-01357-w
- [8] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020. [Online]. Available: https://doi.org/10.1137/18M121784X
- [9] H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 38–42.
- [10] M. Hong, S. Zeng, J. Zhang, and H. Sun, “On the Divergence of Decentralized Non-Convex Optimization,” arXiv:2006.11662 [cs, math], Jun. 2020, arXiv: 2006.11662. [Online]. Available: http://arxiv.org/abs/2006.11662
- [11] M. Hong, “A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach,” IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, vol. 5, no. 3, p. 11, 2018.
- [12] M. Hong, M. Razaviyayn, and J. Lee, “Gradient primal-dual algorithm converges to second-order stationary solution for nonconvex distributed optimization over networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 2009–2018. [Online]. Available: http://proceedings.mlr.press/v80/hong18a.html
- [13] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017.
- [14] D. Hajinezhad, M. Hong, and A. Garcia, “ZONE: Zeroth-Order Nonconvex Multiagent Optimization Over Networks,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 3995–4010, Oct. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8629972/
- [15] Y. Tang and N. Li, “Distributed zero-order algorithms for nonconvex multi-agent optimization,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 781–786.
- [16] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 391–405, 2013.
- [17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/f75526659f31040afeb61cb7133e4e6d-Paper.pdf
- [18] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “: Decentralized training over decentralized data,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 4848–4856. [Online]. Available: http://proceedings.mlr.press/v80/tang18a.html
- [19] H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 9217–9228. [Online]. Available: http://proceedings.mlr.press/v119/sun20a.html
- [20] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht, “First-order methods almost always avoid strict saddle points,” Mathematical Programming, vol. 176, no. 1, pp. 311–337, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-019-01374-3
- [21] B. Swenson, R. Murray, S. Kar, and H. V. Poor, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” arXiv:2003.02818 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2003.02818
- [22] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed gradient flow: Nonsmoothness, nonconvexity, and saddle point evasion,” arXiv:2008.05387 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2008.05387
- [23] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing Communication and Computation in Distributed Optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, Aug. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8528465/
- [24] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation,” Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp. 33–46, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731506001808
- [25] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, Feb. 2013. [Online]. Available: http://link.springer.com/10.1007/s10107-011-0484-9
- [26] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality,” Math. Oper. Res., vol. 35, no. 2, p. 438–457, May 2010. [Online]. Available: https://doi.org/10.1287/moor.1100.0449
- [27] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Math. Program., vol. 116, no. 1–2, p. 5–16, Jan. 2009. [Online]. Available: https://doi.org/10.1007/s10107-007-0133-5
- [28] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press, 2008.
- [29] R. A. Horn and C. R. Johnson, Matrix analysis, 2nd ed. Cambridge ; New York: Cambridge University Press, 2012.