Second-order Properties of Noisy Distributed Gradient Descent
Abstract
We study a fixed step-size noisy distributed gradient descent algorithm for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are introduced to the gradient descent directions at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, it is established that each agent converges to a neighborhood of a local minimizer and the size of the neighborhood depends on the step-size and the confidence parameter. A numerical example is presented to illustrate the effectiveness of the random perturbations in terms of escaping saddle points in fewer iterations than without the perturbations.
Index Terms:
Non-convex optimization; first-order methods; random perturbations; evading saddle pointsI Introduction
We consider the optimization problem
(1) |
where each is smooth but possibly non-convex, and is the decision vector. The aim is to employ agents to iteratively solve the optimization problem in (1), over an undirected and connected network graph . Each agent only knows the corresponding function and its gradient. The pair of agents is able to directly exchange information if and only if . Collaborative distributed optimization over a network is of significant interest in the contexts of control, learning and estimation, particularly in large-scale system scenarios, such as unmanned vehicle systems [1], electric power systems [2], transit systems [3], and wireless sensor networks [4].
In the field of optimization, two primary classes of distributed methods can be identified: dual decomposition methods and consensus-based methods. Dual decomposition methods involve minimizing an augmented Lagrangian formulated on the basis of constraints that enforce agreement between agents, via iterative updates of the corresponding primal and dual variables [5]. The distributed dual decomposition algorithm in [6] involves agents alternating between updating their primal and dual variables and communicating with their neighbors. In [7], it is established that distributed alternating direction method of multipliers (ADMM) exhibits linear convergence rates in strongly convex settings. Consensus-based methods can be traced back to the distributed computation models proposed in [8], which seek to eliminate agent disagreements through local iterate exchange and weighted averaging to achieve consensus. This idea underlies the distributed (sub)gradient methods proposed in [9] and [10] to solve problem (1) with all convex. In the case of diminishing step-size, each agent converges to an optimizer [10]; with constant step-size, convergence is typically faster, but only to the vicinity of an optimizer [9].
The focus in this paper is on consensus-based distributed (sub)gradient methods, for the simplicity as first-order methods, enabling easy adaptation to diverse situations. In particular, a fixed step-size Distributed Gradient Descent (DGD) algorithm is considered, as an instance of the distributed (sub)gradient methods. In unperturbed form, the update for each agent at iteration is given by
(2) |
where is the constant step-size, is the gradient of , is the local copy of the decision vector at agent , and is the scalar entry in the -th row and -th column of a given mixing matrix . The mixing matrix is consistent with the graph , in the sense that for all , if , and otherwise. The convergence rates of fixed step-size and diminishing step-size DGD algorithms in strongly convex settings are examined in [11] and [12], respectively. Several variants of DGD have been proposed in (strongly) convex settings. Nesterov momentum is used in the distributed gradient descent update with diminishing step-sizes in [13] to improve convergence rates. Inexact proximal-gradient method is considered in [14] for problems involving non-smooth functions. To reach exact consensus with a constant step-size, the gradient tracking method [15, 16] and [17] is used to neutralize in (2), as the gradient descent part cannot vanish itself.
In non-convex settings, gradient descent methods may face issues due to saddle points, as converging to a first-order stationary point does not guarantee the local minimality. While Hessian-based methods can avoid saddle points, their computational cost can be prohibitive for large-scale problems. In [18], the fixed step-size DGD algorithm is shown to retain the property of convergence to a neighborhood of a consensus stationary solution under some regularity assumptions in non-convex settings. It is also shown in [19] that DGD with a constant step-size converges almost surely to a neighborhood of second-order stationary solutions. However, this requires random initialization to avoid the zero measure manifold of saddle point attraction, and moreover, the underlying analysis does not support techniques for actively escaping saddle points.
Recently, it has been shown that standard (centralized) gradient descent methods can take exponential time to escape saddle points [20], while noise (random perturbations) have been proven effective for escaping saddle points in non-convex optimization. The Noisy Gradient Descent algorithm in [21] and [22] is proven to be able to escape saddle points efficiently while converging to a neighborhood of a local minimizer with high probability. However, all of these works are limited to centralized methods. In [23] it is shown that distributed stochastic gradient descent converges to local minima almost surely when diminishing step sizes are used. Consider constant step-size, the diffusion strategy with stochastic gradient in [24] and [25] only returns approximately second-order stationary points rather than an outcome that lies in a neighborhood of a local minimizer with controllable size.
In this paper, the main contribution is that we analyze a fixed step-size noisy distributed gradient descent (NDGD) algorithm for solving the optimization problem in (1) by expanding upon and combining ideas from [21] and [22] on centralized stochastic gradient descent, and from [11, 18, 19] on distributed unperturbed gradient descent. In this combination, random perturbations are added to the gradient descent directions at each step to actively evade saddle points. It is established that under certain regularity conditions, and with a suitable step-size, each agent converges to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations. A numerical example is presented to illustrate the effectiveness of the algorithm in terms of escaping from the vicinity of a saddle point in fewer iterations than the standard (i.e., unperturbed) fixed step-size DGD method.
I-A Notation
Let denote the identity matrix, denote the -vector with all entries equal to , and denote the entry in the -th row and -th column of the matrix . For a square symmetric matrix , we use , and to denote its minimum eigenvalue, maximum eigenvalue and spectral norm, respectively. The Kronecker product is denoted by . The distance from the point to a given set is denoted by . We say that a point is -close to a point (resp., a set ) if (resp., ). We use the Bachmann–Landau (asymptotic) notations including , and to hide dependence on variables other than and .
II Problem Setup and Supporting Results
In this section, we present a reformulation of the optimization problem defined in (1) and provide a list of assumptions used in subsequent analysis. Then, we briefly recall aspects of the fixed step-size DGD algorithm (2) and present some existing results for non-convex optimization problems. Next, some intermediate results are derived to establish certain properties of the local minimizers of defined in (1) (see Theorem 2.1), on the basis of a collection of supporting lemmas (see Lemma 2.3.1-2.3.4).
II-A Problem Setup
Definition 2.1.
For differentiable function , a point is said to be first-order stationary if .
Definition 2.2.
For twice differentiable function , a first-order stationary point is: (i) a local minimizer, if ; (ii) a local maximizer, if ; and (iii) a saddle point if and .
By introducing additional local variables, the optimization problem in (1) can be reformulated as
(5) |
where is the local copy of the decision vector at agent , and .
Assumption 2.1 (Local regularity).
The function in (1) is such that for all first-order stationary points , either (i.e., is a local minimizer), or (i.e., is a saddle point or a maximizer).
Assumption 2.2 (Lipschitz gradient).
Each objective has -Lipschitz continuous gradient, i.e.,
for all and each .
Assumption 2.3 (Lipschitz Hessian).
Each objective has -Lipschitz continuous Hessian, i.e.,
for all and each .
If Assumption 2.2 holds, then defined in (5) has -Lipschitz continuous gradient with . Further, if Assumption 2.3 holds, then has -Lipschitz continuous Hessian with .
Assumption 2.4 (Coercivity and properness).
Each local objective is coercive (i.e., its sublevel set is compact) and proper (i.e., not everywhere infinite).
II-B Distributed Gradient Descent
Assumption 2.5 (Network).
The undirected graph is connected.
The DGD algorithm in (2), with constant step-size , can be written in a matrix/vector form as
(6) |
where . Note that from this point on, the mixing matrix is taken to be symmetric, doubly stochastic and strictly diagonally dominant, i.e., for all . Thus, is positive definite by the Gershgorin circle theorem. As proposed in some early works, including [11, 18, 19], we can analyze the convergence properties using an auxiliary function. Let the denote the auxiliary function,
(7) |
consisting of the objective function in (5) and a quadratic penalty, which depends on the step-size and the mixing matrix. We use to denote a local minimizer of . Note that the DGD update (6) applied to (5) can be interpreted as an instance of the standard gradient descent algorithm applied to (II-B), i.e.,
(8) |
Thus, iteratively running (6) and (8) from the same initialization yields the same sequence of iterates.
If Assumption 2.2 holds, then defined in (II-B) has -Lipschitz continuous gradient with . We have that because the spectrum of a symmetric, positive definite and doubly stochastic matrix is contained in the interval by the Perron–Frobenius theorem, with being the only largest eigenvalue (the Perron root). Further, if Assumption 2.3 holds, then has -Lipschitz continuous Hessian with .
II-C Relationships between Local minimizers of and
In this section, we prove that , the component of a local minimizer of associated with agent , can be made arbitrarily close to the set of local minimizers of by choosing sufficiently small . The proof is based on a collection of intermediate results. First, Lemma 2.3.1 shows that by choosing sufficiently small step-size , at a local minimizer of , the component corresponding to agent can be arbitrarily close to , where denotes the average of across all agents. Lemma 2.3.2 and Lemma 2.3.3 show that given and , one can always find constant step-size such that and . Finally, Lemma 2.3.4 shows that for each agent, can be made arbitrarily close to a local minimizer of , by choosing a sufficiently small step-size . Let and denote the set of local minimizers of and , respectively:
(9) |
Lemma 2.3.1.
Let Assumption 2.5 hold. Given , let be a local minimizer of . Then, for each ,
where , and is the second-largest eigenvalue value of .
Proof.
Since is a local minimizer of , we have , and thus,
for all . Therefore, we have that . Now, since , we have . Further, by connectivity of the underlying communication graph (see Assumption 2.5) and the Perron–Frobenius theorem, , and as such,
where the last inequality holds because only the largest eigenvalue of is . In the limit , it follows that
as claimed.
Lemma 2.3.2.
Lemma 2.3.3.
Proof.
Proof.
Following the approach used in Lemma 3.8 [19] to establish a similar result in the absence of the second-order requirement in the definition of , we prove the lemma by contradiction. Suppose
(10) |
Then, there exists a sequence with and for all . Since and are continuous functions (see Assumptions 2.2, 2.3), is closed. Thus, is compact. Since , we can find a convergent sub-sequence with limit point satisfying . Since , it follows that, for all . This means , for all , implying , . By Assumption 2.1, we have . Hence , which contradicts the initial hypothesis (10).
By combining Lemmas 2.3.1 through 2.3.4, the following intermediate theorem can be established. It is used to prove Theorem 3.1 in the next section.
Theorem 2.1.
Proof.
Given , by the triangle inequality, , where . By coercivity and properness of each (see Assumption 2.4), is coercive and proper. Therefore, is bounded, and there exists an upper bound such that for all , . By Lemma 2.3.1, if
and defined in (9), then holds for each . Now, note that in view of Lemmas 2.3.2 and 2.3.3, with as defined in Lemma 2.3.4. As such, by application of Lemma 2.3.4 with , there exists such that if , then holds. Therefore, if
then as claimed.
III Method and Main Results
In this section, a Noisy Distributed Gradient Descent algorithm (see Algorithm 1), a variant of the fixed step-size DGD algorithm, is formulated. The main analysis result, also formulated in this section, establishes the second-order properties of the NDGD algorithm. The key idea is to add random noise to the distributed gradient descent directions at each iteration. The required properties of the noise in Algorithm 1 is presented in Theorem 3.1.
Recall that and denote the set of local minimizers of and respectively, as per (9). Given , , , , and , define
(11) |
where . The next theorem, which establishes second-order properties of the NDGD algorithm, is the main result of the paper; a proof is given in Section IV. Note that this result focuses on the dependency on the given step-size and confidence parameter , hiding the factors that have polynomial dependence on all other parameters (including , , , , and ).
Theorem 3.1.
Suppose Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Given and , also suppose the following:
- 1.
-
2.
the random perturbation at step is i.i.d. and zero mean with variance
-
3.
the generated sequence is bounded.
Then, with probability at least , after iterations, Algorithm 1 reaches a point that is -close to , where . Moreover, is such that is -close to , whereby for all ,
Remark 1.
Remark 2.
Second-order guarantees of DGD have been studied in [19] and [23] based on the almost sure non-convergence to saddle points with random initialization. In this paper, we propose to use random perturbations to actively evade saddle points. The second-order guarantees of NDGD stated in Theorem 3.1 do not require any additional initialization conditions. Second-order guarantees of the stochastic variant of DGD have been studied in [24] and [25], although they only show the convergence to an approximate second-order stationary point. Here, an upper bound is given for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations.
IV Proof of Theorem 3.1
A proof of Theorem 3.1 is provided in Section IV. First, we consider three different cases to study the behavior of NDGD for three different cases, in line with the development of the related result in [21] for centralized gradient descent. i) large in norm (see Lemma 4.1.1); ii) sufficiently negative (see Lemma 4.1.2); and iii) in a neighborhood of the local minimizers of with local strong convexity (see Lemma 4.1.3). Combining the outcome of this with Theorem 2.1, we then prove that with probability at least , after iterations, NDGD yields a point of which the component at each agent is -close to some local minimizer of .
IV-A Behavior of NDGD for three different cases
The following lemmas rely on the proofs of Lemma 16 and Lemma 17 in [21]. Given , , , , and , define
(12) |
where .
We first analyze the behavior of the NDGD algorithm in the case that . Intuitively, when the norm of is large enough, the expectation of the function value decreases by a certain amount after one iteration.
Lemma 4.1.1.
Proof.
Since is symmetric, doubly stochastic and strictly diagonally dominant, it is positive definite by the Gershgorin disk theorem. Given , note that . Since has -Lipschitz continuous gradient, using Taylor’s theorem gives
As such, choosing gives
as claimed.
Next, we analyze the behavior of the NDGD algorithm in the case that . Intuitively, for with small gradient and sufficiently negative , there exists upper bound such that the expectation of the function value decreases by a certain amount after iterations.
Lemma 4.1.2.
Let Assumptions 2.2, 2.3 hold. Let . Given , suppose the random perturbation in Algorithm 1 is i.i.d. and zero mean with variance . Further, given , suppose that the generated sequence is bounded. Then, for any with and , there exists a number of steps such that
where . The number of steps has a fixed upper bound that is independent of , i.e., for all .
Proof.
Given , note that . Thus, choosing , the result holds as shown in the proof of Lemma 17 in [21].
Finally, we analyze the behavior of the algorithm in the case that . Intuitively, when the iterate is close enough to a local minimizer, with high probability subsequent iterates do not leave the neighborhood.
Lemma 4.1.3.
Proof.
Given , note that because the eigenvalues of the symmetric, doubly stochastic, diagonally dominant, and thus positive definite matrix , is contained in the interval . By , we have the local strong convexity with modulus . Since also has Lipschitz gradient, by Theorem 2.1.12 [26],
Therefore,
Since , note that with . Then,
Thus, choosing , the result holds as shown in the proof of Lemma 16 in [21].
IV-B Main proof
Proof.
The main proof includes two steps: i) it is shown that three sets defined in (12) cover all possible points with respect to ; ii) it is shown that the upper bound of the decrease in can be used to derive a lower bound for the probability that the -th update at each agent is close to a local minimizer of .
Step 1. By the supposition in Theorem 3.1, given and , there exist , , , , and
such that , with respect to (11), and thus , where the superscript denote set complement. If ,
if , then by Weyl’s inequality,
if , then again by Weyl’s inequality,
and . Therefore, , , , whereby
Step 2. Define stochastic process as
(13) |
where for all as per Lemma 4.1.2. By Lemma 4.1.1 and Lemma 4.1.2, decreases by a certain amount after a certain number of iterations for , and , respectively, as follows
(14) |
Defining event , by law of total expectation,
where and . Since , we obtain
Since the generated sequence is assumed bounded, there exists such that for all . As such,
Summing both sides of the inequality over gives
Since for all , it follows that , which leads to the following upper bound for the probability of event :
Therefore, if grows larger than , then . Since , after steps, must enter at least once with probability at least . Therefore, by repeating this step times, the probability of entering at least once is lower bounded:
where . Combining this with Lemma 4.1.3, we have that, after iterations, Algorithm 1 produces a point that is -close to with probability at least , where . For given , since satisfies requirements of Theorem 2.1, is such that is -close to . To summarize, we have for ,
as claimed.
V Numerical Example
Consider the following non-convex optimization problem over :
where , , , , and . The mixing matrix is taken to be
It can be verified that is a saddle point of , and that and are two local minimizers. We compare the performance of DGD and NDGD with constant step-size , both initialized from (i.e., close to a saddle point).


From Figure 1(1(a)), although not trapped forever, it does take DGD about 6000 iterations to escape the vicinity of the saddle point and converge to the neighborhood of a local minimizer. From Figure 1(1(b)), we can see that DGD escapes the vicinity of the saddle point with about 2000 iterations and converges to the neighborhood of a local minimizer. The effectiveness of NDGD over DGD is evident through this example.
VI Conclusion
A fixed step-size noisy distributed gradient descent (NDGD) algorithm is formulated for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are added to the gradient descent at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, each agent converges (in probability with specified confidence) to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound on the distance between the iterate at each agent, and the set of local minimizers, after a sufficient number of iterations.
The potential applications of the NDGD algorithm are vast and varied, including multi-agent systems control, federated learning and sensor networks location estimation, particularly in large-scale network scenarios. Further exploration of different approaches to introducing random perturbations, and analysis of convergence rate performance can be pursued in future work.
References
- [1] X. Dong, Y. Hua, Y. Zhou, Z. Ren, and Y. Zhong, “Theory and experiment on formation-containment control of multiple multirotor unmanned aerial vehicle systems,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 1, pp. 229–240, 2018.
- [2] Z. Qiu, G. Deconinck, and R. Belmans, “A literature survey of optimal power flow problems in the electricity market context,” in 2009 IEEE/PES Power Systems Conference and Exposition. IEEE, 2009, pp. 1–6.
- [3] W. Gao, J. Gao, K. Ozbay, and Z.-P. Jiang, “Reinforcement-learning-based cooperative adaptive cruise control of buses in the lincoln tunnel corridor with time-varying topology,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3796–3805, 2019.
- [4] A. Nedić and J. Liu, “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 77–103, 2018.
- [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
- [6] H. Terelius, U. Topcu, and R. M. Murray, “Decentralized multi-agent optimization via dual decomposition,” IFAC proceedings volumes, vol. 44, no. 1, pp. 11 245–11 251, 2011.
- [7] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the admm in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
- [8] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986.
- [9] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [10] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
- [11] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
- [12] K. I. Tsianos and M. G. Rabbat, “Distributed strongly convex optimization,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 593–600.
- [13] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
- [14] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 601–608.
- [15] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [16] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
- [17] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically convergent distributed optimization with uncoordinated step-sizes,” in 2017 American Control Conference (ACC). IEEE, 2017, pp. 3950–3955.
- [18] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
- [19] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020.
- [20] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos, “Gradient descent can take exponential time to escape saddle points,” Advances in neural information processing systems, vol. 30, 2017.
- [21] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Conference on learning theory. PMLR, 2015, pp. 797–842.
- [22] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in International Conference on Machine Learning. PMLR, 2017, pp. 1724–1732.
- [23] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 14 751–14 812, 2022.
- [24] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments—part i: Agreement at a linear rate,” IEEE Transactions on Signal Processing, vol. 69, pp. 1242–1256, 2021.
- [25] ——, “Distributed learning in non-convex environments—part ii: Polynomial escape from saddle-points,” IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021.
- [26] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.