This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Second-order Properties of Noisy Distributed Gradient Descent

Lei Qin, Michael Cantoni, and Ye Pu This work was supported by a Melbourne Research Scholarship and the Australian Research Council (DE220101527 and DP210103272).L. Qin, M. Cantoni, and Y. Pu are with the Department of Electrical and Electronic Engineering, University of Melbourne, Parkville VIC 3010, Australia leqin@student.unimelb.edu.au, {cantoni, ye.pu}@unimelb.edu.au.
Abstract

We study a fixed step-size noisy distributed gradient descent algorithm for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are introduced to the gradient descent directions at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, it is established that each agent converges to a neighborhood of a local minimizer and the size of the neighborhood depends on the step-size and the confidence parameter. A numerical example is presented to illustrate the effectiveness of the random perturbations in terms of escaping saddle points in fewer iterations than without the perturbations.

Index Terms:
Non-convex optimization; first-order methods; random perturbations; evading saddle points

I Introduction

We consider the optimization problem

min𝐱nf(𝐱)min𝐱ni=1mfi(𝐱),\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{n}}f(\mathbf{x})\triangleq\min_{\mathbf{x}\in\mathbb{R}^{n}}\sum_{i=1}^{m}f_{i}(\mathbf{x}), (1)

where each fi:nf_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R} is smooth but possibly non-convex, and 𝐱n\mathbf{x}\in\mathbb{R}^{n} is the decision vector. The aim is to employ mm agents to iteratively solve the optimization problem in (1), over an undirected and connected network graph 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}). Each agent i𝒱:={1,,m}i\in\mathcal{V}:=\{1,\ldots,m\} only knows the corresponding function fif_{i} and its gradient. The pair of agents (i,j)𝒱×𝒱(i,j)\in\mathcal{V}\times\mathcal{V} is able to directly exchange information if and only if (i,j)(i,j)\in\mathcal{E}. Collaborative distributed optimization over a network is of significant interest in the contexts of control, learning and estimation, particularly in large-scale system scenarios, such as unmanned vehicle systems [1], electric power systems [2], transit systems [3], and wireless sensor networks [4].

In the field of optimization, two primary classes of distributed methods can be identified: dual decomposition methods and consensus-based methods. Dual decomposition methods involve minimizing an augmented Lagrangian formulated on the basis of constraints that enforce agreement between agents, via iterative updates of the corresponding primal and dual variables [5]. The distributed dual decomposition algorithm in [6] involves agents alternating between updating their primal and dual variables and communicating with their neighbors. In [7], it is established that distributed alternating direction method of multipliers (ADMM) exhibits linear convergence rates in strongly convex settings. Consensus-based methods can be traced back to the distributed computation models proposed in [8], which seek to eliminate agent disagreements through local iterate exchange and weighted averaging to achieve consensus. This idea underlies the distributed (sub)gradient methods proposed in [9] and [10] to solve problem (1) with all fif_{i} convex. In the case of diminishing step-size, each agent converges to an optimizer [10]; with constant step-size, convergence is typically faster, but only to the vicinity of an optimizer [9].

The focus in this paper is on consensus-based distributed (sub)gradient methods, for the simplicity as first-order methods, enabling easy adaptation to diverse situations. In particular, a fixed step-size Distributed Gradient Descent (DGD) algorithm is considered, as an instance of the distributed (sub)gradient methods. In unperturbed form, the update for each agent i𝒱i\in\mathcal{V} at iteration kk is given by

𝐱^ik+1=j=1m𝐖ij𝐱^jkαfi(𝐱^ik),\displaystyle\hat{\mathbf{x}}^{k+1}_{i}=\sum_{j=1}^{m}\mathbf{W}_{ij}\hat{\mathbf{x}}^{k}_{j}-\alpha\nabla f_{i}(\hat{\mathbf{x}}^{k}_{i}), (2)

where α>0\alpha>0 is the constant step-size, fi\nabla f_{i} is the gradient of fif_{i}, 𝐱^in\hat{\mathbf{x}}_{i}\in\mathbb{R}^{n} is the local copy of the decision vector 𝐱\mathbf{x} at agent i𝒱i\in\mathcal{V}, and 𝐖ij\mathbf{W}_{ij} is the scalar entry in the ii-th row and jj-th column of a given mixing matrix 𝐖m×m\mathbf{W}\in\mathbb{R}^{m\times m}. The mixing matrix is consistent with the graph 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}), in the sense that 𝐖ii>0\mathbf{W}_{ii}>0 for all i𝒱i\in\mathcal{V}, 𝐖ij>0\mathbf{W}_{ij}>0 if (i,j)(i,j)\in\mathcal{E}, and 𝐖ij=0\mathbf{W}_{ij}=0 otherwise. The convergence rates of fixed step-size and diminishing step-size DGD algorithms in strongly convex settings are examined in [11] and [12], respectively. Several variants of DGD have been proposed in (strongly) convex settings. Nesterov momentum is used in the distributed gradient descent update with diminishing step-sizes in [13] to improve convergence rates. Inexact proximal-gradient method is considered in [14] for problems involving non-smooth functions. To reach exact consensus with a constant step-size, the gradient tracking method [15, 16] and [17] is used to neutralize αfi-\alpha\nabla f_{i} in (2), as the gradient descent part cannot vanish itself.

In non-convex settings, gradient descent methods may face issues due to saddle points, as converging to a first-order stationary point does not guarantee the local minimality. While Hessian-based methods can avoid saddle points, their computational cost can be prohibitive for large-scale problems. In [18], the fixed step-size DGD algorithm is shown to retain the property of convergence to a neighborhood of a consensus stationary solution under some regularity assumptions in non-convex settings. It is also shown in [19] that DGD with a constant step-size converges almost surely to a neighborhood of second-order stationary solutions. However, this requires random initialization to avoid the zero measure manifold of saddle point attraction, and moreover, the underlying analysis does not support techniques for actively escaping saddle points.

Recently, it has been shown that standard (centralized) gradient descent methods can take exponential time to escape saddle points [20], while noise (random perturbations) have been proven effective for escaping saddle points in non-convex optimization. The Noisy Gradient Descent algorithm in [21] and [22] is proven to be able to escape saddle points efficiently while converging to a neighborhood of a local minimizer with high probability. However, all of these works are limited to centralized methods. In [23] it is shown that distributed stochastic gradient descent converges to local minima almost surely when diminishing step sizes are used. Consider constant step-size, the diffusion strategy with stochastic gradient in [24] and [25] only returns approximately second-order stationary points rather than an outcome that lies in a neighborhood of a local minimizer with controllable size.

In this paper, the main contribution is that we analyze a fixed step-size noisy distributed gradient descent (NDGD) algorithm for solving the optimization problem in (1) by expanding upon and combining ideas from [21] and [22] on centralized stochastic gradient descent, and from [11, 18, 19] on distributed unperturbed gradient descent. In this combination, random perturbations are added to the gradient descent directions at each step to actively evade saddle points. It is established that under certain regularity conditions, and with a suitable step-size, each agent converges to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations. A numerical example is presented to illustrate the effectiveness of the algorithm in terms of escaping from the vicinity of a saddle point in fewer iterations than the standard (i.e., unperturbed) fixed step-size DGD method.

I-A Notation

Let 𝐈n\mathbf{I}_{n} denote the n×nn\times n identity matrix, 𝟏n\bm{1}_{n} denote the nn-vector with all entries equal to 11, and 𝐀ij\mathbf{A}_{ij} denote the entry in the ii-th row and jj-th column of the matrix 𝐀\mathbf{A}. For a square symmetric matrix 𝐁\mathbf{B}, we use λmin(𝐁)\lambda_{\mathrm{min}}(\mathbf{B}), λmax(𝐁)\lambda_{\mathrm{max}}(\mathbf{B}) and 𝐁\|\mathbf{B}\| to denote its minimum eigenvalue, maximum eigenvalue and spectral norm, respectively. The Kronecker product is denoted by \otimes. The distance from the point 𝐱n\mathbf{x}\in\mathbb{R}^{n} to a given set 𝒴n\mathcal{Y}\subseteq\mathbb{R}^{n} is denoted by dist(𝐱,𝒴):=inf𝐲𝒴𝐱𝐲\textup{dist}(\mathbf{x},\mathcal{Y}):=\inf_{\mathbf{y}\in\mathcal{Y}}\|\mathbf{x}-\mathbf{y}\|. We say that a point 𝐱\mathbf{x} is δ\delta-close to a point 𝐲\mathbf{y} (resp., a set 𝒴\mathcal{Y}) if dist(𝐱,𝐲)δ\textup{dist}(\mathbf{x},\mathbf{y})\leq\delta (resp., dist(𝐱,𝒴)δ\textup{dist}(\mathbf{x},\mathcal{Y})\leq\delta). We use the Bachmann–Landau (asymptotic) notations including 𝒪(g(x,y))\mathcal{O}(g(x,y)), Ω(g(x,y))\Omega(g(x,y)) and Θ(g(x,y))\Theta(g(x,y)) to hide dependence on variables other than xx and yy.

II Problem Setup and Supporting Results

In this section, we present a reformulation of the optimization problem defined in (1) and provide a list of assumptions used in subsequent analysis. Then, we briefly recall aspects of the fixed step-size DGD algorithm (2) and present some existing results for non-convex optimization problems. Next, some intermediate results are derived to establish certain properties of the local minimizers of ff defined in (1) (see Theorem 2.1), on the basis of a collection of supporting lemmas (see Lemma 2.3.1-2.3.4).

II-A Problem Setup

Definition 2.1.

For differentiable function hh, a point 𝐱\mathbf{x} is said to be first-order stationary if h(𝐱)=0\|\nabla h(\mathbf{x})\|=0.

Definition 2.2.

For twice differentiable function hh, a first-order stationary point 𝐱\mathbf{x} is: (i) a local minimizer, if 2h(𝐱)0\nabla^{2}h(\mathbf{x})\succ 0; (ii) a local maximizer, if 2h(𝐱)0\nabla^{2}h(\mathbf{x})\prec 0; and (iii) a saddle point if λmin(2h(𝐱))<0\lambda_{\min}(\nabla^{2}h(\mathbf{x}))<0 and λmax(2h(𝐱))>0\lambda_{\max}(\nabla^{2}h(\mathbf{x}))>0.

By introducing additional local variables, the optimization problem in (1) can be reformulated as

min𝐱^(n)mF(𝐱^)min𝐱^(n)mi=1mfi(𝐱^i),s.t. 𝐱^i=𝐱^j for all (i,j),\displaystyle\begin{gathered}\min_{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}}F(\hat{\mathbf{x}})\triangleq\min_{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}}\sum_{i=1}^{m}f_{i}(\hat{\mathbf{x}}_{i}),\\ \textup{s.t. }\hat{\mathbf{x}}_{i}=\hat{\mathbf{x}}_{j}\text{ for all }(i,j)\in\mathcal{E},\end{gathered} (5)

where 𝐱^in\hat{\mathbf{x}}_{i}\in\mathbb{R}^{n} is the local copy of the decision vector 𝐱\mathbf{x} at agent i𝒱i\in\mathcal{V}, and 𝐱^=[𝐱^1T,,𝐱^mT]T(n)m\hat{\mathbf{x}}=[\hat{\mathbf{x}}_{1}^{T},\cdots,\hat{\mathbf{x}}_{m}^{T}]^{T}\in(\mathbb{R}^{n})^{m}.

Assumption 2.1 (Local regularity).

The function ff in (1) is such that for all first-order stationary points 𝐱\mathbf{x}, either λmin(2f(𝐱))>0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))>0 (i.e., 𝐱\mathbf{x} is a local minimizer), or λmin(2f(𝐱))<0\lambda_{\min}(\nabla^{2}f(\mathbf{x}))<0 (i.e., 𝐱\mathbf{x} is a saddle point or a maximizer).

Assumption 2.2 (Lipschitz gradient).

Each objective fif_{i} has LfigL_{f_{i}}^{g}-Lipschitz continuous gradient, i.e.,

fi(𝐱)fi(𝐲)Lfig𝐱𝐲\displaystyle\|\nabla f_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{y})\|\leq L_{f_{i}}^{g}\|\mathbf{x}-\mathbf{y}\|

for all 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n} and each i𝒱i\in\mathcal{V}.

Assumption 2.3 (Lipschitz Hessian).

Each objective fif_{i} has LfiHL_{f_{i}}^{H}-Lipschitz continuous Hessian, i.e.,

2fi(𝐱)2fi(𝐲)LfiH𝐱𝐲\displaystyle\|\nabla^{2}f_{i}(\mathbf{x})-\nabla^{2}f_{i}(\mathbf{y})\|\leq L_{f_{i}}^{H}\|\mathbf{x}-\mathbf{y}\|

for all 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n} and each i𝒱i\in\mathcal{V}.

If Assumption 2.2 holds, then FF defined in (5) has LFgL_{F}^{g}-Lipschitz continuous gradient with LFg=maxi{Lfig}L_{F}^{g}=\max_{i}\{L_{f_{i}}^{g}\}. Further, if Assumption 2.3 holds, then FF has LFHL_{F}^{H}-Lipschitz continuous Hessian with LFH=maxi{LfiH}L_{F}^{H}=\max_{i}\{L_{f_{i}}^{H}\}.

Assumption 2.4 (Coercivity and properness).

Each local objective fif_{i} is coercive (i.e., its sublevel set is compact) and proper (i.e., not everywhere infinite).

II-B Distributed Gradient Descent

Assumption 2.5 (Network).

The undirected graph 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}) is connected.

The DGD algorithm in (2), with constant step-size α>0\alpha>0, can be written in a matrix/vector form as

𝐱^k+1=𝐖^𝐱^kαF(𝐱^k),\displaystyle\hat{\mathbf{x}}^{k+1}=\hat{\mathbf{W}}\hat{\mathbf{x}}^{k}-\alpha\nabla F(\hat{\mathbf{x}}^{k}), (6)

where 𝐖^:=𝐖𝐈n\hat{\mathbf{W}}:=\mathbf{W}\otimes\mathbf{I}_{n}. Note that from this point on, the mixing matrix 𝐖\mathbf{W} is taken to be symmetric, doubly stochastic and strictly diagonally dominant, i.e., 𝐖ii>ji𝐖ij\mathbf{W}_{ii}>\sum_{j\neq i}\mathbf{W}_{ij} for all i𝒱i\in\mathcal{V}. Thus, 𝐖\mathbf{W} is positive definite by the Gershgorin circle theorem. As proposed in some early works, including [11, 18, 19], we can analyze the convergence properties using an auxiliary function. Let the QαQ_{\alpha} denote the auxiliary function,

Qα(𝐱^)\displaystyle Q_{\alpha}(\hat{\mathbf{x}}) =i=1mfi(𝐱^i)+12αi=1mj=1m(𝐈m𝐖)ij(𝐱^i)T(𝐱^j)\displaystyle=\sum_{i=1}^{m}f_{i}(\hat{\mathbf{x}}_{i})+\frac{1}{2\alpha}\sum_{i=1}^{m}\sum_{j=1}^{m}(\mathbf{I}_{m}-\mathbf{W})_{ij}(\hat{\mathbf{x}}_{i})^{T}(\hat{\mathbf{x}}_{j})
=F(𝐱^)+12α𝐱^𝐈mn𝐖^2,\displaystyle=F(\hat{\mathbf{x}})+\frac{1}{2\alpha}\|\hat{\mathbf{x}}\|^{2}_{\mathbf{I}_{mn}-\hat{\mathbf{W}}}, (7)

consisting of the objective function in (5) and a quadratic penalty, which depends on the step-size and the mixing matrix. We use 𝐱^\hat{\mathbf{x}}^{*} to denote a local minimizer of QαQ_{\alpha}. Note that the DGD update (6) applied to (5) can be interpreted as an instance of the standard gradient descent algorithm applied to (II-B), i.e.,

𝐱^k+1=𝐱^kαQα(𝐱^k).\displaystyle\hat{\mathbf{x}}^{k+1}=\hat{\mathbf{x}}^{k}-\alpha\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k}). (8)

Thus, iteratively running (6) and (8) from the same initialization yields the same sequence of iterates.

If Assumption 2.2 holds, then QαQ_{\alpha} defined in (II-B) has LQαgL_{Q_{\alpha}}^{g}-Lipschitz continuous gradient with LQαg=LFg+α1(1λmin(𝐖))=maxi{Lfig}+α1(1λmin(𝐖))L_{Q_{\alpha}}^{g}=L_{F}^{g}+\alpha^{-1}(1-\lambda_{\min}(\mathbf{W}))=\max_{i}\{L_{f_{i}}^{g}\}+\alpha^{-1}(1-\lambda_{\min}(\mathbf{W})). We have that 1λmin(𝐖)01-\lambda_{\min}(\mathbf{W})\geq 0 because the spectrum of a symmetric, positive definite and doubly stochastic matrix is contained in the interval (0,1](0,1] by the Perron–Frobenius theorem, with 11 being the only largest eigenvalue (the Perron root). Further, if Assumption 2.3 holds, then QαQ_{\alpha} has LQαHL_{Q_{\alpha}}^{H}-Lipschitz continuous Hessian with LQαH=maxi{LfiH}=LFHL_{Q_{\alpha}}^{H}=\max_{i}\{L_{f_{i}}^{H}\}=L_{F}^{H}.

II-C Relationships between Local minimizers of ff and QαQ_{\alpha}

In this section, we prove that 𝐱^i\hat{\mathbf{x}}_{i}^{*}, the component of a local minimizer 𝐱^\hat{\mathbf{x}}^{*} of QαQ_{\alpha} associated with agent ii, can be made arbitrarily close to the set of local minimizers 𝐱\mathbf{x}^{*} of ff by choosing sufficiently small α>0\alpha>0. The proof is based on a collection of intermediate results. First, Lemma 2.3.1 shows that by choosing sufficiently small step-size α>0\alpha>0, at a local minimizer of QαQ_{\alpha}, the component corresponding to agent ii can be arbitrarily close to 𝐱¯\bar{\mathbf{x}}^{*}, where 𝐱¯\bar{\mathbf{x}}^{*} denotes the average of 𝐱^i\hat{\mathbf{x}}_{i}^{*} across all agents. Lemma 2.3.2 and Lemma 2.3.3 show that given ϵg>0\epsilon_{g}>0 and ϵH>0\epsilon_{H}>0, one can always find constant step-size α>0\alpha>0 such that f(𝐱¯)ϵg\|\nabla f(\bar{\mathbf{x}}^{*})\|\leq\epsilon_{g} and λmin(2f(𝐱¯))ϵH\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\epsilon_{H}. Finally, Lemma 2.3.4 shows that for each agent, 𝐱^i\hat{\mathbf{x}}_{i}^{*} can be made arbitrarily close to a local minimizer of ff, by choosing a sufficiently small step-size α>0\alpha>0. Let 𝒳f\mathcal{X}_{f}^{*} and 𝒳^Qα\hat{\mathcal{X}}_{Q_{\alpha}}^{*} denote the set of local minimizers of ff and QαQ_{\alpha}, respectively:

𝒳f:={𝐱n:f(𝐱)=0,2f(𝐱)0},𝒳^Qα:={𝐱^(n)m:Qα(𝐱^)=0,2Qα(𝐱^)0}.\displaystyle\begin{aligned} \mathcal{X}_{f}^{*}:=\{\mathbf{x}\in\mathbb{R}^{n}:&~{}\nabla f(\mathbf{x})=0,~{}\nabla^{2}f(\mathbf{x})\succ 0\},\\ \hat{\mathcal{X}}_{Q_{\alpha}}^{*}:=\{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}:&~{}\nabla Q_{\alpha}(\hat{\mathbf{x}})=0,~{}\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}})\succ 0\}.\end{aligned} (9)
Lemma 2.3.1.

Let Assumption 2.5 hold. Given α>0\alpha>0, let 𝐱^\hat{\mathbf{x}}^{*} be a local minimizer of QαQ_{\alpha}. Then, for each i𝒱i\in\mathcal{V},

𝐱^i𝐱¯αF(𝐱^)1λ2,\displaystyle\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|\leq\alpha\cdot\frac{\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where 𝐱¯=1m(𝟏m𝐈n)T𝐱^\bar{\mathbf{x}}^{*}=\frac{1}{m}(\mathbf{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}, and 0<λ2<10<\lambda_{2}<1 is the second-largest eigenvalue value of 𝐖\mathbf{W}.

Proof.

Since 𝐱^\hat{\mathbf{x}}^{*} is a local minimizer of QαQ_{\alpha}, we have Qα(𝐱^)=F(𝐱^)+α1(𝐈mn𝐖^)𝐱^=0\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})=\nabla F(\hat{\mathbf{x}}^{*})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*}=0, and thus,

𝐱^=𝐖^𝐱^αF(𝐱^)=𝐖^s+1𝐱^αt=0s𝐖^tF(𝐱^)\displaystyle\hat{\mathbf{x}}^{*}=\hat{\mathbf{W}}\hat{\mathbf{x}}^{*}-\alpha\nabla F(\hat{\mathbf{x}}^{*})=\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}-\alpha\sum_{t=0}^{s}\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})

for all ss\in\mathbb{N}. Therefore, we have that 𝐱^𝐖^s+1𝐱^=αt=0s𝐖^tF(𝐱^)αt=0s𝐖^tF(𝐱^)\|\hat{\mathbf{x}}^{*}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}\|=\alpha\|\sum_{t=0}^{s}\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})\|\leq\alpha\sum_{t=0}^{s}\|\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})\|. Now, since (𝟏m𝐈n)T(𝐈mn𝐖^)𝐱^=0(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*}=0, we have (𝟏m𝐈n)TF(𝐱^)=(𝟏m𝐈n)T(Qα(𝐱^)α1(𝐈mn𝐖^)𝐱^)=0(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla F(\hat{\mathbf{x}}^{*})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})-\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*})=0. Further, by connectivity of the underlying communication graph (see Assumption 2.5) and the Perron–Frobenius theorem, 0<λ2<10<\lambda_{2}<1, and as such,

𝐱^i𝐖^s+1𝐱^\displaystyle\|\hat{\mathbf{x}}_{i}^{*}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}\| αt=0s(𝐖t𝟏m𝟏mTm)𝐈nF(𝐱^)\displaystyle\leq\alpha\sum_{t=0}^{s}\|(\mathbf{W}^{t}-\frac{\bm{1}_{m}\bm{1}_{m}^{T}}{m})\otimes\mathbf{I}_{n}\cdot\nabla F(\hat{\mathbf{x}}^{*})\|
αt=0sλ2tF(𝐱^),\displaystyle\leq\alpha\sum_{t=0}^{s}\lambda_{2}^{t}\|\nabla F(\hat{\mathbf{x}}^{*})\|,

where the last inequality holds because only the largest eigenvalue of 𝐖T\mathbf{W}^{T} is 11. In the limit ss\rightarrow\infty, it follows that

𝐱^i𝐱¯=lims𝐱^i𝐖^s+1𝐱^αF(𝐱^)1λ2\displaystyle\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|=\lim_{s\rightarrow\infty}\|\hat{\mathbf{x}}_{i}^{*}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}\|\leq\alpha\cdot\frac{\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}

as claimed.

Lemma 2.3.2.

Let Assumptions 2.2, 2.5 hold. Given α>0\alpha>0, let 𝐱^\hat{\mathbf{x}}^{*} be a local minimizer of QαQ_{\alpha}. Then

f(𝐱¯)αLFgmmF(𝐱^)1λ2,\displaystyle\|\nabla f(\bar{\mathbf{x}}^{*})\|\leq\alpha\cdot L_{F}^{g}\frac{m\sqrt{m}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where 𝐱¯=1m(𝟏m𝐈n)T𝐱^\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*} and 0<λ2<10<\lambda_{2}<1 is the second-largest eigenvalue value of 𝐖\mathbf{W}.

Proof.

Since (𝟏m𝐈n)TF(𝐱^)=(𝟏m𝐈n)T(Qα(𝐱^)α1(𝐈mn𝐖^)𝐱^)=0(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla F(\hat{\mathbf{x}}^{*})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})-\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*})=0, by Assumption 2.2 and Lemma 2.3.1,

f(𝐱¯)\displaystyle\|\nabla f(\bar{\mathbf{x}}^{*})\| =(𝟏m𝐈n)T(F(𝟏m𝐱¯)F(𝐱^))\displaystyle=\|(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{*})-\nabla F(\hat{\mathbf{x}}^{*}))\|
mLFg𝟏m𝐱¯𝐱^\displaystyle\leq\sqrt{m}L_{F}^{g}\|\bm{1}_{m}\otimes\bar{\mathbf{x}}^{*}-\hat{\mathbf{x}}^{*}\|
αLFgmmF(𝐱^)1λ2\displaystyle\leq\alpha\cdot L_{F}^{g}\frac{m\sqrt{m}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}

as claimed.

Lemma 2.3.3.

Let Assumptions 2.3, 2.5 hold. Given α>0\alpha>0, let 𝐱^\hat{\mathbf{x}}^{*} be a local minimizer of QαQ_{\alpha}. Then

λmin(2f(𝐱¯))αLFHm2F(𝐱^)1λ2,\displaystyle\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\alpha\cdot L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where 𝐱¯=1m(𝟏m𝐈n)T𝐱^\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*} and 0<λ2<10<\lambda_{2}<1 is the second-largest eigenvalue value of 𝐖\mathbf{W}.

Proof.

Since (𝟏m𝐈n)T2F(𝐱^)(𝟏m𝐈n)=(𝟏m𝐈n)T(2Qα(𝐱^)1α(𝐈mn𝐖^))(𝟏m𝐈n)=(𝟏m𝐈n)T(2Qα(𝐱^))(𝟏m𝐈n)0(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla^{2}F(\hat{\mathbf{x}}^{*})(\bm{1}_{m}\otimes\mathbf{I}_{n})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{*})-\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))(\bm{1}_{m}\otimes\mathbf{I}_{n})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{*}))(\bm{1}_{m}\otimes\mathbf{I}_{n})\succeq 0, by Assumption 2.3, we have

2\displaystyle\nabla^{2} f(𝐱¯)=(𝟏m𝐈n)T2F(𝟏m𝐱¯)(𝟏m𝐈n)\displaystyle f(\bar{\mathbf{x}}^{*})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{*})(\bm{1}_{m}\otimes\mathbf{I}_{n})
(𝟏m𝐈n)T(2F(𝟏m𝐱¯)2F(𝐱^))(𝟏m𝐈n)\displaystyle\succeq(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{*})-\nabla^{2}F(\hat{\mathbf{x}}^{*}))(\bm{1}_{m}\otimes\mathbf{I}_{n})
mLFH𝐱^𝟏m𝐱¯𝐈n,\displaystyle\succeq-mL_{F}^{H}\|\hat{\mathbf{x}}^{*}-\bm{1}_{m}\otimes\bar{\mathbf{x}}^{*}\|\mathbf{I}_{n},

where the last inequality holds because AA𝐈d\textbf{A}\succeq-\|\textbf{A}\|\mathbf{I}_{d} for any square matrix A, where 𝐀=max𝐱T𝐱=1𝐱T𝐀𝐱\|\mathbf{A}\|=\max_{\mathbf{x}^{T}\mathbf{x}=1}\mathbf{x}^{T}\mathbf{A}\mathbf{x}. So by Lemma 2.3.1,

λmin(2f(𝐱¯))αLFHm2F(𝐱^)1λ2\displaystyle\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\alpha\cdot L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}

as claimed.

Lemma 2.3.4.

Let Assumptions 2.1, 2.2, 2.3 hold. Then, for any given compact set 𝒳n\mathcal{X}\subset\mathbb{R}^{n},

limα0(sup{dist(𝐱,𝒳f):𝐱𝒳fα𝒳})=0,\displaystyle\lim_{\alpha\downarrow 0}(\sup\{\textup{dist}(\mathbf{x},\mathcal{X}_{f}^{*}):\mathbf{x}\in{\mathcal{X}^{\alpha}_{f}}\cap\mathcal{X}\})=0,

where 𝒳fα:={𝐱:f(𝐱)αc1,λmin(2f(𝐱))αc2}\mathcal{X}_{f}^{\alpha}:=\{\mathbf{x}:\|\nabla f(\mathbf{x})\|\leq\alpha\cdot c_{1},\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\alpha\cdot c_{2}\}, with

c1=LFgmmF(𝐱^)1λ2,c2=LFHm2F(𝐱^)1λ2.\displaystyle c_{1}=L_{F}^{g}\frac{m\sqrt{m}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},~{}c_{2}=L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}.

Proof.

Following the approach used in Lemma 3.8 [19] to establish a similar result in the absence of the second-order requirement in the definition of 𝒳fα\mathcal{X}_{f}^{\alpha}, we prove the lemma by contradiction. Suppose

infα>0{sup{dist(𝐱,𝒳f):𝐱𝒳fα𝒳}}=d>0.\displaystyle\inf_{\alpha>0}\{\sup\{\textup{dist}(\mathbf{x},\mathcal{X}_{f}^{*}):\mathbf{x}\in{\mathcal{X}^{\alpha}_{f}}\cap\mathcal{X}\}\}=d>0. (10)

Then, there exists a sequence {𝐱k}\{\mathbf{x}^{k}\} with 𝐱k𝒳f1/k𝒳\mathbf{x}^{k}\in\mathcal{X}_{f}^{1/k}\cap\mathcal{X} and dist(𝐱k,𝒳f)d\textup{dist}(\mathbf{x}^{k},\mathcal{X}_{f}^{*})\geq d for all k+k\in\mathbb{N}^{+}. Since f\|\nabla f\| and λmin(2f)\lambda_{\min}(\nabla^{2}f) are continuous functions (see Assumptions 2.2, 2.3), 𝒳f1/k\mathcal{X}_{f}^{1/k} is closed. Thus, 𝒳f1/k𝒳\mathcal{X}_{f}^{1/k}\cap\mathcal{X} is compact. Since {𝐱k}𝒳\{\mathbf{x}^{k}\}\subset\mathcal{X}, we can find a convergent sub-sequence {𝐱tk}\{\mathbf{x}^{t_{k}}\} with limit point 𝐱\mathbf{x}^{\infty} satisfying dist(𝐱,𝒳f)d\textup{dist}(\mathbf{x}^{\infty},\mathcal{X}_{f}^{*})\geq d. Since (𝒳f1/k𝒳)𝒳f1/(k+1)(\mathcal{X}_{f}^{1/k}\cap\mathcal{X})\supset\mathcal{X}_{f}^{1/(k+1)}, it follows that, 𝐱𝒳f1/k𝒳\mathbf{x}^{\infty}\in\mathcal{X}_{f}^{1/k}\cap\mathcal{X} for all k+k\in\mathbb{N}^{+}. This means f(𝐱)c1/k\|\nabla f(\mathbf{x}^{\infty})\|\leq c_{1}/k, λmin(2f(𝐱))c2/k\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))\geq-c_{2}/k for all k+k\in\mathbb{N}^{+}, implying f(𝐱)=0\|\nabla f(\mathbf{x}^{\infty})\|=0, λmin(2f(𝐱))0\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))\geq 0. By Assumption 2.1, we have λmin(2f(𝐱))>0\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))>0. Hence dist(𝐱,𝒳f)=0\textup{dist}(\mathbf{x}^{\infty},\mathcal{X}_{f}^{*})=0, which contradicts the initial hypothesis (10).

By combining Lemmas 2.3.1 through 2.3.4, the following intermediate theorem can be established. It is used to prove Theorem 3.1 in the next section.

Theorem 2.1.

Let Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Given Δ1>0\Delta_{1}>0, there exists threshold α¯(Δ1)>0\bar{\alpha}(\Delta_{1})>0 such that, if 0<αα¯(Δ1)0<\alpha\leq\bar{\alpha}(\Delta_{1}), and 𝐱^\hat{\mathbf{x}}^{*} is a local minimizer of QαQ_{\alpha}, then

dist(𝐱^i,𝒳f)Δ1\displaystyle\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1}

for each i𝒱i\in\mathcal{V}.

Proof.

Given α>0\alpha>0, by the triangle inequality, dist(𝐱^i,𝒳f)𝐱^i𝐱¯+dist(𝐱¯,𝒳f)\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|+\textup{dist}(\bar{\mathbf{x}}^{*},\mathcal{X}_{f}^{*}), where 𝐱¯=1m(𝟏m𝐈n)T𝐱^\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}. By coercivity and properness of each fif_{i} (see Assumption 2.4), FF is coercive and proper. Therefore, 𝒳^Qα\hat{\mathcal{X}}^{*}_{Q_{\alpha}} is bounded, and there exists an upper bound G>0G>0 such that for all 𝐱^𝒳^Qα\hat{\mathbf{x}}^{*}\in\hat{\mathcal{X}}^{*}_{Q_{\alpha}}, F(𝐱^)G\|\nabla F(\hat{\mathbf{x}}^{*})\|\leq G. By Lemma 2.3.1, if

0<αα¯1(Δ1):=Δ1(1λ2)2G\displaystyle 0<\alpha\leq\bar{\alpha}_{1}(\Delta_{1}):=\frac{\Delta_{1}(1-\lambda_{2})}{2G}

and 𝐱^𝒳^Qα\hat{\mathbf{x}}^{*}\in\hat{\mathcal{X}}^{*}_{Q_{\alpha}} defined in (9), then 𝐱^i𝐱¯Δ1/2\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|\leq\Delta_{1}/2 holds for each i𝒱i\in\mathcal{V}. Now, note that 𝐱¯𝒳fα\bar{\mathbf{x}}^{*}\in\mathcal{X}_{f}^{\alpha} in view of Lemmas 2.3.2 and 2.3.3, with 𝒳fα\mathcal{X}_{f}^{\alpha} as defined in Lemma 2.3.4. As such, by application of Lemma 2.3.4 with 𝒳={𝐱¯}\mathcal{X}=\{\bar{\mathbf{x}}^{*}\}, there exists α¯2(Δ1)>0\bar{\alpha}_{2}(\Delta_{1})>0 such that if 0<αα¯2(Δ1)0<\alpha\leq\bar{\alpha}_{2}(\Delta_{1}), then dist(𝐱¯,𝒳f)Δ1/2\textup{dist}(\bar{\mathbf{x}}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1}/2 holds. Therefore, if

0<αα¯(Δ1):=min{α¯1(Δ1),α¯2(Δ1)},\displaystyle 0<\alpha\leq\bar{\alpha}(\Delta_{1}):=\min\{\bar{\alpha}_{1}(\Delta_{1}),\bar{\alpha}_{2}(\Delta_{1})\},

then dist(𝐱^i,𝒳f)Δ1\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1} as claimed.

III Method and Main Results

In this section, a Noisy Distributed Gradient Descent algorithm (see Algorithm 1), a variant of the fixed step-size DGD algorithm, is formulated. The main analysis result, also formulated in this section, establishes the second-order properties of the NDGD algorithm. The key idea is to add random noise to the distributed gradient descent directions at each iteration. The required properties of the noise ξik\xi_{i}^{k} in Algorithm 1 is presented in Theorem 3.1.

Initialization;
for k=0,1,k=0,1,\cdots do
     for i=0,1,,mi=0,1,\cdots,m do
         Sample i.i.d ξik\xi_{i}^{k};
         𝐱^ik+1=j=1m𝐖ij𝐱^jkα(fi(𝐱^ik)+ξik)\hat{\mathbf{x}}_{i}^{k+1}=\sum_{j=1}^{m}\mathbf{W}_{ij}\hat{\mathbf{x}}_{j}^{k}-\alpha(\nabla f_{i}(\hat{\mathbf{x}}_{i}^{k})+\xi_{i}^{k});
     end for
end for
Algorithm 1 Noisy Distributed Gradient Descent (NDGD)

Recall that 𝒳f\mathcal{X}^{*}_{f} and 𝒳^Qα\hat{\mathcal{X}}^{*}_{Q_{\alpha}} denote the set of local minimizers of ff and QαQ_{\alpha} respectively, as per (9). Given ϵ>0\epsilon>0, γ>0\gamma>0, μ>0\mu>0, δ>0\delta>0, and α>0\alpha>0, define

α,ϵ1:={𝐱^:F(𝐱^)+α1(𝐈mn𝐖^)𝐱^ϵ},α,γ2:={𝐱^:λmin(2F(𝐱^))γα1},α,μ,δ3:={𝐱^:λmin(2F(𝐱^))μ,dist(𝐱^,𝒳^)δ},\displaystyle\begin{aligned} \mathcal{L}_{\alpha,\epsilon}^{1}:=\{\hat{\mathbf{x}}:&~{}\|\nabla F(\hat{\mathbf{x}})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|\geq\epsilon\},\\ \mathcal{L}_{\alpha,\gamma}^{2}:=\{\hat{\mathbf{x}}:&~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))\leq-\gamma-\alpha^{-1}\},\\ \mathcal{L}_{\alpha,\mu,\delta}^{3}:=\{\hat{\mathbf{x}}:&~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))\geq\mu,~{}\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{\prime})\leq\delta\},\end{aligned} (11)

where 𝒳^:={𝐱^:F(𝐱^)+α1(𝐈mn𝐖^)𝐱^=0,λmin(2F(𝐱^))>0}\hat{\mathcal{X}}^{\prime}:=\{\hat{\mathbf{x}}:\|\nabla F(\hat{\mathbf{x}})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|=0,~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))>0\}. The next theorem, which establishes second-order properties of the NDGD algorithm, is the main result of the paper; a proof is given in Section IV. Note that this result focuses on the dependency on the given step-size α\alpha and confidence parameter ζ\zeta, hiding the factors that have polynomial dependence on all other parameters (including Δ1\Delta_{1}, ϵ\epsilon, γ\gamma, μ\mu, δ\delta and σ\sigma ).

Theorem 3.1.

Suppose Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Given Δ1>0\Delta_{1}>0 and 0<ζ<10<\zeta<1, also suppose the following:

  1. 1.

    There exist ϵ>0\epsilon>0, γ(0,LFg]\gamma\in(0,L_{F}^{g}], μ(0,LFg]\mu\in(0,L_{F}^{g}], δ>0\delta>0 and α(0,α^(Δ1,ζ)]\alpha\in(0,\hat{\alpha}(\Delta_{1},\zeta)] such that

    α,ϵ1α,γ2α,μ,δ3=(n)m,\displaystyle\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2}\cup\mathcal{L}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m},

    where α^(Δ1,ζ):=\hat{\alpha}(\Delta_{1},\zeta):=

    min{α¯(Δ1),21LFg,λmin(𝐖)LFgmax{1,log(ζ1)}}>0\displaystyle\min\{\bar{\alpha}(\Delta_{1}),\frac{\sqrt{2}-1}{L_{F}^{g}},\frac{\lambda_{\min}(\mathbf{W})}{L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\}}\}>0

    with α¯(Δ1)\bar{\alpha}(\Delta_{1}) as per Theorem 2.1;

  2. 2.

    the random perturbation ξik\xi_{i}^{k} at step k>0k>0 is i.i.d. and zero mean with variance

    σ2σmax2(ϵ):=λmin(𝐖)ϵ2mn;\displaystyle\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=\frac{\lambda_{\min}(\mathbf{W})\epsilon^{2}}{mn};
  3. 3.

    the generated sequence {Qα(𝐱^k)}\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\} is bounded.

Then, with probability at least 1ζ1-\zeta, after K(α,ζ)=𝒪(α2logζ1)K(\alpha,\zeta)=\mathcal{O}(\alpha^{-2}\log\zeta^{-1}) iterations, Algorithm 1 reaches a point 𝐱^K(α,ζ)(n)m\hat{\mathbf{x}}^{K(\alpha,\zeta)}\in(\mathbb{R}^{n})^{m} that is Δ2(α,ζ)\Delta_{2}(\alpha,\zeta)-close to 𝒳Qα\mathcal{X}_{Q_{\alpha}}^{*}, where Δ2(α,ζ)=𝒪(αlog(α1ζ1))\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})}). Moreover, 𝐱^=inf𝐱^𝒳Qα𝐱^iK(α,ζ)𝐱^\hat{\mathbf{x}}^{*}=\inf_{\hat{\mathbf{x}}\in\mathcal{X}_{Q_{\alpha}}^{*}}\|\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)}-\hat{\mathbf{x}}\| is such that 𝐱^i\hat{\mathbf{x}}_{i}^{*} is Δ1\Delta_{1}-close to 𝒳f\mathcal{X}_{f}^{*}, whereby for all i𝒱i\in\mathcal{V},

dist(𝐱^iK(α,ζ),𝒳f)Δ1+Δ2(α,ζ).\displaystyle\mathrm{dist}(\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)},\mathcal{X}_{f}^{*})\leq\Delta_{1}+\Delta_{2}(\alpha,\zeta).

Remark 1.

Intuitively, condition 1) in Theorem 3.1 requires that for QαQ_{\alpha}, all points with a small norm of gradient, they either possess a sufficient descent direction or reside within a neighborhood of a local minimizer of QαQ_{\alpha}, where local strong convexity is present.

Remark 2.

For condition 2) in Theorem 3.1, one way to generate such i.i.d noise is to sample ξik\xi_{i}^{k} uniformly from an nn-dimensional sphere with the radius rr. This ensures 𝔼(ξik)=𝟎\mathbb{E}(\xi_{i}^{k})=\bm{0}, 𝔼(ξik(ξik)T)=(r2/n)𝐈n\mathbb{E}(\xi_{i}^{k}(\xi_{i}^{k})^{T})=(r^{2}/n)\mathbf{I}_{n}, and ξikr\|\xi_{i}^{k}\|\leq r for all i𝒱i\in\mathcal{V} and kk\in\mathbb{N}. By choosing r2nσmax2(ϵ)r^{2}\leq n\sigma_{\max}^{2}(\epsilon), we have 𝔼(ξk(ξk)T)=(r2/n)𝐈mnσmax2(ϵ)𝐈mn\mathbb{E}(\xi^{k}(\xi^{k})^{T})=(r^{2}/n)\mathbf{I}_{mn}\preceq\sigma_{\max}^{2}(\epsilon)\mathbf{I}_{mn}.

Second-order guarantees of DGD have been studied in [19] and [23] based on the almost sure non-convergence to saddle points with random initialization. In this paper, we propose to use random perturbations to actively evade saddle points. The second-order guarantees of NDGD stated in Theorem 3.1 do not require any additional initialization conditions. Second-order guarantees of the stochastic variant of DGD have been studied in [24] and [25], although they only show the convergence to an approximate second-order stationary point. Here, an upper bound is given for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations.

IV Proof of Theorem 3.1

A proof of Theorem 3.1 is provided in Section IV. First, we consider three different cases to study the behavior of NDGD for three different cases, in line with the development of the related result in [21] for centralized gradient descent. i) large in norm Qα(𝐱^k)\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k}) (see Lemma 4.1.1); ii) sufficiently negative λmin(2Qα(𝐱^k))\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k})) (see Lemma 4.1.2); and iii) 𝐱^k\hat{\mathbf{x}}^{k} in a neighborhood of the local minimizers of QαQ_{\alpha} with local strong convexity (see Lemma 4.1.3). Combining the outcome of this with Theorem 2.1, we then prove that with probability at least 1ζ1-\zeta, after K(α,ζ)K(\alpha,\zeta) iterations, NDGD yields a point of which the component at each agent is Δ1+Δ2(α,ζ)\Delta_{1}+\Delta_{2}(\alpha,\zeta)-close to some local minimizer of ff.

IV-A Behavior of NDGD for three different cases

The following lemmas rely on the proofs of Lemma 16 and Lemma 17 in [21]. Given ϵ>0\epsilon>0, γ>0\gamma>0, μ>0\mu>0, δ>0\delta>0, and α>0\alpha>0, define

α,ϵ1:={𝐱^:Qα(𝐱^)ϵ},α,γ2:={𝐱^:Λα(𝐱^)γ},α,μ,δ3:={𝐱^:Λα(𝐱^)μ,dist(𝐱^,𝒳^Qα)δ},\displaystyle\begin{aligned} \mathcal{I}_{\alpha,\epsilon}^{1}:=\{\hat{\mathbf{x}}:&~{}\|\nabla Q_{\alpha}(\hat{\mathbf{x}})\|\geq\epsilon\},\\ \mathcal{I}_{\alpha,\gamma}^{2}:=\{\hat{\mathbf{x}}:&~{}\Lambda_{\alpha}(\hat{\mathbf{x}})\leq-\gamma\},\\ \mathcal{I}_{\alpha,\mu,\delta}^{3}:=\{\hat{\mathbf{x}}:&~{}\Lambda_{\alpha}(\hat{\mathbf{x}})\geq\mu,~{}\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{*}_{Q_{\alpha}})\leq\delta\},\end{aligned} (12)

where Λα(𝐱^)=λmin(2Qα(𝐱^))\Lambda_{\alpha}(\hat{\mathbf{x}})=\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}})).

We first analyze the behavior of the NDGD algorithm in the case that 𝐱^kα,ϵ1\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\epsilon}^{1}. Intuitively, when the norm of Qα(𝐱^)\nabla Q_{\alpha}(\hat{\mathbf{x}}) is large enough, the expectation of the function value decreases by a certain amount after one iteration.

Lemma 4.1.1.

Let Assumption 2.2 hold. Given ϵ>0\epsilon>0, suppose the random perturbation ξik\xi^{k}_{i} in Algorithm 1 is i.i.d. and zero mean with variance σ2σmax2(ϵ):=(λmin(𝐖)ϵ2)/(mn)\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn). Then given 0<α1/LFg0<\alpha\leq 1/L_{F}^{g}, for any 𝐱^k\hat{\mathbf{x}}^{k} such that Qα(𝐱^k)ϵ\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\geq\epsilon, after one iteration,

𝔼[Qα(𝐱^k+1)|𝐱^k]Qα(𝐱^k)l1(α),\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})\leq-l_{1}(\alpha),

where l1(α)=Ω(α)l_{1}(\alpha)=\Omega(\alpha).

Proof.

Since 𝐖\mathbf{W} is symmetric, doubly stochastic and strictly diagonally dominant, it is positive definite by the Gershgorin disk theorem. Given σ2(λmin(𝐖)ϵ2)/(mn)\sigma^{2}\leq(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn), note that mnσ2/λmin(𝐖)<ϵ\sqrt{mn\sigma^{2}/\lambda_{\min}(\mathbf{W})}<\epsilon. Since QαQ_{\alpha} has LQαgL_{Q_{\alpha}}^{g}-Lipschitz continuous gradient, using Taylor’s theorem gives

𝔼[Q\displaystyle\mathbb{E}[Q (𝐱^k+1)α|𝐱^k]Qα(𝐱^k){}_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})
=\displaystyle= 𝔼[Qα(𝐱^k)T(𝐱^k+1𝐱^k)+(𝐱^k+1𝐱^k)T(01(1t)\displaystyle~{}\mathbb{E}[\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})+(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})^{T}(\int^{1}_{0}(1-t)
2Qα(𝐱^k+t(𝐱^k+1𝐱^k))dt)(𝐱^k+1𝐱^k)|𝐱^k]\displaystyle~{}\quad\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k}+t(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}))\textup{d}t)(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})~{}|~{}\hat{\mathbf{x}}^{k}]
\displaystyle\leq Qα(𝐱^k)T𝔼[𝐱^k+1𝐱^k|𝐱^k]\displaystyle~{}\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}\mathbb{E}[\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}~{}|~{}\hat{\mathbf{x}}^{k}]
+LQαg2𝔼[𝐱^k+1𝐱^k2|𝐱^k]\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}}{2}~{}\mathbb{E}[\|\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}\|^{2}~{}|~{}\hat{\mathbf{x}}^{k}]
=\displaystyle= Qα(𝐱^k)T𝔼[α(Qα(𝐱^k)+ξk)|𝐱^k]\displaystyle~{}\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}\mathbb{E}[-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})~{}|~{}\hat{\mathbf{x}}^{k}]
+LQαg2𝔼[α(Qα(𝐱^k)+ξk)2|𝐱^k]\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}}{2}~{}\mathbb{E}[\|-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})\|^{2}~{}|~{}\hat{\mathbf{x}}^{k}]
=\displaystyle= (α+LQαg2α2)Qα(𝐱^k)2+LQαg2α2mnσ2\displaystyle~{}(-\alpha+\frac{L_{Q_{\alpha}}^{g}}{2}\alpha^{2})\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|^{2}+\frac{L_{Q_{\alpha}}^{g}}{2}\alpha^{2}mn\sigma^{2}
=\displaystyle= (1+λmin(𝐖)2α+LFg2α2)Qα(𝐱^k)2\displaystyle~{}(-\frac{1+\lambda_{\min}(\mathbf{W})}{2}\alpha+\frac{L_{F}^{g}}{2}\alpha^{2})\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|^{2}
+(1λmin(𝐖)2α+LFg2α2)mnσ2.\displaystyle~{}\quad+(\frac{1-\lambda_{\min}(\mathbf{W})}{2}\alpha+\frac{L_{F}^{g}}{2}\alpha^{2})mn\sigma^{2}.

As such, choosing 0<α1/LFg0<\alpha\leq 1/L_{F}^{g} gives

𝔼[Q\displaystyle\mathbb{E}[Q (𝐱^k+1)α|𝐱^k]Qα(𝐱^k){}_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})
\displaystyle\leq λmin(𝐖)2αQα(𝐱^k)2+2λmin(𝐖)2αmnσ2\displaystyle-\frac{\lambda_{\min}(\mathbf{W})}{2}\alpha\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|^{2}+\frac{2-\lambda_{\min}(\mathbf{W})}{2}\alpha mn\sigma^{2}
\displaystyle\leq λmin(𝐖)2αmnσ2\displaystyle-\frac{\lambda_{\min}(\mathbf{W})}{2}\alpha mn\sigma^{2}

as claimed.

Next, we analyze the behavior of the NDGD algorithm in the case that 𝐱^kα,γ2\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\gamma}^{2}. Intuitively, for 𝐱^k\hat{\mathbf{x}}^{k} with small gradient and sufficiently negative λmin(2Qα(𝐱^))\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}})), there exists upper bound Tmax(α)T_{\max}(\alpha) such that the expectation of the function value decreases by a certain amount after TTmax(α)T\leq T_{\max}(\alpha) iterations.

Lemma 4.1.2.

Let Assumptions 2.2, 2.3 hold. Let γ(0,LFg]\gamma\in(0,L_{F}^{g}]. Given ϵ>0\epsilon>0, suppose the random perturbation ξik\xi^{k}_{i} in Algorithm 1 is i.i.d. and zero mean with variance σ2σmax2(ϵ):=(λmin(𝐖)ϵ2)/(mn)\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn). Further, given 0<α(21)/LFg0<\alpha\leq(\sqrt{2}-1)/L_{F}^{g}, suppose that the generated sequence {Qα(𝐱^k)}\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\} is bounded. Then, for any 𝐱^k\hat{\mathbf{x}}^{k} with Qα(𝐱^k)<ϵ\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|<\epsilon and λmin(2Qα(𝐱^))γ\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))\leq-\gamma, there exists a number of steps T(𝐱^k)>0T(\hat{\mathbf{x}}^{k})>0 such that

𝔼[Qα(𝐱^k+T(𝐱^k))|𝐱^k]Qα(𝐱^k)l2(α),\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{k+T(\hat{\mathbf{x}}^{k})})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})\leq-l_{2}(\alpha),

where l2(α)=Ω(α)l_{2}(\alpha)=\Omega(\alpha). The number of steps T(𝐱^k)T(\hat{\mathbf{x}}^{k}) has a fixed upper bound Tmax(α)T_{\max}(\alpha) that is independent of 𝐱^k\hat{\mathbf{x}}^{k}, i.e., T(𝐱^k)Tmax(α)=𝒪(α1)T(\hat{\mathbf{x}}^{k})\leq T_{\max}(\alpha)=\mathcal{O}(\alpha^{-1}) for all 𝐱^k\hat{\mathbf{x}}^{k}.

Proof.

Given σ2(λmin(𝐖)ϵ2)/(mn)\sigma^{2}\leq(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn), note that mnσ2/λmin(𝐖)<ϵ\sqrt{mn\sigma^{2}/\lambda_{\min}(\mathbf{W})}<\epsilon. Thus, choosing 0<α(21)/LFg(2mn)/LFg(2mn)/γ0<\alpha\leq(\sqrt{2}-1)/L_{F}^{g}\leq(2mn)/L_{F}^{g}\leq(2mn)/\gamma, the result holds as shown in the proof of Lemma 17 in [21].

Finally, we analyze the behavior of the algorithm in the case that 𝐱^kα,μ,δ3\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\mu,\delta}^{3}. Intuitively, when the iterate 𝐱^k\hat{\mathbf{x}}^{k} is close enough to a local minimizer, with high probability subsequent iterates do not leave the neighborhood.

Lemma 4.1.3.

Let Assumptions 2.2 hold. Let μ(0,LFg]\mu\in(0,L_{F}^{g}]. Given ϵ>0\epsilon>0, suppose the random perturbation ξik\xi^{k}_{i} in Algorithm 1 is i.i.d. and zero mean with variance σ2σmax2(ϵ):=(λmin(𝐖)ϵ2)/(mn)\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn). Further, given δ>0\delta>0, 0<α(λmin(𝐖))/(LFgmax{1,log(ζ1)})0<\alpha\leq(\lambda_{\min}(\mathbf{W}))/(L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\}) and local minimizer 𝐱^𝒳Qα\hat{\mathbf{x}}^{*}\in\mathcal{X}_{Q_{\alpha}}^{*}, suppose λmin(2Qα(𝐱^))μ\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))\geq\mu for all 𝐱^\hat{\mathbf{x}} such that 𝐱^𝐱^<δ\|\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}\|<\delta. Then, there exists δ1(α)=𝒪(α)[0,δ)\delta_{1}(\alpha)=\mathcal{O}(\sqrt{\alpha})\in[0,\delta) such that, for any 𝐱^k\hat{\mathbf{x}}^{k} with 𝐱^k𝐱^<δ1(α)\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\|<\delta_{1}(\alpha), with probability at least 1ζ/21-\zeta/2,

𝐱^k+s𝐱^Δ2(α,ζ)\displaystyle\|\hat{\mathbf{x}}^{k+s}-\hat{\mathbf{x}}^{*}\|\leq\Delta_{2}(\alpha,\zeta)

for all sK(α,ζ)=𝒪(α2log(ζ1))s\leq K(\alpha,\zeta)=\mathcal{O}(\alpha^{-2}\log(\zeta^{-1})), where Δ2(α,ζ)=𝒪(αlog(α1ζ1))<δ\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})})<\delta.

Proof.

Given 0<αλmin(𝐖)/LFg0<\alpha\leq\lambda_{\min}(\mathbf{W})/L_{F}^{g}, note that α1/LQαg\alpha\leq 1/L_{Q_{\alpha}}^{g} because the eigenvalues of the symmetric, doubly stochastic, diagonally dominant, and thus positive definite matrix 𝐖\mathbf{W}, is contained in the interval (0,1](0,1]. By λmin(2Qα(𝐱^k))μ>0\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k}))\geq\mu>0, we have the local strong convexity with modulus μ\mu. Since QαQ_{\alpha} also has Lipschitz gradient, by Theorem 2.1.12 [26],

Qα(𝐱^)T(𝐱^𝐱^)LQαgμLQαg+μ𝐱^𝐱^2+1LQαg+μQα(𝐱^)Qα(𝐱^)2.\nabla Q_{\alpha}(\hat{\mathbf{x}})^{T}(\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*})\geq\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}\|\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}\|^{2}\\ +\frac{1}{L_{Q_{\alpha}}^{g}+\mu}\|\nabla Q_{\alpha}(\hat{\mathbf{x}})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}.

Therefore,

𝔼[\displaystyle\mathbb{E}[\| 𝐱^k+1𝐱^2|𝐱^k]\displaystyle\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{*}\|^{2}~{}|~{}\hat{\mathbf{x}}^{k}]
=\displaystyle= 𝔼[𝐱^kα(Qα(𝐱^k)+ξk)𝐱^2|𝐱^k]\displaystyle~{}\mathbb{E}[\|\hat{\mathbf{x}}^{k}-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})-\hat{\mathbf{x}}^{*}\|^{2}~{}|~{}\hat{\mathbf{x}}^{k}]
\displaystyle\leq (𝐱^k𝐱^)2+α2Qα(𝐱^k)Qα(𝐱^)2\displaystyle~{}\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*})\|^{2}+\alpha^{2}\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}
2αQα(𝐱^k)T(𝐱^k𝐱^)+α2mnσ2\displaystyle~{}\quad-2\alpha\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*})+\alpha^{2}mn\sigma^{2}
\displaystyle\leq (𝐱^k𝐱^)2+α2Qα(𝐱^k)Qα(𝐱^)2\displaystyle~{}\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*})\|^{2}+\alpha^{2}\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}
2α(1LQαg+μQα(𝐱^k)Qα(𝐱^)2\displaystyle~{}\quad-2\alpha(\frac{1}{L_{Q_{\alpha}}^{g}+\mu}\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}
+LQαgμLQαg+μ𝐱^k𝐱^2)+α2mnσ2\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\|^{2})+\alpha^{2}mn\sigma^{2}
=\displaystyle\quad= (12αLQαgμLQαg+μ)𝐱^k𝐱^2+(α22α1LQαg+μ)\displaystyle~{}(1-2\alpha\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu})\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\|^{2}+(\alpha^{2}-2\alpha\frac{1}{L_{Q_{\alpha}}^{g}+\mu})
Qα(𝐱^k)Qα(𝐱^)2+α2mnσ2.\displaystyle~{}\quad\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}+\alpha^{2}mn\sigma^{2}.

Since α1/LQαg\alpha\leq 1/L_{Q_{\alpha}}^{g}, note that α2/(LQαg+μ)<0\alpha-2/(L_{Q_{\alpha}}^{g}+\mu)<0 with μLQαg\mu\leq L_{Q_{\alpha}}^{g}. Then,

𝔼[𝐱^k𝐱^2]\displaystyle\mathbb{E}[\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\|^{2}]\leq (12αLQαgμLQαg+μ+α2μ22αμ2LQαg+μ)\displaystyle~{}(1-\frac{2\alpha\cdot L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}+\alpha^{2}\mu^{2}-\frac{2\alpha\cdot\mu^{2}}{L_{Q_{\alpha}}^{g}+\mu})
𝐱^k1𝐱^2+α2mnσ2\displaystyle~{}\quad\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\|^{2}+\alpha^{2}mn\sigma^{2}
\displaystyle\leq (1αμ)2𝐱^k1𝐱^2+α2mnσ2.\displaystyle~{}(1-\alpha\mu)^{2}\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\|^{2}+\alpha^{2}mn\sigma^{2}.

Thus, choosing α1/LQαg1/LFgμ1\alpha\leq 1/L_{Q_{\alpha}}^{g}\leq 1/L_{F}^{g}\leq\mu^{-1}, the result holds as shown in the proof of Lemma 16 in [21].

IV-B Main proof

Proof.

The main proof includes two steps: i) it is shown that three sets defined in (12) cover all possible points with respect to QαQ_{\alpha}; ii) it is shown that the upper bound of the decrease in QαQ_{\alpha} can be used to derive a lower bound for the probability that the K(α,ζ)K(\alpha,\zeta)-th update at each agent is close to a local minimizer of ff.

Step 1. By the supposition in Theorem 3.1, given Δ1>0\Delta_{1}>0 and 0<ζ<10<\zeta<1, there exist ϵ>0\epsilon>0, 0<γLFg0<\gamma\leq L_{F}^{g}, 0<μLFg0<\mu\leq L_{F}^{g}, δ>0\delta>0, and

0<αmin{α¯(Δ1),21LFg,λmin(𝐖)LFgmax{1,log(ζ1)}},\displaystyle 0<\alpha\leq\min\{\bar{\alpha}(\Delta_{1}),\frac{\sqrt{2}-1}{L_{F}^{g}},\frac{\lambda_{\min}(\mathbf{W})}{L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\}}\},

such that α,ϵ1α,γ2α,μ,δ3=(n)m\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2}\cup\mathcal{L}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m}, with respect to (11), and thus α,μ,δ3(α,ϵ1α,γ2)c\mathcal{L}_{\alpha,\mu,\delta}^{3}\supseteq(\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2})^{c}, where the superscript cc denote set complement. If 𝐱^α,ϵ1\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\epsilon}^{1},

Qα(𝐱^)\displaystyle\|\nabla Q_{\alpha}(\hat{\mathbf{x}})\| =F(𝐱^)+1α(𝐈mn𝐖^)𝐱^ϵ;\displaystyle=\|\nabla F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|\geq\epsilon;

if 𝐱^α,γ2\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\gamma}^{2}, then by Weyl’s inequality,

λmin(2Qα(𝐱^))=λmin(2F(𝐱^)+1α(𝐈mn𝐖^))γ;\displaystyle\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))=\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))\leq-\gamma;

if 𝐱^α,μ,δ3\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}, then again by Weyl’s inequality,

λmin(2Qα(𝐱^))=λmin(2F(𝐱^)+1α(𝐈mn𝐖^))μ,\displaystyle\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))=\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))\geq\mu,

and dist(𝐱^,𝒳^Qα)δ\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{*}_{Q_{\alpha}})\leq\delta. Therefore, α,ϵ1α,ϵ1\mathcal{L}_{\alpha,\epsilon}^{1}\subseteq\mathcal{I}_{\alpha,\epsilon}^{1}, α,γ2α,γ2\mathcal{L}_{\alpha,\gamma}^{2}\subseteq\mathcal{I}_{\alpha,\gamma}^{2}, α,μ,δ3α,μ,δ3\mathcal{L}_{\alpha,\mu,\delta}^{3}\subseteq\mathcal{I}_{\alpha,\mu,\delta}^{3}, whereby

α,ϵ1α,γ2α,μ,δ3=(n)m,α,μ,δ3(α,ϵ1α,γ2)c.\displaystyle\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\gamma}^{2}\cup\mathcal{I}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m},~{}\mathcal{I}_{\alpha,\mu,\delta}^{3}\supseteq(\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\gamma}^{2})^{c}.

Step 2. Define stochastic process {κi}\{\kappa_{i}\}\subset\mathbb{N} as

κi:={0,i=0κi1+1,𝐱^κi1α,ϵ1α,μ,δ3κi1+T(𝐱^κi1),𝐱^κi1α,γ2,\displaystyle\kappa_{i}:=\begin{cases}0,&i=0\\ \kappa_{i-1}+1,&\hat{\mathbf{x}}^{\kappa_{i-1}}\in\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\mu,\delta}^{3}\\ \kappa_{i-1}+T(\hat{\mathbf{x}}^{\kappa_{i-1}}),&\hat{\mathbf{x}}^{\kappa_{i-1}}\in\mathcal{I}_{\alpha,\gamma}^{2}\end{cases}, (13)

where T(𝐱^)Tmax(α)=𝒪~(α1)T(\hat{\mathbf{x}})\leq T_{\max}(\alpha)=\tilde{\mathcal{O}}(\alpha^{-1}) for all 𝐱^\hat{\mathbf{x}} as per Lemma 4.1.2. By Lemma 4.1.1 and Lemma 4.1.2, QαQ_{\alpha} decreases by a certain amount after a certain number of iterations for 𝐱^α,ϵ1\hat{\mathbf{x}}\in\mathcal{I}_{\alpha,\epsilon}^{1}, and 𝐱^α,γ2\hat{\mathbf{x}}\in\mathcal{I}_{\alpha,\gamma}^{2}, respectively, as follows

𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)|𝐱^κiα,ϵ1]l1(α),𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)|𝐱^κiα,γ2]l2(α),\displaystyle\begin{aligned} \mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\hat{\mathbf{x}}^{\kappa_{i}}\in\mathcal{I}_{\alpha,\epsilon}^{1}]\leq-l_{1}(\alpha),\\ \mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\hat{\mathbf{x}}^{\kappa_{i}}\in\mathcal{I}_{\alpha,\gamma}^{2}]\leq-l_{2}(\alpha),\end{aligned} (14)

where l1(α)=Ω(α)l_{1}(\alpha)=\Omega(\alpha) and l2(α)=Ω(α)l_{2}(\alpha)=\Omega(\alpha) are defined in Lemma 4.1.1 and Lemma 4.1.2.

Defining event i:={(ji)𝐱^κjα,μ,δ3}\mathcal{E}_{i}:=\{(\exists j\leq i)~{}\hat{\mathbf{x}}^{\kappa_{j}}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}\}, by law of total expectation,

𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)]=𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)|i][i]+𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)|ic][ic].\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ =\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}]\cdot\mathbb{P}[\mathcal{E}_{i}]\\ +\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}^{c}]\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Combining (13) and (14) gives

𝔼[Qα(𝐱^κi+1)Qα(𝐱^κi)|ic]l(α)Δκi,\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}^{c}]\leq-l(\alpha)\cdot\Delta\kappa_{i},

where l(α)=min{l1(α),l2(α)/Tmax(α)}=Ω(α2)l(\alpha)=\min\{l_{1}(\alpha),l_{2}(\alpha)/T_{\max}(\alpha)\}=\Omega(\alpha^{2}) and Δκi=κi+1κi\Delta\kappa_{i}=\kappa_{i+1}-\kappa_{i}. Since [i1][i]\mathbb{P}[\mathcal{E}_{i-1}]\leq\mathbb{P}[\mathcal{E}_{i}], we obtain

𝔼[Qα(𝐱^κi+1)]𝔼[Qα(𝐱^κi)]𝔼[Qα(𝐱^κi)|i]([i][i1])l(α)Δκi.\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ \leq\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}]\cdot(\mathbb{P}[\mathcal{E}_{i}]-\mathbb{P}[\mathcal{E}_{i-1}])-l(\alpha)\cdot\Delta\kappa_{i}.

Since the generated sequence {Qα(𝐱^k)}\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\} is assumed bounded, there exists b>0b>0 such that Qα(𝐱^k)b\|Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\leq b for all k=0,1,k=0,1,\cdots. As such,

𝔼[Qα(𝐱^κi+1)]𝔼[Qα(𝐱^κi)]b([i][i1])l(α)Δκi[ic].\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ \leq b\cdot(\mathbb{P}[\mathcal{E}_{i}]-\mathbb{P}[\mathcal{E}_{i-1}])-l(\alpha)\cdot\Delta\kappa_{i}\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Summing both sides of the inequality over ii gives

𝔼[Qα(𝐱^κi)]𝔼[Qα(𝐱^κ1)]b([i1][0])l(α)(κiκ1)[ic].\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{1}})]\\ \leq b\cdot(\mathbb{P}[\mathcal{E}_{i-1}]-\mathbb{P}[\mathcal{E}_{0}])-l(\alpha)\cdot(\kappa_{i}-\kappa_{1})\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Since Qα(𝐱^k)b\|Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\leq b for all k=0,1,k=0,1,\cdots, it follows that 2bbl(α)(κiκ1)[ic]-2b\leq b-l(\alpha)\cdot(\kappa_{i}-\kappa_{1})\cdot\mathbb{P}[\mathcal{E}_{i}^{c}], which leads to the following upper bound for the probability of event ic\mathcal{E}_{i}^{c}:

[ic]3bl(α)(κiκ1).\displaystyle\mathbb{P}[\mathcal{E}_{i}^{c}]\leq\frac{3b}{l(\alpha)(\kappa_{i}-\kappa_{1})}.

Therefore, if κiκ1\kappa_{i}-\kappa_{1} grows larger than 6b/l(α)6b/l(\alpha), then [ic]1/2\mathbb{P}[\mathcal{E}_{i}^{c}]\leq 1/2. Since κ1Tmax(α)=𝒪(α1)\kappa_{1}\leq T_{\max}(\alpha)=\mathcal{O}(\alpha^{-1}), after K(α)=6b/l(α)+Tmax(α)=𝒪(α2)K^{\prime}(\alpha)=6b/l(\alpha)+T_{\max}(\alpha)=\mathcal{O}(\alpha^{-2}) steps, {𝐱^k}\{\hat{\mathbf{x}}^{k}\} must enter α,μ,δ3\mathcal{L}_{\alpha,\mu,\delta}^{3} at least once with probability at least 1/21/2. Therefore, by repeating this step logζ1\log\zeta^{-1} times, the probability of {𝐱^k}\{\hat{\mathbf{x}}^{k}\} entering α,μ,δ3\mathcal{L}_{\alpha,\mu,\delta}^{3} at least once is lower bounded:

[{(kK(α,ζ))𝐱^kα,μ,δ3}]1ζ2,\displaystyle\mathbb{P}[\{(\exists k\leq K(\alpha,\zeta))~{}\hat{\mathbf{x}}^{k}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}\}]\geq 1-\frac{\zeta}{2},

where K(α)=𝒪(α2logζ1)K(\alpha)=\mathcal{O}(\alpha^{-2}\log\zeta^{-1}). Combining this with Lemma 4.1.3, we have that, after K(α,ζ)K(\alpha,\zeta) iterations, Algorithm 1 produces a point that is Δ2(α,ζ)\Delta_{2}(\alpha,\zeta)-close to 𝒳^Qα\hat{\mathcal{X}}^{*}_{Q_{\alpha}} with probability at least 1ζ1-\zeta, where Δ2(α,ζ)=𝒪(αlog(α1ζ1))\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})}). For given Δ1>0\Delta_{1}>0, since αα¯(Δ1)\alpha\leq\bar{\alpha}(\Delta_{1}) satisfies requirements of Theorem 2.1, 𝐱^=inf𝐱^𝒳Qα𝐱^iK(α,ζ)𝐱^\hat{\mathbf{x}}^{*}=\inf_{\hat{\mathbf{x}}\in\mathcal{X}_{Q_{\alpha}}^{*}}\|\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)}-\hat{\mathbf{x}}\| is such that 𝐱^i\hat{\mathbf{x}}_{i}^{*} is Δ1\Delta_{1}-close to 𝒳f\mathcal{X}_{f}^{*}. To summarize, we have for i𝒱i\in\mathcal{V},

dist(𝐱^iK(α,ζ),𝒳f)Δ1+Δ2(α,ζ)\displaystyle\mathrm{dist}(\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)},\mathcal{X}_{f}^{*})\leq\Delta_{1}+\Delta_{2}(\alpha,\zeta)

as claimed.

V Numerical Example

Consider the following non-convex optimization problem over 𝐱=(x1,x2)\mathbf{x}=(x_{1},x_{2}):

min𝐱2f(𝐱)=min𝐱2i=15fi(𝐱)=x14x12+x24+x22,\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{2}}f({\mathbf{x}})=\min_{\mathbf{x}\in\mathbb{R}^{2}}\sum_{i=1}^{5}f_{i}(\mathbf{x})=x_{1}^{4}-x_{1}^{2}+x_{2}^{4}+x_{2}^{2},

where f1(𝐱)=0.25x14x12x22f_{1}(\mathbf{x})=0.25x_{1}^{4}-x_{1}^{2}-x_{2}^{2}, f2(𝐱)=0.25x14+0.5x24+1.5x22f_{2}(\mathbf{x})=0.25x_{1}^{4}+0.5x_{2}^{4}+1.5x_{2}^{2}, f3(𝐱)=x12+x22f_{3}(\mathbf{x})=-x_{1}^{2}+x_{2}^{2}, f4(𝐱)=0.5x140.5x22f_{4}(\mathbf{x})=0.5x_{1}^{4}-0.5x_{2}^{2}, and f5(𝐱)=x12+0.5x24f_{5}(\mathbf{x})=x_{1}^{2}+0.5x_{2}^{4}. The mixing matrix is taken to be

𝐖=[0.600.200.200.600.20.20.200.60.2000.20.20.600.20.2000.6].\displaystyle\mathbf{W}=\begin{bmatrix}0.6&0&0.2&0&0.2\\ 0&0.6&0&0.2&0.2\\ 0.2&0&0.6&0.2&0\\ 0&0.2&0.2&0.6&0\\ 0.2&0.2&0&0&0.6\end{bmatrix}.

It can be verified that 𝐱=(x1,x2)=(0,0)\mathbf{x}=(x_{1},x_{2})=(0,0) is a saddle point of ff, and that 𝐱=(x1,x2)=(22,0)\mathbf{x}=(x_{1},x_{2})=(-\frac{\sqrt{2}}{2},0) and 𝐱=(x1,x2)=(22,0)\mathbf{x}=(x_{1},x_{2})=(\frac{\sqrt{2}}{2},0) are two local minimizers. We compare the performance of DGD and NDGD with constant step-size α=0.005\alpha=0.005, both initialized from 𝐱0=(x10,x20)=(106,106)\mathbf{x}^{0}=(x_{1}^{0},x_{2}^{0})=(10^{-6},10^{-6}) (i.e., close to a saddle point).

Refer to caption
(a) 𝐱1k\mathbf{x}_{1}^{k} of DGD.
Refer to caption
(b) 𝐱1k\mathbf{x}_{1}^{k} of NDGD.
Figure 1: Iterations of 𝐱1\mathbf{x}_{1}.

From Figure 1(1(a)), although not trapped forever, it does take DGD about 6000 iterations to escape the vicinity of the saddle point and converge to the neighborhood of a local minimizer. From Figure 1(1(b)), we can see that DGD escapes the vicinity of the saddle point with about 2000 iterations and converges to the neighborhood of a local minimizer. The effectiveness of NDGD over DGD is evident through this example.

VI Conclusion

A fixed step-size noisy distributed gradient descent (NDGD) algorithm is formulated for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are added to the gradient descent at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, each agent converges (in probability with specified confidence) to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound on the distance between the iterate at each agent, and the set of local minimizers, after a sufficient number of iterations.

The potential applications of the NDGD algorithm are vast and varied, including multi-agent systems control, federated learning and sensor networks location estimation, particularly in large-scale network scenarios. Further exploration of different approaches to introducing random perturbations, and analysis of convergence rate performance can be pursued in future work.

References

  • [1] X. Dong, Y. Hua, Y. Zhou, Z. Ren, and Y. Zhong, “Theory and experiment on formation-containment control of multiple multirotor unmanned aerial vehicle systems,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 1, pp. 229–240, 2018.
  • [2] Z. Qiu, G. Deconinck, and R. Belmans, “A literature survey of optimal power flow problems in the electricity market context,” in 2009 IEEE/PES Power Systems Conference and Exposition.   IEEE, 2009, pp. 1–6.
  • [3] W. Gao, J. Gao, K. Ozbay, and Z.-P. Jiang, “Reinforcement-learning-based cooperative adaptive cruise control of buses in the lincoln tunnel corridor with time-varying topology,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3796–3805, 2019.
  • [4] A. Nedić and J. Liu, “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 77–103, 2018.
  • [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [6] H. Terelius, U. Topcu, and R. M. Murray, “Decentralized multi-agent optimization via dual decomposition,” IFAC proceedings volumes, vol. 44, no. 1, pp. 11 245–11 251, 2011.
  • [7] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the admm in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
  • [8] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986.
  • [9] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [10] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
  • [11] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [12] K. I. Tsianos and M. G. Rabbat, “Distributed strongly convex optimization,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2012, pp. 593–600.
  • [13] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
  • [14] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2012, pp. 601–608.
  • [15] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [16] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
  • [17] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically convergent distributed optimization with uncoordinated step-sizes,” in 2017 American Control Conference (ACC).   IEEE, 2017, pp. 3950–3955.
  • [18] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
  • [19] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020.
  • [20] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos, “Gradient descent can take exponential time to escape saddle points,” Advances in neural information processing systems, vol. 30, 2017.
  • [21] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Conference on learning theory.   PMLR, 2015, pp. 797–842.
  • [22] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1724–1732.
  • [23] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 14 751–14 812, 2022.
  • [24] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments—part i: Agreement at a linear rate,” IEEE Transactions on Signal Processing, vol. 69, pp. 1242–1256, 2021.
  • [25] ——, “Distributed learning in non-convex environments—part ii: Polynomial escape from saddle-points,” IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021.
  • [26] Y. Nesterov, Introductory lectures on convex optimization: A basic course.   Springer Science & Business Media, 2003, vol. 87.