Second-order Properties of Noisy Distributed Gradient Descent

Lei Qin, Michael Cantoni, and Ye Pu This work was supported by a Melbourne Research Scholarship and the Australian Research Council (DE220101527 and DP210103272).L. Qin, M. Cantoni, and Y. Pu are with the Department of Electrical and Electronic Engineering, University of Melbourne, Parkville VIC 3010, Australia leqin@student.unimelb.edu.au, {cantoni, ye.pu}@unimelb.edu.au.

Abstract

We study a fixed step-size noisy distributed gradient descent algorithm for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are introduced to the gradient descent directions at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, it is established that each agent converges to a neighborhood of a local minimizer and the size of the neighborhood depends on the step-size and the confidence parameter. A numerical example is presented to illustrate the effectiveness of the random perturbations in terms of escaping saddle points in fewer iterations than without the perturbations.

Index Terms:

Non-convex optimization; first-order methods; random perturbations; evading saddle points

I Introduction

We consider the optimization problem

\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{n}}f(\mathbf{x})\triangleq\min_{\mathbf{x}\in\mathbb{R}^{n}}\sum_{i=1}^{m}f_{i}(\mathbf{x}),

(1)

where each $f_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is smooth but possibly non-convex, and $\mathbf{x}\in\mathbb{R}^{n}$ is the decision vector. The aim is to employ $m$ agents to iteratively solve the optimization problem in (1), over an undirected and connected network graph $\mathcal{G}(\mathcal{V},\mathcal{E})$ . Each agent $i\in\mathcal{V}:=\{1,\ldots,m\}$ only knows the corresponding function $f_{i}$ and its gradient. The pair of agents $(i,j)\in\mathcal{V}\times\mathcal{V}$ is able to directly exchange information if and only if $(i,j)\in\mathcal{E}$ . Collaborative distributed optimization over a network is of significant interest in the contexts of control, learning and estimation, particularly in large-scale system scenarios, such as unmanned vehicle systems [1], electric power systems [2], transit systems [3], and wireless sensor networks [4].

In the field of optimization, two primary classes of distributed methods can be identified: dual decomposition methods and consensus-based methods. Dual decomposition methods involve minimizing an augmented Lagrangian formulated on the basis of constraints that enforce agreement between agents, via iterative updates of the corresponding primal and dual variables [5]. The distributed dual decomposition algorithm in [6] involves agents alternating between updating their primal and dual variables and communicating with their neighbors. In [7], it is established that distributed alternating direction method of multipliers (ADMM) exhibits linear convergence rates in strongly convex settings. Consensus-based methods can be traced back to the distributed computation models proposed in [8], which seek to eliminate agent disagreements through local iterate exchange and weighted averaging to achieve consensus. This idea underlies the distributed (sub)gradient methods proposed in [9] and [10] to solve problem (1) with all $f_{i}$ convex. In the case of diminishing step-size, each agent converges to an optimizer [10]; with constant step-size, convergence is typically faster, but only to the vicinity of an optimizer [9].

The focus in this paper is on consensus-based distributed (sub)gradient methods, for the simplicity as first-order methods, enabling easy adaptation to diverse situations. In particular, a fixed step-size Distributed Gradient Descent (DGD) algorithm is considered, as an instance of the distributed (sub)gradient methods. In unperturbed form, the update for each agent $i\in\mathcal{V}$ at iteration $k$ is given by

\displaystyle\hat{\mathbf{x}}^{k+1}_{i}=\sum_{j=1}^{m}\mathbf{W}_{ij}\hat{\mathbf{x}}^{k}_{j}-\alpha\nabla f_{i}(\hat{\mathbf{x}}^{k}_{i}),

(2)

where $\alpha>0$ is the constant step-size, $\nabla f_{i}$ is the gradient of $f_{i}$ , $\hat{\mathbf{x}}_{i}\in\mathbb{R}^{n}$ is the local copy of the decision vector $\mathbf{x}$ at agent $i\in\mathcal{V}$ , and $\mathbf{W}_{ij}$ is the scalar entry in the $i$ -th row and $j$ -th column of a given mixing matrix $\mathbf{W}\in\mathbb{R}^{m\times m}$ . The mixing matrix is consistent with the graph $\mathcal{G}(\mathcal{V},\mathcal{E})$ , in the sense that $\mathbf{W}_{ii}>0$ for all $i\in\mathcal{V}$ , $\mathbf{W}_{ij}>0$ if $(i,j)\in\mathcal{E}$ , and $\mathbf{W}_{ij}=0$ otherwise. The convergence rates of fixed step-size and diminishing step-size DGD algorithms in strongly convex settings are examined in [11] and [12], respectively. Several variants of DGD have been proposed in (strongly) convex settings. Nesterov momentum is used in the distributed gradient descent update with diminishing step-sizes in [13] to improve convergence rates. Inexact proximal-gradient method is considered in [14] for problems involving non-smooth functions. To reach exact consensus with a constant step-size, the gradient tracking method [15, 16] and [17] is used to neutralize $-\alpha\nabla f_{i}$ in (2), as the gradient descent part cannot vanish itself.

In non-convex settings, gradient descent methods may face issues due to saddle points, as converging to a first-order stationary point does not guarantee the local minimality. While Hessian-based methods can avoid saddle points, their computational cost can be prohibitive for large-scale problems. In [18], the fixed step-size DGD algorithm is shown to retain the property of convergence to a neighborhood of a consensus stationary solution under some regularity assumptions in non-convex settings. It is also shown in [19] that DGD with a constant step-size converges almost surely to a neighborhood of second-order stationary solutions. However, this requires random initialization to avoid the zero measure manifold of saddle point attraction, and moreover, the underlying analysis does not support techniques for actively escaping saddle points.

Recently, it has been shown that standard (centralized) gradient descent methods can take exponential time to escape saddle points [20], while noise (random perturbations) have been proven effective for escaping saddle points in non-convex optimization. The Noisy Gradient Descent algorithm in [21] and [22] is proven to be able to escape saddle points efficiently while converging to a neighborhood of a local minimizer with high probability. However, all of these works are limited to centralized methods. In [23] it is shown that distributed stochastic gradient descent converges to local minima almost surely when diminishing step sizes are used. Consider constant step-size, the diffusion strategy with stochastic gradient in [24] and [25] only returns approximately second-order stationary points rather than an outcome that lies in a neighborhood of a local minimizer with controllable size.

In this paper, the main contribution is that we analyze a fixed step-size noisy distributed gradient descent (NDGD) algorithm for solving the optimization problem in (1) by expanding upon and combining ideas from [21] and [22] on centralized stochastic gradient descent, and from [11, 18, 19] on distributed unperturbed gradient descent. In this combination, random perturbations are added to the gradient descent directions at each step to actively evade saddle points. It is established that under certain regularity conditions, and with a suitable step-size, each agent converges to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations. A numerical example is presented to illustrate the effectiveness of the algorithm in terms of escaping from the vicinity of a saddle point in fewer iterations than the standard (i.e., unperturbed) fixed step-size DGD method.

I-A Notation

Let $\mathbf{I}_{n}$ denote the $n\times n$ identity matrix, $\bm{1}_{n}$ denote the $n$ -vector with all entries equal to $1$ , and $\mathbf{A}_{ij}$ denote the entry in the $i$ -th row and $j$ -th column of the matrix $\mathbf{A}$ . For a square symmetric matrix $\mathbf{B}$ , we use $\lambda_{\mathrm{min}}(\mathbf{B})$ , $\lambda_{\mathrm{max}}(\mathbf{B})$ and $\|\mathbf{B}\|$ to denote its minimum eigenvalue, maximum eigenvalue and spectral norm, respectively. The Kronecker product is denoted by $\otimes$ . The distance from the point $\mathbf{x}\in\mathbb{R}^{n}$ to a given set $\mathcal{Y}\subseteq\mathbb{R}^{n}$ is denoted by $\textup{dist}(\mathbf{x},\mathcal{Y}):=\inf_{\mathbf{y}\in\mathcal{Y}}\|\mathbf{x}-\mathbf{y}\|$ . We say that a point $\mathbf{x}$ is $\delta$ -close to a point $\mathbf{y}$ (resp., a set $\mathcal{Y}$ ) if $\textup{dist}(\mathbf{x},\mathbf{y})\leq\delta$ (resp., $\textup{dist}(\mathbf{x},\mathcal{Y})\leq\delta$ ). We use the Bachmann–Landau (asymptotic) notations including $\mathcal{O}(g(x,y))$ , $\Omega(g(x,y))$ and $\Theta(g(x,y))$ to hide dependence on variables other than $x$ and $y$ .

II Problem Setup and Supporting Results

In this section, we present a reformulation of the optimization problem defined in (1) and provide a list of assumptions used in subsequent analysis. Then, we briefly recall aspects of the fixed step-size DGD algorithm (2) and present some existing results for non-convex optimization problems. Next, some intermediate results are derived to establish certain properties of the local minimizers of $f$ defined in (1) (see Theorem 2.1), on the basis of a collection of supporting lemmas (see Lemma 2.3.1-2.3.4).

II-A Problem Setup

Definition 2.1.

For differentiable function $h$ , a point $\mathbf{x}$ is said to be first-order stationary if $\|\nabla h(\mathbf{x})\|=0$ .

Definition 2.2.

For twice differentiable function $h$ , a first-order stationary point $\mathbf{x}$ is: (i) a local minimizer, if $\nabla^{2}h(\mathbf{x})\succ 0$ ; (ii) a local maximizer, if $\nabla^{2}h(\mathbf{x})\prec 0$ ; and (iii) a saddle point if $\lambda_{\min}(\nabla^{2}h(\mathbf{x}))<0$ and $\lambda_{\max}(\nabla^{2}h(\mathbf{x}))>0$ .

By introducing additional local variables, the optimization problem in (1) can be reformulated as

\displaystyle\begin{gathered}\min_{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}}F(\hat{\mathbf{x}})\triangleq\min_{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}}\sum_{i=1}^{m}f_{i}(\hat{\mathbf{x}}_{i}),\\ \textup{s.t. }\hat{\mathbf{x}}_{i}=\hat{\mathbf{x}}_{j}\text{ for all }(i,j)\in\mathcal{E},\end{gathered}

(5)

where $\hat{\mathbf{x}}_{i}\in\mathbb{R}^{n}$ is the local copy of the decision vector $\mathbf{x}$ at agent $i\in\mathcal{V}$ , and $\hat{\mathbf{x}}=[\hat{\mathbf{x}}_{1}^{T},\cdots,\hat{\mathbf{x}}_{m}^{T}]^{T}\in(\mathbb{R}^{n})^{m}$ .

Assumption 2.1 (Local regularity).

The function $f$ in (1) is such that for all first-order stationary points $\mathbf{x}$ , either $\lambda_{\min}(\nabla^{2}f(\mathbf{x}))>0$ (i.e., $\mathbf{x}$ is a local minimizer), or $\lambda_{\min}(\nabla^{2}f(\mathbf{x}))<0$ (i.e., $\mathbf{x}$ is a saddle point or a maximizer).

Assumption 2.2 (Lipschitz gradient).

Each objective $f_{i}$ has $L_{f_{i}}^{g}$ -Lipschitz continuous gradient, i.e.,

\displaystyle\|\nabla f_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{y})\|\leq L_{f_{i}}^{g}\|\mathbf{x}-\mathbf{y}\|

for all $\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}$ and each $i\in\mathcal{V}$ .

Assumption 2.3 (Lipschitz Hessian).

Each objective $f_{i}$ has $L_{f_{i}}^{H}$ -Lipschitz continuous Hessian, i.e.,

\displaystyle\|\nabla^{2}f_{i}(\mathbf{x})-\nabla^{2}f_{i}(\mathbf{y})\|\leq L_{f_{i}}^{H}\|\mathbf{x}-\mathbf{y}\|

for all $\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}$ and each $i\in\mathcal{V}$ .

If Assumption 2.2 holds, then $F$ defined in (5) has $L_{F}^{g}$ -Lipschitz continuous gradient with $L_{F}^{g}=\max_{i}\{L_{f_{i}}^{g}\}$ . Further, if Assumption 2.3 holds, then $F$ has $L_{F}^{H}$ -Lipschitz continuous Hessian with $L_{F}^{H}=\max_{i}\{L_{f_{i}}^{H}\}$ .

Assumption 2.4 (Coercivity and properness).

Each local objective $f_{i}$ is coercive (i.e., its sublevel set is compact) and proper (i.e., not everywhere infinite).

II-B Distributed Gradient Descent

Assumption 2.5 (Network).

The undirected graph $\mathcal{G}(\mathcal{V},\mathcal{E})$ is connected.

The DGD algorithm in (2), with constant step-size $\alpha>0$ , can be written in a matrix/vector form as

\displaystyle\hat{\mathbf{x}}^{k+1}=\hat{\mathbf{W}}\hat{\mathbf{x}}^{k}-\alpha\nabla F(\hat{\mathbf{x}}^{k}),

(6)

where $\hat{\mathbf{W}}:=\mathbf{W}\otimes\mathbf{I}_{n}$ . Note that from this point on, the mixing matrix $\mathbf{W}$ is taken to be symmetric, doubly stochastic and strictly diagonally dominant, i.e., $\mathbf{W}_{ii}>\sum_{j\neq i}\mathbf{W}_{ij}$ for all $i\in\mathcal{V}$ . Thus, $\mathbf{W}$ is positive definite by the Gershgorin circle theorem. As proposed in some early works, including [11, 18, 19], we can analyze the convergence properties using an auxiliary function. Let the $Q_{\alpha}$ denote the auxiliary function,

	$\displaystyle Q_{\alpha}(\hat{\mathbf{x}})$	$\displaystyle=\sum_{i=1}^{m}f_{i}(\hat{\mathbf{x}}_{i})+\frac{1}{2\alpha}\sum_{i=1}^{m}\sum_{j=1}^{m}(\mathbf{I}_{m}-\mathbf{W})_{ij}(\hat{\mathbf{x}}_{i})^{T}(\hat{\mathbf{x}}_{j})$
		$\displaystyle=F(\hat{\mathbf{x}})+\frac{1}{2\alpha}\\|\hat{\mathbf{x}}\\|^{2}_{\mathbf{I}_{mn}-\hat{\mathbf{W}}},$		(7)

consisting of the objective function in (5) and a quadratic penalty, which depends on the step-size and the mixing matrix. We use $\hat{\mathbf{x}}^{*}$ to denote a local minimizer of $Q_{\alpha}$ . Note that the DGD update (6) applied to (5) can be interpreted as an instance of the standard gradient descent algorithm applied to (II-B), i.e.,

\displaystyle\hat{\mathbf{x}}^{k+1}=\hat{\mathbf{x}}^{k}-\alpha\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k}).

(8)

Thus, iteratively running (6) and (8) from the same initialization yields the same sequence of iterates.

If Assumption 2.2 holds, then $Q_{\alpha}$ defined in (II-B) has $L_{Q_{\alpha}}^{g}$ -Lipschitz continuous gradient with $L_{Q_{\alpha}}^{g}=L_{F}^{g}+\alpha^{-1}(1-\lambda_{\min}(\mathbf{W}))=\max_{i}\{L_{f_{i}}^{g}\}+\alpha^{-1}(1-\lambda_{\min}(\mathbf{W}))$ . We have that $1-\lambda_{\min}(\mathbf{W})\geq 0$ because the spectrum of a symmetric, positive definite and doubly stochastic matrix is contained in the interval $(0,1]$ by the Perron–Frobenius theorem, with $1$ being the only largest eigenvalue (the Perron root). Further, if Assumption 2.3 holds, then $Q_{\alpha}$ has $L_{Q_{\alpha}}^{H}$ -Lipschitz continuous Hessian with $L_{Q_{\alpha}}^{H}=\max_{i}\{L_{f_{i}}^{H}\}=L_{F}^{H}$ .

II-C Relationships between Local minimizers of $f$ and $Q_{\alpha}$

In this section, we prove that $\hat{\mathbf{x}}_{i}^{*}$ , the component of a local minimizer $\hat{\mathbf{x}}^{*}$ of $Q_{\alpha}$ associated with agent $i$ , can be made arbitrarily close to the set of local minimizers $\mathbf{x}^{*}$ of $f$ by choosing sufficiently small $\alpha>0$ . The proof is based on a collection of intermediate results. First, Lemma 2.3.1 shows that by choosing sufficiently small step-size $\alpha>0$ , at a local minimizer of $Q_{\alpha}$ , the component corresponding to agent $i$ can be arbitrarily close to $\bar{\mathbf{x}}^{*}$ , where $\bar{\mathbf{x}}^{*}$ denotes the average of $\hat{\mathbf{x}}_{i}^{*}$ across all agents. Lemma 2.3.2 and Lemma 2.3.3 show that given $\epsilon_{g}>0$ and $\epsilon_{H}>0$ , one can always find constant step-size $\alpha>0$ such that $\|\nabla f(\bar{\mathbf{x}}^{*})\|\leq\epsilon_{g}$ and $\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\epsilon_{H}$ . Finally, Lemma 2.3.4 shows that for each agent, $\hat{\mathbf{x}}_{i}^{*}$ can be made arbitrarily close to a local minimizer of $f$ , by choosing a sufficiently small step-size $\alpha>0$ . Let $\mathcal{X}_{f}^{*}$ and $\hat{\mathcal{X}}_{Q_{\alpha}}^{*}$ denote the set of local minimizers of $f$ and $Q_{\alpha}$ , respectively:

\displaystyle\begin{aligned} \mathcal{X}_{f}^{*}:=\{\mathbf{x}\in\mathbb{R}^{n}:&~{}\nabla f(\mathbf{x})=0,~{}\nabla^{2}f(\mathbf{x})\succ 0\},\\ \hat{\mathcal{X}}_{Q_{\alpha}}^{*}:=\{\hat{\mathbf{x}}\in(\mathbb{R}^{n})^{m}:&~{}\nabla Q_{\alpha}(\hat{\mathbf{x}})=0,~{}\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}})\succ 0\}.\end{aligned}

(9)

Lemma 2.3.1.

Let Assumption 2.5 hold. Given $\alpha>0$ , let $\hat{\mathbf{x}}^{*}$ be a local minimizer of $Q_{\alpha}$ . Then, for each $i\in\mathcal{V}$ ,

\displaystyle\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|\leq\alpha\cdot\frac{\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where $\bar{\mathbf{x}}^{*}=\frac{1}{m}(\mathbf{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}$ , and $0<\lambda_{2}<1$ is the second-largest eigenvalue value of $\mathbf{W}$ .

Proof.

Since $\hat{\mathbf{x}}^{*}$ is a local minimizer of $Q_{\alpha}$ , we have $\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})=\nabla F(\hat{\mathbf{x}}^{*})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*}=0$ , and thus,

\displaystyle\hat{\mathbf{x}}^{*}=\hat{\mathbf{W}}\hat{\mathbf{x}}^{*}-\alpha\nabla F(\hat{\mathbf{x}}^{*})=\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}-\alpha\sum_{t=0}^{s}\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})

for all $s\in\mathbb{N}$ . Therefore, we have that $\|\hat{\mathbf{x}}^{*}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}\|=\alpha\|\sum_{t=0}^{s}\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})\|\leq\alpha\sum_{t=0}^{s}\|\hat{\mathbf{W}}^{t}\nabla F(\hat{\mathbf{x}}^{*})\|$ . Now, since $(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*}=0$ , we have $(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla F(\hat{\mathbf{x}}^{*})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})-\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*})=0$ . Further, by connectivity of the underlying communication graph (see Assumption 2.5) and the Perron–Frobenius theorem, $0<\lambda_{2}<1$ , and as such,

	$\displaystyle\\|\hat{\mathbf{x}}_{i}^{}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{}\\|$	$\displaystyle\leq\alpha\sum_{t=0}^{s}\\|(\mathbf{W}^{t}-\frac{\bm{1}_{m}\bm{1}_{m}^{T}}{m})\otimes\mathbf{I}_{n}\cdot\nabla F(\hat{\mathbf{x}}^{*})\\|$
		$\displaystyle\leq\alpha\sum_{t=0}^{s}\lambda_{2}^{t}\\|\nabla F(\hat{\mathbf{x}}^{*})\\|,$

where the last inequality holds because only the largest eigenvalue of $\mathbf{W}^{T}$ is $1$ . In the limit $s\rightarrow\infty$ , it follows that

\displaystyle\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|=\lim_{s\rightarrow\infty}\|\hat{\mathbf{x}}_{i}^{*}-\hat{\mathbf{W}}^{s+1}\hat{\mathbf{x}}^{*}\|\leq\alpha\cdot\frac{\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}

as claimed.

Lemma 2.3.2.

Let Assumptions 2.2, 2.5 hold. Given $\alpha>0$ , let $\hat{\mathbf{x}}^{*}$ be a local minimizer of $Q_{\alpha}$ . Then

\displaystyle\|\nabla f(\bar{\mathbf{x}}^{*})\|\leq\alpha\cdot L_{F}^{g}\frac{m\sqrt{m}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where $\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}$ and $0<\lambda_{2}<1$ is the second-largest eigenvalue value of $\mathbf{W}$ .

Proof.

Since $(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla F(\hat{\mathbf{x}}^{*})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})-\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}^{*})=0$ , by Assumption 2.2 and Lemma 2.3.1,

	$\displaystyle\\|\nabla f(\bar{\mathbf{x}}^{*})\\|$	$\displaystyle=\\|(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})-\nabla F(\hat{\mathbf{x}}^{}))\\|$
		$\displaystyle\leq\sqrt{m}L_{F}^{g}\\|\bm{1}_{m}\otimes\bar{\mathbf{x}}^{}-\hat{\mathbf{x}}^{}\\|$
		$\displaystyle\leq\alpha\cdot L_{F}^{g}\frac{m\sqrt{m}\\|\nabla F(\hat{\mathbf{x}}^{*})\\|}{1-\lambda_{2}}$

as claimed.

Lemma 2.3.3.

Let Assumptions 2.3, 2.5 hold. Given $\alpha>0$ , let $\hat{\mathbf{x}}^{*}$ be a local minimizer of $Q_{\alpha}$ . Then

\displaystyle\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\alpha\cdot L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},

where $\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}$ and $0<\lambda_{2}<1$ is the second-largest eigenvalue value of $\mathbf{W}$ .

Proof.

Since $(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla^{2}F(\hat{\mathbf{x}}^{*})(\bm{1}_{m}\otimes\mathbf{I}_{n})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{*})-\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))(\bm{1}_{m}\otimes\mathbf{I}_{n})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{*}))(\bm{1}_{m}\otimes\mathbf{I}_{n})\succeq 0$ , by Assumption 2.3, we have

	$\displaystyle\nabla^{2}$	$\displaystyle f(\bar{\mathbf{x}}^{})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})(\bm{1}_{m}\otimes\mathbf{I}_{n})$
		$\displaystyle\succeq(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})-\nabla^{2}F(\hat{\mathbf{x}}^{}))(\bm{1}_{m}\otimes\mathbf{I}_{n})$
		$\displaystyle\succeq-mL_{F}^{H}\\|\hat{\mathbf{x}}^{}-\bm{1}_{m}\otimes\bar{\mathbf{x}}^{}\\|\mathbf{I}_{n},$

where the last inequality holds because $\textbf{A}\succeq-\|\textbf{A}\|\mathbf{I}_{d}$ for any square matrix A, where $\|\mathbf{A}\|=\max_{\mathbf{x}^{T}\mathbf{x}=1}\mathbf{x}^{T}\mathbf{A}\mathbf{x}$ . So by Lemma 2.3.1,

\displaystyle\lambda_{\min}(\nabla^{2}f(\bar{\mathbf{x}}^{*}))\geq-\alpha\cdot L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}

as claimed.

Lemma 2.3.4.

Let Assumptions 2.1, 2.2, 2.3 hold. Then, for any given compact set $\mathcal{X}\subset\mathbb{R}^{n}$ ,

\displaystyle\lim_{\alpha\downarrow 0}(\sup\{\textup{dist}(\mathbf{x},\mathcal{X}_{f}^{*}):\mathbf{x}\in{\mathcal{X}^{\alpha}_{f}}\cap\mathcal{X}\})=0,

where $\mathcal{X}_{f}^{\alpha}:=\{\mathbf{x}:\|\nabla f(\mathbf{x})\|\leq\alpha\cdot c_{1},\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\alpha\cdot c_{2}\}$ , with

\displaystyle c_{1}=L_{F}^{g}\frac{m\sqrt{m}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}},~{}c_{2}=L_{F}^{H}\frac{m^{2}\|\nabla F(\hat{\mathbf{x}}^{*})\|}{1-\lambda_{2}}.

Proof.

Following the approach used in Lemma 3.8 [19] to establish a similar result in the absence of the second-order requirement in the definition of $\mathcal{X}_{f}^{\alpha}$ , we prove the lemma by contradiction. Suppose

\displaystyle\inf_{\alpha>0}\{\sup\{\textup{dist}(\mathbf{x},\mathcal{X}_{f}^{*}):\mathbf{x}\in{\mathcal{X}^{\alpha}_{f}}\cap\mathcal{X}\}\}=d>0.

(10)

Then, there exists a sequence $\{\mathbf{x}^{k}\}$ with $\mathbf{x}^{k}\in\mathcal{X}_{f}^{1/k}\cap\mathcal{X}$ and $\textup{dist}(\mathbf{x}^{k},\mathcal{X}_{f}^{*})\geq d$ for all $k\in\mathbb{N}^{+}$ . Since $\|\nabla f\|$ and $\lambda_{\min}(\nabla^{2}f)$ are continuous functions (see Assumptions 2.2, 2.3), $\mathcal{X}_{f}^{1/k}$ is closed. Thus, $\mathcal{X}_{f}^{1/k}\cap\mathcal{X}$ is compact. Since $\{\mathbf{x}^{k}\}\subset\mathcal{X}$ , we can find a convergent sub-sequence $\{\mathbf{x}^{t_{k}}\}$ with limit point $\mathbf{x}^{\infty}$ satisfying $\textup{dist}(\mathbf{x}^{\infty},\mathcal{X}_{f}^{*})\geq d$ . Since $(\mathcal{X}_{f}^{1/k}\cap\mathcal{X})\supset\mathcal{X}_{f}^{1/(k+1)}$ , it follows that, $\mathbf{x}^{\infty}\in\mathcal{X}_{f}^{1/k}\cap\mathcal{X}$ for all $k\in\mathbb{N}^{+}$ . This means $\|\nabla f(\mathbf{x}^{\infty})\|\leq c_{1}/k$ , $\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))\geq-c_{2}/k$ for all $k\in\mathbb{N}^{+}$ , implying $\|\nabla f(\mathbf{x}^{\infty})\|=0$ , $\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))\geq 0$ . By Assumption 2.1, we have $\lambda_{\min}(\nabla^{2}f(\mathbf{x}^{\infty}))>0$ . Hence $\textup{dist}(\mathbf{x}^{\infty},\mathcal{X}_{f}^{*})=0$ , which contradicts the initial hypothesis (10).

By combining Lemmas 2.3.1 through 2.3.4, the following intermediate theorem can be established. It is used to prove Theorem 3.1 in the next section.

Theorem 2.1.

Let Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Given $\Delta_{1}>0$ , there exists threshold $\bar{\alpha}(\Delta_{1})>0$ such that, if $0<\alpha\leq\bar{\alpha}(\Delta_{1})$ , and $\hat{\mathbf{x}}^{*}$ is a local minimizer of $Q_{\alpha}$ , then

\displaystyle\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1}

for each $i\in\mathcal{V}$ .

Proof.

Given $\alpha>0$ , by the triangle inequality, $\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|+\textup{dist}(\bar{\mathbf{x}}^{*},\mathcal{X}_{f}^{*})$ , where $\bar{\mathbf{x}}^{*}=\frac{1}{m}(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\hat{\mathbf{x}}^{*}$ . By coercivity and properness of each $f_{i}$ (see Assumption 2.4), $F$ is coercive and proper. Therefore, $\hat{\mathcal{X}}^{*}_{Q_{\alpha}}$ is bounded, and there exists an upper bound $G>0$ such that for all $\hat{\mathbf{x}}^{*}\in\hat{\mathcal{X}}^{*}_{Q_{\alpha}}$ , $\|\nabla F(\hat{\mathbf{x}}^{*})\|\leq G$ . By Lemma 2.3.1, if

\displaystyle 0<\alpha\leq\bar{\alpha}_{1}(\Delta_{1}):=\frac{\Delta_{1}(1-\lambda_{2})}{2G}

and $\hat{\mathbf{x}}^{*}\in\hat{\mathcal{X}}^{*}_{Q_{\alpha}}$ defined in (9), then $\|\hat{\mathbf{x}}_{i}^{*}-\bar{\mathbf{x}}^{*}\|\leq\Delta_{1}/2$ holds for each $i\in\mathcal{V}$ . Now, note that $\bar{\mathbf{x}}^{*}\in\mathcal{X}_{f}^{\alpha}$ in view of Lemmas 2.3.2 and 2.3.3, with $\mathcal{X}_{f}^{\alpha}$ as defined in Lemma 2.3.4. As such, by application of Lemma 2.3.4 with $\mathcal{X}=\{\bar{\mathbf{x}}^{*}\}$ , there exists $\bar{\alpha}_{2}(\Delta_{1})>0$ such that if $0<\alpha\leq\bar{\alpha}_{2}(\Delta_{1})$ , then $\textup{dist}(\bar{\mathbf{x}}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1}/2$ holds. Therefore, if

\displaystyle 0<\alpha\leq\bar{\alpha}(\Delta_{1}):=\min\{\bar{\alpha}_{1}(\Delta_{1}),\bar{\alpha}_{2}(\Delta_{1})\},

then $\textup{dist}(\hat{\mathbf{x}}_{i}^{*},\mathcal{X}_{f}^{*})\leq\Delta_{1}$ as claimed.

III Method and Main Results

In this section, a Noisy Distributed Gradient Descent algorithm (see Algorithm 1), a variant of the fixed step-size DGD algorithm, is formulated. The main analysis result, also formulated in this section, establishes the second-order properties of the NDGD algorithm. The key idea is to add random noise to the distributed gradient descent directions at each iteration. The required properties of the noise $\xi_{i}^{k}$ in Algorithm 1 is presented in Theorem 3.1.

Initialization;

for

k=0,1,\cdots

for

i=0,1,\cdots,m

Sample i.i.d

\xi_{i}^{k}

;

\hat{\mathbf{x}}_{i}^{k+1}=\sum_{j=1}^{m}\mathbf{W}_{ij}\hat{\mathbf{x}}_{j}^{k}-\alpha(\nabla f_{i}(\hat{\mathbf{x}}_{i}^{k})+\xi_{i}^{k})

;

end for

Algorithm 1 Noisy Distributed Gradient Descent (NDGD)

Recall that $\mathcal{X}^{*}_{f}$ and $\hat{\mathcal{X}}^{*}_{Q_{\alpha}}$ denote the set of local minimizers of $f$ and $Q_{\alpha}$ respectively, as per (9). Given $\epsilon>0$ , $\gamma>0$ , $\mu>0$ , $\delta>0$ , and $\alpha>0$ , define

\displaystyle\begin{aligned} \mathcal{L}_{\alpha,\epsilon}^{1}:=\{\hat{\mathbf{x}}:&~{}\|\nabla F(\hat{\mathbf{x}})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|\geq\epsilon\},\\ \mathcal{L}_{\alpha,\gamma}^{2}:=\{\hat{\mathbf{x}}:&~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))\leq-\gamma-\alpha^{-1}\},\\ \mathcal{L}_{\alpha,\mu,\delta}^{3}:=\{\hat{\mathbf{x}}:&~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))\geq\mu,~{}\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{\prime})\leq\delta\},\end{aligned}

(11)

where $\hat{\mathcal{X}}^{\prime}:=\{\hat{\mathbf{x}}:\|\nabla F(\hat{\mathbf{x}})+\alpha^{-1}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|=0,~{}\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}}))>0\}$ . The next theorem, which establishes second-order properties of the NDGD algorithm, is the main result of the paper; a proof is given in Section IV. Note that this result focuses on the dependency on the given step-size $\alpha$ and confidence parameter $\zeta$ , hiding the factors that have polynomial dependence on all other parameters (including $\Delta_{1}$ , $\epsilon$ , $\gamma$ , $\mu$ , $\delta$ and $\sigma$ ).

Theorem 3.1.

Suppose Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Given $\Delta_{1}>0$ and $0<\zeta<1$ , also suppose the following:

There exist $\epsilon>0$ , $\gamma\in(0,L_{F}^{g}]$ , $\mu\in(0,L_{F}^{g}]$ , $\delta>0$ and $\alpha\in(0,\hat{\alpha}(\Delta_{1},\zeta)]$ such that

\displaystyle\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2}\cup\mathcal{L}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m},

where $\hat{\alpha}(\Delta_{1},\zeta):=$

\displaystyle\min\{\bar{\alpha}(\Delta_{1}),\frac{\sqrt{2}-1}{L_{F}^{g}},\frac{\lambda_{\min}(\mathbf{W})}{L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\}}\}>0

with $\bar{\alpha}(\Delta_{1})$ as per Theorem 2.1;

2.

the random perturbation $\xi_{i}^{k}$ at step $k>0$ is i.i.d. and zero mean with variance

$\displaystyle\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=\frac{\lambda_{\min}(\mathbf{W})\epsilon^{2}}{mn};$
3.

the generated sequence $\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\}$ is bounded.

Then, with probability at least $1-\zeta$ , after $K(\alpha,\zeta)=\mathcal{O}(\alpha^{-2}\log\zeta^{-1})$ iterations, Algorithm 1 reaches a point $\hat{\mathbf{x}}^{K(\alpha,\zeta)}\in(\mathbb{R}^{n})^{m}$ that is $\Delta_{2}(\alpha,\zeta)$ -close to $\mathcal{X}_{Q_{\alpha}}^{*}$ , where $\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})})$ . Moreover, $\hat{\mathbf{x}}^{*}=\inf_{\hat{\mathbf{x}}\in\mathcal{X}_{Q_{\alpha}}^{*}}\|\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)}-\hat{\mathbf{x}}\|$ is such that $\hat{\mathbf{x}}_{i}^{*}$ is $\Delta_{1}$ -close to $\mathcal{X}_{f}^{*}$ , whereby for all $i\in\mathcal{V}$ ,

\displaystyle\mathrm{dist}(\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)},\mathcal{X}_{f}^{*})\leq\Delta_{1}+\Delta_{2}(\alpha,\zeta).

Remark 1.

Intuitively, condition 1) in Theorem 3.1 requires that for $Q_{\alpha}$ , all points with a small norm of gradient, they either possess a sufficient descent direction or reside within a neighborhood of a local minimizer of $Q_{\alpha}$ , where local strong convexity is present.

Remark 2.

For condition 2) in Theorem 3.1, one way to generate such i.i.d noise is to sample $\xi_{i}^{k}$ uniformly from an $n$ -dimensional sphere with the radius $r$ . This ensures $\mathbb{E}(\xi_{i}^{k})=\bm{0}$ , $\mathbb{E}(\xi_{i}^{k}(\xi_{i}^{k})^{T})=(r^{2}/n)\mathbf{I}_{n}$ , and $\|\xi_{i}^{k}\|\leq r$ for all $i\in\mathcal{V}$ and $k\in\mathbb{N}$ . By choosing $r^{2}\leq n\sigma_{\max}^{2}(\epsilon)$ , we have $\mathbb{E}(\xi^{k}(\xi^{k})^{T})=(r^{2}/n)\mathbf{I}_{mn}\preceq\sigma_{\max}^{2}(\epsilon)\mathbf{I}_{mn}$ .

Second-order guarantees of DGD have been studied in [19] and [23] based on the almost sure non-convergence to saddle points with random initialization. In this paper, we propose to use random perturbations to actively evade saddle points. The second-order guarantees of NDGD stated in Theorem 3.1 do not require any additional initialization conditions. Second-order guarantees of the stochastic variant of DGD have been studied in [24] and [25], although they only show the convergence to an approximate second-order stationary point. Here, an upper bound is given for the distance between the iterate at each agent and the set of local minimizers after a sufficient number of iterations.

IV Proof of Theorem 3.1

A proof of Theorem 3.1 is provided in Section IV. First, we consider three different cases to study the behavior of NDGD for three different cases, in line with the development of the related result in [21] for centralized gradient descent. i) large in norm $\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})$ (see Lemma 4.1.1); ii) sufficiently negative $\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k}))$ (see Lemma 4.1.2); and iii) $\hat{\mathbf{x}}^{k}$ in a neighborhood of the local minimizers of $Q_{\alpha}$ with local strong convexity (see Lemma 4.1.3). Combining the outcome of this with Theorem 2.1, we then prove that with probability at least $1-\zeta$ , after $K(\alpha,\zeta)$ iterations, NDGD yields a point of which the component at each agent is $\Delta_{1}+\Delta_{2}(\alpha,\zeta)$ -close to some local minimizer of $f$ .

IV-A Behavior of NDGD for three different cases

The following lemmas rely on the proofs of Lemma 16 and Lemma 17 in [21]. Given $\epsilon>0$ , $\gamma>0$ , $\mu>0$ , $\delta>0$ , and $\alpha>0$ , define

\displaystyle\begin{aligned} \mathcal{I}_{\alpha,\epsilon}^{1}:=\{\hat{\mathbf{x}}:&~{}\|\nabla Q_{\alpha}(\hat{\mathbf{x}})\|\geq\epsilon\},\\ \mathcal{I}_{\alpha,\gamma}^{2}:=\{\hat{\mathbf{x}}:&~{}\Lambda_{\alpha}(\hat{\mathbf{x}})\leq-\gamma\},\\ \mathcal{I}_{\alpha,\mu,\delta}^{3}:=\{\hat{\mathbf{x}}:&~{}\Lambda_{\alpha}(\hat{\mathbf{x}})\geq\mu,~{}\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{*}_{Q_{\alpha}})\leq\delta\},\end{aligned}

(12)

where $\Lambda_{\alpha}(\hat{\mathbf{x}})=\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))$ .

We first analyze the behavior of the NDGD algorithm in the case that $\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\epsilon}^{1}$ . Intuitively, when the norm of $\nabla Q_{\alpha}(\hat{\mathbf{x}})$ is large enough, the expectation of the function value decreases by a certain amount after one iteration.

Lemma 4.1.1.

Let Assumption 2.2 hold. Given $\epsilon>0$ , suppose the random perturbation $\xi^{k}_{i}$ in Algorithm 1 is i.i.d. and zero mean with variance $\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn)$ . Then given $0<\alpha\leq 1/L_{F}^{g}$ , for any $\hat{\mathbf{x}}^{k}$ such that $\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\geq\epsilon$ , after one iteration,

\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})\leq-l_{1}(\alpha),

where $l_{1}(\alpha)=\Omega(\alpha)$ .

Proof.

Since $\mathbf{W}$ is symmetric, doubly stochastic and strictly diagonally dominant, it is positive definite by the Gershgorin disk theorem. Given $\sigma^{2}\leq(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn)$ , note that $\sqrt{mn\sigma^{2}/\lambda_{\min}(\mathbf{W})}<\epsilon$ . Since $Q_{\alpha}$ has $L_{Q_{\alpha}}^{g}$ -Lipschitz continuous gradient, using Taylor’s theorem gives

	$\displaystyle\mathbb{E}[Q$	${}_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}\|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})$
	$\displaystyle=$	$\displaystyle~{}\mathbb{E}[\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})+(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})^{T}(\int^{1}_{0}(1-t)$
		$\displaystyle~{}\quad\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k}+t(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}))\textup{d}t)(\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k})~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle\leq$	$\displaystyle~{}\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}\mathbb{E}[\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}~{}\|~{}\hat{\mathbf{x}}^{k}]$
		$\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}}{2}~{}\mathbb{E}[\\|\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{k}\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle=$	$\displaystyle~{}\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}\mathbb{E}[-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})~{}\|~{}\hat{\mathbf{x}}^{k}]$
		$\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}}{2}~{}\mathbb{E}[\\|-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle=$	$\displaystyle~{}(-\alpha+\frac{L_{Q_{\alpha}}^{g}}{2}\alpha^{2})\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\\|^{2}+\frac{L_{Q_{\alpha}}^{g}}{2}\alpha^{2}mn\sigma^{2}$
	$\displaystyle=$	$\displaystyle~{}(-\frac{1+\lambda_{\min}(\mathbf{W})}{2}\alpha+\frac{L_{F}^{g}}{2}\alpha^{2})\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\\|^{2}$
		$\displaystyle~{}\quad+(\frac{1-\lambda_{\min}(\mathbf{W})}{2}\alpha+\frac{L_{F}^{g}}{2}\alpha^{2})mn\sigma^{2}.$

As such, choosing $0<\alpha\leq 1/L_{F}^{g}$ gives

	$\displaystyle\mathbb{E}[Q$	${}_{\alpha}(\hat{\mathbf{x}}^{k+1})~{}\|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})$
	$\displaystyle\leq$	$\displaystyle-\frac{\lambda_{\min}(\mathbf{W})}{2}\alpha\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\\|^{2}+\frac{2-\lambda_{\min}(\mathbf{W})}{2}\alpha mn\sigma^{2}$
	$\displaystyle\leq$	$\displaystyle-\frac{\lambda_{\min}(\mathbf{W})}{2}\alpha mn\sigma^{2}$

as claimed.

Next, we analyze the behavior of the NDGD algorithm in the case that $\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\gamma}^{2}$ . Intuitively, for $\hat{\mathbf{x}}^{k}$ with small gradient and sufficiently negative $\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))$ , there exists upper bound $T_{\max}(\alpha)$ such that the expectation of the function value decreases by a certain amount after $T\leq T_{\max}(\alpha)$ iterations.

Lemma 4.1.2.

Let Assumptions 2.2, 2.3 hold. Let $\gamma\in(0,L_{F}^{g}]$ . Given $\epsilon>0$ , suppose the random perturbation $\xi^{k}_{i}$ in Algorithm 1 is i.i.d. and zero mean with variance $\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn)$ . Further, given $0<\alpha\leq(\sqrt{2}-1)/L_{F}^{g}$ , suppose that the generated sequence $\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\}$ is bounded. Then, for any $\hat{\mathbf{x}}^{k}$ with $\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})\|<\epsilon$ and $\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))\leq-\gamma$ , there exists a number of steps $T(\hat{\mathbf{x}}^{k})>0$ such that

\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{k+T(\hat{\mathbf{x}}^{k})})~{}|~{}\hat{\mathbf{x}}^{k}]-Q_{\alpha}(\hat{\mathbf{x}}^{k})\leq-l_{2}(\alpha),

where $l_{2}(\alpha)=\Omega(\alpha)$ . The number of steps $T(\hat{\mathbf{x}}^{k})$ has a fixed upper bound $T_{\max}(\alpha)$ that is independent of $\hat{\mathbf{x}}^{k}$ , i.e., $T(\hat{\mathbf{x}}^{k})\leq T_{\max}(\alpha)=\mathcal{O}(\alpha^{-1})$ for all $\hat{\mathbf{x}}^{k}$ .

Proof.

Given $\sigma^{2}\leq(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn)$ , note that $\sqrt{mn\sigma^{2}/\lambda_{\min}(\mathbf{W})}<\epsilon$ . Thus, choosing $0<\alpha\leq(\sqrt{2}-1)/L_{F}^{g}\leq(2mn)/L_{F}^{g}\leq(2mn)/\gamma$ , the result holds as shown in the proof of Lemma 17 in [21].

Finally, we analyze the behavior of the algorithm in the case that $\hat{\mathbf{x}}^{k}\in\mathcal{I}_{\alpha,\mu,\delta}^{3}$ . Intuitively, when the iterate $\hat{\mathbf{x}}^{k}$ is close enough to a local minimizer, with high probability subsequent iterates do not leave the neighborhood.

Lemma 4.1.3.

Let Assumptions 2.2 hold. Let $\mu\in(0,L_{F}^{g}]$ . Given $\epsilon>0$ , suppose the random perturbation $\xi^{k}_{i}$ in Algorithm 1 is i.i.d. and zero mean with variance $\sigma^{2}\leq\sigma_{\max}^{2}(\epsilon):=(\lambda_{\min}(\mathbf{W})\epsilon^{2})/(mn)$ . Further, given $\delta>0$ , $0<\alpha\leq(\lambda_{\min}(\mathbf{W}))/(L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\})$ and local minimizer $\hat{\mathbf{x}}^{*}\in\mathcal{X}_{Q_{\alpha}}^{*}$ , suppose $\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))\geq\mu$ for all $\hat{\mathbf{x}}$ such that $\|\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}\|<\delta$ . Then, there exists $\delta_{1}(\alpha)=\mathcal{O}(\sqrt{\alpha})\in[0,\delta)$ such that, for any $\hat{\mathbf{x}}^{k}$ with $\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\|<\delta_{1}(\alpha)$ , with probability at least $1-\zeta/2$ ,

\displaystyle\|\hat{\mathbf{x}}^{k+s}-\hat{\mathbf{x}}^{*}\|\leq\Delta_{2}(\alpha,\zeta)

for all $s\leq K(\alpha,\zeta)=\mathcal{O}(\alpha^{-2}\log(\zeta^{-1}))$ , where $\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})})<\delta$ .

Proof.

Given $0<\alpha\leq\lambda_{\min}(\mathbf{W})/L_{F}^{g}$ , note that $\alpha\leq 1/L_{Q_{\alpha}}^{g}$ because the eigenvalues of the symmetric, doubly stochastic, diagonally dominant, and thus positive definite matrix $\mathbf{W}$ , is contained in the interval $(0,1]$ . By $\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}^{k}))\geq\mu>0$ , we have the local strong convexity with modulus $\mu$ . Since $Q_{\alpha}$ also has Lipschitz gradient, by Theorem 2.1.12 [26],

\nabla Q_{\alpha}(\hat{\mathbf{x}})^{T}(\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*})\geq\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}\|\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}\|^{2}\\ +\frac{1}{L_{Q_{\alpha}}^{g}+\mu}\|\nabla Q_{\alpha}(\hat{\mathbf{x}})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\|^{2}.

Therefore,

	$\displaystyle\mathbb{E}[\\|$	$\displaystyle\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{*}\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle=$	$\displaystyle~{}\mathbb{E}[\\|\hat{\mathbf{x}}^{k}-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})-\hat{\mathbf{x}}^{*}\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle\leq$	$\displaystyle~{}\\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{})\\|^{2}+\alpha^{2}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{})\\|^{2}$
		$\displaystyle~{}\quad-2\alpha\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*})+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\leq$	$\displaystyle~{}\\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{})\\|^{2}+\alpha^{2}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{})\\|^{2}$
		$\displaystyle~{}\quad-2\alpha(\frac{1}{L_{Q_{\alpha}}^{g}+\mu}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\\|^{2}$
		$\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2})+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\quad=$	$\displaystyle~{}(1-2\alpha\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu})\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2}+(\alpha^{2}-2\alpha\frac{1}{L_{Q_{\alpha}}^{g}+\mu})$
		$\displaystyle~{}\quad\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\\|^{2}+\alpha^{2}mn\sigma^{2}.$

Since $\alpha\leq 1/L_{Q_{\alpha}}^{g}$ , note that $\alpha-2/(L_{Q_{\alpha}}^{g}+\mu)<0$ with $\mu\leq L_{Q_{\alpha}}^{g}$ . Then,

	$\displaystyle\mathbb{E}[\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2}]\leq$	$\displaystyle~{}(1-\frac{2\alpha\cdot L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}+\alpha^{2}\mu^{2}-\frac{2\alpha\cdot\mu^{2}}{L_{Q_{\alpha}}^{g}+\mu})$
		$\displaystyle~{}\quad\\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\\|^{2}+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\leq$	$\displaystyle~{}(1-\alpha\mu)^{2}\\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\\|^{2}+\alpha^{2}mn\sigma^{2}.$

Thus, choosing $\alpha\leq 1/L_{Q_{\alpha}}^{g}\leq 1/L_{F}^{g}\leq\mu^{-1}$ , the result holds as shown in the proof of Lemma 16 in [21].

IV-B Main proof

Proof.

The main proof includes two steps: i) it is shown that three sets defined in (12) cover all possible points with respect to $Q_{\alpha}$ ; ii) it is shown that the upper bound of the decrease in $Q_{\alpha}$ can be used to derive a lower bound for the probability that the $K(\alpha,\zeta)$ -th update at each agent is close to a local minimizer of $f$ .

Step 1. By the supposition in Theorem 3.1, given $\Delta_{1}>0$ and $0<\zeta<1$ , there exist $\epsilon>0$ , $0<\gamma\leq L_{F}^{g}$ , $0<\mu\leq L_{F}^{g}$ , $\delta>0$ , and

\displaystyle 0<\alpha\leq\min\{\bar{\alpha}(\Delta_{1}),\frac{\sqrt{2}-1}{L_{F}^{g}},\frac{\lambda_{\min}(\mathbf{W})}{L_{F}^{g}\cdot\max\{1,\log(\zeta^{-1})\}}\},

such that $\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2}\cup\mathcal{L}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m}$ , with respect to (11), and thus $\mathcal{L}_{\alpha,\mu,\delta}^{3}\supseteq(\mathcal{L}_{\alpha,\epsilon}^{1}\cup\mathcal{L}_{\alpha,\gamma}^{2})^{c}$ , where the superscript $c$ denote set complement. If $\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\epsilon}^{1}$ ,

\displaystyle\|\nabla Q_{\alpha}(\hat{\mathbf{x}})\|

\displaystyle=\|\nabla F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}})\hat{\mathbf{x}}\|\geq\epsilon;

if $\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\gamma}^{2}$ , then by Weyl’s inequality,

\displaystyle\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))=\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))\leq-\gamma;

if $\hat{\mathbf{x}}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}$ , then again by Weyl’s inequality,

\displaystyle\lambda_{\min}(\nabla^{2}Q_{\alpha}(\hat{\mathbf{x}}))=\lambda_{\min}(\nabla^{2}F(\hat{\mathbf{x}})+\frac{1}{\alpha}(\mathbf{I}_{mn}-\hat{\mathbf{W}}))\geq\mu,

and $\textup{dist}(\hat{\mathbf{x}},\hat{\mathcal{X}}^{*}_{Q_{\alpha}})\leq\delta$ . Therefore, $\mathcal{L}_{\alpha,\epsilon}^{1}\subseteq\mathcal{I}_{\alpha,\epsilon}^{1}$ , $\mathcal{L}_{\alpha,\gamma}^{2}\subseteq\mathcal{I}_{\alpha,\gamma}^{2}$ , $\mathcal{L}_{\alpha,\mu,\delta}^{3}\subseteq\mathcal{I}_{\alpha,\mu,\delta}^{3}$ , whereby

\displaystyle\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\gamma}^{2}\cup\mathcal{I}_{\alpha,\mu,\delta}^{3}=(\mathbb{R}^{n})^{m},~{}\mathcal{I}_{\alpha,\mu,\delta}^{3}\supseteq(\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\gamma}^{2})^{c}.

Step 2. Define stochastic process $\{\kappa_{i}\}\subset\mathbb{N}$ as

\displaystyle\kappa_{i}:=\begin{cases}0,&i=0\\ \kappa_{i-1}+1,&\hat{\mathbf{x}}^{\kappa_{i-1}}\in\mathcal{I}_{\alpha,\epsilon}^{1}\cup\mathcal{I}_{\alpha,\mu,\delta}^{3}\\ \kappa_{i-1}+T(\hat{\mathbf{x}}^{\kappa_{i-1}}),&\hat{\mathbf{x}}^{\kappa_{i-1}}\in\mathcal{I}_{\alpha,\gamma}^{2}\end{cases},

(13)

where $T(\hat{\mathbf{x}})\leq T_{\max}(\alpha)=\tilde{\mathcal{O}}(\alpha^{-1})$ for all $\hat{\mathbf{x}}$ as per Lemma 4.1.2. By Lemma 4.1.1 and Lemma 4.1.2, $Q_{\alpha}$ decreases by a certain amount after a certain number of iterations for $\hat{\mathbf{x}}\in\mathcal{I}_{\alpha,\epsilon}^{1}$ , and $\hat{\mathbf{x}}\in\mathcal{I}_{\alpha,\gamma}^{2}$ , respectively, as follows

\displaystyle\begin{aligned} \mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\hat{\mathbf{x}}^{\kappa_{i}}\in\mathcal{I}_{\alpha,\epsilon}^{1}]\leq-l_{1}(\alpha),\\ \mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\hat{\mathbf{x}}^{\kappa_{i}}\in\mathcal{I}_{\alpha,\gamma}^{2}]\leq-l_{2}(\alpha),\end{aligned}

(14)

where $l_{1}(\alpha)=\Omega(\alpha)$ and $l_{2}(\alpha)=\Omega(\alpha)$ are defined in Lemma 4.1.1 and Lemma 4.1.2.

Defining event $\mathcal{E}_{i}:=\{(\exists j\leq i)~{}\hat{\mathbf{x}}^{\kappa_{j}}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}\}$ , by law of total expectation,

\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ =\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}]\cdot\mathbb{P}[\mathcal{E}_{i}]\\ +\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}^{c}]\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Combining (13) and (14) gives

\displaystyle\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})-Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}^{c}]\leq-l(\alpha)\cdot\Delta\kappa_{i},

where $l(\alpha)=\min\{l_{1}(\alpha),l_{2}(\alpha)/T_{\max}(\alpha)\}=\Omega(\alpha^{2})$ and $\Delta\kappa_{i}=\kappa_{i+1}-\kappa_{i}$ . Since $\mathbb{P}[\mathcal{E}_{i-1}]\leq\mathbb{P}[\mathcal{E}_{i}]$ , we obtain

\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ \leq\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})~{}|~{}\mathcal{E}_{i}]\cdot(\mathbb{P}[\mathcal{E}_{i}]-\mathbb{P}[\mathcal{E}_{i-1}])-l(\alpha)\cdot\Delta\kappa_{i}.

Since the generated sequence $\{Q_{\alpha}(\hat{\mathbf{x}}^{k})\}$ is assumed bounded, there exists $b>0$ such that $\|Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\leq b$ for all $k=0,1,\cdots$ . As such,

\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i+1}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]\\ \leq b\cdot(\mathbb{P}[\mathcal{E}_{i}]-\mathbb{P}[\mathcal{E}_{i-1}])-l(\alpha)\cdot\Delta\kappa_{i}\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Summing both sides of the inequality over $i$ gives

\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{i}})]-\mathbb{E}[Q_{\alpha}(\hat{\mathbf{x}}^{\kappa_{1}})]\\ \leq b\cdot(\mathbb{P}[\mathcal{E}_{i-1}]-\mathbb{P}[\mathcal{E}_{0}])-l(\alpha)\cdot(\kappa_{i}-\kappa_{1})\cdot\mathbb{P}[\mathcal{E}_{i}^{c}].

Since $\|Q_{\alpha}(\hat{\mathbf{x}}^{k})\|\leq b$ for all $k=0,1,\cdots$ , it follows that $-2b\leq b-l(\alpha)\cdot(\kappa_{i}-\kappa_{1})\cdot\mathbb{P}[\mathcal{E}_{i}^{c}]$ , which leads to the following upper bound for the probability of event $\mathcal{E}_{i}^{c}$ :

\displaystyle\mathbb{P}[\mathcal{E}_{i}^{c}]\leq\frac{3b}{l(\alpha)(\kappa_{i}-\kappa_{1})}.

Therefore, if $\kappa_{i}-\kappa_{1}$ grows larger than $6b/l(\alpha)$ , then $\mathbb{P}[\mathcal{E}_{i}^{c}]\leq 1/2$ . Since $\kappa_{1}\leq T_{\max}(\alpha)=\mathcal{O}(\alpha^{-1})$ , after $K^{\prime}(\alpha)=6b/l(\alpha)+T_{\max}(\alpha)=\mathcal{O}(\alpha^{-2})$ steps, $\{\hat{\mathbf{x}}^{k}\}$ must enter $\mathcal{L}_{\alpha,\mu,\delta}^{3}$ at least once with probability at least $1/2$ . Therefore, by repeating this step $\log\zeta^{-1}$ times, the probability of $\{\hat{\mathbf{x}}^{k}\}$ entering $\mathcal{L}_{\alpha,\mu,\delta}^{3}$ at least once is lower bounded:

\displaystyle\mathbb{P}[\{(\exists k\leq K(\alpha,\zeta))~{}\hat{\mathbf{x}}^{k}\in\mathcal{L}_{\alpha,\mu,\delta}^{3}\}]\geq 1-\frac{\zeta}{2},

where $K(\alpha)=\mathcal{O}(\alpha^{-2}\log\zeta^{-1})$ . Combining this with Lemma 4.1.3, we have that, after $K(\alpha,\zeta)$ iterations, Algorithm 1 produces a point that is $\Delta_{2}(\alpha,\zeta)$ -close to $\hat{\mathcal{X}}^{*}_{Q_{\alpha}}$ with probability at least $1-\zeta$ , where $\Delta_{2}(\alpha,\zeta)=\mathcal{O}(\sqrt{\alpha\log(\alpha^{-1}\zeta^{-1})})$ . For given $\Delta_{1}>0$ , since $\alpha\leq\bar{\alpha}(\Delta_{1})$ satisfies requirements of Theorem 2.1, $\hat{\mathbf{x}}^{*}=\inf_{\hat{\mathbf{x}}\in\mathcal{X}_{Q_{\alpha}}^{*}}\|\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)}-\hat{\mathbf{x}}\|$ is such that $\hat{\mathbf{x}}_{i}^{*}$ is $\Delta_{1}$ -close to $\mathcal{X}_{f}^{*}$ . To summarize, we have for $i\in\mathcal{V}$ ,

\displaystyle\mathrm{dist}(\hat{\mathbf{x}}_{i}^{K(\alpha,\zeta)},\mathcal{X}_{f}^{*})\leq\Delta_{1}+\Delta_{2}(\alpha,\zeta)

as claimed.

V Numerical Example

Consider the following non-convex optimization problem over $\mathbf{x}=(x_{1},x_{2})$ :

\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{2}}f({\mathbf{x}})=\min_{\mathbf{x}\in\mathbb{R}^{2}}\sum_{i=1}^{5}f_{i}(\mathbf{x})=x_{1}^{4}-x_{1}^{2}+x_{2}^{4}+x_{2}^{2},

where $f_{1}(\mathbf{x})=0.25x_{1}^{4}-x_{1}^{2}-x_{2}^{2}$ , $f_{2}(\mathbf{x})=0.25x_{1}^{4}+0.5x_{2}^{4}+1.5x_{2}^{2}$ , $f_{3}(\mathbf{x})=-x_{1}^{2}+x_{2}^{2}$ , $f_{4}(\mathbf{x})=0.5x_{1}^{4}-0.5x_{2}^{2}$ , and $f_{5}(\mathbf{x})=x_{1}^{2}+0.5x_{2}^{4}$ . The mixing matrix is taken to be

\displaystyle\mathbf{W}=\begin{bmatrix}0.6&0&0.2&0&0.2\\ 0&0.6&0&0.2&0.2\\ 0.2&0&0.6&0.2&0\\ 0&0.2&0.2&0.6&0\\ 0.2&0.2&0&0&0.6\end{bmatrix}.

It can be verified that $\mathbf{x}=(x_{1},x_{2})=(0,0)$ is a saddle point of $f$ , and that $\mathbf{x}=(x_{1},x_{2})=(-\frac{\sqrt{2}}{2},0)$ and $\mathbf{x}=(x_{1},x_{2})=(\frac{\sqrt{2}}{2},0)$ are two local minimizers. We compare the performance of DGD and NDGD with constant step-size $\alpha=0.005$ , both initialized from $\mathbf{x}^{0}=(x_{1}^{0},x_{2}^{0})=(10^{-6},10^{-6})$ (i.e., close to a saddle point).

Refer to caption — (a) $\mathbf{x}_{1}^{k}$ of DGD.

From Figure 1(1(a)), although not trapped forever, it does take DGD about 6000 iterations to escape the vicinity of the saddle point and converge to the neighborhood of a local minimizer. From Figure 1(1(b)), we can see that DGD escapes the vicinity of the saddle point with about 2000 iterations and converges to the neighborhood of a local minimizer. The effectiveness of NDGD over DGD is evident through this example.

VI Conclusion

A fixed step-size noisy distributed gradient descent (NDGD) algorithm is formulated for solving optimization problems in which the objective is a finite sum of smooth but possibly non-convex functions. Random perturbations are added to the gradient descent at each step to actively evade saddle points. Under certain regularity conditions, and with a suitable step-size, each agent converges (in probability with specified confidence) to a neighborhood of a local minimizer. In particular, we determine a probabilistic upper bound on the distance between the iterate at each agent, and the set of local minimizers, after a sufficient number of iterations.

The potential applications of the NDGD algorithm are vast and varied, including multi-agent systems control, federated learning and sensor networks location estimation, particularly in large-scale network scenarios. Further exploration of different approaches to introducing random perturbations, and analysis of convergence rate performance can be pursued in future work.

References

[1] X. Dong, Y. Hua, Y. Zhou, Z. Ren, and Y. Zhong, “Theory and experiment on formation-containment control of multiple multirotor unmanned aerial vehicle systems,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 1, pp. 229–240, 2018.
[2] Z. Qiu, G. Deconinck, and R. Belmans, “A literature survey of optimal power flow problems in the electricity market context,” in 2009 IEEE/PES Power Systems Conference and Exposition. IEEE, 2009, pp. 1–6.
[3] W. Gao, J. Gao, K. Ozbay, and Z.-P. Jiang, “Reinforcement-learning-based cooperative adaptive cruise control of buses in the lincoln tunnel corridor with time-varying topology,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3796–3805, 2019.
[4] A. Nedić and J. Liu, “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 77–103, 2018.
[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
[6] H. Terelius, U. Topcu, and R. M. Murray, “Decentralized multi-agent optimization via dual decomposition,” IFAC proceedings volumes, vol. 44, no. 1, pp. 11 245–11 251, 2011.
[7] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the admm in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
[8] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986.
[9] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[10] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
[11] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
[12] K. I. Tsianos and M. G. Rabbat, “Distributed strongly convex optimization,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 593–600.
[13] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
[14] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 601–608.
[15] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[16] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[17] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically convergent distributed optimization with uncoordinated step-sizes,” in 2017 American Control Conference (ACC). IEEE, 2017, pp. 3950–3955.
[18] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
[19] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020.
[20] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos, “Gradient descent can take exponential time to escape saddle points,” Advances in neural information processing systems, vol. 30, 2017.
[21] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Conference on learning theory. PMLR, 2015, pp. 797–842.
[22] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in International Conference on Machine Learning. PMLR, 2017, pp. 1724–1732.
[23] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 14 751–14 812, 2022.
[24] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments—part i: Agreement at a linear rate,” IEEE Transactions on Signal Processing, vol. 69, pp. 1242–1256, 2021.
[25] ——, “Distributed learning in non-convex environments—part ii: Polynomial escape from saddle-points,” IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021.
[26] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.

	$\displaystyle\\|\nabla f(\bar{\mathbf{x}}^{*})\\|$	$\displaystyle=\\|(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})-\nabla F(\hat{\mathbf{x}}^{}))\\|$
		$\displaystyle\leq\sqrt{m}L_{F}^{g}\\|\bm{1}_{m}\otimes\bar{\mathbf{x}}^{}-\hat{\mathbf{x}}^{}\\|$
		$\displaystyle\leq\alpha\cdot L_{F}^{g}\frac{m\sqrt{m}\\|\nabla F(\hat{\mathbf{x}}^{*})\\|}{1-\lambda_{2}}$

	$\displaystyle\nabla^{2}$	$\displaystyle f(\bar{\mathbf{x}}^{})=(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})(\bm{1}_{m}\otimes\mathbf{I}_{n})$
		$\displaystyle\succeq(\bm{1}_{m}\otimes\mathbf{I}_{n})^{T}(\nabla^{2}F(\bm{1}_{m}\otimes\bar{\mathbf{x}}^{})-\nabla^{2}F(\hat{\mathbf{x}}^{}))(\bm{1}_{m}\otimes\mathbf{I}_{n})$
		$\displaystyle\succeq-mL_{F}^{H}\\|\hat{\mathbf{x}}^{}-\bm{1}_{m}\otimes\bar{\mathbf{x}}^{}\\|\mathbf{I}_{n},$

	$\displaystyle\mathbb{E}[\\|$	$\displaystyle\hat{\mathbf{x}}^{k+1}-\hat{\mathbf{x}}^{*}\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle=$	$\displaystyle~{}\mathbb{E}[\\|\hat{\mathbf{x}}^{k}-\alpha(\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})+\xi^{k})-\hat{\mathbf{x}}^{*}\\|^{2}~{}\|~{}\hat{\mathbf{x}}^{k}]$
	$\displaystyle\leq$	$\displaystyle~{}\\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{})\\|^{2}+\alpha^{2}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{})\\|^{2}$
		$\displaystyle~{}\quad-2\alpha\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})^{T}(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*})+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\leq$	$\displaystyle~{}\\|(\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{})\\|^{2}+\alpha^{2}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{})\\|^{2}$
		$\displaystyle~{}\quad-2\alpha(\frac{1}{L_{Q_{\alpha}}^{g}+\mu}\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\\|^{2}$
		$\displaystyle~{}\quad+\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2})+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\quad=$	$\displaystyle~{}(1-2\alpha\frac{L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu})\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2}+(\alpha^{2}-2\alpha\frac{1}{L_{Q_{\alpha}}^{g}+\mu})$
		$\displaystyle~{}\quad\\|\nabla Q_{\alpha}(\hat{\mathbf{x}}^{k})-\nabla Q_{\alpha}(\hat{\mathbf{x}}^{*})\\|^{2}+\alpha^{2}mn\sigma^{2}.$

	$\displaystyle\mathbb{E}[\\|\hat{\mathbf{x}}^{k}-\hat{\mathbf{x}}^{*}\\|^{2}]\leq$	$\displaystyle~{}(1-\frac{2\alpha\cdot L_{Q_{\alpha}}^{g}\mu}{L_{Q_{\alpha}}^{g}+\mu}+\alpha^{2}\mu^{2}-\frac{2\alpha\cdot\mu^{2}}{L_{Q_{\alpha}}^{g}+\mu})$
		$\displaystyle~{}\quad\\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\\|^{2}+\alpha^{2}mn\sigma^{2}$
	$\displaystyle\leq$	$\displaystyle~{}(1-\alpha\mu)^{2}\\|\hat{\mathbf{x}}^{k-1}-\hat{\mathbf{x}}^{*}\\|^{2}+\alpha^{2}mn\sigma^{2}.$

Second-order Properties of Noisy Distributed Gradient Descent

Abstract

Index Terms:

I Introduction

I-A Notation

II Problem Setup and Supporting Results

II-A Problem Setup

Definition 2.1.

Definition 2.2.

Assumption 2.1 (Local regularity).

Assumption 2.2 (Lipschitz gradient).

Assumption 2.3 (Lipschitz Hessian).

Assumption 2.4 (Coercivity and properness).

II-B Distributed Gradient Descent

Assumption 2.5 (Network).

II-C Relationships between Local minimizers of ff and QαQ_{\alpha}

Lemma 2.3.1.

Proof.

Lemma 2.3.2.

Proof.

Lemma 2.3.3.

Proof.

Lemma 2.3.4.

Proof.

Theorem 2.1.

Proof.

III Method and Main Results

Theorem 3.1.

Remark 1.

Remark 2.

IV Proof of Theorem 3.1

IV-A Behavior of NDGD for three different cases

Lemma 4.1.1.

Proof.

Lemma 4.1.2.

Proof.

Lemma 4.1.3.

Proof.

IV-B Main proof

Proof.

V Numerical Example

VI Conclusion

References

II-C Relationships between Local minimizers of $f$ and $Q_{\alpha}$