On the Convergence of NEAR-DGD for Nonconvex Optimization with Second Order Guarantees

Charikleia Iakovidou and Ermin Wei This work was supported by NSF NRI 2024774. Department of Electrical and Computer Engineering
Northwestern University
Evanston, IL
Email: chariako@u.northwestern.edu, ermin.wei@northwestern.edu

Abstract

We consider the setting where the nodes of an undirected, connected network collaborate to solve a shared objective modeled as the sum of smooth functions. We assume that each summand is privately known by a unique node. NEAR-DGD is a distributed first order method which permits adjusting the amount of communication between nodes relative to the amount of computation performed locally in order to balance convergence accuracy and total application cost. In this work, we generalize the convergence properties of a variant of NEAR-DGD from the strongly convex to the nonconvex case. Under mild assumptions, we show convergence to minimizers of a custom Lyapunov function. Moreover, we demonstrate that the gap between those minimizers and the second order stationary solutions of the original problem can become arbitrarily small depending on the choice of algorithm parameters. Finally, we accompany our theoretical analysis with a numerical experiment to evaluate the empirical performance of NEAR-DGD in the nonconvex setting.

Index Terms:

distributed optimization, decentralized gradient method, nonconvex optimization, second-order guarantees.

I Introduction

We focus on optimization problems where the cost function can be modeled as a summation of $n$ components,

\min_{x\in\mathbb{R}^{p}}f(x)=\sum_{i=1}^{n}f_{i}(x),

(1)

where $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ is a smooth and (possibly) nonconvex function.

Problems of this type frequently arise in a variety of decentralized systems such as wireless sensor networks, smart grids and systems of autonomous vehicles. A special case of this setting involves a connected, undirected network of $n$ nodes $\mathcal{G}(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ and $\mathcal{E}$ denote the sets of nodes and edges, respectively. Each node $i\in\mathcal{V}$ has private access to the cost component $f_{i}$ and maintains a local estimate $x_{i}$ of the global decision variable $x$ . This leads to the following equivalent reformulation of Problem (1),

\begin{split}\min_{x_{i}\in\mathbb{R}^{p}}\sum_{i=1}^{n}f_{i}(x_{i}),\quad\text{s.t. }x_{i}=x_{j},\quad\forall(i,j)\in\mathcal{E}.\end{split}

(2)

One of the first algorithms proposed for the solution of Problem (2) when the functions $f_{i}$ are convex is the Distributed (Sub)Gradient Descent (DGD) method [1], which relies on the combination of two elements: $i$ ) local gradient steps on the functions $f_{i}$ and $ii$ ) calculations of weighted averages of local and neighbor variables $x_{i}$ . For the remainder of this work, we will be referring to these two procedures as computation and consensus (or communication) steps, respectively. While DGD has been shown to converge to an approximate solution of Problem (2) under constant steplengths, a subset of methods known as gradient tracking algorithms [2, 3, 4] overcomes this limitation by iteratively estimating the average gradient between nodes.

The convergence of DGD when the function $f$ is nonconvex has been studied in [5]. NEXT [6], SONATA [7, 8], xFilter [9] and MAGENTA [10], are some examples of distributed methods that utilize gradient tracking and can handle nonconvex objectives. Other approaches include primal-dual algorithms [11, 12] (we note that primal-dual and gradient tracking algorithms are equivalent in some cases [3]), the perturbed push-sum method [13], zeroth order methods [14, 15], and stochastic gradient algorithms [16, 17, 18, 19].

Providing second order guarantees when Hessian information is not available is a challenging task. As a result, the majority of the works listed in the previous paragraph establish convergence to critical points only. A recent line of research leverages existing results from dynamical systems theory and the structural properties of certain problems (which include matrix factorization, phase retrieval and dictionary learning, among others) to demonstrate that several centralized first order algorithms converge to minimizers almost surely when initialized randomly [20]. Specifically, if the objective function satisfies the strict saddle property, namely, if all critical points are either strict saddles or minimizers, then many first order methods converge to saddles only if they are initialized in a low-dimensional manifold with measure zero. Using similar arguments, almost sure convergence to second order stationary points of Problem (2) is proven in [8] for DOGT, a gradient tracking algorithm for directed networks, and in [12] for the first order primal-dual algorithms GPDA and GADMM. The convergence of DGD with constant steplength to a neighborhood of the minimizers of Problem (2) is also shown in [8]. The conditions under which the Distributed Stochastic Gradient method (D-SGD), and Distributed Gradient Flow (DGF), a continuous-time approximation of DGD, avoid saddle points are studied in [21] and [22], respectively. Finally, the authors of [13] prove almost sure convergence to local minima under the assumption that the objective function has no saddle points.

Given the diversity of distributed systems in terms of computing power, connectivity and energy consumption, among other concerns, the ability to adjust the relative amounts of communication and computation on a case-by-case basis is a desirable attribute for a distributed optimization algorithm. While some existing methods are designed to minimize overall communication load (for instance, the authors of [9] employ Chebyshev polynomials to improve communication complexity), all of the methods listed above perform fixed amounts of computation and communication at every iteration and lack adaptability to heterogeneous environments.

I-A Contributions

In this work, we extend the convergence analysis of the NEAR-DGD method, originally proposed in [23], from the strongly convex to the nonconvex setting. NEAR-DGD is a distributed first order method with a flexible framework, which allows for the exchange of computation with communication in order to reach a target accuracy level while simultaneously maintaining low overall application cost. We design a custom Lyapunov function which captures both progress on Problem (1) and distance to consensus, and demonstrate that under relatively mild assumptions, a variant of NEAR-DGD converges to the set of critical points of this Lyapunov function and to approximate critical points of the function $f$ of Problem (1). Moreover, we show that the distance between the limit points of NEAR-DGD and the critical points of $f$ can become arbitrarily small by appropriate selection of algorithm parameters. Finally, we employ recent results based on dynamical systems theory to prove that the same variant of NEAR-DGD almost surely avoids the saddle points of the Lyapunov function when initialized randomly. Our analysis is shorter and simpler compared to other works due to the convenient form of our Lyapunov function. The implication is that NEAR-DGD asymptotically converges to second order stationary solutions of Problem (1) as the values of algorithm parameters increase.

I-B Notation

In this paper, all vectors are column vectors. We will use the notation $v^{\prime}$ to refer to the transpose of a vector $v$ . The concatenation of local vectors $v_{i}\in\mathbb{R}^{p}$ is denoted by $\mathbf{v}=[v_{i}^{\prime},...,v_{n}^{\prime}]^{\prime}\in\mathbb{R}^{np}$ with a lowercase boldface letter. The average of the vectors $v_{1},...,v_{n}$ contained in $\mathbf{v}$ will be denoted by $\bar{v}$ , i.e. $\bar{v}=\frac{1}{n}\sum_{i=1}^{n}v_{i}$ . We use uppercase boldface letters for matrices and will denote the element in the $i^{th}$ row and $j^{th}$ column of matrix $\mathbf{H}$ with $h_{ij}$ . We will refer to the $i^{th}$ (real) eigenvalue in ascending order (i.e. 1st is the smallest) of a matrix $\mathbf{H}$ as $\lambda_{i}(\mathbf{H})$ . We use the notations $I_{p}$ and $1_{n}$ for the identity matrix of dimension $p$ and the vector of ones of dimension $n$ , respectively. We will use $\|\cdot\|$ to denote the $l_{2}$ -norm, i.e. for $v\in\mathbb{R}^{p}$ we have $\left\|v\right\|=\sqrt{\sum_{i=1}^{p}\left[v\right]_{i}^{2}}$ where $\left[v\right]_{i}$ is the $i$ -th element of $v$ . The inner product of vectors $v,u$ will be denoted by $\langle v,u\rangle$ . The symbol $\otimes$ will denote the Kronecker product operation. Finally, we define the averaging matrix $\mathbf{M}:=\left(\frac{1_{n}1^{\prime}_{n}}{n}\otimes I_{p}\right)$ .

I-C Organization

The rest of this paper is organized as follows. We briefly review the NEAR-DGD method and list our assumptions for the rest of this work in Section II. We analyze the convergence properties of NEAR-DGD when the function $f$ of Problem (1) is nonconvex in Section III. Finally, we present the results of a numerical experiment we conducted to assess the empirical performance of NEAR-DGD in the nonconvex setting in Section IV, and conclude this work in Section V.

II The NEAR-DGD method

In this section, we list our assumptions for the remainder of this work and briefly review the NEAR-DGD method, originally proposed for strongly convex optimization in [23]. We first introduce the following compact reformulation of Problem (2),

\begin{split}\min_{\mathbf{x}\in\mathbb{R}^{np}}\mathbf{f}\left(\mathbf{x}\right):=\sum_{i=1}^{n}f_{i}(x_{i}),&\quad\text{s.t. }\left(\mathbf{W}\otimes I_{p}\right)\mathbf{x}=\mathbf{x},\end{split}

(3)

where $\mathbf{x}=[x_{1}^{\prime},...,x_{n}^{\prime}]^{\prime}$ in $\mathbb{R}^{np}$ is the concatenation of the local variables $x_{i}$ , $\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R}$ and $\mathbf{W}\in\mathbb{R}^{n\times n}$ is a matrix satisfying the following condition.

Assumption II.1.

(Consensus matrix) The matrix $\mathbf{W}\in\mathbb{R}^{n\times n}$ has the following properties: i) symmetry, ii) double stochasticity, iii) $w_{ij}>0$ if and only if $(ij)\in\mathcal{E}$ or $i=j$ and $w_{ij}=0$ otherwise and iv) positive-definiteness.

We can construct a matrix $\tilde{\mathbf{W}}$ satisfying properties $(i)-(iii)$ of Assumption II.1 by defining its elements to be max degree or Metropolis-Hastings weights [24], for instance. Such matrices are not necessarily positive-definite, so we can further enforce property $(iv)$ using simple linear transformations (for example, we could define $\mathbf{W}=(1-\delta)^{-1}(\tilde{\mathbf{W}}-\delta I_{n})$ , where $\delta<\lambda_{1}(\tilde{\mathbf{W}})$ is a constant). For the rest of this work, we will be referring to the $2^{nd}$ largest eigenvalue of $\mathbf{W}$ as $\beta$ , i.e. $\beta=\lambda_{n-1}(\mathbf{W})$ .

We adopt the following standard assumptions for the global function $\mathbf{f}$ of Problem 3.

Assumption II.2.

(Global Lipschitz gradient) The global objective function $\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R}$ has $L$ -Lipschitz continuous gradients, i.e. $\|\nabla\mathbf{f}(\mathbf{x})-\nabla\mathbf{f}(\mathbf{y})\|\leq L\|\mathbf{x}-\mathbf{y}\|,\quad\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{np}$ .

Assumption II.3.

(Coercivity) The global objective function $\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R}$ is coercive, i.e. $\lim_{k\rightarrow\infty}\mathbf{f}(\mathbf{x}_{k})=\infty$ for every sequence $\{\mathbf{x}_{k}\}$ such that $\|\mathbf{x}_{k}\|\rightarrow\infty$ .

II-A The NEAR-DGD method

Starting from (arbitrary) initial points $y_{i,0}=x_{i,0}$ , the local iterates of NEAR-DGD at node $i\in\mathcal{V}$ and iteration count $k$ can be expressed as,


	$\displaystyle x_{i,k}=\sum_{j=1}^{n}w^{t(k)}_{ij}y_{j,k},$		(4a)
	$\displaystyle y_{i,k+1}=x_{i,k}-\alpha\nabla f_{i}\left(x_{i,k}\right),$		(4b)

where $\{t(k)\}$ is a predefined sequence of consensus rounds per iteration, $\alpha>0$ is a positive steplength and $w_{ij}^{t(k)}$ is the element in the $i^{th}$ row and $j^{th}$ column of the matrix $\mathbf{W}^{t(k)}$ , resulting from the composition of $t(k)$ consensus operations,

\mathbf{W}^{t(k)}=\underbrace{\mathbf{W}\cdot\mathbf{W}\cdot...\cdot\mathbf{W}}_{t(k)\text{ times}}\in\mathbb{R}^{n\times n}.

The system-wide iterates of NEAR-DGD can be written as,


	$\displaystyle\mathbf{x}_{k}=\mathbf{Z}^{t(k)}\mathbf{y}_{k}$		(5a)
	$\displaystyle\mathbf{y}_{k+1}=\mathbf{x}_{k}-\alpha\nabla\mathbf{f}(\mathbf{x}_{k}),$		(5b)

where $\mathbf{x}_{k}=[x_{1,k}^{\prime},...,x_{n,k}^{\prime}]^{\prime}\in\mathbb{R}^{np}$ , $\mathbf{y}_{k+1}=[y_{1,k+1}^{\prime},...,y_{n,k+1}^{\prime}]^{\prime}\in\mathbb{R}^{np}$ and $\mathbf{Z}^{t(k)}=(\mathbf{W}^{t(k)}\otimes I_{p})\in\mathbb{R}^{np\times np}$ .

The sequence of consensus rounds per iteration $\{t(k)\}$ can be suitably chosen to balance convergence accuracy and total cost on a per-application basis. When the functions $f_{i}$ are strongly convex, NEAR-DGD paired with an increasing sequence $\{t(k)\}$ converges to the optimal solution of Problem (3), and achieves exact convergence with geometric rate (in terms of gradient evaluations) when $t(k)=k$ [23].

III Convergence Analysis

In this section, we present our theoretical results on the convergence of a variant of NEAR-DGD where the number of consensus steps per iteration is fixed, i.e. $t(k)=t$ in (4a) and (5a) for some $t\in\mathbb{N}^{+}$ . We will refer to this method as NEAR-DGD^t. First, we introduce the following Lyapunov function, which will play a key role in our analysis,

\mathcal{L}_{t}\left(\mathbf{y}\right)=\mathbf{f}\left(\mathbf{Z}^{t}\mathbf{y}\right)+\frac{1}{2\alpha}\left\|\mathbf{y}\right\|^{2}_{\mathbf{Z}^{t}}-\frac{1}{2\alpha}\left\|\mathbf{y}\right\|^{2}_{\mathbf{Z}^{2t}}.

(6)

Using (6), we can express the $\mathbf{x}_{k}$ iterates of NEAR-DGD^t as,

\begin{split}\mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right).\end{split}

(7)

We need one more assumption on the geometry of the Lyapunov function $\mathcal{L}_{t}$ to guarantee the convergence of NEAR-DGD^t. Namely, we require $\mathcal{L}_{t}$ to be ”sharp” around its critical points, up to a reparametrization. This property is formally summarized below.

Definition 1.

(Kurdyka-Łojasiewicz (KL) property) [25] A function $h:\mathbb{R}^{p}\rightarrow\mathbb{R}\cup\{+\infty\}$ has the KL property at $x^{\star}\in\text{dom}(\partial h)$ if there exists $\eta\in(0,+\infty]$ , a neighborhood $U$ of $x^{\star}$ , and a continuous concave function $\phi:[0,\eta)\rightarrow\mathbb{R}^{+}$ such that: $i$ ) $\phi(0)=0$ , $ii$ ) $\phi$ is $\mathcal{C}^{1}$ (continuously differentiable) on $(0,\eta)$ , $iii$ ) for all $s\in(0,\eta)$ , $\phi^{\prime}(s)>0$ , and $iv$ ) for all $x\in U\cap\{x:h(x^{\star})<h(x)<h(x^{\star})+\eta\}$ , the KL inequality holds:

\phi^{\prime}(h(x)-h(x^{\star}))\cdot\text{dist}(0,\partial h(x))\geq 1.

Proper lower semicontinuous functions which satisfy the KL inequality at each point of $\text{dom}(\partial h)$ are called KL functions.

Assumption III.1.

(KL Lyapunov function) The Lyapunov function $\mathcal{L}_{t}:\mathbb{R}^{np}\rightarrow\mathbb{R}$ is a KL function.

Assumption III.1 covers a broad range of functions, including real analytic, semialgebraic and globally subanalytic functions (see [26] for more details). For instance, if the function $\mathbf{f}$ is real analytic, then $\mathcal{L}_{t}$ is the sum of real analytic functions and by extension KL.

III-A Convergence to critical points

In this subsection, we demonstrate that the $\mathbf{y}_{k}$ iterates of NEAR-DGD^t converge to a critical point of the Lyapunov function $\mathcal{L}_{t}$ (6). We assume that Assumptions II.1-II.3 and Assumption III.1 hold for the rest of this work. We begin our analysis by showing that the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ is non-increasing in the following Lemma.

Lemma III.2.

(Sufficient Descent) Let { $\mathbf{y}_{k}$ } be the sequence of NEAR-DGD^t iterates generated by (5b) and suppose that the steplength $\alpha$ satisfies $\alpha<2/L$ , where $L$ is defined in Assumption II.2. Then the following inequality holds for the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ ,

\begin{split}\mathcal{L}_{t}\left(\mathbf{y}_{k+1}\right)&\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2},\end{split}

where $\rho=(2\alpha)^{-1}\min_{i}\left(\lambda^{t}_{i}(\mathbf{Z})\left(1+\left(1-\alpha L\right)\lambda^{t}_{i}(\mathbf{Z})\right)\right)>0$ .

Proof.

Combining (5a) and (6), we obtain for $k\geq 0$ ,

\begin{split}\mathcal{L}_{t}(\mathbf{y}_{k})=\mathbf{f}(\mathbf{x}_{k})+\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k}\|^{2}.\end{split}

(8)

Let $\mathbf{d}_{x}:=\mathbf{x}_{k+1}-\mathbf{x}_{k}$ . Assumption II.2 then yields $\mathbf{f}(\mathbf{x}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})+\langle\nabla\mathbf{f}(\mathbf{x}_{k}),\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\frac{1}{\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2},$ where we acquire the last equality from (5b). Substituting this relation in (8) applied at the $(k+1)^{th}$ iteration, we obtain,

\begin{split}&\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}\\ &\quad+\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k+1}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle+\frac{1}{2\alpha}\|\mathbf{x}_{k}\|^{2}-\frac{1}{\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle\\ &\quad+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}+\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k+1}\|^{2},\end{split}

where we obtain the equality after further application of (8). After setting $\mathbf{d}_{y}:=\mathbf{y}_{k+1}-\mathbf{y}_{k}$ and re-arranging the terms, we obtain,

\begin{split}\mathcal{L}_{t}(\mathbf{y}_{k+1})&\leq\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle+\frac{1}{\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k}\rangle\\ &\quad-\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\|\mathbf{y}_{k}\|^{2}_{\mathbf{Z}^{t}}+\frac{1}{\alpha}\langle\mathbf{y}_{k+1},\mathbf{y}_{k}\rangle_{\mathbf{Z}^{t}}\\ &\quad-\frac{1}{2\alpha}\|\mathbf{y}_{k+1}\|^{2}_{\mathbf{Z}^{t}}-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\|\mathbf{d}_{y}\|^{2}_{\mathbf{Z}^{t}}-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}.\end{split}

Let $\mathbf{H}:=(2\alpha)^{-1}\mathbf{Z}^{t}\left(I+(1-\alpha L)\mathbf{Z}^{t}\right)$ , which is a positive definite matrix due to Assumption II.1 and the fact that $\alpha<2/L$ . Moreover, $\|\mathbf{d}_{x}\|^{2}=\|\mathbf{d}_{y}\|^{2}_{\mathbf{Z}^{2t}}$ by Eq. (5a). We can then re-write the immediately previous relation as $\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathcal{L}_{t}(\mathbf{y}_{k})-\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|^{2}_{\mathbf{H}}$ . Applying the definition of $\rho=\lambda_{1}(\mathbf{H})$ concludes the proof. ∎

An important consequence of Lemma III.2 is that NEAR-DGD^t can tolerate a bigger range of steplengths than previously indicated ( $\alpha<2/L$ vs. $\alpha<1/L$ in [23]). Moreover, Lemma III.2 implies that the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ is upper bounded by $\mathcal{L}_{t}(\mathbf{y}_{0})$ . We use this fact to prove that the iterates of NEAR-DGD^t are also bounded in the next Lemma.

Lemma III.3.

(Boundedness) Let $\{\mathbf{x}_{k}\}$ and $\{\mathbf{y}_{k}\}$ be the sequences of NEAR-DGD^t ( $t(k)=t$ ) iterates generated by (5a) and (5b), respectively, from initial point $\mathbf{y}_{0}$ and under steplength $\alpha<2/L$ . Then the following hold: $i$ ) the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ is lower bounded, and $ii$ ) there exist universal positive constants $B_{x}$ and $B_{y}$ such that $\|\mathbf{x}_{k}\|\leq B_{x}$ and $\|\mathbf{y}_{k+1}\|\leq B_{y}$ for all $k\geq 0$ and $t\in\mathbb{N}^{+}$ .

Proof.

By Assumption II.3, the function $\mathbf{f}$ is lower bounded and therefore $\mathcal{L}_{t}$ is also lower bounded (sum of lower bounded functions). This proves the first claim of this Lemma.

To prove the second claim, we first notice that Lemma III.2 implies that the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ is upper bounded by $\mathcal{L}_{t}(\mathbf{y}_{0})$ . Let us define the set $\mathcal{X}_{0}:=\{\mathbf{Z}^{t}\mathbf{y}_{0},t\in\mathbb{N}^{+}\}$ . The set $\mathcal{X}_{0}$ is compact, since $\|\mathbf{Z}^{t}\mathbf{y}_{0}\|\leq\|\mathbf{y}_{0}\|$ for all $t\in\mathbb{N}^{+}$ due to the non-expansiveness of $\mathbf{Z}$ . Hence, by the continuity of $\mathbf{f}$ and the Weierstrass Extreme Value Theorem, there exists $\hat{\mathbf{x}}_{0}\in\mathcal{X}_{0}$ such that $\mathbf{f}(\mathbf{x}_{0})\leq\mathbf{f}(\hat{\mathbf{x}}_{0})$ for all $\mathbf{x}_{0}\in\mathcal{X}_{0}$ . Moreover, Assumption II.1 yields $\|\mathbf{y}_{0}\|^{2}_{\mathbf{Z}^{t}(I-\mathbf{Z}^{t})}\leq\|\mathbf{y}_{0}\|^{2}$ for all positive integers $t$ , and therefore $\mathcal{L}_{t}(\mathbf{y}_{0})\leq\hat{\mathcal{L}}$ for all $t\in\mathbb{N}^{+}$ , where $\hat{\mathcal{L}}=\mathbf{f}(\hat{\mathbf{x}}_{0})+(2\alpha)^{-1}\|\mathbf{y}_{0}\|^{2}$ .

Since $\hat{\mathcal{L}}\geq\mathcal{L}_{t}(\mathbf{y}_{0})\geq\mathcal{L}_{t}(\mathbf{y}_{k})\geq\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}_{k})=\mathbf{f}(\mathbf{x}_{k})$ for all $k\geq 0$ and $t>0$ , the sequence $\{\mathbf{f}(\mathbf{x}_{k})\}$ is upper bounded. Hence, by Assumption II.3, there exists positive constant $B_{x}$ such that $\|\mathbf{x}_{k}\|\leq B_{x}$ for $k\geq 0$ and $t>0$ . Moreover, Assumption II.2 yields $\mathbf{f}(\mathbf{y}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})+\langle\nabla\mathbf{f}(\mathbf{x}_{k}),\mathbf{y}_{k+1}-\mathbf{x}_{k}\rangle+\frac{L}{2}\|\mathbf{y}_{k+1}-\mathbf{x}_{k}\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\alpha\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}+\frac{\alpha^{2}L}{2}\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\alpha\left(1-\frac{\alpha L}{2}\right)\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}\leq\mathbf{f}(\mathbf{x}_{k}),$ where we obtain the first equality from (5b) and last inequality from the fact that $\alpha<2/L$ . This relation combined with Assumption II.3 implies that there exists constant $B_{y}>0$ such that $\|\mathbf{y}_{k+1}\|\leq B_{y}$ for $k>0$ and $t>0$ , which concludes the proof. ∎

Next, we use Lemma III.3 to show that the distance between the local iterates generated by NEAR-DGD^t and their average is bounded.

Lemma III.4.

(Bounded distance to mean) Let $x_{i,k}$ and $y_{i,k}$ be the local NEAR-DGD^t iterates produced under steplength $\alpha<2/L$ by (4a) and (4b), respectively, and define the average iterates $\bar{x}_{k}:=\frac{1}{n}\sum_{i=1}^{n}x_{i,k}$ and $\bar{y}_{k}:=\frac{1}{n}\sum_{i=1}^{n}y_{i,k}$ . Then the distance between the local and the average iterates is bounded for $i=1,...,n$ and $k=1,2,...$ , i.e.

\left\|x_{i,k}-\bar{x}_{k}\right\|\leq\beta^{t}B_{y},\text{ and }\left\|y_{i,k}-\bar{y}_{k}\right\|\leq B_{y},

where $B_{y}$ is a positive constant defined in Lemma III.3.

Proof.

Multiplying both sides of (5a) with $\mathbf{M}=\left(\frac{1_{n}1_{n}^{\prime}}{n}\otimes I_{p}\right)$ yields $\bar{x}_{k}=\bar{y}_{k}$ . Moreover, we observe that $\left\|\mathbf{v}_{k}-\mathbf{M}\mathbf{v}_{k}\right\|^{2}=\sum_{i=1}^{n}\left\|v_{i,k}-\bar{v}_{k}\right\|^{2}$ for any vector $\mathbf{v}\in\mathbb{R}^{np}$ . Hence,

\begin{split}\left\|x_{i,k}-\bar{x}_{k}\right\|&=\left\|x_{i,k}-\bar{y}_{k}\right\|\leq\left\|\mathbf{x}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|\\ &\leq\left\|\mathbf{Z}^{t}\mathbf{y}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|\leq\beta^{t}\|\mathbf{y}_{k}\|,\end{split}

where we derive the last inequality from the spectral properties of $\mathbf{Z}$ and $\mathbf{M}$ (we note that the matrix $1_{n}1_{n}^{\prime}/n$ has a single non-zero eigenvalue at $1$ associated with the eigenvector $1_{np}$ ).

Similarly, for the local iterates $y_{i,k}$ we obtain,

\begin{split}\left\|y_{i,k}-\bar{y}_{k}\right\|\leq\left\|\mathbf{y}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|=\left\|\left(I-\mathbf{M}\right)\mathbf{y}_{k}\right\|\leq\|\mathbf{y}_{k}\|.\end{split}

Applying Lemma III.3 to the two preceding inequalities completes the proof. ∎

We are now ready to state the first Theorem of this section, namely that there exists a subsequence of $\{\mathbf{y}_{k}\}$ that converges to a critical point of $\mathcal{L}_{t}$ .

Theorem III.5.

(Subsequence convergence) Let $\{\mathbf{y}_{k}\}$ be the sequence of NEAR-DGD^t iterates generated by (5b) with steplength $\alpha<2/L$ . Then $\{\mathbf{y}_{k}\}$ has a convergent subsequence whose limit point is a critical point of (6).

Proof.

By Lemma III.3, the sequence $\{\mathbf{y}_{k}\}$ is bounded and therefore there exists a convergent subsequence $\{\mathbf{y}_{k_{s}}\}_{s\in\mathbb{N}}\rightarrow\mathbf{y}^{\infty}$ as $s\rightarrow\infty$ . In addition, recursive application of Lemma III.2 over iterations $0,1,...,k$ yields,

\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\leq\mathcal{L}_{t}\left(\mathbf{y}_{0}\right)-\rho\sum_{j=0}^{k-1}\left\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\right\|^{2},

where the sequence $\{\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\}$ is non-increasing and bounded from below by Lemmas III.2 and III.3.

Hence, $\{\mathcal{L}_{k}\left(\mathbf{y}_{k}\right)\}$ converges and the above relation implies that $\sum_{k=1}^{\infty}\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}<+\infty$ and $\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\rightarrow 0$ . Moreover, $\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|=\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|_{\mathbf{Z}^{2t}}\leq\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|$ by the non-expansiveness of $\mathbf{Z}$ and thus $\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\rightarrow 0$ . Finally, Eq. 7 yields $\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\|=\alpha^{-1}\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\rightarrow 0$ . We conclude that $\left\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k_{s}}\right)\right\|\rightarrow 0$ as $s\rightarrow\infty$ and therefore $\nabla\mathcal{L}_{t}\left(\mathbf{y}^{\infty}\right)=\mathbf{0}$ . ∎

We note that Assumption III.1 is not necessary for Theorem III.5 to hold. However, Theorem III.5 does not guarantee the convergence of NEAR-DGD^t; we will need Assumption III.1 to prove that NEAR-DGD^t converges in Theorem III.8. Before that, we introduce the following two preliminary Lemmas that hold only under Assumption III.1.

Lemma III.6.

(Bounded difference under the KL property) Let $\mathbf{x}_{k}$ and $\mathbf{y}_{k}$ be the NEAR-DGD^t iterates generated by (5a) and (5b), respectively, under steplength $\alpha<2/L$ . Moreover, suppose that the KL inequality with respect to some point $\mathbf{y}^{\star}\in\mathbb{R}^{np}$ holds at $\mathbf{y}_{k}$ , i.e.,

\phi^{\prime}(\mathcal{L}_{t}(\mathbf{y}_{k})-\mathcal{L}_{t}(\mathbf{y}^{\star}))\|\mathcal{L}_{k}(\mathbf{y}_{k})\|\geq 1.

(9)

Then the following relation holds,

\begin{split}\left\|\mathbf{v}_{k+1}-\mathbf{v}_{k}\right\|\leq\frac{1}{\alpha\rho}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right),\end{split}

where $\|\mathbf{v}_{k+1}-\mathbf{v}_{k}\|$ can be $\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|$ or $\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|$ and $l_{k}:=\mathcal{L}_{t}(\mathbf{y}_{k})-\mathcal{L}_{t}(\mathbf{y}^{\star})$ .

Proof.

Lemma III.2 yields $\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\mathcal{L}_{t}\left(\mathbf{y}_{k+1}\right)=l_{k}-l_{k+1}$ for $k\geq 0$ . We can multiply both sides of this relation with $\phi^{\prime}\left(l_{k}\right)>0$ to obtain $\rho\phi^{\prime}\left(l_{k}\right)\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}\leq-\phi^{\prime}\left(l_{k}\right)\left(l_{k+1}-l_{k}\right)\leq\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right),$ where we derive the last inequality from the concavity of $\phi$ . In addition, using Eq. 7, we can re-write (9) as $\alpha^{-1}\phi^{\prime}(l_{k})\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\geq 1$ . Combining these relations, we acquire,

\begin{split}\frac{\alpha\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}}{\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|}\leq\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right).\end{split}

Observing that $\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\leq\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|$ due to the non-expansiveness of $\mathbf{Z}$ and re-arranging the terms of the relation above yields the final result. ∎

In the next Lemma, we show that if NEAR-DGD^t is initialized from an appropriate subset of $\mathbb{R}^{np}$ and Assumption III.1 holds, then the sequence $\{\mathbf{y}_{k}\}$ converges to a critical point of the Lyapunov function $\mathcal{L}_{t}$ .

Lemma III.7.

(Local convergence) Let $\{\mathbf{y}_{k}\}$ be the sequence of iterates generated by (5b) from initial point $\mathbf{y}_{0}$ and with steplength $\alpha<2/L$ . Moreover, let $U$ and $\eta$ be the objects in Def. 1 and suppose that the following relations are satisfied for some point $\mathbf{y}^{\star}\in\mathbb{R}^{np}$ ,

	$\displaystyle(\alpha\rho)^{-1}\phi\left(\mathcal{L}_{t}(\mathbf{y}_{0})-\mathcal{L}_{t}(\mathbf{y}^{\star})\right)+\\|\mathbf{y}_{0}-\mathbf{y}^{\star}\\|<r,$		(10)
	$\displaystyle\mathcal{L}_{t}(\mathbf{y}^{\star})\leq\mathcal{L}_{t}(\mathbf{y}_{k})<\mathcal{L}_{t}(\mathbf{y}^{\star})+\eta,\quad k\geq 0,$		(11)

where $r$ is a positive constant and $\mathcal{B}(\mathbf{y}^{\star},r)\subset U$ .

Then the sequence $\{\mathbf{y}_{k}\}$ has finite length, i.e. $\sum_{j=0}^{\infty}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|<\infty$ , and converges to a critical point of (6).

Proof.

In the trivial case where $\mathcal{L}_{t}(\mathbf{y}_{k})=\mathcal{L}_{t}(\mathbf{y}^{\star})$ , Lemma III.2 combined with (11) yield $\mathcal{L}_{t}(\mathbf{y}_{k+1})=\mathcal{L}_{t}(\mathbf{y}_{k})=\mathcal{L}_{t}(\mathbf{y}^{\star})$ and $\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|=0$ . Let us now assume that $\mathcal{L}_{t}(\mathbf{y}_{k})\in\big{(}\mathcal{L}_{t}(\mathbf{y}^{\star}),\mathcal{L}_{t}(\mathbf{y}^{\star})+\eta\big{)}$ and $\mathbf{y}_{k}\in\mathcal{B}(\mathbf{y}^{\star},r)$ up to and including some index $\tau\in\mathbb{N}^{+}$ , which implies that (9) holds for all $k\leq\tau$ . Applying the triangle inequality twice, we obtain,

\begin{split}\|\mathbf{y}_{\tau+1}-\mathbf{y}^{\star}\|&\leq\|\mathbf{y}_{\tau+1}-\mathbf{y}_{0}\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &=\left\|\sum_{j=0}^{\tau}\left(\mathbf{y}_{j+1}-\mathbf{y}_{j}\right)\right\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &\leq\sum_{j=0}^{\tau}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|.\end{split}

Application of Lemma III.6 then yields $\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\leq(\alpha\rho)^{-1}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right)$ , for $k\leq\tau$ . Substituting this in the preceding relation, we acquire,

\begin{split}\|\mathbf{y}_{\tau+1}-\mathbf{y}^{\star}\|&\leq\frac{1}{\alpha\rho}\left(\phi\left(l_{0}\right)-\phi\left(l_{\tau+1}\right)\right)+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &\leq\frac{\phi\left(l_{0}\right)}{\alpha\rho}+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|<r.\end{split}

The above result implies that $\mathbf{y}_{\tau+1}\in\mathcal{B}(\mathbf{y}^{\star},r)$ . Given the fact that that $\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|<r$ and thus $\mathbf{y}_{0}\in\mathcal{B}(\mathbf{y}^{\star},r)$ , we have $\mathbf{y}_{k}\in\mathcal{B}(\mathbf{y}^{\star},r)$ and $\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\leq(\alpha\rho)^{-1}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right)$ for all $k\geq 0$ . Hence,

\begin{split}\sum_{j=0}^{\infty}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|&\leq\frac{1}{\alpha\rho}\sum_{j=0}^{\infty}\left(\phi(l_{k})-\phi(l_{k+1})\right)\\ &\leq\frac{1}{\alpha\rho}\left(\phi(l_{0})-\phi(l_{\infty})\right)\leq\frac{\phi(l_{0})}{\alpha\rho}.\end{split}

Thus, the sequence $\{\mathbf{y}_{k}\}$ is finite and Cauchy (convergent), and $\{\mathbf{y}_{k}\}\rightarrow\tilde{\mathbf{y}}$ , where $\tilde{\mathbf{y}}$ is a critical point of (6) by Theorem III.5. ∎

Next, we combine our previous results to prove the global convergence of the $\mathbf{y}_{k}$ iterates of NEAR-DGD^t in Theorem III.8.

Theorem III.8.

(Global Convergence) Let $\{\mathbf{y}_{k}\}$ be the sequence of NEAR-DGD^t iterates produced by (5b) under steplength $\alpha<2/L$ and let $\mathbf{y}^{\infty}$ be a limit point of a convergent subsequence of $\{\mathbf{y}_{k}\}$ as defined in Theorem III.5.

Then under Assumption III.1 the following statements hold: $i$ ) there exists an index $k_{0}\in\mathbb{N}^{+}$ such that the KL inequality with respect to $\mathbf{y}^{\infty}$ holds for all $k\geq k_{0}$ , and $ii$ ) the sequence $\{\mathbf{y}_{k}\}$ converges to $\mathbf{y}^{\infty}$ .

Proof.

We first observe that by Lemma III.2 the sequence $\{\mathcal{L}_{t}(\mathbf{y}_{k})\}$ is non-increasing, and therefore $\mathcal{L}_{t}(\mathbf{y}^{\infty})\leq\mathcal{L}_{t}(\mathbf{y}_{k})$ for all $k\geq 0$ . If Assumption III.1 holds, then the objects $U$ and $\eta$ in Def. 1 exist and by the continuity of $\phi$ , it is possible to find an index $k_{0}$ that satisfies the following relations,

	$\displaystyle(\alpha\rho)^{-1}\phi\left(\mathcal{L}_{t}(\mathbf{y}_{k_{0}})-\mathcal{L}_{t}(\mathbf{y}^{\infty})\right)+\\|\mathbf{y}_{k_{0}}-\mathbf{y}^{\infty}\\|<r,$
	$\displaystyle\mathcal{L}_{t}(\mathbf{y}_{k})\in[\mathcal{L}_{t}(\mathbf{y}^{\infty}),\mathcal{L}_{t}(\mathbf{y}^{\infty})+\eta),\quad\forall k\geq 0,$

where $\mathcal{B}(\mathbf{y}^{\infty},r)\subset U$ .

Applying Lemma III.7 to the sequence $\{\mathbf{y}_{k}\}_{k\geq k_{0}}$ with $\mathbf{y}^{\star}=\mathbf{y}^{\infty}$ establishes the convergence of $\{\mathbf{y}_{k}\}$ . Finally, since $\mathbf{y}^{\infty}$ is the limit point of a subsequence of $\{\mathbf{y}_{k}\}$ and $\{\mathbf{y}_{k}\}$ is convergent, we conclude that $\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty}$ . ∎

Since $\mathbf{Z}$ is a non-singular matrix, Theorem III.8 implies that the sequence $\{\mathbf{x}_{k}\}$ also converges. Moreover, using arguments similar to [27], we can prove the following result on the convergence rate of $\{\mathbf{x}_{k}\}$ .

Lemma III.9.

(Rates) Let $\{\mathbf{x}_{k}\}$ be the sequence of iterates produced by (5a), $\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}$ where $\mathbf{y}^{\infty}$ is the limit point of the sequence $\{\mathbf{y}_{k}\}$ and suppose $\phi(s)=cs^{1-\theta}$ in Assumption III.1 for some constant $c>0$ and $\theta\in[0,1)$ (for a discussion on $\phi$ , we direct readers to [26]). Then the following hold:

1.

If $\theta=0$ , $\{\mathbf{x}_{k}\}$ converges in a finite number of iterations.
2.

If $\theta\in(0,1/2]$ , then constants $c>0$ and $Q\in[0,1)$ exist such that $\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq cQ^{k}$ .
3.

If $\theta\in(1/2,1)$ , then there exists a constant $c>0$ such that $\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq ck^{-\frac{1-\theta}{2\theta-1}}$ .

Proof.

$i$ ) $\theta=0$ : From the definition of $\phi$ and $\theta=0$ we have $\phi^{\prime}(l_{k})=c(1-\theta)l_{k}^{-\theta}=c$ . Let $I:=\{k\in\mathbb{N}:\mathbf{x}_{k+1}\neq\mathbf{x}_{k}\}$ (by the non-singularity of $\mathbf{Z}$ , it also follows that $\mathbf{y}_{k+1}\neq\mathbf{y}_{k}$ for $k\in I$ ). Then for large $k$ the KL inequality holds at $\mathbf{y}_{k}$ and we obtain $\|\nabla\mathcal{L}_{t}(\mathbf{y}_{k})\|\geq c^{-1}$ , or equivalently by (7), $\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\geq\alpha c^{-1}$ . Application of Lemma III.2 combined with the fact that $\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\leq\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|$ yields $\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\rho\alpha^{2}c^{-2}$ . Given the convergence of the sequence $\{\mathcal{L}_{t}\}$ , we conclude that the set $I$ is finite and the method converges in a finite number of steps.

$ii$ ) $\theta\in(0,1)$ : Let $S_{k}:=\sum_{j=k}^{\infty}\|\mathbf{x}_{j+1}-\mathbf{x}_{j}\|$ where $\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}$ . Since $\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq S_{k}$ , it suffices to bound $S_{k}$ . Using Lemma III.6 with $\mathbf{y}^{\star}=\mathbf{y}^{\infty}$ and for $k\geq k_{0}$ , where $k_{0}$ is defined in Theorem III.8, we obtain,

\begin{split}S_{k}&\leq\frac{1}{\alpha\rho}\sum_{j=k}^{\infty}\left(\phi(l_{j})-\phi(l_{j+1})\right)=\frac{1}{\alpha\rho}\phi(l_{k})=\frac{1}{\nu}l_{k}^{1-\theta},\end{split}

(12)

where $\nu=\alpha\rho/c$ .

Moreover, Eq. 7 yields $\left\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\right\|=\alpha^{-1}\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|=\alpha^{-1}\left(S_{k}-S_{k+1}\right)$ . Using this relation and the definition of $\phi$ , we can express the KL inequality as,

\mu l_{k}^{-\theta}\left(S_{k}-S_{k+1}\right)\geq 1,

(13)

where $\mu=\alpha^{-1}c(1-\theta)$ .

If $\theta\in(0,1/2]$ , raising both sides of the preceding inequality to the power of $\gamma=\frac{1-\theta}{\theta}>1$ and re-arranging the terms yields $\mu^{\gamma}\left(S_{k}-S_{k+1}\right)^{\gamma}\geq l_{k}^{1-\theta}$ . Due to the fact that $S_{k}-S_{k+1}=\alpha\|\nabla\mathcal{L}_{t}(\mathbf{y}_{k})\|\rightarrow 0$ , there exists some index $k$ such that $S_{k}-S_{k+1}>\left(S_{k}-S_{k+1}\right)^{\gamma}$ and $\mu^{\gamma}\left(S_{k}-S_{k+1}\right)\geq l_{k}^{1-\theta}$ . Combining this relation with (12), we obtain $\nu S_{k}\leq\mu^{\gamma}\left(S_{k}-S_{k+1}\right)\Leftrightarrow S_{k+1}\leq\left(1-\frac{\nu}{\mu^{\gamma}}\right)S_{k}.$

If $\theta\in(1/2,1)$ , raising both sides of (12) to the power of $\theta/(1-\theta)>1$ yields $S_{k}^{\theta/(1-\theta)}\leq\nu^{-\theta/(1-\theta)}l_{k}^{\theta}$ . After substituting this relation in (13) and re-arranging we obtain $1\leq C\left(S_{k}-S_{k+1}\right)(S_{k}^{\theta/(1-\theta)})^{-1},$ where $C=\mu\nu^{-\theta/(1-\theta)}$ . Define $h:(0,+\infty)\rightarrow\mathbb{R}$ to be $h(s)=s^{-\theta/(1-\theta)}$ . The preceding relation then yields $1\leq C(S_{k}-S_{k+1})h(S_{k})\leq C\int_{S_{k+1}}^{S_{k}}h(s)ds=C\zeta^{-1}\left(S_{k}^{\zeta}-S_{k+1}^{\zeta}\right)$ , where $\zeta=(1-2\theta)/(1-\theta)<0$ . After setting $\tilde{C}=-C^{-1}\zeta>0$ and re-arranging, we obtain $\tilde{C}\leq S^{\zeta}_{k+1}-S^{\zeta}_{k}$ . Summing this relation over iterations $k=k_{0},...,t-1$ yields $(t-k_{0})\tilde{C}\leq S_{t}^{\zeta}-S_{k_{0}}^{\zeta}\Leftrightarrow S_{t}\leq\left((t-k_{0})\tilde{C}+S_{k_{0}}^{\zeta}\right)^{1/\zeta}\leq ct^{1/\zeta}$ , for some $c>0$ . ∎

We conclude this subsection with one more result on the distance to optimality of the local $x_{i,k}$ iterates of NEAR-DGD^t and their average $\bar{x}_{k}=\frac{1}{n}\sum_{i=1}^{n}x_{i,k}$ as $k\rightarrow\infty$ .

Corollary III.10.

(Distance to optimality) Suppose that $\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty}$ and let $\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}$ . Moreover, let $\bar{x}^{\infty}=\bar{y}^{\infty}=\frac{1}{n}\sum_{i=1}^{n}x^{\infty}_{i}$ . Then $\bar{x}^{\infty}$ is an approximate critical point of $f$ ,

\left\|\nabla f(\bar{x}^{\infty})\right\|\leq\beta^{t}\sqrt{n}LB_{y}

where $B_{y}$ is a positive constant defined in Lemma III.3.

Proof.

We observe that $\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})=\frac{1}{n}\cdot 1_{n}\otimes\nabla f(\bar{y}^{\infty})$ and hence $\|\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})\|=n^{-1}\|1_{n}\otimes\nabla f(\bar{y}^{\infty})\|=(\sqrt{n})^{-1}\|\nabla f(\bar{y}^{\infty})\|,$ where we obtain the last equality due to the fact that $\|1_{n}\otimes v\|^{2}=n\|v\|^{2}$ for any vector $v$ .

Moreover, $\mathbf{y}^{\infty}$ is a critical point of (6) and therefore satisfies $\nabla\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{Z}^{t}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})+\frac{1}{\alpha}\mathbf{Z}^{t}\mathbf{y}^{\infty}-\frac{1}{\alpha}\mathbf{Z}^{2t}\mathbf{y}^{\infty}=\mathbf{0}$ . From the double stochasticity of $\mathbf{Z}$ , multiplying the above relation with $\mathbf{M}$ yields $\mathbf{M}\nabla\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{M}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})=\mathbf{0}$ . After combining all the preceding results, we obtain,

\begin{split}\|\nabla f(\bar{x}^{\infty})\|&=\sqrt{n}\|\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})-\mathbf{M}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})\|\\ &\leq\sqrt{n}L\|\mathbf{M}\mathbf{y}^{\infty}-\mathbf{Z}^{t}\mathbf{y}^{\infty}\|\leq\beta^{t}\sqrt{n}L\|\mathbf{y}^{\infty}\|,\end{split}

where used the spectral properties of $\mathbf{M}$ and Assumption II.2 to get the first inequality and the spectral properties of $\mathbf{Z}$ to get the second inequality. Applying Lemma III.3 yields the result of this Corollary.

∎

III-B Second order guarantees

In this subsection, we provide second order guarantees for the NEAR-DGD^t method. Specifically, using recent results stemming from dynamical systems theory, we will prove that NEAR-DGD^t almost surely avoids the strict saddles of the Lyapunov function $\mathcal{L}_{t}$ when initialized randomly. Hence, if $\mathcal{L}_{t}$ satisfies the strict saddle property, NEAR-DGD^t converges to minima of $\mathcal{L}_{t}$ with probability $1$ . We begin by listing a number of additional assumptions and definitions.

Assumption III.11.

(Differentiability) The functions $\mathbf{f}$ is $\mathcal{C}^{2}$ .

Assumption III.11 implies that the function $\mathcal{L}_{t}$ is also $\mathcal{C}^{2}$ .

Definition 2.

(Differential of a mapping) [Ch. 3, [28]] The differential of a mapping $g:\mathcal{X}\rightarrow\mathcal{X}$ , denoted as $Dg(x)$ , is a linear operator from $\mathcal{T}(x)\rightarrow\mathcal{T}(g(x))$ , where $\mathcal{T}(x)$ is the tangent space of $\mathcal{X}$ at point $x$ . Given a curve $\gamma$ in $\mathcal{X}$ with $\gamma(0)=x$ and $\frac{d\gamma}{dt}(0)=v\in\mathcal{T}(x)$ , the linear operator is defined as $Dg(x)v=\frac{d(g\circ\gamma)}{dt}(0)\in\mathcal{T}(g(x))$ . The determinant of the linear operator $\det(Dg(x))$ is the determinant of the matrix representing $Dg(x)$ with respect to an arbitrary basis.

Definition 3.

(Unstable fixed points) The set of unstable fixed points $\mathcal{A}^{\star}_{g}$ of a mapping $g:\mathcal{X}\rightarrow\mathcal{X}$ is defined as $\mathcal{A}^{\star}_{g}=\{x\in\mathcal{X}:g(x)=x,\max_{i}|\lambda_{i}(Dg(x))|>1\}.$

Definition 4.

(Strict saddles) The set of strict saddles $\mathcal{X}^{\star}$ of a function $f:\mathcal{X}\rightarrow\mathbb{R}$ is defined as $\mathcal{X}^{\star}=\{x^{\star}\in\mathcal{X}:\nabla f(x^{\star})=0,\lambda_{1}(\nabla^{2}f(x^{\star}))<0\}.$

We can express NEAR-DGD^t as a mapping $g:\mathbb{R}^{np}\rightarrow\mathbb{R}^{np}$ ,

g(\mathbf{y})=\mathbf{Z}^{t}\mathbf{y}-\alpha\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}),

with $Dg(\mathbf{y})=\mathbf{Z}^{t}\left(I-\alpha\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y})\right)$ . Let us define the set of unstable fixed points $\mathcal{A}_{g}^{\star}$ of NEAR-DGD^t and the set of strict saddles $\mathcal{Y}^{\star}$ of the Lyapunov function (6) following Def. 3 and 4, respectively. Corollary $1$ of [20] implies that if $\det(Dg(\mathbf{y}))\neq 0$ for all $\mathbf{y}\in\mathbb{R}^{np}$ and $\mathcal{Y}^{\star}\subset\mathcal{A}^{\star}_{g}$ , then NEAR-DGD^t almost surely avoids the strict saddles of (6). We will show that this is indeed the case in Theorem III.12.

Theorem III.12.

(Convergence to 2nd order stationary points) Let $\{\mathbf{y}_{k}\}$ be the sequence of iterates generated by NEAR-DGD^t under steplength $\alpha<1/L$ . Then if the Lyapunov function $\mathcal{L}_{t}$ satisfies the strict saddle property, $\{\mathbf{y}_{k}\}$ converges almost surely to $2^{nd}$ order stationary points of $\mathcal{L}_{t}$ under random initialization.

Proof.

We begin this proof by showing that $\det(Dg(\mathbf{y}))\neq 0$ for every $\mathbf{y}\in\mathbb{R}^{np}$ . Let $\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))$ be the eigenvalues of the Hessian $\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y})$ . Assumption II.2 implies that $\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))<L$ for all $i\in\{1,...,np\}$ . Using standard properties of the determinant, we obtain, $\det\left(Dg(\mathbf{x})\right)=\det(\mathbf{Z}^{t})\det(I-\alpha\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))=\left(\prod_{i}\lambda^{t}_{i}(\mathbf{Z})\right)\left(\prod_{i}(1-\alpha\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))\right)$ . Thus, $\det\left(Dg(\mathbf{x})\right)\neq 0$ by the positive-definiteness of $\mathbf{Z}$ and $\alpha<1/L$ .

We will now confirm that $\mathcal{Y}^{\star}\subset\mathcal{A}^{\star}_{g}$ . Every critical point $\mathbf{y}^{\star}$ of (6) satisfies $\nabla\mathcal{L}_{t}(\mathbf{y}^{\star})=\mathbf{0}$ , namely $\mathbf{Z}^{t}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\star})+\frac{1}{\alpha}\mathbf{Z}^{t}\mathbf{y}^{\star}-\frac{1}{\alpha}\mathbf{Z}^{2t}\mathbf{y}^{\star}=\mathbf{0}$ . Since $\mathbf{Z}$ is positive-definite and by extension non-singular, we can multiply both sides of the equality above with $\alpha\mathbf{Z}^{-t}$ and re-arrange the resulting terms to obtain $\mathbf{y}^{\star}=g(\mathbf{y}^{\star})$ . Finally, the Hessian of (6) at $\mathbf{y}^{\star}$ is given by,

\begin{split}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})&=\mathbf{Z}^{t}\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\star})\mathbf{Z}^{t}+\frac{1}{\alpha}\mathbf{Z}^{t}(I-\mathbf{Z}^{t})\\ &=\frac{1}{\alpha}\left(I-Dg(\mathbf{y}^{\star})\right)\mathbf{Z}^{t}.\end{split}

(14)

We define the matrix $\mathbf{P}:=\alpha\mathbf{Z}^{-\frac{t}{2}}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\mathbf{Z}^{-\frac{t}{2}}$ . Using the positive-definiteness of $\mathbf{Z}$ , we obtain from (14) $I-Dg(\mathbf{y}^{\star})=\alpha\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\mathbf{Z}^{-t}=\mathbf{Z}^{\frac{t}{2}}\mathbf{P}\mathbf{Z}^{-\frac{t}{2}},$ which implies that $\left(I-Dg(\mathbf{y}^{\star})\right)$ and $\mathbf{P}$ are similar matrices and have identical spectrums. Moreover, the matrix $\mathbf{Z}^{-\frac{t}{2}}$ is symmetric by Assumption II.1. Hence, $\mathbf{P}$ and $\left(\alpha\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\right)$ are congruent and by Sylvester’s law of inertia [Theorem 4.5.8, [29]] they have the same number of negative eigenvalues. Given that $\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})$ has at least one negative eigenvalue by Def. 4, we conclude that so does $\mathbf{P}$ and there exists index $i$ such that $1-\lambda_{i}(Dg(\mathbf{y}^{\star}))<0$ or $\lambda_{i}(Dg(\mathbf{y}^{\star}))>1$ . Applying [Corollary $1$ , [20]] establishes the desired result. ∎

Before we conclude this section, we make one final remark on the asymptotic behavior of NEAR-DGD^t as the parameter $t$ becomes large.

Corollary III.13.

(Convergence to SOS) Let $\{\mathbf{x}_{k}\}$ and $\{\mathbf{y}_{k}\}$ be the sequences of NEAR-DGD^t iterates produced by (5a) and (5b), respectively, from initial point $\mathbf{y}_{0}$ with $t(k)=t\in\mathbb{N}^{+}$ and steplength $\alpha<1/L$ . Moreover, suppose that $\mathbf{y}^{\infty}$ is the limit point of NEAR-DGD^t and let $\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}=[(x_{1}^{\infty})^{\prime},...,(x_{n}^{\infty})^{\prime}]^{\prime}$ . Then $x_{i}^{\infty}=x_{j}^{\infty}$ for all $i\neq j$ and $x_{i}^{\infty}$ approaches the $2^{nd}$ order stationary solutions (SOS) of Problem 1 as $t\rightarrow\infty$ for all $i\in\mathcal{V}$ .

Proof.

By Theorems III.8 and III.12, we have $\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty}$ , where $\mathbf{y}^{\infty}$ is a minimizer of $\mathcal{L}_{t}$ . Since $\mathbf{Z}$ is non-singular, we also have $\{\mathbf{x}_{k}\}\rightarrow\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}$ . As $t\rightarrow\infty$ , Lemmas III.4 and III.10 yield $\|x_{i}^{\infty}-\bar{x}^{\infty}\|\rightarrow 0$ and $\|\nabla f(\bar{x}^{\infty})\|\rightarrow 0$ , respectively, implying that $x_{i}^{\infty}$ and $\bar{x}^{\infty}$ approach each other and the critical points of $f$ . Finally, $\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\succeq 0$ by Theorem III.12, where $\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{Z}^{t}\nabla^{2}\mathbf{f}(\mathbf{x}^{\infty})\mathbf{Z}^{t}+\alpha^{-1}\mathbf{Z}^{t}(I-\mathbf{Z}^{t})$ . Multiplying this relation with the matrix $\mathbf{M}$ on both sides, we obtain $\mathbf{M}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\mathbf{M}=\mathbf{M}\nabla^{2}\mathbf{f}(\mathbf{x}^{\infty})\mathbf{M}$ . As $t\rightarrow\infty$ , Lemma III.4 yields $\mathbf{M}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\mathbf{M}\rightarrow\mathbf{M}\nabla^{2}\mathbf{f}(1_{n}\otimes\bar{x}^{\infty})\mathbf{M}=n^{-2}1_{n}1_{n}^{\prime}\otimes\nabla^{2}f(\bar{x}^{\infty})$ . Therefore, $\nabla^{2}f(\bar{x}^{\infty})\succeq 0$ by Sylvester’s law of inertia for congruent matrices [Theorem 4.5.8, [29]]. Based on the above, we conclude that NEAR-DGD^t approaches the $2^{nd}$ order stationary solutions of Problem 1 as $t\rightarrow\infty$ . ∎

IV Numerical Results

We evaluate the empirical performance of NEAR-DGD on the following regularized quadratic problem,

\min_{x\in\mathbb{R}^{p}}f(x)=\frac{1}{2}\sum_{i=1}^{n}\left(\|x\|^{2}_{Q^{i}}\right)+\frac{1}{4}\|x\|^{4}_{D_{I}},

where $I\in\{1,...,p\}$ is some positive index and $Q^{i}\in\mathbb{R}^{p\times p}$ and $D_{I}\in\mathbb{R}^{p\times p}$ are diagonal matrices constructed as follows: $q^{i}_{jj}<0$ if $j=I$ and $q^{i}_{jj}>0$ otherwise, and $D_{I}=c\cdot e_{I}e_{I}^{\prime}$ , where $c>0$ is a constant and $e_{I}$ is the indicator vector for the $I^{th}$ element. It is easy to check that $f$ has a unique saddle point at $x=\mathbf{0}$ and two minima at $x^{\star}=\pm\frac{1}{c}\left(\sqrt{\sum_{i=1}^{n}-q^{i}_{II}}\right)e_{I}$ . We can distribute this problem to $n$ nodes by setting $f_{i}(x)=\frac{1}{2}\|x\|^{2}_{Q^{i}}+\frac{1}{4n}\|x\|^{4}_{D_{I}}$ . Moreover, each $f_{i}$ has Lipschitz gradients in any compact subset of $\mathbb{R}^{p}$ .

We set $p=I=4$ for the purposes of our numerical experiment. The matrices $Q^{i}$ were constructed randomly with $q^{i}_{jj}\in(-1,0)$ for $j=I$ and $q^{i}_{jj}\in(0,1)$ otherwise, and the parameter $c$ of matrix $D_{I}$ was set to $1$ . We allocated each $f_{i}$ to a unique node in a network of size $n=12$ with ring graph topology. We tested $6$ methods in total, including DGD [1, 5], DOGT (with doubly stochastic consensus matrix) [8], and $4$ variants of the NEAR-DGD method: $i$ ) NEAR-DGD¹ (one consensus round per gradient evaluation), $ii$ ) NEAR-DGD⁵ ( $5$ consensus rounds per gradient evaluation), $iii$ ) a variant of NEAR-DGD where the sequence of consensus rounds increases by $1$ at every iteration, and to which we will refer as NEAR-DGD⁺, and $iv$ ) a practical variant of NEAR-DGD⁺, where starting from one consensus round/iteration, we double the number of consensus rounds every $100$ gradient evaluations. We will refer to this last modification as NEAR-DGD ${}^{+}_{100}$ . All algorithms were initialized from the same randomly chosen point in the interval $[-1,1]^{np}$ . The stepsize was manually tuned to $\alpha=10^{-1}$ for all methods.

In Fig. 1, we plot the objective function error $f(\bar{x}_{k})-f^{\star}$ where $f^{\star}=f(x^{\star})$ (Fig. 1a) and the distance $\|\bar{x}_{k}\|$ of the average iterates to the saddle point $x=\mathbf{0}$ (Fig. 1b) versus the number of iterations/gradient evaluations for all methods. In Fig. 1a, we observe that convergence accuracy increases with the value of the parameter $t$ of NEAR-DGD^t, as predicted by our theoretical results. NEAR-DGD¹ performs comparably to DGD, while the two variants of NEAR-DGD paired with increasing sequences of consensus rounds per iteration, i.e. NEAR-DGD⁺ and NEAR-DGD ${}^{+}_{100}$ , achieve exact convergence to the optimal value with faster rates compared to NEXT. All methods successfully escape the saddle point of $f$ with approximately the same speed (Fig. 1b). We noticed that the trends in Fig. 1b were very sensitive to small changes in problem parameters and the selection of initial point.

In Fig. 2, we plot the objective function error $f(\bar{x}_{k})-f^{\star}$ versus the cumulative application cost (per node) for all methods, where we calculated the cost per iteration using the framework proposed in [23],

\text{Cost}=c_{c}\times\#\text{Communications}+c_{g}\times\#\text{Computations},

where $c_{c}$ and $c_{g}$ are constants representing the application-specific costs of one communication and one computation operation, respectively. In Fig. 2a, the costs of communication and computation are equal ( $c_{c}=c_{g}$ ) and NEXT outperforms NEAR-DGD⁺ and NEAR-DGD ${}^{+}_{100}$ since it requires only two communication rounds per gradient evaluation to achieve exact convergence. Conversely, in Fig. 2b, the cost of communication is relatively low compared to the cost of computation ( $c_{c}=10^{-2}c_{g}$ ). In this case, NEAR-DGD⁺ converges to the optimal value faster than the remaining methods in terms of total application cost.

Refer to caption — (a) Objective function error

V Conclusion

NEAR-DGD [23] is a distributed first order method that permits adjusting the amounts of computation and communication carried out at each iteration to balance convergence accuracy and total application cost. We have extended to the nonconvex setting the analysis of NEAR-DGD^t, a variant of NEAR-DGD performing a fixed number of communication rounds at every iteration, which is controlled by the parameter $t$ . Given a connected, undirected network with general topology, we have shown that NEAR-DGD^t converges to minimizers of a customly designed Lyapunov function and locally approaches the minimizers of the original objective function as $t$ becomes large. Our numerical results confirm our theoretical analysis and indicate that NEAR-DGD can achieve exact convergence to the $2^{nd}$ order stationary points of Problem (1) when the number of consensus rounds increases over time.

References

[1] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48 –61, Jan 2009.
[2] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[3] A. Nedich, A. Olshevsky, and A. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 1 2017.
[4] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. PP, pp. 1–1, 04 2017.
[5] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
[6] P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
[7] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Math. Program., vol. 176, no. 1–2, p. 497–544, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-018-01357-w
[8] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020. [Online]. Available: https://doi.org/10.1137/18M121784X
[9] H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 38–42.
[10] M. Hong, S. Zeng, J. Zhang, and H. Sun, “On the Divergence of Decentralized Non-Convex Optimization,” arXiv:2006.11662 [cs, math], Jun. 2020, arXiv: 2006.11662. [Online]. Available: http://arxiv.org/abs/2006.11662
[11] M. Hong, “A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach,” IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, vol. 5, no. 3, p. 11, 2018.
[12] M. Hong, M. Razaviyayn, and J. Lee, “Gradient primal-dual algorithm converges to second-order stationary solution for nonconvex distributed optimization over networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 2009–2018. [Online]. Available: http://proceedings.mlr.press/v80/hong18a.html
[13] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017.
[14] D. Hajinezhad, M. Hong, and A. Garcia, “ZONE: Zeroth-Order Nonconvex Multiagent Optimization Over Networks,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 3995–4010, Oct. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8629972/
[15] Y. Tang and N. Li, “Distributed zero-order algorithms for nonconvex multi-agent optimization,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 781–786.
[16] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 391–405, 2013.
[17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/f75526659f31040afeb61cb7133e4e6d-Paper.pdf
[18] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “ $d^{2}$ : Decentralized training over decentralized data,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 4848–4856. [Online]. Available: http://proceedings.mlr.press/v80/tang18a.html
[19] H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 9217–9228. [Online]. Available: http://proceedings.mlr.press/v119/sun20a.html
[20] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht, “First-order methods almost always avoid strict saddle points,” Mathematical Programming, vol. 176, no. 1, pp. 311–337, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-019-01374-3
[21] B. Swenson, R. Murray, S. Kar, and H. V. Poor, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” arXiv:2003.02818 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2003.02818
[22] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed gradient flow: Nonsmoothness, nonconvexity, and saddle point evasion,” arXiv:2008.05387 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2008.05387
[23] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing Communication and Computation in Distributed Optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, Aug. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8528465/
[24] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation,” Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp. 33–46, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731506001808
[25] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, Feb. 2013. [Online]. Available: http://link.springer.com/10.1007/s10107-011-0484-9
[26] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality,” Math. Oper. Res., vol. 35, no. 2, p. 438–457, May 2010. [Online]. Available: https://doi.org/10.1287/moor.1100.0449
[27] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Math. Program., vol. 116, no. 1–2, p. 5–16, Jan. 2009. [Online]. Available: https://doi.org/10.1007/s10107-007-0133-5
[28] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press, 2008.
[29] R. A. Horn and C. R. Johnson, Matrix analysis, 2nd ed. Cambridge ; New York: Cambridge University Press, 2012.