This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Convergence of NEAR-DGD for Nonconvex Optimization with Second Order Guarantees

Charikleia Iakovidou and Ermin Wei This work was supported by NSF NRI 2024774. Department of Electrical and Computer Engineering
Northwestern University
Evanston, IL
Email: chariako@u.northwestern.edu, ermin.wei@northwestern.edu
Abstract

We consider the setting where the nodes of an undirected, connected network collaborate to solve a shared objective modeled as the sum of smooth functions. We assume that each summand is privately known by a unique node. NEAR-DGD is a distributed first order method which permits adjusting the amount of communication between nodes relative to the amount of computation performed locally in order to balance convergence accuracy and total application cost. In this work, we generalize the convergence properties of a variant of NEAR-DGD from the strongly convex to the nonconvex case. Under mild assumptions, we show convergence to minimizers of a custom Lyapunov function. Moreover, we demonstrate that the gap between those minimizers and the second order stationary solutions of the original problem can become arbitrarily small depending on the choice of algorithm parameters. Finally, we accompany our theoretical analysis with a numerical experiment to evaluate the empirical performance of NEAR-DGD in the nonconvex setting.

Index Terms:
distributed optimization, decentralized gradient method, nonconvex optimization, second-order guarantees.

I Introduction

We focus on optimization problems where the cost function can be modeled as a summation of nn components,

minxpf(x)=i=1nfi(x),\min_{x\in\mathbb{R}^{p}}f(x)=\sum_{i=1}^{n}f_{i}(x), (1)

where f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} is a smooth and (possibly) nonconvex function.

Problems of this type frequently arise in a variety of decentralized systems such as wireless sensor networks, smart grids and systems of autonomous vehicles. A special case of this setting involves a connected, undirected network of nn nodes 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}), where 𝒱\mathcal{V} and \mathcal{E} denote the sets of nodes and edges, respectively. Each node i𝒱i\in\mathcal{V} has private access to the cost component fif_{i} and maintains a local estimate xix_{i} of the global decision variable xx. This leads to the following equivalent reformulation of Problem (1),

minxipi=1nfi(xi),s.t. xi=xj,(i,j).\begin{split}\min_{x_{i}\in\mathbb{R}^{p}}\sum_{i=1}^{n}f_{i}(x_{i}),\quad\text{s.t. }x_{i}=x_{j},\quad\forall(i,j)\in\mathcal{E}.\end{split} (2)

One of the first algorithms proposed for the solution of Problem (2) when the functions fif_{i} are convex is the Distributed (Sub)Gradient Descent (DGD) method [1], which relies on the combination of two elements: ii) local gradient steps on the functions fif_{i} and iiii) calculations of weighted averages of local and neighbor variables xix_{i}. For the remainder of this work, we will be referring to these two procedures as computation and consensus (or communication) steps, respectively. While DGD has been shown to converge to an approximate solution of Problem (2) under constant steplengths, a subset of methods known as gradient tracking algorithms  [2, 3, 4] overcomes this limitation by iteratively estimating the average gradient between nodes.

The convergence of DGD when the function ff is nonconvex has been studied in [5]. NEXT [6], SONATA [7, 8], xFilter [9] and MAGENTA [10], are some examples of distributed methods that utilize gradient tracking and can handle nonconvex objectives. Other approaches include primal-dual algorithms [11, 12] (we note that primal-dual and gradient tracking algorithms are equivalent in some cases [3]), the perturbed push-sum method [13], zeroth order methods [14, 15], and stochastic gradient algorithms [16, 17, 18, 19].

Providing second order guarantees when Hessian information is not available is a challenging task. As a result, the majority of the works listed in the previous paragraph establish convergence to critical points only. A recent line of research leverages existing results from dynamical systems theory and the structural properties of certain problems (which include matrix factorization, phase retrieval and dictionary learning, among others) to demonstrate that several centralized first order algorithms converge to minimizers almost surely when initialized randomly [20]. Specifically, if the objective function satisfies the strict saddle property, namely, if all critical points are either strict saddles or minimizers, then many first order methods converge to saddles only if they are initialized in a low-dimensional manifold with measure zero. Using similar arguments, almost sure convergence to second order stationary points of Problem (2) is proven in [8] for DOGT, a gradient tracking algorithm for directed networks, and in [12] for the first order primal-dual algorithms GPDA and GADMM. The convergence of DGD with constant steplength to a neighborhood of the minimizers of Problem (2) is also shown in [8]. The conditions under which the Distributed Stochastic Gradient method (D-SGD), and Distributed Gradient Flow (DGF), a continuous-time approximation of DGD, avoid saddle points are studied in [21] and [22], respectively. Finally, the authors of [13] prove almost sure convergence to local minima under the assumption that the objective function has no saddle points.

Given the diversity of distributed systems in terms of computing power, connectivity and energy consumption, among other concerns, the ability to adjust the relative amounts of communication and computation on a case-by-case basis is a desirable attribute for a distributed optimization algorithm. While some existing methods are designed to minimize overall communication load (for instance, the authors of [9] employ Chebyshev polynomials to improve communication complexity), all of the methods listed above perform fixed amounts of computation and communication at every iteration and lack adaptability to heterogeneous environments.

I-A Contributions

In this work, we extend the convergence analysis of the NEAR-DGD method, originally proposed in [23], from the strongly convex to the nonconvex setting. NEAR-DGD is a distributed first order method with a flexible framework, which allows for the exchange of computation with communication in order to reach a target accuracy level while simultaneously maintaining low overall application cost. We design a custom Lyapunov function which captures both progress on Problem (1) and distance to consensus, and demonstrate that under relatively mild assumptions, a variant of NEAR-DGD converges to the set of critical points of this Lyapunov function and to approximate critical points of the function ff of Problem (1). Moreover, we show that the distance between the limit points of NEAR-DGD and the critical points of ff can become arbitrarily small by appropriate selection of algorithm parameters. Finally, we employ recent results based on dynamical systems theory to prove that the same variant of NEAR-DGD almost surely avoids the saddle points of the Lyapunov function when initialized randomly. Our analysis is shorter and simpler compared to other works due to the convenient form of our Lyapunov function. The implication is that NEAR-DGD asymptotically converges to second order stationary solutions of Problem (1) as the values of algorithm parameters increase.

I-B Notation

In this paper, all vectors are column vectors. We will use the notation vv^{\prime} to refer to the transpose of a vector vv. The concatenation of local vectors vipv_{i}\in\mathbb{R}^{p} is denoted by 𝐯=[vi,,vn]np\mathbf{v}=[v_{i}^{\prime},...,v_{n}^{\prime}]^{\prime}\in\mathbb{R}^{np} with a lowercase boldface letter. The average of the vectors v1,,vnv_{1},...,v_{n} contained in 𝐯\mathbf{v} will be denoted by v¯\bar{v}, i.e. v¯=1ni=1nvi\bar{v}=\frac{1}{n}\sum_{i=1}^{n}v_{i}. We use uppercase boldface letters for matrices and will denote the element in the ithi^{th} row and jthj^{th} column of matrix 𝐇\mathbf{H} with hijh_{ij}. We will refer to the ithi^{th} (real) eigenvalue in ascending order (i.e. 1st is the smallest) of a matrix 𝐇\mathbf{H} as λi(𝐇)\lambda_{i}(\mathbf{H}). We use the notations IpI_{p} and 1n1_{n} for the identity matrix of dimension pp and the vector of ones of dimension nn, respectively. We will use \|\cdot\| to denote the l2l_{2}-norm, i.e. for vpv\in\mathbb{R}^{p} we have v=i=1p[v]i2\left\|v\right\|=\sqrt{\sum_{i=1}^{p}\left[v\right]_{i}^{2}} where [v]i\left[v\right]_{i} is the ii-th element of vv. The inner product of vectors v,uv,u will be denoted by v,u\langle v,u\rangle. The symbol \otimes will denote the Kronecker product operation. Finally, we define the averaging matrix 𝐌:=(1n1nnIp)\mathbf{M}:=\left(\frac{1_{n}1^{\prime}_{n}}{n}\otimes I_{p}\right).

I-C Organization

The rest of this paper is organized as follows. We briefly review the NEAR-DGD method and list our assumptions for the rest of this work in Section II. We analyze the convergence properties of NEAR-DGD when the function ff of Problem (1) is nonconvex in Section III. Finally, we present the results of a numerical experiment we conducted to assess the empirical performance of NEAR-DGD in the nonconvex setting in Section IV, and conclude this work in Section V.

II The NEAR-DGD method

In this section, we list our assumptions for the remainder of this work and briefly review the NEAR-DGD method, originally proposed for strongly convex optimization in [23]. We first introduce the following compact reformulation of Problem (2),

min𝐱np𝐟(𝐱):=i=1nfi(xi),s.t. (𝐖Ip)𝐱=𝐱,\begin{split}\min_{\mathbf{x}\in\mathbb{R}^{np}}\mathbf{f}\left(\mathbf{x}\right):=\sum_{i=1}^{n}f_{i}(x_{i}),&\quad\text{s.t. }\left(\mathbf{W}\otimes I_{p}\right)\mathbf{x}=\mathbf{x},\end{split} (3)

where 𝐱=[x1,,xn]\mathbf{x}=[x_{1}^{\prime},...,x_{n}^{\prime}]^{\prime} in np\mathbb{R}^{np} is the concatenation of the local variables xix_{i}, 𝐟:np\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R} and 𝐖n×n\mathbf{W}\in\mathbb{R}^{n\times n} is a matrix satisfying the following condition.

Assumption II.1.

(Consensus matrix) The matrix 𝐖n×n\mathbf{W}\in\mathbb{R}^{n\times n} has the following properties: i) symmetry, ii) double stochasticity, iii) wij>0w_{ij}>0 if and only if (ij)(ij)\in\mathcal{E} or i=ji=j and wij=0w_{ij}=0 otherwise and iv) positive-definiteness.

We can construct a matrix 𝐖~\tilde{\mathbf{W}} satisfying properties (i)(iii)(i)-(iii) of Assumption II.1 by defining its elements to be max degree or Metropolis-Hastings weights [24], for instance. Such matrices are not necessarily positive-definite, so we can further enforce property (iv)(iv) using simple linear transformations (for example, we could define 𝐖=(1δ)1(𝐖~δIn)\mathbf{W}=(1-\delta)^{-1}(\tilde{\mathbf{W}}-\delta I_{n}), where δ<λ1(𝐖~)\delta<\lambda_{1}(\tilde{\mathbf{W}}) is a constant). For the rest of this work, we will be referring to the 2nd2^{nd} largest eigenvalue of 𝐖\mathbf{W} as β\beta, i.e. β=λn1(𝐖)\beta=\lambda_{n-1}(\mathbf{W}).

We adopt the following standard assumptions for the global function 𝐟\mathbf{f} of Problem 3.

Assumption II.2.

(Global Lipschitz gradient) The global objective function 𝐟:np\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R} has LL-Lipschitz continuous gradients, i.e. 𝐟(𝐱)𝐟(𝐲)L𝐱𝐲,𝐱,𝐲np\|\nabla\mathbf{f}(\mathbf{x})-\nabla\mathbf{f}(\mathbf{y})\|\leq L\|\mathbf{x}-\mathbf{y}\|,\quad\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{np}.

Assumption II.3.

(Coercivity) The global objective function 𝐟:np\mathbf{f}:\mathbb{R}^{np}\rightarrow\mathbb{R} is coercive, i.e. limk𝐟(𝐱k)=\lim_{k\rightarrow\infty}\mathbf{f}(\mathbf{x}_{k})=\infty for every sequence {𝐱k}\{\mathbf{x}_{k}\} such that 𝐱k\|\mathbf{x}_{k}\|\rightarrow\infty.

II-A The NEAR-DGD method

Starting from (arbitrary) initial points yi,0=xi,0y_{i,0}=x_{i,0}, the local iterates of NEAR-DGD at node i𝒱i\in\mathcal{V} and iteration count kk can be expressed as,

xi,k=j=1nwijt(k)yj,k,\displaystyle x_{i,k}=\sum_{j=1}^{n}w^{t(k)}_{ij}y_{j,k}, (4a)
yi,k+1=xi,kαfi(xi,k),\displaystyle y_{i,k+1}=x_{i,k}-\alpha\nabla f_{i}\left(x_{i,k}\right), (4b)

where {t(k)}\{t(k)\} is a predefined sequence of consensus rounds per iteration, α>0\alpha>0 is a positive steplength and wijt(k)w_{ij}^{t(k)} is the element in the ithi^{th} row and jthj^{th} column of the matrix 𝐖t(k)\mathbf{W}^{t(k)}, resulting from the composition of t(k)t(k) consensus operations,

𝐖t(k)=𝐖𝐖𝐖t(k) timesn×n.\mathbf{W}^{t(k)}=\underbrace{\mathbf{W}\cdot\mathbf{W}\cdot...\cdot\mathbf{W}}_{t(k)\text{ times}}\in\mathbb{R}^{n\times n}.

The system-wide iterates of NEAR-DGD can be written as,

𝐱k=𝐙t(k)𝐲k\displaystyle\mathbf{x}_{k}=\mathbf{Z}^{t(k)}\mathbf{y}_{k} (5a)
𝐲k+1=𝐱kα𝐟(𝐱k),\displaystyle\mathbf{y}_{k+1}=\mathbf{x}_{k}-\alpha\nabla\mathbf{f}(\mathbf{x}_{k}), (5b)

where 𝐱k=[x1,k,,xn,k]np\mathbf{x}_{k}=[x_{1,k}^{\prime},...,x_{n,k}^{\prime}]^{\prime}\in\mathbb{R}^{np}, 𝐲k+1=[y1,k+1,,yn,k+1]np\mathbf{y}_{k+1}=[y_{1,k+1}^{\prime},...,y_{n,k+1}^{\prime}]^{\prime}\in\mathbb{R}^{np} and 𝐙t(k)=(𝐖t(k)Ip)np×np\mathbf{Z}^{t(k)}=(\mathbf{W}^{t(k)}\otimes I_{p})\in\mathbb{R}^{np\times np}.

The sequence of consensus rounds per iteration {t(k)}\{t(k)\} can be suitably chosen to balance convergence accuracy and total cost on a per-application basis. When the functions fif_{i} are strongly convex, NEAR-DGD paired with an increasing sequence {t(k)}\{t(k)\} converges to the optimal solution of Problem (3), and achieves exact convergence with geometric rate (in terms of gradient evaluations) when t(k)=kt(k)=k [23].

III Convergence Analysis

In this section, we present our theoretical results on the convergence of a variant of NEAR-DGD where the number of consensus steps per iteration is fixed, i.e. t(k)=tt(k)=t in (4a) and (5a) for some t+t\in\mathbb{N}^{+}. We will refer to this method as NEAR-DGDt. First, we introduce the following Lyapunov function, which will play a key role in our analysis,

t(𝐲)=𝐟(𝐙t𝐲)+12α𝐲𝐙t212α𝐲𝐙2t2.\mathcal{L}_{t}\left(\mathbf{y}\right)=\mathbf{f}\left(\mathbf{Z}^{t}\mathbf{y}\right)+\frac{1}{2\alpha}\left\|\mathbf{y}\right\|^{2}_{\mathbf{Z}^{t}}-\frac{1}{2\alpha}\left\|\mathbf{y}\right\|^{2}_{\mathbf{Z}^{2t}}. (6)

Using (6), we can express the 𝐱k\mathbf{x}_{k} iterates of NEAR-DGDt as,

𝐱k+1=𝐱kαt(𝐲k).\begin{split}\mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right).\end{split} (7)

We need one more assumption on the geometry of the Lyapunov function t\mathcal{L}_{t} to guarantee the convergence of NEAR-DGDt. Namely, we require t\mathcal{L}_{t} to be ”sharp” around its critical points, up to a reparametrization. This property is formally summarized below.

Definition 1.

(Kurdyka-Łojasiewicz (KL) property) [25] A function h:p{+}h:\mathbb{R}^{p}\rightarrow\mathbb{R}\cup\{+\infty\} has the KL property at xdom(h)x^{\star}\in\text{dom}(\partial h) if there exists η(0,+]\eta\in(0,+\infty], a neighborhood UU of xx^{\star}, and a continuous concave function ϕ:[0,η)+\phi:[0,\eta)\rightarrow\mathbb{R}^{+} such that: ii) ϕ(0)=0\phi(0)=0, iiii) ϕ\phi is 𝒞1\mathcal{C}^{1} (continuously differentiable) on (0,η)(0,\eta), iiiiii) for all s(0,η)s\in(0,\eta), ϕ(s)>0\phi^{\prime}(s)>0, and iviv) for all xU{x:h(x)<h(x)<h(x)+η}x\in U\cap\{x:h(x^{\star})<h(x)<h(x^{\star})+\eta\}, the KL inequality holds:

ϕ(h(x)h(x))dist(0,h(x))1.\phi^{\prime}(h(x)-h(x^{\star}))\cdot\text{dist}(0,\partial h(x))\geq 1.

Proper lower semicontinuous functions which satisfy the KL inequality at each point of dom(h)\text{dom}(\partial h) are called KL functions.

Assumption III.1.

(KL Lyapunov function) The Lyapunov function t:np\mathcal{L}_{t}:\mathbb{R}^{np}\rightarrow\mathbb{R} is a KL function.

Assumption III.1 covers a broad range of functions, including real analytic, semialgebraic and globally subanalytic functions (see [26] for more details). For instance, if the function 𝐟\mathbf{f} is real analytic, then t\mathcal{L}_{t} is the sum of real analytic functions and by extension KL.

III-A Convergence to critical points

In this subsection, we demonstrate that the 𝐲k\mathbf{y}_{k} iterates of NEAR-DGDt converge to a critical point of the Lyapunov function t\mathcal{L}_{t} (6). We assume that Assumptions II.1-II.3 and Assumption III.1 hold for the rest of this work. We begin our analysis by showing that the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\} is non-increasing in the following Lemma.

Lemma III.2.

(Sufficient Descent) Let {𝐲k\mathbf{y}_{k}} be the sequence of NEAR-DGDt iterates generated by (5b) and suppose that the steplength α\alpha satisfies α<2/L\alpha<2/L, where LL is defined in Assumption II.2. Then the following inequality holds for the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\},

t(𝐲k+1)t(𝐲k)ρ𝐲k+1𝐲k2,\begin{split}\mathcal{L}_{t}\left(\mathbf{y}_{k+1}\right)&\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2},\end{split}

where ρ=(2α)1mini(λit(𝐙)(1+(1αL)λit(𝐙)))>0\rho=(2\alpha)^{-1}\min_{i}\left(\lambda^{t}_{i}(\mathbf{Z})\left(1+\left(1-\alpha L\right)\lambda^{t}_{i}(\mathbf{Z})\right)\right)>0.

Proof.

Combining (5a) and (6), we obtain for k0k\geq 0,

t(𝐲k)=𝐟(𝐱k)+12α𝐲k,𝐱k12α𝐱k2.\begin{split}\mathcal{L}_{t}(\mathbf{y}_{k})=\mathbf{f}(\mathbf{x}_{k})+\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k}\|^{2}.\end{split} (8)

Let 𝐝x:=𝐱k+1𝐱k\mathbf{d}_{x}:=\mathbf{x}_{k+1}-\mathbf{x}_{k}. Assumption II.2 then yields 𝐟(𝐱k+1)𝐟(𝐱k)+𝐟(𝐱k),𝐝x+L2𝐝x2=𝐟(𝐱k)1α𝐲k+1𝐱k,𝐝x+L2𝐝x2,\mathbf{f}(\mathbf{x}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})+\langle\nabla\mathbf{f}(\mathbf{x}_{k}),\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\frac{1}{\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}, where we acquire the last equality from (5b). Substituting this relation in (8) applied at the (k+1)th(k+1)^{th} iteration, we obtain,

t(𝐲k+1)𝐟(𝐱k)12α𝐲k+1𝐱k,𝐝x+L2𝐝x2+12α𝐲k+1,𝐱k+112α𝐱k+12=t(𝐲k)12α𝐲k,𝐱k+12α𝐱k21α𝐲k+1𝐱k,𝐝x+L2𝐝x2+12α𝐲k+1,𝐱k+112α𝐱k+12,\begin{split}&\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}\\ &\quad+\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k+1}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle+\frac{1}{2\alpha}\|\mathbf{x}_{k}\|^{2}-\frac{1}{\alpha}\langle\mathbf{y}_{k+1}-\mathbf{x}_{k},\mathbf{d}_{x}\rangle\\ &\quad+\frac{L}{2}\|\mathbf{d}_{x}\|^{2}+\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\frac{1}{2\alpha}\|\mathbf{x}_{k+1}\|^{2},\end{split}

where we obtain the equality after further application of (8). After setting 𝐝y:=𝐲k+1𝐲k\mathbf{d}_{y}:=\mathbf{y}_{k+1}-\mathbf{y}_{k} and re-arranging the terms, we obtain,

t(𝐲k+1)t(𝐲k)12α𝐲k,𝐱k+1α𝐲k+1,𝐱k12α𝐲k+1,𝐱k+1(12αL2)𝐝x2=t(𝐲k)12α𝐲k𝐙t2+1α𝐲k+1,𝐲k𝐙t12α𝐲k+1𝐙t2(12αL2)𝐝x2=t(𝐲k)12α𝐝y𝐙t2(12αL2)𝐝x2.\begin{split}\mathcal{L}_{t}(\mathbf{y}_{k+1})&\leq\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\langle\mathbf{y}_{k},\mathbf{x}_{k}\rangle+\frac{1}{\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k}\rangle\\ &\quad-\frac{1}{2\alpha}\langle\mathbf{y}_{k+1},\mathbf{x}_{k+1}\rangle-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\|\mathbf{y}_{k}\|^{2}_{\mathbf{Z}^{t}}+\frac{1}{\alpha}\langle\mathbf{y}_{k+1},\mathbf{y}_{k}\rangle_{\mathbf{Z}^{t}}\\ &\quad-\frac{1}{2\alpha}\|\mathbf{y}_{k+1}\|^{2}_{\mathbf{Z}^{t}}-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}\\ &=\mathcal{L}_{t}(\mathbf{y}_{k})-\frac{1}{2\alpha}\|\mathbf{d}_{y}\|^{2}_{\mathbf{Z}^{t}}-\left(\frac{1}{2\alpha}-\frac{L}{2}\right)\|\mathbf{d}_{x}\|^{2}.\end{split}

Let 𝐇:=(2α)1𝐙t(I+(1αL)𝐙t)\mathbf{H}:=(2\alpha)^{-1}\mathbf{Z}^{t}\left(I+(1-\alpha L)\mathbf{Z}^{t}\right), which is a positive definite matrix due to Assumption II.1 and the fact that α<2/L\alpha<2/L. Moreover, 𝐝x2=𝐝y𝐙2t2\|\mathbf{d}_{x}\|^{2}=\|\mathbf{d}_{y}\|^{2}_{\mathbf{Z}^{2t}} by Eq. (5a). We can then re-write the immediately previous relation as t(𝐲k+1)t(𝐲k)𝐲k+1𝐲k𝐇2\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathcal{L}_{t}(\mathbf{y}_{k})-\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|^{2}_{\mathbf{H}}. Applying the definition of ρ=λ1(𝐇)\rho=\lambda_{1}(\mathbf{H}) concludes the proof. ∎

An important consequence of Lemma III.2 is that NEAR-DGDt can tolerate a bigger range of steplengths than previously indicated (α<2/L\alpha<2/L vs. α<1/L\alpha<1/L in [23]). Moreover, Lemma III.2 implies that the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\} is upper bounded by t(𝐲0)\mathcal{L}_{t}(\mathbf{y}_{0}). We use this fact to prove that the iterates of NEAR-DGDt are also bounded in the next Lemma.

Lemma III.3.

(Boundedness) Let {𝐱k}\{\mathbf{x}_{k}\} and {𝐲k}\{\mathbf{y}_{k}\} be the sequences of NEAR-DGDt (t(k)=tt(k)=t) iterates generated by (5a) and (5b), respectively, from initial point 𝐲0\mathbf{y}_{0} and under steplength α<2/L\alpha<2/L. Then the following hold: ii) the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\} is lower bounded, and iiii) there exist universal positive constants BxB_{x} and ByB_{y} such that 𝐱kBx\|\mathbf{x}_{k}\|\leq B_{x} and 𝐲k+1By\|\mathbf{y}_{k+1}\|\leq B_{y} for all k0k\geq 0 and t+t\in\mathbb{N}^{+}.

Proof.

By Assumption II.3, the function 𝐟\mathbf{f} is lower bounded and therefore t\mathcal{L}_{t} is also lower bounded (sum of lower bounded functions). This proves the first claim of this Lemma.

To prove the second claim, we first notice that Lemma III.2 implies that the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\} is upper bounded by t(𝐲0)\mathcal{L}_{t}(\mathbf{y}_{0}). Let us define the set 𝒳0:={𝐙t𝐲0,t+}\mathcal{X}_{0}:=\{\mathbf{Z}^{t}\mathbf{y}_{0},t\in\mathbb{N}^{+}\}. The set 𝒳0\mathcal{X}_{0} is compact, since 𝐙t𝐲0𝐲0\|\mathbf{Z}^{t}\mathbf{y}_{0}\|\leq\|\mathbf{y}_{0}\| for all t+t\in\mathbb{N}^{+} due to the non-expansiveness of 𝐙\mathbf{Z}. Hence, by the continuity of 𝐟\mathbf{f} and the Weierstrass Extreme Value Theorem, there exists 𝐱^0𝒳0\hat{\mathbf{x}}_{0}\in\mathcal{X}_{0} such that 𝐟(𝐱0)𝐟(𝐱^0)\mathbf{f}(\mathbf{x}_{0})\leq\mathbf{f}(\hat{\mathbf{x}}_{0}) for all 𝐱0𝒳0\mathbf{x}_{0}\in\mathcal{X}_{0}. Moreover, Assumption II.1 yields 𝐲0𝐙t(I𝐙t)2𝐲02\|\mathbf{y}_{0}\|^{2}_{\mathbf{Z}^{t}(I-\mathbf{Z}^{t})}\leq\|\mathbf{y}_{0}\|^{2} for all positive integers tt, and therefore t(𝐲0)^\mathcal{L}_{t}(\mathbf{y}_{0})\leq\hat{\mathcal{L}} for all t+t\in\mathbb{N}^{+}, where ^=𝐟(𝐱^0)+(2α)1𝐲02\hat{\mathcal{L}}=\mathbf{f}(\hat{\mathbf{x}}_{0})+(2\alpha)^{-1}\|\mathbf{y}_{0}\|^{2}.

Since ^t(𝐲0)t(𝐲k)𝐟(𝐙t𝐲k)=𝐟(𝐱k)\hat{\mathcal{L}}\geq\mathcal{L}_{t}(\mathbf{y}_{0})\geq\mathcal{L}_{t}(\mathbf{y}_{k})\geq\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}_{k})=\mathbf{f}(\mathbf{x}_{k}) for all k0k\geq 0 and t>0t>0, the sequence {𝐟(𝐱k)}\{\mathbf{f}(\mathbf{x}_{k})\} is upper bounded. Hence, by Assumption II.3, there exists positive constant BxB_{x} such that 𝐱kBx\|\mathbf{x}_{k}\|\leq B_{x} for k0k\geq 0 and t>0t>0. Moreover, Assumption II.2 yields 𝐟(𝐲k+1)𝐟(𝐱k)+𝐟(𝐱k),𝐲k+1𝐱k+L2𝐲k+1𝐱k2=𝐟(𝐱k)α𝐟(𝐱k)2+α2L2𝐟(𝐱k)2=𝐟(𝐱k)α(1αL2)𝐟(𝐱k)2𝐟(𝐱k),\mathbf{f}(\mathbf{y}_{k+1})\leq\mathbf{f}(\mathbf{x}_{k})+\langle\nabla\mathbf{f}(\mathbf{x}_{k}),\mathbf{y}_{k+1}-\mathbf{x}_{k}\rangle+\frac{L}{2}\|\mathbf{y}_{k+1}-\mathbf{x}_{k}\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\alpha\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}+\frac{\alpha^{2}L}{2}\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}=\mathbf{f}(\mathbf{x}_{k})-\alpha\left(1-\frac{\alpha L}{2}\right)\|\nabla\mathbf{f}(\mathbf{x}_{k})\|^{2}\leq\mathbf{f}(\mathbf{x}_{k}), where we obtain the first equality from (5b) and last inequality from the fact that α<2/L\alpha<2/L. This relation combined with Assumption II.3 implies that there exists constant By>0B_{y}>0 such that 𝐲k+1By\|\mathbf{y}_{k+1}\|\leq B_{y} for k>0k>0 and t>0t>0, which concludes the proof. ∎

Next, we use Lemma III.3 to show that the distance between the local iterates generated by NEAR-DGDt and their average is bounded.

Lemma III.4.

(Bounded distance to mean) Let xi,kx_{i,k} and yi,ky_{i,k} be the local NEAR-DGDt iterates produced under steplength α<2/L\alpha<2/L by (4a) and (4b), respectively, and define the average iterates x¯k:=1ni=1nxi,k\bar{x}_{k}:=\frac{1}{n}\sum_{i=1}^{n}x_{i,k} and y¯k:=1ni=1nyi,k\bar{y}_{k}:=\frac{1}{n}\sum_{i=1}^{n}y_{i,k}. Then the distance between the local and the average iterates is bounded for i=1,,ni=1,...,n and k=1,2,k=1,2,..., i.e.

xi,kx¯kβtBy, and yi,ky¯kBy,\left\|x_{i,k}-\bar{x}_{k}\right\|\leq\beta^{t}B_{y},\text{ and }\left\|y_{i,k}-\bar{y}_{k}\right\|\leq B_{y},

where ByB_{y} is a positive constant defined in Lemma III.3.

Proof.

Multiplying both sides of (5a) with 𝐌=(1n1nnIp)\mathbf{M}=\left(\frac{1_{n}1_{n}^{\prime}}{n}\otimes I_{p}\right) yields x¯k=y¯k\bar{x}_{k}=\bar{y}_{k}. Moreover, we observe that 𝐯k𝐌𝐯k2=i=1nvi,kv¯k2\left\|\mathbf{v}_{k}-\mathbf{M}\mathbf{v}_{k}\right\|^{2}=\sum_{i=1}^{n}\left\|v_{i,k}-\bar{v}_{k}\right\|^{2} for any vector 𝐯np\mathbf{v}\in\mathbb{R}^{np}. Hence,

xi,kx¯k=xi,ky¯k𝐱k𝐌𝐲k𝐙t𝐲k𝐌𝐲kβt𝐲k,\begin{split}\left\|x_{i,k}-\bar{x}_{k}\right\|&=\left\|x_{i,k}-\bar{y}_{k}\right\|\leq\left\|\mathbf{x}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|\\ &\leq\left\|\mathbf{Z}^{t}\mathbf{y}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|\leq\beta^{t}\|\mathbf{y}_{k}\|,\end{split}

where we derive the last inequality from the spectral properties of 𝐙\mathbf{Z} and 𝐌\mathbf{M} (we note that the matrix 1n1n/n1_{n}1_{n}^{\prime}/n has a single non-zero eigenvalue at 11 associated with the eigenvector 1np1_{np}).

Similarly, for the local iterates yi,ky_{i,k} we obtain,

yi,ky¯k𝐲k𝐌𝐲k=(I𝐌)𝐲k𝐲k.\begin{split}\left\|y_{i,k}-\bar{y}_{k}\right\|\leq\left\|\mathbf{y}_{k}-\mathbf{M}\mathbf{y}_{k}\right\|=\left\|\left(I-\mathbf{M}\right)\mathbf{y}_{k}\right\|\leq\|\mathbf{y}_{k}\|.\end{split}

Applying Lemma III.3 to the two preceding inequalities completes the proof. ∎

We are now ready to state the first Theorem of this section, namely that there exists a subsequence of {𝐲k}\{\mathbf{y}_{k}\} that converges to a critical point of t\mathcal{L}_{t}.

Theorem III.5.

(Subsequence convergence) Let {𝐲k}\{\mathbf{y}_{k}\} be the sequence of NEAR-DGDt iterates generated by (5b) with steplength α<2/L\alpha<2/L. Then {𝐲k}\{\mathbf{y}_{k}\} has a convergent subsequence whose limit point is a critical point of (6).

Proof.

By Lemma III.3, the sequence {𝐲k}\{\mathbf{y}_{k}\} is bounded and therefore there exists a convergent subsequence {𝐲ks}s𝐲\{\mathbf{y}_{k_{s}}\}_{s\in\mathbb{N}}\rightarrow\mathbf{y}^{\infty} as ss\rightarrow\infty. In addition, recursive application of Lemma III.2 over iterations 0,1,,k0,1,...,k yields,

t(𝐲k)t(𝐲0)ρj=0k1𝐲j+1𝐲j2,\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\leq\mathcal{L}_{t}\left(\mathbf{y}_{0}\right)-\rho\sum_{j=0}^{k-1}\left\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\right\|^{2},

where the sequence {t(𝐲k)}\{\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\} is non-increasing and bounded from below by Lemmas III.2 and III.3.

Hence, {k(𝐲k)}\{\mathcal{L}_{k}\left(\mathbf{y}_{k}\right)\} converges and the above relation implies that k=1𝐲k+1𝐲k2<+\sum_{k=1}^{\infty}\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}<+\infty and 𝐲k+1𝐲k0\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\rightarrow 0. Moreover, 𝐱k+1𝐱k=𝐲k+1𝐲k𝐙2t𝐲k+1𝐲k\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|=\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|_{\mathbf{Z}^{2t}}\leq\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\| by the non-expansiveness of 𝐙\mathbf{Z} and thus 𝐱k+1𝐱k0\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\rightarrow 0. Finally, Eq. 7 yields t(𝐲k)=α1𝐱k+1𝐱k0\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\|=\alpha^{-1}\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\rightarrow 0. We conclude that t(𝐲ks)0\left\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k_{s}}\right)\right\|\rightarrow 0 as ss\rightarrow\infty and therefore t(𝐲)=𝟎\nabla\mathcal{L}_{t}\left(\mathbf{y}^{\infty}\right)=\mathbf{0}. ∎

We note that Assumption III.1 is not necessary for Theorem III.5 to hold. However, Theorem III.5 does not guarantee the convergence of NEAR-DGDt; we will need Assumption III.1 to prove that NEAR-DGDt converges in Theorem III.8. Before that, we introduce the following two preliminary Lemmas that hold only under Assumption III.1.

Lemma III.6.

(Bounded difference under the KL property) Let 𝐱k\mathbf{x}_{k} and 𝐲k\mathbf{y}_{k} be the NEAR-DGDt iterates generated by (5a) and (5b), respectively, under steplength α<2/L\alpha<2/L. Moreover, suppose that the KL inequality with respect to some point 𝐲np\mathbf{y}^{\star}\in\mathbb{R}^{np} holds at 𝐲k\mathbf{y}_{k}, i.e.,

ϕ(t(𝐲k)t(𝐲))k(𝐲k)1.\phi^{\prime}(\mathcal{L}_{t}(\mathbf{y}_{k})-\mathcal{L}_{t}(\mathbf{y}^{\star}))\|\mathcal{L}_{k}(\mathbf{y}_{k})\|\geq 1. (9)

Then the following relation holds,

𝐯k+1𝐯k1αρ(ϕ(lk)ϕ(lk+1)),\begin{split}\left\|\mathbf{v}_{k+1}-\mathbf{v}_{k}\right\|\leq\frac{1}{\alpha\rho}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right),\end{split}

where 𝐯k+1𝐯k\|\mathbf{v}_{k+1}-\mathbf{v}_{k}\| can be 𝐱k+1𝐱k\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\| or 𝐲k+1𝐲k\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\| and lk:=t(𝐲k)t(𝐲)l_{k}:=\mathcal{L}_{t}(\mathbf{y}_{k})-\mathcal{L}_{t}(\mathbf{y}^{\star}).

Proof.

Lemma III.2 yields ρ𝐲k+1𝐲k2t(𝐲k)t(𝐲k+1)=lklk+1\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\mathcal{L}_{t}\left(\mathbf{y}_{k+1}\right)=l_{k}-l_{k+1} for k0k\geq 0. We can multiply both sides of this relation with ϕ(lk)>0\phi^{\prime}\left(l_{k}\right)>0 to obtain ρϕ(lk)𝐲k+1𝐲k2ϕ(lk)(lk+1lk)ϕ(lk)ϕ(lk+1),\rho\phi^{\prime}\left(l_{k}\right)\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}\leq-\phi^{\prime}\left(l_{k}\right)\left(l_{k+1}-l_{k}\right)\leq\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right), where we derive the last inequality from the concavity of ϕ\phi. In addition, using Eq. 7, we can re-write (9) as α1ϕ(lk)𝐱k+1𝐱k1\alpha^{-1}\phi^{\prime}(l_{k})\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\geq 1. Combining these relations, we acquire,

αρ𝐲k+1𝐲k2𝐱k+1𝐱kϕ(lk)ϕ(lk+1).\begin{split}\frac{\alpha\rho\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|^{2}}{\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|}\leq\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right).\end{split}

Observing that 𝐱k+1𝐱k𝐲k+1𝐲k\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\leq\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\| due to the non-expansiveness of 𝐙\mathbf{Z} and re-arranging the terms of the relation above yields the final result. ∎

In the next Lemma, we show that if NEAR-DGDt is initialized from an appropriate subset of np\mathbb{R}^{np} and Assumption III.1 holds, then the sequence {𝐲k}\{\mathbf{y}_{k}\} converges to a critical point of the Lyapunov function t\mathcal{L}_{t}.

Lemma III.7.

(Local convergence) Let {𝐲k}\{\mathbf{y}_{k}\} be the sequence of iterates generated by (5b) from initial point 𝐲0\mathbf{y}_{0} and with steplength α<2/L\alpha<2/L. Moreover, let UU and η\eta be the objects in Def. 1 and suppose that the following relations are satisfied for some point 𝐲np\mathbf{y}^{\star}\in\mathbb{R}^{np},

(αρ)1ϕ(t(𝐲0)t(𝐲))+𝐲0𝐲<r,\displaystyle(\alpha\rho)^{-1}\phi\left(\mathcal{L}_{t}(\mathbf{y}_{0})-\mathcal{L}_{t}(\mathbf{y}^{\star})\right)+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|<r, (10)
t(𝐲)t(𝐲k)<t(𝐲)+η,k0,\displaystyle\mathcal{L}_{t}(\mathbf{y}^{\star})\leq\mathcal{L}_{t}(\mathbf{y}_{k})<\mathcal{L}_{t}(\mathbf{y}^{\star})+\eta,\quad k\geq 0, (11)

where rr is a positive constant and (𝐲,r)U\mathcal{B}(\mathbf{y}^{\star},r)\subset U.

Then the sequence {𝐲k}\{\mathbf{y}_{k}\} has finite length, i.e. j=0𝐲j+1𝐲j<\sum_{j=0}^{\infty}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|<\infty, and converges to a critical point of (6).

Proof.

In the trivial case where t(𝐲k)=t(𝐲)\mathcal{L}_{t}(\mathbf{y}_{k})=\mathcal{L}_{t}(\mathbf{y}^{\star}), Lemma III.2 combined with (11) yield t(𝐲k+1)=t(𝐲k)=t(𝐲)\mathcal{L}_{t}(\mathbf{y}_{k+1})=\mathcal{L}_{t}(\mathbf{y}_{k})=\mathcal{L}_{t}(\mathbf{y}^{\star}) and 𝐲k+1𝐲k=0\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\|=0. Let us now assume that t(𝐲k)(t(𝐲),t(𝐲)+η)\mathcal{L}_{t}(\mathbf{y}_{k})\in\big{(}\mathcal{L}_{t}(\mathbf{y}^{\star}),\mathcal{L}_{t}(\mathbf{y}^{\star})+\eta\big{)} and 𝐲k(𝐲,r)\mathbf{y}_{k}\in\mathcal{B}(\mathbf{y}^{\star},r) up to and including some index τ+\tau\in\mathbb{N}^{+}, which implies that (9) holds for all kτk\leq\tau. Applying the triangle inequality twice, we obtain,

𝐲τ+1𝐲𝐲τ+1𝐲0+𝐲0𝐲=j=0τ(𝐲j+1𝐲j)+𝐲0𝐲j=0τ𝐲j+1𝐲j+𝐲0𝐲.\begin{split}\|\mathbf{y}_{\tau+1}-\mathbf{y}^{\star}\|&\leq\|\mathbf{y}_{\tau+1}-\mathbf{y}_{0}\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &=\left\|\sum_{j=0}^{\tau}\left(\mathbf{y}_{j+1}-\mathbf{y}_{j}\right)\right\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &\leq\sum_{j=0}^{\tau}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|.\end{split}

Application of Lemma III.6 then yields 𝐲k+1𝐲k(αρ)1(ϕ(lk)ϕ(lk+1))\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\leq(\alpha\rho)^{-1}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right), for kτk\leq\tau. Substituting this in the preceding relation, we acquire,

𝐲τ+1𝐲1αρ(ϕ(l0)ϕ(lτ+1))+𝐲0𝐲ϕ(l0)αρ+𝐲0𝐲<r.\begin{split}\|\mathbf{y}_{\tau+1}-\mathbf{y}^{\star}\|&\leq\frac{1}{\alpha\rho}\left(\phi\left(l_{0}\right)-\phi\left(l_{\tau+1}\right)\right)+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|\\ &\leq\frac{\phi\left(l_{0}\right)}{\alpha\rho}+\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|<r.\end{split}

The above result implies that 𝐲τ+1(𝐲,r)\mathbf{y}_{\tau+1}\in\mathcal{B}(\mathbf{y}^{\star},r). Given the fact that that 𝐲0𝐲<r\|\mathbf{y}_{0}-\mathbf{y}^{\star}\|<r and thus 𝐲0(𝐲,r)\mathbf{y}_{0}\in\mathcal{B}(\mathbf{y}^{\star},r), we have 𝐲k(𝐲,r)\mathbf{y}_{k}\in\mathcal{B}(\mathbf{y}^{\star},r) and 𝐲k+1𝐲k(αρ)1(ϕ(lk)ϕ(lk+1))\left\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\right\|\leq(\alpha\rho)^{-1}\left(\phi\left(l_{k}\right)-\phi\left(l_{k+1}\right)\right) for all k0k\geq 0. Hence,

j=0𝐲j+1𝐲j1αρj=0(ϕ(lk)ϕ(lk+1))1αρ(ϕ(l0)ϕ(l))ϕ(l0)αρ.\begin{split}\sum_{j=0}^{\infty}\|\mathbf{y}_{j+1}-\mathbf{y}_{j}\|&\leq\frac{1}{\alpha\rho}\sum_{j=0}^{\infty}\left(\phi(l_{k})-\phi(l_{k+1})\right)\\ &\leq\frac{1}{\alpha\rho}\left(\phi(l_{0})-\phi(l_{\infty})\right)\leq\frac{\phi(l_{0})}{\alpha\rho}.\end{split}

Thus, the sequence {𝐲k}\{\mathbf{y}_{k}\} is finite and Cauchy (convergent), and {𝐲k}𝐲~\{\mathbf{y}_{k}\}\rightarrow\tilde{\mathbf{y}}, where 𝐲~\tilde{\mathbf{y}} is a critical point of (6) by Theorem III.5. ∎

Next, we combine our previous results to prove the global convergence of the 𝐲k\mathbf{y}_{k} iterates of NEAR-DGDt in Theorem III.8.

Theorem III.8.

(Global Convergence) Let {𝐲k}\{\mathbf{y}_{k}\} be the sequence of NEAR-DGDt iterates produced by (5b) under steplength α<2/L\alpha<2/L and let 𝐲\mathbf{y}^{\infty} be a limit point of a convergent subsequence of {𝐲k}\{\mathbf{y}_{k}\} as defined in Theorem III.5.

Then under Assumption III.1 the following statements hold: ii) there exists an index k0+k_{0}\in\mathbb{N}^{+} such that the KL inequality with respect to 𝐲\mathbf{y}^{\infty} holds for all kk0k\geq k_{0}, and iiii) the sequence {𝐲k}\{\mathbf{y}_{k}\} converges to 𝐲\mathbf{y}^{\infty}.

Proof.

We first observe that by Lemma III.2 the sequence {t(𝐲k)}\{\mathcal{L}_{t}(\mathbf{y}_{k})\} is non-increasing, and therefore t(𝐲)t(𝐲k)\mathcal{L}_{t}(\mathbf{y}^{\infty})\leq\mathcal{L}_{t}(\mathbf{y}_{k}) for all k0k\geq 0. If Assumption III.1 holds, then the objects UU and η\eta in Def. 1 exist and by the continuity of ϕ\phi, it is possible to find an index k0k_{0} that satisfies the following relations,

(αρ)1ϕ(t(𝐲k0)t(𝐲))+𝐲k0𝐲<r,\displaystyle(\alpha\rho)^{-1}\phi\left(\mathcal{L}_{t}(\mathbf{y}_{k_{0}})-\mathcal{L}_{t}(\mathbf{y}^{\infty})\right)+\|\mathbf{y}_{k_{0}}-\mathbf{y}^{\infty}\|<r,
t(𝐲k)[t(𝐲),t(𝐲)+η),k0,\displaystyle\mathcal{L}_{t}(\mathbf{y}_{k})\in[\mathcal{L}_{t}(\mathbf{y}^{\infty}),\mathcal{L}_{t}(\mathbf{y}^{\infty})+\eta),\quad\forall k\geq 0,

where (𝐲,r)U\mathcal{B}(\mathbf{y}^{\infty},r)\subset U.

Applying Lemma III.7 to the sequence {𝐲k}kk0\{\mathbf{y}_{k}\}_{k\geq k_{0}} with 𝐲=𝐲\mathbf{y}^{\star}=\mathbf{y}^{\infty} establishes the convergence of {𝐲k}\{\mathbf{y}_{k}\}. Finally, since 𝐲\mathbf{y}^{\infty} is the limit point of a subsequence of {𝐲k}\{\mathbf{y}_{k}\} and {𝐲k}\{\mathbf{y}_{k}\} is convergent, we conclude that {𝐲k}𝐲\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty}. ∎

Since 𝐙\mathbf{Z} is a non-singular matrix, Theorem III.8 implies that the sequence {𝐱k}\{\mathbf{x}_{k}\} also converges. Moreover, using arguments similar to [27], we can prove the following result on the convergence rate of {𝐱k}\{\mathbf{x}_{k}\}.

Lemma III.9.

(Rates) Let {𝐱k}\{\mathbf{x}_{k}\} be the sequence of iterates produced by (5a), 𝐱=𝐙t𝐲\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty} where 𝐲\mathbf{y}^{\infty} is the limit point of the sequence {𝐲k}\{\mathbf{y}_{k}\} and suppose ϕ(s)=cs1θ\phi(s)=cs^{1-\theta} in Assumption III.1 for some constant c>0c>0 and θ[0,1)\theta\in[0,1) (for a discussion on ϕ\phi, we direct readers to [26]). Then the following hold:

  1. 1.

    If θ=0\theta=0, {𝐱k}\{\mathbf{x}_{k}\} converges in a finite number of iterations.

  2. 2.

    If θ(0,1/2]\theta\in(0,1/2], then constants c>0c>0 and Q[0,1)Q\in[0,1) exist such that 𝐱k𝐱cQk\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq cQ^{k}.

  3. 3.

    If θ(1/2,1)\theta\in(1/2,1), then there exists a constant c>0c>0 such that 𝐱k𝐱ck1θ2θ1\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq ck^{-\frac{1-\theta}{2\theta-1}}.

Proof.

ii) θ=0\theta=0: From the definition of ϕ\phi and θ=0\theta=0 we have ϕ(lk)=c(1θ)lkθ=c\phi^{\prime}(l_{k})=c(1-\theta)l_{k}^{-\theta}=c. Let I:={k:𝐱k+1𝐱k}I:=\{k\in\mathbb{N}:\mathbf{x}_{k+1}\neq\mathbf{x}_{k}\} (by the non-singularity of 𝐙\mathbf{Z}, it also follows that 𝐲k+1𝐲k\mathbf{y}_{k+1}\neq\mathbf{y}_{k} for kIk\in I). Then for large kk the KL inequality holds at 𝐲k\mathbf{y}_{k} and we obtain t(𝐲k)c1\|\nabla\mathcal{L}_{t}(\mathbf{y}_{k})\|\geq c^{-1}, or equivalently by (7), 𝐱k+1𝐱kαc1\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\geq\alpha c^{-1}. Application of Lemma III.2 combined with the fact that 𝐱k+1𝐱k𝐲k+1𝐲k\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|\leq\|\mathbf{y}_{k+1}-\mathbf{y}_{k}\| yields t(𝐲k+1)t(𝐲k)ρα2c2\mathcal{L}_{t}(\mathbf{y}_{k+1})\leq\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)-\rho\alpha^{2}c^{-2}. Given the convergence of the sequence {t}\{\mathcal{L}_{t}\}, we conclude that the set II is finite and the method converges in a finite number of steps.

iiii) θ(0,1)\theta\in(0,1): Let Sk:=j=k𝐱j+1𝐱jS_{k}:=\sum_{j=k}^{\infty}\|\mathbf{x}_{j+1}-\mathbf{x}_{j}\| where 𝐱=𝐙t𝐲\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}. Since 𝐱k𝐱Sk\|\mathbf{x}_{k}-\mathbf{x}^{\infty}\|\leq S_{k}, it suffices to bound SkS_{k}. Using Lemma III.6 with 𝐲=𝐲\mathbf{y}^{\star}=\mathbf{y}^{\infty} and for kk0k\geq k_{0}, where k0k_{0} is defined in Theorem III.8, we obtain,

Sk1αρj=k(ϕ(lj)ϕ(lj+1))=1αρϕ(lk)=1νlk1θ,\begin{split}S_{k}&\leq\frac{1}{\alpha\rho}\sum_{j=k}^{\infty}\left(\phi(l_{j})-\phi(l_{j+1})\right)=\frac{1}{\alpha\rho}\phi(l_{k})=\frac{1}{\nu}l_{k}^{1-\theta},\end{split} (12)

where ν=αρ/c\nu=\alpha\rho/c.

Moreover, Eq. 7 yields t(𝐲k)=α1𝐱k+1𝐱k=α1(SkSk+1)\left\|\nabla\mathcal{L}_{t}\left(\mathbf{y}_{k}\right)\right\|=\alpha^{-1}\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|=\alpha^{-1}\left(S_{k}-S_{k+1}\right). Using this relation and the definition of ϕ\phi, we can express the KL inequality as,

μlkθ(SkSk+1)1,\mu l_{k}^{-\theta}\left(S_{k}-S_{k+1}\right)\geq 1, (13)

where μ=α1c(1θ)\mu=\alpha^{-1}c(1-\theta).

If θ(0,1/2]\theta\in(0,1/2], raising both sides of the preceding inequality to the power of γ=1θθ>1\gamma=\frac{1-\theta}{\theta}>1 and re-arranging the terms yields μγ(SkSk+1)γlk1θ\mu^{\gamma}\left(S_{k}-S_{k+1}\right)^{\gamma}\geq l_{k}^{1-\theta}. Due to the fact that SkSk+1=αt(𝐲k)0S_{k}-S_{k+1}=\alpha\|\nabla\mathcal{L}_{t}(\mathbf{y}_{k})\|\rightarrow 0, there exists some index kk such that SkSk+1>(SkSk+1)γS_{k}-S_{k+1}>\left(S_{k}-S_{k+1}\right)^{\gamma} and μγ(SkSk+1)lk1θ\mu^{\gamma}\left(S_{k}-S_{k+1}\right)\geq l_{k}^{1-\theta}. Combining this relation with (12), we obtain νSkμγ(SkSk+1)Sk+1(1νμγ)Sk.\nu S_{k}\leq\mu^{\gamma}\left(S_{k}-S_{k+1}\right)\Leftrightarrow S_{k+1}\leq\left(1-\frac{\nu}{\mu^{\gamma}}\right)S_{k}.

If θ(1/2,1)\theta\in(1/2,1), raising both sides of (12) to the power of θ/(1θ)>1\theta/(1-\theta)>1 yields Skθ/(1θ)νθ/(1θ)lkθS_{k}^{\theta/(1-\theta)}\leq\nu^{-\theta/(1-\theta)}l_{k}^{\theta}. After substituting this relation in (13) and re-arranging we obtain 1C(SkSk+1)(Skθ/(1θ))1,1\leq C\left(S_{k}-S_{k+1}\right)(S_{k}^{\theta/(1-\theta)})^{-1}, where C=μνθ/(1θ)C=\mu\nu^{-\theta/(1-\theta)}. Define h:(0,+)h:(0,+\infty)\rightarrow\mathbb{R} to be h(s)=sθ/(1θ)h(s)=s^{-\theta/(1-\theta)}. The preceding relation then yields 1C(SkSk+1)h(Sk)CSk+1Skh(s)𝑑s=Cζ1(SkζSk+1ζ)1\leq C(S_{k}-S_{k+1})h(S_{k})\leq C\int_{S_{k+1}}^{S_{k}}h(s)ds=C\zeta^{-1}\left(S_{k}^{\zeta}-S_{k+1}^{\zeta}\right), where ζ=(12θ)/(1θ)<0\zeta=(1-2\theta)/(1-\theta)<0. After setting C~=C1ζ>0\tilde{C}=-C^{-1}\zeta>0 and re-arranging, we obtain C~Sk+1ζSkζ\tilde{C}\leq S^{\zeta}_{k+1}-S^{\zeta}_{k}. Summing this relation over iterations k=k0,,t1k=k_{0},...,t-1 yields (tk0)C~StζSk0ζSt((tk0)C~+Sk0ζ)1/ζct1/ζ(t-k_{0})\tilde{C}\leq S_{t}^{\zeta}-S_{k_{0}}^{\zeta}\Leftrightarrow S_{t}\leq\left((t-k_{0})\tilde{C}+S_{k_{0}}^{\zeta}\right)^{1/\zeta}\leq ct^{1/\zeta}, for some c>0c>0. ∎

We conclude this subsection with one more result on the distance to optimality of the local xi,kx_{i,k} iterates of NEAR-DGDt and their average x¯k=1ni=1nxi,k\bar{x}_{k}=\frac{1}{n}\sum_{i=1}^{n}x_{i,k} as kk\rightarrow\infty.

Corollary III.10.

(Distance to optimality) Suppose that {𝐲k}𝐲\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty} and let 𝐱=𝐙t𝐲\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}. Moreover, let x¯=y¯=1ni=1nxi\bar{x}^{\infty}=\bar{y}^{\infty}=\frac{1}{n}\sum_{i=1}^{n}x^{\infty}_{i}. Then x¯\bar{x}^{\infty} is an approximate critical point of ff,

f(x¯)βtnLBy\left\|\nabla f(\bar{x}^{\infty})\right\|\leq\beta^{t}\sqrt{n}LB_{y}

where ByB_{y} is a positive constant defined in Lemma III.3.

Proof.

We observe that 𝐌𝐟(𝐌𝐲)=1n1nf(y¯)\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})=\frac{1}{n}\cdot 1_{n}\otimes\nabla f(\bar{y}^{\infty}) and hence 𝐌𝐟(𝐌𝐲)=n11nf(y¯)=(n)1f(y¯),\|\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})\|=n^{-1}\|1_{n}\otimes\nabla f(\bar{y}^{\infty})\|=(\sqrt{n})^{-1}\|\nabla f(\bar{y}^{\infty})\|, where we obtain the last equality due to the fact that 1nv2=nv2\|1_{n}\otimes v\|^{2}=n\|v\|^{2} for any vector vv.

Moreover, 𝐲\mathbf{y}^{\infty} is a critical point of (6) and therefore satisfies t(𝐲)=𝐙t𝐟(𝐙t𝐲)+1α𝐙t𝐲1α𝐙2t𝐲=𝟎\nabla\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{Z}^{t}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})+\frac{1}{\alpha}\mathbf{Z}^{t}\mathbf{y}^{\infty}-\frac{1}{\alpha}\mathbf{Z}^{2t}\mathbf{y}^{\infty}=\mathbf{0}. From the double stochasticity of 𝐙\mathbf{Z}, multiplying the above relation with 𝐌\mathbf{M} yields 𝐌t(𝐲)=𝐌𝐟(𝐙t𝐲)=𝟎\mathbf{M}\nabla\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{M}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})=\mathbf{0}. After combining all the preceding results, we obtain,

f(x¯)=n𝐌𝐟(𝐌𝐲)𝐌𝐟(𝐙t𝐲)nL𝐌𝐲𝐙t𝐲βtnL𝐲,\begin{split}\|\nabla f(\bar{x}^{\infty})\|&=\sqrt{n}\|\mathbf{M}\nabla\mathbf{f}(\mathbf{M}\mathbf{y}^{\infty})-\mathbf{M}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\infty})\|\\ &\leq\sqrt{n}L\|\mathbf{M}\mathbf{y}^{\infty}-\mathbf{Z}^{t}\mathbf{y}^{\infty}\|\leq\beta^{t}\sqrt{n}L\|\mathbf{y}^{\infty}\|,\end{split}

where used the spectral properties of 𝐌\mathbf{M} and Assumption II.2 to get the first inequality and the spectral properties of 𝐙\mathbf{Z} to get the second inequality. Applying Lemma III.3 yields the result of this Corollary.

III-B Second order guarantees

In this subsection, we provide second order guarantees for the NEAR-DGDt method. Specifically, using recent results stemming from dynamical systems theory, we will prove that NEAR-DGDt almost surely avoids the strict saddles of the Lyapunov function t\mathcal{L}_{t} when initialized randomly. Hence, if t\mathcal{L}_{t} satisfies the strict saddle property, NEAR-DGDt converges to minima of t\mathcal{L}_{t} with probability 11. We begin by listing a number of additional assumptions and definitions.

Assumption III.11.

(Differentiability) The functions 𝐟\mathbf{f} is 𝒞2\mathcal{C}^{2}.

Assumption III.11 implies that the function t\mathcal{L}_{t} is also 𝒞2\mathcal{C}^{2}.

Definition 2.

(Differential of a mapping) [Ch. 3, [28]] The differential of a mapping g:𝒳𝒳g:\mathcal{X}\rightarrow\mathcal{X}, denoted as Dg(x)Dg(x), is a linear operator from 𝒯(x)𝒯(g(x))\mathcal{T}(x)\rightarrow\mathcal{T}(g(x)), where 𝒯(x)\mathcal{T}(x) is the tangent space of 𝒳\mathcal{X} at point xx. Given a curve γ\gamma in 𝒳\mathcal{X} with γ(0)=x\gamma(0)=x and dγdt(0)=v𝒯(x)\frac{d\gamma}{dt}(0)=v\in\mathcal{T}(x), the linear operator is defined as Dg(x)v=d(gγ)dt(0)𝒯(g(x))Dg(x)v=\frac{d(g\circ\gamma)}{dt}(0)\in\mathcal{T}(g(x)). The determinant of the linear operator det(Dg(x))\det(Dg(x)) is the determinant of the matrix representing Dg(x)Dg(x) with respect to an arbitrary basis.

Definition 3.

(Unstable fixed points) The set of unstable fixed points 𝒜g\mathcal{A}^{\star}_{g} of a mapping g:𝒳𝒳g:\mathcal{X}\rightarrow\mathcal{X} is defined as 𝒜g={x𝒳:g(x)=x,maxi|λi(Dg(x))|>1}.\mathcal{A}^{\star}_{g}=\{x\in\mathcal{X}:g(x)=x,\max_{i}|\lambda_{i}(Dg(x))|>1\}.

Definition 4.

(Strict saddles) The set of strict saddles 𝒳\mathcal{X}^{\star} of a function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} is defined as 𝒳={x𝒳:f(x)=0,λ1(2f(x))<0}.\mathcal{X}^{\star}=\{x^{\star}\in\mathcal{X}:\nabla f(x^{\star})=0,\lambda_{1}(\nabla^{2}f(x^{\star}))<0\}.

We can express NEAR-DGDt as a mapping g:npnpg:\mathbb{R}^{np}\rightarrow\mathbb{R}^{np},

g(𝐲)=𝐙t𝐲α𝐟(𝐙t𝐲),g(\mathbf{y})=\mathbf{Z}^{t}\mathbf{y}-\alpha\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}),

with Dg(𝐲)=𝐙t(Iα2𝐟(𝐙t𝐲))Dg(\mathbf{y})=\mathbf{Z}^{t}\left(I-\alpha\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y})\right). Let us define the set of unstable fixed points 𝒜g\mathcal{A}_{g}^{\star} of NEAR-DGDt and the set of strict saddles 𝒴\mathcal{Y}^{\star} of the Lyapunov function (6) following Def. 3 and 4, respectively. Corollary 11 of [20] implies that if det(Dg(𝐲))0\det(Dg(\mathbf{y}))\neq 0 for all 𝐲np\mathbf{y}\in\mathbb{R}^{np} and 𝒴𝒜g\mathcal{Y}^{\star}\subset\mathcal{A}^{\star}_{g}, then NEAR-DGDt almost surely avoids the strict saddles of (6). We will show that this is indeed the case in Theorem III.12.

Theorem III.12.

(Convergence to 2nd order stationary points) Let {𝐲k}\{\mathbf{y}_{k}\} be the sequence of iterates generated by NEAR-DGDt under steplength α<1/L\alpha<1/L. Then if the Lyapunov function t\mathcal{L}_{t} satisfies the strict saddle property, {𝐲k}\{\mathbf{y}_{k}\} converges almost surely to 2nd2^{nd} order stationary points of t\mathcal{L}_{t} under random initialization.

Proof.

We begin this proof by showing that det(Dg(𝐲))0\det(Dg(\mathbf{y}))\neq 0 for every 𝐲np\mathbf{y}\in\mathbb{R}^{np}. Let λi(2𝐟(𝐙t𝐲))\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y})) be the eigenvalues of the Hessian 2𝐟(𝐙t𝐲)\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}). Assumption II.2 implies that λi(2𝐟(𝐙t𝐲))<L\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))<L for all i{1,,np}i\in\{1,...,np\}. Using standard properties of the determinant, we obtain, det(Dg(𝐱))=det(𝐙t)det(Iα2𝐟(𝐙t𝐲))=(iλit(𝐙))(i(1αλi(2𝐟(𝐙t𝐲)))\det\left(Dg(\mathbf{x})\right)=\det(\mathbf{Z}^{t})\det(I-\alpha\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))=\left(\prod_{i}\lambda^{t}_{i}(\mathbf{Z})\right)\left(\prod_{i}(1-\alpha\lambda_{i}(\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}))\right). Thus, det(Dg(𝐱))0\det\left(Dg(\mathbf{x})\right)\neq 0 by the positive-definiteness of 𝐙\mathbf{Z} and α<1/L\alpha<1/L.

We will now confirm that 𝒴𝒜g\mathcal{Y}^{\star}\subset\mathcal{A}^{\star}_{g}. Every critical point 𝐲\mathbf{y}^{\star} of (6) satisfies t(𝐲)=𝟎\nabla\mathcal{L}_{t}(\mathbf{y}^{\star})=\mathbf{0}, namely 𝐙t𝐟(𝐙t𝐲)+1α𝐙t𝐲1α𝐙2t𝐲=𝟎\mathbf{Z}^{t}\nabla\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\star})+\frac{1}{\alpha}\mathbf{Z}^{t}\mathbf{y}^{\star}-\frac{1}{\alpha}\mathbf{Z}^{2t}\mathbf{y}^{\star}=\mathbf{0}. Since 𝐙\mathbf{Z} is positive-definite and by extension non-singular, we can multiply both sides of the equality above with α𝐙t\alpha\mathbf{Z}^{-t} and re-arrange the resulting terms to obtain 𝐲=g(𝐲)\mathbf{y}^{\star}=g(\mathbf{y}^{\star}). Finally, the Hessian of (6) at 𝐲\mathbf{y}^{\star} is given by,

2t(𝐲)=𝐙t2𝐟(𝐙t𝐲)𝐙t+1α𝐙t(I𝐙t)=1α(IDg(𝐲))𝐙t.\begin{split}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})&=\mathbf{Z}^{t}\nabla^{2}\mathbf{f}(\mathbf{Z}^{t}\mathbf{y}^{\star})\mathbf{Z}^{t}+\frac{1}{\alpha}\mathbf{Z}^{t}(I-\mathbf{Z}^{t})\\ &=\frac{1}{\alpha}\left(I-Dg(\mathbf{y}^{\star})\right)\mathbf{Z}^{t}.\end{split} (14)

We define the matrix 𝐏:=α𝐙t22t(𝐲)𝐙t2\mathbf{P}:=\alpha\mathbf{Z}^{-\frac{t}{2}}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\mathbf{Z}^{-\frac{t}{2}}. Using the positive-definiteness of 𝐙\mathbf{Z}, we obtain from (14) IDg(𝐲)=α2t(𝐲)𝐙t=𝐙t2𝐏𝐙t2,I-Dg(\mathbf{y}^{\star})=\alpha\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\mathbf{Z}^{-t}=\mathbf{Z}^{\frac{t}{2}}\mathbf{P}\mathbf{Z}^{-\frac{t}{2}}, which implies that (IDg(𝐲))\left(I-Dg(\mathbf{y}^{\star})\right) and 𝐏\mathbf{P} are similar matrices and have identical spectrums. Moreover, the matrix 𝐙t2\mathbf{Z}^{-\frac{t}{2}} is symmetric by Assumption II.1. Hence, 𝐏\mathbf{P} and (α2t(𝐲))\left(\alpha\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star})\right) are congruent and by Sylvester’s law of inertia [Theorem 4.5.8, [29]] they have the same number of negative eigenvalues. Given that 2t(𝐲)\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\star}) has at least one negative eigenvalue by Def. 4, we conclude that so does 𝐏\mathbf{P} and there exists index ii such that 1λi(Dg(𝐲))<01-\lambda_{i}(Dg(\mathbf{y}^{\star}))<0 or λi(Dg(𝐲))>1\lambda_{i}(Dg(\mathbf{y}^{\star}))>1. Applying [Corollary 11[20]] establishes the desired result. ∎

Before we conclude this section, we make one final remark on the asymptotic behavior of NEAR-DGDt as the parameter tt becomes large.

Corollary III.13.

(Convergence to SOS) Let {𝐱k}\{\mathbf{x}_{k}\} and {𝐲k}\{\mathbf{y}_{k}\} be the sequences of NEAR-DGDt iterates produced by (5a) and (5b), respectively, from initial point 𝐲0\mathbf{y}_{0} with t(k)=t+t(k)=t\in\mathbb{N}^{+} and steplength α<1/L\alpha<1/L. Moreover, suppose that 𝐲\mathbf{y}^{\infty} is the limit point of NEAR-DGDt and let 𝐱=𝐙t𝐲=[(x1),,(xn)]\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}=[(x_{1}^{\infty})^{\prime},...,(x_{n}^{\infty})^{\prime}]^{\prime}. Then xi=xjx_{i}^{\infty}=x_{j}^{\infty} for all iji\neq j and xix_{i}^{\infty} approaches the 2nd2^{nd} order stationary solutions (SOS) of Problem 1 as tt\rightarrow\infty for all i𝒱i\in\mathcal{V}.

Proof.

By Theorems III.8 and III.12, we have {𝐲k}𝐲\{\mathbf{y}_{k}\}\rightarrow\mathbf{y}^{\infty}, where 𝐲\mathbf{y}^{\infty} is a minimizer of t\mathcal{L}_{t}. Since 𝐙\mathbf{Z} is non-singular, we also have {𝐱k}𝐱=𝐙t𝐲\{\mathbf{x}_{k}\}\rightarrow\mathbf{x}^{\infty}=\mathbf{Z}^{t}\mathbf{y}^{\infty}. As tt\rightarrow\infty, Lemmas III.4 and III.10 yield xix¯0\|x_{i}^{\infty}-\bar{x}^{\infty}\|\rightarrow 0 and f(x¯)0\|\nabla f(\bar{x}^{\infty})\|\rightarrow 0, respectively, implying that xix_{i}^{\infty} and x¯\bar{x}^{\infty} approach each other and the critical points of ff. Finally, 2t(𝐲)0\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\succeq 0 by Theorem III.12, where 2t(𝐲)=𝐙t2𝐟(𝐱)𝐙t+α1𝐙t(I𝐙t)\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})=\mathbf{Z}^{t}\nabla^{2}\mathbf{f}(\mathbf{x}^{\infty})\mathbf{Z}^{t}+\alpha^{-1}\mathbf{Z}^{t}(I-\mathbf{Z}^{t}). Multiplying this relation with the matrix 𝐌\mathbf{M} on both sides, we obtain 𝐌2t(𝐲)𝐌=𝐌2𝐟(𝐱)𝐌\mathbf{M}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\mathbf{M}=\mathbf{M}\nabla^{2}\mathbf{f}(\mathbf{x}^{\infty})\mathbf{M}. As tt\rightarrow\infty, Lemma III.4 yields 𝐌2t(𝐲)𝐌𝐌2𝐟(1nx¯)𝐌=n21n1n2f(x¯)\mathbf{M}\nabla^{2}\mathcal{L}_{t}(\mathbf{y}^{\infty})\mathbf{M}\rightarrow\mathbf{M}\nabla^{2}\mathbf{f}(1_{n}\otimes\bar{x}^{\infty})\mathbf{M}=n^{-2}1_{n}1_{n}^{\prime}\otimes\nabla^{2}f(\bar{x}^{\infty}). Therefore, 2f(x¯)0\nabla^{2}f(\bar{x}^{\infty})\succeq 0 by Sylvester’s law of inertia for congruent matrices [Theorem 4.5.8, [29]]. Based on the above, we conclude that NEAR-DGDt approaches the 2nd2^{nd} order stationary solutions of Problem 1 as tt\rightarrow\infty. ∎

IV Numerical Results

We evaluate the empirical performance of NEAR-DGD on the following regularized quadratic problem,

minxpf(x)=12i=1n(xQi2)+14xDI4,\min_{x\in\mathbb{R}^{p}}f(x)=\frac{1}{2}\sum_{i=1}^{n}\left(\|x\|^{2}_{Q^{i}}\right)+\frac{1}{4}\|x\|^{4}_{D_{I}},

where I{1,,p}I\in\{1,...,p\} is some positive index and Qip×pQ^{i}\in\mathbb{R}^{p\times p} and DIp×pD_{I}\in\mathbb{R}^{p\times p} are diagonal matrices constructed as follows: qjji<0q^{i}_{jj}<0 if j=Ij=I and qjji>0q^{i}_{jj}>0 otherwise, and DI=ceIeID_{I}=c\cdot e_{I}e_{I}^{\prime}, where c>0c>0 is a constant and eIe_{I} is the indicator vector for the IthI^{th} element. It is easy to check that ff has a unique saddle point at x=𝟎x=\mathbf{0} and two minima at x=±1c(i=1nqIIi)eIx^{\star}=\pm\frac{1}{c}\left(\sqrt{\sum_{i=1}^{n}-q^{i}_{II}}\right)e_{I}. We can distribute this problem to nn nodes by setting fi(x)=12xQi2+14nxDI4f_{i}(x)=\frac{1}{2}\|x\|^{2}_{Q^{i}}+\frac{1}{4n}\|x\|^{4}_{D_{I}}. Moreover, each fif_{i} has Lipschitz gradients in any compact subset of p\mathbb{R}^{p}.

We set p=I=4p=I=4 for the purposes of our numerical experiment. The matrices QiQ^{i} were constructed randomly with qjji(1,0)q^{i}_{jj}\in(-1,0) for j=Ij=I and qjji(0,1)q^{i}_{jj}\in(0,1) otherwise, and the parameter cc of matrix DID_{I} was set to 11. We allocated each fif_{i} to a unique node in a network of size n=12n=12 with ring graph topology. We tested 66 methods in total, including DGD [1, 5], DOGT (with doubly stochastic consensus matrix) [8], and 44 variants of the NEAR-DGD method: ii) NEAR-DGD1 (one consensus round per gradient evaluation), iiii) NEAR-DGD5 (55 consensus rounds per gradient evaluation), iiiiii) a variant of NEAR-DGD where the sequence of consensus rounds increases by 11 at every iteration, and to which we will refer as NEAR-DGD+, and iviv) a practical variant of NEAR-DGD+, where starting from one consensus round/iteration, we double the number of consensus rounds every 100100 gradient evaluations. We will refer to this last modification as NEAR-DGD100+{}^{+}_{100}. All algorithms were initialized from the same randomly chosen point in the interval [1,1]np[-1,1]^{np}. The stepsize was manually tuned to α=101\alpha=10^{-1} for all methods.

In Fig. 1, we plot the objective function error f(x¯k)ff(\bar{x}_{k})-f^{\star} where f=f(x)f^{\star}=f(x^{\star}) (Fig. 1a) and the distance x¯k\|\bar{x}_{k}\| of the average iterates to the saddle point x=𝟎x=\mathbf{0} (Fig. 1b) versus the number of iterations/gradient evaluations for all methods. In Fig. 1a, we observe that convergence accuracy increases with the value of the parameter tt of NEAR-DGDt, as predicted by our theoretical results. NEAR-DGD1 performs comparably to DGD, while the two variants of NEAR-DGD paired with increasing sequences of consensus rounds per iteration, i.e. NEAR-DGD+ and NEAR-DGD100+{}^{+}_{100}, achieve exact convergence to the optimal value with faster rates compared to NEXT. All methods successfully escape the saddle point of ff with approximately the same speed (Fig. 1b). We noticed that the trends in Fig. 1b were very sensitive to small changes in problem parameters and the selection of initial point.

In Fig. 2, we plot the objective function error f(x¯k)ff(\bar{x}_{k})-f^{\star} versus the cumulative application cost (per node) for all methods, where we calculated the cost per iteration using the framework proposed in [23],

Cost=cc×#Communications+cg×#Computations,\text{Cost}=c_{c}\times\#\text{Communications}+c_{g}\times\#\text{Computations},

where ccc_{c} and cgc_{g} are constants representing the application-specific costs of one communication and one computation operation, respectively. In Fig. 2a, the costs of communication and computation are equal (cc=cgc_{c}=c_{g}) and NEXT outperforms NEAR-DGD+ and NEAR-DGD100+{}^{+}_{100} since it requires only two communication rounds per gradient evaluation to achieve exact convergence. Conversely, in Fig. 2b, the cost of communication is relatively low compared to the cost of computation (cc=102cgc_{c}=10^{-2}c_{g}). In this case, NEAR-DGD+ converges to the optimal value faster than the remaining methods in terms of total application cost.

Refer to caption
(a) Objective function error
Refer to caption
(b) Distance to x=𝟎x=\mathbf{0} (saddle)
Figure 1: Distance to ff^{\star} (left) and to saddle point (right)
Refer to caption
(a) cg=1c_{g}=1, cc=1c_{c}=1
Refer to caption
(b) cg=1c_{g}=1, cc=102c_{c}=10^{-2}
Figure 2: Objective function error as a function of cumulative application cost (per node)

V Conclusion

NEAR-DGD [23] is a distributed first order method that permits adjusting the amounts of computation and communication carried out at each iteration to balance convergence accuracy and total application cost. We have extended to the nonconvex setting the analysis of NEAR-DGDt, a variant of NEAR-DGD performing a fixed number of communication rounds at every iteration, which is controlled by the parameter tt. Given a connected, undirected network with general topology, we have shown that NEAR-DGDt converges to minimizers of a customly designed Lyapunov function and locally approaches the minimizers of the original objective function as tt becomes large. Our numerical results confirm our theoretical analysis and indicate that NEAR-DGD can achieve exact convergence to the 2nd2^{nd} order stationary points of Problem (1) when the number of consensus rounds increases over time.

References

  • [1] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48 –61, Jan 2009.
  • [2] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [3] A. Nedich, A. Olshevsky, and A. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 1 2017.
  • [4] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. PP, pp. 1–1, 04 2017.
  • [5] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
  • [6] P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
  • [7] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Math. Program., vol. 176, no. 1–2, p. 497–544, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-018-01357-w
  • [8] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” SIAM Journal on Optimization, vol. 30, no. 4, pp. 3029–3068, 2020. [Online]. Available: https://doi.org/10.1137/18M121784X
  • [9] H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 38–42.
  • [10] M. Hong, S. Zeng, J. Zhang, and H. Sun, “On the Divergence of Decentralized Non-Convex Optimization,” arXiv:2006.11662 [cs, math], Jun. 2020, arXiv: 2006.11662. [Online]. Available: http://arxiv.org/abs/2006.11662
  • [11] M. Hong, “A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach,” IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, vol. 5, no. 3, p. 11, 2018.
  • [12] M. Hong, M. Razaviyayn, and J. Lee, “Gradient primal-dual algorithm converges to second-order stationary solution for nonconvex distributed optimization over networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 2009–2018. [Online]. Available: http://proceedings.mlr.press/v80/hong18a.html
  • [13] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017.
  • [14] D. Hajinezhad, M. Hong, and A. Garcia, “ZONE: Zeroth-Order Nonconvex Multiagent Optimization Over Networks,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 3995–4010, Oct. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8629972/
  • [15] Y. Tang and N. Li, “Distributed zero-order algorithms for nonconvex multi-agent optimization,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 781–786.
  • [16] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 391–405, 2013.
  • [17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/f75526659f31040afeb61cb7133e4e6d-Paper.pdf
  • [18] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “d2d^{2}: Decentralized training over decentralized data,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 4848–4856. [Online]. Available: http://proceedings.mlr.press/v80/tang18a.html
  • [19] H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 9217–9228. [Online]. Available: http://proceedings.mlr.press/v119/sun20a.html
  • [20] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht, “First-order methods almost always avoid strict saddle points,” Mathematical Programming, vol. 176, no. 1, pp. 311–337, Jul. 2019. [Online]. Available: https://doi.org/10.1007/s10107-019-01374-3
  • [21] B. Swenson, R. Murray, S. Kar, and H. V. Poor, “Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima,” arXiv:2003.02818 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2003.02818
  • [22] B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed gradient flow: Nonsmoothness, nonconvexity, and saddle point evasion,” arXiv:2008.05387 [math.OC], Aug. 2020. [Online]. Available: http://arxiv.org/abs/2008.05387
  • [23] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing Communication and Computation in Distributed Optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, Aug. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8528465/
  • [24] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation,” Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp. 33–46, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731506001808
  • [25] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, Feb. 2013. [Online]. Available: http://link.springer.com/10.1007/s10107-011-0484-9
  • [26] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality,” Math. Oper. Res., vol. 35, no. 2, p. 438–457, May 2010. [Online]. Available: https://doi.org/10.1287/moor.1100.0449
  • [27] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Math. Program., vol. 116, no. 1–2, p. 5–16, Jan. 2009. [Online]. Available: https://doi.org/10.1007/s10107-007-0133-5
  • [28] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds.   Princeton, NJ: Princeton University Press, 2008.
  • [29] R. A. Horn and C. R. Johnson, Matrix analysis, 2nd ed.   Cambridge ; New York: Cambridge University Press, 2012.