This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Optimization On Time-Varying Directed Graphs Under Communication Constraints

Yiyue Chen, \IEEEmembershipStudent Member, IEEE, Abolfazl Hashemi, \IEEEmembershipStudent Member, IEEE, and Haris Vikalo, \IEEEmembershipSenior Member, IEEE, and Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo are with the Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712 USA. This work was supported in part by NSF grant 1809327.    Yiyue Chen, Abolfazl Hashemi, Haris Vikalo Yiyue Chen and Haris Vikalo are with the Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712 USA. Abolfazl Hashemi is with the Oden Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, TX 78712 USA. This work was supported in part by NSF grant 1809327.
Abstract

We consider the problem of decentralized optimization where a collection of agents, each having access to a local cost function, communicate over a time-varying directed network and aim to minimize the sum of those functions. In practice, the amount of information that can be exchanged between the agents is limited due to communication constraints. We propose a communication-efficient algorithm for decentralized convex optimization that rely on sparsification of local updates exchanged between neighboring agents in the network. In directed networks, message sparsification alters column-stochasticity – a property that plays an important role in establishing convergence of decentralized learning tasks. We propose a decentralized optimization scheme that relies on local modification of mixing matrices, and show that it achieves 𝒪(lnTT)\mathcal{O}(\frac{\mathrm{ln}T}{\sqrt{T}}) convergence rate in the considered settings. Experiments validate theoretical results and demonstrate efficacy of the proposed algorithm.

1 Introduction

In recent years, decentralized optimization has attracted considerable interest from the machine learning, signal processing, and control communities [13, 14, 11, 9]. We consider the setting where a collection of agents attempts to minimize an objective that consists of functions distributed among the agents; each agent evaluates one of the functions on its local data. Formally, this optimization task can be stated as

min𝐱d[f(𝐱):=1ni=1nfi(𝐱)],\min_{\mathbf{x}\in\mathbb{R}^{d}}\left[f(\mathbf{x}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(\mathbf{x})\right], (1)

where nn is the number of agents and fi:df_{i}:\mathbb{R}^{d}\to\mathbb{R} is the function assigned to the ithi\textsuperscript{th} node, i[n]:={1,,n}i\in[n]:=\left\{1,...,n\right\}. The agents collaborate by exchanging information over a network modeled by a time-varying directed graph 𝒢(t)=(|n|,(t)){\mathcal{G}}(t)=(|n|,{\mathcal{E}}(t)), where (t){\mathcal{E}}(t) denotes the set of edges at time tt; agent ii can send a message to agent jj at time tt if there exist an edge from ii to jj at tt, i.e., if {i,j}(t)\left\{i,j\right\}\in{\mathcal{E}}(t).

The described setting has been a subject of extensive studies over the last decade, leading to a number of seminal results [12, 4, 17, 3, 10, 15, 7]. Majority of prior work assumes symmetry in the agents’ communication capabilities, i.e., models the problem using undirected graphs. However, the assumption of symmetry is often violated and the graph that captures properties of the communication network should be directed. Providing provably convergent decentralized convex optimization schemes over directed graphs is challenging; technically, this stems from the fact that unlike in undirected graphs, the so-called mixing matrix of a directed graph is not doubly stochastic. The existing prior work in the directed graph settings includes the grad-push algorithm [5, 11], which compensates for the imbalance in a column-stochastic mixing matrix by relying on local normalization scalars, and the directed distributed gradient descent (D-DGD) scheme [19] which carefully tracks link changes over time and their impact on the mixing matrices. Assuming convex local function, both of these methods achieve 𝒪(lnTT)\mathcal{O}(\frac{\mathrm{ln}T}{\sqrt{T}}) convergence rate.

In practice, communication bandwidth is often limited and thus the amount of information that can be exchanged between the agents is restricted. This motivates design of decentralized optimization schemes capable of operating under communication constraints; none of the aforementioned methods considers such settings. Recently, techniques that address communication constraints in decentralized optimization by quantizing or sparsifying messages exchanged between participating agents have been proposed in literature [18, 20, 15]. Such schemes have been deployed in the context of decentralized convex optimization over undirected networks [7] as well as in fixed directed networks [16]. However, there has been no prior work on communication-constrained decentralized learning over time-varying directed networks.

In this paper we propose, to our knowledge the first, communication-sparsifying scheme for decentralized convex optimization over directed networks, and provide formal guarantees of its convergence; in particular, we show that the proposed method achieves 𝒪(lnTT)\mathcal{O}(\frac{\mathrm{ln}T}{\sqrt{T}}) convergence rate. Experiments demonstrate efficacy of the proposed scheme.

2 Problem Setting

Assume that a collection of agents aims to collaboratively find the unique solution to decentralized convex optimization (1); let us denote this solution by 𝐱{\mathbf{x}}^{*} and assume, for simplicity, that 𝒳=d\mathcal{X}=\mathbb{R}^{d}. The agents, represented by nodes of a directed time-varying graph, are allowed to exchange sparsified messages. In the following, we do not assume smoothness or strong convexity of the objective; however, our analysis can be extended to such settings.

Let WintW_{in}^{t} (row-stochastic) and WouttW_{out}^{t} (column-stochastic) denote the in-neighbor and out-neighbor connectivity matrix at time tt, respectively. Moreover, let 𝒩in,it\mathcal{N}_{{in},i}^{t} be the set of nodes that can send information to node ii (including ii), and 𝒩out,jt\mathcal{N}_{out,j}^{t} the set of nodes that can receive information from node jj (including jj) at time tt. We assume that both 𝒩in,it\mathcal{N}_{{in},i}^{t} and 𝒩out,it\mathcal{N}_{out,i}^{t} are known to node ii. A simple policy for designing WintW^{t}_{in} and WouttW^{t}_{out} is to set

[Wint]ij=1/|𝒩in,it|,[Woutt]ij=1/|𝒩out,jt|.[W^{t}_{in}]_{ij}=1/|\mathcal{N}^{t}_{in,i}|,\quad[W^{t}_{out}]_{ij}=1/|\mathcal{N}^{t}_{out,j}|. (2)

We assume that the constructed mixing matrices have non-zero spectral gaps; this is readily satisfied in a variety of settings including when the union graph is jointly-connected. Matrices WintW_{in}^{t} and WouttW_{out}^{t} can be used to synthesize the mixing matrix, as formally stated in Section 3 (see Definition 1).

To reduce the size of the messages exchanged between agents in a network, we perform sparsification. In particular, each node uniformly at random selects and communicates kk out of dd entries of a dd-dimensional message. To formalize this, we introduce a sparsification operator Q:ddQ:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. The operator QQ is biased, i.e., 𝔼[Q(𝐱)]𝐱\mathbb{E}[Q({\mathbf{x}})]\neq{\mathbf{x}}, and has variance that depends on the norm of its argument, 𝔼[Q(𝐱)𝐱2]𝐱2\mathbb{E}[\|Q({\mathbf{x}})-{\mathbf{x}}\|^{2}]\propto\|{\mathbf{x}}\|^{2}. Biased compression operators have previously been considered in the context of time-invariant networks [15, 7, 6, 16] but are not encountered in time-varying network settings.

3 Compressed Time-Varying Decentralized Optimization

A common strategy to solving decentralized optimization problems is to orchestrate exchange of messages between agents such that each message consists of a combination of compressed messages from neighboring nodes and a gradient noise term. The gradient term is rendered vanishing by adopting a decreasing stepsize scheme; this ensures that the agents in the network reach a consensus state which is the optimal solution to the optimization problem.

To meet communication constraints, messages may be sparsified; however, simplistic introduction of sparsification to the existing methods, e.g., [5, 1, 2, 11], may have adverse effect on their convergence – indeed, modified schemes may only converge to a neighborhood of the optimal solution or even end up diverging. This is caused by the non-vanishing error due to the bias and variance of the sparsification operator. We note that the impact of sparsification on the entries of a state vector in the network can be interpreted as that of link failures; this motivates us to account for it in the structure of the connectivity matrices. Specifically, we split the vector-valued decentralized problem into dd individual scalar-valued sub-problems with the coordinate in-neighbor and out-neighbor connectivity matrices, {Win,mt}m=1d\{W_{in,m}^{t}\}_{m=1}^{d} and {Wout,mt}m=1d\{W_{out,m}^{t}\}_{m=1}^{d}, specified for each time tt. If an entry is sparsified at time tt (i.e., set to zero and not communicated), the corresponding coordinate connectivity matrices are no longer stochastic. To handle this issue, we re-normalize the connectivity matrices {Win,mt}m=1d\{W_{in,m}^{t}\}_{m=1}^{d} and {Wout,mt}m=1d\{W_{out,m}^{t}\}_{m=1}^{d}, ensuring their row stochasticity and column stochasticity, respectively; node ii performs re-normalization of the ithi\textsuperscript{th} row of {Win,mt}m=1d\{W_{in,m}^{t}\}_{m=1}^{d} and ithi\textsuperscript{th} column of {Wout,mt}m=1d\{W_{out,m}^{t}\}_{m=1}^{d} locally. We denote by {Amt}m=1d\{A_{m}^{t}\}_{m=1}^{d} and {Bmt}m=1d\{B_{m}^{t}\}_{m=1}^{d} the weight matrices resulting from the re-normalization of {Win,mt}m=1d\{W_{in,m}^{t}\}_{m=1}^{d} and {Wout,mt}m=1d\{W_{out,m}^{t}\}_{m=1}^{d}, respectively.

Following the work of [1] on average consensus, we introduce an auxiliary vector 𝐲id\mathbf{y}_{i}\in\mathbb{R}^{d} for each node. Referred to as the surplus vector, 𝐲id\mathbf{y}_{i}\in\mathbb{R}^{d} records variations of the state vectors over time and is used to help ensure the state vectors approach the consensus state. At time step tt, node ii compresses 𝐱it\mathbf{x}_{i}^{t} and 𝐲it\mathbf{y}_{i}^{t} and sends both to the current out-neighbors. To allow succinct expression of the update rule, we introduce 𝐳itd\mathbf{z}_{i}^{t}\in\mathbb{R}^{d} defined as

𝐳it={𝐱it,i{1,,n}𝐲int,i{n+1,,2n}.\mathbf{z}_{i}^{t}=\begin{cases}\mathbf{x}_{i}^{t},&i\in\left\{1,...,n\right\}\\ \mathbf{y}_{i-n}^{t},&i\in\left\{n+1,...,2n\right\}.\end{cases} (3)

The sparsification operator Q()Q(\cdot) is applied to 𝐳it\mathbf{z}_{i}^{t}, resulting in Q(𝐳it)Q(\mathbf{z}_{i}^{t}); we denote the mthm\textsuperscript{th} entry of the sparsified vector by [Q(𝐳it)]m[Q({{\mathbf{z}}}_{i}^{t})]_{m}. The aforementioned weight matrix AmtA^{t}_{m} is formed as

[Amt]ij={[Win,mt]ijj𝒮mt(i,j)[Win,mt]ijif j𝒮mt(i,j)0otherwise,[A^{t}_{m}]_{ij}=\begin{cases}\frac{[W^{t}_{in,m}]_{ij}}{\sum_{j\in\mathcal{S}_{m}^{t}(i,j)}[W^{t}_{in,m}]_{ij}}&\text{if }j\in\mathcal{S}_{m}^{t}(i,j)\\ 0&\mathrm{otherwise},\end{cases} (4)

where 𝒮mt(i,j):={j|j𝒩in,it,[Q(𝐳jt)]m0}{i}\mathcal{S}_{m}^{t}(i,j):=\{j|j\in\mathcal{N}^{t}_{in,i},[Q({{\mathbf{z}}}_{j}^{t})]_{m}\neq 0\}\cup\{i\}. Likewise, BmtB^{t}_{m} is defined as

[Bmt]ij={[Wout,mt]iji𝒯mt(i,j)[Wout,mt]ijif i𝒯mt(i,j)0otherwise,[B^{t}_{m}]_{ij}=\begin{cases}\frac{[W^{t}_{out,m}]_{ij}}{\sum_{i\in\mathcal{T}_{m}^{t}(i,j)}[W^{t}_{out,m}]_{ij}}&\text{if }i\in\mathcal{T}_{m}^{t}(i,j)\\ 0&\mathrm{otherwise},\end{cases} (5)

where 𝒯mt(i,j):={i|i𝒩out,jt,[Q(𝐳it)]m0}{j}\mathcal{T}_{m}^{t}(i,j):=\{i|i\in\mathcal{N}^{t}_{out,j},[Q({{\mathbf{z}}}_{i}^{t})]_{m}\neq 0\}\cup\{j\}.

To obtain the update rule for the optimization algorithm, we first need to define the mixing matrix of a directed network with sparsified messages.

Definition 1.

At time tt, the mthm\textsuperscript{th} mixing matrix of a time-varying directed network deploying sparsified messages, M¯mt2n×2n\bar{M}_{m}^{t}\in\mathbb{R}^{2n\times 2n}, is a matrix with eigenvalues 1=|λ1(M¯mt)|=|λ2(M¯mt)||λ3(M¯mt)||λ2n(M¯mt)|1=|\lambda_{1}(\bar{M}_{m}^{t})|=|\lambda_{2}(\bar{M}_{m}^{t})|\geq|\lambda_{3}(\bar{M}_{m}^{t})|\geq\cdots|\lambda_{2n}(\bar{M}_{m}^{t})| that is constructed from the current network topology as

M¯mt=[Amt𝟎IAmtBmt],\bar{M}_{m}^{t}=\left[\begin{matrix}A_{m}^{t}&\mathbf{0}\\ I-A_{m}^{t}&B_{m}^{t}\\ \end{matrix}\right], (6)

where AmtA_{m}^{t} and BmtB_{m}^{t} represent the mthm\textsuperscript{th} normalized in-neighbor and out-neighbor connectivity matrices at time tt, respectively.

With 𝐳it\mathbf{z}_{i}^{t} and M¯mt\bar{M}_{m}^{t} defined in (3) and (6), respectively, node ii updates the mthm\textsuperscript{th} component of its message according to

zimt+1\displaystyle z_{im}^{t+1} =j=12n[M¯mt]ij[Q(𝐳jt)]m+𝟙{tmod=1}ϵ[F]ijzjmt/\displaystyle=\sum_{j=1}^{2n}[\bar{M}^{t}_{m}]_{ij}[Q({\mathbf{z}}_{j}^{t})]_{m}+\mathbbm{1}_{\left\{t\ \text{mod}\ \mathcal{B}=\mathcal{B}-1\right\}}\epsilon[F]_{ij}z_{jm}^{\mathcal{B}\lfloor t/\mathcal{B}\rfloor} (7)
𝟙{tmod=1}αt/gimt/,\displaystyle\quad-\mathbbm{1}_{\left\{t\ \text{mod}\ \mathcal{B}=\mathcal{B}-1\right\}}\alpha_{\lfloor t/\mathcal{B}\rfloor}g_{im}^{\mathcal{B}\lfloor t/\mathcal{B}\rfloor},

where gimtg_{im}^{t} denotes the mthm\textsuperscript{th} entry of the gradient vector 𝐠it\mathbf{g}_{i}^{t} constructed as

𝐠it={fi(𝐱it),i{1,,n}𝟎,i{n+1,,2n}.\mathbf{g}_{i}^{t}=\begin{cases}\nabla f_{i}(\mathbf{x}_{i}^{t}),&i\in\left\{1,...,n\right\}\\ \mathbf{0},&i\in\left\{n+1,...,2n\right\}.\end{cases} (8)

Moreover, F=[𝟎I𝟎I]F=\left[\begin{matrix}\mathbf{0}&I\\ \mathbf{0}&-I\end{matrix}\right], and αt\alpha_{t} is the stepsize at time tt.

In (7), the update of vectors 𝐳it{\mathbf{z}}_{i}^{t} consists of a mixture of the compressed state vectors and surplus vectors, and includes a vanishing gradient computed from history. The mixture of compressed messages can be interpreted as obtained by sparsification and multiplication with the mixing matrix from the previous time steps, except for the times when

tmod=1.t\mod\mathcal{B}=\mathcal{B}-1. (9)

When tt satisfies (9), the update of 𝐳it{\mathbf{z}}_{i}^{t} incorporates stored vectors 𝐳it/{\mathbf{z}}_{i}^{\mathcal{B}\lfloor t/\mathcal{B}\rfloor}. Note that 𝐳it/{\mathbf{z}}_{i}^{\mathcal{B}\lfloor t/\mathcal{B}\rfloor} is multiplied by ϵF\epsilon F, where the perturbation parameter ϵ\epsilon determines the extent FF affects the update. One can show that ϵF\epsilon F, in combination with the mixing matrix M¯mt\bar{M}_{m}^{t}, guarantees non-zero spectral gap of the product matrix over \mathcal{B} consecutive time steps starting from t=kt=k\mathcal{B}. Similarly, gradient term αt/gimt/\alpha_{\lfloor t/\mathcal{B}\rfloor}g_{im}^{\mathcal{B}\lfloor t/\mathcal{B}\rfloor}, computed using state vectors 𝐱it(1){\mathbf{x}}_{i}^{t-(\mathcal{B}-1)}, participates in the update when (9) holds. We formalize the proposed procedure as Algorithm 1.

Algorithm 1 Communication-Sparsifying Jointly-Connected Gradient Descent
1:  Input: TT, ϵ\epsilon, 𝐱0\mathbf{x}^{0}, 𝐲0=𝟎\mathbf{y}^{0}=\mathbf{0},
2:  set 𝐳0=[𝐱0;𝐲0]\mathbf{z}^{0}=[\mathbf{x}^{0};\mathbf{y}^{0}]
3:  for each t[0,1,,T]t\in[0,1,...,T] do
4:     generate non-negative matrices WintW^{t}_{in}, WouttW^{t}_{out}
5:     for each m[1,,d]m\in[1,...,d] do
6:        construct a row-stochastic AmtA^{t}_{m} and a column-stochastic BmtB^{t}_{m} according to (4) and (5)
7:        construct M¯mt\bar{M}^{t}_{m} according to (6)
8:        for each i{1,,2n}i\in\left\{1,...,2n\right\} do
9:           Update zimt+1z_{im}^{t+1} according to (7)
10:        end for
11:     end for
12:  end for

Remark. It is worth pointing out that in Algorithm 1 each node needs to store local messages of size 4d4d (four dd-dimensional vectors: the current state and surplus vectors, past surplus vector, and local gradient vector). Only the two current vectors may be communicated to the neighboring nodes while the other two vectors are used locally when (9) holds. Note that M¯mt\bar{M}_{m}^{t} has column sum equal to one but it is not column-stochastic due to having negative entries. Finally, note that when =1\mathcal{B}=1, the network is strongly connected at all times.

Refer to caption
(a) Residual: =1\mathcal{B}=1
Refer to caption
(b) Residual: =3\mathcal{B}=3
Refer to caption
(c) Correct rate: =1\mathcal{B}=1
Refer to caption
(d) Correct rate: =5\mathcal{B}=5
Figure 1: Linear regression on a jointly connected network with =1,3,ϵ=0.05\mathcal{B}=1,3,\epsilon=0.05, see (a), (b); logistic regression on a jointly connected network with =1,5,ϵ=0.01\mathcal{B}=1,5,\epsilon=0.01, see (c), (d).

3.1 Convergence Analysis

Let M¯m(T:s)=M¯mTM¯mT1M¯ms\bar{M}_{m}(T:s)=\bar{M}_{m}^{T}\bar{M}_{m}^{T-1}\cdots\bar{M}_{m}^{s} denote the product of a sequence of consecutive mixing matrices from time ss to TT, with the superscript indicating the time and the subscript indicating the entry position. The perturbed product, Mm((k+1)1:k)M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}), is obtained from adding the perturbation term ϵF\epsilon F to the product of mixing matrices as

Mm((k+1)1:k)=M¯m((k+1)1:k)+ϵF.M_{m}((k+1)\mathcal{B}-1:k\mathcal{B})=\bar{M}_{m}((k+1)\mathcal{B}-1:k\mathcal{B})+\epsilon F. (10)

To proceed, we require the following assumptions.

Assumption 1.

The mixing matrices, stepsizes, and the local objectives satisfy:

  1. (i)

    k0,1md\forall k\geq 0,1\leq m\leq d, there exists some 0<ϵ0<10<\epsilon_{0}<1 such that the perturbed product, Mm((k+1)1:k)M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}) has a non-zero spectral gap ϵ\forall\epsilon such that 0<ϵ<ϵ00<\epsilon<\epsilon_{0}.

  2. (ii)

    For a fixed ϵ(0,1)\epsilon\in(0,1), the set of all possible mixing matrices {M¯mt}\{\bar{M}_{m}^{t}\} is a finite set.

  3. (iii)

    The sequence of stepsizes, {αt}\left\{\alpha_{t}\right\}, is non-negative and satisfies t=0αt=,t=0αt2<\sum_{t=0}^{\infty}\alpha_{t}=\infty,\ \sum_{t=0}^{\infty}\alpha_{t}^{2}<\infty.

  4. (iv)

    1in,1md,t0\forall 1\leq i\leq n,1\leq m\leq d,t\geq 0, there exists some D>0D>0 such that |gimt|<D|g_{im}^{t}|<D.

Given the weight matrices scheme in (2), assumptions (i) and (ii) hold for a variety of network structures. Assumptions (iii) and (iv) are common in decentralized optimization [12, 11, 19]) and help guide nodes in the network to a consensus that approaches the global optimal solution. We formalize our main theoretical results in Theorem 1, which establishes convergence of Algorithm 1 to the optimal solution. The details of the proof of the theorem is stated in Section A of the appendix.

Theorem 1.

Suppose Assumption 1 holds. Let 𝐱{\mathbf{x}}^{*} be the unique optimal solution and f=f(𝐱)f^{*}=f({\mathbf{x}}^{*}). Then

2k=0αk(f(𝐳¯k)f)\displaystyle 2\sum_{k=0}^{\infty}\alpha_{k}(f(\bar{\mathbf{z}}^{k\mathcal{B}})-f^{*}) n𝐳¯0𝐱+nD2k=0αk2\displaystyle\leq n\|\bar{\mathbf{z}}^{0}-\mathbf{x}^{*}\|+nD^{\prime 2}\sum_{k=0}^{\infty}\alpha_{k}^{2} (11)
+4Dni=1nk=0αk𝐳ik𝐳¯k,\displaystyle\quad+\frac{4D^{\prime}}{n}\sum_{i=1}^{n}\sum_{k=0}^{\infty}\alpha_{k}\|\mathbf{z}_{i}^{k\mathcal{B}}-\bar{\mathbf{z}}^{k\mathcal{B}}\|,

where D=dDD^{\prime}=\sqrt{d}D and 𝐳¯t=1ni=1n𝐱it+1ni=1n𝐲it\bar{{\mathbf{z}}}^{t}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}^{t}+\frac{1}{n}\sum_{i=1}^{n}{\mathbf{y}}_{i}^{t}.

Note that since t=0αt=\sum_{t=0}^{\infty}\alpha_{t}=\infty, it is straightforward to see that Theorem 1 implies limtf(𝐳it)=f\lim_{t\to\infty}f(\mathbf{z}_{i}^{t})=f^{*} for every agent ii, thereby establishing convergence of Algorithm 1 to the global minimum of (1). Additionally, for the stepsize αt=𝒪(1/t)\alpha_{t}=\mathcal{O}(1/\sqrt{t}), Algorithm 1 attains the convergence rate 𝒪(lnTT)\mathcal{O}(\frac{\ln T}{\sqrt{T}}).

4 Numerical Simulations

We test Algorithm 1 in applications to linear and logistic regression, and compare the results to Q-Grad-Push, obtained by applying simple quantization to the push-sum scheme [11], and Q-De-DGD [16]. Neither of these two schemes was developed with communication-constrained optimization over time-varying directed networks in mind – the former was originally proposed for unconstrainted communication, while the latter is concerned with static networks. However, since there is no prior work on decentralized optimization over time-varying directed networks under communication constraints, we adopt them for the purpose of benchmarking.

We use Erdős–Rényi model to generate strongly connected instances of a graph with 1010 nodes and edge appearance probability 0.90.9. Two uni-directional edges are dropped randomly from each such graph while still preserving strong connectivity. We then remove in-going and out-going edges of randomly selected nodes to create a scenario where an almost-surely strongly connected network is formed only after taking a union of graphs over \mathcal{B} time instances (see Assumption 1). Finally, recall that qq denotes the fraction of entries that nodes communicate to their neighbors (small qq implies high compression).

Decentralized linear regression. First, consider the optimization problem min𝐱1ni=1n𝐲iDi𝐱i2,\min_{{\mathbf{x}}}\frac{1}{n}\sum_{i=1}^{n}\|\mathbf{y}_{i}-D_{i}\mathbf{x}_{i}\|^{2}, where Di200×128D_{i}\in\mathbb{R}^{200\times 128} is a local data matrix with 200200 data points of size d=128d=128 at node ii, and 𝐲i200\mathbf{y}_{i}\in\mathbb{R}^{200} represents the local measurement vector at node ii. We generate 𝐱\mathbf{x}^{*} from a normal distribution, and set up the measurement model as 𝐲i=Mi𝐱+ηi\mathbf{y}_{i}=M_{i}\mathbf{x}^{*}+\eta_{i}, where MiM_{i} is randomly generated from the standard normal distribution; finally, the rows of the data matrix are normalized to sum to one. The local additive noise ηi\eta_{i} is generated from a zero-mean Gaussian distribution with variance 0.010.01. In Algorithm 1 and Q-Grad-Push, local vectors are initialized randomly to 𝐱i0\mathbf{x}_{i}^{0}; Q-De-DGD is initialized with an all-zero vector. The quantization level of the benchmarking algorithms is selected to ensure that the number of bits those algorithms communicate is equal to that of Algorithm 1 when q=0.09q=0.09. All algorithms are run with stepsize αt=0.2t\alpha_{t}=\frac{0.2}{t}. The performance of different schemes is quantified by the residuals 𝐱t𝐱¯𝐱0𝐱¯\frac{\|\mathbf{x}^{t}-\bar{\mathbf{x}}\|}{\|\mathbf{x}^{0}-\bar{\mathbf{x}}\|}. The results are shown in Fig.1 (a), (b). As shown in the subplots, for all the considered sparsification rates Algorithm 1 converges with rate proportional to qq, while the benchmarking algorithms do not converge to the optimal solution.

Decentralized logistic regression. Next, we consider a multi-class classification task on the MNIST dataset [8]. The logistic regression problem is formulated as

min𝐱{μ2𝐱2+i=1nj=1Nln(1+exp((𝐦ijT𝐱i)yij))}.\min_{{\mathbf{x}}}\left\{\frac{\mu}{2}\|\mathbf{x}\|^{2}+\sum_{i=1}^{n}\sum_{j=1}^{N}\mathrm{ln}(1+\mathrm{exp}(-(\mathbf{m}_{ij}^{T}\mathbf{x}_{i})y_{ij}))\right\}.

The data is distributed across the network such that each node ii has access to N=120N=120 training samples (𝐦ij,yij)64×{0,,9}(\mathbf{m}_{ij},y_{ij})\in\mathbb{R}^{64}\times\{0,\cdots,9\}, where 𝐦ij\mathbf{m}_{ij} denotes a vectorized image with size d=64d=64 and yijy_{ij} denotes the corresponding digit label. Performance of Algorithm 1 is again compared with Q-Grad-Push and Q-De-DGD; all algorithms are initialized with zero vectors. The quantization level of the benchmarking algorithms is selected such that the number of bits they communicate is equal to that of Algorithm 1 for q=0.07q=0.07. The experiment is run using the stepsize αt=0.02t\alpha_{t}=\frac{0.02}{t}; we set μ=105\mu=10^{-5}. Fig. 1 (c), (d) show the classification correct rate of Algorithm 1 for different sparsification and connectivity levels. As can be seen there, all sparsified schemes achieve the same level of the classification correct rate. The schemes communicating fewer messages in less connected networks converge slower, while the two benchmarking algorithms converge only to a neighborhood of the optimal solution.

5 Conclusion

We considered the problem of decentralized learning over time-varying directed graphs where, due to communication constraints, nodes communicate sparsified messages. We proposed a communication-efficient algorithm that achieves 𝒪(lnTT)\mathcal{O}(\frac{\mathrm{ln}T}{\sqrt{T}}) convergence rate for general decentralized convex optimization tasks. As part of the future work, it is of interest to study reduction of the computational cost of the optimization procedure by extending the results to the setting where network agents rely on stochastic gradients.

Appendices

The appendix presents the analysis of Theorem 1 and derives the convergence rate.

Appendix A Elaborating on Assumption 1(i)

Analysis of the algorithms presented in the paper is predicated on the property of the product of consecutive mixing matrices of general time-varying graphs stated in Assumption 1. Here we establish conditions under which this property holds for a specific graph structure, i.e., identify ϵ0\epsilon_{0} in Assumption 1 for the graphs that are jointly connected over \mathcal{B} consecutive time steps. Note that when =1\mathcal{B}=1, such graphs reduce to the special case of graphs that are strongly connected at each time step. For convenience, we formally state the >1\mathcal{B}>1 and =1\mathcal{B}=1 settings as Assumption 2 and 3, respectively.

Assumption 2.

The graphs 𝒢m(t)=(|n|,m(t)){\mathcal{G}}_{m}(t)=(|n|,{\mathcal{E}}_{m}(t)), modeling network connectivity for the mthm\textsuperscript{th} entry of the sparsified parameter vectors, are \mathcal{B}-jointly-connected.

Assumption 2 implies that starting from any time step t=kt=k\mathcal{B}, k𝒩k\in\mathcal{N}, the union graph over \mathcal{B} consecutive time steps is a strongly connected graph. This is a weaker requirement then the standard assumption (Assumption 3 given below) often encountered in literature on convergence analysis of algorithms for distributed optimization and consensus problems.

Assumption 3.

The graphs 𝒢m(t)=(|n|,m(t)){\mathcal{G}}_{m}(t)=(|n|,{\mathcal{E}}_{m}(t)), modeling network connectivity for the mthm\textsuperscript{th} entry of the sparsified parameter vectors, are strongly connected at any time tt.

Next, we state a lemma adopted from [1] which helps establish that under Assumptions 1(ii) and 3, the so-called spectral gap of the product of mixing matrices taken over a number of consecutive time steps is non-zero.

Lemma 1.

[1] Suppose Assumptions 1(ii) and 3 hold. Let M¯mt\bar{M}_{m}^{t} be the mixing matrix in (6) and Mmt=M¯mt+ϵFM_{m}^{t}=\bar{M}_{m}^{t}+\epsilon F such that ϵ(0,γm)\epsilon\in(0,\gamma_{m}), where γm=1(20+8n)n(1|λ3(M¯mt)|)n\gamma_{m}=\frac{1}{(20+8n)^{n}}(1-|\lambda_{3}(\bar{M}_{m}^{t})|)^{n}. Then, the mixing matrix MmtM_{m}^{t} has a simple eigenvalue 11 and all its other eigenvalues have magnitude smaller than 11.

Note that Lemma 1 implies that

limk(Mmt)k=[𝟏n𝟏nTn𝟏n𝟏nTn𝟎𝟎],\mathrm{lim}_{k\to\infty}(M_{m}^{t})^{k}=\left[\begin{matrix}\frac{\mathbf{1}_{n}\mathbf{1}^{T}_{n}}{n}&\frac{\mathbf{1}_{n}\mathbf{1}^{T}_{n}}{n}\\ \mathbf{0}&\mathbf{0}\\ \end{matrix}\right], (12)

where the rate of convergence is geometric and determined by the nonzero spectral gap of the perturbed mixing matrix MmtM_{m}^{t}, i.e., 1|λ2(Mmt)|1-|\lambda_{2}(M_{m}^{t})|. This implies that the local estimate, ximtx^{t}_{im}, approaches the average consensus value z¯mt=1ni=1nximt+1ni=1nyimt\bar{z}^{t}_{m}=\frac{1}{n}\sum_{i=1}^{n}x^{t}_{im}+\frac{1}{n}\sum_{i=1}^{n}y^{t}_{im} while the auxiliary variable yimty^{t}_{im} vanishes to 0.

We now utilize the insight of (12) to establish a result that will facilitate convergence analysis of the setting described by Assumption 2. In particular, we consider the setting where in any time window of size \mathcal{B} starting from time t=kt=k\mathcal{B} for some integer kk, the union of the associated directed graphs is strongly connected. The following lemma will help establish that if a small perturbation, ϵF\epsilon F is added to the product of mixing matrices M¯m((k+1)1:k)\bar{M}_{m}((k+1)\mathcal{B}-1:k\mathcal{B}) , then the product Mm((k+1)1:k)M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}) has only a simple eigenvalue 11 while all its other eigenvalues have moduli smaller than one.

Lemma 2.

Suppose that Assumptions 1(ii) and 2 hold. If the parameter ϵ(0,ϵ¯)\epsilon\in(0,\bar{\epsilon}) and ϵ¯=minmγm\bar{\epsilon}=\min_{m}\gamma_{m}, where γm=mink1(20+8n)n(1|λ3(M¯m((k+1)1:k))|)n\gamma_{m}=\min_{k}\frac{1}{(20+8n)^{n}}(1-|\lambda_{3}(\bar{M}_{m}((k+1)\mathcal{B}-1:k\mathcal{B}))|)^{n}, then for each mm the mixing matrix product Mm((k+1)1:k)M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}) has simple eigenvalue 11 for all integer k0k\geq 0 and ϵ(0,ϵ¯)\epsilon\in(0,\bar{\epsilon}).

Proof.

Consider a fixed realization of \mathcal{B}-strongly-connected graph sequences,
{𝒢(0),,𝒢(1)}.\left\{{\mathcal{G}}(0),\cdots,{\mathcal{G}}(\mathcal{B}-1)\right\}. For s0,,1s\in 0,\cdots,\mathcal{B}-1, M¯s\bar{M}^{s} is block (lower) triangular and the spectrum is determined by the spectrum of the (1,1)(1,1)-block and the (2,2)(2,2)-block. Furthermore, for such ss, AsA^{s} (row-stochastic) and BsB^{s} (column-stochastic) matrices have non-negative entries. Owing to the fact that the union graph over \mathcal{B} iterations is strongly connected, Πs=01As=A1A0\Pi_{s=0}^{\mathcal{B}-1}A^{s}=A^{\mathcal{B}-1}\cdots A^{0} and Πs=01Bs=B1B0\Pi_{s=0}^{\mathcal{B}-1}B^{s}=B^{\mathcal{B}-1}\cdots B^{0} are both irreducible. Thus, Πs=01As\Pi_{s=0}^{\mathcal{B}-1}A^{s} and Πs=01Bs\Pi_{s=0}^{\mathcal{B}-1}B^{s} both have simple eigenvalue 11. Recall that Πs=01M¯s\Pi_{s=0}^{\mathcal{B}-1}\bar{M}^{s} has column sum equal to 11, and thus we can verify that rank(Πs=01M¯sI)=2n2\mathrm{rank}(\Pi_{s=0}^{\mathcal{B}-1}\bar{M}^{s}-I)=2n-2; therefore, the eigenvalue 11 is semi-simple.
 
Next, we characterize the change of the semi-simple eigenvalue λ1=λ2=1\lambda_{1}=\lambda_{2}=1 of Πs=01M¯s\Pi_{s=0}^{\mathcal{B}-1}\bar{M}^{s} when a small perturbation ϵF\epsilon F is added. Consider the eigenvalues of the perturbed matrix product, λ1(ϵ)\lambda_{1}(\epsilon), λ2(ϵ)\lambda_{2}(\epsilon), which corresponds to λ1\lambda_{1}, λ2\lambda_{2}, respectively. For all s0,,1s\in 0,\cdots,\mathcal{B}-1, M¯s\bar{M}^{s} has two common right eigenvectors and left eigenvectors for eigenvalue 11; they are the right eigenvectors and left eigenvectors of the matrix product. The right eigenvectors y1y_{1}, y2y_{2} and left eigenvectors z1z_{1}, z1z_{1} of the semi-simple eigenvalue 11 are

Y:=[y1y2]=[01v2nv2],Z:=[z1z2]=[11v10].Y:=\begin{matrix}[y_{1}&y_{2}]\end{matrix}=\begin{bmatrix}0&1\\ v_{2}&-nv_{2}\end{bmatrix},\quad Z:=\begin{bmatrix}z_{1}^{\prime}\\ z_{2}^{\prime}\end{bmatrix}=\begin{bmatrix}1^{\prime}&1^{\prime}\\ v_{1}^{\prime}&0\end{bmatrix}. (13)

By following exactly the same steps and using Proposition 11 in [1], we can show that for small ϵ>0\epsilon>0, the perturbed matrix product has a simple eigenvalue 11. Further, it can be guaranteed that for ϵ<1(20+8n)n(1|λ3(M¯(1:0)|)n\epsilon<\frac{1}{(20+8n)^{n}}(1-|\lambda_{3}(\bar{M}(\mathcal{B}-1:0)|)^{n}, the perturbed matrix product M(1:0)M(\mathcal{B}-1:0) has simple eigenvalue 11.
 
From Assumption 1(ii), there is only a finite number of possible mixing matrices {M¯mt}\left\{\bar{M}^{t}_{m}\right\}; if we let γm=mink1(20+8n)n(1|λ3(M¯m((k+1)1:k))|)n\gamma_{m}=\min_{k}\frac{1}{(20+8n)^{n}}(1-|\lambda_{3}(\bar{M}_{m}((k+1)\mathcal{B}-1:k\mathcal{B}))|)^{n}, starting from any time step t=kt=k\mathcal{B}, the perturbed mixing matrix product Mm(t+1:t)M_{m}(t+\mathcal{B}-1:t) has simple eigenvalue 11 for all ϵ<ϵ¯=minmγm\epsilon<\bar{\epsilon}=\min_{m}\gamma_{m}. ∎

Appendix B Lemma 3 and its proof

The following lemma implies that the product of mixing matrices converges to its limit at a geometric rate; this intermediate result can be used to establish the convergence rate of the proposed optimization Algorithm.

Lemma 3.

Suppose that Assumptions 1(i) and 1(ii) hold. There exists ϵ0>0\epsilon_{0}>0 such that if ϵ(0,ϵ0)\epsilon\in(0,\epsilon_{0}) then for all m=1,,dm=1,\cdots,d the following statements hold.

  1. (a)

    The spectral norm of Mm((k+1)1:k)M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}) satisfies

    ρ(Mm((k+1)1:k))1n[𝟏T 0T]T[𝟏T 1T])=σ<1\displaystyle\rho(M_{m}((k+1)\mathcal{B}-1:k\mathcal{B}))-\frac{1}{n}[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}[\mathbf{1}^{T}\ \mathbf{1}^{T}])=\sigma<1 (14)

    k𝒩\forall k\in\mathcal{N}.

  2. (b)

    There exists Γ=2nd>0\Gamma=\sqrt{2nd}>0 such that

    Mm(n1:0)1n[𝟏T 0T]T[𝟏T 1T]Γσn.\|M_{m}(n\mathcal{B}-1:0)-\frac{1}{n}[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}[\mathbf{1}^{T}\ \mathbf{1}^{T}]\|_{\infty}\leq\Gamma\sigma^{n}. (15)

We start this proof by introducing two intermediate lemmas: Lemma 4 and Lemma 5.

Lemma 4.

Assume that Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}) has non-zero spectral gap for each mm. Then the following statements hold.

  1. (a)

    The sequence of matrix products Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}) converges to the limit matrix

    limt(Mm((s+1)1:s))t=[𝟏n𝟏nTn𝟏n𝟏nTn𝟎𝟎].\mathrm{lim}_{t\to\infty}(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))^{t}=\left[\begin{matrix}\frac{\mathbf{1}_{n}\mathbf{1}^{T}_{n}}{n}&\frac{\mathbf{1}_{n}\mathbf{1}^{T}_{n}}{n}\\ \mathbf{0}&\mathbf{0}\\ \end{matrix}\right]. (16)
  2. (b)

    Let 1=|λ1(Mm((s+1)1:s))|>|λ2(Mm((s+1)1:s))||λ2n(Mm((s+1)1:s))|1=|\lambda_{1}(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))|>|\lambda_{2}(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))|\geq\cdots\geq|\lambda_{2n}(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))| be the eigenvalues of Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}), and let σm=|λ2(Mm((s+1)1:s))|\sigma_{m}=|\lambda_{2}(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))|; then there exists Γm>0\Gamma_{m}^{\prime}>0 such that

    (Mm((s+1)1:s))tΓmσmt,\displaystyle\|(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))^{t}-\mathcal{I}\|_{\infty}\leq\Gamma_{m}^{\prime}\sigma_{m}^{t}, (17)

    where :=1n[𝟏T 0T]T[𝟏T 1T]\mathcal{I}:=\frac{1}{n}[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}[\mathbf{1}^{T}\ \mathbf{1}^{T}].

Proof.

For each mm, Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}) has column sum equal to 11. According to Assumption 1, definition of the mixing matrix (6), and the construction of the product (10), Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}) has a simple eigenvalue 11 with the corresponding left eigenvector [𝟏T 1T][\mathbf{1}^{T}\ \mathbf{1}^{T}] and right eigenvector [𝟏T 0T]T[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}. Following Jordan matrix decomposition for the simple eigenvalue, there exist some P,Q(2n1)×(2n1)P,Q\in\mathcal{R}^{(2n-1)\times(2n-1)} such that

(Mm((s+1)1:s))t\displaystyle(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))^{t} =t+PJmtQ=+PJmtQ.\displaystyle=\mathcal{I}^{t}+PJ_{m}^{t}Q=\mathcal{I}+PJ_{m}^{t}Q. (18)

Let γm\gamma_{m} be the second largest eigenvalue magnitude of Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}); then, γm\gamma_{m} is also the spectral norm of JmJ_{m}. The proof of part (a) follows by noting that limtJmt=𝟎.\mathrm{lim}_{t\to\infty}J_{m}^{t}=\mathbf{0}. Since P\|P\|, Q\|Q\| and Jm\|J_{m}\| are finite, there exists some Γm>0\Gamma_{m}^{\prime}>0 such that

(Mm((s+1)1:s))tPJmtQΓmσmt\displaystyle\quad\|(M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}))^{t}-\mathcal{I}\|_{\infty}\leq\|PJ_{m}^{t}Q\|_{\infty}\leq\Gamma_{m}^{\prime}\sigma_{m}^{t} (19)

which completes the proof of part (b). ∎

Lemma 5.

Suppose that for each mm, Mm((s+1)1:s)M_{m}((s+1)\mathcal{B}-1:s\mathcal{B}) has non-zero spectral gap. Let σ=maxmσm\sigma=\max_{m}\sigma_{m}, where σm\sigma_{m} is as defined in Lemma 4. Then, for each mm it holds that

ρ(Mm(T1:0)1n[𝟏T 0T]T[𝟏T 1T])σT.\rho(M_{m}(T\mathcal{B}-1:0)-\frac{1}{n}[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}[\mathbf{1}^{T}\ \mathbf{1}^{T}])\leq\sigma^{T}. (20)
Proof.

We prove this lemma by induction.
 
Base step. T=1T=1. According to the selection rule of Mm(1:0)M_{m}(\mathcal{B}-1:0) and definition of σ\sigma, the statement holds.
 
Inductive step. Suppose for all T1<TT_{1}<T the statement holds. Let T1=TT_{1}=T. Since for each time step t=kt=k\mathcal{B}, Mm(t1:(t1))M_{m}(t\mathcal{B}-1:(t-1)\mathcal{B}) has column sum equal to 11 and has a simple eigenvalue 11 with the corresponding left eigenvector [𝟏T 1T][\mathbf{1}^{T}\ \mathbf{1}^{T}] and right eigenvector [𝟏T 0T]T[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}, then

Mm(T1:0)\displaystyle\quad M_{m}(T\mathcal{B}-1:0)-\mathcal{I} =Mm(T1:0)Mm(1:0)\displaystyle=M_{m}(T\mathcal{B}-1:0)-\mathcal{I}M_{m}(\mathcal{B}-1:0)
=(Mm(T1:))Mm(1:0).\displaystyle=(M_{m}(T\mathcal{B}-1:\mathcal{B})-\mathcal{I})M_{m}(\mathcal{B}-1:0).

Taking the spectral norm over both hand sides after recursion and applying Gelfand corollaries, we complete the proof. ∎

We now continue with the proof of Lemma 3. Lemma 5 implies the result in part (a) of Lemma 3. Due to the equivalence of matrix norms, we can obtain the desired results in Lemma 3 (b). In particular, for matrix Am×nA\in\mathcal{R}^{m\times n} it holds that

1nAA2mA.\frac{1}{\sqrt{n}}\|A\|_{\infty}\leq\|A\|_{2}\leq\sqrt{m}\|A\|_{\infty}.

Since Lemma 5 shows that Mm(T1:0)2σT,\|M_{m}(T\mathcal{B}-1:0)-\mathcal{I}\|_{2}\leq\sigma^{T}, then there exists Γ=2nd>0\Gamma=\sqrt{2nd}>0 such that

Mm(T1:0)1n[𝟏T 0T]T[𝟏T 1T]ΓσT,\|M_{m}(T\mathcal{B}-1:0)-\frac{1}{n}[\mathbf{1}^{T}\ \mathbf{0}^{T}]^{T}[\mathbf{1}^{T}\ \mathbf{1}^{T}]\|_{\infty}\leq\Gamma\sigma^{T},

which completes the proof.

Appendix C Lemma 6 and its proof

In this part we state an intermediate lemma that establishes an upper bound on the disagreement term 𝐳it𝐳¯t\|\mathbf{z}_{i}^{t}-\mathbf{\bar{z}}^{t}\|.

Lemma 6.

Let Γ=2nd\Gamma=\sqrt{2nd} and Assumptions 1(i), (ii), (iii) and (iiii) imply the following statements:

  1. (a)

    For 1in1\leq i\leq n and t=k1+tt=k\mathcal{B}-1+t^{\prime}, where t=1,,t^{\prime}=1,\cdots,\mathcal{B}, it holds that

    𝐳ik𝐳¯k\displaystyle\|\mathbf{z}_{i}^{k\mathcal{B}}-\mathbf{\bar{z}}^{k\mathcal{B}}\| Γσkj=12nm=1d|zjm0|+dnΓDr=1k1σkrαr1+2dDαk1,\displaystyle\leq\Gamma\sigma^{k}\sum_{j=1}^{2n}\sum_{m=1}^{d}|z^{0}_{jm}|+\sqrt{d}n\Gamma D\sum_{r=1}^{k-1}\sigma^{k-r}\alpha_{r-1}+2\sqrt{d}D\alpha_{k-1}, (21)
    𝐳it𝐳¯t\displaystyle\|\mathbf{z}_{i}^{t}-\mathbf{\bar{z}}^{t}\| Γ(σ1/)t(t1)j=12nm=1d|zjm0|+dnΓDr=1t/1σt/rαr1\displaystyle\leq\Gamma(\sigma^{1/\mathcal{B}})^{t-(t^{\prime}-1)}\sum_{j=1}^{2n}\sum_{m=1}^{d}|z^{0}_{jm}|+\sqrt{d}n\Gamma D\sum_{r=1}^{\lfloor t/\mathcal{B}\rfloor-1}\sigma^{\lfloor t/\mathcal{B}\rfloor-r}\alpha_{r-1} (22)
    +2dDαt/1𝟙t=1.\displaystyle+2\sqrt{d}D\alpha_{\lfloor t/\mathcal{B}\rfloor-1}\mathbbm{1}_{t^{\prime}=1}.
  2. (b)

    For 1+ni2n1+n\leq i\leq 2n and t=k1+tt=k\mathcal{B}-1+t^{\prime}, where t=1,,t^{\prime}=1,\cdots,\mathcal{B}, it holds that

    𝐳ik\displaystyle\|\mathbf{z}_{i}^{k\mathcal{B}}\| Γσkj=12nm=1d|zjm0|+dnΓDr=1k1σkrαr1+2dDαk1,\displaystyle\leq\Gamma\sigma^{k}\sum_{j=1}^{2n}\sum_{m=1}^{d}|z^{0}_{jm}|+\sqrt{d}n\Gamma D\sum_{r=1}^{k-1}\sigma^{k-r}\alpha_{r-1}+2\sqrt{d}D\alpha_{k-1}, (23)
    𝐳it\displaystyle\|\mathbf{z}_{i}^{t}\| Γ(σ1/)t(t1)j=12nm=1d|zjm0|+dnΓDr=1t/1σt/rαr1\displaystyle\leq\Gamma(\sigma^{1/\mathcal{B}})^{t-(t^{\prime}-1)}\sum_{j=1}^{2n}\sum_{m=1}^{d}|z^{0}_{jm}|+\sqrt{d}n\Gamma D\sum_{r=1}^{\lfloor t/\mathcal{B}\rfloor-1}\sigma^{\lfloor t/\mathcal{B}\rfloor-r}\alpha_{r-1} (24)
    +2dDαt/1𝟙t=1.\displaystyle+2\sqrt{d}D\alpha_{\lfloor t/\mathcal{B}\rfloor-1}\mathbbm{1}_{t^{\prime}=1}.

Consider the time step t=k1t=k\mathcal{B}-1 for some integer kk and rewrite the update (7) as

zimt+1\displaystyle z_{im}^{t+1} =j=12n[M¯mt]ij[Q(𝐳jt)]m+𝟙{tmod=1}ϵ[F]ijzjmt/𝟙{tmod=1}αtgimt/\displaystyle=\sum_{j=1}^{2n}[\bar{M}^{t}_{m}]_{ij}[Q({\mathbf{z}}_{j}^{t})]_{m}+\mathbbm{1}_{\left\{t\mod\mathcal{B}=\mathcal{B}-1\right\}}\epsilon[F]_{ij}z_{jm}^{\lfloor t/\mathcal{B}\rfloor}-\mathbbm{1}_{\left\{t\mod\mathcal{B}=\mathcal{B}-1\right\}}\alpha_{t}g_{im}^{\lfloor t/\mathcal{B}\rfloor} (25)
=j=12n[M¯mt]ijzjmt+𝟙{tmod=1}ϵ[F]ijzjmt/𝟙{tmod=1}αtgimt/.\displaystyle=\sum_{j=1}^{2n}[\bar{M}^{t}_{m}]_{ij}z_{jm}^{t}+\mathbbm{1}_{\left\{t\mod\mathcal{B}=\mathcal{B}-1\right\}}\epsilon[F]_{ij}z_{jm}^{\lfloor t/\mathcal{B}\rfloor}-\mathbbm{1}_{\left\{t\mod\mathcal{B}=\mathcal{B}-1\right\}}\alpha_{t}g_{im}^{\lfloor t/\mathcal{B}\rfloor}.

Establishing recursion, we obtain

zimk\displaystyle z_{im}^{k\mathcal{B}} =j=12n[Mm(k1:0)]ijzjm0r=1k1j=12n[Mm((k1)1:(r1))]ijαr1gjmr1\displaystyle=\sum_{j=1}^{2n}[M_{m}(k\mathcal{B}-1:0)]_{ij}z_{jm}^{0}-\sum_{r=1}^{k-1}\sum_{j=1}^{2n}[M_{m}((k-1)\mathcal{B}-1:(r-1)\mathcal{B})]_{ij}\alpha_{r-1}g_{jm}^{r-1} (26)
αk1gim(k1).\displaystyle-\alpha_{k-1}g_{im}^{(k-1)\mathcal{B}}.

Using the fact that Mm(s2:s1)M_{m}(s_{2}:s_{1}) has column sum equal to 11 for all s2s10s_{2}\geq s_{1}\geq 0, we can represent z¯mk\bar{z}_{m}^{k\mathcal{B}} as

z¯mk\displaystyle\bar{z}_{m}^{k\mathcal{B}} =1nj=12nzjm01nr=1k1j=12nαr1gjmr11nj=1nαk1gjm(k1).\displaystyle=\frac{1}{n}\sum_{j=1}^{2n}z_{jm}^{0}-\frac{1}{n}\sum_{r=1}^{k-1}\sum_{j=1}^{2n}\alpha_{r-1}g_{jm}^{r-1}-\frac{1}{n}\sum_{j=1}^{n}\alpha_{k-1}g_{jm}^{(k-1)\mathcal{B}}. (27)

By combining the last two expressions,

𝐳ik𝐳¯k\displaystyle\|{\mathbf{z}}_{i}^{k\mathcal{B}}-\bar{{\mathbf{z}}}^{k\mathcal{B}}\| j=12n([Mm(k1:0)]ij1n)zjm0\displaystyle\leq\|\sum_{j=1}^{2n}([M_{m}(k\mathcal{B}-1:0)]_{ij}-\frac{1}{n})z_{jm}^{0}\| (28)
+r=1k1j=12n([Mm((k1)1:(r1))]ij1n)αr1gjmr1\displaystyle+\|\sum_{r=1}^{k-1}\sum_{j=1}^{2n}([M_{m}((k-1)\mathcal{B}-1:(r-1)\mathcal{B})]_{ij}-\frac{1}{n})\alpha_{r-1}g_{jm}^{r-1}\|
+αk1(𝐠i(k1)1nj=1n𝐠j(k1)).\displaystyle+\|\alpha_{k-1}(\mathbf{g}_{i}^{(k-1)\mathcal{B}}-\frac{1}{n}\sum_{j=1}^{n}\mathbf{g}_{j}^{(k-1)\mathcal{B}})\|.

The proof of part (a) is completed by summing mm from 11 to dd and applying the results of Lemma 3, recalling the fact that ρ(M¯m(t1:0))\rho(\bar{M}_{m}(t-1:0)) has non-zero spectral gap for all mm, tt, and invoking the relationship 𝐱2𝐱1d𝐱2\|{\mathbf{x}}\|_{2}\leq\|{\mathbf{x}}\|_{1}\leq\sqrt{d}\|{\mathbf{x}}\|_{2} for 𝐱d{\mathbf{x}}\in{\mathcal{R}}^{d}. Proof of the first inequality in part (b) follows the same line of reasoning.

To show the correctness of the second inequality in both (a) and (b), we use the fact that for tmod1t\mod\mathcal{B}\neq\mathcal{B}-1,

zimt+1=j=12n[M¯mt]ij[Q(𝐳jt)]m=j=12n[M¯mt]ijzjmtz_{im}^{t+1}=\sum_{j=1}^{2n}[\bar{M}^{t}_{m}]_{ij}[Q({\mathbf{z}}_{j}^{t})]_{m}=\sum_{j=1}^{2n}[\bar{M}^{t}_{m}]_{ij}z_{jm}^{t} (29)

and rewrite k=t(t1)k=\frac{t-(t^{\prime}-1)}{\mathcal{B}}. This concludes the proof of Lemma 6.

Appendix D Proof of Theorem 1

Recall the update (7) and note that 𝐳¯(k+1)=𝐳¯kαtni=1nfi(𝐳ik).\bar{{\mathbf{z}}}^{(k+1)\mathcal{B}}=\bar{{\mathbf{z}}}^{k\mathcal{B}}-\frac{\alpha_{t}}{n}\sum_{i=1}^{n}\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}}). We thus have that

𝐳¯k+k𝐱2=αkni=1nfi(𝐳ik)2+𝐳¯k𝐱22αkni=1n𝐳¯k𝐱,fi(𝐳ik).\displaystyle\|\bar{{\mathbf{z}}}^{k\mathcal{B}+k}-{\mathbf{x}}^{*}\|^{2}=\|\frac{\alpha_{k}}{n}\sum_{i=1}^{n}\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\|^{2}+\|\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*}\|^{2}-\frac{2\alpha_{k}}{n}\sum_{i=1}^{n}\langle\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle. (30)

On the other hand,

𝐳¯t𝐱2=𝐳¯k𝐱2+αkni=1nfi(𝐳ik)22αkni=1n𝐳¯k𝐱,fi(𝐳ik)\displaystyle\|\bar{{\mathbf{z}}}^{t}-{\mathbf{x}}^{*}\|^{2}=\|\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*}\|^{2}+\|\frac{\alpha_{k}}{n}\sum_{i=1}^{n}\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\|^{2}-\frac{2\alpha_{k}}{n}\sum_{i=1}^{n}\langle\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle (31)

for t=k1+tt=k\mathcal{B}-1+t^{\prime} and t=1,,1t^{\prime}=1,\cdots,\mathcal{B}-1. Therefore,

𝐳¯k+k𝐱2=𝐳¯k𝐱2+αkni=1nfi(𝐳ik)22αkni=1n𝐳¯k𝐱,fi(𝐳ik).\displaystyle\|\bar{{\mathbf{z}}}^{k\mathcal{B}+k}-{\mathbf{x}}^{*}\|^{2}=\|\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*}\|^{2}+\|\frac{\alpha_{k}}{n}\sum_{i=1}^{n}\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\|^{2}-\frac{2\alpha_{k}}{n}\sum_{i=1}^{n}\langle\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle. (32)

Since |gim|D|g_{im}|\leq D and D=dDD^{\prime}=\sqrt{d}D, and by invoking the convexity of ff,

𝐳¯k𝐱,fi(𝐳ik)\displaystyle\langle\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{x}}^{*},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle =𝐳¯k𝐳ik,fi(𝐳ik)+𝐳ik𝐱,fi(𝐳ik)\displaystyle=\langle\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{z}}_{i}^{k\mathcal{B}},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle+\langle{\mathbf{z}}_{i}^{k\mathcal{B}}-{\mathbf{x}}^{*},\nabla f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})\rangle (33)
D𝐳¯k𝐳ik+fi(𝐳ik)fi(𝐳¯k)+fi(𝐳¯k)fi(𝐱)\displaystyle\geq-D^{\prime}\|\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{z}}_{i}^{k\mathcal{B}}\|+f_{i}({\mathbf{z}}_{i}^{k\mathcal{B}})-f_{i}(\bar{{\mathbf{z}}}^{k\mathcal{B}})+f_{i}(\bar{{\mathbf{z}}}^{k\mathcal{B}})-f_{i}({\mathbf{x}}^{*})
2D𝐳¯k𝐳ik+fi(𝐳¯k)fi(𝐱).\displaystyle\geq-2D^{\prime}\|\bar{{\mathbf{z}}}^{k\mathcal{B}}-{\mathbf{z}}_{i}^{k\mathcal{B}}\|+f_{i}(\bar{{\mathbf{z}}}^{k}\mathcal{B})-f_{i}({\mathbf{x}}^{*}).

Rearranging the terms above and summing from t=0t=0 to \infty completes the proof.

Appendix E Proof of the convergence rate

First we derive an intermediate proposition.

Proposition 1.

For each mm and kk, the following inequalities hold:

  1. (a)

    For 1in1\leq i\leq n, k=0αk|zimkz¯mk|<\sum_{k=0}^{\infty}\alpha_{k}|z_{im}^{k\mathcal{B}}-\bar{z}_{m}^{k\mathcal{B}}|<\infty.

  2. (b)

    For n+1i2nn+1\leq i\leq 2n, k=0αk|zimk|<\sum_{k=0}^{\infty}\alpha_{k}|z_{im}^{k\mathcal{B}}|<\infty.

Proof.

Using the result of Lemma 6(a), for 1in1\leq i\leq n,

k=1Tαk𝐳ik𝐳¯k\displaystyle\sum_{k=1}^{T}\alpha_{k}\|{\mathbf{z}}_{i}^{k\mathcal{B}}-\bar{{\mathbf{z}}}^{k\mathcal{B}}\| Γ(j=12ns=1d|zjs0|)k=1Tαkσk+dnΓDk=1Tr=1k1σtrαkαr1\displaystyle\leq\Gamma(\sum_{j=1}^{2n}\sum_{s=1}^{d}|z^{0}_{js}|)\sum_{k=1}^{T}\alpha_{k}\sigma^{k}+\sqrt{d}n\Gamma D\sum_{k=1}^{T}\sum_{r=1}^{k-1}\sigma^{t-r}\alpha_{k}\alpha_{r-1} (34)
+2dDk=0T1αk2.\displaystyle+2\sqrt{d}D\sum_{k=0}^{T-1}\alpha_{k}^{2}.

Applying inequality ab12(a+b)2,a,bab\leq\frac{1}{2}(a+b)^{2},a,b\in\mathcal{R},

k=1Tαkσk12k=1T(αk2+σ2k)12k=1Tαk2+11σ2\sum_{k=1}^{T}\alpha_{k}\sigma^{k}\leq\frac{1}{2}\sum_{k=1}^{T}(\alpha_{k}^{2}+\sigma^{2k})\leq\frac{1}{2}\sum_{k=1}^{T}\alpha_{k}^{2}+\frac{1}{1-\sigma^{2}} (35)
k=1Tr=1k1σkrαkαr112k=1Tαk2r=1r1σkr+12r=1T1αr12k=r+1Tσkr11σk=1Tαk.\displaystyle\sum_{k=1}^{T}\sum_{r=1}^{k-1}\sigma^{k-r}\alpha_{k}\alpha_{r-1}\leq\frac{1}{2}\sum_{k=1}^{T}\alpha_{k}^{2}\sum_{r=1}^{r-1}\sigma^{k-r}+\frac{1}{2}\sum_{r=1}^{T-1}\alpha_{r-1}^{2}\sum_{k=r+1}^{T}\sigma^{k-r}\leq\frac{1}{1-\sigma}\sum_{k=1}^{T}\alpha_{k}. (36)

Using the assumption that the step size satisfies k=0αt2<\sum_{k=0}^{\infty}\alpha_{t}^{2}<\infty as TT\to\infty, we complete the proof of part (a). The same techniques can be used to prove part (b). ∎

We can now continue the proof of the stated convergence rate. Since the mixing matrices have columns that sum up to one we have z¯k+t1=k¯,\bar{z}^{k\mathcal{B}+t^{\prime}-1}=\bar{k\mathcal{B}}, for all t=1,,t^{\prime}=1,\cdots,\mathcal{B}.

In the following step, we consider t=kt=k\mathcal{B} for some integer k0k\geq 0. Defining fmin:=mintf(𝐳¯t)f_{\min}:=\mathrm{min}_{t}f(\bar{\mathbf{z}}^{t}), we have

(fminf)t=0Tαt\displaystyle(f_{\min}-f^{*})\sum_{t=0}^{T}\alpha_{t} t=0Tαt(f(𝐳¯t)f)C1+C2t=0Tαt2,\displaystyle\leq\sum_{t=0}^{T}\alpha_{t}(f(\bar{\mathbf{z}}^{t})-f^{*})\leq C_{1}+C_{2}\sum_{t=0}^{T}\alpha_{t}^{2}, (37)

where

C1=n2(𝐳¯0𝐱2𝐳¯T+1𝐱2)+DΓj=12n𝐳j01σ2,C_{1}=\frac{n}{2}(\|\bar{\mathbf{z}}^{0}-\mathbf{x}^{*}\|^{2}-\|\bar{\mathbf{z}}_{T+1}-\mathbf{x}^{*}\|^{2})+D^{\prime}\Gamma\sum_{j=1}^{2n}\frac{\|\mathbf{z}_{j}^{0}\|}{1-\sigma^{2}}, (38)
C2=nD22+4D2+DΓj=12n𝐳j0+2D2Γ1σ.C_{2}=\frac{nD^{\prime 2}}{2}+4D^{\prime 2}+D^{\prime}\Gamma\sum_{j=1}^{2n}\|\mathbf{z}_{j}^{0}\|+\frac{2D^{\prime 2}\Gamma}{1-\sigma}. (39)

Note that we can express (37) equivalently as

(fminf)C1t=0Tαt+C2t=0Tαt2t=0Tαt.(f_{\min}-f^{*})\leq\frac{C_{1}}{\sum_{t=0}^{T}\alpha_{t}}+\frac{C_{2}\sum_{t=0}^{T}\alpha_{t}^{2}}{\sum_{t=0}^{T}\alpha_{t}}. (40)

Now, by recalling the statement of Assumption 1(iii), we have that αt=o(1/t)\alpha_{t}=o(1/\sqrt{t}). If we select the schedule of stepsizes according to αt=1/t\alpha_{t}=1/\sqrt{t}, the two terms on the right hand side of (40) satisfies

C1t=0Tαt=C11/2T1=𝒪(1T),C2t=0Tαt2t=0Tαt=C2lnT2(T1)=𝒪(lnTT).\frac{C_{1}}{\sum_{t=0}^{T}\alpha_{t}}=C_{1}\frac{1/2}{\sqrt{T}-1}=\mathcal{O}(\frac{1}{\sqrt{T}}),\frac{C_{2}\sum_{t=0}^{T}\alpha_{t}^{2}}{\sum_{t=0}^{T}\alpha_{t}}=C_{2}\frac{\mathrm{ln}T}{2(\sqrt{T}-1)}=\mathcal{O}(\frac{\mathrm{ln}T}{\sqrt{T}}). (41)

This completes the proof.

References

  • [1] Cai, K., and Ishii, H. Average consensus on general strongly connected digraphs. Automatica 48, 11 (2012), 2750–2761.
  • [2] Cai, K., and Ishii, H. Average consensus on arbitrary strongly connected digraphs with time-varying topologies. IEEE Transactions on Automatic Control 59, 4 (2014), 1066–1071.
  • [3] Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control 57, 3 (2011), 592–606.
  • [4] Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization 20, 3 (2010), 1157–1170.
  • [5] Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. (2003), IEEE, pp. 482–491.
  • [6] Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. Decentralized deep learning with arbitrary communication compression. arXiv preprint arXiv:1907.09356 (2019).
  • [7] Koloskova, A., Stich, S., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning (2019), pp. 3478–3487.
  • [8] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
  • [9] McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (2017), pp. 1273–1282.
  • [10] Nedić, A., Lee, S., and Raginsky, M. Decentralized online optimization with global objectives and local communication. In 2015 American Control Conference (ACC) (2015), IEEE, pp. 4497–4503.
  • [11] Nedić, A., and Olshevsky, A. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control 60, 3 (2014), 601–615.
  • [12] Nedic, A., and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54, 1 (2009), 48–61.
  • [13] Ren, W., and Beard, R. W. Consensus seeking in multiagent systems under dynamically changing interaction topologies. IEEE Transactions on automatic control 50, 5 (2005), 655–661.
  • [14] Ren, W., Beard, R. W., and Atkins, E. M. Information consensus in multivehicle cooperative control. IEEE Control systems magazine 27, 2 (2007), 71–82.
  • [15] Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. In Advances in Neural Information Processing Systems (2018), pp. 4447–4458.
  • [16] Taheri, H., Mokhtari, A., Hassani, H., and Pedarsani, R. Quantized decentralized stochastic learning over directed graphs. In International Conference on Machine Learning (ICML) (2020).
  • [17] Wei, E., and Ozdaglar, A. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) (2012), IEEE, pp. 5445–5450.
  • [18] Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems (2017), pp. 1509–1519.
  • [19] Xi, C., Wu, Q., and Khan, U. A. On the distributed optimization over directed networks. Neurocomputing 267 (2017), 508–515.
  • [20] Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), JMLR. org, pp. 4035–4043.