Distributed gradient-based optimization in the presence of dependent aperiodic communication
Abstract
Iterative distributed optimization algorithms involve multiple agents that communicate with each other, over time, in order to minimize/maximize a global objective. In the presence of unreliable communication networks, the Age-of-Information (AoI), which measures the freshness of data received, may be large and hence hinder algorithmic convergence. In this paper, we study the convergence of general distributed gradient-based optimization algorithms in the presence of communication that neither happens periodically nor at stochastically independent points in time. We show that convergence is guaranteed provided the random variables associated with the AoI processes are stochastically dominated by a random variable with finite first moment. This improves on previous requirements of boundedness of more than the first moment. We then introduce stochastically strongly connected (SSC) networks, a new stochastic form of strong connectedness for time-varying networks. We show: If for any the processes that describe the success of communication between agents in a SSC network are -mixing with summable, then the associated AoI processes are stochastically dominated by a random variable with finite -th moment. In combination with our first contribution, this implies that distributed stochastic gradient descend converges in the presence of AoI, if is summable.
keywords:
, and
1 Introduction
Distributed optimization of stochastically approximated loss functions lies at the heart of many system-level problems that arise in multi-agent learning [26], resource allocation for data centers [12], or decentralized control of power systems [18]. In these scenarios, distributed implementations have many advantages such as balanced workload or the avoidance of a single point of failure. However, this usually comes with high communication costs for coordination [29], entailing that information can only be exchanged rarely, causing local versions of global information to be significantly outdated. Hence, it is of high interest to characterize conditions such that a distributed optimization algorithm can converge when only significantly outdated information with sporadic updates is available.
We therefore consider distributed stochastic optimization problems (SOPs) where the choice of local optimization variables has to be coordinated over an uncertain time-varying communication network. A typical distributed SOP can take the following form:
(1.1) |
The objective is to minimize a real-valued function , which is function of an optimization variable and a random variable representing noise or uncertainty taken from a set . The optimization variable is composed of local components that are associated with local agents of a distributed system. Hence, no global control of is possible. Moreover, the distribution of is typically unknown in practical scenarios and the agents can only observe samples of the uncertainty at discrete time steps . Thus, the problem is that each agent has to coordinate the local choice for its variable with all other agents by exchanging information over a network and iteratively refine the choice based on the observed samples of uncertainty .
To solve this problem, we propose the following solution. Suppose every agent runs a local distributed stochastic gradient descent (SGD) algorithm that generates a sequence to solve problem (1.1). Ideally, every agent would like to have direct access to every new element of the sequences from every other agent during run-time of its own local algorithm. However, due to the distributed nature of the considered setting, the agents have to communicate their updates for their local optimization variables to other agents via a communication network. Because of the uncertainty of communication networks, each agent can therefore only use delayed versions for all to update its own local variable . Here, denotes the newest update of available at agent at time and is its corresponding age. We refer to the ’s as the Age of Information (AoI) variables. The resulting distributed algorithm is therefore in essence a “straightforward” implementation of SGD, where the true values of local variables are replaced by their aged counterparts. Due to the size of generated information in large distributed systems, and the uncertainty and high cost for communication over networks, the AoI variables can not be expected to be bounded and should therefore be modelled as an unbounded sequence of random variables. The problem is therefore to formulate mild network and communication assumptions that are representative and easily verifiable such that this SGD algorithm that uses highly aged information will still converge.
A major challenge for this problem is the multitude of potential factors that affect the AoI random variables. Information exchange between some pairs of agents might experience unbounded delays; mobility of agents or network scheduling algorithms can induce a varying set of network topologies. This can create dependencies among successive network transmissions, preventing agents from exchanging data for extended periods of time. In general, transmissions that happen close in some domain (e.g. time, frequency, or space in wireless communication) are expected to be highly correlated. It is therefore important to formulate a communication network model and associated assumptions that can represent these cases while being mathematically tractable for analysis. Notably, the assumption of guaranteed periodic or stochastically independent communication is practically unrealistic.
1.1 Network models in the literature
One of the most common models in the distributed optimization literature is a time-varying network model that is represented by a time-varying graph (Definition 5). For this graph, the most common assumption is that there is a constant such that the union graph associated with all time intervals is strongly connected [32, 1, 34, 15]. A network with this property is typically called uniformly strongly connected [21], -strongly connected [22, 28] or jointly strongly connected [33]. This model implies guaranteed periodic communication. Another common model is to assume a time-varying network graph whose expected union graph is strongly connected, where the events that describe the success of communication across network edges are independent across time [3, 16, 14, 25].
In ref. [16, 22, 32, 28, 34, 14, 3, 15] the objective is that agents come to a consensus on one global optimization variable to minimize the sum of real-valued functions, each of which associated with one of the local agents. Although such consensus-type problems might appear quite different to (1.1), it turns out that an algorithm for (1.1) can also find a solution for consensus problems after a minor reformulation at the cost of additional communication, which we discuss in [27]. In contrast to the consensus type problems, ref. [33, 1, 25] and this work consider distributed optimization problems where each agent has to select a local optimization variable, such that the combination of all local variables solves a global optimization problem.
Observe that literature exclusively considers network models that either guarantee periodic communication or require communication based on independent events. We believe that these are restrictive assumptions that do not represent real-world communication networks well. To close this gap, we present a less restrictive network model and verifiable network conditions that guarantee that an SGD algorithm finds a solution to problems of the form (1.1). We also show the aforementioned typical network assumptions from the literature are stronger versions of our new set of network assumptions (Assumptions 5 and 6). Our assumptions only require a stochastic form of strong connectivity and a dependency decay (mixing) property. To the best of our knowledge, ours is the first work that guarantees asymptotic convergence of a distributed optimization scheme under such mildly restrictive conditions, connecting an abstract optimization theory with a wide range verifiable network conditions. However, it must be noted that other papers (such as those discussed above) can provide rate of convergence results while we merely give an almost sure convergence analysis.
1.2 Summary of contribution
Our work contributes to the literature of network conditions that guarantee asymptotic convergence to the set of stationary points of a distributed stochastic optimization problems with potentially non-convex objective function. Our work, builds on our previous work on SGD for time-varying networks [25]. However, whereas in [25] the focus was on the optimization iteration, with a strong and restrictive i.i.d. network assumption, this work focuses on guaranteeing convergence under significantly weaker network conditions. Most importantly, our network conditions cover time-varying network topologies, unbounded communication delays, non-independent aperiodic communication, asynchronous local updates and event-driven communication.
As the first step, we describe a distributed stochastic gradient descend algorithm (Algorithm 1) that instead of true local variables uses aged variables as a consequence of network communication. The AoI variables therefore induce gradient errors when comparing Algorithm 1 with and without AoI. As our first major contribution, we show in Lemma 2 that the aforementioned gradient errors vanish asymptotically under an asymptotic growth conditions for the AoI variables. Specifically, we require that all for all are stochastically dominated by a non-negative integer-valued random variable with at least finite first moment. This provides a significant weakening of traditional assumptions from the stochastic approximation literature in the present setting, since traditionally a dominating random variable with at least a bounded moment greater than one was required. With Lemma 2 we then show the convergence of Algorithm 1 in Theorem 1.
Our second contribution is a universally applicable time-varying network model and associated assumptions to generally verify the existence of dominating random variables with an arbitrary required moment conditions. Our time-varying network model is formulated using events , each of which represents successful information exchange from some agent to another agent during some time slot . We then introduce the notion of -stochastically strongly connected (SSC) network with and . This notion requires that there is a set of network edges that form a strongly connected graph for which for all . In other words, for those edges communication is successful at least ones over every interval of length with at least probability . We then present a general recipe to validate stochastic dominance properties with required moment conditions. Afterwards, Theorem 2 presents our main result: Fix any and consider a -SSC network. If there exists some , such that the processes are -mixing with , then all AoI variables are stochastically dominated by non-negative integer-valued random variable with . This result, together with Theorem 1, will therefore imply our final convergence result for Algorithm 1 under the minimal requirement of a SSC network with summable -mixing coefficients.
The rest of the paper is structured as follows: In Section 2 we state notation and preliminaries from probability and graph theory. In Section 3 we discuss the problem formulation and our distributed SGD algorithm. Afterwards we prove the almost sure convergence of Algorithm 1 in Section 4 under asymptotic growth conditions for the AoI variables. We then introduce our time-varying network model and associated assumptions in Section 5. Section 6 then presents our construction to validate stochastic dominance properties and our main results. Finally, we discuss the verifiability of our network assumptions and future work in Section 7.
2 Notation, definitions and preliminaries
This section presents notation, and preliminaries from probability and graph theory. Throughout our work, discrete points in time are indicated by superscript letters . We refer to a time slot as the time interval from time step to . We use to denote .
We make frequent use of the big notation: Consider two real-valued sequences , . Then , if .
From probability theory we need the concepts of stochastic dominance, expectation of non-negative integer-valued random variables, measure of dependency and -mixing:
Definition 1.
A non-negative integer-valued random variable is said to be stochastically dominated by a random variable if for all .
Proposition 1 ([9]).
Suppose is non-negative integer-valued random variable, then
(2.1) |
Let be a probability space, and let and be two sub--algebras of . The measure of dependency between and is defined as
(2.2) |
Consider a stochastic process . For , define the sub--algebra generated from up to by
(2.3) |
Informally, the -algebra generated by a stochastic process from a time interval describes the information that can be extracted from the associated process realizations, see [11] for details.
Definition 2.
The -mixing coefficients of the process are
(2.4) |
for every . The process is called strongly mixing (or -mixing), if as .
Mixing is a notion of asymptotic independence. We refer to [8] for a survey about different mixing notions. We now introduce a subclass of strongly mixing processes with different rates of convergence:
Definition 3.
The process is called -strongly mixing for some , if
(2.5) |
We will use this new mixing property to describe dependency decay of different orders.
For details on graph theory we refer the reader to [30]. We require the following concepts:
Definition 4.
A directed graph is called strongly connected, if every pair of nodes is connected by a directed path.
Definition 5.
A time-varying network is defined as sequence
(2.6) |
where each element is a directed graph.
We will use the following new connectivity notion for time-varying networks:
Definition 6.
A time-varying network is called -stochastically strongly connected (SSC) with and , if there exists a strongly connected graph , such that for all and for all
(2.7) |
3 Problem description
We consider a -agent distributed optimization problem, where each agent has to choose values for a local variable to minimize a global objective function . The global optimization variable is the concatenation of the local optimization variables associated with the local agents. The objective function is assumed to be stochastic and given by
(3.1) |
with a random real-valued function, where the randomness is modeled by an -valued random variable that represents noise or uncertainty.
As discussed in the introduction, if a central agent had direct control of the optimization vector , it would be straightforward to find a local minimum of (3.1) using stochastic gradient descend (SGD) under suitable assumptions [7, Ch. 10]. However, as the components of are associated with distributed agents, we consider that the agents need to coordinate their choice for the local optimization variables by exchanging information via a communication network.
We assume a synchronized communication setting according to a global clock . Each agent updates its local variable at every time step based on a local gradient descend iteration. The iterations will be defined in Section 3.1. For each agent , the local iterations generate a sequence starting from an initial candidate value for the optimal value . To execute the local gradient iteration, agent requires a locally available estimate of the current optimization variable of agent for all . We consider that these information have to be communicated using a communication network. Specifically, every agent will use the newest available local optimization variable from every other agent to update its own local variable. Due to the potential uncertainty of the network only aged/delayed versions of the local variables of the other agents are available at agent at any time step. Therefore, agent has only access to the delayed version
(3.2) |
of at every time step . Here, denotes the newest update of available at agent at time and ’s are the AoI random variables. Further, we refer to as the local believe vector of agent at time . As the next step, we describe the gradient based iteration that uses instead of to solve problem (3.1).
3.1 Algorithm
We consider that the agents iteratively refine their local variables using the partial derivatives . We assume that the agents do not know the distribution of , but during any time slot an agent can observe an i.i.d. realisation of . For simplicity, we assume that all agents are affected by the same realisation of the random variable . In other words, when agent and agent calculate their partial derivatives during some time slot , they use the same realisation of , i.e. and . The extension to agent-specific realisations of is merely a technical reformulation that was already described in [24].
To evaluate the partial derivatives , agent uses the most recent available version of the optimization variable of agent for all , i.e. it calculates instead of . The following SGD iteration is used by each agent to update its local variable:
(3.3) |
where is a given step-size sequence and is a local stochastic additive error that may arise during the calculation of . Algorithm 1 summarizes the protocol that runs on every agent locally. For now, we assume that the agents use some communication protocol to exchange their local believe vectors over a network. The protocol and the network properties therefore induce the distribution of the AoI variables . In the next section, we will proof the convergence of Algorithm 1 under an abstract growth conditions for the AoI variables. Section 5 then formulates a communication network model and associated assumptions to satisfy these growth conditions.
Remark 1.
In our previous work [25], we also included asynchronous gradient updates in (3.3). The agents are therefore allowed to update their local variable not at every time step . This may be included here using the associated assumptions from [25]. Our previous work, considers (3.3) for a restrictive network model with independent communication (see Section 5.4 for further details). This work resolves this issue, but we use synchronous gradient updates to avoid notational overload.
4 Asymptotic convergence of Algorithm 1
In this section, we will show the asymptotic convergence of Algorithm 1. Specifically, we show that the iterations in (3.3) converge to a neighbourhood of a local stationary point of (3.1). The main part of the proof is to show that the gradient errors
(4.1) |
due to AoI vanish asymptotically. This error captures the difference of the gradient descent step some agent would take given its local believe vector compared to the true global state.
To show that the gradient errors vanish, we require that the AoI variables satisfy an asymptotic growth conditions. Observe that the gradient errors depend on the AoI variables and the step size sequence , since determines how much successive steps of iteration (3.3) differ. If the step size sequence gets smaller quick enough relative to some maximal potential growth of the AoI variables, we expect to decay to zero. This is because even significantly outdated information stays relevant, if the steps taken during that time were comparably small. The convergence of Algorithm 1 will then follow from the convergence of (3.3) when one considers no AoI, i.e. the case .
The following assumption formalize the required trade of between the choice of the step size sequence and the required network quality.
Assumption 1.
-
1.
There exists and a non-negative integer-valued random variable , such that stochastically dominates (Definition 1) all for all and all with
-
2.
The step-size sequence satisfies:
-
(i)
, .
-
(ii)
with as in 1.
-
(i)
Assumption 1.1 requires that the network quality is good enough, such that the tail distribution of the AoI variables decays rapidly enough, such that at least a dominating random variable with finite mean exists. This assumption contributes a significant weakening of the traditional assumptions required for convergence in the present setting. The traditional assumptions formulated in [6], required at least a dominating random variable with finite -th moment for . In this work, we show for the first time that actually is sufficient to achieve asymptotic convergence. We show that under Assumption 1.1 the growth of each can not exceed any fraction of after some potentially large time-step. We formulate this in Lemma 1.
Assumption 1.1(i) is standard in the stochastic approximation literature. Assumption 1.1(ii) requires that we choose the step size depending on the network quality. For example, if only the worst network quality can be verified, i.e. that there is only a dominating variable with finite mean, then we have to choose . In addition to the aforementioned weakening of assumptions, we also do not require that the step-size sequence is eventually monotonically decreasing and we only require instead of . Both conditions were traditionally assumed.
We will now present additional assumptions associated with the objective function in (3.1) and the iterations in (3.3). After that we show the convergence of Algorithm 1. In Section 5 we will then present verifiable network conditions to ensure that Assumption 1.1 holds. We will also see that it is easy to formulate very restrictive network conditions, such that the growth of the AoI variables behave very well. For example, one can show that all moments of the AoI variables are bounded under the standard assumptions in the distributed optimization literature (see Section 5.4). That is, Assumption 1.1 would be satisfied for all .
In addition to Assumption 1, we require the following assumptions.
Assumption 2.
-
1.
is continuous and locally Lipschitz continuous in the -coordinate, where the associated constant may depend on .
-
2.
.
-
3.
is an -valued random variable, where is a one-point compactificable space.
Assumption 3.
For all , we have a.s.
Assumption 4.
Almost surely, for some fixed .
We refer to [25] for detailed discussion on the verifiability of Assumptions 2, 3 and 4.
Recall the gradient errors due to the AoI variables in (4.1). Next, we will show that these gradient errors vanish asymptotically. We start with an asymptotic grow property for the AoI variables under Assumption 1.1.
Lemma 1.
Under Assumption 1.1, we have for every and for all that
(4.2) |
Proof.
Fix . By Assumption 1 there is a non-negative integer-valued random variable , such that
(4.3) |
for all and . Hence, we have
(4.4) | ||||
(4.5) | ||||
(4.6) |
where the sets are defined as
(4.7) |
for every . We use these sets to consider all in every interval . The second inequality then follows from the monotonicity of the cumulative distribution function (CDF) by definition of the sets . Since , we have therefore shown that
(4.8) |
The last equality follows from Proposition 1, since is a non-negative integer-valued random variable. ∎
We are now ready to prove that the gradient errors due to AoI vanish asymptotically.
Lemma 2.
Under Assumptions 2, 3 and 1, we have that .
Proof.
By Assumption 3, we have that for some sample path dependent radius . Then, [25, Lemma 1] shows that is locally Lipschitz continuous with a constant independent of . Hence, is globally Lipschitz continuous with a constant when restricted to . Using the triangular inequality, the established Lipschitz continuity of and Assumption 3, we have that
(4.9) |
for a sample path dependent constant . We will now show that
(4.10) |
which will imply that .
By Assumption 1, . Hence, there are constants and , such that
(4.11) |
Also by Assumption 1, there is some that stochastically dominates all , with . Now fix . By Lemma 1 we have that
(4.12) |
It now follows from the Borel-Cantelli Lemma that Hence, there is sample path dependent , such that
(4.13) |
Equations 4.11 and 4.13 therefore yield that
(4.14) |
for all with and . Finally, using the monotonicity of , we have
(4.15) |
Hence,
(4.16) |
and (4.10) follows, since the choice of is arbitrary. ∎
In [25, Theorem 1] we proved the convergence of Algorithm 1 for for all . The following theorem is now an immediate consequence of this result and Lemma 2.
Theorem 1.
Under Assumptions 2, 3, 1 and 4, we have that Algorithm 1 converges almost surely to a -neighbourhood of the set of stationary points of F, where is the almost sure bound of the additive errors according to Assumption 4.
5 A new set of network conditions for distributed optimization
In the previous section, we presented a convergence proof for Algorithm 1 under the network assumption Assumption 1.1. This assumption directly requires that some -th moment of all AoI variables is bounded. However, the distribution of all AoI variables will typically be the consequence of direct agent to agent communication. We are therefore interest in more concrete conditions on the network and the agent communication that imply the required AoI moment conditions. To achieve this, this section introduces a network model and associated assumptions to verify Assumption 1.1.
5.1 Network model
Recall that Algorithm 1 requires that the agents exchange their local variables over a network. The network and an associated communication protocol should allow that local variables can frequently spread across the network and reach every agent. We will now introduce a network model were the agents try to exchange their local belief vectors . The agents therefore try to share their latest available version of all other agents local variable with other agents. This might potentially flood the network with data, however, there well known protocols to reduce the number of possibly redundant transmissions [17].
We assume a time-varying network (Definition 5)
(5.1) |
which is a sequence of directed graphs. Each agent is in one-to-one correspondence with one node in the graph. For every time step , an edge represents the event that agent successfully exchanges its local believe vector during time slot with agent . We denote this event by . Therefore, the sequence of directed graphs and the sequences of events are in one to one correspondence: An edge if and only if the event occurs. An edge therefore does not represent the possibility for communication, but the actual event of communication.
One may add additional complexity to the model, e.g. using a graph that represents the possibility for communication. Additionally, the model may be extended to scenarios where multiple successive events need to occur to guarantee the exchange of a single realization of a believe vector . This might be necessary if the dimension of is very large and/or the network bandwidth is small.
Note that although we defined the events for all , some of those events might never occur over the whole time horizon. We will especially do not require that all agents communicate directly! However, at least some of the events should occur “frequently” enough such that the time-varying network satisfies certain connectivity properties. This will be formulated in Section 5.2 with Assumption 5.
The formulation of the time-varying communication network using the edge events has several advantages. The model allows for an underlying time-varying graph that may be the consequence of an network scheduling algorithm or the physical dynamics of the agents themselves. Each event can be represented as a multistage process. For example, (i) the availability of a channel, (ii) the use of an access protocol given the availability of a channel, (iii) the success of the transmission given the successful channel access. In general, the event-based formulation appears to be very convenient for analysis.
In the next two subsection, we will formulate our assumptions for the time-varying network using the events .
5.2 Stochastic strong connectedness
The following assumption formalizes our required network connectivity property.
Assumption 5 (Network connectivity assumption).
We assume that the time-varying network is -stochastically strongly connected (SSC) (Definition 6) for some and some .
Using the events , a -SSC network requires there exists a strongly connected graph , such that for all and for all , we have
(5.2) |
A -SSC network therefore requires that there are some pairs of agents that can communicate directly at least ones in every time-interval of the form with positive probability . Notice that SSC does not require direct communication between every pair of agents. The only agents that do communicate are those given in the set . A SSC network reflects our intuition of a non-degenerate communication network. Some agents can “frequently” exchange information with positive probability and information can spread across the network since is strongly connected.
Note that a network that is SSC does not imply guaranteed transmissions periodically. We will see shortly that SSC is significantly weaker that plain guaranteed periodic communication. With stochastic strong connectivity we can not draw any conclusions about the dependency of events in the network. On the other hand, assuming guaranteed periodic communication does imply a strong form of dependency decay as shown in Section 5.4. Recall that our objective is to verify Assumption 1.1. However, using SSC alone is not sufficient to even guarantee the existence of a dominating random variable as required in Assumption 1.1. The next subsection therefore formulates dependency decay conditions using strong mixing (Definition 2).
5.3 Network dependency decay
Recall that our time-varying network is given by a sequence of directed graphs . The sequence is in one-to-one correspondences with events that represent the presence of an edge at time . We will now formulate a dependency decay assumption based on the notion of strongly mixing processes. We can then show that the AoI variables associated with a -SSC network satisfies specific moment conditions depending on the assumed rate at which dependency decays in the network.
Assumption 6 (Dependency decay assumption).
We assume that the time-varying network is such that there is some such that each process is -strongly mixing (Definition 3) for some .
With this assumption we do not require that the dependency of subsequent events does decay at any specific rate. However, there should be an interval size , such that the dependency of subsequent union events decays sufficiently fast. Notice that Assumption 6 is a dependency decay assumption for the network processes associated with all network edges . However, we actually only require the assumption for those edges in an edge set according to Assumption 5. Additionally, notice that we do not require any form of independence or dependency decay between transmissions over different edges. The reason for this is Lemma 5. The lemma will show that the existence of a dominating random variable for the AoI variables is in a natural way a transitive property of the network.
In this work we don’t give a recipe to verify Assumption 6. However, we will see in the next subsection that the standard assumptions in the distributed optimization literature all imply Assumption 6. Another set of examples where Assumption 6 is also directly satisfied are scenarios where the network events are driven by a geometrically ergodic Markov process [10, 8]. Of course, it can be comparatively difficult to verify this in practice. However, traditionally and also more recently it has been quite common to model network fading channels by finite Markov chains [31, 4, 23, 19, 5]. We further discuss the verifiability of Assumption 6 in Section 7.
5.4 Comparison of Assumptions 5 and 6 to assumptions in the literature
In this subsection we show that the typical network assumptions in the literature imply Assumptions 5 and 6.
First, consider the network models in [25, 3, 16, 14]. It is easy to check that network models imply the following properties:
-
1.
There is a strongly connected graph and some , such that for all and for all .
-
2.
The events are independent for different time-steps or different edges.
Independence is particularly unrealistic for wireless communication systems, since transmission that occur close in time, space, frequency or code can be highly correlated. Notably, this assumptions do not show any trade off between the choice of the step size sequence and some network related property. Hence, there is no trade of between the growth of the AoI variables and the choice of the step size sequence. In fact, it is easy to show that under this assumptions all moments of all are bounded, see Section 6.1 Example 1.
We can now show that the above properties imply Assumptions 5 and 6. Assumption 5 is directly satisfied for . Define the -algebras
(5.3) |
Then Assumption 6 holds trivially, since the independence of the events implies that
(5.4) |
for and for all . Hence, the mixing coefficients for each process satisfies for every .
Second, consider the time-varying network in [33, 22, 32, 28, 1, 34, 15]. The authors assume that their network is -strongly connected. Hence, they assume guaranteed periodic communication. Assumption 5 is therefore directly satisfied by choosing . Then for all . Assumption 6 is also directly satisfied by choosing . To see this, fix any with . Then
(5.5) |
since the intersection of almost sure events is an almost sure event. Therefore,
(5.6) |
and Assumption 6 follows.
We have therefore shown that the network models in the literature satisfy Assumptions 5 and 6. Moreover, Assumptions 5 and 6 are significantly weaker, since they do not require independent communication or guaranteed periodic communication, but merely asymptotic independence.
6 Stochastic dominance properties of AoI for Time-Varying Networks
In this section, we show that Assumptions 5 and 6 imply Assumption 1.1. Recall that the AoI variables , as defined in Section 3, are now a consequence of the network model formulated in Section 5.1. Each agent tries to send its local believe vector (Equation 3.2) to some other agents. A successful transmission to some other agent is represented by an edge of the time-varying network or equivalently by the event .
Recall that Assumption 1.1 requires finite moment properties of a random variable that stochastically dominates (Definition 1) all . The following definition will be useful to formulate our main result and the subsequent proof.
Definition 7.
We say an AoI variable is stochastically dominated with finite -th moment for some , if there exists a non-negative integer-valued random variable that stochastically dominates all for and all with .
The following theorem formulates the main result of this section.
Theorem 2.
Let be a time-varying network that is -SSC (Definition 6) with associated strongly connected graph . If for each , there is some , such that the process is -strongly mixing (Definition 3) for some , then all AoI variables are stochastically dominated by a single random variable with finite -th moment.
Stochastic dominance with finite -th moment corresponds to the mere existence of a dominating random variable without any necessary moment condition. Theorem 2 shows a more general result as it would be required for the convergence of Algorithm 1. It is shown for all . The following corollary is now immediate and requires Theorem 2 for .
Corollary 1.
Under Assumptions 2, 3, 4, 5 and 6, we have that Algorithm 1 converges almost surely to a -neighbourhood of the set of stationary points of F, where is the almost sure bound of the additive errors according to Assumption 4.
Proof.
Under Assumption 5 and 6, it follows from Theorem 2 that Assumption 1.1 holds for some . We can then choose a step size sequence that is not summable, but square summable with and therefore also satisfy Assumption 1.2. The requirements of Theorem 1 are therefore satisfied and the statement of the corollary follows. ∎
The rest of this section is devoted to the proof of Theorem 2. We begin by describing a general construction/recipe to establish the stochastic dominance properties for AoI variables of time-varying networks. In addition, we illustrate the recipe for the scenario where the edge events are independent. Afterwards, we give the proof of Theorem 2. Before proceeding, we show a preliminary property of the AoI variables for an -SSC network.
Lemma 3.
Let be a time-varying network that is -SSC with associated strongly connected graph , then for all we have
(6.1) |
Proof.
First, we have have for , since . We therefore concentrate on . Fix , i.e. and are agents that can communicate directly. Observe that successful direct communication from to during any time interval of the form implies that the AoI at time is less than . In other words, we have the following inclusion
(6.2) |
Since the network is -SSC, we have that
(6.3) |
The complementary event of the previous expression therefore concludes the proof of the lemma. ∎
6.1 A construction to establish stochastic dominance properties
We now describe a general construction to establish the stochastic dominance properties with some finite -th moment (Definition 7) for an AoI variable . The idea is to find a uniform upper bound , such that
for all independent of for some and . We can now use this bound to define the CDF of a new random variable. Since there is some , such that for all . Now define a non-negative integer-valued random variable by describing its CDF (more precisely its complementary CDF) as follows:
(6.4) | |||||
(6.5) |
By definition stochastically dominates all for all . Moreover, if
for some , then it will follow from Proposition 1 that is stochastically dominated with finite -th moment.
As the next step, we describe how we can find a function for the above construction. Consider a -SSC network. Let be the strongly connected graph associated with the -SSC network and fix an edge . Let be an increasing sequence in , with . Now for each use this sequence to define time indices
(6.6) |
as long as . Let be the number of constructed time indices and observe that
(6.7) |
This follows since implies for all by the very construction of the time indices . In general, we can now derive as an upper bound to the right-hand side in (6.7), which we illustrate immediately for case of independent network communication, i.e. were the events are independent. For the case of dependent network communication, this will be formulated in Lemma 4 in the next section.
Example 1 (Independent network communication).
Let be the strongly connected graph associated with a -SSC network and consider an edge . Using the exemplary network independence and Lemma 3, we have from (6.7) that
(6.8) |
for all large enough such that . Now define
and hence . The construction described above then yields a dominating random variable for all for all . It is now easy to verify that
for all , since the series is a version of a weighted geometric series. We have therefore established that with independent communication, each AoI variable with is stochastically dominated with finite -th moment for every . This underlines how strong the assumption of independent communication is.
6.2 Proof of Theorem 2
In the previous example, we used the independence of the edge events to establish a uniform upper bound for with geometric decay. Recall that was used in (6.6) to define the time indices , such that . Now consider the case where the edge events are not independent but merely mixing. We will see that we can then find a new upper bound to (6.7), such that
(6.9) |
with an error term due to the non independence.
Now, if the mixing coefficients associated with processes decay rapidly enough, we expect that decays sufficiently, such that the new upper bound still satisfies some summability properties and hence allows that we establish stochastic dominance properties. The following lemma makes this intuition precise. We establishes the stochastic dominance property of order for those network edges that ensure that the network is -SSC.
Lemma 4.
Let be a time-varying network that is -SSC (Definition 6) with associated strongly connected graph . If for any the process is -strongly mixing (Definition 3) for some and some , then is stochastically dominated with finite -th moment (Definition 7).
Proof.
Fix an edge . The theme of the proof is to establish a uniform upper bound to the complementary CDF of independent of , such that the construction from Section 6.1 yields the required dominating random variable.
Step 1 (Reduction to ): The -strongly mixing property of the network guarantees mixing of the process for some . W.l.o.g. we can assume that . This is justified as follows. Lets denote by a new random variable that captures the time, since the last interval of the form with at least one successful transmission from to . The case then yields the conclusion of the Lemma for , i.e. there will be random variable that stochastically dominates all with . For any and , we have . Therefore,
(6.10) |
and by Minkowski’s inequality. Hence, would be the required dominating random variable for and we may therefore assume .
Step 2 (Initial CDF bound): Fix and recall the definition of and the associated sequence for each from Section 6.1. We have
(6.7 recalled) |
With a slide abuse of notation we will now refer with to the age of information associated with the direct information exchange from to . The age of information associated with direct information exchange by definition stochastically dominates the actual AoI. Without this step we would technically require a stronger mixing requirement, specifically, one for the events generated by all and not only for the events generated by for the pair . Note that Lemma 3 also directly holds for this case, since we anyway used the direct information exchange to prove it.
We will now establish an upper bound to (6.7 recalled) using that is -strongly mixing. For this, define the following sub--algebras generated by the events :
(6.11) |
The important generated events are, whether the AoI variables at some time step exceed a threshold , i.e. whether . Since the event is generated by the events with ,we have that
(6.12) |
For this, we required the reduction to age of information associated with direct information exchange. It then follows by definition of the time indices that
(6.13) |
and
(6.14) |
for every . Hence,
(6.15) |
By construction of the indices , we have . The strong mixing property of the process therefore implys that
(6.16) |
where are the mixing coefficients associated with the process . It now follows from Lemma 3 that for , since the network is -SSC. Hence,
(6.17) |
Applying (6.16) and (6.17) successively yields:
(6.18) | ||||
(6.19) |
for
For , we can now apply the construction presented in Section 6.1 with the bound (6.19) to obtain a dominating random variable. Here we may choose as in Example 1. For it is now crucial to choose , such that both terms in (6.19) decay rapidly enough to obtain the required stochastic dominance property with finite -th moment. However, it turns out that the bound (6.19) is only sufficient to achieve this for all , due to the merely geometric decay of the first term. The next step therefore uses (6.19) to obtain a better upper bound for (6.18).
Step 3: To improve the CDF bound for , we use that is summable. It then follows that for we have
(6.20) |
and for we have
(6.21) |
since for this case is guaranteed to be decreasing as . Both cases show that there is a constant and some , such that
(6.22) |
for sufficiently large . With (6.19) it then follows that
(6.23) |
for sufficiently large . Since the first term above is exponential and the second is rational, we can find a new , such that
(6.24) |
for sufficiently large. For this, one may again choose .
Step 4 - (Verifying the stochastic dominance property with finite -th moment):
We now insert the CDF bound from step 4 in (6.18) and obtain
(6.25) |
for sufficiently large. Now choose , such that
(6.26) |
and then choose . We choose this to guarantee the required summability property of the first term in (6.25), since
(6.27) |
for . Hence, we have
(6.28) |
for .
Now define
and apply the construction presented in Section 6.1. This yields a non-negative integer-valued random variable that stochastically dominates for all . Moreover, we have , if
(6.29) |
The first part of the series is finite, since
(6.30) |
where we used that for .
For the second part of the series, note that is by construction a monotonically decreasing function from to [8]. Now extend by linear interpolation to a monotonically decreasing function from to . Then for all , we have by monotonicity. Hence the second part is finite, since
(6.31) |
The second inequality can be shown using a similar construction as in Lemma 1. Finally, the finiteness of the last summation follows from the assumed -strongly mixing property.
∎
We have thus established the stochastic dominance property of order for those network edges that ensure that the network is SSC under the -strongly mixing condition. As the next step, we show an elementary lemma associated with the AoI variables of a time-varying network. The lemma shows that the existence of stochastically dominating random variables associated with the AoI variables of a time-varying network is a transitive property.
Lemma 5.
For nodes of a time-varying network suppose and are stochastically dominated by and , respectively. Then
-
1.
There is random variable that stochastically dominates .
-
2.
If moreover for some , then also .
Proof.
Fix and some . Now observe the following inclusion associated with events of the three AoI variables and :
(6.32) |
The inclusion states that the two events
-
1.
The AoI is less than for information received at node from node at time
-
2.
The AoI is less than for information received at node from node at time
imply the event that the AoI is less than for information received at node from node at time . By taking the complement of the inclusion in (6.32), we have that
(6.33) | ||||
(6.34) | ||||
(6.35) |
In the last step, we used the assumption that there are random variables and that stochastically dominate and , respectively, for all .
Now and are integer-valued, so there is some such that
(6.36) |
for all . Define a non-negative integer-valued random variable by defining its CDF:
(6.37) | ||||
(6.38) |
This proves part of the lemma.
Now suppose for some . We can now write the -th moment of using its CDF from above:
(6.39) | ||||
(6.40) | ||||
(6.41) |
Where the equality follows from Proposition 1, since and are non-negative integer-valued random variables. This proves part of the lemma. ∎
Lemma 5 allows that we extend the stochastic dominance properties from Lemma 4 for node pairs to arbitrary node pairs . We are now ready to prove Theorem 2.
Proof of Theorem 2.
First, fix an arbitrary pairs of nodes . Since the network is SSC, it follows from Lemma 4 there is a sequence of edges for some , with and , such that for each , there is non-negative integer-valued random variable that stochastically dominates all for all , with . It now follows by induction using the transitive property of the AoI variables from Lemma 5(b), that there is a non-negative integer-valued random variable that stochastically dominates all for all , with .
It is now left to verify that there is a single dominating random variables for all pairs . This essentially follows since we consider finitely many agents. For every , define
(6.42) |
Since , there is some , such that for all . Define a non-negative integer-valued random variable by describing its CDF as follows:
(6.43) | |||||
(6.44) |
By construction stochastically dominates all for all and for all . Finally, we have
(6.45) |
where the equality simply follows from continuity of addition and since all are convergent.
∎
7 Conclusions and future work
In this work, we presented an asymptotic convergence analysis of distributed stochastic gradient descent that uses aged information. The required network assumptions have been weakened to the mere existence of non-negative integer-valued random variable with finite first moment that stochastically dominates all age of information random variables variables. This assumption can be satisfied with the new network Assumptions 5 and 6. These assumptions are significantly weaker then the common network assumptions in the literature. We hope that our assumptions penalize future work in distributed optimization under less restrictive network assumptions. Notably, instead of periodic or independent communication, we merely require asymptotically independent communication formulated using -mixing with the minimal requirement that .
It would be interesting to see, whether summability properties of -mixing coefficients indeed hold for representative physical wireless communication system. This might be possible when the underlying physical system has a mixing property in an ergodic sense. For example, hyperbolic systems are common models to describe electro magnetic wave propagation and it was shown in [2] that hyperbolic systems admit a strong mixing property in an ergodic sense.
To apply Assumption 6 in practice, it would be most desirable if the -mixing coefficients (or an upper bound) for the network processes could be estimated from data. Unfortunately, there are only a handful of methods that estimate or approximate the mixing coefficients from data. One method that uses an approximation method based on histograms was presented in [20]. However, this method suffers from high complexity. Very recently, a new method was presented in [13]. Most notably, the work presents a hypothesis test to decide, whether the sum of the alpha mixing coefficients is below an upper bound. With this, it is therefore now possible to verify with high confidence, whether Assumption 6 holds for using data.
Adrian Redder was supported by the German Research Foundation (DFG) - 315248657 and SFB 901.
References
- Aybat and Hamedani [2019] {barticle}[author] \bauthor\bsnmAybat, \bfnmNecdet Serhat\binitsN. S. and \bauthor\bsnmHamedani, \bfnmErfan Yazdandoost\binitsE. Y. (\byear2019). \btitleA distributed ADMM-like method for resource sharing over time-varying networks. \bjournalSIAM Journal on Optimization \bvolume29 \bpages3036–3068. \endbibitem
- Babillot [2002] {barticle}[author] \bauthor\bsnmBabillot, \bfnmMartine\binitsM. (\byear2002). \btitleOn the mixing property for hyperbolic systems. \bjournalIsrael journal of mathematics \bvolume129 \bpages61–76. \endbibitem
- Bastianello et al. [2020] {barticle}[author] \bauthor\bsnmBastianello, \bfnmNicola\binitsN., \bauthor\bsnmCarli, \bfnmRuggero\binitsR., \bauthor\bsnmSchenato, \bfnmLuca\binitsL. and \bauthor\bsnmTodescato, \bfnmMarco\binitsM. (\byear2020). \btitleAsynchronous distributed optimization over lossy networks via relaxed ADMM: Stability and linear convergence. \bjournalIEEE Transactions on Automatic Control \bvolume66 \bpages2620–2635. \endbibitem
- Bianchi [2000] {barticle}[author] \bauthor\bsnmBianchi, \bfnmGiuseppe\binitsG. (\byear2000). \btitlePerformance analysis of the IEEE 802.11 distributed coordination function. \bjournalIEEE Journal on selected areas in communications \bvolume18 \bpages535–547. \endbibitem
- Boban, Gong and Xu [2016] {binproceedings}[author] \bauthor\bsnmBoban, \bfnmMate\binitsM., \bauthor\bsnmGong, \bfnmXitao\binitsX. and \bauthor\bsnmXu, \bfnmWen\binitsW. (\byear2016). \btitleModeling the evolution of line-of-sight blockage for V2V channels. In \bbooktitle2016 IEEE 84th Vehicular Technology Conference (VTC-Fall) \bpages1–7. \bpublisherIEEE. \endbibitem
- Borkar [1998] {barticle}[author] \bauthor\bsnmBorkar, \bfnmVivek S\binitsV. S. (\byear1998). \btitleAsynchronous stochastic approximations. \bjournalSIAM Journal on Control and Optimization \bvolume36 \bpages840–851. \endbibitem
- Borkar [2009] {bbook}[author] \bauthor\bsnmBorkar, \bfnmVivek S\binitsV. S. (\byear2009). \btitleStochastic approximation: A dynamical systems viewpoint. \bpublisherSpringer. \endbibitem
- Bradley [2005] {barticle}[author] \bauthor\bsnmBradley, \bfnmRichard C\binitsR. C. (\byear2005). \btitleBasic properties of strong mixing conditions. A survey and some open questions. \bjournalProbability surveys \bvolume2 \bpages107–144. \endbibitem
- Chakraborti, Jardim and Epprecht [2018] {barticle}[author] \bauthor\bsnmChakraborti, \bfnmSubhabrata\binitsS., \bauthor\bsnmJardim, \bfnmFelipe\binitsF. and \bauthor\bsnmEpprecht, \bfnmEugenio\binitsE. (\byear2018). \btitleHigher-order moments using the survival function: The alternative expectation formula. \bjournalThe American Statistician. \endbibitem
- Davydov [1974] {barticle}[author] \bauthor\bsnmDavydov, \bfnmYu A\binitsY. A. (\byear1974). \btitleMixing conditions for Markov chains. \bjournalTheory of Probability & Its Applications \bvolume18 \bpages312–328. \endbibitem
- Durrett [2019] {bbook}[author] \bauthor\bsnmDurrett, \bfnmRick\binitsR. (\byear2019). \btitleProbability: Theory and Examples. \bpublisherCambridge University Press. \endbibitem
- Haghshenas et al. [2019] {barticle}[author] \bauthor\bsnmHaghshenas, \bfnmKawsar\binitsK., \bauthor\bsnmPahlevan, \bfnmAli\binitsA., \bauthor\bsnmZapater, \bfnmMarina\binitsM., \bauthor\bsnmMohammadi, \bfnmSiamak\binitsS. and \bauthor\bsnmAtienza, \bfnmDavid\binitsD. (\byear2019). \btitleMagnetic: Multi-agent machine learning-based approach for energy efficient dynamic consolidation in data centers. \bjournalIEEE Transactions on Services Computing. \endbibitem
- Khaleghi and Lugosi [2021] {barticle}[author] \bauthor\bsnmKhaleghi, \bfnmAzadeh\binitsA. and \bauthor\bsnmLugosi, \bfnmGábor\binitsG. (\byear2021). \btitleInferring the mixing properties of an ergodic process. \bjournalarXiv preprint arXiv:2106.07054. \endbibitem
- Koloskova et al. [2020] {binproceedings}[author] \bauthor\bsnmKoloskova, \bfnmAnastasia\binitsA., \bauthor\bsnmLoizou, \bfnmNicolas\binitsN., \bauthor\bsnmBoreiri, \bfnmSadra\binitsS., \bauthor\bsnmJaggi, \bfnmMartin\binitsM. and \bauthor\bsnmStich, \bfnmSebastian\binitsS. (\byear2020). \btitleA unified theory of decentralized SGD with changing topology and local updates. In \bbooktitleInternational Conference on Machine Learning \bpages5381–5393. \bpublisherPMLR. \endbibitem
- Kovalev et al. [2021] {binproceedings}[author] \bauthor\bsnmKovalev, \bfnmDmitry\binitsD., \bauthor\bsnmShulgin, \bfnmEgor\binitsE., \bauthor\bsnmRichtárik, \bfnmPeter\binitsP., \bauthor\bsnmRogozin, \bfnmAlexander\binitsA. and \bauthor\bsnmGasnikov, \bfnmAlexander\binitsA. (\byear2021). \btitleADOM: Accelerated decentralized optimization method for time-varying networks. In \bbooktitleInternational Conference on Machine Learning. \bpublisherPMLR. \endbibitem
- Lei, Chen and Fang [2018] {barticle}[author] \bauthor\bsnmLei, \bfnmJinlong\binitsJ., \bauthor\bsnmChen, \bfnmHan-Fu\binitsH.-F. and \bauthor\bsnmFang, \bfnmHai-Tao\binitsH.-T. (\byear2018). \btitleAsymptotic Properties of Primal-Dual Algorithm for Distributed Stochastic Optimization over Random Networks with Imperfect Communications. \bjournalSIAM Journal on Control and Optimization \bvolume56 \bpages2159–2188. \endbibitem
- Lim and Kim [2001] {barticle}[author] \bauthor\bsnmLim, \bfnmHyojun\binitsH. and \bauthor\bsnmKim, \bfnmChongkwon\binitsC. (\byear2001). \btitleFlooding in wireless ad hoc networks. \bjournalComputer Communications \bvolume24 \bpages353–363. \endbibitem
- Lin and Bitar [2017] {barticle}[author] \bauthor\bsnmLin, \bfnmWeixuan\binitsW. and \bauthor\bsnmBitar, \bfnmEilyan\binitsE. (\byear2017). \btitleDecentralized stochastic control of distributed energy resources. \bjournalIEEE Transactions on Power Systems \bvolume33 \bpages888–900. \endbibitem
- Lin et al. [2015] {barticle}[author] \bauthor\bsnmLin, \bfnmSiyu\binitsS., \bauthor\bsnmKong, \bfnmLinghe\binitsL., \bauthor\bsnmHe, \bfnmLiang\binitsL., \bauthor\bsnmGuan, \bfnmKe\binitsK., \bauthor\bsnmAi, \bfnmBo\binitsB., \bauthor\bsnmZhong, \bfnmZhangdui\binitsZ. and \bauthor\bsnmBriso-Rodríguez, \bfnmCesar\binitsC. (\byear2015). \btitleFinite-state Markov modeling for high-speed railway fading channels. \bjournalIEEE Antennas and Wireless Propagation Letters \bvolume14 \bpages954–957. \endbibitem
- McDonald, Shalizi and Schervish [2015] {barticle}[author] \bauthor\bsnmMcDonald, \bfnmDaniel J\binitsD. J., \bauthor\bsnmShalizi, \bfnmCosma Rohilla\binitsC. R. and \bauthor\bsnmSchervish, \bfnmMark\binitsM. (\byear2015). \btitleEstimating beta-mixing coefficients via histograms. \bjournalElectronic Journal of Statistics \bvolume9 \bpages2855–2883. \endbibitem
- Nedić and Olshevsky [2014] {barticle}[author] \bauthor\bsnmNedić, \bfnmAngelia\binitsA. and \bauthor\bsnmOlshevsky, \bfnmAlex\binitsA. (\byear2014). \btitleDistributed optimization over time-varying directed graphs. \bjournalIEEE Transactions on Automatic Control \bvolume60 \bpages601–615. \endbibitem
- Nedic, Olshevsky and Shi [2017] {barticle}[author] \bauthor\bsnmNedic, \bfnmAngelia\binitsA., \bauthor\bsnmOlshevsky, \bfnmAlex\binitsA. and \bauthor\bsnmShi, \bfnmWei\binitsW. (\byear2017). \btitleAchieving geometric convergence for distributed optimization over time-varying graphs. \bjournalSIAM Journal on Optimization \bvolume27 \bpages2597–2633. \endbibitem
- Pimentel, Falk and Lisbôa [2004] {barticle}[author] \bauthor\bsnmPimentel, \bfnmCecilio\binitsC., \bauthor\bsnmFalk, \bfnmTiago H\binitsT. H. and \bauthor\bsnmLisbôa, \bfnmLuciano\binitsL. (\byear2004). \btitleFinite-state Markov modeling of correlated Rician-fading channels. \bjournalIEEE transactions on vehicular technology \bvolume53 \bpages1491–1501. \endbibitem
- Ramaswamy, Redder and Quevedo [2019] {barticle}[author] \bauthor\bsnmRamaswamy, \bfnmArunselvan\binitsA., \bauthor\bsnmRedder, \bfnmAdrian\binitsA. and \bauthor\bsnmQuevedo, \bfnmDaniel E\binitsD. E. (\byear2019). \btitleOptimization over time-varying networks with unbounded delays. \bjournalarXiv preprint arXiv:1912.07055. \endbibitem
- Ramaswamy, Redder and Quevedo [2021] {barticle}[author] \bauthor\bsnmRamaswamy, \bfnmArunselvan\binitsA., \bauthor\bsnmRedder, \bfnmAdrian\binitsA. and \bauthor\bsnmQuevedo, \bfnmDaniel E\binitsD. E. (\byear2021). \btitleDistributed optimization over time-varying networks with stochastic information delays. \bjournalIEEE Transactions on Automatic Control. \endbibitem
- Redder, Ramaswamy and Karl [2022a] {barticle}[author] \bauthor\bsnmRedder, \bfnmAdrian\binitsA., \bauthor\bsnmRamaswamy, \bfnmArunselvan\binitsA. and \bauthor\bsnmKarl, \bfnmHolger\binitsH. (\byear2022a). \btitleAsymptotic Convergence of Deep Multi-Agent Actor-Critic Algorithms. \bjournalarXiv preprint arXiv:2201.00570. \endbibitem
- Redder, Ramaswamy and Karl [2022b] {bmisc}[author] \bauthor\bsnmRedder, \bfnmAdrian\binitsA., \bauthor\bsnmRamaswamy, \bfnmArunselvan\binitsA. and \bauthor\bsnmKarl, \bfnmHolger\binitsH. (\byear2022b). \btitleMulti-agent gradient-based resource allocation for networked systems (To appear). \endbibitem
- Scutari and Sun [2019] {barticle}[author] \bauthor\bsnmScutari, \bfnmGesualdo\binitsG. and \bauthor\bsnmSun, \bfnmYing\binitsY. (\byear2019). \btitleDistributed nonconvex constrained optimization over time-varying digraphs. \bjournalMath. Program. \bvolume176 \bpages497–544. \endbibitem
- Tang et al. [2018] {barticle}[author] \bauthor\bsnmTang, \bfnmHanlin\binitsH., \bauthor\bsnmGan, \bfnmShaoduo\binitsS., \bauthor\bsnmZhang, \bfnmCe\binitsC., \bauthor\bsnmZhang, \bfnmTong\binitsT. and \bauthor\bsnmLiu, \bfnmJi\binitsJ. (\byear2018). \btitleCommunication Compression for Decentralized Training. \bjournalAdvances in Neural Information Processing Systems \bvolume31 \bpages7652–7662. \endbibitem
- Trudeau [1993] {bbook}[author] \bauthor\bsnmTrudeau, \bfnmRichard J\binitsR. J. (\byear1993). \btitleIntroduction to Graph Theory. \bpublisherCourier Corporation. \endbibitem
- Wang and Moayeri [1995] {barticle}[author] \bauthor\bsnmWang, \bfnmHong Shen\binitsH. S. and \bauthor\bsnmMoayeri, \bfnmNader\binitsN. (\byear1995). \btitleFinite-state Markov channel-a useful model for radio communication channels. \bjournalIEEE transactions on vehicular technology \bvolume44 \bpages163–171. \endbibitem
- Wang et al. [2019] {barticle}[author] \bauthor\bsnmWang, \bfnmYinghui\binitsY., \bauthor\bsnmZhao, \bfnmWenxiao\binitsW., \bauthor\bsnmHong, \bfnmYiguang\binitsY. and \bauthor\bsnmZamani, \bfnmMohsen\binitsM. (\byear2019). \btitleDistributed Subgradient-Free Stochastic Optimization Algorithm for Nonsmooth Convex Functions over Time-Varying Networks. \bjournalSIAM Journal on Control and Optimization \bvolume57 \bpages2821–2842. \endbibitem
- Xu et al. [2017] {barticle}[author] \bauthor\bsnmXu, \bfnmYun\binitsY., \bauthor\bsnmHan, \bfnmTingrui\binitsT., \bauthor\bsnmCai, \bfnmKai\binitsK., \bauthor\bsnmLin, \bfnmZhiyun\binitsZ., \bauthor\bsnmYan, \bfnmGangfeng\binitsG. and \bauthor\bsnmFu, \bfnmMinyue\binitsM. (\byear2017). \btitleA distributed algorithm for resource allocation over dynamic digraphs. \bjournalIEEE Transactions on Signal Processing \bvolume65 \bpages2600–2612. \endbibitem
- Yu, Ho and Yuan [2020] {barticle}[author] \bauthor\bsnmYu, \bfnmZhan\binitsZ., \bauthor\bsnmHo, \bfnmDaniel WC\binitsD. W. and \bauthor\bsnmYuan, \bfnmDeming\binitsD. (\byear2020). \btitleDistributed Stochastic Optimization over Time-Varying Noisy Network. \bjournalarXiv preprint arXiv:2005.03982. \endbibitem